Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

M. Jehanzeb Mirza

Overflow Prevention Enhances Long-Context Recurrent LLMs

May 12, 2025

Assaf Ben-Kish, Itamar Zimerman, M. Jehanzeb Mirza, James Glass, Leonid Karlinsky, Raja Giryes

Abstract:A recent trend in LLMs is developing recurrent sub-quadratic models that improve long-context processing efficiency. We investigate leading large long-context models, focusing on how their fixed-size recurrent memory affects their performance. Our experiments reveal that, even when these models are trained for extended contexts, their use of long contexts remains underutilized. Specifically, we demonstrate that a chunk-based inference procedure, which identifies and processes only the most relevant portion of the input can mitigate recurrent memory failures and be effective for many long-context tasks: On LongBench, our method improves the overall performance of Falcon3-Mamba-Inst-7B by 14%, Falcon-Mamba-Inst-7B by 28%, RecurrentGemma-IT-9B by 50%, and RWKV6-Finch-7B by 51%. Surprisingly, this simple approach also leads to state-of-the-art results in the challenging LongBench v2 benchmark, showing competitive performance with equivalent size Transformers. Furthermore, our findings raise questions about whether recurrent models genuinely exploit long-range dependencies, as our single-chunk strategy delivers stronger performance - even in tasks that presumably require cross-context relations.

Via

Access Paper or Ask Questions

Exploring Modality Guidance to Enhance VFM-based Feature Fusion for UDA in 3D Semantic Segmentation

Apr 19, 2025

Johannes Spoecklberger, Wei Lin, Pedro Hermosilla, Sivan Doveh, Horst Possegger, M. Jehanzeb Mirza

Abstract:Vision Foundation Models (VFMs) have become a de facto choice for many downstream vision tasks, like image classification, image segmentation, and object localization. However, they can also provide significant utility for downstream 3D tasks that can leverage the cross-modal information (e.g., from paired image data). In our work, we further explore the utility of VFMs for adapting from a labeled source to unlabeled target data for the task of LiDAR-based 3D semantic segmentation. Our method consumes paired 2D-3D (image and point cloud) data and relies on the robust (cross-domain) features from a VFM to train a 3D backbone on a mix of labeled source and unlabeled target data. At the heart of our method lies a fusion network that is guided by both the image and point cloud streams, with their relative contributions adjusted based on the target domain. We extensively compare our proposed methodology with different state-of-the-art methods in several settings and achieve strong performance gains. For example, achieving an average improvement of 6.5 mIoU (over all tasks), when compared with the previous state-of-the-art.

Via

Access Paper or Ask Questions

Teaching VLMs to Localize Specific Objects from In-context Examples

Nov 20, 2024

Sivan Doveh, Nimrod Shabtay, Wei Lin, Eli Schwartz, Hilde Kuehne, Raja Giryes, Rogerio Feris, Leonid Karlinsky, James Glass, Assaf Arbelle(+2 more)

Figure 1 for Teaching VLMs to Localize Specific Objects from In-context Examples

Figure 2 for Teaching VLMs to Localize Specific Objects from In-context Examples

Figure 3 for Teaching VLMs to Localize Specific Objects from In-context Examples

Figure 4 for Teaching VLMs to Localize Specific Objects from In-context Examples

Abstract:Vision-Language Models (VLMs) have shown remarkable capabilities across diverse visual tasks, including image recognition, video understanding, and Visual Question Answering (VQA) when explicitly trained for these tasks. Despite these advances, we find that current VLMs lack a fundamental cognitive ability: learning to localize objects in a scene by taking into account the context. In this work, we focus on the task of few-shot personalized localization, where a model is given a small set of annotated images (in-context examples) -- each with a category label and bounding box -- and is tasked with localizing the same object type in a query image. To provoke personalized localization abilities in models, we present a data-centric solution that fine-tunes them using carefully curated data from video object tracking datasets. By leveraging sequences of frames tracking the same object across multiple shots, we simulate instruction-tuning dialogues that promote context awareness. To reinforce this, we introduce a novel regularization technique that replaces object labels with pseudo-names, ensuring the model relies on visual context rather than prior knowledge. Our method significantly enhances few-shot localization performance without sacrificing generalization, as demonstrated on several benchmarks tailored to personalized localization. This work is the first to explore and benchmark personalized few-shot localization for VLMs, laying a foundation for future research in context-driven vision-language applications. The code for our project is available at https://github.com/SivanDoveh/IPLoc

Via

Access Paper or Ask Questions

LiveXiv -- A Multi-Modal Live Benchmark Based on Arxiv Papers Content

Oct 15, 2024

Nimrod Shabtay, Felipe Maia Polo, Sivan Doveh, Wei Lin, M. Jehanzeb Mirza, Leshem Chosen, Mikhail Yurochkin, Yuekai Sun, Assaf Arbelle, Leonid Karlinsky(+1 more)

Abstract:The large-scale training of multi-modal models on data scraped from the web has shown outstanding utility in infusing these models with the required world knowledge to perform effectively on multiple downstream tasks. However, one downside of scraping data from the web can be the potential sacrifice of the benchmarks on which the abilities of these models are often evaluated. To safeguard against test data contamination and to truly test the abilities of these foundation models we propose LiveXiv: A scalable evolving live benchmark based on scientific ArXiv papers. LiveXiv accesses domain-specific manuscripts at any given timestamp and proposes to automatically generate visual question-answer pairs (VQA). This is done without any human-in-the-loop, using the multi-modal content in the manuscripts, like graphs, charts, and tables. Moreover, we introduce an efficient evaluation approach that estimates the performance of all models on the evolving benchmark using evaluations of only a subset of models. This significantly reduces the overall evaluation cost. We benchmark multiple open and proprietary Large Multi-modal Models (LMMs) on the first version of our benchmark, showing its challenging nature and exposing the models true abilities, avoiding contamination. Lastly, in our commitment to high quality, we have collected and evaluated a manually verified subset. By comparing its overall results to our automatic annotations, we have found that the performance variance is indeed minimal (<2.5%). Our dataset is available online on HuggingFace, and our code will be available here.

Via

Access Paper or Ask Questions

GLOV: Guided Large Language Models as Implicit Optimizers for Vision Language Models

Oct 08, 2024

M. Jehanzeb Mirza, Mengjie Zhao, Zhuoyuan Mao, Sivan Doveh, Wei Lin, Paul Gavrikov, Michael Dorkenwald, Shiqi Yang, Saurav Jha, Hiromi Wakaki(+5 more)

Figure 1 for GLOV: Guided Large Language Models as Implicit Optimizers for Vision Language Models

Figure 2 for GLOV: Guided Large Language Models as Implicit Optimizers for Vision Language Models

Figure 3 for GLOV: Guided Large Language Models as Implicit Optimizers for Vision Language Models

Figure 4 for GLOV: Guided Large Language Models as Implicit Optimizers for Vision Language Models

Abstract:In this work, we propose a novel method (GLOV) enabling Large Language Models (LLMs) to act as implicit Optimizers for Vision-Langugage Models (VLMs) to enhance downstream vision tasks. Our GLOV meta-prompts an LLM with the downstream task description, querying it for suitable VLM prompts (e.g., for zero-shot classification with CLIP). These prompts are ranked according to a purity measure obtained through a fitness function. In each respective optimization step, the ranked prompts are fed as in-context examples (with their accuracies) to equip the LLM with the knowledge of the type of text prompts preferred by the downstream VLM. Furthermore, we also explicitly steer the LLM generation process in each optimization step by specifically adding an offset difference vector of the embeddings from the positive and negative solutions found by the LLM, in previous optimization steps, to the intermediate layer of the network for the next generation step. This offset vector steers the LLM generation toward the type of language preferred by the downstream VLM, resulting in enhanced performance on the downstream vision tasks. We comprehensively evaluate our GLOV on 16 diverse datasets using two families of VLMs, i.e., dual-encoder (e.g., CLIP) and encoder-decoder (e.g., LLaVa) models -- showing that the discovered solutions can enhance the recognition performance by up to 15.0% and 57.5% (3.8% and 21.6% on average) for these models.

* Code: https://github.com/jmiemirza/GLOV

Via

Access Paper or Ask Questions

ConMe: Rethinking Evaluation of Compositional Reasoning for Modern VLMs

Jun 12, 2024

Irene Huang, Wei Lin, M. Jehanzeb Mirza, Jacob A. Hansen, Sivan Doveh, Victor Ion Butoi, Roei Herzig, Assaf Arbelle, Hilde Kuhene, Trevor Darrel(+4 more)

Figure 1 for ConMe: Rethinking Evaluation of Compositional Reasoning for Modern VLMs

Figure 2 for ConMe: Rethinking Evaluation of Compositional Reasoning for Modern VLMs

Figure 3 for ConMe: Rethinking Evaluation of Compositional Reasoning for Modern VLMs

Figure 4 for ConMe: Rethinking Evaluation of Compositional Reasoning for Modern VLMs

Abstract:Compositional Reasoning (CR) entails grasping the significance of attributes, relations, and word order. Recent Vision-Language Models (VLMs), comprising a visual encoder and a Large Language Model (LLM) decoder, have demonstrated remarkable proficiency in such reasoning tasks. This prompts a crucial question: have VLMs effectively tackled the CR challenge? We conjecture that existing CR benchmarks may not adequately push the boundaries of modern VLMs due to the reliance on an LLM-only negative text generation pipeline. Consequently, the negatives produced either appear as outliers from the natural language distribution learned by VLMs' LLM decoders or as improbable within the corresponding image context. To address these limitations, we introduce ConMe -- a compositional reasoning benchmark and a novel data generation pipeline leveraging VLMs to produce `hard CR Q&A'. Through a new concept of VLMs conversing with each other to collaboratively expose their weaknesses, our pipeline autonomously generates, evaluates, and selects challenging compositional reasoning questions, establishing a robust CR benchmark, also subsequently validated manually. Our benchmark provokes a noteworthy, up to 33%, decrease in CR performance compared to preceding benchmarks, reinstating the CR challenge even for state-of-the-art VLMs.

* The first three authors contributed equally

Via

Access Paper or Ask Questions

Into the Fog: Evaluating Multiple Object Tracking Robustness

Apr 12, 2024

Nadezda Kirillova, M. Jehanzeb Mirza, Horst Possegger, Horst Bischof

Abstract:State-of-the-art (SOTA) trackers have shown remarkable Multiple Object Tracking (MOT) performance when trained and evaluated on current benchmarks. However, these benchmarks primarily consist of clear scenarios, overlooking adverse atmospheric conditions such as fog, haze, smoke and dust. As a result, the robustness of SOTA trackers remains underexplored. To address these limitations, we propose a pipeline for physic-based volumetric fog simulation in arbitrary real-world MOT dataset utilizing frame-by-frame monocular depth estimation and a fog formation optical model. Moreover, we enhance our simulation by rendering of both homogeneous and heterogeneous fog effects. We propose to use the dark channel prior method to estimate fog (smoke) color, which shows promising results even in night and indoor scenes. We present the leading tracking benchmark MOTChallenge (MOT17 dataset) overlaid by fog (smoke for indoor scenes) of various intensity levels and conduct a comprehensive evaluation of SOTA MOT methods, revealing their limitations under fog and fog-similar challenges.

Via

Access Paper or Ask Questions

Towards Multimodal In-Context Learning for Vision & Language Models

Mar 19, 2024

Sivan Doveh, Shaked Perek, M. Jehanzeb Mirza, Amit Alfassy, Assaf Arbelle, Shimon Ullman, Leonid Karlinsky

Figure 1 for Towards Multimodal In-Context Learning for Vision & Language Models

Figure 2 for Towards Multimodal In-Context Learning for Vision & Language Models

Figure 3 for Towards Multimodal In-Context Learning for Vision & Language Models

Figure 4 for Towards Multimodal In-Context Learning for Vision & Language Models

Abstract:Inspired by the emergence of Large Language Models (LLMs) that can truly understand human language, significant progress has been made in aligning other, non-language, modalities to be `understandable' by an LLM, primarily via converting their samples into a sequence of embedded language-like tokens directly fed into the LLM (decoder) input stream. However, so far limited attention has been given to transferring (and evaluating) one of the core LLM capabilities to the emerging VLMs, namely the In-Context Learning (ICL) ability, or in other words to guide VLMs to desired target downstream tasks or output structure using in-context image+text demonstrations. In this work, we dive deeper into analyzing the capabilities of some of the state-of-the-art VLMs to follow ICL instructions, discovering them to be somewhat lacking. We discover that even models that underwent large-scale mixed modality pre-training and were implicitly guided to make use of interleaved image and text information (intended to consume helpful context from multiple images) under-perform when prompted with few-shot (ICL) demonstrations, likely due to their lack of `direct' ICL instruction tuning. To test this conjecture, we propose a simple, yet surprisingly effective, strategy of extending a common VLM alignment framework with ICL support, methodology, and curriculum. We explore, analyze, and provide insights into effective data mixes, leading up to a significant 21.03% (and 11.3% on average) ICL performance boost over the strongest VLM baselines and a variety of ICL benchmarks. We also contribute new benchmarks for ICL evaluation in VLMs and discuss their advantages over the prior art.

Via

Access Paper or Ask Questions

Meta-Prompting for Automating Zero-shot Visual Recognition with LLMs

Mar 19, 2024

M. Jehanzeb Mirza, Leonid Karlinsky, Wei Lin, Sivan Doveh, Jakub Micorek, Mateusz Kozinski, Hilde Kuhene, Horst Possegger

Abstract:Prompt ensembling of Large Language Model (LLM) generated category-specific prompts has emerged as an effective method to enhance zero-shot recognition ability of Vision-Language Models (VLMs). To obtain these category-specific prompts, the present methods rely on hand-crafting the prompts to the LLMs for generating VLM prompts for the downstream tasks. However, this requires manually composing these task-specific prompts and still, they might not cover the diverse set of visual concepts and task-specific styles associated with the categories of interest. To effectively take humans out of the loop and completely automate the prompt generation process for zero-shot recognition, we propose Meta-Prompting for Visual Recognition (MPVR). Taking as input only minimal information about the target task, in the form of its short natural language description, and a list of associated class labels, MPVR automatically produces a diverse set of category-specific prompts resulting in a strong zero-shot classifier. MPVR generalizes effectively across various popular zero-shot image recognition benchmarks belonging to widely different domains when tested with multiple LLMs and VLMs. For example, MPVR obtains a zero-shot recognition improvement over CLIP by up to 19.8% and 18.2% (5.0% and 4.5% on average over 20 datasets) leveraging GPT and Mixtral LLMs, respectively

* Project Page (Code and Data): https://jmiemirza.github.io/Meta-Prompting/

Via

Access Paper or Ask Questions

TAP: Targeted Prompting for Task Adaptive Generation of Textual Training Instances for Visual Classification

Sep 13, 2023

M. Jehanzeb Mirza, Leonid Karlinsky, Wei Lin, Horst Possegger, Rogerio Feris, Horst Bischof

Figure 1 for TAP: Targeted Prompting for Task Adaptive Generation of Textual Training Instances for Visual Classification

Figure 2 for TAP: Targeted Prompting for Task Adaptive Generation of Textual Training Instances for Visual Classification

Figure 3 for TAP: Targeted Prompting for Task Adaptive Generation of Textual Training Instances for Visual Classification

Figure 4 for TAP: Targeted Prompting for Task Adaptive Generation of Textual Training Instances for Visual Classification

Abstract:Vision and Language Models (VLMs), such as CLIP, have enabled visual recognition of a potentially unlimited set of categories described by text prompts. However, for the best visual recognition performance, these models still require tuning to better fit the data distributions of the downstream tasks, in order to overcome the domain shift from the web-based pre-training data. Recently, it has been shown that it is possible to effectively tune VLMs without any paired data, and in particular to effectively improve VLMs visual recognition performance using text-only training data generated by Large Language Models (LLMs). In this paper, we dive deeper into this exciting text-only VLM training approach and explore ways it can be significantly further improved taking the specifics of the downstream task into account when sampling text data from LLMs. In particular, compared to the SOTA text-only VLM training approach, we demonstrate up to 8.4% performance improvement in (cross) domain-specific adaptation, up to 8.7% improvement in fine-grained recognition, and 3.1% overall average improvement in zero-shot classification compared to strong baselines.

* Code is available at: https://github.com/jmiemirza/TAP

Via

Access Paper or Ask Questions