Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Makarand Tapaswi

CVIT, IIIT Hyderabad

One Identity, Many Roles: Multimodal Entity Coreference for Enhanced Video Situation Recognition

Apr 25, 2026

Balaji Darur, Amanmeet Garg, Makarand Tapaswi

Abstract:Video Situation Recognition (VidSitu) addresses the challenging problem of "who did what to whom, with what, how, and where" in a video. It tests thorough video understanding by requiring identification of salient actions and associated short descriptions for event roles across multiple events. Grounding with VidSitu requires spatio-temporal localization of key entities across shots and varied appearances. We posit that coherent video understanding requires consistent identification of entities that play different roles. We propose Multimodal Entity Coreference (MEC) to unite entity descriptions in text with grounding across the video. Towards this, we introduce CineMEC, a multi-stage approach that unites event role mention groups with visual clusters of entities, without explicit grounding supervision during training. Our approach is designed to exploit the synergy between visual grounding and captioning, where improving one influences the other and vice versa. For evaluation, we extend the VidSitu dataset with grounding annotations. While previous work focuses primarily on descriptions, CineMEC improves consistency across both: captioning (+2.5% CIDEr, +7% LEA) and visual grounding (+18% HOTA).

* Accepted to CVPR 2026 Findings. Project Page: https://katha-ai.github.io/projects/cinemec/

Via

Access Paper or Ask Questions

Steerable Visual Representations

Apr 02, 2026

Jona Ruthardt, Manu Gaur, Deva Ramanan, Makarand Tapaswi, Yuki M. Asano

Abstract:Pretrained Vision Transformers (ViTs) such as DINOv2 and MAE provide generic image features that can be applied to a variety of downstream tasks such as retrieval, classification, and segmentation. However, such representations tend to focus on the most salient visual cues in the image, with no way to direct them toward less prominent concepts of interest. In contrast, Multimodal LLMs can be guided with textual prompts, but the resulting representations tend to be language-centric and lose their effectiveness for generic visual tasks. To address this, we introduce Steerable Visual Representations, a new class of visual representations, whose global and local features can be steered with natural language. While most vision-language models (e.g., CLIP) fuse text with visual features after encoding (late fusion), we inject text directly into the layers of the visual encoder (early fusion) via lightweight cross-attention. We introduce benchmarks for measuring representational steerability, and demonstrate that our steerable visual features can focus on any desired objects in an image while preserving the underlying representation quality. Our method also matches or outperforms dedicated approaches on anomaly detection and personalized object discrimination, exhibiting zero-shot generalization to out-of-distribution tasks.

* preprint

Via

Access Paper or Ask Questions

STRinGS: Selective Text Refinement in Gaussian Splatting

Dec 08, 2025

Abhinav Raundhal, Gaurav Behera, P J Narayanan, Ravi Kiran Sarvadevabhatla, Makarand Tapaswi

Abstract:Text as signs, labels, or instructions is a critical element of real-world scenes as they can convey important contextual information. 3D representations such as 3D Gaussian Splatting (3DGS) struggle to preserve fine-grained text details, while achieving high visual fidelity. Small errors in textual element reconstruction can lead to significant semantic loss. We propose STRinGS, a text-aware, selective refinement framework to address this issue for 3DGS reconstruction. Our method treats text and non-text regions separately, refining text regions first and merging them with non-text regions later for full-scene optimization. STRinGS produces sharp, readable text even in challenging configurations. We introduce a text readability measure OCR Character Error Rate (CER) to evaluate the efficacy on text regions. STRinGS results in a 63.6% relative improvement over 3DGS at just 7K iterations. We also introduce a curated dataset STRinGS-360 with diverse text scenarios to evaluate text readability in 3D reconstruction. Our method and dataset together push the boundaries of 3D scene understanding in text-rich environments, paving the way for more robust text-aware reconstruction methods.

* Accepted to WACV 2026. Project Page, see https://STRinGS-official.github.io

Via

Access Paper or Ask Questions

MALeR: Improving Compositional Fidelity in Layout-Guided Generation

Nov 08, 2025

Shivank Saxena, Dhruv Srivastava, Makarand Tapaswi

Figure 1 for MALeR: Improving Compositional Fidelity in Layout-Guided Generation

Figure 2 for MALeR: Improving Compositional Fidelity in Layout-Guided Generation

Figure 3 for MALeR: Improving Compositional Fidelity in Layout-Guided Generation

Figure 4 for MALeR: Improving Compositional Fidelity in Layout-Guided Generation

Abstract:Recent advances in text-to-image models have enabled a new era of creative and controllable image generation. However, generating compositional scenes with multiple subjects and attributes remains a significant challenge. To enhance user control over subject placement, several layout-guided methods have been proposed. However, these methods face numerous challenges, particularly in compositional scenes. Unintended subjects often appear outside the layouts, generated images can be out-of-distribution and contain unnatural artifacts, or attributes bleed across subjects, leading to incorrect visual outputs. In this work, we propose MALeR, a method that addresses each of these challenges. Given a text prompt and corresponding layouts, our method prevents subjects from appearing outside the given layouts while being in-distribution. Additionally, we propose a masked, attribute-aware binding mechanism that prevents attribute leakage, enabling accurate rendering of subjects with multiple attributes, even in complex compositional scenes. Qualitative and quantitative evaluation demonstrates that our method achieves superior performance in compositional accuracy, generation consistency, and attribute binding compared to previous work. MALeR is particularly adept at generating images of scenes with multiple subjects and multiple attributes per subject.

* ACM TOG Dec 2025, Siggraph Asia, Project page: https://katha-ai.github.io/projects/maler/

Via

Access Paper or Ask Questions

What You See is What You Ask: Evaluating Audio Descriptions

Oct 01, 2025

Divy Kala, Eshika Khandelwal, Makarand Tapaswi

Figure 1 for What You See is What You Ask: Evaluating Audio Descriptions

Figure 2 for What You See is What You Ask: Evaluating Audio Descriptions

Figure 3 for What You See is What You Ask: Evaluating Audio Descriptions

Figure 4 for What You See is What You Ask: Evaluating Audio Descriptions

Abstract:Audio descriptions (ADs) narrate important visual details in movies, enabling Blind and Low Vision (BLV) users to understand narratives and appreciate visual details. Existing works in automatic AD generation mostly focus on few-second trimmed clips, and evaluate them by comparing against a single ground-truth reference AD. However, writing ADs is inherently subjective. Through alignment and analysis of two independent AD tracks for the same movies, we quantify the subjectivity in when and whether to describe, and what and how to highlight. Thus, we show that working with trimmed clips is inadequate. We propose ADQA, a QA benchmark that evaluates ADs at the level of few-minute long, coherent video segments, testing whether they would help BLV users understand the story and appreciate visual details. ADQA features visual appreciation (VA) questions about visual facts and narrative understanding (NU) questions based on the plot. Through ADQA, we show that current AD generation methods lag far behind human-authored ADs. We conclude with several recommendations for future work and introduce a public leaderboard for benchmarking.

* EMNLP 2025 Main Track Long Paper

Via

Access Paper or Ask Questions

Investigating Mechanisms for In-Context Vision Language Binding

May 28, 2025

Darshana Saravanan, Makarand Tapaswi, Vineet Gandhi

Abstract:To understand a prompt, Vision-Language models (VLMs) must perceive the image, comprehend the text, and build associations within and across both modalities. For instance, given an 'image of a red toy car', the model should associate this image to phrases like 'car', 'red toy', 'red object', etc. Feng and Steinhardt propose the Binding ID mechanism in LLMs, suggesting that the entity and its corresponding attribute tokens share a Binding ID in the model activations. We investigate this for image-text binding in VLMs using a synthetic dataset and task that requires models to associate 3D objects in an image with their descriptions in the text. Our experiments demonstrate that VLMs assign a distinct Binding ID to an object's image tokens and its textual references, enabling in-context association.

* Accepted to MIV at CVPRW 2025 (Oral)

Via

Access Paper or Ask Questions

The Sound of Water: Inferring Physical Properties from Pouring Liquids

Nov 18, 2024

Piyush Bagad, Makarand Tapaswi, Cees G. M. Snoek, Andrew Zisserman

Figure 1 for The Sound of Water: Inferring Physical Properties from Pouring Liquids

Figure 2 for The Sound of Water: Inferring Physical Properties from Pouring Liquids

Figure 3 for The Sound of Water: Inferring Physical Properties from Pouring Liquids

Figure 4 for The Sound of Water: Inferring Physical Properties from Pouring Liquids

Abstract:We study the connection between audio-visual observations and the underlying physics of a mundane yet intriguing everyday activity: pouring liquids. Given only the sound of liquid pouring into a container, our objective is to automatically infer physical properties such as the liquid level, the shape and size of the container, the pouring rate and the time to fill. To this end, we: (i) show in theory that these properties can be determined from the fundamental frequency (pitch); (ii) train a pitch detection model with supervision from simulated data and visual data with a physics-inspired objective; (iii) introduce a new large dataset of real pouring videos for a systematic study; (iv) show that the trained model can indeed infer these physical properties for real data; and finally, (v) we demonstrate strong generalization to various container shapes, other datasets, and in-the-wild YouTube videos. Our work presents a keen understanding of a narrow yet rich problem at the intersection of acoustics, physics, and learning. It opens up applications to enhance multisensory perception in robotic pouring.

* 25 pages, 17 figures. Project page at https://bpiyush.github.io/pouring-water-website

Via

Access Paper or Ask Questions

IdentifyMe: A Challenging Long-Context Mention Resolution Benchmark

Nov 12, 2024

Kawshik Manikantan, Makarand Tapaswi, Vineet Gandhi, Shubham Toshniwal

Figure 1 for IdentifyMe: A Challenging Long-Context Mention Resolution Benchmark

Figure 2 for IdentifyMe: A Challenging Long-Context Mention Resolution Benchmark

Figure 3 for IdentifyMe: A Challenging Long-Context Mention Resolution Benchmark

Figure 4 for IdentifyMe: A Challenging Long-Context Mention Resolution Benchmark

Abstract:Recent evaluations of LLMs on coreference resolution have revealed that traditional output formats and evaluation metrics do not fully capture the models' referential understanding. To address this, we introduce IdentifyMe, a new benchmark for mention resolution presented in a multiple-choice question (MCQ) format, commonly used for evaluating LLMs. IdentifyMe features long narratives and employs heuristics to exclude easily identifiable mentions, creating a more challenging task. The benchmark also consists of a curated mixture of different mention types and corresponding entities, allowing for a fine-grained analysis of model performance. We evaluate both closed- and open source LLMs on IdentifyMe and observe a significant performance gap (20-30%) between the state-of-the-art sub-10B open models vs. closed ones. We observe that pronominal mentions, which have limited surface information, are typically much harder for models to resolve than nominal mentions. Additionally, we find that LLMs often confuse entities when their mentions overlap in nested structures. The highest-scoring model, GPT-4o, achieves 81.9% accuracy, highlighting the strong referential capabilities of state-of-the-art LLMs while also indicating room for further improvement.

* 9 pages, 5 figures

Via

Access Paper or Ask Questions

No Detail Left Behind: Revisiting Self-Retrieval for Fine-Grained Image Captioning

Sep 04, 2024

Manu Gaur, Darshan Singh S, Makarand Tapaswi

Figure 1 for No Detail Left Behind: Revisiting Self-Retrieval for Fine-Grained Image Captioning

Figure 2 for No Detail Left Behind: Revisiting Self-Retrieval for Fine-Grained Image Captioning

Figure 3 for No Detail Left Behind: Revisiting Self-Retrieval for Fine-Grained Image Captioning

Figure 4 for No Detail Left Behind: Revisiting Self-Retrieval for Fine-Grained Image Captioning

Abstract:Image captioning systems are unable to generate fine-grained captions as they are trained on data that is either noisy (alt-text) or generic (human annotations). This is further exacerbated by maximum likelihood training that encourages generation of frequently occurring phrases. Previous works have tried to address this limitation by fine-tuning captioners with a self-retrieval (SR) reward. However, we find that SR fine-tuning has a tendency to reduce caption faithfulness and even hallucinate. In this work, we circumvent this bottleneck by improving the MLE initialization of the captioning system and designing a curriculum for the SR fine-tuning process. To this extent, we present (1) Visual Caption Boosting, a novel framework to instill fine-grainedness in generic image captioning datasets while remaining anchored in human annotations; and (2) BagCurri, a carefully designed training curriculum that more optimally leverages the contrastive nature of the self-retrieval reward. Jointly, they enable the captioner to describe fine-grained aspects in the image while preserving faithfulness to ground-truth captions. Our approach outperforms previous work by +8.9% on SR against 99 random distractors (RD100) (Dessi et al., 2023); and +7.6% on ImageCoDe. Additionally, existing metrics to evaluate captioning systems fail to reward diversity or evaluate a model's fine-grained understanding ability. Our third contribution addresses this by proposing self-retrieval from the lens of evaluation. We introduce TrueMatch, a benchmark comprising bags of highly similar images that uses SR to assess the captioner's ability to capture subtle visual distinctions. We evaluate and compare several state-of-the-art open-source MLLMs on TrueMatch, and find that our SR approach outperforms them all by a significant margin (e.g. +4.8% - 7.1% over Cambrian) while having 1-2 orders of magnitude fewer parameters.

Via

Access Paper or Ask Questions

Major Entity Identification: A Generalizable Alternative to Coreference Resolution

Jun 20, 2024

Kawshik Manikantan, Shubham Toshniwal, Makarand Tapaswi, Vineet Gandhi

Figure 1 for Major Entity Identification: A Generalizable Alternative to Coreference Resolution

Figure 2 for Major Entity Identification: A Generalizable Alternative to Coreference Resolution

Figure 3 for Major Entity Identification: A Generalizable Alternative to Coreference Resolution

Figure 4 for Major Entity Identification: A Generalizable Alternative to Coreference Resolution

Abstract:The limited generalization of coreference resolution (CR) models has been a major bottleneck in the task's broad application. Prior work has identified annotation differences, especially for mention detection, as one of the main reasons for the generalization gap and proposed using additional annotated target domain data. Rather than relying on this additional annotation, we propose an alternative formulation of the CR task, Major Entity Identification (MEI), where we: (a) assume the target entities to be specified in the input, and (b) limit the task to only the frequent entities. Through extensive experiments, we demonstrate that MEI models generalize well across domains on multiple datasets with supervised models and LLM-based few-shot prompting. Additionally, the MEI task fits the classification framework, which enables the use of classification-based metrics that are more robust than the current CR metrics. Finally, MEI is also of practical use as it allows a user to search for all mentions of a particular entity or a group of entities of interest.

* 16 pages, 6 figures

Via

Access Paper or Ask Questions