Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Shao-Yen Tseng

Analyzing Hierarchical Structure in Vision Models with Sparse Autoencoders

May 21, 2025

Matthew Lyle Olson, Musashi Hinck, Neale Ratzlaff, Changbai Li, Phillip Howard, Vasudev Lal, Shao-Yen Tseng

Abstract:The ImageNet hierarchy provides a structured taxonomy of object categories, offering a valuable lens through which to analyze the representations learned by deep vision models. In this work, we conduct a comprehensive analysis of how vision models encode the ImageNet hierarchy, leveraging Sparse Autoencoders (SAEs) to probe their internal representations. SAEs have been widely used as an explanation tool for large language models (LLMs), where they enable the discovery of semantically meaningful features. Here, we extend their use to vision models to investigate whether learned representations align with the ontological structure defined by the ImageNet taxonomy. Our results show that SAEs uncover hierarchical relationships in model activations, revealing an implicit encoding of taxonomic structure. We analyze the consistency of these representations across different layers of the popular vision foundation model DINOv2 and provide insights into how deep vision models internalize hierarchical category information by increasing information in the class token through each layer. Our study establishes a framework for systematic hierarchical analysis of vision model representations and highlights the potential of SAEs as a tool for probing semantic structure in deep networks.

* (Oral) CVPR 2025 Workshop on Mechanistic Interpretability for Vision. Authors 1 and 2 contributed equally

Via

Access Paper or Ask Questions

FiVL: A Framework for Improved Vision-Language Alignment

Dec 19, 2024

Estelle Aflalo, Gabriela Ben Melech Stan, Tiep Le, Man Luo, Shachar Rosenman, Sayak Paul, Shao-Yen Tseng, Vasudev Lal

Figure 1 for FiVL: A Framework for Improved Vision-Language Alignment

Figure 2 for FiVL: A Framework for Improved Vision-Language Alignment

Figure 3 for FiVL: A Framework for Improved Vision-Language Alignment

Figure 4 for FiVL: A Framework for Improved Vision-Language Alignment

Abstract:Large Vision Language Models (LVLMs) have achieved significant progress in integrating visual and textual inputs for multimodal reasoning. However, a recurring challenge is ensuring these models utilize visual information as effectively as linguistic content when both modalities are necessary to formulate an accurate answer. We hypothesize that hallucinations arise due to the lack of effective visual grounding in current LVLMs. This issue extends to vision-language benchmarks, where it is difficult to make the image indispensable for accurate answer generation, particularly in vision question-answering tasks. In this work, we introduce FiVL, a novel method for constructing datasets designed to train LVLMs for enhanced visual grounding and to evaluate their effectiveness in achieving it. These datasets can be utilized for both training and assessing an LVLM's ability to use image content as substantive evidence rather than relying solely on linguistic priors, providing insights into the model's reliance on visual information. To demonstrate the utility of our dataset, we introduce an innovative training task that outperforms baselines alongside a validation method and application for explainability. The code is available at https://github.com/IntelLabs/fivl.

Via

Access Paper or Ask Questions

FastRM: An efficient and automatic explainability framework for multimodal generative models

Dec 02, 2024

Gabriela Ben-Melech Stan, Estelle Aflalo, Man Luo, Shachar Rosenman, Tiep Le, Sayak Paul, Shao-Yen Tseng, Vasudev Lal

Figure 1 for FastRM: An efficient and automatic explainability framework for multimodal generative models

Figure 2 for FastRM: An efficient and automatic explainability framework for multimodal generative models

Figure 3 for FastRM: An efficient and automatic explainability framework for multimodal generative models

Figure 4 for FastRM: An efficient and automatic explainability framework for multimodal generative models

Abstract:While Large Vision Language Models (LVLMs) have become masterly capable in reasoning over human prompts and visual inputs, they are still prone to producing responses that contain misinformation. Identifying incorrect responses that are not grounded in evidence has become a crucial task in building trustworthy AI. Explainability methods such as gradient-based relevancy maps on LVLM outputs can provide an insight on the decision process of models, however these methods are often computationally expensive and not suited for on-the-fly validation of outputs. In this work, we propose FastRM, an effective way for predicting the explainable Relevancy Maps of LVLM models. Experimental results show that employing FastRM leads to a 99.8% reduction in compute time for relevancy map generation and an 44.4% reduction in memory footprint for the evaluated LVLM, making explainable AI more efficient and practical, thereby facilitating its deployment in real-world applications.

Via

Access Paper or Ask Questions

Debias your Large Multi-Modal Model at Test-Time with Non-Contrastive Visual Attribute Steering

Nov 15, 2024

Neale Ratzlaff, Matthew Lyle Olson, Musashi Hinck, Estelle Aflalo, Shao-Yen Tseng, Vasudev Lal, Phillip Howard

Figure 1 for Debias your Large Multi-Modal Model at Test-Time with Non-Contrastive Visual Attribute Steering

Figure 2 for Debias your Large Multi-Modal Model at Test-Time with Non-Contrastive Visual Attribute Steering

Figure 3 for Debias your Large Multi-Modal Model at Test-Time with Non-Contrastive Visual Attribute Steering

Figure 4 for Debias your Large Multi-Modal Model at Test-Time with Non-Contrastive Visual Attribute Steering

Abstract:Large Multi-Modal Models (LMMs) have demonstrated impressive capabilities as general-purpose chatbots that can engage in conversations about a provided input, such as an image. However, their responses are influenced by societal biases present in their training datasets, leading to undesirable differences in how the model responds when presented with images depicting people of different demographics. In this work, we propose a novel debiasing framework for LMMs that directly removes biased representations during text generation to decrease outputs related to protected attributes, or even representing them internally. Our proposed method is training-free; given a single image and a list of target attributes, we can ablate the corresponding representations with just one step of gradient descent on the image itself. Our experiments show that not only can we can minimize the propensity of LMMs to generate text related to protected attributes, but we can improve sentiment and even simply use synthetic data to inform the ablation while retaining language modeling capabilities on real data such as COCO or FACET. Furthermore, we find the resulting generations from a debiased LMM exhibit similar accuracy as a baseline biased model, showing that debiasing effects can be achieved without sacrificing model performance.

* 10 pages, 3 Figures, 3 Tables. arXiv admin note: text overlap with arXiv:2410.13976

Via

Access Paper or Ask Questions

Debiasing Large Vision-Language Models by Ablating Protected Attribute Representations

Oct 17, 2024

Neale Ratzlaff, Matthew Lyle Olson, Musashi Hinck, Shao-Yen Tseng, Vasudev Lal, Phillip Howard

Figure 1 for Debiasing Large Vision-Language Models by Ablating Protected Attribute Representations

Figure 2 for Debiasing Large Vision-Language Models by Ablating Protected Attribute Representations

Figure 3 for Debiasing Large Vision-Language Models by Ablating Protected Attribute Representations

Figure 4 for Debiasing Large Vision-Language Models by Ablating Protected Attribute Representations

Abstract:Large Vision Language Models (LVLMs) such as LLaVA have demonstrated impressive capabilities as general-purpose chatbots that can engage in conversations about a provided input image. However, their responses are influenced by societal biases present in their training datasets, leading to undesirable differences in how the model responds when presented with images depicting people of different demographics. In this work, we propose a novel debiasing framework for LVLMs by directly ablating biased attributes during text generation to avoid generating text related to protected attributes, or even representing them internally. Our method requires no training and a relatively small amount of representative biased outputs (~1000 samples). Our experiments show that not only can we can minimize the propensity of LVLMs to generate text related to protected attributes, but we can even use synthetic data to inform the ablation while retaining captioning performance on real data such as COCO. Furthermore, we find the resulting generations from a debiased LVLM exhibit similar accuracy as a baseline biased model, showing that debiasing effects can be achieved without sacrificing model performance.

* NeurIPS workshop on SafeGenAI, 10 pages, 2 figures

Via

Access Paper or Ask Questions

L-MAGIC: Language Model Assisted Generation of Images with Coherence

Jun 03, 2024

Zhipeng Cai, Matthias Mueller, Reiner Birkl, Diana Wofk, Shao-Yen Tseng, JunDa Cheng, Gabriela Ben-Melech Stan, Vasudev Lal, Michael Paulitsch

Abstract:In the current era of generative AI breakthroughs, generating panoramic scenes from a single input image remains a key challenge. Most existing methods use diffusion-based iterative or simultaneous multi-view inpainting. However, the lack of global scene layout priors leads to subpar outputs with duplicated objects (e.g., multiple beds in a bedroom) or requires time-consuming human text inputs for each view. We propose L-MAGIC, a novel method leveraging large language models for guidance while diffusing multiple coherent views of 360 degree panoramic scenes. L-MAGIC harnesses pre-trained diffusion and language models without fine-tuning, ensuring zero-shot performance. The output quality is further enhanced by super-resolution and multi-view fusion techniques. Extensive experiments demonstrate that the resulting panoramic scenes feature better scene layouts and perspective view rendering quality compared to related works, with >70% preference in human evaluations. Combined with conditional diffusion models, L-MAGIC can accept various input modalities, including but not limited to text, depth maps, sketches, and colored scripts. Applying depth estimation further enables 3D point cloud generation and dynamic scene exploration with fluid camera motion. Code is available at https://github.com/IntelLabs/MMPano. The video presentation is available at https://youtu.be/XDMNEzH4-Ec?list=PLG9Zyvu7iBa0-a7ccNLO8LjcVRAoMn57s.

* accepted to CVPR 2024

Via

Access Paper or Ask Questions

LVLM-Intrepret: An Interpretability Tool for Large Vision-Language Models

Apr 03, 2024

Gabriela Ben Melech Stan, Raanan Yehezkel Rohekar, Yaniv Gurwicz, Matthew Lyle Olson, Anahita Bhiwandiwalla, Estelle Aflalo, Chenfei Wu, Nan Duan, Shao-Yen Tseng, Vasudev Lal

Figure 1 for LVLM-Intrepret: An Interpretability Tool for Large Vision-Language Models

Figure 2 for LVLM-Intrepret: An Interpretability Tool for Large Vision-Language Models

Figure 3 for LVLM-Intrepret: An Interpretability Tool for Large Vision-Language Models

Figure 4 for LVLM-Intrepret: An Interpretability Tool for Large Vision-Language Models

Abstract:In the rapidly evolving landscape of artificial intelligence, multi-modal large language models are emerging as a significant area of interest. These models, which combine various forms of data input, are becoming increasingly popular. However, understanding their internal mechanisms remains a complex task. Numerous advancements have been made in the field of explainability tools and mechanisms, yet there is still much to explore. In this work, we present a novel interactive application aimed towards understanding the internal mechanisms of large vision-language models. Our interface is designed to enhance the interpretability of the image patches, which are instrumental in generating an answer, and assess the efficacy of the language model in grounding its output in the image. With our application, a user can systematically investigate the model and uncover system limitations, paving the way for enhancements in system capabilities. Finally, we present a case study of how our application can aid in understanding failure mechanisms in a popular large multi-modal model: LLaVA.

Via

Access Paper or Ask Questions

LLaVA-Gemma: Accelerating Multimodal Foundation Models with a Compact Language Model

Mar 29, 2024

Musashi Hinck, Matthew L. Olson, David Cobbley, Shao-Yen Tseng, Vasudev Lal

Figure 1 for LLaVA-Gemma: Accelerating Multimodal Foundation Models with a Compact Language Model

Figure 2 for LLaVA-Gemma: Accelerating Multimodal Foundation Models with a Compact Language Model

Figure 3 for LLaVA-Gemma: Accelerating Multimodal Foundation Models with a Compact Language Model

Abstract:We train a suite of multimodal foundation models (MMFM) using the popular LLaVA framework with the recently released Gemma family of large language models (LLMs). Of particular interest is the 2B parameter Gemma model, which provides opportunities to construct capable small-scale MMFMs. In line with findings from other papers in this space, we test the effect of ablating three design features: pretraining the connector, utilizing a more powerful image backbone, and increasing the size of the language backbone. The resulting models, which we call LLaVA-Gemma, exhibit moderate performance on an array of evaluations, but fail to improve past the current comparably sized SOTA models. Closer analysis of performance shows mixed effects; skipping pretraining tends to reduce performance, larger vision models sometimes improve performance, and increasing language model size has inconsistent effects. We publicly release training recipes, code and weights for our models for the LLaVA-Gemma models.

* Authors 1 and 2 contributed equally. Models available at https://huggingface.co/intel/llava-gemma-2b/ and \url{https://huggingface.co/intel/llava-gemma-7b/

Via

Access Paper or Ask Questions

LDM3D-VR: Latent Diffusion Model for 3D VR

Nov 06, 2023

Gabriela Ben Melech Stan, Diana Wofk, Estelle Aflalo, Shao-Yen Tseng, Zhipeng Cai, Michael Paulitsch, Vasudev Lal

Figure 1 for LDM3D-VR: Latent Diffusion Model for 3D VR

Figure 2 for LDM3D-VR: Latent Diffusion Model for 3D VR

Figure 3 for LDM3D-VR: Latent Diffusion Model for 3D VR

Abstract:Latent diffusion models have proven to be state-of-the-art in the creation and manipulation of visual outputs. However, as far as we know, the generation of depth maps jointly with RGB is still limited. We introduce LDM3D-VR, a suite of diffusion models targeting virtual reality development that includes LDM3D-pano and LDM3D-SR. These models enable the generation of panoramic RGBD based on textual prompts and the upscaling of low-resolution inputs to high-resolution RGBD, respectively. Our models are fine-tuned from existing pretrained models on datasets containing panoramic/high-resolution RGB images, depth maps and captions. Both models are evaluated in comparison to existing related methods.

* Accepted to Workshop on Diffusion Models, NeurIPS 2023

Via

Access Paper or Ask Questions

ManagerTower: Aggregating the Insights of Uni-Modal Experts for Vision-Language Representation Learning

May 31, 2023

Xiao Xu, Bei Li, Chenfei Wu, Shao-Yen Tseng, Anahita Bhiwandiwalla, Shachar Rosenman, Vasudev Lal, Wanxiang Che, Nan Duan

Figure 1 for ManagerTower: Aggregating the Insights of Uni-Modal Experts for Vision-Language Representation Learning

Figure 2 for ManagerTower: Aggregating the Insights of Uni-Modal Experts for Vision-Language Representation Learning

Figure 3 for ManagerTower: Aggregating the Insights of Uni-Modal Experts for Vision-Language Representation Learning

Figure 4 for ManagerTower: Aggregating the Insights of Uni-Modal Experts for Vision-Language Representation Learning

Abstract:Two-Tower Vision-Language (VL) models have shown promising improvements on various downstream VL tasks. Although the most advanced work improves performance by building bridges between encoders, it suffers from ineffective layer-by-layer utilization of uni-modal representations and cannot flexibly exploit different levels of uni-modal semantic knowledge. In this work, we propose ManagerTower, a novel VL model architecture that gathers and combines the insights of pre-trained uni-modal experts at different levels. The managers introduced in each cross-modal layer can adaptively aggregate uni-modal semantic knowledge to facilitate more comprehensive cross-modal alignment and fusion. ManagerTower outperforms previous strong baselines both with and without Vision-Language Pre-training (VLP). With only 4M VLP data, ManagerTower achieves superior performances on various downstream VL tasks, especially 79.15% accuracy on VQAv2 Test-Std, 86.56% IR@1 and 95.64% TR@1 on Flickr30K. Code and checkpoints are available at https://github.com/LooperXX/ManagerTower.

* Accepted by ACL 2023 Main Conference, Oral

Via

Access Paper or Ask Questions