Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Title:Understanding Multimodal LLMs: the Mechanistic Interpretability of Llava in Visual Question Answering

Nov 17, 2024

Zeping Yu, Sophia Ananiadou

Figure 1 for Understanding Multimodal LLMs: the Mechanistic Interpretability of Llava in Visual Question Answering

Figure 2 for Understanding Multimodal LLMs: the Mechanistic Interpretability of Llava in Visual Question Answering

Figure 3 for Understanding Multimodal LLMs: the Mechanistic Interpretability of Llava in Visual Question Answering

Figure 4 for Understanding Multimodal LLMs: the Mechanistic Interpretability of Llava in Visual Question Answering

Share this with someone who'll enjoy it:

Abstract:Understanding the mechanisms behind Large Language Models (LLMs) is crucial for designing improved models and strategies. While recent studies have yielded valuable insights into the mechanisms of textual LLMs, the mechanisms of Multi-modal Large Language Models (MLLMs) remain underexplored. In this paper, we apply mechanistic interpretability methods to analyze the visual question answering (VQA) mechanisms in the first MLLM, Llava. We compare the mechanisms between VQA and textual QA (TQA) in color answering tasks and find that: a) VQA exhibits a mechanism similar to the in-context learning mechanism observed in TQA; b) the visual features exhibit significant interpretability when projecting the visual embeddings into the embedding space; and c) Llava enhances the existing capabilities of the corresponding textual LLM Vicuna during visual instruction tuning. Based on these findings, we develop an interpretability tool to help users and researchers identify important visual locations for final predictions, aiding in the understanding of visual hallucination. Our method demonstrates faster and more effective results compared to existing interpretability approaches. Code: \url{https://github.com/zepingyu0512/llava-mechanism}

* preprint

View paper on

Share this with someone who'll enjoy it:

Title:Understanding Multimodal LLMs: the Mechanistic Interpretability of Llava in Visual Question Answering

Paper and Code