Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Title:The Narrow Gate: Localized Image-Text Communication in Vision-Language Models

Dec 09, 2024

Alessandro Serra, Francesco Ortu, Emanuele Panizon, Lucrezia Valeriani, Lorenzo Basile, Alessio Ansuini, Diego Doimo, Alberto Cazzaniga

Figure 1 for The Narrow Gate: Localized Image-Text Communication in Vision-Language Models

Figure 2 for The Narrow Gate: Localized Image-Text Communication in Vision-Language Models

Figure 3 for The Narrow Gate: Localized Image-Text Communication in Vision-Language Models

Figure 4 for The Narrow Gate: Localized Image-Text Communication in Vision-Language Models

Share this with someone who'll enjoy it:

Abstract:Recent advances in multimodal training have significantly improved the integration of image understanding and generation within a unified model. This study investigates how vision-language models (VLMs) handle image-understanding tasks, specifically focusing on how visual information is processed and transferred to the textual domain. We compare VLMs that generate both images and text with those that output only text, highlighting key differences in information flow. We find that in models with multimodal outputs, image and text embeddings are more separated within the residual stream. Additionally, models vary in how information is exchanged from visual to textual tokens. VLMs that only output text exhibit a distributed communication pattern, where information is exchanged through multiple image tokens. In contrast, models trained for image and text generation rely on a single token that acts as a narrow gate for the visual information. We demonstrate that ablating this single token significantly deteriorates performance on image understanding tasks. Furthermore, modifying this token enables effective steering of the image semantics, showing that targeted, local interventions can reliably control the model's global behavior.

View paper on

Share this with someone who'll enjoy it:

Title:The Narrow Gate: Localized Image-Text Communication in Vision-Language Models

Paper and Code