Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Title:Zero-shot Translation of Attention Patterns in VQA Models to Natural Language

Nov 08, 2023

Leonard Salewski, A. Sophia Koepke, Hendrik P. A. Lensch, Zeynep Akata

Figure 1 for Zero-shot Translation of Attention Patterns in VQA Models to Natural Language

Figure 2 for Zero-shot Translation of Attention Patterns in VQA Models to Natural Language

Figure 3 for Zero-shot Translation of Attention Patterns in VQA Models to Natural Language

Figure 4 for Zero-shot Translation of Attention Patterns in VQA Models to Natural Language

Share this with someone who'll enjoy it:

Abstract:Converting a model's internals to text can yield human-understandable insights about the model. Inspired by the recent success of training-free approaches for image captioning, we propose ZS-A2T, a zero-shot framework that translates the transformer attention of a given model into natural language without requiring any training. We consider this in the context of Visual Question Answering (VQA). ZS-A2T builds on a pre-trained large language model (LLM), which receives a task prompt, question, and predicted answer, as inputs. The LLM is guided to select tokens which describe the regions in the input image that the VQA model attended to. Crucially, we determine this similarity by exploiting the text-image matching capabilities of the underlying VQA model. Our framework does not require any training and allows the drop-in replacement of different guiding sources (e.g. attribution instead of attention maps), or language models. We evaluate this novel task on textual explanation datasets for VQA, giving state-of-the-art performances for the zero-shot setting on GQA-REX and VQA-X. Our code is available at: https://github.com/ExplainableML/ZS-A2T.

* Published in GCPR 2023

View paper on

Share this with someone who'll enjoy it:

Title:Zero-shot Translation of Attention Patterns in VQA Models to Natural Language

Paper and Code