Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Title:Help Me Identify: Is an LLM+VQA System All We Need to Identify Visual Concepts?

Oct 17, 2024

Shailaja Keyur Sampat, Maitreya Patel, Yezhou Yang, Chitta Baral

Figure 1 for Help Me Identify: Is an LLM+VQA System All We Need to Identify Visual Concepts?

Figure 2 for Help Me Identify: Is an LLM+VQA System All We Need to Identify Visual Concepts?

Figure 3 for Help Me Identify: Is an LLM+VQA System All We Need to Identify Visual Concepts?

Figure 4 for Help Me Identify: Is an LLM+VQA System All We Need to Identify Visual Concepts?

Share this with someone who'll enjoy it:

Abstract:An ability to learn about new objects from a small amount of visual data and produce convincing linguistic justification about the presence/absence of certain concepts (that collectively compose the object) in novel scenarios is an important characteristic of human cognition. This is possible due to abstraction of attributes/properties that an object is composed of e.g. an object `bird' can be identified by the presence of a beak, feathers, legs, wings, etc. Inspired by this aspect of human reasoning, in this work, we present a zero-shot framework for fine-grained visual concept learning by leveraging large language model and Visual Question Answering (VQA) system. Specifically, we prompt GPT-3 to obtain a rich linguistic description of visual objects in the dataset. We convert the obtained concept descriptions into a set of binary questions. We pose these questions along with the query image to a VQA system and aggregate the answers to determine the presence or absence of an object in the test images. Our experiments demonstrate comparable performance with existing zero-shot visual classification methods and few-shot concept learning approaches, without substantial computational overhead, yet being fully explainable from the reasoning perspective.

* 14 pages, 7 figures

View paper on

Share this with someone who'll enjoy it:

Title:Help Me Identify: Is an LLM+VQA System All We Need to Identify Visual Concepts?

Paper and Code