Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Title:ViCor: Bridging Visual Understanding and Commonsense Reasoning with Large Language Models

Oct 09, 2023

Kaiwen Zhou, Kwonjoon Lee, Teruhisa Misu, Xin Eric Wang

Figure 1 for ViCor: Bridging Visual Understanding and Commonsense Reasoning with Large Language Models

Figure 2 for ViCor: Bridging Visual Understanding and Commonsense Reasoning with Large Language Models

Figure 3 for ViCor: Bridging Visual Understanding and Commonsense Reasoning with Large Language Models

Figure 4 for ViCor: Bridging Visual Understanding and Commonsense Reasoning with Large Language Models

Share this with someone who'll enjoy it:

Abstract:In our work, we explore the synergistic capabilities of pre-trained vision-and-language models (VLMs) and large language models (LLMs) for visual commonsense reasoning (VCR). We categorize the problem of VCR into visual commonsense understanding (VCU) and visual commonsense inference (VCI). For VCU, which involves perceiving the literal visual content, pre-trained VLMs exhibit strong cross-dataset generalization. On the other hand, in VCI, where the goal is to infer conclusions beyond image content, VLMs face difficulties. We find that a baseline where VLMs provide perception results (image captions) to LLMs leads to improved performance on VCI. However, we identify a challenge with VLMs' passive perception, which often misses crucial context information, leading to incorrect or uncertain reasoning by LLMs. To mitigate this issue, we suggest a collaborative approach where LLMs, when uncertain about their reasoning, actively direct VLMs to concentrate on and gather relevant visual elements to support potential commonsense inferences. In our method, named ViCor, pre-trained LLMs serve as problem classifiers to analyze the problem category, VLM commanders to leverage VLMs differently based on the problem classification, and visual commonsense reasoners to answer the question. VLMs will perform visual recognition and understanding. We evaluate our framework on two VCR benchmark datasets and outperform all other methods that do not require in-domain supervised fine-tuning.

View paper on

Share this with someone who'll enjoy it:

Title:ViCor: Bridging Visual Understanding and Commonsense Reasoning with Large Language Models

Paper and Code