Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Title:ProReason: Multi-Modal Proactive Reasoning with Decoupled Eyesight and Wisdom

Oct 18, 2024

Jingqi Zhou, Sheng Wang, Jingwei Dong, Lei Li, Jiahui Gao, Lingpeng Kong, Chuan Wu

Figure 1 for ProReason: Multi-Modal Proactive Reasoning with Decoupled Eyesight and Wisdom

Figure 2 for ProReason: Multi-Modal Proactive Reasoning with Decoupled Eyesight and Wisdom

Figure 3 for ProReason: Multi-Modal Proactive Reasoning with Decoupled Eyesight and Wisdom

Figure 4 for ProReason: Multi-Modal Proactive Reasoning with Decoupled Eyesight and Wisdom

Share this with someone who'll enjoy it:

Abstract:Large vision-language models (LVLMs) have witnessed significant progress on visual understanding tasks. However, they often prioritize language knowledge over image information on visual reasoning tasks, incurring performance degradation. To tackle this issue, we first identify the drawbacks of existing solutions (i.e., insufficient and irrelevant visual descriptions, and limited multi-modal capacities). We then decompose visual reasoning process into two stages: visual perception (i.e., eyesight) and textual reasoning (i.e., wisdom), and introduce a novel visual reasoning framework named ProReason. This framework features multi-run proactive perception and decoupled vision-reasoning capabilities. Briefly, given a multi-modal question, ProReason iterates proactive information collection and reasoning until the answer can be concluded with necessary and sufficient visual descriptions. Notably, the disassociation of capabilities allows seamless integration of existing large language models (LLMs) to compensate for the reasoning deficits of LVLMs. Our extensive experiments demonstrate that ProReason outperforms both existing multi-step reasoning frameworks and passive peer methods on a wide range of benchmarks for both open-source and closed-source models. In addition, with the assistance of LLMs, ProReason achieves a performance improvement of up to 15% on MMMU benchmark. Our insights into existing solutions and the decoupled perspective for feasible integration of LLMs illuminate future research on visual reasoning techniques, especially LLM-assisted ones.

View paper on

Share this with someone who'll enjoy it:

Title:ProReason: Multi-Modal Proactive Reasoning with Decoupled Eyesight and Wisdom

Paper and Code