Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Title:Visual CoT: Unleashing Chain-of-Thought Reasoning in Multi-Modal Language Models

Mar 25, 2024

Hao Shao, Shengju Qian, Han Xiao, Guanglu Song, Zhuofan Zong, Letian Wang, Yu Liu, Hongsheng Li

Figure 1 for Visual CoT: Unleashing Chain-of-Thought Reasoning in Multi-Modal Language Models

Figure 2 for Visual CoT: Unleashing Chain-of-Thought Reasoning in Multi-Modal Language Models

Figure 3 for Visual CoT: Unleashing Chain-of-Thought Reasoning in Multi-Modal Language Models

Figure 4 for Visual CoT: Unleashing Chain-of-Thought Reasoning in Multi-Modal Language Models

Share this with someone who'll enjoy it:

Abstract:This paper presents Visual CoT, a novel pipeline that leverages the reasoning capabilities of multi-modal large language models (MLLMs) by incorporating visual Chain-of-Thought (CoT) reasoning. While MLLMs have shown promise in various visual tasks, they often lack interpretability and struggle with complex visual inputs. To address these challenges, we propose a multi-turn processing pipeline that dynamically focuses on visual inputs and provides interpretable thoughts. We collect and introduce the Visual CoT dataset comprising 373k question-answer pairs, annotated with intermediate bounding boxes highlighting key regions essential for answering the questions. Importantly, the introduced benchmark is capable of evaluating MLLMs in scenarios requiring specific local region identification. Extensive experiments demonstrate the effectiveness of our framework and shed light on better inference strategies. The Visual CoT dataset, benchmark, and pre-trained models are available to foster further research in this direction.

* Code: https://github.com/deepcs233/Visual-CoT

View paper on

Share this with someone who'll enjoy it:

Title:Visual CoT: Unleashing Chain-of-Thought Reasoning in Multi-Modal Language Models

Paper and Code