Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Title:VoCoT: Unleashing Visually Grounded Multi-Step Reasoning in Large Multi-Modal Models

May 28, 2024

Zejun Li, Ruipu Luo, Jiwen Zhang, Minghui Qiu, Zhongyu Wei

Figure 1 for VoCoT: Unleashing Visually Grounded Multi-Step Reasoning in Large Multi-Modal Models

Figure 2 for VoCoT: Unleashing Visually Grounded Multi-Step Reasoning in Large Multi-Modal Models

Figure 3 for VoCoT: Unleashing Visually Grounded Multi-Step Reasoning in Large Multi-Modal Models

Figure 4 for VoCoT: Unleashing Visually Grounded Multi-Step Reasoning in Large Multi-Modal Models

Share this with someone who'll enjoy it:

Abstract:While large multi-modal models (LMMs) have exhibited impressive capabilities across diverse tasks, their effectiveness in handling complex tasks has been limited by the prevailing single-step reasoning paradigm. To this end, this paper proposes VoCoT, a multi-step Visually grounded object-centric Chain-of-Thought reasoning framework tailored for inference with LMMs. VoCoT is characterized by two key features: (1) object-centric reasoning paths that revolve around cross-modal shared object-level information, and (2) visually grounded representation of object concepts in a multi-modal interleaved and aligned manner, which effectively bridges the modality gap within LMMs during long-term generation. Additionally, we construct an instruction dataset to facilitate LMMs in adapting to reasoning with VoCoT. By introducing VoCoT into the prevalent open-source LMM architecture, we introduce VolCano. With only 7B parameters and limited input resolution, VolCano demonstrates excellent performance across various scenarios, surpassing SOTA models, including GPT-4V, in tasks requiring complex reasoning. Our code, data and model will be available at https://github.com/RupertLuo/VoCoT.

View paper on

Share this with someone who'll enjoy it:

Title:VoCoT: Unleashing Visually Grounded Multi-Step Reasoning in Large Multi-Modal Models

Paper and Code