Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Title:CogCoM: Train Large Vision-Language Models Diving into Details through Chain of Manipulations

Feb 06, 2024

Ji Qi, Ming Ding, Weihan Wang, Yushi Bai, Qingsong Lv, Wenyi Hong, Bin Xu, Lei Hou, Juanzi Li, Yuxiao Dong(+1 more)

Figure 1 for CogCoM: Train Large Vision-Language Models Diving into Details through Chain of Manipulations

Figure 2 for CogCoM: Train Large Vision-Language Models Diving into Details through Chain of Manipulations

Figure 3 for CogCoM: Train Large Vision-Language Models Diving into Details through Chain of Manipulations

Figure 4 for CogCoM: Train Large Vision-Language Models Diving into Details through Chain of Manipulations

Share this with someone who'll enjoy it:

Abstract:Vision-Language Models (VLMs) have demonstrated their widespread viability thanks to extensive training in aligning visual instructions to answers. However, this conclusive alignment leads models to ignore critical visual reasoning, and further result in failures on meticulous visual problems and unfaithful responses. In this paper, we propose Chain of Manipulations, a mechanism that enables VLMs to solve problems with a series of manipulations, where each manipulation refers to an operation on the visual input, either from intrinsic abilities (e.g., grounding) acquired through prior training or from imitating human-like behaviors (e.g., zoom in). This mechanism encourages VLMs to generate faithful responses with evidential visual reasoning, and permits users to trace error causes in the interpretable paths. We thus train CogCoM, a general 17B VLM with a memory-based compatible architecture endowed this reasoning mechanism. Experiments show that our model achieves the state-of-the-art performance across 8 benchmarks from 3 categories, and a limited number of training steps with the data swiftly gains a competitive performance. The code and data are publicly available at https://github.com/THUDM/CogCoM.

* 17 pages, 7 figures

View paper on

Share this with someone who'll enjoy it:

Title:CogCoM: Train Large Vision-Language Models Diving into Details through Chain of Manipulations

Paper and Code