Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Title:Pink: Unveiling the Power of Referential Comprehension for Multi-modal LLMs

Oct 01, 2023

Shiyu Xuan, Qingpei Guo, Ming Yang, Shiliang Zhang

Figure 1 for Pink: Unveiling the Power of Referential Comprehension for Multi-modal LLMs

Figure 2 for Pink: Unveiling the Power of Referential Comprehension for Multi-modal LLMs

Figure 3 for Pink: Unveiling the Power of Referential Comprehension for Multi-modal LLMs

Figure 4 for Pink: Unveiling the Power of Referential Comprehension for Multi-modal LLMs

Share this with someone who'll enjoy it:

Abstract:Multi-modal Large Language Models (MLLMs) have shown remarkable capabilities in many vision-language tasks. Nevertheless, most MLLMs still lack the Referential Comprehension (RC) ability to identify a specific object or area in images, limiting their application in fine-grained perception tasks. This paper proposes a novel method to enhance the RC capability for MLLMs. Our model represents the referring object in the image using the coordinates of its bounding box and converts the coordinates into texts in a specific format. This allows the model to treat the coordinates as natural language. Moreover, we construct the instruction tuning dataset with various designed RC tasks at a low cost by unleashing the potential of annotations in existing datasets. To further boost the RC ability of the model, we propose a self-consistent bootstrapping method that extends dense object annotations of a dataset into high-quality referring-expression-bounding-box pairs. The model is trained end-to-end with a parameter-efficient tuning framework that allows both modalities to benefit from multi-modal instruction tuning. This framework requires fewer trainable parameters and less training data. Experimental results on conventional vision-language and RC tasks demonstrate the superior performance of our method. For instance, our model exhibits a 12.0% absolute accuracy improvement over Instruct-BLIP on VSR and surpasses Kosmos-2 by 24.7% on RefCOCO_val under zero-shot settings. We also attain the top position on the leaderboard of MMBench. The models, datasets, and codes are publicly available at https://github.com/SY-Xuan/Pink

View paper on

Share this with someone who'll enjoy it:

Title:Pink: Unveiling the Power of Referential Comprehension for Multi-modal LLMs

Paper and Code