Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Title:Synthesize, Diagnose, and Optimize: Towards Fine-Grained Vision-Language Understanding

Nov 30, 2023

Wujian Peng, Sicheng Xie, Zuyao You, Shiyi Lan, Zuxuan Wu

Figure 1 for Synthesize, Diagnose, and Optimize: Towards Fine-Grained Vision-Language Understanding

Figure 2 for Synthesize, Diagnose, and Optimize: Towards Fine-Grained Vision-Language Understanding

Figure 3 for Synthesize, Diagnose, and Optimize: Towards Fine-Grained Vision-Language Understanding

Figure 4 for Synthesize, Diagnose, and Optimize: Towards Fine-Grained Vision-Language Understanding

Share this with someone who'll enjoy it:

Abstract:Vision language models (VLM) have demonstrated remarkable performance across various downstream tasks. However, understanding fine-grained visual-linguistic concepts, such as attributes and inter-object relationships, remains a significant challenge. While several benchmarks aim to evaluate VLMs in finer granularity, their primary focus remains on the linguistic aspect, neglecting the visual dimension. Here, we highlight the importance of evaluating VLMs from both a textual and visual perspective. We introduce a progressive pipeline to synthesize images that vary in a specific attribute while ensuring consistency in all other aspects. Utilizing this data engine, we carefully design a benchmark, SPEC, to diagnose the comprehension of object size, position, existence, and count. Subsequently, we conduct a thorough evaluation of four leading VLMs on SPEC. Surprisingly, their performance is close to random guess, revealing significant limitations. With this in mind, we propose a simply yet effective approach to optimize VLMs in fine-grained understanding, achieving significant improvements on SPEC without compromising the zero-shot performance. Results on two additional fine-grained benchmarks also show consistent improvements, further validating the transferability of our approach.

* 10 pages, 5 figures

View paper on

Share this with someone who'll enjoy it:

Title:Synthesize, Diagnose, and Optimize: Towards Fine-Grained Vision-Language Understanding

Paper and Code