Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Title:Response Wide Shut: Surprising Observations in Basic Vision Language Model Capabilities

Aug 13, 2024

Shivam Chandhok, Wan-Cyuan Fan, Leonid Sigal

Figure 1 for Response Wide Shut: Surprising Observations in Basic Vision Language Model Capabilities

Figure 2 for Response Wide Shut: Surprising Observations in Basic Vision Language Model Capabilities

Figure 3 for Response Wide Shut: Surprising Observations in Basic Vision Language Model Capabilities

Figure 4 for Response Wide Shut: Surprising Observations in Basic Vision Language Model Capabilities

Share this with someone who'll enjoy it:

Abstract:Vision-Language Models (VLMs) have emerged as general purpose tools for addressing a variety of complex computer vision problems. Such models have been shown to be highly capable, but, at the same time, also lacking some basic visual understanding skills. In this paper, we set out to understand the limitations of SoTA VLMs on fundamental visual tasks: object classification, understanding spatial arrangement, and ability to delineate individual object instances (through counting), by constructing a series of tests that probe which components of design, specifically, maybe lacking. Importantly, we go significantly beyond the current benchmarks, that simply measure final performance of VLM, by also comparing and contrasting it to performance of probes trained directly on features obtained from visual encoder (image embeddings), as well as intermediate vision-language projection used to bridge image-encoder and LLM-decoder ouput in many SoTA models (e.g., LLaVA, BLIP, InstructBLIP). In doing so, we uncover nascent shortcomings in VLMs response and make a number of important observations which could help train and develop more effective VLM models in future.

* Under Submission

View paper on

Share this with someone who'll enjoy it:

Title:Response Wide Shut: Surprising Observations in Basic Vision Language Model Capabilities

Paper and Code