Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Title:Prismatic VLMs: Investigating the Design Space of Visually-Conditioned Language Models

Feb 12, 2024

Siddharth Karamcheti, Suraj Nair, Ashwin Balakrishna, Percy Liang, Thomas Kollar, Dorsa Sadigh

Figure 1 for Prismatic VLMs: Investigating the Design Space of Visually-Conditioned Language Models

Figure 2 for Prismatic VLMs: Investigating the Design Space of Visually-Conditioned Language Models

Figure 3 for Prismatic VLMs: Investigating the Design Space of Visually-Conditioned Language Models

Figure 4 for Prismatic VLMs: Investigating the Design Space of Visually-Conditioned Language Models

Share this with someone who'll enjoy it:

Abstract:Visually-conditioned language models (VLMs) have seen growing adoption in applications such as visual dialogue, scene understanding, and robotic task planning; adoption that has fueled a wealth of new models such as LLaVa, InstructBLIP, and PaLI-3. Despite the volume of new releases, key design decisions around image preprocessing, architecture, and optimization are under-explored, making it challenging to understand what factors account for model performance $-$ a challenge further complicated by the lack of objective, consistent evaluations. To address these gaps, we first compile a suite of standardized evaluations spanning visual question answering, object localization from language, and targeted challenge sets that probe properties such as hallucination; evaluations that provide calibrated, fine-grained insight into a VLM's capabilities. Second, we rigorously investigate VLMs along key design axes, including pretrained visual representations and quantifying the tradeoffs of using base vs. instruct-tuned language models, amongst others. We couple our analysis with three resource contributions: (1) a unified framework for evaluating VLMs, (2) optimized, flexible code for VLM training, and (3) checkpoints for all models, including a family of VLMs at the 7-13B scale that strictly outperform InstructBLIP and LLaVa v1.5, the state-of-the-art in open-source VLMs.

* 22 pages, 11 figures. Training code and models: https://github.com/TRI-ML/prismatic-vlms. Evaluation code: https://github.com/TRI-ML/vlm-evaluation

View paper on

Share this with someone who'll enjoy it:

Title:Prismatic VLMs: Investigating the Design Space of Visually-Conditioned Language Models

Paper and Code