Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Title:Imperfect Vision Encoders: Efficient and Robust Tuning for Vision-Language Models

Jul 23, 2024

Aristeidis Panos, Rahaf Aljundi, Daniel Olmeda Reino, Richard E Turner

Share this with someone who'll enjoy it:

Abstract:Vision language models (VLMs) demonstrate impressive capabilities in visual question answering and image captioning, acting as a crucial link between visual and language models. However, existing open-source VLMs heavily rely on pretrained and frozen vision encoders (such as CLIP). Despite CLIP's robustness across diverse domains, it still exhibits non-negligible image understanding errors. These errors propagate to the VLM responses, resulting in sub-optimal performance. In our work, we propose an efficient and robust method for updating vision encoders within VLMs. Our approach selectively and locally updates encoders, leading to substantial performance improvements on data where previous mistakes occurred, while maintaining overall robustness. Furthermore, we demonstrate the effectiveness of our method during continual few-shot updates. Theoretical grounding, generality, and computational efficiency characterize our approach.

View paper on

Share this with someone who'll enjoy it:

Title:Imperfect Vision Encoders: Efficient and Robust Tuning for Vision-Language Models

Paper and Code