Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Title:Vision Search Assistant: Empower Vision-Language Models as Multimodal Search Engines

Oct 28, 2024

Zhixin Zhang, Yiyuan Zhang, Xiaohan Ding, Xiangyu Yue

Figure 1 for Vision Search Assistant: Empower Vision-Language Models as Multimodal Search Engines

Figure 2 for Vision Search Assistant: Empower Vision-Language Models as Multimodal Search Engines

Figure 3 for Vision Search Assistant: Empower Vision-Language Models as Multimodal Search Engines

Figure 4 for Vision Search Assistant: Empower Vision-Language Models as Multimodal Search Engines

Share this with someone who'll enjoy it:

Abstract:Search engines enable the retrieval of unknown information with texts. However, traditional methods fall short when it comes to understanding unfamiliar visual content, such as identifying an object that the model has never seen before. This challenge is particularly pronounced for large vision-language models (VLMs): if the model has not been exposed to the object depicted in an image, it struggles to generate reliable answers to the user's question regarding that image. Moreover, as new objects and events continuously emerge, frequently updating VLMs is impractical due to heavy computational burdens. To address this limitation, we propose Vision Search Assistant, a novel framework that facilitates collaboration between VLMs and web agents. This approach leverages VLMs' visual understanding capabilities and web agents' real-time information access to perform open-world Retrieval-Augmented Generation via the web. By integrating visual and textual representations through this collaboration, the model can provide informed responses even when the image is novel to the system. Extensive experiments conducted on both open-set and closed-set QA benchmarks demonstrate that the Vision Search Assistant significantly outperforms the other models and can be widely applied to existing VLMs.

* Code is available at https://github.com/cnzzx/VSA

View paper on

Share this with someone who'll enjoy it:

Title:Vision Search Assistant: Empower Vision-Language Models as Multimodal Search Engines

Paper and Code