Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Title:UDKAG: Augmenting Large Vision-Language Models with Up-to-Date Knowledge

May 23, 2024

Chuanhao Li, Zhen Li, Chenchen Jing, Shuo Liu, Wenqi Shao, Yuwei Wu, Ping Luo, Yu Qiao, Kaipeng Zhang

Figure 1 for UDKAG: Augmenting Large Vision-Language Models with Up-to-Date Knowledge

Figure 2 for UDKAG: Augmenting Large Vision-Language Models with Up-to-Date Knowledge

Figure 3 for UDKAG: Augmenting Large Vision-Language Models with Up-to-Date Knowledge

Figure 4 for UDKAG: Augmenting Large Vision-Language Models with Up-to-Date Knowledge

Share this with someone who'll enjoy it:

Abstract:Large vision-language models (LVLMs) are ignorant of the up-to-date knowledge, such as LLaVA series, because they cannot be updated frequently due to the large amount of resources required, and therefore fail in many cases. For example, if a LVLM was released on January 2024, and it wouldn't know the detailed plot of the new movie Dune 2, which wasn't released until February 2024. To solve the problem, a promising solution is to provide LVLMs with up-to-date knowledge via internet search during inference, i.e., internet-augmented generation (IAG), which is already integrated in some closed-source commercial LVLMs such as GPT-4V. However, the specific mechanics underpinning them remain a mystery. In this paper, we propose a plug-and-play framework, for augmenting existing LVLMs in handling visual question answering (VQA) about up-to-date knowledge, dubbed UDKAG. A hierarchical filtering model is trained to effectively and efficiently find the most helpful content from the websites returned by a search engine to prompt LVLMs with up-to-date knowledge. To train the model and evaluate our framework's performance, we propose a pipeline to automatically generate news-related VQA samples to construct a dataset, dubbed UDK-VQA. A multi-model voting mechanism is introduced to label the usefulness of website/content for VQA samples to construct the training set. Experimental results demonstrate the effectiveness of our framework, outperforming GPT-4V by about 25% in accuracy.

* 12 pages, 6 figures, a framework to augment large vision-language models with up-to-date knowledge

View paper on

Share this with someone who'll enjoy it:

Title:UDKAG: Augmenting Large Vision-Language Models with Up-to-Date Knowledge

Paper and Code