Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Title:Web-Scale Visual Entity Recognition: An LLM-Driven Data Approach

Oct 31, 2024

Mathilde Caron, Alireza Fathi, Cordelia Schmid, Ahmet Iscen

Figure 1 for Web-Scale Visual Entity Recognition: An LLM-Driven Data Approach

Figure 2 for Web-Scale Visual Entity Recognition: An LLM-Driven Data Approach

Figure 3 for Web-Scale Visual Entity Recognition: An LLM-Driven Data Approach

Figure 4 for Web-Scale Visual Entity Recognition: An LLM-Driven Data Approach

Share this with someone who'll enjoy it:

Abstract:Web-scale visual entity recognition, the task of associating images with their corresponding entities within vast knowledge bases like Wikipedia, presents significant challenges due to the lack of clean, large-scale training data. In this paper, we propose a novel methodology to curate such a dataset, leveraging a multimodal large language model (LLM) for label verification, metadata generation, and rationale explanation. Instead of relying on the multimodal LLM to directly annotate data, which we found to be suboptimal, we prompt it to reason about potential candidate entity labels by accessing additional contextually relevant information (such as Wikipedia), resulting in more accurate annotations. We further use the multimodal LLM to enrich the dataset by generating question-answer pairs and a grounded finegrained textual description (referred to as "rationale") that explains the connection between images and their assigned entities. Experiments demonstrate that models trained on this automatically curated data achieve state-of-the-art performance on web-scale visual entity recognition tasks (e.g. +6.9% improvement in OVEN entity task), underscoring the importance of high-quality training data in this domain.

* NeurIPS 2024

View paper on

Share this with someone who'll enjoy it:

Title:Web-Scale Visual Entity Recognition: An LLM-Driven Data Approach

Paper and Code