Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Title:Grounding Language Models for Visual Entity Recognition

Feb 28, 2024

Zilin Xiao, Ming Gong, Paola Cascante-Bonilla, Xingyao Zhang, Jie Wu, Vicente Ordonez

Figure 1 for Grounding Language Models for Visual Entity Recognition

Figure 2 for Grounding Language Models for Visual Entity Recognition

Figure 3 for Grounding Language Models for Visual Entity Recognition

Figure 4 for Grounding Language Models for Visual Entity Recognition

Share this with someone who'll enjoy it:

Abstract:We introduce AutoVER, an Autoregressive model for Visual Entity Recognition. Our model extends an autoregressive Multi-modal Large Language Model by employing retrieval augmented constrained generation. It mitigates low performance on out-of-domain entities while excelling in queries that require visually-situated reasoning. Our method learns to distinguish similar entities within a vast label space by contrastively training on hard negative pairs in parallel with a sequence-to-sequence objective without an external retriever. During inference, a list of retrieved candidate answers explicitly guides language generation by removing invalid decoding paths. The proposed method achieves significant improvements across different dataset splits in the recently proposed Oven-Wiki benchmark. Accuracy on the Entity seen split rises from 32.7% to 61.5%. It also demonstrates superior performance on the unseen and query splits by a substantial double-digit margin.

View paper on

Share this with someone who'll enjoy it:

Title:Grounding Language Models for Visual Entity Recognition

Paper and Code