Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Title:Generalizable Entity Grounding via Assistance of Large Language Model

Feb 04, 2024

Lu Qi, Yi-Wen Chen, Lehan Yang, Tiancheng Shen, Xiangtai Li, Weidong Guo, Yu Xu, Ming-Hsuan Yang

Figure 1 for Generalizable Entity Grounding via Assistance of Large Language Model

Figure 2 for Generalizable Entity Grounding via Assistance of Large Language Model

Figure 3 for Generalizable Entity Grounding via Assistance of Large Language Model

Figure 4 for Generalizable Entity Grounding via Assistance of Large Language Model

Share this with someone who'll enjoy it:

Abstract:In this work, we propose a novel approach to densely ground visual entities from a long caption. We leverage a large multimodal model (LMM) to extract semantic nouns, a class-agnostic segmentation model to generate entity-level segmentation, and the proposed multi-modal feature fusion module to associate each semantic noun with its corresponding segmentation mask. Additionally, we introduce a strategy of encoding entity segmentation masks into a colormap, enabling the preservation of fine-grained predictions from features of high-resolution masks. This approach allows us to extract visual features from low-resolution images using the CLIP vision encoder in the LMM, which is more computationally efficient than existing approaches that use an additional encoder for high-resolution images. Our comprehensive experiments demonstrate the superiority of our method, outperforming state-of-the-art techniques on three tasks, including panoptic narrative grounding, referring expression segmentation, and panoptic segmentation.

View paper on

Share this with someone who'll enjoy it:

Title:Generalizable Entity Grounding via Assistance of Large Language Model

Paper and Code