Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Title:VT-CLIP: Enhancing Vision-Language Models with Visual-guided Texts

Dec 04, 2021

Renrui Zhang, Longtian Qiu, Wei Zhang, Ziyao Zeng

Figure 1 for VT-CLIP: Enhancing Vision-Language Models with Visual-guided Texts

Figure 2 for VT-CLIP: Enhancing Vision-Language Models with Visual-guided Texts

Figure 3 for VT-CLIP: Enhancing Vision-Language Models with Visual-guided Texts

Figure 4 for VT-CLIP: Enhancing Vision-Language Models with Visual-guided Texts

Share this with someone who'll enjoy it:

Abstract:Contrastive Vision-Language Pre-training (CLIP) has drown increasing attention recently for its transferable visual representation learning. Supervised by large-scale image-text pairs, CLIP is able to align paired images and texts and thus conduct zero-shot recognition in open-vocabulary scenarios. However, there exists semantic gap between the specific application and generally pre-trained knowledge, which makes the matching sub-optimal on downstream tasks. In this paper, we propose VT-CLIP to enhance vision-language modeling via visual-guided texts. Specifically, we guide the text feature to adaptively explore informative regions on the image and aggregate the visual feature by cross-attention machanism. In this way, the visual-guided text become more semantically correlated with the image, which greatly benefits the matching process. In few-shot settings, we evaluate our VT-CLIP on 11 well-known classification datasets and experiment extensive ablation studies to demonstrate the effectiveness of VT-CLIP. The code will be released soon.

View paper on

Share this with someone who'll enjoy it:

Title:VT-CLIP: Enhancing Vision-Language Models with Visual-guided Texts

Paper and Code