Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Title:SgVA-CLIP: Semantic-guided Visual Adapting of Vision-Language Models for Few-shot Image Classification

Nov 28, 2022

Fang Peng, Xiaoshan Yang, Changsheng Xu

Figure 1 for SgVA-CLIP: Semantic-guided Visual Adapting of Vision-Language Models for Few-shot Image Classification

Figure 2 for SgVA-CLIP: Semantic-guided Visual Adapting of Vision-Language Models for Few-shot Image Classification

Figure 3 for SgVA-CLIP: Semantic-guided Visual Adapting of Vision-Language Models for Few-shot Image Classification

Figure 4 for SgVA-CLIP: Semantic-guided Visual Adapting of Vision-Language Models for Few-shot Image Classification

Share this with someone who'll enjoy it:

Abstract:Although significant progress has been made in few-shot learning, most of existing few-shot learning methods require supervised pre-training on a large amount of samples of base classes, which limits their generalization ability in real world application. Recently, large-scale self-supervised vision-language models (e.g., CLIP) have provided a new paradigm for transferable visual representation learning. However, the pre-trained VLPs may neglect detailed visual information that is difficult to describe by language sentences, but important for learning an effective classifier in few-shot classification. To address the above problem, we propose a new framework, named Semantic-guided Visual Adapting (SgVA), which can effectively extend vision-language pre-trained models to produce discriminative task-specific visual features by comprehensively using a vision-specific contrastive loss, a cross-modal contrastive loss, and an implicit knowledge distillation. The implicit knowledge distillation is designed to transfer the fine-grained cross-modal knowledge to guide the updating of the vision adapter. State-of-the-art results on 13 datasets demonstrate that the adapted visual features can well complement the cross-modal features to improve few-shot image classification.

View paper on

Share this with someone who'll enjoy it:

Title:SgVA-CLIP: Semantic-guided Visual Adapting of Vision-Language Models for Few-shot Image Classification

Paper and Code