Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Title:Multimodality Helps Unimodality: Cross-Modal Few-Shot Learning with Multimodal Models

Jan 18, 2023

Zhiqiu Lin, Samuel Yu, Zhiyi Kuang, Deepak Pathak, Deva Ramanan

Figure 1 for Multimodality Helps Unimodality: Cross-Modal Few-Shot Learning with Multimodal Models

Figure 2 for Multimodality Helps Unimodality: Cross-Modal Few-Shot Learning with Multimodal Models

Figure 3 for Multimodality Helps Unimodality: Cross-Modal Few-Shot Learning with Multimodal Models

Figure 4 for Multimodality Helps Unimodality: Cross-Modal Few-Shot Learning with Multimodal Models

Share this with someone who'll enjoy it:

Abstract:The ability to quickly learn a new task with minimal instruction - known as few-shot learning - is a central aspect of intelligent agents. Classical few-shot benchmarks make use of few-shot samples from a single modality, but such samples may not be sufficient to characterize an entire concept class. In contrast, humans use cross-modal information to learn new concepts efficiently. In this work, we demonstrate that one can indeed build a better ${\bf visual}$ dog classifier by ${\bf read}$ing about dogs and ${\bf listen}$ing to them bark. To do so, we exploit the fact that recent multimodal foundation models such as CLIP are inherently cross-modal, mapping different modalities to the same representation space. Specifically, we propose a simple cross-modal adaptation approach that learns from few-shot examples spanning different modalities. By repurposing class names as additional one-shot training samples, we achieve SOTA results with an embarrassingly simple linear classifier for vision-language adaptation. Furthermore, we show that our approach can benefit existing methods such as prefix tuning, adapters, and classifier ensembling. Finally, to explore other modalities beyond vision and language, we construct the first (to our knowledge) audiovisual few-shot benchmark and use cross-modal training to improve the performance of both image and audio classification.

* Project website: https://linzhiqiu.github.io/papers/cross_modal/

View paper on

Share this with someone who'll enjoy it:

Title:Multimodality Helps Unimodality: Cross-Modal Few-Shot Learning with Multimodal Models

Paper and Code