Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Title:Multimodal Few-Shot Object Detection with Meta-Learning Based Cross-Modal Prompting

Apr 16, 2022

Guangxing Han, Jiawei Ma, Shiyuan Huang, Long Chen, Rama Chellappa, Shih-Fu Chang

Figure 1 for Multimodal Few-Shot Object Detection with Meta-Learning Based Cross-Modal Prompting

Figure 2 for Multimodal Few-Shot Object Detection with Meta-Learning Based Cross-Modal Prompting

Figure 3 for Multimodal Few-Shot Object Detection with Meta-Learning Based Cross-Modal Prompting

Figure 4 for Multimodal Few-Shot Object Detection with Meta-Learning Based Cross-Modal Prompting

Share this with someone who'll enjoy it:

Abstract:We study multimodal few-shot object detection (FSOD) in this paper, using both few-shot visual examples and class semantic information for detection. Most of previous works focus on either few-shot or zero-shot object detection, ignoring the complementarity of visual and semantic information. We first show that meta-learning and prompt-based learning, the most commonly-used methods for few-shot learning and zero-shot transferring from pre-trained vision-language models to downstream tasks, are conceptually similar. They both reformulate the objective of downstream tasks the same as the pre-training tasks, and mostly without tuning the parameters of pre-trained models. Based on this observation, we propose to combine meta-learning with prompt-based learning for multimodal FSOD without fine-tuning, by learning transferable class-agnostic multimodal FSOD models over many-shot base classes. Specifically, to better exploit the pre-trained vision-language models, the meta-learning based cross-modal prompting is proposed to generate soft prompts and further used to extract the semantic prototype, conditioned on the few-shot visual examples. Then, the extracted semantic prototype and few-shot visual prototype are fused to generate the multimodal prototype for detection. Our models can efficiently fuse the visual and semantic information at both token-level and feature-level. We comprehensively evaluate the proposed multimodal FSOD models on multiple few-shot object detection benchmarks, achieving promising results.

* 22 pages

View paper on

Share this with someone who'll enjoy it:

Title:Multimodal Few-Shot Object Detection with Meta-Learning Based Cross-Modal Prompting

Paper and Code