Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Title:MUSE: Mamba is Efficient Multi-scale Learner for Text-video Retrieval

Aug 20, 2024

Haoran Tang, Meng Cao, Jinfa Huang, Ruyang Liu, Peng Jin, Ge Li, Xiaodan Liang

Figure 1 for MUSE: Mamba is Efficient Multi-scale Learner for Text-video Retrieval

Figure 2 for MUSE: Mamba is Efficient Multi-scale Learner for Text-video Retrieval

Figure 3 for MUSE: Mamba is Efficient Multi-scale Learner for Text-video Retrieval

Figure 4 for MUSE: Mamba is Efficient Multi-scale Learner for Text-video Retrieval

Share this with someone who'll enjoy it:

Abstract:Text-Video Retrieval (TVR) aims to align and associate relevant video content with corresponding natural language queries. Most existing TVR methods are based on large-scale pre-trained vision-language models (e.g., CLIP). However, due to the inherent plain structure of CLIP, few TVR methods explore the multi-scale representations which offer richer contextual information for a more thorough understanding. To this end, we propose MUSE, a multi-scale mamba with linear computational complexity for efficient cross-resolution modeling. Specifically, the multi-scale representations are generated by applying a feature pyramid on the last single-scale feature map. Then, we employ the Mamba structure as an efficient multi-scale learner to jointly learn scale-wise representations. Furthermore, we conduct comprehensive studies to investigate different model structures and designs. Extensive results on three popular benchmarks have validated the superiority of MUSE.

* 8 pages

View paper on

Share this with someone who'll enjoy it:

Title:MUSE: Mamba is Efficient Multi-scale Learner for Text-video Retrieval

Paper and Code