Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Title:HaVTR: Improving Video-Text Retrieval Through Augmentation Using Large Foundation Models

Apr 07, 2024

Yimu Wang, Shuai Yuan, Xiangru Jian, Wei Pang, Mushi Wang, Ning Yu

Figure 1 for HaVTR: Improving Video-Text Retrieval Through Augmentation Using Large Foundation Models

Figure 2 for HaVTR: Improving Video-Text Retrieval Through Augmentation Using Large Foundation Models

Figure 3 for HaVTR: Improving Video-Text Retrieval Through Augmentation Using Large Foundation Models

Figure 4 for HaVTR: Improving Video-Text Retrieval Through Augmentation Using Large Foundation Models

Share this with someone who'll enjoy it:

Abstract:While recent progress in video-text retrieval has been driven by the exploration of powerful model architectures and training strategies, the representation learning ability of video-text retrieval models is still limited due to low-quality and scarce training data annotations. To address this issue, we present a novel video-text learning paradigm, HaVTR, which augments video and text data to learn more generalized features. Specifically, we first adopt a simple augmentation method, which generates self-similar data by randomly duplicating or dropping subwords and frames. In addition, inspired by the recent advancement in visual and language generative models, we propose a more powerful augmentation method through textual paraphrasing and video stylization using large language models (LLMs) and visual generative models (VGMs). Further, to bring richer information into video and text, we propose a hallucination-based augmentation method, where we use LLMs and VGMs to generate and add new relevant information to the original data. Benefiting from the enriched data, extensive experiments on several video-text retrieval benchmarks demonstrate the superiority of HaVTR over existing methods.

View paper on

Share this with someone who'll enjoy it:

Title:HaVTR: Improving Video-Text Retrieval Through Augmentation Using Large Foundation Models

Paper and Code