Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Title:Vid-Morp: Video Moment Retrieval Pretraining from Unlabeled Videos in the Wild

Dec 01, 2024

Peijun Bao, Chenqi Kong, Zihao Shao, Boon Poh Ng, Meng Hwa Er, Alex C. Kot

Figure 1 for Vid-Morp: Video Moment Retrieval Pretraining from Unlabeled Videos in the Wild

Figure 2 for Vid-Morp: Video Moment Retrieval Pretraining from Unlabeled Videos in the Wild

Figure 3 for Vid-Morp: Video Moment Retrieval Pretraining from Unlabeled Videos in the Wild

Figure 4 for Vid-Morp: Video Moment Retrieval Pretraining from Unlabeled Videos in the Wild

Share this with someone who'll enjoy it:

Abstract:Given a natural language query, video moment retrieval aims to localize the described temporal moment in an untrimmed video. A major challenge of this task is its heavy dependence on labor-intensive annotations for training. Unlike existing works that directly train models on manually curated data, we propose a novel paradigm to reduce annotation costs: pretraining the model on unlabeled, real-world videos. To support this, we introduce Video Moment Retrieval Pretraining (Vid-Morp), a large-scale dataset collected with minimal human intervention, consisting of over 50K videos captured in the wild and 200K pseudo annotations. Direct pretraining on these imperfect pseudo annotations, however, presents significant challenges, including mismatched sentence-video pairs and imprecise temporal boundaries. To address these issues, we propose the ReCorrect algorithm, which comprises two main phases: semantics-guided refinement and memory-consensus correction. The semantics-guided refinement enhances the pseudo labels by leveraging semantic similarity with video frames to clean out unpaired data and make initial adjustments to temporal boundaries. In the following memory-consensus correction phase, a memory bank tracks the model predictions, progressively correcting the temporal boundaries based on consensus within the memory. Comprehensive experiments demonstrate ReCorrect's strong generalization abilities across multiple downstream settings. Zero-shot ReCorrect achieves over 75% and 80% of the best fully-supervised performance on two benchmarks, while unsupervised ReCorrect reaches about 85% on both. The code, dataset, and pretrained models are available at https://github.com/baopj/Vid-Morp.

View paper on

Share this with someone who'll enjoy it:

Title:Vid-Morp: Video Moment Retrieval Pretraining from Unlabeled Videos in the Wild

Paper and Code