Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Title:AutoTVG: A New Vision-language Pre-training Paradigm for Temporal Video Grounding

Jun 11, 2024

Xing Zhang, Jiaxi Gu, Haoyu Zhao, Shicong Wang, Hang Xu, Renjing Pei, Songcen Xu, Zuxuan Wu, Yu-Gang Jiang

Figure 1 for AutoTVG: A New Vision-language Pre-training Paradigm for Temporal Video Grounding

Figure 2 for AutoTVG: A New Vision-language Pre-training Paradigm for Temporal Video Grounding

Figure 3 for AutoTVG: A New Vision-language Pre-training Paradigm for Temporal Video Grounding

Figure 4 for AutoTVG: A New Vision-language Pre-training Paradigm for Temporal Video Grounding

Share this with someone who'll enjoy it:

Abstract:Temporal Video Grounding (TVG) aims to localize a moment from an untrimmed video given the language description. Since the annotation of TVG is labor-intensive, TVG under limited supervision has accepted attention in recent years. The great success of vision-language pre-training guides TVG to follow the traditional "pre-training + fine-tuning" paradigm, however, the pre-training process would suffer from a lack of temporal modeling and fine-grained alignment due to the difference of data nature between pre-train and test. Besides, the large gap between pretext and downstream tasks makes zero-shot testing impossible for the pre-trained model. To avoid the drawbacks of the traditional paradigm, we propose AutoTVG, a new vision-language pre-training paradigm for TVG that enables the model to learn semantic alignment and boundary regression from automatically annotated untrimmed videos. To be specific, AutoTVG consists of a novel Captioned Moment Generation (CMG) module to generate captioned moments from untrimmed videos, and TVGNet with a regression head to predict localization results. Experimental results on Charades-STA and ActivityNet Captions show that, regarding zero-shot temporal video grounding, AutoTVG achieves highly competitive performance with in-distribution methods under out-of-distribution testing, and is superior to existing pre-training frameworks with much less training data.

* Technique Report

View paper on

Share this with someone who'll enjoy it:

Title:AutoTVG: A New Vision-language Pre-training Paradigm for Temporal Video Grounding

Paper and Code