Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Title:Video DataFlywheel: Resolving the Impossible Data Trinity in Video-Language Understanding

Sep 29, 2024

Xiao Wang, Jianlong Wu, Zijia Lin, Fuzheng Zhang, Di Zhang, Liqiang Nie

Figure 1 for Video DataFlywheel: Resolving the Impossible Data Trinity in Video-Language Understanding

Figure 2 for Video DataFlywheel: Resolving the Impossible Data Trinity in Video-Language Understanding

Figure 3 for Video DataFlywheel: Resolving the Impossible Data Trinity in Video-Language Understanding

Figure 4 for Video DataFlywheel: Resolving the Impossible Data Trinity in Video-Language Understanding

Share this with someone who'll enjoy it:

Abstract:Recently, video-language understanding has achieved great success through large-scale pre-training. However, data scarcity remains a prevailing challenge. This study quantitatively reveals an "impossible trinity" among data quantity, diversity, and quality in pre-training datasets. Recent efforts seek to refine large-scale, diverse ASR datasets compromised by low quality through synthetic annotations. These methods successfully leverage useful information in multimodal video content (frames, tags, ASR transcripts, etc.) to refine the original annotations. Nevertheless, they struggle to mitigate noise within synthetic annotations and lack scalability as the dataset size expands. To address these issues, we introduce the Video DataFlywheel framework, which iteratively refines video annotations with improved noise control methods. For iterative refinement, we first leverage a video-language model to generate synthetic annotations, resulting in a refined dataset. Then, we pre-train on it and fine-tune on human refinement examples for a stronger model. These processes are repeated for continuous improvement. For noise control, we present AdaTaiLr, a novel noise control method that requires weaker assumptions on noise distribution, thereby proving more effective in large datasets with theoretical guarantees. The combination of iterative refinement and AdaTaiLr can achieve better scalability in video-language understanding. Extensive experiments show that our framework outperforms existing data refinement baselines, delivering a 3% performance boost and improving dataset quality with minimal diversity loss. Furthermore, our refined dataset facilitates significant improvements in various video-language understanding tasks, including video question answering and text-video retrieval.

* Under peer review

View paper on

Share this with someone who'll enjoy it:

Title:Video DataFlywheel: Resolving the Impossible Data Trinity in Video-Language Understanding

Paper and Code