Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Title:MAD: A Scalable Dataset for Language Grounding in Videos from Movie Audio Descriptions

Dec 01, 2021

Mattia Soldan, Alejandro Pardo, Juan León Alcázar, Fabian Caba Heilbron, Chen Zhao, Silvio Giancola, Bernard Ghanem

Figure 1 for MAD: A Scalable Dataset for Language Grounding in Videos from Movie Audio Descriptions

Figure 2 for MAD: A Scalable Dataset for Language Grounding in Videos from Movie Audio Descriptions

Figure 3 for MAD: A Scalable Dataset for Language Grounding in Videos from Movie Audio Descriptions

Figure 4 for MAD: A Scalable Dataset for Language Grounding in Videos from Movie Audio Descriptions

Share this with someone who'll enjoy it:

Abstract:The recent and increasing interest in video-language research has driven the development of large-scale datasets that enable data-intensive machine learning techniques. In comparison, limited effort has been made at assessing the fitness of these datasets for the video-language grounding task. Recent works have begun to discover significant limitations in these datasets, suggesting that state-of-the-art techniques commonly overfit to hidden dataset biases. In this work, we present MAD (Movie Audio Descriptions), a novel benchmark that departs from the paradigm of augmenting existing video datasets with text annotations and focuses on crawling and aligning available audio descriptions of mainstream movies. MAD contains over 384,000 natural language sentences grounded in over 1,200 hours of video and exhibits a significant reduction in the currently diagnosed biases for video-language grounding datasets. MAD's collection strategy enables a novel and more challenging version of video-language grounding, where short temporal moments (typically seconds long) must be accurately grounded in diverse long-form videos that can last up to three hours.

* 12 Pages, 6 Figures, 7 Tables

View paper on

Share this with someone who'll enjoy it:

Title:MAD: A Scalable Dataset for Language Grounding in Videos from Movie Audio Descriptions

Paper and Code