Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Title:Zero-Shot Video Question Answering via Frozen Bidirectional Language Models

Jun 16, 2022

Antoine Yang, Antoine Miech, Josef Sivic, Ivan Laptev, Cordelia Schmid

Figure 1 for Zero-Shot Video Question Answering via Frozen Bidirectional Language Models

Figure 2 for Zero-Shot Video Question Answering via Frozen Bidirectional Language Models

Figure 3 for Zero-Shot Video Question Answering via Frozen Bidirectional Language Models

Figure 4 for Zero-Shot Video Question Answering via Frozen Bidirectional Language Models

Share this with someone who'll enjoy it:

Abstract:Video question answering (VideoQA) is a complex task that requires diverse multi-modal data for training. Manual annotation of question and answers for videos, however, is tedious and prohibits scalability. To tackle this problem, recent methods consider zero-shot settings with no manual annotation of visual question-answer. In particular, a promising approach adapts frozen autoregressive language models pretrained on Web-scale text-only data to multi-modal inputs. In contrast, we here build on frozen bidirectional language models (BiLM) and show that such an approach provides a stronger and cheaper alternative for zero-shot VideoQA. In particular, (i) we combine visual inputs with the frozen BiLM using light trainable modules, (ii) we train such modules using Web-scraped multi-modal data, and finally (iii) we perform zero-shot VideoQA inference through masked language modeling, where the masked text is the answer to a given question. Our proposed approach, FrozenBiLM, outperforms the state of the art in zero-shot VideoQA by a significant margin on a variety of datasets, including LSMDC-FiB, iVQA, MSRVTT-QA, MSVD-QA, ActivityNet-QA, TGIF-FrameQA, How2QA and TVQA. It also demonstrates competitive performance in the few-shot and fully-supervised setting. Our code and models will be made publicly available at https://antoyang.github.io/frozenbilm.html.

* 23 pages; 5 figures

View paper on

OpenReview

Share this with someone who'll enjoy it:

Title:Zero-Shot Video Question Answering via Frozen Bidirectional Language Models

Paper and Code