Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Title:Clover: Towards A Unified Video-Language Alignment and Fusion Model

Jul 16, 2022

Jingjia Huang, Yinan Li, Jiashi Feng, Xiaoshuai Sun, Rongrong Ji

Figure 1 for Clover: Towards A Unified Video-Language Alignment and Fusion Model

Figure 2 for Clover: Towards A Unified Video-Language Alignment and Fusion Model

Figure 3 for Clover: Towards A Unified Video-Language Alignment and Fusion Model

Figure 4 for Clover: Towards A Unified Video-Language Alignment and Fusion Model

Share this with someone who'll enjoy it:

Abstract:Building a universal video-language model for solving various video understanding tasks (e.g., text-video retrieval, video question answering) is an open challenge to the machine learning field. Towards this goal, most recent attempts train the models, usually consisting of uni-modal and cross-modal feature encoders, with supervised or pair-wise contrastive pre-text tasks. Though offering attractive generality, the resulted models have to compromise between efficiency and performance. We argue the flaws are caused by their pre-training strategies\textemdash they cannot well align and fuse features from different modalities simultaneously. We then introduce Clover -- a Correlated Video-Language pre-training method -- towards a universal video-language model for solving multiple video understanding tasks with neither performance nor efficiency compromise. It improves cross-modal feature alignment and fusion via a novel tri-modal alignment pre-training task. Additionally, we propose to enhance the tri-modal alignment via incorporating learning from masked samples and a novel pair-wise ranking loss. Clover demonstrates outstanding generality. It establishes new state-of-the-arts on multiple downstream tasks, including three retrieval tasks for both zero-shot and fine-tuning settings, and eight video question answering tasks. Codes and pre-trained models will be released at https://github.com/LeeYN-43/Clover.

* In peer review

View paper on

Share this with someone who'll enjoy it:

Title:Clover: Towards A Unified Video-Language Alignment and Fusion Model

Paper and Code