Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Title:Video Question Answering Using CLIP-Guided Visual-Text Attention

Mar 08, 2023

Shuhong Ye, Weikai Kong, Chenglin Yao, Jianfeng Ren, Xudong Jiang

Figure 1 for Video Question Answering Using CLIP-Guided Visual-Text Attention

Figure 2 for Video Question Answering Using CLIP-Guided Visual-Text Attention

Figure 3 for Video Question Answering Using CLIP-Guided Visual-Text Attention

Share this with someone who'll enjoy it:

Abstract:Cross-modal learning of video and text plays a key role in Video Question Answering (VideoQA). In this paper, we propose a visual-text attention mechanism to utilize the Contrastive Language-Image Pre-training (CLIP) trained on lots of general domain language-image pairs to guide the cross-modal learning for VideoQA. Specifically, we first extract video features using a TimeSformer and text features using a BERT from the target application domain, and utilize CLIP to extract a pair of visual-text features from the general-knowledge domain through the domain-specific learning. We then propose a Cross-domain Learning to extract the attention information between visual and linguistic features across the target domain and general domain. The set of CLIP-guided visual-text features are integrated to predict the answer. The proposed method is evaluated on MSVD-QA and MSRVTT-QA datasets, and outperforms state-of-the-art methods.

* Submitted to the 2023 IEEE International Conference on Image Processing (ICIP 2023)

View paper on

Share this with someone who'll enjoy it:

Title:Video Question Answering Using CLIP-Guided Visual-Text Attention

Paper and Code