Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Title:Prompting Video-Language Foundation Models with Domain-specific Fine-grained Heuristics for Video Question Answering

Oct 12, 2024

Ting Yu, Kunhao Fu, Shuhui Wang, Qingming Huang, Jun Yu

Figure 1 for Prompting Video-Language Foundation Models with Domain-specific Fine-grained Heuristics for Video Question Answering

Figure 2 for Prompting Video-Language Foundation Models with Domain-specific Fine-grained Heuristics for Video Question Answering

Figure 3 for Prompting Video-Language Foundation Models with Domain-specific Fine-grained Heuristics for Video Question Answering

Figure 4 for Prompting Video-Language Foundation Models with Domain-specific Fine-grained Heuristics for Video Question Answering

Share this with someone who'll enjoy it:

Abstract:Video Question Answering (VideoQA) represents a crucial intersection between video understanding and language processing, requiring both discriminative unimodal comprehension and sophisticated cross-modal interaction for accurate inference. Despite advancements in multi-modal pre-trained models and video-language foundation models, these systems often struggle with domain-specific VideoQA due to their generalized pre-training objectives. Addressing this gap necessitates bridging the divide between broad cross-modal knowledge and the specific inference demands of VideoQA tasks. To this end, we introduce HeurVidQA, a framework that leverages domain-specific entity-action heuristics to refine pre-trained video-language foundation models. Our approach treats these models as implicit knowledge engines, employing domain-specific entity-action prompters to direct the model's focus toward precise cues that enhance reasoning. By delivering fine-grained heuristics, we improve the model's ability to identify and interpret key entities and actions, thereby enhancing its reasoning capabilities. Extensive evaluations across multiple VideoQA datasets demonstrate that our method significantly outperforms existing models, underscoring the importance of integrating domain-specific knowledge into video-language models for more accurate and context-aware VideoQA.

* IEEE Transactions on Circuits and Systems for Video Technology, 2024 * IEEE Transactions on Circuits and Systems for Video Technology

View paper on

Share this with someone who'll enjoy it:

Title:Prompting Video-Language Foundation Models with Domain-specific Fine-grained Heuristics for Video Question Answering

Paper and Code