Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Title:Exploiting VLM Localizability and Semantics for Open Vocabulary Action Detection

Nov 17, 2024

Wentao Bao, Kai Li, Yuxiao Chen, Deep Patel, Martin Renqiang Min, Yu Kong

Figure 1 for Exploiting VLM Localizability and Semantics for Open Vocabulary Action Detection

Figure 2 for Exploiting VLM Localizability and Semantics for Open Vocabulary Action Detection

Figure 3 for Exploiting VLM Localizability and Semantics for Open Vocabulary Action Detection

Figure 4 for Exploiting VLM Localizability and Semantics for Open Vocabulary Action Detection

Share this with someone who'll enjoy it:

Abstract:Action detection aims to detect (recognize and localize) human actions spatially and temporally in videos. Existing approaches focus on the closed-set setting where an action detector is trained and tested on videos from a fixed set of action categories. However, this constrained setting is not viable in an open world where test videos inevitably come beyond the trained action categories. In this paper, we address the practical yet challenging Open-Vocabulary Action Detection (OVAD) problem. It aims to detect any action in test videos while training a model on a fixed set of action categories. To achieve such an open-vocabulary capability, we propose a novel method OpenMixer that exploits the inherent semantics and localizability of large vision-language models (VLM) within the family of query-based detection transformers (DETR). Specifically, the OpenMixer is developed by spatial and temporal OpenMixer blocks (S-OMB and T-OMB), and a dynamically fused alignment (DFA) module. The three components collectively enjoy the merits of strong generalization from pre-trained VLMs and end-to-end learning from DETR design. Moreover, we established OVAD benchmarks under various settings, and the experimental results show that the OpenMixer performs the best over baselines for detecting seen and unseen actions. We release the codes, models, and dataset splits at https://github.com/Cogito2012/OpenMixer.

* WACV 2025 Accepted

View paper on

Share this with someone who'll enjoy it:

Title:Exploiting VLM Localizability and Semantics for Open Vocabulary Action Detection

Paper and Code