Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Title:Probabilistic Vision-Language Representation for Weakly Supervised Temporal Action Localization

Aug 12, 2024

Geuntaek Lim, Hyunwoo Kim, Joonsoo Kim, Yukyung Choi

Figure 1 for Probabilistic Vision-Language Representation for Weakly Supervised Temporal Action Localization

Figure 2 for Probabilistic Vision-Language Representation for Weakly Supervised Temporal Action Localization

Figure 3 for Probabilistic Vision-Language Representation for Weakly Supervised Temporal Action Localization

Figure 4 for Probabilistic Vision-Language Representation for Weakly Supervised Temporal Action Localization

Share this with someone who'll enjoy it:

Abstract:Weakly supervised temporal action localization (WTAL) aims to detect action instances in untrimmed videos using only video-level annotations. Since many existing works optimize WTAL models based on action classification labels, they encounter the task discrepancy problem (i.e., localization-by-classification). To tackle this issue, recent studies have attempted to utilize action category names as auxiliary semantic knowledge through vision-language pre-training (VLP). However, there are still areas where existing research falls short. Previous approaches primarily focused on leveraging textual information from language models but overlooked the alignment of dynamic human action and VLP knowledge in a joint space. Furthermore, the deterministic representation employed in previous studies struggles to capture fine-grained human motions. To address these problems, we propose a novel framework that aligns human action knowledge and VLP knowledge in a probabilistic embedding space. Moreover, we propose intra- and inter-distribution contrastive learning to enhance the probabilistic embedding space based on statistical similarities. Extensive experiments and ablation studies reveal that our method significantly outperforms all previous state-of-the-art methods. Code is available at https://github.com/sejong-rcv/PVLR.

* Accepted to ACM MM 2024

View paper on

Share this with someone who'll enjoy it:

Title:Probabilistic Vision-Language Representation for Weakly Supervised Temporal Action Localization

Paper and Code