Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Dongkeun Kim

Online Temporal Action Localization with Memory-Augmented Transformer

Aug 06, 2024

Youngkil Song, Dongkeun Kim, Minsu Cho, Suha Kwak

Figure 1 for Online Temporal Action Localization with Memory-Augmented Transformer

Figure 2 for Online Temporal Action Localization with Memory-Augmented Transformer

Figure 3 for Online Temporal Action Localization with Memory-Augmented Transformer

Figure 4 for Online Temporal Action Localization with Memory-Augmented Transformer

Abstract:Online temporal action localization (On-TAL) is the task of identifying multiple action instances given a streaming video. Since existing methods take as input only a video segment of fixed size per iteration, they are limited in considering long-term context and require tuning the segment size carefully. To overcome these limitations, we propose memory-augmented transformer (MATR). MATR utilizes the memory queue that selectively preserves the past segment features, allowing to leverage long-term context for inference. We also propose a novel action localization method that observes the current input segment to predict the end time of the ongoing action and accesses the memory queue to estimate the start time of the action. Our method outperformed existing methods on two datasets, THUMOS14 and MUSES, surpassing not only TAL methods in the online setting but also some offline TAL methods.

* Accepted to ECCV 2024, Project page: https://cvlab.postech.ac.kr/research/MATR/

Via

Access Paper or Ask Questions

Towards More Practical Group Activity Detection: A New Benchmark and Model

Dec 05, 2023

Dongkeun Kim, Youngkil Song, Minsu Cho, Suha Kwak

Abstract:Group activity detection (GAD) is the task of identifying members of each group and classifying the activity of the group at the same time in a video. While GAD has been studied recently, there is still much room for improvement in both dataset and methodology due to their limited capability to address practical GAD scenarios. To resolve these issues, we first present a new dataset, dubbed Caf\'e. Unlike existing datasets, Caf\'e is constructed primarily for GAD and presents more practical evaluation scenarios and metrics, as well as being large-scale and providing rich annotations. Along with the dataset, we propose a new GAD model that deals with an unknown number of groups and latent group members efficiently and effectively. We evaluated our model on three datasets including Caf\'e, where it outperformed previous work in terms of both accuracy and inference speed. Both our dataset and code base will be open to the public to promote future research on GAD.

* Project page: https://cvlab.postech.ac.kr/research/CAFE

Via

Access Paper or Ask Questions

Detector-Free Weakly Supervised Group Activity Recognition

Apr 05, 2022

Dongkeun Kim, Jinsung Lee, Minsu Cho, Suha Kwak

Figure 1 for Detector-Free Weakly Supervised Group Activity Recognition

Figure 2 for Detector-Free Weakly Supervised Group Activity Recognition

Figure 3 for Detector-Free Weakly Supervised Group Activity Recognition

Figure 4 for Detector-Free Weakly Supervised Group Activity Recognition

Abstract:Group activity recognition is the task of understanding the activity conducted by a group of people as a whole in a multi-person video. Existing models for this task are often impractical in that they demand ground-truth bounding box labels of actors even in testing or rely on off-the-shelf object detectors. Motivated by this, we propose a novel model for group activity recognition that depends neither on bounding box labels nor on object detector. Our model based on Transformer localizes and encodes partial contexts of a group activity by leveraging the attention mechanism, and represents a video clip as a set of partial context embeddings. The embedding vectors are then aggregated to form a single group representation that reflects the entire context of an activity while capturing temporal evolution of each partial context. Our method achieves outstanding performance on two benchmarks, Volleyball and NBA datasets, surpassing not only the state of the art trained with the same level of supervision, but also some of existing models relying on stronger supervision.

* Accepted to CVPR 2022

Via

Access Paper or Ask Questions