Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Minho Shim

Why Can't I Open My Drawer? Mitigating Object-Driven Shortcuts in Zero-Shot Compositional Action Recognition

Jan 22, 2026

Geo Ahn, Inwoong Lee, Taeoh Kim, Minho Shim, Dongyoon Wee, Jinwoo Choi

Abstract:We study Compositional Video Understanding (CVU), where models must recognize verbs and objects and compose them to generalize to unseen combinations. We find that existing Zero-Shot Compositional Action Recognition (ZS-CAR) models fail primarily due to an overlooked failure mode: object-driven verb shortcuts. Through systematic analysis, we show that this behavior arises from two intertwined factors: severe sparsity and skewness of compositional supervision, and the asymmetric learning difficulty between verbs and objects. As training progresses, the existing ZS-CAR model increasingly ignores visual evidence and overfits to co-occurrence statistics. Consequently, the existing model does not gain the benefit of compositional recognition in unseen verb-object compositions. To address this, we propose RCORE, a simple and effective framework that enforces temporally grounded verb learning. RCORE introduces (i) a composition-aware augmentation that diversifies verb-object combinations without corrupting motion cues, and (ii) a temporal order regularization loss that penalizes shortcut behaviors by explicitly modeling temporal structure. Across two benchmarks, Sth-com and our newly constructed EK100-com, RCORE significantly improves unseen composition accuracy, reduces reliance on co-occurrence bias, and achieves consistently positive compositional gaps. Our findings reveal object-driven shortcuts as a critical limiting factor in ZS-CAR and demonstrate that addressing them is essential for robust compositional video understanding.

* The code is available at https://github.com/KHU-VLL/RCORE

Via

Access Paper or Ask Questions

Multi-Granular Spatio-Temporal Token Merging for Training-Free Acceleration of Video LLMs

Jul 10, 2025

Jeongseok Hyun, Sukjun Hwang, Su Ho Han, Taeoh Kim, Inwoong Lee, Dongyoon Wee, Joon-Young Lee, Seon Joo Kim, Minho Shim

Figure 1 for Multi-Granular Spatio-Temporal Token Merging for Training-Free Acceleration of Video LLMs

Figure 2 for Multi-Granular Spatio-Temporal Token Merging for Training-Free Acceleration of Video LLMs

Figure 3 for Multi-Granular Spatio-Temporal Token Merging for Training-Free Acceleration of Video LLMs

Figure 4 for Multi-Granular Spatio-Temporal Token Merging for Training-Free Acceleration of Video LLMs

Abstract:Video large language models (LLMs) achieve strong video understanding by leveraging a large number of spatio-temporal tokens, but suffer from quadratic computational scaling with token count. To address this, we propose a training-free spatio-temporal token merging method, named STTM. Our key insight is to exploit local spatial and temporal redundancy in video data which has been overlooked in prior work. STTM first transforms each frame into multi-granular spatial tokens using a coarse-to-fine search over a quadtree structure, then performs directed pairwise merging across the temporal dimension. This decomposed merging approach outperforms existing token reduction methods across six video QA benchmarks. Notably, STTM achieves a 2$\times$ speed-up with only a 0.5% accuracy drop under a 50% token budget, and a 3$\times$ speed-up with just a 2% drop under a 30% budget. Moreover, STTM is query-agnostic, allowing KV cache reuse across different questions for the same video. The project page is available at https://www.jshyun.me/projects/sttm.

* Accepted at ICCV2025; Project page: https://www.jshyun.me/projects/sttm

Via

Access Paper or Ask Questions

Prototypes are Balanced Units for Efficient and Effective Partially Relevant Video Retrieval

Apr 17, 2025

WonJun Moon, Cheol-Ho Cho, Woojin Jun, Minho Shim, Taeoh Kim, Inwoong Lee, Dongyoon Wee, Jae-Pil Heo

Abstract:In a retrieval system, simultaneously achieving search accuracy and efficiency is inherently challenging. This challenge is particularly pronounced in partially relevant video retrieval (PRVR), where incorporating more diverse context representations at varying temporal scales for each video enhances accuracy but increases computational and memory costs. To address this dichotomy, we propose a prototypical PRVR framework that encodes diverse contexts within a video into a fixed number of prototypes. We then introduce several strategies to enhance text association and video understanding within the prototypes, along with an orthogonal objective to ensure that the prototypes capture a diverse range of content. To keep the prototypes searchable via text queries while accurately encoding video contexts, we implement cross- and uni-modal reconstruction tasks. The cross-modal reconstruction task aligns the prototypes with textual features within a shared space, while the uni-modal reconstruction task preserves all video contexts during encoding. Additionally, we employ a video mixing technique to provide weak guidance to further align prototypes and associated textual representations. Extensive evaluations on TVR, ActivityNet-Captions, and QVHighlights validate the effectiveness of our approach without sacrificing efficiency.

Via

Access Paper or Ask Questions

Classification Matters: Improving Video Action Detection with Class-Specific Attention

Jul 29, 2024

Jinsung Lee, Taeoh Kim, Inwoong Lee, Minho Shim, Dongyoon Wee, Minsu Cho, Suha Kwak

Figure 1 for Classification Matters: Improving Video Action Detection with Class-Specific Attention

Figure 2 for Classification Matters: Improving Video Action Detection with Class-Specific Attention

Figure 3 for Classification Matters: Improving Video Action Detection with Class-Specific Attention

Figure 4 for Classification Matters: Improving Video Action Detection with Class-Specific Attention

Abstract:Video action detection (VAD) aims to detect actors and classify their actions in a video. We figure that VAD suffers more from classification rather than localization of actors. Hence, we analyze how prevailing methods form features for classification and find that they prioritize actor regions, yet often overlooking the essential contextual information necessary for accurate classification. Accordingly, we propose to reduce the bias toward actor and encourage paying attention to the context that is relevant to each action class. By assigning a class-dedicated query to each action class, our model can dynamically determine where to focus for effective classification. The proposed model demonstrates superior performance on three challenging benchmarks with significantly fewer parameters and less computation.

* 31 pages, accepted to ECCV 2024

Via

Access Paper or Ask Questions

Masked Autoencoder for Unsupervised Video Summarization

Jun 02, 2023

Minho Shim, Taeoh Kim, Jinhyung Kim, Dongyoon Wee

Figure 1 for Masked Autoencoder for Unsupervised Video Summarization

Figure 2 for Masked Autoencoder for Unsupervised Video Summarization

Figure 3 for Masked Autoencoder for Unsupervised Video Summarization

Figure 4 for Masked Autoencoder for Unsupervised Video Summarization

Abstract:Summarizing a video requires a diverse understanding of the video, ranging from recognizing scenes to evaluating how much each frame is essential enough to be selected as a summary. Self-supervised learning (SSL) is acknowledged for its robustness and flexibility to multiple downstream tasks, but the video SSL has not shown its value for dense understanding tasks like video summarization. We claim an unsupervised autoencoder with sufficient self-supervised learning does not need any extra downstream architecture design or fine-tuning weights to be utilized as a video summarization model. The proposed method to evaluate the importance score of each frame takes advantage of the reconstruction score of the autoencoder's decoder. We evaluate the method in major unsupervised video summarization benchmarks to show its effectiveness under various experimental settings.

Via

Access Paper or Ask Questions

Decomposed Cross-modal Distillation for RGB-based Temporal Action Detection

Mar 30, 2023

Pilhyeon Lee, Taeoh Kim, Minho Shim, Dongyoon Wee, Hyeran Byun

Figure 1 for Decomposed Cross-modal Distillation for RGB-based Temporal Action Detection

Figure 2 for Decomposed Cross-modal Distillation for RGB-based Temporal Action Detection

Figure 3 for Decomposed Cross-modal Distillation for RGB-based Temporal Action Detection

Figure 4 for Decomposed Cross-modal Distillation for RGB-based Temporal Action Detection

Abstract:Temporal action detection aims to predict the time intervals and the classes of action instances in the video. Despite the promising performance, existing two-stream models exhibit slow inference speed due to their reliance on computationally expensive optical flow. In this paper, we introduce a decomposed cross-modal distillation framework to build a strong RGB-based detector by transferring knowledge of the motion modality. Specifically, instead of direct distillation, we propose to separately learn RGB and motion representations, which are in turn combined to perform action localization. The dual-branch design and the asymmetric training objectives enable effective motion knowledge transfer while preserving RGB information intact. In addition, we introduce a local attentive fusion to better exploit the multimodal complementarity. It is designed to preserve the local discriminability of the features that is important for action localization. Extensive experiments on the benchmarks verify the effectiveness of the proposed method in enhancing RGB-based action detectors. Notably, our framework is agnostic to backbones and detection heads, bringing consistent gains across different model combinations.

* Accepted by CVPR 2023

Via

Access Paper or Ask Questions

Exploring Temporally Dynamic Data Augmentation for Video Recognition

Jun 30, 2022

Taeoh Kim, Jinhyung Kim, Minho Shim, Sangdoo Yun, Myunggu Kang, Dongyoon Wee, Sangyoun Lee

Figure 1 for Exploring Temporally Dynamic Data Augmentation for Video Recognition

Figure 2 for Exploring Temporally Dynamic Data Augmentation for Video Recognition

Figure 3 for Exploring Temporally Dynamic Data Augmentation for Video Recognition

Figure 4 for Exploring Temporally Dynamic Data Augmentation for Video Recognition

Abstract:Data augmentation has recently emerged as an essential component of modern training recipes for visual recognition tasks. However, data augmentation for video recognition has been rarely explored despite its effectiveness. Few existing augmentation recipes for video recognition naively extend the image augmentation methods by applying the same operations to the whole video frames. Our main idea is that the magnitude of augmentation operations for each frame needs to be changed over time to capture the real-world video's temporal variations. These variations should be generated as diverse as possible using fewer additional hyper-parameters during training. Through this motivation, we propose a simple yet effective video data augmentation framework, DynaAugment. The magnitude of augmentation operations on each frame is changed by an effective mechanism, Fourier Sampling that parameterizes diverse, smooth, and realistic temporal variations. DynaAugment also includes an extended search space suitable for video for automatic data augmentation methods. DynaAugment experimentally demonstrates that there are additional performance rooms to be improved from static augmentations on diverse video models. Specifically, we show the effectiveness of DynaAugment on various video datasets and tasks: large-scale video recognition (Kinetics-400 and Something-Something-v2), small-scale video recognition (UCF- 101 and HMDB-51), fine-grained video recognition (Diving-48 and FineGym), video action segmentation on Breakfast, video action localization on THUMOS'14, and video object detection on MOT17Det. DynaAugment also enables video models to learn more generalized representation to improve the model robustness on the corrupted videos.

* Technical Report

Via

Access Paper or Ask Questions

Spatiotemporal Augmentation on Selective Frequencies for Video Representation Learning

Apr 08, 2022

Jinhyung Kim, Taeoh Kim, Minho Shim, Dongyoon Han, Dongyoon Wee, Junmo Kim

Figure 1 for Spatiotemporal Augmentation on Selective Frequencies for Video Representation Learning

Figure 2 for Spatiotemporal Augmentation on Selective Frequencies for Video Representation Learning

Figure 3 for Spatiotemporal Augmentation on Selective Frequencies for Video Representation Learning

Figure 4 for Spatiotemporal Augmentation on Selective Frequencies for Video Representation Learning

Abstract:Recent self-supervised video representation learning methods focus on maximizing the similarity between multiple augmented views from the same video and largely rely on the quality of generated views. In this paper, we propose frequency augmentation (FreqAug), a spatio-temporal data augmentation method in the frequency domain for video representation learning. FreqAug stochastically removes undesirable information from the video by filtering out specific frequency components so that learned representation captures essential features of the video for various downstream tasks. Specifically, FreqAug pushes the model to focus more on dynamic features rather than static features in the video via dropping spatial or temporal low-frequency components. In other words, learning invariance between remaining frequency components results in high-frequency enhanced representation with less static bias. To verify the generality of the proposed method, we experiment with FreqAug on multiple self-supervised learning frameworks along with standard augmentations. Transferring the improved representation to five video action recognition and two temporal action localization downstream tasks shows consistent improvements over baselines.

Via

Access Paper or Ask Questions