Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Manjin Kim

Learning Correlation Structures for Vision Transformers

Apr 05, 2024

Manjin Kim, Paul Hongsuck Seo, Cordelia Schmid, Minsu Cho

Figure 1 for Learning Correlation Structures for Vision Transformers

Figure 2 for Learning Correlation Structures for Vision Transformers

Figure 3 for Learning Correlation Structures for Vision Transformers

Figure 4 for Learning Correlation Structures for Vision Transformers

Abstract:We introduce a new attention mechanism, dubbed structural self-attention (StructSA), that leverages rich correlation patterns naturally emerging in key-query interactions of attention. StructSA generates attention maps by recognizing space-time structures of key-query correlations via convolution and uses them to dynamically aggregate local contexts of value features. This effectively leverages rich structural patterns in images and videos such as scene layouts, object motion, and inter-object relations. Using StructSA as a main building block, we develop the structural vision transformer (StructViT) and evaluate its effectiveness on both image and video classification tasks, achieving state-of-the-art results on ImageNet-1K, Kinetics-400, Something-Something V1 & V2, Diving-48, and FineGym.

* Accepted to CVPR 2024

Via

Access Paper or Ask Questions

Future Transformer for Long-term Action Anticipation

May 27, 2022

Dayoung Gong, Joonseok Lee, Manjin Kim, Seong Jong Ha, Minsu Cho

Figure 1 for Future Transformer for Long-term Action Anticipation

Figure 2 for Future Transformer for Long-term Action Anticipation

Figure 3 for Future Transformer for Long-term Action Anticipation

Figure 4 for Future Transformer for Long-term Action Anticipation

Abstract:The task of predicting future actions from a video is crucial for a real-world agent interacting with others. When anticipating actions in the distant future, we humans typically consider long-term relations over the whole sequence of actions, i.e., not only observed actions in the past but also potential actions in the future. In a similar spirit, we propose an end-to-end attention model for action anticipation, dubbed Future Transformer (FUTR), that leverages global attention over all input frames and output tokens to predict a minutes-long sequence of future actions. Unlike the previous autoregressive models, the proposed method learns to predict the whole sequence of future actions in parallel decoding, enabling more accurate and fast inference for long-term anticipation. We evaluate our method on two standard benchmarks for long-term action anticipation, Breakfast and 50 Salads, achieving state-of-the-art results.

* Accepted to CVPR 2022

Via

Access Paper or Ask Questions

Relational Self-Attention: What's Missing in Attention for Video Understanding

Nov 02, 2021

Manjin Kim, Heeseung Kwon, Chunyu Wang, Suha Kwak, Minsu Cho

Figure 1 for Relational Self-Attention: What's Missing in Attention for Video Understanding

Figure 2 for Relational Self-Attention: What's Missing in Attention for Video Understanding

Figure 3 for Relational Self-Attention: What's Missing in Attention for Video Understanding

Figure 4 for Relational Self-Attention: What's Missing in Attention for Video Understanding

Abstract:Convolution has been arguably the most important feature transform for modern neural networks, leading to the advance of deep learning. Recent emergence of Transformer networks, which replace convolution layers with self-attention blocks, has revealed the limitation of stationary convolution kernels and opened the door to the era of dynamic feature transforms. The existing dynamic transforms, including self-attention, however, are all limited for video understanding where correspondence relations in space and time, i.e., motion information, are crucial for effective representation. In this work, we introduce a relational feature transform, dubbed the relational self-attention (RSA), that leverages rich structures of spatio-temporal relations in videos by dynamically generating relational kernels and aggregating relational contexts. Our experiments and ablation studies show that the RSA network substantially outperforms convolution and self-attention counterparts, achieving the state of the art on the standard motion-centric benchmarks for video action recognition, such as Something-Something-V1 & V2, Diving48, and FineGym.

* Accepted to NeurIPS 2021

Via

Access Paper or Ask Questions

Learning Self-Similarity in Space and Time as Generalized Motion for Action Recognition

Feb 14, 2021

Heeseung Kwon, Manjin Kim, Suha Kwak, Minsu Cho

Figure 1 for Learning Self-Similarity in Space and Time as Generalized Motion for Action Recognition

Figure 2 for Learning Self-Similarity in Space and Time as Generalized Motion for Action Recognition

Figure 3 for Learning Self-Similarity in Space and Time as Generalized Motion for Action Recognition

Figure 4 for Learning Self-Similarity in Space and Time as Generalized Motion for Action Recognition

Abstract:Spatio-temporal convolution often fails to learn motion dynamics in videos and thus an effective motion representation is required for video understanding in the wild. In this paper, we propose a rich and robust motion representation based on spatio-temporal self-similarity (STSS). Given a sequence of frames, STSS represents each local region as similarities to its neighbors in space and time. By converting appearance features into relational values, it enables the learner to better recognize structural patterns in space and time. We leverage the whole volume of STSS and let our model learn to extract an effective motion representation from it. The proposed neural block, dubbed SELFY, can be easily inserted into neural architectures and trained end-to-end without additional supervision. With a sufficient volume of the neighborhood in space and time, it effectively captures long-term interaction and fast motion in the video, leading to robust action recognition. Our experimental analysis demonstrates its superiority over previous methods for motion modeling as well as its complementarity to spatio-temporal features from direct convolution. On the standard action recognition benchmarks, Something-Something-V1 & V2, Diving-48, and FineGym, the proposed method achieves the state-of-the-art results.

* 8 pages, 5 figures

Via

Access Paper or Ask Questions

MotionSqueeze: Neural Motion Feature Learning for Video Understanding

Jul 20, 2020

Heeseung Kwon, Manjin Kim, Suha Kwak, Minsu Cho

Figure 1 for MotionSqueeze: Neural Motion Feature Learning for Video Understanding

Figure 2 for MotionSqueeze: Neural Motion Feature Learning for Video Understanding

Figure 3 for MotionSqueeze: Neural Motion Feature Learning for Video Understanding

Figure 4 for MotionSqueeze: Neural Motion Feature Learning for Video Understanding

Abstract:Motion plays a crucial role in understanding videos and most state-of-the-art neural models for video classification incorporate motion information typically using optical flows extracted by a separate off-the-shelf method. As the frame-by-frame optical flows require heavy computation, incorporating motion information has remained a major computational bottleneck for video understanding. In this work, we replace external and heavy computation of optical flows with internal and light-weight learning of motion features. We propose a trainable neural module, dubbed MotionSqueeze, for effective motion feature extraction. Inserted in the middle of any neural network, it learns to establish correspondences across frames and convert them into motion features, which are readily fed to the next downstream layer for better prediction. We demonstrate that the proposed method provides a significant gain on four standard benchmarks for action recognition with only a small amount of additional cost, outperforming the state of the art on Something-Something-V1&V2 datasets.

* Accepted to ECCV 2020

Via

Access Paper or Ask Questions