Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Kirill Gavrilyuk

Motion-Augmented Self-Training for Video Recognition at Smaller Scale

May 04, 2021

Kirill Gavrilyuk, Mihir Jain, Ilia Karmanov, Cees G. M. Snoek

Figure 1 for Motion-Augmented Self-Training for Video Recognition at Smaller Scale

Figure 2 for Motion-Augmented Self-Training for Video Recognition at Smaller Scale

Figure 3 for Motion-Augmented Self-Training for Video Recognition at Smaller Scale

Figure 4 for Motion-Augmented Self-Training for Video Recognition at Smaller Scale

Abstract:The goal of this paper is to self-train a 3D convolutional neural network on an unlabeled video collection for deployment on small-scale video collections. As smaller video datasets benefit more from motion than appearance, we strive to train our network using optical flow, but avoid its computation during inference. We propose the first motion-augmented self-training regime, we call MotionFit. We start with supervised training of a motion model on a small, and labeled, video collection. With the motion model we generate pseudo-labels for a large unlabeled video collection, which enables us to transfer knowledge by learning to predict these pseudo-labels with an appearance model. Moreover, we introduce a multi-clip loss as a simple yet efficient way to improve the quality of the pseudo-labeling, even without additional auxiliary tasks. We also take into consideration the temporal granularity of videos during self-training of the appearance model, which was missed in previous works. As a result we obtain a strong motion-augmented representation model suited for video downstream tasks like action recognition and clip retrieval. On small-scale video datasets, MotionFit outperforms alternatives for knowledge transfer by 5%-8%, video-only self-supervision by 1%-7% and semi-supervised learning by 9%-18% using the same amount of class labels.

Via

Access Paper or Ask Questions

Actor-Transformers for Group Activity Recognition

Mar 28, 2020

Kirill Gavrilyuk, Ryan Sanford, Mehrsan Javan, Cees G. M. Snoek

Figure 1 for Actor-Transformers for Group Activity Recognition

Figure 2 for Actor-Transformers for Group Activity Recognition

Figure 3 for Actor-Transformers for Group Activity Recognition

Figure 4 for Actor-Transformers for Group Activity Recognition

Abstract:This paper strives to recognize individual actions and group activities from videos. While existing solutions for this challenging problem explicitly model spatial and temporal relationships based on location of individual actors, we propose an actor-transformer model able to learn and selectively extract information relevant for group activity recognition. We feed the transformer with rich actor-specific static and dynamic representations expressed by features from a 2D pose network and 3D CNN, respectively. We empirically study different ways to combine these representations and show their complementary benefits. Experiments show what is important to transform and how it should be transformed. What is more, actor-transformers achieve state-of-the-art results on two publicly available benchmarks for group activity recognition, outperforming the previous best published results by a considerable margin.

* CVPR 2020

Via

Access Paper or Ask Questions

Cloth in the Wind: A Case Study of Physical Measurement through Simulation

Mar 09, 2020

Tom F. H. Runia, Kirill Gavrilyuk, Cees G. M. Snoek, Arnold W. M. Smeulders

Figure 1 for Cloth in the Wind: A Case Study of Physical Measurement through Simulation

Figure 2 for Cloth in the Wind: A Case Study of Physical Measurement through Simulation

Figure 3 for Cloth in the Wind: A Case Study of Physical Measurement through Simulation

Figure 4 for Cloth in the Wind: A Case Study of Physical Measurement through Simulation

Abstract:For many of the physical phenomena around us, we have developed sophisticated models explaining their behavior. Nevertheless, measuring physical properties from visual observations is challenging due to the high number of causally underlying physical parameters -- including material properties and external forces. In this paper, we propose to measure latent physical properties for cloth in the wind without ever having seen a real example before. Our solution is an iterative refinement procedure with simulation at its core. The algorithm gradually updates the physical model parameters by running a simulation of the observed phenomenon and comparing the current simulation to a real-world observation. The correspondence is measured using an embedding function that maps physically similar examples to nearby points. We consider a case study of cloth in the wind, with curling flags as our leading example -- a seemingly simple phenomena but physically highly involved. Based on the physics of cloth and its visual manifestation, we propose an instantiation of the embedding function. For this mapping, modeled as a deep network, we introduce a spectral layer that decomposes a video volume into its temporal spectral power and corresponding frequencies. Our experiments demonstrate that the proposed method compares favorably to prior work on the task of measuring cloth material properties and external wind force from a real-world video.

* CVPR 2020. arXiv admin note: substantial text overlap with arXiv:1910.07861

Via

Access Paper or Ask Questions

Go with the Flow: Perception-refined Physics Simulation

Oct 17, 2019

Tom F. H. Runia, Kirill Gavrilyuk, Cees G. M. Snoek, Arnold W. M. Smeulders

Figure 1 for Go with the Flow: Perception-refined Physics Simulation

Figure 2 for Go with the Flow: Perception-refined Physics Simulation

Figure 3 for Go with the Flow: Perception-refined Physics Simulation

Figure 4 for Go with the Flow: Perception-refined Physics Simulation

Abstract:For many of the physical phenomena around us, we have developed sophisticated models explaining their behavior. Nevertheless, inferring specifics from visual observations is challenging due to the high number of causally underlying physical parameters -- including material properties and external forces. This paper addresses the problem of inferring such latent physical properties from observations. Our solution is an iterative refinement procedure with simulation at its core. The algorithm gradually updates the physical model parameters by running a simulation of the observed phenomenon and comparing the current simulation to a real-world observation. The physical similarity is computed using an embedding function that maps physically similar examples to nearby points. As a tangible example, we concentrate on flags curling in the wind -- a seemingly simple phenomenon but physically highly involved. Based on its underlying physical model and visual manifestation, we propose an instantiation of the embedding function. For this mapping, modeled as a deep network, we introduce a spectral decomposition layer that decomposes a video volume into its temporal spectral power and corresponding frequencies. In experiments, we demonstrate our method's ability to recover intrinsic and extrinsic physical parameters from both simulated and real-world video.

Via

Access Paper or Ask Questions

Actor and Action Video Segmentation from a Sentence

Mar 20, 2018

Kirill Gavrilyuk, Amir Ghodrati, Zhenyang Li, Cees G. M. Snoek

Figure 1 for Actor and Action Video Segmentation from a Sentence

Figure 2 for Actor and Action Video Segmentation from a Sentence

Figure 3 for Actor and Action Video Segmentation from a Sentence

Figure 4 for Actor and Action Video Segmentation from a Sentence

Abstract:This paper strives for pixel-level segmentation of actors and their actions in video content. Different from existing works, which all learn to segment from a fixed vocabulary of actor and action pairs, we infer the segmentation from a natural language input sentence. This allows to distinguish between fine-grained actors in the same super-category, identify actor and action instances, and segment pairs that are outside of the actor and action vocabulary. We propose a fully-convolutional model for pixel-level actor and action segmentation using an encoder-decoder architecture optimized for video. To show the potential of actor and action video segmentation from a sentence, we extend two popular actor and action datasets with more than 7,500 natural language descriptions. Experiments demonstrate the quality of the sentence-guided segmentations, the generalization ability of our model, and its advantage for traditional actor and action segmentation compared to the state-of-the-art.

* Accepted to CVPR 2018 as oral

Via

Access Paper or Ask Questions