Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Title:Self-supervised Video Representation Learning by Uncovering Spatio-temporal Statistics

Aug 31, 2020

Jiangliu Wang, Jianbo Jiao, Linchao Bao, Shengfeng He, Wei Liu, Yun-hui Liu

Figure 1 for Self-supervised Video Representation Learning by Uncovering Spatio-temporal Statistics

Figure 2 for Self-supervised Video Representation Learning by Uncovering Spatio-temporal Statistics

Figure 3 for Self-supervised Video Representation Learning by Uncovering Spatio-temporal Statistics

Figure 4 for Self-supervised Video Representation Learning by Uncovering Spatio-temporal Statistics

Share this with someone who'll enjoy it:

Abstract:This paper proposes a novel pretext task to address the self-supervised video representation learning problem. Specifically, given an unlabeled video clip, we compute a series of spatio-temporal statistical summaries, such as the spatial location and dominant direction of the largest motion, the spatial location and dominant color of the largest color diversity along the temporal axis, etc. Then a neural network is built and trained to yield the statistical summaries given the video frames as inputs. In order to alleviate the learning difficulty, we employ several spatial partitioning patterns to encode rough spatial locations instead of exact spatial Cartesian coordinates. Our approach is inspired by the observation that human visual system is sensitive to rapidly changing contents in the visual field, and only needs impressions about rough spatial locations to understand the visual contents. To validate the effectiveness of the proposed approach, we conduct extensive experiments with several 3D backbone networks, i.e., C3D, 3D-ResNet and R(2+1)D. The results show that our approach outperforms the existing approaches across the three backbone networks on various downstream video analytic tasks including action recognition, video retrieval, dynamic scene recognition, and action similarity labeling. The source code is made publicly available at: https://github.com/laura-wang/video_repres_sts.

* 14 pages. An extension of our previous work at arXiv:1904.03597

View paper on

Share this with someone who'll enjoy it:

Title:Self-supervised Video Representation Learning by Uncovering Spatio-temporal Statistics

Paper and Code