Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Alex N. Wang

PooDLe: Pooled and dense self-supervised learning from naturalistic videos

Aug 20, 2024

Alex N. Wang, Christopher Hoang, Yuwen Xiong, Yann LeCun, Mengye Ren

Figure 1 for PooDLe: Pooled and dense self-supervised learning from naturalistic videos

Figure 2 for PooDLe: Pooled and dense self-supervised learning from naturalistic videos

Figure 3 for PooDLe: Pooled and dense self-supervised learning from naturalistic videos

Figure 4 for PooDLe: Pooled and dense self-supervised learning from naturalistic videos

Abstract:Self-supervised learning has driven significant progress in learning from single-subject, iconic images. However, there are still unanswered questions about the use of minimally-curated, naturalistic video data, which contain dense scenes with many independent objects, imbalanced class distributions, and varying object sizes. In this paper, we propose a novel approach that combines an invariance-based SSL objective on pooled representations with a dense SSL objective that enforces equivariance to optical flow warping. Our findings indicate that a unified objective applied at multiple feature scales is essential for learning effective image representations from high-resolution, naturalistic videos. We validate our approach on the BDD100K driving video dataset and the Walking Tours first-person video dataset, demonstrating its ability to capture spatial understanding from a dense objective and semantic understanding via a pooled representation objective.

* Project page: https://poodle-ssl.github.io

Via

Access Paper or Ask Questions

Self-supervised learning of video representations from a child's perspective

Feb 01, 2024

A. Emin Orhan, Wentao Wang, Alex N. Wang, Mengye Ren, Brenden M. Lake

Figure 1 for Self-supervised learning of video representations from a child's perspective

Figure 2 for Self-supervised learning of video representations from a child's perspective

Figure 3 for Self-supervised learning of video representations from a child's perspective

Figure 4 for Self-supervised learning of video representations from a child's perspective

Abstract:Children learn powerful internal models of the world around them from a few years of egocentric visual experience. Can such internal models be learned from a child's visual experience with highly generic learning algorithms or do they require strong inductive biases? Recent advances in collecting large-scale, longitudinal, developmentally realistic video datasets and generic self-supervised learning (SSL) algorithms are allowing us to begin to tackle this nature vs. nurture question. However, existing work typically focuses on image-based SSL algorithms and visual capabilities that can be learned from static images (e.g. object recognition), thus ignoring temporal aspects of the world. To close this gap, here we train self-supervised video models on longitudinal, egocentric headcam recordings collected from a child over a two year period in their early development (6-31 months). The resulting models are highly effective at facilitating the learning of action concepts from a small number of labeled examples; they have favorable data size scaling properties; and they display emergent video interpolation capabilities. Video models also learn more robust object representations than image-based models trained with the exact same data. These results suggest that important temporal aspects of a child's internal model of the world may be learnable from their visual experience using highly generic learning algorithms and without strong inductive biases.

* 7 pages, 6 figures; code & models available from https://github.com/eminorhan/video-models

Via

Access Paper or Ask Questions