Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Title:HVM-1: Large-scale video models pretrained with nearly 5000 hours of human-like video data

Jul 25, 2024

A. Emin Orhan

Figure 1 for HVM-1: Large-scale video models pretrained with nearly 5000 hours of human-like video data

Figure 2 for HVM-1: Large-scale video models pretrained with nearly 5000 hours of human-like video data

Figure 3 for HVM-1: Large-scale video models pretrained with nearly 5000 hours of human-like video data

Figure 4 for HVM-1: Large-scale video models pretrained with nearly 5000 hours of human-like video data

Share this with someone who'll enjoy it:

Abstract:We introduce Human-like Video Models (HVM-1), large-scale video models pretrained with nearly 5000 hours of curated human-like video data (mostly egocentric, temporally extended, continuous video recordings), using the spatiotemporal masked autoencoder (ST-MAE) algorithm. We release two 633M parameter models trained at spatial resolutions of 224x224 and 448x448 pixels. We evaluate the performance of these models in downstream few-shot video and image recognition tasks and compare them against a model pretrained with 1330 hours of short action-oriented video clips from YouTube (Kinetics-700). HVM-1 models perform competitively against the Kinetics-700 pretrained model in downstream evaluations despite substantial qualitative differences between the spatiotemporal characteristics of the corresponding pretraining datasets. HVM-1 models also learn more accurate and more robust object representations compared to models pretrained with the image-based MAE algorithm on the same data, demonstrating the potential benefits of learning to predict temporal regularities in natural videos for learning better object representations.

* 10 pages, 5 figures, 1 table; code & models available from https://github.com/eminorhan/hvm-1

View paper on

Share this with someone who'll enjoy it:

Title:HVM-1: Large-scale video models pretrained with nearly 5000 hours of human-like video data

Paper and Code