Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Rahul Ravishankar

An Empirical Study of Autoregressive Pre-training from Videos

Jan 09, 2025

Jathushan Rajasegaran, Ilija Radosavovic, Rahul Ravishankar, Yossi Gandelsman, Christoph Feichtenhofer, Jitendra Malik

Figure 1 for An Empirical Study of Autoregressive Pre-training from Videos

Figure 2 for An Empirical Study of Autoregressive Pre-training from Videos

Figure 3 for An Empirical Study of Autoregressive Pre-training from Videos

Figure 4 for An Empirical Study of Autoregressive Pre-training from Videos

Abstract:We empirically study autoregressive pre-training from videos. To perform our study, we construct a series of autoregressive video models, called Toto. We treat videos as sequences of visual tokens and train transformer models to autoregressively predict future tokens. Our models are pre-trained on a diverse dataset of videos and images comprising over 1 trillion visual tokens. We explore different architectural, training, and inference design choices. We evaluate the learned visual representations on a range of downstream tasks including image recognition, video classification, object tracking, and robotics. Our results demonstrate that, despite minimal inductive biases, autoregressive pre-training leads to competitive performance across all benchmarks. Finally, we find that scaling our video models results in similar scaling curves to those seen in language models, albeit with a different rate. More details at https://brjathu.github.io/toto/

Via

Access Paper or Ask Questions

Scaling Properties of Diffusion Models for Perceptual Tasks

Nov 13, 2024

Rahul Ravishankar, Zeeshan Patel, Jathushan Rajasegaran, Jitendra Malik

Figure 1 for Scaling Properties of Diffusion Models for Perceptual Tasks

Figure 2 for Scaling Properties of Diffusion Models for Perceptual Tasks

Figure 3 for Scaling Properties of Diffusion Models for Perceptual Tasks

Figure 4 for Scaling Properties of Diffusion Models for Perceptual Tasks

Abstract:In this paper, we argue that iterative computation with diffusion models offers a powerful paradigm for not only generation but also visual perception tasks. We unify tasks such as depth estimation, optical flow, and amodal segmentation under the framework of image-to-image translation, and show how diffusion models benefit from scaling training and test-time compute for these perceptual tasks. Through a careful analysis of these scaling properties, we formulate compute-optimal training and inference recipes to scale diffusion models for visual perception tasks. Our models achieve competitive performance to state-of-the-art methods using significantly less data and compute. To access our code and models, see https://scaling-diffusion-perception.github.io .

Via

Access Paper or Ask Questions