Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Benjamin Lundell

How Do I Do That? Synthesizing 3D Hand Motion and Contacts for Everyday Interactions

Apr 16, 2025

Aditya Prakash, Benjamin Lundell, Dmitry Andreychuk, David Forsyth, Saurabh Gupta, Harpreet Sawhney

Abstract:We tackle the novel problem of predicting 3D hand motion and contact maps (or Interaction Trajectories) given a single RGB view, action text, and a 3D contact point on the object as input. Our approach consists of (1) Interaction Codebook: a VQVAE model to learn a latent codebook of hand poses and contact points, effectively tokenizing interaction trajectories, (2) Interaction Predictor: a transformer-decoder module to predict the interaction trajectory from test time inputs by using an indexer module to retrieve a latent affordance from the learned codebook. To train our model, we develop a data engine that extracts 3D hand poses and contact trajectories from the diverse HoloAssist dataset. We evaluate our model on a benchmark that is 2.5-10X larger than existing works, in terms of diversity of objects and interactions observed, and test for generalization of the model across object categories, action categories, tasks, and scenes. Experimental results show the effectiveness of our approach over transformer & diffusion baselines across all settings.

* CVPR 2025, Project page: https://ap229997.github.io/projects/latentact

Via

Access Paper or Ask Questions

STEPs: Self-Supervised Key Step Extraction from Unlabeled Procedural Videos

Jan 02, 2023

Anshul Shah, Benjamin Lundell, Harpreet Sawhney, Rama Chellappa

Figure 1 for STEPs: Self-Supervised Key Step Extraction from Unlabeled Procedural Videos

Figure 2 for STEPs: Self-Supervised Key Step Extraction from Unlabeled Procedural Videos

Figure 3 for STEPs: Self-Supervised Key Step Extraction from Unlabeled Procedural Videos

Figure 4 for STEPs: Self-Supervised Key Step Extraction from Unlabeled Procedural Videos

Abstract:We address the problem of extracting key steps from unlabeled procedural videos, motivated by the potential of Augmented Reality (AR) headsets to revolutionize job training and performance. We decompose the problem into two steps: representation learning and key steps extraction. We employ self-supervised representation learning via a training strategy that adapts off-the-shelf video features using a temporal module. Training implements self-supervised learning losses involving multiple cues such as appearance, motion and pose trajectories extracted from videos to learn generalizable representations. Our method extracts key steps via a tunable algorithm that clusters the representations extracted from procedural videos. We quantitatively evaluate our approach with key step localization and also demonstrate the effectiveness of the extracted representations on related downstream tasks like phase classification. Qualitative results demonstrate that the extracted key steps are meaningful to succinctly represent the procedural tasks.

Via

Access Paper or Ask Questions