Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Changan Chen

Interspatial Attention for Efficient 4D Human Video Generation

May 21, 2025

Ruizhi Shao, Yinghao Xu, Yujun Shen, Ceyuan Yang, Yang Zheng, Changan Chen, Yebin Liu, Gordon Wetzstein

Abstract:Generating photorealistic videos of digital humans in a controllable manner is crucial for a plethora of applications. Existing approaches either build on methods that employ template-based 3D representations or emerging video generation models but suffer from poor quality or limited consistency and identity preservation when generating individual or multiple digital humans. In this paper, we introduce a new interspatial attention (ISA) mechanism as a scalable building block for modern diffusion transformer (DiT)--based video generation models. ISA is a new type of cross attention that uses relative positional encodings tailored for the generation of human videos. Leveraging a custom-developed video variation autoencoder, we train a latent ISA-based diffusion model on a large corpus of video data. Our model achieves state-of-the-art performance for 4D human video synthesis, demonstrating remarkable motion consistency and identity preservation while providing precise control of the camera and body poses. Our code and model are publicly released at https://dsaurus.github.io/isa4d/.

* Project page: https://dsaurus.github.io/isa4d/

Via

Access Paper or Ask Questions

Learning Activity View-invariance Under Extreme Viewpoint Changes via Curriculum Knowledge Distillation

Apr 07, 2025

Arjun Somayazulu, Efi Mavroudi, Changan Chen, Lorenzo Torresani, Kristen Grauman

Abstract:Traditional methods for view-invariant learning from video rely on controlled multi-view settings with minimal scene clutter. However, they struggle with in-the-wild videos that exhibit extreme viewpoint differences and share little visual content. We introduce a method for learning rich video representations in the presence of such severe view-occlusions. We first define a geometry-based metric that ranks views at a fine-grained temporal scale by their likely occlusion level. Then, using those rankings, we formulate a knowledge distillation objective that preserves action-centric semantics with a novel curriculum learning procedure that pairs incrementally more challenging views over time, thereby allowing smooth adaptation to extreme viewpoint differences. We evaluate our approach on two tasks, outperforming SOTA models on both temporal keystep grounding and fine-grained keystep recognition benchmarks - particularly on views that exhibit severe occlusion.

Via

Access Paper or Ask Questions

SocialGen: Modeling Multi-Human Social Interaction with Language Models

Mar 28, 2025

Heng Yu, Juze Zhang, Changan Chen, Tiange Xiang, Yusu Fang, Juan Carlos Niebles, Ehsan Adeli

Abstract:Human interactions in everyday life are inherently social, involving engagements with diverse individuals across various contexts. Modeling these social interactions is fundamental to a wide range of real-world applications. In this paper, we introduce SocialGen, the first unified motion-language model capable of modeling interaction behaviors among varying numbers of individuals, to address this crucial yet challenging problem. Unlike prior methods that are limited to two-person interactions, we propose a novel social motion representation that supports tokenizing the motions of an arbitrary number of individuals and aligning them with the language space. This alignment enables the model to leverage rich, pretrained linguistic knowledge to better understand and reason about human social behaviors. To tackle the challenges of data scarcity, we curate a comprehensive multi-human interaction dataset, SocialX, enriched with textual annotations. Leveraging this dataset, we establish the first comprehensive benchmark for multi-human interaction tasks. Our method achieves state-of-the-art performance across motion-language tasks, setting a new standard for multi-human interaction modeling.

Via

Access Paper or Ask Questions

The Language of Motion: Unifying Verbal and Non-verbal Language of 3D Human Motion

Dec 13, 2024

Changan Chen, Juze Zhang, Shrinidhi K. Lakshmikanth, Yusu Fang, Ruizhi Shao, Gordon Wetzstein, Li Fei-Fei, Ehsan Adeli

Abstract:Human communication is inherently multimodal, involving a combination of verbal and non-verbal cues such as speech, facial expressions, and body gestures. Modeling these behaviors is essential for understanding human interaction and for creating virtual characters that can communicate naturally in applications like games, films, and virtual reality. However, existing motion generation models are typically limited to specific input modalities -- either speech, text, or motion data -- and cannot fully leverage the diversity of available data. In this paper, we propose a novel framework that unifies verbal and non-verbal language using multimodal language models for human motion understanding and generation. This model is flexible in taking text, speech, and motion or any combination of them as input. Coupled with our novel pre-training strategy, our model not only achieves state-of-the-art performance on co-speech gesture generation but also requires much less data for training. Our model also unlocks an array of novel tasks such as editable gesture generation and emotion prediction from motion. We believe unifying the verbal and non-verbal language of human motion is essential for real-world applications, and language models offer a powerful approach to achieving this goal. Project page: languageofmotion.github.io.

* Project page: languageofmotion.github.io

Via

Access Paper or Ask Questions

Action2Sound: Ambient-Aware Generation of Action Sounds from Egocentric Videos

Jun 13, 2024

Changan Chen, Puyuan Peng, Ami Baid, Zihui Xue, Wei-Ning Hsu, David Harwarth, Kristen Grauman

Figure 1 for Action2Sound: Ambient-Aware Generation of Action Sounds from Egocentric Videos

Figure 2 for Action2Sound: Ambient-Aware Generation of Action Sounds from Egocentric Videos

Figure 3 for Action2Sound: Ambient-Aware Generation of Action Sounds from Egocentric Videos

Figure 4 for Action2Sound: Ambient-Aware Generation of Action Sounds from Egocentric Videos

Abstract:Generating realistic audio for human interactions is important for many applications, such as creating sound effects for films or virtual reality games. Existing approaches implicitly assume total correspondence between the video and audio during training, yet many sounds happen off-screen and have weak to no correspondence with the visuals -- resulting in uncontrolled ambient sounds or hallucinations at test time. We propose a novel ambient-aware audio generation model, AV-LDM. We devise a novel audio-conditioning mechanism to learn to disentangle foreground action sounds from the ambient background sounds in in-the-wild training videos. Given a novel silent video, our model uses retrieval-augmented generation to create audio that matches the visual content both semantically and temporally. We train and evaluate our model on two in-the-wild egocentric video datasets Ego4D and EPIC-KITCHENS. Our model outperforms an array of existing methods, allows controllable generation of the ambient sound, and even shows promise for generalizing to computer graphics game clips. Overall, our work is the first to focus video-to-audio generation faithfully on the observed visual content despite training from uncurated clips with natural background sounds.

* Project page: https://vision.cs.utexas.edu/projects/action2sound

Via

Access Paper or Ask Questions

HOI-Swap: Swapping Objects in Videos with Hand-Object Interaction Awareness

Jun 11, 2024

Zihui Xue, Mi Luo, Changan Chen, Kristen Grauman

Figure 1 for HOI-Swap: Swapping Objects in Videos with Hand-Object Interaction Awareness

Figure 2 for HOI-Swap: Swapping Objects in Videos with Hand-Object Interaction Awareness

Figure 3 for HOI-Swap: Swapping Objects in Videos with Hand-Object Interaction Awareness

Figure 4 for HOI-Swap: Swapping Objects in Videos with Hand-Object Interaction Awareness

Abstract:We study the problem of precisely swapping objects in videos, with a focus on those interacted with by hands, given one user-provided reference object image. Despite the great advancements that diffusion models have made in video editing recently, these models often fall short in handling the intricacies of hand-object interactions (HOI), failing to produce realistic edits -- especially when object swapping results in object shape or functionality changes. To bridge this gap, we present HOI-Swap, a novel diffusion-based video editing framework trained in a self-supervised manner. Designed in two stages, the first stage focuses on object swapping in a single frame with HOI awareness; the model learns to adjust the interaction patterns, such as the hand grasp, based on changes in the object's properties. The second stage extends the single-frame edit across the entire sequence; we achieve controllable motion alignment with the original video by: (1) warping a new sequence from the stage-I edited frame based on sampled motion points and (2) conditioning video generation on the warped sequence. Comprehensive qualitative and quantitative evaluations demonstrate that HOI-Swap significantly outperforms existing methods, delivering high-quality video edits with realistic HOIs.

* Project website: https://vision.cs.utexas.edu/projects/HOI-Swap/

Via

Access Paper or Ask Questions

Sim2Real Transfer for Audio-Visual Navigation with Frequency-Adaptive Acoustic Field Prediction

May 05, 2024

Changan Chen, Jordi Ramos, Anshul Tomar, Kristen Grauman

Abstract:Sim2real transfer has received increasing attention lately due to the success of learning robotic tasks in simulation end-to-end. While there has been a lot of progress in transferring vision-based navigation policies, the existing sim2real strategy for audio-visual navigation performs data augmentation empirically without measuring the acoustic gap. The sound differs from light in that it spans across much wider frequencies and thus requires a different solution for sim2real. We propose the first treatment of sim2real for audio-visual navigation by disentangling it into acoustic field prediction (AFP) and waypoint navigation. We first validate our design choice in the SoundSpaces simulator and show improvement on the Continuous AudioGoal navigation benchmark. We then collect real-world data to measure the spectral difference between the simulation and the real world by training AFP models that only take a specific frequency subband as input. We further propose a frequency-adaptive strategy that intelligently selects the best frequency band for prediction based on both the measured spectral difference and the energy distribution of the received audio, which improves the performance on the real data. Lastly, we build a real robot platform and show that the transferred policy can successfully navigate to sounding objects. This work demonstrates the potential of building intelligent agents that can see, hear, and act entirely from simulation, and transferring them to the real world.

Via

Access Paper or Ask Questions

ActiveRIR: Active Audio-Visual Exploration for Acoustic Environment Modeling

Apr 24, 2024

Arjun Somayazulu, Sagnik Majumder, Changan Chen, Kristen Grauman

Figure 1 for ActiveRIR: Active Audio-Visual Exploration for Acoustic Environment Modeling

Figure 2 for ActiveRIR: Active Audio-Visual Exploration for Acoustic Environment Modeling

Figure 3 for ActiveRIR: Active Audio-Visual Exploration for Acoustic Environment Modeling

Figure 4 for ActiveRIR: Active Audio-Visual Exploration for Acoustic Environment Modeling

Abstract:An environment acoustic model represents how sound is transformed by the physical characteristics of an indoor environment, for any given source/receiver location. Traditional methods for constructing acoustic models involve expensive and time-consuming collection of large quantities of acoustic data at dense spatial locations in the space, or rely on privileged knowledge of scene geometry to intelligently select acoustic data sampling locations. We propose active acoustic sampling, a new task for efficiently building an environment acoustic model of an unmapped environment in which a mobile agent equipped with visual and acoustic sensors jointly constructs the environment acoustic model and the occupancy map on-the-fly. We introduce ActiveRIR, a reinforcement learning (RL) policy that leverages information from audio-visual sensor streams to guide agent navigation and determine optimal acoustic data sampling positions, yielding a high quality acoustic model of the environment from a minimal set of acoustic samples. We train our policy with a novel RL reward based on information gain in the environment acoustic model. Evaluating on diverse unseen indoor environments from a state-of-the-art acoustic simulation platform, ActiveRIR outperforms an array of methods--both traditional navigation agents based on spatial novelty and visual exploration as well as existing state-of-the-art methods.

* Project page: https://vision.cs.utexas.edu/projects/active_rir/

Via

Access Paper or Ask Questions

SoundingActions: Learning How Actions Sound from Narrated Egocentric Videos

Apr 08, 2024

Changan Chen, Kumar Ashutosh, Rohit Girdhar, David Harwath, Kristen Grauman

Figure 1 for SoundingActions: Learning How Actions Sound from Narrated Egocentric Videos

Figure 2 for SoundingActions: Learning How Actions Sound from Narrated Egocentric Videos

Figure 3 for SoundingActions: Learning How Actions Sound from Narrated Egocentric Videos

Figure 4 for SoundingActions: Learning How Actions Sound from Narrated Egocentric Videos

Abstract:We propose a novel self-supervised embedding to learn how actions sound from narrated in-the-wild egocentric videos. Whereas existing methods rely on curated data with known audio-visual correspondence, our multimodal contrastive-consensus coding (MC3) embedding reinforces the associations between audio, language, and vision when all modality pairs agree, while diminishing those associations when any one pair does not. We show our approach can successfully discover how the long tail of human actions sound from egocentric video, outperforming an array of recent multimodal embedding techniques on two datasets (Ego4D and EPIC-Sounds) and multiple cross-modal tasks.

* Accepted at CVPR 2024. Project page: https://vision.cs.utexas.edu/projects/soundingactions

Via

Access Paper or Ask Questions

F$^3$Loc: Fusion and Filtering for Floorplan Localization

Mar 05, 2024

Changan Chen, Rui Wang, Christoph Vogel, Marc Pollefeys

Figure 1 for F$^3$Loc: Fusion and Filtering for Floorplan Localization

Figure 2 for F$^3$Loc: Fusion and Filtering for Floorplan Localization

Figure 3 for F$^3$Loc: Fusion and Filtering for Floorplan Localization

Figure 4 for F$^3$Loc: Fusion and Filtering for Floorplan Localization

Abstract:In this paper we propose an efficient data-driven solution to self-localization within a floorplan. Floorplan data is readily available, long-term persistent and inherently robust to changes in the visual appearance. Our method does not require retraining per map and location or demand a large database of images of the area of interest. We propose a novel probabilistic model consisting of an observation and a novel temporal filtering module. Operating internally with an efficient ray-based representation, the observation module consists of a single and a multiview module to predict horizontal depth from images and fuses their results to benefit from advantages offered by either methodology. Our method operates on conventional consumer hardware and overcomes a common limitation of competing methods that often demand upright images. Our full system meets real-time requirements, while outperforming the state-of-the-art by a significant margin.

* 10 pages, 11 figure, accepted to CVPR 2024

Via

Access Paper or Ask Questions