Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Kevin Xie

University of Toronto, Nvidia

VideoPanda: Video Panoramic Diffusion with Multi-view Attention

Apr 15, 2025

Kevin Xie, Amirmojtaba Sabour, Jiahui Huang, Despoina Paschalidou, Greg Klar, Umar Iqbal, Sanja Fidler, Xiaohui Zeng

Abstract:High resolution panoramic video content is paramount for immersive experiences in Virtual Reality, but is non-trivial to collect as it requires specialized equipment and intricate camera setups. In this work, we introduce VideoPanda, a novel approach for synthesizing 360$^\circ$ videos conditioned on text or single-view video data. VideoPanda leverages multi-view attention layers to augment a video diffusion model, enabling it to generate consistent multi-view videos that can be combined into immersive panoramic content. VideoPanda is trained jointly using two conditions: text-only and single-view video, and supports autoregressive generation of long-videos. To overcome the computational burden of multi-view video generation, we randomly subsample the duration and camera views used during training and show that the model is able to gracefully generalize to generating more frames during inference. Extensive evaluations on both real-world and synthetic video datasets demonstrate that VideoPanda generates more realistic and coherent 360$^\circ$ panoramas across all input conditions compared to existing methods. Visit the project website at https://research-staging.nvidia.com/labs/toronto-ai/VideoPanda/ for results.

* Project website at https://research-staging.nvidia.com/labs/toronto-ai/VideoPanda/

Via

Access Paper or Ask Questions

Cosmos-Transfer1: Conditional World Generation with Adaptive Multimodal Control

Mar 18, 2025

NVIDIA, :, Hassan Abu Alhaija, Jose Alvarez, Maciej Bala, Tiffany Cai, Tianshi Cao, Liz Cha, Joshua Chen, Mike Chen(+30 more)

Abstract:We introduce Cosmos-Transfer, a conditional world generation model that can generate world simulations based on multiple spatial control inputs of various modalities such as segmentation, depth, and edge. In the design, the spatial conditional scheme is adaptive and customizable. It allows weighting different conditional inputs differently at different spatial locations. This enables highly controllable world generation and finds use in various world-to-world transfer use cases, including Sim2Real. We conduct extensive evaluations to analyze the proposed model and demonstrate its applications for Physical AI, including robotics Sim2Real and autonomous vehicle data enrichment. We further demonstrate an inference scaling strategy to achieve real-time world generation with an NVIDIA GB200 NVL72 rack. To help accelerate research development in the field, we open-source our models and code at https://github.com/nvidia-cosmos/cosmos-transfer1.

Via

Access Paper or Ask Questions

L4GM: Large 4D Gaussian Reconstruction Model

Jun 14, 2024

Jiawei Ren, Kevin Xie, Ashkan Mirzaei, Hanxue Liang, Xiaohui Zeng, Karsten Kreis, Ziwei Liu, Antonio Torralba, Sanja Fidler, Seung Wook Kim(+1 more)

Figure 1 for L4GM: Large 4D Gaussian Reconstruction Model

Figure 2 for L4GM: Large 4D Gaussian Reconstruction Model

Figure 3 for L4GM: Large 4D Gaussian Reconstruction Model

Figure 4 for L4GM: Large 4D Gaussian Reconstruction Model

Abstract:We present L4GM, the first 4D Large Reconstruction Model that produces animated objects from a single-view video input -- in a single feed-forward pass that takes only a second. Key to our success is a novel dataset of multiview videos containing curated, rendered animated objects from Objaverse. This dataset depicts 44K diverse objects with 110K animations rendered in 48 viewpoints, resulting in 12M videos with a total of 300M frames. We keep our L4GM simple for scalability and build directly on top of LGM, a pretrained 3D Large Reconstruction Model that outputs 3D Gaussian ellipsoids from multiview image input. L4GM outputs a per-frame 3D Gaussian Splatting representation from video frames sampled at a low fps and then upsamples the representation to a higher fps to achieve temporal smoothness. We add temporal self-attention layers to the base LGM to help it learn consistency across time, and utilize a per-timestep multiview rendering loss to train the model. The representation is upsampled to a higher framerate by training an interpolation model which produces intermediate 3D Gaussian representations. We showcase that L4GM that is only trained on synthetic data generalizes extremely well on in-the-wild videos, producing high quality animated 3D assets.

* Project page: https://research.nvidia.com/labs/toronto-ai/l4gm

Via

Access Paper or Ask Questions

LATTE3D: Large-scale Amortized Text-To-Enhanced3D Synthesis

Mar 22, 2024

Kevin Xie, Jonathan Lorraine, Tianshi Cao, Jun Gao, James Lucas, Antonio Torralba, Sanja Fidler, Xiaohui Zeng

Abstract:Recent text-to-3D generation approaches produce impressive 3D results but require time-consuming optimization that can take up to an hour per prompt. Amortized methods like ATT3D optimize multiple prompts simultaneously to improve efficiency, enabling fast text-to-3D synthesis. However, they cannot capture high-frequency geometry and texture details and struggle to scale to large prompt sets, so they generalize poorly. We introduce LATTE3D, addressing these limitations to achieve fast, high-quality generation on a significantly larger prompt set. Key to our method is 1) building a scalable architecture and 2) leveraging 3D data during optimization through 3D-aware diffusion priors, shape regularization, and model initialization to achieve robustness to diverse and complex training prompts. LATTE3D amortizes both neural field and textured surface generation to produce highly detailed textured meshes in a single forward pass. LATTE3D generates 3D objects in 400ms, and can be further enhanced with fast test-time optimization.

* See the project website at https://research.nvidia.com/labs/toronto-ai/LATTE3D/

Via

Access Paper or Ask Questions

Generating Transferable Adversarial Simulation Scenarios for Self-Driving via Neural Rendering

Sep 27, 2023

Yasasa Abeysirigoonawardena, Kevin Xie, Chuhan Chen, Salar Hosseini, Ruiting Chen, Ruiqi Wang, Florian Shkurti

Figure 1 for Generating Transferable Adversarial Simulation Scenarios for Self-Driving via Neural Rendering

Figure 2 for Generating Transferable Adversarial Simulation Scenarios for Self-Driving via Neural Rendering

Figure 3 for Generating Transferable Adversarial Simulation Scenarios for Self-Driving via Neural Rendering

Figure 4 for Generating Transferable Adversarial Simulation Scenarios for Self-Driving via Neural Rendering

Abstract:Self-driving software pipelines include components that are learned from a significant number of training examples, yet it remains challenging to evaluate the overall system's safety and generalization performance. Together with scaling up the real-world deployment of autonomous vehicles, it is of critical importance to automatically find simulation scenarios where the driving policies will fail. We propose a method that efficiently generates adversarial simulation scenarios for autonomous driving by solving an optimal control problem that aims to maximally perturb the policy from its nominal trajectory. Given an image-based driving policy, we show that we can inject new objects in a neural rendering representation of the deployment scene, and optimize their texture in order to generate adversarial sensor inputs to the policy. We demonstrate that adversarial scenarios discovered purely in the neural renderer (surrogate scene) can often be successfully transferred to the deployment scene, without further optimization. We demonstrate this transfer occurs both in simulated and real environments, provided the learned surrogate scene is sufficiently close to the deployment scene.

* Conference paper submitted to CoRL 23

Via

Access Paper or Ask Questions

ATT3D: Amortized Text-to-3D Object Synthesis

Jun 06, 2023

Jonathan Lorraine, Kevin Xie, Xiaohui Zeng, Chen-Hsuan Lin, Towaki Takikawa, Nicholas Sharp, Tsung-Yi Lin, Ming-Yu Liu, Sanja Fidler, James Lucas

Figure 1 for ATT3D: Amortized Text-to-3D Object Synthesis

Figure 2 for ATT3D: Amortized Text-to-3D Object Synthesis

Figure 3 for ATT3D: Amortized Text-to-3D Object Synthesis

Figure 4 for ATT3D: Amortized Text-to-3D Object Synthesis

Abstract:Text-to-3D modelling has seen exciting progress by combining generative text-to-image models with image-to-3D methods like Neural Radiance Fields. DreamFusion recently achieved high-quality results but requires a lengthy, per-prompt optimization to create 3D objects. To address this, we amortize optimization over text prompts by training on many prompts simultaneously with a unified model, instead of separately. With this, we share computation across a prompt set, training in less time than per-prompt optimization. Our framework - Amortized text-to-3D (ATT3D) - enables knowledge-sharing between prompts to generalize to unseen setups and smooth interpolations between text for novel assets and simple animations.

* 22 pages, 20 figures

Via

Access Paper or Ask Questions

Physics-based Human Motion Estimation and Synthesis from Videos

Sep 21, 2021

Kevin Xie, Tingwu Wang, Umar Iqbal, Yunrong Guo, Sanja Fidler, Florian Shkurti

Figure 1 for Physics-based Human Motion Estimation and Synthesis from Videos

Figure 2 for Physics-based Human Motion Estimation and Synthesis from Videos

Figure 3 for Physics-based Human Motion Estimation and Synthesis from Videos

Figure 4 for Physics-based Human Motion Estimation and Synthesis from Videos

Abstract:Human motion synthesis is an important problem with applications in graphics, gaming and simulation environments for robotics. Existing methods require accurate motion capture data for training, which is costly to obtain. Instead, we propose a framework for training generative models of physically plausible human motion directly from monocular RGB videos, which are much more widely available. At the core of our method is a novel optimization formulation that corrects imperfect image-based pose estimations by enforcing physics constraints and reasons about contacts in a differentiable way. This optimization yields corrected 3D poses and motions, as well as their corresponding contact forces. Results show that our physically-corrected motions significantly outperform prior work on pose estimation. We can then use these to train a generative model to synthesize future motion. We demonstrate both qualitatively and quantitatively significantly improved motion estimation, synthesis quality and physical plausibility achieved by our method on the large scale Human3.6m dataset \cite{h36m_pami} as compared to prior kinematic and physics-based methods. By enabling learning of motion synthesis from video, our method paves the way for large-scale, realistic and diverse motion synthesis.

* To appear in ICCV 2021

Via

Access Paper or Ask Questions

KAMA: 3D Keypoint Aware Body Mesh Articulation

Apr 27, 2021

Umar Iqbal, Kevin Xie, Yunrong Guo, Jan Kautz, Pavlo Molchanov

Figure 1 for KAMA: 3D Keypoint Aware Body Mesh Articulation

Figure 2 for KAMA: 3D Keypoint Aware Body Mesh Articulation

Figure 3 for KAMA: 3D Keypoint Aware Body Mesh Articulation

Figure 4 for KAMA: 3D Keypoint Aware Body Mesh Articulation

Abstract:We present KAMA, a 3D Keypoint Aware Mesh Articulation approach that allows us to estimate a human body mesh from the positions of 3D body keypoints. To this end, we learn to estimate 3D positions of 26 body keypoints and propose an analytical solution to articulate a parametric body model, SMPL, via a set of straightforward geometric transformations. Since keypoint estimation directly relies on image clues, our approach offers significantly better alignment to image content when compared to state-of-the-art approaches. Our proposed approach does not require any paired mesh annotations and is able to achieve state-of-the-art mesh fittings through 3D keypoint regression only. Results on the challenging 3DPW and Human3.6M demonstrate that our approach yields state-of-the-art body mesh fittings.

* "Additional qualitative results: https://youtu.be/mPikZEIpUE0"

Via

Access Paper or Ask Questions

gradSim: Differentiable simulation for system identification and visuomotor control

Apr 06, 2021

Krishna Murthy Jatavallabhula, Miles Macklin, Florian Golemo, Vikram Voleti, Linda Petrini, Martin Weiss, Breandan Considine, Jerome Parent-Levesque, Kevin Xie, Kenny Erleben(+4 more)

Figure 1 for gradSim: Differentiable simulation for system identification and visuomotor control

Figure 2 for gradSim: Differentiable simulation for system identification and visuomotor control

Figure 3 for gradSim: Differentiable simulation for system identification and visuomotor control

Figure 4 for gradSim: Differentiable simulation for system identification and visuomotor control

Abstract:We consider the problem of estimating an object's physical properties such as mass, friction, and elasticity directly from video sequences. Such a system identification problem is fundamentally ill-posed due to the loss of information during image formation. Current solutions require precise 3D labels which are labor-intensive to gather, and infeasible to create for many systems such as deformable solids or cloth. We present gradSim, a framework that overcomes the dependence on 3D supervision by leveraging differentiable multiphysics simulation and differentiable rendering to jointly model the evolution of scene dynamics and image formation. This novel combination enables backpropagation from pixels in a video sequence through to the underlying physical attributes that generated them. Moreover, our unified computation graph -- spanning from the dynamics and through the rendering process -- enables learning in challenging visuomotor control tasks, without relying on state-based (3D) supervision, while obtaining performance competitive to or better than techniques that rely on precise 3D labels.

* ICLR 2021. Project page (and a dynamic web version of the article): https://gradsim.github.io

Via

Access Paper or Ask Questions

Skill Transfer via Partially Amortized Hierarchical Planning

Nov 27, 2020

Kevin Xie, Homanga Bharadhwaj, Danijar Hafner, Animesh Garg, Florian Shkurti

Figure 1 for Skill Transfer via Partially Amortized Hierarchical Planning

Figure 2 for Skill Transfer via Partially Amortized Hierarchical Planning

Figure 3 for Skill Transfer via Partially Amortized Hierarchical Planning

Figure 4 for Skill Transfer via Partially Amortized Hierarchical Planning

Abstract:To quickly solve new tasks in complex environments, intelligent agents need to build up reusable knowledge. For example, a learned world model captures knowledge about the environment that applies to new tasks. Similarly, skills capture general behaviors that can apply to new tasks. In this paper, we investigate how these two approaches can be integrated into a single reinforcement learning agent. Specifically, we leverage the idea of partial amortization for fast adaptation at test time. For this, actions are produced by a policy that is learned over time while the skills it conditions on are chosen using online planning. We demonstrate the benefits of our design decisions across a suite of challenging locomotion tasks and demonstrate improved sample efficiency in single tasks as well as in transfer from one task to another, as compared to competitive baselines. Videos are available at: https://sites.google.com/view/partial-amortization-hierarchy/home

* First two authors contributed equally. Preprint. NeurIPS 2020 Deep RL Workshop and under review

Via

Access Paper or Ask Questions