Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Mike Chen

Cosmos-Drive-Dreams: Scalable Synthetic Driving Data Generation with World Foundation Models

Jun 11, 2025

Xuanchi Ren, Yifan Lu, Tianshi Cao, Ruiyuan Gao, Shengyu Huang, Amirmojtaba Sabour, Tianchang Shen, Tobias Pfaff, Jay Zhangjie Wu, Runjian Chen(+6 more)

Figure 1 for Cosmos-Drive-Dreams: Scalable Synthetic Driving Data Generation with World Foundation Models

Figure 2 for Cosmos-Drive-Dreams: Scalable Synthetic Driving Data Generation with World Foundation Models

Figure 3 for Cosmos-Drive-Dreams: Scalable Synthetic Driving Data Generation with World Foundation Models

Figure 4 for Cosmos-Drive-Dreams: Scalable Synthetic Driving Data Generation with World Foundation Models

Abstract:Collecting and annotating real-world data for safety-critical physical AI systems, such as Autonomous Vehicle (AV), is time-consuming and costly. It is especially challenging to capture rare edge cases, which play a critical role in training and testing of an AV system. To address this challenge, we introduce the Cosmos-Drive-Dreams - a synthetic data generation (SDG) pipeline that aims to generate challenging scenarios to facilitate downstream tasks such as perception and driving policy training. Powering this pipeline is Cosmos-Drive, a suite of models specialized from NVIDIA Cosmos world foundation model for the driving domain and are capable of controllable, high-fidelity, multi-view, and spatiotemporally consistent driving video generation. We showcase the utility of these models by applying Cosmos-Drive-Dreams to scale the quantity and diversity of driving datasets with high-fidelity and challenging scenarios. Experimentally, we demonstrate that our generated data helps in mitigating long-tail distribution problems and enhances generalization in downstream tasks such as 3D lane detection, 3D object detection and driving policy learning. We open source our pipeline toolkit, dataset and model weights through the NVIDIA's Cosmos platform. Project page: https://research.nvidia.com/labs/toronto-ai/cosmos_drive_dreams

* Only the core contributors are listed. The full list of contributors can be found in Appendix A of this paper

Via

Access Paper or Ask Questions

Cosmos-Transfer1: Conditional World Generation with Adaptive Multimodal Control

Mar 18, 2025

NVIDIA, :, Hassan Abu Alhaija, Jose Alvarez, Maciej Bala, Tiffany Cai, Tianshi Cao, Liz Cha, Joshua Chen, Mike Chen(+30 more)

Abstract:We introduce Cosmos-Transfer, a conditional world generation model that can generate world simulations based on multiple spatial control inputs of various modalities such as segmentation, depth, and edge. In the design, the spatial conditional scheme is adaptive and customizable. It allows weighting different conditional inputs differently at different spatial locations. This enables highly controllable world generation and finds use in various world-to-world transfer use cases, including Sim2Real. We conduct extensive evaluations to analyze the proposed model and demonstrate its applications for Physical AI, including robotics Sim2Real and autonomous vehicle data enrichment. We further demonstrate an inference scaling strategy to achieve real-time world generation with an NVIDIA GB200 NVL72 rack. To help accelerate research development in the field, we open-source our models and code at https://github.com/nvidia-cosmos/cosmos-transfer1.

Via

Access Paper or Ask Questions

InfiniCube: Unbounded and Controllable Dynamic 3D Driving Scene Generation with World-Guided Video Models

Dec 05, 2024

Yifan Lu, Xuanchi Ren, Jiawei Yang, Tianchang Shen, Zhangjie Wu, Jun Gao, Yue Wang, Siheng Chen, Mike Chen, Sanja Fidler(+1 more)

Figure 1 for InfiniCube: Unbounded and Controllable Dynamic 3D Driving Scene Generation with World-Guided Video Models

Figure 2 for InfiniCube: Unbounded and Controllable Dynamic 3D Driving Scene Generation with World-Guided Video Models

Figure 3 for InfiniCube: Unbounded and Controllable Dynamic 3D Driving Scene Generation with World-Guided Video Models

Figure 4 for InfiniCube: Unbounded and Controllable Dynamic 3D Driving Scene Generation with World-Guided Video Models

Abstract:We present InfiniCube, a scalable method for generating unbounded dynamic 3D driving scenes with high fidelity and controllability. Previous methods for scene generation either suffer from limited scales or lack geometric and appearance consistency along generated sequences. In contrast, we leverage the recent advancements in scalable 3D representation and video models to achieve large dynamic scene generation that allows flexible controls through HD maps, vehicle bounding boxes, and text descriptions. First, we construct a map-conditioned sparse-voxel-based 3D generative model to unleash its power for unbounded voxel world generation. Then, we re-purpose a video model and ground it on the voxel world through a set of carefully designed pixel-aligned guidance buffers, synthesizing a consistent appearance. Finally, we propose a fast feed-forward approach that employs both voxel and pixel branches to lift the dynamic videos to dynamic 3D Gaussians with controllable objects. Our method can generate controllable and realistic 3D driving scenes, and extensive experiments validate the effectiveness and superiority of our model.

* Project Page: https://research.nvidia.com/labs/toronto-ai/infinicube/

Via

Access Paper or Ask Questions

SCube: Instant Large-Scale Scene Reconstruction using VoxSplats

Oct 26, 2024

Xuanchi Ren, Yifan Lu, Hanxue Liang, Zhangjie Wu, Huan Ling, Mike Chen, Sanja Fidler, Francis Williams, Jiahui Huang

Figure 1 for SCube: Instant Large-Scale Scene Reconstruction using VoxSplats

Figure 2 for SCube: Instant Large-Scale Scene Reconstruction using VoxSplats

Figure 3 for SCube: Instant Large-Scale Scene Reconstruction using VoxSplats

Figure 4 for SCube: Instant Large-Scale Scene Reconstruction using VoxSplats

Abstract:We present SCube, a novel method for reconstructing large-scale 3D scenes (geometry, appearance, and semantics) from a sparse set of posed images. Our method encodes reconstructed scenes using a novel representation VoxSplat, which is a set of 3D Gaussians supported on a high-resolution sparse-voxel scaffold. To reconstruct a VoxSplat from images, we employ a hierarchical voxel latent diffusion model conditioned on the input images followed by a feedforward appearance prediction model. The diffusion model generates high-resolution grids progressively in a coarse-to-fine manner, and the appearance network predicts a set of Gaussians within each voxel. From as few as 3 non-overlapping input images, SCube can generate millions of Gaussians with a 1024^3 voxel grid spanning hundreds of meters in 20 seconds. Past works tackling scene reconstruction from images either rely on per-scene optimization and fail to reconstruct the scene away from input views (thus requiring dense view coverage as input) or leverage geometric priors based on low-resolution models, which produce blurry results. In contrast, SCube leverages high-resolution sparse networks and produces sharp outputs from few views. We show the superiority of SCube compared to prior art using the Waymo self-driving dataset on 3D reconstruction and demonstrate its applications, such as LiDAR simulation and text-to-scene generation.

* NeurIPS 2024. Project page: https://research.nvidia.com/labs/toronto-ai/scube/

Via

Access Paper or Ask Questions

Changing the Narrative Perspective: From Deictic to Anaphoric Point of View

Mar 06, 2021

Mike Chen, Razvan Bunescu

Figure 1 for Changing the Narrative Perspective: From Deictic to Anaphoric Point of View

Figure 2 for Changing the Narrative Perspective: From Deictic to Anaphoric Point of View

Figure 3 for Changing the Narrative Perspective: From Deictic to Anaphoric Point of View

Figure 4 for Changing the Narrative Perspective: From Deictic to Anaphoric Point of View

Abstract:We introduce the task of changing the narrative point of view, where characters are assigned a narrative perspective that is different from the one originally used by the writer. The resulting shift in the narrative point of view alters the reading experience and can be used as a tool in fiction writing or to generate types of text ranging from educational to self-help and self-diagnosis. We introduce a benchmark dataset containing a wide range of types of narratives annotated with changes in point of view from deictic (first or second person) to anaphoric (third person) and describe a pipeline for processing raw text that relies on a neural architecture for mention selection. Evaluations on the new benchmark dataset show that the proposed architecture substantially outperforms the baselines by generating mentions that are less ambiguous and more natural.

* To appear in Information Processing & Management, Special Issue on Creative Language Processing

Via

Access Paper or Ask Questions