Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Shuhan Tan

Promptable Closed-loop Traffic Simulation

Sep 09, 2024

Shuhan Tan, Boris Ivanovic, Yuxiao Chen, Boyi Li, Xinshuo Weng, Yulong Cao, Philipp Krähenbühl, Marco Pavone

Figure 1 for Promptable Closed-loop Traffic Simulation

Figure 2 for Promptable Closed-loop Traffic Simulation

Figure 3 for Promptable Closed-loop Traffic Simulation

Abstract:Simulation stands as a cornerstone for safe and efficient autonomous driving development. At its core a simulation system ought to produce realistic, reactive, and controllable traffic patterns. In this paper, we propose ProSim, a multimodal promptable closed-loop traffic simulation framework. ProSim allows the user to give a complex set of numerical, categorical or textual prompts to instruct each agent's behavior and intention. ProSim then rolls out a traffic scenario in a closed-loop manner, modeling each agent's interaction with other traffic participants. Our experiments show that ProSim achieves high prompt controllability given different user prompts, while reaching competitive performance on the Waymo Sim Agents Challenge when no prompt is given. To support research on promptable traffic simulation, we create ProSim-Instruct-520k, a multimodal prompt-scenario paired driving dataset with over 10M text prompts for over 520k real-world driving scenarios. We will release code of ProSim as well as data and labeling tools of ProSim-Instruct-520k at https://ariostgx.github.io/ProSim.

* Accepted to CoRL 2024. Website available at https://ariostgx.github.io/ProSim

Via

Access Paper or Ask Questions

Wolf: Captioning Everything with a World Summarization Framework

Jul 26, 2024

Boyi Li, Ligeng Zhu, Ran Tian, Shuhan Tan, Yuxiao Chen, Yao Lu, Yin Cui, Sushant Veer, Max Ehrlich, Jonah Philion(+10 more)

Figure 1 for Wolf: Captioning Everything with a World Summarization Framework

Figure 2 for Wolf: Captioning Everything with a World Summarization Framework

Figure 3 for Wolf: Captioning Everything with a World Summarization Framework

Figure 4 for Wolf: Captioning Everything with a World Summarization Framework

Abstract:We propose Wolf, a WOrLd summarization Framework for accurate video captioning. Wolf is an automated captioning framework that adopts a mixture-of-experts approach, leveraging complementary strengths of Vision Language Models (VLMs). By utilizing both image and video models, our framework captures different levels of information and summarizes them efficiently. Our approach can be applied to enhance video understanding, auto-labeling, and captioning. To evaluate caption quality, we introduce CapScore, an LLM-based metric to assess the similarity and quality of generated captions compared to the ground truth captions. We further build four human-annotated datasets in three domains: autonomous driving, general scenes, and robotics, to facilitate comprehensive comparisons. We show that Wolf achieves superior captioning performance compared to state-of-the-art approaches from the research community (VILA1.5, CogAgent) and commercial solutions (Gemini-Pro-1.5, GPT-4V). For instance, in comparison with GPT-4V, Wolf improves CapScore both quality-wise by 55.6% and similarity-wise by 77.4% on challenging driving videos. Finally, we establish a benchmark for video captioning and introduce a leaderboard, aiming to accelerate advancements in video understanding, captioning, and data alignment. Leaderboard: https://wolfv0.github.io/leaderboard.html.

Via

Access Paper or Ask Questions

Language Conditioned Traffic Generation

Jul 16, 2023

Shuhan Tan, Boris Ivanovic, Xinshuo Weng, Marco Pavone, Philipp Kraehenbuehl

Abstract:Simulation forms the backbone of modern self-driving development. Simulators help develop, test, and improve driving systems without putting humans, vehicles, or their environment at risk. However, simulators face a major challenge: They rely on realistic, scalable, yet interesting content. While recent advances in rendering and scene reconstruction make great strides in creating static scene assets, modeling their layout, dynamics, and behaviors remains challenging. In this work, we turn to language as a source of supervision for dynamic traffic scene generation. Our model, LCTGen, combines a large language model with a transformer-based decoder architecture that selects likely map locations from a dataset of maps, and produces an initial traffic distribution, as well as the dynamics of each vehicle. LCTGen outperforms prior work in both unconditional and conditional traffic scene generation in terms of realism and fidelity. Code and video will be available at https://ariostgx.github.io/lctgen.

* Technical Report. Website available at https://ariostgx.github.io/lctgen

Via

Access Paper or Ask Questions

EgoDistill: Egocentric Head Motion Distillation for Efficient Video Understanding

Jan 05, 2023

Shuhan Tan, Tushar Nagarajan, Kristen Grauman

Abstract:Recent advances in egocentric video understanding models are promising, but their heavy computational expense is a barrier for many real-world applications. To address this challenge, we propose EgoDistill, a distillation-based approach that learns to reconstruct heavy egocentric video clip features by combining the semantics from a sparse set of video frames with the head motion from lightweight IMU readings. We further devise a novel self-supervised training strategy for IMU feature learning. Our method leads to significant improvements in efficiency, requiring 200x fewer GFLOPs than equivalent video models. We demonstrate its effectiveness on the Ego4D and EPICKitchens datasets, where our method outperforms state-of-the-art efficient video understanding methods.

* Tech report. Project page: https://vision.cs.utexas.edu/projects/egodistill

Via

Access Paper or Ask Questions

TrafficGen: Learning to Generate Diverse and Realistic Traffic Scenarios

Oct 12, 2022

Lan Feng, Quanyi Li, Zhenghao Peng, Shuhan Tan, Bolei Zhou

Figure 1 for TrafficGen: Learning to Generate Diverse and Realistic Traffic Scenarios

Figure 2 for TrafficGen: Learning to Generate Diverse and Realistic Traffic Scenarios

Figure 3 for TrafficGen: Learning to Generate Diverse and Realistic Traffic Scenarios

Figure 4 for TrafficGen: Learning to Generate Diverse and Realistic Traffic Scenarios

Abstract:Diverse and realistic traffic scenarios are crucial for evaluating the AI safety of autonomous driving systems in simulation. This work introduces a data-driven method called TrafficGen for traffic scenario generation. It learns from the fragmented human driving data collected in the real world and then can generate realistic traffic scenarios. TrafficGen is an autoregressive generative model with an encoder-decoder architecture. In each autoregressive iteration, it first encodes the current traffic context with the attention mechanism and then decodes a vehicle's initial state followed by generating its long trajectory. We evaluate the trained model in terms of vehicle placement and trajectories and show substantial improvements over baselines. TrafficGen can be also used to augment existing traffic scenarios, by adding new vehicles and extending the fragmented trajectories. We further demonstrate that importing the generated scenarios into a simulator as interactive training environments improves the performance and the safety of driving policy learned from reinforcement learning. More project resource is available at https://metadriverse.github.io/trafficgen

Via

Access Paper or Ask Questions

SceneGen: Learning to Generate Realistic Traffic Scenes

Jan 16, 2021

Shuhan Tan, Kelvin Wong, Shenlong Wang, Sivabalan Manivasagam, Mengye Ren, Raquel Urtasun

Figure 1 for SceneGen: Learning to Generate Realistic Traffic Scenes

Figure 2 for SceneGen: Learning to Generate Realistic Traffic Scenes

Figure 3 for SceneGen: Learning to Generate Realistic Traffic Scenes

Figure 4 for SceneGen: Learning to Generate Realistic Traffic Scenes

Abstract:We consider the problem of generating realistic traffic scenes automatically. Existing methods typically insert actors into the scene according to a set of hand-crafted heuristics and are limited in their ability to model the true complexity and diversity of real traffic scenes, thus inducing a content gap between synthesized traffic scenes versus real ones. As a result, existing simulators lack the fidelity necessary to train and test self-driving vehicles. To address this limitation, we present SceneGen, a neural autoregressive model of traffic scenes that eschews the need for rules and heuristics. In particular, given the ego-vehicle state and a high definition map of surrounding area, SceneGen inserts actors of various classes into the scene and synthesizes their sizes, orientations, and velocities. We demonstrate on two large-scale datasets SceneGen's ability to faithfully model distributions of real traffic scenes. Moreover, we show that SceneGen coupled with sensor simulation can be used to train perception models that generalize to the real world.

Via

Access Paper or Ask Questions

Improving the Fairness of Deep Generative Models without Retraining

Dec 09, 2020

Shuhan Tan, Yujun Shen, Bolei Zhou

Figure 1 for Improving the Fairness of Deep Generative Models without Retraining

Figure 2 for Improving the Fairness of Deep Generative Models without Retraining

Figure 3 for Improving the Fairness of Deep Generative Models without Retraining

Figure 4 for Improving the Fairness of Deep Generative Models without Retraining

Abstract:Generative Adversarial Networks (GANs) have recently advanced face synthesis by learning the underlying distribution of observed data. However, it will lead to a biased image generation due to the imbalanced training data or the mode collapse issue. Prior work typically addresses the fairness of data generation by balancing the training data that correspond to the concerned attributes. In this work, we propose a simple yet effective method to improve the fairness of image generation for a pre-trained GAN model without retraining. We utilize the recent work of GAN interpretation to identify the directions in the latent space corresponding to the target attributes, and then manipulate a set of latent codes with balanced attribute distributions over output images. We learn a Gaussian Mixture Model (GMM) to fit a distribution of the latent code set, which supports the sampling of latent codes for producing images with a more fair attribute distribution. Experiments show that our method can substantially improve the fairness of image generation, outperforming potential baselines both quantitatively and qualitatively. The images generated from our method are further applied to reveal and quantify the biases in commercial face classifiers and face super-resolution model.

* 19 pages, 18 figures. Project page: https://genforce.github.io/fairgen/

Via

Access Paper or Ask Questions

LiDARsim: Realistic LiDAR Simulation by Leveraging the Real World

Jun 16, 2020

Sivabalan Manivasagam, Shenlong Wang, Kelvin Wong, Wenyuan Zeng, Mikita Sazanovich, Shuhan Tan, Bin Yang, Wei-Chiu Ma, Raquel Urtasun

Figure 1 for LiDARsim: Realistic LiDAR Simulation by Leveraging the Real World

Figure 2 for LiDARsim: Realistic LiDAR Simulation by Leveraging the Real World

Figure 3 for LiDARsim: Realistic LiDAR Simulation by Leveraging the Real World

Figure 4 for LiDARsim: Realistic LiDAR Simulation by Leveraging the Real World

Abstract:We tackle the problem of producing realistic simulations of LiDAR point clouds, the sensor of preference for most self-driving vehicles. We argue that, by leveraging real data, we can simulate the complex world more realistically compared to employing virtual worlds built from CAD/procedural models. Towards this goal, we first build a large catalog of 3D static maps and 3D dynamic objects by driving around several cities with our self-driving fleet. We can then generate scenarios by selecting a scene from our catalog and "virtually" placing the self-driving vehicle (SDV) and a set of dynamic objects from the catalog in plausible locations in the scene. To produce realistic simulations, we develop a novel simulator that captures both the power of physics-based and learning-based simulation. We first utilize ray casting over the 3D scene and then use a deep neural network to produce deviations from the physics-based simulation, producing realistic LiDAR point clouds. We showcase LiDARsim's usefulness for perception algorithms-testing on long-tail events and end-to-end closed-loop evaluation on safety-critical scenarios.

* CVPR 2020 (Oral)

Via

Access Paper or Ask Questions

Generalized Domain Adaptation with Covariate and Label Shift CO-ALignment

Oct 23, 2019

Shuhan Tan, Xingchao Peng, Kate Saenko

Figure 1 for Generalized Domain Adaptation with Covariate and Label Shift CO-ALignment

Figure 2 for Generalized Domain Adaptation with Covariate and Label Shift CO-ALignment

Figure 3 for Generalized Domain Adaptation with Covariate and Label Shift CO-ALignment

Figure 4 for Generalized Domain Adaptation with Covariate and Label Shift CO-ALignment

Abstract:Unsupervised knowledge transfer has a great potential to improve the generalizability of deep models to novel domains. Yet the current literature assumes that the label distribution is domain-invariant and only aligns the covariate or vice versa. In this paper, we explore the task of Generalized Domain Adaptation (GDA): How to transfer knowledge across different domains in the presence of both covariate and label shift? We propose a covariate and label distribution CO-ALignment (COAL) model to tackle this problem. Our model leverages prototype-based conditional alignment and label distribution estimation to diminish the covariate and label shifts, respectively. We demonstrate experimentally that when both types of shift exist in the data, COAL leads to state-of-the-art performance on several cross-domain benchmarks.

* 17 pages

Via

Access Paper or Ask Questions

Weakly Supervised Open-set Domain Adaptation by Dual-domain Collaboration

Apr 30, 2019

Shuhan Tan, Jiening Jiao, Wei-Shi Zheng

Figure 1 for Weakly Supervised Open-set Domain Adaptation by Dual-domain Collaboration

Figure 2 for Weakly Supervised Open-set Domain Adaptation by Dual-domain Collaboration

Figure 3 for Weakly Supervised Open-set Domain Adaptation by Dual-domain Collaboration

Figure 4 for Weakly Supervised Open-set Domain Adaptation by Dual-domain Collaboration

Abstract:In conventional domain adaptation, a critical assumption is that there exists a fully labeled domain (source) that contains the same label space as another unlabeled or scarcely labeled domain (target). However, in the real world, there often exist application scenarios in which both domains are partially labeled and not all classes are shared between these two domains. Thus, it is meaningful to let partially labeled domains learn from each other to classify all the unlabeled samples in each domain under an open-set setting. We consider this problem as weakly supervised open-set domain adaptation. To address this practical setting, we propose the Collaborative Distribution Alignment (CDA) method, which performs knowledge transfer bilaterally and works collaboratively to classify unlabeled data and identify outlier samples. Extensive experiments on the Office benchmark and an application on person reidentification show that our method achieves state-of-the-art performance.

* CVPR 2019

Via

Access Paper or Ask Questions