Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Xuewu Lin

DriveCamSim: Generalizable Camera Simulation via Explicit Camera Modeling for Autonomous Driving

May 26, 2025

Wenchao Sun, Xuewu Lin, Keyu Chen, Zixiang Pei, Yining Shi, Chuang Zhang, Sifa Zheng

Abstract:Camera sensor simulation serves as a critical role for autonomous driving (AD), e.g. evaluating vision-based AD algorithms. While existing approaches have leveraged generative models for controllable image/video generation, they remain constrained to generating multi-view video sequences with fixed camera viewpoints and video frequency, significantly limiting their downstream applications. To address this, we present a generalizable camera simulation framework DriveCamSim, whose core innovation lies in the proposed Explicit Camera Modeling (ECM) mechanism. Instead of implicit interaction through vanilla attention, ECM establishes explicit pixel-wise correspondences across multi-view and multi-frame dimensions, decoupling the model from overfitting to the specific camera configurations (intrinsic/extrinsic parameters, number of views) and temporal sampling rates presented in the training data. For controllable generation, we identify the issue of information loss inherent in existing conditional encoding and injection pipelines, proposing an information-preserving control mechanism. This control mechanism not only improves conditional controllability, but also can be extended to be identity-aware to enhance temporal consistency in foreground object rendering. With above designs, our model demonstrates superior performance in both visual quality and controllability, as well as generalization capability across spatial-level (camera parameters variations) and temporal-level (video frame rate variations), enabling flexible user-customizable camera simulation tailored to diverse application scenarios. Code will be avaliable at https://github.com/swc-17/DriveCamSim for facilitating future research.

Via

Access Paper or Ask Questions

SEM: Enhancing Spatial Understanding for Robust Robot Manipulation

May 22, 2025

Xuewu Lin, Tianwei Lin, Lichao Huang, Hongyu Xie, Yiwei Jin, Keyu Li, Zhizhong Su

Abstract:A key challenge in robot manipulation lies in developing policy models with strong spatial understanding, the ability to reason about 3D geometry, object relations, and robot embodiment. Existing methods often fall short: 3D point cloud models lack semantic abstraction, while 2D image encoders struggle with spatial reasoning. To address this, we propose SEM (Spatial Enhanced Manipulation model), a novel diffusion-based policy framework that explicitly enhances spatial understanding from two complementary perspectives. A spatial enhancer augments visual representations with 3D geometric context, while a robot state encoder captures embodiment-aware structure through graphbased modeling of joint dependencies. By integrating these modules, SEM significantly improves spatial understanding, leading to robust and generalizable manipulation across diverse tasks that outperform existing baselines.

Via

Access Paper or Ask Questions

Revisit Mixture Models for Multi-Agent Simulation: Experimental Study within a Unified Framework

Jan 28, 2025

Longzhong Lin, Xuewu Lin, Kechun Xu, Haojian Lu, Lichao Huang, Rong Xiong, Yue Wang

Figure 1 for Revisit Mixture Models for Multi-Agent Simulation: Experimental Study within a Unified Framework

Figure 2 for Revisit Mixture Models for Multi-Agent Simulation: Experimental Study within a Unified Framework

Figure 3 for Revisit Mixture Models for Multi-Agent Simulation: Experimental Study within a Unified Framework

Figure 4 for Revisit Mixture Models for Multi-Agent Simulation: Experimental Study within a Unified Framework

Abstract:Simulation plays a crucial role in assessing autonomous driving systems, where the generation of realistic multi-agent behaviors is a key aspect. In multi-agent simulation, the primary challenges include behavioral multimodality and closed-loop distributional shifts. In this study, we revisit mixture models for generating multimodal agent behaviors, which can cover the mainstream methods including continuous mixture models and GPT-like discrete models. Furthermore, we introduce a closed-loop sample generation approach tailored for mixture models to mitigate distributional shifts. Within the unified mixture model~(UniMM) framework, we recognize critical configurations from both model and data perspectives. We conduct a systematic examination of various model configurations, including positive component matching, continuous regression, prediction horizon, and the number of components. Moreover, our investigation into the data configuration highlights the pivotal role of closed-loop samples in achieving realistic simulations. To extend the benefits of closed-loop samples across a broader range of mixture models, we further address the shortcut learning and off-policy learning issues. Leveraging insights from our exploration, the distinct variants proposed within the UniMM framework, including discrete, anchor-free, and anchor-based models, all achieve state-of-the-art performance on the WOSAC benchmark.

Via

Access Paper or Ask Questions

BIP3D: Bridging 2D Images and 3D Perception for Embodied Intelligence

Nov 22, 2024

Xuewu Lin, Tianwei Lin, Lichao Huang, Hongyu Xie, Zhizhong Su

Abstract:In embodied intelligence systems, a key component is 3D perception algorithm, which enables agents to understand their surrounding environments. Previous algorithms primarily rely on point cloud, which, despite offering precise geometric information, still constrain perception performance due to inherent sparsity, noise, and data scarcity. In this work, we introduce a novel image-centric 3D perception model, BIP3D, which leverages expressive image features with explicit 3D position encoding to overcome the limitations of point-centric methods. Specifically, we leverage pre-trained 2D vision foundation models to enhance semantic understanding, and introduce a spatial enhancer module to improve spatial understanding. Together, these modules enable BIP3D to achieve multi-view, multi-modal feature fusion and end-to-end 3D perception. In our experiments, BIP3D outperforms current state-of-the-art results on the EmbodiedScan benchmark, achieving improvements of 5.69% in the 3D detection task and 15.25% in the 3D visual grounding task.

Via

Access Paper or Ask Questions

SparseDrive: End-to-End Autonomous Driving via Sparse Scene Representation

May 31, 2024

Wenchao Sun, Xuewu Lin, Yining Shi, Chuang Zhang, Haoran Wu, Sifa Zheng

Abstract:The well-established modular autonomous driving system is decoupled into different standalone tasks, e.g. perception, prediction and planning, suffering from information loss and error accumulation across modules. In contrast, end-to-end paradigms unify multi-tasks into a fully differentiable framework, allowing for optimization in a planning-oriented spirit. Despite the great potential of end-to-end paradigms, both the performance and efficiency of existing methods are not satisfactory, particularly in terms of planning safety. We attribute this to the computationally expensive BEV (bird's eye view) features and the straightforward design for prediction and planning. To this end, we explore the sparse representation and review the task design for end-to-end autonomous driving, proposing a new paradigm named SparseDrive. Concretely, SparseDrive consists of a symmetric sparse perception module and a parallel motion planner. The sparse perception module unifies detection, tracking and online mapping with a symmetric model architecture, learning a fully sparse representation of the driving scene. For motion prediction and planning, we review the great similarity between these two tasks, leading to a parallel design for motion planner. Based on this parallel design, which models planning as a multi-modal problem, we propose a hierarchical planning selection strategy , which incorporates a collision-aware rescore module, to select a rational and safe trajectory as the final planning output. With such effective designs, SparseDrive surpasses previous state-of-the-arts by a large margin in performance of all tasks, while achieving much higher training and inference efficiency. Code will be avaliable at https://github.com/swc-17/SparseDrive for facilitating future research.

Via

Access Paper or Ask Questions

EDA: Evolving and Distinct Anchors for Multimodal Motion Prediction

Dec 15, 2023

Longzhong Lin, Xuewu Lin, Tianwei Lin, Lichao Huang, Rong Xiong, Yue Wang

Figure 1 for EDA: Evolving and Distinct Anchors for Multimodal Motion Prediction

Figure 2 for EDA: Evolving and Distinct Anchors for Multimodal Motion Prediction

Figure 3 for EDA: Evolving and Distinct Anchors for Multimodal Motion Prediction

Figure 4 for EDA: Evolving and Distinct Anchors for Multimodal Motion Prediction

Abstract:Motion prediction is a crucial task in autonomous driving, and one of its major challenges lands in the multimodality of future behaviors. Many successful works have utilized mixture models which require identification of positive mixture components, and correspondingly fall into two main lines: prediction-based and anchor-based matching. The prediction clustering phenomenon in prediction-based matching makes it difficult to pick representative trajectories for downstream tasks, while the anchor-based matching suffers from a limited regression capability. In this paper, we introduce a novel paradigm, named Evolving and Distinct Anchors (EDA), to define the positive and negative components for multimodal motion prediction based on mixture models. We enable anchors to evolve and redistribute themselves under specific scenes for an enlarged regression capacity. Furthermore, we select distinct anchors before matching them with the ground truth, which results in impressive scoring performance. Our approach enhances all metrics compared to the baseline MTR, particularly with a notable relative reduction of 13.5% in Miss Rate, resulting in state-of-the-art performance on the Waymo Open Motion Dataset. Code is available at https://github.com/Longzhong-Lin/EDA.

* Thirty-Eighth AAAI Conference on Artificial Intelligence (AAAI2024)

Via

Access Paper or Ask Questions

Sparse4D v3: Advancing End-to-End 3D Detection and Tracking

Nov 20, 2023

Xuewu Lin, Zixiang Pei, Tianwei Lin, Lichao Huang, Zhizhong Su

Figure 1 for Sparse4D v3: Advancing End-to-End 3D Detection and Tracking

Figure 2 for Sparse4D v3: Advancing End-to-End 3D Detection and Tracking

Figure 3 for Sparse4D v3: Advancing End-to-End 3D Detection and Tracking

Figure 4 for Sparse4D v3: Advancing End-to-End 3D Detection and Tracking

Abstract:In autonomous driving perception systems, 3D detection and tracking are the two fundamental tasks. This paper delves deeper into this field, building upon the Sparse4D framework. We introduce two auxiliary training tasks (Temporal Instance Denoising and Quality Estimation) and propose decoupled attention to make structural improvements, leading to significant enhancements in detection performance. Additionally, we extend the detector into a tracker using a straightforward approach that assigns instance ID during inference, further highlighting the advantages of query-based algorithms. Extensive experiments conducted on the nuScenes benchmark validate the effectiveness of the proposed improvements. With ResNet50 as the backbone, we witnessed enhancements of 3.0\%, 2.2\%, and 7.6\% in mAP, NDS, and AMOTA, achieving 46.9\%, 56.1\%, and 49.0\%, respectively. Our best model achieved 71.9\% NDS and 67.7\% AMOTA on the nuScenes test set. Code will be released at \url{https://github.com/linxuewu/Sparse4D}.

Via

Access Paper or Ask Questions

Sparse4D v2: Recurrent Temporal Fusion with Sparse Model

May 24, 2023

Xuewu Lin, Tianwei Lin, Zixiang Pei, Lichao Huang, Zhizhong Su

Figure 1 for Sparse4D v2: Recurrent Temporal Fusion with Sparse Model

Figure 2 for Sparse4D v2: Recurrent Temporal Fusion with Sparse Model

Figure 3 for Sparse4D v2: Recurrent Temporal Fusion with Sparse Model

Figure 4 for Sparse4D v2: Recurrent Temporal Fusion with Sparse Model

Abstract:Sparse algorithms offer great flexibility for multi-view temporal perception tasks. In this paper, we present an enhanced version of Sparse4D, in which we improve the temporal fusion module by implementing a recursive form of multi-frame feature sampling. By effectively decoupling image features and structured anchor features, Sparse4D enables a highly efficient transformation of temporal features, thereby facilitating temporal fusion solely through the frame-by-frame transmission of sparse features. The recurrent temporal fusion approach provides two main benefits. Firstly, it reduces the computational complexity of temporal fusion from $O(T)$ to $O(1)$, resulting in significant improvements in inference speed and memory usage. Secondly, it enables the fusion of long-term information, leading to more pronounced performance improvements due to temporal fusion. Our proposed approach, Sparse4Dv2, further enhances the performance of the sparse perception algorithm and achieves state-of-the-art results on the nuScenes 3D detection benchmark. Code will be available at \url{https://github.com/linxuewu/Sparse4D}.

Via

Access Paper or Ask Questions

Sparse4D: Multi-view 3D Object Detection with Sparse Spatial-Temporal Fusion

Nov 19, 2022

Xuewu Lin, Tianwei Lin, Zixiang Pei, Lichao Huang, Zhizhong Su

Abstract:Bird-eye-view (BEV) based methods have made great progress recently in multi-view 3D detection task. Comparing with BEV based methods, sparse based methods lag behind in performance, but still have lots of non-negligible merits. To push sparse 3D detection further, in this work, we introduce a novel method, named Sparse4D, which does the iterative refinement of anchor boxes via sparsely sampling and fusing spatial-temporal features. (1) Sparse 4D Sampling: for each 3D anchor, we assign multiple 4D keypoints, which are then projected to multi-view/scale/timestamp image features to sample corresponding features; (2) Hierarchy Feature Fusion: we hierarchically fuse sampled features of different view/scale, different timestamp and different keypoints to generate high-quality instance feature. In this way, Sparse4D can efficiently and effectively achieve 3D detection without relying on dense view transformation nor global attention, and is more friendly to edge devices deployment. Furthermore, we introduce an instance-level depth reweight module to alleviate the ill-posed issue in 3D-to-2D projection. In experiment, our method outperforms all sparse based methods and most BEV based methods on detection task in the nuScenes dataset.

Via

Access Paper or Ask Questions

Global Correlation Network: End-to-End Joint Multi-Object Detection and Tracking

Apr 10, 2021

Xuewu Lin, Yu-ang Guo, Jianqiang Wang

Figure 1 for Global Correlation Network: End-to-End Joint Multi-Object Detection and Tracking

Figure 2 for Global Correlation Network: End-to-End Joint Multi-Object Detection and Tracking

Figure 3 for Global Correlation Network: End-to-End Joint Multi-Object Detection and Tracking

Figure 4 for Global Correlation Network: End-to-End Joint Multi-Object Detection and Tracking

Abstract:Multi-object tracking (MOT) has made great progress in recent years, but there are still some problems. Most MOT algorithms follow tracking-by-detection framework, which separates detection and tracking into two independent parts. Early tracking-by-detection algorithms need to do two feature extractions for detection and tracking. Recently, some algorithms make the feature extraction into one network, but the tracking part still relies on data association and needs complex post-processing for life cycle management. Those methods do not combine detection and tracking well. In this paper, we present a novel network to realize joint multi-object detection and tracking in an end-to-end way, called Global Correlation Network (GCNet). Different from most object detection methods, GCNet introduces the global correlation layer for regression of absolute size and coordinates of bounding boxes instead of offsets prediction. The pipeline of detection and tracking by GCNet is conceptually simple, which does not need non-maximum suppression, data association, and other complicated tracking strategies. GCNet was evaluated on a multi-vehicle tracking dataset, UA-DETRAC, and demonstrates promising performance compared to the state-of-the-art detectors and trackers.

* 9 pages, 5 figures, under review

Via

Access Paper or Ask Questions