Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Chaokang Jiang

GMF-Drive: Gated Mamba Fusion with Spatial-Aware BEV Representation for End-to-End Autonomous Driving

Aug 08, 2025

Jian Wang, Chaokang Jiang, Haitao Xu

Abstract:Diffusion-based models are redefining the state-of-the-art in end-to-end autonomous driving, yet their performance is increasingly hampered by a reliance on transformer-based fusion. These architectures face fundamental limitations: quadratic computational complexity restricts the use of high-resolution features, and a lack of spatial priors prevents them from effectively modeling the inherent structure of Bird's Eye View (BEV) representations. This paper introduces GMF-Drive (Gated Mamba Fusion for Driving), an end-to-end framework that overcomes these challenges through two principled innovations. First, we supersede the information-limited histogram-based LiDAR representation with a geometrically-augmented pillar format encoding shape descriptors and statistical features, preserving critical 3D geometric details. Second, we propose a novel hierarchical gated mamba fusion (GM-Fusion) architecture that substitutes an expensive transformer with a highly efficient, spatially-aware state-space model (SSM). Our core BEV-SSM leverages directional sequencing and adaptive fusion mechanisms to capture long-range dependencies with linear complexity, while explicitly respecting the unique spatial properties of the driving scene. Extensive experiments on the challenging NAVSIM benchmark demonstrate that GMF-Drive achieves a new state-of-the-art performance, significantly outperforming DiffusionDrive. Comprehensive ablation studies validate the efficacy of each component, demonstrating that task-specific SSMs can surpass a general-purpose transformer in both performance and efficiency for autonomous driving.

* 7 pages, 4 figures

Via

Access Paper or Ask Questions

MAMBA4D: Efficient Long-Sequence Point Cloud Video Understanding with Disentangled Spatial-Temporal State Space Models

May 23, 2024

Jiuming Liu, Jinru Han, Lihao Liu, Angelica I. Aviles-Rivero, Chaokang Jiang, Zhe Liu, Hesheng Wang

Figure 1 for MAMBA4D: Efficient Long-Sequence Point Cloud Video Understanding with Disentangled Spatial-Temporal State Space Models

Figure 2 for MAMBA4D: Efficient Long-Sequence Point Cloud Video Understanding with Disentangled Spatial-Temporal State Space Models

Figure 3 for MAMBA4D: Efficient Long-Sequence Point Cloud Video Understanding with Disentangled Spatial-Temporal State Space Models

Figure 4 for MAMBA4D: Efficient Long-Sequence Point Cloud Video Understanding with Disentangled Spatial-Temporal State Space Models

Abstract:Point cloud videos effectively capture real-world spatial geometries and temporal dynamics, which are essential for enabling intelligent agents to understand the dynamically changing 3D world we live in. Although static 3D point cloud processing has witnessed significant advancements, designing an effective 4D point cloud video backbone remains challenging, mainly due to the irregular and unordered distribution of points and temporal inconsistencies across frames. Moreover, recent state-of-the-art 4D backbones predominantly rely on transformer-based architectures, which commonly suffer from large computational costs due to their quadratic complexity, particularly when processing long video sequences. To address these challenges, we propose a novel 4D point cloud video understanding backbone based on the recently advanced State Space Models (SSMs). Specifically, our backbone begins by disentangling space and time in raw 4D sequences, and then establishing spatio-temporal correlations using our newly developed Intra-frame Spatial Mamba and Inter-frame Temporal Mamba blocks. The Intra-frame Spatial Mamba module is designed to encode locally similar or related geometric structures within a certain temporal searching stride, which can effectively capture short-term dynamics. Subsequently, these locally correlated tokens are delivered to the Inter-frame Temporal Mamba module, which globally integrates point features across the entire video with linear complexity, further establishing long-range motion dependencies. Experimental results on human action recognition and 4D semantic segmentation tasks demonstrate the superiority of our proposed method. Especially, for long video sequences, our proposed Mamba-based method has an 87.5% GPU memory reduction, 5.36 times speed-up, and much higher accuracy (up to +10.4%) compared with transformer-based counterparts on MSR-Action3D dataset.

Via

Access Paper or Ask Questions

NeuroGauss4D-PCI: 4D Neural Fields and Gaussian Deformation Fields for Point Cloud Interpolation

May 23, 2024

Chaokang Jiang, Dalong Du, Jiuming Liu, Siting Zhu, Zhenqiang Liu, Zhuang Ma, Zhujin Liang, Jie Zhou

Abstract:Point Cloud Interpolation confronts challenges from point sparsity, complex spatiotemporal dynamics, and the difficulty of deriving complete 3D point clouds from sparse temporal information. This paper presents NeuroGauss4D-PCI, which excels at modeling complex non-rigid deformations across varied dynamic scenes. The method begins with an iterative Gaussian cloud soft clustering module, offering structured temporal point cloud representations. The proposed temporal radial basis function Gaussian residual utilizes Gaussian parameter interpolation over time, enabling smooth parameter transitions and capturing temporal residuals of Gaussian distributions. Additionally, a 4D Gaussian deformation field tracks the evolution of these parameters, creating continuous spatiotemporal deformation fields. A 4D neural field transforms low-dimensional spatiotemporal coordinates ($x,y,z,t$) into a high-dimensional latent space. Finally, we adaptively and efficiently fuse the latent features from neural fields and the geometric features from Gaussian deformation fields. NeuroGauss4D-PCI outperforms existing methods in point cloud frame interpolation, delivering leading performance on both object-level (DHB) and large-scale autonomous driving datasets (NL-Drive), with scalability to auto-labeling and point cloud densification tasks. The source code is released at https://github.com/jiangchaokang/NeuroGauss4D-PCI.

* Under review

Via

Access Paper or Ask Questions

3DSFLabelling: Boosting 3D Scene Flow Estimation by Pseudo Auto-labelling

Mar 01, 2024

Chaokang Jiang, Guangming Wang, Jiuming Liu, Hesheng Wang, Zhuang Ma, Zhenqiang Liu, Zhujin Liang, Yi Shan, Dalong Du

Figure 1 for 3DSFLabelling: Boosting 3D Scene Flow Estimation by Pseudo Auto-labelling

Figure 2 for 3DSFLabelling: Boosting 3D Scene Flow Estimation by Pseudo Auto-labelling

Figure 3 for 3DSFLabelling: Boosting 3D Scene Flow Estimation by Pseudo Auto-labelling

Figure 4 for 3DSFLabelling: Boosting 3D Scene Flow Estimation by Pseudo Auto-labelling

Abstract:Learning 3D scene flow from LiDAR point clouds presents significant difficulties, including poor generalization from synthetic datasets to real scenes, scarcity of real-world 3D labels, and poor performance on real sparse LiDAR point clouds. We present a novel approach from the perspective of auto-labelling, aiming to generate a large number of 3D scene flow pseudo labels for real-world LiDAR point clouds. Specifically, we employ the assumption of rigid body motion to simulate potential object-level rigid movements in autonomous driving scenarios. By updating different motion attributes for multiple anchor boxes, the rigid motion decomposition is obtained for the whole scene. Furthermore, we developed a novel 3D scene flow data augmentation method for global and local motion. By perfectly synthesizing target point clouds based on augmented motion parameters, we easily obtain lots of 3D scene flow labels in point clouds highly consistent with real scenarios. On multiple real-world datasets including LiDAR KITTI, nuScenes, and Argoverse, our method outperforms all previous supervised and unsupervised methods without requiring manual labelling. Impressively, our method achieves a tenfold reduction in EPE3D metric on the LiDAR KITTI dataset, reducing it from $0.190m$ to a mere $0.008m$ error.

* Accepted by CVPR2024! 10 pages, 6 figures

Via

Access Paper or Ask Questions

DifFlow3D: Toward Robust Uncertainty-Aware Scene Flow Estimation with Diffusion Model

Nov 29, 2023

Jiuming Liu, Guangming Wang, Weicai Ye, Chaokang Jiang, Jinru Han, Zhe Liu, Guofeng Zhang, Dalong Du, Hesheng Wang

Figure 1 for DifFlow3D: Toward Robust Uncertainty-Aware Scene Flow Estimation with Diffusion Model

Figure 2 for DifFlow3D: Toward Robust Uncertainty-Aware Scene Flow Estimation with Diffusion Model

Figure 3 for DifFlow3D: Toward Robust Uncertainty-Aware Scene Flow Estimation with Diffusion Model

Figure 4 for DifFlow3D: Toward Robust Uncertainty-Aware Scene Flow Estimation with Diffusion Model

Abstract:Scene flow estimation, which aims to predict per-point 3D displacements of dynamic scenes, is a fundamental task in the computer vision field. However, previous works commonly suffer from unreliable correlation caused by locally constrained searching ranges, and struggle with accumulated inaccuracy arising from the coarse-to-fine structure. To alleviate these problems, we propose a novel uncertainty-aware scene flow estimation network (DifFlow3D) with the diffusion probabilistic model. Iterative diffusion-based refinement is designed to enhance the correlation robustness and resilience to challenging cases, e.g., dynamics, noisy inputs, repetitive patterns, etc. To restrain the generation diversity, three key flow-related features are leveraged as conditions in our diffusion model. Furthermore, we also develop an uncertainty estimation module within diffusion to evaluate the reliability of estimated scene flow. Our DifFlow3D achieves state-of-the-art performance, with 6.7\% and 19.1\% EPE3D reduction respectively on FlyingThings3D and KITTI 2015 datasets. Notably, our method achieves an unprecedented millimeter-level accuracy (0.0089m in EPE3D) on the KITTI dataset. Additionally, our diffusion-based refinement paradigm can be readily integrated as a plug-and-play module into existing scene flow networks, significantly increasing their estimation accuracy. Codes will be released later.

Via

Access Paper or Ask Questions

RegFormer: An Efficient Projection-Aware Transformer Network for Large-Scale Point Cloud Registration

Mar 22, 2023

Jiuming Liu, Guangming Wang, Zhe Liu, Chaokang Jiang, Marc Pollefeys, Hesheng Wang

Figure 1 for RegFormer: An Efficient Projection-Aware Transformer Network for Large-Scale Point Cloud Registration

Figure 2 for RegFormer: An Efficient Projection-Aware Transformer Network for Large-Scale Point Cloud Registration

Figure 3 for RegFormer: An Efficient Projection-Aware Transformer Network for Large-Scale Point Cloud Registration

Figure 4 for RegFormer: An Efficient Projection-Aware Transformer Network for Large-Scale Point Cloud Registration

Abstract:Although point cloud registration has achieved remarkable advances in object-level and indoor scenes, large-scale registration methods are rarely explored. Challenges mainly arise from the huge point number, complex distribution, and outliers of outdoor LiDAR scans. In addition, most existing registration works generally adopt a two-stage paradigm: They first find correspondences by extracting discriminative local features, and then leverage estimators (eg. RANSAC) to filter outliers, which are highly dependent on well-designed descriptors and post-processing choices. To address these problems, we propose an end-to-end transformer network (RegFormer) for large-scale point cloud alignment without any further post-processing. Specifically, a projection-aware hierarchical transformer is proposed to capture long-range dependencies and filter outliers by extracting point features globally. Our transformer has linear complexity, which guarantees high efficiency even for large-scale scenes. Furthermore, to effectively reduce mismatches, a bijective association transformer is designed for regressing the initial transformation. Extensive experiments on KITTI and NuScenes datasets demonstrate that our RegFormer achieves state-of-the-art performance in terms of both accuracy and efficiency.

Via

Access Paper or Ask Questions

3D Scene Flow Estimation on Pseudo-LiDAR: Bridging the Gap on Estimating Point Motion

Sep 27, 2022

Chaokang Jiang, Guangming Wang, Yanzi Miao, Hesheng Wang

Figure 1 for 3D Scene Flow Estimation on Pseudo-LiDAR: Bridging the Gap on Estimating Point Motion

Figure 2 for 3D Scene Flow Estimation on Pseudo-LiDAR: Bridging the Gap on Estimating Point Motion

Figure 3 for 3D Scene Flow Estimation on Pseudo-LiDAR: Bridging the Gap on Estimating Point Motion

Figure 4 for 3D Scene Flow Estimation on Pseudo-LiDAR: Bridging the Gap on Estimating Point Motion

Abstract:3D scene flow characterizes how the points at the current time flow to the next time in the 3D Euclidean space, which possesses the capacity to infer autonomously the non-rigid motion of all objects in the scene. The previous methods for estimating scene flow from images have limitations, which split the holistic nature of 3D scene flow by estimating optical flow and disparity separately. Learning 3D scene flow from point clouds also faces the difficulties of the gap between synthesized and real data and the sparsity of LiDAR point clouds. In this paper, the generated dense depth map is utilized to obtain explicit 3D coordinates, which achieves direct learning of 3D scene flow from 2D images. The stability of the predicted scene flow is improved by introducing the dense nature of 2D pixels into the 3D space. Outliers in the generated 3D point cloud are removed by statistical methods to weaken the impact of noisy points on the 3D scene flow estimation task. Disparity consistency loss is proposed to achieve more effective unsupervised learning of 3D scene flow. The proposed method of self-supervised learning of 3D scene flow on real-world images is compared with a variety of methods for learning on the synthesized dataset and learning on LiDAR point clouds. The comparisons of multiple scene flow metrics are shown to demonstrate the effectiveness and superiority of introducing pseudo-LiDAR point cloud to scene flow estimation.

* 9 pages, 5 figures; This paper has been accepted by IEEE Transactions on Industrial Informatics

Via

Access Paper or Ask Questions

FFPA-Net: Efficient Feature Fusion with Projection Awareness for 3D Object Detection

Sep 15, 2022

Chaokang Jiang, Guangming Wang, Jinxing Wu, Yanzi Miao, Hesheng Wang

Figure 1 for FFPA-Net: Efficient Feature Fusion with Projection Awareness for 3D Object Detection

Figure 2 for FFPA-Net: Efficient Feature Fusion with Projection Awareness for 3D Object Detection

Figure 3 for FFPA-Net: Efficient Feature Fusion with Projection Awareness for 3D Object Detection

Figure 4 for FFPA-Net: Efficient Feature Fusion with Projection Awareness for 3D Object Detection

Abstract:Promising complementarity exists between the texture features of color images and the geometric information of LiDAR point clouds. However, there still present many challenges for efficient and robust feature fusion in the field of 3D object detection. In this paper, first, unstructured 3D point clouds are filled in the 2D plane and 3D point cloud features are extracted faster using projection-aware convolution layers. Further, the corresponding indexes between different sensor signals are established in advance in the data preprocessing, which enables faster cross-modal feature fusion. To address LiDAR points and image pixels misalignment problems, two new plug-and-play fusion modules, LiCamFuse and BiLiCamFuse, are proposed. In LiCamFuse, soft query weights with perceiving the Euclidean distance of bimodal features are proposed. In BiLiCamFuse, the fusion module with dual attention is proposed to deeply correlate the geometric and textural features of the scene. The quantitative results on the KITTI dataset demonstrate that the proposed method achieves better feature-level fusion. In addition, the proposed network shows a shorter running time compared to existing methods.

* 7 pages, 4 figures; under review

Via

Access Paper or Ask Questions

Unsupervised Learning of 3D Scene Flow with 3D Odometry Assistance

Sep 11, 2022

Guangming Wang, Zhiheng Feng, Chaokang Jiang, Hesheng Wang

Figure 1 for Unsupervised Learning of 3D Scene Flow with 3D Odometry Assistance

Figure 2 for Unsupervised Learning of 3D Scene Flow with 3D Odometry Assistance

Figure 3 for Unsupervised Learning of 3D Scene Flow with 3D Odometry Assistance

Figure 4 for Unsupervised Learning of 3D Scene Flow with 3D Odometry Assistance

Abstract:Scene flow represents the 3D motion of each point in the scene, which explicitly describes the distance and the direction of each point's movement. Scene flow estimation is used in various applications such as autonomous driving fields, activity recognition, and virtual reality fields. As it is challenging to annotate scene flow with ground truth for real-world data, this leaves no real-world dataset available to provide a large amount of data with ground truth for scene flow estimation. Therefore, many works use synthesized data to pre-train their network and real-world LiDAR data to finetune. Unlike the previous unsupervised learning of scene flow in point clouds, we propose to use odometry information to assist the unsupervised learning of scene flow and use real-world LiDAR data to train our network. Supervised odometry provides more accurate shared cost volume for scene flow. In addition, the proposed network has mask-weighted warp layers to get a more accurate predicted point cloud. The warp operation means applying an estimated pose transformation or scene flow to a source point cloud to obtain a predicted point cloud and is the key to refining scene flow from coarse to fine. When performing warp operations, the points in different states use different weights for the pose transformation and scene flow transformation. We classify the states of points as static, dynamic, and occluded, where the static masks are used to divide static and dynamic points, and the occlusion masks are used to divide occluded points. The mask-weighted warp layer indicates that static masks and occlusion masks are used as weights when performing warp operations. Our designs are proved to be effective in ablation experiments. The experiment results show the promising prospect of an odometry-assisted unsupervised learning method for 3D scene flow in real-world data.

* 12 pages, 9 figures, under review

Via

Access Paper or Ask Questions

Pseudo-LiDAR for Visual Odometry

Sep 04, 2022

Huiying Deng, Guangming Wang, Zhiheng Feng, Chaokang Jiang, Xinrui Wu, Yanzi Miao, Hesheng Wang

Figure 1 for Pseudo-LiDAR for Visual Odometry

Figure 2 for Pseudo-LiDAR for Visual Odometry

Figure 3 for Pseudo-LiDAR for Visual Odometry

Figure 4 for Pseudo-LiDAR for Visual Odometry

Abstract:In the existing methods, LiDAR odometry shows superior performance, but visual odometry is still widely used for its price advantage. Conventionally, the task of visual odometry mainly rely on the input of continuous images. However, it is very complicated for the odometry network to learn the epipolar geometry information provided by the images. In this paper, the concept of pseudo-LiDAR is introduced into the odometry to solve this problem. The pseudo-LiDAR point cloud back-projects the depth map generated by the image into the 3D point cloud, which changes the way of image representation. Compared with the stereo images, the pseudo-LiDAR point cloud generated by the stereo matching network can get the explicit 3D coordinates. Since the 6 Degrees of Freedom (DoF) pose transformation occurs in 3D space, the 3D structure information provided by the pseudo-LiDAR point cloud is more direct than the image. Compared with sparse LiDAR, the pseudo-LiDAR has a denser point cloud. In order to make full use of the rich point cloud information provided by the pseudo-LiDAR, a projection-aware dense odometry pipeline is adopted. Most previous LiDAR-based algorithms sampled 8192 points from the point cloud as input to the odometry network. The projection-aware dense odometry pipeline takes all the pseudo-LiDAR point clouds generated from the images except for the error points as the input to the network. While making full use of the 3D geometric information in the images, the semantic information in the images is also used in the odometry task. The fusion of 2D-3D is achieved in an image-only based odometry. Experiments on the KITTI dataset prove the effectiveness of our method. To the best of our knowledge, this is the first visual odometry method using pseudo-LiDAR.

* 8 pages, 7 figures

Via

Access Paper or Ask Questions