Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Zhizhong Su

GeoFlow-SLAM: A Robust Tightly-Coupled RGBD-Inertial Fusion SLAM for Dynamic Legged Robotics

Mar 18, 2025

Tingyang Xiao, Xiaolin Zhou, Liu Liu, Wei Sui, Wei Feng, Jiaxiong Qiu, Xinjie Wang, Zhizhong Su

Abstract:This paper presents GeoFlow-SLAM, a robust and effective Tightly-Coupled RGBD-inertial SLAM for legged robots operating in highly dynamic environments.By integrating geometric consistency, legged odometry constraints, and dual-stream optical flow (GeoFlow), our method addresses three critical challenges:feature matching and pose initialization failures during fast locomotion and visual feature scarcity in texture-less scenes.Specifically, in rapid motion scenarios, feature matching is notably enhanced by leveraging dual-stream optical flow, which combines prior map points and poses. Additionally, we propose a robust pose initialization method for fast locomotion and IMU error in legged robots, integrating IMU/Legged odometry, inter-frame Perspective-n-Point (PnP), and Generalized Iterative Closest Point (GICP). Furthermore, a novel optimization framework that tightly couples depth-to-map and GICP geometric constraints is first introduced to improve the robustness and accuracy in long-duration, visually texture-less environments. The proposed algorithms achieve state-of-the-art (SOTA) on collected legged robots and open-source datasets. To further promote research and development, the open-source datasets and code will be made publicly available at https://github.com/NSN-Hello/GeoFlow-SLAM

* 8 pages

Via

Access Paper or Ask Questions

GaussTR: Foundation Model-Aligned Gaussian Transformer for Self-Supervised 3D Spatial Understanding

Dec 17, 2024

Haoyi Jiang, Liu Liu, Tianheng Cheng, Xinjie Wang, Tianwei Lin, Zhizhong Su, Wenyu Liu, Xinggang Wang

Abstract:3D Semantic Occupancy Prediction is fundamental for spatial understanding as it provides a comprehensive semantic cognition of surrounding environments. However, prevalent approaches primarily rely on extensive labeled data and computationally intensive voxel-based modeling, restricting the scalability and generalizability of 3D representation learning. In this paper, we introduce GaussTR, a novel Gaussian Transformer that leverages alignment with foundation models to advance self-supervised 3D spatial understanding. GaussTR adopts a Transformer architecture to predict sparse sets of 3D Gaussians that represent scenes in a feed-forward manner. Through aligning rendered Gaussian features with diverse knowledge from pre-trained foundation models, GaussTR facilitates the learning of versatile 3D representations and enables open-vocabulary occupancy prediction without explicit annotations. Empirical evaluations on the Occ3D-nuScenes dataset showcase GaussTR's state-of-the-art zero-shot performance, achieving 11.70 mIoU while reducing training duration by approximately 50%. These experimental results highlight the significant potential of GaussTR for scalable and holistic 3D spatial understanding, with promising implications for autonomous driving and embodied agents. Code is available at https://github.com/hustvl/GaussTR.

Via

Access Paper or Ask Questions

Gaussian Object Carver: Object-Compositional Gaussian Splatting with surfaces completion

Dec 03, 2024

Liu Liu, Xinjie Wang, Jiaxiong Qiu, Tianwei Lin, Xiaolin Zhou, Zhizhong Su

Figure 1 for Gaussian Object Carver: Object-Compositional Gaussian Splatting with surfaces completion

Figure 2 for Gaussian Object Carver: Object-Compositional Gaussian Splatting with surfaces completion

Figure 3 for Gaussian Object Carver: Object-Compositional Gaussian Splatting with surfaces completion

Figure 4 for Gaussian Object Carver: Object-Compositional Gaussian Splatting with surfaces completion

Abstract:3D scene reconstruction is a foundational problem in computer vision. Despite recent advancements in Neural Implicit Representations (NIR), existing methods often lack editability and compositional flexibility, limiting their use in scenarios requiring high interactivity and object-level manipulation. In this paper, we introduce the Gaussian Object Carver (GOC), a novel, efficient, and scalable framework for object-compositional 3D scene reconstruction. GOC leverages 3D Gaussian Splatting (GS), enriched with monocular geometry priors and multi-view geometry regularization, to achieve high-quality and flexible reconstruction. Furthermore, we propose a zero-shot Object Surface Completion (OSC) model, which uses 3D priors from 3d object data to reconstruct unobserved surfaces, ensuring object completeness even in occluded areas. Experimental results demonstrate that GOC improves reconstruction efficiency and geometric fidelity. It holds promise for advancing the practical application of digital twins in embodied AI, AR/VR, and interactive simulation environments.

Via

Access Paper or Ask Questions

GLS: Geometry-aware 3D Language Gaussian Splatting

Nov 27, 2024

Jiaxiong Qiu, Liu Liu, Zhizhong Su, Tianwei Lin

Abstract:Recently, 3D Gaussian Splatting (3DGS) has achieved significant performance on indoor surface reconstruction and open-vocabulary segmentation. This paper presents GLS, a unified framework of surface reconstruction and open-vocabulary segmentation based on 3DGS. GLS extends two fields by exploring the correlation between them. For indoor surface reconstruction, we introduce surface normal prior as a geometric cue to guide the rendered normal, and use the normal error to optimize the rendered depth. For open-vocabulary segmentation, we employ 2D CLIP features to guide instance features and utilize DEVA masks to enhance their view consistency. Extensive experiments demonstrate the effectiveness of jointly optimizing surface reconstruction and open-vocabulary segmentation, where GLS surpasses state-of-the-art approaches of each task on MuSHRoom, ScanNet++, and LERF-OVS datasets. Code will be available at https://github.com/JiaxiongQ/GLS.

* Technical Report

Via

Access Paper or Ask Questions

BIP3D: Bridging 2D Images and 3D Perception for Embodied Intelligence

Nov 22, 2024

Xuewu Lin, Tianwei Lin, Lichao Huang, Hongyu Xie, Zhizhong Su

Abstract:In embodied intelligence systems, a key component is 3D perception algorithm, which enables agents to understand their surrounding environments. Previous algorithms primarily rely on point cloud, which, despite offering precise geometric information, still constrain perception performance due to inherent sparsity, noise, and data scarcity. In this work, we introduce a novel image-centric 3D perception model, BIP3D, which leverages expressive image features with explicit 3D position encoding to overcome the limitations of point-centric methods. Specifically, we leverage pre-trained 2D vision foundation models to enhance semantic understanding, and introduce a spatial enhancer module to improve spatial understanding. Together, these modules enable BIP3D to achieve multi-view, multi-modal feature fusion and end-to-end 3D perception. In our experiments, BIP3D outperforms current state-of-the-art results on the EmbodiedScan benchmark, achieving improvements of 5.69% in the 3D detection task and 15.25% in the 3D visual grounding task.

Via

Access Paper or Ask Questions

Sparse4D v3: Advancing End-to-End 3D Detection and Tracking

Nov 20, 2023

Xuewu Lin, Zixiang Pei, Tianwei Lin, Lichao Huang, Zhizhong Su

Figure 1 for Sparse4D v3: Advancing End-to-End 3D Detection and Tracking

Figure 2 for Sparse4D v3: Advancing End-to-End 3D Detection and Tracking

Figure 3 for Sparse4D v3: Advancing End-to-End 3D Detection and Tracking

Figure 4 for Sparse4D v3: Advancing End-to-End 3D Detection and Tracking

Abstract:In autonomous driving perception systems, 3D detection and tracking are the two fundamental tasks. This paper delves deeper into this field, building upon the Sparse4D framework. We introduce two auxiliary training tasks (Temporal Instance Denoising and Quality Estimation) and propose decoupled attention to make structural improvements, leading to significant enhancements in detection performance. Additionally, we extend the detector into a tracker using a straightforward approach that assigns instance ID during inference, further highlighting the advantages of query-based algorithms. Extensive experiments conducted on the nuScenes benchmark validate the effectiveness of the proposed improvements. With ResNet50 as the backbone, we witnessed enhancements of 3.0\%, 2.2\%, and 7.6\% in mAP, NDS, and AMOTA, achieving 46.9\%, 56.1\%, and 49.0\%, respectively. Our best model achieved 71.9\% NDS and 67.7\% AMOTA on the nuScenes test set. Code will be released at \url{https://github.com/linxuewu/Sparse4D}.

Via

Access Paper or Ask Questions

Sparse4D v2: Recurrent Temporal Fusion with Sparse Model

May 24, 2023

Xuewu Lin, Tianwei Lin, Zixiang Pei, Lichao Huang, Zhizhong Su

Figure 1 for Sparse4D v2: Recurrent Temporal Fusion with Sparse Model

Figure 2 for Sparse4D v2: Recurrent Temporal Fusion with Sparse Model

Figure 3 for Sparse4D v2: Recurrent Temporal Fusion with Sparse Model

Figure 4 for Sparse4D v2: Recurrent Temporal Fusion with Sparse Model

Abstract:Sparse algorithms offer great flexibility for multi-view temporal perception tasks. In this paper, we present an enhanced version of Sparse4D, in which we improve the temporal fusion module by implementing a recursive form of multi-frame feature sampling. By effectively decoupling image features and structured anchor features, Sparse4D enables a highly efficient transformation of temporal features, thereby facilitating temporal fusion solely through the frame-by-frame transmission of sparse features. The recurrent temporal fusion approach provides two main benefits. Firstly, it reduces the computational complexity of temporal fusion from $O(T)$ to $O(1)$, resulting in significant improvements in inference speed and memory usage. Secondly, it enables the fusion of long-term information, leading to more pronounced performance improvements due to temporal fusion. Our proposed approach, Sparse4Dv2, further enhances the performance of the sparse perception algorithm and achieves state-of-the-art results on the nuScenes 3D detection benchmark. Code will be available at \url{https://github.com/linxuewu/Sparse4D}.

Via

Access Paper or Ask Questions

Sparse4D: Multi-view 3D Object Detection with Sparse Spatial-Temporal Fusion

Nov 19, 2022

Xuewu Lin, Tianwei Lin, Zixiang Pei, Lichao Huang, Zhizhong Su

Abstract:Bird-eye-view (BEV) based methods have made great progress recently in multi-view 3D detection task. Comparing with BEV based methods, sparse based methods lag behind in performance, but still have lots of non-negligible merits. To push sparse 3D detection further, in this work, we introduce a novel method, named Sparse4D, which does the iterative refinement of anchor boxes via sparsely sampling and fusing spatial-temporal features. (1) Sparse 4D Sampling: for each 3D anchor, we assign multiple 4D keypoints, which are then projected to multi-view/scale/timestamp image features to sample corresponding features; (2) Hierarchy Feature Fusion: we hierarchically fuse sampled features of different view/scale, different timestamp and different keypoints to generate high-quality instance feature. In this way, Sparse4D can efficiently and effectively achieve 3D detection without relying on dense view transformation nor global attention, and is more friendly to edge devices deployment. Furthermore, we introduce an instance-level depth reweight module to alleviate the ill-posed issue in 3D-to-2D projection. In experiment, our method outperforms all sparse based methods and most BEV based methods on detection task in the nuScenes dataset.

Via

Access Paper or Ask Questions

HybridGazeNet: Geometric model guided Convolutional Neural Networks for gaze estimation

Nov 23, 2021

Shaobo Guo, Xiao Jiang, Zhizhong Su, Rui Wu, Xin Wang

Figure 1 for HybridGazeNet: Geometric model guided Convolutional Neural Networks for gaze estimation

Figure 2 for HybridGazeNet: Geometric model guided Convolutional Neural Networks for gaze estimation

Figure 3 for HybridGazeNet: Geometric model guided Convolutional Neural Networks for gaze estimation

Figure 4 for HybridGazeNet: Geometric model guided Convolutional Neural Networks for gaze estimation

Abstract:As a critical cue for understanding human intention, human gaze provides a key signal for Human-Computer Interaction(HCI) applications. Appearance-based gaze estimation, which directly regresses the gaze vector from eye images, has made great progress recently based on Convolutional Neural Networks(ConvNets) architecture and open-source large-scale gaze datasets. However, encoding model-based knowledge into CNN model to further improve the gaze estimation performance remains a topic that needs to be explored. In this paper, we propose HybridGazeNet(HGN), a unified framework that encodes the geometric eyeball model into the appearance-based CNN architecture explicitly. Composed of a multi-branch network and an uncertainty module, HybridGazeNet is trained using a hyridized strategy. Experiments on multiple challenging gaze datasets shows that HybridGazeNet has better accuracy and generalization ability compared with existing SOTA methods. The code will be released later.

* 10 pages

Via

Access Paper or Ask Questions

Gaussian Vector: An Efficient Solution for Facial Landmark Detection

Oct 03, 2020

Yilin Xiong, Zijian Zhou, Yuhao Dou, Zhizhong Su

Figure 1 for Gaussian Vector: An Efficient Solution for Facial Landmark Detection

Figure 2 for Gaussian Vector: An Efficient Solution for Facial Landmark Detection

Figure 3 for Gaussian Vector: An Efficient Solution for Facial Landmark Detection

Figure 4 for Gaussian Vector: An Efficient Solution for Facial Landmark Detection

Abstract:Significant progress has been made in facial landmark detection with the development of Convolutional Neural Networks. The widely-used algorithms can be classified into coordinate regression methods and heatmap based methods. However, the former loses spatial information, resulting in poor performance while the latter suffers from large output size or high post-processing complexity. This paper proposes a new solution, Gaussian Vector, to preserve the spatial information as well as reduce the output size and simplify the post-processing. Our method provides novel vector supervision and introduces Band Pooling Module to convert heatmap into a pair of vectors for each landmark. This is a plug-and-play component which is simple and effective. Moreover, Beyond Box Strategy is proposed to handle the landmarks out of the face bounding box. We evaluate our method on 300W, COFW, WFLW and JD-landmark. That the results significantly surpass previous works demonstrates the effectiveness of our approach.

* 19 pages, 8 figures

Via

Access Paper or Ask Questions