Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Jieqi Shi

GravMAD: Grounded Spatial Value Maps Guided Action Diffusion for Generalized 3D Manipulation

Sep 30, 2024

Yangtao Chen, Zixuan Chen, Junhui Yin, Jing Huo, Pinzhuo Tian, Jieqi Shi, Yang Gao

Abstract:Robots' ability to follow language instructions and execute diverse 3D tasks is vital in robot learning. Traditional imitation learning-based methods perform well on seen tasks but struggle with novel, unseen ones due to variability. Recent approaches leverage large foundation models to assist in understanding novel tasks, thereby mitigating this issue. However, these methods lack a task-specific learning process, which is essential for an accurate understanding of 3D environments, often leading to execution failures. In this paper, we introduce GravMAD, a sub-goal-driven, language-conditioned action diffusion framework that combines the strengths of imitation learning and foundation models. Our approach breaks tasks into sub-goals based on language instructions, allowing auxiliary guidance during both training and inference. During training, we introduce Sub-goal Keypose Discovery to identify key sub-goals from demonstrations. Inference differs from training, as there are no demonstrations available, so we use pre-trained foundation models to bridge the gap and identify sub-goals for the current task. In both phases, GravMaps are generated from sub-goals, providing flexible 3D spatial guidance compared to fixed 3D positions. Empirical evaluations on RLBench show that GravMAD significantly outperforms state-of-the-art methods, with a 28.63% improvement on novel tasks and a 13.36% gain on tasks encountered during training. These results demonstrate GravMAD's strong multi-task learning and generalization in 3D manipulation. Video demonstrations are available at: https://gravmad.github.io.

* Under review

Via

Access Paper or Ask Questions

VDG: Vision-Only Dynamic Gaussian for Driving Simulation

Jun 26, 2024

Hao Li, Jingfeng Li, Dingwen Zhang, Chenming Wu, Jieqi Shi, Chen Zhao, Haocheng Feng, Errui Ding, Jingdong Wang, Junwei Han

Figure 1 for VDG: Vision-Only Dynamic Gaussian for Driving Simulation

Figure 2 for VDG: Vision-Only Dynamic Gaussian for Driving Simulation

Figure 3 for VDG: Vision-Only Dynamic Gaussian for Driving Simulation

Figure 4 for VDG: Vision-Only Dynamic Gaussian for Driving Simulation

Abstract:Dynamic Gaussian splatting has led to impressive scene reconstruction and image synthesis advances in novel views. Existing methods, however, heavily rely on pre-computed poses and Gaussian initialization by Structure from Motion (SfM) algorithms or expensive sensors. For the first time, this paper addresses this issue by integrating self-supervised VO into our pose-free dynamic Gaussian method (VDG) to boost pose and depth initialization and static-dynamic decomposition. Moreover, VDG can work with only RGB image input and construct dynamic scenes at a faster speed and larger scenes compared with the pose-free dynamic view-synthesis method. We demonstrate the robustness of our approach via extensive quantitative and qualitative experiments. Our results show favorable performance over the state-of-the-art dynamic view synthesis methods. Additional video and source code will be posted on our project page at https://3d-aigc.github.io/VDG.

Via

Access Paper or Ask Questions

FM-Fusion: Instance-aware Semantic Mapping Boosted by Vision-Language Foundation Models

Feb 07, 2024

Chuhao Liu, Ke Wang, Jieqi Shi, Zhijian Qiao, Shaojie Shen

Abstract:Semantic mapping based on the supervised object detectors is sensitive to image distribution. In real-world environments, the object detection and segmentation performance can lead to a major drop, preventing the use of semantic mapping in a wider domain. On the other hand, the development of vision-language foundation models demonstrates a strong zero-shot transferability across data distribution. It provides an opportunity to construct generalizable instance-aware semantic maps. Hence, this work explores how to boost instance-aware semantic mapping from object detection generated from foundation models. We propose a probabilistic label fusion method to predict close-set semantic classes from open-set label measurements. An instance refinement module merges the over-segmented instances caused by inconsistent segmentation. We integrate all the modules into a unified semantic mapping system. Reading a sequence of RGB-D input, our work incrementally reconstructs an instance-aware semantic map. We evaluate the zero-shot performance of our method in ScanNet and SceneNN datasets. Our method achieves 40.3 mean average precision (mAP) on the ScanNet semantic instance segmentation task. It outperforms the traditional semantic mapping method significantly.

* Accepted by IEEE RA-L

Via

Access Paper or Ask Questions

Are All Point Clouds Suitable for Completion? Weakly Supervised Quality Evaluation Network for Point Cloud Completion

Mar 03, 2023

Jieqi Shi, Peiliang Li, Xiaozhi Chen, Shaojie Shen

Abstract:In the practical application of point cloud completion tasks, real data quality is usually much worse than the CAD datasets used for training. A small amount of noisy data will usually significantly impact the overall system's accuracy. In this paper, we propose a quality evaluation network to score the point clouds and help judge the quality of the point cloud before applying the completion model. We believe our scoring method can help researchers select more appropriate point clouds for subsequent completion and reconstruction and avoid manual parameter adjustment. Moreover, our evaluation model is fast and straightforward and can be directly inserted into any model's training or use process to facilitate the automatic selection and post-processing of point clouds. We propose a complete dataset construction and model evaluation method based on ShapeNet. We verify our network using detection and flow estimation tasks on KITTI, a real-world dataset for autonomous driving. The experimental results show that our model can effectively distinguish the quality of point clouds and help in practical tasks.

* ICRA 2023

Via

Access Paper or Ask Questions

Efficient Implicit Neural Reconstruction Using LiDAR

Feb 28, 2023

Dongyu Yan, Xiaoyang Lyu, Jieqi Shi, Yi Lin

Figure 1 for Efficient Implicit Neural Reconstruction Using LiDAR

Figure 2 for Efficient Implicit Neural Reconstruction Using LiDAR

Figure 3 for Efficient Implicit Neural Reconstruction Using LiDAR

Figure 4 for Efficient Implicit Neural Reconstruction Using LiDAR

Abstract:Modeling scene geometry using implicit neural representation has revealed its advantages in accuracy, flexibility, and low memory usage. Previous approaches have demonstrated impressive results using color or depth images but still have difficulty handling poor light conditions and large-scale scenes. Methods taking global point cloud as input require accurate registration and ground truth coordinate labels, which limits their application scenarios. In this paper, we propose a new method that uses sparse LiDAR point clouds and rough odometry to reconstruct fine-grained implicit occupancy field efficiently within a few minutes. We introduce a new loss function that supervises directly in 3D space without 2D rendering, avoiding information loss. We also manage to refine poses of input frames in an end-to-end manner, creating consistent geometry without global point cloud registration. As far as we know, our method is the first to reconstruct implicit scene representation from LiDAR-only input. Experiments on synthetic and real-world datasets, including indoor and outdoor scenes, prove that our method is effective, efficient, and accurate, obtaining comparable results with existing methods using dense input.

* 6+2 pages, 8 figures, Accepted for publication at IEEE International Conference on Robotics and Automation (ICRA) 2023

Via

Access Paper or Ask Questions

You Only Label Once: 3D Box Adaptation from Point Cloud to Image via Semi-Supervised Learning

Nov 17, 2022

Jieqi Shi, Peiliang Li, Xiaozhi Chen, Shaojie Shen

Abstract:The image-based 3D object detection task expects that the predicted 3D bounding box has a ``tightness'' projection (also referred to as cuboid), which fits the object contour well on the image while still keeping the geometric attribute on the 3D space, e.g., physical dimension, pairwise orthogonal, etc. These requirements bring significant challenges to the annotation. Simply projecting the Lidar-labeled 3D boxes to the image leads to non-trivial misalignment, while directly drawing a cuboid on the image cannot access the original 3D information. In this work, we propose a learning-based 3D box adaptation approach that automatically adjusts minimum parameters of the 360$^{\circ}$ Lidar 3D bounding box to perfectly fit the image appearance of panoramic cameras. With only a few 2D boxes annotation as guidance during the training phase, our network can produce accurate image-level cuboid annotations with 3D properties from Lidar boxes. We call our method ``you only label once'', which means labeling on the point cloud once and automatically adapting to all surrounding cameras. As far as we know, we are the first to focus on image-level cuboid refinement, which balances the accuracy and efficiency well and dramatically reduces the labeling effort for accurate cuboid annotation. Extensive experiments on the public Waymo and NuScenes datasets show that our method can produce human-level cuboid annotation on the image without needing manual adjustment.

Via

Access Paper or Ask Questions

Temporal Point Cloud Completion with Pose Disturbance

Feb 07, 2022

Jieqi Shi, Lingyun Xu, Peiliang Li, Xiaozhi Chen, Shaojie Shen

Figure 1 for Temporal Point Cloud Completion with Pose Disturbance

Figure 2 for Temporal Point Cloud Completion with Pose Disturbance

Figure 3 for Temporal Point Cloud Completion with Pose Disturbance

Figure 4 for Temporal Point Cloud Completion with Pose Disturbance

Abstract:Point clouds collected by real-world sensors are always unaligned and sparse, which makes it hard to reconstruct the complete shape of object from a single frame of data. In this work, we manage to provide complete point clouds from sparse input with pose disturbance by limited translation and rotation. We also use temporal information to enhance the completion model, refining the output with a sequence of inputs. With the help of gated recovery units(GRU) and attention mechanisms as temporal units, we propose a point cloud completion framework that accepts a sequence of unaligned and sparse inputs, and outputs consistent and aligned point clouds. Our network performs in an online manner and presents a refined point cloud for each frame, which enables it to be integrated into any SLAM or reconstruction pipeline. As far as we know, our framework is the first to utilize temporal information and ensure temporal consistency with limited transformation. Through experiments in ShapeNet and KITTI, we prove that our framework is effective in both synthetic and real-world datasets.

* 8 pages; Accepted by RAL with ICRA 2022

Via

Access Paper or Ask Questions

Graph-Guided Deformation for Point Cloud Completion

Nov 11, 2021

Jieqi Shi, Lingyun Xu, Liang Heng, Shaojie Shen

Figure 1 for Graph-Guided Deformation for Point Cloud Completion

Figure 2 for Graph-Guided Deformation for Point Cloud Completion

Figure 3 for Graph-Guided Deformation for Point Cloud Completion

Figure 4 for Graph-Guided Deformation for Point Cloud Completion

Abstract:For a long time, the point cloud completion task has been regarded as a pure generation task. After obtaining the global shape code through the encoder, a complete point cloud is generated using the shape priorly learnt by the networks. However, such models are undesirably biased towards prior average objects and inherently limited to fit geometry details. In this paper, we propose a Graph-Guided Deformation Network, which respectively regards the input data and intermediate generation as controlling and supporting points, and models the optimization guided by a graph convolutional network(GCN) for the point cloud completion task. Our key insight is to simulate the least square Laplacian deformation process via mesh deformation methods, which brings adaptivity for modeling variation in geometry details. By this means, we also reduce the gap between the completion task and the mesh deformation algorithms. As far as we know, we are the first to refine the point cloud completion task by mimicing traditional graphics algorithms with GCN-guided deformation. We have conducted extensive experiments on both the simulated indoor dataset ShapeNet, outdoor dataset KITTI, and our self-collected autonomous driving dataset Pandar40. The results show that our method outperforms the existing state-of-the-art algorithms in the 3D point cloud completion task.

* RAL with IROS 2021

Via

Access Paper or Ask Questions

Tracking from Patterns: Learning Corresponding Patterns in Point Clouds for 3D Object Tracking

Oct 20, 2020

Jieqi Shi, Peiliang Li, Shaojie Shen

Figure 1 for Tracking from Patterns: Learning Corresponding Patterns in Point Clouds for 3D Object Tracking

Figure 2 for Tracking from Patterns: Learning Corresponding Patterns in Point Clouds for 3D Object Tracking

Figure 3 for Tracking from Patterns: Learning Corresponding Patterns in Point Clouds for 3D Object Tracking

Abstract:A robust 3D object tracker which continuously tracks surrounding objects and estimates their trajectories is key for self-driving vehicles. Most existing tracking methods employ a tracking-by-detection strategy, which usually requires complex pair-wise similarity computation and neglects the nature of continuous object motion. In this paper, we propose to directly learn 3D object correspondences from temporal point cloud data and infer the motion information from correspondence patterns. We modify the standard 3D object detector to process two lidar frames at the same time and predict bounding box pairs for the association and motion estimation tasks. We also equip our pipeline with a simple yet effective velocity smoothing module to estimate consistent object motion. Benifiting from the learned correspondences and motion refinement, our method exceeds the existing 3D tracking methods on both the KITTI and larger scale Nuscenes dataset.

* ECCV2020 Workshop on Perception for Autonomous Driving(PAD2020)
* 4 pages, ECCV2020 Workshop on Perception for Autonomous Driving(PAD2020)

Via

Access Paper or Ask Questions

Joint Spatial-Temporal Optimization for Stereo 3D Object Tracking

Apr 20, 2020

Peiliang Li, Jieqi Shi, Shaojie Shen

Figure 1 for Joint Spatial-Temporal Optimization for Stereo 3D Object Tracking

Figure 2 for Joint Spatial-Temporal Optimization for Stereo 3D Object Tracking

Figure 3 for Joint Spatial-Temporal Optimization for Stereo 3D Object Tracking

Figure 4 for Joint Spatial-Temporal Optimization for Stereo 3D Object Tracking

Abstract:Directly learning multiple 3D objects motion from sequential images is difficult, while the geometric bundle adjustment lacks the ability to localize the invisible object centroid. To benefit from both the powerful object understanding skill from deep neural network meanwhile tackle precise geometry modeling for consistent trajectory estimation, we propose a joint spatial-temporal optimization-based stereo 3D object tracking method. From the network, we detect corresponding 2D bounding boxes on adjacent images and regress an initial 3D bounding box. Dense object cues (local depth and local coordinates) that associating to the object centroid are then predicted using a region-based network. Considering both the instant localization accuracy and motion consistency, our optimization models the relations between the object centroid and observed cues into a joint spatial-temporal error function. All historic cues will be summarized to contribute to the current estimation by a per-frame marginalization strategy without repeated computation. Quantitative evaluation on the KITTI tracking dataset shows our approach outperforms previous image-based 3D tracking methods by significant margins. We also report extensive results on multiple categories and larger datasets (KITTI raw and Argoverse Tracking) for future benchmarking.

* cvpr2020

Via

Access Paper or Ask Questions