Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Jiantao Gao

VisionPAD: A Vision-Centric Pre-training Paradigm for Autonomous Driving

Nov 22, 2024

Haiming Zhang, Wending Zhou, Yiyao Zhu, Xu Yan, Jiantao Gao, Dongfeng Bai, Yingjie Cai, Bingbing Liu, Shuguang Cui, Zhen Li

Figure 1 for VisionPAD: A Vision-Centric Pre-training Paradigm for Autonomous Driving

Figure 2 for VisionPAD: A Vision-Centric Pre-training Paradigm for Autonomous Driving

Figure 3 for VisionPAD: A Vision-Centric Pre-training Paradigm for Autonomous Driving

Figure 4 for VisionPAD: A Vision-Centric Pre-training Paradigm for Autonomous Driving

Abstract:This paper introduces VisionPAD, a novel self-supervised pre-training paradigm designed for vision-centric algorithms in autonomous driving. In contrast to previous approaches that employ neural rendering with explicit depth supervision, VisionPAD utilizes more efficient 3D Gaussian Splatting to reconstruct multi-view representations using only images as supervision. Specifically, we introduce a self-supervised method for voxel velocity estimation. By warping voxels to adjacent frames and supervising the rendered outputs, the model effectively learns motion cues in the sequential data. Furthermore, we adopt a multi-frame photometric consistency approach to enhance geometric perception. It projects adjacent frames to the current frame based on rendered depths and relative poses, boosting the 3D geometric representation through pure image supervision. Extensive experiments on autonomous driving datasets demonstrate that VisionPAD significantly improves performance in 3D object detection, occupancy prediction and map segmentation, surpassing state-of-the-art pre-training strategies by a considerable margin.

Via

Access Paper or Ask Questions

Forging Vision Foundation Models for Autonomous Driving: Challenges, Methodologies, and Opportunities

Jan 16, 2024

Xu Yan, Haiming Zhang, Yingjie Cai, Jingming Guo, Weichao Qiu, Bin Gao, Kaiqiang Zhou, Yue Zhao, Huan Jin, Jiantao Gao(+6 more)

Figure 1 for Forging Vision Foundation Models for Autonomous Driving: Challenges, Methodologies, and Opportunities

Figure 2 for Forging Vision Foundation Models for Autonomous Driving: Challenges, Methodologies, and Opportunities

Figure 3 for Forging Vision Foundation Models for Autonomous Driving: Challenges, Methodologies, and Opportunities

Figure 4 for Forging Vision Foundation Models for Autonomous Driving: Challenges, Methodologies, and Opportunities

Abstract:The rise of large foundation models, trained on extensive datasets, is revolutionizing the field of AI. Models such as SAM, DALL-E2, and GPT-4 showcase their adaptability by extracting intricate patterns and performing effectively across diverse tasks, thereby serving as potent building blocks for a wide range of AI applications. Autonomous driving, a vibrant front in AI applications, remains challenged by the lack of dedicated vision foundation models (VFMs). The scarcity of comprehensive training data, the need for multi-sensor integration, and the diverse task-specific architectures pose significant obstacles to the development of VFMs in this field. This paper delves into the critical challenge of forging VFMs tailored specifically for autonomous driving, while also outlining future directions. Through a systematic analysis of over 250 papers, we dissect essential techniques for VFM development, including data preparation, pre-training strategies, and downstream task adaptation. Moreover, we explore key advancements such as NeRF, diffusion models, 3D Gaussian Splatting, and world models, presenting a comprehensive roadmap for future research. To empower researchers, we have built and maintained https://github.com/zhanghm1995/Forge_VFM4AD, an open-access repository constantly updated with the latest advancements in forging VFMs for autonomous driving.

* Github Repo: https://github.com/zhanghm1995/Forge_VFM4AD

Via

Access Paper or Ask Questions

RadOcc: Learning Cross-Modality Occupancy Knowledge through Rendering Assisted Distillation

Dec 19, 2023

Haiming Zhang, Xu Yan, Dongfeng Bai, Jiantao Gao, Pan Wang, Bingbing Liu, Shuguang Cui, Zhen Li

Figure 1 for RadOcc: Learning Cross-Modality Occupancy Knowledge through Rendering Assisted Distillation

Figure 2 for RadOcc: Learning Cross-Modality Occupancy Knowledge through Rendering Assisted Distillation

Figure 3 for RadOcc: Learning Cross-Modality Occupancy Knowledge through Rendering Assisted Distillation

Figure 4 for RadOcc: Learning Cross-Modality Occupancy Knowledge through Rendering Assisted Distillation

Abstract:3D occupancy prediction is an emerging task that aims to estimate the occupancy states and semantics of 3D scenes using multi-view images. However, image-based scene perception encounters significant challenges in achieving accurate prediction due to the absence of geometric priors. In this paper, we address this issue by exploring cross-modal knowledge distillation in this task, i.e., we leverage a stronger multi-modal model to guide the visual model during training. In practice, we observe that directly applying features or logits alignment, proposed and widely used in bird's-eyeview (BEV) perception, does not yield satisfactory results. To overcome this problem, we introduce RadOcc, a Rendering assisted distillation paradigm for 3D Occupancy prediction. By employing differentiable volume rendering, we generate depth and semantic maps in perspective views and propose two novel consistency criteria between the rendered outputs of teacher and student models. Specifically, the depth consistency loss aligns the termination distributions of the rendered rays, while the semantic consistency loss mimics the intra-segment similarity guided by vision foundation models (VLMs). Experimental results on the nuScenes dataset demonstrate the effectiveness of our proposed method in improving various 3D occupancy prediction approaches, e.g., our proposed methodology enhances our baseline by 2.2% in the metric of mIoU and achieves 50% in Occ3D benchmark.

* Accepted by AAAI 2024

Via

Access Paper or Ask Questions

Let Images Give You More:Point Cloud Cross-Modal Training for Shape Analysis

Oct 09, 2022

Xu Yan, Heshen Zhan, Chaoda Zheng, Jiantao Gao, Ruimao Zhang, Shuguang Cui, Zhen Li

Figure 1 for Let Images Give You More:Point Cloud Cross-Modal Training for Shape Analysis

Figure 2 for Let Images Give You More:Point Cloud Cross-Modal Training for Shape Analysis

Figure 3 for Let Images Give You More:Point Cloud Cross-Modal Training for Shape Analysis

Figure 4 for Let Images Give You More:Point Cloud Cross-Modal Training for Shape Analysis

Abstract:Although recent point cloud analysis achieves impressive progress, the paradigm of representation learning from a single modality gradually meets its bottleneck. In this work, we take a step towards more discriminative 3D point cloud representation by fully taking advantages of images which inherently contain richer appearance information, e.g., texture, color, and shade. Specifically, this paper introduces a simple but effective point cloud cross-modality training (PointCMT) strategy, which utilizes view-images, i.e., rendered or projected 2D images of the 3D object, to boost point cloud analysis. In practice, to effectively acquire auxiliary knowledge from view images, we develop a teacher-student framework and formulate the cross modal learning as a knowledge distillation problem. PointCMT eliminates the distribution discrepancy between different modalities through novel feature and classifier enhancement criteria and avoids potential negative transfer effectively. Note that PointCMT effectively improves the point-only representation without architecture modification. Sufficient experiments verify significant gains on various datasets using appealing backbones, i.e., equipped with PointCMT, PointNet++ and PointMLP achieve state-of-the-art performance on two benchmarks, i.e., 94.4% and 86.7% accuracy on ModelNet40 and ScanObjectNN, respectively. Code will be made available at https://github.com/ZhanHeshen/PointCMT.

* To appear in NIPS2022

Via

Access Paper or Ask Questions

2DPASS: 2D Priors Assisted Semantic Segmentation on LiDAR Point Clouds

Jul 10, 2022

Xu Yan, Jiantao Gao, Chaoda Zheng, Chao Zheng, Ruimao Zhang, Shenghui Cui, Zhen Li

Figure 1 for 2DPASS: 2D Priors Assisted Semantic Segmentation on LiDAR Point Clouds

Figure 2 for 2DPASS: 2D Priors Assisted Semantic Segmentation on LiDAR Point Clouds

Figure 3 for 2DPASS: 2D Priors Assisted Semantic Segmentation on LiDAR Point Clouds

Figure 4 for 2DPASS: 2D Priors Assisted Semantic Segmentation on LiDAR Point Clouds

Abstract:As camera and LiDAR sensors capture complementary information used in autonomous driving, great efforts have been made to develop semantic segmentation algorithms through multi-modality data fusion. However, fusion-based approaches require paired data, i.e., LiDAR point clouds and camera images with strict point-to-pixel mappings, as the inputs in both training and inference, which seriously hinders their application in practical scenarios. Thus, in this work, we propose the 2D Priors Assisted Semantic Segmentation (2DPASS), a general training scheme, to boost the representation learning on point clouds, by fully taking advantage of 2D images with rich appearance. In practice, by leveraging an auxiliary modal fusion and multi-scale fusion-to-single knowledge distillation (MSFSKD), 2DPASS acquires richer semantic and structural information from the multi-modal data, which are then online distilled to the pure 3D network. As a result, equipped with 2DPASS, our baseline shows significant improvement with only point cloud inputs. Specifically, it achieves the state-of-the-arts on two large-scale benchmarks (i.e. SemanticKITTI and NuScenes), including top-1 results in both single and multiple scan(s) competitions of SemanticKITTI.

* ECCV2022

Via

Access Paper or Ask Questions

Box-Aware Feature Enhancement for Single Object Tracking on Point Clouds

Aug 10, 2021

Chaoda Zheng, Xu Yan, Jiantao Gao, Weibing Zhao, Wei Zhang, Zhen Li, Shuguang Cui

Figure 1 for Box-Aware Feature Enhancement for Single Object Tracking on Point Clouds

Figure 2 for Box-Aware Feature Enhancement for Single Object Tracking on Point Clouds

Figure 3 for Box-Aware Feature Enhancement for Single Object Tracking on Point Clouds

Figure 4 for Box-Aware Feature Enhancement for Single Object Tracking on Point Clouds

Abstract:Current 3D single object tracking approaches track the target based on a feature comparison between the target template and the search area. However, due to the common occlusion in LiDAR scans, it is non-trivial to conduct accurate feature comparisons on severe sparse and incomplete shapes. In this work, we exploit the ground truth bounding box given in the first frame as a strong cue to enhance the feature description of the target object, enabling a more accurate feature comparison in a simple yet effective way. In particular, we first propose the BoxCloud, an informative and robust representation, to depict an object using the point-to-box relation. We further design an efficient box-aware feature fusion module, which leverages the aforementioned BoxCloud for reliable feature matching and embedding. Integrating the proposed general components into an existing model P2B, we construct a superior box-aware tracker (BAT). Experiments confirm that our proposed BAT outperforms the previous state-of-the-art by a large margin on both KITTI and NuScenes benchmarks, achieving a 12.8% improvement in terms of precision while running ~20% faster.

* Accepted by ICCV 2021

Via

Access Paper or Ask Questions

PointLIE: Locally Invertible Embedding for Point Cloud Sampling and Recovery

Apr 30, 2021

Weibing Zhao, Xu Yan, Jiantao Gao, Ruimao Zhang, Jiayan Zhang, Zhen Li, Song Wu, Shuguang Cui

Figure 1 for PointLIE: Locally Invertible Embedding for Point Cloud Sampling and Recovery

Figure 2 for PointLIE: Locally Invertible Embedding for Point Cloud Sampling and Recovery

Figure 3 for PointLIE: Locally Invertible Embedding for Point Cloud Sampling and Recovery

Figure 4 for PointLIE: Locally Invertible Embedding for Point Cloud Sampling and Recovery

Abstract:Point Cloud Sampling and Recovery (PCSR) is critical for massive real-time point cloud collection and processing since raw data usually requires large storage and computation. In this paper, we address a fundamental problem in PCSR: How to downsample the dense point cloud with arbitrary scales while preserving the local topology of discarding points in a case-agnostic manner (i.e. without additional storage for point relationship)? We propose a novel Locally Invertible Embedding for point cloud adaptive sampling and recovery (PointLIE). Instead of learning to predict the underlying geometry details in a seemingly plausible manner, PointLIE unifies point cloud sampling and upsampling to one single framework through bi-directional learning. Specifically, PointLIE recursively samples and adjusts neighboring points on each scale. Then it encodes the neighboring offsets of sampled points to a latent space and thus decouples the sampled points and the corresponding local geometric relationship. Once the latent space is determined and that the deep model is optimized, the recovery process could be conducted by passing the recover-pleasing sampled points and a randomly-drawn embedding to the same network through an invertible operation. Such a scheme could guarantee the fidelity of dense point recovery from sampled points. Extensive experiments demonstrate that the proposed PointLIE outperforms state-of-the-arts both quantitatively and qualitatively. Our code is released through https://github.com/zwb0/PointLIE.

* IJCAI 2021
* To appear in IJCAI 2021

Via

Access Paper or Ask Questions

Sparse Single Sweep LiDAR Point Cloud Segmentation via Learning Contextual Shape Priors from Scene Completion

Dec 07, 2020

Xu Yan, Jiantao Gao, Jie Li, Ruimao Zhang, Zhen Li, Rui Huang, Shuguang Cui

Figure 1 for Sparse Single Sweep LiDAR Point Cloud Segmentation via Learning Contextual Shape Priors from Scene Completion

Figure 2 for Sparse Single Sweep LiDAR Point Cloud Segmentation via Learning Contextual Shape Priors from Scene Completion

Figure 3 for Sparse Single Sweep LiDAR Point Cloud Segmentation via Learning Contextual Shape Priors from Scene Completion

Figure 4 for Sparse Single Sweep LiDAR Point Cloud Segmentation via Learning Contextual Shape Priors from Scene Completion

Abstract:LiDAR point cloud analysis is a core task for 3D computer vision, especially for autonomous driving. However, due to the severe sparsity and noise interference in the single sweep LiDAR point cloud, the accurate semantic segmentation is non-trivial to achieve. In this paper, we propose a novel sparse LiDAR point cloud semantic segmentation framework assisted by learned contextual shape priors. In practice, an initial semantic segmentation (SS) of a single sweep point cloud can be achieved by any appealing network and then flows into the semantic scene completion (SSC) module as the input. By merging multiple frames in the LiDAR sequence as supervision, the optimized SSC module has learned the contextual shape priors from sequential LiDAR data, completing the sparse single sweep point cloud to the dense one. Thus, it inherently improves SS optimization through fully end-to-end training. Besides, a Point-Voxel Interaction (PVI) module is proposed to further enhance the knowledge fusion between SS and SSC tasks, i.e., promoting the interaction of incomplete local geometry of point cloud and complete voxel-wise global structure. Furthermore, the auxiliary SSC and PVI modules can be discarded during inference without extra burden for SS. Extensive experiments confirm that our JS3C-Net achieves superior performance on both SemanticKITTI and SemanticPOSS benchmarks, i.e., 4% and 3% improvement correspondingly.

* To appear in AAAI 2021. Codes are available at https://github.com/yanx27/JS3C-Net

Via

Access Paper or Ask Questions