Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Yunsong Zhou

SimGen: Simulator-conditioned Driving Scene Generation

Jun 13, 2024

Yunsong Zhou, Michael Simon, Zhenghao Peng, Sicheng Mo, Hongzi Zhu, Minyi Guo, Bolei Zhou

Figure 1 for SimGen: Simulator-conditioned Driving Scene Generation

Figure 2 for SimGen: Simulator-conditioned Driving Scene Generation

Figure 3 for SimGen: Simulator-conditioned Driving Scene Generation

Figure 4 for SimGen: Simulator-conditioned Driving Scene Generation

Abstract:Controllable synthetic data generation can substantially lower the annotation cost of training data in autonomous driving research and development. Prior works use diffusion models to generate driving images conditioned on the 3D object layout. However, those models are trained on small-scale datasets like nuScenes, which lack appearance and layout diversity. Moreover, the trained models can only generate images based on the real-world layout data from the validation set of the same dataset, where overfitting might happen. In this work, we introduce a simulator-conditioned scene generation framework called SimGen that can learn to generate diverse driving scenes by mixing data from the simulator and the real world. It uses a novel cascade diffusion pipeline to address challenging sim-to-real gaps and multi-condition conflicts. A driving video dataset DIVA is collected to enhance the generative diversity of SimGen, which contains over 147.5 hours of real-world driving videos from 73 locations worldwide and simulated driving data from the MetaDrive simulator. SimGen achieves superior generation quality and diversity while preserving controllability based on the text prompt and the layout pulled from a simulator. We further demonstrate the improvements brought by SimGen for synthetic data augmentation on the BEV detection and segmentation task and showcase its capability in safety-critical data generation. Code, data, and models will be made available.

Via

Access Paper or Ask Questions

Embodied Understanding of Driving Scenarios

Mar 07, 2024

Yunsong Zhou, Linyan Huang, Qingwen Bu, Jia Zeng, Tianyu Li, Hang Qiu, Hongzi Zhu, Minyi Guo, Yu Qiao, Hongyang Li

Figure 1 for Embodied Understanding of Driving Scenarios

Figure 2 for Embodied Understanding of Driving Scenarios

Figure 3 for Embodied Understanding of Driving Scenarios

Figure 4 for Embodied Understanding of Driving Scenarios

Abstract:Embodied scene understanding serves as the cornerstone for autonomous agents to perceive, interpret, and respond to open driving scenarios. Such understanding is typically founded upon Vision-Language Models (VLMs). Nevertheless, existing VLMs are restricted to the 2D domain, devoid of spatial awareness and long-horizon extrapolation proficiencies. We revisit the key aspects of autonomous driving and formulate appropriate rubrics. Hereby, we introduce the Embodied Language Model (ELM), a comprehensive framework tailored for agents' understanding of driving scenes with large spatial and temporal spans. ELM incorporates space-aware pre-training to endow the agent with robust spatial localization capabilities. Besides, the model employs time-aware token selection to accurately inquire about temporal cues. We instantiate ELM on the reformulated multi-faced benchmark, and it surpasses previous state-of-the-art approaches in all aspects. All code, data, and models will be publicly shared.

* 43 pages, 16 figures

Via

Access Paper or Ask Questions

Extend Your Own Correspondences: Unsupervised Distant Point Cloud Registration by Progressive Distance Extension

Mar 06, 2024

Quan Liu, Hongzi Zhu, Zhenxi Wang, Yunsong Zhou, Shan Chang, Minyi Guo

Figure 1 for Extend Your Own Correspondences: Unsupervised Distant Point Cloud Registration by Progressive Distance Extension

Figure 2 for Extend Your Own Correspondences: Unsupervised Distant Point Cloud Registration by Progressive Distance Extension

Figure 3 for Extend Your Own Correspondences: Unsupervised Distant Point Cloud Registration by Progressive Distance Extension

Figure 4 for Extend Your Own Correspondences: Unsupervised Distant Point Cloud Registration by Progressive Distance Extension

Abstract:Registration of point clouds collected from a pair of distant vehicles provides a comprehensive and accurate 3D view of the driving scenario, which is vital for driving safety related applications, yet existing literature suffers from the expensive pose label acquisition and the deficiency to generalize to new data distributions. In this paper, we propose EYOC, an unsupervised distant point cloud registration method that adapts to new point cloud distributions on the fly, requiring no global pose labels. The core idea of EYOC is to train a feature extractor in a progressive fashion, where in each round, the feature extractor, trained with near point cloud pairs, can label slightly farther point cloud pairs, enabling self-supervision on such far point cloud pairs. This process continues until the derived extractor can be used to register distant point clouds. Particularly, to enable high-fidelity correspondence label generation, we devise an effective spatial filtering scheme to select the most representative correspondences to register a point cloud pair, and then utilize the aligned point clouds to discover more correct correspondences. Experiments show that EYOC can achieve comparable performance with state-of-the-art supervised methods at a lower training cost. Moreover, it outwits supervised methods regarding generalization performance on new data distributions.

* In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2024

Via

Access Paper or Ask Questions

Density-invariant Features for Distant Point Cloud Registration

Aug 08, 2023

Quan Liu, Hongzi Zhu, Yunsong Zhou, Hongyang Li, Shan Chang, Minyi Guo

Figure 1 for Density-invariant Features for Distant Point Cloud Registration

Figure 2 for Density-invariant Features for Distant Point Cloud Registration

Figure 3 for Density-invariant Features for Distant Point Cloud Registration

Figure 4 for Density-invariant Features for Distant Point Cloud Registration

Abstract:Registration of distant outdoor LiDAR point clouds is crucial to extending the 3D vision of collaborative autonomous vehicles, and yet is challenging due to small overlapping area and a huge disparity between observed point densities. In this paper, we propose Group-wise Contrastive Learning (GCL) scheme to extract density-invariant geometric features to register distant outdoor LiDAR point clouds. We mark through theoretical analysis and experiments that, contrastive positives should be independent and identically distributed (i.i.d.), in order to train densityinvariant feature extractors. We propose upon the conclusion a simple yet effective training scheme to force the feature of multiple point clouds in the same spatial location (referred to as positive groups) to be similar, which naturally avoids the sampling bias introduced by a pair of point clouds to conform with the i.i.d. principle. The resulting fully-convolutional feature extractor is more powerful and density-invariant than state-of-the-art methods, improving the registration recall of distant scenarios on KITTI and nuScenes benchmarks by 40.9% and 26.9%, respectively. Code is available at https://github.com/liuQuan98/GCL.

* In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), 2023

Via

Access Paper or Ask Questions

APR: Online Distant Point Cloud Registration Through Aggregated Point Cloud Reconstruction

May 08, 2023

Quan Liu, Yunsong Zhou, Hongzi Zhu, Shan Chang, Minyi Guo

Abstract:For many driving safety applications, it is of great importance to accurately register LiDAR point clouds generated on distant moving vehicles. However, such point clouds have extremely different point density and sensor perspective on the same object, making registration on such point clouds very hard. In this paper, we propose a novel feature extraction framework, called APR, for online distant point cloud registration. Specifically, APR leverages an autoencoder design, where the autoencoder reconstructs a denser aggregated point cloud with several frames instead of the original single input point cloud. Our design forces the encoder to extract features with rich local geometry information based on one single input point cloud. Such features are then used for online distant point cloud registration. We conduct extensive experiments against state-of-the-art (SOTA) feature extractors on KITTI and nuScenes datasets. Results show that APR outperforms all other extractors by a large margin, increasing average registration recall of SOTA extractors by 7.1% on LoKITTI and 4.6% on LoNuScenes. Code is available at https://github.com/liuQuan98/APR.

* in Proceedings of the International Joint Conference on Artificial Intelligence (IJCAI), 2023

Via

Access Paper or Ask Questions

MonoATT: Online Monocular 3D Object Detection with Adaptive Token Transformer

Mar 23, 2023

Yunsong Zhou, Hongzi Zhu, Quan Liu, Shan Chang, Minyi Guo

Figure 1 for MonoATT: Online Monocular 3D Object Detection with Adaptive Token Transformer

Figure 2 for MonoATT: Online Monocular 3D Object Detection with Adaptive Token Transformer

Figure 3 for MonoATT: Online Monocular 3D Object Detection with Adaptive Token Transformer

Figure 4 for MonoATT: Online Monocular 3D Object Detection with Adaptive Token Transformer

Abstract:Mobile monocular 3D object detection (Mono3D) (e.g., on a vehicle, a drone, or a robot) is an important yet challenging task. Existing transformer-based offline Mono3D models adopt grid-based vision tokens, which is suboptimal when using coarse tokens due to the limited available computational power. In this paper, we propose an online Mono3D framework, called MonoATT, which leverages a novel vision transformer with heterogeneous tokens of varying shapes and sizes to facilitate mobile Mono3D. The core idea of MonoATT is to adaptively assign finer tokens to areas of more significance before utilizing a transformer to enhance Mono3D. To this end, we first use prior knowledge to design a scoring network for selecting the most important areas of the image, and then propose a token clustering and merging network with an attention mechanism to gradually merge tokens around the selected areas in multiple stages. Finally, a pixel-level feature map is reconstructed from heterogeneous tokens before employing a SOTA Mono3D detector as the underlying detection core. Experiment results on the real-world KITTI dataset demonstrate that MonoATT can effectively improve the Mono3D accuracy for both near and far objects and guarantee low latency. MonoATT yields the best performance compared with the state-of-the-art methods by a large margin and is ranked number one on the KITTI 3D benchmark.

* in the Proceedings of IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2023

Via

Access Paper or Ask Questions

MoGDE: Boosting Mobile Monocular 3D Object Detection with Ground Depth Estimation

Mar 23, 2023

Yunsong Zhou, Quan Liu, Hongzi Zhu, Yunzhe Li, Shan Chang, Minyi Guo

Abstract:Monocular 3D object detection (Mono3D) in mobile settings (e.g., on a vehicle, a drone, or a robot) is an important yet challenging task. Due to the near-far disparity phenomenon of monocular vision and the ever-changing camera pose, it is hard to acquire high detection accuracy, especially for far objects. Inspired by the insight that the depth of an object can be well determined according to the depth of the ground where it stands, in this paper, we propose a novel Mono3D framework, called MoGDE, which constantly estimates the corresponding ground depth of an image and then utilizes the estimated ground depth information to guide Mono3D. To this end, we utilize a pose detection network to estimate the pose of the camera and then construct a feature map portraying pixel-level ground depth according to the 3D-to-2D perspective geometry. Moreover, to improve Mono3D with the estimated ground depth, we design an RGB-D feature fusion network based on the transformer structure, where the long-range self-attention mechanism is utilized to effectively identify ground-contacting points and pin the corresponding ground depth to the image feature map. We conduct extensive experiments on the real-world KITTI dataset. The results demonstrate that MoGDE can effectively improve the Mono3D accuracy and robustness for both near and far objects. MoGDE yields the best performance compared with the state-of-the-art methods by a large margin and is ranked number one on the KITTI 3D benchmark.

* 36th Conference on Neural Information Processing Systems (NeurIPS), 2022. arXiv admin note: text overlap with arXiv:2303.13018

Via

Access Paper or Ask Questions

Monocular 3D Object Detection: An Extrinsic Parameter Free Approach

Jun 30, 2021

Yunsong Zhou, Yuan He, Hongzi Zhu, Cheng Wang, Hongyang Li, Qinhong Jiang

Figure 1 for Monocular 3D Object Detection: An Extrinsic Parameter Free Approach

Figure 2 for Monocular 3D Object Detection: An Extrinsic Parameter Free Approach

Figure 3 for Monocular 3D Object Detection: An Extrinsic Parameter Free Approach

Figure 4 for Monocular 3D Object Detection: An Extrinsic Parameter Free Approach

Abstract:Monocular 3D object detection is an important task in autonomous driving. It can be easily intractable where there exists ego-car pose change w.r.t. ground plane. This is common due to the slight fluctuation of road smoothness and slope. Due to the lack of insight in industrial application, existing methods on open datasets neglect the camera pose information, which inevitably results in the detector being susceptible to camera extrinsic parameters. The perturbation of objects is very popular in most autonomous driving cases for industrial products. To this end, we propose a novel method to capture camera pose to formulate the detector free from extrinsic perturbation. Specifically, the proposed framework predicts camera extrinsic parameters by detecting vanishing point and horizon change. A converter is designed to rectify perturbative features in the latent space. By doing so, our 3D detector works independent of the extrinsic parameter variations and produces accurate results in realistic cases, e.g., potholed and uneven roads, where almost all existing monocular detectors fail to handle. Experiments demonstrate our method yields the best performance compared with the other state-of-the-arts by a large margin on both KITTI 3D and nuScenes datasets.

Via

Access Paper or Ask Questions

On Geometric Structure of Activation Spaces in Neural Networks

Apr 02, 2019

Yuting Jia, Haiwen Wang, Shuo Shao, Huan Long, Yunsong Zhou, Xinbing Wang

Figure 1 for On Geometric Structure of Activation Spaces in Neural Networks

Figure 2 for On Geometric Structure of Activation Spaces in Neural Networks

Figure 3 for On Geometric Structure of Activation Spaces in Neural Networks

Figure 4 for On Geometric Structure of Activation Spaces in Neural Networks

Abstract:In this paper, we investigate the geometric structure of activation spaces of fully connected layers in neural networks and then show applications of this study. We propose an efficient approximation algorithm to characterize the convex hull of massive points in high dimensional space. Based on this new algorithm, four common geometric properties shared by the activation spaces are concluded, which gives a rather clear description of the activation spaces. We then propose an alternative classification method grounding on the geometric structure description, which works better than neural networks alone. Surprisingly, this data classification method can be an indicator of overfitting in neural networks. We believe our work reveals several critical intrinsic properties of modern neural networks and further gives a new metric for evaluating them.

Via

Access Paper or Ask Questions