Abstract:Deep learning models for autonomous driving, encompassing perception, planning, and control, depend on vast datasets to achieve their high performance. However, their generalization often suffers due to domain-specific data distributions, making an effective scene-based categorization of samples necessary to improve their reliability across diverse domains. Manual captioning, though valuable, is both labor-intensive and time-consuming, creating a bottleneck in the data annotation process. Large Visual Language Models (LVLMs) present a compelling solution by automating image analysis and categorization through contextual queries, often without requiring retraining for new categories. In this study, we evaluate the capabilities of LVLMs, including GPT-4 and LLaVA, to understand and classify urban traffic scenes on both an in-house dataset and the BDD100K. We propose a scalable captioning pipeline that integrates state-of-the-art models, enabling a flexible deployment on new datasets. Our analysis, combining quantitative metrics with qualitative insights, demonstrates the effectiveness of LVLMs to understand urban traffic scenarios and highlights their potential as an efficient tool for data-driven advancements in autonomous driving.
Abstract:Accurate localization is a critical component of mobile autonomous systems, especially in Global Navigation Satellite Systems (GNSS)-denied environments where traditional methods fail. In such scenarios, environmental sensing is essential for reliable operation. However, approaches such as LiDAR odometry and Simultaneous Localization and Mapping (SLAM) suffer from drift over long distances, especially in the absence of loop closures. Map-based localization offers a robust alternative, but the challenge lies in creating and georeferencing maps without GNSS support. To address this issue, we propose a method for creating georeferenced maps without GNSS by using publicly available data, such as building footprints and surface models derived from sparse aerial scans. Our approach integrates these data with onboard LiDAR scans to produce dense, accurate, georeferenced 3D point cloud maps. By combining an Iterative Closest Point (ICP) scan-to-scan and scan-to-map matching strategy, we achieve high local consistency without suffering from long-term drift. Thus, we eliminate the reliance on GNSS for the creation of georeferenced maps. The results demonstrate that LiDAR-only mapping can produce accurate georeferenced point cloud maps when augmented with existing map priors.
Abstract:Today's autonomous vehicles rely on a multitude of sensors to perceive their environment. To improve the perception or create redundancy, the sensor's alignment relative to each other must be known. With Multi-LiCa, we present a novel approach for the alignment, e.g. calibration. We present an automatic motion- and targetless approach for the extrinsic multi LiDAR-to-LiDAR calibration without the need for additional sensor modalities or an initial transformation input. We propose a two-step process with feature-based matching for the coarse alignment and a GICP-based fine registration in combination with a cost-based matching strategy. Our approach can be applied to any number of sensors and positions if there is a partial overlap between the field of view of single sensors. We show that our pipeline is better generalized to different sensor setups and scenarios and is on par or better in calibration accuracy than existing approaches. The presented framework is integrated in ROS 2 but can also be used as a standalone application. To build upon our work, our source code is available at: https://github.com/TUMFTM/Multi_LiCa.
Abstract:This paper explores pedestrian trajectory prediction in urban traffic while focusing on both model accuracy and real-world applicability. While promising approaches exist, they are often not publicly available, revolve around pedestrian datasets excluding traffic-related information, or resemble architectures that are either not real-time capable or robust. To address these limitations, we first introduce a dedicated benchmark based on Argoverse 2, specifically targeting pedestrians in urban settings. Following this, we present Snapshot, a modular, feed-forward neural network that outperforms the current state of the art while utilizing significantly less information. Despite its agent-centric encoding scheme, Snapshot demonstrates scalability, real-time performance, and robustness to varying motion histories. Moreover, by integrating Snapshot into a modular autonomous driving software stack, we showcase its real-world applicability
Abstract:Autonomous trucking is a promising technology that can greatly impact modern logistics and the environment. Ensuring its safety on public roads is one of the main duties that requires an accurate perception of the environment. To achieve this, machine learning methods rely on large datasets, but to this day, no such datasets are available for autonomous trucks. In this work, we present MAN TruckScenes, the first multimodal dataset for autonomous trucking. MAN TruckScenes allows the research community to come into contact with truck-specific challenges, such as trailer occlusions, novel sensor perspectives, and terminal environments for the first time. It comprises more than 740 scenes of 20 s each within a multitude of different environmental conditions. The sensor set includes 4 cameras, 6 lidar, 6 radar sensors, 2 IMUs, and a high-precision GNSS. The dataset's 3D bounding boxes were manually annotated and carefully reviewed to achieve a high quality standard. Bounding boxes are available for 27 object classes, 15 attributes, and a range of more than 230 m. The scenes are tagged according to 34 distinct scene tags, and all objects are tracked throughout the scene to promote a wide range of applications. Additionally, MAN TruckScenes is the first dataset to provide 4D radar data with 360{\deg} coverage and is thereby the largest radar dataset with annotated 3D bounding boxes. Finally, we provide extensive dataset analysis and baseline results. The dataset, development kit and more are available online.
Abstract:State-of-the-art LiDAR calibration frameworks mainly use non-probabilistic registration methods such as Iterative Closest Point (ICP) and its variants. These methods suffer from biased results due to their pair-wise registration procedure as well as their sensitivity to initialization and parameterization. This often leads to misalignments in the calibration process. Probabilistic registration methods compensate for these drawbacks by specifically modeling the probabilistic nature of the observations. This paper presents GMMCalib, an automatic target-based extrinsic calibration approach for multi-LiDAR systems. Using an implementation of a Gaussian Mixture Model (GMM)-based registration method that allows joint registration of multiple point clouds, this data-driven approach is compared to ICP algorithms. We perform simulation experiments using the digital twin of the EDGAR research vehicle and validate the results in a real-world environment. We also address the local minima problem of local registration methods for extrinsic sensor calibration and use a distance-based metric to evaluate the calibration results. Our results show that an increase in robustness against sensor miscalibrations can be achieved by using GMM-based registration algorithms. The code is open source and available on GitHub.
Abstract:Reliable detection and tracking of surrounding objects are indispensable for comprehensive motion prediction and planning of autonomous vehicles. Due to the limitations of individual sensors, the fusion of multiple sensor modalities is required to improve the overall detection capabilities. Additionally, robust motion tracking is essential for reducing the effect of sensor noise and improving state estimation accuracy. The reliability of the autonomous vehicle software becomes even more relevant in complex, adversarial high-speed scenarios at the vehicle handling limits in autonomous racing. In this paper, we present a modular multi-modal sensor fusion and tracking method for high-speed applications. The method is based on the Extended Kalman Filter (EKF) and is capable of fusing heterogeneous detection inputs to track surrounding objects consistently. A novel delay compensation approach enables to reduce the influence of the perception software latency and to output an updated object list. It is the first fusion and tracking method validated in high-speed real-world scenarios at the Indy Autonomous Challenge 2021 and the Autonomous Challenge at CES (AC@CES) 2022, proving its robustness and computational efficiency on embedded systems. It does not require any labeled data and achieves position tracking residuals below 0.1 m. The related code is available as open-source software at https://github.com/TUMFTM/FusionTracking.
Abstract:While current research and development of autonomous driving primarily focuses on developing new features and algorithms, the transfer from isolated software components into an entire software stack has been covered sparsely. Besides that, due to the complexity of autonomous software stacks and public road traffic, the optimal validation of entire stacks is an open research problem. Our paper targets these two aspects. We present our autonomous research vehicle EDGAR and its digital twin, a detailed virtual duplication of the vehicle. While the vehicle's setup is closely related to the state of the art, its virtual duplication is a valuable contribution as it is crucial for a consistent validation process from simulation to real-world tests. In addition, different development teams can work with the same model, making integration and testing of the software stacks much easier, significantly accelerating the development process. The real and virtual vehicles are embedded in a comprehensive development environment, which is also introduced. All parameters of the digital twin are provided open-source at https://github.com/TUMFTM/edgar_digital_twin.
Abstract:In this paper, the state of the art in the field of pedestrian trajectory prediction is evaluated alongside the constant velocity model (CVM) with respect to its applicability in autonomous vehicles. The evaluation is conducted on the widely-used ETH/UCY dataset where the Average Displacement Error (ADE) and the Final Displacement Error (FDE) are reported. To align with requirements in real-world applications, modifications are made to the input features of the initially proposed models. An ablation study is conducted to examine the influence of the observed motion history on the prediction performance, thereby establishing a better understanding of its impact. Additionally, the inference time of each model is measured to evaluate the scalability of each model when confronted with varying amounts of agents. The results demonstrate that simple models remain competitive when generating single trajectories, and certain features commonly thought of as useful have little impact on the overall performance across different architectures. Based on these findings, recommendations are proposed to guide the future development of trajectory prediction algorithms.
Abstract:A reliable perception has to be robust against challenging environmental conditions. Therefore, recent efforts focused on the use of radar sensors in addition to camera and lidar sensors for perception applications. However, the sparsity of radar point clouds and the poor data availability remain challenging for current perception methods. To address these challenges, a novel graph neural network is proposed that does not just use the information of the points themselves but also the relationships between the points. The model is designed to consider both point features and point-pair features, embedded in the edges of the graph. Furthermore, a general approach for achieving transformation invariance is proposed which is robust against unseen scenarios and also counteracts the limited data availability. The transformation invariance is achieved by an invariant data representation rather than an invariant model architecture, making it applicable to other methods. The proposed RadarGNN model outperforms all previous methods on the RadarScenes dataset. In addition, the effects of different invariances on the object detection and semantic segmentation quality are investigated. The code is made available as open-source software under https://github.com/TUMFTM/RadarGNN.