Abstract:Multi-sensor fusion is crucial for accurate 3D object detection in autonomous driving, with cameras and LiDAR being the most commonly used sensors. However, existing methods perform sensor fusion in a single view by projecting features from both modalities either in Bird's Eye View (BEV) or Perspective View (PV), thus sacrificing complementary information such as height or geometric proportions. To address this limitation, we propose ProFusion3D, a progressive fusion framework that combines features in both BEV and PV at both intermediate and object query levels. Our architecture hierarchically fuses local and global features, enhancing the robustness of 3D object detection. Additionally, we introduce a self-supervised mask modeling pre-training strategy to improve multi-modal representation learning and data efficiency through three novel objectives. Extensive experiments on nuScenes and Argoverse2 datasets conclusively demonstrate the efficacy of ProFusion3D. Moreover, ProFusion3D is robust to sensor failure, demonstrating strong performance when only one modality is available.
Abstract:Semantic segmentation models are typically trained on a fixed set of classes, limiting their applicability in open-world scenarios. Class-incremental semantic segmentation aims to update models with emerging new classes while preventing catastrophic forgetting of previously learned ones. However, existing methods impose strict rigidity on old classes, reducing their effectiveness in learning new incremental classes. In this work, we propose Taxonomy-Oriented Poincar\'e-regularized Incremental-Class Segmentation (TOPICS) that learns feature embeddings in hyperbolic space following explicit taxonomy-tree structures. This supervision provides plasticity for old classes, updating ancestors based on new classes while integrating new classes at fitting positions. Additionally, we maintain implicit class relational constraints on the geometric basis of the Poincar\'e ball. This ensures that the latent space can continuously adapt to new constraints while maintaining a robust structure to combat catastrophic forgetting. We also establish eight realistic incremental learning protocols for autonomous driving scenarios, where novel classes can originate from known classes or the background. Extensive evaluations of TOPICS on the Cityscapes and Mapillary Vistas 2.0 benchmarks demonstrate that it achieves state-of-the-art performance. We make the code and trained models publicly available at http://topics.cs.uni-freiburg.de.
Abstract:Sensor setups of robotic platforms commonly include both camera and LiDAR as they provide complementary information. However, fusing these two modalities typically requires a highly accurate calibration between them. In this paper, we propose MDPCalib which is a novel method for camera-LiDAR calibration that requires neither human supervision nor any specific target objects. Instead, we utilize sensor motion estimates from visual and LiDAR odometry as well as deep learning-based 2D-pixel-to-3D-point correspondences that are obtained without in-domain retraining. We represent the camera-LiDAR calibration as a graph optimization problem and minimize the costs induced by constraints from sensor motion and point correspondences. In extensive experiments, we demonstrate that our approach yields highly accurate extrinsic calibration parameters and is robust to random initialization. Additionally, our approach generalizes to a wide range of sensor setups, which we demonstrate by employing it on various robotic platforms including a self-driving perception car, a quadruped robot, and a UAV. To make our calibration method publicly accessible, we release the code on our project website at http://calibration.cs.uni-freiburg.de.
Abstract:LiDARs are widely used for mapping and localization in dynamic environments. However, their high cost limits their widespread adoption. On the other hand, monocular localization in LiDAR maps using inexpensive cameras is a cost-effective alternative for large-scale deployment. Nevertheless, most existing approaches struggle to generalize to new sensor setups and environments, requiring retraining or fine-tuning. In this paper, we present CMRNext, a novel approach for camera-LIDAR matching that is independent of sensor-specific parameters, generalizable, and can be used in the wild for monocular localization in LiDAR maps and camera-LiDAR extrinsic calibration. CMRNext exploits recent advances in deep neural networks for matching cross-modal data and standard geometric techniques for robust pose estimation. We reformulate the point-pixel matching problem as an optical flow estimation problem and solve the Perspective-n-Point problem based on the resulting correspondences to find the relative pose between the camera and the LiDAR point cloud. We extensively evaluate CMRNext on six different robotic platforms, including three publicly available datasets and three in-house robots. Our experimental evaluations demonstrate that CMRNext outperforms existing approaches on both tasks and effectively generalizes to previously unseen environments and sensor setups in a zero-shot manner. We make the code and pre-trained models publicly available at http://cmrnext.cs.uni-freiburg.de .
Abstract:Localization is paramount for autonomous robots. While camera and LiDAR-based approaches have been extensively investigated, they are affected by adverse illumination and weather conditions. Therefore, radar sensors have recently gained attention due to their intrinsic robustness to such conditions. In this paper, we propose RaLF, a novel deep neural network-based approach for localizing radar scans in a LiDAR map of the environment, by jointly learning to address both place recognition and metric localization. RaLF is composed of radar and LiDAR feature encoders, a place recognition head that generates global descriptors, and a metric localization head that predicts the 3-DoF transformation between the radar scan and the map. We tackle the place recognition task by learning a shared embedding space between the two modalities via cross-modal metric learning. Additionally, we perform metric localization by predicting pixel-level flow vectors that align the query radar scan with the LiDAR map. We extensively evaluate our approach on multiple real-world driving datasets and show that RaLF achieves state-of-the-art performance for both place recognition and metric localization. Moreover, we demonstrate that our approach can effectively generalize to different cities and sensor setups than the ones used during training. We make the code and trained models publicly available at http://ralf.cs.uni-freiburg.de.
Abstract:Safety and efficiency are paramount in healthcare facilities where the lives of patients are at stake. Despite the adoption of robots to assist medical staff in challenging tasks such as complex surgeries, human expertise is still indispensable. The next generation of autonomous healthcare robots hinges on their capacity to perceive and understand their complex and frenetic environments. While deep learning models are increasingly used for this purpose, they require extensive annotated training data which is impractical to obtain in real-world healthcare settings. To bridge this gap, we present Syn-Mediverse, the first hyper-realistic multimodal synthetic dataset of diverse healthcare facilities. Syn-Mediverse contains over \num{48000} images from a simulated industry-standard optical tracking camera and provides more than 1.5M annotations spanning five different scene understanding tasks including depth estimation, object detection, semantic segmentation, instance segmentation, and panoptic segmentation. We demonstrate the complexity of our dataset by evaluating the performance on a broad range of state-of-the-art baselines for each task. To further advance research on scene understanding of healthcare facilities, along with the public dataset we provide an online evaluation benchmark available at \url{http://syn-mediverse.cs.uni-freiburg.de}
Abstract:Visual odometry is a fundamental task for many applications on mobile devices and robotic platforms. Since such applications are oftentimes not limited to predefined target domains and learning-based vision systems are known to generalize poorly to unseen environments, methods for continual adaptation during inference time are of significant interest. In this work, we introduce CoVIO for online continual learning of visual-inertial odometry. CoVIO effectively adapts to new domains while mitigating catastrophic forgetting by exploiting experience replay. In particular, we propose a novel sampling strategy to maximize image diversity in a fixed-size replay buffer that targets the limited storage capacity of embedded devices. We further provide an asynchronous version that decouples the odometry estimation from the network weight update step enabling continuous inference in real time. We extensively evaluate CoVIO on various real-world datasets demonstrating that it successfully adapts to new domains while outperforming previous methods. The code of our work is publicly available at http://continual-slam.cs.uni-freiburg.de.
Abstract:A key component of graph-based SLAM systems is the ability to detect loop closures in a trajectory to reduce the drift accumulated over time from the odometry. Most LiDAR-based methods achieve this goal by using only the geometric information, disregarding the semantics of the scene. In this work, we introduce PADLoC, a LiDAR-based loop closure detection and registration architecture comprising a shared 3D convolutional feature extraction backbone, a global descriptor head for loop closure detection, and a novel transformer-based head for point cloud matching and registration. We present multiple methods for estimating the point-wise matching confidence based on diversity indices. Additionally, to improve forward-backward consistency, we propose the use of two shared matching and registration heads with their source and target inputs swapped by exploiting that the estimated relative transformations must be inverse of each other. Furthermore, we leverage panoptic information during training in the form of a novel loss function that reframes the matching problem as a classification task in the case of the semantic labels and as a graph connectivity assignment for the instance labels. We perform extensive evaluations of PADLoC on multiple real-world datasets demonstrating that it achieves state-of-the-art performance. The code of our work is publicly available at http://padloc.cs.uni-freiburg.de.
Abstract:While lifelong SLAM addresses the capability of a robot to adapt to changes within a single environment over time, in this paper we introduce the task of continual SLAM. Here, a robot is deployed sequentially in a variety of different environments and has to transfer its knowledge of previously experienced environments to thus far unseen environments, while avoiding catastrophic forgetting. This is particularly relevant in the context of vision-based approaches, where the relevant features vary widely between different environments. We propose a novel approach for solving the continual SLAM problem by introducing CL-SLAM. Our approach consists of a dual-network architecture that handles both short-term adaptation and long-term memory retention by incorporating a replay buffer. Extensive evaluations of CL-SLAM in three different environments demonstrate that it outperforms several baselines inspired by existing continual learning-based visual odometry methods. The code of our work is publicly available at http://continual-slam.cs.uni-freiburg.de.
Abstract:Scene understanding is a pivotal task for autonomous vehicles to safely navigate in the environment. Recent advances in deep learning enable accurate semantic reconstruction of the surroundings from LiDAR data. However, these models encounter a large domain gap while deploying them on vehicles equipped with different LiDAR setups which drastically decreases their performance. Fine-tuning the model for every new setup is infeasible due to the expensive and cumbersome process of recording and manually labeling new data. Unsupervised Domain Adaptation (UDA) techniques are thus essential to fill this domain gap and retain the performance of models on new sensor setups without the need for additional data labeling. In this paper, we propose AdaptLPS, a novel UDA approach for LiDAR panoptic segmentation that leverages task-specific knowledge and accounts for variation in the number of scan lines, mounting position, intensity distribution, and environmental conditions. We tackle the UDA task by employing two complementary domain adaptation strategies, data-based and model-based. While data-based adaptations reduce the domain gap by processing the raw LiDAR scans to resemble the scans in the target domain, model-based techniques guide the network in extracting features that are representative for both domains. Extensive evaluations on three pairs of real-world autonomous driving datasets demonstrate that AdaptLPS outperforms existing UDA approaches by up to 6.41 pp in terms of the PQ score.