Abstract:While visual SLAM systems are well studied and achieve impressive results in indoor and urban settings, natural, outdoor and open-field environments are much less explored and still present relevant research challenges. Visual navigation and local mapping have shown a relatively good performance in open-field environments. However, globally consistent mapping and long-term localization still depend on the robustness of loop detection and closure, for which the literature is scarce. In this work we propose a novel method to pave the way towards robust loop detection in open fields, particularly in agricultural settings, based on local feature search and stereo geometric refinement, with a final stage of relative pose estimation. Our method consistently achieves good loop detections, with a median error of 15cm. We aim to characterize open fields as a novel environment for loop detection, understanding the limitations and problems that arise when dealing with them.
Abstract:Having prior knowledge of an environment boosts the localization and mapping accuracy of robots. Several approaches in the literature have utilized architectural plans in this regard. However, almost all of them overlook the deviations between actual as-built environments and as-planned architectural designs, introducing bias in the estimations. To address this issue, we present a novel localization and mapping method denoted as deviations-informed Situational Graphs or diS-Graphs that integrates prior knowledge from architectural plans even in the presence of deviations. It is based on Situational Graphs (S-Graphs) that merge geometric models of the environment with 3D scene graphs into a multi-layered jointly optimizable factor graph. Our diS-Graph extracts information from architectural plans by first modeling them as a hierarchical factor graph, which we will call an Architectural Graph (A-Graph). While the robot explores the real environment, it estimates an S-Graph from its onboard sensors. We then use a novel matching algorithm to register the A-Graph and S-Graph in the same reference, and merge both of them with an explicit model of deviations. Finally, an alternating graph optimization strategy allows simultaneous global localization and mapping, as well as deviation estimation between both the A-Graph and the S-Graph. We perform several experiments in simulated and real datasets in the presence of deviations. On average, our diS-Graphs outperforms the baselines by a margin of approximately 43% in simulated environments and by 7% in real environments, while being able to estimate deviations up to 35 cm and 15 degrees.
Abstract:We propose three novel metrics for evaluating the accuracy of a set of estimated camera poses given the ground truth: Translation Alignment Score (TAS), Rotation Alignment Score (RAS), and Pose Alignment Score (PAS). The TAS evaluates the translation accuracy independently of the rotations, and the RAS evaluates the rotation accuracy independently of the translations. The PAS is the average of the two scores, evaluating the combined accuracy of both translations and rotations. The TAS is computed in four steps: (1) Find the upper quartile of the closest-pair-distances, $d$. (2) Align the estimated trajectory to the ground truth using a robust registration method. (3) Collect all distance errors and obtain the cumulative frequencies for multiple thresholds ranging from $0.01d$ to $d$ with a resolution $0.01d$. (4) Add up these cumulative frequencies and normalize them such that the theoretical maximum is 1. The TAS has practical advantages over the existing metrics in that (1) it is robust to outliers and collinear motion, and (2) there is no need to adjust parameters on different datasets. The RAS is computed in a similar manner to the TAS and is also shown to be more robust against outliers than the existing rotation metrics. We verify our claims through extensive simulations and provide in-depth discussion of the strengths and weaknesses of the proposed metrics.
Abstract:Visual Place Recognition (VPR) plays a critical role in many localization and mapping pipelines. It consists of retrieving the closest sample to a query image, in a certain embedding space, from a database of geotagged references. The image embedding is learned to effectively describe a place despite variations in visual appearance, viewpoint, and geometric changes. In this work, we formulate how limitations in the Geographic Distance Sensitivity of current VPR embeddings result in a high probability of incorrectly sorting the top-k retrievals, negatively impacting the recall. In order to address this issue in single-stage VPR, we propose a novel mining strategy, CliqueMining, that selects positive and negative examples by sampling cliques from a graph of visually similar images. Our approach boosts the sensitivity of VPR embeddings at small distance ranges, significantly improving the state of the art on relevant benchmarks. In particular, we raise recall@1 from 75% to 82% in MSLS Challenge, and from 76% to 90% in Nordland. Models and code are available at https://github.com/serizba/cliquemining.
Abstract:3D Gaussian Splatting has emerged as a very promising scene representation, achieving state-of-the-art quality in novel view synthesis significantly faster than competing alternatives. However, its use of spherical harmonics to represent scene colors limits the expressivity of 3D Gaussians and, as a consequence, the capability of the representation to generalize as we move away from the training views. In this paper, we propose to encode the color information of 3D Gaussians into per-Gaussian feature vectors, which we denote as Feature Splatting (FeatSplat). To synthesize a novel view, Gaussians are first "splatted" into the image plane, then the corresponding feature vectors are alpha-blended, and finally the blended vector is decoded by a small MLP to render the RGB pixel values. To further inform the model, we concatenate a camera embedding to the blended feature vector, to condition the decoding also on the viewpoint information. Our experiments show that these novel model for encoding the radiance considerably improves novel view synthesis for low overlap views that are distant from the training views. Finally, we also show the capacity and convenience of our feature vector representation, demonstrating its capability not only to generate RGB values for novel views, but also their per-pixel semantic labels. We will release the code upon acceptance. Keywords: Gaussian Splatting, Novel View Synthesis, Feature Splatting
Abstract:In this paper, we introduce a novel formulation for camera motion estimation that integrates RGB-D images and inertial data through scene flow. Our goal is to accurately estimate the camera motion in a rigid 3D environment, along with the state of the inertial measurement unit (IMU). Our proposed method offers the flexibility to operate as a multi-frame optimization or to marginalize older data, thus effectively utilizing past measurements. To assess the performance of our method, we conducted evaluations using both synthetic data from the ICL-NUIM dataset and real data sequences from the OpenLORIS-Scene dataset. Our results show that the fusion of these two sensors enhances the accuracy of camera motion estimation when compared to using only visual data.
Abstract:Perceptual aliasing and weak textures pose significant challenges to the task of place recognition, hindering the performance of Simultaneous Localization and Mapping (SLAM) systems. This paper presents a novel model, called UMF (standing for Unifying Local and Global Multimodal Features) that 1) leverages multi-modality by cross-attention blocks between vision and LiDAR features, and 2) includes a re-ranking stage that re-orders based on local feature matching the top-k candidates retrieved using a global representation. Our experiments, particularly on sequences captured on a planetary-analogous environment, show that UMF outperforms significantly previous baselines in those challenging aliased environments. Since our work aims to enhance the reliability of SLAM in all situations, we also explore its performance on the widely used RobotCar dataset, for broader applicability. Code and models are available at https://github.com/DLR-RM/UMF
Abstract:We propose a robust method for point cloud registration that can handle both unknown scales and extreme outlier ratios. Our method, dubbed PCR-99, uses a deterministic 3-point sampling approach with two novel mechanisms that significantly boost the speed: (1) an improved ordering of the samples based on pairwise scale consistency, prioritizing the point correspondences that are more likely to be inliers, and (2) an efficient outlier rejection scheme based on triplet scale consistency, prescreening bad samples and reducing the number of hypotheses to be tested. Our evaluation shows that, up to 98% outlier ratio, the proposed method achieves comparable performance to the state of the art. At 99% outlier ratio, however, it outperforms the state of the art for both known-scale and unknown-scale problems. Especially for the latter, we observe a clear superiority in terms of robustness and speed.
Abstract:Estimating the relative camera pose from $n \geq 5$ correspondences between two calibrated views is a fundamental task in computer vision. This process typically involves two stages: 1) estimating the essential matrix between the views, and 2) disambiguating among the four candidate relative poses that satisfy the epipolar geometry. In this paper, we demonstrate a novel approach that, for the first time, bypasses the second stage. Specifically, we show that it is possible to directly estimate the correct relative camera pose from correspondences without needing a post-processing step to enforce the cheirality constraint on the correspondences. Building on recent advances in certifiable non-minimal optimization, we frame the relative pose estimation as a Quadratically Constrained Quadratic Program (QCQP). By applying the appropriate constraints, we ensure the estimation of a camera pose that corresponds to a valid 3D geometry and that is globally optimal when certified. We validate our method through exhaustive synthetic and real-world experiments, confirming the efficacy, efficiency and accuracy of the proposed approach. Code is available at https://github.com/javrtg/C2P.
Abstract:The task of Visual Place Recognition (VPR) aims to match a query image against references from an extensive database of images from different places, relying solely on visual cues. State-of-the-art pipelines focus on the aggregation of features extracted from a deep backbone, in order to form a global descriptor for each image. In this context, we introduce SALAD (Sinkhorn Algorithm for Locally Aggregated Descriptors), which reformulates NetVLAD's soft-assignment of local features to clusters as an optimal transport problem. In SALAD, we consider both feature-to-cluster and cluster-to-feature relations and we also introduce a 'dustbin' cluster, designed to selectively discard features deemed non-informative, enhancing the overall descriptor quality. Additionally, we leverage and fine-tune DINOv2 as a backbone, which provides enhanced description power for the local features, and dramatically reduces the required training time. As a result, our single-stage method not only surpasses single-stage baselines in public VPR datasets, but also surpasses two-stage methods that add a re-ranking with significantly higher cost. Code and models are available at https://github.com/serizba/salad.