Abstract:Neural Radiance Fields (NeRFs) have revolutionized the field of novel view synthesis, demonstrating remarkable performance. However, the modeling and rendering of reflective objects remain challenging problems. Recent methods have shown significant improvements over the baselines in handling reflective scenes, albeit at the expense of efficiency. In this work, we aim to strike a balance between efficiency and quality. To this end, we investigate an implicit-explicit approach based on conventional volume rendering to enhance the reconstruction quality and accelerate the training and rendering processes. We adopt an efficient density-based grid representation and reparameterize the reflected radiance in our pipeline. Our proposed reflection-aware approach achieves a competitive quality efficiency trade-off compared to competing methods. Based on our experimental results, we propose and discuss hypotheses regarding the factors influencing the results of density-based methods for reconstructing reflective objects. The source code is available at https://github.com/gkouros/ref-dvgo.
Abstract:Most 6-DoF localization and SLAM systems use static landmarks but ignore dynamic objects because they cannot be usefully incorporated into a typical pipeline. Where dynamic objects have been incorporated, typical approaches have attempted relatively sophisticated identification and localization of these objects, limiting their robustness or general utility. In this research, we propose a middle ground, demonstrated in the context of autonomous vehicles, using dynamic vehicles to provide limited pose constraint information in a 6-DoF frame-by-frame PnP-RANSAC localization pipeline. We refine initial pose estimates with a motion model and propose a method for calculating the predicted quality of future pose estimates, triggered based on whether or not the autonomous vehicle's motion is constrained by the relative frame-to-frame location of dynamic vehicles in the environment. Our approach detects and identifies suitable dynamic vehicles to define these pose constraints to modify a pose filter, resulting in improved recall across a range of localization tolerances from $0.25m$ to $5m$, compared to a state-of-the-art baseline single image PnP method and its vanilla pose filtering. Our constraint detection system is active for approximately $35\%$ of the time on the Ford AV dataset and localization is particularly improved when the constraint detection is active.
Abstract:Can knowing where you are assist in perceiving objects in your surroundings, especially under adverse weather and lighting conditions? In this work we investigate whether a prior map can be leveraged to aid in the detection of dynamic objects in a scene without the need for a 3D map or pixel-level map-query correspondences. We contribute an algorithm which refines an initial set of candidate object detections and produces a refined subset of highly accurate detections using a prior map. We begin by using visual place recognition (VPR) to retrieve a reference map image for a given query image, then use a binary classification neural network that compares the query and mapping image regions to validate the query detection. Once our classification network is trained, on approximately 1000 query-map image pairs, it is able to improve the performance of vehicle detection when combined with an existing off-the-shelf vehicle detector. We demonstrate our approach using standard datasets across two cities (Oxford and Zurich) under different settings of train-test separation of map-query traverse pairs. We further emphasize the performance gains of our approach against alternative design choices and show that VPR suffices for the task, eliminating the need for precise ground truth localization.
Abstract:Recently, two-stage Deformable DETR introduced the query-based two-stage head, a new type of two-stage head different from the region-based two-stage heads of classical detectors as Faster R-CNN. In query-based two-stage heads, the second stage selects one feature per detection, called the query, as opposed to pooling a rectangular grid of features as in region-based detectors. In this work, we further improve the query-based head from Deformable DETR, significantly speeding up the convergence while increasing its performance. This is achieved by incorporating classical techniques such as anchor generation within the query-based paradigm. By combining the best of both the classical and the query-based worlds, our FQDet head peaks at 45.4 AP on the 2017 COCO validation set when using a ResNet-50+TPN backbone, only after training for 12 epochs using the 1x schedule. We outperform other high-performing two-stage heads such as e.g. Cascade R-CNN, while using the same backbone and while often being computationally cheaper. Additionally, when using the large ResNeXt-101-DCN+TPN backbone and multi-scale testing, our FQDet head achieves 52.9 AP on the 2017 COCO test-dev set after only 12 epochs of training. Code will be released.
Abstract:Pose estimation is usually tackled as either a bin classification problem or as a regression problem. In both cases, the idea is to directly predict the pose of an object. This is a non-trivial task because of appearance variations of similar poses and similarities between different poses. Instead, we follow the key idea that it is easier to compare two poses than to estimate them. Render-and-compare approaches have been employed to that end, however, they tend to be unstable, computationally expensive, and slow for real-time applications. We propose doing category-level pose estimation by learning an alignment metric using a contrastive loss with a dynamic margin and a continuous pose-label space. For efficient inference, we use a simple real-time image retrieval scheme with a reference set of renderings projected to an embedding space. To achieve robustness to real-world conditions, we employ synthetic occlusions, bounding box perturbations, and appearance augmentations. Our approach achieves state-of-the-art performance on PASCAL3D and OccludedPASCAL3D, as well as high-quality results on KITTI3D.
Abstract:This work describes the automatic registration of a large network (approximately 40) of fixed, ceiling-mounted environment cameras spread over a large area (approximately 800 squared meters) using a mobile calibration robot equipped with a single upward-facing fisheye camera and a backlit ArUco marker for easy detection. The fisheye camera is used to do visual odometry (VO), and the ArUco marker facilitates easy detection of the calibration robot in the environment cameras. In addition, the fisheye camera is also able to detect the environment cameras. This two-way, bidirectional detection constrains the pose of the environment cameras to solve an optimization problem. Such an approach can be used to automatically register a large-scale multi-camera system used for surveillance, automated parking, or robotic applications. This VO based multicamera registration method is extensively validated using real-world experiments, and also compared against a similar approach which uses an LiDAR - an expensive, heavier and power hungry sensor.
Abstract:6-DoF visual localization systems utilize principled approaches rooted in 3D geometry to perform accurate camera pose estimation of images to a map. Current techniques use hierarchical pipelines and learned 2D feature extractors to improve scalability and increase performance. However, despite gains in typical recall@0.25m type metrics, these systems still have limited utility for real-world applications like autonomous vehicles because of their `worst' areas of performance - the locations where they provide insufficient recall at a certain required error tolerance. Here we investigate the utility of using `place specific configurations', where a map is segmented into a number of places, each with its own configuration for modulating the pose estimation step, in this case selecting a camera within a multi-camera system. On the Ford AV benchmark dataset, we demonstrate substantially improved worst-case localization performance compared to using off-the-shelf pipelines - minimizing the percentage of the dataset which has low recall at a certain error tolerance, as well as improved overall localization performance. Our proposed approach is particularly applicable to the crowdsharing model of autonomous vehicle deployment, where a fleet of AVs are regularly traversing a known route.
Abstract:We propose a novel and pragmatic framework for traffic scene perception with roadside cameras. The proposed framework covers a full-stack of roadside perception pipeline for infrastructure-assisted autonomous driving, including object detection, object localization, object tracking, and multi-camera information fusion. Unlike previous vision-based perception frameworks rely upon depth offset or 3D annotation at training, we adopt a modular decoupling design and introduce a landmark-based 3D localization method, where the detection and localization can be well decoupled so that the model can be easily trained based on only 2D annotations. The proposed framework applies to either optical or thermal cameras with pinhole or fish-eye lenses. Our framework is deployed at a two-lane roundabout located at Ellsworth Rd. and State St., Ann Arbor, MI, USA, providing 7x24 real-time traffic flow monitoring and high-precision vehicle trajectory extraction. The whole system runs efficiently on a low-power edge computing device with all-component end-to-end delay of less than 20ms.
Abstract:Engineering design is traditionally performed by hand: an expert makes design proposals based on past experience, and these proposals are then tested for compliance with certain target specifications. Testing for compliance is performed first by computer simulation using what is called a discipline model. Such a model can be implemented by a finite element analysis, multibody systems approach, etc. Designs passing this simulation are then considered for physical prototyping. The overall process may take months, and is a significant cost in practice. We have developed a Bayesian optimization system for partially automating this process by directly optimizing compliance with the target specification with respect to the design parameters. The proposed method is a general framework for computing a generalized inverse of a high-dimensional non-linear function that does not require e.g. gradient information, which is often unavailable from discipline models. We furthermore develop a two-tier convergence criterion based on (i) convergence to a solution optimally satisfying all specified design criteria, or (ii) convergence to a minimum-norm solution. We demonstrate the proposed approach on a vehicle chassis design problem motivated by an industry setting using a state-of-the-art commercial discipline model. We show that the proposed approach is general, scalable, and efficient, and that the novel convergence criteria can be implemented straightforwardly based on existing concepts and subroutines in popular Bayesian optimization software packages.
Abstract:Uncertainty pervades through the modern robotic autonomy stack, with nearly every component (e.g., sensors, detection, classification, tracking, behavior prediction) producing continuous or discrete probabilistic distributions. Trajectory forecasting, in particular, is surrounded by uncertainty as its inputs are produced by (noisy) upstream perception and its outputs are predictions that are often probabilistic for use in downstream planning. However, most trajectory forecasting methods do not account for upstream uncertainty, instead taking only the most-likely values. As a result, perceptual uncertainties are not propagated through forecasting and predictions are frequently overconfident. To address this, we present a novel method for incorporating perceptual state uncertainty in trajectory forecasting, a key component of which is a new statistical distance-based loss function which encourages predicting uncertainties that better match upstream perception. We evaluate our approach both in illustrative simulations and on large-scale, real-world data, demonstrating its efficacy in propagating perceptual state uncertainty through prediction and producing more calibrated predictions.