School of Aerospace, Mechanical and Mechatronic Engineering, The University of Sydney, Sydney Institute for Robotics and Intelligent Systems
Abstract:Neural Radiance Fields (NeRFs) provide a high fidelity, continuous scene representation that can realistically represent complex behaviour of light. Despite recent works like Ref-NeRF improving geometry through physics-inspired models, the ability for a NeRF to overcome shape-radiance ambiguity and converge to a representation consistent with real geometry remains limited. We demonstrate how curriculum learning of a surface light field model helps a NeRF converge towards a more geometrically accurate scene representation. We introduce four additional regularisation terms to impose geometric smoothness, consistency of normals and a separation of Lambertian and specular appearance at geometry in the scene, conforming to physical models. Our approach yields improvements of 14.4% to normals on positionally encoded NeRFs and 9.2% on grid-based models compared to current reflection-based NeRF variants. This includes a separated view-dependent appearance, conditioning a NeRF to have a geometric representation consistent with the captured scene. We demonstrate compatibility of our method with existing NeRF variants, as a key step in enabling radiance-based representations for geometry critical applications.
Abstract:Segmented light field images can serve as a powerful representation in many of computer vision tasks exploiting geometry and appearance of objects, such as object pose tracking. In the light field domain, segmentation presents an additional objective of recognizing the same segment through all the views. Segment Anything Model 2 (SAM 2) allows producing semantically meaningful segments for monocular images and videos. However, using SAM 2 directly on light fields is highly ineffective due to unexploited constraints. In this work, we present a novel light field segmentation method that adapts SAM 2 to the light field domain without retraining or modifying the model. By utilizing the light field domain constraints, the method produces high quality and view-consistent light field masks, outperforming the SAM 2 video tracking baseline and working 7 times faster, with a real-time speed. We achieve this by exploiting the epipolar geometry cues to propagate the masks between the views, probing the SAM 2 latent space to estimate their occlusion, and further prompting SAM 2 for their refinement.
Abstract:Drones have revolutionized the fields of aerial imaging, mapping, and disaster recovery. However, the deployment of drones in low-light conditions is constrained by the image quality produced by their on-board cameras. In this paper, we present a learning architecture for improving 3D reconstructions in low-light conditions by finding features in a burst. Our approach enhances visual reconstruction by detecting and describing high quality true features and less spurious features in low signal-to-noise ratio images. We demonstrate that our method is capable of handling challenging scenes in millilux illumination, making it a significant step towards drones operating at night and in extremely low-light applications such as underground mining and search and rescue operations.
Abstract:The performance of robots in their applications heavily depends on the quality of sensory input. However, designing sensor payloads and their parameters for specific robotic tasks is an expensive process that requires well-established sensor knowledge and extensive experiments with physical hardware. With cameras playing a pivotal role in robotic perception, we introduce a novel end-to-end optimization approach for co-designing a camera with specific robotic tasks by combining derivative-free and gradient-based optimizers. The proposed method leverages recent computer graphics techniques and physical camera characteristics to prototype the camera in software, simulate operational environments and tasks for robots, and optimize the camera design based on the desired tasks in a cost-effective way. We validate the accuracy of our camera simulation by comparing it with physical cameras, and demonstrate the design of cameras with stronger performance than common off-the-shelf alternatives. Our approach supports the optimization of both continuous and discrete camera parameters, manufacturing constraints, and can be generalized to a broad range of camera design scenarios including multiple cameras and unconventional cameras. This work advances the fully automated design of cameras for specific robotics tasks.
Abstract:The majority of image processing approaches assume images are in or can be rectified to a perspective projection. However, in many applications it is beneficial to use non conventional cameras, such as fisheye cameras, that have a larger field of view (FOV). The issue arises that these large-FOV images can't be rectified to a perspective projection without significant cropping of the original image. To address this issue we propose Rectified Convolutions (RectConv); a new approach for adapting pre-trained convolutional networks to operate with new non-perspective images, without any retraining. Replacing the convolutional layers of the network with RectConv layers allows the network to see both rectified patches and the entire FOV. We demonstrate RectConv adapting multiple pre-trained networks to perform segmentation and detection on fisheye imagery from two publicly available datasets. Our approach requires no additional data or training, and operates directly on the native image as captured from the camera. We believe this work is a step toward adapting the vast resources available for perspective images to operate across a broad range of camera geometries.
Abstract:Vision is a popular and effective sensor for robotics from which we can derive rich information about the environment: the geometry and semantics of the scene, as well as the age, gender, identity, activity and even emotional state of humans within that scene. This raises important questions about the reach, lifespan, and potential misuse of this information. This paper is a call to action to consider privacy in the context of robotic vision. We propose a specific form privacy preservation in which no images are captured or could be reconstructed by an attacker even with full remote access. We present a set of principles by which such systems can be designed, and through a case study in localisation demonstrate in simulation a specific implementation that delivers an important robotic capability in an inherently privacy-preserving manner. This is a first step, and we hope to inspire future works that expand the range of applications open to sighted robotic systems.
Abstract:There are a multitude of emerging imaging technologies that could benefit robotics. However the need for bespoke models, calibration and low-level processing represents a key barrier to their adoption. In this work we present NOCaL, Neural odometry and Calibration using Light fields, a semi-supervised learning architecture capable of interpreting previously unseen cameras without calibration. NOCaL learns to estimate camera parameters, relative pose, and scene appearance. It employs a scene-rendering hypernetwork pretrained on a large number of existing cameras and scenes, and adapts to previously unseen cameras using a small supervised training set to enforce metric scale. We demonstrate NOCaL on rendered and captured imagery using conventional cameras, demonstrating calibration-free odometry and novel view synthesis. This work represents a key step toward automating the interpretation of general camera geometries and emerging imaging technologies.
Abstract:Robots operating at night using conventional vision cameras face significant challenges in reconstruction due to noise-limited images. Previous work has demonstrated that burst-imaging techniques can be used to partially overcome this issue. In this paper, we develop a novel feature detector that operates directly on image bursts that enhances vision-based reconstruction under extremely low-light conditions. Our approach finds keypoints with well-defined scale and apparent motion within each burst by jointly searching in a multi-scale and multi-motion space. Because we describe these features at a stage where the images have higher signal-to-noise ratio, the detected features are more accurate than the state-of-the-art on conventional noisy images and burst-merged images and exhibit high precision, recall, and matching performance. We show improved feature performance and camera pose estimates and demonstrate improved structure-from-motion performance using our feature detector in challenging light-constrained scenes. Our feature finder provides a significant step towards robots operating in low-light scenarios and applications including night-time operations.
Abstract:This work addresses the problems of semantic segmentation and image super-resolution by jointly considering the performance of both in training a Generative Adversarial Network (GAN). We propose a novel architecture and domain-specific feature loss, allowing super-resolution to operate as a pre-processing step to increase the performance of downstream computer vision tasks, specifically semantic segmentation. We demonstrate this approach using Nearmap's aerial imagery dataset which covers hundreds of urban areas at 5-7 cm per pixel resolution. We show the proposed approach improves perceived image quality as well as quantitative segmentation accuracy across all prediction classes, yielding an average accuracy improvement of 11.8% and 108% at 4x and 32x super-resolution, compared with state-of-the art single-network methods. This work demonstrates that jointly considering image-based and task-specific losses can improve the performance of both, and advances the state-of-the-art in semantic-aware super-resolution of aerial imagery.
Abstract:Images captured under extremely low light conditions are noise-limited, which can cause existing robotic vision algorithms to fail. In this paper we develop an image processing technique for aiding 3D reconstruction from images acquired in low light conditions. Our technique, based on burst photography, uses direct methods for image registration within bursts of short exposure time images to improve the robustness and accuracy of feature-based structure-from-motion (SfM). We demonstrate improved SfM performance in challenging light-constrained scenes, including quantitative evaluations that show improved feature performance and camera pose estimates. Additionally, we show that our method converges more frequently to correct reconstructions than the state-of-the-art. Our method is a significant step towards allowing robots to operate in low light conditions, with potential applications to robots operating in environments such as underground mines and night time operation.