Abstract:The concept of function and affordance is a critical aspect of 3D scene understanding and supports task-oriented objectives. In this work, we develop a model that learns to structure and vary functional affordance across a 3D hierarchical scene graph representing the spatial organization of a scene. The varying functional affordance is designed to integrate with the varying spatial context of the graph. More specifically, we develop an algorithm that learns to construct a 3D hierarchical scene graph (3DHSG) that captures the spatial organization of the scene. Starting from segmented object point clouds and object semantic labels, we develop a 3DHSG with a top node that identifies the room label, child nodes that define local spatial regions inside the room with region-specific affordances, and grand-child nodes indicating object locations and object-specific affordances. To support this work, we create a custom 3DHSG dataset that provides ground truth data for local spatial regions with region-specific affordances and also object-specific affordances for each object. We employ a transformer-based model to learn the 3DHSG. We use a multi-task learning framework that learns both room classification and learns to define spatial regions within the room with region-specific affordances. Our work improves on the performance of state-of-the-art baseline models and shows one approach for applying transformer models to 3D scene understanding and the generation of 3DHSGs that capture the spatial organization of a room. The code and dataset are publicly available.
Abstract:Neural Radiance Fields (NeRFs) provide a high fidelity, continuous scene representation that can realistically represent complex behaviour of light. Despite recent works like Ref-NeRF improving geometry through physics-inspired models, the ability for a NeRF to overcome shape-radiance ambiguity and converge to a representation consistent with real geometry remains limited. We demonstrate how curriculum learning of a surface light field model helps a NeRF converge towards a more geometrically accurate scene representation. We introduce four additional regularisation terms to impose geometric smoothness, consistency of normals and a separation of Lambertian and specular appearance at geometry in the scene, conforming to physical models. Our approach yields improvements of 14.4% to normals on positionally encoded NeRFs and 9.2% on grid-based models compared to current reflection-based NeRF variants. This includes a separated view-dependent appearance, conditioning a NeRF to have a geometric representation consistent with the captured scene. We demonstrate compatibility of our method with existing NeRF variants, as a key step in enabling radiance-based representations for geometry critical applications.
Abstract:This paper presents DynORecon, a Dynamic Object Reconstruction system that leverages the information provided by Dynamic SLAM to simultaneously generate a volumetric map of observed moving entities while estimating free space to support navigation. By capitalising on the motion estimations provided by Dynamic SLAM, DynORecon continuously refines the representation of dynamic objects to eliminate residual artefacts from past observations and incrementally reconstructs each object, seamlessly integrating new observations to capture previously unseen structures. Our system is highly efficient (~20 FPS) and produces accurate (~10 cm) reconstructions of dynamic objects using simulated and real-world outdoor datasets.
Abstract:Most Simultaneous localisation and mapping (SLAM) systems have traditionally assumed a static world, which does not align with real-world scenarios. To enable robots to safely navigate and plan in dynamic environments, it is essential to employ representations capable of handling moving objects. Dynamic SLAM is an emerging field in SLAM research as it improves the overall system accuracy while providing additional estimation of object motions. State-of-the-art literature informs two main formulations for Dynamic SLAM, representing dynamic object points in either the world or object coordinate frame. While expressing object points in a local reference frame may seem intuitive, it may not necessarily lead to the most accurate and robust solutions. This paper conducts and presents a thorough analysis of various Dynamic SLAM formulations, identifying the best approach to address the problem. To this end, we introduce a front-end agnostic framework using GTSAM that can be used to evaluate various Dynamic SLAM formulations.
Abstract:The problem of tracking self-motion as well as motion of objects in the scene using information from a camera is known as multi-body visual odometry and is a challenging task. This paper proposes a robust solution to achieve accurate estimation and consistent track-ability for dynamic multi-body visual odometry. A compact and effective framework is proposed leveraging recent advances in semantic instance-level segmentation and accurate optical flow estimation. A novel formulation, jointly optimizing SE(3) motion and optical flow is introduced that improves the quality of the tracked points and the motion estimation accuracy. The proposed approach is evaluated on the virtual KITTI Dataset and tested on the real KITTI Dataset, demonstrating its applicability to autonomous driving applications. For the benefit of the community, we make the source code public.
Abstract:The scene rigidity assumption, also known as the static world assumption, is common in SLAM algorithms. Most existing algorithms operating in complex dynamic environments simplify the problem by removing moving objects from consideration or tracking them separately. Such strong assumptions limit the deployment of autonomous mobile robotic systems in a wide range of important real world applications involving highly dynamic and unstructured environments. This paper presents VDO-SLAM, a robust object-aware dynamic SLAM system that exploits semantic information to enable motion estimation of rigid objects in the scene without any prior knowledge of the objects shape or motion models. The proposed approach integrates dynamic and static structures in the environment into a unified estimation framework resulting in accurate robot pose and spatio-temporal map estimation. We provide a way to extract velocity estimates from object pose change of moving objects in the scene providing an important functionality for navigation in complex dynamic environments. We demonstrate the performance of the proposed system on a number of real indoor and outdoor datasets. Results show consistent and substantial improvements over state-of-the-art algorithms. An open-source version of the source code is available.
Abstract:The static world assumption is standard in most simultaneous localisation and mapping (SLAM) algorithms. Increased deployment of autonomous systems to unstructured dynamic environments is driving a need to identify moving objects and estimate their velocity in real-time. Most existing SLAM based approaches rely on a database of 3D models of objects or impose significant motion constraints. In this paper, we propose a new feature-based, model-free, object-aware dynamic SLAM algorithm that exploits semantic segmentation to allow estimation of motion of rigid objects in a scene without the need to estimate the object poses or have any prior knowledge of their 3D models. The algorithm generates a map of dynamic and static structure and has the ability to extract velocities of rigid moving objects in the scene. Its performance is demonstrated on simulated, synthetic and real-world datasets.
Abstract:Accurate estimation of the environment structure simultaneously with the robot pose is a key capability of autonomous robotic vehicles. Classical simultaneous localization and mapping (SLAM) algorithms rely on the static world assumption to formulate the estimation problem, however, the real world has a significant amount of dynamics that can be exploited for a more accurate localization and versatile representation of the environment. In this paper we propose a technique to integrate the motion of dynamic objects into the SLAM estimation problem, without the necessity of estimating the pose or the geometry of the objects. To this end, we introduce a novel representation of the pose change of rigid bodies in motion and show the benefits of integrating such information when performing SLAM in dynamic environments. Our experiments show consistent improvement in robot localization and mapping accuracy when using a simple constant motion assumption, even for objects whose motion slightly violates this assumption.
Abstract:Single-image haze-removal is challenging due to limited information contained in one single image. Previous solutions largely rely on handcrafted priors to compensate for this deficiency. Recent convolutional neural network (CNN) models have been used to learn haze-related priors but they ultimately work as advanced image filters. In this paper we propose a novel semantic ap- proach towards single image haze removal. Unlike existing methods, we infer color priors based on extracted semantic features. We argue that semantic context can be exploited to give informative cues for (a) learning color prior on clean image and (b) estimating ambient illumination. This design allowed our model to recover clean images from challenging cases with strong ambiguity, e.g. saturated illumination color and sky regions in image. In experiments, we validate our ap- proach upon synthetic and real hazy images, where our method showed superior performance over state-of-the-art approaches, suggesting semantic information facilitates the haze removal task.
Abstract:Maximum likelihood estimation (MLE) is a well-known estimation method used in many robotic and computer vision applications. Under Gaussian assumption, the MLE converts to a nonlinear least squares (NLS) problem. Efficient solutions to NLS exist and they are based on iteratively solving sparse linear systems until convergence. In general, the existing solutions provide only an estimation of the mean state vector, the resulting covariance being computationally too expensive to recover. Nevertheless, in many simultaneous localisation and mapping (SLAM) applications, knowing only the mean vector is not enough. Data association, obtaining reduced state representations, active decisions and next best view are only a few of the applications that require fast state covariance recovery. Furthermore, computer vision and robotic applications are in general performed online. In this case, the state is updated and recomputed every step and its size is continuously growing, therefore, the estimation process may become highly computationally demanding. This paper introduces a general framework for incremental MLE called SLAM++, which fully benefits from the incremental nature of the online applications, and provides efficient estimation of both the mean and the covariance of the estimate. Based on that, we propose a strategy for maintaining a sparse and scalable state representation for large scale mapping, which uses information theory measures to integrate only informative and non-redundant contributions to the state representation. SLAM++ differs from existing implementations by performing all the matrix operations by blocks. This led to extremely fast matrix manipulation and arithmetic operations. Even though this paper tests SLAM++ efficiency on SLAM problems, its applicability remains general.