Abstract:We study the task of language-conditioned pick and place in clutter, where a robot should grasp a target object in open clutter and move it to a specified place. Some approaches learn end-to-end policies with features from vision foundation models, requiring large datasets. Others combine foundation models in a zero-shot setting, suffering from cascading errors. In addition, they primarily leverage vision and language foundation models, focusing less on action priors. In this paper, we aim to develop an effective policy by integrating foundation priors from vision, language, and action. We propose A$^2$, an action prior alignment method that aligns unconditioned action priors with 3D vision-language priors by learning one attention layer. The alignment formulation enables our policy to train with less data and preserve zero-shot generalization capabilities. We show that a shared policy for both pick and place actions enhances the performance for each task, and introduce a policy adaptation scheme to accommodate the multi-modal nature of actions. Extensive experiments in simulation and the real-world show that our policy achieves higher task success rates with fewer steps for both pick and place tasks in clutter, effectively generalizing to unseen objects and language instructions.
Abstract:The joint optimization of the sensor trajectory and 3D map is a crucial characteristic of bundle adjustment (BA), essential for autonomous driving. This paper presents $\nu$-DBA, a novel framework implementing geometric dense bundle adjustment (DBA) using 3D neural implicit surfaces for map parametrization, which optimizes both the map surface and trajectory poses using geometric error guided by dense optical flow prediction. Additionally, we fine-tune the optical flow model with per-scene self-supervision to further improve the quality of the dense mapping. Our experimental results on multiple driving scene datasets demonstrate that our method achieves superior trajectory optimization and dense reconstruction accuracy. We also investigate the influences of photometric error and different neural geometric priors on the performance of surface reconstruction and novel view synthesis. Our method stands as a significant step towards leveraging neural implicit representations in dense bundle adjustment for more accurate trajectories and detailed environmental mapping.
Abstract:Neural implicit representations have emerged as a promising solution for providing dense geometry in Simultaneous Localization and Mapping (SLAM). However, existing methods in this direction fall short in terms of global consistency and low latency. This paper presents NGEL-SLAM to tackle the above challenges. To ensure global consistency, our system leverages a traditional feature-based tracking module that incorporates loop closure. Additionally, we maintain a global consistent map by representing the scene using multiple neural implicit fields, enabling quick adjustment to the loop closure. Moreover, our system allows for fast convergence through the use of octree-based implicit representations. The combination of rapid response to loop closure and fast convergence makes our system a truly low-latency system that achieves global consistency. Our system enables rendering high-fidelity RGB-D images, along with extracting dense and complete surfaces. Experiments on both synthetic and real-world datasets suggest that our system achieves state-of-the-art tracking and mapping accuracy while maintaining low latency.