Abstract:Large Language Models (LLMs) have demonstrated potential in Vision-and-Language Navigation (VLN) tasks, yet current applications face challenges. While LLMs excel in general conversation scenarios, they struggle with specialized navigation tasks, yielding suboptimal performance compared to specialized VLN models. We introduce FLAME (FLAMingo-Architected Embodied Agent), a novel Multimodal LLM-based agent and architecture designed for urban VLN tasks that efficiently handles multiple observations. Our approach implements a three-phase tuning technique for effective adaptation to navigation tasks, including single perception tuning for street view description, multiple perception tuning for trajectory summarization, and end-to-end training on VLN datasets. The augmented datasets are synthesized automatically. Experimental results demonstrate FLAME's superiority over existing methods, surpassing state-of-the-art methods by a 7.3% increase in task completion rate on Touchdown dataset. This work showcases the potential of Multimodal LLMs (MLLMs) in complex navigation tasks, representing an advancement towards practical applications of MLLMs in embodied AI. Project page: https://flame-sjtu.github.io
Abstract:Place recognition is indispensable for drift-free localization system. Due to the variations of the environment, place recognition using single modality has limitations. In this paper, we propose a bi-modal place recognition method, which can extract compound global descriptor from the two modalities, vision and LiDAR. Specifically, we build elevation image generated from point cloud modality as a discriminative structural representation. Based on the 3D information, we derive the correspondences between 3D points and image pixels, by which the pixel-wise visual features can be inserted into the elevation map grids. In this way, we fuse the structural features and visual features in the consistent bird-eye view frame, yielding a semantic feature representation with sensible geometry, namely CORAL. Comparisons on the Oxford RobotCar show that CORAL has superior performance against other state-of-the-art methods. We also demonstrate that our network can be generalized to other scenes and sensor configurations using cross-city datasets.