Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Title:Tell Me Where You Are: Multimodal LLMs Meet Place Recognition

Jun 25, 2024

Zonglin Lyu, Juexiao Zhang, Mingxuan Lu, Yiming Li, Chen Feng

Figure 1 for Tell Me Where You Are: Multimodal LLMs Meet Place Recognition

Figure 2 for Tell Me Where You Are: Multimodal LLMs Meet Place Recognition

Figure 3 for Tell Me Where You Are: Multimodal LLMs Meet Place Recognition

Figure 4 for Tell Me Where You Are: Multimodal LLMs Meet Place Recognition

Share this with someone who'll enjoy it:

Abstract:Large language models (LLMs) exhibit a variety of promising capabilities in robotics, including long-horizon planning and commonsense reasoning. However, their performance in place recognition is still underexplored. In this work, we introduce multimodal LLMs (MLLMs) to visual place recognition (VPR), where a robot must localize itself using visual observations. Our key design is to use vision-based retrieval to propose several candidates and then leverage language-based reasoning to carefully inspect each candidate for a final decision. Specifically, we leverage the robust visual features produced by off-the-shelf vision foundation models (VFMs) to obtain several candidate locations. We then prompt an MLLM to describe the differences between the current observation and each candidate in a pairwise manner, and reason about the best candidate based on these descriptions. Our results on three datasets demonstrate that integrating the general-purpose visual features from VFMs with the reasoning capabilities of MLLMs already provides an effective place recognition solution, without any VPR-specific supervised training. We believe our work can inspire new possibilities for applying and designing foundation models, i.e., VFMs, LLMs, and MLLMs, to enhance the localization and navigation of mobile robots.

View paper on

Share this with someone who'll enjoy it:

Title:Tell Me Where You Are: Multimodal LLMs Meet Place Recognition

Paper and Code