Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Title:Affordances-Oriented Planning using Foundation Models for Continuous Vision-Language Navigation

Jul 08, 2024

Jiaqi Chen, Bingqian Lin, Xinmin Liu, Xiaodan Liang, Kwan-Yee K. Wong

Figure 1 for Affordances-Oriented Planning using Foundation Models for Continuous Vision-Language Navigation

Figure 2 for Affordances-Oriented Planning using Foundation Models for Continuous Vision-Language Navigation

Figure 3 for Affordances-Oriented Planning using Foundation Models for Continuous Vision-Language Navigation

Figure 4 for Affordances-Oriented Planning using Foundation Models for Continuous Vision-Language Navigation

Share this with someone who'll enjoy it:

Abstract:LLM-based agents have demonstrated impressive zero-shot performance in the vision-language navigation (VLN) task. However, these zero-shot methods focus only on solving high-level task planning by selecting nodes in predefined navigation graphs for movements, overlooking low-level control in realistic navigation scenarios. To bridge this gap, we propose AO-Planner, a novel affordances-oriented planning framework for continuous VLN task. Our AO-Planner integrates various foundation models to achieve affordances-oriented motion planning and action decision-making, both performed in a zero-shot manner. Specifically, we employ a visual affordances prompting (VAP) approach, where visible ground is segmented utilizing SAM to provide navigational affordances, based on which the LLM selects potential next waypoints and generates low-level path planning towards selected waypoints. We further introduce a high-level agent, PathAgent, to identify the most probable pixel-based path and convert it into 3D coordinates to fulfill low-level motion. Experimental results on the challenging R2R-CE benchmark demonstrate that AO-Planner achieves state-of-the-art zero-shot performance (5.5% improvement in SPL). Our method establishes an effective connection between LLM and 3D world to circumvent the difficulty of directly predicting world coordinates, presenting novel prospects for employing foundation models in low-level motion control.

View paper on

Share this with someone who'll enjoy it:

Title:Affordances-Oriented Planning using Foundation Models for Continuous Vision-Language Navigation

Paper and Code