Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Shanshan Guo

MO-VLN: A Multi-Task Benchmark for Open-set Zero-Shot Vision-and-Language Navigation

Jun 17, 2023

Xiwen Liang, Liang Ma, Shanshan Guo, Jianhua Han, Hang Xu, Shikui Ma, Xiaodan Liang

Figure 1 for MO-VLN: A Multi-Task Benchmark for Open-set Zero-Shot Vision-and-Language Navigation

Figure 2 for MO-VLN: A Multi-Task Benchmark for Open-set Zero-Shot Vision-and-Language Navigation

Figure 3 for MO-VLN: A Multi-Task Benchmark for Open-set Zero-Shot Vision-and-Language Navigation

Figure 4 for MO-VLN: A Multi-Task Benchmark for Open-set Zero-Shot Vision-and-Language Navigation

Abstract:Given a natural language, a general robot has to comprehend the instruction and find the target object or location based on visual observations even in unexplored environments. Most agents rely on massive diverse training data to achieve better generalization, which requires expensive labor. These agents often focus on common objects and fewer tasks, thus are not intelligent enough to handle different types of instructions. To facilitate research in open-set vision-and-language navigation, we propose a benchmark named MO-VLN, aiming at testing the effectiveness and generalization of the agent in the multi-task setting. First, we develop a 3D simulator rendered by realistic scenarios using Unreal Engine 5, containing more realistic lights and details. The simulator contains three scenes, i.e., cafe, restaurant, and nursing house, of high value in the industry. Besides, our simulator involves multiple uncommon objects, such as takeaway cup and medical adhesive tape, which are more complicated compared with existing environments. Inspired by the recent success of large language models (e.g., ChatGPT, Vicuna), we construct diverse high-quality data of instruction type without human annotation. Our benchmark MO-VLN provides four tasks: 1) goal-conditioned navigation given a specific object category (e.g., "fork"); 2) goal-conditioned navigation given simple instructions (e.g., "Search for and move towards a tennis ball"); 3) step-by-step instruction following; 4) finding abstract object based on high-level instruction (e.g., "I am thirsty").

* 18 pages

Via

Access Paper or Ask Questions