Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Title:DivScene: Benchmarking LVLMs for Object Navigation with Diverse Scenes and Objects

Oct 03, 2024

Zhaowei Wang, Hongming Zhang, Tianqing Fang, Ye Tian, Yue Yang, Kaixin Ma, Xiaoman Pan, Yangqiu Song, Dong Yu

Figure 1 for DivScene: Benchmarking LVLMs for Object Navigation with Diverse Scenes and Objects

Figure 2 for DivScene: Benchmarking LVLMs for Object Navigation with Diverse Scenes and Objects

Figure 3 for DivScene: Benchmarking LVLMs for Object Navigation with Diverse Scenes and Objects

Figure 4 for DivScene: Benchmarking LVLMs for Object Navigation with Diverse Scenes and Objects

Share this with someone who'll enjoy it:

Abstract:Object navigation in unknown environments is crucial for deploying embodied agents in real-world applications. While we have witnessed huge progress due to large-scale scene datasets, faster simulators, and stronger models, previous studies mainly focus on limited scene types and target objects. In this paper, we study a new task of navigating to diverse target objects in a large number of scene types. To benchmark the problem, we present a large-scale scene dataset, DivScene, which contains 4,614 scenes across 81 different types. With the dataset, we build an end-to-end embodied agent, NatVLM, by fine-tuning a Large Vision Language Model (LVLM) through imitation learning. The LVLM is trained to take previous observations from the environment and generate the next actions. We also introduce CoT explanation traces of the action prediction for better performance when tuning LVLMs. Our extensive experiments find that we can build a performant LVLM-based agent through imitation learning on the shortest paths constructed by a BFS planner without any human supervision. Our agent achieves a success rate that surpasses GPT-4o by over 20%. Meanwhile, we carry out various analyses showing the generalization ability of our agent.

* Work in Progress

View paper on

Share this with someone who'll enjoy it:

Title:DivScene: Benchmarking LVLMs for Object Navigation with Diverse Scenes and Objects

Paper and Code