Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Title:Scene-LLM: Extending Language Model for 3D Visual Understanding and Reasoning

Mar 22, 2024

Rao Fu, Jingyu Liu, Xilun Chen, Yixin Nie, Wenhan Xiong

Figure 1 for Scene-LLM: Extending Language Model for 3D Visual Understanding and Reasoning

Figure 2 for Scene-LLM: Extending Language Model for 3D Visual Understanding and Reasoning

Figure 3 for Scene-LLM: Extending Language Model for 3D Visual Understanding and Reasoning

Figure 4 for Scene-LLM: Extending Language Model for 3D Visual Understanding and Reasoning

Share this with someone who'll enjoy it:

Abstract:This paper introduces Scene-LLM, a 3D-visual-language model that enhances embodied agents' abilities in interactive 3D indoor environments by integrating the reasoning strengths of Large Language Models (LLMs). Scene-LLM adopts a hybrid 3D visual feature representation, that incorporates dense spatial information and supports scene state updates. The model employs a projection layer to efficiently project these features in the pre-trained textual embedding space, enabling effective interpretation of 3D visual information. Unique to our approach is the integration of both scene-level and ego-centric 3D information. This combination is pivotal for interactive planning, where scene-level data supports global planning and ego-centric data is important for localization. Notably, we use ego-centric 3D frame features for feature alignment, an efficient technique that enhances the model's ability to align features of small objects within the scene. Our experiments with Scene-LLM demonstrate its strong capabilities in dense captioning, question answering, and interactive planning. We believe Scene-LLM advances the field of 3D visual understanding and reasoning, offering new possibilities for sophisticated agent interactions in indoor settings.

View paper on

Share this with someone who'll enjoy it:

Title:Scene-LLM: Extending Language Model for 3D Visual Understanding and Reasoning

Paper and Code