Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Title:Breaking Down the Task: A Unit-Grained Hybrid Training Framework for Vision and Language Decision Making

Jul 16, 2023

Ruipu Luo, Jiwen Zhang, Zhongyu Wei

Figure 1 for Breaking Down the Task: A Unit-Grained Hybrid Training Framework for Vision and Language Decision Making

Figure 2 for Breaking Down the Task: A Unit-Grained Hybrid Training Framework for Vision and Language Decision Making

Figure 3 for Breaking Down the Task: A Unit-Grained Hybrid Training Framework for Vision and Language Decision Making

Figure 4 for Breaking Down the Task: A Unit-Grained Hybrid Training Framework for Vision and Language Decision Making

Share this with someone who'll enjoy it:

Abstract:Vision language decision making (VLDM) is a challenging multimodal task. The agent have to understand complex human instructions and complete compositional tasks involving environment navigation and object manipulation. However, the long action sequences involved in VLDM make the task difficult to learn. From an environment perspective, we find that task episodes can be divided into fine-grained \textit{units}, each containing a navigation phase and an interaction phase. Since the environment within a unit stays unchanged, we propose a novel hybrid-training framework that enables active exploration in the environment and reduces the exposure bias. Such framework leverages the unit-grained configurations and is model-agnostic. Specifically, we design a Unit-Transformer (UT) with an intrinsic recurrent state that maintains a unit-scale cross-modal memory. Through extensive experiments on the TEACH benchmark, we demonstrate that our proposed framework outperforms existing state-of-the-art methods in terms of all evaluation metrics. Overall, our work introduces a novel approach to tackling the VLDM task by breaking it down into smaller, manageable units and utilizing a hybrid-training framework. By doing so, we provide a more flexible and effective solution for multimodal decision making.

View paper on

Share this with someone who'll enjoy it:

Title:Breaking Down the Task: A Unit-Grained Hybrid Training Framework for Vision and Language Decision Making

Paper and Code