Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Title:Multimodal Diffusion Transformer: Learning Versatile Behavior from Multimodal Goals

Jul 08, 2024

Moritz Reuss, Ömer Erdinç Yağmurlu, Fabian Wenzel, Rudolf Lioutikov

Figure 1 for Multimodal Diffusion Transformer: Learning Versatile Behavior from Multimodal Goals

Figure 2 for Multimodal Diffusion Transformer: Learning Versatile Behavior from Multimodal Goals

Figure 3 for Multimodal Diffusion Transformer: Learning Versatile Behavior from Multimodal Goals

Figure 4 for Multimodal Diffusion Transformer: Learning Versatile Behavior from Multimodal Goals

Share this with someone who'll enjoy it:

Abstract:This work introduces the Multimodal Diffusion Transformer (MDT), a novel diffusion policy framework, that excels at learning versatile behavior from multimodal goal specifications with few language annotations. MDT leverages a diffusion-based multimodal transformer backbone and two self-supervised auxiliary objectives to master long-horizon manipulation tasks based on multimodal goals. The vast majority of imitation learning methods only learn from individual goal modalities, e.g. either language or goal images. However, existing large-scale imitation learning datasets are only partially labeled with language annotations, which prohibits current methods from learning language conditioned behavior from these datasets. MDT addresses this challenge by introducing a latent goal-conditioned state representation that is simultaneously trained on multimodal goal instructions. This state representation aligns image and language based goal embeddings and encodes sufficient information to predict future states. The representation is trained via two self-supervised auxiliary objectives, enhancing the performance of the presented transformer backbone. MDT shows exceptional performance on 164 tasks provided by the challenging CALVIN and LIBERO benchmarks, including a LIBERO version that contains less than $2\%$ language annotations. Furthermore, MDT establishes a new record on the CALVIN manipulation challenge, demonstrating an absolute performance improvement of $15\%$ over prior state-of-the-art methods that require large-scale pretraining and contain $10\times$ more learnable parameters. MDT shows its ability to solve long-horizon manipulation from sparsely annotated data in both simulated and real-world environments. Demonstrations and Code are available at https://intuitive-robots.github.io/mdt_policy/.

* RSS 2024

View paper on

Share this with someone who'll enjoy it:

Title:Multimodal Diffusion Transformer: Learning Versatile Behavior from Multimodal Goals

Paper and Code