Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Title:PackDiT: Joint Human Motion and Text Generation via Mutual Prompting

Jan 27, 2025

Zhongyu Jiang, Wenhao Chai, Zhuoran Zhou, Cheng-Yen Yang, Hsiang-Wei Huang, Jenq-Neng Hwang

Figure 1 for PackDiT: Joint Human Motion and Text Generation via Mutual Prompting

Figure 2 for PackDiT: Joint Human Motion and Text Generation via Mutual Prompting

Figure 3 for PackDiT: Joint Human Motion and Text Generation via Mutual Prompting

Figure 4 for PackDiT: Joint Human Motion and Text Generation via Mutual Prompting

Share this with someone who'll enjoy it:

Abstract:Human motion generation has advanced markedly with the advent of diffusion models. Most recent studies have concentrated on generating motion sequences based on text prompts, commonly referred to as text-to-motion generation. However, the bidirectional generation of motion and text, enabling tasks such as motion-to-text alongside text-to-motion, has been largely unexplored. This capability is essential for aligning diverse modalities and supports unconditional generation. In this paper, we introduce PackDiT, the first diffusion-based generative model capable of performing various tasks simultaneously, including motion generation, motion prediction, text generation, text-to-motion, motion-to-text, and joint motion-text generation. Our core innovation leverages mutual blocks to integrate multiple diffusion transformers (DiTs) across different modalities seamlessly. We train PackDiT on the HumanML3D dataset, achieving state-of-the-art text-to-motion performance with an FID score of 0.106, along with superior results in motion prediction and in-between tasks. Our experiments further demonstrate that diffusion models are effective for motion-to-text generation, achieving performance comparable to that of autoregressive models.

View paper on

Share this with someone who'll enjoy it:

Title:PackDiT: Joint Human Motion and Text Generation via Mutual Prompting

Paper and Code