Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Runhua Shi

Audio-driven Gesture Generation via Deviation Feature in the Latent Space

Mar 27, 2025

Jiahui Chen, Yang Huan, Runhua Shi, Chanfan Ding, Xiaoqi Mo, Siyu Xiong, Yinong He

Abstract:Gestures are essential for enhancing co-speech communication, offering visual emphasis and complementing verbal interactions. While prior work has concentrated on point-level motion or fully supervised data-driven methods, we focus on co-speech gestures, advocating for weakly supervised learning and pixel-level motion deviations. We introduce a weakly supervised framework that learns latent representation deviations, tailored for co-speech gesture video generation. Our approach employs a diffusion model to integrate latent motion features, enabling more precise and nuanced gesture representation. By leveraging weakly supervised deviations in latent space, we effectively generate hand gestures and mouth movements, crucial for realistic video production. Experiments show our method significantly improves video quality, surpassing current state-of-the-art techniques.

* 6 pages, 5 figures

Via

Access Paper or Ask Questions

Self-Supervised Learning of Deviation in Latent Representation for Co-speech Gesture Video Generation

Sep 26, 2024

Huan Yang, Jiahui Chen, Chaofan Ding, Runhua Shi, Siyu Xiong, Qingqi Hong, Xiaoqi Mo, Xinhan Di

Figure 1 for Self-Supervised Learning of Deviation in Latent Representation for Co-speech Gesture Video Generation

Figure 2 for Self-Supervised Learning of Deviation in Latent Representation for Co-speech Gesture Video Generation

Figure 3 for Self-Supervised Learning of Deviation in Latent Representation for Co-speech Gesture Video Generation

Figure 4 for Self-Supervised Learning of Deviation in Latent Representation for Co-speech Gesture Video Generation

Abstract:Gestures are pivotal in enhancing co-speech communication. While recent works have mostly focused on point-level motion transformation or fully supervised motion representations through data-driven approaches, we explore the representation of gestures in co-speech, with a focus on self-supervised representation and pixel-level motion deviation, utilizing a diffusion model which incorporates latent motion features. Our approach leverages self-supervised deviation in latent representation to facilitate hand gestures generation, which are crucial for generating realistic gesture videos. Results of our first experiment demonstrate that our method enhances the quality of generated videos, with an improvement from 2.7 to 4.5% for FGD, DIV, and FVD, and 8.1% for PSNR, 2.5% for SSIM over the current state-of-the-art methods.

* 5 pages, 5 figures, conference

Via

Access Paper or Ask Questions