Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Minglei Li

Human Motion Video Generation: A Survey

Sep 04, 2025

Haiwei Xue, Xiangyang Luo, Zhanghao Hu, Xin Zhang, Xunzhi Xiang, Yuqin Dai, Jianzhuang Liu, Zhensong Zhang, Minglei Li, Jian Yang(+5 more)

Abstract:Human motion video generation has garnered significant research interest due to its broad applications, enabling innovations such as photorealistic singing heads or dynamic avatars that seamlessly dance to music. However, existing surveys in this field focus on individual methods, lacking a comprehensive overview of the entire generative process. This paper addresses this gap by providing an in-depth survey of human motion video generation, encompassing over ten sub-tasks, and detailing the five key phases of the generation process: input, motion planning, motion video generation, refinement, and output. Notably, this is the first survey that discusses the potential of large language models in enhancing human motion video generation. Our survey reviews the latest developments and technological trends in human motion video generation across three primary modalities: vision, text, and audio. By covering over two hundred papers, we offer a thorough overview of the field and highlight milestone works that have driven significant technological breakthroughs. Our goal for this survey is to unveil the prospects of human motion video generation and serve as a valuable resource for advancing the comprehensive applications of digital humans. A complete list of the models examined in this survey is available in Our Repository https://github.com/Winn1y/Awesome-Human-Motion-Video-Generation.

* IEEE Transactions on Pattern Analysis and Machine Intelligence 2025
* Accepted by TPAMI. Github Repo: https://github.com/Winn1y/Awesome-Human-Motion-Video-Generation IEEE Access: https://ieeexplore.ieee.org/document/11106267

Via

Access Paper or Ask Questions

SegTalker: Segmentation-based Talking Face Generation with Mask-guided Local Editing

Sep 05, 2024

Lingyu Xiong, Xize Cheng, Jintao Tan, Xianjia Wu, Xiandong Li, Lei Zhu, Fei Ma, Minglei Li, Huang Xu, Zhihu Hu

Figure 1 for SegTalker: Segmentation-based Talking Face Generation with Mask-guided Local Editing

Figure 2 for SegTalker: Segmentation-based Talking Face Generation with Mask-guided Local Editing

Figure 3 for SegTalker: Segmentation-based Talking Face Generation with Mask-guided Local Editing

Figure 4 for SegTalker: Segmentation-based Talking Face Generation with Mask-guided Local Editing

Abstract:Audio-driven talking face generation aims to synthesize video with lip movements synchronized to input audio. However, current generative techniques face challenges in preserving intricate regional textures (skin, teeth). To address the aforementioned challenges, we propose a novel framework called SegTalker to decouple lip movements and image textures by introducing segmentation as intermediate representation. Specifically, given the mask of image employed by a parsing network, we first leverage the speech to drive the mask and generate talking segmentation. Then we disentangle semantic regions of image into style codes using a mask-guided encoder. Ultimately, we inject the previously generated talking segmentation and style codes into a mask-guided StyleGAN to synthesize video frame. In this way, most of textures are fully preserved. Moreover, our approach can inherently achieve background separation and facilitate mask-guided facial local editing. In particular, by editing the mask and swapping the region textures from a given reference image (e.g. hair, lip, eyebrows), our approach enables facial editing seamlessly when generating talking face video. Experiments demonstrate that our proposed approach can effectively preserve texture details and generate temporally consistent video while remaining competitive in lip synchronization. Quantitative and qualitative results on the HDTF and MEAD datasets illustrate the superior performance of our method over existing methods.

* 10 pages, 7 figures, 3 tables

Via

Access Paper or Ask Questions

Landmark-guided Diffusion Model for High-fidelity and Temporally Coherent Talking Head Generation

Aug 03, 2024

Jintao Tan, Xize Cheng, Lingyu Xiong, Lei Zhu, Xiandong Li, Xianjia Wu, Kai Gong, Minglei Li, Yi Cai

Figure 1 for Landmark-guided Diffusion Model for High-fidelity and Temporally Coherent Talking Head Generation

Figure 2 for Landmark-guided Diffusion Model for High-fidelity and Temporally Coherent Talking Head Generation

Figure 3 for Landmark-guided Diffusion Model for High-fidelity and Temporally Coherent Talking Head Generation

Figure 4 for Landmark-guided Diffusion Model for High-fidelity and Temporally Coherent Talking Head Generation

Abstract:Audio-driven talking head generation is a significant and challenging task applicable to various fields such as virtual avatars, film production, and online conferences. However, the existing GAN-based models emphasize generating well-synchronized lip shapes but overlook the visual quality of generated frames, while diffusion-based models prioritize generating high-quality frames but neglect lip shape matching, resulting in jittery mouth movements. To address the aforementioned problems, we introduce a two-stage diffusion-based model. The first stage involves generating synchronized facial landmarks based on the given speech. In the second stage, these generated landmarks serve as a condition in the denoising process, aiming to optimize mouth jitter issues and generate high-fidelity, well-synchronized, and temporally coherent talking head videos. Extensive experiments demonstrate that our model yields the best performance.

Via

Access Paper or Ask Questions

Adapter-X: A Novel General Parameter-Efficient Fine-Tuning Framework for Vision

Jun 06, 2024

Minglei Li, Peng Ye, Yongqi Huang, Lin Zhang, Tao Chen, Tong He, Jiayuan Fan, Wanli Ouyang

Figure 1 for Adapter-X: A Novel General Parameter-Efficient Fine-Tuning Framework for Vision

Figure 2 for Adapter-X: A Novel General Parameter-Efficient Fine-Tuning Framework for Vision

Figure 3 for Adapter-X: A Novel General Parameter-Efficient Fine-Tuning Framework for Vision

Figure 4 for Adapter-X: A Novel General Parameter-Efficient Fine-Tuning Framework for Vision

Abstract:Parameter-efficient fine-tuning (PEFT) has become increasingly important as foundation models continue to grow in both popularity and size. Adapter has been particularly well-received due to their potential for parameter reduction and adaptability across diverse tasks. However, striking a balance between high efficiency and robust generalization across tasks remains a challenge for adapter-based methods. We analyze existing methods and find that: 1) parameter sharing is the key to reducing redundancy; 2) more tunable parameters, dynamic allocation, and block-specific design are keys to improving performance. Unfortunately, no previous work considers all these factors. Inspired by this insight, we introduce a novel framework named Adapter-X. First, a Sharing Mixture of Adapters (SMoA) module is proposed to fulfill token-level dynamic allocation, increased tunable parameters, and inter-block sharing at the same time. Second, some block-specific designs like Prompt Generator (PG) are introduced to further enhance the ability of adaptation. Extensive experiments across 2D image and 3D point cloud modalities demonstrate that Adapter-X represents a significant milestone as it is the first to outperform full fine-tuning in both 2D image and 3D point cloud modalities with significantly fewer parameters, i.e., only 0.20% and 1.88% of original trainable parameters for 2D and 3D classification tasks. Our code will be publicly available.

Via

Access Paper or Ask Questions

Multi-Channel Multi-Step Spectrum Prediction Using Transformer and Stacked Bi-LSTM

May 29, 2024

Guangliang Pan, Jie Li, Minglei Li

Figure 1 for Multi-Channel Multi-Step Spectrum Prediction Using Transformer and Stacked Bi-LSTM

Figure 2 for Multi-Channel Multi-Step Spectrum Prediction Using Transformer and Stacked Bi-LSTM

Figure 3 for Multi-Channel Multi-Step Spectrum Prediction Using Transformer and Stacked Bi-LSTM

Figure 4 for Multi-Channel Multi-Step Spectrum Prediction Using Transformer and Stacked Bi-LSTM

Abstract:Spectrum prediction is considered as a key technology to assist spectrum decision. Despite the great efforts that have been put on the construction of spectrum prediction, achieving accurate spectrum prediction emphasizes the need for more advanced solutions. In this paper, we propose a new multichannel multi-step spectrum prediction method using Transformer and stacked bidirectional LSTM (Bi- LSTM), named TSB. Specifically, we use multi-head attention and stacked Bi-LSTM to build a new Transformer based on encoder-decoder architecture. The self-attention mechanism composed of multiple layers of multi-head attention can continuously attend to all positions of the multichannel spectrum sequences. The stacked Bi-LSTM can learn these focused coding features by multi-head attention layer by layer. The advantage of this fusion mode is that it can deeply capture the long-term dependence of multichannel spectrum data. We have conducted extensive experiments on a dataset generated by a real simulation platform. The results show that the proposed algorithm performs better than the baselines.

Via

Access Paper or Ask Questions

Co-Speech Gesture Video Generation via Motion-Decoupled Diffusion Model

Apr 02, 2024

Xu He, Qiaochu Huang, Zhensong Zhang, Zhiwei Lin, Zhiyong Wu, Sicheng Yang, Minglei Li, Zhiyi Chen, Songcen Xu, Xiaofei Wu

Figure 1 for Co-Speech Gesture Video Generation via Motion-Decoupled Diffusion Model

Figure 2 for Co-Speech Gesture Video Generation via Motion-Decoupled Diffusion Model

Figure 3 for Co-Speech Gesture Video Generation via Motion-Decoupled Diffusion Model

Figure 4 for Co-Speech Gesture Video Generation via Motion-Decoupled Diffusion Model

Abstract:Co-speech gestures, if presented in the lively form of videos, can achieve superior visual effects in human-machine interaction. While previous works mostly generate structural human skeletons, resulting in the omission of appearance information, we focus on the direct generation of audio-driven co-speech gesture videos in this work. There are two main challenges: 1) A suitable motion feature is needed to describe complex human movements with crucial appearance information. 2) Gestures and speech exhibit inherent dependencies and should be temporally aligned even of arbitrary length. To solve these problems, we present a novel motion-decoupled framework to generate co-speech gesture videos. Specifically, we first introduce a well-designed nonlinear TPS transformation to obtain latent motion features preserving essential appearance information. Then a transformer-based diffusion model is proposed to learn the temporal correlation between gestures and speech, and performs generation in the latent motion space, followed by an optimal motion selection module to produce long-term coherent and consistent gesture videos. For better visual perception, we further design a refinement network focusing on missing details of certain areas. Extensive experimental results show that our proposed framework significantly outperforms existing approaches in both motion and video-related evaluations. Our code, demos, and more resources are available at https://github.com/thuhcsi/S2G-MDDiffusion.

* 22 pages, 8 figures, CVPR 2024

Via

Access Paper or Ask Questions

Syllable based DNN-HMM Cantonese Speech to Text System

Feb 13, 2024

Timothy Wong, Claire Li, Sam Lam, Billy Chiu, Qin Lu, Minglei Li, Dan Xiong, Roy Shing Yu, Vincent T. Y. Ng

Abstract:This paper reports our work on building up a Cantonese Speech-to-Text (STT) system with a syllable based acoustic model. This is a part of an effort in building a STT system to aid dyslexic students who have cognitive deficiency in writing skills but have no problem expressing their ideas through speech. For Cantonese speech recognition, the basic unit of acoustic models can either be the conventional Initial-Final (IF) syllables, or the Onset-Nucleus-Coda (ONC) syllables where finals are further split into nucleus and coda to reflect the intra-syllable variations in Cantonese. By using the Kaldi toolkit, our system is trained using the stochastic gradient descent optimization model with the aid of GPUs for the hybrid Deep Neural Network and Hidden Markov Model (DNN-HMM) with and without I-vector based speaker adaptive training technique. The input features of the same Gaussian Mixture Model with speaker adaptive training (GMM-SAT) to DNN are used in all cases. Experiments show that the ONC-based syllable acoustic modeling with I-vector based DNN-HMM achieves the best performance with the word error rate (WER) of 9.66% and the real time factor (RTF) of 1.38812.

* 7 pages, 3 figures, LREC 2016

Via

Access Paper or Ask Questions

Partial Fine-Tuning: A Successor to Full Fine-Tuning for Vision Transformers

Dec 25, 2023

Peng Ye, Yongqi Huang, Chongjun Tu, Minglei Li, Tao Chen, Tong He, Wanli Ouyang

Abstract:Fine-tuning pre-trained foundation models has gained significant popularity in various research fields. Existing methods for fine-tuning can be roughly divided into two categories, namely Parameter-Efficient Fine-Tuning and High-Performance Fine-Tuning. The former aims at improving efficiency, while the latter focuses on enhancing performance. Beyond these methods, we demonstrate that Partial Fine-Tuning can be an innovative and promising direction capable of concurrently enhancing both efficiency and accuracy. We first validate eight manually-defined partial fine-tuning strategies across kinds of datasets and vision transformer architectures, and find that some partial fine-tuning strategies (e.g., ffn only or attention only) can achieve better performance with fewer tuned parameters than full fine-tuning, and selecting appropriate layers is critical to partial fine-tuning. Thus, we propose a novel fine-tuned angle metric to guide the selection of appropriate layers for partial fine-tuning, making it flexible to be adapted to various scenarios for more practicable partial fine-tuning. Additionally, we show that partial fine-tuning can serve as a new dimension for Model Soups, improving both the model performance and generalization with fewer tuned parameters. Comprehensive experiments on a wide range of datasets and models validate the great potential of partial fine-tuning.

Via

Access Paper or Ask Questions

TransFace: Unit-Based Audio-Visual Speech Synthesizer for Talking Head Translation

Dec 23, 2023

Xize Cheng, Rongjie Huang, Linjun Li, Tao Jin, Zehan Wang, Aoxiong Yin, Minglei Li, Xinyu Duan, changpeng yang, Zhou Zhao

Abstract:Direct speech-to-speech translation achieves high-quality results through the introduction of discrete units obtained from self-supervised learning. This approach circumvents delays and cascading errors associated with model cascading. However, talking head translation, converting audio-visual speech (i.e., talking head video) from one language into another, still confronts several challenges compared to audio speech: (1) Existing methods invariably rely on cascading, synthesizing via both audio and text, resulting in delays and cascading errors. (2) Talking head translation has a limited set of reference frames. If the generated translation exceeds the length of the original speech, the video sequence needs to be supplemented by repeating frames, leading to jarring video transitions. In this work, we propose a model for talking head translation, \textbf{TransFace}, which can directly translate audio-visual speech into audio-visual speech in other languages. It consists of a speech-to-unit translation model to convert audio speech into discrete units and a unit-based audio-visual speech synthesizer, Unit2Lip, to re-synthesize synchronized audio-visual speech from discrete units in parallel. Furthermore, we introduce a Bounded Duration Predictor, ensuring isometric talking head translation and preventing duplicate reference frames. Experiments demonstrate that our proposed Unit2Lip model significantly improves synchronization (1.601 and 0.982 on LSE-C for the original and generated audio speech, respectively) and boosts inference speed by a factor of 4.35 on LRS2. Additionally, TransFace achieves impressive BLEU scores of 61.93 and 47.55 for Es-En and Fr-En on LRS3-T and 100% isochronous translations.

Via

Access Paper or Ask Questions

Practical Deep Dispersed Watermarking with Synchronization and Fusion

Oct 23, 2023

Hengchang Guo, Qilong Zhang, Junwei Luo, Feng Guo, Wenbin Zhang, Xiaodong Su, Minglei Li

Figure 1 for Practical Deep Dispersed Watermarking with Synchronization and Fusion

Figure 2 for Practical Deep Dispersed Watermarking with Synchronization and Fusion

Figure 3 for Practical Deep Dispersed Watermarking with Synchronization and Fusion

Figure 4 for Practical Deep Dispersed Watermarking with Synchronization and Fusion

Abstract:Deep learning based blind watermarking works have gradually emerged and achieved impressive performance. However, previous deep watermarking studies mainly focus on fixed low-resolution images while paying less attention to arbitrary resolution images, especially widespread high-resolution images nowadays. Moreover, most works usually demonstrate robustness against typical non-geometric attacks (\textit{e.g.}, JPEG compression) but ignore common geometric attacks (\textit{e.g.}, Rotate) and more challenging combined attacks. To overcome the above limitations, we propose a practical deep \textbf{D}ispersed \textbf{W}atermarking with \textbf{S}ynchronization and \textbf{F}usion, called \textbf{\proposed}. Specifically, given an arbitrary-resolution cover image, we adopt a dispersed embedding scheme which sparsely and randomly selects several fixed small-size cover blocks to embed a consistent watermark message by a well-trained encoder. In the extraction stage, we first design a watermark synchronization module to locate and rectify the encoded blocks in the noised watermarked image. We then utilize a decoder to obtain messages embedded in these blocks, and propose a message fusion strategy based on similarity to make full use of the consistency among messages, thus determining a reliable message. Extensive experiments conducted on different datasets convincingly demonstrate the effectiveness of our proposed {\proposed}. Compared with state-of-the-art approaches, our blind watermarking can achieve better performance: averagely improve the bit accuracy by 5.28\% and 5.93\% against single and combined attacks, respectively, and show less file size increment and better visual quality. Our code is available at https://github.com/bytedance/DWSF.

* Accpeted by ACM MM 2023

Via

Access Paper or Ask Questions