Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Peng Chang

EasyGenNet: An Efficient Framework for Audio-Driven Gesture Video Generation Based on Diffusion Model

Apr 11, 2025

Renda Li, Xiaohua Qi, Qiang Ling, Jun Yu, Ziyi Chen, Peng Chang, Mei HanJing Xiao

Abstract:Audio-driven cospeech video generation typically involves two stages: speech-to-gesture and gesture-to-video. While significant advances have been made in speech-to-gesture generation, synthesizing natural expressions and gestures remains challenging in gesture-to-video systems. In order to improve the generation effect, previous works adopted complex input and training strategies and required a large amount of data sets for pre-training, which brought inconvenience to practical applications. We propose a simple one-stage training method and a temporal inference method based on a diffusion model to synthesize realistic and continuous gesture videos without the need for additional training of temporal modules.The entire model makes use of existing pre-trained weights, and only a few thousand frames of data are needed for each character at a time to complete fine-tuning. Built upon the video generator, we introduce a new audio-to-video pipeline to synthesize co-speech videos, using 2D human skeleton as the intermediate motion representation. Our experiments show that our method outperforms existing GAN-based and diffusion-based methods.

Via

Access Paper or Ask Questions

Data-free Knowledge Distillation with Diffusion Models

Apr 01, 2025

Xiaohua Qi, Renda Li, Long Peng, Qiang Ling, Jun Yu, Ziyi Chen, Peng Chang, Mei Han, Jing Xiao

Abstract:Recently Data-Free Knowledge Distillation (DFKD) has garnered attention and can transfer knowledge from a teacher neural network to a student neural network without requiring any access to training data. Although diffusion models are adept at synthesizing high-fidelity photorealistic images across various domains, existing methods cannot be easiliy implemented to DFKD. To bridge that gap, this paper proposes a novel approach based on diffusion models, DiffDFKD. Specifically, DiffDFKD involves targeted optimizations in two key areas. Firstly, DiffDFKD utilizes valuable information from teacher models to guide the pre-trained diffusion models' data synthesis, generating datasets that mirror the training data distribution and effectively bridge domain gaps. Secondly, to reduce computational burdens, DiffDFKD introduces Latent CutMix Augmentation, an efficient technique, to enhance the diversity of diffusion model-generated images for DFKD while preserving key attributes for effective knowledge transfer. Extensive experiments validate the efficacy of DiffDFKD, yielding state-of-the-art results exceeding existing DFKD approaches. We release our code at https://github.com/xhqi0109/DiffDFKD.

* Accepted by ICME2025

Via

Access Paper or Ask Questions

SyncDiff: Diffusion-based Talking Head Synthesis with Bottlenecked Temporal Visual Prior for Improved Synchronization

Mar 17, 2025

Xulin Fan, Heting Gao, Ziyi Chen, Peng Chang, Mei Han, Mark Hasegawa-Johnson

Abstract:Talking head synthesis, also known as speech-to-lip synthesis, reconstructs the facial motions that align with the given audio tracks. The synthesized videos are evaluated on mainly two aspects, lip-speech synchronization and image fidelity. Recent studies demonstrate that GAN-based and diffusion-based models achieve state-of-the-art (SOTA) performance on this task, with diffusion-based models achieving superior image fidelity but experiencing lower synchronization compared to their GAN-based counterparts. To this end, we propose SyncDiff, a simple yet effective approach to improve diffusion-based models using a temporal pose frame with information bottleneck and facial-informative audio features extracted from AVHuBERT, as conditioning input into the diffusion process. We evaluate SyncDiff on two canonical talking head datasets, LRS2 and LRS3 for direct comparison with other SOTA models. Experiments on LRS2/LRS3 datasets show that SyncDiff achieves a synchronization score 27.7%/62.3% relatively higher than previous diffusion-based methods, while preserving their high-fidelity characteristics.

* Accepted to WACV 2025

Via

Access Paper or Ask Questions

Bridging the Gap: Deciphering Tabular Data Using Large Language Model

Aug 28, 2023

Hengyuan Zhang, Peng Chang, Zongcheng Ji

Abstract:In the realm of natural language processing, the understanding of tabular data has perpetually stood as a focal point of scholarly inquiry. The emergence of expansive language models, exemplified by the likes of ChatGPT, has ushered in a wave of endeavors wherein researchers aim to harness these models for tasks related to table-based question answering. Central to our investigative pursuits is the elucidation of methodologies that amplify the aptitude of such large language models in discerning both the structural intricacies and inherent content of tables, ultimately facilitating their capacity to provide informed responses to pertinent queries. To this end, we have architected a distinctive module dedicated to the serialization of tables for seamless integration with expansive language models. Additionally, we've instituted a corrective mechanism within the model to rectify potential inaccuracies. Experimental results indicate that, although our proposed method trails the SOTA by approximately 11.7% in overall metrics, it surpasses the SOTA by about 1.2% in tests on specific datasets. This research marks the first application of large language models to table-based question answering tasks, enhancing the model's comprehension of both table structures and content.

Via

Access Paper or Ask Questions

A CTC Alignment-based Non-autoregressive Transformer for End-to-end Automatic Speech Recognition

Apr 15, 2023

Ruchao Fan, Wei Chu, Peng Chang, Abeer Alwan

Figure 1 for A CTC Alignment-based Non-autoregressive Transformer for End-to-end Automatic Speech Recognition

Figure 2 for A CTC Alignment-based Non-autoregressive Transformer for End-to-end Automatic Speech Recognition

Figure 3 for A CTC Alignment-based Non-autoregressive Transformer for End-to-end Automatic Speech Recognition

Figure 4 for A CTC Alignment-based Non-autoregressive Transformer for End-to-end Automatic Speech Recognition

Abstract:Recently, end-to-end models have been widely used in automatic speech recognition (ASR) systems. Two of the most representative approaches are connectionist temporal classification (CTC) and attention-based encoder-decoder (AED) models. Autoregressive transformers, variants of AED, adopt an autoregressive mechanism for token generation and thus are relatively slow during inference. In this paper, we present a comprehensive study of a CTC Alignment-based Single-Step Non-Autoregressive Transformer (CASS-NAT) for end-to-end ASR. In CASS-NAT, word embeddings in the autoregressive transformer (AT) are substituted with token-level acoustic embeddings (TAE) that are extracted from encoder outputs with the acoustical boundary information offered by the CTC alignment. TAE can be obtained in parallel, resulting in a parallel generation of output tokens. During training, Viterbi-alignment is used for TAE generation, and multiple training strategies are further explored to improve the word error rate (WER) performance. During inference, an error-based alignment sampling method is investigated in depth to reduce the alignment mismatch in the training and testing processes. Experimental results show that the CASS-NAT has a WER that is close to AT on various ASR tasks, while providing a ~24x inference speedup. With and without self-supervised learning, we achieve new state-of-the-art results for non-autoregressive models on several datasets. We also analyze the behavior of the CASS-NAT decoder to explain why it can perform similarly to AT. We find that TAEs have similar functionality to word embeddings for grammatical structures, which might indicate the possibility of learning some semantic information from TAEs without a language model.

* Published in IEEE Transactions on Audio, Speech, and Language Processing

Via

Access Paper or Ask Questions

Transformer-Based Multi-Aspect Multi-Granularity Non-Native English Speaker Pronunciation Assessment

May 06, 2022

Yuan Gong, Ziyi Chen, Iek-Heng Chu, Peng Chang, James Glass

Figure 1 for Transformer-Based Multi-Aspect Multi-Granularity Non-Native English Speaker Pronunciation Assessment

Figure 2 for Transformer-Based Multi-Aspect Multi-Granularity Non-Native English Speaker Pronunciation Assessment

Figure 3 for Transformer-Based Multi-Aspect Multi-Granularity Non-Native English Speaker Pronunciation Assessment

Figure 4 for Transformer-Based Multi-Aspect Multi-Granularity Non-Native English Speaker Pronunciation Assessment

Abstract:Automatic pronunciation assessment is an important technology to help self-directed language learners. While pronunciation quality has multiple aspects including accuracy, fluency, completeness, and prosody, previous efforts typically only model one aspect (e.g., accuracy) at one granularity (e.g., at the phoneme-level). In this work, we explore modeling multi-aspect pronunciation assessment at multiple granularities. Specifically, we train a Goodness Of Pronunciation feature-based Transformer (GOPT) with multi-task learning. Experiments show that GOPT achieves the best results on speechocean762 with a public automatic speech recognition (ASR) acoustic model trained on Librispeech.

* Accepted at ICASSP 2022. Code at https://github.com/YuanGongND/gopt Interactive Colab demo at https://colab.research.google.com/github/YuanGongND/gopt/blob/master/colab/GOPT_GPU.ipynb . ICASSP 2022

Via

Access Paper or Ask Questions

An Improved Single Step Non-autoregressive Transformer for Automatic Speech Recognition

Jul 22, 2021

Ruchao Fan, Wei Chu, Peng Chang, Jing Xiao, Abeer Alwan

Figure 1 for An Improved Single Step Non-autoregressive Transformer for Automatic Speech Recognition

Figure 2 for An Improved Single Step Non-autoregressive Transformer for Automatic Speech Recognition

Figure 3 for An Improved Single Step Non-autoregressive Transformer for Automatic Speech Recognition

Figure 4 for An Improved Single Step Non-autoregressive Transformer for Automatic Speech Recognition

Abstract:Non-autoregressive mechanisms can significantly decrease inference time for speech transformers, especially when the single step variant is applied. Previous work on CTC alignment-based single step non-autoregressive transformer (CASS-NAT) has shown a large real time factor (RTF) improvement over autoregressive transformers (AT). In this work, we propose several methods to improve the accuracy of the end-to-end CASS-NAT, followed by performance analyses. First, convolution augmented self-attention blocks are applied to both the encoder and decoder modules. Second, we propose to expand the trigger mask (acoustic boundary) for each token to increase the robustness of CTC alignments. In addition, iterated loss functions are used to enhance the gradient update of low-layer parameters. Without using an external language model, the WERs of the improved CASS-NAT, when using the three methods, are 3.1%/7.2% on Librispeech test clean/other sets and the CER is 5.4% on the Aishell1 test set, achieving a 7%~21% relative WER/CER improvement. For the analyses, we plot attention weight distributions in the decoders to visualize the relationships between token-level acoustic embeddings. When the acoustic embeddings are visualized, we find that they have a similar behavior to word embeddings, which explains why the improved CASS-NAT performs similarly to AT.

* Accepted to Interspeech2021

Via

Access Paper or Ask Questions

CASS-NAT: CTC Alignment-based Single Step Non-autoregressive Transformer for Speech Recognition

Oct 28, 2020

Ruchao Fan, Wei Chu, Peng Chang, Jing Xiao

Figure 1 for CASS-NAT: CTC Alignment-based Single Step Non-autoregressive Transformer for Speech Recognition

Figure 2 for CASS-NAT: CTC Alignment-based Single Step Non-autoregressive Transformer for Speech Recognition

Figure 3 for CASS-NAT: CTC Alignment-based Single Step Non-autoregressive Transformer for Speech Recognition

Figure 4 for CASS-NAT: CTC Alignment-based Single Step Non-autoregressive Transformer for Speech Recognition

Abstract:We propose a CTC alignment-based single step non-autoregressive transformer (CASS-NAT) for speech recognition. Specifically, the CTC alignment contains the information of (a) the number of tokens for decoder input, and (b) the time span of acoustics for each token. The information are used to extract acoustic representation for each token in parallel, referred to as token-level acoustic embedding which substitutes the word embedding in autoregressive transformer (AT) to achieve parallel generation in decoder. During inference, an error-based alignment sampling method is proposed to be applied to the CTC output space, reducing the WER and retaining the parallelism as well. Experimental results show that the proposed method achieves WERs of 3.8%/9.1% on Librispeech test clean/other dataset without an external LM, and a CER of 5.8% on Aishell1 Mandarin corpus, respectively1. Compared to the AT baseline, the CASS-NAT has a performance reduction on WER, but is 51.2x faster in terms of RTF. When decoding with an oracle CTC alignment, the lower bound of WER without LM reaches 2.3% on the test-clean set, indicating the potential of the proposed method.

* Submitted to ICASSP2021

Via

Access Paper or Ask Questions

Model-Based Manipulation of Linear Flexible Objects with Visual Curvature Feedback

Jul 16, 2020

Peng Chang, Taskin Padir

Figure 1 for Model-Based Manipulation of Linear Flexible Objects with Visual Curvature Feedback

Figure 2 for Model-Based Manipulation of Linear Flexible Objects with Visual Curvature Feedback

Figure 3 for Model-Based Manipulation of Linear Flexible Objects with Visual Curvature Feedback

Figure 4 for Model-Based Manipulation of Linear Flexible Objects with Visual Curvature Feedback

Abstract:Manipulation of deformable objects is a desired skill in making robots ubiquitous in manufacturing, service, healthcare, and security. Deformable objects are common in our daily lives, e.g., wires, clothes, bed sheets, etc., and are significantly more difficult to model than rigid objects. In this study, we investigate vision-based manipulation of linear flexible objects such as cables. We propose a geometric modeling method that is based on visual feedback to develop a general representation of the linear flexible object that is subject to gravity. The model characterizes the shape of the object by combining the curvatures on two projection planes. In this approach, we achieve tracking of the position and orientation (pose) of a cable-like object, the pose of its tip, and the pose of the selected grasp point on the object, which enables closed-loop manipulation of the object. We demonstrate the feasibility of our approach by completing the Plug Task used in the 2015 DARPA Robotics Challenge Finals, which involves unplugging a power cable from one socket and plugging it into another. Experiments show that we can successfully complete the task autonomously within 30 seconds.

* This paper is accepted for The 2020 IEEE/ASME International Conference on Advanced Intelligent Mechatronics (AIM 2020)

Via

Access Paper or Ask Questions

Sim2Real2Sim: Bridging the Gap Between Simulation and Real-World in Flexible Object Manipulation

Feb 10, 2020

Peng Chang, Taskin Padir

Figure 1 for Sim2Real2Sim: Bridging the Gap Between Simulation and Real-World in Flexible Object Manipulation

Figure 2 for Sim2Real2Sim: Bridging the Gap Between Simulation and Real-World in Flexible Object Manipulation

Figure 3 for Sim2Real2Sim: Bridging the Gap Between Simulation and Real-World in Flexible Object Manipulation

Figure 4 for Sim2Real2Sim: Bridging the Gap Between Simulation and Real-World in Flexible Object Manipulation

Abstract:This paper addresses a new strategy called Simulation-to-Real-to-Simulation (Sim2Real2Sim) to bridge the gap between simulation and real-world, and automate a flexible object manipulation task. This strategy consists of three steps: (1) using the rough environment with the estimated models to develop the methods to complete the manipulation task in the simulation; (2) applying the methods from simulation to real-world and comparing their performance; (3) updating the models and methods in simulation based on the differences between the real world and the simulation. The Plug Task from the 2015 DARPA Robotics Challenge Finals is chosen to evaluate our Sim2Real2Sim strategy. A new identification approach for building the model of the linear flexible objects is derived from real-world to simulation. The automation of the DRC plug task in both simulation and real-world proves the success of the Sim2Real2Sim strategy. Numerical experiments are implemented to validate the simulated model.

* This paper is accepted for the IEEE International Conference on Robotic Computing (IRC) 2020

Via

Access Paper or Ask Questions