Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Xinyuan Zhou

DiffS2UT: A Semantic Preserving Diffusion Model for Textless Direct Speech-to-Speech Translation

Oct 26, 2023

Yongxin Zhu, Zhujin Gao, Xinyuan Zhou, Zhongyi Ye, Linli Xu

Abstract:While Diffusion Generative Models have achieved great success on image generation tasks, how to efficiently and effectively incorporate them into speech generation especially translation tasks remains a non-trivial problem. Specifically, due to the low information density of speech data, the transformed discrete speech unit sequence is much longer than the corresponding text transcription, posing significant challenges to existing auto-regressive models. Furthermore, it is not optimal to brutally apply discrete diffusion on the speech unit sequence while disregarding the continuous space structure, which will degrade the generation performance significantly. In this paper, we propose a novel diffusion model by applying the diffusion forward process in the \textit{continuous} speech representation space, while employing the diffusion backward process in the \textit{discrete} speech unit space. In this way, we preserve the semantic structure of the continuous speech representation space in the diffusion process and integrate the continuous and discrete diffusion models. We conduct extensive experiments on the textless direct speech-to-speech translation task, where the proposed method achieves comparable results to the computationally intensive auto-regressive baselines (500 steps on average) with significantly fewer decoding steps (50 steps).

* Accepted in EMNLP2023 main conference

Via

Access Paper or Ask Questions

CNN-based Discriminative Training for Domain Compensation in Acoustic Event Detection with Frame-wise Classifier

Mar 26, 2021

Tiantian Tang, Xinyuan Zhou, Yanhua Long, Yijie Li, Jiaen Liang

Figure 1 for CNN-based Discriminative Training for Domain Compensation in Acoustic Event Detection with Frame-wise Classifier

Figure 2 for CNN-based Discriminative Training for Domain Compensation in Acoustic Event Detection with Frame-wise Classifier

Figure 3 for CNN-based Discriminative Training for Domain Compensation in Acoustic Event Detection with Frame-wise Classifier

Figure 4 for CNN-based Discriminative Training for Domain Compensation in Acoustic Event Detection with Frame-wise Classifier

Abstract:Domain mismatch is a noteworthy issue in acoustic event detection tasks, as the target domain data is difficult to access in most real applications. In this study, we propose a novel CNN-based discriminative training framework as a domain compensation method to handle this issue. It uses a parallel CNN-based discriminator to learn a pair of high-level intermediate acoustic representations. Together with a binary discriminative loss, the discriminators are forced to maximally exploit the discrimination of heterogeneous acoustic information in each audio clip with target events, which results in a robust paired representations that can well discriminate the target events and background/domain variations separately. Moreover, to better learn the transient characteristics of target events, a frame-wise classifier is designed to perform the final classification. In addition, a two-stage training with the CNN-based discriminator initialization is further proposed to enhance the system training. All experiments are performed on the DCASE 2018 Task3 datasets. Results show that our proposal significantly outperforms the official baseline on cross-domain conditions in AUC by relative $1.8-12.1$% without any performance degradation on in-domain evaluation conditions.

Via

Access Paper or Ask Questions