Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Title:ProsoSpeech: Enhancing Prosody With Quantized Vector Pre-training in Text-to-Speech

Feb 16, 2022

Yi Ren, Ming Lei, Zhiying Huang, Shiliang Zhang, Qian Chen, Zhijie Yan, Zhou Zhao

Figure 1 for ProsoSpeech: Enhancing Prosody With Quantized Vector Pre-training in Text-to-Speech

Figure 2 for ProsoSpeech: Enhancing Prosody With Quantized Vector Pre-training in Text-to-Speech

Figure 3 for ProsoSpeech: Enhancing Prosody With Quantized Vector Pre-training in Text-to-Speech

Share this with someone who'll enjoy it:

Abstract:Expressive text-to-speech (TTS) has become a hot research topic recently, mainly focusing on modeling prosody in speech. Prosody modeling has several challenges: 1) the extracted pitch used in previous prosody modeling works have inevitable errors, which hurts the prosody modeling; 2) different attributes of prosody (e.g., pitch, duration and energy) are dependent on each other and produce the natural prosody together; and 3) due to high variability of prosody and the limited amount of high-quality data for TTS training, the distribution of prosody cannot be fully shaped. To tackle these issues, we propose ProsoSpeech, which enhances the prosody using quantized latent vectors pre-trained on large-scale unpaired and low-quality text and speech data. Specifically, we first introduce a word-level prosody encoder, which quantizes the low-frequency band of the speech and compresses prosody attributes in the latent prosody vector (LPV). Then we introduce an LPV predictor, which predicts LPV given word sequence. We pre-train the LPV predictor on large-scale text and low-quality speech data and fine-tune it on the high-quality TTS dataset. Finally, our model can generate expressive speech conditioned on the predicted LPV. Experimental results show that ProsoSpeech can generate speech with richer prosody compared with baseline methods.

* Accepted by ICASSP 2022

View paper on

Share this with someone who'll enjoy it:

Title:ProsoSpeech: Enhancing Prosody With Quantized Vector Pre-training in Text-to-Speech

Paper and Code