Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Title:Towards Robust FastSpeech 2 by Modelling Residual Multimodality

Jun 02, 2023

Fabian Kögel, Bac Nguyen, Fabien Cardinaux

Figure 1 for Towards Robust FastSpeech 2 by Modelling Residual Multimodality

Figure 2 for Towards Robust FastSpeech 2 by Modelling Residual Multimodality

Figure 3 for Towards Robust FastSpeech 2 by Modelling Residual Multimodality

Figure 4 for Towards Robust FastSpeech 2 by Modelling Residual Multimodality

Share this with someone who'll enjoy it:

Abstract:State-of-the-art non-autoregressive text-to-speech (TTS) models based on FastSpeech 2 can efficiently synthesise high-fidelity and natural speech. For expressive speech datasets however, we observe characteristic audio distortions. We demonstrate that such artefacts are introduced to the vocoder reconstruction by over-smooth mel-spectrogram predictions, which are induced by the choice of mean-squared-error (MSE) loss for training the mel-spectrogram decoder. With MSE loss FastSpeech 2 is limited to learn conditional averages of the training distribution, which might not lie close to a natural sample if the distribution still appears multimodal after all conditioning signals. To alleviate this problem, we introduce TVC-GMM, a mixture model of Trivariate-Chain Gaussian distributions, to model the residual multimodality. TVC-GMM reduces spectrogram smoothness and improves perceptual audio quality in particular for expressive datasets as shown by both objective and subjective evaluation.

* Accepted at INTERSPEECH 2023

View paper on

Share this with someone who'll enjoy it:

Title:Towards Robust FastSpeech 2 by Modelling Residual Multimodality

Paper and Code