Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Title:WavThruVec: Latent speech representation as intermediate features for neural speech synthesis

Mar 31, 2022

Hubert Siuzdak, Piotr Dura, Pol van Rijn, Nori Jacoby

Figure 1 for WavThruVec: Latent speech representation as intermediate features for neural speech synthesis

Figure 2 for WavThruVec: Latent speech representation as intermediate features for neural speech synthesis

Figure 3 for WavThruVec: Latent speech representation as intermediate features for neural speech synthesis

Figure 4 for WavThruVec: Latent speech representation as intermediate features for neural speech synthesis

Share this with someone who'll enjoy it:

Abstract:Recent advances in neural text-to-speech research have been dominated by two-stage pipelines utilizing low-level intermediate speech representation such as mel-spectrograms. However, such predetermined features are fundamentally limited, because they do not allow to exploit the full potential of a data-driven approach through learning hidden representations. For this reason, several end-to-end methods have been proposed. However, such models are harder to train and require a large number of high-quality recordings with transcriptions. Here, we propose WavThruVec - a two-stage architecture that resolves the bottleneck by using high-dimensional Wav2Vec 2.0 embeddings as intermediate speech representation. Since these hidden activations provide high-level linguistic features, they are more robust to noise. That allows us to utilize annotated speech datasets of a lower quality to train the first-stage module. At the same time, the second-stage component can be trained on large-scale untranscribed audio corpora, as Wav2Vec 2.0 embeddings are time-aligned and speaker-independent. This results in an increased generalization capability to out-of-vocabulary words, as well as to a better generalization to unseen speakers. We show that the proposed model not only matches the quality of state-of-the-art neural models, but also presents useful properties enabling tasks like voice conversion or zero-shot synthesis.

* Submitted to Interspeech'22

View paper on

Share this with someone who'll enjoy it:

Title:WavThruVec: Latent speech representation as intermediate features for neural speech synthesis

Paper and Code