Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Piotr Dura

WavThruVec: Latent speech representation as intermediate features for neural speech synthesis

Mar 31, 2022

Hubert Siuzdak, Piotr Dura, Pol van Rijn, Nori Jacoby

Figure 1 for WavThruVec: Latent speech representation as intermediate features for neural speech synthesis

Figure 2 for WavThruVec: Latent speech representation as intermediate features for neural speech synthesis

Figure 3 for WavThruVec: Latent speech representation as intermediate features for neural speech synthesis

Figure 4 for WavThruVec: Latent speech representation as intermediate features for neural speech synthesis

Abstract:Recent advances in neural text-to-speech research have been dominated by two-stage pipelines utilizing low-level intermediate speech representation such as mel-spectrograms. However, such predetermined features are fundamentally limited, because they do not allow to exploit the full potential of a data-driven approach through learning hidden representations. For this reason, several end-to-end methods have been proposed. However, such models are harder to train and require a large number of high-quality recordings with transcriptions. Here, we propose WavThruVec - a two-stage architecture that resolves the bottleneck by using high-dimensional Wav2Vec 2.0 embeddings as intermediate speech representation. Since these hidden activations provide high-level linguistic features, they are more robust to noise. That allows us to utilize annotated speech datasets of a lower quality to train the first-stage module. At the same time, the second-stage component can be trained on large-scale untranscribed audio corpora, as Wav2Vec 2.0 embeddings are time-aligned and speaker-independent. This results in an increased generalization capability to out-of-vocabulary words, as well as to a better generalization to unseen speakers. We show that the proposed model not only matches the quality of state-of-the-art neural models, but also presents useful properties enabling tasks like voice conversion or zero-shot synthesis.

* Submitted to Interspeech'22

Via

Access Paper or Ask Questions

VoiceMe: Personalized voice generation in TTS

Mar 29, 2022

Pol van Rijn, Silvan Mertes, Dominik Schiller, Piotr Dura, Hubert Siuzdak, Peter M. C. Harrison, Elisabeth André, Nori Jacoby

Figure 1 for VoiceMe: Personalized voice generation in TTS

Figure 2 for VoiceMe: Personalized voice generation in TTS

Figure 3 for VoiceMe: Personalized voice generation in TTS

Abstract:Novel text-to-speech systems can generate entirely new voices that were not seen during training. However, it remains a difficult task to efficiently create personalized voices from a high dimensional speaker space. In this work, we use speaker embeddings from a state-of-the-art speaker verification model (SpeakerNet) trained on thousands of speakers to condition a TTS model. We employ a human sampling paradigm to explore this speaker latent space. We show that users can create voices that fit well to photos of faces, art portraits, and cartoons. We recruit online participants to collectively manipulate the voice of a speaking face. We show that (1) a separate group of human raters confirms that the created voices match the faces, (2) speaker gender apparent from the face is well-recovered in the voice, and (3) people are consistently moving towards the real voice prototype for the given face. Our results demonstrate that this technology can be applied in a wide number of applications including character voice development in audiobooks and games, personalized speech assistants, and individual voices for people with speech impairment.

* Submitted to Interspeech'22

Via

Access Paper or Ask Questions