Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Kamil Pokora

Creating New Voices using Normalizing Flows

Dec 22, 2023

Piotr Bilinski, Thomas Merritt, Abdelhamid Ezzerg, Kamil Pokora, Sebastian Cygert, Kayoko Yanagisawa, Roberto Barra-Chicote, Daniel Korzekwa

Figure 1 for Creating New Voices using Normalizing Flows

Figure 2 for Creating New Voices using Normalizing Flows

Figure 3 for Creating New Voices using Normalizing Flows

Figure 4 for Creating New Voices using Normalizing Flows

Abstract:Creating realistic and natural-sounding synthetic speech remains a big challenge for voice identities unseen during training. As there is growing interest in synthesizing voices of new speakers, here we investigate the ability of normalizing flows in text-to-speech (TTS) and voice conversion (VC) modes to extrapolate from speakers observed during training to create unseen speaker identities. Firstly, we create an approach for TTS and VC, and then we comprehensively evaluate our methods and baselines in terms of intelligibility, naturalness, speaker similarity, and ability to create new voices. We use both objective and subjective metrics to benchmark our techniques on 2 evaluation tasks: zero-shot and new voice speech synthesis. The goal of the former task is to measure the precision of the conversion to an unseen voice. The goal of the latter is to measure the ability to create new voices. Extensive evaluations demonstrate that the proposed approach systematically allows to obtain state-of-the-art performance in zero-shot speech synthesis and creates various new voices, unobserved in the training set. We consider this work to be the first attempt to synthesize new voices based on mel-spectrograms and normalizing flows, along with a comprehensive analysis and comparison of the TTS and VC modes.

* Interspeech 2022, 2958-2962
* Interspeech 2022

Via

Access Paper or Ask Questions

Cross-lingual Knowledge Distillation via Flow-based Voice Conversion for Robust Polyglot Text-To-Speech

Sep 15, 2023

Dariusz Piotrowski, Renard Korzeniowski, Alessio Falai, Sebastian Cygert, Kamil Pokora, Georgi Tinchev, Ziyao Zhang, Kayoko Yanagisawa

Abstract:In this work, we introduce a framework for cross-lingual speech synthesis, which involves an upstream Voice Conversion (VC) model and a downstream Text-To-Speech (TTS) model. The proposed framework consists of 4 stages. In the first two stages, we use a VC model to convert utterances in the target locale to the voice of the target speaker. In the third stage, the converted data is combined with the linguistic features and durations from recordings in the target language, which are then used to train a single-speaker acoustic model. Finally, the last stage entails the training of a locale-independent vocoder. Our evaluations show that the proposed paradigm outperforms state-of-the-art approaches which are based on training a large multilingual TTS model. In addition, our experiments demonstrate the robustness of our approach with different model architectures, languages, speakers and amounts of data. Moreover, our solution is especially beneficial in low-resource settings.

* Accepted at ICONIP 2023

Via

Access Paper or Ask Questions

Comparing normalizing flows and diffusion models for prosody and acoustic modelling in text-to-speech

Jul 31, 2023

Guangyan Zhang, Thomas Merritt, Manuel Sam Ribeiro, Biel Tura-Vecino, Kayoko Yanagisawa, Kamil Pokora, Abdelhamid Ezzerg, Sebastian Cygert, Ammar Abbas, Piotr Bilinski(+3 more)

Figure 1 for Comparing normalizing flows and diffusion models for prosody and acoustic modelling in text-to-speech

Figure 2 for Comparing normalizing flows and diffusion models for prosody and acoustic modelling in text-to-speech

Figure 3 for Comparing normalizing flows and diffusion models for prosody and acoustic modelling in text-to-speech

Figure 4 for Comparing normalizing flows and diffusion models for prosody and acoustic modelling in text-to-speech

Abstract:Neural text-to-speech systems are often optimized on L1/L2 losses, which make strong assumptions about the distributions of the target data space. Aiming to improve those assumptions, Normalizing Flows and Diffusion Probabilistic Models were recently proposed as alternatives. In this paper, we compare traditional L1/L2-based approaches to diffusion and flow-based approaches for the tasks of prosody and mel-spectrogram prediction for text-to-speech synthesis. We use a prosody model to generate log-f0 and duration features, which are used to condition an acoustic model that generates mel-spectrograms. Experimental results demonstrate that the flow-based model achieves the best performance for spectrogram prediction, improving over equivalent diffusion and L1 models. Meanwhile, both diffusion and flow-based prosody predictors result in significant improvements over a typical L2-trained prosody models.

* 5 pages, 2 figures, 5 tables. Interspeech 2023

Via

Access Paper or Ask Questions

On granularity of prosodic representations in expressive text-to-speech

Jan 26, 2023

Mikolaj Babianski, Kamil Pokora, Raahil Shah, Rafal Sienkiewicz, Daniel Korzekwa, Viacheslav Klimkov

Figure 1 for On granularity of prosodic representations in expressive text-to-speech

Figure 2 for On granularity of prosodic representations in expressive text-to-speech

Figure 3 for On granularity of prosodic representations in expressive text-to-speech

Figure 4 for On granularity of prosodic representations in expressive text-to-speech

Abstract:In expressive speech synthesis it is widely adopted to use latent prosody representations to deal with variability of the data during training. Same text may correspond to various acoustic realizations, which is known as a one-to-many mapping problem in text-to-speech. Utterance, word, or phoneme-level representations are extracted from target signal in an auto-encoding setup, to complement phonetic input and simplify that mapping. This paper compares prosodic embeddings at different levels of granularity and examines their prediction from text. We show that utterance-level embeddings have insufficient capacity and phoneme-level tend to introduce instabilities when predicted from text. Word-level representations impose balance between capacity and predictability. As a result, we close the gap in naturalness by 90% between synthetic speech and recordings on LibriTTS dataset, without sacrificing intelligibility.

* Accepted to IEEE SLT 2022

Via

Access Paper or Ask Questions

Remap, warp and attend: Non-parallel many-to-many accent conversion with Normalizing Flows

Nov 10, 2022

Abdelhamid Ezzerg, Thomas Merritt, Kayoko Yanagisawa, Piotr Bilinski, Magdalena Proszewska, Kamil Pokora, Renard Korzeniowski, Roberto Barra-Chicote, Daniel Korzekwa

Figure 1 for Remap, warp and attend: Non-parallel many-to-many accent conversion with Normalizing Flows

Figure 2 for Remap, warp and attend: Non-parallel many-to-many accent conversion with Normalizing Flows

Figure 3 for Remap, warp and attend: Non-parallel many-to-many accent conversion with Normalizing Flows

Figure 4 for Remap, warp and attend: Non-parallel many-to-many accent conversion with Normalizing Flows

Abstract:Regional accents of the same language affect not only how words are pronounced (i.e., phonetic content), but also impact prosodic aspects of speech such as speaking rate and intonation. This paper investigates a novel flow-based approach to accent conversion using normalizing flows. The proposed approach revolves around three steps: remapping the phonetic conditioning, to better match the target accent, warping the duration of the converted speech, to better suit the target phonemes, and an attention mechanism that implicitly aligns source and target speech sequences. The proposed remap-warp-attend system enables adaptation of both phonetic and prosodic aspects of speech while allowing for source and converted speech signals to be of different lengths. Objective and subjective evaluations show that the proposed approach significantly outperforms a competitive CopyCat baseline model in terms of similarity to the target accent, naturalness and intelligibility.

* IEEE Spoken Language Technology Workshop 2022

Via

Access Paper or Ask Questions

Text-free non-parallel many-to-many voice conversion using normalising flows

Mar 15, 2022

Thomas Merritt, Abdelhamid Ezzerg, Piotr Biliński, Magdalena Proszewska, Kamil Pokora, Roberto Barra-Chicote, Daniel Korzekwa

Figure 1 for Text-free non-parallel many-to-many voice conversion using normalising flows

Figure 2 for Text-free non-parallel many-to-many voice conversion using normalising flows

Figure 3 for Text-free non-parallel many-to-many voice conversion using normalising flows

Figure 4 for Text-free non-parallel many-to-many voice conversion using normalising flows

Abstract:Non-parallel voice conversion (VC) is typically achieved using lossy representations of the source speech. However, ensuring only speaker identity information is dropped whilst all other information from the source speech is retained is a large challenge. This is particularly challenging in the scenario where at inference-time we have no knowledge of the text being read, i.e., text-free VC. To mitigate this, we investigate information-preserving VC approaches. Normalising flows have gained attention for text-to-speech synthesis, however have been under-explored for VC. Flows utilize invertible functions to learn the likelihood of the data, thus provide a lossless encoding of speech. We investigate normalising flows for VC in both text-conditioned and text-free scenarios. Furthermore, for text-free VC we compare pre-trained and jointly-learnt priors. Flow-based VC evaluations show no degradation between text-free and text-conditioned VC, resulting in improvements over the state-of-the-art. Also, joint-training of the prior is found to negatively impact text-free VC quality.

Via

Access Paper or Ask Questions

Enhancing audio quality for expressive Neural Text-to-Speech

Aug 13, 2021

Abdelhamid Ezzerg, Adam Gabrys, Bartosz Putrycz, Daniel Korzekwa, Daniel Saez-Trigueros, David McHardy, Kamil Pokora, Jakub Lachowicz, Jaime Lorenzo-Trueba, Viacheslav Klimkov

Figure 1 for Enhancing audio quality for expressive Neural Text-to-Speech

Figure 2 for Enhancing audio quality for expressive Neural Text-to-Speech

Figure 3 for Enhancing audio quality for expressive Neural Text-to-Speech

Figure 4 for Enhancing audio quality for expressive Neural Text-to-Speech

Abstract:Artificial speech synthesis has made a great leap in terms of naturalness as recent Text-to-Speech (TTS) systems are capable of producing speech with similar quality to human recordings. However, not all speaking styles are easy to model: highly expressive voices are still challenging even to recent TTS architectures since there seems to be a trade-off between expressiveness in a generated audio and its signal quality. In this paper, we present a set of techniques that can be leveraged to enhance the signal quality of a highly-expressive voice without the use of additional data. The proposed techniques include: tuning the autoregressive loop's granularity during training; using Generative Adversarial Networks in acoustic modelling; and the use of Variational Auto-Encoders in both the acoustic model and the neural vocoder. We show that, when combined, these techniques greatly closed the gap in perceived naturalness between the baseline system and recordings by 39% in terms of MUSHRA scores for an expressive celebrity voice.

* 6 pages, 4 figures, 2 tables, SSW 2021

Via

Access Paper or Ask Questions

Non-Autoregressive TTS with Explicit Duration Modelling for Low-Resource Highly Expressive Speech

Jun 25, 2021

Raahil Shah, Kamil Pokora, Abdelhamid Ezzerg, Viacheslav Klimkov, Goeric Huybrechts, Bartosz Putrycz, Daniel Korzekwa, Thomas Merritt

Figure 1 for Non-Autoregressive TTS with Explicit Duration Modelling for Low-Resource Highly Expressive Speech

Figure 2 for Non-Autoregressive TTS with Explicit Duration Modelling for Low-Resource Highly Expressive Speech

Figure 3 for Non-Autoregressive TTS with Explicit Duration Modelling for Low-Resource Highly Expressive Speech

Figure 4 for Non-Autoregressive TTS with Explicit Duration Modelling for Low-Resource Highly Expressive Speech

Abstract:Whilst recent neural text-to-speech (TTS) approaches produce high-quality speech, they typically require a large amount of recordings from the target speaker. In previous work, a 3-step method was proposed to generate high-quality TTS while greatly reducing the amount of data required for training. However, we have observed a ceiling effect in the level of naturalness achievable for highly expressive voices when using this approach. In this paper, we present a method for building highly expressive TTS voices with as little as 15 minutes of speech data from the target speaker. Compared to the current state-of-the-art approach, our proposed improvements close the gap to recordings by 23.3% for naturalness of speech and by 16.3% for speaker similarity. Further, we match the naturalness and speaker similarity of a Tacotron2-based full-data (~10 hours) model using only 15 minutes of target speaker data, whereas with 30 minutes or more, we significantly outperform it. The following improvements are proposed: 1) changing from an autoregressive, attention-based TTS model to a non-autoregressive model replacing attention with an external duration model and 2) an additional Conditional Generative Adversarial Network (cGAN) based fine-tuning step.

* 6 pages, 5 figures. Accepted to Speech Synthesis Workshop (SSW) 2021

Via

Access Paper or Ask Questions