Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Bartosz Putrycz

BASE TTS: Lessons from building a billion-parameter Text-to-Speech model on 100K hours of data

Feb 15, 2024

Mateusz Łajszczak, Guillermo Cámbara, Yang Li, Fatih Beyhan, Arent van Korlaar, Fan Yang, Arnaud Joly, Álvaro Martín-Cortinas, Ammar Abbas, Adam Michalski(+9 more)

Figure 1 for BASE TTS: Lessons from building a billion-parameter Text-to-Speech model on 100K hours of data

Figure 2 for BASE TTS: Lessons from building a billion-parameter Text-to-Speech model on 100K hours of data

Figure 3 for BASE TTS: Lessons from building a billion-parameter Text-to-Speech model on 100K hours of data

Figure 4 for BASE TTS: Lessons from building a billion-parameter Text-to-Speech model on 100K hours of data

Abstract:We introduce a text-to-speech (TTS) model called BASE TTS, which stands for $\textbf{B}$ig $\textbf{A}$daptive $\textbf{S}$treamable TTS with $\textbf{E}$mergent abilities. BASE TTS is the largest TTS model to-date, trained on 100K hours of public domain speech data, achieving a new state-of-the-art in speech naturalness. It deploys a 1-billion-parameter autoregressive Transformer that converts raw texts into discrete codes ("speechcodes") followed by a convolution-based decoder which converts these speechcodes into waveforms in an incremental, streamable manner. Further, our speechcodes are built using a novel speech tokenization technique that features speaker ID disentanglement and compression with byte-pair encoding. Echoing the widely-reported "emergent abilities" of large language models when trained on increasing volume of data, we show that BASE TTS variants built with 10K+ hours and 500M+ parameters begin to demonstrate natural prosody on textually complex sentences. We design and share a specialized dataset to measure these emergent abilities for text-to-speech. We showcase state-of-the-art naturalness of BASE TTS by evaluating against baselines that include publicly available large-scale text-to-speech systems: YourTTS, Bark and TortoiseTTS. Audio samples generated by the model can be heard at https://amazon-ltts-paper.com/.

* v1.1 (fixed typos)

Via

Access Paper or Ask Questions

Enhancing audio quality for expressive Neural Text-to-Speech

Aug 13, 2021

Abdelhamid Ezzerg, Adam Gabrys, Bartosz Putrycz, Daniel Korzekwa, Daniel Saez-Trigueros, David McHardy, Kamil Pokora, Jakub Lachowicz, Jaime Lorenzo-Trueba, Viacheslav Klimkov

Figure 1 for Enhancing audio quality for expressive Neural Text-to-Speech

Figure 2 for Enhancing audio quality for expressive Neural Text-to-Speech

Figure 3 for Enhancing audio quality for expressive Neural Text-to-Speech

Figure 4 for Enhancing audio quality for expressive Neural Text-to-Speech

Abstract:Artificial speech synthesis has made a great leap in terms of naturalness as recent Text-to-Speech (TTS) systems are capable of producing speech with similar quality to human recordings. However, not all speaking styles are easy to model: highly expressive voices are still challenging even to recent TTS architectures since there seems to be a trade-off between expressiveness in a generated audio and its signal quality. In this paper, we present a set of techniques that can be leveraged to enhance the signal quality of a highly-expressive voice without the use of additional data. The proposed techniques include: tuning the autoregressive loop's granularity during training; using Generative Adversarial Networks in acoustic modelling; and the use of Variational Auto-Encoders in both the acoustic model and the neural vocoder. We show that, when combined, these techniques greatly closed the gap in perceived naturalness between the baseline system and recordings by 39% in terms of MUSHRA scores for an expressive celebrity voice.

* 6 pages, 4 figures, 2 tables, SSW 2021

Via

Access Paper or Ask Questions

Non-Autoregressive TTS with Explicit Duration Modelling for Low-Resource Highly Expressive Speech

Jun 25, 2021

Raahil Shah, Kamil Pokora, Abdelhamid Ezzerg, Viacheslav Klimkov, Goeric Huybrechts, Bartosz Putrycz, Daniel Korzekwa, Thomas Merritt

Figure 1 for Non-Autoregressive TTS with Explicit Duration Modelling for Low-Resource Highly Expressive Speech

Figure 2 for Non-Autoregressive TTS with Explicit Duration Modelling for Low-Resource Highly Expressive Speech

Figure 3 for Non-Autoregressive TTS with Explicit Duration Modelling for Low-Resource Highly Expressive Speech

Figure 4 for Non-Autoregressive TTS with Explicit Duration Modelling for Low-Resource Highly Expressive Speech

Abstract:Whilst recent neural text-to-speech (TTS) approaches produce high-quality speech, they typically require a large amount of recordings from the target speaker. In previous work, a 3-step method was proposed to generate high-quality TTS while greatly reducing the amount of data required for training. However, we have observed a ceiling effect in the level of naturalness achievable for highly expressive voices when using this approach. In this paper, we present a method for building highly expressive TTS voices with as little as 15 minutes of speech data from the target speaker. Compared to the current state-of-the-art approach, our proposed improvements close the gap to recordings by 23.3% for naturalness of speech and by 16.3% for speaker similarity. Further, we match the naturalness and speaker similarity of a Tacotron2-based full-data (~10 hours) model using only 15 minutes of target speaker data, whereas with 30 minutes or more, we significantly outperform it. The following improvements are proposed: 1) changing from an autoregressive, attention-based TTS model to a non-autoregressive model replacing attention with an external duration model and 2) an additional Conditional Generative Adversarial Network (cGAN) based fine-tuning step.

* 6 pages, 5 figures. Accepted to Speech Synthesis Workshop (SSW) 2021

Via

Access Paper or Ask Questions

Universal Neural Vocoding with Parallel WaveNet

Feb 15, 2021

Yunlong Jiao, Adam Gabrys, Georgi Tinchev, Bartosz Putrycz, Daniel Korzekwa, Viacheslav Klimkov

Figure 1 for Universal Neural Vocoding with Parallel WaveNet

Figure 2 for Universal Neural Vocoding with Parallel WaveNet

Figure 3 for Universal Neural Vocoding with Parallel WaveNet

Figure 4 for Universal Neural Vocoding with Parallel WaveNet

Abstract:We present a universal neural vocoder based on Parallel WaveNet, with an additional conditioning network called Audio Encoder. Our universal vocoder offers real-time high-quality speech synthesis on a wide range of use cases. We tested it on 43 internal speakers of diverse age and gender, speaking 20 languages in 17 unique styles, of which 7 voices and 5 styles were not exposed during training. We show that the proposed universal vocoder significantly outperforms speaker-dependent vocoders overall. We also show that the proposed vocoder outperforms several existing neural vocoder architectures in terms of naturalness and universality. These findings are consistent when we further test on more than 300 open-source voices.

* 5 pages, 2 figures. Accepted to ICASSP 2021

Via

Access Paper or Ask Questions