Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Title:Neural Codec Language Models are Zero-Shot Text to Speech Synthesizers

Jan 05, 2023

Chengyi Wang, Sanyuan Chen, Yu Wu, Ziqiang Zhang, Long Zhou, Shujie Liu, Zhuo Chen, Yanqing Liu, Huaming Wang, Jinyu Li(+3 more)

Figure 1 for Neural Codec Language Models are Zero-Shot Text to Speech Synthesizers

Figure 2 for Neural Codec Language Models are Zero-Shot Text to Speech Synthesizers

Figure 3 for Neural Codec Language Models are Zero-Shot Text to Speech Synthesizers

Figure 4 for Neural Codec Language Models are Zero-Shot Text to Speech Synthesizers

Share this with someone who'll enjoy it:

Abstract:We introduce a language modeling approach for text to speech synthesis (TTS). Specifically, we train a neural codec language model (called Vall-E) using discrete codes derived from an off-the-shelf neural audio codec model, and regard TTS as a conditional language modeling task rather than continuous signal regression as in previous work. During the pre-training stage, we scale up the TTS training data to 60K hours of English speech which is hundreds of times larger than existing systems. Vall-E emerges in-context learning capabilities and can be used to synthesize high-quality personalized speech with only a 3-second enrolled recording of an unseen speaker as an acoustic prompt. Experiment results show that Vall-E significantly outperforms the state-of-the-art zero-shot TTS system in terms of speech naturalness and speaker similarity. In addition, we find Vall-E could preserve the speaker's emotion and acoustic environment of the acoustic prompt in synthesis. See https://aka.ms/valle for demos of our work.

* Working in progress

View paper on

Share this with someone who'll enjoy it:

Title:Neural Codec Language Models are Zero-Shot Text to Speech Synthesizers

Paper and Code