Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Title:NaturalSpeech: End-to-End Text to Speech Synthesis with Human-Level Quality

May 10, 2022

Xu Tan, Jiawei Chen, Haohe Liu, Jian Cong, Chen Zhang, Yanqing Liu, Xi Wang, Yichong Leng, Yuanhao Yi, Lei He(+4 more)

Figure 1 for NaturalSpeech: End-to-End Text to Speech Synthesis with Human-Level Quality

Figure 2 for NaturalSpeech: End-to-End Text to Speech Synthesis with Human-Level Quality

Figure 3 for NaturalSpeech: End-to-End Text to Speech Synthesis with Human-Level Quality

Figure 4 for NaturalSpeech: End-to-End Text to Speech Synthesis with Human-Level Quality

Share this with someone who'll enjoy it:

Abstract:Text to speech (TTS) has made rapid progress in both academia and industry in recent years. Some questions naturally arise that whether a TTS system can achieve human-level quality, how to define/judge that quality and how to achieve it. In this paper, we answer these questions by first defining the human-level quality based on the statistical significance of subjective measure and introducing appropriate guidelines to judge it, and then developing a TTS system called NaturalSpeech that achieves human-level quality on a benchmark dataset. Specifically, we leverage a variational autoencoder (VAE) for end-to-end text to waveform generation, with several key modules to enhance the capacity of the prior from text and reduce the complexity of the posterior from speech, including phoneme pre-training, differentiable duration modeling, bidirectional prior/posterior modeling, and a memory mechanism in VAE. Experiment evaluations on popular LJSpeech dataset show that our proposed NaturalSpeech achieves -0.01 CMOS (comparative mean opinion score) to human recordings at the sentence level, with Wilcoxon signed rank test at p-level p >> 0.05, which demonstrates no statistically significant difference from human recordings for the first time on this dataset.

* 19 pages, 3 figures, 8 tables

View paper on

Share this with someone who'll enjoy it:

Title:NaturalSpeech: End-to-End Text to Speech Synthesis with Human-Level Quality

Paper and Code