Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Dabiao Ma

MHTTS: Fast multi-head text-to-speech for spontaneous speech with imperfect transcription

Feb 04, 2022

Dabiao Ma, Yitong Zhang, Meng Li, Feng Ye

Abstract:Neural network based end-to-end Text-to-Speech (TTS) has greatly improved the quality of synthesized speech. While how to use massive spontaneous speech without transcription efficiently still remains an open problem. In this paper, we propose MHTTS, a fast multi-speaker TTS system that is robust to transcription errors and speaking style speech data. Specifically, we introduce a multi-head model and transfer text information from high-quality corpus with manual transcription to spontaneous speech with imperfectly recognized transcription by jointly training them. MHTTS has three advantages: 1) Our system synthesizes better quality multi-speaker voice with faster inference speed. 2) Our system is capable of transferring correct text information to data with imperfect transcription, simulated using corruption, or provided by an Automatic Speech Recogniser (ASR). 3) Our system can utilize massive real spontaneous speech with imperfect transcription and synthesize expressive voice.

Via

Access Paper or Ask Questions

FPUAS : Fully Parallel UFANS-based End-to-End Acoustic System with 10x Speed Up

Dec 18, 2018

Dabiao Ma, Zhiba Su, Yuhao Lu

Figure 1 for FPUAS : Fully Parallel UFANS-based End-to-End Acoustic System with 10x Speed Up

Figure 2 for FPUAS : Fully Parallel UFANS-based End-to-End Acoustic System with 10x Speed Up

Figure 3 for FPUAS : Fully Parallel UFANS-based End-to-End Acoustic System with 10x Speed Up

Figure 4 for FPUAS : Fully Parallel UFANS-based End-to-End Acoustic System with 10x Speed Up

Abstract:A lightweight end-to-end acoustic system is crucial in the deployment of text-to-speech tasks. Finding one that produces good audios with small time latency and fewer errors remains a problem. In this paper, we propose a new non-autoregressive, fully parallel acoustic system that utilizes a new attention structure and a recently proposed convolutional structure. Compared with the most popular end-to-end text-to-speech systems, our acoustic system can produce equal or better quality audios with fewer errors and reach at least 10 times speed up of inference.

Via

Access Paper or Ask Questions