Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Katsuya Uenoyama

Efficiently Trainable Text-to-Speech System Based on Deep Convolutional Networks with Guided Attention

Oct 24, 2017

Hideyuki Tachibana, Katsuya Uenoyama, Shunsuke Aihara

Figure 1 for Efficiently Trainable Text-to-Speech System Based on Deep Convolutional Networks with Guided Attention

Figure 2 for Efficiently Trainable Text-to-Speech System Based on Deep Convolutional Networks with Guided Attention

Figure 3 for Efficiently Trainable Text-to-Speech System Based on Deep Convolutional Networks with Guided Attention

Figure 4 for Efficiently Trainable Text-to-Speech System Based on Deep Convolutional Networks with Guided Attention

Abstract:This paper describes a novel text-to-speech (TTS) technique based on deep convolutional neural networks (CNN), without any recurrent units. Recurrent neural network (RNN) has been a standard technique to model sequential data recently, and this technique has been used in some cutting-edge neural TTS techniques. However, training RNN component often requires a very powerful computer, or very long time typically several days or weeks. Recent other studies, on the other hand, have shown that CNN-based sequence synthesis can be much faster than RNN-based techniques, because of high parallelizability. The objective of this paper is to show an alternative neural TTS system, based only on CNN, that can alleviate these economic costs of training. In our experiment, the proposed Deep Convolutional TTS can be sufficiently trained only in a night (15 hours), using an ordinary gaming PC equipped with two GPUs, while the quality of the synthesized speech was almost acceptable.

* submitted to ICASSP 2018

Via

Access Paper or Ask Questions