Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Sivanand Achanta

On-device neural speech synthesis

Sep 17, 2021

Sivanand Achanta, Albert Antony, Ladan Golipour, Jiangchuan Li, Tuomo Raitio, Ramya Rasipuram, Francesco Rossi, Jennifer Shi, Jaimin Upadhyay, David Winarsky(+1 more)

Figure 1 for On-device neural speech synthesis

Figure 2 for On-device neural speech synthesis

Figure 3 for On-device neural speech synthesis

Figure 4 for On-device neural speech synthesis

Abstract:Recent advances in text-to-speech (TTS) synthesis, such as Tacotron and WaveRNN, have made it possible to construct a fully neural network based TTS system, by coupling the two components together. Such a system is conceptually simple as it only takes grapheme or phoneme input, uses Mel-spectrogram as an intermediate feature, and directly generates speech samples. The system achieves quality equal or close to natural speech. However, the high computational cost of the system and issues with robustness have limited their usage in real-world speech synthesis applications and products. In this paper, we present key modeling improvements and optimization strategies that enable deploying these models, not only on GPU servers, but also on mobile devices. The proposed system can generate high-quality 24 kHz speech at 5x faster than real time on server and 3x faster than real time on mobile devices.

* 7 pages 2 figures, accepted to ASRU 2021

Via

Access Paper or Ask Questions

Statistical Parametric Speech Synthesis Using Bottleneck Representation From Sequence Auto-encoder

Jun 19, 2016

Sivanand Achanta, KNRK Raju Alluri, Suryakanth V Gangashetty

Figure 1 for Statistical Parametric Speech Synthesis Using Bottleneck Representation From Sequence Auto-encoder

Figure 2 for Statistical Parametric Speech Synthesis Using Bottleneck Representation From Sequence Auto-encoder

Figure 3 for Statistical Parametric Speech Synthesis Using Bottleneck Representation From Sequence Auto-encoder

Figure 4 for Statistical Parametric Speech Synthesis Using Bottleneck Representation From Sequence Auto-encoder

Abstract:In this paper, we describe a statistical parametric speech synthesis approach with unit-level acoustic representation. In conventional deep neural network based speech synthesis, the input text features are repeated for the entire duration of phoneme for mapping text and speech parameters. This mapping is learnt at the frame-level which is the de-facto acoustic representation. However much of this computational requirement can be drastically reduced if every unit can be represented with a fixed-dimensional representation. Using recurrent neural network based auto-encoder, we show that it is indeed possible to map units of varying duration to a single vector. We then use this acoustic representation at unit-level to synthesize speech using deep neural network based statistical parametric speech synthesis technique. Results show that the proposed approach is able to synthesize at the same quality as the conventional frame based approach at a highly reduced computational cost.

* 5 pages (with references)

Via

Access Paper or Ask Questions

Significance of Maximum Spectral Amplitude in Sub-bands for Spectral Envelope Estimation and Its Application to Statistical Parametric Speech Synthesis

Aug 03, 2015

Sivanand Achanta, Anandaswarup Vadapalli, Sai Krishna R., Suryakanth V. Gangashetty

Figure 1 for Significance of Maximum Spectral Amplitude in Sub-bands for Spectral Envelope Estimation and Its Application to Statistical Parametric Speech Synthesis

Figure 2 for Significance of Maximum Spectral Amplitude in Sub-bands for Spectral Envelope Estimation and Its Application to Statistical Parametric Speech Synthesis

Figure 3 for Significance of Maximum Spectral Amplitude in Sub-bands for Spectral Envelope Estimation and Its Application to Statistical Parametric Speech Synthesis

Figure 4 for Significance of Maximum Spectral Amplitude in Sub-bands for Spectral Envelope Estimation and Its Application to Statistical Parametric Speech Synthesis

Abstract:In this paper we propose a technique for spectral envelope estimation using maximum values in the sub-bands of Fourier magnitude spectrum (MSASB). Most other methods in the literature parametrize spectral envelope in cepstral domain such as Mel-generalized cepstrum etc. Such cepstral domain representations, although compact, are not readily interpretable. This difficulty is overcome by our method which parametrizes in the spectral domain itself. In our experiments, spectral envelope estimated using MSASB method was incorporated in the STRAIGHT vocoder. Both objective and subjective results of analysis-by-synthesis indicate that the proposed method is comparable to STRAIGHT. We also evaluate the effectiveness of the proposed parametrization in a statistical parametric speech synthesis framework using deep neural networks.

Via

Access Paper or Ask Questions