Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Title:R-MelNet: Reduced Mel-Spectral Modeling for Neural TTS

Jun 30, 2022

Kyle Kastner, Aaron Courville

Figure 1 for R-MelNet: Reduced Mel-Spectral Modeling for Neural TTS

Figure 2 for R-MelNet: Reduced Mel-Spectral Modeling for Neural TTS

Figure 3 for R-MelNet: Reduced Mel-Spectral Modeling for Neural TTS

Figure 4 for R-MelNet: Reduced Mel-Spectral Modeling for Neural TTS

Share this with someone who'll enjoy it:

Abstract:This paper introduces R-MelNet, a two-part autoregressive architecture with a frontend based on the first tier of MelNet and a backend WaveRNN-style audio decoder for neural text-to-speech synthesis. Taking as input a mixed sequence of characters and phonemes, with an optional audio priming sequence, this model produces low-resolution mel-spectral features which are interpolated and used by a WaveRNN decoder to produce an audio waveform. Coupled with half precision training, R-MelNet uses under 11 gigabytes of GPU memory on a single commodity GPU (NVIDIA 2080Ti). We detail a number of critical implementation details for stable half precision training, including an approximate, numerically stable mixture of logistics attention. Using a stochastic, multi-sample per step inference scheme, the resulting model generates highly varied audio, while enabling text and audio based controls to modify output waveforms. Qualitative and quantitative evaluations of an R-MelNet system trained on a single speaker TTS dataset demonstrate the effectiveness of our approach.

View paper on

Share this with someone who'll enjoy it:

Title:R-MelNet: Reduced Mel-Spectral Modeling for Neural TTS

Paper and Code