Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Thilo Koehler

Ultra-lightweight Neural Differential DSP Vocoder For High Quality Speech Synthesis

Jan 19, 2024

Prabhav Agrawal, Thilo Koehler, Zhiping Xiu, Prashant Serai, Qing He

Abstract:Neural vocoders model the raw audio waveform and synthesize high-quality audio, but even the highly efficient ones, like MB-MelGAN and LPCNet, fail to run real-time on a low-end device like a smartglass. A pure digital signal processing (DSP) based vocoder can be implemented via lightweight fast Fourier transforms (FFT), and therefore, is a magnitude faster than any neural vocoder. A DSP vocoder often gets a lower audio quality due to consuming over-smoothed acoustic model predictions of approximate representations for the vocal tract. In this paper, we propose an ultra-lightweight differential DSP (DDSP) vocoder that uses a jointly optimized acoustic model with a DSP vocoder, and learns without an extracted spectral feature for the vocal tract. The model achieves audio quality comparable to neural vocoders with a high average MOS of 4.36 while being efficient as a DSP vocoder. Our C++ implementation, without any hardware-specific optimization, is at 15 MFLOPS, surpasses MB-MelGAN by 340 times in terms of FLOPS, and achieves a vocoder-only RTF of 0.003 and overall RTF of 0.044 while running single-threaded on a 2GHz Intel Xeon CPU.

* Accepted for ICASSP 2024

Via

Access Paper or Ask Questions

Multi-rate attention architecture for fast streamable Text-to-speech spectrum modeling

Apr 01, 2021

Qing He, Zhiping Xiu, Thilo Koehler, Jilong Wu

Figure 1 for Multi-rate attention architecture for fast streamable Text-to-speech spectrum modeling

Figure 2 for Multi-rate attention architecture for fast streamable Text-to-speech spectrum modeling

Figure 3 for Multi-rate attention architecture for fast streamable Text-to-speech spectrum modeling

Figure 4 for Multi-rate attention architecture for fast streamable Text-to-speech spectrum modeling

Abstract:Typical high quality text-to-speech (TTS) systems today use a two-stage architecture, with a spectrum model stage that generates spectral frames and a vocoder stage that generates the actual audio. High-quality spectrum models usually incorporate the encoder-decoder architecture with self-attention or bi-directional long short-term (BLSTM) units. While these models can produce high quality speech, they often incur O($L$) increase in both latency and real-time factor (RTF) with respect to input length $L$. In other words, longer inputs leads to longer delay and slower synthesis speed, limiting its use in real-time applications. In this paper, we propose a multi-rate attention architecture that breaks the latency and RTF bottlenecks by computing a compact representation during encoding and recurrently generating the attention vector in a streaming manner during decoding. The proposed architecture achieves high audio quality (MOS of 4.31 compared to groundtruth 4.48), low latency, and low RTF at the same time. Meanwhile, both latency and RTF of the proposed system stay constant regardless of input lengths, making it ideal for real-time applications.

Via

Access Paper or Ask Questions

FBWave: Efficient and Scalable Neural Vocoders for Streaming Text-To-Speech on the Edge

Nov 25, 2020

Bichen Wu, Qing He, Peizhao Zhang, Thilo Koehler, Kurt Keutzer, Peter Vajda

Figure 1 for FBWave: Efficient and Scalable Neural Vocoders for Streaming Text-To-Speech on the Edge

Figure 2 for FBWave: Efficient and Scalable Neural Vocoders for Streaming Text-To-Speech on the Edge

Figure 3 for FBWave: Efficient and Scalable Neural Vocoders for Streaming Text-To-Speech on the Edge

Figure 4 for FBWave: Efficient and Scalable Neural Vocoders for Streaming Text-To-Speech on the Edge

Abstract:Nowadays more and more applications can benefit from edge-based text-to-speech (TTS). However, most existing TTS models are too computationally expensive and are not flexible enough to be deployed on the diverse variety of edge devices with their equally diverse computational capacities. To address this, we propose FBWave, a family of efficient and scalable neural vocoders that can achieve optimal performance-efficiency trade-offs for different edge devices. FBWave is a hybrid flow-based generative model that combines the advantages of autoregressive and non-autoregressive models. It produces high quality audio and supports streaming during inference while remaining highly computationally efficient. Our experiments show that FBWave can achieve similar audio quality to WaveRNN while reducing MACs by 40x. More efficient variants of FBWave can achieve up to 109x fewer MACs while still delivering acceptable audio quality. Audio demos are available at https://bichenwu09.github.io/vocoder_demos.

Via

Access Paper or Ask Questions

G2G: TTS-Driven Pronunciation Learning for Graphemic Hybrid ASR

Oct 22, 2019

Duc Le, Thilo Koehler, Christian Fuegen, Michael L. Seltzer

Figure 1 for G2G: TTS-Driven Pronunciation Learning for Graphemic Hybrid ASR

Figure 2 for G2G: TTS-Driven Pronunciation Learning for Graphemic Hybrid ASR

Figure 3 for G2G: TTS-Driven Pronunciation Learning for Graphemic Hybrid ASR

Figure 4 for G2G: TTS-Driven Pronunciation Learning for Graphemic Hybrid ASR

Abstract:Grapheme-based acoustic modeling has recently been shown to outperform phoneme-based approaches in both hybrid and end-to-end automatic speech recognition (ASR), even on non-phonemic languages like English. However, graphemic ASR still has problems with rare long-tail words that do not follow the standard spelling conventions seen in training, such as entity names. In this work, we present a novel method to train a statistical grapheme-to-grapheme (G2G) model on text-to-speech data that can rewrite an arbitrary character sequence into more phonetically consistent forms. We show that using G2G to provide alternative pronunciations during decoding reduces Word Error Rate by 3% to 11% relative over a strong graphemic baseline and bridges the gap on rare name recognition with an equivalent phonetic setup. Unlike many previously proposed methods, our method does not require any change to the acoustic model training procedure. This work reaffirms the efficacy of grapheme-based modeling and shows that specialized linguistic knowledge, when available, can be leveraged to improve graphemic ASR.

Via

Access Paper or Ask Questions