Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Prabhav Agrawal

Ultra-lightweight Neural Differential DSP Vocoder For High Quality Speech Synthesis

Jan 19, 2024

Prabhav Agrawal, Thilo Koehler, Zhiping Xiu, Prashant Serai, Qing He

Abstract:Neural vocoders model the raw audio waveform and synthesize high-quality audio, but even the highly efficient ones, like MB-MelGAN and LPCNet, fail to run real-time on a low-end device like a smartglass. A pure digital signal processing (DSP) based vocoder can be implemented via lightweight fast Fourier transforms (FFT), and therefore, is a magnitude faster than any neural vocoder. A DSP vocoder often gets a lower audio quality due to consuming over-smoothed acoustic model predictions of approximate representations for the vocal tract. In this paper, we propose an ultra-lightweight differential DSP (DDSP) vocoder that uses a jointly optimized acoustic model with a DSP vocoder, and learns without an extracted spectral feature for the vocal tract. The model achieves audio quality comparable to neural vocoders with a high average MOS of 4.36 while being efficient as a DSP vocoder. Our C++ implementation, without any hardware-specific optimization, is at 15 MFLOPS, surpasses MB-MelGAN by 340 times in terms of FLOPS, and achieves a vocoder-only RTF of 0.003 and overall RTF of 0.044 while running single-threaded on a 2GHz Intel Xeon CPU.

* Accepted for ICASSP 2024

Via

Access Paper or Ask Questions

Towards zero-shot Text-based voice editing using acoustic context conditioning, utterance embeddings, and reference encoders

Oct 28, 2022

Jason Fong, Yun Wang, Prabhav Agrawal, Vimal Manohar, Jilong Wu, Thilo Köhler, Qing He

Figure 1 for Towards zero-shot Text-based voice editing using acoustic context conditioning, utterance embeddings, and reference encoders

Figure 2 for Towards zero-shot Text-based voice editing using acoustic context conditioning, utterance embeddings, and reference encoders

Figure 3 for Towards zero-shot Text-based voice editing using acoustic context conditioning, utterance embeddings, and reference encoders

Figure 4 for Towards zero-shot Text-based voice editing using acoustic context conditioning, utterance embeddings, and reference encoders

Abstract:Text-based voice editing (TBVE) uses synthetic output from text-to-speech (TTS) systems to replace words in an original recording. Recent work has used neural models to produce edited speech that is similar to the original speech in terms of clarity, speaker identity, and prosody. However, one limitation of prior work is the usage of finetuning to optimise performance: this requires further model training on data from the target speaker, which is a costly process that may incorporate potentially sensitive data into server-side models. In contrast, this work focuses on the zero-shot approach which avoids finetuning altogether, and instead uses pretrained speaker verification embeddings together with a jointly trained reference encoder to encode utterance-level information that helps capture aspects such as speaker identity and prosody. Subjective listening tests find that both utterance embeddings and a reference encoder improve the continuity of speaker identity and prosody between the edited synthetic speech and unedited original recording in the zero-shot setting.

* Submitted to ICASSP 2023

Via

Access Paper or Ask Questions