Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Kainan Peng

Vevo: Controllable Zero-Shot Voice Imitation with Self-Supervised Disentanglement

Feb 11, 2025

Xueyao Zhang, Xiaohui Zhang, Kainan Peng, Zhenyu Tang, Vimal Manohar, Yingru Liu, Jeff Hwang, Dangna Li, Yuhao Wang, Julian Chan(+3 more)

Abstract:The imitation of voice, targeted on specific speech attributes such as timbre and speaking style, is crucial in speech generation. However, existing methods rely heavily on annotated data, and struggle with effectively disentangling timbre and style, leading to challenges in achieving controllable generation, especially in zero-shot scenarios. To address these issues, we propose Vevo, a versatile zero-shot voice imitation framework with controllable timbre and style. Vevo operates in two core stages: (1) Content-Style Modeling: Given either text or speech's content tokens as input, we utilize an autoregressive transformer to generate the content-style tokens, which is prompted by a style reference; (2) Acoustic Modeling: Given the content-style tokens as input, we employ a flow-matching transformer to produce acoustic representations, which is prompted by a timbre reference. To obtain the content and content-style tokens of speech, we design a fully self-supervised approach that progressively decouples the timbre, style, and linguistic content of speech. Specifically, we adopt VQ-VAE as the tokenizer for the continuous hidden features of HuBERT. We treat the vocabulary size of the VQ-VAE codebook as the information bottleneck, and adjust it carefully to obtain the disentangled speech representations. Solely self-supervised trained on 60K hours of audiobook speech data, without any fine-tuning on style-specific corpora, Vevo matches or surpasses existing methods in accent and emotion conversion tasks. Additionally, Vevo's effectiveness in zero-shot voice conversion and text-to-speech tasks further demonstrates its strong generalization and versatility. Audio samples are available at https://versavoice.github.io.

* Accepted by ICLR 2025

Via

Access Paper or Ask Questions

VoiceShop: A Unified Speech-to-Speech Framework for Identity-Preserving Zero-Shot Voice Editing

Apr 11, 2024

Philip Anastassiou, Zhenyu Tang, Kainan Peng, Dongya Jia, Jiaxin Li, Ming Tu, Yuping Wang, Yuxuan Wang, Mingbo Ma

Abstract:We present VoiceShop, a novel speech-to-speech framework that can modify multiple attributes of speech, such as age, gender, accent, and speech style, in a single forward pass while preserving the input speaker's timbre. Previous works have been constrained to specialized models that can only edit these attributes individually and suffer from the following pitfalls: the magnitude of the conversion effect is weak, there is no zero-shot capability for out-of-distribution speakers, or the synthesized outputs exhibit undesirable timbre leakage. Our work proposes solutions for each of these issues in a simple modular framework based on a conditional diffusion backbone model with optional normalizing flow-based and sequence-to-sequence speaker attribute-editing modules, whose components can be combined or removed during inference to meet a wide array of tasks without additional model finetuning. Audio samples are available at \url{https://voiceshopai.github.io}.

Via

Access Paper or Ask Questions

Non-parallel Accent Conversion using Pseudo Siamese Disentanglement Network

Dec 12, 2022

Dongya Jia, Qiao Tian, Jiaxin Li, Yuanzhe Chen, Kainan Peng, Mingbo Ma, Yuping Wang, Yuxuan Wang

Abstract:The main goal of accent conversion (AC) is to convert the accent of speech into the target accent while preserving the content and timbre. Previous reference-based methods rely on reference utterances in the inference phase, which limits their practical application. What's more, previous reference-free methods mostly require parallel data in the training phase. In this paper, we propose a reference-free method based on non-parallel data from the perspective of feature disentanglement. Pseudo Siamese Disentanglement Network (PSDN) is proposed to disentangle the accent information from the content representation and model the target accent. Besides, a timbre augmentation method is proposed to enhance the ability of timbre retaining for speakers without target-accent data. Experimental results show that the proposed system can convert the accent of native American English speech into Indian accent with higher accentedness (3.47) than the baseline (2.75) and input (1.19). The naturalness of converted speech is also comparable to that of the input.

Via

Access Paper or Ask Questions

WaveFlow: A Compact Flow-based Model for Raw Audio

Jan 10, 2020

Wei Ping, Kainan Peng, Kexin Zhao, Zhao Song

Figure 1 for WaveFlow: A Compact Flow-based Model for Raw Audio

Figure 2 for WaveFlow: A Compact Flow-based Model for Raw Audio

Figure 3 for WaveFlow: A Compact Flow-based Model for Raw Audio

Figure 4 for WaveFlow: A Compact Flow-based Model for Raw Audio

Abstract:In this work, we propose WaveFlow, a small-footprint generative flow for raw audio, which is directly trained with maximum likelihood. WaveFlow handles the long-range structure of waveform with a dilated 2-D convolutional architecture, while modeling the local variations using compact autoregressive functions. It provides a unified view of likelihood-based models for raw audio, including WaveNet and WaveGlow as special cases. WaveFlow can generate high-fidelity speech as WaveNet, while synthesizing several orders of magnitude faster as it only requires a few sequential steps to generate waveforms with hundreds of thousands of time-steps. Furthermore, it can close the significant likelihood gap that has existed between autoregressive models and flow-based models for efficient synthesis. Finally, our small-footprint WaveFlow has 15$\times$ fewer parameters than WaveGlow and can generate 22.05 kHz high-fidelity audio 42.6$\times$ faster than real-time on a V100 GPU without engineered inference kernels.

Via

Access Paper or Ask Questions

Incremental Text-to-Speech Synthesis with Prefix-to-Prefix Framework

Nov 07, 2019

Mingbo Ma, Baigong Zheng, Kaibo Liu, Renjie Zheng, Hairong Liu, Kainan Peng, Kenneth Church, Liang Huang

Figure 1 for Incremental Text-to-Speech Synthesis with Prefix-to-Prefix Framework

Figure 2 for Incremental Text-to-Speech Synthesis with Prefix-to-Prefix Framework

Figure 3 for Incremental Text-to-Speech Synthesis with Prefix-to-Prefix Framework

Figure 4 for Incremental Text-to-Speech Synthesis with Prefix-to-Prefix Framework

Abstract:Text-to-speech synthesis (TTS) has witnessed rapid progress in recent years, where neural methods became capable of producing audio with near human-level naturalness. However, these efforts still suffer from two types of latencies: (a) the computational latency (synthesize time), which grows linearly with the sentence length even with parallel approaches, and (b) the input latency in scenarios where the input text is incrementally generated (such as in simultaneous translation, dialog generation, and assistive technologies). To reduce these latencies, we devise the first neural incremental TTS approach based on the recently proposed prefix-to-prefix framework. We synthesize speech in an online fashion, playing a segment of audio while generating the next, resulting in an O(1) rather than O(n) latency. Experiments on English TTS show that our approach achieves similar speech naturalness compared to full sentence methods, but only using a fraction of time and a constant (1 - 2 words) latency.

* 11 pages

Via

Access Paper or Ask Questions

Multi-Speaker End-to-End Speech Synthesis

Jul 09, 2019

Jihyun Park, Kexin Zhao, Kainan Peng, Wei Ping

Figure 1 for Multi-Speaker End-to-End Speech Synthesis

Figure 2 for Multi-Speaker End-to-End Speech Synthesis

Figure 3 for Multi-Speaker End-to-End Speech Synthesis

Figure 4 for Multi-Speaker End-to-End Speech Synthesis

Abstract:In this work, we extend ClariNet (Ping et al., 2019), a fully end-to-end speech synthesis model (i.e., text-to-wave), to generate high-fidelity speech from multiple speakers. To model the unique characteristic of different voices, low dimensional trainable speaker embeddings are shared across each component of ClariNet and trained together with the rest of the model. We demonstrate that the multi-speaker ClariNet outperforms state-of-the-art systems in terms of naturalness, because the whole model is jointly optimized in an end-to-end manner.

Via

Access Paper or Ask Questions

Parallel Neural Text-to-Speech

Jun 05, 2019

Kainan Peng, Wei Ping, Zhao Song, Kexin Zhao

Figure 1 for Parallel Neural Text-to-Speech

Figure 2 for Parallel Neural Text-to-Speech

Figure 3 for Parallel Neural Text-to-Speech

Figure 4 for Parallel Neural Text-to-Speech

Abstract:In this work, we propose a non-autoregressive seq2seq model that converts text to spectrogram. It is fully convolutional and obtains about 46.7 times speed-up over Deep Voice 3 at synthesis while maintaining comparable speech quality using a WaveNet vocoder. Interestingly, it has even fewer attention errors than the autoregressive model on the challenging test sentences. Furthermore, we build the first fully parallel neural text-to-speech system by applying the inverse autoregressive flow~(IAF) as the parallel neural vocoder. Our system can synthesize speech from text through a single feed-forward pass. We also explore a novel approach to train the IAF from scratch as a generative model for raw waveform, which avoids the need for distillation from a separately trained WaveNet.

* v2: we improve the synthesis speed of ParaNet by sharing the attention masking at all attention layers

Via

Access Paper or Ask Questions

Neural Voice Cloning with a Few Samples

Oct 12, 2018

Sercan O. Arik, Jitong Chen, Kainan Peng, Wei Ping, Yanqi Zhou

Figure 1 for Neural Voice Cloning with a Few Samples

Figure 2 for Neural Voice Cloning with a Few Samples

Figure 3 for Neural Voice Cloning with a Few Samples

Figure 4 for Neural Voice Cloning with a Few Samples

Abstract:Voice cloning is a highly desired feature for personalized speech interfaces. Neural network based speech synthesis has been shown to generate high quality speech for a large number of speakers. In this paper, we introduce a neural voice cloning system that takes a few audio samples as input. We study two approaches: speaker adaptation and speaker encoding. Speaker adaptation is based on fine-tuning a multi-speaker generative model with a few cloning samples. Speaker encoding is based on training a separate model to directly infer a new speaker embedding from cloning audios and to be used with a multi-speaker generative model. In terms of naturalness of the speech and its similarity to original speaker, both approaches can achieve good performance, even with very few cloning audios. While speaker adaptation can achieve better naturalness and similarity, the cloning time or required memory for the speaker encoding approach is significantly less, making it favorable for low-resource deployment.

Via

Access Paper or Ask Questions

ClariNet: Parallel Wave Generation in End-to-End Text-to-Speech

Jul 30, 2018

Wei Ping, Kainan Peng, Jitong Chen

Figure 1 for ClariNet: Parallel Wave Generation in End-to-End Text-to-Speech

Figure 2 for ClariNet: Parallel Wave Generation in End-to-End Text-to-Speech

Figure 3 for ClariNet: Parallel Wave Generation in End-to-End Text-to-Speech

Figure 4 for ClariNet: Parallel Wave Generation in End-to-End Text-to-Speech

Abstract:In this work, we propose an alternative solution for parallel wave generation by WaveNet. In contrast to parallel WaveNet (Oord et al., 2018), we distill a Gaussian inverse autoregressive flow from the autoregressive WaveNet by minimizing a novel regularized KL divergence between their highly-peaked output distributions. Our method computes the KL divergence in closed-form, which simplifies the training algorithm and provides very efficient distillation. In addition, we propose the first text-to-wave neural architecture for speech synthesis, which is fully convolutional and enables fast end-to-end training from scratch. It significantly outperforms the previous pipeline that connects a text-to-spectrogram model to a separately trained WaveNet (Ping et al., 2018). We also successfully distill a parallel waveform synthesizer conditioned on the hidden representation in this end-to-end model.

* Fix a few typos in v2

Via

Access Paper or Ask Questions

Deep Voice 3: Scaling Text-to-Speech with Convolutional Sequence Learning

Feb 22, 2018

Wei Ping, Kainan Peng, Andrew Gibiansky, Sercan O. Arik, Ajay Kannan, Sharan Narang, Jonathan Raiman, John Miller

Figure 1 for Deep Voice 3: Scaling Text-to-Speech with Convolutional Sequence Learning

Figure 2 for Deep Voice 3: Scaling Text-to-Speech with Convolutional Sequence Learning

Figure 3 for Deep Voice 3: Scaling Text-to-Speech with Convolutional Sequence Learning

Figure 4 for Deep Voice 3: Scaling Text-to-Speech with Convolutional Sequence Learning

Abstract:We present Deep Voice 3, a fully-convolutional attention-based neural text-to-speech (TTS) system. Deep Voice 3 matches state-of-the-art neural speech synthesis systems in naturalness while training ten times faster. We scale Deep Voice 3 to data set sizes unprecedented for TTS, training on more than eight hundred hours of audio from over two thousand speakers. In addition, we identify common error modes of attention-based speech synthesis networks, demonstrate how to mitigate them, and compare several different waveform synthesis methods. We also describe how to scale inference to ten million queries per day on one single-GPU server.

* Published as a conference paper at ICLR 2018. (v3 changed paper title)

Via

Access Paper or Ask Questions