Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Title:CoDiff-VC: A Codec-Assisted Diffusion Model for Zero-shot Voice Conversion

Dec 03, 2024

Yuke Li, Xinfa Zhu, Hanzhao Li, JiXun Yao, WenJie Tian, XiPeng Yang, YunLin Chen, Zhifei Li, Lei Xie

Figure 1 for CoDiff-VC: A Codec-Assisted Diffusion Model for Zero-shot Voice Conversion

Figure 2 for CoDiff-VC: A Codec-Assisted Diffusion Model for Zero-shot Voice Conversion

Figure 3 for CoDiff-VC: A Codec-Assisted Diffusion Model for Zero-shot Voice Conversion

Figure 4 for CoDiff-VC: A Codec-Assisted Diffusion Model for Zero-shot Voice Conversion

Share this with someone who'll enjoy it:

Abstract:Zero-shot voice conversion (VC) aims to convert the original speaker's timbre to any target speaker while keeping the linguistic content. Current mainstream zero-shot voice conversion approaches depend on pre-trained recognition models to disentangle linguistic content and speaker representation. This results in a timbre residue within the decoupled linguistic content and inadequacies in speaker representation modeling. In this study, we propose CoDiff-VC, an end-to-end framework for zero-shot voice conversion that integrates a speech codec and a diffusion model to produce high-fidelity waveforms. Our approach involves employing a single-codebook codec to separate linguistic content from the source speech. To enhance content disentanglement, we introduce Mix-Style layer normalization (MSLN) to perturb the original timbre. Additionally, we incorporate a multi-scale speaker timbre modeling approach to ensure timbre consistency and improve voice detail similarity. To improve speech quality and speaker similarity, we introduce dual classifier-free guidance, providing both content and timbre guidance during the generation process. Objective and subjective experiments affirm that CoDiff-VC significantly improves speaker similarity, generating natural and higher-quality speech.

View paper on

Share this with someone who'll enjoy it:

Title:CoDiff-VC: A Codec-Assisted Diffusion Model for Zero-shot Voice Conversion

Paper and Code