Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Title:neural concatenative singing voice conversion: rethinking concatenation-based approach for one-shot singing voice conversion

Dec 08, 2023

Binzhu Sha, Xu Li, Zhiyong Wu, Ying Shan, Helen Meng

Figure 1 for neural concatenative singing voice conversion: rethinking concatenation-based approach for one-shot singing voice conversion

Figure 2 for neural concatenative singing voice conversion: rethinking concatenation-based approach for one-shot singing voice conversion

Figure 3 for neural concatenative singing voice conversion: rethinking concatenation-based approach for one-shot singing voice conversion

Share this with someone who'll enjoy it:

Abstract:Any-to-any singing voice conversion is confronted with a significant challenge of ``timbre leakage'' issue caused by inadequate disentanglement between the content and the speaker timbre. To address this issue, this study introduces a novel neural concatenative singing voice conversion (NeuCoSVC) framework. The NeuCoSVC framework comprises a self-supervised learning (SSL) representation extractor, a neural harmonic signal generator, and a waveform synthesizer. Specifically, the SSL extractor condenses the audio into a sequence of fixed-dimensional SSL features. The harmonic signal generator produces both raw and filtered harmonic signals as the pitch information by leveraging a linear time-varying (LTV) filter. Finally, the audio generator reconstructs the audio waveform based on the SSL features, as well as the harmonic signals and the loudness information. During inference, the system performs voice conversion by substituting source SSL features with their nearest counterparts from a matching pool, which comprises SSL representations extracted from the target audio, while the raw harmonic signals and the loudness are extracted from the source audio and are kept unchanged. Since the utilized SSL features in the conversion stage are directly from the target audio, the proposed framework has great potential to address the ``timbre leakage'' issue caused by previous disentanglement-based approaches. Experimental results confirm that the proposed system delivers much better performance than the speaker embedding approach (disentanglement-based) in the context of one-shot SVC across intra-language, cross-language, and cross-domain evaluations.

View paper on

Share this with someone who'll enjoy it:

Title:neural concatenative singing voice conversion: rethinking concatenation-based approach for one-shot singing voice conversion

Paper and Code