Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Giuseppe Ruggiero

Enhancing Polyglot Voices by Leveraging Cross-Lingual Fine-Tuning in Any-to-One Voice Conversion

Sep 25, 2024

Giuseppe Ruggiero, Matteo Testa, Jurgen Van de Walle, Luigi Di Caro

Abstract:The creation of artificial polyglot voices remains a challenging task, despite considerable progress in recent years. This paper investigates self-supervised learning for voice conversion to create native-sounding polyglot voices. We introduce a novel cross-lingual any-to-one voice conversion system that is able to preserve the source accent without the need for multilingual data from the target speaker. In addition, we show a novel cross-lingual fine-tuning strategy that further improves the accent and reduces the training data requirements. Objective and subjective evaluations with English, Spanish, French and Mandarin Chinese confirm that our approach improves on state-of-the-art methods, enhancing the speech intelligibility and overall quality of the converted speech, especially in cross-lingual scenarios. Audio samples are available at https://giuseppe-ruggiero.github.io/a2o-vc-demo/

* Full paper accepted at EMNLP 2024

Via

Access Paper or Ask Questions

Voice Cloning: a Multi-Speaker Text-to-Speech Synthesis Approach based on Transfer Learning

Feb 10, 2021

Giuseppe Ruggiero, Enrico Zovato, Luigi Di Caro, Vincent Pollet

Figure 1 for Voice Cloning: a Multi-Speaker Text-to-Speech Synthesis Approach based on Transfer Learning

Figure 2 for Voice Cloning: a Multi-Speaker Text-to-Speech Synthesis Approach based on Transfer Learning

Figure 3 for Voice Cloning: a Multi-Speaker Text-to-Speech Synthesis Approach based on Transfer Learning

Figure 4 for Voice Cloning: a Multi-Speaker Text-to-Speech Synthesis Approach based on Transfer Learning

Abstract:Deep learning models are becoming predominant in many fields of machine learning. Text-to-Speech (TTS), the process of synthesizing artificial speech from text, is no exception. To this end, a deep neural network is usually trained using a corpus of several hours of recorded speech from a single speaker. Trying to produce the voice of a speaker other than the one learned is expensive and requires large effort since it is necessary to record a new dataset and retrain the model. This is the main reason why the TTS models are usually single speaker. The proposed approach has the goal to overcome these limitations trying to obtain a system which is able to model a multi-speaker acoustic space. This allows the generation of speech audio similar to the voice of different target speakers, even if they were not observed during the training phase.

Via

Access Paper or Ask Questions