Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Meiying Melissa Chen

Generating Novel and Realistic Speakers for Voice Conversion

Nov 10, 2025

Meiying Melissa Chen, Zhenyu Wang, Zhiyao Duan

Figure 1 for Generating Novel and Realistic Speakers for Voice Conversion

Figure 2 for Generating Novel and Realistic Speakers for Voice Conversion

Figure 3 for Generating Novel and Realistic Speakers for Voice Conversion

Figure 4 for Generating Novel and Realistic Speakers for Voice Conversion

Abstract:Voice conversion models modify timbre while preserving paralinguistic features, enabling applications like dubbing and identity protection. However, most VC systems require access to target utterances, limiting their use when target data is unavailable or when users desire conversion to entirely novel, unseen voices. To address this, we introduce a lightweight method SpeakerVAE to generate novel speakers for VC. Our approach uses a deep hierarchical variational autoencoder to model the speaker timbre space. By sampling from the trained model, we generate novel speaker representations for voice synthesis in a VC pipeline. The proposed method is a flexible plug-in module compatible with various VC models, without co-training or fine-tuning of the base VC system. We evaluated our approach with state-of-the-art VC models: FACodec and CosyVoice2. The results demonstrate that our method successfully generates novel, unseen speakers with quality comparable to that of the training speakers.

Via

Access Paper or Ask Questions

GTR-Voice: Articulatory Phonetics Informed Controllable Expressive Speech Synthesis

Jun 15, 2024

Zehua Kcriss Li, Meiying Melissa Chen, Yi Zhong, Pinxin Liu, Zhiyao Duan

Figure 1 for GTR-Voice: Articulatory Phonetics Informed Controllable Expressive Speech Synthesis

Figure 2 for GTR-Voice: Articulatory Phonetics Informed Controllable Expressive Speech Synthesis

Abstract:Expressive speech synthesis aims to generate speech that captures a wide range of para-linguistic features, including emotion and articulation, though current research primarily emphasizes emotional aspects over the nuanced articulatory features mastered by professional voice actors. Inspired by this, we explore expressive speech synthesis through the lens of articulatory phonetics. Specifically, we define a framework with three dimensions: Glottalization, Tenseness, and Resonance (GTR), to guide the synthesis at the voice production level. With this framework, we record a high-quality speech dataset named GTR-Voice, featuring 20 Chinese sentences articulated by a professional voice actor across 125 distinct GTR combinations. We verify the framework and GTR annotations through automatic classification and listening tests, and demonstrate precise controllability along the GTR dimensions on two fine-tuned expressive TTS models. We open-source the dataset and TTS models.

Via

Access Paper or Ask Questions

Articulatory Phonetics Informed Controllable Expressive Speech Synthesis

Jun 15, 2024

Zehua Kcriss Li, Meiying Melissa Chen, Yi Zhong, Pinxin Liu, Zhiyao Duan

Figure 1 for Articulatory Phonetics Informed Controllable Expressive Speech Synthesis

Figure 2 for Articulatory Phonetics Informed Controllable Expressive Speech Synthesis

Via

Access Paper or Ask Questions