Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Title:ProsodyFM: Unsupervised Phrasing and Intonation Control for Intelligible Speech Synthesis

Dec 16, 2024

Xiangheng He, Junjie Chen, Zixing Zhang, Björn W. Schuller

Figure 1 for ProsodyFM: Unsupervised Phrasing and Intonation Control for Intelligible Speech Synthesis

Figure 2 for ProsodyFM: Unsupervised Phrasing and Intonation Control for Intelligible Speech Synthesis

Figure 3 for ProsodyFM: Unsupervised Phrasing and Intonation Control for Intelligible Speech Synthesis

Figure 4 for ProsodyFM: Unsupervised Phrasing and Intonation Control for Intelligible Speech Synthesis

Share this with someone who'll enjoy it:

Abstract:Prosody contains rich information beyond the literal meaning of words, which is crucial for the intelligibility of speech. Current models still fall short in phrasing and intonation; they not only miss or misplace breaks when synthesizing long sentences with complex structures but also produce unnatural intonation. We propose ProsodyFM, a prosody-aware text-to-speech synthesis (TTS) model with a flow-matching (FM) backbone that aims to enhance the phrasing and intonation aspects of prosody. ProsodyFM introduces two key components: a Phrase Break Encoder to capture initial phrase break locations, followed by a Duration Predictor for the flexible adjustment of break durations; and a Terminal Intonation Encoder which integrates a set of intonation shape tokens combined with a novel Pitch Processor for more robust modeling of human-perceived intonation change. ProsodyFM is trained with no explicit prosodic labels and yet can uncover a broad spectrum of break durations and intonation patterns. Experimental results demonstrate that ProsodyFM can effectively improve the phrasing and intonation aspects of prosody, thereby enhancing the overall intelligibility compared to four state-of-the-art (SOTA) models. Out-of-distribution experiments show that this prosody improvement can further bring ProsodyFM superior generalizability for unseen complex sentences and speakers. Our case study intuitively illustrates the powerful and fine-grained controllability of ProsodyFM over phrasing and intonation.

* Accepted by AAAI 2025

View paper on

Share this with someone who'll enjoy it:

Title:ProsodyFM: Unsupervised Phrasing and Intonation Control for Intelligible Speech Synthesis

Paper and Code