Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Ziqian Dai

Automatic Prosody Annotation with Pre-Trained Text-Speech Model

Jun 16, 2022

Ziqian Dai, Jianwei Yu, Yan Wang, Nuo Chen, Yanyao Bian, Guangzhi Li, Deng Cai, Dong Yu

Figure 1 for Automatic Prosody Annotation with Pre-Trained Text-Speech Model

Figure 2 for Automatic Prosody Annotation with Pre-Trained Text-Speech Model

Figure 3 for Automatic Prosody Annotation with Pre-Trained Text-Speech Model

Figure 4 for Automatic Prosody Annotation with Pre-Trained Text-Speech Model

Abstract:Prosodic boundary plays an important role in text-to-speech synthesis (TTS) in terms of naturalness and readability. However, the acquisition of prosodic boundary labels relies on manual annotation, which is costly and time-consuming. In this paper, we propose to automatically extract prosodic boundary labels from text-audio data via a neural text-speech model with pre-trained audio encoders. This model is pre-trained on text and speech data separately and jointly fine-tuned on TTS data in a triplet format: {speech, text, prosody}. The experimental results on both automatic evaluation and human evaluation demonstrate that: 1) the proposed text-speech prosody annotation framework significantly outperforms text-only baselines; 2) the quality of automatic prosodic boundary annotations is comparable to human annotations; 3) TTS systems trained with model-annotated boundaries are slightly better than systems that use manual ones.

* accepted by INTERSPEECH2022

Via

Access Paper or Ask Questions