Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Title:M2-CTTS: End-to-End Multi-scale Multi-modal Conversational Text-to-Speech Synthesis

May 03, 2023

Jinlong Xue, Yayue Deng, Fengping Wang, Ya Li, Yingming Gao, Jianhua Tao, Jianqing Sun, Jiaen Liang

Figure 1 for M2-CTTS: End-to-End Multi-scale Multi-modal Conversational Text-to-Speech Synthesis

Figure 2 for M2-CTTS: End-to-End Multi-scale Multi-modal Conversational Text-to-Speech Synthesis

Figure 3 for M2-CTTS: End-to-End Multi-scale Multi-modal Conversational Text-to-Speech Synthesis

Share this with someone who'll enjoy it:

Abstract:Conversational text-to-speech (TTS) aims to synthesize speech with proper prosody of reply based on the historical conversation. However, it is still a challenge to comprehensively model the conversation, and a majority of conversational TTS systems only focus on extracting global information and omit local prosody features, which contain important fine-grained information like keywords and emphasis. Moreover, it is insufficient to only consider the textual features, and acoustic features also contain various prosody information. Hence, we propose M2-CTTS, an end-to-end multi-scale multi-modal conversational text-to-speech system, aiming to comprehensively utilize historical conversation and enhance prosodic expression. More specifically, we design a textual context module and an acoustic context module with both coarse-grained and fine-grained modeling. Experimental results demonstrate that our model mixed with fine-grained context information and additionally considering acoustic features achieves better prosody performance and naturalness in CMOS tests.

* 5 pages, 1 figures, 2 tables. Accepted by ICASSP 2023

View paper on

Share this with someone who'll enjoy it:

Title:M2-CTTS: End-to-End Multi-scale Multi-modal Conversational Text-to-Speech Synthesis

Paper and Code