Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Yookyung Shin

Cross-speaker Emotion Transfer by Manipulating Speech Style Latents

Mar 15, 2023

Suhee Jo, Younggun Lee, Yookyung Shin, Yeongtae Hwang, Taesu Kim

Abstract:In recent years, emotional text-to-speech has shown considerable progress. However, it requires a large amount of labeled data, which is not easily accessible. Even if it is possible to acquire an emotional speech dataset, there is still a limitation in controlling emotion intensity. In this work, we propose a novel method for cross-speaker emotion transfer and manipulation using vector arithmetic in latent style space. By leveraging only a few labeled samples, we generate emotional speech from reading-style speech without losing the speaker identity. Furthermore, emotion strength is readily controllable using a scalar value, providing an intuitive way for users to manipulate speech. Experimental results show the proposed method affords superior performance in terms of expressiveness, naturalness, and controllability, preserving speaker identity.

* accepted to ICASSP 2023

Via

Access Paper or Ask Questions

Text-driven Emotional Style Control and Cross-speaker Style Transfer in Neural TTS

Jul 13, 2022

Yookyung Shin, Younggun Lee, Suhee Jo, Yeongtae Hwang, Taesu Kim

Figure 1 for Text-driven Emotional Style Control and Cross-speaker Style Transfer in Neural TTS

Figure 2 for Text-driven Emotional Style Control and Cross-speaker Style Transfer in Neural TTS

Figure 3 for Text-driven Emotional Style Control and Cross-speaker Style Transfer in Neural TTS

Figure 4 for Text-driven Emotional Style Control and Cross-speaker Style Transfer in Neural TTS

Abstract:Expressive text-to-speech has shown improved performance in recent years. However, the style control of synthetic speech is often restricted to discrete emotion categories and requires training data recorded by the target speaker in the target style. In many practical situations, users may not have reference speech recorded in target emotion but still be interested in controlling speech style just by typing text description of desired emotional style. In this paper, we propose a text-based interface for emotional style control and cross-speaker style transfer in multi-speaker TTS. We propose the bi-modal style encoder which models the semantic relationship between text description embedding and speech style embedding with a pretrained language model. To further improve cross-speaker style transfer on disjoint, multi-style datasets, we propose the novel style loss. The experimental results show that our model can generate high-quality expressive speech even in unseen style.

* Accepted to Interspeech 2022

Via

Access Paper or Ask Questions