Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Marlene Staib

Papercup Technologies Ltd

Ctrl-P: Temporal Control of Prosodic Variation for Speech Synthesis

Jun 15, 2021

Devang S Ram Mohan, Vivian Hu, Tian Huey Teh, Alexandra Torresquintero, Christopher G. R. Wallis, Marlene Staib, Lorenzo Foglianti, Jiameng Gao, Simon King

Figure 1 for Ctrl-P: Temporal Control of Prosodic Variation for Speech Synthesis

Figure 2 for Ctrl-P: Temporal Control of Prosodic Variation for Speech Synthesis

Figure 3 for Ctrl-P: Temporal Control of Prosodic Variation for Speech Synthesis

Figure 4 for Ctrl-P: Temporal Control of Prosodic Variation for Speech Synthesis

Abstract:Text does not fully specify the spoken form, so text-to-speech models must be able to learn from speech data that vary in ways not explained by the corresponding text. One way to reduce the amount of unexplained variation in training data is to provide acoustic information as an additional learning signal. When generating speech, modifying this acoustic information enables multiple distinct renditions of a text to be produced. Since much of the unexplained variation is in the prosody, we propose a model that generates speech explicitly conditioned on the three primary acoustic correlates of prosody: $F_{0}$, energy and duration. The model is flexible about how the values of these features are specified: they can be externally provided, or predicted from text, or predicted then subsequently modified. Compared to a model that employs a variational auto-encoder to learn unsupervised latent features, our model provides more interpretable, temporally-precise, and disentangled control. When automatically predicting the acoustic features from text, it generates speech that is more natural than that from a Tacotron 2 model with reference encoder. Subsequent human-in-the-loop modification of the predicted acoustic features can significantly further increase naturalness.

* To be published in Interspeech 2021. 5 pages, 4 figures

Via

Access Paper or Ask Questions

ADEPT: A Dataset for Evaluating Prosody Transfer

Jun 15, 2021

Alexandra Torresquintero, Tian Huey Teh, Christopher G. R. Wallis, Marlene Staib, Devang S Ram Mohan, Vivian Hu, Lorenzo Foglianti, Jiameng Gao, Simon King

Figure 1 for ADEPT: A Dataset for Evaluating Prosody Transfer

Figure 2 for ADEPT: A Dataset for Evaluating Prosody Transfer

Figure 3 for ADEPT: A Dataset for Evaluating Prosody Transfer

Abstract:Text-to-speech is now able to achieve near-human naturalness and research focus has shifted to increasing expressivity. One popular method is to transfer the prosody from a reference speech sample. There have been considerable advances in using prosody transfer to generate more expressive speech, but the field lacks a clear definition of what successful prosody transfer means and a method for measuring it. We introduce a dataset of prosodically-varied reference natural speech samples for evaluating prosody transfer. The samples include global variations reflecting emotion and interpersonal attitude, and local variations reflecting topical emphasis, propositional attitude, syntactic phrasing and marked tonicity. The corpus only includes prosodic variations that listeners are able to distinguish with reasonable accuracy, and we report these figures as a benchmark against which text-to-speech prosody transfer can be compared. We conclude the paper with a demonstration of our proposed evaluation methodology, using the corpus to evaluate two text-to-speech models that perform prosody transfer.

* 5 pages, 1 figure, accepted to Interspeech 2021

Via

Access Paper or Ask Questions

Incremental Text to Speech for Neural Sequence-to-Sequence Models using Reinforcement Learning

Aug 07, 2020

Devang S Ram Mohan, Raphael Lenain, Lorenzo Foglianti, Tian Huey Teh, Marlene Staib, Alexandra Torresquintero, Jiameng Gao

Figure 1 for Incremental Text to Speech for Neural Sequence-to-Sequence Models using Reinforcement Learning

Figure 2 for Incremental Text to Speech for Neural Sequence-to-Sequence Models using Reinforcement Learning

Figure 3 for Incremental Text to Speech for Neural Sequence-to-Sequence Models using Reinforcement Learning

Figure 4 for Incremental Text to Speech for Neural Sequence-to-Sequence Models using Reinforcement Learning

Abstract:Modern approaches to text to speech require the entire input character sequence to be processed before any audio is synthesised. This latency limits the suitability of such models for time-sensitive tasks like simultaneous interpretation. Interleaving the action of reading a character with that of synthesising audio reduces this latency. However, the order of this sequence of interleaved actions varies across sentences, which raises the question of how the actions should be chosen. We propose a reinforcement learning based framework to train an agent to make this decision. We compare our performance against that of deterministic, rule-based systems. Our results demonstrate that our agent successfully balances the trade-off between the latency of audio generation and the quality of synthesised audio. More broadly, we show that neural sequence-to-sequence models can be adapted to run in an incremental manner.

* To be published in Interspeech 2020. 5 pages, 4 figures

Via

Access Paper or Ask Questions

Phonological Features for 0-shot Multilingual Speech Synthesis

Aug 06, 2020

Marlene Staib, Tian Huey Teh, Alexandra Torresquintero, Devang S Ram Mohan, Lorenzo Foglianti, Raphael Lenain, Jiameng Gao

Figure 1 for Phonological Features for 0-shot Multilingual Speech Synthesis

Figure 2 for Phonological Features for 0-shot Multilingual Speech Synthesis

Figure 3 for Phonological Features for 0-shot Multilingual Speech Synthesis

Figure 4 for Phonological Features for 0-shot Multilingual Speech Synthesis

Abstract:Code-switching---the intra-utterance use of multiple languages---is prevalent across the world. Within text-to-speech (TTS), multilingual models have been found to enable code-switching. By modifying the linguistic input to sequence-to-sequence TTS, we show that code-switching is possible for languages unseen during training, even within monolingual models. We use a small set of phonological features derived from the International Phonetic Alphabet (IPA), such as vowel height and frontness, consonant place and manner. This allows the model topology to stay unchanged for different languages, and enables new, previously unseen feature combinations to be interpreted by the model. We show that this allows us to generate intelligible, code-switched speech in a new language at test time, including the approximation of sounds never seen in training.

* 5 pages, to be presented at INTERSPEECH 2020

Via

Access Paper or Ask Questions