Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Title:Fine-grained style control in Transformer-based Text-to-speech Synthesis

Oct 12, 2021

Li-Wei Chen, Alexander Rudnicky

Figure 1 for Fine-grained style control in Transformer-based Text-to-speech Synthesis

Figure 2 for Fine-grained style control in Transformer-based Text-to-speech Synthesis

Figure 3 for Fine-grained style control in Transformer-based Text-to-speech Synthesis

Figure 4 for Fine-grained style control in Transformer-based Text-to-speech Synthesis

Share this with someone who'll enjoy it:

Abstract:In this paper, we present a novel architecture to realize fine-grained style control on the transformer-based text-to-speech synthesis (TransformerTTS). Specifically, we model the speaking style by extracting a time sequence of local style tokens (LST) from the reference speech. The existing content encoder in TransformerTTS is then replaced by our designed cross-attention blocks for fusion and alignment between content and style. As the fusion is performed along with the skip connection, our cross-attention block provides a good inductive bias to gradually infuse the phoneme representation with a given style. Additionally, we prevent the style embedding from encoding linguistic content by randomly truncating LST during training and using wav2vec 2.0 features. Experiments show that with fine-grained style control, our system performs better in terms of naturalness, intelligibility, and style transferability. Our code and samples are publicly available.

* Submitted to ICASSP 2022

View paper on

Share this with someone who'll enjoy it:

Title:Fine-grained style control in Transformer-based Text-to-speech Synthesis

Paper and Code