Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Title:Effective Use of Variational Embedding Capacity in Expressive End-to-End Speech Synthesis

Jul 09, 2019

Eric Battenberg, Soroosh Mariooryad, Daisy Stanton, RJ Skerry-Ryan, Matt Shannon, David Kao, Tom Bagby

Figure 1 for Effective Use of Variational Embedding Capacity in Expressive End-to-End Speech Synthesis

Figure 2 for Effective Use of Variational Embedding Capacity in Expressive End-to-End Speech Synthesis

Figure 3 for Effective Use of Variational Embedding Capacity in Expressive End-to-End Speech Synthesis

Figure 4 for Effective Use of Variational Embedding Capacity in Expressive End-to-End Speech Synthesis

Share this with someone who'll enjoy it:

Abstract:Recent work has explored sequence-to-sequence latent variable models for expressive speech synthesis (supporting control and transfer of prosody and style), but has not presented a coherent framework for understanding the trade-offs between the competing methods. In this paper, we propose embedding capacity as a unified method of analyzing the behavior of latent variable models of speech, comparing existing heuristic (non-variational) methods to variational methods that are able to explicitly constrain capacity using an upper bound on representational mutual information. In our proposed model (Capacitron), we show that by adding conditional dependencies to the variational posterior such that it matches the form of the true posterior, the same model can be used for high-precision prosody transfer, text-agnostic style transfer, and generation of natural-sounding prior samples. For multi-speaker models, Capacitron is able to preserve target speaker identity during inter-speaker prosody transfer and when drawing samples from the latent prior. Lastly, we introduce a method for decomposing embedding capacity hierarchically across two sets of latents, allowing a portion of the latent variability to be specified and the remaining variability sampled from a learned prior.

* Submitted to NeurIPS 2019

View paper on

OpenReview

Share this with someone who'll enjoy it:

Title:Effective Use of Variational Embedding Capacity in Expressive End-to-End Speech Synthesis

Paper and Code