Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Title:Continuous Speech Synthesis using per-token Latent Diffusion

Oct 21, 2024

Arnon Turetzky, Nimrod Shabtay, Slava Shechtman, Hagai Aronowitz, David Haws, Ron Hoory, Avihu Dekel

Figure 1 for Continuous Speech Synthesis using per-token Latent Diffusion

Figure 2 for Continuous Speech Synthesis using per-token Latent Diffusion

Figure 3 for Continuous Speech Synthesis using per-token Latent Diffusion

Figure 4 for Continuous Speech Synthesis using per-token Latent Diffusion

Share this with someone who'll enjoy it:

Abstract:The success of autoregressive transformer models with discrete tokens has inspired quantization-based approaches for continuous modalities, though these often limit reconstruction quality. We therefore introduce SALAD, a per-token latent diffusion model for zero-shot text-to-speech, that operates on continuous representations. SALAD builds upon the recently proposed expressive diffusion head for image generation, and extends it to generate variable-length outputs. Our approach utilizes semantic tokens for providing contextual information and determining the stopping condition. We suggest three continuous variants for our method, extending popular discrete speech synthesis techniques. Additionally, we implement discrete baselines for each variant and conduct a comparative analysis of discrete versus continuous speech modeling techniques. Our results demonstrate that both continuous and discrete approaches are highly competent, and that SALAD achieves a superior intelligibility score while obtaining speech quality and speaker similarity on par with the ground-truth audio.

* Preprint, Under review

View paper on

Share this with someone who'll enjoy it:

Title:Continuous Speech Synthesis using per-token Latent Diffusion

Paper and Code