Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Title:Robust Zero-Shot Text-to-Speech Synthesis with Reverse Inference Optimization

Jul 02, 2024

Yuchen Hu, Chen Chen, Siyin Wang, Eng Siong Chng, Chao Zhang

Figure 1 for Robust Zero-Shot Text-to-Speech Synthesis with Reverse Inference Optimization

Figure 2 for Robust Zero-Shot Text-to-Speech Synthesis with Reverse Inference Optimization

Figure 3 for Robust Zero-Shot Text-to-Speech Synthesis with Reverse Inference Optimization

Figure 4 for Robust Zero-Shot Text-to-Speech Synthesis with Reverse Inference Optimization

Share this with someone who'll enjoy it:

Abstract:In this paper, we propose reverse inference optimization (RIO), a simple and effective method designed to enhance the robustness of autoregressive-model-based zero-shot text-to-speech (TTS) systems using reinforcement learning from human feedback (RLHF). To assess the quality of speech produced by the TTS system without human annotations, RIO introduces a novel concept termed as reverse inference based on the Bayesian principle, which suggests that a high-quality generated speech should be able to be used as a prompt for subsequent generation using the same TTS model. By leveraging reverse inference as the standard to select exemplars used in RLHF from the speech samples generated by the TTS system itself, RIO steers the subsequent optimization towards a direction of enhancing the TTS robustness. The RIO framework, comprising sampling, automatic annotating, and learning, obviates the need for a reward model or pairwise preference data, and significantly improves the stability of zero-shot TTS performance by reducing the discrepancies between training and inference conditions. Our experimental results verify that RIO can effectively improve both subjective and objective metrics, including mean opinion scores, word error rates, and speaker similarity. Remarkably, RIO can also diminish the incidence of bad outputs to nearly zero percent, rivalling the robustness when using ground-truth speech as the prompt.

* 12 pages, Work in progress

View paper on

Share this with someone who'll enjoy it:

Title:Robust Zero-Shot Text-to-Speech Synthesis with Reverse Inference Optimization

Paper and Code