Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Title:Central Kurdish Text-to-Speech Synthesis with Novel End-to-End Transformer Training

Aug 06, 2024

Hawraz A. Ahmad, Tarik A. Rashid

Figure 1 for Central Kurdish Text-to-Speech Synthesis with Novel End-to-End Transformer Training

Figure 2 for Central Kurdish Text-to-Speech Synthesis with Novel End-to-End Transformer Training

Figure 3 for Central Kurdish Text-to-Speech Synthesis with Novel End-to-End Transformer Training

Figure 4 for Central Kurdish Text-to-Speech Synthesis with Novel End-to-End Transformer Training

Share this with someone who'll enjoy it:

Abstract:Recent advancements in text-to-speech (TTS) models have aimed to streamline the two-stage process into a single-stage training approach. However, many single-stage models still lag behind in audio quality, particularly when handling Kurdish text and speech. There is a critical need to enhance text-to-speech conversion for the Kurdish language, particularly for the Sorani dialect, which has been relatively neglected and is underrepresented in recent text-to-speech advancements. This study introduces an end-to-end TTS model for efficiently generating high-quality Kurdish audio. The proposed method leverages a variational autoencoder (VAE) that is pre-trained for audio waveform reconstruction and is augmented by adversarial training. This involves aligning the prior distribution established by the pre-trained encoder with the posterior distribution of the text encoder within latent variables. Additionally, a stochastic duration predictor is incorporated to imbue synthesized Kurdish speech with diverse rhythms. By aligning latent distributions and integrating the stochastic duration predictor, the proposed method facilitates the real-time generation of natural Kurdish speech audio, offering flexibility in pitches and rhythms. Empirical evaluation via the mean opinion score (MOS) on a custom dataset confirms the superior performance of our approach (MOS of 3.94) compared with that of a one-stage system and other two-staged systems as assessed through a subjective human evaluation.

* 19 pages

View paper on

Share this with someone who'll enjoy it:

Title:Central Kurdish Text-to-Speech Synthesis with Novel End-to-End Transformer Training

Paper and Code