Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Young-Sun Joo

MultiVerse: Efficient and Expressive Zero-Shot Multi-Task Text-to-Speech

Oct 04, 2024

Taejun Bak, Youngsik Eom, SeungJae Choi, Young-Sun Joo

Abstract:Text-to-speech (TTS) systems that scale up the amount of training data have achieved significant improvements in zero-shot speech synthesis. However, these systems have certain limitations: they require a large amount of training data, which increases costs, and often overlook prosody similarity. To address these issues, we propose MultiVerse, a zero-shot multi-task TTS system that is able to perform TTS or speech style transfer in zero-shot and cross-lingual conditions. MultiVerse requires much less training data than traditional data-driven approaches. To ensure zero-shot performance even with limited data, we leverage source-filter theory-based disentanglement, utilizing the prompt for modeling filter-related and source-related representations. Additionally, to further enhance prosody similarity, we adopt a prosody modeling approach combining prompt-based autoregressive and non-autoregressive methods. Evaluations demonstrate the remarkable zero-shot multi-task TTS performance of MultiVerse and show that MultiVerse not only achieves zero-shot TTS performance comparable to data-driven TTS systems with much less data, but also significantly outperforms other zero-shot TTS systems trained with the same small amount of data. In particular, our novel prosody modeling technique significantly contributes to MultiVerse's ability to generate speech with high prosody similarity to the given prompts. Our samples are available at https://nc-ai.github.io/speech/publications/multiverse/index.html

* Accepted to EMNLP 2024 Findings

Via

Access Paper or Ask Questions

Avocodo: Generative Adversarial Network for Artifact-free Vocoder

Jun 28, 2022

Taejun Bak, Junmo Lee, Hanbin Bae, Jinhyeok Yang, Jae-Sung Bae, Young-Sun Joo

Figure 1 for Avocodo: Generative Adversarial Network for Artifact-free Vocoder

Figure 2 for Avocodo: Generative Adversarial Network for Artifact-free Vocoder

Figure 3 for Avocodo: Generative Adversarial Network for Artifact-free Vocoder

Figure 4 for Avocodo: Generative Adversarial Network for Artifact-free Vocoder

Abstract:Neural vocoders based on the generative adversarial neural network (GAN) have been widely used due to their fast inference speed and lightweight networks while generating high-quality speech waveforms. Since the perceptually important speech components are primarily concentrated in the low-frequency band, most of the GAN-based neural vocoders perform multi-scale analysis that evaluates downsampled speech waveforms. This multi-scale analysis helps the generator improve speech intelligibility. However, in preliminary experiments, we observed that the multi-scale analysis which focuses on the low-frequency band causes unintended artifacts, e.g., aliasing and imaging artifacts, and these artifacts degrade the synthesized speech waveform quality. Therefore, in this paper, we investigate the relationship between these artifacts and GAN-based neural vocoders and propose a GAN-based neural vocoder, called Avocodo, that allows the synthesis of high-fidelity speech with reduced artifacts. We introduce two kinds of discriminators to evaluate waveforms in various perspectives: a collaborative multi-band discriminator and a sub-band discriminator. We also utilize a pseudo quadrature mirror filter bank to obtain downsampled multi-band waveforms while avoiding aliasing. The experimental results show that Avocodo outperforms conventional GAN-based neural vocoders in both speech and singing voice synthesis tasks and can synthesize artifact-free speech. Especially, Avocodo is even capable to reproduce high-quality waveforms of unseen speakers.

Via

Access Paper or Ask Questions

Enhancement of Pitch Controllability using Timbre-Preserving Pitch Augmentation in FastPitch

Apr 12, 2022

Hanbin Bae, Young-Sun Joo

Figure 1 for Enhancement of Pitch Controllability using Timbre-Preserving Pitch Augmentation in FastPitch

Figure 2 for Enhancement of Pitch Controllability using Timbre-Preserving Pitch Augmentation in FastPitch

Figure 3 for Enhancement of Pitch Controllability using Timbre-Preserving Pitch Augmentation in FastPitch

Figure 4 for Enhancement of Pitch Controllability using Timbre-Preserving Pitch Augmentation in FastPitch

Abstract:The recently developed pitch-controllable text-to-speech (TTS) model, i.e. FastPitch, was conditioned for the pitch contours. However, the quality of the synthesized speech degraded considerably for pitch values that deviated significantly from the average pitch; i.e. the ability to control pitch was limited. To address this issue, we propose two algorithms to improve the robustness of FastPitch. First, we propose a novel timbre-preserving pitch-shifting algorithm for natural pitch augmentation. Pitch-shifted speech samples sound more natural when using the proposed algorithm because the speaker's vocal timbre is maintained. Moreover, we propose a training algorithm that defines FastPitch using pitch-augmented speech datasets with different pitch ranges for the same sentence. The experimental results demonstrate that the proposed algorithms improve the pitch controllability of FastPitch.

* Submitted to INTERSPEECH 2022

Via

Access Paper or Ask Questions

Hierarchical and Multi-Scale Variational Autoencoder for Diverse and Natural Non-Autoregressive Text-to-Speech

Apr 08, 2022

Jae-Sung Bae, Jinhyeok Yang, Tae-Jun Bak, Young-Sun Joo

Figure 1 for Hierarchical and Multi-Scale Variational Autoencoder for Diverse and Natural Non-Autoregressive Text-to-Speech

Figure 2 for Hierarchical and Multi-Scale Variational Autoencoder for Diverse and Natural Non-Autoregressive Text-to-Speech

Figure 3 for Hierarchical and Multi-Scale Variational Autoencoder for Diverse and Natural Non-Autoregressive Text-to-Speech

Figure 4 for Hierarchical and Multi-Scale Variational Autoencoder for Diverse and Natural Non-Autoregressive Text-to-Speech

Abstract:This paper proposes a hierarchical and multi-scale variational autoencoder-based non-autoregressive text-to-speech model (HiMuV-TTS) to generate natural speech with diverse speaking styles. Recent advances in non-autoregressive TTS (NAR-TTS) models have significantly improved the inference speed and robustness of synthesized speech. However, the diversity of speaking styles and naturalness are needed to be improved. To solve this problem, we propose the HiMuV-TTS model that first determines the global-scale prosody and then determines the local-scale prosody via conditioning on the global-scale prosody and the learned text representation. In addition, we improve the quality of speech by adopting the adversarial training technique. Experimental results verify that the proposed HiMuV-TTS model can generate more diverse and natural speech as compared to TTS models with single-scale variational autoencoders, and can represent different prosody information in each scale.

* Submitted to INTERSPEECH 2022

Via

Access Paper or Ask Questions

Hierarchical Context-Aware Transformers for Non-Autoregressive Text to Speech

Jun 29, 2021

Jae-Sung Bae, Tae-Jun Bak, Young-Sun Joo, Hoon-Young Cho

Figure 1 for Hierarchical Context-Aware Transformers for Non-Autoregressive Text to Speech

Figure 2 for Hierarchical Context-Aware Transformers for Non-Autoregressive Text to Speech

Figure 3 for Hierarchical Context-Aware Transformers for Non-Autoregressive Text to Speech

Figure 4 for Hierarchical Context-Aware Transformers for Non-Autoregressive Text to Speech

Abstract:In this paper, we propose methods for improving the modeling performance of a Transformer-based non-autoregressive text-to-speech (TNA-TTS) model. Although the text encoder and audio decoder handle different types and lengths of data (i.e., text and audio), the TNA-TTS models are not designed considering these variations. Therefore, to improve the modeling performance of the TNA-TTS model we propose a hierarchical Transformer structure-based text encoder and audio decoder that are designed to accommodate the characteristics of each module. For the text encoder, we constrain each self-attention layer so the encoder focuses on a text sequence from the local to the global scope. Conversely, the audio decoder constrains its self-attention layers to focus in the reverse direction, i.e., from global to local scope. Additionally, we further improve the pitch modeling accuracy of the audio decoder by providing sentence and word-level pitch as conditions. Various objective and subjective evaluations verified that the proposed method outperformed the baseline TNA-TTS.

* Accepted to INTERSPEECH 2021

Via

Access Paper or Ask Questions

A Neural Text-to-Speech Model Utilizing Broadcast Data Mixed with Background Music

Mar 04, 2021

Hanbin Bae, Jae-Sung Bae, Young-Sun Joo, Young-Ik Kim, Hoon-Young Cho

Figure 1 for A Neural Text-to-Speech Model Utilizing Broadcast Data Mixed with Background Music

Figure 2 for A Neural Text-to-Speech Model Utilizing Broadcast Data Mixed with Background Music

Figure 3 for A Neural Text-to-Speech Model Utilizing Broadcast Data Mixed with Background Music

Figure 4 for A Neural Text-to-Speech Model Utilizing Broadcast Data Mixed with Background Music

Abstract:Recently, it has become easier to obtain speech data from various media such as the internet or YouTube, but directly utilizing them to train a neural text-to-speech (TTS) model is difficult. The proportion of clean speech is insufficient and the remainder includes background music. Even with the global style token (GST). Therefore, we propose the following method to successfully train an end-to-end TTS model with limited broadcast data. First, the background music is removed from the speech by introducing a music filter. Second, the GST-TTS model with an auxiliary quality classifier is trained with the filtered speech and a small amount of clean speech. In particular, the quality classifier makes the embedding vector of the GST layer focus on representing the speech quality (filtered or clean) of the input speech. The experimental results verified that the proposed method synthesized much more high-quality speech than conventional methods.

* Accepted at ICASSP 2021

Via

Access Paper or Ask Questions