Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Taejun Bak

MultiVerse: Efficient and Expressive Zero-Shot Multi-Task Text-to-Speech

Oct 04, 2024

Taejun Bak, Youngsik Eom, SeungJae Choi, Young-Sun Joo

Abstract:Text-to-speech (TTS) systems that scale up the amount of training data have achieved significant improvements in zero-shot speech synthesis. However, these systems have certain limitations: they require a large amount of training data, which increases costs, and often overlook prosody similarity. To address these issues, we propose MultiVerse, a zero-shot multi-task TTS system that is able to perform TTS or speech style transfer in zero-shot and cross-lingual conditions. MultiVerse requires much less training data than traditional data-driven approaches. To ensure zero-shot performance even with limited data, we leverage source-filter theory-based disentanglement, utilizing the prompt for modeling filter-related and source-related representations. Additionally, to further enhance prosody similarity, we adopt a prosody modeling approach combining prompt-based autoregressive and non-autoregressive methods. Evaluations demonstrate the remarkable zero-shot multi-task TTS performance of MultiVerse and show that MultiVerse not only achieves zero-shot TTS performance comparable to data-driven TTS systems with much less data, but also significantly outperforms other zero-shot TTS systems trained with the same small amount of data. In particular, our novel prosody modeling technique significantly contributes to MultiVerse's ability to generate speech with high prosody similarity to the given prompts. Our samples are available at https://nc-ai.github.io/speech/publications/multiverse/index.html

* Accepted to EMNLP 2024 Findings

Via

Access Paper or Ask Questions

Avocodo: Generative Adversarial Network for Artifact-free Vocoder

Jun 28, 2022

Taejun Bak, Junmo Lee, Hanbin Bae, Jinhyeok Yang, Jae-Sung Bae, Young-Sun Joo

Figure 1 for Avocodo: Generative Adversarial Network for Artifact-free Vocoder

Figure 2 for Avocodo: Generative Adversarial Network for Artifact-free Vocoder

Figure 3 for Avocodo: Generative Adversarial Network for Artifact-free Vocoder

Figure 4 for Avocodo: Generative Adversarial Network for Artifact-free Vocoder

Abstract:Neural vocoders based on the generative adversarial neural network (GAN) have been widely used due to their fast inference speed and lightweight networks while generating high-quality speech waveforms. Since the perceptually important speech components are primarily concentrated in the low-frequency band, most of the GAN-based neural vocoders perform multi-scale analysis that evaluates downsampled speech waveforms. This multi-scale analysis helps the generator improve speech intelligibility. However, in preliminary experiments, we observed that the multi-scale analysis which focuses on the low-frequency band causes unintended artifacts, e.g., aliasing and imaging artifacts, and these artifacts degrade the synthesized speech waveform quality. Therefore, in this paper, we investigate the relationship between these artifacts and GAN-based neural vocoders and propose a GAN-based neural vocoder, called Avocodo, that allows the synthesis of high-fidelity speech with reduced artifacts. We introduce two kinds of discriminators to evaluate waveforms in various perspectives: a collaborative multi-band discriminator and a sub-band discriminator. We also utilize a pseudo quadrature mirror filter bank to obtain downsampled multi-band waveforms while avoiding aliasing. The experimental results show that Avocodo outperforms conventional GAN-based neural vocoders in both speech and singing voice synthesis tasks and can synthesize artifact-free speech. Especially, Avocodo is even capable to reproduce high-quality waveforms of unseen speakers.

Via

Access Paper or Ask Questions

GANSpeech: Adversarial Training for High-Fidelity Multi-Speaker Speech Synthesis

Jun 29, 2021

Jinhyeok Yang, Jae-Sung Bae, Taejun Bak, Youngik Kim, Hoon-Young Cho

Figure 1 for GANSpeech: Adversarial Training for High-Fidelity Multi-Speaker Speech Synthesis

Figure 2 for GANSpeech: Adversarial Training for High-Fidelity Multi-Speaker Speech Synthesis

Figure 3 for GANSpeech: Adversarial Training for High-Fidelity Multi-Speaker Speech Synthesis

Figure 4 for GANSpeech: Adversarial Training for High-Fidelity Multi-Speaker Speech Synthesis

Abstract:Recent advances in neural multi-speaker text-to-speech (TTS) models have enabled the generation of reasonably good speech quality with a single model and made it possible to synthesize the speech of a speaker with limited training data. Fine-tuning to the target speaker data with the multi-speaker model can achieve better quality, however, there still exists a gap compared to the real speech sample and the model depends on the speaker. In this work, we propose GANSpeech, which is a high-fidelity multi-speaker TTS model that adopts the adversarial training method to a non-autoregressive multi-speaker TTS model. In addition, we propose simple but efficient automatic scaling methods for feature matching loss used in adversarial training. In the subjective listening tests, GANSpeech significantly outperformed the baseline multi-speaker FastSpeech and FastSpeech2 models, and showed a better MOS score than the speaker-specific fine-tuned FastSpeech2.

* Accepted to INTERSPEECH 2021

Via

Access Paper or Ask Questions

FastPitchFormant: Source-filter based Decomposed Modeling for Speech Synthesis

Jun 29, 2021

Taejun Bak, Jae-Sung Bae, Hanbin Bae, Young-Ik Kim, Hoon-Young Cho

Figure 1 for FastPitchFormant: Source-filter based Decomposed Modeling for Speech Synthesis

Figure 2 for FastPitchFormant: Source-filter based Decomposed Modeling for Speech Synthesis

Figure 3 for FastPitchFormant: Source-filter based Decomposed Modeling for Speech Synthesis

Figure 4 for FastPitchFormant: Source-filter based Decomposed Modeling for Speech Synthesis

Abstract:Methods for modeling and controlling prosody with acoustic features have been proposed for neural text-to-speech (TTS) models. Prosodic speech can be generated by conditioning acoustic features. However, synthesized speech with a large pitch-shift scale suffers from audio quality degradation, and speaker characteristics deformation. To address this problem, we propose a feed-forward Transformer based TTS model that is designed based on the source-filter theory. This model, called FastPitchFormant, has a unique structure that handles text and acoustic features in parallel. With modeling each feature separately, the tendency that the model learns the relationship between two features can be mitigated.

* Accepted to INTERSPEECH 2021

Via

Access Paper or Ask Questions