Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Seung-won Park

Assem-VC: Realistic Voice Conversion by Assembling Modern Speech Synthesis Techniques

Apr 02, 2021

Kang-wook Kim, Seung-won Park, Myun-chul Joe

Figure 1 for Assem-VC: Realistic Voice Conversion by Assembling Modern Speech Synthesis Techniques

Figure 2 for Assem-VC: Realistic Voice Conversion by Assembling Modern Speech Synthesis Techniques

Figure 3 for Assem-VC: Realistic Voice Conversion by Assembling Modern Speech Synthesis Techniques

Figure 4 for Assem-VC: Realistic Voice Conversion by Assembling Modern Speech Synthesis Techniques

Abstract:In this paper, we pose the current state-of-the-art voice conversion (VC) systems as two-encoder-one-decoder models. After comparing these models, we combine the best features and propose Assem-VC, a new state-of-the-art any-to-many non-parallel VC system. This paper also introduces the GTA finetuning in VC, which significantly improves the quality and the speaker similarity of the outputs. Assem-VC outperforms the previous state-of-the-art approaches in both the naturalness and the speaker similarity on the VCTK dataset. As an objective result, the degree of speaker disentanglement of features such as phonetic posteriorgrams (PPG) is also explored. Our investigation indicates that many-to-many VC results are no longer distinct from human speech and similar quality can be achieved with any-to-many models. Audio samples are available at https://mindslab-ai.github.io/assem-vc/

* Submitted to Interspeech 2021

Via

Access Paper or Ask Questions

Generating Novel Glyph without Human Data by Learning to Communicate

Oct 09, 2020

Seung-won Park

Figure 1 for Generating Novel Glyph without Human Data by Learning to Communicate

Figure 2 for Generating Novel Glyph without Human Data by Learning to Communicate

Figure 3 for Generating Novel Glyph without Human Data by Learning to Communicate

Figure 4 for Generating Novel Glyph without Human Data by Learning to Communicate

Abstract:In this paper, we present Neural Glyph, a system that generates novel glyph without any training data. The generator and the classifier are trained to communicate via visual symbols as a medium, which enforces the generator to come up with a set of distinctive symbols. Our method results in glyphs that resemble the human-made glyphs, which may imply that the visual appearances of existing glyphs can be attributed to constraints of communication via writing. Important tricks that enable this framework is described and the code is made available.

* Submitted to NeurIPS 2020 workshop on Machine Learning for Creativity and Design; 6 pages with 4 figures

Via

Access Paper or Ask Questions

Cotatron: Transcription-Guided Speech Encoder for Any-to-Many Voice Conversion without Parallel Data

May 07, 2020

Seung-won Park, Doo-young Kim, Myun-chul Joe

Figure 1 for Cotatron: Transcription-Guided Speech Encoder for Any-to-Many Voice Conversion without Parallel Data

Figure 2 for Cotatron: Transcription-Guided Speech Encoder for Any-to-Many Voice Conversion without Parallel Data

Figure 3 for Cotatron: Transcription-Guided Speech Encoder for Any-to-Many Voice Conversion without Parallel Data

Figure 4 for Cotatron: Transcription-Guided Speech Encoder for Any-to-Many Voice Conversion without Parallel Data

Abstract:We propose Cotatron, a transcription-guided speech encoder for speaker-independent linguistic representation. Cotatron is based on the multispeaker TTS architecture and can be trained with conventional TTS datasets. We train a voice conversion system to reconstruct speech with Cotatron features, which is similar to the previous methods based on Phonetic Posteriorgram (PPG). By training and evaluating our system with 108 speakers from the VCTK dataset, we outperform the previous method in terms of both naturalness and speaker similarity. Our system can also convert speech from speakers that are unseen during training, and utilize ASR to automate the transcription with minimal reduction of the performance. Audio samples are available at https://mindslab-ai.github.io/cotatron, and the code with a pre-trained model will be made available soon.

* Submitted to Interspeech 2020

Via

Access Paper or Ask Questions