Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Kang-wook Kim

Talking Face Generation with Multilingual TTS

May 13, 2022

Hyoung-Kyu Song, Sang Hoon Woo, Junhyeok Lee, Seungmin Yang, Hyunjae Cho, Youseong Lee, Dongho Choi, Kang-wook Kim

Figure 1 for Talking Face Generation with Multilingual TTS

Figure 2 for Talking Face Generation with Multilingual TTS

Figure 3 for Talking Face Generation with Multilingual TTS

Figure 4 for Talking Face Generation with Multilingual TTS

Abstract:In this work, we propose a joint system combining a talking face generation system with a text-to-speech system that can generate multilingual talking face videos from only the text input. Our system can synthesize natural multilingual speeches while maintaining the vocal identity of the speaker, as well as lip movements synchronized to the synthesized speech. We demonstrate the generalization capabilities of our system by selecting four languages (Korean, English, Japanese, and Chinese) each from a different language family. We also compare the outputs of our talking face generation model to outputs of a prior work that claims multilingual support. For our demo, we add a translation API to the preprocessing stage and present it in the form of a neural dubber so that users can utilize the multilingual property of our system more easily.

* Accepted to CVPR Demo Track (2022)

Via

Access Paper or Ask Questions

FS-NCSR: Increasing Diversity of the Super-Resolution Space via Frequency Separation and Noise-Conditioned Normalizing Flow

Apr 20, 2022

Ki-Ung Song, Dongseok Shim, Kang-wook Kim, Jae-young Lee, Younggeun Kim

Figure 1 for FS-NCSR: Increasing Diversity of the Super-Resolution Space via Frequency Separation and Noise-Conditioned Normalizing Flow

Figure 2 for FS-NCSR: Increasing Diversity of the Super-Resolution Space via Frequency Separation and Noise-Conditioned Normalizing Flow

Figure 3 for FS-NCSR: Increasing Diversity of the Super-Resolution Space via Frequency Separation and Noise-Conditioned Normalizing Flow

Figure 4 for FS-NCSR: Increasing Diversity of the Super-Resolution Space via Frequency Separation and Noise-Conditioned Normalizing Flow

Abstract:Super-resolution suffers from an innate ill-posed problem that a single low-resolution (LR) image can be from multiple high-resolution (HR) images. Recent studies on the flow-based algorithm solve this ill-posedness by learning the super-resolution space and predicting diverse HR outputs. Unfortunately, the diversity of the super-resolution outputs is still unsatisfactory, and the outputs from the flow-based model usually suffer from undesired artifacts which causes low-quality outputs. In this paper, we propose FS-NCSR which produces diverse and high-quality super-resolution outputs using frequency separation and noise conditioning compared to the existing flow-based approaches. As the sharpness and high-quality detail of the image rely on its high-frequency information, FS-NCSR only estimates the high-frequency information of the high-resolution outputs without redundant low-frequency components. Through this, FS-NCSR significantly improves the diversity score without significant image quality degradation compared to the NCSR, the winner of the previous NTIRE 2021 challenge.

* CVPRW 2022, First three authors are equally contributed

Via

Access Paper or Ask Questions

Controllable and Interpretable Singing Voice Decomposition via Assem-VC

Oct 25, 2021

Kang-wook Kim, Junhyeok Lee

Figure 1 for Controllable and Interpretable Singing Voice Decomposition via Assem-VC

Figure 2 for Controllable and Interpretable Singing Voice Decomposition via Assem-VC

Figure 3 for Controllable and Interpretable Singing Voice Decomposition via Assem-VC

Figure 4 for Controllable and Interpretable Singing Voice Decomposition via Assem-VC

Abstract:We propose a singing decomposition system that encodes time-aligned linguistic content, pitch, and source speaker identity via Assem-VC. With decomposed speaker-independent information and the target speaker's embedding, we could synthesize the singing voice of the target speaker. In conclusion, we made a perfectly synced duet with the user's singing voice and the target singer's converted singing voice.

* Accepted to NeurIPS Workshop on ML for Creativity and Design 2021 (Oral)

Via

Access Paper or Ask Questions

Assem-VC: Realistic Voice Conversion by Assembling Modern Speech Synthesis Techniques

Apr 02, 2021

Kang-wook Kim, Seung-won Park, Myun-chul Joe

Figure 1 for Assem-VC: Realistic Voice Conversion by Assembling Modern Speech Synthesis Techniques

Figure 2 for Assem-VC: Realistic Voice Conversion by Assembling Modern Speech Synthesis Techniques

Figure 3 for Assem-VC: Realistic Voice Conversion by Assembling Modern Speech Synthesis Techniques

Figure 4 for Assem-VC: Realistic Voice Conversion by Assembling Modern Speech Synthesis Techniques

Abstract:In this paper, we pose the current state-of-the-art voice conversion (VC) systems as two-encoder-one-decoder models. After comparing these models, we combine the best features and propose Assem-VC, a new state-of-the-art any-to-many non-parallel VC system. This paper also introduces the GTA finetuning in VC, which significantly improves the quality and the speaker similarity of the outputs. Assem-VC outperforms the previous state-of-the-art approaches in both the naturalness and the speaker similarity on the VCTK dataset. As an objective result, the degree of speaker disentanglement of features such as phonetic posteriorgrams (PPG) is also explored. Our investigation indicates that many-to-many VC results are no longer distinct from human speech and similar quality can be achieved with any-to-many models. Audio samples are available at https://mindslab-ai.github.io/assem-vc/

* Submitted to Interspeech 2021

Via

Access Paper or Ask Questions