Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Title:Learning Disentangled Speech Representations with Contrastive Learning and Time-Invariant Retrieval

Jan 18, 2024

Yimin Deng, Huaizhen Tang, Xulong Zhang, Ning Cheng, Jing Xiao, Jianzong Wang

Figure 1 for Learning Disentangled Speech Representations with Contrastive Learning and Time-Invariant Retrieval

Figure 2 for Learning Disentangled Speech Representations with Contrastive Learning and Time-Invariant Retrieval

Figure 3 for Learning Disentangled Speech Representations with Contrastive Learning and Time-Invariant Retrieval

Figure 4 for Learning Disentangled Speech Representations with Contrastive Learning and Time-Invariant Retrieval

Share this with someone who'll enjoy it:

Abstract:Voice conversion refers to transferring speaker identity with well-preserved content. Better disentanglement of speech representations leads to better voice conversion. Recent studies have found that phonetic information from input audio has the potential ability to well represent content. Besides, the speaker-style modeling with pre-trained models making the process more complex. To tackle these issues, we introduce a new method named "CTVC" which utilizes disentangled speech representations with contrastive learning and time-invariant retrieval. Specifically, a similarity-based compression module is used to facilitate a more intimate connection between the frame-level hidden features and linguistic information at phoneme-level. Additionally, a time-invariant retrieval is proposed for timbre extraction based on multiple segmentations and mutual information. Experimental results demonstrate that "CTVC" outperforms previous studies and improves the sound quality and similarity of converted results.

* Accepted by 2024 IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP2024)

View paper on

Share this with someone who'll enjoy it:

Title:Learning Disentangled Speech Representations with Contrastive Learning and Time-Invariant Retrieval

Paper and Code