Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Fritz Seebauer

Speech Synthesis along Perceptual Voice Quality Dimensions

Jan 15, 2025

Frederik Rautenberg, Michael Kuhlmann, Fritz Seebauer, Jana Wiechmann, Petra Wagner, Reinhold Haeb-Umbach

Abstract:While expressive speech synthesis or voice conversion systems mainly focus on controlling or manipulating abstract prosodic characteristics of speech, such as emotion or accent, we here address the control of perceptual voice qualities (PVQs) recognized by phonetic experts, which are speech properties at a lower level of abstraction. The ability to manipulate PVQs can be a valuable tool for teaching speech pathologists in training or voice actors. In this paper, we integrate a Conditional Continuous-Normalizing-Flow-based method into a Text-to-Speech system to modify perceptual voice attributes on a continuous scale. Unlike previous approaches, our system avoids direct manipulation of acoustic correlates and instead learns from examples. We demonstrate the system's capability by manipulating four voice qualities: Roughness, breathiness, resonance and weight. Phonetic experts evaluated these modifications, both for seen and unseen speaker conditions. The results highlight both the system's strengths and areas for improvement.

* Accepted by ICASSP 2025

Via

Access Paper or Ask Questions

On Feature Importance and Interpretability of Speaker Representations

Oct 19, 2023

Frederik Rautenberg, Michael Kuhlmann, Jana Wiechmann, Fritz Seebauer, Petra Wagner, Reinhold Haeb-Umbach

Figure 1 for On Feature Importance and Interpretability of Speaker Representations

Figure 2 for On Feature Importance and Interpretability of Speaker Representations

Figure 3 for On Feature Importance and Interpretability of Speaker Representations

Figure 4 for On Feature Importance and Interpretability of Speaker Representations

Abstract:Unsupervised speech disentanglement aims at separating fast varying from slowly varying components of a speech signal. In this contribution, we take a closer look at the embedding vector representing the slowly varying signal components, commonly named the speaker embedding vector. We ask, which properties of a speaker's voice are captured and investigate to which extent do individual embedding vector components sign responsible for them, using the concept of Shapley values. Our findings show that certain speaker-specific acoustic-phonetic properties can be fairly well predicted from the speaker embedding, while the investigated more abstract voice quality features cannot.

* Presented at the ITG conference on Speech Communication 2023

Via

Access Paper or Ask Questions

Investigating Speaker Embedding Disentanglement on Natural Read Speech

Aug 08, 2023

Michael Kuhlmann, Adrian Meise, Fritz Seebauer, Petra Wagner, Reinhold Haeb-Umbach

Figure 1 for Investigating Speaker Embedding Disentanglement on Natural Read Speech

Figure 2 for Investigating Speaker Embedding Disentanglement on Natural Read Speech

Figure 3 for Investigating Speaker Embedding Disentanglement on Natural Read Speech

Figure 4 for Investigating Speaker Embedding Disentanglement on Natural Read Speech

Abstract:Disentanglement is the task of learning representations that identify and separate factors that explain the variation observed in data. Disentangled representations are useful to increase the generalizability, explainability, and fairness of data-driven models. Only little is known about how well such disentanglement works for speech representations. A major challenge when tackling disentanglement for speech representations are the unknown generative factors underlying the speech signal. In this work, we investigate to what degree speech representations encoding speaker identity can be disentangled. To quantify disentanglement, we identify acoustic features that are highly speaker-variant and can serve as proxies for the factors of variation underlying speech. We find that disentanglement of the speaker embedding is limited when trained with standard objectives promoting disentanglement but can be improved over vanilla representation learning to some extent.

* To be published at 15th ITG conference on speech communication

Via

Access Paper or Ask Questions

Investigation into Target Speaking Rate Adaptation for Voice Conversion

Sep 05, 2022

Michael Kuhlmann, Fritz Seebauer, Janek Ebbers, Petra Wagner, Reinhold Haeb-Umbach

Figure 1 for Investigation into Target Speaking Rate Adaptation for Voice Conversion

Figure 2 for Investigation into Target Speaking Rate Adaptation for Voice Conversion

Figure 3 for Investigation into Target Speaking Rate Adaptation for Voice Conversion

Figure 4 for Investigation into Target Speaking Rate Adaptation for Voice Conversion

Abstract:Disentangling speaker and content attributes of a speech signal into separate latent representations followed by decoding the content with an exchanged speaker representation is a popular approach for voice conversion, which can be trained with non-parallel and unlabeled speech data. However, previous approaches perform disentanglement only implicitly via some sort of information bottleneck or normalization, where it is usually hard to find a good trade-off between voice conversion and content reconstruction. Further, previous works usually do not consider an adaptation of the speaking rate to the target speaker or they put some major restrictions to the data or use case. Therefore, the contribution of this work is two-fold. First, we employ an explicit and fully unsupervised disentanglement approach, which has previously only been used for representation learning, and show that it allows to obtain both superior voice conversion and content reconstruction. Second, we investigate simple and generic approaches to linearly adapt the length of a speech signal, and hence the speaking rate, to a target speaker and show that the proposed adaptation allows to increase the speaking rate similarity with respect to the target speaker.

* Accepted to INTERSPEECH 2022

Via

Access Paper or Ask Questions