Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Sarina Meyer

First Steps Towards Voice Anonymization for Code-Switching Speech

Jul 02, 2025

Sarina Meyer, Ekaterina Kolos, Ngoc Thang Vu

Abstract:The goal of voice anonymization is to modify an audio such that the true identity of its speaker is hidden. Research on this task is typically limited to the same English read speech datasets, thus the efficacy of current methods for other types of speech data remains unknown. In this paper, we present the first investigation of voice anonymization for the multilingual phenomenon of code-switching speech. We prepare two corpora for this task and propose adaptations to a multilingual anonymization model to make it applicable for code-switching speech. By testing the anonymization performance of this and two language-independent methods on the datasets, we find that only the multilingual system performs well in terms of privacy and utility preservation. Furthermore, we observe challenges in performing utility evaluations on this data because of its spontaneous character and the limited code-switching support by the multilingual speech recognition model.

* accepted at Interspeech 2025

Via

Access Paper or Ask Questions

Probing the Feasibility of Multilingual Speaker Anonymization

Jul 03, 2024

Sarina Meyer, Florian Lux, Ngoc Thang Vu

Abstract:In speaker anonymization, speech recordings are modified in a way that the identity of the speaker remains hidden. While this technology could help to protect the privacy of individuals around the globe, current research restricts this by focusing almost exclusively on English data. In this study, we extend a state-of-the-art anonymization system to nine languages by transforming language-dependent components to their multilingual counterparts. Experiments testing the robustness of the anonymized speech against privacy attacks and speech deterioration show an overall success of this system for all languages. The results suggest that speaker embeddings trained on English data can be applied across languages, and that the anonymization performance for a language is mainly affected by the quality of the speech synthesis component used for it.

* accepted at Interspeech 2024

Via

Access Paper or Ask Questions

Meta Learning Text-to-Speech Synthesis in over 7000 Languages

Jun 10, 2024

Florian Lux, Sarina Meyer, Lyonel Behringer, Frank Zalkow, Phat Do, Matt Coler, Emanuël A. P. Habets, Ngoc Thang Vu

Abstract:In this work, we take on the challenging task of building a single text-to-speech synthesis system that is capable of generating speech in over 7000 languages, many of which lack sufficient data for traditional TTS development. By leveraging a novel integration of massively multilingual pretraining and meta learning to approximate language representations, our approach enables zero-shot speech synthesis in languages without any available data. We validate our system's performance through objective measures and human evaluation across a diverse linguistic landscape. By releasing our code and models publicly, we aim to empower communities with limited linguistic resources and foster further innovation in the field of speech technology.

* accepted at Interspeech 2024

Via

Access Paper or Ask Questions

The VoicePrivacy 2024 Challenge Evaluation Plan

Apr 03, 2024

Natalia Tomashenko, Xiaoxiao Miao, Pierre Champion, Sarina Meyer, Xin Wang, Emmanuel Vincent, Michele Panariello, Nicholas Evans, Junichi Yamagishi, Massimiliano Todisco

Figure 1 for The VoicePrivacy 2024 Challenge Evaluation Plan

Figure 2 for The VoicePrivacy 2024 Challenge Evaluation Plan

Figure 3 for The VoicePrivacy 2024 Challenge Evaluation Plan

Figure 4 for The VoicePrivacy 2024 Challenge Evaluation Plan

Abstract:The task of the challenge is to develop a voice anonymization system for speech data which conceals the speaker's voice identity while protecting linguistic content and emotional states. The organizers provide development and evaluation datasets and evaluation scripts, as well as baseline anonymization systems and a list of training resources formed on the basis of the participants' requests. Participants apply their developed anonymization systems, run evaluation scripts and submit evaluation results and anonymized speech data to the organizers. Results will be presented at a workshop held in conjunction with Interspeech 2024 to which all participants are invited to present their challenge systems and to submit additional workshop papers.

* arXiv admin note: substantial text overlap with arXiv:2203.12468

Via

Access Paper or Ask Questions

The IMS Toucan System for the Blizzard Challenge 2023

Oct 26, 2023

Florian Lux, Julia Koch, Sarina Meyer, Thomas Bott, Nadja Schauffler, Pavel Denisov, Antje Schweitzer, Ngoc Thang Vu

Abstract:For our contribution to the Blizzard Challenge 2023, we improved on the system we submitted to the Blizzard Challenge 2021. Our approach entails a rule-based text-to-phoneme processing system that includes rule-based disambiguation of homographs in the French language. It then transforms the phonemes to spectrograms as intermediate representations using a fast and efficient non-autoregressive synthesis architecture based on Conformer and Glow. A GAN based neural vocoder that combines recent state-of-the-art approaches converts the spectrogram to the final wave. We carefully designed the data processing, training, and inference procedures for the challenge data. Our system identifier is G. Open source code and demo are available.

* Published at the Blizzard Challenge Workshop 2023, colocated with the Speech Synthesis Workshop 2023, a sattelite event of the Interspeech 2023

Via

Access Paper or Ask Questions

Controllable Generation of Artificial Speaker Embeddings through Discovery of Principal Directions

Oct 26, 2023

Florian Lux, Pascal Tilli, Sarina Meyer, Ngoc Thang Vu

Figure 1 for Controllable Generation of Artificial Speaker Embeddings through Discovery of Principal Directions

Figure 2 for Controllable Generation of Artificial Speaker Embeddings through Discovery of Principal Directions

Figure 3 for Controllable Generation of Artificial Speaker Embeddings through Discovery of Principal Directions

Figure 4 for Controllable Generation of Artificial Speaker Embeddings through Discovery of Principal Directions

Abstract:Customizing voice and speaking style in a speech synthesis system with intuitive and fine-grained controls is challenging, given that little data with appropriate labels is available. Furthermore, editing an existing human's voice also comes with ethical concerns. In this paper, we propose a method to generate artificial speaker embeddings that cannot be linked to a real human while offering intuitive and fine-grained control over the voice and speaking style of the embeddings, without requiring any labels for speaker or style. The artificial and controllable embeddings can be fed to a speech synthesis system, conditioned on embeddings of real humans during training, without sacrificing privacy during inference.

* Published at ISCA Interspeech 2023 https://www.isca-speech.org/archive/interspeech_2023/lux23_interspeech.html

Via

Access Paper or Ask Questions

VoicePAT: An Efficient Open-source Evaluation Toolkit for Voice Privacy Research

Sep 14, 2023

Sarina Meyer, Xiaoxiao Miao, Ngoc Thang Vu

Abstract:Speaker anonymization is the task of modifying a speech recording such that the original speaker cannot be identified anymore. Since the first Voice Privacy Challenge in 2020, along with the release of a framework, the popularity of this research topic is continually increasing. However, the comparison and combination of different anonymization approaches remains challenging due to the complexity of evaluation and the absence of user-friendly research frameworks. We therefore propose an efficient speaker anonymization and evaluation framework based on a modular and easily extendable structure, almost fully in Python. The framework facilitates the orchestration of several anonymization approaches in parallel and allows for interfacing between different techniques. Furthermore, we propose modifications to common evaluation methods which make the evaluation more powerful and reduces their computation time by 65 to 95\%, depending on the metric. Our code is fully open source.

* Submitted to OJSP-ICASSP 2024

Via

Access Paper or Ask Questions

Modeling Speaker-Listener Interaction for Backchannel Prediction

Apr 10, 2023

Daniel Ortega, Sarina Meyer, Antje Schweitzer, Ngoc Thang Vu

Figure 1 for Modeling Speaker-Listener Interaction for Backchannel Prediction

Figure 2 for Modeling Speaker-Listener Interaction for Backchannel Prediction

Figure 3 for Modeling Speaker-Listener Interaction for Backchannel Prediction

Figure 4 for Modeling Speaker-Listener Interaction for Backchannel Prediction

Abstract:We present our latest findings on backchannel modeling novelly motivated by the canonical use of the minimal responses Yeah and Uh-huh in English and their correspondent tokens in German, and the effect of encoding the speaker-listener interaction. Backchanneling theories emphasize the active and continuous role of the listener in the course of the conversation, their effects on the speaker's subsequent talk, and the consequent dynamic speaker-listener interaction. Therefore, we propose a neural-based acoustic backchannel classifier on minimal responses by processing acoustic features from the speaker speech, capturing and imitating listeners' backchanneling behavior, and encoding speaker-listener interaction. Our experimental results on the Switchboard and GECO datasets reveal that in almost all tested scenarios the speaker or listener behavior embeddings help the model make more accurate backchannel predictions. More importantly, a proper interaction encoding strategy, i.e., combining the speaker and listener embeddings, leads to the best performance on both datasets in terms of F1-score.

* Published in IWSDS 2023

Via

Access Paper or Ask Questions

Anonymizing Speech with Generative Adversarial Networks to Preserve Speaker Privacy

Oct 20, 2022

Sarina Meyer, Pascal Tilli, Pavel Denisov, Florian Lux, Julia Koch, Ngoc Thang Vu

Figure 1 for Anonymizing Speech with Generative Adversarial Networks to Preserve Speaker Privacy

Figure 2 for Anonymizing Speech with Generative Adversarial Networks to Preserve Speaker Privacy

Figure 3 for Anonymizing Speech with Generative Adversarial Networks to Preserve Speaker Privacy

Figure 4 for Anonymizing Speech with Generative Adversarial Networks to Preserve Speaker Privacy

Abstract:In order to protect the privacy of speech data, speaker anonymization aims for hiding the identity of a speaker by changing the voice in speech recordings. This typically comes with a privacy-utility trade-off between protection of individuals and usability of the data for downstream applications. One of the challenges in this context is to create non-existent voices that sound as natural as possible. In this work, we propose to tackle this issue by generating speaker embeddings using a generative adversarial network with Wasserstein distance as cost function. By incorporating these artificial embeddings into a speech-to-text-to-speech pipeline, we outperform previous approaches in terms of privacy and utility. According to standard objective metrics and human evaluation, our approach generates intelligible and content-preserving yet privacy-protecting versions of the original recordings.

* IEEE Spoken Language Technology Workshop 2022

Via

Access Paper or Ask Questions

Speaker Anonymization with Phonetic Intermediate Representations

Jul 11, 2022

Sarina Meyer, Florian Lux, Pavel Denisov, Julia Koch, Pascal Tilli, Ngoc Thang Vu

Figure 1 for Speaker Anonymization with Phonetic Intermediate Representations

Figure 2 for Speaker Anonymization with Phonetic Intermediate Representations

Figure 3 for Speaker Anonymization with Phonetic Intermediate Representations

Figure 4 for Speaker Anonymization with Phonetic Intermediate Representations

Abstract:In this work, we propose a speaker anonymization pipeline that leverages high quality automatic speech recognition and synthesis systems to generate speech conditioned on phonetic transcriptions and anonymized speaker embeddings. Using phones as the intermediate representation ensures near complete elimination of speaker identity information from the input while preserving the original phonetic content as much as possible. Our experimental results on LibriSpeech and VCTK corpora reveal two key findings: 1) although automatic speech recognition produces imperfect transcriptions, our neural speech synthesis system can handle such errors, making our system feasible and robust, and 2) combining speaker embeddings from different resources is beneficial and their appropriate normalization is crucial. Overall, our final best system outperforms significantly the baselines provided in the Voice Privacy Challenge 2020 in terms of privacy robustness against a lazy-informed attacker while maintaining high intelligibility and naturalness of the anonymized speech.

* Accepted at Interspeech 2022

Via

Access Paper or Ask Questions