Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Tim Polzehl

Private kNN-VC: Interpretable Anonymization of Converted Speech

May 23, 2025

Carlos Franzreb, Arnab Das, Tim Polzehl, Sebastian Möller

Abstract:Speaker anonymization seeks to conceal a speaker's identity while preserving the utility of their speech. The achieved privacy is commonly evaluated with a speaker recognition model trained on anonymized speech. Although this represents a strong attack, it is unclear which aspects of speech are exploited to identify the speakers. Our research sets out to unveil these aspects. It starts with kNN-VC, a powerful voice conversion model that performs poorly as an anonymization system, presumably because of prosody leakage. To test this hypothesis, we extend kNN-VC with two interpretable components that anonymize the duration and variation of phones. These components increase privacy significantly, proving that the studied prosodic factors encode speaker identity and are exploited by the privacy attack. Additionally, we show that changes in the target selection algorithm considerably influence the outcome of the privacy attack.

* Accepted by Interspeech 2025

Via

Access Paper or Ask Questions

BiCrossMamba-ST: Speech Deepfake Detection with Bidirectional Mamba Spectro-Temporal Cross-Attention

May 20, 2025

Yassine El Kheir, Tim Polzehl, Sebastian Möller

Abstract:We propose BiCrossMamba-ST, a robust framework for speech deepfake detection that leverages a dual-branch spectro-temporal architecture powered by bidirectional Mamba blocks and mutual cross-attention. By processing spectral sub-bands and temporal intervals separately and then integrating their representations, BiCrossMamba-ST effectively captures the subtle cues of synthetic speech. In addition, our proposed framework leverages a convolution-based 2D attention map to focus on specific spectro-temporal regions, enabling robust deepfake detection. Operating directly on raw features, BiCrossMamba-ST achieves significant performance improvements, a 67.74% and 26.3% relative gain over state-of-the-art AASIST on ASVSpoof LA21 and ASVSpoof DF21 benchmarks, respectively, and a 6.80% improvement over RawBMamba on ASVSpoof DF21. Code and models will be made publicly available.

* Accepted Interspeech 2025

Via

Access Paper or Ask Questions

Comprehensive Layer-wise Analysis of SSL Models for Audio Deepfake Detection

Feb 05, 2025

Yassine El Kheir, Youness Samih, Suraj Maharjan, Tim Polzehl, Sebastian Möller

Figure 1 for Comprehensive Layer-wise Analysis of SSL Models for Audio Deepfake Detection

Figure 2 for Comprehensive Layer-wise Analysis of SSL Models for Audio Deepfake Detection

Figure 3 for Comprehensive Layer-wise Analysis of SSL Models for Audio Deepfake Detection

Figure 4 for Comprehensive Layer-wise Analysis of SSL Models for Audio Deepfake Detection

Abstract:This paper conducts a comprehensive layer-wise analysis of self-supervised learning (SSL) models for audio deepfake detection across diverse contexts, including multilingual datasets (English, Chinese, Spanish), partial, song, and scene-based deepfake scenarios. By systematically evaluating the contributions of different transformer layers, we uncover critical insights into model behavior and performance. Our findings reveal that lower layers consistently provide the most discriminative features, while higher layers capture less relevant information. Notably, all models achieve competitive equal error rate (EER) scores even when employing a reduced number of layers. This indicates that we can reduce computational costs and increase the inference speed of detecting deepfakes by utilizing only a few lower layers. This work enhances our understanding of SSL models in deepfake detection, offering valuable insights applicable across varied linguistic and contextual settings. Our trained models and code are publicly available: https://github.com/Yaselley/SSL_Layerwise_Deepfake.

* 13 pages, 3 figures, 3 tables. Accepted to NAACL Findings 2025

Via

Access Paper or Ask Questions

Anonymising Elderly and Pathological Speech: Voice Conversion Using DDSP and Query-by-Example

Oct 20, 2024

Suhita Ghosh, Melanie Jouaiti, Arnab Das, Yamini Sinha, Tim Polzehl, Ingo Siegert, Sebastian Stober

Figure 1 for Anonymising Elderly and Pathological Speech: Voice Conversion Using DDSP and Query-by-Example

Figure 2 for Anonymising Elderly and Pathological Speech: Voice Conversion Using DDSP and Query-by-Example

Figure 3 for Anonymising Elderly and Pathological Speech: Voice Conversion Using DDSP and Query-by-Example

Figure 4 for Anonymising Elderly and Pathological Speech: Voice Conversion Using DDSP and Query-by-Example

Abstract:Speech anonymisation aims to protect speaker identity by changing personal identifiers in speech while retaining linguistic content. Current methods fail to retain prosody and unique speech patterns found in elderly and pathological speech domains, which is essential for remote health monitoring. To address this gap, we propose a voice conversion-based method (DDSP-QbE) using differentiable digital signal processing and query-by-example. The proposed method, trained with novel losses, aids in disentangling linguistic, prosodic, and domain representations, enabling the model to adapt to uncommon speech patterns. Objective and subjective evaluations show that DDSP-QbE significantly outperforms the voice conversion state-of-the-art concerning intelligibility, prosody, and domain preservation across diverse datasets, pathologies, and speakers while maintaining quality and speaker anonymity. Experts validate domain preservation by analysing twelve clinically pertinent domain attributes.

* Accepted in Interspeech 2024

Via

Access Paper or Ask Questions

Emo-StarGAN: A Semi-Supervised Any-to-Many Non-Parallel Emotion-Preserving Voice Conversion

Sep 14, 2023

Suhita Ghosh, Arnab Das, Yamini Sinha, Ingo Siegert, Tim Polzehl, Sebastian Stober

Abstract:Speech anonymisation prevents misuse of spoken data by removing any personal identifier while preserving at least linguistic content. However, emotion preservation is crucial for natural human-computer interaction. The well-known voice conversion technique StarGANv2-VC achieves anonymisation but fails to preserve emotion. This work presents an any-to-many semi-supervised StarGANv2-VC variant trained on partially emotion-labelled non-parallel data. We propose emotion-aware losses computed on the emotion embeddings and acoustic features correlated to emotion. Additionally, we use an emotion classifier to provide direct emotion supervision. Objective and subjective evaluations show that the proposed approach significantly improves emotion preservation over the vanilla StarGANv2-VC. This considerable improvement is seen over diverse datasets, emotions, target speakers, and inter-group conversions without compromising intelligibility and anonymisation.

* Accepted in Interspeech 2023

Via

Access Paper or Ask Questions

StarGAN-VC++: Towards Emotion Preserving Voice Conversion Using Deep Embeddings

Sep 14, 2023

Arnab Das, Suhita Ghosh, Tim Polzehl, Sebastian Stober

Abstract:Voice conversion (VC) transforms an utterance to sound like another person without changing the linguistic content. A recently proposed generative adversarial network-based VC method, StarGANv2-VC is very successful in generating natural-sounding conversions. However, the method fails to preserve the emotion of the source speaker in the converted samples. Emotion preservation is necessary for natural human-computer interaction. In this paper, we show that StarGANv2-VC fails to disentangle the speaker and emotion representations, pertinent to preserve emotion. Specifically, there is an emotion leakage from the reference audio used to capture the speaker embeddings while training. To counter the problem, we propose novel emotion-aware losses and an unsupervised method which exploits emotion supervision through latent emotion representations. The objective and subjective evaluations prove the efficacy of the proposed strategy over diverse datasets, emotions, gender, etc.

* Accepted in 12th Speech Synthesis Workshop (SSW), Satellite event in Interspeech 2023

Via

Access Paper or Ask Questions

Speaker adaptation for Wav2vec2 based dysarthric ASR

Apr 02, 2022

Murali Karthick Baskar, Tim Herzig, Diana Nguyen, Mireia Diez, Tim Polzehl, Lukáš Burget, Jan "Honza'' Černocký

Figure 1 for Speaker adaptation for Wav2vec2 based dysarthric ASR

Figure 2 for Speaker adaptation for Wav2vec2 based dysarthric ASR

Figure 3 for Speaker adaptation for Wav2vec2 based dysarthric ASR

Figure 4 for Speaker adaptation for Wav2vec2 based dysarthric ASR

Abstract:Dysarthric speech recognition has posed major challenges due to lack of training data and heavy mismatch in speaker characteristics. Recent ASR systems have benefited from readily available pretrained models such as wav2vec2 to improve the recognition performance. Speaker adaptation using fMLLR and xvectors have provided major gains for dysarthric speech with very little adaptation data. However, integration of wav2vec2 with fMLLR features or xvectors during wav2vec2 finetuning is yet to be explored. In this work, we propose a simple adaptation network for fine-tuning wav2vec2 using fMLLR features. The adaptation network is also flexible to handle other speaker adaptive features such as xvectors. Experimental analysis show steady improvements using our proposed approach across all impairment severity levels and attains 57.72\% WER for high severity in UASpeech dataset. We also performed experiments on German dataset to substantiate the consistency of our proposed approach across diverse domains.

* Submitted to INTERSPEECH 2022

Via

Access Paper or Ask Questions

Does Summary Evaluation Survive Translation to Other Languages?

Sep 16, 2021

Neslihan Iskender, Oleg Vasilyev, Tim Polzehl, John Bohannon, Sebastian Möller

Figure 1 for Does Summary Evaluation Survive Translation to Other Languages?

Figure 2 for Does Summary Evaluation Survive Translation to Other Languages?

Figure 3 for Does Summary Evaluation Survive Translation to Other Languages?

Figure 4 for Does Summary Evaluation Survive Translation to Other Languages?

Abstract:The creation of a large summarization quality dataset is a considerable, expensive, time-consuming effort, requiring careful planning and setup. It includes producing human-written and machine-generated summaries and evaluation of the summaries by humans, preferably by linguistic experts, and by automatic evaluation tools. If such effort is made in one language, it would be beneficial to be able to use it in other languages. To investigate how much we can trust the translation of such dataset without repeating human annotations in another language, we translated an existing English summarization dataset, SummEval dataset, to four different languages and analyzed the scores from the automatic evaluation metrics in translated languages, as well as their correlation with human annotations in the source language. Our results reveal that although translation changes the absolute value of automatic scores, the scores keep the same rank order and approximately the same correlations with human annotations.

* 6 pages, 2 figures, 2 tables, 1 appendix

Via

Access Paper or Ask Questions

Towards Human-Free Automatic Quality Evaluation of German Summarization

May 13, 2021

Neslihan Iskender, Oleg Vasilyev, Tim Polzehl, John Bohannon, Sebastian Möller

Figure 1 for Towards Human-Free Automatic Quality Evaluation of German Summarization

Figure 2 for Towards Human-Free Automatic Quality Evaluation of German Summarization

Abstract:Evaluating large summarization corpora using humans has proven to be expensive from both the organizational and the financial perspective. Therefore, many automatic evaluation metrics have been developed to measure the summarization quality in a fast and reproducible way. However, most of the metrics still rely on humans and need gold standard summaries generated by linguistic experts. Since BLANC does not require golden summaries and supposedly can use any underlying language model, we consider its application to the evaluation of summarization in German. This work demonstrates how to adjust the BLANC metric to a language other than English. We compare BLANC scores with the crowd and expert ratings, as well as with commonly used automatic metrics on a German summarization data set. Our results show that BLANC in German is especially good in evaluating informativeness.

* 6 pages, 2 figures

Via

Access Paper or Ask Questions

Improving Automatic Emotion Recognition from speech using Rhythm and Temporal feature

Mar 07, 2013

Mayank Bhargava, Tim Polzehl

Figure 1 for Improving Automatic Emotion Recognition from speech using Rhythm and Temporal feature

Figure 2 for Improving Automatic Emotion Recognition from speech using Rhythm and Temporal feature

Figure 3 for Improving Automatic Emotion Recognition from speech using Rhythm and Temporal feature

Figure 4 for Improving Automatic Emotion Recognition from speech using Rhythm and Temporal feature

Abstract:This paper is devoted to improve automatic emotion recognition from speech by incorporating rhythm and temporal features. Research on automatic emotion recognition so far has mostly been based on applying features like MFCCs, pitch and energy or intensity. The idea focuses on borrowing rhythm features from linguistic and phonetic analysis and applying them to the speech signal on the basis of acoustic knowledge only. In addition to this we exploit a set of temporal and loudness features. A segmentation unit is employed in starting to separate the voiced/unvoiced and silence parts and features are explored on different segments. Thereafter different classifiers are used for classification. After selecting the top features using an IGR filter we are able to achieve a recognition rate of 80.60 % on the Berlin Emotion Database for the speaker dependent framework.

* Appeared in ICECIT-2012, Srinivasa Ramanujan Institute of Technology, Anantapur

Via

Access Paper or Ask Questions