Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Abner Hernandez

Self-Supervised Speech Representations Preserve Speech Characteristics while Anonymizing Voices

Apr 04, 2022

Abner Hernandez, Paula Andrea Pérez-Toro, Juan Camilo Vásquez-Correa, Juan Rafael Orozco-Arroyave, Andreas Maier, Seung Hee Yang

Figure 1 for Self-Supervised Speech Representations Preserve Speech Characteristics while Anonymizing Voices

Figure 2 for Self-Supervised Speech Representations Preserve Speech Characteristics while Anonymizing Voices

Figure 3 for Self-Supervised Speech Representations Preserve Speech Characteristics while Anonymizing Voices

Figure 4 for Self-Supervised Speech Representations Preserve Speech Characteristics while Anonymizing Voices

Abstract:Collecting speech data is an important step in training speech recognition systems and other speech-based machine learning models. However, the issue of privacy protection is an increasing concern that must be addressed. The current study investigates the use of voice conversion as a method for anonymizing voices. In particular, we train several voice conversion models using self-supervised speech representations including Wav2Vec2.0, Hubert and UniSpeech. Converted voices retain a low word error rate within 1% of the original voice. Equal error rate increases from 1.52% to 46.24% on the LibriSpeech test set and from 3.75% to 45.84% on speakers from the VCTK corpus which signifies degraded performance on speaker verification. Lastly, we conduct experiments on dysarthric speech data to show that speech features relevant to articulation, prosody, phonation and phonology can be extracted from anonymized voices for discriminating between healthy and pathological speech.

* Submitted for review at Interspeech 2022

Via

Access Paper or Ask Questions

Cross-lingual Self-Supervised Speech Representations for Improved Dysarthric Speech Recognition

Apr 04, 2022

Abner Hernandez, Paula Andrea Pérez-Toro, Elmar Nöth, Juan Rafael Orozco-Arroyave, Andreas Maier, Seung Hee Yang

Figure 1 for Cross-lingual Self-Supervised Speech Representations for Improved Dysarthric Speech Recognition

Figure 2 for Cross-lingual Self-Supervised Speech Representations for Improved Dysarthric Speech Recognition

Figure 3 for Cross-lingual Self-Supervised Speech Representations for Improved Dysarthric Speech Recognition

Figure 4 for Cross-lingual Self-Supervised Speech Representations for Improved Dysarthric Speech Recognition

Abstract:State-of-the-art automatic speech recognition (ASR) systems perform well on healthy speech. However, the performance on impaired speech still remains an issue. The current study explores the usefulness of using Wav2Vec self-supervised speech representations as features for training an ASR system for dysarthric speech. Dysarthric speech recognition is particularly difficult as several aspects of speech such as articulation, prosody and phonation can be impaired. Specifically, we train an acoustic model with features extracted from Wav2Vec, Hubert, and the cross-lingual XLSR model. Results suggest that speech representations pretrained on large unlabelled data can improve word error rate (WER) performance. In particular, features from the multilingual model led to lower WERs than filterbanks (Fbank) or models trained on a single language. Improvements were observed in English speakers with cerebral palsy caused dysarthria (UASpeech corpus), Spanish speakers with Parkinsonian dysarthria (PC-GITA corpus) and Italian speakers with paralysis-based dysarthria (EasyCall corpus). Compared to using Fbank features, XLSR-based features reduced WERs by 6.8%, 22.0%, and 7.0% for the UASpeech, PC-GITA, and EasyCall corpus, respectively.

* Submitted for review at Interspeech 2022

Via

Access Paper or Ask Questions

SliTraNet: Automatic Detection of Slide Transitions in Lecture Videos using Convolutional Neural Networks

Feb 07, 2022

Aline Sindel, Abner Hernandez, Seung Hee Yang, Vincent Christlein, Andreas Maier

Figure 1 for SliTraNet: Automatic Detection of Slide Transitions in Lecture Videos using Convolutional Neural Networks

Figure 2 for SliTraNet: Automatic Detection of Slide Transitions in Lecture Videos using Convolutional Neural Networks

Figure 3 for SliTraNet: Automatic Detection of Slide Transitions in Lecture Videos using Convolutional Neural Networks

Figure 4 for SliTraNet: Automatic Detection of Slide Transitions in Lecture Videos using Convolutional Neural Networks

Abstract:With the increasing number of online learning material in the web, search for specific content in lecture videos can be time consuming. Therefore, automatic slide extraction from the lecture videos can be helpful to give a brief overview of the main content and to support the students in their studies. For this task, we propose a deep learning method to detect slide transitions in lectures videos. We first process each frame of the video by a heuristic-based approach using a 2-D convolutional neural network to predict transition candidates. Then, we increase the complexity by employing two 3-D convolutional neural networks to refine the transition candidates. Evaluation results demonstrate the effectiveness of our method in finding slide transitions.

* 6 pages, 5 figures, 1 table, accepted to OAGM Workshop 2021

Via

Access Paper or Ask Questions