Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Ladislav Mosner

Speech-based emotion recognition with self-supervised models using attentive channel-wise correlations and label smoothing

Nov 03, 2022

Sofoklis Kakouros, Themos Stafylakis, Ladislav Mosner, Lukas Burget

Abstract:When recognizing emotions from speech, we encounter two common problems: how to optimally capture emotion-relevant information from the speech signal and how to best quantify or categorize the noisy subjective emotion labels. Self-supervised pre-trained representations can robustly capture information from speech enabling state-of-the-art results in many downstream tasks including emotion recognition. However, better ways of aggregating the information across time need to be considered as the relevant emotion information is likely to appear piecewise and not uniformly across the signal. For the labels, we need to take into account that there is a substantial degree of noise that comes from the subjective human annotations. In this paper, we propose a novel approach to attentive pooling based on correlations between the representations' coefficients combined with label smoothing, a method aiming to reduce the confidence of the classifier on the training labels. We evaluate our proposed approach on the benchmark dataset IEMOCAP, and demonstrate high performance surpassing that in the literature. The code to reproduce the results is available at github.com/skakouros/s3prl_attentive_correlation.

* Submitted to IEEE-ICASSP 2023

Via

Access Paper or Ask Questions

Extracting speaker and emotion information from self-supervised speech models via channel-wise correlations

Oct 15, 2022

Themos Stafylakis, Ladislav Mosner, Sofoklis Kakouros, Oldrich Plchot, Lukas Burget, Jan Cernocky

Figure 1 for Extracting speaker and emotion information from self-supervised speech models via channel-wise correlations

Figure 2 for Extracting speaker and emotion information from self-supervised speech models via channel-wise correlations

Figure 3 for Extracting speaker and emotion information from self-supervised speech models via channel-wise correlations

Figure 4 for Extracting speaker and emotion information from self-supervised speech models via channel-wise correlations

Abstract:Self-supervised learning of speech representations from large amounts of unlabeled data has enabled state-of-the-art results in several speech processing tasks. Aggregating these speech representations across time is typically approached by using descriptive statistics, and in particular, using the first- and second-order statistics of representation coefficients. In this paper, we examine an alternative way of extracting speaker and emotion information from self-supervised trained models, based on the correlations between the coefficients of the representations - correlation pooling. We show improvements over mean pooling and further gains when the pooling methods are combined via fusion. The code is available at github.com/Lamomal/s3prl_correlation.

* Accepted at IEEE-SLT 2022

Via

Access Paper or Ask Questions

An attention-based backend allowing efficient fine-tuning of transformer models for speaker verification

Oct 03, 2022

Junyi Peng, Oldrich Plchot, Themos Stafylakis, Ladislav Mosner, Lukas Burget, Jan Cernocky

Figure 1 for An attention-based backend allowing efficient fine-tuning of transformer models for speaker verification

Figure 2 for An attention-based backend allowing efficient fine-tuning of transformer models for speaker verification

Figure 3 for An attention-based backend allowing efficient fine-tuning of transformer models for speaker verification

Figure 4 for An attention-based backend allowing efficient fine-tuning of transformer models for speaker verification

Abstract:In recent years, self-supervised learning paradigm has received extensive attention due to its great success in various down-stream tasks. However, the fine-tuning strategies for adapting those pre-trained models to speaker verification task have yet to be fully explored. In this paper, we analyze several feature extraction approaches built on top of a pre-trained model, as well as regularization and learning rate schedule to stabilize the fine-tuning process and further boost performance: multi-head factorized attentive pooling is proposed to factorize the comparison of speaker representations into multiple phonetic clusters. We regularize towards the parameters of the pre-trained model and we set different learning rates for each layer of the pre-trained model during fine-tuning. The experimental results show our method can significantly shorten the training time to 4 hours and achieve SOTA performance: 0.59%, 0.79% and 1.77% EER on Vox1-O, Vox1-E and Vox1-H, respectively.

* Accepted by SLT2022

Via

Access Paper or Ask Questions

Analyzing speaker verification embedding extractors and back-ends under language and channel mismatch

Mar 19, 2022

Anna Silnova, Themos Stafylakis, Ladislav Mosner, Oldrich Plchot, Johan Rohdin, Pavel Matejka, Lukas Burget, Ondrej Glembek, Niko Brummer

Figure 1 for Analyzing speaker verification embedding extractors and back-ends under language and channel mismatch

Figure 2 for Analyzing speaker verification embedding extractors and back-ends under language and channel mismatch

Figure 3 for Analyzing speaker verification embedding extractors and back-ends under language and channel mismatch

Figure 4 for Analyzing speaker verification embedding extractors and back-ends under language and channel mismatch

Abstract:In this paper, we analyze the behavior and performance of speaker embeddings and the back-end scoring model under domain and language mismatch. We present our findings regarding ResNet-based speaker embedding architectures and show that reduced temporal stride yields improved performance. We then consider a PLDA back-end and show how a combination of small speaker subspace, language-dependent PLDA mixture, and nuisance-attribute projection can have a drastic impact on the performance of the system. Besides, we present an efficient way of scoring and fusing class posterior logit vectors recently shown to perform well for speaker verification task. The experiments are performed using the NIST SRE 2021 setup.

* Submitted to Odyssey 2022, under review

Via

Access Paper or Ask Questions