Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Line H. Clemmensen

Exploring Local Interpretable Model-Agnostic Explanations for Speech Emotion Recognition with Distribution-Shift

Apr 07, 2025

Maja J. Hjuler, Line H. Clemmensen, Sneha Das

Abstract:We introduce EmoLIME, a version of local interpretable model-agnostic explanations (LIME) for black-box Speech Emotion Recognition (SER) models. To the best of our knowledge, this is the first attempt to apply LIME in SER. EmoLIME generates high-level interpretable explanations and identifies which specific frequency ranges are most influential in determining emotional states. The approach aids in interpreting complex, high-dimensional embeddings such as those generated by end-to-end speech models. We evaluate EmoLIME, qualitatively, quantitatively, and statistically, across three emotional speech datasets, using classifiers trained on both hand-crafted acoustic features and Wav2Vec 2.0 embeddings. We find that EmoLIME exhibits stronger robustness across different models than across datasets with distribution shifts, highlighting its potential for more consistent explanations in SER tasks within a dataset.

* Published in the proceedings of ICASSP 2025

Via

Access Paper or Ask Questions

On Crowdsourcing-design with Comparison Category Rating for Evaluating Speech Enhancement Algorithms

Jun 02, 2023

Angélica S. Z. Suárez, Clément Laroche, Line H. Clemmensen, Sneha Das

Abstract:Speech enhancement techniques improve the quality or the intelligibility of an audio signal by removing unwanted noise. It is used as preprocessing in numerous applications such as speech recognition, hearing aids, broadcasting and telephony. The evaluation of such algorithms often relies on reference-based objective metrics that are shown to correlate poorly with human perception. In order to evaluate audio quality as perceived by human observers it is thus fundamental to resort to subjective quality assessment. In this paper, a user evaluation based on crowdsourcing (subjective) and the Comparison Category Rating (CCR) method is compared against the DNSMOS, ViSQOL and 3QUEST (objective) metrics. The overall quality scores of three speech enhancement algorithms from real time communications (RTC) are used in the comparison using the P.808 toolkit. Results indicate that while the CCR scale allows participants to identify differences between processed and unprocessed audio samples, two groups of preferences emerge: some users rate positively by focusing on noise suppression processing, while others rate negatively by focusing mainly on speech quality. We further present results on the parameters, size considerations and speaker variations that are critical and should be considered when designing the CCR-based crowdsourcing evaluation.

* Published at ICASSP 2023

Via

Access Paper or Ask Questions

Continuous Metric Learning For Transferable Speech Emotion Recognition and Embedding Across Low-resource Languages

Mar 28, 2022

Sneha Das, Nicklas Leander Lund, Nicole Nadine Lønfeldt, Anne Katrine Pagsberg, Line H. Clemmensen

Figure 1 for Continuous Metric Learning For Transferable Speech Emotion Recognition and Embedding Across Low-resource Languages

Figure 2 for Continuous Metric Learning For Transferable Speech Emotion Recognition and Embedding Across Low-resource Languages

Figure 3 for Continuous Metric Learning For Transferable Speech Emotion Recognition and Embedding Across Low-resource Languages

Figure 4 for Continuous Metric Learning For Transferable Speech Emotion Recognition and Embedding Across Low-resource Languages

Abstract:Speech emotion recognition~(SER) refers to the technique of inferring the emotional state of an individual from speech signals. SERs continue to garner interest due to their wide applicability. Although the domain is mainly founded on signal processing, machine learning, and deep learning, generalizing over languages continues to remain a challenge. However, developing generalizable and transferable models are critical due to a lack of sufficient resources in terms of data and labels for languages beyond the most commonly spoken ones. To improve performance over languages, we propose a denoising autoencoder with semi-supervision using a continuous metric loss based on either activation or valence. The novelty of this work lies in our proposal of continuous metric learning, which is among the first proposals on the topic to the best of our knowledge. Furthermore, to address the lack of activation and valence labels in the transfer datasets, we annotate the signal samples with activation and valence levels corresponding to a dimensional model of emotions, which were then used to evaluate the quality of the embedding over the transfer datasets. We show that the proposed semi-supervised model consistently outperforms the baseline unsupervised method, which is a conventional denoising autoencoder, in terms of emotion classification accuracy as well as correlation with respect to the dimensional variables. Further evaluation of classification accuracy with respect to the reference, a BERT based speech representation model, shows that the proposed method is comparable to the reference method in classifying specific emotion classes at a much lower complexity.

* Preprint of paper accepted to be presented at the Northern Lights Deep Learning Conference (NLDL), 2022. The labels are available at: https://bit.ly/3rg6VsA

Via

Access Paper or Ask Questions

Towards Transferable Speech Emotion Representation: On loss functions for cross-lingual latent representations

Mar 28, 2022

Sneha Das, Nicole Nadine Lønfeldt, Anne Katrine Pagsberg, Line H. Clemmensen

Figure 1 for Towards Transferable Speech Emotion Representation: On loss functions for cross-lingual latent representations

Figure 2 for Towards Transferable Speech Emotion Representation: On loss functions for cross-lingual latent representations

Figure 3 for Towards Transferable Speech Emotion Representation: On loss functions for cross-lingual latent representations

Figure 4 for Towards Transferable Speech Emotion Representation: On loss functions for cross-lingual latent representations

Abstract:In recent years, speech emotion recognition (SER) has been used in wide ranging applications, from healthcare to the commercial sector. In addition to signal processing approaches, methods for SER now also use deep learning techniques which provide transfer learning possibilities. However, generalizing over languages, corpora and recording conditions is still an open challenge. In this work we address this gap by exploring loss functions that aid in transferability, specifically to non-tonal languages. We propose a variational autoencoder (VAE) with KL annealing and a semi-supervised VAE to obtain more consistent latent embedding distributions across data sets. To ensure transferability, the distribution of the latent embedding should be similar across non-tonal languages (data sets). We start by presenting a low-complexity SER based on a denoising-autoencoder, which achieves an unweighted classification accuracy of over 52.09% for four-class emotion classification. This performance is comparable to that of similar baseline methods. Following this, we employ a VAE, the semi-supervised VAE and the VAE with KL annealing to obtain a more regularized latent space. We show that while the DAE has the highest classification accuracy among the methods, the semi-supervised VAE has a comparable classification accuracy and a more consistent latent embedding distribution over data sets.

* Preprint of paper accepted to be presented at the IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2022. Source code at https://bit.ly/34CgkSZ. arXiv admin note: text overlap with arXiv:2105.02055

Via

Access Paper or Ask Questions

Data Representativity for Machine Learning and AI Systems

Mar 09, 2022

Line H. Clemmensen, Rune D. Kjærsgaard

Figure 1 for Data Representativity for Machine Learning and AI Systems

Figure 2 for Data Representativity for Machine Learning and AI Systems

Figure 3 for Data Representativity for Machine Learning and AI Systems

Figure 4 for Data Representativity for Machine Learning and AI Systems

Abstract:Data representativity is crucial when drawing inference from data through machine learning models. Scholars have increased focus on unraveling the bias and fairness in the models, also in relation to inherent biases in the input data. However, limited work exists on the representativity of samples (datasets) for appropriate inference in AI systems. This paper analyzes data representativity in scientific literature related to AI and sampling, and gives a brief overview of statistical sampling methodology from disciplines like sampling of physical materials, experimental design, survey analysis, and observational studies. Different notions of a 'representative sample' exist in past and present literature. In particular, the contrast between the notion of a representative sample in the sense of coverage of the input space, versus a representative sample as a miniature of the target population is of relevance when building AI systems. Using empirical demonstrations on US Census data, we demonstrate that the first is useful for providing equality and demographic parity, and is more robust to distribution shifts, whereas the latter notion is useful in situations where the purpose is to make historical inference or draw inference about the underlying population in general, or make better predictions for the majority in the underlying population. We propose a framework of questions for creating and documenting data, with data representativity in mind, as an addition to existing datasheets for datasets. Finally, we will also like to call for caution of implicit, in addition to explicit, use of a notion of data representativeness without specific clarification.

Via

Access Paper or Ask Questions

Towards Interpretable and Transferable Speech Emotion Recognition: Latent Representation Based Analysis of Features, Methods and Corpora

May 05, 2021

Sneha Das, Nicole Nadine Lønfeldt, Anne Katrine Pagsberg, Line H. Clemmensen

Figure 1 for Towards Interpretable and Transferable Speech Emotion Recognition: Latent Representation Based Analysis of Features, Methods and Corpora

Figure 2 for Towards Interpretable and Transferable Speech Emotion Recognition: Latent Representation Based Analysis of Features, Methods and Corpora

Figure 3 for Towards Interpretable and Transferable Speech Emotion Recognition: Latent Representation Based Analysis of Features, Methods and Corpora

Figure 4 for Towards Interpretable and Transferable Speech Emotion Recognition: Latent Representation Based Analysis of Features, Methods and Corpora

Abstract:In recent years, speech emotion recognition (SER) has been used in wide ranging applications, from healthcare to the commercial sector. In addition to signal processing approaches, methods for SER now also use deep learning techniques. However, generalizing over languages, corpora and recording conditions is still an open challenge in the field. Furthermore, due to the black-box nature of deep learning algorithms, a newer challenge is the lack of interpretation and transparency in the models and the decision making process. This is critical when the SER systems are deployed in applications that influence human lives. In this work we address this gap by providing an in-depth analysis of the decision making process of the proposed SER system. Towards that end, we present low-complexity SER based on undercomplete- and denoising- autoencoders that achieve an average classification accuracy of over 55\% for four-class emotion classification. Following this, we investigate the clustering of emotions in the latent space to understand the influence of the corpora on the model behavior and to obtain a physical interpretation of the latent embedding. Lastly, we explore the role of each input feature towards the performance of the SER.

Via

Access Paper or Ask Questions

A generalized linear joint trained framework for semi-supervised leaning of sparse features

Jun 02, 2020

Juan C. Laria, Line H. Clemmensen, Bjarne K. Ersbøll

Figure 1 for A generalized linear joint trained framework for semi-supervised leaning of sparse features

Figure 2 for A generalized linear joint trained framework for semi-supervised leaning of sparse features

Figure 3 for A generalized linear joint trained framework for semi-supervised leaning of sparse features

Figure 4 for A generalized linear joint trained framework for semi-supervised leaning of sparse features

Abstract:The elastic-net is among the most widely used types of regularization algorithms, commonly associated with the problem of supervised generalized linear model estimation via penalized maximum likelihood. Its nice properties originate from a combination of $\ell_1$ and $\ell_2$ norms, which endow this method with the ability to select variables taking into account the correlations between them. In the last few years, semi-supervised approaches, that use both labeled and unlabeled data, have become an important component in the statistical research. Despite this interest, however, few researches have investigated semi-supervised elastic-net extensions. This paper introduces a novel solution for semi-supervised learning of sparse features in the context of generalized linear model estimation: the generalized semi-supervised elastic-net (s2net), which extends the supervised elastic-net method, with a general mathematical formulation that covers, but is not limited to, both regression and classification problems. We develop a flexible and fast implementation for s2net in R, and its advantages are illustrated using both real and synthetic data sets.

Via

Access Paper or Ask Questions

Forest Floor Visualizations of Random Forests

Jul 04, 2016

Soeren H. Welling, Hanne H. F. Refsgaard, Per B. Brockhoff, Line H. Clemmensen

Figure 1 for Forest Floor Visualizations of Random Forests

Figure 2 for Forest Floor Visualizations of Random Forests

Figure 3 for Forest Floor Visualizations of Random Forests

Figure 4 for Forest Floor Visualizations of Random Forests

Abstract:We propose a novel methodology, forest floor, to visualize and interpret random forest (RF) models. RF is a popular and useful tool for non-linear multi-variate classification and regression, which yields a good trade-off between robustness (low variance) and adaptiveness (low bias). Direct interpretation of a RF model is difficult, as the explicit ensemble model of hundreds of deep trees is complex. Nonetheless, it is possible to visualize a RF model fit by its mapping from feature space to prediction space. Hereby the user is first presented with the overall geometrical shape of the model structure, and when needed one can zoom in on local details. Dimensional reduction by projection is used to visualize high dimensional shapes. The traditional method to visualize RF model structure, partial dependence plots, achieve this by averaging multiple parallel projections. We suggest to first use feature contributions, a method to decompose trees by splitting features, and then subsequently perform projections. The advantages of forest floor over partial dependence plots is that interactions are not masked by averaging. As a consequence, it is possible to locate interactions, which are not visualized in a given projection. Furthermore, we introduce: a goodness-of-visualization measure, use of colour gradients to identify interactions and an out-of-bag cross validated variant of feature contributions.

* 25 pages, 12 figures, supplementary materials. v2->v3: minor proofing, moderated comments on ICE-plots, replaced \psi-operator with the subset named H in equation 13 and 14 to improve simplicity

Via

Access Paper or Ask Questions