Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Rosanna Milner

Empirical Interpretation of the Relationship Between Speech Acoustic Context and Emotion Recognition

Jun 30, 2023

Anna Ollerenshaw, Md Asif Jalal, Rosanna Milner, Thomas Hain

Figure 1 for Empirical Interpretation of the Relationship Between Speech Acoustic Context and Emotion Recognition

Figure 2 for Empirical Interpretation of the Relationship Between Speech Acoustic Context and Emotion Recognition

Figure 3 for Empirical Interpretation of the Relationship Between Speech Acoustic Context and Emotion Recognition

Figure 4 for Empirical Interpretation of the Relationship Between Speech Acoustic Context and Emotion Recognition

Abstract:Speech emotion recognition (SER) is vital for obtaining emotional intelligence and understanding the contextual meaning of speech. Variations of consonant-vowel (CV) phonemic boundaries can enrich acoustic context with linguistic cues, which impacts SER. In practice, speech emotions are treated as single labels over an acoustic segment for a given time duration. However, phone boundaries within speech are not discrete events, therefore the perceived emotion state should also be distributed over potentially continuous time-windows. This research explores the implication of acoustic context and phone boundaries on local markers for SER using an attention-based approach. The benefits of using a distributed approach to speech emotion understanding are supported by the results of cross-corpora analysis experiments. Experiments where phones and words are mapped to the attention vectors along with the fundamental frequency to observe the overlapping distributions and thereby the relationship between acoustic context and emotion. This work aims to bridge psycholinguistic theory research with computational modelling for SER.

Via

Access Paper or Ask Questions

A cross-corpus study on speech emotion recognition

Jul 05, 2022

Rosanna Milner, Md Asif Jalal, Raymond W. M. Ng, Thomas Hain

Figure 1 for A cross-corpus study on speech emotion recognition

Figure 2 for A cross-corpus study on speech emotion recognition

Figure 3 for A cross-corpus study on speech emotion recognition

Figure 4 for A cross-corpus study on speech emotion recognition

Abstract:For speech emotion datasets, it has been difficult to acquire large quantities of reliable data and acted emotions may be over the top compared to less expressive emotions displayed in everyday life. Lately, larger datasets with natural emotions have been created. Instead of ignoring smaller, acted datasets, this study investigates whether information learnt from acted emotions is useful for detecting natural emotions. Cross-corpus research has mostly considered cross-lingual and even cross-age datasets, and difficulties arise from different methods of annotating emotions causing a drop in performance. To be consistent, four adult English datasets covering acted, elicited and natural emotions are considered. A state-of-the-art model is proposed to accurately investigate the degradation of performance. The system involves a bi-directional LSTM with an attention mechanism to classify emotions across datasets. Experiments study the effects of training models in a cross-corpus and multi-domain fashion and results show the transfer of information is not successful. Out-of-domain models, followed by adapting to the missing dataset, and domain adversarial training (DAT) are shown to be more suitable to generalising to emotions across datasets. This shows positive information transfer from acted datasets to those with more natural emotions and the benefits from training on different corpora.

* IEEE Workshop on Automatic Speech Recognition and Understanding 2019
* ASRU 2019

Via

Access Paper or Ask Questions

The 2015 Sheffield System for Transcription of Multi-Genre Broadcast Media

Dec 21, 2015

Oscar Saz, Mortaza Doulaty, Salil Deena, Rosanna Milner, Raymond W. M. Ng, Madina Hasan, Yulan Liu, Thomas Hain

Figure 1 for The 2015 Sheffield System for Transcription of Multi-Genre Broadcast Media

Figure 2 for The 2015 Sheffield System for Transcription of Multi-Genre Broadcast Media

Figure 3 for The 2015 Sheffield System for Transcription of Multi-Genre Broadcast Media

Figure 4 for The 2015 Sheffield System for Transcription of Multi-Genre Broadcast Media

Abstract:We describe the University of Sheffield system for participation in the 2015 Multi-Genre Broadcast (MGB) challenge task of transcribing multi-genre broadcast shows. Transcription was one of four tasks proposed in the MGB challenge, with the aim of advancing the state of the art of automatic speech recognition, speaker diarisation and automatic alignment of subtitles for broadcast media. Four topics are investigated in this work: Data selection techniques for training with unreliable data, automatic speech segmentation of broadcast media shows, acoustic modelling and adaptation in highly variable environments, and language modelling of multi-genre shows. The final system operates in multiple passes, using an initial unadapted decoding stage to refine segmentation, followed by three adapted passes: a hybrid DNN pass with input features normalised by speaker-based cepstral normalisation, another hybrid stage with input features normalised by speaker feature-MLLR transformations, and finally a bottleneck-based tandem stage with noise and speaker factorisation. The combination of these three system outputs provides a final error rate of 27.5% on the official development set, consisting of 47 multi-genre shows.

* IEEE Automatic Speech Recognition and Understanding Workshop (ASRU 2015), 13-17 Dec 2015, Scottsdale, Arizona, USA

Via

Access Paper or Ask Questions