Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Raymond W. M. Ng

A cross-corpus study on speech emotion recognition

Jul 05, 2022

Rosanna Milner, Md Asif Jalal, Raymond W. M. Ng, Thomas Hain

Figure 1 for A cross-corpus study on speech emotion recognition

Figure 2 for A cross-corpus study on speech emotion recognition

Figure 3 for A cross-corpus study on speech emotion recognition

Figure 4 for A cross-corpus study on speech emotion recognition

Abstract:For speech emotion datasets, it has been difficult to acquire large quantities of reliable data and acted emotions may be over the top compared to less expressive emotions displayed in everyday life. Lately, larger datasets with natural emotions have been created. Instead of ignoring smaller, acted datasets, this study investigates whether information learnt from acted emotions is useful for detecting natural emotions. Cross-corpus research has mostly considered cross-lingual and even cross-age datasets, and difficulties arise from different methods of annotating emotions causing a drop in performance. To be consistent, four adult English datasets covering acted, elicited and natural emotions are considered. A state-of-the-art model is proposed to accurately investigate the degradation of performance. The system involves a bi-directional LSTM with an attention mechanism to classify emotions across datasets. Experiments study the effects of training models in a cross-corpus and multi-domain fashion and results show the transfer of information is not successful. Out-of-domain models, followed by adapting to the missing dataset, and domain adversarial training (DAT) are shown to be more suitable to generalising to emotions across datasets. This shows positive information transfer from acted datasets to those with more natural emotions and the benefits from training on different corpora.

* IEEE Workshop on Automatic Speech Recognition and Understanding 2019
* ASRU 2019

Via

Access Paper or Ask Questions

Automatic Genre and Show Identification of Broadcast Media

Jun 10, 2016

Mortaza Doulaty, Oscar Saz, Raymond W. M. Ng, Thomas Hain

Figure 1 for Automatic Genre and Show Identification of Broadcast Media

Figure 2 for Automatic Genre and Show Identification of Broadcast Media

Figure 3 for Automatic Genre and Show Identification of Broadcast Media

Figure 4 for Automatic Genre and Show Identification of Broadcast Media

Abstract:Huge amounts of digital videos are being produced and broadcast every day, leading to giant media archives. Effective techniques are needed to make such data accessible further. Automatic meta-data labelling of broadcast media is an essential task for multimedia indexing, where it is standard to use multi-modal input for such purposes. This paper describes a novel method for automatic detection of media genre and show identities using acoustic features, textual features or a combination thereof. Furthermore the inclusion of available meta-data, such as time of broadcast, is shown to lead to very high performance. Latent Dirichlet Allocation is used to model both acoustics and text, yielding fixed dimensional representations of media recordings that can then be used in Support Vector Machines based classification. Experiments are conducted on more than 1200 hours of TV broadcasts from the British Broadcasting Corporation (BBC), where the task is to categorise the broadcasts into 8 genres or 133 show identities. On a 200-hour test set, accuracies of 98.6% and 85.7% were achieved for genre and show identification respectively, using a combination of acoustic and textual features with meta-data.

* Proc. of 17th Interspeech (2016), San Francisco, California, USA

Via

Access Paper or Ask Questions

The 2015 Sheffield System for Transcription of Multi-Genre Broadcast Media

Dec 21, 2015

Oscar Saz, Mortaza Doulaty, Salil Deena, Rosanna Milner, Raymond W. M. Ng, Madina Hasan, Yulan Liu, Thomas Hain

Figure 1 for The 2015 Sheffield System for Transcription of Multi-Genre Broadcast Media

Figure 2 for The 2015 Sheffield System for Transcription of Multi-Genre Broadcast Media

Figure 3 for The 2015 Sheffield System for Transcription of Multi-Genre Broadcast Media

Figure 4 for The 2015 Sheffield System for Transcription of Multi-Genre Broadcast Media

Abstract:We describe the University of Sheffield system for participation in the 2015 Multi-Genre Broadcast (MGB) challenge task of transcribing multi-genre broadcast shows. Transcription was one of four tasks proposed in the MGB challenge, with the aim of advancing the state of the art of automatic speech recognition, speaker diarisation and automatic alignment of subtitles for broadcast media. Four topics are investigated in this work: Data selection techniques for training with unreliable data, automatic speech segmentation of broadcast media shows, acoustic modelling and adaptation in highly variable environments, and language modelling of multi-genre shows. The final system operates in multiple passes, using an initial unadapted decoding stage to refine segmentation, followed by three adapted passes: a hybrid DNN pass with input features normalised by speaker-based cepstral normalisation, another hybrid stage with input features normalised by speaker feature-MLLR transformations, and finally a bottleneck-based tandem stage with noise and speaker factorisation. The combination of these three system outputs provides a final error rate of 27.5% on the official development set, consisting of 47 multi-genre shows.

* IEEE Automatic Speech Recognition and Understanding Workshop (ASRU 2015), 13-17 Dec 2015, Scottsdale, Arizona, USA

Via

Access Paper or Ask Questions

Latent Dirichlet Allocation Based Organisation of Broadcast Media Archives for Deep Neural Network Adaptation

Nov 16, 2015

Mortaza Doulaty, Oscar Saz, Raymond W. M. Ng, Thomas Hain

Figure 1 for Latent Dirichlet Allocation Based Organisation of Broadcast Media Archives for Deep Neural Network Adaptation

Figure 2 for Latent Dirichlet Allocation Based Organisation of Broadcast Media Archives for Deep Neural Network Adaptation

Figure 3 for Latent Dirichlet Allocation Based Organisation of Broadcast Media Archives for Deep Neural Network Adaptation

Figure 4 for Latent Dirichlet Allocation Based Organisation of Broadcast Media Archives for Deep Neural Network Adaptation

Abstract:This paper presents a new method for the discovery of latent domains in diverse speech data, for the use of adaptation of Deep Neural Networks (DNNs) for Automatic Speech Recognition. Our work focuses on transcription of multi-genre broadcast media, which is often only categorised broadly in terms of high level genres such as sports, news, documentary, etc. However, in terms of acoustic modelling these categories are coarse. Instead, it is expected that a mixture of latent domains can better represent the complex and diverse behaviours within a TV show, and therefore lead to better and more robust performance. We propose a new method, whereby these latent domains are discovered with Latent Dirichlet Allocation, in an unsupervised manner. These are used to adapt DNNs using the Unique Binary Code (UBIC) representation for the LDA domains. Experiments conducted on a set of BBC TV broadcasts, with more than 2,000 shows for training and 47 shows for testing, show that the use of LDA-UBIC DNNs reduces the error up to 13% relative compared to the baseline hybrid DNN models.

* IEEE Automatic Speech Recognition and Understanding Workshop (ASRU 2015), 13-17 Dec 2015, Scottsdale, Arizona, USA

Via

Access Paper or Ask Questions

The USFD Spoken Language Translation System for IWSLT 2014

Sep 13, 2015

Raymond W. M. Ng, Mortaza Doulaty, Rama Doddipatla, Wilker Aziz, Kashif Shah, Oscar Saz, Madina Hasan, Ghada AlHarbi, Lucia Specia, Thomas Hain

Figure 1 for The USFD Spoken Language Translation System for IWSLT 2014

Figure 2 for The USFD Spoken Language Translation System for IWSLT 2014

Figure 3 for The USFD Spoken Language Translation System for IWSLT 2014

Figure 4 for The USFD Spoken Language Translation System for IWSLT 2014

Abstract:The University of Sheffield (USFD) participated in the International Workshop for Spoken Language Translation (IWSLT) in 2014. In this paper, we will introduce the USFD SLT system for IWSLT. Automatic speech recognition (ASR) is achieved by two multi-pass deep neural network systems with adaptation and rescoring techniques. Machine translation (MT) is achieved by a phrase-based system. The USFD primary system incorporates state-of-the-art ASR and MT techniques and gives a BLEU score of 23.45 and 14.75 on the English-to-French and English-to-German speech-to-text translation task with the IWSLT 2014 data. The USFD contrastive systems explore the integration of ASR and MT by using a quality estimation system to rescore the ASR outputs, optimising towards better translation. This gives a further 0.54 and 0.26 BLEU improvement respectively on the IWSLT 2012 and 2014 evaluation data.

* Proc. of 11th International Workshop on Spoken Language Translation (SLT 2014) 86-91, Lake Tahoe, USA, December 4th and 5th, 2014

Via

Access Paper or Ask Questions