Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Md. Shah Fahad

Pitch-Synchronous Single Frequency Filtering Spectrogram for Speech Emotion Recognition

Aug 07, 2019

Shruti Gupta, Md. Shah Fahad, Akshay Deepak

Figure 1 for Pitch-Synchronous Single Frequency Filtering Spectrogram for Speech Emotion Recognition

Figure 2 for Pitch-Synchronous Single Frequency Filtering Spectrogram for Speech Emotion Recognition

Figure 3 for Pitch-Synchronous Single Frequency Filtering Spectrogram for Speech Emotion Recognition

Figure 4 for Pitch-Synchronous Single Frequency Filtering Spectrogram for Speech Emotion Recognition

Abstract:Convolutional neural networks (CNN) are widely used for speech emotion recognition (SER). In such cases, the short time fourier transform (STFT) spectrogram is the most popular choice for representing speech, which is fed as input to the CNN. However, the uncertainty principles of the short-time Fourier transform prevent it from capturing time and frequency resolutions simultaneously. On the other hand, the recently proposed single frequency filtering (SFF) spectrogram promises to be a better alternative because it captures both time and frequency resolutions simultaneously. In this work, we explore the SFF spectrogram as an alternative representation of speech for SER. We have modified the SFF spectrogram by taking the average of the amplitudes of all the samples between two successive glottal closure instants (GCI) locations. The duration between two successive GCI locations gives the pitch, motivating us to name the modified SFF spectrogram as pitch-synchronous SFF spectrogram. The GCI locations were detected using zero frequency filtering approach. The proposed pitch-synchronous SFF spectrogram produced accuracy values of 63.95% (unweighted) and 70.4% (weighted) on the IEMOCAP dataset. These correspond to an improvement of +7.35% (unweighted) and +4.3% (weighted) over state-of-the-art result on the STFT sepctrogram using CNN. Specially, the proposed method recognized 22.7% of the happy emotion samples correctly, whereas this number was 0% for state-of-the-art results. These results also promise a much wider use of the proposed pitch-synchronous SFF spectrogram for other speech-based applications.

* 11 pages and less than 20 figures

Via

Access Paper or Ask Questions

DNN-HMM based Speaker Adaptive Emotion Recognition using Proposed Epoch and MFCC Features

Jun 04, 2018

Md. Shah Fahad, Jainath Yadav, Gyadhar Pradhan, Akshay Deepak

Figure 1 for DNN-HMM based Speaker Adaptive Emotion Recognition using Proposed Epoch and MFCC Features

Figure 2 for DNN-HMM based Speaker Adaptive Emotion Recognition using Proposed Epoch and MFCC Features

Figure 3 for DNN-HMM based Speaker Adaptive Emotion Recognition using Proposed Epoch and MFCC Features

Figure 4 for DNN-HMM based Speaker Adaptive Emotion Recognition using Proposed Epoch and MFCC Features

Abstract:Speech is produced when time varying vocal tract system is excited with time varying excitation source. Therefore, the information present in a speech such as message, emotion, language, speaker is due to the combined effect of both excitation source and vocal tract system. However, there is very less utilization of excitation source features to recognize emotion. In our earlier work, we have proposed a novel method to extract glottal closure instants (GCIs) known as epochs. In this paper, we have explored epoch features namely instantaneous pitch, phase and strength of epochs for discriminating emotions. We have combined the excitation source features and the well known Male-frequency cepstral coefficient (MFCC) features to develop an emotion recognition system with improved performance. DNN-HMM speaker adaptive models have been developed using MFCC, epoch and combined features. IEMOCAP emotional database has been used to evaluate the models. The average accuracy for emotion recognition system when using MFCC and epoch features separately is 59.25% and 54.52% respectively. The recognition performance improves to 64.2% when MFCC and epoch features are combined.

Via

Access Paper or Ask Questions