Abstract:This paper presents a novel deep neural network-based architecture tailored for Speech Emotion Recognition (SER). The architecture capitalises on dense interconnections among multiple layers of bidirectional dilated convolutions. A linear kernel dynamically fuses the outputs of these layers to yield the final emotion class prediction. This innovative architecture is denoted as TBDM-Net: Temporally-Aware Bi-directional Dense Multi-Scale Network. We conduct a comprehensive performance evaluation of TBDM-Net, including an ablation study, across six widely-acknowledged SER datasets for unimodal speech emotion recognition. Additionally, we explore the influence of gender-informed emotion prediction by appending either golden or predicted gender labels to the architecture's inputs or predictions. The implementation of TBDM-Net is accessible at: https://github.com/adrianastan/tbdm-net
Abstract:Large speech models-derived features have recently shown increased performance over signal-based features across multiple downstream tasks, even when the networks are not finetuned towards the target task. In this paper we show the results of an analysis of several signal- and neural models-derived features for speech emotion recognition. We use pretrained models and explore their inherent potential abstractions of emotions. Simple classification methods are used so as to not interfere or add knowledge to the task. We show that, even without finetuning, some of these large neural speech models' representations can enclose information that enables performances close to, and even beyond state-of-the-art results across six standard speech emotion recognition datasets.