Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Title:Multimodal Speech Emotion Recognition using Cross Attention with Aligned Audio and Text

Jul 26, 2022

Yoonhyung Lee, Seunghyun Yoon, Kyomin Jung

Figure 1 for Multimodal Speech Emotion Recognition using Cross Attention with Aligned Audio and Text

Figure 2 for Multimodal Speech Emotion Recognition using Cross Attention with Aligned Audio and Text

Figure 3 for Multimodal Speech Emotion Recognition using Cross Attention with Aligned Audio and Text

Figure 4 for Multimodal Speech Emotion Recognition using Cross Attention with Aligned Audio and Text

Share this with someone who'll enjoy it:

Abstract:In this paper, we propose a novel speech emotion recognition model called Cross Attention Network (CAN) that uses aligned audio and text signals as inputs. It is inspired by the fact that humans recognize speech as a combination of simultaneously produced acoustic and textual signals. First, our method segments the audio and the underlying text signals into equal number of steps in an aligned way so that the same time steps of the sequential signals cover the same time span in the signals. Together with this technique, we apply the cross attention to aggregate the sequential information from the aligned signals. In the cross attention, each modality is aggregated independently by applying the global attention mechanism onto each modality. Then, the attention weights of each modality are applied directly to the other modality in a crossed way, so that the CAN gathers the audio and text information from the same time steps based on each modality. In the experiments conducted on the standard IEMOCAP dataset, our model outperforms the state-of-the-art systems by 2.66% and 3.18% relatively in terms of the weighted and unweighted accuracy.

* Proc. Interspeech 2020, 2717-2721 * 5 pages, accepted by INTERSPEECH 2020

View paper on

Share this with someone who'll enjoy it:

Title:Multimodal Speech Emotion Recognition using Cross Attention with Aligned Audio and Text

Paper and Code