Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Title:Hybrid Fusion Based Interpretable Multimodal Emotion Recognition with Insufficient Labelled Data

Aug 24, 2022

Puneet Kumar, Sarthak Malik, Balasubramanian Raman

Figure 1 for Hybrid Fusion Based Interpretable Multimodal Emotion Recognition with Insufficient Labelled Data

Figure 2 for Hybrid Fusion Based Interpretable Multimodal Emotion Recognition with Insufficient Labelled Data

Figure 3 for Hybrid Fusion Based Interpretable Multimodal Emotion Recognition with Insufficient Labelled Data

Figure 4 for Hybrid Fusion Based Interpretable Multimodal Emotion Recognition with Insufficient Labelled Data

Share this with someone who'll enjoy it:

Abstract:This paper proposes a multimodal emotion recognition system, VIsual Spoken Textual Additive Net (VISTA Net), to classify the emotions reflected by a multimodal input containing image, speech, and text into discrete classes. A new interpretability technique, K-Average Additive exPlanation (KAAP), has also been developed to identify the important visual, spoken, and textual features leading to predicting a particular emotion class. The VISTA Net fuses the information from image, speech & text modalities using a hybrid of early and late fusion. It automatically adjusts the weights of their intermediate outputs while computing the weighted average without human intervention. The KAAP technique computes the contribution of each modality and corresponding features toward predicting a particular emotion class. To mitigate the insufficiency of multimodal emotion datasets labeled with discrete emotion classes, we have constructed a large-scale IIT-R MMEmoRec dataset consisting of real-life images, corresponding speech & text, and emotion labels ('angry,' 'happy,' 'hate,' and 'sad.'). The VISTA Net has resulted in 95.99% emotion recognition accuracy on considering image, speech, and text modalities, which is better than the performance on considering the inputs of any one or two modalities.

View paper on

Share this with someone who'll enjoy it:

Title:Hybrid Fusion Based Interpretable Multimodal Emotion Recognition with Insufficient Labelled Data

Paper and Code