Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Douglas Chai

MAiVAR-T: Multimodal Audio-image and Video Action Recognizer using Transformers

Aug 01, 2023

Muhammad Bilal Shaikh, Douglas Chai, Syed Mohammed Shamsul Islam, Naveed Akhtar

Figure 1 for MAiVAR-T: Multimodal Audio-image and Video Action Recognizer using Transformers

Figure 2 for MAiVAR-T: Multimodal Audio-image and Video Action Recognizer using Transformers

Figure 3 for MAiVAR-T: Multimodal Audio-image and Video Action Recognizer using Transformers

Figure 4 for MAiVAR-T: Multimodal Audio-image and Video Action Recognizer using Transformers

Abstract:In line with the human capacity to perceive the world by simultaneously processing and integrating high-dimensional inputs from multiple modalities like vision and audio, we propose a novel model, MAiVAR-T (Multimodal Audio-Image to Video Action Recognition Transformer). This model employs an intuitive approach for the combination of audio-image and video modalities, with a primary aim to escalate the effectiveness of multimodal human action recognition (MHAR). At the core of MAiVAR-T lies the significance of distilling substantial representations from the audio modality and transmuting these into the image domain. Subsequently, this audio-image depiction is fused with the video modality to formulate a unified representation. This concerted approach strives to exploit the contextual richness inherent in both audio and video modalities, thereby promoting action recognition. In contrast to existing state-of-the-art strategies that focus solely on audio or video modalities, MAiVAR-T demonstrates superior performance. Our extensive empirical evaluations conducted on a benchmark action recognition dataset corroborate the model's remarkable performance. This underscores the potential enhancements derived from integrating audio and video modalities for action recognition purposes.

* 6 pages, 7 figures, 4 tables, Peer reviewed, Accepted @ The 11th European Workshop on Visual Information Processing (EUVIP) will be held on 11th-14th September 2023, in Gj{\o}vik, Norway. arXiv admin note: text overlap with arXiv:2103.15691 by other authors

Via

Access Paper or Ask Questions

MAiVAR: Multimodal Audio-Image and Video Action Recognizer

Sep 11, 2022

Muhammad Bilal Shaikh, Douglas Chai, Syed Mohammed Shamsul Islam, Naveed Akhtar

Figure 1 for MAiVAR: Multimodal Audio-Image and Video Action Recognizer

Figure 2 for MAiVAR: Multimodal Audio-Image and Video Action Recognizer

Figure 3 for MAiVAR: Multimodal Audio-Image and Video Action Recognizer

Figure 4 for MAiVAR: Multimodal Audio-Image and Video Action Recognizer

Abstract:Currently, action recognition is predominately performed on video data as processed by CNNs. We investigate if the representation process of CNNs can also be leveraged for multimodal action recognition by incorporating image-based audio representations of actions in a task. To this end, we propose Multimodal Audio-Image and Video Action Recognizer (MAiVAR), a CNN-based audio-image to video fusion model that accounts for video and audio modalities to achieve superior action recognition performance. MAiVAR extracts meaningful image representations of audio and fuses it with video representation to achieve better performance as compared to both modalities individually on a large-scale action recognition dataset.

* Peer reviewed & accepted at IEEE VCIP 2022 (http://www.vcip2022.org/)

Via

Access Paper or Ask Questions