Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Mahrukh Awan

Efficient Audio-Visual Fusion for Video Classification

Nov 08, 2024

Mahrukh Awan, Asmar Nadeem, Armin Mustafa

Figure 1 for Efficient Audio-Visual Fusion for Video Classification

Abstract:We present Attend-Fusion, a novel and efficient approach for audio-visual fusion in video classification tasks. Our method addresses the challenge of exploiting both audio and visual modalities while maintaining a compact model architecture. Through extensive experiments on the YouTube-8M dataset, we demonstrate that our Attend-Fusion achieves competitive performance with significantly reduced model complexity compared to larger baseline models.

* CVMP Short Paper

Via

Access Paper or Ask Questions

Attend-Fusion: Efficient Audio-Visual Fusion for Video Classification

Aug 26, 2024

Mahrukh Awan, Asmar Nadeem, Muhammad Junaid Awan, Armin Mustafa, Syed Sameed Husain

Figure 1 for Attend-Fusion: Efficient Audio-Visual Fusion for Video Classification

Figure 2 for Attend-Fusion: Efficient Audio-Visual Fusion for Video Classification

Figure 3 for Attend-Fusion: Efficient Audio-Visual Fusion for Video Classification

Figure 4 for Attend-Fusion: Efficient Audio-Visual Fusion for Video Classification

Abstract:Exploiting both audio and visual modalities for video classification is a challenging task, as the existing methods require large model architectures, leading to high computational complexity and resource requirements. Smaller architectures, on the other hand, struggle to achieve optimal performance. In this paper, we propose Attend-Fusion, an audio-visual (AV) fusion approach that introduces a compact model architecture specifically designed to capture intricate audio-visual relationships in video data. Through extensive experiments on the challenging YouTube-8M dataset, we demonstrate that Attend-Fusion achieves an F1 score of 75.64\% with only 72M parameters, which is comparable to the performance of larger baseline models such as Fully-Connected Late Fusion (75.96\% F1 score, 341M parameters). Attend-Fusion achieves similar performance to the larger baseline model while reducing the model size by nearly 80\%, highlighting its efficiency in terms of model complexity. Our work demonstrates that the Attend-Fusion model effectively combines audio and visual information for video classification, achieving competitive performance with significantly reduced model size. This approach opens new possibilities for deploying high-performance video understanding systems in resource-constrained environments across various applications.

Via

Access Paper or Ask Questions