Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Title:UniAV: Unified Audio-Visual Perception for Multi-Task Video Localization

Apr 04, 2024

Tiantian Geng, Teng Wang, Yanfu Zhang, Jinming Duan, Weili Guan, Feng Zheng

Figure 1 for UniAV: Unified Audio-Visual Perception for Multi-Task Video Localization

Figure 2 for UniAV: Unified Audio-Visual Perception for Multi-Task Video Localization

Figure 3 for UniAV: Unified Audio-Visual Perception for Multi-Task Video Localization

Figure 4 for UniAV: Unified Audio-Visual Perception for Multi-Task Video Localization

Share this with someone who'll enjoy it:

Abstract:Video localization tasks aim to temporally locate specific instances in videos, including temporal action localization (TAL), sound event detection (SED) and audio-visual event localization (AVEL). Existing methods over-specialize on each task, overlooking the fact that these instances often occur in the same video to form the complete video content. In this work, we present UniAV, a Unified Audio-Visual perception network, to achieve joint learning of TAL, SED and AVEL tasks for the first time. UniAV can leverage diverse data available in task-specific datasets, allowing the model to learn and share mutually beneficial knowledge across tasks and modalities. To tackle the challenges posed by substantial variations in datasets (size/domain/duration) and distinct task characteristics, we propose to uniformly encode visual and audio modalities of all videos to derive generic representations, while also designing task-specific experts to capture unique knowledge for each task. Besides, we develop a unified language-aware classifier by utilizing a pre-trained text encoder, enabling the model to flexibly detect various types of instances and previously unseen ones by simply changing prompts during inference. UniAV outperforms its single-task counterparts by a large margin with fewer parameters, achieving on-par or superior performances compared to state-of-the-art task-specific methods across ActivityNet 1.3, DESED and UnAV-100 benchmarks.

View paper on

Share this with someone who'll enjoy it:

Title:UniAV: Unified Audio-Visual Perception for Multi-Task Video Localization

Paper and Code