Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Title:Locality-aware Cross-modal Correspondence Learning for Dense Audio-Visual Events Localization

Sep 12, 2024

Ling Xing, Hongyu Qu, Rui Yan, Xiangbo Shu, Jinhui Tang

Figure 1 for Locality-aware Cross-modal Correspondence Learning for Dense Audio-Visual Events Localization

Figure 2 for Locality-aware Cross-modal Correspondence Learning for Dense Audio-Visual Events Localization

Figure 3 for Locality-aware Cross-modal Correspondence Learning for Dense Audio-Visual Events Localization

Figure 4 for Locality-aware Cross-modal Correspondence Learning for Dense Audio-Visual Events Localization

Share this with someone who'll enjoy it:

Abstract:Dense-localization Audio-Visual Events (DAVE) aims to identify time boundaries and corresponding categories for events that can be heard and seen concurrently in an untrimmed video. Existing methods typically encode audio and visual representation separately without any explicit cross-modal alignment constraint. Then they adopt dense cross-modal attention to integrate multimodal information for DAVE. Thus these methods inevitably aggregate irrelevant noise and events, especially in complex and long videos, leading to imprecise detection. In this paper, we present LOCO, a Locality-aware cross-modal Correspondence learning framework for DAVE. The core idea is to explore local temporal continuity nature of audio-visual events, which serves as informative yet free supervision signals to guide the filtering of irrelevant information and inspire the extraction of complementary multimodal information during both unimodal and cross-modal learning stages. i) Specifically, LOCO applies Locality-aware Correspondence Correction (LCC) to uni-modal features via leveraging cross-modal local-correlated properties without any extra annotations. This enforces uni-modal encoders to highlight similar semantics shared by audio and visual features. ii) To better aggregate such audio and visual features, we further customize Cross-modal Dynamic Perception layer (CDP) in cross-modal feature pyramid to understand local temporal patterns of audio-visual events by imposing local consistency within multimodal features in a data-driven manner. By incorporating LCC and CDP, LOCO provides solid performance gains and outperforms existing methods for DAVE. The source code will be released.

View paper on

Share this with someone who'll enjoy it:

Title:Locality-aware Cross-modal Correspondence Learning for Dense Audio-Visual Events Localization

Paper and Code