Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Title:Audio-text Retrieval with Transformer-based Hierarchical Alignment and Disentangled Cross-modal Representation

Sep 14, 2024

Yifei Xin, Zhihong Zhu, Xuxin Cheng, Xusheng Yang, Yuexian Zou

Figure 1 for Audio-text Retrieval with Transformer-based Hierarchical Alignment and Disentangled Cross-modal Representation

Figure 2 for Audio-text Retrieval with Transformer-based Hierarchical Alignment and Disentangled Cross-modal Representation

Figure 3 for Audio-text Retrieval with Transformer-based Hierarchical Alignment and Disentangled Cross-modal Representation

Share this with someone who'll enjoy it:

Abstract:Most existing audio-text retrieval (ATR) approaches typically rely on a single-level interaction to associate audio and text, limiting their ability to align different modalities and leading to suboptimal matches. In this work, we present a novel ATR framework that leverages two-stream Transformers in conjunction with a Hierarchical Alignment (THA) module to identify multi-level correspondences of different Transformer blocks between audio and text. Moreover, current ATR methods mainly focus on learning a global-level representation, missing out on intricate details to capture audio occurrences that correspond to textual semantics. To bridge this gap, we introduce a Disentangled Cross-modal Representation (DCR) approach that disentangles high-dimensional features into compact latent factors to grasp fine-grained audio-text semantic correlations. Additionally, we develop a confidence-aware (CA) module to estimate the confidence of each latent factor pair and adaptively aggregate cross-modal latent factors to achieve local semantic alignment. Experiments show that our THA effectively boosts ATR performance, with the DCR approach further contributing to consistent performance gains.

* Accepted by Interspeech2024

View paper on

Share this with someone who'll enjoy it:

Title:Audio-text Retrieval with Transformer-based Hierarchical Alignment and Disentangled Cross-modal Representation

Paper and Code