Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Title:Aligning Audio-Visual Joint Representations with an Agentic Workflow

Oct 31, 2024

Shentong Mo, Yibing Song

Figure 1 for Aligning Audio-Visual Joint Representations with an Agentic Workflow

Figure 2 for Aligning Audio-Visual Joint Representations with an Agentic Workflow

Figure 3 for Aligning Audio-Visual Joint Representations with an Agentic Workflow

Figure 4 for Aligning Audio-Visual Joint Representations with an Agentic Workflow

Share this with someone who'll enjoy it:

Abstract:Visual content and accompanied audio signals naturally formulate a joint representation to improve audio-visual (AV) related applications. While studies develop various AV representation learning frameworks, the importance of AV data alignment is usually undermined for achieving high-quality representation. We observe that an audio signal may contain background noise interference. Also, non-synchronization may appear between audio and video streams. These non-strict data alignment limits representation quality and downgrade application performance. In this paper, we propose to improve AV joint representations from a data-centric perspective by aligning audio signals to visual data. Our alignment is conducted in an agentic workflow controlled by an LLM-based assistant named AVAgent. For each input AV data pair, our AVAgent uses a multi-modal LLM to convert audio and visual data into language descriptions separately (i.e., tool use). Then, AVAgent reasons whether this paired data is aligned well and plans to edit the audio signal if needed (i.e., planning). The audio editing is executed by predefined actions that filter noise or augment data. Moreover, we use a VLM to evaluate how modified audio signals match the visual content and provide feedback to AVAgent (i.e., reflection). The tool use, planning, and reflection steps operate cyclically to become an agentic workflow where audio signals are gradually aligned to visual content. To this end, existing methods can directly leverage the aligned AV data via our agentic workflow to improve AV joint representations. The experimental results comprehensively demonstrate the state-of-the-art performance of the proposed approach against previous baselines in diverse downstream tasks.

View paper on

Share this with someone who'll enjoy it:

Title:Aligning Audio-Visual Joint Representations with an Agentic Workflow

Paper and Code