Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Title:Diff-SAGe: End-to-End Spatial Audio Generation Using Diffusion Models

Oct 15, 2024

Saksham Singh Kushwaha, Jianbo Ma, Mark R. P. Thomas, Yapeng Tian, Avery Bruni

Figure 1 for Diff-SAGe: End-to-End Spatial Audio Generation Using Diffusion Models

Figure 2 for Diff-SAGe: End-to-End Spatial Audio Generation Using Diffusion Models

Figure 3 for Diff-SAGe: End-to-End Spatial Audio Generation Using Diffusion Models

Figure 4 for Diff-SAGe: End-to-End Spatial Audio Generation Using Diffusion Models

Share this with someone who'll enjoy it:

Abstract:Spatial audio is a crucial component in creating immersive experiences. Traditional simulation-based approaches to generate spatial audio rely on expertise, have limited scalability, and assume independence between semantic and spatial information. To address these issues, we explore end-to-end spatial audio generation. We introduce and formulate a new task of generating first-order Ambisonics (FOA) given a sound category and sound source spatial location. We propose Diff-SAGe, an end-to-end, flow-based diffusion-transformer model for this task. Diff-SAGe utilizes a complex spectrogram representation for FOA, preserving the phase information crucial for accurate spatial cues. Additionally, a multi-conditional encoder integrates the input conditions into a unified representation, guiding the generation of FOA waveforms from noise. Through extensive evaluations on two datasets, we demonstrate that our method consistently outperforms traditional simulation-based baselines across both objective and subjective metrics.

View paper on

Share this with someone who'll enjoy it:

Title:Diff-SAGe: End-to-End Spatial Audio Generation Using Diffusion Models

Paper and Code