Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Title:Discriminator-Guided Cooperative Diffusion for Joint Audio and Video Generation

May 28, 2024

Akio Hayakawa, Masato Ishii, Takashi Shibuya, Yuki Mitsufuji

Figure 1 for Discriminator-Guided Cooperative Diffusion for Joint Audio and Video Generation

Figure 2 for Discriminator-Guided Cooperative Diffusion for Joint Audio and Video Generation

Figure 3 for Discriminator-Guided Cooperative Diffusion for Joint Audio and Video Generation

Figure 4 for Discriminator-Guided Cooperative Diffusion for Joint Audio and Video Generation

Share this with someone who'll enjoy it:

Abstract:In this study, we aim to construct an audio-video generative model with minimal computational cost by leveraging pre-trained single-modal generative models for audio and video. To achieve this, we propose a novel method that guides each single-modal model to cooperatively generate well-aligned samples across modalities. Specifically, given two pre-trained base diffusion models, we train a lightweight joint guidance module to adjust scores separately estimated by the base models to match the score of joint distribution over audio and video. We theoretically show that this guidance can be computed through the gradient of the optimal discriminator distinguishing real audio-video pairs from fake ones independently generated by the base models. On the basis of this analysis, we construct the joint guidance module by training this discriminator. Additionally, we adopt a loss function to make the gradient of the discriminator work as a noise estimator, as in standard diffusion models, stabilizing the gradient of the discriminator. Empirical evaluations on several benchmark datasets demonstrate that our method improves both single-modal fidelity and multi-modal alignment with a relatively small number of parameters.

View paper on

Share this with someone who'll enjoy it:

Title:Discriminator-Guided Cooperative Diffusion for Joint Audio and Video Generation

Paper and Code