Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Title:SonicVisionLM: Playing Sound with Vision Language Models

Jan 27, 2024

Zhifeng Xie, Shengye Yu, Qile He, Mengtian Li

Figure 1 for SonicVisionLM: Playing Sound with Vision Language Models

Figure 2 for SonicVisionLM: Playing Sound with Vision Language Models

Figure 3 for SonicVisionLM: Playing Sound with Vision Language Models

Figure 4 for SonicVisionLM: Playing Sound with Vision Language Models

Share this with someone who'll enjoy it:

Abstract:There has been a growing interest in the task of generating sound for silent videos, primarily because of its practicality in streamlining video post-production. However, existing methods for video-sound generation attempt to directly create sound from visual representations, which can be challenging due to the difficulty of aligning visual representations with audio representations. In this paper, we present SonicVisionLM, a novel framework aimed at generating a wide range of sound effects by leveraging vision language models. Instead of generating audio directly from video, we use the capabilities of powerful vision language models (VLMs). When provided with a silent video, our approach first identifies events within the video using a VLM to suggest possible sounds that match the video content. This shift in approach transforms the challenging task of aligning image and audio into more well-studied sub-problems of aligning image-to-text and text-to-audio through the popular diffusion models. To improve the quality of audio recommendations with LLMs, we have collected an extensive dataset that maps text descriptions to specific sound effects and developed temporally controlled audio adapters. Our approach surpasses current state-of-the-art methods for converting video to audio, resulting in enhanced synchronization with the visuals and improved alignment between audio and video components. Project page: https://yusiissy.github.io/SonicVisionLM.github.io/

View paper on

Share this with someone who'll enjoy it:

Title:SonicVisionLM: Playing Sound with Vision Language Models

Paper and Code