Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Title:Moonshot: Towards Controllable Video Generation and Editing with Multimodal Conditions

Jan 03, 2024

David Junhao Zhang, Dongxu Li, Hung Le, Mike Zheng Shou, Caiming Xiong, Doyen Sahoo

Figure 1 for Moonshot: Towards Controllable Video Generation and Editing with Multimodal Conditions

Figure 2 for Moonshot: Towards Controllable Video Generation and Editing with Multimodal Conditions

Figure 3 for Moonshot: Towards Controllable Video Generation and Editing with Multimodal Conditions

Figure 4 for Moonshot: Towards Controllable Video Generation and Editing with Multimodal Conditions

Share this with someone who'll enjoy it:

Abstract:Most existing video diffusion models (VDMs) are limited to mere text conditions. Thereby, they are usually lacking in control over visual appearance and geometry structure of the generated videos. This work presents Moonshot, a new video generation model that conditions simultaneously on multimodal inputs of image and text. The model builts upon a core module, called multimodal video block (MVB), which consists of conventional spatialtemporal layers for representing video features, and a decoupled cross-attention layer to address image and text inputs for appearance conditioning. In addition, we carefully design the model architecture such that it can optionally integrate with pre-trained image ControlNet modules for geometry visual conditions, without needing of extra training overhead as opposed to prior methods. Experiments show that with versatile multimodal conditioning mechanisms, Moonshot demonstrates significant improvement on visual quality and temporal consistency compared to existing models. In addition, the model can be easily repurposed for a variety of generative applications, such as personalized video generation, image animation and video editing, unveiling its potential to serve as a fundamental architecture for controllable video generation. Models will be made public on https://github.com/salesforce/LAVIS.

* project page: https://showlab.github.io/Moonshot/

View paper on

Share this with someone who'll enjoy it:

Title:Moonshot: Towards Controllable Video Generation and Editing with Multimodal Conditions

Paper and Code