Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Title:DisenStudio: Customized Multi-subject Text-to-Video Generation with Disentangled Spatial Control

May 21, 2024

Hong Chen, Xin Wang, Yipeng Zhang, Yuwei Zhou, Zeyang Zhang, Siao Tang, Wenwu Zhu

Figure 1 for DisenStudio: Customized Multi-subject Text-to-Video Generation with Disentangled Spatial Control

Figure 2 for DisenStudio: Customized Multi-subject Text-to-Video Generation with Disentangled Spatial Control

Figure 3 for DisenStudio: Customized Multi-subject Text-to-Video Generation with Disentangled Spatial Control

Figure 4 for DisenStudio: Customized Multi-subject Text-to-Video Generation with Disentangled Spatial Control

Share this with someone who'll enjoy it:

Abstract:Generating customized content in videos has received increasing attention recently. However, existing works primarily focus on customized text-to-video generation for single subject, suffering from subject-missing and attribute-binding problems when the video is expected to contain multiple subjects. Furthermore, existing models struggle to assign the desired actions to the corresponding subjects (action-binding problem), failing to achieve satisfactory multi-subject generation performance. To tackle the problems, in this paper, we propose DisenStudio, a novel framework that can generate text-guided videos for customized multiple subjects, given few images for each subject. Specifically, DisenStudio enhances a pretrained diffusion-based text-to-video model with our proposed spatial-disentangled cross-attention mechanism to associate each subject with the desired action. Then the model is customized for the multiple subjects with the proposed motion-preserved disentangled finetuning, which involves three tuning strategies: multi-subject co-occurrence tuning, masked single-subject tuning, and multi-subject motion-preserved tuning. The first two strategies guarantee the subject occurrence and preserve their visual attributes, and the third strategy helps the model maintain the temporal motion-generation ability when finetuning on static images. We conduct extensive experiments to demonstrate our proposed DisenStudio significantly outperforms existing methods in various metrics. Additionally, we show that DisenStudio can be used as a powerful tool for various controllable generation applications.

View paper on

Share this with someone who'll enjoy it:

Title:DisenStudio: Customized Multi-subject Text-to-Video Generation with Disentangled Spatial Control

Paper and Code