Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Title:FancyVideo: Towards Dynamic and Consistent Video Generation via Cross-frame Textual Guidance

Aug 15, 2024

Jiasong Feng, Ao Ma, Jing Wang, Bo Cheng, Xiaodan Liang, Dawei Leng, Yuhui Yin

Figure 1 for FancyVideo: Towards Dynamic and Consistent Video Generation via Cross-frame Textual Guidance

Figure 2 for FancyVideo: Towards Dynamic and Consistent Video Generation via Cross-frame Textual Guidance

Figure 3 for FancyVideo: Towards Dynamic and Consistent Video Generation via Cross-frame Textual Guidance

Figure 4 for FancyVideo: Towards Dynamic and Consistent Video Generation via Cross-frame Textual Guidance

Share this with someone who'll enjoy it:

Abstract:Synthesizing motion-rich and temporally consistent videos remains a challenge in artificial intelligence, especially when dealing with extended durations. Existing text-to-video (T2V) models commonly employ spatial cross-attention for text control, equivalently guiding different frame generations without frame-specific textual guidance. Thus, the model's capacity to comprehend the temporal logic conveyed in prompts and generate videos with coherent motion is restricted. To tackle this limitation, we introduce FancyVideo, an innovative video generator that improves the existing text-control mechanism with the well-designed Cross-frame Textual Guidance Module (CTGM). Specifically, CTGM incorporates the Temporal Information Injector (TII), Temporal Affinity Refiner (TAR), and Temporal Feature Booster (TFB) at the beginning, middle, and end of cross-attention, respectively, to achieve frame-specific textual guidance. Firstly, TII injects frame-specific information from latent features into text conditions, thereby obtaining cross-frame textual conditions. Then, TAR refines the correlation matrix between cross-frame textual conditions and latent features along the time dimension. Lastly, TFB boosts the temporal consistency of latent features. Extensive experiments comprising both quantitative and qualitative evaluations demonstrate the effectiveness of FancyVideo. Our approach achieves state-of-the-art T2V generation results on the EvalCrafter benchmark and facilitates the synthesis of dynamic and consistent videos. The video show results can be available at https://fancyvideo.github.io/, and we will make our code and model weights publicly available.

View paper on

Share this with someone who'll enjoy it:

Title:FancyVideo: Towards Dynamic and Consistent Video Generation via Cross-frame Textual Guidance

Paper and Code