Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Baqiao Liu

HARIVO: Harnessing Text-to-Image Models for Video Generation

Oct 10, 2024

Mingi Kwon, Seoung Wug Oh, Yang Zhou, Difan Liu, Joon-Young Lee, Haoran Cai, Baqiao Liu, Feng Liu, Youngjung Uh

Figure 1 for HARIVO: Harnessing Text-to-Image Models for Video Generation

Figure 2 for HARIVO: Harnessing Text-to-Image Models for Video Generation

Figure 3 for HARIVO: Harnessing Text-to-Image Models for Video Generation

Figure 4 for HARIVO: Harnessing Text-to-Image Models for Video Generation

Abstract:We present a method to create diffusion-based video models from pretrained Text-to-Image (T2I) models. Recently, AnimateDiff proposed freezing the T2I model while only training temporal layers. We advance this method by proposing a unique architecture, incorporating a mapping network and frame-wise tokens, tailored for video generation while maintaining the diversity and creativity of the original T2I model. Key innovations include novel loss functions for temporal smoothness and a mitigating gradient sampling technique, ensuring realistic and temporally consistent video generation despite limited public video data. We have successfully integrated video-specific inductive biases into the architecture and loss functions. Our method, built on the frozen StableDiffusion model, simplifies training processes and allows for seamless integration with off-the-shelf models like ControlNet and DreamBooth. project page: https://kwonminki.github.io/HARIVO

* ECCV2024

Via

Access Paper or Ask Questions