Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Title:Large Language Models are Frame-level Directors for Zero-shot Text-to-Video Generation

May 23, 2023

Susung Hong, Junyoung Seo, Sunghwan Hong, Heeseong Shin, Seungryong Kim

Figure 1 for Large Language Models are Frame-level Directors for Zero-shot Text-to-Video Generation

Figure 2 for Large Language Models are Frame-level Directors for Zero-shot Text-to-Video Generation

Figure 3 for Large Language Models are Frame-level Directors for Zero-shot Text-to-Video Generation

Figure 4 for Large Language Models are Frame-level Directors for Zero-shot Text-to-Video Generation

Share this with someone who'll enjoy it:

Abstract:In the paradigm of AI-generated content (AIGC), there has been increasing attention in extending pre-trained text-to-image (T2I) models to text-to-video (T2V) generation. Despite their effectiveness, these frameworks face challenges in maintaining consistent narratives and handling rapid shifts in scene composition or object placement from a single user prompt. This paper introduces a new framework, dubbed DirecT2V, which leverages instruction-tuned large language models (LLMs) to generate frame-by-frame descriptions from a single abstract user prompt. DirecT2V utilizes LLM directors to divide user inputs into separate prompts for each frame, enabling the inclusion of time-varying content and facilitating consistent video generation. To maintain temporal consistency and prevent object collapse, we propose a novel value mapping method and dual-softmax filtering. Extensive experimental results validate the effectiveness of the DirecT2V framework in producing visually coherent and consistent videos from abstract user prompts, addressing the challenges of zero-shot video generation.

* The code and demo will be available at https://github.com/KU-CVLAB/DirecT2V

View paper on

Share this with someone who'll enjoy it:

Title:Large Language Models are Frame-level Directors for Zero-shot Text-to-Video Generation

Paper and Code