Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Title:TALC: Time-Aligned Captions for Multi-Scene Text-to-Video Generation

May 07, 2024

Hritik Bansal, Yonatan Bitton, Michal Yarom, Idan Szpektor, Aditya Grover, Kai-Wei Chang

Figure 1 for TALC: Time-Aligned Captions for Multi-Scene Text-to-Video Generation

Figure 2 for TALC: Time-Aligned Captions for Multi-Scene Text-to-Video Generation

Figure 3 for TALC: Time-Aligned Captions for Multi-Scene Text-to-Video Generation

Figure 4 for TALC: Time-Aligned Captions for Multi-Scene Text-to-Video Generation

Share this with someone who'll enjoy it:

Abstract:Recent advances in diffusion-based generative modeling have led to the development of text-to-video (T2V) models that can generate high-quality videos conditioned on a text prompt. Most of these T2V models often produce single-scene video clips that depict an entity performing a particular action (e.g., `a red panda climbing a tree'). However, it is pertinent to generate multi-scene videos since they are ubiquitous in the real-world (e.g., `a red panda climbing a tree' followed by `the red panda sleeps on the top of the tree'). To generate multi-scene videos from the pretrained T2V model, we introduce Time-Aligned Captions (TALC) framework. Specifically, we enhance the text-conditioning mechanism in the T2V architecture to recognize the temporal alignment between the video scenes and scene descriptions. For instance, we condition the visual features of the earlier and later scenes of the generated video with the representations of the first scene description (e.g., `a red panda climbing a tree') and second scene description (e.g., `the red panda sleeps on the top of the tree'), respectively. As a result, we show that the T2V model can generate multi-scene videos that adhere to the multi-scene text descriptions and be visually consistent (e.g., entity and background). Further, we finetune the pretrained T2V model with multi-scene video-text data using the TALC framework. We show that the TALC-finetuned model outperforms the baseline methods by 15.5 points in the overall score, which averages visual consistency and text adherence using human evaluation. The project website is https://talc-mst2v.github.io/.

* 23 pages, 12 figures, 8 tables

View paper on

Share this with someone who'll enjoy it:

Title:TALC: Time-Aligned Captions for Multi-Scene Text-to-Video Generation

Paper and Code