Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Omer Bar-Tal

DynVFX: Augmenting Real Videos with Dynamic Content

Feb 05, 2025

Danah Yatim, Rafail Fridman, Omer Bar-Tal, Tali Dekel

Figure 1 for DynVFX: Augmenting Real Videos with Dynamic Content

Figure 2 for DynVFX: Augmenting Real Videos with Dynamic Content

Figure 3 for DynVFX: Augmenting Real Videos with Dynamic Content

Figure 4 for DynVFX: Augmenting Real Videos with Dynamic Content

Abstract:We present a method for augmenting real-world videos with newly generated dynamic content. Given an input video and a simple user-provided text instruction describing the desired content, our method synthesizes dynamic objects or complex scene effects that naturally interact with the existing scene over time. The position, appearance, and motion of the new content are seamlessly integrated into the original footage while accounting for camera motion, occlusions, and interactions with other dynamic objects in the scene, resulting in a cohesive and realistic output video. We achieve this via a zero-shot, training-free framework that harnesses a pre-trained text-to-video diffusion transformer to synthesize the new content and a pre-trained Vision Language Model to envision the augmented scene in detail. Specifically, we introduce a novel inference-based method that manipulates features within the attention mechanism, enabling accurate localization and seamless integration of the new content while preserving the integrity of the original scene. Our method is fully automated, requiring only a simple user instruction. We demonstrate its effectiveness on a wide range of edits applied to real-world videos, encompassing diverse objects and scenarios involving both camera and object motion.

* Project page: https://dynvfx.github.io

Via

Access Paper or Ask Questions

AuroraCap: Efficient, Performant Video Detailed Captioning and a New Benchmark

Oct 04, 2024

Wenhao Chai, Enxin Song, Yilun Du, Chenlin Meng, Vashisht Madhavan, Omer Bar-Tal, Jeng-Neng Hwang, Saining Xie, Christopher D. Manning

Figure 1 for AuroraCap: Efficient, Performant Video Detailed Captioning and a New Benchmark

Figure 2 for AuroraCap: Efficient, Performant Video Detailed Captioning and a New Benchmark

Figure 3 for AuroraCap: Efficient, Performant Video Detailed Captioning and a New Benchmark

Figure 4 for AuroraCap: Efficient, Performant Video Detailed Captioning and a New Benchmark

Abstract:Video detailed captioning is a key task which aims to generate comprehensive and coherent textual descriptions of video content, benefiting both video understanding and generation. In this paper, we propose AuroraCap, a video captioner based on a large multimodal model. We follow the simplest architecture design without additional parameters for temporal modeling. To address the overhead caused by lengthy video sequences, we implement the token merging strategy, reducing the number of input visual tokens. Surprisingly, we found that this strategy results in little performance loss. AuroraCap shows superior performance on various video and image captioning benchmarks, for example, obtaining a CIDEr of 88.9 on Flickr30k, beating GPT-4V (55.3) and Gemini-1.5 Pro (82.2). However, existing video caption benchmarks only include simple descriptions, consisting of a few dozen words, which limits research in this field. Therefore, we develop VDC, a video detailed captioning benchmark with over one thousand carefully annotated structured captions. In addition, we propose a new LLM-assisted metric VDCscore for bettering evaluation, which adopts a divide-and-conquer strategy to transform long caption evaluation into multiple short question-answer pairs. With the help of human Elo ranking, our experiments show that this benchmark better correlates with human judgments of video detailed captioning quality.

* Code, docs, weight, benchmark and training data are all avaliable at \href{https://rese1f.github.io/aurora-web/}{website}

Via

Access Paper or Ask Questions

Lumiere: A Space-Time Diffusion Model for Video Generation

Feb 05, 2024

Omer Bar-Tal, Hila Chefer, Omer Tov, Charles Herrmann, Roni Paiss, Shiran Zada, Ariel Ephrat, Junhwa Hur, Guanghui Liu, Amit Raj(+7 more)

Figure 1 for Lumiere: A Space-Time Diffusion Model for Video Generation

Figure 2 for Lumiere: A Space-Time Diffusion Model for Video Generation

Figure 3 for Lumiere: A Space-Time Diffusion Model for Video Generation

Figure 4 for Lumiere: A Space-Time Diffusion Model for Video Generation

Abstract:We introduce Lumiere -- a text-to-video diffusion model designed for synthesizing videos that portray realistic, diverse and coherent motion -- a pivotal challenge in video synthesis. To this end, we introduce a Space-Time U-Net architecture that generates the entire temporal duration of the video at once, through a single pass in the model. This is in contrast to existing video models which synthesize distant keyframes followed by temporal super-resolution -- an approach that inherently makes global temporal consistency difficult to achieve. By deploying both spatial and (importantly) temporal down- and up-sampling and leveraging a pre-trained text-to-image diffusion model, our model learns to directly generate a full-frame-rate, low-resolution video by processing it in multiple space-time scales. We demonstrate state-of-the-art text-to-video generation results, and show that our design easily facilitates a wide range of content creation tasks and video editing applications, including image-to-video, video inpainting, and stylized generation.

* Webpage: https://lumiere-video.github.io/ | Video: https://www.youtube.com/watch?v=wxLr02Dz2Sc

Via

Access Paper or Ask Questions

Space-Time Diffusion Features for Zero-Shot Text-Driven Motion Transfer

Dec 03, 2023

Danah Yatim, Rafail Fridman, Omer Bar-Tal, Yoni Kasten, Tali Dekel

Figure 1 for Space-Time Diffusion Features for Zero-Shot Text-Driven Motion Transfer

Figure 2 for Space-Time Diffusion Features for Zero-Shot Text-Driven Motion Transfer

Figure 3 for Space-Time Diffusion Features for Zero-Shot Text-Driven Motion Transfer

Figure 4 for Space-Time Diffusion Features for Zero-Shot Text-Driven Motion Transfer

Abstract:We present a new method for text-driven motion transfer - synthesizing a video that complies with an input text prompt describing the target objects and scene while maintaining an input video's motion and scene layout. Prior methods are confined to transferring motion across two subjects within the same or closely related object categories and are applicable for limited domains (e.g., humans). In this work, we consider a significantly more challenging setting in which the target and source objects differ drastically in shape and fine-grained motion characteristics (e.g., translating a jumping dog into a dolphin). To this end, we leverage a pre-trained and fixed text-to-video diffusion model, which provides us with generative and motion priors. The pillar of our method is a new space-time feature loss derived directly from the model. This loss guides the generation process to preserve the overall motion of the input video while complying with the target object in terms of shape and fine-grained motion traits.

* Project page: https://diffusion-motion-transfer.github.io/

Via

Access Paper or Ask Questions

Disentangling Structure and Appearance in ViT Feature Space

Nov 20, 2023

Narek Tumanyan, Omer Bar-Tal, Shir Amir, Shai Bagon, Tali Dekel

Abstract:We present a method for semantically transferring the visual appearance of one natural image to another. Specifically, our goal is to generate an image in which objects in a source structure image are "painted" with the visual appearance of their semantically related objects in a target appearance image. To integrate semantic information into our framework, our key idea is to leverage a pre-trained and fixed Vision Transformer (ViT) model. Specifically, we derive novel disentangled representations of structure and appearance extracted from deep ViT features. We then establish an objective function that splices the desired structure and appearance representations, interweaving them together in the space of ViT features. Based on our objective function, we propose two frameworks of semantic appearance transfer -- "Splice", which works by training a generator on a single and arbitrary pair of structure-appearance images, and "SpliceNet", a feed-forward real-time appearance transfer model trained on a dataset of images from a specific domain. Our frameworks do not involve adversarial training, nor do they require any additional input information such as semantic segmentation or correspondences. We demonstrate high-resolution results on a variety of in-the-wild image pairs, under significant variations in the number of objects, pose, and appearance. Code and supplementary material are available in our project page: splice-vit.github.io.

* Accepted to ACM Transactions on Graphics. arXiv admin note: substantial text overlap with arXiv:2201.00424

Via

Access Paper or Ask Questions

TokenFlow: Consistent Diffusion Features for Consistent Video Editing

Jul 23, 2023

Michal Geyer, Omer Bar-Tal, Shai Bagon, Tali Dekel

Figure 1 for TokenFlow: Consistent Diffusion Features for Consistent Video Editing

Figure 2 for TokenFlow: Consistent Diffusion Features for Consistent Video Editing

Figure 3 for TokenFlow: Consistent Diffusion Features for Consistent Video Editing

Figure 4 for TokenFlow: Consistent Diffusion Features for Consistent Video Editing

Abstract:The generative AI revolution has recently expanded to videos. Nevertheless, current state-of-the-art video models are still lagging behind image models in terms of visual quality and user control over the generated content. In this work, we present a framework that harnesses the power of a text-to-image diffusion model for the task of text-driven video editing. Specifically, given a source video and a target text-prompt, our method generates a high-quality video that adheres to the target text, while preserving the spatial layout and motion of the input video. Our method is based on a key observation that consistency in the edited video can be obtained by enforcing consistency in the diffusion feature space. We achieve this by explicitly propagating diffusion features based on inter-frame correspondences, readily available in the model. Thus, our framework does not require any training or fine-tuning, and can work in conjunction with any off-the-shelf text-to-image editing method. We demonstrate state-of-the-art editing results on a variety of real-world videos. Webpage: https://diffusion-tokenflow.github.io/

Via

Access Paper or Ask Questions

MultiDiffusion: Fusing Diffusion Paths for Controlled Image Generation

Feb 16, 2023

Omer Bar-Tal, Lior Yariv, Yaron Lipman, Tali Dekel

Figure 1 for MultiDiffusion: Fusing Diffusion Paths for Controlled Image Generation

Figure 2 for MultiDiffusion: Fusing Diffusion Paths for Controlled Image Generation

Figure 3 for MultiDiffusion: Fusing Diffusion Paths for Controlled Image Generation

Figure 4 for MultiDiffusion: Fusing Diffusion Paths for Controlled Image Generation

Abstract:Recent advances in text-to-image generation with diffusion models present transformative capabilities in image quality. However, user controllability of the generated image, and fast adaptation to new tasks still remains an open challenge, currently mostly addressed by costly and long re-training and fine-tuning or ad-hoc adaptations to specific image generation tasks. In this work, we present MultiDiffusion, a unified framework that enables versatile and controllable image generation, using a pre-trained text-to-image diffusion model, without any further training or finetuning. At the center of our approach is a new generation process, based on an optimization task that binds together multiple diffusion generation processes with a shared set of parameters or constraints. We show that MultiDiffusion can be readily applied to generate high quality and diverse images that adhere to user-provided controls, such as desired aspect ratio (e.g., panorama), and spatial guiding signals, ranging from tight segmentation masks to bounding boxes. Project webpage: https://multidiffusion.github.io

Via

Access Paper or Ask Questions

Text2LIVE: Text-Driven Layered Image and Video Editing

Apr 05, 2022

Omer Bar-Tal, Dolev Ofri-Amar, Rafail Fridman, Yoni Kasten, Tali Dekel

Figure 1 for Text2LIVE: Text-Driven Layered Image and Video Editing

Figure 2 for Text2LIVE: Text-Driven Layered Image and Video Editing

Figure 3 for Text2LIVE: Text-Driven Layered Image and Video Editing

Figure 4 for Text2LIVE: Text-Driven Layered Image and Video Editing

Abstract:We present a method for zero-shot, text-driven appearance manipulation in natural images and videos. Given an input image or video and a target text prompt, our goal is to edit the appearance of existing objects (e.g., object's texture) or augment the scene with visual effects (e.g., smoke, fire) in a semantically meaningful manner. We train a generator using an internal dataset of training examples, extracted from a single input (image or video and target text prompt), while leveraging an external pre-trained CLIP model to establish our losses. Rather than directly generating the edited output, our key idea is to generate an edit layer (color+opacity) that is composited over the original input. This allows us to constrain the generation process and maintain high fidelity to the original input via novel text-driven losses that are applied directly to the edit layer. Our method neither relies on a pre-trained generator nor requires user-provided edit masks. We demonstrate localized, semantic edits on high-resolution natural images and videos across a variety of objects and scenes.

* Project page: https://text2live.github.io

Via

Access Paper or Ask Questions

Splicing ViT Features for Semantic Appearance Transfer

Jan 02, 2022

Narek Tumanyan, Omer Bar-Tal, Shai Bagon, Tali Dekel

Figure 1 for Splicing ViT Features for Semantic Appearance Transfer

Figure 2 for Splicing ViT Features for Semantic Appearance Transfer

Figure 3 for Splicing ViT Features for Semantic Appearance Transfer

Figure 4 for Splicing ViT Features for Semantic Appearance Transfer

Abstract:We present a method for semantically transferring the visual appearance of one natural image to another. Specifically, our goal is to generate an image in which objects in a source structure image are "painted" with the visual appearance of their semantically related objects in a target appearance image. Our method works by training a generator given only a single structure/appearance image pair as input. To integrate semantic information into our framework - a pivotal component in tackling this task - our key idea is to leverage a pre-trained and fixed Vision Transformer (ViT) model which serves as an external semantic prior. Specifically, we derive novel representations of structure and appearance extracted from deep ViT features, untwisting them from the learned self-attention modules. We then establish an objective function that splices the desired structure and appearance representations, interweaving them together in the space of ViT features. Our framework, which we term "Splice", does not involve adversarial training, nor does it require any additional input information such as semantic segmentation or correspondences, and can generate high-resolution results, e.g., work in HD. We demonstrate high quality results on a variety of in-the-wild image pairs, under significant variations in the number of objects, their pose and appearance.

Via

Access Paper or Ask Questions