Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Tavi Halperin

CAFA: a Controllable Automatic Foley Artist

Apr 15, 2025

Roi Benita, Michael Finkelson, Tavi Halperin, Gleb Sterkin, Yossi Adi

Abstract:Foley is a key element in video production, refers to the process of adding an audio signal to a silent video while ensuring semantic and temporal alignment. In recent years, the rise of personalized content creation and advancements in automatic video-to-audio models have increased the demand for greater user control in the process. One possible approach is to incorporate text to guide audio generation. While supported by existing methods, challenges remain in ensuring compatibility between modalities, particularly when the text introduces additional information or contradicts the sounds naturally inferred from the visuals. In this work, we introduce CAFA (Controllable Automatic Foley Artist) a video-and-text-to-audio model that generates semantically and temporally aligned audio for a given video, guided by text input. CAFA is built upon a text-to-audio model and integrates video information through a modality adapter mechanism. By incorporating text, users can refine semantic details and introduce creative variations, guiding the audio synthesis beyond the expected video contextual cues. Experiments show that besides its superior quality in terms of semantic alignment and audio-visual synchronization the proposed method enable high textual controllability as demonstrated in subjective and objective evaluations.

* Renamed paper to "CAFA: a Controllable Automatic Foley Artist" from "Controllable Automatic Foley Artist". Updated link to demo page

Via

Access Paper or Ask Questions

Controllable Automatic Foley Artist

Apr 09, 2025

Roi Benita, Michael Finkelson, Tavi Halperin, Gleb Sterkin, Yossi Adi

Via

Access Paper or Ask Questions

V-LASIK: Consistent Glasses-Removal from Videos Using Synthetic Data

Jun 20, 2024

Rotem Shalev-Arkushin, Aharon Azulay, Tavi Halperin, Eitan Richardson, Amit H. Bermano, Ohad Fried

Abstract:Diffusion-based generative models have recently shown remarkable image and video editing capabilities. However, local video editing, particularly removal of small attributes like glasses, remains a challenge. Existing methods either alter the videos excessively, generate unrealistic artifacts, or fail to perform the requested edit consistently throughout the video. In this work, we focus on consistent and identity-preserving removal of glasses in videos, using it as a case study for consistent local attribute removal in videos. Due to the lack of paired data, we adopt a weakly supervised approach and generate synthetic imperfect data, using an adjusted pretrained diffusion model. We show that despite data imperfection, by learning from our generated data and leveraging the prior of pretrained diffusion models, our model is able to perform the desired edit consistently while preserving the original video content. Furthermore, we exemplify the generalization ability of our method to other local video editing tasks by applying it successfully to facial sticker-removal. Our approach demonstrates significant improvement over existing methods, showcasing the potential of leveraging synthetic data and strong video priors for local video editing tasks.

Via

Access Paper or Ask Questions

Diffusing Colors: Image Colorization with Text Guided Diffusion

Dec 07, 2023

Nir Zabari, Aharon Azulay, Alexey Gorkor, Tavi Halperin, Ohad Fried

Figure 1 for Diffusing Colors: Image Colorization with Text Guided Diffusion

Figure 2 for Diffusing Colors: Image Colorization with Text Guided Diffusion

Figure 3 for Diffusing Colors: Image Colorization with Text Guided Diffusion

Figure 4 for Diffusing Colors: Image Colorization with Text Guided Diffusion

Abstract:The colorization of grayscale images is a complex and subjective task with significant challenges. Despite recent progress in employing large-scale datasets with deep neural networks, difficulties with controllability and visual quality persist. To tackle these issues, we present a novel image colorization framework that utilizes image diffusion techniques with granular text prompts. This integration not only produces colorization outputs that are semantically appropriate but also greatly improves the level of control users have over the colorization process. Our method provides a balance between automation and control, outperforming existing techniques in terms of visual quality and semantic coherence. We leverage a pretrained generative Diffusion Model, and show that we can finetune it for the colorization task without losing its generative power or attention to text prompts. Moreover, we present a novel CLIP-based ranking model that evaluates color vividness, enabling automatic selection of the most suitable level of vividness based on the specific scene semantics. Our approach holds potential particularly for color enhancement and historical image colorization.

* SIGGRAPH Asia 2023

Via

Access Paper or Ask Questions

Temporally stable video segmentation without video annotations

Oct 17, 2021

Aharon Azulay, Tavi Halperin, Orestis Vantzos, Nadav Bornstein, Ofir Bibi

Figure 1 for Temporally stable video segmentation without video annotations

Figure 2 for Temporally stable video segmentation without video annotations

Figure 3 for Temporally stable video segmentation without video annotations

Figure 4 for Temporally stable video segmentation without video annotations

Abstract:Temporally consistent dense video annotations are scarce and hard to collect. In contrast, image segmentation datasets (and pre-trained models) are ubiquitous, and easier to label for any novel task. In this paper, we introduce a method to adapt still image segmentation models to video in an unsupervised manner, by using an optical flow-based consistency measure. To ensure that the inferred segmented videos appear more stable in practice, we verify that the consistency measure is well correlated with human judgement via a user study. Training a new multi-input multi-output decoder using this measure as a loss, together with a technique for refining current image segmentation datasets and a temporal weighted-guided filter, we observe stability improvements in the generated segmented videos with minimal loss of accuracy.

Via

Access Paper or Ask Questions

Endless Loops: Detecting and Animating Periodic Patterns in Still Images

May 19, 2021

Tavi Halperin, Hanit Hakim, Orestis Vantzos, Gershon Hochman, Netai Benaim, Lior Sassy, Michael Kupchik, Ofir Bibi, Ohad Fried

Figure 1 for Endless Loops: Detecting and Animating Periodic Patterns in Still Images

Figure 2 for Endless Loops: Detecting and Animating Periodic Patterns in Still Images

Figure 3 for Endless Loops: Detecting and Animating Periodic Patterns in Still Images

Figure 4 for Endless Loops: Detecting and Animating Periodic Patterns in Still Images

Abstract:We present an algorithm for producing a seamless animated loop from a single image. The algorithm detects periodic structures, such as the windows of a building or the steps of a staircase, and generates a non-trivial displacement vector field that maps each segment of the structure onto a neighboring segment along a user- or auto-selected main direction of motion. This displacement field is used, together with suitable temporal and spatial smoothing, to warp the image and produce the frames of a continuous animation loop. Our cinemagraphs are created in under a second on a mobile device. Over 140,000 users downloaded our app and exported over 350,000 cinemagraphs. Moreover, we conducted two user studies that show that users prefer our method for creating surreal and structured cinemagraphs compared to more manual approaches and compared to previous methods.

* ACM Trans. Graph., Vol. 40, No. 4, Article 142. Publication date: August 2021
* SIGGRAPH 2021. Project page: https://pub.res.lightricks.com/endless-loops/ . Video: https://youtu.be/8ZYUvxWuD2Y

Via

Access Paper or Ask Questions

Clear Skies Ahead: Towards Real-Time Automatic Sky Replacement in Video

Mar 06, 2019

Tavi Halperin, Harel Cain, Ofir Bibi, Michael Werman

Figure 1 for Clear Skies Ahead: Towards Real-Time Automatic Sky Replacement in Video

Figure 2 for Clear Skies Ahead: Towards Real-Time Automatic Sky Replacement in Video

Figure 3 for Clear Skies Ahead: Towards Real-Time Automatic Sky Replacement in Video

Figure 4 for Clear Skies Ahead: Towards Real-Time Automatic Sky Replacement in Video

Abstract:Digital videos such as those captured by a smartphone often exhibit exposure inconsistencies, a poorly exposed sky, or simply suffer from an uninteresting or plain looking sky. Professionals may edit these videos using advanced and time-consuming tools unavailable to most users, to replace the sky with a more expressive or imaginative sky. In this work, we propose an algorithm for automatic replacement of the sky region in a video with a different sky, providing nonprofessional users with a simple yet efficient tool to seamlessly replace the sky. The method is fast, achieving close to real-time performance on mobile devices and the user's involvement can remain as limited as simply selecting the replacement sky.

* Eurographics 2019. Supplementary video: https://youtu.be/1uZ46YzX-pI

Via

Access Paper or Ask Questions

Neural separation of observed and unobserved distributions

Nov 30, 2018

Tavi Halperin, Ariel Ephrat, Yedid Hoshen

Figure 1 for Neural separation of observed and unobserved distributions

Figure 2 for Neural separation of observed and unobserved distributions

Figure 3 for Neural separation of observed and unobserved distributions

Figure 4 for Neural separation of observed and unobserved distributions

Abstract:Separating mixed distributions is a long standing challenge for machine learning and signal processing. Applications include: single-channel multi-speaker separation (cocktail party problem), singing voice separation and separating reflections from images. Most current methods either rely on making strong assumptions on the source distributions (e.g. sparsity, low rank, repetitiveness) or rely on having training samples of each source in the mixture. In this work, we tackle the scenario of extracting an unobserved distribution additively mixed with a signal from an observed (arbitrary) distribution. We introduce a new method: Neural Egg Separation - an iterative method that learns to separate the known distribution from progressively finer estimates of the unknown distribution. In some settings, Neural Egg Separation is initialization sensitive, we therefore introduce GLO Masking which ensures a good initialization. Extensive experiments show that our method outperforms current methods that use the same level of supervision and often achieves similar performance to full supervision.

Via

Access Paper or Ask Questions

Dynamic Temporal Alignment of Speech to Lips

Aug 19, 2018

Tavi Halperin, Ariel Ephrat, Shmuel Peleg

Figure 1 for Dynamic Temporal Alignment of Speech to Lips

Figure 2 for Dynamic Temporal Alignment of Speech to Lips

Figure 3 for Dynamic Temporal Alignment of Speech to Lips

Figure 4 for Dynamic Temporal Alignment of Speech to Lips

Abstract:Many speech segments in movies are re-recorded in a studio during postproduction, to compensate for poor sound quality as recorded on location. Manual alignment of the newly-recorded speech with the original lip movements is a tedious task. We present an audio-to-video alignment method for automating speech to lips alignment, stretching and compressing the audio signal to match the lip movements. This alignment is based on deep audio-visual features, mapping the lips video and the speech signal to a shared representation. Using this shared representation we compute the lip-sync error between every short speech period and every video frame, followed by the determination of the optimal corresponding frame for each short sound period over the entire video clip. We demonstrate successful alignment both quantitatively, using a human perception-inspired metric, as well as qualitatively. The strongest advantage of our audio-to-video approach is in cases where the original voice in unclear, and where a constant shift of the sound can not give a perfect alignment. In these cases state-of-the-art methods will fail.

Via

Access Paper or Ask Questions

Seeing Through Noise: Visually Driven Speaker Separation and Enhancement

Feb 09, 2018

Aviv Gabbay, Ariel Ephrat, Tavi Halperin, Shmuel Peleg

Figure 1 for Seeing Through Noise: Visually Driven Speaker Separation and Enhancement

Figure 2 for Seeing Through Noise: Visually Driven Speaker Separation and Enhancement

Figure 3 for Seeing Through Noise: Visually Driven Speaker Separation and Enhancement

Figure 4 for Seeing Through Noise: Visually Driven Speaker Separation and Enhancement

Abstract:Isolating the voice of a specific person while filtering out other voices or background noises is challenging when video is shot in noisy environments. We propose audio-visual methods to isolate the voice of a single speaker and eliminate unrelated sounds. First, face motions captured in the video are used to estimate the speaker's voice, by passing the silent video frames through a video-to-speech neural network-based model. Then the speech predictions are applied as a filter on the noisy input audio. This approach avoids using mixtures of sounds in the learning process, as the number of such possible mixtures is huge, and would inevitably bias the trained model. We evaluate our method on two audio-visual datasets, GRID and TCD-TIMIT, and show that our method attains significant SDR and PESQ improvements over the raw video-to-speech predictions, and a well-known audio-only method.

* Supplementary video: https://www.youtube.com/watch?v=qmsyj7vAzoI

Via

Access Paper or Ask Questions