Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Hidir Yesiltepe

Dynamic View Synthesis as an Inverse Problem

Jun 09, 2025

Hidir Yesiltepe, Pinar Yanardag

Figure 1 for Dynamic View Synthesis as an Inverse Problem

Figure 2 for Dynamic View Synthesis as an Inverse Problem

Figure 3 for Dynamic View Synthesis as an Inverse Problem

Figure 4 for Dynamic View Synthesis as an Inverse Problem

Abstract:In this work, we address dynamic view synthesis from monocular videos as an inverse problem in a training-free setting. By redesigning the noise initialization phase of a pre-trained video diffusion model, we enable high-fidelity dynamic view synthesis without any weight updates or auxiliary modules. We begin by identifying a fundamental obstacle to deterministic inversion arising from zero-terminal signal-to-noise ratio (SNR) schedules and resolve it by introducing a novel noise representation, termed K-order Recursive Noise Representation. We derive a closed form expression for this representation, enabling precise and efficient alignment between the VAE-encoded and the DDIM inverted latents. To synthesize newly visible regions resulting from camera motion, we introduce Stochastic Latent Modulation, which performs visibility aware sampling over the latent space to complete occluded regions. Comprehensive experiments demonstrate that dynamic view synthesis can be effectively performed through structured latent manipulation in the noise initialization phase.

* Project Page: https://inverse-dvs.github.io/

Via

Access Paper or Ask Questions

LoRAShop: Training-Free Multi-Concept Image Generation and Editing with Rectified Flow Transformers

May 29, 2025

Yusuf Dalva, Hidir Yesiltepe, Pinar Yanardag

Abstract:We introduce LoRAShop, the first framework for multi-concept image editing with LoRA models. LoRAShop builds on a key observation about the feature interaction patterns inside Flux-style diffusion transformers: concept-specific transformer features activate spatially coherent regions early in the denoising process. We harness this observation to derive a disentangled latent mask for each concept in a prior forward pass and blend the corresponding LoRA weights only within regions bounding the concepts to be personalized. The resulting edits seamlessly integrate multiple subjects or styles into the original scene while preserving global context, lighting, and fine details. Our experiments demonstrate that LoRAShop delivers better identity preservation compared to baselines. By eliminating retraining and external constraints, LoRAShop turns personalized diffusion models into a practical `photoshop-with-LoRAs' tool and opens new avenues for compositional visual storytelling and rapid creative iteration.

* Project Webpage: https://lorashop.github.io/

Via

Access Paper or Ask Questions

MotionFlow: Attention-Driven Motion Transfer in Video Diffusion Models

Dec 06, 2024

Tuna Han Salih Meral, Hidir Yesiltepe, Connor Dunlop, Pinar Yanardag

Abstract:Text-to-video models have demonstrated impressive capabilities in producing diverse and captivating video content, showcasing a notable advancement in generative AI. However, these models generally lack fine-grained control over motion patterns, limiting their practical applicability. We introduce MotionFlow, a novel framework designed for motion transfer in video diffusion models. Our method utilizes cross-attention maps to accurately capture and manipulate spatial and temporal dynamics, enabling seamless motion transfers across various contexts. Our approach does not require training and works on test-time by leveraging the inherent capabilities of pre-trained video diffusion models. In contrast to traditional approaches, which struggle with comprehensive scene changes while maintaining consistent motion, MotionFlow successfully handles such complex transformations through its attention-based mechanism. Our qualitative and quantitative experiments demonstrate that MotionFlow significantly outperforms existing models in both fidelity and versatility even during drastic scene alterations.

* Project Page: https://motionflow-diffusion.github.io

Via

Access Paper or Ask Questions

MotionShop: Zero-Shot Motion Transfer in Video Diffusion Models with Mixture of Score Guidance

Dec 06, 2024

Hidir Yesiltepe, Tuna Han Salih Meral, Connor Dunlop, Pinar Yanardag

Figure 1 for MotionShop: Zero-Shot Motion Transfer in Video Diffusion Models with Mixture of Score Guidance

Figure 2 for MotionShop: Zero-Shot Motion Transfer in Video Diffusion Models with Mixture of Score Guidance

Figure 3 for MotionShop: Zero-Shot Motion Transfer in Video Diffusion Models with Mixture of Score Guidance

Figure 4 for MotionShop: Zero-Shot Motion Transfer in Video Diffusion Models with Mixture of Score Guidance

Abstract:In this work, we propose the first motion transfer approach in diffusion transformer through Mixture of Score Guidance (MSG), a theoretically-grounded framework for motion transfer in diffusion models. Our key theoretical contribution lies in reformulating conditional score to decompose motion score and content score in diffusion models. By formulating motion transfer as a mixture of potential energies, MSG naturally preserves scene composition and enables creative scene transformations while maintaining the integrity of transferred motion patterns. This novel sampling operates directly on pre-trained video diffusion models without additional training or fine-tuning. Through extensive experiments, MSG demonstrates successful handling of diverse scenarios including single object, multiple objects, and cross-object motion transfer as well as complex camera motion transfer. Additionally, we introduce MotionBench, the first motion transfer dataset consisting of 200 source videos and 1000 transferred motions, covering single/multi-object transfers, and complex camera motions.

* Project page: https://motionshop-diffusion.github.io

Via

Access Paper or Ask Questions

Stylebreeder: Exploring and Democratizing Artistic Styles through Text-to-Image Models

Jun 20, 2024

Matthew Zheng, Enis Simsar, Hidir Yesiltepe, Federico Tombari, Joel Simon, Pinar Yanardag

Figure 1 for Stylebreeder: Exploring and Democratizing Artistic Styles through Text-to-Image Models

Figure 2 for Stylebreeder: Exploring and Democratizing Artistic Styles through Text-to-Image Models

Figure 3 for Stylebreeder: Exploring and Democratizing Artistic Styles through Text-to-Image Models

Figure 4 for Stylebreeder: Exploring and Democratizing Artistic Styles through Text-to-Image Models

Abstract:Text-to-image models are becoming increasingly popular, revolutionizing the landscape of digital art creation by enabling highly detailed and creative visual content generation. These models have been widely employed across various domains, particularly in art generation, where they facilitate a broad spectrum of creative expression and democratize access to artistic creation. In this paper, we introduce \texttt{STYLEBREEDER}, a comprehensive dataset of 6.8M images and 1.8M prompts generated by 95K users on Artbreeder, a platform that has emerged as a significant hub for creative exploration with over 13M users. We introduce a series of tasks with this dataset aimed at identifying diverse artistic styles, generating personalized content, and recommending styles based on user interests. By documenting unique, user-generated styles that transcend conventional categories like 'cyberpunk' or 'Picasso,' we explore the potential for unique, crowd-sourced styles that could provide deep insights into the collective creative psyche of users worldwide. We also evaluate different personalization methods to enhance artistic expression and introduce a style atlas, making these models available in LoRA format for public use. Our research demonstrates the potential of text-to-image diffusion models to uncover and promote unique artistic expressions, further democratizing AI in art and fostering a more diverse and inclusive artistic community. The dataset, code and models are available at https://stylebreeder.github.io under a Public Domain (CC0) license.

Via

Access Paper or Ask Questions

The Curious Case of End Token: A Zero-Shot Disentangled Image Editing using CLIP

Jun 01, 2024

Hidir Yesiltepe, Yusuf Dalva, Pinar Yanardag

Figure 1 for The Curious Case of End Token: A Zero-Shot Disentangled Image Editing using CLIP

Figure 2 for The Curious Case of End Token: A Zero-Shot Disentangled Image Editing using CLIP

Figure 3 for The Curious Case of End Token: A Zero-Shot Disentangled Image Editing using CLIP

Figure 4 for The Curious Case of End Token: A Zero-Shot Disentangled Image Editing using CLIP

Abstract:Diffusion models have become prominent in creating high-quality images. However, unlike GAN models celebrated for their ability to edit images in a disentangled manner, diffusion-based text-to-image models struggle to achieve the same level of precise attribute manipulation without compromising image coherence. In this paper, CLIP which is often used in popular text-to-image diffusion models such as Stable Diffusion is capable of performing disentangled editing in a zero-shot manner. Through both qualitative and quantitative comparisons with state-of-the-art editing methods, we show that our approach yields competitive results. This insight may open opportunities for applying this method to various tasks, including image and video editing, providing a lightweight and efficient approach for disentangled editing.

Via

Access Paper or Ask Questions

GANTASTIC: GAN-based Transfer of Interpretable Directions for Disentangled Image Editing in Text-to-Image Diffusion Models

Mar 28, 2024

Yusuf Dalva, Hidir Yesiltepe, Pinar Yanardag

Figure 1 for GANTASTIC: GAN-based Transfer of Interpretable Directions for Disentangled Image Editing in Text-to-Image Diffusion Models

Figure 2 for GANTASTIC: GAN-based Transfer of Interpretable Directions for Disentangled Image Editing in Text-to-Image Diffusion Models

Figure 3 for GANTASTIC: GAN-based Transfer of Interpretable Directions for Disentangled Image Editing in Text-to-Image Diffusion Models

Figure 4 for GANTASTIC: GAN-based Transfer of Interpretable Directions for Disentangled Image Editing in Text-to-Image Diffusion Models

Abstract:The rapid advancement in image generation models has predominantly been driven by diffusion models, which have demonstrated unparalleled success in generating high-fidelity, diverse images from textual prompts. Despite their success, diffusion models encounter substantial challenges in the domain of image editing, particularly in executing disentangled edits-changes that target specific attributes of an image while leaving irrelevant parts untouched. In contrast, Generative Adversarial Networks (GANs) have been recognized for their success in disentangled edits through their interpretable latent spaces. We introduce GANTASTIC, a novel framework that takes existing directions from pre-trained GAN models-representative of specific, controllable attributes-and transfers these directions into diffusion-based models. This novel approach not only maintains the generative quality and diversity that diffusion models are known for but also significantly enhances their capability to perform precise, targeted image edits, thereby leveraging the best of both worlds.

* Project page: https://gantastic.github.io

Via

Access Paper or Ask Questions

MIST: Mitigating Intersectional Bias with Disentangled Cross-Attention Editing in Text-to-Image Diffusion Models

Mar 28, 2024

Hidir Yesiltepe, Kiymet Akdemir, Pinar Yanardag

Figure 1 for MIST: Mitigating Intersectional Bias with Disentangled Cross-Attention Editing in Text-to-Image Diffusion Models

Figure 2 for MIST: Mitigating Intersectional Bias with Disentangled Cross-Attention Editing in Text-to-Image Diffusion Models

Figure 3 for MIST: Mitigating Intersectional Bias with Disentangled Cross-Attention Editing in Text-to-Image Diffusion Models

Figure 4 for MIST: Mitigating Intersectional Bias with Disentangled Cross-Attention Editing in Text-to-Image Diffusion Models

Abstract:Diffusion-based text-to-image models have rapidly gained popularity for their ability to generate detailed and realistic images from textual descriptions. However, these models often reflect the biases present in their training data, especially impacting marginalized groups. While prior efforts to debias language models have focused on addressing specific biases, such as racial or gender biases, efforts to tackle intersectional bias have been limited. Intersectional bias refers to the unique form of bias experienced by individuals at the intersection of multiple social identities. Addressing intersectional bias is crucial because it amplifies the negative effects of discrimination based on race, gender, and other identities. In this paper, we introduce a method that addresses intersectional bias in diffusion-based text-to-image models by modifying cross-attention maps in a disentangled manner. Our approach utilizes a pre-trained Stable Diffusion model, eliminates the need for an additional set of reference images, and preserves the original quality for unaltered concepts. Comprehensive experiments demonstrate that our method surpasses existing approaches in mitigating both single and intersectional biases across various attributes. We make our source code and debiased models for various attributes available to encourage fairness in generative models and to support further research.

Via

Access Paper or Ask Questions

RAVE: Randomized Noise Shuffling for Fast and Consistent Video Editing with Diffusion Models

Dec 07, 2023

Ozgur Kara, Bariscan Kurtkaya, Hidir Yesiltepe, James M. Rehg, Pinar Yanardag

Figure 1 for RAVE: Randomized Noise Shuffling for Fast and Consistent Video Editing with Diffusion Models

Figure 2 for RAVE: Randomized Noise Shuffling for Fast and Consistent Video Editing with Diffusion Models

Figure 3 for RAVE: Randomized Noise Shuffling for Fast and Consistent Video Editing with Diffusion Models

Figure 4 for RAVE: Randomized Noise Shuffling for Fast and Consistent Video Editing with Diffusion Models

Abstract:Recent advancements in diffusion-based models have demonstrated significant success in generating images from text. However, video editing models have not yet reached the same level of visual quality and user control. To address this, we introduce RAVE, a zero-shot video editing method that leverages pre-trained text-to-image diffusion models without additional training. RAVE takes an input video and a text prompt to produce high-quality videos while preserving the original motion and semantic structure. It employs a novel noise shuffling strategy, leveraging spatio-temporal interactions between frames, to produce temporally consistent videos faster than existing methods. It is also efficient in terms of memory requirements, allowing it to handle longer videos. RAVE is capable of a wide range of edits, from local attribute modifications to shape transformations. In order to demonstrate the versatility of RAVE, we create a comprehensive video evaluation dataset ranging from object-focused scenes to complex human activities like dancing and typing, and dynamic scenes featuring swimming fish and boats. Our qualitative and quantitative experiments highlight the effectiveness of RAVE in diverse video editing scenarios compared to existing methods. Our code, dataset and videos can be found in https://rave-video.github.io.

* Project webpage: https://rave-video.github.io , Github: http://github.com/rehg-lab/RAVE

Via

Access Paper or Ask Questions