Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Rameen Abdal

Zero-Shot Dynamic Concept Personalization with Grid-Based LoRA

Jul 23, 2025

Rameen Abdal, Or Patashnik, Ekaterina Deyneka, Hao Chen, Aliaksandr Siarohin, Sergey Tulyakov, Daniel Cohen-Or, Kfir Aberman

Abstract:Recent advances in text-to-video generation have enabled high-quality synthesis from text and image prompts. While the personalization of dynamic concepts, which capture subject-specific appearance and motion from a single video, is now feasible, most existing methods require per-instance fine-tuning, limiting scalability. We introduce a fully zero-shot framework for dynamic concept personalization in text-to-video models. Our method leverages structured 2x2 video grids that spatially organize input and output pairs, enabling the training of lightweight Grid-LoRA adapters for editing and composition within these grids. At inference, a dedicated Grid Fill module completes partially observed layouts, producing temporally coherent and identity preserving outputs. Once trained, the entire system operates in a single forward pass, generalizing to previously unseen dynamic concepts without any test-time optimization. Extensive experiments demonstrate high-quality and consistent results across a wide range of subjects beyond trained concepts and editing scenarios.

* Project Page and Video : https://snap-research.github.io/zero-shot-dynamic-concepts/

Via

Access Paper or Ask Questions

Improving the Diffusability of Autoencoders

Feb 20, 2025

Ivan Skorokhodov, Sharath Girish, Benran Hu, Willi Menapace, Yanyu Li, Rameen Abdal, Sergey Tulyakov, Aliaksandr Siarohin

Abstract:Latent diffusion models have emerged as the leading approach for generating high-quality images and videos, utilizing compressed latent representations to reduce the computational burden of the diffusion process. While recent advancements have primarily focused on scaling diffusion backbones and improving autoencoder reconstruction quality, the interaction between these components has received comparatively less attention. In this work, we perform a spectral analysis of modern autoencoders and identify inordinate high-frequency components in their latent spaces, which are especially pronounced in the autoencoders with a large bottleneck channel size. We hypothesize that this high-frequency component interferes with the coarse-to-fine nature of the diffusion synthesis process and hinders the generation quality. To mitigate the issue, we propose scale equivariance: a simple regularization strategy that aligns latent and RGB spaces across frequencies by enforcing scale equivariance in the decoder. It requires minimal code changes and only up to 20K autoencoder fine-tuning steps, yet significantly improves generation quality, reducing FID by 19% for image generation on ImageNet-1K 256x256 and FVD by at least 44% for video generation on Kinetics-700 17x256x256.

* 26 pages, 22 figures, 9 tables

Via

Access Paper or Ask Questions

Dynamic Concepts Personalization from Single Videos

Feb 20, 2025

Rameen Abdal, Or Patashnik, Ivan Skorokhodov, Willi Menapace, Aliaksandr Siarohin, Sergey Tulyakov, Daniel Cohen-Or, Kfir Aberman

Figure 1 for Dynamic Concepts Personalization from Single Videos

Figure 2 for Dynamic Concepts Personalization from Single Videos

Figure 3 for Dynamic Concepts Personalization from Single Videos

Figure 4 for Dynamic Concepts Personalization from Single Videos

Abstract:Personalizing generative text-to-image models has seen remarkable progress, but extending this personalization to text-to-video models presents unique challenges. Unlike static concepts, personalizing text-to-video models has the potential to capture dynamic concepts, i.e., entities defined not only by their appearance but also by their motion. In this paper, we introduce Set-and-Sequence, a novel framework for personalizing Diffusion Transformers (DiTs)-based generative video models with dynamic concepts. Our approach imposes a spatio-temporal weight space within an architecture that does not explicitly separate spatial and temporal features. This is achieved in two key stages. First, we fine-tune Low-Rank Adaptation (LoRA) layers using an unordered set of frames from the video to learn an identity LoRA basis that represents the appearance, free from temporal interference. In the second stage, with the identity LoRAs frozen, we augment their coefficients with Motion Residuals and fine-tune them on the full video sequence, capturing motion dynamics. Our Set-and-Sequence framework results in a spatio-temporal weight space that effectively embeds dynamic concepts into the video model's output domain, enabling unprecedented editability and compositionality while setting a new benchmark for personalizing dynamic concepts.

* Webpage: https://snap-research.github.io/dynamic_concepts/

Via

Access Paper or Ask Questions

Interpreting the Weight Space of Customized Diffusion Models

Jun 13, 2024

Amil Dravid, Yossi Gandelsman, Kuan-Chieh Wang, Rameen Abdal, Gordon Wetzstein, Alexei A. Efros, Kfir Aberman

Figure 1 for Interpreting the Weight Space of Customized Diffusion Models

Figure 2 for Interpreting the Weight Space of Customized Diffusion Models

Figure 3 for Interpreting the Weight Space of Customized Diffusion Models

Figure 4 for Interpreting the Weight Space of Customized Diffusion Models

Abstract:We investigate the space of weights spanned by a large collection of customized diffusion models. We populate this space by creating a dataset of over 60,000 models, each of which is a base model fine-tuned to insert a different person's visual identity. We model the underlying manifold of these weights as a subspace, which we term weights2weights. We demonstrate three immediate applications of this space -- sampling, editing, and inversion. First, as each point in the space corresponds to an identity, sampling a set of weights from it results in a model encoding a novel identity. Next, we find linear directions in this space corresponding to semantic edits of the identity (e.g., adding a beard). These edits persist in appearance across generated samples. Finally, we show that inverting a single image into this space reconstructs a realistic identity, even if the input image is out of distribution (e.g., a painting). Our results indicate that the weight space of fine-tuned diffusion models behaves as an interpretable latent space of identities.

* Project Page: https://snap-research.github.io/weights2weights

Via

Access Paper or Ask Questions

Gaussian Shell Maps for Efficient 3D Human Generation

Nov 29, 2023

Rameen Abdal, Wang Yifan, Zifan Shi, Yinghao Xu, Ryan Po, Zhengfei Kuang, Qifeng Chen, Dit-Yan Yeung, Gordon Wetzstein

Figure 1 for Gaussian Shell Maps for Efficient 3D Human Generation

Figure 2 for Gaussian Shell Maps for Efficient 3D Human Generation

Figure 3 for Gaussian Shell Maps for Efficient 3D Human Generation

Figure 4 for Gaussian Shell Maps for Efficient 3D Human Generation

Abstract:Efficient generation of 3D digital humans is important in several industries, including virtual reality, social media, and cinematic production. 3D generative adversarial networks (GANs) have demonstrated state-of-the-art (SOTA) quality and diversity for generated assets. Current 3D GAN architectures, however, typically rely on volume representations, which are slow to render, thereby hampering the GAN training and requiring multi-view-inconsistent 2D upsamplers. Here, we introduce Gaussian Shell Maps (GSMs) as a framework that connects SOTA generator network architectures with emerging 3D Gaussian rendering primitives using an articulable multi shell--based scaffold. In this setting, a CNN generates a 3D texture stack with features that are mapped to the shells. The latter represent inflated and deflated versions of a template surface of a digital human in a canonical body pose. Instead of rasterizing the shells directly, we sample 3D Gaussians on the shells whose attributes are encoded in the texture features. These Gaussians are efficiently and differentiably rendered. The ability to articulate the shells is important during GAN training and, at inference time, to deform a body into arbitrary user-defined poses. Our efficient rendering scheme bypasses the need for view-inconsistent upsamplers and achieves high-quality multi-view consistent renderings at a native resolution of $512 \times 512$ pixels. We demonstrate that GSMs successfully generate 3D humans when trained on single-view datasets, including SHHQ and DeepFashion.

* Project page : https://rameenabdal.github.io/GaussianShellMaps/

Via

Access Paper or Ask Questions

3DAvatarGAN: Bridging Domains for Personalized Editable Avatars

Jan 06, 2023

Rameen Abdal, Hsin-Ying Lee, Peihao Zhu, Menglei Chai, Aliaksandr Siarohin, Peter Wonka, Sergey Tulyakov

Figure 1 for 3DAvatarGAN: Bridging Domains for Personalized Editable Avatars

Figure 2 for 3DAvatarGAN: Bridging Domains for Personalized Editable Avatars

Figure 3 for 3DAvatarGAN: Bridging Domains for Personalized Editable Avatars

Figure 4 for 3DAvatarGAN: Bridging Domains for Personalized Editable Avatars

Abstract:Modern 3D-GANs synthesize geometry and texture by training on large-scale datasets with a consistent structure. Training such models on stylized, artistic data, with often unknown, highly variable geometry, and camera information has not yet been shown possible. Can we train a 3D GAN on such artistic data, while maintaining multi-view consistency and texture quality? To this end, we propose an adaptation framework, where the source domain is a pre-trained 3D-GAN, while the target domain is a 2D-GAN trained on artistic datasets. We then distill the knowledge from a 2D generator to the source 3D generator. To do that, we first propose an optimization-based method to align the distributions of camera parameters across domains. Second, we propose regularizations necessary to learn high-quality texture, while avoiding degenerate geometric solutions, such as flat shapes. Third, we show a deformation-based technique for modeling exaggerated geometry of artistic domains, enabling -- as a byproduct -- personalized geometric editing. Finally, we propose a novel inversion method for 3D-GANs linking the latent spaces of the source and the target domains. Our contributions -- for the first time -- allow for the generation, editing, and animation of personalized artistic 3D avatars on artistic datasets.

* Project Page: https://rameenabdal.github.io/3DAvatarGAN/

Via

Access Paper or Ask Questions

Video2StyleGAN: Disentangling Local and Global Variations in a Video

May 30, 2022

Rameen Abdal, Peihao Zhu, Niloy J. Mitra, Peter Wonka

Figure 1 for Video2StyleGAN: Disentangling Local and Global Variations in a Video

Figure 2 for Video2StyleGAN: Disentangling Local and Global Variations in a Video

Figure 3 for Video2StyleGAN: Disentangling Local and Global Variations in a Video

Figure 4 for Video2StyleGAN: Disentangling Local and Global Variations in a Video

Abstract:Image editing using a pretrained StyleGAN generator has emerged as a powerful paradigm for facial editing, providing disentangled controls over age, expression, illumination, etc. However, the approach cannot be directly adopted for video manipulations. We hypothesize that the main missing ingredient is the lack of fine-grained and disentangled control over face location, face pose, and local facial expressions. In this work, we demonstrate that such a fine-grained control is indeed achievable using pretrained StyleGAN by working across multiple (latent) spaces (namely, the positional space, the W+ space, and the S space) and combining the optimization results across the multiple spaces. Building on this enabling component, we introduce Video2StyleGAN that takes a target image and driving video(s) to reenact the local and global locations and expressions from the driving video in the identity of the target image. We evaluate the effectiveness of our method over multiple challenging scenarios and demonstrate clear improvements over alternative approaches.

* Video : https://youtu.be/oUeXFyfdE1A

Via

Access Paper or Ask Questions

CLIP2StyleGAN: Unsupervised Extraction of StyleGAN Edit Directions

Dec 09, 2021

Rameen Abdal, Peihao Zhu, John Femiani, Niloy J. Mitra, Peter Wonka

Figure 1 for CLIP2StyleGAN: Unsupervised Extraction of StyleGAN Edit Directions

Figure 2 for CLIP2StyleGAN: Unsupervised Extraction of StyleGAN Edit Directions

Figure 3 for CLIP2StyleGAN: Unsupervised Extraction of StyleGAN Edit Directions

Figure 4 for CLIP2StyleGAN: Unsupervised Extraction of StyleGAN Edit Directions

Abstract:The success of StyleGAN has enabled unprecedented semantic editing capabilities, on both synthesized and real images. However, such editing operations are either trained with semantic supervision or described using human guidance. In another development, the CLIP architecture has been trained with internet-scale image and text pairings and has been shown to be useful in several zero-shot learning settings. In this work, we investigate how to effectively link the pretrained latent spaces of StyleGAN and CLIP, which in turn allows us to automatically extract semantically labeled edit directions from StyleGAN, finding and naming meaningful edit operations without any additional human guidance. Technically, we propose two novel building blocks; one for finding interesting CLIP directions and one for labeling arbitrary directions in CLIP latent space. The setup does not assume any pre-determined labels and hence we do not require any additional supervised text/attributes to build the editing framework. We evaluate the effectiveness of the proposed method and demonstrate that extraction of disentangled labeled StyleGAN edit directions is indeed possible, and reveals interesting and non-trivial edit directions.

Via

Access Paper or Ask Questions

Mind the Gap: Domain Gap Control for Single Shot Domain Adaptation for Generative Adversarial Networks

Oct 15, 2021

Peihao Zhu, Rameen Abdal, John Femiani, Peter Wonka

Figure 1 for Mind the Gap: Domain Gap Control for Single Shot Domain Adaptation for Generative Adversarial Networks

Figure 2 for Mind the Gap: Domain Gap Control for Single Shot Domain Adaptation for Generative Adversarial Networks

Figure 3 for Mind the Gap: Domain Gap Control for Single Shot Domain Adaptation for Generative Adversarial Networks

Figure 4 for Mind the Gap: Domain Gap Control for Single Shot Domain Adaptation for Generative Adversarial Networks

Abstract:We present a new method for one shot domain adaptation. The input to our method is trained GAN that can produce images in domain A and a single reference image I_B from domain B. The proposed algorithm can translate any output of the trained GAN from domain A to domain B. There are two main advantages of our method compared to the current state of the art: First, our solution achieves higher visual quality, e.g. by noticeably reducing overfitting. Second, our solution allows for more degrees of freedom to control the domain gap, i.e. what aspects of image I_B are used to define the domain B. Technically, we realize the new method by building on a pre-trained StyleGAN generator as GAN and a pre-trained CLIP model for representing the domain gap. We propose several new regularizers for controlling the domain gap to optimize the weights of the pre-trained StyleGAN generator to output images in domain B instead of domain A. The regularizers prevent the optimization from taking on too many attributes of the single reference image. Our results show significant visual improvements over the state of the art as well as multiple applications that highlight improved control.

* Video: https://youtu.be/RLBJ-mem9gM

Via

Access Paper or Ask Questions

Barbershop: GAN-based Image Compositing using Segmentation Masks

Jun 02, 2021

Peihao Zhu, Rameen Abdal, John Femiani, Peter Wonka

Figure 1 for Barbershop: GAN-based Image Compositing using Segmentation Masks

Figure 2 for Barbershop: GAN-based Image Compositing using Segmentation Masks

Figure 3 for Barbershop: GAN-based Image Compositing using Segmentation Masks

Figure 4 for Barbershop: GAN-based Image Compositing using Segmentation Masks

Abstract:Seamlessly blending features from multiple images is extremely challenging because of complex relationships in lighting, geometry, and partial occlusion which cause coupling between different parts of the image. Even though recent work on GANs enables synthesis of realistic hair or faces, it remains difficult to combine them into a single, coherent, and plausible image rather than a disjointed set of image patches. We present a novel solution to image blending, particularly for the problem of hairstyle transfer, based on GAN-inversion. We propose a novel latent space for image blending which is better at preserving detail and encoding spatial information, and propose a new GAN-embedding algorithm which is able to slightly modify images to conform to a common segmentation mask. Our novel representation enables the transfer of the visual properties from multiple reference images including specific details such as moles and wrinkles, and because we do image blending in a latent-space we are able to synthesize images that are coherent. Our approach avoids blending artifacts present in other approaches and finds a globally consistent image. Our results demonstrate a significant improvement over the current state of the art in a user study, with users preferring our blending solution over 95 percent of the time.

* Project page: https://zpdesu.github.io/Barbershop/ Video: https://youtu.be/ZU-yrAvoJfQ

Via

Access Paper or Ask Questions