Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Michael Gharbi

PromptRL: Prompt Matters in RL for Flow-Based Image Generation

Feb 01, 2026

Fu-Yun Wang, Han Zhang, Michael Gharbi, Hongsheng Li, Taesung Park

Abstract:Flow matching models (FMs) have revolutionized text-to-image (T2I) generation, with reinforcement learning (RL) serving as a critical post-training strategy for alignment with reward objectives. In this research, we show that current RL pipelines for FMs suffer from two underappreciated yet important limitations: sample inefficiency due to insufficient generation diversity, and pronounced prompt overfitting, where models memorize specific training formulations and exhibit dramatic performance collapse when evaluated on semantically equivalent but stylistically varied prompts. We present PromptRL (Prompt Matters in RL for Flow-Based Image Generation), a framework that incorporates language models (LMs) as trainable prompt refinement agents directly within the flow-based RL optimization loop. This design yields two complementary benefits: rapid development of sophisticated prompt rewriting capabilities and, critically, a synergistic training regime that reshapes the optimization dynamics. PromptRL achieves state-of-the-art performance across multiple benchmarks, obtaining scores of 0.97 on GenEval, 0.98 on OCR accuracy, and 24.05 on PickScore. Furthermore, we validate the effectiveness of our RL approach on large-scale image editing models, improving the EditReward of FLUX.1-Kontext from 1.19 to 1.43 with only 0.06 million rollouts, surpassing Gemini 2.5 Flash Image (also known as Nano Banana), which scores 1.37, and achieving comparable performance with ReasonNet (1.44), which relied on fine-grained data annotations along with a complex multi-stage training. Our extensive experiments empirically demonstrate that PromptRL consistently achieves higher performance ceilings while requiring over 2$\times$ fewer rollouts compared to naive flow-only RL. Our code is available at https://github.com/G-U-N/UniRL.

Via

Access Paper or Ask Questions

Image Neural Field Diffusion Models

Jun 11, 2024

Yinbo Chen, Oliver Wang, Richard Zhang, Eli Shechtman, Xiaolong Wang, Michael Gharbi

Figure 1 for Image Neural Field Diffusion Models

Figure 2 for Image Neural Field Diffusion Models

Figure 3 for Image Neural Field Diffusion Models

Figure 4 for Image Neural Field Diffusion Models

Abstract:Diffusion models have shown an impressive ability to model complex data distributions, with several key advantages over GANs, such as stable training, better coverage of the training distribution's modes, and the ability to solve inverse problems without extra training. However, most diffusion models learn the distribution of fixed-resolution images. We propose to learn the distribution of continuous images by training diffusion models on image neural fields, which can be rendered at any resolution, and show its advantages over fixed-resolution models. To achieve this, a key challenge is to obtain a latent space that represents photorealistic image neural fields. We propose a simple and effective method, inspired by several recent techniques but with key changes to make the image neural fields photorealistic. Our method can be used to convert existing latent diffusion autoencoders into image neural field autoencoders. We show that image neural field diffusion models can be trained using mixed-resolution image datasets, outperform fixed-resolution diffusion models followed by super-resolution models, and can solve inverse problems with conditions applied at different scales efficiently.

* Project page: https://yinboc.github.io/infd/

Via

Access Paper or Ask Questions

Magic Fixup: Streamlining Photo Editing by Watching Dynamic Videos

Mar 19, 2024

Hadi Alzayer, Zhihao Xia, Xuaner Zhang, Eli Shechtman, Jia-Bin Huang, Michael Gharbi

Figure 1 for Magic Fixup: Streamlining Photo Editing by Watching Dynamic Videos

Figure 2 for Magic Fixup: Streamlining Photo Editing by Watching Dynamic Videos

Figure 3 for Magic Fixup: Streamlining Photo Editing by Watching Dynamic Videos

Figure 4 for Magic Fixup: Streamlining Photo Editing by Watching Dynamic Videos

Abstract:We propose a generative model that, given a coarsely edited image, synthesizes a photorealistic output that follows the prescribed layout. Our method transfers fine details from the original image and preserves the identity of its parts. Yet, it adapts it to the lighting and context defined by the new layout. Our key insight is that videos are a powerful source of supervision for this task: objects and camera motions provide many observations of how the world changes with viewpoint, lighting, and physical interactions. We construct an image dataset in which each sample is a pair of source and target frames extracted from the same video at randomly chosen time intervals. We warp the source frame toward the target using two motion models that mimic the expected test-time user edits. We supervise our model to translate the warped image into the ground truth, starting from a pretrained diffusion model. Our model design explicitly enables fine detail transfer from the source frame to the generated image, while closely following the user-specified layout. We show that by using simple segmentations and coarse 2D manipulations, we can synthesize a photorealistic edit faithful to the user's input while addressing second-order effects like harmonizing the lighting and physical interactions between edited objects.

* Project page: https://magic-fixup.github.io/

Via

Access Paper or Ask Questions

VecFusion: Vector Font Generation with Diffusion

Dec 16, 2023

Vikas Thamizharasan, Difan Liu, Shantanu Agarwal, Matthew Fisher, Michael Gharbi, Oliver Wang, Alec Jacobson, Evangelos Kalogerakis

Figure 1 for VecFusion: Vector Font Generation with Diffusion

Figure 2 for VecFusion: Vector Font Generation with Diffusion

Figure 3 for VecFusion: Vector Font Generation with Diffusion

Figure 4 for VecFusion: Vector Font Generation with Diffusion

Abstract:We present VecFusion, a new neural architecture that can generate vector fonts with varying topological structures and precise control point positions. Our approach is a cascaded diffusion model which consists of a raster diffusion model followed by a vector diffusion model. The raster model generates low-resolution, rasterized fonts with auxiliary control point information, capturing the global style and shape of the font, while the vector model synthesizes vector fonts conditioned on the low-resolution raster fonts from the first stage. To synthesize long and complex curves, our vector diffusion model uses a transformer architecture and a novel vector representation that enables the modeling of diverse vector geometry and the precise prediction of control points. Our experiments show that, in contrast to previous generative models for vector graphics, our new cascaded vector diffusion model generates higher quality vector fonts, with complex structures and diverse styles.

Via

Access Paper or Ask Questions

Any-resolution Training for High-resolution Image Synthesis

Apr 14, 2022

Lucy Chai, Michael Gharbi, Eli Shechtman, Phillip Isola, Richard Zhang

Figure 1 for Any-resolution Training for High-resolution Image Synthesis

Figure 2 for Any-resolution Training for High-resolution Image Synthesis

Figure 3 for Any-resolution Training for High-resolution Image Synthesis

Figure 4 for Any-resolution Training for High-resolution Image Synthesis

Abstract:Generative models operate at fixed resolution, even though natural images come in a variety of sizes. As high-resolution details are downsampled away, and low-resolution images are discarded altogether, precious supervision is lost. We argue that every pixel matters and create datasets with variable-size images, collected at their native resolutions. Taking advantage of this data is challenging; high-resolution processing is costly, and current architectures can only process fixed-resolution data. We introduce continuous-scale training, a process that samples patches at random scales to train a new generator with variable output resolutions. First, conditioning the generator on a target scale allows us to generate higher resolutions images than previously possible, without adding layers to the model. Second, by conditioning on continuous coordinates, we can sample patches that still obey a consistent global layout, which also allows for scalable training at higher resolutions. Controlled FFHQ experiments show our method takes advantage of the multi-resolution training data better than discrete multi-scale approaches, achieving better FID scores and cleaner high-frequency details. We also train on other natural image domains including churches, mountains, and birds, and demonstrate arbitrary scale synthesis with both coherent global layouts and realistic local details, going beyond 2K resolution in our experiments. Our project page is available at: https://chail.github.io/anyres-gan/.

Via

Access Paper or Ask Questions

MarioNette: Self-Supervised Sprite Learning

Apr 29, 2021

Dmitriy Smirnov, Michael Gharbi, Matthew Fisher, Vitor Guizilini, Alexei A. Efros, Justin Solomon

Figure 1 for MarioNette: Self-Supervised Sprite Learning

Figure 2 for MarioNette: Self-Supervised Sprite Learning

Figure 3 for MarioNette: Self-Supervised Sprite Learning

Figure 4 for MarioNette: Self-Supervised Sprite Learning

Abstract:Visual content often contains recurring elements. Text is made up of glyphs from the same font, animations, such as cartoons or video games, are composed of sprites moving around the screen, and natural videos frequently have repeated views of objects. In this paper, we propose a deep learning approach for obtaining a graphically disentangled representation of recurring elements in a completely self-supervised manner. By jointly learning a dictionary of texture patches and training a network that places them onto a canvas, we effectively deconstruct sprite-based content into a sparse, consistent, and interpretable representation that can be easily used in downstream tasks. Our framework offers a promising approach for discovering recurring patterns in image collections without supervision.

Via

Access Paper or Ask Questions

Im2Vec: Synthesizing Vector Graphics without Vector Supervision

Feb 04, 2021

Pradyumna Reddy, Michael Gharbi, Michal Lukac, Niloy J. Mitra

Figure 1 for Im2Vec: Synthesizing Vector Graphics without Vector Supervision

Figure 2 for Im2Vec: Synthesizing Vector Graphics without Vector Supervision

Figure 3 for Im2Vec: Synthesizing Vector Graphics without Vector Supervision

Figure 4 for Im2Vec: Synthesizing Vector Graphics without Vector Supervision

Abstract:Vector graphics are widely used to represent fonts, logos, digital artworks, and graphic designs. But, while a vast body of work has focused on generative algorithms for raster images, only a handful of options exists for vector graphics. One can always rasterize the input graphic and resort to image-based generative approaches, but this negates the advantages of the vector representation. The current alternative is to use specialized models that require explicit supervision on the vector graphics representation at training time. This is not ideal because large-scale high quality vector-graphics datasets are difficult to obtain. Furthermore, the vector representation for a given design is not unique, so models that supervise on the vector representation are unnecessarily constrained. Instead, we propose a new neural network that can generate complex vector graphics with varying topologies, and only requires indirect supervision from readily-available raster training images (i.e., with no vector counterparts). To enable this, we use a differentiable rasterization pipeline that renders the generated vector shapes and composites them together onto a raster canvas. We demonstrate our method on a range of datasets, and provide comparison with state-of-the-art SVG-VAE and DeepSVG, both of which require explicit vector graphics supervision. Finally, we also demonstrate our approach on the MNIST dataset, for which no groundtruth vector representation is available. Source code, datasets, and more results are available at http://geometry.cs.ucl.ac.uk/projects/2020/Im2Vec/

Via

Access Paper or Ask Questions

Spatially-Adaptive Pixelwise Networks for Fast Image Translation

Dec 05, 2020

Tamar Rott Shaham, Michael Gharbi, Richard Zhang, Eli Shechtman, Tomer Michaeli

Figure 1 for Spatially-Adaptive Pixelwise Networks for Fast Image Translation

Figure 2 for Spatially-Adaptive Pixelwise Networks for Fast Image Translation

Figure 3 for Spatially-Adaptive Pixelwise Networks for Fast Image Translation

Figure 4 for Spatially-Adaptive Pixelwise Networks for Fast Image Translation

Abstract:We introduce a new generator architecture, aimed at fast and efficient high-resolution image-to-image translation. We design the generator to be an extremely lightweight function of the full-resolution image. In fact, we use pixel-wise networks; that is, each pixel is processed independently of others, through a composition of simple affine transformations and nonlinearities. We take three important steps to equip such a seemingly simple function with adequate expressivity. First, the parameters of the pixel-wise networks are spatially varying so they can represent a broader function class than simple 1x1 convolutions. Second, these parameters are predicted by a fast convolutional network that processes an aggressively low-resolution representation of the input; Third, we augment the input image with a sinusoidal encoding of spatial coordinates, which provides an effective inductive bias for generating realistic novel high-frequency image content. As a result, our model is up to 18x faster than state-of-the-art baselines. We achieve this speedup while generating comparable visual quality across different image resolutions and translation domains.

Via

Access Paper or Ask Questions

A Dataset of Multi-Illumination Images in the Wild

Oct 17, 2019

Lukas Murmann, Michael Gharbi, Miika Aittala, Fredo Durand

Figure 1 for A Dataset of Multi-Illumination Images in the Wild

Figure 2 for A Dataset of Multi-Illumination Images in the Wild

Figure 3 for A Dataset of Multi-Illumination Images in the Wild

Figure 4 for A Dataset of Multi-Illumination Images in the Wild

Abstract:Collections of images under a single, uncontrolled illumination have enabled the rapid advancement of core computer vision tasks like classification, detection, and segmentation. But even with modern learning techniques, many inverse problems involving lighting and material understanding remain too severely ill-posed to be solved with single-illumination datasets. To fill this gap, we introduce a new multi-illumination dataset of more than 1000 real scenes, each captured under 25 lighting conditions. We demonstrate the richness of this dataset by training state-of-the-art models for three challenging applications: single-image illumination estimation, image relighting, and mixed-illuminant white balance.

* ICCV 2019

Via

Access Paper or Ask Questions