Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Dustin Podell

Scaling Rectified Flow Transformers for High-Resolution Image Synthesis

Mar 05, 2024

Patrick Esser, Sumith Kulal, Andreas Blattmann, Rahim Entezari, Jonas Müller, Harry Saini, Yam Levi, Dominik Lorenz, Axel Sauer, Frederic Boesel(+7 more)

Figure 1 for Scaling Rectified Flow Transformers for High-Resolution Image Synthesis

Figure 2 for Scaling Rectified Flow Transformers for High-Resolution Image Synthesis

Figure 3 for Scaling Rectified Flow Transformers for High-Resolution Image Synthesis

Figure 4 for Scaling Rectified Flow Transformers for High-Resolution Image Synthesis

Abstract:Diffusion models create data from noise by inverting the forward paths of data towards noise and have emerged as a powerful generative modeling technique for high-dimensional, perceptual data such as images and videos. Rectified flow is a recent generative model formulation that connects data and noise in a straight line. Despite its better theoretical properties and conceptual simplicity, it is not yet decisively established as standard practice. In this work, we improve existing noise sampling techniques for training rectified flow models by biasing them towards perceptually relevant scales. Through a large-scale study, we demonstrate the superior performance of this approach compared to established diffusion formulations for high-resolution text-to-image synthesis. Additionally, we present a novel transformer-based architecture for text-to-image generation that uses separate weights for the two modalities and enables a bidirectional flow of information between image and text tokens, improving text comprehension, typography, and human preference ratings. We demonstrate that this architecture follows predictable scaling trends and correlates lower validation loss to improved text-to-image synthesis as measured by various metrics and human evaluations. Our largest models outperform state-of-the-art models, and we will make our experimental data, code, and model weights publicly available.

Via

Access Paper or Ask Questions

SDXL: Improving Latent Diffusion Models for High-Resolution Image Synthesis

Jul 04, 2023

Dustin Podell, Zion English, Kyle Lacey, Andreas Blattmann, Tim Dockhorn, Jonas Müller, Joe Penna, Robin Rombach

Figure 1 for SDXL: Improving Latent Diffusion Models for High-Resolution Image Synthesis

Figure 2 for SDXL: Improving Latent Diffusion Models for High-Resolution Image Synthesis

Figure 3 for SDXL: Improving Latent Diffusion Models for High-Resolution Image Synthesis

Figure 4 for SDXL: Improving Latent Diffusion Models for High-Resolution Image Synthesis

Abstract:We present SDXL, a latent diffusion model for text-to-image synthesis. Compared to previous versions of Stable Diffusion, SDXL leverages a three times larger UNet backbone: The increase of model parameters is mainly due to more attention blocks and a larger cross-attention context as SDXL uses a second text encoder. We design multiple novel conditioning schemes and train SDXL on multiple aspect ratios. We also introduce a refinement model which is used to improve the visual fidelity of samples generated by SDXL using a post-hoc image-to-image technique. We demonstrate that SDXL shows drastically improved performance compared the previous versions of Stable Diffusion and achieves results competitive with those of black-box state-of-the-art image generators. In the spirit of promoting open research and fostering transparency in large model training and evaluation, we provide access to code and model weights at https://github.com/Stability-AI/generative-models

Via

Access Paper or Ask Questions