Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Axel Sauer

Fast High-Resolution Image Synthesis with Latent Adversarial Diffusion Distillation

Mar 18, 2024

Axel Sauer, Frederic Boesel, Tim Dockhorn, Andreas Blattmann, Patrick Esser, Robin Rombach

Abstract:Diffusion models are the main driver of progress in image and video synthesis, but suffer from slow inference speed. Distillation methods, like the recently introduced adversarial diffusion distillation (ADD) aim to shift the model from many-shot to single-step inference, albeit at the cost of expensive and difficult optimization due to its reliance on a fixed pretrained DINOv2 discriminator. We introduce Latent Adversarial Diffusion Distillation (LADD), a novel distillation approach overcoming the limitations of ADD. In contrast to pixel-based ADD, LADD utilizes generative features from pretrained latent diffusion models. This approach simplifies training and enhances performance, enabling high-resolution multi-aspect ratio image synthesis. We apply LADD to Stable Diffusion 3 (8B) to obtain SD3-Turbo, a fast model that matches the performance of state-of-the-art text-to-image generators using only four unguided sampling steps. Moreover, we systematically investigate its scaling behavior and demonstrate LADD's effectiveness in various applications such as image editing and inpainting.

Via

Access Paper or Ask Questions

Scaling Rectified Flow Transformers for High-Resolution Image Synthesis

Mar 05, 2024

Patrick Esser, Sumith Kulal, Andreas Blattmann, Rahim Entezari, Jonas Müller, Harry Saini, Yam Levi, Dominik Lorenz, Axel Sauer, Frederic Boesel(+7 more)

Figure 1 for Scaling Rectified Flow Transformers for High-Resolution Image Synthesis

Figure 2 for Scaling Rectified Flow Transformers for High-Resolution Image Synthesis

Figure 3 for Scaling Rectified Flow Transformers for High-Resolution Image Synthesis

Figure 4 for Scaling Rectified Flow Transformers for High-Resolution Image Synthesis

Abstract:Diffusion models create data from noise by inverting the forward paths of data towards noise and have emerged as a powerful generative modeling technique for high-dimensional, perceptual data such as images and videos. Rectified flow is a recent generative model formulation that connects data and noise in a straight line. Despite its better theoretical properties and conceptual simplicity, it is not yet decisively established as standard practice. In this work, we improve existing noise sampling techniques for training rectified flow models by biasing them towards perceptually relevant scales. Through a large-scale study, we demonstrate the superior performance of this approach compared to established diffusion formulations for high-resolution text-to-image synthesis. Additionally, we present a novel transformer-based architecture for text-to-image generation that uses separate weights for the two modalities and enables a bidirectional flow of information between image and text tokens, improving text comprehension, typography, and human preference ratings. We demonstrate that this architecture follows predictable scaling trends and correlates lower validation loss to improved text-to-image synthesis as measured by various metrics and human evaluations. Our largest models outperform state-of-the-art models, and we will make our experimental data, code, and model weights publicly available.

Via

Access Paper or Ask Questions

Adversarial Diffusion Distillation

Nov 28, 2023

Axel Sauer, Dominik Lorenz, Andreas Blattmann, Robin Rombach

Abstract:We introduce Adversarial Diffusion Distillation (ADD), a novel training approach that efficiently samples large-scale foundational image diffusion models in just 1-4 steps while maintaining high image quality. We use score distillation to leverage large-scale off-the-shelf image diffusion models as a teacher signal in combination with an adversarial loss to ensure high image fidelity even in the low-step regime of one or two sampling steps. Our analyses show that our model clearly outperforms existing few-step methods (GANs, Latent Consistency Models) in a single step and reaches the performance of state-of-the-art diffusion models (SDXL) in only four steps. ADD is the first method to unlock single-step, real-time image synthesis with foundation models. Code and weights available under https://github.com/Stability-AI/generative-models and https://huggingface.co/stabilityai/ .

Via

Access Paper or Ask Questions

StyleGAN-T: Unlocking the Power of GANs for Fast Large-Scale Text-to-Image Synthesis

Jan 23, 2023

Axel Sauer, Tero Karras, Samuli Laine, Andreas Geiger, Timo Aila

Figure 1 for StyleGAN-T: Unlocking the Power of GANs for Fast Large-Scale Text-to-Image Synthesis

Figure 2 for StyleGAN-T: Unlocking the Power of GANs for Fast Large-Scale Text-to-Image Synthesis

Figure 3 for StyleGAN-T: Unlocking the Power of GANs for Fast Large-Scale Text-to-Image Synthesis

Figure 4 for StyleGAN-T: Unlocking the Power of GANs for Fast Large-Scale Text-to-Image Synthesis

Abstract:Text-to-image synthesis has recently seen significant progress thanks to large pretrained language models, large-scale training data, and the introduction of scalable model families such as diffusion and autoregressive models. However, the best-performing models require iterative evaluation to generate a single sample. In contrast, generative adversarial networks (GANs) only need a single forward pass. They are thus much faster, but they currently remain far behind the state-of-the-art in large-scale text-to-image synthesis. This paper aims to identify the necessary steps to regain competitiveness. Our proposed model, StyleGAN-T, addresses the specific requirements of large-scale text-to-image synthesis, such as large capacity, stable training on diverse datasets, strong text alignment, and controllable variation vs. text alignment tradeoff. StyleGAN-T significantly improves over previous GANs and outperforms distilled diffusion models - the previous state-of-the-art in fast text-to-image synthesis - in terms of sample quality and speed.

* Project page: https://sites.google.com/view/stylegan-t/

Via

Access Paper or Ask Questions

VoxGRAF: Fast 3D-Aware Image Synthesis with Sparse Voxel Grids

Jun 17, 2022

Katja Schwarz, Axel Sauer, Michael Niemeyer, Yiyi Liao, Andreas Geiger

Figure 1 for VoxGRAF: Fast 3D-Aware Image Synthesis with Sparse Voxel Grids

Figure 2 for VoxGRAF: Fast 3D-Aware Image Synthesis with Sparse Voxel Grids

Figure 3 for VoxGRAF: Fast 3D-Aware Image Synthesis with Sparse Voxel Grids

Figure 4 for VoxGRAF: Fast 3D-Aware Image Synthesis with Sparse Voxel Grids

Abstract:State-of-the-art 3D-aware generative models rely on coordinate-based MLPs to parameterize 3D radiance fields. While demonstrating impressive results, querying an MLP for every sample along each ray leads to slow rendering. Therefore, existing approaches often render low-resolution feature maps and process them with an upsampling network to obtain the final image. Albeit efficient, neural rendering often entangles viewpoint and content such that changing the camera pose results in unwanted changes of geometry or appearance. Motivated by recent results in voxel-based novel view synthesis, we investigate the utility of sparse voxel grid representations for fast and 3D-consistent generative modeling in this paper. Our results demonstrate that monolithic MLPs can indeed be replaced by 3D convolutions when combining sparse voxel grids with progressive growing, free space pruning and appropriate regularization. To obtain a compact representation of the scene and allow for scaling to higher voxel resolutions, our model disentangles the foreground object (modeled in 3D) from the background (modeled in 2D). In contrast to existing approaches, our method requires only a single forward pass to generate a full 3D scene. It hence allows for efficient rendering from arbitrary viewpoints while yielding 3D consistent results with high visual fidelity.

Via

Access Paper or Ask Questions

StyleGAN-XL: Scaling StyleGAN to Large Diverse Datasets

Feb 01, 2022

Axel Sauer, Katja Schwarz, Andreas Geiger

Figure 1 for StyleGAN-XL: Scaling StyleGAN to Large Diverse Datasets

Figure 2 for StyleGAN-XL: Scaling StyleGAN to Large Diverse Datasets

Figure 3 for StyleGAN-XL: Scaling StyleGAN to Large Diverse Datasets

Figure 4 for StyleGAN-XL: Scaling StyleGAN to Large Diverse Datasets

Abstract:Computer graphics has experienced a recent surge of data-centric approaches for photorealistic and controllable content creation. StyleGAN in particular sets new standards for generative modeling regarding image quality and controllability. However, StyleGAN's performance severely degrades on large unstructured datasets such as ImageNet. StyleGAN was designed for controllability; hence, prior works suspect its restrictive design to be unsuitable for diverse datasets. In contrast, we find the main limiting factor to be the current training strategy. Following the recently introduced Projected GAN paradigm, we leverage powerful neural network priors and a progressive growing strategy to successfully train the latest StyleGAN3 generator on ImageNet. Our final model, StyleGAN-XL, sets a new state-of-the-art on large-scale image synthesis and is the first to generate images at a resolution of $1024^2$ at such a dataset scale. We demonstrate that this model can invert and edit images beyond the narrow domain of portraits or specific object classes.

* Project Page: https://sites.google.com/view/stylegan-xl/

Via

Access Paper or Ask Questions

Projected GANs Converge Faster

Nov 01, 2021

Axel Sauer, Kashyap Chitta, Jens Müller, Andreas Geiger

Figure 1 for Projected GANs Converge Faster

Figure 2 for Projected GANs Converge Faster

Figure 3 for Projected GANs Converge Faster

Figure 4 for Projected GANs Converge Faster

Abstract:Generative Adversarial Networks (GANs) produce high-quality images but are challenging to train. They need careful regularization, vast amounts of compute, and expensive hyper-parameter sweeps. We make significant headway on these issues by projecting generated and real samples into a fixed, pretrained feature space. Motivated by the finding that the discriminator cannot fully exploit features from deeper layers of the pretrained model, we propose a more effective strategy that mixes features across channels and resolutions. Our Projected GAN improves image quality, sample efficiency, and convergence speed. It is further compatible with resolutions of up to one Megapixel and advances the state-of-the-art Fr\'echet Inception Distance (FID) on twenty-two benchmark datasets. Importantly, Projected GANs match the previously lowest FIDs up to 40 times faster, cutting the wall-clock time from 5 days to less than 3 hours given the same computational resources.

* To appear in NeurIPS 2021. Project Page: https://sites.google.com/view/projected-gan/

Via

Access Paper or Ask Questions

Zoomorphic Gestures for Communicating Cobot States

Feb 22, 2021

Vanessa Sauer, Axel Sauer, Alexander Mertens

Figure 1 for Zoomorphic Gestures for Communicating Cobot States

Figure 2 for Zoomorphic Gestures for Communicating Cobot States

Figure 3 for Zoomorphic Gestures for Communicating Cobot States

Figure 4 for Zoomorphic Gestures for Communicating Cobot States

Abstract:Communicating the robot state is vital to creating an efficient and trustworthy collaboration between humans and collaborative robots (cobots). Standard approaches for Robot-to-human communication face difficulties in industry settings, e.g., because of high noise levels or certain visibility requirements. Therefore, this paper presents zoomorphic gestures based on dog body language as a possible alternative for communicating the state of appearance-constrained cobots. For this purpose, we conduct a visual communication benchmark comparing zoomorphic gestures, abstract gestures, and light displays. We investigate the modalities regarding intuitive understanding, user experience, and user preference. In a first user study (n = 93), we evaluate our proposed design guidelines for all visual modalities. A second user study (n = 214) constituting the benchmark indicates that intuitive understanding and user experience are highest for both gesture-based modalities. Furthermore, zoomorphic gestures are considerably preferred over other modalities. These findings indicate that zoomorphic gestures with their playful nature are especially suitable for novel users and may decrease initial inhibitions.

* $\c{opyright}$ 2021 IEEE. Personal use of this material is permitted. Permission from IEEE must be obtained for all other uses, in any current or future media, including reprinting/republishing this material for advertising or promotional purposes, creating new collective works, for resale or redistribution to servers or lists, or reuse of any copyrighted component of this work in other works

Via

Access Paper or Ask Questions

Counterfactual Generative Networks

Jan 15, 2021

Axel Sauer, Andreas Geiger

Figure 1 for Counterfactual Generative Networks

Figure 2 for Counterfactual Generative Networks

Figure 3 for Counterfactual Generative Networks

Figure 4 for Counterfactual Generative Networks

Abstract:Neural networks are prone to learning shortcuts -- they often model simple correlations, ignoring more complex ones that potentially generalize better. Prior works on image classification show that instead of learning a connection to object shape, deep classifiers tend to exploit spurious correlations with low-level texture or the background for solving the classification task. In this work, we take a step towards more robust and interpretable classifiers that explicitly expose the task's causal structure. Building on current advances in deep generative modeling, we propose to decompose the image generation process into independent causal mechanisms that we train without direct supervision. By exploiting appropriate inductive biases, these mechanisms disentangle object shape, object texture, and background; hence, they allow for generating counterfactual images. We demonstrate the ability of our model to generate such images on MNIST and ImageNet. Further, we show that the counterfactual images can improve out-of-distribution robustness with a marginal drop in performance on the original classification task, despite being synthetic. Lastly, our generative model can be trained efficiently on a single GPU, exploiting common pre-trained models as inductive biases.

* Published at ICLR 2021

Via

Access Paper or Ask Questions

How to Make Deep RL Work in Practice

Nov 10, 2020

Nirnai Rao, Elie Aljalbout, Axel Sauer, Sami Haddadin

Figure 1 for How to Make Deep RL Work in Practice

Figure 2 for How to Make Deep RL Work in Practice

Figure 3 for How to Make Deep RL Work in Practice

Figure 4 for How to Make Deep RL Work in Practice

Abstract:In recent years, challenging control problems became solvable with deep reinforcement learning (RL). To be able to use RL for large-scale real-world applications, a certain degree of reliability in their performance is necessary. Reported results of state-of-the-art algorithms are often difficult to reproduce. One reason for this is that certain implementation details influence the performance significantly. Commonly, these details are not highlighted as important techniques to achieve state-of-the-art performance. Additionally, techniques from supervised learning are often used by default but influence the algorithms in a reinforcement learning setting in different and not well-understood ways. In this paper, we investigate the influence of certain initialization, input normalization, and adaptive learning techniques on the performance of state-of-the-art RL algorithms. We make suggestions which of those techniques to use by default and highlight areas that could benefit from a solution specifically tailored to RL.

* Published as a workshop paper at the Deep RL workshop, NeurIPS 2020

Via

Access Paper or Ask Questions