Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Katja Schwarz

FlowR: Flowing from Sparse to Dense 3D Reconstructions

Apr 02, 2025

Tobias Fischer, Samuel Rota Bulò, Yung-Hsu Yang, Nikhil Varma Keetha, Lorenzo Porzi, Norman Müller, Katja Schwarz, Jonathon Luiten, Marc Pollefeys, Peter Kontschieder

Figure 1 for FlowR: Flowing from Sparse to Dense 3D Reconstructions

Figure 2 for FlowR: Flowing from Sparse to Dense 3D Reconstructions

Figure 3 for FlowR: Flowing from Sparse to Dense 3D Reconstructions

Figure 4 for FlowR: Flowing from Sparse to Dense 3D Reconstructions

Abstract:3D Gaussian splatting enables high-quality novel view synthesis (NVS) at real-time frame rates. However, its quality drops sharply as we depart from the training views. Thus, dense captures are needed to match the high-quality expectations of some applications, e.g. Virtual Reality (VR). However, such dense captures are very laborious and expensive to obtain. Existing works have explored using 2D generative models to alleviate this requirement by distillation or generating additional training views. These methods are often conditioned only on a handful of reference input views and thus do not fully exploit the available 3D information, leading to inconsistent generation results and reconstruction artifacts. To tackle this problem, we propose a multi-view, flow matching model that learns a flow to connect novel view renderings from possibly sparse reconstructions to renderings that we expect from dense reconstructions. This enables augmenting scene captures with novel, generated views to improve reconstruction quality. Our model is trained on a novel dataset of 3.6M image pairs and can process up to 45 views at 540x960 resolution (91K tokens) on one H100 GPU in a single forward pass. Our pipeline consistently improves NVS in sparse- and dense-view scenarios, leading to higher-quality reconstructions than prior works across multiple, widely-used NVS benchmarks.

* Project page is available at https://tobiasfshr.github.io/pub/flowr

Via

Access Paper or Ask Questions

A Recipe for Generating 3D Worlds From a Single Image

Mar 20, 2025

Katja Schwarz, Denys Rozumnyi, Samuel Rota Bulò, Lorenzo Porzi, Peter Kontschieder

Figure 1 for A Recipe for Generating 3D Worlds From a Single Image

Figure 2 for A Recipe for Generating 3D Worlds From a Single Image

Figure 3 for A Recipe for Generating 3D Worlds From a Single Image

Figure 4 for A Recipe for Generating 3D Worlds From a Single Image

Abstract:We introduce a recipe for generating immersive 3D worlds from a single image by framing the task as an in-context learning problem for 2D inpainting models. This approach requires minimal training and uses existing generative models. Our process involves two steps: generating coherent panoramas using a pre-trained diffusion model and lifting these into 3D with a metric depth estimator. We then fill unobserved regions by conditioning the inpainting model on rendered point clouds, requiring minimal fine-tuning. Tested on both synthetic and real images, our method produces high-quality 3D environments suitable for VR display. By explicitly modeling the 3D structure of the generated environment from the start, our approach consistently outperforms state-of-the-art, video synthesis-based methods along multiple quantitative image quality metrics. Project Page: https://katjaschwarz.github.io/worlds/

Via

Access Paper or Ask Questions

Generative Gaussian Splatting: Generating 3D Scenes with Video Diffusion Priors

Mar 17, 2025

Katja Schwarz, Norman Mueller, Peter Kontschieder

Figure 1 for Generative Gaussian Splatting: Generating 3D Scenes with Video Diffusion Priors

Figure 2 for Generative Gaussian Splatting: Generating 3D Scenes with Video Diffusion Priors

Figure 3 for Generative Gaussian Splatting: Generating 3D Scenes with Video Diffusion Priors

Figure 4 for Generative Gaussian Splatting: Generating 3D Scenes with Video Diffusion Priors

Abstract:Synthesizing consistent and photorealistic 3D scenes is an open problem in computer vision. Video diffusion models generate impressive videos but cannot directly synthesize 3D representations, i.e., lack 3D consistency in the generated sequences. In addition, directly training generative 3D models is challenging due to a lack of 3D training data at scale. In this work, we present Generative Gaussian Splatting (GGS) -- a novel approach that integrates a 3D representation with a pre-trained latent video diffusion model. Specifically, our model synthesizes a feature field parameterized via 3D Gaussian primitives. The feature field is then either rendered to feature maps and decoded into multi-view images, or directly upsampled into a 3D radiance field. We evaluate our approach on two common benchmark datasets for scene synthesis, RealEstate10K and ScanNet+, and find that our proposed GGS model significantly improves both the 3D consistency of the generated multi-view images, and the quality of the generated 3D scenes over all relevant baselines. Compared to a similar model without 3D representation, GGS improves FID on the generated 3D scenes by ~20% on both RealEstate10K and ScanNet+. Project page: https://katjaschwarz.github.io/ggs/

Via

Access Paper or Ask Questions

Volumetric Surfaces: Representing Fuzzy Geometries with Multiple Meshes

Sep 04, 2024

Stefano Esposito, Anpei Chen, Christian Reiser, Samuel Rota Bulò, Lorenzo Porzi, Katja Schwarz, Christian Richardt, Michael Zollhöfer, Peter Kontschieder, Andreas Geiger

Figure 1 for Volumetric Surfaces: Representing Fuzzy Geometries with Multiple Meshes

Figure 2 for Volumetric Surfaces: Representing Fuzzy Geometries with Multiple Meshes

Figure 3 for Volumetric Surfaces: Representing Fuzzy Geometries with Multiple Meshes

Figure 4 for Volumetric Surfaces: Representing Fuzzy Geometries with Multiple Meshes

Abstract:High-quality real-time view synthesis methods are based on volume rendering, splatting, or surface rendering. While surface-based methods generally are the fastest, they cannot faithfully model fuzzy geometry like hair. In turn, alpha-blending techniques excel at representing fuzzy materials but require an unbounded number of samples per ray (P1). Further overheads are induced by empty space skipping in volume rendering (P2) and sorting input primitives in splatting (P3). These problems are exacerbated on low-performance graphics hardware, e.g. on mobile devices. We present a novel representation for real-time view synthesis where the (P1) number of sampling locations is small and bounded, (P2) sampling locations are efficiently found via rasterization, and (P3) rendering is sorting-free. We achieve this by representing objects as semi-transparent multi-layer meshes, rendered in fixed layer order from outermost to innermost. We model mesh layers as SDF shells with optimal spacing learned during training. After baking, we fit UV textures to the corresponding meshes. We show that our method can represent challenging fuzzy objects while achieving higher frame rates than volume-based and splatting-based methods on low-end and mobile devices.

Via

Access Paper or Ask Questions

MultiDiff: Consistent Novel View Synthesis from a Single Image

Jun 26, 2024

Norman Müller, Katja Schwarz, Barbara Roessle, Lorenzo Porzi, Samuel Rota Bulò, Matthias Nießner, Peter Kontschieder

Figure 1 for MultiDiff: Consistent Novel View Synthesis from a Single Image

Figure 2 for MultiDiff: Consistent Novel View Synthesis from a Single Image

Figure 3 for MultiDiff: Consistent Novel View Synthesis from a Single Image

Figure 4 for MultiDiff: Consistent Novel View Synthesis from a Single Image

Abstract:We introduce MultiDiff, a novel approach for consistent novel view synthesis of scenes from a single RGB image. The task of synthesizing novel views from a single reference image is highly ill-posed by nature, as there exist multiple, plausible explanations for unobserved areas. To address this issue, we incorporate strong priors in form of monocular depth predictors and video-diffusion models. Monocular depth enables us to condition our model on warped reference images for the target views, increasing geometric stability. The video-diffusion prior provides a strong proxy for 3D scenes, allowing the model to learn continuous and pixel-accurate correspondences across generated images. In contrast to approaches relying on autoregressive image generation that are prone to drifts and error accumulation, MultiDiff jointly synthesizes a sequence of frames yielding high-quality and multi-view consistent results -- even for long-term scene generation with large camera movements, while reducing inference time by an order of magnitude. For additional consistency and image quality improvements, we introduce a novel, structured noise distribution. Our experimental results demonstrate that MultiDiff outperforms state-of-the-art methods on the challenging, real-world datasets RealEstate10K and ScanNet. Finally, our model naturally supports multi-view consistent editing without the need for further tuning.

* Project page: https://sirwyver.github.io/MultiDiff Video: https://youtu.be/zBC4z4qXW_4 - CVPR 2024

Via

Access Paper or Ask Questions

WildFusion: Learning 3D-Aware Latent Diffusion Models in View Space

Nov 22, 2023

Katja Schwarz, Seung Wook Kim, Jun Gao, Sanja Fidler, Andreas Geiger, Karsten Kreis

Abstract:Modern learning-based approaches to 3D-aware image synthesis achieve high photorealism and 3D-consistent viewpoint changes for the generated images. Existing approaches represent instances in a shared canonical space. However, for in-the-wild datasets a shared canonical system can be difficult to define or might not even exist. In this work, we instead model instances in view space, alleviating the need for posed images and learned camera distributions. We find that in this setting, existing GAN-based methods are prone to generating flat geometry and struggle with distribution coverage. We hence propose WildFusion, a new approach to 3D-aware image synthesis based on latent diffusion models (LDMs). We first train an autoencoder that infers a compressed latent representation, which additionally captures the images' underlying 3D structure and enables not only reconstruction but also novel view synthesis. To learn a faithful 3D representation, we leverage cues from monocular depth prediction. Then, we train a diffusion model in the 3D-aware latent space, thereby enabling synthesis of high-quality 3D-consistent image samples, outperforming recent state-of-the-art GAN-based methods. Importantly, our 3D-aware LDM is trained without any direct supervision from multiview images or 3D geometry and does not require posed images or learned pose or camera distributions. It directly learns a 3D representation without relying on canonical camera coordinates. This opens up promising research avenues for scalable 3D-aware image synthesis and 3D content creation from in-the-wild image data. See https://katjaschwarz.github.io/wildfusion for videos of our 3D results.

Via

Access Paper or Ask Questions

NeuralField-LDM: Scene Generation with Hierarchical Latent Diffusion Models

Apr 19, 2023

Seung Wook Kim, Bradley Brown, Kangxue Yin, Karsten Kreis, Katja Schwarz, Daiqing Li, Robin Rombach, Antonio Torralba, Sanja Fidler

Figure 1 for NeuralField-LDM: Scene Generation with Hierarchical Latent Diffusion Models

Figure 2 for NeuralField-LDM: Scene Generation with Hierarchical Latent Diffusion Models

Figure 3 for NeuralField-LDM: Scene Generation with Hierarchical Latent Diffusion Models

Figure 4 for NeuralField-LDM: Scene Generation with Hierarchical Latent Diffusion Models

Abstract:Automatically generating high-quality real world 3D scenes is of enormous interest for applications such as virtual reality and robotics simulation. Towards this goal, we introduce NeuralField-LDM, a generative model capable of synthesizing complex 3D environments. We leverage Latent Diffusion Models that have been successfully utilized for efficient high-quality 2D content creation. We first train a scene auto-encoder to express a set of image and pose pairs as a neural field, represented as density and feature voxel grids that can be projected to produce novel views of the scene. To further compress this representation, we train a latent-autoencoder that maps the voxel grids to a set of latent representations. A hierarchical diffusion model is then fit to the latents to complete the scene generation pipeline. We achieve a substantial improvement over existing state-of-the-art scene generation models. Additionally, we show how NeuralField-LDM can be used for a variety of 3D content creation applications, including conditional scene generation, scene inpainting and scene style manipulation.

* CVPR 2023

Via

Access Paper or Ask Questions

ARAH: Animatable Volume Rendering of Articulated Human SDFs

Oct 18, 2022

Shaofei Wang, Katja Schwarz, Andreas Geiger, Siyu Tang

Figure 1 for ARAH: Animatable Volume Rendering of Articulated Human SDFs

Figure 2 for ARAH: Animatable Volume Rendering of Articulated Human SDFs

Figure 3 for ARAH: Animatable Volume Rendering of Articulated Human SDFs

Figure 4 for ARAH: Animatable Volume Rendering of Articulated Human SDFs

Abstract:Combining human body models with differentiable rendering has recently enabled animatable avatars of clothed humans from sparse sets of multi-view RGB videos. While state-of-the-art approaches achieve realistic appearance with neural radiance fields (NeRF), the inferred geometry often lacks detail due to missing geometric constraints. Further, animating avatars in out-of-distribution poses is not yet possible because the mapping from observation space to canonical space does not generalize faithfully to unseen poses. In this work, we address these shortcomings and propose a model to create animatable clothed human avatars with detailed geometry that generalize well to out-of-distribution poses. To achieve detailed geometry, we combine an articulated implicit surface representation with volume rendering. For generalization, we propose a novel joint root-finding algorithm for simultaneous ray-surface intersection search and correspondence search. Our algorithm enables efficient point sampling and accurate point canonicalization while generalizing well to unseen poses. We demonstrate that our proposed pipeline can generate clothed avatars with high-quality pose-dependent geometry and appearance from a sparse set of multi-view RGB videos. Our method achieves state-of-the-art performance on geometry and appearance reconstruction while creating animatable avatars that generalize well to out-of-distribution poses beyond the small number of training poses.

* Accepted to ECCV 2022. Project page: https://neuralbodies.github.io/arah/

Via

Access Paper or Ask Questions

VoxGRAF: Fast 3D-Aware Image Synthesis with Sparse Voxel Grids

Jun 17, 2022

Katja Schwarz, Axel Sauer, Michael Niemeyer, Yiyi Liao, Andreas Geiger

Figure 1 for VoxGRAF: Fast 3D-Aware Image Synthesis with Sparse Voxel Grids

Figure 2 for VoxGRAF: Fast 3D-Aware Image Synthesis with Sparse Voxel Grids

Figure 3 for VoxGRAF: Fast 3D-Aware Image Synthesis with Sparse Voxel Grids

Figure 4 for VoxGRAF: Fast 3D-Aware Image Synthesis with Sparse Voxel Grids

Abstract:State-of-the-art 3D-aware generative models rely on coordinate-based MLPs to parameterize 3D radiance fields. While demonstrating impressive results, querying an MLP for every sample along each ray leads to slow rendering. Therefore, existing approaches often render low-resolution feature maps and process them with an upsampling network to obtain the final image. Albeit efficient, neural rendering often entangles viewpoint and content such that changing the camera pose results in unwanted changes of geometry or appearance. Motivated by recent results in voxel-based novel view synthesis, we investigate the utility of sparse voxel grid representations for fast and 3D-consistent generative modeling in this paper. Our results demonstrate that monolithic MLPs can indeed be replaced by 3D convolutions when combining sparse voxel grids with progressive growing, free space pruning and appropriate regularization. To obtain a compact representation of the scene and allow for scaling to higher voxel resolutions, our model disentangles the foreground object (modeled in 3D) from the background (modeled in 2D). In contrast to existing approaches, our method requires only a single forward pass to generate a full 3D scene. It hence allows for efficient rendering from arbitrary viewpoints while yielding 3D consistent results with high visual fidelity.

Via

Access Paper or Ask Questions

StyleGAN-XL: Scaling StyleGAN to Large Diverse Datasets

Feb 01, 2022

Axel Sauer, Katja Schwarz, Andreas Geiger

Figure 1 for StyleGAN-XL: Scaling StyleGAN to Large Diverse Datasets

Figure 2 for StyleGAN-XL: Scaling StyleGAN to Large Diverse Datasets

Figure 3 for StyleGAN-XL: Scaling StyleGAN to Large Diverse Datasets

Figure 4 for StyleGAN-XL: Scaling StyleGAN to Large Diverse Datasets

Abstract:Computer graphics has experienced a recent surge of data-centric approaches for photorealistic and controllable content creation. StyleGAN in particular sets new standards for generative modeling regarding image quality and controllability. However, StyleGAN's performance severely degrades on large unstructured datasets such as ImageNet. StyleGAN was designed for controllability; hence, prior works suspect its restrictive design to be unsuitable for diverse datasets. In contrast, we find the main limiting factor to be the current training strategy. Following the recently introduced Projected GAN paradigm, we leverage powerful neural network priors and a progressive growing strategy to successfully train the latest StyleGAN3 generator on ImageNet. Our final model, StyleGAN-XL, sets a new state-of-the-art on large-scale image synthesis and is the first to generate images at a resolution of $1024^2$ at such a dataset scale. We demonstrate that this model can invert and edit images beyond the narrow domain of portraits or specific object classes.

* Project Page: https://sites.google.com/view/stylegan-xl/

Via

Access Paper or Ask Questions