Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Nando Metzger

Marigold: Affordable Adaptation of Diffusion-Based Image Generators for Image Analysis

May 14, 2025

Bingxin Ke, Kevin Qu, Tianfu Wang, Nando Metzger, Shengyu Huang, Bo Li, Anton Obukhov, Konrad Schindler

Abstract:The success of deep learning in computer vision over the past decade has hinged on large labeled datasets and strong pretrained models. In data-scarce settings, the quality of these pretrained models becomes crucial for effective transfer learning. Image classification and self-supervised learning have traditionally been the primary methods for pretraining CNNs and transformer-based architectures. Recently, the rise of text-to-image generative models, particularly those using denoising diffusion in a latent space, has introduced a new class of foundational models trained on massive, captioned image datasets. These models' ability to generate realistic images of unseen content suggests they possess a deep understanding of the visual world. In this work, we present Marigold, a family of conditional generative models and a fine-tuning protocol that extracts the knowledge from pretrained latent diffusion models like Stable Diffusion and adapts them for dense image analysis tasks, including monocular depth estimation, surface normals prediction, and intrinsic decomposition. Marigold requires minimal modification of the pre-trained latent diffusion model's architecture, trains with small synthetic datasets on a single GPU over a few days, and demonstrates state-of-the-art zero-shot generalization. Project page: https://marigoldcomputervision.github.io

* Journal extension of our CVPR 2024 paper, featuring new tasks, improved efficiency, high-resolution capabilities, and enhanced accessibility

Via

Access Paper or Ask Questions

Marigold-DC: Zero-Shot Monocular Depth Completion with Guided Diffusion

Dec 18, 2024

Massimiliano Viola, Kevin Qu, Nando Metzger, Bingxin Ke, Alexander Becker, Konrad Schindler, Anton Obukhov

Figure 1 for Marigold-DC: Zero-Shot Monocular Depth Completion with Guided Diffusion

Figure 2 for Marigold-DC: Zero-Shot Monocular Depth Completion with Guided Diffusion

Figure 3 for Marigold-DC: Zero-Shot Monocular Depth Completion with Guided Diffusion

Figure 4 for Marigold-DC: Zero-Shot Monocular Depth Completion with Guided Diffusion

Abstract:Depth completion upgrades sparse depth measurements into dense depth maps guided by a conventional image. Existing methods for this highly ill-posed task operate in tightly constrained settings and tend to struggle when applied to images outside the training domain or when the available depth measurements are sparse, irregularly distributed, or of varying density. Inspired by recent advances in monocular depth estimation, we reframe depth completion as an image-conditional depth map generation guided by sparse measurements. Our method, Marigold-DC, builds on a pretrained latent diffusion model for monocular depth estimation and injects the depth observations as test-time guidance via an optimization scheme that runs in tandem with the iterative inference of denoising diffusion. The method exhibits excellent zero-shot generalization across a diverse range of environments and handles even extremely sparse guidance effectively. Our results suggest that contemporary monocular depth priors greatly robustify depth completion: it may be better to view the task as recovering dense depth from (dense) image pixels, guided by sparse depth; rather than as inpainting (sparse) depth, guided by an image. Project website: https://MarigoldDepthCompletion.github.io/

Via

Access Paper or Ask Questions

BetterDepth: Plug-and-Play Diffusion Refiner for Zero-Shot Monocular Depth Estimation

Jul 25, 2024

Xiang Zhang, Bingxin Ke, Hayko Riemenschneider, Nando Metzger, Anton Obukhov, Markus Gross, Konrad Schindler, Christopher Schroers

Figure 1 for BetterDepth: Plug-and-Play Diffusion Refiner for Zero-Shot Monocular Depth Estimation

Figure 2 for BetterDepth: Plug-and-Play Diffusion Refiner for Zero-Shot Monocular Depth Estimation

Figure 3 for BetterDepth: Plug-and-Play Diffusion Refiner for Zero-Shot Monocular Depth Estimation

Figure 4 for BetterDepth: Plug-and-Play Diffusion Refiner for Zero-Shot Monocular Depth Estimation

Abstract:By training over large-scale datasets, zero-shot monocular depth estimation (MDE) methods show robust performance in the wild but often suffer from insufficiently precise details. Although recent diffusion-based MDE approaches exhibit appealing detail extraction ability, they still struggle in geometrically challenging scenes due to the difficulty of gaining robust geometric priors from diverse datasets. To leverage the complementary merits of both worlds, we propose BetterDepth to efficiently achieve geometrically correct affine-invariant MDE performance while capturing fine-grained details. Specifically, BetterDepth is a conditional diffusion-based refiner that takes the prediction from pre-trained MDE models as depth conditioning, in which the global depth context is well-captured, and iteratively refines details based on the input image. For the training of such a refiner, we propose global pre-alignment and local patch masking methods to ensure the faithfulness of BetterDepth to depth conditioning while learning to capture fine-grained scene details. By efficient training on small-scale synthetic datasets, BetterDepth achieves state-of-the-art zero-shot MDE performance on diverse public datasets and in-the-wild scenes. Moreover, BetterDepth can improve the performance of other MDE models in a plug-and-play manner without additional re-training.

Via

Access Paper or Ask Questions

Repurposing Diffusion-Based Image Generators for Monocular Depth Estimation

Dec 04, 2023

Bingxin Ke, Anton Obukhov, Shengyu Huang, Nando Metzger, Rodrigo Caye Daudt, Konrad Schindler

Abstract:Monocular depth estimation is a fundamental computer vision task. Recovering 3D depth from a single image is geometrically ill-posed and requires scene understanding, so it is not surprising that the rise of deep learning has led to a breakthrough. The impressive progress of monocular depth estimators has mirrored the growth in model capacity, from relatively modest CNNs to large Transformer architectures. Still, monocular depth estimators tend to struggle when presented with images with unfamiliar content and layout, since their knowledge of the visual world is restricted by the data seen during training, and challenged by zero-shot generalization to new domains. This motivates us to explore whether the extensive priors captured in recent generative diffusion models can enable better, more generalizable depth estimation. We introduce Marigold, a method for affine-invariant monocular depth estimation that is derived from Stable Diffusion and retains its rich prior knowledge. The estimator can be fine-tuned in a couple of days on a single GPU using only synthetic training data. It delivers state-of-the-art performance across a wide range of datasets, including over 20% performance gains in specific cases. Project page: https://marigoldmonodepth.github.io.

Via

Access Paper or Ask Questions

Neural Fields with Thermal Activations for Arbitrary-Scale Super-Resolution

Nov 29, 2023

Alexander Becker, Rodrigo Caye Daudt, Nando Metzger, Jan Dirk Wegner, Konrad Schindler

Abstract:Recent approaches for arbitrary-scale single image super-resolution (ASSR) have used local neural fields to represent continuous signals that can be sampled at different rates. However, in such formulation, the point-wise query of field values does not naturally match the point spread function (PSF) of a given pixel. In this work we present a novel way to design neural fields such that points can be queried with a Gaussian PSF, which serves as anti-aliasing when moving across resolutions for ASSR. We achieve this using a novel activation function derived from Fourier theory and the heat equation. This comes at no additional cost: querying a point with a Gaussian PSF in our framework does not affect computational cost, unlike filtering in the image domain. Coupled with a hypernetwork, our method not only provides theoretically guaranteed anti-aliasing, but also sets a new bar for ASSR while also being more parameter-efficient than previous methods.

Via

Access Paper or Ask Questions

High-resolution Population Maps Derived from Sentinel-1 and Sentinel-2

Nov 23, 2023

Nando Metzger, Rodrigo Caye Daudt, Devis Tuia, Konrad Schindler

Abstract:Detailed population maps play an important role in diverse fields ranging from humanitarian action to urban planning. Generating such maps in a timely and scalable manner presents a challenge, especially in data-scarce regions. To address it we have developed POPCORN, a population mapping method whose only inputs are free, globally available satellite images from Sentinel-1 and Sentinel-2; and a small number of aggregate population counts over coarse census districts for calibration. Despite the minimal data requirements our approach surpasses the mapping accuracy of existing schemes, including several that rely on building footprints derived from high-resolution imagery. E.g., we were able to produce population maps for Rwanda with 100m GSD based on less than 400 regional census counts. In Kigali, those maps reach an $R^2$ score of 66% w.r.t. a ground truth reference map, with an average error of only $\pm$10 inhabitants/ha. Conveniently, POPCORN retrieves explicit maps of built-up areas and of local building occupancy rates, making the mapping process interpretable and offering additional insights, for instance about the distribution of built-up, but unpopulated areas, e.g., industrial warehouses. Moreover, we find that, once trained, the model can be applied repeatedly to track population changes; and that it can be transferred to geographically similar regions, e.g., from Uganda to Rwanda). With our work we aim to democratize access to up-to-date and high-resolution population maps, recognizing that some regions faced with particularly strong population dynamics may lack the resources for costly micro-census campaigns.

* 17 pages, 10 tables, 7 Figures

Via

Access Paper or Ask Questions

Guided Depth Super-Resolution by Deep Anisotropic Diffusion

Nov 22, 2022

Nando Metzger, Rodrigo Caye Daudt, Konrad Schindler

Abstract:Performing super-resolution of a depth image using the guidance from an RGB image is a problem that concerns several fields, such as robotics, medical imaging, and remote sensing. While deep learning methods have achieved good results in this problem, recent work highlighted the value of combining modern methods with more formal frameworks. In this work, we propose a novel approach which combines guided anisotropic diffusion with a deep convolutional network and advances the state of the art for guided depth super-resolution. The edge transferring/enhancing properties of the diffusion are boosted by the contextual reasoning capabilities of modern networks, and a strict adjustment step guarantees perfect adherence to the source image. We achieve unprecedented results in three commonly used benchmarks for guided depth super-resolution. The performance gain compared to other methods is the largest at larger scales, such as x32 scaling. Code for the proposed method will be made available to promote reproducibility of our results.

Via

Access Paper or Ask Questions

Fine-grained Population Mapping from Coarse Census Counts and Open Geodata

Nov 08, 2022

Nando Metzger, John E. Vargas-Muñoz, Rodrigo C. Daudt, Benjamin Kellenberger, Thao Ton-That Whelan, Ferda Ofli, Muhammad Imran, Konrad Schindler, Devis Tuia

Abstract:Fine-grained population maps are needed in several domains, like urban planning, environmental monitoring, public health, and humanitarian operations. Unfortunately, in many countries only aggregate census counts over large spatial units are collected, moreover, these are not always up-to-date. We present POMELO, a deep learning model that employs coarse census counts and open geodata to estimate fine-grained population maps with 100m ground sampling distance. Moreover, the model can also estimate population numbers when no census counts at all are available, by generalizing across countries. In a series of experiments for several countries in sub-Saharan Africa, the maps produced with POMELOare in good agreement with the most detailed available reference counts: disaggregation of coarse census counts reaches R2 values of 85-89%; unconstrained prediction in the absence of any counts reaches 48-69%.

Via

Access Paper or Ask Questions

Forecasting Urban Development from Satellite Images

Apr 27, 2022

Nando Metzger

Figure 1 for Forecasting Urban Development from Satellite Images

Figure 2 for Forecasting Urban Development from Satellite Images

Figure 3 for Forecasting Urban Development from Satellite Images

Figure 4 for Forecasting Urban Development from Satellite Images

Abstract:Forecasting where and when new buildings will emerge is a rather unexplored niche topic, but relevant in disciplines such as urban planning, agriculture, resource management, and even autonomous flight. In this work, we present a method that accomplishes this task using satellite images and a custom neural network training procedure. In stage A, a DeepLapv3+ backbone is pretrained through a Siamese network architecture aimed at solving a building change detection task. In stage B, we transfer the backbone into a change forecasting model that relies solely on the initial input image. We also transfer the backbone into a forecasting model predicting the correct time range of the future change. For our experiments, we use the SpaceNet7 dataset with 960 km2 spatial extension and 24 monthly frames. We found that our training strategy consistently outperforms the traditional pretraining on the ImageNet dataset. Especially with longer forecasting ranges of 24 months, we observe F1 scores of 24% instead of 16%. Furthermore, we found that our method performed well in forecasting the times of future building constructions. Hereby, the strengths of our custom pretraining become especially apparent when we increase the difficulty of the task by predicting finer time windows.

* 7 Pages short-paper, Master Thesis, 2021

Via

Access Paper or Ask Questions

DSM Refinement with Deep Encoder-Decoder Networks

Dec 14, 2020

Nando Metzger

Figure 1 for DSM Refinement with Deep Encoder-Decoder Networks

Figure 2 for DSM Refinement with Deep Encoder-Decoder Networks

Figure 3 for DSM Refinement with Deep Encoder-Decoder Networks

Figure 4 for DSM Refinement with Deep Encoder-Decoder Networks

Abstract:3D city models can be generated from aerial images. However, the calculated DSMs suffer from noise, artefacts, and data holes that have to be manually cleaned up in a time-consuming process. This work presents an approach that automatically refines such DSMs. The key idea is to teach a neural network the characteristics of urban area from reference data. In order to achieve this goal, a loss function consisting of an L1 norm and a feature loss is proposed. These features are constructed using a pre-trained image classification network. To learn to update the height maps, the network architecture is set up based on the concept of deep residual learning and an encoder-decoder structure. The results show that this combination is highly effective in preserving the relevant geometric structures while removing the undesired artefacts and noise.

Via

Access Paper or Ask Questions