Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Maximilian Durner

Conditional Latent Diffusion Models for Zero-Shot Instance Segmentation

Aug 06, 2025

Maximilian Ulmer, Wout Boerdijk, Rudolph Triebel, Maximilian Durner

Abstract:This paper presents OC-DiT, a novel class of diffusion models designed for object-centric prediction, and applies it to zero-shot instance segmentation. We propose a conditional latent diffusion framework that generates instance masks by conditioning the generative process on object templates and image features within the diffusion model's latent space. This allows our model to effectively disentangle object instances through the diffusion process, which is guided by visual object descriptors and localized image cues. Specifically, we introduce two model variants: a coarse model for generating initial object instance proposals, and a refinement model that refines all proposals in parallel. We train these models on a newly created, large-scale synthetic dataset comprising thousands of high-quality object meshes. Remarkably, our model achieves state-of-the-art performance on multiple challenging real-world benchmarks, without requiring any retraining on target data. Through comprehensive ablation studies, we demonstrate the potential of diffusion models for instance segmentation tasks.

* ICCV 2025

Via

Access Paper or Ask Questions

How Important are Data Augmentations to Close the Domain Gap for Object Detection in Orbit?

Oct 21, 2024

Maximilian Ulmer, Leonard Klüpfel, Maximilian Durner, Rudolph Triebel

Figure 1 for How Important are Data Augmentations to Close the Domain Gap for Object Detection in Orbit?

Figure 2 for How Important are Data Augmentations to Close the Domain Gap for Object Detection in Orbit?

Figure 3 for How Important are Data Augmentations to Close the Domain Gap for Object Detection in Orbit?

Figure 4 for How Important are Data Augmentations to Close the Domain Gap for Object Detection in Orbit?

Abstract:We investigate the efficacy of data augmentations to close the domain gap in spaceborne computer vision, crucial for autonomous operations like on-orbit servicing. As the use of computer vision in space increases, challenges such as hostile illumination and low signal-to-noise ratios significantly hinder performance. While learning-based algorithms show promising results, their adoption is limited by the need for extensive annotated training data and the domain gap that arises from differences between synthesized and real-world imagery. This study explores domain generalization in terms of data augmentations -- classical color and geometric transformations, corruptions, and noise -- to enhance model performance across the domain gap. To this end, we conduct an large scale experiment using a hyperparameter optimization pipeline that samples hundreds of different configurations and searches for the best set to bridge the domain gap. As a reference task, we use 2D object detection and evaluate on the SPEED+ dataset that contains real hardware-in-the-loop satellite images in its test set. Moreover, we evaluate four popular object detectors, including Mask R-CNN, Faster R-CNN, YOLO-v7, and the open set detector GroundingDINO, and highlight their trade-offs between performance, inference speed, and training time. Our results underscore the vital role of data augmentations in bridging the domain gap, improving model performance, robustness, and reliability for critical space applications. As a result, we propose two novel data augmentations specifically developed to emulate the visual effects observed in orbital imagery. We conclude by recommending the most effective augmentations for advancing computer vision in challenging orbital environments. Code for training detectors and hyperparameter search will be made publicly available.

Via

Access Paper or Ask Questions

Unknown Object Grasping for Assistive Robotics

Apr 23, 2024

Elle Miller, Maximilian Durner, Matthias Humt, Gabriel Quere, Wout Boerdijk, Ashok M. Sundaram, Freek Stulp, Jorn Vogel

Figure 1 for Unknown Object Grasping for Assistive Robotics

Figure 2 for Unknown Object Grasping for Assistive Robotics

Figure 3 for Unknown Object Grasping for Assistive Robotics

Figure 4 for Unknown Object Grasping for Assistive Robotics

Abstract:We propose a novel pipeline for unknown object grasping in shared robotic autonomy scenarios. State-of-the-art methods for fully autonomous scenarios are typically learning-based approaches optimised for a specific end-effector, that generate grasp poses directly from sensor input. In the domain of assistive robotics, we seek instead to utilise the user's cognitive abilities for enhanced satisfaction, grasping performance, and alignment with their high level task-specific goals. Given a pair of stereo images, we perform unknown object instance segmentation and generate a 3D reconstruction of the object of interest. In shared control, the user then guides the robot end-effector across a virtual hemisphere centered around the object to their desired approach direction. A physics-based grasp planner finds the most stable local grasp on the reconstruction, and finally the user is guided by shared control to this grasp. In experiments on the DLR EDAN platform, we report a grasp success rate of 87% for 10 unknown objects, and demonstrate the method's capability to grasp objects in structured clutter and from shelves.

* 7 pages, 9 figures

Via

Access Paper or Ask Questions

Density-based Feasibility Learning with Normalizing Flows for Introspective Robotic Assembly

Jul 06, 2023

Jianxiang Feng, Matan Atad, Ismael Rodríguez, Maximilian Durner, Stephan Günnemann, Rudolph Triebel

Abstract:Machine Learning (ML) models in Robotic Assembly Sequence Planning (RASP) need to be introspective on the predicted solutions, i.e. whether they are feasible or not, to circumvent potential efficiency degradation. Previous works need both feasible and infeasible examples during training. However, the infeasible ones are hard to collect sufficiently when re-training is required for swift adaptation to new product variants. In this work, we propose a density-based feasibility learning method that requires only feasible examples. Concretely, we formulate the feasibility learning problem as Out-of-Distribution (OOD) detection with Normalizing Flows (NF), which are powerful generative models for estimating complex probability distributions. Empirically, the proposed method is demonstrated on robotic assembly use cases and outperforms other single-class baselines in detecting infeasible assemblies. We further investigate the internal working mechanism of our method and show that a large memory saving can be obtained based on an advanced variant of NF.

* Accepted to the RSS 2023 Robotic Assembly Workshop

Via

Access Paper or Ask Questions

6D Object Pose Estimation from Approximate 3D Models for Orbital Robotics

Mar 31, 2023

Maximilian Ulmer, Maximilian Durner, Martin Sundermeyer, Manuel Stoiber, Rudolph Triebel

Figure 1 for 6D Object Pose Estimation from Approximate 3D Models for Orbital Robotics

Figure 2 for 6D Object Pose Estimation from Approximate 3D Models for Orbital Robotics

Figure 3 for 6D Object Pose Estimation from Approximate 3D Models for Orbital Robotics

Figure 4 for 6D Object Pose Estimation from Approximate 3D Models for Orbital Robotics

Abstract:We present a novel technique to estimate the 6D pose of objects from single images where the 3D geometry of the object is only given approximately and not as a precise 3D model. To achieve this, we employ a dense 2D-to-3D correspondence predictor that regresses 3D model coordinates for every pixel. In addition to the 3D coordinates, our model also estimates the pixel-wise coordinate error to discard correspondences that are likely wrong. This allows us to generate multiple 6D pose hypotheses of the object, which we then refine iteratively using a highly efficient region-based approach. We also introduce a novel pixel-wise posterior formulation by which we can estimate the probability for each hypothesis and select the most likely one. As we show in experiments, our approach is capable of dealing with extreme visual conditions including overexposure, high contrast, or low signal-to-noise ratio. This makes it a powerful technique for the particularly challenging task of estimating the pose of tumbling satellites for in-orbit robotic applications. Our method achieves state-of-the-art performance on the SPEED+ dataset and has won the SPEC2021 post-mortem competition.

* preprint

Via

Access Paper or Ask Questions

Efficient and Feasible Robotic Assembly Sequence Planning via Graph Representation Learning

Mar 21, 2023

Matan Atad, Jianxiang Feng, Ismael Rodríguez, Maximilian Durner, Rudolph Triebel

Abstract:Automatic Robotic Assembly Sequence Planning (RASP) can significantly improve productivity and resilience in modern manufacturing along with the growing need for greater product customization. One of the main challenges in realizing such automation resides in efficiently finding solutions from a growing number of potential sequences for increasingly complex assemblies. Besides, costly feasibility checks are always required for the robotic system. To address this, we propose a holistic graphical approach including a graph representation called Assembly Graph for product assemblies and a policy architecture, Graph Assembly Processing Network, dubbed GRACE for assembly sequence generation. Secondly, we use GRACE to extract meaningful information from the graph input and predict assembly sequences in a step-by-step manner. In experiments, we show that our approach can predict feasible assembly sequences across product variants of aluminum profiles based on data collected in simulation of a dual-armed robotic system. We further demonstrate that our method is capable of detecting infeasible assemblies, substantially alleviating the undesirable impacts from false predictions, and hence facilitating real-world deployment soon. Code and training data will be open-sourced.

* Under review

Via

Access Paper or Ask Questions

Bridging the Last Mile in Sim-to-Real Robot Perception via Bayesian Active Learning

Sep 29, 2021

Jianxiang Feng, Jongseok Lee, Maximilian Durner, Rudolph Triebel

Figure 1 for Bridging the Last Mile in Sim-to-Real Robot Perception via Bayesian Active Learning

Figure 2 for Bridging the Last Mile in Sim-to-Real Robot Perception via Bayesian Active Learning

Figure 3 for Bridging the Last Mile in Sim-to-Real Robot Perception via Bayesian Active Learning

Figure 4 for Bridging the Last Mile in Sim-to-Real Robot Perception via Bayesian Active Learning

Abstract:Learning from synthetic data is popular in a variety of robotic vision tasks such as object detection, because a large amount of data can be generated without annotations by humans. However, when relying only on synthetic data,we encounter the well-known problem of the simulation-to-reality (Sim-to-Real) gap, which is hard to resolve completely in practice. For such cases, real human-annotated data is necessary to bridge this gap, and in our work we focus on howto acquire this data efficiently. Therefore, we propose a Sim-to-Real pipeline that relies on deep Bayesian active learning and aims to minimize the manual annotation efforts. We devise a learning paradigm that autonomously selects the data that is considered useful for the human expert to annotate. To achieve this, a Bayesian Neural Network (BNN) object detector providing reliable uncertain estimates is adapted to infer the informativeness of the unlabeled data, in order to perform active learning. In our experiments on two object detection data sets, we show that the labeling effort required to bridge the reality gap can be reduced to a small amount. Furthermore, we demonstrate the practical effectiveness of this idea in a grasping task on an assistive robot.

* under review

Via

Access Paper or Ask Questions

Introspective Robot Perception using Smoothed Predictions from Bayesian Neural Networks

Sep 27, 2021

Jianxiang Feng, Maximilian Durner, Zoltan-Csaba Marton, Ferenc Balint-Benczedi, Rudolph Triebel

Figure 1 for Introspective Robot Perception using Smoothed Predictions from Bayesian Neural Networks

Figure 2 for Introspective Robot Perception using Smoothed Predictions from Bayesian Neural Networks

Figure 3 for Introspective Robot Perception using Smoothed Predictions from Bayesian Neural Networks

Figure 4 for Introspective Robot Perception using Smoothed Predictions from Bayesian Neural Networks

Abstract:This work focuses on improving uncertainty estimation in the field of object classification from RGB images and demonstrates its benefits in two robotic applications. We employ a (BNN), and evaluate two practical inference techniques to obtain better uncertainty estimates, namely Concrete Dropout (CDP) and Kronecker-factored Laplace Approximation (LAP). We show a performance increase using more reliable uncertainty estimates as unary potentials within a Conditional Random Field (CRF), which is able to incorporate contextual information as well. Furthermore, the obtained uncertainties are exploited to achieve domain adaptation in a semi-supervised manner, which requires less manual efforts in annotating data. We evaluate our approach on two public benchmark datasets that are relevant for robot perception tasks.

* International Symposium on Robotics Research (ISRR), Hanoi, Vietnam, 2019

Via

Access Paper or Ask Questions

Unknown Object Segmentation from Stereo Images

Mar 11, 2021

Maximilian Durner, Wout Boerdijk, Martin Sundermeyer, Werner Friedl, Zoltan-Csaba Marton, Rudolph Triebel

Figure 1 for Unknown Object Segmentation from Stereo Images

Figure 2 for Unknown Object Segmentation from Stereo Images

Figure 3 for Unknown Object Segmentation from Stereo Images

Figure 4 for Unknown Object Segmentation from Stereo Images

Abstract:Although instance-aware perception is a key prerequisite for many autonomous robotic applications, most of the methods only partially solve the problem by focusing solely on known object categories. However, for robots interacting in dynamic and cluttered environments, this is not realistic and severely limits the range of potential applications. Therefore, we propose a novel object instance segmentation approach that does not require any semantic or geometric information of the objects beforehand. In contrast to existing works, we do not explicitly use depth data as input, but rely on the insight that slight viewpoint changes, which for example are provided by stereo image pairs, are often sufficient to determine object boundaries and thus to segment objects. Focusing on the versatility of stereo sensors, we employ a transformer-based architecture that maps directly from the pair of input images to the object instances. This has the major advantage that instead of a noisy, and potentially incomplete depth map as an input, on which the segmentation is computed, we use the original image pair to infer the object instances and a dense depth map. In experiments in several different application domains, we show that our Instance Stereo Transformer (INSTR) algorithm outperforms current state-of-the-art methods that are based on depth maps. Training code and pretrained models will be made available.

* 8 pages, 5 figures, 6 tables, code will be made available

Via

Access Paper or Ask Questions

"What's This?" -- Learning to Segment Unknown Objects from Manipulation Sequences

Nov 06, 2020

Wout Boerdijk, Martin Sundermeyer, Maximilian Durner, Rudolph Triebel

Figure 1 for "What's This?" -- Learning to Segment Unknown Objects from Manipulation Sequences

Figure 2 for "What's This?" -- Learning to Segment Unknown Objects from Manipulation Sequences

Figure 3 for "What's This?" -- Learning to Segment Unknown Objects from Manipulation Sequences

Figure 4 for "What's This?" -- Learning to Segment Unknown Objects from Manipulation Sequences

Abstract:We present a novel framework for self-supervised grasped object segmentation with a robotic manipulator. Our method successively learns an agnostic foreground segmentation followed by a distinction between manipulator and object solely by observing the motion between consecutive RGB frames. In contrast to previous approaches, we propose a single, end-to-end trainable architecture which jointly incorporates motion cues and semantic knowledge. Furthermore, while the motion of the manipulator and the object are substantial cues for our algorithm, we present means to robustly deal with distraction objects moving in the background, as well as with completely static scenes. Our method neither depends on any visual registration of a kinematic robot or 3D object models, nor on precise hand-eye calibration or any additional sensor data. By extensive experimental evaluation we demonstrate the superiority of our framework and provide detailed insights on its capability of dealing with the aforementioned extreme cases of motion. We also show that training a semantic segmentation network with the automatically labeled data achieves results on par with manually annotated training data. Code and pretrained models will be made publicly available.

* 8 pages, 6 figures

Via

Access Paper or Ask Questions