Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Markus Vincze

Sim2Real Transfer for Vision-Based Grasp Verification

May 05, 2025

Pau Amargant, Peter Hönig, Markus Vincze

Abstract:The verification of successful grasps is a crucial aspect of robot manipulation, particularly when handling deformable objects. Traditional methods relying on force and tactile sensors often struggle with deformable and non-rigid objects. In this work, we present a vision-based approach for grasp verification to determine whether the robotic gripper has successfully grasped an object. Our method employs a two-stage architecture; first YOLO-based object detection model to detect and locate the robot's gripper and then a ResNet-based classifier determines the presence of an object. To address the limitations of real-world data capture, we introduce HSR-GraspSynth, a synthetic dataset designed to simulate diverse grasping scenarios. Furthermore, we explore the use of Visual Question Answering capabilities as a zero-shot baseline to which we compare our model. Experimental results demonstrate that our approach achieves high accuracy in real-world environments, with potential for integration into grasping pipelines. Code and datasets are publicly available at https://github.com/pauamargant/HSR-GraspSynth .

* Accepted at Austrian Robotics Workshop 2025

Via

Access Paper or Ask Questions

LLM-Empowered Embodied Agent for Memory-Augmented Task Planning in Household Robotics

Apr 30, 2025

Marc Glocker, Peter Hönig, Matthias Hirschmanner, Markus Vincze

Figure 1 for LLM-Empowered Embodied Agent for Memory-Augmented Task Planning in Household Robotics

Figure 2 for LLM-Empowered Embodied Agent for Memory-Augmented Task Planning in Household Robotics

Figure 3 for LLM-Empowered Embodied Agent for Memory-Augmented Task Planning in Household Robotics

Figure 4 for LLM-Empowered Embodied Agent for Memory-Augmented Task Planning in Household Robotics

Abstract:We present an embodied robotic system with an LLM-driven agent-orchestration architecture for autonomous household object management. The system integrates memory-augmented task planning, enabling robots to execute high-level user commands while tracking past actions. It employs three specialized agents: a routing agent, a task planning agent, and a knowledge base agent, each powered by task-specific LLMs. By leveraging in-context learning, our system avoids the need for explicit model training. RAG enables the system to retrieve context from past interactions, enhancing long-term object tracking. A combination of Grounded SAM and LLaMa3.2-Vision provides robust object detection, facilitating semantic scene understanding for task planning. Evaluation across three household scenarios demonstrates high task planning accuracy and an improvement in memory recall due to RAG. Specifically, Qwen2.5 yields best performance for specialized agents, while LLaMA3.1 excels in routing tasks. The source code is available at: https://github.com/marc1198/chat-hsr.

* Accepted at Austrian Robotics Workshop 2025

Via

Access Paper or Ask Questions

Category-Level and Open-Set Object Pose Estimation for Robotics

Apr 28, 2025

Peter Hönig, Matthias Hirschmanner, Markus Vincze

Abstract:Object pose estimation enables a variety of tasks in computer vision and robotics, including scene understanding and robotic grasping. The complexity of a pose estimation task depends on the unknown variables related to the target object. While instance-level methods already excel for opaque and Lambertian objects, category-level and open-set methods, where texture, shape, and size are partially or entirely unknown, still struggle with these basic material properties. Since texture is unknown in these scenarios, it cannot be used for disambiguating object symmetries, another core challenge of 6D object pose estimation. The complexity of estimating 6D poses with such a manifold of unknowns led to various datasets, accuracy metrics, and algorithmic solutions. This paper compares datasets, accuracy metrics, and algorithms for solving 6D pose estimation on the category-level. Based on this comparison, we analyze how to bridge category-level and open-set object pose estimation to reach generalization and provide actionable recommendations.

* Accepted at Austrian Robotics Workshop 2025

Via

Access Paper or Ask Questions

Multi-Modal 3D Mesh Reconstruction from Images and Text

Mar 10, 2025

Melvin Reka, Tessa Pulli, Markus Vincze

Abstract:6D object pose estimation for unseen objects is essential in robotics but traditionally relies on trained models that require large datasets, high computational costs, and struggle to generalize. Zero-shot approaches eliminate the need for training but depend on pre-existing 3D object models, which are often impractical to obtain. To address this, we propose a language-guided few-shot 3D reconstruction method, reconstructing a 3D mesh from few input images. In the proposed pipeline, receives a set of input images and a language query. A combination of GroundingDINO and Segment Anything Model outputs segmented masks from which a sparse point cloud is reconstructed with VGGSfM. Subsequently, the mesh is reconstructed with the Gaussian Splatting method SuGAR. In a final cleaning step, artifacts are removed, resulting in the final 3D mesh of the queried object. We evaluate the method in terms of accuracy and quality of the geometry and texture. Furthermore, we study the impact of imaging conditions such as viewing angle, number of input images, and image overlap on 3D object reconstruction quality, efficiency, and computational scalability.

* under review

Via

Access Paper or Ask Questions

Enhancing Transparent Object Pose Estimation: A Fusion of GDR-Net and Edge Detection

Feb 17, 2025

Tessa Pulli, Peter Hönig, Stefan Thalhammer, Matthias Hirschmanner, Markus Vincze

Figure 1 for Enhancing Transparent Object Pose Estimation: A Fusion of GDR-Net and Edge Detection

Figure 2 for Enhancing Transparent Object Pose Estimation: A Fusion of GDR-Net and Edge Detection

Figure 3 for Enhancing Transparent Object Pose Estimation: A Fusion of GDR-Net and Edge Detection

Figure 4 for Enhancing Transparent Object Pose Estimation: A Fusion of GDR-Net and Edge Detection

Abstract:Object pose estimation of transparent objects remains a challenging task in the field of robot vision due to the immense influence of lighting, background, and reflections. However, the edges of clear objects have the highest contrast, which leads to stable and prominent features. We propose a novel approach by incorporating edge detection in a pre-processing step for the tasks of object detection and object pose estimation. We conducted experiments to investigate the effect of edge detectors on transparent objects. We examine the performance of the state-of-the-art 6D object pose estimation pipeline GDR-Net and the object detector YOLOX when applying different edge detectors as pre-processing steps (i.e., Canny edge detection with and without color information, and holistically-nested edges (HED)). We evaluate the physically-based rendered dataset Trans6D-32 K of transparent objects with parameters proposed by the BOP Challenge. Our results indicate that applying edge detection as a pre-processing enhances performance for certain objects.

* First Austrian Symposium on AI, Robotics, and Vision (AIROV 2024)
* accepted at First Austrian Symposium on AI, Robotics, and Vision (AIROV 2024)

Via

Access Paper or Ask Questions

ReFlow6D: Refraction-Guided Transparent Object 6D Pose Estimation via Intermediate Representation Learning

Dec 30, 2024

Hrishikesh Gupta, Stefan Thalhammer, Jean-Baptiste Weibel, Alexander Haberl, Markus Vincze

$Figure 1 for ReFlow6D: Refraction-Guided Transparent Object 6D Pose Estimation via Intermediate Representation Learning$

$Figure 2 for ReFlow6D: Refraction-Guided Transparent Object 6D Pose Estimation via Intermediate Representation Learning$

$Figure 3 for ReFlow6D: Refraction-Guided Transparent Object 6D Pose Estimation via Intermediate Representation Learning$

$Figure 4 for ReFlow6D: Refraction-Guided Transparent Object 6D Pose Estimation via Intermediate Representation Learning$

Abstract:Transparent objects are ubiquitous in daily life, making their perception and robotics manipulation important. However, they present a major challenge due to their distinct refractive and reflective properties when it comes to accurately estimating the 6D pose. To solve this, we present ReFlow6D, a novel method for transparent object 6D pose estimation that harnesses the refractive-intermediate representation. Unlike conventional approaches, our method leverages a feature space impervious to changes in RGB image space and independent of depth information. Drawing inspiration from image matting, we model the deformation of the light path through transparent objects, yielding a unique object-specific intermediate representation guided by light refraction that is independent of the environment in which objects are observed. By integrating these intermediate features into the pose estimation network, we show that ReFlow6D achieves precise 6D pose estimation of transparent objects, using only RGB images as input. Our method further introduces a novel transparent object compositing loss, fostering the generation of superior refractive-intermediate features. Empirical evaluations show that our approach significantly outperforms state-of-the-art methods on TOD and Trans32K-6D datasets. Robot grasping experiments further demonstrate that ReFlow6D's pose estimation accuracy effectively translates to real-world robotics task. The source code is available at: https://github.com/StoicGilgamesh/ReFlow6D and https://github.com/StoicGilgamesh/matting_rendering.

* IEEE Robotics and Automation Letters, vol. 9, no. 11, pp. 9438-9445, Nov. 2024

Via

Access Paper or Ask Questions

Diffusion Features for Zero-Shot 6DoF Object Pose Estimation

Nov 25, 2024

Bernd Von Gimborn, Philipp Ausserlechner, Markus Vincze, Stefan Thalhammer

Figure 1 for Diffusion Features for Zero-Shot 6DoF Object Pose Estimation

Figure 2 for Diffusion Features for Zero-Shot 6DoF Object Pose Estimation

Figure 3 for Diffusion Features for Zero-Shot 6DoF Object Pose Estimation

Figure 4 for Diffusion Features for Zero-Shot 6DoF Object Pose Estimation

Abstract:Zero-shot object pose estimation enables the retrieval of object poses from images without necessitating object-specific training. In recent approaches this is facilitated by vision foundation models (VFM), which are pre-trained models that are effectively general-purpose feature extractors. The characteristics exhibited by these VFMs vary depending on the training data, network architecture, and training paradigm. The prevailing choice in this field are self-supervised Vision Transformers (ViT). This study assesses the influence of Latent Diffusion Model (LDM) backbones on zero-shot pose estimation. In order to facilitate a comparison between the two families of models on a common ground we adopt and modify a recent approach. Therefore, a template-based multi-staged method for estimating poses in a zero-shot fashion using LDMs is presented. The efficacy of the proposed approach is empirically evaluated on three standard datasets for object-specific 6DoF pose estimation. The experiments demonstrate an Average Recall improvement of up to 27% over the ViT baseline. The source code is available at: https://github.com/BvG1993/DZOP.

Via

Access Paper or Ask Questions

ODYSSEE: Oyster Detection Yielded by Sensor Systems on Edge Electronics

Sep 11, 2024

Xiaomin Lin, Vivek Mange, Arjun Suresh, Bernhard Neuberger, Aadi Palnitkar, Brendan Campbell, Alan Williams, Kleio Baxevani, Jeremy Mallette, Alhim Vera(+4 more)

Abstract:Oysters are a keystone species in coastal ecosystems, offering significant economic, environmental, and cultural benefits. However, current monitoring systems are often destructive, typically involving dredging to physically collect and count oysters. A nondestructive alternative is manual identification from video footage collected by divers, which is time-consuming and labor-intensive with expert input. An alternative to human monitoring is the deployment of a system with trained object detection models that performs real-time, on edge oyster detection in the field. One such platform is the Aqua2 robot. Effective training of these models requires extensive high-quality data, which is difficult to obtain in marine settings. To address these complications, we introduce a novel method that leverages stable diffusion to generate high-quality synthetic data for the marine domain. We exploit diffusion models to create photorealistic marine imagery, using ControlNet inputs to ensure consistency with the segmentation ground-truth mask, the geometry of the scene, and the target domain of real underwater images for oysters. The resulting dataset is used to train a YOLOv10-based vision model, achieving a state-of-the-art 0.657 mAP@50 for oyster detection on the Aqua2 platform. The system we introduce not only improves oyster habitat monitoring, but also paves the way to autonomous surveillance for various tasks in marine contexts, improving aquaculture and conservation efforts.

Via

Access Paper or Ask Questions

From Words to Poses: Enhancing Novel Object Pose Estimation with Vision Language Models

Sep 09, 2024

Tessa Pulli, Stefan Thalhammer, Simon Schwaiger, Markus Vincze

Figure 1 for From Words to Poses: Enhancing Novel Object Pose Estimation with Vision Language Models

Figure 2 for From Words to Poses: Enhancing Novel Object Pose Estimation with Vision Language Models

Abstract:Robots are increasingly envisioned to interact in real-world scenarios, where they must continuously adapt to new situations. To detect and grasp novel objects, zero-shot pose estimators determine poses without prior knowledge. Recently, vision language models (VLMs) have shown considerable advances in robotics applications by establishing an understanding between language input and image input. In our work, we take advantage of VLMs zero-shot capabilities and translate this ability to 6D object pose estimation. We propose a novel framework for promptable zero-shot 6D object pose estimation using language embeddings. The idea is to derive a coarse location of an object based on the relevancy map of a language-embedded NeRF reconstruction and to compute the pose estimate with a point cloud registration method. Additionally, we provide an analysis of LERF's suitability for open-set object pose estimation. We examine hyperparameters, such as activation thresholds for relevancy maps and investigate the zero-shot capabilities on an instance- and category-level. Furthermore, we plan to conduct robotic grasping experiments in a real-world setting.

Via

Access Paper or Ask Questions

Improving 2D-3D Dense Correspondences with Diffusion Models for 6D Object Pose Estimation

Feb 09, 2024

Peter Hönig, Stefan Thalhammer, Markus Vincze

Figure 1 for Improving 2D-3D Dense Correspondences with Diffusion Models for 6D Object Pose Estimation

Figure 2 for Improving 2D-3D Dense Correspondences with Diffusion Models for 6D Object Pose Estimation

Figure 3 for Improving 2D-3D Dense Correspondences with Diffusion Models for 6D Object Pose Estimation

Figure 4 for Improving 2D-3D Dense Correspondences with Diffusion Models for 6D Object Pose Estimation

Abstract:Estimating 2D-3D correspondences between RGB images and 3D space is a fundamental problem in 6D object pose estimation. Recent pose estimators use dense correspondence maps and Point-to-Point algorithms to estimate object poses. The accuracy of pose estimation depends heavily on the quality of the dense correspondence maps and their ability to withstand occlusion, clutter, and challenging material properties. Currently, dense correspondence maps are estimated using image-to-image translation models based on GANs, Autoencoders, or direct regression models. However, recent advancements in image-to-image translation have led to diffusion models being the superior choice when evaluated on benchmarking datasets. In this study, we compare image-to-image translation networks based on GANs and diffusion models for the downstream task of 6D object pose estimation. Our results demonstrate that the diffusion-based image-to-image translation model outperforms the GAN, revealing potential for further improvements in 6D object pose estimation models.

* Submitted to the First Austrian Symposium on AI, Robotics, and Vision 2024

Via

Access Paper or Ask Questions