Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Hermann Blum

Spot-On: A Mixed Reality Interface for Multi-Robot Cooperation

May 28, 2025

Tim Engelbracht, Petar Lukovic, Tjark Behrens, Kai Lascheit, René Zurbrügg, Marc Pollefeys, Hermann Blum, Zuria Bauer

Abstract:Recent progress in mixed reality (MR) and robotics is enabling increasingly sophisticated forms of human-robot collaboration. Building on these developments, we introduce a novel MR framework that allows multiple quadruped robots to operate in semantically diverse environments via a MR interface. Our system supports collaborative tasks involving drawers, swing doors, and higher-level infrastructure such as light switches. A comprehensive user study verifies both the design and usability of our app, with participants giving a "good" or "very good" rating in almost all cases. Overall, our approach provides an effective and intuitive framework for MR-based multi-robot collaboration in complex, real-world scenarios.

Via

Access Paper or Ask Questions

Loop Closure from Two Views: Revisiting PGO for Scalable Trajectory Estimation through Monocular Priors

Mar 20, 2025

Tian Yi Lim, Boyang Sun, Marc Pollefeys, Hermann Blum

Abstract:(Visual) Simultaneous Localization and Mapping (SLAM) remains a fundamental challenge in enabling autonomous systems to navigate and understand large-scale environments. Traditional SLAM approaches struggle to balance efficiency and accuracy, particularly in large-scale settings where extensive computational resources are required for scene reconstruction and Bundle Adjustment (BA). However, this scene reconstruction, in the form of sparse pointclouds of visual landmarks, is often only used within the SLAM system because navigation and planning methods require different map representations. In this work, we therefore investigate a more scalable Visual SLAM (VSLAM) approach without reconstruction, mainly based on approaches for two-view loop closures. By restricting the map to a sparse keyframed pose graph without dense geometry representations, our '2GO' system achieves efficient optimization with competitive absolute trajectory accuracy. In particular, we find that recent advancements in image matching and monocular depth priors enable very accurate trajectory optimization from two-view edges. We conduct extensive experiments on diverse datasets, including large-scale scenarios, and provide a detailed analysis of the trade-offs between runtime, accuracy, and map size. Our results demonstrate that this streamlined approach supports real-time performance, scales well in map size and trajectory duration, and effectively broadens the capabilities of VSLAM for long-duration deployments to large environments.

Via

Access Paper or Ask Questions

FunGraph: Functionality Aware 3D Scene Graphs for Language-Prompted Scene Interaction

Mar 10, 2025

Dennis Rotondi, Fabio Scaparro, Hermann Blum, Kai O. Arras

Abstract:The concept of 3D scene graphs is increasingly recognized as a powerful semantic and hierarchical representation of the environment. Current approaches often address this at a coarse, object-level resolution. In contrast, our goal is to develop a representation that enables robots to directly interact with their environment by identifying both the location of functional interactive elements and how these can be used. To achieve this, we focus on detecting and storing objects at a finer resolution, focusing on affordance-relevant parts. The primary challenge lies in the scarcity of data that extends beyond instance-level detection and the inherent difficulty of capturing detailed object features using robotic sensors. We leverage currently available 3D resources to generate 2D data and train a detector, which is then used to augment the standard 3D scene graph generation pipeline. Through our experiments, we demonstrate that our approach achieves functional element segmentation comparable to state-of-the-art 3D models and that our augmentation enables task-driven affordance grounding with higher accuracy than the current solutions.

Via

Access Paper or Ask Questions

FrontierNet: Learning Visual Cues to Explore

Jan 08, 2025

Boyang Sun, Hanzhi Chen, Stefan Leutenegger, Cesar Cadena, Marc Pollefeys, Hermann Blum

Figure 1 for FrontierNet: Learning Visual Cues to Explore

Figure 2 for FrontierNet: Learning Visual Cues to Explore

Figure 3 for FrontierNet: Learning Visual Cues to Explore

Figure 4 for FrontierNet: Learning Visual Cues to Explore

Abstract:Exploration of unknown environments is crucial for autonomous robots; it allows them to actively reason and decide on what new data to acquire for tasks such as mapping, object discovery, and environmental assessment. Existing methods, such as frontier-based methods, rely heavily on 3D map operations, which are limited by map quality and often overlook valuable context from visual cues. This work aims at leveraging 2D visual cues for efficient autonomous exploration, addressing the limitations of extracting goal poses from a 3D map. We propose a image-only frontier-based exploration system, with FrontierNet as a core component developed in this work. FrontierNet is a learning-based model that (i) detects frontiers, and (ii) predicts their information gain, from posed RGB images enhanced by monocular depth priors. Our approach provides an alternative to existing 3D-dependent exploration systems, achieving a 16% improvement in early-stage exploration efficiency, as validated through extensive simulations and real-world experiments.

Via

Access Paper or Ask Questions

Lost & Found: Updating Dynamic 3D Scene Graphs from Egocentric Observations

Nov 28, 2024

Tjark Behrens, René Zurbrügg, Marc Pollefeys, Zuria Bauer, Hermann Blum

Figure 1 for Lost & Found: Updating Dynamic 3D Scene Graphs from Egocentric Observations

Figure 2 for Lost & Found: Updating Dynamic 3D Scene Graphs from Egocentric Observations

Figure 3 for Lost & Found: Updating Dynamic 3D Scene Graphs from Egocentric Observations

Figure 4 for Lost & Found: Updating Dynamic 3D Scene Graphs from Egocentric Observations

Abstract:Recent approaches have successfully focused on the segmentation of static reconstructions, thereby equipping downstream applications with semantic 3D understanding. However, the world in which we live is dynamic, characterized by numerous interactions between the environment and humans or robotic agents. Static semantic maps are unable to capture this information, and the naive solution of rescanning the environment after every change is both costly and ineffective in tracking e.g. objects being stored away in drawers. With Lost & Found we present an approach that addresses this limitation. Based solely on egocentric recordings with corresponding hand position and camera pose estimates, we are able to track the 6DoF poses of the moving object within the detected interaction interval. These changes are applied online to a transformable scene graph that captures object-level relations. Compared to state-of-the-art object pose trackers, our approach is more reliable in handling the challenging egocentric viewpoint and the lack of depth information. It outperforms the second-best approach by 34% and 56% for translational and orientational error, respectively, and produces visibly smoother 6DoF object trajectories. In addition, we illustrate how the acquired interaction information in the dynamic scene graph can be employed in the context of robotic applications that would otherwise be unfeasible: We show how our method allows to command a mobile manipulator through teach & repeat, and how information about prior interaction allows a mobile manipulator to retrieve an object hidden in a drawer. Code, videos and corresponding data are accessible at https://behretj.github.io/LostAndFound.

* Webpage: https://behretj.github.io/LostAndFound

Via

Access Paper or Ask Questions

DepthSplat: Connecting Gaussian Splatting and Depth

Oct 17, 2024

Haofei Xu, Songyou Peng, Fangjinhua Wang, Hermann Blum, Daniel Barath, Andreas Geiger, Marc Pollefeys

Figure 1 for DepthSplat: Connecting Gaussian Splatting and Depth

Figure 2 for DepthSplat: Connecting Gaussian Splatting and Depth

Figure 3 for DepthSplat: Connecting Gaussian Splatting and Depth

Figure 4 for DepthSplat: Connecting Gaussian Splatting and Depth

Abstract:Gaussian splatting and single/multi-view depth estimation are typically studied in isolation. In this paper, we present DepthSplat to connect Gaussian splatting and depth estimation and study their interactions. More specifically, we first contribute a robust multi-view depth model by leveraging pre-trained monocular depth features, leading to high-quality feed-forward 3D Gaussian splatting reconstructions. We also show that Gaussian splatting can serve as an unsupervised pre-training objective for learning powerful depth models from large-scale unlabelled datasets. We validate the synergy between Gaussian splatting and depth estimation through extensive ablation and cross-task transfer experiments. Our DepthSplat achieves state-of-the-art performance on ScanNet, RealEstate10K and DL3DV datasets in terms of both depth estimation and novel view synthesis, demonstrating the mutual benefits of connecting both tasks. Our code, models, and video results are available at https://haofeixu.github.io/depthsplat/.

* Project page: https://haofeixu.github.io/depthsplat/

Via

Access Paper or Ask Questions

ARKit LabelMaker: A New Scale for Indoor 3D Scene Understanding

Oct 17, 2024

Guangda Ji, Silvan Weder, Francis Engelmann, Marc Pollefeys, Hermann Blum

Figure 1 for ARKit LabelMaker: A New Scale for Indoor 3D Scene Understanding

Figure 2 for ARKit LabelMaker: A New Scale for Indoor 3D Scene Understanding

Figure 3 for ARKit LabelMaker: A New Scale for Indoor 3D Scene Understanding

Figure 4 for ARKit LabelMaker: A New Scale for Indoor 3D Scene Understanding

Abstract:The performance of neural networks scales with both their size and the amount of data they have been trained on. This is shown in both language and image generation. However, this requires scaling-friendly network architectures as well as large-scale datasets. Even though scaling-friendly architectures like transformers have emerged for 3D vision tasks, the GPT-moment of 3D vision remains distant due to the lack of training data. In this paper, we introduce ARKit LabelMaker, the first large-scale, real-world 3D dataset with dense semantic annotations. Specifically, we complement ARKitScenes dataset with dense semantic annotations that are automatically generated at scale. To this end, we extend LabelMaker, a recent automatic annotation pipeline, to serve the needs of large-scale pre-training. This involves extending the pipeline with cutting-edge segmentation models as well as making it robust to the challenges of large-scale processing. Further, we push forward the state-of-the-art performance on ScanNet and ScanNet200 dataset with prevalent 3D semantic segmentation models, demonstrating the efficacy of our generated dataset.

Via

Access Paper or Ask Questions

HoloSpot: Intuitive Object Manipulation via Mixed Reality Drag-and-Drop

Oct 14, 2024

Pablo Soler Garcia, Petar Lukovic, Lucie Reynaud, Andrea Sgobbi, Federica Bruni, Martin Brun, Marc Zünd, Riccardo Bollati, Marc Pollefeys, Hermann Blum(+1 more)

Abstract:Human-robot interaction through mixed reality (MR) technologies enables novel, intuitive interfaces to control robots in remote operations. Such interfaces facilitate operations in hazardous environments, where human presence is risky, yet human oversight remains crucial. Potential environments include disaster response scenarios and areas with high radiation or toxic chemicals. In this paper we present an interface system projecting a 3D representation of a scanned room as a scaled-down 'dollhouse' hologram, allowing users to select and manipulate objects using a straightforward drag-and-drop interface. We then translate these drag-and-drop user commands into real-time robot actions based on the recent Spot-Compose framework. The Unity-based application provides an interactive tutorial and a user-friendly experience, ensuring ease of use. Through comprehensive end-to-end testing, we validate the system's capability in executing pick-and-place tasks and a complementary user study affirms the interface's intuitive controls. Our findings highlight the advantages of this interface in improving user experience and operational efficiency. This work lays the groundwork for a robust framework that advances the potential for seamless human-robot collaboration in diverse applications. Paper website: https://holospot.github.io/

* 6 pages, 8 figures, submitted to ICRA 2025

Via

Access Paper or Ask Questions

SpotLight: Robotic Scene Understanding through Interaction and Affordance Detection

Sep 18, 2024

Tim Engelbracht, René Zurbrügg, Marc Pollefeys, Hermann Blum, Zuria Bauer

Abstract:Despite increasing research efforts on household robotics, robots intended for deployment in domestic settings still struggle with more complex tasks such as interacting with functional elements like drawers or light switches, largely due to limited task-specific understanding and interaction capabilities. These tasks require not only detection and pose estimation but also an understanding of the affordances these elements provide. To address these challenges and enhance robotic scene understanding, we introduce SpotLight: A comprehensive framework for robotic interaction with functional elements, specifically light switches. Furthermore, this framework enables robots to improve their environmental understanding through interaction. Leveraging VLM-based affordance prediction to estimate motion primitives for light switch interaction, we achieve up to 84% operation success in real world experiments. We further introduce a specialized dataset containing 715 images as well as a custom detection model for light switch detection. We demonstrate how the framework can facilitate robot learning through physical interaction by having the robot explore the environment and discover previously unknown relationships in a scene graph representation. Lastly, we propose an extension to the framework to accommodate other functional interactions such as swing doors, showcasing its flexibility. Videos and Code: timengelbracht.github.io/SpotLight/

* timengelbracht.github.io/SpotLight/

Via

Access Paper or Ask Questions

Learning Where to Look: Self-supervised Viewpoint Selection for Active Localization using Geometrical Information

Jul 22, 2024

Luca Di Giammarino, Boyang Sun, Giorgio Grisetti, Marc Pollefeys, Hermann Blum, Daniel Barath

Figure 1 for Learning Where to Look: Self-supervised Viewpoint Selection for Active Localization using Geometrical Information

Figure 2 for Learning Where to Look: Self-supervised Viewpoint Selection for Active Localization using Geometrical Information

Figure 3 for Learning Where to Look: Self-supervised Viewpoint Selection for Active Localization using Geometrical Information

Figure 4 for Learning Where to Look: Self-supervised Viewpoint Selection for Active Localization using Geometrical Information

Abstract:Accurate localization in diverse environments is a fundamental challenge in computer vision and robotics. The task involves determining a sensor's precise position and orientation, typically a camera, within a given space. Traditional localization methods often rely on passive sensing, which may struggle in scenarios with limited features or dynamic environments. In response, this paper explores the domain of active localization, emphasizing the importance of viewpoint selection to enhance localization accuracy. Our contributions involve using a data-driven approach with a simple architecture designed for real-time operation, a self-supervised data training method, and the capability to consistently integrate our map into a planning framework tailored for real-world robotics applications. Our results demonstrate that our method performs better than the existing one, targeting similar problems and generalizing on synthetic and real data. We also release an open-source implementation to benefit the community.

* www.github.com/rvp-group/learning-where-to-look

Via

Access Paper or Ask Questions