Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Sebastian Houben

Fraunhofer Institute for Intelligent Analysis and Information Systems IAIS, Sankt Augustin, Germany

From Language to Action: Can LLM-Based Agents Be Used for Embodied Robot Cognition?

Mar 03, 2026

Shinas Shaji, Fabian Huppertz, Alex Mitrevski, Sebastian Houben

Abstract:In order to flexibly act in an everyday environment, a robotic agent needs a variety of cognitive capabilities that enable it to reason about plans and perform execution recovery. Large language models (LLMs) have been shown to demonstrate emergent cognitive aspects, such as reasoning and language understanding; however, the ability to control embodied robotic agents requires reliably bridging high-level language to low-level functionalities for perception and control. In this paper, we investigate the extent to which an LLM can serve as a core component for planning and execution reasoning in a cognitive robot architecture. For this purpose, we propose a cognitive architecture in which an agentic LLM serves as the core component for planning and reasoning, while components for working and episodic memories support learning from experience and adaptation. An instance of the architecture is then used to control a mobile manipulator in a simulated household environment, where environment interaction is done through a set of high-level tools for perception, reasoning, navigation, grasping, and placement, all of which are made available to the LLM-based agent. We evaluate our proposed system on two household tasks (object placement and object swapping), which evaluate the agent's reasoning, planning, and memory utilisation. The results demonstrate that the LLM-driven agent can complete structured tasks and exhibits emergent adaptation and memory-guided planning, but also reveal significant limitations, such as hallucinations about the task success and poor instruction following by refusing to acknowledge and complete sequential tasks. These findings highlight both the potential and challenges of employing LLMs as embodied cognitive controllers for autonomous robots.

* Accepted for publication at the 2026 IEEE International Conference on Robotics and Automation (ICRA)

Via

Access Paper or Ask Questions

Reliable Robotic Task Execution in the Face of Anomalies

Oct 27, 2025

Bharath Santhanam, Alex Mitrevski, Santosh Thoduka, Sebastian Houben, Teena Hassan

Abstract:Learned robot policies have consistently been shown to be versatile, but they typically have no built-in mechanism for handling the complexity of open environments, making them prone to execution failures; this implies that deploying policies without the ability to recognise and react to failures may lead to unreliable and unsafe robot behaviour. In this paper, we present a framework that couples a learned policy with a method to detect visual anomalies during policy deployment and to perform recovery behaviours when necessary, thereby aiming to prevent failures. Specifically, we train an anomaly detection model using data collected during nominal executions of a trained policy. This model is then integrated into the online policy execution process, so that deviations from the nominal execution can trigger a three-level sequential recovery process that consists of (i) pausing the execution temporarily, (ii) performing a local perturbation of the robot's state, and (iii) resetting the robot to a safe state by sampling from a learned execution success model. We verify our proposed method in two different scenarios: (i) a door handle reaching task with a Kinova Gen3 arm using a policy trained in simulation and transferred to the real robot, and (ii) an object placing task with a UFactory xArm 6 using a general-purpose policy model. Our results show that integrating policy execution with anomaly detection and recovery increases the execution success rate in environments with various anomalies, such as trajectory deviations and adversarial human interventions.

* Accepted for publication in IEEE Robotics and Automation Letters (RA-L)

Via

Access Paper or Ask Questions

Enhancing Video-Based Robot Failure Detection Using Task Knowledge

Aug 26, 2025

Santosh Thoduka, Sebastian Houben, Juergen Gall, Paul G. Plöger

Abstract:Robust robotic task execution hinges on the reliable detection of execution failures in order to trigger safe operation modes, recovery strategies, or task replanning. However, many failure detection methods struggle to provide meaningful performance when applied to a variety of real-world scenarios. In this paper, we propose a video-based failure detection approach that uses spatio-temporal knowledge in the form of the actions the robot performs and task-relevant objects within the field of view. Both pieces of information are available in most robotic scenarios and can thus be readily obtained. We demonstrate the effectiveness of our approach on three datasets that we amend, in part, with additional annotations of the aforementioned task-relevant knowledge. In light of the results, we also propose a data augmentation method that improves performance by applying variable frame rates to different parts of the video. We observe an improvement from 77.9 to 80.0 in F1 score on the ARMBench dataset without additional computational expense and an additional increase to 81.4 with test-time augmentation. The results emphasize the importance of spatio-temporal information during failure detection and suggest further investigation of suitable heuristics in future implementations. Code and annotations are available.

* Accepted at ECMR 2025

Via

Access Paper or Ask Questions

GG-BBQ: German Gender Bias Benchmark for Question Answering

Jul 22, 2025

Shalaka Satheesh, Katrin Klug, Katharina Beckh, Héctor Allende-Cid, Sebastian Houben, Teena Hassan

Abstract:Within the context of Natural Language Processing (NLP), fairness evaluation is often associated with the assessment of bias and reduction of associated harm. In this regard, the evaluation is usually carried out by using a benchmark dataset, for a task such as Question Answering, created for the measurement of bias in the model's predictions along various dimensions, including gender identity. In our work, we evaluate gender bias in German Large Language Models (LLMs) using the Bias Benchmark for Question Answering by Parrish et al. (2022) as a reference. Specifically, the templates in the gender identity subset of this English dataset were machine translated into German. The errors in the machine translated templates were then manually reviewed and corrected with the help of a language expert. We find that manual revision of the translation is crucial when creating datasets for gender bias evaluation because of the limitations of machine translation from English to a language such as German with grammatical gender. Our final dataset is comprised of two subsets: Subset-I, which consists of group terms related to gender identity, and Subset-II, where group terms are replaced with proper names. We evaluate several LLMs used for German NLP on this newly created dataset and report the accuracy and bias scores. The results show that all models exhibit bias, both along and against existing social stereotypes.

* Accepted to the 6th Workshop on Gender Bias in Natural Language Processing (GeBNLP), taking place on August 1st 2025, as part of ACL 2025 in Vienna

Via

Access Paper or Ask Questions

Parameter-Efficient Fine-Tuning of Vision Foundation Model for Forest Floor Segmentation from UAV Imagery

May 13, 2025

Mohammad Wasil, Ahmad Drak, Brennan Penfold, Ludovico Scarton, Maximilian Johenneken, Alexander Asteroth, Sebastian Houben

Abstract:Unmanned Aerial Vehicles (UAVs) are increasingly used for reforestation and forest monitoring, including seed dispersal in hard-to-reach terrains. However, a detailed understanding of the forest floor remains a challenge due to high natural variability, quickly changing environmental parameters, and ambiguous annotations due to unclear definitions. To address this issue, we adapt the Segment Anything Model (SAM), a vision foundation model with strong generalization capabilities, to segment forest floor objects such as tree stumps, vegetation, and woody debris. To this end, we employ parameter-efficient fine-tuning (PEFT) to fine-tune a small subset of additional model parameters while keeping the original weights fixed. We adjust SAM's mask decoder to generate masks corresponding to our dataset categories, allowing for automatic segmentation without manual prompting. Our results show that the adapter-based PEFT method achieves the highest mean intersection over union (mIoU), while Low-rank Adaptation (LoRA), with fewer parameters, offers a lightweight alternative for resource-constrained UAV platforms.

* Accepted to the Novel Approaches for Precision Agriculture and Forestry with Autonomous Robots IEEE ICRA Workshop - 2025

Via

Access Paper or Ask Questions

End-to-End Multi-Task Policy Learning from NMPC for Quadruped Locomotion

May 13, 2025

Anudeep Sajja, Shahram Khorshidi, Sebastian Houben, Maren Bennewitz

Figure 1 for End-to-End Multi-Task Policy Learning from NMPC for Quadruped Locomotion

Figure 2 for End-to-End Multi-Task Policy Learning from NMPC for Quadruped Locomotion

Figure 3 for End-to-End Multi-Task Policy Learning from NMPC for Quadruped Locomotion

Figure 4 for End-to-End Multi-Task Policy Learning from NMPC for Quadruped Locomotion

Abstract:Quadruped robots excel in traversing complex, unstructured environments where wheeled robots often fail. However, enabling efficient and adaptable locomotion remains challenging due to the quadrupeds' nonlinear dynamics, high degrees of freedom, and the computational demands of real-time control. Optimization-based controllers, such as Nonlinear Model Predictive Control (NMPC), have shown strong performance, but their reliance on accurate state estimation and high computational overhead makes deployment in real-world settings challenging. In this work, we present a Multi-Task Learning (MTL) framework in which expert NMPC demonstrations are used to train a single neural network to predict actions for multiple locomotion behaviors directly from raw proprioceptive sensor inputs. We evaluate our approach extensively on the quadruped robot Go1, both in simulation and on real hardware, demonstrating that it accurately reproduces expert behavior, allows smooth gait switching, and simplifies the control pipeline for real-time deployment. Our MTL architecture enables learning diverse gaits within a unified policy, achieving high $R^{2}$ scores for predicted joint targets across all tasks.

Via

Access Paper or Ask Questions

Taming the Randomness: Towards Label-Preserving Cropping in Contrastive Learning

Apr 28, 2025

Mohamed Hassan, Mohammad Wasil, Sebastian Houben

Figure 1 for Taming the Randomness: Towards Label-Preserving Cropping in Contrastive Learning

Figure 2 for Taming the Randomness: Towards Label-Preserving Cropping in Contrastive Learning

Figure 3 for Taming the Randomness: Towards Label-Preserving Cropping in Contrastive Learning

Figure 4 for Taming the Randomness: Towards Label-Preserving Cropping in Contrastive Learning

Abstract:Contrastive learning (CL) approaches have gained great recognition as a very successful subset of self-supervised learning (SSL) methods. SSL enables learning from unlabeled data, a crucial step in the advancement of deep learning, particularly in computer vision (CV), given the plethora of unlabeled image data. CL works by comparing different random augmentations (e.g., different crops) of the same image, thus achieving self-labeling. Nevertheless, randomly augmenting images and especially random cropping can result in an image that is semantically very distant from the original and therefore leads to false labeling, hence undermining the efficacy of the methods. In this research, two novel parameterized cropping methods are introduced that increase the robustness of self-labeling and consequently increase the efficacy. The results show that the use of these methods significantly improves the accuracy of the model by between 2.7\% and 12.4\% on the downstream task of classifying CIFAR-10, depending on the crop size compared to that of the non-parameterized random cropping method.

Via

Access Paper or Ask Questions

Florenz: Scaling Laws for Systematic Generalization in Vision-Language Models

Mar 12, 2025

Julian Spravil, Sebastian Houben, Sven Behnke

Figure 1 for Florenz: Scaling Laws for Systematic Generalization in Vision-Language Models

Figure 2 for Florenz: Scaling Laws for Systematic Generalization in Vision-Language Models

Figure 3 for Florenz: Scaling Laws for Systematic Generalization in Vision-Language Models

Figure 4 for Florenz: Scaling Laws for Systematic Generalization in Vision-Language Models

Abstract:Cross-lingual transfer enables vision-language models (VLMs) to perform vision tasks in various languages with training data only in one language. Current approaches rely on large pre-trained multilingual language models. However, they face the curse of multilinguality, sacrificing downstream task performance for multilingual capabilities, struggling with lexical ambiguities, and falling behind recent advances. In this work, we study the scaling laws of systematic generalization with monolingual VLMs for multilingual tasks, focusing on the impact of model size and seen training samples. We propose Florenz, a monolingual encoder-decoder VLM with 0.4B to 11.2B parameters combining the pre-trained VLM Florence-2 and the large language model Gemma-2. Florenz is trained with varying compute budgets on a synthetic dataset that features intentionally incomplete language coverage for image captioning, thus, testing generalization from the fully covered translation task. We show that not only does indirectly learning unseen task-language pairs adhere to a scaling law, but also that with our data generation pipeline and the proposed Florenz model family, image captioning abilities can emerge in a specific language even when only data for the translation task is available. Fine-tuning on a mix of downstream datasets yields competitive performance and demonstrates promising scaling trends in multimodal machine translation (Multi30K, CoMMuTE), lexical disambiguation (CoMMuTE), and image captioning (Multi30K, XM3600, COCO Karpathy).

Via

Access Paper or Ask Questions

GO-VMP: Global Optimization for View Motion Planning in Fruit Mapping

Mar 05, 2025

Allen Isaac Jose, Sicong Pan, Tobias Zaenker, Rohit Menon, Sebastian Houben, Maren Bennewitz

Abstract:Automating labor-intensive tasks such as crop monitoring with robots is essential for enhancing production and conserving resources. However, autonomously monitoring horticulture crops remains challenging due to their complex structures, which often result in fruit occlusions. Existing view planning methods attempt to reduce occlusions but either struggle to achieve adequate coverage or incur high robot motion costs. We introduce a global optimization approach for view motion planning that aims to minimize robot motion costs while maximizing fruit coverage. To this end, we leverage coverage constraints derived from the set covering problem (SCP) within a shortest Hamiltonian path problem (SHPP) formulation. While both SCP and SHPP are well-established, their tailored integration enables a unified framework that computes a global view path with minimized motion while ensuring full coverage of selected targets. Given the NP-hard nature of the problem, we employ a region-prior-based selection of coverage targets and a sparse graph structure to achieve effective optimization outcomes within a limited time. Experiments in simulation demonstrate that our method detects more fruits, enhances surface coverage, and achieves higher volume accuracy than the motion-efficient baseline with a moderate increase in motion cost, while significantly reducing motion costs compared to the coverage-focused baseline. Real-world experiments further confirm the practical applicability of our approach.

* Allen Isaac Jose and Sicong Pan have equal contribution. Submitted to IROS 2025

Via

Access Paper or Ask Questions

Bayesian Inverse Graphics for Few-Shot Concept Learning

Sep 12, 2024

Octavio Arriaga, Jichen Guo, Rebecca Adam, Sebastian Houben, Frank Kirchner

Abstract:Humans excel at building generalizations of new concepts from just one single example. Contrary to this, current computer vision models typically require large amount of training samples to achieve a comparable accuracy. In this work we present a Bayesian model of perception that learns using only minimal data, a prototypical probabilistic program of an object. Specifically, we propose a generative inverse graphics model of primitive shapes, to infer posterior distributions over physically consistent parameters from one or several images. We show how this representation can be used for downstream tasks such as few-shot classification and pose estimation. Our model outperforms existing few-shot neural-only classification algorithms and demonstrates generalization across varying lighting conditions, backgrounds, and out-of-distribution shapes. By design, our model is uncertainty-aware and uses our new differentiable renderer for optimizing global scene parameters through gradient descent, sampling posterior distributions over object parameters with Markov Chain Monte Carlo (MCMC), and using a neural based likelihood function.

* Neural-Symbolic Learning and Reasoning. NeSy 2024. Lecture Notes in Computer Science, vol 14979, pages 141-166

Via

Access Paper or Ask Questions