Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Steffen Udluft

Is Q-learning an Ill-posed Problem?

Feb 21, 2025

Philipp Wissmann, Daniel Hein, Steffen Udluft, Thomas Runkler

Abstract:This paper investigates the instability of Q-learning in continuous environments, a challenge frequently encountered by practitioners. Traditionally, this instability is attributed to bootstrapping and regression model errors. Using a representative reinforcement learning benchmark, we systematically examine the effects of bootstrapping and model inaccuracies by incrementally eliminating these potential error sources. Our findings reveal that even in relatively simple benchmarks, the fundamental task of Q-learning - iteratively learning a Q-function from policy-specific target values - can be inherently ill-posed and prone to failure. These insights cast doubt on the reliability of Q-learning as a universal solution for reinforcement learning problems.

* Accepted at ESANN 2025

Via

Access Paper or Ask Questions

TEA: Trajectory Encoding Augmentation for Robust and Transferable Policies in Offline Reinforcement Learning

Nov 28, 2024

Batıkan Bora Ormancı, Phillip Swazinna, Steffen Udluft, Thomas A. Runkler

Abstract:In this paper, we investigate offline reinforcement learning (RL) with the goal of training a single robust policy that generalizes effectively across environments with unseen dynamics. We propose a novel approach, Trajectory Encoding Augmentation (TEA), which extends the state space by integrating latent representations of environmental dynamics obtained from sequence encoders, such as AutoEncoders. Our findings show that incorporating these encodings with TEA improves the transferability of a single policy to novel environments with new dynamics, surpassing methods that rely solely on unmodified states. These results indicate that TEA captures critical, environment-specific characteristics, enabling RL agents to generalize effectively across dynamic conditions.

Via

Access Paper or Ask Questions

Neural-ANOVA: Model Decomposition for Interpretable Machine Learning

Aug 22, 2024

Steffen Limmer, Steffen Udluft, Clemens Otte

Abstract:The analysis of variance (ANOVA) decomposition offers a systematic method to understand the interaction effects that contribute to a specific decision output. In this paper we introduce Neural-ANOVA, an approach to decompose neural networks into glassbox models using the ANOVA decomposition. Our approach formulates a learning problem, which enables rapid and closed-form evaluation of integrals over subspaces that appear in the calculation of the ANOVA decomposition. Finally, we conduct numerical experiments to illustrate the advantages of enhanced interpretability and model validation by a decomposition of the learned interaction effects.

* 8 pages, 4 figures, 5 tables

Via

Access Paper or Ask Questions

Why long model-based rollouts are no reason for bad Q-value estimates

Jul 16, 2024

Philipp Wissmann, Daniel Hein, Steffen Udluft, Volker Tresp

Abstract:This paper explores the use of model-based offline reinforcement learning with long model rollouts. While some literature criticizes this approach due to compounding errors, many practitioners have found success in real-world applications. The paper aims to demonstrate that long rollouts do not necessarily result in exponentially growing errors and can actually produce better Q-value estimates than model-free methods. These findings can potentially enhance reinforcement learning techniques.

* Accepted at ESANN 2024

Via

Access Paper or Ask Questions

Model-based Offline Quantum Reinforcement Learning

Apr 14, 2024

Simon Eisenmann, Daniel Hein, Steffen Udluft, Thomas A. Runkler

Abstract:This paper presents the first algorithm for model-based offline quantum reinforcement learning and demonstrates its functionality on the cart-pole benchmark. The model and the policy to be optimized are each implemented as variational quantum circuits. The model is trained by gradient descent to fit a pre-recorded data set. The policy is optimized with a gradient-free optimization scheme using the return estimate given by the model as the fitness function. This model-based approach allows, in principle, full realization on a quantum computer during the optimization phase and gives hope that a quantum advantage can be achieved as soon as sufficiently powerful quantum computers are available.

Via

Access Paper or Ask Questions

Learning Control Policies for Variable Objectives from Offline Data

Aug 11, 2023

Marc Weber, Phillip Swazinna, Daniel Hein, Steffen Udluft, Volkmar Sterzing

Figure 1 for Learning Control Policies for Variable Objectives from Offline Data

Figure 2 for Learning Control Policies for Variable Objectives from Offline Data

Figure 3 for Learning Control Policies for Variable Objectives from Offline Data

Figure 4 for Learning Control Policies for Variable Objectives from Offline Data

Abstract:Offline reinforcement learning provides a viable approach to obtain advanced control strategies for dynamical systems, in particular when direct interaction with the environment is not available. In this paper, we introduce a conceptual extension for model-based policy search methods, called variable objective policy (VOP). With this approach, policies are trained to generalize efficiently over a variety of objectives, which parameterize the reward function. We demonstrate that by altering the objectives passed as input to the policy, users gain the freedom to adjust its behavior or re-balance optimization targets at runtime, without need for collecting additional observation batches or re-training.

* 8 pages, 7 figures

Via

Access Paper or Ask Questions

Automatic Trade-off Adaptation in Offline RL

Jun 16, 2023

Phillip Swazinna, Steffen Udluft, Thomas Runkler

Abstract:Recently, offline RL algorithms have been proposed that remain adaptive at runtime. For example, the LION algorithm \cite{lion} provides the user with an interface to set the trade-off between behavior cloning and optimality w.r.t. the estimated return at runtime. Experts can then use this interface to adapt the policy behavior according to their preferences and find a good trade-off between conservatism and performance optimization. Since expert time is precious, we extend the methodology with an autopilot that automatically finds the correct parameterization of the trade-off, yielding a new algorithm which we term AutoLION.

* Oral Presentation @ ESANN 2023

Via

Access Paper or Ask Questions

Safe Policy Improvement Approaches and their Limitations

Aug 01, 2022

Philipp Scholl, Felix Dietrich, Clemens Otte, Steffen Udluft

Figure 1 for Safe Policy Improvement Approaches and their Limitations

Figure 2 for Safe Policy Improvement Approaches and their Limitations

Figure 3 for Safe Policy Improvement Approaches and their Limitations

Figure 4 for Safe Policy Improvement Approaches and their Limitations

Abstract:Safe Policy Improvement (SPI) is an important technique for offline reinforcement learning in safety critical applications as it improves the behavior policy with a high probability. We classify various SPI approaches from the literature into two groups, based on how they utilize the uncertainty of state-action pairs. Focusing on the Soft-SPIBB (Safe Policy Improvement with Soft Baseline Bootstrapping) algorithms, we show that their claim of being provably safe does not hold. Based on this finding, we develop adaptations, the Adv-Soft-SPIBB algorithms, and show that they are provably safe. A heuristic adaptation, Lower-Approx-Soft-SPIBB, yields the best performance among all SPIBB algorithms in extensive experiments on two benchmarks. We also check the safety guarantees of the provably safe algorithms and show that huge amounts of data are necessary such that the safety bounds become useful in practice.

* 27 pages. arXiv admin note: substantial text overlap with arXiv:2201.12175

Via

Access Paper or Ask Questions

Quantum Policy Iteration via Amplitude Estimation and Grover Search -- Towards Quantum Advantage for Reinforcement Learning

Jun 09, 2022

Simon Wiedemann, Daniel Hein, Steffen Udluft, Christian Mendl

Figure 1 for Quantum Policy Iteration via Amplitude Estimation and Grover Search -- Towards Quantum Advantage for Reinforcement Learning

Figure 2 for Quantum Policy Iteration via Amplitude Estimation and Grover Search -- Towards Quantum Advantage for Reinforcement Learning

Figure 3 for Quantum Policy Iteration via Amplitude Estimation and Grover Search -- Towards Quantum Advantage for Reinforcement Learning

Figure 4 for Quantum Policy Iteration via Amplitude Estimation and Grover Search -- Towards Quantum Advantage for Reinforcement Learning

Abstract:We present a full implementation and simulation of a novel quantum reinforcement learning (RL) method and mathematically prove a quantum advantage. Our approach shows in detail how to combine amplitude estimation and Grover search into a policy evaluation and improvement scheme. We first develop quantum policy evaluation (QPE) which is quadratically more efficient compared to an analogous classical Monte Carlo estimation and is based on a quantum mechanical realization of a finite Markov decision process (MDP). Building on QPE, we derive a quantum policy iteration that repeatedly improves an initial policy using Grover search until the optimum is reached. Finally, we present an implementation of our algorithm for a two-armed bandit MDP which we then simulate. The results confirm that QPE provides a quantum advantage in RL problems.

Via

Access Paper or Ask Questions

User-Interactive Offline Reinforcement Learning

May 21, 2022

Phillip Swazinna, Steffen Udluft, Thomas Runkler

Figure 1 for User-Interactive Offline Reinforcement Learning

Figure 2 for User-Interactive Offline Reinforcement Learning

Figure 3 for User-Interactive Offline Reinforcement Learning

Figure 4 for User-Interactive Offline Reinforcement Learning

Abstract:Offline reinforcement learning algorithms still lack trust in practice due to the risk that the learned policy performs worse than the original policy that generated the dataset or behaves in an unexpected way that is unfamiliar to the user. At the same time, offline RL algorithms are not able to tune their most important hyperparameter - the proximity of the learned policy to the original policy. We propose an algorithm that allows the user to tune this hyperparameter at runtime, thereby overcoming both of the above mentioned issues simultaneously. This allows users to start with the original behavior and grant successively greater deviation, as well as stopping at any time when the policy deteriorates or the behavior is too far from the familiar one.

Via

Access Paper or Ask Questions