Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Daniel Hein

Is Q-learning an Ill-posed Problem?

Feb 21, 2025

Philipp Wissmann, Daniel Hein, Steffen Udluft, Thomas Runkler

Abstract:This paper investigates the instability of Q-learning in continuous environments, a challenge frequently encountered by practitioners. Traditionally, this instability is attributed to bootstrapping and regression model errors. Using a representative reinforcement learning benchmark, we systematically examine the effects of bootstrapping and model inaccuracies by incrementally eliminating these potential error sources. Our findings reveal that even in relatively simple benchmarks, the fundamental task of Q-learning - iteratively learning a Q-function from policy-specific target values - can be inherently ill-posed and prone to failure. These insights cast doubt on the reliability of Q-learning as a universal solution for reinforcement learning problems.

* Accepted at ESANN 2025

Via

Access Paper or Ask Questions

Why long model-based rollouts are no reason for bad Q-value estimates

Jul 16, 2024

Philipp Wissmann, Daniel Hein, Steffen Udluft, Volker Tresp

Abstract:This paper explores the use of model-based offline reinforcement learning with long model rollouts. While some literature criticizes this approach due to compounding errors, many practitioners have found success in real-world applications. The paper aims to demonstrate that long rollouts do not necessarily result in exponentially growing errors and can actually produce better Q-value estimates than model-free methods. These findings can potentially enhance reinforcement learning techniques.

* Accepted at ESANN 2024

Via

Access Paper or Ask Questions

Model-based Offline Quantum Reinforcement Learning

Apr 14, 2024

Simon Eisenmann, Daniel Hein, Steffen Udluft, Thomas A. Runkler

Abstract:This paper presents the first algorithm for model-based offline quantum reinforcement learning and demonstrates its functionality on the cart-pole benchmark. The model and the policy to be optimized are each implemented as variational quantum circuits. The model is trained by gradient descent to fit a pre-recorded data set. The policy is optimized with a gradient-free optimization scheme using the return estimate given by the model as the fitness function. This model-based approach allows, in principle, full realization on a quantum computer during the optimization phase and gives hope that a quantum advantage can be achieved as soon as sufficiently powerful quantum computers are available.

Via

Access Paper or Ask Questions

Learning Control Policies for Variable Objectives from Offline Data

Aug 11, 2023

Marc Weber, Phillip Swazinna, Daniel Hein, Steffen Udluft, Volkmar Sterzing

Figure 1 for Learning Control Policies for Variable Objectives from Offline Data

Figure 2 for Learning Control Policies for Variable Objectives from Offline Data

Figure 3 for Learning Control Policies for Variable Objectives from Offline Data

Figure 4 for Learning Control Policies for Variable Objectives from Offline Data

Abstract:Offline reinforcement learning provides a viable approach to obtain advanced control strategies for dynamical systems, in particular when direct interaction with the environment is not available. In this paper, we introduce a conceptual extension for model-based policy search methods, called variable objective policy (VOP). With this approach, policies are trained to generalize efficiently over a variety of objectives, which parameterize the reward function. We demonstrate that by altering the objectives passed as input to the policy, users gain the freedom to adjust its behavior or re-balance optimization targets at runtime, without need for collecting additional observation batches or re-training.

* 8 pages, 7 figures

Via

Access Paper or Ask Questions

Quantum Policy Iteration via Amplitude Estimation and Grover Search -- Towards Quantum Advantage for Reinforcement Learning

Jun 09, 2022

Simon Wiedemann, Daniel Hein, Steffen Udluft, Christian Mendl

Figure 1 for Quantum Policy Iteration via Amplitude Estimation and Grover Search -- Towards Quantum Advantage for Reinforcement Learning

Figure 2 for Quantum Policy Iteration via Amplitude Estimation and Grover Search -- Towards Quantum Advantage for Reinforcement Learning

Figure 3 for Quantum Policy Iteration via Amplitude Estimation and Grover Search -- Towards Quantum Advantage for Reinforcement Learning

Figure 4 for Quantum Policy Iteration via Amplitude Estimation and Grover Search -- Towards Quantum Advantage for Reinforcement Learning

Abstract:We present a full implementation and simulation of a novel quantum reinforcement learning (RL) method and mathematically prove a quantum advantage. Our approach shows in detail how to combine amplitude estimation and Grover search into a policy evaluation and improvement scheme. We first develop quantum policy evaluation (QPE) which is quadratically more efficient compared to an analogous classical Monte Carlo estimation and is based on a quantum mechanical realization of a finite Markov decision process (MDP). Building on QPE, we derive a quantum policy iteration that repeatedly improves an initial policy using Grover search until the optimum is reached. Finally, we present an implementation of our algorithm for a two-armed bandit MDP which we then simulate. The results confirm that QPE provides a quantum advantage in RL problems.

Via

Access Paper or Ask Questions

Comparing Model-free and Model-based Algorithms for Offline Reinforcement Learning

Jan 14, 2022

Phillip Swazinna, Steffen Udluft, Daniel Hein, Thomas Runkler

Figure 1 for Comparing Model-free and Model-based Algorithms for Offline Reinforcement Learning

Abstract:Offline reinforcement learning (RL) Algorithms are often designed with environments such as MuJoCo in mind, in which the planning horizon is extremely long and no noise exists. We compare model-free, model-based, as well as hybrid offline RL approaches on various industrial benchmark (IB) datasets to test the algorithms in settings closer to real world problems, including complex noise and partially observable states. We find that on the IB, hybrid approaches face severe difficulties and that simpler algorithms, such as rollout based algorithms or model-free algorithms with simpler regularizers perform best on the datasets.

* Submitted to IFAC Conference on Intelligent Control and Automation Sciences (ICONS)2022

Via

Access Paper or Ask Questions

Trustworthy AI for Process Automation on a Chylla-Haase Polymerization Reactor

Aug 30, 2021

Daniel Hein, Daniel Labisch

Figure 1 for Trustworthy AI for Process Automation on a Chylla-Haase Polymerization Reactor

Figure 2 for Trustworthy AI for Process Automation on a Chylla-Haase Polymerization Reactor

Figure 3 for Trustworthy AI for Process Automation on a Chylla-Haase Polymerization Reactor

Figure 4 for Trustworthy AI for Process Automation on a Chylla-Haase Polymerization Reactor

Abstract:In this paper, genetic programming reinforcement learning (GPRL) is utilized to generate human-interpretable control policies for a Chylla-Haase polymerization reactor. Such continuously stirred tank reactors (CSTRs) with jacket cooling are widely used in the chemical industry, in the production of fine chemicals, pigments, polymers, and medical products. Despite appearing rather simple, controlling CSTRs in real-world applications is quite a challenging problem to tackle. GPRL utilizes already existing data from the reactor and generates fully automatically a set of optimized simplistic control strategies, so-called policies, the domain expert can choose from. Note that these policies are white-box models of low complexity, which makes them easy to validate and implement in the target control system, e.g., SIMATIC PCS 7. However, despite its low complexity the automatically-generated policy yields a high performance in terms of reactor temperature control deviation, which we empirically evaluate on the original reactor template.

* Proceedings of the Genetic and Evolutionary Computation Conference Companion GECCO 21 (2021)

Via

Access Paper or Ask Questions

Behavior Constraining in Weight Space for Offline Reinforcement Learning

Jul 12, 2021

Phillip Swazinna, Steffen Udluft, Daniel Hein, Thomas Runkler

Figure 1 for Behavior Constraining in Weight Space for Offline Reinforcement Learning

Figure 2 for Behavior Constraining in Weight Space for Offline Reinforcement Learning

Abstract:In offline reinforcement learning, a policy needs to be learned from a single pre-collected dataset. Typically, policies are thus regularized during training to behave similarly to the data generating policy, by adding a penalty based on a divergence between action distributions of generating and trained policy. We propose a new algorithm, which constrains the policy directly in its weight space instead, and demonstrate its effectiveness in experiments.

* Accepted at ESANN 2021

Via

Access Paper or Ask Questions

Interpretable Control by Reinforcement Learning

Jul 20, 2020

Daniel Hein, Steffen Limmer, Thomas A. Runkler

Figure 1 for Interpretable Control by Reinforcement Learning

Figure 2 for Interpretable Control by Reinforcement Learning

Figure 3 for Interpretable Control by Reinforcement Learning

Figure 4 for Interpretable Control by Reinforcement Learning

Abstract:In this paper, three recently introduced reinforcement learning (RL) methods are used to generate human-interpretable policies for the cart-pole balancing benchmark. The novel RL methods learn human-interpretable policies in the form of compact fuzzy controllers and simple algebraic equations. The representations as well as the achieved control performances are compared with two classical controller design methods and three non-interpretable RL methods. All eight methods utilize the same previously generated data batch and produce their controller offline - without interaction with the real benchmark dynamics. The experiments show that the novel RL methods are able to automatically generate well-performing policies which are at the same time human-interpretable. Furthermore, one of the methods is applied to automatically learn an equation-based policy for a hardware cart-pole demonstrator by using only human-player-generated batch data. The solution generated in the first attempt already represents a successful balancing policy, which demonstrates the methods applicability to real-world problems.

Via

Access Paper or Ask Questions

Generating Interpretable Fuzzy Controllers using Particle Swarm Optimization and Genetic Programming

Apr 29, 2018

Daniel Hein, Steffen Udluft, Thomas A. Runkler

Figure 1 for Generating Interpretable Fuzzy Controllers using Particle Swarm Optimization and Genetic Programming

Figure 2 for Generating Interpretable Fuzzy Controllers using Particle Swarm Optimization and Genetic Programming

Figure 3 for Generating Interpretable Fuzzy Controllers using Particle Swarm Optimization and Genetic Programming

Figure 4 for Generating Interpretable Fuzzy Controllers using Particle Swarm Optimization and Genetic Programming

Abstract:Autonomously training interpretable control strategies, called policies, using pre-existing plant trajectory data is of great interest in industrial applications. Fuzzy controllers have been used in industry for decades as interpretable and efficient system controllers. In this study, we introduce a fuzzy genetic programming (GP) approach called fuzzy GP reinforcement learning (FGPRL) that can select the relevant state features, determine the size of the required fuzzy rule set, and automatically adjust all the controller parameters simultaneously. Each GP individual's fitness is computed using model-based batch reinforcement learning (RL), which first trains a model using available system samples and subsequently performs Monte Carlo rollouts to predict each policy candidate's performance. We compare FGPRL to an extended version of a related method called fuzzy particle swarm reinforcement learning (FPSRL), which uses swarm intelligence to tune the fuzzy policy parameters. Experiments using an industrial benchmark show that FGPRL is able to autonomously learn interpretable fuzzy policies with high control performance.

* Accepted at Genetic and Evolutionary Computation Conference 2018 (GECCO '18)

Via

Access Paper or Ask Questions