Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Daniel Pirutinsky

Accelerating the Computation of UCB and Related Indices for Reinforcement Learning

Sep 28, 2019

Wesley Cowan, Michael N. Katehakis, Daniel Pirutinsky

Figure 1 for Accelerating the Computation of UCB and Related Indices for Reinforcement Learning

Figure 2 for Accelerating the Computation of UCB and Related Indices for Reinforcement Learning

Figure 3 for Accelerating the Computation of UCB and Related Indices for Reinforcement Learning

Figure 4 for Accelerating the Computation of UCB and Related Indices for Reinforcement Learning

Abstract:In this paper we derive an efficient method for computing the indices associated with an asymptotically optimal upper confidence bound algorithm (MDP-UCB) of Burnetas and Katehakis (1997) that only requires solving a system of two non-linear equations with two unknowns, irrespective of the cardinality of the state space of the Markovian decision process (MDP). In addition, we develop a similar acceleration for computing the indices for the MDP-Deterministic Minimum Empirical Divergence (MDP-DMED) algorithm developed in Cowan et al. (2019), based on ideas from Honda and Takemura (2011), that involves solving a single equation of one variable. We provide experimental results demonstrating the computational time savings and regret performance of these algorithms. In these comparison we also consider the Optimistic Linear Programming (OLP) algorithm (Tewari and Bartlett, 2008) and a method based on Posterior sampling (MDP-PS).

* A version of some of the algorithms and comparisons has appeared in a previous technical note by Cowan, Katehakis, and Pirutinsky (2019) arXiv:1909.06019

Via

Access Paper or Ask Questions

Reinforcement Learning: a Comparison of UCB Versus Alternative Adaptive Policies

Sep 13, 2019

Wesley Cowan, Michael N. Katehakis, Daniel Pirutinsky

Figure 1 for Reinforcement Learning: a Comparison of UCB Versus Alternative Adaptive Policies

Figure 2 for Reinforcement Learning: a Comparison of UCB Versus Alternative Adaptive Policies

Abstract:In this paper we consider the basic version of Reinforcement Learning (RL) that involves computing optimal data driven (adaptive) policies for Markovian decision process with unknown transition probabilities. We provide a brief survey of the state of the art of the area and we compare the performance of the classic UCB policy of \cc{bkmdp97} with a new policy developed herein which we call MDP-Deterministic Minimum Empirical Divergence (MDP-DMED), and a method based on Posterior sampling (MDP-PS).

Via

Access Paper or Ask Questions