Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

José Niño-Mora

Optimal Control of Fluid Restless Multi-armed Bandits: A Machine Learning Approach

Feb 06, 2025

Dimitris Bertsimas, Cheol Woo Kim, José Niño-Mora

Figure 1 for Optimal Control of Fluid Restless Multi-armed Bandits: A Machine Learning Approach

Figure 2 for Optimal Control of Fluid Restless Multi-armed Bandits: A Machine Learning Approach

Figure 3 for Optimal Control of Fluid Restless Multi-armed Bandits: A Machine Learning Approach

Figure 4 for Optimal Control of Fluid Restless Multi-armed Bandits: A Machine Learning Approach

Abstract:We propose a machine learning approach to the optimal control of fluid restless multi-armed bandits (FRMABs) with state equations that are either affine or quadratic in the state variables. By deriving fundamental properties of FRMAB problems, we design an efficient machine learning based algorithm. Using this algorithm, we solve multiple instances with varying initial states to generate a comprehensive training set. We then learn a state feedback policy using Optimal Classification Trees with hyperplane splits (OCT-H). We test our approach on machine maintenance, epidemic control and fisheries control problems. Our method yields high-quality state feedback policies and achieves a speed-up of up to 26 million times compared to a direct numerical algorithm for fluid problems.

Via

Access Paper or Ask Questions

Multi-Action Restless Bandits with Weakly Coupled Constraints: Simultaneous Learning and Control

Dec 04, 2024

Jing Fu, Bill Moran, José Niño-Mora

Abstract:We study a system with finitely many groups of multi-action bandit processes, each of which is a Markov decision process (MDP) with finite state and action spaces and potentially different transition matrices when taking different actions. The bandit processes of the same group share the same state and action spaces and, given the same action that is taken, the same transition matrix. All the bandit processes across various groups are subject to multiple weakly coupled constraints over their state and action variables. Unlike the past studies that focused on the offline case, we consider the online case without assuming full knowledge of transition matrices and reward functions a priori and propose an effective scheme that enables simultaneous learning and control. We prove the convergence of the relevant processes in both the timeline and the number of the bandit processes, referred to as the convergence in the time and the magnitude dimensions. Moreover, we prove that the relevant processes converge exponentially fast in the magnitude dimension, leading to exponentially diminishing performance deviation between the proposed online algorithms and offline optimality.

* 70 pages,0 figure

Via

Access Paper or Ask Questions