Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Title:Exploration-Exploitation Trade-off in Reinforcement Learning on Online Markov Decision Processes with Global Concave Rewards

May 15, 2019

Wang Chi Cheung

Figure 1 for Exploration-Exploitation Trade-off in Reinforcement Learning on Online Markov Decision Processes with Global Concave Rewards

Share this with someone who'll enjoy it:

Abstract:We consider an agent who is involved in a Markov decision process and receives a vector of outcomes every round. Her objective is to maximize a global concave reward function on the average vectorial outcome. The problem models applications such as multi-objective optimization, maximum entropy exploration, and constrained optimization in Markovian environments. In our general setting where a stationary policy could have multiple recurrent classes, the agent faces a subtle yet consequential trade-off in alternating among different actions for balancing the vectorial outcomes. In particular, stationary policies are in general sub-optimal. We propose a no-regret algorithm based on online convex optimization (OCO) tools (Agrawal and Devanur 2014) and UCRL2 (Jaksch et al. 2010). Importantly, we introduce a novel gradient threshold procedure, which carefully controls the switches among actions to handle the subtle trade-off. By delaying the gradient updates, our procedure produces a non-stationary policy that diversifies the outcomes for optimizing the objective. The procedure is compatible with a variety of OCO tools.

* 54 pages, 1 figure

View paper on

Share this with someone who'll enjoy it:

Title:Exploration-Exploitation Trade-off in Reinforcement Learning on Online Markov Decision Processes with Global Concave Rewards

Paper and Code