Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Title:Planning to Go Out-of-Distribution in Offline-to-Online Reinforcement Learning

Oct 09, 2023

Trevor McInroe, Stefano V. Albrecht, Amos Storkey

Figure 1 for Planning to Go Out-of-Distribution in Offline-to-Online Reinforcement Learning

Figure 2 for Planning to Go Out-of-Distribution in Offline-to-Online Reinforcement Learning

Figure 3 for Planning to Go Out-of-Distribution in Offline-to-Online Reinforcement Learning

Figure 4 for Planning to Go Out-of-Distribution in Offline-to-Online Reinforcement Learning

Share this with someone who'll enjoy it:

Abstract:Offline pretraining with a static dataset followed by online fine-tuning (offline-to-online, or OtO) is a paradigm that is well matched to a real-world RL deployment process: in few real settings would one deploy an offline policy with no test runs and tuning. In this scenario, we aim to find the best-performing policy within a limited budget of online interactions. Previous work in the OtO setting has focused on correcting for bias introduced by the policy-constraint mechanisms of offline RL algorithms. Such constraints keep the learned policy close to the behavior policy that collected the dataset, but this unnecessarily limits policy performance if the behavior policy is far from optimal. Instead, we forgo policy constraints and frame OtO RL as an exploration problem: we must maximize the benefit of the online data-collection. We study major online RL exploration paradigms, adapting them to work well with the OtO setting. These adapted methods contribute several strong baselines. Also, we introduce an algorithm for planning to go out of distribution (PTGOOD), which targets online exploration in relatively high-reward regions of the state-action space unlikely to be visited by the behavior policy. By leveraging concepts from the Conditional Entropy Bottleneck, PTGOOD encourages data collected online to provide new information relevant to improving the final deployment policy. In that way the limited interaction budget is used effectively. We show that PTGOOD significantly improves agent returns during online fine-tuning and finds the optimal policy in as few as 10k online steps in Walker and in as few as 50k in complex control tasks like Humanoid. Also, we find that PTGOOD avoids the suboptimal policy convergence that many of our baselines exhibit in several environments.

* 9 pages, 12 figures, preprint

View paper on

Share this with someone who'll enjoy it:

Title:Planning to Go Out-of-Distribution in Offline-to-Online Reinforcement Learning

Paper and Code