Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Haochi Wu

SAMG: State-Action-Aware Offline-to-Online Reinforcement Learning with Offline Model Guidance

Oct 24, 2024

Liyu Zhang, Haochi Wu, Xu Wan, Quan Kong, Ruilong Deng, Mingyang Sun

Figure 1 for SAMG: State-Action-Aware Offline-to-Online Reinforcement Learning with Offline Model Guidance

Figure 2 for SAMG: State-Action-Aware Offline-to-Online Reinforcement Learning with Offline Model Guidance

Figure 3 for SAMG: State-Action-Aware Offline-to-Online Reinforcement Learning with Offline Model Guidance

Figure 4 for SAMG: State-Action-Aware Offline-to-Online Reinforcement Learning with Offline Model Guidance

Abstract:The offline-to-online (O2O) paradigm in reinforcement learning (RL) utilizes pre-trained models on offline datasets for subsequent online fine-tuning. However, conventional O2O RL algorithms typically require maintaining and retraining the large offline datasets to mitigate the effects of out-of-distribution (OOD) data, which limits their efficiency in exploiting online samples. To address this challenge, we introduce a new paradigm called SAMG: State-Action-Conditional Offline-to-Online Reinforcement Learning with Offline Model Guidance. In particular, rather than directly training on offline data, SAMG freezes the pre-trained offline critic to provide offline values for each state-action pair to deliver compact offline information. This framework eliminates the need for retraining with offline data by freezing and leveraging these values of the offline model. These are then incorporated with the online target critic using a Bellman equation weighted by a policy state-action-aware coefficient. This coefficient, derived from a conditional variational auto-encoder (C-VAE), aims to capture the reliability of the offline data on a state-action level. SAMG could be easily integrated with existing Q-function based O2O RL algorithms. Theoretical analysis shows good optimality and lower estimation error of SAMG. Empirical evaluations demonstrate that SAMG outperforms four state-of-the-art O2O RL algorithms in the D4RL benchmark.

Via

Access Paper or Ask Questions