Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Title:Off-Policy Correction for Actor-Critic Algorithms in Deep Reinforcement Learning

Aug 01, 2022

Baturay Saglam, Dogan C. Cicek, Furkan B. Mutlu, Suleyman S. Kozat

Figure 1 for Off-Policy Correction for Actor-Critic Algorithms in Deep Reinforcement Learning

Figure 2 for Off-Policy Correction for Actor-Critic Algorithms in Deep Reinforcement Learning

Figure 3 for Off-Policy Correction for Actor-Critic Algorithms in Deep Reinforcement Learning

Figure 4 for Off-Policy Correction for Actor-Critic Algorithms in Deep Reinforcement Learning

Share this with someone who'll enjoy it:

Abstract:Compared to on-policy policy gradient techniques, off-policy model-free deep reinforcement learning (RL) approaches that use previously gathered data can improve sampling efficiency. However, off-policy learning becomes challenging when the discrepancy between the distributions of the policy of interest and the policies that collected the data increases. Although the well-studied importance sampling and off-policy policy gradient techniques were proposed to compensate for this discrepancy, they usually require a collection of long trajectories that increases the computational complexity and induce additional problems such as vanishing or exploding gradients. Moreover, their generalization to continuous action domains is strictly limited as they require action probabilities, which is unsuitable for deterministic policies. To overcome these limitations, we introduce an alternative off-policy correction algorithm for continuous action spaces, Actor-Critic Off-Policy Correction (AC-Off-POC), to mitigate the potential drawbacks introduced by the previously collected data. Through a novel discrepancy measure computed by the agent's most recent action decisions on the states of the randomly sampled batch of transitions, the approach does not require actual or estimated action probabilities for any policy and offers an adequate one-step importance sampling. Theoretical results show that the introduced approach can achieve a contraction mapping with a fixed unique point, which allows a "safe" off-policy learning. Our empirical results suggest that AC-Off-POC consistently improves the state-of-the-art and attains higher returns in fewer steps than the competing methods by efficiently scheduling the learning rate in Q-learning and policy optimization.

* 23 pages, 5 figures, 5 tables

View paper on

Share this with someone who'll enjoy it:

Title:Off-Policy Correction for Actor-Critic Algorithms in Deep Reinforcement Learning

Paper and Code