Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Qing-Shan Jia

Query-Policy Misalignment in Preference-Based Reinforcement Learning

May 27, 2023

Xiao Hu, Jianxiong Li, Xianyuan Zhan, Qing-Shan Jia, Ya-Qin Zhang

Figure 1 for Query-Policy Misalignment in Preference-Based Reinforcement Learning

Figure 2 for Query-Policy Misalignment in Preference-Based Reinforcement Learning

Figure 3 for Query-Policy Misalignment in Preference-Based Reinforcement Learning

Figure 4 for Query-Policy Misalignment in Preference-Based Reinforcement Learning

Abstract:Preference-based reinforcement learning (PbRL) provides a natural way to align RL agents' behavior with human desired outcomes, but is often restrained by costly human feedback. To improve feedback efficiency, most existing PbRL methods focus on selecting queries to maximally improve the overall quality of the reward model, but counter-intuitively, we find that this may not necessarily lead to improved performance. To unravel this mystery, we identify a long-neglected issue in the query selection schemes of existing PbRL studies: Query-Policy Misalignment. We show that the seemingly informative queries selected to improve the overall quality of reward model actually may not align with RL agents' interests, thus offering little help on policy learning and eventually resulting in poor feedback efficiency. We show that this issue can be effectively addressed via near on-policy query and a specially designed hybrid experience replay, which together enforce the bidirectional query-policy alignment. Simple yet elegant, our method can be easily incorporated into existing approaches by changing only a few lines of code. We showcase in comprehensive experiments that our method achieves substantial gains in both human feedback and RL sample efficiency, demonstrating the importance of addressing query-policy misalignment in PbRL tasks.

* The first two authors contributed equally

Via

Access Paper or Ask Questions

Mind the Gap: Offline Policy Optimization for Imperfect Rewards

Feb 03, 2023

Jianxiong Li, Xiao Hu, Haoran Xu, Jingjing Liu, Xianyuan Zhan, Qing-Shan Jia, Ya-Qin Zhang

Figure 1 for Mind the Gap: Offline Policy Optimization for Imperfect Rewards

Figure 2 for Mind the Gap: Offline Policy Optimization for Imperfect Rewards

Figure 3 for Mind the Gap: Offline Policy Optimization for Imperfect Rewards

Figure 4 for Mind the Gap: Offline Policy Optimization for Imperfect Rewards

Abstract:Reward function is essential in reinforcement learning (RL), serving as the guiding signal to incentivize agents to solve given tasks, however, is also notoriously difficult to design. In many cases, only imperfect rewards are available, which inflicts substantial performance loss for RL agents. In this study, we propose a unified offline policy optimization approach, \textit{RGM (Reward Gap Minimization)}, which can smartly handle diverse types of imperfect rewards. RGM is formulated as a bi-level optimization problem: the upper layer optimizes a reward correction term that performs visitation distribution matching w.r.t. some expert data; the lower layer solves a pessimistic RL problem with the corrected rewards. By exploiting the duality of the lower layer, we derive a tractable algorithm that enables sampled-based learning without any online interactions. Comprehensive experiments demonstrate that RGM achieves superior performance to existing methods under diverse settings of imperfect rewards. Further, RGM can effectively correct wrong or inconsistent rewards against expert preference and retrieve useful information from biased rewards.

* Accept by ICLR2023. The first two authors contributed equally

Via

Access Paper or Ask Questions

Decentralized Multi-Agent Reinforcement Learning: An Off-Policy Method

Oct 31, 2021

Kuo Li, Qing-Shan Jia

Figure 1 for Decentralized Multi-Agent Reinforcement Learning: An Off-Policy Method

Figure 2 for Decentralized Multi-Agent Reinforcement Learning: An Off-Policy Method

Abstract:We discuss the problem of decentralized multi-agent reinforcement learning (MARL) in this work. In our setting, the global state, action, and reward are assumed to be fully observable, while the local policy is protected as privacy by each agent, and thus cannot be shared with others. There is a communication graph, among which the agents can exchange information with their neighbors. The agents make individual decisions and cooperate to reach a higher accumulated reward. Towards this end, we first propose a decentralized actor-critic (AC) setting. Then, the policy evaluation and policy improvement algorithms are designed for discrete and continuous state-action-space Markov Decision Process (MDP) respectively. Furthermore, convergence analysis is given under the discrete-space case, which guarantees that the policy will be reinforced by alternating between the processes of policy evaluation and policy improvement. In order to validate the effectiveness of algorithms, we design experiments and compare them with previous algorithms, e.g., Q-learning \cite{watkins1992q} and MADDPG \cite{lowe2017multi}. The results show that our algorithms perform better from the aspects of both learning speed and final performance. Moreover, the algorithms can be executed in an off-policy manner, which greatly improves the data efficiency compared with on-policy algorithms.

Via

Access Paper or Ask Questions

An Actor-Critic Method for Simulation-Based Optimization

Oct 31, 2021

Kuo Li, Qing-Shan Jia, Jiaqi Yan

Figure 1 for An Actor-Critic Method for Simulation-Based Optimization

Figure 2 for An Actor-Critic Method for Simulation-Based Optimization

Figure 3 for An Actor-Critic Method for Simulation-Based Optimization

Figure 4 for An Actor-Critic Method for Simulation-Based Optimization

Abstract:We focus on a simulation-based optimization problem of choosing the best design from the feasible space. Although the simulation model can be queried with finite samples, its internal processing rule cannot be utilized in the optimization process. We formulate the sampling process as a policy searching problem and give a solution from the perspective of Reinforcement Learning (RL). Concretely, Actor-Critic (AC) framework is applied, where the Actor serves as a surrogate model to predict the performance on unknown designs, whereas the actor encodes the sampling policy to be optimized. We design the updating rule and propose two algorithms for the cases where the feasible spaces are continuous and discrete respectively. Some experiments are designed to validate the effectiveness of proposed algorithms, including two toy examples, which intuitively explain the algorithms, and two more complex tasks, i.e., adversarial attack task and RL task, which validate the effectiveness in large-scale problems. The results show that the proposed algorithms can successfully deal with these problems. Especially note that in the RL task, our methods give a new perspective to robot control by treating the task as a simulation model and solving it by optimizing the policy generating process, while existing works commonly optimize the policy itself directly.

Via

Access Paper or Ask Questions