Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Title:Beyond Reward: Offline Preference-guided Policy Optimization

May 25, 2023

Yachen Kang, Diyuan Shi, Jinxin Liu, Li He, Donglin Wang

Figure 1 for Beyond Reward: Offline Preference-guided Policy Optimization

Figure 2 for Beyond Reward: Offline Preference-guided Policy Optimization

Figure 3 for Beyond Reward: Offline Preference-guided Policy Optimization

Figure 4 for Beyond Reward: Offline Preference-guided Policy Optimization

Share this with someone who'll enjoy it:

Abstract:This study focuses on the topic of offline preference-based reinforcement learning (PbRL), a variant of conventional reinforcement learning that dispenses with the need for online interaction or specification of reward functions. Instead, the agent is provided with pre-existing offline trajectories and human preferences between pairs of trajectories to extract the dynamics and task information, respectively. Since the dynamics and task information are orthogonal, a naive approach would involve using preference-based reward learning followed by an off-the-shelf offline RL algorithm. However, this requires the separate learning of a scalar reward function, which is assumed to be an information bottleneck. To address this issue, we propose the offline preference-guided policy optimization (OPPO) paradigm, which models offline trajectories and preferences in a one-step process, eliminating the need for separately learning a reward function. OPPO achieves this by introducing an offline hindsight information matching objective for optimizing a contextual policy and a preference modeling objective for finding the optimal context. OPPO further integrates a well-performing decision policy by optimizing the two objectives iteratively. Our empirical results demonstrate that OPPO effectively models offline preferences and outperforms prior competing baselines, including offline RL algorithms performed over either true or pseudo reward function specifications. Our code is available at https://github.com/bkkgbkjb/OPPO .

View paper on

OpenReview

Share this with someone who'll enjoy it:

Title:Beyond Reward: Offline Preference-guided Policy Optimization

Paper and Code