Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Title:Best Policy Learning from Trajectory Preference Feedback

Jan 31, 2025

Akhil Agnihotri, Rahul Jain, Deepak Ramachandran, Zheng Wen

Figure 1 for Best Policy Learning from Trajectory Preference Feedback

Figure 2 for Best Policy Learning from Trajectory Preference Feedback

Figure 3 for Best Policy Learning from Trajectory Preference Feedback

Figure 4 for Best Policy Learning from Trajectory Preference Feedback

Share this with someone who'll enjoy it:

Abstract:We address the problem of best policy identification in preference-based reinforcement learning (PbRL), where learning occurs from noisy binary preferences over trajectory pairs rather than explicit numerical rewards. This approach is useful for post-training optimization of generative AI models during multi-turn user interactions, where preference feedback is more robust than handcrafted reward models. In this setting, learning is driven by both an offline preference dataset -- collected from a rater of unknown 'competence' -- and online data collected with pure exploration. Since offline datasets may exhibit out-of-distribution (OOD) biases, principled online data collection is necessary. To address this, we propose Posterior Sampling for Preference Learning ($\mathsf{PSPL}$), a novel algorithm inspired by Top-Two Thompson Sampling, that maintains independent posteriors over the true reward model and transition dynamics. We provide the first theoretical guarantees for PbRL in this setting, establishing an upper bound on the simple Bayesian regret of $\mathsf{PSPL}$. Since the exact algorithm can be computationally impractical, we also provide an approximate version that outperforms existing baselines.

View paper on

Share this with someone who'll enjoy it:

Title:Best Policy Learning from Trajectory Preference Feedback

Paper and Code