Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Title:Sail into the Headwind: Alignment via Robust Rewards and Dynamic Labels against Reward Hacking

Dec 12, 2024

Paria Rashidinejad, Yuandong Tian

Figure 1 for Sail into the Headwind: Alignment via Robust Rewards and Dynamic Labels against Reward Hacking

Figure 2 for Sail into the Headwind: Alignment via Robust Rewards and Dynamic Labels against Reward Hacking

Figure 3 for Sail into the Headwind: Alignment via Robust Rewards and Dynamic Labels against Reward Hacking

Figure 4 for Sail into the Headwind: Alignment via Robust Rewards and Dynamic Labels against Reward Hacking

Share this with someone who'll enjoy it:

Abstract:Aligning AI systems with human preferences typically suffers from the infamous reward hacking problem, where optimization of an imperfect reward model leads to undesired behaviors. In this paper, we investigate reward hacking in offline preference optimization, which aims to improve an initial model using a preference dataset. We identify two types of reward hacking stemming from statistical fluctuations in the dataset: Type I Reward Hacking due to subpar choices appearing more favorable, and Type II Reward Hacking due to decent choices appearing less favorable. We prove that many (mainstream or theoretical) preference optimization methods suffer from both types of reward hacking. To mitigate Type I Reward Hacking, we propose POWER, a new preference optimization method that combines Guiasu's weighted entropy with a robust reward maximization objective. POWER enjoys finite-sample guarantees under general function approximation, competing with the best covered policy in the data. To mitigate Type II Reward Hacking, we analyze the learning dynamics of preference optimization and develop a novel technique that dynamically updates preference labels toward certain "stationary labels", resulting in diminishing gradients for untrustworthy samples. Empirically, POWER with dynamic labels (POWER-DL) consistently outperforms state-of-the-art methods on alignment benchmarks, achieving improvements of up to 13.0 points on AlpacaEval 2.0 and 11.5 points on Arena-Hard over DPO, while also improving or maintaining performance on downstream tasks such as mathematical reasoning. Strong theoretical guarantees and empirical results demonstrate the promise of POWER-DL in mitigating reward hacking.

* 46 pages, 3 figures

View paper on

Share this with someone who'll enjoy it:

Title:Sail into the Headwind: Alignment via Robust Rewards and Dynamic Labels against Reward Hacking

Paper and Code