Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Title:Self-Augmented Preference Optimization: Off-Policy Paradigms for Language Model Alignment

May 31, 2024

Yueqin Yin, Zhendong Wang, Yujia Xie, Weizhu Chen, Mingyuan Zhou

Figure 1 for Self-Augmented Preference Optimization: Off-Policy Paradigms for Language Model Alignment

Figure 2 for Self-Augmented Preference Optimization: Off-Policy Paradigms for Language Model Alignment

Figure 3 for Self-Augmented Preference Optimization: Off-Policy Paradigms for Language Model Alignment

Figure 4 for Self-Augmented Preference Optimization: Off-Policy Paradigms for Language Model Alignment

Share this with someone who'll enjoy it:

Abstract:Traditional language model alignment methods, such as Direct Preference Optimization (DPO), are limited by their dependence on static, pre-collected paired preference data, which hampers their adaptability and practical applicability. To overcome this limitation, we introduce Self-Augmented Preference Optimization (SAPO), an effective and scalable training paradigm that does not require existing paired data. Building on the self-play concept, which autonomously generates negative responses, we further incorporate an off-policy learning pipeline to enhance data exploration and exploitation. Specifically, we employ an Exponential Moving Average (EMA) model in conjunction with a replay buffer to enable dynamic updates of response segments, effectively integrating real-time feedback with insights from historical data. Our comprehensive evaluations of the LLaMA3-8B and Mistral-7B models across benchmarks, including the Open LLM Leaderboard, IFEval, AlpacaEval 2.0, and MT-Bench, demonstrate that SAPO matches or surpasses established offline contrastive baselines, such as DPO and Odds Ratio Preference Optimization, and outperforms offline self-play methods like SPIN. Our code is available at https://github.com/yinyueqin/SAPO

View paper on

Share this with someone who'll enjoy it:

Title:Self-Augmented Preference Optimization: Off-Policy Paradigms for Language Model Alignment

Paper and Code