Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Title:Smaug: Fixing Failure Modes of Preference Optimisation with DPO-Positive

Feb 20, 2024

Arka Pal, Deep Karkhanis, Samuel Dooley, Manley Roberts, Siddartha Naidu, Colin White

Figure 1 for Smaug: Fixing Failure Modes of Preference Optimisation with DPO-Positive

Figure 2 for Smaug: Fixing Failure Modes of Preference Optimisation with DPO-Positive

Figure 3 for Smaug: Fixing Failure Modes of Preference Optimisation with DPO-Positive

Figure 4 for Smaug: Fixing Failure Modes of Preference Optimisation with DPO-Positive

Share this with someone who'll enjoy it:

Abstract:Direct Preference Optimisation (DPO) is effective at significantly improving the performance of large language models (LLMs) on downstream tasks such as reasoning, summarisation, and alignment. Using pairs of preferred and dispreferred data, DPO models the \textit{relative} probability of picking one response over another. In this work, first we show theoretically that the standard DPO loss can lead to a \textit{reduction} of the model's likelihood of the preferred examples, as long as the relative probability between the preferred and dispreferred classes increases. We then show empirically that this phenomenon occurs when fine-tuning LLMs on common datasets, especially datasets in which the edit distance between pairs of completions is low. Using these insights, we design DPO-Positive (DPOP), a new loss function and training procedure which avoids this failure mode. Surprisingly, we also find that DPOP significantly outperforms DPO across a wide variety of datasets and downstream tasks, including datasets with high edit distances between completions. By fine-tuning with DPOP, we create and release Smaug-34B and Smaug-72B, which achieve state-of-the-art open-source performance. Notably, Smaug-72B is nearly 2\% better than any other open-source model on the HuggingFace Open LLM Leaderboard and becomes the first open-source LLM to surpass an average accuracy of 80\%.

View paper on

Share this with someone who'll enjoy it:

Title:Smaug: Fixing Failure Modes of Preference Optimisation with DPO-Positive

Paper and Code