Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Title:Insights into Alignment: Evaluating DPO and its Variants Across Multiple Tasks

Apr 23, 2024

Amir Saeidi, Shivanshu Verma, Chitta Baral

Figure 1 for Insights into Alignment: Evaluating DPO and its Variants Across Multiple Tasks

Figure 2 for Insights into Alignment: Evaluating DPO and its Variants Across Multiple Tasks

Figure 3 for Insights into Alignment: Evaluating DPO and its Variants Across Multiple Tasks

Figure 4 for Insights into Alignment: Evaluating DPO and its Variants Across Multiple Tasks

Share this with someone who'll enjoy it:

Abstract:Large Language Models (LLMs) have demonstrated remarkable performance across a spectrum of tasks. Recently, Direct Preference Optimization (DPO) has emerged as an RL-free approach to optimize the policy model on human preferences. However, several limitations hinder the widespread adoption of this method. To address these shortcomings, various versions of DPO have been introduced. Yet, a comprehensive evaluation of these variants across diverse tasks is still lacking. In this study, we aim to bridge this gap by investigating the performance of alignment methods across three distinct scenarios: (1) keeping the Supervised Fine-Tuning (SFT) part, (2) skipping the SFT part, and (3) skipping the SFT part and utilizing an instruction-tuned model. Furthermore, we explore the impact of different training sizes on their performance. Our evaluation spans a range of tasks including dialogue systems, reasoning, mathematical problem-solving, question answering, truthfulness, and multi-task understanding, encompassing 13 benchmarks such as MT-Bench, Big Bench, and Open LLM Leaderboard. Key observations reveal that alignment methods achieve optimal performance with smaller training data subsets, exhibit limited effectiveness in reasoning tasks yet significantly impact mathematical problem-solving, and employing an instruction-tuned model notably influences truthfulness. We anticipate that our findings will catalyze further research aimed at developing more robust models to address alignment challenges.

View paper on

Share this with someone who'll enjoy it:

Title:Insights into Alignment: Evaluating DPO and its Variants Across Multiple Tasks

Paper and Code