Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Shiming Xie

Minor DPO reject penalty to increase training robustness

Aug 22, 2024

Shiming Xie, Hong Chen, Fred Yu, Zeye Sun, Xiuyu Wu, Yingfan Hu

Figure 1 for Minor DPO reject penalty to increase training robustness

Figure 2 for Minor DPO reject penalty to increase training robustness

Figure 3 for Minor DPO reject penalty to increase training robustness

Figure 4 for Minor DPO reject penalty to increase training robustness

Abstract:Learning from human preference is a paradigm used in large-scale language model (LLM) fine-tuning step to better align pretrained LLM to human preference for downstream task. In the past it uses reinforcement learning from human feedback (RLHF) algorithm to optimize the LLM policy to align with these preferences and not to draft too far from the original model. Recently, Direct Preference Optimization (DPO) has been proposed to solve the alignment problem with a simplified RL-free method. Using preference pairs of chosen and reject data, DPO models the relative log probability as implicit reward function and optimize LLM policy using a simple binary cross entropy objective directly. DPO is quite straight forward and easy to be understood. It perform efficiently and well in most cases. In this article, we analyze the working mechanism of $\beta$ in DPO, disclose its syntax difference between RL algorithm and DPO, and understand the potential shortage brought by the DPO simplification. With these insights, we propose MinorDPO, which is better aligned to the original RL algorithm, and increase the stability of preference optimization process.

* 8 pages, 19 figures

Via

Access Paper or Ask Questions

Minor SFT loss for LLM fine-tune to increase performance and reduce model deviation

Aug 20, 2024

Shiming Xie, Hong Chen, Fred Yu, Zeye Sun, Xiuyu Wu

Figure 1 for Minor SFT loss for LLM fine-tune to increase performance and reduce model deviation

Figure 2 for Minor SFT loss for LLM fine-tune to increase performance and reduce model deviation

Abstract:Instruct LLM provide a paradigm used in large scale language model to align LLM to human preference. The paradigm contains supervised fine tuning and reinforce learning from human feedback. This paradigm is also used in downstream scenarios to adapt LLM to specific corpora and applications. Comparing to SFT, there are many efforts focused on RLHF and several algorithms being proposed, such as PPO, DPO, IPO, KTO, MinorDPO and etc. Meanwhile most efforts for SFT are focused on how to collect, filter and mix high quality data. In this article with insight from DPO and MinorDPO, we propose a training metric for SFT to measure the discrepancy between the optimized model and the original model, and a loss function MinorSFT that can increase the training effectiveness, and reduce the discrepancy between the optimized LLM and original LLM.

* 8 pages, 5 figures

Via

Access Paper or Ask Questions