Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Title:Intuitive Fine-Tuning: Towards Unifying SFT and RLHF into a Single Process

May 20, 2024

Ermo Hua, Biqing Qi, Kaiyan Zhang, Yue Yu, Ning Ding, Xingtai Lv, Kai Tian, Bowen Zhou

Figure 1 for Intuitive Fine-Tuning: Towards Unifying SFT and RLHF into a Single Process

Figure 2 for Intuitive Fine-Tuning: Towards Unifying SFT and RLHF into a Single Process

Figure 3 for Intuitive Fine-Tuning: Towards Unifying SFT and RLHF into a Single Process

Figure 4 for Intuitive Fine-Tuning: Towards Unifying SFT and RLHF into a Single Process

Share this with someone who'll enjoy it:

Abstract:Supervised Fine-Tuning (SFT) and Reinforcement Learning from Human Feedback (RLHF) are two fundamental processes for enhancing the capabilities of Language Models (LMs) post pre-training, aligning them better with human preferences. Although SFT advances in training efficiency, RLHF delivers better alignment, thus they are often combined. However, common practices simply apply them sequentially without unifying their optimization targets, resulting in a trade-off between fitting different objectives, and ignoring the opportunities to bridge the paradigm gap and take the strength from both. To obtain a unified understanding, we interpret SFT and RLHF using two sub-processes -- Preference Estimation and Transition Optimization -- defined at token level within the Markov Decision Process (MDP) framework. This modeling shows that SFT is only a specialized case of RLHF with inferior estimation and optimization. RLHF evaluates the quality of model's entire generated answer, whereas SFT only scores predicted tokens based on preceding tokens from target answers. Therefore, SFT overestimates the ability of model, leading to inferior optimization. Building on this view, we introduce Intuitive Fine-tuning (IFT) to integrate SFT and RLHF into a single process. IFT captures LMs' intuitive sense of the entire answers through a temporal residual connection, while using a single policy and the same volume of non-preference-labeled data as SFT. Our experiments show that IFT performs comparably or even superiorly to sequential recipes of SFT and some typical alignment methods across several tasks, particularly those requires generation, reasoning, and fact-following abilities. An explainable Frozen Lake game further validates the effectiveness of IFT.

View paper on

Share this with someone who'll enjoy it:

Title:Intuitive Fine-Tuning: Towards Unifying SFT and RLHF into a Single Process

Paper and Code