Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Title:RLTHF: Targeted Human Feedback for LLM Alignment

Feb 19, 2025

Yifei Xu, Tusher Chakraborty, Emre Kıcıman, Bibek Aryal, Eduardo Rodrigues, Srinagesh Sharma, Roberto Estevao, Maria Angels de Luis Balaguer, Jessica Wolk, Rafael Padilha(+4 more)

Figure 1 for RLTHF: Targeted Human Feedback for LLM Alignment

Figure 2 for RLTHF: Targeted Human Feedback for LLM Alignment

Figure 3 for RLTHF: Targeted Human Feedback for LLM Alignment

Figure 4 for RLTHF: Targeted Human Feedback for LLM Alignment

Share this with someone who'll enjoy it:

Abstract:Fine-tuning large language models (LLMs) to align with user preferences is challenging due to the high cost of quality human annotations in Reinforcement Learning from Human Feedback (RLHF) and the generalizability limitations of AI Feedback. To address these challenges, we propose RLTHF, a human-AI hybrid framework that combines LLM-based initial alignment with selective human annotations to achieve full-human annotation alignment with minimal effort. RLTHF identifies hard-to-annotate samples mislabeled by LLMs using a reward model's reward distribution and iteratively enhances alignment by integrating strategic human corrections while leveraging LLM's correctly labeled samples. Evaluations on HH-RLHF and TL;DR datasets show that RLTHF reaches full-human annotation-level alignment with only 6-7% of the human annotation effort. Furthermore, models trained on RLTHF's curated datasets for downstream tasks outperform those trained on fully human-annotated datasets, underscoring the effectiveness of RLTHF's strategic data curation.

View paper on

Share this with someone who'll enjoy it:

Title:RLTHF: Targeted Human Feedback for LLM Alignment

Paper and Code