Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Title:Self-Generated Critiques Boost Reward Modeling for Language Models

Nov 25, 2024

Yue Yu, Zhengxing Chen, Aston Zhang, Liang Tan, Chenguang Zhu, Richard Yuanzhe Pang, Yundi Qian, Xuewei Wang, Suchin Gururangan, Chao Zhang(+3 more)

Figure 1 for Self-Generated Critiques Boost Reward Modeling for Language Models

Figure 2 for Self-Generated Critiques Boost Reward Modeling for Language Models

Figure 3 for Self-Generated Critiques Boost Reward Modeling for Language Models

Figure 4 for Self-Generated Critiques Boost Reward Modeling for Language Models

Share this with someone who'll enjoy it:

Abstract:Reward modeling is crucial for aligning large language models (LLMs) with human preferences, especially in reinforcement learning from human feedback (RLHF). However, current reward models mainly produce scalar scores and struggle to incorporate critiques in a natural language format. We hypothesize that predicting both critiques and the scalar reward would improve reward modeling ability. Motivated by this, we propose Critic-RM, a framework that improves reward models using self-generated critiques without extra supervision. Critic-RM employs a two-stage process: generating and filtering high-quality critiques, followed by joint fine-tuning on reward prediction and critique generation. Experiments across benchmarks show that Critic-RM improves reward modeling accuracy by 3.7%-7.3% compared to standard reward models and LLM judges, demonstrating strong performance and data efficiency. Additional studies further validate the effectiveness of generated critiques in rectifying flawed reasoning steps with 2.5%-3.2% gains in improving reasoning accuracy.

* 20 pages

View paper on

Share this with someone who'll enjoy it:

Title:Self-Generated Critiques Boost Reward Modeling for Language Models

Paper and Code