Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Title:DRLC: Reinforcement Learning with Dense Rewards from LLM Critic

Jan 14, 2024

Meng Cao, Lei Shu, Lei Yu, Yun Zhu, Nevan Wichers, Yinxiao Liu, Lei Meng

Figure 1 for DRLC: Reinforcement Learning with Dense Rewards from LLM Critic

Figure 2 for DRLC: Reinforcement Learning with Dense Rewards from LLM Critic

Figure 3 for DRLC: Reinforcement Learning with Dense Rewards from LLM Critic

Figure 4 for DRLC: Reinforcement Learning with Dense Rewards from LLM Critic

Share this with someone who'll enjoy it:

Abstract:Reinforcement learning (RL) can align language models with non-differentiable reward signals, such as human preferences. However, a major challenge arises from the sparsity of these reward signals - typically, there is only one reward for the entire generation. This sparsity of rewards can lead to inefficient and unstable learning. In this paper, we introduce a novel framework leveraging the critique ability of LLMs to produce dense rewards throughout the learning process. Our approach incorporates a critic language model alongside the policy model. This critic is prompted with the task description, question, policy model's output, and environment's reward signal as input, and provides token or span-level dense rewards that reflect the quality of each segment of the output. We assess our approach on three text generation tasks: sentiment control, language model detoxification, and summarization. Experimental results show that incorporating artificial dense rewards in training yields consistent performance gains over the PPO baseline with holistic rewards. Furthermore, in a setting where the same model serves as both policy and critic, we demonstrate that "self-critique" rewards also boost learning efficiency.

View paper on

Share this with someone who'll enjoy it:

Title:DRLC: Reinforcement Learning with Dense Rewards from LLM Critic

Paper and Code