Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Title:Process Supervision-Guided Policy Optimization for Code Generation

Oct 23, 2024

Ning Dai, Zheng Wu, Renjie Zheng, Ziyun Wei, Wenlei Shi, Xing Jin, Guanlin Liu, Chen Dun, Liang Huang, Lin Yan

Figure 1 for Process Supervision-Guided Policy Optimization for Code Generation

Figure 2 for Process Supervision-Guided Policy Optimization for Code Generation

Figure 3 for Process Supervision-Guided Policy Optimization for Code Generation

Figure 4 for Process Supervision-Guided Policy Optimization for Code Generation

Share this with someone who'll enjoy it:

Abstract:Reinforcement Learning (RL) with unit test feedback has enhanced large language models (LLMs) code generation, but relies on sparse rewards provided only after complete code evaluation, limiting learning efficiency and incremental improvements. When generated code fails all unit tests, no learning signal is received, hindering progress on complex tasks. To address this, we propose a Process Reward Model (PRM) that delivers dense, line-level feedback on code correctness during generation, mimicking human code refinement and providing immediate guidance. We explore various strategies for training PRMs and integrating them into the RL framework, finding that using PRMs both as dense rewards and for value function initialization significantly boosts performance. Our approach increases our in-house LLM's pass rate from 28.2% to 29.8% on LiveCodeBench and from 31.8% to 35.8% on our internal benchmark. Our experimental results highlight the effectiveness of PRMs in enhancing RL-driven code generation, especially for long-horizon scenarios.

* 14 pages, 5 figures

View paper on

Share this with someone who'll enjoy it:

Title:Process Supervision-Guided Policy Optimization for Code Generation

Paper and Code