Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Title:Direct Value Optimization: Improving Chain-of-Thought Reasoning in LLMs with Refined Values

Feb 19, 2025

Hongbo Zhang, Han Cui, Guangsheng Bao, Linyi Yang, Jun Wang, Yue Zhang

Figure 1 for Direct Value Optimization: Improving Chain-of-Thought Reasoning in LLMs with Refined Values

Figure 2 for Direct Value Optimization: Improving Chain-of-Thought Reasoning in LLMs with Refined Values

Figure 3 for Direct Value Optimization: Improving Chain-of-Thought Reasoning in LLMs with Refined Values

Figure 4 for Direct Value Optimization: Improving Chain-of-Thought Reasoning in LLMs with Refined Values

Share this with someone who'll enjoy it:

Abstract:We introduce Direct Value Optimization (DVO), an innovative reinforcement learning framework for enhancing large language models in complex reasoning tasks. Unlike traditional methods relying on preference labels, DVO utilizes value signals at individual reasoning steps, optimizing models via a mean squared error loss. The key benefit of DVO lies in its fine-grained supervision, circumventing the need for labor-intensive human annotations. Target values within the DVO are estimated using either Monte Carlo Tree Search or an outcome value model. Our empirical analysis on both mathematical and commonsense reasoning tasks shows that DVO consistently outperforms existing offline preference optimization techniques, even with fewer training steps. These findings underscore the importance of value signals in advancing reasoning capabilities and highlight DVO as a superior methodology under scenarios lacking explicit human preference information.

* preprint

View paper on

Share this with someone who'll enjoy it:

Title:Direct Value Optimization: Improving Chain-of-Thought Reasoning in LLMs with Refined Values

Paper and Code