Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Title:CPL: Critical Planning Step Learning Boosts LLM Generalization in Reasoning Tasks

Sep 13, 2024

Tianlong Wang, Xueting Han, Jing Bai

Figure 1 for CPL: Critical Planning Step Learning Boosts LLM Generalization in Reasoning Tasks

Figure 2 for CPL: Critical Planning Step Learning Boosts LLM Generalization in Reasoning Tasks

Figure 3 for CPL: Critical Planning Step Learning Boosts LLM Generalization in Reasoning Tasks

Figure 4 for CPL: Critical Planning Step Learning Boosts LLM Generalization in Reasoning Tasks

Share this with someone who'll enjoy it:

Abstract:Post-training large language models (LLMs) to develop reasoning capabilities has proven effective across diverse domains, such as mathematical reasoning and code generation. However, existing methods primarily focus on improving task-specific reasoning but have not adequately addressed the model's generalization capabilities across a broader range of reasoning tasks. To tackle this challenge, we introduce Critical Planning Step Learning (CPL), which leverages Monte Carlo Tree Search (MCTS) to explore diverse planning steps in multi-step reasoning tasks. Based on long-term outcomes, CPL learns step-level planning preferences to improve the model's planning capabilities and, consequently, its general reasoning capabilities. Furthermore, while effective in many scenarios for aligning LLMs, existing preference learning approaches like Direct Preference Optimization (DPO) struggle with complex multi-step reasoning tasks due to their inability to capture fine-grained supervision at each step. We propose Step-level Advantage Preference Optimization (Step-APO), which integrates an advantage estimate for step-level preference pairs obtained via MCTS into the DPO. This enables the model to more effectively learn critical intermediate planning steps, thereby further improving its generalization in reasoning tasks. Experimental results demonstrate that our method, trained exclusively on GSM8K and MATH, not only significantly improves performance on GSM8K (+10.5%) and MATH (+6.5%), but also enhances out-of-domain reasoning benchmarks, such as ARC-C (+4.0%), BBH (+1.8%), MMLU-STEM (+2.2%), and MMLU (+0.9%).

View paper on

Share this with someone who'll enjoy it:

Title:CPL: Critical Planning Step Learning Boosts LLM Generalization in Reasoning Tasks

Paper and Code