Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Title:Inference-Time Computations for LLM Reasoning and Planning: A Benchmark and Insights

Feb 18, 2025

Shubham Parashar, Blake Olson, Sambhav Khurana, Eric Li, Hongyi Ling, James Caverlee, Shuiwang Ji

Figure 1 for Inference-Time Computations for LLM Reasoning and Planning: A Benchmark and Insights

Figure 2 for Inference-Time Computations for LLM Reasoning and Planning: A Benchmark and Insights

Figure 3 for Inference-Time Computations for LLM Reasoning and Planning: A Benchmark and Insights

Figure 4 for Inference-Time Computations for LLM Reasoning and Planning: A Benchmark and Insights

Share this with someone who'll enjoy it:

Abstract:We examine the reasoning and planning capabilities of large language models (LLMs) in solving complex tasks. Recent advances in inference-time techniques demonstrate the potential to enhance LLM reasoning without additional training by exploring intermediate steps during inference. Notably, OpenAI's o1 model shows promising performance through its novel use of multi-step reasoning and verification. Here, we explore how scaling inference-time techniques can improve reasoning and planning, focusing on understanding the tradeoff between computational cost and performance. To this end, we construct a comprehensive benchmark, known as Sys2Bench, and perform extensive experiments evaluating existing inference-time techniques on eleven diverse tasks across five categories, including arithmetic reasoning, logical reasoning, common sense reasoning, algorithmic reasoning, and planning. Our findings indicate that simply scaling inference-time computation has limitations, as no single inference-time technique consistently performs well across all reasoning and planning tasks.

View paper on

Share this with someone who'll enjoy it:

Title:Inference-Time Computations for LLM Reasoning and Planning: A Benchmark and Insights

Paper and Code