Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Title:EquiBench: Benchmarking Code Reasoning Capabilities of Large Language Models via Equivalence Checking

Feb 18, 2025

Anjiang Wei, Jiannan Cao, Ran Li, Hongyu Chen, Yuhui Zhang, Ziheng Wang, Yaofeng Sun, Yuan Liu, Thiago S. F. X. Teixeira, Diyi Yang(+2 more)

Figure 1 for EquiBench: Benchmarking Code Reasoning Capabilities of Large Language Models via Equivalence Checking

Figure 2 for EquiBench: Benchmarking Code Reasoning Capabilities of Large Language Models via Equivalence Checking

Figure 3 for EquiBench: Benchmarking Code Reasoning Capabilities of Large Language Models via Equivalence Checking

Figure 4 for EquiBench: Benchmarking Code Reasoning Capabilities of Large Language Models via Equivalence Checking

Share this with someone who'll enjoy it:

Abstract:Equivalence checking, i.e., determining whether two programs produce identical outputs for all possible inputs, underpins a broad range of applications, including software refactoring, testing, and optimization. We present the task of equivalence checking as a new way to evaluate the code reasoning abilities of large language models (LLMs). We introduce EquiBench, a dataset of 2400 program pairs spanning four programming languages and six equivalence categories. These pairs are systematically generated through program analysis, compiler scheduling, and superoptimization, covering nontrivial structural transformations that demand deep semantic reasoning beyond simple syntactic variations. Our evaluation of 17 state-of-the-art LLMs shows that OpenAI o3-mini achieves the highest overall accuracy of 78.0%. In the most challenging categories, the best accuracies are 62.3% and 68.8%, only modestly above the 50% random baseline for binary classification, indicating significant room for improvement in current models' code reasoning capabilities.

View paper on

Share this with someone who'll enjoy it:

Title:EquiBench: Benchmarking Code Reasoning Capabilities of Large Language Models via Equivalence Checking

Paper and Code