Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Title:CRUXEval-X: A Benchmark for Multilingual Code Reasoning, Understanding and Execution

Aug 23, 2024

Ruiyang Xu, Jialun Cao, Yaojie Lu, Hongyu Lin, Xianpei Han, Ben He, Shing-Chi Cheung, Le Sun

Figure 1 for CRUXEval-X: A Benchmark for Multilingual Code Reasoning, Understanding and Execution

Figure 2 for CRUXEval-X: A Benchmark for Multilingual Code Reasoning, Understanding and Execution

Figure 3 for CRUXEval-X: A Benchmark for Multilingual Code Reasoning, Understanding and Execution

Figure 4 for CRUXEval-X: A Benchmark for Multilingual Code Reasoning, Understanding and Execution

Share this with someone who'll enjoy it:

Abstract:Code benchmarks such as HumanEval are widely adopted to evaluate Large Language Models' (LLMs) coding capabilities. However, there is an unignorable programming language bias in existing code benchmarks -- over 95% code generation benchmarks are dominated by Python, leaving the LLMs' capabilities in other programming languages such as Java and C/C++ unknown. Moreover, coding task bias is also crucial. Most benchmarks focus on code generation capability, while benchmarks for code reasoning (given input, reasoning output; and given output, reasoning input), an essential coding capability, are insufficient. Yet, constructing multi-lingual benchmarks can be expensive and labor-intensive, and codes in contest websites such as Leetcode suffer from data contamination during training. To fill this gap, we propose CRUXEVAL-X, a multi-lingual code reasoning benchmark that contains 19 programming languages. It comprises at least 600 subjects for each language, along with 19K content-consistent tests in total. In particular, the construction pipeline of CRUXEVAL-X works in a fully automated and test-guided manner, which iteratively generates and repairs based on execution feedback. Also, to cross language barriers (e.g., dynamic/static type systems in Python/C++), we formulated various transition rules between language pairs to facilitate translation. Our intensive evaluation of 24 representative LLMs reveals the correlation between language pairs. For example, TypeScript and JavaScript show a significant positive correlation, while Racket has less correlation with other languages. More interestingly, even a model trained solely on Python can achieve at most 34.4% Pass@1 in other languages, revealing the cross-language generalization of LLMs.

* 13pages

View paper on

Share this with someone who'll enjoy it:

Title:CRUXEval-X: A Benchmark for Multilingual Code Reasoning, Understanding and Execution

Paper and Code