Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Title:Mutation-based Consistency Testing for Evaluating the Code Understanding Capability of LLMs

Jan 11, 2024

Ziyu Li, Donghwan Shin

Figure 1 for Mutation-based Consistency Testing for Evaluating the Code Understanding Capability of LLMs

Figure 2 for Mutation-based Consistency Testing for Evaluating the Code Understanding Capability of LLMs

Figure 3 for Mutation-based Consistency Testing for Evaluating the Code Understanding Capability of LLMs

Figure 4 for Mutation-based Consistency Testing for Evaluating the Code Understanding Capability of LLMs

Share this with someone who'll enjoy it:

Abstract:Large Language Models (LLMs) have shown remarkable capabilities in processing both natural and programming languages, which have enabled various applications in software engineering, such as requirement engineering, code generation, and software testing. However, existing code generation benchmarks do not necessarily assess the code understanding performance of LLMs, especially for the subtle inconsistencies that may arise between code and its semantics described in natural language. In this paper, we propose a novel method to systematically assess the code understanding performance of LLMs, particularly focusing on subtle differences between code and its descriptions, by introducing code mutations to existing code generation datasets. Code mutations are small changes that alter the semantics of the original code, creating a mismatch with the natural language description. We apply different types of code mutations, such as operator replacement and statement deletion, to generate inconsistent code-description pairs. We then use these pairs to test the ability of LLMs to correctly detect the inconsistencies. We propose a new LLM testing method, called Mutation-based Consistency Testing (MCT), and conduct a case study on the two popular LLMs, GPT-3.5 and GPT-4, using the state-of-the-art code generation benchmark, HumanEval-X, which consists of six programming languages (Python, C++, Java, Go, JavaScript, and Rust). We compare the performance of the LLMs across different types of code mutations and programming languages and analyze the results. We find that the LLMs show significant variation in their code understanding performance and that they have different strengths and weaknesses depending on the mutation type and language.

* This is an author-preprint. The published version will be included in the proceedings of CAIN 2024 (co-located with ICSE 2024)

View paper on

Share this with someone who'll enjoy it:

Title:Mutation-based Consistency Testing for Evaluating the Code Understanding Capability of LLMs

Paper and Code