Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Title:Generate and Pray: Using SALLMS to Evaluate the Security of LLM Generated Code

Nov 01, 2023

Mohammed Latif Siddiq, Joanna C. S. Santos

Figure 1 for Generate and Pray: Using SALLMS to Evaluate the Security of LLM Generated Code

Figure 2 for Generate and Pray: Using SALLMS to Evaluate the Security of LLM Generated Code

Figure 3 for Generate and Pray: Using SALLMS to Evaluate the Security of LLM Generated Code

Figure 4 for Generate and Pray: Using SALLMS to Evaluate the Security of LLM Generated Code

Share this with someone who'll enjoy it:

Abstract:With the growing popularity of Large Language Models (e.g. GitHub Copilot, ChatGPT, etc.) in software engineers' daily practices, it is important to ensure that the code generated by these tools is not only functionally correct but also free of vulnerabilities. Although LLMs can help developers to be more productive, prior empirical studies have shown that LLMs can generate insecure code. There are two contributing factors to the insecure code generation. First, existing datasets used to evaluate Large Language Models (LLMs) do not adequately represent genuine software engineering tasks sensitive to security. Instead, they are often based on competitive programming challenges or classroom-type coding tasks. In real-world applications, the code produced is integrated into larger codebases, introducing potential security risks. There's a clear absence of benchmarks that focus on evaluating the security of the generated code. Second, existing evaluation metrics primarily focus on the functional correctness of the generated code while ignoring security considerations. Metrics such as pass@k gauge the probability of obtaining the correct code in the top k suggestions. Other popular metrics like BLEU, CodeBLEU, ROUGE, and METEOR similarly emphasize functional accuracy, neglecting security implications. In light of these research gaps, in this paper, we described SALLM, a framework to benchmark LLMs' abilities to generate secure code systematically. This framework has three major components: a novel dataset of security-centric Python prompts, an evaluation environment to test the generated code, and novel metrics to evaluate the models' performance from the perspective of secure code generation.

* 16 pages

View paper on

Share this with someone who'll enjoy it:

Title:Generate and Pray: Using SALLMS to Evaluate the Security of LLM Generated Code

Paper and Code