Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Mark Harman

University College London, Facebook London

Harden and Catch for Just-in-Time Assured LLM-Based Software Testing: Open Research Challenges

Apr 23, 2025

Mark Harman, Peter O'Hearn, Shubho Sengupta

Abstract:Despite decades of research and practice in automated software testing, several fundamental concepts remain ill-defined and under-explored, yet offer enormous potential real-world impact. We show that these concepts raise exciting new challenges in the context of Large Language Models for software test generation. More specifically, we formally define and investigate the properties of hardening and catching tests. A hardening test is one that seeks to protect against future regressions, while a catching test is one that catches such a regression or a fault in new functionality introduced by a code change. Hardening tests can be generated at any time and may become catching tests when a future regression is caught. We also define and motivate the Catching `Just-in-Time' (JiTTest) Challenge, in which tests are generated `just-in-time' to catch new faults before they land into production. We show that any solution to Catching JiTTest generation can also be repurposed to catch latent faults in legacy code. We enumerate possible outcomes for hardening and catching tests and JiTTests, and discuss open research problems, deployment options, and initial results from our work on automated LLM-based hardening at Meta. This paper\footnote{Author order is alphabetical. The corresponding author is Mark Harman.} was written to accompany the keynote by the authors at the ACM International Conference on the Foundations of Software Engineering (FSE) 2025.

* To Appear as keynote paper at FSE 2025

Via

Access Paper or Ask Questions

LLMs Love Python: A Study of LLMs' Bias for Programming Languages and Libraries

Mar 21, 2025

Lukas Twist, Jie M. Zhang, Mark Harman, Don Syme, Joost Noppen, Detlef Nauck

Abstract:Programming language and library choices are crucial to software reliability and security. Poor or inconsistent choices can lead to increased technical debt, security vulnerabilities, and even catastrophic failures in safety-critical systems. As Large Language Models (LLMs) play an increasing role in code generation, it is essential to understand how they make these decisions. However, little is known about their preferences when selecting programming languages and libraries for different coding tasks. To fill this gap, this study provides the first in-depth investigation into LLM preferences for programming languages and libraries used when generating code. We assess the preferences of eight diverse LLMs by prompting them to complete various coding tasks, including widely-studied benchmarks and the more practical task of generating the initial structural code for new projects (a crucial step that often determines a project's language or library choices). Our findings reveal that LLMs heavily favour Python when solving language-agnostic problems, using it in 90%-97% of cases for benchmark tasks. Even when generating initial project code where Python is not a suitable language, it remains the most-used language in 58% of instances. Moreover, LLMs contradict their own language recommendations in 83% of project initialisation tasks, raising concerns about their reliability in guiding language selection. Similar biases toward well-established libraries further create serious discoverability challenges for newer open-source projects. These results highlight the need to improve LLMs' adaptability to diverse programming contexts and to develop mechanisms for mitigating programming language and library bias.

* 12 pages, 1 figure

Via

Access Paper or Ask Questions

Rethinking the Influence of Source Code on Test Case Generation

Sep 14, 2024

Dong Huang, Jie M. Zhang, Mingzhe Du, Mark Harman, Heming Cui

Abstract:Large language models (LLMs) have been widely applied to assist test generation with the source code under test provided as the context. This paper aims to answer the question: If the source code under test is incorrect, will LLMs be misguided when generating tests? The effectiveness of test cases is measured by their accuracy, coverage, and bug detection effectiveness. Our evaluation results with five open- and six closed-source LLMs on four datasets demonstrate that incorrect code can significantly mislead LLMs in generating correct, high-coverage, and bug-revealing tests. For instance, in the HumanEval dataset, LLMs achieve 80.45% test accuracy when provided with task descriptions and correct code, but only 57.12% when given task descriptions and incorrect code. For the APPS dataset, prompts with correct code yield tests that detect 39.85% of the bugs, while prompts with incorrect code detect only 19.61%. These findings have important implications for the deployment of LLM-based testing: using it on mature code may help protect against future regression, but on early-stage immature code, it may simply bake in errors. Our findings also underscore the need for further research to improve LLMs resilience against incorrect code in generating reliable and bug-revealing tests.

* 23 pages

Via

Access Paper or Ask Questions

An Empirical Study on Fairness Improvement with Multiple Protected Attributes

Jul 25, 2023

Zhenpeng Chen, Jie M. Zhang, Federica Sarro, Mark Harman

Abstract:Existing research mostly improves the fairness of Machine Learning (ML) software regarding a single protected attribute at a time, but this is unrealistic given that many users have multiple protected attributes. This paper conducts an extensive study of fairness improvement regarding multiple protected attributes, covering 11 state-of-the-art fairness improvement methods. We analyze the effectiveness of these methods with different datasets, metrics, and ML models when considering multiple protected attributes. The results reveal that improving fairness for a single protected attribute can largely decrease fairness regarding unconsidered protected attributes. This decrease is observed in up to 88.3% of scenarios (57.5% on average). More surprisingly, we find little difference in accuracy loss when considering single and multiple protected attributes, indicating that accuracy can be maintained in the multiple-attribute paradigm. However, the effect on precision and recall when handling multiple protected attributes is about 5 times and 8 times that of a single attribute. This has important implications for future fairness research: reporting only accuracy as the ML performance metric, which is currently common in the literature, is inadequate.

Via

Access Paper or Ask Questions

Bias Mitigation for Machine Learning Classifiers: A Comprehensive Survey

Jul 15, 2022

Max Hort, Zhenpeng Chen, Jie M. Zhang, Federica Sarro, Mark Harman

Figure 1 for Bias Mitigation for Machine Learning Classifiers: A Comprehensive Survey

Figure 2 for Bias Mitigation for Machine Learning Classifiers: A Comprehensive Survey

Figure 3 for Bias Mitigation for Machine Learning Classifiers: A Comprehensive Survey

Figure 4 for Bias Mitigation for Machine Learning Classifiers: A Comprehensive Survey

Abstract:This paper provides a comprehensive survey of bias mitigation methods for achieving fairness in Machine Learning (ML) models. We collect a total of 234 publications concerning bias mitigation for ML classifiers. These methods can be distinguished based on their intervention procedure (i.e., pre-processing, in-processing, post-processing) and the technology they apply. We investigate how existing bias mitigation methods are evaluated in the literature. In particular, we consider datasets, metrics and benchmarking. Based on the gathered insights (e.g., what is the most popular fairness metric? How many datasets are used for evaluating bias mitigation methods?). We hope to support practitioners in making informed choices when developing and evaluating new bias mitigation methods.

* 22 pages, 4 figures

Via

Access Paper or Ask Questions

A Comprehensive Empirical Study of Bias Mitigation Methods for Software Fairness

Jul 07, 2022

Zhenpeng Chen, Jie M. Zhang, Federica Sarro, Mark Harman

Figure 1 for A Comprehensive Empirical Study of Bias Mitigation Methods for Software Fairness

Figure 2 for A Comprehensive Empirical Study of Bias Mitigation Methods for Software Fairness

Figure 3 for A Comprehensive Empirical Study of Bias Mitigation Methods for Software Fairness

Figure 4 for A Comprehensive Empirical Study of Bias Mitigation Methods for Software Fairness

Abstract:Software bias is an increasingly important operational concern for software engineers. We present a large-scale, comprehensive empirical evaluation of 17 representative bias mitigation methods, evaluated with 12 Machine Learning (ML) performance metrics, 4 fairness metrics, and 24 types of fairness-performance trade-off assessment, applied to 8 widely-adopted benchmark software decision/prediction tasks. The empirical coverage is comprehensive, covering the largest numbers of bias mitigation methods, evaluation metrics, and fairness-performance trade-off measures compared to previous work on this important operational software characteristic. We find that (1) the bias mitigation methods significantly decrease the values reported by all ML performance metrics (including those not considered in previous work) in a large proportion of the scenarios studied (42%~75% according to different ML performance metrics); (2) the bias mitigation methods achieve fairness improvement in only approximately 50% over all scenarios and metrics (ranging between 29%~59% according to the metric used to asses bias/fairness); (3) the bias mitigation methods have a poor fairness-performance trade-off or even lead to decreases in both fairness and ML performance in 37% of the scenarios; (4) the effectiveness of the bias mitigation methods depends on tasks, models, and fairness and ML performance metrics, and there is no 'silver bullet' bias mitigation method demonstrated to be effective for all scenarios studied. The best bias mitigation method that we find outperforms other methods in only 29% of the scenarios. We have made publicly available the scripts and data used in this study in order to allow for future replication and extension of our work.

Via

Access Paper or Ask Questions

Leveraging Automated Unit Tests for Unsupervised Code Translation

Oct 13, 2021

Baptiste Roziere, Jie M. Zhang, Francois Charton, Mark Harman, Gabriel Synnaeve, Guillaume Lample

Figure 1 for Leveraging Automated Unit Tests for Unsupervised Code Translation

Figure 2 for Leveraging Automated Unit Tests for Unsupervised Code Translation

Figure 3 for Leveraging Automated Unit Tests for Unsupervised Code Translation

Figure 4 for Leveraging Automated Unit Tests for Unsupervised Code Translation

Abstract:With little to no parallel data available for programming languages, unsupervised methods are well-suited to source code translation. However, the majority of unsupervised machine translation approaches rely on back-translation, a method developed in the context of natural language translation and one that inherently involves training on noisy inputs. Unfortunately, source code is highly sensitive to small changes; a single token can result in compilation failures or erroneous programs, unlike natural languages where small inaccuracies may not change the meaning of a sentence. To address this issue, we propose to leverage an automated unit-testing system to filter out invalid translations, thereby creating a fully tested parallel corpus. We found that fine-tuning an unsupervised model with this filtered data set significantly reduces the noise in the translations so-generated, comfortably outperforming the state-of-the-art for all language pairs studied. In particular, for Java $\to$ Python and Python $\to$ C++ we outperform the best previous methods by more than 16% and 24% respectively, reducing the error rate by more than 35%.

Via

Access Paper or Ask Questions

Ownership at Large -- Open Problems and Challenges in Ownership Management

Apr 15, 2020

John Ahlgren, Maria Eugenia Berezin, Kinga Bojarczuk, Elena Dulskyte, Inna Dvortsova, Johann George, Natalija Gucevska, Mark Harman, Shan He, Ralf Lämmel(+3 more)

Figure 1 for Ownership at Large -- Open Problems and Challenges in Ownership Management

Abstract:Software-intensive organizations rely on large numbers of software assets of different types, e.g., source-code files, tables in the data warehouse, and software configurations. Who is the most suitable owner of a given asset changes over time, e.g., due to reorganization and individual function changes. New forms of automation can help suggest more suitable owners for any given asset at a given point in time. By such efforts on ownership health, accountability of ownership is increased. The problem of finding the most suitable owners for an asset is essentially a program comprehension problem: how do we automatically determine who would be best placed to understand, maintain, evolve (and thereby assume ownership of) a given asset. This paper introduces the Facebook Ownesty system, which uses a combination of ultra large scale data mining and machine learning and has been deployed at Facebook as part of the company's ownership management approach. Ownesty processes many millions of software assets (e.g., source-code files) and it takes into account workflow and organizational aspects. The paper sets out open problems and challenges on ownership for the research community with advances expected from the fields of software engineering, programming languages, and machine learning.

* Author order is alphabetical. Contact author: Ralf L\"ammel (rlaemmel@acm.org). The subject of the paper is covered by the contact author's keynote at the same conference

Via

Access Paper or Ask Questions

WES: Agent-based User Interaction Simulation on Real Infrastructure

Apr 11, 2020

John Ahlgren, Maria Eugenia Berezin, Kinga Bojarczuk, Elena Dulskyte, Inna Dvortsova, Johann George, Natalija Gucevska, Mark Harman, Ralf Lämmel, Erik Meijer(+2 more)

Figure 1 for WES: Agent-based User Interaction Simulation on Real Infrastructure

Figure 2 for WES: Agent-based User Interaction Simulation on Real Infrastructure

Figure 3 for WES: Agent-based User Interaction Simulation on Real Infrastructure

Figure 4 for WES: Agent-based User Interaction Simulation on Real Infrastructure

Abstract:We introduce the Web-Enabled Simulation (WES) research agenda, and describe FACEBOOK's WW system. We describe the application of WW to reliability, integrity and privacy at FACEBOOK , where it is used to simulate social media interactions on an infrastructure consisting of hundreds of millions of lines of code. The WES agenda draws on research from many areas of study, including Search Based Software Engineering, Machine Learning, Programming Languages, Multi Agent Systems, Graph Theory, Game AI, and AI Assisted Game Play. We conclude with a set of open problems and research challenges to motivate wider investigation.

* Author order is alphabetical. Correspondence to Mark Harman (markharman@fb.com). This paper appears in GI 2020: 8th International Workshop on Genetic Improvement

Via

Access Paper or Ask Questions

Machine Learning Testing: Survey, Landscapes and Horizons

Jun 19, 2019

Jie M. Zhang, Mark Harman, Lei Ma, Yang Liu

Figure 1 for Machine Learning Testing: Survey, Landscapes and Horizons

Figure 2 for Machine Learning Testing: Survey, Landscapes and Horizons

Figure 3 for Machine Learning Testing: Survey, Landscapes and Horizons

Figure 4 for Machine Learning Testing: Survey, Landscapes and Horizons

Abstract:This paper provides a comprehensive survey of Machine Learning Testing (ML testing) research. It covers 128 papers on testing properties (e.g., correctness, robustness, and fairness), testing components (e.g., the data, learning program, and framework), testing workflow (e.g., test generation and test evaluation), and application scenarios (e.g., autonomous driving, machine translation). The paper also analyses trends concerning datasets, research trends, and research focus, concluding with research challenges and promising research directions in ML testing.

Via

Access Paper or Ask Questions