Swinburne University of Technology
Abstract:Context: This paper provides an in-depth examination of the generation and evaluation of Metamorphic Relations (MRs) using GPT models developed by OpenAI, with a particular focus on the capabilities of GPT-4 in software testing environments. Objective: The aim is to examine the quality of MRs produced by GPT-3.5 and GPT-4 for a specific System Under Test (SUT) adopted from an earlier study, and to introduce and apply an improved set of evaluation criteria for a diverse range of SUTs. Method: The initial phase evaluates MRs generated by GPT-3.5 and GPT-4 using criteria from a prior study, followed by an application of an enhanced evaluation framework on MRs created by GPT-4 for a diverse range of nine SUTs, varying from simple programs to complex systems incorporating AI/ML components. A custom-built GPT evaluator, alongside human evaluators, assessed the MRs, enabling a direct comparison between automated and human evaluation methods. Results: The study finds that GPT-4 outperforms GPT-3.5 in generating accurate and useful MRs. With the advanced evaluation criteria, GPT-4 demonstrates a significant ability to produce high-quality MRs across a wide range of SUTs, including complex systems incorporating AI/ML components. Conclusions: GPT-4 exhibits advanced capabilities in generating MRs suitable for various applications. The research underscores the growing potential of AI in software testing, particularly in the generation and evaluation of MRs, and points towards the complementarity of human and AI skills in this domain.
Abstract:While ChatGPT is a well-known artificial intelligence chatbot being used to answer human's questions, one may want to discover its potential in advancing software testing. We examine the capability of ChatGPT in advancing the intelligence of software testing through a case study on metamorphic testing (MT), a state-of-the-art software testing technique. We ask ChatGPT to generate candidates of metamorphic relations (MRs), which are basically necessary properties of the object program and which traditionally require human intelligence to identify. These MR candidates are then evaluated in terms of correctness by domain experts. We show that ChatGPT can be used to generate new correct MRs to test several software systems. Having said that, the majority of MR candidates are either defined vaguely or incorrect, especially for systems that have never been tested with MT. ChatGPT can be used to advance software testing intelligence by proposing MR candidates that can be later adopted for implementing tests; but human intelligence should still inevitably be involved to justify and rectify their correctness.
Abstract:Software testing is often hindered where it is impossible or impractical to determine the correctness of the behaviour or output of the software under test (SUT), a situation known as the oracle problem. An example of an area facing the oracle problem is automatic image classification, using machine learning to classify an input image as one of a set of predefined classes. An approach to software testing that alleviates the oracle problem is metamorphic testing (MT). While traditional software testing examines the correctness of individual test cases, MT instead examines the relations amongst multiple executions of test cases and their outputs. These relations are called metamorphic relations (MRs): if an MR is found to be violated, then a fault must exist in the SUT. This paper examines the problem of classifying images containing visually hidden markers called Artcodes, and applies MT to verify and enhance the trained classifiers. This paper further examines two MRs, Separation and Occlusion, and reports on their capability in verifying the image classification using one-way analysis of variance (ANOVA) in conjunction with three other statistical analysis methods: t-test (for unequal variances), Kruskal-Wallis test, and Dunnett's test. In addition to our previously-studied classifier, that used Random Forests, we introduce a new classifier that uses a support vector machine, and present its MR-augmented version. Experimental evaluations across a number of performance metrics show that the augmented classifiers can achieve better performance than non-augmented classifiers. This paper also analyses how the enhanced performance is obtained.
Abstract:Training multiple-layered deep neural networks (DNNs) is difficult. The standard practice of using a large number of samples for training often does not improve the performance of a DNN to a satisfactory level. Thus, a systematic training approach is needed. To address this need, we introduce an innovative approach of using metamorphic relations (MRs) as "knowledge carriers" to train DNNs. Based on the concept of metamorphic testing and MRs (which play the role of a test oracle in software testing), we make use of the notion of metamorphic group of inputs as concrete instances of MRs (which are abstractions of knowledge) to train a DNN in a systematic and effective manner. To verify the viability of our training approach, we have conducted a preliminary experiment to compare the performance of two DNNs: one trained with MRs and the other trained without MRs. We found that the DNN trained with MRs has delivered a better performance, thereby confirming that our approach of using MRs as knowledge carriers to train DNNs is promising. More work and studies, however, are needed to solidify and leverage this approach to generate widespread impact on effective DNN training.