Abstract:System testing is essential in any software development project to ensure that the final products meet the requirements. Creating comprehensive test cases for system testing from requirements is often challenging and time-consuming. This paper explores the effectiveness of using Large Language Models (LLMs) to generate test case designs from Software Requirements Specification (SRS) documents. In this study, we collected the SRS documents of five software engineering projects containing functional and non-functional requirements, which were implemented, tested, and delivered by respective developer teams. For generating test case designs, we used ChatGPT-4o Turbo model. We employed prompt-chaining, starting with an initial context-setting prompt, followed by prompts to generate test cases for each use case. We assessed the quality of the generated test case designs through feedback from the same developer teams as mentioned above. Our experiments show that about 87 percent of the generated test cases were valid, with the remaining 13 percent either not applicable or redundant. Notably, 15 percent of the valid test cases were previously not considered by developers in their testing. We also tasked ChatGPT with identifying redundant test cases, which were subsequently validated by the respective developers to identify false positives and to uncover any redundant test cases that may have been missed by the developers themselves. This study highlights the potential of leveraging LLMs for test generation from the Requirements Specification document and also for assisting developers in quickly identifying and addressing redundancies, ultimately improving test suite quality and efficiency of the testing procedure.
Abstract:We introduce mEdIT, a multi-lingual extension to CoEdIT -- the recent state-of-the-art text editing models for writing assistance. mEdIT models are trained by fine-tuning multi-lingual large, pre-trained language models (LLMs) via instruction tuning. They are designed to take instructions from the user specifying the attributes of the desired text in the form of natural language instructions, such as Grammatik korrigieren (German) or Parafrasee la oraci\'on (Spanish). We build mEdIT by curating data from multiple publicly available human-annotated text editing datasets for three text editing tasks (Grammatical Error Correction (GEC), Text Simplification, and Paraphrasing) across diverse languages belonging to six different language families. We detail the design and training of mEdIT models and demonstrate their strong performance on many multi-lingual text editing benchmarks against other multilingual LLMs. We also find that mEdIT generalizes effectively to new languages over multilingual baselines. We publicly release our data, code, and trained models at https://github.com/vipulraheja/medit.
Abstract:As the text generation capabilities of large language models become increasingly prominent, recent studies have focused on controlling particular aspects of the generated text to make it more personalized. However, most research on controllable text generation focuses on controlling the content or modeling specific high-level/coarse-grained attributes that reflect authors' writing styles, such as formality, domain, or sentiment. In this paper, we focus on controlling fine-grained attributes spanning multiple linguistic dimensions, such as lexical and syntactic attributes. We introduce a novel benchmark to train generative models and evaluate their ability to generate personalized text based on multiple fine-grained linguistic attributes. We systematically investigate the performance of various large language models on our benchmark and draw insights from the factors that impact their performance. We make our code, data, and pretrained models publicly available.
Abstract:This study evaluates the effectiveness of various large language models (LLMs) in performing tasks common among undergraduate computer science students. Although a number of research studies in the computing education community have explored the possibility of using LLMs for a variety of tasks, there is a lack of comprehensive research comparing different LLMs and evaluating which LLMs are most effective for different tasks. Our research systematically assesses some of the publicly available LLMs such as Google Bard, ChatGPT, GitHub Copilot Chat, and Microsoft Copilot across diverse tasks commonly encountered by undergraduate computer science students. These tasks include code generation, explanation, project ideation, content generation, class assignments, and email composition. Evaluation for these tasks was carried out by junior and senior students in computer science, and provides insights into the models' strengths and limitations. This study aims to guide students in selecting suitable LLMs for any specific task and offers valuable insights on how LLMs can be used constructively by students and instructors.
Abstract:Generating unit tests is a crucial task in software development, demanding substantial time and effort from programmers. The advent of Large Language Models (LLMs) introduces a novel avenue for unit test script generation. This research aims to experimentally investigate the effectiveness of LLMs, specifically exemplified by ChatGPT, for generating unit test scripts for Python programs, and how the generated test cases compare with those generated by an existing unit test generator (Pynguin). For experiments, we consider three types of code units: 1) Procedural scripts, 2) Function-based modular code, and 3) Class-based code. The generated test cases are evaluated based on criteria such as coverage, correctness, and readability. Our results show that ChatGPT's performance is comparable with Pynguin in terms of coverage. At the same time, ChatGPT's ability to generate tests is superior to Pynguin, as the latter is not able to generate test cases for Category 1. We also find that about 39% and 28% of assertions generated by ChatGPT for Category 2 and 3, respectively, were incorrect. Our results also show that there is minimal overlap in missed statements between ChatGPT and Pynguin, thus, suggesting that a combination of both tools may enhance unit test generation performance. Finally, prompt engineering improved ChatGPT's performance, achieving an average 28% coverage improvement in Category 2 and 15% improvement in Category 3 after about 4 iterations.
Abstract:This paper presents a comprehensive comparative analysis of Large Language Models (LLMs) for generation of code documentation. Code documentation is an essential part of the software writing process. The paper evaluates models such as GPT-3.5, GPT-4, Bard, Llama2, and Starchat on various parameters like Accuracy, Completeness, Relevance, Understandability, Readability and Time Taken for different levels of code documentation. Our evaluation employs a checklist-based system to minimize subjectivity, providing a more objective assessment. We find that, barring Starchat, all LLMs consistently outperform the original documentation. Notably, closed-source models GPT-3.5, GPT-4, and Bard exhibit superior performance across various parameters compared to open-source/source-available LLMs, namely LLama 2 and StarChat. Considering the time taken for generation, GPT-4 demonstrated the longest duration, followed by Llama2, Bard, with ChatGPT and Starchat having comparable generation times. Additionally, file level documentation had a considerably worse performance across all parameters (except for time taken) as compared to inline and function level documentation.
Abstract:The emergence of Large language models (LLMs) is expected to have a major impact on education. This paper explores the potential of using ChatGPT, an LLM, as a virtual Teaching Assistant (TA) in an Introductory Programming Course. We evaluate ChatGPT's capabilities by comparing its performance with that of human TAs in some TA functions. The TA functions which we focus on include (1) solving programming assignments, (2) grading student code submissions, and (3) providing feedback to undergraduate students in an introductory programming course. Firstly, we investigate how closely ChatGPT's solutions align with those submitted by students. This analysis goes beyond code correctness and also considers code quality. Secondly, we assess ChatGPT's proficiency in grading student code submissions using a given grading rubric and compare its performance with the grades assigned by human TAs. Thirdly, we analyze the quality and relevance of the feedback provided by ChatGPT. This evaluation considers how well ChatGPT addresses mistakes and offers suggestions for improvement in student solutions from both code correctness and code quality perspectives. We conclude with a discussion on the implications of integrating ChatGPT into computing education for automated grading, personalized learning experiences, and instructional support.
Abstract:Large language models (LLMs) such as ChatGPT and Google Bard have garnered significant attention in the academic community. Previous research has evaluated these LLMs for various applications such as generating programming exercises and solutions. However, these evaluations have predominantly been conducted by instructors and researchers, not considering the actual usage of LLMs by students. This study adopts a student-first approach to comprehensively understand how undergraduate computer science students utilize ChatGPT, a popular LLM, released by OpenAI. We employ a combination of student surveys and interviews to obtain valuable insights into the benefits, challenges, and suggested improvements related to ChatGPT. Our findings suggest that a majority of students (over 57%) have a convincingly positive outlook towards adopting ChatGPT as an aid in coursework-related tasks. However, our research also highlights various challenges that must be resolved for long-term acceptance of ChatGPT amongst students. The findings from this investigation have broader implications and may be applicable to other LLMs and their role in computing education.
Abstract:In recent times, large language models (LLMs) have shown impressive performance on various document-level tasks such as document classification, summarization, and question-answering. However, research on understanding their capabilities on the task of self-contradictions in long documents has been very limited. In this work, we introduce ContraDoc, the first human-annotated dataset to study self-contradictions in long documents across multiple domains, varying document lengths, self-contradictions types, and scope. We then analyze the current capabilities of four state-of-the-art open-source and commercially available LLMs: GPT3.5, GPT4, PaLM2, and LLaMAv2 on this dataset. While GPT4 performs the best and can outperform humans on this task, we find that it is still unreliable and struggles with self-contradictions that require more nuance and context. We release the dataset and all the code associated with the experiments.
Abstract:We present Speakerly, a new real-time voice-based writing assistance system that helps users with text composition across various use cases such as emails, instant messages, and notes. The user can interact with the system through instructions or dictation, and the system generates a well-formatted and coherent document. We describe the system architecture and detail how we address the various challenges while building and deploying such a system at scale. More specifically, our system uses a combination of small, task-specific models as well as pre-trained language models for fast and effective text composition while supporting a variety of input modes for better usability.