Abstract:Large language models (LLMs) have been widely applied to assist test generation with the source code under test provided as the context. This paper aims to answer the question: If the source code under test is incorrect, will LLMs be misguided when generating tests? The effectiveness of test cases is measured by their accuracy, coverage, and bug detection effectiveness. Our evaluation results with five open- and six closed-source LLMs on four datasets demonstrate that incorrect code can significantly mislead LLMs in generating correct, high-coverage, and bug-revealing tests. For instance, in the HumanEval dataset, LLMs achieve 80.45% test accuracy when provided with task descriptions and correct code, but only 57.12% when given task descriptions and incorrect code. For the APPS dataset, prompts with correct code yield tests that detect 39.85% of the bugs, while prompts with incorrect code detect only 19.61%. These findings have important implications for the deployment of LLM-based testing: using it on mature code may help protect against future regression, but on early-stage immature code, it may simply bake in errors. Our findings also underscore the need for further research to improve LLMs resilience against incorrect code in generating reliable and bug-revealing tests.
Abstract:Despite advancements in evaluating Large Language Models (LLMs) for code synthesis, benchmarks have predominantly focused on functional correctness, overlooking the importance of code efficiency. We present Mercury, the first benchmark designated for assessing the code efficiency of LLM code synthesis tasks. Mercury consists of 1,889 programming tasks covering diverse difficulty levels alongside test case generators generating unlimited cases for comprehensive evaluation. Unlike existing benchmarks, Mercury integrates a novel metric Beyond@K to measure normalized code efficiency based on historical submissions, leading to a new evaluation indicator for code synthesis, which encourages generating functionally correct and computationally efficient code, mirroring the real-world software development standard. Our findings reveal that while LLMs demonstrate the remarkable capability to generate functionally correct code, there still exists a substantial gap in their efficiency output, underscoring a new frontier for LLM research and development.
Abstract:GUI testing is significant in the SE community. Most existing frameworks are intrusive and only support some specific platforms. With the development of distinct scenarios, diverse embedded systems or customized operating systems on different devices do not support existing intrusive GUI testing frameworks. Some approaches adopt robotic arms to replace the interface invoking of mobile apps under test and use computer vision technologies to identify GUI elements. However, some challenges are unsolved. First, existing approaches assume that GUI screens are fixed so that they cannot be adapted to diverse systems with different screen conditions. Second, existing approaches use XY-plane robotic arms, which cannot flexibly simulate testing operations. Third, existing approaches ignore compatibility bugs and only focus on crash bugs. A more practical approach is required for the non-intrusive scenario. We propose a practical non-intrusive GUI testing framework with visual robotic arms. RoboTest integrates novel GUI screen and widget detection algorithms, adaptive to detecting screens of different sizes and then to extracting GUI widgets from the detected screens. Then, a set of testing operations is applied with a 4-DOF robotic arm, which effectively and flexibly simulates human testing operations. During app exploration, RoboTest integrates the Principle of Proximity-guided exploration strategy, choosing close widgets of the previous targets to reduce robotic arm movement overhead and improve exploration efficiency. RoboTest can effectively detect some compatibility bugs beyond crash bugs with a GUI comparison on different devices of the same test operations. We evaluate RoboTest with 20 mobile apps, with a case study on an embedded system. The results show that RoboTest can effectively, efficiently, and generally explore AUTs to find bugs and reduce exploration time overhead.
Abstract:The vast number of parameters in large language models (LLMs) endows them with remarkable capabilities, allowing them to excel in a variety of natural language processing tasks. However, this complexity also presents challenges, making LLMs difficult to train and inhibiting their ability to continuously assimilate new knowledge, which may lead to inaccuracies in their outputs. To mitigate these issues, this paper presents DynaMind, a novel continual learning framework designed for LLMs. DynaMind incorporates memory mechanisms to assimilate new knowledge and modular operators to enhance the model inference process with the newly assimilated knowledge, consequently improving the accuracies of LLMs' outputs. Benchmark experiments demonstrate DynaMind's effectiveness in overcoming these challenges. The code and demo of DynaMind are available on GitHub: https://github.com/Elfsong/DynaMind.
Abstract:Scene graphs provide structured semantic understanding beyond images. For downstream tasks, such as image retrieval, visual question answering, visual relationship detection, and even autonomous vehicle technology, scene graphs can not only distil complex image information but also correct the bias of visual models using semantic-level relations, which has broad application prospects. However, the heavy labour cost of constructing graph annotations may hinder the application of PSG in practical scenarios. Inspired by the observation that people usually identify the subject and object first and then determine the relationship between them, we proposed to decouple the scene graphs generation task into two sub-tasks: 1) an image segmentation task to pick up the qualified objects. 2) a restricted auto-regressive text generation task to generate the relation between given objects. Therefore, in this work, we introduce image semantic relation generation (ISRG), a simple but effective image-to-text model, which achieved 31 points on the OpenPSG dataset and outperforms strong baselines respectively by 16 points (ResNet-50) and 5 points (CLIP).