Abstract:Realignment techniques are often employed to enhance cross-lingual transfer in multilingual language models, still, they can sometimes degrade performance in languages that differ significantly from the fine-tuned source language. This paper introduces AlignFreeze, a method that freezes either the layers' lower half or upper half during realignment. Through controlled experiments on 4 tasks, 3 models, and in 35 languages, we find that realignment affects all the layers but can be the most detrimental to the lower ones. Freezing the lower layers can prevent performance degradation. Particularly, AlignFreeze improves Part-of-Speech (PoS) tagging performances in languages where full realignment fails: with XLM-R, it provides improvements of more than one standard deviation in accuracy in seven more languages than full realignment.
Abstract:Flaky tests exhibit non-deterministic behavior during execution and they may pass or fail without any changes to the program under test. Detecting and classifying these flaky tests is crucial for maintaining the robustness of automated test suites and ensuring the overall reliability and confidence in the testing. However, flaky test detection and classification is challenging due to the variability in test behavior, which can depend on environmental conditions and subtle code interactions. Large Language Models (LLMs) offer promising approaches to address this challenge, with fine-tuning and few-shot learning (FSL) emerging as viable techniques. With enough data fine-tuning a pre-trained LLM can achieve high accuracy, making it suitable for organizations with more resources. Alternatively, we introduce FlakyXbert, an FSL approach that employs a Siamese network architecture to train efficiently with limited data. To understand the performance and cost differences between these two methods, we compare fine-tuning on larger datasets with FSL in scenarios restricted by smaller datasets. Our evaluation involves two existing flaky test datasets, FlakyCat and IDoFT. Our results suggest that while fine-tuning can achieve high accuracy, FSL provides a cost-effective approach with competitive accuracy, which is especially beneficial for organizations or projects with limited historical data available for training. These findings underscore the viability of both fine-tuning and FSL in flaky test detection and classification with each suited to different organizational needs and resource availability.
Abstract:Data augmentation has become a standard practice in software engineering to address limited or imbalanced data sets, particularly in specialized domains like test classification and bug detection where data can be scarce. Although techniques such as SMOTE and mutation-based augmentation are widely used in software testing and debugging applications, a rigorous understanding of how augmented training data impacts model bias is lacking. It is especially critical to consider bias in scenarios where augmented data sets are used not just in training but also in testing models. Through a comprehensive case study of flaky test classification, we demonstrate how to test for bias and understand the impact that the inclusion of augmented samples in testing sets can have on model evaluation.
Abstract:The use of large language models (LLMs) is widespread across many domains, including Software Engineering, where they have been used to automate tasks such as program generation and test classification. As LLM-based methods continue to evolve, it is important that we define clear and robust methods that fairly evaluate performance. Benchmarks are a common approach to assess LLMs with respect to their ability to solve problem-specific tasks as well as assess different versions of an LLM to solve tasks over time. For example, the HumanEval benchmark is composed of 164 hand-crafted tasks and has become an important tool in assessing LLM-based program generation. However, a major barrier to a fair evaluation of LLMs using benchmarks like HumanEval is data contamination resulting from data leakage of benchmark tasks and solutions into the training data set. This barrier is compounded by the black-box nature of LLM training data which makes it difficult to even know if data leakage has occurred. To address the data leakage problem, we propose a new benchmark construction method where a benchmark is composed of template tasks that can be instantiated into new concrete tasks using combinatorial test design. Concrete tasks for the same template task must be different enough that data leakage has minimal impact and similar enough that the tasks are interchangeable with respect to performance evaluation. To assess our benchmark construction method, we propose HumanEval_T, an alternative benchmark to HumanEval that was constructed using template tasks and combinatorial test design.
Abstract:Human-computer interaction (HCI) has been a widely researched area for many years, with continuous advancements in technology leading to the development of new techniques that change the way we interact with computers. With the recent advent of powerful computers, we recognize human actions and interact accordingly, thus revolutionizing the way we interact with computers. The purpose of this paper is to provide a comparative analysis of various algorithms used for recognizing user faces and gestures in the context of computer vision and HCI. This study aims to explore and evaluate the performance of different algorithms in terms of accuracy, robustness, and efficiency. This study aims to provide a comprehensive analysis of algorithms for face and gesture recognition in the context of computer vision and HCI, with the goal of improving the design and development of interactive systems that are more intuitive, efficient, and user-friendly.