Peter
Abstract:Large Language Models excel in tasks like natural language understanding and text generation. Prompt engineering plays a critical role in leveraging LLM effectively. However, LLMs black-box nature hinders its interpretability and effective prompting engineering. A wide range of model explanation approaches have been developed for deep learning models, However, these local explanations are designed for single-output tasks like classification and regression,and cannot be directly applied to LLMs, which generate sequences of tokens. Recent efforts in LLM explanation focus on natural language explanations, but they are prone to hallucinations and inaccuracies. To address this, we introduce OurTool, a framework for multi-granularity prompt explanations by aggregating token-level insights. OurTool introduces two token-level explanation approaches: 1.an aggregation-based approach combining local explanation techniques, and 2. a perturbation-based approach with novel techniques to evaluate token masking impact. OurTool supports both white-box and black-box explanations and extends explanations to higher granularity levels, enabling flexible analysis. We evaluate OurTool in case studies such as sentiment analysis, showing the perturbation-based approach performs best using semantic similarity to assess perturbation impact. Furthermore, we conducted a user study to confirm OurTool's accuracy and practical value, and demonstrate its potential to enhance LLM interpretability.
Abstract:Detection of Graphical User Interface (GUI) elements is a crucial task for automatic code generation from images and sketches, GUI testing, and GUI search. Recent studies have leveraged both old-fashioned and modern computer vision (CV) techniques. Oldfashioned methods utilize classic image processing algorithms (e.g. edge detection and contour detection) and modern methods use mature deep learning solutions for general object detection tasks. GUI element detection, however, is a domain-specific case of object detection, in which objects overlap more often, and are located very close to each other, plus the number of object classes is considerably lower, yet there are more objects in the images compared to natural images. Hence, the studies that have been carried out on comparing various object detection models, might not apply to GUI element detection. In this study, we evaluate the performance of the four most recent successful YOLO models for general object detection tasks on GUI element detection and investigate their accuracy performance in detecting various GUI elements.
Abstract:Detecting vulnerabilities is a crucial task for maintaining the integrity, availability, and security of software systems. Utilizing DL-based models for vulnerability detection has become commonplace in recent years. However, such deep learning-based vulnerability detectors (DLVD) suffer from a shortage of sizable datasets to train effectively. Data augmentation can potentially alleviate the shortage of data, but augmenting vulnerable code is challenging and requires designing a generative solution that maintains vulnerability. Hence, the work on generating vulnerable code samples has been limited and previous works have only focused on generating samples that contain single statements or specific types of vulnerabilities. Lately, large language models (LLMs) are being used for solving various code generation and comprehension tasks and have shown inspiring results, especially when fused with retrieval augmented generation (RAG). In this study, we explore three different strategies to augment vulnerabilities both single and multi-statement vulnerabilities, with LLMs, namely Mutation, Injection, and Extension. We conducted an extensive evaluation of our proposed approach on three vulnerability datasets and three DLVD models, using two LLMs. Our results show that our injection-based clustering-enhanced RAG method beats the baseline setting (NoAug), Vulgen, and VGX (two SOTA methods), and Random Oversampling (ROS) by 30.80\%, 27.48\%, 27.93\%, and 15.41\% in f1-score with 5K generated vulnerable samples on average, and 53.84\%, 54.10\%, 69.90\%, and 40.93\% with 15K generated vulnerable samples. Our approach demonstrates its feasibility for large-scale data augmentation by generating 1K samples at as cheap as US$ 1.88.
Abstract:The shuffle model of differential privacy (DP) has recently emerged as a powerful one for decentralized computation without fully trustable parties. Since it anonymizes and permutes messages from clients through a shuffler, the privacy can be amplified and utility can be improved. However, the shuffling procedure in turn restricts its applications only to statistical tasks that are permutation-invariant. This work explores the feasibility of shuffle privacy amplification for prevalent non-statistical computations: spatial crowdsourcing, combinatorial optimization, location-based social systems, and federated learning with incentives, which suffer either computationally intractability or intolerable utility loss in existing approaches (e.g., secure MPC and local DP). We proposes a new paradigm of shuffle model that can provide critical security functionalities like message authorization and result access control, meanwhile maintaining the most of privacy amplification effects. It incurs almost the same computation/communication costs as the non-private setting, and permits the server to run arbitrary algorithms on (noisy) client information in plaintext. Our novel technique is introducing statistically random identity into DP and force identical random distribution on all clients, so as to support secure functionalities even after message shuffling and to maintain privacy amplification simultaneously. Given that existing DP randomizers fails in the new shuffle model, we also propose a new mechanism and prove its optimality therein. Experimental results on spatial crowdsourcing, location-based social system, and federated learning with incentives, show that our paradigm and mechanism is fast as non-private settings, while reducing up to 90% error and increasing utility performance indicates by 100%-300% relatively, and can be practical under reasonable privacy budget.
Abstract:Large Language Models (LLMs) have significantly advanced natural language processing (NLP) tasks but also pose ethical and societal risks due to their propensity to generate harmful content. To address this, various approaches have been developed to safeguard LLMs from producing unsafe content. However, existing methods have limitations, including the need for training specific control models and proactive intervention during text generation, that lead to quality degradation and increased computational overhead. To mitigate those limitations, we propose LLMSafeGuard, a lightweight framework to safeguard LLM text generation in real-time. LLMSafeGuard integrates an external validator into the beam search algorithm during decoding, rejecting candidates that violate safety constraints while allowing valid ones to proceed. We introduce a similarity based validation approach, simplifying constraint introduction and eliminating the need for control model training. Additionally, LLMSafeGuard employs a context-wise timing selection strategy, intervening LLMs only when necessary. We evaluate LLMSafeGuard on two tasks, detoxification and copyright safeguarding, and demonstrate its superior performance over SOTA baselines. For instance, LLMSafeGuard reduces the average toxic score of. LLM output by 29.7% compared to the best baseline meanwhile preserving similar linguistic quality as natural output in detoxification task. Similarly, in the copyright task, LLMSafeGuard decreases the Longest Common Subsequence (LCS) by 56.2% compared to baselines. Moreover, our context-wise timing selection strategy reduces inference time by at least 24% meanwhile maintaining comparable effectiveness as validating each time step. LLMSafeGuard also offers tunable parameters to balance its effectiveness and efficiency.
Abstract:Logs are important in modern software development with runtime information. Log parsing is the first step in many log-based analyses, that involve extracting structured information from unstructured log data. Traditional log parsers face challenges in accurately parsing logs due to the diversity of log formats, which directly impacts the performance of downstream log-analysis tasks. In this paper, we explore the potential of using Large Language Models (LLMs) for log parsing and propose LLMParser, an LLM-based log parser based on generative LLMs and few-shot tuning. We leverage four LLMs, Flan-T5-small, Flan-T5-base, LLaMA-7B, and ChatGLM-6B in LLMParsers. Our evaluation of 16 open-source systems shows that LLMParser achieves statistically significantly higher parsing accuracy than state-of-the-art parsers (a 96% average parsing accuracy). We further conduct a comprehensive empirical analysis on the effect of training size, model size, and pre-training LLM on log parsing accuracy. We find that smaller LLMs may be more effective than more complex LLMs; for instance where Flan-T5-base achieves comparable results as LLaMA-7B with a shorter inference time. We also find that using LLMs pre-trained using logs from other systems does not always improve parsing accuracy. While using pre-trained Flan-T5-base shows an improvement in accuracy, pre-trained LLaMA results in a decrease (decrease by almost 55% in group accuracy). In short, our study provides empirical evidence for using LLMs for log parsing and highlights the limitations and future research direction of LLM-based log parsers.
Abstract:Over the past decade, structured illumination microscopy (SIM) has found its niche in super-resolution (SR) microscopy due to its fast imaging speed and low excitation intensity. However, due to the significantly higher light dose compared to wide-field microscopy and the time-consuming post-processing procedures, long-term, real-time, super-resolution observation of living cells is still out of reach for most SIM setups, which inevitably limits its routine use by cell biologists. Here, we describe square lattice SIM (SL-SIM) for long-duration live cell imaging by using the square lattice optical field as illumination, which allows continuous super-resolved observation over long periods of time. In addition, by extending the previous joint spatial-frequency reconstruction concept to SL-SIM, a high-speed reconstruction strategy is validated in the GPU environment, whose reconstruction time is even shorter than image acquisition time, thus enabling real-time observation. We have demonstrated the potential of SL-SIM on various biological applications, ranging from microtubule cytoskeleton dynamics to the interactions of mitochondrial cristae and DNAs in COS7 cells. The inherent lower light dose and user-friendly workflow of the SL-SIM could help make long-duration, real-time and super-resolved observations accessible to biological laboratories.
Abstract:Technical question and answering (Q&A) sites such as Stack Overflow have become an important source for software developers to seek knowledge. However, code snippets on Q&A sites are usually uncompilable and semantically incomplete for compilation due to unresolved types and missing dependent libraries, which raises the obstacle for users to reuse or analyze Q&A code snippets. Prior approaches either are not designed for synthesizing compilable code or suffer from a low compilation success rate. To address this problem, we propose ZS4C, a lightweight approach to perform zero-shot synthesis of compilable code from incomplete code snippets using Large Language Model (LLM). ZS4C operates in two stages. In the first stage, ZS4C utilizes an LLM, i.e., ChatGPT, to identify missing import statements for a given code snippet, leveraging our designed task-specific prompt template. In the second stage, ZS4C fixes compilation errors caused by incorrect import statements and syntax errors through collaborative work between ChatGPT and a compiler. We thoroughly evaluated ZS4C on a widely used benchmark called StatType-SO against the SOTA approach SnR. Compared with SnR, ZS4C improves the compilation rate from 63% to 87.6%, with a 39.3% improvement. On average, ZS4C can infer more accurate import statements than SnR, with an improvement of 6.6% in the F1.
Abstract:Context: Navigating the knowledge of Stack Overflow (SO) remains challenging. To make the posts vivid to users, SO allows users to write and edit posts with Markdown or HTML so that users can leverage various formatting styles (e.g., bold, italic, and code) to highlight the important information. Nonetheless, there have been limited studies on the highlighted information. Objective: We carried out the first large-scale exploratory study on the information highlighted in SO answers in our recent study. To extend our previous study, we develop approaches to automatically recommend highlighted content with formatting styles using neural network architectures initially designed for the Named Entity Recognition task. Method: In this paper, we studied 31,169,429 answers of Stack Overflow. For training recommendation models, we choose CNN and BERT models for each type of formatting (i.e., Bold, Italic, Code, and Heading) using the information highlighting dataset we collected from SO answers. Results: Our models based on CNN architecture achieve precision ranging from 0.71 to 0.82. The trained model for automatic code content highlighting achieves a recall of 0.73 and an F1 score of 0.71, outperforming the trained models for other formatting styles. The BERT models have even lower recalls and F1 scores than the CNN models. Our analysis of failure cases indicates that the majority of the failure cases are missing identification (i.e., the model misses the content that is supposed to be highlighted) due to the models tend to learn the frequently highlighted words while struggling to learn less frequent words. Conclusion: Our findings suggest that it is possible to develop recommendation models for highlighting information for answers with different formatting styles on Stack Overflow.
Abstract:It is common practice to discretize continuous defect counts into defective and non-defective classes and use them as a target variable when building defect classifiers (discretized classifiers). However, this discretization of continuous defect counts leads to information loss that might affect the performance and interpretation of defect classifiers. Another possible approach to build defect classifiers is through the use of regression models then discretizing the predicted defect counts into defective and non-defective classes (regression-based classifiers). In this paper, we compare the performance and interpretation of defect classifiers that are built using both approaches (i.e., discretized classifiers and regression-based classifiers) across six commonly used machine learning classifiers (i.e., linear/logistic regression, random forest, KNN, SVM, CART, and neural networks) and 17 datasets. We find that: i) Random forest based classifiers outperform other classifiers (best AUC) for both classifier building approaches; ii) In contrast to common practice, building a defect classifier using discretized defect counts (i.e., discretized classifiers) does not always lead to better performance. Hence we suggest that future defect classification studies should consider building regression-based classifiers (in particular when the defective ratio of the modeled dataset is low). Moreover, we suggest that both approaches for building defect classifiers should be explored, so the best-performing classifier can be used when determining the most influential features.