Abstract:Converting webpage design into functional UI code is a critical step for building websites, which can be labor-intensive and time-consuming. To automate this design-to-code transformation process, various automated methods using learning-based networks and multi-modal large language models (MLLMs) have been proposed. However, these studies were merely evaluated on a narrow range of static web pages and ignored dynamic interaction elements, making them less practical for real-world website deployment. To fill in the blank, we present the first systematic investigation of MLLMs in generating interactive webpages. Specifically, we first formulate the Interaction-to-Code task and build the Interaction2Code benchmark that contains 97 unique web pages and 213 distinct interactions, spanning 15 webpage types and 30 interaction categories. We then conduct comprehensive experiments on three state-of-the-art (SOTA) MLLMs using both automatic metrics and human evaluations, thereby summarizing six findings accordingly. Our experimental results highlight the limitations of MLLMs in generating fine-grained interactive features and managing interactions with complex transformations and subtle visual modifications. We further analyze failure cases and their underlying causes, identifying 10 common failure types and assessing their severity. Additionally, our findings reveal three critical influencing factors, i.e., prompts, visual saliency, and textual descriptions, that can enhance the interaction generation performance of MLLMs. Based on these findings, we elicit implications for researchers and developers, providing a foundation for future advancements in this field. Datasets and source code are available at https://github.com/WebPAI/Interaction2Code.
Abstract:Websites are critical in today's digital world, with over 1.11 billion currently active and approximately 252,000 new sites launched daily. Converting website layout design into functional UI code is a time-consuming yet indispensable step of website development. Manual methods of converting visual designs into functional code present significant challenges, especially for non-experts. To explore automatic design-to-code solutions, we first conduct a motivating study on GPT-4o and identify three types of issues in generating UI code: element omission, element distortion, and element misarrangement. We further reveal that a focus on smaller visual segments can help multimodal large language models (MLLMs) mitigate these failures in the generation process. In this paper, we propose DCGen, a divide-and-conquer-based approach to automate the translation of webpage design to UI code. DCGen starts by dividing screenshots into manageable segments, generating descriptions for each segment, and then reassembling them into complete UI code for the entire screenshot. We conduct extensive testing with a dataset comprised of real-world websites and various MLLMs and demonstrate that DCGen achieves up to a 14% improvement in visual similarity over competing methods. To the best of our knowledge, DCGen is the first segment-aware prompt-based approach for generating UI code directly from screenshots.
Abstract:Configurable software systems are prone to configuration errors, resulting in significant losses to companies. However, diagnosing these errors is challenging due to the vast and complex configuration space. These errors pose significant challenges for both experienced maintainers and new end-users, particularly those without access to the source code of the software systems. Given that logs are easily accessible to most end-users, we conduct a preliminary study to outline the challenges and opportunities of utilizing logs in localizing configuration errors. Based on the insights gained from the preliminary study, we propose an LLM-based two-stage strategy for end-users to localize the root-cause configuration properties based on logs. We further implement a tool, LogConfigLocalizer, aligned with the design of the aforementioned strategy, hoping to assist end-users in coping with configuration errors through log analysis. To the best of our knowledge, this is the first work to localize the root-cause configuration properties for end-users based on Large Language Models~(LLMs) and logs. We evaluate the proposed strategy on Hadoop by LogConfigLocalizer and prove its efficiency with an average accuracy as high as 99.91%. Additionally, we also demonstrate the effectiveness and necessity of different phases of the methodology by comparing it with two other variants and a baseline tool. Moreover, we validate the proposed methodology through a practical case study to demonstrate its effectiveness and feasibility.
Abstract:Large Language Models (LLMs) have achieved remarkable success in code completion, as evidenced by their essential roles in developing code assistant services such as Copilot. Being trained on in-file contexts, current LLMs are quite effective in completing code for single source files. However, it is challenging for them to conduct repository-level code completion for large software projects that require cross-file information. Existing research on LLM-based repository-level code completion identifies and integrates cross-file contexts, but it suffers from low accuracy and limited context length of LLMs. In this paper, we argue that Integrated Development Environments (IDEs) can provide direct, accurate and real-time cross-file information for repository-level code completion. We propose IDECoder, a practical framework that leverages IDE native static contexts for cross-context construction and diagnosis results for self-refinement. IDECoder utilizes the rich cross-context information available in IDEs to enhance the capabilities of LLMs of repository-level code completion. We conducted preliminary experiments to validate the performance of IDECoder and observed that this synergy represents a promising trend for future exploration.
Abstract:System logs play a critical role in maintaining the reliability of software systems. Fruitful studies have explored automatic log-based anomaly detection and achieved notable accuracy on benchmark datasets. However, when applied to large-scale cloud systems, these solutions face limitations due to high resource consumption and lack of adaptability to evolving logs. In this paper, we present an accurate, lightweight, and adaptive log-based anomaly detection framework, referred to as SeaLog. Our method introduces a Trie-based Detection Agent (TDA) that employs a lightweight, dynamically-growing trie structure for real-time anomaly detection. To enhance TDA's accuracy in response to evolving log data, we enable it to receive feedback from experts. Interestingly, our findings suggest that contemporary large language models, such as ChatGPT, can provide feedback with a level of consistency comparable to human experts, which can potentially reduce manual verification efforts. We extensively evaluate SeaLog on two public datasets and an industrial dataset. The results show that SeaLog outperforms all baseline methods in terms of effectiveness, runs 2X to 10X faster and only consumes 5% to 41% of the memory resource.
Abstract:Recently, the community has achieved substantial progress on many commonsense reasoning benchmarks. However, it is still unclear what is learned from the training process: the knowledge, inference capability, or both? We argue that due to the large scale of commonsense knowledge, it is infeasible to annotate a large enough training set for each task to cover all commonsense for learning. Thus we should separate the commonsense knowledge acquisition and inference over commonsense knowledge as two separate tasks. In this work, we focus on investigating models' commonsense inference capabilities from two perspectives: (1) Whether models can know if the knowledge they have is enough to solve the task; (2) Whether models can develop commonsense inference capabilities that generalize across commonsense tasks. We first align commonsense tasks with relevant knowledge from commonsense knowledge bases and ask humans to annotate whether the knowledge is enough or not. Then, we convert different commonsense tasks into a unified question answering format to evaluate models' generalization capabilities. We name the benchmark as Commonsense Inference with Knowledge-in-the-loop Question Answering (CIKQA).
Abstract:Causality knowledge is crucial for many artificial intelligence systems. Conventional textual-based causality knowledge acquisition methods typically require laborious and expensive human annotations. As a result, their scale is often limited. Moreover, as no context is provided during the annotation, the resulting causality knowledge records (e.g., ConceptNet) typically do not take the context into consideration. To explore a more scalable way of acquiring causality knowledge, in this paper, we jump out of the textual domain and investigate the possibility of learning contextual causality from the visual signal. Compared with pure text-based approaches, learning causality from the visual signal has the following advantages: (1) Causality knowledge belongs to the commonsense knowledge, which is rarely expressed in the text but rich in videos; (2) Most events in the video are naturally time-ordered, which provides a rich resource for us to mine causality knowledge from; (3) All the objects in the video can be used as context to study the contextual property of causal relations. In detail, we first propose a high-quality dataset Vis-Causal and then conduct experiments to demonstrate that with good language and visual representation models as well as enough training signals, it is possible to automatically discover meaningful causal knowledge from the videos. Further analysis also shows that the contextual property of causal relations indeed exists, taking which into consideration might be crucial if we want to use the causality knowledge in real applications, and the visual signal could serve as a good resource for learning such contextual causality.