Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Ruben Martins

Hypergraph-Guided Regex Filter Synthesis for Event-Based Anomaly Detection

Sep 08, 2025

Margarida Ferreira, Victor Nicolet, Luan Pham, Joey Dodds, Daniel Kroening, Ines Lynce, Ruben Martins

Abstract:We propose HyGLAD, a novel algorithm that automatically builds a set of interpretable patterns that model event data. These patterns can then be used to detect event-based anomalies in a stationary system, where any deviation from past behavior may indicate malicious activity. The algorithm infers equivalence classes of entities with similar behavior observed from the events, and then builds regular expressions that capture the values of those entities. As opposed to deep-learning approaches, the regular expressions are directly interpretable, which also translates to interpretable anomalies. We evaluate HyGLAD against all 7 unsupervised anomaly detection methods from DeepOD on five datasets from real-world systems. The experimental results show that on average HyGLAD outperforms existing deep-learning methods while being an order of magnitude more efficient in training and inference (single CPU vs GPU). Precision improved by 1.2x and recall by 1.3x compared to the second-best baseline.

Via

Access Paper or Ask Questions

Security Vulnerability Detection with Multitask Self-Instructed Fine-Tuning of Large Language Models

Jun 09, 2024

Aidan Z. H. Yang, Haoye Tian, He Ye, Ruben Martins, Claire Le Goues

Figure 1 for Security Vulnerability Detection with Multitask Self-Instructed Fine-Tuning of Large Language Models

Figure 2 for Security Vulnerability Detection with Multitask Self-Instructed Fine-Tuning of Large Language Models

Figure 3 for Security Vulnerability Detection with Multitask Self-Instructed Fine-Tuning of Large Language Models

Figure 4 for Security Vulnerability Detection with Multitask Self-Instructed Fine-Tuning of Large Language Models

Abstract:Software security vulnerabilities allow attackers to perform malicious activities to disrupt software operations. Recent Transformer-based language models have significantly advanced vulnerability detection, surpassing the capabilities of static analysis based deep learning models. However, language models trained solely on code tokens do not capture either the explanation of vulnerability type or the data flow structure information of code, both of which are crucial for vulnerability detection. We propose a novel technique that integrates a multitask sequence-to-sequence LLM with pro-gram control flow graphs encoded as a graph neural network to achieve sequence-to-classification vulnerability detection. We introduce MSIVD, multitask self-instructed fine-tuning for vulnerability detection, inspired by chain-of-thought prompting and LLM self-instruction. Our experiments demonstrate that MSIVD achieves superior performance, outperforming the highest LLM-based vulnerability detector baseline (LineVul), with a F1 score of 0.92 on the BigVul dataset, and 0.48 on the PreciseBugs dataset. By training LLMs and GNNs simultaneously using a combination of code and explanatory metrics of a vulnerable program, MSIVD represents a promising direction for advancing LLM-based vulnerability detection that generalizes to unseen data. Based on our findings, we further discuss the necessity for new labelled security vulnerability datasets, as recent LLMs have seen or memorized prior datasets' held-out evaluation data.

Via

Access Paper or Ask Questions

Large Language Models for Test-Free Fault Localization

Oct 03, 2023

Aidan Z. H. Yang, Ruben Martins, Claire Le Goues, Vincent J. Hellendoorn

Figure 1 for Large Language Models for Test-Free Fault Localization

Figure 2 for Large Language Models for Test-Free Fault Localization

Figure 3 for Large Language Models for Test-Free Fault Localization

Figure 4 for Large Language Models for Test-Free Fault Localization

Abstract:Fault Localization (FL) aims to automatically localize buggy lines of code, a key first step in many manual and automatic debugging tasks. Previous FL techniques assume the provision of input tests, and often require extensive program analysis, program instrumentation, or data preprocessing. Prior work on deep learning for APR struggles to learn from small datasets and produces limited results on real-world programs. Inspired by the ability of large language models (LLMs) of code to adapt to new tasks based on very few examples, we investigate the applicability of LLMs to line level fault localization. Specifically, we propose to overcome the left-to-right nature of LLMs by fine-tuning a small set of bidirectional adapter layers on top of the representations learned by LLMs to produce LLMAO, the first language model based fault localization approach that locates buggy lines of code without any test coverage information. We fine-tune LLMs with 350 million, 6 billion, and 16 billion parameters on small, manually curated corpora of buggy programs such as the Defects4J corpus. We observe that our technique achieves substantially more confidence in fault localization when built on the larger models, with bug localization performance scaling consistently with the LLM size. Our empirical evaluation shows that LLMAO improves the Top-1 results over the state-of-the-art machine learning fault localization (MLFL) baselines by 2.3%-54.4%, and Top-5 results by 14.4%-35.6%. LLMAO is also the first FL technique trained using a language model architecture that can detect security vulnerabilities down to the code line level.

Via

Access Paper or Ask Questions

MELT: Mining Effective Lightweight Transformations from Pull Requests

Aug 28, 2023

Daniel Ramos, Hailie Mitchell, Inês Lynce, Vasco Manquinho, Ruben Martins, Claire Le Goues

Figure 1 for MELT: Mining Effective Lightweight Transformations from Pull Requests

Figure 2 for MELT: Mining Effective Lightweight Transformations from Pull Requests

Figure 3 for MELT: Mining Effective Lightweight Transformations from Pull Requests

Figure 4 for MELT: Mining Effective Lightweight Transformations from Pull Requests

Abstract:Software developers often struggle to update APIs, leading to manual, time-consuming, and error-prone processes. We introduce MELT, a new approach that generates lightweight API migration rules directly from pull requests in popular library repositories. Our key insight is that pull requests merged into open-source libraries are a rich source of information sufficient to mine API migration rules. By leveraging code examples mined from the library source and automatically generated code examples based on the pull requests, we infer transformation rules in \comby, a language for structural code search and replace. Since inferred rules from single code examples may be too specific, we propose a generalization procedure to make the rules more applicable to client projects. MELT rules are syntax-driven, interpretable, and easily adaptable. Moreover, unlike previous work, our approach enables rule inference to seamlessly integrate into the library workflow, removing the need to wait for client code migrations. We evaluated MELT on pull requests from four popular libraries, successfully mining 461 migration rules from code examples in pull requests and 114 rules from auto-generated code examples. Our generalization procedure increases the number of matches for mined rules by 9x. We applied these rules to client projects and ran their tests, which led to an overall decrease in the number of warnings and fixing some test cases demonstrating MELT's effectiveness in real-world scenarios.

Via

Access Paper or Ask Questions

UpMax: User partitioning for MaxSAT

May 25, 2023

Pedro Orvalho, Vasco Manquinho, Ruben Martins

Abstract:It has been shown that Maximum Satisfiability (MaxSAT) problem instances can be effectively solved by partitioning the set of soft clauses into several disjoint sets. The partitioning methods can be based on clause weights (e.g., stratification) or based on graph representations of the formula. Afterwards, a merge procedure is applied to guarantee that an optimal solution is found. This paper proposes a new framework called UpMax that decouples the partitioning procedure from the MaxSAT solving algorithms. As a result, new partitioning procedures can be defined independently of the MaxSAT algorithm to be used. Moreover, this decoupling also allows users that build new MaxSAT formulas to propose partition schemes based on knowledge of the problem to be solved. We illustrate this approach using several problems and show that partitioning has a large impact on the performance of unsatisfiability-based MaxSAT algorithms.

* 17 pages, 6 figures, 2 tables. https://github.com/forge-lab/upmax

Via

Access Paper or Ask Questions

FOREST: An Interactive Multi-tree Synthesizer for Regular Expressions

Dec 28, 2020

Margarida Ferreira, Miguel Terra-Neves, Miguel Ventura, Inês Lynce, Ruben Martins

Figure 1 for FOREST: An Interactive Multi-tree Synthesizer for Regular Expressions

Figure 2 for FOREST: An Interactive Multi-tree Synthesizer for Regular Expressions

Figure 3 for FOREST: An Interactive Multi-tree Synthesizer for Regular Expressions

Figure 4 for FOREST: An Interactive Multi-tree Synthesizer for Regular Expressions

Abstract:Form validators based on regular expressions are often used on digital forms to prevent users from inserting data in the wrong format. However, writing these validators can pose a challenge to some users. We present FOREST, a regular expression synthesizer for digital form validations. FOREST produces a regular expression that matches the desired pattern for the input values and a set of conditions over capturing groups that ensure the validity of integer values in the input. Our synthesis procedure is based on enumerative search and uses a Satisfiability Modulo Theories (SMT) solver to explore and prune the search space. We propose a novel representation for regular expressions synthesis, multi-tree, which induces patterns in the examples and uses them to split the problem through a divide-and-conquer approach. We also present a new SMT encoding to synthesize capture conditions for a given regular expression. To increase confidence in the synthesized regular expression, we implement user interaction based on distinguishing inputs. We evaluated FOREST on real-world form-validation instances using regular expressions. Experimental results show that FOREST successfully returns the desired regular expression in 72% of the instances and outperforms REGEL, a state-of-the-art regular expression synthesizer.

Via

Access Paper or Ask Questions

Reflections on "Incremental Cardinality Constraints for MaxSAT"

Oct 10, 2019

Ruben Martins, Saurabh Joshi, Vasco Manquinho, Ines Lynce

Figure 1 for Reflections on "Incremental Cardinality Constraints for MaxSAT"

Figure 2 for Reflections on "Incremental Cardinality Constraints for MaxSAT"

Figure 3 for Reflections on "Incremental Cardinality Constraints for MaxSAT"

Figure 4 for Reflections on "Incremental Cardinality Constraints for MaxSAT"

Abstract:To celebrate the first 25 years of the International Conference on Principles and Practice of Constraint Programming (CP) the editors invited the authors of the most cited paper of each year to write a commentary on their paper. This report describes our reflections on the CP 2014 paper "Incremental Cardinality Constraints for MaxSAT" and its impact on the Maximum Satisfiability community and beyond.

* 10 pages, 1 algorithm, 1 table, 4 figures, article invited as part of "Virtual Volume" for 25th anniversary of CP

Via

Access Paper or Ask Questions

Approximation Strategies for Incomplete MaxSAT

Jun 19, 2018

Saurabh Joshi, Prateek Kumar, Ruben Martins, Sukrut Rao

Figure 1 for Approximation Strategies for Incomplete MaxSAT

Abstract:Incomplete MaxSAT solving aims to quickly find a solution that attempts to minimize the sum of the weights of the unsatisfied soft clauses without providing any optimality guarantees. In this paper, we propose two approximation strategies for improving incomplete MaxSAT solving. In one of the strategies, we cluster the weights and approximate them with a representative weight. In another strategy, we break up the problem of minimizing the sum of weights of unsatisfiable clauses into multiple minimization subproblems. Experimental results show that approximation strategies can be used to find better solutions than the best incomplete solvers in the MaxSAT Evaluation 2017.

* 10 pages, 3 algorithms, 1 figure, International Conference on Principles and Practice of Constraint Programming (CP) 2018

Via

Access Paper or Ask Questions

Relating Complexity-theoretic Parameters with SAT Solver Performance

Jun 26, 2017

Edward Zulkoski, Ruben Martins, Christoph Wintersteiger, Robert Robere, Jia Liang, Krzysztof Czarnecki, Vijay Ganesh

Figure 1 for Relating Complexity-theoretic Parameters with SAT Solver Performance

Figure 2 for Relating Complexity-theoretic Parameters with SAT Solver Performance

Figure 3 for Relating Complexity-theoretic Parameters with SAT Solver Performance

Figure 4 for Relating Complexity-theoretic Parameters with SAT Solver Performance

Abstract:Over the years complexity theorists have proposed many structural parameters to explain the surprising efficiency of conflict-driven clause-learning (CDCL) SAT solvers on a wide variety of large industrial Boolean instances. While some of these parameters have been studied empirically, until now there has not been a unified comparative study of their explanatory power on a comprehensive benchmark. We correct this state of affairs by conducting a large-scale empirical evaluation of CDCL SAT solver performance on nearly 7000 industrial and crafted formulas against several structural parameters such as backdoors, treewidth, backbones, and community structure. Our study led us to several results. First, we show that while such parameters only weakly correlate with CDCL solving time, certain combinations of them yield much better regression models. Second, we show how some parameters can be used as a "lens" to better understand the efficiency of different solving heuristics. Finally, we propose a new complexity-theoretic parameter, which we call learning-sensitive with restarts (LSR) backdoors, that extends the notion of learning-sensitive (LS) backdoors to incorporate restarts and discuss algorithms to compute them. We mathematically prove that for certain class of instances minimal LSR-backdoors are exponentially smaller than minimal-LS backdoors.

Via

Access Paper or Ask Questions

Generalized Totalizer Encoding for Pseudo-Boolean Constraints

Jul 21, 2015

Saurabh Joshi, Ruben Martins, Vasco Manquinho

Figure 1 for Generalized Totalizer Encoding for Pseudo-Boolean Constraints

Figure 2 for Generalized Totalizer Encoding for Pseudo-Boolean Constraints

Figure 3 for Generalized Totalizer Encoding for Pseudo-Boolean Constraints

Abstract:Pseudo-Boolean constraints, also known as 0-1 Integer Linear Constraints, are used to model many real-world problems. A common approach to solve these constraints is to encode them into a SAT formula. The runtime of the SAT solver on such formula is sensitive to the manner in which the given pseudo-Boolean constraints are encoded. In this paper, we propose generalized Totalizer encoding (GTE), which is an arc-consistency preserving extension of the Totalizer encoding to pseudo-Boolean constraints. Unlike some other encodings, the number of auxiliary variables required for GTE does not depend on the magnitudes of the coefficients. Instead, it depends on the number of distinct combinations of these coefficients. We show the superiority of GTE with respect to other encodings when large pseudo-Boolean constraints have low number of distinct coefficients. Our experimental results also show that GTE remains competitive even when the pseudo-Boolean constraints do not have this characteristic.

* 10 pages, 2 figures, 2 tables. To be published in 21st International Conference on Principles and Practice of Constraint Programming 2015

Via

Access Paper or Ask Questions