Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Emily First

Lemmanaid: Neuro-Symbolic Lemma Conjecturing

Apr 07, 2025

Yousef Alhessi, Sólrún Halla Einarsdóttir, George Granberry, Emily First, Moa Johansson, Sorin Lerner, Nicholas Smallbone

Abstract:Automatically conjecturing useful, interesting and novel lemmas would greatly improve automated reasoning tools and lower the bar for formalizing mathematics in proof assistants. It is however a very challenging task for both neural and symbolic approaches. We present the first steps towards a practical neuro-symbolic lemma conjecturing tool, Lemmanaid, that combines Large Language Models (LLMs) and symbolic methods, and evaluate it on proof libraries for the Isabelle proof assistant. We train an LLM to generate lemma templates that describe the shape of a lemma, and use symbolic methods to fill in the details. We compare Lemmanaid against an LLM trained to generate complete lemma statements as well as previous fully symbolic conjecturing methods. Our results indicate that neural and symbolic techniques are complementary. By leveraging the best of both symbolic and neural methods we can generate useful lemmas for a wide range of input domains, facilitating computer-assisted theory development and formalization.

Via

Access Paper or Ask Questions

Rango: Adaptive Retrieval-Augmented Proving for Automated Software Verification

Dec 18, 2024

Kyle Thompson, Nuno Saavedra, Pedro Carrott, Kevin Fisher, Alex Sanchez-Stern, Yuriy Brun, João F. Ferreira, Sorin Lerner, Emily First

Figure 1 for Rango: Adaptive Retrieval-Augmented Proving for Automated Software Verification

Figure 2 for Rango: Adaptive Retrieval-Augmented Proving for Automated Software Verification

Figure 3 for Rango: Adaptive Retrieval-Augmented Proving for Automated Software Verification

Figure 4 for Rango: Adaptive Retrieval-Augmented Proving for Automated Software Verification

Abstract:Formal verification using proof assistants, such as Coq, enables the creation of high-quality software. However, the verification process requires significant expertise and manual effort to write proofs. Recent work has explored automating proof synthesis using machine learning and large language models (LLMs). This work has shown that identifying relevant premises, such as lemmas and definitions, can aid synthesis. We present Rango, a fully automated proof synthesis tool for Coq that automatically identifies relevant premises and also similar proofs from the current project and uses them during synthesis. Rango uses retrieval augmentation at every step of the proof to automatically determine which proofs and premises to include in the context of its fine-tuned LLM. In this way, Rango adapts to the project and to the evolving state of the proof. We create a new dataset, CoqStoq, of 2,226 open-source Coq projects and 196,929 theorems from GitHub, which includes both training data and a curated evaluation benchmark of well-maintained projects. On this benchmark, Rango synthesizes proofs for 32.0% of the theorems, which is 29% more theorems than the prior state-of-the-art tool Tactician. Our evaluation also shows that Rango adding relevant proofs to its context leads to a 47% increase in the number of theorems proven.

* In Proceedings of the 47th International Conference on Software Engineering (ICSE), Ottawa, ON, Canada, April 2025

Via

Access Paper or Ask Questions

Cobblestone: Iterative Automation for Formal Verification

Oct 25, 2024

Saketh Ram Kasibatla, Arpan Agarwal, Yuriy Brun, Sorin Lerner, Talia Ringer, Emily First

Abstract:Formal verification using proof assistants, such as Coq, is an effective way of improving software quality, but it is expensive. Writing proofs manually requires both significant effort and expertise. Recent research has used machine learning to automatically synthesize proofs, reducing verification effort, but these tools are able to prove only a fraction of the desired software properties. We introduce Cobblestone, a new proof-synthesis approach that improves on the state of the art by taking advantage of partial progress in proof synthesis attempts. Unlike prior tools, Cobblestone can produce multiple unsuccessful proofs using a large language model (LLM), identify the working portions of those proofs, and combine them into a single, successful proof, taking advantage of internal partial progress. We evaluate Cobblestone on two benchmarks of open-source Coq projects, controlling for training data leakage in LLM datasets. Fully automatically, Cobblestone can prove 48% of the theorems, while Proverbot9001, the previous state-of-the-art, learning-based, proof-synthesis tool, can prove 17%. Cobblestone establishes a new state of the art for fully automated proof synthesis tools for Coq. We also evaluate Cobblestone in a setting where it is given external partial proof progress from oracles, serving as proxies for a human proof engineer or another tool. When the theorem is broken down into a set of subgoals and Cobblestone is given a set of relevant lemmas already proven in the project, it can prove up to 58% of the theorems. We qualitatively study the theorems Cobblestone is and is not able to prove to outline potential future research directions to further improve proof synthesis, including developing interactive, semi-automated tools. Our research shows that tools can make better use of partial progress made during proof synthesis to more effectively automate formal verification.

* 13 pages, 10 figures

Via

Access Paper or Ask Questions

Learn from Failure: Fine-Tuning LLMs with Trial-and-Error Data for Intuitionistic Propositional Logic Proving

Apr 10, 2024

Chenyang An, Zhibo Chen, Qihao Ye, Emily First, Letian Peng, Jiayun Zhang, Zihan Wang, Sorin Lerner, Jingbo Shang

Figure 1 for Learn from Failure: Fine-Tuning LLMs with Trial-and-Error Data for Intuitionistic Propositional Logic Proving

Figure 2 for Learn from Failure: Fine-Tuning LLMs with Trial-and-Error Data for Intuitionistic Propositional Logic Proving

Figure 3 for Learn from Failure: Fine-Tuning LLMs with Trial-and-Error Data for Intuitionistic Propositional Logic Proving

Figure 4 for Learn from Failure: Fine-Tuning LLMs with Trial-and-Error Data for Intuitionistic Propositional Logic Proving

Abstract:Recent advances in Automated Theorem Proving have shown the effectiveness of leveraging a (large) language model that generates tactics (i.e. proof steps) to search through proof states. The current model, while trained solely on successful proof paths, faces a discrepancy at the inference stage, as it must sample and try various tactics at each proof state until finding success, unlike its training which does not incorporate learning from failed attempts. Intuitively, a tactic that leads to a failed search path would indicate that similar tactics should receive less attention during the following trials. In this paper, we demonstrate the benefit of training models that additionally learn from failed search paths. Facing the lack of such trial-and-error data in existing open-source theorem-proving datasets, we curate a dataset on intuitionistic propositional logic theorems and formalize it in Lean, such that we can reliably check the correctness of proofs. We compare our model trained on relatively short trial-and-error information (TrialMaster) with models trained only on the correct paths and discover that the former solves more unseen theorems with lower trial searches.

* Submitted to ACL on Feb.15th 2024

Via

Access Paper or Ask Questions

Baldur: Whole-Proof Generation and Repair with Large Language Models

Mar 16, 2023

Emily First, Markus N. Rabe, Talia Ringer, Yuriy Brun

Figure 1 for Baldur: Whole-Proof Generation and Repair with Large Language Models

Figure 2 for Baldur: Whole-Proof Generation and Repair with Large Language Models

Figure 3 for Baldur: Whole-Proof Generation and Repair with Large Language Models

Figure 4 for Baldur: Whole-Proof Generation and Repair with Large Language Models

Abstract:Formally verifying software properties is a highly desirable but labor-intensive task. Recent work has developed methods to automate formal verification using proof assistants, such as Coq and Isabelle/HOL, e.g., by training a model to predict one proof step at a time, and using that model to search through the space of possible proofs. This paper introduces a new method to automate formal verification: We use large language models, trained on natural language text and code and fine-tuned on proofs, to generate whole proofs for theorems at once, rather than one step at a time. We combine this proof generation model with a fine-tuned repair model to repair generated proofs, further increasing proving power. As its main contributions, this paper demonstrates for the first time that: (1) Whole-proof generation using transformers is possible and is as effective as search-based techniques without requiring costly search. (2) Giving the learned model additional context, such as a prior failed proof attempt and the ensuing error message, results in proof repair and further improves automated proof generation. (3) We establish a new state of the art for fully automated proof synthesis. We reify our method in a prototype, Baldur, and evaluate it on a benchmark of 6,336 Isabelle/HOL theorems and their proofs. In addition to empirically showing the effectiveness of whole-proof generation, repair, and added context, we show that Baldur improves on the state-of-the-art tool, Thor, by automatically generating proofs for an additional 8.7% of the theorems. Together, Baldur and Thor can prove 65.7% of the theorems fully automatically. This paper paves the way for new research into using large language models for automating formal verification.

Via

Access Paper or Ask Questions