Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Aidan M. Swope

LeanDojo: Theorem Proving with Retrieval-Augmented Language Models

Jun 27, 2023

Kaiyu Yang, Aidan M. Swope, Alex Gu, Rahul Chalamala, Peiyang Song, Shixing Yu, Saad Godil, Ryan Prenger, Anima Anandkumar

Figure 1 for LeanDojo: Theorem Proving with Retrieval-Augmented Language Models

Figure 2 for LeanDojo: Theorem Proving with Retrieval-Augmented Language Models

Figure 3 for LeanDojo: Theorem Proving with Retrieval-Augmented Language Models

Figure 4 for LeanDojo: Theorem Proving with Retrieval-Augmented Language Models

Abstract:Large language models (LLMs) have shown promise in proving formal theorems using proof assistants such as Lean. However, existing methods are difficult to reproduce or build on, due to private code, data, and large compute requirements. This has created substantial barriers to research on machine learning methods for theorem proving. This paper removes these barriers by introducing LeanDojo: an open-source Lean playground consisting of toolkits, data, models, and benchmarks. LeanDojo extracts data from Lean and enables interaction with the proof environment programmatically. It contains fine-grained annotations of premises in proofs, providing valuable data for premise selection: a key bottleneck in theorem proving. Using this data, we develop ReProver (Retrieval-Augmented Prover): the first LLM-based prover that is augmented with retrieval for selecting premises from a vast math library. It is inexpensive and needs only one GPU week of training. Our retriever leverages LeanDojo's program analysis capability to identify accessible premises and hard negative examples, which makes retrieval much more effective. Furthermore, we construct a new benchmark consisting of 96,962 theorems and proofs extracted from Lean's math library. It features challenging data split requiring the prover to generalize to theorems relying on novel premises that are never used in training. We use this benchmark for training and evaluation, and experimental results demonstrate the effectiveness of ReProver over non-retrieval baselines and GPT-4. We thus provide the first set of open-source LLM-based theorem provers without any proprietary datasets and release it under a permissive MIT license to facilitate further research.

Via

Access Paper or Ask Questions

Representation Learning for Remote Sensing: An Unsupervised Sensor Fusion Approach

Aug 11, 2021

Aidan M. Swope, Xander H. Rudelis, Kyle T. Story

Figure 1 for Representation Learning for Remote Sensing: An Unsupervised Sensor Fusion Approach

Figure 2 for Representation Learning for Remote Sensing: An Unsupervised Sensor Fusion Approach

Figure 3 for Representation Learning for Remote Sensing: An Unsupervised Sensor Fusion Approach

Figure 4 for Representation Learning for Remote Sensing: An Unsupervised Sensor Fusion Approach

Abstract:In the application of machine learning to remote sensing, labeled data is often scarce or expensive, which impedes the training of powerful models like deep convolutional neural networks. Although unlabeled data is abundant, recent self-supervised learning approaches are ill-suited to the remote sensing domain. In addition, most remote sensing applications currently use only a small subset of the multi-sensor, multi-channel information available, motivating the need for fused multi-sensor representations. We propose a new self-supervised training objective, Contrastive Sensor Fusion, which exploits coterminous data from multiple sources to learn useful representations of every possible combination of those sources. This method uses information common across multiple sensors and bands by training a single model to produce a representation that remains similar when any subset of its input channels is used. Using a dataset of 47 million unlabeled coterminous image triplets, we train an encoder to produce semantically meaningful representations from any possible combination of channels from the input sensors. These representations outperform fully supervised ImageNet weights on a remote sensing classification task and improve as more sensors are fused. Our code is available at https://storage.cloud.google.com/public-published-datasets/csf_code.zip.

* 9 pages, 5 figures. Work completed in 2019 and submitted to ICLR in 2020. Source code available at: https://github.com/descarteslabs/contrastive_sensor_fusion. Data available at: https://storage.cloud.google.com/public-published-datasets/osm_example_dataset.zip?folder=true&organizationId=272688069953

Via

Access Paper or Ask Questions