Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Title:MathGAP: Out-of-Distribution Evaluation on Problems with Arbitrarily Complex Proofs

Oct 17, 2024

Andreas Opedal, Haruki Shirakami, Bernhard Schölkopf, Abulhair Saparov, Mrinmaya Sachan

Figure 1 for MathGAP: Out-of-Distribution Evaluation on Problems with Arbitrarily Complex Proofs

Figure 2 for MathGAP: Out-of-Distribution Evaluation on Problems with Arbitrarily Complex Proofs

Figure 3 for MathGAP: Out-of-Distribution Evaluation on Problems with Arbitrarily Complex Proofs

Figure 4 for MathGAP: Out-of-Distribution Evaluation on Problems with Arbitrarily Complex Proofs

Share this with someone who'll enjoy it:

Abstract:Large language models (LLMs) can solve arithmetic word problems with high accuracy, but little is known about how well they generalize to problems that are more complex than the ones on which they have been trained. Empirical investigations of such questions are impeded by two major flaws of current evaluations: (i) much of the evaluation data is contaminated, in the sense that it has already been seen during training, and (ii) benchmark datasets do not capture how problem proofs may be arbitrarily complex in various ways. As a step towards addressing these issues, we present a framework for evaluating LLMs on problems that have arbitrarily complex arithmetic proofs, called MathGAP. MathGAP generates problems that follow fixed proof specifications -- along with chain-of-thought reasoning annotations -- enabling systematic studies on generalization with respect to arithmetic proof complexity. We apply MathGAP to analyze how in-context learning interacts with generalization to problems that have more complex proofs. We find that among the models tested, most show a significant decrease in performance as proofs get deeper and wider. This effect is more pronounced in complex, nonlinear proof structures, which are challenging even for GPT-4o. Surprisingly, providing in-context examples from the same distribution as the test set is not always beneficial for performance. In particular, zero-shot prompting as well as demonstrating a diverse range of examples that are less complex than the test data sometimes yield similar or higher accuracies.

* Preprint

View paper on

Share this with someone who'll enjoy it:

Title:MathGAP: Out-of-Distribution Evaluation on Problems with Arbitrarily Complex Proofs

Paper and Code