Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Title:MathHay: An Automated Benchmark for Long-Context Mathematical Reasoning in LLMs

Oct 07, 2024

Lei Wang, Shan Dong, Yuhui Xu, Hanze Dong, Yalu Wang, Amrita Saha, Ee-Peng Lim, Caiming Xiong, Doyen Sahoo

Figure 1 for MathHay: An Automated Benchmark for Long-Context Mathematical Reasoning in LLMs

Figure 2 for MathHay: An Automated Benchmark for Long-Context Mathematical Reasoning in LLMs

Figure 3 for MathHay: An Automated Benchmark for Long-Context Mathematical Reasoning in LLMs

Figure 4 for MathHay: An Automated Benchmark for Long-Context Mathematical Reasoning in LLMs

Share this with someone who'll enjoy it:

Abstract:Recent large language models (LLMs) have demonstrated versatile capabilities in long-context scenarios. Although some recent benchmarks have been developed to evaluate the long-context capabilities of LLMs, there is a lack of benchmarks evaluating the mathematical reasoning abilities of LLMs over long contexts, which is crucial for LLMs' application in real-world scenarios. In this paper, we introduce MathHay, an automated benchmark designed to assess the long-context mathematical reasoning capabilities of LLMs. Unlike previous benchmarks like Needle in a Haystack, which focus primarily on information retrieval within long texts, MathHay demands models with both information-seeking and complex mathematical reasoning abilities. We conduct extensive experiments on MathHay to assess the long-context mathematical reasoning abilities of eight top-performing LLMs. Even the best-performing model, Gemini-1.5-Pro-002, still struggles with mathematical reasoning over long contexts, achieving only 51.26% accuracy at 128K tokens. This highlights the significant room for improvement on the MathHay benchmark.

* Work-in-Progress

View paper on

Share this with someone who'll enjoy it:

Title:MathHay: An Automated Benchmark for Long-Context Mathematical Reasoning in LLMs

Paper and Code