Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Title:U-MATH: A University-Level Benchmark for Evaluating Mathematical Skills in LLMs

Dec 04, 2024

Konstantin Chernyshev, Vitaliy Polshkov, Ekaterina Artemova, Alex Myasnikov, Vlad Stepanov, Alexei Miasnikov, Sergei Tilga

Figure 1 for U-MATH: A University-Level Benchmark for Evaluating Mathematical Skills in LLMs

Figure 2 for U-MATH: A University-Level Benchmark for Evaluating Mathematical Skills in LLMs

Figure 3 for U-MATH: A University-Level Benchmark for Evaluating Mathematical Skills in LLMs

Figure 4 for U-MATH: A University-Level Benchmark for Evaluating Mathematical Skills in LLMs

Share this with someone who'll enjoy it:

Abstract:The current evaluation of mathematical skills in LLMs is limited, as existing benchmarks are either relatively small, primarily focus on elementary and high-school problems, or lack diversity in topics. Additionally, the inclusion of visual elements in tasks remains largely under-explored. To address these gaps, we introduce U-MATH, a novel benchmark of 1,100 unpublished open-ended university-level problems sourced from teaching materials. It is balanced across six core subjects, with 20% of multimodal problems. Given the open-ended nature of U-MATH problems, we employ an LLM to judge the correctness of generated solutions. To this end, we release $\mu$-MATH, a dataset to evaluate the LLMs' capabilities in judging solutions. The evaluation of general domain, math-specific, and multimodal LLMs highlights the challenges presented by U-MATH. Our findings reveal that LLMs achieve a maximum accuracy of only 63% on text-based tasks, with even lower 45% on visual problems. The solution assessment proves challenging for LLMs, with the best LLM judge having an F1-score of 80% on $\mu$-MATH.

View paper on

Share this with someone who'll enjoy it:

Title:U-MATH: A University-Level Benchmark for Evaluating Mathematical Skills in LLMs

Paper and Code