Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Igor Ivanov

Resurrecting saturated LLM benchmarks with adversarial encoding

Feb 10, 2025

Igor Ivanov, Dmitrii Volkov

Figure 1 for Resurrecting saturated LLM benchmarks with adversarial encoding

Figure 2 for Resurrecting saturated LLM benchmarks with adversarial encoding

Figure 3 for Resurrecting saturated LLM benchmarks with adversarial encoding

Figure 4 for Resurrecting saturated LLM benchmarks with adversarial encoding

Abstract:Recent work showed that small changes in benchmark questions can reduce LLMs' reasoning and recall. We explore two such changes: pairing questions and adding more answer options, on three benchmarks: WMDP-bio, GPQA, and MMLU variants. We find that for more capable models, these predictably reduce performance, essentially heightening the performance ceiling of a benchmark and unsaturating it again. We suggest this approach can resurrect old benchmarks.

Via

Access Paper or Ask Questions