Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Title:S3Eval: A Synthetic, Scalable, Systematic Evaluation Suite for Large Language Models

Oct 23, 2023

Fangyu Lei, Qian Liu, Yiming Huang, Shizhu He, Jun Zhao, Kang Liu

Figure 1 for S3Eval: A Synthetic, Scalable, Systematic Evaluation Suite for Large Language Models

Figure 2 for S3Eval: A Synthetic, Scalable, Systematic Evaluation Suite for Large Language Models

Figure 3 for S3Eval: A Synthetic, Scalable, Systematic Evaluation Suite for Large Language Models

Figure 4 for S3Eval: A Synthetic, Scalable, Systematic Evaluation Suite for Large Language Models

Share this with someone who'll enjoy it:

Abstract:The rapid development of Large Language Models (LLMs) has led to great strides in model capabilities like reasoning and long-context understanding. However, as LLMs are able to process longer contexts, it becomes more challenging to evaluate whether they have acquired certain capabilities, since the length of text (e.g., 100K tokens) they can process far exceeds what humans can reliably assess in a reasonable duration. In this paper, we propose using complex synthetic tasks as a proxy evaluation method, and present S3Eval, a Synthetic, Scalable, Systematic evaluation suite for LLMs evaluation. As a synthetic benchmark, S3Eval enables the creation of any number of evaluation examples that are theoretically invisible to LLMs, mitigating the test set contamination issue. The synthetic nature of S3Eval provides users full control over the dataset, allowing them to systematically probe LLM capabilities by scaling text length and varying task difficulty across diverse scenarios. The strong correlation between S3Eval performance and scores of real-world benchmarks like Big-Bench Hard (BBH) demonstrates the soundness of using S3Eval for evaluation of LLMs. The in-depth analysis also uncover additional insights, including performance drop when the answer is sparsely distributed or located in the middle context, as well as some counter-intuitive trends of model performance.

* Work in progress

View paper on

Share this with someone who'll enjoy it:

Title:S3Eval: A Synthetic, Scalable, Systematic Evaluation Suite for Large Language Models

Paper and Code