Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Wenlong Du

Beyond Benchmarking: A New Paradigm for Evaluation and Assessment of Large Language Models

Jul 10, 2024

Jin Liu, Qingquan Li, Wenlong Du

Figure 1 for Beyond Benchmarking: A New Paradigm for Evaluation and Assessment of Large Language Models

Figure 2 for Beyond Benchmarking: A New Paradigm for Evaluation and Assessment of Large Language Models

Abstract:In current benchmarks for evaluating large language models (LLMs), there are issues such as evaluation content restriction, untimely updates, and lack of optimization guidance. In this paper, we propose a new paradigm for the measurement of LLMs: Benchmarking-Evaluation-Assessment. Our paradigm shifts the "location" of LLM evaluation from the "examination room" to the "hospital". Through conducting a "physical examination" on LLMs, it utilizes specific task-solving as the evaluation content, performs deep attribution of existing problems within LLMs, and provides recommendation for optimization.

Via

Access Paper or Ask Questions