Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Jiangnan Hang

Toward the Evaluation of Large Language Models Considering Score Variance across Instruction Templates

Aug 22, 2024

Yusuke Sakai, Adam Nohejl, Jiangnan Hang, Hidetaka Kamigaito, Taro Watanabe

Figure 1 for Toward the Evaluation of Large Language Models Considering Score Variance across Instruction Templates

Figure 2 for Toward the Evaluation of Large Language Models Considering Score Variance across Instruction Templates

Figure 3 for Toward the Evaluation of Large Language Models Considering Score Variance across Instruction Templates

Figure 4 for Toward the Evaluation of Large Language Models Considering Score Variance across Instruction Templates

Abstract:The natural language understanding (NLU) performance of large language models (LLMs) has been evaluated across various tasks and datasets. The existing evaluation methods, however, do not take into account the variance in scores due to differences in prompts, which leads to unfair evaluation and comparison of NLU performance. Moreover, evaluation designed for specific prompts is inappropriate for instruction tuning, which aims to perform well with any prompt. It is therefore necessary to find a way to measure NLU performance in a fair manner, considering score variance between different instruction templates. In this study, we provide English and Japanese cross-lingual datasets for evaluating the NLU performance of LLMs, which include multiple instruction templates for fair evaluation of each task, along with regular expressions to constrain the output format. Furthermore, we propose the Sharpe score as an evaluation metric that takes into account the variance in scores between templates. Comprehensive analysis of English and Japanese LLMs reveals that the high variance among templates has a significant impact on the fair evaluation of LLMs.

* 19 pages, 7 figures

Via

Access Paper or Ask Questions