Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Title:L-Eval: Instituting Standardized Evaluation for Long Context Language Models

Jul 31, 2023

Chenxin An, Shansan Gong, Ming Zhong, Mukai Li, Jun Zhang, Lingpeng Kong, Xipeng Qiu

Figure 1 for L-Eval: Instituting Standardized Evaluation for Long Context Language Models

Figure 2 for L-Eval: Instituting Standardized Evaluation for Long Context Language Models

Figure 3 for L-Eval: Instituting Standardized Evaluation for Long Context Language Models

Figure 4 for L-Eval: Instituting Standardized Evaluation for Long Context Language Models

Share this with someone who'll enjoy it:

Abstract:Recently, there has been growing interest in extending the context length of instruction-following models in order to effectively process single-turn long input (e.g. summarizing a paper) and conversations with more extensive histories. While proprietary models such as GPT-4 and Claude have shown significant strides in handling extremely lengthy input, open-sourced models are still in the early stages of experimentation. It also remains unclear whether extending the context can offer substantial gains over traditional methods such as retrieval, and to what extent it improves upon their regular counterparts in practical downstream tasks. To address this challenge, we propose instituting standardized evaluation for long context language models. Concretely, we develop L-Eval which contains 411 long documents and over 2,000 human-labeled query-response pairs encompassing areas such as law, finance, school lectures, lengthy conversations, news, long-form novels, and meetings. L-Eval also adopts diverse evaluation methods and instruction styles, enabling a more reliable assessment of Long Context Language Models (LCLMs). Our findings indicate that while open-source models typically lag behind commercial models, they still exhibit impressive performance compared with their regular versions. LLaMA2-13B achieves the best results on both open-ended tasks (win \textbf{42}\% vs turbo-16k-0613) and closed-ended tasks with only 4k context length. We release our new evaluation suite, code, and all generation results including predictions from all open-sourced LCLMs, GPT4-32k, Cluade-100k at {\url{https://github.com/OpenLMLab/LEval}}.

View paper on

Share this with someone who'll enjoy it:

Title:L-Eval: Instituting Standardized Evaluation for Long Context Language Models

Paper and Code