Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Title:LongGenbench: Benchmarking Long-Form Generation in Long Context LLMs

Sep 11, 2024

Yuhao Wu, Ming Shan Hee, Zhiqing Hu, Roy Ka-Wei Lee

Figure 1 for LongGenbench: Benchmarking Long-Form Generation in Long Context LLMs

Figure 2 for LongGenbench: Benchmarking Long-Form Generation in Long Context LLMs

Figure 3 for LongGenbench: Benchmarking Long-Form Generation in Long Context LLMs

Figure 4 for LongGenbench: Benchmarking Long-Form Generation in Long Context LLMs

Share this with someone who'll enjoy it:

Abstract:The abilities of long-context language models (LMs) are often evaluated using the "Needle-in-a-Haystack" (NIAH) test, which comprises tasks designed to assess a model's ability to identify specific information ("needle") within large text sequences ("haystack"). While these benchmarks measure how well models understand long-context input sequences, they do not effectively gauge the quality of long-form text generation--a critical aspect for applications such as design proposals and creative writing. To address this gap, we have introduced a new long-form text evaluation benchmark, LongGenbench, which tests models' ability to identify specific events within generated long text sequences. In this benchmark, we prompt long-context LMs to create long-form text that must include particular events or constraints and evaluate their ability to incorporate these elements. We evaluated ten long-context LMs across four distinct scenarios, three types of prompt instructions, and two different generation-length settings (16K and 32K). Although these models perform well on NIAH benchmarks, none demonstrated satisfactory performance on the LongGenbench, raising concerns about their ability to generate coherent long-form text that follows instructions. Additionally, as the length of the generated text increases, all models exhibit a significant drop in performance.

* work in progress. arXiv admin note: text overlap with arXiv:2404.06654 by other authors

View paper on

Share this with someone who'll enjoy it:

Title:LongGenbench: Benchmarking Long-Form Generation in Long Context LLMs

Paper and Code