Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Cyril Chhun

Do Language Models Enjoy Their Own Stories? Prompting Large Language Models for Automatic Story Evaluation

May 22, 2024

Cyril Chhun, Fabian M. Suchanek, Chloé Clavel

Abstract:Storytelling is an integral part of human experience and plays a crucial role in social interactions. Thus, Automatic Story Evaluation (ASE) and Generation (ASG) could benefit society in multiple ways, but they are challenging tasks which require high-level human abilities such as creativity, reasoning and deep understanding. Meanwhile, Large Language Models (LLM) now achieve state-of-the-art performance on many NLP tasks. In this paper, we study whether LLMs can be used as substitutes for human annotators for ASE. We perform an extensive analysis of the correlations between LLM ratings, other automatic measures, and human annotations, and we explore the influence of prompting on the results and the explainability of LLM behaviour. Most notably, we find that LLMs outperform current automatic measures for system-level evaluation but still struggle at providing satisfactory explanations for their answers.

* TACL, pre-MIT Press publication version

Via

Access Paper or Ask Questions

Of Human Criteria and Automatic Metrics: A Benchmark of the Evaluation of Story Generation

Aug 25, 2022

Cyril Chhun, Pierre Colombo, Chloé Clavel, Fabian M. Suchanek

Figure 1 for Of Human Criteria and Automatic Metrics: A Benchmark of the Evaluation of Story Generation

Figure 2 for Of Human Criteria and Automatic Metrics: A Benchmark of the Evaluation of Story Generation

Figure 3 for Of Human Criteria and Automatic Metrics: A Benchmark of the Evaluation of Story Generation

Figure 4 for Of Human Criteria and Automatic Metrics: A Benchmark of the Evaluation of Story Generation

Abstract:Research on Automatic Story Generation (ASG) relies heavily on human and automatic evaluation. However, there is no consensus on which human evaluation criteria to use, and no analysis of how well automatic criteria correlate with them. In this paper, we propose to re-evaluate ASG evaluation. We introduce a set of 6 orthogonal and comprehensive human criteria, carefully motivated by the social sciences literature. We also present HANNA, an annotated dataset of 1,056 stories produced by 10 different ASG systems. HANNA allows us to quantitatively evaluate the correlations of 72 automatic metrics with human criteria. Our analysis highlights the weaknesses of current metrics for ASG and allows us to formulate practical recommendations for ASG evaluation.

* 43 pages, 38 figures. To appear in Proceedings of the 29th International Conference on Computational Linguistics (COLING 2022)

Via

Access Paper or Ask Questions