Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Maurice Fürstenberg

Can AI grade your essays? A comparative analysis of large language models and teacher ratings in multidimensional essay scoring

Nov 25, 2024

Kathrin Seßler, Maurice Fürstenberg, Babette Bühler, Enkelejda Kasneci

Figure 1 for Can AI grade your essays? A comparative analysis of large language models and teacher ratings in multidimensional essay scoring

Figure 2 for Can AI grade your essays? A comparative analysis of large language models and teacher ratings in multidimensional essay scoring

Figure 3 for Can AI grade your essays? A comparative analysis of large language models and teacher ratings in multidimensional essay scoring

Figure 4 for Can AI grade your essays? A comparative analysis of large language models and teacher ratings in multidimensional essay scoring

Abstract:The manual assessment and grading of student writing is a time-consuming yet critical task for teachers. Recent developments in generative AI, such as large language models, offer potential solutions to facilitate essay-scoring tasks for teachers. In our study, we evaluate the performance and reliability of both open-source and closed-source LLMs in assessing German student essays, comparing their evaluations to those of 37 teachers across 10 pre-defined criteria (i.e., plot logic, expression). A corpus of 20 real-world essays from Year 7 and 8 students was analyzed using five LLMs: GPT-3.5, GPT-4, o1, LLaMA 3-70B, and Mixtral 8x7B, aiming to provide in-depth insights into LLMs' scoring capabilities. Closed-source GPT models outperform open-source models in both internal consistency and alignment with human ratings, particularly excelling in language-related criteria. The novel o1 model outperforms all other LLMs, achieving Spearman's $r = .74$ with human assessments in the overall score, and an internal consistency of $ICC=.80$. These findings indicate that LLM-based assessment can be a useful tool to reduce teacher workload by supporting the evaluation of essays, especially with regard to language-related criteria. However, due to their tendency for higher scores, the models require further refinement to better capture aspects of content quality.

* Accepted at LAK '25

Via

Access Paper or Ask Questions