Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Title:ACORN: Aspect-wise Commonsense Reasoning Explanation Evaluation

May 08, 2024

Ana Brassard, Benjamin Heinzerling, Keito Kudo, Keisuke Sakaguchi, Kentaro Inui

Figure 1 for ACORN: Aspect-wise Commonsense Reasoning Explanation Evaluation

Figure 2 for ACORN: Aspect-wise Commonsense Reasoning Explanation Evaluation

Figure 3 for ACORN: Aspect-wise Commonsense Reasoning Explanation Evaluation

Figure 4 for ACORN: Aspect-wise Commonsense Reasoning Explanation Evaluation

Share this with someone who'll enjoy it:

Abstract:Evaluating free-text explanations is a multifaceted, subjective, and labor-intensive task. Large language models (LLMs) present an appealing alternative due to their potential for consistency, scalability, and cost-efficiency. In this work, we present ACORN, a new dataset of 3,500 free-text explanations and aspect-wise quality ratings, and use it to gain insights into how LLMs evaluate explanations. We observed that replacing one of the human ratings sometimes maintained, but more often lowered the inter-annotator agreement across different settings and quality aspects, suggesting that their judgments are not always consistent with human raters. We further quantified this difference by comparing the correlation between LLM-generated ratings with majority-voted human ratings across different quality aspects. With the best system, Spearman's rank correlation ranged between 0.53 to 0.95, averaging 0.72 across aspects, indicating moderately high but imperfect alignment. Finally, we considered the alternative of using an LLM as an additional rater when human raters are scarce, and measured the correlation between majority-voted labels with a limited human pool and LLMs as an additional rater, compared to the original gold labels. While GPT-4 improved the outcome when there were only two human raters, in all other observed cases, LLMs were neutral to detrimental when there were three or more human raters. We publicly release the dataset to support future improvements in LLM-in-the-loop evaluation here: https://github.com/a-brassard/ACORN.

* 18 pages, 7 figures, under review. Data available here: https://github.com/a-brassard/ACORN

View paper on

Share this with someone who'll enjoy it:

Title:ACORN: Aspect-wise Commonsense Reasoning Explanation Evaluation

Paper and Code