Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Title:Factual Consistency Evaluation of Summarisation in the Era of Large Language Models

Feb 21, 2024

Zheheng Luo, Qianqian Xie, Sophia Ananiadou

Figure 1 for Factual Consistency Evaluation of Summarisation in the Era of Large Language Models

Figure 2 for Factual Consistency Evaluation of Summarisation in the Era of Large Language Models

Figure 3 for Factual Consistency Evaluation of Summarisation in the Era of Large Language Models

Figure 4 for Factual Consistency Evaluation of Summarisation in the Era of Large Language Models

Share this with someone who'll enjoy it:

Abstract:Factual inconsistency with source documents in automatically generated summaries can lead to misinformation or pose risks. Existing factual consistency(FC) metrics are constrained by their performance, efficiency, and explainability. Recent advances in Large language models (LLMs) have demonstrated remarkable potential in text evaluation but their effectiveness in assessing FC in summarisation remains underexplored. Prior research has mostly focused on proprietary LLMs, leaving essential factors that affect their assessment capabilities unexplored. Additionally, current FC evaluation benchmarks are restricted to news articles, casting doubt on the generality of the FC methods tested on them. In this paper, we first address the gap by introducing TreatFact a dataset of LLM-generated summaries of clinical texts, annotated for FC by domain experts. Moreover, we benchmark 11 LLMs for FC evaluation across news and clinical domains and analyse the impact of model size, prompts, pre-training and fine-tuning data. Our findings reveal that despite proprietary models prevailing on the task, open-source LLMs lag behind. Nevertheless, there is potential for enhancing the performance of open-source LLMs through increasing model size, expanding pre-training data, and developing well-curated fine-tuning data. Experiments on TreatFact suggest that both previous methods and LLM-based evaluators are unable to capture factual inconsistencies in clinical summaries, posing a new challenge for FC evaluation.

* 5 figures

View paper on

Share this with someone who'll enjoy it:

Title:Factual Consistency Evaluation of Summarisation in the Era of Large Language Models

Paper and Code