Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Title:A Meta-Evaluation of Faithfulness Metrics for Long-Form Hospital-Course Summarization

Mar 07, 2023

Griffin Adams, Jason Zucker, Noémie Elhadad

Figure 1 for A Meta-Evaluation of Faithfulness Metrics for Long-Form Hospital-Course Summarization

Figure 2 for A Meta-Evaluation of Faithfulness Metrics for Long-Form Hospital-Course Summarization

Figure 3 for A Meta-Evaluation of Faithfulness Metrics for Long-Form Hospital-Course Summarization

Figure 4 for A Meta-Evaluation of Faithfulness Metrics for Long-Form Hospital-Course Summarization

Share this with someone who'll enjoy it:

Abstract:Long-form clinical summarization of hospital admissions has real-world significance because of its potential to help both clinicians and patients. The faithfulness of summaries is critical to their safe usage in clinical settings. To better understand the limitations of abstractive systems, as well as the suitability of existing evaluation metrics, we benchmark faithfulness metrics against fine-grained human annotations for model-generated summaries of a patient's Brief Hospital Course. We create a corpus of patient hospital admissions and summaries for a cohort of HIV patients, each with complex medical histories. Annotators are presented with summaries and source notes, and asked to categorize manually highlighted summary elements (clinical entities like conditions and medications as well as actions like "following up") into one of three categories: ``Incorrect,'' ``Missing,'' and ``Not in Notes.'' We meta-evaluate a broad set of proposed faithfulness metrics and, across metrics, explore the importance of domain adaptation (e.g. the impact of in-domain pre-training and metric fine-tuning), the use of source-summary alignments, and the effects of distilling a single metric from an ensemble of pre-existing metrics. Off-the-shelf metrics with no exposure to clinical text correlate well yet overly rely on summary extractiveness. As a practical guide to long-form clinical narrative summarization, we find that most metrics correlate best to human judgments when provided with one summary sentence at a time and a minimal set of relevant source context.

* Preprint

View paper on

Share this with someone who'll enjoy it:

Title:A Meta-Evaluation of Faithfulness Metrics for Long-Form Hospital-Course Summarization

Paper and Code