Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Jay DeYoung

Automated Metrics for Medical Multi-Document Summarization Disagree with Human Evaluations

May 23, 2023

Lucy Lu Wang, Yulia Otmakhova, Jay DeYoung, Thinh Hung Truong, Bailey E. Kuehl, Erin Bransom, Byron C. Wallace

Abstract:Evaluating multi-document summarization (MDS) quality is difficult. This is especially true in the case of MDS for biomedical literature reviews, where models must synthesize contradicting evidence reported across different documents. Prior work has shown that rather than performing the task, models may exploit shortcuts that are difficult to detect using standard n-gram similarity metrics such as ROUGE. Better automated evaluation metrics are needed, but few resources exist to assess metrics when they are proposed. Therefore, we introduce a dataset of human-assessed summary quality facets and pairwise preferences to encourage and support the development of better automated evaluation methods for literature review MDS. We take advantage of community submissions to the Multi-document Summarization for Literature Review (MSLR) shared task to compile a diverse and representative sample of generated summaries. We analyze how automated summarization evaluation metrics correlate with lexical features of generated summaries, to other automated metrics including several we propose in this work, and to aspects of human-assessed summary quality. We find that not only do automated metrics fail to capture aspects of quality as assessed by humans, in many cases the system rankings produced by these metrics are anti-correlated with rankings according to human annotators.

* ACL 2023; Github: https://github.com/allenai/mslr-annotated-dataset

Via

Access Paper or Ask Questions

Jointly Extracting Interventions, Outcomes, and Findings from RCT Reports with LLMs

May 05, 2023

Somin Wadhwa, Jay DeYoung, Benjamin Nye, Silvio Amir, Byron C. Wallace

Figure 1 for Jointly Extracting Interventions, Outcomes, and Findings from RCT Reports with LLMs

Figure 2 for Jointly Extracting Interventions, Outcomes, and Findings from RCT Reports with LLMs

Figure 3 for Jointly Extracting Interventions, Outcomes, and Findings from RCT Reports with LLMs

Figure 4 for Jointly Extracting Interventions, Outcomes, and Findings from RCT Reports with LLMs

Abstract:Results from Randomized Controlled Trials (RCTs) establish the comparative effectiveness of interventions, and are in turn critical inputs for evidence-based care. However, results from RCTs are presented in (often unstructured) natural language articles describing the design, execution, and outcomes of trials; clinicians must manually extract findings pertaining to interventions and outcomes of interest from such articles. This onerous manual process has motivated work on (semi-)automating extraction of structured evidence from trial reports. In this work we propose and evaluate a text-to-text model built on instruction-tuned Large Language Models (LLMs) to jointly extract Interventions, Outcomes, and Comparators (ICO elements) from clinical abstracts, and infer the associated results reported. Manual (expert) and automated evaluations indicate that framing evidence extraction as a conditional generation task and fine-tuning LLMs for this purpose realizes considerable ($\sim$20 point absolute F1 score) gains over the previous SOTA. We perform ablations and error analyses to assess aspects that contribute to model performance, and to highlight potential directions for further improvements. We apply our model to a collection of published RCTs through mid-2022, and release a searchable database of structured findings (anonymously for now): bit.ly/joint-relations-extraction-mlhc

* Under Review

Via

Access Paper or Ask Questions

Do Multi-Document Summarization Models Synthesize?

Jan 31, 2023

Jay DeYoung, Stephanie C. Martinez, Iain J. Marshall, Byron C. Wallace

Abstract:Multi-document summarization entails producing concise synopses of collections of inputs. For some applications, the synopsis should accurately \emph{synthesize} inputs with respect to a key property or aspect. For example, a synopsis of film reviews all written about a particular movie should reflect the average critic consensus. As a more consequential example, consider narrative summaries that accompany biomedical \emph{systematic reviews} of clinical trial results. These narratives should fairly summarize the potentially conflicting results from individual trials. In this paper we ask: To what extent do modern multi-document summarization models implicitly perform this type of synthesis? To assess this we perform a suite of experiments that probe the degree to which conditional generation models trained for summarization using standard methods yield outputs that appropriately synthesize inputs. We find that existing models do partially perform synthesis, but do so imperfectly. In particular, they are over-sensitive to changes in input ordering and under-sensitive to changes in input compositions (e.g., the ratio of positive to negative movie reviews). We propose a simple, general method for improving model synthesis capabilities by generating an explicitly diverse set of candidate outputs, and then selecting from these the string best aligned with the expected aggregate measure for the inputs, or \emph{abstaining} when the model produces no good candidate. This approach improves model synthesis performance. We hope highlighting the need for synthesis (in some summarization settings), motivates further research into multi-document summarization methods and learning objectives that explicitly account for the need to synthesize.

* 22 Pages, 13 Figures, 22 Tables. ACL Formatted paper; expanded version of rejected ICLR submisssion https://openreview.net/forum?id=1PTeB4MWCfU Paper de-anonymized ahead of ICLR de-anonymization due to ACL policies/additional conference submission

Via

Access Paper or Ask Questions

Entity Anchored ICD Coding

Aug 15, 2022

Jay DeYoung, Han-Chin Shing, Luyang Kong, Christopher Winestock, Chaitanya Shivade

Abstract:Medical coding is a complex task, requiring assignment of a subset of over 72,000 ICD codes to a patient's notes. Modern natural language processing approaches to these tasks have been challenged by the length of the input and size of the output space. We limit our model inputs to a small window around medical entities found in our documents. From those local contexts, we build contextualized representations of both ICD codes and entities, and aggregate over these representations to form document-level predictions. In contrast to existing methods which use a representation fixed either in size or by codes seen in training, we represent ICD codes by encoding the code description with local context. We discuss metrics appropriate to deploying coding systems in practice. We show that our approach is superior to existing methods in both standard and deployable measures, including performance on rare and unseen codes.

* Accepted to American Medical Informatics Association (AMIA) 2022 Annual Symposium

Via

Access Paper or Ask Questions

MS2: Multi-Document Summarization of Medical Studies

Apr 15, 2021

Jay DeYoung, Iz Beltagy, Madeleine van Zuylen, Bailey Kuehl, Lucy Lu Wang

Figure 1 for MS2: Multi-Document Summarization of Medical Studies

Figure 2 for MS2: Multi-Document Summarization of Medical Studies

Figure 3 for MS2: Multi-Document Summarization of Medical Studies

Figure 4 for MS2: Multi-Document Summarization of Medical Studies

Abstract:To assess the effectiveness of any medical intervention, researchers must conduct a time-intensive and highly manual literature review. NLP systems can help to automate or assist in parts of this expensive process. In support of this goal, we release MS^2 (Multi-Document Summarization of Medical Studies), a dataset of over 470k documents and 20k summaries derived from the scientific literature. This dataset facilitates the development of systems that can assess and aggregate contradictory evidence across multiple studies, and is the first large-scale, publicly available multi-document summarization dataset in the biomedical domain. We experiment with a summarization system based on BART, with promising early results. We formulate our summarization inputs and targets in both free text and structured forms and modify a recently proposed metric to assess the quality of our system's generated summaries. Data and models are available at https://github.com/allenai/ms2

* 8 pages of content, 20 pages including references and appendix. See https://github.com/allenai/ms2/ for code, https://ai2-s2-ms2.s3-us-west-2.amazonaws.com/ms_data_2021-04-12.zip for data (1.8G, zipped)

Via

Access Paper or Ask Questions

Understanding Clinical Trial Reports: Extracting Medical Entities and Their Relations

Oct 08, 2020

Benjamin E. Nye, Jay DeYoung, Eric Lehman, Ani Nenkova, Iain J. Marshall, Byron C. Wallace

Figure 1 for Understanding Clinical Trial Reports: Extracting Medical Entities and Their Relations

Figure 2 for Understanding Clinical Trial Reports: Extracting Medical Entities and Their Relations

Figure 3 for Understanding Clinical Trial Reports: Extracting Medical Entities and Their Relations

Figure 4 for Understanding Clinical Trial Reports: Extracting Medical Entities and Their Relations

Abstract:The best evidence concerning comparative treatment effectiveness comes from clinical trials, the results of which are reported in unstructured articles. Medical experts must manually extract information from articles to inform decision-making, which is time-consuming and expensive. Here we consider the end-to-end task of both (a) extracting treatments and outcomes from full-text articles describing clinical trials (entity identification) and, (b) inferring the reported results for the former with respect to the latter (relation extraction). We introduce new data for this task, and evaluate models that have recently achieved state-of-the-art results on similar tasks in Natural Language Processing. We then propose a new method motivated by how trial results are typically presented that outperforms these purely data-driven baselines. Finally, we run a fielded evaluation of the model with a non-profit seeking to identify existing drugs that might be re-purposed for cancer, showing the potential utility of end-to-end evidence extraction systems.

Via

Access Paper or Ask Questions

Evidence Inference 2.0: More Data, Better Models

May 14, 2020

Jay DeYoung, Eric Lehman, Ben Nye, Iain J. Marshall, Byron C. Wallace

Figure 1 for Evidence Inference 2.0: More Data, Better Models

Figure 2 for Evidence Inference 2.0: More Data, Better Models

Figure 3 for Evidence Inference 2.0: More Data, Better Models

Figure 4 for Evidence Inference 2.0: More Data, Better Models

Abstract:How do we most effectively treat a disease or condition? Ideally, we could consult a database of evidence gleaned from clinical trials to answer such questions. Unfortunately, no such database exists; clinical trial results are instead disseminated primarily via lengthy natural language articles. Perusing all such articles would be prohibitively time-consuming for healthcare practitioners; they instead tend to depend on manually compiled systematic reviews of medical literature to inform care. NLP may speed this process up, and eventually facilitate immediate consult of published evidence. The Evidence Inference dataset was recently released to facilitate research toward this end. This task entails inferring the comparative performance of two treatments, with respect to a given outcome, from a particular article (describing a clinical trial) and identifying supporting evidence. For instance: Does this article report that chemotherapy performed better than surgery for five-year survival rates of operable cancers? In this paper, we collect additional annotations to expand the Evidence Inference dataset by 25\%, provide stronger baseline models, systematically inspect the errors that these make, and probe dataset quality. We also release an abstract only (as opposed to full-texts) version of the task for rapid model prototyping. The updated corpus, documentation, and code for new baselines and evaluations are available at http://evidence-inference.ebm-nlp.com/.

* Accepted as workshop paper into BioNLP Updated results from SciBERT to Biomed RoBERTa

Via

Access Paper or Ask Questions

ERASER: A Benchmark to Evaluate Rationalized NLP Models

Nov 08, 2019

Jay DeYoung, Sarthak Jain, Nazneen Fatema Rajani, Eric Lehman, Caiming Xiong, Richard Socher, Byron C. Wallace

Figure 1 for ERASER: A Benchmark to Evaluate Rationalized NLP Models

Figure 2 for ERASER: A Benchmark to Evaluate Rationalized NLP Models

Figure 3 for ERASER: A Benchmark to Evaluate Rationalized NLP Models

Figure 4 for ERASER: A Benchmark to Evaluate Rationalized NLP Models

Abstract:State-of-the-art models in NLP are now predominantly based on deep neural networks that are generally opaque in terms of how they come to specific predictions. This limitation has led to increased interest in designing more interpretable deep models for NLP that can reveal the `reasoning' underlying model outputs. But work in this direction has been conducted on different datasets and tasks with correspondingly unique aims and metrics; this makes it difficult to track progress. We propose the Evaluating Rationales And Simple English Reasoning (ERASER) benchmark to advance research on interpretable models in NLP. This benchmark comprises multiple datasets and tasks for which human annotations of "rationales" (supporting evidence) have been collected. We propose several metrics that aim to capture how well the rationales provided by models align with human rationales, and also how faithful these rationales are (i.e., the degree to which provided rationales influenced the corresponding predictions). Our hope is that releasing this benchmark facilitates progress on designing more interpretable NLP systems. The benchmark, code, and documentation are available at: www.eraserbenchmark.com .

* https://github.com/jayded/eraserbenchmark http://www.eraserbenchmark.com/

Via

Access Paper or Ask Questions

Inferring Which Medical Treatments Work from Reports of Clinical Trials

Apr 04, 2019

Eric Lehman, Jay DeYoung, Regina Barzilay, Byron C. Wallace

Figure 1 for Inferring Which Medical Treatments Work from Reports of Clinical Trials

Figure 2 for Inferring Which Medical Treatments Work from Reports of Clinical Trials

Figure 3 for Inferring Which Medical Treatments Work from Reports of Clinical Trials

Figure 4 for Inferring Which Medical Treatments Work from Reports of Clinical Trials

Abstract:How do we know if a particular medical treatment actually works? Ideally one would consult all available evidence from relevant clinical trials. Unfortunately, such results are primarily disseminated in natural language scientific articles, imposing substantial burden on those trying to make sense of them. In this paper, we present a new task and corpus for making this unstructured evidence actionable. The task entails inferring reported findings from a full-text article describing a randomized controlled trial (RCT) with respect to a given intervention, comparator, and outcome of interest, e.g., inferring if an article provides evidence supporting the use of aspirin to reduce risk of stroke, as compared to placebo. We present a new corpus for this task comprising 10,000+ prompts coupled with full-text articles describing RCTs. Results using a suite of models --- ranging from heuristic (rule-based) approaches to attentive neural architectures --- demonstrate the difficulty of the task, which we believe largely owes to the lengthy, technical input texts. To facilitate further work on this important, challenging problem we make the corpus, documentation, a website and leaderboard, and code for baselines and evaluation available at http://evidence-inference.ebm-nlp.com/.

* Accepted to NAACL 2019

Via

Access Paper or Ask Questions

Events Beyond ACE: Curated Training for Events

Sep 24, 2018

Ryan Gabbard, Jay DeYoung, Marjorie Freedman

Figure 1 for Events Beyond ACE: Curated Training for Events

Figure 2 for Events Beyond ACE: Curated Training for Events

Figure 3 for Events Beyond ACE: Curated Training for Events

Figure 4 for Events Beyond ACE: Curated Training for Events

Abstract:We explore a human-driven approach to annotation, curated training (CT), in which annotation is framed as teaching the system by using interactive search to identify informative snippets of text to annotate, unlike traditional approaches which either annotate preselected text or use active learning. A trained annotator performed 80 hours of CT for the thirty event types of the NIST TAC KBP Event Argument Extraction evaluation. Combining this annotation with ACE results in a 6% reduction in error and the learning curve of CT plateaus more slowly than for full-document annotation. 3 NLP researchers performed CT for one event type and showed much sharper learning curves with all three exceeding ACE performance in less than ninety minutes, suggesting that CT can provide further benefits when the annotator deeply understands the system.

Via

Access Paper or Ask Questions