Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Heba Elfardy

Evaluating the Utility of Grounding Documents with Reference-Free LLM-based Metrics

Jan 30, 2026

Yilun Hua, Giuseppe Castellucci, Peter Schulam, Heba Elfardy, Kevin Small

Abstract:Retrieval Augmented Generation (RAG)'s success depends on the utility the LLM derives from the content used for grounding. Quantifying content utility does not have a definitive specification and existing metrics ignore model-specific capabilities and/or rely on costly annotations. In this paper, we propose Grounding Generation Utility (GroGU), a model-specific and reference-free metric that defines utility as a function of the downstream LLM's generation confidence based on entropy. Despite having no annotation requirements, GroGU is largely faithful in distinguishing ground-truth documents while capturing nuances ignored by LLM-agnostic metrics. We apply GroGU to train a query-rewriter for RAG by identifying high-utility preference data for Direct Preference Optimization. Experiments show improvements by up to 18.2 points in Mean Reciprocal Rank and up to 9.4 points in answer accuracy.

Via

Access Paper or Ask Questions

SumREN: Summarizing Reported Speech about Events in News

Dec 02, 2022

Revanth Gangi Reddy, Heba Elfardy, Hou Pong Chan, Kevin Small, Heng Ji

Figure 1 for SumREN: Summarizing Reported Speech about Events in News

Figure 2 for SumREN: Summarizing Reported Speech about Events in News

Figure 3 for SumREN: Summarizing Reported Speech about Events in News

Figure 4 for SumREN: Summarizing Reported Speech about Events in News

Abstract:A primary objective of news articles is to establish the factual record for an event, frequently achieved by conveying both the details of the specified event (i.e., the 5 Ws; Who, What, Where, When and Why regarding the event) and how people reacted to it (i.e., reported statements). However, existing work on news summarization almost exclusively focuses on the event details. In this work, we propose the novel task of summarizing the reactions of different speakers, as expressed by their reported statements, to a given event. To this end, we create a new multi-document summarization benchmark, SUMREN, comprising 745 summaries of reported statements from various public figures obtained from 633 news articles discussing 132 events. We propose an automatic silver training data generation approach for our task, which helps smaller models like BART achieve GPT-3 level performance on this task. Finally, we introduce a pipeline-based framework for summarizing reported speech, which we empirically show to generate summaries that are more abstractive and factual than baseline query-focused summarization approaches.

* Accepted at AAAI 2023

Via

Access Paper or Ask Questions

PLAtE: A Large-scale Dataset for List Page Web Extraction

May 24, 2022

Aidan San, Jan Bakus, Colin Lockard, David Ciemiewicz, Yangfeng Ji, Sandeep Atluri, Kevin Small, Heba Elfardy

Figure 1 for PLAtE: A Large-scale Dataset for List Page Web Extraction

Figure 2 for PLAtE: A Large-scale Dataset for List Page Web Extraction

Figure 3 for PLAtE: A Large-scale Dataset for List Page Web Extraction

Figure 4 for PLAtE: A Large-scale Dataset for List Page Web Extraction

Abstract:Recently, neural models have been leveraged to significantly improve the performance of information extraction from semi-structured websites. However, a barrier for continued progress is the small number of datasets large enough to train these models. In this work, we introduce the PLAtE (Pages of Lists Attribute Extraction) dataset as a challenging new web extraction task. PLAtE focuses on shopping data, specifically extractions from product review pages with multiple items. PLAtE encompasses both the tasks of: (1) finding product-list segmentation boundaries and (2) extracting attributes for each product. PLAtE is composed of 53, 905 items from 6, 810 pages, making it the first large-scale list page web extraction dataset. We construct PLAtE by collecting list pages from Common Crawl, then annotating them on Mechanical Turk. Quantitative and qualitative analyses are performed to demonstrate PLAtE has high-quality annotations. We establish strong baseline performance on PLAtE with a SOTA model achieving an F1-score of 0.750 for attribute classification and 0.915 for segmentation, indicating opportunities for future research innovations in web extraction.

Via

Access Paper or Ask Questions

Answer Consolidation: Formulation and Benchmarking

Apr 29, 2022

Wenxuan Zhou, Qiang Ning, Heba Elfardy, Kevin Small, Muhao Chen

Figure 1 for Answer Consolidation: Formulation and Benchmarking

Figure 2 for Answer Consolidation: Formulation and Benchmarking

Figure 3 for Answer Consolidation: Formulation and Benchmarking

Figure 4 for Answer Consolidation: Formulation and Benchmarking

Abstract:Current question answering (QA) systems primarily consider the single-answer scenario, where each question is assumed to be paired with one correct answer. However, in many real-world QA applications, multiple answer scenarios arise where consolidating answers into a comprehensive and non-redundant set of answers is a more efficient user interface. In this paper, we formulate the problem of answer consolidation, where answers are partitioned into multiple groups, each representing different aspects of the answer set. Then, given this partitioning, a comprehensive and non-redundant set of answers can be constructed by picking one answer from each group. To initiate research on answer consolidation, we construct a dataset consisting of 4,699 questions and 24,006 sentences and evaluate multiple models. Despite a promising performance achieved by the best-performing supervised models, we still believe this task has room for further improvements.

* NAACL 2022

Via

Access Paper or Ask Questions

Hidden Biases in Unreliable News Detection Datasets

Apr 20, 2021

Xiang Zhou, Heba Elfardy, Christos Christodoulopoulos, Thomas Butler, Mohit Bansal

Figure 1 for Hidden Biases in Unreliable News Detection Datasets

Figure 2 for Hidden Biases in Unreliable News Detection Datasets

Figure 3 for Hidden Biases in Unreliable News Detection Datasets

Figure 4 for Hidden Biases in Unreliable News Detection Datasets

Abstract:Automatic unreliable news detection is a research problem with great potential impact. Recently, several papers have shown promising results on large-scale news datasets with models that only use the article itself without resorting to any fact-checking mechanism or retrieving any supporting evidence. In this work, we take a closer look at these datasets. While they all provide valuable resources for future research, we observe a number of problems that may lead to results that do not generalize in more realistic settings. Specifically, we show that selection bias during data collection leads to undesired artifacts in the datasets. In addition, while most systems train and predict at the level of individual articles, overlapping article sources in the training and evaluation data can provide a strong confounding factor that models can exploit. In the presence of this confounding factor, the models can achieve good performance by directly memorizing the site-label mapping instead of modeling the real task of unreliable news detection. We observed a significant drop (>10%) in accuracy for all models tested in a clean split with no train/test source overlap. Using the observations and experimental results, we provide practical suggestions on how to create more reliable datasets for the unreliable news detection task. We suggest future dataset creation include a simple model as a difficulty/bias probe and future model development use a clean non-overlapping site and date split.

* EACL 2021 (11 pages, 3 figures, 8 tables)

Via

Access Paper or Ask Questions