Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Shahar Levy

More Documents, Same Length: Isolating the Challenge of Multiple Documents in RAG

Mar 06, 2025

Shahar Levy, Nir Mazor, Lihi Shalmon, Michael Hassid, Gabriel Stanovsky

Abstract:Retrieval-augmented generation (RAG) provides LLMs with relevant documents. Although previous studies noted that retrieving many documents can degrade performance, they did not isolate how the quantity of documents affects performance while controlling for context length. We evaluate various language models on custom datasets derived from a multi-hop QA task. We keep the context length and position of relevant information constant while varying the number of documents, and find that increasing the document count in RAG settings poses significant challenges for LLMs. Additionally, our results indicate that processing multiple documents is a separate challenge from handling long contexts. We also make the datasets and code available: https://github.com/shaharl6000/MoreDocsSameLen .

* Preprint

Via

Access Paper or Ask Questions

SEAM: A Stochastic Benchmark for Multi-Document Tasks

Jun 23, 2024

Gili Lior, Avi Caciularu, Arie Cattan, Shahar Levy, Ori Shapira, Gabriel Stanovsky

Figure 1 for SEAM: A Stochastic Benchmark for Multi-Document Tasks

Figure 2 for SEAM: A Stochastic Benchmark for Multi-Document Tasks

Figure 3 for SEAM: A Stochastic Benchmark for Multi-Document Tasks

Figure 4 for SEAM: A Stochastic Benchmark for Multi-Document Tasks

Abstract:Various tasks, such as summarization, multi-hop question answering, or coreference resolution, are naturally phrased over collections of real-world documents. Such tasks present a unique set of challenges, revolving around the lack of coherent narrative structure across documents, which often leads to contradiction, omission, or repetition of information. Despite their real-world application and challenging properties, there is currently no benchmark which specifically measures the abilities of large language models (LLMs) on multi-document tasks. To bridge this gap, we present SEAM (a Stochastic Evaluation Approach for Multi-document tasks), a conglomerate benchmark over a diverse set of multi-document datasets, setting conventional evaluation criteria, input-output formats, and evaluation protocols. In particular, SEAM addresses the sensitivity of LLMs to minor prompt variations through repeated evaluations, where in each evaluation we sample uniformly at random the values of arbitrary factors (e.g., the order of documents). We evaluate different LLMs on SEAM finding that multi-document tasks pose a significant challenge for LLMs, even for state-of-the-art models with 70B parameters. In addition, we show that the stochastic approach uncovers underlying statistical trends which cannot be observed in a static benchmark. We hope that SEAM will spur progress via consistent and meaningful evaluation of multi-document tasks.

Via

Access Paper or Ask Questions

Collecting a Large-Scale Gender Bias Dataset for Coreference Resolution and Machine Translation

Sep 10, 2021

Shahar Levy, Koren Lazar, Gabriel Stanovsky

Figure 1 for Collecting a Large-Scale Gender Bias Dataset for Coreference Resolution and Machine Translation

Figure 2 for Collecting a Large-Scale Gender Bias Dataset for Coreference Resolution and Machine Translation

Figure 3 for Collecting a Large-Scale Gender Bias Dataset for Coreference Resolution and Machine Translation

Figure 4 for Collecting a Large-Scale Gender Bias Dataset for Coreference Resolution and Machine Translation

Abstract:Recent works have found evidence of gender bias in models of machine translation and coreference resolution using mostly synthetic diagnostic datasets. While these quantify bias in a controlled experiment, they often do so on a small scale and consist mostly of artificial, out-of-distribution sentences. In this work, we find grammatical patterns indicating stereotypical and non-stereotypical gender-role assignments (e.g., female nurses versus male dancers) in corpora from three domains, resulting in a first large-scale gender bias dataset of 108K diverse real-world English sentences. We manually verify the quality of our corpus and use it to evaluate gender bias in various coreference resolution and machine translation models. We find that all tested models tend to over-rely on gender stereotypes when presented with natural inputs, which may be especially harmful when deployed in commercial systems. Finally, we show that our dataset lends itself to finetuning a coreference resolution model, finding it mitigates bias on a held out set. Our dataset and models are publicly available at www.github.com/SLAB-NLP/BUG. We hope they will spur future research into gender bias evaluation mitigation techniques in realistic settings.

* Accepted to Findings of EMNLP 2021

Via

Access Paper or Ask Questions