Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Fangyuan Xu

Recycled Attention: Efficient inference for long-context language models

Nov 08, 2024

Fangyuan Xu, Tanya Goyal, Eunsol Choi

Abstract:Generating long sequences of tokens given a long-context input imposes a heavy computational burden for large language models (LLMs). One of the computational bottleneck comes from computing attention over a long sequence of input at each generation step. In this paper, we propose Recycled Attention, an inference-time method which alternates between full context attention and attention over a subset of input tokens. When performing partial attention, we recycle the attention pattern of a previous token that has performed full attention and attend only to the top K most attended tokens, reducing the cost of data movement and attention computation. Compared to previously proposed inference-time acceleration method which attends only to local context or tokens with high accumulative attention scores, our approach flexibly chooses tokens that are relevant to the current decoding step. We evaluate our methods on RULER, a suite of tasks designed to comprehensively evaluate long-context abilities, and long-context language modeling tasks. Applying our method to off-the-shelf LLMs achieves comparable speedup to baselines which only consider local context while improving the performance by 2x. We further explore two ideas to improve performance-efficiency trade-offs: (1) dynamically decide when to perform recycled or full attention step based on the query similarities and (2) continued pre-training the model with Recycled Attention.

Via

Access Paper or Ask Questions

Contrastive Learning to Improve Retrieval for Real-world Fact Checking

Oct 07, 2024

Aniruddh Sriram, Fangyuan Xu, Eunsol Choi, Greg Durrett

Figure 1 for Contrastive Learning to Improve Retrieval for Real-world Fact Checking

Figure 2 for Contrastive Learning to Improve Retrieval for Real-world Fact Checking

Figure 3 for Contrastive Learning to Improve Retrieval for Real-world Fact Checking

Figure 4 for Contrastive Learning to Improve Retrieval for Real-world Fact Checking

Abstract:Recent work on fact-checking addresses a realistic setting where models incorporate evidence retrieved from the web to decide the veracity of claims. A bottleneck in this pipeline is in retrieving relevant evidence: traditional methods may surface documents directly related to a claim, but fact-checking complex claims requires more inferences. For instance, a document about how a vaccine was developed is relevant to addressing claims about what it might contain, even if it does not address them directly. We present Contrastive Fact-Checking Reranker (CFR), an improved retriever for this setting. By leveraging the AVeriTeC dataset, which annotates subquestions for claims with human written answers from evidence documents, we fine-tune Contriever with a contrastive objective based on multiple training signals, including distillation from GPT-4, evaluating subquestion answers, and gold labels in the dataset. We evaluate our model on both retrieval and end-to-end veracity judgments about claims. On the AVeriTeC dataset, we find a 6\% improvement in veracity classification accuracy. We also show our gains can be transferred to FEVER, ClaimDecomp, HotpotQA, and a synthetic dataset requiring retrievers to make inferences.

* EMNLP 2024 FEVER Workshop

Via

Access Paper or Ask Questions

Long-Form Answers to Visual Questions from Blind and Low Vision People

Aug 12, 2024

Mina Huh, Fangyuan Xu, Yi-Hao Peng, Chongyan Chen, Hansika Murugu, Danna Gurari, Eunsol Choi, Amy Pavel

Abstract:Vision language models can now generate long-form answers to questions about images - long-form visual question answers (LFVQA). We contribute VizWiz-LF, a dataset of long-form answers to visual questions posed by blind and low vision (BLV) users. VizWiz-LF contains 4.2k long-form answers to 600 visual questions, collected from human expert describers and six VQA models. We develop and annotate functional roles of sentences of LFVQA and demonstrate that long-form answers contain information beyond the question answer such as explanations and suggestions. We further conduct automatic and human evaluations with BLV and sighted people to evaluate long-form answers. BLV people perceive both human-written and generated long-form answers to be plausible, but generated answers often hallucinate incorrect visual details, especially for unanswerable visual questions (e.g., blurry or irrelevant images). To reduce hallucinations, we evaluate the ability of VQA models to abstain from answering unanswerable questions across multiple prompting strategies.

* COLM 2024

Via

Access Paper or Ask Questions

KIWI: A Dataset of Knowledge-Intensive Writing Instructions for Answering Research Questions

Mar 06, 2024

Fangyuan Xu, Kyle Lo, Luca Soldaini, Bailey Kuehl, Eunsol Choi, David Wadden

Figure 1 for KIWI: A Dataset of Knowledge-Intensive Writing Instructions for Answering Research Questions

Figure 2 for KIWI: A Dataset of Knowledge-Intensive Writing Instructions for Answering Research Questions

Figure 3 for KIWI: A Dataset of Knowledge-Intensive Writing Instructions for Answering Research Questions

Figure 4 for KIWI: A Dataset of Knowledge-Intensive Writing Instructions for Answering Research Questions

Abstract:Large language models (LLMs) adapted to follow user instructions are now widely deployed as conversational agents. In this work, we examine one increasingly common instruction-following task: providing writing assistance to compose a long-form answer. To evaluate the capabilities of current LLMs on this task, we construct KIWI, a dataset of knowledge-intensive writing instructions in the scientific domain. Given a research question, an initial model-generated answer and a set of relevant papers, an expert annotator iteratively issues instructions for the model to revise and improve its answer. We collect 1,260 interaction turns from 234 interaction sessions with three state-of-the-art LLMs. Each turn includes a user instruction, a model response, and a human evaluation of the model response. Through a detailed analysis of the collected responses, we find that all models struggle to incorporate new information into an existing answer, and to perform precise and unambiguous edits. Further, we find that models struggle to judge whether their outputs successfully followed user instructions, with accuracy at least 10 points short of human agreement. Our findings indicate that KIWI will be a valuable resource to measure progress and improve LLMs' instruction-following capabilities for knowledge intensive writing tasks.

Via

Access Paper or Ask Questions

Understanding Retrieval Augmentation for Long-Form Question Answering

Oct 18, 2023

Hung-Ting Chen, Fangyuan Xu, Shane A. Arora, Eunsol Choi

Abstract:We present a study of retrieval-augmented language models (LMs) on long-form question answering. We analyze how retrieval augmentation impacts different LMs, by comparing answers generated from models while using the same evidence documents, and how differing quality of retrieval document set impacts the answers generated from the same LM. We study various attributes of generated answers (e.g., fluency, length, variance) with an emphasis on the attribution of generated long-form answers to in-context evidence documents. We collect human annotations of answer attribution and evaluate methods for automatically judging attribution. Our study provides new insights on how retrieval augmentation impacts long, knowledge-rich text generation of LMs. We further identify attribution patterns for long text generation and analyze the main culprits of attribution errors. Together, our analysis reveals how retrieval augmentation impacts long knowledge-rich text generation and provide directions for future work.

Via

Access Paper or Ask Questions

RECOMP: Improving Retrieval-Augmented LMs with Compression and Selective Augmentation

Oct 06, 2023

Fangyuan Xu, Weijia Shi, Eunsol Choi

Abstract:Retrieving documents and prepending them in-context at inference time improves performance of language model (LMs) on a wide range of tasks. However, these documents, often spanning hundreds of words, make inference substantially more expensive. We propose compressing the retrieved documents into textual summaries prior to in-context integration. This not only reduces the computational costs but also relieves the burden of LMs to identify relevant information in long retrieved documents. We present two compressors -- an extractive compressor which selects useful sentences from retrieved documents and an abstractive compressor which generates summaries by synthesizing information from multiple documents. Both compressors are trained to improve LMs' performance on end tasks when the generated summaries are prepended to the LMs' input, while keeping the summary concise.If the retrieved documents are irrelevant to the input or offer no additional information to LM, our compressor can return an empty string, implementing selective augmentation.We evaluate our approach on language modeling task and open domain question answering task. We achieve a compression rate of as low as 6% with minimal loss in performance for both tasks, significantly outperforming the off-the-shelf summarization models. We show that our compressors trained for one LM can transfer to other LMs on the language modeling task and provide summaries largely faithful to the retrieved documents.

Via

Access Paper or Ask Questions

Concise Answers to Complex Questions: Summarization of Long-form Answers

May 30, 2023

Abhilash Potluri, Fangyuan Xu, Eunsol Choi

Abstract:Long-form question answering systems provide rich information by presenting paragraph-level answers, often containing optional background or auxiliary information. While such comprehensive answers are helpful, not all information is required to answer the question (e.g. users with domain knowledge do not need an explanation of background). Can we provide a concise version of the answer by summarizing it, while still addressing the question? We conduct a user study on summarized answers generated from state-of-the-art models and our newly proposed extract-and-decontextualize approach. We find a large proportion of long-form answers (over 90%) in the ELI5 domain can be adequately summarized by at least one system, while complex and implicit answers are challenging to compress. We observe that decontextualization improves the quality of the extractive summary, exemplifying its potential in the summarization task. To promote future work, we provide an extractive summarization dataset covering 1K long-form answers and our user study annotations. Together, we present the first study on summarizing long-form answers, taking a step forward for QA agents that can provide answers at multiple granularities.

* ACL 2023 Long Paper

Via

Access Paper or Ask Questions

A Critical Evaluation of Evaluations for Long-form Question Answering

May 29, 2023

Fangyuan Xu, Yixiao Song, Mohit Iyyer, Eunsol Choi

Figure 1 for A Critical Evaluation of Evaluations for Long-form Question Answering

Figure 2 for A Critical Evaluation of Evaluations for Long-form Question Answering

Figure 3 for A Critical Evaluation of Evaluations for Long-form Question Answering

Figure 4 for A Critical Evaluation of Evaluations for Long-form Question Answering

Abstract:Long-form question answering (LFQA) enables answering a wide range of questions, but its flexibility poses enormous challenges for evaluation. We perform the first targeted study of the evaluation of long-form answers, covering both human and automatic evaluation practices. We hire domain experts in seven areas to provide preference judgments over pairs of answers, along with free-form justifications for their choices. We present a careful analysis of experts' evaluation, which focuses on new aspects such as the comprehensiveness of the answer. Next, we examine automatic text generation metrics, finding that no existing metrics are predictive of human preference judgments. However, some metrics correlate with fine-grained aspects of answers (e.g., coherence). We encourage future work to move away from a single "overall score" of the answer and adopt a multi-faceted evaluation, targeting aspects such as factuality and completeness. We publicly release all of our annotations and code to spur future work into LFQA evaluation.

* ACL 2023 Camera Ready, Code available at https://github.com/carriex/lfqa_eval

Via

Access Paper or Ask Questions

Modeling Exemplification in Long-form Question Answering via Retrieval

May 19, 2022

Shufan Wang, Fangyuan Xu, Laure Thompson, Eunsol Choi, Mohit Iyyer

Figure 1 for Modeling Exemplification in Long-form Question Answering via Retrieval

Figure 2 for Modeling Exemplification in Long-form Question Answering via Retrieval

Figure 3 for Modeling Exemplification in Long-form Question Answering via Retrieval

Figure 4 for Modeling Exemplification in Long-form Question Answering via Retrieval

Abstract:Exemplification is a process by which writers explain or clarify a concept by providing an example. While common in all forms of writing, exemplification is particularly useful in the task of long-form question answering (LFQA), where a complicated answer can be made more understandable through simple examples. In this paper, we provide the first computational study of exemplification in QA, performing a fine-grained annotation of different types of examples (e.g., hypotheticals, anecdotes) in three corpora. We show that not only do state-of-the-art LFQA models struggle to generate relevant examples, but also that standard evaluation metrics such as ROUGE are insufficient to judge exemplification quality. We propose to treat exemplification as a \emph{retrieval} problem in which a partially-written answer is used to query a large set of human-written examples extracted from a corpus. Our approach allows a reliable ranking-type automatic metrics that correlates well with human evaluation. A human evaluation shows that our model's retrieved examples are more relevant than examples generated from a state-of-the-art LFQA model.

* 2022 Annual Conference of the North American Chapter of the Association for Computational Linguistics

Via

Access Paper or Ask Questions

How Do We Answer Complex Questions: Discourse Structure of Long-form Answers

Mar 21, 2022

Fangyuan Xu, Junyi Jessy Li, Eunsol Choi

Figure 1 for How Do We Answer Complex Questions: Discourse Structure of Long-form Answers

Figure 2 for How Do We Answer Complex Questions: Discourse Structure of Long-form Answers

Figure 3 for How Do We Answer Complex Questions: Discourse Structure of Long-form Answers

Figure 4 for How Do We Answer Complex Questions: Discourse Structure of Long-form Answers

Abstract:Long-form answers, consisting of multiple sentences, can provide nuanced and comprehensive answers to a broader set of questions. To better understand this complex and understudied task, we study the functional structure of long-form answers collected from three datasets, ELI5, WebGPT and Natural Questions. Our main goal is to understand how humans organize information to craft complex answers. We develop an ontology of six sentence-level functional roles for long-form answers, and annotate 3.9k sentences in 640 answer paragraphs. Different answer collection methods manifest in different discourse structures. We further analyze model-generated answers -- finding that annotators agree less with each other when annotating model-generated answers compared to annotating human-written answers. Our annotated data enables training a strong classifier that can be used for automatic analysis. We hope our work can inspire future research on discourse-level modeling and evaluation of long-form QA systems.

* ACL 2022

Via

Access Paper or Ask Questions