Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Mingda Chen

Improving Factuality with Explicit Working Memory

Dec 24, 2024

Mingda Chen, Yang Li, Karthik Padthe, Rulin Shao, Alicia Sun, Luke Zettlemoyer, Gargi Gosh, Wen-tau Yih

Figure 1 for Improving Factuality with Explicit Working Memory

Figure 2 for Improving Factuality with Explicit Working Memory

Figure 3 for Improving Factuality with Explicit Working Memory

Figure 4 for Improving Factuality with Explicit Working Memory

Abstract:Large language models can generate factually inaccurate content, a problem known as hallucination. Recent works have built upon retrieved-augmented generation to improve factuality through iterative prompting but these methods are limited by the traditional RAG design. To address these challenges, we introduce EWE (Explicit Working Memory), a novel approach that enhances factuality in long-form text generation by integrating a working memory that receives real-time feedback from external resources. The memory is refreshed based on online fact-checking and retrieval feedback, allowing EWE to rectify false claims during the generation process and ensure more accurate and reliable outputs. Our experiments demonstrate that Ewe outperforms strong baselines on four fact-seeking long-form generation datasets, increasing the factuality metric, VeriScore, by 2 to 10 points absolute without sacrificing the helpfulness of the responses. Further analysis reveals that the design of rules for memory updates, configurations of memory units, and the quality of the retrieval datastore are crucial factors for influencing model performance.

Via

Access Paper or Ask Questions

RA-DIT: Retrieval-Augmented Dual Instruction Tuning

Oct 08, 2023

Xi Victoria Lin, Xilun Chen, Mingda Chen, Weijia Shi, Maria Lomeli, Rich James, Pedro Rodriguez, Jacob Kahn, Gergely Szilvasy, Mike Lewis(+2 more)

Figure 1 for RA-DIT: Retrieval-Augmented Dual Instruction Tuning

Figure 2 for RA-DIT: Retrieval-Augmented Dual Instruction Tuning

Figure 3 for RA-DIT: Retrieval-Augmented Dual Instruction Tuning

Figure 4 for RA-DIT: Retrieval-Augmented Dual Instruction Tuning

Abstract:Retrieval-augmented language models (RALMs) improve performance by accessing long-tail and up-to-date knowledge from external data stores, but are challenging to build. Existing approaches require either expensive retrieval-specific modifications to LM pre-training or use post-hoc integration of the data store that leads to suboptimal performance. We introduce Retrieval-Augmented Dual Instruction Tuning (RA-DIT), a lightweight fine-tuning methodology that provides a third option by retrofitting any LLM with retrieval capabilities. Our approach operates in two distinct fine-tuning steps: (1) one updates a pre-trained LM to better use retrieved information, while (2) the other updates the retriever to return more relevant results, as preferred by the LM. By fine-tuning over tasks that require both knowledge utilization and contextual awareness, we demonstrate that each stage yields significant performance improvements, and using both leads to additional gains. Our best model, RA-DIT 65B, achieves state-of-the-art performance across a range of knowledge-intensive zero- and few-shot learning benchmarks, significantly outperforming existing in-context RALM approaches by up to +8.9% in 0-shot setting and +1.4% in 5-shot setting on average.

* 24 pages

Via

Access Paper or Ask Questions

xSIM++: An Improved Proxy to Bitext Mining Performance for Low-Resource Languages

Jun 22, 2023

Mingda Chen, Kevin Heffernan, Onur Çelebi, Alex Mourachko, Holger Schwenk

Abstract:We introduce a new proxy score for evaluating bitext mining based on similarity in a multilingual embedding space: xSIM++. In comparison to xSIM, this improved proxy leverages rule-based approaches to extend English sentences in any evaluation set with synthetic, hard-to-distinguish examples which more closely mirror the scenarios we encounter during large-scale mining. We validate this proxy by running a significant number of bitext mining experiments for a set of low-resource languages, and subsequently train NMT systems on the mined data. In comparison to xSIM, we show that xSIM++ is better correlated with the downstream BLEU scores of translation systems trained on mined bitexts, providing a reliable proxy of bitext mining performance without needing to run expensive bitext mining pipelines. xSIM++ also reports performance for different error types, offering more fine-grained feedback for model development.

* The first two authors contributed equally; ACL 2023 short; Code and data are available at https://github.com/facebookresearch/LASER

Via

Access Paper or Ask Questions

Efficient Open Domain Multi-Hop Question Answering with Few-Shot Data Synthesis

May 23, 2023

Mingda Chen, Xilun Chen, Wen-tau Yih

Abstract:Few-shot learning for open domain multi-hop question answering typically relies on large language models (LLMs). While powerful, LLMs are inefficient at the inference time. We propose a data synthesis framework for multi-hop question answering that allows for improving smaller language models with less than 10 human-annotated question answer pairs. The framework is built upon the data generation functions parameterized by LLMs and prompts, which requires minimal hand-crafted features. Empirically, we synthesize millions of multi-hop questions and claims. After finetuning language models on the synthetic data, we evaluate the models on popular benchmarks on multi-hop question answering and fact verification. Our experimental results show that finetuning on the synthetic data improves model performance significantly, allowing our finetuned models to be competitive with prior models while being almost one-third the size in terms of parameter counts.

Via

Access Paper or Ask Questions

BLASER: A Text-Free Speech-to-Speech Translation Evaluation Metric

Dec 16, 2022

Mingda Chen, Paul-Ambroise Duquenne, Pierre Andrews, Justine Kao, Alexandre Mourachko, Holger Schwenk, Marta R. Costa-jussà

Figure 1 for BLASER: A Text-Free Speech-to-Speech Translation Evaluation Metric

Figure 2 for BLASER: A Text-Free Speech-to-Speech Translation Evaluation Metric

Figure 3 for BLASER: A Text-Free Speech-to-Speech Translation Evaluation Metric

Figure 4 for BLASER: A Text-Free Speech-to-Speech Translation Evaluation Metric

Abstract:End-to-End speech-to-speech translation (S2ST) is generally evaluated with text-based metrics. This means that generated speech has to be automatically transcribed, making the evaluation dependent on the availability and quality of automatic speech recognition (ASR) systems. In this paper, we propose a text-free evaluation metric for end-to-end S2ST, named BLASER, to avoid the dependency on ASR systems. BLASER leverages a multilingual multimodal encoder to directly encode the speech segments for source input, translation output and reference into a shared embedding space and computes a score of the translation quality that can be used as a proxy to human evaluation. To evaluate our approach, we construct training and evaluation sets from more than 40k human annotations covering seven language directions. The best results of BLASER are achieved by training with supervision from human rating scores. We show that when evaluated at the sentence level, BLASER correlates significantly better with human judgment compared to ASR-dependent metrics including ASR-SENTBLEU in all translation directions and ASR-COMET in five of them. Our analysis shows combining speech and text as inputs to BLASER does not increase the correlation with human scores, but best correlations are achieved when using speech, which motivates the goal of our research. Moreover, we show that using ASR for references is detrimental for text-based metrics.

Via

Access Paper or Ask Questions

Leveraging Natural Supervision for Language Representation Learning and Generation

Jul 21, 2022

Mingda Chen

Figure 1 for Leveraging Natural Supervision for Language Representation Learning and Generation

Figure 2 for Leveraging Natural Supervision for Language Representation Learning and Generation

Figure 3 for Leveraging Natural Supervision for Language Representation Learning and Generation

Figure 4 for Leveraging Natural Supervision for Language Representation Learning and Generation

Abstract:Recent breakthroughs in Natural Language Processing (NLP) have been driven by language models trained on a massive amount of plain text. While powerful, deriving supervision from textual resources is still an open question. For example, language model pretraining often neglects the rich, freely-available structures in textual data. In this thesis, we describe three lines of work that seek to improve the training and evaluation of neural models using naturally-occurring supervision. We first investigate self-supervised training losses to help enhance the performance of pretrained language models for various NLP tasks. Specifically, we alter the sentence prediction loss to make it better suited to other pretraining losses and more challenging to solve. We design an intermediate finetuning step that uses self-supervised training to promote models' ability in cross-task generalization. Then we describe methods to leverage the structures in Wikipedia and paraphrases. In particular, we propose training losses to exploit hyperlinks, article structures, and article category graphs for entity-, discourse-, entailment-related knowledge. We propose a framework that uses paraphrase pairs to disentangle semantics and syntax in sentence representations. We extend the framework for a novel generation task that controls the syntax of output text with a sentential exemplar. Lastly, we discuss our work on tailoring textual resources for establishing challenging evaluation tasks. We introduce three datasets by defining novel tasks using various fan-contributed websites, including a long-form data-to-text generation dataset, a screenplay summarization dataset, and a long-form story generation dataset. These datasets have unique characteristics offering challenges to future work in their respective task settings.

* PhD Thesis

Via

Access Paper or Ask Questions

Improving In-Context Few-Shot Learning via Self-Supervised Training

May 03, 2022

Mingda Chen, Jingfei Du, Ramakanth Pasunuru, Todor Mihaylov, Srini Iyer, Veselin Stoyanov, Zornitsa Kozareva

Figure 1 for Improving In-Context Few-Shot Learning via Self-Supervised Training

Figure 2 for Improving In-Context Few-Shot Learning via Self-Supervised Training

Figure 3 for Improving In-Context Few-Shot Learning via Self-Supervised Training

Figure 4 for Improving In-Context Few-Shot Learning via Self-Supervised Training

Abstract:Self-supervised pretraining has made few-shot learning possible for many NLP tasks. But the pretraining objectives are not typically adapted specifically for in-context few-shot learning. In this paper, we propose to use self-supervision in an intermediate training stage between pretraining and downstream few-shot usage with the goal to teach the model to perform in-context few shot learning. We propose and evaluate four self-supervised objectives on two benchmarks. We find that the intermediate self-supervision stage produces models that outperform strong baselines. Ablation study shows that several factors affect the downstream performance, such as the amount of training data and the diversity of the self-supervised objectives. Human-annotated cross-task supervision and self-supervision are complementary. Qualitative analysis suggests that the self-supervised-trained models are better at following task requirements.

* NAACL 2022

Via

Access Paper or Ask Questions

TVRecap: A Dataset for Generating Stories with Character Descriptions

Sep 18, 2021

Mingda Chen, Kevin Gimpel

Figure 1 for TVRecap: A Dataset for Generating Stories with Character Descriptions

Figure 2 for TVRecap: A Dataset for Generating Stories with Character Descriptions

Figure 3 for TVRecap: A Dataset for Generating Stories with Character Descriptions

Figure 4 for TVRecap: A Dataset for Generating Stories with Character Descriptions

Abstract:We introduce TVRecap, a story generation dataset that requires generating detailed TV show episode recaps from a brief summary and a set of documents describing the characters involved. Unlike other story generation datasets, TVRecap contains stories that are authored by professional screenwriters and that feature complex interactions among multiple characters. Generating stories in TVRecap requires drawing relevant information from the lengthy provided documents about characters based on the brief summary. In addition, by swapping the input and output, TVRecap can serve as a challenging testbed for abstractive summarization. We create TVRecap from fan-contributed websites, which allows us to collect 26k episode recaps with 1868.7 tokens on average. Empirically, we take a hierarchical story generation approach and find that the neural model that uses oracle content selectors for character descriptions demonstrates the best performance on automatic metrics, showing the potential of our dataset to inspire future research on story generation with constraints. Qualitative analysis shows that the best-performing model sometimes generates content that is unfaithful to the short summaries, suggesting promising directions for future work.

Via

Access Paper or Ask Questions

SummScreen: A Dataset for Abstractive Screenplay Summarization

Apr 14, 2021

Mingda Chen, Zewei Chu, Sam Wiseman, Kevin Gimpel

Figure 1 for SummScreen: A Dataset for Abstractive Screenplay Summarization

Figure 2 for SummScreen: A Dataset for Abstractive Screenplay Summarization

Figure 3 for SummScreen: A Dataset for Abstractive Screenplay Summarization

Figure 4 for SummScreen: A Dataset for Abstractive Screenplay Summarization

Abstract:We introduce SummScreen, a summarization dataset comprised of pairs of TV series transcripts and human written recaps. The dataset provides a challenging testbed for abstractive summarization for several reasons. Plot details are often expressed indirectly in character dialogues and may be scattered across the entirety of the transcript. These details must be found and integrated to form the succinct plot descriptions in the recaps. Also, TV scripts contain content that does not directly pertain to the central plot but rather serves to develop characters or provide comic relief. This information is rarely contained in recaps. Since characters are fundamental to TV series, we also propose two entity-centric evaluation metrics. Empirically, we characterize the dataset by evaluating several methods, including neural models and those based on nearest neighbors. An oracle extractive approach outperforms all benchmarked models according to automatic metrics, showing that the neural models are unable to fully exploit the input transcripts. Human evaluation and qualitative analysis reveal that our non-oracle models are competitive with their oracle counterparts in terms of generating faithful plot events and can benefit from better content selectors. Both oracle and non-oracle models generate unfaithful facts, suggesting future research directions.

Via

Access Paper or Ask Questions

Generating Wikipedia Article Sections from Diverse Data Sources

Dec 29, 2020

Mingda Chen, Sam Wiseman, Kevin Gimpel

Figure 1 for Generating Wikipedia Article Sections from Diverse Data Sources

Figure 2 for Generating Wikipedia Article Sections from Diverse Data Sources

Figure 3 for Generating Wikipedia Article Sections from Diverse Data Sources

Figure 4 for Generating Wikipedia Article Sections from Diverse Data Sources

Abstract:Datasets for data-to-text generation typically focus either on multi-domain, single-sentence generation or on single-domain, long-form generation. In this work, we create a large-scale dataset, WikiTableT, that pairs Wikipedia sections with their corresponding tabular data and various metadata. WikiTableT contains millions of instances, covering a broad range of topics, as well as a variety of flavors of generation tasks with different levels of flexibility. We benchmark several training and decoding strategies on WikiTableT. Our qualitative analysis shows that the best approaches can generate fluent and high quality texts but they sometimes struggle with coherence.

Via

Access Paper or Ask Questions