Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Kaheer Suleman

Investigating Failures to Generalize for Coreference Resolution Models

Mar 16, 2023

Ian Porada, Alexandra Olteanu, Kaheer Suleman, Adam Trischler, Jackie Chi Kit Cheung

Figure 1 for Investigating Failures to Generalize for Coreference Resolution Models

Figure 2 for Investigating Failures to Generalize for Coreference Resolution Models

Figure 3 for Investigating Failures to Generalize for Coreference Resolution Models

Figure 4 for Investigating Failures to Generalize for Coreference Resolution Models

Abstract:Coreference resolution models are often evaluated on multiple datasets. Datasets vary, however, in how coreference is realized -- i.e., how the theoretical concept of coreference is operationalized in the dataset -- due to factors such as the choice of corpora and annotation guidelines. We investigate the extent to which errors of current coreference resolution models are associated with existing differences in operationalization across datasets (OntoNotes, PreCo, and Winogrande). Specifically, we distinguish between and break down model performance into categories corresponding to several types of coreference, including coreferring generic mentions, compound modifiers, and copula predicates, among others. This break down helps us investigate how state-of-the-art models might vary in their ability to generalize across different coreference types. In our experiments, for example, models trained on OntoNotes perform poorly on generic mentions and copula predicates in PreCo. Our findings help calibrate expectations of current coreference resolution models; and, future work can explicitly account for those types of coreference that are empirically associated with poor generalization when developing models.

Via

Access Paper or Ask Questions

The KITMUS Test: Evaluating Knowledge Integration from Multiple Sources in Natural Language Understanding Systems

Dec 15, 2022

Akshatha Arodi, Martin Pömsl, Kaheer Suleman, Adam Trischler, Alexandra Olteanu, Jackie Chi Kit Cheung

Abstract:Many state-of-the-art natural language understanding (NLU) models are based on pretrained neural language models. These models often make inferences using information from multiple sources. An important class of such inferences are those that require both background knowledge, presumably contained in a model's pretrained parameters, and instance-specific information that is supplied at inference time. However, the integration and reasoning abilities of NLU models in the presence of multiple knowledge sources have been largely understudied. In this work, we propose a test suite of coreference resolution tasks that require reasoning over multiple facts. Our dataset is organized into subtasks that differ in terms of which knowledge sources contain relevant facts. We evaluate state-of-the-art coreference resolution models on our dataset. Our results indicate that several models struggle to reason on-the-fly over knowledge observed both at pretrain time and at inference time. However, with task-specific training, a subset of models demonstrates the ability to integrate certain knowledge types from multiple sources.

* 19 pages

Via

Access Paper or Ask Questions

Deconstructing NLG Evaluation: Evaluation Practices, Assumptions, and Their Implications

May 13, 2022

Kaitlyn Zhou, Su Lin Blodgett, Adam Trischler, Hal Daumé III, Kaheer Suleman, Alexandra Olteanu

Figure 1 for Deconstructing NLG Evaluation: Evaluation Practices, Assumptions, and Their Implications

Figure 2 for Deconstructing NLG Evaluation: Evaluation Practices, Assumptions, and Their Implications

Figure 3 for Deconstructing NLG Evaluation: Evaluation Practices, Assumptions, and Their Implications

Figure 4 for Deconstructing NLG Evaluation: Evaluation Practices, Assumptions, and Their Implications

Abstract:There are many ways to express similar things in text, which makes evaluating natural language generation (NLG) systems difficult. Compounding this difficulty is the need to assess varying quality criteria depending on the deployment setting. While the landscape of NLG evaluation has been well-mapped, practitioners' goals, assumptions, and constraints -- which inform decisions about what, when, and how to evaluate -- are often partially or implicitly stated, or not stated at all. Combining a formative semi-structured interview study of NLG practitioners (N=18) with a survey study of a broader sample of practitioners (N=61), we surface goals, community practices, assumptions, and constraints that shape NLG evaluations, examining their implications and how they embody ethical considerations.

* Camera Ready for NAACL 2022 (Main Conference)

Via

Access Paper or Ask Questions

TopiOCQA: Open-domain Conversational Question Answeringwith Topic Switching

Oct 02, 2021

Vaibhav Adlakha, Shehzaad Dhuliawala, Kaheer Suleman, Harm de Vries, Siva Reddy

Figure 1 for TopiOCQA: Open-domain Conversational Question Answeringwith Topic Switching

Figure 2 for TopiOCQA: Open-domain Conversational Question Answeringwith Topic Switching

Figure 3 for TopiOCQA: Open-domain Conversational Question Answeringwith Topic Switching

Figure 4 for TopiOCQA: Open-domain Conversational Question Answeringwith Topic Switching

Abstract:In a conversational question answering scenario, a questioner seeks to extract information about a topic through a series of interdependent questions and answers. As the conversation progresses, they may switch to related topics, a phenomenon commonly observed in information-seeking search sessions. However, current datasets for conversational question answering are limiting in two ways: 1) they do not contain topic switches; and 2) they assume the reference text for the conversation is given, i.e., the setting is not open-domain. We introduce TopiOCQA (pronounced Tapioca), an open-domain conversational dataset with topic switches on Wikipedia. TopiOCQA contains 3,920 conversations with information-seeking questions and free-form answers. TopiOCQA poses a challenging test-bed for models, where efficient retrieval is required on multiple turns of the same conversation, in conjunction with constructing valid responses using conversational history. We evaluate several baselines, by combining state-of-the-art document retrieval methods with neural reader models. Our best models achieves F1 of 51.9, and BLEU score of 42.1 which falls short of human performance by 18.3 points and 17.6 points respectively, indicating the difficulty of our dataset. Our dataset and code will be available at https://mcgill-nlp.github.io/topiocqa

Via

Access Paper or Ask Questions

Modeling Event Plausibility with Consistent Conceptual Abstraction

Apr 20, 2021

Ian Porada, Kaheer Suleman, Adam Trischler, Jackie Chi Kit Cheung

Figure 1 for Modeling Event Plausibility with Consistent Conceptual Abstraction

Figure 2 for Modeling Event Plausibility with Consistent Conceptual Abstraction

Figure 3 for Modeling Event Plausibility with Consistent Conceptual Abstraction

Figure 4 for Modeling Event Plausibility with Consistent Conceptual Abstraction

Abstract:Understanding natural language requires common sense, one aspect of which is the ability to discern the plausibility of events. While distributional models -- most recently pre-trained, Transformer language models -- have demonstrated improvements in modeling event plausibility, their performance still falls short of humans'. In this work, we show that Transformer-based plausibility models are markedly inconsistent across the conceptual classes of a lexical hierarchy, inferring that "a person breathing" is plausible while "a dentist breathing" is not, for example. We find this inconsistency persists even when models are softly injected with lexical knowledge, and we present a simple post-hoc method of forcing model consistency that improves correlation with human plausibility judgements.

* NAACL-HLT 2021

Via

Access Paper or Ask Questions

An Analysis of Dataset Overlap on Winograd-Style Tasks

Nov 09, 2020

Ali Emami, Adam Trischler, Kaheer Suleman, Jackie Chi Kit Cheung

Figure 1 for An Analysis of Dataset Overlap on Winograd-Style Tasks

Figure 2 for An Analysis of Dataset Overlap on Winograd-Style Tasks

Figure 3 for An Analysis of Dataset Overlap on Winograd-Style Tasks

Figure 4 for An Analysis of Dataset Overlap on Winograd-Style Tasks

Abstract:The Winograd Schema Challenge (WSC) and variants inspired by it have become important benchmarks for common-sense reasoning (CSR). Model performance on the WSC has quickly progressed from chance-level to near-human using neural language models trained on massive corpora. In this paper, we analyze the effects of varying degrees of overlap between these training corpora and the test instances in WSC-style tasks. We find that a large number of test instances overlap considerably with the corpora on which state-of-the-art models are (pre)trained, and that a significant drop in classification accuracy occurs when we evaluate models on instances with minimal overlap. Based on these results, we develop the KnowRef-60K dataset, which consists of over 60k pronoun disambiguation problems scraped from web data. KnowRef-60K is the largest corpus to date for WSC-style common-sense reasoning and exhibits a significantly lower proportion of overlaps with current pretraining corpora.

* 11 pages with references, accepted at COLING 2020

Via

Access Paper or Ask Questions

Can a Gorilla Ride a Camel? Learning Semantic Plausibility from Text

Nov 13, 2019

Ian Porada, Kaheer Suleman, Jackie Chi Kit Cheung

Figure 1 for Can a Gorilla Ride a Camel? Learning Semantic Plausibility from Text

Figure 2 for Can a Gorilla Ride a Camel? Learning Semantic Plausibility from Text

Figure 3 for Can a Gorilla Ride a Camel? Learning Semantic Plausibility from Text

Figure 4 for Can a Gorilla Ride a Camel? Learning Semantic Plausibility from Text

Abstract:Modeling semantic plausibility requires commonsense knowledge about the world and has been used as a testbed for exploring various knowledge representations. Previous work has focused specifically on modeling physical plausibility and shown that distributional methods fail when tested in a supervised setting. At the same time, distributional models, namely large pretrained language models, have led to improved results for many natural language understanding tasks. In this work, we show that these pretrained language models are in fact effective at modeling physical plausibility in the supervised setting. We therefore present the more difficult problem of learning to model physical plausibility directly from text. We create a training set by extracting attested events from a large corpus, and we provide a baseline for training on these attested events in a self-supervised manner and testing on a physical plausibility task. We believe results could be further improved by injecting explicit commonsense knowledge into a distributional model.

* Proceedings of the First Workshop on Commonsense Inference in Natural Language Processing. (2019) 123-129
* Accepted at COIN@EMNLP 2019

Via

Access Paper or Ask Questions

Improving Neural Question Generation using World Knowledge

Sep 10, 2019

Deepak Gupta, Kaheer Suleman, Mahmoud Adada, Andrew McNamara, Justin Harris

Figure 1 for Improving Neural Question Generation using World Knowledge

Figure 2 for Improving Neural Question Generation using World Knowledge

Figure 3 for Improving Neural Question Generation using World Knowledge

Abstract:In this paper, we propose a method for incorporating world knowledge (linked entities and fine-grained entity types) into a neural question generation model. This world knowledge helps to encode additional information related to the entities present in the passage required to generate human-like questions. We evaluate our models on both SQuAD and MS MARCO to demonstrate the usefulness of the world knowledge features. The proposed world knowledge enriched question generation model is able to outperform the vanilla neural question generation model by 1.37 and 1.59 absolute BLEU 4 score on SQuAD and MS MARCO test dataset respectively.

Via

Access Paper or Ask Questions

Playing log(N)-Questions over Sentences

Aug 13, 2019

Peter Potash, Kaheer Suleman

Figure 1 for Playing log(N)-Questions over Sentences

Figure 2 for Playing log(N)-Questions over Sentences

Figure 3 for Playing log(N)-Questions over Sentences

Figure 4 for Playing log(N)-Questions over Sentences

Abstract:We propose a two-agent game wherein a questioner must be able to conjure discerning questions between sentences, incorporate responses from an answerer, and keep track of a hypothesis state. The questioner must be able to understand the information required to make its final guess, while also being able to reason over the game's text environment based on the answerer's responses. We experiment with an end-to-end model where both agents can learn simultaneously to play the game, showing that simultaneously achieving high game accuracy and producing meaningful questions can be a difficult trade-off.

* 5 pages

Via

Access Paper or Ask Questions

On the Evaluation of Common-Sense Reasoning in Natural Language Understanding

Nov 05, 2018

Paul Trichelair, Ali Emami, Jackie Chi Kit Cheung, Adam Trischler, Kaheer Suleman, Fernando Diaz

Figure 1 for On the Evaluation of Common-Sense Reasoning in Natural Language Understanding

Figure 2 for On the Evaluation of Common-Sense Reasoning in Natural Language Understanding

Figure 3 for On the Evaluation of Common-Sense Reasoning in Natural Language Understanding

Figure 4 for On the Evaluation of Common-Sense Reasoning in Natural Language Understanding

Abstract:The NLP and ML communities have long been interested in developing models capable of common-sense reasoning, and recent works have significantly improved the state of the art on benchmarks like the Winograd Schema Challenge (WSC). Despite these advances, the complexity of tasks designed to test common-sense reasoning remains under-analyzed. In this paper, we make a case study of the Winograd Schema Challenge and, based on two new measures of instance-level complexity, design a protocol that both clarifies and qualifies the results of previous work. Our protocol accounts for the WSC's limited size and variable instance difficulty, properties common to other common-sense benchmarks. Accounting for these properties when assessing model results may prevent unjustified conclusions.

* 4 pages

Via

Access Paper or Ask Questions