Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Marie-Catherine de Marneffe

LiTEx: A Linguistic Taxonomy of Explanations for Understanding Within-Label Variation in Natural Language Inference

May 28, 2025

Pingjun Hong, Beiduo Chen, Siyao Peng, Marie-Catherine de Marneffe, Barbara Plank

Abstract:There is increasing evidence of Human Label Variation (HLV) in Natural Language Inference (NLI), where annotators assign different labels to the same premise-hypothesis pair. However, within-label variation--cases where annotators agree on the same label but provide divergent reasoning--poses an additional and mostly overlooked challenge. Several NLI datasets contain highlighted words in the NLI item as explanations, but the same spans on the NLI item can be highlighted for different reasons, as evidenced by free-text explanations, which offer a window into annotators' reasoning. To systematically understand this problem and gain insight into the rationales behind NLI labels, we introduce LITEX, a linguistically-informed taxonomy for categorizing free-text explanations. Using this taxonomy, we annotate a subset of the e-SNLI dataset, validate the taxonomy's reliability, and analyze how it aligns with NLI labels, highlights, and explanations. We further assess the taxonomy's usefulness in explanation generation, demonstrating that conditioning generation on LITEX yields explanations that are linguistically closer to human explanations than those generated using only labels or highlights. Our approach thus not only captures within-label variation but also shows how taxonomy-guided generation for reasoning can bridge the gap between human and model explanations more effectively than existing strategies.

* 21 pages, 6 figures

Via

Access Paper or Ask Questions

Explanation sensitivity to the randomness of large language models: the case of journalistic text classification

Oct 07, 2024

Jeremie Bogaert, Marie-Catherine de Marneffe, Antonin Descampe, Louis Escouflaire, Cedrick Fairon, Francois-Xavier Standaert

Abstract:Large language models (LLMs) perform very well in several natural language processing tasks but raise explainability challenges. In this paper, we examine the effect of random elements in the training of LLMs on the explainability of their predictions. We do so on a task of opinionated journalistic text classification in French. Using a fine-tuned CamemBERT model and an explanation method based on relevance propagation, we find that training with different random seeds produces models with similar accuracy but variable explanations. We therefore claim that characterizing the explanations' statistical distribution is needed for the explainability of LLMs. We then explore a simpler model based on textual features which offers stable explanations but is less accurate. Hence, this simpler model corresponds to a different tradeoff between accuracy and explainability. We show that it can be improved by inserting features derived from CamemBERT's explanations. We finally discuss new research directions suggested by our results, in particular regarding the origin of the sensitivity observed in the training randomness.

* Traitement Automatique des Langues 64, 2023, ATALA, Paris
* This paper is a faithful translation of a paper which was peer-reviewed and published in the French journal Traitement Automatique des Langues, n. 64

Via

Access Paper or Ask Questions

VariErr NLI: Separating Annotation Error from Human Label Variation

Mar 04, 2024

Leon Weber-Genzel, Siyao Peng, Marie-Catherine de Marneffe, Barbara Plank

Abstract:Human label variation arises when annotators assign different labels to the same item for valid reasons, while annotation errors occur when labels are assigned for invalid reasons. These two issues are prevalent in NLP benchmarks, yet existing research has studied them in isolation. To the best of our knowledge, there exists no prior work that focuses on teasing apart error from signal, especially in cases where signal is beyond black-and-white. To fill this gap, we introduce a systematic methodology and a new dataset, VariErr (variation versus error), focusing on the NLI task in English. We propose a 2-round annotation scheme with annotators explaining each label and subsequently judging the validity of label-explanation pairs. \name{} contains 7,574 validity judgments on 1,933 explanations for 500 re-annotated NLI items. We assess the effectiveness of various automatic error detection (AED) methods and GPTs in uncovering errors versus human label variation. We find that state-of-the-art AED methods significantly underperform compared to GPTs and humans. While GPT-4 is the best system, it still falls short of human performance. Our methodology is applicable beyond NLI, offering fertile ground for future research on error versus plausible variation, which in turn can yield better and more trustworthy NLP systems.

* 13 pages, under review

Via

Access Paper or Ask Questions

Ecologically Valid Explanations for Label Variation in NLI

Oct 20, 2023

Nan-Jiang Jiang, Chenhao Tan, Marie-Catherine de Marneffe

Abstract:Human label variation, or annotation disagreement, exists in many natural language processing (NLP) tasks, including natural language inference (NLI). To gain direct evidence of how NLI label variation arises, we build LiveNLI, an English dataset of 1,415 ecologically valid explanations (annotators explain the NLI labels they chose) for 122 MNLI items (at least 10 explanations per item). The LiveNLI explanations confirm that people can systematically vary on their interpretation and highlight within-label variation: annotators sometimes choose the same label for different reasons. This suggests that explanations are crucial for navigating label interpretations in general. We few-shot prompt large language models to generate explanations but the results are inconsistent: they sometimes produces valid and informative explanations, but it also generates implausible ones that do not support the label, highlighting directions for improvement.

* Findings at EMNLP 2023. Overlap with previous version arXiv:2304.12443

Via

Access Paper or Ask Questions

Understanding and Predicting Human Label Variation in Natural Language Inference through Explanation

Apr 24, 2023

Nan-Jiang Jiang, Chenhao Tan, Marie-Catherine de Marneffe

Abstract:Human label variation (Plank 2022), or annotation disagreement, exists in many natural language processing (NLP) tasks. To be robust and trusted, NLP models need to identify such variation and be able to explain it. To this end, we created the first ecologically valid explanation dataset with diverse reasoning, LiveNLI. LiveNLI contains annotators' highlights and free-text explanations for the label(s) of their choice for 122 English Natural Language Inference items, each with at least 10 annotations. We used its explanations for chain-of-thought prompting, and found there is still room for improvement in GPT-3's ability to predict label distribution with in-context learning.

Via

Access Paper or Ask Questions

Investigating Reasons for Disagreement in Natural Language Inference

Sep 07, 2022

Nan-Jiang Jiang, Marie-Catherine de Marneffe

Figure 1 for Investigating Reasons for Disagreement in Natural Language Inference

Figure 2 for Investigating Reasons for Disagreement in Natural Language Inference

Figure 3 for Investigating Reasons for Disagreement in Natural Language Inference

Figure 4 for Investigating Reasons for Disagreement in Natural Language Inference

Abstract:We investigate how disagreement in natural language inference (NLI) annotation arises. We developed a taxonomy of disagreement sources with 10 categories spanning 3 high-level classes. We found that some disagreements are due to uncertainty in the sentence meaning, others to annotator biases and task artifacts, leading to different interpretations of the label distribution. We explore two modeling approaches for detecting items with potential disagreement: a 4-way classification with a "Complicated" label in addition to the three standard NLI labels, and a multilabel classification approach. We found that the multilabel classification is more expressive and gives better recall of the possible interpretations in the data.

* accepted at TACL, pre-MIT Press publication version

Via

Access Paper or Ask Questions

He Thinks He Knows Better than the Doctors: BERT for Event Factuality Fails on Pragmatics

Jul 02, 2021

Nanjiang Jiang, Marie-Catherine de Marneffe

Figure 1 for He Thinks He Knows Better than the Doctors: BERT for Event Factuality Fails on Pragmatics

Figure 2 for He Thinks He Knows Better than the Doctors: BERT for Event Factuality Fails on Pragmatics

Figure 3 for He Thinks He Knows Better than the Doctors: BERT for Event Factuality Fails on Pragmatics

Figure 4 for He Thinks He Knows Better than the Doctors: BERT for Event Factuality Fails on Pragmatics

Abstract:We investigate how well BERT performs on predicting factuality in several existing English datasets, encompassing various linguistic constructions. Although BERT obtains a strong performance on most datasets, it does so by exploiting common surface patterns that correlate with certain factuality labels, and it fails on instances where pragmatic reasoning is necessary. Contrary to what the high performance suggests, we are still far from having a robust system for factuality prediction.

* to be published in TACL, pre-MIT Press publication version

Via

Access Paper or Ask Questions

Universal Dependencies v2: An Evergrowing Multilingual Treebank Collection

Apr 22, 2020

Joakim Nivre, Marie-Catherine de Marneffe, Filip Ginter, Jan Hajič, Christopher D. Manning, Sampo Pyysalo, Sebastian Schuster, Francis Tyers, Daniel Zeman

Figure 1 for Universal Dependencies v2: An Evergrowing Multilingual Treebank Collection

Figure 2 for Universal Dependencies v2: An Evergrowing Multilingual Treebank Collection

Figure 3 for Universal Dependencies v2: An Evergrowing Multilingual Treebank Collection

Figure 4 for Universal Dependencies v2: An Evergrowing Multilingual Treebank Collection

Abstract:Universal Dependencies is an open community effort to create cross-linguistically consistent treebank annotation for many languages within a dependency-based lexicalist framework. The annotation consists in a linguistically motivated word segmentation; a morphological layer comprising lemmas, universal part-of-speech tags, and standardized morphological features; and a syntactic layer focusing on syntactic relations between predicates, arguments and modifiers. In this paper, we describe version 2 of the guidelines (UD v2), discuss the major changes from UD v1 to UD v2, and give an overview of the currently available treebanks for 90 languages.

* LREC 2020

Via

Access Paper or Ask Questions

"i have a feeling trump will win": Forecasting Winners and Losers from User Predictions on Twitter

Sep 01, 2017

Sandesh Swamy, Alan Ritter, Marie-Catherine de Marneffe

Figure 1 for "i have a feeling trump will win": Forecasting Winners and Losers from User Predictions on Twitter

Figure 2 for "i have a feeling trump will win": Forecasting Winners and Losers from User Predictions on Twitter

Figure 3 for "i have a feeling trump will win": Forecasting Winners and Losers from User Predictions on Twitter

Figure 4 for "i have a feeling trump will win": Forecasting Winners and Losers from User Predictions on Twitter

Abstract:Social media users often make explicit predictions about upcoming events. Such statements vary in the degree of certainty the author expresses toward the outcome:"Leonardo DiCaprio will win Best Actor" vs. "Leonardo DiCaprio may win" or "No way Leonardo wins!". Can popular beliefs on social media predict who will win? To answer this question, we build a corpus of tweets annotated for veridicality on which we train a log-linear classifier that detects positive veridicality with high precision. We then forecast uncertain outcomes using the wisdom of crowds, by aggregating users' explicit predictions. Our method for forecasting winners is fully automated, relying only on a set of contenders as input. It requires no training data of past outcomes and outperforms sentiment and tweet volume baselines on a broad range of contest prediction tasks. We further demonstrate how our approach can be used to measure the reliability of individual accounts' predictions and retrospectively identify surprise outcomes.

* Accepted at EMNLP 2017 (long paper)

Via

Access Paper or Ask Questions