Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Cassandra L. Jacobs

Large-scale cloze evaluation reveals that token prediction tasks are neither lexically nor semantically aligned

Oct 15, 2024

Cassandra L. Jacobs, Loïc Grobol, Alvin Tsang

Abstract:In this work we compare the generative behavior at the next token prediction level in several language models by comparing them to human productions in the cloze task. We find that while large models trained for longer are typically better estimators of human productions, but they reliably under-estimate the probabilities of human responses, over-rank rare responses, under-rank top responses, and produce highly distinct semantic spaces. Altogether, this work demonstrates in a tractable, interpretable domain that LM generations can not be used as replacements of or models of the cloze task.

Via

Access Paper or Ask Questions

Incorporating Annotator Uncertainty into Representations of Discourse Relations

Aug 14, 2023

S. Magalí López Cortez, Cassandra L. Jacobs

Abstract:Annotation of discourse relations is a known difficult task, especially for non-expert annotators. In this paper, we investigate novice annotators' uncertainty on the annotation of discourse relations on spoken conversational data. We find that dialogue context (single turn, pair of turns within speaker, and pair of turns across speakers) is a significant predictor of confidence scores. We compute distributed representations of discourse relations from co-occurrence statistics that incorporate information about confidence scores and dialogue context. We perform a hierarchical clustering analysis using these representations and show that weighting discourse relation representations with information about confidence and dialogue context coherently models our annotators' uncertainty about discourse relation labels.

Via

Access Paper or Ask Questions

The distribution of discourse relations within and across turns in spontaneous conversation

Jul 07, 2023

S. Magalí López Cortez, Cassandra L. Jacobs

Abstract:Time pressure and topic negotiation may impose constraints on how people leverage discourse relations (DRs) in spontaneous conversational contexts. In this work, we adapt a system of DRs for written language to spontaneous dialogue using crowdsourced annotations from novice annotators. We then test whether discourse relations are used differently across several types of multi-utterance contexts. We compare the patterns of DR annotation within and across speakers and within and across turns. Ultimately, we find that different discourse contexts produce distinct distributions of discourse relations, with single-turn annotations creating the most uncertainty for annotators. Additionally, we find that the discourse relation annotations are of sufficient quality to predict from embeddings of discourse units.

* Proceedings of Computational Approaches to Discourse 2023, collocated with the 2023 meeting of the Association for Computational Linguistics, Toronto, Canada

Via

Access Paper or Ask Questions

Lost in Space Marking

Aug 02, 2022

Cassandra L. Jacobs, Yuval Pinter

Abstract:We look at a decision taken early in training a subword tokenizer, namely whether it should be the word-initial token that carries a special mark, or the word-final one. Based on surface-level considerations of efficiency and cohesion, as well as morphological coverage, we find that a Unigram LM tokenizer trained on pre-tokenized English text is better off marking the word-initial token, while one trained on raw text benefits from marking word ends. Our findings generalize across domains.

* Submission to SIGMORPHON 2021

Via

Access Paper or Ask Questions

Will it Unblend?

Sep 18, 2020

Yuval Pinter, Cassandra L. Jacobs, Jacob Eisenstein

Abstract:Natural language processing systems often struggle with out-of-vocabulary (OOV) terms, which do not appear in training data. Blends, such as "innoventor", are one particularly challenging class of OOV, as they are formed by fusing together two or more bases that relate to the intended meaning in unpredictable manners and degrees. In this work, we run experiments on a novel dataset of English OOV blends to quantify the difficulty of interpreting the meanings of blends by large-scale contextual language models such as BERT. We first show that BERT's processing of these blends does not fully access the component meanings, leaving their contextual representations semantically impoverished. We find this is mostly due to the loss of characters resulting from blend formation. Then, we assess how easily different models can recognize the structure and recover the origin of blends, and find that context-aware embedding systems outperform character-level and context-free embeddings, although their results are still far from satisfactory.

* Findings of EMNLP 2020

Via

Access Paper or Ask Questions

NYTWIT: A Dataset of Novel Words in the New York Times

Mar 06, 2020

Yuval Pinter, Cassandra L. Jacobs, Max Bittker

Figure 1 for NYTWIT: A Dataset of Novel Words in the New York Times

Abstract:We present the New York Times Word Innovation Types dataset, or NYTWIT, a collection of over 2,500 novel English words published in the New York Times between November 2017 and March 2019, manually annotated for their class of novelty (such as lexical derivation, dialectal variation, blending, or compounding). We present baseline results for both uncontextual and contextual prediction of novelty class, showing that there is room for improvement even for state-of-the-art NLP systems. We hope this resource will prove useful for linguists and NLP practitioners by providing a real-world environment of novel word appearance.

Via

Access Paper or Ask Questions