Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Laure Thompson

Latin Treebanks in Review: An Evaluation of Morphological Tagging Across Time

Aug 13, 2024

Marisa Hudspeth, Brendan O'Connor, Laure Thompson

Figure 1 for Latin Treebanks in Review: An Evaluation of Morphological Tagging Across Time

Figure 2 for Latin Treebanks in Review: An Evaluation of Morphological Tagging Across Time

Figure 3 for Latin Treebanks in Review: An Evaluation of Morphological Tagging Across Time

Figure 4 for Latin Treebanks in Review: An Evaluation of Morphological Tagging Across Time

Abstract:Existing Latin treebanks draw from Latin's long written tradition, spanning 17 centuries and a variety of cultures. Recent efforts have begun to harmonize these treebanks' annotations to better train and evaluate morphological taggers. However, the heterogeneity of these treebanks must be carefully considered to build effective and reliable data. In this work, we review existing Latin treebanks to identify the texts they draw from, identify their overlap, and document their coverage across time and genre. We additionally design automated conversions of their morphological feature annotations into the conventions of standard Latin grammar. From this, we build new time-period data splits that draw from the existing treebanks which we use to perform a broad cross-time analysis for POS and morphological feature tagging. We find that BERT-based taggers outperform existing taggers while also being more robust to cross-domain shifts.

Via

Access Paper or Ask Questions

Modeling Exemplification in Long-form Question Answering via Retrieval

May 19, 2022

Shufan Wang, Fangyuan Xu, Laure Thompson, Eunsol Choi, Mohit Iyyer

Figure 1 for Modeling Exemplification in Long-form Question Answering via Retrieval

Figure 2 for Modeling Exemplification in Long-form Question Answering via Retrieval

Figure 3 for Modeling Exemplification in Long-form Question Answering via Retrieval

Figure 4 for Modeling Exemplification in Long-form Question Answering via Retrieval

Abstract:Exemplification is a process by which writers explain or clarify a concept by providing an example. While common in all forms of writing, exemplification is particularly useful in the task of long-form question answering (LFQA), where a complicated answer can be made more understandable through simple examples. In this paper, we provide the first computational study of exemplification in QA, performing a fine-grained annotation of different types of examples (e.g., hypotheticals, anecdotes) in three corpora. We show that not only do state-of-the-art LFQA models struggle to generate relevant examples, but also that standard evaluation metrics such as ROUGE are insufficient to judge exemplification quality. We propose to treat exemplification as a \emph{retrieval} problem in which a partially-written answer is used to query a large set of human-written examples extracted from a corpus. Our approach allows a reliable ranking-type automatic metrics that correlates well with human evaluation. A human evaluation shows that our model's retrieved examples are more relevant than examples generated from a state-of-the-art LFQA model.

* 2022 Annual Conference of the North American Chapter of the Association for Computational Linguistics

Via

Access Paper or Ask Questions

Phrase-BERT: Improved Phrase Embeddings from BERT with an Application to Corpus Exploration

Sep 13, 2021

Shufan Wang, Laure Thompson, Mohit Iyyer

Figure 1 for Phrase-BERT: Improved Phrase Embeddings from BERT with an Application to Corpus Exploration

Figure 2 for Phrase-BERT: Improved Phrase Embeddings from BERT with an Application to Corpus Exploration

Figure 3 for Phrase-BERT: Improved Phrase Embeddings from BERT with an Application to Corpus Exploration

Figure 4 for Phrase-BERT: Improved Phrase Embeddings from BERT with an Application to Corpus Exploration

Abstract:Phrase representations derived from BERT often do not exhibit complex phrasal compositionality, as the model relies instead on lexical similarity to determine semantic relatedness. In this paper, we propose a contrastive fine-tuning objective that enables BERT to produce more powerful phrase embeddings. Our approach (Phrase-BERT) relies on a dataset of diverse phrasal paraphrases, which is automatically generated using a paraphrase generation model, as well as a large-scale dataset of phrases in context mined from the Books3 corpus. Phrase-BERT outperforms baselines across a variety of phrase-level similarity tasks, while also demonstrating increased lexical diversity between nearest neighbors in the vector space. Finally, as a case study, we show that Phrase-BERT embeddings can be easily integrated with a simple autoencoder to build a phrase-based neural topic model that interprets topics as mixtures of words and phrases by performing a nearest neighbor search in the embedding space. Crowdsourced evaluations demonstrate that this phrase-based topic model produces more coherent and meaningful topics than baseline word and phrase-level topic models, further validating the utility of Phrase-BERT.

* EMNLP 2021 Conference Camera Ready

Via

Access Paper or Ask Questions

Topic Modeling with Contextualized Word Representation Clusters

Oct 23, 2020

Laure Thompson, David Mimno

Figure 1 for Topic Modeling with Contextualized Word Representation Clusters

Figure 2 for Topic Modeling with Contextualized Word Representation Clusters

Figure 3 for Topic Modeling with Contextualized Word Representation Clusters

Figure 4 for Topic Modeling with Contextualized Word Representation Clusters

Abstract:Clustering token-level contextualized word representations produces output that shares many similarities with topic models for English text collections. Unlike clusterings of vocabulary-level word embeddings, the resulting models more naturally capture polysemy and can be used as a way of organizing documents. We evaluate token clusterings trained from several different output layers of popular contextualized language models. We find that BERT and GPT-2 produce high quality clusterings, but RoBERTa does not. These cluster models are simple, reliable, and can perform as well as, if not better than, LDA topic models, maintaining high topic quality even when the number of topics is large relative to the size of the local collection.

Via

Access Paper or Ask Questions