Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Thierry Poibeau

Lattice

Annotating References to Mythological Entities in French Literature

Dec 24, 2024

Thierry Poibeau

Abstract:In this paper, we explore the relevance of large language models (LLMs) for annotating references to Roman and Greek mythological entities in modern and contemporary French literature. We present an annotation scheme and demonstrate that recent LLMs can be directly applied to follow this scheme effectively, although not without occasionally making significant analytical errors. Additionally, we show that LLMs (and, more specifically, ChatGPT) are capable of offering interpretative insights into the use of mythological references by literary authors. However, we also find that LLMs struggle to accurately identify relevant passages in novels (when used as an information retrieval engine), often hallucinating and generating fabricated examples-an issue that raises significant ethical concerns. Nonetheless, when used carefully, LLMs remain valuable tools for performing annotations with high accuracy, especially for tasks that would be difficult to annotate comprehensively on a large scale through manual methods alone.

* CHR (Computational Humanities Research) -- Digital Methods for Mythological Research Workshop, Dec 2024, Aarhus (Danemark), Denmark

Via

Access Paper or Ask Questions

An Incremental Clustering Baseline for Event Detection on Twitter

Dec 16, 2024

Marjolaine Ray, Qi Wang, Frédérique Mélanie-Becquet, Thierry Poibeau, Béatrice Mazoyer

Abstract:Event detection in text streams is a crucial task for the analysis of online media and social networks. One of the current challenges in this field is establishing a performance standard while maintaining an acceptable level of computational complexity. In our study, we use an incremental clustering algorithm combined with recent advancements in sentence embeddings. Our objective is to compare our findings with previous studies, specifically those by Cao et al. (2024) and Mazoyer et al. (2020). Our results demonstrate significant improvements and could serve as a relevant baseline for future research in this area.

* Proceedings of the Workshop on the Future of Event Detection (FuturED), ACL, Nov 2024, Miami, United States. pp.18-24

Via

Access Paper or Ask Questions

How to Evaluate Coreference in Literary Texts?

Dec 30, 2023

Ana-Isabel Duron-Tejedor, Pascal Amsili, Thierry Poibeau

Abstract:In this short paper, we examine the main metrics used to evaluate textual coreference and we detail some of their limitations. We show that a unique score cannot represent the full complexity of the problem at stake, and is thus uninformative, or even misleading. We propose a new way of evaluating coreference, taking into account the context (in our case, the analysis of fictions, esp. novels). More specifically, we propose to distinguish long coreference chains (corresponding to main characters), from short ones (corresponding to secondary characters), and singletons (isolated elements). This way, we hope to get more interpretable and thus more informative results through evaluation.

* Presented as a poster at the conference CHR2023 (non archival)

Via

Access Paper or Ask Questions

On the Correspondence between Compositionality and Imitation in Emergent Neural Communication

May 22, 2023

Emily Cheng, Mathieu Rita, Thierry Poibeau

Abstract:Compositionality is a hallmark of human language that not only enables linguistic generalization, but also potentially facilitates acquisition. When simulating language emergence with neural networks, compositionality has been shown to improve communication performance; however, its impact on imitation learning has yet to be investigated. Our work explores the link between compositionality and imitation in a Lewis game played by deep neural agents. Our contributions are twofold: first, we show that the learning algorithm used to imitate is crucial: supervised learning tends to produce more average languages, while reinforcement learning introduces a selection pressure toward more compositional languages. Second, our study reveals that compositional languages are easier to imitate, which may induce the pressure toward compositional languages in RL imitation settings.

* Findings of ACL 2023; 5 pages + 8 pages of supplementary materials

Via

Access Paper or Ask Questions

Modern French Poetry Generation with RoBERTa and GPT-2

Dec 06, 2022

Mika Hämäläinen, Khalid Alnajjar, Thierry Poibeau

Abstract:We present a novel neural model for modern poetry generation in French. The model consists of two pretrained neural models that are fine-tuned for the poem generation task. The encoder of the model is a RoBERTa based one while the decoder is based on GPT-2. This way the model can benefit from the superior natural language understanding performance of RoBERTa and the good natural language generation performance of GPT-2. Our evaluation shows that the model can create French poetry successfully. On a 5 point scale, the lowest score of 3.57 was given by human judges to typicality and emotionality of the output poetry while the best score of 3.79 was given to understandability.

* ICCC 2022

Via

Access Paper or Ask Questions

Video Games as a Corpus: Sentiment Analysis using Fallout New Vegas Dialog

Dec 05, 2022

Mika Hämäläinen, Khalid Alnajjar, Thierry Poibeau

Figure 1 for Video Games as a Corpus: Sentiment Analysis using Fallout New Vegas Dialog

Figure 2 for Video Games as a Corpus: Sentiment Analysis using Fallout New Vegas Dialog

Figure 3 for Video Games as a Corpus: Sentiment Analysis using Fallout New Vegas Dialog

Figure 4 for Video Games as a Corpus: Sentiment Analysis using Fallout New Vegas Dialog

Abstract:We present a method for extracting a multilingual sentiment annotated dialog data set from Fallout New Vegas. The game developers have preannotated every line of dialog in the game in one of the 8 different sentiments: \textit{anger, disgust, fear, happy, neutral, pained, sad } and \textit{surprised}. The game has been translated into English, Spanish, German, French and Italian. We conduct experiments on multilingual, multilabel sentiment analysis on the extracted data set using multilingual BERT, XLMRoBERTa and language specific BERT models. In our experiments, multilingual BERT outperformed XLMRoBERTa for most of the languages, also language specific models were slightly better than multilingual BERT for most of the languages. The best overall accuracy was 54\% and it was achieved by using multilingual BERT on Spanish data. The extracted data set presents a challenging task for sentiment analysis. We have released the data, including the testing and training splits, openly on Zenodo. The data set has been shuffled for copyright reasons.

* FDG 2022

Via

Access Paper or Ask Questions

Automatic Generation of Factual News Headlines in Finnish

Dec 05, 2022

Maximilian Koppatz, Khalid Alnajjar, Mika Hämäläinen, Thierry Poibeau

Abstract:We present a novel approach to generating news headlines in Finnish for a given news story. We model this as a summarization task where a model is given a news article, and its task is to produce a concise headline describing the main topic of the article. Because there are no openly available GPT-2 models for Finnish, we will first build such a model using several corpora. The model is then fine-tuned for the headline generation task using a massive news corpus. The system is evaluated by 3 expert journalists working in a Finnish media house. The results showcase the usability of the presented approach as a headline suggestion tool to facilitate the news production process.

* INLG 2022

Via

Access Paper or Ask Questions

Word Order Matters when you Increase Masking

Nov 08, 2022

Karim Lasri, Alessandro Lenci, Thierry Poibeau

Abstract:Word order, an essential property of natural languages, is injected in Transformer-based neural language models using position encoding. However, recent experiments have shown that explicit position encoding is not always useful, since some models without such feature managed to achieve state-of-the art performance on some tasks. To understand better this phenomenon, we examine the effect of removing position encodings on the pre-training objective itself (i.e., masked language modelling), to test whether models can reconstruct position information from co-occurrences alone. We do so by controlling the amount of masked tokens in the input sentence, as a proxy to affect the importance of position information for the task. We find that the necessity of position information increases with the amount of masking, and that masked language models without position encodings are not able to reconstruct this information on the task. These findings point towards a direct relationship between the amount of masking and the ability of Transformers to capture order-sensitive aspects of language using position encoding.

* Accepted at EMNLP 2022 (main conference)

Via

Access Paper or Ask Questions

Subject Verb Agreement Error Patterns in Meaningless Sentences: Humans vs. BERT

Sep 21, 2022

Karim Lasri, Olga Seminck, Alessandro Lenci, Thierry Poibeau

Figure 1 for Subject Verb Agreement Error Patterns in Meaningless Sentences: Humans vs. BERT

Figure 2 for Subject Verb Agreement Error Patterns in Meaningless Sentences: Humans vs. BERT

Figure 3 for Subject Verb Agreement Error Patterns in Meaningless Sentences: Humans vs. BERT

Figure 4 for Subject Verb Agreement Error Patterns in Meaningless Sentences: Humans vs. BERT

Abstract:Both humans and neural language models are able to perform subject-verb number agreement (SVA). In principle, semantics shouldn't interfere with this task, which only requires syntactic knowledge. In this work we test whether meaning interferes with this type of agreement in English in syntactic structures of various complexities. To do so, we generate both semantically well-formed and nonsensical items. We compare the performance of BERT-base to that of humans, obtained with a psycholinguistic online crowdsourcing experiment. We find that BERT and humans are both sensitive to our semantic manipulation: They fail more often when presented with nonsensical items, especially when their syntactic structure features an attractor (a noun phrase between the subject and the verb that has not the same number as the subject). We also find that the effect of meaningfulness on SVA errors is stronger for BERT than for humans, showing higher lexical sensitivity of the former on this task.

* COLING 2022 Main Conference (The 29th international conference on computational linguistics)

Via

Access Paper or Ask Questions

Probing for the Usage of Grammatical Number

Apr 21, 2022

Karim Lasri, Tiago Pimentel, Alessandro Lenci, Thierry Poibeau, Ryan Cotterell

Figure 1 for Probing for the Usage of Grammatical Number

Figure 2 for Probing for the Usage of Grammatical Number

Figure 3 for Probing for the Usage of Grammatical Number

Figure 4 for Probing for the Usage of Grammatical Number

Abstract:A central quest of probing is to uncover how pre-trained models encode a linguistic property within their representations. An encoding, however, might be spurious-i.e., the model might not rely on it when making predictions. In this paper, we try to find encodings that the model actually uses, introducing a usage-based probing setup. We first choose a behavioral task which cannot be solved without using the linguistic property. Then, we attempt to remove the property by intervening on the model's representations. We contend that, if an encoding is used by the model, its removal should harm the performance on the chosen behavioral task. As a case study, we focus on how BERT encodes grammatical number, and on how it uses this encoding to solve the number agreement task. Experimentally, we find that BERT relies on a linear encoding of grammatical number to produce the correct behavioral output. We also find that BERT uses a separate encoding of grammatical number for nouns and verbs. Finally, we identify in which layers information about grammatical number is transferred from a noun to its head verb.

* ACL 2022 (Main Conference) The discussion section had been inadvertently removed before the article was published on arxiv

Via

Access Paper or Ask Questions