Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Walter Daelemans

Tilburg University

WinoWhat: A Parallel Corpus of Paraphrased WinoGrande Sentences with Common Sense Categorization

Mar 31, 2025

Ine Gevers, Victor De Marez, Luna De Bruyne, Walter Daelemans

Abstract:In this study, we take a closer look at how Winograd schema challenges can be used to evaluate common sense reasoning in LLMs. Specifically, we evaluate generative models of different sizes on the popular WinoGrande benchmark. We release WinoWhat, a new corpus, in which each instance of the WinoGrande validation set is paraphrased. Additionally, we evaluate the performance on the challenge across five common sense knowledge categories, giving more fine-grained insights on what types of knowledge are more challenging for LLMs. Surprisingly, all models perform significantly worse on WinoWhat, implying that LLM reasoning capabilities are overestimated on WinoGrande. To verify whether this is an effect of benchmark memorization, we match benchmark instances to LLM trainingdata and create two test-suites. We observe that memorization has a minimal effect on model performance on WinoGrande.

Via

Access Paper or Ask Questions

BEIR-NL: Zero-shot Information Retrieval Benchmark for the Dutch Language

Dec 11, 2024

Nikolay Banar, Ehsan Lotfi, Walter Daelemans

Abstract:Zero-shot evaluation of information retrieval (IR) models is often performed using BEIR; a large and heterogeneous benchmark composed of multiple datasets, covering different retrieval tasks across various domains. Although BEIR has become a standard benchmark for the zero-shot setup, its exclusively English content reduces its utility for underrepresented languages in IR, including Dutch. To address this limitation and encourage the development of Dutch IR models, we introduce BEIR-NL by automatically translating the publicly accessible BEIR datasets into Dutch. Using BEIR-NL, we evaluated a wide range of multilingual dense ranking and reranking models, as well as the lexical BM25 method. Our experiments show that BM25 remains a competitive baseline, and is only outperformed by the larger dense models trained for retrieval. When combined with reranking models, BM25 achieves performance on par with the best dense ranking models. In addition, we explored the impact of translation on the data by back-translating a selection of datasets to English, and observed a performance drop for both dense and lexical methods, indicating the limitations of translation for creating benchmarks. BEIR-NL is publicly available on the Hugging Face hub.

* To be presented at BUCC 2025 (COLING)

Via

Access Paper or Ask Questions

Bilingual BSARD: Extending Statutory Article Retrieval to Dutch

Dec 10, 2024

Ehsan Lotfi, Nikolay Banar, Nerses Yuzbashyan, Walter Daelemans

Abstract:Statutory article retrieval plays a crucial role in making legal information more accessible to both laypeople and legal professionals. Multilingual countries like Belgium present unique challenges for retrieval models due to the need for handling legal issues in multiple languages. Building on the Belgian Statutory Article Retrieval Dataset (BSARD) in French, we introduce the bilingual version of this dataset, bBSARD. The dataset contains parallel Belgian statutory articles in both French and Dutch, along with legal questions from BSARD and their Dutch translation. Using bBSARD, we conduct extensive benchmarking of retrieval models available for Dutch and French. Our benchmarking setup includes lexical models, zero-shot dense models, and fine-tuned small foundation models. Our experiments show that BM25 remains a competitive baseline compared to many zero-shot dense models in both languages. We also observe that while proprietary models outperform open alternatives in the zero-shot setting, they can be matched or surpassed by fine-tuning small language-specific models. Our dataset and evaluation code are publicly available.

* To be presented at RegNLP-2025 (COLING)

Via

Access Paper or Ask Questions

Bag of Lies: Robustness in Continuous Pre-training BERT

Jun 14, 2024

Ine Gevers, Walter Daelemans

Abstract:This study aims to acquire more insights into the continuous pre-training phase of BERT regarding entity knowledge, using the COVID-19 pandemic as a case study. Since the pandemic emerged after the last update of BERT's pre-training data, the model has little to no entity knowledge about COVID-19. Using continuous pre-training, we control what entity knowledge is available to the model. We compare the baseline BERT model with the further pre-trained variants on the fact-checking benchmark Check-COVID. To test the robustness of continuous pre-training, we experiment with several adversarial methods to manipulate the input data, such as training on misinformation and shuffling the word order until the input becomes nonsensical. Surprisingly, our findings reveal that these methods do not degrade, and sometimes even improve, the model's downstream performance. This suggests that continuous pre-training of BERT is robust against misinformation. Furthermore, we are releasing a new dataset, consisting of original texts from academic publications in the LitCovid repository and their AI-generated false counterparts.

Via

Access Paper or Ask Questions

PersonalityChat: Conversation Distillation for Personalized Dialog Modeling with Facts and Traits

Jan 14, 2024

Ehsan Lotfi, Maxime De Bruyn, Jeska Buhmann, Walter Daelemans

Figure 1 for PersonalityChat: Conversation Distillation for Personalized Dialog Modeling with Facts and Traits

Figure 2 for PersonalityChat: Conversation Distillation for Personalized Dialog Modeling with Facts and Traits

Figure 3 for PersonalityChat: Conversation Distillation for Personalized Dialog Modeling with Facts and Traits

Figure 4 for PersonalityChat: Conversation Distillation for Personalized Dialog Modeling with Facts and Traits

Abstract:The new wave of Large Language Models (LLM) has offered an efficient tool to curate sizeable conversational datasets. So far studies have mainly focused on task-oriented or generic open-domain dialogs, and have not fully explored the ability of LLMs in following complicated prompts. In this work, we focus on personalization, and employ LLMs to curate a dataset which is difficult and costly to crowd-source: PersonalityChat is a synthetic conversational dataset based upon the popular PersonaChat dataset, but conditioned on both personas and (Big-5) personality traits. Evaluating models fine-tuned on this dataset, we show that the personality trait labels can be used for trait-based personalization of generative dialogue models. We also perform a head-to-head comparison between PersonalityChat and PersonaChat, and show that training on the distilled dataset results in more fluent and coherent dialog agents in the small-model regime.

* GEM workshop @ EMNLP23

Via

Access Paper or Ask Questions

Open-Domain Dialog Evaluation using Follow-Ups Likelihood

Sep 12, 2022

Maxime De Bruyn, Ehsan Lotfi, Jeska Buhmann, Walter Daelemans

Figure 1 for Open-Domain Dialog Evaluation using Follow-Ups Likelihood

Figure 2 for Open-Domain Dialog Evaluation using Follow-Ups Likelihood

Figure 3 for Open-Domain Dialog Evaluation using Follow-Ups Likelihood

Figure 4 for Open-Domain Dialog Evaluation using Follow-Ups Likelihood

Abstract:Automatic evaluation of open-domain dialogs remains an unsolved problem. Moreover, existing methods do not correlate strongly with human annotations. This paper presents a new automated evaluation method using follow-ups: we measure the probability that a language model will continue the conversation with a fixed set of follow-ups (e.g., not really relevant here, what are you trying to say). When compared against twelve existing methods, our new evaluation achieves the highest correlation with human evaluations.

* Accepted at COLING 2022

Via

Access Paper or Ask Questions

CoNTACT: A Dutch COVID-19 Adapted BERT for Vaccine Hesitancy and Argumentation Detection

Mar 14, 2022

Jens Lemmens, Jens Van Nooten, Tim Kreutz, Walter Daelemans

Figure 1 for CoNTACT: A Dutch COVID-19 Adapted BERT for Vaccine Hesitancy and Argumentation Detection

Figure 2 for CoNTACT: A Dutch COVID-19 Adapted BERT for Vaccine Hesitancy and Argumentation Detection

Figure 3 for CoNTACT: A Dutch COVID-19 Adapted BERT for Vaccine Hesitancy and Argumentation Detection

Figure 4 for CoNTACT: A Dutch COVID-19 Adapted BERT for Vaccine Hesitancy and Argumentation Detection

Abstract:We present CoNTACT: a Dutch language model adapted to the domain of COVID-19 tweets. The model was developed by continuing the pre-training phase of RobBERT (Delobelle, 2020) by using 2.8M Dutch COVID-19 related tweets posted in 2021. In order to test the performance of the model and compare it to RobBERT, the two models were tested on two tasks: (1) binary vaccine hesitancy detection and (2) detection of arguments for vaccine hesitancy. For both tasks, not only Twitter but also Facebook data was used to show cross-genre performance. In our experiments, CoNTACT showed statistically significant gains over RobBERT in all experiments for task 1. For task 2, we observed substantial improvements in virtually all classes in all experiments. An error analysis indicated that the domain adaptation yielded better representations of domain-specific terminology, causing CoNTACT to make more accurate classification decisions.

Via

Access Paper or Ask Questions

Cyberbullying Classifiers are Sensitive to Model-Agnostic Perturbations

Jan 17, 2022

Chris Emmery, Ákos Kádár, Grzegorz Chrupała, Walter Daelemans

Figure 1 for Cyberbullying Classifiers are Sensitive to Model-Agnostic Perturbations

Figure 2 for Cyberbullying Classifiers are Sensitive to Model-Agnostic Perturbations

Figure 3 for Cyberbullying Classifiers are Sensitive to Model-Agnostic Perturbations

Figure 4 for Cyberbullying Classifiers are Sensitive to Model-Agnostic Perturbations

Abstract:A limited amount of studies investigates the role of model-agnostic adversarial behavior in toxic content classification. As toxicity classifiers predominantly rely on lexical cues, (deliberately) creative and evolving language-use can be detrimental to the utility of current corpora and state-of-the-art models when they are deployed for content moderation. The less training data is available, the more vulnerable models might become. This study is, to our knowledge, the first to investigate the effect of adversarial behavior and augmentation for cyberbullying detection. We demonstrate that model-agnostic lexical substitutions significantly hurt classifier performance. Moreover, when these perturbed samples are used for augmentation, we show models become robust against word-level perturbations at a slight trade-off in overall task performance. Augmentations proposed in prior work on toxicity prove to be less effective. Our results underline the need for such evaluations in online harm areas with small corpora. The perturbed data, models, and code are available for reproduction at https://github.com/cmry/augtox

* Submitted to LREC 2022

Via

Access Paper or Ask Questions

MFAQ: a Multilingual FAQ Dataset

Oct 05, 2021

Maxime De Bruyn, Ehsan Lotfi, Jeska Buhmann, Walter Daelemans

Figure 1 for MFAQ: a Multilingual FAQ Dataset

Figure 2 for MFAQ: a Multilingual FAQ Dataset

Figure 3 for MFAQ: a Multilingual FAQ Dataset

Figure 4 for MFAQ: a Multilingual FAQ Dataset

Abstract:In this paper, we present the first multilingual FAQ dataset publicly available. We collected around 6M FAQ pairs from the web, in 21 different languages. Although this is significantly larger than existing FAQ retrieval datasets, it comes with its own challenges: duplication of content and uneven distribution of topics. We adopt a similar setup as Dense Passage Retrieval (DPR) and test various bi-encoders on this dataset. Our experiments reveal that a multilingual model based on XLM-RoBERTa achieves the best results, except for English. Lower resources languages seem to learn from one another as a multilingual model achieves a higher MRR than language-specific ones. Our qualitative analysis reveals the brittleness of the model on simple word changes. We publicly release our dataset, model and training script.

* Accepted at MRQA workshop (EMNLP 2021)

Via

Access Paper or Ask Questions

Teach Me What to Say and I Will Learn What to Pick: Unsupervised Knowledge Selection Through Response Generation with Pretrained Generative Models

Oct 05, 2021

Ehsan Lotfi, Maxime De Bruyn, Jeska Buhmann, Walter Daelemans

Figure 1 for Teach Me What to Say and I Will Learn What to Pick: Unsupervised Knowledge Selection Through Response Generation with Pretrained Generative Models

Figure 2 for Teach Me What to Say and I Will Learn What to Pick: Unsupervised Knowledge Selection Through Response Generation with Pretrained Generative Models

Figure 3 for Teach Me What to Say and I Will Learn What to Pick: Unsupervised Knowledge Selection Through Response Generation with Pretrained Generative Models

Figure 4 for Teach Me What to Say and I Will Learn What to Pick: Unsupervised Knowledge Selection Through Response Generation with Pretrained Generative Models

Abstract:Knowledge Grounded Conversation Models (KGCM) are usually based on a selection/retrieval module and a generation module, trained separately or simultaneously, with or without having access to a gold knowledge option. With the introduction of large pre-trained generative models, the selection and generation part have become more and more entangled, shifting the focus towards enhancing knowledge incorporation (from multiple sources) instead of trying to pick the best knowledge option. These approaches however depend on knowledge labels and/or a separate dense retriever for their best performance. In this work we study the unsupervised selection abilities of pre-trained generative models (e.g. BART) and show that by adding a score-and-aggregate module between encoder and decoder, they are capable of learning to pick the proper knowledge through minimising the language modelling loss (i.e. without having access to knowledge labels). Trained as such, our model - K-Mine - shows competitive selection and generation performance against models that benefit from knowledge labels and/or separate dense retriever.

* Accepted at ConvAI workshop (EMNLP 2021)

Via

Access Paper or Ask Questions