Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Miryam de Lhoneux

Form and Meaning in Intrinsic Multilingual Evaluations

Jan 15, 2026

Wessel Poelman, Miryam de Lhoneux

Abstract:Intrinsic evaluation metrics for conditional language models, such as perplexity or bits-per-character, are widely used in both mono- and multilingual settings. These metrics are rather straightforward to use and compare in monolingual setups, but rest on a number of assumptions in multilingual setups. One such assumption is that comparing the perplexity of CLMs on parallel sentences is indicative of their quality since the information content (here understood as the semantic meaning) is the same. However, the metrics are inherently measuring information content in the information-theoretic sense. We make this and other such assumptions explicit and discuss their implications. We perform experiments with six metrics on two multi-parallel corpora both with mono- and multilingual models. Ultimately, we find that current metrics are not universally comparable. We look at the form-meaning debate to provide some explanation for this.

* EACL 2026: Main Conference

Via

Access Paper or Ask Questions

On the Interplay between Positional Encodings, Morphological Complexity, and Word Order Flexibility

Nov 11, 2025

Kushal Tatariya, Wessel Poelman, Miryam de Lhoneux

Figure 1 for On the Interplay between Positional Encodings, Morphological Complexity, and Word Order Flexibility

Figure 2 for On the Interplay between Positional Encodings, Morphological Complexity, and Word Order Flexibility

Figure 3 for On the Interplay between Positional Encodings, Morphological Complexity, and Word Order Flexibility

Figure 4 for On the Interplay between Positional Encodings, Morphological Complexity, and Word Order Flexibility

Abstract:Language model architectures are predominantly first created for English and subsequently applied to other languages. It is an open question whether this architectural bias leads to degraded performance for languages that are structurally different from English. We examine one specific architectural choice: positional encodings, through the lens of the trade-off hypothesis: the supposed interplay between morphological complexity and word order flexibility. This hypothesis posits a trade-off between the two: a more morphologically complex language can have a more flexible word order, and vice-versa. Positional encodings are a direct target to investigate the implications of this hypothesis in relation to language modelling. We pretrain monolingual model variants with absolute, relative, and no positional encodings for seven typologically diverse languages and evaluate them on four downstream tasks. Contrary to previous findings, we do not observe a clear interaction between position encodings and morphological complexity or word order flexibility, as measured by various proxies. Our results show that the choice of tasks, languages, and metrics are essential for drawing stable conclusions

* IJCNLP-AACL: Main Conference

Via

Access Paper or Ask Questions

The Roles of English in Evaluating Multilingual Language Models

Dec 11, 2024

Wessel Poelman, Miryam de Lhoneux

Abstract:Multilingual natural language processing is getting increased attention, with numerous models, benchmarks, and methods being released for many languages. English is often used in multilingual evaluation to prompt language models (LMs), mainly to overcome the lack of instruction tuning data in other languages. In this position paper, we lay out two roles of English in multilingual LM evaluations: as an interface and as a natural language. We argue that these roles have different goals: task performance versus language understanding. This discrepancy is highlighted with examples from datasets and evaluation setups. Numerous works explicitly use English as an interface to boost task performance. We recommend to move away from this imprecise method and instead focus on furthering language understanding.

* NoDaLiDa 2025

Via

Access Paper or Ask Questions

How Good is Your Wikipedia?

Nov 08, 2024

Kushal Tatariya, Artur Kulmizev, Wessel Poelman, Esther Ploeger, Marcel Bollmann, Johannes Bjerva, Jiaming Luo, Heather Lent, Miryam de Lhoneux

Figure 1 for How Good is Your Wikipedia?

Figure 2 for How Good is Your Wikipedia?

Figure 3 for How Good is Your Wikipedia?

Figure 4 for How Good is Your Wikipedia?

Abstract:Wikipedia's perceived high quality and broad language coverage have established it as a fundamental resource in multilingual NLP. In the context of low-resource languages, however, these quality assumptions are increasingly being scrutinised. This paper critically examines the data quality of Wikipedia in a non-English setting by subjecting it to various quality filtering techniques, revealing widespread issues such as a high percentage of one-line articles and duplicate articles. We evaluate the downstream impact of quality filtering on Wikipedia and find that data quality pruning is an effective means for resource-efficient training without hurting performance, especially for low-resource languages. Moreover, we advocate for a shift in perspective from seeking a general definition of data quality towards a more language- and task-specific one. Ultimately, we aim for this study to serve as a guide to using Wikipedia for pretraining in a multilingual setting.

Via

Access Paper or Ask Questions

Pixology: Probing the Linguistic and Visual Capabilities of Pixel-based Language Models

Oct 15, 2024

Kushal Tatariya, Vladimir Araujo, Thomas Bauwens, Miryam de Lhoneux

Figure 1 for Pixology: Probing the Linguistic and Visual Capabilities of Pixel-based Language Models

Figure 2 for Pixology: Probing the Linguistic and Visual Capabilities of Pixel-based Language Models

Figure 3 for Pixology: Probing the Linguistic and Visual Capabilities of Pixel-based Language Models

Figure 4 for Pixology: Probing the Linguistic and Visual Capabilities of Pixel-based Language Models

Abstract:Pixel-based language models have emerged as a compelling alternative to subword-based language modelling, particularly because they can represent virtually any script. PIXEL, a canonical example of such a model, is a vision transformer that has been pre-trained on rendered text. While PIXEL has shown promising cross-script transfer abilities and robustness to orthographic perturbations, it falls short of outperforming monolingual subword counterparts like BERT in most other contexts. This discrepancy raises questions about the amount of linguistic knowledge learnt by these models and whether their performance in language tasks stems more from their visual capabilities than their linguistic ones. To explore this, we probe PIXEL using a variety of linguistic and visual tasks to assess its position on the vision-to-language spectrum. Our findings reveal a substantial gap between the model's visual and linguistic understanding. The lower layers of PIXEL predominantly capture superficial visual features, whereas the higher layers gradually learn more syntactic and semantic abstractions. Additionally, we examine variants of PIXEL trained with different text rendering strategies, discovering that introducing certain orthographic constraints at the input level can facilitate earlier learning of surface-level features. With this study, we hope to provide insights that aid the further development of pixel-based language models.

* 9 pages, Accepted to EMNLP 2025 Main

Via

Access Paper or Ask Questions

Recipe for Zero-shot POS Tagging: Is It Useful in Realistic Scenarios?

Oct 14, 2024

Zeno Vandenbulcke, Lukas Vermeire, Miryam de Lhoneux

Figure 1 for Recipe for Zero-shot POS Tagging: Is It Useful in Realistic Scenarios?

Figure 2 for Recipe for Zero-shot POS Tagging: Is It Useful in Realistic Scenarios?

Figure 3 for Recipe for Zero-shot POS Tagging: Is It Useful in Realistic Scenarios?

Figure 4 for Recipe for Zero-shot POS Tagging: Is It Useful in Realistic Scenarios?

Abstract:POS tagging plays a fundamental role in numerous applications. While POS taggers are highly accurate in well-resourced settings, they lag behind in cases of limited or missing training data. This paper focuses on POS tagging for languages with limited data. We seek to identify the characteristics of datasets that make them favourable for training POS tagging models without using any labelled training data from the target language. This is a zero-shot approach. We compare the accuracies of a multilingual large language model (mBERT) fine-tuned on one or more languages related to the target language. Additionally, we compare these results with models trained directly on the target language itself. We do this for three target low-resource languages. Our research highlights the importance of accurate dataset selection for effective zero-shot POS tagging. Particularly, a strong linguistic relationship and high-quality datasets ensure optimal results. For extremely low-resource languages, zero-shot models prove to be a viable option.

* To appear at the 4th Multilingual NLP workshop collocated with EMNLP 2024

Via

Access Paper or Ask Questions

Trans-Tokenization and Cross-lingual Vocabulary Transfers: Language Adaptation of LLMs for Low-Resource NLP

Aug 08, 2024

François Remy, Pieter Delobelle, Hayastan Avetisyan, Alfiya Khabibullina, Miryam de Lhoneux, Thomas Demeester

Abstract:The development of monolingual language models for low and mid-resource languages continues to be hindered by the difficulty in sourcing high-quality training data. In this study, we present a novel cross-lingual vocabulary transfer strategy, trans-tokenization, designed to tackle this challenge and enable more efficient language adaptation. Our approach focuses on adapting a high-resource monolingual LLM to an unseen target language by initializing the token embeddings of the target language using a weighted average of semantically similar token embeddings from the source language. For this, we leverage a translation resource covering both the source and target languages. We validate our method with the Tweeties, a series of trans-tokenized LLMs, and demonstrate their competitive performance on various downstream tasks across a small but diverse set of languages. Additionally, we introduce Hydra LLMs, models with multiple swappable language modeling heads and embedding tables, which further extend the capabilities of our trans-tokenization strategy. By designing a Hydra LLM based on the multilingual model TowerInstruct, we developed a state-of-the-art machine translation model for Tatar, in a zero-shot manner, completely bypassing the need for high-quality parallel data. This breakthrough is particularly significant for low-resource languages like Tatar, where high-quality parallel data is hard to come by. By lowering the data and time requirements for training high-quality models, our trans-tokenization strategy allows for the development of LLMs for a wider range of languages, especially those with limited resources. We hope that our work will inspire further research and collaboration in the field of cross-lingual vocabulary transfer and contribute to the empowerment of languages on a global scale.

* Accepted at COLM 2024

Via

Access Paper or Ask Questions

A Principled Framework for Evaluating on Typologically Diverse Languages

Jul 06, 2024

Esther Ploeger, Wessel Poelman, Andreas Holck Høeg-Petersen, Anders Schlichtkrull, Miryam de Lhoneux, Johannes Bjerva

Abstract:Beyond individual languages, multilingual natural language processing (NLP) research increasingly aims to develop models that perform well across languages generally. However, evaluating these systems on all the world's languages is practically infeasible. To attain generalizability, representative language sampling is essential. Previous work argues that generalizable multilingual evaluation sets should contain languages with diverse typological properties. However, 'typologically diverse' language samples have been found to vary considerably in this regard, and popular sampling methods are flawed and inconsistent. We present a language sampling framework for selecting highly typologically diverse languages given a sampling frame, informed by language typology. We compare sampling methods with a range of metrics and find that our systematic methods consistently retrieve more typologically diverse language selections than previous methods in NLP. Moreover, we provide evidence that this affects generalizability in multilingual model evaluation, emphasizing the importance of diverse language sampling in NLP evaluation.

Via

Access Paper or Ask Questions

What is 'Typological Diversity' in NLP?

Feb 14, 2024

Esther Ploeger, Wessel Poelman, Miryam de Lhoneux, Johannes Bjerva

Figure 1 for What is 'Typological Diversity' in NLP?

Figure 2 for What is 'Typological Diversity' in NLP?

Figure 3 for What is 'Typological Diversity' in NLP?

Figure 4 for What is 'Typological Diversity' in NLP?

Abstract:The NLP research community has devoted increased attention to languages beyond English, resulting in considerable improvements for multilingual NLP. However, these improvements only apply to a small subset of the world's languages. Aiming to extend this, an increasing number of papers aspires to enhance generalizable multilingual performance across languages. To this end, linguistic typology is commonly used to motivate language selection, on the basis that a broad typological sample ought to imply generalization across a broad range of languages. These selections are often described as being 'typologically diverse'. In this work, we systematically investigate NLP research that includes claims regarding 'typological diversity'. We find there are no set definitions or criteria for such claims. We introduce metrics to approximate the diversity of language selection along several axes and find that the results vary considerably across papers. Crucially, we show that skewed language selection can lead to overestimated multilingual performance. We recommend future work to include an operationalization of 'typological diversity' that empirically justifies the diversity of language samples.

Via

Access Paper or Ask Questions

Sociolinguistically Informed Interpretability: A Case Study on Hinglish Emotion Classification

Feb 05, 2024

Kushal Tatariya, Heather Lent, Johannes Bjerva, Miryam de Lhoneux

Figure 1 for Sociolinguistically Informed Interpretability: A Case Study on Hinglish Emotion Classification

Figure 2 for Sociolinguistically Informed Interpretability: A Case Study on Hinglish Emotion Classification

Figure 3 for Sociolinguistically Informed Interpretability: A Case Study on Hinglish Emotion Classification

Figure 4 for Sociolinguistically Informed Interpretability: A Case Study on Hinglish Emotion Classification

Abstract:Emotion classification is a challenging task in NLP due to the inherent idiosyncratic and subjective nature of linguistic expression, especially with code-mixed data. Pre-trained language models (PLMs) have achieved high performance for many tasks and languages, but it remains to be seen whether these models learn and are robust to the differences in emotional expression across languages. Sociolinguistic studies have shown that Hinglish speakers switch to Hindi when expressing negative emotions and to English when expressing positive emotions. To understand if language models can learn these associations, we study the effect of language on emotion prediction across 3 PLMs on a Hinglish emotion classification dataset. Using LIME and token level language ID, we find that models do learn these associations between language choice and emotional expression. Moreover, having code-mixed data present in the pre-training can augment that learning when task-specific data is scarce. We also conclude from the misclassifications that the models may overgeneralise this heuristic to other infrequent examples where this sociolinguistic phenomenon does not apply.

* 5 pages, Accepted to SIGTYP 2024 @ EACL

Via

Access Paper or Ask Questions