Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Catherine Arnett

Evaluating Morphological Alignment of Tokenizers in 70 Languages

Jul 08, 2025

Catherine Arnett, Marisa Hudspeth, Brendan O'Connor

Abstract:While tokenization is a key step in language modeling, with effects on model training and performance, it remains unclear how to effectively evaluate tokenizer quality. One proposed dimension of tokenizer quality is the extent to which tokenizers preserve linguistically meaningful subwords, aligning token boundaries with morphological boundaries within a word. We expand MorphScore (Arnett & Bergen, 2025), which previously covered 22 languages, to support a total of 70 languages. The updated MorphScore offers more flexibility in evaluation and addresses some of the limitations of the original version. We then correlate our alignment scores with downstream task performance for five pre-trained languages models on seven tasks, with at least one task in each of the languages in our sample. We find that morphological alignment does not explain very much variance in model performance, suggesting that morphological alignment alone does not measure dimensions of tokenization quality relevant to model performance.

* 6 pages, 3 figures. Accepted to the Tokenization Workshop at ICML 2025

Via

Access Paper or Ask Questions

BPE Stays on SCRIPT: Structured Encoding for Robust Multilingual Pretokenization

May 30, 2025

Sander Land, Catherine Arnett

Abstract:Byte Pair Encoding (BPE) tokenizers, widely used in Large Language Models, face challenges in multilingual settings, including penalization of non-Western scripts and the creation of tokens with partial UTF-8 sequences. Pretokenization, often reliant on complex regular expressions, can also introduce fragility and unexpected edge cases. We propose SCRIPT (Script Category Representation in PreTokenization), a novel encoding scheme that bypasses UTF-8 byte conversion by using initial tokens based on Unicode script and category properties. This approach enables a simple, rule-based pretokenization strategy that respects script boundaries, offering a robust alternative to pretokenization strategies based on regular expressions. We also introduce and validate a constrained BPE merging strategy that enforces character integrity, applicable to both SCRIPT-BPE and byte-based BPE. Our experiments demonstrate that SCRIPT-BPE achieves competitive compression while eliminating encoding-based penalties for non-Latin-script languages.

* 9 pages, 2 figures. For associated code, see https://github.com/sanderland/script_bpe

Via

Access Paper or Ask Questions

On the Acquisition of Shared Grammatical Representations in Bilingual Language Models

Mar 05, 2025

Catherine Arnett, Tyler A. Chang, James A. Michaelov, Benjamin K. Bergen

Abstract:While crosslingual transfer is crucial to contemporary language models' multilingual capabilities, how it occurs is not well understood. In this paper, we ask what happens to a monolingual language model when it begins to be trained on a second language. Specifically, we train small bilingual models for which we control the amount of data for each language and the order of language exposure. To find evidence of shared multilingual representations, we turn to structural priming, a method used to study grammatical representations in humans. We first replicate previous crosslingual structural priming results and find that after controlling for training data quantity and language exposure, there are asymmetrical effects across language pairs and directions. We argue that this asymmetry may shape hypotheses about human structural priming effects. We also find that structural priming effects are less robust for less similar language pairs, highlighting potential limitations of crosslingual transfer learning and shared representations for typologically diverse languages.

Via

Access Paper or Ask Questions

Why do language models perform worse for morphologically complex languages?

Nov 21, 2024

Catherine Arnett, Benjamin K. Bergen

Abstract:Language models perform differently across languages. It has been previously suggested that morphological typology may explain some of this variability (Cotterell et al., 2018). We replicate previous analyses and find additional new evidence for a performance gap between agglutinative and fusional languages, where fusional languages, such as English, tend to have better language modeling performance than morphologically more complex languages like Turkish. We then propose and test three possible causes for this performance gap: morphological alignment of tokenizers, tokenization quality, and disparities in dataset sizes and measurement. To test the morphological alignment hypothesis, we present MorphScore, a tokenizer evaluation metric, and supporting datasets for 22 languages. We find some evidence that tokenization quality explains the performance gap, but none for the role of morphological alignment. Instead we find that the performance gap is most reduced when training datasets are of equivalent size across language types, but only when scaled according to the so-called "byte-premium" -- the different encoding efficiencies of different languages and orthographies. These results suggest that no language is harder or easier for a language model to learn on the basis of its morphological typology. Differences in performance can be attributed to disparities in dataset size. These results bear on ongoing efforts to improve performance for low-performing and under-resourced languages.

* 9 pages

Via

Access Paper or Ask Questions

Toxicity of the Commons: Curating Open-Source Pre-Training Data

Oct 29, 2024

Catherine Arnett, Eliot Jones, Ivan P. Yamshchikov, Pierre-Carl Langlais

Figure 1 for Toxicity of the Commons: Curating Open-Source Pre-Training Data

Figure 2 for Toxicity of the Commons: Curating Open-Source Pre-Training Data

Figure 3 for Toxicity of the Commons: Curating Open-Source Pre-Training Data

Figure 4 for Toxicity of the Commons: Curating Open-Source Pre-Training Data

Abstract:Open-source large language models are becoming increasingly available and popular among researchers and practitioners. While significant progress has been made on open-weight models, open training data is a practice yet to be adopted by the leading open-weight models creators. At the same time, there researchers are working to make language models safer. We propose a data curation pipeline to reduce harmful outputs by models trained on public domain data. There are unique challenges to working with public domain data, as these sources differ from web text in both form and content. Many sources are historical documents and are the result of Optical Character Recognition (OCR). Consequently, current state-of-the-art approaches to toxicity filtering are often infeasible or inappropriate for open data models. In this paper, we introduce a new fully open-source pipeline for open-data toxicity filtering. Our contributions are threefold. We create a custom training dataset, ToxicCommons, which is composed of texts which have been classified across five different dimensions (racial/origin-based, gender/sex-based, religious, ability-based discrimination, and violence). We use this dataset to train a custom classifier, Celadon, that can be used to detect toxic content in open data more efficiently at a larger scale. Finally, we describe the balanced approach to content filtration that optimizes safety filtering with respect to the filtered data available for training.

Via

Access Paper or Ask Questions

BPE Gets Picky: Efficient Vocabulary Refinement During Tokenizer Training

Sep 06, 2024

Pavel Chizhov, Catherine Arnett, Elizaveta Korotkova, Ivan P. Yamshchikov

Abstract:Language models can largely benefit from efficient tokenization. However, they still mostly utilize the classical BPE algorithm, a simple and reliable method. This has been shown to cause such issues as under-trained tokens and sub-optimal compression that may affect the downstream performance. We introduce Picky BPE, a modified BPE algorithm that carries out vocabulary refinement during tokenizer training. Our method improves vocabulary efficiency, eliminates under-trained tokens, and does not compromise text compression. Our experiments show that our method does not reduce the downstream performance, and in several cases improves it.

* 9 pages

Via

Access Paper or Ask Questions

Goldfish: Monolingual Language Models for 350 Languages

Aug 19, 2024

Tyler A. Chang, Catherine Arnett, Zhuowen Tu, Benjamin K. Bergen

Abstract:For many low-resource languages, the only available language models are large multilingual models trained on many languages simultaneously. However, using FLORES perplexity as a metric, we find that these models perform worse than bigrams for many languages (e.g. 24% of languages in XGLM 4.5B; 43% in BLOOM 7.1B). To facilitate research that focuses on low-resource languages, we pre-train and release Goldfish, a suite of monolingual autoregressive Transformer language models up to 125M parameters for 350 languages. The Goldfish reach lower FLORES perplexities than BLOOM, XGLM, and MaLA-500 on 98 of 204 FLORES languages, despite each Goldfish model being over 10x smaller. However, the Goldfish significantly underperform larger multilingual models on reasoning benchmarks, suggesting that for low-resource languages, multilinguality primarily improves general reasoning abilities rather than basic text generation. We release models trained on 5MB (350 languages), 10MB (288 languages), 100MB (166 languages), and 1GB (83 languages) of text data where available. The Goldfish models are available as baselines, fine-tuning sources, or augmentations to existing models in low-resource NLP research, and they are further useful for crosslinguistic studies requiring maximally comparable models across languages.

Via

Access Paper or Ask Questions

Revenge of the Fallen? Recurrent Models Match Transformers at Predicting Human Language Comprehension Metrics

Apr 30, 2024

James A. Michaelov, Catherine Arnett, Benjamin K. Bergen

Abstract:Transformers have supplanted Recurrent Neural Networks as the dominant architecture for both natural language processing tasks and, despite criticisms of cognitive implausibility, for modelling the effect of predictability on online human language comprehension. However, two recently developed recurrent neural network architectures, RWKV and Mamba, appear to perform natural language tasks comparably to or better than transformers of equivalent scale. In this paper, we show that contemporary recurrent models are now also able to match - and in some cases, exceed - performance of comparably sized transformers at modeling online human language comprehension. This suggests that transformer language models are not uniquely suited to this task, and opens up new directions for debates about the extent to which architectural features of language models make them better or worse models of human language comprehension.

Via

Access Paper or Ask Questions

Different Tokenization Schemes Lead to Comparable Performance in Spanish Number Agreement

Mar 20, 2024

Catherine Arnett, Pamela D. Rivière, Tyler A. Chang, Sean Trott

Figure 1 for Different Tokenization Schemes Lead to Comparable Performance in Spanish Number Agreement

Figure 2 for Different Tokenization Schemes Lead to Comparable Performance in Spanish Number Agreement

Figure 3 for Different Tokenization Schemes Lead to Comparable Performance in Spanish Number Agreement

Figure 4 for Different Tokenization Schemes Lead to Comparable Performance in Spanish Number Agreement

Abstract:The relationship between language model tokenization and performance is an open area of research. Here, we investigate how different tokenization schemes impact number agreement in Spanish plurals. We find that morphologically-aligned tokenization performs similarly to other tokenization schemes, even when induced artificially for words that would not be tokenized that way during training. We then present exploratory analyses demonstrating that language model embeddings for different plural tokenizations have similar distributions along the embedding space axis that maximally distinguishes singular and plural nouns. Our results suggest that morphologically-aligned tokenization is a viable tokenization approach, and existing models already generalize some morphological patterns to new items. However, our results indicate that morphological tokenization is not strictly required for performance.

Via

Access Paper or Ask Questions

A Bit of a Problem: Measurement Disparities in Dataset Sizes Across Languages

Mar 01, 2024

Catherine Arnett, Tyler A. Chang, Benjamin K. Bergen

Figure 1 for A Bit of a Problem: Measurement Disparities in Dataset Sizes Across Languages

Figure 2 for A Bit of a Problem: Measurement Disparities in Dataset Sizes Across Languages

Figure 3 for A Bit of a Problem: Measurement Disparities in Dataset Sizes Across Languages

Figure 4 for A Bit of a Problem: Measurement Disparities in Dataset Sizes Across Languages

Abstract:How should text dataset sizes be compared across languages? Even for content-matched (parallel) corpora, UTF-8 encoded text can require a dramatically different number of bytes for different languages. In our work, we define the byte premium between two languages as the ratio of bytes used to encode content-matched text in those languages. We compute byte premiums for 1155 languages, and we use linear regressions to estimate byte premiums for other languages. We release a tool to obtain byte premiums for any two languages, enabling comparisons of dataset sizes across languages for more equitable multilingual model development and data practices.

Via

Access Paper or Ask Questions