Picture for Catherine Arnett

Catherine Arnett

Toxicity of the Commons: Curating Open-Source Pre-Training Data

Add code
Oct 29, 2024
Viaarxiv icon

BPE Gets Picky: Efficient Vocabulary Refinement During Tokenizer Training

Add code
Sep 06, 2024
Viaarxiv icon

Goldfish: Monolingual Language Models for 350 Languages

Add code
Aug 19, 2024
Viaarxiv icon

Revenge of the Fallen? Recurrent Models Match Transformers at Predicting Human Language Comprehension Metrics

Add code
Apr 30, 2024
Viaarxiv icon

Different Tokenization Schemes Lead to Comparable Performance in Spanish Number Agreement

Add code
Mar 20, 2024
Figure 1 for Different Tokenization Schemes Lead to Comparable Performance in Spanish Number Agreement
Figure 2 for Different Tokenization Schemes Lead to Comparable Performance in Spanish Number Agreement
Figure 3 for Different Tokenization Schemes Lead to Comparable Performance in Spanish Number Agreement
Figure 4 for Different Tokenization Schemes Lead to Comparable Performance in Spanish Number Agreement
Viaarxiv icon

A Bit of a Problem: Measurement Disparities in Dataset Sizes Across Languages

Add code
Mar 01, 2024
Viaarxiv icon

Structural Priming Demonstrates Abstract Grammatical Representations in Multilingual Language Models

Add code
Nov 15, 2023
Viaarxiv icon

When Is Multilinguality a Curse? Language Modeling for 250 High- and Low-Resource Languages

Add code
Nov 15, 2023
Figure 1 for When Is Multilinguality a Curse? Language Modeling for 250 High- and Low-Resource Languages
Figure 2 for When Is Multilinguality a Curse? Language Modeling for 250 High- and Low-Resource Languages
Figure 3 for When Is Multilinguality a Curse? Language Modeling for 250 High- and Low-Resource Languages
Figure 4 for When Is Multilinguality a Curse? Language Modeling for 250 High- and Low-Resource Languages
Viaarxiv icon

Crosslingual Structural Priming and the Pre-Training Dynamics of Bilingual Language Models

Add code
Oct 11, 2023
Viaarxiv icon