Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Zébulon Goriely

BabyLM's First Words: Word Segmentation as a Phonological Probing Task

Apr 04, 2025

Zébulon Goriely

Abstract:Language models provide a key framework for studying linguistic theories based on prediction, but phonological analysis using large language models (LLMs) is difficult; there are few phonological benchmarks beyond English and the standard input representation used in LLMs (subwords of graphemes) is not suitable for analyzing the representation of phonemes. In this work, we demonstrate how word segmentation can be used as a phonological probing task, allowing us to study the representations learned by phoneme-based language models trained on child-directed speech across 31 languages. Following computational models of word segmentation, we present unsupervised methods for extracting word boundaries from a trained model using the observation that prediction-error peaks at the start of words. We also use linear probes to identify that these models implicitly track word boundaries, even when they do not appear in training. This cross-lingual work corroborates statistical learning theories of acquisition and empirically motivates new methods for training subword tokenizers.

* 17 pages, 10 figures, submitted to CoNLL 2025

Via

Access Paper or Ask Questions

From Babble to Words: Pre-Training Language Models on Continuous Streams of Phonemes

Oct 30, 2024

Zébulon Goriely, Richard Diehl Martinez, Andrew Caines, Lisa Beinborn, Paula Buttery

Figure 1 for From Babble to Words: Pre-Training Language Models on Continuous Streams of Phonemes

Figure 2 for From Babble to Words: Pre-Training Language Models on Continuous Streams of Phonemes

Figure 3 for From Babble to Words: Pre-Training Language Models on Continuous Streams of Phonemes

Figure 4 for From Babble to Words: Pre-Training Language Models on Continuous Streams of Phonemes

Abstract:Language models are typically trained on large corpora of text in their default orthographic form. However, this is not the only option; representing data as streams of phonemes can offer unique advantages, from deeper insights into phonological language acquisition to improved performance on sound-based tasks. The challenge lies in evaluating the impact of phoneme-based training, as most benchmarks are also orthographic. To address this, we develop a pipeline to convert text datasets into a continuous stream of phonemes. We apply this pipeline to the 100-million-word pre-training dataset from the BabyLM challenge, as well as to standard language and grammatical benchmarks, enabling us to pre-train and evaluate a model using phonemic input representations. Our results show that while phoneme-based training slightly reduces performance on traditional language understanding tasks, it offers valuable analytical and practical benefits.

Via

Access Paper or Ask Questions

Less is More: Pre-Training Cross-Lingual Small-Scale Language Models with Cognitively-Plausible Curriculum Learning Strategies

Oct 30, 2024

Suchir Salhan, Richard Diehl Martinez, Zébulon Goriely, Paula Buttery

Abstract:Curriculum Learning has been a popular strategy to improve the cognitive plausibility of Small-Scale Language Models (SSLMs) in the BabyLM Challenge. However, it has not led to considerable improvements over non-curriculum models. We assess whether theoretical linguistic acquisition theories can be used to specify more fine-grained curriculum learning strategies, creating age-ordered corpora of Child-Directed Speech for four typologically distant language families to implement SSLMs and acquisition-inspired curricula cross-lingually. Comparing the success of three objective curricula (Growing, Inwards and MMM) that precisely replicate the predictions of acquisition theories on a standard SSLM architecture, we find fine-grained acquisition-inspired curricula can outperform non-curriculum baselines and performance benefits of curricula strategies in SSLMs can be derived by specifying fine-grained language-specific curricula that precisely replicate language acquisition theories.

* BabyLM Shared Task 2024 (Accepted, Poster), co-located in EMNLP 2024

Via

Access Paper or Ask Questions