Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Title:From Babble to Words: Pre-Training Language Models on Continuous Streams of Phonemes

Oct 30, 2024

Zébulon Goriely, Richard Diehl Martinez, Andrew Caines, Lisa Beinborn, Paula Buttery

Figure 1 for From Babble to Words: Pre-Training Language Models on Continuous Streams of Phonemes

Figure 2 for From Babble to Words: Pre-Training Language Models on Continuous Streams of Phonemes

Figure 3 for From Babble to Words: Pre-Training Language Models on Continuous Streams of Phonemes

Figure 4 for From Babble to Words: Pre-Training Language Models on Continuous Streams of Phonemes

Share this with someone who'll enjoy it:

Abstract:Language models are typically trained on large corpora of text in their default orthographic form. However, this is not the only option; representing data as streams of phonemes can offer unique advantages, from deeper insights into phonological language acquisition to improved performance on sound-based tasks. The challenge lies in evaluating the impact of phoneme-based training, as most benchmarks are also orthographic. To address this, we develop a pipeline to convert text datasets into a continuous stream of phonemes. We apply this pipeline to the 100-million-word pre-training dataset from the BabyLM challenge, as well as to standard language and grammatical benchmarks, enabling us to pre-train and evaluate a model using phonemic input representations. Our results show that while phoneme-based training slightly reduces performance on traditional language understanding tasks, it offers valuable analytical and practical benefits.

View paper on

Share this with someone who'll enjoy it:

Title:From Babble to Words: Pre-Training Language Models on Continuous Streams of Phonemes

Paper and Code