Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Jinbiao Yang

Rethinking Tokenization: Crafting Better Tokenizers for Large Language Models

Mar 01, 2024

Jinbiao Yang

Figure 1 for Rethinking Tokenization: Crafting Better Tokenizers for Large Language Models

Figure 2 for Rethinking Tokenization: Crafting Better Tokenizers for Large Language Models

Figure 3 for Rethinking Tokenization: Crafting Better Tokenizers for Large Language Models

Figure 4 for Rethinking Tokenization: Crafting Better Tokenizers for Large Language Models

Abstract:Tokenization significantly influences language models(LMs)' performance. This paper traces the evolution of tokenizers from word-level to subword-level, analyzing how they balance tokens and types to enhance model adaptability while controlling complexity. Despite subword tokenizers like Byte Pair Encoding (BPE) overcoming many word tokenizer limitations, they encounter difficulties in handling non-Latin languages and depend heavily on extensive training data and computational resources to grasp the nuances of multiword expressions (MWEs). This article argues that tokenizers, more than mere technical tools, should drawing inspiration from the cognitive science about human language processing. This study then introduces the "Principle of Least Effort" from cognitive science, that humans naturally seek to reduce cognitive effort, and discusses the benefits of this principle for tokenizer development. Based on this principle, the paper proposes that the Less-is-Better (LiB) model could be a new approach for LLM tokenizer. The LiB model can autonomously learn an integrated vocabulary consisting of subwords, words, and MWEs, which effectively reduces both the numbers of tokens and types. Comparative evaluations show that the LiB tokenizer outperforms existing word and BPE tokenizers, presenting an innovative method for tokenizer development, and hinting at the possibility of future cognitive science-based tokenizers being more efficient.

Via

Access Paper or Ask Questions

Lexical representation explains cortical entrainment during speech comprehension

Jan 10, 2018

Stefan Frank, Jinbiao Yang

Figure 1 for Lexical representation explains cortical entrainment during speech comprehension

Abstract:Results from a recent neuroimaging study on spoken sentence comprehension have been interpreted as evidence for cortical entrainment to hierarchical syntactic structure. We present a simple computational model that predicts the power spectra from this study, even though the model's linguistic knowledge is restricted to the lexical level, and word-level representations are not combined into higher-level units (phrases or sentences). Hence, the cortical entrainment results can also be explained from the lexical properties of the stimuli, without recourse to hierarchical syntax.

* Submitted for publication

Via

Access Paper or Ask Questions