Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Marco Cognetta

Tokenization as Finite-State Transduction

Oct 21, 2024

Marco Cognetta, Naoaki Okazaki

Figure 1 for Tokenization as Finite-State Transduction

Figure 2 for Tokenization as Finite-State Transduction

Figure 3 for Tokenization as Finite-State Transduction

Figure 4 for Tokenization as Finite-State Transduction

Abstract:Tokenization is the first step in modern neural language model pipelines where an input text is converted to a sequence of subword tokens. We introduce from first principles a finite-state transduction framework which can efficiently encode all possible tokenizations of a regular language. We then constructively show that Byte-Pair Encoding (BPE) and MaxMatch (WordPiece), two popular tokenization schemes, fit within this framework. For BPE, this is particularly surprising given its resemblance to context-free grammar and the fact that it does not tokenize strings from left to right. An application of this is to guided generation, where the outputs of a language model are constrained to match some pattern. Here, patterns are encoded at the character level, which creates a mismatch between the constraints and the model's subword vocabulary. While past work has focused only on constraining outputs without regard to the underlying tokenization algorithm, our framework allows for simultaneously constraining the model outputs to match a specified pattern while also adhering to the underlying tokenizer's canonical tokenization.

* 10 pages + 5 pages in appendix

Via

Access Paper or Ask Questions

Distributional Properties of Subword Regularization

Aug 21, 2024

Marco Cognetta, Vilém Zouhar, Naoaki Okazaki

Figure 1 for Distributional Properties of Subword Regularization

Figure 2 for Distributional Properties of Subword Regularization

Figure 3 for Distributional Properties of Subword Regularization

Figure 4 for Distributional Properties of Subword Regularization

Abstract:Subword regularization, used widely in NLP, improves model performance by reducing the dependency on exact tokenizations, augmenting the training corpus, and exposing the model to more unique contexts during training. BPE and MaxMatch, two popular subword tokenization schemes, have stochastic dropout regularization variants. However, there has not been an analysis of the distributions formed by them. We show that these stochastic variants are heavily biased towards a small set of tokenizations per word. If the benefits of subword regularization are as mentioned, we hypothesize that biasedness artificially limits the effectiveness of these schemes. Thus, we propose an algorithm to uniformly sample tokenizations that we use as a drop-in replacement for the stochastic aspects of existing tokenizers, and find that it improves machine translation quality.

* 4 pages + 4 page appendix. 3 figures

Via

Access Paper or Ask Questions

An Analysis of BPE Vocabulary Trimming in Neural Machine Translation

Mar 30, 2024

Marco Cognetta, Tatsuya Hiraoka, Naoaki Okazaki, Rico Sennrich, Yuval Pinter

Abstract:We explore threshold vocabulary trimming in Byte-Pair Encoding subword tokenization, a postprocessing step that replaces rare subwords with their component subwords. The technique is available in popular tokenization libraries but has not been subjected to rigorous scientific scrutiny. While the removal of rare subwords is suggested as best practice in machine translation implementations, both as a means to reduce model size and for improving model performance through robustness, our experiments indicate that, across a large space of hyperparameter settings, vocabulary trimming fails to improve performance, and is even prone to incurring heavy degradation.

* 15 pages

Via

Access Paper or Ask Questions

Two Counterexamples to Tokenization and the Noiseless Channel

Feb 29, 2024

Marco Cognetta, Vilém Zouhar, Sangwhan Moon, Naoaki Okazaki

Abstract:In Tokenization and the Noiseless Channel (Zouhar et al., 2023a), R\'enyi efficiency is suggested as an intrinsic mechanism for evaluating a tokenizer: for NLP tasks, the tokenizer which leads to the highest R\'enyi efficiency of the unigram distribution should be chosen. The R\'enyi efficiency is thus treated as a predictor of downstream performance (e.g., predicting BLEU for a machine translation task), without the expensive step of training multiple models with different tokenizers. Although useful, the predictive power of this metric is not perfect, and the authors note there are additional qualities of a good tokenization scheme that R\'enyi efficiency alone cannot capture. We describe two variants of BPE tokenization which can arbitrarily increase R\'enyi efficiency while decreasing the downstream model performance. These counterexamples expose cases where R\'enyi efficiency fails as an intrinsic tokenization metric and thus give insight for building more accurate predictors.

* 9 pages, 2 figures, to appear in LREC-COLING 2024, de-texified metadata

Via

Access Paper or Ask Questions