Abstract:Tokenization is the first step in modern neural language model pipelines where an input text is converted to a sequence of subword tokens. We introduce from first principles a finite-state transduction framework which can efficiently encode all possible tokenizations of a regular language. We then constructively show that Byte-Pair Encoding (BPE) and MaxMatch (WordPiece), two popular tokenization schemes, fit within this framework. For BPE, this is particularly surprising given its resemblance to context-free grammar and the fact that it does not tokenize strings from left to right. An application of this is to guided generation, where the outputs of a language model are constrained to match some pattern. Here, patterns are encoded at the character level, which creates a mismatch between the constraints and the model's subword vocabulary. While past work has focused only on constraining outputs without regard to the underlying tokenization algorithm, our framework allows for simultaneously constraining the model outputs to match a specified pattern while also adhering to the underlying tokenizer's canonical tokenization.
Abstract:Subword regularization, used widely in NLP, improves model performance by reducing the dependency on exact tokenizations, augmenting the training corpus, and exposing the model to more unique contexts during training. BPE and MaxMatch, two popular subword tokenization schemes, have stochastic dropout regularization variants. However, there has not been an analysis of the distributions formed by them. We show that these stochastic variants are heavily biased towards a small set of tokenizations per word. If the benefits of subword regularization are as mentioned, we hypothesize that biasedness artificially limits the effectiveness of these schemes. Thus, we propose an algorithm to uniformly sample tokenizations that we use as a drop-in replacement for the stochastic aspects of existing tokenizers, and find that it improves machine translation quality.
Abstract:We explore threshold vocabulary trimming in Byte-Pair Encoding subword tokenization, a postprocessing step that replaces rare subwords with their component subwords. The technique is available in popular tokenization libraries but has not been subjected to rigorous scientific scrutiny. While the removal of rare subwords is suggested as best practice in machine translation implementations, both as a means to reduce model size and for improving model performance through robustness, our experiments indicate that, across a large space of hyperparameter settings, vocabulary trimming fails to improve performance, and is even prone to incurring heavy degradation.
Abstract:In Tokenization and the Noiseless Channel (Zouhar et al., 2023a), R\'enyi efficiency is suggested as an intrinsic mechanism for evaluating a tokenizer: for NLP tasks, the tokenizer which leads to the highest R\'enyi efficiency of the unigram distribution should be chosen. The R\'enyi efficiency is thus treated as a predictor of downstream performance (e.g., predicting BLEU for a machine translation task), without the expensive step of training multiple models with different tokenizers. Although useful, the predictive power of this metric is not perfect, and the authors note there are additional qualities of a good tokenization scheme that R\'enyi efficiency alone cannot capture. We describe two variants of BPE tokenization which can arbitrarily increase R\'enyi efficiency while decreasing the downstream model performance. These counterexamples expose cases where R\'enyi efficiency fails as an intrinsic tokenization metric and thus give insight for building more accurate predictors.