Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Title:Constructing a BPE Tokenization DFA

May 13, 2024

Martin Berglund, Willeke Martens, Brink van der Merwe

Share this with someone who'll enjoy it:

Abstract:Many natural language processing systems operate over tokenizations of text to address the open-vocabulary problem. In this paper, we give and analyze an algorithm for the efficient construction of deterministic finite automata designed to operate directly on tokenizations produced by the popular byte pair encoding technique. This makes it possible to apply many existing techniques and algorithms to the tokenized case, such as pattern matching, equivalence checking of tokenization dictionaries, and composing tokenized languages in various ways.

View paper on

Share this with someone who'll enjoy it:

Title:Constructing a BPE Tokenization DFA

Paper and Code