# language_models_over_canonical_bytepair_encodings__f946dae8.pdf Language Models over Canonical Byte-Pair Encodings Tim Vieira 1 Tianyu Liu 1 Clemente Pasti 1 Yahya Emara 1 Brian Du Sell 1 Benjamin Le Brun 2 Mario Giulianelli 1 Juan Luis Gastaldi 1 Timothy J. O Donnell 3 2 4 Ryan Cotterell 1 Modern language models represent probability distributions over character strings as distributions over (shorter) token strings derived via a deterministic tokenizer, such as byte-pair encoding. While this approach is highly effective at scaling up language models to large corpora, its current incarnations have a concerning property: the model assigns nonzero probability mass to an exponential number of noncanonical token encodings of each character string these are token strings that decode to valid character strings but are impossible under the deterministic tokenizer (i.e., they will never be seen in any training corpus, no matter how large). This misallocation is both erroneous, as noncanonical strings never appear in training data, and wasteful, diverting probability mass away from plausible outputs. These are avoidable mistakes! In this work, we propose methods to enforce canonicality in token-level language models, ensuring that only canonical token strings are assigned positive probability. We present two approaches: (1) canonicality by conditioning, leveraging test-time inference strategies without additional training, and (2) canonicality by construction, a model parameterization that guarantees canonical outputs but requires training. We demonstrate that fixing canonicality mistakes improves the likelihood of held-out data for several models and corpora. github.com/genlm/canonical-icml-2025 1. Introduction Modern language models are probability distributions over character strings (denoted Σ ) that are parameterized as 1ETH Zürich 2Mila 3Mc Gill University 4Canada CIFAR AI Chair. Correspondence to: Tim Vieira . Proceedings of the 42 nd International Conference on Machine Learning, Vancouver, Canada. PMLR 267, 2025. Copyright 2025 by the author(s). Figure 1. The figure shows the canonical and noncanonical encodings of the string Hello, world. The diagram shows the top-8 token encodings of the string according to their probability (descending top to bottom), as there are hundreds of them for this short string. Note that the canonical token encoding is the most likely one (i.e., it is at the top), which is reassuring as it is the most representative of the training data. distributions over a different space of token strings (denoted ). In the case of the prevalent byte-pair encoding (BPE; Gage, 1994), each token may be regarded as a multicharacter (or multi-byte) chunk of a UTF-8-encoded string. Such a language model is trained by encoding its training data into token strings, which has the benefit of compressing the large corpus. The BPE encoding is a pair of functions: an encoding function τ : Σ that maps a given input character string σ to its canonical encoding δ = τ(σ), and a decoding function κ: Σ that maps any token string δ back to a character string σ = κ(δ). The result of training on a corpus of token strings is a language model p that generates a token string δ and decodes it to a character string σ. We define the set of canonical token strings D as those that appear as the canonical encoding of at least one character string, i.e., D def= {τ(σ) | σ Σ }. These are the only strings that are possible in the tokenized training data of p . The set of noncanonical token strings is, thus, D. Even though p has never and will never see a noncanonical token string in its training data, it will still place nonzero probability mass on them due to model architecture and training limitations. Fig. 1 shows an example Language Models over Canonical Byte-Pair Encodings of a string with its canonical encoding and its exponentially many noncanonical encodings. Fixing canonicality mistakes. Removing noncanonical strings from the support of the estimated distribution p can improve the likelihood of the correct token strings. We describe two families of methods for doing so Canonicality by conditioning: We explore efficient testtime inference methods for conditionally generating text that satisfies the canonicality constraint without retraining. Canonicality by construction: We explore methods that impose canonicality constraints directly in the language model s parameterized architecture and give a method to fine-tune its parameters. In addition to these novel methods, this paper presents the following contributions: We prove that our methods can only improve the fit to the true distribution over tokens. We show empirically that fixing canonicality mistakes using our methods improves the likelihood of held-out data for several models and corpora. As a side effect of pursuing the goal of generating only canonical token strings, we discovered an efficient, incremental canonicality test for BPE (see App. B) that is significantly simpler than prior work (e.g., Berglund et al., 2024; Cognetta & Okazaki, 2024), as it does not require automata theory to understand or implement. 2. Preliminaries 2.1. Strings An alphabet is a non-empty, finite set of symbols (e.g., bytes, characters, or tokens) from which we can build strings. A string is a finite-length sequence of symbols from a given alphabet. Let Γ be an alphabet. Let Γ denote the set of all strings over Γ. We use |γ| to denote the length of the string and ε to denote the string of zero length. Let γ γ denote the concatenation of γ and γ . We write γ 0 = ℓ(δ) > 0 for all δ . This is ensured when our prefix canonicality test 1{δ D} is exact. More generally, it is safe to allow for false positives but not false negatives. 13Our experiments use transformer-based language models. Language Models over Canonical Byte-Pair Encodings Why train? Suppose the original model p is equal to some pθ in our family; let θ(0) denote its parameters. Clearly, if we set θ = θ(0), the model ℓθ is no different from ℓ. The reason we optimize θ beyond θ(0) is that the training objective for θ(0) pressured the parameters to model canonicality, but now that pressure is gone because the canonicalized architecture (Def. 3) enforces that constraint for all θ. Thus, the parameters that were previously used to model canonicality preferences can be repurposed to model any other textual phenomena in the training data. Training pθ. Given a training corpus σ(1),... , σ(M) i.i.d. p Σ of character strings, let δ(1),... , δ(M) be their corresponding (canonical) tokenizations. We define L(p), the log-loss (average negative log-likelihood) of the training corpus under a language model p: L(p) def= 1 m=1 log p(δ(m)) (16) Log-loss is a reasonable (and common) training objective, as minimizing L(ℓθ) also minimizes KL(p ℓθ) in the limit of infinite data (with perfect optimization). However, training a large language model with L from scratch is prohibitively costly. Instead, we fine-tune an existing (noncanonical) large language model. We use the following fine-tuning objective, which strikes a balance between fitting the log-loss objective while maintaining some fidelity to the original language model p : Fλ(θ) def= (1 λ) L(ℓθ) + λ KL(ℓθ p ) (17) Here, we use the KL divergence between ℓθ and the original unconstrained model distribution p as a regularizer (e.g., Christiano et al., 2017; Stiennon et al., 2020; Ziegler et al., 2020; Korbak et al., 2022). The regularization parameter 0 λ 1 is used to trade fidelity to the original model (higher) against fidelity to the fine-tuning data (lower). Optimization. Our optimization algorithm is based on stochastic gradient descent. To optimize Fλ(θ), on each step, we randomly choose with probability λ which of the two terms to perform a gradient step on. 1. For the log-loss updates, we sample a minibatch of examples from the fine-tuning corpus and use the sample average of that minibatch s gradient as an efficient approximation to the gradient over the entire corpus. 2. We estimate the gradient of the KL regularization term using Amini et al. s (2025) Rao Blackwellized method using samples from the current ℓθ. Experiments initialize ℓθ with the same parameters as p . 5. Experiments This section evaluates our proposed methods canonicality by constraints (global and local; 3) and canonicality by conditioning ( 4) by measuring their impact on real datasets and language models. This evaluation complements the theoretical guarantees of Proposition 1 and Proposition 3, by quantifying the log-loss (L; Eq. (16)) on held-out data. Datasets. We experiment on the following two datasets: Penn Treebank (PTB, Marcus et al., 1993) (test split; 3761 strings, 82k words, 439k characters) Wiki Text (Merity et al., 2017) (test split; 4358 strings, 234k words and 1286k characters) Models. We experiment with the following models: GPT-2 models (Radford et al., 2019) of increasing size (small, medium, and large) Llama models (Llama Team, 2024) of increasing size (3.2-1B, 3.2-3B, and 3.1-8B). Pre-tokenization. Our efficient bigram-based canonicality test 1{δ δ D} (App. B) makes the simplifying assumption that the tokenizer τ is only based on BPE. However, in practice, systems use an additional pre-tokenization step that breaks text into chunks called pre-tokens (e.g., based on spaces and punctuation). Most pre-tokenizers are based on hand-crafted regular expressions; the GPT-2 and Llama models each use a distinct pre-tokenization regular expressions. Since these models make use pre-tokenization, our incremental test occasionally makes some mistakes. Particularly, concerning are instances of false negatives, i.e., we rule out a bigram as noncanonical when it is in fact canonical. False negatives are problematic for estimation (Footnote 12). These errors are caused by the interaction of the pre-tokenizer and the tokenizer.14 To work around these rare exceptions, we determined a (small) set of overrides for our canonicality test by identifying the bigrams where we have made a false negative judgment on a corpus of canon- 14For example, under the GPT-2 tokenizer, the token bigram \n 198 may or may not be canonical, depending on the context. For instance, in the following example, it is canonical: τ(Hi,\n\n I) = because the pre-tokenizer creates separate pretokens for each newline, preventing BPE from merging them. However, in the next example, it is not canonical: τ(Hi,\n\n) = because the pre-tokenizer does not create separate pretokens for the newlines, allowing BPE to merge them. Language Models over Canonical Byte-Pair Encodings ical strings. We note that this workaround occasionally introduces false positives.15 5.2. Canonicality by Conditioning Methology. Fig. 3 reports the log-loss of the local ℓand global g methods on each dataset ({p(m) }M m=1) as well as p , which serves as our baseline. Below are the details of how we estimated the log-loss of each method. Baseline: The log-loss for the baseline method is m=1 log p (δ(m)) (18a) Global: The log-loss for the global method is L(g) = L(p ) + log Z L(p ) + log b Z (18b) where b Z is computed by Eq. (13) using 2000 samples from ℓ. Note that log b Z 0; thus, the global method can only improve the log-loss, L(g) L(p ).16 Local: The log-loss for the local method is L(ℓ) = L(p ) + 1 m=1 log wℓ(δ(m)) | {z } def = c W Note that 1 M PM m=1 log wℓ(δ(m)) 0; thus, the local method can only improve the log-loss, L(ℓ) L(p ). Relationship to theory. Proposition 1 shows that the reduction in KL divergence is equal to log Z, which is equal to the difference in log-loss L(p ) L(g). Proposition 3 shows that the reduction in KL divergence is equal to Eδ p [log wℓ(δ)], which is equal to the difference in log-loss L(p ) L(ℓ) (in expectation). Observations. For the global method, the change in the log-loss is independent of the dataset because log b Z is independent of the dataset. However, for the local method, it is dependent because c W is dependent on the dataset. Statistical significance: For the global method, the log-loss reduction is positive and constant for every string; therefore, it is trivially statistically significantly better than the baseline method according to the paired-permutation test (i.e., p = 0). 15A complete solution to the pre-tokenization challenges would build a transducer that accurately models the pre-tokenizer, which we then compose with a transducer implementing BPE (see App. B.4). We leave this investigation for future work. 16In the case of the GPT-2 models, we restricted strings to have length 1024, as these models tend to ramble. We analyze this choice further in Fig. 4. Model Baseline Local Global small 201.0 200.7 199.1 medium 195.1 194.5 193.1 large 189.4 188.9 188.2 1B 171.2 171.1 169.7 3B 165.0 165.0 164.2 8B 161.5 161.5 160.1 small 369.2 367.0 367.3 medium 334.1 333.2 332.2 large 320.8 319.1 319.6 1B 286.7 284.4 285.2 3B 264.6 262.0 263.7 8B 248.2 245.8 246.8 Figure 3. Log-loss (L; bits/string) for the baseline (p ), local (ℓ), and global (g) methods across two datasets and models. Bolding indicates that the number is the best in its row. See text for discussion of statistical significance. For the local method, the log-loss reduction is positive for all strings, but the amount varies across strings; nonetheless, it is trivially statistically significantly better than the baseline method according to the pairedpermutation test (i.e., p = 0). Local vs. global: On the PTB dataset, we found that the global method is significantly better than the local method for all models. However, on the Wiki Text dataset, the local model is better in all cases except GPT-2-medium. Takeways. On principle ( 3.1), we maintain that the global method (based on conditioning) is the correct solution, as it is the only distribution that preserves the relative probabilities for canonical strings. Additionally, our experiments support the local method as a practical approximation to the global distribution, which we expected to be the case because the probability allocated to each noncanonical token is generally quite small, meaning that the warping is not large; see discussion Item 3 in 3.2.2). Interestingly, on some datasets (e.g., Wiki Text), the warping induced by the local method with respect to the global distribution can be advantageous, i.e., its reduction in log-loss with respect to the Wiki Text distribution is larger than that of the global method. Log-canonicality rate vs. length. We found that the GPT2 models generate very long strings, so we investigated the tradeoffs in the log-canonicality rate for truncating the lengths of the strings generated by each model. Fig. 4 shows the effect of length on the log-canonicality rate estimate log b Z, using 2000 samples taken from ℓ. We see that for the Llama methods, the log-canonicality rate stabilizes after 256 tokens, but for the GPT-2 models, it continues decreasing. Thus, for practical reasons, our experiments with GPT-2 Language Models over Canonical Byte-Pair Encodings 128 256 512 1024 maximum string length GPT-2S LLa Ma1B GPT-2M LLa Ma3B GPT-2L LLa Ma8B Figure 4. Log-canonicality rate vs. (tokenized) length, including 95% confidence intervals, for each model. will be based on a limit length of 1024, but bear in mind that this means that the actual log-canonicality rate is likely to be much smaller, meaning the reduction in log-loss for GPT-2 is larger than Fig. 3 indicates. Supplementary error analysis. App. E analyzes the most frequent canonicality errors made by each model. 5.3. Canonicality by Construction Methodology. We fine-tuned two language models, GPT2S and GPT-2M,17 on the PTB train set and a subset of the Wiki Text train set with 50K strings and 4.2M words. We consider fine-tuning the canonicalized architecture (ℓθ) and the original architecture (pθ ) using the training criterion Fλ for λ {0.001, 0.01, 0.1, 0.2}.18 As a baseline, we consider the original model p and the locally constrained model ℓ, as these models serve as initialization for fine tuning. Fig. 5 reports the log-loss L (bits/string) on the held-out datasets for each method mentioned above. Observations. In all cases (i.e., models, datasets, and methods), we observe a large improvement in log-loss from fine-tuning. The reduction in log-loss is larger in the case of PTB. Provided that λ is small enough, we see a consistent reduction in log-loss when comparing ℓθ and pθ . However, the difference between ℓθ and pθ is comparatively smaller than we saw in Fig. 3. Generally, a the smaller values of λ work best, indicating 17Unfortunately, we are unable to fine-tune larger models due to computational constraints. However, the models used here serve as a proof of concept. 18Each model is trained for 3 epochs using the Adam W optimizer (Loshchilov & Hutter, 2019) with a learning rate of 5e 5 and linear learning rate decay. For efficiency, we use bfloat16 to represent the model parameters. We use a minibatch of size 8 for estimating the gradient of each term of the Fλ objective. N/A .001 .01 .1 .2 201.4 143.7 143.7 143.6 146.8 200.7 143.6 143.6 143.6 146.8 195.0 128.6 128.9 128.6 132.1 194.5 128.3 128.7 128.4 132.0 369.2 331.7 331.7 331.7 335.7 367.0 331.7 331.7 331.7 335.6 334.1 293.9 294.0 293.9 298.7 333.2 293.7 293.8 293.8 298.8 Figure 5. Log-loss (bits/string) of fine-tuned models on held-out portions of Wiki Text and PTB. The column labeled N/A reports the log-loss of the baselines p and ℓ. The other columns correspond to the value of the regularization parameter λ used in the finetuning loss. The rows labeled p are the original architecture and ℓ canonicalize architecture. that reguarlization towards the original model should not be done too strongly. Takeaways. Unsurprisingly, fine-tuning for the specific dataset is useful regardless of whether we use the original or canonicalized architecture. Fine-tuning with the canonicalized architecture performs slightly better, but improvements appear to be small. It is possible that training the canonicalized architecture from scratch on a huge dataset would yield better results than our proof-of-concept experiment. We have demonstrated that enforcing canonicality in tokenlevel language models eliminates systemic probability mass misallocation, leading to improved likelihoods on held-out data across multiple models and corpora. Our proposed methods canonicality by conditioning and canonicality by construction provide practical solutions that either refine inference or modify the model architecture to ensure only canonical token strings are assigned positive probability. In addition to the empirical benefits, our theoretical results establish that correcting these mistakes strictly improves model fit. Moreover, our discovery of an efficient incremental test for BPE canonicality simplifies prior approaches, making it more accessible for practical deployment. These findings underscore the importance of aligning token-based probability models with their training distributions, thereby paving the way for more accurate language modeling. Acknowledgments The authors would like to thank John Terilla, Marco Cognetta, Ben Lipkin, Luca Malagutti, and Manuel de Prada Corral for their helpful feedback and discussions. Language Models over Canonical Byte-Pair Encodings Impact Statement Potential benefits. By introducing methods to enforce canonicality, we improve a given model s fidelity to the underlying distribution that generated the text. These contributions have broad implications for the reliability of language models, particularly in applications where precise probability estimates are crucial. By eliminating avoidable modeling errors, we advance the goal of more reliable and robust language models. Potential harms. Constraining tokenization may inadvertently reduce robustness to rare, noisy, or adversarial inputs by ruling out alternative tokenizations of the same character string. While the canonicality constraint does not remove any character strings from the language, it does restrict the model to assign nonzero probability only to a single canonical tokenization of each string. This may disproportionately affect rare or unconventional spellings, as their probability mass may be less sharply concentrated on a single tokenization. Future work should provide a deeper analysis of whether or not canonicality enforcement helps or hurts in these settings to ensure fair and robust language model behavior across diverse linguistic contexts. Amini, A., Vieira, T., and Cotterell, R. Better estimation of the KL divergence between language models, 2025. URL https://arxiv.org/abs/2504.10637. Anthropic. Introducing Claude, 2023a. URL https://www. anthropic.com/news/introducing-claude. Anthropic. Claude 2, 2023b. URL https://www. anthropic.com/news/claude-2. Berglund, M., Martens, W., and van der Merwe, B. Constructing a BPE tokenization DFA, 2024. URL https: //arxiv.org/abs/2405.07671. Biderman, S., Schoelkopf, H., Anthony, Q. G., Bradley, H., O Brien, K., Hallahan, E., Khan, M. A., Purohit, S., Prashanth, U. S., Raff, E., Skowron, A., Sutawika, L., and Van Der Wal, O. Pythia: A suite for analyzing large language models across training and scaling. In Proceedings of the International Conference on Machine Learning, 2023. URL https://proceedings.mlr.press/v202/ biderman23a.html. Black, S., Biderman, S., Hallahan, E., Anthony, Q., Gao, L., Golding, L., He, H., Leahy, C., Mc Donell, K., Phang, J., Pieler, M., Prashanth, U. S., Purohit, S., Reynolds, L., Tow, J., Wang, B., and Weinbach, S. GPT-Neo X-20B: An open-source autoregressive language model, 2022. URL https://arxiv.org/abs/2204.06745. Brown, T. B., Mann, B., Ryder, N., Subbiah, M., Kaplan, J., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., Agarwal, S., Herbert-Voss, A., Krueger, G., Henighan, T., Child, R., Ramesh, A., Ziegler, D. M., Wu, J., Winter, C., Hesse, C., Chen, M., Sigler, E., Litwin, M., Gray, S., Chess, B., Clark, J., Berner, C., Mc Candlish, S., Radford, A., Sutskever, I., and Amodei, D. Language models are few-shot learners, 2020. URL https://arxiv.org/abs/2005.14165. Chatterjee, S. and Diaconis, P. The sample size required in importance sampling, 2017. URL https://arxiv.org/ abs/1511.01437. Christiano, P. F., Leike, J., Brown, T., Martic, M., Legg, S., and Amodei, D. Deep reinforcement learning from human preferences. In Advances in Neural Information Processing Systems, 2017. URL https://proceedings. neurips.cc/paper_files/paper/2017/file/ d5e2c0adad503c91f91df240d0cd4e49-Paper.pdf. Cognetta, M. and Okazaki, N. Tokenization as finite-state transduction, 2024. URL https://arxiv.org/abs/ 2410.15696. Gage, P. A new algorithm for data compression. C Users Journal, 12(2), 1994. ISSN 0898-9788. URL https://web.archive.org/web/20230319172720/ https://www.derczynski.com/papers/archive/ BPE_Gage.pdf. Gastaldi, J. L., Terilla, J., Malagutti, L., Du Sell, B., Vieira, T., and Cotterell, R. The foundations of tokenization: Statistical and computational concerns. In Proceedings of the International Conference on Learning Representations, 2025. URL https://arxiv.org/abs/2407.11606. Gemma Team. Gemma: Open models based on Gemini research and technology, 2024. URL https://arxiv. org/abs/2403.08295. Groeneveld, D., Beltagy, I., Walsh, E., Bhagia, A., Kinney, R., Tafjord, O., Jha, A., Ivison, H., Magnusson, I., Wang, Y., Arora, S., Atkinson, D., Authur, R., Chandu, K., Cohan, A., Dumas, J., Elazar, Y., Gu, Y., Hessel, J., Khot, T., Merrill, W., Morrison, J., Muennighoff, N., Naik, A., Nam, C., Peters, M., Pyatkin, V., Ravichander, A., Schwenk, D., Shah, S., Smith, W., Strubell, E., Subramani, N., Wortsman, M., Dasigi, P., Lambert, N., Richardson, K., Zettlemoyer, L., Dodge, J., Lo, K., Soldaini, L., Smith, N., and Hajishirzi, H. OLMo: Accelerating the science of language models. In Proceedings of the Annual Meeting of the Association for Computational Linguistics, 2024. URL https://aclanthology.org/ 2024.acl-long.841/. Language Models over Canonical Byte-Pair Encodings Korbak, T., Perez, E., and Buckley, C. RL with KL penalties is better viewed as Bayesian inference. In Findings of the Association for Computational Linguistics: EMNLP, 2022. URL https://aclanthology. org/2022.findings-emnlp.77/. Lew, A. K., Zhi-Xuan, T., Grand, G., and Mansinghka, V. K. Sequential Monte Carlo steering of large language models using probabilistic programs, 2023. URL https: //arxiv.org/abs/2306.03081. Lindmark, V. Analyzing the effect of DFA compression in token DFAs, 2024. URL https://www.diva-portal. org/smash/get/diva2:1878423/FULLTEXT01.pdf. Llama Team. The Llama 3 herd of models, 2024. URL https://arxiv.org/abs/2407.21783. Loshchilov, I. and Hutter, F. Decoupled weight decay regularization. In Proceedings of the International Conference on Learning Representations, 2019. URL https: //openreview.net/pdf?id=Bkg6Ri Cq Y7. Loula, J., Le Brun, B., Du, L., Lipkin, B., Pasti, C., Grand, G., Liu, T., Emara, Y., Freedman, M., Eisner, J., Cotterell, R., Mansinghka, V., Lew, A. K., Vieira, T., and O Donnell, T. J. Syntactic and semantic control of large language models via sequential Monte Carlo. In Proceedings of the International Conference on Learning Representations, 2025. URL https://openreview.net/ forum?id=xo Xn62Fz D0. Marcus, M. P., Santorini, B., and Marcinkiewicz, M. A. Building a large annotated corpus of English: The Penn Treebank. Computational Linguistics, 19(2):313 330, 1993. URL https://aclanthology.org/J93-2004/. Merity, S., Xiong, C., Bradbury, J., and Socher, R. Pointer sentinel mixture models. In International Conference on Learning Representations, 2017. URL https:// openreview.net/forum?id=Byj72udxe. Mikolov, T., Karafiát, M., Burget, L., ˇCernocký, J., and Khudanpur, S. Recurrent neural network based language model. In Proceedings of INTERSPEECH, 2010. URL https://www.isca-archive.org/interspeech_ 2010/mikolov10_interspeech.html. Mistral AI. Mistral 7B, 2023. URL https://mistral.ai/ news/announcing-mistral-7b/. Mistral AI. Au large, 2024a. URL https://mistral.ai/ news/mistral-large/. Mistral AI. Mistral nemo, 2024b. URL https://mistral. ai/news/mistral-nemo/. Open AI. Introducing Chat GPT, 2022. URL https:// openai.com/index/chatgpt/. Open AI. GPT-4 technical report, 2024a. URL https:// arxiv.org/abs/2303.08774. Open AI. Hello GPT-4o, 2024b. URL https://openai. com/index/hello-gpt-4o/. Park, K., Wang, J., Berg-Kirkpatrick, T., Polikarpova, N., and D Antoni, L. Grammar-aligned decoding. In Advances in Neural Information Processing Systems, 2025. URL https://openreview.net/forum? id=5G7ve8E1Lu. Radford, A., Wu, J., Child, R., Luan, D., Amodei, D., and Sutskever, I. Language models are unsupervised multitask learners, 2019. URL https://cdn.openai. com/better-language-models/language_models_ are_unsupervised_multitask_learners.pdf. Sennrich, R., Haddow, B., and Birch, A. Neural machine translation of rare words with subword units. In Proceedings of the Annual Meeting of the Association for Computational Linguistics, 2016. URL https:// aclanthology.org/P16-1162. Shannon, C. E. A mathematical theory of communication. The Bell System Technical Journal, 27(4), 1948. URL https://doi.org/10.1002/j.1538-7305.1948. tb00917.x. Stiennon, N., Ouyang, L., Wu, J., Ziegler, D., Lowe, R., Voss, C., Radford, A., Amodei, D., and Christiano, P. F. Learning to summarize with human feedback. In Advances in Neural Information Processing Systems, 2020. URL https://proceedings. neurips.cc/paper_files/paper/2020/file/ 1f89885d556929e98d3ef9b86448f951-Paper.pdf. Sundermeyer, M., Ney, H., and Schlüter, R. From feedforward to recurrent LSTM neural networks for language modeling. IEEE/ACM Transactions on Audio, Speech, and Language Processing, 23(3), 2015. URL https://ieeexplore.ieee.org/document/7050391. Team OLMo. 2 OLMo 2 furious, 2025. URL https:// arxiv.org/abs/2501.00656. Touvron, H., Lavril, T., Izacard, G., Martinet, X., Lachaux, M.-A., Lacroix, T., Rozière, B., Goyal, N., Hambro, E., Azhar, F., Rodriguez, A., Joulin, A., Grave, E., and Lample, G. LLa MA: Open and efficient foundation language models, 2023a. URL https://arxiv.org/abs/2302. 13971. Touvron, H., Martin, L., Stone, K., Albert, P., Almahairi, A., Babaei, Y., Bashlykov, N., Batra, S., Bhargava, P., Bhosale, S., Bikel, D., Blecher, L., Ferrer, C. C., Chen, M., Cucurull, G., Esiobu, D., Fernandes, J., Fu, J., Fu, W., Fuller, B., Gao, C., Goswami, V., Goyal, N., Hartshorn, Language Models over Canonical Byte-Pair Encodings A., Hosseini, S., Hou, R., Inan, H., Kardas, M., Kerkez, V., Khabsa, M., Kloumann, I., Korenev, A., Koura, P. S., Lachaux, M.-A., Lavril, T., Lee, J., Liskovich, D., Lu, Y., Mao, Y., Martinet, X., Mihaylov, T., Mishra, P., Molybog, I., Nie, Y., Poulton, A., Reizenstein, J., Rungta, R., Saladi, K., Schelten, A., Silva, R., Smith, E. M., Subramanian, R., Tan, X. E., Tang, B., Taylor, R., Williams, A., Kuan, J. X., Xu, P., Yan, Z., Zarov, I., Zhang, Y., Fan, A., Kambadur, M., Narang, S., Rodriguez, A., Stojnic, R., Edunov, S., and Scialom, T. Llama 2: Open foundation and finetuned chat models, 2023b. URL https://arxiv.org/ abs/2307.09288. Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A. N., Kaiser, Ł., and Polosukhin, I. Attention is all you need. Advances in Neural Information Processing Systems, 30, 2017. URL https://arxiv.org/abs/ 1706.03762. Vieira, T., Le Brun, B., Giulianelli, M., Gastaldi, J. L., Du Sell, B., Terilla, J., O Donnell, T. J., and Cotterell, R. From language models over tokens to language models over characters. In Proceedings of the International Conference on Machine Learning, 2024. URL https://arxiv.org/abs/2412.03719. Zhao, S., Brekelmans, R., Makhzani, A., and Grosse, R. Probabilistic inference in language models via twisted sequential Monte Carlo, 2024. URL https://arxiv. org/abs/2404.17546. Ziegler, D. M., Stiennon, N., Wu, J., Brown, T. B., Radford, A., Amodei, D., Christiano, P., and Irving, G. Fine-tuning language models from human preferences, 2020. URL https://arxiv.org/abs/1909.08593. Zouhar, V., Meister, C., Gastaldi, J., Du, L., Vieira, T., Sachan, M., and Cotterell, R. A formal perspective on byte-pair encoding. In Findings of the Association for Computational Linguistics, 2023. URL https: //aclanthology.org/2023.findings-acl.38. Language Models over Canonical Byte-Pair Encodings A. Limitations and Future Work Future experimental work. A limitation of our experimental design is that we only evaluate our approach s ability to estimate probabilities. We do not have any experiments evaluating any qualitative difference in the sample quality from the canonicalized methods. Models overregularize. Language models often place a higher probability on a noncanonical encoding of the string than the canonical one. Here is an interesting example from Wiki Text corpus (Merity et al., 2017): 11 xs = """\n\n = Robert Boulter = \n\n\n\n\n Robert Boulter is an English film , \ 12 television and theatre actor . He had a guest @-@ starring role on the television \ 13 series The Bill in 2000 . """.encode("utf-8") In the case of GPT-2, the conditional probability of the canonical tokenization is only 2.7%, and it is ranked 4th in the (partially enumerated) set of encodings of this string: UTF-8 Encoding. The very first step in representing text is the choice of string encoding. UTF-8 is an encoding that maps strings of characters into bytes, which is used to store almost every webpage on the internet. The scheme uses a variablewidth encoding of each Unicode character that uses between 1 and 4 bytes, with more common Unicode characters requiring fewer bytes. The UTF-8 encoding imposes additional validity constraints on token strings. In future work, we may extend our work on BPE encoding validity one step further to ensure that the byte string that the BPE string decodes is a valid UTF-8 string. For better or worse, UTF-8 decoding errors are typically suppressed, e.g., by the Huggingface transformers library. Implementation note. Byte-pair encoding (as the name suggests) typically takes bytes as inputs. Note that popular interfaces like huggingface try to make things easier for users by hiding the UTF-8 encoding step that maps strings to bytes Language Models over Canonical Byte-Pair Encodings from users by exposing only a string interface. We require finer-grained control to implement the round-trip test correctly; we must avoid encoding and decoding from the text s encoding (e.g., UTF-8). Therefore, we use a custom implementation of byte-pair encoding. Note that the transformers library does not provide methods to work directly with byte strings. ".encode("utf-8") = b'\xf0\x9f\xa4\x97' is the encoding of the hugging face emoji However, if we try to decode the following prefix of the byte string b'\xf0\x9f\xa4'.decode("utf-8"), we get an error as it is an invalid UTF-8 string. Note that if we use b'\xf0\x9f\xa4'.decode("utf-8", errors="ignore") or b'\xf0\x9f\xa4'.decode("utf-8", "replace") the errors may go unnoticed. Many packages (e.g., transformers and tiktoken) suppress these errors. Other tokenization models. We limited our experiments to BPE, as it is by far the most commonly used tokenization model (Footnote 5). However, many other tokenizers exist, such as longest-match tokenization (used by Word Piece) and other finite-state transduction-based tokenizers. These other tokenizers can plug directly into our framework as long as they provide efficient membership tests for D. In the case of finite-state-transduction methods (which include longest-match tokenizers), the membership is a straightforward operation on the automaton. We note that D = D, in general and, it specifically does not hold for longest-match tokenizers. More sophisticated inference algorithms. Future work may wish to explore sequential importance resampling (SIR) methods or other sequential Monte Carlo methods (SMC), which have shown great promise as methods for constrained generation from language models (e.g., Lew et al., 2023; Zhao et al., 2024). Sequential importance resampling, for example, provides the computational benefit of not waiting until a complete string is sampled from the local model to resample. It works by evolving many string prefixes in parallel. Now, when any given string is stuck in a low-probability region, it can be probabilistically replaced by another string in the collection. Many SMC algorithms, including SIR, are GPU-friendly. Distillation. Another interesting direction to explore related to fine-tuning is the distillation of the (approximately) globally canonicalized model g. We could do that by minimizing KL(g ℓθ) using stochastic gradient descent. Precision vs. recall. It can be challenging to correctly capture all of the constraints that the inputs must satisfy. In this paper, we explored the canonicality constraints that BPE models have. We provided efficient methods for testing canonicality, and we needed to validate those against reference canonicality procedures (as a kind of regression test). It may be impossible to capture every nuance of a particular implementation perfectly. And, in some cases, there are even bugs or undesirable behaviors that the reference implementations have. For example, the pre-tokenization system used by GPT-2 was designed to split the possessive marker ( s) from a phrase like Open AI s system Open AI| s| system; however, the rule has the following behavior which is likely not intensional Open AI Inc. s system Open AI| Inc|. |s| system, as it fails to split the possessive correctly because the regular expression used by the pre-tokenizer prioritizes . over s in this example. More generally, reasoning about a complicated software system using formal models will always present a challenge between fidelity and analytical tractability. Evaluation on downstream tasks. Our study deliberately centers on a task-agnostic metric KL divergence. We proved that our canonicalization methods are guaranteed to improve this metric (Propositions 1 and 3) and we quantified the effect empirically ( 5). However, many language-model applications are judged with task-specific metrics that may be misaligned with KL divergence. Thus, an improvement in KL divergence does not guarantee an improvement in the task-specific metrics. Assessing how canonicalization influences concrete reasoning tasks under each task s own evaluation metric remains an important direction for future work. Language Models over Canonical Byte-Pair Encodings B. Efficient Membership Tests for BPE This section gives a formal definition of byte-pair encoding (BPE) and novel algorithms for the following membership tests pertaining to it: the canonical set D (see Theorem 1) the canonical prefix set D (see Lemma 3) an incremental membership test for single-token extensions, i.e., δ δ D given that δ D (see Proposition 5) Our key insight is that a token string is canonical if and only if all of its constituent bigrams are canonical; see Theorem 1. Thus, our membership test runs in time linear in the length of the string we iterate over all bigrams in the strings and check their canonicality; we call this the bigram test. Moreover, our test is simple to implement, unlike similar tests given in prior work (e.g., Berglund et al., 2024; Cognetta & Okazaki, 2024). A consequence of Theorem 1 is that D = D (see Lemma 3), which means that membership in the set of canonical prefixes reduces to membership in the canonical set.19 B.1. Byte-Pair Encoding Byte-pair encoding (BPE; Gage, 1994; Sennrich et al., 2016) is currently the most commonly used tokenization method, as it has been used in most recent state-of-the-art language models.20 We formalize BPE as follows. Definition 4. Byte-pair encoding is a pair (Σ, M) that defines a tokenization model (Σ, , τ, κ). We describe the assumptions on (Σ, M) and the construction of (Σ, , τ, κ) below: Σ is a base alphabet of characters (or bytes) M = h σ (1), σ (1) ,... , σ (M), σ (M) i is the merge list where σ (m), σ (m) Σ+ for m = 1,... , M. We define the subword alphabet S def= Σ {σ σ | σ , σ M}. We define the token alphabet such that for each δ , there is a unique object (e.g., an identifier) for each distinct element of the subword alphabet S. The reason for this technical detail is to avoid confusion between the concatenation of subwords and the concatenation of tokens. We define the base encoder τ : S and base decoder κ: S as bijective maps between the subword alphabet S and a set of base tokens . We define the decoder κ: Σ to be the pointwise extension of the base decoder: κ(δ1 δN) def= κ(δ1) κ(δN). We define the encoder τ : Σ in the pseudocode, which is a deterministic string rewriting process based on applying a sequence of merge rules to the current string of tokens until no more merges are available.21,22 14 def τ(σ1 σN): 15 δ τ(σ1) τ(σN) 16 while True: 17 δ rewrite(δ) 18 if δ = δ: return δ 20 def rewrite(δ1 δN): 21 for σ , σ in M 22 for n in [1,... , N 1]: 23 if κ(δn), κ(δn+1) = σ , σ : 24 return δ1 δn 1 τ(σ σ ) δn+2 δN 25 return δ1 δN Note that the merge rules are applied in a specific deterministic order to any given string; thus, there is a unique way to derive the output tokenization from the input string. 19We note that, in principle, the bigram canonicality tests can all be precomputed and cached for maximal efficiency. However, we found that to be impractical for models with large token vocabularies. In App. B.3, we provide a very efficient method that can test bigram canonicality. 20See Footnote 5. 21We note that τ can be implemented much more efficiently using a careful choice of data structure (see, e.g., Zouhar et al., 2023). 22We note that Berglund et al. (2024) refers to this incarnation of BPE as the Sentence Piece BPE algorithm. Language Models over Canonical Byte-Pair Encodings B.2. Simple and Efficient Membership Tests Definition 5. We define the set of canonical bigrams as def= {δ δ | δ, δ , φ(δ δ ) = δ δ } . (19) where φ is the canonicalization function ( 2.6). Definition 6. We define the set of bigram-canonical strings B as strings composed of only canonical bigrams: def= {δ : BIGRAMS(δ) B}. (20) In general, it is possible for a token in the token vocabulary not to be canonical. Such tokens can always be removed from the token vocabulary without changing the result of the BPE tokenizer. We assume, for simplicity, that the token vocabulary contains no such tokens. Assumption 1. For all δ , φ(δ) = δ. Definition 7. We define the truncated canonicalization function φT as follows: 26 def φT (δ): 27 σ1 σN κ(δ) # decode token string 28 δ(0) τ(σ1) τ(σN) # initialize with base tokens 29 for t = 0,... , (T 1): 30 δ(t+1) rewrite(δ(t)) 31 if δ(t+1) = δ(t): return δ(t) 32 return δ(T ) Remark 1. It is straightforward to verify that φ = φ , as it is equivalent to code for τ(κ( )) where we have added the option of truncating the number of iterations performed in τ s fixpoint iteration loop. Setting T = , disables the truncation; thus, φ = φ . Language Models over Canonical Byte-Pair Encodings Lemma 1 (Canonicalization process synchronization). Suppose that abc D or both ab, bc D. Suppose further that b is nonempty. Then, there exists a sequence {(at, bt, ct, dt, et)} t=1 such that the following equations hold for all t 0: φt(abc) = φat(a)φbt(b)φct(c) φdt(ab) = φat(a)φbt(b) φet(bc) = φbt(b)φct(c) (21) Proof. We will prove Lemma 1 by induction on t, and by explicitly constructing the sequence {(at, bt, ct, dt, et)} t=1. Let T denote the number of distinct iterations of φ(abc). Base case (P0). In this case, it is straightforward to verify that (a0, b0, c0, d0, e0) = (0, 0, 0, 0, 0) satisfies P0 because each canonicalization call returns precisely the base tokenization of its argument. Thus, P0 holds. Induction step. Suppose P0,..., Pj hold. We seek to show that Pj+1 holds. Since Pj holds, there must exist (at, bt, ct, dt, et) satisfying the equations in Pt. To prove that Pt+1 holds, we will construct (at+1, bt+1, ct+1, dt+1, et+1) such that the equations in Pt+1. We first observe that at time t, φt(abc) = a(t)b(t)c(t) where a(t) is the tokenization at step t of the substring over the characters in a in the canonicalization process for abc b(t) is the tokenization at step t of the substring over the characters in b in the canonicalization process for abc c(t) is the tokenization at step t of the substring over the characters in c in the canonicalization process for abc The only way for this observation to be false is for a merge to straddle the boundary between a(t) and b(t) or b(t) and c(t), which is impossible because of our premise that abc D or both ab, bc D. More specifically, if there ever were a merge between a(t) and b(t) then ab and abc could not be canonical, as the final tokenization would not respect the boundary between a and b. Similarly, if there ever were a merge between b(t) and c(t), then by analogous reasoning, bc and abc could not be canonical. We now turn to case analysis: 1. Suppose t < T, then the (t+1)th step of φ(abc) applies to the highest-priority23 bigram of a(t)b(t)c(t). Consider the following subcases characterizing the possible positions for this merge: (a) The merge is in a(t). Then, (at+1, bt+1, ct+1, dt+1, et+1) = (at + 1, bt, ct, dt + 1, et) satisfies Pt+1 because It must also be the highest-priority merge in a(t), which is step at+1 of φ(a). It must also be the highest-priority merge in a(t)b(t), which is step dt+1 of φ(ab). The other canonicalization processes are unaffected, so they copy their position on this step. (b) The merge is in b(t). Then, (at+1, bt+1, ct+1, dt+1, et+1) = (at, bt + 1, ct, dt + 1, et + 1) satisfies Pt+1 because It must also be the highest-priority merge in b(t), which is step bt+1 of φ(b). It must also be the highest-priority merge in a(t)b(t), which is step dt+1 of φ(ab). It must also be the highest-priority merge in b(t)c(t), which is step et+1 of φ(bc). The other canonicalization processes are unaffected, so they copy their position on this step. (c) The merge is in c(t). Then, (at+1, bt+1, ct+1, dt+1, et+1) = (at, bt, ct + 1, dt, et + 1) satisfies Pt+1 because It must also be the highest-priority merge in c(t), which is step ct+1 of φ(c). It must also be the highest-priority merge in b(t)c(t), which is step et+1 of φ(bc). The other canonicalization processes are unaffected, so they copy their position on this step. 2. Suppose t T. No merges exist. Then, (at+1, bt+1, ct+1, dt+1, et+1) = (at, bt, ct, dt, et) satisfies Pt+1 because φt(abc) = φT (abc) = a(t)b(t)c(t) = a(T )b(T )c(T ), by definition of T, which implies that no further changes are possible and all processes must copy their position on this step, and continue to do so forever. Therefore, Pt holds for all t 0 by the principle of induction. 23I.e., the bigram with the lexicographically lowest pair of (position in the merge list, position in the current token string). Note that there can be no ties under this ordering. Language Models over Canonical Byte-Pair Encodings Example 1. This example seeks to illustrate Lemma 1. Consider the following three token strings from the GPT-2 tokenizer: token 30001 c = For these tokens, we have κ(abc) = re-tokenized. Our worked example will focus on only the first equation in Pt (repeated below for convenience), as the other two equations behave very similarly: φt(abc) = φat(a)φbt(b)φct(c) Before moving forward with the example, we note that each token during the canonicalization process will be rendered as a tree because it is a convenient representation for showing which merges were done to build each of them.24 For example, our inputs a, b, and c are represented by the following trees: a = r e - b = Each row in the table below corresponds to a time step t, and the content of each column is the status of the canonicalization process at that time. We mark cells that are copied from the cell immediately above them with a dash ( ). t φt(abc) φat(a) φbt(b) φct(c) 0 r e - t o k e n i z e d r e - t o k e n i z e d 1 r e - t o k e n i z e d r e - 2 r e - t o k e n i z e d t o k e n 3 r e - t o k e n i z e d i z e d 4 r e - t o k e n i z e d t o k e n 5 r e - t o k e n i z e d i z e d 6 r e - t o k e n i z e d i z e d 7 r e - t o k e n i z e d t o k e n 8 r e - t o k e n i z e d t o k e n We see that at every step t, the equation φt(abc) = φat(a)φbt(b)φct(c) is satisfied for an appropriate choice of (at, bt, ct). Specifically, we count the number of non-copy actions in each row. In the example, for t = 8, we have (at, bt, ct) = (1, 4, 3). This example also allows us to see the case analysis used in the proof for the location of the merge in action. 24We give a more formal presentation of these trees in App. B.3. Language Models over Canonical Byte-Pair Encodings Lemma 2 (Interlocking Canonicalization). a , b +, c : abc D ab D bc D (22) Proof. Fix a , b +, and c arbitrarily. We will consider the direction of the bi-implication separately. Case (= ). Suppose that abc D. Then, Lemma 1 gives us a sequence {(at, bt, ct, dt, et)} t=1 such that the following equations hold for all t 0. φt(abc) = φal(a)φbt(b)φct(c) φdt(ab) = φat(a)φbt(b) φet(bc) = φbt(b)φct(c) (23a) Now, consider (a T , b T , ct, d T , e T ), which gives φT (abc) = φa T (a)φb T (b)φc T (c) φd T (ab) = φa T (a)φb T (b) φe T (bc) = φb T (b)φc T (c) (23b) which implies the following because, at time T, each of the canonicalization processes has reached its respective fixpoint: φ(abc) = φ(a)φ(b)φ(c) φ(ab) = φ(a)φ(b) φ(bc) = φ(b)φ(c) (23c) which implies the following because each call to φ must return the canonicalization of its respective argument: φ(abc) = abc φ(ab) = ab φ(bc) = bc Therefore, ab and bc are also canonical. Case ( =). Suppose ab, bc D. Then, following an identical argument as the = -case, we can see that abc D. Thus, the bi-implication holds; thus, Lemma 2 holds. Language Models over Canonical Byte-Pair Encodings Theorem 1 establishes that a BPE token string is canonical if and only if all of its bigrams are canonical.25 Theorem 1. D = B. Proof. We seek to prove, for all δ , that δ D δ B. We do this using induction on prefixes of δ and Lemma 2. Choose δ = δ1 δN arbitrarily. Let the induction hypothesis P(i) be that δ i D δ i B. Base cases. We first prove P(0) and, if |δ| 1, P(1). Case P(0). To prove P(0), we seek to show that δ 0 = ε D δ 0 = ε B. Both sides of the bi-implication are always true: ε D because φ(ε) = τ(κ(ε)) = τ(ε) = ε; and ε B because it contains no bigrams, hence it is vacuously true. Therefore, P(0) is true. Case P(1). To prove that P(1) is true for cases where |δ| 1, we seek to show that δ 1 = δ1 D δ 1 = δ1 B. Again, both sides of the bi-implication are always true. By Assumption 1, we know that δ1 D is true. Since δ1 has no bigrams, δ1 B is vacuously true. Therefore, P(1) is true. Induction step. Assume P(i), that δ i D δ i B, where i 2. We seek to prove P(i + 1), that δ i+1 D δ i+1 B. We prove each direction of the bi-implication separately. Forward. We seek to prove δ i+1 D = δ i+1 B. Assume δ i+1 = δ i 1δiδi+1 D. By Lemma 2, we know that δ i 1δi D and δiδi+1 D. Since δ i 1δi D, by P(i), we know that δ i 1δi B. Since δ i 1δi B and δiδi+1 D, all of the bigrams in δ i 1δiδi+1 = δ i+1 are canonical, so δ i+1 B. Backward. We seek to prove δ i+1 B = δ i+1 D. Assume δ i+1 = δ i 1δiδi+1 B. Then δ i 1δi B, because it contains only a subset of the bigrams in δ i 1δiδi+1. Then, by P(i), we know that δ i 1δi D. From δ i 1δiδi+1 B and Def. 6, we know that δiδi+1 D. Since δ i 1δi D and δiδi+1 D, by Lemma 2, we know that δ i 1δiδi+1 = δ i+1 D. By induction, we know P(N) is true, so δ D δ B, and the proof is complete. 25Note that if a string with length less than or equal to one has no bigrams; thus, the property is trivially satisfied, i.e., all length-zero and length-one token strings are canonical. The latter is thanks to Assumption 1. Language Models over Canonical Byte-Pair Encodings Lemma 3. For BPE,26 the set of canonical strings is prefix closed, i.e., D = D. Proof. We prove D D and D D. Part 1 (D D). Obvious by definition of D. Part 2 ( D D). 1. Suppose δ D 2. = by definition of D, there is a δ D such that δ δ 3. = by Theorem 1, the bigrams of δ are all canonical 4. = since δ δ , its bigrams are a subset of those of δ , so they are also all canonical 5. = since the bigrams of δ are all canonical, by Theorem 1, δ D Therefore, D = D. 26More generally, Lemma 3 holds for any tokenization model with a bigram-based canonicality test (i.e., δ D BIGRAMS(δ) B. Language Models over Canonical Byte-Pair Encodings B.3. An Even Faster Bigram Test BPE derivations. Our canonicality test involves the inspection of the merge structure within a tokenization of a string. For that purpose, we define the derivation of a given string γ(σ) as a string of trees t1 t N where each tn (for n = 1,... , N) is a binary tree. For our purposes, a (binary) tree is a recursive data type composed of either a pair of trees or a base token. The pseudocode below shows how τ can be augmented to produce these derivation trees for a given string: 33 def γ(σ1 σN): 34 t τ(σ1) τ(σN) 35 while True: 36 t step_derivation(t) 37 if t = t: return t 39 def step_derivation(t1 t N): 40 for σ , σ M: 41 for n [1,... N 1]: 42 if κ(tn), κ(tn+1) = σ , σ : 43 return t1 tn 1 tn, tn+1 tn+2 t N 44 return t1 t N Note that γ( ) is an adaptation of τ. The difference is that τ has been replaced with a function that creates trees instead of replacing them with tokens. We have also extended κ to return the subword denoted by the derivation. For each token δ , we associate with it a unique canonical derivation γ(δ). However, we note that when |γ(δ)| > 1, it means that the token δ does not canonically tokenize to itself. Assumption 1 ensures |γ(δ)| = 1 for all δ . Additional notation. For notational convenience, we define the following: For any ℓ Σ Σ+ Σ+, let M[ℓ] denote the following: If ℓ Σ, return its position in some fixed, arbitrary ordering of Σ. If ℓ Σ+ Σ+, return the rank position of the merge ℓin the merge list plus |Σ| (as an offset), and if ℓis not in M. We extend M[t] to trees t as follows. For a tree in t : M[t] def= M[κ(t)]. For a tree of the form t = s, s : M[t] def= M[ κ(s), κ(s ) ]. We define the following ordering relations on merge rules and subtrees: We define ℓ< r M[ℓ] < M[r]. We define ℓ> r, ℓ r, and ℓ r analogously. Given a token δ , we define the left spine S(γ(δ)) of γ(δ) to be the ordered set containing the root node and its left descendants27 (ordered from root to leaf); and, we define the right spine R(γ(δ)) analogously (for right descendants). In a given tree, we define π as the function that returns the parent node for any of its subtrees. Intuition. At a high level, we say that a token δ conflicts with a bigram δ δ if and only if δ applies with an earlier rank over any overlapping span of the same character string. More formally, we say that δ (nontrivially) overlaps with δ and δ if and only if κ(δ) = σ1 σn (24a) κ(δ ) = σn+1 σN (24b) κ(δ δ ) = σ1 σn σn+1 σN (24c) κ(δ ) = σi σj where 1 i n < j N (24d) However, for δ to conflict with δ δ , it would need to include a merge over the span σi σj that blocks at least one merge in δ and δ from occurring in φ(δ δ ). σ1 σi σnσn+1 σj σN γ(δ ) γ(δ ) σ1 σi σnσn+1 σj σN 27We say that a node p is a left descendant of a node q if p is the left child of q or a left descendant of the left child of q. We define the right descendants of p in an analogous manner. Language Models over Canonical Byte-Pair Encodings In the diagram above, the derivation for the bigram δ δ is denoted by the juxtaposition of two triangles, one for each token s derivation tree. We have denoted the canonical derivation as a trapezoid, as it is generally the union of multiple triangles; there may be 1, 2, or more tokens in the canonicalized bigram φ(δ δ ).28 In orange, we have drawn a third triangle for the conflicting token δ . We note that γ(δ ) must intersect γ(δ) at a node s along its right-spine R(γ(δ)), and it must intersect γ(δ ) at a node s along its left-spine S(γ(δ )). In the diagram, the s and s represent the subtrees at respective points of the intersection with the hypothetical conflicting tree. The orange arc from s to s is used to show that any two nodes along the left and right edges could, in principle, be a conflicting tree. Crucially, however, the rank of the merge pair must precede the competing merge pairs present in the existing tokens δ and δ . We make this more precise below. Definition 8. A token δ conflicts with the bigram δ δ if and only if γ(δ ) is a subtree of γ(φ(δ δ )) (which may be a string of trees) but not a subtree of either γ(δ) or γ(δ ). We can find a conflicting token by exploiting the structure of the tokenization algorithm. Definition 9. Let δ be a token such that γ(δ ) = s, s for some s R(γ(δ)) and s S(γ(δ )). We say that δ is a minimal conflict for the bigram δ δ if and only if π(s) > s, s π(s ). The reason why δ identified by Def. 9 is minimal is that its definition identifies the earliest possible deviation, i.e., the earliest merge rule that could apply on their shared string σ1 σN that was not in the existing merges in δ and δ . Note that the asymmetry < versus is precisely what breaks ties between equal rank merges in favor of the left-most match, as prescribed in the procedure for τ that was provided in Def. 4. Below is an algorithm for finding the minimal conflicting tree between δ and δ , if one exists. 45 def find_conflict(δ, δ ): 48 while True: 51 while True: 52 if L > M[ t, t ] R: return t, t # conflict 53 if κ(t ) : break # t is a leaf 55 t , _ t # descend the left spine 56 if t : break # t is a leaf 58 _, t t # descend the right spine 59 return None The find_conflict algorithm finds the minimal conflict between a pair of tokens δ and δ (if one exists). We note that a useful consequence of Assumption 1 is that we do not need to check subtrees that exist off of the spines for conflicts, making our procedure more efficient than computing the canonicalized string from scratch, e.g., by running τ on the decoded string κ(δ δ ). Instead, our algorithm searches for a minimal conflicting tree: either it stops once it has found such a tree, or it has exhausted the limited options for such a tree. Finding the minimal conflict (if it exists) is also faster than computing the entire canonical tokenization φ(δ δ ), as it only requires finding the first difference rather than completely building the canonicalized bigram.29 Definition 10. We define Φ as the following relation on pairs of tokens: Φ(δ, δ ) def= (find_conflict(δ, δ ) = None) (25) 28Please refer back to Fig. 2 for examples. 29We note that the find_conflict algorithm can be efficiently vectorized so that it computes the complete vector of valid next tokens very efficiently. We will provide a Py Torch implementation in our public code release (upon publication). Language Models over Canonical Byte-Pair Encodings The proposition below establishes that Φ provides a correct membership test for the set of canonical bigrams B. Proposition 5. Φ(δ, δ ) δ δ B Proof. Observe that (Φ(δ, δ ) δ δ B) ( Φ(δ, δ ) = δ δ / B) ( Φ(δ, δ ) = δ δ / B). We will prove the proposition by proving each conjunct separately. Part 1 ( Φ(δ, δ ) = δ δ / B). 1. Suppose Φ(δ, δ ). Then, there exists a conflicting pair s, s = find_conflict(δ, δ ), i.e., s, s satisfies π(s) > s, s π(s ). Below is a schematic representation of such a conflict: conflict > s, s 2. Thus, the conflicting merge s, s would have been preferred by τ if it were run on the character string κ(δ δ ): the conflicting merge blocks both of the merges below because π(s) = µℓ, s > s, s and π(s ) = s , µ r s, s :30 3. The existence of this intermediate step means that it is impossible for δ δ to be a canonical bigram; thus, δ δ / B. Part 2 ( Φ(δ, δ ) = δ δ / B). 1. Suppose δ δ / B. Then, φ(δ δ ) = δ δ , by definition. 2. Then, there must exist a conflicting merge s, s as a subtree in φ(δ δ ) that blocks δ δ from being built. Now, because both δ and δ are canonical in isolation (Assumption 1), the conflict s, s must straddle the boundary between the δ and δ : This implies that s must be a subtree along the right edge of δ and s must be a subtree along the left edge of δ , as those are precisely the only ways that the straddling merge can occur under our circumstances. Lastly, for s, s to block δ and δ , it would need to satisfy π(s) > s, s π(s ). 3. This characterization is equivalent to Φ(δ, δ ). 30Note that s, s < π(s) enforces the left-most merge preference in BPE s encoding procedure τ, i.e., in the event of a tie between overlapping merges of the same rank, the left-most merge is taken. Language Models over Canonical Byte-Pair Encodings B.4. Connections to Canonicality Algorithms Definition 11. We can describe the set of canonical tokenizations with the following finite-state automaton: states = alphabet = transitions = {δ δ δ | δ δ B} initial = accepting = Note that we never construct the automaton. Generality. Our method works under a broader set of assumptions than Berglund et al. (2024); Lindmark (2024), as they require the merge list to be proper. This condition is, in fact, restrictive: the Llama tokenizer (used in our experiments) is not proper, but the GPT-2 tokenizer is proper. Practicality. The automaton has V nodes and V 2 edges where V = | |. In practice, the machine is very close to having V 2 edges (even after DFA minimization!) because the transitions do not rule out many tokens. This means that the explicit machine is enormous. For example, GPT2 has 50K states and nearly 2,500,000,000 edges, making it way too large to build. Storing the edges that are absent is significantly more efficient. On the other hand, our version never builds the machine, so the memory complexity is not a problem. It is also possible to use a more sophisticated automata representation that allows for failure transitions; we refer the reader to Cognetta & Okazaki (2024); Lindmark (2024) for further discussion of the technique. Language Models over Canonical Byte-Pair Encodings C. Supporting Proofs and Lemmata Proposition 6. Suppose that the tokenization model (τ, κ, Σ, ) is exact, then δ : δ D φ(δ) = δ (26) Proof. Fix an arbitrary δ . We consider each direction separately: Part 1 (δ D = φ(δ) = δ): 1. Suppose δ D 2. = σ Σ : δ = τ(σ) by definition of D 3. = κ(τ(σ)) = σ by exactness assumption 4. = τ(κ(τ(σ))) = τ(σ) because τ is a function 5. = τ(κ(δ)) = δ substitute τ(σ) 7 δ 6. = φ(δ) = δ by definition of φ 7. Thus, this direction holds. Part 2 (δ D = φ(δ) = δ): 1. Suppose φ(δ) = δ 2. = δ = τ(κ(δ)) 3. = σ Σ : δ = τ(σ); specifically, σ = τ(δ) 4. = δ D by definition of D 5. Thus, this direction holds. Since each directions have been proven, the proposition holds. Corollary 1. D = {φ(δ) | δ } (27) Proposition 7. Let (τ, κ) be exact. Then, (τ, κ) is a bijection between (Σ , D). Proof. The bijection follows directly from the following: Exactness means σ Σ : κ(τ(σ)) = σ. Proposition 6 ensures that δ D: τ(κ(δ)) = δ. Thus, we have a bijection, and the proposition holds. Proposition 8. For an exact tokenization model (Σ, , τ, κ), the canonicalization operator is idempotent, i.e., φ(δ) = δ = φ(δ ) = δ for all δ, δ (28) Proof. The proposition is equivalent to φ(φ(δ)) = φ(δ) for all δ . Fix δ, δ arbitrarily. 1. Suppose φ(δ) = δ 2. = τ(κ(δ)) = δ by definition of φ 3. = τ(κ(τ(κ(δ)))) = τ(κ(δ )) because τ and κ are functions 4. = τ( κ( τ(κ(δ)))) = τ(κ(δ )) by exactness 5. = τ(κ(δ)) = τ(κ(δ )) 6. = τ(κ(δ)) = τ(κ(φ(δ))) substitution δ 7 φ(δ) 7. = φ(δ) = φ(φ(δ)) by definition of φ Language Models over Canonical Byte-Pair Encodings D. Proofs for Section 4 (Canonicality by Conditioning) Proposition 1. Assuming that the true distribution over tokens p is canonical, the globally canonicalized model g guarantees the following reduction in KL divergence: g) = log Z | {z } 0 (7) (δ) g(δ) (29a) (δ) log p (δ)1{δ D} (δ) log p (δ) (δ) log 1{δ D} + X (δ) log Z (29c) (δ) log p (δ) + log Z (29d) p ) + log Z (29e) Now, the proposition follows by basic algebra. Proposition 2. ℓ(δ) = p (δ) 1{δ D} | {z } def =wℓ(δ) wℓ(δ) = p (δ) ℓ(δ) 1{δ D} (10b) ℓ(δ) = ℓ(EOS | δ) ℓ(δt | δ