# language_models_over_canonical_bytepair_encodings__f946dae8.pdf

Language Models over Canonical Byte-Pair Encodings

Tim Vieira 1 Tianyu Liu 1 Clemente Pasti 1 Yahya Emara 1 Brian Du Sell 1 Benjamin Le Brun 2

Mario Giulianelli 1 Juan Luis Gastaldi 1 Timothy J. O Donnell 3 2 4 Ryan Cotterell 1

Modern language models represent probability distributions over character strings as distributions over (shorter) token strings derived via a deterministic tokenizer, such as byte-pair encoding. While this approach is highly effective at scaling up language models to large corpora, its current incarnations have a concerning property: the model assigns nonzero probability mass to an exponential number of noncanonical token encodings of each character string these are token strings that decode to valid character strings but are impossible under the deterministic tokenizer (i.e., they will never be seen in any training corpus, no matter how large). This misallocation is both erroneous, as noncanonical strings never appear in training data, and wasteful, diverting probability mass away from plausible outputs. These are avoidable mistakes! In this work, we propose methods to enforce canonicality in token-level language models, ensuring that only canonical token strings are assigned positive probability. We present two approaches: (1) canonicality by conditioning, leveraging test-time inference strategies without additional training, and (2) canonicality by construction, a model parameterization that guarantees canonical outputs but requires training. We demonstrate that fixing canonicality mistakes improves the likelihood of held-out data for several models and corpora.

github.com/genlm/canonical-icml-2025

1. Introduction

Modern language models are probability distributions over character strings (denoted Σ ) that are parameterized as

1ETH Zürich 2Mila 3Mc Gill University 4Canada CIFAR AI Chair. Correspondence to: Tim Vieira <tim.f.vieira@gmail.com>.

Proceedings of the 42 nd International Conference on Machine Learning, Vancouver, Canada. PMLR 267, 2025. Copyright 2025 by the author(s).

Figure 1. The figure shows the canonical and noncanonical encodings of the string Hello, world. The diagram shows the top-8 token encodings of the string according to their probability (descending top to bottom), as there are hundreds of them for this short string. Note that the canonical token encoding is the most likely one (i.e., it is at the top), which is reassuring as it is the most representative of the training data.

distributions over a different space of token strings (denoted ). In the case of the prevalent byte-pair encoding (BPE; Gage, 1994), each token may be regarded as a multicharacter (or multi-byte) chunk of a UTF-8-encoded string. Such a language model is trained by encoding its training data into token strings, which has the benefit of compressing the large corpus. The BPE encoding is a pair of functions: an encoding function τ : Σ that maps a given input character string σ to its canonical encoding δ = τ(σ), and a decoding function κ: Σ that maps any token string δ back to a character string σ = κ(δ). The result of training on a corpus of token strings is a language model p that generates a token string δ and decodes it to a character string σ.

We define the set of canonical token strings D as those that appear as the canonical encoding of at least one character string, i.e., D

def= {τ(σ) | σ Σ }. These are the only strings that are possible in the tokenized training data of p . The set of noncanonical token strings is, thus, D. Even though p has never and will never see a noncanonical token string in its training data, it will still place nonzero probability mass on them due to model architecture and training limitations. Fig. 1 shows an example

Language Models over Canonical Byte-Pair Encodings

of a string with its canonical encoding and its exponentially many noncanonical encodings.

Fixing canonicality mistakes. Removing noncanonical strings from the support of the estimated distribution p can improve the likelihood of the correct token strings. We describe two families of methods for doing so

Canonicality by conditioning: We explore efficient testtime inference methods for conditionally generating text that satisfies the canonicality constraint without retraining. Canonicality by construction: We explore methods that impose canonicality constraints directly in the language model s parameterized architecture and give a method to fine-tune its parameters.

In addition to these novel methods, this paper presents the following contributions:

We prove that our methods can only improve the fit to the true distribution over tokens. We show empirically that fixing canonicality mistakes using our methods improves the likelihood of held-out data for several models and corpora. As a side effect of pursuing the goal of generating only canonical token strings, we discovered an efficient, incremental canonicality test for BPE (see App. B) that is significantly simpler than prior work (e.g., Berglund et al., 2024; Cognetta & Okazaki, 2024), as it does not require automata theory to understand or implement.

2. Preliminaries

2.1. Strings

An alphabet is a non-empty, finite set of symbols (e.g., bytes, characters, or tokens) from which we can build strings. A string is a finite-length sequence of symbols from a given alphabet. Let Γ be an alphabet. Let Γ denote the set of all strings over Γ. We use |γ| to denote the length of the string and ε to denote the string of zero length. Let γ γ denote the concatenation of γ and γ . We write γ<n def= γ1 γn 1 to access the first n 1 symbols of the string γ. Let γ γ

denote that γ is a prefix of γ , and γ γ denote that γ

is has γ as a prefix. The set of bigrams of a given string

BIGRAMS(γ1 γN) def= {γnγn+1 | 1 n < N}.

2.2. Information-Theoretic Quantities

For completeness, we include definitions of the following common information-theoretic quantities, which are measured in bits. Let log be the logarithm function in base 2 throughout this paper. Let p and q be probability distributions over a countable domain X.

Entropy: H(p) def= P

x X p(x) log p(x)

Cross-entropy: H(p, q) def= P

x X p(x) log q(x) KL divergence: KL(p q) = P

x X p(x) log p(x)

These quantities are related via KL(p q) = H(p, q) H(p), and H(p) = H(p, p). Lastly, we follow the informationtheoretic convention that 0 log 0 def= 0 in these definitions.

2.3. Language Models

A language model pΓ is a probability distribution over a set of strings Γ . Thus, we can define a string-valued random variable Y pΓ with this distribution. Moreover, we can compute the conditional probabilities of different events pertaining to the random variable Y , as we normally do in probability theory. However, since Y is string-valued, the comparisons and transformations applied to it must be string-friendly. We define the prefix probability:

pΓ(γ) def= Pr Y pΓ[Y γ] = X

γ Γ 1{γ γ} pΓ(γ ) (1a)

and the conditional prefix probability:

pΓ(γ | γ) def= Pr Y pΓ[Y γ γ | Y γ] = pΓ(γ γ )

pΓ(EOS | γ) def= pΓ(γ)

We can use conditional prefix probability to factorize the probability pΓ(γ) of a string γ = γ1 γT as the conditional probability of each of its symbols γ1,..., γT , followed by an end-of-string event EOS:1

pΓ(γ) = pΓ(EOS | γ)

pΓ(γt | γ<t) (1d)

Note that each conditional prefix distribution pΓ( | γ<t) is a probability distribution over Γ {EOS}. Many language models are defined via the product in Eq. (1d) where each single-symbol conditional prefix probability comes from a parametric model trained by predicting the next token given the previous context. Examples of this include transformers (Vaswani et al., 2017), RNNs (e.g., Mikolov et al., 2010; Sundermeyer et al., 2015), and n-gram models (e.g., Shannon, 1948).

2.4. Tokenized Language Models

An (exact, deterministic) tokenization model (Gastaldi et al., 2025) is a tuple (Σ, , τ, κ) where Σ is an alphabet of characters (or bytes), is an alphabet of tokens, τ : Σ is an encoding function, κ: Σ is a decoding function, and κ(τ(σ)) = σ for all σ Σ .2

1Here, EOS is an event, not a symbol; thus, it may never appear in any well-formed string. 2Note that for κ and τ to form a bijection between Σ and , we would additionally require τ(κ(δ)) = δ for all δ .

Language Models over Canonical Byte-Pair Encodings

A tokenized language model is a language model pΣ over character strings Σ that is parameterized by a tokenlevel language model p over and a decoding function κ: Σ . The tokenized language model defines the following probability distribution over Σ :

pΣ(σ) def= X

1{σ = κ(δ)} p (δ) (2)

Notice that τ is conspicuously absent from this expression, as it is only part of the estimation process, which we characterize below. In 4, we consider alternatives to Eq. (2) that exploit prior knowledge of τ to force the probability of noncanonical token strings to be zero.3

Estimating a tokenized language model. Let p

Σ be some probability distribution over Σ we wish to estimate. As is standard in statistical estimation, we take a large sample σ(1),... , σ(M) i.i.d. p

Σ. Next, we encode each strings of characters into strings of tokens, giving us a corpus of token strings: δ(m) = τ(σ(m)) for 1 m M. Let p

denote the distribution of these token strings:

1{τ(σ) = δ}p

Next, we estimate a token-level language model p to fit the strings δ(1),... , δ(M). After estimation, we have a distribution p over strings of tokens rather than strings of characters. Thus, to generate a string of characters, we first generate a token string δ p , and decode it (via κ(δ)) to get a character string. This process is why the tokenized language model s character strings are distributed according to Eq. (2).4 Tokenization is used in practice because working with (short) token strings is easier than (long) character strings. The most commonly used tokenization scheme for modern language models is byte-pair encoding (BPE; Sennrich et al., 2016; Gage, 1994), which is a simple and effective algorithm for shortening text (Zouhar et al., 2023).

2.5. Byte-Pair Encoding

Byte-pair encoding (BPE; Gage, 1994; Sennrich et al., 2016) is a commonly used tokenization scheme.5 Intuitively,

3Vieira et al. (2024) give algorithms for working with tokenized language models at the character (or byte) level, including algorithms for computing conditional probabilities. 4Gastaldi et al. (2025) characterizes the conditions required for tokenized language models to estimate p

Σ consistently. 5BPE is the most prevalent tokenization strategy in modern language models; it is used by Open AI s GPT-2 (Radford et al., 2019), GPT-3 (Brown et al., 2020), GPT-3.5 (Open AI, 2022), and GPT-4 (Open AI, 2024a), GPT-4o (Open AI, 2024b); Meta s Llama 1 (Touvron et al., 2023a), Llama 2 (Touvron et al., 2023b), and Llama 3 (Llama Team, 2024); Mistral s 7B (Mistral AI, 2023), Au (Mistral AI, 2024a), and Nemo (Mistral AI, 2024b);

the 1169 t 83

t 83 th 400

x 87 th 400

the 1169 tha 12898

than 14813 tha 12898

27547 Th 817

Figure 2. Short examples of canonical and noncanonical token strings from GPT2 s tokenizer. Because the canonicalization operation φ is idempotent, each example in the second column is canonical. These examples were chosen to highlight some of BPE s unintuitive preferences, e.g., BPE is not a longest-match tokenizer, and canonicalization can increase the string s length.

the encoding function of BPE works as follows: each character of an input string σ1 σN is initially mapped to an individual token, then pairs of tokens are iteratively merged into bigger tokens according to a list of merge rules M. We refer the reader to App. B for a formal definition of BPE.

2.6. Canonicality

To facilitate discussion, we start by introducing some terminology. The set of canonical strings:

def= {τ(σ) | σ Σ } (4a)

The set of noncanonical strings D

def= \ D, and the set of canonical prefixes:

def= {δ : δ δ , δ D} (4b)

We say that δ is canonical if δ D. We say δ is a canonical prefix if δ D. We define the canonicalization function φ: D as φ(δ) def= τ(κ(δ)). Note that D = {φ(δ) | δ }. We say that our language model p is canonical if δ / D = p (δ) = 0, i.e., P

δ D p (δ) = 1.

Membership tests. We can test membership in D via the round-trip test: δ D φ(δ) = δ. We note, however, that testing membership in D for general tokenizers is less

GPT-Neo X (Black et al., 2022); Eleuther AI s Pythia (Biderman et al., 2023); Anthropic s Claude (Anthropic, 2023a) and Claude 2 (Anthropic, 2023b); Google s Gemma (Gemma Team, 2024); and AI2 s OLMo 1 (Groeneveld et al., 2024) and OLMo 2 (Team OLMo, 2025).

Language Models over Canonical Byte-Pair Encodings

straightforward.6

Membership tests for BPE. App. B provides efficient algorithms for membership in these sets for BPE. Lemma 3 shows that for BPE D= D, i.e., the round-trip test is a correct membership test for D. We also provide a very efficient incremental algorithm for performing these membership tests that exploits the properties of the BPE encoding procedure. Fig. 2 shows examples of canonicalized BPE strings.

3. Canonicality by Conditioning

Recall (Eq. (2)) that a tokenized language model defines a language model pΣ(σ) over Σ . However, it is common practice to evaluate the probability of a character string σ Σ using the following sub-distribution over Σ during both likelihood-based training and evaluation:

Σ(σ) def= p (τ(σ)) (5)

The problem with p

Σ is that P

Σ(σ) can be < 1, making it is an improper probability distribution over Σ .

Unfortunately, every token string canonical or not is assigned a nonzero probability under a softmax-normalized model; thus, modern language models allocate (typically small) probability to each and every noncanonical string. The leakage of mass to noncanonical strings is compounded by the fact that D is an exponentially sparse subset of .7

Our first approach to this problem defines a language model g that is the result of probabilistic conditioning on the event that the generated token string is in D.

Definition 1. The globally canonicalized language models gΣ and g are defined as the following probability distributions over Σ and , respectively:

gΣ(σ) def= g(τ(σ)) (6a)

g(δ) def= Pr Y p [Y = δ | Y D] (6b)

= Pr Y p [Y = δ, Y D]

Pr Y p [Y D] (6c)

Z p (δ)1{δ D} (6d)

where Z is the canonicality rate:

Z def= Pr Y p [Y D] (6e)

6When the tokenization function τ is implemented as a finite-state transducer, we may derive an efficient finite-state automaton that describes the prefix language of its outputs. 7More precisely, as a function of a length limit N, the fraction of canonical strings in the universal of token strings |{δ D : |δ| N}| |{δ : |δ| N}| decreases exponentially quickly. This is evident as each character string has one canonical encoding, and exponentially many noncanonical encodings as a function of its length.

Note that gΣ(σ) = 1

Z p (τ(σ)), meaning that we may interpret the globally canonicalized model as renormalizing p

Σ. We note that the effect of conditioning the language model to generate only canonical token strings may dramatically change the conditional prefix distributions of the distribution. We will discuss the details of how to approximately and exactly generate samples from g.

3.1. Theoretical Guarantees for Conditioning

Why do we condition? The conditional distribution g lets us preserve what matters: the relative probabilities of canonical strings, ensuring that we eliminate the wasteful mass allocated to noncanonical strings without distorting the existing mass on canonical strings.8

The following proposition shows that test-time conditioning improves our estimate g of the true distribution p

. Proposition 1. Assuming that the true distribution over tokens p

is canonical, the globally canonicalized model g guarantees the following reduction in KL divergence:

g) = log Z | {z } 0 (7)

Proof: See App. D.

We may interpret Proposition 1 as saying that, no matter how good or how bad of a model the original language model p is with respect to the true distribution p

, we can expect a reduction in the KL divergence equal to log Z, bringing it closer to p

. This reduction is large if Z is small, i.e., the token strings generated in p are unlikely to be canonical, modest if Z 1, and zero if Z = 1. In other words, the change in KL quantifies the value of having the constraint as prior knowledge.

3.2. Algorithms for Approximate Conditioning

3.2.1. EXACT ALGORITHMS

The simplest exact conditioning algorithm for sampling from g is rejection sampling:

1 def rejection_sampling():

2 while True:

4 if δ D: return δ

However, the rejection sampling algorithm can be inefficient if Z is small, as its expected running time per sample is O(1/Z). When Z 1, this is pretty fast. However, when it is smaller, it can be unusably slow.

8More formally, the conditional distribution is the unique distribution q over D that maintains the relative probabilities of any pair δ, δ D that is, p (δ ), q(δ) = p (δ), q(δ ). The only way to achieve this is to set q(δ) = η, p (δ) for all δ D, where normalization fixes η = 1/Z. Thus, q = g is the unique way to preserve relative probability subject to the canonicality constraint.

Language Models over Canonical Byte-Pair Encodings

Another correct sampling algorithm is ancestral sampling:

5 def ancestral_sampling():

7 while True:

8 δ g ( | δ)

9 if δ = EOS: return δ

Here, each step of the algorithm samples a token (or EOS) from the globally canonicalized model s conditional prefix distribution. Unfortunately, computing g exactly is intractable as it requires summing over infinitely many future strings. Thus, g must be approximated.9 In 3.2.2, we will provide a cheap, local approximation and, in 3.2.3, we will devise a strategy for improving the short-sightedness of the local approximation by using more computation.

3.2.2. A LOCALLY CANONICALIZED APPROXIMATION

Our locally canonicalized model ℓis a distribution over that approximates ancestral_sampling for sampling from g by using the following local approximation ℓto the global prefix probability g . Definition 2. The locally canonicalized model ℓis a language model over , which assigns probability zero to noncanonical strings. The probability of a token string under ℓis defined auto-regressively as

ℓ(δ) def= ℓ(EOS | δ)

ℓ(δt | δ<t) (9a)

where its next-token conditional distributions are defined as

ℓ(EOS | δ) def= p (EOS | δ)1{δ D}

ℓ(δ | δ) def= p (δ | δ) 1{δ δ D} ℓ(δ) (9c)

where ℓ(δ) normalizes the distribution over {EOS} that lead to a prefix-canonical extension of the string: ℓ(δ) def= 1{δ D} p (EOS | δ) (9d)

p (δ | δ) 1{δ δ D}

This distribution is practical for the following reasons:

1. It is efficient: The canonicality checks are cheap to compute (App. B), and there is very little additional overhead over sampling from the base conditional distribution.

9We also mention more sophisticated approximations, such as learning (Zhao et al., 2024), adaptive upper bounds (Park et al., 2025), and sequential Monte Carlo steering (Lew et al., 2023; Loula et al., 2025). We do not explore these more sophisticated strategies, as the local approximation works surprisingly well.

2. It has the correct support: The local distribution can only place positive probability on canonical strings; it eliminates noncanonical strings by masking out any token δ

that would render the extended string δ δ noncanonical. 3. It often gives a reasonable approximation to the conditional global distribution: We expect this to be a good approximation when ℓ(δ) 1, which we may expect in practice as noncanonical strings tend to be assigned tiny probabilities, making the sum in Eq. (9d) close to 1.

Unfortunately, the locally canonicalized model ℓgives a warped approximation to g, as Proposition 2 shows: Proposition 2.

ℓ(δ) = p (δ) 1{δ D}

| {z } def =wℓ(δ)

wℓ(δ) = p (δ)

ℓ(δ) 1{δ D} (10b)

Proof: See App. D.

We call wℓ(δ) the weight of the locally canonicalized sample (in relation to p

). We note that the weight describes the change in relative probability of the canonical token strings δ D. We call this unwanted change warping. There is no warping in the globally canonicalized case, as the relative probability is unchanged. In 3.2.3, we describe methods that can unwarp the local distribution at the cost of additional computation.

Despite the warping, the local distribution ℓcomes with the following reassuring theoretical property, guaranteeing that it helps estimate the true distribution p

. Proposition 3. Assuming that the true distribution over tokens p

is canonical, the locally canonicalized model ℓ guarantees the following reduction in KL divergence:

[log wℓ(δ)]

Proof: See App. D.

Proposition 3 confirms the intuitive notion that local canonicalization cannot make the language model worse.10

3.2.3. SAMPLING-BASED INFERENCE

Unfortunately, sampling from g is hard because we cannot efficiently compute the conditional prefix probability g

10Proposition 3 may be generalized to any distribution q. Specifically, KL(p

h log p (δ)

q(δ) i . However, it may not necessarily be a reduction, as it is for ℓ, if the q distorts the distribution too much.

Language Models over Canonical Byte-Pair Encodings

(and rejection sampling is slow when Z is small). In this section, we show how to use the local approximation in conjunction with sampling-based inference to improve the fidelity to g at the cost of increased computation. These methods use the local distributions as a proposal distribution to produce candidates that will be properly weighted and resampled in a manner so that they maintain a principled approximation to the global distribution g. Therefore, in expectation and in the limit, these methods produce exact samples; thus, they do not warp the distribution.

The warping in the locally canonicalized method can occur because the sampling algorithm approximated the conditional prefix probability canonicality, meaning that we may sample a string of tokens that looks good initially, but we end up stuck with a bad string prefix because we overestimated the conditional prefix probability. In contrast, rejection sampling restarts the sample once noncanonicality is detected, whereas the local method ignores noncanonical tokens and forces the current string to complete. We can fix the bias in the local approximation by exploiting our knowledge of wℓ(δ) for each sample δ ℓ. More specifically, wℓ(δ) is proportional to the overor under-representation rate of δ in samples from ℓ with respect to the target distribution g. Therefore, we can correctly account for this in expectation by weighting each sample by its wℓfactor. This is the principle behind the importance sampling methods we describe below.11

Proposition 4.

δ ℓ[wℓ(δ)] = E

p (δ)1{δ D}

Proof: See App. D.

Thus, we can apply importance sampling to obtain an unbiased estimate b Z of Z given δ(1),... , δ(M) i.i.d. ℓ:

m=1 wℓ(δ(m)) (13)

We may also estimate the global distribution g as follows:

bg(δ) def= 1

m=1 wℓ(δ(m))1{δ=δ(m)} (14)

Note that the quotient with b Z leads to statistical bias; however, that bias vanishes as M grows larger. If desired, we may also use bg for drawing approximate samples using a method known as importance resampling. This method produces approximate samples from g by sampling δ bg,

11For a rigorous treatment of the technical conditions required for importance resampling and estimation to converge, we refer the reader to Chatterjee & Diaconis (2017).

which has a support set with size at most M. Thus, importance sampling and resampling give us a parameter M that improves the approximation as we make it larger.12

4. Canonicality by Construction

The method we present in this section augments the output layer of the transformer architecture with canonicality constraints. Thus, it requires additional training and testtime computation. Fortunately, this additional computation, at least in the case of BPE, is minimal thanks to efficient algorithms for checking canonicality (see App. B).

Suppose we have a base architecture, i.e., a parametric family of language models {pθ}θ RD that is parameterized by a real-valued vector of parameters θ RD. We assume log pθ(δ) is continuously differentiable with respect to θ for all δ , and that each pθ is a valid language model over . For example, {pθ}θ RD can be a parametric family of transformers or recurrent neural networks.13 Let pθ denote the conditional prefix probability of one such model.

Definition 3. Our canonicalized architecture is a parametric family of language models {ℓθ}θ RD that is built on top of a base architecture {pθ}θ RD. We define each ℓθ in terms of pθ in the following manner:

ℓθ(δ) def= ℓθ(EOS | δ)

ℓθ(δt | δ<t) (15a)

where each ℓθ( | δ<t) is defined as one of the following distributions over {EOS}:

ℓθ(δ | δ) def= pθ(δ | δ) 1{δ δ D} ℓθ(δ) (15b)

ℓθ(EOS | δ) def= pθ(EOS | δ) 1{δ D}

ℓθ(δ) (15c)

and ℓθ(δ) ensures normalization:

ℓθ(δ) def= pθ(EOS | δ) 1{δ D} (15d)

pθ(δ | δ) 1{δ δ D}

Much like the locally canonicalized model (Def. 2), noncanonical strings are forced to have probability zero using canonicality checks in the definition of its conditional prefix probabilities Eq. (15b) and (15c).

12We note that for these estimates to converge to their correct values, we require p (δ)1{δ D} > 0 = ℓ(δ) > 0 for all δ . This is ensured when our prefix canonicality test 1{δ D} is exact. More generally, it is safe to allow for false positives but not false negatives. 13Our experiments use transformer-based language models.

Language Models over Canonical Byte-Pair Encodings

Why train? Suppose the original model p is equal to some pθ in our family; let θ(0) denote its parameters. Clearly, if we set θ = θ(0), the model ℓθ is no different from ℓ. The reason we optimize θ beyond θ(0) is that the training objective for θ(0) pressured the parameters to model canonicality, but now that pressure is gone because the canonicalized architecture (Def. 3) enforces that constraint for all θ. Thus, the parameters that were previously used to model canonicality preferences can be repurposed to model any other textual phenomena in the training data.

Training pθ. Given a training corpus σ(1),... , σ(M) i.i.d. p

Σ of character strings, let δ(1),... , δ(M) be their corresponding (canonical) tokenizations. We define L(p), the log-loss (average negative log-likelihood) of the training corpus under a language model p:

L(p) def= 1

m=1 log p(δ(m)) (16)

Log-loss is a reasonable (and common) training objective, as minimizing L(ℓθ) also minimizes KL(p

ℓθ) in the limit of infinite data (with perfect optimization). However, training a large language model with L from scratch is prohibitively costly. Instead, we fine-tune an existing (noncanonical) large language model.

We use the following fine-tuning objective, which strikes a balance between fitting the log-loss objective while maintaining some fidelity to the original language model p :

Fλ(θ) def= (1 λ) L(ℓθ) + λ KL(ℓθ p ) (17)

Here, we use the KL divergence between ℓθ and the original unconstrained model distribution p as a regularizer (e.g., Christiano et al., 2017; Stiennon et al., 2020; Ziegler et al., 2020; Korbak et al., 2022). The regularization parameter 0 λ 1 is used to trade fidelity to the original model (higher) against fidelity to the fine-tuning data (lower).

Optimization. Our optimization algorithm is based on stochastic gradient descent. To optimize Fλ(θ), on each step, we randomly choose with probability λ which of the two terms to perform a gradient step on.

1. For the log-loss updates, we sample a minibatch of examples from the fine-tuning corpus and use the sample average of that minibatch s gradient as an efficient approximation to the gradient over the entire corpus. 2. We estimate the gradient of the KL regularization term using Amini et al. s (2025) Rao Blackwellized method using samples from the current ℓθ.

Experiments initialize ℓθ with the same parameters as p .

5. Experiments

This section evaluates our proposed methods canonicality by constraints (global and local; 3) and canonicality by conditioning ( 4) by measuring their impact on real datasets and language models. This evaluation complements the theoretical guarantees of Proposition 1 and Proposition 3, by quantifying the log-loss (L; Eq. (16)) on held-out data.

Datasets. We experiment on the following two datasets:

Penn Treebank (PTB, Marcus et al., 1993) (test split; 3761 strings, 82k words, 439k characters) Wiki Text (Merity et al., 2017) (test split; 4358 strings, 234k words and 1286k characters)

Models. We experiment with the following models:

GPT-2 models (Radford et al., 2019) of increasing size (small, medium, and large) Llama models (Llama Team, 2024) of increasing size (3.2-1B, 3.2-3B, and 3.1-8B).

Pre-tokenization. Our efficient bigram-based canonicality test 1{δ δ D} (App. B) makes the simplifying assumption that the tokenizer τ is only based on BPE. However, in practice, systems use an additional pre-tokenization step that breaks text into chunks called pre-tokens (e.g., based on spaces and punctuation). Most pre-tokenizers are based on hand-crafted regular expressions; the GPT-2 and Llama models each use a distinct pre-tokenization regular expressions. Since these models make use pre-tokenization, our incremental test occasionally makes some mistakes. Particularly, concerning are instances of false negatives, i.e., we rule out a bigram as noncanonical when it is in fact canonical. False negatives are problematic for estimation (Footnote 12). These errors are caused by the interaction of the pre-tokenizer and the tokenizer.14 To work around these rare exceptions, we determined a (small) set of overrides for our canonicality test by identifying the bigrams where we have made a false negative judgment on a corpus of canon-

14For example, under the GPT-2 tokenizer, the token bigram

\n 198 may or may not be canonical, depending on the context. For instance, in the following example, it is canonical:

τ(Hi,\n\n I) =

because the pre-tokenizer creates separate pretokens for each newline, preventing BPE from merging them. However, in the next example, it is not canonical:

τ(Hi,\n\n) =

because the pre-tokenizer does not create separate pretokens for the newlines, allowing BPE to merge them.

Language Models over Canonical Byte-Pair Encodings

ical strings. We note that this workaround occasionally introduces false positives.15

5.2. Canonicality by Conditioning

Methology. Fig. 3 reports the log-loss of the local ℓand global g methods on each dataset ({p(m)

}M m=1) as well as p , which serves as our baseline. Below are the details of how we estimated the log-loss of each method.

Baseline: The log-loss for the baseline method is

m=1 log p (δ(m)) (18a)

Global: The log-loss for the global method is

L(g) = L(p ) + log Z L(p ) + log b Z (18b)

where b Z is computed by Eq. (13) using 2000 samples from ℓ. Note that log b Z 0; thus, the global method can only improve the log-loss, L(g) L(p ).16

Local: The log-loss for the local method is

L(ℓ) = L(p ) + 1

m=1 log wℓ(δ(m))

| {z } def = c W

Note that 1 M PM m=1 log wℓ(δ(m)) 0; thus, the local method can only improve the log-loss, L(ℓ) L(p ).

Relationship to theory. Proposition 1 shows that the reduction in KL divergence is equal to log Z, which is equal to the difference in log-loss L(p ) L(g). Proposition 3 shows that the reduction in KL divergence is equal to Eδ p

[log wℓ(δ)], which is equal to the difference in log-loss L(p ) L(ℓ) (in expectation).

Observations.

For the global method, the change in the log-loss is independent of the dataset because log b Z is independent of the dataset. However, for the local method, it is dependent because c W is dependent on the dataset. Statistical significance:

For the global method, the log-loss reduction is positive and constant for every string; therefore, it is trivially statistically significantly better than the baseline method according to the paired-permutation test (i.e., p = 0).

15A complete solution to the pre-tokenization challenges would build a transducer that accurately models the pre-tokenizer, which we then compose with a transducer implementing BPE (see App. B.4). We leave this investigation for future work. 16In the case of the GPT-2 models, we restricted strings to have length 1024, as these models tend to ramble. We analyze this choice further in Fig. 4.

Model Baseline Local Global

small 201.0 200.7 199.1 medium 195.1 194.5 193.1 large 189.4 188.9 188.2

1B 171.2 171.1 169.7 3B 165.0 165.0 164.2 8B 161.5 161.5 160.1

small 369.2 367.0 367.3 medium 334.1 333.2 332.2 large 320.8 319.1 319.6

1B 286.7 284.4 285.2 3B 264.6 262.0 263.7 8B 248.2 245.8 246.8

Figure 3. Log-loss (L; bits/string) for the baseline (p ), local (ℓ), and global (g) methods across two datasets and models. Bolding indicates that the number is the best in its row. See text for discussion of statistical significance.

For the local method, the log-loss reduction is positive for all strings, but the amount varies across strings; nonetheless, it is trivially statistically significantly better than the baseline method according to the pairedpermutation test (i.e., p = 0). Local vs. global: On the PTB dataset, we found that the global method is significantly better than the local method for all models. However, on the Wiki Text dataset, the local model is better in all cases except GPT-2-medium.

Takeways. On principle ( 3.1), we maintain that the global method (based on conditioning) is the correct solution, as it is the only distribution that preserves the relative probabilities for canonical strings. Additionally, our experiments support the local method as a practical approximation to the global distribution, which we expected to be the case because the probability allocated to each noncanonical token is generally quite small, meaning that the warping is not large; see discussion Item 3 in 3.2.2). Interestingly, on some datasets (e.g., Wiki Text), the warping induced by the local method with respect to the global distribution can be advantageous, i.e., its reduction in log-loss with respect to the Wiki Text distribution is larger than that of the global method.

Log-canonicality rate vs. length. We found that the GPT2 models generate very long strings, so we investigated the tradeoffs in the log-canonicality rate for truncating the lengths of the strings generated by each model. Fig. 4 shows the effect of length on the log-canonicality rate estimate log b Z, using 2000 samples taken from ℓ. We see that for the Llama methods, the log-canonicality rate stabilizes after 256 tokens, but for the GPT-2 models, it continues decreasing. Thus, for practical reasons, our experiments with GPT-2

Language Models over Canonical Byte-Pair Encodings

128 256 512 1024 maximum string length

GPT-2S LLa Ma1B

GPT-2M LLa Ma3B

GPT-2L LLa Ma8B

Figure 4. Log-canonicality rate vs. (tokenized) length, including 95% confidence intervals, for each model.

will be based on a limit length of 1024, but bear in mind that this means that the actual log-canonicality rate is likely to be much smaller, meaning the reduction in log-loss for GPT-2 is larger than Fig. 3 indicates.

Supplementary error analysis. App. E analyzes the most frequent canonicality errors made by each model.

5.3. Canonicality by Construction

Methodology. We fine-tuned two language models, GPT2S and GPT-2M,17 on the PTB train set and a subset of the Wiki Text train set with 50K strings and 4.2M words. We consider fine-tuning the canonicalized architecture (ℓθ) and the original architecture (pθ ) using the training criterion Fλ for λ {0.001, 0.01, 0.1, 0.2}.18 As a baseline, we consider the original model p and the locally constrained model ℓ, as these models serve as initialization for fine tuning. Fig. 5 reports the log-loss L (bits/string) on the held-out datasets for each method mentioned above.

Observations.

In all cases (i.e., models, datasets, and methods), we observe a large improvement in log-loss from fine-tuning. The reduction in log-loss is larger in the case of PTB. Provided that λ is small enough, we see a consistent reduction in log-loss when comparing ℓθ and pθ . However, the difference between ℓθ and pθ is comparatively smaller than we saw in Fig. 3. Generally, a the smaller values of λ work best, indicating

17Unfortunately, we are unable to fine-tune larger models due to computational constraints. However, the models used here serve as a proof of concept. 18Each model is trained for 3 epochs using the Adam W optimizer (Loshchilov & Hutter, 2019) with a learning rate of 5e 5 and linear learning rate decay. For efficiency, we use bfloat16 to represent the model parameters. We use a minibatch of size 8 for estimating the gradient of each term of the Fλ objective.

N/A .001 .01 .1 .2

201.4 143.7 143.7 143.6 146.8

200.7 143.6 143.6 143.6 146.8

195.0 128.6 128.9 128.6 132.1

194.5 128.3 128.7 128.4 132.0

369.2 331.7 331.7 331.7 335.7

367.0 331.7 331.7 331.7 335.6

334.1 293.9 294.0 293.9 298.7

333.2 293.7 293.8 293.8 298.8

Figure 5. Log-loss (bits/string) of fine-tuned models on held-out portions of Wiki Text and PTB. The column labeled N/A reports the log-loss of the baselines p and ℓ. The other columns correspond to the value of the regularization parameter λ used in the finetuning loss. The rows labeled p are the original architecture and ℓ canonicalize architecture.

that reguarlization towards the original model should not be done too strongly.

Takeaways. Unsurprisingly, fine-tuning for the specific dataset is useful regardless of whether we use the original or canonicalized architecture. Fine-tuning with the canonicalized architecture performs slightly better, but improvements appear to be small. It is possible that training the canonicalized architecture from scratch on a huge dataset would yield better results than our proof-of-concept experiment.

We have demonstrated that enforcing canonicality in tokenlevel language models eliminates systemic probability mass misallocation, leading to improved likelihoods on held-out data across multiple models and corpora. Our proposed methods canonicality by conditioning and canonicality by construction provide practical solutions that either refine inference or modify the model architecture to ensure only canonical token strings are assigned positive probability. In addition to the empirical benefits, our theoretical results establish that correcting these mistakes strictly improves model fit. Moreover, our discovery of an efficient incremental test for BPE canonicality simplifies prior approaches, making it more accessible for practical deployment. These findings underscore the importance of aligning token-based probability models with their training distributions, thereby paving the way for more accurate language modeling.

Acknowledgments

The authors would like to thank John Terilla, Marco Cognetta, Ben Lipkin, Luca Malagutti, and Manuel de Prada Corral for their helpful feedback and discussions.

Language Models over Canonical Byte-Pair Encodings

Impact Statement

Potential benefits. By introducing methods to enforce canonicality, we improve a given model s fidelity to the underlying distribution that generated the text. These contributions have broad implications for the reliability of language models, particularly in applications where precise probability estimates are crucial. By eliminating avoidable modeling errors, we advance the goal of more reliable and robust language models.

Potential harms. Constraining tokenization may inadvertently reduce robustness to rare, noisy, or adversarial inputs by ruling out alternative tokenizations of the same character string. While the canonicality constraint does not remove any character strings from the language, it does restrict the model to assign nonzero probability only to a single canonical tokenization of each string. This may disproportionately affect rare or unconventional spellings, as their probability mass may be less sharply concentrated on a single tokenization. Future work should provide a deeper analysis of whether or not canonicality enforcement helps or hurts in these settings to ensure fair and robust language model behavior across diverse linguistic contexts.

Amini, A., Vieira, T., and Cotterell, R. Better estimation of the KL divergence between language models, 2025. URL https://arxiv.org/abs/2504.10637.

Anthropic. Introducing Claude, 2023a. URL https://www. anthropic.com/news/introducing-claude.

Anthropic. Claude 2, 2023b. URL https://www. anthropic.com/news/claude-2.

Berglund, M., Martens, W., and van der Merwe, B. Constructing a BPE tokenization DFA, 2024. URL https: //arxiv.org/abs/2405.07671.

Biderman, S., Schoelkopf, H., Anthony, Q. G., Bradley, H., O Brien, K., Hallahan, E., Khan, M. A., Purohit, S., Prashanth, U. S., Raff, E., Skowron, A., Sutawika, L., and Van Der Wal, O. Pythia: A suite for analyzing large language models across training and scaling. In Proceedings of the International Conference on Machine Learning, 2023. URL https://proceedings.mlr.press/v202/ biderman23a.html.

Black, S., Biderman, S., Hallahan, E., Anthony, Q., Gao, L., Golding, L., He, H., Leahy, C., Mc Donell, K., Phang, J., Pieler, M., Prashanth, U. S., Purohit, S., Reynolds, L., Tow, J., Wang, B., and Weinbach, S. GPT-Neo X-20B: An open-source autoregressive language model, 2022. URL https://arxiv.org/abs/2204.06745.

Brown, T. B., Mann, B., Ryder, N., Subbiah, M., Kaplan, J., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., Agarwal, S., Herbert-Voss, A., Krueger, G., Henighan, T., Child, R., Ramesh, A., Ziegler, D. M., Wu, J., Winter, C., Hesse, C., Chen, M., Sigler, E., Litwin, M., Gray, S., Chess, B., Clark, J., Berner, C., Mc Candlish, S., Radford, A., Sutskever, I., and Amodei, D. Language models are few-shot learners, 2020. URL https://arxiv.org/abs/2005.14165.

Chatterjee, S. and Diaconis, P. The sample size required in importance sampling, 2017. URL https://arxiv.org/ abs/1511.01437.

Christiano, P. F., Leike, J., Brown, T., Martic, M., Legg, S., and Amodei, D. Deep reinforcement learning from human preferences. In Advances in Neural Information Processing Systems, 2017. URL https://proceedings. neurips.cc/paper_files/paper/2017/file/ d5e2c0adad503c91f91df240d0cd4e49-Paper.pdf.

Cognetta, M. and Okazaki, N. Tokenization as finite-state transduction, 2024. URL https://arxiv.org/abs/ 2410.15696.

Gage, P. A new algorithm for data compression. C Users Journal, 12(2), 1994. ISSN 0898-9788. URL https://web.archive.org/web/20230319172720/ https://www.derczynski.com/papers/archive/ BPE_Gage.pdf.

Gastaldi, J. L., Terilla, J., Malagutti, L., Du Sell, B., Vieira, T., and Cotterell, R. The foundations of tokenization: Statistical and computational concerns. In Proceedings of the International Conference on Learning Representations, 2025. URL https://arxiv.org/abs/2407.11606.

Gemma Team. Gemma: Open models based on Gemini research and technology, 2024. URL https://arxiv. org/abs/2403.08295.

Groeneveld, D., Beltagy, I., Walsh, E., Bhagia, A., Kinney, R., Tafjord, O., Jha, A., Ivison, H., Magnusson, I., Wang, Y., Arora, S., Atkinson, D., Authur, R., Chandu, K., Cohan, A., Dumas, J., Elazar, Y., Gu, Y., Hessel, J., Khot, T., Merrill, W., Morrison, J., Muennighoff, N., Naik, A., Nam, C., Peters, M., Pyatkin, V., Ravichander, A., Schwenk, D., Shah, S., Smith, W., Strubell, E., Subramani, N., Wortsman, M., Dasigi, P., Lambert, N., Richardson, K., Zettlemoyer, L., Dodge, J., Lo, K., Soldaini, L., Smith, N., and Hajishirzi, H. OLMo: Accelerating the science of language models. In Proceedings of the Annual Meeting of the Association for Computational Linguistics, 2024. URL https://aclanthology.org/ 2024.acl-long.841/.

Language Models over Canonical Byte-Pair Encodings

Korbak, T., Perez, E., and Buckley, C. RL with KL penalties is better viewed as Bayesian inference. In Findings of the Association for Computational Linguistics: EMNLP, 2022. URL https://aclanthology. org/2022.findings-emnlp.77/.

Lew, A. K., Zhi-Xuan, T., Grand, G., and Mansinghka, V. K. Sequential Monte Carlo steering of large language models using probabilistic programs, 2023. URL https: //arxiv.org/abs/2306.03081.

Lindmark, V. Analyzing the effect of DFA compression in token DFAs, 2024. URL https://www.diva-portal. org/smash/get/diva2:1878423/FULLTEXT01.pdf.

Llama Team. The Llama 3 herd of models, 2024. URL

https://arxiv.org/abs/2407.21783.

Loshchilov, I. and Hutter, F. Decoupled weight decay regularization. In Proceedings of the International Conference on Learning Representations, 2019. URL https: //openreview.net/pdf?id=Bkg6Ri Cq Y7.

Loula, J., Le Brun, B., Du, L., Lipkin, B., Pasti, C., Grand, G., Liu, T., Emara, Y., Freedman, M., Eisner, J., Cotterell, R., Mansinghka, V., Lew, A. K., Vieira, T., and O Donnell, T. J. Syntactic and semantic control of large language models via sequential Monte Carlo. In Proceedings of the International Conference on Learning Representations, 2025. URL https://openreview.net/ forum?id=xo Xn62Fz D0.

Marcus, M. P., Santorini, B., and Marcinkiewicz, M. A. Building a large annotated corpus of English: The Penn Treebank. Computational Linguistics, 19(2):313 330, 1993. URL https://aclanthology.org/J93-2004/.

Merity, S., Xiong, C., Bradbury, J., and Socher, R. Pointer sentinel mixture models. In International Conference on Learning Representations, 2017. URL https:// openreview.net/forum?id=Byj72udxe.

Mikolov, T., Karafiát, M., Burget, L., ˇCernocký, J., and Khudanpur, S. Recurrent neural network based language model. In Proceedings of INTERSPEECH, 2010. URL https://www.isca-archive.org/interspeech_ 2010/mikolov10_interspeech.html.

Mistral AI. Mistral 7B, 2023. URL https://mistral.ai/ news/announcing-mistral-7b/.

Mistral AI. Au large, 2024a. URL https://mistral.ai/ news/mistral-large/.

Mistral AI. Mistral nemo, 2024b. URL https://mistral. ai/news/mistral-nemo/.

Open AI. Introducing Chat GPT, 2022. URL https:// openai.com/index/chatgpt/.

Open AI. GPT-4 technical report, 2024a. URL https:// arxiv.org/abs/2303.08774.

Open AI. Hello GPT-4o, 2024b. URL https://openai. com/index/hello-gpt-4o/.

Park, K., Wang, J., Berg-Kirkpatrick, T., Polikarpova, N., and D Antoni, L. Grammar-aligned decoding. In Advances in Neural Information Processing Systems, 2025. URL https://openreview.net/forum? id=5G7ve8E1Lu.

Radford, A., Wu, J., Child, R., Luan, D., Amodei, D., and Sutskever, I. Language models are unsupervised multitask learners, 2019. URL https://cdn.openai. com/better-language-models/language_models_ are_unsupervised_multitask_learners.pdf.

Sennrich, R., Haddow, B., and Birch, A. Neural machine translation of rare words with subword units. In Proceedings of the Annual Meeting of the Association for Computational Linguistics, 2016. URL https:// aclanthology.org/P16-1162.

Shannon, C. E. A mathematical theory of communication. The Bell System Technical Journal, 27(4), 1948. URL https://doi.org/10.1002/j.1538-7305.1948. tb00917.x.

Stiennon, N., Ouyang, L., Wu, J., Ziegler, D., Lowe, R., Voss, C., Radford, A., Amodei, D., and Christiano, P. F. Learning to summarize with human feedback. In Advances in Neural Information Processing Systems, 2020. URL https://proceedings. neurips.cc/paper_files/paper/2020/file/ 1f89885d556929e98d3ef9b86448f951-Paper.pdf.

Sundermeyer, M., Ney, H., and Schlüter, R. From feedforward to recurrent LSTM neural networks for language modeling. IEEE/ACM Transactions on Audio, Speech, and Language Processing, 23(3), 2015. URL https://ieeexplore.ieee.org/document/7050391.

Team OLMo. 2 OLMo 2 furious, 2025. URL https:// arxiv.org/abs/2501.00656.

Touvron, H., Lavril, T., Izacard, G., Martinet, X., Lachaux, M.-A., Lacroix, T., Rozière, B., Goyal, N., Hambro, E., Azhar, F., Rodriguez, A., Joulin, A., Grave, E., and Lample, G. LLa MA: Open and efficient foundation language models, 2023a. URL https://arxiv.org/abs/2302. 13971.

Touvron, H., Martin, L., Stone, K., Albert, P., Almahairi, A., Babaei, Y., Bashlykov, N., Batra, S., Bhargava, P., Bhosale, S., Bikel, D., Blecher, L., Ferrer, C. C., Chen, M., Cucurull, G., Esiobu, D., Fernandes, J., Fu, J., Fu, W., Fuller, B., Gao, C., Goswami, V., Goyal, N., Hartshorn,

Language Models over Canonical Byte-Pair Encodings

A., Hosseini, S., Hou, R., Inan, H., Kardas, M., Kerkez, V., Khabsa, M., Kloumann, I., Korenev, A., Koura, P. S., Lachaux, M.-A., Lavril, T., Lee, J., Liskovich, D., Lu, Y., Mao, Y., Martinet, X., Mihaylov, T., Mishra, P., Molybog, I., Nie, Y., Poulton, A., Reizenstein, J., Rungta, R., Saladi, K., Schelten, A., Silva, R., Smith, E. M., Subramanian, R., Tan, X. E., Tang, B., Taylor, R., Williams, A., Kuan, J. X., Xu, P., Yan, Z., Zarov, I., Zhang, Y., Fan, A., Kambadur, M., Narang, S., Rodriguez, A., Stojnic, R., Edunov, S., and Scialom, T. Llama 2: Open foundation and finetuned chat models, 2023b. URL https://arxiv.org/ abs/2307.09288.

Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A. N., Kaiser, Ł., and Polosukhin, I. Attention is all you need. Advances in Neural Information Processing Systems, 30, 2017. URL https://arxiv.org/abs/ 1706.03762.

Vieira, T., Le Brun, B., Giulianelli, M., Gastaldi, J. L., Du Sell, B., Terilla, J., O Donnell, T. J., and Cotterell, R. From language models over tokens to language models over characters. In Proceedings of the International Conference on Machine Learning, 2024. URL https://arxiv.org/abs/2412.03719.

Zhao, S., Brekelmans, R., Makhzani, A., and Grosse, R. Probabilistic inference in language models via twisted sequential Monte Carlo, 2024. URL https://arxiv. org/abs/2404.17546.

Ziegler, D. M., Stiennon, N., Wu, J., Brown, T. B., Radford, A., Amodei, D., Christiano, P., and Irving, G. Fine-tuning language models from human preferences, 2020. URL https://arxiv.org/abs/1909.08593.

Zouhar, V., Meister, C., Gastaldi, J., Du, L., Vieira, T., Sachan, M., and Cotterell, R. A formal perspective on byte-pair encoding. In Findings of the Association for Computational Linguistics, 2023. URL https: //aclanthology.org/2023.findings-acl.38.

Language Models over Canonical Byte-Pair Encodings

A. Limitations and Future Work

Future experimental work. A limitation of our experimental design is that we only evaluate our approach s ability to estimate probabilities. We do not have any experiments evaluating any qualitative difference in the sample quality from the canonicalized methods.

Models overregularize. Language models often place a higher probability on a noncanonical encoding of the string than the canonical one. Here is an interesting example from Wiki Text corpus (Merity et al., 2017):

11 xs = """\n\n = Robert Boulter = \n\n\n\n\n Robert Boulter is an English film , \

12 television and theatre actor . He had a guest @-@ starring role on the television \

13 series The Bill in 2000 . """.encode("utf-8")

In the case of GPT-2, the conditional probability of the canonical tokenization is only 2.7%, and it is ranked 4th in the (partially enumerated) set of encodings of this string:

UTF-8 Encoding. The very first step in representing text is the choice of string encoding. UTF-8 is an encoding that maps strings of characters into bytes, which is used to store almost every webpage on the internet. The scheme uses a variablewidth encoding of each Unicode character that uses between 1 and 4 bytes, with more common Unicode characters requiring fewer bytes. The UTF-8 encoding imposes additional validity constraints on token strings. In future work, we may extend our work on BPE encoding validity one step further to ensure that the byte string that the BPE string decodes is a valid UTF-8 string. For better or worse, UTF-8 decoding errors are typically suppressed, e.g., by the Huggingface transformers library.

Implementation note. Byte-pair encoding (as the name suggests) typically takes bytes as inputs. Note that popular interfaces like huggingface try to make things easier for users by hiding the UTF-8 encoding step that maps strings to bytes

Language Models over Canonical Byte-Pair Encodings

from users by exposing only a string interface. We require finer-grained control to implement the round-trip test correctly; we must avoid encoding and decoding from the text s encoding (e.g., UTF-8). Therefore, we use a custom implementation of byte-pair encoding. Note that the transformers library does not provide methods to work directly with byte strings.

".encode("utf-8") = b'\xf0\x9f\xa4\x97' is the encoding of the hugging face emoji

However, if we try to decode the following prefix of the byte string b'\xf0\x9f\xa4'.decode("utf-8"), we get an error as it is an invalid UTF-8 string. Note that if we use b'\xf0\x9f\xa4'.decode("utf-8", errors="ignore") or b'\xf0\x9f\xa4'.decode("utf-8", "replace") the errors may go unnoticed. Many packages (e.g., transformers and tiktoken) suppress these errors.

Other tokenization models. We limited our experiments to BPE, as it is by far the most commonly used tokenization model (Footnote 5). However, many other tokenizers exist, such as longest-match tokenization (used by Word Piece) and other finite-state transduction-based tokenizers. These other tokenizers can plug directly into our framework as long as they provide efficient membership tests for D. In the case of finite-state-transduction methods (which include longest-match tokenizers), the membership is a straightforward operation on the automaton. We note that D = D, in general and, it specifically does not hold for longest-match tokenizers.

More sophisticated inference algorithms. Future work may wish to explore sequential importance resampling (SIR) methods or other sequential Monte Carlo methods (SMC), which have shown great promise as methods for constrained generation from language models (e.g., Lew et al., 2023; Zhao et al., 2024). Sequential importance resampling, for example, provides the computational benefit of not waiting until a complete string is sampled from the local model to resample. It works by evolving many string prefixes in parallel. Now, when any given string is stuck in a low-probability region, it can be probabilistically replaced by another string in the collection. Many SMC algorithms, including SIR, are GPU-friendly.

Distillation. Another interesting direction to explore related to fine-tuning is the distillation of the (approximately) globally canonicalized model g. We could do that by minimizing KL(g ℓθ) using stochastic gradient descent.

Precision vs. recall. It can be challenging to correctly capture all of the constraints that the inputs must satisfy. In this paper, we explored the canonicality constraints that BPE models have. We provided efficient methods for testing canonicality, and we needed to validate those against reference canonicality procedures (as a kind of regression test). It may be impossible to capture every nuance of a particular implementation perfectly. And, in some cases, there are even bugs or undesirable behaviors that the reference implementations have. For example, the pre-tokenization system used by GPT-2 was designed to split the possessive marker ( s) from a phrase like Open AI s system Open AI| s| system; however, the rule has the following behavior which is likely not intensional Open AI Inc. s system Open AI| Inc|. |s| system, as it fails to split the possessive correctly because the regular expression used by the pre-tokenizer prioritizes . over s in this example. More generally, reasoning about a complicated software system using formal models will always present a challenge between fidelity and analytical tractability.

Evaluation on downstream tasks. Our study deliberately centers on a task-agnostic metric KL divergence. We proved that our canonicalization methods are guaranteed to improve this metric (Propositions 1 and 3) and we quantified the effect empirically ( 5). However, many language-model applications are judged with task-specific metrics that may be misaligned with KL divergence. Thus, an improvement in KL divergence does not guarantee an improvement in the task-specific metrics. Assessing how canonicalization influences concrete reasoning tasks under each task s own evaluation metric remains an important direction for future work.

Language Models over Canonical Byte-Pair Encodings

B. Efficient Membership Tests for BPE

This section gives a formal definition of byte-pair encoding (BPE) and novel algorithms for the following membership tests pertaining to it:

the canonical set D (see Theorem 1) the canonical prefix set D (see Lemma 3) an incremental membership test for single-token extensions, i.e., δ δ D given that δ D (see Proposition 5)

Our key insight is that a token string is canonical if and only if all of its constituent bigrams are canonical; see Theorem 1. Thus, our membership test runs in time linear in the length of the string we iterate over all bigrams in the strings and check their canonicality; we call this the bigram test. Moreover, our test is simple to implement, unlike similar tests given in prior work (e.g., Berglund et al., 2024; Cognetta & Okazaki, 2024). A consequence of Theorem 1 is that D = D (see Lemma 3), which means that membership in the set of canonical prefixes reduces to membership in the canonical set.19

B.1. Byte-Pair Encoding

Byte-pair encoding (BPE; Gage, 1994; Sennrich et al., 2016) is currently the most commonly used tokenization method, as it has been used in most recent state-of-the-art language models.20 We formalize BPE as follows.

Definition 4. Byte-pair encoding is a pair (Σ, M) that defines a tokenization model (Σ, , τ, κ). We describe the assumptions on (Σ, M) and the construction of (Σ, , τ, κ) below:

Σ is a base alphabet of characters (or bytes) M = h σ (1), σ (1) ,... , σ (M), σ (M) i is the merge list where σ (m), σ (m) Σ+ for m = 1,... , M.

We define the subword alphabet S

def= Σ {σ σ | σ , σ M}. We define the token alphabet such that for each δ , there is a unique object (e.g., an identifier) for each distinct element of the subword alphabet S. The reason for this technical detail is to avoid confusion between the concatenation of subwords and the concatenation of tokens. We define the base encoder τ : S and base decoder κ: S as bijective maps between the subword alphabet S

and a set of base tokens . We define the decoder κ: Σ to be the pointwise extension of the base decoder: κ(δ1 δN) def= κ(δ1) κ(δN). We define the encoder τ : Σ in the pseudocode, which is a deterministic string rewriting process based on applying a sequence of merge rules to the current string of tokens until no more merges are available.21,22

14 def τ(σ1 σN):

15 δ τ(σ1) τ(σN)

16 while True:

17 δ rewrite(δ)

18 if δ = δ: return δ

20 def rewrite(δ1 δN):

21 for σ , σ in M

22 for n in [1,... , N 1]:

23 if κ(δn), κ(δn+1) = σ , σ :

24 return δ1 δn 1 τ(σ σ ) δn+2 δN

25 return δ1 δN

Note that the merge rules are applied in a specific deterministic order to any given string; thus, there is a unique way to derive the output tokenization from the input string.

19We note that, in principle, the bigram canonicality tests can all be precomputed and cached for maximal efficiency. However, we found that to be impractical for models with large token vocabularies. In App. B.3, we provide a very efficient method that can test bigram canonicality. 20See Footnote 5. 21We note that τ can be implemented much more efficiently using a careful choice of data structure (see, e.g., Zouhar et al., 2023). 22We note that Berglund et al. (2024) refers to this incarnation of BPE as the Sentence Piece BPE algorithm.

Language Models over Canonical Byte-Pair Encodings

B.2. Simple and Efficient Membership Tests

Definition 5. We define the set of canonical bigrams as

def= {δ δ | δ, δ , φ(δ δ ) = δ δ } . (19)

where φ is the canonicalization function ( 2.6).

Definition 6. We define the set of bigram-canonical strings B as strings composed of only canonical bigrams:

def= {δ : BIGRAMS(δ) B}. (20)

In general, it is possible for a token in the token vocabulary not to be canonical. Such tokens can always be removed from the token vocabulary without changing the result of the BPE tokenizer. We assume, for simplicity, that the token vocabulary contains no such tokens.

Assumption 1. For all δ , φ(δ) = δ.

Definition 7. We define the truncated canonicalization function φT as follows:

26 def φT (δ):

27 σ1 σN κ(δ) # decode token string

28 δ(0) τ(σ1) τ(σN) # initialize with base tokens

29 for t = 0,... , (T 1):

30 δ(t+1) rewrite(δ(t))

31 if δ(t+1) = δ(t): return δ(t)

32 return δ(T )

Remark 1. It is straightforward to verify that φ = φ , as it is equivalent to code for τ(κ( )) where we have added the option of truncating the number of iterations performed in τ s fixpoint iteration loop. Setting T = , disables the truncation; thus, φ = φ .

Language Models over Canonical Byte-Pair Encodings

Lemma 1 (Canonicalization process synchronization). Suppose that abc D or both ab, bc D. Suppose further that b is nonempty. Then, there exists a sequence {(at, bt, ct, dt, et)} t=1 such that the following equations hold for all t 0:

φt(abc) = φat(a)φbt(b)φct(c) φdt(ab) = φat(a)φbt(b) φet(bc) = φbt(b)φct(c) (21)

Proof. We will prove Lemma 1 by induction on t, and by explicitly constructing the sequence {(at, bt, ct, dt, et)} t=1.

Let T denote the number of distinct iterations of φ(abc).

Base case (P0). In this case, it is straightforward to verify that (a0, b0, c0, d0, e0) = (0, 0, 0, 0, 0) satisfies P0 because each canonicalization call returns precisely the base tokenization of its argument. Thus, P0 holds.

Induction step. Suppose P0,..., Pj hold. We seek to show that Pj+1 holds.

Since Pj holds, there must exist (at, bt, ct, dt, et) satisfying the equations in Pt. To prove that Pt+1 holds, we will construct (at+1, bt+1, ct+1, dt+1, et+1) such that the equations in Pt+1.

We first observe that at time t, φt(abc) = a(t)b(t)c(t) where

a(t) is the tokenization at step t of the substring over the characters in a in the canonicalization process for abc b(t) is the tokenization at step t of the substring over the characters in b in the canonicalization process for abc c(t) is the tokenization at step t of the substring over the characters in c in the canonicalization process for abc

The only way for this observation to be false is for a merge to straddle the boundary between a(t) and b(t) or b(t) and c(t), which is impossible because of our premise that abc D or both ab, bc D. More specifically, if there ever were a merge between a(t) and b(t) then ab and abc could not be canonical, as the final tokenization would not respect the boundary between a and b. Similarly, if there ever were a merge between b(t) and c(t), then by analogous reasoning, bc and abc could not be canonical.

We now turn to case analysis:

1. Suppose t < T, then the (t+1)th step of φ(abc) applies to the highest-priority23 bigram of a(t)b(t)c(t). Consider the following subcases characterizing the possible positions for this merge:

(a) The merge is in a(t). Then, (at+1, bt+1, ct+1, dt+1, et+1) = (at + 1, bt, ct, dt + 1, et) satisfies Pt+1 because It must also be the highest-priority merge in a(t), which is step at+1 of φ(a). It must also be the highest-priority merge in a(t)b(t), which is step dt+1 of φ(ab). The other canonicalization processes are unaffected, so they copy their position on this step. (b) The merge is in b(t). Then, (at+1, bt+1, ct+1, dt+1, et+1) = (at, bt + 1, ct, dt + 1, et + 1) satisfies Pt+1 because It must also be the highest-priority merge in b(t), which is step bt+1 of φ(b). It must also be the highest-priority merge in a(t)b(t), which is step dt+1 of φ(ab). It must also be the highest-priority merge in b(t)c(t), which is step et+1 of φ(bc). The other canonicalization processes are unaffected, so they copy their position on this step. (c) The merge is in c(t). Then, (at+1, bt+1, ct+1, dt+1, et+1) = (at, bt, ct + 1, dt, et + 1) satisfies Pt+1 because It must also be the highest-priority merge in c(t), which is step ct+1 of φ(c). It must also be the highest-priority merge in b(t)c(t), which is step et+1 of φ(bc). The other canonicalization processes are unaffected, so they copy their position on this step. 2. Suppose t T. No merges exist. Then, (at+1, bt+1, ct+1, dt+1, et+1) = (at, bt, ct, dt, et) satisfies Pt+1 because φt(abc) = φT (abc) = a(t)b(t)c(t) = a(T )b(T )c(T ), by definition of T, which implies that no further changes are possible and all processes must copy their position on this step, and continue to do so forever.

Therefore, Pt holds for all t 0 by the principle of induction.

23I.e., the bigram with the lexicographically lowest pair of (position in the merge list, position in the current token string). Note that there can be no ties under this ordering.

Language Models over Canonical Byte-Pair Encodings

Example 1. This example seeks to illustrate Lemma 1. Consider the following three token strings from the GPT-2 tokenizer:

token 30001 c =

For these tokens, we have κ(abc) = re-tokenized. Our worked example will focus on only the first equation in Pt (repeated below for convenience), as the other two equations behave very similarly:

φt(abc) = φat(a)φbt(b)φct(c)

Before moving forward with the example, we note that each token during the canonicalization process will be rendered as a tree because it is a convenient representation for showing which merges were done to build each of them.24 For example, our inputs a, b, and c are represented by the following trees:

a = r e - b =

Each row in the table below corresponds to a time step t, and the content of each column is the status of the canonicalization process at that time. We mark cells that are copied from the cell immediately above them with a dash ( ).

t φt(abc) φat(a) φbt(b) φct(c)

0 r e - t o k e n i z e d r e - t o k e n i z e d

1 r e - t o k e n i z e d r e -

2 r e - t o k e n i z e d t o k e n

3 r e - t o k e n i z e d i z e d

4 r e - t o k e n i z e d t o k e n

5 r e - t o k e n i z e d i z e d

6 r e - t o k e n i z e d i z e d

7 r e - t o k e n i z e d t o k e n

8 r e - t o k e n i z e d t o k e n

We see that at every step t, the equation φt(abc) = φat(a)φbt(b)φct(c) is satisfied for an appropriate choice of (at, bt, ct). Specifically, we count the number of non-copy actions in each row. In the example, for t = 8, we have (at, bt, ct) = (1, 4, 3). This example also allows us to see the case analysis used in the proof for the location of the merge in action.

24We give a more formal presentation of these trees in App. B.3.

Language Models over Canonical Byte-Pair Encodings

Lemma 2 (Interlocking Canonicalization).

a , b +, c : abc D ab D bc D (22)

Proof. Fix a , b +, and c arbitrarily.

We will consider the direction of the bi-implication separately.

Case (= ). Suppose that abc D. Then, Lemma 1 gives us a sequence {(at, bt, ct, dt, et)} t=1 such that the following equations hold for all t 0.

φt(abc) = φal(a)φbt(b)φct(c) φdt(ab) = φat(a)φbt(b) φet(bc) = φbt(b)φct(c) (23a)

Now, consider (a T , b T , ct, d T , e T ), which gives

φT (abc) = φa T (a)φb T (b)φc T (c) φd T (ab) = φa T (a)φb T (b) φe T (bc) = φb T (b)φc T (c) (23b)

which implies the following because, at time T, each of the canonicalization processes has reached its respective fixpoint:

φ(abc) = φ(a)φ(b)φ(c) φ(ab) = φ(a)φ(b) φ(bc) = φ(b)φ(c) (23c)

which implies the following because each call to φ must return the canonicalization of its respective argument:

φ(abc) = abc φ(ab) = ab φ(bc) = bc

Therefore, ab and bc are also canonical.

Case ( =). Suppose ab, bc D. Then, following an identical argument as the = -case, we can see that abc D.

Thus, the bi-implication holds; thus, Lemma 2 holds.

Language Models over Canonical Byte-Pair Encodings

Theorem 1 establishes that a BPE token string is canonical if and only if all of its bigrams are canonical.25

Theorem 1. D = B.

Proof. We seek to prove, for all δ , that δ D δ B. We do this using induction on prefixes of δ and Lemma 2.

Choose δ = δ1 δN arbitrarily. Let the induction hypothesis P(i) be that δ i D δ i B.

Base cases. We first prove P(0) and, if |δ| 1, P(1).

Case P(0). To prove P(0), we seek to show that δ 0 = ε D δ 0 = ε B. Both sides of the bi-implication are always true: ε D because φ(ε) = τ(κ(ε)) = τ(ε) = ε; and ε B because it contains no bigrams, hence it is vacuously true. Therefore, P(0) is true. Case P(1). To prove that P(1) is true for cases where |δ| 1, we seek to show that δ 1 = δ1 D δ 1 = δ1 B. Again, both sides of the bi-implication are always true. By Assumption 1, we know that δ1 D is true. Since δ1 has no bigrams, δ1 B is vacuously true. Therefore, P(1) is true.

Induction step. Assume P(i), that δ i D δ i B, where i 2. We seek to prove P(i + 1), that δ i+1 D δ i+1 B. We prove each direction of the bi-implication separately.

Forward. We seek to prove δ i+1 D = δ i+1 B. Assume δ i+1 = δ i 1δiδi+1 D. By Lemma 2, we know that δ i 1δi D and δiδi+1 D. Since δ i 1δi D, by P(i), we know that δ i 1δi B. Since δ i 1δi B and δiδi+1 D, all of the bigrams in δ i 1δiδi+1 = δ i+1 are canonical, so δ i+1 B. Backward. We seek to prove δ i+1 B = δ i+1 D. Assume δ i+1 = δ i 1δiδi+1 B. Then δ i 1δi B, because it contains only a subset of the bigrams in δ i 1δiδi+1. Then, by P(i), we know that δ i 1δi D. From δ i 1δiδi+1 B and Def. 6, we know that δiδi+1 D. Since δ i 1δi D and δiδi+1 D, by Lemma 2, we know that δ i 1δiδi+1 = δ i+1 D.

By induction, we know P(N) is true, so δ D δ B, and the proof is complete.

25Note that if a string with length less than or equal to one has no bigrams; thus, the property is trivially satisfied, i.e., all length-zero and length-one token strings are canonical. The latter is thanks to Assumption 1.

Language Models over Canonical Byte-Pair Encodings

Lemma 3. For BPE,26 the set of canonical strings is prefix closed, i.e., D = D.

Proof. We prove D D and D D.

Part 1 (D D). Obvious by definition of D. Part 2 ( D D). 1. Suppose δ D 2. = by definition of D, there is a δ D such that δ δ

3. = by Theorem 1, the bigrams of δ are all canonical 4. = since δ δ , its bigrams are a subset of those of δ , so they are also all canonical 5. = since the bigrams of δ are all canonical, by Theorem 1, δ D

Therefore, D = D.

26More generally, Lemma 3 holds for any tokenization model with a bigram-based canonicality test (i.e., δ D BIGRAMS(δ) B.

Language Models over Canonical Byte-Pair Encodings

B.3. An Even Faster Bigram Test

BPE derivations. Our canonicality test involves the inspection of the merge structure within a tokenization of a string. For that purpose, we define the derivation of a given string γ(σ) as a string of trees t1 t N where each tn (for n = 1,... , N) is a binary tree. For our purposes, a (binary) tree is a recursive data type composed of either a pair of trees or a base token. The pseudocode below shows how τ can be augmented to produce these derivation trees for a given string:

33 def γ(σ1 σN):

34 t τ(σ1) τ(σN)

35 while True:

36 t step_derivation(t)

37 if t = t: return t

39 def step_derivation(t1 t N):

40 for σ , σ M:

41 for n [1,... N 1]:

42 if κ(tn), κ(tn+1) = σ , σ :

43 return t1 tn 1 tn, tn+1 tn+2 t N

44 return t1 t N

Note that γ( ) is an adaptation of τ. The difference is that τ has been replaced with a function that creates trees instead of replacing them with tokens. We have also extended κ to return the subword denoted by the derivation.

For each token δ , we associate with it a unique canonical derivation γ(δ). However, we note that when |γ(δ)| > 1, it means that the token δ does not canonically tokenize to itself. Assumption 1 ensures |γ(δ)| = 1 for all δ .

Additional notation. For notational convenience, we define the following:

For any ℓ Σ Σ+ Σ+, let M[ℓ] denote the following:

If ℓ Σ, return its position in some fixed, arbitrary ordering of Σ. If ℓ Σ+ Σ+, return the rank position of the merge ℓin the merge list plus |Σ| (as an offset), and if ℓis not in M. We extend M[t] to trees t as follows.

For a tree in t : M[t] def= M[κ(t)]. For a tree of the form t = s, s : M[t] def= M[ κ(s), κ(s ) ]. We define the following ordering relations on merge rules and subtrees: We define ℓ< r M[ℓ] < M[r]. We define ℓ> r, ℓ r, and ℓ r analogously. Given a token δ , we define the left spine S(γ(δ)) of γ(δ) to be the ordered set containing the root node and its left descendants27 (ordered from root to leaf); and, we define the right spine R(γ(δ)) analogously (for right descendants). In a given tree, we define π as the function that returns the parent node for any of its subtrees.

Intuition. At a high level, we say that a token δ conflicts with a bigram δ δ if and only if δ applies with an earlier rank over any overlapping span of the same character string. More formally, we say that δ (nontrivially) overlaps with δ and δ

if and only if

κ(δ) = σ1 σn (24a)

κ(δ ) = σn+1 σN (24b)

κ(δ δ ) = σ1 σn σn+1 σN (24c)

κ(δ ) = σi σj where 1 i n < j N (24d)

However, for δ to conflict with δ δ , it would need to include a merge over the span σi σj that blocks at least one merge in δ and δ from occurring in φ(δ δ ).

σ1 σi σnσn+1 σj σN

γ(δ ) γ(δ )

σ1 σi σnσn+1 σj σN

27We say that a node p is a left descendant of a node q if p is the left child of q or a left descendant of the left child of q. We define the right descendants of p in an analogous manner.

Language Models over Canonical Byte-Pair Encodings

In the diagram above, the derivation for the bigram δ δ is denoted by the juxtaposition of two triangles, one for each token s derivation tree. We have denoted the canonical derivation as a trapezoid, as it is generally the union of multiple triangles; there may be 1, 2, or more tokens in the canonicalized bigram φ(δ δ ).28 In orange, we have drawn a third triangle for the conflicting token δ . We note that γ(δ ) must intersect γ(δ) at a node s along its right-spine R(γ(δ)), and it must intersect γ(δ ) at a node s along its left-spine S(γ(δ )). In the diagram, the s and s represent the subtrees at respective points of the intersection with the hypothetical conflicting tree. The orange arc from s to s is used to show that any two nodes along the left and right edges could, in principle, be a conflicting tree. Crucially, however, the rank of the merge pair must precede the competing merge pairs present in the existing tokens δ and δ . We make this more precise below.

Definition 8. A token δ conflicts with the bigram δ δ if and only if γ(δ ) is a subtree of γ(φ(δ δ )) (which may be a string of trees) but not a subtree of either γ(δ) or γ(δ ).

We can find a conflicting token by exploiting the structure of the tokenization algorithm.

Definition 9. Let δ be a token such that γ(δ ) = s, s for some s R(γ(δ)) and s S(γ(δ )). We say that δ is a minimal conflict for the bigram δ δ if and only if π(s) > s, s π(s ).

The reason why δ identified by Def. 9 is minimal is that its definition identifies the earliest possible deviation, i.e., the earliest merge rule that could apply on their shared string σ1 σN that was not in the existing merges in δ and δ . Note that the asymmetry < versus is precisely what breaks ties between equal rank merges in favor of the left-most match, as prescribed in the procedure for τ that was provided in Def. 4.

Below is an algorithm for finding the minimal conflicting tree between δ and δ , if one exists.

45 def find_conflict(δ, δ ):

48 while True:

51 while True:

52 if L > M[ t, t ] R: return t, t # conflict

53 if κ(t ) : break # t is a leaf

55 t , _ t # descend the left spine

56 if t : break # t is a leaf

58 _, t t # descend the right spine

59 return None

The find_conflict algorithm finds the minimal conflict between a pair of tokens δ and δ (if one exists). We note that a useful consequence of Assumption 1 is that we do not need to check subtrees that exist off of the spines for conflicts, making our procedure more efficient than computing the canonicalized string from scratch, e.g., by running τ on the decoded string κ(δ δ ). Instead, our algorithm searches for a minimal conflicting tree: either it stops once it has found such a tree, or it has exhausted the limited options for such a tree. Finding the minimal conflict (if it exists) is also faster than computing the entire canonical tokenization φ(δ δ ), as it only requires finding the first difference rather than completely building the canonicalized bigram.29

Definition 10. We define Φ as the following relation on pairs of tokens:

Φ(δ, δ ) def= (find_conflict(δ, δ ) = None) (25)

28Please refer back to Fig. 2 for examples. 29We note that the find_conflict algorithm can be efficiently vectorized so that it computes the complete vector of valid next tokens very efficiently. We will provide a Py Torch implementation in our public code release (upon publication).

Language Models over Canonical Byte-Pair Encodings

The proposition below establishes that Φ provides a correct membership test for the set of canonical bigrams B. Proposition 5. Φ(δ, δ ) δ δ B

Proof. Observe that (Φ(δ, δ ) δ δ B) ( Φ(δ, δ ) = δ δ / B) ( Φ(δ, δ ) = δ δ / B). We will prove the proposition by proving each conjunct separately.

Part 1 ( Φ(δ, δ ) = δ δ / B).

1. Suppose Φ(δ, δ ). Then, there exists a conflicting pair s, s = find_conflict(δ, δ ), i.e., s, s satisfies π(s) > s, s π(s ). Below is a schematic representation of such a conflict:

conflict > s, s

2. Thus, the conflicting merge s, s would have been preferred by τ if it were run on the character string κ(δ δ ):

the conflicting merge blocks both of the merges below because π(s) = µℓ, s > s, s and π(s ) = s , µ r s, s :30

3. The existence of this intermediate step means that it is impossible for δ δ to be a canonical bigram; thus, δ δ / B.

Part 2 ( Φ(δ, δ ) = δ δ / B).

1. Suppose δ δ / B. Then, φ(δ δ ) = δ δ , by definition. 2. Then, there must exist a conflicting merge s, s as a subtree in φ(δ δ ) that blocks δ δ from being built. Now, because both δ and δ are canonical in isolation (Assumption 1), the conflict s, s must straddle the boundary between the δ and δ :

This implies that s must be a subtree along the right edge of δ and s must be a subtree along the left edge of δ , as those are precisely the only ways that the straddling merge can occur under our circumstances. Lastly, for s, s to block δ and δ , it would need to satisfy π(s) > s, s π(s ). 3. This characterization is equivalent to Φ(δ, δ ).

30Note that s, s < π(s) enforces the left-most merge preference in BPE s encoding procedure τ, i.e., in the event of a tie between overlapping merges of the same rank, the left-most merge is taken.

Language Models over Canonical Byte-Pair Encodings

B.4. Connections to Canonicality Algorithms

Definition 11. We can describe the set of canonical tokenizations with the following finite-state automaton:

states = alphabet =

transitions = {δ

δ δ | δ δ B} initial = accepting =

Note that we never construct the automaton.

Generality. Our method works under a broader set of assumptions than Berglund et al. (2024); Lindmark (2024), as they require the merge list to be proper. This condition is, in fact, restrictive: the Llama tokenizer (used in our experiments) is not proper, but the GPT-2 tokenizer is proper.

Practicality. The automaton has V nodes and V 2 edges where V = | |. In practice, the machine is very close to having V 2 edges (even after DFA minimization!) because the transitions do not rule out many tokens. This means that the explicit machine is enormous. For example, GPT2 has 50K states and nearly 2,500,000,000 edges, making it way too large to build. Storing the edges that are absent is significantly more efficient. On the other hand, our version never builds the machine, so the memory complexity is not a problem. It is also possible to use a more sophisticated automata representation that allows for failure transitions; we refer the reader to Cognetta & Okazaki (2024); Lindmark (2024) for further discussion of the technique.

Language Models over Canonical Byte-Pair Encodings

C. Supporting Proofs and Lemmata

Proposition 6. Suppose that the tokenization model (τ, κ, Σ, ) is exact, then

δ : δ D φ(δ) = δ (26)

Proof. Fix an arbitrary δ . We consider each direction separately:

Part 1 (δ D = φ(δ) = δ):

1. Suppose δ D 2. = σ Σ : δ = τ(σ) by definition of D 3. = κ(τ(σ)) = σ by exactness assumption 4. = τ(κ(τ(σ))) = τ(σ) because τ is a function 5. = τ(κ(δ)) = δ substitute τ(σ) 7 δ 6. = φ(δ) = δ by definition of φ 7. Thus, this direction holds. Part 2 (δ D = φ(δ) = δ):

1. Suppose φ(δ) = δ 2. = δ = τ(κ(δ)) 3. = σ Σ : δ = τ(σ); specifically, σ = τ(δ) 4. = δ D by definition of D 5. Thus, this direction holds.

Since each directions have been proven, the proposition holds.

Corollary 1.

D = {φ(δ) | δ } (27)

Proposition 7. Let (τ, κ) be exact. Then, (τ, κ) is a bijection between (Σ , D).

Proof. The bijection follows directly from the following:

Exactness means σ Σ : κ(τ(σ)) = σ. Proposition 6 ensures that δ D: τ(κ(δ)) = δ.

Thus, we have a bijection, and the proposition holds.

Proposition 8. For an exact tokenization model (Σ, , τ, κ), the canonicalization operator is idempotent, i.e.,

φ(δ) = δ = φ(δ ) = δ for all δ, δ (28)

Proof. The proposition is equivalent to φ(φ(δ)) = φ(δ) for all δ .

Fix δ, δ arbitrarily.

1. Suppose φ(δ) = δ

2. = τ(κ(δ)) = δ by definition of φ 3. = τ(κ(τ(κ(δ)))) = τ(κ(δ )) because τ and κ are functions 4. = τ( κ( τ(κ(δ)))) = τ(κ(δ )) by exactness 5. = τ(κ(δ)) = τ(κ(δ )) 6. = τ(κ(δ)) = τ(κ(φ(δ))) substitution δ 7 φ(δ) 7. = φ(δ) = φ(φ(δ)) by definition of φ

Language Models over Canonical Byte-Pair Encodings

D. Proofs for Section 4 (Canonicality by Conditioning)

Proposition 1. Assuming that the true distribution over tokens p

is canonical, the globally canonicalized model g guarantees the following reduction in KL divergence:

g) = log Z | {z } 0 (7)

(δ) g(δ) (29a)

(δ) log p (δ)1{δ D}

(δ) log p (δ)

(δ) log 1{δ D} + X

(δ) log Z (29c)

(δ) log p (δ) + log Z (29d)

p ) + log Z (29e)

Now, the proposition follows by basic algebra.

Proposition 2.

ℓ(δ) = p (δ) 1{δ D}

| {z } def =wℓ(δ)

wℓ(δ) = p (δ)

ℓ(δ) 1{δ D} (10b)

ℓ(δ) = ℓ(EOS | δ)

ℓ(δt | δ<t) (30a)

= p (EOS | δ)1{δ D}

p (δt | δ<t) 1{δ<t δt D} ℓ(δ<t) (30b)

p (EOS | δ)

t=1 p (δt | δ<t)

t=1 1{δ<t δt D}

= p (δ) 1{δ D}

Proposition 3. Assuming that the true distribution over tokens p

is canonical, the locally canonicalized model ℓguarantees the following reduction in KL divergence:

[log wℓ(δ)]

Language Models over Canonical Byte-Pair Encodings

(δ) ℓ(δ) (31a)

(δ) log ℓ(δ) (31b)

(δ) log p (δ)1{δ D}

wℓ(δ) | {z } by Proposition 2

(δ) log p (δ)

: 0 by assumption X

(δ) log 1{δ D} + X

(δ) log wℓ(δ) (31d)

p ) + E δ p

[log wℓ(δ)] (31e)

Now, the proposition follows by algebra.

Proposition 4.

δ ℓ[wℓ(δ)] = E

p (δ)1{δ D}

p (δ)1{δ D} (32a)

p (δ) (32b)

ℓ(δ) p (δ) p (δ)wℓ(δ) 1 (32d)

ℓ(δ) wℓ(δ) (32e)

δ ℓ[wℓ(δ)] (32f)

Language Models over Canonical Byte-Pair Encodings

E. Frequently Generated Noncanonical Bigrams

Because BPE canonicality is determined entirely by bigram constraints (App. B), a natural question is: Which of those constraints does a language model violate most often? To find out, we catalogue the model s most frequently produced noncanonical bigrams in hopes of revealing some specific aspects of BPE canonicality that it struggles to model.

The bigram frequency is defined as

µ(α β) def= E δ p

δ δ BIGRAMS(δ) 1{δ δ = α β}

t=1 1{δt δt+1 = α β}

We must estimate µ because computing µ exactly is infeasible. We do so by sampling: Given a sample δ(1),... , δ(M) i.i.d. p , the following is an (unbiased) Monte Carlo estimator:

eµ(α β) def= 1

t=1 1{δ(m) t δ(m) t+1 = α β} (34)

However, we found that using the following (unbiased) Rao Blackwellized Monte Carlo estimator provided better coverage for noncanonical bigrams, as they are individually rare:

bµ(α β) def= 1

1{δ(m) t δ = α β} p (δ | δ(m) <t+1) (35)

The Rao Blackwellized estimator uses the exact conditional expectation over the token at step t + 1 rather than the single sample token δ(m) t+1, providing a significant variance reduction with little effect on the running time.

Experiment. Tab. 1 and 2 show the noncanonical bigrams (i.e., α β / B) that we have estimated (for each model) to be the most frequent via eµ with M = 2000 samples.

Observations and takeaways. Inspecting the tables, we see that a large portion of the frequently generated noncanonical

bigrams are frequent substrings that have been tokenized incorrectly, for example,

ized 1143 corresponds to a frequent

substring normalized which all things equal should be tokenized as

39279 . However, there does not appear to be a clear pattern for why these models struggle with these specific bigrams over others; further investigation is needed.

Language Models over Canonical Byte-Pair Encodings

GPT-2 Small Bigram Freq.

eros 27498 3.86e-3

PF 42668 1.99e-3

ech 3055 1.94e-3

k B 38841 1.66e-3

\\\\ 6852 1.59e-3

uce 7234 1.56e-3

ub 549 1.35e-3

y 88 1.21e-3

t 83 1.20e-3

or 273 9.64e-4

P 47 9.56e-4

OR 1581 9.34e-4

T 51 7.36e-4

ized 1143 5.78e-4

ocide 16207 5.34e-4

og 519 4.70e-4

ful 913 4.46e-4

us 385 3.56e-4

nie 11952 3.46e-4

al 282 2.82e-4

GPT-2 Medium Bigram Freq.

1849 4.58e-3

er 263 2.36e-3

ands 1746 2.33e-3

w 86 1.73e-3

, 11 8.73e-4

A 32 8.10e-4

ers 364 7.68e-4

rush 37357 6.81e-4

ide 485 5.91e-4

hawks 27221 5.82e-4

r 81 4.52e-4

iated 12931 4.31e-4

4386 4.01e-4

ging 2667 3.75e-4

un 403 3.11e-4

st 301 2.41e-4

emia 22859 2.39e-4

wan 8149 2.27e-4

x 87 2.27e-4

wich 11451 2.26e-4

GPT-2 Large Bigram Freq.

1849 4.67e-3

i 72 2.81e-3

E 36 1.96e-3

g 70 1.71e-3

\xe2\x80\x94

960 1.36e-3

ys 893 1.18e-3

ing 278 6.24e-4

, 11 5.93e-4

udd 4185 5.32e-4

ly 306 4.84e-4

o 78 4.08e-4

bul 15065 3.97e-4

- 12 3.97e-4

ed 276 3.92e-4

man 805 3.44e-4

N 45 3.43e-4

pped 1496 2.88e-4

en 268 2.62e-4

ing 278 2.51e-4

1435 2.29e-4

Table 1. Most frequently generated noncanonical bigrams generated by GPT-2 models.

Language Models over Canonical Byte-Pair Encodings

Llama 3.2-1B Bigram Freq.

\xc2\xa0\xc2\xa0

4194 5.15e-3

4194 3.06e-3

\t 197 2.04e-3

rt 3423 6.67e-4

ST 790 6.33e-4

LOAD 13285 2.60e-4

an 276 2.47e-4

i 72 2.10e-4

DATE 7242 2.03e-4

ERIC 37016 2.03e-4

ved 2111 2.03e-4

ans 598 2.00e-4

ERICA 96390 1.93e-4

ae 6043 1.60e-4

ORIA 90118 1.53e-4

IES 5482 1.43e-4

OST 4327 1.20e-4

ins 1354 1.13e-4

Serve 61521 9.67e-5

ON 715 9.33e-5

Llama 3.2-3B Bigram Freq.

\xe3\x80\x80

\xe3\x80\x80

23249 6.41e-3

4194 4.64e-3

y 88 1.86e-3

w 86 1.60e-3

sqrt 27986 1.54e-3

bert 9339 1.07e-3

220 9.16e-4

4194 5.75e-4

t 83 5.50e-4

220 3.52e-4

en 268 3.49e-4

x 87 2.32e-4

4194 2.29e-4

ovie 6461 1.62e-4

les 645 1.11e-4

ed 291 1.07e-4

t 83 1.07e-4

collections

52237 1.03e-4

ple 698 1.01e-4

hana 84203 1.01e-4

Llama 3.1-8B Bigram Freq.

\xe3\x80\x80

\xe3\x80\x80

23249 9.60e-3

ide 3633 3.46e-3

ony 3633 1.13e-3

ta 2629 6.92e-4

K 42 4.55e-4

4194 4.22e-4

220 3.27e-4

YA 87582 3.05e-4

YY 10266 2.70e-4

anes 14997 2.67e-4

phet 22718 2.25e-4

y 88 1.67e-4

ens 729 1.37e-4

\xd0\xb2\xd0\xb0\xd0\xbd

27385 1.29e-4

ls 4835 1.29e-4

a 64 1.11e-4

l 75 1.03e-4

van 16023 9.89e-5

7713 9.33e-5

ORD 4373 9.01e-5

Table 2. Most frequently generated noncanonical bigrams generated by Llama models.