# training_llms_over_neurally_compressed_text__ab49c6fc.pdf Published in Transactions on Machine Learning Research (12/2024) Training LLMs over Neurally Compressed Text Brian Lestera Jaehoon Leeb Alex Alemia Jeffrey Penningtona Adam Robertsa Jascha Sohl-Dicksteinb Noah Constanta a Google Deep Mind b Anthropic {brianlester, nconstant}@google.com Reviewed on Open Review: https: // openreview. net/ forum? id= p Rvh MSV48t In this paper, we explore the idea of training large language models (LLMs) over highly compressed text. While standard subword tokenizers compress text by a small factor, neural text compressors can achieve much higher rates of compression. If it were possible to train LLMs directly over neurally compressed text, this would confer advantages in training and serving efficiency, as well as easier handling of long text spans. The main obstacle to this goal is that strong compression tends to produce opaque outputs that are not well-suited for learning. In particular, we find that text naïvely compressed via Arithmetic Coding is not readily learnable by LLMs. To overcome this, we propose Equal-Info Windows, a novel compression technique whereby text is segmented into blocks that each compress to the same bit length. Using this method, we demonstrate effective learning over neurally compressed text that improves with scale, and outperforms byte-level baselines by a wide margin on perplexity and inference speed benchmarks. While our method delivers worse perplexity than subword tokenizers for models trained with the same parameter count, it has the benefit of shorter sequence lengths. Shorter sequence lengths require fewer autoregressive generation steps, often reducing latency. Finally, we provide extensive analysis of the properties that contribute to learnability, and offer concrete suggestions for how to further improve the performance of high-compression tokenizers. 1 Introduction Today s large language models (LLMs) are almost exclusively trained over subword tokens. The tokenizers used to produce these tokens often BPE (Gage, 1994; Sennrich et al., 2016) or Unigram (Kudo, 2018), as implemented by the Sentence Piece library (Kudo & Richardson, 2018) are compressors that typically achieve ~4 compression over natural language text (Xue et al., 2022).1 While these tokenizers hide the character-level makeup of each token from the LLM (Xue et al., 2022; Liu et al., 2023), this downside is widely seen as outweighed by the significant benefits of compression. Compared to raw byte-level models, an LLM trained over subword tokens sees ~4 more text per token, allowing it to model longer-distance dependencies, ingest more pretraining data, and predict more text at inference time, all without increasing compute.2 Given these advantages, it raises the question, could we compress text further to achieve even greater gains? It is well known that autoregressive language models can be turned into lossless text compressors, and recent Work done while at Google Deep Mind. 1We refer here to token-level compression rate, i.e., the length reduction between a raw UTF-8 byte sequence and the corresponding sequence of subword tokens. If instead we measure the number of bits required to encode the two sequences, subword compression typically delivers ~2 or less compression, depending on vocabulary size, which typically ranges from 32k to 256k. See Section 3.4 for discussion. 2The increased cost of the input embedding and final softmax layers due to increased vocabulary size is negligible for all but the smallest models. Published in Transactions on Machine Learning Research (12/2024) M1 LM over bytes Neural compr M2 LM over compressed text Neural compression can lead to better LLM tokenizers Compression Algorithm (e.g. Arithmetic Coding) 010000110011101010111100 100000100110000101111101 67 58 188 130 97 125 next byte probabilities 21 70 99 4 55 next token probabilities compressed text compressed text Figure 1: An overview of our approach for training an LLM (M2) over neurally compressed text. First, M1 is trained as a standard byte-level language model given a leftward context, M1 assigns a probability to each possible following byte. Next, corpus text is compressed into a bitstream using M1 as a compressor. Specifically, the probabilities that M1 assigns at each text position are fed into a compression algorithm like Arithmetic Coding that supports using dynamic symbol probabilities. Finally, this bitstream is chunked into tokens (e.g., 8-bit chunks), and M2 is trained as a language model over compressed text. work has shown that LLMs can easily achieve 12 compression over English text (Delétang et al., 2024).3 Can we simply train an LLM over this neurally compressed text? In this paper we explore various options for doing so, focusing primarily on the idea of using Arithmetic Coding (AC) (Witten et al., 1987), which is known to reach the near-optimal compression rate for a particular model that assigns probabilities to text continuations. Figure 1 presents our high-level approach. First, a small language model M1 is trained over raw byte sequences. Next, this frozen model is used to compress pretraining corpus text by applying a standard compression algorithm like AC. The resulting compressed bitstream is then chunked into tokens, which are used to train M2 , a language model that directly reads and writes neural-compressed text. Given a perfect probabilistic model of the raw byte sequence, the compression step would output a fullycompressed bitstream that would be indistinguishable from random noise, and hence unlearnable by M2. In reality, M1 can never be perfect (Zvonkin & Levin, 2007), so the M1-compressed output will still contain learnable patterns. We explore whether using compression powered by a relatively small M1 is able to remove the simple structure that M1 understands from the input e.g., patterns of spelling, word frequency, and basic grammar while retaining any higher-level structure that M1 fails to model e.g., patterns requiring deeper reasoning and long range coherence. A larger M2 would then learn to model this higher-level structure, 3Specifically, the authors show that Chincilla 70B (Hoffmann et al., 2022) can compress 2048-byte subspans of enwik9 at a 12 bit-level compression rate. Published in Transactions on Machine Learning Research (12/2024) without needing to relearn the low-level structure removed by M1.4 In theory, this process could be repeated by training an even-larger M3 model on text compressed by M2, and so on. In practice, we find that text compressed via Arithmetic Coding is not readily learnable by a standard transformer-based LLM, with resulting models predicting tokens at chance. Interestingly, this result holds even when M1 is reduced to a context-free unigram model, suggesting that the challenge of modeling ACcompressed text stems from the difficulty of learning the AC compression and decompression process itself. We verify this hypothesis by showing that even the sub-tasks of AC-compressing and AC-decompressing text are not learned well beyond a few initial tokens. To aid learnability, we propose compression via Equal-Info Windows, a simple technique that breaks text into contiguous windows and compresses them via Arithmetic Coding independently. Rather than splitting text into windows of equal text length, we track the number of bits output by the compressor, and close each window just before it exceeds a set information threshold (e.g., 32 bits of information). This has the advantage that when chunking the subsequent bitstream into M2 tokens, there is a stable mapping from N tokens to one window (e.g., four 8-bit tokens one 32-bit window). At each window boundary, we reset both AC algorithm and the M1 model context. This ensures that each window may be mapped back onto raw text without any additional information. Through ablations on window size and M2 vocabulary size, we find that Equal-Info Windows make learning of AC-compressed text possible across a range of settings. However, we also observe that learning progresses gradually, starting with tokens at the left edge of each window, and for longer windows, the model learns little about the tokens near the right edge. Our best-performing setting uses short 16-bit windows that each correspond to a single 16-bit M2 token. Despite resetting the compression algorithm every 16 bits, we still achieve ~5.3 token-level compression overall, which exceeds standard subword tokenizers. Remarkably, our best M2 models outperform byte-level baselines on perplexity benchmarks (bits/byte) for fixed computation budget (FLOPs/byte). This shows that learning over neural-compressed text can be effective. At the same time, our best M2 models underperform subword baselines. We suspect this is due at least in part to the relatively unstable mappings our neural tokenizers induce between words and tokens. By contrast, standard subword tokenizers induce essentially stable word-to-token mappings, which likely makes the token sequences they output well-suited for LLM training. We illustrate this contrast through qualitative examples. Whether a neural tokenizer can reach a high level of compression while maintaining high learnability for LLM training is an interesting question for future research. Our main contributions are as follows: (1) Outline advantages and challenges of training over neurally compressed text. (2) Compare LLMs trained over different tokenizers along two axes: bits/byte and FLOPs/byte. (3) Show that standard LLMs can t learn to model vanilla AC-compressed text. (4) Show that GZip-compressed text is learnable by standard LLMs, but not competitive. (5) Propose compression via Equal-Info Windows, and show that it enables learning over neurally compressed text. 2 Motivation and Background 2.1 Advantages of Training over Neurally Compressed Text Training LLMs over compressed text is appealing for many reasons. We discuss three advantages in detail below. Efficiency The most straightforward advantage is efficiency. By compressing the same text into a shorter token sequence, the model can process more text for the same computational cost. In particular, a model trained over C compressed text will see C more text during training compared to a model trained over raw text, given an equal compute budget. Increasing the amount of data seen in pretraining is often an effective means of improving performance (Kaplan et al., 2020; Hoffmann et al., 2022). Processing text more efficiently also confers benefits at inference time, reducing the serving cost for handling a request of a given prompt and continuation length. In addition to reducing the raw compute needed for inference, compression can also 4Intuitively, training M2 could be seen as analogous to fitting the residuals of M1 (Friedman, 2001). Published in Transactions on Machine Learning Research (12/2024) improve inference latency, since generating better-compressed output requires fewer sequential autoregressive steps. Longer Context A second advantage is that working with compressed text allows modeling longer contextual dependencies. In vanilla transformer-based models, computation for the self-attention layer scales quadratically with the sequence length, O(n2d). This has limited the sequence lengths used by such models in practical settings to ~10k tokens.5 If, via compression, each token represents (on average) C bytes of raw text, then the resulting LLM can model dependencies across C longer distances compared to a raw text model operating over the same token sequence length. While the benefits of modeling longer context (beyond ~1,000 bytes) are modest when viewed merely as perplexity gains (Press et al., 2022), the ability to condition on long context is critical for many applications, such as retrieving content from a document, or answering a coding question provided documentation. Distribution of Compute A third potential advantage of training over compressed text is that information will be spread more uniformly across the sequence. By the nature of compression, a text span that is relatively predictable (e.g., a boilerplate notice) will be more compressible than a span with high perplexity (e.g., a unique product serial number). When an LLM is trained over well-compressed text, each token will represent roughly an equal amount of information. Since the LLM allocates equal compute to each token, this amounts to allocating more compute for harder text spans. This adaptivity is similar in spirit to Adaptive Computation Time (ACT) (Graves, 2017), which learns to allocate additional compute at some sequence positions in an end-to-end manner, but with the advantage that in our case the computation remains dense identical operations are applied at each position.6 2.2 Challenges of Training over Compressed Text Learnability It is not at all obvious what types of compression are transparent enough to be learnable through a standard LLM training process. Strong compression can be seen as removing as much redundant or predictable information from a sequence as possible. Consequently, the bitstream output by a good compressor is inherently hard to distinguish from random noise. In this work, we explore the setting where M2 the model trained over compressed text has a larger capacity than M1, the model used for compression. In principle, this setup should allow M2 to extract additional information from the signal even after M1 has compressed it. However, for strong enough M1 compression, the resulting bitstream may be too noisy to detect any signal. As a prerequisite for M2 to effectively predict continuations of compressed text, we anticipate that it is necessary for M2 to have the ability to decompress bits text and compress text bits. These sub-tasks are challenging in their own right. First, M2 needs to accurately simulate M1 in order to know the probabilities it assigns to the text, which determine the output of compression.7 Training models to mimic other models can be difficult (Lester et al., 2022), and even in settings where models do learn to copy the behavior of another network (Hinton et al., 2015), this is often only when looking at which symbol was assigned the highest probability the actual probabilities assigned often differ (Stanton et al., 2021). Second, M2 needs to learn the compression procedure itself. In our case, this means tracking the Arithmetic Coding algorithm, which requires maintaining high-precision numerical state across long contexts. We investigate these sub-tasks in detail in Section 5.2. A further learnability challenge is the high level of context sensitivity needed to interpret a bitstream of compressed text. When chunked into tokens, a particular bit subsequence (e.g., 10111001) can map onto the 5Exploring sub-quadratic attention mechanisms is an area of active research (Ainslie et al., 2020; Wang et al., 2020; Kitaev et al., 2020; Zaheer et al., 2020; Beltagy et al., 2020; Child et al., 2019, et alia). However, regardless of the cost of attention, compressing the input increases the effective context for free . 6It should be noted that ACT learns to allocate more compute where it is useful, as opposed to merely where the predictions are hard. For example, ACT learns to not waste compute on inherently unpredictable text spans. We expect that as a heuristic, allocating more compute to higher-perplexity text spans is valuable, but leave this to future work to verify. 7For Arithmetic Coding, not only would M2 need to know the probabilities M1 assigns to the observed text, but it would also need to know the probabilities assigned to many unobserved symbols. This is because Arithmetic Coding operates over cumulative probabilities, i.e., the probability that the next symbol is e or any alphabetically preceding symbol. Published in Transactions on Machine Learning Research (12/2024) same token despite having no stable meaning across occurrences. We show examples in Section 6.1, where a token maps to many different underlying text forms, necessitating strong contextual understanding. While LLMs are robust to some level of polysemy, as highlighted by the success of Hash Embeddings (Tito Svenstrup et al., 2017) where multiple unrelated words share a single token representation, we suspect this has its limits. Numerical Stability An additional technical challenge is that compression methods can be sensitive to the precise model probabilities used. To achieve lossless compression in our setup, it is critical that the M1 probabilities match during compression and decompression. This can be hard to guarantee in practice, as there are many sources of numerical noise in LLM inference, especially when running on parallel hardware. An expanded discussion of numerical stability issues can be found in Section 3.7. Multi-Model Inference Finally, a specific challenge of training over neurally compressed text is that multiple models need to be stored and run side-by-side in order to perform inference. We assume that if M1 is relatively small, this additional overhead is not a significant drawback compared to a standard tokenizer, which is also a separate model that is needed to tokenize text input and detokenize LLM outputs. In evaluating our approach, we include M1 compute in our calculations of total inference cost (FLOPs/byte). 2.3 Compression In this work, we focus on lossless compression, which aims to encode a sequence of input symbols, x0:N = {x0, x1, . . . , x N} X|V |, into a bitstream while minimizing the expected length of the bitstream. Compression methods are often factored into a modeling component and a coding component (Mahoney, 2013). The input sequence can be viewed as a sample from a true distribution p, x0:N p, with a standard autoregressive decomposition, p(x0:N) = QN i=1 p(xi|x0, . . . , xi 1). The modeling component aims to approximate p with ˆp. While some compression algorithms assume static probabilities for each symbol, stronger algorithms are adaptive , meaning that symbol probabilities may change based on context. In this work, we use context-aware transformer-based language models to represent ˆp. The coding component of a compression algorithm converts the input sequence to a bitstream of length ℓ(x0:N). To maximize compression, we want a coding algorithm that minimizes the expected number of bits in the bitstream, L := Ex0:N p[ℓ(x0:N)]. This is done by assigning shorter bit sequences to common symbols and longer sequences to less common ones.8 The expected length is lower bounded by L H(p) where H(p) := Ex0:N p[ log2 p(x)] (Shannon, 1948). This means that, given a near-optimal coding algorithm, the achievable level of compression derives from how well the model ˆp approximates p. 2.4 Arithmetic Coding Arithmetic Coding (Rissanen, 1976; Pasco, 1977) uses a model ˆp to compresses a sequence x0:N to a bitstream, which is the binary expansion of a float f [0, 1). The float f is found by assigning successively smaller sub-intervals to each symbol xi x0:N, with the final interval enclosing f. An interval is made of an upper and lower bound, Ii = [li, ui) and its size is given by ui li. Starting with I0 = [0, 1), at each step of encoding, the interval for the symbol xi is created by partitioning the interval Ii 1 based on the cumulative distribution of ˆp given the previous context, ˆpcdf(xi|x token appended to it. We concatenate 128 documents together to generate a long sequence of text. Using UTF-8 byte-level tokenization, the average document length is 2,170 bytes, thus these long sequences have an average length of 277,760 bytes. Despite the document breaks, we consider these long sequences continguous for the training of language models. These sequences are then split into individual examples, which are shuffled using the deterministic dataset functionality from Seq IO (Roberts et al., 2023). 3.2 Training M1 The model used for compression is a decoder-only Transformer model (Vaswani et al., 2017). It uses the 3m size seen in Table 4 and a context length of 1,024. We use a batch size of 128, an rsqrt decay learning rate schedule (1/ steps) starting at 1.0 with 10,000 warmup steps, and a z-loss of 0.0001. The model is trained for 2,500,000 steps using the Adafactor (Shazeer & Stern, 2018) optimizer. The feed-forward layers use Re LU activations (Nair & Hinton, 2010; Fukushima, 1975), and we use distinct learnable relative attention Published in Transactions on Machine Learning Research (12/2024) embeddings (Shaw et al., 2018) at each layer. We use a deterministic Seq IO dataset and train using Jax (Bradbury et al., 2018), Flax (Heek et al., 2020), and T5X (Roberts et al., 2023). The final validation performance of the M1 model is 1.457 bits/byte, a standard measure of perplexity, see Section 3.8. M1 and M2 are both trained on the C4 training data, but the final validation data used to evaluate M2 is unseen during M1 training, therefore there is no information leakage. This is similar to how LLM tokenizers are often trained on same dataset that the LLM is subsequently trained on. 3.3 Compression Methods When compressing C4 training data, we use an example length of 10,240 bytes and apply one of the following compression techniques (see Appendix B for more methods we considered). This results in compressed examples that are, on average, much longer than our target sequence length of 512 M2 tokens. Thus, each example fills or nearly fills the model s context window with a compressed sequence made from contiguous raw bytes. We compress 51,200,000 examples using each method, allowing us to train each M2 model for 200,000 steps without repeating data. Arithmetic Coding: In this setting, we use a decoder-only transformer language model to model ˆp, that is, when creating the interval Ii(x0:N), the partitions for each possible character, ˆp(xi|x Targets Task Figure 7: Arithmetic compression and decompression cast as sequence-to-sequence tasks. To account for the strong performance that is possible by modeling only the output bytes in the decompression task, we also train a Byte LM on just the targets. Table 5: Transformers struggle to learn Arithmetic Coding. In the sequence-to-sequence setting, a model that learns AC compression/decompression should have an accuracy of 100. Our models perform much worse. When tasked with decompression in a sequence-to-sequence format, our transformer s improvement over pure language modeling of the targets was not statistically significant (p = 0.07). Thus, the model is not able to leverage the compressed input. Similarly, AC compression is only learned to 1.7% accuracy. Compression Task Accuracy Cross Entropy AC[v=256] Decompress 76.98 0.751 0.005 Byte Level LM 76.86 0.755 0.001 Equal Info AC[b=16, v=256] Decompress 95.30 0.141 0.002 Byte Level LM 76.83 0.756 0.001 AC[v=256] Compress 1.7 2.489 5.2 Transformers Struggle to Learn Arithmetic Coding Arithmetic Coding is a sequential algorithm that involves tracking multiple state variables as the input (byte) sequence is consumed. Each token in the output sequence represents multiple transformations of these variables, e.g., 8 transformations when using 8-bit token chunking. Theoretically, only 10 transformer layers are needed to have a computational path through the model layers that can process a sequence of 1,024 tokens as a chain, where each token conditions on the previous one. While most of our transformers have the capacity to model these sequences only our 25m model has fewer layers we see in practice that the Arithmetic Coding algorithm is still difficult to learn. To directly diagnose the ability to track Arithmetic Coding, we format AC compression and decompression as sequence-to-sequence tasks, as shown in Fig. 7. The input provides the model with the true text, so we expect a model that is able to learn Arithmetic Coding should achieve an accuracy of 100. We compress sequences of 1,024 bytes using M1 and Arithmetic Coding.13 We concatenate the bytes and AC output tokens to create the compression task. For the decompression task, we simply flip the order AC output tokens first and then bytes. The target token IDs (bytes or tokens) are shifted by the input vocabulary size, ensuring that they 13We use shorter raw text sequences to keep the final sequence length of inputs + targets manageable. Published in Transactions on Machine Learning Research (12/2024) have distinct values. We use a decoder-only transformer as our model with a causal attention mask, i.e., even during the input sequence, future tokens are hidden from the model. We train models with 113m parameters. Loss, gradients, and evaluation metrics are only computed on the target tokens. In the decompression task, the target tokens are bytes. By ignoring the inputs and just modeling the outputs, the decompression model can achieve decent performance without actually leveraging the input data. To control for this, we also train a byte-level language model baseline on the same sequence-to-sequence data, excluding the input tokens. If the decompression model is actually learning to decompress Arithmetic Coding, we would expect stronger performance than the byte-level baseline. As we see in Table 5, the baseline model, which does not see the input tokens, has the same performance as the decompression model.14 Clearly, the models trained for decompression are not actually learning to do decompression. In contrast, we see that models trained on Equal Info AC[b=16, v=256] do appear to learn to perform decompression. An M2 model trained to decompression Equal Info AC[b=16, v=256] compressed data quickly jumps to 90% accuracy, in 9K updates, and continues to improve to 95% over training. The model is still improving as training ends with no clear signs of overfitting, suggesting the expected 100% accuracy should be achievable with more data.15 It clearly leverages the input as a byte-level language model trained over the same target tokens only reaches an accuracy of 76. The model trained for compression actually shows some signs of learning. Training a language model directly on the compressed output results in the model learning a uniform distribution over tokens, see Fig. 3. When the model is able to attend to the input text, we see that the performance in Table 5 is better than the uniform distribution (which would have a cross entropy loss of 5.545). While this method shows some hope for the learnability of Arithmetic Coding, the need to include the input sequence negates the main advantage of compression, i.e., applying the model to a shorter sequence. Additionally, the compressor s performance is far from the 100 it should be able to achieve. We also find training on these sequence-to-sequence datasets to be less stable than training on the language modeling datasets. In our experiments, large performance swings and divergence were relatively common. 5.3 Larger Vocabulary Helps Beyond Increasing the Compression Ratio Our best results training over compressed text use Equal Info AC with 16-bit windows and vocabulary size at either 65k (best) or 256 (second-best). One clear advantage of the v=65k model is that it has a 2 better token compression rate, so sees twice as much raw text during training. To assess whether its performance gain is due entirely to this advantage, we train a 25m parameter M2 model over the same dataset, but reduce its sequence length from 512 256. This model trains on half as many tokens, but sees the same amount of underlying text as the v=256 model.16 Table 6 shows that even in this setting, the model with larger vocabulary is stronger.17 In fact, most of the bits/byte gain (84% absolute) is due to the structural change in tokenization, as opposed to the additional text seen. One possible explanation for its strong performance is that the v=65k model uses exactly one token to represent each equal-info window. We ll see in the next section that in Equal Info AC settings with multiple tokens per window, any non-initial tokens are highly context-dependent, and learning proceeds on a curriculum from the easy window-initial tokens to the harder window-final tokens. In this section we examine how neural compression based tokenizers differ from standard tokenizers, and conduct additional analysis on training dynamics and learnability of compressed data. This analysis leads us 14The slight gain is statistically insignificant (p = 0.07). 15This could also be explained by how the end of sequence is handled, see Appendix D.5 16To compensate for the smaller number of tokens in a sample of 20 batches from validation set when each example is 256 tokens, we compute our evaluation metrics over 40 batches. 17It may be possible to achieve further gains by increasing the token bit depth further. However, most deep learning frameworks do not support using unsigned data types for inputs, and the resulting large vocabulary size can cause a computational bottleneck in the final softmax layer. Published in Transactions on Machine Learning Research (12/2024) Table 6: Most of the gain of increasing vocabulary from 256 to 65k remains even in the byte matched setting, where the models train over the same number of raw bytes. Performance gains seen between settings are all statistically significant. Tokenization Comparison Bits/Byte Equal Info AC[b=16, v=256] 1.472 0.004 Equal Info AC[b=16, v=65k] byte matched 1.287 0.003 Equal Info AC[b=16, v=65k] token matched 1.251 0.003 0 5 10 15 20 25 Levenshtein Distance Sentence Pairs Sentence Piece Arithmetic Coding[v=256] Equal Info AC[b=16, v=256] Figure 8: Edit distance between tokenized sentence pairs that differ only by a single prefix word. With Sentence Piece, changing the prefix has a minimal effect on tokenization, limited to the prefix itself. AC has the largest edit distances, as the prefix affects the probabilities M1 assigns to all subsequent text. Equal Info AC falls in the middle, as many distinct prefixes lead to the same windowing of subsequent text, and each window is encoded in isolation. to several recommendations for future work developing new compression schemes that aim to be learnable by transformer models while delivering stronger compression than subword tokenizers. 6.1 AC-Based Tokenizations are Less Stable and Less Semantic than Sentence Piece We observe that text tokenized by our AC-based methods is not stable that is, the tokenization of similar input data often leads to vastly different tokens sequences. To quantify this stability, we measure the effect of adding different prefixes to a fixed test sentence. We sample 500 words from a list of 3,000 common English words18 and prepend each to the example sentence in Table 7. We then tokenize these new sentences using Sentence Piece, AC[v=256], and Equal Info AC[b=16, v=256]. Finally, for each tokenizer, we calculate the Levenshtein edit distance (Levenshtein, 1965) between all pairs of tokenized sentences. Fig. 8 shows a histogram of these distances. We observe that Sentence Piece tokenized sentence pairs always have a low edit distance, as the effect of changing the prefix is limited to the prefix tokens. In contrast, the edit distance between AC-tokenized sentence pairs is consistently high. The presence of the prefix affects the probabilities M1 assigns throughout the entire sequence, resulting in widely varying token outputs. Edit distance with Equal Info AC falls in between these two extremes. The degree to which the tokens after the prefix match depends on how the prefix aligns with Equal Info window boundaries, which in turn affects the windowing of the rest of the sentence. Since there are a relatively small number of possible window 18https://www.ef.com/ca/english-resources/english-vocabulary/top-3000-words/ Published in Transactions on Machine Learning Research (12/2024) Table 7: Comparing tokenization under Sentence Piece vs. Equal Info AC. Sentence Piece gives a fairly stable text token mapping. For instance, each occurrence of elephants maps to the same two-token sequence: [ elephant] [s]. By contrast, Equal Info AC[b=16, v=65k] is less stable and less semantic. Each occurrence of elephants maps to different tokens, and most tokens fail to align with meaningful linguistic boundaries (e.g., word or morpheme). Input Text The three currently living species are: African savanna elephants, African forest elephants, and the Asian elephants. Sentence Piece Tokens [The] [ three] [ currently] [ living] [ species] [ are] [:] [ African] [ ] [s] [a] [v] [anna] [ elephant] [s] [,] [ African] [forest] [ elephant] [s] [,] [ and] [ the] [ Asian] [ elephant] [s] [.] Equal Info AC [b=16, v=65k] Tokens [The th] [ree c] [urrently l] [iving ] [species] [ are] [: A] [frica] [n sav] [anna] [ ele] [pha] [nts, ] [Afr] [ican ] [forest ] [eleph] [ants, ] [and the ] [Asi] [an e] [lep] [hant] [s.] alignments, many sentence pairs have highly overlapping token sequences. As AC tokenization is so much less stable than the other settings, we restrict further analysis to Sentence Piece and Equal Info AC. While the performance of our Equal Info AC[b=16, v=65k] model approaches that of the Sentence Piece baseline, our edit distances analysis suggests that the two tokenization schemes differ in many regards. To better understand these differences in qualitative terms, we compare the tokenizations of a single sentence in Table 7. First, we observe that Sentence Piece produces a stable text token mapping. For example, elephants appears three times in the sentence, and stably maps to the same two-token sequence in all cases: [ elephant] [s]. Similarly, both occurrences of African map to the same token: [ African]. In contrast, the Equal Info AC tokenization is relatively unstable, with each occurrence of these words being segmented in a different way, and yielding different token sequences. Second, we find that the Sentence Piece tokenization is more semantic , by which we mean that the segmentation it induces aligns better with meaningful linguistic units words and morphemes. While there are some exceptions, e.g. savanna being tokenized as [s] [a] [v] [anna], the more common case is that whole words are parsed as single tokens (e.g., currently), or into meaningful morphemes (e.g., elephant-s). By comparison, Equal Info AC tokenization appears to almost entirely disregard word and morpheme boundaries. As one example, we see Asian elephants. tokenized as [Asi] [an e] [lep] [hant] [s.]. Despite these differences, there is an important similarity between Sentence Piece and Equal Info AC[b=16, v=65k]: they are both stable in the token text direction. That is, a given token ID, e.g., token #500, will always map to the same output text. This transparent decoding property likely makes it easier for a downstream model to learn over these tokens.19 When we move to versions of Equal Info AC that contain multiple tokens per window, such as Equal Info AC[b=16, v=256], this transparency is destroyed for all non-initial tokens within a window. This is illustrated in Table 8. When the same token appears window-initially in different contexts, we see the window text has a stable prefix e.g., token #151 always maps to the prefix le- . However, when occurring as the second token within a two-token window, there are no apparent correspondences between window text.20 As Equal Info AC window length increases, the proportion of tokens that are stable decreases. This may explain the observed difficulty of learning over longer windows. The window text for all instances of these tokens can be seen in Appendix D.8. Note that Table 8 examines window text, as opposed to token text correspondences. This is because for multi-token windows, the mapping from tokens to text is not well defined. More specifically, each 19Padding to reach a specific window size can require extra computation to discern between padding and characters that compress to all zeros, however we find in Appendix D.5 that it is not an issue for M2 models. 20A repeated text substring that happens to be aligned with a window multiple times is one of the few cases where the second token will represent the same text. Published in Transactions on Machine Learning Research (12/2024) Table 8: Window-initial tokens have stable token text mappings, while non-initial tokens have contextual meaning and are thus unstable. We tokenize 20 documents with Equal Info AC[b=16, v=256] and show the full window text in a random sample of cases where a specific token appears at the first or second position within the window. Token Window Position Window Text 151 1 [lew ] / [lea] / [led] / [len] / [less] / [led] / [les] / [lew ] 2 [thoug] / [ust] / [ this] / [etti] / [npo] / [thoug] / [ un] / [imag] 185 1 [ord a] / [or k] / [ord] / [or f] / [or al] / [or a ] / [ore i] / [ora] 2 [ery] / [s may] / [cian] / [onte] / [h de] / [cri] / [opp] / [ides] 1001000111111000 The_th|ree_c|urrently_l|iving_|species|_are|:_A |frica|n_sav|anna|_ele|pha|nts,_|Afr|ican_|forest_ |eleph|ants,_|and_the_|Asi|an_e|lep|hant|s.| 1111000011110011 1101100001000100 ... ... Figure 9: An illustration of the mapping between characters (bottom), bits (middle) and tokens (top) in the Equal Info AC[b=16, v=256] setting. Each equal-info window corresponds to 16 bits of AC output, which are chunked into two 8-bit M2 tokens from a vocabulary of 256. Colors indicate whether each character contributes to the first token, second token, or both tokens within a window. We note that window-initial characters are not well compressed, so the initial 8-bit token tends to only cover one or two characters. character maps to a particular subsequence of the compressed bitstream, but these may not align with token boundaries.21 Fig. 9 illustrates the mapping between characters, bits, and tokens. We find that many windows contain a character (shown in purple) whose bits are split across two 8-bit tokens. Fig. 9 also highlights that window-initial characters are not being well compressed, with the window-initial token often only covering one or two characters. This is due to our Equal Info AC procedure fully resetting M1 s context at every window boundary. With no context, M1 cannot make confident predictions, leading to more bits being needed to represent the initial character. While this hurts the overall compression ratio, there is a benefit for learnability in that each window can be decoded in isolation. In Appendix A.2, we experiment with maintaining M1 context across two or more windows, and find that the improvement in compression ratio does not justify the degradation in bits/byte. 6.2 AC Decoding is Learned Step-by-Step As Arithmetic Coding is a sequential (left-to-right) and contextual algorithm, the text represented by a given token will differ based on the previous token. As such, a model should perform better on a token if it has a strong understanding of the token before it. When using Equal Info AC compression, each window represents an independent Arithmetic Coding document. As we move deeper into the window, more and more AC decompression must be done to understand the token. 21This can be a source of instability, even in window-initial tokens, see Appendix C. Published in Transactions on Machine Learning Research (12/2024) 0 20000 40000 60000 Step (a) Accuracy per token position 0 20000 40000 60000 Step Accuracy Increase 1 2 3 4 5 6 7 8 (b) Increase over trivial accuracy per token position Figure 10: Earlier tokens within the 8-token window of an Equal Info AC[b=64, v=256] model are learned earlier in training. As training progresses, the model unlocks the ability to model tokens deeper and deeper into the window. The plot on the right shows the increase over trivial accuracy which we define as the maximum accuracy achieved in the first 2,000 steps of training. (Note, window-final padding makes trivial accuracy higher for later positions.) For tokens #1 3, later tokens reach higher accuracy (3 > 2 > 1), likely due to the benefit of local context. For tokens #4 8, accuracy deteriorates, indicating that the model has trouble tracking the AC algorithm for more than ~32 bits. To understand how a token s position within a window affects learning, we track the average accuracy at each position within the 8-token windows of a 403m parameter Equal Info AC[b=64, v=256] model22 during training. Fig. 10 shows both raw accuracy (left) as well as the increase over trivial accuracy (right), which we define as the maximum accuracy achieved in the first 2,000 steps of training. By that point in training, the model predictions no longer change drastically between updates, and trivial patterns, such as how end-of-window padding generally reduces the number of possible tokens at the ends of windows, have already been learned. Looking at accuracy increase highlights the sequential learning trend by discounting any part of accuracy that is text independent. In particular, we note that window-final tokens have a non-uniform distribution due to the use of window-final padding bits (see our Equal Info AC formulation in Section 3.3), which can be learned without any understanding of the text. We observe two interesting trends. First, there is a clear ordering as to when the model starts to make meaningful (non-trivial) progress on a given position. The initial token (#1) is learned first, followed fairly quickly by #2 and then #3. Later tokens are only unlocked after 10,000 training steps, suggesting that the ability to model these tokens builds on a foundation of understanding the preceding tokens within the window. The second trend concerns the accuracy reached at each position. Here, we observe an increase in accuracy from #1 < #2 < #3, followed by a decrease from #3 < #4 < #5 and so on.23 We interpret the increase across the first three positions as due to the benefit of extra leftward context.24 This is akin to the initial byte in a word being harder to predict than the following bytes. The decreasing performance at tokens #4 and beyond suggests the model is unable to track AC decompression indefinitely. While the model clearly learns to decompress longer sequences as training progresses, reliably decoding past 32 bits of AC output appears to be a challenge. 6.3 Learnable Distributions are Less Uniform A well-known result in the compression literature is that there can be no recursive compression (Mahoney, 2013). The compression algorithm removes information captured by its model, resulting in a uniform output 22The absolute accuracy of the Equal Info AC[b=64, v=256] model is relatively poor, but its relatively long window provides the clearest illustration of these positional trends. We observe similar trends for Equal Info AC[b=16, v=256] which has smaller windows but much stronger performance. 23The final token #8 also fits this trend when looking at the increase over non-trivial accuracy. The raw accuracy is higher than previous tokens, #4 7, due to the skewed distribution induced by window-final padding. 24While suggests that 24 bit windows could be effective, but in Appendix A.1 we found it is not. Published in Transactions on Machine Learning Research (12/2024) 2 4 6 8 10 12 14 16 Bit N-grams KL(observed || uniform) RNG Arithmetic Coding[v=256] Static AC[v=256] GZip[v=256] Equal Info AC[b=16] Equal Info AC[b=64] (a) bit n-grams counting all overlapping occurrences 2 4 6 8 10 12 14 16 N-Bit Tokens KL(observed || uniform) (b) n-bit tokens following our M2 tokenization Figure 11: As the bitstream is grouped into larger units, the empirical distribution moves away from uniform. We plot KL divergence of observed n-gram distributions from the uniform distribution, across various n-gram sizes. While AC compressed data would be difficult to distinguish from random data, we find there are still patterns to capture when using other compression schemes, particularly for GZip and shorter Equal Info AC windows. Compared to the left plot, we find that the tokenized bitstream (see Section 3.4) has even more information for M2 to capture. that appears random to the original model. However, our setting is not recursive compression. Instead, a separate and larger model is trained on the compressed output, which should be able to capture new patterns in the bitstream. Despite this, the output of compression using M1 appears very uniform, as evidenced by the minimal gains from modeling the unigram token distribution in Table 3. Therefore, it seems reasonable that this uniformity could make it hard for M2 to learn (as all patterns must be contextual). We investigate this by plotting the KL divergence (Kullback & Leibler, 1951) between the observed empirical distribution and a uniform distribution for different segmentations of the bitstream. If the underlying distribution of bits was truly random and independent, then the distribution of unigrams for some bitstream segmentation should remain uniform as p(bi, . . . , bi+n) = Qi+n j=i (p(bj)) and therefore the KL divergence should remain close to zero. On the other hand, if the distribution diverges from uniform, there is contextual information to be learned when training an LLM to model p(bn|bi, . . . , bi+n 1). We segment the bitstream either into bit n-grams, where successive n-grams are allowed to overlap, or into n-bit tokens, following our M2 tokenization procedure see Section 3.4. We only plot tokenization into n-bits that are factors of 16, otherwise tokens would cross window boundaries in the Equal Info AC[b=16] setting. As a baseline, we used the cryptographic secrets package in Python to generate bitstreams that should be truly random and independent. As such, the KL divergence should remain at 0 when segmented in the same way as the compressed data. The reason this does not hold in Fig. 11 is that the maximum likelihood estimate of entropy, ˆH = P x ˆ X ˆp(x) log2 ˆp(x), is negatively biased (Paninski, 2003). In Fig. 17 we see that when using a Miller-Madow estimator (Miller, 1955) to correct for this bias, the expected KL of 0 is well within sampling noise bounds. To account for noise in the entropy estimation, we plot 90th percentile intervals of the KL divergence between the observed entropy from 100 disjoint samples of the data and the uniform distribution.25 25As the number of bits in a segmentation grow, the vocabulary size increases exponentially, requiring many more samples. Thus we expect noise in the entropy estimate to grow with n. This holds, but it is obfuscated by the log scaling in Fig. 11. In fact, the magnitude of the noise for settings such as GZip and Equal Info AC is larger than for AC or RNG. This noise behavior is seen in Fig. 16. See Appendix D.6 for more information on entropy estimation and bias correction. Published in Transactions on Machine Learning Research (12/2024) The AC and RNG lines in Fig. 11 are very similar and their sampling noise intervals have large overlaps. This suggests that the data generated by AC compression with M1 is difficult to distinguish from random data.26 This is a possible explanation for why M2 models trained on AC data only learn to output a uniform distribution, as seen in Fig. 3. In Fig. 11, we see that GZip is the least uniform, which is expected as it has the worst compression rate among these settings. However, the segmentation into tokens does not result in much extra information. This is again suggestive that the differences between the coding components of GZip and Arithmetic Coding are important for learnability. It is also a possible explanation of why GZip is the one setting where using 16-bit tokens does not improve performance. Similarly, Fig. 11 shows that Equal Info AC[b=16] has the most information among the Arithmetic Coding approaches. Given that this is the most learnable setting, it suggests that non-uniformity of the bitstream may be important for learning. We also see a large increase when moving to 16-bit tokens, providing a further possible explanation for why larger vocabulary is helpful (see Section 5.3). Finally, we note that Static AC has less information than Equal Info AC[b=16], suggesting that weakening the coding component of Arithmetic Coding is a more effective way to retain information and increase learnability for M2. 7 Conclusion We have shown there is promise in the idea of training LLMs over neurally compressed text. In the best case, this will allow training over text that is better compressed than standard subword token sequences, while maintaining learnability. This an appealing prospect, as models that read and write more text per token are more efficient to train and serve, and can model longer dependencies. While the very simplest approach does not work (training directly over a tokenized AC-encoded bitstream), we showed that a relatively simple modification compression via Equal Info Windows already brings us within striking distance of popular tokenizers. When measured in terms of perplexity achievable at fixed inference cost (FLOPs/byte), we find that our method outperforms raw byte-level models, and comes increasingly close to the performance of Sentence Piece tokenization as scale increases to 2 billion parameters. While bespoke compression methods have developed around different modalities (e.g., text, audio, images, video) and different applications (e.g., delta-of-delta for regular repeating timestamps (Pelkonen et al., 2015)), to our knowledge, no efficient compression methods have been designed specifically for use as LLM tokenizers. We are optimistic that future work will create such methods. Compared to today s subword tokenizers, we expect these methods (i) will deliver higher compression rates, (ii) will come closer to equal information per token, thus allocating compute more effectively, and (iii) will give models a more direct view of the underlying raw text, thus helping on spelling and pronunciation tasks. As a tradeoff, we expect these neural tokenizers will be somewhat less stable in their text token mapping, but perhaps not so unstable as our approach here. In particular, we think it is worth exploring methods under which a given word typically maps to a relatively small number (tens not thousands) of relatable token sequences. One direction we left unexplored is the idea of passing information between the compressing model (M1) and the LLM trained over compressed text (M2). Some additional signal of M1 s internal state or output may be helpful for M2 to accurately simulate M1, which is a prerequisite to flawlessly encoding and decoding M1-compressed text. For hill-climbing in this space, we found it useful to iterate on the sequence-to-sequence sub-tasks of compression and decompression, which should, in theory, be learnable with high accuracy. Specifically, if future work can devise a strong (~10 ) compressor that a transformer can be trained to accurately encode and decode, we expect that this will be an ideal candidate for tokenizing text for LLMs. 26For n > 2, the AC entropy is statistically significantly less than the RNG entropy, however, differences in the mean entropy only start to appear after ~8 decimal places. Published in Transactions on Machine Learning Research (12/2024) Acknowledgments We would like to thank Ben Adlam, Grégoire Delétang, Rosanne Liu, and Colin Raffel for detailed comments on an earlier draft. We re also grateful to Peter Liu, both for helpful discussion, as well as for building some of the infrastructure that made our experiments easier to run. Finally, we thank Doug Eck, Noah Fiedel, and the PAGI team for ongoing guidance and feedback. Bibliography Joshua Ainslie, Santiago Ontanon, Chris Alberti, Vaclav Cvicek, Zachary Fisher, Philip Pham, Anirudh Ravula, Sumit Sanghai, Qifan Wang, and Li Yang. ETC: Encoding Long and Structured Inputs in Transformers. In Bonnie Webber, Trevor Cohn, Yulan He, and Yang Liu (eds.), Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP), pp. 268 284, Online, November 2020. Association for Computational Linguistics. doi: 10.18653/v1/2020.emnlp-main.19. URL https://aclanthology.org/2020.emnlp-main.19. Rami Al-Rfou, Dokook Choe, Noah Constant, Mandy Guo, and Llion Jones. Character-Level Language Modeling with Deeper Self-Attention. Proceedings of the AAAI Conference on Artificial Intelligence, 33 (01):3159 3166, Jul. 2019. doi: 10.1609/aaai.v33i01.33013159. URL https://ojs.aaai.org/index.php/ AAAI/article/view/4182. Mehdi Ali, Michael Fromm, Klaudia Thellmann, Richard Rutmann, Max Lübbering, Johannes Leveling, Katrin Klug, Jan Ebert, Niclas Doll, Jasper Buschhoff, Charvi Jain, Alexander Weber, Lena Jurkschat, Hammam Abdelwahab, Chelsea John, Pedro Ortiz Suarez, Malte Ostendorff, Samuel Weinbach, Rafet Sifa, Stefan Kesselheim, and Nicolas Flores-Herr. Tokenizer choice for LLM training: Negligible or crucial? In Kevin Duh, Helena Gomez, and Steven Bethard (eds.), Findings of the Association for Computational Linguistics: NAACL 2024, pp. 3907 3924, Mexico City, Mexico, June 2024. Association for Computational Linguistics. doi: 10.18653/v1/2024.findings-naacl.247. URL https://aclanthology.org/2024.findings-naacl.247. Alexei Baevski, Henry Zhou, Abdelrahman Mohamed, and Michael Auli. wav2vec 2.0: A Framework for Self-Supervised Learning of Speech Representations. In Proceedings of the 34th International Conference on Neural Information Processing Systems, Neur IPS 20, Red Hook, NY, USA, 2020. Curran Associates Inc. ISBN 9781713829546. URL https://proceedings.neurips.cc/paper/2020/file/ 92d1e1eb1cd6f9fba3227870bb6d7f07-Paper.pdf. Johannes Ballé, Sung Jin Hwang, and Eirikur Agustsson. Tensor Flow Compression: Learned Data Compression, 2024. URL http://github.com/tensorflow/compression. Iz Beltagy, Matthew E. Peters, and Arman Cohan. Longformer: The Long-Document Transformer, December 2020. URL http://arxiv.org/abs/2004.05150. Zalán Borsos, Raphaël Marinier, Damien Vincent, Eugene Kharitonov, Olivier Pietquin, Matt Sharifi, Dominik Roblek, Olivier Teboul, David Grangier, Marco Tagliasacchi, and Neil Zeghidour. Audio LM: A Language Modeling Approach to Audio Generation. IEEE/ACM Transactions on Audio, Speech, and Language Processing, 31:2523 2533, 2023. doi: 10.1109/TASLP.2023.3288409. URL https://arxiv.org/abs/2209. 03143. Kaj Bostrom and Greg Durrett. Byte pair encoding is suboptimal for language model pretraining. In Trevor Cohn, Yulan He, and Yang Liu (eds.), Findings of the Association for Computational Linguistics: EMNLP 2020, pp. 4617 4624, Online, November 2020. Association for Computational Linguistics. doi: 10.18653/v1/2020.findings-emnlp.414. URL https://aclanthology.org/2020.findings-emnlp.414. James Bradbury, Roy Frostig, Peter Hawkins, Matthew James Johnson, Chris Leary, Dougal Maclaurin, George Necula, Adam Paszke, Jake Vander Plas, Skye Wanderman-Milne, and Qiao Zhang. JAX: composable transformations of Python+Num Py programs, 2018. URL http://github.com/google/jax. Published in Transactions on Machine Learning Research (12/2024) Peter F. Brown, Vincent J. Della Pietra, Stephen A. Della Pietra, and Robert L. Mercer. The Mathematics of Statistical Machine Translation: Parameter Estimation. Comput. Linguist., 19(2):263 311, jun 1993. ISSN 0891-2017. URL https://aclanthology.org/J93-2003/. Rewon Child, Scott Gray, Alec Radford, and Ilya Sutskever. Generating Long Sequences with Sparse Transformers, April 2019. URL http://arxiv.org/abs/1904.10509. Dokook Choe, Rami Al-Rfou, Mandy Guo, Heeyoung Lee, and Noah Constant. Bridging the Gap for Tokenizer Free Language Models. Co RR, abs/1908.10322, 2019. URL http://arxiv.org/abs/1908.10322. Junyoung Chung, Sungjin Ahn, and Yoshua Bengio. Hierarchical Multiscale Recurrent Neural Networks. In International Conference on Learning Representations, 2017. URL https://openreview.net/forum?id= S1di0sfgl. Yu-An Chung, Yu Zhang, Wei Han, Chung-Cheng Chiu, James Qin, Ruoming Pang, and Yonghui Wu. w2v-BERT: Combining Contrastive Learning and Masked Language Modeling for Self-Supervised Speech Pre-Training. In 2021 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU), pp. 244 250, 2021. doi: 10.1109/ASRU51503.2021.9688253. URL https://arxiv.org/abs/2108.06209. Jonathan H. Clark, Dan Garrette, Iulia Turc, and John Wieting. Canine: Pre-training an Efficient Tokenization Free Encoder for Language Representation. Transactions of the Association for Computational Linguistics, 10:73 91, 2022. doi: 10.1162/tacl_a_00448. URL https://aclanthology.org/2022.tacl-1.5. Gautier Dagan, Gabriel Synnaeve, and Baptiste Rozière. Getting the Most Out of Your Tokenizer for Pre-training and Domain Adaptation. In Forty-first International Conference on Machine Learning, ICML, 2024. URL https://openreview.net/forum?id=ZFYBn Lljt T. Zihang Dai, Zhilin Yang, Yiming Yang, Jaime Carbonell, Quoc Le, and Ruslan Salakhutdinov. Transformer XL: Attentive Language Models beyond a Fixed-Length Context. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, pp. 2978 2988, Florence, Italy, July 2019. Association for Computational Linguistics. doi: 10.18653/v1/P19-1285. URL https://aclanthology.org/P19-1285. Simon De Deo, Robert X. D. Hawkins, Sara Klingenstein, and Tim Hitchcock. Bootstrap Methods for the Empirical Study of Decision-Making and Information Flows in Social Systems. Entropy. An International and Interdisciplinary Journal of Entropy and Information Studies, 15(6):2246 2276, 2013. ISSN 1099-4300. doi: 10.3390/e15062246. URL https://www.mdpi.com/1099-4300/15/6/2246. Grégoire Delétang, Anian Ruoss, Paul-Ambroise Duquenne, Elliot Catt, Tim Genewein, Christopher Mattern, Jordi Grau-Moya, Li Kevin Wenliang, Matthew Aitchison, Laurent Orseau, Marcus Hutter, and Joel Veness. Language Modeling Is Compression. In ICLR, 2024. URL https://arxiv.org/abs/2309.10668. P Deutsch. GZIP file format specification. RFC 1952, May 1996. URL https://www.rfc-editor.org/rfc/ rfc1952.txt. P Deutsch and J-L Gailly. ZLIB Compressed Data Format Specification. RFC 1950, May 1996. URL https://www.rfc-editor.org/rfc/rfc1950.txt. Jarek Duda. Asymmetric Numeral Systems: Entropy Coding Combining Speed of Huffman Coding with Compression Rate of Arithmetic Coding, January 2014. URL http://arxiv.org/abs/1311.2540. B. Efron. Bootstrap Methods: Another Look at the Jackknife. The Annals of Statistics, 7(1):1 26, 1979. ISSN 00905364. URL http://www.jstor.org/stable/2958830. Jerome H. Friedman. Greedy Function Approximation: A Gradient Boosting Machine. The Annals of Statistics, 29(5):1189 1232, 2001. doi: 10.1214/aos/1013203451. URL https://doi.org/10.1214/aos/1013203451. Kunihiko Fukushima. Cognitron: A Self-Organizing Multilayered Neural Network. Biological Cybernetics, 20:121 136, 1975. doi: 10.1007/BF00342633. URL https://link.springer.com/article/10.1007/ BF00342633. Published in Transactions on Machine Learning Research (12/2024) Philip Gage. A New Algorithm for Data Compression. C Users Journal, 12(2):23 38, 1994. URL http: //www.pennelynn.com/Documents/CUJ/HTML/94HTML/19940045.HTM. Leo Gao, Stella Biderman, Sid Black, Laurence Golding, Travis Hoppe, Charles Foster, Jason Phang, Horace He, Anish Thite, Noa Nabeshima, Shawn Presser, and Connor Leahy. The Pile: An 800GB Dataset of Diverse Text for Language Modeling, December 2020. URL http://arxiv.org/abs/2101.00027. Nathan Godey, Roman Castagné, Éric de la Clergerie, and Benoît Sagot. MANTa: Efficient Gradient Based Tokenization for End-to-End Robust Language Modeling. In Findings of the Association for Computational Linguistics: EMNLP 2022, pp. 2859 2870, Abu Dhabi, United Arab Emirates, December 2022. Association for Computational Linguistics. doi: 10.18653/v1/2022.findings-emnlp.207. URL https: //aclanthology.org/2022.findings-emnlp.207. Omer Goldman, Avi Caciularu, Matan Eyal, Kris Cao, Idan Szpektor, and Reut Tsarfaty. Unpacking Tokenization: Evaluating Text Compression and its Correlation with Model Performance, March 2024. URL http://arxiv.org/abs/2403.06265. Alex Graves. Adaptive Computation Time for Recurrent Neural Networks, February 2017. URL http: //arxiv.org/abs/1603.08983. Jonathan Heek, Anselm Levskaya, Avital Oliver, Marvin Ritter, Bertrand Rondepierre, Andreas Steiner, and Marc van Zee. Flax: A Neural Network Library and Ecosystem for JAX, 2020. URL http://github.com/ google/flax. Geoffrey Hinton, Oriol Vinyals, and JeffDean. Distilling the Knowledge in a Neural Network, March 2015. URL http://arxiv.org/abs/1503.02531. Jordan Hoffmann, Sebastian Borgeaud, Arthur Mensch, Elena Buchatskaya, Trevor Cai, Eliza Rutherford, Diego de las Casas, Lisa Anne Hendricks, Johannes Welbl, Aidan Clark, Tom Hennigan, Eric Noland, Katherine Millican, George van den Driessche, Bogdan Damoc, Aurelia Guy, Simon Osindero, Karen Simonyan, Erich Elsen, Oriol Vinyals, Jack William Rae, and Laurent Sifre. Training Compute-Optimal Large Language Models. In Advances in Neural Information Processing Systems, 2022. URL https: //openreview.net/forum?id=i BBc RUl OAPR. Paul G. Howard and Jeffrey Scott Vitter. Analysis of Arithmetic Coding for Data Compression. Information Processing & Management, 28(6):749 763, 1992. ISSN 0306-4573. doi: 10.1016/0306-4573(92)90066-9. URL https://www.sciencedirect.com/science/article/pii/0306457392900669. David A. Huffman. A Method for the Construction of Minimum-Redundancy Codes. Proceedings of the IRE, 40(9):1098 1101, 1952. doi: 10.1109/JRPROC.1952.273898. URL https://ieeexplore.ieee.org/ document/4051119. J. D. Hunter. Matplotlib: A 2D Graphics Environment. Computing in Science & Engineering, 9(3):90 95, 2007. doi: 10.1109/MCSE.2007.55. URL https://ieeexplore.ieee.org/document/4160265. Zhiying Jiang, Matthew Y. R. Yang, Mikhail Tsirlin, Raphael Tang, and Jimmy Lin. Less is More: Parameter Free Text Classification with Gzip, December 2022. URL http://arxiv.org/abs/2212.09410. Jared Kaplan, Sam Mc Candlish, Tom Henighan, Tom B. Brown, Benjamin Chess, Rewon Child, Scott Gray, Alec Radford, Jeffrey Wu, and Dario Amodei. Scaling Laws for Neural Language Models, January 2020. URL http://arxiv.org/abs/2001.08361. Nikita Kitaev, Lukasz Kaiser, and Anselm Levskaya. Reformer: The Efficient Transformer. In International Conference on Learning Representations, 2020. URL https://openreview.net/forum?id=rkg NKk Htv B. Taku Kudo. Subword Regularization: Improving Neural Network Translation Models with Multiple Subword Candidates. In Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pp. 66 75, Melbourne, Australia, July 2018. Association for Computational Linguistics. doi: 10.18653/v1/P18-1007. URL https://aclanthology.org/P18-1007. Published in Transactions on Machine Learning Research (12/2024) Taku Kudo and John Richardson. Sentence Piece: A Simple and Language Independent Subword Tokenizer and Detokenizer for Neural Text Processing. In Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing: System Demonstrations, pp. 66 71, Brussels, Belgium, November 2018. Association for Computational Linguistics. doi: 10.18653/v1/D18-2012. URL https://aclanthology. org/D18-2012. S. Kullback and R. A. Leibler. On Information and Sufficiency. The Annals of Mathematical Statistics, 22(1): 79 86, 1951. doi: 10.1214/aoms/1177729694. URL https://doi.org/10.1214/aoms/1177729694. M Oğuzhan Külekci. Compressed Context Modeling for Text Compression. In 2011 Data Compression Conference, pp. 373 382, 2011. doi: 10.1109/DCC.2011.44. URL https://ieeexplore.ieee.org/document/ 5749495. Brian Lester, Joshua Yurtsever, Siamak Shakeri, and Noah Constant. Reducing Retraining by Recycling Parameter-Efficient Prompts. ar Xiv preprint ar Xiv:2208.05577, aug 2022. URL https://arxiv.org/abs/ 2208.05577. Vladimir I. Levenshtein. Binary codes capable of correcting deletions, insertions, and reversals. Soviet physics. Doklady, 10:707 710, 1965. URL https://api.semanticscholar.org/Corpus ID:60827152. Yaniv Leviathan, Matan Kalman, and Yossi Matias. Fast Inference from Transformers via Speculative Decoding. In Proceedings of the 40th International Conference on Machine Learning, volume 202 of Proceedings of Machine Learning Research, pp. 19274 19286. PMLR, 23 29 Jul 2023. URL https: //proceedings.mlr.press/v202/leviathan23a.html. Rosanne Liu, Dan Garrette, Chitwan Saharia, William Chan, Adam Roberts, Sharan Narang, Irina Blok, Rj Mical, Mohammad Norouzi, and Noah Constant. Character-Aware Models Improve Visual Text Rendering. In Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pp. 16270 16297, Toronto, Canada, July 2023. Association for Computational Linguistics. doi: 10.18653/v1/2023.acl-long.900. URL https://aclanthology.org/2023.acl-long.900. Matt Mahoney. Data Compression Explained. 2013. URL https://mattmahoney.net/dc/dce.html. Ashok Vardhan Makkuva, Marco Bondaschi, Adway Girish, Alliot Nagle, Martin Jaggi, Hyeji Kim, and Michael Gastpar. Attention with Markov: A Framework for Principled Analysis of Transformers via Markov Chains. Ar Xiv, abs/2402.04161, 2024. URL https://arxiv.org/abs/2402.04161. George Miller. Note on the Bias of Information Estimates. In Information Theory in Psychology: Problems and Methods, pp. 95 100. Free Press, 1955. URL https://www.scienceopen.com/document?vid= 357d299f-62fa-4bda-8dd2-e4d5b5abde5d. Vinod Nair and Geoffrey E. Hinton. Rectified Linear Units Improve Restricted Boltzmann Machines. In Proceedings of the 27th International Conference on International Conference on Machine Learning, ICML 10, pp. 807 814, Madison, WI, USA, 2010. Omnipress. ISBN 978-1-60558-907-7. URL https: //www.cs.toronto.edu/~fritz/absps/relu ICML.pdf. Liam Paninski. Estimation of Entropy and Mutual Information. Neural Computation, 15(6):1191 1253, June 2003. ISSN 0899-7667. doi: 10.1162/089976603321780272. URL https://doi.org/10.1162/ 089976603321780272. R. Pasco. Source Coding Algorithms for Fast Data Compression (Ph.D. Thesis Abstr.). IEEE Transactions on Information Theory, 23(4):548 548, 1977. doi: 10.1109/TIT.1977.1055739. URL https://ieeexplore. ieee.org/document/1055739. Tuomas Pelkonen, Scott Franklin, Justin Teller, Paul Cavallaro, Qi Huang, Justin Meza, and Kaushik Veeraraghavan. Gorilla: a Fast, Scalable, In-Memory Time Series Database. Proc. VLDB Endow., 8(12): 1816 1827, aug 2015. ISSN 2150-8097. doi: 10.14778/2824032.2824078. URL https://doi.org/10.14778/ 2824032.2824078. Published in Transactions on Machine Learning Research (12/2024) Ofir Press, Noah Smith, and Mike Lewis. Train Short, Test Long: Attention with Linear Biases Enables Input Length Extrapolation. In International Conference on Learning Representations, 2022. URL https://openreview.net/forum?id=R8s QPp GCv0. Colin Raffel, Noam Shazeer, Adam Roberts, Katherine Lee, Sharan Narang, Michael Matena, Yanqi Zhou, Wei Li, and Peter J. Liu. Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer. Journal of Machine Learning Research (JMLR 2020), 21(140):1 67, 2020. URL https://jmlr.org/ papers/v21/20-074.html. Nived Rajaraman, Jiantao Jiao, and Kannan Ramchandran. Toward a Theory of Tokenization in LLMs. Ar Xiv, abs/2404.08335, 2024. URL https://arxiv.org/abs/2404.08335. J. J. Rissanen. Generalized Kraft Inequality and Arithmetic Coding. IBM Journal of Research and Development, 20(3):198 203, 1976. doi: 10.1147/rd.203.0198. URL https://ieeexplore.ieee.org/ document/5391119. Adam Roberts, Hyung Won Chung, Gaurav Mishra, Anselm Levskaya, James Bradbury, Daniel Andor, Sharan Narang, Brian Lester, Colin Gaffney, Afroz Mohiuddin, Curtis Hawthorne, Aitor Lewkowycz, Alex Salcianu, Marc van Zee, Jacob Austin, Sebastian Goodman, Livio Baldini Soares, Haitang Hu, Sasha Tsvyashchenko, Aakanksha Chowdhery, Jasmijn Bastings, Jannis Bulian, Xavier Garcia, Jianmo Ni, Andrew Chen, Kathleen Kenealy, Kehang Han, Michelle Casbon, Jonathan H. Clark, Stephan Lee, Dan Garrette, James Lee-Thorp, Colin Raffel, Noam Shazeer, Marvin Ritter, Maarten Bosma, Alexandre Passos, Jeremy Maitin-Shepard, Noah Fiedel, Mark Omernick, Brennan Saeta, Ryan Sepassi, Alexander Spiridonov, Joshua Newlan, and Andrea Gesmundo. Scaling Up Models and Data with t5x and seqio. Journal of Machine Learning Research, 24(377):1 8, 2023. URL http://jmlr.org/papers/v24/23-0795.html. Rylan Schaeffer, Brando Miranda, and Sanmi Koyejo. Are Emergent Abilities of Large Language Models a Mirage? In Thirty-Seventh Conference on Neural Information Processing Systems, 2023. URL https: //arxiv.org/abs/2304.15004. Craig W. Schmidt, Varshini Reddy, Haoran Zhang, Alec Alameddine, Omri Uzan, Yuval Pinter, and Chris Tanner. Tokenization is more than compression, 2024. URL https://arxiv.org/abs/2402.18376. Mike Schuster and Kaisuke Nakajima. Japanese and Korean Voice Search. In International Conference on Acoustics, Speech and Signal Processing, pp. 5149 5152, 2012. Rico Sennrich, Barry Haddow, and Alexandra Birch. Neural Machine Translation of Rare Words with Subword Units. In Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pp. 1715 1725, Berlin, Germany, August 2016. Association for Computational Linguistics. doi: 10.18653/v1/P16-1162. URL https://aclanthology.org/P16-1162. Claude Elwood Shannon. A Mathematical Theory of Communication. The Bell System Technical Journal, 27:379 423, 1948. URL http://plan9.bell-labs.com/cm/ms/what/shannonday/shannon1948.pdf. Peter Shaw, Jakob Uszkoreit, and Ashish Vaswani. Self-Attention with Relative Position Representations. In Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 2 (Short Papers), pp. 464 468, New Orleans, Louisiana, June 2018. Association for Computational Linguistics. doi: 10.18653/v1/N18-2074. URL https:// aclanthology.org/N18-2074. Noam Shazeer and Mitchell Stern. Adafactor: Adaptive Learning Rates with Sublinear Memory Cost. In Proceedings of the 35th International Conference on Machine Learning, volume 80 of Proceedings of Machine Learning Research, pp. 4596 4604. PMLR, July 2018. URL https://proceedings.mlr.press/ v80/shazeer18a.html. Samuel Stanton, Pavel Izmailov, Polina Kirichenko, Alex Alemi, and Andrew Gordon Wilson. Does Knowledge Distillation Really Work? 2021. URL https://arxiv.org/abs/1911.09189. Published in Transactions on Machine Learning Research (12/2024) Yi Tay, Vinh Q. Tran, Sebastian Ruder, Jai Gupta, Hyung Won Chung, Dara Bahri, Zhen Qin, Simon Baumgartner, Cong Yu, and Donald Metzler. Charformer: Fast Character Transformers via Gradientbased Subword Tokenization. In International Conference on Learning Representations, 2022. URL https://openreview.net/forum?id=Jt BRnrl OEFN. Dan Tito Svenstrup, Jonas Hansen, and Ole Winther. Hash Embeddings for Efficient Word Representations. In Advances in Neural Information Processing Systems, volume 30. Curran Associates, Inc., 2017. URL https://proceedings.neurips.cc/paper_files/paper/2017/file/ f0f6ba4b5e0000340312d33c212c3ae8-Paper.pdf. Chandra Shekhara Kaushik Valmeekam, Krishna Narayanan, Dileep Kalathil, Jean-Francois Chamberland, and Srinivas Shakkottai. LLMZip: Lossless Text Compression using Large Language Models, June 2023. URL http://arxiv.org/abs/2306.04050. Aaron van den Oord, Oriol Vinyals, and Koray Kavukcuoglu. Neural Discrete Representation Learning. In Proceedings of the 31st International Conference on Neural Information Processing Systems, Neur IPS 17, pp. 6309 6318, Red Hook, NY, USA, 2017. Curran Associates Inc. ISBN 9781510860964. URL https://proceedings.neurips.cc/paper_files/paper/2017/file/ 7a98af17e63a0ac09ce2e96d03992fbc-Paper.pdf. Guido Van Rossum and Fred L. Drake. Python 3 Reference Manual. Create Space, Scotts Valley, CA, 2009. ISBN 1441412697. Dusan Varis and Ondřej Bojar. Sequence Length is a Domain: Length-based Overfitting in Transformer Models. In Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing, pp. 8246 8257, Online and Punta Cana, Dominican Republic, November 2021. Association for Computational Linguistics. doi: 10.18653/v1/2021.emnlp-main.650. URL https://aclanthology.org/2021.emnlp-main.650. Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Łukasz Kaiser, and Illia Polosukhin. Attention Is All You Need. In Advances in Neural Information Processing Systems 30, pp. 5998 6008. Curran Associates, Inc., 2017. URL http://papers.nips.cc/paper/ 7181-attention-is-all-you-need.pdf. Luke Vilnis, Yury Zemlyanskiy, Patrick Murray, Alexandre Tachard Passos, and Sumit Sanghai. Arithmetic Sampling: Parallel Diverse Decoding for Large Language Models. In Proceedings of the 40th International Conference on Machine Learning, volume 202 of Proceedings of Machine Learning Research, pp. 35120 35136. PMLR, 23 29 Jul 2023. URL https://proceedings.mlr.press/v202/vilnis23a.html. Pauli Virtanen, Ralf Gommers, Travis E. Oliphant, Matt Haberland, Tyler Reddy, David Cournapeau, Evgeni Burovski, Pearu Peterson, Warren Weckesser, Jonathan Bright, Stéfan J. van der Walt, Matthew Brett, Joshua Wilson, K. Jarrod Millman, Nikolay Mayorov, Andrew R. J. Nelson, Eric Jones, Robert Kern, Eric Larson, C J Carey, İlhan Polat, Yu Feng, Eric W. Moore, Jake Vander Plas, Denis Laxalde, Josef Perktold, Robert Cimrman, Ian Henriksen, E. A. Quintero, Charles R. Harris, Anne M. Archibald, Antônio H. Ribeiro, Fabian Pedregosa, Paul van Mulbregt, and Sci Py 1.0 Contributors. Sci Py 1.0: Fundamental Algorithms for Scientific Computing in Python. Nature Methods, 17:261 272, 2020. doi: 10.1038/s41592-019-0686-2. URL https://scipy.org/. Sinong Wang, Belinda Z. Li, Madian Khabsa, Han Fang, and Hao Ma. Linformer: Self-Attention with Linear Complexity, 2020. Michael L. Waskom. Seaborn: Statistical Data Visualization. Journal of Open Source Software, 6(60):3021, 2021. doi: 10.21105/joss.03021. URL https://doi.org/10.21105/joss.03021. Bernard Lewis Welch. The Generalization of Student s Problem when Several Different Population Variances are Involved. Biometrika, 34(1/2):28 35, 1947. ISSN 00063444. URL http://www.jstor.org/stable/ 2332510. Published in Transactions on Machine Learning Research (12/2024) Ian H. Witten, Radford M. Neal, and John G. Cleary. Arithmetic Coding for Data Compression. Communications of The Acm, 30(6):520 540, June 1987. ISSN 0001-0782. doi: 10.1145/214762.214771. URL https://doi.org/10.1145/214762.214771. Yonghui Wu, Mike Schuster, Zhifeng Chen, Quoc V. Le, Mohammad Norouzi, Wolfgang Macherey, Maxim Krikun, Yuan Cao, Qin Gao, Klaus Macherey, JeffKlingner, Apurva Shah, Melvin Johnson, Xiaobing Liu, Łukasz Kaiser, Stephan Gouws, Yoshikiyo Kato, Taku Kudo, Hideto Kazawa, Keith Stevens, George Kurian, Nishant Patil, Wei Wang, CliffYoung, Jason Smith, Jason Riesa, Alex Rudnick, Oriol Vinyals, Greg Corrado, MacduffHughes, and Jeffrey Dean. Google s neural machine translation system: Bridging the gap between human and machine translation, 2016. URL https://arxiv.org/abs/1609.08144. Linting Xue, Aditya Barua, Noah Constant, Rami Al-Rfou, Sharan Narang, Mihir Kale, Adam Roberts, and Colin Raffel. By T5: Towards a Token-Free Future with Pre-trained Byte-to-Byte Models. Transactions of the Association for Computational Linguistics, 10:291 306, 2022. doi: 10.1162/tacl_a_00461. URL https://aclanthology.org/2022.tacl-1.17. Chulhee Yun, Srinadh Bhojanapalli, Ankit Singh Rawat, Sashank Reddi, and Sanjiv Kumar. Are Transformers Universal Approximators of Sequence-to-Sequence Functions? In International Conference on Learning Representations, ICLR, 2020. URL https://openreview.net/forum?id=Byx RM0Ntvr. Manzil Zaheer, Guru Guruganesh, Kumar Avinava Dubey, Joshua Ainslie, Chris Alberti, Santiago Ontanon, Philip Pham, Anirudh Ravula, Qifan Wang, Li Yang, et al. Big Bird: Transformers for Longer Sequences. Advances in Neural Information Processing Systems, 33, 2020. G. K. Zipf. The Psycho-Biology of Language. Houghton Mifflin, 1935. J. Ziv and A. Lempel. A Universal Algorithm for Sequential Data Compression. IEEE Transactions on Information Theory, 23(3):337 343, 1977. doi: 10.1109/TIT.1977.1055714. URL https://courses.cs. duke.edu/spring03/cps296.5/papers/ziv_lempel_1977_universal_algorithm.pdf. A Zvonkin and L Levin. The Complexity of Finite Objects and the Development of the Concepts of Information and Randomness by Means of the Theory of Algorithms. Russian Mathematical Surveys, 25:83, 10 2007. doi: 10.1070/RM1970v025n06ABEH001269. URL https://www.cs.bu.edu/fac/lnd/dvi/ZL-e.pdf. A Additional Experiments A.1 Equal-Info Windows with 24 bits In Fig. 10, we saw that when trained on 64-bit Equal-Info windows, our model achieves the highest accuracy on the third 8-bit token. All else being equal, we expect to see higher accuracy on later tokens, since the model should benefit from additional leftward context. Thus, the stark accuracy decrease from token #3 (>8%) to token #4 (<4%) likely indicates a difficulty in tracking the AC algorithm beyond 24 bits. In this section, we explore training over 24-bit windows. First, we train a 403m M2 model over Equal Info AC[b=24, v=256] compressed data, following the procedure from Section 3. We find this model achieves 1.394 bits/byte. This fits cleanly in the trend seen in Fig. 5 while the model outperforms Equal Info AC[b=32, v=256] (which includes the problematic token #4), it still underperforms Equal Info AC[b=16, v=256]. We also train a model using Equal Info AC[b=24, v=65k], and find it achieves 1.249 bits/byte, again fitting cleanly between the 16-bit and 32-bit settings. It is noteworthy that the model performs well despite the fact that the token bit-depth (16) is not a divisor of the window size (24). This results in M2 tokens that cross window boundaries. Equal Info AC[b=24, v=256] has a compression ratio of 3.15 and Equal Info AC[b=24, v=65k] has a compression ratio of 6.30. Published in Transactions on Machine Learning Research (12/2024) A.2 Resetting M1 Every Window is Beneficial We saw in Section 6.1 that the initial characters within each equal-info window are especially costly to encode. This is due to our procedure of resetting the M1 model s context at every window boundary. In this section, we experiment with maintaining M1 context over multiple windows. We expect this to improve the overall compression rate, but to negatively impact learnability, as a window can no longer be decoded in isolation. As one extreme case, we consider never resetting M1. That is, we first use M1 to assign probabilities to all tokens in the input, and then use Equal-Info AC to compress the sequence into multiple fixed sized windows. In this setting we find that M2 does not learn beyond outputting a uniform distribution over tokens.27 Inference FLOPs/byte Equal Info AC[b=16, v=256] Equal Info AC[b=16, v=65k] Reset Every Window Reset Every Other Window Never Reset Figure 12: Resetting M1 with each Equal-Info window is beneficial. While reducing how often M1 is reset improves the compression ratio (see Table 9), bits/byte performance is much worse, as seen by these points being far above the scaling curves for our default settings. At the other extreme, we consider resetting M1 every other time an Equal-Info window is emitted. We train 403m parameter M2 models in each of these settings, and show the results in Fig. 12 and Table 9. While resetting M1 every other window does not yield as strong performance as our standard Equal Info AC[b=16, v=65k] setting, it is notable that it is still learning at all. This suggests that M2 is able to leverage context from compressed windows earlier in the sequence. While the improved compression ratio places this compression scheme further to the right in Fig. 4 than Equal Info AC[b=16, v=65k] or Sentence Piece, the reduced performance means M2 must be extremely large to reach the Pareto frontier. Exploring this compression scheme in a more traditional pre-training setting, where M2 is trained on trillions of tokens, would be interesting as this model seems particularly under-fit. Similarly, exploring how many windows can be emitted before resetting M1 while remaining learnable for M2 would be of interest as future work. 27Consequently, models trained on compressed data that used 8-bit or 16-bit tokenization have the same performance in terms of bits/byte. 16-bit tokenization squares the number of possible symbols in the vocabulary. This results in a doubling of the negative log likelihood loss when predicting the uniform distribution. This effectively cancels out the doubling of the compression rate 16-bit tokenization yields. Published in Transactions on Machine Learning Research (12/2024) Dataset M1 Reset Compression Ratio bits/byte Equal Info AC[b=16, v=256] every window 2.66 1.092 every other window 3.40 1.748 never 5.33 1.501 Equal Info AC[b=16, v=65k] every window 5.31 1.015 every other window 6.80 1.697 never 10.66 1.501 Table 9: Resetting M1 less often yields greater compression rate, but bits/byte performance degrades. We see that resetting M1 every other window results in more compressed, but still learnable data. However, M2 models trained on that data do not perform as well as when M1 is reset every Equal-Info window. As seen in Fig. 3 never resetting M1 seems to have strong performance, but this is an artifact of its high compression rate; it does not actually learn anything beyond a uniform distribution. A.3 Character Awareness and Spelling Language models trained over subword tokens are not character-aware that is, they lack direct access to the character-level makeup of their input text. As a result, these models tend to perform poorly on tasks relating to spelling (Xue et al., 2022; Liu et al., 2023). In this section, we explore whether our M2 models trained over neurally compressed text can offer an improvement in this regard. We investigate spelling ability using an adapted version of the Wiki Spell task (Liu et al., 2023). When fed a word as input, the model is required to spell it by outputing the component characters as separate tokens. We compare two of the 403m parameter models trained in Section 3 the Equal Info AC[b=16, v=256] M2 model and the Sentence Piece model. In each case, we fine-tune the model (which has been pre-trained for 200,000 steps on language modeling) on the Wiki Spell task for 200 steps 5 epochs over the data. We continue the same learning rate schedule as pre-training. For each model, we tokenize the Wiki Spell input words with the same tokenizer used during pre-training. The choice of which token IDs to use for the output character sequence is less straightforward. For the M2 model, we use the By T5 vocabulary (Xue et al., 2022), which overlaps in an arbitrary way with the 256 IDs used during pre-training. For the Sentence Piece model, we consider two options. For a more direct comparison, we map each output character to an arbitrary ID shared with a non-character subword token in the vocabulary Sentence Piece (Shared) . As an alternative, we also consider mapping each output character to the ID of that character within the Sentence Piece vocabulary Sentence Piece (Characters) . We expect this setting to perform better, as the model should already have some understanding of these characters and their relation to larger subwords from pre-training. Fig. 13 shows that a Sentence Piece model leveraging its pre-trained character representations has the best spelling performance on held-out evaluation samples. It is also the first that reaches 100% training accuracy. The Equal Info AC[b=16, v=256] model, which must share token meanings between the compressed input and the character output, outperforms a Sentence Piece model with the same limitation on the evaluation set. Additionally, the model pre-trained over compressed text reaches 100% training accuracy before the analogous Sentence Piece model. Interestingly, M2 s inability to reach 100% accuracy suggests that the model may not actually be performing decompression internally while doing language modeling over compressed text.28 A.4 GZip Headers and Footers GZip compressed documents have both a header two bytes that identify the file type and a footer two bytes representing the Adler-32 checksum (Deutsch & Gailly, 1996) of the input. We train 25m M2 models on versions of the dataset where the header/footer are removed, as well as versions where it is kept. In 28We note that only 0.18% of Wiki Spell dev examples (9 of 5,000) contain a character that is unseen during training. Thus, even if the output character IDs are chosen arbitrarily, a character-aware model should in theory be able to get nearly perfect performance. Published in Transactions on Machine Learning Research (12/2024) 0 25 50 75 100 125 150 175 200 Steps Sentence Piece (Characters) Sentence Piece (Shared) Equal Info AC[b=16, v=256] (a) Spelling performance on held-out words 25 50 75 100 125 150 175 200 Steps Sentence Piece (Characters) Sentence Piece (Shared) Equal Info AC[b=16, v=256] (b) Spelling performance on the training data Figure 13: Reusing Sentence Piece character representations yields the strongest speller (and the fastest to train), but when forced to learn the output representations from scratch, a pre-trained Equal Info AC[b=16, v=256] model is a better speller than a Sentence Piece based model. Table 10, we see that including the header/footer offers a marginal improvement in bits/byte, at the cost of a marginal decrease in compression ratio. These differences are not large enough to overcome GZip s poor performance compared to other compression methods. We opt to use versions of GZip including the header/footer throughout. Table 10: Removal of the GZip header and footer results in minimal performance differences. Method Compression Ratio bits/byte GZip[v=256] 2.23 2.33 header/footer 2.24 2.35 GZip[v=65k] 4.46 2.91 header/footer 4.47 2.92 A.5 Avoiding End-of-Window Zeros Our Equal-Info Windows algorithm described in Section 3.3 involves compressing text until exceeding a specified number of bits (e.g., 16-bits per window), then backtracking one character, and padding with zero bits. As a character may encode to all zero bits, we need to ensure there is no ambiguity between padding bits and all-zero characters in order to achieve lossless compression. Our default solution, used in all experiments, is to greedily include the most characters possible in each window. Given the knowledge of greedy encoding, we can design a decoder that decodes windows unambiguously by using look-ahead, as detailed in Appendix D.5. An alternative resolution to the problem of window-final all-zero characters is to simply avoid emitting these characters during compression instead, delaying their encoding until the next window. While this degrades the compression rate, it results in a much simpler decoding algorithm. Specifically, with the knowledge that a window includes the least number of characters that can result in the given bitstream, we can decode windows without any look-ahead. To test whether this alternative scheme (Delay) improves learnability, we retrain Equal Info M2 models following the procedure from Section 3, with the only difference being that in generating M2 s training data, we delay emission of window-final all-zero characters. Applying this delayed version of windowed compression, we observe a reduction in compression rate from 2.66 to 2.20. Published in Transactions on Machine Learning Research (12/2024) Inference FLOPs/byte Equal Info AC[b=16, v=256] Equal Info AC[b=16, v=256] (Delay) Equal Info AC[b=16, v=65k] Equal Info AC[b=16, v=65k] (Delay) 2 4 6 bytes/step Figure 14: Comparison of our default greedy implementation of Equal Info AC with a variant (Delay) where emission of window-final all-zero characters is delayed. For Equal Info AC[b=16, v=256], the default implementation performs better. For Equal Info AC[b=16, v=65k], bits/byte performs improves slightly but at the cost of compression rate. Fig. 14 shows the results of training M2 models over data compressed with each method. For Equal Info AC[b=16, v=256], we find that our default implementation outperforms the delayed version along both axes. For Equal Info AC[b=16, v=65k], the delayed implementation delivers a slight improvement in terms of bits/byte. However, this gain is offset by the reduction in compression ratio, so the two performance curves largely overlap. Numerical values can be found in Table 17. Overall, our findings suggest that when trained over data encoded with our default (greedy) Equal Info AC algorithm, M2 is able to distinguish trailing zeros that represent characters from those that represent padding. We opt to use our default implementation throughout, as it maximizes compression ratio. B Other Compression Methods Considered B.1 Equal-Text Windows Equal-Text windows are a simpler alternative to Equal-Info windows. Rather than consuming a variable amount of text and outputting a fixed number of bits, Equal-Text windows feed a fixed amount of text into the Arithmetic Coder, which compresses to a variable number of bits. We anticipate a downside to Equal-Text windows is that it would be difficult for M2 to decipher where one window ends and another begins, as this no longer aligns with a fixed number of tokens, and the boundary may appear token-internally. We considered adding delimiter tokens between windows to overcome this difficulty. However, we expect this would degrade the compression rate too much, especially for the short AC compressed windows that we found most effective in Fig. 5. Further exploration of Equal-Text windows, especially to see if the delimiters are actually required, would be interesting future work as the Equal-Text Windows algorithm is much simpler than Equal-Info Windows. B.2 Huffman Coding We also considered using Huffman Coding (Huffman, 1952) as a baseline compression implementation. However, as most implementations use static probabilities for characters, the resulting compression rate would likely be too low to be competitive. With static Huffman Coding, there is a fixed mapping between bitstream subsequences and characters, which may improve learnability by M2 models. However, because the Published in Transactions on Machine Learning Research (12/2024) coding component assigns each character a whole number of bits, the coding is less less optimal compared to Arithmetic Coding. Huffman Coding can be made adaptive by updating the induced codebook periodically, based on newer data. When considering bit-level compression, adaptive Huffman Coding performs similar to static Huffman Coding (Mahoney, 2013). However, when considering token-level compression, and the fact that the adaptive distribution will come from M1, not unigrams of the recent data, training M2 models on adaptive Huffman Coding could be interesting future work. As Huffman coding is part of the GZip algorithm, we opted not to explore Huffman Coding on its own. B.3 Asymmetric Numeral Systems Another compression algorithm we considered was Asymmetric Numeral Systems (ANS) (Duda, 2014). ANS has strong coding performance and is amenable to adaptive probabilities. The internal state is only a single natural number, which may be easier for an LLM to track than the two real numbers used in Arithmetic Coding (AC). However, unlike AC, the encoding and decoding algorithm are stack-like, where the encoder runs left-to-right and the decoder runs right-to-left. We thought this would make streaming inference, where a single M2 token is generated and then decoded by M1 before another M2 token is generates, difficult. Thus we opted to explore AC over ANS in this work. However, the simpler state is appealing and using ANS for compression would be of interest as future work. C Instability in the Initial Token of Multi-Token Windows There are cases where the token text mapping for the initial token in a multi-token Equal Info AC window can be unstable. When a character s bitstream crosses the token boundary the purple characters in Fig. 9 only some prefix of the bitstream contributes to the value of the initial token. It is possible that another character may produce a different bitstream with a shared prefix. If the token boundary comes before the difference in the bitstreams, then the two tokens will have the same value but represent different text. When this occurs the text prefix will remain stable, i.e., any characters whose bitstreams are entirely contained within the initial token will match, but the final character may differ. Thus the notion of mapping a compressed token to exact characters is not well defined, as there are often cases there a character is spread across two tokens. Note, this only occurs at token boundaries; Equal Info AC[b=16, v=65k] is stable as no characters cross windows. This is most likely a reason that Equal Info AC[b=16, v=65k] outperforms Equal Info AC[b=16, v=256] in our byte-controlled ablations (Section 5.3). Therefore, we consider Equal Info AC stable enough to enable learnability by M2. Interestingly, Külekci (2011) point out this same issue, where a fixed size view of a variable length stream can cause false equivalencies when prefixes match. Similar to our findings, they find the models do have some limited ability to deal with these situations. D Details for Reproducibility D.1 Variance Sampling from the validation set was seeded. For a given seed, the same batches are sampled at each evaluation step within a training run. Similarly, when models of a different size are trained on the same compressed data, the same evaluation batches are sampled, allowing for fair comparison. As the Bytes and Sentence Piece baselines use deterministic datasets, the validation seed is not used. Instead the start_step is incremented by 20 to get a new sample of 20 batches. Model initialization and the order of the training data is controlled by the training seed. This seed was also changed during variance testing. During training, the dataset is checkpointed and therefore each example is seen exactly once. The exact order of the training data is determined by the seed. As the Bytes and Sentence Piece baselines use deterministic datasets, the training order is fixed. Published in Transactions on Machine Learning Research (12/2024) Table 11: Variance in performance is low. Even with maximum changes between runs different evaluation samples, different training orders, and different parameter initialization there is very little variance in final performance. Statistics were calculated over 5 different 25m parameter training runs for each method. Method bits/byte Bytes 1.2899 0.0020 Sentence Piece 1.1171 0.0006 AC[v=256] 1.4573 0.0001 Static AC[v=256] 4.6936 0.0005 Equal Info AC[b=16, v=256] 1.4724 0.0044 Equal Info AC[b=32, v=256] 2.0457 0.0058 Equal Info AC[b=64, v=256] 1.8494 0.0052 Equal Info AC[b=128, v=256] 1.7121 0.0003 GZip[v=256] 2.3374 0.0061 5 models with 25m parameters were trained with different seeds (both validation and training) for each compression method and the two baselines. The mean and standard deviation can be found in Table 11. The variance is so low that we only report single values for most other experimental settings, such as larger models. Training models of size 403m and 2b over data compressed with Equal Info AC[b=64, v=256] and Equal Info AC[b=128, v=256], as well as a 2b model with Equal Info AC[b=128, v=65k], occasionally diverged, collapsing to a simple model that just output the uniform distribution. The numbers for these settings exclude these divergent runs. This resulted in 7 re-runs in the most problematic case. D.2 The Amount of Raw Text Bytes Seen by M2 Table 12 shows the number of tokens and bytes found in the training dataset for each compression method. During the data generation process, sequences of 10,240 generated by concatenating 128 C4 byte-tokenized documents together are compressed. Some of these sequences, namely the final sequence created from the tail of the concatenated docs, are too short to be compressed to the target length of 512. Thus, the exact number of tokens in the dataset can vary slightly. With no padding, each dataset would have been trained on 26,214,400,000 tokens, we see all settings are close to this value, with the maximum deviation being Equal Info AC[b=128, v=65k] with 1.06% fewer tokens. All compression datasets are created from the same source sequences, thus the underlying byte sequences compressed by weaker methods are prefixes of the underlying sequences compressed by stronger methods. D.3 Scaling Curves with Scaled Training Data When scaling models, Hoffmann et al. (2022) argue that training data should be scaled linearly with model size. As such, when comparing settings with constant training FLOPs, a large part of the FLOPs budget should be used by adding more training data. We apply this technique to compensate for our 2b models being under-trained by plotting the scaling curves in Fig. 15, where the smaller models are trained with less data, proportional to their size. Models with 25m parameters only train for 3k steps, 113m for 11k, 403m for 40k, and 2b for 200k steps. Otherwise, the settings match those in Fig. 3. Numerical values used in the graph can be found in Table 13. Scaling the training data adjusts the absolute slopes of the lines for all models that learn. Models that do not learn still only predict a uniform distribution. The trends between settings are unchanged. Thus we opt to plot the versions where training data is held constant across model sizes. Published in Transactions on Machine Learning Research (12/2024) Table 12: Compression ratios (bytes / tokens) achieved by various methods. For Equal Info AC, the compression ratio increases with increased window size. Small differences in the number of non-padding tokens are due to noise in the data generation process some randomly selected documents are too short to be compressed to the target length of 512. Method Compression Ratio Tokens Bytes Bytes 1.0 26,188,185,600 26,188,185,600 Sentence Piece 4.28 26,112,163,840 111,728,726,639 AC[v=256] 5.49 26,083,328,000 143,197,470,720 Static AC[v=256] 1.73 26,175,078,400 45,282,885,632 GZip[v=256] 2.23 26,175,209,472 58,370,424,832 Equal Info AC[b=16, v=256] 2.66 26,154,106,880 69,569,924,301 Equal Info AC[b=32, v=256] 3.49 26,109,542,400 91,122,302,976 Equal Info AC[b=64, v=256] 4.16 26,110,853,120 108,621,148,979 Equal Info AC[b=128, v=256] 4.61 26,078,085,120 120,219,972,403 AC[v=65k] 10.98 25,952,256,000 284,955,770,880 Static AC[v=65k] 3.46 26,133,135,360 90,420,648,346 GZip[v=65k] 4.47 26,122,649,600 116,768,243,712 Equal Info AC[b=16, v=65k] 5.31 26,091,192,320 138,544,231,219 Equal Info AC[b=32, v=65k] 6.97 26,049,249,280 181,563,267,482 Equal Info AC[b=64, v=65k] 8.33 26,004,684,800 216,619,024,384 Equal Info AC[b=128, v=65k] 9.22 25,936,527,360 239,134,782,259 D.4 Evaluation Details In our experiments, different settings have different vocabulary size, tokenization, and has a different amount of underlying text due to variations in compression rate. Thus, they are not directly comparable using pertoken versions metrics like the cross-entropy, negative log likelihood loss, or perplexity. To address this, we convert our token-level negative log likelihood loss, ℓ, to byte-level negative log likelihood loss by dividing the loss by that compression method s specific token-level compression rate, ℓbyte = ℓ/(Li T /Lo T ) = ℓ(Lo T /Li T ). Note that we use per byte metrics over per character metrics as there is ambiguity as to what counts as a character when working with UTF-8 Unicode. As is common in evaluation of work related to compression, instead of the negative log likelihood loss ℓbyte (in the unit of nats ) per byte, we use bits/byte. This would require using log base two instead of the natural log during the negative log likelihood calculation, but this conversion can be done after the fact, bits/byte = log2(eℓbyte) = ℓbyte/ ln(2). Note that this results in the same conversion used in Gao et al. (2020), bits/byte = ℓbyte/ ln(2) = (Lo T /Li T )ℓ/ ln(2), when the input tokens represent bytes. As one of the main advantages of an M2 model that processes compressed text is that it needs to be run over fewer tokens, we also compare models based on the amount of FLOPs required during inference. Different compression methods result in different sequence lengths for the M2 model to process. Therefore, we need to standardize our FLOPs measurement to the byte-level so that it is comparable across methods. We start with FLOPs/token approximated by 2 num_params (not including embedding parameters) following Kaplan et al. (2020) and divide it by that method s token-level compression rate to get the FLOPs/byte, just like the bits/byte conversion. For methods that require running an M1 model over each byte, the FLOPs/byte cost of the M1 model is added. Note, while there is a computational cost to running GZip over the input text, we ignore it as it is insubstantial compared to the cost of running model inference. Evaluation of language models is often done by running the model on the entire validation set, moving the sliding window formed by the model s context window by a single token at each step. This yields stronger models by providing the most context possible when making predictions for a token. As we care about relative performances between methods, opposed to absolute performance, we opt to evaluate the model on a sample of the C4 validation set. During evaluation, the model is run over 20 batches, resulting in predictions Published in Transactions on Machine Learning Research (12/2024) 107 108 109 Inference FLOPs/byte Bytes Arithmetic Coding[v=256] Static AC[v=256] Equal Info AC[b=16, v=256] GZip[v=256] Sentence Piece Arithmetic Coding[v=65k] Static AC[v=65k] Equal Info AC[b=16, v=65k] GZip[v=65k] Figure 15: Training language models over compressed text while scaling training data with model size results in steeper slopes. When scaling model size, it has been found that the training data should be scaled proportionally (Hoffmann et al., 2022). We apply this scaling technique by plotting values for smaller models at earlier training steps. The trends are similar to Fig. 3, even down to things like where the Equal Info AC[b=16, v=256] line crosses the Bytes baseline (between the 25m and 113m parameter models). for 2,621,440 tokens. These tokens represent different amounts of text based on the compression method, thus it would have been impossible to run evaluation on the same bytes for all methods. We trained five 25m parameter models with different seeds and found that the final performance is very stable. The largest standard deviation was 0.0061. Thus, the variance introduced from sampling the validation set is negligible. See Appendix D.1 for more information. D.5 End of Window and End of Input Sequence Handling In the implementation of Equal Info AC[b=W], each output window must end up being W bits. Therefore, when the compression of an additional character would result in a bitstream of more than W bits, padding of the compressed bitstream without that additional character must be done. In cases where the final character in the window only adds zeros to the bitstream, it is unclear at first glance if that final character was included in the window, or if it was omitted and the trailing zeros are all padding. However, the compression scheme is still lossless if we are consistent in our encoding. By always including the most input characters possible in each window, we know that, during decoding, if the addition of a final character (which is compressed to all zeros) still results in the same compressed bitstream, then that final character is part of that window. The decoding algorithm also knows when to stop adding characters to input when the addition of a new character would generate more than W bits when compressed.29 29The only exception is at the end of the sequence, when the amount of padding is dictated by running out of input characters instead of running out of room in the window. This can be solved by including an end-of-input symbol. Published in Transactions on Machine Learning Research (12/2024) Table 13: Numerical values from Fig. 15. Values for the uniform distribution and FLOPs/byte values can be found in Table 15. Dataset Size Step bits/byte Bytes 25m 3k 1.62 113m 11k 1.36 403m 40k 1.18 2b 200k 1.03 Sentence Piece 25m 3k 1.35 113m 11k 1.15 403m 40k 1.00 2b 200k 0.87 AC[v=256] 25m 3k 1.46 113m 11k 1.46 403m 40k 1.46 2b 200k 1.46 AC[v=65k] 25m 3k 1.46 113m 11k 1.46 403m 40k 1.46 2b 200k 1.46 Static AC[v=256] 25m 3k 4.62 113m 11k 4.62 403m 40k 4.62 2b 200k 4.62 Static AC[v=65k] 25m 3k 4.62 113m 11k 4.62 403m 40k 4.61 2b 200k 4.61 Equal Info AC[b=16, v=256] 25m 3k 1.86 113m 11k 1.50 403m 40k 1.21 2b 200k 0.99 Equal Info AC[b=16, v=65k] 25m 3k 1.62 113m 11k 1.31 403m 40k 1.10 2b 200k 0.94 GZip[v=256] 25m 3k 2.95 113m 11k 2.48 403m 40k 1.97 2b 200k 1.56 GZip[v=65k] 25m 3k 3.30 113m 11k 3.08 403m 40k 2.72 2b 200k 2.26 This kind of padding is present in many Arithmetic Coding implementations and is generally solved by either giving the AC decoder the original input sequence length and the compressed message, or by the AC decoder using a special termination character. These fixes are hard to apply in our setting. Passing the number of tokens present in a window to M2 would be possible during training, but it would make inference much more complex (requiring a solution such as M2 generating fertility scores that specify how many characters the generated tokens represent (Brown et al., 1993)). In order to include an end-of-input symbol for the AC decoder, the M1 model must be able to assign reasonable probabilities to that symbol. Therefore it would need to appear at the end of each training example, hindering M1 s ability to be applied to longer sequences. Published in Transactions on Machine Learning Research (12/2024) As such, we achieve lossless compression within each window by allowing the AC decoder to be run multiple times, incrementing the sequence length until we find the sequence that, when compressed, no longer matches the compressed output and backtracking. As we do not include an AC decoder end-of-input symbol, when we reach the end of the input, there is some ambiguity in the number of characters that are included in that window. However, as we compress extremely long sequences, over 10,000 characters, the final window in a, trimmed, M2 example rarely correlates to the final input characters. Thus most sequences (99.4%) are decompressable using the approach above. In our validation data, just 0.0011% of windows represent the end of an input sequence. Additionally, the standard deviations of the loss across validation tokens is similar for models trained over compressed input and models trained directly on Sentence Piece. Thus we believe that these ambiguous tokens do not effect our results. D.6 Entropy Estimation 2 4 6 8 10 12 14 16 Bit N-grams Normalized Entropy RNG Arithmetic Coding[v=256] Static AC[v=256] GZip[v=256] Equal Info AC[b=16] Equal Info AC[b=64] (a) bit n-grams counting all overlapping occurrences 2 4 6 8 10 12 14 16 N-Bit Tokens Normalized Entropy (b) n-bit tokens following our M2 tokenization Figure 16: The amount of noise in the entropy estimate grows as the length of bit segments grow. Larger segmentations of the bitstream result in larger vocabularies and therefore require larger sample sizes for accurate entropy estimates. For each setting, we plot the 5%, 50%, and 95% percentile intervals for the entropy, normalized by the average entropy across partitions. We see that the noise grows with n and that settings like Equal Info AC[b=16] are noisier than AC, despite this not being apparent in Fig. 11. To account for noise in the entropy estimation, we partition the data into 100 disjoint samples. This results in each partition being a sample of ~2 billion symbols for n-grams and ~130 million for tokens. We then calculate the entropy for each partition and the KL divergence between the entropy of the 0.5, 0.50, and 0.95 quantile points and a uniform distribution. These quantiles are then plotted on Fig. 11 to illustrate sampling noise 90% of sampled entropies fall within these bounds. The log scaling of Fig. 11 hides some of the noise trends, namely that the noise grows with n and that settings like GZip and Equal Info AC are noisier than AC and RNG. These trends are seen in Fig. 16 where the entropy has been normalized based on the mean entropy calculated across the partitions. The maximum likelihood, or plug-in, estimator of entropy, ˆH = P x X ˆp(x) log2 ˆp(x), is negatively biased in fact, all entropy estimators are biased (Paninski, 2003). The Miller-Madow estimator attempts to correct for this bias by adding the approximate bias, caused by sampling, to the plug-in estimator.30 The Miller-Madow estimator is given by ˆHMM = ˆH + ˆ |V | 1 2m . In this case, m is the size of the sample used to estimate entropy and ˆ |V | is the estimated vocabulary size. In some applications, the vocabulary may need to be estimated for 30There are other methods for entropy bias correction such as those used in De Deo et al. (2013) based on bootstrapping (Efron, 1979), however, with the size of the C4 training data, the required resampling was not possible. Thus, we use Miller-Madow in this work. Published in Transactions on Machine Learning Research (12/2024) 2 4 6 8 10 12 14 16 Bit N-grams KL(observed || uniform) RNG Arithmetic Coding[v=256] Static AC[v=256] GZip[v=256] Equal Info AC[b=16] Equal Info AC[b=64] (a) N-Grams 2 4 6 8 10 12 14 16 N-Bit Tokens KL(observed || uniform) Figure 17: Bias corrected KL divergence between the observed and uniform distributions for different segmentations of the bitstream. This plot is similar to Fig. 11, however, the KL divergence calculations use the entropy of the observed distribution after applying the Miller-Madow bias correction. After applying bias correction, we see that the expected 0 KL divergence for the RNG baseline is now within the 90th percentile bounds. However, this can results in an, incorrect, negative KL divergence which is removed from the graph. Thus the RNG 50th percentile is shown as a scatter plot rather than a broken line. In this setting it is clear that the 50th percentile for AC[v=65k]s above the 50th percentile for RNG, however, it is hard to disentangle the two as their 5th percentile lines are similar. example, to account for new words that are added to languages but in this case our vocabulary size is always 2n where n is the size of the current segmentation. When we plot the KL divergence between the Miller-Madow estimated entropy and the uniform distribution, we see that the percentile interval for the RNG baseline now includes 0, the KL divergence we expect given the data was generated from random and independent bits. As bias correction is approximate, it is possible that, for a given sample, the correction will result in an entropy greater than the maximum entropy possible for a given vocabulary size. Given that KL divergence between a distribution P and the uniform distribution U simplifies to the entropy of U minus the entropy of P, KL(P||U) = H[U] ˆH[P] = log2 |V | ˆH[p], this results in a negative KL divergence, which is not allowed. These points get removed from the graph during log scaling and the resulting 50% percentile line for RNG data looks strange. Therefore, we only plot points with positive KL divergence in Fig. 17. The Miller-Madow estimation of entropy makes it clear that the 0.5 entropy quantile for AC compressed data is much higher than the 50% percentile for RNG data. Additionally, for n > 2, the AC entropy is statistically significantly less than the RNG entropy; however, differences in the mean entropy only start to appear after ~8 decimal places. This slight difference in mean, coupled with the fact that the 5% percentiles are similar, means we cannot confidently assert the model will be able to easily distinguish the AC compressed data from random data. Given that we care about the differences between the entropy of data compressed with different methods which is invariant to bias and the strange plots when values are less than 0, we opt to plot the plug-in estimator in Fig. 11 instead of the Miller-Madow estimator. D.7 Analysis Implementation Matplolib (Hunter, 2007) and Seaborn (Waskom, 2021) were used to make all the included graphs. Statistical significance tests were done using Welch s t-test (Welch, 1947) using the function scipy.stats.ttest_ind_from_stats from Sci Py (Virtanen et al., 2020). We used p < 0.05 as the statistical significance threshold. Published in Transactions on Machine Learning Research (12/2024) D.8 Window Text Patterns and Token Positions We tokenize 20 documents of length 1,024 with Equal Info AC[b=16, v=256] and find that all 256 possible token values occur multiple times, both as the first and as the second token within the window. When tokenized with Equal Info AC[b=16, v=65k], 34.5% of attested tokens appear more than once. Table 14 shows all the window text for repeated tokens. Table 14: The deduplicated window text from all instances of tokens that appear multiple times when we tokenized 20 documents of length 1,024 (20,480 compressed tokens) with Equal Info AC[b=16, v=256]. Token Window Position Window Text 185 1 [or ] / [or a ] / [or ac] / [or al] / [or cr] / [or d] / [or f] / [or h] [or hi] / [or i] / [or k] / [or ma] / [or pr] / [or r] / [or s] / [or se] [or su] / [or t] / [or to] / [or v] / [or wha] / [or y] / [or yo] / [or, t] [or-] / [or.] / [ora] / [orc] / [orce ] / [ord] / [ord a] / [order] [ore a] / [ore e] / [ore ev] / [ore g] / [ore i] 2 [ 4] / [ of F] / [ records ] / [. Lo] / [Alt] / [OI] / [ase ] / [at y] [cian] / [cri] / [d. I] / [ery] / [h de] / [hen s] / [ides] / [n ne] [oft] / [om i] / [onte] / [opp] / [pir] / [rev] / [reve] / [s may] [tion a] / [y do] / [y t] 151 1 [le] / [le s] / [le t] / [le. ] / [lea] / [lec] / [led] / [led ] [led t] / [leg] / [lege] / [leh] / [lem ] / [leme] / [lems] / [len] [ler] / [les] / [less] / [let] / [lett] / [level] / [lew ] / [ley] / [lf ] 2 [ all ] / [ nut] / [ this] / [ un] / [. I w] / [Ni] / [as t] / [ceed ] [choos] / [e Mi] / [e-li] / [etti] / [imag] / [ion a] / [k a] / [ne a] [ng up] / [niversi] / [npo] / [nt pr] / [pi] / [rvices] / [s T] / [s your] [s?] / [so c] / [stag] / [thou] / [thoug] / [ust] / [ust ] D.9 Numerical Values Table 15 includes the specific values used to create Fig. 3. Similarly, Table 16 includes the values used to create Fig. 5. The numerical values from Fig. 6 can be found across Table 15 and Table 16. Table 17 includes the numerical values from Fig. 14. Published in Transactions on Machine Learning Research (12/2024) Table 15: Numerical values from Fig. 3. Methods that use 16-bit tokens (v=65k) have the same uniform distribution performance as the 8-bit version (v=265). Note: One thousand million is used over one billion to make comparison of FLOPs/byte values easier. Dataset Size bits/byte FLOPs/byte Bytes 25m 1.29 50.00M 113m 1.16 226.00M 403m 1.08 806.00M 2b 1.03 4,000.00M uniform 8.00 - Sentence Piece 25m 1.12 11.69M 113m 1.01 52.82M 403m 0.94 188.37M 2b 0.87 934.84M uniform 3.47 - AC[v=256] 25m 1.46 15.11M 113m 1.46 47.17M 403m 1.46 152.81M 2b 1.46 734.60M uniform 1.46 - AC[v=65k] 25m 1.46 10.55M 113m 1.46 26.58M 403m 1.46 79.41M 2b 1.46 370.30M Static AC[v=256] 25m 4.61 28.90M 113m 4.61 130.64M 403m 4.61 465.90M 2b 4.61 2,310.00M uniform 4.62 - Static AC[v=65k] 25m 4.60 14.45M 113m 4.60 65.32M 403m 4.62 232.95M 2b 4.62 1,160.00M Equal Info AC[b=16, v=256] 25m 1.47 24.80M 113m 1.23 90.96M 403m 1.09 309.01M 2b 0.99 1,510.00M uniform 3.01 - Equal Info AC[b=16, v=65k] 25m 1.25 15.42M 113m 1.12 48.56M 403m 1.02 157.79M 2b 0.94 759.30M GZip[v=256] 25m 2.34 22.42M 113m 1.93 101.35M 403m 1.69 361.43M 2b 1.56 1,790.00M uniform 3.59 - GZip[v=65k] 25m 2.92 11.19M 113m 2.69 50.56M 403m 2.48 180.31M 2b 2.26 894.85M Published in Transactions on Machine Learning Research (12/2024) Table 16: Numerical values from Fig. 5. Values for Equal Info AC[b=16, v=256] and Equal Info AC[b=16, v=65k] can be found in Table 15. Note, Equal Info AC[b=128, v=256] showed slight improvements beyond the significant digits shown here as the model scales. Dataset Size bits/byte FLOPs/byte Equal Info AC[b=32, v=256] 25m 2.05 20.33M 113m 1.94 70.76M 403m 1.83 236.95M 2b 1.65 1,150.00M Equal Info AC[b=32, v=65k] 25m 1.95 13.17M 113m 1.85 38.42M 403m 1.74 121.64M 2b 1.63 579.89M Equal Info AC[b=64, v=256] 25m 1.85 18.02M 113m 1.82 60.33M 403m 1.80 199.75M 2b 1.79 967.54M Equal Info AC[b=64, v=65k] 25m 1.82 12.00M 113m 1.80 33.13M 403m 1.79 102.76M 2b 1.76 486.19M Equal Info AC[b=128, v=256] 25m 1.71 16.85M 113m 1.71 55.02M 403m 1.71 180.84M 2b 1.71 873.68M Equal Info AC[b=128, v=65k] 25m 1.70 11.42M 113m 1.69 30.51M 403m 1.68 93.42M 2b 1.67 439.84M Table 17: Numerical values from Fig. 14, comparing our standard implementation of Equal Info AC versus an implementation where window-final all-zero symbols are delayed. Dataset Compression Ratio Size bits/byte FLOPs/byte Equal Info AC[b=16, v=256] 2.66 25m 1.47 24.80M 113m 1.23 90.96M 403m 1.09 309.01M 2b 0.99 1,510.00M Equal Info AC[b=16, v=256] Delay 2.20 25m 1.50 28.73M 113m 1.25 108.73M 403m 1.11 372.36M 2b 1.00 1,820.00M Equal Info AC[b=16, v=65k] 5.31 25m 1.25 15.42M 113m 1.12 48.56M 403m 1.02 157.79M 2b 0.94 789.30M Equal Info AC[b=16, v=65k] Delay 4.40 25m 1.23 17.36M 113m 1.11 57.36M 403m 1.01 190.18M 2b 0.93 915.09M