# language_modeling_is_compression__5fac69ce.pdf Published as a conference paper at ICLR 2024 LANGUAGE MODELING IS COMPRESSION Grégoire Delétang 1 Anian Ruoss 1 Paul-Ambroise Duquenne2 Elliot Catt1 Tim Genewein1 Christopher Mattern1 Jordi Grau-Moya1 Li Kevin Wenliang1 Matthew Aitchison1 Laurent Orseau1 Marcus Hutter1 Joel Veness1 It has long been established that predictive models can be transformed into lossless compressors and vice versa. Incidentally, in recent years, the machine learning community has focused on training increasingly large and powerful self-supervised (language) models. Since these large language models exhibit impressive predictive capabilities, they are well-positioned to be strong compressors. In this work, we advocate for viewing the prediction problem through the lens of compression and evaluate the compression capabilities of large (foundation) models. We show that large language models are powerful general-purpose predictors and that the compression viewpoint provides novel insights into scaling laws, tokenization, and in-context learning. For example, Chinchilla 70B, while trained primarily on text, compresses Image Net patches to 43.4% and Libri Speech samples to 16.4% of their raw size, beating domain-specific compressors like PNG (58.5%) or FLAC (30.3%), respectively. Finally, we show that the prediction-compression equivalence allows us to use any compressor (like gzip) to build a conditional generative model. 1 INTRODUCTION Information theory and machine learning are inextricably linked and have even been referred to as two sides of the same coin (Mac Kay, 2003). One particularly elegant connection is the essential equivalence between probabilistic models of data and lossless compression. The source coding theorem (Shannon, 1948) is the fundamental theorem describing this idea, i.e., the expected message length in bits of an optimal entropy encoder is equal to the negative log2-likelihood of the statistical model. In other words, maximizing the log2-likelihood (of the data) is equivalent to minimizing the number of bits required per message. Indeed, lossless compression with a probabilistic model can be achieved in a variety of different ways, including Huffman coding (Huffman, 1952), arithmetic coding (Pasco, 1977; Rissanen, 1976), and asymmetric numeral systems (Duda, 2009). Arithmetic coding, in particular, is known to be optimal in terms of coding length, meaning that the overall compression performance depends on the capabilities of the probabilistic model (see Fig. 1 for an overview of arithmetic coding). Incidentally, in recent years, large pre-trained Transformers (Vaswani et al., 2017), so-called foundation models (Bommasani et al., 2021), have proven to be highly successful across a wide range of predictive tasks (Bubeck et al., 2023; Rae et al., 2021) and are thus promising candidates for use with arithmetic coding. Indeed, Transformer-based compression with arithmetic coding has produced state-of-the-art results both in the online (Bellard, 2021; Mao et al., 2022) and offline settings (Valmeekam et al., 2023). In the online setting, a pseudo-randomly initialized model is directly trained on the stream of data that is to be compressed, while the offline setting, which we consider in our work, trains the model on an external dataset before employing it to compress a (potentially different) data stream. Consequently, offline compression is performed incontext, with a fixed set of model parameters. Transformers have demonstrated impressive in-context learning abilities (Laskin et al., 2023; Brown et al., 2020; Wei et al., 2022; Genewein et al., 2023) and are thus ideally suited for offline compression. *Equal contribution. 1Google Deep Mind. 2Meta AI & Inria. Correspondence to {gdelt, anianr}@google.com. Published as a conference paper at ICLR 2024 Input (4 bytes) Output (7 bit) P(A|AI)=0.2 P(I|AI)=0.45 P(X|AI)=0.35 P(A|AIX)=0.6 P(I|AIX)=0.2 P(X|AIX)=0.2 Figure 1: Arithmetic encoding of AIXI with a probabilistic model P (blue) resulting in the binary code b0101010 (green). We iteratively divide the real interval I = [0, 1) according to the model s (conditional) probabilities and select the sub-interval corresponding to the observed symbol (e.g., I = [0, 0.45) for P(A)). We further refine I for each input symbol (indicated by the arrows), e.g., I = [0.09, 0.36) for P(I|A). To determine the encoded output, we iteratively split [0, 1) in half and assign a binary code to each sub-interval (shaded red areas). At every step we can output the binary code if I is fully contained in the corresponding binary interval (e.g., b0 for A , but not for AI as it could be b00 or b01 ). At the end of the input, the code is b0101 , which cannot be uniquely decoded (P(A|AIX), P(I|AIX), P(X|AIX) all overlap with b0101 ). Thus, we further refine the binary code until its binary interval is fully contained in I (all calculations in Appendix A). The context length is a key limiting factor in offline compression, as it dictates the maximum number of bytes a model can compress at a time. Transformers can only compress a few kilobytes (each token being coded with 2 or 3 bytes), while requiring a lot of compute. Correspondingly, many challenging predictive tasks (e.g., algorithmic reasoning or long-term memory) require long contexts (Delétang et al., 2023), and thus extending these models context lengths is a key challenge which is gaining increased attention (Zaheer et al., 2020; Guo et al., 2022; Bulatov et al., 2023). The in-context compression view provides insights into the failure modes of current foundation models. This Work We advocate for using (lossless) compression to study foundation models. To that end, we conduct an extensive empirical investigation of the offline (in-context) compression capabilities of large language models, with the rationale that they have become readily available (Touvron et al., 2023a;b) and can thus be used for compression without the training overhead. We empirically demonstrate that these models, while (meta-)trained primarily on text, achieve competitive compression rates across different data modalities, outperforming domain-specific standard compressors (not accounting for model parameter size). Moreover, we shed new light on scaling laws (Kaplan et al., 2020), showing that they also hold true for compression but that measuring the adjusted compression rates instead of the log loss adds a twist: Scaling beyond a certain point will deteriorate the compression performance since the model parameters need to be accounted for in the compressed output. Finally, we advocate for framing (self-supervised) prediction through the lens of compression as it encompasses generalization: a model that compresses well generalizes well (Hutter, 2006). Contributions We empirically study the lossless compression capabilities of foundation models: We review how to compress with predictive models via arithmetic coding and call attention to the connection between current language modeling research and compression. We show that large language models achieve impressive compression rates (disregarding model parameter size) on modalities other than text. For example, Chinchilla 70B achieves compression rates of 43.4% on Image Net patches and 16.4% on Libri Speech samples, beating domain-specific compressors like PNG (58.5%) or FLAC (30.3%), respectively. Published as a conference paper at ICLR 2024 We revisit scaling laws, showing that the dataset size provides a hard limit on model size in terms of compression performance and that model scaling is not a silver bullet. We leverage the compression-prediction equivalence to employ compressors as generative models and visually illustrate the performance of the underlying compressor. We demonstrate that tokenization, which can be viewed as a pre-compression, does, in general, not improve compression performance, but allows models to increase the information content in their context and is thus generally employed to improve prediction performance. 2 BACKGROUND In this section, we review the necessary background on information theory and its relation to likelihood maximization. To that end, we consider streams of data x1:n := x1x2 . . . xn X n of length n from a finite set of symbols X. We write x j = x 0 (i.e., encoding xk), we first partition the previous interval Ik 1 = [lk 1, uk 1) into N sub-intervals Ik(x1), Ik(x2), . . . , one for each letter from X = {x1, x2, . . . , x N}. The size of sub-interval Ik(y) that represents letter y is (uk 1 lk 1) ρ(y | x