# computeoptimal_llms_provably_generalize_better_with_scale__8b314b8f.pdf Published as a conference paper at ICLR 2025 COMPUTE-OPTIMAL LLMS PROVABLY GENERALIZE BETTER WITH SCALE Marc Finzi Carnegie Mellon University Sanyam Kapoor New York University Diego Granziol Pure Strength AI Anming Gu Boston University Christopher De Sa Cornell University J. Zico Kolter Carnegie Mellon University Andrew Gordon Wilson New York University Why do larger language models generalize better? To explore this question, we develop generalization bounds on the pretraining objective of large language models (LLMs) in the compute-optimal regime, as described by the Chinchilla scaling laws. We introduce a novel, fully empirical Freedman-type martingale concentration inequality that tightens existing bounds by accounting for the variance of the loss function. This generalization bound can be decomposed into three interpretable components: the number of parameters per token, the loss variance, and the quantization error at a fixed bitrate. As language models are scaled up, the number of parameters per data point remains constant; however, both the loss variance and the quantization error decrease, implying that larger models should have smaller generalization gaps. We examine why larger models tend to be more quantizable from an information theoretic perspective, showing that the rate at which they can integrate new information grows slower than their capacity on the compute-optimal frontier. From these findings we produce a scaling law for the generalization gap, showing that our bounds become stronger in a predictable way. 1 INTRODUCTION Large language models (LLMs) have demonstrated a remarkable general purpose problem solving capacity across a wide range of complex tasks, from classical NLU (Brown, 2020), forecasting (Gruver et al., 2023), mathematics (Trinh et al., 2024), spatial reasoning (Patel & Pavlick, 2022), and many other areas. For a large majority of individual tasks, model capabilities increase monotonically as the next token prediction loss from the pretraining objective decreases. A conceptually useful story about the learning process involves the model accommodating predictive subprograms of progressively larger computational depth and complexity. During pretraining, shallow details like word frequencies, syntax, and grammar are absorbed first, followed by higher level structures such as facts, relations, and idioms, eventually giving way to yet higher level patterns. For reasons not yet well understood, this process manifests in the pretraining objective as a power law for LLMs and other generative models on natural data. The frontier of best achievable performance given a fixed computational budget C obeys a predictable power law relationship L(C) C γ over many orders of magnitude (Kaplan et al., 2020), varying considerably with the kind of data (Henighan et al., 2020) but only weakly with model architecture and training method (Bahri et al., 2021). Effort in quantifying what this relationship is in a given domain and how it varies as model size and dataset size are traded off has been extremely valuable in guiding where resources are spent in constructing more capable AI models (Brown, 2020; Besiroglu et al., 2024; Open AI, 2023; Dubey et al., 2024) and charting a path for the future. In this work, we target the why of scaling laws. While mathematically simple toy models or toy data are valuable, we aim to study the why of scaling laws on real models and real data by focusing on one contribution to the scaling law curve: the token-wise generalization gap. Constructing a generalization bound sensitive enough to capture the small differences between architectures and yet simple enough to write down in a short formula is Equal advising. Published as a conference paper at ICLR 2025 likely impossible; however, even the broad strokes of behavior such as how generalization scales with compute have not been addressed. Thus, here we focus high level understanding rather than algorithmic intervention. We can observe that in order for the generalization gap not to eventually dominate the contributions to the test loss (which has not yet been observed), it must be that the generalization gap decreases at least as fast as the approximation gap does. Deeply understanding the success of the pretraining scaling of LLMs paradigm requires being able to predict that this would be the case. In order to construct the relevant generalization bounds, we introduce a novel empirical Freedman concentration inequality (Freedman, 1975). Our generalization bound highlights three critical components the ratio of parameters per token in compute-optimal scaling (which is roughly constant), the token-wise loss variance (which decreases with model size), and the performance gap between quantized and unquantized models (which also decreases with model size). As an alternative to quantization, we bound the information transfer between dataset and the model, showing that the information content in the model grows sublinearly with model size, and thus the complexity decreases with model size. These components collectively contribute to a predictable reduction in the generalization gap as models grow larger. 2 BACKGROUND 2.1 GENERALIZATION BOUNDS At a high level, we are interested in the expected test error (population risk) EX p D[Rh(X)(X )] for a given model (hypothesis) h depending on the training set X but evaluated on a test set X sampled from the data distribution p D. One conceptually convenient way of breaking down this quantity is into the irreducible error, approximation gap, and generalization gap:1 EX p D[Rh(X)(X )] = R (X) | {z } Irreducible Error E + Rh(X)(X) R (X) | {z } Approximation Gap A + EX p D[Rh(X)(X )] Rh(X)(X) | {z } Generalization Gap G The first term describes the entropy of natural text, e.g. the amount of truly random information content in the data, which cannot be further explained even when knowing the true data generating process. The second term describes the approximation gap, capturing the extent to which the trained model is able to fit the training data. This term combines both model capacity, e.g. as described by universal approximation theorems (Cybenko, 1989), as well as optimization via how well the training algorithm is able to find the given solution. Finally, we have the generalization gap, capturing the extent to which training and testing performance diverge on account of overfitting to the statistically irrelevant regularities in X. Though generalization bounds focus on the last term, all three quantities are of interest for understanding LLM behavior. Empirically, it has been observed that the generalization gap for LLMs (at least in the low epoch regime) tends to be extremely small compared to the other two terms and we aim to make sense of why this is the case. Among the simplest generalization bounds is the finite hypothesis with prior generalization bound applied to IID data (Shalev-Shwartz & Ben-David, 2014). With probability at least 1 δ, EX p D[Rh(X)(X )] Rh(X)(X) log 1/P(h) + log 1/δ where m is the number of IID data points, is an upper bound on the range of values the risk can take, and P(h) is a prior distribution over hypotheses in a discrete hypothesis class H. With a judicious choice of prior, log 1/P(h) can be related to the compressed size of the model measured in nats (Lotfi et al., 2022). During text pretraining, the individual tokens are not sampled IID. Thus, a generalization bound requires treating entire documents (often thousands of tokens) as the elements the empirical risk is computed over. Note that modern language models have hundreds of times more parameters than 1We note this differs from the commonly referred to estimation-approximation error breakdown (Bottou & Bousquet, 2007) or the bias-variance decomposition (Brown & Ali, 2024); however, the train error-generalization gap is more useful for our purposes. Published as a conference paper at ICLR 2025 documents they were trained on. With the help of very extreme compression methods and using smoothing to bound , it is possible to construct nonvacuous bounds (Lotfi et al., 2024a). However, the required compression (greater than 100 times) is so severe that it cripples model performance. In a recent work, Lotfi et al. (2024c) explore breaking down generalization into tokenwise generalization, e.g. how the loss varies with each individual predicted token being resampled under the distribution but keeping the context the same. Splitting up the training dataset X into the sequence of tokens [Xk]D k=1, the authors bound k=1 E[Rh(Xk | X 1 for some 0. Let K be any finite subset of (0, 1). Then, with probability at least 1 δ over the probability space (filtered by (Fk)n k=0), 1 n E[Xk | Fk 1] Xk) C + Σ where C := 1 Σ = Σ(C, , {Yk Xk}n k=1, K) := min s K k=1 v s Ak /s and v(x) = x log(1 + x). The proof is provided in Appendix A.1. This concentration inequality states that the sequence of random variables concentrates on their conditional means with a term Σ depending on the empirical variation of the loss value. We note that Σ can be viewed as a variance term. As we show in Appendix A, using a small K, the variance proxy can be upper bounded: Σ 2 q 1 n Pn k=1(Xk Yk)2, explicitly related to the empirical variance but with the mean replaced by Yk. Although the minimization form above is unwieldy, it produces significantly tighter estimates of Σ (a factor of 5x smaller). When the loss variation Σ is small, concentration to the mean happens at a rate linear in the complexity C rather the slower C rate. The finite set K serves merely as a device to control how finely s is optimized, and can be set for example as a uniformly spaced grid in (0, 1). The concentration inequality we present here provides the core result for our generalization bounds, and to the best of our knowledge it is the first martingale concentration inequality to incorporate a variance term which can be evaluated on the original data. We can view this bound as aiming to achieve the benefits that Freedman s inequality has over Azuma s inequality while being fully empirical, replacing the population variance with a fully empirical proxy. Our approach is analogous to the fully empirical Bernstein bound derived in Maurer & Pontil (2009), but in the martingale rather than IID setting. Unfortunately, the proof technique of Maurer & Pontil (2009) does not carry over to Published as a conference paper at ICLR 2025 the martingale case and instead we take a different approach. We derive our concentration inequality in Theorem A.5 making use of a proxy Yk that is Fk 1-measurable but which can take the place of E[Xk | Fk 1] in the variance. In practice, we choose this quantity to be the mean of the model NLL under resampling of the given token according to the model distribution in place of the data distribution. 3.2 EXTENDING TO A DISCRETE HYPOTHESIS CLASS From the concentration inequality in equation 1, we derive the following discrete hypothesis class generalization bound. Lemma 3.2. Let X1, . . . , Xn be a sequence of (possibly dependent) random variables. Let Rh(Xk | X 0, we have that simultaneously for all h H, with probability at least 1 δ, 1 n k E[Rh(Xk | X