# scaling_laws_for_precision__b102a4d3.pdf

Published as a conference paper at ICLR 2025

SCALING LAWS FOR PRECISION

Tanishq Kumar 1 Zachary Ankner* 3,4 Benjamin F. Spector2 Blake Bordelon1

Niklas Muennighoff2 Mansheej Paul4 Cengiz Pehlevan1 Christopher R e2

Aditi Raghunathan5

1Harvard University 2Stanford University 3MIT 4Databricks 5Carnegie Mellon University

Low precision training and inference affect both the quality and cost of language models, but current scaling laws do not account for this. In this work, we devise precision-aware scaling laws for both training and inference. We propose that training in lower precision reduces the model s effective parameter count, allowing us to predict the additional loss incurred from training in low precision and post-train quantization. For inference, we find that the degradation introduced by post-training quantization increases as models are trained on more data, eventually making additional pretraining data actively harmful. For training, our scaling laws allow us to predict the loss of a model with different parts in different precisions, and suggest that training larger models in lower precision may be compute optimal. We unify the scaling laws for post and pretraining quantization to arrive at a single functional form that predicts degradation from training and inference in varied precisions. We fit on over 465 pretraining runs and validate our predictions on model sizes up to 1.7B parameters trained on up to 26B tokens.

1 INTRODUCTION

Scale has emerged as a central driver of progress in deep learning (Brown et al., 2020). Key work on scaling (Kaplan et al., 2020; Hoffmann et al., 2022) studied tradeoffs between model/dataset size to balance performance and compute. However, the precision in which models are trained and served is an important third factor that contributes to both cost and performance. Deep learning is trending towards lower precision: current frontier models like the Llama-3 series are trained in BF16 (Dubey et al., 2024), and there is widespread effort to move the pretraining paradigm to FP8 (Micikevicius et al., 2022). The next generation of hardware will support FP4, and advances in weight-only quantization have led to training in binary and ternary at scale (Ma et al., 2024; Wang et al., 2023). How far will these paradigms go? Specifically, we ask:

What are the tradeoffs between precision, parameters, and data? How do they compare for pretraining and inference?

Studying scaling in precision is challenging because work on scaling laws generally aims to drop fine-grained implementation details in pursuit of universal functional forms while work on quantization generally does the opposite, focuses on the details: how quantization is done, with what type, to what part of the model. In seeking a balance, we consider a variety of plausible functional forms, and choose one that abstracts implementation details of quantization away from loss scaling, allowing us to predict loss scaling in many situations of practical interest. This functional form that posits bit precision and parameter count interchangeably contribute to a model s effective parameter count, Neff, and implementation details like which parts of a model are quantized to what precision, interact with loss scaling only through their effect on this quantity.

Overall, we study the scaling of the effects of precision on loss as we vary data and parameters, both during and after training. We first study how the degradation induced by post-train quantization scales with parameters and data. We find that the degradation increases with data, so that for

Equal contribution. Correspondence to tkumar@college.harvard.edu.

Published as a conference paper at ICLR 2025

Token/Parameter Ratio

Val Loss (Post-Quant)

More pretraining compute

worse at inference time

Scaling: Post-Train Quantization

INT3 INT4 INT5 INT6 No PTQ

FP4 (1.76B)

FP6 (1.17B)

BF16 (440M)

FP32 (220M)

Training Precision (Model Size)

Final Val Loss

2.997 3.009

Training larger models in lower precision can be compute optimal

Scaling: Quantized Training

Figure 1: Schematic of key findings. (Left) Training a fixed model size to various data budgets in BF16 and quantizing weights at the end. We find that degradation due to post-train quantization increases with tokens seen during pretraining, so that eventually additional pretraining data can be harmful. (Right) Our scaling suggests training larger models in lower precision can be computeoptimal according to the cost model in Section 4.3. Weights, activations, attention quantized, all models trained on the same data budget, details in Appendix J.

a fixed model, training on additional data after a certain point can be actively harmful if the model will be quantized after training. We then shift our focus to quantized training, examining both the quantization-aware-training (weights only) and low-precision training (weights, activations, attention all quantized) settings. Our scaling laws for pretraining suggest that the compute-optimal pretraining precision is in general independent of compute budget. Surprisingly, however, this independence ceases to be true if model size is constrained, in which case the compute-optimal precision grows slowly in compute.

In all, we pretrain a suite of 465 language models in 3 to 16 bit precisions, as well as post-train quantize each to multiple precisions. For a language model with N parameters, trained on D tokens with training precision Ptrain, and post-train weight precision Ppost, we ultimately find a unified scaling law that takes the following form:

L(N, D, Ptrain, Ppost) = AN α eff | {z } Training-time Effects

| {z } Usual Chinchilla form

+ δPTQ(Neff, D, Ptrain, Ppost) | {z } Post-Training Effects

where A, B, E, α, β are positive fitted constants, and δPTQ refers to the loss degradation induced by post-training quantization before inference. Altogether, our results for post-train quantization illustrate how more pretraining FLOPs do not always lead to better models at inference-time, and our results for low-precision pretraining suggest that both the standard practice of training models in 16-bit, and the race to extremely low (sub 4-bit) pretraining precision, may be suboptimal.

2 BACKGROUND, RELATED WORK, AND SETUP

Notation. Throughout, D denotes dataset size in tokens and N denotes model size in parameters. Pw, Pa, Pkv refer to the bit precision, in integer-type, of the weights, activations, and key-value cache ( attention )1 during training, and Ppost refers to the precision we post-train quantize (PTQ) weights to at the end for model inference. When P or Ptrain is used without reference to a part of the model, all three model parts are tied to the same precision. The inference-time loss degradation induced by post-train quantization will be denoted δPTQ(N, D, Ptrain, Ppost), and it is defined as the change in loss from performing post-training quantization compared to the end of pretraining. We use high precision to mean 16-bit or above.

1We study KV, rather than QKV, because understanding scaling in the KV cache alone is important for many inference settings. For pretraining claims in Section 4.3, we quantize the entire attention computation, including queries, finding additionally quantizing the query vectors makes a negligible difference to scaling.

Published as a conference paper at ICLR 2025

2.1 QUANTIZATION FUNDAMENTALS: HOW, WHAT, WHEN

The Problem: Compute vs Memory-Bound Workloads. Most deep learning workloads are bottlenecked by either compute, in the form of matrix multiplications, or memory bandwidth, in the form of data movement between different parts of the GPU. Different types of workloads have different bottlenecks: most time is spent doing large matrix multiplications during pretraining, so it is compute-bound; in contrast, small-batch inference is bandwidth-bound by model weights; longsequence decoding is bandwidth-bound by KV cache, etc. This motivates studying scaling in the training precision of the (weights, activations, KV cache) both in isolation and in combination.

Quantization: How. Quantization of an operation typically refers to rounding of values in matrices involved in some computation on the forward or backward pass, depending on what is quantized, and when. Quantization is usually done to integer or floating-point type.

Quantization: What. Only weights. Quantization-aware training Quantizing only weights during training does not offer any compute savings because matrix multiplications are still done in high precision. However, this is commonly done to allow weights to adapt to low precision so they can be served at very low precision at inference-time, thereby alleviating memory bottlenecks (Ma et al., 2024; Wang et al., 2023). We will refer to this as quantization-aware-training and defer additional discussion to Appendix D.

Weights, activations, attention. Low-precision training Quantizing and activations and attention in addition to weights allows for compute gains because matrix multiplications can be done in low precision (if the hardware supports it) since everything is in the same precision. We will refer to this setting as low-precision training to distinguish it from quantization-aware training.

Quantization: When. Quantization can be done during or after training. In practice, when seeking to reduce inference-time memory costs, one first attempts post-train quantization. If that degrades the model too much, quantization-aware-training is used. Post-train quantization is typically only applied to model weights (Frantar et al., 2022; Dettmers et al., 2022; Lin et al., 2023; Xiao et al., 2023). To reduce pretraining costs, low-precision-training is needed. We will study scaling laws for post-training quantization in Section 3, for quantized training in Section 4 (examining both quantization-aware training and low precision training) and unify the two in Section 5. The numerical values of all our fitted constants can be found in Appendix K.

2.2 SCALING LAWS AND PARAMETRIC FITS

Scaling Laws. Hoffmann et al. (2022) model loss scaling using the functional form L(N, D) = AN α + BD β + E where A, B, α, β, E are positive fitted constants, finding that data and parameters should be scaled in roughly equal proportion as more compute becomes available. We will refer to the scaling of (Hoffmann et al., 2022) as Chinchilla-optimal or just Chinchilla and note this is often used colloquially as D/N 20 being pretraining compute-optimal. On the theoretical front, work on scaling laws (Bahri et al., 2024; Bordelon et al., 2024; Lin et al., 2024b) finds that noise to various parts of model or data affects loss in a predictable way. While previous works have explored the scaling behavior of post-training quantization in terms of total model bits (Dettmers & Zettlemoyer, 2023) and knowledge capacity (Allen-Zhu & Li, 2024), we focus instead on data scaling. We note that in general the exact fitted values of all coefficients and exponents can vary drastically based on small implementation differences: Besiroglu et al. (2024) find different constants when attempting to replicate (Hoffmann et al., 2022), Sardana & Frankle (2023) fit coefficients A, B of different orders of magnitude. For this reason, we emphasize our contribution is not the numerical values we fit, but the trends and functional forms we identify.

Overtraining. In practice, accounting for inference costs means training smaller models for substantially longer than Chinchilla-optimal (Sardana & Frankle, 2023; Gadre et al., 2024). For instance, Llama-3-8B is trained to D/N 2000 (Dubey et al., 2024) and the Gemma-2 series up to D/N > 1000 (Team et al., 2024). We refer to such models as overtrained in this paper, with the token/parameter ratio D/N being a key quantity throughout. Work on inference-time compute (Snell et al., 2024; Brown et al., 2024) and on synthetic and multimodal data (Yang et al., 2024; Fan et al., 2024; Bauer et al., 2024) suggests future models may be even more overtrained. Therefore, modern work on scale must consider ratios much larger than Chinchilla-optimal, and in this work

Published as a conference paper at ICLR 2025

100 1000 3.25

Val Loss (Post-Quant)

INT6 INT5 INT4 INT3 No PTQ

Token/Parameter Ratio

Degradation, PTQ

100 10 100 10

Figure 2: Loss degradation from PTQ increases with data. Top row is loss after PTQ, bottom row is loss degradation compared to end of training, before PTQ. The top row is thus the gray line in each plot plus the corresponding value in the bottom row. We can see that degradation grows with data, bottom row is fitted with Equation 2. For D/N sufficiently large (left), loss can increase in data. Even at lower D/N, where post-quant loss continues to decrease with data, the value of data is reduced compare to the baseline. R2 = 0.97 over all fitted points (bottom row).

we perform experiments up to D/N 103 and analyze the predictions found by our scaling law for up to D/N 105. See Appendix B for additional related work.

We train and evaluate a suite of OLMo-style models on the Dolma V1.7 dataset (Groeneveld et al., 2024; Soldaini et al., 2024), using a standard Transformer++ implementation; see Appendix A for hyperparameters and ablations. Our experiments consist of a sweep of language model pretraining runs over N [30, 60, 110, 220] million parameters (non-embedding) and D [1.5, 3, 6, 13, 26] billion tokens. Our model sizes are relatively small because we train up to a very high D/N 103 to study data scaling and set off over 20 runs at every (N, D): we sweep 8 values of precision for each of the (weights, activations, attention).

3 SCALING LAWS FOR POST-TRAIN QUANTIZATION

The easiest and most common quantization technique is post-train quantizing a model off-the-shelf (Chee et al., 2024; Huang et al., 2024; Dettmers et al., 2022; Lin et al., 2023; Xiao et al., 2023). In this section, we consider models trained in BF16 and use GPTQ (Frantar et al., 2022) to post-train quantize them, replicating our findings with two other methods in Appendix F. We quantify the resulting loss degradation δPTQ, finding that post-train quantization scales poorly in data.

3.1 OVERTRAINED MODELS DEGRADE MORE WHEN POST-TRAIN QUANTIZED

We consider different model sizes (columns) trained on various data budgets (x-axis of each plot) and plot in Figure 2 both the loss after post-train quantization (top row) and the degradation incurred relative to end of training (bottom row). We find that the degradation δPTQ increases in training data size across all model sizes, but that for a fixed dataset size larger models incur a smaller degradation. We additionally observe that δPTQ increases exponentially as we decrease the precision we quantize to. Based on these observations we model δPTQ as taking the form:

δPTQ(N, D, Ppost) = CT

e Ppost/γpost (2)

Published as a conference paper at ICLR 2025

where CT , γD, γN, γpost are positive fitted constants. As we find the fitted values of γD and γN to be similar (see Appendix K for numerical values), we can think of this as an approximate power law in the token/parameter ratio D/N. The intuition for this poor data scaling might be that as models train on more data, they compress more information into their weights, so that perturbations to weights in the form of quantization are more harmful to loss, all else equal. We discuss formal theoretical interpretations in Appendix H.

This finding implies that for models that will be post-train quantized, there exists an amount of pretraining data beyond which additional data is actively harmful to performance at inference-time (see top-left, Figure 2). This can be defined as the point where additional data increases post-train degradation more than it decreases loss during pretraining. We solve analytically for this critical data size in Appendix E, as well analyze a cost model for workloads where inference-cost is the primary concern. We thus summarize our first scaling finding as follows.

Finding 1. Overtrained language models are more sensitive to post-training quantization. For models trained in BF16 or above, we can model this loss degradation as

δPTQ(N, D, Ppost) = CT

e Ppost/γpost

where CT , γD, γN, γpost are positive fitted constants. This implies that when D/N is sufficiently large, or Ppost sufficiently small, loss after quantization can increase as models are pretrained for longer, as in Figure 2. We will revisit and modify Equation 2 in Section 5 to account for the effects of training in low-precision on δPTQ.

4 SCALING LAWS FOR QUANTIZED TRAINING

In this section we study pretraining with weights, activations, and KV cache in various precisions. Importantly, only training precision, not test-time precision, is varied in this section; we discuss the interaction between train and test-time precision in Section 5. We sweep the training precisions of the weights, activations, and KV cache Pw, Pa, Pkv [3, 12] individually, as well as training BF16 baselines. We also pretrain models with arbitrary combinations of Pw, Pa, Pkv to validate our scaling laws. To perform quantization during training, we quantize the forward pass in integer type unless otherwise noted, see Appendix D for implementation details.

4.1 QUANTIZATION-AWARE-TRAINING: QUANTIZING WEIGHTS DURING TRAINING HAS A CONSISTENT AND PREDICTABLE EFFECT

We first examine the trade-off between weight precision Pw and parameters N while holding Pa = Pkv fixed at high precision. We fix D = 13B tokens and perform a grid sweep over combinations of N and Pw. We plot the resulting Iso Loss contours where we linearly interpolate the final loss values in Figure 3. We observe that the bit precision of the weights can be traded off for the number of parameters, i.e., a model with smaller N but larger Pw can achieve the same loss as a model with larger N but smaller Pw. Additionally, we find that the gains from increasing the bit precision of the weights are large at lower precisions but saturate at higher precisions (typically around 6-7 bits per weight).

In line with the empirical trends in Figure 3, we find the best fit for the tradeoff between weight precision and parameters is Neff(N, Pw) = N(1 e Pw/γw), where γw is a fitted constant measuring the sensitivity of model weights (alternative fits explored in Appendix K). We therefore modify Chinchilla scaling to account for Neff by making the substitution N 7 Neff(N, Pw), giving the modified form:

L(N, D) = A[N(1 e Pw/γw)] α + BD β + E (3)

where we recall that A, B, E, α, β are fitted positive constants in the usual Chinchilla scaling form, and γw is a fitted constant we introduce. We plot the predictions of our fit compared to observed values in Figure 4 for a range of (N, D).

Published as a conference paper at ICLR 2025

4 6 8 10 12 14 16 Precision (bits)

Neff/N vs Precision

Weights Activations KV Cache Tied

30 40 50 60 70 80 90 100 N (millions)

Empirical Iso Loss Contours

30 40 50 60 70 80 90 100 N (millions)

Predicted Loss Contours

Figure 3: (Left) Neff/N from our final scaling law. Our fit of Neff(N, Pw) in this section is the first step towards this (blue). Empirical (center) and predicted (right) Iso Loss contours illustrating the precision-parameter tradeoff. Y-axis is weight precision during quantized training. All runs plotted trained on D = 13B tokens. Predictions from a fitted version of Equation 3, darker lines correspond to lower loss.

3 4 5 6 7 8 Pw (training precision, bits)

Final Val Loss

3.3B tokens

3 4 5 6 7 8 Pw (training precision, bits)

13.1B tokens

3 4 5 6 7 8 Pw (training precision, bits)

26.2B tokens

30M 60M 110M 220M

Figure 4: Predicting final validation losses L(N, D, Pw) for various N, D, Pw to test our proposed functional form. Points are experimental values, lines are predictions of a single parametric fit of the form in Equation 3. We train only two model sizes at 26B due to compute constraints.

4.2 LOW-PRECISION-TRAINING: THE EFFECTS OF QUANTIZING WEIGHTS, ACTIVATIONS, AND ATTENTION ARE COMPOSITIONAL AND MULTIPLICATIVE

Quantization-aware training does not change the cost of pretraining. This is because modern GPUs require inputs to a matrix multiplication to have the same precision, i.e. Pw = Pa = Pkv (Micikevicius et al., 2022). To understand the interplay between precision and pretraining compute we must now analyze the scaling behavior of Pa and Pkv as well. Note that in our training experiments, we only quantize on the forward pass to ensure a fair comparison between quantization-aware-training (weights only) and the additional quantization to activations/KV cache, see Appendix D.

Precision of activations and KV cache affect loss in a similar way. We first verify in Appendix Figure 20 that varying Pa and Pkv in isolation give rise to scaling behavior that is best fit by a functional form analogous to the form for Pw (Equation 3, Figure 5, left).

We refer to the scaling coefficients computed by varying the precision of just one part of the model at a time as marginally fitted constants, and those found by fitting on runs that include multiple model components in low precision at the same time as jointly fitted constants.

Constants fitted marginally and jointly make similarly good predictions. We now turn our attention to understanding the interactions between weights, activations, and attention. If the effects of quantizing weights, activations, and attention are independent, then a factorized, multiplicative interaction of the following form is a natural proposal.

Neff(P) = N(1 e Pw/γw)(1 e Pa/γa)(1 e Pkv/γkv) (4)

We test whether this independence approximately holds by comparing the predictive power of a model with marginally fitted constants and a model with jointly fitted constants. We show the predictive power of both models in Figure 5(b, c), finding that both methods for fitting constants have approximately the same predictive power. These results suggest that the independence assumption is reasonable. We both present further evidence that this factorized functional form is a strong fit to the data as well as discuss alternative factorization schemes in Appendix M.

Published as a conference paper at ICLR 2025

3.2 3.4 3.6 3.8 4.0 4.2 4.4

Pw Marginal Sweep MSE: 0.0028, R²: 0.9655

3.2 3.4 3.6 3.8 4.0 4.2 4.4 4.6

Joint fit, f(Pw, Pa, Pkv) MSE: 0.0086, R²: 0.9006

3.2 3.4 3.6 3.8 4.0 4.2 4.4 4.6

Combined Marginals, f(Pw)f(Pa)f(Pkv)

MSE: 0.0089, R²: 0.8973

Figure 5: (Left) Predicted loss based on fitted values with Equation 4. (center) Fitting γ parameters jointly on sweeps with combinations of precisions vs (right) fitting them on marginal sweeps where only one model part is in low precision at a time. Outliers are those at extremely low precision whose training runs are sometimes unstable.

Finding 2. The effects of quantizing the weights, activations, and KV cache during training are well modeled as independent and multiplicative so that

L(N, D, Pw, Pa, Pkv) = AN α eff + BD β + E

where Neff(Pw, Pa, Pkv) = N(1 e Pw/γw)(1 e Pa/γa)(1 e Pkv/γkv)

for which we fit constants γw, γa, γkv that reflect the different sensitivities of weights, activations, and KV cache. If the three precisions are set to the same value P, as in pretraining, this simplifies to Neff(P) N(1 e P/ γ)3 where γ is the average of the three parameters. We visualize this functional form with our fitted values in Figure 3 (left).

4.3 IMPLICATIONS FOR PRETRAINING

When training in a precision P, meaning Pw = Pa = Pkv = P, compute cost scales linearly in P (Abdelkhalik et al., 2022)2. Hoffmann et al. (2022) performed all experiments in 16-bit precision and use a cost model of C = 6ND FLOPs. We generalize this to C = 6 16NDP to account for the linear relation between compute and precision, which reduces to the Chinchilla cost function for P = 16. We now examine three practically relevant variants of the following optimization problem.

min N,D,P L(N, D, P) = A[N(1 e P/γ)3] α + BD β + E subject to C = 6

Since derivations are algebraically involved, we will work up to proportionality and verify proposed solutions numerically. See Appendix E for mathematical details. We note that the implications of our functional form are true no matter the scale at which future experiments are done, but the numerical values we predict depend on our fitted constants which are fitted on smaller-scale, integertype experiments.

4.3.1 IF YOU MUST TRAIN IN LOW PRECISION, INCREASE PARAMETERS BEFORE DATA

Minimizing L(N, D) with P fixed, subject to C NDP. We get with some algebra that at precision P and compute budget C, the optimal allocations N , D of parameters and data relative to Chinchilla-optimal NCh, DCh will be given by

NCh(C) h 1 e P/ γi 3α α+β P β α+β and D (P, C)

DCh(C) h 1 e P/ γi 3α α+β P

which suggests as precision of training decreases at fixed compute, we should increase parameters and decrease data. The interpretation of this is that at very low precisions, our effective parameter count vanishes so that increasing parameter count is compute-optimal since data egregiously outstrips effective parameters.

2In practice, the gains are less than linear due to systems overhead.

Published as a conference paper at ICLR 2025

INT4 (1.76B)

INT6 (1.17B)

INT8 (880M)

INT16 (440M)

INT32 (220M)

Training Precision (Model Size)

Predicted Val Loss

Predicted: Quantized Training (INT)

FP4 (1.76B)

FP6 (1.17B)

BF16 (440M)

FP32 (220M)

Training Precision (Model Size)

Final Val Loss

Empirical: Quantized Training (FP)

0.1 1 10 100 1000 D (Dataset Size, Trillion Tokens)

P (Model Precision)

P * (D) for Various N

Irreducible Loss

Figure 6: Scaling law predictions (left, fitted on integer type) vs empirical values (right, floatingpoint type). Precision of weights, activations, attention fixed to Ptrain. Predictions closely match the empirical trend, but are shifted up by a small amount since floating-point is a more expressive type and will incur lower loss at the same precision. (Right) When N is held fixed, compute-optimal precision increases approximately logarithmically with data. Markers correspond to predicted computeoptimal precision for Llama-3 (8b, 70b, 405b), denoted by (circle, triangle, star) at each Iso FLOP (lines), illustrating how compute-optimal precision increases in data when model size is held fixed.

4.3.2 COMPUTE-OPTIMAL PRETRAINING PRECISION IS IN GENERAL INDEPENDENT OF COMPUTE

Jointly minimizing L(N, D, P) with C NDP. This is the setting of pretraining without constraints on N, D, P except for a fixed compute budget. Solving this joint minimization problem gives an implicit equation for P (C). Denoting u(P) = [1 e P/ γ] 3α, we find (see Appendix E) that this equation takes the form 3α

γ u(P) 3α+1

3α e P/ γ = P 1u(P) (7)

which reveals that in general the optimal pretraining precision is independent of compute budget. This suggests that compute-optimal precision should be held fixed to P while N, D are scaled according to Equation 6. We find this P to be around 7-8 bits when fitting our scaling law on runs with quantization done to integer type. This has two consequences: first, this means the defacto practice of training models in 16-bit may be suboptimal. Second, the race to low-precision training may have to stop before going below 4-bits, since this would force model sizes to become disproportionately (more than 4x) larger to maintain loss scaling (see Figure 3, left).

We test our predictions in Figure 6 at a larger scale. We train compute-matched models at various parameter count and precision ranging from FP4 to FP32 and 220M to 1.6B parameters. We train in floating-point type since that is standard in pretraining (Groeneveld et al., 2024; Deitke et al., 2024), though our scaling laws are fitted on integer type. We plot our predicted trend in Figure 6 (left) and the empirical values in the middle. We find that scaling fits on integer type are a strong fit until 4-bit precision, at which points the difference between the two types becomes more apparent. The matching of qualitative trends throughout, with the optimum being close to the predicted optimum of P near 7-8 bits suggests that similar scaling laws may exist across types. We initiate a similar analysis for floating-point type in Appendix ??.

4.3.3 BUT COMPUTE-OPTIMAL PRETRAINING PRECISION CAN INCREASE IN COMPUTE IF MODEL SIZE N IS CONSTRAINED

Minimizing L(D, P) with N fixed, subject to C NDP. A common use case in practice is to train a suite of models of various sizes on similar data. The Llama-3 and Gemma-2 series (Dubey et al., 2024; Team et al., 2024) are examples. In this setting, N is fixed in advance and only D, P are jointly optimized. Surprisingly, our scaling laws predict that models of differing sizes should not necessarily be trained in the same precision, and that compute-optimal precision scales as P (C) log C. Since N is held constant and we show in Appendix E that log C log D in proportion, we can write P (C) log(D/N). The intuition for this is that, for a fixed N, precision acts as a new lever to bring highly overtrained models closer to pretraining optimality3 by reducing D/Neff.

3An important subtlety here is that since models are overtrained for inference, we want to keep the cost of a forward pass which is proportional to NP fixed, not just N. While NP is the same for both a model of N0

Published as a conference paper at ICLR 2025

10 6 10 5 10 4 10 3 10 2 10 1 100 Actual PTQ

Predicted PTQ

PTQ(Neff, D, Ptrain, Ppost) MSE: 5.06e-02, R2: 0.9041

3 4 5 6 7 8 9 10 11 12 Pw, training precision (bits)

Ppost, post-training precision (bits)

Empirical PTQ

4 6 8 10 12 Pw, training precision (bits)

Ppost, post-training precision (bits)

Predicted PTQ

Figure 7: Combined plots for predicting degradation. (Left) demonstrates the quality of our fit on all our runs, including all combinations of pre and post-training precisions. (Center, right) illustrate visually that our unified degradation form can predict degradation when training and serving in any precision. Plots (center, right) vary Pw only, but fits in (left) include runs where Pa, Pkv are also jointly varied.

Finding 3. When N, D, P are optimized jointly, compute-optimal pretraining precision is independent of compute. 16-bit has many unnecessary bits, and 4-bit requires increasing the model size disproportionately to maintain loss scaling. Our fits imply that 7-8 bits are compute-optimal. In contrast, when N is fixed in advance, such as when training a model family on similar data, P (C) log C. This suggests that for models that will be significantly overtrained, higher precision during training may be compute-optimal.

5 A UNIFIED SCALING LAW FOR PRECISION

In this section, we combine the two scaling laws presented into a unified functional form that predicts both training/post-training effects, including interactions between the two. We now treat δPTQ as a function δPTQ(N, D, Ptrain, Ppost) rather than just δPTQ(N, D, Ppost) as we did earlier in Section 3. We find two competing effects at play when predicting δPTQ, but overall, models trained in lower precision are more robust to post-train quantization in the sense of incurring lower degradation.

Two competing effects at play during post-train quantization. Intuitively, training any of Pw, Pa, Pkv in low precision forces the model to learn weights that are robust to quantization noise, so they degrade less under PTQ. However, the reduced N 7 Neff implies that models trained in low precision will degrade more because δPTQ increases with N γN as we found in Section 3. We call this second effect the overtraining effect. In practice, the first robustification effect wins out, so that models trained in lower precision overall degrade less when post-train quantized. We confirm using Neff rather than N to predict degradation given various training precisions leads to a substantially stronger fit in Figure 21(top left, top center), to verify the competing overtraining effect.

Modifying δPTQ to account for training precision. We assume training precision is strictly greater than inference precision, and define degradation as identically zero if they are equal. We begin by studying how degradation scales with just weight-precision during training, Pw.

Consider Figure 7(center). We fix (N, D) and each cell of the heatmap represents the empirical degradation δPTQ(Pw, Ppost). We observe that degradation very quickly increases to its exponentially large value from Section 3 if there is any gap between training and inference-time precision. This

parameters in 16-bit and one with 2N0 parameters in 8-bit, the latter has higher Neff with our γ, so will reach a lower pretraining loss on the same data with the same training/inference costs.

Published as a conference paper at ICLR 2025

motivates modifying our initial functional form fitted in Section 3 to

δPTQ(N, D, Pw, Ppost) = CT e Ppost/γpost DγD

| {z } Overtraining effect

[1 e Cw(Pw Ppost)] | {z } Robustification effect

where Cw is the only new fitted value. Then, we can extend this to include the precision effects of activations/attention in the natural way:

δPTQ(N, D, Pw, Pa, Pkv, Ppost) = CT e Ppost/γpost DγD

x {w,a,kv} [1 e Cx(Px Ppost)] (9)

We measure the fit to the data of such a functional form in Figure 7, and find a strong fit with R2 = 0.90 on over 1000 data points (each of 465 pretraining runs post-train quantized to multiple precisions).

An interpretable, unified functional form. Now we simplify and interpret the resulting functional form. Consider training with only weights in low precision and take Cw = 1 for illustrative purposes so we can simplify Equation 9. Denote σ2 tr := e Pw/γw as training noise reflecting the decrease in effective parameter count due to training weights in lower precision. Then, Equation 9 simplifies to

δPTQ(N, D, Ptrain, Ppost) = CT (σ2 PTQ σ2 tr) | {z } Robustification effect

| {z } Overtraining effect

which we note is the intuitive modification one might make to the form of the initial post-training quantization degradation we fitted in Section 3, in Finding 3.1, with a small competing effects factor from Neff pushing in the opposite direction. It cleanly reflects the intuition that models are robustified to PTQ noise to the extent they were trained with similar noise.

Finding 4 (Unified Scaling Laws). Modeling low-precision effects during pretraining as independent and multiplicative noise that accumulates, and including post-training quantization degradation, the predicted loss for a language model with N parameters, trained on D tokens, with training precision Pw, Pa, Pkv to end-time weight-precision Ppost, can be predicted as

L(N, D, Pw, Pa, Pkv, Ppost) = AN α eff + BD β + E + δPTQ (11)

where δPTQ(N, D, Pw, Pa, Pkv, Ppost) is in general as in Equation 9 and Neff(N, Pw, Pa, Pkv) as in Finding 4.2.

6 CONCLUSION AND LIMITATIONS

We find that the common inference-time technique of post-train quantization can incur large degradation at very high data budgets, demonstrating a striking example of how more pretraining compute does not always imply stronger models at inference-time. Seeking better data scaling, we study quantization-aware and low precision training. We find that parameters and bit precision are well modeled as interchangeably controlling an effective parameter count of the model allows us to predict finite-precision loss effects accurately during both training and inference.

There are limitations to our analysis. First, we use a fixed architecture throughout to examine the effects of precision, parameters, and tokens in a controlled manner. In contrast, low precision training often involves architectural tweaks (Ma et al., 2024; Zhu et al., 2024) that can close much of the gap from a vanilla full precision model. Second, while compute costs do scale linearly with precision, the gains from halving precision are usually less than 2x due to systems overhead. Third, we only consider loss scaling without downstream model evaluations. We emphasize that the trends we find aim to be suggestive rather than prescriptive, and hope future work can more comprehensively examine these effects at larger model scale. In all, we find that the effects of precision on loss are predictable and consistent, with important and surprising implications.

Published as a conference paper at ICLR 2025

7 ETHICS STATEMENT

We study the efficient training of language models, and as such do not see any new ethical concerns arising as a result of our work.

8 ACKNOWLEDGEMENTS

Tanishq Kumar thanks Tim Dettmers, Chris De Sa, Neil Band and Luke Bailey for helpful comments and discussion, as well as Ludwig Schmidt for spotting an early typo. Blake Bordelon is supported by a Google Ph D Fellowship. Cengiz Pehlevan is supported by NSF grant DMS-2134157, NSF CAREER Award IIS-2239780, and a Sloan Research Fellowship. This work has been made possible in part by a gift from the Chan Zuckerberg Initiative Foundation to establish the Kempner Institute for the Study of Natural and Artificial Intelligence. Aditi Raghunathan acknowledges support from AI2050 program by Schmidt Sciences (Grant G2264481), Google Research Scholar, Apple, NSF, Cisco.

We gratefully acknowledge the support of NIH under No. U54EB020405 (Mobilize), NSF under Nos. CCF2247015 (Hardware-Aware), CCF1763315 (Beyond Sparsity), CCF1563078 (Volume to Velocity), and 1937301 (RTML); US DEVCOM ARL under Nos. W911NF-23-20184 (Long-context) and W911NF-21-2-0251 (Interactive Human-AI Teaming); ONR under Nos. N000142312633 (Deep Signal Processing); Stanford HAI under No. 247183; NXP, Xilinx, LETICEA, Intel, IBM, Microsoft, NEC, Toshiba, TSMC, ARM, Hitachi, BASF, Accenture, Ericsson, Qualcomm, Analog Devices, Google Cloud, Salesforce, Total, the HAI-GCP Cloud Credits for Research program, the Stanford Data Science Initiative (SDSI). Benjamin F. Spector is supported by a Hertz Fellowship. The U.S. Government is authorized to reproduce and distribute reprints for Governmental purposes notwithstanding any copyright notation thereon. Any opinions, findings, and conclusions or recommendations expressed in this material are those of the authors and do not necessarily reflect the views, policies, or endorsements, either expressed or implied, of NIH, ONR, or the U.S. Government.

Emmanuel Abbe, Enric Boix-Adsera, Matthew S Brennan, Guy Bresler, and Dheeraj Nagaraj. The staircase property: How hierarchical structure can guide deep learning. Advances in Neural Information Processing Systems, 34:26989 27002, 2021.

Emmanuel Abbe, Enric Boix Adsera, and Theodor Misiakiewicz. The merged-staircase property: a necessary and nearly sufficient condition for sgd learning of sparse functions on two-layer neural networks. In Conference on Learning Theory, pp. 4782 4887. PMLR, 2022.

Hamdy Abdelkhalik, Yehia Arafa, Nandakishore Santhi, and Abdel-Hameed A Badawy. Demystifying the nvidia ampere architecture through microbenchmarking and instruction-level analysis. In 2022 IEEE High Performance Extreme Computing Conference (HPEC), pp. 1 8. IEEE, 2022.

Armen Aghajanyan, Lili Yu, Alexis Conneau, Wei-Ning Hsu, Karen Hambardzumyan, Susan Zhang, Stephen Roller, Naman Goyal, Omer Levy, and Luke Zettlemoyer. Scaling laws for generative mixed-modal language models. In International Conference on Machine Learning, pp. 265 279. PMLR, 2023.

Arash Ahmadian, Saurabh Dash, Hongyu Chen, Bharat Venkitesh, Zhen Stephen Gou, Phil Blunsom, Ahmet Ust un, and Sara Hooker. Intriguing properties of quantization at scale. Advances in Neural Information Processing Systems, 36:34278 34294, 2023.

Ibrahim M Alabdulmohsin, Behnam Neyshabur, and Xiaohua Zhai. Revisiting neural scaling laws in language and vision. Advances in Neural Information Processing Systems, 35:22300 22312, 2022.

Alon Albalak, Yanai Elazar, Sang Michael Xie, Shayne Longpre, Nathan Lambert, Xinyi Wang, Niklas Muennighoff, Bairu Hou, Liangming Pan, Haewon Jeong, et al. A survey on data selection for language models. ar Xiv preprint ar Xiv:2402.16827, 2024.

Published as a conference paper at ICLR 2025

Loubna Ben Allal, Raymond Li, Denis Kocetkov, Chenghao Mou, Christopher Akiki, Carlos Munoz Ferrandis, Niklas Muennighoff, Mayank Mishra, Alex Gu, Manan Dey, et al. Santacoder: don t reach for the stars! ar Xiv preprint ar Xiv:2301.03988, 2023.

Zeyuan Allen-Zhu and Yuanzhi Li. Physics of language models: Part 3.3, knowledge capacity scaling laws. ar Xiv preprint ar Xiv:2404.05405, 2024.

Alexander Atanasov, Blake Bordelon, Sabarish Sainathan, and Cengiz Pehlevan. The onset of variance-limited behavior for networks in the lazy and rich regimes. ar Xiv preprint ar Xiv:2212.12147, 2022.

Yasaman Bahri, Ethan Dyer, Jared Kaplan, Jaehoon Lee, and Utkarsh Sharma. Explaining neural scaling laws. Proceedings of the National Academy of Sciences, 121(27):e2311878121, 2024.

Jinze Bai, Shuai Bai, Yunfei Chu, Zeyu Cui, Kai Dang, Xiaodong Deng, Yang Fan, Wenbin Ge, Yu Han, Fei Huang, et al. Qwen technical report. ar Xiv preprint ar Xiv:2309.16609, 2023.

Boaz Barak, Benjamin Edelman, Surbhi Goel, Sham Kakade, Eran Malach, and Cyril Zhang. Hidden progress in deep learning: Sgd learns parities near the computational limit. Advances in Neural Information Processing Systems, 35:21750 21764, 2022.

Andr e Bauer, Simon Trapp, Michael Stenger, Robert Leppich, Samuel Kounev, Mark Leznik, Kyle Chard, and Ian Foster. Comprehensive exploration of synthetic data generation: A survey. ar Xiv preprint ar Xiv:2401.02524, 2024.

Tamay Besiroglu, Ege Erdil, Matthew Barnett, and Josh You. Chinchilla scaling: A replication attempt. ar Xiv preprint ar Xiv:2404.10102, 2024.

Blake Bordelon, Lorenzo Noci, Mufan Bill Li, Boris Hanin, and Cengiz Pehlevan. Depthwise hyperparameter transfer in residual networks: Dynamics and scaling limit. ar Xiv preprint ar Xiv:2309.16620, 2023.

Blake Bordelon, Alexander Atanasov, and Cengiz Pehlevan. A dynamical model of neural scaling laws. ar Xiv preprint ar Xiv:2402.01092, 2024.

Bradley Brown, Jordan Juravsky, Ryan Ehrlich, Ronald Clark, Quoc V Le, Christopher R e, and Azalia Mirhoseini. Large language monkeys: Scaling inference compute with repeated sampling. ar Xiv preprint ar Xiv:2407.21787, 2024.

Tom B. Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared Kaplan, Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, Sandhini Agarwal, Ariel Herbert-Voss, Gretchen Krueger, Tom Henighan, Rewon Child, Aditya Ramesh, Daniel M. Ziegler, Jeffrey Wu, Clemens Winter, Christopher Hesse, Mark Chen, Eric Sigler, Mateusz Litwin, Scott Gray, Benjamin Chess, Jack Clark, Christopher Berner, Sam Mc Candlish, Alec Radford, Ilya Sutskever, and Dario Amodei. Language models are few-shot learners, 2020. URL https://arxiv.org/abs/2005.14165.

Jerry Chee, Yaohui Cai, Volodymyr Kuleshov, and Christopher M De Sa. Quip: 2-bit quantization of large language models with guarantees. Advances in Neural Information Processing Systems, 36, 2024.

Mehdi Cherti, Romain Beaumont, Ross Wightman, Mitchell Wortsman, Gabriel Ilharco, Cade Gordon, Christoph Schuhmann, Ludwig Schmidt, and Jenia Jitsev. Reproducible scaling laws for contrastive language-image learning. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 2818 2829, 2023.

Aakanksha Chowdhery, Sharan Narang, Jacob Devlin, Maarten Bosma, Gaurav Mishra, Adam Roberts, Paul Barham, Hyung Won Chung, Charles Sutton, Sebastian Gehrmann, et al. Palm: Scaling language modeling with pathways. Journal of Machine Learning Research, 24(240): 1 113, 2023.

Published as a conference paper at ICLR 2025

Aidan Clark, Diego de Las Casas, Aurelia Guy, Arthur Mensch, Michela Paganini, Jordan Hoffmann, Bogdan Damoc, Blake Hechtman, Trevor Cai, Sebastian Borgeaud, et al. Unified scaling laws for routed language models. In International conference on machine learning, pp. 4057 4086. PMLR, 2022.

Jeremy Cohen, Simran Kaur, Yuanzhi Li, J Zico Kolter, and Ameet Talwalkar. Gradient descent on neural networks typically occurs at the edge of stability. In International Conference on Learning Representations, 2021.

Matthieu Courbariaux, Yoshua Bengio, and Jean-Pierre David. Training deep neural networks with low precision multiplications. ar Xiv preprint ar Xiv:1412.7024, 2014.

Matt Deitke, Christopher Clark, Sangho Lee, Rohun Tripathi, Yue Yang, Jae Sung Park, Mohammadreza Salehi, Niklas Muennighoff, Kyle Lo, Luca Soldaini, et al. Molmo and pixmo: Open weights and open data for state-of-the-art multimodal models. ar Xiv preprint ar Xiv:2409.17146, 2024.

Tim Dettmers and Luke Zettlemoyer. The case for 4-bit precision: k-bit inference scaling laws. In International Conference on Machine Learning, pp. 7750 7774. PMLR, 2023.

Tim Dettmers, Mike Lewis, Sam Shleifer, and Luke Zettlemoyer. 8-bit optimizers via block-wise quantization. ar Xiv preprint ar Xiv:2110.02861, 2021.

Tim Dettmers, Mike Lewis, Younes Belkada, and Luke Zettlemoyer. Gpt3. int8 (): 8-bit matrix multiplication for transformers at scale. Advances in Neural Information Processing Systems, 35: 30318 30332, 2022.

Tim Dettmers, Ruslan Svirschevski, Vage Egiazarian, Denis Kuznedelev, Elias Frantar, Saleh Ashkboos, Alexander Borzunov, Torsten Hoefler, and Dan Alistarh. Spqr: A sparse-quantized representation for near-lossless llm weight compression. ar Xiv preprint ar Xiv:2306.03078, 2023.

Tim Dettmers, Artidoro Pagnoni, Ari Holtzman, and Luke Zettlemoyer. Qlora: Efficient finetuning of quantized llms. Advances in Neural Information Processing Systems, 36, 2024.

Jesse Dodge, Maarten Sap, Ana Marasovi c, William Agnew, Gabriel Ilharco, Dirk Groeneveld, Margaret Mitchell, and Matt Gardner. Documenting large webtext corpora: A case study on the colossal clean crawled corpus. ar Xiv preprint ar Xiv:2104.08758, 2021.

Abhimanyu Dubey, Abhinav Jauhri, Abhinav Pandey, Abhishek Kadian, Ahmad Al-Dahle, Aiesha Letman, Akhil Mathur, Alan Schelten, Amy Yang, Angela Fan, et al. The llama 3 herd of models. ar Xiv preprint ar Xiv:2407.21783, 2024.

Lijie Fan, Kaifeng Chen, Dilip Krishnan, Dina Katabi, Phillip Isola, and Yonglong Tian. Scaling laws of synthetic images for model training... for now. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 7382 7392, 2024.

Elias Frantar, Saleh Ashkboos, Torsten Hoefler, and Dan Alistarh. Gptq: Accurate post-training quantization for generative pre-trained transformers. ar Xiv preprint ar Xiv:2210.17323, 2022.

Samir Yitzhak Gadre, Georgios Smyrnis, Vaishaal Shankar, Suchin Gururangan, Mitchell Wortsman, Rulin Shao, Jean Mercat, Alex Fang, Jeffrey Li, Sedrick Keh, et al. Language models scale reliably with over-training and on downstream tasks. ar Xiv preprint ar Xiv:2403.08540, 2024.

Justin Gilmer, Behrooz Ghorbani, Ankush Garg, Sneha Kudugunta, Behnam Neyshabur, David Cardoze, George Dahl, Zachary Nado, and Orhan Firat. A loss curvature perspective on training instability in deep learning. ar Xiv preprint ar Xiv:2110.04369, 2021.

Dirk Groeneveld, Iz Beltagy, Pete Walsh, Akshita Bhagia, Rodney Kinney, Oyvind Tafjord, Ananya Harsh Jha, Hamish Ivison, Ian Magnusson, Yizhong Wang, et al. Olmo: Accelerating the science of language models. ar Xiv preprint ar Xiv:2402.00838, 2024.

Alexander H agele, Elie Bakouch, Atli Kosson, Loubna Ben Allal, Leandro Von Werra, and Martin Jaggi. Scaling laws and compute-optimal training beyond fixed training durations. ar Xiv preprint ar Xiv:2405.18392, 2024.

Published as a conference paper at ICLR 2025

Jordan Hoffmann, Sebastian Borgeaud, Arthur Mensch, Elena Buchatskaya, Trevor Cai, Eliza Rutherford, Diego de Las Casas, Lisa Anne Hendricks, Johannes Welbl, Aidan Clark, et al. Training compute-optimal large language models. ar Xiv preprint ar Xiv:2203.15556, 2022.

Wei Huang, Yangdong Liu, Haotong Qin, Ying Li, Shiming Zhang, Xianglong Liu, Michele Magno, and Xiaojuan Qi. Billm: Pushing the limit of post-training quantization for llms. ar Xiv preprint ar Xiv:2402.04291, 2024.

Berivan Isik, Natalia Ponomareva, Hussein Hazimeh, Dimitris Paparas, Sergei Vassilvitskii, and Sanmi Koyejo. Scaling laws for downstream task performance of large language models. ar Xiv preprint ar Xiv:2402.04177, 2024.

Benoit Jacob, Skirmantas Kligys, Bo Chen, Menglong Zhu, Matthew Tang, Andrew Howard, Hartwig Adam, and Dmitry Kalenichenko. Quantization and training of neural networks for efficient integer-arithmetic-only inference. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 2704 2713, 2018.

Albert Q Jiang, Alexandre Sablayrolles, Arthur Mensch, Chris Bamford, Devendra Singh Chaplot, Diego de las Casas, Florian Bressand, Gianna Lengyel, Guillaume Lample, Lucile Saulnier, et al. Mistral 7b. ar Xiv preprint ar Xiv:2310.06825, 2023.

Jared Kaplan, Sam Mc Candlish, Tom Henighan, Tom B Brown, Benjamin Chess, Rewon Child, Scott Gray, Alec Radford, Jeffrey Wu, and Dario Amodei. Scaling laws for neural language models. ar Xiv preprint ar Xiv:2001.08361, 2020.

Jakub Krajewski, Jan Ludziejewski, Kamil Adamczewski, Maciej Pi oro, Michał Krutul, Szymon Antoniak, Kamil Ciebiera, Krystian Kr ol, Tomasz Odrzyg o zd z, Piotr Sankowski, et al. Scaling laws for fine-grained mixture of experts. ar Xiv preprint ar Xiv:2402.07871, 2024.

Teven Le Scao, Angela Fan, Christopher Akiki, Ellie Pavlick, Suzana Ili c, Daniel Hesslow, Roman Castagn e, Alexandra Sasha Luccioni, Franc ois Yvon, Matthias Gall e, et al. Bloom: A 176bparameter open-access multilingual language model. 2023.

Jeffrey Li, Alex Fang, Georgios Smyrnis, Maor Ivgi, Matt Jordan, Samir Gadre, Hritik Bansal, Etash Guha, Sedrick Keh, Kushal Arora, et al. Datacomp-lm: In search of the next generation of training sets for language models. ar Xiv preprint ar Xiv:2406.11794, 2024.

Raymond Li, Loubna Ben Allal, Yangtian Zi, Niklas Muennighoff, Denis Kocetkov, Chenghao Mou, Marc Marone, Christopher Akiki, Jia Li, Jenny Chim, et al. Starcoder: may the source be with you! ar Xiv preprint ar Xiv:2305.06161, 2023.

Ji Lin, Jiaming Tang, Haotian Tang, Shang Yang, Xingyu Dang, and Song Han. Awq: Activationaware weight quantization for llm compression and acceleration. arxiv. MLSys 2024, 2023.

Ji Lin, Jiaming Tang, Haotian Tang, Shang Yang, Wei-Ming Chen, Wei-Chen Wang, Guangxuan Xiao, Xingyu Dang, Chuang Gan, and Song Han. Awq: Activation-aware weight quantization for on-device llm compression and acceleration. Proceedings of Machine Learning and Systems, 6: 87 100, 2024a.

Licong Lin, Jingfeng Wu, Sham M Kakade, Peter L Bartlett, and Jason D Lee. Scaling laws in linear regression: Compute, parameters, and data. ar Xiv preprint ar Xiv:2406.08466, 2024b.

Qian Liu, Xiaosen Zheng, Niklas Muennighoff, Guangtao Zeng, Longxu Dou, Tianyu Pang, Jing Jiang, and Min Lin. Regmix: Data mixture as regression for language model pre-training. ar Xiv preprint ar Xiv:2407.01492, 2024.

Zechun Liu, Barlas Oguz, Changsheng Zhao, Ernie Chang, Pierre Stock, Yashar Mehdad, Yangyang Shi, Raghuraman Krishnamoorthi, and Vikas Chandra. Llm-qat: Data-free quantization aware training for large language models. ar Xiv preprint ar Xiv:2305.17888, 2023.

Anton Lozhkov, Raymond Li, Loubna Ben Allal, Federico Cassano, Joel Lamy-Poirier, Nouamane Tazi, Ao Tang, Dmytro Pykhtar, Jiawei Liu, Yuxiang Wei, et al. Starcoder 2 and the stack v2: The next generation. ar Xiv preprint ar Xiv:2402.19173, 2024.

Published as a conference paper at ICLR 2025

Risto Luukkonen, Ville Komulainen, Jouni Luoma, Anni Eskelinen, Jenna Kanerva, Hanna-Mari Kupari, Filip Ginter, Veronika Laippala, Niklas Muennighoff, Aleksandra Piktus, et al. Fingpt: Large generative models for a small language. ar Xiv preprint ar Xiv:2311.05640, 2023.

Shuming Ma, Hongyu Wang, Lingxiao Ma, Lei Wang, Wenhui Wang, Shaohan Huang, Li Dong, Ruiping Wang, Jilong Xue, and Furu Wei. The era of 1-bit llms: All large language models are in 1.58 bits. ar Xiv preprint ar Xiv:2402.17764, 2024.

Paulius Micikevicius, Sharan Narang, Jonah Alben, Gregory Diamos, Erich Elsen, David Garcia, Boris Ginsburg, Michael Houston, Oleksii Kuchaiev, Ganesh Venkatesh, et al. Mixed precision training. ar Xiv preprint ar Xiv:1710.03740, 2017.

Paulius Micikevicius, Dusan Stosic, Neil Burgess, Marius Cornea, Pradeep Dubey, Richard Grisenthwaite, Sangwon Ha, Alexander Heinecke, Patrick Judd, John Kamalu, et al. Fp8 formats for deep learning. ar Xiv preprint ar Xiv:2209.05433, 2022.

Niklas Muennighoff, Thomas Wang, Lintang Sutawika, Adam Roberts, Stella Biderman, Teven Le Scao, M Saiful Bari, Sheng Shen, Zheng-Xin Yong, Hailey Schoelkopf, et al. Crosslingual generalization through multitask finetuning. ar Xiv preprint ar Xiv:2211.01786, 2022.

Niklas Muennighoff, Alexander Rush, Boaz Barak, Teven Le Scao, Nouamane Tazi, Aleksandra Piktus, Sampo Pyysalo, Thomas Wolf, and Colin A Raffel. Scaling data-constrained language models. Advances in Neural Information Processing Systems, 36, 2024a.

Niklas Muennighoff, Luca Soldaini, Dirk Groeneveld, Kyle Lo, Jacob Morrison, Sewon Min, Weijia Shi, Pete Walsh, Oyvind Tafjord, Nathan Lambert, et al. Olmoe: Open mixture-of-experts language models. ar Xiv preprint ar Xiv:2409.02060, 2024b.

Quynh Nguyen, Marco Mondelli, and Guido F Montufar. Tight bounds on the smallest eigenvalue of the neural tangent kernel for deep relu networks. In International Conference on Machine Learning, pp. 8119 8129. PMLR, 2021.

Bo Peng, Daniel Goldstein, Quentin Anthony, Alon Albalak, Eric Alcaide, Stella Biderman, Eugene Cheah, Teddy Ferdinan, Haowen Hou, Przemysław Kazienko, et al. Eagle and finch: Rwkv with matrix-valued states and dynamic recurrence. ar Xiv preprint ar Xiv:2404.05892, 2024.

Jack W Rae, Sebastian Borgeaud, Trevor Cai, Katie Millican, Jordan Hoffmann, Francis Song, John Aslanides, Sarah Henderson, Roman Ring, Susannah Young, et al. Scaling language models: Methods, analysis & insights from training gopher. ar Xiv preprint ar Xiv:2112.11446, 2021.

Colin Raffel, Noam Shazeer, Adam Roberts, Katherine Lee, Sharan Narang, Michael Matena, Yanqi Zhou, Wei Li, and Peter J Liu. Exploring the limits of transfer learning with a unified text-to-text transformer. Journal of machine learning research, 21(140):1 67, 2020.

Yangjun Ruan, Chris J Maddison, and Tatsunori Hashimoto. Observational scaling laws and the predictability of language model performance. ar Xiv preprint ar Xiv:2405.10938, 2024.

Nikhil Sardana and Jonathan Frankle. Beyond chinchilla-optimal: Accounting for inference in language model scaling laws. ar Xiv preprint ar Xiv:2401.00448, 2023.

Teven Le Scao, Thomas Wang, Daniel Hesslow, Lucile Saulnier, Stas Bekman, M Saiful Bari, Stella Biderman, Hady Elsahar, Niklas Muennighoff, Jason Phang, et al. What language model to train if you have one million gpu hours? ar Xiv preprint ar Xiv:2210.15424, 2022.

Noam Shazeer. Glu variants improve transformer. ar Xiv preprint ar Xiv:2002.05202, 2020.

Ying Sheng, Lianmin Zheng, Binhang Yuan, Zhuohan Li, Max Ryabinin, Beidi Chen, Percy Liang, Christopher R e, Ion Stoica, and Ce Zhang. Flexgen: High-throughput generative inference of large language models with a single gpu. In International Conference on Machine Learning, pp. 31094 31116. PMLR, 2023.

Mohammad Shoeybi, Mostofa Patwary, Raul Puri, Patrick Le Gresley, Jared Casper, and Bryan Catanzaro. Megatron-lm: Training multi-billion parameter language models using model parallelism. ar Xiv preprint ar Xiv:1909.08053, 2019.

Published as a conference paper at ICLR 2025

Charlie Snell, Jaehoon Lee, Kelvin Xu, and Aviral Kumar. Scaling llm test-time compute optimally can be more effective than scaling model parameters. ar Xiv preprint ar Xiv:2408.03314, 2024.

Luca Soldaini, Rodney Kinney, Akshita Bhagia, Dustin Schwenk, David Atkinson, Russell Authur, Ben Bogin, Khyathi Chandu, Jennifer Dumas, Yanai Elazar, et al. Dolma: An open corpus of three trillion tokens for language model pretraining research. ar Xiv preprint ar Xiv:2402.00159, 2024.

Aarohi Srivastava, Abhinav Rastogi, Abhishek Rao, Abu Awal Md Shoeb, Abubakar Abid, Adam Fisch, Adam R Brown, Adam Santoro, Aditya Gupta, Adri a Garriga-Alonso, et al. Beyond the imitation game: Quantifying and extrapolating the capabilities of language models. ar Xiv preprint ar Xiv:2206.04615, 2022.

Jianlin Su, Yu Lu, Shengfeng Pan, Bo Wen, and Yunfeng Liu. Roformer: enhanced transformer with rotary position embedding. corr abs/2104.09864 (2021). ar Xiv preprint ar Xiv:2104.09864, 2021.

Xiao Sun, Naigang Wang, Chia-Yu Chen, Jiamin Ni, Ankur Agrawal, Xiaodong Cui, Swagath Venkataramani, Kaoutar El Maghraoui, Vijayalakshmi Viji Srinivasan, and Kailash Gopalakrishnan. Ultra-low precision 4-bit training of deep neural networks. Advances in Neural Information Processing Systems, 33:1796 1807, 2020.

Chaofan Tao, Qian Liu, Longxu Dou, Niklas Muennighoff, Zhongwei Wan, Ping Luo, Min Lin, and Ngai Wong. Scaling laws with vocabulary: Larger models deserve larger vocabularies. ar Xiv preprint ar Xiv:2407.13623, 2024.

Yi Tay, Mostafa Dehghani, Samira Abnar, Hyung Won Chung, William Fedus, Jinfeng Rao, Sharan Narang, Vinh Q Tran, Dani Yogatama, and Donald Metzler. Scaling laws vs model architectures: How does inductive bias influence scaling? ar Xiv preprint ar Xiv:2207.10551, 2022a.

Yi Tay, Jason Wei, Hyung Won Chung, Vinh Q Tran, David R So, Siamak Shakeri, Xavier Garcia, Huaixiu Steven Zheng, Jinfeng Rao, Aakanksha Chowdhery, et al. Transcending scaling laws with 0.1% extra compute. ar Xiv preprint ar Xiv:2210.11399, 2022b.

Gemini Team, Rohan Anil, Sebastian Borgeaud, Yonghui Wu, Jean-Baptiste Alayrac, Jiahui Yu, Radu Soricut, Johan Schalkwyk, Andrew M Dai, Anja Hauth, et al. Gemini: a family of highly capable multimodal models. ar Xiv preprint ar Xiv:2312.11805, 2023.

Gemma Team, Morgane Riviere, Shreya Pathak, Pier Giuseppe Sessa, Cassidy Hardin, Surya Bhupatiraju, L eonard Hussenot, Thomas Mesnard, Bobak Shahriari, Alexandre Ram e, et al. Gemma 2: Improving open language models at a practical size. ar Xiv preprint ar Xiv:2408.00118, 2024.

Hugo Touvron, Thibaut Lavril, Gautier Izacard, Xavier Martinet, Marie-Anne Lachaux, Timoth ee Lacroix, Baptiste Rozi ere, Naman Goyal, Eric Hambro, Faisal Azhar, et al. Llama: Open and efficient foundation language models. ar Xiv preprint ar Xiv:2302.13971, 2023a.

Hugo Touvron, Louis Martin, Kevin Stone, Peter Albert, Amjad Almahairi, Yasmine Babaei, Nikolay Bashlykov, Soumya Batra, Prajjwal Bhargava, Shruti Bhosale, et al. Llama 2: Open foundation and fine-tuned chat models. ar Xiv preprint ar Xiv:2307.09288, 2023b.

Ahmet Ust un, Viraat Aryabumi, Zheng-Xin Yong, Wei-Yin Ko, Daniel D souza, Gbemileke Onilude, Neel Bhandari, Shivalika Singh, Hui-Lee Ooi, Amr Kayid, et al. Aya model: An instruction finetuned open-access multilingual language model. ar Xiv preprint ar Xiv:2402.07827, 2024.

Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Lukasz Kaiser, and Illia Polosukhin. Attention is all you need.(nips), 2017. ar Xiv preprint ar Xiv:1706.03762, 10:S0140525X16001837, 2017.

Hongyu Wang, Shuming Ma, Li Dong, Shaohan Huang, Huaijie Wang, Lingxiao Ma, Fan Yang, Ruiping Wang, Yi Wu, and Furu Wei. Bitnet: Scaling 1-bit transformers for large language models. ar Xiv preprint ar Xiv:2310.11453, 2023.

Published as a conference paper at ICLR 2025

Jason Wei, Yi Tay, Rishi Bommasani, Colin Raffel, Barret Zoph, Sebastian Borgeaud, Dani Yogatama, Maarten Bosma, Denny Zhou, Donald Metzler, et al. Emergent abilities of large language models. ar Xiv preprint ar Xiv:2206.07682, 2022.

Mitchell Wortsman, Tim Dettmers, Luke Zettlemoyer, Ari Morcos, Ali Farhadi, and Ludwig Schmidt. Stable and low-precision training for large-scale vision-language models. Advances in Neural Information Processing Systems, 36:10271 10298, 2023a.

Mitchell Wortsman, Peter J Liu, Lechao Xiao, Katie Everett, Alex Alemi, Ben Adlam, John D Co Reyes, Izzeddin Gur, Abhishek Kumar, Roman Novak, et al. Small-scale proxies for large-scale transformer training instabilities. ar Xiv preprint ar Xiv:2309.14322, 2023b.

Hao Wu, Patrick Judd, Xiaojie Zhang, Mikhail Isaev, and Paulius Micikevicius. Integer quantization for deep learning inference: Principles and empirical evaluation. ar Xiv preprint ar Xiv:2004.09602, 2020.

Guangxuan Xiao, Ji Lin, Mickael Seznec, Hao Wu, Julien Demouth, and Song Han. Smoothquant: Accurate and efficient post-training quantization for large language models. In International Conference on Machine Learning, pp. 38087 38099. PMLR, 2023.

Greg Yang, Edward J Hu, Igor Babuschkin, Szymon Sidor, Xiaodong Liu, David Farhi, Nick Ryder, Jakub Pachocki, Weizhu Chen, and Jianfeng Gao. Tensor programs v: Tuning large neural networks via zero-shot hyperparameter transfer. ar Xiv preprint ar Xiv:2203.03466, 2022.

Zitong Yang, Neil Band, Shuangping Li, Emmanuel Cand es, and Tatsunori Hashimoto. Synthetic continued pretraining. ar Xiv preprint ar Xiv:2409.07431, 2024.

Susan Zhang, Stephen Roller, Naman Goyal, Mikel Artetxe, Moya Chen, Shuohui Chen, Christopher Dewan, Mona Diab, Xian Li, Xi Victoria Lin, et al. Opt: Open pre-trained transformer language models. ar Xiv preprint ar Xiv:2205.01068, 2022.

Rui-Jie Zhu, Yu Zhang, Ethan Sifferman, Tyler Sheaves, Yiqiao Wang, Dustin Richmond, Peng Zhou, and Jason K Eshraghian. Scalable matmul-free language modeling. ar Xiv preprint ar Xiv:2406.02528, 2024.

Xunyu Zhu, Jian Li, Yong Liu, Can Ma, and Weiping Wang. A survey on model compression for large language models. ar Xiv preprint ar Xiv:2308.07633, 2023.

Published as a conference paper at ICLR 2025

Precision (bits)

Precision (bits)

Precision (bits)

Activations

Figure 8: L(Pw), L(Pa), L(Pkv) for ablated hyperparameters, N = 30M, D = 1.5B. We can see the trends persist, where the first few bits reduce final val loss significantly, with diminishing/saturating returns quickly setting in at higher precision. We do not fit constants on these ablated runs.

A HYPERPARAMETER DETAILS AND ABLATIONS

We launch over 20 runs for each (N, D) combination to study scaling in precision, trained and validated on the common crawl split of the Dolma dataset (Soldaini et al., 2024). We use a standard causal Transformer++ implementation: Swi GLU activations (Shazeer, 2020), Ro PE embeddings (Su et al., 2021), RMSLayer Norm, Adam β values of (0.9, 0.95). We adopt a cosine learning rate schedule with 10% warmup period and peak learning rate of 6e-4 for the smallest model and learning rates scaled with width and depth according to depth-µP for the larger models (Yang et al., 2022; Bordelon et al., 2023). We use a sequence length of 1024 and batch size of 256 throughout, with Adam ϵ 1e-15, following (Wortsman et al., 2023b). We use weight decay of 0.1, as (Ahmadian et al., 2023) find some results in the quantization literature may be artifacts of insufficient weight decay. We follow (Ma et al., 2024) in including a Layer Norm before projections because they find it is important for low precision training to be stable. These are the hyperparameters and settings used for the main scaling law experiments.

To check robustness, we then ablate these hyperparameter choices, with results in Figure 8. In our ablation we use a sequence length of 512 with batch size 128, weight decay of 1e-3, Adam ϵ of 1e-10, a peak learning rate of 1e-4 and a warmup period of duration 3%. We train models with these alternative hyperparameters at various weight, activation, and KV cache precisions. We train and val on C4 (Raffel et al., 2020; Dodge et al., 2021) instead. Though these ablations are at rather smaller scale due to compute constraints, the loss curves follow the same trends rapid decrease in final loss with an initial increase in precision from 4 bits, then diminishing returns as we approach higher precision as in the main text, suggesting the trends are robust to hyperparameter choices.

B ADDITIONAL RELATED WORK

Efficient training and inference Low precision has been key to improving the efficiency of training and using LLMs (Micikevicius et al., 2017; Shoeybi et al., 2019; Wortsman et al., 2023a; Zhu et al., 2023). Prior works generally study either precision during training (Courbariaux et al., 2014; Dettmers et al., 2024; 2021; Sun et al., 2020; Liu et al., 2023) or the effects of changing the precision after training (post-training quantization) (Frantar et al., 2022; Lin et al., 2024a; Dettmers et al., 2022; Xiao et al., 2023; Sheng et al., 2023; Dettmers et al., 2023). In this work we study both, the precision during training and after, and unify them from a scaling perspective. Other important works include recent popular work on quantization-aware-training (Ma et al., 2024) where weights are quantized to extreme precisions (ternary) on the forward pass during training. This work is consistent with ours in that they can quantize weights so aggressively because weights are less sensitive than activations or KV cache. Further, while we use a fixed architecture throughout to maintain a controlled comparison across precision, they use a nonstandard architecture, learning rate, and weight decay schedule specifically to make training with ternary weights stable.

Published as a conference paper at ICLR 2025

Large language models and scaling By scaling up the transformer architecture (Vaswani et al., 2017) a variety of large language models have been proposed (Brown et al., 2020; Rae et al., 2021; Touvron et al., 2023a;b; Dubey et al., 2024; Le Scao et al., 2023; Muennighoff et al., 2022; 2024b; Groeneveld et al., 2024; Jiang et al., 2023; Zhang et al., 2022; Allal et al., 2023; Li et al., 2023; Lozhkov et al., 2024; Luukkonen et al., 2023; Bai et al., 2023; Chowdhery et al., 2023; Team et al., 2023; Ust un et al., 2024; Deitke et al., 2024). To improve our understanding of these models, various works have investigated their scaling properties (Ruan et al., 2024; Allen-Zhu & Li, 2024; H agele et al., 2024). Many aspects are relevant to scaling including the architecture (Tay et al., 2022a; Krajewski et al., 2024; Tao et al., 2024; Clark et al., 2022; Tay et al., 2022b; Scao et al., 2022; Peng et al., 2024), the modalities considered (Aghajanyan et al., 2023; Alabdulmohsin et al., 2022; Cherti et al., 2023), the performance metrics (Wei et al., 2022; Srivastava et al., 2022; Isik et al., 2024), the data composition (Li et al., 2024; Liu et al., 2024; Albalak et al., 2024) and data repetitions (Muennighoff et al., 2024a). Our work analyzes one such aspect, which is key to better scaling: the numeric precision during and after training.

C ALTERNATIVE FUNCTIONAL FORMS

There are several plausible functional forms to try a priori. The key junctions are whether a form is 1) additive or multiplicative and 2) interacts with parameters/data or is independent, 3) a power law or exponential. We try a variety of combinations of these three and find the formulation in the main text one of the best fits, notably with the fewest fitted parameters. We emphasize that several fitted forms are likely to be reasonable fits to the data, and an important desiderata for choosing a functional fit is interpretability. Several scaling law papers find multiple fits plausible in terms of predictive power (Muennighoff et al., 2024a; Kaplan et al., 2020), and ultimately make a decision based on interpretability.

We make these fit choices on sweeps of the form L(N, D, PW) and discuss alternatives to the decomposition/factorization to account for activations and KV cache in Appendix Section M, which assumes an effective parameter count formulation. In this section, a power law refers to a term of the form Cw P αw where Cw, αw are fitted. In general, we find modeling precision effects with power law fits on their own causes the fitted constants A, B to blow up, whereas this does not happen with exponential fits, suggesting the power law does not change sharply enough to match the change in loss induced by precision. We note that while fitting parameters using a double notion of effective parameters and effective data leads to a slightly better fit, it requires more fitted parameters so we stick with the Neff formulation for simplicity and interpretability. When choosing between fits we validate on held-out data and the R2 values below reflect the fit on the held out data. This is in contrast to our plots in the main text, where we have chosen a functional form and we fit and plot on the same data, as is standard in scaling laws (Muennighoff et al., 2024a).

Functional Form Val R2 Number of Fitted Parameters Neff 0.82 3 Additive/independent power law 0.71 2 Deff 0.74 3 Neff and Deff (tied) 0.79 3 Neff and Deff (not tied) 0.84 4 Multiplicative power law, N, P 0.75 2

Table 1: Comparison of Functional Forms with R2,and Number of Fitted Parameters

D QUANTIZATION IMPLEMENTATION DETAILS AND TYPES

Two canonical types for neural network quantization are floating-point (FP) and integer (INT) quantization. Despite their differences in representation, we hypothesize the scaling behavior between floating-point and integer quantization can be described by similar functional forms, where 1(b) provides preliminary evidence for this.

Published as a conference paper at ICLR 2025

D.1 INTEGER QUANTIZATION AND IMPLEMENTATION DETAILS

In integer quantization, continuous values are mapped to discrete integer values. Typically, this is done by scaling the original values according to a fixed scale factor. Mathematically, for a real number x, the quantized integer value xint is computed as:

where s is the scaling factor, and denotes rounding to the nearest integer specified by the number of bits. The value can then be dequantized back to an approximate real value by multiplying by s:

xdequant = s xint

This process introduces quantization error, defined as the difference between the original value x and the dequantized value xdequant. The goal of quantization is to minimize this error while still reducing the precision. One can think of this as rounding to the nearest point on a uniform lattice. More complicated quantization schemes involve selecting the lattice points in a data or model-dependent manner. Integer quantization, as implemented, uses a fixed-point scaling based on the maximum absolute value of the tensor, and then scales the values within the range [Qn, Qp], where Qn = 2(b 1) and Qp = 2(b 1) 1, with b being the number of bits.

Integer quantization first rescales the inputs into the range specified by the number of bits by

s = Qp max(|x|)

for tensor-based scaling, or

s = Qp max(|x|, dim = k)

for channel-based scaling. After scaling, the result is rounded to the nearest integer and then clamped to the range [Qn, Qp]. After matrix multiplication, the result is rescaled back into the original range. We quantize only the forward pass in this work, to ensure fair comparison between quantizationaware-training (weights only) and low-precision training (weights, activations, KV cache). This is because the backward pass is not usually quantized during quantization-aware-training (Ma et al., 2024), so comparing sensitivities of weights (forward only) to activations/KV cache (forward and backward) would not be a principled comparison. In production pretraining in low precision, the matrix multiplications on the backward pass are also quantized, leading to further compute savings. We leave a detailed analysis of how our observations change when accounting for the backward pass to future work. We use integer quantization throughout to fit our scaling laws for simplicity.

D.2 FLOATING-POINT QUANTIZATION

Floating-point quantization is slightly more sophisticated, aiming to make a non-uniform lattice roughly matching the distribution of the weights, which are assumed to be Gaussian. A floatingpoint number is in general represented as:

xfp = ( 1)s m 2e

where s is the sign bit, m is the mantissa, and e is the exponent. In floating-point quantization, both the mantissa and exponent are quantized to reduce the bit width. For exponent-mantissa allocations of bits and details of exponent bias, we follow the guidelines from (Micikevicius et al., 2022) and quantize weights per channel and activations per-tensor.

Making a full scaling law for floating-point quantization is more involved than our integer treatment, because the effects of scaling mantissa vs exponent bits are not the same. In contrast, in integer quantization, each additional bit simply causes us to round into a finer-grained lattice after rescaling, thereby reducing quantization error by a predictable amount. In floating-point quantization, altering the exponent affects the dynamic range, while altering the mantissa changes the precision within that range. This flexibility at once makes floating-point quantization more suitable for model training, but harder to analyze. We leave a commensurately detailed analysis of mantissa vs exponent and more generally floating point scaling to future work.

Published as a conference paper at ICLR 2025

D.3 HARDWARE DETAILS

Weight-only quantization can accelerate inference because software can be written to accommodate moving data between GPU parts (HBM-SRAM) in smaller units (types), so that a given bandwidth can move more data per second. This reduces memory (IO) bottlenecks that often dominate during inference, even with high-batch workloads. However, we emphasize that the type and therefore speed at which the GPU can do matrix multiplications in natively is determined by the hardware provider, so that even when Pw = Pa = Pqkv (including queries), compute savings are only achieved when these correspond with both a bit-width and type that the GPU supports. We aim to study scaling in a fairly hardware-agnostic manner so that our work may be useful in the future, and make no claims about hardware details or optimality. We train all our models with fake (simulated) quantization on NVidia H100 GPUs to remain hardware agnostic, not taking advantage of any true low-precision computation. The only assumption is that when hardware does implement support for integer quantization, it is done in a way that involves some combination of rescaling and rounding, as is standard at the time of writing (Dettmers & Zettlemoyer, 2023; Dettmers et al., 2022; Wu et al., 2020; Jacob et al., 2018).

E DERIVATIONS

E.1 CRITICAL DATASET SIZE FOR PTQ

We seek a Dcrit that satisfies L(Dcrit)

D = δPTQ(Dcrit)

D . Taking both derivatives for the functional forms presented in the main text and equating their opposing effects, we get the equation

BD β 1 crit = γDCT N γN e Ppost/γpost DγD 1 crit (12)

which implies

Dcrit = βBN γN e Ppost/γpost

1 γD+β (13)

is the predicted point after which pretraining on more data can increase loss of a model that is posttrain quantized. Note that this quantity explodes in P, so that a truly unreasonable amount of data is required for longer pretraining to be harmful at commonly used precisions (eg. 8-bit). However, we find that on overtrained models D/N 103, these overtraining-degradation effects become nontrivial around 5-bits, and dominant below that.

E.2 COMPUTE-OPTIMALITY CALCULATIONS

We set a constraint C NDP throughout. Working up to proportionality is essentially rescaling the compute constraint, so it doesn t affect the scaling trends we identify, which is our focus.

E.2.1 FIXED PRECISION COMPUTE OPTIMAL SCALING

Under fixed precision, the loss takes the form

L = u(P)AN α + BD β (14)

where u(P) = [1 e P/γ] 3α is a fixed constant. The compute optimal scaling when minimizing the loss over N, D gives

L = u(P)AN α + BC βN βP β (15)

by replacing D = C NP . Optimizing over N, we see that this is equivalent to the original chinchilla optimization problem but with A Au(P) and B BP β. Performing this optimization, we find

N (P, C) = u(P)Aα

β α+β , D (P, C) = u(P)Aα

1 α+β C α α+β (16)

Published as a conference paper at ICLR 2025

We can relate the above expressions to the original Chinchilla-optimal N, D at full precision NCh(C), DCh(C).

NCh(C) h 1 e P/ γi 3α α+β P β α+β and D (P, C)

DCh(C) h 1 e P/ γi 3α α+β P

E.2.2 FIXED MODEL SIZE N

Now, we investigate the case where model size N is fixed but precision and data are jointly optimized at fixed compute C = NDP. This optimization problem takes the form

L = u(P)AN α + BD β (18)

Under fixed compute, we have D = C NP so replacing the second term, we have

L = u(P)AN α + BC βN βP β (19)

where N is a constant. We therefore have a single variable P to minimize the above formula over

L P = u (P)AN α + BC βN β β P β 1 = 0 (20)

First, we note that u (P) has the following form

u (P) = 3α[1 e P/γ] 3α 1 1

γ e P/γ = 3α

γ e P/γ u(P) 3α+1

We thus desire a solution to the implicit equation

γ e P/γ u(P) 3α+1

3α AN α = BC βN β β P β 1 (22)

We now aim to find an approximate asymptotic relationship between P and C as C . Taking a logarithm of both sides, we find (neglecting additive constants that are independent of C, P)

(3α + 1) ln(1 e P/γ) 1

γ P β ln C (23)

The correct dominant balance at large C is to take P βγ ln C, as can be verified numerically. With the constraint that C = NPD we have that D C Nβγ ln C .

E.2.3 MINIMIZATION OVER N, D, P WITH FIXED COMPUTE

Recall our three-way loss function is given as below. We separate Neff into terms involving (N, P) explicitly here as it makes the math easier to follow.

L(N, D, P) = AN αu(P) + BD β , u(P) = [1 e P/γ] 3α (24)

Under the constraint C NDP, we can replace D in terms of C, N, P giving the loss expression

L = AN αu(P) + BN βP βC β (25) L N = αAN α 1u(P) + βBN β 1P βC β = 0 (26)

L P = 3α/γAN αu(P) 3α+1

3α e P/γ + βBN βP β 1C β = 0 (27)

Multiplying the first equation by N and dividing the second equation by it reveals that the optimal P satisfies a compute-independent implicit equation

3 γ u(P) 1 3α e P/ γ = P 1u(P) (28)

This exercise reveals that the compute optimal strategy when allowed to jointly optimize N, D, P is to choose a fixed precision that satisfies the above equation and then to scale up N, D with the prescription in Appendix I.1.1.

Published as a conference paper at ICLR 2025

1010 1011 1012 1013 1014

Inference FLOPs, C = 2NP

Post-Train Precision (P*)

P* vs C = 2NP

1010 1011 1012 1013 1014

Inference FLOPs, C = 2NP

Data Tokens (D*)

D* vs C = 2NP

1010 1011 1012 1013 1014

Inference FLOPs, C = 2NP

Model Parameters (N*)

N* vs C = 2NP

Llama3-8b Llama3-70b Llama3-405b

Figure 9: Numerically minimizing a model of inference-time costs with respect to N, D, P after accounting for post-train-quantization degradation and its relation to overtraining.

E.3 INFERENCE-TIME COST MODEL

For many, inference is the primary cost of training and serving models. Here, we present a preliminary analysis of an inference-time cost model. The key tension is that inference cost scales as NP, so that inference costs at a fixed pretraining loss can be reduce by either reducing model size (and overtraining more) or quantizing post-training

We will assume here that P = Ppost refers to the precision weights will be quantized to. In practice, inference costs may depend on the precision of the KV cache and activations to some extent as well, but we assume this for tractability of the following mathematical model, and to get a sense of how overtraining and post-train quantization concerns play out at inference-time. We can phrase this minimization problem in the following way.

min N,D,P L(N, D, P) = AN α + BD β + CT DγD

N γN e P/γ subject to C = NP (29)

The system of first-order conditions that results from this constrained optimization problem is not in general tractable analytically, so we solve the above constrained optimization problem for P (C), N (C), D (C) numerically via a simple grid search. We find that N , D grow as a power law in C while P log C. The clumping in points is an artifact of the numerics of the grid search; the fitted lines represent the loglinear (left) and loglog (middle, right) trends overall.

It might be surprising that D is not taken to infinity since it does not appear in the cost function. The reason for this is because if it was, post-train degradation (the third term) would become large. It might also be surprising that D changes with compute at all. The reason for this is because, once again, of the third term: as we allow more inference-time compute we use more N, and at a larger N we can now tolerate a larger data budget for a given post-train quantization degradation, so being compute-optimal means taking advantage of this and training that larger parameter count on more data.

The intuition for why P log C might be as follows. Consider a situation in which P is independent of compute: the third term will come to be a bottleneck in loss as compute gets larger because N, D are both being scaled as power laws in compute, and eventually the effect of e P/γ will become non-negligible in comparison to the first two terms in the loss function. To continue decreasing loss at this point, we must make this term smaller at a rate commensurate with the other terms, which go as a power law in compute. Since precision is inside the exponential, this can be done by taking P log C. An important thing to note is that since we are ignoring pretraining costs here, the absolute values of predicted D are much larger than would be realistically possible in any reasonably training regime, where pretraining costs do matter, even if less than inference costs. But the empirical trends in N , P showcase how overtraining with post-train quantization in mind can outperform vanilla overtraining without accounting for its effects on post-train quantization.

Published as a conference paper at ICLR 2025

Val Loss (Post-Quant)

N = 30M N = 60M N = 110M N = 220M

INT3 INT4 INT5 INT6 No PTQ

Degradation, PTQ

100 10 100 10 Token/Parameter Ratio

Figure 10: Replicating Section 3 results with AWQ.

Val Loss (Post-Quant)

N = 30M N = 60M N = 110M N = 220M

INT3 INT4 INT5 INT6 No PTQ

Degradation, PTQ

100 10 100 10 Token/Parameter Ratio

Figure 11: Replicating Section 3 results with RTN.

F REPLICATING PTQ SCALING WITH OTHER QUANTIZATION METHODS

Here we replicate the finding that post-train degradation due to post-train quantization increases with token/parameter ratio as DγD/N γN . We fit the same functional form as in the main text, but get slightly different values of fitted constants, as expect. We replicate on AWQ (Lin et al., 2023) and round-to-nearest quantization. The former is a modern and sophisticated technique, and the latter a simple and na ıve approach to quantization. The fact they, as well as GPTQ in the main text, share the same failure modes suggests that poor post-training quantization data scaling should be the default expectation for any newly proposed PTQ technique.

G PTQ: LEARNING RATE SCHEDULE ABLATION

Here, we ablate our learning rate and schedule to use warmup with linear decay, as opposed to a cosine schedule, to check it is not an artifact of our choice of learning rate schedule. We do so

Published as a conference paper at ICLR 2025

Token/Parameter Ratio

Val Loss (Post-Quant)

INT6 INT5 INT4 INT3 No PTQ

Figure 12: Linear LR Schedule Ablation

on our 30M model due to compute constraints, finding the degradation with token/parameter ratio persists, as expected.

H WHY DO LANGUAGE MODELS GET MORE SENSITIVE WITH OVERTRAINING?

This section is speculative.

Sharpness. A canonical line of work in optimization demonstrates that model sharpness increases during learning until it hovers at a maximal value (the edge of stability ) (Cohen et al., 2021; Gilmer et al., 2021), so that movement along the top Hessian eigenvector degrades loss by more throughout training. Though sharpness is formally a worst-case sensitivity, we conjecture similar results hold for average case, such as loss degradation induced by isotropic noise. It may be possible that sharpness during language model pretraining does not reach its maximal value for a long time, which is why sensitivity to noise monotonically seems to increase as D/N on realistic data budgets. Closely related is the largest eigenvalue of the neural tangent kernel (NTK) which captures the magnitude of the variance of the predictor under parameter noise. This quantity is known to empirically increase during training in a variety of settings, and is closely related to generalization guarantees (Nguyen et al., 2021; Atanasov et al., 2022).

Hierarchical learning strategies become more sensitive throughout training. Our expectation that overtrained language models may degrade more when quantized at inference-time is motivated in part by the following results. The hierarchical nature of learning is by now well understood in some toy settings: in (Abbe et al., 2021), it is shown that staircase polynomials of increasing degree are learned faster than high-degree monomials since neural networks combine existing features to learn new ones. In (Abbe et al., 2022) this result was strengthened to show that such hierarchical structure is both necessary and sufficient to learn sparse functions with SGD in two layer neural networks. In this setting, damage to features encoding lower-order polynomials affects all higher-order ones, so that such networks are increasingly sensitive to fixed feature noise throughout learning. Another result of a similar flavor is that of (Barak et al., 2022), who explicitly require high-precision gradients for sparse parity to be learned, since sparse parity is learned by the amplification of a small initial signal. If language models learn hierarchically, it is possible that the features that are learned late into overtraining as D/N are reliant on base features, so that noise harms the base features and therefore significantly damages higher-order features.

I GRANULARITY ABLATIONS

Here, we ablate our choice of quantization granularity (per-tensor vs per-channel) compared to the main text, where we do weights per-channel and activations per-tensor. Per-tensor quantization involves keeping one scalar to rescale all values in a tensor into the quantization codebook range, and per-channel means keeping a scalar per channel dimension; therefore, the latter is strictly more expressive and thus incurs lower quantization loss, than the former, at the cost of slightly more memory usage. Here, we ask: is the increased sensitivity of activations a result of them being inherently more sensitive, or due to the per-tensor design choice.

Published as a conference paper at ICLR 2025

4 6 8 10 12

Precision (bits)

Varying Quantization Granularity

Weights (per tensor) Weights (per channel) Activations (per tensor) Activations (per channel)

Figure 13: Quantization granularity ablation: all combination of (training weight precision, training activation precision) (per-tensor, per-channel). Dashed lines are per-channel and solid are pertensor.

These results show that activations are generally more sensitive than weights, since their loss penalty at lower precision goes up faster even when granularity is kept fixed across the two. In fact, quantizing activations per-channel is almost as hard as quantizing weights per-tensor. This is consistent with a broad line of work in quantization finding that activations comprise the central difficulty in quantization (Dettmers & Zettlemoyer, 2023; Ma et al., 2024).

J MAIN FIGURE DETAILS

The model on the left is N = 30M parameters, chosen because we could train it to the highest token/parameter ratio given our compute budget. On the right we train a suite of models with NP kept constants on 16B tokens (so that C = 6 16NDP is matched throughout under our cost model). We plot val loss on Dolma, as throughout the main text, and use floating-point (rather than integer) to make the pretraining claims as realistic as possible.

K NUMERICAL FITS

Following (Muennighoff et al., 2024a), we tie α = β so they do not become very different, though this is not required. Distinct α, β only add expressivity to the model and we have verified the plots look similar without tying. We also only use the full scaling law when specified in the text, since the law is developed piecewise through the text. For instance, Figures 3 and 4 solely fit Chinchilla with a substitution N 7 Neff(Pw) because at that point Pa, Pkv have not been introduced. Figures 5, 6, and 7 use our full scaling law, for instance to make predictions. We emphasize our numerical constants are unlikely to be useful because as (Hoffmann et al., 2022; Sardana & Frankle, 2023) show, fitted constant depend heavily on the architecture and dataset used, which differs from setup to setup. Rather, the trends we identify are the key findings. With that said, our fitted constants are as follows.

Note that we include biases in our exponent fits, for instance when modelling Neff as a saturating exponential, we find that the different parts of a model cause numerical instability at different values of low precisions, so even if they are the same functional form, they may be translated (left/right shifted versions) of eah other. For instance a fit of the form ex/γx in the main text is really computed with offset ex/γx+n, but including biases everywhere clutters notation and obscures mathematical insight.

Published as a conference paper at ICLR 2025

Constant Value A 4.299e3 α 0.4965 B 1.806e4 E 2.7648 γw 2.6745 nw 0.3037 γi 2.2102 ni 1.4072 γkv 0.9578 nkv 2.4185 CT 0.0598 γD 0.5068 γN 0.3439 γ 0.5907 b 1.1277

Table 2: Fitted constants and their values

4 6 8 10 12

220M params, 3.3B tokens

4 6 8 10 12

220M params, 3.3B tokens

4 6 8 10 12

220M params, 3.3B tokens

4 6 8 10 12

Precision (bits)

110M params, 26.2B tokens

4 6 8 10 12

Precision (bits)

110M params, 26.2B tokens

4 6 8 10 12

Precision (bits)

110M params, 26.2B tokens

Weights Activations KV Cache

Figure 14: Sweeping L(P) for the three model parts at various N, D.

L ARE WEIGHTS, ACTIVATIONS, AND KV CACHE EQUALLY SENSITIVE?

We find that training runs with Pa 3 or Pkv 3 are not numerically stable, and often diverge, while Pw = 3 is still well behaved. In particular, we find activations are more sensitive, though this could be because we quantize activations per-tensor and weights-per channel, rather than activations being inherently more sensitive. Consequently, we do not fit or validate on runs with activations or attention bits equal to 3. We leave a more detailed analysis of fine-grained sensitivity across layers and types of parameters to future work. The Figure below illustrates the empirical sensitivity by plotting L(P) for the three quantities for various runs (N, D).

Published as a conference paper at ICLR 2025

4 6 8 10 12 Pw (training precision, bits)

f(P) = Neff(P)/N

4 6 8 10 12 Pa (training precision, bits)

Activations

4 6 8 10 12 Pkv (training precision, bits)

Figure 15: Plotting what Neff looks like empirically. Each black point is a pretraining run, mathematical details of what is plotted here in Appendix E. Blue lines are parametric fits of a saturating exponential.

M EMPIRICAL NEFF

Consider a model trained with some arbitrary (N, D, Pw). Assuming a Chinchilla function form with N 7 Neff(Pw), we can write the difference between its loss and the loss of a full precision model as L(N, D, Pw) L(N, D, ) = A[N α eff N α]

as the terms involving B, D, E cancel. Note that Neff(Pw = ) = N by construction. In practice, we use a BF16 model as the infinite-precision model, finding no real difference if we use an FP32 model or even a functional fit estimating Pw based on our integer quantization loss results. Our goal is to plot what f(P) looks like where Neff = N f(P). Therefore, we can rearrange the above equation as follows

f(P) := Neff

L(N, D, Pw) L(N, D, Pw = )

A + N α 1/α (30)

Then plotting this quantity using our fitted numerical values (See Appendix K) gives us the empirical tradeoff between precision and parameters. We can see that the tradeoff is quickly saturating in P to a value near 1. While the functional form is the same for the three model parts, the fitted constants are different. For instance, runs with Pa 3 or Pkv 3 often diverged, and this was not the case with weight precision. Further, we can see that the KV cache is not sensitive to quantization at higher bit value, but very quickly becomes sensitive around 4-5 bit precision.

Then as far as the joint functional form for Neff(Pw, Pa, Pkv) is concerned, we acknowledge that alternative factorizations that do not decompose the model into weights, activations, and KV cache, may have an equally good fit. For instance, decomposing the weights term into a product of layerwise effects has a reasonable fit though introduces more parameters, and a more coarse-grained version may not decompose the model into parts at all, but only consider tied precisions. We choose this factorized form because QAT considers weights only, and activations and attentions are the two other things that must then be kept in low precision to see compute gains. Since practitioners often care about KV cache on its own, we chose to decompose activations and attention as activations and KV cache. We emphasize that our main point is not that this factorization is objectively correct, but in observing that such a factorization that assumes approximate independence is possible in the first place.

N FLOATING-POINT EXPERIMENTS

The key difference between floating point and integer type is that the former allocates some bits to the exponent representation and some to the mantissa, and these bits play different roles, unlike in integer type where every bit plays the same role in making the quantization lattice uniformly more fine-grained. We hypothesize that if exponent and mantissa bits are scaled jointly (ie. increase

Published as a conference paper at ICLR 2025

3.2 3.4 3.6 3.8 4.0 4.2 4.4

Predicted vs Actual MSE: 0.0027, R²: 0.9678

3 4 5 6 7 8 9 10 11 12 13 14 15 16

Weight bits (Pw, floating-point)

N=220M, D=13.1B

3 4 5 6 7 8 9 10 11 12 13 14 15 16

Weight bits (Pw, floating-point)

N=30M, D=6.7B

Figure 16: Fitting an effective parameter form to floating-point precision for weight training. (Left) involves checking quality of fit on 140 training runs in floating point precision for weights during training.

Pw (exponent/mantissa)

Loss vs Exponent/Mantissa Weight Bits

Figure 17: Exponent-mantissa bit allocation sweep. We can see the two types of bits have different scaling behavior, but both fit the saturating form where the first few bits reduce loss a lot, with diminishing returns after that.

together as total bit count does), the overall trend will still be predictable with a functional form like ours. To test this, we fit a parametric form like Equation 3 with the constants A, B, E, α = β listed in the table. The overall fit results in values of γw = 2.8111 and an exponent bias of b = 0.1240, showing the functional form is still a good fit to the data, even for floating point, under reasonably standard bit allocation schemes between mantissa and exponent. On the middle and right, we fit the same parametric form for particular values of (N, D) and visualize the quality of the resulting predictions.

We use bit allocations of E2M0, E3M0, E4M1, E3M2, E4M2, E5M2, and E5M6 for 3, 4, 5, 6, 7, 8, 12 bits, respectively, with one sign bit throughout. Since exponent and mantissa bits play in general different roles (ie. the effect of a bit on loss and dynamics depends a lot on whether it comes from the mantissa or exponent in floating point), we expect our functional form does well here because mantissa and exponent allocations both increase jointly as precision rises, so overall the trends are predictable in a similar way. We check directly the role of the two by sweeping Ex M3 and E3Mx directly, confirming this intuition. This suggests one route for making fine-grained fits for general arbitrary Ex My combinations is to decompose the effects of mantissa and weights, for instance a form like Neff(Pw, m, Pw, e, N). Since this is not needed for standard bit allocation choices as we can see in Figure 16, we do not delve into this complexity.

O ADDITIONAL PLOTS

Published as a conference paper at ICLR 2025

0 2 4 6 8 10 12

Tokens (billions)

Training-time Effects, Ptrain

0 2 4 6 8 10 12

Tokens (billions)

Post-Training Effects, Ppost

Figure 18: Illustration of what finite-precision effects during training and inference look like on learning curves.

Published as a conference paper at ICLR 2025

3 4 5 6 7 8 9 10 11 12 Pw, training precision (bits)

Pinf, post-train quantization

Empirical Inference-time Degradation

N=30M, D=1.6B

3 4 5 6 7 8 9 10 11 12 Pw, training precision (bits)

Pinf, post-train quantization

Predicted Inference-time Degradation

N=30M, D=1.6B

3 4 5 6 7 8 9 10 11 12 Pw, training precision (bits)

Pinf, post-train quantization

Empirical Inference-time Degradation

N=60M, D=6.6B

3 4 5 6 7 8 9 10 11 12 Pw, training precision (bits)

Pinf, post-train quantization

Predicted Inference-time Degradation

N=60M, D=6.6B

3 4 5 6 7 8 9 10 11 12 Pw, training precision (bits)

Pinf, post-train quantization

Empirical Inference-time Degradation

N=110M, D=6.6B

3 4 5 6 7 8 9 10 11 12 Pw, training precision (bits)

Pinf, post-train quantization

Predicted Inference-time Degradation

N=110M, D=6.6B

3 4 5 6 7 8 9 10 11 12 Pw, training precision (bits)

Pinf, post-train quantization

Empirical Inference-time Degradation

N=220M, D=6.6B

3 4 5 6 7 8 9 10 11 12 Pw, training precision (bits)

Pinf, post-train quantization

Predicted Inference-time Degradation

N=220M, D=6.6B

Inference-time Degradation

Predicted Inference-time Degradation

Inference-time Degradation

Predicted Inference-time Degradation

Inference-time Degradation

Predicted Inference-time Degradation

Inference-time Degradation

Predicted Inference-time Degradation

Figure 19: Predicted vs actual δPTQ for several N, D.

Published as a conference paper at ICLR 2025

3.2 3.4 3.6 3.8 4.0 4.2 4.4 4.6

Pi Sweep MSE: 0.0055, R²: 0.9410

3.2 3.4 3.6 3.8 4.0 4.2

Pkv Sweep MSE: 0.0003, R²: 0.9965

Figure 20: Marginal sweeps for precision of activations and KV cache, along with predictions from an Neff functional form analogous to Equation 3 fitted from scratch.

10 6 10 5 10 4 10 3 10 2 10 1 100 Actual LPTQ

Predicted LPTQ

Without Neff MSE: 9.24e-02, R2: 0.8249

10 6 10 5 10 4 10 3 10 2 10 1 100 Actual LPTQ

Predicted LPTQ

With Neff MSE: 5.06e-02, R2: 0.9041

4 6 8 10 12 Training Precision

LPTQ vs Training Precision

3 4 5 6 7 8 9 10 11 12 Pw, training precision (bits)

Ppost, post-training precision (bits)

Empirical LPTQ

3 4 5 6 7 8 9 10 11 12 Pw, training precision (bits)

Ppost, post-training precision (bits)

Predicted LPTQ

Figure 21: Combined plots for predicting degradation. (a) and (b) illustrate different fitting approaches to model degradation, demonstrating a stronger fit when N 7 Neff is used. (c), (d) (e) illustrate our unified degradation form can predict degradation when training and serving in any precision. Plots (c-e) made for varied Pw, but fits in (a) and (b) include runs where Pa, Pkv are also jointly varied.