# distillation_scaling_laws__1faed4e3.pdf Distillation Scaling Laws Dan Busbridge 1 Amitis Shidani 2 Floris Weers 1 Jason Ramapuram 1 Etai Littwin 1 Russ Webb 1 We propose a distillation scaling law that estimates distilled model performance based on a compute budget and its allocation between the student and teacher. Our findings mitigate the risks associated with large-scale distillation by enabling compute-optimal allocation for both the teacher and student to maximize student performance. We provide compute-optimal distillation recipes for two key scenarios: when a teacher already exists, and when a teacher needs training. In settings involving many students or an existing teacher, distillation outperforms supervised learning up to a compute level that scales predictably with student size. Conversely, if only one student is to be distilled and a teacher also requires training, supervised learning is generally preferable. Additionally, our large-scale study of distillation increases our understanding of the process and helps inform experimental design. 1. Introduction The study of scaling laws (Hestness et al., 2017; Rosenfeld et al., 2020; Kaplan et al., 2020; Hoffmann et al., 2022) revealed that previously trained Language Models (LMs) could have been more capable if they had followed a compute optimal training paradigm, which determines the model size and the number of training tokens that give the best performing model under a given compute budget. Many subsequent works have followed compute optimal training (Dey et al., 2023; Muennighoff et al., 2023b). The size of compute optimal models grows with compute (Hoffmann et al., 2022), which makes them challenging to use due to the growth in inference costs. In practice, 1Apple 2University of Oxford, UK. Work done during an internship at Apple. For a full breakdown of contributions see Appendix J. Correspondence to: Dan Busbridge . Proceedings of the 42 nd International Conference on Machine Learning, Vancouver, Canada. PMLR 267, 2025. Copyright 2025 by the author(s). 1.8 2.0 2.2 2.4 2.6 2.8 Teacher Cross-Entropy Loss LT Student Cross-Entropy LS Distillation Tokens DS Student Parameters NS Interpolation Extrapolation Student Parameters NS Interpolation Extrapolation Figure 1. Extrapolations of the Distillation Scaling Law. The distillation scaling law (Equation 8) is fitted to students with high cross-entropy (LS > 2.3) for a range of teachers with crossentropies LT . Solid lines represent predicted model behavior for unseen teachers for a given student configuration (interpolation), and dashed lines represent predicted model behavior beyond seen teachers and for low cross-entropy students (LS 2.3). The diagonal block dashed line indicates where student and teacher cross-entropies are equal. Teachers with lower cross-entropy generally produce students with lower cross-entropy, until the capacity gap (see Figure 4 and Appendix B.3). As shown, a student can also outperform its teacher (see Figures 2, 3, and 41). this means compute optimal models are slow, expensive to serve, consume more battery life, provide high barriers to entry for academic study, and have a significant carbon footprint. With an inference volume of billions of tokens per day (Open AI & Pilipiszyn, 2021), the inference cost of an LM is typically significantly larger than its pretraining cost (Chien et al., 2023; Wu et al., 2024a) and is going to further increase in an era of test-time compute scaling (Snell et al., 2024; Brown et al., 2024; Wu et al., 2024b). Unsustainable inference costs have led to an alternative training paradigm, overtraining (Gadre et al., 2024), where Distillation Scaling Laws the amount of training data used is much greater than in the compute optimal case, enabling small, capable models. Overtrained models better satisfy compute optimality when compute is measured over a model s lifetime, rather than just the pretraining cost (Sardana et al., 2024). As supervised scaling laws follow power laws in model size and training data, diminishing returns in performance occur much sooner than in the compute-optimal case. To achieve reasonable capabilities, these models need to be trained on many trillions of tokens, (Snell et al., 2024; Brown et al., 2024; Wu et al., 2024b), which is expensive and time-consuming. We seek models that match the performance of small overtrained models but at lower training cost. A popular candidate is distillation (Hinton et al., 2015), where a capable teacher LM produces targets for a smaller student LM. When distillation is used for LM pretraining, we will call this distillation pretraining. There are many explanations for why distillation works, from dark knowledge transfer, where information is contained in the ratio of probabilities of incorrect classes (Hinton et al., 2015), to being a form of regularization (Mobahi et al., 2020), or reducing noise in the learning process (Menon et al., 2020), among many other explanations. Despite a lack of consensus for why distillation works, distillation pretraining has produced more capable models than supervised pretraining in the Gemma and Gemini (Rivière et al., 2024), Minitron (Muralidharan et al., 2024; Sreenivas et al., 2024), and AFM (Gunter et al., 2024) families of LMs in of both pretraining loss and downstream evaluations. Yet, at the same time, Liu et al. (2024) reported that distillation produces less capable models than supervised pretraining does. With such significant compute resources being devoted to distillation pretraining of LMs, it is essential to understand how to correctly allocate these resources, to produce the most capable models possible, and to understand if gains are even possible compared to supervised pretraining when both methods have access to the same resources (Dehghani et al., 2021). To close this knowledge gap, we conduct an comprehensive, controlled study of distillation, with transformer students and teachers ranging from 143M to 12.6B parameters, trained on data of a few billion to 512B tokens. These experiments yield our distillation scaling law, which estimates student performance as a function of resources (the teacher, the student size, and the amount of distillation data). This resolves when distillation is and is not effective for producing models of a desired capability under practical resource constraints of interest. We find the following: 1. The cross-entropy of a student of size NS distilled on DS tokens from a teacher of size NT trained on DT tokens can be predicted using our distillation scaling law (Equation 8). 2. The teacher size NT and number of teacher training tokens DT determine the student cross-entropy only through the resulting teacher cross-entropy LT = LT (NT , DT ) (Figure 3b). 3. The influence of the teacher cross-entropy upon the student loss follows a power law which transitions between two behaviors depending on the relative learning capacities of student and the teacher, reflecting a phenomenon in distillation called the capacity gap, where a stronger teacher produces a worse student. Our parameterization resolves outstanding questions about the capacity gap, showing that it is a gap in learning capacity (both hypothesis space and ability to optimize) between the teacher and student, and not only about their relative sizes, which is a special case. Our results show that distillation can not produce lower model cross-entropies than supervised learning when both learning processes are given enough data or compute. However, distillation is more efficient than supervised learning if both of the following are true: 1. The total compute or tokens used for the student is not larger than student size-dependent threshold given by our scaling law (Section 5.1). 2. A teacher already exists, or the teacher to be trained has uses beyond a single distillation (Section 5.3). We hope the laws and analyses we provide will guide the community to produce even more capable models with lower inference cost and lower lifetime compute costs. 2. Background Predicting model performance is essential when scaling, as it lets us understand i) the value of increasing the available compute (C), and ii) how that compute should be distributed, typically between model parameters (N) and data (D), in order to achieve a model with desired properties. These properties may be predicting the data distribution sufficiently well, measured in cross-entropy (L), or achieving a level of performance on downstream tasks of interest. Fortunately, cross-entropy is predictable, with substantial empirical and theoretical evidence that L follows a powerlaw in parameters N and data D (measured in tokens) L(N,D) | {z } Model Cross-Entropy = E |{z} Irreducible Error + A | {z } Model ability to mimic data Distillation Scaling Laws where {E, A, B, α, β, γ} are task-specific positive coefficients1 estimated from n training runs {(Ni, Di, Li)}n i=1. The choice of runs is critical; not all experiments enable identifying the coefficients of Equation 1. One could use compute optimal models whose size parameters N and number of training tokens D give the lowest cross-entropy subject to a compute constraint C N ,D =argmin N,D L(N,D) s.t. FLOPs(N,D)=C. (2) This is tempting, as compute-optimal models offer the largest loss variation for a total experiment budget. Unfortunately, compute optimal models have a constant token to parameter ratio M D/N = const. (Hoffmann et al., 2022), removing a degree of freedom. To achieve reliable identification of scaling coefficients, Hoffmann et al. (2022) uses two training strategies: 1. (Fixed model, varied data) The number of training tokens is varied for a fixed family of models. 2. (Iso FLOP profiles) Model size and training tokens are both varied subject to a total compute constraint. Data from both strategies is then combined for the fit. See Appendix B for an extended background. The goal of this paper is to predict the cross-entropy LS of a student produced by distillation. This will reveal the value of increasing compute for distillation, crucially, which distillation produces the student of a given size that achieves the lowest cross-entropy for a given compute budget. 3. Preliminaries Notation For a sequence x, x(i:j) = (x(i), x(i+1), . . . , x(j)) is a slice of the sequence, and x( 0 is the distillation temperature. Combining the losses together results in a total token-level loss for the student: LS(x(i), z(i) T , z(i) S ) = (1 λ) LNTP(x(i), z(i) S ) + λ LKD(z(i) T , z(i) S ) + λZ LZ(z(i) S ). (7) 4. Distillation Scaling Laws Here we outline the steps taken to arrive at our distillation scaling law. First we describe the experimental setting (Section 4.1) and the experiments needed to determine the scaling coefficients (Section 4.2). Given the empirical observations, we discuss the form our distillation scaling law takes (Section 4.3), find the coefficients, and verify the law under extrapolation (Section 4.4). 4.1. Experimental Setup All models are based on Gunter et al. (2024) and use decoupled weight decay Loshchilov & Hutter (2019) for regularization, as well as a simplified version of µP (Yang & 2We do not write this as z( 2.2 1.8 2.0 2.2 2.4 2.6 2.8 Cross-Entropy L Prediction Error (%) (a) Supervised. Predicted Student Cross-Entropy b LS Extrapolation All LS > 2.3 2.0 2.2 2.4 2.6 2.8 Student Cross-Entropy L Prediction Error (%) (b) Distillation. Figure 5. Scaling law fits. (a) The supervised scaling law (Equation 1) applied to the data in Figure 36a. (b) Our distillation scaling law (Equation 8) applied to the data in Figures 2 to 4. Orange points show predictions from a scaling law fitted on high crossentropy models, for which the grey region is extrapolation. Blue points show predictions from a scaling law fitted on all data. Table 2. The four practical distillation settings we study, and how their compute accounting is implemented through Equation 9. Compute Scenario δLgt T δPre T Description Best case (fully amortized teacher) 0 0 The teacher incurs no additional FLOPs and so we are free to choose the teacher L T that minimizes the student cross-entropy. Teacher inference 1 0 We don t account for the teacher cost because the teacher already exists, or we intend to use the teacher as e.g., a server model. We still need to pay to use it for distilling a student. Teacher pretraining 0 1 The teacher needs training, but we store the logits for reuse, either during training, or after training for distilling into sufficiently many students. Teacher pretraining + inference 1 1 The teacher needs training and we pay for distilling into one student, the worst case scenario. and teacher pretraining cost in the total compute budget (see Table 2). F(N) is the number of Floating Operations (FLOPs) a model with N parameters performs per token during a forward pass. F(N) 2N is often used, giving supervised FLOPs 6ND. We cannot use the 2N approximation, as (i) using non-embedding parameters N induces systematic errors (Porian et al., 2024), and (ii) we are interested in small models with large context sizes where the FLOP contribution from attention is significant. To resolve these issues, we derive a simple expression F(N) 2N(1 + c1N 1/3 + c2N 2/3) for fixed-aspect ratio models in Appendix H.1, and recommend the scaling community consider adopting this hyperparameter setting. Distillation Scaling Laws 5.1. Fixed Tokens or Compute (Best Case) To build intuition for when distillation may (and may not) be beneficial, we ask how well can distillation do in the best case scenario, compared with supervised learning? We superimpose the data of Figures 2 and 3 onto contours of distilled cross-entropy LS compared to a supervised model with the same resources e LS (Figure 6). Teacher NT =546M Teacher NT =975M Teacher NT =1.82B Teacher NT =2.72B 1B 10B 100B 1T 10T 100M Teacher NT =4.82B 1B 10B 100B 1T 10T Teacher NT =7.75B (Student - Supervised) Cross-Entropy: LS e LS Student Tokens DS Student Parameters NS Figure 6. Fixed-M Teacher/Iso FLOP students (data). The cross-entropy difference between best case distillation and supervised learning, as determined by our supervised and distillation scaling laws (Figure 5) for six student sizes NS {546M, . . . , 7.75B} and a range of token budgets DS [1B, 10T]. The scatter points correspond to cross-entropies achieved by the runs in Figures 2 and 38a. Blue indicates distillation outperforms supervised learning (LS < e LS), while red indicates supervised learning outperforms distillation (LS > e LS). The white horizontal dashed line indicates the teacher size. Supervised learning always outperforms distillation given enough student compute or tokens. For a modest token budget, distillation is favorable; however, when a large number of tokens are available, supervised learning outperforms distillation. This is expected; in the large data regime, supervised learning can find the best solution limited by model size N (Equation 1), whereas distillation only finds this solution for the optimal teacher L T (see Appendix E.6), and is otherwise limited by the distillation process. Although this finding appears to contradict the patient teacher finding of Beyer et al. (2022), it does not, pri- marily due to the differences in supervised baselines (see Appendix D.1). A compute-constrained student version of Figure 6 and Iso FLOP Teacher/Fixed M student contours are provided in Appendix D.2. 5.2. Fixed Tokens or Compute (Teacher Inference) Next, we focus on the common scenario of planning to distill and trying to decide among an existing set of teachers {(L(i) T , N (i) T )}n i=1. A larger teacher may provide a better learning signal (lower cross-entropy) but will also be more expensive to use because of the teacher logits cost (Equation 9, δLgt T = 1), inducing a trade-off. Given a target student size NS and budget DS or CTotal, the only degree of freedom is the choice of teacher. For a fixed data budget, as the student size increases, teacher cross-entropy should be decreased as a power law. Here, the compute cost from NT is not relevant as we are considering a token budget. Student cross-entropy at different distillation token budgets is shown in Figure 7. An equivalent plot for different student sizes while varying 2.5 Student DS =250B 1.90 1.95 2.00 2.05 2.10 2.15 2.40 2.45 2.50 Student DS =1T 1.85 1.90 1.95 2.00 2.05 2.10 2.40 2.45 2.50 1B 10B 100B 2.5 Student DS =4T 1.80 1.85 1.90 1.95 2.00 2.05 2.10 2.35 2.40 2.45 2.50 1B 10B 100B Student DS =16T 1.85 1.90 1.95 2.00 2.05 2.35 2.40 2.45 2.50 Student Parameters NS Teacher Loss LT Figure 7. Students given a teacher and token budget. Contours of student cross-entropy LS for a range of teachers and students across four distillation token budgets DS {250B, 1T, 4T, 16T}. The red line indicates the optimal teacher cross-entropy L T (NS, DS) = arg min LT LS(NS, DS, LT ) for each student size and distillation token budget. tokens is shown in Appendix D.3. We see that the optimal teacher loss L T (red line) decreases as a power law with student size NS until LS matches L T , when there is an inflection point in L T , causing the teacher loss de- Distillation Scaling Laws crease to sharpen with NS. This generalizes the observation of Zhang et al. (2023a), that Optimal teacher scale almost consistently follows a linear scaling with the student scale across different model architectures and data scales. which is a special case of our finding when the teachers are compute optimal (Figure 36a). Note that our findings consistently show that teacher cross-entropy LT determines student cross-entropy LS, not NT itself (which leads to a given LT ). We investigate a fixed compute budget setting for teacher inference only in Appendix D.3. 5.3. Compute Optimal Distillation We extend the analysis of Hoffmann et al. (2022) to distillation, giving compute optimal distillation, determining how to produce the student of a desired size NS with the lowest cross-entropy given a compute budget C D S,N T ,D T = argmin DS,NT ,DT LS(NS,DS,NT ,DT ) s.t. FLOPs=C, (10) To present the best and worst case for incorporating teacher inference into the compute constraints, we consider all scenarios presented in Table 2. We also compare against the optimal supervised performance. To find the minima in Equation 10 we perform constrained numerical minimization using Sequential Least SQuares Programming (SLSQP) (Kraft, 1988) in Sci Py (Virtanen et al., 2019). Supervised learning always matches optimal distillation at sufficient compute budget, with the intersection favoring supervised learning increasing as student size grows. In Figure 8 we see that supervised learning always matches the best case distillation setting at some total compute budget, as anticipated from the asymptotic analysis in Figure 40. The compute transition point at which supervised learning becomes preferable to distillation increases as a function of student size. See also Figure 6. We also observe that smaller models are more likely to benefit from supervised pretraining, whereas larger models are more likely to benefit from distillation. When teacher training is included in the compute, the best student cross-entropy is always higher than in the supervised setting. This means that if the only aim is to produce the best model of a target size and you do not already have access to a teacher, then supervised learning should be used, instead of training a teacher and then distilling. Conversely, if the intention is to distill into a family of models, or use the teacher as a server model, distillation may be computationally preferable to supervised learning. On reflection, this finding should be expected, otherwise it would imply that given for a total end-to-end compute, distillation outperforms maximum likelihood optimization. Student NS =300M Student NS =1B 1020 1022 1024 1026 2.6 Student NS =3B 1020 1022 1024 1026 1.8 Student NS =10B Total Compute (FLOPs) Student Cross Entropy LS Distillation (best case) Distillation (teacher inference) Distillation (teacher pretraining + inference) Distillation (teacher pretraining) Supervised Figure 8. Compute-optimal distilled student performance. The best cross-entropy students of four sizes NS {300M, 1B, 3B, 10B} can achieve in the four distillation scenarios considered (Table 2) and in a supervised baseline, as total compute is varied. Table 3. Optimal compute allocation trends. Student size Compute (FLOPs) Allocation Small ( 3B) Small ( 1021) Mostly teacher pretraining. Small ( 3B) Large ( 1025) Evenly divided between student training and teacher inference, much less on teacher pretraining. Large ( 10B) Small ( 1021) Mostly standard student training. Large ( 10B) Large ( 1025) Equally divided between student training, teacher inference, and teacher pretraining. A detailed discussion of the compute optimal configurations that produce (N S, N T , D T ) for all scenarios is provided in Appendix D.4. To build intuition for how quantities interact, we take the most complex scenario, teacher pretraining + inference. A view of the optimal distillation setup as compute varies is presented in Figure 9. Student and teacher tokens scale as a power law, with student tokens scaling at a faster rate. Optimal teacher size increases initially until it is slightly larger than the student, after which it plateaus. This plateau occurs because inference with large teachers is expensive, and with an increase in the number of student tokens, it becomes more efficient to overtrain the teacher. Distillation Scaling Laws Student NS =300M Student NS =1B 1020 1022 1024 1026 Student NS =3B 1020 1022 1024 1026 Student NS =10B Total Compute (FLOPs) Optimal Value Optimal Quantity N S D S N T D T Figure 9. Optimal configurations accounting for teacher pretraining and teacher logit inference costs. For student sizes NS {300M, 1B, 3B, 10B}, the student (N S, D S), and teacher (NT , D T ) configurations minimizing the student cross entropy L S subject to a total compute budget that accounts for both teacher pretraining and teacher logit inference costs. The values in Figure 9 can be recombined to produce the compute terms in Equation 9 as shown in Appendix D.4, Figure 29. We summarize the trend in Table 3. 6. Conclusion We propose a distillation scaling law that estimates distilled model performance based on a compute budget and its allocation between the student and teacher. We then used our law to study practical distillation scenarios, and showed that distillation is only more efficient than supervised learning if (i) the total compute or tokens used for distillation is not larger than a student size-dependent threshold, and (ii) a teacher already exists, or the teacher to be trained has applications beyond its use in a single distillation. Moreover, we used this law to determine optimal distillation scenarios that can outperform supervised learning, enabling practitioners to select the best teacher for their use case. This work represents the largest controlled empirical study of distillation we are aware of, with systematic ablations of common distillation techniques. Just as supervised scaling has mitigated risks in supervised pretraining, our find- ings offer a roadmap for producing smaller, more powerful models with lower inference costs, reducing carbon footprints, and enhancing the feasibility of test-time scaling. Acknowledgments We thank Pierre Ablin, Samira Abnar, Samy Bengio, Miguel Sarabia del Castillo, Federico Danieli, Eeshan Gunesh Dhekane, Angeliki Giannou, Adam Goli nski, Tom Gunter, Navdeep Jaitly, Tatiana Likhomanenko, Ian Magnusson, Preetum Nakkiran, Skyler Seto, Josh Susskind, Kunal Talwar, Barry Theobald, Vimal Thilak, Oncel Tuzel, Chong Wang, Jianyu Wang, Luca Zappella, and Shuangfei Zhai for their helpful feedback and critical discussions throughout the process of writing this paper; Okan Akalin, Hassan Babaie, Peter Bukowinski, Denise Hui, Mubarak Seyed Ibrahim, David Koski, Li Li, Cindy Liu, Cesar Lopez Nataren, Ruoming Pang, Rajat Phull, Evan Samanas, Guillaume Seguin, Dan Swann, Shang-Chen Wu, Joe Zhou, Kelvin Zou, and the wider Apple infrastructure and Foundation Model teams for assistance with developing and running scalable, fault tolerant code. Names are in alphabetical order by last name within group. Impact Statement This work shows how to apply the framework of scaling laws to the distillation setting, and investigates distillation as a viable alternative to the overtraining paradigm for producing capable language models. Our findings demonstrate when distillation should and should not be performed, from a compute efficiency perspective, compared to supervised learning. There are a number of benefits to this: 1. As compute-optimal recipes for distillation are now known, there is greater opportunity for producing powerful models with lower inference costs. Lowering inference costs reduces the largest component of the total carbon footprint of language models (from training to inference). 2. When combined with established scaling laws, there is a larger space of models for which compute-optimal configurations are known. To produce models with a given capability, the compute, hardware and climate costs have been reduced compared to before, thanks to the identification of the optimal recipe. 3. Our distillation scaling law reduces compute usage by eliminating unnecessary experimentation across various hyperparameters and distillation settings. It is now understood that the primary driver of student crossentropy is teacher cross-entropy, and so teacher size and tokens can be removed as search dimensions. Distillation Scaling Laws 4. Small powerful models democratize the study of highly capable models, enabling broader participation in the study of their capabilities and safety aspects. However, there are potential negative consequences: 1. Using distillation as part of a training pipeline introduces new sources of bias. Teacher models may contain bias from their pretraining data. Even if a student is distilled on unbiased data, the bias of the teacher will be inherited by the student. 2. Small powerful language models are more efficient during inference, reducing the amount of resources needed for malicious actors to achieve their goals, such as generating targeted misinformation at scale. Abdin, M. I., Aneja, J., Behl, H. S., Bubeck, S., Eldan, R., Gunasekar, S., Harrison, M., Hewett, R. J., Javaheripi, M., Kauffmann, P., Lee, J. R., Lee, Y. T., Li, Y., Liu, W., Mendes, C. C. T., Nguyen, A., Price, E., de Rosa, G., Saarikivi, O., Salim, A., Shah, S., Wang, X., Ward, R., Wu, Y., Yu, D., Zhang, C., and Zhang, Y. Phi-4 technical report. Co RR, abs/2412.08905, 2024a. doi: 10.48550/ ARXIV.2412.08905. URL https://doi.org/10. 48550/ar Xiv.2412.08905. Abdin, M. I., Jacobs, S. A., Awan, A. A., Aneja, J., Awadallah, A., Awadalla, H., Bach, N., Bahree, A., Bakhtiari, A., Behl, H. S., Benhaim, A., Bilenko, M., Bjorck, J., Bubeck, S., Cai, M., Mendes, C. C. T., Chen, W., Chaudhary, V., Chopra, P., Giorno, A. D., de Rosa, G., Dixon, M., Eldan, R., Iter, D., Garg, A., Goswami, A., Gunasekar, S., Haider, E., Hao, J., Hewett, R. J., Huynh, J., Javaheripi, M., Jin, X., Kauffmann, P., Karampatziakis, N., Kim, D., Khademi, M., Kurilenko, L., Lee, J. R., Lee, Y. T., Li, Y., Liang, C., Liu, W., Lin, E., Lin, Z., Madan, P., Mitra, A., Modi, H., Nguyen, A., Norick, B., Patra, B., Perez-Becker, D., Portet, T., Pryzant, R., Qin, H., Radmilac, M., Rosset, C., Roy, S., Ruwase, O., Saarikivi, O., Saied, A., Salim, A., Santacroce, M., Shah, S., Shang, N., Sharma, H., Song, X., Tanaka, M., Wang, X., Ward, R., Wang, G., Witte, P., Wyatt, M., Xu, C., Xu, J., Yadav, S., Yang, F., Yang, Z., Yu, D., Zhang, C., Zhang, C., Zhang, J., Zhang, L. L., Zhang, Y., Zhang, Y., Zhang, Y., and Zhou, X. Phi-3 technical report: A highly capable language model locally on your phone. Co RR, abs/2404.14219, 2024b. doi: 10.48550/ARXIV.2404.14219. URL https://doi. org/10.48550/ar Xiv.2404.14219. Abnar, S., Shah, H., Busbridge, D., Ali, A. M. E., Susskind, J., and Thilak, V. Parameters vs flops: Scaling laws for optimal sparsity for mixture-of-experts language models, 2025. URL https://arxiv.org/abs/2501. 12370. Ainslie, J., Lee-Thorp, J., de Jong, M., Zemlyanskiy, Y., Lebrón, F., and Sanghai, S. GQA: training generalized multi-query transformer models from multi-head checkpoints. In Bouamor, H., Pino, J., and Bali, K. (eds.), Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, EMNLP 2023, Singapore, December 6-10, 2023, pp. 4895 4901. Association for Computational Linguistics, 2023. doi: 10.18653/ V1/2023.EMNLP-MAIN.298. URL https://doi. org/10.18653/v1/2023.emnlp-main.298. Aitchison, L. Why you don t overfit, and don t need bayes if you only train for one epoch. Co RR, abs/2411.14478, 2024. doi: 10.48550/ARXIV.2411. 14478. URL https://doi.org/10.48550/ ar Xiv.2411.14478. Amara, I., Sepahvand, N. M., Meyer, B. H., Gross, W. J., and Clark, J. J. BD-KD: balancing the divergences for online knowledge distillation. Co RR, abs/2212.12965, 2022. doi: 10.48550/ARXIV.2212. 12965. URL https://doi.org/10.48550/ ar Xiv.2212.12965. Apple. The axlearn library for deep learning., 2023. URL https://github.com/apple/axlearn. Accessed: 2025-02-11. Bahri, Y., Dyer, E., Kaplan, J., Lee, J., and Sharma, U. Explaining neural scaling laws. Co RR, abs/2102.06701, 2021. URL https://arxiv.org/abs/2102. 06701. Barnett, M. An empirical study of scaling laws for transfer. Co RR, abs/2408.16947, 2024. doi: 10.48550/ARXIV. 2408.16947. URL https://doi.org/10.48550/ ar Xiv.2408.16947. Berant, J., Chou, A., Frostig, R., and Liang, P. Semantic parsing on freebase from question-answer pairs. In Proceedings of the 2013 Conference on Empirical Methods in Natural Language Processing, EMNLP 2013, 1821 October 2013, Grand Hyatt Seattle, Seattle, Washington, USA, A meeting of SIGDAT, a Special Interest Group of the ACL, pp. 1533 1544. ACL, 2013. URL https://aclanthology.org/D13-1160/. Besiroglu, T., Erdil, E., Barnett, M., and You, J. Chinchilla scaling: A replication attempt. Co RR, abs/2404.10102, 2024. doi: 10.48550/ARXIV.2404. 10102. URL https://doi.org/10.48550/ ar Xiv.2404.10102. Distillation Scaling Laws Beyer, L., Zhai, X., Royer, A., Markeeva, L., Anil, R., and Kolesnikov, A. Knowledge distillation: A good teacher is patient and consistent. In IEEE/CVF Conference on Computer Vision and Pattern Recognition, CVPR 2022, New Orleans, LA, USA, June 18-24, 2022, pp. 10915 10924. IEEE, 2022. doi: 10.1109/CVPR52688. 2022.01065. URL https://doi.org/10.1109/ CVPR52688.2022.01065. Bhakthavatsalam, S., Khashabi, D., Khot, T., Mishra, B. D., Richardson, K., Sabharwal, A., Schoenick, C., Tafjord, O., and Clark, P. Think you have solved direct-answer question answering? try arc-da, the direct-answer AI2 reasoning challenge. Co RR, abs/2102.03315, 2021. URL https://arxiv.org/ abs/2102.03315. Bi, X., Chen, D., Chen, G., Chen, S., Dai, D., Deng, C., Ding, H., Dong, K., Du, Q., Fu, Z., Gao, H., Gao, K., Gao, W., Ge, R., Guan, K., Guo, D., Guo, J., Hao, G., Hao, Z., He, Y., Hu, W., Huang, P., Li, E., Li, G., Li, J., Li, Y., Li, Y. K., Liang, W., Lin, F., Liu, A. X., Liu, B., Liu, W., Liu, X., Liu, X., Liu, Y., Lu, H., Lu, S., Luo, F., Ma, S., Nie, X., Pei, T., Piao, Y., Qiu, J., Qu, H., Ren, T., Ren, Z., Ruan, C., Sha, Z., Shao, Z., Song, J., Su, X., Sun, J., Sun, Y., Tang, M., Wang, B., Wang, P., Wang, S., Wang, Y., Wang, Y., Wu, T., Wu, Y., Xie, X., Xie, Z., Xie, Z., Xiong, Y., Xu, H., Xu, R. X., Xu, Y., Yang, D., You, Y., Yu, S., Yu, X., Zhang, B., Zhang, H., Zhang, L., Zhang, L., Zhang, M., Zhang, M., Zhang, W., Zhang, Y., Zhao, C., Zhao, Y., Zhou, S., Zhou, S., Zhu, Q., and Zou, Y. Deepseek LLM: scaling open-source language models with longtermism. Co RR, abs/2401.02954, 2024. doi: 10.48550/ARXIV.2401.02954. URL https:// doi.org/10.48550/ar Xiv.2401.02954. Bisk, Y., Zellers, R., Bras, R. L., Gao, J., and Choi, Y. PIQA: reasoning about physical commonsense in natural language. In The Thirty-Fourth AAAI Conference on Artificial Intelligence, AAAI 2020, The Thirty-Second Innovative Applications of Artificial Intelligence Conference, IAAI 2020, The Tenth AAAI Symposium on Educational Advances in Artificial Intelligence, EAAI 2020, New York, NY, USA, February 7-12, 2020, pp. 7432 7439. AAAI Press, 2020. doi: 10.1609/AAAI.V34I05. 6239. URL https://doi.org/10.1609/aaai. v34i05.6239. Blasiok, J., Gopalan, P., Hu, L., and Nakkiran, P. When does optimizing a proper loss yield calibration? In Oh, A., Naumann, T., Globerson, A., Saenko, K., Hardt, M., and Levine, S. (eds.), Advances in Neural Information Processing Systems 36: Annual Conference on Neural Information Processing Systems 2023, Neur IPS 2023, New Orleans, LA, USA, December 10 - 16, 2023, 2023. URL http://papers. nips.cc/paper_files/paper/2023/hash/ e4165c96702bac5f4962b70f3cf2f136Abstract-Conference.html. Blondel, M. and Roulet, V. The elements of differentiable programming. Co RR, abs/2403.14606, 2024. doi: 10.48550/ARXIV.2403.14606. URL https://doi. org/10.48550/ar Xiv.2403.14606. Brown, B. C. A., Juravsky, J., Ehrlich, R. S., Clark, R., Le, Q. V., Ré, C., and Mirhoseini, A. Large language monkeys: Scaling inference compute with repeated sampling. Co RR, abs/2407.21787, 2024. doi: 10.48550/ARXIV.2407.21787. URL https://doi. org/10.48550/ar Xiv.2407.21787. Brown, T. B., Mann, B., Ryder, N., Subbiah, M., Kaplan, J., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., Agarwal, S., Herbert-Voss, A., Krueger, G., Henighan, T., Child, R., Ramesh, A., Ziegler, D. M., Wu, J., Winter, C., Hesse, C., Chen, M., Sigler, E., Litwin, M., Gray, S., Chess, B., Clark, J., Berner, C., Mc Candlish, S., Radford, A., Sutskever, I., and Amodei, D. Language models are few-shot learners. In Larochelle, H., Ranzato, M., Hadsell, R., Balcan, M., and Lin, H. (eds.), Advances in Neural Information Processing Systems 33: Annual Conference on Neural Information Processing Systems 2020, Neur IPS 2020, December 6-12, 2020, virtual, 2020. URL https://proceedings. neurips.cc/paper/2020/hash/ 1457c0d6bfcb4967418bfb8ac142f64a Abstract.html. Bucila, C., Caruana, R., and Niculescu-Mizil, A. Model compression. In Eliassi-Rad, T., Ungar, L. H., Craven, M., and Gunopulos, D. (eds.), Proceedings of the Twelfth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, Philadelphia, PA, USA, August 20-23, 2006, pp. 535 541. ACM, 2006. doi: 10. 1145/1150402.1150464. URL https://doi.org/ 10.1145/1150402.1150464. Burns, C., Izmailov, P., Kirchner, J. H., Baker, B., Gao, L., Aschenbrenner, L., Chen, Y., Ecoffet, A., Joglekar, M., Leike, J., Sutskever, I., and Wu, J. Weak-to-strong generalization: Eliciting strong capabilities with weak supervision. In Forty-first International Conference on Machine Learning, ICML 2024, Vienna, Austria, July 21-27, 2024. Open Review.net, 2024. URL https: //openreview.net/forum?id=gh NRg2m Eg N. Caballero, E., Gupta, K., Rish, I., and Krueger, D. Broken neural scaling laws. In The Eleventh International Conference on Learning Representations, ICLR 2023, Kigali, Rwanda, May 1-5, 2023. Open Review.net, Distillation Scaling Laws 2023. URL https://openreview.net/forum? id=sckjveql CZ. Carrell, A. M., Mallinar, N., Lucas, J., and Nakkiran, P. The calibration generalization gap. Co RR, abs/2210.01964, 2022. doi: 10.48550/ARXIV.2210. 01964. URL https://doi.org/10.48550/ ar Xiv.2210.01964. CERN. Cern data centre: Key information, March 2018. URL http://informationtechnology.web.cern.ch/sites/ information-technology.web.cern.ch/ files/CERNData Centre_Key Information_ 02March2018V1.pdf. Accessed: 2025-01-29. Chien, A. A., Lin, L., Nguyen, H., Rao, V., Sharma, T., and Wijayawardana, R. Reducing the carbon impact of generative AI inference (today and in 2035). In Porter, G., Anderson, T., Chien, A. A., Eilam, T., Josephson, C., and Park, J. (eds.), Proceedings of the 2nd Workshop on Sustainable Computer Systems, Hot Carbon 2023, Boston, MA, USA, 9 July 2023, pp. 11:1 11:7. ACM, 2023. doi: 10.1145/3604930.3605705. URL https: //doi.org/10.1145/3604930.3605705. Cho, J. H. and Hariharan, B. On the efficacy of knowledge distillation. In 2019 IEEE/CVF International Conference on Computer Vision, ICCV 2019, Seoul, Korea (South), October 27 - November 2, 2019, pp. 4793 4801. IEEE, 2019. doi: 10.1109/ICCV. 2019.00489. URL https://doi.org/10.1109/ ICCV.2019.00489. Chowdhery, A., Narang, S., Devlin, J., Bosma, M., Mishra, G., Roberts, A., Barham, P., Chung, H. W., Sutton, C., Gehrmann, S., Schuh, P., Shi, K., Tsvyashchenko, S., Maynez, J., Rao, A., Barnes, P., Tay, Y., Shazeer, N., Prabhakaran, V., Reif, E., Du, N., Hutchinson, B., Pope, R., Bradbury, J., Austin, J., Isard, M., Gur-Ari, G., Yin, P., Duke, T., Levskaya, A., Ghemawat, S., Dev, S., Michalewski, H., Garcia, X., Misra, V., Robinson, K., Fedus, L., Zhou, D., Ippolito, D., Luan, D., Lim, H., Zoph, B., Spiridonov, A., Sepassi, R., Dohan, D., Agrawal, S., Omernick, M., Dai, A. M., Pillai, T. S., Pellat, M., Lewkowycz, A., Moreira, E., Child, R., Polozov, O., Lee, K., Zhou, Z., Wang, X., Saeta, B., Diaz, M., Firat, O., Catasta, M., Wei, J., Meier-Hellstern, K., Eck, D., Dean, J., Petrov, S., and Fiedel, N. Palm: Scaling language modeling with pathways. J. Mach. Learn. Res., 24:240:1 240:113, 2023. URL https: //jmlr.org/papers/v24/22-1144.html. Clark, A., de Las Casas, D., Guy, A., Mensch, A., Paganini, M., Hoffmann, J., Damoc, B., Hechtman, B. A., Cai, T., Borgeaud, S., van den Driessche, G., Rutherford, E., Hennigan, T., Johnson, M. J., Cassirer, A., Jones, C., Buchatskaya, E., Budden, D., Sifre, L., Osindero, S., Vinyals, O., Ranzato, M., Rae, J. W., Elsen, E., Kavukcuoglu, K., and Simonyan, K. Unified scaling laws for routed language models. In Chaudhuri, K., Jegelka, S., Song, L., Szepesvári, C., Niu, G., and Sabato, S. (eds.), International Conference on Machine Learning, ICML 2022, 17-23 July 2022, Baltimore, Maryland, USA, volume 162 of Proceedings of Machine Learning Research, pp. 4057 4086. PMLR, 2022. URL https://proceedings.mlr. press/v162/clark22a.html. Cobbe, K., Kosaraju, V., Bavarian, M., Chen, M., Jun, H., Kaiser, L., Plappert, M., Tworek, J., Hilton, J., Nakano, R., Hesse, C., and Schulman, J. Training verifiers to solve math word problems. Co RR, abs/2110.14168, 2021. URL https://arxiv.org/abs/2110. 14168. Dao, T. Flashattention-2: Faster attention with better parallelism and work partitioning. In The Twelfth International Conference on Learning Representations, ICLR 2024, Vienna, Austria, May 7-11, 2024. Open Review.net, 2024. URL https://openreview.net/ forum?id=m Zn2Xyh9Ec. Dao, T., Fu, D. Y., Ermon, S., Rudra, A., and Ré, C. Flashattention: Fast and memory-efficient exact attention with io-awareness. In Koyejo, S., Mohamed, S., Agarwal, A., Belgrave, D., Cho, K., and Oh, A. (eds.), Advances in Neural Information Processing Systems 35: Annual Conference on Neural Information Processing Systems 2022, Neur IPS 2022, New Orleans, LA, USA, November 28 - December 9, 2022, 2022. URL http://papers. nips.cc/paper_files/paper/2022/hash/ 67d57c32e20fd0a7a302cb81d36e40d5Abstract-Conference.html. Deep Seek-AI, Liu, A., Feng, B., Xue, B., Wang, B., Wu, B., Lu, C., Zhao, C., Deng, C., Zhang, C., Ruan, C., Dai, D., Guo, D., Yang, D., Chen, D., Ji, D., Li, E., Lin, F., Dai, F., Luo, F., Hao, G., Chen, G., Li, G., Zhang, H., Bao, H., Xu, H., Wang, H., Zhang, H., Ding, H., Xin, H., Gao, H., Li, H., Qu, H., Cai, J. L., Liang, J., Guo, J., Ni, J., Li, J., Wang, J., Chen, J., Chen, J., Yuan, J., Qiu, J., Li, J., Song, J., Dong, K., Hu, K., Gao, K., Guan, K., Huang, K., Yu, K., Wang, L., Zhang, L., Xu, L., Xia, L., Zhao, L., Wang, L., Zhang, L., Li, M., Wang, M., Zhang, M., Zhang, M., Tang, M., Li, M., Tian, N., Huang, P., Wang, P., Zhang, P., Wang, Q., Zhu, Q., Chen, Q., Du, Q., Chen, R. J., Jin, R. L., Ge, R., Zhang, R., Pan, R., Wang, R., Xu, R., Zhang, R., Chen, R., Li, S. S., Lu, S., Zhou, S., Chen, S., Wu, S., Ye, S., Ye, S., Ma, S., Wang, S., Zhou, S., Yu, S., Zhou, S., Pan, S., Wang, T., Yun, T., Distillation Scaling Laws Pei, T., Sun, T., Xiao, W. L., and Zeng, W. Deepseekv3 technical report. Co RR, abs/2412.19437, 2024. doi: 10.48550/ARXIV.2412.19437. URL https://doi. org/10.48550/ar Xiv.2412.19437. Dehghani, M., Arnab, A., Beyer, L., Vaswani, A., and Tay, Y. The efficiency misnomer. Co RR, abs/2110.12894, 2021. URL https://arxiv.org/abs/2110. 12894. Deng, J., Dong, W., Socher, R., Li, L., Li, K., and Fei-Fei, L. Imagenet: A large-scale hierarchical image database. In 2009 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR 2009), 20-25 June 2009, Miami, Florida, USA, pp. 248 255. IEEE Computer Society, 2009. doi: 10.1109/CVPR.2009.5206848. URL https://doi. org/10.1109/CVPR.2009.5206848. Dey, N., Gosal, G., Chen, Z., Khachane, H., Marshall, W., Pathria, R., Tom, M., and Hestness, J. Cerebras-gpt: Open compute-optimal language models trained on the cerebras wafer-scale cluster. Co RR, abs/2304.03208, 2023. doi: 10.48550/ARXIV.2304. 03208. URL https://doi.org/10.48550/ ar Xiv.2304.03208. Dubey, A., Jauhri, A., Pandey, A., Kadian, A., Al-Dahle, A., Letman, A., Mathur, A., Schelten, A., Yang, A., Fan, A., Goyal, A., Hartshorn, A., Yang, A., Mitra, A., Sravankumar, A., Korenev, A., Hinsvark, A., Rao, A., Zhang, A., Rodriguez, A., Gregerson, A., Spataru, A., Rozière, B., Biron, B., Tang, B., Chern, B., Caucheteux, C., Nayak, C., Bi, C., Marra, C., Mc Connell, C., Keller, C., Touret, C., Wu, C., Wong, C., Ferrer, C. C., Nikolaidis, C., Allonsius, D., Song, D., Pintz, D., Livshits, D., Esiobu, D., Choudhary, D., Mahajan, D., Garcia-Olano, D., Perino, D., Hupkes, D., Lakomkin, E., Al Badawy, E., Lobanova, E., Dinan, E., Smith, E. M., Radenovic, F., Zhang, F., Synnaeve, G., Lee, G., Anderson, G. L., Nail, G., Mialon, G., Pang, G., Cucurell, G., Nguyen, H., Korevaar, H., Xu, H., Touvron, H., Zarov, I., Ibarra, I. A., Kloumann, I. M., Misra, I., Evtimov, I., Copet, J., Lee, J., Geffert, J., Vranes, J., Park, J., Mahadeokar, J., Shah, J., van der Linde, J., Billock, J., Hong, J., Lee, J., Fu, J., Chi, J., Huang, J., Liu, J., Wang, J., Yu, J., Bitton, J., Spisak, J., Park, J., Rocca, J., Johnstun, J., Saxe, J., Jia, J., Alwala, K. V., Upasani, K., Plawiak, K., Li, K., Heafield, K., Stone, K., and et al. The llama 3 herd of models. Co RR, abs/2407.21783, 2024. doi: 10.48550/ARXIV.2407.21783. URL https://doi. org/10.48550/ar Xiv.2407.21783. Epoch AI. Key trends and figures in machine learning, 2023. URL https://epoch.ai/trends. Accessed: 2025-02-11. Fan, W., Lu, S., Li, X., Zhan, D., and Gan, L. Revisit the essence of distilling knowledge through calibration. In Forty-first International Conference on Machine Learning, ICML 2024, Vienna, Austria, July 2127, 2024. Open Review.net, 2024. URL https:// openreview.net/forum?id=NZgbwza OIx. Furlanello, T., Lipton, Z. C., Tschannen, M., Itti, L., and Anandkumar, A. Born-again neural networks. In Dy, J. G. and Krause, A. (eds.), Proceedings of the 35th International Conference on Machine Learning, ICML 2018, Stockholmsmässan, Stockholm, Sweden, July 10-15, 2018, volume 80 of Proceedings of Machine Learning Research, pp. 1602 1611. PMLR, 2018. URL http://proceedings.mlr.press/ v80/furlanello18a.html. Gadre, S. Y., Smyrnis, G., Shankar, V., Gururangan, S., Wortsman, M., Shao, R., Mercat, J., Fang, A., Li, J., Keh, S., Xin, R., Nezhurina, M., Vasiljevic, I., Jitsev, J., Dimakis, A. G., Ilharco, G., Song, S., Kollar, T., Carmon, Y., Dave, A., Heckel, R., Muennighoff, N., and Schmidt, L. Language models scale reliably with over-training and on downstream tasks. Co RR, abs/2403.08540, 2024. doi: 10.48550/ARXIV. 2403.08540. URL https://doi.org/10.48550/ ar Xiv.2403.08540. Gao, L., Tow, J., Abbasi, B., Biderman, S., Black, S., Di Pofi, A., Foster, C., Golding, L., Hsu, J., Le Noac h, A., Li, H., Mc Donell, K., Muennighoff, N., Ociepa, C., Phang, J., Reynolds, L., Schoelkopf, H., Skowron, A., Sutawika, L., Tang, E., Thite, A., Wang, B., Wang, K., and Zou, A. A framework for few-shot language model evaluation, 07 2024. URL https://zenodo.org/ records/12608602. Gunter, T., Wang, Z., Wang, C., Pang, R., Narayanan, A., Zhang, A., Zhang, B., Chen, C., Chiu, C., Qiu, D., Gopinath, D., Yap, D. A., Yin, D., Nan, F., Weers, F., Yin, G., Huang, H., Wang, J., Lu, J., Peebles, J., Ye, K., Lee, M., Du, N., Chen, Q., Keunebroek, Q., Wiseman, S., Evans, S., Lei, T., Rathod, V., Kong, X., Du, X., Li, Y., Wang, Y., Gao, Y., Ahmed, Z., Xu, Z., Lu, Z., Rashid, A., Jose, A. M., Doane, A., Bencomo, A., Vanderby, A., Hansen, A., Jain, A., Anupama, A. M., Kamal, A., Wu, B., Brum, C., Maalouf, C., Erdenebileg, C., Dulhanty, C., Moritz, D., Kang, D., Jimenez, E., Ladd, E., Shi, F., Bai, F., Chu, F., Hohman, F., Kotek, H., Coleman, H. G., Li, J., Bigham, J. P., Cao, J., Lai, J., Cheung, J., Shan, J., Zhou, J., Li, J., Qin, J., Singh, K., Vega, K., Zou, K., Heckman, L., Gardiner, L., Bowler, M., Cordell, M., Cao, M., Hay, N., Shahdadpuri, N., Godwin, O., Dighe, P., Rachapudi, P., Tantawi, R., Frigg, R., Davarnia, S., Shah, S., Guha, Distillation Scaling Laws S., Sirovica, S., Ma, S., Ma, S., Wang, S., Kim, S., Jayaram, S., Shankar, V., Paidi, V., Kumar, V., Wang, X., Zheng, X., and Cheng, W. Apple intelligence foundation language models. Co RR, abs/2407.21075, 2024. doi: 10.48550/ARXIV.2407.21075. URL https://doi. org/10.48550/ar Xiv.2407.21075. Harutyunyan, H., Rawat, A. S., Menon, A. K., Kim, S., and Kumar, S. Supervision complexity and its role in knowledge distillation. In The Eleventh International Conference on Learning Representations, ICLR 2023, Kigali, Rwanda, May 1-5, 2023. Open Review.net, 2023. URL https://openreview.net/forum? id=8j U7wy7N7m A. Havrilla, A. and Liao, W. Understanding scaling laws with statistical and approximation theory for transformer neural networks on intrinsically low-dimensional data. Co RR, abs/2411.06646, 2024. doi: 10.48550/ARXIV. 2411.06646. URL https://doi.org/10.48550/ ar Xiv.2411.06646. Hendrycks, D., Burns, C., Basart, S., Critch, A., Li, J., Song, D., and Steinhardt, J. Aligning AI with shared human values. In 9th International Conference on Learning Representations, ICLR 2021, Virtual Event, Austria, May 3-7, 2021. Open Review.net, 2021a. URL https: //openreview.net/forum?id=d Ny_RKz Jac Y. Hendrycks, D., Burns, C., Basart, S., Zou, A., Mazeika, M., Song, D., and Steinhardt, J. Measuring massive multitask language understanding. In 9th International Conference on Learning Representations, ICLR 2021, Virtual Event, Austria, May 3-7, 2021. Open Review.net, 2021b. URL https://openreview. net/forum?id=d7KBjm I3Gm Q. Henighan, T., Kaplan, J., Katz, M., Chen, M., Hesse, C., Jackson, J., Jun, H., Brown, T. B., Dhariwal, P., Gray, S., Hallacy, C., Mann, B., Radford, A., Ramesh, A., Ryder, N., Ziegler, D. M., Schulman, J., Amodei, D., and Mc Candlish, S. Scaling laws for autoregressive generative modeling. Co RR, abs/2010.14701, 2020. URL https://arxiv.org/abs/2010.14701. Hernandez, D., Kaplan, J., Henighan, T., and Mc Candlish, S. Scaling laws for transfer. Co RR, abs/2102.01293, 2021. URL https://arxiv.org/abs/2102. 01293. Hestness, J., Narang, S., Ardalani, N., Diamos, G. F., Jun, H., Kianinejad, H., Patwary, M. M. A., Yang, Y., and Zhou, Y. Deep learning scaling is predictable, empirically. Co RR, abs/1712.00409, 2017. URL http: //arxiv.org/abs/1712.00409. Hinton, G. E., Vinyals, O., and Dean, J. Distilling the knowledge in a neural network. Co RR, abs/1503.02531, 2015. URL http://arxiv.org/ abs/1503.02531. Hoffmann, J., Borgeaud, S., Mensch, A., Buchatskaya, E., Cai, T., Rutherford, E., de Las Casas, D., Hendricks, L. A., Welbl, J., Clark, A., Hennigan, T., Noland, E., Millican, K., van den Driessche, G., Damoc, B., Guy, A., Osindero, S., Simonyan, K., Elsen, E., Rae, J. W., Vinyals, O., and Sifre, L. Training compute-optimal large language models. Co RR, abs/2203.15556, 2022. doi: 10.48550/ARXIV.2203.15556. URL https:// doi.org/10.48550/ar Xiv.2203.15556. Hu, S., Tu, Y., Han, X., He, C., Cui, G., Long, X., Zheng, Z., Fang, Y., Huang, Y., Zhao, W., Zhang, X., Thai, Z. L., Zhang, K., Wang, C., Yao, Y., Zhao, C., Zhou, J., Cai, J., Zhai, Z., Ding, N., Jia, C., Zeng, G., Li, D., Liu, Z., and Sun, M. Minicpm: Unveiling the potential of small language models with scalable training strategies. Co RR, abs/2404.06395, 2024. doi: 10.48550/ARXIV.2404.06395. URL https://doi. org/10.48550/ar Xiv.2404.06395. Ildiz, M. E., Gozeten, H. A., Taga, E. O., Mondelli, M., and Oymak, S. High-dimensional analysis of knowledge distillation: Weak-to-strong generalization and scaling laws. Co RR, abs/2410.18837, 2024. doi: 10.48550/ ARXIV.2410.18837. URL https://doi.org/10. 48550/ar Xiv.2410.18837. Jain, A., Montanari, A., and Sasoglu, E. Scaling laws for learning with real and surrogate data. Co RR, abs/2402.04376, 2024. doi: 10.48550/ARXIV. 2402.04376. URL https://doi.org/10.48550/ ar Xiv.2402.04376. Jelassi, S., Mohri, C., Brandfonbrener, D., Gu, A., Vyas, N., Anand, N., Alvarez-Melis, D., Li, Y., Kakade, S. M., and Malach, E. Mixture of parrots: Experts improve memorization more than reasoning. Co RR, abs/2410.19034, 2024. doi: 10.48550/ARXIV. 2410.19034. URL https://doi.org/10.48550/ ar Xiv.2410.19034. Jiang, A. Q., Sablayrolles, A., Mensch, A., Bamford, C., Chaplot, D. S., de Las Casas, D., Bressand, F., Lengyel, G., Lample, G., Saulnier, L., Lavaud, L. R., Lachaux, M., Stock, P., Scao, T. L., Lavril, T., Wang, T., Lacroix, T., and Sayed, W. E. Mistral 7b. Co RR, abs/2310.06825, 2023. doi: 10.48550/ARXIV. 2310.06825. URL https://doi.org/10.48550/ ar Xiv.2310.06825. Joshi, M., Choi, E., Weld, D. S., and Zettlemoyer, L. Triviaqa: A large scale distantly supervised challenge dataset Distillation Scaling Laws for reading comprehension. In Barzilay, R. and Kan, M. (eds.), Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics, ACL 2017, Vancouver, Canada, July 30 - August 4, Volume 1: Long Papers, pp. 1601 1611. Association for Computational Linguistics, 2017. doi: 10.18653/V1/P17-1147. URL https://doi.org/10.18653/v1/P17-1147. Kadavath, S., Conerly, T., Askell, A., Henighan, T., Drain, D., Perez, E., Schiefer, N., Hatfield-Dodds, Z., Das Sarma, N., Tran-Johnson, E., Johnston, S., Showk, S. E., Jones, A., Elhage, N., Hume, T., Chen, A., Bai, Y., Bowman, S., Fort, S., Ganguli, D., Hernandez, D., Jacobson, J., Kernion, J., Kravec, S., Lovitt, L., Ndousse, K., Olsson, C., Ringer, S., Amodei, D., Brown, T., Clark, J., Joseph, N., Mann, B., Mc Candlish, S., Olah, C., and Kaplan, J. Language models (mostly) know what they know. Co RR, abs/2207.05221, 2022. doi: 10.48550/ARXIV.2207.05221. URL https://doi. org/10.48550/ar Xiv.2207.05221. Kaplan, J., Mc Candlish, S., Henighan, T., Brown, T. B., Chess, B., Child, R., Gray, S., Radford, A., Wu, J., and Amodei, D. Scaling laws for neural language models. Co RR, abs/2001.08361, 2020. URL https:// arxiv.org/abs/2001.08361. Kim, Y. and Rush, A. M. Sequence-level knowledge distillation. In Su, J., Carreras, X., and Duh, K. (eds.), Proceedings of the 2016 Conference on Empirical Methods in Natural Language Processing, EMNLP 2016, Austin, Texas, USA, November 1-4, 2016, pp. 1317 1327. The Association for Computational Linguistics, 2016. doi: 10.18653/V1/D16-1139. URL https://doi.org/ 10.18653/v1/d16-1139. Kingma, D. P. and Ba, J. Adam: A method for stochastic optimization. In Bengio, Y. and Le Cun, Y. (eds.), 3rd International Conference on Learning Representations, ICLR 2015, San Diego, CA, USA, May 7-9, 2015, Conference Track Proceedings, 2015. URL http: //arxiv.org/abs/1412.6980. Kolesnikov, A., Beyer, L., Zhai, X., Puigcerver, J., Yung, J., Gelly, S., and Houlsby, N. Big transfer (bit): General visual representation learning. In Vedaldi, A., Bischof, H., Brox, T., and Frahm, J. (eds.), Computer Vision - ECCV 2020 - 16th European Conference, Glasgow, UK, August 23-28, 2020, Proceedings, Part V, volume 12350 of Lecture Notes in Computer Science, pp. 491 507. Springer, 2020. doi: 10.1007/978-3-030-585587\_29. URL https://doi.org/10.1007/9783-030-58558-7_29. Kraft, D. A Software Package for Sequential Quadratic Programming. Deutsche Forschungsund Versuch- sanstalt für Luftund Raumfahrt Köln: Forschungsbericht. Wiss. Berichtswesen d. DFVLR, 1988. URL https://books.google.co.uk/books?id= 4r Ka Gw AACAAJ. Lee, D., Tian, Z., Zhao, Y., Cheung, K. C., and Zhang, N. L. Hard gate knowledge distillation - leverage calibration for robust and reliable language model. In Goldberg, Y., Kozareva, Z., and Zhang, Y. (eds.), Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing, EMNLP 2022, Abu Dhabi, United Arab Emirates, December 7-11, 2022, pp. 9793 9803. Association for Computational Linguistics, 2022. doi: 10.18653/V1/2022.EMNLPMAIN.665. URL https://doi.org/10.18653/ v1/2022.emnlp-main.665. Li, Y., Bubeck, S., Eldan, R., Giorno, A. D., Gunasekar, S., and Lee, Y. T. Textbooks are all you need II: phi1.5 technical report. Co RR, abs/2309.05463, 2023. doi: 10.48550/ARXIV.2309.05463. URL https://doi. org/10.48550/ar Xiv.2309.05463. Liu, Z., Zhao, C., Iandola, F. N., Lai, C., Tian, Y., Fedorov, I., Xiong, Y., Chang, E., Shi, Y., Krishnamoorthi, R., Lai, L., and Chandra, V. Mobilellm: Optimizing sub-billion parameter language models for ondevice use cases. In Forty-first International Conference on Machine Learning, ICML 2024, Vienna, Austria, July 21-27, 2024. Open Review.net, 2024. URL https: //openreview.net/forum?id=EIGb Xbxc UQ. Lopez-Paz, D., Bottou, L., Schölkopf, B., and Vapnik, V. Unifying distillation and privileged information. In Bengio, Y. and Le Cun, Y. (eds.), 4th International Conference on Learning Representations, ICLR 2016, San Juan, Puerto Rico, May 2-4, 2016, Conference Track Proceedings, 2016. URL http://arxiv. org/abs/1511.03643. Loshchilov, I. and Hutter, F. Decoupled weight decay regularization. In 7th International Conference on Learning Representations, ICLR 2019, New Orleans, LA, USA, May 6-9, 2019. Open Review.net, 2019. URL https: //openreview.net/forum?id=Bkg6Ri Cq Y7. Ludziejewski, J., Krajewski, J., Adamczewski, K., Pióro, M., Krutul, M., Antoniak, S., Ciebiera, K., Król, K., Odrzygózdz, T., Sankowski, P., Cygan, M., and Jaszczur, S. Scaling laws for fine-grained mixture of experts. In Forty-first International Conference on Machine Learning, ICML 2024, Vienna, Austria, July 2127, 2024. Open Review.net, 2024. URL https:// openreview.net/forum?id=yoqdlyn CRs. Lukasik, M., Bhojanapalli, S., Menon, A. K., and Kumar, S. Teacher s pet: understanding and mitigating Distillation Scaling Laws biases in distillation. Trans. Mach. Learn. Res., 2022, 2022. URL https://openreview.net/forum? id=ph3AYXpw Eb. Menon, A. K., Rawat, A. S., Reddi, S. J., Kim, S., and Kumar, S. Why distillation helps: a statistical perspective. Co RR, abs/2005.10419, 2020. URL https: //arxiv.org/abs/2005.10419. Mesnard, T., Hardin, C., Dadashi, R., Bhupatiraju, S., Pathak, S., Sifre, L., Rivière, M., Kale, M. S., Love, J., Tafti, P., Hussenot, L., Chowdhery, A., Roberts, A., Barua, A., Botev, A., Castro-Ros, A., Slone, A., Héliou, A., Tacchetti, A., Bulanova, A., Paterson, A., Tsai, B., Shahriari, B., Lan, C. L., Choquette-Choo, C. A., Crepy, C., Cer, D., Ippolito, D., Reid, D., Buchatskaya, E., Ni, E., Noland, E., Yan, G., Tucker, G., Muraru, G., Rozhdestvenskiy, G., Michalewski, H., Tenney, I., Grishchenko, I., Austin, J., Keeling, J., Labanowski, J., Lespiau, J., Stanway, J., Brennan, J., Chen, J., Ferret, J., Chiu, J., and et al. Gemma: Open models based on gemini research and technology. Co RR, abs/2403.08295, 2024. doi: 10.48550/ARXIV. 2403.08295. URL https://doi.org/10.48550/ ar Xiv.2403.08295. Minderer, M., Djolonga, J., Romijnders, R., Hubis, F., Zhai, X., Houlsby, N., Tran, D., and Lucic, M. Revisiting the calibration of modern neural networks. In Ranzato, M., Beygelzimer, A., Dauphin, Y. N., Liang, P., and Vaughan, J. W. (eds.), Advances in Neural Information Processing Systems 34: Annual Conference on Neural Information Processing Systems 2021, Neur IPS 2021, December 6-14, 2021, virtual, pp. 15682 15694, 2021. URL https://proceedings. neurips.cc/paper/2021/hash/ 8420d359404024567b5aefda1231af24Abstract.html. Mirzadeh, S., Farajtabar, M., Li, A., Levine, N., Matsukawa, A., and Ghasemzadeh, H. Improved knowledge distillation via teacher assistant. In The Thirty Fourth AAAI Conference on Artificial Intelligence, AAAI 2020, The Thirty-Second Innovative Applications of Artificial Intelligence Conference, IAAI 2020, The Tenth AAAI Symposium on Educational Advances in Artificial Intelligence, EAAI 2020, New York, NY, USA, February 7-12, 2020, pp. 5191 5198. AAAI Press, 2020. doi: 10.1609/AAAI.V34I04.5963. URL https://doi. org/10.1609/aaai.v34i04.5963. Mobahi, H., Farajtabar, M., and Bartlett, P. L. Selfdistillation amplifies regularization in hilbert space. In Larochelle, H., Ranzato, M., Hadsell, R., Balcan, M., and Lin, H. (eds.), Advances in Neural Information Processing Systems 33: Annual Conference on Neural Information Processing Systems 2020, Neur IPS 2020, December 6-12, 2020, virtual, 2020. URL https://proceedings. neurips.cc/paper/2020/hash/ 2288f691b58edecadcc9a8691762b4fd Abstract.html. Muennighoff, N., Rush, A. M., Barak, B., Scao, T. L., Piktus, A., Tazi, N., Pyysalo, S., Wolf, T., and Raffel, C. Scaling data-constrained language models. Co RR, abs/2305.16264, 2023a. doi: 10.48550/ARXIV. 2305.16264. URL https://doi.org/10.48550/ ar Xiv.2305.16264. Muennighoff, N., Rush, A. M., Barak, B., Scao, T. L., Tazi, N., Piktus, A., Pyysalo, S., Wolf, T., and Raffel, C. A. Scaling data-constrained language models. In Oh, A., Naumann, T., Globerson, A., Saenko, K., Hardt, M., and Levine, S. (eds.), Advances in Neural Information Processing Systems 36: Annual Conference on Neural Information Processing Systems 2023, Neur IPS 2023, New Orleans, LA, USA, December 10 - 16, 2023, 2023b. URL http://papers. nips.cc/paper_files/paper/2023/hash/ 9d89448b63ce1e2e8dc7af72c984c196Abstract-Conference.html. Mukhoti, J., Kulharia, V., Sanyal, A., Golodetz, S., Torr, P. H. S., and Dokania, P. K. Calibrating deep neural networks using focal loss. In Larochelle, H., Ranzato, M., Hadsell, R., Balcan, M., and Lin, H. (eds.), Advances in Neural Information Processing Systems 33: Annual Conference on Neural Information Processing Systems 2020, Neur IPS 2020, December 6-12, 2020, virtual, 2020. URL https://proceedings. neurips.cc/paper/2020/hash/ aeb7b30ef1d024a76f21a1d40e30c302Abstract.html. Muralidharan, S., Sreenivas, S. T., Joshi, R., Chochowski, M., Patwary, M., Shoeybi, M., Catanzaro, B., Kautz, J., and Molchanov, P. Compact language models via pruning and knowledge distillation. Co RR, abs/2407.14679, 2024. doi: 10.48550/ARXIV. 2407.14679. URL https://doi.org/10.48550/ ar Xiv.2407.14679. Nagarajan, V., Menon, A. K., Bhojanapalli, S., Mobahi, H., and Kumar, S. On student-teacher deviations in distillation: does it pay to disobey? In Oh, A., Naumann, T., Globerson, A., Saenko, K., Hardt, M., and Levine, S. (eds.), Advances in Neural Information Processing Systems 36: Annual Conference on Neural Information Processing Systems 2023, Neur IPS 2023, New Orleans, LA, USA, December Distillation Scaling Laws 10 - 16, 2023, 2023. URL http://papers. nips.cc/paper_files/paper/2023/hash/ 12d286282e1be5431ea05262a21f415c Abstract-Conference.html. Narayanan, D., Shoeybi, M., Casper, J., Le Gresley, P., Patwary, M., Korthikanti, V., Vainbrand, D., Kashinkunti, P., Bernauer, J., Catanzaro, B., Phanishayee, A., and Zaharia, M. Efficient large-scale language model training on GPU clusters using megatron-lm. In de Supinski, B. R., Hall, M. W., and Gamblin, T. (eds.), International Conference for High Performance Computing, Networking, Storage and Analysis, SC 2021, St. Louis, Missouri, USA, November 14-19, 2021, pp. 58. ACM, 2021. doi: 10.1145/3458817.3476209. URL https: //doi.org/10.1145/3458817.3476209. Nguyen, T. Q. and Salazar, J. Transformers without tears: Improving the normalization of self-attention. In Niehues, J., Cattoni, R., Stüker, S., Negri, M., Turchi, M., Ha, T., Salesky, E., Sanabria, R., Barrault, L., Specia, L., and Federico, M. (eds.), Proceedings of the 16th International Conference on Spoken Language Translation, IWSLT 2019, Hong Kong, November 2-3, 2019. Association for Computational Linguistics, 2019. URL https://aclanthology.org/ 2019.iwslt-1.17. Nilsback, M. and Zisserman, A. Automated flower classification over a large number of classes. In Sixth Indian Conference on Computer Vision, Graphics & Image Processing, ICVGIP 2008, Bhubaneswar, India, 16-19 December 2008, pp. 722 729. IEEE Computer Society, 2008. doi: 10.1109/ICVGIP.2008.47. URL https: //doi.org/10.1109/ICVGIP.2008.47. Open AI. GPT-4 technical report. Co RR, abs/2303.08774, 2023. doi: 10.48550/ARXIV.2303.08774. URL https://doi.org/10.48550/ar Xiv.2303. 08774. Open AI and Pilipiszyn, A. Gpt-3 powers the next generation of apps, 2021. URL http://website-url. com. Accessed on Jan 19, 2025. Paperno, D., Kruszewski, G., Lazaridou, A., Pham, Q. N., Bernardi, R., Pezzelle, S., Baroni, M., Boleda, G., and Fernández, R. The LAMBADA dataset: Word prediction requiring a broad discourse context. In Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics, ACL 2016, August 7-12, 2016, Berlin, Germany, Volume 1: Long Papers. The Association for Computer Linguistics, 2016. doi: 10.18653/V1/P16-1144. URL https://doi.org/ 10.18653/v1/p16-1144. Paquette, E., Paquette, C., Xiao, L., and Pennington, J. 4+3 phases of compute-optimal neural scaling laws. Co RR, abs/2405.15074, 2024. doi: 10.48550/ARXIV. 2405.15074. URL https://doi.org/10.48550/ ar Xiv.2405.15074. Pareek, D., Du, S. S., and Oh, S. Understanding the gains from repeated self-distillation. Co RR, abs/2407.04600, 2024. doi: 10.48550/ARXIV.2407. 04600. URL https://doi.org/10.48550/ ar Xiv.2407.04600. Pearce, T. and Song, J. Reconciling kaplan and chinchilla scaling laws. Co RR, abs/2406.12907, 2024. doi: 10.48550/ARXIV.2406.12907. URL https://doi. org/10.48550/ar Xiv.2406.12907. Peng, H., Lv, X., Bai, Y., Yao, Z., Zhang, J., Hou, L., and Li, J. Pre-training distillation for large language models: A design space exploration. Co RR, abs/2410.16215, 2024. doi: 10.48550/ARXIV.2410. 16215. URL https://doi.org/10.48550/ ar Xiv.2410.16215. Porian, T., Wortsman, M., Jitsev, J., Schmidt, L., and Carmon, Y. Resolving discrepancies in compute-optimal scaling of language models. Co RR, abs/2406.19146, 2024. doi: 10.48550/ARXIV.2406.19146. URL https://doi.org/10.48550/ar Xiv.2406. 19146. Raffel, C., Shazeer, N., Roberts, A., Lee, K., Narang, S., Matena, M., Zhou, Y., Li, W., and Liu, P. J. Exploring the limits of transfer learning with a unified text-totext transformer. J. Mach. Learn. Res., 21:140:1 140:67, 2020. URL https://jmlr.org/papers/v21/ 20-074.html. Rawat, A. S., Sadhanala, V., Rostamizadeh, A., Chakrabarti, A., Jitkrittum, W., Feinberg, V., Kim, S., Harutyunyan, H., Saunshi, N., Nado, Z., Shivanna, R., Reddi, S. J., Menon, A. K., Anil, R., and Kumar, S. A little help goes a long way: Efficient LLM training by leveraging small lms. Co RR, abs/2410.18779, 2024. doi: 10.48550/ARXIV.2410.18779. URL https://doi. org/10.48550/ar Xiv.2410.18779. Reid, M., Savinov, N., Teplyashin, D., Lepikhin, D., Lillicrap, T. P., Alayrac, J., Soricut, R., Lazaridou, A., Firat, O., Schrittwieser, J., Antonoglou, I., Anil, R., Borgeaud, S., Dai, A. M., Millican, K., Dyer, E., Glaese, M., Sottiaux, T., Lee, B., Viola, F., Reynolds, M., Xu, Y., Molloy, J., Chen, J., Isard, M., Barham, P., Hennigan, T., Mc Ilroy, R., Johnson, M., Schalkwyk, J., Collins, E., Rutherford, E., Moreira, E., Ayoub, K., Goel, M., Meyer, C., Thornton, G., Yang, Z., Michalewski, H., Abbas, Z., Schucher, N., Anand, A., Ives, R., Keeling, J., Lenc, Distillation Scaling Laws K., Haykal, S., Shakeri, S., Shyam, P., Chowdhery, A., Ring, R., Spencer, S., Sezener, E., and et al. Gemini 1.5: Unlocking multimodal understanding across millions of tokens of context. Co RR, abs/2403.05530, 2024. doi: 10.48550/ARXIV.2403.05530. URL https://doi. org/10.48550/ar Xiv.2403.05530. Rivière, M., Pathak, S., Sessa, P. G., Hardin, C., Bhupatiraju, S., Hussenot, L., Mesnard, T., Shahriari, B., Ramé, A., Ferret, J., Liu, P., Tafti, P., Friesen, A., Casbon, M., Ramos, S., Kumar, R., Lan, C. L., Jerome, S., Tsitsulin, A., Vieillard, N., Stanczyk, P., Girgin, S., Momchev, N., Hoffman, M., Thakoor, S., Grill, J., Neyshabur, B., Bachem, O., Walton, A., Severyn, A., Parrish, A., Ahmad, A., Hutchison, A., Abdagic, A., Carl, A., Shen, A., Brock, A., Coenen, A., Laforge, A., Paterson, A., Bastian, B., Piot, B., Wu, B., Royal, B., Chen, C., Kumar, C., Perry, C., Welty, C., Choquette Choo, C. A., Sinopalnikov, D., Weinberger, D., Vijaykumar, D., Rogozinska, D., Herbison, D., Bandy, E., Wang, E., Noland, E., Moreira, E., Senter, E., Eltyshev, E., Visin, F., Rasskin, G., Wei, G., Cameron, G., Martins, G., Hashemi, H., Klimczak-Plucinska, H., Batra, H., Dhand, H., Nardini, I., Mein, J., Zhou, J., Svensson, J., Stanway, J., Chan, J., Zhou, J. P., Carrasqueira, J., Iljazi, J., Becker, J., Fernandez, J., van Amersfoort, J., Gordon, J., Lipschultz, J., Newlan, J., Ji, J., Mohamed, K., Badola, K., Black, K., Millican, K., Mc Donell, K., Nguyen, K., Sodhia, K., Greene, K., Sjösund, L. L., Usui, L., Sifre, L., Heuermann, L., Lago, L., and Mc Nealus, L. Gemma 2: Improving open language models at a practical size. Co RR, abs/2408.00118, 2024. doi: 10.48550/ARXIV.2408.00118. URL https://doi. org/10.48550/ar Xiv.2408.00118. Romero, A., Ballas, N., Kahou, S. E., Chassang, A., Gatta, C., and Bengio, Y. Fitnets: Hints for thin deep nets. In Bengio, Y. and Le Cun, Y. (eds.), 3rd International Conference on Learning Representations, ICLR 2015, San Diego, CA, USA, May 7-9, 2015, Conference Track Proceedings, 2015. URL http://arxiv.org/abs/ 1412.6550. Rosenfeld, J. S., Rosenfeld, A., Belinkov, Y., and Shavit, N. A constructive prediction of the generalization error across scales. In 8th International Conference on Learning Representations, ICLR 2020, Addis Ababa, Ethiopia, April 26-30, 2020. Open Review.net, 2020. URL https://openreview.net/forum? id=ryenvp EKDr. Sakaguchi, K., Bras, R. L., Bhagavatula, C., and Choi, Y. Winogrande: an adversarial winograd schema challenge at scale. Commun. ACM, 64(9):99 106, 2021. doi: 10.1145/3474381. URL https://doi.org/ 10.1145/3474381. Sardana, N., Portes, J. P., Doubov, S., and Frankle, J. Beyond chinchilla-optimal: Accounting for inference in language model scaling laws. In Forty-first International Conference on Machine Learning, ICML 2024, Vienna, Austria, July 21-27, 2024. Open Review.net, 2024. URL https://openreview.net/forum? id=0bm Xrt TDUu. Shazeer, N. GLU variants improve transformer. Co RR, abs/2002.05202, 2020. URL https://arxiv.org/ abs/2002.05202. Shazeer, N., Mirhoseini, A., Maziarz, K., Davis, A., Le, Q. V., Hinton, G. E., and Dean, J. Outrageously large neural networks: The sparsely-gated mixture-of-experts layer. Co RR, abs/1701.06538, 2017. URL http:// arxiv.org/abs/1701.06538. Snell, C., Lee, J., Xu, K., and Kumar, A. Scaling LLM test-time compute optimally can be more effective than scaling model parameters. Co RR, abs/2408.03314, 2024. doi: 10.48550/ARXIV.2408.03314. URL https:// doi.org/10.48550/ar Xiv.2408.03314. Sreenivas, S. T., Muralidharan, S., Joshi, R., Chochowski, M., Patwary, M., Shoeybi, M., Catanzaro, B., Kautz, J., and Molchanov, P. LLM pruning and distillation in practice: The minitron approach. Co RR, abs/2408.11796, 2024. doi: 10.48550/ARXIV. 2408.11796. URL https://doi.org/10.48550/ ar Xiv.2408.11796. Stanton, S., Izmailov, P., Kirichenko, P., Alemi, A. A., and Wilson, A. G. Does knowledge distillation really work? In Ranzato, M., Beygelzimer, A., Dauphin, Y. N., Liang, P., and Vaughan, J. W. (eds.), Advances in Neural Information Processing Systems 34: Annual Conference on Neural Information Processing Systems 2021, Neur IPS 2021, December 6-14, 2021, virtual, pp. 6906 6919, 2021. URL https://proceedings. neurips.cc/paper/2021/hash/ 376c6b9ff3bedbbea56751a84fffc10c Abstract.html. Stein, C. Inadmissibility of the usual estimator for the mean of a multivariate normal distribution. Proceedings of the Third Berkeley Symposium on Mathematical Statistics and Probability, 1:197 206, 1956. Su, J., Ahmed, M. H. M., Lu, Y., Pan, S., Bo, W., and Liu, Y. Roformer: Enhanced transformer with rotary position embedding. Neurocomputing, 568:127063, 2024. doi: 10.1016/J.NEUCOM.2023.127063. URL https:// doi.org/10.1016/j.neucom.2023.127063. Tian, Y., Krishnan, D., and Isola, P. Contrastive representation distillation. In 8th International Conference on Learning Representations, ICLR 2020, Addis Distillation Scaling Laws Ababa, Ethiopia, April 26-30, 2020. Open Review.net, 2020. URL https://openreview.net/forum? id=Skgp BJrtv S. Touvron, H., Lavril, T., Izacard, G., Martinet, X., Lachaux, M., Lacroix, T., Rozière, B., Goyal, N., Hambro, E., Azhar, F., Rodriguez, A., Joulin, A., Grave, E., and Lample, G. Llama: Open and efficient foundation language models. Co RR, abs/2302.13971, 2023a. doi: 10.48550/ARXIV.2302.13971. URL https://doi. org/10.48550/ar Xiv.2302.13971. Touvron, H., Martin, L., Stone, K., Albert, P., Almahairi, A., Babaei, Y., Bashlykov, N., Batra, S., Bhargava, P., Bhosale, S., Bikel, D., Blecher, L., Canton-Ferrer, C., Chen, M., Cucurull, G., Esiobu, D., Fernandes, J., Fu, J., Fu, W., Fuller, B., Gao, C., Goswami, V., Goyal, N., Hartshorn, A., Hosseini, S., Hou, R., Inan, H., Kardas, M., Kerkez, V., Khabsa, M., Kloumann, I., Korenev, A., Koura, P. S., Lachaux, M., Lavril, T., Lee, J., Liskovich, D., Lu, Y., Mao, Y., Martinet, X., Mihaylov, T., Mishra, P., Molybog, I., Nie, Y., Poulton, A., Reizenstein, J., Rungta, R., Saladi, K., Schelten, A., Silva, R., Smith, E. M., Subramanian, R., Tan, X. E., Tang, B., Taylor, R., Williams, A., Kuan, J. X., Xu, P., Yan, Z., Zarov, I., Zhang, Y., Fan, A., Kambadur, M., Narang, S., Rodriguez, A., Stojnic, R., Edunov, S., and Scialom, T. Llama 2: Open foundation and finetuned chat models. Co RR, abs/2307.09288, 2023b. doi: 10.48550/ARXIV.2307.09288. URL https://doi. org/10.48550/ar Xiv.2307.09288. Virtanen, P., Gommers, R., Oliphant, T. E., Haberland, M., Reddy, T., Cournapeau, D., Burovski, E., Peterson, P., Weckesser, W., Bright, J., van der Walt, S., Brett, M., Wilson, J., Millman, K. J., Mayorov, N., Nelson, A. R. J., Jones, E., Kern, R., Larson, E., Carey, C., Polat, I., Feng, Y., Moore, E. W., Vander Plas, J., Laxalde, D., Perktold, J., Cimrman, R., Henriksen, I., Quintero, E. A., Harris, C. R., Archibald, A. M., Ribeiro, A. H., Pedregosa, F., van Mulbregt, P., and Sci Py. Scipy 1.0-fundamental algorithms for scientific computing in python. Co RR, abs/1907.10121, 2019. URL http: //arxiv.org/abs/1907.10121. Welbl, J., Liu, N. F., and Gardner, M. Crowdsourcing multiple choice science questions. In Derczynski, L., Xu, W., Ritter, A., and Baldwin, T. (eds.), Proceedings of the 3rd Workshop on Noisy User-generated Text, NUT@EMNLP 2017, Copenhagen, Denmark, September 7, 2017, pp. 94 106. Association for Computational Linguistics, 2017. doi: 10.18653/V1/W17-4413. URL https://doi.org/10.18653/v1/w17-4413. Wortsman, M., Liu, P. J., Xiao, L., Everett, K., Alemi, A., Adlam, B., Co-Reyes, J. D., Gur, I., Kumar, A., Novak, R., Pennington, J., Sohl-Dickstein, J., Xu, K., Lee, J., Gilmer, J., and Kornblith, S. Small-scale proxies for large-scale transformer training instabilities. Co RR, abs/2309.14322, 2023. doi: 10.48550/ARXIV. 2309.14322. URL https://doi.org/10.48550/ ar Xiv.2309.14322. Wortsman, M., Liu, P. J., Xiao, L., Everett, K. E., Alemi, A. A., Adlam, B., Co-Reyes, J. D., Gur, I., Kumar, A., Novak, R., Pennington, J., Sohl-Dickstein, J., Xu, K., Lee, J., Gilmer, J., and Kornblith, S. Small-scale proxies for large-scale transformer training instabilities. In The Twelfth International Conference on Learning Representations, ICLR 2024, Vienna, Austria, May 711, 2024. Open Review.net, 2024. URL https:// openreview.net/forum?id=d8w0pmv Xb Z. Wu, C., Acun, B., Raghavendra, R., and Hazelwood, K. M. Beyond efficiency: Scaling AI sustainably. IEEE Micro, 44(5):37 46, 2024a. doi: 10.1109/MM.2024. 3409275. URL https://doi.org/10.1109/MM. 2024.3409275. Wu, Y., Sun, Z., Li, S., Welleck, S., and Yang, Y. An empirical analysis of compute-optimal inference for problem-solving with language models. Co RR, abs/2408.00724, 2024b. doi: 10.48550/ARXIV. 2408.00724. URL https://doi.org/10.48550/ ar Xiv.2408.00724. Yang, G. and Hu, E. J. Tensor programs IV: feature learning in infinite-width neural networks. In Meila, M. and Zhang, T. (eds.), Proceedings of the 38th International Conference on Machine Learning, ICML 2021, 18-24 July 2021, Virtual Event, volume 139 of Proceedings of Machine Learning Research, pp. 11727 11737. PMLR, 2021. URL http://proceedings.mlr.press/ v139/yang21c.html. Yang, G. and Littwin, E. Tensor programs ivb: Adaptive optimization in the infinite-width limit. Co RR, abs/2308.01814, 2023. doi: 10.48550/ARXIV.2308. 01814. URL https://doi.org/10.48550/ ar Xiv.2308.01814. Yang, G., Hu, E. J., Babuschkin, I., Sidor, S., Liu, X., Farhi, D., Ryder, N., Pachocki, J., Chen, W., and Gao, J. Tensor programs V: tuning large neural networks via zero-shot hyperparameter transfer. Co RR, abs/2203.03466, 2022. doi: 10.48550/ARXIV.2203.03466. URL https:// doi.org/10.48550/ar Xiv.2203.03466. Yang, G., Simon, J. B., and Bernstein, J. A spectral condition for feature learning. Co RR, abs/2310.17813, 2023. doi: 10.48550/ARXIV.2310.17813. URL https:// doi.org/10.48550/ar Xiv.2310.17813. Distillation Scaling Laws Yang, G., Yu, D., Zhu, C., and Hayou, S. Tensor programs VI: feature learning in infinite depth neural networks. In The Twelfth International Conference on Learning Representations, ICLR 2024, Vienna, Austria, May 7-11, 2024. Open Review.net, 2024. URL https: //openreview.net/forum?id=17p VDnpwwl. Yuan, M., Lang, B., and Quan, F. Student-friendly knowledge distillation. Knowl. Based Syst., 296: 111915, 2024. doi: 10.1016/J.KNOSYS.2024.111915. URL https://doi.org/10.1016/j.knosys. 2024.111915. Zellers, R., Holtzman, A., Bisk, Y., Farhadi, A., and Choi, Y. Hellaswag: Can a machine really finish your sentence? In Korhonen, A., Traum, D. R., and Màrquez, L. (eds.), Proceedings of the 57th Conference of the Association for Computational Linguistics, ACL 2019, Florence, Italy, July 28August 2, 2019, Volume 1: Long Papers, pp. 4791 4800. Association for Computational Linguistics, 2019. doi: 10.18653/V1/P19-1472. URL https://doi.org/10.18653/v1/p19-1472. Zhang, B. and Sennrich, R. Root mean square layer normalization. In Wallach, H. M., Larochelle, H., Beygelzimer, A., d Alché-Buc, F., Fox, E. B., and Garnett, R. (eds.), Advances in Neural Information Processing Systems 32: Annual Conference on Neural Information Processing Systems 2019, Neur IPS 2019, December 8-14, 2019, Vancouver, BC, Canada, pp. 12360 12371, 2019. URL https://proceedings. neurips.cc/paper/2019/hash/ 1e8a19426224ca89e83cef47f1e7f53b Abstract.html. Zhang, C., Raghu, M., Kleinberg, J. M., and Bengio, S. Pointer value retrieval: A new benchmark for understanding the limits of neural network generalization. Co RR, abs/2107.12580, 2021. URL https: //arxiv.org/abs/2107.12580. Zhang, C., Song, D., Ye, Z., and Gao, Y. Towards the law of capacity gap in distilling language models. Co RR, abs/2311.07052, 2023a. doi: 10.48550/ARXIV. 2311.07052. URL https://doi.org/10.48550/ ar Xiv.2311.07052. Zhang, C., Yang, Y., Liu, J., Wang, J., Xian, Y., Wang, B., and Song, D. Lifting the curse of capacity gap in distilling language models. In Rogers, A., Boyd-Graber, J. L., and Okazaki, N. (eds.), Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), ACL 2023, Toronto, Canada, July 9-14, 2023, pp. 4535 4553. Association for Computational Linguistics, 2023b. doi: 10.18653/ V1/2023.ACL-LONG.249. URL https://doi. org/10.18653/v1/2023.acl-long.249. Zhu, C., Xu, B., Wang, Q., Zhang, Y., and Mao, Z. On the calibration of large language models and alignment. Co RR, abs/2311.13240, 2023. doi: 10.48550/ARXIV. 2311.13240. URL https://doi.org/10.48550/ ar Xiv.2311.13240. Distillation Scaling Laws Appendices A Limitations 23 B Extended background 24 B.1 Knowledge Distillation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24 B.2 Neural Scaling Laws . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24 B.3 The Knowledge Distillation Capacity Gap . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25 C Teacher Student Capacity Gaps 26 C.1 Kernel Regression . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26 C.1.1 Setup . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26 C.1.2 Distilling the Teacher . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27 C.1.3 U-shape in the student error . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27 C.2 MLPs on the Mapping Problem . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29 C.2.1 Problem Definition . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29 C.2.2 Experimental Findings . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29 D Distillation scaling law applications (additional results) 30 D.1 Experimental differences resolving the apparent contradiction with patient teachers . . . . . . . . . . . . 30 D.2 Fixed tokens or compute (best case) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31 D.3 Fixed size or compute (teacher inference) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33 D.4 Compute optimal distillation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35 D.4.1 Setup . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35 D.4.2 Cross-entropy . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35 D.4.3 Distillation (best case) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37 D.4.4 Distillation (teacher inference) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38 D.4.5 Distillation (teacher pretraining) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40 D.4.6 Distillation (teacher pretraining + inference) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41 D.4.7 Optimal teacher training and student distillation tokens . . . . . . . . . . . . . . . . . . . . . . . 43 D.4.8 Optimal teacher size . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44 D.5 Compute and data efficiency gains for distillation compared to supervised learning . . . . . . . . . . . . 45 E Additional Results 46 E.1 Downstream evaluations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47 E.2 Teachers used in distillation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 48 E.3 Fixed-M teacher/fixed-M students and the capacity gap . . . . . . . . . . . . . . . . . . . . . . . . . . 48 Distillation Scaling Laws E.4 Full distillation scaling law Iso FLOP profiles . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49 E.5 Distillation scaling law Iso FLOP optima . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 50 E.6 Distillation with infinite data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 50 E.7 Weak-to-strong generalization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51 E.8 Model calibration . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52 E.8.1 Teachers . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52 E.8.2 198M students trained on 20N tokens . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 53 E.8.3 198M Students trained on 128B tokens . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 56 F Scaling coefficients 58 F.1 Supervised scaling law coefficient estimation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 58 F.2 Distillation scaling law coefficient estimation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 58 F.3 Scaling law coefficients parameteric fit . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 59 G Distilling language models in practice 59 G.1 Mixing coefficient (λ) sensitivity analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 60 G.2 Temperature (τ) sensitivity analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 61 G.3 Learning rate (η) sensitivity analysis, verification of µP for distillation . . . . . . . . . . . . . . . . . . . 62 G.4 Distribution truncation methods: Top-k and Top-p sensitivity . . . . . . . . . . . . . . . . . . . . . . . . 62 G.5 Forward and reverse KL divergence . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 63 H Parameters and Floating Operation Estimation 64 H.1 Alternative approximation for FLOPs per token as a function of N . . . . . . . . . . . . . . . . . . . . . 64 H.2 Model parameters . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 66 H.3 FLOPs per token . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 67 I Model architecture 68 J Contributions 69 Distillation Scaling Laws A. Limitations This work has several limitations that we are aware of: Our work is performed in the language modeling setting only. Although there is good evidence that the functional form of scaling laws applies across domains (Henighan et al., 2020), we cannot be absolutely certain that distillation behaves in the way we describe in this work in all domains. We perform our analysis on the English subset of C4 dataset (see Appendix I). This means that for our larger token runs, data has been repeated. Although it was shown in Muennighoff et al. (2023b) that on the C4 dataset, repeating data up to 4 times has negligible impact to loss compared to having unique data, this was shown in the supervised setting, and we cannot be absolutely certain that the same applies in the distillation setting. A second downside of using the C4 dataset is that we are limited in our ability to analyze downstream evaluations of students resulting from distillation. Our performance over standard English language downstream tasks closely follows cross-entropy, however, C4 is not as well suited for pretraining in order to probe aspects like reasoning performance (see Appendix E.1). We focused on distillation as originally defined in Hinton et al. (2015), where the teacher produces a full probability distribution for the student to target. We did this as it is a popular choice for training language models (Rivière et al., 2024; Gunter et al., 2024; Sreenivas et al., 2024). More colloquially, distillation has become used to describe the more general process of using a teacher in order to produce a student. One popular approach for training language models is Sequence-Level Knowledge Distillation (Kim & Rush, 2016) where the teacher is sampled, e.g. with beam search, in order to produce sequences for training the student on in a supervised way. This technique, also called synthetic data or hard distillation has been employed to great effect in the LLa MA families (Touvron et al., 2023a) and most recently, the smaller models distilled from Deep Seek-R1 (Deep Seek-AI et al., 2024). On top of these distillation methods are many variations of objectives, such as intermediate layer matching (Romero et al., 2015), modified objectives (Tian et al., 2020) and beyond. While we anticipate that our broader findings should apply in these cases, we cannot be absolutely sure. In particular, we suggest that verifying the scaling properties of Sequence Level Knowledge Distillation in a controlled, resource constrained manner as we have done here is important for future study. Our work exclusively studies transformer style architectures, for both the teacher and student. While supervised crossentropy is primarily influenced by model size and the amount of training data ((Kaplan et al., 2020)), it is plausible that architectural differences might affect model confidence or knowledge transfer in ways not fully captured by cross-entropy. Evidence for this effect was shown in Furlanello et al. (2018), although in a limited data setting where the teacher behaves as a regularizer and as a learning signal, significantly more complicated than our setting. Consequently, a study in non-repeated data on i) the influence of architectural disparities, and ii) of non-transformer architectures, could offer valuable insights. Our work exclusively investigates training and distilling on the same data distribution. This was done to allow us to isolate and study algorithmic effects, rather than effects from data. Unfortunately, this study design misses one typical distillation workflow, where a user chooses an openly available model trained by another group on a (possibly unknown) source distribution psource, and then distills it on their own target distribution ptarget. We suspect the following may occur. Consider the case that the teacher is well-trained, that is, ˆp T (y|x) psource(y|x). The student trained under Equation 7 should then approximate the teacher distribution, i.e. ˆq S(y|x) ˆp T (y|x) psource(y|x), that is, on the intersection of the support of psource(x) and ptarget(x), the student will learn to approximate the next-token distribution of the source domain, and not the target domain. Outside of this intersection, the teacher may behave out-of-domain and cease to provide meaningful signal for the student. Quantifying the scaling properties as a function of this teacherstudent domain difference would be a valuable extension of our study. Our Distillation Scaling Law Equation 8 is not universal, that is, the coefficients we observe (Appendix F) are specific to our architecture and dataset choices and are not guaranteed to generalize to other architectures and datasets. Further, although the form of our scaling law has many desired limiting behaviors, it is not derived from first principles, as in e.g. Paquette et al. (2024). As such, we cannot fully guarantee the correctness of the law, and suggest that a formal derivation of the scaling law as valuable future work. Distillation Scaling Laws B. Extended background B.1. Knowledge Distillation Bucila et al. (2006) provided strong evidence that the knowledge gained by a large ensemble of models can be effectively transferred to a single smaller model. Later, Hinton et al. (2015) introduced knowledge distillation, where a smaller student network learns from a larger teacher network by mimicking its softened output probabilities, improving efficiency and generalization. Building on this, Stanton et al. (2021) studied both fidelity and student generalization, showing that while knowledge distillation often improves generalization, it frequently fails to achieve high fidelity, as student models do not fully match the teacher s predictive distribution. We study fidelity in terms of calibration in Appendix E.8, and show that when the learning signal is consistent with the calibration measure, then the student in our setup is well-calibrated both with respect to the teacher and the actual data. Addressing this, Beyer et al. (2022) demonstrated that knowledge distillation is most effective when the teacher is patient and consistent, providing stable targets over prolonged training to improve student generalization and fidelity. Our Language Model (LM) setup automatically satisfies consistency: both the teacher and student see the same data during the student s training. However, our conclusions differ from those of Beyer et al. (2022) in that although distilling a student for longer does improve its performance, unless the teacher is chosen perfectly, distillation becomes less effective than supervised learning in the patient setting, see Appendix D.2 for a discussion. Beyond empirical insights, Menon et al. (2020) established a bias-variance tradeoff for the student, quantifying how access to teacher logits can significantly enhance learning. Meanwhile, Pareek et al. (2024) investigated self-distillation, where the student and teacher share the same architecture and size, to assess the potential gains from repeatedly applying knowledge distillation. While most studies assume the teacher is a larger model, recent work explores weak-to-strong generalization, where a weaker model distills knowledge into a stronger one. This concept, introduced by Burns et al. (2024) and studied in LMs, was further analyzed by Ildiz et al. (2024), who extended the theoretical analysis to high-dimensional data and over-parameterized regression. Their findings show that distillation can provably outperform training with strong labels under the same data budget but does not improve the data scaling law. Our distillation scaling law (Equation 8) confirms this finding, which for a fixed teacher cross-entropy does not improve the scaling law compared to the supervised one in Equation 1. Moreover, in many previous works, distillation happens with repeated data, that is, the student sees the same data as the teacher does during its training. In our setup, we do not repeat the data between teacher training and distillation, which allows us to examine only the effect of distillation rather than the possible diminishing returns of repeated data; see Muennighoff et al. (2023a) for more details on the effect of repeating data. B.2. Neural Scaling Laws Predictable scaling trends in neural networks were first empirically observed by Hestness et al. (2017) and later by Kaplan et al. (2020) who established empirical scaling laws for language model performance based on cross-entropy, which led to Hoffmann et al. (2022) and the pursuit of compute-optimal training. Beyond the empirical studies, there have been many theoretical works which provide explanations for why scaling laws should exist (Bahri et al., 2021; Paquette et al., 2024; Havrilla & Liao, 2024). More recent works explore scaling laws across different distributions, closely related to knowledge distillation. Hernandez et al. (2021) derived a scaling law for transfer learning, analyzing effective data transfer in low-data regimes and diminishing returns in high-data regimes. Similarly, Barnett (2024) empirically studied pretraining on one distribution for optimizing downstream performance on another, showing that when the transfer gap is low, pretraining is a cost-effective strategy. Finally, Jain et al. (2024) theoretically analyze how additional data from a surrogate model affects generalization, demonstrating that surrogate data can reduce test error even when unrelated due to Stein s paradox (Stein, 1956), with test error following a scaling law. This setup is related to tuning the coefficient λ in our case, where we also observe a U-shape behavior depending on the teacher and student sizes (see Figure 51a). However, we are interested in studying the effect of distillation only (λ = 1.0), which differs from their setup. While these works are closely related to knowledge distillation since one can compare the distribution of the teacher logits to that of the student they do not establish a distillation scaling law. Moreover, their setup differs from practical knowledge distillation, as it does not involve training a new student model using a teacher but instead studies the effect of transferring training knowledge to a downstream task. Our work is the first to determine and verify a distillation scaling law and examine the regions where one should distill as well as the regions where supervised pretraining outperforms distillation; see Figures 6, 7, and 14 in Appendix D.2 and Section 5.2. Finally, for improving inference cost at a given model capability, the scaling behavior of Mixture of Experts (Mo E) (Shazeer et al., 2017; Jelassi et al., 2024) have been investigated in the context of scaling laws (Clark et al., 2022; Ludziejewski et al., 2024; Abnar et al., 2025) as one alternative to knowledge distillation. Distillation Scaling Laws B.3. The Knowledge Distillation Capacity Gap Despite extensive research on knowledge distillation, a persistent challenge is the curse of capacity gap, where a larger teacher does not necessarily produce a superior student compared to a smaller teacher. This occurs because a large gap in model capacity makes it harder for the student to effectively learn from the teacher s outputs. As a result, there exists an optimal teacher size along the scaling trajectory that maximizes student performance. Our distillation scaling law in Equation 8 confirms this, revealing a u-shaped trend in the scaling law and validating the existence of an optimal teacher. However, our results further indicate that the capacity gap is influenced not only by the size of the teacher but also by its training tokens and, more generally, its loss. A theoretical analysis in the kernel regression setup (Appendix C) supports these findings. Lukasik et al. (2022) showed that distillation gains are not uniform and can even degrade performance when small teacher errors are amplified by the student. Similarly, Nagarajan et al. (2023) found that deviations in predictive probabilities cause students to exaggerate the teacher s confidence levels. Several works (Peng et al., 2024; Zhang et al., 2023a; Rawat et al., 2024) observed the capacity gap in pre-training distillation for Large Language Model (LLM)s, affecting both large-to-small and small-to-large distillation. Notably, Zhang et al. (2023a) proposed an empirical law of the capacity gap, showing that the optimal teacher scale follows an approximately linear relationship with the student s scale. However, our findings suggest that scaling alone is insufficient one must account for the complexity of the effective hypothesis space (Equation 8) and we show that Zhang et al. (2023a) is a special case of our work when the teachers are compute-optimal from a supervised perspective (see Section 5.3). To address this issue, various strategies have been explored. Yuan et al. (2024) studied temperature scaling, which simplifies the teacher s output into more learnable representations, aiding student generalization. We analyzed the effect of temperature and learning rate in distillation (Figures 52 and 53) and found that, contrary to existing literature, the optimal temperature is one. We hypothesize that this discrepancy arises because previous studies used repeated tokens, whereas our setup does not involve repeated data. Additionally, Cho & Hariharan (2019) found that early stopping of the teacher s training mitigates the capacity gap, while Mirzadeh et al. (2020) proposed progressive distillation, where knowledge is transferred through intermediate models to improve student learning. Further, Fan et al. (2024) looked at the effect of knowledge distillation from distributional differences using calibration, and found that teacher miscalibration is a primary source of poor student performance and a capacity gap. We study calibration in Appendix E.8 and show that our teachers are well-calibrated, and that poor calibration cannot be the only source of the capacity gap. (Lee et al., 2022) focuses on the calibration of the student rather than teacher, and develop a modified training procedure that swaps between teacher and data supervision, improving student generalization. Amara et al. (2022) investigated further modifications of the objective, using a sample-wise adaptive balance between forward and reverse KL divergence, reducing Expected Calibration Error (ECE) and reducing the capacity gap. From a theoretical perspective, Harutyunyan et al. (2023) analyzed the capacity gap in distillation using supervision complexity in kernel classifiers. Their findings highlight a trade-off between teacher accuracy, student margin with respect to teacher predictions, and teacher complexity, explaining why some teachers are easier for the student to learn. Earlier, Lopez-Paz et al. (2016) studied generalization error in distillation, proving that learning from a teacher can be beneficial under certain conditions, particularly when the teacher s capacity is small. Using similar techniques in LMs, Zhang et al. (2023b) demonstrated that among students of different capacities distilled from the same teacher, smaller students suffer from higher generalization error and lower performance, while larger teachers provide lower generalization error, reinforcing the trade-off in teacher-student capacity. Our distillation scaling law (Equation 8) also confirms this trend, and we observe the effect of capacity gap in our scaling law terms, see Section 4.3 for more details. Foundation models were initially undertrained (Brown et al., 2020), then followed the compute-optimal scaling law carefully (Hoffmann et al., 2022; Pearce & Song, 2024; Besiroglu et al., 2024), and soon after started overtraining heavily (Sardana et al., 2024; Bi et al., 2024; Hu et al., 2024; Mesnard et al., 2024; Jiang et al., 2023). The LLa MA family (Touvron et al., 2023a;b; Dubey et al., 2024) and Phi line (Li et al., 2023; Abdin et al., 2024b;a) is following the same trend, where smaller models are overtrained according to the original Chinchilla scaling laws. In all these cases, the models are designed to be best possible foundation model that is still cheap and fast to run on lower end hardware. Besides overtraining, more recently, smaller foundation models tend to be distilled from larger models (Gunter et al., 2024; Rivière et al., 2024; Reid et al., 2024) to further increase performance. In these cases, the large model either specifically trained with the sole purpose of being a distillation teacher, or an existing model is re-used. In both these cases, there are no reports of how the exact teacher size is decided when taking total compute into account. Determining the optimal allocation of compute in distillation is one of the primary contributions of our work (see Section 5.3). Distillation Scaling Laws C. Teacher Student Capacity Gaps In this section, we examine the capacity gap in two settings: kernel regression and a synthetic example using Multi-Layer Perceptron (MLP) for a mapping problem. The kernel regression setup provides a theoretical and analytically tractable perspective on the capacity gap. The MLP-based synthetic example allows us to study the capacity gap in a more practical, learnable function approximation scenario. By analyzing these two setups, we aim to better understand the fundamental limitations of distillation when there is a significant mismatch between teacher and student capacities. C.1. Kernel Regression One of our main contributions is that the student loss follows a broken power law, where the transition between the two power law regions occur when the student becomes a stronger learner than the teacher (Equation 8). This implies that making the teacher too capable (relative to the student) reduces student performance. In this section we show how a capacity gap provably degrades student performance in the setting of kernel regression. While simple, we believe the underlying principle causing the student performance degradation in this case carry over to much more general settings involving neural networks. C.1.1. SETUP Let H denote a Hilbert space spanned by orthonormal bases functions {ϕi} i=1 such that ϕi, ϕj H = δij. Let f H denote the target function, identified by a set of coefficients α = {αi} i=1 R, α = M < such that: i=1 αiϕi(x). (11) Let Hm t , Hn s denote the teacher and student Hilbert spaces respectively: Hm t = Span{ϕ1, ϕ2, ..., ϕm}, (12) Hn s = Span{ϕ1, ϕ2, ..., ϕn}, (13) which are the hypothesis spaces of the teacher and student. Note that while the Hilbert space H is spanned by an infinite orthonormal basis, the teacher and student spaces are finite and spanned by m and n basis functions respectively, where |m n| represents the teacher and student capacity gap. The process of training the teacher and student models involves solving the following constrained optimization problems: g = min g Hm t g f H s.t g H T, (14) h = min h Hn s h g H s.t h H D, (15) where g , h are the optimal teacher and student respectively, and D T < M. Note that we assume the teacher and student are exposed to an infinite amount of training data, hence our analysis is carried over entirely in function space. Lemma C.1. The optimal teacher g is given by: g (x) = C(m, T) i=1 αiϕi(x), C(m, T) = ( 1 p Pm i=1 α2 i T T Pm i=1 α2 i otherwise. (16) The teacher error e teacher(m, T) is given by: e teacher(m, T) = g f H = v u u t(C(m, T) 1)2 m X i=m+1 α2 i . (17) Proof. By construction we may assume the teacher model takes the form g = Pm i=1 βiϕi. where p Pm i=1 β2 i T. We can write the error of g using: eteacher(m, T, β) = m X i=1 (βi αi)ϕi + i=m+1 αiϕi H = i=1 (βi αi)2 + i=m+1 α2 i . (18) Distillation Scaling Laws Note that the minimizing coefficients β of Equation 18 must take the form β = Cα for some coefficient C. Considering the norm constraint on g, the constant C takes the form in Equation 16. Plugging the resulting g into the expression for eteacher(m, T, β ) completes the proof. Notably and intuitively, teacher error decreases monotonically as m, representing the teacher model capacity, increases. C.1.2. DISTILLING THE TEACHER We now pick our student function h by mimicking the teacher subject to a norm constraint: h (x) = min h Hn t h g H s.t. h H D. (19) Lemma C.2. Let k = min(m, n) be the smaller of the teacher and student capacities. The optimal student h is given by: h = Q(m, k, T, D)C(m, T) i=1 αiϕi (20) Q(m, k, T, D) = 1 C(m, T) q Pk i=1 α2 i < D D C(m,T ) Pk i=1 α2 i otherwise. (21) The student error with respect to the target function is then: estudent(m, n, T, D) = h f H = v u u t(C(m, T)Q(m, k, T, D) 1)2 k X i=k+1 α2 i (22) Proof. The proof follows the exact same logic as in Lemma C.1. i.e, we can assume the optimal student is given by h = Pn i=1 γiϕi. From the distillation loss, the optimal coefficients must match the teacher coefficients for the basis functions {ϕi}n i=1, perhaps rescaled due to the norm constraint p Pn i=1 γ2 i D. This rescaling then gives rise to the additional Q(m, k, T, D) multiplier in Equation 21. C.1.3. U-SHAPE IN THE STUDENT ERROR We will prove that the map m 7 estudent(m, n, T, D) is comprised of two distinct segments: i) where the student error monotonically decreases for m < n, and ii) where it monotonically increases for m n, establishing a U-shape in the student error echoing the trend seen in Figures 3 and 4. Case 1: m < n. (Student error is non-increasing in m) Claim. For 1 m < n, we have estudent(m + 1, n, T, D) estudent(m, n, T, D). In words, when m < n, the error does not increase (and typically decreases) as the teacher capacity m increases. Let Hm,T t Hm t denote the space of functions in Hm t that are norm constrained by D. i.e: Hm,T t = {f Hm t : f H T}. (23) Since Hm,T t Hm+1,T t , it follows that g m Hm+1,T t , which implies that the teacher error cannot increase as m increases, hence it monotonically decreases. Now, let h m denote the optimal student given the teacher g m. Since D T, then for any m < n, we can equivalently write the optimal student h m as the solution to the following optimization problem: m n h m = min h Hn s h g m H s.t h H D (24) = min h Hm t h f H s.t h H D, (25) Distillation Scaling Laws which corresponds exactly to the objective of finding the optimal teacher with with a norm constraint set to D. Therefore, from the fact that the teacher error monotonically decreases we can conclude that the student error monotonically decreases as well in the regime m < n. Case 2: m n. (Student error eventually increases in m) Claim. For m n: estudent(m + 1, n, T, D) estudent(m, n, T, D). Hence once m exceeds n the student error cannot decrease any further, the error eventually starts to rise. Let β m = {β1, ..., βm} denote the coefficients of the optimal teacher g m. Note that in the regime m n, as long as p Pn i=1 β2 i D (i.e the norm of the coefficients corresponding to the basis {ϕ1, ..., ϕn} is smaller than D), we have from Equation 21 that Q(m, k, T, D) = 1, which means that the optimal student doesnt change, hence its error remains constant. If however p Pn i=1 β2 i < D, then we have from Equation 21: 1 > Q(m, k, T, D) Q(m + 1, k, T, D), (26) where the second inequality becomes strict if α2 m+1 > 0. A strict inequality (i.e Q(m, k, T, D) > Q(m + 1, k, T, D)) implies the optimal student is further scaled down due to the teacher having to "spread its capacity" to additional basis functions that are not learnable by the student, thereby strictly increasing its error. Hence for m n, we get estudent(m + 1, n, T, D) estudent(m, n, T, D), demonstrating that the error increases monotonically with m once m n. Conclusion (U-shaped trend). Combining these two cases: ( For 1 m < n : estudent(m, n, T, D) monotonically decreasing in m, For m n : estudent(m, n, T, D) monotonically increasing in m. Therefore, as a function of m, the student error estudent(m, n, T, D) first decreases and then increases (for m n) (for m n), giving a u-shape in student error due to a capacity gap between the teacher and the student. 0 250 500 750 1000 Teacher capacity m Teacher Student Student Capacity n Figure 10. Distillation in kernel regression. We randomly sample the α = {α1, ..., α1000} coefficients of the target function uniformly in the range [ 1, 1]. We fix T = 5, D = 4.5 and compute the optimal student and teacher errors according to Lemmas C.1 and C.2 for various values of n (dashed curves), and for m [1...1000]. The student error exhibits a U shaped error curve as predicted, where the error starts to increase when m n. The black solid line indicates the teacher error, which always decreases with increasing m. Distillation Scaling Laws We present an empirical verification of these conclusions in Figure 10. The above theoretical analysis points to an intuitive interpretation of the potentially adverse effect of a large teacher-student capacity gap; the degradation in student performance is due to the teacher learning basis functions that are unreachable by the student, at the expense of basis functions that are reachable by the student. In the following we provide empirical evidence in support of this picture in a controlled yet more realistic setting. C.2. MLPs on the Mapping Problem C.2.1. PROBLEM DEFINITION Here we show a synthetic setting which exhibits the U-shape phenomenon. Matching the kernel regression analysis (Appendix C.1), we find that the synthetic problem must include a class of problems that are easy for the student to learn, and ones that are harder, in order for the U-shape to appear. The problem setting is the Mapping Problem, and is similar in spirit to Pointer Value Retrieval (Zhang et al., 2021), Here, the input is composed of small integers in {0,1,2}. The label for each sample is given by the code below, which shows the two cases: i) one where the label is simply given by a one-hot position, and ii) one where the label is given by the location of a matching element in the context portion of the input. def find(vector, value): """Find locations of value in vector.""" return np.where(vector == value)[0] def remove(vector, value): """Find value from vector.""" return np.delete(vector, find(vector, value)) def label(vector: np.ndarray, num_classes: int) -> np.ndarray: """Return the label in [0, num_classes) for vector.""" assert len(vector) == 2 * num_classes one_hot = vector[num_classes:] context = vector[:num_classes] i = find(one_hot, 1) if context[i] == 0: return i else: # remapping c = context[i] return remove(find(context, c), i) Examples: ----------------------------- 2020210001000000, label = 1 context [2 0 2 0 2 1 0 0] one-hot [0 1 0 0 0 0 0 0] ----------------------------- 1210120000000100, label = 2 context [1 1 2 0 1 2 0 0] one-hot [0 0 0 0 0 1 0 0] ----------------------------- 0122221201000000, label = 6 context [0 1 2 2 2 2 1 2] one-hot [0 1 0 0 0 0 0 0] ----------------------------- C.2.2. EXPERIMENTAL FINDINGS We train MLPs with two hidden layers of equal width, all non-linearities are Rectified Linear Units (Re LUs). Teachers and students of different sizes are produced by varying the hidden layer width only. All model are trained with Adam (Kingma & Ba, 2015) using a peak learning rate of 3 10 4, a single cycle cosine learning rate schedule with a linear warmup of 5% of the total training steps. A batch size of 512 is used for all models. Training samples are never repeated. Unless explicitly stated, model are trained on 500 512, or 20N samples, where N is the number of model parameters, whichever is larger. In Figure 11, we look at varying the size of the teacher. For the width 256 model, student performance improves as the teacher size increases to a point, and then student performance worsens. This is observable in both the student cross-entropy (Figure 11a) and accuracy (Figure 11b). Aligning with theory and large-scale experiments, the student cannot learn if it is too small, and learns to match the teacher model when the student is large enough. In the intermediate regime, where distillation is often used, we see an optimal teacher size and a capacity gap phenomenon. Distillation Scaling Laws 128 256 512 Teacher Width dffn Cross-Entropy LS Student Width dffn 128 256 362 (a) Cross-entropy 128 256 512 Teacher Width dffn Student Accuracy Student Width dffn 128 256 362 (b) Accuracy Figure 11. Student performance when varying teacher width. (a) Student cross-entropy as teacher width dffn is varied. (b) Student accuracy as teacher width dffn is varied. Bands show the (25%,75%) values across four trials. In Figure 12, a similar effect can be seen, when a large teacher (dffn = 512) is trained with on different amounts of data. This observation aligns with the idea that it is the teacher s completeness in modeling the problem that eventually harms the performance of a student with lesser capacity, and not only the teacher size. 64 256 1024 4096 16384 Teacher Training Steps Cross-Entropy LS Teacher dffn = 512 Student Width dffn 128 256 362 (a) Cross-entropy 64 256 1024 4096 16384 Teacher Training Steps Student Accuracy Teacher dffn = 512 Student Width dffn 128 256 362 (b) Accuracy Figure 12. Student performance when varying teacher training data. (a) Student cross-entropy as teacher training data is varied. (b) Student accuracy as teacher training data is is varied. Bands show the (25%,75%) values across four trials. D. Distillation scaling law applications (additional results) In this section, we present results referenced in Section 5. We explore the best-case scenario for distillation under fixed student tokens or compute, as well as under fixed teacher size or compute, while accounting for teacher inference. These results provide further insights into the optimal distillation strategies in different resource-constrained settings. D.1. Experimental differences resolving the apparent contradiction with patient teachers Beyer et al. (2022) showed in computer vision that a good teacher is: 1. Patient: Distillation works best when training for a large number of epochs, and 2. Consistent: The teacher and the student see the same views of the data under an augmentation policy. Our setting automatically satisfies consistency as there is no augmentation policy. There is a remaining question about Distillation Scaling Laws patience, which in our scenario corresponds to the large DS limit. We observe that for a given student size: 1. If the teacher is optimally chosen for the student, distilling on a large number of tokens produces the same result as training the model in a supervised way on the same number of tokens (Appendix E.6). 2. Otherwise supervised learning outperforms distillation (Section 5.3). The second statement implies that the student should not be trained for too long, appearing to contradict patient teachers. To resolve the contradiction, first we note that the modes in Beyer et al. (2022) are trained on a large, diverse dataset, e.g. Image Net21k (Kolesnikov et al., 2020) and then fine-tuned on target datasets (e.g. Flowers102 (Nilsback & Zisserman, 2008), or Image Net1k (Deng et al., 2009)). Students are distilled on the target datasets and only access the teacher s training distribution indirectly, i.e. 1. The students in Beyer et al. (2022) do not see the teacher training distribution directly, whereas ours do. 2. There is no supervised baseline where a supervised model has access to both Image Net21k and the target dataset. The absence of a supervised baseline means that Beyer et al. (2022) were unable to observe the point at which supervised learning becomes preferred to distillation as a function of compute or training data. This was not the focus of their work. In our setting, we do have a supervised baseline, and see that at some amount of compute, supervised learning becomes more efficient than (or equally efficient as) distillation, leading us to upper-bound the length one should distill for. We also do see that distilling for longer improves the distilled model performance, i.e. patient teaching does work. However, we additionally note that patient teaching can be compute-suboptimal compared to supervised learning, depending on the specific setting (see Appendix D.4). Additional differences in our experimental setups beyond the ones mentioned above, are summarized in Table 4. Table 4. Experimental setting differences between Beyer et al. (2022) and ours. Component Beyer et al. (2022) Ours Data repetitions Many repetitions Minimal repetitions Data diversity Low number of unique tokens Large number of unique tokens Domain Vision Language Objective Fewer categories, more unimodal Many categories, highly multimodal Architecture Different computer vision architectures Maximal Update Parameterization (µP) optimized homogeneous transformers D.2. Fixed tokens or compute (best case) Distillation can outperform supervised learning given enough teacher training tokens or compute. As shown in Figures 13a and 13b, when the teacher size, student size, and number of student tokens are held constant, increasing the number of teacher training tokens makes distillation more favorable than supervised learning. This advantage arises because the teacher, with access to more training tokens, can better learn the approximation of the language distribution. As a result, the teacher s learned distribution become more informative for the student to follow, thus improving the student s performance. Note that for a fixed student size and compute, the teacher must be sufficiently large and well-trained; otherwise, supervised learning will outperform distillation. Without adequate teacher size or training, the student may not benefit from the distillation process, leading to inferior performance compared to direct supervised learning. We also see that the scatter data matches up well with the contour colors, despite these contour beings a difference of two scaling laws, providing a verification of our setup. Supervised learning always outperforms distillation given enough student compute or tokens. The trend observed in Figure 14 mirrors that of Section 5.1. It demonstrates that, for a fixed teacher size and compute, supervised learning can outperform distillation when the student s compute is sufficiently large. With enough resources allocated to the student, it can learn more effectively from the data directly, making distillation less advantageous in comparison. This advantage only happens at a compute budget that grows with student size. Distillation Scaling Laws Student NS =198M Student NS =546M 1B 10B 100B 1T 10T 100M Student NS =975M 1B 10B 100B 1T 10T Student NS =1.82B (Student - Supervised) Cross-Entropy: LS e L Teacher Tokens DT Teacher Parameters NT (a) Fixed data Student NS =198M Student NS =546M 1019 1020 1021 1022 100M Student NS =975M 1019 1020 1021 1022 Student NS =1.82B (Student - Supervised) Cross-Entropy: LS e L Teacher Compute (FLOPs) Teacher Parameters NT (b) Fixed compute Figure 13. Iso FLOP Teacher Contours with Fixed M students. (a) For a given teacher size NT , for a given teacher token DT , what is the difference between the loss achieved by distillation and supervised learning. Blue indicates distillation outperforms supervised learning, and red indicates when supervised learning outperforms distillation. The white horizontal dashed line indicates the student size. (b) For a given teacher size NS, for a given teacher compute budget, what is the difference between the loss achieved by distillation and supervised learning. Blue indicates distillation outperforms supervised learning, and red indicates when supervised learning outperforms distillation. The white horizontal dashed line indicates the student size. Teacher NT=546M Teacher NT=975M Teacher NT=1.82B Teacher NT=2.72B 1019 1020 1021 1022 100M Teacher NT=4.82B 1019 1020 1021 1022 Teacher NT=7.75B (Student - Supervised) Cross-Entropy: LS e LS Student Compute (FLOPs) Student Parameters NS Figure 14. Fixed M Teacher Contours with Iso FLOP students (compute). For a given student size and student compute budget, the difference between the loss achieved by distillation and supervised learning. Blue indicates distillation outperforms supervised learning, and red indicates when supervised learning outperforms distillation. The white horizontal dashed line indicates the teacher size. Distillation Scaling Laws D.3. Fixed size or compute (teacher inference) Fixed student size For a fixed student size, as the number of student tokens increases, the optimal teacher cross-entropy decreases slightly; see Figure 15. This observation highlights an asymmetry between the growth of student size and student tokens (or their rates in the scaling law), as the behavior here differs from that observed in Section 5.1. Notably, when the student size is sufficiently large, such as NS = 30B, increasing the student tokens initially leads to a decrease in the teacher s loss, followed by a saturation point and a slow decrease in the optimal teacher s loss. 2.5 Student NS =1B Student NS =3B 2.25 2.30 2.35 2.40 2.45 2.50 250B 1T 4T 16T 2.5 Student NS =10B 2.00 2.05 2.10 2.15 2.20 2.25 2.30 2.35 2.40 2.45 2.50 250B 1T 4T 16T Student NS =30B 1.75 1.80 1.85 1.90 1.95 2.00 2.05 2.10 2.15 2.20 2.25 2.30 2.35 2.40 2.45 Student Tokens DS Teacher Loss LT Figure 15. Student performance given a teacher varying distillation tokens. For four distillation student sizes NS {1B, 3B, 10B, 30B} the validation loss achieved by a students distilled on DS [250B, 16T] tokens under a teacher with loss LT [E, 2.5]. The red line indicates the value of the teacher loss resulting in the best performing student, and the vertical dashed line indicates the number of tokens at which supervised pretraining outperforms distillation. Fixed compute budget Given an inference budget NS, a set of teachers {(L(i) T , N (i) T )}n i=1 and a total compute budget CTotal, the number of distillation tokens is determined from Equation 9 DS = CTotal/(3F(NS) + δT Logits F(NT )), (27) where F(N) is the forward Floating Operations (FLOPs) per token of a model of size N (see Appendix H). If δT Logits = 0 then there is no price to pay for a larger teacher, and the conclusions are identical to those of the fixed token analysis of Section 5.2. In the worst case scenario, δT Logits = 1, then using a larger teacher will mean fewer distillation tokens are available for the student. Due to the capacity gap phenomenon, at small compute budgets, this means it is actually better to use a large weak teacher rather than a large strong teacher. Once compute is sufficient to allow enough distillation tokens, a stronger teacher can be used for all student sizes (see Figure 16). Distillation Scaling Laws 2.5 CTotal = 1021 3.05 3.10 3.15 3.20 3.50 3.55 3.60 3.65 3.70 CTotal = 1022 2.30 .35 2.40 2.45 CTotal = 1023 CTotal = 1024 3.05 3.10 3.15 3.25 3.30 3.35 3.40 3.45 3.50 3.55 2.20 .25 2.30 2.35 .00 2.05 2.10 2.30 2.35 2.40 .45 2.50 2.20 2.25 2.30 2.35 2.40 2.45 2.50 .40 2.45 2.50 2.85 2.90 2.95 3.00 3.05 3.10 3.15 3.20 3.25 3.30 3.35 3.40 3.45 2.05 2.10 2.15 2.30 2.35 2.40 2.45 .50 1.95 2.00 2.05 .10 2.15 2.20 2.25 2.30 .35 2.40 2.45 2.50 1.95 2.00 2.05 2.10 2.15 2.20 2.25 .30 2.35 2.40 2.45 1B 10B 100B 1T 3.15 3.20 3.25 3.30 3.35 3.40 1B 10B 100B 1T 2.05 2.10 2.15 .20 2.25 2.30 2.35 2.40 2.45 2.50 1B 10B 100B 1T 1.85 1.90 .95 2.00 2.05 2.10 2.15 .20 2.25 2.30 2.35 2.40 .45 2.50 1B 10B 100B 1T 1.80 1.85 1.90 .95 2.00 2.05 2.10 2.15 .20 2.25 2.30 2.35 2.40 2.45 Teacher Parameters NT Teacher Loss LT Figure 16. Fixed compute distillation strategy. The student performance obtained for four total compute budgets CTotal {1021, 1022, 1023, 1024} FLOPs and four student sizes NS {1B, 3B, 10B, 30B} under a teacher of size NT [1B, 1T] and teacher loss LT [E, 2.5]. The red line indicates the value of teacher loss L T (NT ) that results in the best student performance for each teacher size NT . Distillation Scaling Laws Table 5. Scenarios considered in our scaling law applications. Same as Table 2. Compute Scenario δLgt T δPre T Description Best case (fully amortized teacher) 0 0 The teacher produces no additional FLOPs and so we are free to choose the teacher L T that minimizes the student cross-entropy. Teacher inference 1 0 We don t account for the teacher cost because the teacher already exists, or we intend to use the teacher as e.g. a server model. We still need to pay to use it for distilling a student. Teacher pretraining 0 1 The teacher needs training, but we store the logits for re-use, either during training, or after training for distilling into sufficiently many students. Teacher pretraining + inference 1 1 The teacher needs training and we pay for distilling into one student, the worst case scenario. D.4. Compute optimal distillation D.4.1. SETUP The solutions resulting in the losses give guidance on how to scale depending on the use case, and are the result of constrained optimization D S, N T , D T = arg min DS,NT ,DT LS(NS, DS, NT , DT ) s.t. FLOPs(NS, DS, NT , DT ) = C, (28) where LS(NS, DS, NT , DT ) is the distillation scaling law (Equation 8), and FLOPs(NS, DS, NT , DT ) 3F(NS)DS | {z } Student Training +F(NT )(δLgt T DS | {z } Teacher Logits + δPre T 3DT | {z } Teacher Training is the total number of floating operations performed in the entire distillation setup. F(N) is the forward FLOPs per token of a model of size N (see Appendix H), and δLgt T , δPre T [0, 1] indicate if we account for the cost of teacher logit inference for the student targets and teacher pretraining cost in the total compute budget. For convenience, we restate our compute scenarios of interest in Table 5). Constrained numerical minimization using Sequential Least SQuares Programming (SLSQP) (Kraft, 1988) in Sci Py (Virtanen et al., 2019). We allow numerical solutions for model sizes and tokens NT , DS, DT [1M, 100P]. While this token upper-limit is larger than available resources (Epoch AI, 2023), it simplifies discussions when comparing to supervised learning at large compute budgets, which otherwise, for smaller students, would only by using a fraction of the available compute. We begin by looking at the student cross-entropy achievable in each compute scenarios alongside the corresponding teacher cross-entropies in Appendix D.4.2. We then investigate the compute-optimal distillation configurations for each scenario that produce those cross-entropies. We look at best case distillation in Appendix D.4.3, teacher inference in Appendix D.4.4, teacher pretraining in Appendix D.4.5, and teacher pretraining + inference in Appendix D.4.6. Finally, to aid comparisons across methods, we present the token and parameter configurations for all methods in Appendix D.4.7 and Appendix D.4.8 respectively. For completeness, in the following sections, some of the findings of Section 5.3 are restated. D.4.2. CROSS-ENTROPY In Figure 17 we show the student cross-entropies achieved in the compute optimal case for each scenario in Table 5, and the teacher cross-entropies that enable those student cross-entropies in Figure 18. Distillation and supervised learning produce the same student at large compute. The first thing to note in Figure 17 is that at low compute, in the best case and teacher inference scenarios, distillation outperforms supervised learning, consistent with our expectations from distillation and the existing literature (see Appendix B.1). However, once enough the compute is large enough6, distillation and supervised learning produce models with the same cross-entropy, i.e. in general, 6The level of compute at which this happens is larger for larger models, see Figure 17 for specific values. Distillation Scaling Laws distillation does not allow us to produce better models that supervised learning does, however, distillation does produce better models than supervised learning with modest resources. This behavior is consistent with the asymptotic analysis in Appendix E.6, and can be understood through noting that although distillation modifies the learning process the student undergoes, distillation does not alter the hypothesis space of the student, which is tied to the student size NS, is the same hypothesis space in the supervised and distillation settings, and can be explored in the limit of infinite compute or data. Student NS =300M Student NS =500M Student NS =1B 2.6 Student NS =3B 1020 1022 1024 1026 Student NS =5B 1020 1022 1024 1026 1.8 Student NS =10B 1020 1022 1024 1026 3.0 Student NS =30B 1020 1022 1024 1026 Student NS =50B Total Compute (FLOPs) Student Cross Entropy LS Distillation (best case) Distillation (teacher inference) Distillation (teacher pretraining + inference) Distillation (teacher pretraining) Supervised Figure 17. Compute optimal distillation student cross-entropies. For eight student sizes, the optimal student validation cross-entropy L S in each of the distillation scenarios considered as the total compute is varied. The compute at which distillation and supervised learning produce similar models grows with student size. Continuing the previous observation, we see in Figure 17 that supervised cross-entropy approaches the best case and teacher inference student cross-entropies at a value of compute which increases with compute, meaning that larger students benefit from distillation for larger compute budgets than supervised learning. This implies that if your target student size is small and your compute budget is large, then supervised learning is more likely to be beneficial than if your target student size is larger. The phenomenon happens because larger supervised models saturate in performance at larger values of D (Equation 1), and distillation accelerates progress towards this saturation with the correct choice of teacher (Equation 8), with more capable teachers producing more gains per token. Including teacher training in compute produces student cross-entropies higher than in the supervised setting. In Figure 17 supervised cross-entropy is always below the teacher pretraining and teacher pretraining + inference scenarios, except at very large compute budgets, when supervised learning and these distillation scenarios produce similar student cross-entropies. This means that if your only aim is to produce the model of a target size with the lowest cross-entropy and you do not have access to a teacher, then you should choose supervised learning, instead of training a teacher and then distilling. Conversely, if the intention is to distill into a family of models, or use the teacher as a server model, distillation may be more computationally beneficial than supervised learning. This finding aligns with expectations, the alternative implies distillation can outperform direct maximum likelihood optimization given fixed compute. The optimal teacher cross-entropy decreases with increasing total compute. As shown in Figure 18, the optimal teacher cross entropy loss has a decreasing trend with respect to the total compute. However, in the best case scenarios, at low compute for larger student, where the number of student tokens is lower than the Chinchilla rule of thumb, an inflection point happens in optimal teacher compute. We now turn to investigating the optimal distillation configurations that achieve these student cross-entropies. Distillation Scaling Laws Student NS =300M Student NS =500M Student NS =1B Student NS =3B 1020 1022 1024 1026 Student NS =5B 1020 1022 1024 1026 Student NS =10B 1020 1022 1024 1026 1.5 Student NS =30B 1020 1022 1024 1026 Student NS =50B Total Compute (FLOPs) Optimal Teacher Cross Entropy L T Distillation (best case) Distillation (teacher inference) Distillation (teacher pretraining + inference) Distillation (teacher pretraining) Figure 18. Compute optimal distillation teacher cross-entropies. For eight student sizes, the optimal teacher validation loss L T resulting in lowest student validation loss L S in each of the distillation scenarios considered (Table 5) the total compute is varied. D.4.3. DISTILLATION (BEST CASE) In the distillation (best case) scenario, δLgt T = δPre T = 0, which means that we only account for compute associated with the standard supervised learning case FLOPs(NS, DS, NT , DT ) 3F(NS)DS | {z } Student Training We call this best case as the scenario reflects a freedom to choose the best distillation setting for a given student size NS, with all of the compute being put into training the student for as long as possible (maximal DS). In this sense we can consider this the upper bound in performance for distillation in our experimental setting. 1020 1022 1024 1026 300M 500M 30B 50B L S 2.352.40 2.45 2.502.55 2.60 2.65 2.70 2.75 1020 1022 1024 1026 1020 1022 1024 1026 10B 30B 100B 1020 1022 1024 1026 1020 1022 1024 1026 Total Compute CTotal Student Size NS Figure 19. Compute optimal configuration contours for distillation (best case). The compute optimal quantities (D S, N T , D T ) giving rise to the student cross entropies for best case in Figure 17 for a range of student sizes. (N T , D T ) are the supervised compute optimal combination giving rise to L T in Figure 18. This scenario represents the setting where a teacher already exists, or we will use the teacher for another purpose, for example a server model. In these scenarios, we do not need to worry about the teacher pretraining cost. Additionally, this teacher may be used to produce the logits for many different students, or we may have saved the logits from the teacher during its training. In these cases, the cost for producing the student logits can also be ignored. The optimal quantities (D S, N T , D T ) giving rise to the cross entropies in Figure 17 are shown in Figures 19 and 20. In the best case scenario, L T is determined, however N T and D T are not determined because they do not enter into the compute constraint, yielding a one-dimensional family (NT (L T , DT ), DT ) of valid solutions to the minimization problem (Equation 28). To provide some guidance for producing L T , in Figure 18 we present the supervised compute optimal (NT (L T , DT ), DT ), i.e. the combination that minimizes FLOPs F(NT )DT subject to L(NT , DT ) = LT . Distillation Scaling Laws Student NS =300M Student NS =500M Student NS =1B Student NS =3B 1020 1022 1024 1026 Student NS =5B 1020 1022 1024 1026 Student NS =10B 1020 1022 1024 1026 Student NS =30B 1020 1022 1024 1026 Student NS =50B Total Compute (FLOPs) Optimal Value Optimal Quantity N S D S N T D T Figure 20. Compute optimal configurations for distillation (best case). For eight student sizes, the compute optimal quantities (D S, N T , D T ) giving rise to the student cross entropies for best case in Figure 17. (N T , D T ) are the supervised compute optimal combination giving rise to L T in Figure 18. This is a one-dimensional slice of Figure 19. In this scenario, all the compute goes into student tokens, and so in Figure 20 we see optimal student tokens D S increases with compute at the same rate as we could for the supervised model, which is higher for smaller students. The optimal teacher parameters N T and tokens D T move together to produce the L T in Figure 18. Again, the exact values of N T , D T in Figure 20 represent the supervised compute optimal solution for producing the L T , but are not the only solution in this compute scenario, since N T , D T are not uniquely determined by the compute constraint. D.4.4. DISTILLATION (TEACHER INFERENCE) In the distillation (teacher inference) scenario, δLgt T = 1 , δPre T = 0, which means that we account for compute associated with the standard supervised learning case as well as the cost for producing the logits for the student FLOPs(NS, DS, NT , DT ) 3F(NS)DS | {z } Student Training + F(NT )DS | {z } Teacher Logits This scenario represents the setting where a teacher already exists, but logits for the distillation still need producing. The optimal quantities (D S, N T , D T ) giving rise to the cross entropies in Figure 17 are shown in Figures 21 and 22. 1020 1022 1024 1026 300M 500M 30B 50B L S 2.4 2.502.55 2.60 2.65 2.70 2.75 1020 1022 1024 1026 1020 1022 1024 1026 1020 1022 1024 1026 Total Compute CTotal Student Size NS Figure 21. Compute optimal configuration contours for distillation (teacher inference). The compute optimal quantities (D S, N T , D T ) giving rise to the student cross entropies for teacher inference in Figure 17. The teacher should be overtrained. In the teacher inference scenario, D T does not contribute directly to compute but instead indirectly N T subject to L T . To minimize N T at a given L T , the solution is to maximize D T as is seen in Figure 22; Distillation Scaling Laws Student NS =300M Student NS =500M Student NS =1B Student NS =3B 1020 1022 1024 1026 Student NS =5B 1020 1022 1024 1026 Student NS =10B 1020 1022 1024 1026 Student NS =30B 1020 1022 1024 1026 Student NS =50B Total Compute (FLOPs) Optimal Value Optimal Quantity N S D S N T D T Figure 22. Compute optimal configurations for distillation (teacher inference). For eight student sizes, the compute optimal quantities (D S, N T , D T ) producing the student cross entropies for teacher inference in Figure 17. This is a one-dimensional slice of Figure 21. D T takes the largest value allowed in our numerical optimization, 1017 tokens. Although not surprising, this demonstrates the benefit of producing overtrained teachers, instead of taking the tempting strategy of using compute optimal teachers followed by a long distillation process into a smaller student model. As compute is increased, relatively less should be spent on student training, and more on teacher logit inference. The compute allocations resulting from the optimal combination are shown in Figure 23. We see that in all cases, the student training term (blue) decreases as compute increases, whereas the teacher logits (orange) increases. This happens because as compute increases: i) optimal student tokens increases at a rate approximately independent of compute, ii) the teacher size increases with compute to provide a stronger signal, while iii) the student size is fixed (see Figure 22). Student NS =300M Student NS =500M Student NS =1B Student NS =3B 1020 1022 1024 1026 0 Student NS =5B 1020 1022 1024 1026 Student NS =10B 1020 1022 1024 1026 Student NS =30B 1020 1022 1024 1026 Student NS =50B Total Compute (FLOPs) Optimal Compute Fraction (%) Compute Component Student Training Teacher Logits Teacher Training Figure 23. Compute optimal allocations for distillation (teacher inference). For eight student sizes, the compute optimal allocations corresponding to the terms in Equation 29 for the compute optimal values in Figure 22. Distillation Scaling Laws D.4.5. DISTILLATION (TEACHER PRETRAINING) In the distillation (teacher pretraining) scenario, δLgt T = 0 , δPre T = 1, which means that we account for compute associated with training the teacher, in addition to the standard training cost of the student, but not the cost of producing the logits FLOPs(NS, DS, NT , DT ) 3F(NS)DS | {z } Student Training + 3F(NT )DT | {z } Teacher Training This scenario represents when we want to figure out which teacher to produce to distill into sufficiently many different students, storing the teacher logits for reuse, effectively ammortizing the cost of producing the logits. Here, contrary to the previous two scenarios (Appendices D.4.3 and D.4.5), the teacher size NT and teacher tokens DT contribute directly to the compute accounting (Equation 32). The optimal quantities (D S, N T , D T ) giving rise to the cross entropies in Figure 17 are shown in Figures 24 and 25. 1020 1022 1024 1026 300M 500M 30B 50B L S 2.50 2.55 2.602.652.70 2.75 2.80 2.85 1020 1022 1024 1026 2.30 2.35 2.40 2.45 2.50 2.55 2.60 2.65 1020 1022 1024 1026 1020 1022 1024 1026 1020 1022 1024 1026 Total Compute CTotal Student Size NS Figure 24. Compute optimal configuration contours for distillation (teacher pretraining). The compute optimal quantities (D S, N T , D T ) giving rise to the student cross entropies for teacher pretraining in Figure 17. Student NS =300M Student NS =500M Student NS =1B Student NS =3B 1020 1022 1024 1026 Student NS =5B 1020 1022 1024 1026 Student NS =10B 1020 1022 1024 1026 Student NS =30B 1020 1022 1024 1026 Student NS =50B Total Compute (FLOPs) Optimal Value Optimal Quantity N S D S N T D T Figure 25. Compute optimal configurations for distillation (teacher pretraining). For eight student sizes, the compute optimal quantities (D S, N T , D T ) giving rise to the student cross entropies for teacher pretraining in Figure 17. This is a one-dimensional size of Figure 24. The compute optimal teacher for distillation is a supervised compute optimal teacher. In Figure 25 we see that the MT DT /NT ratio of the teacher is constant for all values of compute, and can be compared to the ratio in Figure 19. This can be understood as there is no inference cost to pay for making the teacher large; we are only minimizing the training compute budgets of two models, and the most efficient way to produce a teacher with a given cross-entropy LT is a teacher Distillation Scaling Laws that is compute-optimal in a supervised sense. Note that this conclusion is the opposite to the finding in Appendix D.4.4. There, the inference is expensive, and so the teacher should be overtrained. Here, teacher training is expensive, so teacher training should be compute optimal. As compute is increased, relatively less should be spent on teacher training, and more on student training. In Figure 26 we see the compute allocations for the configurations shown in Figure 25, and see that student training relative compute (blue) increases with increasing compute budget, while the teacher training (green) decreases with increasing compute budget. This happens because, as in all compute scenarios, with increasing compute, the optimal student tokens N S increases (Figure 25). Teacher size and tokens are also increasing with increasing compute, providing a stronger signal for the student with more tokens to learn. However, this increase in teacher size and tokens plateaus, while the student tokens continues to increase. This is because here the teacher is compute optimal, and so the amount of compute needed to improve the learning signal for the student is much less than the amount of compute needed to train the student for to make use of that signal, due to the stronger diminishing returns with respect to DS at a fixed NS (Equation 8). Student NS =300M Student NS =500M Student NS =1B Student NS =3B 1020 1022 1024 1026 0 Student NS =5B 1020 1022 1024 1026 Student NS =10B 1020 1022 1024 1026 Student NS =30B 1020 1022 1024 1026 Student NS =50B Total Compute (FLOPs) Optimal Compute Fraction (%) Compute Component Student Training Teacher Logits Teacher Training Figure 26. Compute optimal allocations for distillation (teacher pretraining). For eight student sizes, the compute optimal allocations corresponding to the terms in Equation 29 for the compute optimal values in Figure 25. D.4.6. DISTILLATION (TEACHER PRETRAINING + INFERENCE) In the distillation (teacher pretraining + inference) scenario, δLgt T = δPre T = 1, which means that we account for all costs associated with distilling a single student FLOPs(NS, DS, NT , DT ) 3F(NS)DS | {z } Student Training + F(NT )DS | {z } Teacher Logits + 3F(NT )DT | {z } Teacher Training This scenario can be thought of as the compute optimal worst case scenario for distillation, i.e. one teacher is trained only for the purposes of one student. As in Appendix D.4.4, teacher size NT and teacher tokens DT contribute directly to the compute accounting (Equation 33). The optimal quantities (D S, N T , D T ) giving rise to the cross entropies in Figure 17 are shown in Figures 27 and 28. Compute optimal teachers should be used for lower compute budgets and overtrained teachers should be used for larger compute budgets. In Figure 28 we see a teacher configuration that interpolates between the teacher pretraining (Appendix D.4.5) and teacher inference (Appendix D.4.4) compute scenarios. At low compute, the optimal number of student tokens D S is not too large, this means there is little penalty to increasing the teacher size, resulting in an approximately supervised compute-optimal teacher given a teacher compute budget. Once the optimal number of student tokens Distillation Scaling Laws becomes higher than the optimal number of teacher tokens, there is significant penalty to increasing the teacher size. At this point, the teacher solution starts to become the overtrained solution seen in teacher inference, the optimal teacher tokens continue to increase polynomially, but this is not followed with an increase in the teacher size. For sufficiently high compute, corresponding to a large number of student distillation tokens, the compute penalty for teacher size is so large that optimal teacher size decreases with compute. 1020 1022 1024 1026 300M 500M 30B 50B L S 2.25 2.30 2.35 2.45 2.50 2.55 2.602.652.70 2.75 2.80 2.85 1020 1022 1024 1026 2.102.15 2.20 2.55 2.60 2.65 1020 1022 1024 1026 1020 1022 1024 1026 1020 1022 1024 1026 Total Compute CTotal Student Size NS Figure 27. Compute optimal configuration contours for distillation (teacher pretraining + inference). The compute optimal quantities (D S, N T , D T ) giving rise to the student cross entropies for teacher pretraining + inference in Figure 17. Student NS =300M Student NS =500M Student NS =1B Student NS =3B 1020 1022 1024 1026 Student NS =5B 1020 1022 1024 1026 Student NS =10B 1020 1022 1024 1026 Student NS =30B 1020 1022 1024 1026 Student NS =50B Total Compute (FLOPs) Optimal Value Optimal Quantity N S D S N T D T Figure 28. Compute optimal configurations for distillation (teacher pretraining + inference). For eight student sizes, the compute optimal quantities (D S, N T , D T ) giving rise to the student cross entropies for teacher pretraining + inference in Figure 17. This is a one-dimensional size of Figure 27. For small students, as compute grows, more should be spent on training the student and producing logits for the student. In Figure 29 we see the compute allocations for the configurations shown in Figure 28. Compute optimal smaller models tend to have smaller teachers, and optimal teacher tokens always grow at a slower rate than student tokens, and so teacher the training cost is relatively small. As compute grows, the student is distilled on more tokens, and the teacher always becomes slightly larger than the student, which gives rise to most compute being allocated to standard student training compute component and producing the logits for this training. For large students, as compute grows, more should be spent on training the teacher, until a transition happens where more should be spent on training the student and producing logits for the student. The explanation for the phenomenon is as above, except that the larger students need a more capable teacher to learn from as compute grows, and so initially compute needs to bused to produce the teachers required. After a certain amount of compute, the large number of Distillation Scaling Laws optimal student distillation tokens moves the optimal solution towards an overtrained teacher scenario, and more compute being allocated to student training and logit production. Student NS =300M Student NS =500M Student NS =1B Student NS =3B 1020 1022 1024 1026 0 Student NS =5B 1020 1022 1024 1026 Student NS =10B 1020 1022 1024 1026 Student NS =30B 1020 1022 1024 1026 Student NS =50B Total Compute (FLOPs) Optimal Compute Fraction (%) Compute Component Student Training Teacher Logits Teacher Training Figure 29. Compute optimal allocations for distillation (teacher pretraining). For eight student sizes, the compute optimal allocations corresponding to the terms in Equation 29 for the compute optimal values in Figure 28. D.4.7. OPTIMAL TEACHER TRAINING AND STUDENT DISTILLATION TOKENS To aid in comparing the different compute strategies presented in Appendices D.4.3 to D.4.6, we now present each compute optimal value for all strategies, including supervised. Here, we show compute-optimal distillation student tokens D S in Figure 31 and compute-optimal teacher pretraining tokens D T in Figure 31. 1B 10B 100B 1T 10T 100T 1P 10P 100P Student NS =300M Student NS =500M Student NS =1B Student NS =3B 1020 1022 1024 1026 1B 10B 100B 1T 10T 100T 1P 10P Student NS =5B 1020 1022 1024 1026 Student NS =10B 1020 1022 1024 1026 Student NS =30B 1020 1022 1024 1026 Student NS =50B Total Compute (FLOPs) Optimal Student Tokens D S Distillation (best case) Distillation (teacher inference) Distillation (teacher pretraining + inference) Distillation (teacher pretraining) Supervised Figure 30. Compute optimal distillation student tokens. For eight student sizes, the compute optimal student tokens D S giving rise to the student cross-entropies for all compute scenarios, including supervised. In all scenarios, student tokens should be increased with compute similar to in the supervised case. We see in Figure 30 that, as in Chinchilla (Hoffmann et al., 2022), supervised tokens are increased polynomially with compute. Dis- Distillation Scaling Laws tillation (best case) follows the exact same allocation, as does distillation (pretraining) with asymptotically large compute. All other methods follow the same increase rate, but with scenario-dependent offsets. 1B 10B 100B 1T 10T 100T 1P 10P 100P Student NS =300M Student NS =500M Student NS =1B Student NS =3B 1020 1022 1024 1026 1B 10B 100B 1T 10T 100T 1P 10P 100P Student NS =5B 1020 1022 1024 1026 Student NS =10B 1020 1022 1024 1026 Student NS =30B 1020 1022 1024 1026 Student NS =50B Total Compute (FLOPs) Optimal Teacher Tokens D T Distillation (best case) Distillation (teacher inference) Distillation (teacher pretraining + inference) Distillation (teacher pretraining) Figure 31. Compute optimal distillation teacher tokens. For eight student sizes, the compute optimal teacher tokens D T giving rise to the student cross-entropies for all compute scenarios. Optimal teacher tokens interpolate between scenarios based on compute allocation. In Figure 31 we can see more clearly the interpolation behavior discussed in Appendix D.4.6. At low compute, teacher pretraining and teacher pretraining + inference share optimal solutions because the number of student tokens N S is small. At high compute, teacher pretraining + inference approaches teacher inference, while teacher pretraining approaches best case, as N S is large, and costs associated with teacher pretraining become less important. D.4.8. OPTIMAL TEACHER SIZE 100B Student NS =300M Student NS =500M Student NS =1B Student NS =3B 1020 1022 1024 1026 100T Student NS =5B 1020 1022 1024 1026 Student NS =10B 1020 1022 1024 1026 Student NS =30B 1020 1022 1024 1026 Student NS =50B Total Compute (FLOPs) Optimal Teacher Parameters N T Distillation (best case) Distillation (teacher inference) Distillation (teacher pretraining + inference) Distillation (teacher pretraining) Figure 32. Compute optimal distillation teacher size. For eight student sizes, the compute optimal teacher size N T giving rise to the student cross-entropies for all compute scenarios. Optimal teacher size interpolate between scenarios based on compute allocation. As in the optimal teacher tokens N T in Figure 31, the same mechanism causes interpolation behavior in optimal teacher size (see Figure 32). Distillation Scaling Laws D.5. Compute and data efficiency gains for distillation compared to supervised learning In this final section, we use the compute-optimal strategies developed through Appendices D.4.3 to D.4.6 and understand, for each distillation compute scenario (Table 5) if it is more compute and/or data efficient to use distillation compared to supervised learning in order to produce a desired model (i.e. of a given size NS with a desired performance, measured in cross-entropy LS). In Figure 33 we show the amount of compute needed to distill a student of a given size to a given cross-entropy as a multiple of the compute that supervised learning needs to produce the same result. We do this for for each of the distillation compute scenarios, whose optimal configurations are given in Appendices D.4.3 to D.4.6. In Figure 34 we show the same, except we show the number of tokens needed to distill a student of a given size to a given cross-entropy as a multiple of the number of tokens that supervised learning needs to produce the same result. Our distillation token accounting depends on compute scenario: DDist. = DS + δPre T DT , (34) i.e. we only count teacher tokens if the teacher pretraining cost is also included in the compute cost (see Equation 29). 10 Student NS =300M Student NS =500M Student NS =1B Student NS =3B 1.6 1.8 2.0 2.2 2.4 2.6 0.2 10 Student NS =5B 1.6 1.8 2.0 2.2 2.4 2.6 Student NS =10B 1.6 1.8 2.0 2.2 2.4 2.6 Student NS =30B 1.6 1.8 2.0 2.2 2.4 2.6 Student NS =50B Student Cross-Entropy LS Distillation Compute / Supervised Compute Compute Scenario Distillation (best case) Distillation (teacher pretraining) Distillation (teacher inference) Distillation (teacher pretraining + inference) Break-even L(N = NS,D = ) Figure 33. Compute optimal distillation compute ratios. For eight student sizes, the amount of supervised compute needed to produce a student of the indicated size and cross-entropy. The horizontal dashed line indicates the break-even point, when doing supervised leaning is as computationally efficient as the corresponding distillation compute scenario. Values greater (less) than one indicate distillation is more (less) expensive than supervised learning for producing a model of the indicated size and cross-entropy. The vertical dashed line indicates the lowest cross-entropy achievable by that student. When teacher training is discounted, distillation is often more efficient. In Figure 33, the base case (blue) and teacher inference (orange) compute scenarios are below the grey dashed line for cross-entropies slightly above the lowest possible cross-entropy (vertical grey dashed line), meaning less compute is needed for distillation than supervised learning. This compute efficiency translates into data efficiency (see Figure 34). To produce the strongest student possible, supervised learning is more efficient. In Figures 33 and 34, the base case (blue) and teacher inference (orange) compute scenarios attain values larger than one as the target cross-entropy LS approaches the limiting value L(N = NS, D = ) for each student size NS, (vertical dashed line). This suggests i) the existence of a more efficient training strategy where distillation is used as an initial training stage, with a transition to Distillation Scaling Laws supervised learning based on a token or cross-entropy threshold, and ii) potentially increased importance of data mixtures (λ 1, see Appendix G.1) when distilling with significant token and/or compute budgets. We leave this for future work. In situations where teacher training is required, supervised learning is more efficient. As observed in Appendix D.4.2, for all student sizes, if teacher pretraining is included in the computational cost of producing a student, supervised learning is always more efficient than distilling. This can be seen from Figure 33 as the teacher pretraining (green) and teacher pretraining + inference (red) compute scenarios are above the grey dashed line, which means more compute is needed for distillation than supervised learning in those compute scenarios. This compute efficiency translates into data efficiency (see Figure 34). 10 Student NS =300M Student NS =500M Student NS =1B Student NS =3B 1.6 1.8 2.0 2.2 2.4 2.6 0.2 10 Student NS =5B 1.6 1.8 2.0 2.2 2.4 2.6 Student NS =10B 1.6 1.8 2.0 2.2 2.4 2.6 Student NS =30B 1.6 1.8 2.0 2.2 2.4 2.6 Student NS =50B Student Cross-Entropy LS Distillation Data / Supervised Data Compute Scenario Distillation (best case) Distillation (teacher pretraining) Distillation (teacher inference) Distillation (teacher pretraining + inference) Break-even L(N = NS,D = ) Figure 34. Compute optimal distillation data ratios. For eight student sizes, the number of tokens compute needed to produce a student of the indicated size and cross-entropy. The horizontal dashed line indicates the break-even point, when doing supervised leaning is as data efficient as the corresponding distillation compute scenario. Values greater (less) than one indicate distillation is more (less) expensive than supervised learning for producing a model of the indicated size and cross-entropy. The vertical dashed line indicates the lowest cross-entropy achievable by that student. Distillation is more efficient for larger students. In Figure 33 we see in the pretrain + inference scenario, producing a NS =500M student with a cross-entropy of 2.4 has roughly 3/4 the compute cost of producing the same model with supervised learning, whereas producing a NS =10B student with a cross-entropy of 2.2 has roughly 1/2 the compute cost of producing the same model with supervised learning. In terms of data (Figure 34), the 500M and 10B configurations use roughly 2/3 and 1/2 the number of tokens of their supervised counterparts respectively. The efficiency gains from distillation are potentially greater for larger students when considering compute or data. E. Additional Results In this section, we provide an extensive list of studies, including downstream evaluations of distillation. We cover the models used as teachers, examine the Kullback-Leibler Divergence (KLD) between teacher and student in fixed token-tosize ratios, and present supplementary materials to Section 4.1. Additionally, we investigate the limiting behavior of our scaling law, weak-to-strong generalization, and conduct a model calibration study to assess fidelity. These analyses offer a comprehensive view of the factors influencing distillation performance and the behavior of our proposed scaling laws. Distillation Scaling Laws E.1. Downstream evaluations In all settings, we optimize for and predict validation cross-entropy. To confirm that the validation cross-entropy is a good proxy for the downstream evaluation that is the ultimate interest, in Figure 35 we show evaluations for the supervised teachers and the distilled students on downstream evaluation tasks. ARC Easy (Bhakthavatsalam et al., 2021), ARC Challenge (Bhakthavatsalam et al., 2021), Hella Swag (Zellers et al., 2019), Piqa (Bisk et al., 2020), Sciq (Welbl et al., 2017), Wino Grande (Sakaguchi et al., 2021) and Lambada Open AI (Paperno et al., 2016) are zero-shot tasks. Trivia QA (Joshi et al., 2017) and Web QS (Berant et al., 2013) are one-shot tasks. Trivia QA evaluation is on the larger and more challenging Web split. Core En is the average of both the zero-shot and one-shot tasks. We have included GSM8K (Cobbe et al., 2021) and MMLU (Hendrycks et al., 2021b;a). GSM8K is used in an 8-shot chain of thought setting, following LLa MA (Touvron et al., 2023a;b; Dubey et al., 2024). MMLU is used in a five-shot setting. These perform near-random for most of the models, and only show a slightly upwards trend for models with low cross-entropy. This near-random performance is due to the use of the C4 dataset in training, and we note that we do not aim for competitive downstream evaluation results. Finally, we note that the relation between cross-entropy and downstream performance for the supervised and distilled models is similar. We suspect this is because the student behaves like a low variance expectation of a biased teacher in the KL-matching distillation scenario (Menon et al., 2020), and we anticipate that the relationship between cross-entropy and downstream performance may be different for alternative distillation strategies. All models are evaluated using an internal version of the open-source lm-evaluation-harness (Gao et al., 2024). ARC Challenge Wino Grande Lambada Open AI Trivia QA (1-shot) 2.0 2.5 0.0 Web QS (1-shot) 2.0 2.5 0.00 Teacher Cross-Entropy LT Student (Supervised) Cross-Entropy LS (L) Downstream Evaluation Metric Distilled Student Supervised Figure 35. Model downstream evaluations. Each scatter point is a different model. The circular points correspond to distilled students, whose color indicates the cross-entropy of the teacher used for that distillation process. The red crosses correspond to the supervised models (i.e. the teachers). For a discussion of the individual metrics and datasets, see Appendix E.1. Distillation Scaling Laws E.2. Teachers used in distillation In Figure 36 we show the cross-entropies of the models used as teachers in Section 4.2, and for fitting the supervised scaling law: i) eleven of fixed-M ratio models following the Chinchilla rule of thumb D/N = M 20 (Hoffmann et al., 2022), ii) six models on D = 512B tokens (Figure 36a), and iii) four Iso FLOP profiles (Figure 36b). Together this produces 74 runs corresponding to tuples of (N, D, L). 100M 300M 1B 3B 7B 14B Parameters N Cross Entropy L D = 20N D = 512B (a) Fixed-M and 512B Teachers. 100M 300M 1B 3B 7B Parameters N Cross-Entropy L FLOPs 3 1019 1020 3 1020 1021 (b) Supervised Iso FLOPs. Parameters N 3 1019 1020 3 1020 1021 Compute (FLOPs) Cross-Entropy L (c) Supervised Iso FLOP minima. Figure 36. Supervised Iso FLOPs. (a) The cross-entropy of supervised models trained with either a Chinchilla optimal M = D/N 20 or on 512B tokens. (b) The cross-entropy supervised models trained with four ISOFLOP profiles C {3 1019, 1020, 3 1020, 1021}. (c) The optimal supervised parameters N (C) = arg min N L(C) for each Iso FLOP profile, and the loss L (C) achieved by that model. Coefficient estimation (Appendix F.1) yields the scaling coefficients shown in Table 6, and a scaling law which has 1% relative prediction error, including when extrapolated from weaker to stronger models (see Figure 5a). E.3. Fixed-M teacher/fixed-M students and the capacity gap Cross-Entropy LS Student NS = 143M Student NS = 198M Student NS = 546M 300M 1B 3B 6B 14B 0.00 KL Divergence (Teacher||Student) 300M 1B 3B 6B 14B 300M 1B 3B 6B 14B Teacher Parameters NT Student Distillation Tokens DS 20N 40N 80N 160N 320N Figure 37. Fixed M Teacher/Fixed M Student. Students of three sizes trained with different MS = DS/NS = 20 ratios are distilled from teachers with MT = DT /NT 20. This is a more complete version of Figure 3. In Figure 37, the capacity gap in knowledge distillation can be seen. Improving a teacher s performance does not always improve a student s, and even reduces the performance after a certain point. The KLD between teacher and student is an increasing function of teacher size in all cases, which means as the teacher improves its own performance, the student finds Distillation Scaling Laws the teacher more challenging to model, which eventually prevents the student from taking advantage of teacher gains. See Appendix E.8.2 for an investigation using calibration to understand where this mismatch occurs. E.4. Full distillation scaling law Iso FLOP profiles In Figure 38a we provide the full six fixed M Teacher/Iso FLOP Student profiles, only two of which were shown in Figure 2. These experiments enable the reliable determination of α , β , γ , A and B . In Figure 38b we provide the full four Iso FLOP teacher/ fixed M student, only two of which were shown in Figure 3. These experiments enable the reliable determination of c0, c1, f1 and d1. Strong-to-weak generalization occurs. For the weaker teachers (NT 2.72B), The horizontal dashed line in each pane shows the cross-entropy achieved by the teacher (Appendix E.2). we see that for students larger than the teacher (NS > NT ) and for sufficiently large compute budgets, the student is able to outperform the teacher (see Appendix E.7 for a detailed one-dimensional slice). A stronger teacher signal is needed in order for stronger students to outperfom the supervised baseline. The horizontal dashed line in each pane shows the cross-entropy achieved by the student if trained using supervised learning (Appendix E.2). We see that weaker students benefit more from distillation, as e.g. the 198M student has all observed data below this dashed line, meaning all distillations outperform the supervised baseline. However, for the 1.82B student, only 1021 FLOP teachers produce distilled students that outperform the supervised baseline. Teacher NT =546M Teacher NT =975M 2.6 Teacher NT =1.82B Teacher NT =2.72B 100M 300M 1B 3B 7B 2.6 Teacher NT =4.82B 100M 300M 1B 3B 7B Teacher NT =7.75B Student Parameters NS Student Cross-Entropy LS Student FLOPs 3 1019 1020 3 1020 1021 3 1021 (a) Fixed M Teacher/Student Iso FLOP profiles. Student: 198M Student: 546M 100M 300M 1B 3B 7B Student: 975M 100M 300M 1B 3B 7B Student: 1.82B Teacher Parameters NT Student Cross-Entropy LS Teacher FLOPs 3 1019 1020 3 1020 1021 (b) Iso FLOP Teacher/Fixed M Student profiles. Figure 38. Supervised Iso FLOPs. (a) Teachers of six sizes with MT = DT /NT 20 are distilled into Students with four Iso FLOP profiles, and a small number with CS = 3 1021. The horizontal grey and vertical black dashed lines indicate teacher cross entropy LT and size NT respectively. (b) Students of four sizes trained with a M = DS/NS = 20 are distilled from teachers with four Iso FLOP profiles. Horizontal (vertical) dashed lines indicate student supervised cross entropy e LS (student size NS). Distillation Scaling Laws E.5. Distillation scaling law Iso FLOP optima The optimal loss values of each Iso FLOP in Figure 38a are shown in Figure 39. Optimal Student Parameters N S Weak Strong Generalization 500M 1B 3B 7B Optimal Student Cross-Entropy L S Teacher Parameters NT Student FLOPs 3 1019 1020 3 1020 1021 (a) Fixed M-Ratio Teacher/Student ISOFlop optima. Optimal Teacher Parameters N T Weak Strong Generalization 200M 500M 1B 2B Optimal Student Cross-Entropy L S Student Parameters NS Teacher FLOPs 3 1019 1020 3 1020 1021 (b) Fixed M-Ratio Student/Teacher ISOFlop optima. Figure 39. ISOFlop optima. a) The optimal student parameters N S = arg min NS L(NS) that give the lowest student validation loss for each teacher-student combination shown in Figure 38a. The dashed lines correspond to the validation loss of the optimal supervised models trained with the four corresponding compute budget. b) The optimal teacher parameters N T = arg min NT L(TS) that give the lowest student validation loss for each teacher-student combination shown in Figure 3. The black dashed line correspond to the validation loss of a M = D/N = 20 supervised model of the indicated student size. In both figures, the shaded region corresponds to where weak to strong generalization may occur, as NS > NT (see Appendix E.7). E.6. Distillation with infinite data From the supervised scaling law (Equation 1) a model with N parameters has a cross-entropy lower bound L(N) L(N, D = ) = E + (AN α)γ (35) which represents the best solution to the training objective subject to constraints from that model s hypothesis space (Hoffmann et al., 2022) and is achieved when the number of training tokens is large (D ). As the hypothesis space of a model is independent of the procedure used to find the solutions, we anticipate that the student with NS parameters has a cross-entropy lower bound that is the same as the supervised one Equation 35. However, it not immediately clear if this is true in practice, since LS(NS) LS(NS, DS = , LT = L T ) (36) = L T + (A N α S )γ 1 + L T d 1 1 L(NS) 1/f1! c1f1 , (37) where L T = arg min L(NS, DS = , LT ) is the teacher cross-entropy that minimizes Equation 8. Upon checking numerically, we do find that Equation 35 is consistent with Equation 37 for a range of models N, NS [100M, 100B] (Figure 40). We stress that unlike our three motivations for the equation properties (Section 4.3), this infinite data limit was imposed added by hand, and is only true for certain values scaling coefficients. This lower bound consistency is evidence Distillation Scaling Laws that that our distillation scaling law has desired behavior far outside of observed models, at least along the data and teacher axes. We also note that only the optimal teacher for each student size produces a student cross-entropy lower bound that is consistent with the supervised one. Any other choice produces higher student cross-entropies, either because the teacher is too weak, or due to the capacity gap. 100M 10B 1T Student Parameters NS Cross-Entropy L (LS) Supervised L(NS,DS = ) Distillation LS(NS,DS = ,L T) Figure 40. Scaling behavior in the infinite data regime. For the optimal choice of teacher, the loss achieved by all student sizes under distillation is consistent with the loss achievable by supervised learning. This is not true for any choice of teacher, only the optimal one, which can be determined through numerical optimization of the provided distillation scaling laws (see Section 5). E.7. Weak-to-strong generalization In Figure 41 we see that weak-to-strong generalization (Burns et al., 2024; Ildiz et al., 2024) occurs only in the finite distillation data regime, and when the number of tokens is sufficiently large, the student cross-entropy increases again, eventually matching the teacher cross-entropy. This can be understood in the following way: i) when the student is larger than the teacher, the student contains in its hypothesis space the function represented by the teacher, ii) when the student is shown the teacher outputs on enough of the data manifold, it eventually matches what the teacher does on the whole data manifold. We note this doesn t explain how and why the student outperforms its teacher, and only constrains its asymptotic (low and high distillation data) behaviors. Cross-Entropy LS NT =1.82B, NS =546M NT =546M, NS =1.82B 8B 32B 128B 512B 0.0 KL Divergence (Teacher||Student) 8B 32B 128B 512B Student Tokens DS Student Supervised Teacher Figure 41. Fixed M-Ratio Teacher varying student data. We look at strong to weak generalization (left) and weak to strong (right) distillation, varying distillation tokens DS [8B, 512B]. Distillation Scaling Laws E.8. Model calibration Calibration in LMs refers to the alignment between the model s confidence in its predictions and the actual correctness of those predictions. Well-calibrated models provide confidence scores that accurately reflect their probability of correctness, enabling more decision-making. ECE is a common metric to quantify miscalibration, and measures the difference between predicted confidence and actual accuracy across multiple confidence intervals |Bm| NSamples |Accuracy(Bm) Confidence(Bm)| , (38) where M is the number of bins, Bm is the set of samples whose confidence scores fall into the m-th bin, |Bm| denotes the number of samples in bin Bm, NSamples = PM m=1 |Bm| is the total number of samples, Accuracy(Bm) and Confidence(Bm) are the empirical accuracy and average confidence of the model being evaluated in bin m respectively. Lower ECE indicates better model calibration. To measure ECE, we use M = 21 bins uniformly partitioned across the output probability space. Accuracy and confidence are computed in the standard manner: the predicted label is determined via the argmax over the output probabilities for each prediction, and the confidence is defined as the maximum probability assigned to the predicted label. Accuracy is then measured as the proportion of instances where the predicted label matches the ground truth. Notably, this approach focuses solely on the maximum probability prediction, disregarding the calibration of lower-probability predictions. To assess calibration across the entire output distribution rather than just the top prediction, alternative metrics could be considered. E.8.1. TEACHERS In Figure 42, we see the ECE for different sizes of teachers. For all models, ECE is between 0.4% and 0.6%, suggesting that the models confidence estimates closely align with their actual accuracies. We also observe that the blue points, i.e. , the teacher s actual accuracy for predictions falling into specific confidence intervals, closely follow the diagonal, indicating that the models are well-calibrated. This well-calibrated nature can be surprising, as large models can be overconfident. For example, Mukhoti et al. (2020) indicates the overconfidence of large models observed in (Minderer et al., 2021) arises from overfitting, regardless of the training set correctness. NT =198M ECE=0.7% NT =546M ECE=0.6% NT =975M ECE=0.5% NT =1.82B ECE=0.5% 0.0 0.5 1.0 0.0 NT =2.72B ECE=0.4% 0.0 0.5 1.0 NT =4.82B ECE=0.5% 0.0 0.5 1.0 NT =7.75B ECE=0.6% Confidence Proportion(confidence) Perfectly calibrated Teacher Confidence Teacher Accuracy Figure 42. Teacher calibration. The calibration of teachers of seven different sizes. The x-axis shows the teacher probability assigned to the most confident class, and the y-axis is the empirical accuracy of predictions within each confidence bin. Blue points represent the teacher accuracy for predictions falling into specific confidence intervals. Orange points represent the proportion of samples in each confidence bin (helpful for understanding sample distribution across confidence levels). The dashed line represents perfect calibration, where confidence matches empirical accuracy. The ECE (Equation 38) for each teacher is shown as the title of each plot. The primary distinction in our setup is that: i) our models are underparameterized (N < D), and ii) data is not repeated. Consequently, overfitting to the training set does not occur (Aitchison, 2024), so model overconfidence does not arise to the same extent as in many prior calibration studies. Instead, in our setting, increasing model size N or training tokens D, improves the approximation of the seen distribution with minimal generalization gap, yielding better calibration (Carrell et al., 2022; Blasiok et al., 2023). Our observation of good calibration in large models aligns with prior calibration findings Distillation Scaling Laws for language model calibration (Zhu et al., 2023; Kadavath et al., 2022; Open AI, 2023). E.8.2. 198M STUDENTS TRAINED ON 20N TOKENS In this section we consider students trained on the teacher distribution, as in our main study. We also study students trained on the teacher top-1 distribution, as described in Appendix G.4, as the qualitative difference in behavior can be informative for student design. Evaluating the calibration of a student can be done in a number of ways: 1. We can compare student outputs relative ground-truth data, as in Appendix E.8.1 for the teachers. 2. We can compare student outputs with the outputs of its teacher. Calibration against ground-truth. First, let s consider comparison against ground truth data. In Figure 43 we show student calibration with respect to the dataset labels for both teacher distribution distillation and teacher top-1 distillation. 1. Distilled on the full teacher distribution. In Figure 43a, we observe that the student is well-calibrated against ground truth data. Similar to the teacher s calibration plot in Figure 42, we see a small discrepancy at very low and very high confidence values, and the ECE value is low. 2. Distilled on teacher top-1. In Figure 43b, we see that a student trained only on its teacher s top-1 prediction, is not calibrated against ground truth data. The blue points below the dashed line indicate an overconfident student, i.e. , its predicted confidence is higher than the actual accuracy in that confidence range. This is because training the student on top-1 assigns the student to the most plausible outcome rather than all the plausible outcomes with correct frequencies. Confidence proportions are low for all bins that are not the most confident bin, and ECE is high, although decreases with increasing teacher size NT . Figure 43 shows that training the student on the teacher s distribution results in a calibrated student, whereas training on the teacher top-1 does not. Indeed, optimizing against the teacher s top-1 is not a proper scoring metric, and that teacher top-1 is not an unbiased estimator for the data, while the teacher distribution is. NT =198M ECE=0.1% NT =546M ECE=0.4% NT =975M ECE=0.6% NT =1.82B ECE=0.6% 0.0 0.5 1.0 0.0 NT =2.72B ECE=0.6% 0.0 0.5 1.0 NT =4.82B ECE=0.6% 0.0 0.5 1.0 NT =7.75B ECE=0.6% Confidence Proportion(confidence) Perfectly calibrated Student Confidence Student Accuracy (a) Distillation target: teacher distribution. NT =198M ECE=38.7% NT =546M ECE=33.2% NT =975M ECE=29.8% NT =1.82B ECE=27.2% 0.0 0.5 1.0 0.0 NT =2.72B ECE=25.2% 0.0 0.5 1.0 NT =4.82B ECE=22.9% 0.0 0.5 1.0 NT =7.75B ECE=21.6% Confidence Proportion(confidence) Perfectly calibrated Student Confidence Student Accuracy (b) Distillation target: teacher top-1. Figure 43. Student calibration (data). Calibration of the student with respect to the actual data labels, trained with different teacher sizes (NT ), on (a) the teacher distribution and (b) the teacher s top-1. For axis definitions and the figure legend, refer to Figure 42. Blue points below the dashed line indicate student overconfidence. Calibration against teacher top-1. Next we investigate the first student calibration against the teacher. In Figure 44 we show student calibration with respect to the teacher s top-1 label. That is, the next-token label used for accuracy computation, and extract the students confidence is the most probable next-token according to the teacher, instead of the label from data. Here no next token labels are used at all. These teacher top-1 labels are also used for the ECE calculation, which is still computed using Equation 38. Distillation Scaling Laws 1. Distilled on the full teacher distribution. We see in Figure 44a that when distilled from the full teacher distribution, the student is not calibrated against the teacher top-1. The blue points are above the dashed line, which means that the empirical accuracy is higher than the model s predicted confidence, i.e. with respect to the teacher top-1, the student is underconfident. This can be understood by noting that the top-1 objective is an easier objective than modeling the full vocabulary at each step. 2. Distilled on teacher top-1. In Figure 44b we observe that a student is distilled from its teacher s top-1 is calibrated with respect to teacher s top-1. NT =198M ECE=38.5% NT =546M ECE=31.3% NT =975M ECE=27.3% NT =1.82B ECE=24.5% 0.0 0.5 1.0 0.0 NT =2.72B ECE=22.3% 0.0 0.5 1.0 NT =4.82B ECE=20.1% 0.0 0.5 1.0 NT =7.75B ECE=18.9% Confidence Proportion(confidence) Perfectly calibrated Student Confidence Student Match Teacher Accuracy (a) Distillation target: teacher distribution. NT =198M ECE=1.1% NT =546M ECE=1.5% NT =975M ECE=1.6% NT =1.82B ECE=1.6% 0.0 0.5 1.0 0.0 NT =2.72B ECE=1.6% 0.0 0.5 1.0 NT =4.82B ECE=1.5% 0.0 0.5 1.0 NT =7.75B ECE=1.5% Confidence Proportion(confidence) Perfectly calibrated Student Confidence Student Match Teacher Accuracy (b) Distillation target: teacher top-1. Figure 44. Student calibration (teacher top-1). Calibration of the student with respect to the teacher s top 1, trained with different teacher sizes (NT ), on (a) the teacher distribution and (b) the teacher s top-1. For axis definitions and the figure legend, refer to Figure 42. Blue points above the dashed line indicate the student is underconfident. Figure 44 shows that training the student on teacher top-1 results in calibration against teacher top-1, whereas a model trained on data, or distilled on the full teacher distribution is not calibrated against teacher top-1. As above, this can be understood as now teacher s top-1 is now a proper scoring metric, and teacher top-1 is an unbiased estimator for itself. Calibration against teacher distribution. Here we develop a modified calibration measure that will help us understand if the student matches the teacher in a distributional sense. As we have two distributions to compare, we can ask, for a given teacher confidence, what is the expected student confidence. This leads to ECEDist, a distributional form of ECE: ECEDist(A, B) = |Bm| NSamples |Confidence(Bm; A) Confidence(Bm; B)| , (39) and is similar in spirit to divergence measures like KLD. Bm, |Bm|, and NSamples are defined as before, and Confidence S(Bm; A|B) is the average confidence of model A or B in bin m respectively. The bins Gm are always witin the bins of confidence of model B. In the current evaluation, we take A as the teacher and B as the student, and we are measuring the average confidence of the teacher is measured within a student s confidence bin. 1. Distilled on the full teacher distribution. In Figure 45a, we see that when the student is confident, it matches the teacher confidence. However, as the teacher model grows in size, when the student is less confident, it it systematically underestimates its confidence. This suggests that the student has not effectively learned low-probability outcomes, or that these outcomes are particularly challenging for the student to replicate. The underconfidence in these regions may be a result of the distillation process not providing sufficient learning signal for these difficult cases, or the inherent difficulty of capturing the uncertainty associated with low-confidence predictions. This observation of confidence mismatch helps indicate which parts of the distribution the student finds challenging to model, giving rise to the increasing KLD and capacity gap observed in Figure 4 and Appendix E.3. 2. Distilled on teacher top-1. In Figure 45b, for small teachers, we observe student overconfidence. As the teacher increases in size, the student s overconfidence in low-confidence bins transitions to underconfidence. At the same time, Distillation Scaling Laws the student s overconfidence in high-confidence bins improves, leading to an overall reduction in distributional ECE. This pattern of overconfidence in the student is similar to what we saw in Figure 43b, but the change in behavior at low-confidence bins as the teacher s size varies is different. This shift in the student s calibration behavior, especially in low-confidence bins, aligns with findings from Figure 45a and may highlight the difficulty the small student faces in learning rare events. NT =198M ECE=1.3% NT =546M ECE=2.7% NT =975M ECE=4.1% NT =1.82B ECE=5.4% 0.0 0.5 1.0 0.0 NT =2.72B ECE=6.6% 0.0 0.5 1.0 NT =4.82B ECE=8.0% 0.0 0.5 1.0 NT =7.75B ECE=9.0% Confidence Proportion(confidence) Perfectly calibrated Student Confidence Teacher Confidence (a) Train target: teacher distribution. NT =198M ECE=37.4% NT =546M ECE=30.0% NT =975M ECE=25.3% NT =1.82B ECE=22.1% 0.0 0.5 1.0 0.0 NT =2.72B ECE=19.6% 0.0 0.5 1.0 NT =4.82B ECE=17.2% 0.0 0.5 1.0 NT =7.75B ECE=16.1% Confidence Proportion(confidence) Perfectly calibrated Student Confidence Teacher Confidence (b) Train target: teacher top 1. Figure 45. Student calibration (teacher distribution). Calibration of the student with respect to the teacher s distribution, trained with different teacher sizes (NT ), on (a) the teacher distribution and (b) the teacher s top-1. For ECE calculation on the full distribution, see Equation 39. For axis definitions and the figure legend, refer to Figure 42. Blue points below the dashed line indicate student overconfidence, while points above the dashed line indicate underconfidence. We can also inspect the student confidences within a bin of teacher confidences, and compute the distributional ECE (Equation 39), swapping the roles of teacher and student (see Figure 46). 1. Distilled on the full teacher distribution. In Figure 45a we complete the picture from Figure 45a and see that the part of the distribution the student struggles to model is actually the place where teacher is most confident. 2. Distilled on teacher top-1. In Figure 45b we see that the student is systematically overconfident for all values of teaacher confidence, except for the largest teachers, where the student is underconfident when those teachers are most confident. NT =198M ECE=1.7% NT =546M ECE=3.3% NT =975M ECE=4.8% NT =1.82B ECE=6.2% 0.0 0.5 1.0 0.0 NT =2.72B ECE=7.3% 0.0 0.5 1.0 NT =4.82B ECE=8.7% 0.0 0.5 1.0 NT =7.75B ECE=9.6% Confidence Proportion(confidence) Perfectly calibrated Teacher Confidence Student Confidence (a) Train target: teacher distribution. NT =198M ECE=37.4% NT =546M ECE=30.4% NT =975M ECE=26.3% NT =1.82B ECE=23.5% 0.0 0.5 1.0 0.0 NT =2.72B ECE=21.3% 0.0 0.5 1.0 NT =4.82B ECE=19.1% 0.0 0.5 1.0 NT =7.75B ECE=18.1% Confidence Proportion(confidence) Perfectly calibrated Teacher Confidence Student Confidence (b) Train target: teacher top 1. Figure 46. Student calibration (under teacher confidence bins). Calibration of the student with respect to the teacher s confidence bins, trained with different teacher sizes (NT ), on (a) the teacher distribution and (b) the teacher s top-1. For ECE calculation on the full distribution, see Equation 39. For axis definitions and the figure legend, refer to Figure 42. Blue points below the dashed line indicate the teacher is less confident than the student. Distillation Scaling Laws E.8.3. 198M STUDENTS TRAINED ON 128B TOKENS In this section, we study the effect of increasing the number distillation tokens in Appendix E.8.2 from DS 20NS to DS 512B. Here, we reserve discussion for the observed differences compared to Appendix E.8.2. NT =198M ECE=0.1% NT =546M ECE=0.1% NT =975M ECE=0.2% NT =1.82B ECE=0.3% 0.0 0.5 1.0 0.0 NT =2.72B ECE=0.4% 0.0 0.5 1.0 NT =4.82B ECE=0.5% 0.0 0.5 1.0 NT =7.75B ECE=0.4% Confidence Proportion(confidence) Perfectly calibrated Student Confidence Student Accuracy (a) Train target: teacher distribution. NT =198M ECE=42.1% NT =546M ECE=37.1% NT =975M ECE=34.2% NT =1.82B ECE=31.7% 0.0 0.5 1.0 0.0 NT =2.72B ECE=29.3% 0.0 0.5 1.0 NT =4.82B ECE=26.5% 0.0 0.5 1.0 NT =7.75B ECE=24.8% Confidence Proportion(confidence) Perfectly calibrated Student Confidence Student Accuracy (b) Train target: teacher Top 1. Figure 47. Student calibration (data). Calibration of the student with respect to the actual data labels with increased training tokens. Compare to Figure 43 for the effect of tokens and refer to Figure 42 for legend and axis explanations. Calibration against ground-truth. As the number of distillation tokens increases, we observe a consistent decrease in the ECE when the student is trained on the teacher s distribution, as shown by the comparison between Figure 47a and Figure 43a across different teacher sizes. However, when the student is trained on the teacher s top-1 predictions, increasing the number of tokens negatively impacts ECE, as evidenced by the comparison between Figure 47b and Figure 43b. This suggests that the teacher s top-1 predictions are not a reliable, unbiased estimator of the actual data, and increasing the number of training tokens only exacerbates this issue. See Appendix G.4 for further discussion. Calibration against teacher top-1. Increasing the number of distillation tokens leads to worse calibration between the student and the teacher s top-1 predictions when the student is trained on the full distribution. This change primarily occurs in the low-confidence bins, and results in a higher ECE (compare Figure 48a and Figure 44a). However, when comparing the ECEs for the student trained on the teacher s top-1 predictions (Figures 44b and 48b), there is an improvement across all teacher sizes. When the student is trained and evaluated using the same metric, increasing the training tokens helps improve calibration, demonstrating consistency between the learning objective and the evaluation metric. NT =198M ECE=42.3% NT =546M ECE=35.9% NT =975M ECE=32.1% NT =1.82B ECE=29.0% 0.0 0.5 1.0 0.0 NT =2.72B ECE=26.3% 0.0 0.5 1.0 NT =4.82B ECE=23.4% 0.0 0.5 1.0 NT =7.75B ECE=21.9% Confidence Proportion(confidence) Perfectly calibrated Student Confidence Student Match Teacher Accuracy (a) Train target: teacher distribution. NT =198M ECE=0.6% NT =546M ECE=0.8% NT =975M ECE=1.0% NT =1.82B ECE=1.1% 0.0 0.5 1.0 0.0 NT =2.72B ECE=1.1% 0.0 0.5 1.0 NT =4.82B ECE=1.2% 0.0 0.5 1.0 NT =7.75B ECE=1.1% Confidence Proportion(confidence) Perfectly calibrated Student Confidence Student Match Teacher Accuracy (b) Train target: teacher top 1. Figure 48. Student calibration (teacher top 1). Calibration of the student with respect to the teacher s top 1 when the training tokens have increased. Compare to Figure 44 for the effect of tokens and refer to Figure 42 for legend and axis explanations. Distillation Scaling Laws Calibration against teacher distribution. A comparison between Figure 49a and Figure 45a shows that when the student is trained on the teacher s full distribution and evaluated against the full distribution using Equation 39, increasing the number of training tokens consistently improves calibration across all teacher sizes. However, when the student is trained on the teacher s top-1 predictions, a quick comparison between Figure 49b and Figure 45b reveals worse calibration uniformly across all confidence bins. NT =198M ECE=0.7% NT =546M ECE=1.3% NT =975M ECE=2.0% NT =1.82B ECE=2.8% 0.0 0.5 1.0 0.0 NT =2.72B ECE=3.8% 0.0 0.5 1.0 NT =4.82B ECE=5.1% 0.0 0.5 1.0 NT =7.75B ECE=6.0% Confidence Proportion(confidence) Perfectly calibrated Student Confidence Teacher Confidence (a) Train target: teacher distribution. NT =198M ECE=41.6% NT =546M ECE=36.0% NT =975M ECE=32.4% NT =1.82B ECE=29.0% 0.0 0.5 1.0 0.0 NT =2.72B ECE=25.7% 0.0 0.5 1.0 NT =4.82B ECE=22.0% 0.0 0.5 1.0 NT =7.75B ECE=20.0% Confidence Proportion(confidence) Perfectly calibrated Student Confidence Teacher Confidence (b) Train target: teacher Top-1. Figure 49. Student calibration (teacher distribution). Calibration of the student with respect to the teacher s distribution as the number of training tokens increases. Compare to Figure 45 for the effect of tokens and refer to Figure 42 for legend and axis explanations. Similarly, when comparing within teacher confidence bins (Figure 50) increasing the number of distillation tokens from 20N to 128B primarily amplifies the observed phenomena at lower distillation token budgets, and improving calibration in cases where there is a proper scoring metric present (Figure 50a). NT =198M ECE=1.0% NT =546M ECE=1.8% NT =975M ECE=2.5% NT =1.82B ECE=3.4% 0.0 0.5 1.0 0.0 NT =2.72B ECE=4.5% 0.0 0.5 1.0 NT =4.82B ECE=5.8% 0.0 0.5 1.0 NT =7.75B ECE=6.7% Confidence Proportion(confidence) Perfectly calibrated Teacher Confidence Student Confidence (a) Train target: teacher distribution. NT =198M ECE=41.8% NT =546M ECE=36.0% NT =975M ECE=32.4% NT =1.82B ECE=29.3% 0.0 0.5 1.0 0.0 NT =2.72B ECE=26.4% 0.0 0.5 1.0 NT =4.82B ECE=23.1% 0.0 0.5 1.0 NT =7.75B ECE=21.4% Confidence Proportion(confidence) Perfectly calibrated Teacher Confidence Student Confidence (b) Train target: teacher top 1. Figure 50. Student calibration (teacher distribution). Calibration of the student with respect to the teacher confidence bins distribution as the number of training tokens increases. Compare to Figure 46 for the effect of tokens. In general, increasing the number of training tokens has a positive effect when the training metric is an unbiased estimator of the actual data or the measured calibration quantities (see Figures 47a, 48b, and 49a) and reduces the ECE, while it has a negative impact when there is a mismatch between the learned and measured quantities (see Figures 47b, 48a, and 49b). Distillation Scaling Laws F. Scaling coefficients In this section, we analyze the process of deriving the coefficients for our scaling law. We follow the procedure outlined in (Hoffmann et al., 2022; Besiroglu et al., 2024), while incorporating our modified scaling laws F.1. Supervised scaling law coefficient estimation First, let s tackle the supervised scaling law Equation 1 restated for convenience L(N, D) = E + A To aid numerical stability, we write this expression in log space. First note that for a, b > 0 log(a + b) = log (exp log a + exp log b) = LSE(log a, log b), (41) where LSE is the log-sum-exp operator. We can now proceed to write the supervised scaling law in log form log L(N, D; A, B, E, α, β) = log E + A = LSE log E, γ log A = LSE [log E, γ LSE (log A αN, log B αD)] . (44) We make no assumptions about the relationships between the values (i.e. no parameter tying) and optimize (A , B , E , α , β , γ ) = arg min {A,B,E,α,β,γ} i Huberδ log L(N (i), D(i); A, B, E, α, β) L(i) (45) with a Huber δ = 10 4, where N (i), D(i) and L(i) are the model size, number of training tokens and loss achieved by the i-th run. We fit on 73 samples over a grid of L-BFGS-B initializations given by: log A {0., 5., 10., 15., 20.}, log B {0., 5., 10., 15., 20.}, log E { 1., 0.5., 0., 0.5, 1., 1.5.}, α {0., 0.5, 1., 1.5}, β {0., 0.5, 1., 1.5}, γ {0., 0.5, 1., 1.5}. The L 2.2 case corresponds to 48 samples. F.2. Distillation scaling law coefficient estimation Next, let s address the distillation scaling law Equation 8 restated for convenience LS(NS, DS, LT ) = LT + 1 Lc0 T 1/f1! c1 f1 A As in Appendix F.1, to aid numerical stability during optimization, we write this in log space log LS(NS, DS, LT ; θ) = log LT + 1 Lc0 T 1/f1! c1 f1 A log LT , c0 log LT c1f1 log c0 log(LT ) c1f1 LSE 0, 1 log LT log e LS log d1 + γ LSE (log A α log NS, log B β log DS) Distillation Scaling Laws where θ = {A , B , α , β , c0, c1, f1, d1}. We make no assumptions about the relationships between the values and optimize θ = arg min θ i Huberδ log LS(N (i) S , D(i) S , L(i) T ; θ) L(i) S (50) with a Huber δ = 10 4, where N (i) S , D(i) S , L(i) T and L(i) S are the student model size, number of training distillation tokens, the teacher pretraining loss and the student validation loss on the data achieved by the i-th run. We fit on 697 samples over a grid of L-BFGS-B initializations given by: log A {0., 5., 10., 15., 20.}, log B {0., 5., 10., 15., 20.}, α {0., 0.5, 1.}, β {0., 0.5, 1.}, γ {0., 0.5, 1.}, c0 {0., 0.5, 1., 1.5}, c1 {0., 0.5, 1., 1.5}, f1 {0., 0.5, 1., 1.5}, log d1 { 1., 0.5, 0., 0.5, 1.}. The LS 2.3 case corresponds to 551 samples. F.3. Scaling law coefficients parameteric fit The fitting procedure outlined in Appendices F.1 and F.2 applied to data described in Section 4.2 yields the scaling coefficients and associated confidence intervals shown in Table 6. Note in the supervised case, our values of a and b are consistent with those of Hoffmann et al. (2022). Table 6. Scaling law parameter estimates accompanied by 90% confidence intervals obtained by bootstrapping (4096 resamples) following the procedure of Besiroglu et al. (2024). a = β/(α + β) and b = β/(α + β) are the supervised compute optimal scaling estimates for N and D respectively (Hoffmann et al., 2022). Supervised Distillation A( ) 3355 (3346, 3360) 2243 (2227, 2255) B( ) 18186 (18157, 18236) 24181 (24084, 24266) E 1.220 (1.190, 1.247) α( ) 0.408 (0.405, 0.411) 0.321 (0.319, 0.324) β( ) 0.431 (0.428, 0.433) 0.637 (0.634, 0.640) γ( ) 0.452 (0.442, 0.461) 0.764 (0.732, 0.788) c0 2.549 (2.425, 2.615) c1 522.6 (522.6, 522.6) f1 0.090 (0.088, 0.093) d1 1.315 (1.302, 1.327) a( ) 0.513 (0.513, 0.513) 0.664 (0.662, 0.665) b( ) 0.486 (0.486, 0.486) 0.335 (0.334, 0.337) Runs 73 697 We also note that our irreducible error term is lower than the one in Hoffmann et al. (2022). We suspect this is due to our use of µP (Yang & Hu, 2021; Yang & Littwin, 2023; Yang et al., 2022; Wortsman et al., 2023; Yang et al., 2023). G. Distilling language models in practice In the following analyses, we explore the sensitivity of student performance under modification of distillation hyperparameters. We demonstrate that the pure distillation setting (λ = 1, Appendix G.1), unit temperature (τ = 1, Appendix G.2), and learning rate η = 0.01 (Appendix G.3) under µP (Yang & Hu, 2021; Yang & Littwin, 2023; Yang et al., 2022; Wortsman et al., 2023; Yang et al., 2023) provides robust performance across model scales, while distribution truncation methods (Top-k, Top-p) degrade performance unless combined with ground-truth next-token prediction (Appendix G.4). Finally, we verify that forward KL divergence distillation, DKL(ˆp T ||ˆq S), consistently outperforms reverse KL (Appendix G.5). Distillation Scaling Laws For ease of reference, we restate the components of the token-level loss for the student: LNTP(x(i), z(i)) = a=1 e(x(i))a log σa(z(i)), (Next-token prediction) (51) LZ(z(i)) = || log Z(z(i))||2 2 = a=1 exp(z(i) a ) 2 , (Z-loss) (52) LKD(z(i) T , z(i) S ) = τ 2 V X , (Distillation loss) (53) LS(x(i), z(i) T , z(i) S ) = (1 λ) LNTP(x(i), z(i) S ) + λ LKD(z(i) T , z(i) S ) + λZ LZ(z(i) S ). (Student loss) (54) See Section 2 for a discussion of each of the terms. G.1. Mixing coefficient (λ) sensitivity analysis The distillation process combines two loss components: knowledge transfer from the teacher, λLKD(z(i) T , z(i) S ), and direct learning from data, (1 λ)LNTP(x(i), z(i) S ), weighted by the mixing coefficient λ (Equation 7). Our distillation scaling law analysis is performed in the pure distillation setting (λ = 1). Here we show this simple choice provides robust performance across a wide range of configurations. Teacher NT =546M Teacher NT =975M Teacher NT =1.82B Teacher NT =2.72B 0.0 0.7 0.9 0.97 0.99 1.0 2.1 Teacher NT =4.82B 0.0 0.7 0.9 0.97 0.99 1.0 Teacher NT =7.75B Loss Mixing Coefficient λ Student Cross-Entropy LS Student Parameters NS 198M 266M 546M 975M 1.82B 2.72B (a) Mixing Coefficient λ Sensitivity. 600M 1B 2B 3B 7B 0.00 Optimal Mixing Coef. λ Teacher Parameters NT Student Parameters NS 1.82B 2.72B (b) Optimal Mixing Coefficients λ Figure 51. Mixing Coefficients λ. (a) Students of six sizes NS {198M, 266M, . . . , 2.72B} trained with a M = DS/NS = 20 ratio are distilled from teachers of size sizes NT {546M, 975M, . . . , 7.75B} trained with a M = DT /NT = 20 ratio with different values of loss mixing coefficient λ [0, 1]. λ = 0 and λ = 1 correspond to supervised training and pure distillation cases respectively. (b) The mixing coefficients λ = arg minλ L(λ) that give the lowest student validation loss for each teacher-student combination shown in Figure 51a. Distillation Scaling Laws We examine various λ values across different teacher-student configurations in Figure 51a and find that while the optimal mixing coefficients λ vary based on the specific teacher-student combinations (Figure 51b), the student cross-entropy LS remains mostly flat for choices of λ > 0.5, with lower values of λ only preferred in the cases where the teacher is particularly weak and where the supervised signal is more informative. From Figure 51a it is also possible to get a sense of when distillation λ > 0 generally outperforms supervised learning λ = 0 under the same token budget. To guide practitioners, Figure 51b shows empirically derived optimal mixing coefficients, λ , though the simplicity and robustness of pure distillation makes it a reliable default choice for practical use and study. G.2. Temperature (τ) sensitivity analysis In distillation, the temperature τ controls the entropy of teacher predictions by scaling logits z(i) T /τ and z(i) S /τ in the knowledge distillation loss LKD (Equations 7 and 53). This scaling modulates the transfer of dark knowledge (Hinton et al., 2015) the log-probability ratios between incorrect categories encode the teacher s understanding of relationships between those categories. Our analysis across τ [0.5, 10] (Figure 52) reveals that higher temperatures (τ > 3) reduces performance by attenuating these ratios in σa(z(i) T /τ), particularly harming smaller students that rely heavily on this signal. Lower temperatures (τ < 1) similarly reduce effectiveness by concentrating probability mass on argmax tokens, diminishing the transfer of relationships between lower-ranked predictions. We find optimal performance at τ = 1 across all model scales, suggesting this temperature best preserves log-probability structure. Unlike the original distillation setting, which relied on dark knowledge to represents hierarchical relationships between incorrect classification predictions in the presence of a true label, language modeling is inherently ambiguous and complex, with many valid continuations. It is precisely the understanding of the ambiguity of language we want to transfer to the student, which is supported by our finding that maintaining the teacher s original probability ratios (τ = 1) produces the lowest student cross-entropies. Teacher NT =546M Teacher NT =1.82B Teacher NT =4.82B Teacher NT =7.75B Distillation Temperature τ Student Cross-Entropy LS Student Parameters NS 198M 546M 975M 1.82B Figure 52. Temperature τ Sensitivity Analysis. Students of four sizes NS {198M, 546M, 975M, 1.82B} trained with a M = DS/NS = 20 ratio are distilled from teachers of sizes NT {546M, 1.82B, 4.82B, 7.75B} trained with a M = DT /NT = 20 ratio with different distillation temperatures τ [0.5, 10]. Distillation Scaling Laws G.3. Learning rate (η) sensitivity analysis, verification of µP for distillation The peak learning rate η determines the scale of student parameter updates in distillation. In our experiments we use a simplified version of µP (Yang & Hu, 2021; Yang & Littwin, 2023; Yang et al., 2022; Wortsman et al., 2023; Yang et al., 2023), described as µP (simple) in (Wortsman et al., 2024). In the supervised case, in addition to improving the performance lower bound compared to the standard parameterization, µP simplifies experimental settings as it enables hyperparameter transfer; the optimal peak learning rate η and initialization scales found for a reference model size can be reused when changing model size7. Here we validate that the optimal peak learning rate η = 0.01 determined in the supervised case transfers to the distillation setting. Sweeping values η [0.001, 0.1] (Figure 53) reveals that µP achieves optimal performance at η = 0.01 uniformly across all configurations, from 198M to 1.82B parameter students and 546M to 7.75B parameter teachers, consistent with the optimal peak learning rate in the supervised setting. Performance varies smoothly and modestly around this optimum, with cross-entropy changing by less than 0.1 nats over one order of magnitude in learning rate. This consistency validates µP s guarantee of scale-invariant training dynamics for distillation, confirming that our experimental setting for determining our distillation scaling law operates at the optimal learning rate or sufficiently close to it in all of our settings. The observed moderate learning sensitivity in distillation partially alleviates the requirement for careful learning rate tuning, showing that in practice the reference learning rate found in the supervised setting can be safely reused in the distillation setting. Teacher NT =546M Teacher NT =1.82B 0.001 0.01 0.1 2.2 3.0 Teacher NT =4.82B 0.001 0.01 0.1 Teacher NT =7.75B Peak Learning Rate η Student Cross-Entropy LS Student Parameters NS 198M 546M 975M 1.82B Figure 53. Learning Rate η Sensitivity Analysis. Students of four sizes NS {198M, 546M, 975M, 1.82B} trained with a M = DS/NS = 20 ratio are distilled from teachers of sizes NT {546M, 1.82B, 4.82B, 7.75B} trained with a M = DT /NT = 20 ratio with different learning rates η [0.001, 0.1]. G.4. Distribution truncation methods: Top-k and Top-p sensitivity We investigate how the truncation of the teacher distributions affects student performance. For these methods, when the teacher produces a distribution ˆp T (x(i) = a|x( 0) enables efficient post-hoc distillation. G.5. Forward and reverse KL divergence We investigate both forward (mode spreading) and reverse (mode seeking) Kullback-Leibler divergences for distillation from NT = 1.82B to NS = 546M. The forward KLD DKL(ˆp T ||ˆq S) (Equation 7), minimizes Lforward = H(ˆp T , ˆq S) H(ˆp T ), where H(ˆp T ) is dropped during optimization as it depends on only fixed teacher parameters. In contrast, the reverse KLD DKL(ˆq S||ˆp T ) requires explicitly computing the student s entropy, Lreverse = H(ˆq S, ˆp T ) H(ˆq S). Distillation Scaling Laws The forward KL achieves a lower data cross-entropy compared to the reverse KL (Table 7), with an average improvement of 0.28 nats. This suggests that explicitly regularizing with respect to the student s entropy during training may not provide additional benefits for distillation quality. Given both the improved performance and reduced computational overhead of forward KL (which avoids computing student entropy), we recommend using standard forward KL for distillation. Table 7. Forward vs Reverse KL Divergence for NT = 1.82B to NS = 546M distillation. Reverse KL is slightly more expensive with respect to vocabulary size V due to the entropy calculation. Method Cross-Entropy Computational Cost Forward KL 2.42 O(V ) Reverse KL 2.70 O(2V ) H. Parameters and Floating Operation Estimation Here we outline the number of parameters (Appendix H.2) and the number of FLOPs per token (Appendix H.3) for our experimental settings. The symbol notation is provided in Table 8. For our scaling laws, we find, as in Kaplan et al. (2020) using that the number of non-embedding-parameters provides the cleanest fit and extrapolation behavior. Our expressions for approximate compute (FLOPs per token) differ from prior work in that we are interested in small models that are capable. This means we are unable to ignore the context-dependent term that arises from the quadratic computational complexity of the attention mechanism. As our architectures are fixed aspect ratio, there is a modified approximation we can use. This expression is discussed in Appendix H.1 For ease of reference, we provide a comparison of the expressions we use to commonly used existing expressions (Kaplan et al., 2020; Hoffmann et al., 2022; Narayanan et al., 2021), and provide comments for significant differences. Table 8. The notation we use for parameter and FLOPs estimation. Component Notation Sequence length/context size nctx Vocabulary size nvocab Number of blocks/layers nlayers Number of query heads nheads Number of key/value heads nkv-heads Model/embedding dimension dmodel Head dimension dhead Feed-forward dimension dffn Number of feed-forward linears nffn Group size in Group Query Attention (GQA) nheads/nkv-heads gsize Model aspect ratio dmodel/nlayers ρmodel Feed-forward ratio dffn/dmodel ρffn H.1. Alternative approximation for FLOPs per token as a function of N From Table 10 and Equation 71 and Table 12 we can read our approximate values for non-embedding parameters and total compute (dropping contributions from normalization layers) as8 N = nlayersd2 model 2 + 2 gsize + nffnρffn CForward = 2nlayersd2 model 2 + 2 gsize + nffnρffn + 2nlayersnctxdmodel (58) = 2N + 2nlayersnctxdmodel + 2nvocabdmodel. (59) 8It was shown in Porian et al. (2024) that ignoring the embedding parameters and FLOPs can lead to systematic estimation bias for small models, and is one of the primary drivers between different exponents reported in Kaplan et al. (2020) and Hoffmann et al. (2022). We find that the the non-embedding parameters gives a tighter scaling behavior. However, in the fixed-aspect-ratio setting, we are able to use both the non-embedding parameters in the scaling law and the approximate total compute simultaneously, removing estimation bias. Indeed, in the supervised setting, our coefficients a and b are consistent with those from Hoffmann et al. (2022) (see Table 6). Distillation Scaling Laws Typically the term 2nlayersnctxdmodel would be dropped, and the embedding parameters included into the total parameters (Hoffmann et al., 2022) or discarded (Kaplan et al., 2020) yielding the expression CForward and the familiar expression C = 6ND (Kaplan et al., 2020; Hoffmann et al., 2022). For our investigation we are interested in small, capable models, which may have a large context, and so both of these terms cannot be ignored in general at the peril of making a systematic error in the region of configuration space we are most interested in. Fortunately, we will see that our choice of fixed aspect ratio ρmodel = dmodel/nlayers architectures allows us a simple to use, more precise estimate. The trick will be to use this fixed aspect ratio to come up with an approximation for nlayers and dmodel as a function of N and ρmodel. With these approximated, the term 2nlayersnctxdmodel can be represented as a function of N. First define9 ω 2 + 2 gsize + nffnρffn (61) N = nlayersd2 modelω. (62) Then we can substitute in ρmodel dmodel/nlayers so that N = nlayersd2 modelω = n3 layersρ2 modelω, (63) and solve for nlayers and dmodel nlayers = N ρ2 modelω 1/3 , dmodel = Nρmodel The CForward term can then be represented as a function of N. The context-dependent term becomes 2nctxnlayersdmodel = 2nctxn2 layersρmodel = 2 N ρ2 modelω 2/3 ρmodelnctx 2nctxσ1N 2/3 (65) σ1 = 1 ρ2 modelω 2/3 ρmodel = 1 ρmodelω2 The vocabulary projection term becomes 2nvocabdmodel = 2nvocab 1/3 = 2nvocab ρmodel 1/3 N 1/3 2nvocabσ2N 1/3, (67) σ2 = ρmodel In total CForward = 2N + 2nctxσ1N 2/3 + 2nvocabσ2N 1/3 = 2N 1 + σ1 nctx N 1/3 + σ2 nvocab where σ1 and σ2 are independent of model and context size. In the large N limit, or the small nctx small nvocab limit this becomes the familiar CForward = 2N. The backward FLOPS per token is taken as twice the forward FLOPs (Blondel & Roulet, 2024) CBackward = 2 CForward. (70) Given the simplicity of the compute expression as a function of N, the better tightness of fit in the scaling law, the improved intuition that the model size more directly corresponds to work being done by the model, and the predictability of hyperparameters at larger scales, we recommend the scaling law community consider adopting fixed aspect ratio models. 9In our setting (Appendix I) ω takes values ω = 2 + 2 gsize + nffnρffn = 2 + 2 3 = 12. (60) Distillation Scaling Laws H.2. Model parameters In Table 9 we present our parameter counting compared to commonly used existing expressions (Kaplan et al., 2020; Hoffmann et al., 2022; Narayanan et al., 2021). We present a convenient substitution in Table 10 which can be easier to work with analytically. Our total expressions match the architecture we are using, which includes only gains for the normalization layers, whereas while (Narayanan et al., 2021) has both weights and biases. We account for potential use of (Ainslie et al., 2023) as well as use of gated linear attention mechanisms which are becoming prevalent in modern architectures (Shazeer, 2020) including the one used in this work (Appendix I). Table 9. Parameter counts for embedding projector, a single transformer layer, final normalization and output layer. Ours indicates the expressions we use in the paper for the total number of parameters (note that the quantity N that appears in our scaling laws is the number of non-embedding parameters, but still includes parameters associated with normalization layers). Approx. indicates taking the within-section total and dropping all terms that are not at least quadratic in one of dmodel, nvocab, and will be used for estimating the FLOPs per token from a given model size (Appendix H.1), and does not differ significantly from the number of non-embedding parameters. Parameters (Kaplan et al., 2020) (Hoffmann et al., 2022) (Narayanan et al., 2021) Ours (Total) Embedding (nvocab + nctx)dmodel (nvocab + nctx)dmodel (nvocab + nctx)dmodel nvocabdmodel Attention (one transformer layer) Pre Norm 2dmodel dmodel QKNorm 2dhead QKV 3nheadsdmodeldhead 3nheadsdmodeldhead 3nheads(dmodel + 1)dhead (nheads + 2nkv-heads)dmodeldhead Project nheadsdheaddmodel nheadsdheaddmodel (nheadsdhead + 1)dmodel nheadsdheaddmodel Total 4nheadsdheaddmodel 4nheadsdheaddmodel 4nheadsdheaddmodel + 3(nheadsdhead + dmodel) 2(nheads + nkv-heads)dheaddmodel + 2dhead + dmodel Approx. 4nheadsdheaddmodel 4nheadsdheaddmodel 4nheadsdheaddmodel + 3(nheadsdhead + dmodel) 2(nheads + nkv-heads)dheaddmodel Feed-forward (one transformer layer) Pre Norm 2dmodel dmodel MLP 2dmodeldffn 2dmodeldffn 2dmodeldffn + dffn + dmodel nffndmodeldffn Total 2dmodeldffn 2dmodeldffn 2dmodeldffn + dffn + 3dmodel nffndmodeldffn + dmodel Approx. 2dmodeldffn 2dmodeldffn 2dmodeldffn + dffn + 3dmodel nffndmodeldffn Output Norm dmodel Final logits Table 10. Parameter counts displayed in Table 9 using simplified notation nheadsdhead = dmodel, dffn = ρffndmodel, and nheads = gsizenkv-heads. Parameters (Kaplan et al., 2020) (Hoffmann et al., 2022) (Narayanan et al., 2021) Ours (Total) Embedding (nvocab + nctx)dmodel (nvocab + nctx)dmodel (nvocab + nctx)dmodel nvocabdmodel Attention (one transformer layer) Pre Norm 2dmodel dmodel QKNorm 2dhead QKV 3d2 model 3d2 model 3(d2 model + dmodel) (1 + 2/gsize)d2 model Project d2 model d2 model d2 model + dmodel d2 model Total 4d2 model 4d2 model 4d2 model + 6dmodel 2(1 + 1/gsize)d2 model + 2dhead + dmodel Approx. 4d2 model 4d2 model 4d2 model + 6dmodel 2(1 + 1/gsize)d2 model Feed-forward (one transformer layer) Pre Norm 2dmodel dmodel MLP 2ρffnd2 model 2ρffnd2 model 2ρffnd2 model + (1 + ρffn)dmodel nffnρffnd2 model Total 2ρffnd2 model 2ρffnd2 model 2ρffnd2 model + (3 + ρffn)dmodel nffnρffnd2 model + dmodel Approx. 2ρffnd2 model 2ρffnd2 model 2ρffnd2 model + (3 + ρffn)dmodel nffnρffnd2 model Output Norm dmodel Final logits This results in an approximation for the number of non-embedding parameters, dropping subleading terms N nlayersd2 model 2 + 2 gsize + nffnρffn which can be used to estimate forward FLOPs per token from the model size (Appendix H.1). Distillation Scaling Laws H.3. FLOPs per token In Table 11 we present our counting of the total number of FLOPs per token performed per token during a forward pass compared to commonly used existing expressions (Kaplan et al., 2020; Hoffmann et al., 2022; Narayanan et al., 2021). We present a convenient substitution in Table 12 which can be easier to work with analytically. Beyond the potential accounting for gated linear layers and grouped query attention, the most important discrepancy across methods is how the attention mechanism is handled. As was also noted in Porian et al. (2024), the expression used in Kaplan et al. (2020) is consistent with efficiently computing a causal attention mechanism (Dao et al., 2022; Dao, 2024) whereas Hoffmann et al. (2022); Narayanan et al. (2021) are consistent with counting attention FLOPs for a bidirectional (non-causal) attention mechanism, where the masked component of the attention matrix (zero by construction) is still being computed. We adopt the efficient expression of assuming a causal computation as this more closely reflects best practice. Table 11. Forward FLOPs per for token for embedding projector, a single transformer layer, final normalization and output layer. Ours indicates the expressions we use in the paper for the total (note that the quantity CForward that appears in compute constraints is the number of non-embedding floating operations. Approx. indicates taking the within-section total and dropping all terms that are not at least quadratic in one of dmodel, nvocab, and will be used for estimating the FLOPs per token from a given model size (Appendix H.1). FLOPs (Kaplan et al., 2020) (Hoffmann et al., 2022) (Narayanan et al., 2021) Ours (Total) Embedding 4dmodel 2nvocabdmodel 2dmodel Attention (one transformer layer) Pre Norm QKNorm QKV 3nheads2dmodeldhead 3nheads2dmodeldhead 3nheads2dmodeldhead (nheads + 2nkv-heads)2dmodeldhead Logits 2nheadsnctxdhead 2nheadsnctxdhead 2nheadsnctxdhead nheadsnctxdhead Softmax 3nheadsnctx 2.5nheadsnctx Values 2nheadsnctxdhead 2nheadsnctxdhead nheadsnctxdhead Project nheads2dheaddmodel nheads2dheaddmodel nheads2dheaddmodel nheads2dheaddmodel Total 2nheadsdhead(4dmodel + nctx) 4nheadsdhead(2dmodel + nctx) + 3nheadsnctx 4nheadsdhead(2dmodel + nctx) 4nheadsdhead(dmodel + nctx/2) + 4nkv-headsdmodeldhead + 2.5nheadsnctx Approx. 2nheadsdhead(4dmodel + nctx) 4nheadsdhead(2dmodel + nctx) + 3nheadsnctx 4nheadsdhead(2dmodel + nctx) 4nheadsdhead(dmodel + nctx/2) + 4nkv-headsdmodeldhead Feed-forward (one transformer layer) Pre Norm MLP 4dmodeldffn 4dmodeldffn 4dmodeldffn 2nffndmodeldffn Output Norm Final logits 2nvocabdmodel 2nvocabdmodel 2nvocabdmodel 2nvocabdmodel Table 12. Forward FLOPs counts per token from Table 11 simplified using nheadsdhead = dmodel, dffn = ρdmodel, and nheads = gsizenkv-heads. FLOPs (Kaplan et al., 2020) (Hoffmann et al., 2022) (Narayanan et al., 2021) Ours (Total) Embedding 4dmodel 2nvocabdmodel 2dmodel Attention (one transformer layer) Pre Norm QKNorm QKV 6d2 model 6d2 model 6d2 model 2(1 + 2/gsize)d2 model Logits 2dmodelnctx 2dmodelnctx 2dmodelnctx dmodelnctx Softmax 3nheadsnctx 2.5nheadsnctx Values 2dmodelnctx 2dmodelnctx dmodelnctx Project 2d2 model 2d2 model 2d2 model 2d2 model Total 8d2 model + 2nctxdmodel 8d2 model + 4nctxdmodel + 3nheadsnctx 8d2 model + 4nctxdmodel (4 + 4/gsize)d2 model + 2nctxdmodel + 2.5nheadsnctx Approx. 8d2 model + 2nctxdmodel 8d2 model + 4nctxdmodel + 3nheadsnctx 8d2 model + 4nctxdmodel (4 + 4/gsize)d2 model + 2nctxdmodel Feed-forward (one transformer layer) Pre Norm MLP 4ρffnd2 model 4ρffnd2 model 4ρffnd2 model 2nffnρffnd2 model Output Norm Final logits 2nvocabdmodel 2nvocabdmodel 2nvocabdmodel 2nvocabdmodel This results in an approximation for the number of non-embedding floating operations per token, dropping subleading terms CForward 2nlayersd2 model 2 + 2 gsize + nffnρffn + 2nlayersnctxdmodel + 2nvocabdmodel (72) which can be used to estimate forward FLOPs per token from the model size (Appendix H.1). Distillation Scaling Laws I. Model architecture All models are based on Gunter et al. (2024) and are trained using AXLearn (Apple, 2023). All models use decoupled weight decay Loshchilov & Hutter (2019) of 10 4 for regularization, as well as a simplified version of µP (Yang & Hu, 2021; Yang & Littwin, 2023; Yang et al., 2022; Wortsman et al., 2023; Yang et al., 2023), following what is described as µP (simple) in (Wortsman et al., 2024). Because of µP (simple), we fix the learning rate to 1e 2 across all model sizes. Multiheaded attention (MHA) is used (gsize = 1), with Pre-Normalization (Nguyen & Salazar, 2019) using RMSNorm (Zhang & Sennrich, 2019). We train all models with a sequence length of nctx = 4096, with Ro PE (Su et al., 2024) positional embeddings (base frequency set to 500k). All model architectures in this work are presented in Table 13, have a fixed aspect ratio dmodel = 128 and a fixed ffn ratio ρffn = 8/3 coupled with gated linear activation (nffn = 3). Table 13. The models used in this work. The different parameter values and FLOPs per token are shown in billions. N is the number of non-embedding parameters and isthe value we use in our scaling laws. Ntotal counts all parameters in the model.Cfwd is the total number of forward FLOPs per token given by the fulltotal in Tables 11 and 12.Cfwd-approx(2N) is the estimated value of forward FLOPs per tokenbased on the 2N approximation, and is accompanied by its relative error.Cfwd-approx(2N+σ) is the estimated value of forward FLOPs per tokenbased on the approximation given in Equation 69, and is accompanied by its relative error.The Cfwd-approx(2N+σ) is the one we use in this work. Name N (B) Ntotal (B) nlayers dmodel dff Cfwd (B) Cfwd-approx(2N) (B) Cfwd-approx(2N+σ) (B) 103M 0.1028 0.1363 8 1024 2816 0.3411 0.2056 (-39.74%) 0.3398 (-0.39%) 143M 0.1434 0.1811 9 1152 3072 0.4487 0.2867 (-36.10%) 0.4471 (-0.34%) 198M 0.1983 0.2402 10 1280 3456 0.587 0.3965 (-32.44%) 0.5853 (-0.29%) 266M 0.2657 0.3118 11 1408 3840 0.7524 0.5314 (-29.38%) 0.7505 (-0.25%) 340M 0.3398 0.3901 12 1536 4096 0.9333 0.6796 (-27.19%) 0.9312 (-0.22%) 435M 0.4348 0.4893 13 1664 4480 1.158 0.8695 (-24.91%) 1.156 (-0.19%) 546M 0.546 0.6047 14 1792 4864 1.417 1.092 (-22.96%) 1.415 (-0.17%) 664M 0.6636 0.7265 15 1920 5120 1.692 1.327 (-21.54%) 1.689 (-0.15%) 810M 0.8096 0.8767 16 2048 5504 2.025 1.619 (-20.03%) 2.022 (-0.14%) 975M 0.9755 1.047 17 2176 5888 2.4 1.951 (-18.69%) 2.397 (-0.12%) 1.15B 1.147 1.222 18 2304 6144 2.787 2.293 (-17.72%) 2.784 (-0.11%) 1.35B 1.355 1.434 19 2432 6528 3.25 2.709 (-16.65%) 3.247 (-0.10%) 1.59B 1.586 1.67 20 2560 6912 3.763 3.172 (-15.70%) 3.759 (-0.09%) 1.82B 1.821 1.909 21 2688 7168 4.284 3.642 (-14.99%) 4.28 (-0.09%) 2.1B 2.102 2.194 22 2816 7552 4.899 4.203 (-14.21%) 4.895 (-0.08%) 2.41B 2.41 2.506 23 2944 7936 5.571 4.819 (-13.49%) 5.567 (-0.07%) 2.72B 2.718 2.819 24 3072 8192 6.246 5.436 (-12.96%) 6.241 (-0.07%) 3.08B 3.082 3.187 25 3200 8576 7.034 6.165 (-12.36%) 7.03 (-0.06%) 3.48B 3.478 3.587 26 3328 8960 7.887 6.956 (-11.81%) 7.883 (-0.06%) 3.87B 3.87 3.983 27 3456 9216 8.736 7.74 (-11.40%) 8.731 (-0.05%) 4.33B 4.329 4.446 28 3584 9600 9.72 8.658 (-10.93%) 9.715 (-0.05%) 4.82B 4.823 4.944 29 3712 9984 10.78 9.646 (-10.49%) 10.77 (-0.05%) 5.31B 5.309 5.434 30 3840 10240 11.82 10.62 (-10.16%) 11.81 (-0.05%) 5.87B 5.873 6.003 31 3968 10624 13.02 11.75 (-9.78%) 13.01 (-0.04%) 6.48B 6.476 6.611 32 4096 11008 14.3 12.95 (-9.43%) 14.29 (-0.04%) 7.07B 7.066 7.204 33 4224 11264 15.56 14.13 (-9.16%) 15.55 (-0.04%) 7.75B 7.747 7.889 34 4352 11648 17 15.49 (-8.85%) 16.99 (-0.04%) 8.47B 8.47 8.617 35 4480 12032 18.52 16.94 (-8.55%) 18.52 (-0.03%) 9.17B 9.173 9.324 36 4608 12288 20.01 18.35 (-8.33%) 20.01 (-0.03%) 10B 10.05 10.2 37 4736 12672 21.85 20.1 (-8.02%) 21.84 (-0.03%) 10.8B 10.84 11 38 4864 13056 23.51 21.67 (-7.83%) 23.5 (-0.03%) 11.7B 11.66 11.83 39 4992 13312 25.26 23.33 (-7.64%) 25.25 (-0.03%) 12.6B 12.61 12.78 40 5120 13696 27.24 25.22 (-7.42%) 27.23 (-0.03%) We rescale the gradients, such that the maximum of the global norm is 1.0. A cosine learning rate schedule is used with warmup (2000 steps), with a final learning rate of one thousandths of the peak learning rate. A Z-loss (Chowdhery et al., Distillation Scaling Laws 2023) of 10 4 is used for stability, slightly decreasing norm growth at the end of the training. For all experiments, the English-only subset of the C4 dataset (Raffel et al., 2020) is used. The C4 dataset was chosen because of its wide usage in the research community. While C4 is big enough for larger-scale experiments, it is small enough to allow for reproduction of experiments. For all distillation trainings, the teacher is trained on a different split as the student. The C4 dataset has roughly 180B tokens in total, which results in 90B unique tokens for the teacher training and 90B unique tokens for the student training. Except for the largest models, all Chinchilla-optimal models do not repeat data. Models that overtrain on more than 90B tokens will have data repetition too. Muennighoff et al. (2023b) has shown (on the C4 dataset) that repeating data up to 4 times has negligible impact to loss compared to having unique data. J. Contributions All authors contributed to writing this paper, designing the experiments, discussing results at each stage of the project. Writing and framing Majority of writing done by Dan Busbridge, Jason Ramapruam, and Amitis Shidani. Research direction led by Dan Busbridge, with research framing, question identification, and prioritization done by all authors. Scaling law experiments Fixed aspect ratio models (Appendix I) FLOP counting methods (Appendix H.1), and model implementation done by Dan Busbridge, Amitis Shidani, and Floris Weers. Dataset preparation done by Floris Weers. Iso FLOP experimental design (Section 4.1) done by Dan Busbridge. Teacher training and distillations done by Dan Busbridge, Amitis Shidani, and Floris Weers. Longer training duration (512B token) teachers and students trained by Floris Weers. Scaling law analysis Original scaling law fitting code based on Besiroglu et al. (2024) developed by Amitis Shidani. Generalized, JAX Just In Time (JIT) compilation compatible scaling law fitting code, and numerical minimization approaches for compute optimal analysis (Section 5 and Appendix D) done by Dan Busbridge. Functional form (Equation 8) developed by Dan Busbridge, in collaboration with Jason Ramapuram, Amitis Shidani, Russ Webb, and Floris Weers. Scaling law downstream metrics Implementations of calibration Appendix E.8, Cumulative Distribution Function (CDF) and top-k metrics done by Amitis Shidani. Downstream model evaluations (Appendix E.1) done by Floris Weers. Teacher student capacity gaps Kernel regression demonstration of the capacity gap phenomenon (Appendix C.1) done by Etai Littwin. MLP synthetic demonstration of the capacity gap phenomenon (Appendix C.2) done by Russ Webb. Distilling language models in practice Mixing coefficient sensitivity analysis (Appendix G.1) done by Dan Busbridge and Jason Ramapuram. Temperature (Appendix G.2) and learning rate (Figure 53) sensitivity analyses done by Dan Busbridge. Top-k and top-p distribution truncation (Appendix G.4) implementation and analyses done by Jason Ramapuram. Mixing coefficient combined with truncation analysis (Appendix G.4) done by Jason Ramapuram. Reverse KL divergence Appendix G.5 implementation and analysis done by Jason Ramapuram.