# minillm_knowledge_distillation_of_large_language_models__c7074a88.pdf Published as a conference paper at ICLR 2024 MINILLM: KNOWLEDGE DISTILLATION OF LARGE LANGUAGE MODELS Yuxian Gu1,2 , Li Dong2, Furu Wei2, Minlie Huang1 1The Co AI Group, Tsinghua University 2Microsoft Research guyx21@mails.tsinghua.edu.cn {lidong1,fuwei}@microsoft.com aihuang@tsinghua.edu.cn Knowledge Distillation (KD) is a promising technique for reducing the high computational demand of large language models (LLMs). However, previous KD methods are primarily applied to white-box classification models or training small models to imitate black-box model APIs like Chat GPT. How to effectively distill the knowledge of white-box LLMs into small models is still under-explored, which becomes more important with the prosperity of open-source LLMs. In this work, we propose a KD approach that distills LLMs into smaller language models. We first replace the forward Kullback-Leibler divergence (KLD) objective in the standard KD approaches with reverse KLD, which is more suitable for KD on generative language models, to prevent the student model from overestimating the low-probability regions of the teacher distribution. Then, we derive an effective optimization approach to learn this objective. The student models are named MINILLM. Extensive experiments in the instruction-following setting show that MINILLM generates more precise responses with higher overall quality, lower exposure bias, better calibration, and higher long-text generation performance than the baselines. Our method is scalable for different model families with 120M to 13B parameters. Our code, data, and model checkpoints can be found in https://github.com/microsoft/LMOps/tree/main/minillm. 100M 200M 400M 700M # of student parameters Average Rouge-L Teacher: GPT-2-1.5B Mini LLM Seq KD 1B 1.5B 2B 2.5B # of student parameters 26 Teacher: GPT-J 6B Mini LLM Seq KD 1.5B 3B 6B # of student parameters Teacher: OPT 13B Mini LLM Seq KD Figure 1: The comparison of MINILLM with the sequence-level KD (Seq KD)1 in terms of the average Rouge-L score on the evaluation sets. Left: GPT-2-1.5B as the teacher and GPT-2 125M, 340M, 760M as the students. Middle: GPT-J 6B as the teacher and GPT-2 760M, 1.5B, GPT-Neo 2.7B as the students. Right: OPT 13B as the teacher and OPT 1.3B, 2.7B, 6.7B as the students. 1 INTRODUCTION With the rapid development of large language models (LLMs; Brown et al., 2020; Han et al., 2021; Bommasani et al., 2021; Chowdhery et al., 2022; Open AI, 2023), a common technique to reduce their high computational resource demand is knowledge distillation (KD; Hinton et al., 2015), where we train a small student model with supervision from a large teacher model. Two categories of KD are commonly applied: black-box KD, where only the teacher-generated texts are accessible, and white-box KD, where the teacher model s output distribution or intermediate hidden states are also available (Jianping et al., 2021). Recently, black-box KD has shown promising results in fine-tuning Contribution during an internship at Microsoft Research. Corresponding author. 1Seq KD (Kim & Rush, 2016; Chiang et al., 2023; Taori et al., 2023) directly trains the student model on the texts generated by the teacher model, which is a widely used KD method for language models. Published as a conference paper at ICLR 2024 small models on the prompt-response pairs generated by LLM APIs (Taori et al., 2023; Chiang et al., 2023; Wu et al., 2023; Peng et al., 2023). With the emergence of more open-source LLMs (Zhang et al., 2022; Touvron et al., 2023), white-box KD becomes more valuable for both research communities and industry sectors because student models receive better signals from the output distribution and hidden states of teacher models, thereby potentially resulting in higher performance. However, white-box KD approaches are mostly studied for small (< 1B parameters) language understanding models (Sanh et al., 2019; Wang et al., 2020), while white-box KD for LLMs is yet to be explored. In this work, we investigate white-box KD of LLMs where the output distribution of the teacher model is available. We argue that the standard KD objectives (Kim & Rush, 2016; Song et al., 2020; Chiang et al., 2023; Taori et al., 2023) are sub-optimal for LLMs that perform tasks in a generative manner. Given the teacher distribution p(y|x) and the student distribution qθ(y|x) parameterized by θ, standard KD objectives (including several variants for sequence-level models) essentially minimize the approximated forward Kullback-Leibler divergence (KLD) between the teacher and the student distribution, termed as KL[p||qθ], which forces qθ to cover all modes of p. For text classification tasks, KL[p||qθ] works well because the output space usually consists of a finite number of classes such that both p(y|x) and qθ(y|x) have few modes. However, for open-ended text generation tasks, the output spaces are much more complex, and p(y|x) can contain many more modes than what qθ(y|x) can express due to the limited model capacity. Minimizing forward KLD causes qθ to assign unreasonably high probabilities to the void regions of p (Malinin & Gales, 2019) and produces very unlikely samples under p during free-run generation (Huszár, 2015). 0.0 2.5 5.0 7.5 10.0 12.5 Target Distribution Forward KLD Reverse KLD (ours) Figure 2: We fit a Gaussian mixture distribution with a single Gaussian distribution using forward KLD and reverse KLD. To alleviate this problem, we propose to minimize reverse KLD, KL[qθ||p], widely used in computer vision (Lee et al., 2023) and reinforcement learning (Czarnecki et al., 2019). Compared to KL[p||qθ], minimizing KL[qθ||p] causes qθ to seek the major modes of p, and assign low probabilities to p s void regions (Minka et al., 2005), as illustrated in Table 2 and discussed in Section 2.1. In LLM text generation, this means that the student model avoids learning too many long-tail variants (Holtzman et al., 2020) in the teacher model s distribution and focuses on the correctness of the generated cotents, which is critical in practical scenarios that require truthfulness and reliability (Ji et al., 2023b). To optimize minθ KL[qθ||p], as shown in Section 2.2, we derive the gradient of the objective with Policy Gradient (Sutton et al., 1999). To further stabilize and accelerate training, we propose (1) single-step decomposition to reduce variance, (2) teacher-mixed sampling to alleviate reward hacking, and (3) length normalization to eliminate the length bias. Finally, we introduce the overall KD algorithm in Section 2.3. Our student models are named MINILLM, indicating our method is suitable and works well for compressing large (generative) language models. We apply our method to various generative language models (Radford et al., 2019; Zhang et al., 2022; Touvron et al., 2023) with sizes ranging from 120M to 13B in the instruction-following setting (Sanh et al., 2022; Wei et al., 2022a) that covers a large range of NLP tasks. We use 5 datasets with Rouge-L (Lin, 2004), human judgment, and the GPT-4 feedback for evaluation. Experiments show that MINILLM consistently outperforms standard KD baselines on all the datasets and scales up well from 120M to 13B models (see Figure 1). More analysis shows that MINILLM yields lower exposure bias, better calibration, and higher long response generation performance. We consider conditional text generation where the model produces a response y = {yt}T t=1 conditioning on a prompt x sampled from the distribution px, which is typically how LLMs perform tasks. We formulate KD as an optimization problem to minimize the difference between a fixed teacher model distribution p(y|x) and a student model distribution qθ(y|x) parameterized by θ. The standard KD methods approximately2 minimize the forward KLD: KL[p||qθ] = Ex px,y p log p(y|x) qθ(y|x), where p can be real data distribution (word-level KD) or teacher distribution p (sequence-level KD). 2We say approximately because for word-level KD, y is sampled from the real distribution, not the teacher distribution. For a strong enough teacher model, we can consider the two distributions approximately the same. Published as a conference paper at ICLR 2024 Prompt 𝒙 Reverse KLD ℒ𝜃= KL[𝑞𝜃| 𝑝 ℒ(𝜃) (Section 2.2) Prompt 𝒙 Forward KLD ℒ𝜃= KL[𝑝| 𝑞𝜃 a Sequence-Level KD Mini LLM (Ours) Figure 3: Comparison between sequence-level KD (left) and MINILLM (right). Sequence-level KD forces the student to memorize all samples generated by the teacher model, while MINILLM improves its generated texts with the teacher model s feedback. Though widely used, KL[p||qθ] tends to overestimate the void regions of p in text generation tasks when qθ is insufficiently expressive (Ji et al., 2023a). KD for LLMs fits the case because LLMs perform tasks in a generative manner, such that the low-capacity student models cannot perfectly imitate the complex text generation distribution of the teacher models or humans. 2.1 MINILLM: KNOWLEDGE DISTILLATION WITH Reverse KLD We consider minimizing the reverse KLD between the student and teacher model distributions as the learning objective for MINILLM: θ = arg min θ L(θ) = arg min θ KL[qθ||p] = arg min θ E x px,y qθ log p(y|x) Minimizing reverse KLD has been shown to cause the mode-seeking behavior in generative modeling (Huszár, 2015; Nowozin et al., 2016; Chen et al., 2018; Lee et al., 2023), where qθ assigns high probabilities to p s large modes and ignore the small ones (illustrated in a toy experiment in Figure 2). In this work, we first study this property for KD of LLMs in text generation. Minimizing forward KLD causes qθ to place large probability mass on the zero-probability places of p, corresponding to the low-quality text generation in practice, while reverse KLD focuses on p s major modes, which is crucial to ensure the correctness and faithfulness of text generation. As illustrated in Figure 3, unlike sequence-level KD, MINILLM that minimizes reverse KLD does not force qθ to fit all y sampled from the teacher distribution p. Instead, it encourages the student to generate samples preferred by the teacher within its own capacities, which is more possible to achieve. Interestingly, we also find another perspective of understanding MINILLM motivated by Inverse Reinforcement Learning (Ziebart et al., 2008). We present the related derivation in Appendix A.1. 2.2 OPTIMIZATION WITH POLICY GRADIENT Gradient Derivation We notice that the gradient of the objective function L(θ) in Equation 1 can be derived using the Policy Gradient Theorem (Williams, 1992; Haarnoja et al., 2017): L(θ) = E x px,y qθ( |x) t=1 (Rt 1) log qθ(yt|y 150 tokens). Calibration Open AI (2023) has shown that models trained with policy optimization are likely to be poorly calibrated. We test the calibration of MINILLM and the KD baselines on two widelyused text classification datasets: SST2 (Socher et al., 2013) and Bool Q (Clark et al., 2019), based on LLa MA-7B. We design zero-shot classification instructions (see Appendix C.2) and take the Published as a conference paper at ICLR 2024 [0, 5] [6, 10] [11, + ] Ground Truth Length Range R-L Against SFT KD Seq KD Mini LLM Figure 7: The Rouge-L scores of the distilled models against SFT models on the different subsets of S-NI split by the golden responses length. Dolly Self Inst Dist-4 Loss Dist-4 Loss Teacher 99.3 3.55 99.1 4.44 KD 99.4 3.93 98.8 5.36 Seq KD 99.3 3.91 98.8 5.22 MINILLM 99.0 3.95 98.6 5.33 Table 3: The distinct 4-grams (Dist-4) and language modeling loss (Loss) on the test sets based on the LLa MA family. MINILLM preserves generation diversity. probability of the label words to compute the ECE scores (Nixon et al., 2019). From Table 2, we observe that KD and Seq KD models are worse calibrated than the teacher model, which potentially explains their low performance on canonical benchmarks (Gudibande et al., 2023). We suspect that minimizing forward KLD causes the models to push high probabilities to void regions of the target distribution, which leads to significant distribution differences between the student and the teacher (see the example in Figure 2). In contrast, MINILLM focuses on accurately learning the major parts of the target distribution and narrows the ECE scores gap between the student and the teacher. Performance on Various Response Length We study the models performance when the golden response lengths belong to different ranges. In Figure 7, we illustrate the Rouge-L scores of different KD models against the SFT models on three S-NI subsets split by the length of the ground truth responses. We can see that all methods achieve low scores on prompts that expect short responses ( 5 tokens), probably because most responses in our training set are long sentences, introducing a distribution shift between training and evaluation (Peng et al., 2023). Furthermore, the output spaces of these prompts are relatively small, allowing the student model to cover most modes of the teacher, and thus reverse KLD and forward KLD have similar performance. For prompts with longer responses ( 6 tokens), the teacher distribution contains more modes than the students due to the complex output spaces, which shows the advantage of MINILLM against standard KD models. Similar results on Un NI are shown in Appendix D.4. Generation Diversity Caccia et al. (2020) has found that the model optimized by minimizing reverse KLD is likely to lose modes, which affects the generation diversity. We follow Pang & He (2021) to discuss generation diversity from three aspects: (i) generating multiple distinct responses given a prompt. (ii) generating linguistically complex responses. (iii) the ability to generate contents that have high coverage of the real data distribution. For (i), we argue that for many NLP applications, generating one correct response is sufficient, especially for those scenarios demanding high truthfulness and reliability (Ji et al., 2023b). For (ii) and (iii), we report the responses distinct 4gram proportion and the language modeling loss on the test sets in Table 3, using the base models from the LLa MA family (see Appendix C.4 for more details) . We can see that MINILLM preserves the distinct 4-gram proportion in the generated responses and language modeling loss on the test set. 3.4 ABLATION STUDIES ON OPTIMIZATION STRATEGIES We evaluate the effectiveness of the three strategies proposed to stabilize and accelerate optimization in Section 2.2 by distilling a GPT-2-125M model from the GPT-2-1.5B model. More ablation studies can be found in Appendix D.5. In Table 4, we report the best Rouge-L scores on the validation set of each run and the evaluation results of the corresponding checkpoints. We also plot the reverse KLD between the student and the teacher during training in Figure 8, where the curves are smoothed by 32 steps. We can see that Teacher-Mixed Sampling and Length Normalization works for stabilizing training. Although the reverse KLDs also decrease without these strategies, we find that the models quickly learn to generate repeated, short, or meaningless strings that have high probabilities in the teacher distribution (see examples in Appendix E), which is known as reward hacking (Skalse et al., 2022). This also leads to the low generation performance in Table 4. From Figure 8, we also observe that the Single-Step Decomposition effectively reduces the variance of the training process, which also results in higher scores on the validation and test sets. Published as a conference paper at ICLR 2024 Valid. Dolly R-L R-L MINILLM 27.4 24.6 w/o Length Norm. 17.4 14.7 w/o Teacher-Mixed 22.3 20.4 w/o Single-Step 27.0 23.7 Table 4: The performance on the validation and test set when different combinations of MINILLM optimization strategies are applied. Reverse KLD Reward Hacking Training Steps Figure 8: The reverse KLD between the teacher and the students during MINILLM training when different optimization strategies are applied. 4 RELATED WORK Large Language Models Large language models (LLMs; Brown et al., 2020; Thoppilan et al., 2022; Chowdhery et al., 2022; Open AI, 2023; Anil et al., 2023) have shown superior performance by solving various NLP tasks in a generative manner. Recent works apply instruction tuning (Wei et al., 2022a; Sanh et al., 2022; Chung et al., 2022) or learning from human feedback (Ouyang et al., 2022; Bai et al., 2022) to improve the alignment of LLMs with humans further and create general AI assistants (Open AI, 2022; Google, 2023). There are also efforts to build open-source LLMs (Zhang et al., 2022; Touvron et al., 2023; Biderman et al., 2023) to facilitate research and industry development. Although appealing, the broad capacities of LLMs usually emerge with large model sizes (Kaplan et al., 2020; Wei et al., 2022b) that require massive computational resources. Therefore, model compression is critical for the practical deployment and further research of LLMs. Knowledge Distillation Knowledge distillation (KD; Hinton et al., 2015), as a widely used model compression technique, aims at training a student model with the guidance of a teacher model (Rusu et al., 2015; Sanh et al., 2019; Jianping et al., 2021). In the NLP community, many works apply KD to text classification tasks by mimicking the teacher model s output distribution (Song et al., 2020; Liang et al., 2021; Zhang et al., 2023), hidden states (Jiao et al., 2020; Sun et al., 2019), or attention scores (Wang et al., 2020; 2021). For text generation, the standard KD method is to approximately minimize the forward KLD between the student s and the teacher s generation distribution by using the teacher s output at each time step as supervision (Sanh et al., 2019) or direct training on the teacher s generated texts (Kim & Rush, 2016; Taori et al., 2023; Chiang et al., 2023; Peng et al., 2023). In this paper, we minimize the reverse KLD, which is more suitable for LLMs when the teacher distribution is available. Concurrent works (Agarwal et al., 2023; Wen et al., 2023) also explore more the distribution discrepancy metrics in KD. Distribution Discrepancy Metrics in Text Generation The distribution discrepancy metrics play a significant role in training text generation models. The forward KLD is widely used due to its simplicity when derived as the Maximum Likelihood Estimate (MLE) objective (Zhang & Zhao, 2019). However, previous works show that minimizing forward KLD leads to zero-forcing behavior where models try to cover all modes of the target distribution and sacrifice the accuracy of major modes (Huszár, 2015). Some works resort to using other metrics to remedy this problem, such as reverse KLD (Jiang et al., 2020), Total Variation Distance (Ji et al., 2023a), and Optimal Transport (Li et al., 2020). Our paper tackles this problem under the scenario of knowledge distillation for LLMs. 5 CONCLUSION In this work, we investigate the problem of distilling the knowledge of LLMs into small language models. We find that the standard distillation methods minimizing the forward KLD is sub-optimal in language generation scenarios because the teacher s output distribution contains more modes than the student s, and forward KLD forces the student distribution to overestimate the low-probability regions of the teacher distribution. Therefore, we propose MINILLM that minimizes the reverse KLD between the teacher and student distribution and design an algorithm to optimize this objective. Extensive experiments show that MINILLM produce more precise responses that have higher overall quality than standard KD models. We also find that MINILLM has lower exposure bias, better calibration, and higher performance in long-text generation with good generation diversity. Published as a conference paper at ICLR 2024 ACKNOWLEDGEMENTS This work was supported by the National Key Research and Development Program of China (No. 2021ZD0113304), the National Science Foundation for Distinguished Young Scholars (with No. 62125604), and the NSFC projects (Key project with No. 61936010). Rishabh Agarwal, Nino Vieillard, Piotr Stanczyk, Sabela Ramos, Matthieu Geist, and Olivier Bachem. GKD: Generalized knowledge distillation for auto-regressive sequence models. ar Xiv preprint ar Xiv:2306.13649, 2023. URL https://arxiv.org/abs/2306.13649. Rohan Anil, Andrew M Dai, Orhan Firat, Melvin Johnson, Dmitry Lepikhin, Alexandre Passos, Siamak Shakeri, Emanuel Taropa, Paige Bailey, Zhifeng Chen, et al. Palm 2 technical report. ar Xiv preprint ar Xiv:2305.10403, 2023. URL https://arxiv.org/abs/2305.10403. Kushal Arora, Layla El Asri, Hareesh Bahuleyan, and Jackie Cheung. Why exposure bias matters: An imitation learning perspective of error accumulation in language generation. In Findings of ACL, 2022. URL https://aclanthology.org/2022.findings-acl.58. Yuntao Bai, Andy Jones, Kamal Ndousse, Amanda Askell, Anna Chen, Nova Das Sarma, Dawn Drain, Stanislav Fort, Deep Ganguli, Tom Henighan, et al. Training a helpful and harmless assistant with reinforcement learning from human feedback. ar Xiv preprint ar Xiv:2204.05862, 2022. URL https://arxiv.org/abs/2204.05862. Samy Bengio, Oriol Vinyals, Navdeep Jaitly, and Noam Shazeer. Scheduled sampling for sequence prediction with recurrent neural networks. In Proceedings of Neur IPS, 2015. URL https://proceedings.neurips.cc/paper/2015/hash/ e995f98d56967d946471af29d7bf99f1-Abstract.html. Stella Biderman, Hailey Schoelkopf, Quentin Anthony, Herbie Bradley, Kyle O Brien, Eric Hallahan, Mohammad Aflah Khan, Shivanshu Purohit, USVSN Sai Prashanth, Edward Raff, et al. Pythia: A suite for analyzing large language models across training and scaling. ar Xiv preprint ar Xiv:2304.01373, 2023. URL https://arxiv.org/pdf/2304.01373.pdf. Sid Black, Gao Leo, Phil Wang, Connor Leahy, and Stella Biderman. GPT-Neo: Large Scale Autoregressive Language Modeling with Mesh-Tensorflow, March 2021. URL https://doi. org/10.5281/zenodo.5297715. Rishi Bommasani, Drew A Hudson, Ehsan Adeli, Russ Altman, Simran Arora, Sydney von Arx, Michael S Bernstein, Jeannette Bohg, Antoine Bosselut, Emma Brunskill, et al. On the opportunities and risks of foundation models. ar Xiv preprint ar Xiv:2108.07258, 2021. URL https://arxiv.org/abs/2108.07258. Tom Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, et al. Language models are few-shot learners. In Proceedings of Neur IPS, 2020. URL https://papers.nips.cc/paper/ 2020/hash/1457c0d6bfcb4967418bfb8ac142f64a-Abstract.html. Massimo Caccia, Lucas Caccia, William Fedus, Hugo Larochelle, Joelle Pineau, and Laurent Charlin. Language gans falling short. In ICLR, 2020. URL https://openreview.net/pdf? id=BJgza6Vt PB. Alex James Chan and Mihaela van der Schaar. Scalable bayesian inverse reinforcement learning. In ICLR, 2021. URL https://openreview.net/forum?id=4q R3coi Na Iv. Liqun Chen, Shuyang Dai, Yunchen Pu, Erjin Zhou, Chunyuan Li, Qinliang Su, Changyou Chen, and Lawrence Carin. Symmetric variational autoencoder and connections to adversarial learning. In Proceedings of AISTATS, 2018. URL http://proceedings.mlr.press/v84/ chen18b/chen18b.pdf. Published as a conference paper at ICLR 2024 Wei-Lin Chiang, Zhuohan Li, Zi Lin, Ying Sheng, Zhanghao Wu, Hao Zhang, Lianmin Zheng, Siyuan Zhuang, Yonghao Zhuang, Joseph E. Gonzalez, Ion Stoica, and Eric P. Xing. Vicuna: An open-source chatbot impressing gpt-4 with 90%* chatgpt quality, March 2023. URL https: //lmsys.org/blog/2023-03-30-vicuna/. Aakanksha Chowdhery, Sharan Narang, Jacob Devlin, Maarten Bosma, Gaurav Mishra, Adam Roberts, Paul Barham, Hyung Won Chung, Charles Sutton, Sebastian Gehrmann, et al. Palm: Scaling language modeling with pathways. ar Xiv preprint ar Xiv:2204.02311, 2022. URL https://arxiv.org/abs/2204.02311. Hyung Won Chung, Le Hou, Shayne Longpre, Barret Zoph, Yi Tay, William Fedus, Eric Li, Xuezhi Wang, Mostafa Dehghani, Siddhartha Brahma, et al. Scaling instruction-finetuned language models. ar Xiv preprint ar Xiv:2210.11416, 2022. URL https://arxiv.org/abs/2210. 11416. Kamil Ciosek. Imitation learning by reinforcement learning. In ICLR, 2021. URL https:// openreview.net/forum?id=1zwleyt Ep Yx. Christopher Clark, Kenton Lee, Ming-Wei Chang, Tom Kwiatkowski, Michael Collins, and Kristina Toutanova. Bool Q: Exploring the surprising difficulty of natural yes/no questions. In Proceedings of NAACL-HLT, 2019. URL https://aclanthology.org/N19-1300. Wojciech M Czarnecki, Razvan Pascanu, Simon Osindero, Siddhant Jayakumar, Grzegorz Swirszcz, and Max Jaderberg. Distilling policy distillation. In Proceedings of AISTATS, 2019. URL http: //proceedings.mlr.press/v89/czarnecki19a/czarnecki19a.pdf. Tommaso Furlanello, Zachary Lipton, Michael Tschannen, Laurent Itti, and Anima Anandkumar. Born again neural networks. In Proceedings of ICML, 2018. URL https://proceedings. mlr.press/v80/furlanello18a/furlanello18a.pdf. Aaron Gokaslan, Vanya Cohen, Ellie Pavlick, and Stefanie Tellex. Openwebtext corpus, 2019. URL http://Skylion007.github.io/Open Web Text Corpus. Google. Bard, 2023. URL https://blog.google/technology/ai/ bard-google-ai-search-updates/. Arnav Gudibande, Eric Wallace, Charlie Snell, Xinyang Geng, Hao Liu, Pieter Abbeel, Sergey Levine, and Dawn Song. The false promise of imitating proprietary llms. ar Xiv preprint ar Xiv:2305.15717, 2023. URL https://arxiv.org/abs/2305.15717. Tuomas Haarnoja, Haoran Tang, Pieter Abbeel, and Sergey Levine. Reinforcement learning with deep energy-based policies. In Proceedings of ICML, 2017. URL http://proceedings. mlr.press/v70/haarnoja17a.html?ref=https://githubhelp.com. Xu Han, Zhengyan Zhang, Ning Ding, Yuxian Gu, et al. Pre-trained models: Past, present and future. AI Open, 2021. URL https://www.sciencedirect.com/science/article/pii/ S26666510210002319. Yongchang Hao, Yuxin Liu, and Lili Mou. Teacher forcing recovers reward functions for text generation. In Proceeings of Neur IPS, 2022. URL https://openreview.net/pdf?id=1_ gyp Pu WUC3. Geoffrey Hinton, Oriol Vinyals, and Jeff Dean. Distilling the knowledge in a neural network. ar Xiv preprint ar Xiv:1503.02531, 2015. URL https://arxiv.org/pdf/1503.02531.pdf. Ari Holtzman, Jan Buys, Li Du, Maxwell Forbes, and Yejin Choi. The curious case of neural text degeneration. In Proceedings of ICLR, 2020. URL https://openreview.net/pdf?id= ryg GQyr Fv H. Or Honovich, Thomas Scialom, Omer Levy, and Timo Schick. Unnatural instructions: Tuning language models with (almost) no human labor. In Proceedings of ACL, 2023. URL https: //aclanthology.org/2023.acl-long.806. Published as a conference paper at ICLR 2024 Ferenc Huszár. How (not) to train your generative model: Scheduled sampling, likelihood, adversary? ar Xiv preprint ar Xiv:1511.05101, 2015. URL https://arxiv.org/abs/1511. 05101. Haozhe Ji, Pei Ke, Zhipeng Hu, Rongsheng Zhang, and Minlie Huang. Tailoring language generation models under total variation distance. In Proceedings of ICLR, 2023a. URL https: //openreview.net/pdf?id=VELL0Pl Wfc. Ziwei Ji, Nayeon Lee, Rita Frieske, Tiezheng Yu, Dan Su, Yan Xu, Etsuko Ishii, Ye Jin Bang, Andrea Madotto, and Pascale Fung. Survey of hallucination in natural language generation. ACM Computing Surveys, 2023b. URL https://dl.acm.org/doi/10.1145/3571730. Haoming Jiang, Pengcheng He, Weizhu Chen, Xiaodong Liu, Jianfeng Gao, and Tuo Zhao. Smart: Robust and efficient fine-tuning for pre-trained natural language models through principled regularized optimization. In Proceedings ACL, 2020. URL https://aclanthology.org/ 2020.acl-main.197.pdf. Gou Jianping, Yu Baosheng, Stephen J Maybank, and Tao Dacheng. Knowledge distillation: A survey. IJCV, 2021. URL https://link.springer.com/article/10.1007/ s11263-021-01453-z. Xiaoqi Jiao, Yichun Yin, Lifeng Shang, Xin Jiang, Xiao Chen, Linlin Li, Fang Wang, and Qun Liu. Tinybert: Distilling bert for natural language understanding. In Findings of EMNLP, 2020. URL https://aclanthology.org/2020.findings-emnlp.372/. Jared Kaplan, Sam Mc Candlish, Tom Henighan, Tom B Brown, Benjamin Chess, Rewon Child, Scott Gray, Alec Radford, Jeffrey Wu, and Dario Amodei. Scaling laws for neural language models. ar Xiv preprint ar Xiv:2001.08361, 2020. URL https://arxiv.org/abs/2001. 08361. Yoon Kim and Alexander M Rush. Sequence-level knowledge distillation. In Proceedings of EMNLP, 2016. URL https://aclanthology.org/D16-1139.pdf. Hyoje Lee, Yeachan Park, Hyun Seo, and Myungjoo Kang. Self-knowledge distillation via dropout. Computer Vision and Image Understanding, 2023. URL https://www.sciencedirect. com/science/article/abs/pii/S1077314223001005. Sergey Levine, Aviral Kumar, George Tucker, and Justin Fu. Offline reinforcement learning: Tutorial, review, and perspectives on open problems. ar Xiv preprint ar Xiv:2005.01643, 2020. URL https://arxiv.org/pdf/2005.01643.pdf. Jianqiao Li, Chunyuan Li, Guoyin Wang, Hao Fu, Yuhchen Lin, Liqun Chen, Yizhe Zhang, Chenyang Tao, Ruiyi Zhang, Wenlin Wang, et al. Improving text generation with student-forcing optimal transport. In Proceedings of EMNLP, 2020. URL https://aclanthology.org/ 2020.emnlp-main.735.pdf. Jiwei Li, Michel Galley, Chris Brockett, Jianfeng Gao, and Bill Dolan. A diversity-promoting objective function for neural conversation models. In Kevin Knight, Ani Nenkova, and Owen Rambow (eds.), Proceedings of NAACL, 2016. URL https://aclanthology.org/N16-1014. Kevin J Liang, Weituo Hao, Dinghan Shen, Yufan Zhou, Weizhu Chen, Changyou Chen, and Lawrence Carin. Mix{kd}: Towards efficient distillation of large-scale language models. In Proceedings of ICLR, 2021. URL https://openreview.net/forum?id=UFGEel Jk Lu5. Chin-Yew Lin. ROUGE: A package for automatic evaluation of summaries. In Proceedings of Text Summarization Branches Out (ACL 2004), 2004. URL https://aclanthology.org/ W04-1013. Yinhan Liu, Myle Ott, Naman Goyal, Jingfei Du, Mandar Joshi, Danqi Chen, Omer Levy, Mike Lewis, Luke Zettlemoyer, and Veselin Stoyanov. Ro BERTa: A robustly optimized BERT pretraining approach. ar Xiv preprint ar Xiv:1907.11692, 2019. URL https://arxiv.org/ abs/1907.11692. Published as a conference paper at ICLR 2024 Andrey Malinin and Mark Gales. Reverse KL-divergence training of prior networks: Improved uncertainty and adversarial robustness. In Proceedings of Neur IPS, 2019. URL https://proceedings.neurips.cc/paper_files/paper/2019/file/ 7dd2ae7db7d18ee7c9425e38df1af5e2-Paper.pdf. Tom Minka et al. Divergence measures and message passing. Technical report, Citeseer, 2005. URL https://www.microsoft.com/en-us/research/wp-content/ uploads/2016/02/tr-2005-173.pdf. Seyed Iman Mirzadeh, Mehrdad Farajtabar, Ang Li, Nir Levine, Akihiro Matsukawa, and Hassan Ghasemzadeh. Improved knowledge distillation via teacher assistant. In Proceedings of AAAI, 2020. URL https://ojs.aaai.org/index.php/AAAI/article/view/ 5963/5819. Jeremy Nixon, Michael W Dusenberry, Linchuan Zhang, Ghassen Jerfel, and Dustin Tran. Measuring calibration in deep learning. In CVPR workshops, 2019. URL https: //openaccess.thecvf.com/content_CVPRW_2019/papers/Uncertainty% 20and%20Robustness%20in%20Deep%20Visual%20Learning/Nixon_ Measuring_Calibration_in_Deep_Learning_CVPRW_2019_paper.pdf. Sebastian Nowozin, Botond Cseke, and Ryota Tomioka. f-gan: Training generative neural samplers using variational divergence minimization. In Proceedings of Neur IPS, 2016. URL https://proceedings.neurips.cc/paper_files/paper/2016/ hash/cedebb6e872f539bef8c3f919874e9d7-Abstract.html. Open AI. Open AI: Introducing Chat GPT, 2022. URL https://openai.com/blog/ chatgpt. Open AI. GPT-4 technical report, 2023. URL https://cdn.openai.com/papers/gpt-4. pdf. Long Ouyang, Jeff Wu, Xu Jiang, Diogo Almeida, Carroll L Wainwright, Pamela Mishkin, Chong Zhang, Sandhini Agarwal, Katarina Slama, Alex Ray, et al. Training language models to follow instructions with human feedback. In Proceedings of Neur IPS, 2022. URL https://arxiv. org/abs/2203.02155. Richard Yuanzhe Pang and He He. Text generation by learning from demonstrations. In Proceedings of ICLR, 2021. URL https://openreview.net/pdf?id=Rov X-u Q1Hua. Baolin Peng, Chunyuan Li, Pengcheng He, Michel Galley, and Jianfeng Gao. Instruction tuning with GPT-4. ar Xiv preprint ar Xiv:2304.03277, 2023. URL https://arxiv.org/abs/2304. 03277. Doina Precup, Richard S Sutton, and Satinder P Singh. Eligibility traces for off-policy policy evaluation. In Proceedings of ICML, 2000. URL https://dl.acm.org/doi/10.5555/ 645529.658134. Alec Radford, Jeffrey Wu, Rewon Child, David Luan, Dario Amodei, and Ilya Sutskever. Language models are unsupervised multitask learners. Open AI Technical report, 2019. URL http:// www.persagen.com/files/misc/radford2019language.pdf. Andrei A Rusu, Sergio Gomez Colmenarejo, Caglar Gulcehre, Guillaume Desjardins, James Kirkpatrick, Razvan Pascanu, Volodymyr Mnih, Koray Kavukcuoglu, and Raia Hadsell. Policy distillation. ar Xiv preprint ar Xiv:1511.06295, 2015. URL https://arxiv.org/pdf/1511. 06295.pdf. Victor Sanh, Lysandre Debut, Julien Chaumond, and Thomas Wolf. Distil BERT, a distilled version of bert: smaller, faster, cheaper and lighter. ar Xiv preprint ar Xiv:1910.01108, 2019. URL https://arxiv.org/pdf/1910.01108.pdf. Victor Sanh, Albert Webson, Colin Raffel, Stephen H Bach, Lintang Sutawika, Zaid Alyafeai, et al. Multitask prompted training enables zero-shot task generalization. In Proceedings of ICLR, 2022. URL https://openreview.net/forum?id=9Vrb9D0WI4. Published as a conference paper at ICLR 2024 John Schulman, Filip Wolski, Prafulla Dhariwal, Alec Radford, and Oleg Klimov. Proximal policy optimization algorithms. ar Xiv preprint ar Xiv:1707.06347, 2017. URL https://arxiv. org/pdf/1707.06347.pdf. Iulian V Serban, Chinnadhurai Sankar, Mathieu Germain, Saizheng Zhang, Zhouhan Lin, Sandeep Subramanian, Taesup Kim, Michael Pieper, Sarath Chandar, Nan Rosemary Ke, et al. A deep reinforcement learning chatbot. ar Xiv preprint ar Xiv:1709.02349, 2017. URL https: //arxiv.org/abs/1709.02349. Joar Max Viktor Skalse, Nikolaus HR Howe, Dmitrii Krasheninnikov, and David Krueger. Defining and characterizing reward gaming. In Proceedings of Neur IPS, 2022. URL https: //openreview.net/pdf?id=yb3HOXO3l X2. Richard Socher, Alex Perelygin, Jean Wu, Jason Chuang, Christopher D. Manning, Andrew Ng, and Christopher Potts. Recursive deep models for semantic compositionality over a sentiment treebank. In Proceedings of EMNLP, October 2013. URL https://aclanthology.org/ D13-1170/. Kaitao Song, Hao Sun, Xu Tan, Tao Qin, Jianfeng Lu, Hongzhi Liu, and Tie-Yan Liu. Light PAFF: A two-stage distillation framework for pre-training and fine-tuning. ar Xiv preprint ar Xiv:2004.12817, 2020. URL https://arxiv.org/pdf/2004.12817.pdf. Siqi Sun, Yu Cheng, Zhe Gan, and Jingjing Liu. Patient knowledge distillation for BERT model compression. In Proceedings EMNLP, 2019. URL https://aclanthology.org/D19-1441. Richard S Sutton, David Mc Allester, Satinder Singh, and Yishay Mansour. Policy gradient methods for reinforcement learning with function approximation. Proceedings of Neur IPS, 1999. URL https://proceedings.neurips.cc/paper/1999/file/ 464d828b85b0bed98e80ade0a5c43b0f-Paper.pdf. Rohan Taori, Ishaan Gulrajani, Tianyi Zhang, Yann Dubois, Xuechen Li, Carlos Guestrin, Percy Liang, and Tatsunori B. Hashimoto. Stanford Alpaca: An instruction-following LLa MA model. https://github.com/tatsu-lab/stanford_alpaca, 2023. Romal Thoppilan, Daniel De Freitas, Jamie Hall, Noam Shazeer, Apoorv Kulshreshtha, Heng-Tze Cheng, Alicia Jin, Taylor Bos, Leslie Baker, Yu Du, et al. Lamda: Language models for dialog applications. ar Xiv preprint ar Xiv:2201.08239, 2022. URL https://arxiv.org/abs/ 2201.08239. Faraz Torabi, Garrett Warnell, and Peter Stone. Behavioral cloning from observation. In Proceedings of IJCAI, 2018. URL https://www.ijcai.org/proceedings/2018/0687.pdf. Hugo Touvron, Thibaut Lavril, Gautier Izacard, Xavier Martinet, Marie-Anne Lachaux, Timothée Lacroix, Baptiste Rozière, Naman Goyal, Eric Hambro, Faisal Azhar, Aurelien Rodriguez, Armand Joulin, Edouard Grave, and Guillaume Lample. LLa MA: Open and efficient foundation language models. ar Xiv preprint ar Xiv:2302.13971, 2023. URL https://arxiv.org/pdf/ 2302.13971.pdf. Ben Wang and Aran Komatsuzaki. GPT-J-6B: A 6 Billion Parameter Autoregressive Language Model, 2021. URL https://github.com/kingoflolz/mesh-transformer-jax. Wenhui Wang, Furu Wei, Li Dong, Hangbo Bao, Nan Yang, and Ming Zhou. Mini LM: Deep selfattention distillation for task-agnostic compression of pre-trained transformers. In Proceedings of Neur IPS, 2020. URL https://proceedings.neurips.cc/paper/2020/file/ 3f5ee243547dee91fbd053c1c4a845aa-Paper.pdf. Wenhui Wang, Hangbo Bao, Shaohan Huang, Li Dong, and Furu Wei. Mini LMv2: Multi-head selfattention relation distillation for compressing pretrained transformers. In Findings of ACL, 2021. URL https://aclanthology.org/2021.findings-acl.188. Yizhong Wang, Swaroop Mishra, Pegah Alipoormolabashi, Yeganeh Kordi, Amirreza Mirzaei, Anjana Arunkumar, Arjun Ashok, Arut Selvan Dhanasekaran, Atharva Naik, David Stap, et al. Benchmarking generalization via in-context instructions on 1,600+ language tasks. In Proceedings of EMNLP, 2022. URL https://openreview.net/forum?id=Qz Ze M4K1ya O. Published as a conference paper at ICLR 2024 Yizhong Wang, Yeganeh Kordi, Swaroop Mishra, Alisa Liu, Noah A. Smith, Daniel Khashabi, and Hannaneh Hajishirzi. Self-instruct: Aligning language models with self-generated instructions. In Proceedings of ACL, 2023. URL https://aclanthology.org/2023.acl-long.754. Jason Wei, Maarten Bosma, Vincent Y Zhao, Kelvin Guu, Adams Wei Yu, Brian Lester, Nan Du, Andrew M Dai, and Quoc V Le. Finetuned language models are zero-shot learners. In Proceedings of ICLR, 2022a. URL https://openreview.net/forum?id=g EZr GCozdq R. Jason Wei, Yi Tay, Rishi Bommasani, Colin Raffel, Barret Zoph, Sebastian Borgeaud, Dani Yogatama, Maarten Bosma, Denny Zhou, Donald Metzler, et al. Emergent abilities of large language models. Transactions on Machine Learning Research, 2022b. URL https://openreview. net/pdf?id=yzk SU5zdw D. Yuqiao Wen, Zichao Li, Wenyu Du, and Lili Mou. f-divergence minimization for sequence-level knowledge distillation. In Proceedings of ACL, 2023. URL https://aclanthology.org/ 2023.acl-long.605.pdf. Ronald J Williams. Simple statistical gradient-following algorithms for connectionist reinforcement learning. Machine learning, 1992. URL https://link.springer.com/article/10. 1007/BF00992696. Minghao Wu, Abdul Waheed, Chiyu Zhang, Muhammad Abdul-Mageed, and Alham Fikri Aji. Lamini-lm: A diverse herd of distilled models from large-scale instructions. ar Xiv preprint ar Xiv:2304.14402, 2023. URL https://arxiv.org/pdf/2304.14402.pdf. Huan Zhang and Hai Zhao. Minimum divergence vs. maximum margin: an empirical comparison on seq2seq models. In International Conference on Learning Representations, 2019. URL https: //openreview.net/forum?id=H1x D9s R5Fm. Rongzhi Zhang, Jiaming Shen, Tianqi Liu, Jialu Liu, Michael Bendersky, Marc Najork, and Chao Zhang. Do not blindly imitate the teacher: Using perturbed loss for knowledge distillation. ar Xiv preprint ar Xiv:2305.05010, 2023. URL https://arxiv.org/pdf/2305.05010.pdf. Susan Zhang, Stephen Roller, Naman Goyal, Mikel Artetxe, Moya Chen, Shuohui Chen, Christopher Dewan, Mona Diab, Xian Li, Xi Victoria Lin, et al. OPT: Open pre-trained transformer language models. ar Xiv preprint ar Xiv:2205.01068, 2022. URL https://arxiv.org/pdf/ 2205.01068.pdf. Lianmin Zheng, Wei-Lin Chiang, Ying Sheng, Siyuan Zhuang, Zhanghao Wu, Yonghao Zhuang, Zi Lin, Zhuohan Li, Dacheng Li, Eric Xing, et al. Judging llm-as-a-judge with mt-bench and chatbot arena. In Proceedings of Neur IPS, 2023. URL https://neurips.cc/virtual/ 2023/poster/73434. Brian D Ziebart, Andrew L Maas, J Andrew Bagnell, Anind K Dey, et al. Maximum entropy inverse reinforcement learning. In Proceedings of AAAI, 2008. URL https://cdn.aaai.org/ AAAI/2008/AAAI08-227.pdf. Published as a conference paper at ICLR 2024 A DERIVATIONS A.1 A PERSPECTIVE OF MINILLM FROM INVERSE REINFORCEMENT LEARNING In Section 2.1, we formulate KD as an optimization problem of minimizing the discrepancy between the teacher distribution and the student distribution and finally reach the objective of minimizing reverse KLD. Alternatively, we can also regard KD as training the student model with the teacher model s guidance, which resembles an agent learning from the feedback from an environment. Following Pang & He (2021), we treat token generation as a Markov Decision Process. At each time step t, the student model chooses an action (token) yt from the action space (vocabulary) V conditioning on the state (prefix) (y