# knowledge_fusion_of_large_language_models__b8fec438.pdf Published as a conference paper at ICLR 2024 KNOWLEDGE FUSION OF LARGE LANGUAGE MODELS Fanqi Wan1 , Xinting Huang2 , Deng Cai2, Xiaojun Quan1 , Wei Bi2, Shuming Shi2 1School of Computer Science and Engineering, Sun Yat-sen University, China 2Tencent AI Lab wanfq@mail2.sysu.edu.cn,quanxj3@mail.sysu.edu.cn {timxthuang,jcykcai,victoriabi,shumingshi}@tencent.com While training large language models (LLMs) from scratch can generate models with distinct functionalities and strengths, it comes at significant costs and may result in redundant capabilities. Alternatively, a cost-effective and compelling approach is to merge existing pre-trained LLMs into a more potent model. However, due to the varying architectures of these LLMs, directly blending their weights is impractical. In this paper, we introduce the notion of knowledge fusion for LLMs, aimed at combining the capabilities of existing LLMs and transferring them into a single LLM. By leveraging the generative distributions of source LLMs, we externalize their collective knowledge and unique strengths, thereby potentially elevating the capabilities of the target model beyond those of any individual source LLM. We validate our approach using three popular LLMs with different architectures Llama-2, MPT, and Open LLa MA across various benchmarks and tasks. Our findings confirm that the fusion of LLMs can improve the performance of the target model across a range of capabilities such as reasoning, commonsense, and code generation. Our code, model weights, and data are public at https://github.com/fanqiwan/Fuse LLM. 1 INTRODUCTION With the continuous success of large language models (LLMs) such as GPT (Brown et al., 2020) and LLa MA (Touvron et al., 2023) series across a wide range of natural language processing (NLP) tasks, it has become a strategic imperative for corporations to create their own LLMs. However, the costs associated with LLM development are astronomical. In addition to requiring vast amounts of training data, advanced techniques, substantial computational resources, and skilled labor, the development process also exerts significant pressure on energy consumption and the environment (Rillig et al., 2023). While these LLMs exhibit structural and functional differences, they share similar capabilities across a spectrum of NLP tasks. Consequently, beyond the traditional approach of training an LLM from scratch, an alternative option is to combine existing LLMs into a new, more powerful one, which is termed knowledge fusion of LLMs in this paper. If successful, this fusion not only cuts the cost of initial training but also allows the integrated model to benefit from the strengths of all the LLMs. This new model can also be fine-tuned and adapted for various downstream tasks. Moreover, the fusion can also happen among fine-tuned LLMs that specialize in a specific task. The endeavor to integrate the capabilities of multiple models has been a long-standing pursuit. For example, ensemble methods (Littlestone & Warmuth, 1994; Jiang et al., 2023) directly aggregate the outputs of different models to enhance prediction performance and robustness. However, this approach requires maintaining multiple trained models and executing each during inference, which is impractical for LLMs due to their substantial memory and inference time requirements. Likewise, this approach doesn t facilitate fine-tuning, which is essential for many LLMs. Another approach is to directly merge several neural networks into a single network through parameter-wise arithmetic operations (Wortsman et al., 2022; Jin et al., 2022). This approach typically assumes uniform network architectures and attempts to establish mappings between the weights of distinct neural net- Work was done during the internship at Tencent AI lab. Corresponding authors. Published as a conference paper at ICLR 2024 Weight Merging FUSELLM Ensemble System Output Aggregation Math Operation Model A Model B Model C Model A Model B Model C Merged Model Fuse Train Update Fused Matrix LLM FUSELLM Probabilistic Figure 1: Illustration of conventional model fusion techniques (ensemble and weight merging) and our knowledge fusion approach for LLMs (FUSELLM). Different animal icons represent different LLMs, with various species denoting LLMs possessing differing architectures. FUSELLM externalizes the knowledge from multiple LLMs and transfers their capabilities to a target LLM. works, which is often unattainable in the context of LLMs. Moreover, weight merging may lead to suboptimal results when substantial differences exist in the parameter space (Li et al., 2022). In this paper, we explore the fusion of LLMs from a probabilistic distribution perspective. For an input text, we argue that the probabilistic distributions generated by different source LLMs can reflect their inherent knowledge in understanding this text. Therefore, the proposed FUSELLM leverages the generative distributions of source LLMs to externalize both their collective knowledge and individual strengths and transfer them to the target LLM through lightweight continual training. To achieve this, we develop a new strategy for aligning tokenizations originating from different LLMs and explore two methods for fusing the probability distributions generated by these diverse LLMs. During the continual training, FUSELLM places significant emphasis on minimizing the divergence between the target LLM s probabilistic distributions and those of the source LLMs. To empirically demonstrate the effectiveness of FUSELLM, we examine a challenging yet general scenario of LLMs fusion, where the source models share minimal commonalities. Specifically, we focus on three popular open-source LLMs that possess distinct architectures and functionalities: Llama-2 (Touvron et al., 2023), Open LLa MA (Geng & Liu, 2023), and MPT (Team, 2023). Evaluations across three benchmarks, which consist of a total of 42 tasks spanning reasoning, commonsense, and code generation, confirm that the target model trained by our method outperforms each source LLM and the baseline in most tasks. Moreover, we simulate the existence of functionally distinct LLMs with identical architecture by continually training a single base model on several domain-specific corpora. When evaluated based on perplexity, our method demonstrates superior potential in combining the capabilities of these structurally identical LLMs compared to traditional ensemble and weight merging methods. To sum up, this paper explores a novel challenge called LLMs fusion, with the goal of creating a unified model that effectively utilizes the collective capabilities and unique strengths of diverse LLMs. Illustrated in Figure 1, our proposed approach distinguishes itself from traditional ensemble and weight merging techniques by prioritizing the fusion of multiple LLMs through knowledge externalization and transfer. This study yields several findings that may spark future research. Firstly, while we demonstrate the effectiveness of our method through lightweight continual training on a compact, high-quality corpus, the thoughtful selection of the training corpus can be a crucial consideration, particularly with regard to its relevance to downstream tasks. Secondly, in scenarios where the capabilities of source LLMs vary significantly, the fusion function appears to be crucial in effectively combining their respective strengths. Lastly, when compared to traditional model ensemble and merging techniques, the field of LLMs fusion appears to be a more promising avenue for exploration, especially in light of the diverse structures and substantial model sizes of LLMs. 2 RELATED WORK Model Fusion The integration of capabilities from diverse models has been a long-standing objective, with existing approaches mainly falling into two categories. Firstly, the traditional technique of model ensemble combines the outputs of multiple models to enhance overall system performance (Littlestone & Warmuth, 1994; Sagi & Rokach, 2018). Note that this technique doesn t involve the explicit merging of multiple models into a new one. Common methods for model ensemble Published as a conference paper at ICLR 2024 typically employ weighted averaging (Littlestone & Warmuth, 1994) or majority voting (Monteith et al., 2011) to consolidate predictions from various models. Recently, Jiang et al. (2023) introduced an ensemble framework designed to leverage the diverse strengths of multiple open-source LLMs. This framework first employs a pairwise comparison method to detect subtle distinctions among candidate outputs. Then, it combines the top-ranked candidates to produce an enhanced output, capitalizing on their strengths while mitigating their weaknesses. Secondly, weight merging presents another approach that facilitates model fusion at the parameter level. Gupta et al. (2020) and Wortsman et al. (2022) merged weights from models with identical structures, obtained through different strategies or configurations, to achieve improved overall performance. Similarly, Cha et al. (2021), Rame et al. (2022), and Arpit et al. (2022) explored weighted averaging of models derived from different configurations to enhance out-of-distribution generalization. Furthermore, Jin et al. (2022) merged models designed for specific domains or tasks to create a generalist capable of addressing all domains or tasks. Going beyond parameter merging of entire models, Wang et al. (2022b), Huang et al. (2023), and Zhang et al. (2023) applied linear mathematical operations to adapter parameters to achieve superior generalization performance. In a nutshell, while model ensemble requires the parallel deployment of multiple models, weight merging is generally limited to models with identical architectures. In contrast, the approach proposed in this paper supports the fusion of multiple LLMs with diverse architectures by explicitly transferring their knowledge and capabilities to a target LLM. Knowledge Distillation Knowledge distillation (Hinton et al., 2015), initially proposed for model compression, involves training a student model under the guidance of one or more teacher models. In the NLP community, knowledge distillation has been widely applied to text classification tasks. These applications include training the student model to replicate the teacher s output distribution (Sanh et al., 2019; Turc et al., 2019), as well as features (Sun et al., 2019; Jiao et al., 2020) and relations (Wang et al., 2020) derived from intermediate layers of the teacher model. In the realm of text generation, the conventional approach focuses on minimizing the KL divergence between the student and teacher generation distributions. This is achieved by using the teacher s probability distributions at each time step as supervision (Khanuja et al., 2021; Gu et al., 2023; Agarwal et al., 2023) or by directly training on the teacher s generated texts (Peng et al., 2023; Xu et al., 2023). While our method shares a framework similar to multi-teacher knowledge distillation, there are two significant distinctions. First, in traditional knowledge distillation, the student models are typically constrained to be smaller in size than the teachers. In our scenario, however, there are no limitations on the size of the target model. Second, traditional knowledge distillation often results in the student models lagging behind the teachers in performance after distillation. In contrast, we anticipate that after the fusion, the target model will surpass any of the source models in performance. 3 KNOWLEDGE FUSION OF LLMS The primary objective of LLMs fusion is to externalize the collective knowledge embedded within multiple source LLMs and integrate their capabilities into a target LLM. Given K source LLMs {Ms j}K j=1 with varying architectures, each having undergone individual pre-training or fine-tuning on distinct datasets, the key idea behind our approach is to initially stimulate LLMs to manifest their inherent knowledge by challenging them to predict the next token. The probabilistic distributions of these predictions are thoroughly assessed, and the most accurate predictions are utilized to continually train the target LLM Mt on a corpus C using the causal language modeling objective. In the following sections, we start with a brief introduction to the preliminaries, followed by a detailed explanation of our LLMs fusion framework. Finally, we delve into the implementation details. 3.1 PRELIMINARIES Let t denote a text sequence of length N sampled from the corpus C and t