# ensembles_of_lowrank_expert_adapters__a42ff599.pdf Published as a conference paper at ICLR 2025 ENSEMBLES OF LOW-RANK EXPERT ADAPTERS Yinghao Li1, Vianne Gao2, Chao Zhang2, Mohamad Ali Torkamani1 1Amazon Web Service 2Amazon.com {yinghli,gaov,zhanpcha,alitor}@amazon.com The training and fine-tuning of large language models (LLMs) often involve diverse textual data from multiple sources, which poses challenges due to conflicting gradient directions, hindering optimization and specialization. These challenges can undermine model generalization across tasks, resulting in reduced downstream performance. Recent research suggests that fine-tuning LLMs on carefully selected, task-specific subsets of data can match or even surpass the performance of using the entire dataset. Building on these insights, we propose the Ensembles of Low-Rank Expert Adapters (ELREA) framework to improve the model s capability to handle diverse tasks. ELREA clusters the training instructions based on their gradient directions, representing different areas of expertise and thereby reducing conflicts during optimization. Expert adapters are then trained on these clusters, utilizing the low-rank adaptation (Lo RA) technique to ensure training efficiency and model scalability. During inference, ELREA combines predictions from the most relevant expert adapters based on the input data s gradient similarity to the training clusters, ensuring optimal adapter selection for each task. Experiments show that our method outperforms baseline Lo RA adapters trained on the full dataset and other ensemble approaches with similar training and inference complexity across a range of domain-specific tasks. 1 INTRODUCTION While general-domain large language models (LLMs) such as GPT-4 (Open AI, 2022; 2023) and Llama (Touvron et al., 2023) have shown remarkable efficacy in diverse applications, adapting these models through supervised fine-tuning to specific domains or tasks remains indispensable for achieving optimal performance. For example, instruction following requires subtle model adjustments to specialized datasets that the general pre-training corpus alone cannot provide (Ouyang et al., 2022). Significant resources have been invested in constructing varied, high-quality datasets tailored for LLM fine-tuning such as Alpaca (Taori et al., 2023), the Pile (Gao et al., 2021), or Flan (Longpre et al., 2023). These efforts have fueled the development of specialized models that address complex tasks across fields such as medical diagnostics (Singhal et al., 2023), financial analytics (Yang et al., 2023), and scientific decision-making (Zhang et al., 2024b), or to provide reasoning to their results (Wei et al., 2022), which were tasks once deemed challenging for automated systems. Nonetheless, fine-tuning LLMs on a comprehensive dataset frequently encounters the issue of conflicting gradient directions from varied training data points (Wang et al., 2021; Xia et al., 2024; Chen et al., 2024). This phenomenon complicates the update process of models, potentially leading to suboptimal performance. Wang et al. (2023d) demonstrate that mixing diverse instructional datasets can sometimes result in less than ideal outcomes compared to fine-tuning on a carefully selected subset of the data that directly addresses the task at hand. To enhance the relevance of training data to specific tasks, Xie et al. (2023) have proposed methods like importance resampling, which aligns the training dataset more closely with the target task distribution. Another innovative approach proposed by Xia et al. (2024), termed targeted instruction tuning, involves selecting a small percentage (about 5%) of training data that most significantly influences task performance based on the average gradients of tokens. This method has shown promise, achieving comparable or superior results to traditional full dataset fine-tuning across various tasks. In addition, Xia et al. (2024) also present better outcomes in selecting data points based on the gradient norm than sentence embeddings. Published as a conference paper at ICLR 2025 Despite these advancements, current data selection techniques for fine-tuning are predominantly target-driven, relying heavily on specific features of the target task (e.g., n-gram frequency, example answer embedding, gradient direction) to guide the selection process. This requirement for taskspecific data features imposes significant limitations when adapting LLMs to new or emerging tasks, especially when relevant training data or features are unavailable. To address these challenges, we propose a novel framework, Ensembles of Low-Rank Expert Adapters (ELREA), which leverages Low-Rank Adaptation (Lo RA; Hu et al., 2022; Dettmers et al., 2023) to create multiple expert adapters. These adapters are trained independently on data groups with similar gradient directions and their predictions are assembled during inference based on the gradient features of the input. Specifically, ELREA begins by fine-tuning a base adapter on the full dataset to capture a wide range of general knowledge. We then evaluate and cluster the gradients of individual data points relative to their influence on the base adapter, organizing them into similarly sized groups. On each cluster we continue training a specialized Lo RA expert that is initialized from the base adapter, allowing the training process to maintain a comparable computational burden to that of a single adapter trained on the entire dataset. During inference, the expert adapters collaboratively determine the output by dynamically weighting the adapters according to their alignment with the clusters gradient profile. Compared with conventional Deep Ensembles, such calculation could be conducted only once at the beginning in the recurrent generation process and re-used in subsequent passes, causing minimal computational overhead while achieving stronger performance (Lakshminarayanan et al., 2017; Havasi et al., 2021; Wang et al., 2023a). Unlike previous methods, ELREA focuses on the task-agnostic setup, i.e., a one-time training effort without the need for additional task-specific validation data, making it more suitable for real-world deployment of LLMs. In summary, our contributions are threefold: We introduce Ensembles of Low-Rank Expert Adapters (ELREA), a framework that integrates efficient parameter adaptation techniques into an ensemble model to address conflicting gradient directions in LLM fine-tuning. By combining gradient features with clustering methods, we create expert adapters specialized for different gradient profiles, enabling the model to adapt to diverse tasks without relying on task-specific data features or validation data points. We demonstrate that ELREA outperforms baseline Lo RA adapters trained on the full dataset across various domain-specific applications, as well as other Mixture of Experts (Mo E) and self-consistency methods. 2 PRELIMINARIES 2.1 LANGUAGE MODELS AND PARAMETER-EFFICIENT FINE-TUNING Decoder-only LMs, pioneered by GPT (Radford et al., 2018), are built upon the decoder component of the Transformer architecture (Vaswani et al., 2017) and are among the most prevalent and thoroughly examined language models today. A pre-trained LM, denoted as M, learns the language patterns on extensive text corpora Dpre-train through an unsupervised next-token-prediction (NTP) objective, which minimized the negative log likelihood (NLL) of a subsequent token xt in a length-T sequence x VT consisting tokens from the vocabulary V based on the preceding context xuser 2 Let $A = (2, 0)$, $B = (0, 2)$, $C = (-2, 0)$, and $D = (0, -2)$. Compute the greatest possible value of the product $PA \cdot PB \cdot PC \cdot PD$, where $P$ is a point on the circle $xˆ2 + yˆ2 = 9$. 3 model Listing 2: An example of expected model answer for dataset MATH-Combined. 1 We use complex numbers. Let $a = 2$, $b = 2i$, $c = -2$, and $d = -2$ be the complex numbers corresponding to $A$, $B$, $C$, and $D$, respectively. Let $p$ be the complex number corresponding to $P$, so that $|p| = \sqrt{9} = 3$. Then we have \[\begin{aligned} PA \ cdot PB \cdot PC \cdot PD &= |p-2| \cdot |p-2i| \cdot |p+2| \cdot |p+2i| \\ &= |(p-2)(p +2)| \cdot |(p-2i)(p+2i)| \\ &= |pˆ2-4| \cdot |pˆ2+4| \\ &= |pˆ4-16|. \end{aligned}\] Since $|p| = 3$, we have $|pˆ4| = 3ˆ4= 81$, so by the triangle inequality, \[|pˆ4-16| \le 7https://huggingface.co/sentence-transformers/all-mpnet-base-v2. Published as a conference paper at ICLR 2025 |pˆ4| + |-16| = 81 + 16 = 97.\]Equality holds if and only if $pˆ4 = -81$, which occurs when $p = 3\left(\frac{\sqrt2}{2} + \frac{\sqrt2}{2}i\right)$. Therefore, the answer is $ \boxed{97}$. 2 Listing 3: An example of BBH inference prompts. 1 user 2 Infer the date from context. 3 4 Example 1: 5 Q: Today is Christmas Eve of 1937. What is the date 10 days ago in MM/DD/YYYY? 6 Options: 7 (A) 12/14/2026 8 (B) 12/14/1950 9 (C) 12/14/2007 10 (D) 12/14/1937 11 (E) 07/14/1938 12 (F) 12/14/1988 13 A: Let s think step by step. 14 If today is Christmas Eve of 1937, then today s date is December 24, 1937. 10 days before today is December 14, 1937, that is 12/14/1937. So the answer is (D). 15 16 Example 2: 17 Q: Tomorrow is 11/12/2019. What is the date one year ago from today in MM/DD/YYYY? 18 Options: 19 (A) 09/04/2018 20 (B) 11/11/2018 21 (C) 08/25/2018 22 (D) 11/02/2018 23 (E) 11/04/2018 24 A: Let s think step by step. 25 If tomorrow is 11/12/2019, then today is 11/11/2019. The date one year ago from today is 11/11/2018. So the answer is (B). 26 27 Example 3: 28 Q: Jane and John married on Jan 2, 1958. It is their 5-year anniversary today. What is the date tomorrow in MM/DD/YYYY? 29 Options: 30 (A) 01/11/1961 31 (B) 01/03/1963 32 (C) 01/18/1961 33 (D) 10/14/1960 34 (E) 01/03/1982 35 (F) 12/03/1960 36 A: Let s think step by step. 37 If Jane and John married on Jan 2, 1958, then and if it is their 5-year anniversary today, then today s date is Jan 2, 1963. The date tomorrow is Jan 3, 1963, that is 01/03/1963. So the answer is (B). 38 39 Question: 40 Q: Today is Christmas Eve of 1937. What is the date tomorrow in MM/DD/YYYY? 41 Options: 42 (A) 12/11/1937 43 (B) 12/25/1937 44 (C) 01/04/1938 45 (D) 12/04/1937 46 (E) 12/25/2006 47 (F) 07/25/1937 48 model Listing 4: An example of MMLU inference prompts. 1 user 2 Please solve the following multi-choice problems. 3 4 Example 1: 5 What distinguishes coercive diplomacy from military force? 6 7 Option A: Compellence is another term for coercive diplomacy, but covering a narrower set of criteria; compellence covers those threats aimed at initiating adversary action. A threat to coerce a state to give up part of its territory would count as coercive diplomacy, as long as that threat proactively initiates action before reactive diplomacy is taken. 8 Option B: Coercive diplomacy constitutes the threats of limited force to induce adversary s incentive to comply with the coercer s demands. It is an influence strategy that is intended to obtain compliance: the use of force to defeat an opponent first does not count. It leaves an element of choice with the target to comply, or to continue. 9 Option C: Military force, or the threat of military force, utilises fear to achieve strategic objectives. Coercive diplomacy is differentiated from this approach, because it does not use fear as a tool for coercing an adversary. Published as a conference paper at ICLR 2025 10 Option D: Coercive diplomacy is employed to use force but to limit its effects on the international community. Coercive diplomacy is an aggressive strategy that is intended to obtain compliance through defeat. It does not leave an element of choice with the target , the target either being forced to comply or engage in conflict. It seeks to control by imposing compliance by removing any opportunity for negotiation or concession. 11 12 Answer: B 13 14 Example 2: 15 Which of the following is the best lens through which to investigate the role of child soldiers? 16 17 Option A: Child soldiers are victims of combat that need re-education and rehabilitation. 18 Option B: Children and their mothers are not active subjects in warfare and are best considered as subjects in the private sphere. 19 Option C: Children are most often innocent bystanders in war and are best used as signifiers of peace. 20 Option D: Children have political subjecthood that is missed when they are considered as passive victims of warfare. 21 22 Answer: D 23 24 Example 3: 25 In order to become securitized, a threat must be presented in which of these ways? 26 27 Option A: As an existential threat that requires immediate and extraordinary action, posing a threat to the survival of the state or to societal security. 28 Option B: As requiring immediate and extraordinary action by the state, threatening the survival of a referent object and therefore warranting the use of measures not normally employed in the political realm. 29 Option C: As an urgent threat to the survival of the referent object, so serious that it legitimises the employment of extraordinary action in response. 30 Option D: As an urgent threat to the survival of the audience that requires extraordinary or emergency measures. 31 32 Answer: C 33 34 Example 4: 35 How can we best describe the relationship between the state-centric approach and the concept of human security? 36 37 Option A: There are such wide divisions within the human security framework regarding the nature of threats and referent objects that no widely applicable comparisons between state-centric approaches and human security can be drawn. 38 Option B: By adopting the framework of human security, the limitations of the realist state- centric approach become evident. Whilst human security defines the referent object as the person or population, state-centric approaches prioritise the security of the state, deprioritizing the pursuit of human security. 39 Option C: The state-centric approach to security is a faction of human security, usually defined within the broad school of human security. By being state-centric this approach prioritises the individual as the referent object in security studies. 40 Option D: Both the state-centric and human-centric approaches to security are mutually exclusive and offer a sufficient analytic framework with which to understand the international security system. It is therefore the role of security analysts to determine which of these substantial concepts is correct, and which should be discarded. 41 42 Answer: B 43 44 Example 5: 45 What are the frameworks of analysis within which terrorism has been considered (as of 2020)? 46 47 Option A: Competition between larger nations has resulted in some countries actively supporting terrorist groups to undermine the strength of rival states. Terrorist networks are extended patronage clubs maintained and paid for by their donor states and are conceptualised as being like state actors, to be dealt with using military force. 48 Option B: Globalization has enabled the internationalization of terrorist activities by opening up their operational space, although coordination is still managed from a geographical base. This suggests that terrorist groups are nationally structured which means that terrorism cannot be considered in terms of a war to be defeated militarily without having serious implications on the indigenous population. 49 Option C: Terrorism can be viewed as a problem to be resolved by military means (war on terrorism), by normal police techniques (terrorism as crime), or as a medical problem with underlying causes and symptoms (terrorism as disease). 50 Option D: Terrorism is viewed as a criminal problem. The criminalization of terrorism has two important implications. Firstly, it suggests that terrorism can be eradicated - terrorists can be caught and brought to trial by normal judicial proceedings thereby removing the threat from society - and secondly, it suggests that preventative crime techniques are applicable to prevent its development. 51 52 Answer: C 53 Published as a conference paper at ICLR 2025 Table 5: Efficiency comparison on a toy dataset. Time is in seconds; memory is in Gi B. Step M + Qbase ELREA Time Memory Time Memory Fine-tuning base adapter Qbase on Dft ( 3.1) 246 15.49 246 15.49 Calculating training gradient features δ(xft, instr) ( 3.3) 68 24.76 Calculating test gradient features δtest ( 3.4) 14 24.76 Fine-tuning experts on clusters ( 3.3) 246 15.49 Fine-Tuning Total 246 574 Inference ( 3.4) 114 7.73 262 18.46 54 Question: 55 56 Which of these principles is not an element of the responsibility to protect? 57 58 Option A: The responsibility to prevent. 59 Option B: The responsibility to react. 60 Option C: The responsibility to remain sovereign. 61 Option D: The responsibility to rebuild. 62 model G EFFICIENCY ANALYSIS Theoritical Analysis Theoretically, the computational overhead of ELREA compared to using M + Qbase arises from the following aspects: 1) the computation of the gradients of all training and test instructions; 2) clustering the gradient features of the training data points and computing the weights of each test data point on the clusters; 3) additional training steps to fit Lo RA experts on the training clusters; 4) additional computational resources required to perform the forward pass on all Lo RA experts for each test data point. In practice, step 2) only takes a few minutes with our clustering setup ( 3.3 and 3.4), which is negligible compared to the entire training process and will be ignored in the following discussion. If implemented properly, step 1) can also be integrated into the training and inference process with relatively small overhead. With a na ıve implementation, step 1) approximately equals the cost of training the model on the combination of training and test instructions (without answers) for one epoch, whose overhead depends on the average length of the instructions. For datasets such as Open Assistant, MATH, GSM8k, and Math QA, whose average instruction length is comparatively much shorter than the answer length (Table 4), the overhead is minimal. In the worst-case scenario, step 1) s overhead approximates the cost of training the model on the combination of training and test for one epoch, which is still acceptable for most fine-tuning datasets. As the sum of our training cluster sizes equals the number of training data points, i.e., PC c=1 |Dc| = |Dft|, the additional training steps in step 3) take the same amount of time as training the base adapter Qbase ( 3.4) on Dft, excluding CPU-disk I/O overhead, which is generally less than one minute in our experiments. The complexity of step 4), however, is harder to estimate as it varies drastically according to the implementation. In our implementation, we choose to duplicate the input instruction along the batch dimension by the number of experts (i.e., C + 1) and perform a forward pass on the backbone and all experts simultaneously. This implementation has a similar cost to using a (C + 1) inference batch size with the base adapter M + Qbase. Empirical Results To evaluate the efficiency of ELREA, we compared its computation time with that of the baseline model M+Qbase using a same set of hyper-parameters and device configuration on a single NVIDIA A101 80G GPU, except for the following specific parameters. We generate a toy dataset consisting of 2,000 training samples and 400 test samples as a smaller-scale but more controllable evaluation setup. Each sample contains 60 random lorem-ipsum words in both the instruction and the answer (which accounts for around 200 tokens each), matching the lengths in Dolly-15k (Table 4). We designate C = 4 experts and set the Lo RA ranks to r = 8. The model Published as a conference paper at ICLR 2025 MATH-Algebra MATH-Counting & Probability MATH-Geometry MATH-Intermediate Algebra MATH-Number Theory MATH-Prealgebra MATH-Precalculus gsm8k-gsm8k mathqa-gain mathqa-general mathqa-geometry mathqa-other mathqa-physics mathqa-probability svamp-Addition svamp-Common-Division svamp-Common-Divison svamp-Multiplication svamp-Subtraction (a) MATH-Combined Figure 4: Distribution of data sources and categories within each cluster for the MATH-Combined and GLUR (general language understanding and reasoning) training sets at rank r = 8. Cluster indices are shown along the rows, while columns represent data sources and categories, formatted as {source dataset}-{category} for MATH-Combined and {source dataset} for GLUR. The color intensity reflects the sample count, with darker shades indicating higher counts. Each column is independently normalized, meaning scales may differ across columns. Color gradients are slightly curved to improve visibility for categories with fewer samples. undergoes fine-tuning over 3 epochs, with batch sizes of 4 for both fine-tuning and inference. During inference, the model consistently predict the next 20 tokens for all input instructions to ensure a fair comparison. The results from our implementation, presented in Table 5, indicate that the fine-tuning time for ELREA was 574 seconds, which is approximately 2.3 that of the baseline M + Qbase s 246 seconds. Similarly, the inference time and memory consumption are about 2.3 and 2.4 , respectively. In contrast, a classic Deep Ensembles setup, where each Lo RA expert is trained independently from scratch on the entire dataset, would require 5 the time of the baseline for both fine-tuning and inference. Thus, ELREA offers significant efficiency and performance gains compared to this more traditional approach. Further enhancements to ELREA efficiency could be achieved by reducing the number of experts or the Lo RA ranks, or by constructing gradient features from only the top-k Transformer blocks rather than the entire model. Moreover, we are exploring Lo RA merging techniques in ongoing work to effectively combine similar expert adapters, thereby further reducing inference costs. H FURTHER ANALYSIS ON DATA CLUSTERING To better understand the distribution of data across clusters, we analyzed the sources and categories within each cluster from the MATH-Combined dataset, as visualized in Figure 4. Here, data source refers to the individual datasets that comprise MATH-Combined (i.e., MATH, GSM8k, SVAMP, or Math QA) and language understanding and reasoning (i.e., Co T, Dolly-15k, Flan V2, and Open Assistant), and category pertains to the finer-grain labels within these datasets. Notably, GSM8k is categorized uniformly under a single label gsm8k due to its lack of distinct category labels. Analysis of Figure 4 reveals distinct correlations between clusters and data sources. For instance, in MATH-Combined, clusters 2, 3, and 5 predominantly contain samples from MATH, whereas clusters 0, 1, 6, and 7 primarily feature contributions from Math QA. This clustering also appears to group together tasks requiring similar mathematical skills; for example, cluster 4 heavily includes SVAMP samples, which typically assess algebraic problem-solving capabilities, alongside significant portions of Algebra and Prealgebra from the MATH dataset. Published as a conference paper at ICLR 2025 (a) (b) (c) Figure 5: Examples of data clusters from MATH-Combined, generated using different random seeds in cases where the clusters are non-identical. The entire dataset is used for clustering, but only 10% of the data is visualized for clarity. The 8,192-dimensional gradient features are projected into 2D space using t-SNE. The colors are randomly assigned; the same color does not necessarily imply the same cluster across different seeds. Additionally, within individual sources, clusters distinguish between finer categories effectively; cluster 2 mainly focuses on Geometry and Probability, whereas cluster 3 is concentrated on Algebra. These insights suggest that the data representations successfully capture inherent structural differences, making the clustering both interpretable and meaningful. Such characteristics motivates the design of ELREA and significantly improves its efficacy. As mentioned in 3.3, the clustering process is robust to random seeds; i.e., different seeds yield similar clusters. In cases where the clusters are not identical, we visualize them using t-SNE in Figure 5, which demonstrates sensible data partitioning and similar cluster structures across different seeds. Even if the cluster boundaries are not identical, the ensemble framework in ELREA effectively mitigates these differences through weighted aggregation of experts, ensuring robust performance across various cluster configurations. Therefore, the clustering process is both stable and reliable, providing a strong foundation for the ELREA framework.