# sparse_high_rank_adapters__74a34fe7.pdf Sparse High Rank Adapters Kartikeya Bhardwaj Nilesh Prasad Pandey Sweta Priyadarshi Viswanath Ganapathy Shreya Kadambi Rafael Esteves Shubhankar Borse Paul Whatmough Risheek Garrepalli Mart Van Baalen Harris Teague Markus Nagel Qualcomm AI Research {kbhardwa,pwhatmou,hteague,markusn}@qti.qualcomm.com Abstract Low Rank Adaptation (Lo RA) has gained massive attention in the recent generative AI research. One of the main advantages of Lo RA is its ability to be fused with pretrained models, adding no overhead during inference. However, from a mobile deployment standpoint, we can either avoid inference overhead in the fused mode but lose the ability to switch adapters rapidly, or suffer significant (up to 30% higher) inference latency while enabling rapid switching in the unfused mode. Lo RA also exhibits concept-loss when multiple adapters are used concurrently. In this paper, we propose Sparse High Rank Adapters (SHi RA), a new paradigm which incurs no inference overhead, enables rapid switching, and significantly reduces concept-loss. Specifically, SHi RA can be trained by directly tuning only 1-2% of the base model weights while leaving others unchanged. This results in a highly sparse adapter which can be switched directly in the fused mode. We further provide theoretical and empirical insights on how high sparsity in SHi RA can aid multi-adapter fusion by reducing concept loss. Our extensive experiments on LVMs and LLMs demonstrate that finetuning only a small fraction of the parameters in the base model significantly outperforms Lo RA while enabling both rapid switching and multi-adapter fusion. Finally, we provide a latencyand memory-efficient SHi RA implementation based on Parameter-Efficient Finetuning (PEFT) Library which trains at nearly the same speed as Lo RA while consuming up to 16% lower peak GPU memory, thus making SHi RA easy to adopt for practical use cases. To demonstrate rapid switching benefits during inference, we show that loading SHi RA on a base model can be 5 -16 faster than Lo RA fusion on a CPU. 1 Introduction Low Rank Adaptation (Lo RA) [13] is an established technique to tune the behavior of large generative models such as Large Language Models (LLMs) [30, 29] and Stable Diffusion [24, 22]. As the name suggests, Lo RA requires very few parameters since it trains low rank projection weights that consume very low memory during the finetuning process while producing excellent results. Moreover, these low rank weights can be fused analytically into the base model, thereby incurring no additional overhead during inference. Despite its success, there are still several limitations of low rank adaptation methods. First, if Lo RA parameters are fused into the corresponding pretrained base model weights, they modify the entire weight tensor. Therefore, deploying Lo RA on large models such as LLa MA-1/2 (7B+ parameters) or Stable Diffusion (1.5B+ parameters) on mobile devices would require changing a large number of weights during inference. Consequently, for mobile scenarios, if an application requires rapid adapter switching, existing low rank methods would incur a significant memory and latency cost. This is a major deployment challenge because, unlike large GPUs, local memory of small AI accelerators is limited and cannot store all weights at the same time. These challenges can be partially addressed by Equal contribution. Work done while employed at Qualcomm AI Research. Qualcomm AI Research is an initiative of Qualcomm Technologies, Inc. Code: https://github.com/Qualcomm-AI-research/SHi RA. 38th Conference on Neural Information Processing Systems (Neur IPS 2024). dog in space House, mountain Thunder Bird SHi RA-SNIP MULTI-ADAPTER Figure 1: Sparse High Rank Adapters (SHi RA): Changing about 1-2% weights of the pretrained generative model is often sufficient to achieve high performance. Due to its extreme sparsity, SHi RA enables rapid switching and also reduced concept loss during multi-adapter fusion. In contrast, Lo RA modifies majority of parameters when fused, thus prohibiting rapid switching on mobile devices, and also experiences concept loss during multi-adapter fusion. For Lo RA, elephant for single paintings adapter case has artifacts (extra/broken tusks); bird and knight for multi-adapter case lose paintings concept and keep only the blue fire effects. SHi RA does not experience these issues. running Lo RA in unfused mode; however, unfused inference can incur as high as 30% additional latency compared to the base model [1] (see section 2.1 for details). This increased inference time in unfused mode and time for adapter switching significantly hampers user experience; hence, this is an important problem which has been a focus of recent research by various industries [9]. Second, Lo RA has a well-known limitation called concept loss when using multiple concurrent adapters, e.g., combining multiple style transfer adapters, etc. Specifically, it has been well documented [34, 26, 8] that a simple additive merging of multiple Lo RA adapters leads to concept loss of one or more adapters. Finally, recent literature also contributes important theoretical and empirical knowledge towards the value of high rank adapters. For instance, Kalajdzievski [16] shows that the high rank adapters can greatly outperform low rank adapters when used with correct scaling factors. This calls for further investigation into whether other high rank adapters would significantly outperform Lo RA. In view of the above, we address the following key problems in this paper: (i) How can we perform rapid switching for fused adapters? (ii) Is there a simpler solution for multi-adapter fusion to reduce concept loss? (iii) Can we build high rank adapters that have high expressive power without significantly increasing the training or inference costs? To this end, we propose Sparse High Rank Adapters (SHi RA), a single solution to all three problems above. SHi RA is a highly sparse but a high rank adapter which relies on training only a very small subset of parameters from the original pretrained network. One of the crucial insights we demonstrate is that even finetuning merely 1-2% parameters of the pretrained generative model is sufficient to achieve high performance on many adapter tasks (see Fig. 1). However, unlike Lo RA layers that modify all parameters in the weight tensors in the fused mode, SHi RA still keeps a very low percentage of parameters that need to be switched, thus enabling rapid switching at inference time. Moreover, since the pretrained weights are huge, SHi RA being a very sparse adapter greatly aids multi-adapter fusion by significantly reducing concept loss. Finally, we theoretically and emprically analyze the high rank vs. sparsity properties of SHi RA and why that helps with adapter performance. Overall, we make the following key contributions: We propose SHi RA, a new high rank adapter paradigm to demonstrate that changing as few as 1-2% parameters of the original network is sufficient for adaptation. Our crucial insight is that even the most basic masking criteria (to identify the top 1-2% parameters) enable SHi RA to significantly outperform Lo RA on diverse vision and language tasks. SHi RA enables on-device rapid adapter switching and provides a natural multi-adapter fusion technique due to high sparsity, thus, significantly reducing concept loss. We also theoretically analyze SHi RA through the lens of high rank adaptation vs. sparsity. We conduct extensive experiments on LLMs (LLa MA-7B, LLa MAv2-7B) and LVMs (Stable Diffusion, SDXL) where we demonstrate that SHi RA significantly outperforms Lo RA on both singleand multi-adapter tasks. On LLMs, we show that SHi RA achieves up to 2.7% better accuracy than Lo RA on commonsense reasoning. SHi RA also complements advanced variants of Lo RA such as Do RA [20] and can be easily applied on top of them. Finally, on the training side, we provide a PEFT-based latencyand memory-efficient implementation for SHi RA which trains nearly as fast as standard Lo RA while consuming 16% lower peak GPU memory. Beyond PEFT, we provide a simple way to turn any trainer into SHi RA finetuning. For inference, we demonstrate that SHi RA weights can be loaded on a CPU up to 5 -16 faster than equivalent Lo RA fusing, thereby enabling rapid switching. The rest of this paper is organized as follows: section 2 presents the background and related work. We propose SHi RA in section 3 while describing its theoretical properties in section 4. We then conduct extensive experiments for SHi RA in section 5. Finally, we discuss the key findings in section 6 and conclude the paper in section 7. 2 Background and Related Work 2.1 Background: Edge Deployment Challenges for Lo RA There are three existing deployment options for Lo RA: (i) fuse the adapter offline and then deploy on-device: this changes a large fraction of the weight tensors compared to base model which prohibits rapid switching since it will increase DRAM traffic considerably; (ii) keep the adapter unfused and run the inference in unfused mode: this can help with rapid switching but would incur significant additional (up to 30% higher) latency as shown in [1] since we would have Lo RA branches in the forward pass during inference; (iii) use the Huggingface/Diffusers pipeline [1] (built for server-grade GPUs) for mobile inference. This pipeline consists of load fuse inference unfuse unload to switch adapters. Here, unfused Lo RA-A and Lo RA-B weights (see Fig. 2(a)) are first loaded into the memory and then fused into the base model by computing Wnew = W + AB; this new weight is used for inference. To switch the adapter, we can unfuse the adapter as W = Wnew AB and then unload existing Lo RA weights to load the new ones. We provide further evidence in Appendix A to demonstrate that such a pipeline is not feasible for edge devices. This is primarily because edge devices are memory-limited and not all weights of large generative models can be stored in the local memory at the same time. Hence, loading and fusing needs to happen layerwise on a mobile device that obviously results in massive inference latency costs. 2.2 Related Work Lo RA, its variants, and sparse adapters. Many Lo RA variants exist in literature: Do RA [20], Lo RA+ [11], Ve RA [17], Lo RA-FA [35], RS-Lo RA [16], among many others. The crucial difference between this literature and our work is that we develop a high rank adapter without increasing training and inference costs. Also, for such methods, the final fused adapter still updates all elements in the pretrained weight tensor, thus prohibiting rapid switching. Moreover, for completeness, we will also show that SHi RA is orthogonal to and can be applied on top of some of the latest, more advanced Lo RA variants such as Do RA [20] while preserving the benefits of rapid switching. A few other Lo RA variants have also explored a combination of sparsity and low rank adaptation. Examples include Ro SA [21], So RA [6], Sparse-Adapters [12], etc. Among these, Sparse-Adapters [12] explores the use of popular pruning techniques (e.g., SNIP [19]) to prune out adapters to improve their efficiency. So RA [6] proposes an adaptive rank version of Lo RA by gating elements of down and up projection layers and pruning out the zero entries at inference. Finally, Ro SA [21] combines a sparse adapter with a low rank one to achieve some high rank benefits. However, since they combine their method with Lo RA, the fused adapter weight still overwrites the entire pretrained weight tensor. Partial Finetuning. Our work is most closely related to partial finetuning techniques that were mostly proposed in the pre-Lo RA era [36, 28, 3, 33, 10]. These methods use a mix of fixed sparse masks [28] or learned masks [36, 10] to finetune a pretrained network. Note that, these techniques have been mostly explored for relatively small language models, and not for recent LLMs and diffusion models. Since the Lo RA models exploded in popularity, it has been unclear if other sparse finetuning techniques would achieve comparable results to Lo RA on generic adapter tasks, particularly in the vision domain. One significant limitation of partial finetuning, as opposed to Lo RA-based methods, is its high GPU memory consumption, making it impractical to be used for large generative models. Consequently, the reduced memory consumption for finetuning was a key factor to Lo RA s success and its widespread adoption. To this end, we provide a memoryand latency-efficient PEFT-based implementation for SHi RA which trains as efficiently as Lo RA, thus requiring significantly lower memory consumption compared to prior partial finetuning techniques. Further, we explore the effectiveness of sparse finetuning on both large language and vision models and provide a detailed analysis on rapid switching and multi-adapter fusion of the high rank adapters. Backward Pass Forward Pass Trainable weights Frozen weights Masked Gradients Mask Original Gradients Linear Layer Input Features Output Features Non-Zero Gradients Zero Gradients Pretrained Weights, W b. Sparse High Rank Adaptation (SHi RA) a. Low Rank Adaptation (Lo RA) Fused weight at inference would modify all elements of pretrained weight W Figure 2: (a) Lo RA when fused into the pretrained model modifies all weights and prevents rapid adapter switching. (b) SHi RA does not require additional weights during training but finetunes very few pretrained weights. Our approach relies on a sparse mask for gradient-masking during training. We show that finetuning as low as 1-2% parameters is sufficient to achieve high accuracy. A notable concurrent work is Sp IEL [4] which scales partial finetuning to modern LLMs and also has a PEFT implementation that results in comparable speed and memory as Lo RA. The main differences between Sp IEL and SHi RA are as follows: (i) Sp IEL works with dynamic masks while SHi RA uses a static mask. (ii) Dynamic mask in Sp IEL requires users to install custom sparse linear layer kernels for the GPUs. In contrast, SHi RA does not require installing any custom kernels and directly works with native Pytorch. Hence, SHi RA s biggest advantage is its ease of training/inference deployment. (iii) We also analyze multi-adapter fusion properties, e.g., impact of sparsity on orthogonality between adapters, which were not discussed in Sp IEL. (iv) Finally, SHi RA demonstrates its effectiveness on both vision and language tasks, whereas Sp IEL only discusses the language tasks. Multi-Adapter Fusion. Existing Multi-adapter fusion methods focus on preventing concept loss [8, 34, 26]. However, these methods usually either just use the base Lo RA as it is (and then perform some non-trivial postprocessing on them) [34, 26], or some create some minor variants [8]. In contrast, we introduce a new adapter for the concept loss problem where multiple concepts naturally do not interfere with each other. In that respect, our work is orthogonal to the prior multi-adapter fusion work since our adapter can be further postprocessed using such techniques. 3 Proposed Approach 3.1 Sparse High Rank Adapters (SHi RA) SHi RA exploits highly sparse trainable parameters in the pretrained model. In its simplest form, our adapter can be trained by masking gradients such that only a fraction of original weights get updated. Specifically, we do not add any new weights to the forward pass like Lo RA (see Fig. 2(a)) but rather make a small percentage of existing weights trainable (see Fig. 2(b) top). To this end, we first create an extremely sparse ( 98-99% zeros) mask M Rn m = {0, 1}n m, where n, m are dimensions of the pretrained weight matrix. M is then used to mask the gradients during backpropagation using a Hadamard product (see Fig. 2(b) bottom). Thus, very few parameters get updated during training and our adapter consists of just those sparse weights. Concrete gradient masking-based and another latency-/memory-efficient PEFT implementations for SHi RA are discussed in section 3.3. We consider the following masks M (only 1-2% trainable parameters, see also Appendix B): 1. SHi RA-Struct: In this structured mask, certain rows or columns of the weight as well as its diagonal are set to be trainable. All other rows/columns are not trainable. The diagonal makes the mask high rank whereas the structured trainable rows/columns set to 1 to enable gradient flow to corresponding parameters lead to a rank 1 adapter. Thus, SHi RA-Struct is a combination of a high rank but very sparse adapter and a rank 1 adapter. 2. SHi RA-Rand: This mask is obtained by randomly setting 1-2% parameters as trainable. 3. SHi RA-WM: Here we pick top-K parameters to train based on their weight magnitudes (WM), the absolute value of the weight for each layer. SHi RA Adapter 1 SHi RA Adapter 2 Fused Multi-Adapter + α2 = + α1 Sparse Weights Indices Storing [Sparse Weights + Indices] consumes much less memory than pretrained weights Weights trained for SHi RA b. Multi-adapter fusion a. Rapid adapter switching Weights that changed during adaptation Frozen weights Non-Zero Weights for SHi RA Adapter 2 Zero Weights Non-Zero Weights for SHi RA Adapter 1 Base Model Weights Figure 3: (a) Rapid adapter switching: The sparse finetuned weights can be stored as weights and their indices. At inference time, these weights can be loaded on the base model weights. Since only 1-2% weights need to be overwritten, the adapter can be efficiently switched with different weights at inference, eliminating the need for a separate fusion stage. (b) Multi-adapter fusion: Concept-loss can be reduced if multiple adapters do not significantly interfere with each other. 4. SHi RA-Grad: This is a gradient-based mask. We first collect gradients on a small calibration set and then pick top 1-2% weights that receive the highest gradient magnitudes. 5. SHi RA-SNIP: The SNIP metric from the pruning literature [19] combines weight magnitude and gradient strategies, i.e., SNIP equals magnitude of the gradient times the weight. 3.2 Rapid Adapter Switching, Multi-Adapter Fusion, and High Rank Since very few base weights change during the SHi RA training, we can simply extract them out and store them as sparse weights and their indices (see Fig. 3(a)). Hence, SHi RA is comparable to Lo RA in model size but overwrites only a fraction of the pretrained weights at inference time. In contrast, Lo RA fuses into base weights as Wnew = W + AB and changes the entire weight. Note that, we do not actually need to fuse SHi RA but rather just need to overwrite the modified value at the correct index in the pretrained weight tensor. This enables rapid switching on resource-constrained devices. To verify that SHi RA indeed provides rapid switching benefits compared to Lo RA, we provide an optimized implementation based on scatter_op to overwrite base model weights instead of fusing them like Lo RA. We demonstrate that on a CPU, weight loading for SHi RA adapters can be up to 5 -16 faster than equivalent Lo RA fusing for inference (see Appendix C and Fig 7). Next, we discuss multi-adapter fusion in SHi RA. Given two adapters A1 and A2 with sparse masks M1 and M2, we ask the following questions: (i) What is the impact of sparsity on relative interference between adapters in the multi-adapter setting? (ii) Is it possible to create masks that result in nearly orthogonal SHi RA weights so they do not significantly interfere with each other at inference time? Getting adapters that do not interfere with each other is essential to avoid concept-loss. To this end, we define specific metrics in section 4.2 to analyze orthogonality properties between adapter weights for various SHi RA strategies. We theoretically show that at least one of the SHi RA methods, i.e., SHi RA-Struct can in fact create near-orthogonal adapters. We further experimentally demonstrate in section 5.2.2 that SHi RA-Struct indeed outperforms other methods for multi-adapter fusion. Finally, since we do not have any low rank weights in the forward pass, our proposed adapters can be high rank albeit highly sparse. We theoretically analyze the rank vs. sparsity properties in section 4. 3.3 Memoryand Latency-Efficient SHi RA Training We have created two implementations for SHi RA: (i) a backward hook-based gradient masking to turn any trainer into SHi RA finetuning (see Appendix D), and (ii) a PEFT-based implementation. As discussed in Appendix E, the PEFT-based SHi RA implementation consumes 16.63% lower peak GPU memory and trains almost at a similar speed as Lo RA. On the contrary, Do RA exhibits a 40.99% and 28.9% increase in memory and training time respectively compared to Lo RA. 4 Theoretical Insights for SHi RA 4.1 Rank vs. Sparsity Below we discuss parameter and learning complexity, parallels between Lo RA and SHi RA, as well as its optimization properties from the lens of rank and sparsity. Lemma 4.1. The parameter complexity and learning complexity of SHi RA is equal to the number of non-zero elements in the adapter. Appendix F.1 provides the proof. This lemma suggests that despite high rank property of SHi RA, it would not require significantly larger datasets to converge. Lemma 4.2. If we specify a sparsity factor, the Lo RA is r rank approximation of SHi RA with approximation error bounded by σ2 r+1, the (r + 1)th singular value of the SHi RA adapter. The above lemma is proved in section F.2. As a consequence of this lemma, any r rank Lo RA adapter of size (m, n) can be seen as an approximation of a SHi RA adapter with mr + rn non-zero elements. Lemma 4.3. Scaling factor for SHi RA is independent of the rank of the adapter and can be set to 1. Please see the proof in Appendix F.3. Lemma 4.3 states that we do not need scaling factors to stabilize the training and, therefore, we do not need additional hyperparameters like α or independent learning rates for separate A and B matrices like in Lo RA[13] or Lo RA+ [11]. Of note, the scaling factor α can still be used at inference time to vary the intensity of the adapter. 4.2 Adapter Weight Orthogonality in Multi-Adapter Fusion In this section, we provide theoretical and empirical insights by studying properties of SHi RA and Lo RA adapter designs for multi-adapter fusion. Lemma 4.4. Consider two adapters, W1 and W2. If one of the adapters, W1 or W2 lies in the null space of the other, then the adapters will not interfere multiplicatively. Proof is given in Appendix F.4. The above lemma implies that two adapters can be efficiently fused without interference if they are orthogonal. In order to analyze the orthogonality between any two adapter weights, we define the following metrics: Definition 1. Adapter Weight Orthogonality Magnitude (AWOM) is defined as the l2 norm of the product AT 1 A2 for two sparse adapter weights A1, A2 Rn m. AWOM enables us to understand how far the product AT 1 A2 is from a zero matrix O Rm m (Oi,j = {0} i, j). Definition 2. Adapter Weight Orthogonality Ratio (AWOR) is defined as the sparsity ratio of the product AT 1 A2. Specifically, AWOR = h 1 ||AT 1 A2||0 m2 i , where m2 is #elements in AT 1 A2. Together, AWOM and AWOR can provide us an idea of relative orthogonality between adapter weights A1 and A2. Next, we analyze how at least one of the SHi RA strategies (i.e., SHi RA-Struct) can result in near-orthogonal adapters. Recall that, SHi RA-Struct adapters train certain rows/columns and the diagonal elements while keeping all other parameters frozen. Hence, the final trained adapter (after subtracting the pretrained weight) contains a structured pattern of rows/columns and diagonal elements, everything else being zero. Now, without loss of generality, consider two SHi RA-Struct adapters for a layer with square m m weights: A1 = I + S1 and A2 = I + S2, where S1 and S2 are row-wise patterns of trained weights for two different tasks, and I is an identity matrix. Also, S1 and S2 are non-overlapping, e.g., both have same number of non-zero rows but are offset from each other such that they do not have any common trained rows. Then, the following result holds: Lemma 4.5. Non-overlapping SHi RA-Struct adapters are nearly orthogonal: AWOR for nonoverlapping SHi RA-Struct adapters is at most the sum of sparsity of individual adapters. Since all SHi RA masks are highly sparse, AT 1 A2 has a lot of zeros, thus making the adapters nearly orthogonal. Proof is provided in Appendix F.5. We demonstrate the orthogonality properties of various adapters and report the simulation results in Fig. 4. For our experiment, we compute AWOM and AWOR for a variety of adapter designs - Figure 4: Comparison of average AWOM (left) and AWOR (right) for 50 randomly initialized adapters. We compare different adapters, namely - Dense, Sparse Lo RA, SHi RA-WM and SHi RA-Struct. dense, sparse-Lo RA [12] (sparse Lo RA A and B weights), SHi RA-WM and SHi RAStruct based adapters. As shown in Fig. 4, both dense and sparse Lo RA have low AWOR for adapters with larger dimensions, e.g., 4096 4096 which is typical in LLMs. This signifies that these adapter weights are non-orthogonal. On the contrary, SHi RA-WM achieves much higher AWOR than the Lo RA variants. More interestingly, SHi RA-Struct is nearly orthogonal. Note that, due to high sparsity, AWOM also tends to be much lower for SHi RA adapters than the dense counterparts. Combined with the fact that AWOR of SHi RA adapters is 63-96% higher sparsity than Lo RA, this may suggest that AT 1 A2 would be closer to zero for SHi RA adapters, thus potentially bringing them closer to orthogonality and less interference. Finally, although we have shown interesting properties for SHi RA-Struct, it is still a rank 1 + diagonal adapter. Hence, we need to tradeoff single adapter performance (which strongly depends on adapter s expressive power) against the multi-adapter fusion capabilities. For instance, next we will see that while SHi RA-Struct is good for vision, SHi RA-SNIP performs well across both LVMs and LLMs. Remark 1. The orthogonality property shown here can lead to disentangled representation for adapter outputs before they merge into the base model. However, this property does not hold for other SHi RA masks that do not have a regular sparsity pattern like SHi RA-Struct even if other SHi RA strategies are still more orthogonal than Lo RA weights (e.g., see SHi RA-WM AWOR in Fig. 4(right)). Interestingly, for unstructured sparse masks like SHi RA-WM, SHi RA-Grad, SHi RA-SNIP, etc., both overlapping and non-overlapping adapters have similar orthogonality properties. We discuss this in more detail in section 5.3.2. Finally, this analysis only focuses on orthogonality of adapter weights and not on orthogonality of subspaces. We leave the subspace analysis of SHi RA for future work. 5 Experiments 5.1 Training Setup and Datasets For the vision tasks, we use the Realistic Vision-v3 model checkpoint for Stable Diffusion-v1.5, and finetune it using different adapters on two style transfer datasets collected using public domain images. The first dataset is called Bluefire which provides a blue fire effect to images. The second dataset is a painting dataset which gives a paintings effect (see Appendix section G for more details). For both these datasets, we conduct singleand multi-adapter experiments. To quantify the image quality, we use the Human Preference Score-V2 (HPSv2) [32]. On the language domain, we experiment with LLa MA 7B [29], LLa MA2-7B [30] and evaluate it on various commonsense reasoning benchmarks such as Hella Swag, PIQA, SIQA, Bool Q, Arc-easy, Arc-challenge, Open Book QA and Winogrande. Similar to our vision investigations, we conduct singleand multi-adapter experiments on LLMs as well. Specifically, for language finetuning, we follow the setup adopted by [14, 20] for training and evaluating Lo RA [13], Do RA [20], and SHi RA based finetuned models on downstream tasks. Finally, we also explore generalizability of SHi RA to other popular Lo RA models and applications such as SDXL [22] and Dream Booth [25]. Detailed training setups are provided in the Appendix H. 5.2 Vision Results 5.2.1 Impact of Various SHi RA Masks We first evaluate the image quality for SHi RA and Lo RA on Paintings and Bluefire datasets for both single and multi-adapter usecases. Fig. 1 demonstrates comparison between SHi RA-SNIP and Lo RA. As evident, by merely changing 2% pretrained weights, SHi RA generates high quality images for both finetuning tasks. Style Method %Params HPSv2 score( ) α = 1 α = 0.5 Lo RA 3.84 24.7 1.8 31.3 1.5 SHi RA-Struct 1.99 31.2 1.7 33.0 1.8 SHi RA-Grad 2.05 30.3 1.8 32.3 1.8 SHi RA-SNIP 2.05 29.8 1.8 31.6 1.8 Lo RA 3.84 32.6 1.9 33.6 1.6 SHi RA-Struct 1.99 34.2 1.6 34.1 1.5 SHi RA-Grad 2.05 34.2 1.5 33.7 1.7 SHi RA-SNIP 2.05 33.7 1.7 33.7 1.6 Table 1: HPSv2 score of various adapters on Paintings and Bluefire. SHi RA-Struct outperforms all other methods. Next, we compare various types of SHi RA masks in Fig. 5. Clearly, all SHi RA schemes produce impressive images for different prompts and significantly outperform Lo RA. We further quantify the image quality using HPSv2 for each of the masks. The results are presented in Table 1. As evident, all variants of SHi RA consistently achieve superior or similar HPSv2 scores than Lo RA, especially for larger α (see details on scaling factor α in Appendix I). More results are provided in Appendices J and K: see Table 10 and Fig. 10, 11, 12. 5.2.2 SHi RA Adapters aid Multi-Adapter Fusion As explained in section 4.2, high sparsity of SHi RA reduces their AWOM and increases the AWOR metrics by increasing the number of zeros in AT 1 A2 product even for unstructured schemes such as SHi RA-WM, SHi RA-Grad, and SHi RA-SNIP. We hypothesized that this may lead to improved multi-adapter fusion performance. This was also pointed out by [26, 8, 31]: naively merging multiple Lo RA adapters leads to poor performance and concept loss. thunder bird Ship, sunset, sea House, Prairie night flower SHi RA-Struct SHi RA-Grad SHi RA-SNIP MULTI-ADAPTER Figure 5: Comparison between different SHi RA masking methods for singleand multi-adapter image generation. For multi-adapter fusion, SHi RA-Struct outperforms all other adapters by generating exceptional images with high frequency details and good concept fusion (e.g., see fox and flower). We now validate the effectiveness of various SHi RA schemes on multi-adapter fusion. The right two columns in Fig. 1 and Fig. 5 show our results. SHi RA is clearly better at capturing both concepts than Lo RA. For example, both bird and knight images in Fig. 1 generated with Lo RA lose most of the paintings concept. Similarly, for the fox image in Fig. 5, Lo RA does not show significant bluefire concept. In contrast, SHi RA-Struct and SHi RA-SNIP consistently perform well on many different prompts and produce exceptional images for multi-adapter fusion. Please refer to Appendix K.1 (Fig. 10, 11, 12, and 13) for additional results. For certain classes that were not included in the training set for both adapters (e.g., see Koala in Fig. 10, 12, and 13 in Appendix), we observe that Lo RA produces significant artifacts whereas SHi RA generates high quality images. 5.3 Language Results 5.3.1 Single Adapter SHi RA Finetuning Similar to vision results, we demonstrate the effectiveness of SHi RA on language tasks. For our experiments, each adapter (i.e., weight-magnitude, gradient-magnitude, and SNIP based SHi RA) is trained on the combined 170K sample commonsense reasoning dataset released by [14, 20]. Similar to [20], we train our SHi RA adapters for 3 epochs and compare it against the Lo RA baselines. As shown in Table 2, various SHi RA adapters outperform Lo RA by 1.9-2.7% on an average on LLa MA-7B. Importantly, SHi RA only modifies 1% base parameter weights as compared to 66.72% (4.5B weights) changed by Lo RA in the fused mode, thus enabling rapid switching on edge devices. Interestingly, we found that SHi RA-Struct does not perform well on language tasks likely because it is a rank 1 + diagonal adapter and may not have sufficient expressive power. Moreover, when compared to newer techniques like Do RA [20], our proposed work takes an orthogonal approach by finetuning very few parameters of the pretrained weights. This strategy allows for an efficient integration of our adapter with methods like Do RA to improve the expressiveness of the adapters. As we show in Table 2, our proposed adapter benefits from Do RA based finetuning and achieves almost comparable performance (within 0.3%) to Do RA on an average, with an added benefit of changing only 1% parameters at inference time. In contrast, Do RA would lead to 66.72% (4.5B weights 9GB memory in FP16 format) parameter change in the fused mode. Therefore, SHi RA is orthogonal to other existing low rank methods and can be efficiently integrated with them. Model %Params %C Bool Q( ) PIQA( ) Arc-e( ) Arc-c( ) WG( ) OBQA( ) HS( ) SIQA( ) Avg.( ) Lo RA 0.83 66.72 68.9 80.7 77.8 61.3 78.8 74.8 78.1 77.4 74.7 (+0%) SHi RA-Grad 1.0 1.0 68.4 80.9 80.2 64.7 80.4 78.2 80.3 79.4 76.6 (+1.9%) SHi RA-WM 1.0 1.0 69.6 81.6 81.5 66.5 79.8 79.4 79.6 77.8 77.0 (+2.3%) SHi RA-SNIP 1.0 1.0 68.3 80.6 81.5 67.9 80.0 79.6 82.1 79.1 77.4 (+2.7%) Do RA 0.84 66.72 68.5 82.9 81.4 65.8 80.8 81.0 84.8 79.6 78.1 (+0%) SHi RA-WM-Do RA 6.25 1.0 70.9 81.9 81.7 64.9 80.8 79.2 84.5 78.6 77.8 (-0.3%) Table 2: Evaluation of LLa MA-7B on Commonsense Reasoning. WG and HS denote Wino Grande and Hella Swag, respectively. %C represents parameters changed in the fused mode. ( ): the higher the better. Green denotes improvement. Trained by masking a high-rank Do RA with a WM mask of top 1% weights, thus changing only 1% of the model during both training and inference. Model %Params %C Bool Q( ) PIQA( ) Arc-e( ) Arc-c( ) WG( ) OBQA( ) HS( ) SIQA( ) Avg.( ) Lo RA 0.83 66.72 69.90 79.9 79.8 64.7 82.6 81.0 83.6 79.5 77.61 (+0%) Do RA 0.84 66.72 71.8 83.7 83.7 68.2 82.6 82.4 89.1 76.0 79.68 (+2.07%) SHi RA-SNIP 1.0 1.0 70.42 81.71 83.25 68.6 80.51 81.0 89.78 79.01 79.28 (+1.67%) Table 3: Results for LLa MA2-7B on Commonsense Reasoning. Finally, we experiment with LLa MA2-7B [30] and demonstrate that SHi RA-SNIP which achieved the best results on LLa MA-7B yields significant accuracy gains compared to Lo RA and nearly the same accuracy as Do RA (within 0.4%, see Table 3). 5.3.2 Multi-Adapter Fusion on LLMs We now extend our LLM experiments to the multi-adapter fusion setting. To this end, we create a new setup where we independently train multiple adapters on training sets of individual commonsense reasoning benchmarks, i.e., one adapter each for Bool Q, PIQA, and Arc-Easy. In contrast, each adapter in section 5.3.1 was trained on a combined dataset containing 170K samples from all eight commonsense benchmarks as proposed in [14, 20]. In the present section, the goal is to evaluate how much accuracy drop various adapters experience when we perform multi-adapter fusion. Due to its simplicity towards constructing a mask, we will use SHi RA-WM in the rest of this paper. Further, we explore two settings - overlapping and non-overlapping SHi RA-WM adapters. The overlapping mask consists of top 1% parameters being trained for all tasks. On the other hand, the non-overlapping setting trains the top 1% weights for the first task, next top 1% for the second task, and so on. We compare the performance of both Lo RA and SHi RA across the multi-adapter fusion of these three tasks. As shown in Table 4, both overlapping and non-overlapping multi-SHi RA outperform multi-Lo RA on all three commonsense benchmarks. This is inline with our theoretical analysis in section 4.2 where we suggest that even unstructured sparse SHi RA adapters such as SHi RA-WM would have more orthogonal behavior than Lo RA due to high sparsity (see higher AWOR of SHi RA-WM in Fig. 4(right)). In comparison, independently trained Lo RA adapters would have no such property and suffer greatly during multi-adapter fusion. As a result, we see that both SHi RA models outperform Lo RA by more than 6.5% accuracy on average. Further analysis of the properties of these trained adapters is discussed in Appendix K.3 (see Table 13 and Fig. 9). Of note, this experiment also demonstrates the value of creating a good mask for single adapter performance: Non-overlapping masks achieve lower single adapter accuracy than the corresponding overlapping masks since they train less important parameters. Hence, creating an optimal mask for SHi RA should be of significant interest to future research. 5.4 Content/Style Personalization: Generalizing SHi RA to SDXL and Dream Booth Finally, we extend SHi RA to focus on Dream Booth [25] using a much bigger vision model called SDXL [22]. We follow a similar setup as adopted by [2]. Specifically, one content (vase) and two style (wooden sculpture and canvas) datasets with five images each were collected from the Dream Booth dataset [25] and public domains, respectively. These datasets were used to train various content and style adapters. For our experiments, we use SDXL [23] as our base model and train both Lo RA and SHi RA adapters with comparable trainable parameters on individual single-concept datasets. During training, prompts containing special identifier tokens like "" or "