# rasa_ranksharing_lowrank_adaptation__62e3f580.pdf Published as a conference paper at ICLR 2025 RASA: RANK-SHARING LOW-RANK ADAPTATION Zhiwei He1 Zhaopeng Tu2 Xing Wang2 Xingyu Chen1 Zhijie Wang1 Jiahao Xu2 Tian Liang2 Wenxiang Jiao2 Zhuosheng Zhang1 Rui Wang1 1Shanghai Jiao Tong University 2Tencent AI Lab 1{zwhe.cs,galaxychen,violetevergarden,zhangzs,wangrui12}@sjtu.edu.cn 2{zptu,brightxwang,jettexu,ttianliang,joelwxjiao}@tencent.com Low-rank adaptation (Lo RA) has been prominently employed for parameterefficient fine-tuning of large language models (LLMs). However, the limited expressive capacity of Lo RA, stemming from the low-rank constraint, has been recognized as a bottleneck, particularly in rigorous tasks like code generation and mathematical reasoning. To address this limitation, we introduce Rank-Sharing Low-Rank Adaptation (Ra SA), an innovative extension that enhances the expressive capacity of Lo RA by leveraging partial rank sharing across layers. By forming a shared rank pool and applying layer-specific weighting, Ra SA effectively increases the number of ranks without augmenting parameter overhead. Our theoretically grounded and empirically validated approach demonstrates that Ra SA not only maintains the core advantages of Lo RA but also significantly boosts performance in challenging code and math tasks. Code, data and scripts are available at: https://github.com/zwhe99/Ra SA. 1 INTRODUCTION Low-rank adaptation (Lo RA, Hu et al. (2022)) has become a de facto parameter-efficient fine-tuning (PEFT) method for adapting large language models (LLMs) to specific downstream tasks. Its core idea is to constrain the parameter updates to be low-rank, which significantly reduces the number of trainable parameters and allows them to be merged back into the original model, thereby avoiding additional inference latency. Despite its advantages, recent studies have shown that Lo RA still lags behind full fine-tuning (FFT), particularly in scenarios involving large training datasets and complex tasks such as mathematical reasoning and code generation (Jiang et al., 2024; Biderman et al., 2024). A plausible explanation for this performance gap is that the low-rank constraint limits the expressive capacity of Lo RA. For instance, Biderman et al. (2024) empirically found that the effective rank required for FFT is 10-100 higher than typical Lo RA configuration, and Zeng & Lee (2024) theoretically demonstrated that a Transformer network (Vaswani et al., 2017) requires a rank at least half the size of the model dimension to approximate another model of similar size. Although the limited number of trainable parameters results in limited expressive capacity, recent studies still indicate redundancy in Lo RA s parameters. For example, Kopiczko et al. (2024); Song et al. (2024); Renduchintala et al. (2024); Li et al. (2024) further reduced the number of Lo RA s parameters by sharing them across layers and modules with only slight performance loss. Br uel Gabrielsson et al. (2024) compressed 1,000 Lo RAs trained from different tasks by sharing their parameter spaces. This contradiction suggests that Lo RA s parameters are still not being fully utilized. Combining the above two observations, we propose Rank-Sharing Low-Rank Adaptation (Ra SA), an approach that boosts the expressive capacity of Lo RA by enabling partial rank sharing across layers. Specifically, given an LLM with L layers, Ra SA extracts k ranks from each layer s Lo RA update to form a rank pool of L k ranks, which is shared across all layers with layer-specific weighting. Ra SA retains the core advantages of Lo RA keeping the same parameter overhead and allowing for easy merging back into the model. Moreover, since modern LLMs typically have deep architectures (i.e., large L), Ra SA greatly increase the effective rank of the parameter update by (L 1) k. Work was done when Zhiwei He, Xingyu Chen, and Zhijie Wang were interning at Tencent AI Lab. Zhaopeng Tu and Rui Wang are co-corresponding authors. Published as a conference paper at ICLR 2025 However, a higher rank does not necessarily lead to better expressive capacity. To rigorously assess the benefits of Ra SA, we analyze its capacity to reconstruct high-rank matrices compared to Lo RA. Theoretically, we prove that Ra SA s minimum reconstruction error is bounded by that of Lo RA. Empirically, we show that when k is relatively small, Ra SA can be easily optimized to achieve a significantly lower reconstruction error than Lo RA. Finally, we conducted experiments on mathematical reasoning and code generation, demonstrating that the lower reconstruction error translates to improved downstream task performance. Our contributions are summarized as follows: We propose Ra SA, a novel extension of Lo RA by by allowing partial rank sharing across layers, which significantly improves the method s efficiency and expressiveness ( 2). We provide a comprehensive analysis both theoretical and empirical showcasing Ra SA s superior capacity for matrix reconstruction ( 3) and its resultant improved performance on rigorous downstream tasks (e.g. code and math) ( 4). Bi Ai Bi BS (a) Lo RA (b) Ra SA layer-specific shared across layers Figure 1: Decomposition of the update matrix Wi in Lo RA and Ra SA, where i is the layer index. 2.1 FORMULATION Given a pre-trained weight matrix W Rb a, Lo RA constrains its update to a low-rank form by decomposing the update matrix W Rb a into a product of two rank-r matrices: W + W = W + α r BA (B Rb r, A Rr a), (1) where rank r min(b, a) serves as a bottleneck dimension, reducing the number of trainable parameters, and α is a scaling factor. In an LLM with L layers, Lo RA assigns distinct trainable matrices to each layer-i: {Bi Ai}i [L] (Figure 1(a)). Ra SA, on the other hand, mitigates the low-rank bottleneck of Lo RA through rank sharing. Specifically, Ra SA takes out k ranks in each layer and shares them across all layers. This process can be conceptualized as follows: 1. Split the matrices Bi and Ai into layer-specific parts ( Bi, Ai) and layer-shared parts ( ˆ Bi, ˆ Ai): Bi = [ Bi |{z} Rb (r k) ˆ Bi |{z} Rb k ], Ai = [ AT i |{z} Ra (r k) ˆ AT i |{z} Ra k ]T . (2) 2. Concatenate the layer-shared parts across all layers to form shared rank pools (BS and AS): BS = ˆ B1 ˆ B2 ˆ BL Rb (L k), AS = ˆ AT 1 ˆ AT 2 ˆ AT L T R(L k) a. (3) Therefore, the update for layer-i is given by: Wi + Wi = Wi + α r ( Bi Ai + BSAS) = Wi + Bi BS diag(α Published as a conference paper at ICLR 2025 To enable layer-specific weighting, we replace the constant diagonal matrix with a trainable diagonal matrix Di = diag(d1, d2, , dj, , dr k+Lk), yielding the final Ra SA update (Figure 1(b)): Wi + Wi = Wi + Bi BS | {z } Rb (r k+Lk) | {z } R(r k+Lk) a 2.2 ANALYSIS & IMPLEMENTATION DETAILS Rank of W Comparing Equations (1) and (5), Ra SA increases the rank of W from r to r k + Lk. Since modern LLMs are deep, Ra SA significantly boosts the model s expressive capacity by enabling a higher effective rank, on which we have a detailed discussion in 3. Each layer in Ra SA maintains the same rank for W , which sets it apart from methods that dynamically assign ranks across layers, such as Ada Lo RA (Zhang et al., 2023) and Pri Lo RA (Benedek & Wolf, 2024). Additional Parameters Ra SA introduces the diagonal matrix Di as additional parameters. Since Di is diagonal and operates only at the bottleneck dimension, the added parameters are negligible. In practice, Di contributes to less than 0.001% of the total model parameters. Initialization Following Lo RA, we use Kaiming initialization (He et al., 2015) for Ai and AS, and initialize Bi and BS to zero. For Di, we differentiate between the layer-specific and layer-shared parts by scaling α proportionally by their respective ranks: ( 1 2 α r k if j r k, 1 2 α Lk if j > r k. (6) Same Dimension Assumption Ra SA assumes that all layers share the same dimensionality. This holds for the vast majority of models (e.g. Llama (Dubey et al., 2024), Mistral (Jiang et al., 2023)). 3 RECONSTRUCTION ERROR ANALYSIS While Ra SA increases the effective rank of W , a higher rank does not necessarily guarantee improved expressive capacity. For instance, a full-rank identity matrix can only perform the identity transformation. To assess the expressive capacity of Lo RA and Ra SA, we compare their abilities to reconstruct a set of high-rank matrices {Mi}i [L], where rank(Mi) = R > r. Under the Frobenius norm, the minimum reconstruction error (MRE) of Lo RA is defined as: elora = min Bi,Ai i=1 Mi Bi Ai 2 F . (7) According to the Eckart Young Mirsky theorem (Eckart & Young, 1936), we can perform singular value decomposition (SVD) on Mi: j=1 σ(i) j u(i) j v(i) j T (σ(i) 1 σ(i) 2 σ(i) R ). (8) Lo RA s optimal approximation is given by the first r components of SVD(Mi), and elora becomes the sum of squares of the discarded singular values (those beyond the r-th one): j=1 σ(i) j u(i) j v(i) j j=r+1 σ(i) j Similarly, when each layer shares k ranks out, we can define the MRE of Ra SA as: erasa(k) = min Bi, Ai,BS,AS,Di i=1 Mi Bi BS Di For simplicity, in this section we consider that Di operates only on the shared matrices BS and AS, which does not affect the value of erasa(k): erasa(k) = min Bi, Ai,BS,AS,Di i=1 Mi ( Bi Ai + BSDi AS) 2 F . (11) Published as a conference paper at ICLR 2025 3.1 THEORETICAL ANALYSIS Theorem 3.1. erasa(k) elora Proof. To prove this, we construct a feasible solution for Ra SA that achieves the same reconstruction error as Lo RA s minimum error. This is done by distributing the ranks shared across layers in Ra SA such that they cover the same rank range as the optimal Lo RA solution. For each layer-i, we take the last k components (corresponding to the indices r k + 1 through r) from the Lo RA s optimal approximation (Equation (9)), forming the following matrices: U (i) = h u(i) r k+1 u(i) r k+2 u(i) r i , V (i) = h v(i) r k+1 v(i) r k+2 v(i) r i , Σ(i) = h σ(i) r k+1 σ(i) r k+2 σ(i) r i . The shared matrices BS and AS are constructed by stacking U (i) and V (i) from each layer: BS = U (1) U (i) U (L) , AS = V (1) V (i) V (L) T . (13) Similarly, we define the diagonal matrix Di for each layer-i by placing the corresponding singular values Σ(i) in their appropriate positions: Di = diag( 0 Σ(i) 0 ). (14) Finally, the matrices Bi and Ai are formed from the first r k components of SVD(Mi): j=1 σ(i) j u(i) j v(i) j Substituting Equations (13) to (15) into Equation (11), we derive the following: i Mi ( Bi Ai + BSDi AS) 2 F j=1 σ(i) j u(i) j v(i) j j=1 σ(i) j u(i) j v(i) j j=r k+1 σ(i) j u(i) j v(i) j j=1 σ(i) j u(i) j v(i) j j=1 σ(i) j u(i) j v(i) j j=r+1 σ(i) j Thus, we conclude that erasa(k) elora, proving that Ra SA can achieve equal or lower minimum reconstruction error compared to Lo RA. 3.2 EMPIRICAL ANALYSIS While the previous theoretical analysis guarantees that Ra SA can at least match the MRE of Lo RA, it does not quantify how much Ra SA improves upon Lo RA. To provide a more intuitive understanding of how Ra SA achieves lower reconstruction error, we turn to an optimization-based analysis using coordinate descent. Published as a conference paper at ICLR 2025 Reconsturction Error 0 10 20 30 40 50 Ra SA Lo RA Reconsturction Error 0 10 20 30 40 50 Reconsturction Error 0 10 20 30 40 50 Reconsturction Error 0 10 20 30 40 50 Reconsturction Error 0 10 20 30 40 50 Reconsturction Error 0 10 20 30 40 50 Figure 2: Reconstruction error curves of Ra SA (r = 8, k = 1) during coordinate descent. We also plot the minimum reconstruction error of Lo RA (Equation (9)) for comparison. Empirical Validation Specifically, we instantiate the set of high-rank matrices {Mi}i [L] with the actual weight updates from model fine-tuning: { Wi}i [L], and iteratively minimize the reconstruction error in Equation (11) by adjusting the parameters of Ra SA (r = 8, k = 1), namely the Bi, Ai, BS, AS, and Di (details can be found in appendix A). We apply this procedure to various kinds of linear modules within Llama-3.1-8B until convergence, and compute elora using Equation (9) as baseline values. Figure 2 shows that Ra SA requires 10 iterations to achieve a significantly lower reconstruction error than Lo RA s minimum. This pattern is consistent across all linear modules in the model, demonstrating the enhanced expressive capacity of Ra SA. Selection of k Ra SA introduces only one additional hyper-parameter, k, which controls how many ranks are taken from each layer to be shared across all layers. When k = 0, Ra SA reduces to Lo RA, where no ranks are shared. On the other hand, when k = r, Ra SA shares all ranks across layers, eliminating layer-specific low-rank updates and making the adaptation fully shared. While this maximizes the effective rank of update, it may diminish layer diversity and the ability to capture layer-specific nuances. We traversed k from 0 to 8, and presents the converged reconstruction error from the previous coordinate descent experiment in Figure 3. The results indicate that a small value of k, around r/8, achieves the minimum error. Further increasing k can lead to a rise in reconstruction error, even exceeding that of Lo RA. This finding also indicates that some current methods that share all ranks across all layers, such as Ve RA (Kopiczko et al., 2024) and Tied-Lo RA (Renduchintala et al., 2024), might be sub-optimal and challenging to be optimized. 4 EXPERIMENT Tasks Our experiments generally align with those reported by Biderman et al. (2024). We applied all the methods to instruction fine-tuning and evaluated their performance on challenging tasks: code generation and mathematical reasoning. While Biderman et al. (2024) use Humaneval (Chen et al., 2021) and GSM8K (Cobbe et al., 2021) as test sets, these two benchmarks have become saturated with the rapid growth of LLMs. To provide a more rigorous evaluation, we adopted two more challenging Published as a conference paper at ICLR 2025 Reconstruction Error 0 1 2 4 6 8 Ra SA Lo RA Reconstruction Error 0 1 2 4 6 8 Reconstruction Error 0 1 2 4 6 8 Figure 3: Reconstruction error comparison between Ra SA and Lo RA as a function of the shared rank parameter k. We also plot the minimum reconstruction error of Lo RA (Equation (9)) for comparison. The results are average across all linear modules in the model. benchmarks as test sets. Since these two benchmarks lack validation sets, in addition to reporting the results from the last checkpoint, we also report the best results as a reference for the upper bound of each method. Prompt templates for evaluation are provided in appendix B. Code Generation: We used Magicoder-Evol-Instruct-110k (Wei et al., 2024) as the training data, a collection of programming question-answer (QA) pairs, which is a reproduced and decontaminated version of Wizard Coder (Luo et al., 2024). We used Humaneval+ (Liu et al., 2023) as the test set, an extension of the Humaneval benchmark that scales the number of test cases by 80 . We used the Bigcode Evaluation Harness (Ben Allal et al., 2022) as the evaluation tool, sampling 50 solutions per problem with a temperature of 0.2, and report both Pass@1 and Pass@10. Mathematical Reasoning: We used Meta Math QA (Yu et al., 2024) as the training data, which comprises 395K QA pairs derived from the training sets of GSM8K (Cobbe et al., 2021) and MATH (Hendrycks et al., 2021), rewritten by GPT-3.5. We used MATH (Hendrycks et al., 2021) as the test set, which consists of 5K competition-level mathematics problems covering 7 subjects and 5 difficulty levels. We followed the evaluation protocol from LLMs Evaluation Harness (Gao et al., 2024), using sympy to verify correctness and employing greedy search for generation. Baselines We compare Ra SA to the several representative PEFT methods: Lo RA (Hu et al., 2022) that learns only a low-rank perturbation to the pretrained weight matrix. Mo RA (Jiang et al., 2024) that uses block diagonal matrices instead of low-rank matrices. Ve RA (Kopiczko et al., 2024) that fully shares the low-rank matrices across all layers with layerspecific weighting, and freeze the low-rank matrices during training to achieve extreme parameter efficient fine-tuning. Therefore, Ve RA can set a higher rank-r than Lo RA. LLMs & Training Details We conducted experiments on two open-sourced LLMs: Llama-3.18B (Dubey et al., 2024) and Mistral-0.3-7B (Jiang et al., 2023). Following common practice (Kopiczko et al., 2024; Jiang et al., 2024), we used pre-trained models rather than instruction-tuned ones. We applied PEFTs on all linear modules from attention (Wq, Wk, Wv, Wo) and feed-forward networks (Wup, Wdown, Wgate). We set the model hyper-parameters based on the optimal configurations from Biderman et al. (2024), employing the decoupled Lion W optimizer with a batch size of 192, and training for 8 epochs with a learning rate of 5e-4 by default. For Ra SA, we set k = max(r/8, 1) based on the analysis in 3.2. More details are provided in appendix C. 4.2 MAIN RESULTS In this section, we compare Ra SA and baselines on two challenging domains code and math. Code Generation Table 1 presents the results on the Humaneval+ test set. We compare Ra SA and prior Lo RA variants in terms of both efficiency and effectiveness. Although Ve RA adds only 1.6M extra parameters for r = 1024, it results in a training time increase of between 13% and 16% over Published as a conference paper at ICLR 2025 Table 1: Performance on the code generation task (i.e. Humaneval+). We italicize the best result for each rank, and bold the best result for each model. We also present training cost for each setting in terms of trainable parameters (# Trainable Param.) and training time (Time). Note that for Mo RA and Ra SA, r does not correspond to the effective rank of the update matrix. r Method # Trainable Param. Llama-3.1-8B Mistral-0.3-7B Time PASS@1 PASS@10 Time PASS@1 PASS@10 BEST LAST BEST LAST BEST LAST BEST LAST 1024 Ve RA 1.6M 11.3h 48.8 48.8 66.5 64.2 12.5h 42.5 39.5 57.3 54.4 8 Lo RA 21.0M 9.6h 56.1 53.0 71.2 68.5 10.7h 42.6 39.7 57.7 54.8 Mo RA 21.0M 12.0h 54.6 52.1 68.4 66.9 13.4h 45.2 38.6 64.4 48.6 Ra SA 21.0M 11.2h 57.9 56.9 72.6 69.6 12.1h 50.0 49.0 66.0 64.2 16 Lo RA 41.9M 9.8h 54.5 53.4 68.9 67.6 10.7h 46.0 40.6 61.2 54.9 Mo RA 41.9M 12.7h 56.3 52.9 69.5 65.6 14.0h 43.4 41.0 59.4 56.0 Ra SA 42.0M 11.2h 57.3 56.4 72.1 68.1 12.1h 53.6 51.3 68.5 63.7 32 Lo RA 83.9M 10.0h 57.9 56.9 69.8 69.2 10.8h 50.2 44.4 64.4 57.0 Mo RA 83.9M 12.4h 55.6 53.0 69.0 68.3 14.0h 42.2 42.2 56.4 56.0 Ra SA 83.9M 11.5h 59.5 56.2 72.5 71.4 12.5h 55.7 55.7 70.0 65.7 Lo RA with r = 32. Ve RA, however, is the least effective among all the variants due to its extreme strategy for parameter efficiency. Both Mo RA and Ra SA add a comparable number of additional parameters as Lo RA, yet Mo RA requires more time due to the use of block diagonal matrices. In terms of model performance, Mo RA shows performance on par with Lo RA, aligning with the findings reported in the original paper (Jiang et al., 2024). Our proposed Ra SA surpasses all baseline models in nearly all scenarios. Like Lo RA, Ra SA s performance improves with rank, and at rank 32, Ra SA typically delivers the strongest performance for both the Llama and Mistral models. Ra SA achieves maximum Humaneval+ of 59.5% PASS@1 with Llama-3.1-8B. Table 2: Performance on mathematical reasoning task (i.e. MATH). We also present extra parameters (# Extra Param.) used by Ada Lo RA and Pri Lo RA for estimating parameter importance. r Method # Trainable Param. # Extra Param. Llama-3.1-8B Mistral-0.3-7B Time ACC Time ACC BEST LAST BEST LAST FFT 7-8B 35.5 34.6 28.1 26.6 1024 Ve RA 1.6M 22.4h 27.4 25.6 17.6h 19.9 19.4 8 Lo RA 21.0M 20.1h 28.3 26.7 14.6h 20.1 19.2 Mo RA 21.0M 23.6h 29.2 28.9 19.6h 21.4 21.4 OLo RA 21.0M 29.3h 28.4 27.8 31.1h 22.5 22.5 Ada Lo RA 31.5M 63.0M 27.9h 30.4 28.9 17.7h 22.5 21.6 Pri Lo RA 21.3M 10.7M 24.1h 28.2 29.5 15.9h 22.3 22.3 Ra SA 21.0M 23.8h 30.3 29.1 15.9h 24.3 23.8 16 Lo RA 41.9M 20.2h 28.8 27.1 14.7h 20.9 19.5 Mo RA 41.9M 24.5h 30.2 26.5 21.1h 20.5 19.4 OLo RA 41.9M 29.6h 28.6 28.4 35.9h 22.5 22.2 Ada Lo RA 62.9M 125.8M 28.1h 29.7 29.4 17.6h 23.5 23.2 Pri Lo RA 42.6M 21.3M 24.5h 28.6 29.4 15.9h 22.7 21.6 Ra SA 42.0M 24.4h 31.4 29.8 15.8h 25.9 25.1 32 Lo RA 83.9M 20.6h 28.9 27.2 14.8h 21.8 20.4 Mo RA 83.9M 24.7h 28.6 25.8 20.5h 18.4 18.4 OLo RA 83.9M 29.9h 29.0 28.7 31.3h 23.8 23.5 Ada Lo RA 125.9M 251.8M 28.7h 30.2 29.4 17.9h 23.5 23.5 Pri Lo RA 85.2M 42.6M 24.5h 30.0 28.8 16.4h 24.8 24.2 Ra SA 83.9M 24.3h 31.7 29.6 16.5h 26.1 25.1 Mathematical Reasoning We add FFT and more PEFT baselines in math task: Published as a conference paper at ICLR 2025 OLo RA (B uy ukaky uz, 2024) that uses QR decomposition to initialize the Lo RA adapters. Ada Lo RA (Zhang et al., 2023) that dynamically allocates ranks among parameter matrices. Pri Lo RA (Benedek & Wolf, 2024) that allocates a different rank for each layer in an increasing manner, and performs pruning throughout the training process. The math results are presented in Table 2. FFT outperforms all PEFT methods, aligning the findings from Biderman et al. (2024). Considering both training cost and accuracy, Ra SA demonstrates consistent superiority over all PEFT baselines across various configurations. Mistral notably falls short of its Llama counterpart, exhibiting a performance deficit of approximately 8%, which Ra SA is capable of narrowing down to 5%. We also observe that directly increasing the hyper-parameter r yields only marginal performance gains, but at the cost of doubling the number of training parameters. In contrast, Ra SA greatly outperforms Lo RA with the same or even fewer parameters (Ra SAr=8 > Lo RAr=32). This supports the notion introduced in 1 that Lo RA s parameters are underutilized. Ra SA, on the other hand, improves the utilization of parameters by sharing them across layers. 4.3 RASA LEARNS MORE AND FORGETS LESS THAN LORA 1 2 3 4 5 6 7 8 (b) Mistral-0.3-7B | Code Training Loss 1 2 3 4 5 6 7 8 Lo RA (r=8) Lo RA (r=16) Lo RA (r=32) Ra SA (r=8) Ra SA (r=16) Ra SA (r=32) (a) Llama-3.1-8B | Code 1 2 3 4 5 6 7 8 (d) Mistral-0.3-7B | Math Training Loss 1 2 3 4 5 6 7 8 (c) Llama-3.1-8B | Math Figure 4: Ra SA learns more and faster than Lo RA. Training curves of Lo RA and Ra SA with different ranks. Ra SA consistently outperforms Lo RA with the same rank across models and tasks. Ra SA learns more and faster than Lo RA Figure 4 illustrates the training curves of the fine-tuning process. Generally, the training losses for both Ra SA and Lo RA decrease as the rank increases. Notably, Ra SA consistently outperforms its Lo RA counterpart in terms of both learning effectiveness and efficiency across all cases, aligning with our empirical analysis presented in Section 3.2. These results collectively underscore the efficacy and universal applicability of the proposed Ra SA method. One interesting finding is that Ra SA is specifically effective for Mistral: Ra SA achieves comparable or potentially superior training outcomes to Lo RA with a significantly lower rank requirement of 8, compared to Lo RA s rank of 32. 1 2 3 4 5 6 7 8 (b) Mistral-0.3-7B | Code 1 2 3 4 5 6 7 8 Lo RA (r=8) Lo RA (r=16) Lo RA (r=32) Ra SA (r=8) Ra SA (r=16) Ra SA (r=32) (a) Llama-3.1-8B | Code 1 2 3 4 5 6 7 8 (d) Mistral-0.3-7B | Math Training Loss 1 2 3 4 5 6 7 8 (c) Llama-3.1-8B | Math Figure 5: Ra SA forgets less than Lo RA. Y-axis shows the average of prediction accuracy on three benchmarks to evaluate model s forgetting. Higher prediction accuracy denotes less forgetting. Published as a conference paper at ICLR 2025 Ra SA forgets less than Lo RA We follow Biderman et al. (2024) to investigate the extent of forgetting as degradation of base model capabilities. Specifically, we calculate prediction accuracies on the following three benchmarks: (1) Hella Swag (Zellers et al., 2019): inference the most plausible continuation for daily events (70K problems); (2) Wino Grande (Sakaguchi et al., 2019): assesses commonsense reasoning (44K problems); (3) ARC-Challenge (Clark et al., 2018): complex reasoning and understanding of scientific concepts (7.8K problems). Figure 5 presents the averaged forgetting curves of the three benchmarks, clearly showing that Ra SA experiences less forgetting than Lo RA, with Ra SA s forgetting levels being less affected by rank changes compared to Lo RA. The difference in performance between r = 8 and r = 32 at epoch 8 stands at an average of 2.83% for Lo RA and 0.75% for Ra SA, indicating a smaller performance variation for Ra SA. Lo RA is more prone to forgetting in math than code, while Ra SA displays greater domain robustness. Specifically, with r = 32, Mistral scores 58.7% in code and 52.8% in math using Lo RA, whereas Ra SA shows a reduced performance difference between code (64.6%) and math (66.5%) domains, underscoring Ra SA s robustness. 4.4 SCALING PERFORMANCE ANALYSIS This section investigates the scaling characteristics of the Ra SA approach by varying both the model size and the dataset size to assess its robustness. MATH Accuracy Mixtral-8x7B Lo RA Ra SA Figure 6: MATH performance of scaled models. Model Scaling Initially, we evaluate Ra SA s performance on an expanded scale by examining larger-scale models, including Llama-3.1-70B and Mixtral-8 7B. Due to computational constraints, we employ a rank of r = 4 for both models specifically in the domain of mathematical reasoning. For an equitable comparison, we present results for smaller models configured with r = 4. Each model is trained over 2 epochs using both Lo RA and Ra SA techniques, with performance measured in terms of LAST accuracy. The results in Figure 6 reveal that increasing the model size substantially enhances performance for both Lo RA and Ra SA, across all model types. Noteworthy is the performance of larger Llama and Mistral models using Lo RA, achieving MATH accuracies of 40.4% and 32.6%, respectively. These results significantly exceed those of their smaller counterparts under identical configurations and even surpass outcomes from variants with extended training (i.e., 8 epochs). Notably, Ra SA consistently outperforms Lo RA on these larger-scale models, underscoring Ra SA s robustness in handling models of increased scale. MATH Accuracy Training Data 25% 50% 100% Lo RA Ra SA Figure 7: MATH performance of scaled data. Data Scaling Subsequently, we explore the influence of training data size on Ra SA s performance. We experiment with the Llama-3.1-8B model, applying a rank of r = 8 to facilitate efficient training. The examination involves random sampling of 25% and 50% instances from the SFT data for the mathematics reasoning task. Each model is trained over 8 epochs, with performance assessed through the LAST accuracy. As illustrated in Figure 7, Lo RA s performance seems contingent on the volume of training data, with no noticeable improvement when data is increased from 25% to 50%. This finding is consistent with the results in Biderman et al. (2024). In contrast, Ra SA demonstrates a remarkable ability to enhance performance with an increase in training data volume. Impressively, with just 25% of the training data, Ra SA outperforms La SA even when the latter utilizes the entire dataset, highlighting Ra SA s exceptional efficiency in leveraging training data for performance improvement. 5 RELATED WORK Parameter-Efficient Fine-Tuning (PEFT) PEFT methods aim to minimize the number of trainable parameters needed for fine-tuning large models, thus reducing memory and computational require- Published as a conference paper at ICLR 2025 ments. Pioneering methods include adapter-based (Houlsby et al., 2019) and prompt-based (Lester et al., 2021; Li & Liang, 2021) that introduce additional tunable adapter or prefix tokens to enable efficient fine-tuning while keeping the original model parameters fixed. However, these approaches can slow down inference speed due to the extra components introduced. Lo RA overcomes this drawback by introducing low-rank matrices directly into the weight update process during fine-tuning, effectively reducing trainable parameters without increasing inference latency. Due to its robust performance, Lo RA and its variants have been widely used to adapt LLMs for specific tasks (Yu et al., 2024; Xu et al., 2023; Biderman et al., 2024; Chen et al., 2024; Meng et al., 2024; yang Liu et al., 2024). Benedek & Wolf (2024) and Zhang et al. (2023) show that the number of ranks required for each parameter matrix across the model s layers is not uniform. Therefore, they propose dynamically assigning ranks based on the importance of parameters during training. These rank-allocating approaches typically involve real-time estimation of parameter importance and pruning during the training process. In contrast, Ra SA uses a shared rank pool combined with layer-specific weighting, eliminating the need for complex importance estimation or pruning. Biderman et al. (2024) conduct a comprehensive empirical study on Lo RA, and reveal that while Lo RA still lags behind FFT, it exhibits less catastrophic forgetting. We show that our proposed Ra SA forgets even less than Lo RA, and learns more and faster. Parameter Redundancy of Lo RA Although Lo RA has significantly reduced the number of trainable parameters, recent research suggest that it is possible to further minimize these parameters without compromising performance. Kopiczko et al. (2024) achieve a 99% reduction in Lo RA parameters by fully sharing a pair of low-rank, frozen random matrices across all layers, adjusted with learnable scaling vectors. Koohpayegani et al. (2024) propose learning linear combinations of a set of random matrix bases, while Li et al. (2024) push this further by replacing the matrix bases with a vector bank. Song et al. (2024) and Renduchintala et al. (2024) explore the effects of different sharing and selective fine-tuning strategies. By sharing parameter spaces, Br uel-Gabrielsson et al. (2024) compress 1,000 Lo RAs trained from different task, enabling more efficient serving. These findings collectively suggest that Lo RA s parameter has not been fully utilized and that different Lo RAs exhibit similarities across layers, modules, and even different tasks. Rather than focusing on extreme parameter reduction, this work aims to maintain the same parameter count while exploring how interlayer sharing can enhance parameter utilization. We theoretically and empirically demonstrate that sharing ranks across layers leads to lower reconstruction error and thus better expressive capacity. 6 CONCLUSION In this study, we introduced Ra SA, a novel extension to Lo RA through an innovative partial rank sharing across layers. Ra SA maintains the parameter efficiency and seamless integration into existing models characteristic of Lo RA while substantially increasing the model s expressiveness. Through theoretical analysis, we established Ra SA s superior capability in matrix reconstruction compared to traditional Lo RA, underpinning its improved performance in downstream tasks. Empirical results on complex tasks such as code generation and mathematical reasoning have demonstrated its effectiveness over Lo RA in high-demand scenarios. Future research directions may explore further optimization of rank-sharing schemes and the potential of Ra SA in a broader range of applications, paving the way for the development of even more powerful and efficient PEFT strategies. ACKNOWLEDGMENTS This paper is supported by the General Program of National Natural Science Foundation of China (62176153), the CCF-Tencent Rhino-Bird Open Research Fund (RAGR20240107), and the Tencent AI Lab Fund (RBFR2024002). The authors gratefully acknowledge the financial support from these funding agencies. Loubna Ben Allal, Niklas Muennighoff, Logesh Kumar Umapathi, Ben Lipkin, and Leandro von Werra. A framework for the evaluation of code generation models. https://github.com/ bigcode-project/bigcode-evaluation-harness, 2022. Published as a conference paper at ICLR 2025 Nadav Benedek and Lior Wolf. PRILo RA: Pruned and rank-increasing low-rank adaptation. In Yvette Graham and Matthew Purver (eds.), Findings of the Association for Computational Linguistics: EACL 2024, pp. 252 263, St. Julian s, Malta, March 2024. Association for Computational Linguistics. URL https://aclanthology.org/2024.findings-eacl.18. Dan Biderman, Jose Gonzalez Ortiz, Jacob Portes, Mansheej Paul, Philip Greengard, Connor Jennings, Daniel King, Sam Havens, Vitaliy Chiley, Jonathan Frankle, et al. Lora learns less and forgets less. ar Xiv preprint ar Xiv:2405.09673, 2024. Rickard Br uel-Gabrielsson, Jiacheng Zhu, Onkar Bhardwaj, Leshem Choshen, Kristjan Greenewald, Mikhail Yurochkin, and Justin Solomon. Compress then serve: Serving thousands of lora adapters with little overhead, 2024. URL https://arxiv.org/abs/2407.00066. Kerim B uy ukaky uz. Olora: Orthonormal low-rank adaptation of large language models. ar Xiv preprint ar Xiv:2406.01775, 2024. Mark Chen, Jerry Tworek, Heewoo Jun, Qiming Yuan, Henrique Ponde De Oliveira Pinto, Jared Kaplan, Harri Edwards, Yuri Burda, Nicholas Joseph, Greg Brockman, et al. Evaluating large language models trained on code. ar Xiv preprint ar Xiv:2107.03374, 2021. Yukang Chen, Shengju Qian, Haotian Tang, Xin Lai, Zhijian Liu, Song Han, and Jiaya Jia. Longlo RA: Efficient fine-tuning of long-context large language models. In The Twelfth International Conference on Learning Representations, 2024. URL https://openreview.net/forum?id= 6Pm Jo Rfda K. Peter Clark, Isaac Cowhey, Oren Etzioni, Tushar Khot, Ashish Sabharwal, Carissa Schoenick, and Oyvind Tafjord. Think you have solved question answering? try arc, the ai2 reasoning challenge. ar Xiv:1803.05457v1, 2018. Karl Cobbe, Vineet Kosaraju, Mohammad Bavarian, Mark Chen, Heewoo Jun, Lukasz Kaiser, Matthias Plappert, Jerry Tworek, Jacob Hilton, Reiichiro Nakano, et al. Training verifiers to solve math word problems. ar Xiv preprint ar Xiv:2110.14168, 2021. Abhimanyu Dubey, Abhinav Jauhri, Abhinav Pandey, Abhishek Kadian, Ahmad Al-Dahle, Aiesha Letman, Akhil Mathur, Alan Schelten, Amy Yang, Angela Fan, et al. The llama 3 herd of models. ar Xiv preprint ar Xiv:2407.21783, 2024. Carl Eckart and Gale Young. The approximation of one matrix by another of lower rank. Psychometrika, 1(3):211 218, 1936. Leo Gao, Jonathan Tow, Baber Abbasi, Stella Biderman, Sid Black, Anthony Di Pofi, Charles Foster, Laurence Golding, Jeffrey Hsu, Alain Le Noac h, Haonan Li, Kyle Mc Donell, Niklas Muennighoff, Chris Ociepa, Jason Phang, Laria Reynolds, Hailey Schoelkopf, Aviya Skowron, Lintang Sutawika, Eric Tang, Anish Thite, Ben Wang, Kevin Wang, and Andy Zou. A framework for few-shot language model evaluation, 07 2024. URL https://zenodo.org/records/12608602. Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Delving deep into rectifiers: Surpassing human-level performance on imagenet classification. In 2015 IEEE International Conference on Computer Vision (ICCV), pp. 1026 1034, 2015. doi: 10.1109/ICCV.2015.123. Dan Hendrycks, Collin Burns, Saurav Kadavath, Akul Arora, Steven Basart, Eric Tang, Dawn Song, and Jacob Steinhardt. Measuring mathematical problem solving with the MATH dataset. In Thirty-fifth Conference on Neural Information Processing Systems Datasets and Benchmarks Track (Round 2), 2021. URL https://openreview.net/forum?id=7Bywt2m Qs Ce. Neil Houlsby, Andrei Giurgiu, Stanislaw Jastrzebski, Bruna Morrone, Quentin De Laroussilhe, Andrea Gesmundo, Mona Attariyan, and Sylvain Gelly. Parameter-efficient transfer learning for NLP. In Kamalika Chaudhuri and Ruslan Salakhutdinov (eds.), Proceedings of the 36th International Conference on Machine Learning, volume 97 of Proceedings of Machine Learning Research, pp. 2790 2799. PMLR, 09 15 Jun 2019. URL https://proceedings.mlr.press/ v97/houlsby19a.html. Published as a conference paper at ICLR 2025 Edward J Hu, Yelong Shen, Phillip Wallis, Zeyuan Allen-Zhu, Yuanzhi Li, Shean Wang, Lu Wang, and Weizhu Chen. Lo RA: Low-rank adaptation of large language models. In International Conference on Learning Representations, 2022. URL https://openreview.net/forum?id=n Ze VKee FYf9. Albert Q Jiang, Alexandre Sablayrolles, Arthur Mensch, Chris Bamford, Devendra Singh Chaplot, Diego de las Casas, Florian Bressand, Gianna Lengyel, Guillaume Lample, Lucile Saulnier, et al. Mistral 7b. ar Xiv preprint ar Xiv:2310.06825, 2023. Ting Jiang, Shaohan Huang, Shengyue Luo, Zihan Zhang, Haizhen Huang, Furu Wei, Weiwei Deng, Feng Sun, Qi Zhang, Deqing Wang, et al. Mora: High-rank updating for parameter-efficient fine-tuning. ar Xiv preprint ar Xiv:2405.12130, 2024. Soroush Abbasi Koohpayegani, Navaneet K L, Parsa Nooralinejad, Soheil Kolouri, and Hamed Pirsiavash. NOLA: Compressing lora using linear combination of random basis. In The Twelfth International Conference on Learning Representations, 2024. URL https://openreview.net/ forum?id=Tjf Xc Dgvzk. Dawid Jan Kopiczko, Tijmen Blankevoort, and Yuki M Asano. Ve RA: Vector-based random matrix adaptation. In The Twelfth International Conference on Learning Representations, 2024. URL https://openreview.net/forum?id=Nj Nf Ldxr3A. Brian Lester, Rami Al-Rfou, and Noah Constant. The power of scale for parameter-efficient prompt tuning. In Marie-Francine Moens, Xuanjing Huang, Lucia Specia, and Scott Wentau Yih (eds.), Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing, pp. 3045 3059, Online and Punta Cana, Dominican Republic, November 2021. Association for Computational Linguistics. doi: 10.18653/v1/2021.emnlp-main.243. URL https://aclanthology.org/2021.emnlp-main.243. Xiang Lisa Li and Percy Liang. Prefix-tuning: Optimizing continuous prompts for generation. In Chengqing Zong, Fei Xia, Wenjie Li, and Roberto Navigli (eds.), Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 1: Long Papers), pp. 4582 4597, Online, August 2021. Association for Computational Linguistics. doi: 10.18653/v1/2021.acl-long.353. URL https://aclanthology.org/2021.acl-long.353. Yang Li, Shaobo Han, and Shihao Ji. Vb-lora: Extreme parameter efficient fine-tuning with vector banks. ar Xiv preprint ar Xiv:2405.15179, 2024. Jiawei Liu, Chunqiu Steven Xia, Yuyao Wang, and Lingming Zhang. Is your code generated by chat GPT really correct? rigorous evaluation of large language models for code generation. In Thirty-seventh Conference on Neural Information Processing Systems, 2023. URL https: //openreview.net/forum?id=1qvx610Cu7. Ziyang Luo, Can Xu, Pu Zhao, Qingfeng Sun, Xiubo Geng, Wenxiang Hu, Chongyang Tao, Jing Ma, Qingwei Lin, and Daxin Jiang. Wizardcoder: Empowering code large language models with evol-instruct. In The Twelfth International Conference on Learning Representations, 2024. URL https://openreview.net/forum?id=Un Uw SIg K5W. Fanxu Meng, Zhaohui Wang, and Muhan Zhang. Pissa: Principal singular values and singular vectors adaptation of large language models. ar Xiv preprint ar Xiv:2404.02948, 2024. Adithya Renduchintala, Tugrul Konuk, and Oleksii Kuchaiev. Tied-Lo RA: Enhancing parameter efficiency of Lo RA with weight tying. In Kevin Duh, Helena Gomez, and Steven Bethard (eds.), Proceedings of the 2024 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 1: Long Papers), pp. 8694 8705, Mexico City, Mexico, June 2024. Association for Computational Linguistics. doi: 10.18653/v1/2024.naacl-long.481. URL https://aclanthology.org/2024.naacl-long.481. Keisuke Sakaguchi, Ronan Le Bras, Chandra Bhagavatula, and Yejin Choi. Winogrande: An adversarial winograd schema challenge at scale. ar Xiv preprint ar Xiv:1907.10641, 2019. Yurun Song, Junchen Zhao, Ian G Harris, and Sangeetha Abdu Jyothi. Sharelora: Parameter efficient and robust large language model fine-tuning via shared low-rank adaptation. ar Xiv preprint ar Xiv:2406.10785, 2024. Published as a conference paper at ICLR 2025 Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Ł ukasz Kaiser, and Illia Polosukhin. Attention is all you need. In I. Guyon, U. Von Luxburg, S. Bengio, H. Wallach, R. Fergus, S. Vishwanathan, and R. Garnett (eds.), Advances in Neural Information Processing Systems, volume 30. Curran Associates, Inc., 2017. URL https://proceedings.neurips.cc/paper files/paper/2017/file/ 3f5ee243547dee91fbd053c1c4a845aa-Paper.pdf. Yuxiang Wei, Zhe Wang, Jiawei Liu, Yifeng Ding, and Lingming Zhang. Magicoder: Empowering code generation with OSS-instruct. In Ruslan Salakhutdinov, Zico Kolter, Katherine Heller, Adrian Weller, Nuria Oliver, Jonathan Scarlett, and Felix Berkenkamp (eds.), Proceedings of the 41st International Conference on Machine Learning, volume 235 of Proceedings of Machine Learning Research, pp. 52632 52657. PMLR, 21 27 Jul 2024. URL https://proceedings.mlr.press/ v235/wei24h.html. Haoran Xu, Young Jin Kim, Amr Sharaf, and Hany Hassan Awadalla. A paradigm shift in machine translation: Boosting translation performance of large language models, 2023. Shih yang Liu, Chien-Yi Wang, Hongxu Yin, Pavlo Molchanov, Yu-Chiang Frank Wang, Kwang-Ting Cheng, and Min-Hung Chen. Do RA: Weight-decomposed low-rank adaptation. In Forty-first International Conference on Machine Learning, 2024. URL https://openreview.net/forum? id=3d5CIRG1n2. Longhui Yu, Weisen Jiang, Han Shi, Jincheng YU, Zhengying Liu, Yu Zhang, James Kwok, Zhenguo Li, Adrian Weller, and Weiyang Liu. Metamath: Bootstrap your own mathematical questions for large language models. In The Twelfth International Conference on Learning Representations, 2024. URL https://openreview.net/forum?id=N8N0hg NDRt. Rowan Zellers, Ari Holtzman, Yonatan Bisk, Ali Farhadi, and Yejin Choi. Hella Swag: Can a machine really finish your sentence? In Anna Korhonen, David Traum, and Llu ıs M arquez (eds.), Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, pp. 4791 4800, Florence, Italy, July 2019. Association for Computational Linguistics. doi: 10.18653/v1/P19-1472. URL https://aclanthology.org/P19-1472. Yuchen Zeng and Kangwook Lee. The expressive power of low-rank adaptation. In The Twelfth International Conference on Learning Representations, 2024. URL https://openreview.net/ forum?id=lik XVjmh3E. Qingru Zhang, Minshuo Chen, Alexander Bukharin, Pengcheng He, Yu Cheng, Weizhu Chen, and Tuo Zhao. Adaptive budget allocation for parameter-efficient fine-tuning. In The Eleventh International Conference on Learning Representations, 2023. URL https://openreview.net/ forum?id=lq62u WRJji Y. Published as a conference paper at ICLR 2025 A COORDINATE DESCENT EXPERIMENT This section details the derivation of the coordinate descent experiment discussed in 3.2, inspired by Br uel-Gabrielsson et al. (2024). Given the parameters of Ra SA: { Bi, Ai, BS, Di, AS}i [L], the reconstruction error of Ra SA is defined as: i=1 Mi ( Bi Ai + BSDi AS) 2 F . (17) Clearly, Bi and Ai are independent across layers. By applying the Eckart Young Mirsky theorem (Eckart & Young, 1936), we first compute the SVD of the residual matrix: SVD(Mi BSDi AS) = UΣV T . (18) Therefore, the update rules for Bi and Ai are: Bi = U[:,:r k]Σ 1 2 [:r k,:r k], 1 2 [:r k,:r k]V[:,:r k] T . (19) Let the low-rank decomposition of Mi be: Mi = ˆ Bi ˆ Ai, where ˆ Bi Rb R and ˆ Ai RR a. Next, we compute the following gradients: i=1 2 ˆ Bi ˆ Ai Bi Ai + BSDi AS AT SDT i , (20) i=1 2 ˆ Bi ˆ Ai Bi Ai + BSDi AS BSDi, (21) Di E = 2BT S ˆ Bi ˆ Ai Bi Ai + BSDi AS AT S, (22) diag(Di)E = diag( Di E). (23) By setting these gradients to zeros, we obtain the following update rules: ˆ Bi Bi ˆ Ai Ai i=1 Di ASAT SDT i ˆ Bi Bi ˆ Ai Ai i=1 DT i BT S BSDi diag(Di) = BT S BS ASAT S 1 BT S ˆ Bi Bi AS In coordinate descent, we iteratively apply Equations (19) and (24) to (26) until convergence. Published as a conference paper at ICLR 2025 B PROMPT TEMPLATES Below is an instruction that describes a task. Write a response that appropriately completes the request. ### Instruction: {QUESTION} ### Response: Let's think step by step. Figure 8: Evaluation prompt for mathematical reasoning. Below is an instruction that describes a task. Write a response that appropriately completes the request. ### Instruction: {QUESTION} ### Response: {IMPORT SECTION} {FUNCTION SIGNATURE} {DOCSTRING} Figure 9: Evaluation prompt for code generation. C TRAINING AND DATA DETAILS Training We mostly aligned our training configurations with the optimal configurations from Biderman et al. (2024). For Lo RA, we used the decoupled Lion W optimizer with a batch size of 192, training for 8 epochs with a learning rate of 5e-4. A cosine learning rate scheduler was applied, with the first 10% of training steps used for warmup, and weight decay set to zero. Training and evaluation were conducted using bfloat16 precision. While Biderman et al. (2024) set α = 32 for both math and code tasks, we reduced α to 8 for the math task due to convergence issues observed with the Mistral model when using α = 32. Ra SA training fully inherits all hyper-parameters from Lo RA training. For Mo RA, we used a learning rate of 3e-4, as reported in the original work (Jiang et al., 2024). For Ve RA, following the original paper, we set the learning rate to 10 times that of Lo RA, resulting in 5e-3 (Kopiczko et al., 2024). All experiments for the 7-8B models were conducted on 1 node 8 A100-40G GPUs. For the 70B and Mo E models, we used 8 nodes. Data During training, we grouped data by length, which significantly accelerated the training process. All math training data ends with The answer is: {ANSWER} , helping answer extraction during evaluation.