# lorapro_are_lowrank_adapters_properly_optimized__aa9c5ed3.pdf Published as a conference paper at ICLR 2025 LORA-PRO: ARE LOW-RANK ADAPTERS PROPERLY OPTIMIZED? Zhengbo Wang1,2 Jian Liang2,3 Ran He2,3 Zilei Wang1 Tieniu Tan2,4 1 University of Science and Technology of China 2 NLPR & MAIS, Institute of Automation, Chinese Academy of Sciences 3 School of Artificial Intelligence, University of Chinese Academy of Sciences 4 Nanjing University zhengbowang@mail.ustc.edu.cn, liangjian92@gmail.com Low-rank adaptation, also known as Lo RA, has emerged as a prominent method for parameter-efficient fine-tuning of foundation models. Despite its computational efficiency, Lo RA still yields inferior performance compared to full finetuning. In this paper, we first uncover a fundamental connection between the optimization processes of Lo RA and full fine-tuning: using Lo RA for optimization is mathematically equivalent to full fine-tuning using a low-rank gradient for parameter updates. And this low-rank gradient can be expressed in terms of the gradients of the two low-rank matrices in Lo RA. Leveraging this insight, we introduce Lo RA-Pro, a method that enhances Lo RA s performance by strategically adjusting the gradients of these low-rank matrices. This adjustment allows the low-rank gradient to more accurately approximate the full fine-tuning gradient, thereby narrowing the performance gap between Lo RA and full fine-tuning. Furthermore, we theoretically derive the optimal solutions for adjusting the gradients of the low-rank matrices, applying them during fine-tuning in Lo RA-Pro. We conduct extensive experiments across natural language understanding, dialogue generation, mathematical reasoning, code generation, and image classification tasks, demonstrating that Lo RA-Pro substantially improves Lo RA s performance, effectively narrowing the gap with full fine-tuning. Our code is publicly available at https://github.com/mrflogs/Lo RA-Pro. 1 INTRODUCTION Foundational models (Radford et al., 2021; Brown et al., 2020; Achiam et al., 2023; Kirillov et al., 2023; Rombach et al., 2022; Touvron et al., 2023) have become the cornerstone of modern deep learning. By undergoing pre-training on massive datasets, these models typically exhibit excellent generalization and versatility. Remarkably, some foundation models even demonstrate emergent properties (Hoffmann et al., 2022; Kaplan et al., 2020). Due to these advantages, foundational models have been widely applied to various downstream applications. Nevertheless, it still requires additional fine-tuning when applied to downstream tasks, where the huge parameter size of foundation models result in high cost in this stage. To address this issue, recent research has focused on parameter-efficient fine-tuning (PEFT) methods (Hu et al., 2022; Houlsby et al., 2019; Lester et al., 2021). PEFT methods reduce the fine-tuning cost by keeping the foundation models frozen and only fine-tuning small, additional lightweight adapters. With the majority of parameters frozen, PEFT enables faster fine-tuning and requires fewer resources. Low-rank adaptation (Hu et al., 2022), also known as Lo RA, is one of the most famous PEFT methods, which has been widely adopted across various domains. Inspired by previous works (Aghajanyan et al., 2021; Li et al., 2018), Lo RA hypothesizes that the changes in weights during model adaptation exhibit a low-rank structure. To capture this, Lo RA re-parameterizes these changes by expressing them as the product of two low-rank matrices: W = W0 + W W0 + s BA, where s Corresponding author. Published as a conference paper at ICLR 2025 is a scaling factor, and A Rr n and B Rm r are low-rank matrices with rank r min(m, n). Lo RA reduces the number of trainable parameters from m n to r (m + n), thereby decreasing the cost of fine-tuning. However, despite its efficiency, Lo RA s fine-tuning performance often falls short compared to full fine-tuning (Hu et al., 2022; Liu et al., 2024; Ding et al., 2023). In this paper, we propose a novel PEFT method, Lo RA-Pro, aimed at bridging the gap between Lo RA and full fine-tuning. To begin with, we uncover a crucial connection between the optimization processes of Lo RA and full fine-tuning: using Lo RA for optimization is equivalent to full fine-tuning using a low-rank gradient for parameter updates. In Lo RA, we discover that the change in weight W is connected to the changes in matrices A and B, expressed as d W = W B d B. This relationship implies that updating matrices A and B with gradients g A and g B is equivalent to updating W with a low-rank equivalent gradient g in full fine-tuning, where: B g B = s Bg A + sg BA. (1) Leveraging this insight, our goal is to bridge Lo RA s gap with full fine-tuning by minimizing the discrepancy between the low-rank equivalent gradient g and the full fine-tuning gradient g, by adjusting the gradients of matrices A and B, i.e., ming A,g B g g 2 F . Furthermore, we theoretically demonstrate that this optimization problem admits an optimal closed-form solution, as shown in Theorem 2.1. Notably, the optimal gradients for the low-rank matrices do not explicitly depend on the full fine-tuning gradient. Our main contributions are summarized as follows: We first uncover a crucial connection between Lo RA and full fine-tuning in optimization process: optimizing with Lo RA is mathematically equivalent to full fine-tuning using a low-rank gradient for updating. We propose a novel PEFT method called Lo RA-Pro. Our approach minimizes the distance between the true gradient and the low-rank gradient by adjusting the gradients of matrices A and B. We theoretically prove the optimal gradients and optimize using these gradients. Extensive experiments across tasks in natural language understanding, dialogue generation, mathematical reasoning, code generation, and image classification demonstrate the effectiveness of our method. In this section, we begin by revisiting Lo RA (Hu et al., 2022) in Section 2.1. Following this, we conduct a comparison between Lo RA and full fine-tuning, and point out their connection in the optimization process in Section 2.2. Finally, in Section 2.3, we introduce Lo RA-Pro as a solution to bridge the gap between Lo RA and full fine-tuning. 2.1 REVISITING LOW-RANK ADAPTATION First of all, let s dive back into Low-Rank Adaptation (Lo RA) (Hu et al., 2022). Lo RA s core idea revolves around recognizing the low-rank structure of the change matrix W in the standard finetuning process. This insight allows Lo RA (Hu et al., 2022) to re-parameterize the change matrix into the product of two low-rank matrices, W = W0 + W W0 + s BA. (2) Here, W0 Rm n represents the pre-trained weight matrix, B Rm r and A Rr n are the low-rank matrices, and s is a scaling factor. For Lo RA (Hu et al., 2022), s = α r , while for rs Lo RA (Kalajdzievski, 2023), s = α r. Here, α is the hyper-parameter and r min(m, n) denotes the rank. Consequently, Lo RA significantly reduces the number of fine-tuning parameters from m n to r (m + n), thereby decreasing the computational cost of fine-tuning. 2.2 LORA V.S. FULL FINE-TUNING Despite its widespread applications across various domains, Lo RA s performance still falls short when compared to full fine-tuning. In this part, we compare Lo RA and full fine-tuning in the op- Published as a conference paper at ICLR 2025 timization process. Then, we demonstrate that optimizing using Lo RA is equivalent to using a low-rank gradient in full fine-tuning for updating the parameters. Full fine-tuning. In full fine-tuning, we utilize differential to analyze the relationship between changes in the loss and changes in the weights: W , d W F , (3) where d L and d W denotes the changes of the parameter W and the loss L, and , F is the Frobenius inner product. To minimize the loss function, we typically set d W = L W g (omitting the learning rate for simplicity), which results in d L = L Low-rank adaptation. In Lo RA optimization, given that W = W0 + s BA, we compute the differential using the chain rule: A , d A F + L A, d A F + L B , d B F . Similarly, Lo RA sets d A = L A g A lora and d B = L B g B lora, and thus d L = L B 2 F 0. Moreover, employing the chain rule, we derive: g A lora = L A = s BT g, g B lora = L B = sg AT . (5) Why Lo RA performs worse than full fine-tuning. With Equation (3) and (4), we observe a critical connection between full fine-tuning and Lo RA in the optimization process. In Lo RA, changes in matrices A and B are inherently linked to changes in matrix W, i.e., d W = W A T d A + W B T d B. This indicates that updating A and B with gradient g A and g B is equivalent to performing full fine-tuning on W via the following update: T d B = (s Bg A + sg BA). (6) Equation (6) reveals that Lo RA optimization is equivalent to full fine-tuning using a low-rank gradient g = s Bg A + sg BA (which rank is at most to 2r1) for optimization. Consequently, the performance gap between Lo RA and full fine-tuning may stem from differences between g and the full gradient g. The low-rank gradient g may lose important information contained in g, leading to distinct optimization trajectories and ultimately causing Lo RA to converge to a sub-optimal solution. 2.3 LOW-RANK ADAPTATION WITH EQUIVALENT GRADIENT Definition 1 (Equivalent Gradient). In the context of Lo RA optimization, we define the equivalent gradient as, g s Bg A + sg BA, where s is the scaling factor, and g A and g B are gradients with respect to A and B, respectively. In this part, we introduce our Lo RA-Pro method, which bridges the performance gap by minimizing the discrepancy between the gradients. For convenience, we define the concept of equivalent gradient in Definition 1. Equivalent gradient describes the virtual low-rank gradient of the matrix W in 1We provide the proof in Appendix B.1 Published as a conference paper at ICLR 2025 Gradients Frozen Tunable 𝑔𝐵 Reparameter Low-rank Subspace min 𝑔𝐴,𝑔𝐵 𝑔 𝑔𝐹 Figure 1: Illustration of Lo RA-Pro. Lo RA (Hu et al., 2022) reduces the trainable parameter by reparameterizing the weight into the product of two low-rank matrices, i.e., W = W0+s BA. We have discovered a connection between the optimization processes of full fine-tuning and Lo RA. Updating matrices B and A using gradients g B and g A is equivalent to updating weight W using a virtual low-rank gradient g = s Bg A + sg BA. Therefore, in Lo RA-Pro, we aim to adjust gradients g B and g A to minimize the distance between the equivalent gradient g and the full fine-tuning gradient g, thereby reducing their performance gap. In Theorem 2.1, we provide the optimal update gradients, and in Appendix C, we present the pseudo-code for the optimization algorithm. Lo RA optimization process, despite W not being a trainable parameter. To narrow the performance gap, our goal is to carefully adjust g A and g B of matrices A and B to minimize the distance between the equivalent gradient g and the full gradient g in full fine-tuning. Hence, our objective is: min g A,g B g g 2 F s.t. g = s Bg A + sg BA, d L 0. Here, F denotes the Frobenius norm, and d L denotes the change in loss when updating with gradients g A and g B. The objective aims to minimize the distance of the gradients while ensuring a decrease in loss using the solutions for g A and g B. Closed-form solution. Fortunately, we prove that the optimization problem (7) admits an optimal closed-form solution, as stated in Theorem 2.1. Additionally, an interesting observation arises from Theorem 2.1: while the full gradient g serves as the ground truth in the objective, it does not necessarily explicit appear in the closed-form solution. Instead, the closed-form solution for the optimal gradients can be expressed in terms of the gradients of Lo RA. This allows for an efficient gradient adjustment process, where we backpropagate using the standard Lo RA and adjust the gradients of matrices A and B based on the closed-form solution presented in Theorem 2.1. 2 Theorem 2.1. Assume matrices B Rm r, A Rr n are both full rank. For the objective ming A,g B g g 2 F , the optimal solutions are given by: s(BT B) 1BT g + XA = 1 s2 (BT B) 1g A lora + XA, (8) s[I B(BT B) 1BT ]g AT (AAT ) 1 BX (9) s2 [I B(BT B) 1BT ]g B lora(AAT ) 1 BX. (10) Here, X Rr r represents an arbitrary matrix. Proof. See Appendix B.2. Loss minimization. While Theorem 2.1 offers a closed-form solution to the optimization problem ming A,g B g g 2 F , it s crucial to understand that this solution does not inherently guarantee a 2We provide detailed algorithms in Appendix C. Published as a conference paper at ICLR 2025 decrease in loss when updating the matrices A and B. To address this concern, we introduce Theorem 2.2. In this theorem, we prove that the change in loss d L can be expressed as a negative sum of two Frobenius norms, which leads to d L < 0. This result ensures that the optimization process consistently drives towards a lower loss. Selection of matrix X. Although the equivalent gradient itself is not directly related to the matrix X, the presence of X plays a significant role in the updates of matrices A and B. We select an appropriate X such that g A and g B remain close to g A lora and g B lora respectively. To achieve this, we minimize their Frobenius norm, as demonstrated in Equation (14). In practical terms, BT B and AAT do not share common eigenvalues. Therefore, according to Theorem 2.3, we can determine a unique optimal X for updating matrices A and B. Theorem 2.2. When updating matrices A and B using the closed-form solution from Theorem 2.1, we proceed as follows: A A γg A (11) B B γg B, (12) where γ 0 denotes the learning rate. Our method ensures a decrease in the loss, akin to the standard gradient descent algorithm, expressed by: d L = γ{ g A lora, 1 s2 (BT B) 1g A lora F + g B lora, 1 s2 [I B(BT B) 1BT ]g B lora(AAT ) 1 F } 0. (13) Proof. See Appendix B.3. Theorem 2.3. Consider the optimization problem, min X g A g A lora 2 F + g B g B lora 2 F , (14) where g A and g B are the optimal solutions as stated in Theorem 2.1. The optimal X can be determined by solving the Sylvester equation: BT BX + XAAT = 1 s2 (BT B) 1g A lora AT , (15) which has a unique solution X provided that BT B and AAT do not have any shared eigenvalues. Proof. See Appendix B.4. 3 EXPERIMENTAL RESULTS In this section, we present extensive experiments to evaluate the effectiveness of Lo RA-Pro across various tasks and models. First, we assess natural language understanding capabilities using the GLUE benchmark by fine-tuning the T5-base (Raffel et al., 2020) model in Section 3.1. Next, we evaluate its capabilities in dialogue generation, mathematical reasoning, and code generation using the Llama-2-7B model (Touvron et al., 2023), covered in Section 3.2. We then examine Lo RA-Pro s effectiveness on image classification tasks using the CLIP-Vi T-B/16 (Radford et al., 2021) model in Section 3.3. Finally, we conduct an ablation study of Lo RA-Pro in Section 3.4. Training details. To ensure a fair comparison, we align our experimental setup with that of Lo RA-GA (Wang et al., 2024a). By default, we fine-tune the model using the Adam W optimizer (Loshchilov & Hutter, 2019) with hyper-parameters β1 = 0.9, β2 = 0.999, and weight decay set to 0. We implement a cosine learning rate schedule with a warmup ratio of 0.03. Lo RA is applied to all linear modules, excluding the embedding layer, normalization layer, and classification head. By default, we set the rank r = 8 and α = 16. Published as a conference paper at ICLR 2025 For natural language understanding tasks, we fine-tune T5-base (Raffel et al., 2020) model with learning rate 1e-4. The sequence length is set to 128, and the training batch size is 32. For dialogue generation, mathematical reasoning and code generation tasks, we fine-tune the Llama-2-7B (Touvron et al., 2023) model with a learning rate of 2e-5. We set the sequence length to 1024 and the macro batch size to 32. For image classification tasks, we fine-tune the CLIP-Vi T-B/16 (Radford et al., 2021) model with learning rate 1e-4. The classifier is obtained using prompts such as a photo of a {class} and kept frozen during fine-tuning. And the training batch size is set to 64. All experiments are conducted on NVIDIA RTX A6000 GPUs. To obtain a reliable estimate of model performance, we perform three runs with different random seeds and report the average and standard deviation of the results. 3.1 RESULTS ON NATURAL LANGUAGE UNDERSTANDING TASKS Table 1: Results of fine-tuning T5-base using full fine-tuning and various Lo RA variants on a subset of the GLUE datasets. The Lo RA rank is set to 8 by default. Bold and underline indicate the highest and second-highest scores, respectively. Method MNLI SST2 Co LA QNLI MRPC Average Full FT 86.33 0.00 94.75 0.21 80.70 0.24 93.19 0.22 84.56 0.73 87.91 Lo RA 85.30 0.04 94.04 0.11 69.35 0.05 92.96 0.09 68.38 0.01 82.08 Pi SSA 85.75 0.07 94.07 0.06 74.27 0.39 93.15 0.14 76.31 0.51 84.71 rs Lo RA 85.73 0.10 94.19 0.23 72.32 1.12 93.12 0.09 52.86 2.27 79.64 Lo RA+ 85.81 0.09 93.85 0.24 77.53 0.20 93.14 0.03 74.43 1.39 84.95 Lo RA-GA 85.70 0.09 94.11 0.18 80.57 0.20 93.18 0.06 85.29 0.24 87.77 Do RA 85.67 0.09 94.04 0.53 72.04 0.94 93.04 0.06 68.08 0.51 82.57 Ada Lo RA 85.45 0.11 93.69 0.20 69.16 0.24 91.66 0.05 68.14 0.28 81.62 Lo RA-Pro 86.03 0.19 94.19 0.13 81.94 0.24 93.42 0.05 86.60 0.14 88.44 In this section, we evaluate our Lo RA-Pro across various natural language understanding datasets. To provide a comprehensive comparison, we include several baseline methods: 1) full fine-tuning and the standard Lo RA (Hu et al., 2022). 2) Lo RA variants maintaining the original structure, such as rs Lo RA (Kalajdzievski, 2023), Lo RA+ (Hayou et al., 2024), Pi SSA (Meng et al., 2024), and Lo RA-GA (Wang et al., 2024a), 3) Lo RA variants with modified structures, including Do RA (Liu et al., 2024) and Ada Lo RA (Zhang et al., 2023). The results are presented in Table 1. We fine-tune the T5-base model (Raffel et al., 2020) with the baseline methods on a subset of GLUE datasets: MNLI, SST2, Co LA, QNLI, and MRPC. As shown in Table 1, we observe that Lo RA-Pro achieves the highest scores on 3 out of 5 datasets and the highest average score across all 5 datasets, and achieves the highest accuracy on average over the 5 datasets. Specifically, on average over 5 datasets, Lo RA-Pro surpasses standard Lo RA (Hu et al., 2022) with a margin of 6.36. And Lo RA-Pro even achieve higher than full fine-tuning. This superior performance may be attributed to overfitting in full fine-tuning, where optimizing all model parameters can lead to overfitting on the training data, thus reducing the model s generalization to the test set. This effect is particularly pronounced on small datasets, such as MRPC, which contains only 3.7k training data. These results validate the effectiveness of our methods. 3.2 RESULTS ON LARGE LANGUAGE MODELS In this section, we evaluate the performance of Lo RA-Pro on large language models, focusing on dialogue generation, mathematical reasoning, and code generation capabilities. Our experimental setup follows the configuration used in Lo RA-GA (Wang et al., 2024a). For the dialogue generation task, we fine-tune the Llama-2-7B (Touvron et al., 2023) model on a 52k subset of the Wizard LM dataset (Xu et al., 2024) and evaluate it using the MTBench dataset (Zheng et al., 2024a). GPT-4 is used to assess the quality of the model s responses, and we report the first-turn score as the metric. Published as a conference paper at ICLR 2025 Table 2: Fine-tuning results of Llama-2-7B model. We fine-tune the Llama-2-7B model using full fine-tuning and Lo RA variants on subsets of the Wizard LM (Xu et al., 2024), Meta Math QA (Yu et al., 2024), and Code Feedback (Zheng et al., 2024b) datasets, respectively. And we assess dialogue generation, mathematical reasoning, and coding abilities on MT-Bench, GSM8K, and Human Eval datasets. Bold and underline indicate the highest and second-highest scores, respectively. Method MT-Bench GSM8K Human Eval Full FT 5.30 0.11 59.36 0.85 35.31 2.13 Lo RA 5.61 0.10 42.08 0.04 14.76 0.17 Pi SSA 5.30 0.02 44.54 0.27 16.02 0.78 rs Lo RA 5.25 0.03 45.62 0.10 16.01 0.79 Lo RA+ 5.71 0.08 52.11 0.62 18.17 0.52 Do RA 5.97 0.02 53.07 0.75 19.75 0.41 Ada Lo RA 5.57 0.05 50.72 1.39 17.80 0.44 Lo RA-GA 5.95 0.16 53.60 0.30 19.81 1.46 Lo RA-GA (rank=32) 5.79 0.09 55.12 0.30 20.18 0.19 Lo RA-GA (rank=128) 6.13 0.07 55.07 0.18 23.05 0.37 Lo RA-Pro 5.72 0.03 57.57 0.50 22.97 0.35 Lo RA-Pro (rank=32) 5.57 0.51 57.97 0.50 26.63 0.35 Lo RA-Pro (rank=128) 5.67 0.11 61.08 0.19 30.28 0.93 For the math task, we fine-tune the Llama-2-7B (Touvron et al., 2023) model on a 100k sample from the Meta Math QA dataset (Yu et al., 2024). The model is then evaluated on the GSM8K test set (Cobbe et al., 2021), and we report the accuracy as the metric. For the coding task, we fine-tune the Llama-2-7B (Touvron et al., 2023) model on a 100k subset of the Code Feedback dataset (Zheng et al., 2024b) and test it on the Human Eval dataset (Chen et al., 2021), reporting the PASS@1 metric. We compare Lo RA-Pro with several baselines, including full fine-tuning, Lo RA (Hu et al., 2022), Pi SSA (Meng et al., 2024), rs Lo RA (Kalajdzievski, 2023), Lo RA+ (Hayou et al., 2024), Do RA (Liu et al., 2024), Ada Lo RA (Zhang et al., 2023), and Lo RA-GA (Wang et al., 2024a). By default, we set the rank to 8 and α = 16. Following Lo RA-GA (Wang et al., 2024a), we initialize the scaling factor as in rs Lo RA (Kalajdzievski, 2023), i.e., s = α r. Table 2 presents our experimental results, which demonstrate Lo RA-Pro s superior performance. With a rank of 8, Lo RA-Pro achieves notable improvements over the original Lo RA: 0.11 on MTBench, 15.49 on GSM8K, and 8.21 on Human Eval. When compared to the second-best PEFT method, Lo RA-GA, Lo RA-Pro shows consistent gains: 3.97 on GSM8K and a substantial 3.16 on Human Eval. These results validate the effectiveness of our Lo RA-Pro method. Interestingly, we observe that full fine-tuning unexpectedly underperforms on MT-Bench. We attribute this to potential discrepancies between the Wizard LM training data distribution and the MTBench evaluation set. The extensive learning capacity of full fine-tuning may lead to overfitting on the training distribution, compromising generalization to MT-Bench. Since Lo RA-Pro aligns more closely with full fine-tuning during optimization, its relatively poor performance on MT-Bench may also be attributed to overfitting. To further explore the scalability of our method, we increase the rank in Lo RA-Pro from 8 to 128. Our observations reveal a clear trend: as the rank increases, the performance gap between Lo RAPro and full fine-tuning narrows rapidly. Notably, Lo RA-Pro consistently outperforms Lo RA-GA at the same ranks on both GSM8K and Human Eval datasets. At rank 32, Lo RA-Pro surpasses Lo RAGA by 2.85 on GSM8K and 6.45 on Human Eval. This performance disparity becomes even more pronounced at rank 128, where Lo RA-Pro outperforms Lo RA-GA by 6.01 on GSM8K and 7.23 on Human Eval. These results demonstrate the superior scalability and effectiveness of Lo RA-Pro across various rank settings. Published as a conference paper at ICLR 2025 3.3 RESULTS ON IMAGE CLASSIFICATION TASKS Table 3: Fine-tuning results of CLIP-Vi T-B/16 on image classification tasks. We fine-tune CLIPVi T-B/16 using full fine-tuning and Lo RA variants across Stanford Cars, DTD, Euro SAT, GTSRB, RESISC45, SUN397, and SVHN datasets. Bold indicates the highest results, while underline represents the second-highest results. Method Cars DTD Euro SAT GTSRB RESISC45 SUN397 SVHN Average Zero-shot 63.75 44.39 42.22 35.22 56.46 62.56 15.53 45.73 Full FT 84.23 0.06 77.44 0.19 98.09 0.03 94.31 0.28 93.95 0.0 75.35 0.10 93.04 0.18 88.06 Lo RA 72.81 0.13 73.92 0.38 96.93 0.07 92.40 0.10 90.03 0.14 70.12 0.18 88.02 0.07 83.46 rs Lo RA 82.38 0.20 78.03 0.76 98.06 0.08 95.04 0.11 93.96 0.18 75.38 0.24 92.74 0.18 87.94 Lo RA+ 72.87 0.18 74.07 0.45 97.01 0.02 92.42 0.18 89.96 0.11 70.17 0.15 88.08 0.05 83.51 Do RA 73.72 0.06 73.72 0.33 96.95 0.01 92.38 0.17 90.03 0.08 70.20 0.19 88.23 0.05 83.48 Lo RA-GA 85.18 0.41 77.50 0.12 98.05 0.27 95.28 0.10 94.43 0.19 75.44 0.06 93.68 0.35 88.51 Lo RA-Pro 85.87 0.08 78.64 0.25 98.46 0.03 95.66 0.05 94.75 0.21 76.42 0.14 94.63 0.20 89.20 In this section, we assess the performance of Lo RA-Pro on image classification tasks. To provide a comprehensive comparison, we compare it with several baselines: zero-shot CLIP (Radford et al., 2021), full fine-tuning, vanilla Lo RA (Hu et al., 2022), rs Lo RA (Kalajdzievski, 2023), Lo RA+ (Hayou et al., 2024), Do RA (Liu et al., 2024), and Lo RA-GA (Wang et al., 2024a). We fine-tune the CLIP-Vi T-B/16 (Radford et al., 2021) model on various datasets, including Stanford Cars (Krause et al., 2013), DTD (Cimpoi et al., 2014), Euro SAT (Helber et al., 2019), GTSRB (Houben et al., 2013), RESISC45 (Cheng et al., 2017), SUN397 (Xiao et al., 2010), and SVHN (Netzer et al., 2011). Accuracy is used as the evaluation metric. During fine-tuning, only the visual backbone of the CLIP-Vi T-B/16 model is updated, while the classifier, derived from prompts such as a photo of a {class} , remains frozen. The results are presented on Table 3. Lo RA-Pro achieves the highest accuracy across all seven datasets. Specifically, on average, Lo RA-Pro surpasses zero-shot classification by 43.47, outperforms Lo RA (Hu et al., 2022) by 5.74, and exceeds rs Lo RA (Kalajdzievski, 2023) by 1.26. These results validate the effectiveness of our Lo RA-Pro method. 3.4 ABLATION STUDY Ablation study on the full-rank assumption. In Theorem 2.1, we assume that the matrices A Rr n and B Rm r are full-rank during training. Our goal here is to verify whether this assumption holds in practice. We track the rank changes of all A and B matrices during the fine-tuning process of Llama-2-7B on the Meta Math QA (Yu et al., 2024) dataset. In Figure 2, we illustrate the rank changes of matrices A and B from the q projection of layer 9 during the training process, with rank set to 8 and 32, respectively. We observed that, although A and B do not initially satisfy the full rank assumption (with matrix B initialized as a zero matrix), both matrices achieve full rank after the first update step. The rank behavior of A and B in other layers also exhibits similar results. 0 1 500 1500 2000 3000 0 Matrix Rank matrix A matrix B 0 1 500 1500 2000 3000 Training Step Matrix Rank matrix A matrix B Figure 2: Visualization of matrix ranks of A and B during training, with ranks set to 8 and 32, respectively. This observation provides practical evidence that the assumption in Theorem 2.1 is reasonable and supports the validity of the proposed solutions. Memory footprint and training time. Here, we evaluate the additional costs associated with using Lo RA-Pro compared to Lo RA. We primarily focus on comparing the differences in memory cost and training time between Lo RA-Pro, Lo RA, and full fine-tuning. The results of the experiments are presented in Table 4. We measure memory cost using 8 A6000 GPUs with a batch size of 1. Published as a conference paper at ICLR 2025 Table 4: We compare Lo RA, Lo RA-Pro, and Full Fine-Tuning in terms of memory cost, training time, and performance on the MT-Bench, GSM8K, and Human Eval datasets. Memory cost is measured using a single A6000 GPU with a batch size of 1. Training time is recorded on the Meta Math QA dataset using 8 A100 GPUs with Deep Speed Ze RO-2 stage optimization. Method Memory Cost Training Time MT-Bench GSM8K Human Eval Full FT 8 40 GB 4h 20min 5.30 0.11 59.36 0.85 35.31 2.13 Lo RA 8 17 GB 1h 30min 5.61 0.10 42.08 0.04 14.76 0.17 Lo RA-GA 8 17 GB 1h 31min 5.95 0.16 53.60 0.30 19.81 1.46 Lo RA-Pro 8 21 GB 1h 41min 5.72 0.03 57.57 0.50 22.97 0.35 Training time is recorded on the Wizard LM dataset using 8 A100 GPUs with Deep Speed (Rasley et al., 2020) Ze RO-2 stage optimization. The results are shown in Table 4. From the table, we observe the following: 1) Lo RA-Pro requires approximately 4GB more GPU memory compared to Lo RA. This difference likely stems from the need to compute BT B, AAT , and their inverses during the calculation of the optimal gradients. 2) Surprisingly, the training time for Lo RA-Pro is nearly identical to that of Lo RA, with only about 10 minutes increase in additional training time. We attribute this to the fact that the matrices A and B in Lo RA are low-rank. Consequently, the extra computations required by Lo RA-Pro (such as matrix inversion and the calculation of X) are performed on small r r matrices, making the extra computational overhead manageable. Considering that Lo RA-Pro uses less memory and trains faster than full fine-tuning, while also providing performance improvements over Lo RA, we believe that the additional memory and training time costs are acceptable. 0 200 400 600 800 1000 1200 1400 1600 Lo RA Lo RA-GA Lo RA-Pro Full Fine-Tuning 0 500 1000 1500 2000 2500 3000 Step Meta Math QA Lo RA-Pro Lo RA Lo RA-GA Full Fine-Tuning 0 500 1000 1500 2000 2500 3000 Step Code Feedback Lo RA-Pro Lo RA Lo RA-GA Full Fine-Tuning Figure 3: Training loss curves of Lo RA, Lo RA-GA, Lo RA-Pro, and Full Fine-tuning on Wizard LM, Meta Math QA, and Code Feedback. Training curves of Lo RA-Pro. In this part, we present the training loss curves for Lo RA, Lo RAGA, Lo RA-Pro, and Full Fine-tuning across Wizard LM, Meta Math QA, and Code Feedback datasets. As illustrated in Figure 3, Lo RA-Pro demonstrates a faster convergence speed compared to Lo RA and Lo RA-GA. Furthermore, Lo RA-Pro achieves a lower final loss value upon convergence, indicating its improved efficiency and effectiveness. 4 RELATED WORK Parameter-Efficient Fine-Tuning. Given the huge size of foundation models, recent research has focused on developing parameter-efficient fine-tuning methods (Hu et al., 2022; Liu et al., 2024; Ding et al., 2023; Houlsby et al., 2019; Liu et al., 2023; Lester et al., 2021; Wang et al., 2024c). These methods aim to reduce the cost of fine-tuning by adjusting only a small portion of the model s parameters. Generally, these methods fall into two main categories. The first category is adapter tuning (Houlsby et al., 2019; Sung et al., 2022; He et al., 2021; Zhang et al., 2024; Bapna & Firat, 2019; Hu et al., 2022), which involves inserting small neural network modules, called adapters, into specific layers of the model. During fine-tuning, we keep the model frozen and only fine-tune the lightweight adapter modules, significantly reducing the memory footprint for fine-tuning. The second category is prompt tuning (Lester et al., 2021; Zhou et al., 2022; Li & Liang, 2021; Liu et al., 2022; Wang et al., 2023; 2024b; Liang et al., 2024). Prompt tuning adapts the models to specific Published as a conference paper at ICLR 2025 tasks by adding specially designed prompts or learnable tokens to the input data, rather than directly modifying the internal parameters of foundation models. In this paper, we focus on Lo RA (Hu et al., 2022), a prominent method within the realm of adapter tuning. Low-Rank Adaptation. Low-rank adaptation, initially referred to as Lo RA (Hu et al., 2022), has evolved into a broad category encompassing parameter-efficient fine-tuning methods based on lowrank approximations (Hu et al., 2022; Liu et al., 2024; Hayou et al., 2024; Kalajdzievski, 2023; Zhang et al., 2023; Kopiczko et al., 2024; Hyeon-Woo et al., 2022; Zhang & Pilanci, 2024; Wang et al., 2024a; Zhao et al., 2024; Wang et al., 2024a). Lo RA (Hu et al., 2022) assumes that the changes in the weights of pre-trained models exhibit a low-rank structure. Consequently, it re-parameterizes these changes as the product of low-rank matrices, thereby reducing the cost during fine-tuning. Several variants of Lo RA have been proposed to address different aspects of this approach. For example, Do RA (Liu et al., 2024) improves Lo RA (Hu et al., 2022) by incorporating a learnable magnitude vector to re-scale the normalized product of low-rank matrices. Another variant, rs Lo RA (Kalajdzievski, 2023), introduces a new scaling factor to stabilize training in high-rank scenarios. Lo RA+ (Hayou et al., 2024) improves upon Lo RA by applying different learning rates to the two low-rank matrices. Additionally, Galore (Zhao et al., 2024) employs SVD to project the gradients and its first and second momentum of full training into a low-rank space, thereby reducing the memory footprint during pre-training and fine-tuning. 5 CONCLUSION In this paper, we introduce Lo RA-Pro, a novel approach designed to bridge the performance gap between Lo RA and full fine-tuning. We have discovered that using Lo RA for fine-tuning is equivalent to fine-tuning the original weights with a virtual equivalent low-rank gradient. Based on this insight, we propose adjusting the gradients of matrices A and B to make the equivalent gradient match the true full fine-tuning gradient, thereby reducing their performance gap. Fortunately, we theoretically prove that there exists an optimal closed-form solution for updating matrices A and B, which are applied during fine-tuning in Lo RA-Pro. To validate the effectiveness of our method, we conduct extensive experiments across various domains, including natural language understanding, dialogue generation, mathematical reasoning, code generation, and image classification tasks. The results demonstrate that Lo RA-Pro significantly improves Lo RA performance and narrows the performance gap with full fine-tuning. Limitations. Lo RA-Pro still have some limitations: (1) Lo RA-Pro still adheres to Lo RA s assumption that W is of low rank. However, this assumption may break down in cases of pre-training or when there is a large amount of fine-tuning data, potentially leading to suboptimal results. (2) So far, we have only applied Lo RA-Pro to variants that have a structure similar to Lo RA. It currently cannot be applied to structurally different Lo RA models, such as Do RA (Liu et al., 2024) or FLo RA (Wen & Chaudhuri, 2024). We plan to explore these directions in future research. ACKNOWLEDGEMENT This work was funded by the National Natural Science Foundation of China under Grants (62276256, U2441251) and the Young Elite Scientists Sponsorship Program by CAST (2023QNRC001). We thank Jie Cheng and Yongcan Yu for providing computational resources critical to this work. Josh Achiam, Steven Adler, Sandhini Agarwal, Lama Ahmad, Ilge Akkaya, Florencia Leoni Aleman, Diogo Almeida, Janko Altenschmidt, Sam Altman, Shyamal Anadkat, et al. Gpt-4 technical report. ar Xiv preprint ar Xiv:2303.08774, 2023. Armen Aghajanyan, Sonal Gupta, and Luke Zettlemoyer. Intrinsic dimensionality explains the effectiveness of language model fine-tuning. In ACL-IJCNLP, 2021. Ankur Bapna and Orhan Firat. Simple, scalable adaptation for neural machine translation. In EMNLP-IJCNLP, 2019. Published as a conference paper at ICLR 2025 Tom Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared D Kaplan, Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, et al. Language models are few-shot learners. In Neur IPS, 2020. Mark Chen, Jerry Tworek, Heewoo Jun, Qiming Yuan, Henrique Ponde De Oliveira Pinto, Jared Kaplan, Harri Edwards, Yuri Burda, Nicholas Joseph, Greg Brockman, et al. Evaluating large language models trained on code. ar Xiv preprint ar Xiv:2107.03374, 2021. Gong Cheng, Junwei Han, and Xiaoqiang Lu. Remote sensing image scene classification: Benchmark and state of the art. Proceedings of the IEEE, 105(10):1865 1883, 2017. Mircea Cimpoi, Subhransu Maji, Iasonas Kokkinos, Sammy Mohamed, and Andrea Vedaldi. Describing textures in the wild. In CVPR, 2014. Karl Cobbe, Vineet Kosaraju, Mohammad Bavarian, Mark Chen, Heewoo Jun, Lukasz Kaiser, Matthias Plappert, Jerry Tworek, Jacob Hilton, Reiichiro Nakano, et al. Training verifiers to solve math word problems. ar Xiv preprint ar Xiv:2110.14168, 2021. Ning Ding, Yujia Qin, Guang Yang, Fuchao Wei, Zonghan Yang, Yusheng Su, Shengding Hu, Yulin Chen, Chi-Min Chan, Weize Chen, et al. Parameter-efficient fine-tuning of large-scale pre-trained language models. Nature Machine Intelligence, 5(3):220 235, 2023. Abhimanyu Dubey, Abhinav Jauhri, Abhinav Pandey, Abhishek Kadian, Ahmad Al-Dahle, Aiesha Letman, Akhil Mathur, Alan Schelten, Amy Yang, Angela Fan, et al. The llama 3 herd of models. ar Xiv preprint ar Xiv:2407.21783, 2024. Soufiane Hayou, Nikhil Ghosh, and Bin Yu. Lora+: Efficient low rank adaptation of large models. ar Xiv preprint ar Xiv:2402.12354, 2024. Ruidan He, Linlin Liu, Hai Ye, Qingyu Tan, Bosheng Ding, Liying Cheng, Jiawei Low, Lidong Bing, and Luo Si. On the effectiveness of adapter-based tuning for pretrained language model adaptation. In ACL-IJCNLP, 2021. Patrick Helber, Benjamin Bischke, Andreas Dengel, and Damian Borth. Eurosat: A novel dataset and deep learning benchmark for land use and land cover classification. IEEE Journal of Selected Topics in Applied Earth Observations and Remote Sensing, 12(7):2217 2226, 2019. Jordan Hoffmann, Sebastian Borgeaud, Arthur Mensch, Elena Buchatskaya, Trevor Cai, Eliza Rutherford, Diego de Las Casas, Lisa Anne Hendricks, Johannes Welbl, Aidan Clark, et al. Training compute-optimal large language models. In Neur IPS, 2022. Sebastian Houben, Johannes Stallkamp, Jan Salmen, Marc Schlipsing, and Christian Igel. Detection of traffic signs in real-world images: The german traffic sign detection benchmark. In IJCNN, 2013. Neil Houlsby, Andrei Giurgiu, Stanislaw Jastrzebski, Bruna Morrone, Quentin De Laroussilhe, Andrea Gesmundo, Mona Attariyan, and Sylvain Gelly. Parameter-efficient transfer learning for nlp. In ICML, 2019. Edward J Hu, Phillip Wallis, Zeyuan Allen-Zhu, Yuanzhi Li, Shean Wang, Lu Wang, Weizhu Chen, et al. Lora: Low-rank adaptation of large language models. In ICLR, 2022. Nam Hyeon-Woo, Moon Ye-Bin, and Tae-Hyun Oh. Fedpara: Low-rank hadamard product for communication-efficient federated learning. In ICLR, 2022. Damjan Kalajdzievski. A rank stabilization scaling factor for fine-tuning with lora. ar Xiv preprint ar Xiv:2312.03732, 2023. Jared Kaplan, Sam Mc Candlish, Tom Henighan, Tom B Brown, Benjamin Chess, Rewon Child, Scott Gray, Alec Radford, Jeffrey Wu, and Dario Amodei. Scaling laws for neural language models. ar Xiv preprint ar Xiv:2001.08361, 2020. Alexander Kirillov, Eric Mintun, Nikhila Ravi, Hanzi Mao, Chloe Rolland, Laura Gustafson, Tete Xiao, Spencer Whitehead, Alexander C Berg, Wan-Yen Lo, et al. Segment anything. In ICCV, 2023. Published as a conference paper at ICLR 2025 Dawid Jan Kopiczko, Tijmen Blankevoort, and Yuki M Asano. Vera: Vector-based random matrix adaptation. In ICLR, 2024. Jonathan Krause, Michael Stark, Jia Deng, and Li Fei-Fei. 3d object representations for fine-grained categorization. In ICCV Workshop, 2013. Brian Lester, Rami Al-Rfou, and Noah Constant. The power of scale for parameter-efficient prompt tuning. In EMNLP, 2021. Chunyuan Li, Heerad Farkhoor, Rosanne Liu, and Jason Yosinski. Measuring the intrinsic dimension of objective landscapes. In ICLR, 2018. Xiang Lisa Li and Percy Liang. Prefix-tuning: Optimizing continuous prompts for generation. In ACL-IJCNLP, 2021. Jian Liang, Lijun Sheng, Zhengbo Wang, Ran He, and Tieniu Tan. Realistic unsupervised clip fine-tuning with universal entropy optimization. In ICML, 2024. Pengfei Liu, Weizhe Yuan, Jinlan Fu, Zhengbao Jiang, Hiroaki Hayashi, and Graham Neubig. Pretrain, prompt, and predict: A systematic survey of prompting methods in natural language processing. ACM Computing Surveys, 55(9):1 35, 2023. Shih-yang Liu, Chien-Yi Wang, Hongxu Yin, Pavlo Molchanov, Yu-Chiang Frank Wang, Kwang Ting Cheng, and Min-Hung Chen. Dora: Weight-decomposed low-rank adaptation. In ICML, 2024. Xiao Liu, Kaixuan Ji, Yicheng Fu, Weng Tam, Zhengxiao Du, Zhilin Yang, and Jie Tang. P-tuning: Prompt tuning can be comparable to fine-tuning across scales and tasks. In ACL, 2022. Ilya Loshchilov and Frank Hutter. Decoupled weight decay regularization. In ICLR, 2019. Fanxu Meng, Zhaohui Wang, and Muhan Zhang. Pissa: Principal singular values and singular vectors adaptation of large language models. ar Xiv preprint ar Xiv:2404.02948, 2024. Yuval Netzer, Tao Wang, Adam Coates, Alessandro Bissacco, Baolin Wu, Andrew Y Ng, et al. Reading digits in natural images with unsupervised feature learning. In Neur IPS workshop, 2011. Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, et al. Learning transferable visual models from natural language supervision. In ICML, 2021. Colin Raffel, Noam Shazeer, Adam Roberts, Katherine Lee, Sharan Narang, Michael Matena, Yanqi Zhou, Wei Li, and Peter J Liu. Exploring the limits of transfer learning with a unified text-to-text transformer. Journal of machine learning research, 21(140):1 67, 2020. Jeff Rasley, Samyam Rajbhandari, Olatunji Ruwase, and Yuxiong He. Deepspeed: System optimizations enable training deep learning models with over 100 billion parameters. In SIGKDD, pp. 3505 3506, 2020. Robin Rombach, Andreas Blattmann, Dominik Lorenz, Patrick Esser, and Bj orn Ommer. Highresolution image synthesis with latent diffusion models. In CVPR, 2022. Yi-Lin Sung, Jaemin Cho, and Mohit Bansal. Vl-adapter: Parameter-efficient transfer learning for vision-and-language tasks. In CVPR, 2022. Ilya Sutskever, James Martens, George Dahl, and Geoffrey Hinton. On the importance of initialization and momentum in deep learning. In ICML, 2013. Hugo Touvron, Louis Martin, Kevin Stone, Peter Albert, Amjad Almahairi, Yasmine Babaei, Nikolay Bashlykov, Soumya Batra, Prajjwal Bhargava, Shruti Bhosale, et al. Llama 2: Open foundation and fine-tuned chat models. ar Xiv preprint ar Xiv:2307.09288, 2023. Shaowen Wang, Linxi Yu, and Jian Li. Lora-ga: Low-rank adaptation with gradient approximation. In Neur IPS, 2024a. Published as a conference paper at ICLR 2025 Zhengbo Wang, Jian Liang, Ran He, Nan Xu, Zilei Wang, and Tieniu Tan. Improving zero-shot generalization for clip with synthesized prompts. In ICCV, 2023. Zhengbo Wang, Jian Liang, Ran He, Zilei Wang, and Tieniu Tan. Connecting the dots: Collaborative fine-tuning for black-box vision-language models. In ICML, 2024b. Zhengbo Wang, Jian Liang, Lijun Sheng, Ran He, Zilei Wang, and Tieniu Tan. A hard-to-beat baseline for training-free clip-based adaptation. In ICLR, 2024c. Yeming Wen and Swarat Chaudhuri. Batched low-rank adaptation of foundation models. In ICLR, 2024. Jianxiong Xiao, James Hays, Krista A Ehinger, Aude Oliva, and Antonio Torralba. Sun database: Large-scale scene recognition from abbey to zoo. In CVPR, 2010. Can Xu, Qingfeng Sun, Kai Zheng, Xiubo Geng, Pu Zhao, Jiazhan Feng, Chongyang Tao, Qingwei Lin, and Daxin Jiang. Wizardlm: Empowering large pre-trained language models to follow complex instructions. In ICLR, 2024. Longhui Yu, Weisen Jiang, Han Shi, YU Jincheng, Zhengying Liu, Yu Zhang, James Kwok, Zhenguo Li, Adrian Weller, and Weiyang Liu. Metamath: Bootstrap your own mathematical questions for large language models. In ICLR, 2024. Fangzhao Zhang and Mert Pilanci. Riemannian preconditioned lora for fine-tuning foundation models. In ICML, 2024. Qingru Zhang, Minshuo Chen, Alexander Bukharin, Pengcheng He, Yu Cheng, Weizhu Chen, and Tuo Zhao. Adaptive budget allocation for parameter-efficient fine-tuning. In ICLR, 2023. Renrui Zhang, Jiaming Han, Chris Liu, Aojun Zhou, Pan Lu, Yu Qiao, Hongsheng Li, and Peng Gao. Llama-adapter: Efficient fine-tuning of large language models with zero-initialized attention. In ICLR, 2024. Jiawei Zhao, Zhenyu Zhang, Beidi Chen, Zhangyang Wang, Anima Anandkumar, and Yuandong Tian. Galore: Memory-efficient llm training by gradient low-rank projection. In ICML, 2024. Lianmin Zheng, Wei-Lin Chiang, Ying Sheng, Siyuan Zhuang, Zhanghao Wu, Yonghao Zhuang, Zi Lin, Zhuohan Li, Dacheng Li, Eric Xing, et al. Judging llm-as-a-judge with mt-bench and chatbot arena. In Neur IPS, 2024a. Tianyu Zheng, Ge Zhang, Tianhao Shen, Xueling Liu, Bill Yuchen Lin, Jie Fu, Wenhu Chen, and Xiang Yue. Open Code Interpreter: Integrating code generation with execution and refinement. In Findings of ACL, 2024b. Kaiyang Zhou, Jingkang Yang, Chen Change Loy, and Ziwei Liu. Learning to prompt for visionlanguage models. International Journal of Computer Vision, 130(9):2337 2348, 2022. Published as a conference paper at ICLR 2025 Lo RA-Pro: Are Low-Rank Adapters Properly Optimized? The structure of Appendix is as follows, Appendix A contains the notation usage in our paper. Appendix B contains the proofs of the theorems in the main manuscript. Appendix C details the optimization algorithms of the proposed method. Appendix D presents additional experimental results. In Table 5, we detail the notations utilized in our paper. Table 5: Description of notations used in the paper. Notation Description s scaling factor in Lo RA B Rm r, A Rr n low rank matrices in Lo RA g = L W Rm n gradients of full fine-tuning g A lora = L A = s BT g Rr n gradients of matrix A in Lo RA g B lora = L B = sg AT Rm r gradients of matrix B in Lo RA d L differential of the loss function d A differential of the matrix A d B differential of the matrix B F Frobenius Norm , F Frobenius inner product B PROOF OF THEORETICAL RESULTS B.1 PROOF THAT THE EQUIVALENT GRADIENT IS LOW-RANK Lemma. Assume B Rm r, A Rr n and g B Rm r, g A Rr n represent matrices and their corresponding gradients in Lo RA optimization. We demonstrate that the equivalent gradient: g = sg BA + s Bg A, (16) where s > 0 is the scaling factor, has matrix rank at most 2r. Proof. Since matrix rank satisfies the property of subadditivity, we have: rank( g) = rank(sg BA + s Bg A) rank(g BA) + rank(Bg A). (17) Furthermore, for any matrices A and B, rank(AB) min(rank(A), rank(B)). Therefore, we can bound the ranks as follows: rank(g BA) min(rank(g B), rank(A)) r (18) rank(Bg A) min(rank(B), rank(g A)) r (19) Published as a conference paper at ICLR 2025 Thus, in conclusion, the equivalent gradient has a rank of at most 2r: rank( g) rank(g BA) + rank(Bg A) (20) min(rank(g B), rank(A)) + min(rank(B), rank(g A)) (21) 2r. (22) B.2 PROOF OF THEOREM 2.1 Theorem. Assume matrices B Rm r, A Rr n are both full rank. For the objective ming A,g B g g 2 F , the solutions are given by: s(BT B) 1BT g + XA = 1 s2 (BT B) 1g A lora + XA (23) s[I B(BT B) 1BT ]g AT (AAT ) 1 BX (24) s2 [I B(BT B) 1BT ]g B lora(AAT ) 1 BX. (25) Here, X Rr r represents an arbitrary matrix. Proof. For simplicity, we denote L = s Bg A + sg BA g 2 F . To solve the optimization problem, we need to satisfy the following conditions: L g A = 2s BT (s Bg A + sg BA g) = 0 (26) L g B = 2(s Bg A + sg BA g)s AT = 0 (27) Given that matrices A and B are full-rank, AAT and BT B are invertible. And from Equation (27), we derive: g B = 1 sg AT (AAT ) 1 Bg AAT (AAT ) 1. (28) Substituting this into Equation (26), we obtain the following linear equation: g A[I AT (AAT ) 1A] = 1 s(BT B) 1BT g[I AT (AAT ) 1A]. (29) Here, we notice that the matrix P = I AT (AAT ) 1A is a projection matrix with rank n r. The solution to the linear equation (29) is: s(BT B) 1BT g + XA, (30) where X Rr r represents an arbitrary matrix. We take the solution (30) into Equation (28), we derive: g B = 1 s[I B(BT B) 1BT ]g AT (AAT ) 1 BX (31) While we have obtained closed-form solutions for g A and g B, these solutions explicitly depend on the gradient of the matrix W, i.e., g, which is undesirable since g is unknown during Lo RA optimization. Fortunately, the solutions can be transformed into the forms of the gradients of standard Lo RA, where the gradients are: g A lora = s BT g, g B lora = sg AT . (32) Therefore, the solutions to the optimization problem can be written as: s2 (BT B) 1g A lora + XA, (33) s2 [I B(BT B) 1BT ]g B lora(AAT ) 1 BX. (34) In our method, we perform the standard forward and backward passes of Lo RA, then adjust the gradients of A and B using Solutions (33) and (34), and subsequently update them. Published as a conference paper at ICLR 2025 B.3 PROOF OF THEOREM 2.2 Theorem. When updating matrices A and B using the closed-form solution from Theorem 2.1, we proceed as follows: A A γg A, (35) B B γg B, (36) where γ 0 denotes the learning rate. Our method ensures a decrease in the loss, akin to the standard gradient descent algorithm, expressed by: d L = γ{ g A lora, 1 s2 (BT B) 1g A lora F + g B lora, 1 s2 [I B(BT B) 1BT ]g B lora(AAT ) 1 F } 0 (37) Proof. In summary, the proof of Theorem 2.2 is divided into two distinct parts. To begin with, we demonstrate that d L can be expressed in the following form: d L = γ{ g A lora, 1 s2 (BT B) 1g A lora F + g B lora, 1 s2 [I B(BT B) 1BT ]g B lora(AAT ) 1 F }. (38) In the second part, we prove that this expression for d L is always less than or equal to zero: d L 0. Part I. Therefore, in this part, we first prove Equation (38). During the optimization process, the differential change in the loss function, d L, can be expressed in terms of the differentials d A and d B as follows: d L = L A, d A F + L B , d B F . (39) From Equation (35) and (36), we can derive that: d A = γg A, d B = γg B. (40) Given that L A = g A lora and L B = g B lora, it follows that: d L = γ( g A lora, g A F + g B lora, g B F ) = γ( g A lora, 1 s2 (BT B) 1g A lora F + g B lora, 1 s2 [I B(BT B) 1BT ]g B lora(AAT ) 1 F + g A lora, XA F g B lora, BX F ). And we have the following equation: g A lora, XA F g B lora, BX F = g A lora AT , X F BT g B lora, X F = g A lora AT BT g B lora, X F = (s BT g)AT BT (sg AT ), X F =0. Therefore, we have: d L = γ{ g A lora, 1 s2 (BT B) 1g A lora F + g B lora, 1 s2 [I B(BT B) 1BT ]g B lora(AAT ) 1 F }. (43) Part II. In this part, we aim to prove d L 0. Given that the learning rate γ > 0, it suffices to show the following inequalities: g A lora, 1 s2 (BT B) 1g A lora F 0, (44) g B lora, 1 s2 [I B(BT B) 1BT ]g B lora(AAT ) 1 F 0. (45) By proving these inequalities, we can establish that d L 0 as derived from Equation (38). Published as a conference paper at ICLR 2025 ①Proof of g A lora, 1 s2 (BT B) 1g A lora F 0. To begin with, we need to show that (BT B) 1 is positive definite. To establish this, it is sufficient to show that BT B is positive definite, as the inverse of a positive definite matrix is also positive definite. To achieve this, consider any non-zero vector x, and noting that B is full-rank, we have, x, BT Bx = Bx, Bx = Bx 2 > 0. (46) This shows that BT B is positive definite. Consequently, (BT B) 1 is positive definite as well. Since (BT B) 1 is positive definite, and thus we can apply Cholesky decomposition, and (BT B) 1 = UU T . With this, we have, g A lora, 1 s2 (BT B) 1g A lora F = 1 s2 g A lora, UU T g A lora F s2 U T g A lora, U T g A lora F s2 U T g A lora 2 F 0 ②Proof of g B lora, 1 s2 [I B(BT B) 1BT ]g B lora(AAT ) 1 F 0. Similarly, we can prove that matrix (AAT ) 1 is positive-definite. By employing Cholesky decomposition, we express (AAT ) 1 = UU T , where U is a lower-triangle matrix. Subsequently, we define P = I B(BT B) 1BT . It can be shown that P 2 = P and P is symmetry, indicating that P is a projection matrix. Consequently, the eigenvalues of P are either 0 or 1, which implies that P is positive semi-definite. Utilizing the Cholesky decomposition, we derive that P = V V T , where V is a lower-triangle matrix. Finally, we have: g B lora, 1 s2 [I B(BT B) 1BT ]g B lora(AAT ) 1 F = 1 s2 g B lora, V V T g B lora UU T F s2 V T g B lora U, V T g B lora U F s2 V T g B lora U 2 F 0 In summary, based on the above proofs, we have demonstrated that: d L = γ{ g A lora, 1 s2 (BT B) 1g A lora F | {z } 0 as shown in ① + g B lora, 1 s2 [I B(BT B) 1BT ]g B lora(AAT ) 1 F | {z } 0 as shown in ② B.4 PROOF OF THEOREM 2.3 Theorem. Consider the optimization problem, min X g A g A lora 2 F + g B g B lora 2 F , (50) where g A and g B are the optimal solutions as stated in Theorem 2.1. The optimal X can be determined by solving the Sylvester equation: BT BX + XAAT = 1 s2 (BT B) 1g A lora AT , (51) which has a unique solution X provided that BT B and AAT do not have any shared eigenvalues. Proof. For simplicity, we denote L = g A g A lora 2 F + g B g B lora 2 F . To solve the optimization problem, we need to satisfy the following conditions: L X = 0. (52) Published as a conference paper at ICLR 2025 Since g A and g B are solutions in Theorem 2.1 and g A lora = s BT g and g B lora = sg AT , we obtain that: 2(g A g A lora)AT 2BT (g B g B lora) = 0, g AAT BT g B = g A lora AT BT g B lora, BT BX + XAAT = 1 s2 (BT B) 1g A lora AT , which is a Sylvester equation. This equation has a unique solution for X if and only if BT B and AAT have no shared eigenvalues. C OPTIMIZATION ALGORITHMS In this section, we present the pseudo-codes for implementing our Lo RA-Pro method using the SGD (Sutskever et al., 2013) and Adam W (Loshchilov & Hutter, 2019) optimizers. The details are provided in Algorithm 1 and Algorithm 2, respectively. Lo RA-Pro with SGD optimizer. In the standard SGD algorithm, as illustrated in Algorithm 1, all we need to do is adjusting the gradients of matrices A and B with the solutions in Theorem 2.1. Algorithm 1 Lo RA-Pro with SGD optimizer Require: Given initial learning rate γ, scaling factor s. 1: Initialize time step t 0, low-rank matrices A0 Rr n and B0 Rm r 2: repeat 3: t t + 1 4: g A lora, g B lora Select Batch(At 1, Bt 1) Select batch and return the corresponding gradients 5: A, B At 1, Bt 1 Obtain the low-rank matrices A and B 6: X Solve Sylvester(BT BX + XAAT = 1 s2 (BT B) 1g A lora AT ) Compute X by solving the sylvester equation 7: g A = 1 s2 (BT B) 1g A lora + XA Adjust the gradients of Lo RA with Theorem 2.1 8: g B = 1 s2 [I B(BT B) 1BT ]g B lora(AAT ) 1 BX 9: At At 1 γg A 10: Bt Bt 1 γg B 11: until stopping criterion is met 12: return optimized parameters At and Bt Lo RA-Pro with Adam W optimizer. In Adam W optimizer, the implementation becomes more complex. We aim to closely approximate full fine-tuning during optimization. Several modifications are necessary. Firstly, in order to mimic full fine-tuning, after adjusting the gradients of matrices A and B, we need to compute the equivalent gradient, g = sg BA + s Bg A. (54) Subsequently, we calculate the first and second moments of this equivalent gradient to derive the corresponding Adam W gradient, g Adam W . Secondly, we determine the gradients with respect to matrices A and B as follows: g A = s BT g Adam W , g B = s g Adam W AT . (55) Thirdly, the weight decay process must be adjusted. In line with full fine-tuning, the weight decay is given by: W (1 γλ)(W0 + s BA). (56) This can be decomposed into: W0 (1 γλ)W0, B p Published as a conference paper at ICLR 2025 Algorithm 2 Lo RA-Pro with Adam W optimizer Require: Given initial learning rate γ, scaling factor s, original weight matrix W0 Rm n, and β1 = 0.9, β2 = 0.999, ϵ = 10 8, λ R 1: Initialize time step t 0, low-rank matrices A0 Rr n and B0 Rm r, first momentum m0 Rm n, second momentum vt Rm n 2: repeat 3: t t + 1 4: g A lora, g B lora Select Batch(At 1, Bt 1) Select batch and return the corresponding gradients 5: A, B At 1, Bt 1 Obtain the low-rank matrices A and B 6: X 0 X s value does not affect equivalent gradient 7: g A = 1 s2 (BT B) 1g A lora + XA Adjust the gradients of Lo RA with Theorem 2.1 8: g B = 1 s2 [I B(BT B) 1BT ]g B lora(AAT ) 1 BX 9: g sg BA + s Bg A Compute equivalent gradient 10: mt β1mt 1 + (1 β1) g 11: vt β2vt 1 + (1 β2) g2 12: ˆmt mt 1 βt 1 13: ˆvt vt 1 βt 2 14: g Adam W ˆmt ˆvt+ϵ 15: g A lora s BT g Adam W 16: g B lora s g Adam W AT 17: X Solve Sylvester(BT BX + XAAT = 1 s2 (BT B) 1 g A lora AT ) Compute X by solving the sylvester equation 18: g A = 1 s2 (BT B) 1 g A lora + XA Adjust the gradients of Lo RA with Theorem 2.1 19: g B = 1 s2 [I B(BT B) 1BT ] g B lora(AAT ) 1 BX 20: A 1 γλA Weight Decay 21: B 1 γλB 22: W0 (1 γλ)W0 23: At At 1 γ g A 24: Bt Bt 1 γ g B 25: until stopping criterion is met 26: return optimized parameters At and Bt Published as a conference paper at ICLR 2025 D ADDITIONAL EXPERIMENTS D.1 ABLATION STUDY OF THE SELECTION OF X Based on Theorem 2.1, in Lo RA-Pro, the matrix X can be chosen arbitrarily. While its selection does not affect the equivalent gradient, it does influence the updates of matrices A and B in Lo RA. Here, we conduct an ablation study on the choice of X. We compare three possible values for X. 1) Zero solution: In this simplest case, we set X = 0. 2) Sylvester solution: Here, X is obtained by solving the Sylvester equation, as described in Theorem 2.3. 3) Symmetry solution: This approach aims to balance the contributions of both terms in the equation g = sg BA + s Bg A, enforcing the condition g BA = Bg A. For the symmetry solution, solving for X yields: 2s(BT B) 1BT g A(AT A) 1 = 1 2s2 (BT B) 1BT g B lora(AT A) 1. (58) The comparison of the selection of X is presented in Table 6. As shown in the table, the choice of X significantly impacts Lo RA-Pro s performance, particularly evident in the GSM8K dataset. Different X selections influence the subspaces of A and B, ultimately affecting the optimization process described in Theorem 2.1. Our experiments demonstrate that the Sylvester solution consistently outperforms both the zero and symmetry solutions across all three evaluation tasks. We attribute the superior performance of the Sylvester solution to its ability to select subspaces for A and B that enable faster gradient descent (i.e., maximizing the approximation between the modified gradients g A, g B and the Lo RA gradients g A lora, g B lorag). Table 6: Ablation study on the selection of different X in Lo RA-Pro. Choice of X MT-Bench GSM8K Human Eval Zero 5.58 0.14 31.74 0.69 17.28 0.35 Symmetry (Eq. (58)) 5.71 0.11 42.81 0.62 17.88 0.35 Sylvester (Thm. 2.3) 5.72 0.03 57.57 0.50 22.97 0.35 D.2 VISUALIZATION OF DIFFERENCES BETWEEN EQUIVALENT GRADIENTS AND FULL GRADIENTS In this section, we fine-tune Llama-2-7B on the Meta Math QA100k dataset and visualize the discrepancies between the equivalent gradients of Lo RA and Lo RA-Pro and the full gradients during training, i.e., the differences before and after gradient adjustments. We present visualizations for different optimization modules, including the Q, K, V, O, Up, Down, and Gate layers, and provide results for these modules across the shallow (1), medium (15), and deep (31) layers of Llama-2-7B. The results are shown in Figure 4. From the figure, we can draw the following conclusions: After gradient adjustments in Lo RA-Pro, we observe a significant reduction in the distance between the equivalent gradients and the full gradients. In certain layers, the discrepancy between Lo RA s equivalent gradients and the full gradients continues to increase (e.g., Layer 1 O, Up, Gate projections; Layer 15 Up and Gate projections; and Layer 31 O projection). However, in these layers, the discrepancy for Lo RA-Pro remains stable, indicating that Lo RA-Pro can consistently align with the full gradients during training, preventing the model from settling into sub-optimal solutions. In deep layers, the discrepancy between equivalent gradients and full gradients decreases as training progresses, whereas in shallow and medium layers, the discrepancy first increases and then stabilizes. The cause of this phenomenon is not yet clear, and we plan to investigate it further in future research. These findings highlight that Lo RA-Pro effectively reduces the distance between Lo RA and full gradients during training and ensures continuous alignment with full gradients, underscoring the efficacy of Lo RA-Pro. Published as a conference paper at ICLR 2025 0 500 1000 1500 2000 2500 3000 Diff_lorapro_eq_g_Layer_1_q_proj Diff_lora_eq_g_Layer_1_q_proj 0 500 1000 1500 2000 2500 3000 Diff_lorapro_eq_g_Layer_15_q_proj Diff_lora_eq_g_Layer_15_q_proj 0 500 1000 1500 2000 2500 3000 0.0130 Diff_lorapro_eq_g_Layer_31_q_proj Diff_lora_eq_g_Layer_31_q_proj 0 500 1000 1500 2000 2500 3000 0.0070 Diff_lorapro_eq_g_Layer_1_k_proj Diff_lora_eq_g_Layer_1_k_proj 0 500 1000 1500 2000 2500 3000 Diff_lorapro_eq_g_Layer_15_k_proj Diff_lora_eq_g_Layer_15_k_proj 0 500 1000 1500 2000 2500 3000 0.0140 Diff_lorapro_eq_g_Layer_31_k_proj Diff_lora_eq_g_Layer_31_k_proj 0 500 1000 1500 2000 2500 3000 Diff_lorapro_eq_g_Layer_1_v_proj Diff_lora_eq_g_Layer_1_v_proj 0 500 1000 1500 2000 2500 3000 Diff_lorapro_eq_g_Layer_15_v_proj Diff_lora_eq_g_Layer_15_v_proj 0 500 1000 1500 2000 2500 3000 0.030 Diff_lorapro_eq_g_Layer_31_v_proj Diff_lora_eq_g_Layer_31_v_proj 0 500 1000 1500 2000 2500 3000 Diff_lorapro_eq_g_Layer_1_o_proj Diff_lora_eq_g_Layer_1_o_proj 0 500 1000 1500 2000 2500 3000 Diff_lorapro_eq_g_Layer_15_o_proj Diff_lora_eq_g_Layer_15_o_proj 0 500 1000 1500 2000 2500 3000 0.020 Diff_lorapro_eq_g_Layer_31_o_proj Diff_lora_eq_g_Layer_31_o_proj 0 500 1000 1500 2000 2500 3000 Diff_lorapro_eq_g_Layer_1_up_proj Diff_lora_eq_g_Layer_1_up_proj 0 500 1000 1500 2000 2500 3000 Diff_lorapro_eq_g_Layer_15_up_proj Diff_lora_eq_g_Layer_15_up_proj 0 500 1000 1500 2000 2500 3000 Diff_lorapro_eq_g_Layer_31_up_proj Diff_lora_eq_g_Layer_31_up_proj 0 500 1000 1500 2000 2500 3000 Diff_lorapro_eq_g_Layer_1_down_proj Diff_lora_eq_g_Layer_1_down_proj 0 500 1000 1500 2000 2500 3000 Diff_lorapro_eq_g_Layer_15_down_proj Diff_lora_eq_g_Layer_15_down_proj 0 500 1000 1500 2000 2500 3000 0.18 Diff_lorapro_eq_g_Layer_31_down_proj Diff_lora_eq_g_Layer_31_down_proj 0 500 1000 1500 2000 2500 3000 Diff_lorapro_eq_g_Layer_1_gate_proj Diff_lora_eq_g_Layer_1_gate_proj 0 500 1000 1500 2000 2500 3000 Diff_lorapro_eq_g_Layer_15_gate_proj Diff_lora_eq_g_Layer_15_gate_proj 0 500 1000 1500 2000 2500 3000 0.055 Diff_lorapro_eq_g_Layer_31_gate_proj Diff_lora_eq_g_Layer_31_gate_proj Figure 4: Visualization of the differences between the equivalent gradients of Lo RA, Lo RA-Pro, and the full-parameter gradients during training, i.e., g g F . The rows illustrate the differences across various modules, including Q, K, V, O, Up, Down, and Gate. The columns show the differences at different depths, categorized as shallow (1), medium (15), and deep layers (31). Published as a conference paper at ICLR 2025 D.3 EXPERIMENTS RESULTS WITH DIFFERENT LEARNING RATES To demonstrate the effectiveness of Lo RA-Pro, we evaluated its performance on GSM8K under learning rates of 1e-5 and 5e-5, comparing it with Lo RA and Lo RA-GA. The results, presented in Table 7, show that Lo RA-Pro maintains its advantages under both learning rates, highlighting its robustness to variations in learning rate. Table 7: Performance comparison of Lo RA, Lo RA-GA, Lo RA-Pro on GSM8K with learning rates 1e-5, 2e-5, and 5e-5. GSM8K Lo RA Lo RA-GA Lo RA-Pro 1e-5 36.65 0.82 50.25 0.62 56.48 0.27 2e-5 42.08 0.04 53.60 0.30 57.57 0.50 5e-5 46.41 0.16 52.89 0.19 58.76 1.86 D.4 ADDITIONAL EXPERIMENTS ON LATEST MODELS Table 8: Performance comparison of Lo RA, Lo RA-GA, and Lo RA-Pro with Llama-2-7B and Llama-3.1-8B. GSM8K Lo RA Lo RA-GA Lo RA-Pro Llama-2-7B 42.08 0.04 53.60 0.30 54.23 0.79 Llama-3.1-8B 71.04 0.26 72.20 1.15 75.49 0.42 To further demonstrate the effectiveness of Lo RA-Pro, we conducted additional experiments using the latest model, LLa MA-3.1-8B (Dubey et al., 2024). We fine-tuned the model using these three methods,Lo RA, Lo RA-GA, and Lo RA-Pro,on the Meta Math100k dataset and evaluated its performance on the GSM8k dataset. All results are averaged over three different random seeds. As shown in Table 8, Lo RA-Pro demonstrates a clear advantage over both Lo RA and Lo RA-GA when applied to the LLa MA-3.1-8B model, further highlighting its effectiveness.