# loraga_lowrank_adaptation_with_gradient_approximation__3c0b5496.pdf Lo RA-GA: Low-Rank Adaptation with Gradient Approximation Shaowen Wang wangsw23@mails.tsinghua.edu.cn Linxi Yu yulx23@mails.tsinghua.edu.cn Jian Li lijian83@mail.tsinghua.edu.cn Tsinghua University Beijing, China Fine-tuning large-scale pretrained models is prohibitively expensive in terms of computational and memory costs. Lo RA, as one of the most popular Parameter Efficient Fine-Tuning (PEFT) methods, offers a cost-effective alternative by finetuning an auxiliary low-rank model that has significantly fewer parameters. Although Lo RA reduces the computational and memory requirements significantly at each iteration, extensive empirical evidence indicates that it converges at a considerably slower rate compared to full fine-tuning, ultimately leading to increased overall compute and often worse test performance. In our paper, we perform an in-depth investigation of the initialization method of Lo RA and show that careful initialization (without any change of the architecture and the training algorithm) can significantly enhance both efficiency and performance. In particular, we introduce a novel initialization method, Lo RA-GA (Low Rank Adaptation with Gradient Approximation), which aligns the gradients of low-rank matrix product with those of full fine-tuning at the first step. Our extensive experiments demonstrate that Lo RA-GA achieves a convergence rate comparable to that of full fine-tuning (hence being significantly faster than vanilla Lo RA as well as various recent improvements) while simultaneously attaining comparable or even better performance. For example, on the subset of the GLUE dataset with T5-Base, Lo RA-GA outperforms Lo RA by 5.69% on average. On larger models such as Llama 2-7B, Lo RA-GA shows performance improvements of 0.34, 11.52%, and 5.05% on MT-bench, GSM8K, and Human-eval, respectively. Additionally, we observe up to 2-4 times convergence speed improvement compared to vanilla Lo RA, validating its effectiveness in accelerating convergence and enhancing model performance. Code is available at code. 1 Introduction Fine-tuning large language models (LLMs) is essential for enabling advanced techniques such as instruction fine-tuning [1], reinforcement learning from human feedback (RLHF) [2], and adapting models to specific downstream applications. However, the computational and storage costs associated with full fine-tuning are prohibitively high, particularly as model sizes continue to grow. To address these challenges, methods of Parameter-Efficient Fine-Tuning (PEFT) (see e.g., [3]), such as Low Rank Adaptation (Lo RA) [4], have emerged and gained significant attention. Corresponding author 38th Conference on Neural Information Processing Systems (Neur IPS 2024). Figure 1: (Left) Training loss curves of Llama 2-7B on Meta Math QA to training steps. Lo RA-GA converges as quickly as full fine-tuning and outperforms Lo RA. (Right) Initialization procedures used in Lo RA and Lo RA-GA. The key difference is that Lo RA-GA initializes adapters using the eigenvectors of the gradient matrix, as opposed to random initialization with a scaling factor. Instead of updating the parameters of the model directly, Lo RA incorporates auxilary low-rank matrices B and A into the linear layers of models (such as the Q, K, V , and O matrices in a self-attention block [5]), while keeping the original layer weights W fixed. The modified layer is represented as y = (W + ηBA)x, where x is the input of that layer, y is the output, and η is the scaling factor. This approach significantly reduces the number of parameters that need to be fine-tuned, thereby lowering the computational and memory costs at each step. Despite these benefits, extensive empirical evidence (see e.g., [6, 7, 8, 9]) shows that Lo RA converges significantly slower compared to full finetune. This slower convergence often increases overall computational costs (measured in Floating Point Operations) and can sometimes lead to worse test performance. In our experiments, we typically observe that Lo RA requires 5-6x more iterations and FLOPs to reach the same performance as full fine-tuning under the same learning rate, as shown in Figure 1. To study the cause of slow convergence, we perform an in-depth investigation of the initialization strategy of Lo RA s adapter weights. It is known that fine-tuning pretrained models using the same objective (e.g., language modeling) often converges faster than re-initializing new parameters (e.g., a classification head) [10]. This observation leads us to question whether the slow convergence of vanilla Lo RA might be attributed to the default random initialization of adapter weights (Lo RA initializes A using Kaiming initialization [11] and sets B to zero [4]). In our experiments, we find that different initialization strategies for Lo RA can significantly impact the results, and its default initialization is suboptimal. In pursuit of a convergence rate comparable to full fine-tuning, we aim for initialization so that the update of BA matches the update of W closely. Previous work suggests that gradient descent operates in a low-dimensional subspace [12, 13]. If we can closely approximate the gradients of the full model at the initial step, subsequent steps can also be approximated, potentially accelerating the convergence of Lo RA. To this end, we introduce a novel initialization method, Lo RA-GA (Low Rank Gradient Approximation). By initializing Ainit and Binit with the eigenvectors of the full gradient matrix, the gradient of the low-rank product BA aligns with the direction of the gradient of the full weight matrix W. Mathematically, we aim to ensure that: (BA) ζ W, for some non-zero positive constant ζ. Our contributions can be summarized as follows: 1. We propose Lo RA-GA , a novel initialization method for Lo RA that accelerates convergence by approximating the gradients of the low-rank matrices with ones of the full weight matrix. 2. We identify the scaling factor under non-zero initialization, which ensures the variance of adapter outputs is invariant to the rank of the adapter and the dimension of the input. 3. We validate Lo RA-GA through extensive experiments, demonstrating significant performance improvements and faster convergence compared to vanilla Lo RA. Specifically, Lo RA-GA outperforms Lo RA by 5.69% on the GLUE [14] subset with T5-Base [15], and by 0.34, 11.52%, and 5.05% on MT-bench [16], GSM8K [17], and Human Eval [18] with Llama 2-7B [19], respectively, while achieving up to 2-4 times faster convergence. 2 Related Work 2.1 Initialization The significance of maintaining variance stability during initialization has been widely acknowledged to prevent the occurrence of diminishing or exploding phenomena. Xavier initialization [20] ensures stability in both the forward and backward passes of a network under a linear activation function. He initialization [11] extends this solution to networks using Re LU activation. Distinct from these, LSUV initialization [21] selects a mini-batch of data, performing a forward pass to determine the output variance, and subsequently normalizing it to ensure stability. Tensor program (see e.g., [22]) has emerged as a powerful technique for tuning various hyperparameters, including the initialization, for large models. 2.2 Parameter-Efficient Fine-Tuning (PEFT) To fine-tune increasingly large language models within limited hardware resources, researchers have developed various Parameter-Efficient Fine-Tuning (PEFT) methods. Adapter-based methods [23, 24, 25, 26] incorporate new layers into existing model layers. While fine-tuning only these inserted layers significantly reduces resource consumption and requires much fewer parameters, this approach introduces additional latency during both forward and backward passes. Soft Prompt-based methods [10, 27, 28, 29, 30] prepend learnable soft tokens to the model s input to adapt the model to specific tasks. This approach effectively leverages the pre-trained model s capabilities, requiring only appropriate prompts for task adaptation, though it incurs computational overhead during inference. More broadly, Ga Lore [31] applies low-rank gradients to parameter updates for memory efficiency during training. While this approach is highly expressive and performant, it requires storing complete model checkpoints, consuming more storage than other PEFT methods. 2.3 Lo RA s Variants Lo RA is one of the most popular PEFT methods that introduces the product of low-rank matrices alongside existing layers to approximate weight changes during fine-tuning. Several methods have been proposed to improve the structure of Lo RA. Ada Lo RA [32] dynamically prunes insignificant weights during fine-tuning using SVD, allowing more rank allocation to important areas within a fixed parameter budget. Do RA [8] enhances the model s expressiveness by adding learnable magnitudes to the direction adjustments made by low-rank matrix products. Additionally, Lo HA [33] and Lo Kr [34] employ Hamiltonian and Kronecker products, respectively. Despite these advancements, vanilla Lo RA remains the most popular method due to its robust library and hardware support. Therefore, improving Lo RA without altering its structure and at a low cost is crucial. Several recent methods focus on this aspect. Re Lo RA [35] suggests periodically merging learned adapters into the weight matrices to enhance Lo RA s expressibility. Lo RA+ [36] proposes using different learning rates for the two matrices in Lo RA to improve convergence. rs Lo RA [37] introduces a new scaling factor to make the scale of the output invariant to rank. Although our stable scale approach appears similar to rs Lo RA, rs Lo RA assumes BA = 0 initialization, making r invariant to the update BA. In contrast, our stable scale ensures that non-zero initialized BA remains invariant to both rank and input dimension from the start. Recently, Pi SSA [38] proposes to initializing A and B to approximate the original matrix W, by performing SVD on W. Our method, however, is based on a very different idea, that is to approximate the gradient of W, which involves performing SVD on sampled gradients and properly scaling the initialized matrices, as detailed in Section E. In this section, we analyze the initialization of Lo RA and introduce our method, Lo RA-GA. Lo RAGA consists of two key components: (i) approximating the direction of the gradient of full finetune and (ii) ensuring rank and scale stability in the initialization process. We examine each component and subsequently present their integration within Lo RA-GA. 3.1 Review of Vanilla Lo RA Structure of Lo RA Based on the hypothesis that the updates of fine-tuning are low-rank [13], Lo RA [4] proposes to use the product of two low-rank matrices to represent the incremental part of the original matrix W. Here, W is the weight matrix of a linear layer in the model. For example, in transformers, it could be the Q, K, V , or O matrices of the self-attention layer or the weight matrix in the MLP layer. Specifically, Lo RA has the following mathematical form: W = W0 + W = W0 + α r BA := W0 + ηBA where W , W0 Rm n, B Rm r, and A Rr n, with r min(m, n). W0 is the pre-trained weight matrix, remains frozen during the fine-tuning process, while A and B are trainable. Initialization of Lo RA Under Lo RA s default initialization scheme [4, 39], matrix A is initialized using Kaiming uniform [11], while matrix B is initialized with all zeros. Consequently, BA = 0 and W 0 = W0, ensuring that the initial parameters are unchanged. If the additional term W = ηBA is initially non-zero (e.g., [38]), the frozen parameter can be adjusted to ensure the initial parameters unchanged. This can be expressed as: W = (W0 ηBinit Ainit) + ηBA := Wfrozen + ηBA where Wfrozen = W0 ηBinit Ainit is frozen, and B and A are trainable in this case. 3.2 Gradient Approximation Our goal is to ensure that the first-step update (ηBA) approximate the direction of the weight update W, i.e., (ηBA) ζ W for some non-zero positive constant ζ. We will discuss how to choose ζ in Section 3.3 and one can treat ζ as a fixed constant for now. Consider a gradient descent step with learning rate λ, the updates for A and B are A = λ AL (Ainit) and B = λ BL (Binit), respectively. Assuming learning rate λ is small, the update of ηBA at the first step can be expressed as: η( BAinit + Binit A) = ηλ[ BL (Binit) Ainit + Binit AL (Ainit)] To measure its approximation quality of scaled the update of the weights in full finetune ζ W = ζλ W L (W0), we use the Frobenius norm of the difference between these two updates as a criterion: η( BAinit + Binit A) ζλ W L (W0) F =λ η BL (Binit) Ainit + ηBinit AL (Ainit) ζ W L (W0) F (1) Lemma 3.1. Suppose the loss function is L and y = W x = (W0 + ηBA)x, where y is the output of a layer and x is the input, the gradients of A and B are linear mappings of the gradient of W : AL = BT W L, BL = ( W L) AT Remarkably, W L in Lo RA and W L in full fine-tuning are equal at the beginning of the training. By substituting the gradients in Lemma 3.1 into Equation 1, we can rewrite the criterion as follows: λ η2 W L (W0) AT init Ainit + η2Binit BT init W L (W0) ζ W L (W0) F (2) This criterion evaluates how well the adapter s gradient approximates the direction of the gradient of full fine-tuning, and minimizing it brings the gradient of Lo RA closer to that of full fine-tuning with a scaling factor ζ: min Ainit,Binit η2 W L AT init Ainit + η2Binit BT init W L ζ W L F (3) Theorem 3.1. For the optimization problem in Equation 3 with given ζ, if the Singular Value Decomposition (SVD) of W L is W L = USV T , the solution is: η UIA, Ainit = ζ η V T IB, such that |IA| = |IB| = r, IA IB = {i | 1 i 2r, i N} where IA and IB are index sets. Theorem 3.1 provides an appropriate initialization scheme for Ainit and Binit given a specific ζ. The selection of ζ, which influences the scaling of the update ηBA, will be discussed in the following section. 3.3 Scale Stability Inspired by rs Lo RA citekalajdzievski2023rank and the Kaiming initialization [11], we define the following notions of stability: Definition 3.1. When dout, din, r , an adapter ηBA exhibits two distinct types of scale stabilities: 1. Forward stability: If the inputs to the adapter are independently and identically distributed (i.i.d.) with 2nd moment Θr,dout,din (1), then the 2nd moment of the outputs remains Θr,dout,din (1). 2. Backward stability: If the gradient of the loss with respect to the adapter outputs is Θr,dout,din (1), then the gradient with respect to the inputs remains Θr,dout,din (1). Theorem 3.2. Given the initialization proposed in Theorem 3.1, assume that the orthogonal vectors in Ainit and Binit are randomly selected from the unit spheres in Rdin and Rdout with the constraint that the vectors are orthogonal to each other, and η = Θr,dout,din (1/ r) as suggested by rs Lo RA [37]. Under these conditions, the adapters are forward scale-stable if ζ = Θr,dout,din p and backward scale-stable if ζ = Θr,dout,din p Similar to the results obtained from Kaiming Initialization [11], we observe that either ζ = Θr,dout,din p dout/r2 or ζ = Θr,dout,din p din/r2 work well independently. For all models presented in this paper, either form ensures convergence. Consequently, for all subsequent experiments, we adopt ζ = Θr,dout,din p Remark. We would like to remark that the scaling factor proposed in this subsection proves to be beneficial primarily when one adopts the learning rate typically used in full-finetuning (e.g., 1e 5), since as Lo RA-GA attempts to approximate the updates of full-finetuning. However, recent research [9] suggests that Lo RA with default initialization performs much better with larger learning rates. Furthermore, tensor program analysis [40, 22] indicates that higher learning rates should be paired with smaller initialization magnitudes. Therefore, we recommend decreasing or omitting the scaling factor when training using larger learning rates (e.g., > 1e 4). 3.4 Lo RA-GA Initialization Combining the gradient approximation and stable scale components, we propose the Lo RA-GA initialization method. First, we initialize Ainit and Binit using the solution from Theorem 3.1. Then, we determine the scaling factor ζ according to Theorem 3.2 to ensure rank and scale stability. Thus, based on Theorems 3.1 and 3.2, we propose a novel initialization method, Lo RA-GA. Lo RA-GA : We adopt η = α r and ζ = α2 r2 , where γ is a hyperparameter. We define the index sets IA = {i | 1 i r, i N} and IB = {i | r + 1 i 2r, i N}. Denote the singular value decomposition (SVD) of W L as W L = USV T . The initializations are as follows: γ V T [1:r], Binit = γ U[r+1:2r], Winit = W0 ηBinit Ainit Algorithm 1 Lo RA-GA Initialization Require: Model f( ) with L layers, parameters W, sampled batch B = {x, y}, Lo RA rank r, Lo RA alpha α, loss function L, scale factor γ Ensure: Initialized parameters W, η, A, B 1: ˆy f(x, W) Forward pass 2: ℓ L(y, ˆy) 3: η α r 4: for l = L, . . . , 1 do 5: Compute Wlℓ Backward for one layer 6: dout, din size(Wl) 7: U, S, V svd( Wlℓ) 8: Al V[1:r] 4 dout/γ 9: Bl U[r+1:2r] 4 dout/γ 10: Wl Wl ηBl Al 11: Clear Wlℓ Gradient for this layer is not needed anymore 12: end for 13: return W, η, A, B To save GPU memory during Lo RA-GA initialization, we utilized a technique similar to [41]. By hooking into Py Torch s backward process, we compute the gradient for one layer at a time and discard the computed gradients immediately. This ensures that our memory usage remains at O(1) instead of O(L), where L is the number of layers. This approach allows the memory consumption during the initialization phase to be less than that during the subsequent Lo RA finetuning phase. Our algorithm is shown in Algorithm 1. If the sampled batch size is large, we can also use gradient accumulation to save memory further, as shown in Algorithm 2. 4 Experiments In this section, we evaluate the performance of Lo RA-GA on various benchmark datasets. Initially, we assess Natural Language Understanding (NLU) capabilities using a subset of the GLUE dataset [14] with the T5-Base model [15]. Subsequently, we evaluate dialogue [16, 42], mathematical reasoning [17, 43], and coding abilities [18, 44] using the Llama 2-7B model [19]. Finally, we do the ablation study to prove the effectiveness of our method. Baselines We compare Lo RA-GA with several baselines to demonstrate its effectiveness: 1. Full-Finetune: Fine-tuning the model with all parameters, which requires the most resources. 2. Vanilla Lo RA [4]: Fine-tuning the model by inserting a low-rank matrix product BA into linear layers. A is initialized using Kaiming initialization, while B is initialized to zero. 3. Lo RA Variants with Original Structure: This includes several methods that retain the original Lo RA structure: - rs Lo RA [37] introduces a new scaling factor to stabilize the scale of Lo RA. - Lo RA+ [36] updates the two matrices in Lo RA with different learning rates. - Pi SSA [38] proposes performing SVD on the weight matrix W at the beginning of training and initializing A and B based on the components with larger singular values. 4. Lo RA Variants with Modified Structure: This includes methods that modify the original Lo RA structure: - Do RA [8] enhances the model s expressiveness by adding learnable magnitudes. - Ada Lo RA [32] dynamically prunes insignificant weights during fine-tuning using SVD, allowing more rank allocation to important areas within a fixed parameter budget. 4.1 Experiments on Natural Language Understanding Models and Datasets We fine-tune the T5-Base model on several datasets from the GLUE benchmark, including MNLI, SST-2, Co LA, QNLI, and MRPC. Performance is evaluated on the development set using accuracy as the primary metric. Table 1: Results of fine-tuning T5-base using Full-FT and various Lo RA variants on a subset of GLUE. MNLI SST-2 Co LA QNLI MRPC Average Size 393k 67k 8.5k 105k 3.7k Full 86.33 0.00 94.75 0.21 80.70 0.24 93.19 0.22 84.56 0.73 87.91 Lo RA 85.30 0.04 94.04 0.11 69.35 0.05 92.96 0.09 68.38 0.01 82.08 Pi SSA 85.75 0.07 94.07 0.06 74.27 0.39 93.15 0.14 76.31 0.51 84.71 rs Lo RA 85.73 0.10 94.19 0.23 72.32 1.12 93.12 0.09 52.86 2.27 79.64 Lo RA+ 85.81 0.09 93.85 0.24 77.53 0.20 93.14 0.03 74.43 1.39 84.95 Do RA 85.67 0.09 94.04 0.53 72.04 0.94 93.04 0.06 68.08 0.51 82.57 Ada Lo RA 85.45 0.11 93.69 0.20 69.16 0.24 91.66 0.05 68.14 0.28 81.62 Lo RA-GA 85.70 0.09 94.11 0.18 80.57 0.20 93.18 0.06 85.29 0.24 87.77 Implementation Details We utilize prompt tuning to fine-tune the T5-Base model on the GLUE benchmark. This involves converting labels into tokens (e.g., "positive" or "negative") and using the normalized probability of these tokens as the predicted label probability for classification. We provide the hyperparameters in Appendix D.1. Each experiment is conducted with 3 different random seeds, and the average performance is reported. Results As shown in Table 1, Lo RA-GA consistently outperforms the original Lo RA and other baseline methods, achieving performance comparable to full fine-tuning. Notably, Lo RA-GA excels on smaller datasets such as Co LA and MRPC, demonstrating its ability to converge faster and effectively utilize limited training data. 4.2 Experiment on Large Language Model Models and Datasets To evaluate the scalability of Lo RA-GA , we train Llama 2-7B on three tasks: chat, math, and code. 1. Chat: We train our model on a 52k subset of Wizard LM [42], filtering out responses that begin with "As an AI" or "Sorry". We test our model on the MT-Bench dataset [16], which consists of 80 multi-turn questions designed to assess LLMs on multiple aspects. The quality of the responses is judged by GPT-4, and we report the first turn score. 2. Math: We train our model on a 100k subset of Meta Math QA [43], a dataset bootstrapped from other math instruction tuning datasets like GSM8K[17] and MATH [45], with higher complexity and diversity. We select data bootstrapped from the GSM8K training set and apply filtering. Accuracy is reported on the GSM8K evaluation set. 3. Code: We train our model on a 100k subset of Code-Feedback [44], a high-quality code instruction dataset, removing explanations after code blocks. The model is tested on Human Eval [18], which consists of 180 Python tasks, and we report the PASS@1 metric. Implementation Details Our model is trained using standard supervised learning for language modelling. The loss for the input prompt is set to zero. Detailed hyperparameters can be found in Appendix D.2. Each experiment uses 3 different random seeds, and the average performance across these runs is reported. Result Our results, as summarized in Table 2, indicate that Lo RA-GA outperforms or is comparable to other methods, including full-finetuning. Specifically, Lo RA-GA achieves superior performance on both the GSM8K and Human-eval datasets, underscoring its effectiveness in handling tasks with higher complexity and diversity. On MT-Bench, Lo RA-GA also demonstrates competitive performance, although it slightly trails behind Do RA. Nevertheless, Lo RA-GA achieves this with fewer parameters and approximately 70% of the training time required by Do RA. Additionally, as illustrated in Figure 2 (Left), our method exhibits a significantly faster convergence rate compared to Vanilla Lo RA, with convergence rates comparable to those of full-finetuning. Effect of Rank We attribute the performance discrepancies on the GSM8K and Human-eval datasets, when compared to full-finetuning, primarily to the representational limitations imposed by the low-rank approximation. To address this, we experimented with higher ranks, specifically rank=32 and rank=128. Our findings reveal that Lo RA-GA maintains stability across different rank settings and, in some cases, even surpasses full-finetuning performance. As shown in Figure 2 (Left), higher ranks with our initialization also result in loss curves that closely resemble those of full-finetuning. Table 2: Results of fine-tuning Llama 2-7b using Full-FT and various Lo RA variants, tested on MT-Bench, GSM8K, and Human-eval. Lo RA-GA significantly outperforms Vanilla Lo RA and approaches the performance of Full Finetune. Unless otherwise specified, the Lo RA rank is set to 8. MT-Bench GSM8K Human-eval Full 5.56 0.09 54.20 0.42 19.87 0.57 Lo RA 5.61 0.10 42.08 0.04 14.76 0.17 Pi SSA 5.30 0.02 44.54 0.27 16.02 0.78 rs Lo RA 5.25 0.03 45.62 0.10 16.01 0.79 Lo RA+ 5.71 0.08 52.11 0.62 18.17 0.52 Do RA 5.97 0.02 53.07 0.75 19.75 0.41 Ada Lo RA 5.57 0.05 50.72 1.39 17.80 0.44 Lo RA-GA 5.95 0.16 53.60 0.30 19.81 1.46 Lo RA-GA (Rank=32) 5.79 0.09 55.12 0.30 20.18 0.19 Lo RA-GA (Rank=128) 6.13 0.07 55.07 0.18 23.05 0.37 4.3 Ablation Study We conducted ablation studies to evaluate the contributions of non-zero initialization, stable output, and gradient approximation in Lo RA-GA using five distinct experimental settings. Details of each setting are provided in Table 3. Table 3: Initialization Methods and Corresponding Settings for Ablation Study. The table compares different initialization methods for Lo RA and their settings for A, B, and η. "+SO" denotes stable output, scaling parameters appropriately to ensure stability. "+GA" refers to gradient approximation, where A and B are initialized using orthogonal matrices derived from singular value decomposition. Method A Initialization B Initialization η 0 α/r Gaussian N(0, 1 dout ) N(0, 1 din ) α/r +SO 4 dout/ γ N(0, 1 dout ) 4 dout/ γ N(0, 1 din ) α/ r +GA V[1:r] U[r+1:2r] α/r Lo RA-GA V[1:r] 4 dout/ γ U[r+1:2r] 4 dout/ γ α/ r Table 4: Performance of different settings in the ablation study. Results are shown for MT-Bench, GSM8K, and Human-eval on Llama 2 7b, as well as the average performance on a subset of GLUE on T5-Base. Detailed results can be found in Table 9. MT-Bench GSM8K Human-eval Average of GLUE Full 5.56 0.09 54.20 0.42 19.87 0.57 87.91 Lo RA 5.61 0.10 42.08 0.04 14.76 0.17 82.08 Gaussian 5.62 0.11 38.21 0.06 14.76 0.68 81.88 + SO 5.72 0.04 42.81 1.14 15.55 0.78 82.28 + GA 5.48 0.02 46.65 1.17 16.15 0.78 82.54 Lo RA-GA 5.95 0.16 53.60 0.30 19.81 1.46 87.77 Ablation Result The results are presented in Tables 4 and 9. For both small and large models, we observe that simply changing Lo RA s initialization to Gaussian does not yield any performance gains and may result in a slight performance decline. However, when combined with either "+SO" (Stable Output) or "+GA" (Gradient Approximation), performance improves upon that of Lo RA. Lo RA-GA, which integrates both techniques, outperforms other methods. As shown in Figure 2 (Left) and Figure 4, +SO and +GA also enhance convergence speed, and when both are combined, the training loss curve is even closer to that of full-finetuning. This indicates that both output stability and gradient approximation contribute to the improvement of Lo RA, each addressing different aspects of the model s performance. Figure 2: (Left) Training loss curves of Lo RA-GA with different ranks on the Meta Math QA dataset. Higher ranks result in faster loss reduction, approaching the performance of full fine-tuning. (Right) Training loss curves from the ablation study with different settings on the Meta MATHQA dataset. Compared to Vanilla Lo RA, both components of Lo RA-GA , +SO (stable output) and +GA (gradient approximation), improve convergence speed. Lo RA-GA achieves the fastest convergence, closely matching that of full fine-tuning. 4.4 Memory Costs and Running Time We benchmark Lo RA-GA on a single RTX 3090 24GB GPU, a 128-core CPU, and 256GB of RAM. As shown in Table 5, the memory consumption of our new method does not exceed that used for training with Lo RA, indicating no extra memory is needed. Additionally, the time cost of this operation is relatively negligible compared to the subsequent fine-tuning process. For instance, in the Code-Feedback task, the training process took approximately 10 hours, while the initialization required only about 1 minute, which is insignificant. Table 5: Memory and Time Costs for Initialization and Fine-Tuning. "Parameters" indicates the number of parameters in the model, "Time(Lo RA-GA)" represents the time required for initialization, "Memory(Lo RA-GA)" shows the memory usage during initialization, "Lo RA" and "Full-FT" display the memory usage during Lo RA and full fine-tuning, respectively. Parameters Time(Lo RA-GA) Memory(Lo RA-GA) Lo RA Full-FT T5-Base 220M 2.8s 1.69G 2.71G 3.87G Llama 2-7B 6738M 74.7s 18.77G 23.18G 63.92G 4.5 Performance with Different Index Set Schemas Theorem 3.1 establishes multiple optimal initialization schemes through different choices of index sets IA and IB. While our primary experiments employed IA = 1, . . . , r and IB = r + 1, . . . , 2r, we conducted additional experiments to validate this choice by comparing three schemes: Ar B2r: IA = {1, . . . , r}, IB = {r + 1, . . . , 2r} A2r Br: IA = {r + 1, . . . , 2r}, IB = {1, . . . , r} Random: Random assignment of first 2r indices into two groups Table 6: Performance comparison of initialization schemes on GSM8k using models trained on Meta Math QA subset. Ar B2r A2r Br Random Performance 52.79 52.38 52.01 As shown in Table 6, Ar B2r slightly outperforms the alternatives. While Theorem 3.1 proves these schemas are equivalent in the first step, their behaviors diverge afterward. The gradient of matrix B ( BL = ( W L)AT ) becomes larger than that of A ( AL = BT W L), effectively increasing B s learning rate. This aligns with findings from Lo RA+[36], where larger learning rates for B proved beneficial, potentially explaining Ar B2r s superior performance. 4.6 Impact of Sampled Batch Size The gradient approximation in Lo RA-GA uses sampled batches, with smaller batches resembling Stochastic Gradient Descent (SGD) and larger ones approximating full Gradient Descent (GD). While theoretical work [46] suggests SGD s slower convergence may offer better generalization than GD, we conduct experiments to empirically evaluate different batch sizes. We assess gradient approximation quality by comparing gradients from various batch sizes against a reference batch size of 2048 which serves as a proxy for the full dataset gradient using two metrics: Sign Similarity: The proportion of parameters sharing the same gradient sign. Magnitude Similarity: The proportion of parameters within the same order of magnitude (where one s absolute value is not more than 10 times the other). Table 7: Gradient similarity metrics (vs. batch size 2048) and model performance on GSM8k using models trained on Meta Math QA subset. Batch Size 8 16 32 64 128 256 Sign Similarity 0.743 0.790 0.838 0.875 0.903 0.925 Magnitude Similarity 0.878 0.908 0.933 0.950 0.962 0.971 Performance 52.79 52.99 52.91 53.56 52.57 53.22 As shown in Table 7, both similarity metrics consistently improve with larger batch sizes, indicating better approximation of the full gradient. The results also demonstrate that while larger batch sizes tend to yield marginally better performance, however, the differences are relatively small. Based on these findings, we recommend using a moderately large batch size (e.g., 64) when computational resources permit. 5 Conclusions In this paper, we present a novel initialization scheme for low-rank adaptation (Lo RA), with the goal of acelerating its convergence. By examining the initialization methods and update processes of Lo RA, we develop a new initialization method, Lo RA-GA , which approximates the gradients of the low-rank matrix product with those of full fine-tuning from the very first step. Through extensive experiments, we have demonstrated that Lo RA-GA achieves a convergence rate comparable to that of full fine-tuning while delivering similar or even superior performance. Since Lo RA-GA solely modifies the initialization of Lo RA without altering the architecture or training algorithms, it offers an efficient and effective approach that is easy to implement. Furthermore, it can also be incorporated with other Lo RA variants. For example, Re Lo RA [35] periodically merges the adapters into frozen weights W, which may allow Lo RA-GA to demonstrate its advantages over more steps. We leave it as an interesting future direction. Acknowledgments and Disclosure of Funding The authors are supported in part by the National Natural Science Foundation of China Grant 62161146004. [1] Shengyu Zhang, Linfeng Dong, Xiaoya Li, Sen Zhang, Xiaofei Sun, Shuhe Wang, Jiwei Li, Runyi Hu, Tianwei Zhang, Fei Wu, et al. Instruction tuning for large language models: A survey. ar Xiv preprint ar Xiv:2308.10792, 2023. [2] Long Ouyang, Jeffrey Wu, Xu Jiang, Diogo Almeida, Carroll Wainwright, Pamela Mishkin, Chong Zhang, Sandhini Agarwal, Katarina Slama, Alex Ray, et al. Training language models to follow instructions with human feedback. Advances in neural information processing systems, 35:27730 27744, 2022. [3] Zeyu Han, Chao Gao, Jinyang Liu, Sai Qian Zhang, et al. Parameter-efficient fine-tuning for large models: A comprehensive survey. ar Xiv preprint ar Xiv:2403.14608, 2024. [4] Edward J. Hu, Yelong Shen, Phillip Wallis, Zeyuan Allen-Zhu, Yuanzhi Li, Shean Wang, Lu Wang, and Weizhu Chen. Lora: Low-rank adaptation of large language models, 2021. [5] Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Łukasz Kaiser, and Illia Polosukhin. Attention is all you need. Advances in neural information processing systems, 30, 2017. [6] Ning Ding, Yujia Qin, Guang Yang, Fuchao Wei, Zonghan Yang, Yusheng Su, Shengding Hu, Yulin Chen, Chi-Min Chan, Weize Chen, et al. Parameter-efficient fine-tuning of large-scale pre-trained language models. Nature Machine Intelligence, 5(3):220 235, 2023. [7] Rui Pan, Xiang Liu, Shizhe Diao, Renjie Pi, Jipeng Zhang, Chi Han, and Tong Zhang. Lisa: Layerwise importance sampling for memory-efficient large language model fine-tuning, 2024. [8] Shih-Yang Liu, Chien-Yi Wang, Hongxu Yin, Pavlo Molchanov, Yu-Chiang Frank Wang, Kwang-Ting Cheng, and Min-Hung Chen. Dora: Weight-decomposed low-rank adaptation, 2024. [9] Dan Biderman, Jose Gonzalez Ortiz, Jacob Portes, Mansheej Paul, Philip Greengard, Connor Jennings, Daniel King, Sam Havens, Vitaliy Chiley, Jonathan Frankle, et al. Lora learns less and forgets less. ar Xiv preprint ar Xiv:2405.09673, 2024. [10] Pengfei Liu, Weizhe Yuan, Jinlan Fu, Zhengbao Jiang, Hiroaki Hayashi, and Graham Neubig. Pre-train, prompt, and predict: A systematic survey of prompting methods in natural language processing. ACM Computing Surveys, 55(9):1 35, 2023. [11] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Delving deep into rectifiers: Surpassing human-level performance on imagenet classification, 2015. [12] Guy Gur-Ari, Daniel A Roberts, and Ethan Dyer. Gradient descent happens in a tiny subspace. ar Xiv preprint ar Xiv:1812.04754, 2018. [13] Armen Aghajanyan, Luke Zettlemoyer, and Sonal Gupta. Intrinsic dimensionality explains the effectiveness of language model fine-tuning. ar Xiv preprint ar Xiv:2012.13255, 2020. [14] Alex Wang, Amanpreet Singh, Julian Michael, Felix Hill, Omer Levy, and Samuel R Bowman. Glue: A multi-task benchmark and analysis platform for natural language understanding. ar Xiv preprint ar Xiv:1804.07461, 2018. [15] Colin Raffel, Noam Shazeer, Adam Roberts, Katherine Lee, Sharan Narang, Michael Matena, Yanqi Zhou, Wei Li, and Peter J Liu. Exploring the limits of transfer learning with a unified text-to-text transformer. Journal of machine learning research, 21(140):1 67, 2020. [16] Lianmin Zheng, Wei-Lin Chiang, Ying Sheng, Siyuan Zhuang, Zhanghao Wu, Yonghao Zhuang, Zi Lin, Zhuohan Li, Dacheng Li, Eric Xing, et al. Judging llm-as-a-judge with mt-bench and chatbot arena. Advances in Neural Information Processing Systems, 36, 2024. [17] Karl Cobbe, Vineet Kosaraju, Mohammad Bavarian, Mark Chen, Heewoo Jun, Lukasz Kaiser, Matthias Plappert, Jerry Tworek, Jacob Hilton, Reiichiro Nakano, et al. Training verifiers to solve math word problems. ar Xiv preprint ar Xiv:2110.14168, 2021. [18] Mark Chen, Jerry Tworek, Heewoo Jun, Qiming Yuan, Henrique Ponde de Oliveira Pinto, Jared Kaplan, Harri Edwards, Yuri Burda, Nicholas Joseph, Greg Brockman, et al. Evaluating large language models trained on code. ar Xiv preprint ar Xiv:2107.03374, 2021. [19] Hugo Touvron, Louis Martin, Kevin Stone, Peter Albert, Amjad Almahairi, Yasmine Babaei, Nikolay Bashlykov, Soumya Batra, Prajjwal Bhargava, Shruti Bhosale, et al. Llama 2: Open foundation and fine-tuned chat models. ar Xiv preprint ar Xiv:2307.09288, 2023. [20] Xavier Glorot and Yoshua Bengio. Understanding the difficulty of training deep feedforward neural networks. In Proceedings of the thirteenth international conference on artificial intelligence and statistics, pages 249 256. JMLR Workshop and Conference Proceedings, 2010. [21] Dmytro Mishkin and Jiri Matas. All you need is a good init. ar Xiv preprint ar Xiv:1511.06422, 2015. [22] Greg Yang, Edward J Hu, Igor Babuschkin, Szymon Sidor, Xiaodong Liu, David Farhi, Nick Ryder, Jakub Pachocki, Weizhu Chen, and Jianfeng Gao. Tensor programs v: Tuning large neural networks via zero-shot hyperparameter transfer. ar Xiv preprint ar Xiv:2203.03466, 2022. [23] Neil Houlsby, Andrei Giurgiu, Stanislaw Jastrzebski, Bruna Morrone, Quentin De Laroussilhe, Andrea Gesmundo, Mona Attariyan, and Sylvain Gelly. Parameter-efficient transfer learning for nlp. In International conference on machine learning, pages 2790 2799. PMLR, 2019. [24] Junxian He, Chunting Zhou, Xuezhe Ma, Taylor Berg-Kirkpatrick, and Graham Neubig. Towards a unified view of parameter-efficient transfer learning, 2022. [25] Yaqing Wang, Sahaj Agarwal, Subhabrata Mukherjee, Xiaodong Liu, Jing Gao, Ahmed Hassan Awadallah, and Jianfeng Gao. Adamix: Mixture-of-adaptations for parameter-efficient model tuning. ar Xiv preprint ar Xiv:2205.12410, 2022. [26] Jonas Pfeiffer, Aishwarya Kamath, Andreas Rücklé, Kyunghyun Cho, and Iryna Gurevych. Adapterfusion: Non-destructive task composition for transfer learning. ar Xiv preprint ar Xiv:2005.00247, 2020. [27] Brian Lester, Rami Al-Rfou, and Noah Constant. The power of scale for parameter-efficient prompt tuning, 2021. [28] Anastasia Razdaibiedina, Yuning Mao, Rui Hou, Madian Khabsa, Mike Lewis, Jimmy Ba, and Amjad Almahairi. Residual prompt tuning: Improving prompt tuning with residual reparameterization, 2023. [29] Xiao Liu, Yanan Zheng, Zhengxiao Du, Ming Ding, Yujie Qian, Zhilin Yang, and Jie Tang. Gpt understands, too. AI Open, 2023. [30] Xiang Lisa Li and Percy Liang. Prefix-tuning: Optimizing continuous prompts for generation. ar Xiv preprint ar Xiv:2101.00190, 2021. [31] Jiawei Zhao, Zhenyu Zhang, Beidi Chen, Zhangyang Wang, Anima Anandkumar, and Yuandong Tian. Galore: Memory-efficient llm training by gradient low-rank projection, 2024. [32] Qingru Zhang, Minshuo Chen, Alexander Bukharin, Nikos Karampatziakis, Pengcheng He, Yu Cheng, Weizhu Chen, and Tuo Zhao. Adalora: Adaptive budget allocation for parameter-efficient fine-tuning, 2023. [33] Nam Hyeon-Woo, Moon Ye-Bin, and Tae-Hyun Oh. Fedpara: Low-rank hadamard product for communication-efficient federated learning. ar Xiv preprint ar Xiv:2108.06098, 2021. [34] Ali Edalati, Marzieh Tahaei, Ivan Kobyzev, Vahid Partovi Nia, James J Clark, and Mehdi Rezagholizadeh. Krona: Parameter efficient tuning with kronecker adapter. ar Xiv preprint ar Xiv:2212.10650, 2022. [35] Vladislav Lialin, Namrata Shivagunde, Sherin Muckatira, and Anna Rumshisky. Relora: High-rank training through low-rank updates, 2023. [36] Soufiane Hayou, Nikhil Ghosh, and Bin Yu. Lora+: Efficient low rank adaptation of large models, 2024. [37] Damjan Kalajdzievski. A rank stabilization scaling factor for fine-tuning with lora, 2023. [38] Fanxu Meng, Zhaohui Wang, and Muhan Zhang. Pissa: Principal singular values and singular vectors adaptation of large language models, 2024. [39] Sourab Mangrulkar, Sylvain Gugger, Lysandre Debut, Younes Belkada, Sayak Paul, and Benjamin Bossan. Peft: State-of-the-art parameter-efficient fine-tuning methods. https://github.com/huggingface/ peft, 2022. [40] Greg Yang and Edward J Hu. Feature learning in infinite-width neural networks. ar Xiv preprint ar Xiv:2011.14522, 2020. [41] Kai Lv, Yuqing Yang, Tengxiao Liu, Qinghui Gao, Qipeng Guo, and Xipeng Qiu. Full parameter fine-tuning for large language models with limited resources. ar Xiv preprint ar Xiv:2306.09782, 2023. [42] Can Xu, Qingfeng Sun, Kai Zheng, Xiubo Geng, Pu Zhao, Jiazhan Feng, Chongyang Tao, and Daxin Jiang. Wizardlm: Empowering large language models to follow complex instructions, 2023. [43] Longhui Yu, Weisen Jiang, Han Shi, Jincheng Yu, Zhengying Liu, Yu Zhang, James T. Kwok, Zhenguo Li, Adrian Weller, and Weiyang Liu. Metamath: Bootstrap your own mathematical questions for large language models, 2024. [44] Tianyu Zheng, Ge Zhang, Tianhao Shen, Xueling Liu, Bill Yuchen Lin, Jie Fu, Wenhu Chen, and Xiang Yue. Opencodeinterpreter: Integrating code generation with execution and refinement, 2024. [45] Dan Hendrycks, Collin Burns, Saurav Kadavath, Akul Arora, Steven Basart, Eric Tang, Dawn Song, and Jacob Steinhardt. Measuring mathematical problem solving with the math dataset, 2021. [46] Idan Amir, Tomer Koren, and Roi Livni. Sgd generalizes better than gd (and regularization doesn t help). In Conference on Learning Theory, pages 63 92. PMLR, 2021. [47] Carl Eckart and Gale Young. The approximation of one matrix by another of lower rank. Psychometrika, 1(3):211 218, 1936. [48] Leon Mirsky. Symmetric gauge functions and unitarily invariant norms. The quarterly journal of mathematics, 11(1):50 59, 1960. [49] Ilya Loshchilov and Frank Hutter. Decoupled weight decay regularization, 2019. A Proofs of Theorems A.1 Proof of Theorem 3.1 Lemma 3.1. Suppose the loss function is L and y = W x = (W0 + ηBA)x, where y is the output of a layer and x is the input, the gradients of adapters A and B are linear mappings of the gradient of W : AL = BT W L, BL = ( W L) AT Remarkably, the gradient of W in Lo RA and the gradient of W in full fine-tuning are equal at the beginning of the training. Proof. For the gradients in Lo RA, y y W = BT L y x T = BT W L y x T AT = ( W L) AT At the beginning of training, both Lo RA and full fine-tuning have y = y and identical x, therefore, W L = W L = L Theorem 3.1. Consider the following optimization problem: min Ainit,Binit η2 W L AT init Ainit + η2Binit BT init W L ζ W L F If the Singular Value Decomposition (SVD) of W L is W L = USV T , the solution to this optimization problem is: η UIA, Ainit = ζ η V T IB s.t. |IA| = |IB| = r, IA IB = {i | 1 i 2r, i N} where IA, IB are index sets. Proof. Since that rank(Ainit) = rank(Binit) = r and 2r < min(m, n), we can assert that the matrix W = η2 W LAT init Ainit + η2Binit BT init W L has rank(W ) 2r. Under this given solution, W = η2 W LAT init Ainit + η2Binit BT init W L = ζUSV T (VIAV T IA) + ζ(UIBU T IB)USV T i IA σiuiv T i + ζ X j IB σjuju T j = ζ i=1 σiuiv T i By the classic Eckart-Young Theorem (see e.g., [47, 48]), the optimal low-rank approximation with respect to Frobenius norm is: W = arg min rank(W )=2r W ζ W L F = ζ i=1 σiuiv T i This is identical to what we have got. Therefore, this is the optimal solution. A.2 Proof of Theorem 3.2 Lemma A.1. In Rn, if we randomly pick a vector x that Pn i=1 x2 i = 1, we have: 1. E (xi) = 0, E x2 i = 1 n and E x4 i = Θr,dout,din 1 2. E (xixj) = 0; 3. E x2 i x2 j = Θr,dout,din 1 4. E x2 i xjxk = 0; Proof. It is equivalent to sampling a random point uniformly from a unit sphere in Rn. For property 1, E (xi) = 0 holds obvious by symmetry. Since Pn i=1 x2 i = 1 and uniformly distributed, each entry has identical expectation, E Pn i=1 x2 i = n E x2 i = 1, E x2 i = 1 n. E x4 i = E x2 i x2 i = Θr,dout,din 1 n Θr,dout,din 1 n = Θr,dout,din 1 For property 2, it can also be proved by symmetry: we can always find vector that contains (xi, xj) also lies on the sphere. Therefore, E (xixj) = 0. For property 3, E x2 i x2 j = E x2 i x2 j = Θr,dout,din 1 n Θr,dout,din 1 n = Θr,dout,din 1 For property 4, again it can be proved by symmetry: we can always find vector that contains (xi, xj, xk) also lies on the sphere. Therefore, E x2 i xjxk = 0. Lemma A.2. For a randomly selected orthogonal matrix A Rn n, and we randomly pick two different column vectors x and y from it. For these two vectors, we have the following: 1. E (xiyi) = 0; 2. E (xiyj) = 0; Proof. It is equivalent to first selecting a random vector x from a unit sphere in Rn uniformly, and then selecting the other one y that is orthogonal to x. For property 1, Pn i=1 xiyi = 0 E Pn i=1 xiyi = Pn i=1 E (() xiyi) = 0 E (xiyi) = 0. For property 2, consider that E Pn i=1 xi = E Pn i=1 yi = 0, and given x, we can always find y is also an orthogonal vector. Therefore, E Pn i=1 xi Pn i=1 yi = 0 E(xiyi) = 0. Theorem 3.2. Given the initialization proposed in Theorem 3.1, assume that the orthogonal vectors in Ainit and Binit are randomly selected from Rdin and Rdout, and set η = Θr,dout,din 1 r as suggested by rs Lo RA [37]. Under these conditions, the adapters are forward scale-stable if ζ = Θr,dout,din and backward scale-stable if ζ = Θr,dout,din Proof. In Lo RA, h = (W + ηBA)x, since that W is not considered here, therefore, denote y = ηBAx. When backward propagation, it s like L x = ηAT BT L h . Represente L h as v and L x as g. Therefore, k=1 Bij Ajkxk, 1 i dout (Forward) k=1 Aji Bkjvk, 1 i din (Backward) Since that the output of each layer in model always passes a softmax function, so that the vector L Θr,dout,din (1). Further, since that input xi s are i.i.d., without loss of generality, assume that E(xi) = 0 and E(x2 i ) = 1. For the adapter, as Equation 4 shows, and by the expectations we have proved in Lemma A.1 and A.2, we can calculate the scale of forward and backward process. The scale of forward process is: E y2 i = η2 r X k2=1 E (Bij1Aj1k1Bij2Aj2k2xk1xk2) k2=1 E (Bij1Bij2) E (Aj1k1Aj2k2) E (xk1xk2) k=1 E (Bij1Bij2) E (Aj1k Aj2k) = η2 r X k=1 E B2 ij E A2 jk η4 1 dout 1 din = 1 The scale of the backward process is: E g2 i = η2 r X k2=1 E (Aj1i Bk1j1Aj2i Bk2j2vk1vk2) k2=1 vk1vk2E (Aj1i Aj2i) E (Bk1j1Bk2j2) k=1 v2 k E A2 ji E B2 kj = η2 r X k=1 v2 k ζ2 η4 1 din 1 dout = 1 α2 ζ2r2Θr,dout,din From the results derived by Equation 5 and 6, one can see that we cannot find a proper ζ to make both scales Θr,dout,din (1) unless dout din = Θr,dout,din (1). We can also see that the forward scale is stable if adopting ζ = Θr,dout,din and the backward is stable if ζ = Θr,dout,din B Additional Experimental Results B.1 Convergence Speed As Figure 3 and 4 shown, the convergence of Lo RA-GA is significantly faster than vanilla Lo RA and other ablation models, almost close to that of full fine-tuning, which support our claim about the speed of convergence. Figure 3: Training Loss curves of Full Fine-tuning, Lo RA and Lo RA-GA on different datasets. Figure 4: Training Loss curves of different Lo RA-GA ablations on different datasets. B.2 Evaluating the Rank of the Gradient Matrix Theorem 3.1 suggests that the closer the rank of the gradient matrix is to 2r, the better the gradient approximated, thereby enhancing the theoretical effectiveness of our initialization. Figure 5 illustrates the low-rank nature of gradient matrices. The left panel depicts a grid-like pattern in the gradients of a weight matrix, indicating a low-rank structure. The middle panel shows a steeply declining curve of singular values, reflecting the highly low-rank nature of the gradient matrix. The right panel presents the cumulative curve of squared singular values, demonstrating that a few ranks account for nearly all the singular values of the gradient matrix. Specifically, the coverage in the right panel is defined as Coverage = P2r i=0 σ2 i Pn i=0 σ2 i , where r is the Lo RA rank used in Lo RA-GA , indicating how much of the low-rank matrix can be approximated by this rank. Figure 5: (Left) A gradient matrix of T5-Base during fine-tuning on Co LA. (Middle) The decreasing curve of singular values of the gradient matrix. (Right) The cumulative curve showing the coverage of squared singular values. We further validate this observation on larger models by analyzing LLa MA 2-7B during Meta Math QA training. Table 8 presents the coverage across different layers with varying Lo RA ranks. Even with a relatively small rank of 8, we achieve a mean coverage of 92.9% across all layers, with the minimum coverage being 85.1%. Increasing the rank to 128 yields an impressive mean coverage of 99.3%, with the minimum coverage reaching 97.5%. These results demonstrate that even for large models with weight matrices of dimension 4096, a modest Lo RA rank is sufficient to capture the majority of the gradient information. Table 8: Coverage of gradient matrix across different layers in LLa MA 2-7B Lo RA Rank 8 32 128 Mean Coverage 0.929 0.974 0.993 Min Coverage 0.851 0.933 0.975 B.3 Detailed Ablation Study Result of GLUE Table 9 shows the full results of ablation study on the subset of GLUE, where the average scores are briefly reported in Table 4. As Table 9 demonstrated, Lo RA-GA outperforms all other ablation models, while both "+SO" and "+GA" methods gain some improvement from vanilla Lo RA and simple non-zero initialization "Gaussian". This illustrates that both components in Lo RA-GA have positive contribution to the improvement of performance. Table 9: Performance comparison of different ablations on subset of GLUE dataset. The settings are elaborated in Table 3. MNLI SST-2 Co LA QNLI MRPC Average Trainset 393k 67k 8.5k 105k 3.7k Full 86.33 0.00 94.75 0.21 80.70 0.24 93.19 0.22 84.56 0.73 87.91 Lo RA 85.30 0.04 94.04 0.11 69.35 0.05 92.96 0.09 68.38 0.01 82.08 Gaussian 85.26 0.07 93.85 0.18 69.00 0.22 92.89 0.08 68.38 0.00 81.88 + SO 85.47 0.19 94.23 0.13 70.63 0.78 93.12 0.07 67.97 0.75 82.28 + GA 85.33 0.07 93.88 0.18 74.37 1.12 93.03 0.06 66.09 11.32 82.54 Lo RA-GA 85.70 0.09 94.11 0.18 80.57 0.20 93.18 0.06 85.29 0.24 87.77 B.4 Experimental result with different learning rate Furthermore, we also conduct experiments under learning rates 1e-5 and 5e-5. As Table 10 and 11 shown, Lo RA-GA maintains strong performance across different learning rates, which illustrating its robustness to the variation of learning rate. Table 10: Performance comparison of different methods on MT-Bench, GSM8K, and Human-eval with learning rate 1e-5 MT-Bench GSM8K Human-eval Full 5.63 0.04 43.95 1.95 15.97 0.42 Lo RA 5.53 0.07 35.73 0.09 14.35 0.40 Pi SSA 5.61 0.09 38.51 0.70 15.37 0.78 rs Lo RA 5.60 0.10 40.56 0.47 15.69 0.87 Lo RA+ 5.48 0.14 47.06 0.11 16.90 0.89 Lo RA-GA 5.82 0.04 51.33 0.39 17.64 0.13 Table 11: Performance comparison of different methods on MT-Bench, GSM8K, and Human-eval with learning rate 5e-5 MT-Bench GSM8K Human-eval Full 5.33 0.21 56.33 0.78 25.67 0.42 Lo RA 5.52 0.08 46.89 0.05 15.67 0.60 Pi SSA 5.35 0.01 49.70 0.80 17.62 0.60 rs Lo RA 5.54 0.00 50.04 0.54 17.38 0.26 Lo RA+ 5.89 0.11 55.23 0.16 19.21 0.37 Lo RA-GA 5.76 0.22 52.79 1.02 20.45 0.92 B.5 Experiments on the Full Meta Math QA Dataset Following [9], we conducted additional experiments by training on the complete Meta Math QA dataset for multiple epochs, whereas our main results in the previous section were based on fine-tuning for one epoch on the 100k subset of Meta Math QA. Due to computational constraints, we limited these extended experiments to three methods: Lo RA[4], Lo RA+[36], and Lo RA-GA . Table 12 presents the performance across four epochs, averaged over two random seeds. Table 12: Performance comparison of different methods on full Meta Math QA dataset training for multiple epochs. Epoch 1 Epoch 2 Epoch 3 Epoch 4 Lo RA (Rank=8) 55.19 58.37 59.28 58.90 Lo RA+ (Rank=8) 56.37 59.21 59.93 59.97 Lo RA-GA (Rank=8) 56.48 58.64 60.16 60.88 The results show that Lo RA-GA consistently achieves better performance than vanilla Lo RA and outperforms Lo RA+ in most cases across multiple epochs of training. C Lo RA-GA Initialization With Gradient Accumulation Algorithm 2 Lo RA-GA Initialization With Gradient Accumulation Require: Model f( ) with L layers, parameters W, sampled batch B = {x, y}, Lo RA rank r with n samples, Lo RA alpha α, loss function L, scale factor γ, micro-batch size b Ensure: Initialized parameters W, η, A, B 1: ˆy f(x, W) Forward pass 2: ℓ L(y, ˆy) 3: η α r 4: for l = 1, . . . , L do 5: avg Wlℓ 0 Initialize average gradient for each layer on CPU 6: end for 7: for each micro-batch Bi in B do 8: ˆyi f(xi, W) Forward pass for micro-batch 9: ℓi L(yi, ˆyi) 10: for l = L, . . . , 1 do 11: Compute Wlℓi Backward pass for one layer 12: avg Wlℓ avg Wlℓ+ Wlℓi b n Move to CPU 13: Clear Wlℓi Gradient for this layer is not needed anymore 14: end for 15: end for 16: for l = L, . . . , 1 do 17: dout, din size(Wl) 18: U, S, V svd( avg Wlℓ) 19: Al V[1:r] 4 dout/γ 20: Bl U[r+1:2r] 4 dout/γ 21: Wl Wl ηBl Al 22: end for 23: return W, η, A, B D Hyperparameter D.1 Experiments on Natural Language Understanding We use the following hyperparameters with T5-Base. Training Algorithm: Adam W [49] with β1 = 0.9, β2 = 0.999, ϵ = 1e 8 and weight decay of 0. For full finetuning, Lo RA, and its variants, a learning rate of 1e 4 , a warmup ratio of 0.03, and cosine decay are employed. For Do RA [8], a learning rate of 2e 4 is used, while for Adalora, a learning rate of 5e 4 is applied, both with the same warmup ratio and cosine decay adhering to their respective papers. Lo RA Hyperparameters: Lo RA rank r = 8, α = 16. Lo RA target is all linear modules except embedding layer, layer norm and language model head. Lo RA-GA Hyperparameter: γ = 16, sampled batch size sbs = 8 Other Hyperparameters: Sequence Length T = 128, train batch size bs = 32, number of train epochs E = 1. Precision FP32 D.2 Experiment on Large Language Model We use the following hyperparameters with Llama 2-7B. Training Algorithm: Adam W [49] with with β1 = 0.9, β2 = 0.999, ϵ = 1e 8 and weight decay of 0. For full finetuning, Lo RA, and its variants, a learning rate of 2e 5 [38], a warmup ratio of 0.03, and cosine decay are employed. For Do RA [8], a learning rate of 2e 4 is used, while for Adalora, a learning rate of 5e 4 is applied, both with the same warmup ratio and cosine decay adhering to their respective papers. Precision: The backbone model uses bf16 precision, while during training, Lo RA s B and A matrices use fp32 precision, following the implementation of PEFT [39]. Lo RA-GA Hyperparameter: γ = 64, micro sampled batch size sbs = 1 with gradient accumulation of 32. Lo RA Hyperparameters: Lo RA rank r = 8 and α = 16 for all experiments. Generation Hyperparameters: All generation is performed with top_p = 0.95 and temperature T = 0.8. Other Hyperparameters: Number of train epochs E = 1, train micro batch size mbs = 1 with gradient accumulation of 32. Sequence Length T = 1024 E Comparison between Lo RA-GA and Pi SSA Both Lo RA-GA and Pi SSA [38] concentrate on the initialization of Lo RA, and utilizing SVD on pre-trained models. While they may appear similar superficially, significant differences exist between them. Firstly, the motivations behind Lo RA-GA and Pi SSA are fundamentally different. As discussed in Section 3.2, Lo RA-GA is motivated by the approximation of the Lo RA update and full fine-tuning. We employ SVD on gradients solely because the optimal solution to the gradient approximation problem is precisely obtained (as stated in Theorem 3.1). Conversely, Pi SSA adopts SVD under the assumption that pre-trained weights possess a low intrinsic rank, and thus, the SVD of weights can provide an accurate representation of original weights. In essence, Lo RA-GA emphasizes on gradients and decomposes them, whereas Pi SSA concentrates on weights and decomposes them. Secondly, Lo RA-GA and Pi SSA employ different scales of initialization. In Section 3.3, Lo RA-GA derives an appropriate scaling factor by considering the forward and backward stability of our initialization scheme. On the other hand, Pi SSA uses the largest r singular values as the magnitude of orthogonal matrices directly. F Limitations In this paper, we have demonstrated that Lo RA-GA can achieve performance comparable to full fine-tuning on the T5-Base (220M) and Llama 2-7B models, while significantly reducing the number of parameters and associated costs. However, due to computational resource constraints, we have not validated Lo RA-GA on larger pre-trained models (e.g., Llama 2-70B). In Lo RA-GA , we proposed that a scaling factor is necessary. But in some experiments with large learning rates, we observed potential numerical instability due to the effect of scaling factor and learning rate. This limitation suggests a need for careful tuning of the scaling factor and learning rate to maintain stability. Another limitation pertains to our evaluation scope. While we provide evaluations on MTBench, GSM8K, and Human-eval, we did not assess our method on other datasets. Consequently, we cannot fully guarantee that our findings are universally consistent across all benchmarks. Additionally, we did not implement our method on other Lo RA variants that are orthogonal to our improvements (e.g., Re Lo RA [35]). Therefore, we cannot ascertain whether Lo RA-GA would perform equally well with other Lo RA architectures/improvements. Finally, compared to the original Lo RA, Lo RA-GA requires double the checkpoint storage, as it necessitates storing both the initial adapter checkpoints (Ainit and Binit) and the final adapter checkpoints (A and B). G Compute Resources In this paper, we utilized two types of GPUs: the RTX 3090 24GB GPU, supported by a 128-core CPU and 256GB of RAM (hereinafter referred to as "the RTX 3090"), and the A100 80GB GPU (hereinafter referred to as "the A100"). For the experiments on T5-Base using the GLUE dataset, reported in Section 4.1, all computations were performed on a single RTX 3090. For the Llama 2-7B experiments, reported in Section 4.2, full fine-tuning and Do RA scenarios were conducted on a single A100, while all other Lo RA variants and Lo RA-GA were executed on a single RTX 3090. Additionally, all ablation studies presented in Section 4.3 were carried out on a single RTX 3090. H Broader Impacts In this paper, we identify some limitations of vanilla Lo RA and propose a more efficient and effective method for Lo RA initialization, Lo RA-GA. Lo RA-GA converges faster than vanilla Lo RA and consistently achieves better evaluation results. We believe that this work will have a positive social impact. The primary reasons are as follows: The high cost of training and fine-tuning large models is a significant challenge today. Lo RA-GA offers a way to fine-tune with fewer parameters and lower computational costs while still achieving comparable performance. This will reduce the cost of fine-tuning models and, in turn, decrease energy consumption, such as electricity, contributing to the goal of a low-carbon environment. Furthermore, as the size of large language models (LLM) continues to grow, it becomes increasingly difficult for individuals or small organizations to develop their own LLMs. However, with the help of Lo RA-GA and open-source large models, the hardware barrier to entry in this area is greatly reduced. This will promote democratization in the field of large models, preventing monopolies and dictatorships by a few companies. On the other hand, our method could potentially make it easier to train language models that generate fake news or misleading information. This underscores the necessity for designing effective detectors to identify content generated by large language models (LLMs). Ensuring the responsible use of this technology is crucial to mitigating the risks associated with the misuse of advanced language models. Neur IPS Paper Checklist Question: Do the main claims made in the abstract and introduction accurately reflect the paper s contributions and scope? Answer: [Yes] Justification: We included 3 contributions at the end of introduction. The method Lo RA-GA and stable scale are proposed in Section 3. The extensive experiments and results are in Section 4. Guidelines: The answer NA means that the abstract and introduction do not include the claims made in the paper. The abstract and/or introduction should clearly state the claims made, including the contributions made in the paper and important assumptions and limitations. A No or NA answer to this question will not be perceived well by the reviewers. The claims made should match theoretical and experimental results, and reflect how much the results can be expected to generalize to other settings. It is fine to include aspirational goals as motivation as long as it is clear that these goals are not attained by the paper. 2. Limitations Question: Does the paper discuss the limitations of the work performed by the authors? Answer: [Yes] Justification: The last section (Appendix F) of this paper is entirely discuss the limitation of our works. Guidelines: The answer NA means that the paper has no limitation while the answer No means that the paper has limitations, but those are not discussed in the paper. The authors are encouraged to create a separate "Limitations" section in their paper. The paper should point out any strong assumptions and how robust the results are to violations of these assumptions (e.g., independence assumptions, noiseless settings, model well-specification, asymptotic approximations only holding locally). The authors should reflect on how these assumptions might be violated in practice and what the implications would be. The authors should reflect on the scope of the claims made, e.g., if the approach was only tested on a few datasets or with a few runs. In general, empirical results often depend on implicit assumptions, which should be articulated. The authors should reflect on the factors that influence the performance of the approach. For example, a facial recognition algorithm may perform poorly when image resolution is low or images are taken in low lighting. Or a speech-to-text system might not be used reliably to provide closed captions for online lectures because it fails to handle technical jargon. The authors should discuss the computational efficiency of the proposed algorithms and how they scale with dataset size. If applicable, the authors should discuss possible limitations of their approach to address problems of privacy and fairness. While the authors might fear that complete honesty about limitations might be used by reviewers as grounds for rejection, a worse outcome might be that reviewers discover limitations that aren t acknowledged in the paper. The authors should use their best judgment and recognize that individual actions in favor of transparency play an important role in developing norms that preserve the integrity of the community. Reviewers will be specifically instructed to not penalize honesty concerning limitations. 3. Theory Assumptions and Proofs Question: For each theoretical result, does the paper provide the full set of assumptions and a complete (and correct) proof? Answer: [Yes] Justification: There are 1 lemma (Lemma 3.1) and 2 theorems (Theorem 3.1 and 3.2) proposed in our paper. All them are properly proved in Appendix A. Guidelines: The answer NA means that the paper does not include theoretical results. All the theorems, formulas, and proofs in the paper should be numbered and cross-referenced. All assumptions should be clearly stated or referenced in the statement of any theorems. The proofs can either appear in the main paper or the supplemental material, but if they appear in the supplemental material, the authors are encouraged to provide a short proof sketch to provide intuition. Inversely, any informal proof provided in the core of the paper should be complemented by formal proofs provided in appendix or supplemental material. Theorems and Lemmas that the proof relies upon should be properly referenced. 4. Experimental Result Reproducibility Question: Does the paper fully disclose all the information needed to reproduce the main experimental results of the paper to the extent that it affects the main claims and/or conclusions of the paper (regardless of whether the code and data are provided or not)? Answer: [Yes] Justification: Hyperparameters are disclosed in Appendix D. Other implementation information are disclosed in Section 4, in paragraphs began with "Implementation details". Guidelines: The answer NA means that the paper does not include experiments. If the paper includes experiments, a No answer to this question will not be perceived well by the reviewers: Making the paper reproducible is important, regardless of whether the code and data are provided or not. If the contribution is a dataset and/or model, the authors should describe the steps taken to make their results reproducible or verifiable. Depending on the contribution, reproducibility can be accomplished in various ways. For example, if the contribution is a novel architecture, describing the architecture fully might suffice, or if the contribution is a specific model and empirical evaluation, it may be necessary to either make it possible for others to replicate the model with the same dataset, or provide access to the model. In general. releasing code and data is often one good way to accomplish this, but reproducibility can also be provided via detailed instructions for how to replicate the results, access to a hosted model (e.g., in the case of a large language model), releasing of a model checkpoint, or other means that are appropriate to the research performed. While Neur IPS does not require releasing code, the conference does require all submissions to provide some reasonable avenue for reproducibility, which may depend on the nature of the contribution. For example (a) If the contribution is primarily a new algorithm, the paper should make it clear how to reproduce that algorithm. (b) If the contribution is primarily a new model architecture, the paper should describe the architecture clearly and fully. (c) If the contribution is a new model (e.g., a large language model), then there should either be a way to access this model for reproducing the results or a way to reproduce the model (e.g., with an open-source dataset or instructions for how to construct the dataset). (d) We recognize that reproducibility may be tricky in some cases, in which case authors are welcome to describe the particular way they provide for reproducibility. In the case of closed-source models, it may be that access to the model is limited in some way (e.g., to registered users), but it should be possible for other researchers to have some path to reproducing or verifying the results. 5. Open access to data and code Question: Does the paper provide open access to the data and code, with sufficient instructions to faithfully reproduce the main experimental results, as described in supplemental material? Answer: [Yes] Justification: All codes of our experiments are uploaded to anonymous github. All datasets used in our experiments are all open source, which has been declared and cited in Section 1 (Introduction). Guidelines: The answer NA means that paper does not include experiments requiring code. Please see the Neur IPS code and data submission guidelines (https://nips.cc/public/ guides/Code Submission Policy) for more details. While we encourage the release of code and data, we understand that this might not be possible, so No is an acceptable answer. Papers cannot be rejected simply for not including code, unless this is central to the contribution (e.g., for a new open-source benchmark). The instructions should contain the exact command and environment needed to run to reproduce the results. See the Neur IPS code and data submission guidelines (https://nips.cc/public/ guides/Code Submission Policy) for more details. The authors should provide instructions on data access and preparation, including how to access the raw data, preprocessed data, intermediate data, and generated data, etc. The authors should provide scripts to reproduce all experimental results for the new proposed method and baselines. If only a subset of experiments are reproducible, they should state which ones are omitted from the script and why. At submission time, to preserve anonymity, the authors should release anonymized versions (if applicable). Providing as much information as possible in supplemental material (appended to the paper) is recommended, but including URLs to data and code is permitted. 6. Experimental Setting/Details Question: Does the paper specify all the training and test details (e.g., data splits, hyperparameters, how they were chosen, type of optimizer, etc.) necessary to understand the results? Answer: [Yes] Justification: Details that are important to understand and evaluate experimental results are shown in our paper or appendix like Table 3. Other details can be found in our codes. Guidelines: The answer NA means that the paper does not include experiments. The experimental setting should be presented in the core of the paper to a level of detail that is necessary to appreciate the results and make sense of them. The full details can be provided either with the code, in appendix, or as supplemental material. 7. Experiment Statistical Significance Question: Does the paper report error bars suitably and correctly defined or other appropriate information about the statistical significance of the experiments? Answer: [Yes] Justification: We report standard deviations of our evaluation results like footnotes in Table 1, 2 4, 9, 10 and 11. Guidelines: The answer NA means that the paper does not include experiments. The authors should answer "Yes" if the results are accompanied by error bars, confidence intervals, or statistical significance tests, at least for the experiments that support the main claims of the paper. The factors of variability that the error bars are capturing should be clearly stated (for example, train/test split, initialization, random drawing of some parameter, or overall run with given experimental conditions). The method for calculating the error bars should be explained (closed form formula, call to a library function, bootstrap, etc.) The assumptions made should be given (e.g., Normally distributed errors). It should be clear whether the error bar is the standard deviation or the standard error of the mean. It is OK to report 1-sigma error bars, but one should state it. The authors should preferably report a 2-sigma error bar than state that they have a 96% CI, if the hypothesis of Normality of errors is not verified. For asymmetric distributions, the authors should be careful not to show in tables or figures symmetric error bars that would yield results that are out of range (e.g. negative error rates). If error bars are reported in tables or plots, The authors should explain in the text how they were calculated and reference the corresponding figures or tables in the text. 8. Experiments Compute Resources Question: For each experiment, does the paper provide sufficient information on the computer resources (type of compute workers, memory, time of execution) needed to reproduce the experiments? Answer: [Yes] Justification: Discussed in Appendix G. Guidelines: The answer NA means that the paper does not include experiments. The paper should indicate the type of compute workers CPU or GPU, internal cluster, or cloud provider, including relevant memory and storage. The paper should provide the amount of compute required for each of the individual experimental runs as well as estimate the total compute. The paper should disclose whether the full research project required more compute than the experiments reported in the paper (e.g., preliminary or failed experiments that didn t make it into the paper). 9. Code Of Ethics Question: Does the research conducted in the paper conform, in every respect, with the Neur IPS Code of Ethics https://neurips.cc/public/Ethics Guidelines? Answer: [Yes] Justification: I have read it. I m sure that our research conforms Neur IPS Code of Ethics. Guidelines: The answer NA means that the authors have not reviewed the Neur IPS Code of Ethics. If the authors answer No, they should explain the special circumstances that require a deviation from the Code of Ethics. The authors should make sure to preserve anonymity (e.g., if there is a special consideration due to laws or regulations in their jurisdiction). 10. Broader Impacts Question: Does the paper discuss both potential positive societal impacts and negative societal impacts of the work performed? Answer: [Yes] Justification: We have discussed this in Appendix H. Guidelines: The answer NA means that there is no societal impact of the work performed. If the authors answer NA or No, they should explain why their work has no societal impact or why the paper does not address societal impact. Examples of negative societal impacts include potential malicious or unintended uses (e.g., disinformation, generating fake profiles, surveillance), fairness considerations (e.g., deployment of technologies that could make decisions that unfairly impact specific groups), privacy considerations, and security considerations. The conference expects that many papers will be foundational research and not tied to particular applications, let alone deployments. However, if there is a direct path to any negative applications, the authors should point it out. For example, it is legitimate to point out that an improvement in the quality of generative models could be used to generate deepfakes for disinformation. On the other hand, it is not needed to point out that a generic algorithm for optimizing neural networks could enable people to train models that generate Deepfakes faster. The authors should consider possible harms that could arise when the technology is being used as intended and functioning correctly, harms that could arise when the technology is being used as intended but gives incorrect results, and harms following from (intentional or unintentional) misuse of the technology. If there are negative societal impacts, the authors could also discuss possible mitigation strategies (e.g., gated release of models, providing defenses in addition to attacks, mechanisms for monitoring misuse, mechanisms to monitor how a system learns from feedback over time, improving the efficiency and accessibility of ML). 11. Safeguards Question: Does the paper describe safeguards that have been put in place for responsible release of data or models that have a high risk for misuse (e.g., pretrained language models, image generators, or scraped datasets)? Answer: [NA] Justification: Our research is purely a foundamental reseach about Lo RA, which cannot pose such risks. Guidelines: The answer NA means that the paper poses no such risks. Released models that have a high risk for misuse or dual-use should be released with necessary safeguards to allow for controlled use of the model, for example by requiring that users adhere to usage guidelines or restrictions to access the model or implementing safety filters. Datasets that have been scraped from the Internet could pose safety risks. The authors should describe how they avoided releasing unsafe images. We recognize that providing effective safeguards is challenging, and many papers do not require this, but we encourage authors to take this into account and make a best faith effort. 12. Licenses for existing assets Question: Are the creators or original owners of assets (e.g., code, data, models), used in the paper, properly credited and are the license and terms of use explicitly mentioned and properly respected? Answer: [Yes] Justification: All models and datasets used in our paper are all properly cited. Guidelines: The answer NA means that the paper does not use existing assets. The authors should cite the original paper that produced the code package or dataset. The authors should state which version of the asset is used and, if possible, include a URL. The name of the license (e.g., CC-BY 4.0) should be included for each asset. For scraped data from a particular source (e.g., website), the copyright and terms of service of that source should be provided. If assets are released, the license, copyright information, and terms of use in the package should be provided. For popular datasets, paperswithcode.com/datasets has curated licenses for some datasets. Their licensing guide can help determine the license of a dataset. For existing datasets that are re-packaged, both the original license and the license of the derived asset (if it has changed) should be provided. If this information is not available online, the authors are encouraged to reach out to the asset s creators. 13. New Assets Question: Are new assets introduced in the paper well documented and is the documentation provided alongside the assets? Answer: [Yes] Justification: The way to run our codes work is attached in our codes. Guidelines: The answer NA means that the paper does not release new assets. Researchers should communicate the details of the dataset/code/model as part of their submissions via structured templates. This includes details about training, license, limitations, etc. The paper should discuss whether and how consent was obtained from people whose asset is used. At submission time, remember to anonymize your assets (if applicable). You can either create an anonymized URL or include an anonymized zip file. 14. Crowdsourcing and Research with Human Subjects Question: For crowdsourcing experiments and research with human subjects, does the paper include the full text of instructions given to participants and screenshots, if applicable, as well as details about compensation (if any)? Answer: [NA] Justification: No human subjects or participants involved in our research. Guidelines: The answer NA means that the paper does not involve crowdsourcing nor research with human subjects. Including this information in the supplemental material is fine, but if the main contribution of the paper involves human subjects, then as much detail as possible should be included in the main paper. According to the Neur IPS Code of Ethics, workers involved in data collection, curation, or other labor should be paid at least the minimum wage in the country of the data collector. 15. Institutional Review Board (IRB) Approvals or Equivalent for Research with Human Subjects Question: Does the paper describe potential risks incurred by study participants, whether such risks were disclosed to the subjects, and whether Institutional Review Board (IRB) approvals (or an equivalent approval/review based on the requirements of your country or institution) were obtained? Answer: [NA] Justification: No human subjects or participants involved in our research. Guidelines: The answer NA means that the paper does not involve crowdsourcing nor research with human subjects. Depending on the country in which research is conducted, IRB approval (or equivalent) may be required for any human subjects research. If you obtained IRB approval, you should clearly state this in the paper. We recognize that the procedures for this may vary significantly between institutions and locations, and we expect authors to adhere to the Neur IPS Code of Ethics and the guidelines for their institution. For initial submissions, do not include any information that would break anonymity (if applicable), such as the institution conducting the review.