# bidora_bilevel_optimizationbased_weightdecomposed_lowrank_adaptation__e54094ab.pdf

Published in Transactions on Machine Learning Research (07/2025)

Bi Do RA: Bi-level Optimization-Based Weight-Decomposed Low-Rank Adaptation

Peijia Qin pqin@ucsd.edu University of California, San Diego

Ruiyi Zhang ruz048@ucsd.edu University of California, San Diego

Pengtao Xie p1xie@ucsd.edu University of California, San Diego

Reviewed on Open Review: https: // openreview. net/ forum? id= v2x Cm3VYl4

Parameter-efficient fine-tuning (PEFT) is a flexible and efficient method for adapting large language models (LLMs) to downstream tasks. Among these methods, weight-decomposed low-rank adaptation (Do RA) is a promising approach that decomposes weight matrices into magnitude and direction components to mimic full fine-tuning (FT) better. However, Do RA s simultaneous optimization of these components makes it over-expressive, increases the risk of overfitting, and creates a coupled updating pattern that limits its learning capacity. To address these issues, we propose Bi-level Optimization-Based Weight-Decomposed Low-Rank Adaptation (Bi Do RA), a novel PEFT method based on a bi-level optimization framework. Bi Do RA fundamentally differs from Do RA by optimizing the magnitude and direction in two separate, asynchronous loops using distinct training and validation data splits. This decoupled optimization process effectively mitigates overfitting and allows for more flexible updates that align even more closely with FT. For instance, weight decomposition analysis shows Bi Do RA achieves a magnitude-direction update correlation of 8.042, significantly closer to the FT ideal compared to 1.784 for Do RA. Evaluation of Bi Do RA on diverse tasks spanning natural language understanding, generation, token classification, and extremely small biomedical datasets reveals that it consistently outperforms Do RA and a wide range of leading PEFT methods. This improvement is statistically significant, as demonstrated on the GLUE benchmark where Bi Do RA surpasses Do RA with a p-value of 2.4 10 4 in terms of the Wilcoxon signed-rank test. The code for Bi Do RA is available at https://github.com/t2ance/Bi Do RA.

1 Introduction

Large language models (LLMs) (Radford et al., 2019; Brown et al., 2020) have achieved state-of-the-art results across a broad range of NLP tasks, from natural language understanding (NLU) (Wang et al., 2019) to natural language generation (NLG) (Novikova et al., 2017). Parameter-efficient fine-tuning (PEFT) methods (Houlsby et al., 2019; Hu et al., 2022b) have been introduced as a promising solution for adapting LLMs for downstream data. PEFT approaches update only a subset of the pre-trained parameters, achieving performance comparable to full-finetuning (FT) while requiring significantly fewer computational resources.

One popular type of PEFT is low-rank adaptation (Lo RA, Hu et al. (2022b)), which attaches low-rank matrices to the pre-trained weights and updates only these matrices during fine-tuning. Liu et al. (2024a) shows that when decomposing the weights into magnitude and direction, their correlation (See Appendix D) tends to be positive in Lo RA, whereas it is negative in FT. To bridge the training pattern distinction, they

Published in Transactions on Machine Learning Research (07/2025)

1/||𝑽+ 𝜟𝑽||𝑐

Direction 𝑽 ℝ𝑑 𝑘

(a) Search phase

Merged Weight

Temporarily Frozen Component

Trainable Component

(b) Retraining phase

Magnitude 𝒎 ℝ1 𝑘

Lower to Upper: Direction

Upper to Lower: Magnitude

1/||𝑽+ 𝜟𝑽||𝑐 1/||𝑽+ 𝜟𝑽||𝑐

Lo RA 𝑩 ℝ𝑑 r Lo RA 𝑨 ℝ𝑟 𝑘

Untrainable Component

𝑽 𝜟𝑽 𝑽 𝜟𝑽 𝑽 𝜟𝑽

𝑛layers 𝑛layers 𝑛layers

ℳℒ𝑣𝑎𝑙(𝒱 ℳ, ℳ)

soft selection

Figure 1: An overview of Bi Do RA. Bi Do RA performs PEFT using a BLO framework. At the lower level, Bi Do RA learns the direction component V of the update matrices using the training split of the downstream dataset. At the upper level, Bi Do RA optimizes the magnitude component m with optimized V from the lower level, using the validation split of the dataset. After determining the optimal magnitude, the direction component undergoes further fine-tuning on a combined set of both training and validation splits to maximize overall performance.

introduce an explicit reparameterization of the pre-trained weights matrix. The method, named Do RA, decomposes the weights into the column-wise product of magnitude and direction, which determines the direction and magnitude of the weight update, respectively. This approach enables Do RA to share similar learning patterns with FT, thereby outperforming Lo RA in multiple tasks. Nonetheless, Do RA introduces additional parameters and over-expressive architecture compared to Lo RA, which can exacerbate overfitting issues when adapting to small downstream datasets (See Table 3). Furthermore, in Do RA, the magnitude and direction components are optimized concurrently, leading to a constrained updating pattern due to shared optimization setup (e.g., learning rate, optimizer, batch size).

To address the challenges above, we propose Bi Do RA, a Bi-level Optimization-Based Weight-Decomposed Low-Rank Adaptation method for PEFT. Bi Do RA facilitates an even more flexible updating pattern and mitigates overfitting by separately optimizing the two components on different data splits with distinct optimization levels. Bi Do RA is based on a bi-level optimization (BLO) framework: At the lower level, the low-rank direction component is updated using the training split, while the magnitude component remains fixed. At the upper level, the magnitude component is updated by minimizing the loss on the validation split via hypergradient descent. Subsequently, the direction component is further fine-tuned with the optimal magnitude frozen to maximize the performance. These two optimization steps are performed iteratively until convergence. Fig. 1 provides an overview of Bi Do RA.

A similar strategy of combating overfitting based on BLO has been utilized in the well-established practice of differentiable neural architecture search (DARTS, Liu et al. (2019)), where architecture and sub-networks are learned using different dataset splits. Optimizing the selection variables and sub-networks in a single loop can result in an over-expressive network since the selection variables tend to select all sub-networks to achieve the best expressiveness, which, however, incurs severe overfitting. In contrast, training the subnetworks with the selection module fixed on the training split while validating the effectiveness of the selection module on the unseen validation split effectively eliminates the risk of overfitting. Similarly, we treat the magnitude component as the architecture and the direction component as the sub-networks and train these components on separate datasets. As shown in Table 3, Bi Do RA demonstrates better resistance to overfitting compared to Do RA, given the smaller performance gap between the training set and test set. Furthermore, the asynchronous gradient update steps at the two optimization levels in Bi Do RA facilitate better decoupling of the two components, leading to a more flexible update pattern that closely resembles FT. As illustrated in Fig. 3, the updates across different layers using Bi Do RA have a correlation

Published in Transactions on Machine Learning Research (07/2025)

value that is closest to that of FT, highlighting its superior learning capability compared to both Do RA and Lo RA.

Our work makes the following key contributions:

We propose Bi Do RA, a novel PEFT method based on bi-level optimization. In contrast to Do RA, which trains the magnitude and direction components on a single dataset, Bi Do RA optimizes these components at different optimization levels.

Our strategy effectively mitigates the risk of overfitting and results in a parameter update pattern that more closely resembles full fine-tuning.

Extensive experiments on various downstream tasks highlight the superior performance of Bi Do RA. Bi Do RA consistently surpasses several baseline methods, including Lo RA and Do RA.

2 Related Work

2.1 Parameter Efficient Fine-Tuning Methods

Parameter-efficient fine-tuning (PEFT) methods aim to reduce the high costs associated with full fine-tuning large-scale models by updating only a relatively small subset of pre-trained parameters, rather than the entire model, to adapt to downstream tasks. Existing PEFT methods can be mainly categorized into three types.

The first category, known as adapter-based methods, injects additional trainable modules into the original frozen backbone. For instance, Houlsby et al. (2019) suggests adding linear modules in sequence to existing layers, while He et al. (2022) proposes integrating these modules in parallel with the original layers to enhance performance. Recent advances include SAN (Xu et al., 2023), FADA (Bi et al., 2024), and SET (Yi et al., 2024). SAN presents a side adapter network attached to a frozen CLIP model, which contains two branches for predicting mask proposals and attention biases. FADA introduces a frequency-adapted learning scheme that uses the Haar wavelet transform to decompose frozen features into lowand high-frequency components, which are processed separately to enhance domain generalization. SET proposes a spectraldecomposed token learning framework that leverages the Fast Fourier Transform to separate frozen features into amplitude and phase components, enhancing them with spectral tokens and attention optimization.

The second category is prompt tuning methods, which add extra soft tokens (prompts) to the initial input. During the fine-tuning stage, only these trainable soft tokens are updated, as demonstrated in works such as Lester et al. (2021) and Razdaibiedina et al. (2023). Unfortunately, the first two categories lead to increased inference latency compared to fully fine-tuned models.

The third prominent category focuses on low-rank adaptation, pioneered by Lo RA (Hu et al., 2022a). Lo RA injects trainable, low-rank matrices into a model s layers, freezing the original weights. A key advantage is that these low-rank updates can be merged into the original weights before inference, thus incurring no additional latency. Subsequent works have aimed to improve Lo RA s efficiency and performance. For instance, Ada Lo RA (Zhang et al., 2023) dynamically reallocates the parameter budget based on the importance scores of weight matrices. Zhang et al. (2024b) uses meta-learning to search for the optimal rank of Lo RA matrices, further improving its performance on downstream tasks. Pushing parameter efficiency further, Ve RA (Kopiczko et al., 2024) employs a single pair of shared low-rank matrices across all layers, while AFLo RA (Liu et al., 2024b) freezes a portion of adaptation parameters based on a learned score. A distinct sub-direction has emerged that performs adaptation in the frequency domain, including Fourier FT (Gao et al., 2024), La MDA (Azizi et al., 2024), SSH (Shen et al., 2025b), and Ma CP (Shen et al., 2025a). These methods learn updates in transformed spectral spaces, such as the Fourier, discrete Hartley, or discrete cosine domains, rather than directly in the weight space. Other research has focused on bridging the performance gap between Lo RA and full fine-tuning. Liu et al. (2024a) found that Lo RA s update patterns differ significantly from full fine-tuning, potentially constraining its learning capacity. To mitigate this, they proposed Do RA (Liu et al., 2024a), which decomposes pre-trained weights into magnitude and direction components and uses Lo RA for efficient directional updates, better mimicking full fine-tuning.

Published in Transactions on Machine Learning Research (07/2025)

2.2 Bi-level Optimization

Bi-level optimization (BLO) has been widely applied in various machine learning tasks, including metalearning (Finn et al., 2017; Rajeswaran et al., 2019), neural architecture search (NAS) (Liu et al., 2019; Zhang et al., 2021), and hyperparameter optimization (Lorraine et al., 2020; Franceschi et al., 2017). Despite its wide usage, solving BLO problems can be challenging due to the inherent nature of nested optimization problems. Several algorithms have been proposed to address this challenge, including zeroth-order methods such as Bayesian optimization (Cui & Bai, 2019) and first-order algorithms based on hypergradients (Pearlmutter & Siskind, 2008; Lorraine et al., 2020). Among these approaches, gradient-based BLO has received significant attention because it can scale to high-dimensional problems with a large number of trainable parameters.

Inspired by NAS, where a bi-level approach is used to learn an architecture and its sub-network weights on separate data splits to prevent overfitting, we adapt the BLO framework to parameter-efficient fine-tuning (PEFT), specifically for the weight-decomposed adaptation introduced by Do RA. Unlike in NAS, where BLO searches for a network architecture, Bi Do RA repurposes it to decouple the optimization of a weight matrix s two components: magnitude and direction. This approach marks a significant departure from previous PEFT methods like Lo RA and Do RA, which optimize all trainable parameters simultaneously on a single dataset. In this work, we extend the application of gradient-based BLO to develop a robust and effective PEFT method for pre-trained models. By assigning the magnitude and direction components to different optimization levels with distinct data splits, Bi Do RA creates a decoupled, flexible updating pattern that better mitigates overfitting and more closely resembles the learning behavior of full fine-tuning.

3 Preliminary

Lo RA (Hu et al., 2022b) involves attaching the product of two low-rank matrices to the pre-trained weights and fine-tuning these low-rank matrices on downstream datasets with the pre-trained weights frozen. It is based on the assumption that parameter updates made during fine-tuning exhibit a low intrinsic rank. Formally, given a pre-trained weight matrix W0 Rd k, Lo RA attaches a low-rank update matrix W Rd k to the pre-trained weight. This update matrix can be decomposed as W = BA, where B Rd r

and A Rr k are two low-rank matrices, with r min(d, k). Consequently, the weight matrix W is represented as follows: W = W0 + W = W0 + BA (1)

In this setup, only the Lo RA matrix W is updated. Liu et al. (2024a) found that Lo RA and full fine-tuning exhibit different learning patterns by performing weight decomposition on fine-tuned weight matrices (See Appendix D). To bridge this discrepancy, weight-decomposed low-rank adaptation (Do RA, Liu et al. (2024a)) further reparameterizes the weight matrices by explicitly decomposing them into learnable magnitude and direction components. Formally, Do RA performs adaptation as follows:

W = m V + V V + V c = m W0 + BA W0 + BA c (2)

where V is a product of two learnable low-rank matrices, B and A, while the magnitude component m R1 k is a learnable vector. Here, c represents the vector-wise norm of a matrix computed across each column, using the L2 norm. In Do RA, both components are optimized concurrently on a single downstream dataset. In this work, we aim to improve Do RA by further decoupling the training of the two components.

4.1 Overview of Bi Do RA

Our method, Bi Do RA, optimizes the trainable parameters in Do RA layers by solving a BLO problem. Let M = {m1, m2, . . . , mn} denote the set of magnitude components for all n Do RA modules, and V = { V1, V2, . . . , Vn} denote the set of corresponding direction components. Specifically, we first learn the direction components V (M) on the training split of the downstream dataset Dtr at the lower level. The

Published in Transactions on Machine Learning Research (07/2025)

magnitude component M is tentatively fixed at this level; thus, the resulting optimal direction component V (M) is a function of M. At the upper level, we determine the optimal magnitude component M by optimizing the loss on a validation split Dval. In practice, Dtr and Dval are typically created by splitting the original training set without using additional data. This BLO problem is solved using an efficient gradientbased algorithm, where parameters at two levels are optimized iteratively until convergence. While this work focuses on the empirical validation of Bi Do RA, our choice of optimization strategy is grounded in established theoretical research. The convergence properties of similar gradient-based bi-level algorithms have been previously analyzed (Pedregosa, 2016; Rajeswaran et al., 2019), providing confidence in the stability of our training procedure. Furthermore, the ability of such frameworks to improve generalization a core objective of Bi Do RA has also been formally studied (Bao et al., 2021), supporting the rationale that our approach can mitigate overfitting.

4.2 Orthogonal Regularization

A central goal of Bi Do RA is to learn the two disentangled components of a weight update: magnitude and direction. The direction component, V, is responsible for finding a low-rank basis for the update directions. To maximize the expressive power of this component and prevent overfitting, its basis vectors (i.e., the columns of the direction matrix) should be as diverse and non-redundant as possible.

The orthogonality of neural network weights has been identified as a beneficial property (Bansal et al., 2018) and can effectively mitigate the overfitting issue (Balestriero & richard baraniuk, 2018). By enforcing orthogonality, the direction vectors are constrained to represent distinct, independent pathways for updates. This ensures that the limited parameter budget of the low-rank matrix is used efficiently to explore the solution space. Therefore, we define a Gram regularization loss (Xie et al., 2017) for the direction component:

(Vk + Vk) (Vk + Vk) I 2

where I is the identity matrix and F denotes the Frobenius norm. Intuitively, R(V) encourages each column of the direction matrix, representing a specific direction, to be orthogonal to one another. Since each column has already been normalized (equivalent to projected to the unit sphere), this also prompts each column to be far away from the other, thereby reducing the redundancy of parameters. The effectiveness of this constraint is empirically validated in our ablation study (See Table 5), which shows a consistent performance improvement resulting from the enhanced generalization ability of the learned direction component.

4.3 A Bi-level Optimization Framework

Lower level. At the lower level, we train the low-rank direction component V by minimizing a loss Ltr defined on the training set Dtr. The overall training objective at this level is Ltr(V, M) = L(V, M; Dtr) + γR(V). Here, L represents the fine-tuning loss, given the low-rank direction component V, the magnitude component M, and the training split Dtr of the downstream dataset. R(V) is the orthogonal regularizer defined in Eq. (3), with γ as a trade-off hyperparameter. In this level, we only update V while keeping M fixed, resulting in the following optimization problem:

V (M) = arg min V Ltr(V, M) (4)

where V (M) denotes the optimal solution for V in this problem, which is a function of M.

Upper level. At the upper level, we validate the previously fixed magnitudes M on the validation set Dval, using the optimal direction component V (M) that was learned at the lower level. This results in a validation loss Lval(V (M), M) = L(V (M), M; Dval). We determine the optimal magnitude component M by minimizing this validation loss:

min M Lval(V (M), M) (5)

Published in Transactions on Machine Learning Research (07/2025)

Algorithm 1: Bi Do RA Input: Training dataset Dtr and validation dataset Dval

1 Initialize trainable magnitude components M = {mk}n k=1 and low-rank direction components V = { Vk}n k=1 = {{Ak}n k=1, {Bk}n k=1}

2 // Search Phase

3 while not converged do

4 Update magnitude M by descending MLval(V ξ VLtr(V, M), M)

5 Update direction V by descending VLtr(V, M)

6 Derive the optimal magnitude M = {m k}n k=1 7 // Retraining Phase

8 Train V until converge using Dtr S Dval and derive the optimal direction V

Output: V and M

A bi-level optimization framework. Integrating the two levels of optimization problems, we have the following BLO framework:

min M Lval(V (M), M)

s.t. V (M) = arg min V Ltr(V, M) (6)

Note that these two levels of optimization problems are mutually dependent on each other. The solution of the optimization problem at the lower level, V (M), serves as a parameter for the upper-level problem, while the optimization variable M at the upper level acts as a parameter for the lower-level problem. By solving these two interconnected problems jointly, we can learn the optimal magnitude component M and incremental direction matrices V in an end-to-end manner.

Two reasons exist behind the choice of setting the magnitude component as the upper level instead of the converse one: 1) In literature, the upper level usually has fewer parameters than the lower level. In our case, the design of setting the magnitude of complexity O(k) as the upper level and the direction of complexity O(dr + kr) as the lower level is consistent with the common practice. 2) Bi Do RA resembles the DARTS method (Liu et al., 2019) in NAS literature, where the subnets are selected by a selection variable. Specifically, the magnitude vector resembles a selection variable on the direction matrix by softly selecting each direction (subnets) via scaling.

Optimization algorithm. We use a gradient-based optimization algorithm (Choe et al., 2023b) to solve the BLO problem presented in Eq. (6). A significant challenge in this process is the computation of the upper-level loss gradient with respect to the magnitude component M, as this gradient depends on the optimal solution of the lower-level problem, V (M). For deep neural networks, the lower-level objective is non-convex, meaning that finding the true optimal solution V (M) would require running its optimization process to full convergence. Performing this complete inner optimization for every single update of the upper-level variable M is computationally intractable.

To address this issue, we use the following one-step-unrolled approximation of V (M) inspired by previous work (Liu et al., 2019):

MLval(V (M), M) MLval(V ξ VLtr(V, M), M) (7)

where ξ is the learning rate at the lower level, and the one-step-unrolled model V = V ξ VLtr(V, M) is used as a surrogate for the optimal solution V (M). We then compute the approximated gradient as

Published in Transactions on Machine Learning Research (07/2025)

MLval(V ξ VLtr(V, M), M)

= MLval( V, M) ξ 2 M,VLtr(V, M) VLval( V, M) (8)

MLval( V, M) ξ MLtr(V+, M) MLtr(V , M)

where ϵ is a small scalar and V = V ϵ VLval( V, M). Since directly computing the matrix-vector multiplication term in Eq. (8) is computationally expensive, we use finite difference to approximate this product as in Eq. (9), following Liu et al. (2019). As detailed in Algorithm 1, the direction component V and the magnitude component M are updated using gradient descent iteratively until convergence. After acquiring the optimal magnitudes M through the process above, the direction component V is retrained on the union of training and validation splits to achieve the best performance on downstream tasks, resulting in the final learned V . All splits are intentionally used during retraining to maximize data utilization and performance.

In practice, the convergence of the search phase is determined by the evaluation metric at the upper level. For the subsequent retraining phase, we adopt a stopping criterion similar to Do RA s, observing performance on a separate, held-out test set that is not used during training.

Table 1: Ro BERTabase/large (Rb/l) and De BERTa XXL (DXXL) with different fine-tuning methods on the GLUE benchmark (Wang et al., 2019). A higher value is better for all datasets. The best results are shown in bold.

Method #Params MNLI SST-2 MRPC Co LA QNLI QQP RTE STS-B Avg.

Rb(FT) 125.0M 90.3 94.8 89.3 61.6 86.7 92.8 76.9 91.2 85.5

Rb(Adapter) 0.9 M 86.5 94.0 88.4 58.8 92.5 89.1 71.2 89.9 83.8 Rb(Lo RA) 0.15 M 86.8 94.3 88.0 60.3 93.0 89.6 72.9 90.1 84.4 Rb(Do RA) 0.17 M 86.8 94.2 89.2 60.5 92.9 89.6 73.2 90.2 84.6 Rb(Bi Do RA) 0.17 M 87.1 94.4 89.4 61.3 92.7 90.6 76.0 90.1 85.2

Rl(FT) 355.0M 90.2 96.4 90.9 68.0 94.7 92.2 86.6 92.4 88.9

Rl(Adapter) 0.8M 90.3 96.3 87.7 66.3 94.7 91.5 72.9 91.5 86.4 Rl(Lo RA) 0.39 M 90.6 96.3 90.0 66.9 94.5 91.2 86.3 91.7 88.4 Rl(Do RA) 0.39 M 90.6 96.4 89.8 65.8 94.7 91.2 86.6 92.0 88.4 Rl(Bi Do RA) 0.39 M 90.6 96.1 90.1 67.0 94.6 91.7 86.9 92.0 88.6

DXXL(Do RA) 4.9M 91.2 96.3 92.3 71.1 95.3 91.6 91.8 90.8 90.0 DXXL(Bi Do RA) 4.9M 91.7 96.3 92.6 72.3 95.2 92.0 92.3 90.8 90.4

5 Experiments

5.1 Experimental Setup

We compare Bi Do RA with several PEFT methods, including Full Fine-Tuning (FT), Adapter Tuning (Houlsby et al., 2019; Lin et al., 2020; Rücklé et al., 2021; Pfeiffer et al., 2021), Lo RA (Hu et al., 2022a), A Ada Lo RA (Zhang et al., 2023), Do RA (Liu et al., 2024a), Ve RA (Kopiczko et al., 2024), Fourier FT (Gao et al., 2024), AFLo RA (Liu et al., 2024b), La MDA (Azizi et al., 2024), SSH (Shen et al., 2025b), Ma CP (Shen et al., 2025a). Bi Do RA does not use any additional data compared to other baselines, as we create the validation set for upper-level optimization by splitting the original training set with an 8:2 ratio for all tasks. Detailed descriptions of these baseline methods are provided in Appendix C.

Our experiments cover a wide range of tasks, including natural language understanding (Section 5.2), extremely small biomedical datasets (Section 5.3), natural language generation (Appendix H.1), and token

Published in Transactions on Machine Learning Research (07/2025)

classification (Appendix H.2). Please refer to the detailed dataset settings and experimental settings in Appendix A and Appendix B, respectively. Our implementation is based on the Huggingface Transformers library (Wolf et al., 2019) and the Betty library (Choe et al., 2023b).

5.2 Experiments on Natural Language Understanding Tasks

In this section, we evaluate the performance of Bi Do RA on NLU tasks.

Main results. Table 1 presents the results of fine-tuning the Ro BERTa-base, Ro BERTa-large, and De BERTa XXL models on the GLUE benchmark with baseline PEFT methods and Bi Do RA. The results show that Bi Do RA achieves superior or comparable performance compared to baseline methods across all datasets with the same number of trainable parameters. The superior performance of Bi Do RA verifies the effectiveness of its BLO mechanism. By training the magnitude and direction components on two distinct sub-datasets, Bi Do RA enhances the flexibility of the learning process and improves learning capacity compared to Do RA, resulting in a performance boost.

We also present an experiment on the GLUE benchmark with the Ro BERTa-base model, on a larger, wide range of baselines, following the settings from Shen et al. (2025b) and Shen et al. (2025a) and citing their reported baseline results for reference. The results in Table 2 indicate that Bi Do RA consistently outperforms all baselines, including Do RA, across these diverse NLU tasks, demonstrating its robust generalization capability.

Table 2: Performance of various fine-tuning methods on the GLUE benchmark for the Ro BERTa-base model. The best ones are highlighted by bold and the second ones are highlighted by italic.

Model SST-2 MRPC Co LA QNLI RTE STS-B Avg.

FT 94.8 90.2 63.6 92.8 78.7 91.2 85.22 Bit Fit 93.7 92.7 62.0 91.8 81.5 90.8 85.42 Adpt D 94.7 88.4 62.6 93.0 75.9 90.3 84.15 Lo RA 95.1 89.7 63.4 93.3 78.4 91.5 85.23 Ada Lo RA 94.5 88.7 62.0 93.1 81.0 90.5 84.97 AFLo RA 94.1 89.3 63.5 91.3 77.2 90.6 84.33 La MDA 94.6 89.7 64.9 91.7 78.2 90.4 84.92 Ve RA 94.6 89.5 65.6 91.8 78.7 90.7 85.15 Fourier FT 94.2 90.0 63.8 92.2 79.1 90.8 85.02 SSH 94.1 91.2 63.6 92.4 80.5 90.9 85.46 Ma CP 94.2 89.7 64.6 92.4 80.7 90.9 85.42

Do RA (r = 8) 94.9 89.9 63.7 93.3 78.9 91.5 85.37 Bi Do RA (r = 8) 95.7 90.2 65.8 93.4 79.4 90.5 85.83 Do RA (r = 16) 94.8 90.4 65.6 93.1 81.9 90.7 86.08 Bi Do RA (r = 16) 95.0 90.8 66.7 93.3 82.6 90.9 86.55

Table 3: Quantitative performance gap between training and test sets for Do RA and Bi Do RA using the Ro BERTa-base model. The gap is calculated as the training metric minus the test metric, where a smaller value indicates less overfitting.

Method SST-2 MRPC Co LA QNLI RTE STS-B Avg.

Do RA 2.0 9.5 32.5 6.6 18.0 8.8 12.9 Bi Do RA 1.7 7.0 23.3 0.2 14.0 4.7 8.5

Published in Transactions on Machine Learning Research (07/2025)

Figure 2: Training and test accuracy versus global training steps on the Mod Hayes split of the Reuters21578 dataset (Padmanabhan et al., 2016) when fine-tuning a Ro BERTa-base model using Do RA and Bi Do RA. The training and test curves for Do RA show a larger gap compared to Bi Do RA, highlighting the effectiveness of our method in reducing overfitting.

Robustness of Bi Do RA towards different rank settings. We explore the impact of different rank configurations on Bi Do RA and Do RA, evaluating them with ranks of 8 and 16 in addition to the rank of 4 used in Table 1. The average accuracies reported in Table 2 demonstrate that Bi Do RA consistently surpasses Do RA across all rank configurations, highlighting its resilience and superior performance regardless of the rank setting.

Performance gap between training and testing set. As visualized in Fig. 2, Bi Do RA achieves a smaller gap between the training and test curves. Quantitatively, Table 3 presents this performance gap on the Ro BERTa-base model. The training set metric is calculated as a moving average of the per-batch metric with a decay ratio of 0.99. Since Bi Do RA has two training loops, its training metric is a weighted average (0.8 inner-loop-metric + 0.2 outer-loop-metric), based on the data split size, inner : outer = 8 : 2, in our case. The results show that the performance gap for Bi Do RA is consistently lower than that of Do RA across all datasets. This suggests that Do RA is more prone to overfitting, an issue that Bi Do RA effectively addresses.

5.3 Experiments on Extremely Small Datasets

Table 4: Fine-tuning ESM on the thermostability prediction task (Chen et al., 2023) (left), the BBP task (Dai et al., 2021) (middle), and the MIC task (Ledesma-Fernandez et al., 2023) (right). A higher value is better for all metrics except for MSE. The best results are highlighted in bold.

Methods #Params Accuracy Precision Recall F1

FT 652.7M 79.8 81.2 79.8 78.4

Lo RA 1.5M 75.9 78.2 75.9 75.5 Do RA 1.6M 76.9 78.7 76.9 76.2 Bi Do RA 1.6M 78.8 79.1 78.8 78.2

#Params Accuracy Precision Recall F1

652.9M 89.4 89.9 89.4 89.4

1.9M 86.8 87.7 86.8 86.7 2.0M 89.4 91.3 89.4 89.3 2.0M 92.1 93.1 92.1 92.0

#Params MSE

652.7M 0.2894

1.7M 0.3433 1.8M 0.2918 1.8M 0.2818

We conduct additional experiments on biomedical datasets, including two classification tasks thermostability prediction (Chen et al. (2023), 936 training samples) and blood-brain barrier peptide prediction (BBP, Dai et al. (2021), 200 training samples) and one regression task, minimum inhibitory concentration prediction (MIC, Ledesma-Fernandez et al. (2023), 3,695 training samples), which contain significantly fewer samples than standard NLP tasks.

The results are presented in Table 4. Consistent with our previous findings, Bi Do RA effectively fine-tunes pre-trained models on extremely small datasets. Our method outperforms the baselines by a larger margin

Published in Transactions on Machine Learning Research (07/2025)

Table 5: Ablation studies. We evaluate the performance of Bi Do RA without retraining (w/o retraining), without BLO (ξ = 0), without orthogonal regularization (w/o cst.), and with retraining magnitude.

Method MNLI SST-2 MRPC Co LA QNLI QQP RTE STS-B Avg.

Bi Do RA (retraining magnitude) 87.0 94.3 89.1 60.7 92.7 91.0 73.4 89.9 84.8 Bi Do RA (w/o retraining) 87.0 94.2 89.0 57.3 92.4 90.6 71.6 90.0 84.0 Bi Do RA (ξ = 0) 86.9 94.2 89.0 59.4 90.8 91.2 75.9 90.0 84.7 Bi Do RA (w/o cst.) 87.0 94.4 88.6 61.3 92.7 90.2 76.0 90.1 85.0 Bi Do RA 87.1 94.4 89.4 61.3 92.7 90.6 76.1 90.1 85.2

as the dataset size decreases, confirming our previous conclusion that our method effectively combats the overfitting issue on various network architectures and diverse tasks.

5.4 Ablation Studies

In this section, we perform ablation studies to investigate the effectiveness of individual modules or strategies in Bi Do RA. We fine-tune a Ro BERTa-base model on the GLUE benchmark under different ablation settings, and the results are shown in Table 5.

Retraining. We test the model directly obtained from the search phase to evaluate the effectiveness of further retraining the direction component. The results show that Bi Do RA outperforms Bi Do RA (w/o retraining) on average, highlighting the necessity of retraining. Table 5 also validates that retraining the direction component leads to superior performance than retraining the magnitude.

Bi-level optimization. We set ξ to zero in Algorithm 1 to assess the effectiveness of the BLO framework. This ablation setting can be interpreted as an alternative learning method where two optimization steps are carried out alternately on two different splits of the training dataset. Notably, in the alternative learning method, the updating of each component is unaware of the others, making the training less stable. In contrast, the hyper-gradient used in BLO avoids this issue by connecting the two levels in a certain way. The results show that Bi Do RA outperforms Bi Do RA (ξ = 0) on average, demonstrating the efficacy of the BLo strategy.

Orthogonal regularization. We examine the effectiveness of the orthogonality constraint in Eq. (3) by setting γ to zero. Results show that Bi Do RA outperforms Bi Do RA (w/o cst.) on average, indicating the effectiveness of applying the orthogonality regularizer to alleviate overfitting.

5.5 Weight Decomposition Analysis

One important motivation of Do RA is to bridge the inherent differences between Lo RA and FT. Similar to Do RA, we conduct a weight decomposition analysis on the correlation between the change of magnitudes and that of directions (detailed in Appendix D) for Bi Do RA and baseline methods by fine-tuning a GPT2-medium model on the E2E dataset. As shown in Fig. 3, FT, Do RA, and Bi Do RA all exhibit negative correlation values, while Lo RA shows a positive correlation, consistent with the findings in Liu et al. (2024a). Notably, Bi Do RA achieves a negative correlation of 8.042, closer to FT than Do RA s 1.784. This improvement is attributed to the decoupled training process of the two layers, which allows for a higher learning capacity compared to Do RA.

5.6 Discussion

The advantage of Bi Do RA is supported by both theoretical insights and empirical evidence, as detailed as follows.

Published in Transactions on Machine Learning Research (07/2025)

(a) Full FT(k = 65.816)

(b) Lo RA(k = 0.836)

(c) Do RA(k = 1.784)

(d) Bi Do RA(k = 8.042)

Figure 3: Magnitude and direction updates for (a) FT, (b) Lo RA, (c) Do RA, and (d) Bi Do RA of the query matrices across different layers and intermediate steps after fine-tuning the GPT2 model on the E2E dataset (Novikova et al., 2017), where k denotes the correlation value. Different markers represent matrices from different training steps, with each color corresponding to a specific layer. M denotes the average change in weight vector magnitude, and D denotes the average change in direction, as formally defined in Appendix D.

Motivation. Theoretically, Liu et al. (2024a) showed that Lo RA s training pattern tends to be coupled in terms of magnitude-direction correlation, which degrades learning capacity. Their solution was to introduce a reparameterization that decouples these components in the formulation. We build upon Do RA following their theory and further decouple magnitude and direction in terms of training dynamics. Specifically, the two components are trained in separate loops within a bilevel optimization framework, which is expected to improve performance in an intuition similar to Do RA.

Empirical evidences. We performed a Wilcoxon signed-rank test to compare the performance of Do RA and Bi Do RA. Specifically, we used the results from Table 1. For each PEFT method, we collected 9 values (8 values from each dataset plus the average performance) from one base model. We concatenated the results from three base models (Ro BERTa-base, Ro BERTa-large, and De BERTa-XXL) to obtain a list of 27 values. A comparison of these 27 values between Do RA and Bi Do RA reveals that Bi Do RA is significantly better than Do RA, with a p-value of 2.4 10 4. This result demonstrates that Bi Do RA offers a non-marginal improvement over Do RA.

Additionally, the weight decomposition analysis, including (Fig. 3 and Fig. 4), indicates that Bi Do RA achieves better decoupling of the components compared to Do RA. Evaluation metrics across various tasks demonstrate the superior performance of Bi Do RA, confirming that our decoupled optimization loop leads to improved outcomes.

6 Conclusion and Future Works

We propose Bi Do RA, a novel bi-level optimization framework for PEFT of large-scale pre-trained models. By conducting weight decomposition following the Do RA approach, our method trains the two components separately in two interconnected optimization levels using different sub-datasets. In this way, Bi Do RA not only decouples the learning process of the two components, resulting in a learning pattern closer to FT, but also effectively alleviates overfitting. Empirical studies on various NLP tasks demonstrate that Bi Do RA outperforms Do RA and other baselines, highlighting the effectiveness of our method.

One potential limitation of Bi Do RA is its training efficiency (see Appendix E) in terms of per-step cost, which could be reduced by using more advanced hyper-gradient estimators, such as SAMA (Choe et al., 2023a) or Mix Flow-MG (Kemaev et al., 2025). Furthermore, while we have empirically shown that Bi Do RA induces better decoupling between the magnitude and direction components (Fig. 3 and Fig. 4), a formal theoretical analysis of this property is currently lacking and serves for future work.

Published in Transactions on Machine Learning Research (07/2025)

Acknowledgments

P.X. acknowledges funding support from NSF IIS2405974, NSF IIS2339216, NIH R35GM157217, and NIH R21GM154171.

Seyedarmin Azizi, Souvik Kundu, and Massoud Pedram. Lamda: Large model fine-tuning via spectrally decomposed low-dimensional adaptation. Ar Xiv preprint, abs/2406.12832, 2024. URL https://arxiv. org/abs/2406.12832.

Randall Balestriero and richard baraniuk. A spline theory of deep learning. In Jennifer Dy and Andreas Krause (eds.), Proceedings of the 35th International Conference on Machine Learning, volume 80 of Proceedings of Machine Learning Research, pp. 374 383. PMLR, 2018. URL https://proceedings.mlr. press/v80/balestriero18b.html.

Satanjeev Banerjee and Alon Lavie. METEOR: An automatic metric for MT evaluation with improved correlation with human judgments. In Jade Goldstein, Alon Lavie, Chin-Yew Lin, and Clare Voss (eds.), Proceedings of the ACL Workshop on Intrinsic and Extrinsic Evaluation Measures for Machine Translation and/or Summarization, pp. 65 72, Ann Arbor, Michigan, 2005. Association for Computational Linguistics. URL https://aclanthology.org/W05-0909.

Nitin Bansal, Xiaohan Chen, and Zhangyang Wang. Can we gain more from orthogonality regularizations in training deep cnns? In Neural Information Processing Systems, 2018. URL https: //api.semanticscholar.org/Corpus ID:55704502.

Fan Bao, Guoqiang Wu, Chongxuan Li, Jun Zhu, and Bo Zhang. Stability and generalization of bilevel programming in hyperparameter optimization. In Marc Aurelio Ranzato, Alina Beygelzimer, Yann N. Dauphin, Percy Liang, and Jennifer Wortman Vaughan (eds.), Advances in Neural Information Processing Systems 34: Annual Conference on Neural Information Processing Systems 2021, Neur IPS 2021, December 6-14, 2021, virtual, pp. 4529 4541, 2021. URL https://proceedings.neurips.cc/paper/2021/hash/ 2406a0a94c80406914ff2f6c9fdd67d5-Abstract.html.

Qi Bi, Jingjun Yi, Hao Zheng, Haolan Zhan, Yawen Huang, Wei Ji, Yuexiang Li, and Yefeng Zheng. Learning frequency-adapted vision foundation model for domain generalized semantic segmentation. In Amir Globersons, Lester Mackey, Danielle Belgrave, Angela Fan, Ulrich Paquet, Jakub M. Tomczak, and Cheng Zhang (eds.), Advances in Neural Information Processing Systems 38: Annual Conference on Neural Information Processing Systems 2024, Neur IPS 2024, Vancouver, BC, Canada, December 10 - 15, 2024, 2024. URL http://papers.nips.cc/paper_files/paper/2024/hash/ aaf50c91c3fc018f6a476032d02114d9-Abstract-Conference.html.

Tom B. Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared Kaplan, Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, Sandhini Agarwal, Ariel Herbert Voss, Gretchen Krueger, Tom Henighan, Rewon Child, Aditya Ramesh, Daniel M. Ziegler, Jeffrey Wu, Clemens Winter, Christopher Hesse, Mark Chen, Eric Sigler, Mateusz Litwin, Scott Gray, Benjamin Chess, Jack Clark, Christopher Berner, Sam Mc Candlish, Alec Radford, Ilya Sutskever, and Dario Amodei. Language models are few-shot learners. In Hugo Larochelle, Marc Aurelio Ranzato, Raia Hadsell, Maria-Florina Balcan, and Hsuan-Tien Lin (eds.), Advances in Neural Information Processing Systems 33: Annual Conference on Neural Information Processing Systems 2020, Neur IPS 2020, December 6-12, 2020, virtual, 2020. URL https://proceedings.neurips.cc/paper/2020/hash/ 1457c0d6bfcb4967418bfb8ac142f64a-Abstract.html.

Daniel Cer, Mona Diab, Eneko Agirre, Iñigo Lopez-Gazpio, and Lucia Specia. Sem Eval-2017 task 1: Semantic textual similarity multilingual and crosslingual focused evaluation. In Steven Bethard, Marine Carpuat, Marianna Apidianaki, Saif M. Mohammad, Daniel Cer, and David Jurgens (eds.), Proceedings of the 11th International Workshop on Semantic Evaluation (Sem Eval-2017), pp. 1 14, Vancouver, Canada, 2017. Association for Computational Linguistics. doi: 10.18653/v1/S17-2001. URL https://aclanthology. org/S17-2001.

Published in Transactions on Machine Learning Research (07/2025)

Tianlong Chen, Chengyue Gong, Daniel Jesus Diaz, Xuxi Chen, Jordan Tyler Wells, Qiang Liu, Zhangyang Wang, Andrew D. Ellington, Alex Dimakis, and Adam R. Klivans. Hotprotein: A novel framework for protein thermostability prediction and editing. In The Eleventh International Conference on Learning Representations, ICLR 2023, Kigali, Rwanda, May 1-5, 2023. Open Review.net, 2023. URL https:// openreview.net/pdf?id=YDJRFWBMNby.

Sang Keun Choe, Sanket Vaibhav Mehta, Hwijeen Ahn, Willie Neiswanger, Pengtao Xie, Emma Strubell, and Eric P. Xing. Making scalable meta learning practical. In Alice Oh, Tristan Naumann, Amir Globerson, Kate Saenko, Moritz Hardt, and Sergey Levine (eds.), Advances in Neural Information Processing Systems 36: Annual Conference on Neural Information Processing Systems 2023, Neur IPS 2023, New Orleans, LA, USA, December 10 - 16, 2023, 2023a. URL http://papers.nips.cc/paper_files/paper/2023/hash/ 531998dc1fc858b5857a90b74d96ecab-Abstract-Conference.html.

Sang Keun Choe, Willie Neiswanger, Pengtao Xie, and Eric P. Xing. Betty: An automatic differentiation library for multilevel optimization. In The Eleventh International Conference on Learning Representations, ICLR 2023, Kigali, Rwanda, May 1-5, 2023. Open Review.net, 2023b. URL https://openreview.net/ pdf?id=LV_Me MS38Q9.

Nigel Collier, Tomoko Ohta, Yoshimasa Tsuruoka, Yuka Tateisi, and Jin-Dong Kim. Introduction to the bio-entity recognition task at JNLPBA. In Nigel Collier, Patrick Ruch, and Adeline Nazarenko (eds.), Proceedings of the International Joint Workshop on Natural Language Processing in Biomedicine and its Applications (NLPBA/Bio NLP), pp. 73 78, Geneva, Switzerland, 2004. COLING. URL https:// aclanthology.org/W04-1213.

Hua Cui and Jie Bai. A new hyperparameters optimization method for convolutional neural networks. Pattern Recognition Letters, 125:828 834, 2019.

Ido Dagan, Oren Glickman, and Bernardo Magnini. The pascal recognising textual entailment challenge. In Machine learning challenges workshop, pp. 177 190. Springer, 2005.

Ruyu Dai, Wei Zhang, Wending Tang, Evelien Wynendaele, Qizhi Zhu, Yannan Bin, Bart De Spiegeleer, and Junfeng Xia. Bbppred: sequence-based prediction of blood-brain barrier peptides with feature representation learning and logistic regression. Journal of Chemical Information and Modeling, 61(1):525 534, 2021.

William B. Dolan and Chris Brockett. Automatically constructing a corpus of sentential paraphrases. In Proceedings of the Third International Workshop on Paraphrasing (IWP2005), 2005. URL https:// aclanthology.org/I05-5002.

Chelsea Finn, Pieter Abbeel, and Sergey Levine. Model-agnostic meta-learning for fast adaptation of deep networks. In Doina Precup and Yee Whye Teh (eds.), Proceedings of the 34th International Conference on Machine Learning, ICML 2017, Sydney, NSW, Australia, 6-11 August 2017, volume 70 of Proceedings of Machine Learning Research, pp. 1126 1135. PMLR, 2017. URL http://proceedings.mlr.press/v70/ finn17a.html.

Luca Franceschi, Michele Donini, Paolo Frasconi, and Massimiliano Pontil. Forward and reverse gradientbased hyperparameter optimization. In Doina Precup and Yee Whye Teh (eds.), Proceedings of the 34th International Conference on Machine Learning, ICML 2017, Sydney, NSW, Australia, 6-11 August 2017, volume 70 of Proceedings of Machine Learning Research, pp. 1165 1173. PMLR, 2017. URL http:// proceedings.mlr.press/v70/franceschi17a.html.

Ziqi Gao, Qichao Wang, Aochuan Chen, Zijing Liu, Bingzhe Wu, Liang Chen, and Jia Li. Parameter-efficient fine-tuning with discrete fourier transform. In Forty-first International Conference on Machine Learning, ICML 2024, Vienna, Austria, July 21-27, 2024. Open Review.net, 2024. URL https://openreview.net/ forum?id=XUOHKSsurt.

Junxian He, Chunting Zhou, Xuezhe Ma, Taylor Berg-Kirkpatrick, and Graham Neubig. Towards a unified view of parameter-efficient transfer learning. In The Tenth International Conference on Learning

Published in Transactions on Machine Learning Research (07/2025)

Representations, ICLR 2022, Virtual Event, April 25-29, 2022. Open Review.net, 2022. URL https: //openreview.net/forum?id=0RDcd5Axok.

Neil Houlsby, Andrei Giurgiu, Stanislaw Jastrzebski, Bruna Morrone, Quentin de Laroussilhe, Andrea Gesmundo, Mona Attariyan, and Sylvain Gelly. Parameter-efficient transfer learning for NLP. In Kamalika Chaudhuri and Ruslan Salakhutdinov (eds.), Proceedings of the 36th International Conference on Machine Learning, ICML 2019, 9-15 June 2019, Long Beach, California, USA, volume 97 of Proceedings of Machine Learning Research, pp. 2790 2799. PMLR, 2019. URL http://proceedings.mlr.press/v97/ houlsby19a.html.

Edward J. Hu, Yelong Shen, Phillip Wallis, Zeyuan Allen-Zhu, Yuanzhi Li, Shean Wang, Lu Wang, and Weizhu Chen. Lora: Low-rank adaptation of large language models. In The Tenth International Conference on Learning Representations, ICLR 2022, Virtual Event, April 25-29, 2022. Open Review.net, 2022a. URL https://openreview.net/forum?id=n Ze VKee FYf9.

Edward J. Hu, Yelong Shen, Phillip Wallis, Zeyuan Allen-Zhu, Yuanzhi Li, Shean Wang, Lu Wang, and Weizhu Chen. Lora: Low-rank adaptation of large language models. In The Tenth International Conference on Learning Representations, ICLR 2022, Virtual Event, April 25-29, 2022. Open Review.net, 2022b. URL https://openreview.net/forum?id=n Ze VKee FYf9.

Iurii Kemaev, Dan A Calian, Luisa M Zintgraf, Gregory Farquhar, and Hado van Hasselt. Scalable metalearning via mixed-mode differentiation. Ar Xiv preprint, abs/2505.00793, 2025. URL https://arxiv. org/abs/2505.00793.

Dawid Jan Kopiczko, Tijmen Blankevoort, and Yuki M. Asano. Vera: Vector-based random matrix adaptation. In The Twelfth International Conference on Learning Representations, ICLR 2024, Vienna, Austria, May 7-11, 2024. Open Review.net, 2024. URL https://openreview.net/forum?id=Nj Nf Ldxr3A.

Alba Ledesma-Fernandez, Susana Velasco-Lozano, Javier Santiago-Arcos, Fernando López-Gallego, and Aitziber L Cortajarena. Engineered repeat proteins as scaffolds to assemble multi-enzyme systems for efficient cell-free biosynthesis. Nature Communications, 14(1):2587, 2023.

Brian Lester, Rami Al-Rfou, and Noah Constant. The power of scale for parameter-efficient prompt tuning. In Marie-Francine Moens, Xuanjing Huang, Lucia Specia, and Scott Wen-tau Yih (eds.), Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing, pp. 3045 3059, Online and Punta Cana, Dominican Republic, 2021. Association for Computational Linguistics. doi: 10.18653/v1/ 2021.emnlp-main.243. URL https://aclanthology.org/2021.emnlp-main.243.

Chin-Yew Lin. ROUGE: A package for automatic evaluation of summaries. In Text Summarization Branches Out, pp. 74 81, Barcelona, Spain, 2004. Association for Computational Linguistics. URL https://aclanthology.org/W04-1013.

Chin-Yew Lin and Franz Josef Och. Automatic evaluation of machine translation quality using longest common subsequence and skip-bigram statistics. In Proceedings of the 42nd Annual Meeting of the Association for Computational Linguistics (ACL-04), pp. 605 612, Barcelona, Spain, 2004. doi: 10.3115/ 1218955.1219032. URL https://aclanthology.org/P04-1077.

Zhaojiang Lin, Andrea Madotto, and Pascale Fung. Exploring versatile generative language model via parameter-efficient transfer learning. In Trevor Cohn, Yulan He, and Yang Liu (eds.), Findings of the Association for Computational Linguistics: EMNLP 2020, pp. 441 459, Online, 2020. Association for Computational Linguistics. doi: 10.18653/v1/2020.findings-emnlp.41. URL https://aclanthology. org/2020.findings-emnlp.41.

Hanxiao Liu, Karen Simonyan, and Yiming Yang. DARTS: differentiable architecture search. In 7th International Conference on Learning Representations, ICLR 2019, New Orleans, LA, USA, May 6-9, 2019. Open Review.net, 2019. URL https://openreview.net/forum?id=S1e YHo C5FX.

Published in Transactions on Machine Learning Research (07/2025)

Shih-Yang Liu, Chien-Yi Wang, Hongxu Yin, Pavlo Molchanov, Yu-Chiang Frank Wang, Kwang-Ting Cheng, and Min-Hung Chen. Dora: Weight-decomposed low-rank adaptation. In Forty-first International Conference on Machine Learning, ICML 2024, Vienna, Austria, July 21-27, 2024. Open Review.net, 2024a. URL https://openreview.net/forum?id=3d5CIRG1n2.

Zeyu Liu, Souvik Kundu, Anni Li, Junrui Wan, Lianghao Jiang, and Peter Anthony Beerel. Aflora: Adaptive freezing of low rank adaptation in parameter efficient fine-tuning of large models. Ar Xiv preprint, abs/2403.13269, 2024b. URL https://arxiv.org/abs/2403.13269.

Jonathan Lorraine, Paul Vicol, and David Duvenaud. Optimizing millions of hyperparameters by implicit differentiation. In Silvia Chiappa and Roberto Calandra (eds.), The 23rd International Conference on Artificial Intelligence and Statistics, AISTATS 2020, 26-28 August 2020, Online [Palermo, Sicily, Italy], volume 108 of Proceedings of Machine Learning Research, pp. 1540 1552. PMLR, 2020. URL http: //proceedings.mlr.press/v108/lorraine20a.html.

Jekaterina Novikova, Ondřej Dušek, and Verena Rieser. The E2E dataset: New challenges for end-to-end generation. In Kristiina Jokinen, Manfred Stede, David De Vault, and Annie Louis (eds.), Proceedings of the 18th Annual SIGdial Meeting on Discourse and Dialogue, pp. 201 206, Saarbrücken, Germany, 2017. Association for Computational Linguistics. doi: 10.18653/v1/W17-5525. URL https://aclanthology. org/W17-5525.

Divya Padmanabhan, Satyanath Bhat, Shirish Shevade, and Y Narahari. Topic model based multi-label classification. In 2016 IEEE 28th International Conference on Tools with Artificial Intelligence (ICTAI), pp. 996 1003. IEEE, 2016.

Kishore Papineni, Salim Roukos, Todd Ward, and Wei-Jing Zhu. Bleu: a method for automatic evaluation of machine translation. In Pierre Isabelle, Eugene Charniak, and Dekang Lin (eds.), Proceedings of the 40th Annual Meeting of the Association for Computational Linguistics, pp. 311 318, Philadelphia, Pennsylvania, USA, 2002. Association for Computational Linguistics. doi: 10.3115/1073083.1073135. URL https: //aclanthology.org/P02-1040.

Barak A Pearlmutter and Jeffrey Mark Siskind. Reverse-mode ad in a functional framework: Lambda the ultimate backpropagator. ACM Transactions on Programming Languages and Systems (TOPLAS), 30(2): 1 36, 2008.

Fabian Pedregosa. Hyperparameter optimization with approximate gradient. In Maria-Florina Balcan and Kilian Q. Weinberger (eds.), Proceedings of the 33nd International Conference on Machine Learning, ICML 2016, New York City, NY, USA, June 19-24, 2016, volume 48 of JMLR Workshop and Conference Proceedings, pp. 737 746. JMLR.org, 2016. URL http://proceedings.mlr.press/v48/pedregosa16. html.

Jonas Pfeiffer, Aishwarya Kamath, Andreas Rücklé, Kyunghyun Cho, and Iryna Gurevych. Adapter Fusion: Non-destructive task composition for transfer learning. In Paola Merlo, Jorg Tiedemann, and Reut Tsarfaty (eds.), Proceedings of the 16th Conference of the European Chapter of the Association for Computational Linguistics: Main Volume, pp. 487 503, Online, 2021. Association for Computational Linguistics. doi: 10.18653/v1/2021.eacl-main.39. URL https://aclanthology.org/2021.eacl-main.39.

Alec Radford, Jeffrey Wu, Rewon Child, David Luan, Dario Amodei, Ilya Sutskever, et al. Language models are unsupervised multitask learners. Open AI blog, 1(8):9, 2019.

Aravind Rajeswaran, Chelsea Finn, Sham M. Kakade, and Sergey Levine. Meta-learning with implicit gradients. In Hanna M. Wallach, Hugo Larochelle, Alina Beygelzimer, Florence d Alché-Buc, Emily B. Fox, and Roman Garnett (eds.), Advances in Neural Information Processing Systems 32: Annual Conference on Neural Information Processing Systems 2019, Neur IPS 2019, December 8-14, 2019, Vancouver, BC, Canada, pp. 113 124, 2019. URL https://proceedings.neurips.cc/paper/2019/hash/ 072b030ba126b2f4b2374f342be9ed44-Abstract.html.

Published in Transactions on Machine Learning Research (07/2025)

Pranav Rajpurkar, Robin Jia, and Percy Liang. Know what you don t know: Unanswerable questions for SQu AD. In Iryna Gurevych and Yusuke Miyao (eds.), Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers), pp. 784 789, Melbourne, Australia, 2018. Association for Computational Linguistics. doi: 10.18653/v1/P18-2124. URL https://aclanthology.org/P18-2124.

Anastasiia Razdaibiedina, Yuning Mao, Madian Khabsa, Mike Lewis, Rui Hou, Jimmy Ba, and Amjad Almahairi. Residual prompt tuning: improving prompt tuning with residual reparameterization. In Anna Rogers, Jordan Boyd-Graber, and Naoaki Okazaki (eds.), Findings of the Association for Computational Linguistics: ACL 2023, pp. 6740 6757, Toronto, Canada, 2023. Association for Computational Linguistics. doi: 10.18653/v1/2023.findings-acl.421. URL https://aclanthology.org/2023.findings-acl.421.

Alexander Rives, Joshua Meier, Tom Sercu, Siddharth Goyal, Zeming Lin, Jason Liu, Demi Guo, Myle Ott, C Lawrence Zitnick, Jerry Ma, et al. Biological structure and function emerge from scaling unsupervised learning to 250 million protein sequences. Proceedings of the National Academy of Sciences, 118(15): e2016239118, 2021.

Andreas Rücklé, Gregor Geigle, Max Glockner, Tilman Beck, Jonas Pfeiffer, Nils Reimers, and Iryna Gurevych. Adapter Drop: On the efficiency of adapters in transformers. In Marie-Francine Moens, Xuanjing Huang, Lucia Specia, and Scott Wen-tau Yih (eds.), Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing, pp. 7930 7946, Online and Punta Cana, Dominican Republic, 2021. Association for Computational Linguistics. doi: 10.18653/v1/2021.emnlp-main.626. URL https://aclanthology.org/2021.emnlp-main.626.

Yixian Shen, Qi Bi, Jia-Hong Huang, Hongyi Zhu, Andy D Pimentel, and Anuj Pathania. Macp: Minimal yet mighty adaptation via hierarchical cosine projection. Ar Xiv preprint, abs/2505.23870, 2025a. URL https://arxiv.org/abs/2505.23870.

Yixian Shen, Qi Bi, Jia-hong Huang, Hongyi Zhu, Andy D. Pimentel, and Anuj Pathania. SSH: Sparse spectrum adaptation via discrete hartley transformation. In Luis Chiruzzo, Alan Ritter, and Lu Wang (eds.), Proceedings of the 2025 Conference of the Nations of the Americas Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 1: Long Papers), pp. 10400 10415, Albuquerque, New Mexico, 2025b. Association for Computational Linguistics. ISBN 979-8-89176-189-6. doi: 10.18653/v1/2025.naacl-long.522. URL https://aclanthology.org/2025.naacl-long.522/.

Richard Socher, Alex Perelygin, Jean Wu, Jason Chuang, Christopher D. Manning, Andrew Ng, and Christopher Potts. Recursive deep models for semantic compositionality over a sentiment treebank. In David Yarowsky, Timothy Baldwin, Anna Korhonen, Karen Livescu, and Steven Bethard (eds.), Proceedings of the 2013 Conference on Empirical Methods in Natural Language Processing, pp. 1631 1642, Seattle, Washington, USA, 2013. Association for Computational Linguistics. URL https://aclanthology.org/ D13-1170.

Erik F. Tjong Kim Sang. Introduction to the Co NLL-2002 shared task: Language-independent named entity recognition. In COLING-02: The 6th Conference on Natural Language Learning 2002 (Co NLL-2002), 2002. URL https://aclanthology.org/W02-2024.

Ramakrishna Vedantam, C. Lawrence Zitnick, and Devi Parikh. Cider: Consensus-based image description evaluation. In IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2015, Boston, MA, USA, June 7-12, 2015, pp. 4566 4575. IEEE Computer Society, 2015. doi: 10.1109/CVPR.2015.7299087. URL https://doi.org/10.1109/CVPR.2015.7299087.

Alex Wang, Amanpreet Singh, Julian Michael, Felix Hill, Omer Levy, and Samuel R. Bowman. GLUE: A multi-task benchmark and analysis platform for natural language understanding. In 7th International Conference on Learning Representations, ICLR 2019, New Orleans, LA, USA, May 6-9, 2019. Open Review.net, 2019. URL https://openreview.net/forum?id=r J4km2R5t7.

Published in Transactions on Machine Learning Research (07/2025)

Zhiguo Wang, Wael Hamza, and Radu Florian. Bilateral multi-perspective matching for natural language sentences. In Carles Sierra (ed.), Proceedings of the Twenty-Sixth International Joint Conference on Artificial Intelligence, IJCAI 2017, Melbourne, Australia, August 19-25, 2017, pp. 4144 4150. ijcai.org, 2017. doi: 10.24963/ijcai.2017/579. URL https://doi.org/10.24963/ijcai.2017/579.

Alex Warstadt, Amanpreet Singh, and Samuel R. Bowman. Neural network acceptability judgments. Transactions of the Association for Computational Linguistics, 7:625 641, 2019. doi: 10.1162/tacl_a_00290. URL https://aclanthology.org/Q19-1040.

Adina Williams, Nikita Nangia, and Samuel Bowman. A broad-coverage challenge corpus for sentence understanding through inference. In Marilyn Walker, Heng Ji, and Amanda Stent (eds.), Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long Papers), pp. 1112 1122, New Orleans, Louisiana, 2018. Association for Computational Linguistics. doi: 10.18653/v1/N18-1101. URL https://aclanthology. org/N18-1101.

Thomas Wolf, Lysandre Debut, Victor Sanh, Julien Chaumond, Clement Delangue, Anthony Moi, Pierric Cistac, Tim Rault, Rémi Louf, Morgan Funtowicz, et al. Huggingface s transformers: State-of-the-art natural language processing. Ar Xiv preprint, abs/1910.03771, 2019. URL https://arxiv.org/abs/ 1910.03771.

Di Xie, Jiang Xiong, and Shiliang Pu. All you need is beyond a good init: Exploring better solution for training extremely deep convolutional neural networks with orthonormality and modulation. In 2017 IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2017, Honolulu, HI, USA, July 21-26, 2017, pp. 5075 5084. IEEE Computer Society, 2017. doi: 10.1109/CVPR.2017.539. URL https: //doi.org/10.1109/CVPR.2017.539.

Mengde Xu, Zheng Zhang, Fangyun Wei, Han Hu, and Xiang Bai. Side adapter network for open-vocabulary semantic segmentation. In IEEE/CVF Conference on Computer Vision and Pattern Recognition, CVPR 2023, Vancouver, BC, Canada, June 17-24, 2023, pp. 2945 2954. IEEE, 2023. doi: 10.1109/CVPR52729. 2023.00288. URL https://doi.org/10.1109/CVPR52729.2023.00288.

Jingjun Yi, Qi Bi, Hao Zheng, Haolan Zhan, Wei Ji, Yawen Huang, Yuexiang Li, and Yefeng Zheng. Learning spectral-decomposited tokens for domain generalized semantic segmentation. In Proceedings of the 32nd ACM International Conference on Multimedia, pp. 8159 8168, 2024.

Li Zhang, Han Guo, Leah V Schaffer, Young Su Ko, Digvijay Singh, Hamid Rahmani, Danielle Grotjahn, Elizabeth Villa, Michael Gilson, Wei Wang, et al. Proteinaligner: A multi-modal pretraining framework for protein foundation models. bio Rxiv, pp. 2024 10, 2024a.

Miao Zhang, Steven W. Su, Shirui Pan, Xiaojun Chang, M. Ehsan Abbasnejad, and Reza Haffari. idarts: Differentiable architecture search with stochastic implicit gradients. In Marina Meila and Tong Zhang (eds.), Proceedings of the 38th International Conference on Machine Learning, ICML 2021, 18-24 July 2021, Virtual Event, volume 139 of Proceedings of Machine Learning Research, pp. 12557 12566. PMLR, 2021. URL http://proceedings.mlr.press/v139/zhang21s.html.

Qingru Zhang, Minshuo Chen, Alexander Bukharin, Pengcheng He, Yu Cheng, Weizhu Chen, and Tuo Zhao. Adaptive budget allocation for parameter-efficient fine-tuning. In The Eleventh International Conference on Learning Representations, ICLR 2023, Kigali, Rwanda, May 1-5, 2023. Open Review.net, 2023. URL https://openreview.net/pdf?id=lq62u WRJji Y.

Ruiyi Zhang, Rushi Qiang, Sai Ashish Somayajula, and Pengtao Xie. Auto Lo RA: Automatically tuning matrix ranks in low-rank adaptation based on meta learning. In Kevin Duh, Helena Gomez, and Steven Bethard (eds.), Proceedings of the 2024 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 1: Long Papers), pp. 5048 5060, Mexico City, Mexico, 2024b. Association for Computational Linguistics. URL https://aclanthology. org/2024.naacl-long.282.

Published in Transactions on Machine Learning Research (07/2025)

A Datasets and Models

Table 6: Summary of datasets used in the experiments

Task Group Dataset Metrics Train Dev / Val Test

Natural Language Understanding

MNLI Accuracy 393k 20k 20k SST-2 Accuracy 67k 872 1.8k MRPC Accuracy 3.7k 408 1.7k Co LA Matthews Corr 8.5k 1k 1k QNLI Accuracy 108k 5.7k 5.7k QQP Accuracy 364k 40k 391k RTE Accuracy 2.5k 276 3k STS-B Pearson Corr 7.0k 1.5k 1.4k

Text Classification

Mod Apte F1 8.8k - 3k Mod Hayes F1 18k - 0.7k Mod Lewis F1 12k - 5.5k

Natural Language Generation E2E BLEU, NIST, MET, ROUGE-L, CIDEr 42k 4.6k -

Token Classification Bio NLP Accuracy, Precision, Recall, F1 17k 1.9k 3.9k

Co NLL2003 Accuracy, Precision, Recall, F1 14k 3.3k 3.5k

Biomedical Experiments

Thermostability prediction Accuracy, Precision, Recall, F1 3,695 - 924

BBP Accuracy, Precision, Recall, F1 200 - 38

MIC MSE 936 - 104

In this section, we present the datasets and models used in experiments, and summarize the statistical data in Table 6.

A.1 Natural Language Understanding

The GLUE Benchmark (Wang et al., 2019) comprises a diverse array of tasks that are widely employed for evaluation in natural language understanding. It encompasses two single-sentence classification tasks, three tasks assessing similarity and paraphrasing, and four tasks focusing on natural language inference. Specifically, it includes MNLI (Multi NLI, Williams et al. (2018)), SST-2 (Stanford Sentiment Treebank, Socher et al. (2013)), MRPC (Microsoft Research Paraphrase Corpus, Dolan & Brockett (2005)), Co LA (Corpus of Linguistic Acceptability, Warstadt et al. (2019)), QNLI (Question NLI, Rajpurkar et al. (2018)), QQP (Quora Question Pairs, Wang et al. (2017)), RTE (Recognizing Textual Entailment, Dagan et al. (2005)), and STS-B (Semantic Textual Similarity Benchmark, Cer et al. (2017)). We summarize the statistical data for all datasets within the GLUE Benchmark in Table 6. Following existing practices, the development set is used in GLUE as the test data since the actual test set is not publicly available. We report the overall (matched and mismatched) accuracy for MNLI, Matthew s correlation for Co LA, Pearson correlation for STS-B, and accuracy for the other tasks.

The Reuters-21578 (Padmanabhan et al., 2016) dataset is one of the most widely used data collections for text categorization research. It was collected from the Reuters financial newswire service in 1987 and is used for text classification and natural language processing tasks. Three splits are available: Mod Apte,

Published in Transactions on Machine Learning Research (07/2025)

Mod Hayes, and Mod Lewis. These documents cover various topics, such as politics, economics, and sports. F1 score is used as the evaluation metric across all three splits.

A.2 Natural Language Generation

In our experiments on natural language generation, we use the E2E (Novikova et al., 2017) dataset, which was initially introduced as a dataset for training end-to-end, data-driven natural language generation systems. Multiple references can be associated with each source table used as input. Each sample input (x, y) consists of a series of slot-value pairs accompanied by an associated natural language reference text. The E2E dataset comprises approximately 42k training examples, 4, 600 validation examples, and 4, 600 test examples from the restaurant domain.

We utilize the following five evaluation metrics: BLEU (Papineni et al., 2002), NIST (Lin & Och, 2004), METEOR (Banerjee & Lavie, 2005), ROUGE-L (Lin, 2004), and CIDEr (Vedantam et al., 2015).

A.3 Token Classification

For token classification, we fine-tune the Ro BERTa-base and Ro BERTa-large models on the Bio NLP dataset (Collier et al., 2004) and the Co NLL2003 dataset (Tjong Kim Sang, 2002). Bio NLP (Collier et al., 2004) is a Named Entity Recognition dataset that contains biological entities such as DNA, RNA, and protein. It is essentially a token classification task where we want to classify each entity in the sequence. Co NLL-2003 (Tjong Kim Sang, 2002) focuses on language-independent named entity recognition. It concentrates on four types of named entities: persons, locations, organizations, and miscellaneous entities that do not belong to the previous three groups. Accuracy, precision, recall, and F1 score are used as evaluation metrics.

A.4 Biomedical Experiments

The ESM (Evolutionary Scale Modeling, Rives et al. (2021)) model is a transformer-based protein language model designed for protein sequence analysis, leveraging the transformer architecture to capture evolutionary patterns. We fine-tune the ESM model using the Protein Aligner checkpoint (Zhang et al., 2024a) on two classification tasks thermostability prediction (Chen et al. (2023), 936 training samples) and blood-brain barrier peptide prediction (BBP, Dai et al. (2021), 200 training samples) and one regression task, minimum inhibitory concentration prediction (MIC, Ledesma-Fernandez et al. (2023), 3,695 training samples). Notably, protein analysis datasets are typically much smaller than those in NLP, in which case the large pre-trained models are prone to overfitting, even when using PEFT methods. The trainable parameters (on the order of millions) are significantly overparameterized compared to the available samples (thousands or even hundreds), highlighting the need for our overfitting-resilient counterpart.

B Experimental Settings

In this section, we provide detailed experimental settings. We maintain consistent configurations across experiments, including Lo RA rank, Lo RA α, batch size, maximum sequence length, and optimizer, to ensure a fair comparison. For results other than Table 2, we do not include the bias term in PEFT linear layers. The hyperparameter tuning for our method is straightforward and convenient.

B.1 Ro BERTa

We summarize the experimental settings for the GLUE benchmark (Table 1) and for the Reuters21578 dataset and token classification (Table 13) tasks in Table 7.

We summarize the experimental settings for the GPT-2 experiments (Table 12) in Table 8. The experimental configuration, particularly during the inference stage, follows the approach described by Hu et al. (2022b).

Published in Transactions on Machine Learning Research (07/2025)

Table 7: The hyperparameters used for Ro BERTa on the GLUE benchmark (Wang et al., 2019), Reuters21578 dataset (Padmanabhan et al., 2016), Bio NLP dataset (Collier et al., 2004), and Co NLL2003 dataset (Tjong Kim Sang, 2002).

Settings MNLI SST-2 MRPC Co LA QNLI QQP RTE STS-B Mod Apte Mod Hayes Mod Lewis Bio NLP Co NLL2003

Optimizer Adam W Warmup Ratio 0.06 Scheduler Linear Lo RA rank rank = 4 Lo RA α 8

Ro BERTa-base

Total batch size 32 Global steps 20k 12k 25k 20k 15k 20k 15k 12k 20k 20k 20k 12k 12k Lower learning rate 5e-5 1e-5 2e-5 5e-5 2e-5 5e-5 1e-5 1e-5 3e-5 3e-5 3e-5 1e-5 2e-5 Upper learning rate 5e-5 1e-5 2e-5 5e-5 2e-5 5e-5 1e-5 1e-5 3e-5 3e-5 3e-5 1e-5 2e-5 Lower weight decay 0.1 0.1 0.1 0.1 0.1 0.1 0.1 0.1 0.1 0.1 0.1 0.1 0.2 Upper weight decay 0.1 0.1 0.1 0.1 0 0.1 0.1 0.01 0.1 0.1 0.1 0.1 0.1 Max Seq Length 512 Regularization Coefficient 1e-5 1e-5 1e-5 1e-5 1e-5 1e-5 1e-5 1e-5 0 1e-5 0 1e-5 0

Ro BERTa-large

Total batch size 32 Global steps 50k 20k 30k 20k 60k 40k 15k 10k 20k 20k 20k 12k 15k Lower learning rate 1e-5 1e-5 1e-5 1e-5 1e-5 1e-5 1e-5 1e-5 1e-5 1e-5 1e-5 2e-5 1e-5 Upper learning rate 1e-5 1e-5 1e-5 1e-5 1e-5 1e-5 1e-5 1e-5 1e-5 1e-5 1e-5 2e-5 1e-5 Lower weight decay 0.5 0.5 0 0.2 0.5 0.5 0.5 0.5 0.2 0.1 0.2 0.02 0.1 Upper weight decay 0.5 0.05 0 0.2 0.5 0.5 0.1 0.5 0.1 0.1 0.1 0.02 0.1 Max Seq Length 128 Regularization Coefficient 0 0 1e-5 1e-5 0 1e-5 0 1e-5 0 1e-5 0 0 1e-5

Table 8: The hyperparameters we used for GPT-2 on the E2E NLG benchmark (Novikova et al., 2017).

Settings Training

Optimizer Adam W Warmup Ratio 0.06 Scheduler Linear Lo RA rank ranka = ranku = 4 Lo RA α 32 Label Smooth 0.1 Lower learning rate 1e-3 Upper learning rate 1e-4 Lower weight decay 1 Upper weight decay 1 Max Seq Length 512 Regularization Coefficient 1e-5

Settings Inference

Beam Size 10 Length Penalty 0.9 no repeat ngram size 4

C Baselines in Experiments

Here, we provide a brief introduction to compare baselines in all our experiments.

Full Fine-Tuning (FT): The entire model is fine-tuned, with updates to all parameters.

Adapter Tuning (Houlsby et al., 2019; Lin et al., 2020; Rücklé et al., 2021; Pfeiffer et al., 2021): Methods that introduce adapter layers between the self-attention and MLP modules for parameter-efficient tuning.

Published in Transactions on Machine Learning Research (07/2025)

Lo RA (Hu et al., 2022a): A method that estimates weight updates via low-rank matrices.

Ada Lo RA (Zhang et al., 2023): An extension of Lo RA that dynamically reallocates the parameter budget based on importance scores.

Do RA (Liu et al., 2024a): Decomposes pretrained weights into magnitude and direction, using Lo RA for efficient directional updates.

Ve RA (Kopiczko et al., 2024): Employs a single pair of low-rank matrices across all layers to reduce trainable parameters.

Fourier FT (Gao et al., 2024): Fine-tunes models by learning a subset of spectral coefficients in the Fourier domain.

AFLo RA (Liu et al., 2024b): Freezes low-rank adaptation parameters using a learned score, reducing trainable parameters while maintaining performance.

La MDA (Azizi et al., 2024): Fine-tunes large models via spectrally decomposed low-dimensional adaptation.

SSH (Shen et al., 2025b): Fine-tunes large models after transforming weight matrices with the discrete Hartley transformation (DHT).

Ma CP (Shen et al., 2025a): Fine-tunes large models by projecting the low-rank adaptation weight change into the discrete cosine space.

D Weight Decomposition Analysis

We provide a brief review of the weight decomposition analysis proposed in Liu et al. (2024a). Define the weight decomposition of a weight matrix W Rd k (e.g., query matrix in an attention layer) as W = m V V c = W c W W c , where m R1 k is the magnitude vector, and V Rd k is the directional matrix, with c representing the vector-wise norm of a matrix across each column. This decomposition ensures that each column of V/ V c remains a unit vector, and the corresponding scalar in m defines the magnitude of each vector. Liu et al. (2024a) examine the magnitude and directional variations between W0

and WFT, defined as Mt FT = Pk

n=1 |mn,t FT mn 0 | k and Dt FT = Pk

n=1(1 cos(Vn,t FT,W0 n)) k . Here, Mt FT and Dt FT represent the magnitude and direction differences between W0 and WFT at the t-th training step, respectively, with cos( , ) denoting cosine similarity. mn,t FT and mn 0 are the nth scalars in their respective magnitude vectors, while Vn,t FT and W0 n are the nth columns in Vt FT and W0. Intuitively, a consistent positive slope trend across all the intermediate steps implies a difficulty in concurrent learning of both magnitude and direction, suggesting that slight directional changes are challenging to execute alongside more significant magnitude alterations. In contrast, a relatively negative slope signifies a more varied learning pattern, with a more pronounced negative correlation indicating a larger learning capacity.

Complementary to Fig. 3 in the main paper on the query matrix, we provide additional results of weight decomposition analysis in Fig. 4 on the value matrix to complement the findings in Section 5.5. We can draw two key observations from Fig. 4: 1) Consistent with the results in Liu et al. (2024a), both FT and Do RA exhibit negative correlation values of 49.279 and 5.485, respectively, while Lo RA shows a positive correlation with a value of 2.503. 2) Bi Do RA achieves a negative correlation value of 10.547, indicating closer alignment with FT compared to Do RA. The analysis of how Bi Do RA achieves this improvement is similar to that discussed in Section 5.5.

E Training Cost

Table 9 compares the training efficiency of Lo RA, Do RA, and Bi Do RA on the GLUE benchmark using the Ro BERTa-base model. The table details the total training steps required for convergence and the per-step computational cost, which is normalized relative to Lo RA for reference. For a fair comparison, all methods

Published in Transactions on Machine Learning Research (07/2025)

(a) FT(k = 49.279)

(b) Lo RA(k = 2.503)

(c) Do RA(k = 5.485)

(d) Bi Do RA(k = 10.5)

Figure 4: Magnitude and direction updates for (a) FT, (b) Lo RA, (c) Do RA, and (d) Bi Do RA of the value matrices across different layers and intermediate steps after fine-tuning the GPT2 model on the E2E dataset (Novikova et al., 2017). Different markers represent matrices from different training steps, while different colors indicate matrices from each layer. The values of negative correlation are shown at the top, denoted by k.

Table 9: Average training time cost on the GLUE benchmark (Wang et al., 2019).

Method Lo RA Do RA Bi Do RA

Per-step cost 1 1.36 3.54 Total steps 27.45k 27.45k 17.37k Total time 1 1.36 2.24

were benchmarked on a single NVIDIA A100 GPU. The results show that Bi Do RA converges in fewer steps than Lo RA and Do RA, while the per-step cost for Bi Do RA is modestly higher, as its BLO process requires iterative updates between the two levels and the computation of hypergradients. The total training time for Bi Do RA is approximately 1.64 times that of Do RA, a training cost that remains comparable to the baselines. Given Bi Do RA s superior performance across various tasks, we argue that this slight increase in computational cost is an acceptable trade-off, underscoring our method s practicality.

F The Role of Hyperparameter

The hyperparameter tuning for Bi Do RA is simple, convenient, and straightforward. We further conducted experiments regarding the dataset partition of Dtr and Dval to provide insights into its role in Bi Do RA. The dataset partition helps maintain the balance of inner/outer optimization by assigning different portions of data. The direction component has more trainable parameters, so it is reasonable to use more data for training the lower level while using the remaining data for training magnitudes. As shown in Table 10, we varied the inner-level dataset Dtr partition from 0.6 to 1.0 with 0.1 intervals and experimented with Ro BERTa-base on three splits of the Reuters21578 dataset to examine its influence.

The results indicate that both extreme cases are negative to the overall performance. When the inner partition is too small ( 0.6), directions are not well-trained, and when the inner partition is 1.0, magnitudes are not trained at all, leading to a significant performance drop. These findings demonstrate that BLO is effective in the sense that both levels are necessary for enhancing performance. Although tuning the partition ratio may further improve overall performance, we maintain a consistent data partition of 8 : 2 in all the experiments for simplicity. A fixed configuration of data partition already consistently yields superior performance of Bi Do RA, demonstrating that our method is robust to this hyperparameter within a certain range.

Published in Transactions on Machine Learning Research (07/2025)

Table 10: Experiment results on different data partitions of Bi Do RA.

Partition Mod Apte Mod Hayes Mod Lewis

0.6 85.32 79.76 77.69 0.7 85.32 80.01 77.74 0.8 85.34 79.93 77.63 0.9 85.27 79.85 77.64 1.0 85.23 79.59 77.42

Table 11: Experiment results on different weight decay values and different dropout rates of Do RA.

Method Co LA MRPC RTE

Do RA (weight decay = 0) 59.3 88.7 72.9 Do RA (weight decay = 0.05) 60.1 89.2 73.3 Do RA (weight decay = 0.1) 60.5 89.2 73.2 Do RA (weight decay = 0.2) 60.3 89.0 73.2

Do RA (dropout rate = 0) 59.2 89.2 72.9 Do RA (dropout rate = 0.1) 60.2 88.9 71.4 Do RA (dropout rate = 0.2) 55.1 87.8 64.2

Bi Do RA 61.3 89.4 76.0

G Comparison with Other General Methods for Addressing Overfitting

There are some common experimental settings that may be used to reduce overfitting. For Do RA, two promising methods are increasing weight decay and adopting a more aggressive dropout rate. We conducted experiments on these two methods separately. We kept hyperparameters that have been well-tuned in Do RA and can achieve optimal results while only tuning the weight decay value. Similarly, we tune the dropout rate of Do RA while keeping the weight decay value to be optimized. We conducted experiments on Ro BERTa-base on three datasets. The results are presented in Table 11.

We can draw the observation that neither of these approaches effectively addresses overfitting issues or enhances the model s generalization ability. On the other hand, Bi Do RA exploits the specific magnitudedirection structure of Do RA and the strategy of training the two distinct components on separate splits of the dataset. An advantage of our methodology is that it can be easily combined with other general-purpose overfitting-alleviating strategies since Bi Do RA does not modify the original Do RA architecture.

H Additional Experiments

H.1 Experiments on Natural Language Generation Tasks

In this section, we evaluate Bi Do RA s performance on the NLG task. Table 12 presents the results of finetuning a GPT-2 model on the E2E dataset with baseline PEFT methods and Bi Do RA. The results show that Bi Do RA achieves the best performance across all five evaluation metrics, demonstrating the superiority of Bi Do RA in fine-tuning pre-trained models for NLG tasks.

H.2 Experiments on Token Classification

The effectiveness of Bi Do RA can also be observed in Table 13, which reports the results of token classification tasks. Unlike the NLU tasks discussed in the previous section, which involve classifying entire sentences and focusing on capturing global semantics, token classification requires classifying each token within a sentence, highlighting the importance of capturing local context. On the Bio NLP dataset, Bi Do RA consistently

Published in Transactions on Machine Learning Research (07/2025)

Table 12: Performance of Bi Do RA and baseline methods for fine-tuning GPT2-medium on the E2E dataset (Novikova et al., 2017). A higher value is better for all metrics. The best results are shown in bold.

Method #Params BLEU NIST MET ROUGE-L CIDEr

FT 354.9M 68.0 8.61 46.1 69.0 2.38

Adapter 11.1M 67.0 8.50 45.2 66.9 2.31 Lo RA 0.39M 67.1 8.54 45.7 68.0 2.33 Do RA 0.39M 67.0 8.48 45.4 70.1 2.33 Bi Do RA 0.39M 69.0 8.72 46.2 70.9 2.44

outperforms baseline methods by a large margin in terms of F1 score. On the Co NLL2003 dataset, Bi Do RA either outperforms or matches all baseline methods across all metrics. Consistent with our previous findings, Bi Do RA effectively fine-tunes pre-trained models for token classification tasks.

H.3 More Experiments on Natural Language Understanding Tasks

Table 13 presents the results of fine-tuning Ro BERTa models on the Reuters21578 datasets, a text classification task, where Bi Do RA outperforms all baseline methods by an even larger margin. Notably, Bi Do RA achieves performance comparable to or even better than full fine-tuning, providing further evidence of its superiority.

Table 13: Ro BERTabase/large (Rb/l) with different fine-tuning methods on the Reuters21578 (Padmanabhan et al., 2016), Bio NLP (Collier et al., 2004), and Co NLL2003 (Tjong Kim Sang, 2002) benchmarks. A higher value is better for all metrics. The best results are shown in bold.

Reuters21578 Bio NLP Co NLL2003

Method #Params Mod Apte Mod Hayes Mod Lewis Accuracy Precision Recall F1 Accuracy Precision Recall F1

Rb(FT) 125.0M 85.4 77.6 77.1 93.9 69.0 78.9 73.6 99.3 95.7 96.3 96.0

Rb(Adapter) 0.9M 85.3 77.5 76.8 93.9 69.1 78.8 73.7 99.3 95.7 96.4 96.0 Rb(Lo RA) 0.15M 84.7 74.3 74.7 93.9 69.0 78.8 73.6 99.3 95.4 96.3 95.8 Rb(Do RA) 0.17M 84.8 78.2 76.6 94.0 69.2 79.1 73.8 99.3 95.3 96.2 95.8 Rb(Bi Do RA) 0.17M 85.3 79.9 77.6 93.9 71.2 78.6 74.7 99.3 95.9 96.5 96.2

Rl(FT) 355.0M 84.8 77.5 76.6 94.0 69.4 79.6 74.1 99.4 96.2 97.0 96.6

Rl(Adapter) 0.44M 84.8 77.9 76.7 94.0 69.4 79.7 74.2 99.4 96.1 97.0 96.6 Rl(Lo RA) 0.39M 84.7 77.7 76.7 93.9 69.2 79.3 73.9 99.4 96.2 97.0 96.6 Rl(Do RA) 0.39M 84.8 77.4 76.7 94.0 69.4 79.7 74.2 99.4 96.2 97.1 96.6 Rl(Bi Do RA) 0.39M 84.9 78.9 77.3 94.0 71.3 79.3 75.1 99.4 96.4 97.1 96.7

I Evidence on Orthogonality of Incremental Matrix

To verify that the orthogonal regularization (OR) proposed in Section 4.2 encourages the columns of the direction matrix to be orthogonal, we visualize the normalized eigenvalues of the matrix in Fig. 5. The results show that, compared to methods without OR (i.e., Do RA and Bi Do RA w/o cst.), Bi Do RA with OR produces eigenvalues that are more closely aligned with those of a purely orthogonal matrix, where all eigenvalues would be one. This effect holds for both the query and value matrices and verifies the effectiveness of the OR constraint.

Published in Transactions on Machine Learning Research (07/2025)

(a) Do RA-query matrix

(b) Bi Do RA (w/o cst.)-query matrix

(c) Bi Do RA-query matrix

(d) Do RA-value matrix

(e) Bi Do RA (w/o cst.)-value matrix

(f) Bi Do RA-value matrix

Figure 5: Eigenspectra of the direction matrix for query (top) and value (bottom) matrices across different layers. The figure compares three fine-tuning methods: Bi Do RA, Bi Do RA without orthogonal regularization (w/o cst.), and Do RA. Both axes are on a log scale, and only the 64 largest eigenvalues are shown for visualization clarity. Experiments were conducted on the Co LA dataset (Warstadt et al., 2019) with the Ro BERTa-base model.