# asymmetry_in_lowrank_adapters_of_foundation_models__d310f383.pdf

Asymmetry in Low-Rank Adapters of Foundation Models

Jiacheng Zhu 1 Kristjan Greenewald 2 3 Kimia Nadjahi 1 Haitz S aez de Oc ariz Borde 4

Rickard Br uel Gabrielsson 1 Leshem Choshen 1 3 Marzyeh Ghassemi 1 Mikhail Yurochkin 2 3 Justin Solomon 1

Abstract Parameter-efficient fine-tuning optimizes large, pre-trained foundation models by updating a subset of parameters; in this class, Low-Rank Adaptation (Lo RA) is particularly effective. Inspired by an effort to investigate the different roles of Lo RA matrices during fine-tuning, this paper characterizes and leverages unexpected asymmetry in the importance of low-rank adapter matrices. Specifically, when updating the parameter matrices of a neural network by adding a product BA, we observe that the B and A matrices have distinct functions: A extracts features from the input, while B uses these features to create the desired output. Based on this observation, we demonstrate that fine-tuning B is inherently more effective than fine-tuning A, and that a random untrained A should perform nearly as well as a fine-tuned one. Using an information-theoretic lens, we also bound the generalization of low-rank adapters, showing that the parameter savings of exclusively training B improves the bound. We support our conclusions with experiments on Ro BERTa, BART-Large, LLa MA-2, and Vi Ts. The code and data is available at https://github.com/ Jiacheng-Zhu-AIML/Asymmetry Lo RA

1. Introduction

Foundation models for data-rich modalities such as text and imagery have achieved significant success by pre-training large models on vast amounts of data. While these models are designed to be general-purpose, it is often necessary to fine-tune them for downstream tasks. However, the huge size of foundation models can make fine-tuning the entire model impossible, inspiring parameter-efficient finetuning (PEFT) methods that selectively update fewer param-

1MIT CSAIL 2IBM Research 3MIT-IBM Watson AI Lab 4University of Oxford. Correspondence to: Jiacheng Zhu <zjc@mit.edu>.

Proceedings of the 41 st International Conference on Machine Learning, Vienna, Austria. PMLR 235, 2024. Copyright 2024 by the author(s).

eters (c.f. Lialin et al., 2023). The effectiveness of PEFT demonstrates that updating even a tiny fraction of the parameters can retain and enrich the capabilities of pretrained models. Indeed, fine-tuning has become a necessary ingredient of modern ML; for example, the PEFT package (Hugging Face, Year) has supported more than 4.4k projects since its creation in November 2022.

Among PEFT methods, low-rank adaptation (Lo RA) (Hu et al., 2021) has become increasingly popular, which leverages the assumption that over-parameterized models have a low intrinsic dimension (Aghajanyan et al., 2021). To update a neural network, Lo RA trains a subset of the parameters (usually attention) by representing weight matrices as W0 + W, where W0 is the fixed weight matrix from the pre-trained model and W is a low-rank update. Compared to full fine-tuning, Lo RA considerably reduces the number of trainable parameters and memory requirements and often achieves similar or better performance.

Most Lo RA implementations factor W = BA and optimize for A and B, where A and B have fewer rows and columns (resp.) than W; this approach was proposed by Hu et al. (2021). With this set of variables, the standard Lo RA training procedure where A is initialized to a random matrix and B is initialized to zero exhibits an interesting asymmetry, which is leveraged in some empirical follow-ups (Zhang et al., 2023a; Kopiczko et al., 2024). In particular, while training B is critical for the performance of Lo RA, even a randomly initialized A seems to suffice for strong performance. On the other hand, reversing the roles of A and B substantially decreases performance.

Delving into this empirical suggestion from prior work, this paper demonstrates that Lo RA s components are inherently asymmetric. In fact, the asymmetry occurs even for linear models ( 4.1.1). Indeed, our theoretical ( 4) and empirical analysis ( 5) suggests that fixing A to a random orthogonal matrix can yield similar performance to full Lo RA training, and that this adjustment may even promote generalization. This observation is backed by a comprehensive empirical study, leading to practical suggestions for improving parameter efficiency and generalization of Lo RA models. Our contributions are as follows:

We provide simple theoretical and empirical analysis

Asymmetry in Low-Rank Adapters of Foundation Models

(a) Random initialization, same task

(b) Fixed initialization, different tasks

(c) Random initialization, different tasks

Figure 1. Similarity of learned Lo RA matrices A & B across layers of a Ro BERTa model fine-tuned with different initialization and data settings. Bs are similar when fine-tuning on the same task (a) and dissimilar when fine-tuning on different tasks (b and c). As are similar when initialized identically (b), even though fine-tuning is done on different tasks, and dissimilar when initialized randomly regardless of the fine-tuning task (a and c). The experiment demonstrates the asymmetric roles of A and B in Lo RA.

demonstrating asymmetry of training the two adapter matrices, showing that tuning B is more impactful than tuning A. This confirms and builds upon prior empirical observations (Zhang et al., 2023a; Kopiczko et al., 2024). We show theoretically and empirically that randomly drawing and freezing A while tuning only B can improve generalization vs. tuning both B and A, in addition to practical gains achieved by 2 parameter reduction. We validate our findings through experiments using models including Ro BERTa, BART-Large, LLa MA-2, and the vision transformer (Vi T), on both text and image datasets.

2. Related Work

Since the introduction of the original Lo RA technique (Hu et al., 2021), numerous enhancements have been proposed. For example, quantization can reduce memory usage during training (Gholami et al., 2021; Dettmers et al., 2023; Guo et al., 2024). Also, the number of trainable parameters can be further reduced by adaptively allocating the rank (Zhang et al., 2023b), pruning during training (Benedek & Wolf, 2024), or pruning and quantizing after training (Yadav et al., 2023).

To further reduce the number of trainable Lo RA parameters, the idea of reusing (randomly generated) weights or projections (Frankle & Carbin, 2018; Ramanujan et al., 2020) suggests strategies from learning diagonal matrices rescaling randomly-drawn and frozen B, A matrices (Ve RA) (Kopiczko et al., 2024), deriving B and A from the SVD decomposition of the pre-trained W0 and optimizing for a smaller matrix in the resulting basis (SVDiff) (Han et al., 2023), learning a linear combination of fixed random matrices (NOLA) (Koohpayegani et al., 2023), or fine-tuning using orthogonal matrices (BOFT) (Liu et al., 2024). As echoed in our empirical results, previous methods observe that freezing A in conventional Lo RA preserves performance (Zhang et al., 2023a). While nearly all recent studies treat the two matrices asymmetrically in their initialization or freezing schemes, there is a lack of formal

investigation into this asymmetry in low-rank adaptation.

Zeng & Lee (2023) specifically investigate the expressive power of Lo RA, but only focus on linearized networks and linear components. Their analysis does not consider aspects such as the particular distribution of the fine-tuning target data, generalization, or the differing roles of the different matrices. Lastly, we would like to highlight that even before Lo RA, the effectiveness of fine-tuning was also explained by leveraging similar ideas related to the intrinsic low dimensionality of large models (Aghajanyan et al., 2021).

3. Preliminaries & Background

Notation. Suppose we are given a pre-trained weight matrix W0 Rdout din representing a dense multiplication layer of a neural network foundation model. Lo RA fine-tunes by updating the weights to W0 + W, where rank( W) = r min(dout, din). In particular, Hu et al. (2021) factor W = BA, where A Rr din and B Rdout r have restricted rank r. During training, W0 is fixed; Lo RA updates (A, B). This yields more efficient updates than full fine-tuning, provided that r < dindout din+dout .

Now using i to index layers of a network, a Lo RA update is thus characterized by a set of pre-trained weight matrices W {Wi}L i=1, a set of pre-trained bias vectors b {bi}L i=1, and a set of low-rank trainable weights W { Wi}L i=1. Lo RA may not update all L weight matrices in W, in which case L L.

Motivating example. In Figure 1, we investigate the similarity of learned matrices A and B under three scenarios:

(a) random initialization, A & B trained multiple times on the same task; (b) fixed initialization, A & B trained multiple times, each time on a different task; and (c) random initialization, A & B trained multiple times, each time on a different task.

Here, we fine-tune Ro BERTa large (Liu et al., 2019) with

Asymmetry in Low-Rank Adapters of Foundation Models

Lo RA on the tasks from the GLUE benchmark (Wang et al., 2018). Specifically, we fine-tuned mrpc with 5 random seeds for (a) and on mrpc, rte, stsb, and cola for (b) and (c).

The figure plots similarity of learned A and B matrices across layers in Figure 1, measured by canonical correlation analysis goodness of fit (Ramsay et al., 1984); see Appendix A for motivation.

These plots suggest that B is predominantly responsible for learning, while A is less important. Specifically, when training on the same task with different initializations (scenario (a)), the learned B matrices are similar to each other, while when training on different tasks (scenarios (b) and (c)), they are different. On the contrary, the similarity of learned A matrices is insensitive to training data and is determined by initialization; it is highest in scenario (b) when the initialization is fixed even though training data differs. See Appendix A for additional details of this experiment.

4. Theoretical Analysis

In this section, we analyze the asymmetry in prediction tasks and its effect on generalization. We discuss a general case rather than a specific neural network architecture, considering rank r adaptation of any parameter matrix W = W0 + BA used multiplicatively on some inputdependent vector, i.e.,

layer Output = ψ((W0 + BA) ϕ(layer Input), . . . ) (1)

for some differentiable functions ψ, ϕ. Here, ψ may take more arguments depending on layer Input, which may have their own low rank adapted parameter matrices. This generic form encompasses both feedforward and attention layers.

In this setting, A serves to extract r features from ϕ(layer Input), which are then used by B to predict some desired output for future layers. We will argue that training B to predict the output is crucial for correct outputs, while using a random A is often sufficient, as B can be optimized to use whatever information is retained in the r-dimensional projection A ϕ(layer Input).

4.1. A, B asymmetry in prediction tasks

If we wish to reduce the effort of training both A and B in (1), in principle either A could be frozen and B tuned or B frozen and A tuned. As shown in 5 and elsewhere, these two options are not empirically equivalent: It is best to freeze A and tune B. In this section, we seek to understand the principle behind this asymmetry by theoretically analyzing the fine-tuning of a class of prediction models. We first build intuition with least-squares linear regression.

4.1.1. MULTIVARIATE LINEAR LEAST-SQUARES

As a simple example analogous to a single network layer, we study din-to-dout least-squares linear regression (in (1), set ϕ, ψ to be identity). Specifically, suppose there is an input X Rdin, an output Y Rdout, and a pre-trained linear model ypre(X) = W0X + b0,

where W0 Rdout din and b0 Rdout. With this model held constant, our goal is regressing (Ytarg, Xtarg) pairs where Ytarg is given by:

Ytarg = Wtarg Xtarg + btarg

with Wtarg = W0 + . Following Lo RA, we model the target using a low rank update to the pre-trained W0, i.e. W = W0 + BA:

ˆy(x) = (W0 + BA)x + b,

where B Rdout r and A Rr din for some r.

To find an A and B that best matches the output, we optimize the least squares loss on the difference between ˆy and Ytarg:

L(A, B)=E(Ytarg,Xtarg)[ Ytarg (W0+BA)Xtarg b 2 2]. (2)

Below, we present lemmas on minimizing this loss while freezing either A or B. In both, for simplicity, we set b = btarg and E[Xtarg] = 0 and defer proofs to Appendix B.

Lemma 4.1 (Freezing A yields regression on projected features). Optimizing L(A, B) while fixing A = Q with QQ = Ir yields

B = ΣQ (QΣQ ) 1,

where Σ = Cov[Xtarg], with expected loss

L(Q, B ) = doutσ2 + Tr[ Σ ]

Tr[QΣ ΣQ (QΣQ ) 1].

Lemma 4.2 (Freezing B yields regression on projected outputs). Optimizing L(A, B) while fixing B = U with U U = Ir yields

A = U (Wtarg W0),

with expected loss

L(A , U) = doutσ2 + Tr[ Σ ] Tr[U Σ U],

where Σ = Cov[Xtarg].

Comparing the lemmas above, A is simply the U projection of the targeted change in weight matrix = Wtarg W0. Unlike B , the optimal choice of A does not consider the input data distribution captured by Σ.

Asymmetry in Low-Rank Adapters of Foundation Models

Intuitively, if the goal of adaptation is to approximate some desired output, then projecting away the majority (since r dout) of the output is undesirable. In contrast, projecting away a portion of the input feature space will be less damaging, if the information Xtarg contains about Ytarg is redundant (c.f., neuron dropout (Srivastava et al., 2014) in neural network training) or if the distribution of Xtarg tends to be low-rank.

Consider the following extreme example. If Σ = FF

is at most rank r, e.g. if F din r, then for each X there exists1 an N = F X Rr such that X = FN. Suppose you have tuned a pair A , B . For any orthonormal Q Rr din (e.g. one drawn at random), we can write

B A X = B A FN

= (B A F(QF) 1)QX,

i.e. regardless of A , B , for any (random) Q, there is an exactly equivalent Lo RA adaptation with A = Q and B = (B A F(QF) 1). In this setting, therefore, randomizing A (to Q) is equally expressive to tuning it (using A ).

This intuition is also reflected in the typical Lo RA initialization. When doing full Lo RA (tuning both A, B), A usually is initialized to a random Gaussian matrix, and B is initialized to zero. This procedure presumably empirically derived by Hu et al. (2021) intuitively fits our analysis above, since random A yields good random predictive features, in contrast to using a random output prediction basis. Initializing B to zero then starts the optimization at a zero perturbation of the pretrained model.

We validate the above intuition with the following theorem:

Theorem 4.3 (A, B output fit asymmetry). Consider the settings of Lemmas 4.1 and 4.2, and suppose U, Q are sampled uniformly from their respective Stiefel manifolds. Then, L(A , U) L(Q, B ) with high probability as d/r .

In other words, the least-squares prediction loss of only fine-tuning B is at least as good as only fine-tuning A.

Intuition on asymmetry gap. Theorem 4.3 is built on the following inequality:

Tr[ΣQ (QΣQ ) 1QΣ ]

Tr[(Q Q)ΣQ (QΣQ ) 1QΣ ]. (3)

Let us consider an example regime to build intuition on the size of this gap. Following intuition that freezing A is most successful when the information content of the input is redundant (c.f., Aghajanyan et al. (2021)), suppose the distribution of X is low rank, i.e., Σ is of rank r X. We can then write Σ = UXSXU X, where UX Rdin r X is

1Here F denotes pseudoinverse.

orthogonal and SX Rr X r X is diagonal with nonnegative real entries.

For intuition, set r X = r and SX = σ2Ir. We then have

ΣQ (QΣQ ) 1QΣ = σ2UXU X ,

which no longer depends on Q. The expectation of the key inequality gap in (3) then becomes

EQTr[ΣQ (QΣQ ) 1QΣ ]

EQTr[(Q Q)ΣQ (QΣQ ) 1QΣ ]

= EQTr[(I Q Q)σ2UXU X ]

as d becomes large. In other words, the performance advantage of tuning B over A is large when d r, which is the typical regime in practice.

4.1.2. NONLINEAR LOSSES AND MULTILAYER MODELS

Recalling (1) with an input transformation ϕ and output transformation ψ, consider losses on the output of the form

i=1 h(f(ψ(Wϕ(xi)))) y i f(ψ(Wϕ(xi))),

(4) where f, h are differentiable functions specified by the desired loss, yi RK, xi Rdin, and W Rdout din. This class contains logistic regression (with y being a one-hot encoded class vector), least-squares regression, and generalized linear regression including a neural network with cross entropy loss with one layer being tuned.

We next analyze the gradient of this loss. Our argument is stated with one adapted parameter matrix, but it directly applicable to multilayer and transformer networks with multiple matrices being adapted, where ϕ, ψ, and f will in that scenario vary depending on each parameter matrix s position in the network; ϕ, ψ, and f will depend on other parameter matrices and the current value of their adaptations (by definition of gradients). The interpretation will now be that fixing A when adapting a parameter matrix W (ℓ)

projects the inputs of the corresponding parameter matrix to a lower-dimensional subspace while retaining the ability to fully match the outputs, and fixing B correspondingly projects the parameter matrix s outputs.

For simplicity of notation, the remaining derivation in this section takes ϕ, ψ to be the identity; the extension to general ϕ, ψ is clear. Then, the gradient of (4) is

i=1 J f (Wxi) [ h(f(Wxi)) yi] x i , (5)

where Jf is the Jacobian of f. Starting from this formula, below we incorporate (1) by taking W = W0 + BA.

Asymmetry in Low-Rank Adapters of Foundation Models

Freezing A. Freezing A = Q yields

BL(BQ + W0) =

i=1 J f ((BQ + W0)xi) [ h(f((W0 + BQ)xi)) yi] (Qxi) .

Like the least-squares case, the input data is projected by Q but the output yi is unaffected.

Freezing B. Freezing B = U yields

AL(UA + W0) =

i=1 J f ((UA + W0)xi) [ h(f((W0 + UA)xi)) yi] x i .

Here, the coefficient of x i can be thought of as the output fit term. It includes the Jacobian of f since f is applied between the weights and the output. Compared to (5) and (6), in (7) this output fit term is projected by U. If f is (near) linear, then this projection will be (approximately) dataindependent, highlighting the loss of output information when freezing B.

Hence, in this more general setting, the different roles of A and B are still apparent, and we expect an asymmetry in being able to fit the output.

Example: Logistic regression. For multiclass logistic regression, we have a training dataset {(xi, ci)}n i=1 where xi Rd (features) and ci {1, . . . , K} (label). Denote by yi RK the vector with yci = 1 and yk = 0 for k = ci. The log likelihood is the cross-entropy error

L(w1, . . . , w K) =

k=1 yi ln(pi,k), (8)

where pi,k = exp(w k xi) PK l=1 exp(w l xi) and wk Rd. Let W

RK d whose k-th row is wk. Then, (8) becomes

i=1 ln(1 e W xi) y i Wxi,

where 1 is the column vector of size K with all elements equal to 1; note y i 1 = 1 due to the one-hot structure. This loss can be put in the form (4) by setting f(z) = z and h(z) = ln(1 ez). For freezing, we then have

AL(UA) = U n X

i=1 (yi pi(UA))x i and

i=1 (yi pi(BQ))(Qxi) ,

where pi(W) = e W xi 1 e W xi RK. Freezing B = U, as in least-squares, implies that each output yi is projected as U yi, implying that, at best, the model can hope to only learn outputs in the small random subspace U. In contrast, freezing A = Q is equivalent to logistic regression on the full output with features projected by Q: {(Qxi, yi)}n i=1.

4.2. Advantages of tuning only B over BA together

In the previous section, we established that fine-tuning B alone is typically superior to fine-tuning A alone. It remains, however, to motivate fine-tuning B alone over fine-tuning both A and B together. In this section, we show that the reduced amount of adapted parameters by (roughly) half provides computational gains and improvements in informationtheoretic generalization bounds.

4.2.1. NUMBER OF PARAMETERS

The key benefit of Lo RA is parameter efficiency, which saves memory during training, storage and communication (Lialin et al., 2023). Fine-tuning B alone as opposed to both A and B reduces the number of parameters by a factor of dout dout+din , which equals 0.5 when din = dout.

4.2.2. GENERALIZATION BOUNDS

Consider a learning task, where the training examples lie in Z = X Y; here, X denotes the feature space and Y is the label space. Suppose one observes a training set Sn (Z1, . . . , Zn) Zn, with n i.i.d. training examples from unknown distribution µ. Denote by µ n = µ µ the distribution of Sn. The objective of the learner is to find a predictor f : X Y that maps features to their labels. We assume each predictor is parameterized by w W (e.g., if f is a neural network, w denotes its weights). Denote by A : Zn W the learning algorithm which selects a predictor given Sn. A is, in general, a probabilistic mapping, and we denote by PW |Sn the distribution of its output W given input Sn. If ℓ: W Z R+ is a loss, we define:

Population risk: Rµ(w) EZ µ[ℓ(w, Z)]

Empirical risk: b Rn(w) 1

i=1 ℓ(w, Zi).

The generalization error of A is

gen(µ, A) E(W,Sn) PW |Sn µ n[Rµ(W) b Rn(W)] .

We bound this generalization error using the informationtheoretic generalization framework of Xu & Raginsky (2017). Consider the following incarnations of fine-tuning algorithms, corresponding to classic Lo RA (tuning both A, B matrices), tuning only B, and tuning only A: Definition 4.4 (Fine-tuning algorithms). Let W = {Wi}L i=1 be the L parameter matrices of a pretrained model.

Asymmetry in Low-Rank Adapters of Foundation Models

Let I {1, . . . , L} be a specified subset of the parameter matrices to be fine-tuned. Given a fine-tuning training set Sn, let r be a chosen rank and suppose each tuned parameter is quantized to q bits. We define the following algorithmic frameworks (other details can be arbitrary) for choosing an adaptation W = { i}i I, yielding a finetuned Wtuned = {Wtuned,i}L i=1 with Wtuned,i = Wi + i for i I and Wtuned,i = Wi otherwise:

ABA: For each i I, constrain i = Bi Ai and optimize {Bi, Ai}i I to fit the data Sn. AB: For each i I, sample Qi Rr d(i) in at random, constrain i = Bi Qi, and optimize {Bi}i I to fit the data Sn. AA: For each i I, sample Ui Rd(i) out r at random, constrain i = Ui Ai, and optimize {Ai}i I to fit the data Sn.

We have the following lemma, proved in Appendix C:

Lemma 4.5 (Generalization bounds on adapting A and/or B). Consider the algorithms of Definition 4.4. Assume that ℓW,b( W, e Z) is σ-sub-Gaussian2 under ( W, e Z) P W|W,b µ. Then,

|gen(µ, ABA)|

i I (d(i) in + d(i) out),

|gen(µ, AB)|

i I d(i) out,

|gen(µ, AA)|

i I d(i) in .

This generalization bound increases with the number of parameters being tuned, which is an increasing function of r and the dimensions of the parameter matrices. Importantly, since tuning just one factor (A or B) involves tuning fewer parameters than A and B together, the generalization bound is correspondingly smaller. In the case where the d(i) in = d(i) out, the bound for tuning one factor only is a factor of

2 smaller than the bound for tuning both factors, implying that the rank r for AB could be doubled and have a generalization bound matching that of ABA.

4.3. Discussion of theoretical analysis

The previous two sections establish two conclusions: (1) Tuning A has limited importance when trying to match a desired output; and (2) Tuning one factor instead of two reduces the number of parameters for the same r, while improving generalization bounds and potentially providing memory benefits.

2Bounded losses are sub-Gaussian.

Given a fixed parameter count and generalization budget, therefore, we can use a larger r = r B when fine-tuning B alone than the r BA that would be used on standard Lo RA fine-tuning both A and B. This addition provides more expressive power for the same number of parameters without loss of generalization bounds. Hence, when matching parameter or generalization budget, we expect that finetuning a rank-r B B typically improves performance over fine-tuning a rank-r BA BA Lo RA adaptation.

5. Experiments

We investigate the asymmetry of low-rank adaptation methods with Ro BERTa (Liu et al., 2019), BART (Lewis et al., 2020), Llama-2 (Touvron et al., 2023), and Vistion Transformer (Dosovitskiy et al., 2020). We evaluate the performance of fine-turning strategies on natural language understanding (GLUE (Wang et al., 2018), MMLU (Hendrycks et al., 2020)), natural language generation (XSum (Narayan et al., 2018) and CNN/Daily Mail (Chen et al., 2016)), and multi-domain image classification (Gulrajani & Lopez-Paz, 2020).

We implement all algorithms using Py Torch starting from the publicly-available Huggingface Transformers code base (Wolf et al., 2019). The conventional Lo RA method applies a scaling coefficient α/r to W. Following Lo RA (Hu et al., 2021), we fix α = 2r to be twice the rank. Throughout our experiments, we use ˆA to indicate matrix A is being updated during fine-tuning and use subscripts {rand, 0, km} to indicate that the matrix is initialized as a random orthonormal matrix, zero matrix, and the random uniform initialization used in the original Lo RA, respectively. Note that a properly normalized d r random matrix with independent entries will have close to orthonormal columns when d r (see e.g. Theorem 4.6.1 of Vershynin (2020)), implying that the random orthonormal and random uniform initializations should be essentially equivalent.

We compare to the following methods:

1. Full fine-tuning (FT): The most straightforward adaptation method, which initializes model parameters with the pre-trained weights and updates the whole model with gradient back-propagation. 2. Linear Probing (LP) (Kumar et al., 2022): A simple yet effective method that updates the last linear layer. 3. IA3 (Liu et al., 2022): Injects learned vectors in the attention and feedforward modules. 4. Lo RA: (Hu et al., 2021) Fine-tunes both A and B matrices of an additive BA adaptation as introduced in previous sections, with a separate adaptation for each query/key/value parameter matrix. 5. Ada Lora: (Zhang et al., 2023b) A variant of Lo RA that adaptively changes the rank for each layer.

Asymmetry in Low-Rank Adapters of Foundation Models

Table 1. Different adaptation methods on the GLUE benchmark. We report the overall (matched and mismatched) accuracy for MNLI, Matthew s correlation coefficient for Co LA, Pearson correlation for STS-B, and accuracy for other tasks. Higher is better for all metrics.

Model & Method # Trainable Parameters MNLI SST-2 MRPC Co LA QNLI RTE STS-B Avg.

Lo RA (r = 8) 0.8% 90.3 .07 95.6 0.36 90.3 0.85 64.4 1.8 94.0 0.29 84.1 0.96 91.5 0.16 87.2 Ada Lo RA 2.5% 90.4 .37 95.9 .13 90.1 .54 67.5 1.3 94.7 .22 85.4 .20 91.3 1.0 87.9 (IA)3 0.7% 90.0 .21 95.4 .17 83.7 .13 57.6 .67 93.7 .07 70.3 1.5 87.0 0.4 82.5 Lo RA-FA 0.3% 90.3 .06 95.6 .17 90.6 .32 67.3 2.3 93.4 .61 82.4 1.4 91.2 .29 87.3

ˆB0Arand (r = 8) 0.3% 90.1 .19 95.8 .29 89.7 .13 67.5 1.2 94.0 .27 82.8 1.5 91.9 .26 87.4 ˆB0Arand (r = 16) 0.8% 90.1 .20 96.1 .18 90.7 .90 66.1 2.6 94.4 .10 84.1 .96 91.2 .42 87.5

Brand ˆA0 (r = 8) 0.3% 90.3 .18 95.5 .66 89.3 .09 58.7 2.5 93.8 .21 77.1 1.3 90.7 .31 84.2 Brand ˆA0 (r = 16) 0.8% 89.9 .19 95.6 .64 90.2 0.23 60.3 3.3 93.9 0.25 80.4 0.21 90.9 0.13 85.9

Table 2. Different initialization of classic Lo RA, setting either A or B to be zeros. Note that the trained result is not sensitive to different initializations, with performance differences tending to be smaller than the standard error.

Model & Method # Trainable

Parameters MNLI SST-2 MRPC Co LA QNLI RTE STS-B Avg.

ˆB0 ˆAV 0.8% 90.4 0.11 95.9 0.16 90.7 0.84 64.0 0.50 94.4 0.16 84.1 0.15 91.8 00.15 87.3 ˆB0 ˆArand 0.8% 90.4 0.15 96.0 0.11 91.5 1.1 64.1 0.67 94.5 0.11 85.6 0.96 92.0 0.31 87.7 ˆBU ˆA0 0.8% 90.3 0.07 96.1 .18 91.7 0.33 64.9 1.5 94.7 0.33 84.8 0.96 91.9 0.19 87.8 ˆBrand ˆA0 0.8% 90.3 0.27 96.0 .26 90.8 0.51 66.0 1.01 94.5 0.38 83.6 1.5 92.0 0.18 87.8

5.1. Natural Language Understanding

We use the General Language Understanding Evaluation (GLUE, Wang et al., 2018) to evaluate the fine-tuning performance of different fine-tuning strategies. The GLUE benchmark contains a wide variety of tasks including question-answering, textual similarity, and sentiment analysis. We applied fine-tuning methods to the Ro BERTa (large) model (Liu et al., 2019), which has 355M parameters. To enable a fair comparison, we initialize the weights for all tasks with the original pretrained Ro BERTa weights.

In Table 1 (see the appendix for an expanded table), we compare different freezing & initialization strategies with Lo RA and other baselines. We underline to indicate that performance is better than conventional Lo RA also we use bold to denote the best performance when freezing one of the matrices. First, we can see a clear trend where solely updating the B matrix outperforms just learning the A matrix. In addition, when doubling the rank to match the trainable parameters, ˆB0Aorth consistently outperforms conventional Lo RA. This confirms our hypothesis in 4.3 that any loss in expressive power by not tuning A can be made up for by the larger intrinsic rank of B at no additional parameter cost. In fact, its performance statistically matches that of Ada Lo RA, which uses over 3 times the parameters (incurring the associated memory and training costs).

To assess the effects of different initialization methods for low-rank adaptation, we investigate different initialization methods thoroughly in Table 2. We can see that the best

Table 3. R-1/2/L (%) on text summarization with BART-large on XSum and CNN/Daily Mail ( Here we report numbers from (Zhang et al., 2023b)).

Method # Param.

XSum CNN/Daily Mail

Full FT 100 % 45.49 / 22.33 / 37.26 44.16 / 21.28 / 40.90 Lo RA (r=2) 0.26 % 42.81 / 19.68 / 34.73 43.68 / 20.63 / 40.71

ˆB0Arand,r=16 0.44 % 42.91 / 19.61 / 34.64 43.65 / 20.62 / 40.72 Brand ˆA0,r=16 0.44 % 42.37 / 19.30 / 34.29 43.38 / 20.36 / 40.48

ˆB0 ˆArand,r=8 0.44 % 43.78 / 20.47 / 35.53 43.96 / 20.94 / 41.00 ˆBrand ˆA0,r=8 0.44 % 43.80 / 20.39 / 35.48 44.07 / 21.08 / 41.19

results always come from orthogonal initialization, which further supports our conclusions in 4.

5.2. Natural Language Generation

To investigate the asymmetry of low-rank fine-tuning in natural language generation (NLG), we fine-tune a BARTlarge model (Lewis et al., 2020) and evaluate model performance on the XSum (Narayan et al., 2018) and CNN/Daily Mail (Chen et al., 2016) datasets. Following Zhang et al. (2023b), we apply low-rank adaptation to every query/key/value matrix and report ROUGE 1/2/L scores (R-1/2/L, (Lin, 2004)). We fine-tune models for 15 epochs. We select the beam length as 8 and batch size as 48 for XSum, and the beam length as 4, batch size as 48 for CNN/Daily Mail. More details of the configurations are

Asymmetry in Low-Rank Adapters of Foundation Models

Table 4. Domain Bed results (mean accuracy and standard deviation in %). ID and OOD denote in-domain and out-of-domain test error, respectively. For OOD we report the average performance across different environments.

Method # Param. VLCS PACS Office Home (ID) (OOD) (ID) (OOD) (ID) (OOD)

Lo RA r=8 0.46% 73.51 0.62 56.43 1.96 94.94 0.56 75.58 0.92 78.54 1.49 74.46 0.40 LP 0.00% 75.58 1.66 71.70 1.04 81.62 0.34 61.73 1.25 58.38 0.76 68.59 0.22 Full Fine-tuning 100% 76.21 1.95 64.87 6.44 98.15 0.56 74.90 2.43 80.67 1.22 63.23 0.64

ˆBArand,r=8 0.29% 77.40 2.30 75.81 1.65 92.45 2.68 72.55 1.03 77.66 0.89 77.72 0.32 ˆBArand,r=16 0.46% 79.10 1.41 75.40 1.24 93.52 0.20 73.76 0.67 77.63 0.84 77.85 0.33 Brand ˆAr=8 0.29% 76.71 0.93 72.50 0.89 92.02 1.07 66.25 0.80 72.36 0.69 73.66 0.35

Table 5. Accuracy (%) on MMLU benchmark.

Method # Param. 5-shot Hums STEM Social Other Avg

Llama-2-7B 100% 43.98 34.11 49.08 44.31 43.14 Lo RA r=32 0.24% 44.59 36.50 51.81 45.75 44.76

ˆB0Arand,r=32 0.12% 44.17 36.00 46.88 45.14 45.36 Brand ˆA0,r=32 0.12% 44.36 35.93 51.46 46.85 44.51

ˆB0Arand,r=64 0.12% 45.10 37.65 55.08 51.08 46.46

in the Appendix E.

The results are summarized in Table 3. In the first two rows, we observe the asymmetry between the factors since freezing A and only updating B always outperforms only updating A. The last two rows show the results of tuning both matrices with different initializations, showing that the asymmetry is not explained by the initialization strategy.

5.3. Massive Multitask Language Understanding

We fine-tune the pretrained Llama-2-7B model (Touvron et al., 2023) using instruction tuning on the Alpaca dataset (Wang et al., 2023). We assess the asymmetry on the MMLU benchmark (Hendrycks et al., 2020), which consists of 57 distinct language tasks. As shown in Table 5, the asymmetry also exists in larger language models, and updating B consistently outperforms updating A. Moreover, it also outperforms standard Lo RA except for Other where it matches the performance, reflecting the benefits of being able to increase r without tuning more parameters.

5.4. Vision Transformers and Generalization

We next measure generalization, motivated by the theory in 4.2. In particular, we work with Vi Ts in image classification tasks using the Domainbed testbed for domain generalization (Gulrajani & Lopez-Paz, 2020). Domainbed contains several datasets, each composed of multiple environments (or domains). Classes in each environment tend

to be similar at a high level but differ in terms of style. We fine-tune a pre-trained Vi T, originally trained on Image Net, on the Label Me, Cartoon, and Clipart environments within the VLCS, PACS, and Office-Home datasets, respectively. We employ different benchmark fine-tuning methods such as full fine-tuning, linear probing, and Lo RA, and compare their performance to freezing either A or B in in-domain and out-of-domain generalization. We adhere to the original 80% training and 20% testing splits.

Results are presented in Table 4. In line with our expectations, randomly initializing and freezing matrix A while only updating matrix B generally results in better out-ofdomain test accuracy. We report additional generalization results in Appendix H, in which we compare the train set and test set accuracy of the different approaches. We consistently find that fine-tuning a single matrix leads to smaller gaps between these two quantities compared to Lo RA, paralleling the corresponding reduction in the generalization bounds of 4.2.

5.5. Ablation study and analysis

We also observe the benefit of computational run time when freezing the A matrix, even when doubling the rank. This is because freezing matrix A means its gradients do not need to be stored or computed, reducing the memory footprint for gradients during the training. We provide additional experimental results on multiple datasets to illustrate the runtime improvement. Specifically, in table (6) we compare the train samples per second of different PEFT methods on multiple fine-tuning tasks.

Table 6. Train samples per second on various datasets

GLUE RTE GLUE SST-2

Lo RA 4.71 0.03 227.62 0.59 Ada Lo RA 2.90 0.11 88.14 0.19 ˆB (r=8) 7.29 0.16 255.45 13.38 ˆB (r=16) 6.28 0.17 265.80 12.13

Asymmetry in Low-Rank Adapters of Foundation Models

We also conducted a new ablation study to investigate how different fixed A matrices will affect the performance. Specifically, we use three initializations: (1) Columns dependent on each other, (2) Rows dependent on each other, and (3) a Banded matrix with a bandwidth equal to rank. As we can see, the model struggled to learn anything when either the columns and rows of A are correlated. Also, fixing A to be a banded matrix leads to reasonable performance. Such observation further agrees with our theoretical formulation where we require the fixed A to be orthogonal.

Table 7. Different fixed A on RTE task

ˆBA(1) 50.9 3.13 ˆBA(2) 52.71 3.29 ˆB A(3) 83.51 2.18 ˆB Arand (Ours) 84.1 0.83

6. Conclusion

In this paper, we formally identify and investigate asymmetry in the roles of low-rank adapter matrices in Lo RA fine-tuning. The A matrices extract features from the input, while the B matrices project these features towards the desired objective. We illustrate differences between the two matrices from both theoretical and empirical perspectives. Our theoretical analysis explains the asymmetry in the fine-tuning of large models and suggests that freezing A as a random orthogonal matrix can improve generalization, a claim we corroborate with experiments across multiple models and datasets. Our work serves as an initial step to unveil the mechanisms of fine-tuning large models, and it provides an understanding that can benefit future research directions, promoting efficiency and interpretability.

Impact Statement

This paper presents work whose goal is to advance machine learning. There are no societal consequences of our work that we feel must be specifically highlighted here.

Acknowledgement

We thank Lingxiao Li, Aritra Guha, and the anonymous reviewers for their valuable feedback and helpful recommendations. The MIT Geometric Data Processing Group acknowledges the generous support of Army Research Office grants W911NF2010168 and W911NF2110293, from the CSAIL Systems that Learn program, from the MIT IBM Watson AI Laboratory, from the Toyota CSAIL Joint Research Center, and from an Amazon Research Award.

Aghajanyan, A., Gupta, S., and Zettlemoyer, L. Intrinsic dimensionality explains the effectiveness of language model fine-tuning. In Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 1: Long Papers). Association for Computational Linguistics, 2021. doi: 10.18653/ v1/2021.acl-long.568. URL http://dx.doi.org/ 10.18653/v1/2021.acl-long.568.

Benedek, N. and Wolf, L. Prilora: Pruned and rankincreasing low-rank adaptation. 2024. URL https: //api.semanticscholar.org/Corpus ID: 267068991.

Chen, D., Bolton, J., and Manning, C. D. A thorough examination of the cnn/daily mail reading comprehension task. In Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). Association for Computational Linguistics, 2016. doi: 10.18653/v1/p16-1223. URL http://dx. doi.org/10.18653/v1/P16-1223.

Dettmers, T., Pagnoni, A., Holtzman, A., and Zettlemoyer, L. Qlora: Efficient finetuning of quantized llms. Ar Xiv, abs/2305.14314, 2023. URL https: //api.semanticscholar.org/Corpus ID: 258841328.

Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., Uszkoreit, J., and Houlsby, N. An image is worth 16x16 words: Transformers for image recognition at scale, 2020.

Frankle, J. and Carbin, M. The lottery ticket hypothesis: Finding sparse, trainable neural networks, 2018.

Gholami, A., Kim, S., Dong, Z., Yao, Z., Mahoney, M. W., and Keutzer, K. A survey of quantization methods for efficient neural network inference, 2021.

Gulrajani, I. and Lopez-Paz, D. In search of lost domain generalization, 2020.

Guo, H., Greengard, P., Xing, E. P., and Kim, Y. Lq-lora: Low-rank plus quantized matrix decomposition for efficient language model finetuning, 2024.

Han, L., Li, Y., Zhang, H., Milanfar, P., Metaxas, D., and Yang, F. Svdiff: Compact parameter space for diffusion fine-tuning. ar Xiv preprint ar Xiv:2303.11305, 2023.

Hendrycks, D., Burns, C., Basart, S., Zou, A., Mazeika, M., Song, D., and Steinhardt, J. Measuring massive multitask language understanding, 2020.

Asymmetry in Low-Rank Adapters of Foundation Models

Hu, J. E., Shen, Y., Wallis, P., Allen-Zhu, Z., Li, Y., Wang, S., and Chen, W. Lora: Low-rank adaptation of large language models. Ar Xiv, abs/2106.09685, 2021. URL https://api.semanticscholar. org/Corpus ID:235458009.

Hugging Face. Peft. https://github.com/ huggingface/peft, Year.

Koohpayegani, S. A., Navaneet, K., Nooralinejad, P., Kolouri, S., and Pirsiavash, H. Nola: Networks as linear combination of low rank random basis, 2023.

Kopiczko, D. J., Blankevoort, T., and Asano, Y. M. Vera: Vector-based random matrix adaptation, 2024.

Kornblith, S., Norouzi, M., Lee, H., and Hinton, G. Similarity of neural network representations revisited. In International conference on machine learning, pp. 3519 3529. PMLR, 2019.

Kumar, A., Raghunathan, A., Jones, R., Ma, T., and Liang, P. Fine-tuning can distort pretrained features and underperform out-of-distribution, 2022.

Lewis, M., Liu, Y., Goyal, N., Ghazvininejad, M., Mohamed, A., Levy, O., Stoyanov, V., and Zettlemoyer, L. Bart: Denoising sequence-to-sequence pre-training for natural language generation, translation, and comprehension. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics. Association for Computational Linguistics, 2020. doi: 10.18653/v1/ 2020.acl-main.703. URL http://dx.doi.org/10. 18653/v1/2020.acl-main.703.

Lialin, V., Deshpande, V., and Rumshisky, A. Scaling down to scale up: A guide to parameter-efficient fine-tuning. ar Xiv preprint ar Xiv:2303.15647, 2023.

Lin, C.-Y. Rouge: A package for automatic evaluation of summaries. In Text summarization branches out, pp. 74 81, 2004.

Liu, H., Tam, D., Muqeeth, M., Mohta, J., Huang, T., Bansal, M., and Raffel, C. A. Few-shot parameter-efficient finetuning is better and cheaper than in-context learning. Advances in Neural Information Processing Systems, 35: 1950 1965, 2022.

Liu, W., Qiu, Z., Feng, Y., Xiu, Y., Xue, Y., Yu, L., Feng, H., Liu, Z., Heo, J., Peng, S., Wen, Y., Black, M. J., Weller, A., and Sch olkopf, B. Parameter-efficient orthogonal finetuning via butterfly factorization. In ICLR, 2024.

Liu, Y., Ott, M., Goyal, N., Du, J., Joshi, M., Chen, D., Levy, O., Lewis, M., Zettlemoyer, L., and Stoyanov, V. Roberta: A robustly optimized bert pretraining approach, 2019.

Narayan, S., Cohen, S. B., and Lapata, M. Don t give me the details, just the summary! topic-aware convolutional neural networks for extreme summarization. In Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing. Association for Computational Linguistics, 2018. doi: 10.18653/v1/d18-1206. URL http://dx.doi.org/ 10.18653/v1/D18-1206.

Ramanujan, V., Wortsman, M., Kembhavi, A., Farhadi, A., and Rastegari, M. What s hidden in a randomly weighted neural network? In 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). IEEE, June 2020. doi: 10.1109/cvpr42600.2020. 01191. URL http://dx.doi.org/10.1109/ CVPR42600.2020.01191.

Ramsay, J., ten Berge, J., and Styan, G. Matrix correlation. Psychometrika, 49(3):403 423, 1984.

Srivastava, N., Hinton, G., Krizhevsky, A., Sutskever, I., and Salakhutdinov, R. Dropout: a simple way to prevent neural networks from overfitting. The journal of machine learning research, 15(1):1929 1958, 2014.

Touvron, H., Martin, L., Stone, K., Albert, P., Almahairi, A., Babaei, Y., Bashlykov, N., Batra, S., Bhargava, P., Bhosale, S., Bikel, D., Blecher, L., Ferrer, C. C., Chen, M., Cucurull, G., Esiobu, D., Fernandes, J., Fu, J., Fu, W., Fuller, B., Gao, C., Goswami, V., Goyal, N., Hartshorn, A., Hosseini, S., Hou, R., Inan, H., Kardas, M., Kerkez, V., Khabsa, M., Kloumann, I., Korenev, A., Koura, P. S., Lachaux, M.-A., Lavril, T., Lee, J., Liskovich, D., Lu, Y., Mao, Y., Martinet, X., Mihaylov, T., Mishra, P., Molybog, I., Nie, Y., Poulton, A., Reizenstein, J., Rungta, R., Saladi, K., Schelten, A., Silva, R., Smith, E. M., Subramanian, R., Tan, X. E., Tang, B., Taylor, R., Williams, A., Kuan, J. X., Xu, P., Yan, Z., Zarov, I., Zhang, Y., Fan, A., Kambadur, M., Narang, S., Rodriguez, A., Stojnic, R., Edunov, S., and Scialom, T. Llama 2: Open foundation and fine-tuned chat models, 2023.

Vershynin, R. High-dimensional probability. University of California, Irvine, 2020.

Wang, A., Singh, A., Michael, J., Hill, F., Levy, O., and Bowman, S. Glue: A multi-task benchmark and analysis platform for natural language understanding. In Proceedings of the 2018 EMNLP Workshop Blackbox NLP: Analyzing and Interpreting Neural Networks for NLP. Association for Computational Linguistics, 2018. doi: 10.18653/v1/w18-5446. URL http://dx.doi.org/ 10.18653/v1/W18-5446.

Wang, Y., Ivison, H., Dasigi, P., Hessel, J., Khot, T., Chandu, K. R., Wadden, D., Mac Millan, K., Smith, N. A., Beltagy, I., et al. How far can camels go? exploring the state

Asymmetry in Low-Rank Adapters of Foundation Models

of instruction tuning on open resources. ar Xiv preprint ar Xiv:2306.04751, 2023.

Wolf, T., Debut, L., Sanh, V., Chaumond, J., Delangue, C., Moi, A., Cistac, P., Rault, T., Louf, R., Funtowicz, M., et al. Huggingface s transformers: State-of-the-art natural language processing. ar Xiv preprint ar Xiv:1910.03771, 2019.

Xu, A. and Raginsky, M. Information-theoretic analysis of generalization capability of learning algorithms. In Guyon, I., Luxburg, U. V., Bengio, S., Wallach, H., Fergus, R., Vishwanathan, S., and Garnett, R. (eds.), Advances in Neural Information Processing Systems, volume 30. Curran Associates, Inc., 2017.

Yadav, P., Choshen, L., Raffel, C., and Bansal, M. Compeft:

Compression for communicating parameter efficient updates via sparsification and quantization. ar Xiv preprint ar Xiv:2311.13171, 2023.

Zeng, Y. and Lee, K. The expressive power of low-rank adaptation, 2023.

Zhang, L., Zhang, L., Shi, S., Chu, X., and Li, B. Lora-fa: Memory-efficient low-rank adaptation for large language models fine-tuning, 2023a.

Zhang, Q., Chen, M., Bukharin, A., Karampatziakis, N., He, P., Cheng, Y., Chen, W., and Zhao, T. Adalora: Adaptive budget allocation for parameter-efficient finetuning, 2023b.

Asymmetry in Low-Rank Adapters of Foundation Models

A. Similarity Metric in Figure 1

To measure the similarity of learned A and B matrices we adopted a measure that accounts for the invariance of Lo RA fine-tuning. Let W = BA denote the learned Lo RA adapter. Since BA = BCC 1A for any invertible matrix C Rr r, we can define B = BC and A = C 1A resulting in the same Lo RA adapter W = B A. Thus, to measure the similarity of Lo RA matrices we need a metric that is invariant to invertible linear transformations, i.e., dissimilarity(B, BC) = 0 for any invertible C. In our experiment, we used Canonical Correlation Analysis goodness of fit (Ramsay et al., 1984), similar to prior work comparing neural network representations (Kornblith et al., 2019). The key idea is to compare orthonormal bases of the matrices, thus making this similarity metric invariant to invertible linear transformations.

More specifically, given two matrices X Rn r1 and Y Rn r2, the similarity is computed as follows: U Y UX 2 F / min{r1, r2}, where UX/UY is the orthonormal bases for the columns of X/Y . Following a similar method as in Hu et al. (2021), for A we perform SVD and use the right-singular unitary matrices as the bases, and use left-singular unitary matrices for B.

A.1. Reversed Initialization

The initialization of adapter matrices can play an important role in Lo RA fine-tuning. To further investigate the effect of initialization on asymmetry, we reverse the initialization compared to conventional Lo RA, where A is initialized to zero and B is initialized with random uniform distributions. Overall, we observe that the trend of differences also reverses, which is expected given the significant role of initialization in training deep learning models.

When comparing the similarities of different initialization strategies, we can still draw the same conclusion about the importance of the B matrix. For example, compared with Figure 2(a), the A matrices in Figure 2(d) have a smaller similarity in average. Such difference can also be observed when comparing Figure 2(b) and 2(e).

(a) Random initialization, same task

(b) Fixed initialization, different tasks

(c) Random initialization, different tasks

(d) Random initialization, same task

(e) Fixed initialization, different tasks

(f) Random initialization, different tasks

Figure 2. Similarity of learned Lo RA matrices A & B across layers of a Ro BERTa model fine-tuned with different initialization and data settings. We compare the results from both conventional Lo RA initialization (In Figure (a), (b), and (c), A is initialized as random uniform B is initialized as zero) and a reversed initialization (In Figure (d), (e), and (f), A is initialized as zero B is initialized as random uniform.

Asymmetry in Low-Rank Adapters of Foundation Models

B. Asymmetry Proofs for Multivariate Least Squares

B.1. Proof of Lemma 4.2

Consider freezing B = U where U is orthogonal (U U = Ir) and fine-tuning A. The objective becomes

A = arg min A L(A, U)

= arg min A E(Ytarg,Xtarg) Ytarg (W0 + UA)Xtarg b 2 2

= arg min A E (Wtarg Xtarg W0Xtarg) UAXtarg 2 2

= arg min A E U ((Wtarg W0)Xtarg + n) AXtarg 2

Interestingly, note that this solution A does not depend on the distribution of Xtarg, it is simply the projection of the difference between the pretrained W0 and the target Wtarg. This is because, intuitively, freezing B is projecting down the outputs into r dimensional space, and then optimizing A to match these projected outputs. It can be shown that the expected squared prediction error is

L(A , U) = doutσ2 + Tr[ Σ ] Tr[U Σ U],

where Σ = Cov[Xtarg].

B.2. Proof of Lemma 4.1

Consider freezing A = Q where Q is orthogonal (QQ = Ir) and fine-tuning B. The objective becomes

B = arg min B L(Q, B)

= arg min B E(Ytarg,Xtarg) Ytarg (W0 + BQ)Xtarg 2 2

= arg min B E (Ytarg W0Xtarg) B(QXtarg) 2 2 ,

which is simply an ordinary least squares regression problem mapping QXtarg to (Ytarg W0Xtarg). The solution is known to be B = ΣQ (QΣQ ) 1

yielding an expected squared prediction error of

L(Q, B ) = doutσ2 + Tr[ Σ ] Tr[QΣ ΣQ (QΣQ ) 1].

Note that the solution is now clearly dependent on the distribution of Xtarg, and the first two terms of the squared prediction error are the same but the third term is different.

B.3. Proof of Theorem 4.3

Note that since Σ is positive semidefinite, its symmetric square root Σ1/2 exists and we can simplify the third term in the expression for freezing A using the definition of the Moore-Penrose pseudoinverse ( ) and trace equalities as

IIIA = Tr[QΣ ΣQ (QΣQ ) 1] = Tr[(Σ1/2 Σ1/2)(Σ1/2QT (QΣ1/2Σ1/2QT ) 1QΣ1/2)]

= Tr[(Σ1/2 Σ1/2)((QΣ1/2) (QΣ1/2))].

By the properties of the Moore-Penrose pseudoinverse, the matrix (QΣ1/2) (QΣ1/2) is a d d orthogonal projection matrix onto the span of the r rows of QΣ1/2, i.e. we can write

(QΣ1/2) (QΣ1/2) = QT ΣQΣ

Asymmetry in Low-Rank Adapters of Foundation Models

for some r d orthogonal matrix QΣ. But note that as d/r, d2 Σ 2 F (Tr(Σ))2 (Hanson-Wright inequality),

1 Tr(Σ)(QΣ1/2)(QΣ1/2)T p Ir,

where p denotes convergence in probability. In other words, in the limit QΣ1/2

Tr(Σ) is close to orthogonal with high

probability, implying that its transpose approaches its pseudoinverse. Hence

d/r, d2 Σ 2 F (Tr(Σ))2 E[(QΣ1/2) (QΣ1/2)] = lim

d/r, d2 Σ 2 F (Tr(Σ))2

1 Tr(Σ)ETr((QΣ1/2)T (QΣ1/2)) = r Σ Tr(Σ).

E[IIIA] r Tr[Σ2 ]

Tr[Σ] = r Tr[ Σ2 ]

Tr[Σ] = r Σ 2 F Tr[Σ] . (9)

Recall that on the other hand that E[IIIB] r

Recall that we have assumed that the smallest nonzero eigenvalue of Σ is Tr[ Σ ]/d. Then revisiting (9) above,

r Σ 2 F Tr[Σ] r

d Tr[Σ]Tr[ Σ ]

Tr[Σ] E[IIIB] (10)

and the asymmetry is established.

Hence lim d/r, d2 Σ 2 F (Tr(Σ))2 E[IIIA] limd/r E[IIIB], implying that freezing A to a random orthogonal matrix achieves

lower mean squared error loss than freezing B.

C. Proof of Lemma 4.5: Generalization Bounds

We use the following bound on the generalization error is from (Xu & Raginsky, 2017), specialized to our setting and notation.

Theorem C.1 (specialized from (Xu & Raginsky, 2017)). Denote by A a Lo RA-based fine-tuning algorithm, which outputs W given Sn. Assume that ℓW,b( W, e Z) is σ-sub-Gaussian under ( W, e Z) P W|W,b µ. Then,

|gen(µ, A)|

n I( W; Sn|A, W). (11)

We consider the case of tuning B only first. Applying the above theorem, note that here

I( W; Sn|AB, W) = I({Bi Qi}i I; Sn|AB, W)

= I({Bi}i I; Sn|AB, W),

where we have used the data processing inequality (DPI), noting that the Qi are here considered orthogonal fixed constant matrices as they are not trained, hence the mapping from Bi to Bi Qi is invertible.

We can now bound this expression as

I({Bi}i I; Sn|AB, W) H({Bi}i I)

i I d(i) out,

where we have noted that mutual information is upper bounded by discrete entropy, and entropy in turn is upper bounded by the uniform distribution over its possible support set (q bits in each of r P i I d(i) out dimensions). The bounds for the other two algorithms are similar.

Asymmetry in Low-Rank Adapters of Foundation Models

Table 8. Hyper-parameter setup for GLUE tasks.

Dataset learning rate batch size # epochs γ ti T tf

MNLI 5 10 4 48 25 0.1 6000 100 50000 SST-2 5 10 4 48 25 0.1 6000 100 50000 MRPC 5 10 4 48 15 0.1 5000 100 85000 Co LA 5 10 4 48 15 0.1 5000 100 85000 QNLI 5 10 4 48 15 0.1 5000 100 85000 RTE 5 10 4 48 15 0.1 5000 100 85000 STS-B 5 10 4 48 15 0.1 5000 100 85000

D. Natural Language Understanding Training Details

E. Text Generation Training Details

The configuration of our experiments on text generation is listed in Table 10.

Table 9. Hyper-parameter setup for summarization tasks.

Dataset learning rate batch size # epochs γ ti T tf

XSum 5 10 4 48 25 0.1 6000 100 50000 CNN/Daily Mail 5 10 4 48 15 0.1 5000 100 85000

F. Llama-2 Training Details

Table 10. Hyper-parameter setup for summarization tasks.

Dataset learning rate batch size # epochs γ ti T tf

Alpaca 5 10 4 48 25 0.1 6000 100 50000

Asymmetry in Low-Rank Adapters of Foundation Models

G. Additional Language Results

See Table 11 for additional results.

Table 11. Different adaptation methods on the GLUE benchmark. We report the overall (matched and mismatched) accuracy for MNLI, Matthew s correlation coefficient for Co LA, Pearson correlation for STS-B, and accuracy for other tasks. Higher is better for all metrics.

Model & Method # Trainable

Parameters MNLI SST-2 MRPC Co LA QNLI RTE STS-B Avg.

Lo RA (r = 8) 0.8M 90.3 .07 95.6 0.36 90.3 0.85 64.4 1.8 94.0 0.29 84.1 0.96 91.5 0.16 87.2 Ada Lo RA 2.5% 90.4 .37 95.9 .13 90.1 .54 67.5 1.3 94.7 .22 85.4 .20 91.3 1.0 87.9 (IA)3 0.7% 90.0 .21 95.4 .17 83.7 .13 57.6 .67 93.7 .07 70.3 1.5 87.0 0.4 82.5

ˆB0AV (r = 8) 0.3M 90.1 .09 95.5 .01 90.8 .24 63.8 4.2 94.2 .11 83.3 1.7 91.3 .24 87.0 ˆB0Arand (r = 8) 0.3M 90.1 .19 95.8 .29 89.7 .13 67.5 1.2 94.0 .27 82.8 1.5 91.9 .26 87.4 ˆB0Akm (r = 8) 0.3M 90.1 .17 95.6 .17 90.6 .32 67.3 2.3 93.4 .61 82.4 1.4 91.2 .29 87.2 BU ˆA0 (r = 8) 0.3M 89.3 .18 95.4 0.13 88.8 0.70 59.1 0.48 93.8 0.15 77.5 2.7 90.7 .27 94.9 Brand ˆA0 (r = 8) 0.3M 90.3 .18 95.5 .66 89.3 .09 58.7 2.5 93.8 .21 77.1 1.3 90.7 .31 85.1 Bkm ˆA0 (r = 8) 0.3M 34.5 1.6 95.2 .34 89.3 .11 0.0 0.0 93.0 .38 47.3 .0 91.2 .24 64.4

ˆB0AV (r = 16) 0.8M 90.2 .17 95.8 .20 90.1 .56 67.8 .49 94.5 .07 82.8 .42 91.6 .21 87.5 ˆB0Arand (r = 16) 0.8M 90.1 .20 96.1 .18 90.7 .90 66.1 2.6 94.4 .10 84.1 .96 91.2 .42 87.5 ˆB0Akm (r = 16) 0.8M 90.3 .06 95.6 .01 91.1 .32 65.2 2.1 94.5 .02 81.7 1.8 91.2 .39 87.1 BU ˆA0 (r = 16) 0.8M 90.3 .07 95.4 .57 90.4 1.1 60.7 .14 94.1 .30 80.1 1.2 90.8 .29 86.0 Brand ˆA0 (r = 16) 0.8M 89.9 .19 95.6 .64 90.2 0.23 60.3 3.3 93.9 0.25 80.4 0.21 90.9 0.13 85.9 Bkm ˆA0 (r = 16) 0.8M 89.2 .03 95.2 .29 90.6 0.65 40.4 35. 93.1 0.23 70.3 0.19 91.4 0.26 81.5

ˆB0 ˆAV (r = 8) 0.8M 90.4 .11 95.9 0.18 90.7 0.84 64.0 0.50 94.4 0.16 84.1 0.15 91.8 00.15 87.3 ˆB0 ˆArand (r = 8) 0.8M 90.4 .15 96.0 .63 91.5 1.1 64.1 0.67 94.5 0.11 85.6 0.96 92.0 0.31 87.7 ˆB0 ˆAkm (r = 8) 0.8M 90.3 .07 95.6 0.36 90.3 0.85 64.4 1.8 94.0 0.29 84.1 0.96 91.5 0.16 87.2 ˆBU ˆA0 (r = 8) 0.8M 90.3 .11 96.1 .18 91.7 0.33 64.9 1.5 94.7 0.33 84.8 0.96 91.9 0.19 87.8 ˆBrand ˆA0 (r = 8) 0.8M 90.3 .27 96.0 .26 90.8 0.51 66.0 1.01 94.5 0.38 83.6 1.5 92.0 0.18 87.6 ˆBkm ˆA0 (r = 8) 0.8M 35.5 1.6 95.6 .65 90.0 0.46 21.3 36. 93.8 0.01 57.4 0.17 91.6 0.43 69.3

H. Additional Vision Transformers and Generalization Results

Table 12 displays a more fine-grained version of Table 4 in the main text, and presents results for each out-of-distribution environment independently, in which it is easier to appreciate the benefits of only updating B in terms of out-of-domain performance. Additional results for Terra Incognita, as well as generalization results, can be found in Table 13 and Table 14, respectively. Terra Incognita seems to be a particularly challenging dataset to which low-rank adapters struggle to fit; the most effective method, in this case, appears to be full fine-tuning. In terms of generalization, we can observe that fine-tuning only a single adapter matrix generally results in a lower difference between training set and test set accuracy compared to standard Lo RA for all datasets.

Table 12. Domain Bed results (mean accuracy and standard deviation in %). ID and OOD denote in-domain and out-of-domain generalization, respectively.

Method # Trainable Parameters VLCS PACS Office Home (% full Vi T params) Caltech101 Label Me SUN09 VOC2007 Art Cartoon Photo Sketch Art Clipart Product Photo (OOD) (ID) (OOD) (OOD) (OOD) (ID) (OOD) (OOD) (OOD) (ID) (OOD) (OOD) ˆBArand (r = 8) 0.16M-0.2M (0.18-0.29%) 93.19 2.27 77.40 2.30 61.52 1.50 72.72 1.18 81.22 1.40 92.45 2.68 96.07 0.86 40.37 0.83 73.59 0.59 77.66 0.89 78.02 0.14 81.55 0.24 ˆBArand (r = 16) 0.3M-0.4M (0.36-0.46%) 91.57 0.81 79.10 1.41 60.97 2.44 73.66 0.46 84.36 0.54 93.52 0.20 97.07 0.47 39.87 0.99 73.64 0.40 77.63 0.84 78.07 0.22 81.85 0.36 Brand ˆA (r = 8) 0.16M-0.2M (0.18-0.29%) 87.18 0.77 76.71 0.93 59.89 1.79 70.44 0.10 77.05 0.74 92.02 1.07 92.06 0.34 29.65 1.31 68.36 0.28 72.36 0.69 74.00 0.31 78.63 0.45 Brand ˆA (r = 16) 0.3M-0.4M (0.36-0.46%) 89.28 2.51 78.03 1.23 60.44 1.84 70.81 0.36 81.43 0.92 93.87 0.73 95.63 0.13 35.02 0.86 71.64 0.24 73.77 1.13 75.46 0.25 80.31 0.39 Lo RA (r = 8) 0.3M-0.4M (0.35-0.46%) 44.59 1.96 73.51 0.62 60.44 2.86 64.26 1.07 81.41 0.70 94.94 0.56 95.43 0.54 49.90 1.51 70.44 0.46 78.54 1.49 73.99 0.64 78.95 0.10 Linear Probing 0.004M (0.00%) 90.65 2.51 75.58 1.66 53.74 0.27 70.71 0.35 67.66 0.63 81.62 0.34 88.80 1.43 28.72 1.70 64.56 0.23 58.38 0.76 66.97 0.43 74.23 .001 Full FT 86.4M (100%) 70.57 15.13 76.21 1.95 57.14 1.46 66.90 2.72 75.52 2.89 98.15 0.56 89.54 1.88 59.63 2.53 58.38 0.64 80.67 1.22 63.05 0.85 68.27 0.43

Asymmetry in Low-Rank Adapters of Foundation Models

Table 13. Terra Incognita results (mean accuracy and standard deviation in %). All methods were trained for 20,000 steps.

Method # Trainable Parameters Terra Incognita (% full Vi T params) L100 L38 L43 L46 (OOD) (ID) (OOD) (OOD) ˆBArand (r = 8) 0.16M-0.2M (0.18-0.29%) 16.59 2.59 79.88 0.45 6.46 1.25 10.96 0.52 ˆBArand (r = 16) 0.3M-0.4M (0.36-0.46%) 14.14 1.45 80.48 0.99 7.74 0.26 11.09 0.76 Brand ˆA (r = 8) 0.16M-0.2M (0.18-0.29%) 12.82 0.84 78.65 0.57 3.42 0.81 7.24 1.36 Brand ˆA (r = 16) 0.3M-0.4M (0.36-0.46%) 17.58 1.01 78.89 0.55 8.41 1.88 7.62 0.56 Lo RA (r = 8) 0.3M-0.4M (0.35-0.46%) 41.36 2.94 87.33 .13 13.48 2.19 7.76 1.69 Linear Probing 0.004M (0.00%) 13.82 .20 69.82 0.36 10.06 .45 13.90 .49 Full FT 86.4M (100%) 38.33 6.50 95.05 .31 14.18 2.33 19.50 1.53

Table 14. Generalization results (train set - test set accuracy in %) for Domain Bed.

Method # Trainable Parameters VLCS PACS Office Home Terra Incognita (% full Vi T params) Caltech101 Label Me SUN09 VOC2007 Art Cartoon Photo Sketch Art Clipart Product Photo L100 L38 L43 L46 (OOD) (ID) (OOD) (OOD) (OOD) (ID) (OOD) (OOD) (OOD) (ID) (OOD) (OOD) (OOD) (ID) (OOD) (OOD) ˆBArand (r = 8) 0.2M-M (0.29-0.%) -1.72 2.24 11.82 1.21 28.09 2.04 16.98 0.74 15.82 0.68 3.83 0.70 0.83 0.30 57.34 0.89 15.94 0.28 11.87 1.14 11.51 0.47 7.97 0.56 64.20 2.58 0.91 0.43 74.33 1.26 69.82 0.53 ˆBArand (r = 16) 0.3M-0.4M (0.36-0.46%) -2.48 0.69 9.99 1.44 28.11 2.74 15.43 0.70 12.92 0.87 3.76 0.40 0.22 0.67 57.42 0.62 16.22 0.93 12.25 1.23 11.81 0.34 8.19 0.87 66.62 1.54 0.28 1.18 73.02 0.24 69.67 0.56 Brand ˆA (r = 8) 0.2M-M (0.29-0.%) 0.19 0.86 10.66 0.86 27.48 1.86 16.93 0.19 19.79 0.66 4.81 0.99 4.78 0.29 67.19 1.34 17.73 0.30 13.73 0.86 12.08 0.42 7.45 0.65 65.86 0.64 0.04 0.60 75.27 0.50 71.45 1.17 Brand ˆA (r = 16) 0.3M-0.4M (0.36-0.46%) -1.50 2.88 9.75 0.85 27.34 2.07 16.97 0.61 15.89 0.96 3.44 0.54 1.69 0.30 62.30 0.83 15.20 0.53 13.07 1.30 11.38 0.38 6.53 0.64 62.17 1.41 0.86 0.96 71.34 1.91 72.13 0.15 Lo RA (r = 8) 0.3M-0.4M (0.35-0.46%) 52.94 1.48 24.03 0.16 37.10 3.25 33.28 1.64 18.23 0.74 4.70 0.57 4.22 0.43 49.74 1.44 26.07 0.39 17.97 1.80 22.53 0.63 17.57 0.23 47.53 2.80 1.56 0.24 75.41 2.29 81.12 1.73 Linear Probing 0.004M (0.00%) -12.03 2.11 3.04 1.38 24.88 0.47 7.91 0.79 17.18 0.13 3.22 0.40 -3.96 1.90 56.13 1.33 6.02 0.21 12.20 1.03 3.61 0.51 -3.65 0.19 55.17 0.28 -0.82 0.31 58.94 0.52 55.10 0.52 Full FT 86.4M (100%) 29.03 15.27 23.40 2.05 42.47 1.83 32.70 2.27 24.41 2.94 1.78 0.54 10.38 1.90 40.30 2.49 40.23 0.48 17.94 1.36 35.56 1.02 30.35 0.53 59.84 6.53 3.12 0.26 83.99 2.31 78.67 1.47