# the_expressive_power_of_lowrank_adaptation__284c2e2d.pdf Published as a conference paper at ICLR 2024 THE EXPRESSIVE POWER OF LOW-RANK ADAPTATION Yuchen Zeng Department of Computer Science University of Wisconsin-Madison yzeng58@wisc.edu Kangwook Lee Department of Electrical and Computer Engineering University of Wisconsin-Madison kangwook.lee@wisc.edu Low-Rank Adaptation (Lo RA), a parameter-efficient fine-tuning method that leverages low-rank adaptation of weight matrices, has emerged as a prevalent technique for fine-tuning pre-trained models such as large language models and diffusion models. Despite its huge success in practice, the theoretical underpinnings of Lo RA have largely remained unexplored. This paper takes the first step to bridge this gap by theoretically analyzing the expressive power of Lo RA. We prove that, for fully connected neural networks, Lo RA can adapt any model f to accurately represent any smaller target model f if Lo RA-rank (width of f) depth of f depth of f , under a mild assumption. We quantify the approximation error when the Lo RArank is lower than the threshold. For Transformer networks, we show any model can be adapted to a target model of the same size with rank-( embedding size 2 ) Lo RA adapters. Our study reveals numerous theoretical insights on hyperparameter tuning and algorithm development for Lo RA, all of which are empirically validated. 1 INTRODUCTION Recent foundation models, such as large language models (Open AI, 2023; Liu et al., 2019; Touvron et al., 2023), have achieved remarkable success in a wide range of applications. Due to their substantial size, the standard full fine-tuning approach where all the model s parameters are updated for specialized tasks is becoming increasingly difficult and inefficient. This leads to the growing popularity of parameter-efficient fine-tuning approaches (Hu et al., 2022a; Liu et al., 2022b; Ben Zaken et al., 2022; Hu et al., 2022b). Instead of updating all parameters, these approaches selectively update smaller subsets of weights or introduce lightweight adapters, thereby greatly decreasing the computational and storage costs. The most dominant approach along this line is Low-Rank Adaptation (Lo RA) (Hu et al., 2022a), which employs lightweight low-rank adapters to pre-trained weight matrices. Far from merely enhancing computational efficiency, empirical evidence has shown that Lo RA can match or even exceed the performance of full fine-tuning (Hu et al., 2022a). To date, Lo RA has been widely used and achieved considerable success in adapting large language models (Hu et al., 2022a; Dinh et al., 2022b) and image generation models (Ryu, 2023; Fan et al., 2023) for various downstream tasks. Despite the empirical success of Lo RA, little is known in theory about how it works. A notable exception (Malladi et al., 2023) showed that Lo RA finetuning is approximately equivalent to full fine-tuning in the lazy regime. However, many theoretical questions remain open, such as: What is the minimum rank of the Lo RA adapters required to adapt a (pre-trained) model f to match the functionality of the target model f? How does the model architecture (i.e., depth, width) affect the minimal rank? If the adapter rank is lower than this threshold, what is the resulting approximation error? Answering such questions will provide important theoretical insights into when and why Lo RA achieves effective adaptation. Our Contributions. In this paper, we present the first set of theoretical results that characterize the expressive power of Low-Rank Adaptation (Lo RA) for Fully Connected Neural Networks (FNN) and Transformer Networks (TFN). In particular, we identify the necessary Lo RA-rank for adapting a frozen model to exactly match a target model. For FNN cases, we also establish the required Lo RA-rank for closely approximating the target model when a small approximation error is allowed. Our work focuses solely on the expressive power of the model with low-rank adapters, i.e., we show that under which conditions, effective low-rank adapters exist for the given adaptation task. This excludes other aspects such as optimization and generalization. We now present the essence of our main theoretical findings in the following informal statement. Published as a conference paper at ICLR 2024 Theorem 1 (Informal). Let f be a target FNN and f0 be an arbitrary frozen FNN. Under mild conditions on ranks and network architectures, there exist low-rank adapters such that a low-rank adapted version of f0 is exactly equal to f. We present the detailed formulations of Theorem 1 under two scenarios: (i) applying a uniform rank across all Lo RA adapters, as detailed in Theorem 3 and the specialized instance Corollary 4 for randomly drawn frozen and target models; and (ii) allowing different ranks applied to each Lo RA adapter, as described in Theorem 6. To the best of our knowledge, this is the first known theoretical results on the expressive power of Lo RA. While this informal theorem is for exact approximation, we also derive the approximation bounds as well, i.e., we characterize the approximation error between the finetuned model and the target model as a function of the Lo RArank, as provided in Theorem 5 for the uniform Lo RA-rank scenario and Theorem 6 for general cases. Furthermore, within the same framework, we investigate the expressive power of tuning the final layers for randomly generated frozen models, as described in Lemma 4. This result allows us to contrast Lo RA and final layer tuning, thereby providing insights for future algorithm development. We summarize our main findings on TFN in the following informal theorem. Theorem 2 (Informal). Let f be the target TFN and f0 be the frozen TFN. Under mild conditions on ranks and network architectures, there exist low-rank adapters for attention weight matrices such that a low-rank adapted version of f0 is exactly equal to f. The formal statement of Theorem 2 is provided in Theorem 7, with a specialized version in Corollary 10 tailored for randomly generated frozen and target models. In Sec. 5 and G, we perform experiments on both synthetic and real datasets to substantiate our theoretical results, demonstrating the practical applicability of our theoretical findings in algorithm development and hyperparameter tuning. 1.1 RELATED WORKS Expressive Power of Neural Networks Theoretical study of the expressive power of unfrozen neural networks has progressed since the first universal approximation theorem (Hornik et al., 1989), showing that sufficient network width and depth can guarantee function approximation (Bengio & Delalleau, 2011; Eldan & Shamir, 2016; Liang & Srikant, 2017). Many recent studies obtained similar results for deep neural networks with modern twists such as Re LU activations and Transformer networks (Yun et al., 2020a; Raghu et al., 2017; Telgarsky, 2016; 2015; Bietti & Bach, 2021; Oymak et al., 2023; Lee et al., 2017; Shen & Zhang, 2020; Likhosherstov et al., 2021; Hsu et al., 2021; Park et al., 2021; Yun et al., 2020b; Giannou et al., 2023b). Metrics like Vapnik-Chervonenkis and Rademacher complexities (Vapnik & Chervonenkis, 2015; Bartlett & Mendelson, 2001) assess classification capacity. However, these theories cannot fully explain the performance of frozen neural networks as they generally cannot factor in pre-trained model parameters and adaptation methods. Expressive Power of Adaptation Methods In stark contrast to the flourishing research on the expressive power of neural networks, there exists a limited number of works investigating the expressive power of adaptation methods. A notable exception is Giannou et al. (2023a), investigating the expressive power of normalization parameter fine-tuning. They demonstrate that fine-tuning the normalization layers alone can adapt a randomly initialized Re LU network to match any target network that is O(width) times smaller. We borrow some proof techniques from this work, including techniques for extending results from linear neural networks to Re LU neural networks. In another recent work (Englert & Lazic, 2022), the authors show that neural reprogramming (Elsayed et al., 2019; Engel et al., 2018; Lee et al., 2020; Dinh et al., 2022a; Chen, 2022), a technique that modifies only the inputs while keeping the pretrained network frozen, can adapt any random two-layer Re LU network to achieve arbitrarily high accuracy on a Bernoulli data model over hypercube vertices. Despite these early attempts, no existing study has yet explored the expressive power of Lo RA, the current leading adaptation method. A more detailed discussion of related works is provided in Sec. B. 1.2 NOTATIONS Define [N] := {1, 2, . . . , N}. Let the operators and denote the minimum function and the maximum function, respectively. We use I to represent the identity matrix. Published as a conference paper at ICLR 2024 For a sequence of L matrices (Wl)L l=1, we simplify the product of these matrices WLWL 1 W1 as QL l=1 Wl, with matrices multiplied in descending order from WL to W1. When m > n, we define Pn i=m ai = 0 and Qn i=m ai = 1 for scalars (ai)n i=m, and Pn i=m Wi = O and Qn i=m Wi = I for square matrices (Wi)n i=m. Singular Value Decomposition (SVD) of the matrix W can be expressed as W = UDV , where U, V RD D are orthonormal matrices and D RD D is a diagonal matrix. The singular values, sorted in descending sequence, are represented on the diagonal of D, denoted as σ1(W ) σ2(W ) σD(W ) 0, where σd(W ) denotes the d-th largest singular value for all d [D]. When d > D, σd(W ) is defined as zero. The best rank-r approximation (in the Frobenius norm or the 2-norm) of W is Pr i=1 σiuiv T i , where ui and vi are the i-th column of U and V , respectively (Eckart & Young, 1936; Mirsky, 1960). We denote this best rank-r approximation by LRr(W ), where LR is a shorthand for Low-Rank . When r rank(W ), it is clear that LRr(W ) = W . Occasionally, the subscript r may be omitted to indicate a general low-rank approximation without specifying the rank. 2 WARM UP: EXPRESSIVE POWER OF LINEAR MODELS WITH LORA Before delving into the expressive power of Lo RA for FNN and TFN, we begin by investigating the simplest scenario: both the target model f and the frozen model f0 are linear, i.e., Target Model f(x) = W x, Frozen Model f0(x) = WL W1x = QL l=1 Wl x. This problem serves as a simplified version of approximating a target FNN, where the target model f has a single layer, the frozen model f0 has L layers, all bias vectors in both two models are zero, and the activation functions are linear. Throughout this paper, for the sake of simplicity, we will assume that both models have the same number of neurons in each layer, i.e., W , W1, . . . , WL RD D. Nevertheless, our results are readily extendable to situations where the frozen model is wider than the target model, which is a more natural setting as the frozen models are often overparameterized to ensure high capacity and good performance across diverse tasks in practice. See the discussion in Sec. H for more details. The objective here is to incorporate low-rank adapters into the frozen model so that the adapted model can effectively approximate the target model. Unless otherwise specified, we always consider a uniform Lo RA-rank for all low-rank adapters throughout this paper. For a given Lo RA-rank R [D], we apply Lo RA adapters W 1, . . . , W L to the frozen model, and the adapted model can be represented as Adapted Model f(x) = (WL + W L) (W1 + W 1)x, where rank( W l) R for all l [L]. Since the frozen model and adpated model are all linear, we can focus on quantifying the discrepancy between the linear coefficients, i.e., QL l=1(Wl+ W l) W . In the subsequent lemma, we establish the minimal achievable norm, and identify the smallest Lo RA-rank required for the adapted model to exactly represent the target model, i.e., f = f, under a non-singularity assumption. We will demonstrate in Sec. 3.3 that this non-singularity assumption is mild, as it can be satisfied even by randomly generated weight matrices. Lemma 1. Define error matrix E := W QL l=1 Wl, and denote its rank by RE = rank(E). For a given Lo RA-rank R [D], assume that all the weight matrices of the frozen model (Wl)L l=1, and QL l=1 Wl + LRr(E) are non-singular for all r R(L 1). Then, we have the following: min W l:rank( W l) R QL l=1(Wl + W l) W 2 = σRL+1(E). Thus, when R RE L , the optimal solution satisfies QL l=1(Wl + W l) = W , implying f = f. Proof Sketch. We start the proof by noting that the distance between the adapted and target models l=1 (Wl + W l) W l=1 (Wl + W l) Published as a conference paper at ICLR 2024 The remaining proof aims to minimize the right-hand side under the constraint rank( W l) R for all l [L]. The basic idea here is to match QL l=1(Wl + W l) QL l=1 Wl with the best rank-r approximation of W QL l=1 Wl. The key steps to solve this problem are as follows. 1. Demonstrate that QL l=1(Wl + W l) QL l=1 Wl can be decomposed into L terms: QL l=1(Wl + W l) QL l=1 Wl = PL l=1 QL i=l+1 Wi W l Ql 1 i=1(Wi + W i) . Since rank( W l) R, it follows that rank QL l=1(Wl + W l) QL l=1 Wl RL. 2. Consider the rank-RL approximation of W QL l=1 Wl. Decompose this low-rank approximation into L terms PL l=1 El such that rank(El) R, where El s will be determined later. 3. To match QL l=1(Wl + W l) QL l=1 Wl with the rank-RL approximation of W QL l=1 Wl, we let QL i=l+1 Wi W l Ql 1 i=1(Wi + W i) = El by choosing W l = QL i=l+1 Wi 1 El Ql 1 i=1(Wi + W i) 1 . 4. Select appropriate (El)L l=1 such that Wi + W i are invertible for i [L]. The complete proof and the explicit construction of optimal Lo RA adapters, are detailed in Sec. D. In fact, this lemma delivers a crucial insight. When we consider L = 1 and R = D, the lemma becomes strikingly similar to the Eckart Young Mirsky theorem (Eckart & Young, 1936; Mirsky, 1960). However, there is a significant difference from the classical theorem on the optimal low-rank approximation, which involves a single target matrix and a single matrix as an optimization variable. Our lemma demonstrates that a comparable result can be achieved for a product of matrices, where each matrix is optimized subject to a low-rank constraint. That being said, even though each matrix is constrained by a low rank, the effective rank is the sum of these low ranks, i.e., in this scenario, is LR. Consequently, once the low-rank adapters are optimally configured, one can make the product equal to the best rank LR-approximation of the target matrix. This can be viewed as an extension of the matrix approximation theorem to a product of matrices, each subject to low-rank constraints. Our main theoretical results on the expressive power of Lo RA, which we will present in the subsequent sections, will build upon this core matrix approximation result. 3 EXPRESSIVE POWER OF FNNS WITH LORA 3.1 PROBLEM SETTING We use FNNL,D( ; (Wl)L l=1, (bl)L l=1) to denote a L-layer width-D fully connected Re LU neural network with weight matrices Wl RD D and biases bl RD, where l [L]. The target FNN f and frozen FNN f0 can be represented as follows: Target FNN f := FNNL,D( ; (W l)L l=1, (bl)L l=1), Frozen FNN f0 := FNNL,D( ; (Wl)L l=1, (bl)L l=1), where W l RD D and bl RD represent the weight matrix and bias vector for the l-th layer of the target model f, respectively. Likewise, Wl RD D, bl RD are those for f0, for layer l [L]. Given a specified Lo RA-rank R [D], we adapt the frozen FNN f0 into a new model f via Lo RA. The adapted model f is defined as Adapted FNN f := FNNL,D( ; (Wl + W l)L l=1, (bbl)L l=1), where the weight matrix for the low-rank adapter W l RD D satisfies specified rank constraints, updated bias vector bbl RD for l [L]1. As noted in Sec. 2, it is common for the pretrained model to be larger than necessary. Therefore, we focus on a setting where the frozen model is deeper than the target model, i.e., L L. Furthermore, in this section, we let the input space X RD D be bounded. 1We consider the case where the bias parameters can also be updated, as suggested by Hu et al. (2022a). Experiments investigating the impact of updating bias parameters are presented in Sec. G.5. Published as a conference paper at ICLR 2024 3.2 ONE-LAYER RELU FNN APPROXIMATION We start with investigating the expressive power of Lo RA on one-layer FNN. In this setting, our aim is to identify Lo RA adapters ( W l)L l=1 and bias vectors (bbl)L l=1 such that the adapted model Re LU((WL + W L) Re LU((WL 1 + W L 1) Re LU( ) + bb L 1) + bb L) closely approximates the target one-layer Re LU FNN model Re LU(W 1 +b1). This differs from the setting described in Sec. 2, where a multi-layer FNN with linear activation functions and zero biases was used to approximate a one-layer FNN with the same properties. In the current setting, we introduce non-linearity through the use of Re LU activation functions in the frozen model and also take biases into account. Consequently, to generalize the findings to this new setting, addressing the introduced non-linearity due to the Re LU activation functions in the frozen model is the main challenge. We employ the following two steps to extend the results in Sec. 2 to the current setting. 1. (Linearization) We eliminate the nonlinearity in the first L 1 layers of the adapted model, making it equivalent to a one-layer Re LU FNN. This can be readily achieved by choosing sufficiently large bias vectors for the first L 1 layers to ensure that all Re LUs in these layers are activated. This technique of eliminating non-linearity is inspired by Giannou et al. (2023a). 2. (Weight Matrix Alignment) We update the bias vectors of the last layer bb L to align with that of the target model f, and apply the linear model approximation results (i.e., Lemma 1) to identify the low-rank adapters that match the weight matrix f. Following the steps above, we arrive at the subsequent lemma, which demonstrates that any onelayer FNN can be closely approximated by a multi-layer FNN finetuned via Lo RA. The complete proof is provided in Sec. E.1. Lemma 2. Define error matrix E := W 1 QL l=1 Wl, with its rank represented by RE = rank(E). Consider a Lo RA-rank R [D]. Assume that the weight matrices W1, . . . , WL RD D and QL l=1 Wl + LRr(E) for all r R(L 1) are non-singular. Let x be a random input sampled from a distribution with bounded support X and let Σ = Exx . Then, there exists rank-R or lower matrices W 1, . . . , W L RD D and bias vectors bb1, . . . ,bb L RD such that the expected squared error can be bounded as E f(x) f(x) 2 2 Σ F σ2 RL+1(E). Moreover, when R RE L , we have f(x) = f(x) for all x X. 3.3 MULTI-LAYER RELU FNN APPROXIMATION We now generalize our discussion to the approximation of multi-layer Re LU FNNs. The key strategy for extending the results to approximating multi-layer Re LU FNNs under Lo RA is model partition, inspired from Giannou et al. (2023a). To elucidate this, we start with a specific example. Example 1. Consider the case where L = 2 and L = 4. We view a two-layer target model f as a composition of two one-layer Re LU FNNs. Accordingly, we partition the four-layer adapted model f into two submodels, each consisting of two layers. For each layer in the target model, we utilize two corresponding layers in the frozen/adapted model for approximation. This problem then simplifies into a one-layer FNN approximation problem, which has already been addressed in Lemma 2. Based on this example, we introduce a ordered partition P = {P1, . . . , PL} to partition the layers in the adapted model f, where SL i=1 Pi = [L]. Each element Pi P consists of consecutive integers. Given a partition P, each element Pi specifies that the layers with index l Pi in the adapted model will be used to approximate the i-th layer in the target model. Example 1, which uses every two layers in the adapted model to approximate each layer in the target model, can be considered as a partition represented as {{1, 2} , {3, 4}}. Similarly, we extend this simple uniform partition into general cases for L-layer target FNN and L-layer frozen FNN: Pu = P u 1, . . . , P u L := {1, . . . , M} , {M + 1, . . . , 2M} , . . . , (L 1)M + 1, . . . , L , Published as a conference paper at ICLR 2024 where M := L/L . The uniform partition indicates that every M layers in the adapted model are employed to approximate each layer in the target model. We use Q l Pi Wl to denote the product of the weight matrices from the layers l Pi, with the later layer positioned to the left and the earlier layer to the right in the matrix product. For example, Q l P u 1 Wl = QM l=1 Wl = WM W1. We first extend Lemma 2 to multi-layer FNN approximation setting using this uniform partition. Uniform Model Partition. Given a specified Lo RA-rank R [D], to derive our results, we introduce a mild non-singularity assumption on the weight matrices of the target model and frozen model for the feasibility of our analysis. This assumption is mild, supported by Lemma 3 that even weight matrices initialized at random can meet this requirement. Assumption 1 (Non-Singularity). For a fixed Lo RA-rank R [D], the weight matrices of the frozen model (Wl)L l=1 and matrices Q l P u i Wl + LRr(W i Q l P u i Wl) are non-singular for all r R(M 1) and i [L]. Lemma 3. Let (W l)L l=1, (Wl)L l=1 RD D be matrices whose elements are drawn independently from arbitrary continuous distributions. Then, with probability 1, Assumption 1 holds R [D]. Given this assumption, here we present our first main result, which shows that any frozen FNN can be adapted to exactly approximate the target FNN via Lo RA. Theorem 3. Under Assumption 1, if Lo RA-rank R maxi [L] rank(W i Q l P u i Wl)/M , then there exists rank-R or lower matrices W 1, . . . , W L RD D and bias vectors bb1, . . . ,bb L RD such that the low-rank adapted model f can exactly approximate the target model f, i.e., f(x) = f(x), x X. Moreover, combining Lemma 3 and Theorem 3 gives the following corollary. Corollary 4. Assume that the elements of (W l)L l=1, (Wl)L l=1 are independently drawn from arbitrary continuous distributions. When R D/M, with probability 1, there exists rank-R or lower matrices W 1, . . . , W L RD D and bias vectors bb1, . . . ,bb L RD such that low-rank adapted model f can exactly approximate the target model f on X, i.e., f(x) = f(x), x X. To understand the implications of this corollary, let us consider L L. In this scenario, the required Lo RA-rank is sufficiently small such that the dimension of the rank-R matrix is approximately 2RD. This corollary suggests that with 2RDL 2D2L/M 2D2L learnable parameters, even a random FNN can be adapted into the target model f. It is noteworthy that the total number of parameters of the target model is D2L. This indicates that even though the learnable parameters under Lo RA finetuning appear to be highly constrained (low-rank constrained learnable parameters distributed across many layers), the effective expressive power of Lo RA is nearly optimal up to a constant factor of 2. Our discovery provides the first theoretical insights into the practical success of Lo RA. Furthermore, Theorem 3 indicates that if the model f is close to f such that maxi [L] rank(W i Q l P u i Wl) is small, the number of learnable parameters used by Lo RA can be lower than D2L. Meanwhile, when the employed Lo RA-rank is lower than the critical threshold, the following theorem provides an upper bound for the approximation error. Theorem 5. Define the approximation error of i-th layer as Ei = σRM+1(W i Q l P u i Wl), and the magnitude of the parameters and the input as β := Σ F Qi j=1 W j F + Pi j=1 Qi 1 k=j+1 W k F bj 2 Under Assumption 1, there exists rank-R or lower matrices ( W l)L l=1 with W l RD D and bias vectors (bbl)L l=1 with bbl RD such that for input x X with Exx = Σ, E f(x) f(x) 2 β i=1 max k [L] W k F + Ek L i Ei. Theorem 5 provides an upper bound on the approximation error for the adapted model. This bound is influenced by several factors: (i) magnitude of the target model s parameters and the input, which Published as a conference paper at ICLR 2024 is captured by β and W k F, (ii) the rank of the adapter R and the discrepancy between the frozen model and the target model (W i Q l P u i Wl)L i=1, both of which contribute to the term Ei, (iii) the depth of the frozen model L, reflected in M and consequenly Ei. All the proofs of the results derived for uniform partition are provided in Sec. E.2. General Model Partition. We note that employing this uniform partition strategy for approximating the target model may not always yield optimal results. To illustrate this, we revisit the case considered by Example 1, where L = 2 and L = 4. Consider a scenario where the first layer of the frozen model has been pretrained to match the first layer of the target model. In this case, we can use just the first layer in f to approximate the first layer in f, and a zero Lo RA-rank is sufficient for the exact representation of the first layer. The remaining three layers in f can then be used to approximate the second layer in f. Compared to uniform partition, this partition leverages more layers to approximate the second layer in f, allowing us to achieve the desired performance with a lower Lo RA-rank, as per Lemma 2. This suggests that our approximation error bounds could be further optimized by considering partitioning schemes tailored to specific scenarios. We now extend our results to a more general setting, where we do not assume a uniform partition. Concurrently, recent research by Zhang et al. (2023) has shown that the application of varying Lo RA-ranks leads to improved results. Consequently, we permit each layer in the frozen model to utilize adapters with different Lo RA-ranks. The rank of the Lo RA adapter associated with the l-th layer in the frozen model is denoted by Rl, where l [L]. This result relies on Assumption 2, an analog of Assumption 1, but revised to include a general model partition. More details, including the proofs, are provided in Sec. E.3. Theorem 6. Consider a partition P for the frozen model. Let Assumption 2 hold. If P l Pi Rl rank(W i Q l Pi Wl) for all i [L], there exists Lo RA adapters ( W l)L l=1 with rank( W l) Rl and biases (bbl)L l=1 such that the adapted model f can exactly approximate the target model. Moreover, define the approximation error of the i-th layer as Ei = σP l Pi Rl+1(W i Q l Pi Wl), and the magnitude of the parameters and the input as β := Σ F Qi j=1 W j F + Pi j=1 Qi 1 k=j+1 W k F bj 2 Then, there exists Lo RA adapters ( W l)L l=1 with rank( W l) Rl and biases (bbl)L l=1 such that for any input x X with Exx = Σ, the approximation error can be bounded as E f(x) f(x) 2 β i=1 max k [L] W k F + Ek L i Ei. Comparison to Tuning Final Layers. Updating the final layers and keeping the initial layers frozen (Chatfield et al., 2014; Donahue et al., 2014; Sharif Razavian et al., 2014; Rahimi & Recht, 2007) is another popular model adaptation method. However, unlike Lo RA, which can adapt even randomly generated networks to match a target model, empirical studies (Kornblith et al., 2019) suggest that the effectiveness of final layers tuning heavily depends on the quality of the initial layers. This indicates that merely tuning the final layers of randomly generated networks may not yield desirable performance. The following lemma rigorously supports this assertion, demonstrating that regardless of how the final layers are tuned, it is impossible to adapt a randomly generated model into even a one-layer FNN, a model of very low complexity. Lemma 4. Let D 2 and f be a one-layer target FNN. Assume that the elements of weight matrices (Wl)L l=1 are independently drawn from arbitrary continuous distributions. With probability 1, for any tuning of the last L 1 layers, f = f. In Corollary 4, we demonstrate that Lo RA can adapt any randomly generated models to match the target model, using at most twice the number of learnable parameters as the target model. However, this lemma reveals that final layers tuning, even with L 1 times the learnable parameters of the target model, cannot achieve performance comparable to Lo RA. In other words, Lo RA requires at most 2RDL 2D2 learnable parameters to achieve an exact approximation, while final layers Published as a conference paper at ICLR 2024 tuning fails to approximate the target model even with (L 1)D2 learnable parameters. Therefore, when L 3, Lo RA can deliver strictly superior performance than final layers tuning with the same or fewer parameters. This provides insights into the empirical observation that Lo RA outperforms final layers tuning (Kaplun et al., 2023; Ding et al., 2023). 4 EXPRESSIVE POWER OF TRANSFORMER NETWORKS WITH LORA 4.1 PROBLEM SETTING Transformer network, denoted as TFNL,D, is a composition of L Transformer blocks and an output layer, parameterized by weight Wo RD D. Each transformer block comprises a H-head selfattention layer, parameterized by weight ((W h Ol, W h V l, W h Kl, W h Ql)H h=1)L l=1, followed by a tokenwise feedforward layer, parameterized by weight (W1l, W2l)L l=1 and bias (b1l, b2l)L l=1. We assume that all weight matrices have a dimension of D D, while the bias vectors are of dimension D. We employ the same formulations of transformer blocks as Yun et al. (2020a), with one exception: we exclude skip connections for analytical feasibility. As before, we use (e.g., W 1l) to represent the corresponding parameters for the target model, and (e.g., W h Ol) to represent the corresponding low-rank update. For TFN cases,we consider scenarios where both the frozen model and the target model have L Transformer blocks. For an explicit formulation, please refer to Sec. F.2. 4.2 MAIN RESULTS ON TRANSFORMER NETWORKS We now present our main findings on TFNs. The first result relies on a non-singularity assumption (Assumption 4) tailored for TFN. This assumption is mild, and models with randomly generated weights can satisfy its criteria (Lemma 14). Further details are deferred to Sec. F.2. The following theorem shows that adding Lo RA adapters primarily to the self-attention layers enables the adapted model f to exactly approximate the target model f. This finding is consistent with a recent observation made by Hu et al. (2022a), which indicates that a good performance can be achieved by adapting only the attention layers when applying Lo RA to TFNs. Theorem 7. Consider a given Lo RA-rank R [D]. Let Assumption 4 hold. Let Gi be the rankbased functionality gap to i-th transformer block (i [L]) or output layer (i = L + 1) defined in (23). If R maxi [L+1] Gi 2 , then there exists low-rank adapters with rank lower than R [D] (( W h Kl, W h Ql, W h V l, W h Ol)H h=1)L l=1, W 2L, W o with other low-rank adapters set to O, and updated bias vectors (bb1l,bb2l)L l=1, such that for any X RD N, the adapted model f exactly approximates target model f, i.e., f(X) = f(X). Proof Sketch. The primary challenge for extending our analysis to TFNs, similar to FNN cases, is the nonlinearity introduced by softmax and Re LU. To manage this, we segment a sequence of transformer blocks based on the softmax and Re LU functions. Specifically, we align the output of attention scores before the softmax is applied, and then match the output of the first feedforward layer before Re LU is applied. The complete proof of Theorem 7 and results for randomly generated models can be found in Sec. F.2. Meanwhile, our results here are specifically for TFNs with multi-head attention layers. For TFNs with single-head attention layers, the construction of Lo RA adapters differs due to the absence of W h Oi. Since the results are similar, we defer the problem setting and results for TFNs with single-head attention layers to Sec. F.1. 5 EXPERIMENTS Recall that all our theoretical statements are based on our construction of the Lo RA adapters presented in their corresponding proofs. To validate these results, here we empirically examine the relationship between approximation error and rank by integrating the Lo RA adapters, which are constructed with the uniform partition in our proof, into the frozen model. Published as a conference paper at ICLR 2024 Validation of Our Lo RA Adapter Construction. We employ the Mean Squared Error (MSE) to assess the approximation error, comparing the MSE of the Lo RA adapter as derived from the gradient update method with that from our construction. We consider linear models and FNNs with model dimension D = 16. For linear model cases, we set L = 1, L = 2, while for FNN cases, we set L = 2, L = 4. We include two variants of the frozen model for fine-tuning: one with randomly initialized parameters (Random) and another pretrained on the target distribution (Pretrained). 4 8 12 16 0.0 Gradient Update Our Construction (a) Linear model approximation. 4 8 12 16 0.00 Gradient Update Our Construction (b) FNN approximation. Figure 1: Approximation error (measured by MSE) versus Lo RA-rank. Our results for linear model approximation and FNN approximation via Lo RA are depicted in Fig. 1a and 1b, respectively. Firstly, we observe that the MSE of both two cases is close to zero when R D L/L = 8, which corroborates our claims. Meanwhile, a comparison between the left and right columns of Fig. 1a suggests that pretraining can further reduce the required rank to achieve near-zero approximation error. Furthermore, the curves of our construction align well with those of the gradient update method in linear model approximation cases, confirming the optimality claimed in Lemma 1. However, for FNN approximation cases, the gradient update method outperforms our construction in the small rank region. We conjecture that the suboptimality of our construction for this multi-layer FNN case could arise from unnecessarily matching the intermediate outputs of the frozen model with those of the target model during adapter construction. Additionally, the uniform partition could also be one contributing factor. Findings Empirical Observation Theoretical Insights For a fixed downstream task, larger models require a lower Lo RA-rank to achieve the desired performance. Sec. G.9 Lemma 1, 2, and Theorem 5, 6 When the frozen model is closer to the target model, a lower Lo RA-rank is sufficient to attain the desired performance. Sec. G.9 and 6-th footnote in Hu et al. (2022a) Lemma 1, 2, and Theorem 5, 6, 7 Lo RA outperforms final layers tuning if the quality of shared representation is not good. Sec. G.4 and observations by Kaplun et al. (2023) and Ding et al. (2023) In addition to applying low-rank updates to weight matrices, it is crucial to also update the bias. Sec. G.5 and 2-nd footnote in Hu et al. (2022a) Proofs in Sec. 3.2 and E.1 Tuning attention weights is sufficient for achieving good performance on TFNs. Sec. 4.2 in Hu et al. (2022a) Theorem 7 Current optimization algorithms for Lo RA training might be suboptimal. Fig. 4, 5, and 9 Table 1: Summary of our findings, supported by empirical evidence and theoretical results. Detailed Experimental Setup and Additional Experiments in the Appendix. Further experiment details and a series of additional experiments, including simulations on FNNs and TFNs at different depths, evaluation of classification tasks, empirical comparison between Lo RA and the final layers tuning, investigation of the importance of updatable bias, Lo RA s generalization and optimization properties, and experiments on GLUE benchmark (Wang et al., 2018), are provided in Sec. G. Table 1 summarizes all the empirical findings and aligns them with theoretical insights. 6 CONCLUSIONS This work pioneers the theoretical analysis of Lo RA fine-tuning s expressive capabilities in FNNs and TFNs, offering novel insights into how rank, model depth, and proximity to the target model influence Lo RA s effectiveness. Our theoretical findings are validated by empirical evidence. Future work includes quantifying approximation errors for TFNs when the Lo RA-ranks are lower than required and refining Lo RA adapter update algorithms based on our construction of Lo RA adapters. Published as a conference paper at ICLR 2024 ACKNOWLEDGEMENT This work was supported by NSF Award DMS-2023239, NSF/Intel Partnership on Machine Learning for Wireless Networking Program under Grant No. CNS-2003129, and a grant by Furiosa AI. We extend our heartfelt gratitude to Angeliki Giannou, Kartik Sreenivasan, Tuan Dinh, Jy-yong Sohn, Jingpeng Liu, and anonymous reviewers for their insightful comments that significantly enhanced the quality of our paper. REPRODUCIBILITY STATEMENT The code for all experiments reported in this paper is publicly accessible. For the purpose of reproducibility, the code can be found at the following anonymized Git Hub repository: https: //github.com/UW-Madison-Lee-Lab/Expressive_Power_of_Lo RA. Kwangjun Ahn, Xiang Cheng, Hadi Daneshmand, and Suvrit Sra. Transformers learn to implement preconditioned gradient descent for in-context learning. ar Xiv preprint ar Xiv:2306.00297, 2023. Ekin Aky urek, Dale Schuurmans, Jacob Andreas, Tengyu Ma, and Denny Zhou. What learning algorithm is in-context learning? Investigations with linear models. In International Conference on Learning Representations (ICLR), 2023. Yu Bai, Fan Chen, Huan Wang, Caiming Xiong, and Song Mei. Transformers as statisticians: Provable in-context learning with in-context algorithm selection. ar Xiv preprint ar Xiv:2306.04637, 2023. Peter L Bartlett and Shahar Mendelson. Rademacher and Gaussian complexities: Risk bounds and structural results. In Computational Learning Theory (COLT), volume 2111, pp. 224 240, 2001. Elad Ben Zaken, Yoav Goldberg, and Shauli Ravfogel. Bit Fit: Simple parameter-efficient finetuning for Transformer-based masked language-models. In Annual Meeting of the Association for Computational Linguistics (ACL), pp. 1 9, 2022. Yoshua Bengio and Olivier Delalleau. On the expressive power of deep architectures. In Algorithmic Learning Theory, pp. 18 36, 2011. Alberto Bietti and Francis Bach. Deep equals shallow for Re LU networks in kernel regimes. 2021. Tom B. Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared Kaplan, Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, Sandhini Agarwal, Ariel Herbert-Voss, Gretchen Krueger, Tom Henighan, Rewon Child, Aditya Ramesh, Daniel M. Ziegler, Jeffrey Wu, Clemens Winter, Christopher Hesse, Mark Chen, Eric Sigler, Mateusz Litwin, Scott Gray, Benjamin Chess, Jack Clark, Christopher Berner, Sam Mc Candlish, Alec Radford, Ilya Sutskever, and Dario Amodei. Language models are few-shot learners. In Advances in Neural Information Processing Systems (Neur IPS), 2020. Richard Caron and Tim Traynor. The zero set of a polynomial. WSMR Report, 2005. Ken Chatfield, Karen Simonyan, Andrea Vedaldi, and Andrew Zisserman. Return of the devil in the details: Delving deep into convolutional nets. ar Xiv preprint ar Xiv:1405.3531, 2014. Pin-Yu Chen. Model reprogramming: Resource-efficient cross-domain machine learning. ar Xiv preprint ar Xiv:2202.10629, 2022. George Cybenko. Approximation by superpositions of a sigmoidal function. Mathematics of control, signals and systems, 2:303 314, 1989. Tim Dettmers, Artidoro Pagnoni, Ari Holtzman, and Luke Zettlemoyer. QLo RA: Efficient finetuning of quantized llms. ar Xiv preprint ar Xiv:2305.14314, 2023. Published as a conference paper at ICLR 2024 Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. BERT: Pre-training of deep bidirectional Transformers for language understanding. In North American Chapter of the Association for Computational Linguistics (NAACL), pp. 4171 4186, 2019. Ning Ding, Yujia Qin, Guang Yang, Fuchao Wei, Zonghan Yang, Yusheng Su, Shengding Hu, Yulin Chen, Chi-Min Chan, Weize Chen, Jing Yi, Weilin Zhao, Xiaozhi Wang, Zhiyuan Liu, Hai-Tao Zheng, Jianfei Chen, Yang Liu, Jie Tang, Juanzi Li, and Maosong Sun. Parameter-efficient finetuning of large-scale pre-trained language models. Nature Machine Intelligence, 5(3):220 235, 2023. Tuan Dinh, Daewon Seo, Zhixu Du, Liang Shang, and Kangwook Lee. Improved input reprogramming for GAN conditioning. ar Xiv preprint ar Xiv:2201.02692, 2022a. Tuan Dinh, Yuchen Zeng, Ruisu Zhang, Ziqian Lin, Michael Gira, Shashank Rajput, Jy-yong Sohn, Dimitris Papailiopoulos, and Kangwook Lee. LIFT: Language-interfaced fine-tuning for nonlanguage machine learning tasks. Advances in Neural Information Processing Systems (Neur IPS), 35:11763 11784, 2022b. Jeff Donahue, Yangqing Jia, Oriol Vinyals, Judy Hoffman, Ning Zhang, Eric Tzeng, and Trevor Darrell. De CAF: A deep convolutional activation feature for generic visual recognition. In International Conference on Machine Learning (ICML), pp. 647 655, 2014. Simon S Du, Wei Hu, Sham M Kakade, Jason D Lee, and Qi Lei. Few-shot learning via learning the representation, provably. In International Conference on Learning Representations (ICLR), 2021. Carl Eckart and Gale Young. The approximation of one matrix by another of lower rank. Psychometrika, 1936. Ronen Eldan and Ohad Shamir. The power of depth for feedforward neural networks. In Annual Conference on Learning Theory, volume 49, pp. 907 940, 2016. Gamaleldin F Elsayed, Ian Goodfellow, and Jascha Sohl-Dickstein. Adversarial Reprogramming of neural networks. In International Conference on Learning Representations (ICLR), 2019. Jesse Engel, Matthew Hoffman, and Adam Roberts. Latent constraints: Learning to generate conditionally from unconditional generative models. In International Conference on Learning Representations (ICLR), 2018. Matthias Englert and Ranko Lazic. Adversarial Reprogramming revisited. In Advances in Neural Information Processing Systems (Neur IPS), volume 35, pp. 28588 28600, 2022. Ying Fan, Olivia Watkins, Yuqing Du, Hao Liu, Moonkyung Ryu, Craig Boutilier, Pieter Abbeel, Mohammad Ghavamzadeh, Kangwook Lee, and Kimin Lee. DPOK: Reinforcement learning for fine-tuning text-to-image diffusion models. ar Xiv preprint ar Xiv:2305.16381, 2023. Angeliki Giannou, Shashank Rajput, and Dimitris Papailiopoulos. The expressive power of tuning only the Norm layers. ar Xiv preprint ar Xiv:2302.07937, 2023a. Angeliki Giannou, Shashank Rajput, Jy-Yong Sohn, Kangwook Lee, Jason D. Lee, and Dimitris Papailiopoulos. Looped Transformers as programmable computers. In International Conference on Machine Learning (ICML), volume 202, pp. 11398 11442, 2023b. Michael Gira, Ruisu Zhang, and Kangwook Lee. Debiasing pre-trained language models via efficient fine-tuning. In Workshop on Language Technology for Equality, Diversity and Inclusion, pp. 59 69, 2022. Moritz Hardt and Tengyu Ma. Identity matters in deep learning. In International Conference on Learning Representations, 2017. Pengcheng He, Xiaodong Liu, Jianfeng Gao, and Weizhu Chen. De BERTa: Decoding-enhanced BERT with disentangled attention. In International Conference on Learning Representations, 2021. Published as a conference paper at ICLR 2024 Kurt Hornik, Maxwell Stinchcombe, and Halbert White. Multilayer feedforward networks are universal approximators. Neural Networks, 2:359 366, 1989. Daniel Hsu, Clayton H Sanford, Rocco Servedio, and Emmanouil Vasileios Vlatakis-Gkaragkounis. On the approximation power of two-layer networks of random Re LUs. In Conference on Learning Theory, volume 134, pp. 2423 2461, 2021. Edward J Hu, Yelong Shen, Phillip Wallis, Zeyuan Allen-Zhu, Yuanzhi Li, Shean Wang, Lu Wang, and Weizhu Chen. Lo RA: Low-rank adaptation of large language models. In International Conference on Learning Representations (ICLR), 2022a. Shengding Hu, Zhen Zhang, Ning Ding, Yadao Wang, Yasheng Wang, Zhiyuan Liu, and Maosong Sun. Sparse structure search for delta tuning. In Advances in Neural Information Processing Systems (Neur IPS), volume 35, pp. 9853 9865, 2022b. Like Hui and Mikhail Belkin. Evaluation of neural architectures trained with square loss vs crossentropy in classification tasks. In International Conference on Learning Representations, 2021. Arthur Jacot, Franck Gabriel, and Cl ement Hongler. Neural tangent kernel: Convergence and generalization in neural networks. Advances in neural information processing systems, 31, 2018. Gal Kaplun, Andrey Gurevich, Tal Swisa, Mazor David, Shai Shalev-Shwartz, and Eran Malach. Sub Tuning: Efficient finetuning for multi-task learning. ar Xiv preprint ar Xiv:2302.06354, 2023. Kenji Kawaguchi. Deep learning without poor local minima. Advances in neural information processing systems, 29, 2016. Simon Kornblith, Jonathon Shlens, and Quoc V Le. Do better Image Net models transfer better? In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2019. Thomas Laurent and James von Brecht. Deep linear networks with arbitrary loss: All local minima are global. In Proceedings of the 35th International Conference on Machine Learning, volume 80, pp. 2902 2907, 2018. Holden Lee, Rong Ge, Tengyu Ma, Andrej Risteski, and Sanjeev Arora. On the ability of neural nets to express distributions. In Conference on Learning Theory, pp. 1271 1296, 2017. Kangwook Lee, Changho Suh, and Kannan Ramchandran. Reprogramming GANs via input noise design. In Machine Learning and Knowledge Discovery in Databases - European Conference, (ECML PKDD), volume 12458, pp. 256 271, 2020. Brian Lester, Rami Al-Rfou, and Noah Constant. The power of scale for parameter-efficient prompt tuning. In Empirical Methods in Natural Language Processing (EMNLP), pp. 3045 3059, 2021. Xiang Lisa Li and Percy Liang. Prefix-tuning: Optimizing continuous prompts for generation. In Association for Computational Linguistics and International Joint Conference on Natural Language Processing (ACL/IJCNLP), pp. 4582 4597, 2021. Shiyu Liang and R. Srikant. Why deep neural networks for function approximation? In International Conference on Learning Representations (ICLR), 2017. Valerii Likhosherstov, Krzysztof Choromanski, and Adrian Weller. On the expressive power of self-attention matrices. 2021. Chaoyue Liu, Libin Zhu, and Mikhail Belkin. Loss landscapes and optimization in overparameterized non-linear systems and neural networks. Applied and Computational Harmonic Analysis, 59:85 116, 2022a. Haokun Liu, Derek Tam, Mohammed Muqeeth, Jay Mohta, Tenghao Huang, Mohit Bansal, and Colin A Raffel. Few-shot parameter-efficient fine-tuning is better and cheaper than in-context learning. In Advances in Neural Information Processing Systems (Neur IPS), volume 35, pp. 1950 1965, 2022b. Published as a conference paper at ICLR 2024 Yinhan Liu, Myle Ott, Naman Goyal, Jingfei Du, Mandar Joshi, Danqi Chen, Omer Levy, Mike Lewis, Luke Zettlemoyer, and Veselin Stoyanov. Ro BERTa: A robustly optimized bert pretraining approach. ar Xiv preprint ar Xiv:1907.11692, 2019. Haihao Lu and Kenji Kawaguchi. Depth creates no bad local minima. ar Xiv preprint ar Xiv:1702.08580, 2017. Zhou Lu, Hongming Pu, Feicheng Wang, Zhiqiang Hu, and Liwei Wang. The expressive power of neural networks: A view from the width. Advances in neural information processing systems, 30, 2017. Sadhika Malladi, Alexander Wettig, Dingli Yu, Danqi Chen, and Sanjeev Arora. A kernel-based view of language model fine-tuning. In International Conference on Machine Learning, pp. 23610 23641, 2023. Andreas Maurer, Massimiliano Pontil, and Bernardino Romera-Paredes. The benefit of multitask representation learning. Journal of Machine Learning Research, 17(81):1 32, 2016. Leon Mirsky. Symmetric gauge functions and unitarily invariant norms. The quarterly journal of mathematics, 1960. Open AI. GPT-4 technical report, 2023. Samet Oymak, Ankit Singh Rawat, Mahdi Soltanolkotabi, and Christos Thrampoulidis. On the role of attention in prompt-tuning. In International Conference on Machine Learning (ICML), 2023. Sejun Park, Chulhee Yun, Jaeho Lee, and Jinwoo Shin. Minimum width for universal approximation. In International Conference on Learning Representations (ICLR), 2021. Jorge P erez, Javier Marinkovi c, and Pablo Barcel o. On the turing completeness of modern neural network architectures. ar Xiv preprint ar Xiv:1901.03429, 2019. Aleksandar Petrov, Philip HS Torr, and Adel Bibi. When do prompting and prefix-tuning work? A theory of capabilities and limitations. ar Xiv preprint ar Xiv:2310.19698, 2023. Maithra Raghu, Ben Poole, Jon Kleinberg, Surya Ganguli, and Jascha Sohl-Dickstein. On the expressive power of deep neural networks. In International Conference on Machine Learning (ICML), pp. 2847 2854, 2017. Ali Rahimi and Benjamin Recht. Random features for large-scale kernel machines. In Advances in Neural Information Processing Systems (Neur IPS), volume 3, pp. 5, 2007. Simo Ryu. Low-rank adaptation for fast text-to-image diffusion fine-tuning. https://github. com/cloneofsimo/lora, 2023. Andrew M Saxe, James L Mc Clelland, and Surya Ganguli. Exact solutions to the nonlinear dynamics of learning in deep linear neural networks. 2014. Daewon Seo, Hongyi Wang, Dimitris Papailiopoulos, and Kangwook Lee. Empirical study on the effective VC dimension of low-rank neural networks. In ICML Workshop on Overparameterization: Pitfalls & Opportunities, 2021. Ali Sharif Razavian, Hossein Azizpour, Josephine Sullivan, and Stefan Carlsson. CNN features off-the-shelf: An astounding baseline for recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition workshops, pp. 806 813, 2014. Haizhao Shen, Zuowei Yang and Shijun Zhang. Deep network approximation characterized by number of neurons. Communications in Computational Physics, (5):1768 1811, 2020. Matus Telgarsky. Representation benefits of deep feedforward networks. ar Xiv preprint ar Xiv:1509.08101, 2015. Matus Telgarsky. Benefits of depth in neural networks. In Conference on Learning Theory, pp. 1517 1539, 2016. Published as a conference paper at ICLR 2024 Hugo Touvron, Thibaut Lavril, Gautier Izacard, Xavier Martinet, Marie-Anne Lachaux, Timoth ee Lacroix, Baptiste Rozi ere, Naman Goyal, Eric Hambro, Faisal Azhar, Aur elien Rodriguez, Armand Joulin, Edouard Grave, and Guillaume Lample. LLa MA: Open and efficient foundation language models. ar Xiv preprint ar Xiv:2302.13971, 2023. Nilesh Tripuraneni, Michael Jordan, and Chi Jin. On the theory of transfer learning: The importance of task diversity. In Advances in neural information processing systems (Neur IPS), volume 33, pp. 7852 7862, 2020. Rasul Tutunov, Antoine Grosnit, Juliusz Ziomek, Jun Wang, and Haitham Bou-Ammar. Why can large language models generate correct chain-of-thoughts? ar Xiv preprint ar Xiv:2310.13571, 2023. Vladimir N Vapnik and A Ya Chervonenkis. On the uniform convergence of relative frequencies of events to their probabilities. In Measures of Complexity: Festschrift for Alexey Chervonenkis, pp. 11 30. 2015. Johannes von Oswald, Eyvind Niklasson, Ettore Randazzo, Jo ao Sacramento, Alexander Mordvintsev, Andrey Zhmoginov, and Max Vladymyrov. Transformers learn in-context by gradient descent. ar Xiv preprint ar Xiv:2212.07677, 2022. Alex Wang, Amanpreet Singh, Julian Michael, Felix Hill, Omer Levy, and Samuel Bowman. GLUE: A multi-task benchmark and analysis platform for natural language understanding. In EMNLP Workshop Blackbox NLP: Analyzing and Interpreting Neural Networks for NLP, pp. 353 355, 2018. Jason Wei, Xuezhi Wang, Dale Schuurmans, Maarten Bosma, brian ichter, Fei Xia, Ed H. Chi, Quoc V Le, and Denny Zhou. Chain of thought prompting elicits reasoning in large language models. In Advances in Neural Information Processing Systems (Neur IPS), 2022. Noam Wies, Yoav Levine, and Amnon Shashua. The learnability of in-context learning. ar Xiv preprint ar Xiv:2303.07895, 2023. Sang Michael Xie, Aditi Raghunathan, Percy Liang, and Tengyu Ma. An explanation of in-context learning as implicit bayesian inference. In International Conference on Learning Representations (ICLR), 2022. Chulhee Yun, Srinadh Bhojanapalli, Ankit Singh Rawat, Sashank Reddi, and Sanjiv Kumar. Are Transformers universal approximators of sequence-to-sequence functions? In International Conference on Learning Representations (ICLR), 2020a. Chulhee Yun, Yin-Wen Chang, Srinadh Bhojanapalli, Ankit Singh Rawat, Sashank Reddi, and Sanjiv Kumar. O(n) connections are expressive enough: Universal approximability of sparse transformers. 33:13783 13794, 2020b. Qingru Zhang, Minshuo Chen, Alexander Bukharin, Pengcheng He, Yu Cheng, Weizhu Chen, and Tuo Zhao. Adaptive budget allocation for parameter-efficient fine-tuning. In The Eleventh International Conference on Learning Representations, 2023. Published as a conference paper at ICLR 2024 This appendix encompasses more discussions, experiments, and proofs of the results presented in the main body. Given the extensive use of notations in our paper, we begin by presenting a list of common notations in Sec. A for the reader s convenience. We then delve into a more detailed discussion of related works in Sec. B. Following this, we present the proofs of results from the main body and auxiliary results in Sec. C, D, E, and F. Specifically, we provide additional results for TFN with single-head attention layers, and TFN with multi-head attention layers under random model cases in Sec. F. Further experimental details and interesting experiment findings are provided in Sec. G. Finally, we discuss how to extend our results to cases with varying model dimensions in Sec. H, while this work primarily focuses on instances where both the target model and the frozen model possess the same model width D. More potential future works are outlined in Sec. I. A List of Common Notations 17 B Expanded Related Works 18 C Proofs Related to Linear Algebra 19 C.1 Common Matrix Inequalities . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19 C.2 Non-Singularity of Randomly Generated Matrices . . . . . . . . . . . . . . . . . . 19 D Proofs for Linear Model Approximation 20 E Proofs for FNN Approximation 25 E.1 Approximating One-Layer Re LU FNN via Lo RA . . . . . . . . . . . . . . . . . . 25 E.2 Approximating Multi-Layer Re LU FNN via Lo RA with Uniform Model Parition . 27 E.3 Approximating Multi-Layer Re LU FNN via Lo RA with General Model Parition . . 31 E.4 Approximating Multi-Layer Re LU FNN via Final Layers Tuning . . . . . . . . . . 32 F Proofs for TFN Approximation 33 F.1 Approximating Transformer Network with Single-Head Attention Layers . . . . . 33 F.2 Approximating Transformer Network with Multi-Head Attention Layers . . . . . . 36 G Experiments 39 G.1 Additional Details of Experiment Setup . . . . . . . . . . . . . . . . . . . . . . . 40 G.2 Additional Details on Gradient Update Method . . . . . . . . . . . . . . . . . . . 40 G.3 Validation of Our Lo RA Adapter Construction . . . . . . . . . . . . . . . . . . . . 40 G.3.1 FNN Approximation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40 G.3.2 TFN Approximation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41 G.4 Comparison to Tuning Final Layers . . . . . . . . . . . . . . . . . . . . . . . . . 42 G.5 Benefits of Tuning Biases . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42 G.6 Training Curves . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42 G.7 Generalization Performances . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43 G.8 Evaluation on Classification Tasks . . . . . . . . . . . . . . . . . . . . . . . . . . 43 G.9 Evaluation on Real Datasets . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44 Published as a conference paper at ICLR 2024 H Extension to Cases with Different Model Dimensions 45 I Extended Future Works 46 Published as a conference paper at ICLR 2024 A LIST OF COMMON NOTATIONS We first give a list of common notations that are used in the main body and appendix for reference. f: Lo RA-adapted model. f: target model. f0: frozen/pretrained model. R: rank of Lo RA adapters. D: dimensionality of the model, representing the number of neurons in each layer for FNNs and the embedding size for TFNs. L: depth of the (frozen) model, representing the number of layers for FNNs and the number of transformer blocks for TFNs. N: sequence length of the input for TFNs. x: random input. X: matrix input. X: input space. W : a weight matrix associated with (frozen) model. Subscripts and superscripts may be added for specificity. b: a bias vector associated with the (frozen) model. Subscripts may be added for specificity. zl: the output of the first l layers in the (frozen) FNN. Zl: the output of the first l transformer blocks in a (frozen) TFN. W : a weight matrix associated with the target model. Subscripts and superscripts may be added for specificity. b: a bias vector associated with the target model. Subscripts may be added for specificity. zl: the intermediate output of the first l layers in target FNN given the random input x. Zl: the output of the first l transformer blocks in a target TFN. L: depth of the target model, representing the number of layers for FNNs and the number of transformer blocks for TFNs. W : the weight matrix of a Lo RA adapter. bb: a bias vector associated with the Lo RA-adapted model. bzl: the output of the first l layers in the Lo RA-adapted model given the random input x. b Zl: the output of the first l transformer blocks in the Lo RA-adapted model. M: the ratio of the depth of the frozen model to that of the target model, i.e., L/L. P: partition P = {P1, . . . , PL}, each element Pi specifies that the layers with index l Pi in the adapted model will be used to approximate the i-th layer in the target model. Pi: the i-th element in partition P. Pu: uniform partition Pu := {{1, . . . , M} , {M + 1, . . . , 2M} , . . . , (L 1)M + 1, . . . , L }. The uniform partition indicates that every M layers in the adapted model are employed to approximate each layer in the target model. P u i : the i-th element in uniform partition Pu. ID: the D D identity matrix. When the context permits, the subscript D of ID may be omitted, simplifying the notation to I. Ia:b,D: a diagonal matrix where the diagonal entries from the ath to bth position are set to 1, while all remaining entries are 0s. Published as a conference paper at ICLR 2024 σd( ): the d-th largest singular value for the given square matrix. When d is greater than the width of the matrix, σd( ) = 0. LRr( ): best rank-r approximation of a square matrix in Frobenuis norm and spectral norm. The subscript r may be omitted to indicate a general low-rank approximation without specifying the rank. l Pi Wl: product of the weight matrices from the layers l Pi, with the later layer positioned to the left and the earlier layer to the right in the matrix product. For example, Q l P u 1 Wl = QM l=1 Wl = WM W1. B EXPANDED RELATED WORKS Expressive Power of Fully Connected Neural Networks The theoretical exploration of the expressive power of unfrozen fully connected neural networks has advanced since the introduction of the first universal approximation theorem (Hornik et al., 1989; Cybenko, 1989). Subsequent studies have demonstrated the benefits of depth, asserting that sufficient depth can ensure function approximation (Bengio & Delalleau, 2011; Eldan & Shamir, 2016; Liang & Srikant, 2017; Telgarsky, 2016; 2015). There are also works that have examined the expressive power of FNN from a view of width (Lu et al., 2017; Park et al., 2021; Bietti & Bach, 2021) and the number of neurons (Shen & Zhang, 2020). While these results assume that weight matrices can be arbitrarily adjusted for optimal performance, Hsu et al. (2021) examined the expressive power of randomly generated twolayer FNNs. Our work shares similarities with this direction, as we also delve into scenarios with randomly generated models. Beyond characterizing expressive power by approximation error, alternative metrics have been proposed. Metrics such as Vapnik-Chervonenkis (Vapnik & Chervonenkis, 2015; Seo et al., 2021) and Rademacher complexities (Bartlett & Mendelson, 2001) are utilized to assess classification capacity. Furthermore, Raghu et al. (2017) introduced a novel metric that captures the structural properties of an FNN, and Lee et al. (2017) investigated the ability of FNNs to express distributions. Expressive Power of Transformers As TFNs have grown increasingly popular, a few studies have been conducted to investigate their expressive power. Yun et al. (2020a) established the universal approximation theorem for TFNs in approximating sequence-to-sequence functions. Likhosherstov et al. (2021) characterized the self-attention layer as a matrix and demonstrated that this matrix can approximate any sparse matrices. Beyond approximation, further research has delved into other facets of TFNs expressive power. For instance, Giannou et al. (2023b) found that looped transformers can emulate an instruction-set computer, while P erez et al. (2019) demonstrated that TFNs attain Turing completeness when operating with infinite precision. However, all these theories above cannot fully explain the performance of frozen neural networks as they generally cannot factor in pre-trained model parameters and adaptation methods. Expressive Power of Adaptation Methods Our work focuses on investigating the expressive power of adaptation methods. In stark contrast to the flourishing research on the expressive power of neural networks, there exists a limited number of works investigating the expressive power of adaptation methods. A notable exception is Giannou et al. (2023a), investigating the expressive power of normalization parameter fine-tuning. They demonstrate that fine-tuning the normalization layers alone can adapt a randomly initialized Re LU network to match any target network that is O(width) times smaller. We borrow some proof techniques from this work, including techniques for extending results from linear neural networks to Re LU neural networks. In another recent work (Englert & Lazic, 2022), the authors show that neural reprogramming (Elsayed et al., 2019; Engel et al., 2018; Lee et al., 2020; Dinh et al., 2022a; Chen, 2022), a technique that modifies only the inputs while keeping the pretrained network frozen, can adapt any random two-layer Re LU network to achieve arbitrarily high accuracy on a Bernoulli data model over hypercube vertices. Oymak et al. (2023) explores prompt-tuning within a one-layer attention architecture, revealing that the model resulting from prompt tuning (Lester et al., 2021) is more expressive than the naive self-attention model. Petrov et al. (2023) shows that prompt-tuning and prefix tuning (Li & Liang, 2021) are strictly less expressive than full fine-tuning. Despite these early attempts, no existing study has yet explored the expressive power of Lo RA, the current leading adaptation method. Published as a conference paper at ICLR 2024 Other Theoretical Analysis of Adaptation Methods Lots of efforts have been taken to theoretically analyze other properties of adaptation methods such as generalization. Maurer et al. (2016) provides the generalization bounds for transfer learning, particularly for final layers tuning, demonstrating that the estimation error reduces as the pretrained task diversity and the number of samples for the target task increase. Tripuraneni et al. (2020) further refines this bound by studying the effect of the number of samples in the pre-trained tasks. Interestingly, the estimation error of the final layers tuning provided in Tripuraneni et al. (2020) heavily depends on the quality of the shared representation. This insight aligns with our finding on final layers tuning (Lemma 4), which implies that tuning the final layers fails to adapt an L-layer randomly generated FNN to approximate any one-layer target FNN if the first layer remains frozen. This failure is attributed to the poor quality of the shared random representation. Du et al. (2021) further investigates final layers tuning in few-shot cases, i.e., when there are only a few samples for the target task. A recent study by Malladi et al. (2023), which examined Lo RA and full fine-tuning through the lens of the Neural Tangent Kernel (Jacot et al., 2018), suggested that if the kernel view describes full fine-tuning, then Lo RA approximates full fine-tuning. However, their theoretical analysis of Lo RA is based on linear models, thus limiting its applicability. In contrast, our study considers a more general setting. With the rapid advancement of large language models, new adaptation methods such as in-context learning (Brown et al., 2020), prefix tuning, and prompt-tuning (Lester et al., 2021) are gaining increasing attention. A particular focus of research is the exploration of the theoretical underpinnings of in-context learning (Aky urek et al., 2023; Bai et al., 2023; Wies et al., 2023; Xie et al., 2022; von Oswald et al., 2022; Ahn et al., 2023). Aky urek et al. (2023) demonstrates that transformer-based incontext learners implicitly implement standard learning algorithms, while Bai et al. (2023) presents a similar finding and posits that in-context learning performs algorithm selection like a statistician. Wies et al. (2023) delves into the analysis of the sample complexity of in-context learning. Other works find that in-context learning is equivalent to gradient descent (von Oswald et al., 2022; Ahn et al., 2023), and Bayesian inference (Xie et al., 2022). Beyond in-context learning, a recent research by Tutunov et al. (2023) developed a theoretical framework elucidating how LLMs can accurately generate chain-of-thought reasoning (Wei et al., 2022). C PROOFS RELATED TO LINEAR ALGEBRA In this section, we present a collection of commonly used matrix inequalities and the basic properties of randomly generated matrices. C.1 COMMON MATRIX INEQUALITIES Here, we present some commonly used basic properties for matrix multiplication including rank computation, norm inequalities, as well as key results involving the trace and Frobenius norm of matrices for reference: rank(AB) rank(A) rank(B); Ax 2 A 2 x 2 ; (1) Ex Ax = tr(ACov(x)) + (Ex) A(Ex) = tr(AExx ); tr(AB) = tr(BA); tr(AB) tr(A)tr(B); A F = tr(A) for symmetric A; i σ2 i (A). C.2 NON-SINGULARITY OF RANDOMLY GENERATED MATRICES Although the non-singularity of randomly generated matrices is already established, we include a proof for completeness. Published as a conference paper at ICLR 2024 To facilitate the proof, we introduce a lemma which states that if a polynomial is non-zero, then the set of roots corresponding to a zero value of the polynomial has a Lebesgue measure of zero. Lemma 5 (Caron & Traynor (2005)). Let p(x) be a polynomial of degree d, x Rn. If p is not the zero polynomial, then the set S := {x Rn | p(x) = 0} is of Lebesgue measure zero. We note that the determinant of a matrix can be viewed as a polynomial function of its vectorized version. Based on this insight, we proceed with our proof. Lemma 6. Let X RD D be a random matrix that follows arbitrary continuous distribution with support having non-zero Lebesgue measure on RD D. Then, X is non-singular with probability 1. Proof of Lemma 6. The result is a direct consequence of Lemma 5. Let x = vec(X). Then, x is a random vector following arbitrary continuous distribution with a support having non-zero Lebesgue measure on RD D. First, we establish the relationship: P (det(X) = 0) = P (p(x) = 0) for some polynomial function p. We denote the support of random vector x by X RD2, and the probability density function (PDF) of x by q. Then, P (p(x) = 0) = Z X 1 {p(x) = 0} q(x)dx = Z X {x:p(x)=0} q(x)dx. By Lemma 5, the Lebesgue measure of {x : p(x) = 0} is zero. Hence, Z X {x:p(x)=0} q(x)dx = 0. By combining all the equations above, we conclude that P(det(X) = 0) = 0, which implies X is non-singular with probability 1. D PROOFS FOR LINEAR MODEL APPROXIMATION In this section, we present the results and corresponding proofs for the linear model approximation problem introduced in Sec. 2. The deep linear model is a common technique in theoretical deep learning research, which offers valuable insights into deep nonlinear models, and has been employed in many notable studies, including those by Saxe et al. (2014); Kawaguchi (2016); Lu & Kawaguchi (2017); Hardt & Ma (2017) and Laurent & von Brecht (2018). We employ this toy model as a preliminary model, which serves as a foundation for extending our results to nonlinear models (i.e., FNN and TFN). We first provide a slightly more detailed version of Lemma 1 along with its proof. Then, we present a variant of it that allows for different Lo RA-ranks for each low-rank adapter. The proof for this variant involves only a minor modification of the proof for Lemma 7. Lemma 7. [Detailed version of Lemma 1] Define error matrix E := W QL l=1 Wl, and denote its rank by RE = rank(E). For a given Lo RA-rank R [D], assume that all the weight matrices of the frozen model (Wl)L l=1, and QL l=1 Wl + LRr(E) are non-singular for all r R(L 1). Then, the approximation error min W l:rank( W l) R l=1 (Wl + W l) W | {z } Error matrix E and the optimal solution to the matrix approximation problem satisfies QL l=1(Wl + W l) = QL l=1 Wl + LRRL RE(E). Therefore, when R RE L , we have QL l=1(Wl + W l) = W , implying f f. Published as a conference paper at ICLR 2024 Proof of Lemma 7. Our goal is to find matrices W 1, . . . , W L of rank R or lower such that the product of the adapted matrices approximates the target matrix well, i.e., we aim to solve the following constrained optimization problem: min W l:rank( W l) R l=1 (Wl + W l) W By subtracting QL l=1 Wl from both terms, the constrain optimization problem becomes min W l:rank( W l) R l=1 (Wl + W l) To perform analysis on (2), we start with the analysis of A as follows: l=1 ( W l + Wl) l=1 ( W l + Wl) + WL l=1 ( W l + Wl) Here, we have separated the first term in the product QL l=1( W l + Wl), breaking it into two parts: one involving W L and the other WL. We can further expand the part involving WL: l=1 ( W l + Wl) l=1 ( W l + Wl) + WL 1 l=1 ( W l + Wl) At this point, it becomes clear that this expression can be iteratively decomposed. Following this pattern, we can express A as: l=1 ( W l + Wl) + WL W L 1 l=1 ( W l + Wl) (3) + . . . + ( l=2 Wl)( W 1 + W1) i=l+1 Wi) W l( i=1 (Wi + W i)) | {z } :=Al In this final form, A is decomposed as A = PL l=1 Al. It is important to note that rank(Al) rank( W l) R. Consequently, rank(A) PL l=1 rank(Al) RL. Then, the optimization problem (2) can be relaxed into a low-rank approximation problem (2) min A:rank(A) RL A E 2 , (4) where the optimal solution is A = LRRL RE(E) := E . Therefore, if we can identify rank-R or lower matrices ( W l)L l=1 such that l=1 (Wl + W l) = LRRL RE(W Published as a conference paper at ICLR 2024 then we effectively solve the matrix approximation problem as defined in (2). Moreover, it is straightforward to verify that (5) directly implies all statements in this lemma. Therefore, our remaining proof focuses on proving (5). Denote RE = RL RE. To derive the explicit form of E , we first refer to the SVD of E as where U and V are orthonormal matrices and the first RE diagonal entries of D are non-zero, with all remaining entries being zero. Based on this, E is expressed as E = UDI1:RL,DV . Having already derived the decomposition A = PL l=1 Al, we next aim to decompose E as E = PL l=1 E Ql, where Q1, . . . , QL RD D. The goal now shifts to identifying W l, Ql such that Al = E Ql for each l [L]. Achieving this would complete the proof of (5). Therefore, our goal becomes finding W 1, . . . , W L with rank( W l) R for all l [L] such that i=l+1 Wi) W l( i=1 (Wi + W i)) = E Ql, for all l [L]. (6) One sufficient condition for achieving (6) is that the decomposed matrices Q1, QL and low-rank adapters W 1, . . . , W L meet the following conditions: l=1 E Ql = E , (7) i=l+1 Wi) 1E Ql( i=1 (Wi + W i)) 1, for all l [L] (8) rank( W l) R, for all l [L], (9) rank(Wl + W l) = D, for all l [L 1]. (10) Here (7) describes the decomposition of E , (8) provides one simple solution to (6) when (10) holds, and (9) is the rank constraint on the low-rank adapter. In particular, the (10) is used to ensure the invertibility of Ql i=1(Wi + W i) for l [L 1]. This condition is not necessary for l = L as the inverse of WL + W L is not required for computing any low-rank adapters. We will show that the matrices (Ql)L l=1 defined by Ql = V I(R(l 1)+1) RE :Rl RE ,DV , for all l [L], (11) and W l defined by (8) for all l [L] satisfies the all four conditions (7), (8), (9), and (10). We note that the definition of (Ql)L l=1 clearly satisfies condition (7). For the remaining conditions, namely (8), (9), (10), we proceed the proof by induction. When l = 1. We begin by examining the three conditions (8), (9) and (10) under the base case l = 1. We first determine Q1 and W 1 based on (11) and (8): i=2 Wi) 1E Q1, Q1 = I1:R,D. (12) By the choice of W 1, we satisfy the condition (8). Moreover, it directly follows that rank( W 1) rank(Q1) = R, thereby fulfilling the rank constraint in (9). Published as a conference paper at ICLR 2024 Therefore, we just need to prove that W1 + W 1 is full-rank, as required by condition (10). To compute rank(W1 + W 1), we proceed as follows: rank(W1 + W 1) (12) = rank(W1 + ( i=2 Wi) 1E Q1) (Substituting for W 1) i=1 Wi) + E Q1) (Left multiplying with invertible ( i=1 Wi) + LRR RE (E)). (Simplifying) Given the assumption that QL l=1 Wl + LRr(E) is full rank for all r R(L 1), rank(W1 + W 1) = rank((QL i=1 Wi) + LRR RE (E)) = D, satisfying the last condition (10). When l > 1. Consider l = 2, . . . , L. We assume that for i [l 1], we have determined matrices Qi and W i based on (11) and (8), respectively, and we assume that they satisfy the conditions (8), (9), and (10). First, under the induction assumption that Wi + W i is invertible for all i [l 1], to achieve Al = E Ql, we set W l based on (8). This definition ensures rank( W l) rank(Ql) = R, thereby satisfying the condition (9). To prove that Wl+ W l is full-rank (condition (10)), we focus on computing rank(Wl + W l). We proceed as follows: rank(Wl + W l) (8) = rank(Wl + ( i=l+1 Wi) 1E Ql( i=1 (Wi + W i) 1)) (Substituting for W l) = rank(ID + ( i=l Wi) 1E Ql( i=1 (Wi + W i)) 1) (Left multiplying invertible W 1 l ) = rank l 1 Y i=1 (Wi + W i) + ( i=l Wi) 1E Ql (Right multiplying invertible i=1 (Wi + W i)) = rank (Wl 1 + W l 1) i=1 (Wi + W i) + ( i=l Wi) 1E Ql (Rearranging terms) (8) = rank (Wl 1 + ( i=l Wi) 1E Ql 1( i=1 (Wi + W i)) 1) i=1 (Wi + W i) i=l Wi) 1E Ql (Substituting for W l 1) i=l 1 Wi + E Ql 1( i=1 (Wi + W i)) 1) i=1 (Wi + W i) + E Ql (Left multiplying i=1 (Wi + W i) + E Ql 1 + E Ql (Rearranging terms) Published as a conference paper at ICLR 2024 i=1 Wi + E ( i=1 Qi)) (Taking similar steps) i=1 Wi + LRRl RE (E)). (Simplifying) By the assumption that QL l=1 Wl + LRr(E) is full-rank for r R(L 1) and consequently, rank(Wl + W l) = rank(QL i=1 Wi + LRRl RE (E)) = D, satisfying the last condition (10). Conclusion of Inductive Proof. Thus, by induction, we show that the definitions of ( W l)L l=1 in (8) and (Ql)L l=1 in (11) ensure that Al = E Ql for all l [L]. Summing over l from 1 to L satisfies condition (5), thereby completing the proof. The following lemma extends the results to a more general setting where different Lo RA-ranks can be employed across layers. Lemma 8. Define error matrix E := W QL l=1 Wl, and denote its rank by RE = rank(E). For a sequence of Lo RA-ranks for all layers (Rl)L l=1, assume that all the weight matrices of the frozen model (Wl)L l=1, and QL l=1 Wl + LRr(E) are non-singular for all r PL 1 l=1 Rl. Then, the approximation error min W l:rank( W l) Rl l=1 (Wl + W l) W 2 = σPL l=1 Rl+1 | {z } Error matrix E and the optimal solution to the matrix approximation problem satisfies QL l=1(Wl + W l) = QL l=1 Wl +LR(PL l=1 Rl) RE(E). Therefore, when PL l=1 Rl RE, we have QL l=1(Wl + W l) = W , implying f f. Proof of Lemma 8. The proof follows the same steps of Lemma 7 with only minor modifications. In the current setting, we target the following constrained optimization problem: min W l:rank( W l) Rl l=1 (Wl + W l) W where we allow each Lo RA adapter W l can possess different Lo RA-ranks Rl, i.e., rank( W l) Rl, l [L]. Subtracting QL l=1 Wl from both terms leads us to a similar constrained optimization problem as (2). The only distinction lies in the rank constraint: min W l:rank( W l) Rl l=1 (Wl + W l) Following the same steps, we decompose A into (3). Given that rank(Al) rank( W l) Rl, we deduce that rank(A) PL l=1 rank(Al) PL l=1 Rl. Consequently, the optimization problem above can be eased into a low-rank approximation problem analogous to (4): (13) min A:rank(A) PL l=1 Rl A E 2 , where the optimal solution is A = LR(PL l=1 Rl) RE(E) := E . Therefore, if we can identify the Lo RA adapters ( W l)L l=1 with rank( W l) Rl such that l=1 (Wl + W l) = LR(PL l=1 Rl) RE(W Published as a conference paper at ICLR 2024 the proof is completed. The remaining part of the proof adheres to the steps outlined in the proof of Lemma 7 deriving (5). The only difference is that we consider a different selection of (Ql)l = 1L that satisfies (9) here: Ql = V I(Pl 1 i=1 Ri) RE :(Pl i=1 Ri) RE ,DV . Applying the same steps with this change yields the desired outcomes. This lemma illustrates that in linear cases, the total number of parameters needed to achieve an exact approximation is constant, regardless of Lo RA-rank assignment. It suggests that applying a Lo RA-rank of R per layer is equivalent to applying a Lo RA-rank of RL at the final layer. As a result, fine-tuning only the last layer, which involves assigning a Lo RA-rank of D to the last layer, is equivalent to implementing Lo RA where each adapter is constrained to have a rank of D/L. Both methods can achieve an exact approximation and maintain the same parameter efficiency. E PROOFS FOR FNN APPROXIMATION In this section, we provide the full proof for deriving the main results outlined in Sec. 3. For the sake of completeness, we restate our results from the main body before presenting the proof. E.1 APPROXIMATING ONE-LAYER RELU FNN VIA LORA We first provide a slightly more detailed result on the one-layer Re LU FNN approximation (Lemma 9) along with its corresponding proof. Then, we present a variant of this lemma by allowing for different Lo RA-ranks for each low-rank adapter. The proof for this variant involves only a minor modification of the proof for Lemma 9. Lemma 9 (Detailed version of Lemma 2). Define error matrix E := W 1 QL l=1 Wl, with its rank represented by RE = rank(E). Consider a Lo RA-rank R [D]. Assume that the weight matrices W1, . . . , WL RD D and QL l=1 Wl + LRr(E) for all r R(L 1) are non-singular. Let x be a random input sampled from a distribution with bounded support X and let Σ = Exx . Then, there exists rank-R or lower matrices W 1, . . . , W L RD D and bias vectors bb1, . . . ,bb L RD such that for any input x X, f(x) f(x) = Re LU LRRL RE(W 1 l=1 Wl) (W 1 Therefore, when R RE/L , the adapted model exactly approximates the target model, i.e., f(x) = f(x) for all x X. Furthermore, let x be a random input sampled from a distribution with bounded support X and let Σ = Exx . Then, the expected squared error is bounded as E f(x) f(x) 2 2 Σ F σ2 RL RE+1(W 1 Proof of Lemma 9. This proof consists of three main steps: (i) linearize the first L 1 layers of the adapted model f to reduce it to a single-layer FNN, (ii) align the weight matrices and bias vectors of this simplified f with those of the target model f, (iii) derive an upper bound of the error E f(x) f(x) 2 2 . Linearization. The main challenge here stems from the non-linearities introduced by the Re LU activation function. To remove the non-linearities in the first L 1 layers of updated model f, since the input space X is bounded, we can set all the entries of bb1, . . . ,bb L 1 sufficiently large, thereby Published as a conference paper at ICLR 2024 activating all Re LUs in the first L 1 layers of f. Consequently, we have f(x) = Re LU((WL + W L)z L 1 + bb L) = Re LU (WL + W L)Re LU((WL 1 + W L 1)z L 2 + bb L 1) + bb L = Re LU (WL + W L)((WL 1 + W L 1)z L 2 + bb L 1) + bb L = Re LU (WL + W L)(WL 1 + W L 1)z L 2 + (WL + W L)bb L 1 + bb L l=1 (Wl + W l)x + ( i=l+1 (Wi + W i)bbl) + bb L which is equivalent to a single-layer Re LU neural network with weight matrix QL l=1(Wl + W l) and bias vector (PL 1 l=1 QL i=l+1(Wi + W i)bbl) + bb L. Parameter Alignment. To match the updated model f(x) and target model f(x), we proceed as follows. For weight matrix, Lemma 7 guarantees the existence of rank-R or lower matrices W 1, . . . , W L RD D such that l=1 (Wl + W l) = l=1 Wl + LRRL RE(W l=1 Wl). (14) For the bias vector, we set bb L = b1 PL 1 l=1 QL i=l+1(Wi+ W i)bbl such that PL 1 l=1 QL i=l+1(Wi+ W i)bbl + bb L = b1. Therefore, we obtain f(x) f(x) = Re LU LRRL RE(W 1 l=1 Wl) (W 1 Error Derivation. We compute the expected squared error as follows: E f(x) f(x) 2 2 LRRL RE(W 1 l=1 Wl) (W 1 2 (Re LU is 1-Lipschitz) LRRL RE(W 1 l=1 Wl) (W 1 = Σ F σ2 RL RE+1(W 1 l=1 Wl). (By the definition of LRRL RE( )) This completes the proof. Lemma 9 is extended to cases where different Lo RA-ranks can be used for different low-rank adapters, as detailed in the following lemma. Lemma 10. Define error matrix E := W 1 QL l=1 Wl, and denote its rank by RE = rank(E). Consider a sequence of Lo RA-ranks (Rl)L l=1. Assume that the weight matrices W1, . . . , WL RD D and QL l=1 Wl +LRr(E) for all r PL 1 l=1 Rl are non-singular. Then, there Lo RA adapters ( W l)L l=1 satisfying the rank constraints rank( W l) Rl for all l [L] and bias vectors bb1, . . . ,bb L RD such that for any input x X, f(x) f(x) = Re LU LR(PL l=1 Rl) RE(W 1 l=1 Wl) (W 1 Published as a conference paper at ICLR 2024 Therefore, when PL l=1 Rl RE, the adapted model exactly approximates the target model, i.e., f(x) = f(x) for all x X. Furthermore, for a random input x drawn from a distribution supported on X, and with Σ = Exx , the expected squared error is bounded by: E f(x) f(x) 2 2 Σ F σ2 (PL l=1 Rl) RE+1(W 1 Proof of Lemma 10. This proof closely adheres to the steps detailed in the proof of Lemma 9. The primary change implemented here is that, when we draw the analogy to (14), we apply Lemma 8 instead of Lemma 7. This results in l=1 (Wl + W l) = l=1 Wl + LR(PL l=1 Rl) RE(W Utilizing the steps from the proof of Lemma 9 and integrating the modification specified above, we can establish the desired result. E.2 APPROXIMATING MULTI-LAYER RELU FNN VIA LORA WITH UNIFORM MODEL PARITION In this part, we restate all the results considering uniform model partition from Sec. 3.3, along with their corresponding proofs, presented in the same order. Assumption 1 (Non-Singularity). For a fixed Lo RA-rank R [D], the weight matrices of the frozen model (Wl)L l=1 and matrices Q l P u i Wl + LRr(W i Q l P u i Wl) are non-singular for all r R(M 1) and i [L]. Lemma 3. Let (W l)L l=1, (Wl)L l=1 RD D be matrices whose elements are drawn independently from arbitrary continuous distributions. Then, with probability 1, Assumption 1 holds R [D]. Proof of Lemma 3. We first use Lemma 6 to establish that W 1, . . . , W L, W1, . . . , WL are non- singular with probability 1. The goal of the remaining proof is to demonstrate that Q l P u i Wl + LRr(W i Q l P u i Wl) is full-rank with probability 1. In this proof, we use p to denote the probability density function, where the subscript indicates the associated random variable. Fix an arbitrary i [L] and r [R]. Then probability of the Q l P u i Wl + W i Q l P u i Wl being full-rank can be computed as l P u i Wl = E l P u i Wl(E)d E. If the conditional random matrix Q l P u i Wl +LRr(E) | W i Q l P u i Wl = E has a continuous distribution with support of non-zero Lebesgue measure on RD D, then l P u i Wl = E l P u i Wl + LRr l P u i Wl is full-rank with probability 1. Published as a conference paper at ICLR 2024 Consequently, the remaining part of the proof aims to show that the conditional random matrix Q l P u i Wl + LRr(E) | W i Q l P u i Wl = E follows arbitrary continuous distribution with support having non-zero Lebesgue measure on RD D. Denote W = Q l P u i Wl. Now, consider the conditional distribution of Q l P u i Wl | W i Q l P u i Wl = E, which can be written as p W|Wi W=E(W ) = p Wi(E + W ). Since p Wi is continuous with support of non-zero Lebesgue measure on RD D, the same holds for Q l P u i Wl | W i Q l P u i Wl = E. Furthermore, adding a constant matrix LRr(E) to this conditional distribution preserves the desired properties, thus completing the proof. Theorem 3. Under Assumption 1, there exists rank-R or lower matrices ( W l)L l=1 with W l RD D and bias vectors (bbl)L l=1 with bbl RD when the rank of the low-rank adapter R maxi [L] rank(W i Q l P u i Wl)/M , the low-rank adapted model f can exactly approximate the target model f, i.e., f(x) = f(x) for all input x X. Proof of Theorem 3. The key to this proof lies in a simple idea: for each layer i [L] in the target model, we can update M layers (i.e., (i 1)M + 1-th layer to i M-th layer) in the frozen model to approximate it as guaranteed by Lemma 9. Hence, all layers of the target model can be approximated by the adapted model. Model Decomposition. We partition the adapted model f into L sub-models, each defined as fi( ) = FNNL,D( ; (Wl + W l)l P u i , (bbl)l P u i ), i [L]. In a similar manner, we break down f into L sub-models, each is a one-layer FNN: f i( ) = FNN1,D( ; W i, bi), i [L]. We can then express f(x) and f(x) as compositions of their respective sub-models: f( ) = f L f1( ), f( ) = f L f 1( ). To analyze the error E f(x) f(x) 2 = E f(x) f(x) 2, we consider the error caused by each submodel. Let e Ri = rank(W i Q l P u i Wl) denote the rank of the discrepancy between the target weight matrix and the frozen weight matrices, where i [L]. By Lemma 9, we can select W 1, . . . , W L,bb1, . . . ,bb L such that fi(z) f i(z) = Re LU LRRL e Ri(W i Y l P u i Wl) (W i Y l P u i Wl) E fi(z) f i(z) 2 2 Ezz F σ2 RL e Ri+1(W i l=1 Wl). (16) Given these selected parameters, fi is functionally equivalent to a one-layer FNN: fi(z) = Re LU LRRL e Ri(W i Y l P u i Wl) + Y Clearly, when R maxi e Ri M , it follows that fi = gi for all i [L], which implies f = g. Corollary 4. Assume that the elements of matrices (W l)L l=1, (Wl)L l=1 are independently drawn from arbitrary continuous distributions. When R D/M, there exists rank-R or lower matrices W 1, . . . , W L RD D and bias vectors bb1, . . . ,bb L RD such that low-rank adapted model f can functionally cover the target model f on X, i.e., f(x) = f(x) for all input x X, with probability 1. Published as a conference paper at ICLR 2024 Proof of Corollary 4. To prove the statement, we start by noting that combining Lemma 3 and Theorem 3 directly gives us f(x) = f(x) on X when R maxi [L] rank(W i Q l P u i Wl)/M . Therefore, the only thing left is to show that rank(Wi Q l P u i Wl) = D for i [L] with probability 1. In this proof, we use p to denote the probability density function, where the subscript indicates the associated random variable. To establish this, consider the following probability expression: l P u i Wl = W l P u i Wl(W )d W . Since W is independent of Q l P u i Wl, we have l P u i Wl = W Wi W = 0 Lemma 6 ===== 1. Therefore, we conclude that P n det l P u i Wl = 0 o = 1, which completes the proof. Theorem 5. Define the approximation error of i-th layer as Ei = σRM+1(W i Q l P u i Wl), and the magnitude of the parameters and the input as β := Σ F Qi j=1 W j F + Pi j=1 Qi 1 k=j+1 W k F bj 2 Under Assumption 1, there exists rank-R or lower matrices ( W l)L l=1 with W l RD D and bias vectors (bbl)L l=1 with bbl RD such that for input x X with Exx = Σ, E f(x) f(x) 2 β i=1 max k [L] W k F + Ek L i Ei. Proof of Theorem 5. This proof is a continuation of the proof of Theorem 3. In this proof, we will consider a more general case, without enforcing any constraints on the rank of the adapters R. We use c Wi to denote the corresponding weight matrix, i.e., c Wi = LRRL e Ri(W 1 Q l P u i Wl) + Q l P u i Wl. Error Decomposition. For submodel i = 2, . . . , L, we calculate the expected error of the composition of the first i sub-models, E bzi zi 2 = E fi(bzi 1) f i(zi 1) 2 (17) = E (fi(bzi 1) fi(zi 1)) + fi(zi 1) f i(zi 1) 2 (Rearranging terms) E fi(bzi 1) fi(zi 1) 2 | {z } Ai + E fi(zi 1) f i(zi 1) 2 | {z } Bi . (Applying triangle inequality) Here Ai represents the error resulting from the discrepancy between the first i 1 submodels, while Bi represents the error arising from the mismatch between the i-th submodel. Published as a conference paper at ICLR 2024 Computing Ai. We start by computing the error introduced by the first i 1 submodels, denoted by Ai: Ai = E fi(bzi 1) fi(zi 1) 2 = E Re LU(c Wi(bzi 1 zi 1)) 2 E c Wi(bzi 1 zi 1) 2 (Re LU is 1-Lipschitz) (1) c Wi F E bzi 1 zi 1 2 . (18) l P u i Wl + LRRM e Ri(W i Y l P u i Wl) l P u i Wl W i + LRRM e Ri(W i Y l P u i Wl) (Rearranging terms) l P u i Wl W i + LRRM e Ri(W i Y l P u i Wl) F (Applying triangle inequality) j=RM e Ri+1 σ2 j (W i Y l P u i Wl) (19) (By the definition of W i and LRRM e Ri+1( )) max k [L] ( W k F + Ei) := α. By combining (18) and (19), we get Ai max k [L] W k F + Ei E bzi 1 zi 1 2 αE bzi 1 zi 1 2 . (20) Computing Bi. We proceed to compute the error associated with the i-th submodel, which we denote as Bi. It can be evaluated as follows: Bi = E fi(zi 1) f i(zi 1) 2 LRRM e Ri(W i Y l P u i Wl) (W i Y l P u i Wl) LRRM e Ri(W i Y l P u i Wl) (W i Y l P u i Wl) (Re LU is 1-Lipschitz) LRRM e Ri(W i Y l P u i Wl) (W i Y l P u i Wl) = σRM e Ri+1(W i Y l P u i Wl)E zi 1 2 . We can further simplify E zi 1 2 as : E zi 1 2 = E Re LU(W i 1zi 2 + bi 1) 2 = E W i 1zi 2 + bi 1 2 (Re LU is 1-Lipschitz) Published as a conference paper at ICLR 2024 W i 1 F E zi 2 2 + bi 1 2 (Applying triangle inequality and (1)) W i 1 F W i 2 F E zi 3 2 + bi 2 2 + bi 1 2 (Following the same steps) W j F E x 2 + W k F bj 2 (Repeating the same steps) W k F bj 2 β. Therefore, we obtain Bi βσRM e Ri+1(W i Y l P u i Wl). Error Composition. Having established upper bounds for Ai and Bi, we next evaluate the expected error for the composition of the first i adapted submodels. (17) Ai + Bi (20) αE bzi 1 zi 1 2 + Bi α(αE bzi 2 zi 2 2 + Bi 1) + Bi = α2E bzi 2 zi 2 2 + αBi 1 + Bi αi 1E bz1 z1 2 + k=2 αi k Bk. (21) To compute the overall approximation error of f, which is the composite of all submodels, we have E f(x) f(x) 2 = E f(x) f(x) 2 = E bz L z L 2 (21) αL 1E bz1 z1 2 + i=2 αL i Bi (16) αL 1βσRM e Ri+1(W i Y l P u i Wl) + β i=2 αL iσRM e Ri+1(W i Y l P u i Wl) i=1 αL iσRM e Ri+1(W i Y l P u i Wl) i=1 αL iσRM+1(W i Y l P u i Wl). Substituting α with maxk [L]( W k F + Ei) concludes the proof. E.3 APPROXIMATING MULTI-LAYER RELU FNN VIA LORA WITH GENERAL MODEL PARITION Firstly, we provide the required non-singular assumption and the lemma demonstrating the mildness of this assumption for the general model partition cases after introducing necessary notations. Assumption 2. For the given Lo RA-rank sequence (Rl)L l=1 and partition P, the weight matrices of the frozen model W1, . . . , WL and Q l Pi Wl + LRr(W i Qmax Pi 1 l=min Pi Wl) are non-singular for all r Pmax Pi 1 l=min Pi Rl and i [L]. Note that max Pi and min Pi here represent the maximum and minimum elements in the set Pi, respectively. Lemma 11. Let (W l)L l=1, (Wl)L l=1 RD D be matrices whose elements are drawn independently from arbitrary continuous distributions. Then, with probability 1, Assumption 2 holds for all R [D]. Published as a conference paper at ICLR 2024 Proof of Lemma 11. Following the same steps in the proof of Lemma 3 but replacing the uniform partition with the general partition completes the proof. We now restate Theorem 6 and provide its proof. Theorem 6. Consider a partition P for the frozen model. Let Assumption 2 hold. If P l Pi Rl rank(W i Q l Pi Wl) for all i [L], there exists Lo RA adapters ( W l)L l=1 with rank( W l) Rl and biases (bbl)L l=1 such that the adapted model f can exactly approximate the target model. Moreover, define the approximation error of the i-th layer as Ei = σP l Pi Rl+1(W i Q l Pi Wl), and the magnitude of the parameters and the input as β := Σ F Qi j=1 W j F + Pi j=1 Qi 1 k=j+1 W k F bj 2 Then, there exists Lo RA adapters ( W l)L l=1 with rank( W l) Rl and biases (bbl)L l=1 such that for any input x X with Exx = Σ, the approximation error can be bounded as E f(x) f(x) 2 β i=1 max k [L] W k F + Ek L i Ei. Proof of Theorem 6. This proof follows the same steps as the proofs of Theorem 3 and Theorem 5, substituting the uniform partition Pu with the general partition P and applying Lemma 10 in place of Lemma 2 to derive the desired outcome. E.4 APPROXIMATING MULTI-LAYER RELU FNN VIA FINAL LAYERS TUNING We now aim to examine another commonly used model adaptation method, the final layers tuning, within the same theoretical framework. The main limitation of this method, as compared to Lo RA, is that while Lo RA can update all layers, the tuning of final layers keeps the initial layers frozen. Consequently, a clear limitation arises when the initial layers of the frozen model f are less discriminative than the target model f. That is, if there exist two input vectors x1, x2 RD D such that the output of the initial layers of the frozen model f0 is the same, but the output of the target model f is different, then no matter how the final layers are tuned, it is impossible for the adapted model f to exactly approximate the target model f. To formalize this, we observe that for the first layer of the frozen model, the outputs of the inputs in the non-activation region are always zero. In other words, when x1, x2 {x : W1x + b1 0}, we have Re LU(W1x1+b1) = Re LU(W1x2+b1) = 0. Therefore, no matter how the subsequent layers are tuned, we still have f(x1) = f(x2). When we fix the first l 1 layers, the non-activation region becomes {x : W2(W1x + b1) + b2 0}. Similarly, we define the non-active region of the first l layer in the frozen model as Il = n x : Ql i=1 Wix + Pl i=1 Ql j=i+1 Wjbi 0 o . Correspondingly, we define Il = n x : Ql i=1 W ix + Pl i=1 Ql j=i+1 W jbi 0 o . The following lemma is provided based on these definitions. Lemma 12. If l [L 1] such that Il \ SL i=1 Ii = and the weight matrices of the target model (W i)L i=1 are non-singular, then for any tuning of the last L l layers, f = f. Proof of Lemma 12. For the simplicity of the presentation, we let I = SL i=1 Ii to denote the non- activation region of the target model. Then, the condition Il \ SL i=1 Ii = can be written as Il \ I = . Clearly, both I and Il are closed convex sets. Condition Il \ I = . The condition Il \ I = indicates that there exists a region in Il where the Re LUs are deactivated in the l-th layer of the frozen model, but activated in the entire target model. Therefore, for any x1, x2 Il \ I, we have f(x1) = f(x2) regardless of how the final l + 1 layers Published as a conference paper at ICLR 2024 are tuned. If these x1, x2 Il \ I satisfies f(x1) = f(x2), this proof is completed. The remaining proof is showing the existence of such x1, x2. Existence of x1, x2. Firstly, we show that there exists two x1, x2 Il \ I such that x1 = x2. Let x1 Il \ I. Since Il is a closed set, there exists a sequence (zi) i=1 where zi Il and zi = x1 satisfying limi zi = x1. Note that at least one element zi must not belong to I, otherwise x1 would be in I due to the closed property of I, contradicting the selection of x1. Let x2 = zi. Therefore, we have two distinct x1, x2 Il \ I with x1 = x2. Then, given x1, x2 Il \ I such that x1 = x2, both x1, x2 activate all the Re LUs in the target model. Since x1, x2 I and the weight matrices of the target model (W l)L l=1 all are non-singular, we have f(x1) f(x2) = W L W 1(x1 x2) = 0, implying f(x1) = f(x2). Meanwhile, since x1, x2 Il, the output of the initial l layers of the frozen model are equal, thus we have f(x1) = f(x2) no matter how we tune the last L l layers. This completes the proof. The following lemma reduces the assumptions to the assumption of randomly generated models. This assumption aligns with that of Corollary 4, thereby facilitating a more effective comparison between the expressive power of Lo RA and the adaptation of the final layers. W1x + b1 = 0 W1x + b1 = 0 I1 Figure 2: An example of I1 and I1 when D = 2. Lemma 4. Let D 2 and f be a one-layer target FNN. Assume that the elements of weight matrices (Wl)L l=1 are independently drawn from arbitrary continuous distributions. With probability 1, for any tuning of the last L 1 layers, f = f. Proof of Lemma 4. If we can show that I1 \ I1 = , by Lemma 12, we obtain the desired results. Therefore, the remaining proof aims to show that I1 \ I1 = with probability 1. Note that I1 \ I1 = holds only when W 1 = W1 (not that this is necessary condition not sufficient condition), as demonstrated in Figure 2. However, since the elements of matrices W1 are independently drawn from arbitrary continuous distributions, we have P(W1 = W 1) = 1 for all l [L]. Therefore, I1 \ I1 = holds with probability 1. By Lemma 12, we complete the proof. F PROOFS FOR TFN APPROXIMATION In this section, we not only provide the proof for the results outlined in Sec. 4, but also introduce the problem setting for TFNs with single-head attention layers and present the corresponding results. F.1 APPROXIMATING TRANSFORMER NETWORK WITH SINGLE-HEAD ATTENTION LAYERS In this part, we outline the problem setting to investigate the expressive power of Lo RA in TFNs that utilize single-head attention layers. The primary distinction between this setting and that of TFNs with multi-head attention layers lies in the weight matrices. Specifically, the W h Ol matrices for combining different attention heads are absent in this case. Despite this difference, the derived results are consistent, albeit under slightly modified assumptions regarding the weight matrices and a different Lo RA adaptation strategy. We start by introducing necessary notations. For an input matrix X RD N, where D is the dimension of the token embeddings and N is the number of tokens, the l-th Transformer block using single-head self-attention can be expressed as: Attnl(Zl 1) = WV l Zl 1 softmax (WKl Zl 1) WQl Zl 1 , Zl := W2l Re LU(W1l Attnl(Zl 1) + b1l1 N) + b2l1 N, Published as a conference paper at ICLR 2024 where the weight matrices WKl, WQl, WV l, W1l, W2l RD D, bias vectors b1l, b2l RD, Zl is the output of l-th transformer block, with Z0 = X. The output of the first L Transformer blocks are subsequently fed into the output layer. This produces the final output of the TFN, given by softmax(Wo ZL), where Wo RD D represents the weight matrix of the output layer. For single-head self-attention layers, the target model f, frozen model f, and the adapted model f can be formally represented as: Target TFN g = TFNL,D ; (W V l, W Kl, W Ql, W 2l, W 1l)L l=1, W o , (b1l, b2l)L l=1 , Frozen TFN f0 = TFNL,D ; (WV l, WKl, WQl, W2l, W1l)L l=1, Wo , (b1l, b2l)L l=1 , Adapted TFN f = TFNL,D ; (WV l + W V l, WKl + W Kl, WQl + W Ql, W2l + W 2l, W1l + W 1l)L l=1, Wo + W o , (bb1l,bb2l)L l=1 . Here, W Kl, W Ql, W V l are the weight matrices for generating key, query, and values in the lth transformer block of the target TFN; W 1l, W 2l and b1l, b2l serve as the weight matrices and bias vectors, respectively, for the feedforward layer in the same block; W o is the weight matrix for the output layer. For the frozen TFN, the same roles are played by WKl, WQl, WV l, W1l, W2l, and b1l, b2l for all l [L] and Wo. For the adapted model, low-rank adapters W Kl, W Ql, W V l, W 1l, W 2l, W o with a rank constraint R [D] are added to each weight matrix, and the bias vectors are updated to bb1l,bb2l for all l [L]. Given the problem setting outlined above, we give the non-singularity assumption for TFNs with single-head attention layers. Assumption 3 (Non-Singularity). All the weight matrices of both the target model and the frozen model, as well as the following matrices for all r [D], W Kl WQl + LRr W Kl W Ql W Kl WQl , where l = 1, WKl WQl + LRr W 1 2,l 1W 2,l 1W Kl W Ql W 2,l 1W 1 2,l 1 WKl WQl , for l [L] \ {1} , W1l WV l + LRr W 1l W V l W1l WV l , for l = 1, W1l WV l + LRr W 1l W V l W 2,l 1W 1 2,l 1 W1l WV l , for all l [L] \ {1} , Wo W2L + LRr(W o W 2L Wo W2L), are non-singular. Lemma 13. Let the elements of all weight matrices in target model f and the frozen model f be independently sampled from continuous distributions. Then, Assumption 3 holds with probability 1. Proof of Lemma 13. The results can be obtained by replicating the same steps outlined in the proof of Lemma 3. Theorem 8. Consider the rank of the adapter weight matrices R [D]. Let Assumption 3 hold. Define the rank-based functionality gap Gi to i-th transformer block (i [L]) or output layer (i = L + 1) as maxh rank(W h Ki W h Qi W h Ki W h Qi) maxh rank(W 1i W h V i W1i W h V i) , i = 1, maxh rank(W 2,i 1W h Ki W h Qi W 2,i 1 W 2,i 1W h Ki W h Qi W2,i 1) maxh rank(W 1i W h V i W 2,i 1 W1i W h V i W2,i 1) , 2 i L, rank(W o W 2L Wo W2L), i = L + 1. If R maxi [L+1] Gi 2 , there exists rank-R or lower weight matrices for low-rank adapters ( W Kl, W Ql, W V l, W 1l)L l=1, W 2L, W o with other low-rank adapters set to O, and updated bias vectors: (bb1l,bb2l)L l=1, such that for any X RD N, the adapted model f exactly approximates f, i.e., f(X) = f(X), with probability 1. Published as a conference paper at ICLR 2024 Proof of Theorem 8. Let Hl RD N and Zl RD N denote the intermediate and final outputs of the l-th transformer block in the target model f, respectively. Specifically, Hl represents the output from the first feedforward layer in the l-th transformer block. They are defined as W 1l W V l Zl 1 softmax Z l 1W Kl W Ql Zl 1 + b1l1 N Zl = W 2l Hl + b2l1 N, where l [L]. For the adapted model f, we introduce c Hl and b Zl to denote the corresponding intermediate output of the first feedforward layer and the final output of the l-th transformer block for the adapted model, respectively: c Hl = Re LU (W1l + W 1l)(WV l + W V l) b Zl 1 softmax b Z l 1(WKl + W Kl) (WQl + W Ql) b Zl 1 + bb1l1 N b Zl = (W2l + W 2l)c Hl + bb2l1 N, where l [L]. We note that Z0 = b Z0 = X. In this proof, we set W 2l = O for all l [L]. Our goal is to show that adding low-rank adapters to self-attention layers and the first feedforward layers in all transformer blocks enables the adapted model f to be functionally equivalent to the target model f of the same dimensions. We start by inductively constructing the adapter weight matrices ( W 1l, W V l, W Kl, W Ql,bb1l,bb2l)L l=1 such that c Hl = Hl for all l [L]. We then select the low-rank adapters for W2L and the Wo to approximate the output of the target model. For unmentioned low-rank adapters, we set them as O. When l = 1. To achieve c Hl with Hl for all X, the following conditions must be satisfied: Bias Vector: bb1l = b1l, Query and Key: (WKl + W Kl) (WQl + W Ql) = W Kl W Ql Value and First Feedforward Layer: (W1l + W 1l)(WV l + W V l) = W 1l W V l. To achieve this, we set bb1l = b1l to achieve (24), and select rank-R or lower matrices W Kl, W Ql, W 1l, W V l as suggested by Lemma 7. This ensures c Hl = Hl for l = 1. When l > 1. Now we focus on the cases where l = 2, . . . , L. Assume the induction hypothesis holds for l 1, which is c Hl 1 = Hl 1. This implies Hl 1 = W 1 2,l 1(Zl 1 b2,l 11 N) = W 1 2,l 1( b Zl 1 bb2,l 11 N) = c Hl 1. Using this assumption, we express b Zl 1 in terms of Zl 1: b Zl 1 = W2,l 1W 1 2,l 1(Zl 1 b2,l 11 N) + bb2,l 11 N. Let bb2,l 1 = W2,l 1W 1 2,l 1b2,l 1, then we have b Zl 1 = W2,l 1W 1 2,l 1Zl 1. (22) To achieve c Hl = Hl, we express both c Hl and Hl in terms of Zl 1: W 1l W V l Zl 1 softmax Z l 1W Kl W Ql Zl 1 + b1l1 N c Hl = Re LU (W1l + W 1l)(WV l + W V l) b Zl 1 Published as a conference paper at ICLR 2024 softmax b Z l 1(WKl + W Kl) (WQl + W Ql) b Zl 1 + bb1l1 N , (22) = Re LU (W1l + W 1l)(WV l + W V l) W2,l 1W 1 2,l 1Zl 1 Z l 1W 1 2,l 1W 2,l 1(WKl + W Kl) (WQl + W Ql)W2,l 1W 1 2,l 1Zl 1 + bb1l1 N Therefore, we need to align the following three components: Bias Vector: bb1l = b1l, Query and Key: (WKl + W Kl) (WQl + W Ql) = W 1 2,l 1W 2,l 1W Kl W Ql W 2,l 1W 1 2,l 1, Value and First Feedforward Layer: (W1l + W 1l)(WV l + W V l) = W 1l W V l W 2,l 1W 1 2,l 1. By setting bb1l based on (26) and adjusting W Kl, W Ql, W 1l, W V l based on Lemma 7, we satisfy all three conditions above, thereby obtaining c Hl = Hl for l [L] \ {1}. Output Layer Analysis. By the induction method, we have established c Hl = Hl for all l [L]. We will complete the proof by showing that f(X) = f(X) for all X X. The final output distribution of the target TFN f can be written as f(X) = softmax(W o ZL) = softmax W 2LHL + b2L1 N . We can similarly formulate the final output distribution of the adapted model f : f(X) = softmax((Wo + W o) b ZL) = softmax (Wo + W o) (W2L + W 2L)c HL + bb2L1 N , To align these two expressions, we select W 2L and W o based on Lemma 7, and let bb2L = (Wo + W o) 1W ob2L, where Wo + W o is invertible as shown in the proof of Lemma 7. Thus, the proof is complete. The following corollary identifies the specific Lo RA-rank required to achieve exact representation for random model cases in the current setting. Corollary 9. Assume that the elements of all the weight matrices of both the target TFN and the frozen TFN are independently drawn from arbitrary continuous distributions. If R D 2 , adding low-rank adapters of rank at most R to weight matrices in ( W Kl, W Ql, W V l, W 1l)L l=1, W 2L, W o and tuning the bias vectors, enables the adapted model f to exactly approximate the target model f, i.e., f(X) = f(X) for all X RD N. Proof of Corollary 9. By combining Lemma 13 and Theorem 8, and following the same steps in the proof of Corollary 4 which yields maxi Gi = D, we can obtain the desired outcome. F.2 APPROXIMATING TRANSFORMER NETWORK WITH MULTI-HEAD ATTENTION LAYERS In this section, we first provide the explicit formulation of TFN with multi-head attention layers. Consider an input matrix X RD N, where D is the dimension of the token embeddings and N is the number of tokens. The output of the l-th transformer block is denoted as Zl, which can be computed as follows: Attnl(Zl 1) := h=1 W h Ol W h V l Zl 1 softmax (W h Kl Zl 1) W h Ql Zl 1 , Zl := W2l Re LU(W1l Attnl(Zl 1) + b1l1 N) + b2l1 N, Published as a conference paper at ICLR 2024 where we define Z0 = X. Here, H is the number of attention heads. The weight matrices for each head h [H] in the l-th transformer block are W h Ol, W h V l, W h Kl, W h Ql RD D. The softmax operator softmax( ) is applied column-wise to the matrix. Further, W2l, W1l RD D are the weight matrices and b1l, b2l RD are the bias vectors in the feedforward layers. A Transformer network, denoted as TFNL,D, is a composition of L Transformer blocks, followed by an softmax output layer softmax(Wo ), where Wo RD D. The final output of the TFN is given by softmax(Wo ZL). To study the expressive power of Lo RA within TFNs featuring multi-head attention layers, we next specify the parameters of the target model f, frozen model f0, and the adapted model f, each with L transformer blocks and a dimension D. To study the expressive power of Lo RA within TFNs featuring multi-head attention layers, we next specify the parameters of the target model f, frozen model f0, and the adapted model f, each with L transformer blocks and a dimension D. For ease of presentation, we drop the subscript in TFNL,D, referring to it simply as TFN. Given a specified rank R [D] for Lo RA, these models are defined as follows: Target TFN f = TFN ; ((W h Ol, W h V l, W h Kl, W h Ql)H h=1, W 2l, W 1l)L l=1, W o , (b1l, b2l)L l=1 , Frozen TFN f0 = TFN ; ((W h Ol, W h V l, W h Kl, W h Ql)H h=1, W2l, W1l)L l=1, Wo , (b1l, b2l)L l=1 , Adapted TFN f = TFN ; ((W h Ol + W h Ol, W h V l + W h V l, W h Kl + W h Kl, W h Ql + W h Ql)H h=1, W2l + W 2l, W1l + W 1l)L l=1, Wo + W o , (bb1l,bb2l)L l=1 , where the weight matrices RD D, and the bias vectors RD. Moreover, the weight matrices of the low-rank adapters W h Ol, W h V l, W h Kl, W h Ql, W 2l, W 1l for all h [H] and l [L] are of rank R or lower. We next introduce non-singularity Assumption 4 for TFN with multi-head attention layers scenarios, which is then validated by Lemma 14. We then provide proof of our main results for TFNs Theorem 7. Additionally, we introduce a supplementary theorem that amalgamates results for TFNs with both single-head and multi-head attention layers when the weight matrices are randomly initialized. This is articulated in Corollary 10. Assumption 4 (Non-Singularity). For a fixed R [D], all the weight matrices of both the target model and the frozen model and the following matrices for all r [R], W h Kl W h Ql + LRr W h Kl W h Ql W h Kl W h Ql , for all h [H] and l = 1, W h Kl W h Ql + LRr W 1 2,l 1W 2,l 1W h Kl W h Ql W 2,l 1W 1 2,l 1 W h Kl W h Ql , for all h [H] and l [L] \ {1} , W h Ol W h V l + LRr W 1 1l W 1l W h Ol W h V l W h Ol W h V l , for all h [H] and l = 1, W h Ol W h V l + LRr W 1 1l W 1l W h Ol W h V l W 2,l 1W 1 2,l 1 W h Ol W h V l , for all h [H] and l [L] \ {1} , Wo W2L + LRr(W o W 2L Wo W2L), are non-singular. Lemma 14. Let the elements of all weight matrices in the target model f and frozen model f0 be independently sampled from continuous distributions. Then, Assumption 4 holds with probability 1. Proof of Lemma 14. The results can be obtained by replicating the same steps outlined in the proof of Lemma 3. For the reader s reference, we restate Theorem 7 here integrated with the explicit formulation of the rank-based functionality gap Gi. Theorem 7. Consider a given Lo RA-rank R [D]. Let Assumption 4 hold. Define the rank-based functionality gap Gi to i-th transformer block (i [L]) or output layer (i = L + 1) as Published as a conference paper at ICLR 2024 maxh rank(W h Ki W h Qi W h Ki W h Qi) maxh rank(W 1i W h Oi W h V i W1i W h Oi W h V i) , i = 1, maxh rank(W 2,i 1W h Ki W h Qi W 2,i 1 W 2,i 1W h Ki W h Qi W2,i 1) maxh rank(W 1i W h Oi W h V i W 2,i 1 W1i W h Oi W h V i W2,i 1) , 2 i L, rank(W o W 2L Wo W2L), i = L + 1. (23) If R maxi [L+1] Gi 2 , then there exists low-rank adapters with rank lower than R [D] (( W h Kl, W h Ql, W h V l, W h Ol)H h=1)L l=1, W 2L, W o with other low-rank adapters set to O, and updated bias vectors (bb1l,bb2l)L l=1, such that for any X RD N, the adapted model f exactly approximates target model f, i.e., f(X) = f(X). Proof of Theorem 7. The key idea of this proof is the same as the proof of Theorem 8: our first step is to ensure that, for each transformer block, the output from the first feedforward layer in the target model matches that in the adapted model. Once this is established, we select an appropriate output layer weight matrix to complete the proof. Similar to the proof of Theorem 8, we define Hl RD N and Zl RD N as the intermediate and final outputs of the l-th transformer block in the target model f, respectively. In particular, Hl corresponds to the output of the first feedforward layer in the l-th transformer block. They are formulated as W h Ol W h V l Zl 1 softmax Z l 1W h Kl W h Ql Zl 1 ! Zl = W 2l Hl + b2l1 N. For the adapted model f, we introduce c Hl and b Zl accordingly to denote the intermediate output of the first feedforward layer and the final output of the l-th transformer block for the adapted model, respectively: c Hl = Re LU h=1 (W h Ol + W h Ol)(W h V l + W h V l) b Zl 1 softmax b Z l 1(W h Kl + W h Kl) (W h Ql + W h Ql) b Zl 1 + bb1l1 N b Zl = W2l c Hl + bb2l1 N. Note that Z0 = b Z0 = X. We aim to demonstrate that adding low-rank adapters to the weight matrices allows the adapted TFN f to be functionally equivalent to the target TFN of identical dimensions. We will initiate our proof by inductively constructing the adapter weight matrices (( W h Ol, W h V l, W h Kl, W h Ql)H h=1,bb1l,bb2l)L l=1 such that c Hl = Hl for all l [L], and then select the W 2L and the low-rank adapter for the output layer W o to approximate the output of the target model. For unmentioned low-rank adapters, we set them as O. When l = 1. To achieve c Hl with Hl for all X, we must satisfy the following conditions: Bias Vector: bb1l = b1l, (24) Query and Key: (W h Kl + W h Kl) (W h Ql + W h Ql) = W h Kl W h Ql, Value and Output Projection: (W h Ol + W h Ol)(W h V l + W h V l) = W 1 1l W 1l W h Ol W h V l. To achieve this, we set bb1l = b1l to achieve (24), and select rank-R or lower matrices W h Kl, W h Ql, W h Ol, W h V l for all h [H] as suggested by Lemma 7. This ensures c Hl = Hl for l = 1. Published as a conference paper at ICLR 2024 When l > 1. Now we focus on the cases where l = 2, . . . , L. Assume the induction hypothesis holds for l 1, which is c Hl 1 = Hl 1. Following the same steps in the proof of Theorem 8, we let bb2,l 1 = W2,l 1W 1 2,l 1b2,l 1, thereby obtaining, b Zl 1 = W2,l 1W 1 2,l 1Zl 1. (25) To achieve c Hl = Hl, we express both c Hl and Hl in terms of Zl 1: W h Ol W h V l Zl 1 softmax Z l 1W h Kl W h Ql Zl 1 + b1l1 N c Hl = Re LU W1l H X h=1 (W h Ol + W h Ol)(W h V l + W h V l) b Zl 1 softmax b Z l 1(W h Kl + W h Kl) (W h Ql + W h Ql) b Zl 1 + bb1l1 N , (25) = Re LU h=1 (W h Ol + W h Ol)(W h V l + W h V l) W2,l 1W 1 2,l 1Zl 1 Z l 1W 1 2,l 1W 2,l 1(W h Kl + W h Kl) (W h Ql + W h Ql)W2,l 1W 1 2,l 1Zl 1 + bb1l1 N Therefore, we need to align the following three components: Bias Vector: bb1l = b1l, (26) Query and Key: (W h Kl + W h Kl) (W h Ql + W h Ql) = W 1 2,l 1W 2,l 1W h Kl W h Ql W 2,l 1W 1 2,l 1, Value and Output Projection: (W h Ol + W h Ol)(W h V l + W h V l) = W 1 1l W 1l W h Ol W h V l W 2,l 1W 1 2,l 1. By setting bb1l based on (26) and adjusting W h Kl, W h Ql, W h Ol, W h V l for all h [H] based on Lemma 7, we satisfy all three conditions above, thereby obtaining c Hl = Hl for l [L] \ {1}. Output Layer Analysis. By applying the induction method, we have established c Hl = Hl for all l [L]. Lastly, we choose the W o, W 2L and the bias vector bb2L using the same approach as in the proof of Theorem 8. This concludes the proof. The following corollary identifies the specific Lo RA-rank required to achieve exact representation for random model cases in the current setting. Corollary 10. Assume that the elements of all the weight matrices of both the target TFN and the frozen TFN are independently drawn from arbitrary continuous distributions. If R D 2 , adding low-rank adapters of rank at most R to weight matrices in (( W h Kl, W h Ql, W h V l, W h Ol)H h=1)L l=1, W 2L, W o and tuning the bias vectors, enables the adapted model f to exactly approximate the target model f, i.e., f(X) = f(X) for all X RD N. Proof of Corollary 10. By combining Lemma 14 and Theorem 7, and following the same steps in the proof of Corollary 4 which yields maxi Gi = D, we can obtain the desired outcome. G EXPERIMENTS In this section, we perform experiments on both synthetic and real datasets to corroborate our theoretical results. Firstly, we focus on validating the construction of the Lo RA adapter in our proof. Published as a conference paper at ICLR 2024 Subsequently, we extend our experimental validation to encompass the effects of tuning final layers and the significance of updatable bias. Additionally, we offer visual representations of training curves, assess the generalization performance of Lo RA, and evaluate its efficacy on classification tasks. We also conduct experiments on real datasets to further support our theoretical insights in real-world scenarios. G.1 ADDITIONAL DETAILS OF EXPERIMENT SETUP We implement Lo RA adapter W by reparameterizing it as W = AB , where A, B RD R, and we use the same initialization scheme as proposed by Hu et al. (2022a). For experiments presented in Sec. 5, G.3.1, G.3.2, G.4, and G.5, we consider two variants of frozen models: (Random) The first method involves randomly generating all the weight matrices using the Xavier uniform distribution, which is the default weight initialization method used in Py Torch. (Pretrained) The second method aims to simulate scenarios where the pretrained model is relatively closer to the target model. We achieve this by initially creating the target model and the frozen model in the same way as the first method and then performing full-rank updates on the frozen model via gradient descent to approximate the target model until the approximation error is reduced by 1/3. For other experiments on synthetic datasets, we default to the randomly parameterized frozen model unless specified otherwise. G.2 ADDITIONAL DETAILS ON GRADIENT UPDATE METHOD In our experiments, we utilize the Adam optimizer. We tune the learning rate 10 2, 10 3, 10 4 and the weight decay 0, 10 2, 10 3, 10 4 . The optimal configuration is determined based on the validation loss on a set of 256 samples independently drawn from a standard normal distribution. We run 5,000 iterations for each hyperparameter setting, where at each step 256 fresh standard Gaussian samples are generated for loss and gradient computation. G.3 VALIDATION OF OUR LORA ADAPTER CONSTRUCTION Recall that all our theoretical statements are based on our construction of the Lo RA adapters presented in their corresponding proofs. To validate these results, here we empirically examine the relationship between approximation error and rank by integrating the Lo RA adapters, which are constructed with the uniform partition in our proof, into the frozen model. Furthermore, we evaluate the effectiveness of our constructed Lo RA adapters by comparing their performance against adapters updated through gradient descent and optimized by Adam. All simulations are conducted five times using different seeds, and the reported values represent the median computed across different runs. G.3.1 FNN APPROXIMATION 4 8 12 16 0.00 Gradient Update Our Construction (a) Frozen model is randomly generated. 4 8 12 16 0.00 Gradient Update Our Construction (b) Frozen model is pretrained. Figure 3: Approximation error (measured by MSE) versus Lo RA-rank on FNNs. In this experiment, we assess the effectiveness of our low-rank adapter construction for FNN approximation, which is detailed in the proof of Theorem 5. Published as a conference paper at ICLR 2024 Gradient Update Our Construction Figure 4: Log-scale MSE versus Lo RArank on randomly initialized FNNs. Setup. We consider two scenarios: one with L = 1 and L = 2 and the other one with L = 2 and L = 4. It should be noted that for both these cases, we have M = L/L = 2 here. We employ the gradient update method and the construction outlined in the proof of Theorem 5 to update the Lo RA adapters. Results. Fig. 3 presents the results for FNN approximation. Consistent with the implications drawn in Sec. 5, the y limit changes from Fig. 3a to Fig. 3b suggest that the pretrained frozen model results in less approximation error. Additionally, we observe that our construction s performance aligns closely with the gradient update method when the target model depth L = 1. However, this alignment is not observed when L = 2 on low-rank region (i.e., R 4), This further underscores the limitation of our Lo RA adapter construction, which inherently assumes that the intermediate outputs of the frozen model and the target model need to align. To facilitate a more effective comparison between our construction and the gradient update method in the higher-rank region (i.e., R 6), we present the curves on a logarithmic scale, as depicted in Fig. 4. While the gradient update appears to reach the optimal performance achieved by our Lo RA construction in FNNs, a gap is still discernible when viewed on a logarithmic scale. The MSE of the gradient update method is approximately 10 4, while for our Lo RA construction, it s around 10 8 for a sufficiently large rank. G.3.2 TFN APPROXIMATION We assess the effectiveness of our Lo RA adapter construction in approximating TFN, as detailed in the proof of Theorem 7. Setup. We examine target model f and frozen model f, both featuring the same architecture with L transformer blocks, a single output layer, two attention heads, and embedding size D = 16. We focus on two scenarios: L = 1 and L = 2. The weight matrices for the attention layers follow a standard Gaussian distribution, while those for the linear layers are initialized using the Xavier uniform distribution, which is Py Torch s default scheme for linear layer initialization. Gradient Update Our Construction (a) Frozen model is randomly generated. Gradient Update Our Construction (b) Frozen model is pretrained. Figure 5: Approximation error (measured by MSE) versus Lo RA-rank on TFNs. Results. The observations here align with those from the experiments of FNN approximation. We note that the gradient update method outperforms our approach when the rank is relatively small but lags behind as the rank increases. This advantage of the gradient update method at minimal ranks arises from the inherent complexity of TFNs, which allows for more flexible low-rank adapter construction. Meanwhile, the gradient update method s performance does not significantly improve as the rank increases. This arises from the inherent complexity involved in optimizing TFNs. Nonetheless, our results corroborate the claims made in Theorem 7, as the approximation error must be eradicated when the rank reaches D Published as a conference paper at ICLR 2024 2000 4000 0.00 Lo RA Tuning Final Layers # Tunable Parameters (a) Comparison between Lo RA and tuning final layers. 5 10 15 0.00 Fixed Biases Updatable Biases # Tunable Parameters (b) Comparison between Lo RA with fixed biases and Lo RA with updatable biases. Figure 6: Approximation error (measured by MSE) versus the number of tunable parameters when various methods are employed. The analyses are conducted on FNN models. G.4 COMPARISON TO TUNING FINAL LAYERS Tuning or adding the final layers only is also a common adaptation method used in various domains, including computer vision (Chatfield et al., 2014; Donahue et al., 2014; Sharif Razavian et al., 2014), and natural language processing (Devlin et al., 2019; Gira et al., 2022). Recall that Corollary 4 and Lemma 12 demonstrate that tuning final layers does not perform as well as Lo RA for randomly generated models, provided the Lo RA-rank satisfies the rank constraints shown in Corollary 4. In this experiment, we aim to validate this assertion and compare the performance of tuning final layers and Lo RA in more general scenarios, such as when the frozen model has been pretrained, and when the Lo RA-rank is smaller than required. Setup. We consider FNN models with D = 16, L = 1, L = 8. In this experiment, we employ two baselines: Lo RA and tuning final layers. The Lo RA adapters and the final layers are updated using the gradient update method. Results. Figure 6a compares the MSE of Lo RA and final layer tuning when the same number of tunable parameters are used. In the case of randomly generated models, we observe that final layer tuning yields a significantly higher MSE when using the same number of tunable parameters, corroborating our results in Lemma 12. However, when the frozen model has been pretrained, the performance of final layer tuning improves considerably, though it still falls short of Lo RA. This aligns with conclusions drawn from previous theoretical studies such as Tripuraneni et al. (2020), which asserts that the performance of final layer tuning heavily depends on the quality of the shared representations. G.5 BENEFITS OF TUNING BIASES In our proof, as detailed in Sec. 3.2 and E.1, the updatable biases in the FNN play a crucial role in eliminating the nonlinearity of Re LUs. In this experiment, we investigate the importance of updatable biases in ensuring the success of Lo RA in FNN cases. Setup. We consider FNN models with parameters D = 16, L = 1, L = 2, and examine the performance of Lo RA both with and without biases tuning for adapting it to match the target FNN. The Lo RA adapters and biases are updated using the gradient update method. Results. The performance of Lo RA with and without updatable biases is presented in Figure 6b. We observe that in both random and pretrained model cases, Lo RA with updatable biases outperforms Lo RA with fixed biases when the number of tunable parameters is relatively small. However, the performance gap is not significant and diminishes as the number of tunable parameters increases. This suggests that while tuning biases in conjunction with the low-rank adapters does enhance performance, the gain is not substantial. In other words, even without bias tuning, Lo RA s performance remains competitive. G.6 TRAINING CURVES Published as a conference paper at ICLR 2024 0 2000 4000 L = 1, L = 2 0 2000 4000 L = 2, L = 4 rank = 1 rank = 4 rank = 8 rank = 13 rank = 16 log(Train Loss) Figure 7: Training curves of Lo RA with varying Lo RA-ranks when D = 16. Although our theoretical study does not incorporate any training process, we present the training curves of the Lo RA gradient update method to illuminate the optimization aspects of Lo RA. Setup We depict the training curves of Lo RA finetuning on randomly generated FNNs for R = 1, 4, 8, 13, 16. Unless stated otherwise, all settings strictly adhere to the FNN experiments described in Sec. 5. Results The training curves visualized in Fig. 7 reveal that models with smaller ranks (e.g., R=1,4) converge swiftly due to their limited search space, but they settle at a relatively high training loss. Medium rank models (e.g., R=8) converge more slowly. Highly overparameterized models (g, R=13,18) appear to converge faster, aligning with recent advancements in optimization theory, which suggest that overparameterized models are easier to optimize (Liu et al., 2022a). G.7 GENERALIZATION PERFORMANCES While our theoretical study only establishes the upper bound of Lo RA s performance with infinite data samples, it does not consider Lo RA s generalization performance in practice. Although this is beyond the current scope of our paper, we empirically investigate Lo RA s generalization performance in this experiment. Setup. We include a training set of 400 samples for the cases where L = 1, L = 2, and 800 training samples for the cases where L = 2, L = 4. We evaluate how well Lo RA s training performance transfers to the test set. 0 4 8 12 16 L = 1, L = 2 0 4 8 12 16 L = 2, L = 4 Figure 8: Assessment of Lo RA s generalization performance on FNNs. Results. Fig. 8 presents the training and test MSE versus Lo RA-ranks. However, no clear pattern is observed in the variation of the gap between the training and test MSE with respect to the Lo RA-ranks. This could be due to Adam not precisely finding the minimum (see Fig. 4), potentially avoiding overfitting. To assess Lo RA s generalization performance, we finetuned the frozen model on the training set and reported the training and test MSE. We notice an increasing generalization gap (test MSE - train MSE) as the Lo RA rank increases this is very evident with L=2, and less so with L=4. This is intuitive as larger Lo RA ranks imply a larger hypothesis class (e.g., the Rademacher complexity), so it is expected. We defer a detailed analysis of Lo RA s generalization performance to future work but believe our simulation results provide a valuable starting point for further discussion and investigation. G.8 EVALUATION ON CLASSIFICATION TASKS Our theory and previous experiments all focus on regression cases. In this experiment, we consider binary and multi-class classification tasks to optimize the Lo RA adapter vias cross-entropy and report the performance of Lo RA using accuracy. Multi-class Classification. As shown in Fig. 9a, consistent with our theoretical results, our construction achieves 100% accuracy when R 8. The performance of gradient update is also similar to our observation when MSE is employed as the metric, particularly when MSE is plotted on a logarithmic scale (Fig. 4). This observation echoes the findings of Hui & Belkin (2021), which indicate that optimizing MSE is fundamentally equivalent to optimizing cross-entropy. Binary Classification. We have conducted binary classification tasks. We use the same setup as before but add one more output layer R2 D which is a block diagonal matrix, with the first 8 Published as a conference paper at ICLR 2024 0 4 8 12 16 L = 1, L = 2 Gradient Update Our Construction 0 4 8 12 16 L = 2, L = 4 (a) Multi-class classification tasks with 16 classes. 0 4 8 12 16 L = 1, L = 2 Gradient Update Our Construction 0 4 8 12 16 L = 2, L = 4 (b) Binary classification task. Figure 9: Accuracy versus the rank on classification tasks. The analyses are conducted on FNN models. elements in the first rows and the last 8 elements in the second row are 1 and all remaining elements are 0. We fix this output layer, optimize the cross entropy on the Lo RA adapters, and report the test accuracy. As shown in Fig. 9b, we observe that in this binary classification scenario, even with a very low Lo RA-rank R = 1, the accuracy has been significantly improved, comparable to the results achieved by higher ranks. In the region of higher ranks, our construction significantly outperforms the gradient update method. The suboptimal performance of the gradient update method in this simulation suggests that, despite Lo RA s current impressive performance in practical applications, there is potential for further refinement. G.9 EVALUATION ON REAL DATASETS In our theoretical analysis, we demonstrate how the sizes of frozen models and the distance between the frozen and target models influence the necessary Lo RA-ranks to achieve the desired performance (see Lemma 1, 2, and Theorem 5, 6, 7). Specifically, our results suggest that larger models require fewer Lo RA-ranks to reach the desired performance. Similarly, when the frozen model is closer to the target model, a lower Lo RA-rank is sufficient to achieve the same performance. We validate these theoretical insights through experiments on the GLUE benchmark (Wang et al., 2018). Setup Our experiments are conducted using Tesla V100-PCIE-16GB, NVIDIA A100-SXM480GB, NVIDIA A100-SXM4-40GB, and NVIDIA L40 GPUs. For each run, a single GPU is utilized. Unless otherwise specified, all our settings align with those established by Hu et al. (2022a). Impact of Model Size on Lo RA Rank In practice, most existing studies on Lo RA use the same Lo RA-rank for models of varying sizes. For instance, in the original Lo RA paper (Hu et al., 2022a), Tables 9 and 10 demonstrate the use of the same Lo RA-rank for Ro BERTa-base (Liu et al., 2019), Ro BERTa-large (Liu et al., 2019), and De BERTa-XXL (He et al., 2021). Similarly, in the QLo RA paper (Dettmers et al., 2023), a Lo RA-rank of 64 is set for different models ranging from 13B to 65B parameters (see their Appendix B.2). To validate our theoretical findings, we evaluated the performance of Lo RA on models of different sizes, specifically Ro BERTa-base with 110M parameters and Ro BERTa-large with 340M parameters. The results are presented in Table 2. Initially, we observe that, in the absence of fine-tuning (Lo RA-rank R = 0), there is no consistent trend Ro BERTa-base performs better on 3 datasets, while Ro BERTa-large performs better on 4 datasets. However, after Lo RA fine-tuning, we observe that Ro BERTa-large outperforms in most cases. In fact, even when the base model is trained with a Lo RA-rank three times larger, Ro BERTalarge still performs better on 6 out 8 datasets. Given that the pretrained Ro BERTa-large model was performing no differently from the base model, this observation supports our theoretical findings that deeper models are more expressive with Lo RA training. Impact of Model Proximity on Lo RA Rank While our theoretical results (Lemma 1, 2, and Theorem 5, 6, 7) imply that the frozen model that is closer to the target model achieves better results for a fixed Lo RA-rank. To validate this, we compare the performance of pretrained Ro BERTa-base with the randomly initialized Ro BERTa-base fine-tuned using the same Lo RA-ranks. Published as a conference paper at ICLR 2024 Model R MNLI SST-2 MRPC Co LA QNLI QQP RTE STS-B Ro BERTabase 0 .330 .491 .316 0 .495 .682 .527 .024 Ro BERTalarge 0 .318 .505 .684 0 .505 .369 .473 .032 Ro BERTabase 2 .861 .950 .892 .632 .928 .891 .780 .907 Ro BERTabase 6 .870 .948 .892 .629 .931 .900 .773 .909 Ro BERTalarge 2 .904 .956 .917 .631 .946 .887 .884 .916 Table 2: Comparison of the fine-tuned performance of Ro BERTa-base and Ro BERTa-large using Lo RA with different Lo RA-ranks on the GLUE benchmark. Following Hu et al. (2022a), we report the overall (matched and mismatched) accuracy for MNLI, Matthew s correlation for Co LA, Pearson correlation for STS-B, and accuracy for other tasks. Higher is better for all metrics. Despite the absence of a clear pattern indicating which pretrained model is generally superior, after fine-tuning using Lo RA, we observe that Ro BERTa-large (340M) fine-tuned with Lo RA-rank R = 2 outperforms Ro BERTa-base (110M) with Lo RA-rank R = 6 in 7 out of 8 tasks. This observation aligns with our theoretical conclusion that larger models require lower Lo RA-ranks to achieve the desired performance. Model R MNLI SST-2 MRPC Co LA QNLI QQP RTE STS-B Random 2 .523 .775 .691 .154 .627 .761 .542 .213 Pretrained .861 .950 .892 .632 .928 .891 .780 .907 Random 4 .535 .788 .696 .145 .625 .768 .542 .224 Pretrained .868 .950 .890 .634 .929 .898 .805 .910 Random 6 .544 .799 .696 .154 .632 .768 .542 .210 Pretrained .868 .948 .892 .629 .931 .900 .773 .909 Table 3: Comparison of the fine-tuned performance of randomly initialized and pretrained Ro BERTa-base. Following Hu et al. (2022a), we report the overall (matched and mismatched) accuracy for MNLI, Matthew s correlation for Co LA, Pearson correlation for STS-B, and accuracy for other tasks. Higher is better for all metrics. We observe that the performance of the pretrained Ro BERTa-base significantly surpasses that of the randomly initialized Ro BERTa-base given the same Lo RA-rank. This observation is consistent with our theoretical findings, which suggest that a frozen model closer to the target model yields better performance given the same Lo RA-rank. The results in Table 3 demonstrate that the pretrained Ro BERTa-base significantly surpasses the randomly initialized Ro BERTa-base. This observation is consistent with our theoretical findings, suggesting that the pretrained model requires lower Lo RA-ranks to achieve the desired performance. H EXTENSION TO CASES WITH DIFFERENT MODEL DIMENSIONS This discussion only applies to linear model approximation and FNN approximation. As highlighted in Sec. 2, our results can be easily extended to scenarios where the target model, f, and the frozen model, f, have different model dimensions. Specifically, for linear model or FNN approximation, we use D to represent the number of hidden neurons per layer in the target model and D for the frozen model. We particularly consider the cases where the frozen model is wider than the target model, i.e., D D. This is because the frozen model is typically overparameterized in practical applications. The key idea for extending our analysis to scenarios with different model dimensions is expanding the dimension of the target model. For the sake of simplicity, we focus on the simplest case, the linear model approximation, as an example. In this setting, the difference between the output of the Published as a conference paper at ICLR 2024 adapted model and the target model can be measured by l=1 (Wl + W l) x 0 where x RD. Consequently, the last (D D) columns and rows of QL l=1 (Wl + W l) does not affect the results at all. Denote the submatrix consisting of the first d rows and d columns of a matrix W by [W ]d. Then, to approximate the target model, we aim to solve the following constrained optimization problem for a given Lo RA-rank R [D]: min rank( W l) R l=1 (Wl + W l) To solve this problem, we first define an expanded target matrix, denoted by f W RD D. The expanded target matrix f W is constructed such that h f W i D = W , while the remaining entries matches the corresponding entries in Q l = 1LWl. Then, the error matrix E = f W QL l=1 Wl, consists entirely of zeros except for the first D rows and D columns. Therefore, we obtain RE = rank(E) D. Given the expanded target matrix, we consider the updated constrained optimization problem as follows: min rank( W l) R l=1 (Wl + W l) f W By Lemma 1, we obtain that when the Lo RA-rank R D L , the optimal solution to (28) satisfies QL l=1 (Wl + W l) = f W , given that D RE. This result implies that h QL l=1 (Wl + W l) i W and therefore the approximation error defined in (27) is 0 for all input x. A similar analysis can be conducted for FNN approximation. I EXTENDED FUTURE WORKS To the best of our knowledge, this paper is the first to offer a theoretical understanding of Lo RA fine-tuning on both FNN and TFN. Our work delivers insightful results, elucidating the impact of rank, depth of the pre-trained model, and the distance between the pre-trained model and the target model on the expressive power of Lo RA. Those theoretical results are further corroborated via our experiments. Despite these advancements, several intriguing questions still remain open. First, as observed in the numerical experiments, our construction of Lo RA adapters for FNN and TFN may not be always optimal. Given that more complex models offer increased flexibility, an open question is whether we can devise a more parameter-efficient scheme to construct the Lo RA adapters, thereby deriving a tighter bound on approximation error. Second, for TFN, we have only identified the conditions under which the Lo RA-adapted model exactly matches the target model, due to the analytical complexity of TFN. It would be interesting to quantify the approximation error when the rank is lower than required. Furthermore, for TFN, we constrain the target model and the frozen model to have identical embedding size and depth, and we omit the skip connections and layer norms for simplicity. Another intriguing direction would be to study the expressive power of Lo RA under TFN cases with more general settings on TFN architectures. While our analysis does not involve any training process, an interesting direction for future research would be to consider gradient-based optimization algorithms and examine how efficiently Lo RA can be optimized. Finally, theoretical questions about Lo RA s generalization to unseen data also remain unresolved.