# improving_lora_in_privacypreserving_federated_learning__2a004c9c.pdf

Published as a conference paper at ICLR 2024

IMPROVING LORA IN PRIVACY-PRESERVING FEDERATED LEARNING

Youbang Sun Dept. of Mechanical & Industrial Engineering Northeastern University {sun.youb}@northeastern.edu

Zitao Li, Yaliang Li & Bolin Ding Alibaba Group {zitao.l, yaliang.li, bolin.ding}@alibaba-inc.com

Low-rank adaptation (Lo RA) is one of the most popular task-specific parameterefficient fine-tuning (PEFT) methods on pre-trained language models for its good performance and computational efficiency. Lo RA injects a product of two trainable rank decomposition matrices over the top of each frozen pre-trained model module. However, when applied in the setting of privacy-preserving federated learning (FL), Lo RA may become unstable due to the following facts: 1) the effects of data heterogeneity and multi-step local updates are non-negligible, 2) additive noise enforced on updating gradients to guarantee differential privacy (DP) can be amplified and 3) the final performance is susceptible to hyper-parameters. A key factor leading to these phenomena is the discordance between jointly optimizing the two low-rank matrices by local clients and separately aggregating them by the central server. Thus, this paper proposes an efficient and effective version of Lo RA, Federated Freeze A Lo RA (FFA-Lo RA), to alleviate these challenges and further halve the communication cost of federated fine-tuning LLMs. The core idea of FFA-Lo RA is to fix the randomly initialized non-zero matrices and only fine-tune the zero-initialized matrices. Compared to Lo RA, FFA-Lo RA is motivated by practical and theoretical benefits in privacy-preserved FL. Our experiments demonstrate that FFA-Lo RA provides more consistent performance with better computational efficiency over vanilla Lo RA in various FL tasks.

1 INTRODUCTION

Recent years have witnessed tremendous success in the development of large language models (LLMs) (Touvron et al., 2023; Open AI, 2023; Zhang et al., 2022a; Zeng et al., 2022). The applications of LLMs range from a versatile chatbot for different writing tasks (Open AI) to multi-modal systems (Driess et al., 2023; Wu et al., 2023; Bommasani et al., 2021). Besides the commercialized products based on general-purpose LLMs, people can also build their customized LLMs by utilizing their task-specific data to fine-tune pre-trained LLMs (Howard & Ruder, 2018). Since modern LLMs usually contain billions of parameters, fine-tuning on all parameters has prohibitively high computational costs. As a remedy, parameter efficient fine-tuning (PEFT) approaches (Ding et al., 2023), such as Low-Rank Adaptation (Lo RA) (Hu et al., 2021) have been developed and commonly adapted in many downstream tasks. PEFT methods freeze the majority of parameters in pre-trained LLMs, and perform update on a small subset of parameters. Compared to full model fine-tuning, these approaches usually offer on-par or even better performance while significantly improving computational efficiency.

In this paper, we focus on Lo RA for its good performance and versatility for a wide spectrum of tasks with many variations. However, Lo RA still requires sufficient training data to achieve significant improvement over the raw model. The data-limited parties can unite with others and adopt federated learning (FL) (Li et al., 2020) as the computation framework to fine-tune the model collaboratively. The parameter-efficient nature of Lo RA is welcomed in FL due to its low communication costs and relatively low local computational burdens. Furthermore, if the data parties in FL (usually known as clients in FL) want to provably prevent local data leaking from their shared information in FL,

Work was done while the first author Youbang Sun was an intern at Alibaba Group.

Published as a conference paper at ICLR 2024

differential privacy (DP) (Dwork et al., 2006) techniques can be further employed to provide privacy guarantees.

While there are many existing research results exploring (privacy-preserved) PEFT in the central setting, the exploration on how to conduct (privacy-preserved) Lo RA in the FL setting is still a premature. Directly migrating Lo RA methods from the central setting and combining it with Fed Avg may not achieve the best performance since other sources of interference in the (privacy-preserving) FL setting, such as noisy gradients and non-iid distribution of data in the cross-silo setting, can play important roles in the optimization process. In real-world LLM applications with privacy concerns, such as federated fine-tuning (Babakniya et al., 2023) or fine-tuning under differential privacy guarantees (Li et al., 2022), the performance of Lo RA often suffers deterioration.

Contributions. In this paper, we identify three discordances in applying Lo RA in the privacypreserved FL setting. The first is presented as a mismatched term brought by the joint local updates and separate global aggregations on the two sets of low-rank matrices of Lo RA. The second discordance is that if we employ DP-SGD as the differentially private optimizer for training, the injected noise can be amplified by the locally semi-quadratic nature of Lo RA. Lastly, the choice of one hyper-parameter of Lo RA, the scaling factor α, can significantly affect the convergence and performance of the final model, no matter enforcing DP or not.

To resolve these discordances, we propose our solution named Federated Freeze A Lo RA (FFA-Lo RA). FFA-Lo RA freezes the non-zero initialized low-rank matrices and only perform update and aggregation on the zero-initialized matrices, only half as many parameters as Lo RA. Beside FFA-Lo RA s obvious effect of saving half of the communication and computational cost in FL, we also provide intuitions on why it can alleviate the three aforementioned discordances. We conduct comprehensive experiments to demonstrate the advantages of FFA-Lo RA over Lo RA in privacypreserving FL, across different tasks, hyper-parameters and privacy protection levels.

We summarize our contributions as follows:

We explore the conditions in privacy-preserved FL that are discordant with Lo RA, and provide explanations on the potential reasons of this performance degradation.

We propose a new method, FFA-Lo RA, which tailors Lo RA to increase its performance in these undesirable but unavoidable conditions in privacy-preserved FL.

We conduct extensive experiments to verify that FFA-Lo RA can consistently outperform Lo RA.

2 BACKGROUND AND RELATED WORKS

Parameter efficient fine-tuning. The ever-increasing network size of LLMs makes them prohibitively expensive, if possible at all, to fine-tune directly. To mitigate this problem, parameterefficient fine-tuning (PEFT) methods have been proposed. These methods introduce a small number of additional trainable parameters Θ to improve model performance and keep most of the pre-trained parameters Φ frozen. The task-specific increment Φ is then encoded into Θ with much smaller dimensions. Houlsby et al. (2019) added additional trainable neural modules named adapters to each layer of the network. Alternatively, prefix-tuning(Li & Liang, 2021) and prompt-tuning(Lester et al., 2021) modify the network by concatenating additional trainable dimensions to input or hidden layers of the network. Another series of works (Hu et al., 2021; Yu et al., 2021b) proposed Lo RA and RGP, using low-rank matrices to approximate or re-parameterize the pre-trained weight matrices. Lo RA is arguably the most popular approach among PEFT methods, it only requires tuning less than 1% of the parameters in the full fine-tune approach but achieves comparable performance in a wide range of downstream tasks. There are also works (He et al., 2021; Chavan et al., 2023) that seek to provide a generalized method that unifies these PEFT methods.

Federated fine-tuning with LLM. Although fine-tuned LLMs can become backbones for applications in different areas, the fine-tuning process still favors large-scale, domain-specific data. However, such domain-specific data is typically possessed by multiple parties, with each party s dataset only containing inadequate data to fine-tune models by itself. Furthermore, these parties are often prohibited from sharing such data directly with other entities. A common solution for this dilemma is federated learning (Kairouz et al., 2021), which allows a set of agents to fine-tune LLMs efficiently by sharing their local model updates without explicitly sharing their respective data. Tian et al. (2022) proposed Fed BERT and performed federated pre-training on the BERT model. Different from traditional machine learning models, LLM s tremendous model size can consume significant

Published as a conference paper at ICLR 2024

amount of resources for cross-party communication and require immense computation resources for local training. Many research solutions rely on the combination of PEFT with FL. There have been multiple studies of PEFT in FL in the recent years, Zhang et al. (2022b) considers PEFT in the federated setting. Recently, (Kuang et al., 2023) proposed FS-LLM, a federated framework for federated fine-tuning LLMs. It has been pointed out that data heterogeneity in FL is a challenge for PEFT algorithm (Kairouz et al., 2021; Babakniya et al., 2023).

PEFT with differential privacy. Although LLMs are powerful tools and offer great performance thanks to their ability to extract rich features with the transformer structure and large number of parameters, it is also well known that LLMs with large number of parameters can leak critical information contained in the training dataset (Carlini et al., 2021; Huang et al., 2022). A popular privacy notion that can provide theoretical guarantees against training data leakage from the model is differential privacy (DP) (Dwork et al., 2006).

Definition 1 ((ϵ, δ)-DP ). A randomized algorithm A is (ϵ, δ)-differentially private if for any two neighboring datasets D and D , which differ in exactly a single record, and for all possible subsets S O of possible outputs of A: Pr[A(D) S] eϵPr[A(D ) S] + δ.

Intuitively, DP ensures that any single record cannot significantly affect the distribution of the output. With such indistinguishable output distributions, any adversary can only gain limited additional knowledge about whether a specific record is in the input data. The level of privacy is denoted by the privacy parameters (ϵ, δ), a smaller choice of (ϵ, δ) means a stronger privacy protection guarantee.

Machine learning with DP: DP-SGD. A classic mechanism used to ensure the published model differentially private is DP-SGD (Song et al., 2013; Abadi et al., 2016; Bassily et al., 2014). It requires a DP optimizer to privatize gradients before using them to update the model. Compared with the vanilla stochastic gradient descent (SGD) algorithm, DP-SGD has two additional operations in each iteration. It first clips per-sample gradients with a norm constraint C to limit the maximum influence of any sample. Then, it adds a Gaussian noise z N(0, C2σ2Ip) to the sum of clipped gradients in a batch B. Namely, g = P

i B Clip( fi, C) + z /|B|. Finally, this noisy sum of clipped gradients g is used to update the model. The scalar σ is decided by privacy composition rules (Abadi et al., 2016) given privacy parameter ϵ, δ, total number of iteration T and sampling rate q = |B|/N, where N is the total number of samples in the training set.

In the central setting, where a single trainer possesses all data, existing studies on fine-tuning LLM with DP guarantees mainly adopt DP-SGD as the optimization algorithm. Yu et al. (2021a) studied the effect of parameter-efficient algorithms in private fine-tuning. Li et al. (2021a; 2022) found that although the number of trainable parameters has been significantly reduced for PEFT, the performance of private fine-tuning is not significantly better, which might be contrary to traditional beliefs (Bassily et al., 2014).

Different DP settings in FL. Generally, there are two different levels of differential privacy protection in federated learning, depending on whether the federated aggregation server is trusted by the clients or not. The first setting assumes that the server is trusted, the model updates are shared to the server without privacy concerns; this privacy guarantee is on the final output model achieved by randomization in the global aggregated update on the server side (Mc Mahan et al., 2017b). A stronger privacy setting is to forgo the trustworthy server assumption and ensure the shared update from each client is already differentially private (Li et al., 2021b; Wu et al., 2020; Qu et al., 2021).

In this paper, we adopt the stronger privacy setting, ensuring that any shared information (i.e., updates of model parameters) from local clients to server satisfies DP. By DP s properties, including parallel composition, sequential composition and resistance to post-processing (Dwork et al., 2006; Abadi et al., 2016; Li et al., 2021b), the final model automatically satisfies DP globally.

3 LORA IN PRIVACY-PRESERVING FL

In this paper, we focus on Lo RA, one of the most promising PEFT methods in the central setting, Lo RA has been shown to exhibit better performance than other PEFT methods in the federated setting Kuang et al. (2023). The core idea of Lo RA is to constrain the weight update on the model by a low rank decomposition,

W0 + W = W0 + BA. (1)

Published as a conference paper at ICLR 2024

Instead of training the entire weight matrix W0 Rd k composing Φ, the updates are performed on A Rr k and B Rd r composing Θ. With r << min(d, k), the number of trainable parameters |Θ| is reduced by an order of O(r/ min(d, k)) compared to full fine-tune with size |Φ|.

In order to recover the performance of raw model at the start of training, and keep the weights trainable through back-propagation, A uses random Gaussian initialization, while B is set to zero. The product matrix is additionally scaled by a factor α/r. α also has influence on the performance of Lo RA and is required to be tuned.

Discordance 1: Data heterogeneity and model-averaging introduce interference to Lo RA. The performance of vanilla Lo RA is negatively affected when faced with cross-silo FL tasks with data heterogeneity (Babakniya et al., 2023). Notice that the loss for back-propagation is computed on the composition of raw model parameters and the product of A and B (as Equation 1), and Lo RA performs optimization over A and B jointly on client side. This implies that the problem is approximately optimized as a locally semi-quadratic problem (suppose the model is locally linear when learning rate is small). However, when the server performs aggregation on the server side, A and B are averaged separately following vanilla Fed Avg Mc Mahan et al. (2017a). The product of the averaged A and B involves additional terms that may neither benefit the optimization on the clients loss or FL global loss.

For example, consider a FL task involving two clients with datasets of a same size. If clients locally fine-tune on full parameters and the server aggregates with Fed Avg, the new model parameters can be represented as the following:

2(W1 + W2) = W0 + 1

2( W1 + W2), where Wi = W0 + Wi, i = 1, 2. (2)

An implicit assumption ensuring the global convergence of FL algorithms is Wglobal 1

2( W1+ W2), where Wglobal is the update assuming the server can access all clients dataset directly. When the clients use Lo RA locally, we can also consider Wi Bi Ai. However, after using Fed Avg to aggregate the trainable low-rank matrices, the server produces

W + = W0 + 1

2(B1 + B2) 1

2(A1 + A2) | {z } Parameters after aggregation with Lo RA + Fed Avg

2(B1A1 + B2A2) = W +

| {z } Ideal parameters following model-averaging

Thus, it is possible for two clients in FL to converge to two different combinations of adaptation matrices Bi, Ai, yet when an aggregation such as Fed Avg is applied in server, a linear combination does not necessarily provide good performance for the specific task. The difference between 1

2(B1+ B2) 1

2(A1 + A2) and 1

2( W1 + W2) may become more significant when i) number of local update steps between aggregations is large and ii) the local datasets are different across clients.

This echoes with the client-drift phenomenon discussed by Karimireddy et al. (2020). Clientdrift happens in heterogeneous FL when there is a difference between the average of local loss optima of clients and the optimum of the global loss, i.e. P

i Θ i = Θ global. It is caused by the local gradient dissimilarity among clients and slows down convergence. Since the parameters in Lo RA are locally quadratic in construction, they are more prone to client-drift than a locally linear task such as full-fine-tuning.

Discordance 2: The noise with DP-SGD can be amplified with Lo RA. Although Lo RA and DPSGD are the most popular methods in PEFT and privacy-preserved machine learning respectively, directly combining them together may not be the optimal choice. The discordance again comes from the semi-quadratic structure of Lo RA. Consider the parameters after a single DP-SGD update. Even if no norm clipping operation is triggered, the parameters are updated as

W0 + (B + ξB)(A + ξA) = W0 + BA + ξBA + BξA + ξBξA,

where ξA and ξB consist of the Gaussian noises from DP-SGD. Three terms contain noise and the third term, ξBξA, no longer follows a Gaussian distribution. This shows that noise is cascaded after the multiplication in Lo RA, introducing additional difficulties for convergence in fine-tuning.

We provide synthetic verification with Figure 1. In this example, W R1024 1024 and rank r = 8. We plot the Frobenius norm of the noise matrices ξBξA and ξW for Lo RA and full fine-tuning respectively. Due to the multiplication in Lo RA by construction, the norm of noise scales quadraticly with σ, and is significantly worse that full fine-tune when σ exceeds 0.5.

Published as a conference paper at ICLR 2024

For an FL algorithm with 1000 communication rounds, 10 local update steps and a dataset such as SST-2, using batch-size B = 200, a DP guarantee with ϵ = 6, δ = 1e 5 will require a noise factor of σ = 0.99. In this case, Lo RA will produce approximately 3 times more noise compared to full model fine-tuning This could be an explanation to why Lo RA does not significantly outperform full fine-tuning despite having less parameters, as reported in (Yu et al., 2021a; Li et al., 2022; Babakniya et al., 2023).

0.0 0.2 0.4 0.6 0.8 1.0

|| B A||F || W||F

Figure 1: Frobenius norm of noise terms within a single update.

Discordance 3: Lo RA requires careful tuning on α. In terms of the optimal scaling factor α, empirical results (Kuang et al., 2023) have demonstrated that in many more complex tasks, a larger α shows higher performance after fine-tuning, yet as α increases, the algorithm becomes more and more unstable with much higher variance across different runs. According to Zhou & Cong (2017), the convergence speed for Fed Avg-like algorithms is closely related to the objective function s smoothness factor L. As the scaling factor α increases, the problem becomes less smooth by construction, slowing down convergence.

Furthermore, with increase of the scaling factor α, the impact of noise on the model performance gets worse. This could be explained by the fact that as α increases, the update A becomes less significant compared to A0. Since B is initialized at 0, the gradient information in B becomes more important in comparison. Yet the gradient clipping and privacy engine sees the update on both A and B equally. Due to the imbalanced distribution of information in gradients, the algorithm suffers from either excessive information loss or excessive noise, a trade-off between increasing α for better performance and decreasing α to prevent noise-induced performance deterioration.

While searching for a good hyper-parameter α is important, hyper-parameter optimization (HPO) is usually costly (Khodak et al., 2021). Adding in α for HPO means extra communication and computation costs proportional to the search space size of the α.

4 A SIMPLE RECEIPT: FFA-LORA

In the previous section, we discussed the discordance between Lo RA and privacy-preserved FL. Motivated by theory, we propose a simple modification to Lo RA, Federated Freeze-A Low Rank Adapters, or FFA-Lo RA for short. FFA-Lo RA modifies the training process of Lo RA by setting matrix A to fixed after initialization. That is, for a weight matrix W Rd k, we consider the model update to be projected to a low-rank matrix such that W = W0 + W = W0 + BA0, with B Rd r, A0 Rr k. W0 is initialized as the pre-trained weight, and A0 follows a random Gaussian initialization. Following vanilla Lo RA, we start with B0 = 0 so that the pre-trained model is recovered at the start of fine-tuning. The key difference is that we consider B trainable and keep both W0 and A0 frozen.

We note that our approach is somewhat similar to the works regarding the intrinsic dimension of deep models by (Li et al., 2018; Aghajanyan et al., 2020), however these works emphasis on the existence of low intrinsic dimensions in deep models and the generalization properties. We summarize the advantages of FFA-Lo RA as the following.

FFA-Lo RA has no extra interference with data heterogeneity and model-averaging. We reconsider the federated aggregation example in Section 3. In the heterogeneous setting, each client will generate a different Wi. Since Wi Bi A0 in FFA-Lo RA, the update is more compatible with Fed Avg and DP-SGD than Lo RA. Similar to Equation 3, we write the the aggregation step of a two-client system for FFA-Lo RA:

W + = W0 + 1

2(B1 + B2) A0 = W0 + 1

2(B1A0 + B2A0) = W +. (4)

Unlike Lo RA in Equation 3, FFA-Lo RA does not have the aggregation error term caused by low rank adaptation.

Published as a conference paper at ICLR 2024

FFA-Lo RA works better with noise from DP-SGD. Because FFA-Lo RA no longer employs the locally semi-quadratic structure as Lo RA, the noise in DP would not be amplified. When no norm clipping operation is triggered, the parameters are updated as W0 + (B + ξB)A0 = W0 + BA + ξBA. This is because the trainable parameters are only the zero-initialized matrices B. The noise introduced by DP-SGD is only in the term ξBA but without the ξBξB term, making FFA-Lo RA less susceptible to noise than Lo RA.

In addition, from an analytical perspective, if the model is Lipschitz smooth with respect to W , similar smoothness can be obtained for FFA-Lo RA, but not for Lo RA. We state our formal theorem and proof on the smoothness conditions of the two algorithm in the appendix. Convergence properties similar to (Zhou & Cong, 2017) can be derived from this theorem.

FFA-Lo RA does not rely on α, and is equivalent to Lo RA with α = . In our previous discussion of Lo RA s reliance on α in some tasks, this reliance on α is circumvented in FFA-Lo RA. We can view the set of trainable parameters Θ as a dynamical system, a time-dependent series {Θt}t [T ] generated by the the FFA-Lo RA algorithm. We present the following theorem to illustrate the connection between α and η in FFA-Lo RA. Theorem 1. For local updates with the same initial condition on W, vanilla Lo RA update with scaling factor αLo RA produces trajectory {W k αLo RA}k [K], and FFA-Lo RA with scaling αF F A produces trajectory {W k αF F A}k [K]. Then we have

lim αLo RA W k αLo RA = W k αF F A, for all k, αF F A. (5)

We refer to the appendix for proof of Thm. 1. It is evident that introducing the scaling factor when fine-tuning with FFA-Lo RA is unnecessary. However, the same does not apply to Lo RA. For the case of Lo RA, both A and B are trainable by construction, however, A0 is initialized with Gaussian distribution, away from 0. For Lo RA, when α is different, the initialization point is different, if we want two selection of α to have the same performance, we need to also change the variance of A s initialization.

For vanilla Lo RA, as discussed in Section 3, as α increases and η decreases, the update on A become less significant compared to A0. As α , there is almost zero change to be made on A, i.e. A A0. Yet the update on B is just as significant, making the dynamics of Lo RA infinitely close to FFA-Lo RA when α approaches infinity.

FFA-Lo RA saves computation and communication. Since A0 is fixed after initialization in FFA-Lo RA, the total number of trainable parameters in the model is effectively halved compared to Lo RA. This leads to the most straightforward advantage in the efficiency of computation and communication. Meanwhile, since we get same performance for a wide range of α in FFA-Lo RA as long as the learning rate is scaled accordingly, we can fix α and only search for other hyperparameters such as learning rate in HPO.

We note that subsequent to the submission of this paper, multiple new studies (Zhang et al., 2023; Zhu et al., 2024; Hao et al., 2024) have also considered similar approaches. While this paper distinctly considers the federated and privacy related properties, the succeeding papers can serve as verification of the effectiveness of FFA-Lo RA. Another intuitive approach towards the problem at hand is to alternatively update the two Lo RA weights. While this update method exhibit similar properties, it is empirically shown to be slow to converge.

In general, not only does FFA-Lo RA provide higher efficiency compared to Lo RA, FFA-Lo RA is also able to preserve all the benefits of Lo RA, while avoiding the shortcomings of Lo RA as mentioned previously in Section 3.

5 EXPERIMENTS

In this section, we evaluate and compare the performance of FFA-Lo RA with Lo RA on two LMs, Ro BERTa (Liu et al., 2019) and LLa MA (Touvron et al., 2023). We show that our approach consistently perform better for different types of tasks. We first evaluate the language understanding tasks from the GLUE benchmark(Wang et al., 2018) including MNLI, SST2, QNLI and QQP using the Ro BERTa model. For language generation tasks, we use the LLa MA model with experiment settings provided by (Kuang et al., 2023) as benchmark and use the GSM-8K dataset for evaluation. All experiments were run using NVIDIA Tesla A100 GPUs with half-precision enabled for efficiency.

Published as a conference paper at ICLR 2024

Our experiments are organized as follows: We provide the overall performance comparison of FFA-Lo RA and Lo RA in Section 5.1 (Table 1, 3). Questions regarding the critical factors of convergence are answered in Section 5.2 (Table 2, 4, 7). The evaluation on language generation tasks are provided in Section 5.3.

We note that our results do not exactly match the centralized PEFT results presented in (Hu et al., 2021) and (Yu et al., 2021a) due to the additional introduction of federated communication/aggregation and data heterogeneity in our setup. Our experiments with Lo RA is able to match Lo RA s performance reported in (Hu et al., 2021) in the centralized setting.

5.1 PERFORMANCE OF FFA-LORA AND LORA IN LANGUAGE UNDERSTANDING TASKS

Our experiments on language understanding tasks are based on Ro BERTa-Large (355M) (Liu et al., 2019), a popular choice that has been widely adopted in many research studies for its robustness and versatility. We start from a pre-trained model available from the Hugging Face library.

All our experiments with Lo RA and FFA-Lo RA are run in a 3-client cross-silo federated setting. Data on clients are randomly split among all clients sampled to fit certain proportions to ensure strong data heterogeneity. For the heterogeneous setting, we split data based on their labels, we use [0.1, 0.9], [0.9, 0.1], [0.5, 0.5] data split for binary classification tasks and [0.9, 0.05, 0.05], [0.05, 0.9, 0.05], [0.05, 0.05, 0.9] for three-class classification tasks. In order to make a fair comparison, we keep the batch-size B = 200 and total communication round to 1000, the local update steps to 10, the same across all experiments. All experiments use the same SGD (DP-SGD for the experiments with privacy guarantees) optimizer, all the transformer-related hyperparameters such as sequence length lseq = 128, are kept to be consistent with previous studies (Hu et al., 2021). The classification head of the LM is frozen after initialization, and we add adapters to both the attention layers and the feed-forward layers and choose a scaling factor α = 8 for Lo RA. The same scaling factor α is applied to FFA-Lo RA for the sake of consistency, although it is not needed as stated in Section 4.

Experiments with differential privacy guarantees. We report the best result from a set of experiments run with learning rate η {0.01, 0.02, 0.05, 0.1} for Lo RA and η {0.1, 0.2, 0.5, 1} for FFA-Lo RA. The batch-size and total number of update steps are kept to be the same across different tasks. We fix the rank r = 8 for both algorithms. In terms of privacy parameters, we use δ = 1e 5 and three different choices of privacy budget ϵ {6, 3, 1}. Given the sampling rate, total step number and privacy requirement ϵ, δ, we use the privacy accountant from Opacus (Yousefpour et al., 2021) to calculate the noise scale σ for all our experiments. The optimal clipping threshold is determined from a grid search of C {2, 5, 10}. The results are presented in Table 1. To ensure the privacy guarantees have been met in our experiments, we refer to Section A.5 in the appendix for technical analysis.

The introduction of DP significantly degrades algorithm performance across every task for both FFA-Lo RA and Lo RA, yet FFA-Lo RA offers better performance with and without privacy. We note that the biggest performance gap occurs in the MNLI task, which is a three-class classification task with the strongest level of data heterogeneity across agents. This performance gap demonstrates that FFA-Lo RA is more suitable for tasks where heterogeneity is strong.

Priv. Budget Method MNLI (matched)

MNLI (mismatched) SST-2 QQP QNLI

Non Private Lo RA 82.03 10.7 82.50 10.9 94.32 2.1 83.51 3.3 88.95 6.7 FFA-Lo RA 85.05 1.1 85.62 1.0 94.32 1.7 84.35 0.6 90.35 1.9

ϵ = 6 Lo RA 39.46 14.3 39.69 14.8 93.70 0.5 82.11 1.0 84.99 1.1 FFA-Lo RA 78.81 0.8 80.00 0.7 93.73 0.3 83.31 0.4 87.27 1.0

ϵ = 3 Lo RA 35.82 8.9 35.85 9.1 93.32 0.5 82.08 0.7 83.94 0.6 FFA-Lo RA 77.42 0.8 78.69 0.8 93.59 0.3 83.03 0.4 86.18 1.7

ϵ = 1 Lo RA 33.80 1.6 33.80 1.5 92.14 0.6 81.28 0.7 78.93 6.8 FFA-Lo RA 75.05 1.3 76.50 1.3 92.46 0.5 82.50 0.4 81.53 1.4

Table 1: Experiments of FFA-Lo RA and Lo RA with differential privacy guarantees, accuracy (%) evaluated across 20 runs with mean and standard deviation.

Published as a conference paper at ICLR 2024

5.2 ABLATION STUDY

Although FFA-Lo RA is shown to be effective under federated settings and with private guarantees, previous works have also provided studies on the impact of the other hyper-parameters in Lo RA algorithm. In order to provide a more comprehensive evaluation of the three discordances discussed in Section 3, we still need to answer the following questions:

How does data heterogeneity affect performance of FFA-Lo RA and Lo RA? What is the impact of adapter parameter budget (r) for FFA-Lo RA and the relationship between adapter parameter budget (r) and privacy budget (ϵ) of DP-SGD? How do FFA-Lo RA and Lo RA behave when we choose different α for scaling? How does different initialization on A affect performance?

We answer the questions above with the following experiments.

How does data heterogeneity affect performance of FFA-Lo RA and Lo RA? Our discussion in Section 3 stated that Lo RA is not compatible with Fed Avg when there is strong heterogeneity among clients. For verification, we consider the four tasks with both homogeneous and heterogeneous data, and provide the experiment results below. The severe heterogeneity case corresponds to the data distribution provided in Section 5.1, while data is split with [0.15, 0.85], [0.85, 0.15], [0.5, 0.5] and [0.6, 0.2, 0.2], [0.2, 0.6, 0.2], [0.2, 0.2, 0.6] respectively in the mild heterogeneity configuration.

Data Dist. Method MNLI (matched)

MNLI (mismatched) SST2 QQP QNLI

i.i.d. Lo RA 86.90 87.15 94.42 84.47 91.38 FFA-Lo RA 87.13 87.21 95.14 86.31 92.64

mild het. Lo RA 87.01 87.33 93.55 84.41 91.36 FFA-Lo RA 87.04 87.36 94.10 85.33 91.62

severe het. Lo RA 82.03 82.50 94.32 83.51 88.95 FFA-Lo RA 85.05 85.62 94.32 84.35 90.35

Table 2: Prediction accuracy (%) comparison between i.i.d. and non-i.i.d. data distribution.

It is evident that FFA-Lo RA behaves better than Lo RA in both i.i.d. and non-i.i.d. settings, but the performance is similar in the privacy-free setting.

What is the impact of adapter parameter budget (r) for FFA-Lo RA and the relationship between adapter parameter budget (r) and privacy budget (ϵ) of DP-SGD?

We first evaluate the performance of FFA-Lo RA and Lo RA without the consideration of privacy, we use the mild heterogeneity data distribution and keep the batch-size and total number of update steps to be the same across different tasks. We experiment with rank r {2, 4, 8, 16} on four tasks, and report the best accuracy.

The results are shown in Table 3. From the subspace similarity discussions in Lo RA, we note that increasing rank does not necessarily increase information from the gradients, similar observations can be found in our experiments. Based on the results, we can see that FFA-Lo RA has better performance in the majority of tasks, regardless of the trainable parameter number. In fact, due to the reduction of trainable parameters in FFA-Lo RA, we should compare between FFA-Lo RA and Lo RA with the same parameter budget (i.e. compare FFA-Lo RA r = 16 with Lo RA r = 8). In this case, the advantage of FFA-Lo RA over Lo RA becomes more apparent.

Although there have been multiple studies on the performance of Lo RA with DP, the relationship between rank r and privacy budget ϵ is unclear. We present the experiments below and compare the impact of rank r on FFA-Lo RA versus Lo RA on the QNLI dataset. We use a privacy budget of ϵ {6, 3, 1} with rank r {2, 4, 8, 16}. The results are shown in Table 4.

In our experiments, we find that as the privacy requirements gets stronger, the performance difference of Lo RA between different rank r becomes more and more apparent, yet for FFA-Lo RA, the algorithm is still able to output relatively stable performance on a wide range of rank selections.

How do FFA-Lo RA and Lo RA behave when we choose different α for scaling? As mentioned previously, Lo RA requires a good scaling factor α in order to achieve a good performance. It has

Published as a conference paper at ICLR 2024

Method # of params (million)

MNLI (matched)

MNLI (mismatched) SST-2 QQP QNLI

Lo RA (rank 16) 3.15 (0.877%) 87.43 87.47 93.98 84.79 91.92 Lo RA (rank 8) 1.57 (0.440%) 87.01 87.33 93.55 84.41 91.36 Lo RA (rank 4) 0.79 (0.220%) 86.07 86.41 93.89 83.71 91.51 Lo RA (rank 2) 0.39 (0.110%) 85.83 86.52 93.58 83.00 91.76 FFA-Lo RA (rank 16) 1.57 (0.440%) 85.82 86.38 95.30 84.89 91.65 FFA-Lo RA (rank 8) 0.79 (0.220%) 87.04 87.36 94.10 85.33 91.62 FFA-Lo RA (rank 4) 0.39 (0.110%) 85.61 86.11 94.47 84.64 91.38 FFA-Lo RA (rank 2) 0.20 (0.055%) 84.89 85.75 94.18 84.92 90.98

Table 3: Prediction accuracy (%) comparison on FFA-Lo RA and Lo RA with different ranks.

privacy budget Method r = 16 r = 8 r = 4 r = 2

Non-Private Lo RA 91.92 91.36 91.51 91.76 FFA-Lo RA 91.65 91.62 91.38 88.56

ϵ = 6 Lo RA 86.87 86.45 85.24 83.54 FFA-Lo RA 87.33 87.57 86.74 86.31

ϵ = 3 Lo RA 86.23 86.05 85.35 85.57 FFA-Lo RA 86.36 86.98 86.22 85.08

ϵ = 1 Lo RA 80.54 81.45 58.30 58.15 FFA-Lo RA 81.87 83.01 82.06 82.64

Table 4: Prediction accuracy (%) of FFA-Lo RA and Lo RA across privacy and parameter budgets.

been shown in proof of Thm. 1 that the scaling factor does not affect the overall performance of the algorithm. We conducted experiments with a selection of different α, and refer to A.6 in the appendix for the details and discussion.

How does different initialization on A affect performance? Since our proposed FFA-Lo RA sets A as fixed throughout the fine-tuning process, a natural question would be regarding the initialization of A.We provide a discussion in Appendix A.8.

5.3 EXTENDING BEYOND LANGUAGE CLASSIFICATION

We next consider the task of Natural Language Generation (NLG) with LLa MA-7B, a more sophisticated model with significantly more parameters.

Our method has achieved an accuracy of 17.12% on the task of GSM-8K, significantly better than the best performance of Lo RA at 15.68% (15.31% reported in (Kuang et al., 2023)). It is also the best results on fine-tuning LLa MA with GSM-8K to the best of our knowledge.

For an additional dataset on the computer vision task. We use the pre-trained vision transformer (Dosovitskiy et al., 2020) and consider the task of fine-tuning on the Food-101 dataset Bossard et al. (2014). In short, the algorithms performs similarly compared to the language classification tasks.

We report the details of the two experiments above in Appendix A.3 and A.7 respectively.

6 CONCLUSION

In this paper, we discussed how to improve Lo RA in the context of privacy-preserving federated learning. An in-depth analysis was provided on Lo RA s deficient performance in FL and with DP guarantees. We proposed a modification to Lo RA named FFA-Lo RA, which is theoretically motivated, empirically verified and computationally more efficient. Beyond the scope of this paper, FFA-Lo RA could motivate more interesting problems related to PEFT for future study. For instance, we provide some preliminary results in Appendix A.4 to motivate future studies on algorithms that are even more parameter-efficient for federated LLM fine-tuning, one potential future direction is alternative initialization methods for matrices such as the orthogonal initialization. From a theoretical perspective, FFA-Lo RA could be related to random kernel methods due to its pseudo-linear nature.

Published as a conference paper at ICLR 2024

Martin Abadi, Andy Chu, Ian Goodfellow, H Brendan Mc Mahan, Ilya Mironov, Kunal Talwar, and Li Zhang. Deep learning with differential privacy. In Proceedings of the 2016 ACM SIGSAC conference on computer and communications security, pp. 308 318, 2016.

Armen Aghajanyan, Luke Zettlemoyer, and Sonal Gupta. Intrinsic dimensionality explains the effectiveness of language model fine-tuning. ar Xiv preprint ar Xiv:2012.13255, 2020.

Sara Babakniya, Ahmed Roushdy Elkordy, Yahya H Ezzeldin, Qingfeng Liu, Kee-Bong Song, Mostafa El-Khamy, and Salman Avestimehr. Slora: Federated parameter efficient fine-tuning of language models. ar Xiv preprint ar Xiv:2308.06522, 2023.

Raef Bassily, Adam Smith, and Abhradeep Thakurta. Private empirical risk minimization: Efficient algorithms and tight error bounds. In 2014 IEEE 55th annual symposium on foundations of computer science, pp. 464 473. IEEE, 2014.

Rishi Bommasani, Drew A Hudson, Ehsan Adeli, Russ Altman, Simran Arora, Sydney von Arx, Michael S Bernstein, Jeannette Bohg, Antoine Bosselut, Emma Brunskill, et al. On the opportunities and risks of foundation models. ar Xiv preprint ar Xiv:2108.07258, 2021.

Lukas Bossard, Matthieu Guillaumin, and Luc Van Gool. Food-101 mining discriminative components with random forests. In Computer Vision ECCV 2014: 13th European Conference, Zurich, Switzerland, September 6-12, 2014, Proceedings, Part VI 13, pp. 446 461. Springer, 2014.

Nicholas Carlini, Florian Tramer, Eric Wallace, Matthew Jagielski, Ariel Herbert-Voss, Katherine Lee, Adam Roberts, Tom Brown, Dawn Song, Ulfar Erlingsson, et al. Extracting training data from large language models. In 30th USENIX Security Symposium (USENIX Security 21), pp. 2633 2650, 2021.

Arnav Chavan, Zhuang Liu, Deepak Gupta, Eric Xing, and Zhiqiang Shen. One-for-all: Generalized lora for parameter-efficient fine-tuning. ar Xiv preprint ar Xiv:2306.07967, 2023.

Ning Ding, Yujia Qin, Guang Yang, Fuchao Wei, Zonghan Yang, Yusheng Su, Shengding Hu, Yulin Chen, Chi-Min Chan, Weize Chen, et al. Parameter-efficient fine-tuning of large-scale pre-trained language models. Nature Machine Intelligence, 5(3):220 235, 2023.

Alexey Dosovitskiy, Lucas Beyer, Alexander Kolesnikov, Dirk Weissenborn, Xiaohua Zhai, Thomas Unterthiner, Mostafa Dehghani, Matthias Minderer, Georg Heigold, Sylvain Gelly, et al. An image is worth 16x16 words: Transformers for image recognition at scale. ar Xiv preprint ar Xiv:2010.11929, 2020.

Danny Driess, Fei Xia, Mehdi SM Sajjadi, Corey Lynch, Aakanksha Chowdhery, Brian Ichter, Ayzaan Wahid, Jonathan Tompson, Quan Vuong, Tianhe Yu, et al. Palm-e: An embodied multimodal language model. ar Xiv preprint ar Xiv:2303.03378, 2023.

Cynthia Dwork, Frank Mc Sherry, Kobbi Nissim, and Adam Smith. Calibrating noise to sensitivity in private data analysis. In Theory of Cryptography: Third Theory of Cryptography Conference, TCC 2006, New York, NY, USA, March 4-7, 2006. Proceedings 3, pp. 265 284. Springer, 2006.

Yongchang Hao, Yanshuai Cao, and Lili Mou. Flora: Low-rank adapters are secretly gradient compressors. ar Xiv preprint ar Xiv:2402.03293, 2024.

Junxian He, Chunting Zhou, Xuezhe Ma, Taylor Berg-Kirkpatrick, and Graham Neubig. Towards a unified view of parameter-efficient transfer learning. ar Xiv preprint ar Xiv:2110.04366, 2021.

Neil Houlsby, Andrei Giurgiu, Stanislaw Jastrzebski, Bruna Morrone, Quentin De Laroussilhe, Andrea Gesmundo, Mona Attariyan, and Sylvain Gelly. Parameter-efficient transfer learning for nlp. In International Conference on Machine Learning, pp. 2790 2799. PMLR, 2019.

Jeremy Howard and Sebastian Ruder. Universal language model fine-tuning for text classification. ar Xiv preprint ar Xiv:1801.06146, 2018.

Published as a conference paper at ICLR 2024

Edward J Hu, Yelong Shen, Phillip Wallis, Zeyuan Allen-Zhu, Yuanzhi Li, Shean Wang, Lu Wang, and Weizhu Chen. Lora: Low-rank adaptation of large language models. ar Xiv preprint ar Xiv:2106.09685, 2021.

Jie Huang, Hanyin Shao, and Kevin Chen-Chuan Chang. Are large pre-trained language models leaking your personal information? In Findings of the Association for Computational Linguistics: EMNLP 2022, pp. 2038 2047, 2022.

Peter Kairouz, H Brendan Mc Mahan, Brendan Avent, Aur elien Bellet, Mehdi Bennis, Arjun Nitin Bhagoji, Kallista Bonawitz, Zachary Charles, Graham Cormode, Rachel Cummings, et al. Advances and open problems in federated learning. Foundations and Trends in Machine Learning, 14(1 2):1 210, 2021.

Sai Praneeth Karimireddy, Satyen Kale, Mehryar Mohri, Sashank Reddi, Sebastian Stich, and Ananda Theertha Suresh. Scaffold: Stochastic controlled averaging for federated learning. In International conference on machine learning, pp. 5132 5143. PMLR, 2020.

Mikhail Khodak, Renbo Tu, Tian Li, Liam Li, Maria-Florina F Balcan, Virginia Smith, and Ameet Talwalkar. Federated hyperparameter tuning: Challenges, baselines, and connections to weightsharing. Advances in Neural Information Processing Systems, 34:19184 19197, 2021.

Weirui Kuang, Bingchen Qian, Zitao Li, Daoyuan Chen, Dawei Gao, Xuchen Pan, Yuexiang Xie, Yaliang Li, Bolin Ding, and Jingren Zhou. Federatedscope-llm: A comprehensive package for fine-tuning large language models in federated learning. ar Xiv preprint ar Xiv:2309.00363, 2023.

Brian Lester, Rami Al-Rfou, and Noah Constant. The power of scale for parameter-efficient prompt tuning. ar Xiv preprint ar Xiv:2104.08691, 2021.

Chunyuan Li, Heerad Farkhoor, Rosanne Liu, and Jason Yosinski. Measuring the intrinsic dimension of objective landscapes. ar Xiv preprint ar Xiv:1804.08838, 2018.

Tian Li, Anit Kumar Sahu, Ameet Talwalkar, and Virginia Smith. Federated learning: Challenges, methods, and future directions. IEEE signal processing magazine, 37(3):50 60, 2020.

Xiang Lisa Li and Percy Liang. Prefix-tuning: Optimizing continuous prompts for generation. ar Xiv preprint ar Xiv:2101.00190, 2021.

Xuechen Li, Florian Tramer, Percy Liang, and Tatsunori Hashimoto. Large language models can be strong differentially private learners. ar Xiv preprint ar Xiv:2110.05679, 2021a.

Xuechen Li, Daogao Liu, Tatsunori B Hashimoto, Huseyin A Inan, Janardhan Kulkarni, Yin-Tat Lee, and Abhradeep Guha Thakurta. When does differentially private learning not suffer in high dimensions? Advances in Neural Information Processing Systems, 35:28616 28630, 2022.

Zitao Li, Bolin Ding, Ce Zhang, Ninghui Li, and Jingren Zhou. Federated matrix factorization with privacy guarantee. Proceedings of the VLDB Endowment, 15(4), 2021b.

Yinhan Liu, Myle Ott, Naman Goyal, Jingfei Du, Mandar Joshi, Danqi Chen, Omer Levy, Mike Lewis, Luke Zettlemoyer, and Veselin Stoyanov. Roberta: A robustly optimized bert pretraining approach. ar Xiv preprint ar Xiv:1907.11692, 2019.

Brendan Mc Mahan, Eider Moore, Daniel Ramage, Seth Hampson, and Blaise Aguera y Arcas. Communication-efficient learning of deep networks from decentralized data. In Artificial intelligence and statistics, pp. 1273 1282. PMLR, 2017a.

H Brendan Mc Mahan, Daniel Ramage, Kunal Talwar, and Li Zhang. Learning differentially private recurrent language models. ar Xiv preprint ar Xiv:1710.06963, 2017b.

Open AI. Introducing chatgpt. https://openai.com/blog/chatgpt. Accessed: 2023-0921.

Open AI. Gpt-4 technical report. ar Xiv, pp. 2303 08774, 2023.

Published as a conference paper at ICLR 2024

Chen Qu, Weize Kong, Liu Yang, Mingyang Zhang, Michael Bendersky, and Marc Najork. Natural language understanding with privacy-preserving bert. In Proceedings of the 30th ACM International Conference on Information & Knowledge Management, pp. 1488 1497, 2021.

Shuang Song, Kamalika Chaudhuri, and Anand D Sarwate. Stochastic gradient descent with differentially private updates. In 2013 IEEE global conference on signal and information processing, pp. 245 248. IEEE, 2013.

Yuanyishu Tian, Yao Wan, Lingjuan Lyu, Dezhong Yao, Hai Jin, and Lichao Sun. Fedbert: When federated learning meets pre-training. ACM Transactions on Intelligent Systems and Technology (TIST), 13(4):1 26, 2022.

Hugo Touvron, Thibaut Lavril, Gautier Izacard, Xavier Martinet, Marie-Anne Lachaux, Timoth ee Lacroix, Baptiste Rozi ere, Naman Goyal, Eric Hambro, Faisal Azhar, et al. Llama: Open and efficient foundation language models. ar Xiv preprint ar Xiv:2302.13971, 2023.

Alex Wang, Amanpreet Singh, Julian Michael, Felix Hill, Omer Levy, and Samuel R Bowman. Glue: A multi-task benchmark and analysis platform for natural language understanding. ar Xiv preprint ar Xiv:1804.07461, 2018.

Chenfei Wu, Shengming Yin, Weizhen Qi, Xiaodong Wang, Zecheng Tang, and Nan Duan. Visual chatgpt: Talking, drawing and editing with visual foundation models. ar Xiv preprint ar Xiv:2303.04671, 2023.

Nan Wu, Farhad Farokhi, David Smith, and Mohamed Ali Kaafar. The value of collaboration in convex machine learning with differential privacy. In 2020 IEEE Symposium on Security and Privacy (SP), pp. 304 317. IEEE, 2020.

Ashkan Yousefpour, Igor Shilov, Alexandre Sablayrolles, Davide Testuggine, Karthik Prasad, Mani Malek, John Nguyen, Sayan Ghosh, Akash Bharadwaj, Jessica Zhao, et al. Opacus: User-friendly differential privacy library in pytorch. ar Xiv preprint ar Xiv:2109.12298, 2021.

Da Yu, Saurabh Naik, Arturs Backurs, Sivakanth Gopi, Huseyin A Inan, Gautam Kamath, Janardhan Kulkarni, Yin Tat Lee, Andre Manoel, Lukas Wutschitz, et al. Differentially private fine-tuning of language models. ar Xiv preprint ar Xiv:2110.06500, 2021a.

Da Yu, Huishuai Zhang, Wei Chen, Jian Yin, and Tie-Yan Liu. Large scale private learning via lowrank reparametrization. In International Conference on Machine Learning, pp. 12208 12218. PMLR, 2021b.

Aohan Zeng, Xiao Liu, Zhengxiao Du, Zihan Wang, Hanyu Lai, Ming Ding, Zhuoyi Yang, Yifan Xu, Wendi Zheng, Xiao Xia, et al. Glm-130b: An open bilingual pre-trained model. In The Eleventh International Conference on Learning Representations, 2022.

Longteng Zhang, Lin Zhang, Shaohuai Shi, Xiaowen Chu, and Bo Li. Lora-fa: Memory-efficient low-rank adaptation for large language models fine-tuning. ar Xiv preprint ar Xiv:2308.03303, 2023.

Susan Zhang, Stephen Roller, Naman Goyal, Mikel Artetxe, Moya Chen, Shuohui Chen, Christopher Dewan, Mona Diab, Xian Li, Xi Victoria Lin, et al. Opt: Open pre-trained transformer language models. ar Xiv preprint ar Xiv:2205.01068, 2022a.

Zhuo Zhang, Yuanhang Yang, Yong Dai, Lizhen Qu, and Zenglin Xu. When federated learning meets pre-trained language models parameter-efficient tuning methods. ar Xiv preprint ar Xiv:2212.10025, 2022b.

Fan Zhou and Guojing Cong. On the convergence properties of a k-step averaging stochastic gradient descent algorithm for nonconvex optimization. ar Xiv preprint ar Xiv:1708.01012, 2017.

Jiacheng Zhu, Kristjan Greenewald, Kimia Nadjahi, Haitz S aez de Oc ariz Borde, Rickard Br uel Gabrielsson, Leshem Choshen, Marzyeh Ghassemi, Mikhail Yurochkin, and Justin Solomon. Asymmetry in low-rank adapters of foundation models. ar Xiv preprint ar Xiv:2402.16842, 2024.

Published as a conference paper at ICLR 2024

A.1 SMOOTHNESS ANALYSIS

Theorem 2 (Smoothness conditions). Assume that the loss function give weight and dataset is denoted F(W, D). For a low-rank decomposition on model parameter W such that W(A, B) = W0 + BA satisfying Equation (1). We have the following properties.

1. If B is trainable, A is fixed with A C and F(W, D) is Lipschitz smooth with factor L. The loss function F(W(A, B)) is Lipschitz smooth with respect to B with factor LC2.

2. If both A and B are trainable and F(W, D) is Lipschitz smooth with factor L, the loss function F(W(A, B)) has no Lipschitz smoothness guarantees.

All smoothness notions are defined with respect to matrix Frobenius norm, denoted as .

Proof. First we show that, given W(A, B) = W0 + BA, and the gradient on W is denoted as W F, then we can write the gradients on matrix B as BF = W FAT , since

B1 B2, BF = W(A, B1) W(A, B2), W F = B1A B2A, W F

= B1 B2, W FAT

Similarly, we have AF = BT W F. Using the gradients on A and B, we provide the proof for all the properties.

1. For property 1, we know that for any given B1, B2,

BF(W(A, B1)) BF(W(A, B2))

= W F(W(A, B1))AT W F(W(A, B2))AT L W(A, B1) W(A, B2) A

L B1 B2 A 2

2. For the second property, for the ease of notation, we introduce the stacked variable x := [A, B]. We construct a counter-example such that the function is not Lipschitz smooth with respect to x.

We consider W, A, B Rd d, F(W) = 1

2 W 2 with W0 = 0. Then we consider a sequence {xk}k N such that xk = [Ak, Bk] = [k Id, k Id], then

lim k x W(Ak, Bk) x W(A0, B0)

= lim k AW(Ak, Bk) AW(A0, B0) + BW(Ak, Bk) BW(A0, B0)

Ak A0 + Bk B0

= lim k k3Id + k3Id

k Id + k Id =

From the existence of the above counter-example, we can see that although F(W) is 1Lipschitz smooth, the function is not smooth with respect to x.

A.2 PROOF FOR THEOREM 1

Proof. The theorem starts with initial condition on W, since W = W0 + αBαAα and that the initialization of A is non-zero, this condition implies that Aα = A1 = A, and Bα = 1

αB1. Now we compare the update of the two algorithms given the same initial conditions.

Published as a conference paper at ICLR 2024

From Theorem 1 we know that for FFA-Lo RA, different αF F A does not affect its dynamics, without loss of generality, we consider the case where αF F A = 1.

The FFA-Lo RA update is as follows, the only update is on B:

W k+1 F F A = W0 + 1 Bk+1Ak = W0 + (Bk η Bk)Ak = W k η Bk Ak

The rest of the proof is given by induction, as long as the limit holds for the k + 1-th local iteration given that the k-th iteration holds.

Without the loss of generality, we first consider when αLo RA = 1, then for iteration k, we denote the learning rate as η1, denote the matrices and their gradient as Ak 1, Bk 1 and Ak 1, Bk 1, respectively. And by definition, we have the update that

Ak+1 1 Ak 1 η1 Ak 1 Bk+1 1 Bk 1 η1 Bk 1

And the update of the original weight matrix W becomes

W k+1 1 = W0 + W k+1 1 = W0 + (Bk 1 η1 Bk 1)(Ak 1 η1 Ak 1)

= W k 1 η1 Bk 1Ak 1 + Bk 1 Ak 1 + η2 Bk 1 Ak 1 Since Lo RA do not satisfy the conditions provided in Theorem 1, changing αLo RA will affect its updates. When we choose a different αLo RA = α and corresponding ηα = η1 α2 , we can write the update of Lo RA as

Ak+1 α Ak α ηα Ak α = Ak α η1

α Ak α = Ak 1 η1

Bk+1 α Bk α ηα Bk α = Bk α η1

W k+1 α = W0 + α( 1

α Bk 1)(Ak 1 η1

= W0 + Bk 1Ak 1 η1 Bk 1Ak 1 η1

α Bk 1 Ak 1 η2 1 α Bk 1 Ak 1

Therefore we have

lim αLo RA W k+1 αLo RA = lim α W0 + Bk 1Ak 1 η1 Bk 1Ak 1 η1

α Bk 1 Ak 1 η2 1 α Bk 1 Ak 1

= W k 1 η1 Bk 1Ak 1 = W k+1 F F A Which completes our proof.

A.3 LLAMA EXPERIMENTS AND DETAILS

Similar to Ro BERTa, LLa MA is also widely used and offers competitive results for its network size. We evaluate both Lo RA and FFA-Lo RA with the GSM-8K dataset using the same set of hyperparameters listed by (Kuang et al., 2023). Our method has achieved an accuracy of 17.12% on the task of GSM-8K, significantly better than the best performance of Lo RA at 15.68% (15.31% reported in (Kuang et al., 2023)). It is also the best results on fine-tuning LLa MA with GSM-8K to the best of our knowledge. We sample some generated answers of Lo RA and FFA-Lo RA in Table 5. The answers generated by FFA-Lo RA demonstrate better performance on longer questions with more complex contexts.

A.4 A MOTIVATION FOR FURTHER REDUCING TRAINABLE PARAMETERS

Our approach named FFA-Lo RA in Section 4 exhibits a number of theoretical benefits compared to Lo RA, additionally, it also performs better and is more consistent as shown in Section 5. We can conclude that for PEFT with adapters, by freezing randomly initialized parameters and only train on

Published as a conference paper at ICLR 2024

Question: A new program had 60 downloads in the first month. The number of downloads in the second month was three times as many as the downloads in the first month, but then reduced by 30% in the third month. How many downloads did the program have total over the three months?

FFA-Lo RA: The number of downloads in the first month was 60. The number of downloads in the second month was 3 times as many as the first month, so that is 3 * 60 = 180. The number of downloads in the third month was 180 - 30% = 180 - 54 = 126. The total number of downloads over the three months is 60 + 180 + 126 = 366. The answer is 366.

Lo RA: The number of downloads in the first month was 60. The number of downloads in the second month was 60 x 3 = 180. The number of downloads in the third month was 180 - 30 = 150. The answer is 150.

Question: John decides to get a loan by mortgaging his home. His house is worth $250,000. He gets a loan worth 40% of that. He uses 60% of that to pay off his debts. How much money did he have leftover after paying debt?

FFA-Lo RA: John s house is worth 250,000 dollars. He gets a loan worth 250,000 x 40% = 100,000 dollars. He uses 100,000 x 60% = 60,000 dollars to pay off his debts. 100,000 - 60,000 = 40,000 dollars. The answer is 40,000.

Lo RA: John s house is worth 250,000 dollars. 40% of that is 250,000 x 0.4 = 100,000 dollars. 100,000 dollars is the amount of the loan. 100,000 / 0.6 = 166,666 dollars. 166,666 - 100,000 = 66,666 dollars. 66,666 dollars is the amount of money he has left.

Question: Octavia wants to make a quilt for her sister. She is going to build it from small, medium, and large t-shirts that her family is done with. A small t-shirt is 3 square feet of fabric. A medium one is 4 square feet. A large one is 6 square feet. If she uses 11 small shirts, 8 medium, and 6 large shirts, how many square feet is the quilt?

FFA-Lo RA: A small shirt is 3 square feet. A medium shirt is 4 square feet. A large shirt is 6 square feet. 11 small shirts is 11 x 3 = 33 square feet. 8 medium shirts is 8 x 4 = 32 square feet. 6 large shirts is 6 x 6 = 36 square feet. 33 + 32 + 36 = 101 square feet. The answer is 101.

Lo RA: Octavia is going to use 11 small shirts, 8 medium, and 6 large shirts. 11 + 8 + 6 = 25. 3 square feet is 12 square feet. 25 * 12 = 300 square feet. The answer is 300 square feet.

Table 5: Example of generated answer for GSM8K datasets

Published as a conference paper at ICLR 2024

Method # of params acc w/o DP acc@ϵ = 6 acc@ϵ = 3 Lo RA (rank 16) 3145728 (0.877%) 92.49% 86.87% 86.23% Lo RA (rank 4) 786432 (0.220%) 91.40% 85.2% 85.35% FFA-Lo RA (rank 16) 1572864 (0.440%) 92.49% 87.33 % 86.36% FFA-Lo RA (rank 4) 393216 (0.110%) 92.20% 86.75% 86.22% QVP (rank 128) 1572864 (0.412%) 90.46% 84.23% 83.16% QVP (rank 64) 393216 (0.107%) 90.17% 86.41% 84.44% QVP (rank 32) 98304 (0.0272%) 87.31% 85.69% 84.31% QVP (rank 16) 24576 (0.00685%) 83.40% 84.44% 83.67%

Table 6: Comparison between Lo RA, FFA-Lo RA and QVP adapters, including number of trainable parameters.

the set of parameters that were initialized at 0 is a valid and practical approach. This guided us to get an even more aggressive construction of adapters, which we refer to as QVP Adapters, formulated as below.

For a weight matrix W Rd k, we consider the model update to be projected to a low-rank matrix such that W = W0 + W = W0 + Q0V P0,

where Q0 Rd r, V Rr r, P0 Rr k. Similar to FFA-Lo RA, W0 is the pre-trained weight, and P0, Q0 follows a random Gaussian initialization. We consider V trainable and start with V0 = 0, W0, Q0, P0 are kept frozen throughout the training process. We provide the performance of QVP adapters below in Table 6, and compare with Lo RA and FFA-Lo RA.

For the experiments where these algorithms have the same parameter budget (r = 64 for QVP versus r = 4 for FFA-Lo RA, etc.), QVP do not perform as good as the previously mentioned algorithms. But a unique advantage offered by QVP adapters is that it is possible to even further reduce the number of trainable parameters, and the algorithm is still able to learn meaningful features from data. The same is impossible for Lo RA and FFA-Lo RA since the rank r can not be smaller than 1 for these methods. Therefore, QVP is potentially useful in the case where the parameter budget is extremely constrained, such as local private training in mobile devices.

A.5 DIFFERENTIAL PRIVACY GUARANTEE

We present the following corollary regarding the privacy guarantees in our experiments.

Corollary 2.1 (Privacy Guarantee). Given Theorem 1 with moments accountant in (Abadi et al., 2016), the parallel composition and resistance to post-processing of DP, the mechanism updating FFS-Lo RA with locally ran DP-SGD and Fed Avg can satisfy (ϵ, δ)-DP given i, q = |Bi| |Ni|, the

number of total local updates T of each client and σ = O( q

ϵ ). (The exact σ is computed by the Pytorch s Opacus package (Yousefpour et al., 2021) numerically given q, T, ϵ, δ).

Proof. Firstly, we consider the local datasets {Di}i [n]for the FL network to be disjoint chunks of the global dataset. The DP-SGD with Fed Avg used in our paper to train Lo RA or FFA-Lo RA can be considered as (A) locally updating trainable parameters with DP-SGD, (B) averaging the trainable parameters from clients on the server, and (C) repeating the above two steps for some iterations. The privacy loss of (A) can be composed by moment accountants used in (Abadi et al., 2016). The privacy loss of all clients performing local updates can be composed by the parallel composition property of DP. The averaging on the server in (B) is a post-processing operation that does not introduce privacy loss. Privacy loss of multiple FL rounds of (C) can again be composed with moment accountants used in (Abadi et al., 2016). Eventually, we can convert the moment accountants to (ϵ, δ)-DP as Theorem 1 in (Abadi et al., 2016).

A.6 EXPERIMENTS WITH DIFFERENT SCALING FACTOR α

We conducted experiments with a selection of different α, we use α = 8, r = 8 as baseline, and choose learning rate η according to the learning rate scaling discussed above. Our results are shown

Published as a conference paper at ICLR 2024

in Table 7. For FFA-Lo RA, the performance using η that scales with α is consistent across the wide range of α. However for Lo RA, the same relationship does not hold, and the performance of Lo RA degrades drastically when α changes. Additional grid-search shows that Lo RA still is able to converge with high accuracy with adequate learning rate, but finding an optimal learning rate given α is in general arduous. We note that the optimal η for α = 256 is the same for both FFA-Lo RA and Lo RA, consistent with our discussion in Section 4.

Method α = 2 α = 8 α = 16 α = 64 α = 256 Lo RA (best LR) 91.78% 91.36% 92.11% 91.50% 91.23% Lo RA (LR scaling) 71.88% 91.36% 92.11% 50.96% 49.46% FFA-Lo RA (LR scaling) 91.31% 91.62% 91.9% 91.17% 92.46%

Table 7: Experiment with different scaling factor α.

A.7 COMPUTER VISION EXPERIMENTS

For context, we provide performance reported on huggingface as baseline. A centralized, fine-tuned model has an accuracy of 0.8539.

We first report the results in our centralized experimental setting in the table below.In this case there is no significant performance discrepancy between the two methods, implying that FFA-Lo RA and vanilla has similar performance without consideration of DP and FL. This also aligns with our observations in previous experiments.

In terms of the federated case, we first report the iid setting. It can be seen that compared to Lo RA, FFA-Lo RA has both (a) better convergence and (b) less fluctuations in training. The findings align with our findings in language-related tasks, showing that the properties of Lo RA being discussed in our paper are not limited to language tasks only.

Method Baseline Cen. Lo RA Cen. FFA-Lo RA FL iid Lo RA FL non-iid FFA-Lo RA Accuracy 85.39% 86.18% 85.83% 81.33% 82.10%

Table 8: Performance of FFA-Lo RA in vision transformer evaluated on Food-101 dataset.

A.8 DIFFERENT MATRIX INITIALIZATION FOR A

Since our proposed FFA-Lo RA sets A as fixed throughout the fine-tuning process, a natural question would be regarding the initialization of A. We know that for a zero-initialized A matrix, neither Lo RA nor FFA-Lo RA are able to train any meaningful results. However, suppose that we have A to be full rank (which is also satisfied for any random initialization in general), there are a number of different initialization that we could utilize.

In the majority of this paper, we consider the same initialization as Lo RA. Apart from Kaiming initialization, we consider orthogonal random initialization and using the top r singular vectors of W0 as matrix A. We provide some initial results in Table 9, it can be seen from the plot that matrix with orthogonal initialization seems to perform slightly better than the existing approach. However, the performance gap is not significant enough for a definitive answer.

Method QNLI mean QNLI variance Kaiming Init. 91.84% 0.38% Orthogonal Init. 92.16% 0.83% SVD Init. 91.50% 0.59%

Table 9: Performance of algorithm under similar conditions and different initialization on matrix A.