# federated_learning_with_partial_model_personalization__dd6ddd60.pdf Federated Learning with Partial Model Personalization Krishna Pillutla 1 Kshitiz Malik 2 Abdelrahman Mohamed 2 Michael Rabbat 2 Maziar Sanjabi 2 Lin Xiao 2 We consider two federated learning algorithms for training partially personalized models, where the shared and personal parameters are updated either simultaneously or alternately on the devices. Both algorithms have been proposed in the literature, but their convergence properties are not fully understood, especially for the alternating variant. We provide convergence analyses of both algorithms in the general nonconvex setting with partial participation and delineate the regime where one dominates the other. Our experiments on realworld image, text, and speech datasets demonstrate that (a) partial personalization can obtain most of the benefits of full model personalization with a small fraction of personal parameters, and, (b) the alternating update algorithm outperforms the simultaneous update algorithm by a small but consistent margin. 1. Introduction Federated Learning (Mc Mahan et al., 2017) has emerged as a powerful paradigm for distributed and privacy-preserving machine learning (see Kairouz et al., 2021, and references therein). We consider a typical setting of Federated Learning (FL) with n devices (also called clients), where each device i has a training dataset of Ni samples zi,1, , zi,Ni. Let w Rd represent the parameters of a machine learning model and fi(w, zi,j) be the loss of the model on the training example zi,j. Then the loss function associated with device i is Fi(w) = (1/Ni) PNi j=1 fi(w, zi,j). A common objective of FL is to find model parameters that minimize the weighted average loss across all devices i=1 αi Fi(w), (1) 1Paul G. Allen School of Computer Science & Engineering, University of Washington 2Meta AI. Correspondence to: Krishna Pillutla . Proceedings of the 39 th International Conference on Machine Learning, Baltimore, Maryland, USA, PMLR 162, 2022. Copyright 2022 by the author(s). where the weights αi > 0 satisfy Pn i=1 αi = 1. A common practice is to choose αi = Ni/N where N = Pn i=1 Ni, which corresponds to minimizing the average loss across all samples: (1/N) Pn i=1 PNi j=1 fi(w, zi,j). The main motivation for minimizing the average loss over all devices is to leverage their collective statistical power for better generalization, because the amount of data on each device can be very limited. This is especially important for training modern deep learning models with large number of parameters. However, this argument assumes that the datasets from different devices are sampled from the same, or at least very similar, distributions. Given the diverse characteristics of the users and increasing trend of personalized on-device services, such an i.i.d. assumption may not hold in practice. Thus, the one-model-fits-all formulation in (1) can be ineffective and undesirable. Several approaches have been proposed for personalized FL, including ones based on multi-task learning (Smith et al., 2017), meta learning (Fallah et al., 2020), and proximal methods (Dinh et al., 2020; Li et al., 2021). A simple formulation that captures their main idea is minimize w0,{wi}n i=1 i=1 αi Fi(wi) + λi 2 wi w0 2 , (2) where wi for i = 1, . . . , n are personalized model parameters at the devices, w0 is a reference model, and the λi s are regularization weights that control the extent of personalization. A major disadvantage of the formulation (2), which we call full model personalization, is that it requires twice the memory footprint of the full model, wi and w0 at each device, which severely limits the size of trainable models. On the other hand, full model personalization may be unnecessary for modern deep learning models, which are composed of many simple functional units, typically organized into layers or a more general interconnected architecture. Personalizing the right components, selected with domain knowledge, may lead to substantial benefits with only a small increase in memory footprint. In addition, partial model personalization can be less susceptible to catastrophic forgetting (Mc Closkey & Cohen, 1989), where a large model finetuned on a small local dataset forgets the original (non-personalized) task, leading to degraded test performance. Federated Learning with Partial Model Personalization (a) Personalized output layer(s). (b) Personalized input layer(s). input 1 input 2 (c) Personalized split input layer(s). Figure 1. Three simple examples of partitioning deep learning models. We consider a general setting of FL with partial model personalization. Specifically, we partition the model parameters into two groups: the shared parameters u Rd0 and the personal parameters vi Rdi for i = 1, . . . , n. The full model on device i is denoted as wi = (u, vi), and the local loss function is Fi(u, vi) = (1/Ni) PNi j=1 fi (u, vi), zi,j . Our goal is to solve the optimization problem minimize u, {vi}n i=1 i=1 αi Fi(u, vi). (3) Notice that the dimensions of vi can be different across the devices, allowing the personalized components to have different number of parameters or even different architecture. We investigate two FL algorithms for solving problem (3): Fed Sim, a simultaneous update algorithm and Fed Alt, an alternating update algorithm. Both algorithms follow the standard FL protocol. During each round, the server randomly selects a subset of the devices for update and broadcasts the current global version of the shared parameters to devices in the subset. Each selected device then performs one or more steps of (stochastic) gradient descent to update both the shared parameters and the personal parameters, and sends only the updated shared parameters to the server for aggregation. The updated personal parameters are kept locally at the device to serve as the initialization when the device is selected for another update. In Fed Sim, the shared and personal parameters are updated simultaneously during each local iteration. In Fed Alt, the devices first update the personal parameters with the received shared parameters fixed and then update the shared parameters with the new personal parameters fixed. We provide convergence analysis and empirical evaluation of both methods. Contributions. Our main contributions are as follows. We provide convergence guarantees for the Fed Alt and Fed Sim methods in the general (smooth) nonconvex setting with partial participation. While both methods have appeared in the literature previously, they are either used without convergence analysis or with results on limited settings (assuming convexity or full participation). Our analysis focuses on the general nonconvex setting with partial participation, providing theoretical support for training modern deep learning models in practice. The analysis of Fed Alt with partial participation is especially challenging. We decouple dependent random variables in Fed Alt by introducing the technique of virtual full participation to overcome the difficulties. We conduct extensive experiments on realistic image, text, and speech tasks, exploring different model personalization strategies for each task, and comparing with strong baselines. Our results demonstrate that partial model personalization can obtain most of the benefit of full model personalization with only a small fraction of personalized parameters, and that Fed Alt outperforms Fed Sim by a small but consistent margin. Our experiments also reveal that personalization (full or partial) may lead to worse performance for some devices, despite improving the average. Typical forms of regularization such as weight decay and dropout do not mitigate this issue. This phenomenon has been overlooked in previous work and calls for future research to improve both performance and fairness. It is our hope that the generality of our theory together with strong empirical study can provide valuable guidelines for training partially personalized models in practice. Related work. The ideas behind partial model personalization in federated learning can be traced back to seminal works on multi-task learning (Caruana, 1997; Baxter, 2000; Collobert & Weston, 2008). These works advocate for learning a shared representation across various tasks. These ideas were applied to the setting of federated learning by considering each client as a separate task by Arivazhagan et al. (2019) and Collins et al. (2021); see Figure 1a. Liang et al. (2019) instead propose to personalize the input layers to learn a personalized representation (Figure 1b). Federated Learning with Partial Model Personalization adapter adapter feedforward (a) Transformer layer with two adapters. (b) Generalized additive model. Figure 2. More structured partial model personalization. (a) The adapter has a skip connection, thus it collapses to the identity mapping if vi = 0; in addition, it has a bottleneck in the middle (Houlsby et al., 2019). (b) The generalized additive model can be further augmented with a shared input layer for representation learning. Both optimization algorithms Fed Sim and Fed Alt have appeared in the literature previously, but the scope of their convergence analyses is limited. Specifically, Liang et al. (2019), Arivazhagan et al. (2019) and Hanzely et al. (2021) use Fed Sim, while Collins et al. (2021) and Singhal et al. (2021) proposed variants of Fed Alt. Notably, Hanzely et al. (2021) establish convergence of Fed Sim with participation of all devices in each round in the convex and non-convex cases, while Collins et al. (2021) prove the linear convergence of Fed Alt for a two-layer linear network where Fi( , vi) and Fi(u, ) are both convex for fixed vi and u respectively. We analyze both Fed Alt and Fed Sim in the general nonconvex case with partial device participation where only a sample of devices participate in each round, hence addressing a more practical setting. While we primarily consider problem (3) in the context of partial model personalization, it can serve as a general formulation that covers many other problems. Hanzely et al. (2021) demonstrate that various full model personalization formulations based on regularization (Dinh et al., 2020; Li et al., 2021), including (2), as well as interpolation (Deng et al., 2020a; Mansour et al., 2020) are special cases of this problem. The rates of convergence we prove in 3 are competitive with or better than those in previous works for full model personalization methods in the non-convex case. 2. Partially Personalized Models Modern deep learning models all have a multi-layer architecture. While a complete understanding of why they work so well is still out of reach, a general insight is that the lower layers (close to the input) are responsible for feature extraction and the upper layers (close to the output) focus on complex pattern recognition. Depending on the application domain and scenarios, we may personalize either the input layer(s) or the output layer(s) of the model; see Figure 1. In Figure 1c, the input layers are split horizontally into two parts, one shared and the other personal. They process different chunks of the input vector and their outputs are concatenated before feeding to the upper layers of the model. As demonstrated by Bui et al. (2019), this partitioning can help protect user-specific private features (input 2 in Figure 1c) as the corresponding feature embedding (through vi) are personalized and kept local at the device. Similar architectures have also been proposed in context-dependent language models (e.g., Mikolov & Zweig, 2012). A more structured partitioning is illustrated in Figure 2a, where a typical transformer layer (Vaswani et al., 2017) is augmented with two adapters. This architecture is proposed by Houlsby et al. (2019) for finetuning large language models. Similar residual adapter modules are proposed by Rebuffi et al. (2017) for image classification models in the context of multi-task learning. In the context of FL, we treat the adapter parameters as personal and the rest of the model parameters as shared. Figure 2b shows a generalized additive model, where the outputs of two separate models, one shared and the other personalized, are fused to generate a prediction. Suppose the shared model is h(u, ) and the personal model is hi(vi, ). For regression tasks with samples zi,j = (xi,j, yi,j), where xi,j is the input and yi,j is the output, we let Fi(u, vi) = (1/Ni) PNi j=1 fi (u, vi), zi,j with fi (u, vi), zi,j = yi,j h(u, xi,j) hi(vi, xi,j) 2 . In this special case, the personal model fits the residual of the shared model and vice-versa (Evgeniou & Pontil, Federated Learning with Partial Model Personalization Algorithm 1 Fed Alt / Fed Sim 1: Input: Initial states u(0), {v(0) i }n i=1, number of communication rounds T, number of devices per round m 2: for t = 0, 1, , T 1 do 3: Server samples m devices S(t) {1, . . . , n} 4: Server broadcasts u(t) to each device in S(t) 5: for each device i S(t) in parallel do 6: u(t+1) i , v(t+1) i = Local Alt / Local Sim u(t), v(t) i 7: Device sends u(t+1) i back to server 8: Server updates u(t+1) = (1/m) P i S(t) u(t+1) i 2004; Agarwal et al., 2020). For classification tasks, h(u, ) and hi(vi, ) produce probability distributions over multiple classes. We can use the cross-entropy loss between yi,j and a convex combination of the two model outputs: θh(u, xi,j)+(1 θ)hi(vi, xi,j), where θ (0, 1) is a learnable parameter. Finally, we can cast full model personalization in (2) as a special case of (3) by letting u w0, vi wi and Fi(u, vi) Fi(vi) + (λi/2) vi u 2. Many other formulations of full model personalization can be reduced to (3) as well; see Hanzely et al. (2021). 3. Algorithms and Convergence Analysis In this section, we present and analyze the Fed Alt and Fed Sim algorithms for solving problem (3). To simplify presentation, we denote V = (v1, . . . , vn) Rd1+...+dn and focus on the case of αi = 1/n, i.e., minimizeu, V F(u, V ) := 1 n Pn i=1 Fi(u, vi). (4) This is equivalent to (3) if we scale Fi by nαi, thus does not lose generality. Moreover, we consider the more general setting with local functions Fi(u, vi) = Ez Di[fi((u, vi), z)], where Di is the local data distribution. The Fed Alt and Fed Sim algorithms share a common outerloop description given in Algorithm 1. They differ only in the local update procedures Local Alt and Local Sim, which are given in Algorithms 2 and 3 respectively. We use e u and e v to represent stochastic gradients with respect to w and vi respectively. In Local Alt (Algorithm 2), the personal parameters are updated first with the received shared parameters fixed, then the shared parameters are updated with the new personal parameters fixed. In Local Sim (Algorithm 3), the personal variables vi and local version of the shared parameters ui are updated simultaneously, with their partial gradients evaluated at the same point. They are analogous respectively to the Gauss-Seidel and Jacobi update in numerical linear algebra (e.g., Demmel, 1997, 6.5). The rest of the section is devoted to the convergence analysis. We start with the assumptions in 3.1. In 3.2, we outline the key technical difficulty of dependent random variables in the analysis of Fed Alt and describe how we overcome it with virtual full participation. Finally, we compare the convergence rates of Fed Alt and Fed Sim in 3.3. 3.1. Assumptions We make some assumptions for the convergence analysis. Assumption 1 (Smoothness). For each i = 1, . . . , n, the function Fi is continuously differentiable. There exist constants Lu, Lv, Luv, Lvu such that for each i = 1, . . . , n: u Fi(u, vi) is Lu Lipschitz with respect to u and Luv Lipschitz with respect to vi, and v Fi(u, vi) is Lv Lipschitz with respect to vi and Lvu Lipschitz with respect to u. We summarize the relative cross-sensitivity of u Fi with respect to vi and v Fi with respect to u with the scalar χ := max{Luv, Lvu} p Assumption 2 (Bounded Variance). The stochastic gradients in Algorithm 3 and Algorithm 2 are unbiased and have bounded variance. That is, for all u and vi, E e u Fi(u, vi) = u Fi(u, vi), E e v Fi(u, vi) = v Fi(u, vi) . Furthermore, there exist constants σu and σv such that E e u Fi(u, vi) u Fi(u, vi) 2 σ2 u , E e v Fi(u, vi) v Fi(u, vi) 2 σ2 v . This is a standard bounded variance assumption on the perdevice stochastic gradients (Bottou et al., 2018). We have another source of stochasticity in our setting due to partial device participation. We can view u Fi(u, vi), when i is randomly sampled from {1, . . . , n}, as a stochastic partial gradient of F(u, V ). The next assumption imposes a constant variance bound. Assumption 3 (Partial Gradient Diversity). There exist a constant δ 0 such that for all u and V , 1 n Pn i=1 u Fi(u, vi) u F(u, V ) 2 δ2 . Throughout this paper, we assume F is bounded below by F and denote F0 = F u(0), V (0) F . Further, we use the shorthands V (t) = (v(t) 1 , . . . , v(t) n ), (t) u = u F u(t), V (t) 2 , and v Fi u(t), v(t) i 2 . Federated Learning with Partial Model Personalization Algorithm 2 Local Alt u, vi 1: Input: Number of steps τv, τu, and step sizes γv, γu 2: Initialize vi,0 = vi 3: for k = 0, 1, , τv 1 do 4: vi,k+1 = vi,k γv e v Fi u, vi,k 5: Update v+ i = vi,τv and initialize ui,0 = u 6: for k = 0, 1, , τu 1 do 7: ui,k+1 = ui,k γu e u Fi ui,k, v+ i 8: Update u+ i = ui,τu 9: Return u+ i , v+ i For smooth and nonconvex loss functions Fi, we obtain convergence in expectation to a stationary point of F if the expected values of these two sequences converge to zero. 3.2. Challenges of Fed Alt and Virtual Full Participation To convey the salient ideas, we assume full gradients on each device (σ2 u = 0 = σ2 v) and a single local update per device (τu = 1 = τv). The only stochasticity in the algorithm comes from partial participation, i.e., sampling m devices in each round. Dependent Random Variables. Consider the iterates (u(t), V (t)) generated by Fed Alt (Algorithm 1 with local updates from Algorithm 2). In order to analyze the effect of the u-update, we invoke the smoothness of F( , V (t+1)) as F u(t+1), V (t+1) F u(t), V (t+1) (6) u F u(t), V (t+1) , u(t+1) u(t) + Lu u(t+1) u(t) 2 . Standard convergence proofs of stochastic gradient methods rely on the fact that when we take expectation w.r.t. the sampling S(t) over the first order term (within the inner product), we obtain simplifications because the gradient is usually independent of S(t). This is true for Fed Sim and the v-step of Fed Alt. However, this is not the case for the u-step of Fed Alt since Et h u F u(t), V (t+1) , u(t+1) u(t) i = Et[ u F u(t), V (t+1) ], Et[u(t+1) u(t)] in general, where Et = E[ |u(t), V (t)] denotes the expectation w.r.t. S(t). Indeed, V (t+1) is already updated based on S(t), so both V (t+1) and u(t+1) are dependent random variables, due to their mutual dependence on the sampling S(t); see Figure 3 (left). Therefore, directly taking expectation w.r.t. S(t) in (6) does not lead to a useful result. Virtual Full Participation. We decouple the dependent random variables with virtual full participation. Define e V (t+1) as the result of local v-updates as if every device Algorithm 3 Local Sim u, vi 1: Input: Number of steps τ, and step sizes γv, γu 2: Initialize vi,0 = vi 3: Initialize ui,0 = u 4: for k = 0, 1, , τ 1 do 5: vi,k+1 = vi,k γv e v Fi ui,k, vi,k 6: ui,k+1 = ui,k γu e u Fi ui,k, vi,k 7: Update v+ i = vi,τ 8: Update u+ i = ui,τ 9: Return u+ i , v+ i had participated. This iterate is virtual, meaning that it is a tool of the analysis but is not required by the algorithm. We introduce e V (t+1) on the right hand side of (6) to get F u(t+1), V (t+1) F u(t), V (t+1) E(t)+ u F(u(t), e V (t+1)), u(t+1) u(t) + Lu u(t+1) u(t) 2 , where E(t) is the error term from replacing V (t+1) with e V (t+1). Since e V (t+1) is deterministic when conditioned on (u(t), V (t)), we can now take an expectation w.r.t. the sampling S(t) over u(t+1) only, cf. Figure 3 (right). This allows us to simplify the first order term as Et h u F u(t), e V (t+1) , u(t+1) u(t) i = u F u(t), e V (t+1) , Et[u(t+1) u(t)] i=1 Et u F(u(t), v(t+1)) 2 . Finally, we bound the error term Et[E(t)] O(Luγ2 u + χ2Lvγ2 v), which can be made small by choosing appropriately small learning rates. The technique of virtual full participation is distinct from shadow iterates u(t) k = (1/n) Pn i=1 u(t) i,k typically used in decentralized (Yuan et al., 2016) and federated optimization (Wang et al., 2021), and could be of independent interest. We refer to Appendix A.2 for additional details. 3.3. Comparing Fed Alt and Fed Sim We first present our main result for Fed Alt (Algorithm 1 with Local Alt). The proof relies on the technique of virtual full participation and is proved in Appendix A.3. Theorem 1 (Convergence of Fed Alt). Suppose Assumptions 1, 2 and 3 hold and the learning rates in Fed Alt are chosen as γu = η/(Luτu) and γv = η/(Lvτv). For a choice of η depending on the problem parameters Lu, Lv, χ2, σ2 u, σ2 v, δ2, m, n, and the number of rounds T, Federated Learning with Partial Model Personalization V (t+1) u(t+1) D u F u(t), V (t+1) , u(t+1) u(t)E V (t+1) u(t+1) D u F u(t), e V (t+1) , u(t+1) u(t)E Figure 3. Left: Graphical model depicting the problem of dependent random variables in the analysis of Fed Alt. We cannot take an expectation of the bottom-most inner product term w.r.t. the device sampling S(t) because both V (t+1) and u(t+1) depend on it. Right: Virtual full participation overcomes this problem, since the virtual iterates e V (t+1) are statistically independent of the sampling S(t). The expectation can now pass through the inner product, as required by standard stochastic gradient analyses. we have (ignoring absolute constants), Lu E (t) u + m n Lv E (t) v F0 σ2 alt,1 1/2 F 2 0 σ2 alt,2 1/3 T 2/3 + O 1 where we define effective variance terms σ2 alt,1 = δ2 + σ2 u Lu + σ2 v(m + χ2(n m)) σ2 alt,2 = σ2 u + δ2 Lu (1 τ 1 u ) + σ2 vm Lvn (1 τ 1 v ) + χ2σ2 v Lv , and O( ) hides problem constants independent of T. The left-hand side of (7) is the average over time of a weighted sum of E (t) u and E (t) v . Convergence is measured in the rate at which this quantity decays to zero and depends on effective noise variances σ2 alt,1, σ2 alt,2; these are weighed sums of the variances δ2, σ2 u, and σ2 v contributed by the three sources of stochasticity. The right side contains a standard T 1/2 term with effective noise variance σ2 alt,1 and a lower order T 2/3 term with variance σ2 alt,2. Next, we present our main result for Fed Sim (Algorithm 1 with Local Sim), proved in Appendix A.4. Theorem 2 (Convergence of Fed Sim). Suppose Assumptions 1, 2 and 3 hold and the learning rates in Fed Sim are chosen as γu = η/(Luτ) and γv = η/(Lvτ). Then, for a η depending on the problem parameters and the number of rounds T, the bound (7) holds where the effective variance terms σ2 alt,1, σ2 alt,2 are respectively replaced by σ2 sim,1 = (1 + χ2) δ2 + σ2 u Lu + σ2 vm Lvn σ2 sim,2 = (1 + χ2) δ2 Lu + σ2 u Lu + σ2 v Lv The bound of Fed Sim is analogous to that of Fed Alt, with the only difference in the noise terms σ2 sim,1 and σ2 sim,2. Fed Alt vs. Fed Sim: Two Regimes. Comparing the variances σ2 alt,1 and σ2 sim,1 in the leading 1/ T term, we identify two regimes in terms of problem parameters. The regime where Fed Alt dominates Fed Sim is characterized by the condition < σ2 u + δ2(1 m/n) A practically relevant scenario where this is true is σ2 v 0 and σ2 u 0 from using a large or full batch on a small number of samples per device. In this case, the rate of Fed Alt is better than Fed Sim by a factor of (1 + χ2)1/2, indicating that the rate of Fed Alt is less affected by the coupling χ2 between the personal and shared parameters. Our experiments in 4 corroborate the practical relevance of this regime. Extensions and Discussion. Theorems 1 and 2 are also interesting because of the broad generality of the optimization model (3), as we discussed in 2 and as pointed out by Hanzely et al. (2021). In particular, Theorems 1 and 2 also give rates for full personalization schemes without convergence guarantees in the nonconvex case such as Fed Res (Agarwal et al., 2020), Mapper (Mansour et al., 2020), and Ditto (Li et al., 2021). Furthermore, our rates are better than those of (Dinh et al., 2020) for their p Fed Me objective. Federated Learning with Partial Model Personalization Table 1. Summary of datasets and models. A histogram of data per device is given in Figure 6 (Appendix B). Task Dataset #Classes Model # Model Params #Devices #Data per device Mean Max Next-word prediction Stack Overflow 10000 4-layer transformer 6M 1000 4964 15520 Landmark recognition GLDv2 2028 Res Net-18 12M 823 88 1000 Character recognition EMNIST 63 Res Net-18 11M 1114 298 418 Speech recognition Libri Speech N/A 6-layer transformer 15M 902 8.3 min 15 min We give fully non-asymptotic versions of these theorems under more general assumptions in Appendix A. The O(1/T) term is lower order and can be ignored for T Ω((n/m)2) for Fed Alt and T Ω(n/m) for Fed Sim. 4. Experiments We experimentally compare different model personalization schemes using Fed Alt and Fed Sim. Further details about the experiments and hyperparameters as well as additional experimental results are provided in the appendices. The code to reproduce the experimental results is publicly available.1 Datasets, Tasks and Models. We consider four learning tasks, summarized in Table 1. (a) Next-Word Prediction: We use the Stack Overflow dataset, where each device corresponds to the questions and answers of one user on stackoverflow.com. This is representative of mobile keyboard predictions. We use a 4-layer transformer model (Vaswani et al., 2017) trained with the cross entropy loss and evaluated with top-1 accuracy of next word prediction. (b) Landmark Recognition: We use GLDv2 (Weyand et al., 2020), a large-scale image dataset of global landmarks. Each device corresponds to a Wikipedia contributor and is representative of smartphone users capturing images while traveling. We use Res Net-18 (He et al., 2016). with group norm instead of batch norm (Hsieh et al., 2020) and images are reshaped to 224 224. It is trained with the cross entropy loss and evaluated with the classification accuracy. (c) Character Recognition: We use the EMNIST dataset (Cohen et al., 2017), where the input is a 28 28 grayscale image of a handwritten character and the output is its label (0-9, a-z, A-Z). Each device corresponds to a writer of the character. We use a Res Net-18 model with input and output layers modified to accommodate the smaller image size and number of classes. (d) Speech Recognition (ASR): We construct a federated version of the Libri Speech dataset (Panayotov et al., 2015), partitioned by the speaker of the audio. The 1https://github.com/krishnap25/FL_ partial_personalization input is an audio clip of English speech represented by log-mel filterbank coefficients and the output is its text transcription. We use a 6-layer transformer model trained with the connectionist temporal classification (CTC) criterion (Graves et al., 2006) and report the word error rate for evaluation. Model Partitioning for Partial Personalization. We consider three partitioning schemes. (a) Input layer personalization: This architecture personalizes the input layer to learn personal representations, while the rest of the model is shared (Figure 1b). For next-word prediction, we personalize the first transformer layer instead of the embedding layer. (b) Output layer personalization: This architecture learns a shared representation but personalizes the prediction layer (Figure 1a). We personalize the last transformer layer for a transformer model instead of the output layer. (c) Adapter personalization: Each device adds personal adapter modules to a shared model (Figure 2a). We use the transformer adapters of Houlsby et al. (2019) and the residual adapters of Rebuffi et al. (2017). Algorithms and Experimental Pipeline. We consider three full personalization baselines: (i) Finetune, where each device finetunes its personal full model starting from a learned common model, (ii) Ditto (Li et al., 2021), which is finetuning with ℓ2 regularization, and, (iii) p Fed Me (Dinh et al., 2020) which minimizes the objective (2). All methods, including Fed Alt, Fed Sim and the baselines are initialized with a global model trained with Fed Avg. 4.1. Experimental Results Partial personalization nearly matches full personalization and can sometimes outperform it. Table 2 shows the average test accuracy across all devices of different FL algorithms. We see that on the Stack Overflow dataset, output layer personalization (25.05%) makes up nearly 90% of the gap between the non-personalized baseline (23.82%) and full personalization (25.21%). On EMNIST, adapter personalization exactly matches full personalization. Most surprisingly, on GLDv2, adapter personalization outperforms full personalization by 3.5pp (percentage points). Federated Learning with Partial Model Personalization Table 2. Comparison of partial model personalization with full model personalization in terms of the average test accuracy % across devices. The subscript denotes the standard deviation over 5 random runs. The boldfaced/highlighted numbers denote entries within one standard deviation of the maximum in each row. For partial personalization, we show the accuracy of Fed Alt; see Table 4 for Fed Sim. Non-pers. Full Model Personalization Partial Model Personalization Fed Avg Finetune Ditto p Fed Me Input Layer Output Layer Adapter Stack Overflow 23.82 25.200.01 25.200.01 25.210.01 24.440.01 25.050.01 24.820.01 GLDv2 51.43 62.850.02 62.850.01 62.920.02 53.940.07 56.640.05 66.410.06 EMNIST 93.18 94.130.01 94.130.01 94.130.01 93.620.04 93.570.05 94.130.03 105 106 # Personalized Params. Output Layer Adapter (dim=16) Adapter (dim=64) Stack Overflow 104 105 106 107 # Personalized Params. Input Layer Output Layer 103 104 105 106 107 # Personalized Params. Input Layer Output Layer Partial Full Figure 4. Absolute change in accuracy (percentage points) due to personalization plotted against number of personal parameters (i.e., dimensionality of vi). Note that the x-axis is in log scale. This success of adapter personalization can be explained partly by the nature of GLDv2. On average, the training data on each device contains 25 classes out of a possible 2028 while the testing data contains 10 classes not seen in its own training data. These unseen classes account for nearly 23% of all testing data. Personalizing the full model is susceptible to forgetting the original task (Kirkpatrick et al., 2017), making it harder to get these unseen classes right. Such catastrophic forgetting is worse when finetuning on a very small local dataset, as we often have in FL. On the other hand, personalizing the adapters does not suffer as much from this issue (Rebuffi et al., 2017). Partial personalization only requires a fraction of the parameters to be personalized. Figure 4 shows that the number of personal parameters required to compete with full personalization is rather small. On Stack Overflow, personalizing 1.2% of the parameters with adapters captures 72% of the accuracy boost from personalizing all 5.7M parameters; this can be improved to nearly 90% by personalizing 14% of the parameters (output layer). Likewise, we match full personalization on EMNIST and exceed it on GLDv2 with adapters, personalizing 11.5-12.5% of parameters. The best personalized architecture is model and task dependent. Table 2 shows that personalizing the final transformer layer (denoted as Output Layer ) achieves the best performance for Stack Overflow, while the residual adapter achieves the best performance for GLDv2 and EMNIST. In contrast, input layer personalization achieves the best Table 3. Comparison of finetuning and partial personalization for ASR on Librispeech. We report the word error rate (WER, %) on the test data, averaged across devices. Smaller values are better. Finetune Input Layer Output Layer Adapter 15.55 15.13 15.53 15.50 performance for speech recognition, cf. Table 3. This variation is explained via the primary source of data heterogeneity across devices for each task. The choice of the next word after a context can vary between users, so the output layer is the right component to personalize for this task. Likewise, there is greater heterogeneity in the audio of Libri Speech (accent, tone, and voice of the speaker) than the text (standard literary English), so input layer personalization works best in this case. This shows that the approach of personalizing a fixed model part, as in past works, is suboptimal. Our framework allows for the use of domain knowledge to determine customized personalization. Finetuning is competitive with other full personalization methods. Full finetuning matches the performance of p Fed Me and Ditto on Stack Overflow and EMNIST. On GLDv2, however, p Fed Me outperforms finetuning by 0.07pp, but is still 3.5pp worse than adapter personalization. Fed Alt outperforms Fed Sim by a small but consistent margin. Table 4 shows that Fed Alt almost always outper- Federated Learning with Partial Model Personalization Table 4. Fed Alt vs. Fed Sim for partial personalization. FT (part.) means finetuning the personal parameters vi while fixing the shared parameters u from Fed Avg. The numbers are averaged over 5 random runs and the subscript denotes the standard deviation. Stack Overflow GLDv2 EMNIST FT (part.) Fed Alt Fed Sim FT (part.) Fed Alt Fed Sim FT (part.) Fed Alt Fed Sim Input Layer 24.960.01 24.440.01 24.810.01 51.970.02 53.940.06 53.640.08 93.290.00 93.620.03 93.550.05 Output Layer 24.930.01 25.050.01 25.020.01 53.210.01 56.640.05 56.240.04 93.370.01 93.570.04 93.550.05 Adapter 24.710.00 24.820.01 24.740.01 63.860.06 66.410.05 66.350.03 93.660.00 94.130.03 94.070.03 2500 5000 7500 10000 12500 15000 # Data per device Pers. helps Pers. hurts Full (train) 2500 5000 7500 10000 12500 15000 # Data per device Full (test) 2500 5000 7500 10000 12500 15000 # Data per device Partial (train) 2500 5000 7500 10000 12500 15000 # Data per device Partial (test) Figure 5. Stack Overflow task: Scatter plot of change in training and test accuracy (pp) per-device versus the number of training samples on the device for (a) Left: full personalization with finetuning, and, (b) Right: partial personalization with the output layer. forms Fed Sim by a small margin, e.g., 0.08pp for Stack Overflow/Adapter and 0.3pp for GLDv2/Input Layer. Fed Sim in turn yields a higher accuracy than simply finetuning the personal part of the model by a margin of 0.12pp for Stack Overflow/Output Layer and 2.55pp for GLDv2/Adapter. Furthermore, we observe that the difference between Fed Alt and Fed Sim is much larger than the standard deviation across runs. For instance, under output layer personalization for GLDv2, this difference is 0.4pp (= 8 std). As a practical recommendation, we recommend using Fed Alt as a default, but it does not hurt much to use Fed Sim. 4.2. Effects of Personalization on Generalization Personalization hurts the test accuracy on some devices. Figure 5 shows the change in training and test accuracy of each device, over a non-personalized model baseline. We see that personalization leads to an improvement in training accuracy across all devices, but a reduction in test accuracy on some of the devices. Devices whose testing performance is hurt by personalization are mostly on the left side of the plot, meaning that they have relatively small number of training samples. On the other hand, many devices with the most improved test accuracy also appear on the left side, signaling the benefit of personalization. Therefore, there is a large variation of results for devices with few samples. Additional results in Appendix C show that using ℓ2 regularization as in (2), or weight decay does not mitigate this issue. Increasing regularization strength (less personalization) can reduce the spread of per-device accuracy, but degrades the average accuracy. Dropout does not fix this issue either. An ideal personalized method would boost performance on most of the devices without causing a reduction in (test) accuracy on any device. Realizing this goal calls for a sound statistical analysis for personalized FL and may require sophisticated methods for local performance diagnosis and structured regularization. 5. Discussion In addition to a much smaller memory footprint than full model personalization and being less susceptible to catastrophic forgetting, partial model personalization has other advantages. For example, it reduces the amount of communication between the server and the devices because only the shared parameters are transmitted. While the communication savings may not be significant (especially when the personal parameters are only a small fraction of the full model), communicating only the shared parameters may have significant implications for privacy. Intuitively, it can be harder to infer private information from partial model information. This is especially the case if the more sensitive features of the data are processed through personal components of the model that are kept local at the devices. For example, we speculate that less noise needs to be added to the communicated parameters in order to satisfy differential privacy requirements (Abadi et al., 2016). Federated Learning with Partial Model Personalization Abadi, M., Chu, A., Goodfellow, I. J., Mc Mahan, H. B., Mironov, I., Talwar, K., and Zhang, L. Deep Learning with Differential Privacy. In Proc. of ACM SIGSAC, pp. 308 318. ACM, 2016. Agarwal, A., Langford, J., and Wei, C. Federated Residual Learning. ar Xiv Preprint, 2020. Arivazhagan, M. G., Aggarwal, V., Singh, A. K., and Choudhary, S. Federated Learning with Personalization Layers. ar Xiv Preprint, 2019. Baxter, J. A Model of Inductive Bias Learning. J. Artif. Intell. Res., 12:149 198, 2000. Bottou, L., Curtis, F. E., and Nocedal, J. Optimization Methods for Large-Scale Machine Learning. SIAM Review, 60 (2):223 311, 2018. Bui, D., Malik, K., Goetz, J., Liu, H., Moon, S., Kumar, A., and Shin, K. G. Federated User Representation Learning. ar Xiv Preprint, 2019. Caruana, R. Multitask learning. Mach. Learn., 28(1):41 75, 1997. Cohen, G., Afshar, S., Tapson, J., and van Schaik, A. EMNIST: an extension of MNIST to handwritten letters. ar Xiv Preprint, 2017. Collins, L., Hassani, H., Mokhtari, A., and Shakkottai, S. Exploiting Shared Representations for Personalized Federated Learning. In Proc. of ICML, volume 139, pp. 2089 2099, 2021. Collobert, R. and Weston, J. A Unified Architecture for Natural Language Processing: Deep Neural Networks with Multitask Learning. In ICML, volume 307, pp. 160 167, 2008. Demmel, J. W. Applied Numerical Linear Algebra. SIAM, Philadelphia, 1997. Deng, J., Dong, W., Socher, R., Li, L., Li, K., and Li, F. Image Net: A large-scale hierarchical image database. In Proc. of CVPR, pp. 248 255, 2009. Deng, Y., Kamani, M. M., and Mahdavi, M. Adaptive Personalized Federated Learning. ar Xiv Preprint, 2020a. Deng, Y., Kamani, M. M., and Mahdavi, M. Distributionally Robust Federated Averaging. In Neur IPS, 2020b. Dinh, C. T., Tran, N., and Nguyen, J. Personalized Federated Learning with Moreau Envelopes. In Proc. of Neur IPS, volume 33, pp. 21394 21405, 2020. Evgeniou, T. and Pontil, M. Regularized Multi Task Learning. In KDD, pp. 109 117, 2004. Fallah, A., Mokhtari, A., and Ozdaglar, A. E. Personalized Federated Learning with Theoretical Guarantees: A Model-Agnostic Meta-Learning Approach. In Proc. of Neur IPS, 2020. Graves, A., Fern andez, S., Gomez, F., and Schmidhuber, J. Connectionist Temporal Classification: Labelling Unsegmented Sequence Data with Recurrent Neural Networks. In ICML, pp. 369 376, 2006. Hanzely, F., Zhao, B., and Kolar, M. Personalized Federated Learning: A Unified Framework and Universal Optimization Techniques. ar Xiv Preprint, 2021. He, K., Zhang, X., Ren, S., and Sun, J. Deep Residual Learning for Image Recognition. In CVPR, pp. 770 778, 2016. Houlsby, N., Giurgiu, A., Jastrzebski, S., Morrone, B., de Laroussilhe, Q., Gesmundo, A., Attariyan, M., and Gelly, S. Parameter-Efficient Transfer Learning for NLP. In Proc. of ICML, volume 97, pp. 2790 2799, 2019. Hsieh, K., Phanishayee, A., Mutlu, O., and Gibbons, P. B. The Non-IID Data Quagmire of Decentralized Machine Learning. In Proc. of ICML, volume 119, pp. 4387 4398. PMLR, 2020. Hsu, T. H., Qi, H., and Brown, M. Federated Visual Classification with Real-World Data Distribution. In Proc. of ECCV, volume 12355, pp. 76 92, 2020. Kairouz, P., Mc Mahan, H. B., Avent, B., Bellet, A., Bennis, M., Bhagoji, A. N., Bonawitz, K. A., Charles, Z., Cormode, G., Cummings, R., D Oliveira, R. G. L., Eichner, H., Rouayheb, S. E., Evans, D., Gardner, J., Garrett, Z., Gasc on, A., Ghazi, B., Gibbons, P. B., Gruteser, M., Harchaoui, Z., He, C., He, L., Huo, Z., Hutchinson, B., Hsu, J., Jaggi, M., Javidi, T., Joshi, G., Khodak, M., Koneˇcn y, J., Korolova, A., Koushanfar, F., Koyejo, S., Lepoint, T., Liu, Y., Mittal, P., Mohri, M., Nock, R., Ozg ur, A., Pagh, R., Qi, H., Ramage, D., Raskar, R., Raykova, M., Song, D., Song, W., Stich, S. U., Sun, Z., Suresh, A. T., Tram er, F., Vepakomma, P., Wang, J., Xiong, L., Xu, Z., Yang, Q., Yu, F. X., Yu, H., and Zhao, S. Advances and Open Problems in Federated Learning. Found. Trends Mach. Learn., 14(1-2):1 210, 2021. Karimireddy, S. P., Kale, S., Mohri, M., Reddi, S., Stich, S., and Suresh, A. T. SCAFFOLD: Stochastic controlled averaging for federated learning. In Proc. of ICML, 2020. Kirkpatrick, J., Pascanu, R., Rabinowitz, N., Veness, J., Desjardins, G., Rusu, A. A., Milan, K., Quan, J., Ramalho, T., Grabska-Barwinska, A., Hassabis, D., Clopath, C., Federated Learning with Partial Model Personalization Kumaran, D., and Hadsell, R. Overcoming catastrophic forgetting in neural networks. Proceedings of the National Academy of Sciences, 114(13):3521 3526, 2017. Koloskova, A., Loizou, N., Boreiri, S., Jaggi, M., and Stich, S. A Unified Theory of Decentralized SGD with Changing Topology and Local Updates. In Proc. of ICML, 2020. Li, T., Hu, S., Beirami, A., and Smith, V. Ditto: Fair and Robust Federated Learning Through Personalization. In Proc. of ICML, volume 139, pp. 6357 6368, 2021. Li, X., Huang, K., Yang, W., Wang, S., and Zhang, Z. On the Convergence of Fed Avg on Non-IID Data. In ICLR, 2020. Liang, P. P., Liu, T., Liu, Z., Salakhutdinov, R., and Morency, L. Think Locally, Act Globally: Federated Learning with Local and Global Representations. In Neur IPS Workshop on Federated Learning, 2019. Mansour, Y., Mohri, M., Ro, J., and Suresh, A. T. Three Approaches for Personalization with Applications to Federated Learning. ar Xiv Preprint, 2020. Mc Closkey, M. and Cohen, N. J. Catastrophic Interference in Connectionist Networks: The Sequential Learning Problem. volume 24 of Psychology of Learning and Motivation, pp. 109 165. Academic Press, 1989. Mc Mahan, B., Moore, E., Ramage, D., Hampson, S., and y Arcas, B. A. Communication-Efficient Learning of Deep Networks from Decentralized Data. In Proc. of AISTATS, pp. 1273 1282, 2017. Mikolov, T. and Zweig, G. Context dependent recurrent neural network language model. In IEEE SLT, pp. 234 239, 2012. Misra, I., Shrivastava, A., Gupta, A., and Hebert, M. Crossstitch Networks for Multi-Task Learning. In CVPR, pp. 3994 4003, 2016. Panayotov, V., Chen, G., Povey, D., and Khudanpur, S. Libri Speech: an ASR Corpus based on Public Domain Audio Books. In ICASSP, pp. 5206 5210. IEEE, 2015. Pillutla, K., Laguel, Y., Malick, J., and Harchaoui, Z. Federated Learning with Heterogeneous Data: A Superquantile Optimization Approach. ar Xiv Preprint, 2021. Rebuffi, S., Bilen, H., and Vedaldi, A. Learning multiple visual domains with residual adapters. In Neur IPS, pp. 506 516, 2017. Reddi, S. J., Charles, Z., Zaheer, M., Garrett, Z., Rush, K., Koneˇcn y, J., Kumar, S., and Mc Mahan, H. B. Adaptive Federated Optimization. In Proc. of ICLR, 2021. Singhal, K., Sidahmed, H., Garrett, Z., Wu, S., Rush, K., and Prakash, S. Federated reconstruction: Partially local federated learning. In Proc. of Neur IPS, 2021. Smith, V., Chiang, C.-K., Sanjabi, M., and Talwalkar, A. S. Federated Multi-Task Learning. In Proc. of Neur IPS, pp. 4424 4434, 2017. Synnaeve, G., Xu, Q., Kahn, J., Likhomanenko, T., Grave, E., Pratap, V., Sriram, A., Liptchinsky, V., and Collobert, R. End-to-end ASR: from Supervised to Semi-Supervised Learning with Modern Architectures. ar Xiv preprint, 2019. Tensor Flow Federated. https://www.tensorflow. org/federated. Turc, I., Chang, M.-W., Lee, K., and Toutanova, K. Wellread students learn better: On the importance of pretraining compact models. ar Xiv Preprint, 2019. Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A. N., Kaiser, L., and Polosukhin, I. Attention is All you Need. In Proc. of Neur IPS, pp. 5998 6008, 2017. Wang, J., Charles, Z., Xu, Z., Joshi, G., Mc Mahan, H. B., Al-Shedivat, M., Andrew, G., Avestimehr, S., Daly, K., Data, D., et al. A Field Guide to Federated Optimization. ar Xiv Preprint, 2021. Weyand, T., Araujo, A., Cao, B., and Sim, J. Google Landmarks Dataset v2 - A Large-Scale Benchmark for Instance-Level Recognition and Retrieval. In Proc. of CVPR, pp. 2572 2581, 2020. Yuan, K., Ling, Q., and Yin, W. On the Convergence of Decentralized Gradient Descent. SIAM Journal on Optimization, 26(3):1835 1854, 2016. Federated Learning with Partial Model Personalization A Convergence Analysis: Full Proofs 13 A.1 Review of Setup and Assumptions 13 A.2 Virtual Full Participation: Background and Details 14 A.3 Convergence Analysis of Fed Alt 16 A.4 Convergence Analysis of Fed Sim 23 A.5 Technical Lemmas 30 B Experiments: Detailed Setup and Hyperparameters 33 B.1 Datasets, Tasks, and Models 33 B.2 Experimental Pipeline and Baselines 36 B.3 Hyperparameters and Evaluation Details 37 B.4 Estimated Memory Requirement 38 C Experiments: Additional Results 39 C.1 Speech Recognition: Fed Alt vs. Fed Sim 39 C.2 Ablation: Final Finetuning for Fed Alt and Fed Sim 39 C.3 Effect of Personalization on Per-Device Generalization 39 C.4 Partial Personalization for Stateless Devices 40 Federated Learning with Partial Model Personalization A. Convergence Analysis: Full Proofs We give the full convergence proofs here. The outline of this section is: A.1: Review of setup and assumptions; A.2: Virtual Full Participation: Background and Details A.3: Convergence analysis of Fed Alt and the full proof of Theorem 1 (see Theorem 3 and Corollary 4); A.4: Convergence analysis of Fed Sim and the full proof of Theorem 2 (see Theorem 11 and Corollary 12); A.5: Technical lemmas used in the analysis. A.1. Review of Setup and Assumptions We consider a federated learning system with n devices. Let the loss function on device i be Fi(u, vi), where u Rd0 denotes the shared parameters across all devices and vi Rdi denotes the personal parameters at device i. We aim to minimize the function F(u, V ) := 1 i=1 Fi(u, vi) , (8) where V = (v1, , vn) is a concatenation of all the personalized parameters. This is a special case of (3) with the equal per-device weights, i.e., αi = 1/n. Recall that we assume that F is bounded from below by F . For convenience, we reiterate Assumptions 1, 2 and 3 from the main paper as Assumptions 1 , 2 and 3 below respectively, with some additional comments and discussion. Assumption 1 (Smoothness). For each device i = 1, . . . , n, the objective Fi is smooth, i.e., it is continuously differentiable and, (a) u 7 u Fi(u, vi) is Lu-Lipschitz for all vi, (b) vi 7 v Fi(u, vi) is Lv-Lipschitz for all u, (c) vi 7 u Fi(u, vi) is Luv-Lipschitz for all u, and, (d) u 7 v Fi(u, vi) is Lvu-Lipschitz for all vi. Further, we assume for some χ > 0 that max{Luv, Lvu} χ p The smoothness assumption is a standard one. We can assume without loss of generality that the cross-Lipschitz coefficients Luv, Lvu are equal. Indeed, if Fi is twice continuously differentiable, we can show that Luv, Lvu are both equal to the operator norm 2 uv Fi(u, vi) op of the mixed second derivative matrix. Further, χ denotes the extent to which u impacts the gradient of vi and vice-versa. For concreteness, consider the full personalization setting of Eq. (2), where each Fi is L-smooth; this is a special case of the formulation (8), as we argue in 2. In this case, a simple calculation shows that χ2 = λ λ + L 1 . Our next assumption is about the variance of the stochastic gradients, and is standard in literature. Compared to the main paper, we adopt a more precise notation about stochastic gradients. Assumption 2 (Bounded Variance). Let Di denote a probability distribution over the data space Z on device i. There exist functions Gi,u and Gi,v which are unbiased estimates of u Fi and v Fi respectively. That is, for all u, vi: Ez Di [Gi,u(u, v, z)] = u Fi(u, vi), and Ez Di [Gi,v(u, v, z)] = v Fi(u, vi) . Furthermore, the variance of these estimators is at most σ2 u and σ2 v respectively. That is, Ez Di Gi,u(u, v, z) u Fi(u, vi) 2 σ2 u , Ez Di Gi,v(u, v, z) v Fi(u, vi) 2 σ2 v . Federated Learning with Partial Model Personalization In practice, one usually has Gi,u(u, vi, z) = ufi((u, vi), z), which is the gradient of the loss on datapoint z Di under the model (u, vi), and similarly for Gi,v. Finally, we make a gradient diversity assumption. Assumption 3 (Partial Gradient Diversity). There exist δ 0 and ρ 0 such that for all u and V , i=1 u Fi(u, vi) u F(u, V ) 2 δ2 + ρ2 u F(u, V ) 2 . (9) This is a generalization of Assumption 3 used in the main paper, which is a special case of Assumption 3 with ρ = 0. We allow the partial gradient diversity to grow with the squared norm of the gradient with a factor of ρ2. This assumption is analogous to the bounded variance assumption (Assumption 2 ), but with the stochasticity coming from the sampling of devices. It characterizes how much local steps on one device help or hurt convergence globally. Similar gradient diversity assumptions are often used for analyzing non-personalized federated learning (Koloskova et al., 2020; Karimireddy et al., 2020). Finally, it suffices for the partial gradient diversity assumption to only hold at the iterates (u(t), V (t)) generated by either Fed Sim or Fed Alt. A.2. Virtual Full Participation: Background and Details We recap the challenge of dependent random variables with Fed Alt, and explain the technique of virtual full participation in some more detail. For this section, we assume full gradients on each device (σ2 u = 0 = σ2 v) and a single local update per device (τu = 1 = τv). The only stochasticity in the algorithm comes from partial device participation, i.e., sampling m devices in each round. Background: Stochastic Gradient Convergence Analysis. Consider the minimization problem min w Rd f(w) , where the function f : Rd R is L-smooth. Starting from some fixed w(0) Rd, consider the stochastic gradient iterations w(t+1) = w(t) γg(t), where γ is a fixed learning rate, and g(t) is an unbiased estimate of f(w(t)), i.e., E[g(t)|w(t)] = f(w(t)). Typical proofs of convergence proceed in the general nonconvex case with the smoothness bound f(w(t+1)) f(w(t)) f(w(t)), w(t+1) w(t) + L 2 w(t+1) w(t) 2 (10) = γ f(w(t)), g(t) + γ2L Since the stochastic gradient g(t) is unbiased, we get (under typical assumptions) an inequality Et h f(w(t+1)) i f(w(t)) cγ f(w(t)) 2 + O(γ2) , (11) where c > 0 is some absolute constant and Et[ ] = E[ |w(t)] takes an expectation only over the randomness in step t. The second term is a noise term that can be made small by choosing an appropriately small learning rate γ. Telescoping the inequality over t and rearranging gives a convergence bound. The key intuition behind this proof is that the update is unbiased in linear term of the smoothness upper bound (10). The same intuition holds for most smooth nonconvex stochastic gradient convergence analyses (Bottou et al., 2018). In particular, this takes the following form in this case Et h f(w(t)), w(t+1) w(t) i = D f(w(t)), Et[w(t+1) w(t)] E . (12) This ensures that the contribution of the stochasticity occurs in a lower order O(γ2) term. As we shall see next, such an equality does not hold for Fed Alt in the partial participation case due to dependent random variables. The Challenge in Fed Alt with Partial Participation. Consider the iterates (u(t), V (t)) generated by Fed Alt. The progress Federated Learning with Partial Model Personalization in one round is the combined progress of the v-step (call it Tv) and the u-step (call it Tu) so that F u(t+1), V (t+1) F u(t), V (t) = F u(t), V (t+1) F u(t), V (t) | {z } =:Tv + F u(t+1), V (t+1) F u(t), V (t+1) | {z } =:Tu The analysis of the v-step is easy because the unbiasedness condition similar to (12) holds: Et D V F u(t), V (t) , V (t+1) V (t)E = D V F u(t), V (t) , Et h V (t+1) V (t)i E , since Et[ ] takes an expectation w.r.t. the client sampling S(t). The recipe laid out earlier gives a descent condition similar to (11). For the u-step, an unbiasedness condition similar to (12) does not hold: Et D u F u(t), V (t+1) , u(t+1) u(t)E = D Et h u F u(t), V (t+1) i , Et h u(t+1) u(t)i E . The expectation cannot pass into the inner product because V (t+1) and u(t+1) are dependent random variables. Both are dependent on the device sampling S(t), as shown Figure 3 (left). Virtual Full Participation. We decouple these random variables by using virtual full participation. Define a virtual iterate e V (t+1) as the result of local v-updates as if every device had participated. Specifically, we introduce e V (t+1) on the right hand side of the smoothness bound applied on Tu to get F u(t+1), V (t+1) F u(t), V (t+1) E(t) + u F(u(t), e V (t+1)), u(t+1) u(t) + Lu u(t+1) u(t) 2 , where E(t) is the error term from replacing V (t+1) with e V (t+1) Since e V (t+1) is independent of the client sampling S(t), we can now take an expectation Et[ ] over u(t+1) only, leading us to a situation similar to (12); cf. Figure 3 (right). We bound the error term E(t) using Young s inequality and smoothness (Assumption 1 ) respectively as E(t) = u F(u(t), V (t+1)) u F(u(t), e V (t+1)), u(t+1) u(t) 2 u(t+1) u(t) 2 + 1 2Lu u F(u(t), V (t+1)) u F(u(t), e V (t+1)) 2 2 u(t+1) u(t) 2 + χ2Lv i=1 v(t+1) i v(t+1) i 2 . These two terms are similar to the quadratic terms we get from the smoothness upper bound. We can similarly show Et[E(t)] = O(Luγ2 u + χ2Lvγ2 v), so the error term from virtual full participation is also a lower order O(γ2) term. Virutal Iterates in Related Work. Virtual or shadow iterates have long been used in decentralized optimization (Yuan et al., 2016), and have since been adopted in the analysis of federated optimization algorithms in the non-personalized setting (Li et al., 2020; Koloskova et al., 2020; Wang et al., 2021). In our notation, the shadow iterates used in (Koloskova et al., 2020; Wang et al., 2021) take the form i=1 u(t) i,k , which is an average of the local versions of the shared parameters. This only makes sense for the case of full participation since u(t) i,k is only defined for selected devices i S(t). In partial participation case, Li et al. (2020) define the virtual sequence ( u(t) i,k)τu k=0 as the local SGD updates on all devices i irrespective of whether they were selected. Then, they define the average i=1 u(t) i,k . Federated Learning with Partial Model Personalization Their proof relies on the fact that ES(t)[u(t+1)] = u(t) τu due to the properties of the sampling. In contrast, we consider personalized federated learning the problem of dependent random variables only shows up in the analysis of Fed Alt with partial participation, a setting not considered in prior works. We employ virtual personal parameters v(t) i,k to overcome this problem. We believe that this technique of decoupling dependent random variables can be of independent interest for (distributed) stochastic optimization, including personalized extensions of nonsmooth federated learning objectives (Deng et al., 2020b; Pillutla et al., 2021) or more general multi-task learning formulations (Misra et al., 2016). A.3. Convergence Analysis of Fed Alt We give the full form of Fed Alt in Algorithms 4 for the general case of unequal αi s but focus on αi = 1/n for the analysis. Theorem 1 of the main paper is a simplification of Corollary 4 below, which in turn is proved based on Theorem 3. Throughout this section, we use the constants σ2 alt,1 = δ2 + σ2 u Lu + σ2 v(m + χ2(n m)) Lvn , σ2 alt,2 = σ2 u + δ2 Lu (1 τ 1 u ) + σ2 vm Lvn (1 τ 1 v ) + χ2σ2 v Lv . We also recall the definitions (t) u = u F u(t), V (t+1) 2 , and, (t) v = 1 v Fi u(t), v(t) i 2 . Theorem 3 (Convergence of Fed Alt). Suppose Assumptions 1 , 2 and 3 hold and the learning rates in Fed Alt are chosen as γu = η/(Luτu) and γv = η/(Lvτv), with η min 1 24(1 + ρ2), m 128χ2(n m), r m Then, ignoring absolute constants, we have Lu E (t) u + m n Lv E (t) v F0 ηT + η σ2 alt,1 + η2 σ2 alt,2 . Before proving the theorem, we have the corollary with optimized learning rates. Corollary 4 (Final Rate of Fed Alt). Consider the setting of Theorem 3 and let the number of rounds T be known in advance. Suppose we set the learning rates γu = η/(τLu) and γv = η/(τLv), where (ignoring absolute constants), F0 Tσ2 alt,1 !1/2 F 2 0 T 2 σ2 alt,2 !1/3 1 1 + ρ2 m χ2(n m) We have, ignoring absolute constants, 1 Lu E u F u(t), V (t) 2 + m Lvn2 i=1 E v Fi u(t), v(t) i 2 ! F0 σ2 alt,1 1/2 F 2 0 σ2 alt,2 1/3 1 + ρ2 + χ2 n Proof. The proof follows from invoking Lemma 25 on the bound of Theorem 3. Remark 5 (Asymptotic Rate). The asymptotic 1/ T rate of Theorem 1 is achieved when the 1/T term is dominated by the 1/ T term. This happens when (ignoring absolute constants) 1 + ρ4 + χ4 n2 Federated Learning with Partial Model Personalization Algorithm 4 Fed Alt: Alternating updates of shared and personalized parameters 1: Input: Initial iterates u(0), V (0), Number of communication rounds T, Number of devices per round m, Number of local updates τu, τv, Local step sizes γu, γv, 2: for t = 0, 1, , T 1 do 3: Sample m devices from [n] without replacement in S(t) 4: for each selected device i S(t) in parallel do 5: Initialize v(t) i,0 = v(t) i 6: for k = 0, , τv 1 do 7: // Update personal parameters 8: Sample data z(t) i,k Di 9: v(t) i,k+1 = v(t) i,k γv Gi,v(u(t), v(t) i,k, z(t) i,k) 10: Update v(t+1) i = v(t) k,τv 11: Initialize u(t) i,0 = u(t) 12: for k = 0, , τu 1 do 13: // Update shared parameters 14: u(t) i,k+1 = u(t) i,k γu Gi,u(u(t) i,k, v(t+1) i , z(t) i,k) 15: Update u(t+1) i = u(t) i,τu 16: Update u(t+1) = P i S(t) αiu(t+1) i /P i S(t) αi at the server with secure aggregation 17: return u(T ), v(T ) 1 , , v(T ) n We now prove Theorem 3. Proof of Theorem 3. The proof mainly applies the smoothness upper bound to write out a descent condition with suitably small noise terms. We start with some notation. We introduce the notation e (t) u as the analogue of (t) u with the virtual variable e V (t+1): e (t) u = u F u(t), e V (t+1) 2 . Notation. Let F(t) denote the σ-algebra generated by u(t), V (t) and denote Et[ ] = E[ |F(t)]. For all devices, including those not selected in each round, we define virtual sequences u(t) i,k, v(t) i,k as the SGD updates in Algorithm 4 for all devices regardless of whether they are selected. For the selected devices i S(t), we have v(t) i,k = v(t) i,k and u(t) i,k = u(t) i,k. Note now that the random variables u(t) i,k, v(t) i,k are independent of the device selection S(t). Finally, we have that the updates for the selected devices i S(t) are given by v(t+1) i = v(t) i γv k=0 Gi,v u(t), v(t) i,k, z(t) i,k , and the server update is given by u(t+1) = u(t) γu k=0 Gi,u u(t) i,k, v(t) i,τv, z(t) i,k . Proof Outline and the Challenge of Dependent Random Variables. We start with F u(t+1), V (t+1) F u(t), V (t) = F u(t), V (t+1) F u(t), V (t) + F u(t+1), V (t+1) F u(t), V (t+1) . (13) Federated Learning with Partial Model Personalization The first line corresponds to the effect of the v-step and the second line to the u-step. The former is easy to handle with standard techniques that rely on the smoothness of F u(t), . The latter is more challenging. In particular, the smoothness bound for the u-step gives us F u(t+1), V (t+1) F u(t), V (t+1) D u F u(t), V (t+1) , u(t+1) u(t)E + Lu u(t+1) u(t) 2 . The standard proofs of convergence of stochastic gradient methods rely on the fact that we can take an expectation w.r.t. the sampling S(t) of devices for the first order term. However, both V (t+1) and u(t+1) depend on the sampling S(t) of devices. Therefore, we cannot directly take an expectation with respect to the sampling of devices in S(t). Virtual Full Participation to Circumvent Dependent Random Variables. The crux of the proof lies in replacing V (t+1) in the analysis of the u-step with the virtual iterate e V (t+1) so as to move all the dependence of the u-step on S(t) to the u(t+1) term. This allows us to take an expectation; it remains to carefully bound the resulting error terms. Finally, we will arrive at a bound of the form 8 E[e (t) u ] + γvτvm 16n E[ (t) v ] F0 T + O(γ2 u + γ2 v) . Next, we translate this bound from gradient E[e (t) u ] of the virtual e V (t+1) to E[ (t) u ], which is the gradient computed at the actual iterate V (t). A careful analysis shows that we only incur a lower order term of O(γuγ2 v) in this translation. Choosing γu and γv small enough will give us the final result. Analysis of the u-Step with Virtual Full Participation. We introduce the virtual iterates e V (t+1) into the analysis of the u-step as follows: F u(t+1), V (t+1) F u(t), V (t+1) D u F u(t), V (t+1) , u(t+1) u(t)E + Lu u(t+1) u(t) 2 = D u F u(t), e V (t+1) , u(t+1) u(t)E + Lu u(t+1) u(t) 2 + D u F u(t), V (t+1) u F u(t), e V (t+1) , u(t+1) u(t)E D u F u(t), e V (t+1) , u(t+1) u(t)E + Lu u(t+1) u(t) 2 u F u(t), V (t+1) u F u(t), e V (t+1) 2 D u F u(t), e V (t+1) , u(t+1) u(t)E | {z } T1,u + Lu u(t+1) u(t) 2 | {z } T2,u v(t+1) i v(t+1) i 2 | {z } T3,u The last two inequalities follow from Young s inequality and Lipschitzness of V 7 u F(u, V ) respectively. We have now successfully eliminated the dependence of the first-order term T1,u on V (t+1). The virtual iterates e V (t+1) are now independent of S(t). This allows us to take an expectation w.r.t. the sampling S(t) of the devices. We bound each of these terms in Claims 6 to 8 below to get F u(t+1), V (t+1) F u(t), V (t+1) # 4 Et[e (t) u ] + 2γu L2 u n k=0 Et u(t) i,k u(t) 2 | {z } =:T 2,u +4γ2 vτ 2 v Lvσ2 vχ2(1 m/n) + Luγ2 uτ 2 u m σ2 u + 3δ2 1 m + 8γ2 vτ 2 v Lvχ2(1 m/n) (t) v . Federated Learning with Partial Model Personalization Note that we used the fact that 24Luγuτu(1 + ρ2) 1 to simply the coefficients of some of the terms above. The second term has also been referred to as client drift in the literature; we bound it with Lemma 22 and invoke the assumption on gradient diversity (Assumption 3 ) to get T 2,u 16γ3 u L2 uτu(τu 1) i=1 Et u Fi u(t), v(t+1) i 2 + 8γ3 u L2 uτ 2 u(τu 1)σ2 u 16γ3 u L2 uτu(τu 1) δ2 + ρ2Et u F u(t), e V (t+1) 2 + 8γ3 u L2 uτ 2 u(τu 1)σ2 u . Plugging this back in, we get, F u(t+1), V (t+1) F u(t), V (t+1) # 8 Et[e (t) u ] + Luγ2 uτ 2 u m σ2 u + 2δ2(1 m/n) + 4γ2 vτ 2 v Lvσ2 vχ2(1 m/n) + 8γ2 vτ 2 v Lvχ2(1 m/n) (t) v + 8γ2 u L3 uτ 2 u(τu 1)(σ2 u + 2δ2 u) . Note that we used 128γ2 u L2 uτu(τu 1)ρ2 1, which is implied by 24Luγuτu(1 + ρ2) 1. Bound with the Virual Iterates. We plug this analysis of the u-step and Claim 9 for the v-step into (13) next. We also simplify some coefficients using 128γvτv Lvχ2(n/m 1) 1. This gives us F u(t+1), V (t+1) F u(t), V (t) # 8 Et[e (t) u ] γvτvm 16n Et[ (t) v ] + 4γ2 v Lvτ 2 v σ2 v m n + χ2(1 m/n) + γ2 u Luτ 2 u m σ2 u + 2δ2(1 m/n) + 8γ3 u L2 uτ 2 u(τu 1)(σ2 u + 2δ2) + 4γ3 v L2 vτ 2 v (τv 1)σ2 vm n . Taking an unconditional expectation, summing it over t = 0 to T 1 and rearranging this gives 8 E[e (t) u ] + γvτvm 16n E[ (t) v ] (14) T + 4γ2 v Lvτ 2 v σ2 v m n + χ2(1 m/n) + γ2 u Luτ 2 u m σ2 u + 2δ2(1 m/n) + 8γ3 u L2 uτ 2 u(τu 1)(σ2 u + 2δ2) + 4γ3 v L2 vτ 2 v (τv 1)σ2 vm n . This is a bound in terms of the virtual iterates e V (t+1). However, we wish to show a bound in terms of the actual iterate V (t). Obtaining the Final Bound. It remains now to relate e (t) u with (t) u . Using the Cauchy-Schwartz inequality and smoothness, we have, Et u F u(t), V (t) u F u(t), e V (t+1) 2 i=1 Et u Fi u(t), v(t) i u Fi u(t), v(t+1) i 2 i=1 Et v(t+1) i v(t) i 2 16γ2 vτ 2 v v Fi u(t), v(t) i 2 + 8γ2 vτ 2 v σ2 v = 8γ2 vτ 2 v σ2 vχ2Lu Lv + 16γ2 vτ 2 v χ2Lu Lv (t) v , Federated Learning with Partial Model Personalization where the last inequality followed from Lemma 23. Using u F u(t), V (t) 2 2 u F u(t), V (t) u F u(t), e V (t+1) 2 + 2 u F u(t), e V (t+1) 2 , we get, E[ (t) u ] 2 E[e (t) u ] + 16γ2 vτ 2 v σ2 vχ2Lu Lv + 32γ2 vτ 2 v χ2Lu Lv E[ (t) v ] . Therefore, we get, 16 E[ (t) u ] + γvτvm 32n E[ (t) v ] 8 E[e (t) u ] + γvτvm 2 + 32η2χ2m E[ (t) v ] + γuτuγ2 vτ 2 v σ2 vχ2Lu Lv 8 E[e (t) u ] + γvτvm 16n E[ (t) v ] + γuτuγ2 vτ 2 v σ2 vχ2Lu Lv , where we used 32η2χ2m n 1/2, which is one of the conditions we assume on η. Summing this up and plugging in (14) gives 16 E[ (t) u ] + γvτvm 32n E[ (t) v ] 8 E[e (t) u ] + γvτvm 16n E[ (t) v ] + γuτuγ2 vτ 2 v σ2 vχ2Lu Lv T + 4γ2 v Lvτ 2 v σ2 v m n + χ2(1 m/n) + γ2 u Luτ 2 u m σ2 u + 2δ2(1 m/n) + 8γ3 u L2 uτ 2 u(τu 1)(σ2 u + 2δ2) + 4γ3 v L2 vτ 2 v (τv 1)σ2 vm n + γuτuγ2 vτ 2 v σ2 vχ2Lu Lv . Plugging in γu = η/(Luτu) and γv = η/(Lvτv) completes the proof. The analysis of each of the terms in the u-step is given in the following claims. Claim 6 (Bounding T1,u). We have, Et [T1,u] γuτu 2 Et u F u(t), e V (t+1) 2 + γu L2 u n k=0 Et u(t) i,k u(t) 2 . Proof. For i S(t), we have that u(t) i,k = u(t) i,k. Therefore, we have, Et[T1,u] = γu Et u F u(t), e V (t+1) , 1 k=0 u Fi u(t) i,k, v(t+1) i + Using that u(t) i,k is independent of S(t), we get, Et[T1,u] = γu Et u F u(t), e V (t+1) , 1 k=0 u Fi u(t) i,k, v(t+1) i + = γuτu Et u F u(t), e V (t+1) 2 u F u(t), e V (t+1) , 1 i=1 u Fi u(t) i,k, v(t+1) u Fi u(t), v(t+1) + Invoking x, y x 2/2 + y 2/2 for vectors x, y followed by smoothness completes the proof. Federated Learning with Partial Model Personalization Claim 7 (Bounding T2,u). We have, Et [T2,u] 3Luγ2 uτ 2 u m (1 m/n) Et u F u(t), e V (t+1) 2 + 3L2 uγ2 uτu n k=0 Et u(t) i,k u(t) 2 + 6Luγ2 uτ 2 uδ2 m (1 m/n) . Proof. We use E z 2 = E[z] 2 + E z E[z] 2 for a random vector z to get Et[T2,u] Luγ2 uτ 2 uσ2 u m + Luγ2 uτu i S(t) u Fi u(t) i,k, v(t+1) i | {z } =:T k We break the term T k as u Fi u(t) i,k, v(t+1) i u Fi u(t), v(t+1) i i S(t) u Fi u(t), v(t+1) i u F u(t), e V (t+1) + 3 u F u(t), e V (t+1) 2 . For the first term, we use Jensen s inequality to take the squared norm inside the sum, then use smoothness and take an expectation over the sampling of devices to get u Fi u(t) i,k, v(t+1) i u Fi u(t), v(t+1) i i=1 Et u(t) i,k u(t) 2 . For the second term, we use the fact that S(t) was sampled without replacement (cf. Lemma 21) and invoke the gradient diversity assumption (Assumption 3 ) to get, i S(t) u Fi u(t), v(t+1) i u F u(t), e V (t+1) u Fi u(t), v(t+1) i u F u, e V (t+1) 2 δ2 + ρ2Et u F u(t), e V (t+1) 2 . To complete the proof, we plug these terms back into the definition of T k and Et[T2,u] to complete the proof. Claim 8 (Bounding T3,u). We have, Et [T3,u] 8γ2 vτ 2 v Lvχ2 1 m (t) v + 4χ2γ2 vτ 2 v Lvσ2 v 1 m Proof. Since v(t+1) i = v(t+1) i for i S(t), we have that T3,u = χ2Lv v(t+1) i v(t) i 2 . Federated Learning with Partial Model Personalization Since v(t+1) i v(t) i 2 is independent of S(t), we can take an expectation to get Et[T3,u] = χ2Lv i=1 P(i / S(t)) Et v(t+1) i v(t) i 2 i=1 Et v(t+1) i v(t) i 2 . Plugging in Lemma 23 completes the proof. The analysis of the v-step is given in the next result. Claim 9. Consider the setting of Theorem 3 and assume that γvτv Lv 1/8. We have, Et h F u(t), V (t+1) F u(t), V (t) i γvτvm (t) v 8n + γ2 vτ 2 v Lvσ2 vm 2n + 4γ3 v L2 vτ 2 v (τv 1)σ2 vm n . Proof. From smoothness, we get, Fi u(t), v(t+1) i Fi u(t), v(t) i D v Fi u(t), v(t) i , v(t+1) i v(t) i E | {z } T1,v v(t+1) i v(t) i 2 | {z } T2,v We bound the first term as Et[T1,v] = γv Et v Fi u(t), v(t) i , k=0 v Fi u(t), v(t) i,k + = γvτv v Fi u(t), v(t) i 2 k=0 Et D v Fi u(t), v(t) i , v Fi u(t), v(t) i,k v Fi u(t), v(t) i E v Fi u(t), v(t) i 2 + γv k=0 Et v Fi u(t), v(t) i,k v Fi u(t), v(t) i 2 v Fi u(t), v(t) i 2 + γv L2 v 2 v(t) i,k v(t) i 2 . Next, we observe that Ez Gi,v(u, vi, z) 2 = v Fi(u, vi) 2 + Ez Gi,v(u, vi, z) v Fi(u, vi) 2 v Fi(u, vi) 2 + σ2 v . We invoke this inequality to handle the second term as Et[T2,v] γ2 v Lvτv k=0 Et Gi,v u(t), v(t) i,k, z(t) i,k 2 γ2 v Lvτ 2 v σ2 v 2 + γ2 v Lvτv k=0 Et v Fi u(t), v(t) i,k 2 γ2 v Lvτ 2 v σ2 v 2 + γ2 v Lvτ 2 v v Fi u(t), v(t) i 2 + γ2 v Lvτv k=0 Et v Fi u(t), v(t) i,k v Fi u(t), v(t) i 2 γ2 v Lvτ 2 v σ2 v 2 + γ2 v Lvτ 2 v v Fi u(t), v(t) i 2 + γ2 v L3 vτv k=0 Et v(t) i,k v(t) i 2 . Federated Learning with Partial Model Personalization Algorithm 5 Fed Sim: Simultaneous update of shared and personal parameters 1: Input: Initial iterates u(0), V (0), Number of communication rounds T, Number of devices per round m, Number of local updates τ, Local step sizes γu, γv. 2: for t = 0, 1, , T 1 do 3: Sample m devices from [n] without replacement in S(t) 4: for each selected device i S(t) in parallel do 5: Initialize v(t) i,0 = v(t) i and u(t) i,0 = u(t) 6: for k = 0, , τ 1 do 7: // Update all parameters jointly 8: Sample data z(t) i,k Di 9: v(t) i,k+1 = v(t) i,k γv Gi,v(u(t) i,k, v(t) i,k, z(t) i,k) 10: u(t) i,k+1 = u(t) i,k γu Gi,u(u(t) i,k, v(t) i,k, z(t) i,k) 11: Update v(t+1) i = v(t) i,τ and u(t+1) i = u(t) i,τ 12: Update u(t+1) = P i S(t) αiu(t+1) i /P i S(t) αi at the server with secure aggregation 13: return u(T ), v(T ) 1 , , v(T ) n Plugging these bounds for T1,v and T2,v into the initial smoothness bound and using γv Lvτv 1/4 gives Et h Fi u(t), v(t+1) i Fi u(t), v(t) i i v Fi u(t), v(t) i 2 + γv L2 v v(t) i,k v(t) i 2 + γ2 v Lvτ 2 v σ2 v 2 . We invoke Lemma 22 to bound the P k Et v(t) i,k v(t) i 2 term, which is also known as client drift. We simplify some coefficients using 8γvτv Lv 1 to get Et h Fi u(t), v(t+1) i Fi u(t), v(t) i i v Fi u(t), v(t) i 2 + γ2 v Lvτ 2 v σ2 v 2 + 4γ3 v Lvτ 2 v (τv 1)σ2 v . It remains to invoke that S(t) is a uniformly random sample of m devices from {1, , n} and that v(t+1) i is independent of S(t). To this end, note that Et h F u(t), V (t+1) F u(t), V (t) i = m i S(t) Fi u(t), v(t+1) i Fi u(t), v(t) i i=1 Et h Fi u(t), v(t+1) i Fi u(t), v(t) i i . Plugging in the previous bound completes the proof. Remark 10. We only invoked the partial gradient diversity assumption (Assumption 3) at (virtual) iterates (u(t), e V (t+1)); therefore, it suffices if the assumption only holds at iterates (u(t), e V (t+1)) generated by Fed Alt, rather than at all (u, V ). A.4. Convergence Analysis of Fed Sim We give the full form of Fed Sim in Algorithm 5 for the general case of unequal αi s but focus on αi = 1/n for the analysis. Theorem 2 of the main paper is a simplification of Corollary 12 below, which in turn is proved based on Theorem 11. Throughout this section, we use constants σ2 sim,1 = (1 + χ2) δ2 + σ2 u Lu + σ2 vm Lvn , and , σ2 sim,2 = (1 + χ2) δ2 Lu + σ2 u Lu + σ2 v Lv Federated Learning with Partial Model Personalization Theorem 11 (Convergence of Fed Sim). Suppose Assumptions 1 , 2 and 3 hold and the learning rates in Fed Sim are chosen as γu = η/(Luτ) and γv = η/(Lvτ) with ( 1 12(1 + χ2)(1 + ρ2), m/n 196(1 τ 1)(1 + χ2)(1 + ρ2) Then, ignoring absolute constants, we have Lu E (t) u + m n Lv E (t) v F0 ηT + η σ2 sim,1 + η2 σ2 sim,2 . Before proving the theorem, we give the following corollary with optimized learning rates. Corollary 12 (Final Rate of Fed Sim). Consider the setting of Theorem 11 and let the total number of rounds T be known in advance. Suppose we set the learning rates γu = η/(τLu) and γv = η/(τLv), where (ignoring absolute constants), F0 T σ2 sim,1 !1/2 F 2 0 T 2 σ2 sim,2 !1/3 1 (1 + χ2)(1 + ρ2) m/n (1 τ 1)(1 + χ2)(1 + ρ2) . We have, ignoring absolute constants, 1 Lu E u F u(t), V (t) 2 + m Lvn2 i=1 E v Fi u(t), v(t) i 2 ! F0 σ2 sim,1 1/2 F 2 0 σ2 sim,2 1/3 T 2/3 + F0(1 + χ2)(1 + ρ2) m(1 τ 1)(1 + χ2)(1 + ρ2) Proof. The proof follows from invoking Lemma 25 on the bound of Theorem 11. Remark 13 (Asymptotic Rate). The asymptotic 1/ T rate of Theorem 2 is achieved when the 1/T term is dominated by the 1/ T term. This happens when (ignoring absolute constants) T F0(1 + χ2)(1 + ρ2) σ2 sim,1 max n (1 τ 1) n m, (1 + χ2)(1 + ρ2) o . Note that T Ω(n/m) is necessary for each device to be seen at least once on average, or the personal parameters of some devices will never be updated. We now prove Theorem 11. Proof of Theorem 11. The proof mainly applies the smoothness upper bound to write out a descent condition with suitably small noise terms. We start with some notation. Notation. Let F(t) denote the σ-algebra generated by u(t), V (t) and denote Et[ ] = E[ |F(t)]. For all devices, including those not selected in each round, we define virtual sequences u(t) i,k, v(t) i,k as the SGD updates in Algorithm 5 for all devices regardless of whether they are selected. For the selected devices k S(t), we have u(t) i,k, v(t) i,k = u(t) i,k, v(t) i,k . Note now that the random variables u(t) i,k, v(t) i,k are independent of the device selection S(t). The updates for the devices i S(t) are given by v(t+1) i = v(t) i γv k=0 Gi,v u(t) i,k, v(t) i,k, z(t) i,k , Federated Learning with Partial Model Personalization and the server update is given by u(t+1) = u(t) γu k=0 Gi,u u(t) i,k, v(t) i,k, z(t) i,k . (15) Proof Outline. We use the smoothness of Fi, more precisely Lemma 20, to obtain F u(t+1), V (t+1) F u(t), V (t) D u F(u(t), V (t)), u(t+1) u(t)E | {z } T1,u D v Fi(u(t), v(t) i ), v(t+1) i v(t) i E | {z } T1,v + Lu(1 + χ2) u(t+1) u(t) 2 | {z } T2,u v(t+1) i v(t) i 2 | {z } T2,v Our goal will be to bound each of these terms to get a descent condition from each step of the form Et h F u(t+1), V (t+1) F u(t), V (t) i u F u(t), V (t) 2 γvτm v Fi u(t), v(t) i 2 + O(γ2 u + γ2 v) , where the O(γ2 u + γ2 v) terms are controlled using the bounded variance and gradient diversity assumptions. Telescoping this descent condition gives the final bound. Main Proof. Towards this end, we prove non-asymptotic bounds on each of the terms T1,v, T1,u, T2,v and T2,u, in Claims 14 to 17 respectively. We then invoke them to get the bound Et h F u(t+1), V (t+1) F u(t), V (t) i γuτ 4 (t) u γvτm + Lu(1 + χ2)γ2 uτ 2 σ2 u + 12δ2 m (1 m/n) + Lv(1 + χ2)γ2 vτ 2σ2 vm 2n k=0 Et u(t) i,k u(t) 2 L2 uγu + m n χ2Lu Lvγv k=0 Et v(t) i,k v(t) 2 m n L2 vγv + χ2Lu Lvγu . Note that we simplified some constants appearing on the gradient norm terms using γu 12Lu(1 + χ2)(1 + ρ2)τ 1 and γv 6Lv(1 + χ2)τ 1. Our next step is to bound the last two lines of (17) with Lemma 18 and invoke the gradient diversity assumption (Assumption 3 ) as 1 n u Fi u(t), v(t) i 2 δ2 + (1 + ρ2) u F u(t), V (t) 2 . This gives, after plugging in the learning rates and further simplifying the constants, Et h F u(t+1), V (t+1) F u(t), V (t) i c (t) u 8Lu cm (t) v 8Lvn + c2(1 + χ2) σ2 u 2Lu + mσ2 v n Lv + 6δ2 + c3(1 + χ2)(1 τ 1) 24δ2 Lu + 4σ2 u Lu + 4σ2 v Lu Federated Learning with Partial Model Personalization Taking full expectation, telescoping the series over t = 0, , T 1 and rearranging the resulting terms give the desired bound in Theorem 11. Claim 14 (Bounding T1,v). Let T1,v be defined as in (16). We have, Et[T1,v] γvτm v Fi u(t), v(t) i 2 χ2Lu Lv u(t) i,k u(t) 2 + L2 v v(t) i,k v(t) i 2 . Proof. Define T1,v,i to be contribution of the ith term to T1,v. For i / St, we have that T1,v,i = 0, since v(t+1) i = v(t) i . On the other hand, for i S(t), we use the unbiasedness of the gradient estimator Gi,v and the independence of z(t) i,k from u(t) i,k, v(t) i,k to get Et [T1,v,i] = γv k=0 Et D v Fi u(t), v(t) i , v Fi u(t) i,k, v(t) i,k E k=0 Et D v Fi u(t), v(t) i , v Fi u(t) i,k, v(t) i,k E = γvτ v Fi u(t), v(t) i 2 k=0 Et D v Fi u(t), v(t) i , v Fi u(t) i,k, v(t) i,k v Fi u(t), v(t) i E v Fi u(t), v(t) i 2 + γv k=0 Et v Fi u(t) i,k, v(t) i,k v Fi u(t), v(t) i 2 . (18) For the second term, we add and subtract v Fi u(t), v(t) i,k and use smoothness to get v Fi u(t) i,k, v(t) i,k v Fi u(t), v(t) i 2 2χ2Lu Lv u(t) i,k u(t) 2 + 2L2 v v(t) i,k v(t) i 2 . (19) Since the right hand side of this bound is independent of St, we get, Et[T1,v] = m i S(t) T1,v,i i=1 Et[T1,v,i] , and plugging in (18) and (19) completes the proof. Claim 15 (Bounding T1,u). Consider T1,u defined in (16). We have the bound, Et[T1,u] γuτ u F u(t), V (t) 2 L2 u u(t) i,k u(t) 2 + χ2Lu Lv v(t) i,k v(t) i 2 . Federated Learning with Partial Model Personalization Proof. Due to the independence of S(t) from u(t) i,k, v(t) i,k, we have, Et h u(t+1) u(t)i = γu Et k=0 u Fi u(t) i,k, v(t) i,k k=0 u Fi u(t) i,k, v(t) i,k k=0 Et h u Fi u(t) i,k, v(t) i,k i , where the last equality took an expectation over S(t), which is independent of u(t) i,k, v(t) i,k. Now, using the same sequence of arguments as Claim 14, we have, Et D u F u(t), V (t) , u(t+1) u(t)E u F u(t), V (t) , 1 i=1 u Fi u(t) i,k, v(t) i,k + u F u(t), V (t) 2 + γu i=1 u Fi u(t) i,k, v(t) i,k u F u(t), V (t) u F u(t), V (t) 2 + γu k=0 Et u Fi u(t) i,k, v(t) i,k u Fi u(t), v(t) i 2 u F u(t), V (t) 2 + γu L2 u u(t) i,k u(t) 2 + L2 uv v(t) i,k v(t) i 2 , where the inequality ( ) follows from Jensen s inequality as i=1 u Fi u(t) i,k, v(t) i,k u F u(t), V (t) u Fi u(t) i,k, v(t) i,k u Fi u(t) i,k, v(t) 2 . Claim 16 (Bounding T2,v). Consider T2,v as defined in (16). We have the bound, Et[T2,v] 3Lv(1 + χ2)γ2 vτ 2m 2n2 v Fi u(t), v(t) i 2 + Lv(1 + χ2)γ2 vτ 2mσ2 v 2n + 3Lv(1 + χ2)γ2 vτm 2n2 L2 v v(t) i,k v(t) i 2 + χ2Lu Lv u(t) i,k u(t) 2 . Federated Learning with Partial Model Personalization Proof. We start with Et v(t) k,τ v(t) 2 = γ2 v Et k=0 Gi,v u(t) i,k, v(t) i,k, z(t) i,k k=0 Et Gi,v u(t) i,k, v(t) i,k, z(t) i,k 2 γ2 vτ 2σ2 v + γ2 vτ k=0 Et v Fi u(t) i,k, v(t) i,k 2 γ2 vτ 2σ2 v + 3γ2 vτ 2 v Fi u(t), v(t) i 2 L2 v v(t) i,k v(t) i 2 + χ2Lu Lv u(t) i,k u(t) 2 . Using (a) v(t+1) i = v(t) i,τ for i S(t), and, (b) S(t) is independent from u(t) i,k, v(t) i,k, we get, Et[T2,v] = Lv(1 + χ2)m v(t) i,τ v(t) i 2 Lv(1 + χ2)m i=1 Et v(t) i,τ v(t) i 2 Plugging in the bound Et v(t) i,τ v(t) 2 completes the proof. Claim 17 (Bounding T2,u). Consider T2,u as defined in (16). We have, Et[T2,u] Lu(1 + χ2)γ2 uτ 2 σ2 u + 12δ2 1 m + 3Lu(1 + χ2)γ2 uτ 2(1 + ρ2) u Fi u(t), V (t) 2 + 3Lu(1 + χ2)γ2 uτ 2n L2 u u(t) i,k u(t) 2 + χ2Lu Lv v(t) i,k v(t) i 2 . Proof. We proceed with the first two inequalities as in the proof of Claim 16 to get Et u(t+1) u(t) 2 γ2 uτ 2σ2 u m + γ2 uτ i S(t) u Fi u(t) i,k, v(t) i,k | {z } =:T3,j For T3,j, (a) we add and subtract u F(u(t), V (t)) and u Fi(u(t), v(t) i,k), (b) invoke the squared triangle inequality, and, (c) use smoothness to get T3,j = 6 Et i S(t) u Fi u(t), v(t) i u F u(t), V (t) + 6 u F u(t), V (t) 2 L2 u u(t) i,k u(t) 2 + χ2Lu Lv v(t) i,k v(t) i 2 Federated Learning with Partial Model Personalization For the first term, we use the fact that S(t) is obtained by sampling without replacement to apply Lemma 21 together with the gradient diversity assumption to get i S(t) u Fi u(t), v(t) i u F u(t), V (t) u Fi u(t), v(t) i u F u(t), V (t) 2 δ2 + ρ2 u F u(t), V (t) 2 . T3,j = 12δ2 + 6(1 + ρ2) u F u(t), V (t) 2 L2 u u(t) i,k u(t) 2 + χ2Lu Lv v(t) i,k v(t) i 2 , where we also used the independence between S(t) and ( u(t) i,k, v(t) i,k). Plugging this into the expression for Et u(t+1) u(t) 2 completes the proof. Lemma 18. Let Fi satisfy Assumptions 1 -3 , and consider the iterates uk+1 = uk γu Gi,u(uk, vk, zk) , and, vk+1 = vk γv Gi,v(uk, vk, zk) , for k = 0, , τ 1, where zk Di. Suppose the learning rates satisfy γu = cu/(τLu) and γv = cv/(τLv) with cu, cv 1/ 6 max{1, χ 2}. Further, define, A = γu L2 u + fχ2γv Lu Lv , and, B = fγv L2 v + χ2γu Lu Lv , where f (0, 1] is given. Then, we have the bound, k=0 E A uk u0 2+B vk u0 2 4τ 2(τ 1) γ2 uσ2 u A + γ2 vσ2 v B + 12τ 2(τ 1) γ2 u A u Fi(u0, v0) 2 + γ2 v B v Fi(u0, v0) 2 . Proof. If τ = 1, there is nothing to prove, so we assume τ > 1. Let k := A uk u0 2 + B vk v0 2 and denote by Fk the sigma-algebra generated by (wk, vk). Further, let Ek[ ] = E[ |Fk]. We use the inequality 2αβ α2/δ2 + δ2β2 for reals α, β, δ to get, Ek uk+1 u0 2 1 + 1 τ 1 uk u0 2 + τγ2 u Ek Gi,u(uk, vk, zk) 2 uk u0 2 + τγ2 uσ2 u + τγ2 u u Fi(uk, vk) 2 uk u0 2 + τγ2 uσ2 u + 3τγ2 u u Fi(u0, v0) 2 + 3τγ2 u L2 u uk u0 2 + 3τγ2 u Luv vk v0 2 , where the last inequality followed from the squared triangle inequality (from adding and subtracting u Fi(u0, vk) and u Fi(u0, v0)) followed by smoothness. Together with the analogous inequality for the v-update, we get, Ek[ k+1] 1 + 1 τ 1 k + A uk u0 2 + B vk v0 2 + C , Federated Learning with Partial Model Personalization where we have A = 3τ(γ2 u L2 u A + γ2 vχ2Lu Lv B), and, B = 3τ(γ2 v L2 v B + γ2 uχ2Lu Lv A) and, C = τγ2 uσ2 u A + τγ2 vσ2 v B + 3τγ2 u A u Fi(u0, v0) 2 + 3τγ2 v B v Fi(u0, v0) 2 . Next, we apply Lemma 24 to get that A A/τ and B B/τ under the assumed conditions on the learning rates; this allows us to write the right hand side completely in terms of k and unroll the recurrence. The intuition behind Lemma 24 is as follows. Ignoring the dependence on τ, Lu, Lv, χ for a moment, if γu and γv are both O(η), then A , B are both O(η3), while A and B are O(η). Thus, making η small enough should suffice to get A O(A) and B O(B). Concretely, Lemma 24 gives Ek[ k+1] 1 + 2 τ 1 E[ k] + C , and unrolling this recurrence gives for k τ 1 where we used (1 + 1/α)α e for all α > 0. Summing over k and using the numerical bound e2 < 8 completes the proof. Remark 19. We only invoked the partial gradient diversity assumption (Assumption 3) at iterates (u(t), V (t)); therefore, it suffices if the assumption only holds at iterates (u(t), V (t)) generated by Fed Sim, rather than at all (u, V ). A.5. Technical Lemmas The first lemma involves smoothness of two blocks of variables; we use this in the proof of Fed Sim. Lemma 20 (Block Smoothness). Suppose Fi : Rd Rdi satisfy Assumption 1 . Then, it holds that Fi(w , v i) Fi(w, vi) w Fi(w, vi), w w + v Fi(w, vi), v i vi 2 (1 + χ2) w w 2 + Lv 2 (1 + χ2) v i vi 2 . Proof. Using the Lw-smoothness of F( , v i) and the Lv-smoothness of F(w, ), we have Fi(w , v i) Fi(w, v i) w Fi(w, v i), w w + Lw Fi(w, v i) Fi(w, vi) w Fi(w, vi), v i vi + Lv 2 v i vi 2. Summing the above two inequalities together gives Fi(w , v i) Fi(w, vi) w Fi(w, v i), w w + v Fi(w, vi), v i vi 2 w w 2 + Lv 2 v i vi 2 . (20) We can bound the first inner product term on the right-hand side of the above inequality as w Fi(w, v i), w w = w Fi(w, vi), w w + w Fi(w, v i) w Fi(w, vi), w w w Fi(w, vi), w w + w Fi(w, v i) w Fi(w, vi) w w w Fi(w, vi), w w + Lwv v i vi w w w Fi(w, vi), w w + χ p Lw Lv v i vi w w w Fi(w, vi), w w + χ2 Lv 2 v i vi 2 + χ2 Lw Federated Learning with Partial Model Personalization where the first inequality is due to Cauchy-Schwarz, the second inequality is due to Lwv-Lipschitz property of w Fi(w, ), the third inequality is due to the definition of χ in (5), and the last inequality is due to Young s inequality. Substituting the above inequality into (20) yields the desired result. Next, we have the variance of sampling without replacement. Note the correction factor of (n m)/(n 1) over sampling with replacement. We include the elementary proof for completeness. Lemma 21 (Sampling Without Replacement). Let a1, , an Rd be given. Let S be a uniformly random sample of size m from this collection, where the sampling is without replacement. Denoting the mean a = Pn i=1 ai/n, we have, i=1 ai a 2 ! Proof. The statement is trivially true if m = 1 or m = n. Therefore, we assume now that 2 m n 1. Further, without loss of generality, we assume that a = 0. Finally, let S denote the set of all subsets of [n] of size m. Note that |S| = n m . We now have, = 1 m2 n m X i S ai 2 + X i,j S: i =j ai, aj For the first term, we have, S S : i S ai 2 = n 1 m 1 Likewise, for the second term, we use P j =i aj = ai to get, i,j S: i =j ai, aj = S S : i,j S ai, aj = n 2 m 2 j =i ai, aj = n 2 m 2 Therefore, we get, n 1 m 1 n 2 m 2 i=1 ai 2 = n m mn(n 1) The next two lemmas are about the effect of the local updates in the local SGD literature. The first lemma has also appeared in (Karimireddy et al., 2020); we give the proof for completeness. Lemma 22. Consider f : Rd R which is L-smooth and fix a w(0) Rd. Define the sequence (w(t)) of iterates produced by stochastic gradient descent with a fixed learning rate γ starting from w(0): w(t+1) = w(t) γg(t) , where g(t) is an unbiased (and independent of w) estimator of f(w) with bounded variance σ2. Fix a number τ of steps. If γ ( 2τL) 1, we have the bound t=0 w(t) w(0) 2 8γ2τ 2(τ 1) f(w(0)) 2 + 4γ2τ 2(τ 1)σ2 . Proof. If τ = 1, we have nothing to prove. Assume now that τ 2. Let F(t) be the sigma-algebra generated by w(t) and denote Et[ ] = E[ |F(t)]. We will use the inequality Et g(t) 2 = Et g(t) f(w(t)) 2 + f(w(t)) 2 σ2 + f(w(t)) 2 . (21) Federated Learning with Partial Model Personalization We now successively deduce, Et w(t+1) w(0) 2 = w(t) w(0) γg(t) 2 (a) 1 + 1 τ 1 w(t) w(0) 2 + γ2τEt g(t) 2 (b) 1 + 1 τ 1 w(t) w(0) 2 + 2γ2τ f(w(t)) f(w(0)) 2 + 2γ2τ f(w(0)) 2 + γ2τσ2 (c) 1 + 1 τ 1 + 2γ2τL2 w(t) w(0) 2 + 2γ2τ f(w(0)) 2 + γ2τσ2 (d) 1 + 2 τ 1 w(t) w(0) 2 + 2γ2τ f(w(0)) 2 + γ2τσ2 . Above, we used (a) the inequality 2αβ α2/δ2 + δ2β2 for reals α, β, δ, (b) Eq. (21), (c) L-smoothness of f, and, (d) the condition on the learning rate. Let C = 2γ2τ f(w(0)) 2 + γ2τσ2. Unrolling the inequality and summing up the series gives for all t τ 1 w(t) w(0) 2 C 2 (τ 1) 1 + 2 τ 1 2 (τ 1) 1 + 2 τ 1 2 (τ 1)e2 , where we used the bound (1 + 1/α)α e for all α > 0. Summing over t and using the numerical bound e2 < 8 completes the proof. Lemma 23. Consider the setting of Lemma 22. If γ (2τL) 1, we have the bound w(τ) w(0) 2 16γ2τ 2 f(w(0)) 2 + 8γ2τ 2σ2 . Proof. Proceeding similar to the last proof (expect using δ = τ) gives us Et w(t+1) w(0) 2 1 + 2 w(t) w(0) 2 + 4γ2τ f(w(0)) 2 + 2γ2τσ2 . Unrolling and summing up the sequence completes the proof, similar to that of Lemma 22. The next lemma is about bounding constants. Lemma 24. Let γu, γv, Lw, Lv, χ, f R+ and a natural number τ be given. Denote A := γu L2 u + fγvχ2Lu Lv , and, B := fγv L2 v + γuχ2Lu Lv . Suppose γu = cu/(τLu) and γv = cv/(τLv) with cu, cv > 0 satisfying 6 max{1, χ 2} . Then, we have that γ2 vχ2Lu Lv B + γ2 u L2 u A A/(3τ 2) , and, γ2 uχ2Lu Lv A + γ2 v L2 v B B/(3τ 2) . Proof. Note that it suffices to show 3τ 2χ2γ2 v Lu Lv B A/2 , and, 3τ 2χ2γ2 u Lu Lv A B/2 . Plugging in γu, γv, these are equivalent to 6χ2fc3 v + 6χ4c2 vcu χ2fcv + cu and, 6χ2c3 u + 6χ4fcvc2 u fcv + χ2cu . The assumption on cv implies that 6χ2fc3 v χ2fcv and 6χ4c2 vcu cu. Therefore, the first condition holds. Similarly, the second condition holds too. Federated Learning with Partial Model Personalization 2500 5000 7500 10000 12500 15000 # Data per device (words) Stack Overflow 0 50 100 150 200 250 300 # Data per device (images) 250 300 350 400 # Data per device (images) Figure 6. Distribution of number of training samples per device for each of the tasks considered in the experiments. For GLDv2, we do not show the long right tail, where the maximum number of data points per device is 1000 (cf. Table 1). The final lemma is about tuning the learning rate: the proof is elementary and is omitted. Lemma 25. Consider the map φ : (0, Γ] R+ given by γT + Bγ + Cγ2 , where Γ, A, B, C > 0 are given. Then, we have, 1/2 + 2C1/3 A where γ is given by B. Experiments: Detailed Setup and Hyperparameters We conduct our experiments on four datasets from three modalities, namely images, text, and speech. The datasets contain a natural, non-i.i.d. split of data which is reflective of data heterogeneity encountered in federated learning. We describe in detail the experimental setup and hyperparameters. The code to reproduce the experimental results will be publicly released. The outline of this section is: B.1 describes the tasks and their associated datasets and metrics. B.2 describes the experimental pipeline as well as the baselines we compare to. B.3 presents the hyperparameters of all the algorithms. As discussed in 1, we take the weight αk to be proportional to the number of datapoints available on the device. B.1. Datasets, Tasks and Models We consider four tasks motivated by real-world applications of federated learning. The tasks are summarized in Table 1 of the main paper and the distribution of data across the clients is visualized in Figure 6. For each model, we consider three partial personalization architectures: (a) Input layer personalization: Motivated by Liang et al. (2019), this architecture places the first layer on-device to learn a personalized representation per-client, while the rest of the model is shared. For the next-word prediction transformer model, we use the first transformer layer in place of the word embedding layer owing to its large size. (b) Output layer personalization: Motivated by Collins et al. (2021), this architecture learns a shared global representation but personalizes the prediction layer. For the next-word transformer model, we use the last transformer layer in place of the last prediction layer owing to its large size. For the same reason, we use the second fully connected layer within the final transformer block for the speech-to-text transformer. Federated Learning with Partial Model Personalization Table 5. Summary of partial personalization architectures for the transformer model for next word prediction. Personalization Type Layer on-device # Personalized Params. # Shared Params. Input Layer 1st transformer block 0.8M 4.9M Output Layer Last transformer block 0.8M 4.9M Adapter Adapter modules 0.07M 5.7M (c) Adapter personalization: We also consider a novel partial personalization architecture, where the full model is shared among all clients, while each client adds personalized adapter modules, which are lightweight modules added between layers of the shared model. We use the transformer adapters proposed by Houlsby et al. (2019) and residual adapters proposed by Rebuffi et al. (2017). B.1.1. STACKOVERFLOW FOR NEXT WORD PREDICTION Dataset. The Stack Overflow dataset comprises of questions and answers from the programming question-answer website stackoverflow.com. The goal of the next word prediction task is to predict the next word given a partial sequence of words in a question or answer. This task is a good open-source benchmark for next word predictions in mobile keyboards. We use the Stack Overflow dataset provided by Tensor Flow Federated. Client Distributions. Each client corresponds to one user on Stack Overflow; the data on the client corresponds to the questions and answers posted by this user. We only consider clients with at least 100 training sequences and 10 testing sequences, where a sequence refers to either a question or an answer. We use a fixed subsample of 1000 of them. Following Reddi et al. (2021), we restrict the vocabulary to the top 10000 most frequently occurring words in the dataset. We pad and truncate each sequence of each client to length 20 and consider at most 1000 training sequences on each client. Model. We use a transformer model (Vaswani et al., 2017) commensurate in size with BERT Mini (Turc et al., 2019). It has with 4 transformer blocks and 4 attention heads in each self-attention layer with a transformer hidden dimension of 256 and a fully-connected hidden dimension of 1024. The output layer is a causal language modeling head, i.e., a fully connected layer which assigns a score for each possible vocabulary item, including the special tokens. The model has 6 million parameters, which require around 23 megabytes of memory. Partial Personalization Architecture. The partial personalization architectures used are summarized in Table 5. Loss Function and Evaluation Metric. We train the model with the causal language modeling objective. That is, for each partial sequence, we treat the prediction of the next word as a multiclass classification problem to minimize the multinomial logistic loss, also known as cross entropy loss. For evaluation, we use the top-1 accuracy of predicting words in the proper 10000-word vocabulary (i.e., ignoring special tokens such as padding, out-of-vocabulary, and beginning/end of sequence). B.1.2. GLDV2 FOR VISUAL LANDMARK RECOGNITION Dataset. GLDv2 stands for Google Landmarks Dataset v2 (Weyand et al., 2020), which is a large-scale image dataset. It contains images of popular landmarks from around the world taken and uploaded by Wikipedia contributors. While the images vary in size, the most common image size is 800 600 pixels. The goal of the visual landmark recognition task is to identify the landmark from its image. This task resembles a scenario where smartphone users take photos of natural and architectural landmarks while traveling. We use the federated version of the GLDv2 dataset introduced by Hsu et al. (2020) with 2028 landmarks and provided by Tensor Flow Federated. Client Distributions. Each client corresponds to one Wikipedia user and contains all the images contributed by that user. We only all 823 clients with at least 50 datapoints. We do not use original test set from GLDv2 from evaluation as it comes from different clients. Instead, we take 50% of the data on each client as a testing set. Model. We use a Res Net-18 (He et al., 2016) model pretrained on Image Net (Deng et al., 2009), with group normalization instead of batch normalization (Hsieh et al., 2020). We resize all images to 224 224. We use two data augmentations for training: a random crop from 256 256 and a random horizontal flip. The model has 12 million parameters, which require Federated Learning with Partial Model Personalization Table 6. Summary of partial personalization architectures for the Res Net-18 model for visual landmark recognition. Personalization Type Layer on-device # Personalized Params. # Shared Params. Input Layer 1st conv. layer 0.01M 12.2M Output Layer Last fully connected layer 1M 11.2M Adapter Residual adapter modules 1.4M 12.2M Table 7. Summary of partial personalization architectures for the Res Net-18 model for character recognition. Personalization Type Layer on-device # Personalized Params. # Shared Params. Input Layer 1st conv. layer 0.7K 11.2M Output Layer Last fully connected layer 0.03M 11.2M Adapter Residual adapter modules 1.4M 11.2M around 49 megabytes of storage. Partial Personalization Architecture. The partial personalization architectures used are summarized in Table 6. Loss Function and Evaluation Metric. We use the multinomial logistic loss, also known as cross entropy loss. We evaluate the performance of the model using its classification accuracy. B.1.3. EMNIST FOR CHARACTER RECOGNITION Dataset. EMNIST (Cohen et al., 2017) is a character recognition dataset. The goal is to identify images of handwritten digits or letters; there are 62 possible options (a-z,A-Z, 0-9). The images are grey-scaled pictures of 28 28 = 784 pixels. We use the EMNIST dataset provided by Tensor Flow Federated. Client Distributions. Each client corresponds to one writer , i.e., the human subject who hand-wrote the digit/letter during the data collection process. We only use those clients with at least 100 training points and 25 testing points: there are 1114 of such clients. Model. We use a Res Net-18 (He et al., 2016) model with group normalization instead of batch normalization (Hsieh et al., 2020). We make two modifications to handle the smaller image size (28 28 1 as opposed to the 224 224 3 which the original Res Net was designed to accept): (a) we use a convolutional kernel of size 3 3 rather than the original 7 7 in the first convolution layer, and, (b) we drop the first pooling layer. The model has 11 million parameters, which require around 45 megabytes. Note that the number of parameters in this Res Net is smaller than the one for GLDv2 due to the architectural modifications we make for smaller images as well as the smaller number of classes. Partial Personalization Architecture. The partial personalization architectures used are summarized in Table 7. Loss Function and Evaluation Metric. We use the multinomial logistic loss, also known as cross entropy loss. We evaluate the performance of the model using its classification accuracy. B.1.4. LIBRISPEECH FOR AUTOMATIC SPEECH RECOGNITION Dataset. Librispeech is a speech-to-text dataset containing snippets of speech and the associated text from open domain audiobooks (Panayotov et al., 2015). Given an utterance containing read English speech, the goal is output a text transcription. Each device corresponds to the narrator of the utterance, leading to a natural non-identical split of the data with differences in accent, tone, and voice across devices. This task is reflective of voice commands and speech recognition on mobile phones. We create a federated version of Libri Speech. We use the clean subsets of Libri Speech (a total of 460h of speech) to pretrain a model in a non-federated manner. We use the train-other-500 subset (a total of 500h of audio), which typically contains noiser audio, to construct a federated dataset. Real-world federated tasks often contain proxy data used to pretrain a Federated Learning with Partial Model Personalization Table 8. Summary of partial personalization architectures for the transformer model for speech recognition. Personalization Type Layer on-device # Personalized Params. # Shared Params. Input Layer Convolutional subsamplers 0.8M 12.6M Output Layer 2nd f.c. in last transformer block 0.6M 12.8M Adapter Adapter modules 0.15M 13.4M model prior to federated training, such as Image Net-pretrained vision models. We emulate this setup by first pretraining all our models on the non-federated clean subset of Libri Speech. Client Distributions. We construct the federated dataset from the train-other-500 subset of Libri Speech and do not use the corresponding dev and test sets. Of the 1166 narrators, we discard those with only one chapter of data.2 For each narrator, we assign one chapter as the test data and the remaining as the training data. This is done to ensure that each device has between 10 50% of the device s total data in terms of length of audio3 this leads to approximately 30% of the available audio being used for testing and the remaining 70% for training. Overall, we get a federated dataset with 902 narrators, each of whom corresponds to a device in the federated setting. Model. We use a transformer model (Vaswani et al., 2017) with convolutional subsamplers, as proposed by Synnaeve et al. (2019). The input audio is represented as a sequence of 40 log-mel filterbank coefficients. The model has two 1D convolutional layers with a stride of 2, followed by 6 transformer blocks and 6 attention heads in each self-attention layer with a transformer hidden dimension of 384 and a fully-connected hidden dimension of 1536. The final output layer produces log probabilities on an output vocabulary of 5000 byte pair encodings of subwords. The model has 15 million parameters, requiring around 60 megabytes of memory. Partial Personalization Architecture. The partial personalization architectures used are summarized in Table 8. Loss Function and Evaluation Metric. We train the model with the Connectionist Temporal Classification (CTC) loss (Graves et al., 2006). This is a structured prediction loss that uses dynamic programming to marginalize over all possible alignments between the per-frame subwords and the text transcription. For evaluation, we use the word error rate (WER) obtained from a greedy decoding of the model prediction for a given utterance (or equivalently, beam search with a beam size of 1 with no external language models). B.2. Experimental Pipeline and Baselines There are three components in the training pipeline for all experiments: (a) Non-personalized federated training: The first step involves training a global model wg using the one-model-fits-all approach of (1) with Fed Avg variants. (b) Personalized federated training: This optional second step involves training the shared parameters w together with the personalized parameters vk using a personalized federated learning approach. We warm-start w, vk from the non-personalized model wg from the previous step. (c) Final finetuning: The last step involves only finetuning the personalized parameters vk while the shared parameters w remain unchanged. For step (b), we initialize vk for each k to be the appropriate part of wg for input/output layer personalization. On the other hand, for adapters, we initialize vk to be equal to the same set of randomly initialized weights for each device k. We consider the following baselines: Non-personalized: This denotes the performance of step (a) of the pipeline above, i.e., non-personalized federated training with Fed Avg variants. Full model personalization: We consider three baselines of personalization of the full model: 2Libri Speech organizes the data for each narrator into chapters of the source book. 3When multiple candidate chapters are available for use as a test set, we use the one closest in size to 20% of the data. Federated Learning with Partial Model Personalization Table 9. Hyperparameters for each dataset/task. Hyperparameter Stack Overflow GLDv2 EMNIST Libri Speech Batch size 64 64 32 32 Devices per round 50 50 10 50 Local epochs 1 1 1 1 Server Optimizer Fed Adam Fed Adam Fed Avg Fed Adam Client Optimizer SGD SGD SGD SGD Global Scheduler Linear Linear Exponential Linear Warm up 10% of rounds 10% of rounds N/A 10% of rounds LR decay rounds N/A N/A 500 N/A Max. grad. norm. 0.1 N/A N/A 0.25 Non-personalized training (step (a) of the pipeline) # Rounds 1000 2500 2000 500 Server learning rate 5 10 4 2 10 4 1.0 10 3 Client learning rate 1 10 2 0.5 10 2 Personalized training (step (b) of the pipeline) # Rounds 500 600 500 500 Server learning rate 5 10 5 2 10 5 1.0 10 3 Client learning rate 10 1 10 3 10 2 10 2 Local finetuning (step (c) of the pipeline) #Epochs 5 5 5 5 Optimizer SGD SGD SGD SGD Client learning rate 10 1 10 3 10 2 10 4 (i) Finetune: The non-personalized model from step (a) of the pipeline above is finetuned locally on each client (step (c) of the pipeline). Step (b) is skipped for this baseline. (ii) Ditto (Li et al., 2021): The non-personalized model from step (a) of the pipeline above is finetuned locally on each client (step (c) of the pipeline) with ℓ2 regularization v wg 2. Step (b) is skipped for this baseline. (iii) p Fed Me (Dinh et al., 2020): The non-personalized baseline model from step (a) is trained further in step (b) to optimize (2) using the p Fed Me algorithm of Dinh et al. (2020). Finally the resulting model w is finetuned locally in step (c). Partial Model Personalization: We consider partial model personalization with three different architectures, as defined in B.1. For each personalization approach, we start with the non-personalized model in step (a), continue personalization in step (b) using either Fed Alt or Fed Sim as the algorithm, and finally run step (c) for the local finetuning. B.3. Hyperparameters and Evaluation Details All the tuning of hyperparameters was performed on validation data, formed by holding out 20% of the training data on each device. Once the tuning was complete, we reran the experiments on the full training data, including those held out for validation. Evaluation Metric. Our primary evaluation metric for next-word prediction and image classification is the weighted average of the test accuracy on each client, weighted by the number of test examples (the details of how the accuracy is computed on each dataset is given in B.1 in the paragraph on Loss Function and Evaluation Metric ). This corresponds to the unweighted accuracy obtained by pooling all the data locally, similar to the loss as discussed in 1. The same metric is used for hyperparameter tuning and is reported in all the tables and plots, unless explicitly noted otherwise. For speech recognition, we similarly use a weighted average of the word error rate (WER). The final hyperparameters we use are given in Table 9. Rounds. We start with the number of communication rounds (i.e., the number of calls to secure aggregation routine for the shared parameters), which is used to measure the progress of each algorithm. For the non-personalized training, we use 1000 rounds for Stack Overflow, 2500 rounds for GLDv2 and 2000 rounds for EMNIST. For the personalized training, we warm-start the model from the non-personalized one, and run the training for 500 rounds for Stack Overflow and EMNIST and 600 rounds for GLDv2. Devices per Round. All devices are assumed to be available and selections are made uniformly at random. Following Federated Learning with Partial Model Personalization Table 10. Memory requirements (in megabytes) for training partial model personalization and full model personalization for the experimental settings considered here. Mode Stack Overflow GLDv2 EMNIST No personalization 71 186 142 Input layer personalization 67 186 142 Output layer personalization 67 174 142 Adapter personalization 72 222 159 Full personalization 116 263 232 Memory savings with partial personalization 42% 34% 39% (Reddi et al., 2021; Weyand et al., 2020), we select 50 devices per round for Stack Overflow/GLDv2 and 10 per round for EMNIST, for both the non-personalized as well as the personalized training. Local Updates and Minibatch Size. Each selected device locally runs 1 epoch of mini-batch stochastic gradient descent locally for non-personalized as well as personalized federated training. The final finetuning at the end of personalized training is performed for 5 epochs. We use a minibatch size of 64 for Stack Overflow/GLDv2 and 32 for EMNIST for all settings. Server and Client Optimizer Details. We use Fed Avg for EMNIST and Fed Adam (Reddi et al., 2021) for Stack Overflow and GLDv2. We also use a global scheduler, which applies a schedule on the client learning rates across rounds, while the client learning rate within each round is held constant. We use either a linear scheduler or an exponential scheduler (also called step LR in Py Torch). A linear scheduler applies a linear warmup, if applicable, until the maximum learning rate followed by a linear decay to 0. An exponential scheduler halves the client learning rate once every fixed number of rounds. Both the client and server learning rates are tuned using the validation set. Regularization Coefficient for p Fed Me and Ditto. We tune the regularization coefficient λk = λ for p Fed Me and Ditto using the validation data from the set {10 4, 10 3, , 100} of possible values. The tuned values are: Stack Overflow: 10 3 for Ditto and 10 4 for p Fed Me, GLDv2: 10 1 for both Ditto and p Fed Me, EMNIST: 10 1 for both Ditto and p Fed Me. Random Seed. We report numbers averaged over 5 random seeds for all experiments, with the exception of the speech recognition task. B.4. Estimated Memory Requirement We estimate the memory footprint for partial versus full personalization during training below. During deployment, the memory footprint of partial and full model personalization is the same since one full model is deployed. Estimation Procedure. We assume that the following are needed to be stored on device i S(t) during round t of training: u(t), the previous broadcast global model, which is needed to calculate the model delta to be sent back to the server, current iterate of the shared parameter u(t) i,k, current iterate of the personal parameter v(t) i,k, their respective gradients u and v, and, the internal buffers required for backpropagation. The total memory consumption is therefore, Memory = 3 size(u) + 2 size(v) + size(backprop) . Federated Learning with Partial Model Personalization Table 11. A comparison of Fed Alt and Fed Sim on the speech recognition task in terms of the word error rate (WER) %. Smaller values indicate better predictive performance. Personalization Fed Alt Fed Sim Finetune 15.55 15.55 Input Layer 15.13 15.47 Output Layer 15.53 15.51 Adapter 15.50 15.54 Table 12. The change in accuracy (percentage points) from the final finetuning for Fed Alt and Fed Sim with stateful devices. The subscript denotes the standard deviation over 5 random seeds. Stack Overflow GLDv2 EMNIST Fed Alt Fed Sim Fed Alt Fed Sim Fed Alt Fed Sim Input Layer 0.060.01 0.040.02 0.120.02 0.170.03 0.120.01 0.120.03 Output Layer 0.000.01 0.250.02 0.490.02 0.570.03 0.090.01 0.090.03 Adapter 0.010.01 0.400.08 0.140.02 0.170.01 0.270.02 0.330.03 We estimate the size of the backpropagation buffers for a batch size of 1. Training Memory Requirement. For full model personalization size(v) = size(u), whereas size(v) size(u) for the partial personalization architectures we have considered. Therefore, the total memory requirement of training partial model personalization will be smaller than full model model personalization. From Table 10, we see that partial personalization can result in a 34% to 42% reduction in the memory consumption across the models and datasets considered in the experiments. C. Experiments: Additional Results We now present the detailed experimental results. C.1. Speech Recognition: Fed Alt vs. Fed Sim We compare Fed Alt and Fed Sim for speech recognition in Table 11. We find that input layer personalization with Fed Alt has the smallest word error rate of all the models considered. C.2. Ablation: Final Finetuning for Fed Alt and Fed Sim We now study the effect of the final finetuning (step (c) of the experimental pipeline; cf. B.2) for Fed Alt and Fed Sim. The final finetuning has a minimal impact on partial personalization. We see from Table 12 that the effect of the final finetuning is much smaller than the improvements from personalization. For instance, the improvements from finetuning are close to 0 for Fed Alt on the Stack Overflow dataset. For GLDv2, the finetuning accounts for < 0.5pp of improvement, whereas personalization overall accounts for 5 to 15pp. The final finetuning is more important to Fed Sim than Fed Alt. Table 12 also shows that the final finetuning helps Fed Sim more than Fed Alt. However, Fed Alt still outperforms Fed Sim, as we saw in Table 4. Overall, this shows that Fed Alt is a better algorithm than Fed Sim. The final finetuning helps Fed Sim make up some percentage points in accuracy, but not enough to make up its gap with Fed Alt. C.3. Effect of Personalization on Per-Device Generalization Summary of all scatter plots. All the scatter plots shown in the main paper are summarized in the violin plot of Figure 7. We see from the leftmost figure that the training accuracies on all devices improve with personalization. From the second Federated Learning with Partial Model Personalization Ditto p Fed Me Partial Per-client Statistics (train) Ditto p Fed Me Partial Pers. helps Pers. hurts Per-client Statistics (test) No Reg. Best Reg. Large Reg. Effect of Regularization (test) No d/o Best d/o Large d/o Effect of Dropout (test) Figure 7. Left two: Distribution of change in the per-device train (left most) and test (center left) accuracy due to personalization on the Stack Overflow dataset. Right two: Distribution of change in the per-device test accuracy of partial personalization under regularization on the Stack Overflow dataset: (a) center right: adapter personalization under ℓ2 regularization, and, (b) rightmost: output layer personalization under dropout. Note that the No Reg. and No d/o plots on the right two are different because they personalize different model parts. Interpretation: The white dot in inside the violin denotes the median, while the black box enclosing this white dot marks the interquartile range (i.e., 25th and 75th percentiles). The body of the violin is a kernel density estimate of the distribution of accuracies. The lines extend out to the minimum and maximum accuracy in each case. figure, we see that the test accuracy of some of the devices reduces with personalization; this is true for both partial and full personalization. From the third plot of Figure 7, we see that regularization does not mitigate this overfitting. In fact, the regularization tuned for best average accuracy leads to a nearly identical distribution of test accuracies. A larger regularization reduces the spread of accuracies, but does so at the expense of a smaller median (white dot). The fourth plot of Figure 7 shows that the effect of dropout is similar. The best dropout improves the median accuracy, but it does not mitigate the issue of some devices being hurt by personalization. Train Accuracy plots for devices. From Figure 8, we see that personalization leads to a reduction in test accuracy on some of the devices beyond the initial non-personalized model. The corresponding train accuracy plot is given in Figure 8. We observe that the personalization always leads to an improvement in the training accuracy but not in the test accuracy. The analogous plots for GLDv2 are in Figure 9, where the trends are similar. Whether personalization helps a device or not depends on the random seed. We see in Figure 11 that the shaded region for some of the devices intersects the dotted line at 0. In other words, personalization sometimes helps this device and sometimes hurts it, depending on the random seed. This indicates that the best fix in practice is to use A/B testing on the deployed model to choose whether to use the personalized model or the non-personalized one. Regularization and dropout do not mitigate this issue. From the first row of Figure 10, we see that the weight decay with best mean accuracy exactly matches the unreguarlized case in terms of per-device statistics. Increasing the regularization weight can reduce the spread of per-device accuracy. However, this only leads to a worse mean accuracy and does not mitigate the issue of personalization hurting individual devices. From the second row of Figure 10, we see that the best dropout (0.3 in this case) leads to slight increase in average accuracy (0.18 pp). It also reduces the number of devices hurt by personalization from 256 out of 1000 to 193, but it does not fix this issue. Increasing dropout further only leads to a degradation of per-device statistics. C.4. Partial Personalization for Stateless Devices The algorithms we considered in this paper, namely Fed Alt and Fed Sim, require the devices to maintain the personalized parameters vi s as state across rounds. In cross-device federated learning settings, it is also interesting to consider stateless devices, which are not allowed to maintain state between training rounds. We give preliminary experiments in this setting. We modify the Fed Alt and Fed Sim algorithms from the main paper so that the personalized parameters vi are reinitialized each time device i is chosen for participation. We warm-start vi from the appropriate part of the non-personalized model trained in step (a) of the pipeline. For adapters, we fix a random initialization once, and reuse it. Fed Alt is better than Fed Sim for stateless devices, although the improvement is smaller. We see from Table 13 that all Federated Learning with Partial Model Personalization 2500 5000 7500 10000 12500 15000 # Data per device Pers. helps Pers. hurts 2500 5000 7500 10000 12500 15000 # Data per device 2500 5000 7500 10000 12500 15000 # Data per device 2500 5000 7500 10000 12500 15000 # Data per device 2500 5000 7500 10000 12500 15000 # Data per device Pers. helps Pers. hurts 2500 5000 7500 10000 12500 15000 # Data per device 2500 5000 7500 10000 12500 15000 # Data per device 2500 5000 7500 10000 12500 15000 # Data per device Figure 8. Scatter plot of change in accuracy (pp) per-device versus the number of training samples on the device for Stack Overflow. Top: Training accuracy. Bottom: Test accuracy. This is the full version of Figure 5 from the main paper. Table 13. This is the counterpart of Table 4 to stateless devices. We compare Fed Alt and Fed Sim for partial model personalization with stateless devices. FT (part.) corresponds to finetuning the personal parameters vi locally while fixing the shared parameters u from a non-personalized training. The numbers are averaged over 5 random seeds; the boldfaced numbers denote the highest accuracy in each row. Stack Overflow GLDv2 EMNIST FT (part.) Fed Alt Fed Sim FT (part.) Fed Alt Fed Sim FT (part.) Fed Alt Fed Sim Input Layer 24.960.01 24.840.01 24.890.01 51.970.02 52.760.06 52.740.02 93.290.00 93.510.03 93.480.04 Output Layer 24.930.01 24.940.01 24.940.01 53.210.01 53.300.06 53.300.08 93.370.01 93.530.03 93.510.04 Adapter 24.710.00 24.690.01 24.710.01 63.860.06 64.100.14 63.190.04 93.660.00 93.970.04 93.890.02 algorithms perform similarly for the stateless setting. Nevertheless, we see that Fed Alt obtains mild improvements over both Fed Sim and finetuning for GLDv2, e.g., 0.24pp with adapters. The final finetuning is crucial for stateless devices. We see from Table 14 that the final finetuning accounts for most of improvements in the stateless case. For instance, for GLDv2, the final finetuning accounts for 11.68 and 10.42pp out of a total of 12.67 and 11.76pp for Fed Alt and Fed Sim respectively. However, the personalized federated training (step (b) of the pipeline; cf. B.2) still leads to an increase in accuracy of 1 to 1.34pp. Federated Learning with Partial Model Personalization 100 200 300 400 500 # Data per device Pers. helps Pers. hurts 100 200 300 400 500 # Data per device 100 200 300 400 500 # Data per device 100 200 300 400 500 # Data per device 100 200 300 400 500 # Data per device Pers. helps Pers. hurts 100 200 300 400 500 # Data per device 100 200 300 400 500 # Data per device 100 200 300 400 500 # Data per device Figure 9. Scatter plot of change in accuracy (pp) per-device versus the number of training samples on the device for GLDv2. Top: Training accuracy. Bottom: Test accuracy. Table 14. The change in accuracy (percentage points) from the final finetuning for Fed Alt and Fed Sim with stateless devices. The subscript denotes the standard deviation over 5 random seeds. Stack Overflow GLDv2 EMNIST Fed Alt Fed Sim Fed Alt Fed Sim Fed Alt Fed Sim Input Layer 0.860.03 1.000.02 0.440.03 0.420.03 0.110.02 0.100.04 Output Layer 1.080.03 1.100.02 1.470.04 1.460.05 0.150.02 0.110.02 Adapter 0.840.04 0.880.02 11.680.20 10.420.09 0.460.02 0.420.04 2000 4000 6000 8000 10000 12000 14000 16000 # Data per device Pers. helps Pers. hurts Reg. Param. = 0 2000 4000 6000 8000 10000 12000 14000 16000 # Data per device Reg. Param. = best 2000 4000 6000 8000 10000 12000 14000 16000 # Data per device Reg. Param. = large 2000 4000 6000 8000 10000 12000 14000 16000 # Data per device Pers. helps Pers. hurts Dropout = 0 2000 4000 6000 8000 10000 12000 14000 16000 # Data per device Dropout = best 2000 4000 6000 8000 10000 12000 14000 16000 # Data per device Dropout = large Figure 10. Scatter plot of change in accuracy (pp) per-device versus the number of training samples on the device with the effect of regularization. Top: ℓ2 regularization a.k.a. weight decay. Bottom: dropout. The best values of the ℓ2 regularization parameter and dropout are chosen to maximize the average test accuracy across all devices. Federated Learning with Partial Model Personalization 0 200 400 600 800 1000 Device Rank Pers. helps Pers. hurts 0 200 400 600 800 1000 Device Rank 0 200 400 600 800 1000 Device Rank 0 200 400 600 800 1000 Device Rank Figure 11. Change in per-device accuracy (pp) due to personalization. The solid line is the mean over 5 random runs and the shaded area denotes the max/min across these runs. The devices are sorted in ascending order of accuracy change. The points in orange depict two example devices who might either be helped or harmed by personalization depending on the random seed.