# personalization_improves_privacyaccuracy_tradeoffs_in_federated_learning__6e93e80a.pdf Personalization Improves Privacy Accuracy Tradeoffs in Federated Learning Alberto Bietti 1 Chen-Yu Wei 2 Miroslav Dud ık 3 John Langford 3 Zhiwei Steven Wu 4 Large-scale machine learning systems often involve data distributed across a collection of users. Federated learning algorithms leverage this structure by communicating model updates to a central server, rather than entire datasets. In this paper, we study stochastic optimization algorithms for a personalized federated learning setting involving local and global models subject to user-level (joint) differential privacy. While learning a private global model induces a cost of privacy, local learning is perfectly private. We provide generalization guarantees showing that coordinating local learning with private centralized learning yields a generically useful and improved tradeoff between accuracy and privacy. We illustrate our theoretical results with experiments on synthetic and real-world datasets. 1. Introduction Many modern applications of machine learning involve data from a large set of users. In such settings, both privacy considerations and bandwidth limits may require keeping each user s data on their device, instead communicating with a centralized server via shared model updates, a scenario commonly known as federated learning (Kairouz et al., 2021). When the data distribution varies across users, it is often beneficial to consider personalized models where parts of the model (e.g., user-specific embeddings, or additive offsets) are local to each user, leading to updates that may be performed locally. Such personalized federated learning has been deployed at scale in a wide range of applications, including keyboard next-word prediction (Wang et al., 2019), automated speech recognition, and news 1Center for Data Science, New York University 2University of Southern California 3Microsoft Research, New York 4Carnegie Mellon University. Correspondence to: Alberto Bietti . Proceedings of the 39 th International Conference on Machine Learning, Baltimore, Maryland, USA, PMLR 162, 2022. Copyright 2022 by the author(s). personalization (Paulik et al., 2021). In this paper, we focus on personalized federated learning algorithms based on stochastic optimization, which are the most widely used in this context (Mc Mahan et al., 2017; Wang et al., 2021). Concretely, we consider stochastic optimization problems of the following form: f(w, θ1:N) := 1 i=1 Eξ Pi[fi(w, θi, ξ)] where N is the number of users, w a global parameter, θ1:N = (θ1, . . . , θN) a set of local parameters, and Pi denotes the sample distribution for user i. Similar objectives have been considered in other works (Agarwal et al., 2020; Hanzely et al., 2021). The personalization problem raises the question of how much learning should happen at the local vs global level: if the local models are expressive enough, local learning alone should suffice to learn good models with enough samples per user, yet if the data has some shared structure across users, a global model may help us learn more efficiently. Although federated learning algorithms do not store user data in the central server, private user information may still be leaked in the resulting models, unless appropriate mechanisms are applied to guarantee privacy. In this work, we consider the notion of user-level (joint) differential privacy (DP) (Dwork et al., 2006; 2010a; Kearns et al., 2014), which ensures an adversary cannot reliably detect the presence or absence of all the data associated with a single user based on the output information. Such a notion has been used successfully in federated learning problems, with practical optimization algorithms that lead to useful privacy guarantees (Mc Mahan et al., 2018; Hu et al., 2021). Unfortunately, assuring such privacy may come at a cost to accuracy (see, for example, the lower bounds of (Bassily et al., 2014)). The key question we address is: Can we leverage personalization to improve privacy accuracy tradeoffs in federated learning? The insight motivating our work is that in personalized federated learning, it is only the global portion of the optimization that needs to experience this drop in accuracy. A user s Personalization Improves Privacy Accuracy Tradeoffs in Federated Learning local model can compensate for the privacy-guaranteeing accuracy limitations of the global model. We formalize this by considering algorithms with a personalization parameter α that may vary the level of personalization from local learning (α = 0) to global learning (α = ). We then show generalization bounds on the objective 1 of the form f(wn, θ1:N,n) f Cstat(α, n, N)+Cpriv(α, ϵ, N), (2) where f = minw,θ1:N f(w, θ1:N) is the optimal risk, n is the number of observed samples per user and ϵ is the DP privacy parameter. Here, we expect the second term to vanish when α 0, as local learning does not suffer from privacy. Crucially, this privacy cost does not depend on the number of samples per user n, while the first term, which captures statistical efficiency, generally decreases with n for any α. This emphasizes how adjusting the level of personalization through α can help improve generalization by adjusting the trade-off between these two terms. Concretely, we provide precise guarantees of this form for simple federated private stochastic gradient algorithms, where the number of iterations corresponds to the number of samples per user n, and the personalization parameter α is given by the relative step-size between global and local updates. In particular, we show that α affects the complexity of learning by changing the geometry of the optimization: in problems that benefit from global models, small α makes learning more difficult but reduces the cost of privacy. We complement our theoretical results with experiments on synthetic and real-world federated learning datasets, which illustrate how varying the step-size ratio leads to improved trade-offs between accuracy and privacy. 2. Related Work Model personalizaton. There are a variety of approaches for personalization in federated learning. In local finetuning, a global model is learnt by federated learning and then used as a warm start for on-device learning from the cache of local data (Wang et al., 2019; Paulik et al., 2021). This approach can be augmented with federated learning of hyperparameters (Wang et al., 2019; Jiang et al., 2019; Khodak et al., 2019) to obtain federated learning variants of meta-learning approaches like MAML (Finn et al., 2017) and Reptile (Nichol et al., 2018). Another approach is to view personalization as a multi-task learning problem, and learn task-specific models with a regularization that forces parameters for similar tasks to be close, e.g., in a Euclidean norm (Vanhaesebrouck et al., 2017; Smith et al., 2017; Arivazhagan et al., 2019; Mansour et al., 2020; Dinh et al., 2020; Shen et al., 2020; Huang et al., 2021; Marfoq et al., 2021; Singhal et al., 2021). Task similarity is typically expressed as a weighted undirected graph, a matrix, or a clustering. A special case of the multi-task ap- proach is to learn a global model in addition to local models, which are regularized to be close to the global model (Mansour et al., 2020; Deng et al., 2020; Hanzely & Richt arik, 2020; Hanzely et al., 2021; Marfoq et al., 2021). This is closest to the approach here, which also separates global and local parameters. However, the approach here is more general, because we allow a broader range of modeling relationships between the global and local parameters, similar to federated residual learning (Agarwal et al., 2020). The statistical aspects of such personalization models were studied by Mansour et al. (2020); Agarwal et al. (2020), who in particular provide generalization guarantees for additive personalization models similar to ours. However, compared to the present paper, these works do not provide privacy guarantees nor study the effect of varying the level of personalization. Privacy. The results here advance a recent line of work that provides formal user-level privacy guarantees for model personalizaton in federated learning. Similar to the prior works (Jain et al., 2021) and (Hu et al., 2021), we adopt the privacy formulation of joint differential privacy (JDP, Kearns et al., 2014), a variation of DP that is more suitable for the problem of model personalization than standard DP. (Jain et al., 2021) provides private algorithms that first learn a shared linear representation for all users, and allow each user to learn their local linear regression model over the lower-dimensional representation. This leads to a factorized model of personalization, which is different than ours and not handled by our theoretical assumptions due to non-convexity. (Hu et al., 2021) provides a private personalizaton algorithm through the mean-regularized multi-task learning objective without establishing its statistical rates. In contrast, we consider more general personalization schemes and provide statistical guarantees. The results here are also related to other work in DP with a similar motivation to model personalization. (Li et al., 2020) studies meta-learning under DP. Their framework does not cover model personalization with a separate model for each user. (Noble et al., 2021) studies federated learning with DP with heterogeneous data across nodes, but they do not support personalization. (Bellet et al., 2018) studies fully decentralized DP algorithms for collaborative learning over a network instead of federated learning. Finally, the notion of user-level privacy has also been adopted in prior works (Mc Mahan et al., 2018; Levy et al., 2021), but these do not consider model personalization. 3. Preliminaries In this section, we introduce the problem of personalized federated optimization, as well as the notion of user-level privacy that we consider. Personalization Improves Privacy Accuracy Tradeoffs in Federated Learning 3.1. Problem setting We consider stochastic optimization problems of the form f(w, θ1:N) := 1 i=1 fi(w, θi) where N is the number of users, w Rdw a global parameter, θ1:N = (θ1, . . . , θN) (Rdθ)N a set of local parameters, and fi(w, θi) := Eξ Pi[fi(w, θi, ξ)] is the expected risk of user i, with random samples ξ drawn from an unknown user-specific distribution Pi. While our algorithms may be run on arbitrary differentiable models, our analysis focuses on the convex setting, where fi(w, θi, ξi) is jointly convex in (w, θi) for all ξi. Additive model. An important special case is the additive model for supervised learning, where dw = dθ = d, and fi(w, θi, (x, y)) = ℓ(y, (w + θi) x), (3) where ℓis a loss function and ξ = (x, y) is a training sample. As a running example, consider a movie recommendation app, which seeks to predict how each user i will rate any given movie. In this case, x corresponds to the features describing a movie (e.g., its genre, popularity, length, actors, etc.), and y is the user s rating of that movie. Following the federated protocol, the data about user activity stays on the device and only model parameters can be communicated to the server. In this case, only the information about the global parameter w is communicated. The additive model makes the optimization problem (1) underdetermined, with many possible equivalent solutions obtained by adding any vector v Rd to w and then subtracting v from each θi, effectively shifting the predictive ability between global and local parameters. This is what allows the optimization algorithm to achieve different tradeoffs between statistical generalization (accuracy) and sharing of information across users (privacy). To develop the intuition about this tradeoff, first consider the homogeneous scenario, where the optimal parameters θ i arg minθ E(x,y) Pi[ℓ(y, θ x)] for each user are equal (θ 1 = = θ N). There are two extreme approaches: (i) local learning, where only θi are trained individually for each user, leading to poor sample complexity but perfect privacy (ii) global learning, where only w is trained, and we benefit from using samples from all users, but need communication with a centralized server and lose privacy. In the more realistic heterogeneous scenario, when the θ i are different but have some shared components, e.g., only some coordinates differ across users, joint learning of w and θ1:N can achieve greater accuracy (at the same number of samples) than both local and global learning, by leveraging more samples to estimate the shared components, while benefiting from user-specific data to personalize. Note that in this case, it is possible to further improve privacy, by allowing some shared components to be fitted entirely locally. Quantifying this improvement is the focus of our work. In the movie recommendation example, if only global learning is performed, the prediction w x can use overall popularity statistics, but the resulting prediction is the same for all users, without any personalization. On the other hand, if only local learning is performed, although the system can fully personalize and provide full privacy, the quality of recommendation is limited by the number of movies the user watched before. With joint learning, the system can capture both global trends through w and adapt to each user s preference through θi. 3.2. User-level (joint) differential privacy We aim to provide user-level privacy (Dwork et al., 2010a), which ensures that an adverary cannot detect the presence or absence of all of the data associated with a single user when given the output of the algorithm. To achieve this goal, we design algorithms using the billboard model (Hsu et al., 2014) of differential privacy (DP) (Dwork et al., 2006). Let us first revisit the definition of DP, which informally requires that changing any single user s data cannot change the algorithm s output by much. Definition 3.1 (User-level DP; Dwork et al., 2006; 2010a). A randomized mechanism M is (ϵ, δ)-differential privacy (DP) if for all pairs of data sets D, D that differ by a single user s data and all events E in the output range, Pr[M(D) E] eϵ Pr[M(D ) E] + δ. Billboard model. In the billboard model, a server computes aggregate information subject to the constraint of DP and shares the information as public messages with all N users. Then, based on the public messages and their own private data, each user computes their own personalized model. The billboard model is particularly compatible with algorithms in the federated setting, where the DP messages are typically noisy summary statistics about the users local models (Jain et al., 2021; Hu et al., 2021). (Hsu et al., 2014) shows that algorithms under the billboard model provide an extremely strong privacy guarantee, known as joint differential privacy (JDP, Kearns et al., 2014). Joint differential privacy (JDP). Let Di denote the collection of samples associated with each user i. Two datasets D and D are called i-neighbors if they only differ by user i s private data. For any mechanism M, we denote M i(D) as the output information to all other users except user i. Personalization Improves Privacy Accuracy Tradeoffs in Federated Learning Algorithm 1 Personalized-Private-SGD (PPSGD) 1: Input: η: step-size, α: global/local ratio, σζ: privacy noise level, C: clipping parameter. 2: Initialize w0 = θ0 = 0. 3: for t = 1 to n do 4: for all clients i in parallel do 5: Sample data ξi,t Pi 6: Compute gt θ,i = θfi(wt 1, θi,t 1, ξi,t) gt w,i = wfi(wt 1, θi,t 1, ξi,t) 7: Update θi,t = θi,t 1 η N gt θ,i 8: Clip gradient: gt w,i = gt w,i/max(1, gt w,i C ) 9: Send gt w,i to the server 10: end for 11: Sample ζt N(0, σ2 ζIdw) 12: Update wt = wt 1 αη( 1 N PN i=1 gt w,i + ζt) 13: end for Definition 3.2 (Joint-Differential Privacy, JDP). An algorithm M is (ϵ, δ)-jointly differentially private, written as (ϵ, δ)-JDP, if for all i, all i-neighboring datasets D, D , and all events E in the output space to all other users except i, Pr[M i(D) E] eϵ Pr[M i(D ) E] + δ. In the setting of model personalization, JDP implies that even if all of users except i collude, potentially pooling their private data and local models together, user i s private data are still protected, so long as i does not reveal their own model. 4. Main Algorithm and Analysis Our main algorithm, shown in Algorithm 1, is a personalized version of distributed SGD (Dekel et al., 2012), with a personalization parameter α that controls the relative step-size between global and local updates. At each round t, each user i samples a fresh datapoint ξi,t (this could also be a mini-batch), updates its local model θi, and sends the gradient with respect to w to the central server, which aggregates gradients of all users before updating the global model w. The total number of rounds n thus corresponds to the number of samples per user. In order to guarantee user-level privacy, we clip each user s gradients gt w,i before aggregation, and add Gaussian noise, following common practice (Abadi et al., 2016; Mc Mahan et al., 2018; Hu et al., 2021; Chen et al., 2020). The choice of α affects the degree by which the algorithm favors local learning over global learning, with α = 0 forcing local-only updates. Privacy analysis. We first establish the formal user-level JDP guarantee of Algorithm 1. Theorem 4.1 (Privacy). Suppose we set the noise parameter for an absolute constant c, then Algorithm 1 satisfies (ϵ, δ)- JDP in the billboard model. We defer the full proof to Appendix B.1. At a high level, the proof first shows that the aggregate information released by the server (the sequence of global model updates) satisfies (ϵ, δ)-DP. Since the sequence of global models are sufficient statistics for each user to identify their personalized model θi, the JDP guarantee follows from the billboard lemma (see Lemma B.1). Generalization analysis. Our generalization analysis relies on the following assumptions. We begin with an assumption about minimizers of f, which relies on the following norm for a vector z = (w, θ1:N) that captures the geometry induced by the personalization parameter α: α w 2 + θ1:N 2. (5) Assumption 4.2 (Minimizers). f admits a minimizer z with finite norm z α. In the case of local learning (α = 0), this implies z must have no global component. For joint learning, different minimizers might exist, and our bounds scale with the minimal norm z α among all such minimizers. Assumption 4.3 (Convexity and smoothness). For all i and Pi-almost every ξ, the function (w, θi) 7 fi(w, θi, ξ) is jointly convex and L-smooth (its gradients are L-Lipschitz). Note that this implies that f(z) is jointly convex in z = (w, θ1:N). If we consider the example fi(w, θi, (x, y)) = 1 2(y (w + θi) x)2, and assume x R almost surely, it is easy to verify that the assumption holds with L = 2R2. We will also make the following boundedness assumption on gradients, which is commonly made in the context of private stochastic optimization (e.g., Bassily et al., 2014; Feldman et al., 2020), and simplifies our analysis by avoiding the need to study the effect of clipping on optimization. Assumption 4.4 (Bounded gradients). For all i, w, θi, and Pi-almost every ξ, we have wfi(w, θi, ξ) G. This assumptions avoid the need to clip gradients when C G, and our analysis hereafter assumes C = G. We note that G may be large in some cases, growing with the norm of optimal parameters, thus smaller values of C may often be beneficial in practice for better privacy guarantees. Gradient variances. We consider the following variance quantities, which we assume to be finite, and which are Personalization Improves Privacy Accuracy Tradeoffs in Federated Learning obtained by considering gradients at a given minimizer z : σ2 w,i = Eξ wfi(w , θ i , ξ) wfi(w , θ i ) 2 σ2 θ,i = Eξ θfi(w , θ i , ξ) 2 i σ2 w,i, σ2 θ = 1 As an example, note that in a simple additive model (3) with squared loss and additive label noise of variance τ 2, we have σ2 w,i = σ2 θ,i = τ 2 Tr(EPi[xx ]), recovering standard statistical quantities. If the algorithm relies on mini-batches of size m instead of single datapoints, we may then replace σ2 w,i and σ2 θ,i by σ2 w,i/m and σ2 θ,i/m, respectively, by averaging gradients over the mini-batches. We now provide our main result on the convergence rate of Algorithm 1, which also yields a generalization bound on the excess risk. The proof is in Appendix B.2. Theorem 4.5 (Generalization). Under Assumptions 4.2, 4.3, and 4.4, let z = (w , θ1:N ) be any minimizer of f, Lα := L max(α, 1 σ2 tot,α := α σ2 w + σ2 θ N + αdwσ2 ζ. (6) With η = min{ 1 4Lα , z α nσtot,α } and C = G, Algorithm 1 satisfies E[f( zn) f(z )] 4Lα z 2 α n + 3σtot,α z α n , (7) with zn = 1 n Pn 1 t=0 zt. In particular, with σζ as in (4), hiding absolute constants, we have the following generalization bound: E[f( zn) f(z )] (8) Lα z 2 α n + z α α σ2w + σ2 θ Nn + z α αdw G2 log( 1 The generalization bound takes the form (2), with a cost of privacy that does not depend on the number of samples per user n. As is common in the analysis of SGD, our bound displays a bias term that decays as 1/n, and a variance term decaying as 1/ n, controlled by the gradient variance σ2 tot,α. Ignoring the privacy noise, given that we use N samples at each round, this variance scales as 1/N, leading to an overall asymptotic rate of 1/ Nn, where Nn is the total sample size. Using minibatches of size m would further improve this to 1/ Nnm, where the total number of samples is now Nnm. Local vs global learning. Note that when multiple minimizers z exist, one may choose the one with minimal norm z α. If we consider the additive model (3), we may always consider a minimizer of the form z = (0, θ 1:N). Then z α is bounded even for α = 0 (local learning), which yields an excess risk of order In particular, there is no cost for privacy. If we further assume θ i = v for all i, this becomes We see slow convergence that does not improve with N, which is expected given that no information is shared across users. In this setting, we may instead consider a different minimizer z = (v , 0), for which taking the limit α (global learning) yields the excess risk of order dw G2 log( 1 δ ) N 2ϵ2 . The variance term now decays faster for large N, but has an additional privacy term, which decreases quickly with N but does not improve with large n and may be quite large, particularly in high dimensions, consistent with lower bounds for differentially private optimization (Bassily et al., 2014). Characterizing benefits of personalization. Depending on the number of samples per user n, it may be helpful to choose varying levels of personalization α to adjust this tradeoff, from large α for small n to small α for large n when the privacy cost dominates the bound. The next lemma quantifies this tradeoff for the additive model with homogeneous users (i.e., θ 1 = = θ N): Lemma 4.6 (Threshold on n for personalization benefits). Assume (v, 0) is a minimizer of f, with f an additive model of the form (3). For any α, the minimizer with minimal α-norm is given by (w, θ1:N) with w = αN αN+1v and θi = 1 αN+1v. Assuming σw = σθ = σ, the variance term in the excess risk (9) with n samples per user is monotonic in α, taking the form: Nn + αdw G2 log(1/δ) This is non-decreasing with α if and only if n N(N 1)σ2ϵ2 dw G2 log(1/δ) . (10) Thus, if we ignore the bias term in (9), this suggests that when (10) holds, using local learning (smaller α) should Personalization Improves Privacy Accuracy Tradeoffs in Federated Learning help improve generalization, while if the reverse inequality holds, global learning should be preferred. Eq. (10) suggests that this threshold on the user sample size n scales quadratically with the number of users N, linearly with the privacy level ϵ, and inversely with the dimension dw. When the optimal user parameters are different, this transition would likely happen for smaller n, as we expect local models to be useful even at small sample sizes regardless of privacy. Improving the bias term. We remark that the bias term decreases as 1/n in both local and global learning scenarios described above, and does not improve with the number of users. As in standard SGD (e.g., Dekel et al., 2012), this term decreases with the number of rounds, and may thus decrease more quickly with the number of samples if more communication rounds are performed for the same total number of samples. In Section 5, we show that sampling users at each round can help improve this term. Comparison to Local SGD. A common approach for federated optimization is the local SGD (or federated averaging) algorithm (Mc Mahan et al., 2017), which performs multiple local steps before communicating with the server. This is in contrast to our method, which more closely resembles mini-batch SGD. We note that despite its practical success, the known theoretical guarantees of local SGD typically do not improve on mini-batch SGD except for specific settings (Woodworth et al., 2020). Our study therefore focuses on understanding personalization in the SGD setting, but we note that extending our analysis to local SGD is an interesting direction for future work. 5. Heterogeneous Sample Sizes, User Sampling In this section, we study a variant of Algorithm 1 where we sample some of the users uniformly at each round, with possibly different minibatch sizes for each user. This reflects the fact that users may have different amounts of data, and that not all clients may be available at each round. Let q be the probability of sampling any given user at each round, mi 1 be the mini-batch size for user i, and define M = P i mi and mmax = maxi mi. Algorithm 2 then optimizes the objective f m(w, θ1:N) := M fi(w, θi). (11) Namely, the algorithm optimizes each user s performance proportionally to their amount of samples. We denote the number of rounds by T here, since it no longer corresponds to the total number of samples per user n. Note that the expected total number of samples observed for user i after T rounds is now given by ni = Tqmi. Algorithm 2 PPSGD with client sampling 1: Input: q: client sampling probability, mi: minibatch sizes, η: step size, α: global/local ratio, σζ: privacy noise level, C: clipping parameter. 2: Initialize w0 = θ0 = 0. 3: for t = 1 to T do 4: Sample bi,t Ber(q) 5: for all clients i with bi,t = 1 in parallel do 6: Sample a minibatch {ξ(k) i,t }mi k=1 P mi i 7: Compute gt θ,i = Pmi k=1 θfi(wt 1, θi,t 1, ξ(k) i,t ) gt w,i = Pmi k=1 wfi(wt 1, θi,t 1, ξ(k) i,t ) 8: Update θi,t = θi,t 1 η q M gt θ,i 9: Clip gradient gt w,i = gt w,i/max(1, gt w,i C ) 10: Send gt w,i to the server 11: end for 12: Sample ζt N(0, σ2 ζIdw) 13: Update wt = wt 1 αη( 1 i:bi,t=1 gt w,i + ζt) 14: end for The following result provides a privacy guarantee for Algorithm 2. Compared to Theorem 4.1, it also leverages a standard argument of privacy amplification by subsampling. Theorem 5.1 (Privacy with client sampling). There exist absolute constants c1, c2 such that for any ϵ < c1q2T, if we take then Algorithm 2 satisfies (ϵ, δ)-JDP. Note that compared to the uniform sampling case, the clipping parameter C may need to be larger since we are clipping the sum of up to mmax gradients. In our generalization analysis hereafter, we thus assume C = Gmmax, so that clipping is not needed under Assumption 4.4. We now give an optimization and generalization guarantee. Theorem 5.2 (Generalization with client sampling). Under Assumptions 4.2, 4.3, and 4.4, let z = (w , θ1:N ) be any minimizer of f m, Lm,α = L max α+ αmmax σ2 m,α := α σ2 w,m + α σ2 w,m + σ2 θ,m q M + αdwσ2 ζ, (13) where σ2 w,m := 1 M P i miσ2 w,i and σ2 θ,m := 1 M P i miσ2 θ,i and σ2 w,m := 1 M P i q(1 q)m2 i wfi(w , θ i ) 2. With η = min 1 4Lm,α , z α T σm,α and C = Gm := Gmmax, Algorithm 2 satisfies E[f m( z T ) f m(z )] 4Lm,α z 2 α T + 3σm,α z α Personalization Improves Privacy Accuracy Tradeoffs in Federated Learning with z T = 1 T PT 1 t=0 zt. Taking σζ as in (12), we have: E[f m( z T ) f m(z )] Lm,α z 2 α T + (14) α σ2w,m+α σ2w,m + σ2 θ,m q MT + αdw G2m log( 1 Note that the privacy cost now scales with Gm/M = Gmmax/M instead of G/N in Theorem 4.5. Note that we have Gmmax/M G/N, with equality if and only if all the mi are equal. This highlights the additional cost of privacy with heterogeneous sample sizes, particularly when some users have much more data than others. Compared to Theorem 4.5, there is also an additional variance term σw,m induced by the randomness of client sampling. Note that σw,m vanishes when (i) q = 1 (i.e., no sampling), or (ii) wfi(w , θ i ) = 0 for all i, which always holds for the additive model (3). When all users are sampled (q = 1) and mi = 1 for each user, we have M = N and σw,m = 0, and we recover the same bound as in Theorem 4.5, where T = n corresponds to the number of samples per user. In this case the bias term decreases slowly as 1/n. In contrast, if mi = 1 and q = 1/N, i.e., we sample one user per round on average, then q M = 1, and T now corresponds to the total number of samples Nn. The bias term now decreases as 1/Nn, while the variance decreases at the same 1/ Nn rate. Of course, this comes at the cost of more frequent communication rounds, since each round only uses the data of a single user. Optimizing average user performance. In some cases, we may have clients with heterogeneous amounts of data, but still want to optimize the average performance f in (1), rather than the weighted average f m in (11). This can be achieved by replacing gt θ,i and gt w,i by averages over the minibatch instead of sums in Algorithm 2. This leads to the following guarantee on the excess risk (see Appendix B.6): E[f( z T ) f(z )] (15) Lα z 2 α T + z α α σ2w + σ2 θ q Nmmin T + αdw G2 log( 1 δ ) N 2ϵ2 . Note that the privacy cost is similar to the case with homogeneous data in Section 4, rather than the larger cost of Theorem 5.2, since we are now treating each user equally. Nevertheless, the variance term displays an effective total sample size of ntot = q Nmmin T, which is smaller than in Theorem 5.2, where ntot = q MT corresponds to the expected total number of samples, unless all the mi are equal. This highlights that this improvement in privacy comes at a cost in statistical efficiency. 6. Experiments In this section, we present numerical experiments that illustrate our theoretical guarantees on both synthetic and real-world federated learning datasets. Our code is available at https://github.com/albietz/ppsgd. Experiment setup. We consider linear additive models of the form (3) with the squared loss. For classification datasets, we use a one-versus-all setup. Unless otherwise mentioned, we run our algorithms for one pass over the data, simulating the online setting considered by our theory. We sample a fixed number of users (denoted Q) at each iteration, chosen randomly without replacement, and consider minibatches of size m = 10 for each user. Following standard practice, we compute privacy guarantees ϵ at each iteration using the moments accountant (Abadi et al., 2016), with a fixed δ = 10 4, after applying Gaussian noise of variance σ2 ζ = σ2C2, where C is a clipping parameter and σ a noise multiplier. For each run, we consider a fixed step-size η, personalization parameter α, clipping parameter C, and noise multiplier σ, chosen from a grid (see Appendix A). In order to assess the best achievable performance, we optimize the learning rate separately at each number of iterations reported in our figures. Synthetic dataset. We begin with a synthetic regression example, where each user s data is generated from a ground truth linear model, and multiple coordinates of the parameters are shared across users. We consider N = 1000 users and data in d = 100 dimensions. The parameters θ i are generated as follows from a random θ0 N(0, 102Id): [θ i ]1:95 = [θ0]1:95 [θ i ]96:100 = [θ0]96:100 + δi, δi N(0, 0.012I), while the samples (xi, yi) for user i are generated by the following process: xi N(0, Σ), Σ = diag{λk}d k=1, λk = 1/k yi = θ i xi + εi, εi N(0, τ 2). With this setup, the excess risk can be computed in closed form, and is given by 1 i w+θi θ i 2 Σ, and is shown in Figure 1a. We observe that global learning (α = ) enjoys faster convergence at initial iterations in all cases, thanks to better data efficiency, but eventually stops improving, even without privacy noise, since minimizing risk requires personalization in this model. Small choices of α lead to better performance in later iterations, and with larger levels of privacy noise, full local learning (α = 0) dominates joint learning quite early in the learning process, after about 200 iterations, i.e., 2000 samples per user. Personalization Improves Privacy Accuracy Tradeoffs in Federated Learning 0 200 400 600 800 1000 iterations excess risk privacy noise = 0.0 Q = 3.0 Q = 0 200 400 600 800 1000 iterations excess risk privacy noise = 1.0 0 200 400 600 800 1000 iterations excess risk 17.5 privacy noise = 10.0 (a) Synthetic: N = 1000, d = 100, C = 10. One pass with Q = N and m = 10. 0 250 500 750 1000 1250 1500 1750 2000 iterations test accuracy privacy noise = 0.0 0 500 1000 1500 2000 iterations test accuracy 700 privacy noise = 0.2 0 500 1000 1500 2000 iterations test accuracy privacy noise = 1.0 (b) Stackoverflow tag prediction: N = 500, d = 5000 80, C = 0.01. One pass with Q = 10 and m = 10. 0 2500 5000 7500 10000 12500 15000 17500 iterations test accuracy privacy noise = 0.0 Q = 0 Q = 0.1 Q = 1.0 Q = 10.0 0 2500 5000 7500 10000 12500 15000 17500 test accuracy privacy noise = 1.0 0 2500 5000 7500 10000125001500017500 test accuracy privacy noise = 10.0 (c) EMNIST linear: N = 1000, d = 784 10, C = 10. 20 passes with Q = 10 and m = 10. 0 2500 5000 7500 10000 12500 15000 17500 iterations test accuracy privacy noise = 0.0 Q = 0.0 Q = 0.01 Q = 0.1 Q = 1.0 Q = 0 2500 5000 7500 10000 12500 15000 17500 test accuracy privacy noise = 1.0 0 2500 5000 7500 10000125001500017500 test accuracy 1.75 privacy noise = 3.0 (d) EMNIST CNN: N = 1000, d = 784 10, C = 0.1. 20 passes with Q = 10 and m = 10. Figure 1. Test performance of PP-SGD on the synthetic, Stackoverflow, and EMNIST datasets. Each plot shows curves for different levels of personalization α for fixed privacy noise σ and clipping parameter C. In the private setting, the green dashed line displays the privacy guarantee ϵ at iteration T for all curves in each plot (except α = 0 with is always fully private). Q is the number of sampled users and m the number of samples per user in each round. The choice αQ = 1 equalizes the variance of global and local updates due to samples. Stackoverflow tag prediction. We now consider a subset of the Stackoverflow dataset1 consisting of N = 500 users, 1https://www.tensorflow.org/federated/ api_docs/python/tff/simulation/datasets/ stackoverflow/load_data each with about 300 500 training and 10 100 test documents in a bag-of-words format of dimension 5000. The task is to predict Stackoverflow tags from documents that consist of user questions, and we use the 80 most popular tags as labels. Personalization Improves Privacy Accuracy Tradeoffs in Federated Learning 10 1 100 101 102 103 104 105 106 accuracy vs privacy (Stackoverflow) local only global only best personalized 10 1 100 101 102 103 accuracy vs privacy (EMNIST-linear) local only global only best personalized 10 1 100 101 102 103 accuracy vs privacy (EMNIST-CNN) local only global only best personalized Figure 2. Accuracy privacy tradeoffs on Stackoverflow and EMNIST at the end of training. The red, blue, and light green curves are obtained by varying the amount of privacy noise σ while optimizing the learning rate with local, global, and intermediate models, respectively. The black curve also optimizes over the level of personalization, which improves the tradeoff. In Figure 1b, we show the top-1 accuracy on the test set, while running our method for a single pass over the data, sampling Q = 10 users and m = 10 documents per user at each iteration. We observe that without privacy, the best performance is obtained with an intermediate amount of personalization, α = 10, suggesting that local models are helpful for generalization on this dataset. With privacy noise, however, models with large α quickly degrade in performance, and small α performs best for achieving reasonable privacy levels. Overall, personalization plays a key role for improving the privacy accuracy tradeoff on this dataset, as shown in Figure 2. Federated EMNIST. Finally, we consider a subset of the federated EMNIST digit classification dataset2 consisting of N = 1000 users with about 100 training and 10 test samples per user. Here, a single pass does not lead to good performance, suggesting this may be a hard learning problem that benefits from multiple passes (Pillaud-Vivien et al., 2018). We thus run our method for 20 epochs, sampling 10 users and 10 images per user at each iteration, and show the resulting test accuracy in Figure 1c. On this dataset, we see that local learning does generally poorly, perhaps because there is a lot of shared structure in images across users that benefits significantly from global learning. Nevertheless, there is still a benefit to using small α for achieving good privacy guarantees, and personalization still plays a role in improving the privacy accuracy tradeoff (albeit less pronounced than on Stackoverflow, see Figure 2). In Figure 1d, we show similar plots on federated EMNIST for a 4-layer CNN with a shared global representation ΦW (x) and an additive model at the output layer, leading to a loss of the form fi((w, W), θi, (x, y)) = ℓ(y, (w + θi) ΦW (x)). We see that similar conclusions hold in this non-convex 2https://www.tensorflow.org/federated/ api_docs/python/tff/simulation/datasets/ emnist/load_data scenario, namely, small α is helpful for achieving better privacy guarantees, while large α leads to better accuracy for poor privacy levels. Here the local-only models (α = 0) perform very poorly, since the global representation has frozen random weights, highlighting that learning a useful representation in this model requires a privacy cost. 7. Discussion and Conclusion In this paper, we studied personalized private federated optimization algorithms, providing generalization guarantees in the convex setting that highlight the role of personalization and privacy. We show both theoretically and empirically that adjusting the amount of personalization is always beneficial for improving the accuracy privacy tradeoff, by providing a way to attenuate the cost of privacy through local learning at large sample sizes. Promising future directions include extending our analysis to different models of personalization, e.g., based on neural networks, and to provide adaptive algorithms which may dynamically adjust the level of personalization in an online fashion. Abadi, M., Chu, A., Goodfellow, I., Mc Mahan, H. B., Mironov, I., Talwar, K., and Zhang, L. Deep learning with differential privacy. In Proceedings of the 2016 ACM SIGSAC conference on computer and communications security, 2016. Agarwal, A., Langford, J., and Wei, C.-Y. Federated residual learning. ar Xiv preprint ar Xiv:2003.12880, 2020. Arivazhagan, M. G., Aggarwal, V., Singh, A. K., and Choudhary, S. Federated learning with personalization layers. ar Xiv preprint ar Xiv:1912.00818, 2019. Bassily, R., Smith, A., and Thakurta, A. Private empirical risk minimization: Efficient algorithms and tight error bounds. In 2014 IEEE 55th Annual Symposium on Foundations of Computer Science, 2014. Personalization Improves Privacy Accuracy Tradeoffs in Federated Learning Bellet, A., Guerraoui, R., Taziki, M., and Tommasi, M. Personalized and private peer-to-peer machine learning. In Proceedings of the International Conference on Artificial Intelligence and Statistics (AISTATS), 2018. Chen, X., Wu, Z. S., and Hong, M. Understanding gradient clipping in private SGD: A geometric perspective. In Advances in Neural Information Processing Systems (Neur IPS), 2020. Dekel, O., Gilad-Bachrach, R., Shamir, O., and Xiao, L. Optimal distributed online prediction using mini-batches. Journal of Machine Learning Research, 13(1), 2012. Deng, Y., Kamani, M. M., and Mahdavi, M. Adaptive personalized federated learning. ar Xiv preprint ar Xiv:2003.13461, 2020. Dinh, C. T., Tran, N., and Nguyen, J. Personalized federated learning with moreau envelopes. Advances in Neural Information Processing Systems (Neur IPS), 2020. Dwork, C., Mc Sherry, F., Nissim, K., and Smith, A. Calibrating noise to sensitivity in private data analysis. In Theory of cryptography conference, 2006. Dwork, C., Naor, M., Pitassi, T., and Rothblum, G. N. Differential privacy under continual observation. In Proceedings of the Forty-Second ACM Symposium on Theory of Computing, 2010a. Dwork, C., Rothblum, G. N., and Vadhan, S. Boosting and differential privacy. In 2010 IEEE 51st Annual Symposium on Foundations of Computer Science, 2010b. Feldman, V., Koren, T., and Talwar, K. Private stochastic convex optimization: optimal rates in linear time. In Proceedings of the 52nd Annual ACM SIGACT Symposium on Theory of Computing, 2020. Finn, C., Abbeel, P., , and Levine, S. Model-agnostic metalearning for fast adaptation of deep networks. In Proceedings of the International Conference on Machine Learning (ICML), 2017. Hanzely, F. and Richt arik, P. Federated learning of a mixture of global and local models. ar Xiv preprint ar Xiv:2002.05516, 2020. Hanzely, F., Zhao, B., and Kolar, M. Personalized federated learning: A unified framework and universal optimization techniques. ar Xiv preprint ar Xiv:2102.09743, 2021. Hsu, J., Huang, Z., Roth, A., Roughgarden, T., and Wu, Z. S. Private matchings and allocations. In Symposium on Theory of Computing, STOC, 2014. Hu, S., Wu, Z. S., and Smith, V. Private multi-task learning: Formulation and applications to federated learning. ar Xiv preprint ar Xiv:2108.12978, 2021. Huang, Y., Chu, L., Zhou, Z., Wang, L., Liu, J., Pei, J., and Zhang, Y. Personalized cross-silo federated learning on non-iid data. In Proceedings of the AAAI Conference on Artificial Intelligence, 2021. Jain, P., Rush, J., Smith, A., Song, S., and Guha Thakurta, A. Differentially private model personalization. In Advances in Neural Information Processing Systems (Neur IPS), 2021. Jiang, Y., Koneˇcn y, J., Rush, K., and Kannan, S. Improving federated learning personalization via model agnostic meta learning. ar Xiv preprint ar Xiv:1909.12488, 2019. Kairouz, P., Mc Mahan, H. B., et al. Advances and open problems in federated learning. Foundations and Trends in Machine Learning, 14(1 2):1 210, 2021. Kearns, M., Pai, M. M., Roth, A., and Ullman, J. Mechanism design in large games: Incentives and privacy. American Economic Review, 104(5):431 35, May 2014. Khodak, M., Balcan, M.-F. F., and Talwalkar, A. S. Adaptive gradient-based meta-learning methods. In Advances in Neural Information Processing Systems (Neur IPS), volume 32, 2019. Levy, D., Sun, Z., Amin, K., Kale, S., Kulesza, A., Mohri, M., and Suresh, A. T. Learning with user-level privacy. In Advances in Neural Information Processing Systems (Neur IPS), 2021. Li, J., Khodak, M., Caldas, S., and Talwalkar, A. Differentially private meta-learning. In Proceedings of the International Conference on Learning Representations (ICLR), 2020. Mansour, Y., Mohri, M., Ro, J., and Suresh, A. T. Three approaches for personalization with applications to federated learning, 2020. Marfoq, O., Neglia, G., Bellet, A., Kameni, L., and Vidal, R. Federated multi-task learning under a mixture of distributions. In Advances in Neural Information Processing Systems (Neur IPS), 2021. Mc Mahan, B., Moore, E., Ramage, D., Hampson, S., and Arcas, B. A. y. Communication-efficient learning of deep networks from decentralized data. In Proceedings of the International Conference on Artificial Intelligence and Statistics (AISTATS), 2017. Mc Mahan, H. B., Ramage, D., Talwar, K., and Zhang, L. Learning differentially private recurrent language models. In Proceedings of the International Conference on Learning Representations (ICLR), 2018. Personalization Improves Privacy Accuracy Tradeoffs in Federated Learning Nichol, A., Achiam, J., and Schulman, J. On first-order meta-learning algorithms. ar Xiv preprint ar Xiv:1803.02999, 2018. Noble, M., Bellet, A., and Dieuleveut, A. Differentially private federated learning on heterogeneous data. ar Xiv preprint ar Xiv:2111.09278, 2021. Paulik, M., Seigel, M., Mason, H., Telaar, D., Kluivers, J., van Dalen, R., Lau, C. W., Carlson, L., Granqvist, F., Vandevelde, C., et al. Federated evaluation and tuning for on-device personalization: System design & applications. ar Xiv preprint ar Xiv:2102.08503, 2021. Pillaud-Vivien, L., Rudi, A., and Bach, F. Statistical optimality of stochastic gradient descent on hard learning problems through multiple passes. Advances in Neural Information Processing Systems (Neur IPS), 2018. Shen, T., Zhang, J., Jia, X., Zhang, F., Huang, G., Zhou, P., Kuang, K., Wu, F., and Wu, C. Federated mutual learning. ar Xiv preprint ar Xiv:2006.16765, 2020. Singhal, K., Sidahmed, H., Garrett, Z., Wu, S., Rush, J., and Prakash, S. Federated reconstruction: Partially local federated learning. Advances in Neural Information Processing Systems (Neur IPS), 2021. Smith, V., Chiang, C.-K., Sanjabi, M., and Talwalkar, A. Federated multi-task learning. In Advances in Neural Information Processing Systems (NIPS), 2017. Vanhaesebrouck, P., Bellet, A., and Tommasi, M. Decentralized collaborative learning of personalized models over networks. In Proceedings of the International Conference on Artificial Intelligence and Statistics (AISTATS), 2017. Wang, J., Charles, Z., Xu, Z., Joshi, G., Mc Mahan, H. B., Al-Shedivat, M., Andrew, G., Avestimehr, S., Daly, K., Data, D., et al. A field guide to federated optimization. ar Xiv preprint ar Xiv:2107.06917, 2021. Wang, K., Mathews, R., Kiddon, C., Eichner, H., Beaufays, F., and Ramage, D. Federated evaluation of on-device personalization. ar Xiv preprint ar Xiv:1910.10252, 2019. Woodworth, B., Patel, K. K., Stich, S., Dai, Z., Bullins, B., Mcmahan, B., Shamir, O., and Srebro, N. Is local sgd better than minibatch sgd? In Proceedings of the International Conference on Machine Learning (ICML), 2020. Personalization Improves Privacy Accuracy Tradeoffs in Federated Learning A. Experiment details We provide the hyperparameter grids for each dataset below. Our experiments always optimize the step-size at any fixed iteration. To obtain the plots in Figure 2, we also optimize over α to obtain the best personalized curve, while varying over a finer grid of noise levels, provided below. Synthetic dataset: step-size η: [0.01, 0.02, 0.05, 0.1, 0.2, 0.4, 0.7, 1., 1.2, 1.5, 1.8], personalization level αQ: [0, 0.1, 0.3, 1, 3, 10, 30, 100, ], noise multiplier σ: [0, 0.1, 0.3, 1, 3, 10, 30, 100, 300, 1000]. Stackoverflow: step-size η: [2, 5, 10, 20, 50, 100, 150, 200, 250], personalization level αQ: [0, 0.1, 0.3, 1, 3, 10, 30, 100, ], noise multiplier σ: [0, 0.01, 0.02, 0.05, 0.1, 0.2, 0.5, 1, 2, 5, 10, 100]. EMNIST-linear: step-size η: [0.0001, 0.0002, 0.0005, 0.001, 0.002, 0.005, 0.01], personalization level αQ: [0, 0.1, 0.3, 1, 3, 10, 30, 100, ], noise multiplier σ: [0, 0.1, 0.3, 1, 2, 3, 5, 10, 20, 30, 50, 100]. EMNIST-CNN: step-size η: [0.05, 0.1, 0.2, 0.5, 1., 2., 5.], personalization level αQ: [0, 0.1, 0.3, 1, 3, 10, 30, 100, ], noise multiplier σ: [0, 0.1, 0.3, 1., 3., 10., 30., 100.]. B. Proofs of Main Results For proving the privacy guarantees, we make use of the following billboard lemma. Lemma B.1 (Billboard Lemma (Hsu et al., 2014)). Suppose a message broadcasting mechanism M : D W is (ϵ, δ)- differentially private. Consider any set of functions: gi : Di W W . The composition {gi(Πi D, M(D))} is (ϵ, δ)-joint differentially private, where Πi : D Di is the projection of D onto Di. B.1. Proof of Theorem 4.1 (Privacy for PP-SGD) Proof. To prove the JDP guarantee, by the billboard lemma (Lemma B.1), it is sufficient to show that the global model updates satisfy standard DP w.r.t. any changes to a single user s data. With the assumption that gradients are uniformly bounded by G, each update of the global models has ℓ2 sensitivity bounded by αη/N w.r.t. any change to a single user s data. If we take for some absolute constant c, then by the Gaussian mechanism, each update is ( ϵ n log(1/δ), δ)-differentially private, and by the advanced composition theorem (Dwork et al., 2010b), the algorithm that outputs global models is (ϵ, δ)-differentially private after n steps. Personalization Improves Privacy Accuracy Tradeoffs in Federated Learning When considering both local and global models as outputs of the algorithm, combining the above with the billboard lemma yields the desired JDP guarantee. B.2. Proof of Theorem 4.5 (Generalization for PP-SGD) Proof. Note that the algorithm updates be written jointly on zt = (wt, θt) as zt = zt 1 ηH 1gt, (17) 1 N PN i=1 wfi(wt 1, θi,t 1, ξi,t) + ζt 1 N θf1(wt 1, θ1,t 1, ξ1,t) ... 1 N θf N(wt 1, θN,t 1, ξN,t) where H is a pre-conditioner of the form αIdw 0 0 INdθ Note that H satisties z 2 H = Hz, z = z 2 α, as defined in (5). Denote by Ft the sigma algebra spanned by random variables up to time t. Note that we have E[gt|Ft 1] = f(zt 1). We also define 1 N PN i=1 wfi(w , θ i , ξi,t) + ζt 1 N θf1(w , θ 1, ξ1,t) ... 1 N θf N(w , θ N, ξN,t) which satisfies E[g t ] = f(z ) = 0. We have E[ zt z 2 H|Ft 1] = zt 1 z 2 H 2η H E[H 1gt|Ft 1], zt 1 z + η2 E[ H 1gt 2 H|Ft 1] = zt 1 z 2 H 2η f(zt 1), zt 1 z + η2 E[ gt 2 H 1|Ft 1] zt 1 z 2 H 2η(f(zt 1) f(z )) + 2η2 E[ gt g t 2 H 1|Ft 1] + 2η2 E[ g t 2 H 1] (21) zt 1 z 2 H 2η(1 2ηLα)(f(zt 1) f(z )) + 2η2σ2 tot,α, (22) where the first inequality follows by convexity of f and using a+b 2 2( a 2 + b 2), while the second uses Lemma B.2 below to bound E[ gt g t 2 H 1|Ft 1], as well as the relation E[ g t 2 H 1] = α E i=1 wfi(w , θ i , ξt) + ζt " 1 N θfi(w , θ i , ξt) = α σ2 w + σθ N + αdwσ2 ζ = σ2 tot,α. Note that this quantity matches the definition in (6). Assuming η 1/4Lα, and taking total expectations, (22) yields E[ zt 1 z 2 H] E[ zt 1 z 2 H] η E[f(zt 1) f(z )] + 2η2σ2 tot,α. Summing this inequality from t = 0 to t = n, we obtain E[f( zn) f(z )] z0 z 2 α ηn + 2ησ2 tot,α. (23) Personalization Improves Privacy Accuracy Tradeoffs in Federated Learning Taking η = min{ 1 4Lα , z α nσtot,α }, with z0 = 0 yields the desired bound. Lemma B.2. Let gt and g t be defined as in (18) and (20). We have E[ gt g t 2 H 1] 2Lα(f(zt 1) f(z t )), (24) with Lα := L max(α, 1 Proof. Define fi(w, θi) := fi(w, θi, ξi,t) and f(z) := 1 N PN i=1 fi(w, θi), which satisfies E[ f(z)] = f(z) and E[ f(z)] = f(z). By L-smoothness (Assumption 4.3), we have for every i, fi(wt 1, θi,t 1) fi(w , θ i ) 2 2L fi(wt 1, θi,t 1) fi(w , θ i ) + w fi(w , θ i ), wt 1 w + θ fi(w , θ i ), θt 1 θ Taking the average over i, we get i=1 fi(wt 1, θi,t 1) fi(w , θ i ) 2 2L f(zt 1) f(z ) + f(z ), zt 1 z . gt g t 2 H 1 w fi(wt 1, θi,t 1) w fi(w , θ i ) θ fi(wt 1, θi,t 1) θ fi(w , θ i ) w fi(wt 1, θi,t 1) w fi(w , θ i ) 2 + 1 θ fi(wt 1, θi,t 1) θ fi(w , θ i ) 2 (by Cauchy-Schwarz inequality) i=1 fi(wt 1, θi,t 1) fi(w , θ i ) 2 2Lα f(zt 1) f(z ) + f(z ), zt 1 z . with Lα := L max(α, 1 N ). Taking the expectation, we get E gt g t 2 H 1|Ft 1 2Lα (f(zt 1) f(z ) + f(z ), zt 1 z ) = 2Lα (f(zt 1) f(z )) . B.3. Proof of Lemma 4.6 (effect of α and critical sample size) Proof. We would like to make (w + θi) x = v x for all x, while minimizing (w, θ1:N) α. This is achieved by solving the following minimization problem: min w,θ1:N 1 α w 2 + X s.t. w + θi = v To solve w, we minimize 1 α w 2 + P i v w 2 = 1 α w 2 + N v w 2, which gives w = αN αN+1v and thus θi = 1 αN+1v for all i. Personalization Improves Privacy Accuracy Tradeoffs in Federated Learning 2 v 2 + N 1 αN + 1 N αN + 1 v . Plugging this into (9), we see that the variance term takes the form specified in the lemma statement. To see the effect of α to the variance term, we identify the condition for α, such that the following function is increasing in α: ψ(α) = a(α + 1) αN + 1 + bα αN + 1 for constants a, b. Its derivative can be calculated as follows: ψ (α) = a(αN + 1) a(α + 1)N + b(αN + 1) bαN (αN + 1)2 = a(1 N) + b (αN + 1)2 . Thus, ψ(α) is increasing in α if and only if a(N 1) b. In our case, a = σ2 n and b = c dw G2 log(1/δ) Nϵ , and the condition a(N 1) b is equivalent to n (N 2 N)σ2ϵ cdw G2 log(1/δ). B.4. Proof of Theorem 5.1 (privacy of PPSGD with client sampling) Proof. The proof follows the same lines as that of Proposition 4.1, but additionally leverages a standard argument of privacy amplification by subsampling to accomodate user sampling. Specifically, we use (Abadi et al., 2016, Theorem 1), which leverages R enyi differential privacy to provide an improved dependency on δ compared to the advanced composition theorem (Dwork et al., 2010b). In order to apply (Abadi et al., 2016, Theorem 1), we rewrite the global updates as wt = wt 1 αη i:bi,t=1 gt w,i + q Mζt This matches the form of the the updates in (Abadi et al., 2016), with total noise standard deviation q Mσζ, which should equal σC in their notations. Then, by Abadi et al. (2016, Theorem 1), their condition on σ becomes: guarantees that the global update mechanism after T iterations is (ϵ, δ)-DP. This corresponds to the condition in the statement. Combining this with the billboard lemma (Lemma B.1) yields the desired JDP guarantee on the full algorithm output. B.5. Proof of Theorem 5.2 (Generalization of PP-SGD with client sampling) Proof. Note that the algorithm updates be written jointly on zt = (wt, θt) as zt = zt 1 ηH 1gt, (25) with gt = 1 q M PN i=1 bi,t Pmi k=1 wfi(wt 1, θi,t 1, ξ(k) i,t ) + q Mζt b1,t Pm1 k=1 θf1(wt 1, θ1,t 1, ξ(k) 1,t ) ... b N,t Pm N k=1 θf N(wt 1, θN,t 1, ξ(k) N,t) Personalization Improves Privacy Accuracy Tradeoffs in Federated Learning with H as in (19). We have E[gt|Ft 1] = f m(zt 1). We also define g t = 1 q M PN i=1 bi,t Pmi k=1 wfi(w , θ i , ξ(k) i,t ) + q Mζt b1,t Pm1 k=1 θf1(w , θ 1, ξ(k) 1,t ) ... b N,t Pm N k=1 θf N(w , θ N, ξ(k) N,t) which satisfies E[g t ] = f m(z ) = 0. We have E[ zt z 2 H|Ft 1] = zt 1 z 2 H 2η H E[H 1gt|Ft 1], zt 1 z + η2 E[ H 1gt 2 H|Ft 1] = zt 1 z 2 H 2η f m(zt 1), zt 1 z + η2 E[ gt 2 H 1|Ft 1] zt 1 z 2 H 2η(f m(zt 1) f m(z )) + 2η2 E[ gt g t 2 H 1|Ft 1] + 2η2 E[ g t 2 H 1]. (29) By Lemma B.3, E[ gt g t 2 H 1|Ft 1] 2Lm,α (f m(zt 1) f m(z )) (30) where Lm,α := L max α + αmmax E[ g t 2 H 1] = α E k=1 wfi(w , θ i , ξ(k) i,t ) + ζt k=1 θfi(w , θ i , ξ(k) i,t ) k=1 wfi(w , θ i , ξ(k) i,t ) qmi wfi(w , θ i ) i=1 miσ2 θ,i + αdwσ2 ζ k=1 wfi(w , θ i , ξ(k) i,t ) bi,tmi wfi(w , θ i ) i=1 E h (bi,t q)mi wfi(w , θ i ) 2i + 1 q M 2 i=1 miσ2 θ,i + αdwσ2 ζ i=1 miσ2 w,i + α q M 2 i=1 m2 i E[(bi,t q)2] wfi(w , θ i ) 2 + σ2 θ,q q M + αdwσ2 ζ α σ2 w,m + σ2 θ,m q M + αdwσ2 ζ + αq(1 q) i=1 m2 i wfi(w , θ i ) 2 (32) Assuming η 1/4Lm,α, and taking total expectations, (29) yields E[ zt 1 z 2 H] E[ zt 1 z 2 H] η E[f m(zt 1) f m(z )] + 2η2σ2 m,α. Summing this inequality from t = 0 to t = T, we obtain E[f m( z T ) f m(z )] z0 z 2 α ηT + 2ησ2 m,α. (33) Taking η = min{ 1 4Lm,α , z α T σm,α }, with z0 = 0 yields the desired bound. Lemma B.3. Let gt and g t be defined as in (26) and (28). We have E[ gt g t 2 H 1] 2Lm,α (f m(zt 1) f m(z )) (34) with Lm,α := L max α + αmmax Personalization Improves Privacy Accuracy Tradeoffs in Federated Learning Proof. Define f (k) i (w, θi) := fi(w, θi, ξ(k) i,t ) and f(z) := 1 q M PN i=1 bi,t Pmi k=1 f (k) i (w, θi), which satisfies E[ f(z)] = f m(z) and E[ f(z)] = f m(z). By L-smoothness (Assumption 4.3), we have for every i, k, f (k) i (wt 1, θi,t 1) f (k) i (w , θ i ) 2 2L f (k) i (wt 1, θi,t 1) f (k) i (w , θ i ) + w f (k) i (w , θ i ), wt 1 w + θ f (k) i (w , θ i ), θt 1 θ Taking the weighted sum over i, k with weights bi,t q M , we get k=1 f (k) i (wt 1, θi,t 1) f (k) i (w , θ i ) 2 2L f(zt 1) f(z ) + f(z ), zt 1 z . Taking the expectation on two sides, we further get k=1 E h f (k) i (wt 1, θi,t 1) f (k) i (w , θ i ) 2i 2L (f m(zt 1) f m(z ) + f m(z ), zt 1 z ) = 2L (f m(zt 1) f m(z )) . E gt g t 2 H 1 w f (k) i (wt 1, θi,t 1) w f (k) i (w , θ i ) θ f (k) i (wt 1, θi,t 1) θ f (k) i (w , θ i ) w f (k) i (wt 1, θi,t 1) w f (k) i (w , θ i ) w f (k) i (wt 1, θi,t 1) w f (k) i (w , θ i ) θ f (k) i (wt 1, θi,t 1) θ f (k) i (w , θ i ) k=1 E w f (k) i (wt 1, θi,t 1) w f (k) i (w , θ i ) 2 k=1 E w f (k) i (wt 1, θi,t 1) w f (k) i (w , θ i ) 2 k=1 E θ f (k) i (wt 1, θi,t 1) θ f (k) i (w , θ i ) 2 max α + αmmax k=1 E f (k) i (wt 1, θi,t 1) f (k) i (w , θ i ) 2 max α + αmmax 2L (f m(zt 1) f m(z )) = 2Lm,α (f m(zt 1) f m(z )) . Personalization Improves Privacy Accuracy Tradeoffs in Federated Learning Algorithm 3 PPSGD with client sampling, average user performance 1: Input: q: client sampling probability, mi: minibatch sizes, η: step size, α: global/local ratio, σζ: privacy noise level, C: clipping parameter. 2: Initialize w0 = θ0 = 0. 3: for t = 1 to T do 4: Sample bi,t Ber(q) 5: for all clients i with bi,t = 1 in parallel do 6: Sample a minibatch {ξ(k) i,t }mi k=1 P mi i 7: Compute gt θ,i = 1 mi Pmi k=1 θfi(wt 1, θi,t 1, ξ(k) i,t ) gt w,i = 1 mi Pmi k=1 wfi(wt 1, θi,t 1, ξ(k) i,t ) 8: Update θi,t = θi,t 1 η q N gt θ,i 9: Clip gradient gt w,i = gt w,i/max(1, gt w,i C ) 10: Send gt w,i to the server 11: end for 12: Sample ζt N(0, σ2 ζIdw) 13: Update wt = wt 1 αη( 1 i:bi,t=1 gt w,i + ζt) 14: end for B.6. Optimizing average user risk with heterogeneous sample sizes In this section, we study a variant of Algorithm 2 which optimizes the average risk f instead of the weighted risk f m in (11), as described in Section 5. The algorithm is described in Algorithm 3, and we provide its generalization and privacy guarantees below. Theorem B.4 (Generalization and privacy for Algorithm 3). Under Assumptions 4.2, 4.3, and 4.4, let z = (w , θ1:N ) be any minimizer of f, Lα = L max α + α N , 1 N , and σ2 m,α := α σ2 w,m + α σ2 w,m + σ2 θ,m q N + αdwσ2 ζ α σ2 w + σ2 θ q Nmmin + α σ2 w,m q N + αdwσ2 ζ, (35) where σ2 w,m := 1 i σ2 w,i mi and σ2 θ,m := 1 i σ2 θ,i mi and σ2 w,m := 1 i q(1 q) wfi(w , θ i ) 2. With η = min 1 4Lα , z α T σm,α and C = G, Algorithm 3 satisfies E[f( z T ) f(z )] 4Lα z 2 α T + 3σm,α z α with z T = 1 T PT 1 t=0 zt. For c1, c2 as in Theorem 5.1, and for any ϵ < c1q2T, if we take σζ = c2 C p then Algorithm 3 is (ϵ, δ)-JDP, and the generalization guarantee becomes: E[f( z T ) f(z )] Lα z 2 α T + z α α σ2w,m + α σ2w,m + σ2 θ,m q NT + αdw G2 log( 1 δ ) N 2ϵ2 . (36) Proof. The proof can be obtained by replacing fi by fi = M mi N fi in the proofs of Theorem 5.2 and Theorem 5.1.