# personalized_federated_learning_with_moreau_envelopes__4eb57886.pdf Personalized Federated Learning with Moreau Envelopes Canh T. Dinh1, Nguyen H. Tran1, Tuan Dung Nguyen1,2 1The University of Sydney, Australia tdin6081@uni.sydney.edu.au, nguyen.tran@sydney.edu.au 2The University of Melbourne, Australia tuandungn@unimelb.edu.au Federated learning (FL) is a decentralized and privacy-preserving machine learning technique in which a group of clients collaborate with a server to learn a global model without sharing clients data. One challenge associated with FL is statistical diversity among clients, which restricts the global model from delivering good performance on each client s task. To address this, we propose an algorithm for personalized FL (p Fed Me) using Moreau envelopes as clients regularized loss functions, which help decouple personalized model optimization from the global model learning in a bi-level problem stylized for personalized FL. Theoretically, we show that p Fed Me s convergence rate is state-of-the-art: achieving quadratic speedup for strongly convex and sublinear speedup of order 2/3 for smooth nonconvex objectives. Experimentally, we verify that p Fed Me excels at empirical performance compared with the vanilla Fed Avg and Per-Fed Avg, a meta-learning based personalized FL algorithm. 1 Introduction The abundance of data generated in a massive number of hand-held devices these days has stimulated the development of Federated learning (FL) [1]. The setting of FL is a network of clients connected to a server, and its goal is to build a global model from clients data in a privacy-preserving and communication-efficient way. The current techniques that attempt to fulfill this goal mostly follow three steps: (i) at each communication iteration, the server sends the current global model to clients; (ii) the clients update their local models using their local data; (iii) the server collects the latest local models from a subset of sampled clients in order to update a new global model, repeated until convergence [1 4]. Despite its advantages of data privacy and communication reduction, FL faces a main challenge that affects its performance and convergence rate: statistical diversity, which means that data distributions among clients are distinct (i.e., non-i.i.d.). Thus, the global model, which is trained using these non-i.i.d. data, is hardly well-generalized on each client s data. This particular behaviour has been reported in [5,6], which showed that when the statistical diversity increases, generalization errors of the global model on clients local data also increase significantly. On the other hand, individual learning without FL (i.e., no client collaboration) will also have large generalization error due to insufficient data. These raise the question: How can we leverage the global model in FL to find a personalized model that is stylized for each client s data? Motivated by critical roles of personalized models in several business applications of healthcare, finance, and AI services [5], we address this question by proposing a new FL scheme for personalization, which minimizes the Moreau envelopes [7] of clients loss functions. With this scheme, 34th Conference on Neural Information Processing Systems (Neur IPS 2020), Vancouver, Canada. clients not only contribute to building the reference global model as in the standard FL, but also leverage the reference model to optimize their personalized models w.r.t. local data. Geometrically, the global model in this scheme can be considered as a central point where all clients agree to meet, and personalized models are the points in different directions that clients follow according to their heterogeneous data distributions. Our key contributions in this work are summarized as follows. First, we formulate a new bi-level optimization problem designed for personalized FL (p Fed Me) by using the Moreau envelope as a regularized loss function. The bi-level structure of p Fed Me has a key advantage: decoupling the process of optimizing personalized models from learning the global model. Thus, p Fed Me updates the global model similarly to the standard FL algorithm such as Fed Avg [1], yet parallelly optimizes the personalized models with low complexity. Second, we exploit the convexity-preserving and smoothness-enabled properties of the Moreau envelopes to facilitate the convergence analysis of p Fed Me, which characterizes both client-sampling and client-drift errors: two notorious issues in FL [3]. With carefully tuned hyperparameters, p Fed Me can obtain the state-of-the-art quadratic speedup (resp. sublinear speedup of order 2/3), compared with the existing works with linear speedup (resp. sublinear speedup of order 1/2), for strongly convex (resp. smooth nonconvex) objective. Finally, we empirically evaluate the performance of p Fed Me using both real and synthetic datasets that capture the statistical diversity of clients data. We show that p Fed Me outperforms the vanilla Fed Avg and a meta-learning based personalized FL algorithm Per-Fed Avg [8] in terms of convergence rate and local accuracy. 2 Related Work FL and challenges. One of the first FL algorithms is Fed Avg [1], which uses local SGD updates and builds a global model from a subset of clients with non-i.i.d. data. Subsequently, one-shot FL [9] allows the global model to learn in one single round of communication. To address the limitations on communications in a FL network, [10,11] introduced quantization methods, while [12 14] proposed performing multiple local optimization rounds before sending the local models to the server. In addition, the problem of statistical diversity has been addressed in [15 20]. Preserving privacy in FL has been studied in [21 25]. Personalized FL: mixing models, contextualization, meta-learning, and multi-task learning. Multiple approaches have been proposed to achieve personalization in FL. One such approach is mixing the global and local models. [26] combined the optimization of the local and global models in its L2GD algorithm. [27] introduced three personalization approaches to: user clustering, data interpolation, and model interpolation. While the first two approaches need meta-features from all clients that make them not feasible in FL due to privacy concern, the last approach was used in [6] to create an adaptive personalized federated learning (APFL) algorithm, which attempted to mix a user s local model with the global model. One personalization method used in neural networks is Fed Per [28], in which a network is divided into base and personalized layers, and while the base layers are trained by the server, both types of layers will be trained by users to create a personalized model. Regarding using a model in different contexts, in the next-character prediction task in [29], the requirement to predict differently among devices raises a need to inspect more features about the context of client devices during training, which was studied in [30]. [31] achieves personalization on each user in a fully decentralized network using asynchronous gossip algorithms with assumptions on network topology and similarity between users. The concept of personalization can also be linked to meta-learning. Per-Fed Avg [8], influenced by Model-Agnostic Meta-Learning (MAML) [32], built an initial meta-model that can be updated effectively after one more gradient descent step. During meta-optimization, however, MAML theoretically requires computing the Hessian term, which is computationally prohibitive; therefore, several works including [32 34] attempted to approximate the Hessian matrix. [35] based its framework, ARUBA, on online convex optimization and meta-learning, which can be integrated into FL to improve personalization. [36] discovered that Fed Avg can be interpreted as meta-learning and proposed combining Fed Avg with Reptile [33] for FL personalization. The application of federated meta-learning in recommender systems was studied in [37]. Finally, multi-task learning can be used for personalization: [20] introduced a federated multi-task framework called MOCHA, addressing both systems and statistical heterogeneity. For more details about FL, its challenges, and personalization approaches, we refer the readers to comprehensive surveys in [38,39]. 3 Personalized Federated Learning with Moreau Envelopes (p Fed Me) 3.1 p Fed Me: Problem Formulation In conventional FL, there are N clients communicating with a server to solve the following problem: n f(w) := 1 i=1 fi(w) o (1) to find a global model w. The function fi : Rd R, i = 1, . . . , N, denotes the expected loss over the data distribution of the client i: fi(w) = Eξi fi(w; ξi) , where ξi is a random data sample drawn according to the distribution of client i and fi(w; ξi) is a loss function corresponding to this sample and w. In FL, since clients data possibly come from different environments, contexts, and applications, clients can have non-i.i.d. data distributions, i.e., the distributions of ξi and ξj, i = j, are distinct. Instead of solving the traditional FL problem (1), we take a different approach by using a regularized loss function with l2-norm for each client as follows 2 θi w 2, (2) where θi denotes the personalized model of client i and λ is a regularization parameter that controls the strength of w to the personalized model. While large λ can benefit clients with unreliable data from the abundant data aggregation, small λ helps clients with sufficient useful data prioritize personalization. Note that λ (0, ) to avoid extreme cases of λ = 0, i.e., no FL, or λ = , i.e., no personalized FL. Overall, the idea is allowing clients to pursue their own models with different directions, but not to stay far away from the reference point w, to which every client contributes. Based on this, the personalized FL can be formulated as a bi-level problem: p Fed Me : min w Rd n F(w) := 1 i=1 Fi(w) o , where Fi(w) = min θi Rd n fi(θi) + λ 2 θi w 2o . In p Fed Me, while w is found by exploiting the data aggregation from multiple clients at the outer level, θi is optimized with respect to (w.r.t) client i s data distribution and is maintained a bounded distance from w at the inner level. The definition of Fi(w) is the well-known Moreau envelope, which facilitates several learning algorithm designs [40,41]. The optimal personalized model, which is the unique solution to the inner problem of p Fed Me and also known as the proximal operator in the literature, is defined as follows: ˆθi(w) := proxfi/λ(w) = arg min θi Rd n fi(θi) + λ 2 θi w 2o . (3) For comparison, we consider Per-Fed Avg [8], which arguably has the closest formulation to p Fed Me: n F(w) := 1 i=1 fi θi(w) o , where θi(w) = w α fi(w). (4) Based on the MAML framework [32], Per-Fed Avg aims to find a global model w which client i can use as an initialization to perform one more step of gradient update (with step size α) w.r.t its own loss function to obtain its personalized model θi(w). Compared to Per-Fed Avg, our problem has a similar meaning of w as a meta-model , but instead of using w as the initialization, we, in parallel, pursue both the personalized and global models by solving a bi-level problem, which has several benefits. First, while Per-Fed Avg is optimized for one-step gradient update for its personalized model, p Fed Me is agnostic to the inner optimizer, which means (3) can be solved using any iterative approach with multi-step updates. Second, by re-writing the personalized model update of Per-Fed Avg as θi(w) = w α fi(w) = arg min θi Rd n fi(w), θi w + 1 2α θi w 2o , (5) where we use x, y for the inner product of two vectors x and y, we can see that apart from the similar regularization term, Per-Fed Avg only optimizes the first-order approximation of fi, whereas p Fed Me directly minimizes fi in (3). Third, Per-Fed Avg (or generally several MAML-based methods) requires computing or estimating Hessian matrix, whereas p Fed Me only needs gradient calculation using first-order approach, as will be shown in the next section. Assumption 1 (Strong convexity and smoothness). fi is either (a) µ-strongly convex or (b) nonconvex and L-smooth (i.e., L-Lipschitz gradient), respectively, as follows when w, w : (a) fi(w) fi(w ) + fi(w ), w w + µ (b) fi(w) fi(w ) L w w . Assumption 2 (Bounded variance). The variance of stochastic gradients in each client is bounded Eξi fi(w; ξi) fi(w) 2 γ2 f. Assumption 3 (Bounded diversity). The variance of local gradients to global gradient is bounded i=1 fi(w) f(w) 2 σ2 f. While Assumption 1 is standard for convergence analysis, Assumptions 2 and 3 are widely used in FL context in which γ2 f and σ2 f quantify the sampling noise and the diversity of client s data distribution, respectively [3,8,42,43]. Note that we avoid using the uniformly bounded gradient assumption, i.e., fi(w) G, i, which was used in several related works [6,8]. It was shown that this assumption is not satisfied in the unconstrained strongly convex minimization [44,45]. Finally, we review several useful properties of the Moreau envelope such as smoothing and preserving convexity as follows (see the review and proof for the convex case in [40,46,47] and for nonconvex smooth case in [48], respectively): Proposition 1. If fi is convex or nonconvex with L-Lipschitz fi, then Fi is LF -smooth with LF = λ (with the condition that λ > 2L for nonconvex L-smooth fi), and Fi(w) = λ(w ˆθi(w)). (6) Furthermore, if fi is µ-strongly convex, then Fi is µF -strongly convex with µF = λµ λ+µ. 3.2 p Fed Me: Algorithm In this section, we propose an algorithm, presented in Alg. 1, to solve p Fed Me. Similar to conventional FL algorithms such as Fed Avg [1], at each communication round t, the server broadcasts the latest global model wt to all clients. Then, after all clients perform R local updates, the server will receive the latest local models from a uniformly sampled subset St of clients to perform the model averaging. Note that we use an additional parameter β for global model update, which includes Fed Avg s model averaging when β = 1. Though a similar parameter at the server side was also used in [3,49], it will be shown that p Fed Me can obtain better speedup convergence rates. Specifically, our algorithm, which aims to solve the bi-level problem p Fed Me, has two key differences compared with Fed Avg, which aims to solve (1). First, at the inner level, each client i solves (3) to obtain its personalized model ˆθi(wt i,r) where wt i,r denotes the local model of the client i at the global round t and local round r. Similar to Fed Avg, the purpose of local models is to contribute to building a global model with reduced communication rounds between the clients and server. Second, at the outer level, the local update of client i using gradient descent is with respect to Fi (instead of fi) as the following wt i,r+1 = wt i,r η Fi wt i,r , where η is the learning rate and Fi wt i,r is calculated according to (6) using the current personalized model ˆθi(wt i,r). For the practical algorithm, we use a δ-approximation of ˆθi(wt i,r), denoted by θi(wt i,r) satisfying E θi(wt i,r) ˆθi(wt i,r) δ, and correspondingly use λ(wt i,r θi(wt i,r)) to approximate Fi(wt i,r) Algorithm 1 p Fed Me: Personalized Federated Learning using Moreau Envelope Algorithm 1: input: T, R, S, λ, η, β, w0 2: for t = 0 to T 1 do Global communication rounds 3: Server sends wt to all clients 4: for all i = 1 to N do 5: wt i,0 = wt 6: for r = 0 to R 1 do Local update rounds 7: Sample a fresh mini-batch Di with size |D| and minimize hi(θi; wt i,r, Di), defined in (7), up to an accuracy level according to (8) to find a δ-approximate θi(wt i,r) 8: wt i,r+1 = wt i,r ηλ(wt i,r θi(wt i,r)) 9: Server uniformly samples a subset of clients St with size S, and each of the sampled client sends the local model wt i,R, i St, to the server 10: Server updates the global model: wt+1 = (1 β)wt + β P i St wt i,R S (c.f. line 8). The reason for using the δ-approximate θi(wt i,r) is two-fold. First, obtaining ˆθi(wt i,r) according to (3) usually needs the gradient fi(θi), which, however, requires the distribution of ξi. In practice, we use the following unbiased estimate of fi(θi) by sampling a mini-batch of data Di fi (θi, Di) := 1 |Di| ξi Di fi(θi; ξi) such that E[ fi (θi, Di)] = fi(θi). Second, in general, it is not straightforward to obtain ˆθi(wt i,r) in closed-form. Instead we usually use iterative first-order approach to obtain an approximate θi(wt i,r) with high accuracy. Defining hi(θi; wt i,r, Di) := fi(θi; Di) + λ 2 θi wt i,r 2, (7) suppose we choose λ such that hi(θi; wt i,r, Di) is strongly convex with a condition number κ (which quantifies how hard to optimize (7)), then we can apply gradient descent (resp. Nesterov s accelerated gradient descent) to obtain θi(wt i,r) such that hi( θi; wt i,r, Di) 2 ν, (8) with the number of hi computations K := O κ log d ν resp. O κ log d ν [50], where d is the diameter of the search space, ν is an accuracy level, and O( ) hides constants. The computation complexity of each client in p Fed Me is K times that in Fed Avg. In the following lemma, we show how δ can be adjusted by controlling the (i) sampling noise using mini-batch size |D| and (ii) accuracy level ν. Lemma 1. Let θi(wt i,r) be a solution to (8), we have E h θi(wt i,r) ˆθi(wt i,r) 2i δ2 := 2 (λ+µ)2 γ2 f |D| + ν , if Assumption 1(a) holds; 2 (λ L)2 γ2 f |D| + ν , if Assumption 1(b) holds, and λ > L. 4 p Fed Me: Convergence Analysis In this section, we present the convergence of p Fed Me. We first prove an intermediate result. Lemma 2. Recall the definition of the Moreau envelope Fi in p Fed Me, we have (a) Let Assumption 1(a) hold, then we have i=1 Fi(w) F(w) 2 4LF (F(w) F(w )) + 2 1 i=1 Fi(w ) 2 | {z } =:σ2 F,1 (b) If Assumption 1(b) holds. Furthermore, if λ > 2 i=1 Fi(w) F(w) 2 8L2 λ2 8L2 F(w) 2 + 2 λ2 λ2 8L2 σ2 f | {z } =:σ2 F,2 This lemma provides the bounded diversity of Fi, characterized by the variances σ2 F,1 and σ2 F,2, for strongly convex and nonconvex smooth fi, respectively. While σ2 F,2 is related to σ2 f that needs to be bounded in Assumption 3, σ2 F,1 is measured only at the unique solution w to p Fed Me (for strongly convex Fi, w always exists), and thus σ2 F,1 is finite. These bounds are tight in the sense that σ2 F,1 = σ2 F,2 = 0 when data distribution of clients are i.i.d. Theorem 1 (Strongly convex p Fed Me s convergence). Let Assumptions 1(a) and 2 hold. If T 2 ˆη1µF , there exists an η ˆη1 βR, where ˆη1 := 1 6LF (3+128κF /β) with β 1, such that (a) E F( w T ) F(w ) O E F( w T ) F(w ) := O 0µF e ˆη1µF T/2 + O (N/S 1)σ2 F,1 µF TN + O (Rσ2 F,1 + δ2λ2)κF R(TβµF )2 i=1 E h θT i (w T ) w 2i 1 µF O E [F( w T ) F ] + O σ2 F,1 λ2 + δ2 , where 0 := w0 w 2, κF := LF µF , w T := PT 1 t=0 αtwt/AT with αt := (1 ηµF /2) (t+1) and AT := PT 1 t=0 αt, and O( ) hides both constants and polylogarithmic factors. Corollary 1. When there is no client sampling (i.e., S = N), we can choose either (i) β = Θ( N T (i.e., massive clients) or (ii) β = Θ(N R) otherwise, to obtain either linear speedup O 1/(TRN) or quadratic speedup O 1/(TRN)2 w.r.t computation rounds, respectively. Remark 1. Theorem 1 (a) shows the convergence of the global model w.r.t four error terms, where the expectation is w.r.t the randomness of mini-batch and client samplings. While the first term shows that a carefully chosen constant step size can reduce the initial error w0 w 2 linearly, the last term means that p Fed Me converges towards a λ2δ2 µF -neighborhood of w , due to the approximation error δ at each local round. The second error term is due to the client sampling, which obviously is 0 when S = N. If we choose S such that S/N corresponds to a fixed ratio, e.g., 0.5, then we can obtain a linear speedup O 1/(TN) w.r.t communication rounds for client sampling error. The third error term is due to client drift with multiple local updates. According to Corollary 1, we are able to obtain the quadratic speedup, while most of existing FL convergence analysis of strongly convex loss functions can only achieve linear speedup [3,6,45]. Theorem 1 (b) shows the convergence of personalized models in average to a ball of center w and radius O λ2δ2 µF + σ2 F,1 λ2 + δ2 , which shows that λ can be controlled to trade off reducing the errors between δ2 and σ2 F,1. Theorem 2 (Nonconvex and smooth p Fed Me s convergence). Let Assumptions 1(b), 2, and 3 hold. If η ˆη2 βR, where ˆη2 := 1 75LF λ2 with λ 8L2 + 1 and β 1, then we have (a) E F(wt ) 2 O E F(wt ) 2 := F LF σ2 F,2(N/S 1) 1 TN + ( F ) 2 3 Rσ2 F,2 + λ2δ2 1 β 4 3 R 1 3 T 2 3 + λ2δ2 i=1 E h θt i (wt ) wt 2i O E F(wt ) 2 + O σ2 F,2 λ2 + δ2 , where F := F(w0) F , and t {0, . . . , T 1} is sampled uniformly. Corollary 2. When there is no client sampling, we can choose β = Θ(N 1/2R1/4) and Θ(T 1/3) = Θ((NR)2/3) to obtain a sublinear speed-up of O 1/(TRN)2/3 . Remark 2. Theorem 2 shows a similar convergence structure to that of Theorem 1, but with a sublinear rate for nonconvex case. According to Corollary 2, we are able to obtain the sublinear speedup O 1/(TRN)2/3 , while most of existing convergence analysis for nonconvex FL can only achieve a sublinear speed-up of O 1/ TRN [3,6,49]. 5 Experimental Results and Discussion In this section, we validate the performance of p Fed Me when the data distributions are heterogeneous and non-i.i.d. We first observe the effect of hyperparameters K and β on the convergence of p Fed Me. We then compare p Fed Me with Fed Avg and Per-Fed Avg in both µ-strongly convex and nonconvex settings. The effect of other hyperparameters is presented in the supplementary document. 5.1 Experimental Settings We consider a classification problem using both real (MNIST) and synthetic datasets. MNIST [51] is a handwritten digit dataset containing 10 labels and 70,000 instances. Due to the limitation on MNIST s data size, we distribute the complete dataset to N = 20 clients. To model a heterogeneous setting in terms of local data sizes and classes, each client is allocated a different local data size in the range of [1165, 3834], and only has 2 of the 10 labels. For synthetic data, we adopt the data generation and distribution procedure from [15], using two parameters α and β to control how much the local model and the dataset of each client differ, respectively. Specifically, the dataset serves a 10-class classifier using 60-dimensional real-valued data. We generate a synthetic dataset with α = 0.5 and β = 0.5. Each client s data size is in the range of [250, 25810]. Finally, we distribute the data to N = 100 clients according to the power law in [15]. We fix the subset of clients S = 5 for MNIST, and S = 10 for Synthetic. We compare the algorithms using both cases of the same and fine-tuned learning rates, batch sizes, and number of local and global iterations. For µ-strongly convex setting, we consider a l2-regularized multinomial logistic regression model (MLR) with the softmax activation and cross-entropy loss functions. For nonconvex case, a two-layer deep neural network (DNN) is implemented with hidden layer of size 100 for MNIST and 20 for Synthetic using Re LU activation and a softmax layer at the end. In p Fed Me, we use gradient descent to obtain δ-approximate θi(wt i,r) and the personalized model is evaluated on the personalized parameter θi while the global model is evaluated on w. For the comparison with Per-Fed Avg, we use its personalized model which is the local model after taking an SGD step from the global model. All datasets are split randomly with 75% and 25% for training and testing, respectively. All experiments were conducted using Py Torch [52] version 1.4.0. The code and datasets are available online1. 5.2 Effect of hyperparameters To understand how different hyperparameters such as K and β affect the convergence of p Fed Me in both µ-strongly convex and nonconvex settings, we conduct various experiments on MNIST dataset with η = 0.005 and S = 5. The explanations for choosing hyperparameters R, |D|, and λ are provided in the supplementary files. Effects of computation complexity K: As K allows for approximately finding the personalized model θ, K is also considered as a hyper-parameter of p Fed Me. In Fig. 1, only the value of K is changed during the experiments. We observe that p Fed Me requires a small value of K (around 3 to 5 steps) to approximately compute the personalized model. Larger values of K, such as 7, do not show the improvement on the convergence of the personalized model nor the global model. Similar to R, larger K also requires more user s computation, which has negative effects on user energy consumption. Therefore, the value of K = 5 is chosen for the remaining experiments. Effects of β: Fig. 2 illustrates how β (β 1) affects both the personalized and global models. It is noted that when β = 1, it is similar to model averaging of Fed Avg. According to the figure, it is beneficial to shift the value of β to be larger as it allows p Fed Me to converge faster, especially the global model. However, turning β carefully is also significant to prevent the divergence and instability of p Fed Me. For example, when β moves to the large value, to stabilize the global model as well as 1https://github.com/Charlie Dinh/p Fed Me 0 200 400 600 800 Global rounds Test Accuracy strongly convex p Fed Me (PM): K = 1 p Fed Me (PM): K = 5 p Fed Me (PM): K = 7 p Fed Me (GM): K = 1 p Fed Me (GM): K = 5 p Fed Me (GM): K = 7 0 200 400 600 800 Global rounds Test Accuracy p Fed Me (PM): K = 1 p Fed Me (PM): K = 5 p Fed Me (PM): K = 7 p Fed Me (GM): K = 1 p Fed Me (GM): K = 5 p Fed Me (GM): K = 7 0 200 400 600 800 Global rounds Training Loss strongly convex p Fed Me (PM): K = 1 p Fed Me (PM): K = 5 p Fed Me (PM): K = 7 p Fed Me (GM): K = 1 p Fed Me (GM): K = 5 p Fed Me (GM): K = 7 0 200 400 600 800 Global rounds Training Loss p Fed Me (PM): K = 1 p Fed Me (PM): K = 5 p Fed Me (PM): K = 7 p Fed Me (GM): K = 1 p Fed Me (GM): K = 5 p Fed Me (GM): K = 7 Figure 1: Effect of K on the convergence of p Fed Me in µ-strongly convex and nonconvex settings on MNIST (|D| = 20, λ = 15, R = 20, β = 1). 0 200 400 600 800 Global rounds Test Accuracy strongly convex p Fed Me (PM): = 1.0 p Fed Me (PM): = 2.0 p Fed Me (PM): = 4.0 p Fed Me (GM): = 1.0 p Fed Me (GM): = 2.0 p Fed Me (GM): = 4.0 0 200 400 600 800 Global rounds Test Accuracy p Fed Me (PM): = 1.0 p Fed Me (PM): = 2.0 p Fed Me (PM): = 4.0 p Fed Me (GM): = 1.0 p Fed Me (GM): = 2.0 p Fed Me (GM): = 4.0 0 200 400 600 800 Global rounds Training Loss strongly convex p Fed Me (PM): = 1.0 p Fed Me (PM): = 2.0 p Fed Me (PM): = 4.0 p Fed Me (GM): = 1.0 p Fed Me (GM): = 2.0 p Fed Me (GM): = 4.0 0 200 400 600 800 Global rounds Training Loss p Fed Me (PM): = 1.0 p Fed Me (PM): = 2.0 p Fed Me (PM): = 4.0 p Fed Me (GM): = 1.0 p Fed Me (GM): = 2.0 p Fed Me (GM): = 4.0 Figure 2: Effect of β on the convergence of p Fed Me in µ-strongly convex and nonconvex settings on MNIST (|D| = 20, λ = 15, R = 20, K = 5). 0 200 400 600 800 Global rounds Test Accuracy strongly convex p Fed Me (PM) p Fed Me (GM) Per-Fed Avg Fed Avg 0 200 400 600 800 Global rounds Test Accuracy p Fed Me (PM) p Fed Me (GM) Per-Fed Avg Fed Avg 0 200 400 600 800 Global rounds Training Loss strongly convex p Fed Me (PM) p Fed Me (GM) Per-Fed Avg Fed Avg 0 200 400 600 800 Global rounds Training Loss p Fed Me (PM) p Fed Me (GM) Per-Fed Avg Fed Avg Figure 3: Performance comparison of p Fed Me, Fed Avg, and Per-Fed Avg in µ-strongly convex and nonconvex settings using MNIST (η = 0.005, |D| = 20, S = 5, β = 1 for all experiments). the personalized model, the smaller value of η needs to be considered. Alternatively, β and η should be adjusted in inverse proportion to reach the stability of p Fed Me. 5.3 Performance Comparison In order to highlight the empirical performance of p Fed Me, we perform several comparisons between p Fed Me, Fed Avg, and Per-Fed Avg. We first use the same parameters for all algorithms as an initial comparison. As algorithms behave differently when hyperparameters are changed, we conduct a grid search on a wide range of hyperparameters to figure out the combination of fine-tuned parameters that achieves the highest test accuracy w.r.t. each algorithm. We use both personalized model (PM) and the global model (GM) of p Fed Me for comparisons. The comparisons for MNIST dataset are shown in Fig. 3 (the same hyperparameters) and Table. 1 (fine-tuned hyperparameters). Fig. 3 shows that the p Fed Me s personalized models in strongly convex setting are 1.1%, 1.3%, and 1.5% more accurate than its global model, Per-Fed Avg, and Fed Avg, respectively. The corresponding figures for nonconvex setting are 0.9%, 0.9%, and 1.3%. Table. 1 shows that when using fine-tuned hyperparameters, the p Fed Me s personalized model is the best performer in all settings. 0 200 400 600 Global rounds Test Accuracy strongly convex p Fed Me (PM) p Fed Me (GM) Per-Fed Avg Fed Avg 0 200 400 600 Global rounds Test Accuracy p Fed Me (PM) p Fed Me (GM) Per-Fed Avg Fed Avg 0 200 400 600 Global rounds Training Loss strongly convex p Fed Me (PM) p Fed Me (GM) Per-Fed Avg Fed Avg 0 200 400 600 Global rounds Training Loss p Fed Me (PM) p Fed Me (GM) Per-Fed Avg Fed Avg Figure 4: Performance comparison of p Fed Me, Fed Avg, and Per-Fed Avg in µ-strongly convex and nonconvex settings using Synthetic (η = 0.005, |D| = 20, S = 10, β = 1 for all experiments). Table 1: Comparison using fine-tuned hyperparameters. We fix |D| = 20, R = 20, K = 5, and T = 800 for MNIST, and T = 600 for Synthetic, β = 2 for p Fed Me (ˆα and ˆβ are learning rates of Per-Fed Avg). Algorithm Model MNIST Synthetic λ η (ˆα, ˆβ) Accuracy (%) λ η (ˆα, ˆβ) Accuracy (%) Fed Avg MLR 0.02 93.96 0.02 0.02 77.62 0.11 Per-Fed Avg MLR 0.03, 0.003 94.37 0.04 0.02, 0.002 81.49 0.09 p Fed Me-GM MLR 15 0.01 94.18 0.06 20 0.01 78.65 0.25 p Fed Me-PM MLR 15 0.01 95.62 0.04 20 0.01 83.20 0.06 Fed Avg DNN 0.02 98.79 0.03 0.03 83.64 0.22 Per-Fed Avg DNN 0.02, 0.001 98.90 0.02 0.01, 0.001 85.01 0.10 p Fed Me-GM DNN 30 0.01 99.16 0.03 30 0.01 84.17 0.35 p Fed Me-PM DNN 30 0.01 99.46 0.01 30 0.01 86.36 0.15 For Synthetic dataset, the comparisons for utilizing the same parameters and the fine-tuned parameter are presented in Fig. 4 and Table. 1, respectively. In Fig. 4, although p Fed Me s global model performs worse than others on test accuracy and training loss, the personalized model shows its advantages by achieving the highest test figures. Fig. 4 shows that p Fed Me s personalized model is 6.1%, 3.8%, and 5.2% more accurate than its global model, Per-Fed Avg, and Fed Avg, respectively. The respective figures for the nonconvex setting are 3.9%, 0.7%, and 3.1%. In addition, with fine-tuned hyperparameters in Table. 1, the personalized model of p Fed Me outperforms others in all settings while the global model of p Fed Me only performs better than Fed Avg. From the experimental results, when the data among clients are non-i.i.d, both p Fed Me and Per-Avg gain higher testing accuracy than Fed Avg as they allow the global model to be personalized for a specific client. However, by optimizing the personalized model approximately with multiple gradient updates and avoiding computing the Hessian matrix, the personalized model of p Fed Me is more advantageous than Per-Fed Avg in terms of the convergence rate and the computation complexity. 6 Conclusion In this paper, we propose p Fed Me as a personalized FL algorithm that can adapt to the statistical diversity issue to improve the FL performance. Our approach makes use of the Moreau envelope function which helps decompose the personalized model optimization from global model learning, which allows p Fed Me to update the global model similarly to Fed Avg, yet in parallel to optimize the personalized model w.r.t each client s local data distribution. Theoretical results show that p Fed Me can achieve the state-of-the-art convergence speedup rate. Experimental results demonstrate that p Fed Me outperforms the vanilla Fed Avg and the meta-learning based personalized FL algorithm Per-Fed Avg in both convex and non-convex settings, using both real and synthetic datasets. Finally, the degree to which personalization becomes provably useful is a topic of experimental research, as parameters will need to be adapted to each dataset and federated setting. Broader Impact There have been numerous applications of FL in practice. One notable commercial FL usage, which has proved successful in recent years, is in the next-character prediction task on mobile devices. However, we believe this technology promises many more breakthroughs in a number of fields in the near future with the help of personalized FL models. In health care, for example, common causes of a disease can be identified from many patients without the need to have access to their raw data. The development of capable personalized models helps build better predictors on patients conditions, allowing for faster, more efficient diagnosis and treatment. As much as FL promises, it also comes with a number of challenges. First, an important societal requirement when deploying such technique is that the server must explain which clients data will be participated and which will not. The explainability and interpretability of a system are necessary for the sake of public understanding and making informed consent. Second, to successfully preserve privacy, FL has to overcome malicious actors who possibly interfere in the training process during communication. The malicious behaviors include stealing personalized models from the server, performing adversarial attacks such as changing a personalized model on some examples while remaining a good performance on average, and attempting to alter a model. Finally, an effective and unbiased FL system must be aware that data and computational power among clients can be extremely uneven in practice and, therefore, must ensure that the contribution of each client to the global model is adjusted to its level of distribution. These challenges help necessitate future research in decentralized learning in general and personalized FL in particular. Acknowledgments and Disclosure of Funding This research is funded by Vietnam National University Ho Chi Minh City (VNU-HCM) under grant number DS2020-28-01. Tuan Dung Nguyen s work is supported by the Faculty of Engineering Scholarship at the University of Sydney. [1] H. B. Mc Mahan, E. Moore, D. Ramage, and S. Hampson, Communication-Efficient Learning of Deep Networks from Decentralized Data, in International Conference on Artificial Intelligence and Statistics, Fort Lauderdale, FL, USA, 2017. [2] M. Mohri, G. Sivek, and A. T. Suresh, Agnostic Federated Learning, in International Conference on Machine Learning, 2019. [3] S. P. R. Karimireddy et al., SCAFFOLD: Stochastic Controlled Averaging for Federated Learning, in International Conference on Machine Learning, 2020. [4] K. Pillutla, S. M. Kakade, and Z. Harchaoui, Robust Aggregation for Federated Learning, in ar Xiv:1912.13445, 2019. [5] D. Li and J. Wang, Fed MD: Heterogenous Federated Learning via Model Distillation, 2019. [6] Y. Deng, M. M. Kamani, and M. Mahdavi, Adaptive Personalized Federated Learning, in ar Xiv:2003.13461, 2020. [7] J. J. Moreau, Propriétés des applications prox , Comptes rendus hebdomadaires des séances de l Académie des sciences, vol. 256, pp. 1069 1071, 1963. [8] A. Fallah, A. Mokhtari, and A. Ozdaglar, Personalized Federated Learning: A Meta-Learning Approach, in ar Xiv:2002.07948, 2020. [9] N. Guha, A. Talwalkar, and V. Smith, One-Shot Federated Learning, 2018. [10] A. Reisizadeh, A. Mokhtari, H. Hassani, A. Jadbabaie, and R. Pedarsani, Fed PAQ: A Communication Efficient Federated Learning Method with Periodic Averaging and Quantization, in International Conference on Artificial Intelligence and Statistics, 2020. [11] X. Dai et al., Hyper-Sphere Quantization: Communication-Efficient SGD for Federated Learning, in ar Xiv:1911.04655, 2019. [12] J. Wang and G. Joshi, Cooperative SGD: A unified Framework for the Design and Analysis of Communication-Efficient SGD Algorithms, in ar Xiv:1808.07576, 2019. [13] T. Lin, S. U. Stich, K. K. Patel, and M. Jaggi, Don t Use Large Mini-batches, Use Local SGD, in International Conference on Learning Representations, 2020. [14] S. U. Stich, Local SGD converges fast and communicates little, in International Conference on Learning Representations, New Orleans, LA, USA, 2019. [15] T. Li et al., Federated Optimization in Heterogeneous Networks, in Proceedings of the 3rd MLSys Conference, Austin, TX, USA, 2020. [16] Y. Zhao et al., Federated Learning with Non-IID Data, in ar Xiv:1806.00582, 2018. [17] F. Haddadpour and M. Mahdavi, On the Convergence of Local Descent Methods in Federated Learning, in ar Xiv:1910.14425, 2019. [18] X. Li, K. Huang, W. Yang, S. Wang, and Z. Zhang, On the Convergence of Fed Avg on Non-IID Data, in International Conference on Learning Representations, 2020. [19] A. Khaled, K. Mishchenko, and P. Richtarik, Tighter Theory for Local SGD on Identical and Heterogeneous Data, in International Conference on Artificial Intelligence and Statistics, 2020. [20] V. Smith, C.-K. Chiang, M. Sanjabi, and A. S. Talwalkar, Federated Multi-Task Learning, in Advances in Neural Information Processing Systems 30, 2017. [21] J. C. Duchi, M. I. Jordan, and M. J. Wainwright, Privacy Aware Learning, J. ACM, vol. 61, no. 6, pp. 1 57, 2014. [22] H. B. Mc Mahan, D. Ramage, K. Talwar, and L. Zhang, Learning differentially private recurrent language models, in International Conference on Learning Representations, Vancouver, BC, Canada, 2018. [23] W. Zhu, P. Kairouz, B. Mc Mahan, H. Sun, and W. Li, Federated Heavy Hitters Discovery with Differential Privacy, in International Conference on Artificial Intelligence and Statistics, 2020. [24] N. Agarwal, A. T. Suresh, F. X. X. Yu, S. Kumar, and B. Mc Mahan, cp SGD: Communication-efficient and differentially-private distributed SGD, in Advances in Neural Information Processing Systems 31, 2018. [25] Z. Li, V. Sharma, and S. P. Mohanty, Preserving data privacy via federated learning: Challenges and solutions, IEEE Consumer Electronics Magazine, vol. 9, no. 3, pp. 8 16, 2020. [26] F. Hanzely and P. Richtárik, Federated Learning of a Mixture of Global and Local Models, in ar Xiv:2002.05516, 2020. [27] Y. Mansour, M. Mohri, J. Ro, and A. T. Suresh, Three Approaches for Personalization with Applications to Federated Learning, in ar Xiv:2002.10619, 2020. [28] M. G. Arivazhagan, V. Aggarwal, A. K. Singh, and S. Choudhary, Federated Learning with Personalization Layers, in ar Xiv:1912.00818, 2019. [29] A. Hard et al., Federated Learning for Mobile Keyboard Prediction, in ar Xiv:1811.03604, 2019. [30] K. Wang et al., Federated Evaluation of On-device Personalization, in ar Xiv:1910.10252, 2019. [31] P. Vanhaesebrouck, A. Bellet, and M. Tommasi, Decentralized collaborative learning of personalized models over networks, in International Conference on Artificial Intelligence and Statistics, Fort Lauderdale, FL, USA, 2017. [32] C. Finn, P. Abbeel, and S. Levine, Model-agnostic meta-learning for fast adaptation of deep networks, in International Conference on Machine Learning, 2017. [33] A. Nichol, J. Achiam, and J. Schulman, On First-Order Meta-Learning Algorithms, in ar Xiv:1803.02999, 2018. [34] A. Fallah, A. Mokhtari, and A. Ozdaglar, On the Convergence Theory of Gradient-Based Model-Agnostic Meta-Learning Algorithms, in International Conference on Artificial Intelligence and Statistics, 2020. [35] M. Khodak, M.-F. F. Balcan, and A. S. Talwalkar, Adaptive Gradient-Based Meta-Learning Methods, in Advances in Neural Information Processing Systems 32, 2019. [36] Y. Jiang, J. Koneˇcný, K. Rush, and S. Kannan, Improving Federated Learning Personalization via Model Agnostic Meta Learning, 2019. [37] F. Chen, M. Luo, Z. Dong, Z. Li, and X. He, Federated Meta-Learning with Fast Convergence and Efficient Communication, in ar Xiv:1802.07876, 2019. [38] T. Li, A. K. Sahu, A. Talwalkar, and V. Smith, Federated Learning: Challenges, Methods, and Future Directions, IEEE Signal Processing Magazine, vol. 37, no. 3, pp. 50 60, 2020. [39] P. Kairouz et al., Advances and Open Problems in Federated Learning, in ar Xiv:1912.04977, 2019. [40] H. Lin, J. Mairal, and Z. Harchaoui, Catalyst acceleration for first-order convex optimization: From theory to practice, J. Mach. Learn. Res., vol. 18, no. 1, pp. 7854 7907, 2017. [41] P. Zhou, X. Yuan, H. Xu, S. Yan, and J. Feng, Efficient Meta Learning via Minibatch Proximal Update, in Advances in Neural Information Processing Systems 32, 2019. [42] X. Li, W. Yang, S. Wang, and Z. Zhang, Communication-Efficient Local Decentralized SGD Methods, in ar Xiv:1910.09126, 2020. [43] H. Yu, R. Jin, and S. Yang, On the Linear Speedup Analysis of Communication Efficient Momentum SGD for Distributed Non-Convex Optimization, in International Conference on Machine Learning, 2019. [44] L. M. Nguyen et al., New convergence aspects of stochastic gradient algorithms, Journal of Machine Learning Research, vol. 20, no. 176, pp. 1 49, 2019. [45] A. Khaled, K. Mishchenko, and P. Richtárik, First Analysis of Local GD on Heterogeneous Data, 2019. [46] C. Lemaréchal and C. Sagastizábal, Practical Aspects of the Moreau Yosida Regularization: Theoretical Preliminaries, SIAM J. Optim., vol. 7, no. 2, pp. 367 385, May 1997. [47] C. Planiden and X. Wang, Strongly Convex Functions, Moreau Envelopes, and the Generic Nature of Convex Functions with Strong Minimizers, SIAM J. Optim., vol. 26, no. 2, pp. 1341 1364, 2016. [48] T. Hoheisel, M. Laborde, and A. Oberman, A regularization interpretation of the proximal point method for weakly convex functions, Journal of Dynamics & Games, vol. 7, no. 1, pp. 79 96, 2020. [49] S. Reddi et al., Adaptive Federated Optimization, in ar Xiv:2003.00295, 2020. [50] S. Bubeck, Convex Optimization: Algorithms and Complexity, FNT in Machine Learning, vol. 8, no. 3-4, pp. 231 357, 2015. [51] Y. Lecun, L. Bottou, Y. Bengio, and P. Haffner, Gradient-based learning applied to document recognition, Proceedings of the IEEE, vol. 86, no. 11, pp. 2278 2324, 1998. [52] A. Paszke et al., Py Torch: An Imperative Style, High-Performance Deep Learning Library, in Advances in Neural Information Processing Systems 32, 2019.