# personalized_crosssilo_federated_learning_on_noniid_data__92d4a689.pdf

Personalized Cross-Silo Federated Learning on Non-IID Data

Yutao Huang1, Lingyang Chu*2, Zirui Zhou3, Lanjun Wang3, Jiangchuan Liu1, Jian Pei1, Yong Zhang3

1Simon Fraser University, Burnaby, Canada 2Mc Master University, Hamilton, Canada 3Huawei Technologies Canada, Burnaby, Canada yutaoh@sfu.ca, chul9@mcmaster.ca, zirui.zhou@huawei.com, langjun.wang@huawei.com, jcliu@sfu.ca, jpei@cs.sfu.ca, yong.zhang3@huawei.com

Non-IID data present a tough challenge for federated learning. In this paper, we explore a novel idea of facilitating pairwise collaborations between clients with similar data. We propose Fed AMP, a new method employing federated attentive message passing to facilitate similar clients to collaborate more. We establish the convergence of Fed AMP for both convex and non-convex models, and propose a heuristic method to further improve the performance of Fed AMP when clients adopt deep neural networks as personalized models. Our extensive experiments on benchmark data sets demonstrate the superior performance of the proposed methods.

1 Introduction Federated learning (Yang et al. 2019) facilitates collaborations among a set of clients and preserves their privacy so that the clients can achieve better machine learning performance than individually working alone. The underlying idea is to collectively learn from data from all clients. The initial idea of federated learning starts from aggregating models from clients to achieve a global model so that the global model can be more general and capable. The effectiveness of this global collaboration theme that is not differentiating among all clients highly depends on the data distribution among clients. It works well on IID data, that is, clients are similar to each other in their private data distribution. In many application scenarios where collaborations among clients are needed to train machine learning models, data are unfortunately not IID. For example, consider the cases of personalized cross-silo federated learning (Kairouz et al. 2019), where there are tens or hundreds of clients and the private data of clients may be different in size, class distributions and even the distribution of each class. Global collaboration without considering individual private data often cannot achieve good performance for individual clients. Some federated learning methods try to ﬁx the problem by conducting an additional ﬁne-tuning step after a global model is trained (Ben-David et al. 2010; Cortes and

*Lingyang Chu and Yutao Huang contribute equally in this work. The API of this work is available at https://t.ly/n GN9, free registration at Huawei Cloud is required before use. Copyright 2021, Association for the Advancement of Artiﬁcial Intelligence (www.aaai.org). All rights reserved.

Mohri 2014; Mansour et al. 2020; Mansour, Mohri, and Rostamizadeh 2009; Schneider and Vlachos 2020; Wang et al. 2019).While those methods work in some cases, they cannot solve the problem systematically as demonstrated in our experimental results (e.g., data set CIFAR100 in Table 3). We argue that the fundamental bottleneck in personalized cross-silo federated learning with non-IID data is the misassumption of one global model can ﬁt all clients. Consider the scenario where each client tries to train a model on customers sentiments on food in a country. Different clients collect data in different countries. Obviously, customers reviews on food are likely to be related to their cultures, life-styles, and environments. Unlikely there exists a global model universally ﬁtting all countries. Instead, pairwise collaborations among countries that share similarity in culture, life-styles, environments and other factors may be the key to accomplish good performance in personalized cross-silo federated learning with non-IID data. Carrying the above insight, in this paper, we tackle the challenging personalized cross-silo federated learning problem by a novel attentive message passing mechanism that adaptively facilitates the underlying pairwise collaborations between clients by iteratively encouraging similar clients to collaborate more. We make several technical contributions. We propose a novel method federated attentive message passing (Fed AMP) whose central idea is the attentive message passing mechanism. Fed AMP allows each client to own a local personalized model, but does not use a single global model on the cloud server to conduct collaborations. Instead, it maintains a personalized cloud model on the cloud server for each client, and realizes the attentive message passing mechanism by attentively passing the personalized model of each client as a message to the personalized cloud models with similar model parameters. Moreover, Fed AMP updates the personalized cloud model of each client by a weighted convex combination of all the messages it receives. This adaptively facilitates the underlying pairwise collaborations between clients and signiﬁcantly improves the effectiveness of collaboration. We prove the convergence of Fed AMP for both convex and non-convex personalized models. Furthermore, we propose a heuristic method to further improve the performance of Fed AMP on clients using deep neural networks as person-

The Thirty-Fifth AAAI Conference on Artificial Intelligence (AAAI-21)

alized models. We conduct extensive experiments to demonstrate the superior performance of the proposed methods.

2 Related Works Personalized federated learning for clients with non-IID data has attracted much attention (Deng, Kamani, and Mahdavi 2020; Fallah, Mokhtari, and Ozdaglar 2020; Kulkarni, Kulkarni, and Pant 2020; Mansour et al. 2020). Particularly, our work is related to global federated learning, local customization and multi-task federated learning. Global federated learning (Ji et al. 2019; Mc Mahan et al. 2016; Wang et al. 2020; Yurochkin et al. 2019) trains a single global model to minimize an empirical risk function over the union of the data across all clients. When the data is non-IID across different clients, however, it is difﬁcult to converge to a good global model that achieves a good personalized performance on every client (Kairouz et al. 2019; Li et al. 2020; Mc Mahan et al. 2016; Zhao et al. 2018). Local customization methods (Chen et al. 2018; Fallah, Mokhtari, and Ozdaglar 2020; Jiang et al. 2019; Khodak, Balcan, and Talwalkar 2019; Kulkarni, Kulkarni, and Pant 2020; Mansour et al. 2020; Nichol, Achiam, and Schulman 2018; Schneider and Vlachos 2020; Wang et al. 2019) build a personalized model for each client by customizing a well-trained global model. There are several ways to conduct customization. A practical way to customize a personalized model is local ﬁne-tuning (Ben-David et al. 2010; Cortes and Mohri 2014; Mansour et al. 2020; Mansour, Mohri, and Rostamizadeh 2009; Schneider and Vlachos 2020; Wang et al. 2019), where the global model is ﬁne-tuned using the private data of each client to produce a personalized model for the client. Similarly, meta-learning methods (Chen et al. 2018; Fallah, Mokhtari, and Ozdaglar 2020; Jiang et al. 2019; Khodak, Balcan, and Talwalkar 2019; Kulkarni, Kulkarni, and Pant 2020; Nichol, Achiam, and Schulman 2018) can be extended to customize personalized models by adapting a well-trained global model on the local data of a client (Kairouz et al. 2019). Model mixture methods (Deng, Kamani, and Mahdavi 2020; Hanzely and Richt arik 2020) customize for each client by combining the global model with the client s latent local model. SCAFFOLD (Karimireddy et al. 2019) customizes the gradient updates of personalized models to correct client-drifts between personalized models and a global model. Most existing local customization methods use a single global model to conduct a global collaboration involving all clients. The global collaboration framework only allows contributions from all clients to a global model and customization of the global model for each client. It does not allow pairwise collaboration among clients with similar data, and thus may meet dramatic difﬁculty on non-IID data. Smith et al. (Smith et al. 2017) model the pair-wise collaboration relationships between clients by extending distributed multi-task learning to federated learning. They tackle the problem by a primal-dual optimization method that achieves great performance on convex models. At the same time, due to its rigid requirement of strong duality, their method is not applicable when clients adopt deep neural networks as personalized models.

Different from all existing work, our study explores pairwise collaboration among clients. Our method is particularly effective when clients data are non-IID, and can take the great advantage of similarity among clients.

3 Personalized Federated Learning Problem

In this section, we introduce the personalized federated learning problem that aims to collaboratively train personalized models for a set of clients using the non-IID private data of all clients in a privacy-preserving manner (Kairouz et al. 2019; Zhao et al. 2018). Consider m clients C1, . . . , Cm that have the same type of models M personalized by m different sets of model parameters w1, . . . , wm, respectively. Denote by M(wi) and Di (1 i m) the personalized model and the private training data set of client Ci, respectively. These data sets are non-IID, that is, D1, . . . , Dm are uniformly sampled from m distinct distributions P1, . . . , Pm, respectively. For each client Ci, denote by Vi the performance of M(wi) on the distribution Pi. Denote by V i the best performance model M can achieve on Pi by considering all possible parameter sets. The personalized federated learning problem aims to collaboratively use the private training data sets D1, . . . , Dm to train the personalized models M(w1), . . . , M(wm) such that V1, . . . , Vm are close to V 1, . . . , V m, respectively, and no private training data of any clients are exposed to any other clients or any third parties. To be concrete, denote by Fi : Rd R the training objective function that maps the model parameter set wi Rd to a real valued training loss with respect to the private training data Di of client Ci. We formulate the personalized federated learning problem as

i=1 Fi(wi) + λ

i<j A( wi wj 2)

(1) where W = [w1, . . . , wm] is a d-by-m dimensional matrix that collects w1, . . . , wm as its columns and λ > 0 is a regularization parameter. The ﬁrst term Pm i=1 Fi(wi) in Eq. (1) is the sum of the training losses of the personalized models of all clients. This term allows each client to separately train its own personalized model using its own private training data. The second term improves the collaboration effectiveness between clients by an attention-inducing function A( wi wj 2) deﬁned as follows.

Deﬁnition 1 A( wi wj 2) is an attention-inducing function of wi and wj if A : [0, ) R is a non-linear function that satisﬁes the following properties.

1. A is increasing and concave on [0, ) and A(0) = 0; 2. A is continuously differentiable on (0, ); and 3. For the derivative A of A, limt 0+ A (t) is ﬁnite.

The attention-inducing function A( wi wj 2) measures the difference between wi and wj in a non-linear manner. A

typical example of A( wi wj 2) is the negative exponential function A( wi wj 2) = 1 e wi wj 2/σ with a hyperparameter σ. Another example is the smoothly clipped absolute deviation function (Fan and Li 2001). One more example is the minimax concave penalty function (Zhang 2010). We adopt the widely-used negative exponential function for our method in this paper. As to be illustrated in the next section, our novel use of the attention-inducing function realizes an attentive message passing mechanism that adaptively facilitates collaborations between clients by iteratively encouraging similar clients to collaborate more with each other. The pairwise collaborations boost the performance in personalized federated learning dramatically.

4 Federated Attentive Message Passing

In this section, we ﬁrst propose a general method to tackle the optimization problem in Eq. (1) without considering privacy preservation for clients. Then, we implement the general method by a personalized federated learning method, federated attentive message passing (Fed AMP), which collaboratively trains the personalized models of all clients and preserves their data privacy. Last, we explain why Fed AMP can adaptively facilitate collaborations between clients and signiﬁcantly improve the performance of the personalized models.

A General Method

Denote by F(W) := Pm i=1 Fi(wi) and A(W) := Pm i<j A( wi wj 2) the ﬁrst and the second terms of G(W), respectively. We can rewrite the optimization problem in Eq. (1) to

min W {G(W) := F(W) + λA(W)} . (2)

Based on the framework of incremental-type optimization (Bertsekas 2011), we develop a general method to iteratively optimize G(W) by alternatively optimizing A(W) and F(W) until convergence. In the k-th iteration, we ﬁrst optimize A(W) by applying a gradient descent step to compute an intermediate d-by-m dimensional matrix

U k = W k 1 αk A(W k 1), (3)

where αk > 0 is the step size of gradient descent, and W k 1 denotes the matrix W after the (k 1)-th iteration. Then, we use U k as the prox-center and apply a proximal point step (Rockafellar 1976) to optimize F(W) by computing

W k = arg min W F(W) + λ 2αk W U k 2. (4)

This iterative process continues until a preset maximum number of iterations K is reached. As illustrated later in Section 5, we analyze the non-asymptotic convergence of the general method, and prove that it converges to an optimal solution when G(W) is a convex function, and to a stationary point when G(W) is non-convex.

Figure 1: The message passing mechanism of Fed AMP.

The general method introduced above can be easily implemented by merging all clients private training data together as the training data. To perform personalized federated learning without infringing the data privacy of the clients, we develop Fed AMP to implement the optimization steps of the general method in a client-server framework by maintaining a personalized cloud model for each client on a cloud server, and passing weighted model-aggregation messages between personalized models and personalized cloud models. Following the optimization steps of the general method, Fed AMP ﬁrst optimizes A(W) and implements the optimization step in Eq. (3) by computing the d-by-m dimensional matrix U k on the cloud server. Let U k = [uk 1, . . . , uk m], where uk 1, . . . , uk m are the d-dimensional columns of U k. Since A(W) := Pm i<j A( wi wj 2) and A( wi wj 2) is an attention inducing function, the i-th column uk i of matrix U k computed in Eq. (3) can be rewritten into a linear combination of the model parameter sets wk 1 1 , . . . , wk 1 m as follows.

j =i A wk 1 i wk 1 j 2

j =i A wk 1 i wk 1 j 2 wk 1 j

=ξi,1wk 1 1 + + ξi,mwk 1 m , (5) where A ( wi wj 2) is the derivative of A( wi wj 2) and ξi,1, . . . , ξi,m are the linear combination weights of the model parameter sets wk 1 1 , . . . , wk 1 m , respectively. Often a small value is chosen as the step size αk of gradient descent so that all the linear combination weights ξi,1, . . . , ξi,m are non-negative. Since ξi,1 + + ξi,m = 1, uk i is actually a convex combination of the model parameter sets wk 1 1 , . . . , wk 1 m of the personalized models of the clients. As illustrated in Figure 1, the convex combination uk i can be modeled a message passing mechanism as follows. We treat uk i as the model parameter set of the personalized cloud model of client Ci and also a model aggregation that aggregates wk 1 1 , . . . , wk 1 m . Correspondingly, we can treat

wk 1 1 , . . . , wk 1 m as model-aggregation messages that are passed from all clients to client Ci to conduct the model aggregation and produce uk i at the cloud server. The above message passing mechanism is the key step for Fed AMP to perform inter-client collaboration. This mechanism solely depends on the model parameter sets wk 1 1 , . . . , wk 1 m , thus the cloud server can collect wk 1 1 , . . . , wk 1 m from the clients and conduct the message passing mechanism to optimize A(W) without infringing the data privacy of all the clients. After optimizing A(W) on the cloud server, Fed AMP then optimizes F(W) and implements the optimization step in Eq. (4) by computing independently columns wk 1, . . . , wk m of W k for clients C1, . . . , Cm, respectively. Recall that wk i is the model parameter set of the personalized model owned by client Ci. Following Eq. (4), we compute wk i locally on Ci by

wk i = arg min w Rd Fi(w) + λ 2αk w uk i 2, (6)

Here, we only use the private training data set Di of client Ci to perform personalized training on model M(wi) and, at the same time, consider the inter-client collaboration information carried by the personalized cloud model M(uk i ) by requiring wk i and uk i to be close to each other. Since Eq. (6) only uses Fi(w) and uk i , where Fi(w) is determined by the private training data Di of client Ci, Ci can request its own model parameter set uk i from the cloud server and compute wk i locally without exposing its private training data Di to any other clients or the cloud server. Furthermore, since uk i is a convex combination of wk 1, . . . , wk m, a client Cj cannot infer the personalized models of any other clients or the private data of any other clients. Algorithm 1 summarizes the pseudocode. Fed AMP implements the optimization steps of the general method in a client-server framework, that is, iteratively optimizing G(W) by alternatively optimizing A(W) and F(W) until a preset maximum number of iterations K is reached. The non-asymptotic convergence of Fed AMP is exactly the same as the general method.

Collaboration in Fed AMP Fed AMP adaptively facilitates collaborations between similar clients, since the attentive message passing mechanism iteratively encourages similar clients to collaborate more with each other during the personalized federated learning process. To analyze the attentive message passing mechanism of Fed AMP, we revisit the weights ξi,1, . . . , ξi,m of the convex combination in Eq. (5), where the weight

ξi,j = αk A wk 1 i wk 1 j 2 , (i = j) (7)

is the contribution of message wk 1 j sent from client Cj to the aggregated model parameter set uk i of the personalized cloud model owned by client Ci. ξi,i = 1 Pm j =i ξi,j is simply a self-attention weight that speciﬁes the proportion

Algorithm 1: Fed AMP

Input: m clients, each holds a set of private training data and a personalized model to train. Output: The trained model parameter sets w K 1 , . . . , w K m and u K 1 , . . . , u K m.

1 Randomly initialize w0 1, . . . , w0 m on the clients.

2 for k = 1, 2, . . . , K do

3 Optimize A(W): cloud server collects wk 1 1 , . . . , wk 1 m from the clients to compute uk 1, . . . , uk m by Eq. (5).

4 Optimize F(W): each client Ci requests uk i from the cloud server to compute wk i by Eq. (6).

of the model parameter set wk 1 i of client Ci s personalized model in its own personalized cloud model. Due to Deﬁnition 1, A is an increasing and concave function on [0, ). Thus, the derivative A of A is a non-negative and non-increasing function on (0, ). Therefore, function A ( wk 1 i wk 1 j 2) is a similarity function that measures the similarity between wk 1 i and wk 1 j , such that their similarity is high if they have a small Euclidean distance. From Eq. (7), if the model parameters wk 1 i and wk 1 j are similar with each other, they contribute more to the model parameters uk j and uk i of clients Cj and Ci, respectively. This further makes uk i and uk j more similar to each other. Since the optimization step in Eq. (6) forces wk i and wk j to be close to uk i and uk j , respectively, wk i and wk j are more similar to each other as well. In summary, Fed AMP builds a positive feedback loop that iteratively encourages clients with similar model parameters to have stronger collaborations, and adaptively and implicitly groups similar clients together to conduct more effective collaborations.

5 Convergence Analysis of Fed AMP In this section, we analyze the convergence of Fed AMP when G is convex or non-convex under suitable conditions. To begin with, similar to the analysis of many incremental and stochastic optimization algorithms (Bertsekas 2011; Nemirovski et al. 2009), we make the following assumption.

Assumption 1 There exists a constant B > 0 such that max{ Y : Y F(W k)} B and A(W k) B/λ hold for every k 0, where F is the subdifferential of F and is the Frobenius norm.

For our problem in Eq. (1), Assumption 1 naturally holds if both F(W) and A(W) are locally Lipschitz continuous and W k is bounded by a constant for all k 0. Now, we provide the guarantee on convergence for Fed AMP when both F(W) and A(W) are convex functions.

Theorem 1 Under Assumption 1 and assuming functions F(W) and A(W) in Eq. (1) are convex, if α1 = = αK = λ/

K for some K 0, then the sequence

W 0, . . . , W K generated by Algorithm 1 satisﬁes

min 0 k K G(W k) G + W 0 W 2 + 5B2

where W is an optimal solution of Eq. (1) and G = G(W ). Moreover, if αk satisﬁes P k=1 αk = and P k=1 α2 k < , then

lim inf k G(W k) = G .

Theorem 1 implies that for any ϵ > 0, Fed AMP needs at most O(ϵ 2) iterations to ﬁnd an ϵ-optimal solution f W of Eq. (1) such that G(f W) G ϵ. It also establishes the global convergence of Fed AMP to an optimal solution of Eq. (1) when G is convex. The proof of Theorem 1 is provided in Appendix A (Huang et al. 2020). Next, we provide the convergence guarantee of Fed AMP when G(W) is a smooth and non-convex function.

Theorem 2 Under Assumption 1 and assuming functions F(W) and A(W) in Eq. (1) are continuously differentiable and the gradients F(W) and A(W) are Lipschitz continuous with modulus L, if α1 = = αK = λ/

K, then the sequence W 0, . . . , W K generated by Algorithm 1 satisﬁes

min 0 k K G(W k) 2

18(G(W 0) G + 20LB2)

where W and G are the same as in Theorem 1. Moreover, if αk satisﬁes P k=1 αk = and P k=1 α2 k < , then

lim inf k G(W k) = 0.

Theorem 2 implies that for any ϵ > 0, Fed AMP needs at most O(ϵ 4) iterations to ﬁnd an ϵ-approximate stationary point f W of Eq. (1) such that G(f W) ϵ. It also establishes the global convergence of Fed AMP to a stationary point of Eq. (1) when G is smooth and non-convex. The proof of Theorem 2 is in Appendix B (Huang et al. 2020).

6 Heur Fed AMP: Heuristic Improvement of Fed AMP on Deep Neural Networks In this section, we tackle the challenge in the message passing mechanism when deep neural networks are used by clients, and propose a heuristic improvement of Fed AMP. As illustrated in Section 4, the effectiveness of the attentive message passing mechanism of Fed AMP largely depends on the weights ξi,1, . . . , ξi,m of the model aggregation messages. These message weights are determined by the similarity function A ( wi wj 2) that measures the similarity between the model parameter sets wi and wj based on their Euclidean distance wi wj . When the dimensionalities of wi and wj are small, Euclidean distance is a good measurement to evaluate their difference. In this case, the similarity function A ( wi wj 2) works well in evaluating the similarity between wi and wj.

However, when clients adopt deep neural networks as their personalized models, each personalized model involves a large number of parameters, which means the dimensionalities of both wi and wj are high. In this case, Euclidean distance may not be effective in evaluating the difference between wi and wj anymore due to the curse of dimensionality (Verleysen and Franc ois 2005). Consequently, the message weights produced by A ( wi wj 2) may not be an effective attentive message passing mechanism. Thus, we need a better way to produce the message weights instead of using A ( wi wj 2). To tackle the challenge, we propose Heur Fed AMP, a heuristic revision of Fed AMP when clients use deep neural networks. The key idea of Heur Fed AMP is to heuristically compute the message weights in a different way that works well with the high-dimensional model parameters of deep neural networks. Speciﬁcally, Heur Fed AMP follows the optimization steps of Fed AMP exactly, except that, when computing message weights ξi,1, . . . , ξi,m in the k-th iteration, Heur Fed AMP ﬁrst treats weight ξi,i as a self-attention hyper-parameter that controls the proportion of the message wk 1 i sent from client Ci to its own personalized cloud model, and then computes the weight of the message passed from a client Cj to client Ci by

ξi,j = eσ cos(wk 1 i ,wk 1 j ) Pm h =i eσ cos(wk 1 i ,wk 1 h ) (1 ξi,i), (8)

where σ is a scaling hyper-parameter and cos(wk 1 i , wk 1 j ) is the cosine similarity between wk 1 i and wk 1 j . All the weights ξi,1, . . . , ξi,m computed by Heur Fed AMP are non-negative and sum to 1. Applying the weights computed by Heur Fed AMP to Eq. (5), the model parameter set uk i of the personalized cloud model of client Ci is still a convex combination of all the messages that it receives. Furthermore, according to from Eq. (8), if the model parameter sets wk 1 i and wk 1 j of two clients have a large cosine similarity cos(wk 1 i , wk 1 j ), their messages have large weights and contribute more to the personalized cloud models of each other. In other words, Heur Fed AMP builds a positive feedback loop similar to that of Fed AMP to realize the attentive message passing mechanism. As to be demonstrated in Section 7, Heur Fed AMP improves the performance of Fed AMP when clients adopt deep neural networks as personalized models, because cosine similarity is well-known to be more robust in evaluating similarity between high dimensional model parameters than Euclidean distance.

7 Experiments In this section, we evaluate the performance of Fed AMP and Heur Fed AMP and compare them with the state-ofthe-art personalized federated learning algorithms, including SCAFFOLD (Karimireddy et al. 2019), APFL (Deng, Kamani, and Mahdavi 2020), Fed Avg-FT and Fed Prox FT (Wang et al. 2019). Fed Avg-FT and Fed Prox-FT are two local ﬁne-tuning methods (Wang et al. 2019) that obtain

personalized models by ﬁne-tuning the global models produced by the classic global federated learning methods Fed Avg (Mc Mahan et al. 2016) and Fed Prox (Li et al. 2020), respectively. To make our experiments more comprehensive, we also report the performance of Fed Avg, Fed Prox and a naive separate training method named Separate that independently trains the personalized model of each client without collaboration between clients. The performance of all the methods is evaluated by the best mean testing accuracy (BMTA) in percentage, where the mean testing accuracy is the average of the testing accuracies on all clients, and BMTA is the highest mean testing accuracy achieved by a method during all the communication rounds of training. All the methods are implemented in Py Torch 1.3 running on Dell Alienware with Intel(R) Core(TM) i9-9980XE CPU, 128G memory, NVIDIA 1080Ti, and Ubuntu 16.04.

Settings of Data Sets We use four public benchmark data sets, MNIST (Le Cun, Cortes, and Burges 2010), FMNIST (Fashion MNIST) (Xiao, Rasul, and Vollgraf 2017), EMNIST (Extended-MNIST) (Cohen et al. 2017) and CIFAR100 (Krizhevsky and Hinton 2009). For each of the data sets, we apply three different data settings: 1) an IID data setting (Mc Mahan et al. 2016) that uniformly distributes data across different clients; 2) a pathological non-IID data setting (Mc Mahan et al. 2016) that partitions the data set in a non-IID manner such that each client contains two classes of samples and there is no groupwise similarities between the private data of clients; and 3) a practical non-IID data setting that ﬁrst partitions clients into groups, and then assigns data samples to clients in such a way that the clients in the same group have similar data distributions, the clients in different groups have different data distributions, every client has data from all classes, and the number of samples per client is different for different groups. Comparing with the pathological non-IID data setting, the practical non-IID data setting is closer to reality, since in practice each company participating in a personalized federated learning process often has data from most of the classes, and it is common that a subgroup of companies may have similar data distributions that are different from the data owned by companies outside the subgroup. Let us take EMNIST as an example to show how we apply the practical non-IID data setting. First, we set up 62 clients numbered as clients 0, 1, . . . , 61 and divide them into three groups. Then, we assign the samples to the clients such that 80% of the data of every client are uniformly sampled from a set of dominating classes, and 20% of the data are uniformly sampled from the rest of the classes. Speciﬁcally, the ﬁrst group consists of clients 0-9, where each client has 1000 training samples from the dominating classes with digit labels from 0 to 9 . The second group consists of clients 10-35, where each client has 700 training samples from the dominating classes of upper-case letters from A to Z . The third group consists of clients 36-61, where each client has 400 training samples from the dominating classes of lower-

case letters from a to z . Every client has 100 testing samples with the same distribution as its training data. Limited by space, we only report the most important experimental results in the rest of this section. Please see Appendix C (Huang et al. 2020) for the details of the practical non-IID data setting on MNIST, FMNIST and CIFAR100, the implementation details and the hyperparameter settings of all the methods, and also more extensive results about the convergence and robustness of the proposed methods.

Results on the IID Data Setting Table 1 shows the BMTA of all methods being compared under the IID data setting. The performance of Separate is a good baseline to indicate the needs of collaboration on classifying the data sets, since Separate does not conduct collaboration at all. Separate achieves a performance comparable with all the other methods on the easy data set MNIST. However, on the more challenging data sets FMNIST, EMNIST and CIFAR100, the performance of Separate is signiﬁcantly behind that of the others due to the lack of collaborations between clients. The global federated learning methods Fed Avg and Fed Prox achieve the best performance most of the time on IID data, because the clients are similar to each other and the global model ﬁts every client well. Differentiating pairwise collaborations between different clients are not needed on IID data. APFL achieves a performance comparable with Fed Avg and Fed Prox on all data sets, because it degenerates to Fed Avg under the IID data setting (Deng, Kamani, and Mahdavi 2020). For this reason, under the IID data setting, we consider APFL a global federated learning method instead of a personalized federated learning method. The personalized federated learning methods Fed Avg-FT, Fed Prox-FT and SCAFFOLD do not perform as well as Fed Avg and Fed Prox under the IID data setting. Although they achieve a performance comparable to Fed Avg and Fed Prox on MNIST, their performances on the more challenging data sets FMNIST, EMNIST and CIFAR100 are clearly inferior to Fed Avg and Fed Prox. The local ﬁne-tuning steps of Fed Avg-FT and Fed Prox-FT are prone to over-ﬁtting, and the rigid customization on the gradient updates of SCAFFOLD limits its ﬂexibility to ﬁt IID data well. Fed AMP and Heur Fed AMP perform much better than Fed Avg-FT, Fed Prox-FT and SCAFFOLD under the IID data setting. The personalized models of clients are similar to each other under the IID data setting, thus the attentive message passing mechanism assigns comparable weights to all messages, which accomplishes a global collaboration among all clients similar to that of Fed Avg and Fed AMP in effect. Fed AMP and Heur Fed AMP achieve the best performance among all the personalized federated learning methods on all data sets, and also perform comparably well as Fed Avg and Fed Prox on MNIST, FMNIST and EMNIST.

Results on the Pathological Non-IID Data Setting Table 2 shows the BMTA of all the methods under the pathological non-IID data setting. This data setting is pathological because each client contains only two classes of samples, which largely simpliﬁes the classiﬁcation task on every

Methods MNIST FMNIST EMNIST CIFAR100

Separate 99.27 81.66 54.41 9.82 Fed Avg 99.31 91.94 74.38 49.59 Fed Prox 98.81 90.19 73.14 46.50 Fed Avg-FT 98.98 90.17 70.53 35.07 Fed Prox-FT 98.72 89.02 69.49 40.77 SCAFFOLD 98.89 89.04 72.51 43.06 APFL 98.93 91.03 73.95 49.02 Fed AMP 99.22 92.05 74.07 45.68 Heur Fed AMP 99.28 91.80 74.07 45.88

Table 1: BMTA for the IID data setting.

Methods MNIST FMNIST EMNIST CIFAR100

Separate 98.73 97.67 99.15 92.67 Fed Avg 98.39 77.88 19.44 2.70 Fed Prox 97.15 83.80 48.81 2.81 Fed Avg-FT 99.66 98.07 99.24 95.00 Fed Prox-FT 99.63 98.00 99.27 94.36 SCAFFOLD 99.34 94.58 98.75 2.04 APFL 98.24 97.44 98.90 52.11 Fed AMP 99.53 97.95 99.27 94.87 Heur Fed AMP 99.38 98.17 99.26 94.74

Table 2: BMTA for the pathological non-IID data setting.

client (Mc Mahan et al. 2016). The simplicity of client tasks is clearly indicated by the high performance of Separate on all the data sets. However, the pathological non-IID data setting is not easy for the global federated learning methods. The performance of Fed Avg and Fed Prox degenerates a lot on FMNIST and EMNIST, because taking the global aggregation of all personalized models trained on the non-IID data of different clients introduces signiﬁcant unstableness to the gradientbased optimization process (Zhang et al. 2020). On the most challenging CIFAR100 data set, the unstableness catastrophically destroys the performance of the global models produced by Fed Avg and Fed Prox, and also signiﬁcantly damages the performance of SCAFFOLD and APFL because the global models are destroyed such that the customized gradient updates of SCAFFOLD and the model mixtures conducted by APFL can hardly tune it up. The other personalized federated learning methods Fed Avg-FT, Fed Prox-FT, Fed AMP and Heur Fed AMP achieve comparably good performance on all data sets. Fed Avg-FT and Fed Prox-FT achieve good performance by taking many ﬁne-tuning steps to tune the poor global models back to normal. The good performance of Fed AMP and Heur Fed AMP is achieved by adaptively facilitating pairwise collaborations between clients without using a single global model. Since the personalized cloud models of Fed AMP and Heur Fed AMP only aggregate similar personalized models of clients, they stably converge without suffering from the unstableness caused by the global aggregation of different personalized models.

Methods MNIST FMNIST EMNIST CIFAR100

Separate 86.30 86.73 61.78 39.99 Fed Avg 81.82 79.50 72.27 35.21 Fed Prox 81.46 78.71 70.55 37.31 Fed Avg-FT 91.79 89.73 78.93 49.00 Fed Prox-FT 94.10 87.51 77.31 50.24 SCAFFOLD 98.50 40.20 77.98 21.29 APFL 85.05 84.08 59.07 16.45 Fed AMP 97.59 90.97 81.22 53.04 Heur Fed AMP 97.36 91.37 81.47 53.27

Table 3: BMTA for the practical non-IID data setting.

Results on the Practical Non-IID Data Setting

Table 3 evaluates all methods in BMTA under the practical non-IID data setting. Fed AMP and Heur Fed AMP perform comparably well as SCAFFOLD on MNIST, and they signiﬁcantly outperform all other methods on FMNIST, EMNIST and CIFAR100. To evaluate the personalization performance of all methods in detail, we analyze the testing accuracy of the personalized model owned by each client (Figure 2). Both Fed AMP and Heur Fed AMP have more clients with higher testing accuracy on FMNIST, EMNIST and CIFAR100. We also conduct Wilcoxon signed-rank test (Wilcoxon 1992) to compare Fed AMP/Heur Fed AMP against the other methods on FMNIST, EMNIST and CIFAR100, a pair on a data set at a time. In all those tests, the p-values are all less than 10 4 and thus the non-hypotheses are all rejected. Fed AMP and Heur Fed AMP outperform the other methods in testing accuracies of individual clients with statistical signiﬁcance. The superior performance of Fed AMP and Heur Fed AMP is contributed by the attentive message passing mechanism that adaptively facilitates the underlying pair-wise collaborations between clients. Figure 3 the visualizes the collaboration weights ξi,j computed by Fed AMP and Heur Fed AMP. The pair-wise collaborations between clients are accurately captured by the three blocks in the matrix, where the three ground-truth collaboration groups are clients 0-9, 10-35 and 36-61. The other methods, however, are not able to form those collaboration groups because using a single global model cannot describe the numerate pairwise collaboration relationships between clients when the data is non-IID across different clients.

8 Conclusions

In this paper, we tackle the challenging problem of personalized cross-silo federated learning and develop Fed AMP and Heur Fed AMP that introduce a novel attentive message passing mechanism to signiﬁcantly facilitate the collaboration effectiveness between clients without infringing their data privacy. We analyze how the attentive message passing mechanism iteratively enables similar clients to have stronger collaboration than clients with dissimilar models, and empirically demonstrate that this mechanism signiﬁcantly improves the learning performance.

(a) MNIST (b) FMNIST

(c) EMNIST (d) CIFAR100

Figure 2: The distribution of the testing accuracy of all clients under the practical non-IID data setting.

(a) Fed AMP

(b) Heur Fed AMP

Figure 3: The visualization of the collaboration weights ξi,j computed by Fed AMP and Heur Fed AMP on EMNIST under the practical non-IID data setting. X-axis and y-axis show the IDs of clients.

Acknowledgements Yutao Huang s, Jiangchuan Liu s and Jian Pei s research is supported in part by the NSERC Discovery Grant program. All opinions, ﬁndings, conclusions and recommendations in this paper are those of the authors and do not necessarily reﬂect the views of the funding agencies. Most of the work of the author Zirui Zhou was done when he was afﬁliated with the Department of Mathematics of Hong Kong Baptist University (HKBU) and was supported in part by an HKBU Start-up Grant.

Ethics Statement The ever-growing regulations and laws on protecting data privacy, such as the General Data Protection Regulation1 of Europe, strictly restricts user data transmission between different sources. The restrictions on data transmission

1https://gdpr.eu/

have become one of the biggest challenges for many dataintensive machine learning tasks. To tackle this challenge, we propose Fed AMP and Heur Fed AMP to securely and efﬁciently train a high performance AI model by legally using the private data held by multiple data owners without infringing the data privacy of any data owner.

References Ben-David, S.; Blitzer, J.; Crammer, K.; Kulesza, A.; Pereira, F.; and Vaughan, J. W. 2010. A theory of learning from different domains. Machine Learning 79(1-2): 151 175. Bertsekas, D. P. 2011. Incremental gradient, subgradient, and proximal methods for convex optimization: A survey. Optimization for Machine Learning 2010(1-38): 3. Chen, F.; Dong, Z.; Li, Z.; and He, X. 2018. Federated meta-learning for recommendation. ar Xiv preprint ar Xiv:1802.07876 . Cohen, G.; Afshar, S.; Tapson, J.; and Van Schaik, A. 2017. EMNIST: Extending MNIST to handwritten letters. In IEEE International Joint Conference on Neural Networks, 2921 2926. Cortes, C.; and Mohri, M. 2014. Domain adaptation and sample bias correction theory and algorithm for regression. Theoretical Computer Science 519: 103 126. Deng, Y.; Kamani, M. M.; and Mahdavi, M. 2020. Adaptive personalized federated learning. ar Xiv preprint ar Xiv:2003.13461 . Fallah, A.; Mokhtari, A.; and Ozdaglar, A. 2020. Personalized federated learning: A meta-learning approach. ar Xiv preprint ar Xiv:2002.07948 . Fan, J.; and Li, R. 2001. Variable selection via nonconcave penalized likelihood and its oracle properties. Journal of the American Statistical Association 96(456): 1348 1360. Hanzely, F.; and Richt arik, P. 2020. Federated learning of a mixture of global and local models. ar Xiv preprint ar Xiv:2002.05516 . Huang, Y.; Chu, L.; Zhou, Z.; Wang, L.; Liu, J.; Pei, J.; and Zhang, Y. 2020. Personalized Cross-Silo Federated Learning on Non-IID Data. ar Xiv preprint ar Xiv:2007.03797 . Ji, S.; Pan, S.; Long, G.; Li, X.; Jiang, J.; and Huang, Z. 2019. Learning private neural language modeling with attentive aggregation. In IEEE International Joint Conference on Neural Networks, 1 8. Jiang, Y.; Koneˇcn y, J.; Rush, K.; and Kannan, S. 2019. Improving federated learning personalization via model agnostic meta learning. ar Xiv preprint ar Xiv:1909.12488 . Kairouz, P.; Mc Mahan, H. B.; Avent, B.; Bellet, A.; Bennis, M.; Bhagoji, A. N.; Bonawitz, K.; Charles, Z.; Cormode, G.; Cummings, R.; et al. 2019. Advances and open problems in federated learning. ar Xiv preprint ar Xiv:1912.04977 . Karimireddy, S. P.; Kale, S.; Mohri, M.; Reddi, S. J.; Stich, S. U.; and Suresh, A. T. 2019. SCAFFOLD: Stochastic controlled averaging for on-device federated learning. ar Xiv preprint ar Xiv:1910.06378 .

Khodak, M.; Balcan, M.-F. F.; and Talwalkar, A. S. 2019. Adaptive gradient-based meta-learning methods. In Advances in Neural Information Processing Systems, 5915 5926.

Krizhevsky, A.; and Hinton, G. 2009. Learning multiple layers of features from tiny images. Technical Report, University of Toronto .

Kulkarni, V.; Kulkarni, M.; and Pant, A. 2020. Survey of personalization techniques for federated learning. ar Xiv preprint ar Xiv:2003.08673 .

Le Cun, Y.; Cortes, C.; and Burges, C. J. 2010. MNIST handwritten digit database (Accessed on 2021-03-07). [Online]. Available: http://yann.lecun.com/exdb/mnist .

Li, T.; Sahu, A. K.; Zaheer, M.; Sanjabi, M.; Talwalkar, A.; and Smith, V. 2020. Federated optimization in heterogeneous networks. In Machine Learning and Systems, 429 450.

Mansour, Y.; Mohri, M.; Ro, J.; and Suresh, A. T. 2020. Three approaches for personalization with applications to federated learning. ar Xiv preprint ar Xiv:2002.10619 .

Mansour, Y.; Mohri, M.; and Rostamizadeh, A. 2009. Domain adaptation: Learning bounds and algorithms. ar Xiv preprint ar Xiv:0902.3430 .

Mc Mahan, H. B.; Moore, E.; Ramage, D.; Hampson, S.; et al. 2016. Communication-efﬁcient learning of deep networks from decentralized data. In International Conference on Artiﬁcial Intelligence and Statistics, 1273 1282.

Nemirovski, A.; Juditsky, A.; Lan, G.; and Shapiro, A. 2009. Robust stochastic approximation approach to stochastic programming. SIAM Journal on Optimization 19(4): 1574 1609.

Nichol, A.; Achiam, J.; and Schulman, J. 2018. On ﬁrst-order meta-learning algorithms. ar Xiv preprint ar Xiv:1803.02999 .

Rockafellar, R. T. 1976. Monotone operators and the proximal point algorithm. SIAM Journal on Control and Optimization 14(5): 877 898.

Schneider, J.; and Vlachos, M. 2020. Mass personalization of deep learning. In International Data Science Conference.

Smith, V.; Chiang, C.-K.; Sanjabi, M.; and Talwalkar, A. S. 2017. Federated multi-task learning. In Advances in Neural Information Processing Systems, 4424 4434.

Verleysen, M.; and Franc ois, D. 2005. The curse of dimensionality in data mining and time series prediction. In International Conference on Artiﬁcial Neural Networks: Computational Intelligence and Bioinspired Systems, 758 770.

Wang, H.; Yurochkin, M.; Sun, Y.; Papailiopoulos, D.; and Khazaeni, Y. 2020. Federated learning with matched averaging. In International Conference on Learning Representations.

Wang, K.; Mathews, R.; Kiddon, C.; Eichner, H.; Beaufays, F.; and Ramage, D. 2019. Federated evaluation of on-device personalization. ar Xiv preprint ar Xiv:1910.10252 .

Wilcoxon, F. 1992. Individual comparisons by ranking methods. In Breakthroughs in Statistics, 196 202. Springer. Xiao, H.; Rasul, K.; and Vollgraf, R. 2017. Fashion-mnist: a novel image dataset for benchmarking machine learning algorithms. ar Xiv preprint ar Xiv:1708.07747 . Yang, Q.; Liu, Y.; Chen, T.; and Tong, Y. 2019. Federated machine learning: Concept and applications. ACM Transactions on Intelligent Systems and Technology 10(2): 1 19. Yurochkin, M.; Agarwal, M.; Ghosh, S.; Greenewald, K.; Hoang, N.; and Khazaeni, Y. 2019. Bayesian nonparametric federated learning of neural networks. In International Conference on Machine Learning, 7252 7261. Zhang, C.-H. 2010. Nearly unbiased variable selection under minimax concave penalty. The Annals of Statistics 38(2): 894 942. Zhang, X.; Hong, M.; Dhople, S.; Yin, W.; and Liu, Y. 2020. Fed PD: A Federated Learning Framework with Optimal Rates and Adaptivity to Non-IID Data. ar Xiv preprint ar Xiv:2005.11418 . Zhao, Y.; Li, M.; Lai, L.; Suda, N.; Civin, D.; and Chandra, V. 2018. Federated learning with non-iid data. ar Xiv preprint ar Xiv:1806.00582 .