# federated_learning_with_fair_averaging__504465cc.pdf

Federated Learning with Fair Averaging

Zheng Wang1 , Xiaoliang Fan1, , Jianzhong Qi2 , Chenglu Wen1 , Cheng Wang 1 and Rongshan Yu 1

1Fujian Key Laboratory of Sensing and Computing for Smart Cities, School of Informatics, Xiamen University, Xiamen, China 2School of Computing and Information Systems, University of Melbourne, Melbourne, Australia zwang@stu.xmu.edu.cn, fanxiaoliang@xmu.edu.cn, jianzhong.qi@unimelb.edu.au, {clwen, cwang, rsyu}@xmu.edu.cn

Fairness has emerged as a critical problem in federated learning (FL). In this work, we identify a cause of unfairness in FL conﬂicting gradients with large differences in the magnitudes. To address this issue, we propose the federated fair averaging (Fed FV) algorithm to mitigate potential conﬂicts among clients before averaging their gradients. We ﬁrst use the cosine similarity to detect gradient conﬂicts, and then iteratively eliminate such conﬂicts by modifying both the direction and the magnitude of the gradients. We further show the theoretical foundation of Fed FV to mitigate the issue conﬂicting gradients and converge to Pareto stationary solutions. Extensive experiments on a suite of federated datasets conﬁrm that Fed FV compares favorably against state-of-the-art methods in terms of fairness, accuracy and efﬁciency. The source code is available at https://github.com/Ww Zzz/easy FL.

1 Introduction Federated learning (FL) has emerged as a new machine learning paradigm that aims to utilize clients data to collaboratively train a global model while preserving data privacy [Mc Mahan et al., 2017]. The global model is expected to perform better than any locally trained model, since it has much more data available for the training. However, it is difﬁcult for the global model to treat each client fairly [Kairouz et al., 2019; Li et al., 2019; Mohri et al., 2019]. For example, a global model for face recognition may work well for younger users (clients) but may suffer when being used by more senior users, as the younger generation may use mobile devices more frequently and contribute more training data. Techniques have been proposed to address the fairness issue in FL. AFL [Mohri et al., 2019] aims at good-intent fairness, which is to protect the worst-case performance on any client. This technique only works on a small network of dozens of clients because it directly treats each client as a domain, which may suffer in generalizability. Li et al. [2019] seek to balance the overall performance and fairness using a fair resource allocation method. Hu et al. [2020] ﬁnd a

Corresponding Author

Figure 1: When there are conﬂicting gradients (ga gb<0, ga gc<0, gb gc<0) and large differences in the gradient magnitudes, ||ga||>||gb||>||gc||, the average of the original gradients θ will be dominated by ga, which will be far away from the global optimal θ , resulting in unfairness to the dominated clients b and c. We argue that if projecting the gradients to mitigate the conﬂicts before averaging, the update direction g will be corrected, which is closer to the optimal and fairer.

common degrade direction for all clients without sacriﬁcing anyone s beneﬁt and demonstrate robustness to inﬂating loss attacks. Li et al. [2020a] further explore a trade-off between a more general robustness and fairness, and they personalize each client s model differently. Different from these studies, we observe that conﬂicting gradients with large differences in the magnitudes might fatally bring unfairness in FL. Since gradients are calculated locally by the clients, an update direction of some clients may hinder the model performance on other clients. The gradient conﬂicts seem not detrimental on their own since it has been shown that a simple average of gradients is valid to optimize the global objective function [Li et al., 2020c]. However, when the conﬂicting gradients have large differences in their magnitudes, model accuracy can suffer signiﬁcant reductions for some of the clients, as shown in Figure 1. We highlight three characteristics of FL that make it critical in ensuring fairness for the clients.

Proceedings of the Thirtieth International Joint Conference on Artiﬁcial Intelligence (IJCAI-21)

Heterogeneous and imbalanced data. Different clients have their own data characteristics and hence potentially different data distributions. When the data is non-IID distributed, the divergence among the gradients of difference clients gets even larger [Zhao et al., 2018]. These contribute conﬂicting gradients in FL. Further, different clients may also have different dataset sizes. If the same local batch size is used across the clients, a client with more data will take more steps to train the local model, thus leading to large differences in gradient magnitudes. As a result, the selected users in a communication round can largely conﬂict with each other, leading to an unfair update to some of them we call such conﬂicts internal conﬂicts.

Party selection. In each communication round, the server only randomly selects a subset of clients to train the global model with their local datasets. There is no guarantee that the data distribution in each round is representative of the real population distribution. The data distribution can be highly imbalanced when the server repeatedly chooses a particular type of clients, e.g., due to sampling strategies based on computing power, latency, battery etc., the average of gradients may become dominated by the repeatedly selected clients. Therefore, apart from the internal conﬂicts, we also identify external conﬂicts between those selected and those ignored by the server.

Client dropping out. Another source of external conﬂicts comes from network interruption. When a client drops out due to network interruption (or other hardware issues), its magnitude is considered as 0. An update in the global model may also be unfair for the client.

To address the issues above, we propose Federated Fair a Veraging (Fed FV) algorithm to mitigate the conﬂicts among clients before averaging their gradients. We ﬁrst use the cosine similarity to detect gradient conﬂicts and then iteratively eliminate such conﬂicts by modifying both the direction and magnitude of the gradients. We further show how well Fed FV can mitigate the conﬂicts and how it converges to either a pareto stationary solution or the optimal on convex problems. Extensive experiments on a suite of federated datasets conﬁrm that Fed FV compares favorably against state-of-the-art methods in terms of fairness, accuracy and efﬁciency. The contributions of this work are summarized as follows:

We identify two-fold gradients conﬂicts (i.e., internal conﬂict and external conﬂict) that are major causes of unfairness in FL.

We propose Fed FV that complements existing fair FL systems and prove its ability to mitigate two gradient conﬂicts and converge to Pareto stationary solutions.

We perform extensive experiments on a suite of federated datasets to validate the competitiveness of Fed FV against state-of-the-art methods in terms of fairness, accuracy and efﬁciency.

2 Preliminaries

FL aims to ﬁnd a shared parameter θ that minimizes the weighted average loss of all clients [Mc Mahan et al., 2017]:

min θ F(θ) =

k=1 pk Fk(θ) (1)

where Fk(θ) denotes the local objective of the kth client with weight pk, pk 0 and PK k=1 pk = 1. The local objective is usually deﬁned by an empirical risk on a local dataset, i.e., Fk(θ) = 1 nk P ξi Dk l(θ, ξi), where Dk denotes the local dataset of the kth client, and nk is the size of Dk. To optimize this objective, Mc Mahan et al. [2017] propose Fed Avg, an iterative algorithm where the server randomly samples a subset St of m clients, 0<m K, to train the global model with their own datasets and aggregates local updates Gt = {gt 1, gt 2, ..., gt m} into an average gt = Pm k=1 pkgt k with weight pk = nk Pm j=1 nj in the tth iteration. Fed Avg has been shown to be effective in minimizing the objective with low communication costs while preserving privacy. However, it may lead to an unfair result where accuracy distribution is imbalanced among different clients [Li et al., 2019; Mohri et al., 2019]. In this paper, we aim to minimize Equation (1) while achieving a fair accuracy distribution among the clients.

3 Proposed Approach

We ﬁrst present our approach, Fed FV, for easing the negative impact of conﬂicting gradients with largely different magnitudes on FL fairness. We then give a theoretical analysis on how well Fed FV achieves fairness and its convergence. We start by deﬁning gradient conﬂicts following Yu et al. [2020].

Deﬁnition 1. Client i s gradient gi Rd conﬂicts with client j s gradient gj Rd iff gi gj<0.

Fed FV aims to mitigate the conﬂicts among clients before averaging their gradients. In the tth communication round of Fed FV, after receiving the selected clients updates Gt = {gt 1, gt 2, ..., gt m} and training losses Lt = {lt 1, lt 2, ..., lt m}, the server updates the gradient history GH to keep a trace of the latest gradient of each client. Then, the server sorts gradients in Gt in ascending order of their corresponding clients losses to obtain POt, which provides the order of each gradient to be used as a projection target. Finally, Fed FV mitigates the internal conﬂicts and the external conﬂicts sequentially. Algorithm 1 summarizes the steps of Fed FV.

3.2 Mitigating Internal Conﬂicts

We ﬁrst handle the fairness among the selected clients in each communication round.

Deﬁnition 2. In the tth communication round, there are internal conﬂicts if there is at least a pair of client gradients < gt i, gt j > such that gt i conﬂicts with gt j, where gt i, gt j Gt = {gt 1, gt 2, ..., gt m} and Gt is the selected clients updates.

Proceedings of the Thirtieth International Joint Conference on Artiﬁcial Intelligence (IJCAI-21)

Algorithm 1 Fed FV Input:T, m, α, τ, η, θ0, pk, k = 1, ..., K,

1: Initialize θ0 and gradient history GH = []. 2: for t = 0, 1, ..., T 1 do 3: Server samples a subset St of m clients with the prob. pk and sends the model θt to them. 4: Server receives the updates gt k and the training loss lt k from each client k St, where gt k θt θt k and θt k is updated by client k. Then server updates the trace GH = [gt1 1 , gt2 2 , ..., gt K K ] of the latest updates of all clients. 5: Server sorts the clients updates into a projecting order list POt = [gt i1, ..., gt im] based on their losses, where lt ij lt ij+1.

6: gt Mitigate Internal Conﬂict(POt, Gt, α) 7: if t τ then 8: gt Mitigate External Conﬂict(gt, GH, τ) 9: end if 10: gt = gt/||gt|| || 1

m Pm k gt k|| 11: Server updates the model θt+1 θt gt

12: end for

Internal conﬂicts account for the unfairness among selected clients. Consider learning a binary classiﬁer. if clients with data of one class are in the majority in a communication round and there exist conﬂicts between gradients of these two classes, the global model will be updated to favor clients of the majority class, sacriﬁcing the accuracy on clients of the other class. Further, even when the proportion of clients of the two classes is balanced, the magnitudes of gradients may still largely differ due to different dataset sizes on different clients. To address the internal conﬂicts, Fed FV iteratively projects a client s gradient onto the normal plane of another client with a conﬂicting gradient. Instead of projecting in a random order, we design a loss-based order to decide which gradient should be the projecting target of other gradients based on Theorem 1. We show that the later a client gradient is used as the projection target, the fewer conﬂicts it will have with the ﬁnal average gradient computed by Fed FV. Algorithm 2 summarizes our steps to mitigate internal conﬂicts.

Theorem 1. Suppose there is a set of gradients G = {g1, g2, ..., gm} where gi always conﬂicts with g(tj) j before

projecting g(tj) j to gi s normal plane, and g(ti) i is obtained by projecting gi to the normal planes of other gradients in G for ti times. Assuming that | cos < g(ti) i , g(tj) j >| ϵ, 0<ϵ 1, for each gi G, as long as we iteratively project gi onto gk s normal plane (skipping gi it self) in the ascending order of k where k = 1, 2, ..., m, the larger k is, the smaller the upper bound of conﬂicts between the average g = 1 m Pm i=1 g(m) i and gk is.

Proof. See Appendix A.1.

Since projecting gradients favors for the gradient that

Algorithm 2 Mitigate Internal Conﬂict Input: POt, Gt, α 1: Server selects a set Sα t of the top α clients in POt. 2: Server sets g P C k gt k. 3: for each client k Sα t in parallel do 4: for each gt ij POt, j = 1, ..., m do

5: if g P C k gt ij<0 and k = ij then

6: Server computes g P C k g P C k (gt ij ) g P C k ||gt ij ||2 gt ij.

7: end if 8: end for 9: end for 10: gt 1

m Pm k=1 g P C k 11: return gt

serves as the projecting target later, we put the clients with larger training losses at the end of the projecting order list POt in the tth communication round to improve the model performance on the less well trained data. We allow αm clients to keep their original gradients to further enhance fairness. We will detail α in Section 3.4.

3.3 Mitigate External Conﬂicts Due to party selection and client dropping out, there is a sample bias during each communication round in FL [Kairouz et al., 2019]. A client that is not selected in the tth round can suffer a risk of being forgotten by the model whenever the combined update gt conﬂicts with its imaginary gradient gt img. We can consider such clients to have a weight of zero in each round. However, we cannot directly detect conﬂicts for such clients because we have no access to their real gradients. To address this issue, we estimate their real gradients according to their recent gradients, and we call such gradient conﬂicts external conﬂicts:

Deﬁnition 3. In the tth communication round, there are external conﬂicts if there is a client h / St whose latest gradient gt k h conﬂicts with the combined update gt, where 0<k<τ, 0<τ<t.

An external conﬂict denotes the conﬂict between the assumed gradient of a client that has not been selected and the combined update gt. We add extra steps to prevent the model from forgetting data of clients beyond current selection. We also iteratively project the update gt onto the normal plane of the average of conﬂicting gradients for each previous round in a time-based order. The closer the round is, the later we make it the target to mitigate more conﬂicts between the update gt and more recent conﬂicting gradients according to Theorem 1. We scale the length of the update to || 1

m Pm k gt k|| in the end because the length of all gradients can be enlarged by the projection. Algorithm 3 summarizes our steps to mitigate external conﬂicts.

3.4 Analysis We analyze the capability of Fed FV to mitigate gradient conﬂicts and its convergence. We show how Fed FV can ﬁnd a pareto stationary solution on convex problems.

Proceedings of the Thirtieth International Joint Conference on Artiﬁcial Intelligence (IJCAI-21)

Algorithm 3 Mitigate External Conﬂict Input: gt, GH, τ 1: for round t i, i = τ, τ 1, ..., 1 do 2: gcon 0 3: for each client k = 1, 2, ..., K do 4: if tk = t i then 5: if gt gtk k <0 then 6: gcon gcon + gtk k 7: end if 8: end if 9: end for 10: if gt gcon<0 then 11: Server computes gt gt gt gcon

||gcon||2 gcon. 12: end if 13: end for 14: return gt

Theorem 2. Suppose that there is a set of gradients G = {g1, g2, ..., gm} where gi always conﬂicts with g(tj) j before

projecting g(tj) j to gi s normal plane and g(ti) i is obtained by projecting gi to the normal planes of different gradients in G for ti times. If ϵ1 | cos < g(ti) i , g(tj) j >| ϵ2, 0<ϵ1 ϵ2 1, then as long as we iteratively project gi onto gk s normal plane (skipping gi it self) in the ascending order of k where k = 1, 2, ..., m, the maximum value of |gk g | is bounded by m 1

m (maxi ||gi||)2f(m, k, ϵ1, ϵ2),

where f(m, k, ϵ1, ϵ2) = ϵ2 2(1 ϵ2 1) 1 2 (1 (1 ϵ2 1) m k

1 (1 ϵ2 1) 1 2 .

Proof. See Appendix A.2.

According to Theorem 2, it is possible for Fed FV to bound the maximum conﬂict for any gradient by choosing k to let f(m, k, ϵ1, ϵ2)<ϵ2 with any given m, m 2, since the possible maximum value of conﬂicts between gk and the original average gradient is maxk |gk g| m 1

m ϵ2(maxk ||gk||)2. In practice, we mitigate all the conﬂicts that have been detected to limit the upper bound of gradient conﬂicts of clients. We allow αm clients with large training losses to keep their original gradients to further enhance fairness. When α = 1, all clients keep their original gradients so that Fed FV covers Fed Avg, and when α = 0, all clients are enforced to mitigate conﬂicts with others. Parameter α thus controls the degree of mitigating conﬂicts and can be applied to seek for a balance. Next, we show that Fed FV can ﬁnd a pareto stationary point according Theorems 3 and 4.

Theorem 3. Assume that there are only two types of users whose objective functions are F1(θ) and F2(θ), and each objective function is differentiable, L-smooth and convex. For the average objective F(θ) = 1

2 P Fi(θ), Fed FV with stepsize η 1 L will converge to either 1) a pareto stationary point, 2) or the optimal θ .

Proof. See Appendix A.3.

Theorem 4. Assume that there are m objective functions F1(θ), F2(θ),. .. , Fm(θ), and each objective function is differentiable, L-smooth and convex. If cos < g, g > 1

2 and || g|| || g || where g is the true gradient for the average objective F(θ) = 1

m P Fi(θ), then Fed FV with step size η 1

L will converge to either 1) a pareto stationary point, 2) or the optimal θ .

Proof. See Appendix A.3.

4 Related Work 4.1 Fairness in FL Federated Learning (FL) is ﬁrst proposed by Mc Mahan et al [2017] to collaboratively train a global model distributedly while preserving data privacy[Kairouz et al., 2019; Li et al., 2020b; Mc Mahan et al., 2017]. A number of studies focus on collaborative fairness where the server allocates different models to clients according to their contribution [Lyu et al., 2020; Lyu et al., 2019; Xu and Lyu, 2020]. A few other studies address the fairness of a uniform accuracy distribution across devices [Cho et al., 2020; Hu et al., 2020; Li et al., 2019; Li et al., 2020a; Mohri et al., 2019]. Mohri et al. [2019] and Li et al. [2019] propose different federated objectives AFL and q-FFL to further improve fairness. Hu et al. [2020] observe the competing relation between fairness and robustness to inﬂating loss attacks. Abay et al. [2020] analyze the potential causes of bias in FL which leads to unfairness, and they also point out the negative impact of sample bias due to the party selection. Cho et al. [2020] also show that client selection has an impact on fairness, and they propose a bandit-based communication-efﬁcient client selection strategy to overcome biasness. Huang et al. [2020] reweight clients according to their accuracy and numbers of times of being selected to achieve fairness. They use double momentum to accelerate the convergence. Different from these, we identify the conﬂicts among clients to be a potential cause of unfairness in FL. We mitigate such conﬂicts by computing a fairer average of gradients to achieve fair model performance across devices.

4.2 Gradient Projection Gradient projection has been well studied in continual learning to mitigate the adversary impact of gradient updates to previously learned tasks [Chaudhry et al., 2018; Farajtabar et al., 2019; Guo et al., 2019; Lopez-Paz and Ranzato, 2017]. Lopez-Paz et al. [Lopez-Paz and Ranzato, 2017] project gradient by solving a quadratic programming problem. Chaudhry et al. [2018] project gradient onto the normal plane of the average gradient of previous tasks. Farajtabar et al. [2019] project the current task gradients onto the orthonormal set of previous task gradients. Yu et al. [2020] focus on adversary inﬂuence between task gradients when simultaneously learning multiple tasks. They iteratively project each task gradient onto the normal plane of conﬂicting gradients, which motivates our solution in this paper. To the best of our knowledge, we are the ﬁrst to take the adversary gradient interference into consideration in FL. Speciﬁcally, our proposed Fed FV method can build a connection between fairness and conﬂicting gradients with large differences in the

Proceedings of the Thirtieth International Joint Conference on Artiﬁcial Intelligence (IJCAI-21)

Method Parameters AFL ηλ {0.01, 0.1, 0.5} q Fed Avg q {0.1, 0.2, 1, 2, 5, 15} Fed MGDA+ ϵ {0.01, 0.05, 0.1, 0.5, 1} Fed FA (α, β) {(0.5, 0.5)}, (γs, γc) {(0.5, 0.9)} Fed FV α {0, 0.1, 0.2, 1

3}, τ {0, 1, 3, 10}

Table 1: Method speciﬁc hyper-parameters.

magnitudes, and we prove its ability to mitigate two gradient conﬂicts (i.e., internal and external conﬂicts) and converge to Pareto stationary solutions.

5 Experiments

5.1 Experimental Setup

Datasets and Models We evaluate Fed FV on three public datasets: CIFAR-10 [Krizhevsky, 2012], Fashion MNIST [Xiao et al., 2017] and MNIST [Le Cun et al., 1998]. We follow [Mc Mahan et al., 2017] to create non-I.I.D. datasets. For CIFAR-10, we sort all data records based on their classes, and then split them into 200 shards. We use 100 clients, and each client randomly picks 2 shards without replacement so that each has the same data size. We use 200 clients for MNIST and preprocess MNIST in the same way as CIFAR-10. The local dataset is split into training and testing data with percentages of 80% and 20%. For Fashion MNIST, we simply follow the setting in [Li et al., 2019]. We use a feedforward neural network with 2 hidden layers on CIFAR-10 and Fashion MNIST. We use a CNN with 2 convolution layers on MNIST.

Baselines We compare with the classical method Fed Avg [Mc Mahan et al., 2017] and FL systems that address fairness in FL, including AFL [Mohri et al., 2019], q-Fed Avg [Li et al., 2019], Fed Fa [Huang et al., 2020] and Fed MGDA+ [Hu et al., 2020]. We compare with q-fedavg, Fed Fa and fedmgda+ on all three datasets and compare with AFL only on Fashion MNIST, because AFL is only suitable for small networks with dozens of clients.

Hyper-parameters For all experiments, we ﬁx the local epoch E = 1 and use batchsize BCIF AR 10|MNIST {full, 64}, BF ashion MNIST {full, 400} to run Stochastic Gradient Descent (SGD) on local datasets with stepsize η {0.01, 0.1}. We verify the different methods with hyper-parameters as listed in Table 1. We take the best performance of each method for the comparison.

Implementation All our experiments are implemented on a 64g-MEM Ubuntu 16.04.6 server with 40 Intel(R) Xeon(R) CPU E5-2630 v4 @ 2.20GHz and 4 NVidia(R) 2080Ti GPUs. All code is implemented in Py Torch version 1.3.1. Please see https: //github.com/Ww Zzz/easy FL for full details.

Method Ave. Var. Worst 5% Best 5%

Fed Avg 46.85 0.65 12.57 1.50 19.84 6.55 69.28 1.17

q Fed Avg|q=0.1 47.02 0.89 13.16 1.84 18.72 6.94 70.16 2.06

q Fed Avg|q=0.2 46.91 0.90 13.09 1.84 18.88 7.00 70.16 2.10

q Fed Avg|q=1.0 46.79 0.73 11.72 1.00 22.80 3.39 68.00 1.60

q Fed Avg|q=2.0 46.36 0.38 10.85 0.76 24.64 2.17 66.80 2.02

q Fed Avg|q=5.0 45.25 0.42 9.59 0.36 26.56 1.03 63.60 1.13

Fed Fa 46.43 0.56 12.79 1.54 19.28 6.78 69.36 1.40

Fed MGDA+|ϵ=0.01 45.65 0.21 10.94 0.87 25.12 2.34 67.44 1.20

Fed MGDA+|ϵ=0.05 45.58 0.21 10.98 0.81 25.12 1.87 67.76 2.27

Fed MGDA+|ϵ=0.1 45.52 0.17 11.32 0.86 24.32 2.24 68.48 2.68

Fed MGDA+|ϵ=0.5 45.34 0.21 11.63 0.69 24.00 1.93 68.64 3.11

Fed MGDA+|ϵ=1.0 45.34 0.22 11.64 0.66 24.00 1.93 68.64 3.11

Fed FV|α=0.1,τ=0 48.57 0.76 10.97 1.02 28.32 2.01 69.76 2.45

Fed FV|α=0.2,τ=0 48.54 0.64 10.56 0.96 28.88 1.92 69.20 2.01

Fed FV|α=0.5,τ=0 48.14 0.38 10.60 0.76 28.72 1.92 68.80 1.77

Fed FV|α=0.1,τ=1 49.34 0.56 10.74 0.89 28.56 2.54 70.24 1.49

Fed FV|α=0.1,τ=3 50.00 0.74 10.85 1.14 28.24 3.03 70.96 1.00

Fed FV|α=0.1,τ=10 50.42 0.55 9.70 0.96 32.24 2.10 69.68 2.84

Table 2: The average, the variance, the worst and the best of the test accuracy of all clients on CIFAR-10. All experiments are running over 2000 rounds with full batch size, learning rate η = 0.1 and local epochs E = 1. The reported results are averaged over 5 runs with different random seeds.

5.2 Experimental Results Fairness We ﬁrst verify the advantage of Fed FV in fairness. In Table 2, we list the the mean, variance, the worst 5% and the best 5% of test accuracy on 100 clients created by splitting CIFAR10 where 10% of clients are sampled in each communication round. While reaching a higher mean of test accuracy across clients, Fed FV also yields the lowest variance among all experiments except for q-Fed Avg|q=5.0. Note that q-Fed Avg sacriﬁces the best performance of clients from 69.28% to 63.60% and causes a signiﬁcant reduction of the mean of accuracy from 46.85% to 45.25%. When reaching a similar variance to q-Fed Avg|q=5.0 (9.59 versus 9.72), Fed FV improves the mean of accuracy to 50.42%. Further, the worst 5% performance of all the experiments on Fed FV is higher than that of any others, which indicates Fed FV s effectiveness on protecting the worst performance. Table 3 lists the model test accuracy on each client s data and the average and variance of the accuracy on Fashion MNIST dataset. Different from settings in Table 2, we use a small network with only 3 clients, and the server selects all the clients to participate the training process in each communication round to eliminate the sample bias, which results in the absence of external conﬂicts. Therefore, we only use Fed FV with α {0, 1

3}, τ = 0 in this case. In our observation, even though Fed FV with α = 0 can still ﬁnd the solution with the highest mean of test accuracy 80.48%, it s

Proceedings of the Thirtieth International Joint Conference on Artiﬁcial Intelligence (IJCAI-21)

Method shirt pullover T shirt Ave. Var.

Fed Avg 64.26 1.07 87.03 1.40 89.97 1.30 80.42 0.76 11.50 0.95

AFL|ηλ=0.01 71.09 0.91 81.89 1.30 84.29 1.23 79.09 0.73 5.76 0.75

AFL|ηλ=0.1 76.34 0.63 79.00 1.04 79.06 1.21 78.13 0.72 1.27 0.66

AFL|ηλ=0.5 76.57 0.58 78.77 0.99 79.09 1.15 78.14 0.71 1.12 0.61

q Fed Avg|q=5 71.29 0.85 81.46 1.08 82.86 0.97 78.53 0.64 5.16 0.69

q Fed Avg|q=15 77.09 0.88 75.40 2.07 60.69 1.79 71.06 0.81 7.46 0.88

Fed Fa 63.80 1.10 86.86 1.28 89.97 1.30 80.21 0.73 11.68 0.93

Fed MGDA+|ϵ=0.05 44.63 1.63 57.77 5.72 99.09 0.23 67.16 1.65 23.39 0.57

Fed MGDA+|ϵ=0.1 72.26 3.12 79.71 4.47 86.03 4.27 79.33 1.11 6.45 2.21

Fed MGDA+|ϵ=1.0 72.46 3.32 79.74 4.48 85.66 4.88 79.29 1.14 6.42 2.23

Fed FV|α=0.0,τ=0 61.06 1.21 89.31 1.38 91.06 1.29 80.48 0.68 13.76 1.02

Fed FV|α= 1

3 ,τ=0 77.09 1.35 80.69 1.53 81.06 0.53 79.61 0.85 1.91 0.57

Fed FV|α= 2

3 ,τ=0 77.91 0.80 81.46 1.99 81.46 1.51 80.28 1.09 1.77 0.87

Table 3: Test accuracy on the different clothing classes of Fashion MNIST dataset. All experiments are running over 200 rounds with full batch size, learning rate η = 0.1 and local epochs E = 1. The reported results are averaged over 5 runs with different random seeds.

Fed FV Fed FVRandom Fed FVReverse V ar.CIF AR10 13.19 1.0614.12 0.55 16.28 0.72 V ar.F MNIST 13.76 1.0220.14 0.61 22.05 0.68

Table 4: The effects of projecting order to fairness. The results are averaged over 5 runs with different random seeds.

a unfair solution where the variance is up to 13.76. Instead, Fed FV with α = 2

3 also ﬁnds a solution with high accuracy 80.28%, and simultaneously keep the variance in a low level. We notice that AFL|ηλ=0.5 has the lowest variance. However, AFL only works for small works where the number of clients is limited. Despite of this, Fed FV|α= 2

3 outperforms AFL|ηλ=0.5 on all the clients (i.e. shirt, pullover and T-shirt), as well as the mean of accuracy. Further, when compared to other methods except AFL, our method has the lowest variance down to 1.77, which validates the ability of Fed FV to ﬁnd a fair and accurate solution without the presence of sample bias. From Table 2 and Table 3, we observe that Fed FV is able to ease the negative impact of conﬂicting gradients with largely different magnitudes, which results in a fairer and accurate model with lower variance.

Accuracy and Efﬁciency We further show that Fed FV outperforms existing works addressing fairness in terms of accuracy and efﬁciency on CIFAR-10 and MNIST. All the methods are tuned to their best performance. As shown in Figure 2, when we only mitigate the internal conﬂicts, which is done by Fed FV(τ = 0), we already converge faster while keeping a lower variance than the others. In addition, when we mitigate both the internal and external conﬂicts by Fed FV(τ>0), there is a substantial improvement in the accuracy and efﬁciency again. Fed FV outperforms state-of-the-art methods by up to 7.2% on CIFAR10 and 78% on MNIST, which veriﬁes the ability of Fed FV to reach a higher accuracy with less communication rounds

0 500 1000 1500 2000 communication round

mean of test acc.

fedavg qfedavg q=1 fedfa α=β=0.5 fedmgda+ ε=0.1 fedfv α=0.1 τ=0 fedfv α=0.1 τ=10

0 500 1000 1500 2000 communication round

variance of test acc.

fedavg qfedavg q=1 fedfa α=β=0.5 fedmgda+ ε=0.1 fedfv α=0.1 τ=0 fedfv α=0.1 τ=10

(a) CIFAR-10

0 500 1000 communication round

mean of test acc.

fedavg qfedavg q=1.0 fedfa α=β=0.5 fedmgda+ ε=0.1 fedfv α=0.1 τ=0 fedfv α=0.2 τ=3

0 500 1000 communication round

variance of test acc.

fedavg qfedavg q=1.0 fedfa α=β=0.5 fedmgda+ ε=0.1 fedfv α=0.1 τ=0 fedfv α=0.2 τ=3

Figure 2: The mean (left) and the variance (right) of test accuracy on all clients on (a) CIFAR-10, and (b) MNIST. The results are averaged over 5 runs with different random seeds.

while still keeping a low variance. Thus, we demonstrate that Fed FV s advantages in saving communication cost and ﬁnding a better generalization.

Effects of Projecting Order To verify the effectiveness of our projecting order, we compare Fed FV with another two cases: 1) projecting gradients in a random order of the projection targets; and 2) projecting gradients in an order of the projection targets that is reverse to Fed FV, as shown in Table 4. For CIFAR-10, 20% of clients are sampled in each communication round. For Fashion MNIST, all clients are selected in each communication round. We set α = 0, τ = 0 for all groups to conﬁrm the effectiveness of the projecting order. If we project gradients to the targets in the loss-based order of Fed FV, the variance is the lowest. Projecting gradients to the targets in a random order is also fairer than in an order reverse to Fed FV, which indicates that the loss-based order used by Fed FV helps improve fairness.

6 Conclusions and Future Work

We identiﬁed the issue of conﬂicting gradients with large differences in the magnitudes, which brings unfairness in FL. To address this issue, we propose the Federated Fair a Veraging (Fed FV) algorithm to mitigate the potential conﬂicts among clients before averaging their gradients. In addition, we show how well Fed FV can mitigate the conﬂicts and how it converges to either a pareto stationary solution or the optimal on convex problems. Extensive experiments on a suite of federated datasets conﬁrm that Fed FV compares favorably against state-of-the-art methods in terms of fairness, accuracy and efﬁciency. In the future, we plan to build a complete theoretical analysis of Fed FV on mitigating complex external conﬂicts.

Proceedings of the Thirtieth International Joint Conference on Artiﬁcial Intelligence (IJCAI-21)

A.1 Proof of Theorem 1

Proof. For each gradient gi G, we project gi onto gk s normal plane in an increasing order with k. Thus, we have update rules

g(0) i = gi, k = 0

g(k) i = g(k 1) i g(k 1) i gk ||gk||2 gk, k = 1, 2, ..., m, k = i

g(k) i = g(k 1) i , k = i

Therefore, g(k) i never conﬂicts with gk, since all potential conﬂicts have been eliminated by the updating rules. We then focus on how much the ﬁnal gradient g conﬂicts with each gradient in G. Firstly, the last in the projecting order is gm, and the ﬁnal gradient g will not conﬂict with it. Then, we focus on the last but second gradient gm 1. Following the update rules, we have

g(m) i = g(m 1) i g(m 1) i gm ||gm||2 gm (2)

Let φ(k) i,j denote the angle between g(k) i and gj, and φi,j =

φ(0) i,j , then the average of projected gradients is:

i =m (g(m 1) i g(m 1) i gm ||gm||2 gm) + g(m 1) m )

i g(m 1) i 1

i =m ||g(m 1) i || cos φ(m 1) i,m gm ||gm||

Since gm 1 Pm i g(m 1) i 0,

gm 1 g gm 1 1

i =m ||g(m 1) i || cos φ(m 1) i,m gm ||gm||

i =m ||g(m 1) i ||||gm 1|| cos φ(m 1) i,m cos φm 1,m

i =m ||g(m 1) i ||

Similarly, we can compute the conﬂicts between any gradient gk G and g by removing the g(k) i from g

i =j+1 ||g(j) i || cos φ(j) i,j+1 gj+1 ||gj+1||)

i =j+1 ||g(j) i || cos φ(j) i,j+1 cos φk,j+1

i =j+1 ||g(j) i ||

|ek g | = | gk

||gk|| g | ϵ2

i =j+1 ||g(j) i ||

(5) Therefore, the later gk serves as the projecting target of others, the smaller the upper bound of conﬂicts between g and gk is, since Pm 1 j=k Pm i =j+1 ||g(j) i || Pm 1 j=k 1 Pm i =j+1 ||g(j) i ||.

A.2 Proof of Theorem 2 Proof. According to (5), the upper bound of the conﬂict between any client gradient gk and the ﬁnal average gradient g can be expressed as

|gk g | ϵ2 2 m||gk||

i =j+1 ||g(j) i || (6)

With the update rules of Fed FV, we can infer that

||g(i) k ||2 = ||g(i 1) k ||g(i 1) k || cos φ(i 1) k,i gi ||gi||||2

= ||g(i 1) k ||2 2||g(i 1) k ||2 cos2 φ(i 1) k,i +

||g(i 1) k ||2 cos2 φ(i 1) k,i

= (1 cos2 φ(i 1) k,i )||g(i 1) k ||2

(1 ϵ2 1)||g(i 1) k ||2

Therefore, the maximum value of gradient conﬂict is bounded by

|gk g | ϵ2 2 m||gk||

i =j+1 ||g(j) i ||

ϵ2 2 m(max i ||gi||)

i =j+1 ||g(0) i ||(1 ϵ2 1) j 2

m ϵ2 2(max i ||gi||)2 m 1 X

j=k (1 ϵ2 1) j 2

m (max i ||gi||)2 ϵ2 2(1 ϵ2 1) 1 2 (1 (1 ϵ2 1) m k

1 (1 ϵ2 1) 1 2

m (max i ||gi||)2f(m, k, ϵ1, ϵ2)

Proceedings of the Thirtieth International Joint Conference on Artiﬁcial Intelligence (IJCAI-21)

A.3 Proof of the Convergence Proof of Theorem 3. Let g1 and g2 be the two types of users updates of parameters in the tth communication round. When F(θ) is L-smooth, we have

F(θt+1) F(θt) + F(θ)T (θt+1 θt) + 1

2L||θt+1 θt||2 2 (9) There are two cases: conﬂicting gradients exist or otherwise. If there is no conﬂict between g1 and g2, which indicates g1 g2 0, Fed FV updates as Fed Avg does, simply computing an average of the two gradients and adding it to the parameters to obtain the new parameters θt+1 = θt η g, where g = 1 2(g1 + g2). Therefore, when using step size η 1 L, it will strictly decrease the objective function F(θ) [Li et al., 2020c]. However, if g1 g2<0, Fed FV will compute the update as:

θt+1 = θt η g = θt η(g1 + g2 g1 g2

||g1||2 g1 g1 g2

||g2||2 g2)

(10) Combining with (9), we have:

F(θt+1) F(θt) η(g1 + g2)T (g1 + g2 g1 g2

g1 g2 ||g2||2 g2) + L

2 η2||g1 + g2 g1 g2

||g1||2 g1 g1 g2

||g2||2 g2||2 2

= F(θt) η||g1 + g1||2 + η(g1 + g2)T (g1 g2

||g2||2 g2) + L

2 η2||g1 + g2 g1 g2

||g1||2 g1 g1 g2

||g2||2 g2||2 2

Since g1 g2 = ||g1||||g2|| cos φ12, where φ12 denotes the angle between g1 and g2, after rearranging the items in this inequality, we have

F(θt+1) F(θt) (η L

2 η2)(1 cos2 φ12)(||g1||2+

||g2||2) Lt2(1 cos2 φ12)||g1||||g2|| cos φ12 (12)

To decrease the objective function, the inequality below should be satisﬁed

2 η2)(1 cos2 φ12)(||g1||2 + ||g2||2) Lη2(1

cos2 φ12)||g1||||g2|| cos φ12 0 (13)

2 η2)(||g1||2 + ||g2||2) + L

2 η22||g1||||g2|| cos φ12 0

(η Lη2)(||g1||2 + ||g2||2) + L

2 η2(||g1||2 + ||g2||2+

2||g1||||g2|| cos φ12) 0 (15)

(η Lη2)(||g1||2 + ||g2||2) + L

2 η2||g1 + g2||2 0 (16)

With η 1 L, we have η Lη2 0, which promises the decrease of the objective function. On the other hand, we can infer that

Combine (17) with (13), we have:

F(θt+1) F(θt) η

2(1 cos2 φ12)(||g1||2 + ||g2||2)

2(1 cos2 φ12)2||g1||||g2|| cos φ12

2(1 cos2 φ12)(||g1||2 + ||g2||2 + 2||g1||||g2||

2(1 cos2 φ12)||g1 + g2||2

2(1 cos2 φ12)||g||2

(18) Therefore, the objective function will always degrade unless 1)||g|| = 0, which indicates it will reach the optimal θ , 2)cos φ12 = 1, then we can suppose g1 = βg2 where β>0, and we will get 1 1 + β g1 + β 1 + β g2 = 0 (19)

So θt is a pareto stationary point as deﬁned below:

Deﬁnition 4. For smooth criteria Fk(θ)(1 k K), θ0 is called a Pareto-stationary iff there exists some convex combination of the gradietns Fk(θ0) that equals zero [D esid eri, 2012].

Proof of Theorem 4. With the assumption that F(θ) is Lsmooth, we can get the inequality:

F(θt+1) F(θt)+ F(θ)T (θt+1 θt)+ 1

2L||θt+1 θt||2 2. (20) Similar to the proof for Theorem 2, if there exist conﬂicts, then

F(θt+1) F(θt) η g g + 1

2Lη2|| g ||2

2|| g|||| g || + L

2 η2|| g ||2

2|| g|||| g || + L

2 η2|| g|||| g ||

F(θt) + ( η

2 η2)|| g|||| g ||

2 η2) 0 with η 1 L, the average objective will always degrade if we repeatedly apply the update rules of Fed FV unless 1)|| g|| = 0, which indicates it will ﬁnally reach the optimal θ , 2) || g || = 0, which indicates that all pairs of conﬂicting gradients have a cosine similarity of 1, leading to an existence of convex combination p which satisﬁes P pigi = 0, since we can easily choose a pair of conﬂicting gradients gi and gj as the proof of theorem 3 does and then set the weights of the rest zero.

Proceedings of the Thirtieth International Joint Conference on Artiﬁcial Intelligence (IJCAI-21)

[Abay et al., 2020] Annie Abay, Yi Zhou, Nathalie Baracaldo, Shashank Rajamoni, Ebube Chuba, and Heiko Ludwig. Mitigating bias in federated learning. ar Xiv e-prints, page ar Xiv:2012.02447, December 2020. [Chaudhry et al., 2018] Arslan Chaudhry, Marc Aurelio Ranzato, Marcus Rohrbach, and Mohamed Elhoseiny. Efﬁcient lifelong learning with A-GEM. Co RR, abs/1812.00420, 2018. [Cho et al., 2020] Yae Jee Cho, Samarth Gupta, Gauri Joshi, and Osman Ya gan. Bandit-based communication-efﬁcient client selection strategies for federated learning. ar Xiv eprints, page ar Xiv:2012.08009, 2020. [D esid eri, 2012] Jean-Antoine D esid eri. Multiple-gradient descent algorithm (mgda) for multiobjective optimization. Comptes Rendus Mathematique, 350(5):313 318, 2012. [Farajtabar et al., 2019] Mehrdad Farajtabar, Navid Azizan, Alex Mott, and Ang Li. Orthogonal gradient descent for continual learning. Co RR, abs/1910.07104, 2019. [Guo et al., 2019] Yunhui Guo, Mingrui Liu, Tianbao Yang, and Tajana Rosing. Learning with long-term remembering: Following the lead of mixed stochastic gradient. Co RR, abs/1909.11763, 2019. [Hu et al., 2020] Zeou Hu, Kiarash Shaloudegi, Guojun Zhang, and Yaoliang Yu. Fedmgda+: Federated learning meets multi-objective optimization. ar Xiv e-prints, page ar Xiv:2006.11489, 2020. [Huang et al., 2020] Wei Huang, Tianrui Li, Dexian Wang, Shengdong Du, and Junbo Zhang. Fairness and accuracy in federated learning. ar Xiv e-prints, page ar Xiv:2012.10069, 2020. [Kairouz et al., 2019] Peter Kairouz, H. Brendan Mc Mahan, Brendan Avent, Aur elien Bellet, Mehdi Bennis, and et al. Advances and open problems in federated learning. Co RR, abs/1912.04977, 2019. [Krizhevsky, 2012] Alex Krizhevsky. Learning multiple layers of features from tiny images. University of Toronto, 05 2012. [Le Cun et al., 1998] Yann Le Cun, L eon Bottou, Yoshua Bengio, and Patrick Haffner. Gradient-based learning applied to document recognition. Proceedings of the IEEE, 86(11):2278 2324, 1998. [Li et al., 2019] Tian Li, Maziar Sanjabi, and Virginia Smith. Fair resource allocation in federated learning. Co RR, abs/1905.10497, 2019. [Li et al., 2020a] Tian Li, Shengyuan Hu, Ahmad Beirami, and Virginia Smith. Federated multi-task learning for competing constraints. ar Xiv e-prints, page ar Xiv:2012.04221, 2020. [Li et al., 2020b] Tian Li, Anit Kumar Sahu, Manzil Zaheer, Maziar Sanjabi, Ameet Talwalkar, and Virginia Smith. Federated optimization in heterogeneous networks. ar Xiv e-prints, page ar Xiv:1812.06127, 2020.

[Li et al., 2020c] Xiang Li, Kaixuan Huang, Wenhao Yang, Shusen Wang, and Zhihua Zhang. On the convergence of fedavg on non-iid data. ar Xiv e-prints, page ar Xiv:1907.02189, 2020. [Lopez-Paz and Ranzato, 2017] David Lopez-Paz and Marc Aurelio Ranzato. Gradient episodic memory for continuum learning. Co RR, abs/1706.08840, 2017. [Lyu et al., 2019] Lingjuan Lyu, Jiangshan Yu, Karthik Nandakumar, Yitong Li, Xingjun Ma, and Jiong Jin. Towards fair and decentralized privacy-preserving deep learning with blockchain. Co RR, abs/1906.01167, 2019. [Lyu et al., 2020] Lingjuan Lyu, Xinyi Xu, and Qian Wang. Collaborative fairness in federated learning. ar Xiv eprints, page ar Xiv:2008.12161, 2020. [Mc Mahan et al., 2017] Brendan Mc Mahan, Eider Moore, Daniel Ramage, Seth Hampson, and Blaise Aguera y Arcas. Communication-Efﬁcient Learning of Deep Networks from Decentralized Data. In Aarti Singh and Jerry Zhu, editors, Proceedings of the 20th International Conference on Artiﬁcial Intelligence and Statistics, volume 54 of Proceedings of Machine Learning Research, pages 1273 1282, Fort Lauderdale, FL, USA, 20 22 Apr 2017. PMLR. [Mohri et al., 2019] Mehryar Mohri, Gary Sivek, and Ananda Theertha Suresh. Agnostic federated learning. Co RR, abs/1902.00146, 2019. [Xiao et al., 2017] Han Xiao, Kashif Rasul, and Roland Vollgraf. Fashion-mnist: a novel image dataset for benchmarking machine learning algorithms. Co RR, abs/1708.07747, 2017. [Xu and Lyu, 2020] Xinyi Xu and Lingjuan Lyu. Towards building a robust and fair federated learning system. ar Xiv e-prints, page ar Xiv:2011.10464, 2020. [Yu et al., 2020] Tianhe Yu, Saurabh Kumar, Abhishek Gupta, Sergey Levine, Karol Hausman, and Chelsea Finn. Gradient Surgery for Multi-Task Learning. ar Xiv e-prints, page ar Xiv:2001.06782, January 2020. [Zhao et al., 2018] Yue Zhao, Meng Li, Liangzhen Lai, Naveen Suda, Damon Civin, and Vikas Chandra. Federated Learning with Non-IID Data. ar Xiv e-prints, page ar Xiv:1806.00582, June 2018.

Proceedings of the Thirtieth International Joint Conference on Artiﬁcial Intelligence (IJCAI-21)