# federated_learning_with_fair_averaging__504465cc.pdf Federated Learning with Fair Averaging Zheng Wang1 , Xiaoliang Fan1, , Jianzhong Qi2 , Chenglu Wen1 , Cheng Wang 1 and Rongshan Yu 1 1Fujian Key Laboratory of Sensing and Computing for Smart Cities, School of Informatics, Xiamen University, Xiamen, China 2School of Computing and Information Systems, University of Melbourne, Melbourne, Australia zwang@stu.xmu.edu.cn, fanxiaoliang@xmu.edu.cn, jianzhong.qi@unimelb.edu.au, {clwen, cwang, rsyu}@xmu.edu.cn Fairness has emerged as a critical problem in federated learning (FL). In this work, we identify a cause of unfairness in FL conflicting gradients with large differences in the magnitudes. To address this issue, we propose the federated fair averaging (Fed FV) algorithm to mitigate potential conflicts among clients before averaging their gradients. We first use the cosine similarity to detect gradient conflicts, and then iteratively eliminate such conflicts by modifying both the direction and the magnitude of the gradients. We further show the theoretical foundation of Fed FV to mitigate the issue conflicting gradients and converge to Pareto stationary solutions. Extensive experiments on a suite of federated datasets confirm that Fed FV compares favorably against state-of-the-art methods in terms of fairness, accuracy and efficiency. The source code is available at https://github.com/Ww Zzz/easy FL. 1 Introduction Federated learning (FL) has emerged as a new machine learning paradigm that aims to utilize clients data to collaboratively train a global model while preserving data privacy [Mc Mahan et al., 2017]. The global model is expected to perform better than any locally trained model, since it has much more data available for the training. However, it is difficult for the global model to treat each client fairly [Kairouz et al., 2019; Li et al., 2019; Mohri et al., 2019]. For example, a global model for face recognition may work well for younger users (clients) but may suffer when being used by more senior users, as the younger generation may use mobile devices more frequently and contribute more training data. Techniques have been proposed to address the fairness issue in FL. AFL [Mohri et al., 2019] aims at good-intent fairness, which is to protect the worst-case performance on any client. This technique only works on a small network of dozens of clients because it directly treats each client as a domain, which may suffer in generalizability. Li et al. [2019] seek to balance the overall performance and fairness using a fair resource allocation method. Hu et al. [2020] find a Corresponding Author Figure 1: When there are conflicting gradients (ga gb<0, ga gc<0, gb gc<0) and large differences in the gradient magnitudes, ||ga||>||gb||>||gc||, the average of the original gradients θ will be dominated by ga, which will be far away from the global optimal θ , resulting in unfairness to the dominated clients b and c. We argue that if projecting the gradients to mitigate the conflicts before averaging, the update direction g will be corrected, which is closer to the optimal and fairer. common degrade direction for all clients without sacrificing anyone s benefit and demonstrate robustness to inflating loss attacks. Li et al. [2020a] further explore a trade-off between a more general robustness and fairness, and they personalize each client s model differently. Different from these studies, we observe that conflicting gradients with large differences in the magnitudes might fatally bring unfairness in FL. Since gradients are calculated locally by the clients, an update direction of some clients may hinder the model performance on other clients. The gradient conflicts seem not detrimental on their own since it has been shown that a simple average of gradients is valid to optimize the global objective function [Li et al., 2020c]. However, when the conflicting gradients have large differences in their magnitudes, model accuracy can suffer significant reductions for some of the clients, as shown in Figure 1. We highlight three characteristics of FL that make it critical in ensuring fairness for the clients. Proceedings of the Thirtieth International Joint Conference on Artificial Intelligence (IJCAI-21) Heterogeneous and imbalanced data. Different clients have their own data characteristics and hence potentially different data distributions. When the data is non-IID distributed, the divergence among the gradients of difference clients gets even larger [Zhao et al., 2018]. These contribute conflicting gradients in FL. Further, different clients may also have different dataset sizes. If the same local batch size is used across the clients, a client with more data will take more steps to train the local model, thus leading to large differences in gradient magnitudes. As a result, the selected users in a communication round can largely conflict with each other, leading to an unfair update to some of them we call such conflicts internal conflicts. Party selection. In each communication round, the server only randomly selects a subset of clients to train the global model with their local datasets. There is no guarantee that the data distribution in each round is representative of the real population distribution. The data distribution can be highly imbalanced when the server repeatedly chooses a particular type of clients, e.g., due to sampling strategies based on computing power, latency, battery etc., the average of gradients may become dominated by the repeatedly selected clients. Therefore, apart from the internal conflicts, we also identify external conflicts between those selected and those ignored by the server. Client dropping out. Another source of external conflicts comes from network interruption. When a client drops out due to network interruption (or other hardware issues), its magnitude is considered as 0. An update in the global model may also be unfair for the client. To address the issues above, we propose Federated Fair a Veraging (Fed FV) algorithm to mitigate the conflicts among clients before averaging their gradients. We first use the cosine similarity to detect gradient conflicts and then iteratively eliminate such conflicts by modifying both the direction and magnitude of the gradients. We further show how well Fed FV can mitigate the conflicts and how it converges to either a pareto stationary solution or the optimal on convex problems. Extensive experiments on a suite of federated datasets confirm that Fed FV compares favorably against state-of-the-art methods in terms of fairness, accuracy and efficiency. The contributions of this work are summarized as follows: We identify two-fold gradients conflicts (i.e., internal conflict and external conflict) that are major causes of unfairness in FL. We propose Fed FV that complements existing fair FL systems and prove its ability to mitigate two gradient conflicts and converge to Pareto stationary solutions. We perform extensive experiments on a suite of federated datasets to validate the competitiveness of Fed FV against state-of-the-art methods in terms of fairness, accuracy and efficiency. 2 Preliminaries FL aims to find a shared parameter θ that minimizes the weighted average loss of all clients [Mc Mahan et al., 2017]: min θ F(θ) = k=1 pk Fk(θ) (1) where Fk(θ) denotes the local objective of the kth client with weight pk, pk 0 and PK k=1 pk = 1. The local objective is usually defined by an empirical risk on a local dataset, i.e., Fk(θ) = 1 nk P ξi Dk l(θ, ξi), where Dk denotes the local dataset of the kth client, and nk is the size of Dk. To optimize this objective, Mc Mahan et al. [2017] propose Fed Avg, an iterative algorithm where the server randomly samples a subset St of m clients, 0 such that gt i conflicts with gt j, where gt i, gt j Gt = {gt 1, gt 2, ..., gt m} and Gt is the selected clients updates. Proceedings of the Thirtieth International Joint Conference on Artificial Intelligence (IJCAI-21) Algorithm 1 Fed FV Input:T, m, α, τ, η, θ0, pk, k = 1, ..., K, 1: Initialize θ0 and gradient history GH = []. 2: for t = 0, 1, ..., T 1 do 3: Server samples a subset St of m clients with the prob. pk and sends the model θt to them. 4: Server receives the updates gt k and the training loss lt k from each client k St, where gt k θt θt k and θt k is updated by client k. Then server updates the trace GH = [gt1 1 , gt2 2 , ..., gt K K ] of the latest updates of all clients. 5: Server sorts the clients updates into a projecting order list POt = [gt i1, ..., gt im] based on their losses, where lt ij lt ij+1. 6: gt Mitigate Internal Conflict(POt, Gt, α) 7: if t τ then 8: gt Mitigate External Conflict(gt, GH, τ) 9: end if 10: gt = gt/||gt|| || 1 m Pm k gt k|| 11: Server updates the model θt+1 θt gt 12: end for Internal conflicts account for the unfairness among selected clients. Consider learning a binary classifier. if clients with data of one class are in the majority in a communication round and there exist conflicts between gradients of these two classes, the global model will be updated to favor clients of the majority class, sacrificing the accuracy on clients of the other class. Further, even when the proportion of clients of the two classes is balanced, the magnitudes of gradients may still largely differ due to different dataset sizes on different clients. To address the internal conflicts, Fed FV iteratively projects a client s gradient onto the normal plane of another client with a conflicting gradient. Instead of projecting in a random order, we design a loss-based order to decide which gradient should be the projecting target of other gradients based on Theorem 1. We show that the later a client gradient is used as the projection target, the fewer conflicts it will have with the final average gradient computed by Fed FV. Algorithm 2 summarizes our steps to mitigate internal conflicts. Theorem 1. Suppose there is a set of gradients G = {g1, g2, ..., gm} where gi always conflicts with g(tj) j before projecting g(tj) j to gi s normal plane, and g(ti) i is obtained by projecting gi to the normal planes of other gradients in G for ti times. Assuming that | cos < g(ti) i , g(tj) j >| ϵ, 0<ϵ 1, for each gi G, as long as we iteratively project gi onto gk s normal plane (skipping gi it self) in the ascending order of k where k = 1, 2, ..., m, the larger k is, the smaller the upper bound of conflicts between the average g = 1 m Pm i=1 g(m) i and gk is. Proof. See Appendix A.1. Since projecting gradients favors for the gradient that Algorithm 2 Mitigate Internal Conflict Input: POt, Gt, α 1: Server selects a set Sα t of the top α clients in POt. 2: Server sets g P C k gt k. 3: for each client k Sα t in parallel do 4: for each gt ij POt, j = 1, ..., m do 5: if g P C k gt ij<0 and k = ij then 6: Server computes g P C k g P C k (gt ij ) g P C k ||gt ij ||2 gt ij. 7: end if 8: end for 9: end for 10: gt 1 m Pm k=1 g P C k 11: return gt serves as the projecting target later, we put the clients with larger training losses at the end of the projecting order list POt in the tth communication round to improve the model performance on the less well trained data. We allow αm clients to keep their original gradients to further enhance fairness. We will detail α in Section 3.4. 3.3 Mitigate External Conflicts Due to party selection and client dropping out, there is a sample bias during each communication round in FL [Kairouz et al., 2019]. A client that is not selected in the tth round can suffer a risk of being forgotten by the model whenever the combined update gt conflicts with its imaginary gradient gt img. We can consider such clients to have a weight of zero in each round. However, we cannot directly detect conflicts for such clients because we have no access to their real gradients. To address this issue, we estimate their real gradients according to their recent gradients, and we call such gradient conflicts external conflicts: Definition 3. In the tth communication round, there are external conflicts if there is a client h / St whose latest gradient gt k h conflicts with the combined update gt, where 0| ϵ2, 0<ϵ1 ϵ2 1, then as long as we iteratively project gi onto gk s normal plane (skipping gi it self) in the ascending order of k where k = 1, 2, ..., m, the maximum value of |gk g | is bounded by m 1 m (maxi ||gi||)2f(m, k, ϵ1, ϵ2), where f(m, k, ϵ1, ϵ2) = ϵ2 2(1 ϵ2 1) 1 2 (1 (1 ϵ2 1) m k 1 (1 ϵ2 1) 1 2 . Proof. See Appendix A.2. According to Theorem 2, it is possible for Fed FV to bound the maximum conflict for any gradient by choosing k to let f(m, k, ϵ1, ϵ2)<ϵ2 with any given m, m 2, since the possible maximum value of conflicts between gk and the original average gradient is maxk |gk g| m 1 m ϵ2(maxk ||gk||)2. In practice, we mitigate all the conflicts that have been detected to limit the upper bound of gradient conflicts of clients. We allow αm clients with large training losses to keep their original gradients to further enhance fairness. When α = 1, all clients keep their original gradients so that Fed FV covers Fed Avg, and when α = 0, all clients are enforced to mitigate conflicts with others. Parameter α thus controls the degree of mitigating conflicts and can be applied to seek for a balance. Next, we show that Fed FV can find a pareto stationary point according Theorems 3 and 4. Theorem 3. Assume that there are only two types of users whose objective functions are F1(θ) and F2(θ), and each objective function is differentiable, L-smooth and convex. For the average objective F(θ) = 1 2 P Fi(θ), Fed FV with stepsize η 1 L will converge to either 1) a pareto stationary point, 2) or the optimal θ . Proof. See Appendix A.3. Theorem 4. Assume that there are m objective functions F1(θ), F2(θ),. .. , Fm(θ), and each objective function is differentiable, L-smooth and convex. If cos < g, g > 1 2 and || g|| || g || where g is the true gradient for the average objective F(θ) = 1 m P Fi(θ), then Fed FV with step size η 1 L will converge to either 1) a pareto stationary point, 2) or the optimal θ . Proof. See Appendix A.3. 4 Related Work 4.1 Fairness in FL Federated Learning (FL) is first proposed by Mc Mahan et al [2017] to collaboratively train a global model distributedly while preserving data privacy[Kairouz et al., 2019; Li et al., 2020b; Mc Mahan et al., 2017]. A number of studies focus on collaborative fairness where the server allocates different models to clients according to their contribution [Lyu et al., 2020; Lyu et al., 2019; Xu and Lyu, 2020]. A few other studies address the fairness of a uniform accuracy distribution across devices [Cho et al., 2020; Hu et al., 2020; Li et al., 2019; Li et al., 2020a; Mohri et al., 2019]. Mohri et al. [2019] and Li et al. [2019] propose different federated objectives AFL and q-FFL to further improve fairness. Hu et al. [2020] observe the competing relation between fairness and robustness to inflating loss attacks. Abay et al. [2020] analyze the potential causes of bias in FL which leads to unfairness, and they also point out the negative impact of sample bias due to the party selection. Cho et al. [2020] also show that client selection has an impact on fairness, and they propose a bandit-based communication-efficient client selection strategy to overcome biasness. Huang et al. [2020] reweight clients according to their accuracy and numbers of times of being selected to achieve fairness. They use double momentum to accelerate the convergence. Different from these, we identify the conflicts among clients to be a potential cause of unfairness in FL. We mitigate such conflicts by computing a fairer average of gradients to achieve fair model performance across devices. 4.2 Gradient Projection Gradient projection has been well studied in continual learning to mitigate the adversary impact of gradient updates to previously learned tasks [Chaudhry et al., 2018; Farajtabar et al., 2019; Guo et al., 2019; Lopez-Paz and Ranzato, 2017]. Lopez-Paz et al. [Lopez-Paz and Ranzato, 2017] project gradient by solving a quadratic programming problem. Chaudhry et al. [2018] project gradient onto the normal plane of the average gradient of previous tasks. Farajtabar et al. [2019] project the current task gradients onto the orthonormal set of previous task gradients. Yu et al. [2020] focus on adversary influence between task gradients when simultaneously learning multiple tasks. They iteratively project each task gradient onto the normal plane of conflicting gradients, which motivates our solution in this paper. To the best of our knowledge, we are the first to take the adversary gradient interference into consideration in FL. Specifically, our proposed Fed FV method can build a connection between fairness and conflicting gradients with large differences in the Proceedings of the Thirtieth International Joint Conference on Artificial Intelligence (IJCAI-21) Method Parameters AFL ηλ {0.01, 0.1, 0.5} q Fed Avg q {0.1, 0.2, 1, 2, 5, 15} Fed MGDA+ ϵ {0.01, 0.05, 0.1, 0.5, 1} Fed FA (α, β) {(0.5, 0.5)}, (γs, γc) {(0.5, 0.9)} Fed FV α {0, 0.1, 0.2, 1 3}, τ {0, 1, 3, 10} Table 1: Method specific hyper-parameters. magnitudes, and we prove its ability to mitigate two gradient conflicts (i.e., internal and external conflicts) and converge to Pareto stationary solutions. 5 Experiments 5.1 Experimental Setup Datasets and Models We evaluate Fed FV on three public datasets: CIFAR-10 [Krizhevsky, 2012], Fashion MNIST [Xiao et al., 2017] and MNIST [Le Cun et al., 1998]. We follow [Mc Mahan et al., 2017] to create non-I.I.D. datasets. For CIFAR-10, we sort all data records based on their classes, and then split them into 200 shards. We use 100 clients, and each client randomly picks 2 shards without replacement so that each has the same data size. We use 200 clients for MNIST and preprocess MNIST in the same way as CIFAR-10. The local dataset is split into training and testing data with percentages of 80% and 20%. For Fashion MNIST, we simply follow the setting in [Li et al., 2019]. We use a feedforward neural network with 2 hidden layers on CIFAR-10 and Fashion MNIST. We use a CNN with 2 convolution layers on MNIST. Baselines We compare with the classical method Fed Avg [Mc Mahan et al., 2017] and FL systems that address fairness in FL, including AFL [Mohri et al., 2019], q-Fed Avg [Li et al., 2019], Fed Fa [Huang et al., 2020] and Fed MGDA+ [Hu et al., 2020]. We compare with q-fedavg, Fed Fa and fedmgda+ on all three datasets and compare with AFL only on Fashion MNIST, because AFL is only suitable for small networks with dozens of clients. Hyper-parameters For all experiments, we fix the local epoch E = 1 and use batchsize BCIF AR 10|MNIST {full, 64}, BF ashion MNIST {full, 400} to run Stochastic Gradient Descent (SGD) on local datasets with stepsize η {0.01, 0.1}. We verify the different methods with hyper-parameters as listed in Table 1. We take the best performance of each method for the comparison. Implementation All our experiments are implemented on a 64g-MEM Ubuntu 16.04.6 server with 40 Intel(R) Xeon(R) CPU E5-2630 v4 @ 2.20GHz and 4 NVidia(R) 2080Ti GPUs. All code is implemented in Py Torch version 1.3.1. Please see https: //github.com/Ww Zzz/easy FL for full details. Method Ave. Var. Worst 5% Best 5% Fed Avg 46.85 0.65 12.57 1.50 19.84 6.55 69.28 1.17 q Fed Avg|q=0.1 47.02 0.89 13.16 1.84 18.72 6.94 70.16 2.06 q Fed Avg|q=0.2 46.91 0.90 13.09 1.84 18.88 7.00 70.16 2.10 q Fed Avg|q=1.0 46.79 0.73 11.72 1.00 22.80 3.39 68.00 1.60 q Fed Avg|q=2.0 46.36 0.38 10.85 0.76 24.64 2.17 66.80 2.02 q Fed Avg|q=5.0 45.25 0.42 9.59 0.36 26.56 1.03 63.60 1.13 Fed Fa 46.43 0.56 12.79 1.54 19.28 6.78 69.36 1.40 Fed MGDA+|ϵ=0.01 45.65 0.21 10.94 0.87 25.12 2.34 67.44 1.20 Fed MGDA+|ϵ=0.05 45.58 0.21 10.98 0.81 25.12 1.87 67.76 2.27 Fed MGDA+|ϵ=0.1 45.52 0.17 11.32 0.86 24.32 2.24 68.48 2.68 Fed MGDA+|ϵ=0.5 45.34 0.21 11.63 0.69 24.00 1.93 68.64 3.11 Fed MGDA+|ϵ=1.0 45.34 0.22 11.64 0.66 24.00 1.93 68.64 3.11 Fed FV|α=0.1,τ=0 48.57 0.76 10.97 1.02 28.32 2.01 69.76 2.45 Fed FV|α=0.2,τ=0 48.54 0.64 10.56 0.96 28.88 1.92 69.20 2.01 Fed FV|α=0.5,τ=0 48.14 0.38 10.60 0.76 28.72 1.92 68.80 1.77 Fed FV|α=0.1,τ=1 49.34 0.56 10.74 0.89 28.56 2.54 70.24 1.49 Fed FV|α=0.1,τ=3 50.00 0.74 10.85 1.14 28.24 3.03 70.96 1.00 Fed FV|α=0.1,τ=10 50.42 0.55 9.70 0.96 32.24 2.10 69.68 2.84 Table 2: The average, the variance, the worst and the best of the test accuracy of all clients on CIFAR-10. All experiments are running over 2000 rounds with full batch size, learning rate η = 0.1 and local epochs E = 1. The reported results are averaged over 5 runs with different random seeds. 5.2 Experimental Results Fairness We first verify the advantage of Fed FV in fairness. In Table 2, we list the the mean, variance, the worst 5% and the best 5% of test accuracy on 100 clients created by splitting CIFAR10 where 10% of clients are sampled in each communication round. While reaching a higher mean of test accuracy across clients, Fed FV also yields the lowest variance among all experiments except for q-Fed Avg|q=5.0. Note that q-Fed Avg sacrifices the best performance of clients from 69.28% to 63.60% and causes a significant reduction of the mean of accuracy from 46.85% to 45.25%. When reaching a similar variance to q-Fed Avg|q=5.0 (9.59 versus 9.72), Fed FV improves the mean of accuracy to 50.42%. Further, the worst 5% performance of all the experiments on Fed FV is higher than that of any others, which indicates Fed FV s effectiveness on protecting the worst performance. Table 3 lists the model test accuracy on each client s data and the average and variance of the accuracy on Fashion MNIST dataset. Different from settings in Table 2, we use a small network with only 3 clients, and the server selects all the clients to participate the training process in each communication round to eliminate the sample bias, which results in the absence of external conflicts. Therefore, we only use Fed FV with α {0, 1 3}, τ = 0 in this case. In our observation, even though Fed FV with α = 0 can still find the solution with the highest mean of test accuracy 80.48%, it s Proceedings of the Thirtieth International Joint Conference on Artificial Intelligence (IJCAI-21) Method shirt pullover T shirt Ave. Var. Fed Avg 64.26 1.07 87.03 1.40 89.97 1.30 80.42 0.76 11.50 0.95 AFL|ηλ=0.01 71.09 0.91 81.89 1.30 84.29 1.23 79.09 0.73 5.76 0.75 AFL|ηλ=0.1 76.34 0.63 79.00 1.04 79.06 1.21 78.13 0.72 1.27 0.66 AFL|ηλ=0.5 76.57 0.58 78.77 0.99 79.09 1.15 78.14 0.71 1.12 0.61 q Fed Avg|q=5 71.29 0.85 81.46 1.08 82.86 0.97 78.53 0.64 5.16 0.69 q Fed Avg|q=15 77.09 0.88 75.40 2.07 60.69 1.79 71.06 0.81 7.46 0.88 Fed Fa 63.80 1.10 86.86 1.28 89.97 1.30 80.21 0.73 11.68 0.93 Fed MGDA+|ϵ=0.05 44.63 1.63 57.77 5.72 99.09 0.23 67.16 1.65 23.39 0.57 Fed MGDA+|ϵ=0.1 72.26 3.12 79.71 4.47 86.03 4.27 79.33 1.11 6.45 2.21 Fed MGDA+|ϵ=1.0 72.46 3.32 79.74 4.48 85.66 4.88 79.29 1.14 6.42 2.23 Fed FV|α=0.0,τ=0 61.06 1.21 89.31 1.38 91.06 1.29 80.48 0.68 13.76 1.02 Fed FV|α= 1 3 ,τ=0 77.09 1.35 80.69 1.53 81.06 0.53 79.61 0.85 1.91 0.57 Fed FV|α= 2 3 ,τ=0 77.91 0.80 81.46 1.99 81.46 1.51 80.28 1.09 1.77 0.87 Table 3: Test accuracy on the different clothing classes of Fashion MNIST dataset. All experiments are running over 200 rounds with full batch size, learning rate η = 0.1 and local epochs E = 1. The reported results are averaged over 5 runs with different random seeds. Fed FV Fed FVRandom Fed FVReverse V ar.CIF AR10 13.19 1.0614.12 0.55 16.28 0.72 V ar.F MNIST 13.76 1.0220.14 0.61 22.05 0.68 Table 4: The effects of projecting order to fairness. The results are averaged over 5 runs with different random seeds. a unfair solution where the variance is up to 13.76. Instead, Fed FV with α = 2 3 also finds a solution with high accuracy 80.28%, and simultaneously keep the variance in a low level. We notice that AFL|ηλ=0.5 has the lowest variance. However, AFL only works for small works where the number of clients is limited. Despite of this, Fed FV|α= 2 3 outperforms AFL|ηλ=0.5 on all the clients (i.e. shirt, pullover and T-shirt), as well as the mean of accuracy. Further, when compared to other methods except AFL, our method has the lowest variance down to 1.77, which validates the ability of Fed FV to find a fair and accurate solution without the presence of sample bias. From Table 2 and Table 3, we observe that Fed FV is able to ease the negative impact of conflicting gradients with largely different magnitudes, which results in a fairer and accurate model with lower variance. Accuracy and Efficiency We further show that Fed FV outperforms existing works addressing fairness in terms of accuracy and efficiency on CIFAR-10 and MNIST. All the methods are tuned to their best performance. As shown in Figure 2, when we only mitigate the internal conflicts, which is done by Fed FV(τ = 0), we already converge faster while keeping a lower variance than the others. In addition, when we mitigate both the internal and external conflicts by Fed FV(τ>0), there is a substantial improvement in the accuracy and efficiency again. Fed FV outperforms state-of-the-art methods by up to 7.2% on CIFAR10 and 78% on MNIST, which verifies the ability of Fed FV to reach a higher accuracy with less communication rounds 0 500 1000 1500 2000 communication round mean of test acc. fedavg qfedavg q=1 fedfa α=β=0.5 fedmgda+ ε=0.1 fedfv α=0.1 τ=0 fedfv α=0.1 τ=10 0 500 1000 1500 2000 communication round variance of test acc. fedavg qfedavg q=1 fedfa α=β=0.5 fedmgda+ ε=0.1 fedfv α=0.1 τ=0 fedfv α=0.1 τ=10 (a) CIFAR-10 0 500 1000 communication round mean of test acc. fedavg qfedavg q=1.0 fedfa α=β=0.5 fedmgda+ ε=0.1 fedfv α=0.1 τ=0 fedfv α=0.2 τ=3 0 500 1000 communication round variance of test acc. fedavg qfedavg q=1.0 fedfa α=β=0.5 fedmgda+ ε=0.1 fedfv α=0.1 τ=0 fedfv α=0.2 τ=3 Figure 2: The mean (left) and the variance (right) of test accuracy on all clients on (a) CIFAR-10, and (b) MNIST. The results are averaged over 5 runs with different random seeds. while still keeping a low variance. Thus, we demonstrate that Fed FV s advantages in saving communication cost and finding a better generalization. Effects of Projecting Order To verify the effectiveness of our projecting order, we compare Fed FV with another two cases: 1) projecting gradients in a random order of the projection targets; and 2) projecting gradients in an order of the projection targets that is reverse to Fed FV, as shown in Table 4. For CIFAR-10, 20% of clients are sampled in each communication round. For Fashion MNIST, all clients are selected in each communication round. We set α = 0, τ = 0 for all groups to confirm the effectiveness of the projecting order. If we project gradients to the targets in the loss-based order of Fed FV, the variance is the lowest. Projecting gradients to the targets in a random order is also fairer than in an order reverse to Fed FV, which indicates that the loss-based order used by Fed FV helps improve fairness. 6 Conclusions and Future Work We identified the issue of conflicting gradients with large differences in the magnitudes, which brings unfairness in FL. To address this issue, we propose the Federated Fair a Veraging (Fed FV) algorithm to mitigate the potential conflicts among clients before averaging their gradients. In addition, we show how well Fed FV can mitigate the conflicts and how it converges to either a pareto stationary solution or the optimal on convex problems. Extensive experiments on a suite of federated datasets confirm that Fed FV compares favorably against state-of-the-art methods in terms of fairness, accuracy and efficiency. In the future, we plan to build a complete theoretical analysis of Fed FV on mitigating complex external conflicts. Proceedings of the Thirtieth International Joint Conference on Artificial Intelligence (IJCAI-21) A.1 Proof of Theorem 1 Proof. For each gradient gi G, we project gi onto gk s normal plane in an increasing order with k. Thus, we have update rules g(0) i = gi, k = 0 g(k) i = g(k 1) i g(k 1) i gk ||gk||2 gk, k = 1, 2, ..., m, k = i g(k) i = g(k 1) i , k = i Therefore, g(k) i never conflicts with gk, since all potential conflicts have been eliminated by the updating rules. We then focus on how much the final gradient g conflicts with each gradient in G. Firstly, the last in the projecting order is gm, and the final gradient g will not conflict with it. Then, we focus on the last but second gradient gm 1. Following the update rules, we have g(m) i = g(m 1) i g(m 1) i gm ||gm||2 gm (2) Let φ(k) i,j denote the angle between g(k) i and gj, and φi,j = φ(0) i,j , then the average of projected gradients is: i =m (g(m 1) i g(m 1) i gm ||gm||2 gm) + g(m 1) m ) i g(m 1) i 1 i =m ||g(m 1) i || cos φ(m 1) i,m gm ||gm|| Since gm 1 Pm i g(m 1) i 0, gm 1 g gm 1 1 i =m ||g(m 1) i || cos φ(m 1) i,m gm ||gm|| i =m ||g(m 1) i ||||gm 1|| cos φ(m 1) i,m cos φm 1,m i =m ||g(m 1) i || Similarly, we can compute the conflicts between any gradient gk G and g by removing the g(k) i from g i =j+1 ||g(j) i || cos φ(j) i,j+1 gj+1 ||gj+1||) i =j+1 ||g(j) i || cos φ(j) i,j+1 cos φk,j+1 i =j+1 ||g(j) i || |ek g | = | gk ||gk|| g | ϵ2 i =j+1 ||g(j) i || (5) Therefore, the later gk serves as the projecting target of others, the smaller the upper bound of conflicts between g and gk is, since Pm 1 j=k Pm i =j+1 ||g(j) i || Pm 1 j=k 1 Pm i =j+1 ||g(j) i ||. A.2 Proof of Theorem 2 Proof. According to (5), the upper bound of the conflict between any client gradient gk and the final average gradient g can be expressed as |gk g | ϵ2 2 m||gk|| i =j+1 ||g(j) i || (6) With the update rules of Fed FV, we can infer that ||g(i) k ||2 = ||g(i 1) k ||g(i 1) k || cos φ(i 1) k,i gi ||gi||||2 = ||g(i 1) k ||2 2||g(i 1) k ||2 cos2 φ(i 1) k,i + ||g(i 1) k ||2 cos2 φ(i 1) k,i = (1 cos2 φ(i 1) k,i )||g(i 1) k ||2 (1 ϵ2 1)||g(i 1) k ||2 Therefore, the maximum value of gradient conflict is bounded by |gk g | ϵ2 2 m||gk|| i =j+1 ||g(j) i || ϵ2 2 m(max i ||gi||) i =j+1 ||g(0) i ||(1 ϵ2 1) j 2 m ϵ2 2(max i ||gi||)2 m 1 X j=k (1 ϵ2 1) j 2 m (max i ||gi||)2 ϵ2 2(1 ϵ2 1) 1 2 (1 (1 ϵ2 1) m k 1 (1 ϵ2 1) 1 2 m (max i ||gi||)2f(m, k, ϵ1, ϵ2) Proceedings of the Thirtieth International Joint Conference on Artificial Intelligence (IJCAI-21) A.3 Proof of the Convergence Proof of Theorem 3. Let g1 and g2 be the two types of users updates of parameters in the tth communication round. When F(θ) is L-smooth, we have F(θt+1) F(θt) + F(θ)T (θt+1 θt) + 1 2L||θt+1 θt||2 2 (9) There are two cases: conflicting gradients exist or otherwise. If there is no conflict between g1 and g2, which indicates g1 g2 0, Fed FV updates as Fed Avg does, simply computing an average of the two gradients and adding it to the parameters to obtain the new parameters θt+1 = θt η g, where g = 1 2(g1 + g2). Therefore, when using step size η 1 L, it will strictly decrease the objective function F(θ) [Li et al., 2020c]. However, if g1 g2<0, Fed FV will compute the update as: θt+1 = θt η g = θt η(g1 + g2 g1 g2 ||g1||2 g1 g1 g2 ||g2||2 g2) (10) Combining with (9), we have: F(θt+1) F(θt) η(g1 + g2)T (g1 + g2 g1 g2 g1 g2 ||g2||2 g2) + L 2 η2||g1 + g2 g1 g2 ||g1||2 g1 g1 g2 ||g2||2 g2||2 2 = F(θt) η||g1 + g1||2 + η(g1 + g2)T (g1 g2 ||g2||2 g2) + L 2 η2||g1 + g2 g1 g2 ||g1||2 g1 g1 g2 ||g2||2 g2||2 2 Since g1 g2 = ||g1||||g2|| cos φ12, where φ12 denotes the angle between g1 and g2, after rearranging the items in this inequality, we have F(θt+1) F(θt) (η L 2 η2)(1 cos2 φ12)(||g1||2+ ||g2||2) Lt2(1 cos2 φ12)||g1||||g2|| cos φ12 (12) To decrease the objective function, the inequality below should be satisfied 2 η2)(1 cos2 φ12)(||g1||2 + ||g2||2) Lη2(1 cos2 φ12)||g1||||g2|| cos φ12 0 (13) 2 η2)(||g1||2 + ||g2||2) + L 2 η22||g1||||g2|| cos φ12 0 (η Lη2)(||g1||2 + ||g2||2) + L 2 η2(||g1||2 + ||g2||2+ 2||g1||||g2|| cos φ12) 0 (15) (η Lη2)(||g1||2 + ||g2||2) + L 2 η2||g1 + g2||2 0 (16) With η 1 L, we have η Lη2 0, which promises the decrease of the objective function. On the other hand, we can infer that Combine (17) with (13), we have: F(θt+1) F(θt) η 2(1 cos2 φ12)(||g1||2 + ||g2||2) 2(1 cos2 φ12)2||g1||||g2|| cos φ12 2(1 cos2 φ12)(||g1||2 + ||g2||2 + 2||g1||||g2|| 2(1 cos2 φ12)||g1 + g2||2 2(1 cos2 φ12)||g||2 (18) Therefore, the objective function will always degrade unless 1)||g|| = 0, which indicates it will reach the optimal θ , 2)cos φ12 = 1, then we can suppose g1 = βg2 where β>0, and we will get 1 1 + β g1 + β 1 + β g2 = 0 (19) So θt is a pareto stationary point as defined below: Definition 4. For smooth criteria Fk(θ)(1 k K), θ0 is called a Pareto-stationary iff there exists some convex combination of the gradietns Fk(θ0) that equals zero [D esid eri, 2012]. Proof of Theorem 4. With the assumption that F(θ) is Lsmooth, we can get the inequality: F(θt+1) F(θt)+ F(θ)T (θt+1 θt)+ 1 2L||θt+1 θt||2 2. (20) Similar to the proof for Theorem 2, if there exist conflicts, then F(θt+1) F(θt) η g g + 1 2Lη2|| g ||2 2|| g|||| g || + L 2 η2|| g ||2 2|| g|||| g || + L 2 η2|| g|||| g || F(θt) + ( η 2 η2)|| g|||| g || 2 η2) 0 with η 1 L, the average objective will always degrade if we repeatedly apply the update rules of Fed FV unless 1)|| g|| = 0, which indicates it will finally reach the optimal θ , 2) || g || = 0, which indicates that all pairs of conflicting gradients have a cosine similarity of 1, leading to an existence of convex combination p which satisfies P pigi = 0, since we can easily choose a pair of conflicting gradients gi and gj as the proof of theorem 3 does and then set the weights of the rest zero. Proceedings of the Thirtieth International Joint Conference on Artificial Intelligence (IJCAI-21) [Abay et al., 2020] Annie Abay, Yi Zhou, Nathalie Baracaldo, Shashank Rajamoni, Ebube Chuba, and Heiko Ludwig. Mitigating bias in federated learning. ar Xiv e-prints, page ar Xiv:2012.02447, December 2020. [Chaudhry et al., 2018] Arslan Chaudhry, Marc Aurelio Ranzato, Marcus Rohrbach, and Mohamed Elhoseiny. Efficient lifelong learning with A-GEM. Co RR, abs/1812.00420, 2018. [Cho et al., 2020] Yae Jee Cho, Samarth Gupta, Gauri Joshi, and Osman Ya gan. Bandit-based communication-efficient client selection strategies for federated learning. ar Xiv eprints, page ar Xiv:2012.08009, 2020. [D esid eri, 2012] Jean-Antoine D esid eri. Multiple-gradient descent algorithm (mgda) for multiobjective optimization. Comptes Rendus Mathematique, 350(5):313 318, 2012. [Farajtabar et al., 2019] Mehrdad Farajtabar, Navid Azizan, Alex Mott, and Ang Li. Orthogonal gradient descent for continual learning. Co RR, abs/1910.07104, 2019. [Guo et al., 2019] Yunhui Guo, Mingrui Liu, Tianbao Yang, and Tajana Rosing. Learning with long-term remembering: Following the lead of mixed stochastic gradient. Co RR, abs/1909.11763, 2019. [Hu et al., 2020] Zeou Hu, Kiarash Shaloudegi, Guojun Zhang, and Yaoliang Yu. Fedmgda+: Federated learning meets multi-objective optimization. ar Xiv e-prints, page ar Xiv:2006.11489, 2020. [Huang et al., 2020] Wei Huang, Tianrui Li, Dexian Wang, Shengdong Du, and Junbo Zhang. Fairness and accuracy in federated learning. ar Xiv e-prints, page ar Xiv:2012.10069, 2020. [Kairouz et al., 2019] Peter Kairouz, H. Brendan Mc Mahan, Brendan Avent, Aur elien Bellet, Mehdi Bennis, and et al. Advances and open problems in federated learning. Co RR, abs/1912.04977, 2019. [Krizhevsky, 2012] Alex Krizhevsky. Learning multiple layers of features from tiny images. University of Toronto, 05 2012. [Le Cun et al., 1998] Yann Le Cun, L eon Bottou, Yoshua Bengio, and Patrick Haffner. Gradient-based learning applied to document recognition. Proceedings of the IEEE, 86(11):2278 2324, 1998. [Li et al., 2019] Tian Li, Maziar Sanjabi, and Virginia Smith. Fair resource allocation in federated learning. Co RR, abs/1905.10497, 2019. [Li et al., 2020a] Tian Li, Shengyuan Hu, Ahmad Beirami, and Virginia Smith. Federated multi-task learning for competing constraints. ar Xiv e-prints, page ar Xiv:2012.04221, 2020. [Li et al., 2020b] Tian Li, Anit Kumar Sahu, Manzil Zaheer, Maziar Sanjabi, Ameet Talwalkar, and Virginia Smith. Federated optimization in heterogeneous networks. ar Xiv e-prints, page ar Xiv:1812.06127, 2020. [Li et al., 2020c] Xiang Li, Kaixuan Huang, Wenhao Yang, Shusen Wang, and Zhihua Zhang. On the convergence of fedavg on non-iid data. ar Xiv e-prints, page ar Xiv:1907.02189, 2020. [Lopez-Paz and Ranzato, 2017] David Lopez-Paz and Marc Aurelio Ranzato. Gradient episodic memory for continuum learning. Co RR, abs/1706.08840, 2017. [Lyu et al., 2019] Lingjuan Lyu, Jiangshan Yu, Karthik Nandakumar, Yitong Li, Xingjun Ma, and Jiong Jin. Towards fair and decentralized privacy-preserving deep learning with blockchain. Co RR, abs/1906.01167, 2019. [Lyu et al., 2020] Lingjuan Lyu, Xinyi Xu, and Qian Wang. Collaborative fairness in federated learning. ar Xiv eprints, page ar Xiv:2008.12161, 2020. [Mc Mahan et al., 2017] Brendan Mc Mahan, Eider Moore, Daniel Ramage, Seth Hampson, and Blaise Aguera y Arcas. Communication-Efficient Learning of Deep Networks from Decentralized Data. In Aarti Singh and Jerry Zhu, editors, Proceedings of the 20th International Conference on Artificial Intelligence and Statistics, volume 54 of Proceedings of Machine Learning Research, pages 1273 1282, Fort Lauderdale, FL, USA, 20 22 Apr 2017. PMLR. [Mohri et al., 2019] Mehryar Mohri, Gary Sivek, and Ananda Theertha Suresh. Agnostic federated learning. Co RR, abs/1902.00146, 2019. [Xiao et al., 2017] Han Xiao, Kashif Rasul, and Roland Vollgraf. Fashion-mnist: a novel image dataset for benchmarking machine learning algorithms. Co RR, abs/1708.07747, 2017. [Xu and Lyu, 2020] Xinyi Xu and Lingjuan Lyu. Towards building a robust and fair federated learning system. ar Xiv e-prints, page ar Xiv:2011.10464, 2020. [Yu et al., 2020] Tianhe Yu, Saurabh Kumar, Abhishek Gupta, Sergey Levine, Karol Hausman, and Chelsea Finn. Gradient Surgery for Multi-Task Learning. ar Xiv e-prints, page ar Xiv:2001.06782, January 2020. [Zhao et al., 2018] Yue Zhao, Meng Li, Liangzhen Lai, Naveen Suda, Damon Civin, and Vikas Chandra. Federated Learning with Non-IID Data. ar Xiv e-prints, page ar Xiv:1806.00582, June 2018. Proceedings of the Thirtieth International Joint Conference on Artificial Intelligence (IJCAI-21)