# fedlf_layerwise_fair_federated_learning__25d801b6.pdf Fed LF: Layer-Wise Fair Federated Learning Zibin Pan1,2, Chi Li1,4, Fangchen Yu1, Shuyi Wang1,2, Haijin Wang1, Xiaoying Tang 1,2,3, Junhua Zhao*1,2 1 The School of Science and Engineering, The Chinese University of Hong Kong, Shenzhen, China 2 The Shenzhen Institute of Artificial Intelligence and Robotics for Society 3 The Guangdong Provincial Key Laboratory of Future Networks of Intelligence 4 Shenzhen Research Institute of Big Data zibinpan@link.cuhk.edu.cn, chili@link.cuhk.edu.cn, fangchenyu@link.cuhk.edu.cn, shuyiwang@link.cuhk.edu.cn, haijinwang@link.cuhk.edu.cn, tangxiaoying@cuhk.edu.cn, zhaojunhua@cuhk.edu.cn Fairness has become an important concern in Federated Learning (FL). An unfair model that performs well for some clients while performing poorly for others can reduce the willingness of clients to participate. In this work, we identify a direct cause of unfairness in FL - the use of an unfair direction to update the global model, which favors some clients while conflicting with other clients gradients at the model and layer levels. To address these issues, we propose a layerwise fair Federated Learning algorithm (Fed LF). Firstly, we formulate a multi-objective optimization problem with an effective fair-driven objective for FL. A layer-wise fair direction is then calculated to mitigate the model and layer-level gradient conflicts and reduce the improvement bias. We further provide the theoretical analysis on how Fed LF can improve fairness and guarantee convergence. Extensive experiments on different learning tasks and models demonstrate that Fed LF outperforms the SOTA FL algorithms in terms of accuracy and fairness. The source code is available at https://github.com/zibinpan/Fed LF. 1 Introduction Federated Learning (FL) is a popular machine learning paradigm that allows clients to collaboratively train a global model without sharing data. It is a promising methodology to address data island and privacy issues (Yu et al. 2022), where clients are able to obtain a more generalized model than local learning. However, due to many factors such as data heterogeneity, intermittent client participation, client dropout, etc., the global FL model is prone to be unfair since it favors part of clients and may perform poorly on some others (Li et al. 2020b; Kairouz et al. 2021), which would reduce their willingness to participate in FL. Improving fairness in FL has drawn increased attention recently (Shi, Yu, and Leung 2021; Zhou et al. 2021; Pan et al. 2023). (Mohri, Sivek, and Suresh 2019) proposed AFL, aiming to prevent the overfitting of certain clients at the expense of others. Recent studies have explored proactive ways to enhance fairness by mitigating model-level gradient conflicts among clients (Wang et al. 2021; Hu et al. 2022). *Xiaoying Tang and Junhua Zhao are corresponding authors. Copyright 2024, Association for the Advancement of Artificial Intelligence (www.aaai.org). All rights reserved. Figure 1: A demo of two clients. (a) shows gradients and directions. Obtained by previous algorithms, direction dt conflicts with client 2. dt and dt are directions obtained by Fed LF with/without the fair-driven objective. (b) describes several objective vectors of the models updated by different directions. F(ωt) depicts clients objectives at round t. (Pan et al. 2023) developed this approach and proposed Fed MDFG to calculate a fair-descent direction for the model update that doesn t conflict with clients gradient in the model level, which made progress in enhancing FL fairness. In this work, we extend this proactive way to improve fairness in FL. We observe that in addition to the model-level gradient conflict, there are also layer-level gradient conflicts that should be mitigated when determining a fair direction for the model update. In conclusion, we summarize that there exist three primary challenges when computing a fair direction: model-level gradient conflicts, improvement bias, and layer-level gradient conflicts. Challenge 1: Model-level gradient conflict. Due to the data heterogeneity of clients, clients gradients easily conflict with each other (Wang et al. 2021), i.e., at round t, there exist clients i, j whose gradients gt i and gt j satisfy gt i gt j < 0. When updating the model, a simple-aggregated direction dt will easily conflict with some clients gradients, i.e., dt gt i > 0, thus it will reduce the model s performance on those clients and harm fairness (see Fig. 1). Challenge 2: Improvement bias. It is not enough to enhance fairness only by addressing the above challenge. Because despite a direction that doesn t conflict with each client s gradient can improve the model performance on The Thirty-Eighth AAAI Conference on Artificial Intelligence (AAAI-24) Figure 2: A demo of three clients. (a) shows the model. (b) and (c) describe clients gradients that conflict at layer l and k. gt i,l, gt i,k R3 denote client i s gradient fragments at layer l and k. dt is a conflicting direction obtained by previous methods. dt l and dt k in (b) and (c) are two fragments of dt at layer l and k, which conflict with client 2 and 3 at these layers, i.e., dt l gt 2,l>0, dt l gt 3,l>0, dt k gt 2,k>0, and dt k gt 3,k>0. dt is the direction obtained by Fed LF, with fragments dt l and dt k lying in the yellow area, which depicts all feasible directions that do not conflict with clients gradients at these layers. each client, this improvement may be substantial for some clients but just marginal for others. Fig. 1 illustrates this issue, where dt is a common descent direction that reduces each client s local objective. However, the reduction in client 1 s objective is much larger than that of client 2, resulting in a model that favors client 1, thereby reducing fairness. To handle the above two challenges, Fed MDFG applied the multiple gradient descent algorithm to a designed multiobjective optimization problem with a dynamic objective (see Equ.5 on its paper). But it would drive the model out of the fairer area in some cases and thus requires a necessary hyperparameter to determine the activation of the dynamic objective, which makes it difficult to prove the convergence (while Fed MDFG only gave the convergence proof without considering the hyperparameter). We discuss this in more detail in Section 3.1. Differently, we propose a simpler but more effective fair-driven objective and give theoretical analysis on the convergence and convergence rate. Challenge 3: Layer-level gradient conflict. In deep learning, different layers of a neural network can have different utilities (Yu et al. 2018; Ma et al. 2022; Lee, Zhang, and Avestimehr 2023). For instance, shallow layers primarily hone in on local feature extraction, while deeper layers focus on extracting global features. Therefore, simply aggregating or only mitigating gradient conflicts at the model level cannot prevent the obtained direction from favoring parts of clients at some layers, while drifting away from others, which would reduce the layer utility for those clients. Fig. 2 shows a demo of three clients, where their gradients conflict with each other at layers l and k. Then, the direction dt obtained by previous FL methods favors client 1 while conflicting with clients 2 and 3 at layers k and l. This will cause the new model s parameters at these two layers to favor client 1 while drifting away from the others. In the experimental results of Table 2 and Table 3, we verify the existence of layer-level gradient conflicts and show that mitigating layer- level gradient conflicts can help improve FL fairness. In this paper, we propose a layer-wise Fair Federated Learning (Fed LF) algorithm that can compute a layer-wise fair direction to enhance fairness. First, we formulate a multi-objective optimization problem for FL, incorporating an effective fair-driven objective. Then, we design a layerwise multiple gradient descent algorithm (LMGDA) to obtain such a direction that mitigates both the model and layerlevel gradient conflicts. To the best of our knowledge, Fed LF is the first one capable of identifying a layer-wise fair direction that does not conflict with clients gradients at layers. Our contributions are summarized as follows. We identify the layer-level gradient conflict that would impair the FL fairness. A layer-level conflicting direction would drift away from some clients and diminish the model layer utility on these clients. We formulate a multi-objective optimization problem with an effective fair-driven objective to improve the model performance on each client and enhance fairness. We propose Fed LF that establishes a layer-wise fair direction to drive the FL model fairer. We theoretically analyze that it can mitigate the gradient conflicts among clients at both the model and the layer levels, and reduce the improvement bias. We conduct extensive experiments on multiple FL scenarios, validating that Fed LF outperforms the SOTA approaches in terms of accuracy and fairness. 2 Background & Related Work 2.1 Background of Federated Learning The traditional FL (Mc Mahan et al. 2017) allows clients to collaboratively train a global model ω Rn with the goal of minimizing the weighted average objective of clients: i=1 pi Fi(ω), (1) where pi 0, Pm i=1 pi = 1, and m is the number of clients. Fi(ω) represents the local objective of client i, which is usually defined by a specific loss function such as crossentropy loss and calculated across Ni samples by Fi(ω) = PNi j=1 1 Ni Fij(ω). Fij(ω) is the loss on the jth sample. However, traditional FL easily suffers from performance degradation in heterogeneous settings since a simple aggregated direction for model updating may conflict with the gradients of some clients (Wang et al. 2021) while favoring other clients, and thus the model would be more unfair. So there is another way to consider FL as a multi-objective optimization problem (MOP) (Hu et al. 2022): min ω (F1(ω), F2(ω), , Fm(ω)). (2) One typical gradient-based method to solve Problem (2) is Multiple Gradient Descent Algorithm (MGDA) (D esid eri 2012; Gebken, Peitz, and Dellnitz 2019). It updates the model at each round t by ωt+1=ωt+ηtdt with a step size ηt and a common descent direction dt, which is computed by solving Problem (3) and satisfies dt gt i<0, i, so that it can drive to reduce the local objectives. gt i is the local gradient of the model ω on data samples of client i, which is calculated by gt i= Fi(ωt)=PNi j=1 1 Ni Fij(ωt). MGDA stops when ωt reaches the Pareto stationarity (D esid eri 2012). (dt, αt) = arg min dt Rn,αt R αt + 1 s.t. gt i dt αt, i = 1, , m (3) Definition 1 (Pareto Stationarity): ω is called Pareto stationary iff ξi 0, P i ξi = 1, such that P i ξigi = 0. Pan et al. (2023) successfully applied MGDA in FL, proposing Fed MDFG by adding a dynamic objective to Problem (3), which firstly ensured obtaining a common descent direction that doesn t conflict with each client s gradient in the model level. But the layer-level gradient conflict still remains a significant challenge, which brings the layerwise bias that reduces the layer s capability on some clients. 2.2 Fair Federated Learning Our work follows the definition of the FL fairness summarized by (Li et al. 2021), where a model ω1 is fairer than ω2 if the standard deviation of ω1 s performance (e.g., test accuracy or local objective) across clients is smaller than that of ω2. A general goal of fair FL is to train a model with better average and more uniform performance across clients (Li et al. 2020b; Wang et al. 2021; Pan et al. 2023). Increasing fairness has attracted great interest in recent years. (1) Some previous works tried reweighting aggregation methods (Li et al. 2020b; Huang et al. 2020; Zhao and Joshi 2022a; Li et al. 2023). (2) Li et al. (2020a) reduced the update bias across clients to alleviate the negative impact of client drift in heterogeneous settings, and thus it may enhance fairness in a way. (4) The works of (Huang et al. 2022; Salazar et al. 2022) use momentum approaches. (5) Recently, the works of Fed FV (Wang et al. 2021), Fed MGDA+ (Hu et al. 2022) and Fed MDFG (Pan et al. 2023) explore a proactive way of fair FL by trying to mitigate clients modellevel gradient conflicts. Our method also aims to mitigate this kind of conflict. But differently, we further mitigate the layer-level gradient conflicts among clients. 2.3 Layer-wise Federated Learning Several previous works designed layer-wise approaches for FL. For example, (Lee, Zhang, and Avestimehr 2023) used a layer-wise model aggregation method to reduce the communication cost in FL; (Son, Kim, and Chung 2022) regularized parts of layers during FL training; Some works (Mei et al. 2021; Ma et al. 2022) employed the layer-wise aggregation to build personalized models in personalized FL. Contrary to these methods, we design a novel method to calculate a layer-wise fair direction to enhance fairness in FL. 3 The Proposed Approach Our proposed Fed LF aims to handle the challenges mentioned in Section 1 to obtain layer-wise fair directions to drive the FL model fairer. Algorithm 1 demonstrates the steps of Fed LF. In each communication round t, after receiving the local gradient gt i and the training loss Fi(ωt) of Algorithm 1: Layer-wise Fair Federated Learning (Fed LF) Input: Initialize model parameters ω0, learning rate η. 1: for t = 0, 1, , T 1 do 2: St The set of online clients. 3: Broadcast ωt to all client i, i St, i / St 1. 4: Server receives the gradient gt i and the loss Fi(ωt) from each client i St, where gt i = (ωt ωt i)/η and ωt i is updated by client i. 5: Calculate gt P by Equ. (5). 6: for each layer l L in parallel do 7: dt l Calculate the direction fragment by Equ.(7). 8: end for 9: while l that dt l = 0 do 10: l Combine layer l with its later/previous layer. 11: Recompute dt l. 12: end while 13: dt concatenate all layer-wise directions dt l. 14: Stop if dt = 0. 15: Rescale dt by dt dt/ dt 1 |St| P|St| i gt i , and then broadcast it to all online clients. 16: Both of the server and online clients update model parameters by ωt+1 ωt + ηdt . 17: end for Output: Model parameters ωt. each online client i (Alg. 1, Line 4), the server computes the gradient gt P of the fair-driven objective. Direction fragments dt l for each layer l are then calculated in parallel to produce the layer-wise fair direction dt (Alg. 1, Lines 6-13). Finally, after sending dt to clients, the model is updated by ωt+1 ωt+ηdt both on the server and online clients. 3.1 Problem Formulation We follow the idea of considering FL as a multi-objective optimization problem (Hu et al. 2022; Pan et al. 2023), which is shown in Problem (2). However, it cannot avoid the improvement bias among objectives. For example, there are two clients whose local objectives are F1(ωt) = 4 and F2(ωt) = 5. After updating the model, F1(ωt+1) = 1 but F2(ωt+1) = 4, thus there is a greater disparity between the objectives, which decreases fairness. Toward this end, Fed MDFG (Pan et al. 2023) added a dynamic objective minω F(ω)T ht to Problem (2) (see Equ.5 in their paper), where ht denotes an opposite normalized vector of the projection of 1 on the normal plane of L(ωt). They called it a fair-driven objective. However, this objective doesn t directly drive the model fairer, it would drive the model out of the fairer area if the local objectives are quite close. To handle it, they ran a step size line search to search for a smaller step size, which would bring extra communication costs. Besides, a hyperparameter was added to control whether to use the added objective, making it hard to guarantee the convergence, while they only proved the convergence without considering the effect of the hyperparameter. Differently, we design a more effective fair-driven objective: minω P(ω) = cos( 1, F(ω)), which aims to increase the cosine similarity between the vector 1 and the objective vector F(ω)=(F1(ω), ,Fm(ω)). Hence, it can reduce the difference among the objectives. Therefore, the goal of Fed LF is to optimize the following MOP instead of Problem (2) to render ω Pareto stationary. min ω (F1(ω), F2(ω), , Fm(ω), P(ω)). (4) Compared with the dynamic objective proposed in Fed MDFG, our fair-driven objective retains the formulation as a static optimization problem, and thus we can easily analyze the convergence and the convergence rate (see Section 3.4). Note that for a non-Pareto stationary solution ωt of Problem (2), there always exists a direction dt, such that gt i dt < 0, i {1, , m} and gt P dt < 0, where gt P is the gradient of the fair-driven objective obtained by Equ.(5), gt = concat(gt 1, , gt m). It indicates that the added fairdriven objective doesn t affect the convergence of FL. We provide the proof in detail in Appendix A.1. F(ωt) 2 (F(ωt)T 1F(ωt) 1 F(ωt) 1 F(ωt) In Fig. 1, we show that without the fair-driven objective, the obtained direction dt cannot prevent the objective vector of the updated model from being far away from 1 and thus decrease fairness, even though dt is common descent. 3.2 Layer-wise Fair Direction In Fed LF, we solve Problem (4) by iterating ωt+1 = ωt + ηtdt, where dt is a layer-wise fair direction at tth round. In this section, we discuss how to compute such a direction. We start by defining model-level gradient conflicts following (Wang et al. 2021) and layer-level gradient conflicts. Definition 2 (Model-level Gradient Conflict): The gradients of client i and j conflict with each other iff gi gj < 0. Besides, given that a deep learning model with multiple layers often exhibits distinct utility at each layer (Ma et al. 2022), such as data feature extraction, classification, etc., the heterogeneity of data across clients often leads to substantial variance in the gradient fragments at different layers. Consequently, they may easily come into conflict with one another. Definition 3 (Layer-level Gradient Conflict): Let nl be the dimension of the model parameters at layer l. P l L nl = n. L is a set of all l. Client i s gradient gi Rn conflicts with client j s gradient gj Rn at layer l iff gi,l gj,l < 0, where gi,l Rnl is the fragment of client i s gradient at layer l. It is worth noting that the model/layer-level gradient conflicts also exist in the case of partial client participation or client dropout. We discuss more about it in Section 3.3. Existing fair FL algorithms focus on model-level aggregation. They cannot prevent the direction fragment dt l, which is used to update the model parameters at layer l, from favoring some clients while conflicting with the others at some layers l, i.e., dt l gt i,l > 0, where gt i,l denotes a fragment of client i s gradient (gt i) at the layer l in round t. To mitigate the layer-level gradient conflicts, we need to ensure that the obtained direction dt satisfies dt l gt i,l < 0, client i and layer l. To achieve this, we can solve the following problem (6) to obtain dt l for each layer l, where gt P,l is the fragment of gt P at layer l, and then concatenate all dt l to obtain dt. Different from the MGDA formula (3), it is done by layers, and it contains a constraint gt P,l dt l<0 that is used to reduce the fair-driven objective. (dt l, αt l) = arg min dt l Rnl,αt l R αt l + 1 s.t. gt i,l dt l αt l, i = 1, , m, gt P,l dt l αt l. Scalable Method to obtain dt l. Problem (6) itself does not scale well for the high dimensional decision space, because some layer l often contains millions of parameters. Thus, solving Problem (6) in this scale would be extremely slow. Inspired by (Fliege and Svaiter 2000; Pan et al. 2023) we achieve dt l in another way. Based on the KKT conditions: i=1 λigt i,l + µgt P,l , Xm i=1 λi + µ = 1, (7) where λ1, , λm, µ 0 is the optimal solution of the dual problem of Problem (6): 2 Pm i=1 λigt i,l + µgt P,l 2 s.t. Pm i=1 λi + µ = 1, λi, µ 0, i = 1, 2, , m. (8) Therefore, we only need to solve a simple quadratic optimization problem (8) in (m + 1)-dimension, which is not time-consuming. We report the actual computation time of Fed LF in Appendix B.3. Combine Layers. Inspired by (Gebken, Peitz, and Dellnitz 2019), the obtained dt l satisfies: 1. If ξ Rm+1 0, P i ξi = 1, such that Pm i=1 gt i,lξi + gt P,lξm+1 = 0, then dt l = 0. 2. Otherwise, gt i,l dt l < 0, i = 1, , m, and gt P,l dt l < 0. If dt l= 0 is satisfied for all layer l L, then dt= 0 and thus the model ωt reaches to Pareto stationarity. However, if dt l= 0 is satisfied only for part of l L, the parameters in these layers will stay in stagnation and stop updating. This case would happen if rank(gt 1,l, , gt m,l, gt P,l) < m. When nl m, from a probabilistic point of view, it would have almost probability 1 that rank(gt 1,l, , gt m,l, gt P,l) = m. But when some layers have only few parameters, nl