# feddisco_federated_learning_with_discrepancyaware_collaboration__90345ef1.pdf Fed Disco: Federated Learning with Discrepancy-Aware Collaboration Rui Ye 1 Mingkai Xu 1 Jianyu Wang 2 Chenxin Xu 1 Siheng Chen 1 3 Yanfeng Wang 3 1 This work considers the category distribution heterogeneity in federated learning (FL). This issue is due to biased labeling preferences at multiple clients and is a typical setting of data heterogeneity. To alleviate this issue, most previous works consider either regularizing local models or finetuning the global model, while they ignore the adjustment of aggregation weights and simply assign weights based on the dataset size. However, based on our empirical observations and theoretical analysis, we find that the dataset size is not optimal and the discrepancy between local and global category distributions could be a beneficial and complementary indicator for determining aggregation weights. We thus propose a novel FL algorithm, Federated Learning with Discrepancy Aware Collaboration (Fed Disco), whose aggregation weights not only involve both the dataset size and the discrepancy value, but also contribute to a tighter theoretical upper bound of the optimization error. Fed Disco can promote utility and modularity in a communicationand computationefficient way. Extensive experiments show that our Fed Disco outperforms several state-of-theart methods and can be easily incorporated with many existing methods to further enhance the performance. Our code will be available at https://github.com/Media Brain-SJTU/Fed Disco. 1. Introduction Federated learning (FL) is an emerging field that offers a privacy-preserving collaborative machine learning paradigm (Kairouz et al., 2019). Its main idea is to enable multiple clients to collaboratively train a shared global model by sharing information about model parameters. FL 1Cooperative Medianet Innovation Center, Shanghai Jiao Tong University, Shanghai, China 2Carnegie Mellon University, Pittsburgh, PA, USA 3Shanghai AI Laboratory, Shanghai, China. Correspondence to: Siheng Chen . Proceedings of the 40 th International Conference on Machine Learning, Honolulu, Hawaii, USA. PMLR 202, 2023. Copyright 2023 by the author(s). Figure 1. To mitigate the negative impact of category distribution heterogeneity, Fed Disco considers discrepancy-aware aggregation weights, which involves both dataset size and discrepancy between local category distribution and uniform distribution. Fed Disco is still privacy preserving, communication and computation efficient. algorithms have been widely applied to many real-world scenarios, such as users next-word prediction (Hard et al., 2018), recommendation systems (Ding et al., 2022), and smart healthcare (Kaissis et al., 2020). In spite of the promising trend, there are still multiple key challenges that impede the further development of FL, including data heterogeneity, communication cost and privacy concerns (Kairouz et al., 2019). In this work, we focus on one critical issue in data heterogeneity: category distribution heterogeneity; that is, clients have drastically different category distributions. For example, some clients have many samples in category 1 but very few in category 2; while other clients are opposite. This setting is common in practice, since clients collect local data privately and usually tend to have biased labeling preferences. Due to this heterogeneity, different local models are optimized towards different local objectives, causing divergent optimization directions. It is thus difficult to aggregate these divergent local models to obtain a robust global model. This unpleasant phenomenon has been theoretically and empirically verified in (Zhao et al., 2018; Li et al., 2019; Wang et al., 2021). To mitigate the negative impact of category distribution heterogeneity, there are mainly two approaches: i) local model adjustment at the client side and ii) global model Fed Disco: Federated Learning with Discrepancy-Aware Collaboration adjustment at the server side. Most previous works consider the first approach. For example, Fed Prox (Li et al., 2020a) and Fed Dyn (Acar et al., 2020) introduce regularization terms; and MOON (Li et al., 2021) aligns intermediate outputs of local and global models to reduce the variance among optimized local models. Along the line of the second approach, Fed DF and Fed FTG (Lin et al., 2020; Zhang et al., 2022) introduce an additional fine-tuning step to refine the global model. Despite all these diverse efforts, few works consider optimizing the aggregation weight at the server side1. In fact, most previous works simply aggregate the local models according to the dataset size at each local client. However, the dataset size does not reflect any categorical information and thus cannot provide sufficient information of a local client. It is still unclear whether the dataset size based aggregation is the best aggregation strategy. In this paper, we first answer the above open question by empirically showing that, in many cases, aggregating local models based on the local dataset size is consistently far from optimal; meanwhile, incorporating the discrepancy between local and global category distributions into the aggregation weights could be beneficial. Intuitively, a lower discrepancy value reflects that the private data at a local client has a more similar category distribution with the hypothetically aggregated global data, and the corresponding client should contribute more to the global model than those with higher discrepancy values. To understand the effect of category distribution heterogeneity, we next provide a theoretical analysis for federated averaging with arbitrary aggregation weights, dataset sizes and local discrepancy levels. By minimizing the reformulated upper bound of the optimization error, we obtain an optimized aggregation weights for clients, which depend on not only the local dataset sizes but also the local discrepancy levels. Inspired by the above empirical and theoretical observations, we propose a novel model aggregation method, federated learning with discrepancy-aware collaboration (Fed Disco), to alleviate the category distribution heterogeneity issue. This new method leverages each client s dataset size and discrepancy in determining aggregation weight by assigning larger aggregation weights to those clients with larger dataset sizes and smaller discrepancy values. As a novel model aggregation scheme, Fed Disco is still privacypreserving and can be easily incorporated with many FL methods, such as Fed Prox (Li et al., 2020a) and MOON (Li et al., 2021). Compared to the vanilla aggregation based on local dataset size, Fed Disco nearly introduces no additional computation and communication costs. To validate the effectiveness of our proposed Fed Disco, we conduct extensive experiments under diverse scenarios, such as various heterogeneous settings, globally balanced and 1Fed Nova (Wang et al., 2020b) targets system heterogeneity. imbalanced distributions, full and partial client participation, and across four datasets. We observe that Fed Disco can significantly and consistently improve the performance over existing FL algorithms. The main contributions of this paper is listed as follows: 1. We empirically show that the dataset size is not optimal to determine aggregation weights in FL and the discrepancy between local and global category distributions could be a beneficial complementary indicator; 2. We theoretically show that aggregation weight correlated with both dataset size and discrepancy could contribute to a tighter error bound; 3. We propose a novel aggregation method, called Fed Disco (federated learning with discrepancy-aware collaboration), whose aggregation weight follows from the optimization of the theoretical error bound; 4. Extensive experiments show that Fed Disco achieves state-of-the-art performances under diverse scenarios. 2. Background of Federated Learning Suppose there are total K clients, where the k-th client holds a private dataset Bk = {(xi, yi)|i = 1, 2, ..., |Bk|} and a local model w(t,r) k , where xi and yi denote the ith input and label, t and r denote the round and iteration indices of training. The local category distribution of kth client s dataset is defined as Dk RC, where the c-th element Dk,c = |{(xi,yi)|yi=c}| |Bk| , c {1, 2, , C} is the data proportion of the c-th catregory. See detailed notation descriptions in Table 8. Mathematically, the global objective of federated learning is minw F(w) = PK k=1 nk Fk(w), where nk = |Bk| PK i=1 Bi and Fk(w) are the dataset relative size and local objective of client k, respectively. In the basic FL, Fed Avg (Mc Mahan et al., 2017), each training round t proceeds as follows: 1. Sever broadcasts the global model w(t,0) to clients; 2. Each client k performs local model training using τ SGD steps to obtain a trained model denoted by w(t,τ) k ; 3. Clients upload the local models to the server; 4. Server updates the global model based on the aggregated local models: w(t+1,0) = PK k=1 pkw(t,τ) k , where pk is the aggregation weight for the client k. As mentioned in introduction, the category distribution heterogeneity issue could cause different local objectives in Fed Disco: Federated Learning with Discrepancy-Aware Collaboration (a) Data distributions (b) Exploration 1: aggregation (c) Exploration 2: dataset size (d) Exploration 2: discrepancy Figure 2. Experiments for empirical observations. Exploration 1 shows that dataset size could be not optimal indicator for aggregation weight. Exploration 2 shows that discrepancy could be a beneficial complementary indicator. local training. To address this, many previous works focus on adjustment on training w(t,τ) k , while neglects the adjustment of aggregation weight pk. In fact, most previous works conduct model aggregation simply based on the local dataset relative size; that is, pk = nk. However, we notice that dataset-size-based weighted aggregation could be not optimal in empirical investigation, which motivate us to search for a better aggregation strategy. The empirical observations are delivered in the next section. 3. Empirical Observations This section presents a series of empirical observations, which motivate us to consider a new form of aggregation weight and propose our method. In the experiments, we divide the CIFAR-10 (Krizhevsky et al., 2009) dataset into 10 clients, whose local category distributions are heterogeneous and follow commonly used Dirichlet distribution (Wang et al., 2020a); see Figure 2 (a). Based on this common setting, we conduct the following experiments. To explore the relationship between the aggregation weight and the performance of aggregated global model, we conduct three independent trials, where each trial considers federated training of two clients with similar dataset sizes. In each trial, each client trains a model (denoted as A and B) for 10 epochs and we aggregate these two models following: wagg = (1 r)w A + rw B. Figure 2 (b) shows three curves, which are the testing results of the aggregated model wagg as a function of aggregation weight r in three trials, respectively. Note that r = 0 reflects Model A, r = 1 reflects Model B, and r = 0.5 represents dataset-size-based weighted aggregation. We observe that i) it is not optimal to determine aggregation weights purely based on local dataset size. The performances at r = 0.5 could be far from the optimal performances (stars in the plot); and ii) best performance is achieved when assigning a relatively larger weight r to a better-performed local model. In these trials, Model B outperforms Model A, and the best performance is achieved when Model B has a larger weight (r is around 0.7 0.9). Similar phenomena can be seen in ensemble learning (Jim enez, 1998; Shen & Kong, 2004), which assigns larger weight to better model in ensemble. To search for indicators that can reflect local model s performance and eventually appropriate aggregation weight, we explore the effects of two informative properties of client in category heterogeneity scenario: dataset size and discrepancy between local and global category distribution. Here we use ℓ2 distance to measure the discrepancy. We plot all 10 clients in Figure 2 (c) & (d), where y-axis denote the local model s testing accuracy and x-axis in (c) and (d) denotes client s dataset relative size and discrepancy level, respectively. We highlight four clients in a circle, which have similar dataset sizes. Comparing these two plots, we clearly see that the discrepancy level is a better indicator to reflect the local model s performance than the dataset size. In plot (c), we see that these clients have similar dataset sizes but largely different performances; while in plot (d), the discrepancy level clearly reflects the performance difference among these four clients, that is, the client with a smaller discrepancy performs better. Based on these observations, we hypothesize that, in FL, when a client has a smaller discrepancy value, it might have a better-performed model and thus need a larger weight in aggregation. Motivated by this, we propose to leverage discrepancy in determining the aggregation weight of each client, which is theoretically analyzed in the following. 4. Theoretical Analysis In this section, we firstly obtain a convergence error bound for Fed Avg (Mc Mahan et al., 2017), which highlights the effect of aggregation weight pk, dataset size nk and discrepancy level dk. Then, a concise analytical expression of the optimized aggregation weight is derived by minimizing the error bound, which indicates that aggregation weight should depend on both dataset size and local discrepancy level. Optimization error upper bound. Our analysis is based on the following four standard assumptions in FL. Assumption 4.1 (Smoothness). Function Fk(w) is Lipschitz-smooth: || Fk(x) Fk(y)|| L||x y|| Fed Disco: Federated Learning with Discrepancy-Aware Collaboration for some L. Assumption 4.2 (Bounded Scalar). The global objective function F(w) is bounded below by Finf. Assumption 4.3 (Unbiased Gradient and Bounded Variance). For each client, the stochastic gradient is unbiased: Eξ[gk(w|ξ)] = Fk(w), and has bounded variance: Eξ[||gk(w|ξ) Fk(w)||2] σ2. Assumption 4.4 (Bounded Dissimilarity). For each loss function Fk(w), there exists constant B > 0 such that || Fk(w)||2 || F(w)||2 + Bdk. All assumptions are commonly used in federated learning literature (Wang et al., 2020b; Li et al., 2020a; 2019; Reddi et al., 2021). As no previous literature has considered discrepancy in their theory, a new assumption is required to include discrepancy. However, rather than introducing an additional assumption, we slightly adjust standard dissimilarity assumption (Wang et al., 2020b) and apply Assumption 4.4 to include the discrepancy level, this modification correlates the gradient dissimilarity and distribution discrepancy, and enables us to explore the relationships among aggregation weight pk, dataset relative size nk and discrepancy dk. We present the optimization error bound in Theorem 4.5. Theorem 4.5 (Optimization bound of the global objective function). Let F(w) = PK k=1 nk Fk(w) be the global objective. Under these Assumptions, if we set ηL 1 2τ , the optimization error will be bounded as follows: min t E|| F(w(t,0))||2 1 1 3A WD(1 A) | {z } T0 2(1 A)[F(w(0,0)) Finf] τηT | {z } T1 + 2(1 A)Lησ2 K X + 2(τ 1)σ2L2η2 | {z } T4 where WD = 2 PK k=1(nk pk)2, A = 2τ(τ 1)η2L2, η is local learning rate, B is a constant in Assumption 4.4. Proof. See details in Appendix C.2. Generally, a tighter bound corresponds to a better optimization result. Thus, we explore the effects of pk on upper bound. In Theorem 4.5, there are four parts related to pk. First, we see that larger difference between pk and nk contributes to larger WD and thus smaller denominator in T0 and larger value in T2, which tends to loose the bound. As for T5, by setting pk negatively correlated to dk, when clients have different level of discrepancy, i.e. different dk, T5 will be further reduced, which tends to tight the bound. Therefore, there could be an optimal set of {pk|k [K]} that contributes to the tightest bound, where an optimal pk should be correlated to both nk and dk. This theoretically show that dataset size could be not the optimal indicator and that discrepancy level can be a reasonable complementary indicator for determining aggregation weight. Upper bound minimization. In the following, we derive the analytical expression of aggregation weight (pk) by minimizing the upper bound in Theorem 4.5. To minimize this upper bound, directly solving the minimization results in a complicated expression, which involves too many unknown hyper-parameters in practice. To simplify the expression, we convert the original objective from minimizing (T1 + T2 + T3 + T4 + T5)/T0 to minimizing T1+T2+T3+T4+T5 λT0, where λ is a hyper-parameter. The converted objective still promotes maximization of T0 and minimization of T1 + T2 + T3 + T4 + T5, and still contributes to tighten the bound (T1 + T2 + T3 + T4 + T5)/T0 (also see analysis in Section 7.3). Then, our discrepancyaware aggregation weight is obtained through solving the following optimization problem: min {pk} 2(1 A)[F(w(0,0)) Finf] τηT + (1 A)WDB + 2(1 A)Lησ2 K X k=1 p2 k + 2(τ 1)σ2L2η2 k=1 pkdk λ (1 3A WD(1 A)) , k pk = 1, pk 0, (1) from which we derive the concise expression of an optimized aggregation weight (see details of derivation in Appendix C.5): pk nk a dk + b, (2) where a, b are two constants. This suggests that for a tighter upper bound, the aggregation weight pk should be correlated with both dataset size nk and local discrepancy level dk. This expression of Disco aggregation weight can mitigate the limitation of standard dataset size weighted aggregation by assigning larger aggregation weight to client with larger dataset size and smaller discrepancy level. 5. FL with Discrepancy-Aware Collaboration Motivated by these empirical and theoretical observations, we propose federated learning with discrepancy-aware collaboration (Fed Disco). Fed Disco: Federated Learning with Discrepancy-Aware Collaboration Figure 3. The overview of federated learning with discrepancy-aware collaboration. The left shows the acquisition of clients discrepancy by comparing global and local category distribution, which is only required at the first round. The right shows the federated learning process with discrepancy-aware collaboration by adjusting each client s aggregation weight based on dataset size and discrepancy level. 5.1. Framework Fed Disco involves four key steps: local model training, local discrepancy computation, discrepancy-aware aggregation weight determination and global model aggregation. Steps 1 and 4 follows from the standard federated learning; meanwhile, Steps 2 and 3 integrates the discrepancy between local and global category distributions into aggregation weights. The overview is shown on the right of Figure 3. We also provide an algorithm flow in Algorithm 1. Local model training. Each client k performs τ steps of local training based on dataset Bk and initial model w(t,0) to obtain the trained model, denoted as w(t,τ) k = Local Train(Bk, w(t,0)). Local discrepancy computation. Each client needs to compute the discrepancy between its local category distribution and the hypothetically aggregated global category distribution. Here we consider that the global category distribution is uniform, since it naturally promotes the fairness across all categories and the generalization capability of the global model. Moreover, each local client can compute its discrepancy without additional data sharing, preventing information leakage of category distribution. Concretely, Dk and T denote local and global category distribution. Thus, we set all elements Tc = 1/C to treat all categories equally. Given the global distribution T, each client k can calculate its local discrepancy level dk R by evaluating the difference between these two distributions: dk = e(Dk, T), where e( ) is a pre-defined metric function (e.g. L2 difference or KL-Divergence). Applying L2 difference corresponds to dk = q PC c=1(Dk,c Tc)2. Finally, the server collects clients discrepancy levels, see the left of Figure 3. Note that though we are more interested in uniform target distribution for the sake of category-level fairness, our method is also applicable to scenarios where target distribution is imbalanced; see experiments in Table 5. Discrepancy-aware aggregation weight determination. Motivated by the previous empirical and theoretical observations, we propose to determine more distinguishing aggregation weights for each client k by leveraging dataset relative size nk and local discrepancy level dk (derived from (2)): pk = Re LU(nk a dk + b) PK m=1 Re LU(nm a dm + b) , (3) where Re LU( ) is the relu function to take care of negative values (Fukushima, 1975), a is a hyper-parameter to balance nk and dk, b is another hyper-parameter to adjust the weight. This Disco aggregation can determine more distinguishing aggregation weights for clients by assigning larger weight for clients with larger dataset sizes and smaller local discrepancy levels. Global model aggregation. The server conducts model aggregation based on the above discrepancy-aware aggregation weights. The global model is w(t+1) = PK k=1 pkw(t) k , where w(t) k is the kth local model. 5.2. Discussions Privacy. Our proposed Fed Disco does not leak client s exact distribution (Figure 3) since the discrepancy is calculated at the client side. The server collects discrepancy level of clients, from which the exact category distribution can not be inversely inferred. This indicates that this process is more privacy-preserving compared with several existing works, such as Fed FTG and CCVR (Zhang et al., 2022; Luo et al., 2021), which transmitting exact category distribution. Communication & computation efficiency. Discrepancy communication is only required at the first communication Fed Disco: Federated Learning with Discrepancy-Aware Collaboration round. Besides, the discrepancy is only a numerical value, which is negligible compared with the model communication. The discrepancy calculation is only a simple operation between two vectors, which is negligible compared with model training. Moreover, Fed Disco requires much less computation overhead and no computation burden to the server; while previous methods, such as Fed DF (Lin et al., 2020) and Fed FTG (Zhang et al., 2022), require additional model fine-tuning by simultaneously running K local models at the server side for every communication round. Modularity. The modularity of Fed Disco indicates its broad range of application. Typically, federated learning involves four steps: (1) global model downloading, (2) local updating, (3) local model uploading and (4) model aggregation. Our Fed Disco focuses on step 4, which suggests that it can be easily combined with existing works conducting correction in step 2 and compression in step 1 and 3. Beyond this, it could also be incorporated with methods in step 4 by adjusting aggregation weight to be negatively correlated with discrepancy, such as conducting Disco re-weighting before normalization in Fed Nova (Wang et al., 2020b). 6. Related Works Federated learning (FL) has been an emerging topic (Mc Mahan et al., 2017; Hao et al., 2020). However, data distribution heterogeneity may significantly degrades FL s performance (Zhao et al., 2018; Kairouz et al., 2019; Li et al., 2019). Works that focus on this can be mainly divided into two directions: local and global model adjustment. 6.1. Local Model Adjustment Methods in this direction conduct the adjustment on local model training at the client side, aiming at producing local models with smaller difference (Li et al., 2020a; Karimireddy et al., 2020). Fed Prox (Li et al., 2020a) regularizes the ℓ2-distance between local and global model. Fed Dyn (Acar et al., 2020) proposes a dynamic regularizer to align the local and global solutions. SCAFFOLD (Karimireddy et al., 2020) and Fed DC (Gao et al., 2022) introduce control variates to correct each client s gradient but require double communication cost. MOON (Li et al., 2021) aligns the features of local and global model; Fed FM (Ye et al., 2022) aligns category-wise feature space across clients, contributing to more consistent feature spaces. All of these methods simply assign aggregation weight based on local dataset size, which is not sufficient to distinguish each client s contribution. While our proposed Fed Disco leverages both dataset size and discrepancy between local and global category distributions to determine more distinguishing aggregation weights, which has strong ability of plug-and-play that can be easily incorporated with these methods to further enhance the overall performance. 6.2. Global Model Adjustment Methods in this direction conducts the adjustment on global model at the server side, aiming at produce a global model with better performance (Lin et al., 2020; Reddi et al., 2021; Li et al., 2023; Zhang et al., 2023; Fan et al., 2022). Fed Nova (Wang et al., 2020b) conducts re-weighting to target system heterogeneity while we conduct re-weighting based on local discrepancy level to target data heterogeneity. CCVR (Luo et al., 2021) conducts post-calibration, Fed Avg M (Hsu et al., 2019) and Fed OPT (Reddi et al., 2021) introduce global model momentum to stabilize FL training; while these methods are orthogonal to Fed Disco and can be easily combined. Fed DF (Lin et al., 2020) and Fed FTG (Zhang et al., 2022) introduce an additional finetuning step to refine the global model; Fed Gen (Zhu et al., 2021) learns a feature generator for assisting local training. However, these methods require great computation capability of the server with much more computation cost (e.g. Fed FTG (Zhang et al., 2022) requires twice computation cost compared with Fed Avg (Mc Mahan et al., 2017)). As a contrast, our Fed Disco brings no computation burden to the server and works with negligible computation overhead. 7. Experiments We show key experimental setups and results in this section. More details and results are in Appendix B. 7.1. Experimental Setup Datasets. We consider five image classification datasets to cover medical, natural and artificial scenarios, including HAM10000 (Tschandl et al., 2018), CIFAR-10 & CIFAR100 (Krizhevsky et al., 2009), CINIC-10 (Darlow et al., 2018) and Fashion-MNIST (Xiao et al., 2017); and AG News (Zhang et al., 2015), a text classification dataset. Federated scenarios. We consider two data distribution heterogeneous settings, termed NIID-1 and NIID-2. NIID-1 follows Dirichlet distribution (Wang et al., 2020a; Li et al., 2021) Dirβ, where β (default 0.5) is an argument correlated with heterogeneity level. We consider 10 clients for NIID-1. NIID-2 is a more heterogeneous setting consists of 5 biased clients (each has data from C/5 categories (Mc Mahan et al., 2017; Li et al., 2020a)) and 1 unbiased client (has all C categories), where C is the total category number. Implementation details. The number of local epochs and batch size are 10 and 64, respectively. We run federated learning for 100 rounds. We use Res Net18 (He et al., 2016) for HAM10000, a simple CNN network for other image datasets and Text CNN (Zhang & Wallace, 2015) for AG News. We use SGD optimizer with a 0.01 learning rate. Fed Disco: Federated Learning with Discrepancy-Aware Collaboration Table 1. Accuracy comparisons (mean std on 5 trials, %) on several heterogeneous settings and datasets. Experiments show that Fed Disco consistently outperforms these state-of-the-art methods. METHOD HAM10000 CIFAR-10 CINIC-10 FASHION-MNIST NIID-1 NIID-1 NIID-2 NIID-1 NIID-2 NIID-1 NIID-2 FEDAVG 42.54 0.59 68.47 0.20 65.60 0.16 54.24 0.18 50.35 0.43 89.26 0.09 86.46 0.03 FEDAVGM 42.54 0.45 68.59 0.32 66.01 0.25 54.00 0.44 50.23 0.64 89.31 0.08 87.06 0.11 FEDPROX 44.76 1.35 69.33 0.49 65.61 0.31 56.38 0.35 50.30 0.44 89.28 0.12 87.24 0.21 SCAFFOLD 55.08 0.23 71.09 0.24 66.93 0.17 57.47 0.35 53.50 0.43 89.65 0.07 86.87 0.24 FEDDYN 54.44 0.98 70.14 0.31 68.49 0.62 56.41 0.41 52.69 0.52 89.12 0.13 86.38 0.43 FEDNOVA 44.92 0.59 68.57 0.19 65.61 0.39 54.27 0.20 50.37 0.73 89.05 0.08 86.47 0.20 MOON 45.87 0.90 67.42 0.14 66.25 0.35 52.08 0.20 50.42 0.56 89.29 0.04 86.44 0.04 FEDDC 54.48 0.77 70.93 0.15 67.89 0.44 57.26 0.26 52.43 0.60 89.19 0.02 86.57 0.06 FEDDISCO 59.05 0.67 72.05 0.24 69.85 0.18 58.07 0.15 53.84 0.08 89.74 0.07 87.85 0.20 Table 2. Modularity. Each entry shows accuracy of baseline with Disco (accuracy difference compared with baseline without Disco in Table 1). Experiments show consistent improvements across settings, indicating the modularity of Fed Disco. + DISCO HAM10000 CIFAR-10 CINIC-10 FASHION-MNIST NIID-1 NIID-1 NIID-2 NIID-1 NIID-2 NIID-1 NIID-2 FEDAVG 50.95 (+8.41) 70.05 (+1.58) 68.30 (+2.70) 54.81 (+0.57) 52.46 (+2.11) 89.56 (+0.30) 87.56 (+1.10) FEDAVGM 50.00 (+7.46) 70.07 (+1.48) 67.73 (+1.72) 54.69 (+0.69) 51.91 (+1.68) 89.36 (+0.05) 87.46 (+0.40) FEDPROX 50.48 (+5.72) 70.68 (+1.35) 68.33 (+2.72) 56.93 (+0.55) 52.62 (+2.32) 89.58 (+0.27) 87.72 (+0.48) SCAFFOLD 57.62 (+2.54) 71.70 (+0.61) 69.10 (+2.17) 58.05 (+0.58) 53.82 (+0.32) 89.74 (+0.09) 87.85 (+0.98) FEDDYN 59.05 (+4.61) 72.05 (+1.91) 69.85 (+1.36) 58.07 (+1.66) 53.84 (+1.15) 89.31 (+0.19) 87.18 (+0.80) FEDNOVA 51.43 (+6.51) 70.04 (+1.47) 67.83 (+2.22) 55.04 (+0.77) 52.23 (+1.86) 89.28 (+0.23) 87.52 (+1.05) MOON 52.86 (+6.99) 68.79 (+1.37) 68.35 (+2.10) 53.26 (+1.18) 51.97 (+1.55) 89.50 (+0.21) 87.59 (+1.15) FEDDC 58.25 (+3.77) 71.96 (+1.03) 68.94 (+1.05) 57.70 (+0.44) 53.25 (+0.82) 89.51 (+0.32) 87.11 (+0.54) We evaluate the accuracy on the global testing set. We use KL-Divergence to measure the discrepancy. Baselines. We compare Fed Disco with eight representative baselines. Among these, 1) Fed Avg (Mc Mahan et al., 2017) is the pioneering FL method; 2) Fed Prox (Li et al., 2020a), SCAFFOLD (Karimireddy et al., 2020), Fed Dyn (Acar et al., 2020), MOON (Li et al., 2021), and Fed DC (Gao et al., 2022) focus on local model adjustment; 3) Fed Avg M (Hsu et al., 2019) and Fed Nova (Wang et al., 2020b) focus on global model adjustment. The tuned hyper-parameters are shown in Appendix B.1.5. 7.2. Main Results On four standard datasets and two types of heterogeneity, we compare Fed Disco with state-of-the-art algorithms, show the modularity of Fed Disco, and explore Fed Disco s broader scope of applications. Performance: state-of-the-art accuracy. We compare the accuracy of several state-of-the-art methods and our proposed Fed Disco on multiple heterogeneous settings and datasets in Table 1. Note that HAM10000 is an imbalanced dataset and thus is not applicable for NIID-2. The local training protocol used in Fed Disco is Fed Dyn except SCAFFOLD for Fashion-MNIST. Experiments show that i) our proposed Fed Disco consistently outperforms others across different settings, indicating the effectiveness of our proposed method; ii) Fed Disco achieves significantly better on NIID-1 setting of HAM10000 (β = 5), which is a more difficult task for its severe heterogeneity and imbalance. Modularity: improvements over baselines. One key advantage of our proposed Fed Disco is its modularity, that is, it can be a plug-and-play module in many existing FL methods to further improve their performance. Following the experiments in Table 1, we report the accuracy of baselines combined with our Disco module and the accuracy difference (in parentheses) compared with baselines without Disco in Table 2 (see CIFAR-100 in Table 11). Experiments show that i) Disco consistently enhances the performance of state-of-the-art methods under different datasets and heterogeneity types; ii) for the most difficult task (i.e., HAM10000), Fed Disco brings the largest performance improvement. Specifically, it achieves 19.8% relative accuracy improvement over Fed Avg (Mc Mahan et al., 2017). Applicability to partial client participation scenarios. Partial client participation is a common scenario in crossdevice FL applications where only a subset of clients are available in a specific FL round. To verify that our proposed Fed Disco is applicable to this scenario, we sample 10 out of 60 clients for each round under NIID-2 on CIFAR-10, which consists of 50 biased clients and 10 unbiased clients. We report the averaged accuracy of last 10 rounds and the number Fed Disco: Federated Learning with Discrepancy-Aware Collaboration (a) MOON, NIID-1, CIFAR-10 (b) Fed Avg, NIID-1, CIFAR-10 (c) Fed Avg, NIID-2, CIFAR-10 (d) Fed Avg, NIID-2, CINIC10 Figure 4. Ease of hyper-parameters tuning. Experiments show that our Disco works for a wide range of hyper-parameters. Table 3. Performance under partial client participation scenario without (w.o.) and with our proposed Disco module. Fed Disco not only brings accuracy improvement, but also speeds up training. METHOD ACC (%) ROUNDS 55% W.O. WITH DISCO W.O. WITH DISCO FEDAVG 54.27 61.59 ( 7.32) 58 20 ( 65.52%) FEDAVGM 57.75 60.65 ( 2.90) 36 20 ( 44.44%) FEDPROX 55.50 60.19 ( 4.69) 44 20 ( 54.55%) SCAFFOLD 60.43 64.10 ( 3.67) 34 20 ( 41.18%) FEDDYN 59.44 62.53 ( 3.09) 33 27 ( 18.18%) FEDNOVA 54.11 58.13 ( 4.02) 61 24 ( 60.66%) MOON 54.30 58.79 ( 4.49) 56 24 ( 57.14%) FEDDC 61.18 62.90 ( 1.72) 27 20 ( 25.93%) Table 4. Results on text classification dataset AG News (Zhang et al., 2015) under full and partial participation. Fed Disco consistently brings accuracy improvement over baselines. DISCO? FULL PARTIAL FEDAVG FEDPROX FEDAVG FEDPROX 82.03 79.89 50.88 58.22 84.09 80.81 57.71 62.86 of rounds to reach target accuracy (55%) in Table 3. We see that Disco module significantly i) brings performance gain to baselines (up to 7.32%); ii) reduces the communication cost to reach a target accuracy (up to 65.52%). These results indicate that Fed Disco not only improves the accuracy but also speeds up the training process under this scenario. Applicability to text-modality scenarios. To verify that our proposed Fed Disco can also be applied to text-modality, we explore on a text classification dataset, AG News (Tschandl et al., 2018) under full and partial participation scenarios. Results in Table 4 show that Disco still consistently improves the baselines on text modality and brings significant performance gain under partial client participation scenario. Applicability to globally imbalanced scenarios. Here, we verify that Fed Disco is also capable of globally imbalanced category distribution scenario, that is the hypothetically aggregated global data is imbalanced. We firstly allocate CIFAR-10 to each category c following an exponential distri- bution (Cui et al., 2019): nc = n1ρ c 1 C 1 , where ρ denotes the imbalance ratio and C = 10 is the total category number, n1 = 5000. The dataset is then distributed to 10 clients as NIID-1. Note that ρ = 1 denotes globally balanced, ρ = 20 is the most imbalanced, where category 1 has 5000 samples while category 10 only has 250 samples. We consider two scenarios 1) the global category distribution is non-uniform while the distribution of test dataset is uniform; 2) the global category distribution and the distribution of test dataset are similar, and both non-uniform. 1) We conduct experiments on two typical baselines, Fed Avg and Fed Dyn in Figure 5(a). Experiments show that our Fed Disco consistently improves the baseline regardless of the globally imbalance level; see more results in Appendix B.2. 2) We conduct experiments under two scenarios (ρ = 10 and ρ = 50) and report the results in Table 5. Experiments show that Fed Disco still performs the best when both global and test category distribution are non-uniform; see how to obtain global category distribution in a privacy-preserving way in Appendix A.3. 7.3. Numerical Simulation of Reformulation In Section 4, we minimize the reformulated the upper bound in Theorem 4.5 for obtaining a concise expression, which benefits the practical utility as the tuning efforts are mitigated. Here, we conduct numerical simulation to examine the optimization process of the original form of upper bound in Theorem 4.5 and the reformulated form in Equation (1). We run 400 steps of gradient descent on {pk} by minimizing the reformulated form, and record the resulting values of both reformulated and original form in Table 6. Results show that minimizing the reformulated form benefit minimizing the original form, verifying the rationality of such reformulation during derivation. 7.4. Ablation Study Ease of hyper-parameter tuning. We tune a and b in (3) under four settings to illustrate the ease of hyper-parameter tuning for our proposed Fed Disco. For more comprehensive understanding, we consider multiple scenarios, includ- Fed Disco: Federated Learning with Discrepancy-Aware Collaboration Table 5. Performance on scenario where global category distribution and test distribution are similar but both non-uniform. Experiments show that Fed Disco is also applicable to scenarios where the global and test distributions are non-uniform. METHOD FEDAVG FEDAVGM FEDPROX SCAFFOLD FEDDYN FEDNOVA MOON FEDDC FEDDISCO ρ = 10 69.78 69.25 69.27 71.48 69.29 68.44 68.29 70.00 71.77 ρ = 50 74.03 73.77 74.13 74.03 74.78 74.53 74.03 74.53 76.06 (a) Globally Imbalance Level (b) Client Number (c) Heterogeneity Level (d) Number of Local Epoch Figure 5. Effects of four key FL arguments. Experiments show our Disco consistently brings performance improvement. Table 6. Numerical simulation. Minimizing the reformulated form promotes minimizing the original form. STEP 0 100 200 300 400 REFORMULATED 0.0684 0.0670 0.0663 0.0660 0.0658 ORIGINAL 0.0524 0.0513 0.0507 0.0504 0.0502 Table 7. Effects of discrepancy metric. Fed Disco is robust to various choices of discrepancy metrics. METRIC L1 L2 COSINE KL-DIVERGENCE ACC 69.43 69.99 69.90 70.05 ing different baseline methods (Fed Avg (Mc Mahan et al., 2017) and MOON (Li et al., 2021)), heterogeneity types (NIID-1 and NIID-2) and datasets (CIFAR-10 and CINIC10). We plot the accuracy difference brought by Disco for each group of hyper-parameters in Figure 4. By comparing three pairs: (a)&(b), (b)&(c) and (c)&(d), we see that i) for a wide range (a [0.2, 0.7], b [0.05, 0.4]), our Disco brings performance improvement for most cases regardless of the baseline method, heterogeneity type and dataset; ii) generally, a = 0.4 0.6, b = 0.1 is a safe choice, which leads to stable and great performance. Effects of client number, heterogeneity level and local epoch. We tune three key arguments in FL, including client number (K {10, 30, 50}), heterogeneity level (smaller value corresponds to more severe heterogeneity, β {0.1, 0.3, 0.5}) and number of local epoch E {2, 10, 20}, and show the accuracy improvement brought by Disco in Figure 5(b), 5(c), and 5(d), respectively. Experiments show that our proposed Disco consistently brings performance improvement across different FL arguments. Effects of different discrepancy metrics. Unless specified, we use the KL-Divergence throughout the paper. Though, Table 7 compares four discrepancy metrics under NIID-1 on CIFAR-10, including L1&L2 norm, cosine similarity and KL-Divergence. Experiments show that Fed Disco with these metrics achieve similar performance, indicating its robustness to different discrepancy metrics. 8. Conclusions This paper focuses on data heterogeneity issue in FL. Through empirical and theoretical explorations, we find that conventional dataset-size-based aggregation manner could be far from optimal. Addressing these, we introduce a discrepancy value as a complementary indicator and propose Fed Disco, a novel FL algorithm that assigns larger aggregation weight to client with larger dataset size and smaller discrepancy. Fed Disco introduces negligible computation and communication cost, and can be easily incorporated with many methods. Experiments show that Fed Disco consistently achieves state-of-the-art performances. Though this work mainly explores category-level heterogeneity, we may extend the idea of discrepancy-aware collaboration to other types of heterogeneity, such as featurelevel heterogeneity for classification task and mask-level heterogeneity for segmentation task. Acknowledgements This research is supported by the National Key R&D Program of China under Grant 2021ZD0112801, NSFC under Grant 62171276 and the Science and Technology Commission of Shanghai Municipal under Grant 21511100900 and 22DZ2229005. Fed Disco: Federated Learning with Discrepancy-Aware Collaboration Acar, D. A. E., Zhao, Y., Matas, R., Mattina, M., Whatmough, P., and Saligrama, V. Federated learning based on dynamic regularization. In International Conference on Learning Representations, 2020. Bonawitz, K., Ivanov, V., Kreuter, B., Marcedone, A., Mc Mahan, H. B., Patel, S., Ramage, D., Segal, A., and Seth, K. Practical secure aggregation for privacypreserving machine learning. In proceedings of the 2017 ACM SIGSAC Conference on Computer and Communications Security, pp. 1175 1191, 2017. Bottou, L., Curtis, F. E., and Nocedal, J. Optimization methods for large-scale machine learning. Siam Review, 60(2):223 311, 2018. Caldas, S., Duddu, S. M. K., Wu, P., Li, T., Koneˇcn y, J., Mc Mahan, H. B., Smith, V., and Talwalkar, A. Leaf: A benchmark for federated settings. ar Xiv preprint ar Xiv:1812.01097, 2018. Codella, N., Rotemberg, V., Tschandl, P., Celebi, M. E., Dusza, S., Gutman, D., Helba, B., Kalloo, A., Liopyris, K., Marchetti, M., et al. Skin lesion analysis toward melanoma detection 2018: A challenge hosted by the international skin imaging collaboration (isic). ar Xiv preprint ar Xiv:1902.03368, 2019. Cui, Y., Jia, M., Lin, T.-Y., Song, Y., and Belongie, S. Classbalanced loss based on effective number of samples. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp. 9268 9277, 2019. Darlow, L. N., Crowley, E. J., Antoniou, A., and Storkey, A. J. Cinic-10 is not imagenet or cifar-10. ar Xiv preprint ar Xiv:1810.03505, 2018. Ding, Y., Niu, C., Wu, F., Tang, S., Lyu, C., Chen, G., et al. Federated submodel optimization for hot and cold data features. Advances in Neural Information Processing Systems, 35:1 13, 2022. Fan, Z., Wang, Y., Yao, J., Lyu, L., Zhang, Y., and Tian, Q. Fedskip: Combatting statistical heterogeneity with federated skip aggregation. ar Xiv preprint ar Xiv:2212.07224, 2022. Fukushima, K. Cognitron: A self-organizing multilayered neural network. Biological cybernetics, 20(3):121 136, 1975. Gao, L., Fu, H., Li, L., Chen, Y., Xu, M., and Xu, C.-Z. Feddc: Federated learning with non-iid data via local drift decoupling and correction. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 10112 10121, June 2022. Hao, M., Li, H., Luo, X., Xu, G., Yang, H., and Liu, S. Efficient and privacy-enhanced federated learning for industrial artificial intelligence. IEEE Transactions on Industrial Informatics, 16(10):6532 6542, 2020. doi: 10.1109/TII.2019.2945367. Hard, A., Rao, K., Mathews, R., Ramaswamy, S., Beaufays, F., Augenstein, S., Eichner, H., Kiddon, C., and Ramage, D. Federated learning for mobile keyboard prediction. ar Xiv preprint ar Xiv:1811.03604, 2018. He, K., Zhang, X., Ren, S., and Sun, J. Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 770 778, 2016. Hsu, T.-M. H., Qi, H., and Brown, M. Measuring the effects of non-identical data distribution for federated visual classification. ar Xiv preprint ar Xiv:1909.06335, 2019. Jim enez, D. Dynamically weighted ensemble neural networks for classification. In 1998 IEEE International Joint Conference on Neural Networks Proceedings. IEEE World Congress on Computational Intelligence (Cat. No. 98CH36227), volume 1, pp. 753 756. IEEE, 1998. Kairouz, P., Mc Mahan, H. B., Avent, B., Bellet, A., Bennis, M., Bhagoji, A. N., Bonawitz, K., Charles, Z., Cormode, G., Cummings, R., et al. Advances and open problems in federated learning. ar Xiv preprint ar Xiv:1912.04977, 2019. Kaissis, G. A., Makowski, M. R., R uckert, D., and Braren, R. F. Secure, privacy-preserving and federated machine learning in medical imaging. Nature Machine Intelligence, 2(6):305 311, 2020. Karimireddy, S. P., Kale, S., Mohri, M., Reddi, S., Stich, S., and Suresh, A. T. Scaffold: Stochastic controlled averaging for federated learning. In International Conference on Machine Learning, pp. 5132 5143. PMLR, 2020. Krizhevsky, A., Hinton, G., et al. Learning multiple layers of features from tiny images. 2009. Li, Q., He, B., and Song, D. Model-contrastive federated learning. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10713 10722, 2021. Li, T., Sahu, A. K., Zaheer, M., Sanjabi, M., Talwalkar, A., and Smith, V. Federated optimization in heterogeneous networks. Proceedings of Machine Learning and Systems, 2:429 450, 2020a. Li, T., Sanjabi, M., Beirami, A., and Smith, V. Fair resource allocation in federated learning. In International Conference on Learning Representations, 2020b. Fed Disco: Federated Learning with Discrepancy-Aware Collaboration Li, X., Huang, K., Yang, W., Wang, S., and Zhang, Z. On the convergence of fedavg on non-iid data. In International Conference on Learning Representations, 2019. Li, Z., Lin, T., Shang, X., and Wu, C. Revisiting weighted aggregation in federated learning with neural networks. ar Xiv preprint ar Xiv:2302.10911, 2023. Lin, T., Kong, L., Stich, S. U., and Jaggi, M. Ensemble distillation for robust model fusion in federated learning. Advances in Neural Information Processing Systems, 33: 2351 2363, 2020. Luo, M., Chen, F., Hu, D., Zhang, Y., Liang, J., and Feng, J. No fear of heterogeneity: Classifier calibration for federated learning with non-iid data. Advances in Neural Information Processing Systems, 34, 2021. Mc Mahan, B., Moore, E., Ramage, D., Hampson, S., and y Arcas, B. A. Communication-efficient learning of deep networks from decentralized data. In Artificial intelligence and statistics, pp. 1273 1282. PMLR, 2017. Navon, A., Shamsian, A., Achituve, I., Maron, H., Kawaguchi, K., Chechik, G., and Fetaya, E. Multi-task learning as a bargaining game. In International Conference on Machine Learning, pp. 16428 16446. PMLR, 2022. Paszke, A., Gross, S., Massa, F., Lerer, A., Bradbury, J., Chanan, G., Killeen, T., Lin, Z., Gimelshein, N., Antiga, L., et al. Pytorch: An imperative style, high-performance deep learning library. Advances in neural information processing systems, 32, 2019. Reddi, S. J., Charles, Z., Zaheer, M., Garrett, Z., Rush, K., Koneˇcn y, J., Kumar, S., and Mc Mahan, H. B. Adaptive federated optimization. In International Conference on Learning Representations, 2021. URL https: //openreview.net/forum?id=Lk FG3l B13U5. Shen, Z.-Q. and Kong, F.-S. Dynamically weighted ensemble neural networks for regression problems. In Proceedings of 2004 International Conference on Machine Learning and Cybernetics (IEEE Cat. No. 04EX826), volume 6, pp. 3492 3496. IEEE, 2004. Singh, K., Sandhu, R. K., and Kumar, D. Comment volume prediction using neural networks and decision trees. In IEEE UKSim-AMSS 17th International Conference on Computer Modelling and Simulation, UKSim2015 (UKSim2015), 2015. Tschandl, P., Rosendahl, C., and Kittler, H. The ham10000 dataset, a large collection of multi-source dermatoscopic images of common pigmented skin lesions. Scientific data, 5(1):1 9, 2018. Wang, H., Yurochkin, M., Sun, Y., Papailiopoulos, D., and Khazaeni, Y. Federated learning with matched averaging. In International Conference on Learning Representations, 2020a. URL https://openreview.net/forum? id=Bkluql SFDS. Wang, J., Liu, Q., Liang, H., Joshi, G., and Poor, H. V. Tackling the objective inconsistency problem in heterogeneous federated optimization. Advances in neural information processing systems, 33:7611 7623, 2020b. Wang, J., Charles, Z., Xu, Z., Joshi, G., Mc Mahan, H. B., Al-Shedivat, M., Andrew, G., Avestimehr, S., Daly, K., Data, D., et al. A field guide to federated optimization. ar Xiv preprint ar Xiv:2107.06917, 2021. Xiao, H., Rasul, K., and Vollgraf, R. Fashion-mnist: a novel image dataset for benchmarking machine learning algorithms. ar Xiv preprint ar Xiv:1708.07747, 2017. Ye, R., Ni, Z., Xu, C., Wang, J., Chen, S., and Eldar, Y. C. Fedfm: Anchor-based feature matching for data heterogeneity in federated learning. ar Xiv preprint ar Xiv:2210.07615, 2022. Zhang, L., Shen, L., Ding, L., Tao, D., and Duan, L.-Y. Finetuning global model via data-free knowledge distillation for non-iid federated learning. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10174 10183, 2022. Zhang, R., Xu, Q., Yao, J., Zhang, Y., Tian, Q., and Wang, Y. Federated domain generalization with generalization adjustment. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 3954 3963, 2023. Zhang, X., Zhao, J., and Le Cun, Y. Character-level convolutional networks for text classification. Advances in neural information processing systems, 28, 2015. Zhang, Y. and Wallace, B. A sensitivity analysis of (and practitioners guide to) convolutional neural networks for sentence classification. ar Xiv preprint ar Xiv:1510.03820, 2015. Zhang, Y., Kang, B., Hooi, B., Yan, S., and Feng, J. Deep long-tailed learning: A survey. ar Xiv preprint ar Xiv:2110.04596, 2021. Zhao, Y., Li, M., Lai, L., Suda, N., Civin, D., and Chandra, V. Federated learning with non-iid data. ar Xiv preprint ar Xiv:1806.00582, 2018. Zhu, Z., Hong, J., and Zhou, J. Data-free knowledge distillation for heterogeneous federated learning. In International Conference on Machine Learning, pp. 12878 12889. PMLR, 2021. Fed Disco: Federated Learning with Discrepancy-Aware Collaboration Table 8. Notation descriptions. Notation Description K The total client number in FL system τ The number of SGD steps during local model training for each round Bk Local private dataset of client k w(t,r) k Local model of client k at round t and iteration r w(t,0) Global model at round t Dk Local category distribution of client k Dk,c The c-th element of Dk T Global category distribution nk The relative dataset size of client k pk The aggregation weight of client k dk The discrepancy value of client k Fk(w) Local objective of client k F(w) Global objective of FL Algorithm 1 Fed Disco: Federated Learning with Discrepancy-Aware Collaboration Initialization: Global model w(0,0) for k = 0 to K 1 do Client sends discrepancy value dk to server end for Server computes the aggregation weight pk according to Equation (3) for t = 0 to T 1 do Server sends global model w(t,0) to each client for k = 0 to K 1 do w(t,τ) k local model training for τ steps of SGD Client sends local model w(t,τ) k to server end for Server aggregates local models w(t+1,0) = PK k=1 pkw(t,τ) k end for A. Methodology A.1. Complementary Description Notation table. For convenience, we provide a detailed notation descriptions in Table 8. Algorithm table. We provide the overall algorithm in Algorithm 1. A.2. Discussions Connection with multi-task learning method, Nash-MTL (Navon et al., 2022). Nash-MTL focuses on aggregation weights of multiple tasks and we focus on aggregation weights of multiple clients. We will cite this paper in the revision. However, there are two major differences between Nash-MTL and our work. 1) Nash-MTL focuses on multi-task learning in a centralized manner while we focus on federated learning in a distributed manner. 2) The aggregation weights in Nash-MTL is learned through minimizing a pre-defined problem to search for Nash bargaining solution, which focuses on pair-wise gradient relationships among multiple tasks. In comparison, our aggregation weights is directly computed based on dataset size and discrepancy level, whose design is guided by our empirical and theoretical observation. A.3. Obtaining Global Category Distribution in A Privacy-Preserving Manner In Section 5, we regard the target distribution T as uniform since we want to emphasize category-level fairness. However, our method is also applicable in scenarios where the global category distribution and test distribution are both non-uniform; see results in Table 5. Here, we show that we can obtain the global category distribution in a privacy-preserving way. Fed Disco: Federated Learning with Discrepancy-Aware Collaboration Specifically, we can send each client s category distribution Dk (a vector) to the server using Secure Aggregation (Bonawitz et al., 2017) technique such that the server knows the actual global category distribution without knowing each client s actual category distribution. Then, this actual category distribution can be sent to each client and the discrepancy can be calculated (this process is efficient as it only requires once). As a simple example for the secure aggregation process, the distribution vectors of Client 1, 2 are D1 and D2. Client 1 adds an arbitrary noise vector a to D1 to obtain D1 + a; while Client 2 substitutes a from D2 to obtain D2 a. These two transformed vectors are then sent to the server, where the server knows the sum of global category distribution without knowing the exact distribution of each client: D1 + a + D2 a = D1 + D2. B. Experiments B.1. Implementation details B.1.1. ENVIRONMENTS We run all methods by using Pytorch framework (Paszke et al., 2019) on a single NVIDIA GTX 3090 GPU. The memory occupation ranges from 2065MB to 4413MB for diverse datasets and methods. B.1.2. DATASETS We use five image classification datasets, which cover medical, natural and artificial images. For medical image classification dataset, we consider HAM10000 (Codella et al., 2019; Tschandl et al., 2018), a 7 category classification task for pigmented skin lesions. For natual image classification dataset, we consider CIFAR-10, CIFAR-100 and CINIC-10 (Krizhevsky et al., 2009; Darlow et al., 2018), all of which are classification tasks for natural objects with 10, 100 and 10 categories, respectively. For artificial image classification dataset, we consider Fashion-MNIST (Xiao et al., 2017), a 10 category classification task for clothing. All these four public datasets can be downloaded online. Note that for HAM10000, we hold out a uniform testing set by allocating 30 samples for each category. We use one text classification dataset, AG News (Zhang et al., 2015), which is a 4 classification task. B.1.3. HETEROGENEITY LEVEL Here, we illustrate the data distribution over categories and clients under NIID-1 setting of four datasets. As mentioned in the main paper, NIID-1 denotes the setting of Dirichlet distribution Dirβ. We set β = 0.5 for all datasets except HAM10000 and CIFAR-100. For HAM10000, all methods fail at NIID-1 scenario when β is too small since HAM10000 is an severely imbalanced dataset. Thus, we choose a moderated β = 5.0. We also choose β = 5.0 for CIFAR-100. We plot the NIID-1 data distribution over categories and clients on four datasets in Figure (6). Note that HAM10000 is globally imbalanced and it has the largest number of samples in class 0. For AG News, we partition the dataset to 5 and 50 clients for full and partial participation scenarios, where 80% biased clients have data samples from 2 categories and the other 20% uniform clients have data samples from 4 categories. (a) CIFAR-10 (b) CINIC-10 (c) Fashion-MNIST (d) HAM10000 Figure 6. Data distribution over categories and clients under NIID-1 setting. B.1.4. MODELS For the CIFAR-10, CINIC-10 and Fashion-MNIST datasets, we use a simple CNN network as (Li et al., 2021; Luo et al., 2021). The network is sequentially consists of: 5 5 convolution layer, max-pooling layer, 5 5 convolution layer, three fully-connected layer with hidden size of 120, 84 and 10 respectively. For the HAM10000 and CIFAR-100 dataset, we use Fed Disco: Federated Learning with Discrepancy-Aware Collaboration Table 9. Classification accuracy (%) comparisons under globally imbalanced dataset scenario (ρ = 10). We highlight the best performance and second-best performance. SCAFFOLD and Fed Dyn performs the best while SCAFFOLD requires twice communication costs. Experiments show our proposed Fed Disco consistently improves the performances of baselines, indicating Fed Disco s applicability to this scenario. METHOD FEDAVG FEDAVGM FEDPROX SCAFFOLD FEDDYN FEDNOVA MOON FEDDC WITHOUT DISCO 57.86 57.53 57.78 63.78 63.24 59.64 57.74 61.74 WITH DISCO 60.45 60.05 60.33 65.13 64.33 60.77 59.46 63.13 Res Net18 (He et al., 2016) in Pytorch library. We replace the first 7 7 convolution layer with a 3 3 convolution layer and eliminate the first pooling layer. We also replace the batch normalization layer with group normalization layer as (Acar et al., 2020). For AG News (Zhang et al., 2015), we use Text CNN model (Zhang & Wallace, 2015) with a 32 hidden dimension. B.1.5. HYPER-PARAMETERS For all methods, we tune the hyper-parameter in a reasonable range and report the highest accuracy in the paper. For Fed Prox (Li et al., 2020a), we tune the hyper-parameter µ from {0.001, 0.01, 0.1, 1.0}. For Fed Avg M (Hsu et al., 2019), we tune the hyper-parameter β from {0.3, 0.5, 0.7, 0.9}. For Fed Dyn (Acar et al., 2020), we tune the hyper-parameter α from {0.001, 0.01, 0.1, 1.0}. For MOON (Li et al., 2021), we tune the hyper-parameter µ from {0.01, 0.1, 0.5, 1.0, 5.0}. For Fed DC (Gao et al., 2022), we tune the hyper-parameter α from {0.001, 0.01, 0.1, 1.0}. The tuned best hyper-parameter for these methods are: µ = 0.01 in Fed Prox (Li et al., 2020a), β = 0.5 in Fed Avg M (Hsu et al., 2019), α = 0.01 in Fed Dyn (Acar et al., 2020), µ = 0.1 in MOON (Li et al., 2021), α = 0.01 in Fed DC (Gao et al., 2022). B.2. Globally Imbalanced Scenario: Accuracy and Fairness We show that our proposed Fed Disco is also capable of globally imbalanced category distribution scenario, that is the hypothetically aggregated global data is imbalanced. In the main paper, we show the performance of Fed Avg (Mc Mahan et al., 2017) and Fed Dyn (Acar et al., 2020) with and without Disco for different globally imbalance level (ρ = 1, 2, 5, 10, 20) in Figure 5 (a). We firstly visualize the hypothetically aggregated global data distribution in Figure 7, where x-axis denotes the category and y-axis denotes the number of data samples of the corresponding category. We plot the data distribution over categories for different globally imbalance levels (ρ). ρ = 1 denotes balanced situation and ρ = 20 denotes the most severe imbalanced situation. Our experiments in the Figure 5 (a) of the main paper show that our Fed Disco works for all these globally imbalance levels. (a) Balanced: ρ = 1 (b) Imbalanced: ρ = 2 (c) Imbalanced: ρ = 5 (d) Imbalanced: ρ = 10 (e) Imbalanced: ρ = 20 Figure 7. Hypothetically aggregated global data distribution over categories for different globally imbalance level (ρ). The globally imbalance level increases from the left to the right, where ρ = 1 denotes balanced situation and ρ = 20 denotes the most severe imbalanced situation. Next, we provide the performance of more baselines when ρ = 10 in Table 9. Experiments show that our proposed Fed Disco consistently improves over baselines under this globally imbalanced scenario, indicating its applicability to this scenario. We also explore the performance of global model on each individual category in detail in Figure 8. Here, we use Fed Avg (Mc Mahan et al., 2017) as an example. Following (Zhang et al., 2021), we define and evaluate on three subsets: Fed Disco: Federated Learning with Discrepancy-Aware Collaboration Table 10. Performance under partial client participation scenario. Mean std is evaluated over last 10 rounds. Rounds: rounds to reach target accuracy (55%). Higher mean, lower std and smaller rounds represent better performance. We highlight the difference of mean accuracy brought by Disco in parentheses. We also highlight the reduced number of rounds together with the proportion of reduction brought by Disco in parentheses. Methods with Disco achieve higher accuracy, more stable performance and faster training. Specifically, Fed Avg with Disco achieves 7.32% accuracy improvement and requires 38 less rounds to achieve the target accuracy, which saves 65.52% training time and communication cost. METHOD MEAN STD (%) ROUNDS WITHOUT DISCO WITH DISCO WITHOUT DISCO WITH DISCO FEDAVG 54.27 3.44 61.59 1.84 ( 7.32) 58 20 ( 38, 65.52%) FEDAVGM 57.75 1.92 60.65 1.32 ( 2.90) 36 20 ( 16, 44.44%) FEDPROX 55.50 3.02 60.19 2.47 ( 4.69) 44 20 ( 24, 54.55%) SCAFFOLD 60.43 1.96 64.10 0.76 ( 3.67) 34 20 ( 14, 41.18%) FEDDYN 59.44 2.59 62.53 2.34 ( 3.09) 33 27 ( 06, 18.18%) FEDNOVA 54.11 3.83 58.13 2.72 ( 4.02) 61 24 ( 37, 60.66%) MOON 54.30 3.97 58.79 3.07 ( 4.49) 56 24 ( 32, 57.14%) Head (category 1 - 4), Middle (category 5 - 7) and Tail (category 8 - 10). Additionally, we report the averaged accuracy over all categories and standard deviation across categories. We see that i) Fed Disco achieves comparable performance (difference within 0.75%) on Head and Middle classes and significantly higher performance (nearly 8% improvement) on Tail classes. This suggests that Fed Avg is severally biased to Head classes while Fed Disco can mitigate this bias. ii) Fed Disco achieves higher averaged accuracy with smaller standard deviation, which means Fed Disco can simultaneous enhance overall performance and promote fairness across categories. Figure 8. Accuracy comparison of each category on CIFAR-10. Fed Disco significantly enhance the performance of Tail classes without severely affecting the Head classes. Fed Disco simultaneously enhance overall performance (Higher Avg) and fairness (Lower Std) over Fed Avg. B.3. Partial Client Participation Scenario: Accuracy, Stability and Training Speed In practice, clients may participate only when they are in reliable power and Wi-Fi condition, indicating that only a subset of clients participate in each round (partial client participation). We conduct experiments on CIFAR-10 (Krizhevsky et al., 2009) where there are 40 biased clients and 10 unbiased clients. In each round, we randomly sample 10 clients to participate. We show results from two aspects in Table 10, including the mean and standard deviation (std) of accuracy of last 10 rounds and rounds required to reach target accuracy (55%). Experiments show that methods with our Disco achieve i) consistently higher mean and smaller std of accuracy, indicating better and more stable performance over rounds, ii) fewer rounds to achieve target accuracy, indicating faster training speed, which is friendly to save computation and communication cost. Specifically, for the basic method Fed Avg, Fed Avg with Disco achieves 7.32% accuracy improvement and requires 38 less rounds to achieve the target accuracy, which saves 65.52% training time and communication cost. Fed Disco: Federated Learning with Discrepancy-Aware Collaboration Table 11. Classification accuracy (%) comparisons on CIFAR-100 dataset under NIID-1 scenario. Fed Dyn with Disco is highlighted for best performance. Experiments show our proposed Fed Disco consistently improves the performances of baselines on CIFAR-100. METHOD FEDAVG FEDAVGM FEDPROX SCAFFOLD FEDDYN FEDNOVA MOON FEDDC WITHOUT DISCO 57.28 56.27 58.99 61.39 62.29 57.59 58.01 62.20 WITH DISCO 58.28 56.73 59.29 61.90 62.83 58.02 58.16 62.61 Table 12. Comparison of accuracy variance across clients at different round. Lower denotes more fair. ROUND 1 10 20 30 40 FEDAVG 178.05 226.38 73.65 94.49 89.97 FEDDISCO 167.40 110.63 74.70 84.72 73.45 B.4. Experimental results on CIFAR-100 We provide the results of several baselines on CIFAR-100 (Krizhevsky et al., 2009) under NIID-1 setting in Table 11. Experiments show that our proposed Fed Disco consistently brings gains on CIFAR-100 dataset, verifying its modularity performance. B.5. Effects on Client-Level Fairness In this paper, we focus on the generalization ability and promote category-level fairness, where the corresponding global distribution is uniform. Client-level fairness is another interesting and important research topic in FL (Li et al., 2020b). Here, we explore from this aspect for more comprehensive understanding. Note that Fed Disco does not hurt client-level fairness for two reasons. First, when a client s aggregation weight is high, this client has a uniform training distribution, which can naturally benefit all the clients more or less. If we do the opposite thing and assign a high aggregation weight to a client whose training category distribution is highly skewed, this only benefits this single client, hurts the other clients and violates the client-level fairness. Second, each client s test data distribution could be close to uniform distribution, even though its training data distribution is far from uniform. One important aim of federated learning is to enable clients with biased and limited training data to be aware of global distribution. To validate that Fed Disco does not hurt client-level fairness, we run 40 rounds of FL on CIFAR-10 and record the variance of test accuracy across clients. This accuracy variance is often used for evaluation of fairness (Li et al., 2020b). Small variance denotes that the test accuracies of clients are similar, indicating high fairness. Thus, a lower variance denotes higher fairness. Table 12 compares the accuracy variance of Fed Avg and the accuracy variance after applying Fed Disco to Fed Avg. From the table, we see that the variance of Fed Disco is comparable or lower than Fed Avg, indicating that Fed Disco can potentially enhance the client-level fairness. B.6. Comparison with Equal Aggregation Weights Following the experiments in Table 1, Table 13 compares Fed Disco with Fed Avg with equal aggregation weights (i.e., pk = 1/K). From the table, we see that Fed Disco performs significantly better across different settings. B.7. Experiments on Device Heterogeneity For device heterogeneity where there are stragglers, we conduct experiments with the NIID-2 setting on CIFAR-10 following the setting in (Wang et al., 2020b), where we uniformly sample iteration numbers for each client in the range of 50 to 500 for each round. Results show that Fed Avg achieves 60.21% while Fed Disco achieves 63.20%, indicating that Fed Disco can still bring performance improvement under setting of stragglers. Device heterogeneity is an orthogonal issue to distribution heterogeneity. As these two issues can be concurrent in practice, Fed Disco can still play a role in enhancing overall performance. More importantly, device heterogeneity can even exacerbate the effect of distribution heterogeneity. For example, if Client A has a smaller discrepancy level and Client B has a larger discrepancy level. When Client A is a straggler that performs fewer iterations while Client B performs much more iterations, it is more critical to enhance the aggregation weight of Client A, otherwise the FL system will be dominated by Client B and Fed Disco: Federated Learning with Discrepancy-Aware Collaboration Table 13. Comparison between Fed Disco and Fed Avg with equal aggregation weights pk = 1/K. Fed Disco performs significantly better across different datasets and heterogeneity types. METHOD HAM10000 CIFAR-10 CINIC-10 FASHION-MNIST NIID-1 NIID-1 NIID-2 NIID-1 NIID-2 NIID-1 NIID-2 FEDAVG (pk = 1/K) 44.76 68.11 65.60 54.27 50.35 89.35 86.46 FEDAVG+DISCO 50.95 70.05 68.30 54.81 52.46 89.56 87.56 the global model will be biased. Further, we can combine algorithms (e.g., Fed Nova (Wang et al., 2020b)) that specifically focuses on device heterogeneity with ours to further enhance performance. B.8. Other Experiments 1) The idea of discrepancy-aware collaboration can be extended to otehr tasks such as regression. Here, we conduct experiments on regression dataset Prediction of Facebook Comment (Singh et al., 2015) (predicting the number of comments given a post). We pre-define several categories for this regression task, where each category covers regression labels in some specific range. For example, we group those samples with regression labels ranging from 1 to 10 as category 1. With this strategy, we can obtain a distribution vector as we do for classification tasks. Note that such operation is only conducted for obtaining a distribution vector, we still use the regression labels for model training. It turns out that the test loss of Fed Avg (Mc Mahan et al., 2017) is 0.432 and Fed Disco achieves 0.407, indicating our Fed Disco can also handle continuous label distributions. 2) Here, we consider feature-level heterogeneity. We conduct experiments on FEMNIST from LEAF benchmark (Caldas et al., 2018) following (Li et al., 2020a), where both feature-level and category-level heterogeneity exist. We see that Fed Avg achieves 76.40% accuracy while Fed Disco achieves 78.36% accuracy. C. Theoretical Analysis C.1. Preliminaries The global objective function is F(w) = PN k=1 pk Fk(w), where PK k=1 pk = 1. For ease of writing, we use gk(w) to denote mini-batch gradient gk(w|ξ) and Fk(w) to denote full-batch gradient, where ξ is a mini-batch sampled from dataset. We further define the following two notions: Averaged Mini-batch Gradient: dk = 1 r=0 gk(w(t,r) k ), (4) Averaged Full-batch Gradient: hk = 1 r=0 Fk(w(t,r) k ). (5) Then, the update of the global model between two rounds is as follows: w(t+1,0) w(t,0) = τη k=1 pkdk. (6) Here, we presents a key lemma and defer its proof to section C.3. Lemma C.1. Suppose {At}T t=1 is a sequence of random matrices and follows E[At|At 1, At 2, ..., A1] = 0, then t=1 E h At 2 F i Fed Disco: Federated Learning with Discrepancy-Aware Collaboration C.2. Proof of Theorem 4.5 According to the Lipschitz-smooth assumption in Assumption 4.1, we have its equivalent form (Bottou et al., 2018) E h F(w(t+1,0)) i F(w(t,0)) E h D F(w(t,0)), w(t+1,0) w(t,0)Ei L 2 E w(t+1,0) w(t,0) 2 (7) where the expectation is taken over mini-batches ξ(t,r) k , k 1, 2, ..., K, r 0, 1, ..., τ 1. C.2.1. BOUNDING N1 IN (8) k=1 pk(dk hk) F(w(t,0)) 2 + 1 where (10) uses the unbiased gradient assumption in Assumption 4.3, such that E[dk hk] = hk hk = 0. (11) uses the fact that 2 a, b = a 2 + b 2 a b 2. C.2.2. BOUNDING N2 IN (8) k=1 pk(dk hk) + k=1 pk(dk hk) k=1 p2 k E h dk hk 2i + 2E r=0 (gk(w(t,r) k ) Fk(w(t,r) k )) r=0 E gk(w(t,r) k ) Fk(w(t,r) k ) 2 + 2E k=1 p2 k + 2E Fed Disco: Federated Learning with Discrepancy-Aware Collaboration where (13) follows a + b 2 2 a 2 + 2 b 2, (14) uses the fact that clients are independent to each other so that E dk hk, dn hn = 0, k = n. (16) uses Lemma C.1 and (17) uses bounded variance assumption in Assumption 4.3. Plug (11) and (17) back into (8), we have E h F(w(t+1,0)) i F(w(t,0)) F(w(t,0)) 2 τη 2 (1 2τηL)E + Lτη2σ2 K X k=1 p2 k + τη C.2.3. BOUNDING N3 IN (18) k=1 (nk pk) Fk(w(t,0)) + k=1 pk Fk(w(t,0)) hk k=1 (nk pk) Fk(w(t,0)) k=1 pk Fk(w(t,0)) hk k=1 (nk pk)2 # " K X Fk(w(t,0)) 2 # k=1 pk Fk(w(t,0)) hk k=1 (nk pk)2 # " K F(w(t,0)) 2 + B k=1 pk Fk(w(t,0)) hk where (20) follows a + b 2 2 a 2 + 2 b 2, (21) follows Cauchy Schwarz inequality, (22) uses the bounded similarity assumption in Assumption 4.4. We use WD to denote 2K h PK k=1(nk pk)2i . When 1 2τηL 0, we have E h F(w(t+1,0)) i F(w(t,0)) F(w(t,0)) 2 + Lτη2σ2 K X k=1 p2 k + τηWDB k=1 dk + τηE k=1 pk Fk(w(t,0)) hk F(w(t,0)) 2 + Lτη2σ2 K X k=1 p2 k + τηWDB k=1 dk + τη k=1 pk E Fk(w(t,0)) hk 2 where (24) uses Jensen s Inequality PK k=1 pkxk 2 PK k=1 pk xk 2. Fed Disco: Federated Learning with Discrepancy-Aware Collaboration C.2.4. BOUNDING N4 IN (24) E Fk(w(t,0)) hk 2 = E Fk(w(t,0)) 1 r=0 Fk(w(t,r) k ) r=0 ( Fk(w(t,0)) Fk(w(t,r) k )) r=0 E Fk(w(t,0)) Fk(w(t,r) k ) 2 (27) r=0 E w(t,0) w(t,r) k 2 where (27) uses Jensen s Inequality and (28) follows Lipschitz-smooth property. C.2.5. BOUNDING N5 IN (34) E w(t,0) w(t,r) k 2 = η2E s=0 gk(w(t,s) k ) gk(w(t,s) k ) Fk(w(t,s) k ) s=0 Fk(w(t,s) k ) s=0 E gk(w(t,s) k ) Fk(w(t,s) k ) 2 + 2η2E s=0 Fk(w(t,s) k ) 2rη2σ2 + 2η2E 1 r Fk(w(t,s) k ) 2rη2σ2 + 2rη2 r 1 X s=0 E Fk(w(t,s) k ) 2 (33) 2rη2σ2 + 2rη2 τ 1 X s=0 E Fk(w(t,s) k ) 2 (34) where (30) uses a + b 2 2 a 2 + 2 b 2, (31) uses Lemma C.1, (32) uses the bounded variance assumption in Assumption 4.3, (33) uses Jensen s Inequality. Plug (34) back into (28) and use this equation Pτ 1 r=0 r = τ(τ 1) 2 , we have E Fk(w(t,0)) hk 2 L2 r=0 E w(t,0) w(t,r) k 2 (35) (τ 1)L2η2σ2 + (τ 1)L2η2 τ 1 X s=0 E Fk(w(t,s) k ) 2 where N6 in (36) can be further bounded. Fed Disco: Federated Learning with Discrepancy-Aware Collaboration C.2.6. BOUNDING N6 IN (36) E Fk(w(t,s) k ) 2 2E Fk(w(t,s) k ) Fk(w(t,0)) 2 + 2E Fk(w(t,0)) 2 (37) 2L2E w(t,0) w(t,s) k 2 + 2E Fk(w(t,0)) 2 , (38) where (37) uses a + b 2 2 a 2 + 2 b 2, (38) uses Lipschitz-smooth property. Plug (38) back to (36), we have r=0 E w(t,0) w(t,r) k 2 (τ 1)L2η2σ2 + 2(τ 1)η2L4 τ 1 X s=0 E w(t,0) k w(t,s) 2 + 2(τ 1)η2L2 τ 1 X s=0 E Fk(w(t,0)) 2 (39) After rearranging, we have E Fk(w(t,0)) hk 2 r=0 E w(t,0) w(t,r) k 2 (40) (τ 1)η2σ2L2 1 2τ(τ 1)η2L2 + 2τ(τ 1)η2L2 1 2τ(τ 1)η2L2 E Fk(w(t,0)) 2 (41) =(τ 1)η2σ2L2 1 A + A 1 AE Fk(w(t,0)) 2 , (42) where we define A = 2τ(τ 1)η2L2 < 1. Then, the last term in (24) is bounded by k=1 pk E Fk(w(t,0)) hk 2 (τ 1)η2σ2L2 1 A + A 1 AE Fk(w(t,0)) 2 (43) τ(τ 1)σ2L2η3 1 AE F(w(t,0)) 2 + τηAB k=1 pkdk, (44) Fed Disco: Federated Learning with Discrepancy-Aware Collaboration where (44) follows bounded dissimilarity assumption in Assumption 4.4. Plug (44) back to (24), we have E h F(w(t+1,0)) i F(w(t,0)) F(w(t,0)) 2 + Lτη2σ2 K X k=1 p2 k + τηWDB + τ(τ 1)σ2L2η3 1 AE F(w(t,0)) 2 + τηAB k=1 pkdk (45) 2 (1 WD 2A 1 A) F(w(t,0)) 2 + Lτη2σ2 K X k=1 p2 k + τηWDB k=1 dk + τ(τ 1)σ2L2η3 Finally, by taking the average expectation across all rounds, we finish the proof of Theorem 4.5 min t E F(w(t,0)) 2 1 t=0 E F(w(t,0)) 2 (47) 2(1 A) F(w(0,0)) Finf τηT [1 3A WD(1 A)] + (1 A)WDB PK k=1 dk [1 3A WD(1 A)] K + 2(1 A)Lησ2 PK k=1 p2 k [1 3A WD(1 A)] + 2(τ 1)σ2L2η2 [1 3A WD(1 A)] + 2AB PK k=1 pkdk [1 3A WD(1 A)] (48) = 1 1 3A WD(1 A) 2(1 A) F(w(0,0)) Finf τηT | {z } T1 + 2(1 A)Lησ2 K X k=1 p2 k | {z } T3 + 2(τ 1)σ2L2η2 | {z } T4 where WD = 2K h PK k=1(nk pk)2i , pk is the aggregation weight, nk is the dataset relative size, dk is the discrepancy level, A = 2τ(τ 1)η2L2 < 1, τ is the number of steps in local model training, η is learning rate, T is the total communication round in FL, K is the total client number, Finf, B, L, σ are the constants in assumptions. C.3. Proof of Lemma C.1 Suppose {At}T t=1 is a sequence of random matrices and follows E[At|At 1, At 2, ..., A1] = 0, then t=1 E h At 2 F i t=1 E h At 2 F i + j=1,j =i E Tr A i A j (50) t=1 E h At 2 F i + j=1,j =i Tr E A i A j (51) t=1 E h At 2 F i , (52) where (52) comes from assuming i < j and using the law of total expectation E A i Aj = E A i E[Aj|Ai, ..., A1] = 0. Fed Disco: Federated Learning with Discrepancy-Aware Collaboration C.4. Further Analysis with Proper Learning Rate As a conventional setting in theoretical literature, we can set the learning rate η = 1 τT (Wang et al., 2020b). Here, note that the learning rate η is strongly correlated with the number of communication round T. Then, there are two typical cases: T is finite and relatively small. According to η = 1 τT , the learning rate η will be relatively large. Then, A = 2τ(τ 1)η2L2 will be relatively large and (1 A) will be relatively small. As a result, T2 will be relatively small and T5 will be relatively large, making T5 a dominant term in the optimization bound. Thus, in this case, tuning pk to make T5 smaller is a better solution such that the a and b in Equation (2) will be non-zero and the overall upper bound could be tighter. T is quite large or infinite. the learning rate η = 1 τT will be relatively small. Then, A = 2τ(τ 1)η2L2 will be relatively small and (1 A) will be relatively large. As a result, T2 will be relatively large and T5 will be relatively small, making T2 a dominant term in the optimization bound. Thus, in this case, tuning pk to make T2 smaller is a better solution. To achieve this, WD should be reduced to zero and thus a and b in Equation (2) will zero to make the upper bound tight. Here, we can have the following corollary: Corollary C.2. By substituting η = 1 τT into Theorem 4.5, we will have the following bound: min t E|| F(w(t,0))||2 O( 1 τT ) + O( 1 Corollary C.2 indicates that as T , the optimization upper bound approaches 0 (reaches a stationary point) and matches with previous convergence results (Wang et al., 2020b). As the ultimate goal of this paper is to achieve more pleasant performance in practice, we are more interested in the previous part that the number of communication round T is finite (T should not be too large due to issues such as communication burden). For this reason, we explore better aggregation weights to achieve a tighter upper bound in this paper. However, our algorithm can actually converge (reach a stationary point) based on our Theorem 4.5 and Corollary C.2 if the number of communication round goes to infinite. C.5. Reformulated Upper Bound Minimization Generally, a tighter bound corresponds to a better optimization result. Thus, we explore the effects of pk on the upper bound in (49). In Theorem 4.5, there there are four parts related to pk. First, we see that larger difference between pk and nk contributes to larger WD and thus smaller denominator in T0 and larger value in T2, which tends to loose the bound. As for T5, by setting pk negatively correlated to dk, when clients have different level of discrepancy, i.e. different dk, T5 will be further reduced, which tends to tighten the bound. Therefore, there could be an optimal set of {pk|k [K]} that contributes to the tightest bound, where an optimal pk should be correlated to both nk and dk. To minimize this upper bound, directly solving the minimization results in a complicated expression, which involves too many unknown hyper-parameters in practice. To simplify the expression, we convert the original objective from minimizing (T1 + T2 + T3 + T4 + T5)/T0 to minimizing T1 + T2 + T3 + T4 + T5 λT0, where λ is a hyper-parameter. The converted objective still promotes maximization of T0 and minimization of T1 + T2 + T3 + T4 + T5, and still contributes to tighten the bound (T1 + T2 + T3 + T4 + T5)/T0. Then, our discrepancy-aware aggregation weight is obtained through solving the following optimization problem: min {pk} 2(1 A)[F(w(0,0)) Finf] τηT + (1 A)WDB m=1 dm + 2(1 A)Lησ2 K X + 2(τ 1)σ2L2η2 + 2AB m=1 pmdm λ (1 3A WD(1 A)) , m pm = 1, pm 0, where WD = 2K h PK m=1(nm pm)2i . Fed Disco: Federated Learning with Discrepancy-Aware Collaboration To solve this constrained optimization, one condition of the optimal solution is that the derivative of the following function equals to zero: Q(pk) =2(1 A)[F(w(0,0)) Finf] τηT + (1 A)WDB m=1 dm + 2(1 A)Lησ2 K X + 2(τ 1)σ2L2η2 + 2AB m=1 pmdm λ (1 3A WD(1 A)) + µ( Then, we have the following equation: m=1 dm(pk nk) + 4(1 A)Lησ2pk + 2ABdk + 4Kλ(1 A)(pk nk) + µ νm = 0, from which we can rearrange and obtain the expression of pk: h 4B(1 A) PK m=1 dm + 4Kλ(1 A) i nk 2ABdk µ + νk 4B(1 A) PK m=1 dm + 4(1 A)Lησ2 + 4Kλ(1 A) . (54) Finally, we can derive the following concise expression of an optimized aggregation weight pk: pk nk a dk + b, (55) where a, b are two constants. This theoretically show that simply dataset-size-based aggregation could be not optimal since the above analysis suggests that for a tighter upper bound, the aggregation weight pk should be correlated with both dataset size nk and local discrepancy level dk. This expression of aggregation weight can mitigate the limitation of standard dataset size weighted aggregation by assigning larger aggregation weight to client with larger dataset size and smaller discrepancy level.