# fedcda_federated_learning_with_crossrounds_divergenceaware_aggregation__504f0215.pdf Published as a conference paper at ICLR 2024 FEDCDA: FEDERATED LEARNING WITH CROSSROUND DIVERGENCE-AWARE AGGREGATION Haozhao Wang1, Haoran Xu2, Yichen Li3, Yuan Xu1, Ruixuan Li3, , Tianwei Zhang4 1S-Lab, Nanyang Technological University 2Zhejiang University 3Department of Computer Science, Huazhong University of Science and Technology 4Nanyang Technological University {hz wang, rxli}@hust.edu.cn, tianwei.zhang@ntu.edu.sg In Federated Learning (FL), model aggregation is pivotal. It involves a global server iteratively aggregating client local trained models in successive rounds without accessing private data. Traditional methods typically aggregate the local models from the current round alone. However, due to the statistical heterogeneity across clients, the local models from different clients may be greatly diverse, making the obtained global model incapable of maintaining the specific knowledge of each local model. In this paper, we introduce a novel method, Fed CDA, which selectively aggregates cross-round local models, decreasing discrepancies between the global model and local models. The principle behind Fed CDA is that due to the different global model parameters received in different rounds and the nonconvexity of deep neural networks, the local models from each client may converge to different local optima across rounds. Therefore, for each client, we select a local model from its several recent local models obtained in multiple rounds, where the local model is selected by minimizing its divergence from the local models of other clients. This ensures the aggregated global model remains close to all selected local models to maintain their data knowledge. Extensive experiments conducted on various models and datasets reveal our approach outperforms state-of-the-art aggregation methods. 1 INTRODUCTION Federated Learning (FL) has emerged as a key framework for training deep neural networks (DNNs) through client collaboration without the need to share original datasets (Mc Mahan et al., 2017b; Wang et al., 2022; Li et al., 2022b). It has been extensively utilized in areas like medical image processing (Liu et al.; Guo et al.; Xu et al.) and recommendation systems (Ramaswamy et al., 2019; Ammad-ud-din et al., 2019). FL is an iterative procedure in which each round involves the local model training across various individual clients, and then aggregating these models centrally on a server (Mc Mahan et al., 2017a). In this paper, we focus on the aggregation of FL, which is the critical step to obtain the global model from multiple local models. The typical aggregation method is Fed Avg, which computes the coordinate-wise weighted average of parameters of local models with the weight as the ratio of the data size (Mc Mahan et al., 2017b). Although the implementation of this method is straightforward, some works (Yurochkin et al., 2019a; Li et al., 2022b; Liu et al., 2022; Wang et al., 2020a) consider that the coordinate-wise average will reduce the performance due to the Non IID (i.e., not independently and identically) data among clients. Specifically, they identify that the parameter ordering of different local models may be varied due to the permutation invariance of neural network (NN) parameters. Thus, they propose re-ordering the parameters before applying the weighted average. Another type of work considers that the Non IID data also affects the aggregation weights and they propose adaptively setting the weights using a learnable approach (Li et al., 2023). Although these Haozhao Wang, Haoran Xu, and Yichen Li contribute equally to this work. Ruixuan Li is the Corresponding author. Published as a conference paper at ICLR 2024 methods have achieved great success separately, they mainly aggregate the local models from the current single round, which may limit the improvement of FL performance. Orthogonal to these works, in this paper, we focus on the aggregation of cross-round local models to further unleash the potential aggregation performance. Intuitively, to acquire the data knowledge of some specific client, it is necessary for the global model to be close to its locally trained model (Kirkpatrick et al., 2017). Nevertheless, the local models of different clients in the same single round may have a large divergence from each other due to the statistical heterogeneity. Thus, as shown in Figure 1(a), the aggregated global model may greatly deviate from these local models. To tackle this challenge, we consider a common fact that each client is usually able to achieve convergence in different rounds after the startup training stage, especially when FL prefers a larger interval for the local training process to save the communication cost (Sun et al., 2023a). In addition, due to receiving different global models in different rounds and the existence of multiple local optima in deep neural networks (Wu et al., 2017; Kawaguchi, 2016; Xie et al., 2021), each client often converges to different models in different rounds, each of which can usually learn local data well, especially when training data using the most advanced optimizer (Loshchilov & Hutter, 2019; Chaudhari et al., 2017). Therefore, a natural idea is that the global model can also fit the local data of some specific client once it approaches any of the different local models in multiple rounds. Motivated by this, the global model can essentially be obtained by aggregating selected local models from different rounds to reduce their divergence and maintain their knowledge. (a) Single round (b) Cross round Local update client 1 as 饾惏1 client 2 as 饾惏2 Global model Two-round local models of client 1 client 2 Two local optima of Figure 1: Comparison of different aggregation methods on a two-client FL. (a) The global model wt+1 is the aggregation of the local model w1 t+1 of client 1 and w2 t+1 of the client 2 in the same t + 1-th round. (b) wt+1 is the aggregation of the t-th round local model w1 t and the t+1-th round local model w2 t+1. wt+1 obtained through cross-round aggregation is close to the local optimum 1 of client 1 and the local optimum 2 of client 2, while the single-round aggregation is distant from any local optima. A practical example is in Appendix A. Based on the above motivation, we propose a novel aggregation method named Fed CDA, which selectively aggregates cross-round local models. More specifically, we design a divergence-aware selection strategy that selects local models from multiple rounds with minimum divergence to their aggregated model and only aggregates the selected local models to obtain the global model. In this way, as shown in Figure 1(b), the global model approaches selected local models and thus maintains the data knowledge of clients. Considering the selection problem is a combinatorial optimization problem with a large search space, we further design an approximation version by selecting local models in a batch way to reduce the selection cost. Then, we establish theories to provide a better understanding and guarantee the convergence of our method. We conduct extensive experiments on various datasets, and the results show that Fed CDA outperforms state-of-the-art baselines. Our contributions are: To the best of our knowledge, this paper is the first to study the aggregation of cross-round local models. We identify that cross-round local models among clients may have a smaller divergence than those in the single round and selectively aggregating them can make the global model approach local models more closely, thus maintaining their local data knowledge. We propose a new cross-round aggregation method named Fed CDA. It obtains the global model by aggregating local models selected from multiple rounds based on the criterion of the minimum divergence. Besides, we design an approximation strategy to reduce the cost of selection. We establish comprehensive theories for our method. Specifically, we provide theoretical insights for understanding our algorithm and show that the approximation selection error is bounded by the convergence of the local model. Besides, we also prove the convergence of our method. We conduct extensive experiments over various deep-learning models and datasets. The efficiency superiority of Fed CDA is demonstrated by comparing our proposed aggregation method with traditional aggregation methods, which achieves the best performance. Published as a conference paper at ICLR 2024 2 RELATED WORKS Many previous methods have been proposed to improve the performance of FL. For example, some works propose regularizing the update of the local model to mitigate the Non IID issue (Li et al., 2020; Sun et al., 2023b). Orthogonal to these works, this paper focuses on the aggregation of local models. Generally, there are three main types of FL aggregation methods. Aggregation Weights One of the typical researches is to determine adaptive aggregation weights (Li et al., 2022a; Rehman et al., 2023). For instance, AUTO-FEDAVG (Xia et al., 2021) tailors weights based on distinct institutional medical datasets to enable personalized medicine, whereas L2C (Li et al., 2022a) identifies similar peers in decentralized FL by adapting weights using local data. While these approaches have proven effective, they primarily emphasize the creation of personalized models for individual clients. In contrast, our work centers on acquiring a global model. Recently, Fed LAW (Li et al., 2023) aims to obtain a global model by learning the weights. Nevertheless, all of these methods rely on the proxy dataset in the server while our aggregation method does not. Model Fusion Due to the permutation invariance of neural network parameters, some works consider that the parameters ordering of different local models across clients may be varied especially when their data is Non IID (Yu et al., 2021; Singh & Jaggi, 2020; Li et al., 2022b). In this case, the coordinate-wise average of local models will lead to a mismatch between the same-position parameters of cross-client local models, degrading the performance of the aggregated model. Hence, these works seek to fuse these local models by re-ordering the parameters to match them across clients such as using Hungarian matching algorithm (Wang et al., 2020a), Bayesian approach (Yurochkin et al., 2019b), or a graph matching algorithm (Liu et al., 2022). Federated Distillation Different from the two above types of methods that compute the average of model parameters, federated distillation employs an ensemble distillation computing the average of their logits over the aggregation of local models (Wu & Gong, 2021; Guo et al., 2020; Bistritz et al., 2020; Wang et al., 2023). Notably, Lin et al. (2020); Chen & Chao (2021) initially introduced a technique that harnesses knowledge distillation on the server side. This approach transfers knowledge from multiple local models to the global model using an unlabeled proxy dataset. However, these methods depend on the availability of an auxiliary dataset on the server, which may not be present in real-world scenarios. In response to this limitation, recent studies (Zhu et al., 2021; Zhang et al., 2022; Wang et al., 2023) proposed replacing the proxy dataset with generated data, enabling ensemble federated distillation in a data-free manner. We in this paper focus on the average of model parameters, which is orthogonal to these works. Federated learning allows N clients with a server to solve the following optimization problem: min w Rd F(w) = 1 n=1 Fn(w), s.t., Fn(w) = E尉 Dnfn(w; 尉) (1) to obtain the global model w. The function Fn(w) : Rd R denotes the expected loss over the data distribution of client n. Dn denotes the data distribution of the n-th client. fn(w; 尉) denotes the loss value with respect to model w and random data sample 尉. Without causing confusion, we use fn(w) to denote a mini-batch of fn(w; 尉) for simplicity. Besides, we make the following assumptions for these objectives which are widely adopted in FL (Dinh et al., 2020; Wang et al., 2020b). Assumption 1 (L-smoothness). The objective function Fn is L-smooth with Lipschitz constant L > 0, i.e., Fn(w) Fn(w ) 2 L w w 2 for all w, w . Assumption 2 (Bounded Variance). For all parameters w, the variance of the local stochastic gradient in each client is bounded by 蟽2 l : E( fn(w) Fn(w) 2) 蟽2 l . Besides, the global variance of gradients among clients is bounded by 蟽2 g: 1 N PN n=1 Fn(w) F(w) 2 蟽2 g. Assumption 3 (Bounded Gradient). For all parameters w, the stochastic gradient with respect to the loss is bounded by a constant M: E( fn(w) 2) M 2. Published as a conference paper at ICLR 2024 4 METHODOLOGY In this part, we will introduce our proposed aggregation method. To minimize the objective (1), we first apply Assumption 1 to each local loss function Fn(w): min w Rd F(w)= 1 n=1 Fn(w) 1 Fn(wn)+ wn Fn(wn)(w wn) + L w wn 2 2 . (2) Then, we turn to minimize the upper bound of the objective function (1), which corresponds to the right-hand side term of the inequality (2), by aggregating local models wn of each client n from multiple rounds. Given the set of recent K local models Wn t = {wn t1, . . . , wn t K} of each client n obtained in multiple rounds, the server seeks to solve the following objective: min w Rd,w1 W1 t ,...,w N WN t Fn(wn) + wn Fn(wn)T (w wn) + L 2 w wn 2 2 . (3) The problem (3) is strongly convex in terms of w for any combination of the local models wn. Therefore, the global model w has a closed-form solution with respective to the local models wn: n=1 wn 1 LN n=1 wn Fn(wn). (4) Given equation (4), the problem (3) is equivalent to a combinatorial optimization problem to select local models wn. However, solving this problem requires computing the full gradient wn Fn(wn) on the local dataset of each client n, leading to extra expensive computation and communication cost. Considering the local model wn may nearly approach one of the local optima or saddle point wn, especially at the end of the FL training stage or when the number of local epochs is large, we take an approximation as wn Fn(wn) 0. The problem (3) can be re-formulated as: min w1 W1 t ,...,w N WN t n=1 Fn(wn) + L n=1 w wn 2 2, s.t., w = 1 min w1 W1 t ,...,w N WN t n=1 Fn(wn) + L n=1 wn 2 2 L 2 w 2 2, s.t., w = 1 n=1 wn. (6) Equation (5) reveals that the criterion for choosing local models can be understood as the selection of cross-round local models that exhibit minimal divergence among each other, i.e., variance 1 N PN n=1 w wn 2 2, particularly when the difference in loss Fn(wn) tends to be small among clients. Although solving (5) can obtain the optimal combination of cross-round local models, the computation complexity and memory cost are large. An approach to reducing the computation cost is to utilize the equivalent version of (5), i.e., (6), which is derived using 1 n Pn i=1(xi x)2 = 1 n Pn i=1 x2 i x2 with x = 1 n Pn i=1 xi. In this way, the l2 norm of wn 2 2 can be cached once it is computed to avoid repeated computations. Yet, the search space for all combinations is still large. Denoting the model size as C, we have the following conclusion. Proposition 1 The computation complexity of solving (6) is O(KN) and the memory cost is KNC. Due to the exponential complexity of computation, directly solving (6) is not affordable even by a cloud for large K and N. Therefore, we further propose selecting local models with approximately minimum divergence to reduce the cost. Our strategy includes two steps. First, selection for partial clients. We propose only selecting local models from P clients n Pt that participate in the current round t and fixing the local models of other clients n N Pt by using those selected in previous rounds: min wn Wn t , n Pt 1 N n Pt Ln(wn)+ 1 n N Pt Ln(wn) | {z } Fixed in current round 2 w 2 2, (7) where the aggregated model w remains the same as w = 1 N PN n=1 wn and Ln(wn) denotes Ln(wn) = Fn(wn) + L 2 wn 2 2. As the local models of non-participating clients are fixed in the current round, this leads to a great reduction in computational complexity and memory requirements. Published as a conference paper at ICLR 2024 Proposition 2 The computation complexity of solving (7) is O(KP ) and the memory cost is KPC. Second, batch-based selection. To further reduce the computation complexity, we propose selecting local models in a stochastic greedy manner. Specifically, we randomly group participated clients into B equal-size batches Pt = P1 t PB t and select local models for these clients batch by batch. When selecting local models for clients in the b-th batch, the local models of clients in batches 1 to b 1 are fixed and in batches b + 1 to P are excluded, and the objective is: min wn Wn t , n Pb t 1 N P + b P n (N Pt) P1 t Pb 1 t | {z } Fixed in the b-th subset selection 2 w 2 2, (8) where the model w is aggregated by computing the average of local models of nonparticipated clients and the 1-st to b-th batch of participated clients, i.e., w = 1 N P + b P n (N Pt) P1 t Pb t wn. This can also be viewed as selecting local models that are close to that of clients participating in previous rounds and thus maintaining the memory of their data. The complexity of the computation is further reduced. Proposition 3 The computation complexity of solving (8) is O(BK P B ) and the memory cost is KPC. While the computational complexity remains exponential, we retain the flexibility to manually adjust the value of B for control. In practice, we can maintain P B as a constant, effectively reducing the complexity to an acceptable level. In an extreme scenario, we can set B = P, resulting in linear complexity with respect to the value of K. Our experiments have shown that even a small value of K, such as K = 3, produces satisfactory performance, rendering the computational complexity acceptable for practical applications. Additionally, the memory cost KPC is also manageable when K is small, because the number of sampled clients P is usually a small ratio of the total clients. The local models of non-participated clients can be stored on the disk which has sufficient storage space. For example, a 1TB hard drive can store approximately 20, 000 copies of Res Net-18, which is widely adopted on the edge. Given that even a mobile phone is equipped with 1 TB storage, we believe that the cost is within the budget of the aggregation node which is typically hosted by a cloud. As compared to other aggregation methods like weight setting (Li et al., 2023) or ensemble distillation Lin et al. (2020); Chen & Chao (2021), our approach has a distinct advantage. We do not depend on an additional public dataset, which can be challenging to acquire due to the requirement for a similar distribution as the global dataset. Moreover, our method does not introduce higher computational complexity compared to existing methods. Many existing aggregation methods involve performing gradient descent (Li et al., 2023; Chen & Chao, 2021) or solve maximum bipartite matching problems (Wang et al., 2020a), which can be computationally intensive. 4.1 FEDCDA ALGORITHM The complete procedure of our method is given in Algorithm 1 by assuming the base algorithm is Fed Avg (Mc Mahan et al., 2017a). Fed CDA differs from Fed Avg primarily in lines 9 and 10 of its implementation. When it receives local models from a subset of clients, the server updates its cached local models. This update involves replacing the oldest round s local model with the most recently received one, as indicated in line 9. After this update, the server selects local models for aggregation by solving the problem (8) in line 10. Finally, the global model is obtained by averaging both selected and fixed local models to retain knowledge contributed by all clients. In practice, we usually apply Fed CDA after a warmup training stage using Fed Avg or other baselines to ensure that the local models can approach the convergence during the local training process. It is also worthwhile to note that most existing methods usually employ improved techniques over the step of line 11, e.g., re-setting aggregation weights (Li et al., 2023) or using ensemble distillation (Lin et al., 2020; Chen & Chao, 2021), which are orthogonal to us. 5 THEORETICAL ANALYSIS In this section, we provide theories for better understanding the principles and bounding the error of the proposed algorithm. We first prove that the selection error of using the approximated objective Published as a conference paper at ICLR 2024 Algorithm 1 Fed CDA Algorithm Input: Number of cached local models K, number of subsets B, learning rate 畏, number of sampling clients P, and total communication rounds T. Output: Converged global model w. 1: Initialize the model parameter w0; 2: Distribute w0 to all clients; 3: for each communication round t {1, 2, ..., T} do 4: Randomly select a set of clients Pt; 5: for each selected client n Pt in parallel do 6: Initialize the local model with the received global model: wn = wt; 7: Solve the local problem by updating wn for E local mini-batch SGD steps and accumulate the local loss Fn(wn) in the last local epoch: wn = wn 畏 wnfn(wn); 8: Update the cached set Wn t for n Pt by replacing the oldest model with received wn; 9: Select local models wn for each client n Pt by solving the problem (8); 10: Aggregate both selected and fixed local models to obtain: wt+1 = 1 N PN n=1 wn; return global model w T (5) to the exact objective (3) is bounded by the convergence degree of local models. Then, we present the benefits of Fed CDA on idealized cases and conditions. Finally, we establish the convergence theories for our algorithm. Theorem 1 (Approximation Selection Error) Define the local optima closest to wn t as wn, t and the maximum distance between any two local optima that are close to cached local models across clients and rounds as D, i.e., D = maxn [N],n [N],n =n ,i [K],i [K]( wn, ti wn , t i ). If the distance between the local model wn t and its approximated critical point wn, t is limited by a constant 系 > 0, i.e., wn t wn, t 系, then the disparity in the global loss between aggregating local models selected using (5) and (3) is constrained by 蔚 4L系2 + 2LD系. The proof can be found in Appendix B.1. The theorem indicates that the approximation error of using (5) to (3) becomes smaller when local models are convergent. It implicitly reveals that our algorithm may obtain a better global model when the local models approach convergence, i.e., with large local iterations or large warmup rounds, which are verified by our experimental results in Figure 2(c) and Figure 6.2. Further, we seek to show that minimizing (3) leads to a lower global loss than naively aggregating the local models in the newest current round. We define the divergence among local models wn, n = 1 . . . , N as Var(wn) = 1 N PN n=1 wn wn 2 2. We denote wt , wn t , n = 1, . . . , N as the solution of objective (3). Similarly, we denote wt as the t-th round global model aggregated from all t-th round local models wn t , wn t , n = 1, . . . , N. Subsequently, we demonstrate that the global loss F(wt ) can be assured to be lower than F(wt) under the condition that the divergence Var(wn t ) is less than Var(wn t ) by a certain value. Theorem 2 (Impact of Divergence of Local Optima) Let the definition of the local optima wn, t and distance 系 be the same as Theorem 1. Consider the loss function Fn(w) is strongly convex with a parameter 碌 within the region spanning from the local optima wn, t to the global model wt and the local loss achieves equivalent values on local optima in different rounds, i.e., Fn(wn, t ) = Fn(wn, t ). If the divergence among selected local models is small enough, i.e., satisfying Var(wn, t ) 碌 LVar(wn, t ) ( 碌 L + 1)系2, then the global loss of using selected global model wt is smaller than that of using t-th round global model wt, i.e., F(wt ) F(wt). The proof can be found in Appendix B.2. Although the conditions of Theorem 2 may be idealized in practical settings, it provides some insights for understanding our method. Smaller divergence among local models leads to a smaller loss of the aggregated model. An ideal case is that the divergence is reduced to 0 where the local optima of all local models across clients are the same. In fact, such an ideal case can widely exist in overparameterized deep neural networks, where a large model may achieve 0 loss in the local dataset of each client and hence is the local optima of all clients. Therefore, our method may prefer large models. The experimental results in Table 1 also verify our statement, where the improvement is higher for the larger models. Although our Published as a conference paper at ICLR 2024 motivation mainly comes from the non-convex functions where there are multiple local optima, our algorithm is also applicable to convex cases. More discussions can be found in Appendix B.3. Finally, we present the convergence of our algorithm. Noting that even though our algorithm does not achieve faster theoretical convergence using existing optimization analytical tools, our algorithm demonstrates great empirical benefits. Theorem 3 (Convergence on Non-convex Functions) Consider problem (1) under Assumption 1,2, and 3. If the learning rate 畏 satisfies 0 < 畏 1 LE , then the global model wt solved by (3) achieves asymptotic convergence, i.e., 1 T PT t=1 F(wt ) 2 2 = O( 1 Ideas of Proof: Our proof mainly includes two parts. First, we prove that the difference between the loss of the global model wt obtained by (3) and that of the reference global model wt obtained by aggregating the newest local models is bounded. Then, we prove that the loss of the global model wt achieves convergence, which in turn indicates the convergence of the global model wt . Detailed derivations are deferred to Appendix B.4. 6 EVALUATION 6.1 EXPERIMENTAL SETUP Datasets and Models: We consider three popular datasets in experiments: Fashion-MNIST (Xiao et al., 2017), CIFAR-10 (Krizhevsky et al., 2009) and CIFAR-100 (Krizhevsky et al., 2009), which contains 10, 10, 100 classes respectively. For CIFAR-10 and CIFAR-100 datasets, we use Res Net18 (He et al., 2016) as the backbone to train and test the performance while for Fashion-MNIST we use a simple CNN instead. The simple CNN has two 5x5 convolution layers (the first with 32 channels, the second with 64, each followed with 2x2 max pooling), a fully connected layer with 512 units and Re Lu activation, and a final fully output connected layer. Data Partition: To evaluate the performance of our work in a heterogeneous scenario, we specify two Non-IID data partition methods called Shards (Mc Mahan et al., 2017a) and Dirichlet (Lin et al., 2020). In the Shards setting, the sorted samples are shuffled into N S shards, and assigned to N clients randomly. Each client owns an equal number of pieces. In the second setting, data distribution over clients satisfies the Dirichlet distribution by using 伪 to characterize the degree of heterogeneity. We set 伪 of Dirichlet: {0.1, 0.3, 0.5} and shards for each client: {2, 4, 8}. Baselines: Beside of Fed Avg (Mc Mahan et al., 2017a), we also compare against various types of efficient federated learning approaches with the proposed method in our experiments. The first main type includes typical non-aggregation methods that speed FL in the local process or tuning learning rate, including Fed Prox (Li et al., 2020), Fed Ex P (Jhunjhunwala et al., 2023), and Fed SAM (Qu et al., 2022). The methods of the second type can be divided into three main representative aggregation categories: ensemble distillation including Fed DF (Lin et al., 2020) and Fed GEN (Chen & Chao, 2021); model fusion including Fed MA (Wang et al., 2020a) and GAMF (Liu et al., 2022); weights setting including Fed LAW (Li et al., 2023). Implementation: We implement the whole experiment in a simulation environment based on Py Torch 2.0 and 8 NVIDIA Ge Force RTX 3090 GPUs. We use 20 clients in total and randomly choose 20% each round for local training. We set the local epoch to 20, batch size to 64, and learning rate to 1e 3. We employ SGD optimizer with momentum of 1e 4 and weight decay of 1e 5 for all methods and datasets. At the same time, we set the number of global communication rounds to 200. Each experiment setting is run twice and we take each run s final 10 rounds accuracy and calculate the average value and standard variance. For our method, we also need to set the memory size of the client K to 3, batch number B to 3, and the number of warmup rounds to 50. Besides, we simply assume L = 1 for all clients to save the computation cost. 6.2 EXPERIMENT RESULTS Performance Comparison. We report the comparison results with other baselines in Table 1. The results with a broader range of hyperparameters can be found in Appendix D. In order to demonstrate the generalization of our method, we compare them on two different Non-IID settings, Shards and Dirichlet distribution. We apply different data distributions on different datasets. We can see that our proposed Fed CDA achieves the best performance on almost all settings. It demonstrates the effectiveness and benefit of cross-round divergence-aware aggregation. Specifically, on relatively Published as a conference paper at ICLR 2024 Table 1: The comparison of test accuracy of different methods. The best results are bolded. Method Fashion-MNIST(%) CIFAR-10(%) CIFAR-100(%) Shards (S) 2 4 8 2 4 8 2 4 8 Fed Avg 64.69 5.62 74.78 4.55 76.81 3.33 28.10 3.96 59.83 2.94 70.87 1.91 11.86 1.19 15.87 1.00 21.91 0.55 Fed Prox 64.21 4.11 70.76 3.89 72.19 4.16 26.39 4.16 53.03 2.29 70.91 1.87 10.87 0.58 15.37 0.46 24.16 0.33 Fed Ex P 65.24 3.47 69.31 4.62 76.66 5.04 26.84 4.75 59.31 3.61 69.53 1.94 11.59 0.81 16.47 0.99 23.58 1.36 Fed SAM 59.28 0.15 75.19 0.10 76.07 0.09 29.31 0.32 57.12 0.08 61.56 0.31 11.19 0.16 15.95 0.15 22.44 0.16 Fed DF 64.72 2.11 74.16 1.52 85.51 0.95 32.37 2.39 60.08 5.67 71.52 2.67 11.63 0.67 17.13 1.12 25.84 1.02 Fed GEN 63.50 3.27 69.42 4.09 80.17 4.71 27.21 3.12 57.16 2.71 68.93 1.75 10.07 0.19 15.26 0.29 21.49 0.17 Fed MA 64.71 4.92 74.98 5.03 77.13 4.10 28.61 1.39 59.97 0.96 70.91 1.02 11.89 0.57 15.90 0.92 22.02 0.82 GAMF 64.97 3.93 75.21 4.05 77.34 3.78 28.92 1.52 60.23 1.93 71.44 1.75 11.98 0.99 16.76 0.77 24.15 0.49 Fed LAW 60.34 4.39 73.93 4.91 77.53 3.52 26.32 2.80 46.81 3.61 61.08 2.61 11.57 1.61 15.99 0.49 22.37 0.68 Ours 66.30 0.07 76.59 0.25 78.99 0.13 34.97 0.31 62.81 0.28 72.04 0.23 12.20 0.13 19.98 0.25 28.16 0.30 Dirichlet (伪) 0.1 0.3 0.5 0.1 0.3 0.5 0.1 0.3 0.5 Fed Avg 71.81 5.61 75.97 3.21 79.73 1.94 50.43 1.68 61.11 2.68 67.37 1.69 30.13 0.70 35.73 0.56 38.86 0.35 Fed Prox 70.44 3.87 72.17 4.10 75.24 2.19 38.98 5.91 61.64 1.92 70.16 2.03 32.96 1.18 40.81 0.41 42.53 0.48 Fed Ex P 73.42 4.22 76.57 3.39 80.22 3.78 60.63 4.32 70.22 2.40 74.37 1.91 36.76 1.18 44.18 0.53 47.80 0.58 Fed SAM 71.65 0.07 75.91 0.06 77.67 0.10 49.96 0.20 59.53 0.19 64.54 0.21 21.54 0.12 24.72 0.18 28.59 0.19 Fed DF 80.03 1.04 84.42 0.62 86.84 1.93 54.28 2.39 69.85 5.67 73.76 2.67 34.76 0.67 39.42 1.12 42.31 1.02 Fed GEN 73.02 1.87 77.48 3.50 81.76 4.21 47.09 3.12 64.90 2.71 68.74 1.75 29.02 0.19 38.54 0.29 40.81 0.17 Fed MA 71.87 4.28 75.89 4.15 80.12 3.23 49.98 2.01 61.32 2.17 68.42 1.95 30.02 0.58 36.21 0.83 39.55 0.52 GAMF 72.11 5.16 76.24 3.67 80.55 2.06 51.21 1.37 63.45 1.03 70.14 1.81 31.12 0.69 37.26 0.78 41.25 0.74 Fed LAW 71.93 8.23 76.88 2.80 79.98 1.09 48.91 3.59 61.50 2.29 67.08 1.75 32.01 2.61 38.80 2.20 40.11 1.17 Ours 78.63 0.14 84.67 0.12 87.01 0.08 62.46 0.22 70.27 0.29 74.96 0.17 39.38 0.25 45.86 0.22 49.31 0.22 0.1 0.3 0.5 CIFAR-10 Dirichlet ( ) Test Accuracy(%) Fed Avg newest random approximate optimal (a) Aggregation strategies 0 25 50 75 100 125 150 175 200 Communication Rounds Test Accuracy(%) k=1 k=2 k=3 k=4 (b) Memory size K 0 25 50 75 100 125 150 175 200 Communication Rounds Test Accuracy(%) local epoch=10 local epoch=20 local epoch=30 local epoch=40 (c) Local ep.s (Fed CDA) 0 25 50 75 100 125 150 175 200 Communication Rounds Test Accuracy(%) local epoch=10 local epoch=20 local epoch=30 local epoch=40 (d) Local ep.s (Fed Avg) Figure 2: (a) shows the effect of different aggregation strategies. (b) shows the impact of the memory size K on Fed CDA. (c) and (d) present the impact of local epochs (ep.s) on Fed CDA and Fed Avg. larger datasets such as CIFAR-10 Dirichlet 0.1, Fed CDA with Res Net-18 achieves 62.46% accuracy whereas the best baseline method Fed Ex P achieves 60.63% accuracy. In addition, Fed CDA with simple CNN also makes improvements on relatively smaller datasets, although the improvement is less than in large models. At the same time, we can also see that the results of our method on relatively small datasets and simple CNN are not the best, which may be because the features of models with different rounds are more similar on small datasets and simple models, and can not provide more aggregation features to accelerate convergence. In conclusion, we can notice our Fed CDA makes more improvements on the large model and complex datasets. Table 2: Results of FL with 100 clients. Method Dir(0.1) Dir(0.3) Dir(0.5) Fed Avg 38.89% 0.85% 40.38% 0.55% 42.23% 0.30% Fed Prox 39.86% 0.45% 39.48% 0.37% 40.18% 0.46% Fed Ex P 38.04% 3.37% 44.10% 1.69% 41.79% 1.62% Fed SAM 16.35% 0.25% 20.53% 0.25% 25.70% 0.28% Fed DF 41.04% 0.57% 47.06% 0.74% 47.63% 0.53% Fed GEN 39.91% 1.72% 41.65% 1.35% 43.39% 1.08% Fed MA 39.12% 0.52% 40.42% 0.61% 42.89% 0.21% GAMF 39.89% 0.67% 40.98% 0.32% 43.25% 0.34% Fed LAW 40.88% 0.66% 41.77% 0.78% 41.89% 0.33% Ours 47.38% 0.23% 49.96% 0.21% 50.04% 0.19% More Comparison Results with Different Hyperparameters. To compare with baselines in a comprehensive way, we further conduct experiments on different hyper-parameters. The number of clients is 100 with the sample ratio being 10%. The learning rate is set to be 0.1 with the weight decay being 1e-3, and the number of local epochs is 5. The local optimizer is SGD without momentum. The experiment is conducted by running the Res Net18 on the CIFAR-100 dataset. The results are shown in Table 2. As can be seen, our method still performs the best. Specifically, when the data is the most heterogeneous, i.e., with Dirichlet 伪 = 0.1, Fed CDA achieves the accuracy of 47.38% which outperforms the best baseline method Fed DF by 6.34%. Different Aggregation Strategies. We compare five aggregation strategies on the CIFAR-10 datasets. Because the optimal selection for updates method is an exponential method, we only sample 10 clients in each round, where the sample ratio is 0.3 and the client memory size is K = 3. As shown in Figure 2(a), we compare Fed Avg with different aggregation strategies in our method. The Published as a conference paper at ICLR 2024 clients in Fed Avg do not require memory. The newest strategy is that only the newest local model of each client is aggregated during the server aggregation phase. The random strategy is that during the server aggregation phase, we randomly select a local model from multiple rounds to aggregate. Finally, the approximate strategy is as 8 shown above and the optimal one is as 3. We can find that the approximate and optimal strategies have huge performance improvement over Fed Avg, newest and random strategies with Res Net-18 on CIFAR-10 Dirichlet 0.1, 0.3 and 0.5. At its peak, there is an almost 10% increase over Fed Avg. We can also see the performance of approximate performance is about the same as the optimal one, but the convergence time of the former is much smaller than that of the latter. In fact, the former is actually a greedy algorithmic approximation of the latter, so the computation of the solution is greatly reduced. We also compare the average polymerization time of each round of these aggregation strategies. Details are in the Appendix C. Hyperparameters Sensitivity. As shown in the following figure 2(b) , We compare the test accuracy of client memory size K for 1, 2, 3, 4 on CIFAR-10 Dirichlet 0.1. As K increases, the final test accuracy increases which confirms our theory. The increase of K value gives cross-round polymerization more choices and possibilities. 0 25 50 75 100 125 150 175 200 Communication Rounds Test Accuracy(%) batch=1 batch=2 batch=3 batch=4 batch=5 batch=6 batch=7 Figure 3: Impact of Batch Comparison with Different Epochs for Fed CDA and Fed Avg. By comparing the results in Figure 2(c) and Figure 2(d), from the vertical perspective, our Fed CDA eventually converges with increasing test accuracy with more local epochs while the final convergence accuracy of Fedavg remains roughly unchanged. This proves that our algorithm can tolerate large local interactions to save communication cost. Horizontally, Fed CDA converges rapidly and stably, whereas the convergence curve of Fed Avg is very oscillatory. The reason is that our method excludes the negative impact of sampling clients while Fed Avg cannot. Therefore, the convergence of our method is more stable than Fed Avg. Effect of Batch Number. As we can see in Figure 6.2, different batch numbers have little effect on the final precision result. Our experiment setup 50 clients and 10 sample clients on the CIFAR-10 Dirichlet 0.1. We compare the results for batch B = 1, 2, 3, 4, 5, 6, 7. The results show that the approximation selection can keep the accuracy closer to the optimal selection. 0 25 50 75 100 125 150 175 200 Communication Rounds Test Accuracy(%) warmup=0 warmup=30 warmup=50 warmup=100 warmup=150 warmup=200 critical point Figure 4: Impact of Warmup Warmup Analysis. The experiments of Figure 6.2 are conducted on settings of CIFAR-10, 20 clients, and the sample ratio 0.2. Fed CDA with no warmup rounds is worse than some with warmup rounds. This is because the local models in the early rounds can not approach convergence. The combination with not-well-converged local models in old rounds may prevent the training of the global model. Therefore, it is similar to Fed Avg in the startup training stages. Its advantages gradually exhibit with the training proceeds and outperforms Fed Avg (warmup=200). Yet, there is a threshold for raising warmup rounds. Specifically, we can see that FL with 150 warmup rounds has worse performance than 100 warmup rounds. The principle behind it is that the local models and the global model have approached convergence in the final training stage. The difference between local models in different rounds is greatly reduced and thus the combination of them gains little benefits. 7 CONCLUSION This paper targets aggregation in federated learning, addressing the issue that traditional singleround methods may not preserve locally learned knowledge due to statistical heterogeneity. Recognizing clients convergence post-startup stage and local models consistent data fitting across rounds, we propose Fed CDAa new method that selectively aggregates cross-round models with minimum divergence. To enhance efficiency, we introduce an approximation selection algorithm. Theoretical convergence is proven and empirical results show our method outperforms state-of-the-art baselines. This paper addresses the ideal scenario where smoothness value is equal among clients. The goal is to improve our method for cases with varying smoothness by refining objectives and incorporating sharpness in cross-round aggregation, as we currently treat all local models equally without considering the sharpness of their local optima, despite flatter optima often correlating with better generalization. Published as a conference paper at ICLR 2024 ACKNOWLEDGMENTS The research is supported under the National Key R&D Program of China (2022ZD0160201) and the RIE2020 Industry Alignment Fund - Industry Collaboration Projects (IAF-ICP) Funding Initiative, as well as cash and in-kind contributions from the industry partner(s). This work is supported by National Natural Science Foundation of China under grants U1836204, U1936108, 62206102, and Science and Technology Support Program of Hubei Province under grant 2022BAA046. Muhammad Ammad-ud-din, Elena Ivannikova, Suleiman A. Khan, Were Oyomno, Qiang Fu, Kuan Eeik Tan, and Adrian Flanagan. Federated collaborative filtering for privacy-preserving personalized recommendation system. Co RR, abs/1901.09888, 2019. Ilai Bistritz, Ariana Mann, and Nicholas Bambos. Distributed distillation for on-device learning. In Proceedings of Advances in Neural Information Processing Systems 33: Annual Conference on Neural Information Processing Systems, Neur IPS, 2020. Pratik Chaudhari, Anna Choromanska, Stefano Soatto, Yann Le Cun, Carlo Baldassi, Christian Borgs, Jennifer T. Chayes, Levent Sagun, and Riccardo Zecchina. Entropy-sgd: Biasing gradient descent into wide valleys. In 5th International Conference on Learning Representations, ICLR 2017, Toulon, France, April 24-26, 2017, Conference Track Proceedings, 2017. Hong-You Chen and Wei-Lun Chao. Fedbe: Making bayesian model ensemble applicable to federated learning. In 9th International Conference on Learning Representations, ICLR 2021, Virtual Event, Austria, May 3-7, 2021, 2021. Canh T. Dinh, Nguyen H. Tran, and Tuan Dung Nguyen. Personalized federated learning with moreau envelopes. In Proceedings of Advances in Neural Information Processing Systems 33: Annual Conference on Neural Information Processing Systems, Neur IPS, 2020. Pengfei Guo, Puyang Wang, Jinyuan Zhou, Shanshan Jiang, and Vishal M. Patel. Multi-institutional collaborations for improving deep learning-based magnetic resonance image reconstruction using federated learning. In IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2021, virtual, June 19-25, 2021, pp. 2423 2432. Qiushan Guo, Xinjiang Wang, Yichao Wu, Zhipeng Yu, Ding Liang, Xiaolin Hu, and Ping Luo. Online knowledge distillation via collaborative learning. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, CVPR, pp. 11020 11029, 2020. Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep Residual Learning for Image Recognition. In 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 770 778. IEEE, 2016. Divyansh Jhunjhunwala, Shiqiang Wang, and Gauri Joshi. Fedexp: Speeding up federated averaging via extrapolation. In The Eleventh International Conference on Learning Representations, ICLR 2023, Kigali, Rwanda, May 1-5, 2023. Kenji Kawaguchi. Deep learning without poor local minima. In Advances in Neural Information Processing Systems 29: Annual Conference on Neural Information Processing Systems 2016, December 5-10, 2016, Barcelona, Spain, pp. 586 594, 2016. James Kirkpatrick, Razvan Pascanu, Neil Rabinowitz, Joel Veness, Guillaume Desjardins, Andrei A Rusu, Kieran Milan, John Quan, Tiago Ramalho, Agnieszka Grabska-Barwinska, et al. Overcoming catastrophic forgetting in neural networks. Proceedings of the national academy of sciences, 114(13):3521 3526, 2017. Alex Krizhevsky, Geoffrey Hinton, et al. Learning multiple layers of features from tiny images. 2009. Shuangtong Li, Tianyi Zhou, Xinmei Tian, and Dacheng Tao. Learning to collaborate in decentralized learning of personalized models. In IEEE/CVF Conference on Computer Vision and Pattern Recognition, CVPR 2022, New Orleans, LA, USA, June 18-24, 2022, pp. 9756 9765, 2022a. Published as a conference paper at ICLR 2024 Tian Li, Anit Kumar Sahu, Manzil Zaheer, Maziar Sanjabi, Ameet Talwalkar, and Virginia Smith. Federated optimization in heterogeneous networks. In Proceedings of Machine Learning and Systems 2020, MLSys 2020, Austin, TX, USA, March 2-4, 2020, 2020. Xin-Chun Li, Yichu Xu, Shaoming Song, Bingshuai Li, Yinchuan Li, Yunfeng Shao, and De-Chuan Zhan. Federated learning with position-aware neurons. In IEEE/CVF Conference on Computer Vision and Pattern Recognition, CVPR 2022, New Orleans, LA, USA, June 18-24, 2022, pp. 10072 10081, 2022b. Zexi Li, Tao Lin, Xinyi Shang, and Chao Wu. Revisiting weighted aggregation in federated learning with neural networks. In International Conference on Machine Learning, ICML 2023, 23-29 July 2023, Honolulu, Hawaii, USA, pp. 19767 19788, 2023. Tao Lin, Lingjing Kong, Sebastian U Stich, and Martin Jaggi. Ensemble distillation for robust model fusion in federated learning. Advances in Neural Information Processing Systems, 33:2351 2363, 2020. Chang Liu, Chenfei Lou, Runzhong Wang, Alan Yuhan Xi, Li Shen, and Junchi Yan. Deep neural network fusion via graph matching with applications to model ensemble and federated learning. In International Conference on Machine Learning, ICML 2022, 17-23 July 2022, Baltimore, Maryland, USA, pp. 13857 13869, 2022. Quande Liu, Cheng Chen, Jing Qin, Qi Dou, and Pheng-Ann Heng. Feddg: Federated domain generalization on medical image segmentation via episodic learning in continuous frequency space. In IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2021, virtual, June 19-25, 2021, pp. 1013 1023. Ilya Loshchilov and Frank Hutter. Decoupled weight decay regularization. In 7th International Conference on Learning Representations, ICLR 2019, New Orleans, LA, USA, May 6-9, 2019, 2019. Brendan Mc Mahan, Eider Moore, Daniel Ramage, Seth Hampson, and Blaise Ag uera y Arcas. Communication-efficient learning of deep networks from decentralized data. In Proceedings of the 20th International Conference on Artificial Intelligence and Statistics, AISTATS, 2017a. Brendan Mc Mahan, Eider Moore, Daniel Ramage, Seth Hampson, and Blaise Aguera y Arcas. Communication-efficient learning of deep networks from decentralized data. In Artificial intelligence and statistics, pp. 1273 1282. PMLR, 2017b. Zhe Qu, Xingyu Li, Rui Duan, Yao Liu, Bo Tang, and Zhuo Lu. Generalized federated learning via sharpness aware minimization. In International Conference on Machine Learning, ICML 2022, 17-23 July 2022, Baltimore, Maryland, USA, pp. 18250 18280, 2022. Swaroop Ramaswamy, Rajiv Mathews, Kanishka Rao, and Franc oise Beaufays. Federated learning for emoji prediction in a mobile keyboard. Co RR, abs/1906.04329, 2019. Yasar Abbas Ur Rehman, Yan Gao, Pedro Porto Buarque de Gusm ao, Mina Alibeigi, Jiajun Shen, and Nicholas D. Lane. L-DAWA: layer-wise divergence aware weight aggregation in federated self-supervised visual representation learning. Co RR, abs/2307.07393, 2023. Sidak Pal Singh and Martin Jaggi. Model fusion via optimal transport. In Advances in Neural Information Processing Systems 33: Annual Conference on Neural Information Processing Systems 2020, Neur IPS 2020, December 6-12, 2020, virtual, 2020. Yan Sun, Li Shen, Tiansheng Huang, Liang Ding, and Dacheng Tao. Fedspeed: Larger local interval, less communication round, and higher generalization accuracy. In The Eleventh International Conference on Learning Representations, ICLR 2023, Kigali, Rwanda, May 1-5, 2023a. Yan Sun, Li Shen, and Dacheng Tao. Understanding how consistency works in federated learning via stage-wise relaxed initialization. Thirty-Seventh Annual Conference on Neural Information Processing Systems (Neur IPS 2023), New Orleans, LA, USA, Dec 10-16, 2023b. Published as a conference paper at ICLR 2024 Chunnan Wang, Xiang Chen, Junzhe Wang, and Hongzhi Wang. ATPFL: automatic trajectory prediction model design under federated learning framework. In IEEE/CVF Conference on Computer Vision and Pattern Recognition, CVPR, New Orleans, LA, USA,June 18-24, pp. 6553 6562, 2022. Haozhao Wang, Yichen Li, Wenchao Xu, Ruixuan Li, Yufeng Zhan, and Zhigang Zeng. Dafkd: Domain-aware federated knowledge distillation. In IEEE/CVF Conference on Computer Vision and Pattern Recognition, CVPR 2023, Vancouver, BC, Canada, June 17-24, 2023, pp. 20412 20421, 2023. Hongyi Wang, Mikhail Yurochkin, Yuekai Sun, Dimitris S. Papailiopoulos, and Yasaman Khazaeni. Federated learning with matched averaging. In 8th International Conference on Learning Representations, ICLR 2020, Addis Ababa, Ethiopia, April 26-30, 2020, 2020a. Jianyu Wang, Qinghua Liu, Hao Liang, Gauri Joshi, and H. Vincent Poor. Tackling the objective inconsistency problem in heterogeneous federated optimization. In Advances in Neural Information Processing Systems 33: Annual Conference on Neural Information Processing Systems 2020, Neur IPS 2020, December 6-12, 2020, virtual, 2020b. Guile Wu and Shaogang Gong. Peer collaborative learning for online knowledge distillation. In Proceedings of the AAAI Conference on Artificial Intelligence, AAAI, 2021. Lei Wu, Zhanxing Zhu, and Weinan E. Towards understanding generalization of deep learning: Perspective of loss landscapes. Co RR, abs/1706.10239, 2017. Yingda Xia, Dong Yang, Wenqi Li, Andriy Myronenko, Daguang Xu, Hirofumi Obinata, Hitoshi Mori, Peng An, Stephanie A. Harmon, Evrim Turkbey, Baris Turkbey, Bradford J. Wood, Francesca Patella, Elvira Stellato, Gianpaolo Carrafiello, Anna Ierardi, Alan L. Yuille, and Holger Roth. Auto-fedavg: Learnable federated averaging for multi-institutional medical image segmentation. Co RR, abs/2104.10195, 2021. Han Xiao, Kashif Rasul, and Roland Vollgraf. Fashion-mnist: a novel image dataset for benchmarking machine learning algorithms. Co RR, abs/1708.07747, 2017. Zeke Xie, Issei Sato, and Masashi Sugiyama. A diffusion theory for deep learning dynamics: Stochastic gradient descent exponentially favors flat minima. In 9th International Conference on Learning Representations, ICLR 2021, Virtual Event, Austria, May 3-7, 2021, 2021. An Xu, Wenqi Li, Pengfei Guo, Dong Yang, Holger Roth, Ali Hatamizadeh, Can Zhao, Daguang Xu, Heng Huang, and Ziyue Xu. Closing the generalization gap of cross-silo federated medical image segmentation. In IEEE/CVF Conference on Computer Vision and Pattern Recognition, CVPR 2022, New Orleans, LA, USA, June 18-24, 2022, pp. 20834 20843. Fuxun Yu, Weishan Zhang, Zhuwei Qin, Zirui Xu, Di Wang, Chenchen Liu, Zhi Tian, and Xiang Chen. Fed2: Feature-aligned federated learning. In KDD 21: The 27th ACM SIGKDD Conference on Knowledge Discovery and Data Mining, Virtual Event, Singapore, August 14-18, 2021, pp. 2066 2074, 2021. Mikhail Yurochkin, Mayank Agarwal, Soumya Ghosh, Kristjan H. Greenewald, and Trong Nghia Hoang. Statistical model aggregation via parameter matching. In Advances in Neural Information Processing Systems 32: Annual Conference on Neural Information Processing Systems 2019, Neur IPS 2019, December 8-14, 2019, Vancouver, BC, Canada, pp. 10954 10964, 2019a. Mikhail Yurochkin, Mayank Agarwal, Soumya Ghosh, Kristjan H. Greenewald, Trong Nghia Hoang, and Yasaman Khazaeni. Bayesian nonparametric federated learning of neural networks. In Proceedings of the 36th International Conference on Machine Learning, ICML 2019, 9-15 June 2019, Long Beach, California, USA, pp. 7252 7261, 2019b. Lin Zhang, Li Shen, Liang Ding, Dacheng Tao, and Ling-Yu Duan. Fine-tuning global model via data-free knowledge distillation for non-iid federated learning. In IEEE/CVF Conference on Computer Vision and Pattern Recognition, CVPR 2022, New Orleans, LA, USA, June 18-24, 2022, pp. 10164 10173, 2022. Zhuangdi Zhu, Junyuan Hong, and Jiayu Zhou. Data-free knowledge distillation for heterogeneous federated learning. In International Conference on Machine Learning, pp. 12878 12889. PMLR, 2021.