# faster_adaptive_federated_learning__0b9e7604.pdf Faster Adaptive Federated Learning Xidong Wu1, Feihu Huang1,2, Zhengmian Hu1, Heng Huang1 1 Electrical and Computer Engineering, University of Pittsburgh, Pittsburgh, PA, United States 2 College of Computer Science and Technology, Nanjing University of Aeronautics and Astronautics, Nanjing, China xidong wu@outlook.com, huangfeihu2018@gmail.com, huzhengmian@gmail.com, henghuanghh@gmail.com Federated learning has attracted increasing attention with the emergence of distributed data. While extensive federated learning algorithms have been proposed for the non-convex distributed problem, federated learning in practice still faces numerous challenges, such as the large training iterations to converge since the sizes of models and datasets keep increasing, and the lack of adaptivity by SGD-based model updates. Meanwhile, the study of adaptive methods in federated learning is scarce and existing works either lack a complete theoretical convergence guarantee or have slow sample complexity. In this paper, we propose an efficient adaptive algorithm (i.e., FAFED) based on the momentum-based variancereduced technique in cross-silo FL. We first explore how to design the adaptive algorithm in the FL setting. By providing a counter-example, we prove that a simple combination of FL and adaptive methods could lead to divergence. More importantly, we provide a convergence analysis for our method and prove that our algorithm is the first adaptive FL algorithm to reach the best-known samples O(ϵ 3) and O(ϵ 2) communication rounds to find an ϵ-stationary point without large batches. The experimental results on the language modeling task and image classification task with heterogeneous data demonstrate the efficiency of our algorithms. Introduction Distributed training, which emerges to address the challenge of distributed data, has attracted wide attention (Bao et al. 2022). With the improvement of computing power, the bottleneck of training speed is gradually shifting from computing capacity to communication. Therefore, federated learning (FL) (Mc Mahan et al. 2017) was proposed as an important distributed training paradigm in large-scale machine learning to reduce communication overhead. In the FL setting, a central server coordinates multiple worker nodes to learn a joint model together with periodic model averaging by leveraging the massive local data of each worker node. The worker nodes share the computational load, and FL also provides some level of data privacy because training data are not directly shared or aggregated. More recently, an increasing number of FL works focus on addressing the cross-silo FL (i.e., FL between large institutions) problem (Xu and Huang 2022; Guo et al. 2022; Copyright 2023, Association for the Advancement of Artificial Intelligence (www.aaai.org). All rights reserved. Karimireddy et al. 2020b), where most clients participate in computation every round and can maintain state between rounds. The cross-silo FL involves many practical applications, such as collaborative learning on financial data across various corporations and stakeholders or health data across numerous medical centers (Xu et al. 2022; Guo et al. 2022). In this paper, we consider solving a federated learning problem in the cross-silo setting, defined as min x Rd f(x) := 1 i=1 fi(x) (1) where x Rd denotes the model parameter and N indicates the number of worker nodes. fi(x) = Eξ(i) Di fi x; ξ(i) is the loss function of the ith worker node, and ξ(i) Di denotes the samples ξ(i) drawn from distribution Di on the ith worker node. When Di and Dj are different (i = j ), it is referred to as the heterogeneous data setting. In this paper, {Di}N i=1 are not identical. We restrict our focus to the non-convex problem, where the functions fi(x) and f(x), therefore, are smooth and non-convex. On worker node i, we have access to a stochastic gradient fi(x; ξ(i)) as an unbiased estimation of the ith worker node s true gradient fi(x). Worker nodes collaboratively learn a global model, but the raw data in each worker node is never shared with the server and other worker nodes. Although, various FL methods have been proposed (Karimireddy et al. 2020b; Reddi et al. 2020; Hong et al. 2021; Xiong, Li, and Cai 2023), which substantially reduce communication cost by avoiding frequent transmission between local worker nodes and the central server, it suffers from unfavorable convergence behavior. It is caused by a variety of factors, such as (1) client drift (Karimireddy et al. 2020b), where local client models move towards local optima instead of global optima, (2) lack of adaptivity as SGDbased update (Reddi et al. 2020), and (3) large training iterations to converge as sizes of model parameters and training datasets keep increasing. Despite the recent advances, most of the existing work focuses on solving client drifts (Karimireddy et al. 2020b; Khanduri et al. 2021; Xu and Huang 2022). The current federated learning framework still cannot solve all challenges. On the other hand, we know that adaptive methods have been widely developed and studied in non-federated settings The Thirty-Seventh AAAI Conference on Artificial Intelligence (AAAI-23) since they generally require less parameter tuning and better convergence speed during the training phase. Meanwhile, like centralized method SGD, stochastic FL methods are not a good option for settings with heavy-tail stochastic gradient noise distributions. Such issues could be solved by adaptive learning rates (Zhang et al. 2019), which combines knowledge of past iterations. In addition, adaptive gradient methods also escape saddle points faster compared to SGD (Staib et al. 2019). Therefore, the introduction of adaptive tools is an important direction to improve the performance of FL algorithms in the practice. However, the design of adaptive FL methods is nontrivial because the local worker node moves towards different directions and the global trackers cannot be updated frequently in the FL setting. The improper design of the adaptive FL method might lead to convergence issues (Chen, Li, and Li 2020). Reddi et al. (2020) firstly proposed a class of federated versions of adaptive optimizers, including Fed Adagrad, Fed Yogi, and Fed Adam. But its analysis only holds when the β1 = 0 and it cannot use the advantage of the momentum. Mime Adam is proposed (Karimireddy et al. 2020a) and it applies server statistics locally to address this issue. Nevertheless, Mime Adam has to compute the full local gradient, which might be forbidden in practice. More recently, Fed AMS is proposed (Wang, Lin, and Chen 2022) and it provides the completed proof, but it doesn t improve the convergence rate. Overall, the sample convergence rates of Fed Adagrad, Fed Yogi, Fed Adam and Fed AMS are O ϵ 4 (Not better than Fed Avg). At the same time, they also require an extra global learning rate to tune. As the sizes of model parameters and training datasets keep increasing, the deep learning models require more training iterations to converge and more efficient optimization methods are welcomed. Consequently, a natural question is whether one can achieve a faster convergence rate in theory and practice with adaptive technology. In this paper, we give an affirmative answer to the above question by proposing a faster adaptive FL algorithm (i.e., FAFED). Contributions The main contributions of this work are listed below: We study how to incorporate the adaptive gradient method into federated learning. We propose a faster stochastic adaptive FL method (i.e., FAFED) in heterogeneous data settings based on the momentum-based variance reduction technique with a general adaptive matrix. We provide a convergence analysis framework for our adaptive methods under some mild assumptions. Our algorithm is the first adaptive FL algorithm to reach the best-known samples complexity O(ϵ 3) and communication complexity O(ϵ 2) to find an ϵ-stationary point without large batches. The extensive experimental results on the language modeling task and image classification task confirm the effectiveness of our proposed algorithm. By establishing a counter example, we also show that a naive combination of adaptive gradient methods and periodic model averaging might result in divergence. Therefore, sharing adaptive learning should be considered in the FL setting. Related Works Federated Learning Fed Avg was proposed in (Mc Mahan et al. 2017) as the first FL algorithm. With periodic model averaging, it can dramatically reduce communication overheads. Earlier works analyzed FL algorithms in the homogeneous data setting (Woodworth et al. 2020; Khaled, Mishchenko, and Richt arik 2020) and recent research extends federated learning to heterogeneous data settings (non-iid), as well as non-convex models, such as deep neural networks. When datasets on different worker nodes are homogeneous, Fed Avg reduces to local SGD (Zinkevich et al. 2010). Recent works (Yang, Fang, and Liu 2021; Karimireddy et al. 2020b) consider Fed Avg with partial worker nodes participation with O(1) local updates iterations and batch sizes. The sample and communication complexities are both O(ϵ 4). In (Yu, Jin, and Yang 2019; Yu, Yang, and Zhu 2019), authors propose Parallel Restarted SGD and Momentum SGD, and show that both of them require O(ε 4) samples and O(ε 3) rounds of communication to reach an εstationary solution. SCAFFOLD was proposed in (Karimireddy et al. 2020b), which uses control variates to correct for the client-drift when the data is heterogeneous. It achieves the same sample and communication complexities as Fed Avg. Li et al. (2020) proposed a penalty-based method called Fed Prox to reduce the communication complexity to O(ε 2). The analysis of Fed Prox depends on a gradient similarity assumption to restrict the data heterogeneity, which essentially requires that all minimums of f(x) are also minimums of fi(x). Later, Fed PD was proposed in (Zhang et al. 2020) to relax this assumption. Momentum-based optimizers are widely used in learning tasks (Sun et al. 2022). Subsequently, some momentumbased FL algorithms are proposed. For example, (Xu and Huang 2022) introduces a momentum fusion technique to coordinate the server and local momentum buffers, but they do not reduce the complexity. Based on variance reduction technology, Fed-GLOMO (Das et al. 2022) require O(ε 3) sample complexity and O(ε 3) communication complexity. Their sample complexity matches the optimal complexity of the centralized non-convex stochastic optimization algorithms (Fang et al. 2018; Cutkosky and Orabona 2019). More recently, STEM was proposed in (Khanduri et al. 2021) which utilizes a momentum-assisted stochastic gradient direction for both the worker nodes and central server updates. It further reduces the communication rounds to O(ε 2) and keeps the same sample cost of O(ε 3). Adaptive Methods Adaptive methods are a class of optimization algorithms as one of the most important variants of stochastic gradient descent in machine learning. For example, Adam (Kingma and Ba 2014) Ada Grad (Duchi, Hazan, and Singer 2011), Ada Delta (Zeiler 2012) are widely used as optimization tools in training deep neural networks (DNNs). Afterward, some variants (Reddi, Kale, and Kumar 2019) have been proposed to show a convergence guarantee in the non-convex setting. More recently, the works (Cutkosky Algorithm Reference Sample Communication Adaptivity Fed Avg (Yang, Fang, and Liu 2021) O ϵ 4 O ϵ 4 (Karimireddy et al. 2020b) Fed Adagrad (Reddi et al. 2020) O ϵ 4 O ϵ 4 Fed Yogi (Reddi et al. 2020) O ϵ 4 O ϵ 4 Fed Adam (Reddi et al. 2020) O ϵ 4 O ϵ 4 Fed AMS (Wang, Lin, and Chen 2022) O ϵ 4 O ϵ 4 FAFED Our work O ϵ 3 O ϵ 2 Table 1: Complexity comparison of Fed Avg and typical adaptive FL algorithms for finding an ϵ-stationary point. Sample complexity denotes the number of calls to the First-order Oracle (IFO) by all worker nodes to reach an ε-stationary point. Communication complexity is defined as the total number of back-and-forth communication rounds between each worker node and the central server required to reach an ε-stationary point. and Orabona 2019; Huang, Li, and Huang 2021) presented some accelerated adaptive gradient methods based on the variance-reduced techniques. In FL settings, Reddi et al. (2020) firstly propose federated versions of adaptive optimizers, including a class of adaptive FL methods, such as Fed Adagrad, Fed Yogi, and Fed Adam. These methods achieve the same sample cost and communication rounds as Fed Avg when assuming the β1 = 0. Chen, Li, and Li (2020) proposed Federated AMSGrad and achieves the same sample cost and communication rounds. Mime Adam is proposed in (Karimireddy et al. 2020a) but it requires the full local gradient in each communication round. More recently, Fed AMS is proposed in (Wang, Lin, and Chen 2022) and it provides the completed proof and considers the gradient compression. But it doesn t improve the convergence rate. Table 1 summarizes the details of typical adaptive FL algorithms. Preliminaries Notations: For two vectors x and y in Euclidean space, x, y denote their inner product. denotes the ℓ2 norm for vectors and spectral norm for matrices, respectively. And xt,i denotes the local model parameters of the ith worker node at the iteration t. xf(x) is the partial derivative w.r.t. variables x. Id means d-dimension identity matrix. a = O(b) denotes that a Cb for some constant C > 0, and the notation O( ) hides logarithmic terms. Given the mini-batch samples B = {ξi}q i=1, we let fi(x; B) = 1 q Pq i=1 fi(x; ξi). Assumption 1. (i) Unbiased Gradient. Each component function fi(x; ξ) computed at each worker node is unbiased ξ(i) Di, i [N] and x Rd: E[ fi(x; ξ)] = fi(x), (ii) Intraand internode Variance Bound. The following holds for all ξ(i) Di, i, j [N] and x Rd: E fi(x; ξ(i)) fi(x) 2 σ2. fi(x) fj(x) 2 ζ2 The assumption 1-(ii) is a typical assumption used in FL algorithms to constrain the data heterogeneity. ζ is the het- erogeneity parameter and represents the level of data heterogeneity. If datasets across each worker node have the identical distributions, i.e., Di = Dj for all i, j [N], then we have ζ = 0, corresponds to the homogeneous data setting (I.I.D setting). In this paper, we consider the heterogeneous data setting and ζ > 0. Assumption 2. Each component function fi(x; ξ) has a LLipschitz gradient, i.e., x1, x2, we have E xfi(x1; ξ) xf(x2; ξ) L x1 x2 , By using convexity of and assumption 2, we have xf(x1) xf(x2) = E xf(x1; ξ) xf(x2; ξ) E xf(x1; ξ) xf(x2; ξ) L x1 x2 Assumption 2 is Lipschitz smooth, it is still a widely used assumption in optimization analysis. Many typical centralized stochastic algorithms use this assumption, such as SPIDER (Fang et al. 2018), STORM (Cutkosky and Orabona 2019) . Similarly, it is used in FL algorithms such as MIME (Karimireddy et al. 2020a), Fed-GLOMO (Das et al. 2022) and STEM(Khanduri et al. 2021). Assumption 3. The function F(x) is bounded below in X, i.e., F = infx X F(x) > . Assumption 4. In our algorithms, the adaptive matrices At for all t 1 for updating the variables x is a diagonal matrix and satisfies λmin(At) ρ > 0, where ρ is an appropriate positive number based on its definition. Assumption 4 ensures that the adaptive matrices At, t 1, are positive definite, as in (Huang, Li, and Huang 2021). The adaptive matrices At are diagonal matrices, and we do not need to inverse the matrix At. Assumption 5. (Bounded Gradients). The function fi(x) have G-bounded gradients, i.e., for any i [N], x Rd, we have fi(x) G. Assumption 5 is used to provide the upper bound of the gradient in the adaptive methods, as in (Reddi et al. 2020; Chen, Li, and Li 2020; Wang, Lin, and Chen 2022). It is a typical assumption in the adaptive methods to constrain the upper bound of the adaptive learning rate. It is reasonable and often satisfied in practice, for example, it holds for the finite sum problem. Definition 1. A point x is called ϵ-stationary point if f(x) ϵ. Generally, a stochastic algorithm is defined to achieve an ϵ-stationary point in T iterations if E f(x T ) ϵ. Faster Adaptive Federated Learning In this section, we explore how to design the method to combine adaptive gradient method with federated learning. We propose two algorithms to show the idea behind the design of the adaptive FL methods and how to use the adaptive learning rate properly. We use a counter example to show that naive combination of local adaptive update might result in divergence, and then propose our fast adaptive federated learning method (i.e., FAFED). Divergence of Local Adaptive Federated Learning The Fed Adam, Fed Yogi, Fed Adagrad proposed in (Reddi et al. 2020) and Fed AMS proposed in (Wang, Lin, and Chen 2022) adjust the adaptive learning rate on the server. These methods have a main drawback that adaptive term cannot adjust the performance of the model in the local update, and introduce an extra global learning rate to tune. To improve the algorithm, the most straightforward way to design an adaptive federated learning method is to add an adaptive term on each worker node and run an existing adaptive method, such as Adam, Super Adam locally, and then average the model periodically after the inner loop. For ease of understanding, we design the adaptive method in algorithm 1 based on Fed Avg. Each work node runs local SGD with an adaptive learning rate independently. The model parameters {xt,i}N i=1 are averaged after inner loop, as the Fed Avg. However, this design might suffer convergence issues and algorithm 1 can fail to converge to stationary points regardless of parameter selection (Chen, Li, and Li 2020). It is because of heterogeneous data settings and the fact that the adaptive learning rates on different nodes are different. As a result, the global model moves away from the global optima point. Following (Chen, Li, and Li 2020), theorem 1 uses an example to present the details of step update in the algorithm 1 and shows that in some cases, divergence is unavoidable no matter how we choose the tuning parameters. Theorem 1. Suppose the sequence { xt}T t=1 are generated from algorithm 1 using stochastic partial derivatives. { xt}T t=1 might fail to converge to non-stationary points regardless of tuning parameter selection. Proof. We utilize a counter example to prove theorem 1 and consider a simple 1-dimensional case with N = 3 worker nodes as: f1 = 3x2, |x| 1, 6|x| 2, |x| > 1 f2 = f3 = x2, |x| 1 2|x| + 1, |x| > 1 i=1 fi(x) 1 3x2, |x| 1, 2 3|x|, |x| > 1 It is clear that x = 0 is the unique stationary point. We begin from step t = 0. Assume η = 0.1, β = 0.5 and v0,i = Algorithm 1: Naive adaptive Fed Avg Algorithm 1: Input: T, tuning parameters {β, η}, v0,i and mini-batch size b0; 2: initialize: Initialize: xi Rd for i [N], 3: for t = 1, 2, . . . , T do 4: Client i [N]: 5: Draw mini-batch samples Bt,i = {ξj i }b0 j=1 with |Bt| = b0 from Di locally, and compute stochastic partial derivatives ˆgt,i = xfi(xt,i; Bt,i) 6: vt,i = βvt 1,i + (1 β) ˆg2 t,i 7: if mod (t, q) = 0 then 8: Set xt+1,i = xt+1,i = 1 N PN j=1 xt,j η ˆgt,j vt,i 9: else 10: Set xt+1,i = xt,i η ˆgt,i vt,i 11: end if 12: end for 13: Output: x chosen uniformly random from { xt}T t=1. 0 and the initial point is x0,i = 10 for i = 1, 2, 3. With the first update (t = 1), for the f1, we have g0,1 = 6, and v0,1 = 0.5 62 = 18 . For i = 2, 3, we have g0,i = -2 and v0,1 = 2. Following the algorithm 1 each worker node has its adaptive learning rate. Thus, after the first update, we have x1,1 = 10 0.1 2 6 = 9.858, and x1,2 = x1,3 = 10+ 0.1 2 2 = 10.14, and we have x = 10.05. The global model moves towards the opposite direction. We continue to show the following steps. We still have gt,1 = 6 and gt,2 = gt,3 = 2. vt,1 = (1 βt) 62 and vt,2 = vt,3 = (1 βt) 22. Therefore, xt,1 always updates by 6η/ p (1 βt) 62 and xt,2, xt,3 always update by 2η/ p (1 βt) 22. As a result, the averaged model parameter will update η 3 (1 βt). It is the opposite of the di- rection of convergence and is independent of the choice of parameters η and β. Therefore, after the first inner loop on each worker node and averaging step on the central server, the global model moves away from the optima point. In the following steps, each worker node continues running the local SGD with an adaptive learning rate from the same point. The global model keeps moving away from the optima point after each inner loop training. Finally, the global model fails to converge to the optimal point. From this example, we could see the model diverges no matter what tuning parameters we choose. The divergence is caused by the heterogeneous data setting and the nonconsensus of adaptive learning rates on different worker nodes. This suggests that we should combine the gradient information across nodes when we design the adaptive method in the FL setting. Thus, we use the sharing adaptive learning rate in the Algorithm 2 to avoid divergence. Faster Adaptive Federated Learning Method In the above subsection, we showed that SGD-based local adaptive learning method could diverge even in a very sim- ple example regardless of tuning parameters selection. In this subsection, we propose a novel fast adaptive federated learning algorithm (FAFED) with shared adaptive learning rates for solving the problem under the heterogeneous data setting. Specifically, our FAFED algorithm is summarized in algorithm 2. At the step 8 in algorithm 2, we use the coordinate-wise adaptive learning rate as in Adam (Kingma and Ba 2014), defined as: vt,i = βvt 1,i + (1 β) ( xfi(xt,i; Bt,i))2 (2) where β (0, 1). At the step 10 in algorithm 2, we add a periodic averaging step for local adaptive learning rate vt,i at the server side. Then we use vt,i to generate an adaptive matrix At = diag( vt + ρ), where ρ > 0. In fact, the adaptive vector vt can be given with different adaptive learning rate methods, such as the global adaptive learning rate, Ada Grad Norm (Ward, Wu, and Bottou 2019), and the At keeps the same form. The tuning parameter ρ is used to balance the adaptive information with noises. In the local update, different from algorithm 1, algorithm 2 use the shared adaptive learning rates to avoid model divergence. At the step 15 in algorithm 2, the same At is used for local updates of different work nodes. The idea behind the design is that vt,i can be viewed as the second-moment estimation of the gradients, thus At established on the average of vt,i is also an estimation of the second moment of the global model. With the average of adaptive information, At could follow the global direction and avoid the divergence issue in the algorithm 1. At step 7 in algorithm 2, we use the momentum-based variance reduced gradient estimator mt,i, to track the gradient and update the model, defined as: mt,i = xfi(xt,i; Bt,i) + (1 αt)(mt 1 xfi(xt 1,i; Bt,i) (3) where αt (0, 1). At the step 11 in algorithm 2, the gradient estimator mt,i is also synchronized and averaged on the server. Overall, the local servers run adaptive updates locally with the shared adaptive learning rates, and the global server aggregates the model parameters, gradient estimator mt,i and the second-moment estimator vt,i every q steps. In the next section, we will establish the theoretical convergence guarantee of the proposed algorithm. Convergence Analysis of Our Algorithm In this subsection, we study the convergence properties of our new algorithm under Assumptions 1, 2, 3, 4, and 5. The details about proofs are provided in the supplementary materials. Given the sequence { x}T t=1 generated from our algorithms, we first define a useful convergence metric as follows: Mt = 1 4η2 t xt+1 xt 2 + 1 4ρ2 f ( xt) mt 2 (4) where these two terms of Mt measure the convergence of the iteration solution of { x}T t=1. The new convergence Algorithm 2: FAFED Algorithm 1: Input: T, Parameters: β, ηt, αt, the number of local updates q, and mini batch size b and initial batch-size B; 2: initialize: Initialize: x0,i = x0 = 1 N PN i=1 x0,i. m0,i = m0 = 1 N PN i=1 ˆm0,i with ˆm0,i = xf(x0,i; B0,i) and v0,i = v0 = 1 N PN i=1 ˆv0,i with ˆv0,i = ( xf(x0,i; B0,i))2 where |B0,i| = B from Di for i [N]. A0 = diag( v0 + ρ) 3: x1,i = x0,i η0m0,i, for all i [N] 4: for t = 1, 2, . . . , T do 5: Client i [N]: 6: Draw mini-batch samples Bt,i = {ξj i }b j=1 with |Bt,i| = b from Di locally, and compute stochastic partial derivatives ˆgt,i = xfi(xt,i; Bt,i) and ˆgt 1,i = xfi(xt 1,i; Bt,i) 7: mt,i = ˆgt,i + (1 αt)(mt 1 ˆgt 1,i) 8: vt,i = βvt 1,i + (1 β) ˆg2 t,i 9: if mod (t, q) = 0 then 10: vt,i = vt = 1 N PN i=1 vt,i and At = diag( vt + ρ) 11: mt,i = mt = 1 N PN i=1 mt,i 12: xt+1,i = xt+1 = 1 N PN i=1(xt,i ηt A 1 t mt,i) 13: else 14: At = At 1 15: xt+1,i = xt,i ηt A 1 t mt,i 16: end if 17: end for 18: Output: x chosen uniformly random from { xt}T t=1. measure is tighter than the standard gradient norm metric, f( xt) , and we complete the final convergence analysis based on it. Theorem 2. Suppose that sequence {xt}T t=1 are generated from algorithm 2. Under the above Assumptions (1,2,3,4,5), given that t 0, αt+1 = cη2 t , c = 1 12LI h3ρ2 + 60L2 b Nρ , w = max( 3 2, w 1728L3I3 h3 t) h = N 2/3 L , and set ηt = ρ h (wt + t)1/3 (5) then we have t=1 E f( xt) G ρT + L ρ(NT)2/3 E [f ( x0) f ] + 6qσ2 + [122 150q b2ρ2T + 1800 b2ρ2(NT)2/3 ] 5σ2 where G = 4 p (σ2 + G2 + ρ2) Remark 1. (Complexity) Without loss of generality, let B = bq and b = O(1)(b 1), and choose q = T/N 2 1/3. Based on the definition of the ε-stationary point, namely, E f(x T ) ϵ and E MT ϵ2. we get T = O(N 1ε 3). And T q = (NT)2/3 = O(ε 2), Because the sample size b is a constant, the total sample cost is O(N 1ε 3) and the communication round is O(ε 2) for finding an ε-stationary point that matches the state of the art of gradient complexity bound given in for solving the problem. And O(N 1ε 3) exhibits a linear speed-up compared with the aforementioned centralized optimal algorithms, such as SPIDER and STORM (Fang et al. 2018; Cutkosky and Orabona 2019). Remark 2. (Data Heterogeneity) We use the ζ to present the data heterogeneity. From final results, it is shown that larger ζ (higher data heterogeneity) will slow down the training. Remark 3. Due to Assumption 4 and the definition of At, the smallest eigenvalue of the adaptive matrix At has a lower bound ρ > 0. It balances the adaptive information in the adaptive learning rate. Generally, we choose ρ = O(1) and we do not choose a very small or large parameter in practice. Experimental Results In this section, we evaluate our algorithms with language modeling task and image classification tasks. We compare our algorithms with the existing state-of-the-art algorithms, including Fed Avg, SCAFFOLD (Karimireddy et al. 2020b), STEM (Khanduri et al. 2021), Fed Adam (Reddi et al. 2020) and Fed AMS (Wang, Lin, and Chen 2022). Experiments are implemented using Py Torch, and we run all experiments on CPU machines with 2.3 GHz Intel Core i9 as well as NVIDIA Tesla P40 GPU. Language Modeling Task The Wiki Text2 dataset is used in the experiment and the data is partitioned by 16 worker nodes. Over the Wiki Text2 dataset, we train a 2-layer LSTM (Hochreiter and Schmidhuber 1997) with 650-dimensional word embeddings and 650 hidden units per layer. We used a batch size of 20 and an inner loop number q of 10 in the experiment, and set the dropout rate as 0.5. To avoid the case of an exploding gradient in LSTM, we also clip the gradients by norm 0.25 (Huang, Li, and Huang 2021). Grid search is used to choose learning rates for each optimizer. In Fed Avg, SCAFFOLD, Fed Adam, and Fed AMS, we set the learning rate as 10. The global learning rate in the SCAFFOLD is 1. Given that the large global learning rate in Fed Adam and Fed Ams causes the divergence and big fluctuation, we tune it as 0.03( 10 1.5). In STEM algorithm, we set κ as 20, w = σ = 1, and the step-size is diminished in each epoch as in (Khanduri et al. 2021). In FAFED algorithm, we set ρ h as 1 and w = 1 and decrease the step size as (5). In the Fed Adam, Fed AMS, STEM and FAFED algorithm, the momentum parameters, such as αt, β, β1 and β2 are chosen from the set {0.1, 0.9}. Their adaptive parameters τ or ρ are chosen as 0.01. Figure 1 shows both training loss and training perplexities. Our FAFED algorithm outperforms all other baseline optimizers. Although Fed Adam also has a good performance with an adaptive learning rate, it presents clear fluctuation at the beginning of the training phase because it utilizes the global adaptive method. Fed AMS is worse because the adaptive term of FAFED and Fed Adam is more flexible, while the adaptive term is monotonically increasing in Fed AMS. Image Classification Tasks In the second task, we conduct image classification tasks on the Fashion-MNIST dataset as in (Nouiehed et al. 2019), MNIST dataset and CIFAR-10 dataset with 20 worker nodes in the network. The fashion-MNIST dataset and MNIST dataset includes 60, 000 training images and 10, 000 testing images classified into 10 classes. Each image in both datasets contains 28 28 arrays of the grayscale pixel. CIFAR-10 dataset includes 50, 000 training images and 10, 000 testing images. 60, 000 32 32 color images are classified into 10 categories. Each worker node holds the same Convolutional Neural Network (CNN) model as the classifier. We use cross entropy as the loss function. The network structures are provided in the supplementary material. We consider three different heterogeneity settings: low heterogeneity, moderate heterogeneity and high heterogeneity. For the low heterogeneity setting, datasets in different worker nodes have 95% similarity (Karimireddy et al. 2020b). In the real-world application, the data on the different worker nodes are usually completely different, and they even have different categories. Then we consider two more challenging settings. For moderate heterogeneity settings and high heterogeneity settings, the datasets are divided into disjoint sets across all worker nodes. In the moderate heterogeneity setting, each worker node holds part of the data from all the classes, while for the high heterogeneity setting, each worker node only has a part of the total classes (5 out of 10 classes). We carefully tune hyperparameters for all methods. We run grid search for step size, and choose the step size in the set {0.001, 0.01, 0.02, 0.05, 0.1}. We set the global learning rate as 1 for SCAFFOLD. For Fed Adam and Fed AMS, we set global learning from the set {10 1.5, 10 2, 10 2.5, }, based on the Fig. 2 in (Reddi et al. 2020). The adaptive parameter τ is chosen as 0.01. Given that the datasets are divided by all worker nodes, the number of data points in each worker node is limited. Heterogeneity settings also slow down the training. Thus, the number of local steps required increases, and the methods perform badly with diminishing step size because it decreases rapidly. We choose the fixed step size for STEM and FAFED. We choose the momentum parameter in the set {0.1, 0.9}. The β1 and β2 in Fed Adam and Fed AMS, and β in FAFED are chosen from {0.1, 0.9}. The batch-size b is in {5, 50, 100} and the inner loop number q {5, 10, 20}. With the increase of data heterogeneity from low heterogeneity to high heterogeneity, we increase the batch size and decrease the inner loop number. Discussion: The goal of our experiments is two-fold: (1) To compare the performance of FAFED with other algo- (a) Training Loss (b) Training perplexities Figure 1: Experimental results of Wiki Text2 for language modeling task. (a) Fashion-MNIST (c) CIFAR-10 Figure 2: Training loss vs the number of communication rounds for low heterogeneity setting. rithms in different heterogeneity settings during the training phase with training datasets; (2) To demonstrate the model performance on the test datasets. In Figures 2, and figure (Seen in the supplementary materials), we show the performance of FAFED and other baseline methods against the number of communication rounds, namely back-and-forth communication rounds between each worker node and the central server on three datasets with different heterogeneity setting in the image classification task. From Figures, we can find that our algorithms consistently outperform the other baseline algorithms. Compared with Fed Adam and Fed AMS, our adaptive method has lower fluctuation (e.g., the beginning phase of Fed Adam). That is because we use the adaptive learning rate locally, while FEDADAM just scales the model parameters in the central server after multistep training. Finally, we focus on the performance on the testing datasets. In Tables (Seen in the supplementary mateirals), we show the testing accuracy of FAFED and that of other algorithms on the Fashion-MNIST dataset for different heterogeneity settings after training with the same epochs. Although, with the increasing data heterogeneity, model training becomes more difficult. FAFED performs well under all conditions. It shows the adaptive FL methods adapt well and our method (FAFED) has a good performance in different heterogeneity settings. Conclusion In this work, we proposed a novel adaptive algorithm (i.e., FAFED) based on the momentum-based variance reduced technique in the FL setting. We show that adaptive optimizers can be powerful tools and have a good performance in both theoretical analysis and numerical experiments. In the beginning, we explore how to design the adaptive algorithm in the FL setting. By providing a counter example, we present that a naive combination of the local adaptive method with the periodic model average can lead to divergence, and sharing adaptive learning should be considered. Moreover, we provide a solid convergence analysis for our methods, and prove that our algorithm is the first adaptive FL method to reach the best-known samples complexity O(ϵ 3) and communication complexity O(ϵ 2) to find an ϵ-stationary point without large batches. Finally, we conduct experiments on the language modeling task and image classification tasks with different levels of heterogeneous data. Acknowledgements This work was partially supported by NSF IIS 1838627, 1837956, 1956002, 2211492, CNS 2213701, CCF 2217003, DBI 2225775. References Bao, R.; Wu, X.; Xian, W.; and Huang, H. 2022. Doubly sparse asynchronous learning for stochastic composite optimization. In Proceedings of the Thirty-First International Joint Conference on Artificial Intelligence, IJCAI, 1916 1922. Chen, X.; Li, X.; and Li, P. 2020. Toward communication efficient adaptive gradient method. In Proceedings of the 2020 ACM-IMS on Foundations of Data Science Conference, 119 128. Cutkosky, A.; and Orabona, F. 2019. Momentum-based variance reduction in non-convex sgd. Advances in neural information processing systems, 32. Das, R.; Acharya, A.; Hashemi, A.; Sanghavi, S.; Dhillon, I. S.; and Topcu, U. 2022. Faster non-convex federated learning via global and local momentum. In Uncertainty in Artificial Intelligence, 496 506. PMLR. Duchi, J.; Hazan, E.; and Singer, Y. 2011. Adaptive subgradient methods for online learning and stochastic optimization. Journal of machine learning research, 12(7). Fang, C.; Li, C. J.; Lin, Z.; and Zhang, T. 2018. Spider: Near-optimal non-convex optimization via stochastic pathintegrated differential estimator. Advances in Neural Information Processing Systems, 31. Guo, P.; Yang, D.; Hatamizadeh, A.; Xu, A.; Xu, Z.; Li, W.; Zhao, C.; Xu, D.; Harmon, S.; Turkbey, E.; et al. 2022. Auto Fed RL: Federated Hyperparameter Optimization for Multiinstitutional Medical Image Segmentation. ar Xiv preprint ar Xiv:2203.06338. Hochreiter, S.; and Schmidhuber, J. 1997. Long short-term memory. Neural computation, 9(8): 1735 1780. Hong, J.; Wang, H.; Wang, Z.; and Zhou, J. 2021. Federated robustness propagation: Sharing adversarial robustness in federated learning. ar Xiv preprint ar Xiv:2106.10196, 1. Huang, F.; Li, J.; and Huang, H. 2021. Super-adam: faster and universal framework of adaptive gradients. Advances in Neural Information Processing Systems, 34. Karimireddy, S. P.; Jaggi, M.; Kale, S.; Mohri, M.; Reddi, S. J.; Stich, S. U.; and Suresh, A. T. 2020a. Mime: Mimicking centralized stochastic algorithms in federated learning. ar Xiv preprint ar Xiv:2008.03606. Karimireddy, S. P.; Kale, S.; Mohri, M.; Reddi, S.; Stich, S.; and Suresh, A. T. 2020b. Scaffold: Stochastic controlled averaging for federated learning. In International Conference on Machine Learning, 5132 5143. PMLR. Khaled, A.; Mishchenko, K.; and Richt arik, P. 2020. Tighter theory for local SGD on identical and heterogeneous data. In International Conference on Artificial Intelligence and Statistics, 4519 4529. PMLR. Khanduri, P.; Sharma, P.; Yang, H.; Hong, M.; Liu, J.; Rajawat, K.; and Varshney, P. 2021. Stem: A stochastic twosided momentum algorithm achieving near-optimal sample and communication complexities for federated learning. Advances in Neural Information Processing Systems, 34. Kingma, D. P.; and Ba, J. 2014. Adam: A method for stochastic optimization. ar Xiv preprint ar Xiv:1412.6980. Li, T.; Sahu, A. K.; Zaheer, M.; Sanjabi, M.; Talwalkar, A.; and Smith, V. 2020. Federated optimization in heterogeneous networks. Proceedings of Machine Learning and Systems, 2: 429 450. Mc Mahan, B.; Moore, E.; Ramage, D.; Hampson, S.; and y Arcas, B. A. 2017. Communication-efficient learning of deep networks from decentralized data. In Artificial intelligence and statistics, 1273 1282. PMLR. Nouiehed, M.; Sanjabi, M.; Huang, T.; Lee, J. D.; and Razaviyayn, M. 2019. Solving a class of non-convex min-max games using iterative first order methods. Advances in Neural Information Processing Systems, 32. Reddi, S.; Charles, Z.; Zaheer, M.; Garrett, Z.; Rush, K.; Koneˇcn y, J.; Kumar, S.; and Mc Mahan, H. B. 2020. Adaptive federated optimization. ar Xiv preprint ar Xiv:2003.00295. Reddi, S. J.; Kale, S.; and Kumar, S. 2019. On the convergence of adam and beyond. ar Xiv preprint ar Xiv:1904.09237. Staib, M.; Reddi, S.; Kale, S.; Kumar, S.; and Sra, S. 2019. Escaping saddle points with adaptive gradient methods. In International Conference on Machine Learning, 5956 5965. PMLR. Sun, J.; Huai, M.; Jha, K.; and Zhang, A. 2022. Demystify Hyperparameters for Stochastic Optimization with Transferable Representations. In Proceedings of the 28th ACM SIGKDD Conference on Knowledge Discovery and Data Mining, 1706 1716. Wang, Y.; Lin, L.; and Chen, J. 2022. Communication Efficient Adaptive Federated Learning. ar Xiv preprint ar Xiv:2205.02719. Ward, R.; Wu, X.; and Bottou, L. 2019. Adagrad stepsizes: Sharp convergence over nonconvex landscapes. In International Conference on Machine Learning, 6677 6686. PMLR. Woodworth, B.; Patel, K. K.; Stich, S.; Dai, Z.; Bullins, B.; Mcmahan, B.; Shamir, O.; and Srebro, N. 2020. Is local SGD better than minibatch SGD? In International Conference on Machine Learning, 10334 10343. PMLR. Xiong, Z.; Li, W.; and Cai, Z. 2023. Federated Generative Model on Multi-Source Heterogeneous Data in Io T. In Proceedings of the AAAI Conference on Artificial Intelligence. Xu, A.; and Huang, H. 2022. Coordinating momenta for cross-silo federated learning. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 36, 8735 8743. Xu, A.; Li, W.; Guo, P.; Yang, D.; Roth, H. R.; Hatamizadeh, A.; Zhao, C.; Xu, D.; Huang, H.; and Xu, Z. 2022. Closing the Generalization Gap of Cross-silo Federated Medical Image Segmentation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 20866 20875. Yang, H.; Fang, M.; and Liu, J. 2021. Achieving linear speedup with partial worker participation in non-iid federated learning. ar Xiv preprint ar Xiv:2101.11203. Yu, H.; Jin, R.; and Yang, S. 2019. On the linear speedup analysis of communication efficient momentum SGD for distributed non-convex optimization. In International Conference on Machine Learning, 7184 7193. PMLR. Yu, H.; Yang, S.; and Zhu, S. 2019. Parallel restarted SGD with faster convergence and less communication: Demystifying why model averaging works for deep learning. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 33, 5693 5700. Zeiler, M. D. 2012. Adadelta: an adaptive learning rate method. ar Xiv preprint ar Xiv:1212.5701. Zhang, J.; Karimireddy, S. P.; Veit, A.; Kim, S.; Reddi, S. J.; Kumar, S.; and Sra, S. 2019. Why ADAM beats SGD for attention models. ar Xiv preprint ar Xiv:1912.03194. Zhang, X.; Hong, M.; Dhople, S.; Yin, W.; and Liu, Y. 2020. Fedpd: A federated learning framework with optimal rates and adaptivity to non-iid data. ar Xiv preprint ar Xiv:2005.11418. Zinkevich, M.; Weimer, M.; Li, L.; and Smola, A. 2010. Parallelized stochastic gradient descent. Advances in neural information processing systems, 23.