# communicationefficient_adaptive_federated_learning__73d80cf0.pdf Communication-Efficient Adaptive Federated Learning Yujia Wang 1 Lu Lin 2 Jinghui Chen 1 Federated learning is a machine learning training paradigm that enables clients to jointly train models without sharing their own localized data. However, the implementation of federated learning in practice still faces numerous challenges, such as the large communication overhead due to the repetitive server-client synchronization and the lack of adaptivity by SGD-based model updates. Despite that various methods have been proposed for reducing the communication cost by gradient compression or quantization, and the federated versions of adaptive optimizers such as Fed Adam are proposed to add more adaptivity, the current federated learning framework still cannot solve the aforementioned challenges all at once. In this paper, we propose a novel communication-efficient adaptive federated learning method (Fed CAMS) with theoretical convergence guarantees. We show that in the nonconvex stochastic optimization setting, our proposed Fed CAMS achieves the same convergence rate of O( 1 T Km) as its non-compressed counterparts. Extensive experiments on various benchmarks verify our theoretical analysis. 1. Introduction Federated learning (FL) (Koneˇcn y et al., 2016; Mc Mahan et al., 2017) has recently become a popular machine learning training paradigm where multiple clients cooperate to jointly learn a machine learning model. In the federated learning setting, training data is distributed across a large number of clients, or edge devices, such as smartphones, personal computers, or Io T devices. These clients own valuable data for training a variety of machine learning 1College of Information Sciences and Technology, Pennsylvania State University, State College, PA, United States 2Department of Computer Science, University of Virginia, Charlottesville, VA, United States. Correspondence to: Jinghui Chen . Proceedings of the 39 th International Conference on Machine Learning, Baltimore, Maryland, USA, PMLR 162, 2022. Copyright 2022 by the author(s). models, yet those raw client data is not allowed to share with the server or other clients due to privacy and regulation concerns. Federated Learning (Koneˇcn y et al., 2016; Mc Mahan et al., 2017) works by having each client train the ML model locally based on its own data, while having the clients iteratively exchanging and synchronizing their local ML model parameters with each other through a central server. Mc Mahan et al. (2017) proposed Fed Avg algorithm, whose global model is updated by averaging multiple steps of local stochastic gradient descent (SGD) updates, and it has become one of the most popular FL methods. Despite the ability to jointly train the model without directly sharing the data, the implementation of FL in practice still faces several major challenges such as (1) large communication overhead due to the repetitive synchronization between the server and the clients; and (2) lack of adaptivity as SGDbased update may not be suitable for heavy-tail stochastic gradient noise distributions, which often arise in training large-scale models such as BERT (Devlin et al., 2018), GPT3 (Brown et al., 2020), GAN (Goodfellow et al., 2014) or Vi T (Dosovitskiy et al., 2021). Note that various attempts have been made to solve the aforementioned challenges individually but not all of them at once. In terms of reducing communication costs, one can avoid transmitting the complete model updates when synchronizing. Several works, including (Reisizadeh et al., 2020; Jin et al., 2020; Jhunjhunwala et al., 2021; Chen et al., 2021b), have studied the compressed and quantized federated learning optimization method based on Fed Avg. Another way is to reduce the number of participating clients such that only part of the clients participate in the model training at each round (Yang et al., 2021; Li et al., 2019b; Nishio & Yonetani, 2019; Li & Wang, 2019). Besides, the network resources allocation also plays an important role in communication-efficient federated learning problems (Li et al., 2019a; Yang et al., 2020). For adaptivity concerns, recently, Fed Adam (Reddi et al., 2020) and other variants, such as Fed Yogi (Reddi et al., 2020) and Fed AMSGrad (Tong et al., 2020) were proposed to introduce adaptive gradient methods (Kingma & Ba, 2014; Reddi et al., 2018) into federated learning framework and provided provable convergence guarantees. However, it is still an open problem how to achieve communication efficient adaptive federated optimization while still providing rigorous convergence Communication-Efficient Adaptive Federated Learning guarantees. In this paper, we aim to develop a new compressed federated adaptive gradient optimization method that is communication-efficient while also provably convergences. Specifically, we first propose Fed AMS, a variant of Fed Adam, with an improved convergence analysis over the original Fed Adam. Based on Fed AMS, we propose Fed CAMS, a Federated Communication-compressed AMSGrad with Max Stablization (Fed CAMS), which addresses both the communication and adaptivity challenges within one training framework. We summarize our contributions as follows: We provide an improved analysis on the convergence behaviour of Fed AMS, a variant of the existing federated adaptive gradient method Fed Adam (Reddi et al., 2020), whose analysis is simplified for only considering the case where no momentum is been used. In particular, we prove that Fed AMS (with momentum) can achieve the same convergence rate of O( 1 T Km) w.r.t total iterations T, the number of local updates K, and the number of workers m for both full participation and partial participation schemes. We propose a new communication-efficient adaptive federated optimization method, Fed CAMS, which to the best of our knowledge, for the first time, achieves both communication efficiency and adaptivity in federated learning with one single learning framework. Fed CAMS largely reduces the communication cost by error feedback and compression strategy and it is compatible with various commonly-used compressors in practice. We prove that Fed CAMS achieves the same convergence rate of O( 1 T Km), as its uncompressed counterpart Fed AMS. We conduct experiments on various benchmarks and show that our proposed Fed AMS and Fed CAMS achieve good adaptivity in training real-world machine learning models. Furthermore, we show that Fed CAMS effectively reduced the communication cost (number of bits for communication) by orders of magnitude while sacrificing little in terms of prediction accuracy. Notation: For vectors x, y Rd, x, x2, x/y denote the element-wise square root, square, and division of the vectors. For vector x and matrix A, denotes the ℓ2 norm of vector/matrix, i.e., x = x 2 and A = A 2. 2. Related Work SGD and Adaptive Gradient Methods: Stochastic gradient descent (SGD) (Robbins & Monro, 1951) has been widely applied in training machine learning models for decades. Although SGD is straightforward to implement, it is known to be sensitive to parameters and relatively slow to converge when facing heavy-tail stochastic gradi- ent noise. Adaptive gradient methods were proposed to overcome these issues of SGD, including Ada Grad (Duchi et al., 2011), RMSProp (Tieleman et al., 2012), Ada Delta (Zeiler, 2012). Adam (Kingma & Ba, 2014) and its variant AMSGrad (Reddi et al., 2018), are tremendously used in training deep neural networks, and other variants (Luo et al., 2019; Loshchilov & Hutter, 2017; Chen et al., 2020a) also play important roles in improving adaptive gradient methods through different aspects. Federated Learning: As the demand of locally data storing and training models at edge devices, Federated Learning (Koneˇcn y et al., 2016; Li et al., 2020) rapidly attracts growing interest in recent years. Federated Averaging method (Fed Avg) (Mc Mahan et al., 2017) works by periodically averaging local SGD updates. Stich (2018) provided a concise theoretical convergence guarantee for local SGD. Lin et al. (2018) proposed a variant of local SGD with empirical improvements. There are many works based on Fed Avg such as Fed Prox (Li et al., 2020), Fed Nova (Wang et al., 2020), SCAFFOLD (Karimireddy et al., 2020), and other work discussed the variants of Fed Avg (Yang et al., 2021; Li et al., 2019b; Hsu et al., 2019; Wang et al., 2019). Reddi et al. (2020) recently proposed several adaptive federated optimization methods including Fed Adagrad, Fed Yogi and Fed Adam to overcome the existing convergence issues of Fed Avg. Chen et al. (2020b) proposed Local AMSGrad and Tong et al. (2020) proposed a family of federated adaptive gradient methods with calibrations. Another line of research focused on addressing data heterogeneity issues or the network resource allocation issues (Ghosh et al., 2019; Li & Wang, 2019; Yang et al., 2020). Communication-Compressed Federated Learning: Various strategies have been proposed for reducing communication costs in distributed learning for SGD based algorithms (Bernstein et al., 2018; Seide et al., 2014; Alistarh et al., 2017; Basu et al., 2019; Stich et al., 2018; Stich & Karimireddy, 2019; Karimireddy et al., 2019) and also adaptive gradient methods (Tang et al., 2021; Wang et al., 2022). In terms of federated learning, many studies have tried to apply the aforementioned methods to Fed Avg and have attracted growing interest recently, e.g., Fed PAQ (Reisizadeh et al., 2020), Fed COM (Haddadpour et al., 2021), sign SGD in federated learning (Jin et al., 2020), communication-efficient federated learning (Chen et al., 2021b), Ada Quant FL (Jhunjhunwala et al., 2021). However, there are fewer attempts to develop communication-efficient adaptive gradient methods in federated learning, which is our key focus in this work. 3. Proposed Method In this paper, we aim to study the following federated learning nonconvex optimization problem: min x Rd f(x) = 1 i=1 Fi(x), (3.1) Communication-Efficient Adaptive Federated Learning where m is the total amount of local clients, d denotes the dimension of the model parameters, Fi(x) = Eξ Di Fi(x, ξi) is the local nonconvex loss function on client i associated with a local distribution Di. In the stochastic setting, we can only obtain the unbiased estimator of Fi(x), i.e., the stochastic gradient gi t = Fi(x, ξi). In the non i.i.d setting, distributions Di, Dj can vary from each other, i.e., Di = Dj, i = j. Fed Avg (Mc Mahan et al., 2017) is a commonly used optimization approach to solve (3.1). Let xt denotes the global model parameters before the t-th iteration. Now at iteration t, the participating client i from the selected subset St (with size n) receives the model xt from the server, conducts K steps of local SGD updates with local learning rate ηl, obtains the local model xi t,K. Client i then sends the model difference i t = xi t,K xt to the server. And the server updates the global model difference t by simply averaging the local model differences i t. The server then updates the global model xt+1 by xt+1 = xt + t, which is the same1 as directly averaging the local model xi t,K, i.e., xt+1 = xt + 1 i St(xi t,K xt) = 1 i St xi t,K. Fed Adam was then proposed among several adaptive optimization methods in federated learning (Reddi et al., 2020). Fed Adam changes the global update rule of Fed Avg from one-step SGD to one-step adaptive gradient optimization. Specifically, after gathering local differences i t and averaging to t, the server updates the global model by Adam optimizer: mt = β1mt 1 + (1 β1) t, (3.2) vt = β2vt 1 + (1 β2) 2 t, (3.3) xt+1 = xt + η mt vt + ϵ, (3.4) where t acts as pseudo gradient, and the global update can be viewed as one step Adam update using t. Several variants were also proposed will slight changes in the variance term vt, such as Fed Adagrad and Fed Yogi (Reddi et al., 2020) and Fed AMSGrad (Tong et al., 2020). Note that the ϵ in (3.4) is used for numerical stabilization purpose as the vt term can be quite small and cause unstable optimization behaviours. 3.1. Federated AMSGrad with Max Stabilization In this section, we propose a general adaptive federated optimization framework, Federated AMSGrad with Max Stabilization (Fed AMS), where the server conducts one additional max stabilization step before the final update. Algorithm 1 summarize the details of general Fed AMS framework. At the beginning of global round t, we first 1The global update of Fed Avg is equivalent to perform one step SGD update with the pseudo gradient t and learning rate η = 1. select a subset of clients St, each participating client i St obtains the local model xi t,K after K steps of local SGD updates with learning rate ηl. The model difference i t is the difference between the local updated model xi t,K and the current global model xt, i.e., i t = xi t,K xt. The server aggregates i t and gets the global difference t, this t acts as a pseudo gradient to calculate momentum mt and variance vt following (3.2) and (3.3). Now for updating xt+1, our general Fed AMS framework provides two options for max stabilization: Option 1: bvt = max(bvt 1, vt, ϵ), xt+1 = xt + η mt bvt . Option 2: bvt = max(bvt 1, vt), xt+1 = xt + η mt bvt + ϵ. Note that Option 2 is the same as the AMSGrad (Reddi et al., 2018) update rule, which brings a non-decreasing vt to solve a non-convergence issue in Adam (Kingma & Ba, 2014). For Option 1, Fed AMS directly adopts bvt as the denominator where ϵ is token as the part of the max operation in bvt. Intuitively, the unstable behaviour of the denominator (small value in the vt) usually only happens for a small set of dimensions. Therefore, the max stabilization strategy in Option 1 only affects those dimensions with small vt values, while the traditional adding strategy as in Option 2 will affect the accuracy on all dimensions. Moreover, we want to emphasize that although the theoretical analysis in Reddi et al. (2020) assumes β1 = 0 and only considers the impact of variance vt, thus the non-decreasing variance is not necessary for the analysis in Reddi et al. (2020). While the non-decreasing variance is indeed necessary for us to obtain the complete proof with a positive β1 (see Appendix for details). 3.2. Federated Communication-Compressed AMSGrad In order to reduce the communication costs between synchronization, we propose Federated Communicationcompressed AMSGrad with Max Stabilization (Fed CAMS), which is summarized in Algorithm 2. The main difference lies in that after the client i obtains the model differences i t via local SGD, Fed CAMS will compress i t to b i t via error feedback compression strategy, and then send b i t to the central server. In details, at round t, the client i will apply the compressor on the summation of model differences i t together with the cumulative compression error ei t to obtain b i t. After that, the client will update term ei t+1 by calculating the new cumulative compression error, i.e., ei t+1 = t + ei t b i t, which will be useful for next round s computation. The rest part of Fed CAMS is similar to Fed AMS: the server aggregates b i t and obtains b t, which will participate in the global update. To summarize, Fed CAMS is indeed a communication- Communication-Efficient Adaptive Federated Learning Algorithm 1 Fed AMS Input: initial point x1, local step size ηl, global stepsize η, β1, β2, ϵ. 1: m0 0, v0 0 2: for t = 1 to T do 3: Random sample a subset St of clients 4: Server sends xt to the subset St of clients 5: xi t,0 = xt 6: for each client i St in parallel do 7: for k = 0, ..., K 1 do 8: Compute local stochastic gradient: gi t,k = Fi(xi t,k; ξi t,k) 9: xi t,k+1 = xi t,k ηlgi t,k 10: end for 11: i t = xi t,K xt 12: end for 13: Server aggregates local update: t = 1 |St| P i St i t 14: Update mt = β1mt 1 + (1 β1) t 15: Update vt = β2vt 1 + (1 β2) 2 t //Option 1: 16: bvt = max(bvt 1, vt, ϵ), update xt+1 = xt + η mt bvt //Option 2: 17: bvt = max(bvt 1, vt), update xt+1 = xt + η mt bvt+ϵ 18: end for efficient with the following features. Error Feedback Compression: Although error-feedback strategy (Karimireddy et al., 2019; Stich et al., 2018; Stich & Karimireddy, 2019) has been widely used in various distributed learning settings, there is much less use of errorfeedback in the federated settings, especially for adaptive federated optimization. Note that combining error-feedback with adaptive federated optimization is not a trivial task at all, instead, it is actually quite complicated. Specifically, the theoretical analysis of the adaptive gradient method in the typical nonconvex setting relies on the construction of the Lyapunov function. Compared to directly analyzing the model parameter x, this causes extra difficulty as applying the error feedback strategy on the smoothness-expanded terms from the Lyapunov function will result in an accumulation of the compression error which leads to divergence2. In our theoretical analysis, we have to modify the original construction of the Lyapunov function and introduce a new auxiliary sequence about the compression error which eliminates the accumulation of compression error. Unlike those direct compression strategies such as simple quantization or direct compression strategies (Haddadpour et al., 2021; Reisizadeh et al., 2020; Jin et al., 2020) which usually require 2Similar divergence issue has also been discussed in Tang et al. (2021); Wang et al. (2022) in the distributed setting. an unbiased compressor to work, error feedback allows for various biased compressors such as commonly used scaled sign compressor or top-k compressors. Furthermore, the design of the error feedback strategy is well-known for reducing unnecessary compression error, which leads to a more precise model update. Support for Partial Participation: Here we also make error feedback compatible with partial participation settings by keeping the stale cumulative compression error for clients who were not selected for the current round training (see Lines 14-16 in Algorithm 2). Such design makes Fed CAMS more practical and communication efficient. Note that the default client sampling strategy in Fed CAMS is to randomly select the participating clients (without replacement) in each round, i.e., pi = P{i St} = n/m. This can be easily extended to the weighted sampling strategy with probability pi = wi, even with varying numbers of participating workers n. Algorithm 2 Fed CAMS Input: initial point x1, local step size ηl, global stepsize η, β1, β2, ϵ, compressor C( ). 1: m0 0, v0 0, ei 1 = 0 2: for t = 1 to T do 3: Random sample a subset St of clients 4: Server sends xt to the subset St of clients 5: xi t,0 = xt 6: for each client i St in parallel do 7: for k = 0, ..., K 1 do 8: Compute local stochastic gradient: gi t,k = Fi(xi t,k; ξi t,k) 9: xi t,k+1 = xi t,k ηlgi t,k 10: end for 11: i t = xi t,K xt 12: Compress b i t = C( i t +ei t), send b i t to the server and update ei t+1 = i t + ei t b i t 13: end for 14: for each client j / St in parallel do 15: client j maintains the stale compression error ej t+1 = ej t 16: end for 17: Server aggregates local update b t = 1 |St| P i St b i t 18: Server updates xt+1 using b t in the same way as in Algorithm 1 (Line 14-17) 19: end for 4. Convergence Analysis In this section, we present the theoretical convergence results of our proposed Fed AMS and Fed CAMS in Algorithm 1 and 2. We first introduce some assumptions needed for the proof. Communication-Efficient Adaptive Federated Learning Assumption 4.1 (Smoothness). Each loss function on the i-th worker Fi(x) is L-smooth, i.e., x, y Rd, Fi(x) Fi(y) Fi(y), x y L This also implies the L-gradient Lipschitz condition, i.e., Fi(x) Fi(y) L x y . Assumption 4.1 is a standard assumption in nonconvex optimization problems, which has been also adopted in Kingma & Ba (2014); Reddi et al. (2018); Li et al. (2019b); Yang et al. (2021). Assumption 4.2 (Bounded Gradient). Each loss function on the i-th worker Fi(x) has G-bounded stochastic gradient on ℓ2, i.e., for all ξ, we have fi(x, ξ) G. The assumption of bounded gradient is usually adopted in adaptive gradient methods (Kingma & Ba, 2014; Reddi et al., 2018; Zhou et al., 2018; Chen et al., 2020a) Assumption 4.3 (Bounded Variance). Each stochastic gradient on the i-th worker has a bounded local variance, i.e., for all x, i [m],we have E fi(x, ξ) Fi(x) 2 σ2 l , and the loss function on each worker has a global variance bound, 1 m Pm i=1 Fi(x) f(x) 2 σ2 g. Assumption 4.3 is widely used in federated optimization problems (Li et al., 2019b; Reddi et al., 2020; Yang et al., 2021). The bounded local variance represents the randomness of stochastic gradients, and the bounded global variance represents data heterogeneity between clients. Note that σg = 0 corresponds to the i.i.d setting, in which datasets from each client have the same distribution. In the following, we will show the convergence results of Fed AMS3 and Fed CAMS. 4.1. Convergence Analysis for Fed AMS Full Participation: For the full participation scheme, all workers participate in the communication rounds and model update, i.e., |St| = m, t [t]. Theorem 4.4. Under Assumptions 4.1-4.3, if the local learning rate ηl satisfies the following condition: ηl min n 1 8KL, ϵ K β2K2G2+ϵ[(3+C2 1)ηL+2 then the iterates of Fed AMS in Algorithm 1 under full participation scheme satisfy min t [T ] E f(xt) 2 β2η2 l K2G2 + ϵ f0 f T + Φ , (4.1) where Ψ = C1G2d ϵ + 2C2 1ηηl KLG2d ϵ , Φ = 5η2 l KL2 3For simplicity, we will only present the convergence guarantee with Option 1. Note that the theoretical analysis can be easily extended to Option 2 with constant-only changes. 6Kσ2 g)+[(3+C2 1)ηL+2 p 2(1 β2)G] ηl 2mϵσ2 l and C1 = β1 1 β1 . Remark 4.5. The upper bound for mint [T ] E[ f(xt) 2] contains three parts: the first two items that directly related to the total number of step T are vanishing as T . The last term in (4.1) relates to the local stochastic variance σl and global variance σg. In the i.i.d setting where each worker has the same data distribution, we have zero global variance, i.e., σg = 0, and the variance term Φ will be smaller and less dependent on the number of local steps K. Corollary 4.6. Suppose we choose the global learning rate η = Θ( Km) and local learning rate ηl = Θ( 1 T K ), when T is sufficient large, i.e., T Km, the convergence rate for Fed AMS in Algorithm 1 under full participation scheme satisfies min t [T ] E f(xt) 2 = O 1 Remark 4.7. Corollary 4.6 suggests that with sufficient large T, Fed AMS achieves a convergence rate of O( 1 T Km), which matches the result for general federated non-convex optimization methods such as SCAFFOLD (Karimireddy et al., 2020), Fed Adam (Reddi et al., 2020). Remark 4.8. Note that compared with Fed Adam (Reddi et al., 2020), our theoretical analysis on Fed AMS makes improvements in completing the proof. The analysis in Reddi et al. (2020) can only consider the case when β1 = 0 which largely simplifies the proof in their theoretical analysis, while we provide a full analysis on Fed AMS with non-zero momentum term. Partial Participation: In the partial participation scheme, we assume that only n of m workers participate the local update and communicate with the central server on each step t, i.e., |St| = n, t [1, T]. The partial participation includes the randomness of sampling, and the coefficient varies for different sampling methods. Here we consider the random sampling without replacement. At the t-th iteration, we randomly sample a subset St contains n workers for local updating, for any two workers i, j St, the probability of being sampled to participate in model update are P{i St} = n m and P{i, j St} = n(n 1) m(m 1). Theorem 4.9. Under Assumption 4.1-4.3, if the local learning rate ηl satisfies: ηl min n 1 8KL, β2K2G2+ϵ [3ηL+C2 1ηL+2 the iterates of Fed AMS in Algorithm 1 under partial participation scheme satisfy min t [T ] E f(xt) 2 β2η2 l K2G2 + ϵ f0 f T + Φ , (4.3) Communication-Efficient Adaptive Federated Learning where Ψ = C1G2d ϵ + 2C2 1ηηl KLG2d ϵ and Φ = 5η2 l KL2 6Kσ2 g) + [(3 + C2 1)ηL + 2 p 2(1 β2)G] ηl 2nϵσ2 l + [(3+C2 1)ηL+2 p 2(1 β2)G] ηl(m n) 2n(m 1)ϵ[15K2L2η2(σ2 l + 6Kσ2 g) + 3Kσ2 g] and C1 = β1 1 β1 . Remark 4.10. The upper bound for mint [T ] E[ f(xt) 2] of partial participation is similar to full participation case but with a larger variance term Φ. This is due to the fact that random sampling of participating workers introduces an additional variance during sampling. In the i.i.d setting where we have zero global variance, i.e., σg = 0, the variance term will get smaller and less dependent on the number of local steps K as well. Corollary 4.11. Suppose we choose the global learning rate η = Θ( Kn) and local learning rate ηl = Θ( 1 T K ), the convergence rate for Fed AMS in Algorithm 1 under partial participation scheme without replacement sampling is min t [T ] E f(xt) 2 = O Remark 4.12. Note that Corollary 4.11 suggests that the dominant term in (4.3) is O( T n), which directly relates to the global variance σ2 g. Such convergence rate is consistent with the partial participation result of Fed Avg in the non i.i.d case in (Yang et al., 2021). It shows that the global variance has more impact on convergence behaviour in partial participation cases, especially in highly non i.i.d. cases where σg is large. Corollary 4.11 also suggests that larger number of participating clients n would accelerate the convergence. Note that although Fed Adam (Reddi et al., 2020) did not provide explicit conclusions on the partial participation setting, its appendix introduced the necessary steps for analyzing such setting, which cannot imply such desired relationship with n. Remark 4.13. The impact of the number of local updates K is complicated. In partial participation settings, it shows that larger K slows down the convergence while full participation suggests the opposite. Similar slow-down result has also been mentioned in Li et al. (2019b), while some others (Stich, 2018; Mc Mahan et al., 2017) showed that larger K would increase the convergence rate. We will leave this as future work. 4.2. Convergence Analysis for Fed CAMS Let us first introduce the assumption for the compressor. Assumption 4.14 (Biased Compressor). Consider a biased operator C : Rd Rd: for x Rd, there exists constant 0 q 1 such that E[ C(x) x ] q x , x Rd. Note that q = 0 leads to C(x) = x which means no compression to x. Assumption 4.14 is a standard assumption for biased compressors (Karimireddy et al., 2019; Stich et al., 2018; Tang et al., 2021). There are several widely used compressors satisfying 4.14 such as scaled-sign compressor and top-k compressor. Top-k (Stich et al., 2018): For 1 k d and x Rd, the coordinate of x is ordered by the magnitude |x(1)| |x(2)| |x(d)|. Denote α1, α2, ..., αd as standard unit basis vectors in Rd. The compressor Ctop : Rd Rd is defined as: Ctop(x) = Pd i=d k+1 x(i)α(i). Remark 4.15. Let us define the compression ratio as r = k/d. It can be shown that Ctop(x) x 2 (1 k/d) x 2 = (1 r) x 2, and thus we have q = 1 r. Scaled sign (Karimireddy et al., 2019):For 1 k d and x Rd, the compressor Csign : Rd Rd is defined as Csign(x) = x 1 sign(x)/d. Remark 4.16. For scaled sign compressor, we have Csign(x) x 2 = (1 x 2 1/d x 2) x 2, thus we have q = p 1 x 2 1/d x 2. Next, we show the convergence analysis for Fed CAMS. Due to the space limit, we only show the full participation setting and leave the partial participation setting in Appendix B.3. Theorem 4.17. Under Assumptions 4.1-4.3 and 4.14, if the local learning rate ηl satisfies the following condition: ηl min n 1 8KL, ϵ KCβ,q[3ηL+2C2ηL+2 4β2(1 + q2)3(1 q2) 2K2G2 + ϵ, then the iterates of Fed CAMS in Algorithm 2 satisfy min t [T ] E f(xt) 2 β2 4(1 + q2)3 (1 q2)2 η2 l K2G2 + ϵ f0 f where Ψ = C1G2d ϵ + 2C2 1ηηl KLG2d ϵ and Φ = 5η2 l KL2 6Kσ2 g) + [(3 + 2C2)ηL + 2 p 2(1 β2)G] ηl 2mϵσ2 l , C1 = β1 1 β1 + 2q 1 q2 and C2 = β2 1 (1 β1)2 + 4q2 Remark 4.18. The convergence rate in Theorem 4.17 contains three parts as well, the first two parts are related to total iterates T, and they vanish as T increases. The last term Φ shows no direct dependency on T, but on local and global variances. In the i.i.d case where σg = 0, the variance Φ will decrease and show less dependency on the number of local steps K. Corollary 4.19. Suppose we choose the global learning rate η = Θ( Km) and local learning rate ηl = Θ( 1 T K ), when T is sufficient large, i.e., T Km, the convergence rate for Fed CAMS in Algorithm 2 under full participation Communication-Efficient Adaptive Federated Learning scheme satisfies min t [T ] E f(xt) 2 = O 1 Remark 4.20. Corollary 4.19 suggests that with sufficient large T, Fed CAMS achieves the desired O( 1 T Km) convergence rate which matches the result for its uncompressed counterpart, Fed AMS. This suggests that Fed CAMS can indeed achieve better communication efficiency without sacrificing much on the accuracy. Remark 4.21. The constants C1 and C2 in Theorem 4.17 are related to the compression constant q. Specifically, if we track the dependency on q in the convergence rate, we have O(1/(1 q) TKm) under the full participation scheme. A larger q (q 1) corresponds to a stronger compression we applied, leading to worse convergence due to heavier information losses. Note that this q-dependency is common for the adaptive gradient method since the convergence proof of adaptive gradient methods heavily relies on the bounded gradient assumption and thus the compressed gradient bound is related to q. Similar type of q-dependency also occurs in other communication compressed distributed Adam methods such as Chen et al. (2021a).4 5. Experiments In this section, we present empirical validations toward the effectiveness of our proposed algorithms. Firstly, we provide comparisons between Fed AMS and other first-order federated optimization baselines. Secondly, we provide experimental results of our proposed communication-efficient adaptive federated learning method, Fed CAMS, to show its effectiveness in achieving communication-efficient adaptive federated learning. Experimental Setup: We test all federated learning baselines, including ours on CIFAR10 and CIFAR100 datasets (Krizhevsky et al., 2009) using the following two models: (1) Res Net-18 (He et al., 2016), a widely used convolutional neural network model which is commonly trained by SGD; and (2) Conv Mixer model (Trockman & Kolter, 2022), which shares similar ideas to vision transformer (Dosovitskiy et al., 2021) to use patch embeddings to preserve locality and similarly is trained via adaptive gradient methods by default. We set in total 100 clients for all federated training experiments. We set the partial participation ratio as 0.1, i.e., in each round, the server picks 10 out of 100 clients to participate in the communication and model update. In each round, the client will perform 3 local epochs of local training with batch size 20. We search for the best training 4Note that the q-dependency in 1-bit Adam (Tang et al., 2021) is actually different due to the use of variance-freezed Adam update, i.e., freeze the variance term vt of Adam update as a constant after a few epochs, which make it resembles momentum SGD. hyper-parameters for each baseline, including ours. Due to the space limit, we leave all the hyper-parameter details as well as the CIFAR-100 experiments in the Appendix. 5.1. Fed AMS and Adaptive Federated Optimization We compare two options of the Fed AMS framework with several state-of-the-art adaptive federated learning optimization methods, including: (1) Fed Adam (Reddi et al., 2020) (2) Fed Yogi (Reddi et al., 2020) as well as standard federated baselines: (3) Fed Avg (Mc Mahan et al., 2017). Note that the Option 2 for Fed AMS is same as Fed AMSGrad (Tong et al., 2020). Thus in this section, we denote Fed AMS for Option 1 and Fed AMSGrad for Option 2 in the general Fed AMS framework. Figure 1 shows the convergence result of Fed AMS and other federated learning baselines on training CIFAR-10 dataset with Res Net-18 model and Conv Mixer-256-8 model. We compare the training loss and test accuracy against global rounds for each model. For the Res Net-18 model, Fed AMS and Fed Yogi achieve quite similar performances, which are significantly better than the other three baselines. In particular, Fed AMS performs the best in terms of the final training loss and test accuracy. On the other hand, Fed AMSGrad and Fed Adam obtain quite similar results on test accuracy and training loss. Fed Avg achieves a slightly better training loss to Fed Adam and Fed AMSGrad but much higher test accuracy which is close to Fed Yogi and Fed AMS. For the Conv Mixer-256-8 model, which is typically trained via adaptive gradient methods, we observe that all adaptive federated optimization methods (Fed Adam, Fed Yogi, Fed AMSGrad and Fed AMS) achieve much better performance in terms of both training loss and test accuracy than Fed Avg. In detail, Fed AMS again achieves a significantly better result than other baselines. Other adaptive methods, including Fed Adam, Fed Yogi, and Fed AMSGrad, have similar convergence behaviour when training the Conv Mixer-256-8 model. Such results empirically show the effectiveness of our proposed Fed AMS method with max stabilization. Figure 2 shows the effect of parameter n on the convergence rate by choosing different number of n from {5, 10, 20}. From Figure 2 we can observe that a larger number of participating clients n in general achieves a faster convergence rate. This verified our theoretical results in Section 4.1. Figure 3 shows the ablation study with different local epochs by choosing different number of local epochs E from {3, 10, 30, 100} when training CIFAR-10 data on Res Net18 with Fed AMS optimizer. We observe that larger E leads to faster convergence, but larger E does not show a significant advantage in achieving a higher test accuracy. We follow Fed Avg (Mc Mahan et al., 2017) and Fed Adam (Reddi et al., 2020) and set local epoch E = 3 by default unless otherwise specified. Communication-Efficient Adaptive Federated Learning 0 100 200 300 400 500 #Rounds Training Loss Fed Avg Fed Adam Fed Yogi Fed AMSGrad Fed AMS (a) Res Net-18 0 100 200 300 400 500 #Rounds Test Accuracy Fed Avg Fed Adam Fed Yogi Fed AMSGrad Fed AMS (b) Res Net-18 0 100 200 300 400 500 #Rounds Training Loss Fed Avg Fed Adam Fed Yogi Fed AMSGrad Fed AMS (c) Conv Mixer-256-8 0 100 200 300 400 500 #Rounds Test Accuracy Fed Avg Fed Adam Fed Yogi Fed AMSGrad Fed AMS (d) Conv Mixer-256-8 Figure 1. The learning curves for Fed AMS and other federated learning baselines on training CIFAR-10 data (a)(b) show the results for the Res Net-18 model and (c)(d) show the results for the Conv Mixer-256-8 model. 0 100 200 300 400 500 #Rounds Training Loss Fed AMS (n=5) Fed AMS (n=10) Fed AMS (n=20) (a) Res Net-18 0 100 200 300 400 500 #Rounds Training Loss Fed AMS (n=5) Fed AMS (n=10) Fed AMS (n=20) (b) Conv Mixer-256-8 Figure 2. The learning curves for Fed AMS with different participating number of clients n in training CIFAR-10 data on the Res Net-18 and the Conv Mixer-256-8 models. 0 100 200 300 400 500 #Rounds Training Loss E=100 E=30 E=10 E=3 (a) Training Loss 0 100 200 300 400 500 #Rounds Test Accuracy E=100 E=30 E=10 E=3 (b) Test Accuracy Figure 3. The learning curves for Fed AMS with different number of local epochs E in training CIFAR-10 data on the Res Net-18 model. 5.2. Communication-Efficient Fed CAMS Figure 4 shows the convergence results of Fed AMS and Fed CAMS5 with different compression strategies on training CIFAR-10 dataset with the Res Net-18 model. It includes comparisons between scaled sign compressor and top-k compressor with compression ratio r {1/64, 1/128, 1/256}. We compare the training loss and test accuracy against global rounds and the (pseudo) gradient communication bits6. Fed AMS, who does not conduct any communication compression, performs the best in terms of training loss yet requires a large volume of communication costs. For our Fed CAMS compression methods, sign compressor and top-k compressor with ratio r = 1/64 achieve similar performance in terms of test accuracy against the training rounds and obtain the best trade-off between communication efficiency and model accuracy. Figure 4 (b)(d) show the direct comparison against the communication bits of training Res Net-18 on CIFAR-10. In particular, we can observe that the top-k compressor with a smaller r (i.e., a heavier compression with more information lost), obtains better communication efficiency but a slower convergence rate. Note that for a d dimensional vector, the overall cost of a scaled sign compressor is 32+d bits, and it is roughly the same communication costs as a top-k compressor7 with a ratio r = 1/64. This verifies our theoretical results in Section 4.2. Figure 5 shows the convergence results of Fed AMS and Fed CAMS with the same compression strategies as in Figure 4 on training CIFAR-10 dataset with the Conv Mixer-2568 model. We notice that Fed CAMS with the scaled sign compressor achieves roughly the same training loss and test accuracy as Fed AMS but with a few orders of magnitude less in communication costs, while other top-k compression models have significantly worse performance. Among the top-k compressor trained models, the one with compression ratio r = 1/64 still obtains better training loss and test accuracy but higher communication costs. These results suggest that our proposed Fed CAMS is communication-efficient while maintaining high accuracy. 6. Conclusions and Future Work In this paper, we propose a communication-efficient compressed federated adaptive gradient optimization framework, Fed CAMS, which largely reduces the communication overhead and addresses the adaptivity issue in federated optimization methods. Fed CAMS is based on our proposed gen- 5Here Fed CAMS adopt Option 1 for the final update step for fairness comparisons (same as Fed AMS). 6Note that here we only count the client-to-server one-way communications compression. 7Top-k compressor also needs to communicate about the chosen k locations which roughly double the costs. Communication-Efficient Adaptive Federated Learning 0 100 200 300 400 500 #Rounds Training Loss Fed AMS Fed CAMS (sign) Fed CAMS (top-K with r=1/64) Fed CAMS (top-K with r=1/128) Fed CAMS (top-K with r=1/256) (a) Training Loss 107 108 109 1010 1011 #Grad Comm Bits Training Loss Fed AMS Fed CAMS (sign) Fed CAMS (top-K with r=1/64) Fed CAMS (top-K with r=1/128) Fed CAMS (top-K with r=1/256) (b) Training Loss 0 100 200 300 400 500 #Rounds Test Accuracy Fed AMS Fed CAMS (sign) Fed CAMS (top-K with r=1/64) Fed CAMS (top-K with r=1/128) Fed CAMS (top-K with r=1/256) (c) Test Accuracy 107 108 109 1010 1011 #Grad Comm Bits Test Aaccurcay Fed AMS Fed CAMS (sign) Fed CAMS (top-K with r=1/64) Fed CAMS (top-K with r=1/128) Fed CAMS (top-K with r=1/256) (d) Test Accuracy Figure 4. The learning curves for Fed CAMS and uncompressed Fed AMS on training CIFAR-10 data on the Res Net-18 model. 0 100 200 300 400 500 #Rounds Training Loss Fed AMS Fed CAMS (sign) Fed CAMS (top-K with r=1/64) Fed CAMS (top-K with r=1/128) Fed CAMS (top-K with r=1/256) (a) Training Loss 105 106 107 108 109 1010 #Grad Comm Bits Training Loss Fed AMS Fed CAMS (sign) Fed CAMS (top-K with r=1/64) Fed CAMS (top-K with r=1/128) Fed CAMS (top-K with r=1/256) (b) Training Loss 0 100 200 300 400 500 #Rounds Test Accuracy Fed AMS Fed CAMS (sign) Fed CAMS (top-K with r=1/64) Fed CAMS (top-K with r=1/128) Fed CAMS (top-K with r=1/256) (c) Test Accuracy 105 106 107 108 109 1010 #Grad Comm Bits Test Aaccurcay Fed AMS Fed CAMS (sign) Fed CAMS (top-K with r=1/64) Fed CAMS (top-K with r=1/128) Fed CAMS (top-K with r=1/256) (d) Test Accuracy Figure 5. The learning curves for Fed CAMS and uncompressed Fed AMS on training CIFAR-10 data on Conv Mixer-256-8 model. eral adaptive federated optimization framework, Fed AMS, which contains variants of Fed Adam feature max stabilization mechanisms. We present an improved theoretical convergence analysis of adaptive federated optimization, based on which we prove that in the nonconvex stochastic optimization setting, our proposed Fed CAMS achieves the same convergence rate as its uncompressed counterpart Fed AMS with a few orders of magnitude less communication cost. Experiments on various benchmarks verified our theoretical Our current analysis is limited to one-way communication compression from clients to the central server. However, extending our current analysis to two-way communication compression is highly non-trivial as it can be hard to guarantee the distributed global model from server to clients to stay synchronized due to error feedback and the biased compressor, especially in the partial participation setting. We leave it as future work. Acknowledgements We thank the anonymous reviewers for their helpful comments. This research was supported in part by a Seed Grant award from the Institute for Computational and Data Sciences at the Pennsylvania State University as well as Dell Technology AI Infrastructure Level Technologies Grant. The views and conclusions contained in this paper are those of the authors and should not be interpreted as representing any funding agencies. Alistarh, D., Grubic, D., Li, J., Tomioka, R., and Vojnovic, M. Qsgd: Communication-efficient sgd via gradient quantization and encoding. Advances in Neural Information Processing Systems, 30:1709 1720, 2017. Basu, D., Data, D., Karakus, C., and Diggavi, S. Qsparselocal-sgd: Distributed sgd with quantization, sparsification, and local computations. ar Xiv preprint ar Xiv:1906.02367, 2019. Bernstein, J., Wang, Y.-X., Azizzadenesheli, K., and Anandkumar, A. signsgd: Compressed optimisation for nonconvex problems. In International Conference on Machine Learning, pp. 560 569. PMLR, 2018. Brown, T. B., Mann, B., Ryder, N., Subbiah, M., Kaplan, J., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., et al. Language models are few-shot learners. ar Xiv preprint ar Xiv:2005.14165, 2020. Chen, C., Shen, L., Huang, H., and Liu, W. Quantized adam with error feedback. ACM Transactions on Intelligent Systems and Technology (TIST), 12(5):1 26, 2021a. Chen, J., Zhou, D., Tang, Y., Yang, Z., Cao, Y., and Gu, Q. Closing the generalization gap of adaptive gradient methods in training deep neural networks. In Proceedings of the International Joint Conference on Artificial Intelligence (IJCAI), 2020a. Chen, M., Shlezinger, N., Poor, H. V., Eldar, Y. C., and Cui, S. Communication-efficient federated learning. Pro- Communication-Efficient Adaptive Federated Learning ceedings of the National Academy of Sciences, 118(17), 2021b. Chen, X., Liu, S., Sun, R., and Hong, M. On the convergence of a class of adam-type algorithms for non-convex optimization. ar Xiv preprint ar Xiv:1808.02941, 2018. Chen, X., Li, X., and Li, P. Toward communication efficient adaptive gradient method. In Proceedings of the 2020 ACM-IMS on Foundations of Data Science Conference, pp. 119 128, 2020b. Devlin, J., Chang, M.-W., Lee, K., and Toutanova, K. Bert: Pre-training of deep bidirectional transformers for language understanding. ar Xiv preprint ar Xiv:1810.04805, 2018. Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., Uszkoreit, J., and Houlsby, N. An image is worth 16x16 words: Transformers for image recognition at scale. In International Conference on Learning Representations, 2021. Duchi, J., Hazan, E., and Singer, Y. Adaptive subgradient methods for online learning and stochastic optimization. Journal of machine learning research, 12(7), 2011. Ghosh, A., Hong, J., Yin, D., and Ramchandran, K. Robust federated learning in a heterogeneous environment. ar Xiv preprint ar Xiv:1906.06629, 2019. Goodfellow, I., Pouget-Abadie, J., Mirza, M., Xu, B., Warde-Farley, D., Ozair, S., Courville, A., and Bengio, Y. Generative adversarial nets. Advances in neural information processing systems, 27, 2014. Haddadpour, F., Kamani, M. M., Mokhtari, A., and Mahdavi, M. Federated learning with compression: Unified analysis and sharp guarantees. In International Conference on Artificial Intelligence and Statistics, pp. 2350 2358. PMLR, 2021. He, K., Zhang, X., Ren, S., and Sun, J. Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 770 778, 2016. Hsu, T.-M. H., Qi, H., and Brown, M. Measuring the effects of non-identical data distribution for federated visual classification. ar Xiv preprint ar Xiv:1909.06335, 2019. Jhunjhunwala, D., Gadhikar, A., Joshi, G., and Eldar, Y. C. Adaptive quantization of model updates for communication-efficient federated learning. In ICASSP 2021-2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 3110 3114. IEEE, 2021. Jin, R., Huang, Y., He, X., Dai, H., and Wu, T. Stochasticsign sgd for federated learning with theoretical guarantees. ar Xiv preprint ar Xiv:2002.10940, 2020. Karimireddy, S. P., Rebjock, Q., Stich, S., and Jaggi, M. Error feedback fixes signsgd and other gradient compression schemes. In International Conference on Machine Learning, pp. 3252 3261. PMLR, 2019. Karimireddy, S. P., Kale, S., Mohri, M., Reddi, S., Stich, S., and Suresh, A. T. Scaffold: Stochastic controlled averaging for federated learning. In International Conference on Machine Learning, pp. 5132 5143. PMLR, 2020. Kingma, D. P. and Ba, J. Adam: A method for stochastic optimization. ar Xiv preprint ar Xiv:1412.6980, 2014. Koneˇcn y, J., Mc Mahan, H. B., Yu, F. X., Richt arik, P., Suresh, A. T., and Bacon, D. Federated learning: Strategies for improving communication efficiency. ar Xiv preprint ar Xiv:1610.05492, 2016. Krizhevsky, A., Hinton, G., et al. Learning multiple layers of features from tiny images. 2009. Li, D. and Wang, J. Fedmd: Heterogenous federated learning via model distillation. ar Xiv preprint ar Xiv:1910.03581, 2019. Li, T., Sanjabi, M., Beirami, A., and Smith, V. Fair resource allocation in federated learning. ar Xiv preprint ar Xiv:1905.10497, 2019a. Li, T., Sahu, A. K., Talwalkar, A., and Smith, V. Federated learning: Challenges, methods, and future directions. IEEE Signal Processing Magazine, 37(3):50 60, 2020. Li, X., Huang, K., Yang, W., Wang, S., and Zhang, Z. On the convergence of fedavg on non-iid data. ar Xiv preprint ar Xiv:1907.02189, 2019b. Lin, T., Stich, S. U., Patel, K. K., and Jaggi, M. Don t use large mini-batches, use local sgd. ar Xiv preprint ar Xiv:1808.07217, 2018. Loshchilov, I. and Hutter, F. Decoupled weight decay regularization. ar Xiv preprint ar Xiv:1711.05101, 2017. Luo, L., Xiong, Y., Liu, Y., and Sun, X. Adaptive gradient methods with dynamic bound of learning rate. ar Xiv preprint ar Xiv:1902.09843, 2019. Mc Mahan, B., Moore, E., Ramage, D., Hampson, S., and y Arcas, B. A. Communication-efficient learning of deep networks from decentralized data. In Artificial intelligence and statistics, pp. 1273 1282. PMLR, 2017. Communication-Efficient Adaptive Federated Learning Nishio, T. and Yonetani, R. Client selection for federated learning with heterogeneous resources in mobile edge. In ICC 2019-2019 IEEE International Conference on Communications (ICC), pp. 1 7. IEEE, 2019. Reddi, S., Charles, Z., Zaheer, M., Garrett, Z., Rush, K., Koneˇcn y, J., Kumar, S., and Mc Mahan, H. B. Adaptive federated optimization. ar Xiv preprint ar Xiv:2003.00295, 2020. Reddi, S. J., Kale, S., and Kumar, S. On the convergence of adam and beyond. In International Conference on Learning Representations, 2018. Reisizadeh, A., Mokhtari, A., Hassani, H., Jadbabaie, A., and Pedarsani, R. Fedpaq: A communication-efficient federated learning method with periodic averaging and quantization. In International Conference on Artificial Intelligence and Statistics, pp. 2021 2031. PMLR, 2020. Robbins, H. and Monro, S. A stochastic approximation method. The annals of mathematical statistics, pp. 400 407, 1951. Seide, F., Fu, H., Droppo, J., Li, G., and Yu, D. 1-bit stochastic gradient descent and its application to data-parallel distributed training of speech dnns. In Fifteenth annual conference of the international speech communication association. Citeseer, 2014. Stich, S. U. Local sgd converges fast and communicates little. ar Xiv preprint ar Xiv:1805.09767, 2018. Stich, S. U. and Karimireddy, S. P. The error-feedback framework: Better rates for sgd with delayed gradients and compressed communication. ar Xiv preprint ar Xiv:1909.05350, 2019. Stich, S. U., Cordonnier, J.-B., and Jaggi, M. Sparsified sgd with memory. ar Xiv preprint ar Xiv:1809.07599, 2018. Tang, H., Gan, S., Awan, A. A., Rajbhandari, S., Li, C., Lian, X., Liu, J., Zhang, C., and He, Y. 1-bit adam: Communication efficient large-scale training with adam s convergence speed. ar Xiv preprint ar Xiv:2102.02888, 2021. Tieleman, T., Hinton, G., et al. Lecture 6.5-rmsprop: Divide the gradient by a running average of its recent magnitude. COURSERA: Neural networks for machine learning, 4 (2):26 31, 2012. Tong, Q., Liang, G., and Bi, J. Effective federated adaptive gradient methods with non-iid decentralized data. ar Xiv preprint ar Xiv:2009.06557, 2020. Trockman, A. and Kolter, J. Z. Patches are all you need? ar Xiv preprint ar Xiv:2201.09792, 2022. Wang, J., Tantia, V., Ballas, N., and Rabbat, M. Slowmo: Improving communication-efficient distributed sgd with slow momentum. ar Xiv preprint ar Xiv:1910.00643, 2019. Wang, J., Liu, Q., Liang, H., Joshi, G., and Poor, H. V. Tackling the objective inconsistency problem in heterogeneous federated optimization. ar Xiv preprint ar Xiv:2007.07481, 2020. Wang, Y., Lin, L., and Chen, J. Communication-compressed adaptive gradient method for distributed nonconvex optimization. In International Conference on Artificial Intelligence and Statistics, pp. 6292 6320. PMLR, 2022. Yang, H., Fang, M., and Liu, J. Achieving linear speedup with partial worker participation in non-iid federated learning. ar Xiv preprint ar Xiv:2101.11203, 2021. Yang, Z., Chen, M., Saad, W., Hong, C. S., and Shikh Bahaei, M. Energy efficient federated learning over wireless communication networks. IEEE Transactions on Wireless Communications, 20(3):1935 1949, 2020. Zeiler, M. D. Adadelta: an adaptive learning rate method. ar Xiv preprint ar Xiv:1212.5701, 2012. Zhou, D., Chen, J., Cao, Y., Tang, Y., Yang, Z., and Gu, Q. On the convergence of adaptive gradient methods for nonconvex optimization. ar Xiv preprint ar Xiv:1808.05671, 2018. Communication-Efficient Adaptive Federated Learning A. Proof in Section 4.1 A.1. Proof of Theorem 4.4 Similar to previous works studied adaptive methods (Chen et al., 2018; Zhou et al., 2018; Chen et al., 2020a), we introduce a Lyapunov sequence zt: assume x0 = x1, for each t 1, we have zt = xt + β1 1 β1 (xt xt 1) = 1 1 β1 xt β1 1 β1 xt 1. (A.1) For the difference of sequence zt, we have zt+1 zt = 1 β1 (xt+1 xt) β1 1 β1 (xt xt 1) = 1 1 β1 (η b V 1/2 t mt) β1 1 β1 η b V 1/2 t 1 mt 1 = 1 1 β1 η b V 1/2 t β1mt 1 + (1 β1) t β1 1 β1 η b V 1/2 t 1 mt 1 = η b V 1/2 t t η β1 1 β1 b V 1/2 t 1 b V 1/2 t Since f is L-smooth, taking conditional expectation at time t, we have E[f(zt+1)] f(zt) E[ f(zt), zt+1 zt ] + L 2 E[ zt+1 zt 2] E f(zt), η b V 1/2 t t E f(zt), η β1 1 β1 b V 1/2 t 1 b V 1/2 t 2 E b V 1/2 t t β1 1 β1 b V 1/2 t 1 b V 1/2 t = E f(xt), η b V 1/2 t t ηE f(zt), β1 1 β1 b V 1/2 t 1 b V 1/2 t 2 E b V 1/2 t t β1 1 β1 b V 1/2 t 1 b V 1/2 t + E f(zt) f(xt), η b V 1/2 t t here we recall the notation b Vt = diag(bvt) = diag(max(bvt 1, vt, ϵ)). Bounding I1: We have I1 = E f(xt), η t bvt 2ηE f(xt), t p 2ηE f(xt), t vt + ϵ t p Communication-Efficient Adaptive Federated Learning where the first inequality follows by the fact that bvt vt+ϵ 2 . For the second term in (A.3), we have 2ηE f(xt), t vt + ϵ t p 2η f(xt) E 1 vt + ϵ 1 p ϵ E[ t 2], (A.4) where the second inequality follows from Lemma C.1 and C.4, and we will further apply the bound for E[ t 2] following from Lemma C.5. For the first term in (A.3), we have 2ηE f(xt), t p 2ηE f(xt) p β2vt 1 + ϵ , t + ηl K f(xt) ηl K f(xt) 2ηηl KE f(xt) 2ηE f(xt) p β2vt 1 + ϵ , t + ηl K f(xt) 2ηηl KE f(xt) β2vt 1 + ϵ , E 1 k=0 ηlgi t,k + ηl K f(xt) 2ηηl KE f(xt) β2vt 1 + ϵ , E ηl k=0 gi t,k + ηl K i=1 Fi(xt) , (A.5) where the third equality follows the local update rule. For the last term in (A.5), we have β2vt 1 + ϵ , E ηl k=0 gi t,k + ηl K β2vt 1 + ϵ f(xt), ηl K β2vt 1 + ϵ E m X k=0 ( Fi(xi t,k) Fi(xt)) 2ηηl 2Km2 E 1 k=0 ( Fi(xi t,k) Fi(xt)) 2ηηl 2Km2 E 1 k=0 Fi(xi t,k) k=0 E Fi(xi t,k) Fi(xt) 2ηηl 2Km2 E 1 k=0 Fi(xi t,k) k=0 E xi t,k xt 2ηηl 2Km2 E 1 k=0 Fi(xi t,k) where the second equation follows from x, y = 1 2[ x 2 + y 2 x y 2], the first inequality holds by applying Cauchy-Schwarz inequality, the second inequality follows from Assumption 4.1. Communication-Efficient Adaptive Federated Learning Hence by applying Lemma C.9 with the local learning rate condition: ηl 1 8KL, we have 2 η f(xt) p β2vt 1 + ϵ , E ηl k=0 gi t,k + ηl K 2 + 5ηη3 l K2L2 2ϵ (σ2 l + 6Kσ2 g) 2ηηl 2Km2 E 1 k=0 Fi(xi t,k) Then merging pieces together, we have 2 + 5ηη3 l K2L2 2ϵ (σ2 l + 6Kσ2 g) 2ηηl 2Km2 E 1 k=0 Fi(xi t,k) 2 + 5ηη3 l K2L2 2ϵ (σ2 l + 6Kσ2 g) ηηl 2Km2 E 1 k=0 Fi(xi t,k) ϵ E[ t 2]. (A.8) Bounding I2: The bound for I2 mainly follows by the update rule and definition of virtual sequence zt, I2 = ηE f(zt), β1 1 β1 b V 1/2 t 1 b V 1/2 t = ηE f(zt) f(xt) + f(xt), β1 1 β1 b V 1/2 t 1 b V 1/2 t ηE f(xt) β1 1 β1 b V 1/2 t 1 b V 1/2 t β1 1 β1 mt 1 b V 1/2 t 1 b V 1/2 t η β1 1 β1 ηl KG2E b V 1/2 t 1 b V 1/2 t 1 + η2 β2 1 (1 β1)2 Lη2 l K2G2ϵ 1/2E b V 1/2 t 1 b V 1/2 t 1 where the last inequality holds by applying Lemma C.4 and the fact of bvt 1 ϵ. Bounding I3: It can be bounded as follows: 2 E b V 1/2 t t + β1 1 β1 b V 1/2 t 1 b V 1/2 t η2LE b V 1/2 t t 2 + η2LE β1 1 β1 b V 1/2 t 1 b V 1/2 t η2LE b V 1/2 t t 2 + η2L β2 1 (1 β1)2 η2 l K2G2E b V 1/2 t 1 b V 1/2 t 2 , (A.10) where the first inequality follows by Cauchy-Schwarz inequality, and the second one follows by Lemma C.4. Communication-Efficient Adaptive Federated Learning Bounding I4: I4 = E f(zt) f(xt), η b V 1/2 t t E f(zt) f(xt) η b V 1/2 t t LE zt xt η b V 1/2 t t 2 E b V 1/2 t t 2 + η2L 2 E β1 1 β1 b V 1/2 t mt where the first inequality holds by the fact of a, b a b , the second one follows from Assumption 4.1 and the third one holds by the definition of virtual sequence zt and the fact of a b 1 2 b 2. Then summing I4 over t = 1, , T, we have t=1 E[ t 2] + η2L t=1 E β1 1 β1 mt t=1 E[ t 2] + η2L 2ϵ β2 1 (1 β1)2 t=1 E[ mt 2]. (A.11) By Lemma C.7, we have t=1 E[ mt 2] TKη2 l m σ2 l + η2 l m2 k=0 Fi(xi t,k) Therefore, the summation of I4 term is bounded by t=1 E[ t 2] + β2 1 (1 β1)2 η2L k=0 Fi(xi t,k) + β2 1 (1 β1)2 η2L 2ϵ TKη2 l m σ2 l . (A.12) Merging pieces together: Substituting (A.8), (A.9) and (A.10) into (A.2), summing over from t = 1 to T and then adding (A.11), we have E[f(z T +1)] f(z1) = t=1 [I1 + I2 + I3 + I4] t=1 E f(xt) 2 + 5ηη3 l K2L2T 2ϵ (σ2 l + 6Kσ2 g) + t=1 E[ t 2] k=0 Fi(xi t,k)) + β1 1 β1 ηηl KG2 T X t=1 E b V 1/2 t 1 b V 1/2 t 1 + β2 1 (1 β1)2 η2η2 l K2G2 t=1 E b V 1/2 t 1 b V 1/2 t 1 + β2 1 (1 β1)2 η2η2 l K2LG2 T X t=1 E b V 1/2 t 1 b V 1/2 t 2 + η2L t=1 E b V 1/2 t t 2 t=1 E[ t 2] + η2L 2ϵ β2 1 (1 β1)2 t=1 E[ mt 2]. (A.13) Communication-Efficient Adaptive Federated Learning By applying Lemma C.5 into all terms containing the second moment estimate of model difference t in (A.13), and using the fact that ( p β2K2G2 + ϵ) 1 x ( p β2η2 l K2G2 + ϵ) 1 x x β2vt+ϵ ϵ 1/2 x , we have E[f(z T +1)] f(z1) t=1 E f(xt) 2 + 5ηη3 l K2L2T 2ϵ (σ2 l + 6Kσ2 g) + β1 1 β1 + β2 1 (1 β1)2 2η2η2 l K2LG2d k=0 Fi(xi t,k)) + η2L + η2L 2(1 β2)ηG KTη2 l mϵ σ2 l + η2 l m2ϵ k=0 Fi(xi t,k) + β2 1 (1 β1)2 η2L k=0 Fi(xi t,k) 2 + β2 1 (1 β1)2 η2L 2ϵ TKη2 l m σ2 l β2η2 l K2G2 + ϵ t=1 E[ f(xt) 2] + 5ηη3 l K2L2T 2ϵ (σ2 l + 6Kσ2 g) ηηl KG2d ϵ + β2 1 (1 β1)2 2η2η2 l K2LG2d + η2L + η2L 2(1 β2)ηG + β2 1 (1 β1)2 η2L KTη2 l mϵ σ2 l k=0 Fi(xi t,k) β2K2G2 + ϵKm2 η2L + η2L 2(1 β2)G + β2 1 (1 β1)2 η2L β2η2 l K2G2 + ϵ t=1 E[ f(xt) 2] + 5ηη3 l K2L2T 2ϵ (σ2 l + 6Kσ2 g) ηηl KG2d ϵ + β2 1 (1 β1)2 2η2η2 l K2LG2d + η2L + η2L 2(1 β2)G + β2 1 (1 β1)2 η2L KTη2 l mϵ σ2 l . (A.14) The last inequality holds due to additional constraint of local learning rate ηl with the inequality ηηl 2 β2K2G2+ϵKm2 ( 3η2L 2(1 β2)G + β2 1 (1 β1)2 η2L 2 ) η2 l m2ϵ 0, thus we obtain the constraint ηl ϵ K β2K2G2+ϵ[(3+C2 1)ηL+2 2(1 β2)G]. Hence β2η2 l K2G2 + ϵ T t=1 E[ f(xt) 2] f(z0) E[f(z T )] T + 5ηη3 l K2L2 2ϵ (σ2 l + 6Kσ2 g) + [(3 + C2 1)η2L + 2 p 2(1 β2)ηG]Kη2 l 2mϵ σ2 l + C1ηηl KG2d T ϵ + 2C2 1η2η2 l K2LG2d Tϵ . (A.15) min E[ f(xt) 2] 4 q β2η2 l K2G2 + ϵ f0 f T + Φ , (A.16) where Ψ = C1G2d ϵ + 2C2 1ηηl KLG2d ϵ , Φ = 5η2 l KL2 2ϵ (σ2 l +6Kσ2 g)+[(3+C2 1)ηL+2 p 2(1 β2)G] ηl 2mϵσ2 l , where C1 = β1 1 β1 . Communication-Efficient Adaptive Federated Learning A.2. Proof of Corollary 4.6 If we pick ηl = Θ( 1 T K ) and η = Θ( Km), we have mint [T ] E[ f(xt) 2] = O( 1 A.3. Proof of Theorem 4.9 Notations and equations: For partial participation, i.e. |St| = n, t [T]. The global model difference is the average of local model difference from the subset St, i.e., t = 1 i St t i. Denote t = 1 m Pm i=1 t i, and for convenience, we follow the previous notation of b Vt = diag(bvt + ϵ). Next we show that the global model difference t is an unbiased estimator of t: ESt[ t] = 1 i=1 t wi] = ESt[ t w1] = 1 i=1 t i = t. (A.17) Define the virtual sequence zt same as previous: assume x0 = x1, for each t 1, we have zt = xt + β1 1 β1 (xt xt 1) = 1 1 β1 xt β1 1 β1 xt 1, (A.18) zt+1 zt = η b Vt t η β1 1 β1 b V 1/2 t 1 b V 1/2 t mt 1. (A.19) By Assumption 4.1, we have E[f(zt+1)] f(zt) E f(zt), η b V 1/2 t t E f(zt), η β1 1 β1 b V 1/2 t 1 b V 1/2 t 2 E b V 1/2 t t β1 1 β1 b V 1/2 t 1 b V 1/2 t = E f(xt), η b V 1/2 t t ηE f(zt), β1 1 β1 b V 1/2 t 1 b V 1/2 t 2 E b V 1/2 t t β1 1 β1 b V 1/2 t 1 b V 1/2 t + E f(zt) f(xt), η b V 1/2 t t Since t is an unbiased estimator of t, the main difference of convergence analysis for partial participation cases is bounding E[ t 2]. Note that the bound for I 2 is exactly the same as the bound for I2. For the corresponding three terms, I 1, I 3 and I 4 which include the second-order momentum estimate of t. For I 1, we have I 1 = E f(xt), η t bvt = ηE f(xt), + ηE f(xt), The first term in (A.21) does not change in partial participation scheme. The second term is changed due to the variance of t changes. For the second term of I 1, we have 2ηE f(xt), t vt + ϵ t p ϵ E[ t 2]. (A.22) Communication-Efficient Adaptive Federated Learning For I 3, we have t=1 I 3 η2L t=1 E[ t 2] + η2L β2 1 (1 β1)2 η2 l K2G2 T X b V 1/2 t 1 b V 1/2 t and for I 4, similar to (A.11), we have t=1 I 4 η2L t=1 E[ t 2] + η2L 2ϵ β2 1 (1 β1)2 t=1 E[ mt 2]. (A.24) From Lemma C.8, we have t=1 E[ mt 2] KTη2 l n σ2 l + η2 l n2 k=0 Fi(xi t,k) Then substituting (A.25) into (A.24), we have t=1 I 4 η2L t=1 E[ t 2] + β2 1 (1 β1)2 η2η2 l L 2n2ϵ k=0 Fi(xi t,k) 2 + β2 1 (1 β1)2 η2η2 l KTL 2nϵ σ2 l t=1 E[ t 2] + β2 1 (1 β1)2 η2η2 l L 2n2ϵ k=0 P{i St} Fi(xi t,k) + β2 1 (1 β1)2 η2η2 l KTL 2nϵ σ2 l , (A.26) where we will further apply the bound for E[ t 2] following by Lemma C.6. The second term in (A.26) can be bounded from (C.10). Therefore, summing up (A.22), (A.23) and (A.9), summing over from t = 1 to T, then adding (A.24), we have E[f(z T +1)] f(z1) = t=1 [I 1 + I2 + I 3 + I 4] t=1 E f(xt) 2 + 5ηη3 l K2L2T 2ϵ (σ2 l + 6Kσ2 g) + t=1 E[ t 2] k=0 Fi(xi t,k) + β1 1 β1 ηηl KG2 T X t=1 E b V 1/2 t 1 b V 1/2 t 1 + β2 1 (1 β1)2 η2η2 l K2G2 t=1 E b V 1/2 t 1 b V 1/2 t 1 + β2 1 (1 β1)2 η2η2 l K2LG2 T X t=1 E b V 1/2 t 1 b V 1/2 t 2 + η2L t=1 E b V 1/2 t t 2 t=1 E[ t 2] + η2L 2ϵ β2 1 (1 β1)2 t=1 E[ mt 2]. (A.27) By applying Lemma C.5 into all terms containing the second moment estimate of model difference t in (A.27), using the fact that ( p β2K2G2 + ϵ) 1 x ( p β2η2 l K2G2 + ϵ) 1 x x β2vt+ϵ ϵ 1/2 x , and applying Lemma C.8, we Communication-Efficient Adaptive Federated Learning E[f(z T +1)] f(z1) β2η2 l K2G2 + ϵ t=1 E[ f(xt) 2] + 5ηη3 l K2L2T 2ϵ (σ2 l + 6Kσ2 g) ηηl KG2d ϵ + β2 1 (1 β1)2 2η2η2 l K2LG2d 2 + β2 1 2(1 β1)2 η2L + p 2(1 β2)ηG KTη2 l nϵ σ2 l k=0 Fi(xi t,k) β2K2G2 + ϵKm2 3η2L 2 + β2 1 2(1 β1)2 η2L + p 2(1 β2)ηG η2 l (n 1) mn(m 1)ϵ 2 + β2 1 2(1 β1)2 η2L + p 2(1 β2)ηG η2 l (m n) mn(m 1)ϵ 15m K3L3η2 l (σ2 l + 6Kσ2 g)T + (90m K4L2η2 l + 3m K2) t=1 E[ f(xt) 2] + 3m K2Tσ2 g then we have E[f(z T +1)] f(z1) β2η2 l K2G2 + ϵ t=1 E[ f(xt) 2] + 5ηη3 l K2L2T 2ϵ (σ2 l + 6Kσ2 g) + β1 1 β1 + β2 1 (1 β1)2 2η2η2 l K2LG2d 2 + β2 1 2(1 β1)2 η2L + p 2(1 β2)ηG KTη2 l nϵ σ2 l 2 + β2 1 2(1 β1)2 η2L + p 2(1 β2)ηG η2 l (m n) mn(m 1)ϵ 15m K3L3η2 l (σ2 l + 6Kσ2 g)T + (90m K4L2η2 l + 3m K2) t=1 E[ f(xt) 2] + 3m K2Tσ2 g By adopting additional constraint of local learning rate ηl with the inequality 3η2L 2 + β2 1 2(1 β1)2 η2L + p 2(1 β2)ηG η2 l (n 1) mn(m 1)ϵ ηηl 2 β2K2G2+ϵKm2 0, thus we obtain the constraint ηl n(m 1) m(n 1) ϵ β2K2G2+ϵK(3ηL+C2 1ηL+2 2(1 β2)G), and we further need ηl satisfies ηηl K 4 β2η2 l K2G2+ϵ 3η2L 2 + β2 1 2(1 β1)2 η2L + p 2(1 β2)ηG η2 l (m n) mn(m 1)ϵ(90m K4L2η2 l + 3m K2) ηηl K 8 β2η2 l K2G2+ϵ. Hence we have the following condition on local learning rate ηl, β2K2G2 + ϵ 3ηL 2 + β2 1 2(1 β1)2 ηL + p 2(1 β2)G 1 , (A.30) then we have β2η2 l K2G2 + ϵ T i=1 E[ f(xt) 2] f(z0) E[f(z T )] T + C1ηηl KG2d T ϵ + 2C2 1η2η2 l K2LG2d Tϵ + 5ηη3 l K2L2 2ϵ (σ2 l + 6Kσ2 g) + 3η2L 2 + β2 1 2(1 β1)2 η2L + p 2(1 β2)ηG Kη2 l nϵ σ2 l 2 + β2 1 2(1 β1)2 η2L + p 2(1 β2)ηG η2 l (m n) mn(m 1)ϵ[15m K3L2η2 l (σ2 l + 6Kσ2 g) + 3m K2σ2 g]. (A.31) Communication-Efficient Adaptive Federated Learning min E[ f(xt) 2] 8 q β2η2 l K2G2 + ϵ f0 f T + Φ , (A.32) where Ψ = C1G2d ϵ + 2C2 1ηηl KLG2d ϵ , Φ = 5η2 l KL2 2ϵ (σ2 l + 6Kσ2 g) + [(3 + C2 1)ηL + 2 p 2(1 β2)G] ηl 2nϵσ2 l + [(3 + C2 1)ηL + 2(1 β2)G] ηl(m n) 2n(m 1)ϵ[15K2L2η2 l (σ2 l + 6Kσ2 g) + 3Kσ2 g] and C1 = β1 1 β1 . A.4. Proof of Corollary 4.11 If we choose ηl = Θ( 1 T K ) and η = Θ( Kn), we have mint [T ] E[ f(xt) 2] = O( B. Proof of Theorems in Section 4.2 and Partial Participation Setting for Fed CAMS B.1. Proof of Theorem 4.17 Notations and equations: From the update rule of Algorithm 2, we have e1 = 0, et = 1 m Pm i=1 ei t and mt = (1 β1) Pt i=1 βt i 1 b i. Denote a global uncompressed difference t = 1 m Pm i=1 t i. Denote a virtual momentum sequence: m t = β1m t 1 + (1 β1) t, hence we have m t = (1 β1) Pt i=1 βt i 1 i. By the aforementioned definition and notation, we have i=1 (b i t i t) = 1 i=1 (ei t ei t+1) = et et+1. (B.1) Denote the weighted averaging error sequence Γt = (1 β1) Pt τ=1 βt τ 1 eτ, with the imput e1 = 0, we obtain the relation between Γt and mt as follows mt m t = (1 β1) τ=1 βt τ 1 (b τ τ) = (1 β1) τ=1 βt τ 1 (eτ eτ+1) = Γt Γt+1, (B.2) where the last step holds due to Γt+1 = (1 β1) Pt+1 τ=1 βt τ 1 eτ+1 = (1 β1) Pt τ=1 βt τ 1 eτ+1 + βt 1e1. Similar to previous works studied adaptive methods (Chen et al., 2020a; Zhou et al., 2018; Chen et al., 2018), we introduce a Lyapunov sequence zt: assume x0 = x1, for each t 1, we have yt = xt + β1 1 β1 (xt xt 1) = 1 1 β1 xt β1 1 β1 xt+1. Therefore, by the update rule of xt, we have yt+1 = xt+1 + η β1 1 β1 b V 1/2 t mt = xt+1 + η β1 1 β1 b V 1/2 t [m t + Γt Γt+1] = xt+1 + η β1 1 β1 b V 1/2 t m t + η β1 1 β1 b V 1/2 t Γt+1 (1 β1)et+1 = xt+1 + η β1 1 β1 b V 1/2 t m t + η b V 1/2 t Γt+1 η b V 1/2 t et+1. (B.3) The third equation holds due to the fact that Γt+1 = β1Γt + (1 β1)et+1. We then introduce a new sequence based on the previous Lyapunov sequence yt as follows zt+1 = yt+1 + η b V 1/2 t et+1 = xt+1 + η β1 1 β1 b V 1/2 t m t + η b V 1/2 t Γt+1. (B.4) Communication-Efficient Adaptive Federated Learning The sequence difference zt+1 zt can be represented by zt+1 zt = xt+1 xt + η β1 1 β1 b V 1/2 t m t η β1 1 β1 b V 1/2 t 1 m t 1 + η b V 1/2 t Γt+1 η b V 1/2 t 1 Γt = η b V 1/2 t mt + η b V 1/2 t Γt+1 + η β1 1 β1 b V 1/2 t m t η β1 1 β1 b V 1/2 t 1 m t 1 η b V 1/2 t 1 Γt, (B.5) where the second equation follows the update rule of xt+1. Following (B.2), then combining likely terms and applying the definition of m t, we have zt+1 zt = η b V 1/2 t m t + η b V 1/2 t Γt + η β1 1 β1 b V 1/2 t m t η β1 1 β1 b V 1/2 t 1 m t 1 η b V 1/2 t 1 Γt = η 1 1 β1 b V 1/2 t m t η β1 1 β1 b V 1/2 t 1 m t 1 + η b V 1/2 t Γt η b V 1/2 t 1 Γt = η 1 1 β1 b V 1/2 t [β1m t 1 + (1 β1) t] η β1 1 β1 b V 1/2 t 1 m t 1 + η b V 1/2 t Γt η b V 1/2 t 1 Γt = η b V 1/2 t t η β1 1 β1 b V 1/2 t 1 b V 1/2 t m t 1 η b V 1/2 t 1 b V 1/2 t Therefore, we obtain a helpful Lyapunov sequence for our proof of Fed CAMS. The proof of Fed CAMS in full participation settings has a similar outline with the proof of Fed AMS. By Assumption 4.1, we have E[f(zt+1)] f(zt) E[ f(zt), zt+1 zt ] + L 2 E[ zt+1 zt 2] E f(zt), η b V 1/2 t t E f(zt), η β1 1 β1 b V 1/2 t 1 b V 1/2 t m t 1 + b V 1/2 t 1 b V 1/2 t 2 E b V 1/2 t t β1 1 β1 b V 1/2 t 1 b V 1/2 t m t 1 b V 1/2 t 1 b V 1/2 t = E f(xt), η b V 1/2 t t ηE f(zt), β1 1 β1 b V 1/2 t 1 b V 1/2 t m t 1 + b V 1/2 t 1 b V 1/2 t 2 E b V 1/2 t t β1 1 β1 b V 1/2 t 1 b V 1/2 t m t 1 b V 1/2 t 1 b V 1/2 t + E f(zt) f(xt), η b V 1/2 t t here we recall the notation b Vt = diag(bvt) = diag(max(bvt 1, vt, ϵ)). Bounding T1:We have T1 = E f(xt), η t bvt 2ηE f(xt), t p 2ηE f(xt), t vt + ϵ t p Communication-Efficient Adaptive Federated Learning where the first inequality follows by the fact that bvt vt+ϵ 2 . For the second term in (B.8), we have 2 ηE f(xt), t vt + ϵ t p 2 η f(xt) E 1 vt + ϵ 1 p ϵ E[ t 2], (B.9) where the second inequality follows from Lemma C.1 and C.4, and we will further apply the bound for E[ t 2] by applying Lemma C.5. For the first term in (B.8), we have 2 ηE f(xt), t p 2 ηE f(xt) p β2vt 1 + ϵ , t + ηl K f(xt) ηl K f(xt) 2ηηl KE f(xt) 2ηE f(xt) p β2vt 1 + ϵ , t + ηl K f(xt) 2ηηl KE f(xt) β2vt 1 + ϵ , E ηl k=0 gi t,k + ηl K f(xt) 2ηηl KE f(xt) β2vt 1 + ϵ , E ηl k=0 gi t,k + ηl K i=1 Fi(xt) . (B.10) For the last term in (B.10), we have β2vt 1 + ϵ , E ηl k=0 gi t,k + ηl K β2vt 1 + ϵ f(xt), ηl K β2vt 1 + ϵ E m X k=0 ( Fi(xi t,k) Fi(xt)) 2ηηl 2Km2 E 1 k=0 ( Fi(xi t,k) Fi(xt)) 2ηηl 2Km2 E 1 k=0 Fi(xi t,k) k=0 E Fi(xi t,k) Fi(xt) 2ηηl 2Km2 E 1 k=0 Fi(xi t,k) where the second equation follows from x, y = 1 2[ x 2 + y 2 x y 2], and the inequality holds by applying Communication-Efficient Adaptive Federated Learning Cauchy-Schwarz inequality. Then by Assumption 4.1, we have β2vt 1 + ϵ , E ηl k=0 gi t,k + ηl K k=0 E xi t,k xt 2ηηl 2Km2 E 1 k=0 Fi(xi t,k) 2 + 5ηη3 l K2L2 2ϵ (σ2 l + 6Kσ2 g) 2ηηl 2Km2 E 1 k=0 Fi(xi t,k) where the last inequality holds by applying Lemma C.9 and the constraint of local learning rate ηl 1 8KL. Then we have 2 + 5ηη3 l K2L2 2ϵ (σ2 l + 6Kσ2 g) 2 ηηl 2Km2 E 1 k=0 Fi(xi t,k) 2 + 5ηη3 l K2L2 2ϵ (σ2 l + 6Kσ2 g) ηηl 2Km2 E 1 k=0 Fi(xi t,k) ϵ E[ t 2]. (B.13) Bounding T2: The bound for T2 mainly follows by the update rule and definition of virtual sequence zt. T2 = ηE f(zt), β1 1 β1 b V 1/2 t 1 b V 1/2 t m t 1 + b V 1/2 t 1 b V 1/2 t = ηE f(zt) + f(zt) f(xt), b V 1/2 t 1 b V 1/2 t β1 1 β1 m t 1 + Γt b V 1/2 t 1 b V 1/2 t β1 1 β1 m t 1 + Γt β1 1 β1 m t 1 + Γt b V 1/2 t 1 b V 1/2 t β1 1 β1 m t 1 + Γt ηC1ηl KG2E b V 1/2 t 1 b V 1/2 t 1 + η2C2 1Lη2 l K2G2ϵ 1/2E b V 1/2 t 1 b V 1/2 t 1 where the last inequality holds by Lemma C.4, here C1 = β1 1 β1 + 2q 1 q2 . Bounding T3: It can be bounded as follows: 2 E b V 1/2 t t + β1 1 β1 b V 1/2 t 1 b V 1/2 t η2LE b V 1/2 t t 2 + η2LE β1 1 β1 b V 1/2 t 1 b V 1/2 t m t 1 + b V 1/2 t 1 b V 1/2 t η2LE b V 1/2 t t 2 + η2LC2 1η2 l K2G2E b V 1/2 t 1 b V 1/2 t 2 , (B.15) Communication-Efficient Adaptive Federated Learning where the first inequality follows by Cauchy-Schwarz inequality, and the second one follows by Lemma C.4, here C1 = β1 1 β1 + 2q 1 q2 . Bounding T4: T4 = E f(zt) f(xt), η b V 1/2 t t E f(zt) f(xt) η b V 1/2 t t LE zt xt η b V 1/2 t t 2 E b V 1/2 t t 2 + η2L 2 E β1 1 β1 b V 1/2 t mt + b V 1/2 t Γt where the first inequality holds by the fact of a, b a b , the second one follows from Assumption 4.1 and the third one holds by the definition of virtual sequence zt and the fact of a b 1 2 b 2. Then summing T4 over t = 1, , T, we have t=1 E b V 1/2 t t 2 + η2L t=1 E β1 1 β1 mt + Γt t=1 E[ t 2] + η2L β2 1 (1 β1)2 t=1 E mt 2 + t=1 E Γt 2 . (B.16) By Lemma C.7, we have t=1 E[ mt 2] TKη2 l m σ2 l + η2 l m2 k=0 Fi(xi t,k) t=1 E[ Γt 2] 4Tq2 (1 q2)2 Kη2 l m σ2 l + η2 l m2 4q2 k=0 Fi(xi t,k) Therefore, the T4 term is bounded by t=1 E[ t 2] + C2η2L k=0 Fi(xi t,k) ϵ TKη2 l m σ2 l , (B.17) where C2 = 4q2 (1 q2)2 + β2 1 (1 β1)2 . Merging pieces together: Substituting (B.13), (B.14) and (B.15) into (B.7), summing over from t = 1 to T and then adding Communication-Efficient Adaptive Federated Learning (B.17), we have E[f(z T +1)] f(z1) = t=1 [T1 + T2 + T3 + T4] t=1 E f(xt) 2 + 5ηη3 l K2L2T 2ϵ (σ2 l + 6Kσ2 g) + t=1 E[ t 2] k=0 Fi(xi t,k)) 2 + C1ηηl KG2 T X t=1 E b V 1/2 t 1 b V 1/2 t 1 + C2 1η2η2 l K2G2 ϵ t=1 E b V 1/2 t 1 b V 1/2 t 1 + C2 1η2η2 l K2LG2 T X t=1 E b V 1/2 t 1 b V 1/2 t 2 t=1 E b V 1/2 t t 2 + η2L t=1 E b V 1/2 t t 2 + η2L 2 β2 1 (1 β1)2 t=1 E[ m t 2] + η2L t=1 E[ Γt 2]. Hence by organizing and applying Lemmas, we have E[f(z T +1)] f(z1) t=1 E f(xt) 2 + 5ηη3 l K2L2T 2ϵ (σ2 l + 6Kσ2 g) k=0 Fi(xi t,k)) 2 + C1ηηl KG2d ϵ + 2C2 1η2η2 l K2LG2d + η2L + η2L 2(1 β2)ηG KTη2 l mϵ σ2 l + η2 l m2ϵ k=0 Fi(xi t,k) ϵ η2 l C2 m2 k=0 Fi(xi t,k) ϵ TKη2 l C2 m σ2 l , (B.19) by applying Lemma C.5 into all terms containing the second moment estimate of model difference t in (B.18), using the fact that q β2 (1+q2)3 (1 q2)2 K2G2 + ϵ 1 x q β2 (1+q2)3 (1 q2)2 η2 l K2G2 + ϵ 1 x x β2vt+ϵ ϵ 1/2 x , and applying Lemma C.2 and C.8, we have E[f(z T +1)] f(z1) 4β2 (1+q2)3 (1 q2)2 η2 l K2G2 + ϵ t=1 E[ f(xt) 2] + 5ηη3 l K2L2T 2ϵ (σ2 l + 6Kσ2 g) + C1ηηl KG2d ϵ + 2C2 1η2η2 l K2LG2d 2 + C2η2L + p 2(1 β2)ηG KTη2 l mϵ σ2 l k=0 Fi(xi t,k) 4β2 (1+q2)3 (1 q2)2 η2 l K2G2 + ϵKm2 3η2L 2 + C2η2L + p 2(1 β2)ηG η2 l m2ϵ t=1 E[ f(xt) 2] + 5ηη3 l K2L2T 2ϵ (σ2 l + 6Kσ2 g) + C1ηηl KG2d ϵ + 2C2 1η2η2 l K2LG2d 2 + C2η2L + p 2(1 β2)ηG KTη2 l mϵ σ2 l , (B.20) Communication-Efficient Adaptive Federated Learning where the last inequality holds by ηl ϵ 4β2(1+q2)3(1 q2) 2K2G2+ϵ K(3ηL+2C2ηL+2 2(1 β2)G). Hence we have 4β2 (1+q2)3 (1 q2)2 η2 l K2G2 + ϵ T t=1 E[ f(xt) 2] f(z0) E[f(z T )] T + 5ηη3 l K2L2 2ϵ (σ2 l + 6Kσ2 g) + C1ηηl KG2d T ϵ + 2C2 1η2η2 l K2LG2d Tϵ + 3η2L + 2C2η2L + 2 p 2(1 β2)ηG Kη2 l 2mϵ σ2 l , (B.21) where C1 = β1 1 β1 + 2q 1 q2 and C2 = β2 1 (1 β1)2 + 4q2 (1 q2)2 . (B.21) also implies, min E[ f(xt) 2] 4 4β2 (1 + q2)3 (1 q2)2 η2 l K2G2 + ϵ f0 f T + Φ , (B.22) where Ψ = C1G2d ϵ + 2C2 1ηηl KLG2d ϵ , Φ = 5η2 l KL2 2ϵ (σ2 l +6Kσ2 g)+[(3+2C2)ηL+2 p 2(1 β2)G] ηl 2mϵσ2 l , C1 = β1 1 β1 + 2q 1 q2 and C2 = β2 1 (1 β1)2 + 4q2 B.2. Proof of Corollary 4.19 Let ηl = Θ( 1 T K ) and η = Θ( Km), the convergence rate under full participation scheme is O( 1 B.3. Analysis on the Partial Participation Setting for Fed CAMS Let us present the theoretical analysis of the partial participation scheme of Fed CAMS (Algorithm 2). Similar to partial participation scheme in Section 4.1, we have the following convergence analysis. Theorem B.1. Under Assumption 4.1-4.3 and 4.14, if the local learning rate ηl satisfies the following condition: ηl min n 1 8KL, n(m 1)ϵ 48m(n 1)[K p 4β2(1 + q2)3(1 q2) 2K2G2 + ϵ(ηL + p 2(1 β2)G)] 1o , then the iterates of Algorithm 2 under partial participation scheme satisfy min t [T ] E f(xt) 2 8 4β2 (1 + q2)3 (1 q2)2 η2 l K2G2 + ϵ f0 f T + Φ , (B.23) where Ψ = C1G2d ϵ + 2C2 1ηηl KLG2d ϵ , Φ = C1ηηl KLG2 ϵ + 5η2 l KL2 2ϵ (σ2 l + 6Kσ2 g) + [ηL + p 2(1 β2)G] ηl nϵσ2 l + [ηL + p 2(1 β2)G] ηl(m n) n(m 1)ϵ[15K2L2η2 l (σ2 l + 6Kσ2 g) + 3Kσ2 g] and C1 = β1 1 β1 + 2q 1 q2 . Remark B.2. The upper bound for mint [T ] E f(xt) 2 of partial participation is similar to full participation case but with a larger variance term Φ. This is due to the fact that random sampling of participating workers introduces an additional variance during sampling. Remark B.3. From Theorem B.1, constants C1 is related to the compressor constant q. The stronger compression we apply to the model difference i t corresponding to larger q (q 1) leads to worse convergence due to larger information losses. Next, we provide theoretical proofs for the partial participation analysis for Fed CAMS. Proof of Theorem B.1: Notations and equations: From the update rule of Algorithm 2, we have e1 = 0, et = 1 m Pm i=1 ei t and mt = (1 β1) Pt i=1 βt i 1 b i. Denote a global uncompressed difference t = 1 |St| P i St t i. Denote a virtual momentum sequence: m t = β1m t 1 + (1 β1) t, hence we have m t = (1 β1) Pt i=1 βt i 1 i. Define additional two virtual sequences t = 1 n Pm i=1 i t and b t = 1 n Pm i=1 b i t. Note that when the client i does not take part in the round of participation at step t, we have i t = b i t = 0, therefore, t = t and b t = b t. Communication-Efficient Adaptive Federated Learning By the aforementioned definition and notation, define a subset St = {wt 1, wt 2, ..., wt n}, we have b t t = 1 |St| i St (b i t i t) = 1 i=1 (b i t i t) = 1 i=1 (ei t ei t+1) = e t e t+1, (B.24) where the compression errors have the same structure, e t = 1 n Pm i=1 ei t. Similar to the previous analysis, we define the following sequence: Γt+1 := (1 β1) τ=1 βt+1 τ 1 e τ, and keep using the Lyapunov function zt from (B.4). For the expectation of model difference t, we have ESt[ t] = 1 = ESt[ t w1] = 1 i=1 t i = t. (B.25) The proof of Fed CAMS in partial participation settings has a similar outline combing the proof of partial participation in Fed AMS and full participation in Fed CAMS. By Assumption 4.1, we have E[f(zt+1)] f(zt) E f(xt), η b V 1/2 t t E f(zt), η β1 1 β1 b V 1/2 t 1 b V 1/2 t m t 1 + b V 1/2 t 1 b V 1/2 t 2 E b V 1/2 t t β1 1 β1 b V 1/2 t 1 b V 1/2 t m t 1 b V 1/2 t 1 b V 1/2 t + E f(zt) f(xt), η b V 1/2 t t Note that the bound for T 2 is exactly the same as the bound for T2. For the three corresponding terms, T 1, T 3 and T 4 which include the second-order momentum estimate of t. For T 1, similar to the full participation settings, we have 2E f(xt), η t p 2ηE f(xt), t vt + ϵ t p The first term in (B.27) does not change in partial participation scheme. The second term is changed due to the variance of t changes. For the second term of T 1, we have 2ηE f(xt), t vt + ϵ t p ϵ E[ t 2]. (B.28) For T 3, similar to the proof of T3, we have t=1 T 3 η2L t=1 E[ t 2] + η2LC2 1η2 l K2G2 T X t=1 E b V 1/2 t 1 b V 1/2 t 2 , (B.29) Communication-Efficient Adaptive Federated Learning where C1 = β1 1 β1 + 2q 1 q2 . For T 4 in partial participation, we have T 4 = ηE f(zt) f(xt), b V 1/2 t t ηE f(zt) f(xt) b V 1/2 t t η2LE β1 1 β1 b V 1/2 t 1 m t 1 + b V 1/2 t 1 Γt b V 1/2 t t C1η2η2 l K2LG2 Hence, the summation from T 1 to T 4 over total iteration T is: E[f(z T +1)] f(z1) = t=1 [T 1 + T 2 + T 3 + T 4] t=1 E f(xt) 2 + 5ηη3 l K2L2T 2ϵ (σ2 l + 6Kσ2 g) + t=1 E[ t 2] k=0 Fi(xt)) 2 + C1ηηl KG2 T X t=1 E b V 1/2 t 1 b V 1/2 t 1 + C2 1η2η2 l K2LG2ϵ 1/2 T X t=1 E b V 1/2 t 1 b V 1/2 t 1 + C2 1η2η2 l K2LG2 T X t=1 E b V 1/2 t 1 b V 1/2 t 2 t=1 E[ t 2] + C1Tη2η2 l K2LG2 4β2 (1+q2)3 (1 q2)2 η2 l K2G2 + ϵ t=1 E[ f(xt) 2] + 5ηη3 l K2L2T 2ϵ (σ2 l + 6Kσ2 g) + C1ηηl KG2d + 2C2 1η2η2 l K2LG2d Tϵ ηηl 4β2 (1+q2)3 (1 q2)2 η2 l K2G2 + ϵKm2 k=0 Fi(xt)) + η2η2 l LKT 2(1 β2)ηη2 l KTG nϵ σ2 l + C1Tη2η2 l K2LG2 + η2η2 l L ϵ + 2(1 β2)ηη2 l G ϵ m n mn(m 1) 15m K3L3η2 l (σ2 l + 6Kσ2 g)T + (90m K4L2η2 l + 3m K2) t=1 E[ f(xt) 2] + 3m K2Tσ2 g + η2η2 l L ϵ + 2(1 β2)ηη2 l G ϵ n 1 mn(m 1) k=0 Fi(xt)) The proof outline is similar with previous proof. We take the use of Lemma C.2, C.6, C.8 for corresponding terms. By additional constraints of local learning rate ηl with the inequality h η2L + p 2(1 β2)ηG i η2 l (n 1) mn(m 1)ϵ ηηl 2Km2 hq 4β2 (1+q2)3 (1 q2)2 η2 l K2G2 + ϵ i 1 0, we obtain the constraint ηl n(m 1) m(n 1) ϵ 2K 4β2(1+q2)3(1 q2) 2K2G2+ϵ[ηL+ 2(1 β2)G], and we further need ηl satisfies ηηl K 4 4β2(1+q2)3(1 q2) 2η2 l K2G2+ϵ 2(1 β2)ηG) η2 l (m n) mn(m 1)ϵ(90m K4L2η2 l +3m K2) ηηl K 8 4β2(1+q2)3(1 q2) 2η2 l K2G2+ϵ. Hence for the convergence Communication-Efficient Adaptive Federated Learning rate, we have 4β2 (1+q2)3 (1 q2)2 η2 l K2G2 + ϵ T i=1 E[ f(xt) 2] f(z0) E[f(z T )] T + 5ηη3 l K2L2 2ϵ (σ2 l + 6Kσ2 g) + ηL + p 2(1 β2)G ηη2 l K nϵ σ2 l + C1ηηl KG2d T ϵ + 2C2 1η2η2 l K2LG2d Tϵ + C1η2η2 l K2LG2 + η2η2 l L ϵ + 2(1 β2)ηη2 l G ϵ m n mn(m 1)[15m K3L2η2 l (σ2 l + 6Kσ2 g) + 3m K2σ2 g]. (B.32) min E[ f(xt) 2] 8 4β2 (1 + q2)3 (1 q2)2 η2 l K2G2 + ϵ f0 f T + Φ , (B.33) where Ψ = C1G2d ϵ + 2C2 1ηηl KLG2d ϵ , Φ = C1ηηl KLG2 ϵ + 5η2 l KL2 2ϵ (σ2 l + 6Kσ2 g) + [ηL + p 2(1 β2)G] ηl nϵσ2 l + [ηL + p 2(1 β2)G] ηl(m n) n(m 1)ϵ[15K2L2η2 l (σ2 l + 6Kσ2 g) + 3Kσ2 g] and C1 = β1 1 β1 + 2q 1 q2 . C. Supporting Lemmas Lemma C.1. For the element-wise difference, Wt = 1 vt+ϵ 1 β2vt 1+ϵ, we have Wt 1 β2 Proof. Note that we have: Wt = 1 vt + ϵ 1 p β2vt 1 + ϵ vt + ϵ)( p β2vt 1 + ϵ + vt + ϵ) vt + ϵ p β2vt 1 + ϵ( p β2vt 1 + ϵ + vt + ϵ) = β2vt 1 vt vt + ϵ p β2vt 1 + ϵ( p β2vt 1 + ϵ + vt + ϵ) = (1 β2) 2 t vt + ϵ p β2vt 1 + ϵ( p β2vt 1 + ϵ + vt + ϵ) (1 β2) 2 t vt + ϵ p β2vt 1 + ϵ 1 β2 t ϵ t , (C.1) where the forth equation holds by the update rule of vt, i.e., vt = β2vt 1 + (1 β2) 2 t, and the first inequality holds due vt + ϵ vt 1 β2 t and p β2vt 1 + ϵ 0. This concludes the proof. Lemma C.2. For the variance difference sequence b V 1/2 t 1 b V 1/2 t , we have b V 1/2 t 1 b V 1/2 t 1 d ϵ, b V 1/2 t 1 b V 1/2 t 2 d Proof. By the definition of variance matrix b Vt, and the non-decreasing update of Fed CAMS, i.e., bvt 1 bvt = Communication-Efficient Adaptive Federated Learning max(bvt 1, vt, ϵ), we have b V 1/2 t 1 b V 1/2 t 1 = bvt 1 1 bvt where the inequality holds by the definition of bvt Rd. For the sum of the variance difference under ℓ2 norm, we have b V 1/2 t 1 b V 1/2 t 2 = bvt 1 1 bvt bvt 1 1 bvt where the first inequality holds by the element-wise operation: x, y Rd, 0 y x, we have (x y)2 (x y)(x + y) = x2 y2. It concludes the proof. Lemma C.3. The compression error has the following absolute bound (1 q2)2 η2 l K2G2, et 2 4q2 (1 q2)2 η2 l K2G2. (C.5) Proof. For all t [T], by Assumption 4.14 and Young s inequality, we have ei t+1 2 = i t + ei t C( i t + ei t) 2 q2 i t + ei t 2 q2(1 + ρ) ei t 2 + q2 1 + 1 2 ei t 2 + 2q2 1 q2 i t 2, where the last inequality holds by choosing ρ = 1 q2 2q2 . Thus we obtain the absolute bound for the error terms, (1 q2)2 η2 l K2G2, i=1 ei t 2 4q2 (1 q2)2 η2 l K2G2. (C.6) It concludes the proof. Communication-Efficient Adaptive Federated Learning Lemma C.4. Under Assumptions 4.2 and 4.14, for Fed AMS, we have f(x) G, t ηl KG, mt ηl KG and vt η2 l K2G2. For Fed CAMS, we have f(x) G, b t 2 4(1+q2)3 (1 q2)2 η2 l K2G2, m t ηl KG and vt 4(1+q2)3 (1 q2)2 η2 l K2G2, where m t = β1m t 1 + (1 β1) t. Proof. Since f has G-bounded stochastic gradients, for any x and ξ, we have f(x, ξ) G, we have f(x) = Eξ f(x, ξ) Eξ f(x, ξ) G. For Fed AMS, the model difference t, by definition, has the following formula, t = xi t,K xt = ηl k=1 gi t,k, t ηl K gi t,k ηl KG. Thus the bound for momentum mt and variance vt has the formula of mt = (1 β1) τ=1 βt τ 1 t ηl KG, vt = (1 β2) τ=1 βt τ 2 t 2 η2 l K2G2. For the compressed version, Fed CAMS, we have b t 2 C( t + et) 2 C( t + et) ( t + et) + ( t + et) 2 2(q2 + 1) t + et 2 4(q2 + 1)[ t 2 + et 2] (1 q2)2 η2 l K2G2, where the third inequality holds due to Assumption 4.14,and the last inequality holds due to Lemma C.3. The virtual momentum sequence m t has the same bound as mt of Fed AMS. For the variance sequence of Fed CAMS, we have vt = (1 β2) τ=1 βt τ 2 b t 2 4(1 + q2)3 (1 q2)2 η2 l K2G2. This concludes the proof. Lemma C.5. The global model difference t = Pm i=1 i t in full participation cases satisfy E[ t 2] Kη2 l m σ2 l + η2 l m2 E k=0 Fi(xi t,k) Communication-Efficient Adaptive Federated Learning Proof. For E[ t 2] in full participation case, we have E[ t 2] = E 1 m k=0 ηlgi t,k = η2 l m2 E = η2 l m2 E k=0 (gi t,k Fi(xi t,k)) 2 + η2 l m2 E k=0 Fi(xi t,k) Kη2 l m σ2 l + η2 l m2 E k=0 Fi(xi t,k) where the inequality holds by Assumption 4.2. This concludes the proof. Lemma C.6. The global model difference t = P i St i t in partial participation cases satisfy E[ t 2] = Kη2 l n σ2 l + η2 l (m n) mn(m 1)[15m K3L3η2 l (σ2 l + 6Kσ2 g) + 90m K4L2η2 l + 3m K2 f(xt) 2 + 3m K2σ2 g] + η2 l (n 1) mn(m 1)E k=0 Fi(xi t,k) Proof. We have E[ t 2] = E 1 n i=1 I{i St} i t = η2 l n2 E i=1 I{i St} k=0 [gi t,k Fi(xi t,k)] i=1 I{i St} k=0 Fi(xi t,k) = η2 l n2 E i=1 P{i St} k=0 [gi t,k Fi(xi t,k)] i=1 P{i St} k=0 Fi(xi t,k) = η2 l mn E k=0 [gi t,k Fi(xi t,k)] 2 + η2 l n2 E i=1 P{i St} k=0 Fi(xi t,k) Kη2 l n σ2 l + η2 l n2 E i=1 P{i St} k=0 Fi(xi t,k) where the fifth equation holds due to P{i St} = n m. Note that we have k=0 Fi(xi t,k) k=0 Fi(xi t,k) k=0 Fi(xi t,k), k=0 Fj(xj t,k) k=0 Fi(xi t,k) k=0 Fi(xi t,k) k=0 Fj(xj t,k) where the second equation holds due to Pm i=1 xi 2 = Pm i=1 m xi 2 1 i =j xi xj 2. By the sampling strategy Communication-Efficient Adaptive Federated Learning (without replacement), we have P{i St} = n m and P{i, j St} = n(n 1) m(m 1), thus we have k=0 P{i St} Fi(xi t,k) i=1 P{i St} k=0 Fi(xi t,k) i =j P{i, j St} K 1 X k=0 Fi(xi t,k), k=0 Fj(xj t,k) k=0 Fi(xi t,k) k=0 Fi(xi t,k), k=0 Fj(xj t,k) k=0 Fi(xi t,k) 2 n(n 1) 2m(m 1) k=0 Fi(xi t,k) k=0 Fj(xj t,k) k=0 Fi(xi t,k) k=0 Fi(xi t,k) where the third equation holds due to x, y = 1 2[ x 2 + y 2 x y 2] and the last equation holds due to 1 i =j xi xj 2 = Pm i=1 m xi 2 Pm i=1 xi 2. Therefore, for the last term in (C.8), we have E[ t 2] = Kη2 l n σ2 l + η2 l (m n) mn(m 1) k=0 Fi(xi t,k) 2 + η2 l (n 1) mn(m 1)E k=0 Fi(xi t,k) The second term in (C.11) is bounded partially following Reddi et al. (2020), k=0 Fi(xi t,k) k=0 [ Fi(xi t,k) Fi(xt i) + Fi(xt i) f(xt) + f(xt)] k=0 [ Fi(xi t,k) Fi(xt i)] 2 + 3m K2σ2 g + 3m K2 f(xt) 2 k=0 E[ xi t,k xt i 2] + 3m K2σ2 g + 3m K2 f(xt) 2 15m K3L3η2 l (σ2 l + 6Kσ2 g) + (90m K4L2η2 l + 3m K2) f(xt) 2 + 3m K2σ2 g, (C.12) where the last inequality holds by applying Lemma C.9 (also follows from Reddi et al. (2020)). Substituting (C.12) into (C.11), this concludes the proof. Lemma C.7. Under Assumptions 4.1-4.3 and Assumption 4.14, for the momentum sequence mt = (1 β1) Pt τ=1 βt τ 1 τ and accumulated error sequence Γt = (1 β1) Pt τ=1 βt τ 1 eτ in full participation settings, we have t=1 E[ mt 2] TKη2 l m σ2 l + η2 l m2 k=0 Fi(xi t,k) t=1 E[ Γt 2] 4Tq2 (1 q2)2 Kη2 l m σ2 l + η2 l m2 4q2 k=0 Fi(xi t,k) Communication-Efficient Adaptive Federated Learning Proof. By the updating rule, we have E[ mt 2] = E (1 β1) τ=1 βt τ 1 τ 2 (1 β1)2 d X τ=1 βt τ 1 τ,i (1 β1)2 d X τ=1 βt τ 1 2 τ,i τ=1 βt τ 1 E[ τ 2] Kη2 l m σ2 l + η2 l m2 (1 β1) τ=1 βt τ 1 E k=0 Fi(xi t,k) where the second inequality holds by applying Cauchy-Schwarz inequality, and the third inequality holds by summation of series. The last inequality holds by Lemma C.5. Hence summing over t = 1, , T, we have t=1 E[ mt 2] TKη2 l m σ2 l + η2 l m2 k=0 Fi(xi t,k) For the compression error et, following the proof from Lemma C.3, for t and each local worker i, we have the induction E[ ei t+1 2] 2q2 t τ E[ i τ 2] (1 q2)2 Kη2 l m σ2 l + η2 l m2 k=0 Fi(xi τ,k) For the sequence Γt, similar as the previous analysis, we have E[ Γt 2] E (1 β1) τ=1 βt τ 1 eτ τ=1 βt τ 1 E[ eτ 2] i=1 E[ ei τ 2] (1 q2)2 Kη2 l m σ2 l + η2 l m2 2q2(1 β1) k=0 Fi(xi j,k) Summing over t = 1, , T, we have t=1 E[ Γt 2] 4Tq2 (1 q2)2 Kη2 l m σ2 l + η2 l m2 2q2 k=0 Fi(xi τ,k) (1 q2)2 Kη2 l m σ2 l + η2 l m2 4q2 k=0 Fi(xi t,k) Communication-Efficient Adaptive Federated Learning Lemma C.8. Under Assumptions 4.1-4.3 and Assumption 4.14, for the momentum sequence mt = (1 β1) Pt τ=1 βt τ 1 τ in partial participation settings, we have t=1 E[ mt 2] KTη2 l n σ2 l + η2 l n2 k=0 Fi(xi t,k) Proof. The proof outline is the same as the proof of Lemma C.7, the main difference is E[ t 2] has changed, so we need to apply Lemma C.6 instead of Lemma C.5 during the proof. Lemma C.9. (This lemma directly follows from Lemma 3 in Fed Adam (Reddi et al., 2020). For local learning rate which satisfying ηl 1 8KL, the local model difference after k ( k {0, 1, ..., K 1}) steps local updates satisfies i=1 E[ xi t,k xt 2] 5Kη2 l (σ2 l + 6Kσ2 g) + 30K2η2 l E[ f(xt) 2]. (C.18) Proof. The proof of Lemma C.9 is exactly same as the proof of Lemma 3 in Reddi et al. (2020). D. Additional Discussions The additional server-to-worker communication: Our current analysis only focus on one-way compression from worker to server while the server-to-worker broadcasting is still uncompressed since of cost of broadcasting is in general cheaper than worker to server uploading. Note that it is also straightforward to compress x for server-to-worker communication in the full participation scheme (with guarantees). So we can indeed achieve high communication efficiency even for two-way compression. Table 1 shows the communication bits comparison for scaled sign and top-k compressors, where T is the total iteration of training and d denotes the dimension of x. Specifically, Table 2 shows the communication bits corresponding to the experiments showing by Figure 4. However, for the partial participating setting, it will encounter a synchronization issue, which is highly non-trivial to solve. Thus we leave the two-way compression strategy for future work. Method Uncompressed One-way Compression Two-way Compression Scaled sign 32d 2T (32 + d) T + 32d T (32 + d) 2T Top-k 32d 2T 32(2k + d) T 32 2k 2T Table 1. Communication bits comparisons for scaled sign and top-k compressors. Method Uncompressed One-way Compression Two-way Compression Scaled sign 3.58 1011 1.84 1011 1.12 1010 Top-k with r = 1/64 3.58 1011 1.84 1011 1.12 1010 Top-k with r = 1/128 3.58 1011 1.82 1011 5.59 109 Top-k with r = 1/256 3.58 1011 1.80 1011 2.79 109 Table 2. Approximate communication bits comparisons for scaled sign and top-k with r = 1/64, r = 1/128 and r = 1/256 compressors when training CIFAR-10 on Res Net-18 model for 500 rounds. E. Additional Experimental Results E.1. Hyperparameter Settings We conduct detailed hyperparameter searches to find the best hyperparameters for each baseline methods including ours. In details, we grid search over the local learning rate ηl {0.0001, 0.001, 0.01, 0.1, 1.0}, the global learning rate η {0.001, 0.01, 0.1, 1.0} for all methods. For adaptive federated optimization methods, we set β1 = 0.9, β2 = 0.99. For Fed Adam, Fed Yogi, and Fed AMSGrad, we search the best ϵ from {10 8, 10 4, 10 3, 10 2, 10 1, 100}. For Fed AMS and Fed CAMS, we search the max stabilization ϵ from {10 8, 10 4, 10 3, 10 2, 10 1, 100}. Communication-Efficient Adaptive Federated Learning Specifically, for our Res Net-18 experiments, we set the local learning rate ηl = 0.01 and the global learning rate η = 1.0 for Fed Avg, set ηl = 0.01, η = 0.1 and ϵ = 0.1 for Fed Adam and Fed AMSGrad, set ηl = 0.01, η = 1.0 and ϵ = 0.1 for Fed Yogi, set ηl = 0.01, η = 1.0 and max stabilization ϵ = 0.001 for Fed AMS and Fed CAMS. For our Conv Mixer-256-8 experiments, we set the local learning rate ηl = 0.01 and the global learning rate η = 1.0 for Fed Avg, set ηl = 0.01, η = 1.0 and ϵ = 0.1 for Fed Adam, Fed Yogi and Fed AMSGrad, set ηl = 0.01, η = 1.0 and max stabilization ϵ = 0.001 for Fed AMS and Fed CAMS. E.2. Additional Experiments Figure 6 shows the effect of parameter n on the convergence rate of Fed CAMS with choosing groups of parameters: n {5, 10, 20}. For both Res Net-18 and Conv Mixer-256-8 models, it is shown that a larger number of participating clients n achieves a faster convergence rate, this backs up our theory. 0 100 200 300 400 500 #Rounds Training Loss Fed CAMS (sign, n=5) Fed CAMS (sign, n=10) Fed CAMS (sign, n=20) (a) Res Net-18 0 100 200 300 400 500 #Rounds Training Loss Fed CAMS (sign, n=5) Fed CAMS (sign, n=10) Fed CAMS (sign, n=20) (b) Conv Mixer-256-8 Figure 6. The learning curves for Fed CAMS with different participating number of clients n in training CIFAR-10 data on Res Net-18 and Conv Mixer-256-8 models. Figure 7 shows the convergence result of Fed AMS and other federated learning baselines on training CIFAR-100 dataset with the Res Net-18 model and the Conv Mixer-256-8 model. We compare the training loss and test accuracy against the global rounds for each model. For the Res Net-18 model, Fed AMS and Fed Yogi achieve significantly better performance comparing with other three baselines. In particular, Fed Yogi has a fast convergence rate at the beginning status, while Fed AMS performs the best in terms of the final training loss and test accuracy. Fed Avg achieves a slightly better training loss to Fed Adam and Fed AMSGrad but much higher test accuracy which is close to Fed Yogi and Fed AMS. 0 100 200 300 400 500 #Rounds Training Loss Fed Avg Fed Adam Fed Yogi Fed AMSGrad Fed AMS (a) Res Net-18 0 100 200 300 400 500 #Rounds Test Accuracy Fed Avg Fed Adam Fed Yogi Fed AMSGrad Fed AMS (b) Res Net-18 0 100 200 300 400 500 #Rounds Training Loss Fed Avg Fed Adam Fed Yogi Fed AMSGrad Fed AMS (c) Conv Mixer-256-8 0 100 200 300 400 500 #Rounds Test Accuracy Fed Avg Fed Adam Fed Yogi Fed AMSGrad Fed AMS (d) Conv Mixer-256-8 Figure 7. The learning curves for Fed AMS and other federated learning baselines on training CIFAR-100 data (a)(b) show the results for Res Net-18 model and (c)(d) show the results for Conv Mixer-256-8 model. For the Conv Mixer-256-8 model, which is typically trained via adaptive gradient method, we observe that all adaptive federated optimization methods (Fed Adam, Fed Yogi, Fed AMSGrad and Fed AMS) achieve much better performance in terms of both training loss and test accuracy compared with Fed Avg. In details, Fed AMS again achieves a significantly better result than other baselines in terms of training loss and test accuracy. Other adaptive methods, including Fed Adam, Fed Yogi, and Fed AMSGrad, have similar convergence behaviour when training the Conv Mixer-256-8 model. Such results empirically show the effectiveness of our proposed Fed AMS method. Communication-Efficient Adaptive Federated Learning E.3. Additional Ablation Study The ablation on ϵ: We conduct an ablation study with ϵ {10 1, 10 2, 10 3, 10 4, 10 6, 10 8} on CIFAR-10 in Table 3 and our ϵ value in experiments is chosen by its relatively higher test accuracy. ϵ 10 1 10 2 10 3 10 4 10 6 10 8 Test acc (%) 90.45 90.51 90.94 90.72 90.49 90.30 Table 3. Ablation study on ϵ.