# momentum_provably_improves_error_feedback__9f28327d.pdf Momentum Provably Improves Error Feedback! Ilyas Fatkhullin ETH AI Center & ETH Zurich Alexander Tyurin KAUST* Peter RichtΓ‘rik KAUST Due to the high communication overhead when training machine learning models in a distributed environment, modern algorithms invariably rely on lossy communication compression. However, when untreated, the errors caused by compression propagate, and can lead to severely unstable behavior, including exponential divergence. Almost a decade ago, Seide et al. [2014] proposed an error feedback (EF) mechanism, which we refer to as EF14, as an immensely effective heuristic for mitigating this issue. However, despite steady algorithmic and theoretical advances in the EF field in the last decade, our understanding is far from complete. In this work we address one of the most pressing issues. In particular, in the canonical nonconvex setting, all known variants of EF rely on very large batch sizes to converge, which can be prohibitive in practice. We propose a surprisingly simple fix which removes this issue both theoretically, and in practice: the application of Polyak s momentum to the latest incarnation of EF due to RichtΓ‘rik et al. [2021] known as EF21. Our algorithm, for which we coin the name EF21-SGDM, improves the communication and sample complexities of previous error feedback algorithms under standard smoothness and bounded variance assumptions, and does not require any further strong assumptions such as bounded gradient dissimilarity. Moreover, we propose a double momentum version of our method that improves the complexities even further. Our proof seems to be novel even when compression is removed from the method, and as such, our proof technique is of independent interest in the study of nonconvex stochastic optimization enriched with Polyak s momentum. 1 Introduction Since the practical utility of modern machine learning models crucially depends on our ability to train them on large quantities of training data, it is imperative to perform the training in a distributed storage and compute environment. In federated learning (FL) [KoneΛ‡cnΓ½ et al., 2016, Kairouz, 2019], for example, data is naturally stored in a distributed fashion across a large number of clients (who capture and own the data in the first place), and the goal is to train a single machine learning model from the wealth of all this distributed data, in a private fashion, directly on their devices. 1.1 Formalism. We consider the problem of collaborative training of a single model by several clients in a data-parallel fashion. In particular, we aim to solve the distributed nonconvex stochastic optimization problem [ 𝑓(π‘₯) := 1 𝑖=1 𝑓𝑖(π‘₯) ] , 𝑓𝑖(π‘₯) := Eπœ‰π‘– π’Ÿπ‘–[𝑓𝑖(π‘₯, πœ‰π‘–)] , 𝑖= 1, . . . , 𝑛, (1) where 𝑛is the number of clients, π‘₯ R𝑑represents the parameters of the model we wish to train, and 𝑓𝑖(π‘₯) is the (typically nonconvex) loss of model parameterized by the vector π‘₯on the data π’Ÿπ‘– owned by client 𝑖. Unlike most works in federated learning, we do not assume the datasets to be similar, i.e., we allow the distributions π’Ÿ1, . . . , π’Ÿπ‘›to be arbitrarily different. *King Abdullah University of Science and Technology, Thuwal, Saudi Arabia. 37th Conference on Neural Information Processing Systems (Neur IPS 2023). We are interested in the fundamental problem of finding an approximately stationary point of 𝑓in expectation, i.e., we wish to find a (possibly random) vector Λ†π‘₯ R𝑑such that E [ 𝑓(Λ†π‘₯) ] πœ€. In order to solve this problem, we assume that the 𝑛clients communicate via an orchestrating server. Typically, the role of the server is to first perform aggregation of the messages obtained from the workers, and to subsequently broadcast the aggregated information back to the workers. Following an implicit assumption made in virtually all theoretically-focused papers on communication-efficient training, we also assume that the speed of client-to-workers broadcast is so fast (compared to speed of workers-to-client communication) that the cost associated with broadcast can be neglected2. 1.2 Aiming for communication and computation efficiency at the same time. In our work, we pay attention to two key aspects of efficient distributed training communication cost and computation cost for finding an approximate stationary point Λ†π‘₯. The former refers to the number of bits that need to be communicated by the workers to the server, and the latter refers to the number of stochastic gradients that need to be sampled by each client. The rest of the paper can be summarized as follows: We pick one of the most popular communication-efficient gradient-type methods (the EF21 method of RichtΓ‘rik et al. [2021] the latest variant of error feedback pioneered by Seide et al. [2014]) and modify it in a way which provably preserves its communication complexity, but massively improves its computation/sample complexity, both theoretically and in practice. 2 Communication Compression, Error Feedback, and Sample Complexity Communication compression techniques such as quantization [Alistarh et al., 2017, HorvΓ‘th et al., 2019a] and sparsification [Seide et al., 2014, Beznosikov et al., 2020] are known to be immensely powerful for reducing the communication footprint of gradient-type3 methods. Arguably the most studied, versatile and practically useful class of compression mappings are contractive compressors. Definition 1 (Contractive compressors). We say that a (possibly randomized) mapping π’ž: R𝑑 R𝑑 is a contractive compression operator if there exists a constant 0 < 𝛼 1 such that E [ π’ž(π‘₯) π‘₯ 2] (1 𝛼) π‘₯ 2, π‘₯ R𝑑. (2) Inequality (2) is satisfied by a vast array of compressors considered in the literature, including numerous variants of sparsification operators [Alistarh et al., 2018, Stich et al., 2018], quantization operators [Alistarh et al., 2017, HorvΓ‘th et al., 2019a], and low-rank approximation [Vogels et al., 2019, Safaryan et al., 2022] and more [Beznosikov et al., 2020, Safaryan et al., 2021]. The canonical examples are i) the Top𝐾sparsifier, which preserves the 𝐾largest components of π‘₯in magnitude and sets all remaining coordinates to zero [Stich et al., 2018], and ii) the (scaled) Rand𝐾sparsifier, which preserves a subset of 𝐾components of π‘₯chosen uniformly at random and sets all remaining coordinates to zero [Beznosikov et al., 2020]. In both cases, (2) is satisfied with 𝛼= 𝐾/𝑑. 2.1 Brief history of error-feedback When greedy contractive compressors, such as Top𝐾, are used in a direct way to compress the local gradients in distributed gradient descent (GD), the resulting method may diverge exponentially, even on strongly convex quadratics [Beznosikov et al., 2020]. Empirically, instability caused by such a naive application of greedy compressors was observed much earlier, and a fix was proposed in the form of the error feedback (EF) mechanism by Seide et al. [2014], which we henceforth call EF14 or EF14-SGD (in the stochastic case).4 To the best of our knowledge, the best sample complexity of EF14-SGD for finding a stationary point in the distributed nonconvex setting is given by Koloskova et al. [2020]: after π’ͺ(𝐺𝛼 1πœ€ 3 + 𝜎2𝑛 1πœ€ 4) samples5, EF14-SGD finds a point π‘₯ such that E[ 𝑓(π‘₯) ] πœ€, where 𝛼is the contraction parameter (see Definition 1). However, such 2While this is a reasonable assumption in many practical situations [Mishchenko et al., 2019, Kairouz, 2019], some works consider the regime when the server-to-workers broadcast cannot be neglected [HorvΓ‘th et al., 2019a, Tang et al., 2020, Philippenko and Dieuleveut, 2020, Kovalev et al., 2021, Fatkhullin et al., 2021, Gruntkowska et al., 2022]. 3For Newton-type methods, see [Islamov et al., 2022] and references therein. 4In Appendix A, we provide a more detailed discussion on theoretical develepments for this method. 5Here 𝜎2 is the bound on the variance of stochastic gradients at each node, see Assumption 2. When referring the sample complexity we count the number of stochastic gradients used only at one of the 𝑛nodes rather than by all nodes in total. This is a meaningful notion because the computations are done in parallel. 0 2000 4000 6000 8000 10000 Iteration, t Objective value EF21-SGD-ideal EF21-SGD EF21-SGDM (a) Divergence for 𝑛= 1. 0 2000 4000 6000 8000 10000 Iteration, t Objective value EF21-SGD (n = 50) EF21-SGD (n = 500) EF21-SGD (n = 5000) EF21-SGD (n = 10000) (b) No improvement with 𝑛. Figure 1: Divergence of EF21-SGD on the quadratic function 𝑓(π‘₯) = 1 2 π‘₯ 2, π‘₯ R2, using the Top1 compressor. See the proof of Theorem 1 for details on the construction of the noise πœ‰; we use 𝜎= 1, 𝐡= 1. The starting point is π‘₯0 = (0, 0.01) . Unlike EF21-SGD, our method EF21-SGDM does not suffer from divergence and is stable near optimum. Figure 1b shows that when increasing the number of nodes 𝑛, EF21-SGD applied with 𝐡= 1 does not improve, and, moreover, diverges from the optimum even faster. All experiments use constant parameters 𝛾= πœ‚= 0.1/ 𝑇= 10 3; see Figure 4 for diminishing parameters. Each method is run 10 times and the plot shows the median performance alongside the 25% and 75% quantiles. an analysis has two important deficiencies. First, in the deterministic case (when exact gradients are computable by each node), the analysis only gives the suboptimal π’ͺ(πœ€ 3) iteration complexity, which is suboptimal compared to vanilla (i.e., non-compressed) gradient descent, whose iteration complexity is π’ͺ(πœ€ 2). Second, their analysis relies heavily on additional strong assumptions, such as the bounded gradient (BG) assumption, E[ 𝑓𝑖(π‘₯, πœ‰π‘–) 2] 𝐺2 for all π‘₯ R𝑑, 𝑖 [𝑛], πœ‰π‘– π’Ÿπ‘–, or the bounded gradient similarity (BGS) assumption, 1 𝑛 𝑛 𝑖=1 𝑓𝑖(π‘₯) 𝑓(π‘₯) 2 𝐺2 for all π‘₯ R𝑑. Such assumptions are restrictive and sometimes even unrealistic. In particular, both BG and BGS might not hold even in the case of convex quadratic functions.6 Moreover, it was recently shown that nonconvex analysis of stochastic gradient methods using a BG assumption may hide an exponential dependence on the smoothness constant in the complexity [Yang et al., 2023]. In 2021, these issues were partially resolved by RichtΓ‘rik et al. [2021], who propose a modification of the EF mechanism, which they call EF21. They address both deficiencies of the original EF14 method: i) they removed the BG/BGS assumptions, and improved the iteration complexity to π’ͺ(πœ€ 2) in the full gradient regime. Subsequently, the EF21 method was modified in several directions, e.g., extended to bidirectional compression, variance reduction and proximal setup [Fatkhullin et al., 2021], generalized from contractive to three-point compressors [RichtΓ‘rik et al., 2022] and adaptive compressors [Makarenko et al., 2022], modified from dual (gradient) to primal (model) compression [Gruntkowska et al., 2022] and from centralized to decentralized setting [Zhao et al., 2022]. For further work, we refer to [Wang et al., 2022, Dorfman et al., 2023, Islamov et al., 2022]. 2.2 Key issue: error feedback has an unhealthy appetite for samples! Unfortunately, the current theory of EF21 with stochastic gradients has weak sample complexity guarantees. In particular, Fatkhullin et al. [2021] extended the EF21-GD method, which is the basic variant of EF21 using full gradient at the clients, to EF21-SGD, which uses a large minibatch of stochastic gradients instead. They obtained π’ͺ( 1 π›Όπœ€2 + 𝜎2 𝛼3πœ€4 ) sample complexity for their method. Later, Zhao et al. [2022] improved this result slightly7 to π’ͺ( 1 π›Όπœ€2 + 𝜎2 𝛼2πœ€4 ), shaving off one 𝛼in the stochastic term. However, it is easy to notice several issues in these results, which generally feature the fundamental challenge of combining biased gradient methods with stochastic gradients. Mega-batches. These works require all clients to sample mega-batches of stochastic gradients/datapoints in each iteration, of order π’ͺ(πœ€ 2), in order to control the variance coming from stochastic gradients. In Figure 1, we find that, in fact, a batch-free (i.e., with mini-batch size 𝐡= 1) 6For example, one can consider 𝑓𝑖(π‘₯) = π‘₯ 𝐴𝑖π‘₯with 𝐴𝑖 R𝑑 𝑑, for which BG or BGS assumptions hold only in the trivial cases: matrices 𝐴𝑖are all zero or all equal to each other (homogeneous data regime). 7The result was obtained under a more general setting of decentralized optimization over a network. version of EF21-SGD diverges even on a very simple quadratic function. We also observe a similar behavior when a small batch 𝐡> 1 is applied. This implies that there is a fundamental flaw in the EF21-SGD method itself, rather just a problem of the theoretical analysis. While mega-batch methods are common in optimization literature, smaller batches are often preferred whenever they work . For example, the time/cost required to obtain such a large number of samples at each iteration might be unreasonably large compared to the communication time, which is already reduced using compression. Moreover, when dealing with medical data, large batches might simply be unavailable [Rieke et al., 2020]. In certain applications, such as federated reinforcement learning (RL) or multi-agent RL, it is often intractable to sample more than one trajectory of the environment in order to form a gradient estimator [Mitra et al., 2023, Doan et al., 2019, Jin et al., 2022, Khodadadian et al., 2022]. Further, a method using a mega-batch at each iteration effectively follows the gradient descent (GD) dynamics instead of the dynamics of (mini-batch) SGD, which may hinder the training and generalization performance of such algorithms since it is both empirically [Keskar et al., 2017, Kleinberg et al., 2018] and theoretically [Kale et al., 2021] observed that mini-batch SGD is superior to mega-batch SGD or GD in a number of machine learning tasks. Dependence on 𝛼. The total sample complexity results derived by Fatkhullin et al. [2021], Zhao et al. [2022] suffer from poor dependence on the contraction parameter 𝛼. Typically, EF methods are used with the Top𝐾sparsifier, which only communicates 𝐾largest entries in magnitude. In this case, 𝛼= 𝐾/𝑑, and the stochastic part of sample complexity scales quadratically with dimension. No improvement with 𝑛. The stochastic term in the sample complexity of EF21-SGD does not improve when increasing the number of nodes. However, the opposite behavior is typically desired, and is present in several latest non-EF methods based on unbiased compressors, such as MARINA [Gorbunov et al., 2021] and DASHA [Tyurin and RichtΓ‘rik, 2022]. We are not aware of any distributed algorithms utilizing the Top𝐾compressor achieving linear speedup in 𝑛in the stochastic term without relying on restrictive BG or BGS assumptions. These observations motivate our work with the following central questions: Can we design a batch-free distributed SGD method utilizing contractive communication compression (such as Top𝐾) without relying on restrictive BG/BGS assumptions? Is it possible to improve over the current state-of-theart π’ͺ ( 𝛼 1πœ€ 2 + 𝜎2𝛼 2πœ€ 4) sample complexity under the standard smoothness and bounded variance assumptions? We answer both questions in the affirmative by incorporating a momentum update into EF21-SGD. 2.3 Mysterious effectiveness of momentum in nonconvex optimization An immensely popular modification of SGD (and its distributed variants) is the use of momentum. This technique, initially inspired by the developments in convex optimization [Polyak, 1964], is often applied in machine learning for stabilizing convergence and speeding up the training. In particular, momentum is an important part of an immensely popular and empirically successful line of adaptive methods for deep learning, including ADAM [Kingma and Ba, 2015] and a plethora of variants. The classical SGD method with Polyak (i.e., heavy ball) momentum (SGDM) reads: π‘₯𝑑+1 = π‘₯𝑑 𝛾𝑣𝑑, 𝑣𝑑+1 = (1 πœ‚)𝑣𝑑+ πœ‚ 𝑓(π‘₯𝑑+1, πœ‰π‘‘+1), (3) where 𝛾> 0 is a learning rate and πœ‚> 0 is the momentum parameter. We provide a concise walk through the key theoretical developments in the analysis of SGDM in stochastic nonconvex optimization in Appendix A; and only mention the most relevant works here. The most closely related works to ours are [Mishchenko et al., 2019], [Xie et al., 2020], and [Fatkhullin et al., 2021], which analyze momentum together with communication compression. The analysis in [Mishchenko et al., 2019, Xie et al., 2020] requires BG/BGS assumption, and does not provide any theoretical improvement over the variants without momentum. Finally, the analysis of Fatkhullin et al. [2021] is only established for deterministic case, and it is unclear if its extension to stochastic case can bring any convergence improvement over EF21-SGD. Recently, several other works attempt to explain the benefit of momentum [Plattner, 2022]; some consider structured nonconvex problems [Wang and Abernethy, 2021], and others focus on generalization [Jelassi and Li, 2022]. Method Communication complexity Asymptotic sample complexity Batch-free? No extra assumptions? EF14-SGD [Koloskova et al., 2020] NEOLITHIC [Huang et al., 2022] 𝐾 π›Όπœ€2 log ( 𝐺 EF21-SGD [Fatkhullin et al., 2021] BEER [Zhao et al., 2022] Corollary 2 (a) Analysis requires a bound of the second moment of the stochastic gradients, i.e., E [ 𝑓𝑖(π‘₯, πœ‰π‘–) 2] 𝐺2 for all π‘₯ R𝑑. (b) This complexity is achieved by using a large mini-batch and communicating 𝐾/𝛼 coordinates per iteration, see Appendix A. (c) Analysis requires a bounded gradient disimilarity assumption, i.e., 1 𝑛 𝑛 𝑖=1 𝑓𝑖(π‘₯) 𝑓(π‘₯) 2 𝐺2 for all π‘₯ R𝑑. (d) Analysis requires a batch-size at least 𝐡 𝜎2 𝛼2πœ€2 for EF21-SGD and 𝐡 𝜎2 π›Όπœ€2 for BEER. Table 1: Summary of related works on distributed error compensated SGD methods using a Top𝐾compressor under Assumptions 1 and 2. The goal is to find an πœ€-stationary point of a smooth nonconvex function of the form (1), i.e., a point π‘₯such that E [ 𝑓(π‘₯) ] πœ€. "Communication complexity": the total # of communicated bits if the method is applied with sufficiently large batch-size; see Table 2 for batch-size. "Asymptotic sample complexity": the total # of samples required at each node to find an πœ€-stationary point for batch-size 𝐡= 1 in the regime πœ€ 0. "No extra assumptions": means that no additional assumption is required. Summary of contributions. Despite the vast amount of work trying to explain the benefits of momentum, there is no work obtaining any theoretical improvement over vanilla SGD in the smooth nonconvex setting under the standard assumptions of smoothness and bounded variance. First, we establish a negative result for a simplified/idealized version of EF21-SGD, which shows that this algorithm does not converge with constant batch-size, and that a mega-batch of order Ω(𝜎2πœ€ 2) is required. This provides a strong indication that EF21-SGD method is inherently sensitive to stochastic gradients, which is also confirmed by our numerical simulations. We propose a simple fix for this problem by incorporating momentum step into EF21-SGD, which leads to our one-batch EF21-SGDM algorithm. By leveraging our new Lyapunov function construction and new analysis, we establish π’ͺ ( 𝛼 1πœ€ 2 + 𝜎2πœ€ 4) sample complexity in the single node case. We extend our algorithm to the distributed setting and derive an improved sample complexity result compared to other methods using the Top𝐾compressor without resorting to the BG/BGS assumptions. In particular, EF21-SGDM achieves asymptotically optimal π’ͺ ( 𝜎2𝑛 1πœ€ 4) sample complexity. Moreover, when EF21-SGDM is applied with large enough batch size, we prove that it reaches the optimal communication complexity π’ͺ ( 𝐾𝛼 1πœ€ 2) ; see Tables 1 & 2 for more details. Finally, we propose a double momentum variant of EF21-SGDM, and find that it further improves the sample complexity of EF21-SGDM in the non-asymptotic regime. We highlight that, interestingly, we prove that momentum helps: EF21-SGDM is theoretically better compared to its non-momentum variant large-batch EF21-SGD. We believe that our new technique can be extended in many ways, e.g., to dealing with other biased compressors and other (biased) communication saving techniques such as lazy aggregation of gradients, model compression, bidirectional compression, partial participation, decentralized training, adaptive compression; other important optimization techniques involving biased updates such as proximal SGD with momentum, gradient clipping and adaptive step-size schedules. We also hope that our proof techniques can be useful to establish linear speedup for other classes of distributed methods, e.g, algorithms based on local training Prox Skip/Scaffnew [Mishchenko et al., 2022]. Additionally, we extend our results to the class of absolute compressors in Appendix H and study the variance reduced variant of error feedback in Appendix I. 3 Main Results Throughout the paper we work under the following standard assumptions. Assumption 1 (Smoothness and lower boundedness). We assume that 𝑓has 𝐿-Lipschitz gradient, i.e., 𝑓(π‘₯) 𝑓(𝑦) 𝐿 π‘₯ 𝑦 for all π‘₯, 𝑦 R𝑑, and each 𝑓𝑖has 𝐿𝑖-Lipschitz gradient, i.e., 𝑓𝑖(π‘₯) 𝑓𝑖(𝑦) 𝐿𝑖 π‘₯ 𝑦 for all 𝑖 [𝑛], π‘₯, 𝑦 R𝑑. We denote 𝐿2 := 1 𝑛 𝑛 𝑖=1 𝐿2 𝑖. Moreover, we assume that 𝑓is lower bounded, i.e., 𝑓* := infπ‘₯ R𝑑𝑓(π‘₯) > . Assumption 2 (Bounded variance (BV)). There exists 𝜎> 0 such that E [ 𝑓𝑖(π‘₯, πœ‰π‘–) 𝑓𝑖(π‘₯) 2] 𝜎2, π‘₯ R𝑑, 𝑖 [𝑛], (4) where πœ‰π‘– π’Ÿπ‘–are i.i.d. random samples for each 𝑖 [𝑛]. 3.1 A deeper dive into the issues EF21 has with stochastic gradients As remarked before, the current analysis of EF21 in the stochastic setting requires each client to sample a mega-batch in each iteration, and it is not clear how to avoid this. In order to understand this phenomenon, we propose to step back and examine an idealized version of EF21-SGD, which we call EF21-SGD-ideal, defined by the update rules (5a) + (5aa): π‘₯𝑑+1 = π‘₯𝑑 𝛾𝑔𝑑, 𝑔𝑑= 1 𝑖=1 𝑔𝑑 𝑖 (5a) EF21-SGD-ideal: 𝑔𝑑+1 𝑖 = 𝑓𝑖(π‘₯𝑑+1) + π’ž ( 𝑓𝑖(π‘₯𝑑+1, πœ‰π‘‘+1 𝑖 ) 𝑓𝑖(π‘₯𝑑+1) EF21-SGD: 𝑔𝑑+1 𝑖 = 𝑔𝑑 𝑖 + π’ž ( 𝑓𝑖(π‘₯𝑑+1, πœ‰π‘‘+1 𝑖 ) 𝑔𝑑 𝑖 Compared to EF21-SGD, given by (5a) + (5ab), we replace the previous state 𝑔𝑑 𝑖by the exact gradient at the current iteration. Since EF21-SGD heavily relies on the approximation 𝑔𝑑 𝑖 𝑓𝑖(π‘₯𝑑+1), and according to the proof of convergence of EF21-SGD, such discrepancy tends to zero as 𝑑 , this change can only improve the method. While we admit this is a conceptual algorithm only (it does not lead to any communication or sample complexity reduction in practice)8, it serves us well to illustrate the drawbacks of EF21-SGD. We now establish the following negative result for EF21-SGD-ideal. Theorem 1. Let 𝐿, 𝜎> 0, 0 < 𝛾 1/𝐿and 𝑛= 1. There exists a convex, 𝐿-smooth function 𝑓: R2 R, a contractive compressor π’ž( ) satisfying Definition 1, and an unbiased stochastic gradient with bounded variance 𝜎2 such that if the method EF21-SGD-ideal ((5a) + (5aa)) is run with step-size 𝛾, then for all 𝑇 0 and for all π‘₯0 {(0, π‘₯0 (2)) R2 | π‘₯0 (2) < 0}, we have E [ 𝑓(π‘₯𝑇) 2] 1 60 min { 𝜎2, 𝑓(π‘₯0) 2} . Fix 0 < πœ€ 𝐿/ 60 and π‘₯0 = (0, 1) . Additionally assume that 𝑛 1 and the variance of unbiased stochastic gradient is controlled by 𝜎2/𝐡for some 𝐡 1. If 𝐡< 𝜎2 60πœ€2 , then we have E [ 𝑓(π‘₯𝑇) ] > πœ€for all 𝑇 0. The above theorem implies that the method (5a), (5aa), does not converge with small batch-size (e.g., equal to one) for any fixed step-size choice.9 Moreover, in distributed setting with 𝑛nodes, a mini-batch of order 𝐡= Ω ( 𝜎2/πœ€2) is required for convergence. Notice that this batch-size is independent of 𝑛, which further implies that a linear speedup in the number of nodes 𝑛cannot be achieved for this method. While we only prove these negative results for an "idealized" version of EF21-SGD rather than for the method itself, in Figures 1a and 4a, we empirically verify that EF21-SGD also suffers from a similar divergence on the same problem instance provided in the proof of Theorem 1. Additionally, Figures 1b and 4b illustrate that the situation does not improve for EF21-SGD when increasing 𝑛. 3.2 Momentum for avoiding mega-batches Let us now focus on the single node setting10 and try to fix the divergence issue shown above. As we can learn from Theorem 1, the key reason for non-convergence of EF21-SGD is that even if the state vector 𝑔𝑑sufficiently approximates the current gradient, i.e., 𝑔𝑑 𝑓(π‘₯𝑑+1), the design of 8This is because full gradients would need to be computed and communicated for its implementation. Notice also that if 𝜎= 0, this method becomes the exact distributed gradient descent. 9In fact, the example can be easily extended to the case of polynomially decaying step-size. 10In this case, we can drop index 𝑖everywhere and write 𝑔𝑑 𝑖= 𝑔𝑑, πœ‰π‘‘ 𝑖= πœ‰π‘‘for all 𝑑 0. this method cannot guarantee that the quantity 𝑔𝑑 𝑓(π‘₯𝑑) 2 π’ž( 𝑓(π‘₯𝑑, πœ‰π‘‘) 𝑓(π‘₯𝑑)) 2 is small enough. Indeed, the last term above can be bounded by 2(2 𝛼)𝜎2 in expectation, but it is not sufficient as formally illustrated in Theorem 1. To fix this problem, we propose to modify our idealized EF21-SGD-ideal method so that the compressed difference can be controlled and made arbitrarily small, which leads us to another (more advanced) conceptual algorithm, EF21-SGDM-ideal: 𝑣𝑑+1 = 𝑓(π‘₯𝑑+1) + πœ‚( 𝑓(π‘₯𝑑+1, πœ‰π‘‘+1) 𝑓(π‘₯𝑑+1)), 𝑔𝑑+1 = 𝑓(π‘₯𝑑+1) + π’ž ( 𝑣𝑑+1 𝑓(π‘₯𝑑+1) In this method, instead of using 𝑣𝑑+1 = 𝑓(π‘₯𝑑+1, πœ‰π‘‘+1) as in EF21-SGD-ideal, we introduce a correction, which allows to control variance of the difference 𝑓(π‘₯𝑑+1, πœ‰π‘‘+1) 𝑓(π‘₯𝑑+1). This allows us to derive the following convergence result. Let 𝛿0 := 𝑓(π‘₯0) 𝑓*. Proposition 1. Let Assumptions 1, 2 hold, and let π’žsatisfy Definition 1. Let 𝑔0 = 0 and the step-size in method (5a), (6) be set as 𝛾 1/𝐿. Let Λ†π‘₯𝑇be sampled uniformly at random from the iterates of the method. Then for any πœ‚> 0 after 𝑇iterations, we have E [ 𝑓(Λ†π‘₯𝑇) 2] 2𝛿0 Notice that if πœ‚= 1, then algorithm EF21-SGDM-ideal (5a), (6) reduces to EF21-SGD-ideal method (5a), (5aa), and this result shows that the lower bound for the batch-size established in Theorem 1 is tight, i.e., 𝐡= Θ(𝜎2/πœ€2) is necessary and sufficient11 for convergence. For πœ‚< 1, the above theorem suggests that using a small enough parameter πœ‚, the variance term can be completely eliminated. This observation motivates us to design a practical variant of this method. Similarly to the design of EF21 mechanism (from EF21-SGD-ideal), we propose to do this by replacing the exact gradients 𝑓(π‘₯𝑑+1) by state vectors 𝑣𝑑and 𝑔𝑑as follows: EF21-SGDM: 𝑣𝑑+1 = 𝑣𝑑+ πœ‚( 𝑓(π‘₯𝑑+1, πœ‰π‘‘+1) 𝑣𝑑), 𝑔𝑑+1 = 𝑔𝑑+ π’ž ( 𝑣𝑑+1 𝑔𝑑) (7) Theorem 2. Let Assumptions 1, 2 hold, and let π’žsatisfy Definition 1. Let method (5a), (7) be run with 𝑔0 = 𝑣0 = 𝑓(π‘₯0), and Λ†π‘₯𝑇be sampled uniformly at random from the iterates of the method. Then for all πœ‚ (0, 1] with 𝛾 𝛾0 = min { 𝛼 7𝐿 } , we have E [ 𝑓(Λ†π‘₯𝑇) 2] π’ͺ( 𝛿0 The choice πœ‚= min { 1, ( 𝐿𝛿0 𝜎2𝑇 ) 1/2} , 𝛾= 𝛾0 results in E [ 𝑓(Λ†π‘₯𝑇) 2] π’ͺ ( 𝐿𝛿0 𝛼𝑇+ ( 𝐿𝛿0𝜎2 Compared to Proposition 1, where πœ‚can be made arbitrarily small, Theorem 2 suggests that there is a trade-off for the choice of πœ‚ (0, 1] in algorithm (5a), (7). The above theorem implies that in single node setting EF21-SGDM has π’ͺ( 𝐿 πœ€4 ) sample complexity. For 𝛼= 1, this result matches with the sample complexity of SGD and is known to be unimprovable under Assumptions 1 and 2 [Arjevani et al., 2019]. Moreover, when 𝛼= 1, our sample complexity matches with previous analysis of momentum methods in [Liu et al., 2020] and [Defazio, 2021]. However, even in this single node (𝑛= 1), uncompressed (𝛼= 1) setting our analysis is different from the previous work, in particular, our choice of momentum parameter and the Lyapunov function are different, see Appendix A and J. For 𝛼< 1, the above result matches with sample complexity of EF14-SGD (single node setting) [Stich and Karimireddy, 2019], which was recently shown to be optimal [Huang et al., 2022] for biased compressors satisfying Definition 1. However, notice that the extension of the analysis by Stich and Karimireddy [2019] for EF14-SGD to distributed setting meets additional challenges and it is unclear whether it is possible without imposing additional BG or BGS assumptions as in [Koloskova et al., 2020]. We revisit this analysis in Appendix K to showcase the difficulty of removing BG/BGS. In the following we will demonstrate the benefit of our EF21-SGDM method by extending it to distributed setting without imposing any additional assumptions. 3.3 Distributed stochastic error feedback with momentum Now we are ready to present a distributed variant of EF21-SGDM , see Algorithm 1. Letting 𝛿𝑑:= 𝑓(π‘₯𝑑) 𝑓*, our convergence analysis of this method relies on the monotonicity of the following Lyapunov function: Λ𝑑:= 𝛿𝑑+ 𝛾 𝛼𝑛 𝑖=1 𝑔𝑑 𝑖 𝑣𝑑 𝑖 2 + π›Ύπœ‚ 𝛼2𝑛 𝑖=1 𝑣𝑑 𝑖 𝑓𝑖(π‘₯𝑑) 2 + 𝛾 𝑖=1 (𝑣𝑑 𝑖 𝑓𝑖(π‘₯𝑑)) 11This follows by replacing 𝜎2 in the batch free algorithm by 𝜎2/𝐡if the batch-size of size 𝐡> 1 is used. Algorithm 1 EF21-SGDM (Error Feedback 2021 Enhanced with Polyak Momentum) 1: Input: starting point π‘₯0, step-size 𝛾> 0, momentum πœ‚ (0, 1], initial batch size 𝐡init 2: Initialize 𝑣0 𝑖= 𝑔0 𝑖= 1 𝐡init 𝐡init 𝑗=1 𝑓𝑖(π‘₯0, πœ‰0 𝑖,𝑗) for 𝑖= 1, . . . , 𝑛; 𝑔0 = 1 𝑛 𝑛 𝑖=1 𝑔0 𝑖 3: for 𝑑= 0,1, 2, . . . , 𝑇 1 do 4: Master computes π‘₯𝑑+1 = π‘₯𝑑 𝛾𝑔𝑑and broadcasts π‘₯𝑑+1 to all nodes 5: for all nodes 𝑖= 1, . . . , 𝑛in parallel do 6: Compute momentum estimator 𝑣𝑑+1 𝑖 = (1 πœ‚)𝑣𝑑 𝑖+ πœ‚ 𝑓𝑖(π‘₯𝑑+1, πœ‰π‘‘+1 𝑖 ) 7: Compress 𝑐𝑑+1 𝑖 = π’ž(𝑣𝑑+1 𝑖 𝑔𝑑 𝑖) and send 𝑐𝑑+1 𝑖 to the master 8: Update local state 𝑔𝑑+1 𝑖 = 𝑔𝑑 𝑖+ 𝑐𝑑+1 𝑖 9: end for 10: Master computes 𝑔𝑑+1 = 1 𝑛 𝑛 𝑖=1 𝑔𝑑+1 𝑖 via 𝑔𝑑+1 = 𝑔𝑑+ 1 𝑛 𝑛 𝑖=1 𝑐𝑑+1 𝑖 11: end for Convergence of EF21-SGDM with contractive compressors. We obtain the following result: Theorem 3. Let Assumptions 1 and 2 hold. Let Λ†π‘₯𝑇be sampled uniformly at random from the 𝑇 iterates of the method. Let EF21-SGDM (Algorithm 1) be run with a contractive compressor. For all πœ‚ (0, 1] and 𝐡init 1, with 𝛾 min { 𝛼 20 𝐿, πœ‚ 7𝐿 } , we have E [ 𝑓(Λ†π‘₯𝑇) 2] π’ͺ ( Ξ›0 𝛾𝑇+ πœ‚3𝜎2 where Ξ›0 is given by (8). Choosing the batch size 𝐡init = 𝜎2 𝐿𝛿0 , and stepsize 𝛾= min { 𝛼 20 𝐿, πœ‚ and momentum πœ‚= min { 1, ( 𝐿𝛿0𝛼2 𝜎2𝑇 ) 1/4 , ( 𝐿𝛿0𝛼 𝜎2𝑇 ) 1/3 , ( 𝐿𝛿0𝑛 𝜎2𝑇 ) 1/2 , 𝛼 𝐿𝛿0𝐡init } , 12 we get E [ 𝑓(Λ†π‘₯𝑇) 2] π’ͺ ( 𝐿𝛿0 𝛼𝑇+ ( 𝐿𝛿0𝜎2/3 𝛼2/3𝑇 ) 3/4 + ( 𝐿𝛿0𝜎 𝛼𝑇 ) 2/3 + ( 𝐿𝛿0𝜎2 𝑛𝑇 ) 1/2) . Remark 1. Note that using large initial batch size 𝐡init > 1 is not necessary for convergence of EF21-SGDM. If we set 𝐡init = 1, the above theorem still holds by replacing 𝛿0 with Ξ›0. Remark 2. In the single node setting (𝑛= 1), the above result recovers the statement of Theorem 2 (with the same choice of parameters) since by Young s inequality ( 𝐿𝛿0𝜎2/3 𝛼2/3𝑇 ) 3/4 𝑇 ) 1/2 , ( 𝐿𝛿0𝜎 𝛼𝑇 ) 2/3 1 𝑇 ) 1/2 , and 𝐿= 𝐿. Recovering previous rates in case of full gradients. Compared to the iteration complexity π’ͺ( 𝐿max𝐺 π›Όπœ€3 ) of EF14 [Koloskova et al., 2020], our result, summarized in Corollary 1. If 𝜎= 0, then E [ 𝑓(Λ†π‘₯𝑇) ] πœ€after 𝑇= π’ͺ ( 𝐿 π›Όπœ€2 ) iterations. is better by an order of magnitude, and does not require the BG assumption. The result of Corollary 1 is the same as for EF21 method [RichtΓ‘rik et al., 2021], and EF21-HB method [Fatkhullin et al., 2021]. Notice, however, that even in this deterministic setting (𝜎= 0) EF21-SGDM method is different from EF21 and EF21-HB: while the original EF21 does not use momentum, EF21-HB method incorporates momentum on the server side to update π‘₯𝑑, which is different from our Algorithm 1, where momentum is applied by each node. This iteration complexity π’ͺ ( 1 π›Όπœ€2 ) is optimal in both 𝛼and πœ€. The matching lower bound was recently established by Huang et al. [2022] for smooth nonconvex optimization in the class of centralized, zero-respecting algorithms with contractive compressors. Comparison to previous work. Our sample complexity13 in Corollary 2. E [ 𝑓(Λ†π‘₯𝑇) ] πœ€after 𝑇= π’ͺ ( 𝐿 π›Όπœ€2 + 𝐿𝜎2/3 𝛼2/3πœ€8/3 + 𝐿𝜎 𝛼1/2πœ€3 + 𝐿𝜎2 π‘›πœ€4 ) iterations. 12In Appendix J, we show how to deal with time varying 𝛾𝑑and πœ‚π‘‘. 13Note that the initial batch size contributes to the sample complexity only an additive constant independent of πœ€. Moreover, 𝐡init = 𝜎2 πœ€2 since, otherwise, 𝑓(π‘₯0) 2 2𝐿𝛿0 πœ€2, and π‘₯0 is a solution. In the following, we ignore the dependece on 𝐡init for a fair comparison with other works. strictly improves over the complexity π’ͺ( 𝐺𝐿max π›Όπœ€3 + 𝐿max𝜎2 π‘›πœ€4 ) of EF14-SGD by Koloskova et al. [2020], even in case when 𝐺< + . Notice that it always holds that 𝜎 𝐺. If we assume that 𝐺 𝜎, our three first terms in the complexity improve the first term from Koloskova et al. [2020] by the factor of πœ€/𝜎, (πœ€π›Ό/𝜎)1/3, or 𝛼1/2. Compared to the BEER algorithm of Zhao et al. [2022], with sample complexity π’ͺ( 𝐿max 𝛼2πœ€4 ), the result of Corollary 2 is strictly better in terms of 𝛼, 𝑛, and the smoothness constants.14 In addition, we remove the large batch requirement for convergence compared to [Fatkhullin et al., 2021, Zhao et al., 2022]. Moreover, notice that Corollary 2 implies that EF21-SGDM achieves asymptotically optimal sample complexity π’ͺ( 𝐿𝜎2 π‘›πœ€4 ) in the regime πœ€ 0. 3.4 Further improvement using double momentum! Unfortunately, in the non-asymptotic regime, our sample complexity does not match with the lower bound in all problem parameters simultanuously due to the middle term 𝐿𝜎2/3 𝛼2/3πœ€8/3 + 𝐿𝜎 𝛼1/2πœ€3 , which can potentially dominate over 𝐿𝜎2 π‘›πœ€4 term for large enough 𝑛and πœ€, and small enough 𝛼and 𝜎. We propose a double-momentum method, which can further improve the middle term in the sample complexity of EF21-SGDM. We replace the momentum estimator 𝑣𝑑 𝑖in line 6 of Algorithm 1 by the following two-step momentum update EF21-SGD2M: 𝑣𝑑+1 𝑖 = (1 πœ‚)𝑣𝑑 𝑖+ πœ‚ 𝑓𝑖(π‘₯𝑑+1, πœ‰π‘‘+1 𝑖 ), 𝑒𝑑+1 𝑖 = (1 πœ‚)𝑒𝑑 𝑖+ πœ‚π‘£π‘‘+1 𝑖 . (10) We formally present this method in Algorithm 3 in Appendix G. Compared to EF21-SGDM (Algorithm 1), the only change is that instead of compressing 𝑣𝑑 𝑖 𝑔𝑑 𝑖, in EF21-SGD2M, we compress 𝑒𝑑 𝑖 𝑔𝑑 𝑖, where 𝑒𝑖is a two step (double) momentum estimator. The intuition behind this modification is that a double momentum estimator 𝑒𝑑 𝑖has richer "memory" of the past gradients compared to 𝑣𝑑 𝑖. Notice that for each node, EF21-SGD2M requires to save 3 vectors (𝑣𝑑 𝑖, 𝑒𝑑 𝑖, 𝑔𝑑 𝑖) instead of 2 in EF21-SGDM (𝑣𝑑 𝑖, 𝑔𝑑 𝑖) and EF14-SGD (𝑒𝑑 𝑖, 𝑔𝑑 𝑖).15 When interacting with biased compression operator π’ž( ), such effect becomes crucial in improving the sample complexity. For EF21-SGD2M, we derive Corollary 3. Let 𝑣𝑑 𝑖in Algorithm 1 be replaced by 𝑒𝑑 𝑖given by (10) (Algorithm 3 in Appendix G). Then with appropriate choice of 𝛾and πœ‚(given in Theorem 5), we have E [ 𝑓(Λ†π‘₯𝑇) ] πœ€after π›Όπœ€2 + 𝐿𝛿0𝜎2/3 𝛼2/3πœ€8/3 + 𝐿𝛿0𝜎2 π‘›πœ€4 ) iterations. 4 Experiments We consider a nonconvex logistic regression problem: 𝑓𝑖(π‘₯1, . . . , π‘₯𝑐) = 1 π‘š π‘š 𝑗=1 log(exp(π‘Ž 𝑖𝑗π‘₯𝑦𝑖𝑗)/ 𝑐 𝑦=1 exp(π‘Ž 𝑖𝑗π‘₯𝑦)) with a nonconvex regularizer β„Ž(π‘₯1, . . . , π‘₯𝑐) = πœ† 𝑐 𝑦=1 𝑙 π‘˜=1[π‘₯𝑦]2 π‘˜/(1 + [π‘₯𝑦]2 π‘˜) with πœ†= 10 3, where π‘₯1, . . . , π‘₯𝑐 R𝑙, [ ]π‘˜is an indexing operation of a vector, 𝑐 2 is the number of classes, 𝑙is the number of features, π‘šis the size of a dataset, π‘Žπ‘–π‘— R𝑙and 𝑦𝑖𝑗 {1, . . . , 𝑐} are features and labels. The datasets used are MNIST (with 𝑙= 784, π‘š= 60 000, 𝑐= 10) and real-sim (with 𝑙= 20 958, π‘š= 72 309, 𝑐= 2) [Le Cun et al., 2010, Chang and Lin, 2011]. The dimension of the problem is 𝑑= (𝑙+ 1)𝑐, i.e., 𝑑= 7 850 for MNIST and 𝑑= 41 918 for real-sim. In each experiment, we show relations between the total number of transmitted coordinates and gradient/function values. The stochastic gradients in each algorithm are replaced by a mini-batch estimator 1 𝐡 𝐡 𝑗=1 𝑓𝑖(π‘₯, πœ‰π‘–π‘—) with the same 𝐡 1 in each plot. Notice that all methods (except for NEOLITHIC)16 calculate the same number of samples at each communication round, thus the dependence on the number of samples used will be qualitatively the same. In all algorithms, the step sizes are fine-tuned from a set {2π‘˜| π‘˜ [ 20, 20]} and the Top𝐾 compressor is used to compress information from the nodes to the master. For EF21-SGDM , we fix momentum parameter πœ‚= 0.1 in all experiments. Prior to that, we tuned πœ‚ {0.01, 0.1} on the independent dataset w8a (with 𝑙= 300, π‘š= 49 749, 𝑐= 2). We omit BEER method from the plots since it showed worse performance than EF21-SGD in all runs. 14𝐿max := max𝑖 [𝑛] 𝐿𝑖. Notice that 𝐿 𝐿 𝐿max and the inequalities are strict in heterogeneous setting. 15See Appendix K for details on EF14-SGD. In contrast, EF21-SGD needs to save only one vector (𝑔𝑑 𝑖). 16For NEOLITHIC, we use the parameter 𝑅= 𝑑/𝐾 following the requirement in their Theorem 3. Experiments in Huang et al. [2022] use a heuristic choice 𝑅= 4, and thus can show faster convergence. 4.1 Experiment 1: increasing batch-size In this experiment, we use MNIST dataset and fix the number of transmitted coordinates to 𝐾= 10 (thus 𝛼 𝐾/𝑑 10 3), and set 𝑛= 10. Figure 2 shows convergence plots for 𝐡 {1,32,128}. EF21-SGDM and its double momentum version EF21-SGD2M have fast convergence and show a significant improvement when increasing batch-size compared to EF14-SGD. In contrast, EF21-SGD suffers from poor performance for small 𝐡, which confirms our observations in previous sections. NEOLITHIC has order times slower convergence rate due to the fact that it sends 𝑑/𝐾 compressed vectors in each iteration, while other methods send only one. 0.0 0.5 1.0 1.5 2.0 2.5 3.0 #bits / n 1e6 || f(xt)||2 Neolithic: Step size: 0.125 EF14-SGD: Step size: 0.0625 EF21-SGD: Step size: 0.015625 EF21-SGD2M: Step size: 0.03125 EF21-SGDM: Step size: 0.0078125 0.0 0.5 1.0 1.5 2.0 2.5 3.0 #bits / n 1e6 || f(xt)||2 Neolithic: Step size: 1.0 EF14-SGD: Step size: 0.25 EF21-SGD: Step size: 0.0078125 EF21-SGD2M: Step size: 0.0625 EF21-SGDM: Step size: 0.0625 0.0 0.5 1.0 1.5 2.0 2.5 3.0 #bits / n 1e6 || f(xt)||2 Neolithic: Step size: 1.0 EF14-SGD: Step size: 0.25 EF21-SGD: Step size: 0.015625 EF21-SGD2M: Step size: 0.0625 EF21-SGDM: Step size: 0.125 (c) 𝐡= 128 Figure 2: Experiment on MNIST dataset with 𝑛= 10, and Top10 compressor. 4.2 Experiment 2: improving convergence with 𝑛 This experiment uses real-sim dataset, 𝐾= 100 (thus 𝛼 𝐾/𝑑 2 10 3), and with 𝐡= 128 π‘š. We vary the number of nodes within 𝑛 {1, 10, 100}, see Figure 3. In this case, EF21-SGDM and EF21-SGD2M have much faster convergence compared to other methods for all 𝑛. Moreover, the proposed algorithms show a significant improvement when 𝑛increases. We also observe that on this task, EF21-SGD2M performs slightly worse than EF21-SGDM for 𝑛= 10, 100 , but it is still much faster than other other methods. 0.0 0.5 1.0 1.5 2.0 2.5 3.0 #bits / n 1e7 EF14-SGD: Step size: 64.0 EF21-SGD: Step size: 8.0 EF21-SGD2M: Step size: 32.0 EF21-SGDM: Step size: 64.0 0.0 0.5 1.0 1.5 2.0 2.5 3.0 #bits / n 1e7 EF14-SGD: Step size: 128.0 EF21-SGD: Step size: 64.0 EF21-SGD2M: Step size: 512.0 EF21-SGDM: Step size: 512.0 0.0 0.5 1.0 1.5 2.0 2.5 3.0 #bits / n 1e7 EF14-SGD: Step size: 128.0 EF21-SGD: Step size: 128.0 EF21-SGD2M: Step size: 512.0 EF21-SGDM: Step size: 512.0 (c) 𝑛= 100 Figure 3: Experiment on real-sim dataset with batch-size 𝐡= 128, and Top100 compressor. In Section C, we present extra simulations with different parameters for above experiments. Additionally, we inlclude experiemnts on simple quadratic problems and perform training of larger image recognition models. In all cases, EF21-SGDM and EF21-SGD2M outperform other algorithms. Acknowledgments and Disclosure of Funding This work of I. Fatkhullin was supported by ETH AI Center doctoral fellowship. The work of P. RichtΓ‘rik and A. Tyurin was supported by the KAUST Baseline Research Scheme (KAUST BRF) and the KAUST Extreme Computing Research Center (KAUST ECRC); P. RichtΓ‘rik was also supported by the SDAIA-KAUST Center of Excellence in Data Science and Artificial Intelligence (SDAIA-KAUST AI). Dan Alistarh, Demjan Grubic, Jerry Li, Ryota Tomioka, and Milan Vojnovic. QSGD: Communicationefficient SGD via gradient quantization and encoding. In Advances in Neural Information Processing Systems, pages 1709 1720, 2017. Dan Alistarh, Torsten Hoefler, Mikael Johansson, Sarit Khirirat, Nikola Konstantinov, and CΓ©dric Renggli. The convergence of sparsified gradient methods. In Advances in Neural Information Processing Systems, 2018. Yossi Arjevani, Yair Carmon, John C Duchi, Dylan J Foster, Nathan Srebro, and Blake Woodworth. Lower bounds for non-convex stochastic optimization. ar Xiv preprint ar Xiv:1912.02365, 2019. Aleksandr Beznosikov, Samuel HorvΓ‘th, Peter RichtΓ‘rik, and Mher Safaryan. On biased compression for distributed learning. ar Xiv preprint ar Xiv:2002.12410, 2020. Chih-Chung Chang and Chih-Jen Lin. LIBSVM: a library for support vector machines. ACM Transactions on Intelligent Systems and Technology (TIST), 2(3):1 27, 2011. Xin Chen, Niao He, Yifan Hu, and Zikun Ye. Efficient algorithms for minimizing compositions of convex functions and random functions and its applications in network revenue management. ar Xiv preprint ar Xiv:2205.01774, 2022. Ashok Cutkosky and Harsh Mehta. Momentum improves normalized SGD. In International Conference on Machine Learning, page 2260 2268. PMLR, 2020. Ashok Cutkosky and Harsh Mehta. High-probability bounds for non-convex stochastic optimization with heavy tails. pages 4883 4895, 2021. Ashok Cutkosky and Francesco Orabona. Momentum-based variance reduction in non-convex SGD. In Advances in Neural Information Processing Systems, 2019. Aaron Defazio. Momentum via primal averaging: Theoretical insights and learning rate schedules for non-convex optimization. ar Xiv preprint ar Xiv:2010.00406, 2021. Thinh Doan, Siva Maguluri, and Justin Romberg. Finite-time analysis of distributed TD(0) with linear function approximation on multi-agent reinforcement learning. In International Conference on Machine Learning, pages 1626 1635. PMLR, 2019. Ron Dorfman, Shay Vargaftik, Yaniv Ben-Itzhak, and Kfir Y. Levy. Do Co FL: Downlink compression for cross-device federated learning. ar Xiv preprint ar Xiv:2302.00543, 2023. Ilyas Fatkhullin, Igor Sokolov, Eduard Gorbunov, Zhize Li, and Peter RichtΓ‘rik. EF21 with bells & whistles: Practical algorithmic extensions of modern error feedback. ar Xiv preprint ar Xiv:2110.03294, 2021. Juan Gao, Xin-Wei Liu, Yu-Hong Dai, Yakui Huang, and Junhua Gu. Distributed stochastic gradient tracking methods with momentum acceleration for non-convex optimization. Computational Optimization and Applications, 84(2):531 572, 2023. Euhanna Ghadimi, Hamid Reza Feyzmahdavian, and Mikael Johansson. Global convergence of the heavy-ball method for convex optimization. In 2015 European control conference (ECC), pages 310 315. IEEE, 2015. Eduard Gorbunov, Dmitry Kovalev, Dmitry Makarenko, and Peter RichtΓ‘rik. Linearly converging error compensated SGD. In 34th Conference on Neural Information Processing Systems, 2020. Eduard Gorbunov, Konstantin Burlachenko, Zhize Li, and Peter RichtΓ‘rik. MARINA: Faster nonconvex distributed learning with compression. In International Conference on Machine Learning, pages 3788 3798. PMLR, 2021. Kaja Gruntkowska, Alexander Tyurin, and Peter RichtΓ‘rik. Ef21-p and friends: Improved theoretical communication complexity for distributed optimization with bidirectional compression. ar Xiv preprint ar Xiv:2209.15218, Sep 2022. Suyog Gupta, Ankur Agrawal, Kailash Gopalakrishnan, and Pritish Narayanan. Deep learning with limited numerical precision. In International Conference on Machine Learning, page 1737 1746, Feb 2015. Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 770 778, 2016. Samuel HorvΓ‘th, Chen-Yu Ho, L udovΓ­t HorvΓ‘th, Atal Narayan Sahu, Marco Canini, and Peter RichtΓ‘rik. Natural compression for distributed deep learning. ar Xiv preprint ar Xiv:1905.10988, 2019a. Samuel HorvΓ‘th, Dmitry Kovalev, Konstantin Mishchenko, Sebastian Stich, and Peter RichtΓ‘rik. Stochastic distributed learning with gradient quantization and variance reduction. ar Xiv preprint ar Xiv:1904.05115, 2019b. Xinmeng Huang, Yiming Chen, Wotao Yin, and Kun Yuan. Lower bounds and nearly optimal algorithms in distributed learning with communication compression. In Advances in Neural Information Processing Systems, 2022. Rustem Islamov, Xun Qian, SlavomΓ­r Hanzely, Mher Safaryan, and Peter RichtΓ‘rik. Distributed newton-type methods with communication compression and bernoulli aggregation. ar Xiv preprint ar Xiv:2206.03588, 2022. Samy Jelassi and Yuanzhi Li. Towards understanding how momentum improves generalization in deep learning. In International Conference on Machine Learning, pages 9965 10040. PMLR, 2022. Hao Jin, Yang Peng, Wenhao Yang, Shusen Wang, and Zhihua Zhang. Federated reinforcement learning with environment heterogeneity. In International Conference on Artificial Intelligence and Statistics, volume 151 of Proceedings of Machine Learning Research, pages 18 37. PMLR, 28 30 Mar 2022. Peter et al Kairouz. Advances and open problems in federated learning. ar Xiv preprint ar Xiv:1912.04977, 2019. Satyen Kale, Ayush Sekhari, and Karthik Sridharan. Sgd: The role of implicit regularization, batch-size and multiple-epochs. ar Xiv preprint ar Xiv:2107.05074, 2021. Sai Praneeth Karimireddy, Quentin Rebjock, Sebastian Stich, and Martin Jaggi. Error feedback fixes Sign SGD and other gradient compression schemes. In International Conference on Machine Learning, 2019. Sai Praneeth Karimireddy, Martin Jaggi, Satyen Kale, Mehryar Mohri, Sashank J. Reddi, Sebastian U. Stich, and Ananda Theertha Suresh. Mime: Mimicking centralized stochastic algorithms in federated learning. ar Xiv preprint ar Xiv:2008.03606, 2021. Nitish Shirish Keskar, Dheevatsa Mudigere, Jorge Nocedal, Mikhail Smelyanskiy, and Ping Tak Peter Tang. On large-batch training for deep learning: Generalization gap and sharp minima. In International Conference on Learning Representations, 2017. Sajad Khodadadian, Pranay Sharma, Gauri Joshi, and Siva Theja Maguluri. Federated reinforcement learning: Linear speedup under markovian sampling. In International Conference on Machine Learning, page 10997 11057. PMLR, Jun 2022. Diederik P. Kingma and Jimmy Ba. Adam: A method for stochastic optimization. In ICLR (Poster), 2015. Bobby Kleinberg, Yuanzhi Li, and Yang Yuan. An alternative view: When does sgd escape local minima? In International Conference on Machine Learning, page 2698 2707. PMLR, 2018. Anastasia Koloskova, Tao Lin, S. Stich, and Martin Jaggi. Decentralized deep learning with arbitrary communication compression. In International Conference on Learning Representations, 2020. Jakub KoneΛ‡cnΓ½, H. Brendan Mc Mahan, Felix Yu, Peter RichtΓ‘rik, Ananda Theertha Suresh, and Dave Bacon. Federated learning: strategies for improving communication efficiency. In NIPS Private Multi-Party Machine Learning Workshop, 2016. Dmitry Kovalev, Anastasia Koloskova, Martin Jaggi, Peter RichtΓ‘rik, and Sebastian Stich. A linearly convergent algorithm for decentralized optimization: Sending less bits for free! In The 24th International Conference on Artificial Intelligence and Statistics (AISTATS 2021), 2021. Alex Krizhevsky, Geoffrey Hinton, et al. Learning multiple layers of features from tiny images. Technical report, University of Toronto, Toronto, 2009. Yann Le Cun, Corinna Cortes, and CJ Burges. Mnist handwritten digit database. ATT Labs [Online]. Available: http://yann.lecun.com/exdb/mnist, 2, 2010. Xiaoyu Li and Francesco Orabona. A high probability analysis of adaptive SGD with momentum. ar Xiv preprint ar Xiv:2007.14294, 2020. Xiaoyu Li, Mingrui Liu, and Francesco Orabona. On the last iterate convergence of momentum methods. In International Conference on Algorithmic Learning Theory, pages 699 717. PMLR, 2022a. Xiaoyun Li, Belhal Karimi, and Ping Li. On distributed adaptive optimization with gradient compression. In International Conference on Learning Representations, 2022b. Zhize Li, Dmitry Kovalev, Xun Qian, and Peter RichtΓ‘rik. Acceleration for compressed gradient descent in distributed and federated optimization. In International Conference on Machine Learning, pages 5895 5904. PMLR, 2020. Zhize Li, Hongyan Bao, Xiangliang Zhang, and Peter RichtΓ‘rik. PAGE: A simple and optimal probabilistic gradient estimator for nonconvex optimization. In International Conference on Machine Learning, pages 6286 6295. PMLR, 2021. Yujun Lin, Song Han, Huizi Mao, Yu Wang, and Bill Dally. Deep gradient compression: Reducing the communication bandwidth for distributed training. In International Conference on Learning Representations, 2018. Yanli Liu, Yuan Gao, and Wotao Yin. An improved analysis of stochastic gradient descent with momentum. In Advances in Neural Information Processing Systems, 2020. Maksim Makarenko, Elnur Gasanov, Rustem Islamov, Abdurakhmon Sadiev, and Peter Richtarik. Adaptive compression for communication-efficient distributed training. ar Xiv preprint ar Xiv:2211.00188, Oct 2022. Konstantin Mishchenko, Eduard Gorbunov, Martin TakΓ‘Λ‡c, and Peter RichtΓ‘rik. Distributed learning with compressed gradient differences. ar Xiv preprint ar Xiv:1901.09269, 2019. Konstantin Mishchenko, Grigory Malinovsky, Sebastian Stich, and Peter Richtarik. Prox Skip: Yes! Local gradient steps provably lead to communication acceleration! Finally! In International Conference on Machine Learning, pages 15750 15769. PMLR, 2022. Aritra Mitra, George J. Pappas, and Hamed Hassani. Temporal difference learning with compressed updates: Error-feedback meets reinforcement learning. ar Xiv preprint ar Xiv:2301.00944, 2023. Constantin Philippenko and Aymeric Dieuleveut. Bidirectional compression in heterogeneous settings for distributed or federated learning with partial participation: tight convergence guarantees. ar Xiv preprint ar Xiv:2006.14591, 2020. Maximilian Plattner. On SGD with momentum. Master s Thesis, 2022. Boris T Polyak. Some methods of speeding up the convergence of iteration methods. Ussr computational mathematics and mathematical physics, 4(5):1 17, 1964. Xun Qian, Peter RichtΓ‘rik, and Tong Zhang. Error compensated distributed SGD can be accelerated. ar Xiv preprint ar Xiv:2010.00091, 2020. Peter RichtΓ‘rik, Igor Sokolov, and Ilyas Fatkhullin. EF21: A new, simpler, theoretically better, and practically faster error feedback. In Advances in Neural Information Processing Systems, 2021. Peter RichtΓ‘rik, Igor Sokolov, Ilyas Fatkhullin, Elnur Gasanov, Zhize Li, and Eduard Gorbunov. 3PC: Three point compressors for communication-efficient distributed training and a better theory for lazy aggregation. ar Xiv preprint ar Xiv:2202.00998, 2022. Nicola Rieke, Jonny Hancox, Wenqi Li, Fausto MilletarΓ¬, Holger R. Roth, Shadi Albarqouni, Spyridon Bakas, Mathieu N. Galtier, Bennett A. Landman, Klaus Maier-Hein, SΓ©bastien Ourselin, Micah Sheller, Ronald M. Summers, Andrew Trask, Daguang Xu, Maximilian Baust, and M. Jorge Cardoso. The future of digital health with federated learning. npj Digital Medicine, 3(11):1 7, 2020. Mher Safaryan, Egor Shulgin, and Peter RichtΓ‘rik. Uncertainty principle for communication compression in distributed and federated learning and the search for an optimal compressor. Information and Inference: A Journal of the IMA, 2021. Mher Safaryan, Rustem Islamov, Xun Qian, and Peter RichtΓ‘rik. Fed NL: Making Newton-type methods applicable to federated learning. In Internatioanl Conference on Machine Learning, 2022. Atal Sahu, Aritra Dutta, Ahmed M. Abdelmoniem, Trambak Banerjee, Marco Canini, and Panos Kalnis. Rethinking gradient sparsification as total error minimization. In Advances in Neural Information Processing Systems, page 8133 8146, 2021. Amedeo Sapio, Marco Canini, Chen-Yu Ho, Jacob Nelson, Panos Kalnis, Changhoon Kim, Arvind Krishnamurthy, and Masoud Moshref. Scaling distributed machine learning with in-network aggregation. In 18th USENIX Symposium on Networked Systems Design and Implementation, page 785 808, 2021. Othmane Sebbouh, Charles Dossal, and Aude Rondepierre. Nesterov s acceleration and Polyak s heavy ball method in continuous time: convergence rate analysis under geometric conditions and perturbations. ar Xiv preprint ar Xiv:1907.02710, 2019. Othmane Sebbouh, Robert M Gower, and Aaron Defazio. Almost sure convergence rates for stochastic gradient descent and stochastic heavy ball. In Conference on Learning Theory, pages 3935 3971. PMLR, 2021. Frank Seide, Hao Fu, Jasha Droppo, Gang Li, and Dong Yu. 1-bit stochastic gradient descent and its application to data-parallel distributed training of speech DNNs. In Fifteenth Annual Conference of the International Speech Communication Association, 2014. Navjot Singh, Deepesh Data, Jemin George, and Suhas Diggavi. Squarm-sgd: Communicationefficient momentum sgd for decentralized optimization. IEEE Journal on Selected Areas in Information Theory, 2(3):954 969, 2021. Sebastian Stich and Sai Praneeth Karimireddy. The error-feedback framework: Better rates for SGD with delayed gradients and compressed communication. ar Xiv preprint ar Xiv:1909.05350, 2019. Sebastian U. Stich and Sai Praneeth Karimireddy. The Error-Feedback Framework: Better Rates for SGD with Delayed Gradients and Compressed Communication. ar Xiv preprint ar Xiv:1909.05350, 2021. Sebastian U. Stich, J.-B. Cordonnier, and Martin Jaggi. Sparsified SGD with memory. In Advances in Neural Information Processing Systems, 2018. RafaΕ‚ Szlendak, Alexander Tyurin, and Peter RichtΓ‘rik. Permutation compressors for provably faster distributed nonconvex optimization. ar Xiv preprint ar Xiv:2110.03300, 2021. Yuki Takezawa, Han Bao, Kenta Niwa, Ryoma Sato, and Makoto Yamada. Momentum tracking: Momentum acceleration for decentralized deep learning on heterogeneous data. ar Xiv preprint ar Xiv:2209.15505, 2022. Hanlin Tang, Xiangru Lian, Chen Yu, Tong Zhang, and Ji Liu. Double Squeeze: Parallel stochastic gradient descent with double-pass error-compensated compression. In International Conference on Machine Learning, 2020. Alexander Tyurin and Peter RichtΓ‘rik. Dasha: Distributed nonconvex optimization with communication compression, optimal oracle complexity, and no client synchronization. ar Xiv preprint ar Xiv:2202.01268, 2022. Thijs Vogels, Sai Praneeth Karimireddy, and Martin Jaggi. Powersgd: Practical low-rank gradient compression for distributed optimization. Advances in Neural Information Processing Systems, 2019. Jun-Kun Wang and Jacob Abernethy. Quickly finding a benign region via heavy ball momentum in non-convex optimization. ar Xiv preprint ar Xiv:2010.01449, 2021. Yujia Wang, Lu Lin, and Jinghui Chen. Communication-compressed adaptive gradient method for distributed nonconvex optimization. In International Conference on Artificial Intelligence and Statistics, pages 6292 6320. PMLR, 2022. Tiannan Xiao and Guoguo Yang. A convergence study of sgd-type methods for stochastic optimization. ar Xiv preprint ar Xiv:2211.06197, 2022. Cong Xie, Shuai Zheng, Oluwasanmi Koyejo, Indranil Gupta, Mu Li, and Haibin Lin. CSER: Communication-efficient SGD with error reset. In Advances in Neural Information Processing Systems, pages 12593 12603, 2020. An Xu and Heng Huang. Detached error feedback for distributed sgd with random sparsification. In International Conference on Machine Learning, pages 24550 24575. PMLR, 2022. Junchi Yang, Xiang Li, Ilyas Fatkhullin, and Niao He. Two sides of one coin: the limits of untuned SGD and the power of adaptive methods. In Advances in Neural Information Processing Systems, 2023. Tianbao Yang, Qihang Lin, and Zhe Li. Unified convergence analysis of stochastic momentum methods for convex and non-convex optimization. ar Xiv preprint ar Xiv:1604.03257, 2016. Chung-Yiu Yau and Hoi-To Wai. Do Co M-SGT: Doubly compressed momentum-assisted stochastic gradient tracking algorithm for communication efficient decentralized learning. ar Xiv preprint ar Xiv:2202.00255, 2022. Hao Yu, Rong Jin, and Sen Yang. On the linear speedup analysis of communication efficient momentum sgd for distributed non-convex optimization. In International Conference on Machine Learning, page 7184 7193. PMLR, 2019. SK Zavriev and FV Kostyuk. Heavy-ball method in nonconvex optimization problems. Computational Mathematics and Modeling, 4(4):336 341, 1993. Haoyu Zhao, Boyue Li, Zhize Li, Peter RichtΓ‘rik, and Yuejie Chi. BEER: Fast O(1/T) rate for decentralized nonconvex optimization with communication compression. In Advances in Neural Information Processing Systems, 2022. Shuai Zheng, Ziyue Huang, and James Kwok. Communication-efficient distributed blockwise momentum SGD with error-feedback. In Advances in Neural Information Processing Systems, 2019. 1 Introduction 1 2 Communication Compression, Error Feedback, and Sample Complexity 2 2.1 Brief history of error-feedback . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2 2.2 Key issue: error feedback has an unhealthy appetite for samples! . . . . . . . . . . 3 2.3 Mysterious effectiveness of momentum in nonconvex optimization . . . . . . . . . 4 3 Main Results 5 3.1 A deeper dive into the issues EF21 has with stochastic gradients . . . . . . . . . . 6 3.2 Momentum for avoiding mega-batches . . . . . . . . . . . . . . . . . . . . . . . . 6 3.3 Distributed stochastic error feedback with momentum . . . . . . . . . . . . . . . . 7 3.4 Further improvement using double momentum! . . . . . . . . . . . . . . . . . . . 9 4 Experiments 9 4.1 Experiment 1: increasing batch-size . . . . . . . . . . . . . . . . . . . . . . . . . 10 4.2 Experiment 2: improving convergence with 𝑛. . . . . . . . . . . . . . . . . . . . 10 A More on Contractive Compressors, Error Feedback and Momentum 18 B Variance Reduction Effect of SGDM and Comparison to STORM 21 C Additional Experiments and Details of Experimental Setup 22 C.1 Extra plots for experiments 1 and 2 . . . . . . . . . . . . . . . . . . . . . . . . . . 22 C.2 Experiment 3: stochastic quadratic optimization . . . . . . . . . . . . . . . . . . . 22 C.3 Experiment 4: training neural network . . . . . . . . . . . . . . . . . . . . . . . . 24 D Descent Lemma 25 E EF21-SGDM-ideal (Proof of Theorem 1 and Proposition 1) 25 F EF21-SGDM (Proof of Theorems 2 and 3) 28 F.1 Controlling the error of momentum estimator . . . . . . . . . . . . . . . . . . . . 30 F.2 Controlling the error of contractive compression and momentum estimator . . . . . 31 G Further Improvement Using Double Momentum (Proof of Corollary 3) 33 G.1 Controlling the error of second momentum estimator . . . . . . . . . . . . . . . . 36 G.2 Controlling the error of contractive compression and double momentum estimator . 38 H EF21-SGDM with Absolute Compressor 40 H.1 Controlling the error of absolute compression . . . . . . . . . . . . . . . . . . . . 42 I EF21-STORM/MVR 43 I.1 Controlling the variance of STORM/MVR estimator . . . . . . . . . . . . . . . . 46 I.2 Controlling the variance of contractive compression and STORM/MVR estimator . 47 J Simplified Proof of SGDM: Time Varying Parameters and No Tuning for Momentum Sequence 49 K Revisiting EF14-SGD Analysis under BG and BGS Assumptions 51 Method Comm. compl. Batch-size for comm. compl. Asymp. sample compl. Batch free? No extra assump.? EF14-SGD [Koloskova et al., 2020] NEOLITHIC [Huang et al., 2022] π›Όπœ€2 log ( 𝐺 πœ€ ) (*) 𝐿max𝜎2 EF21-SGD [Fatkhullin et al., 2021] BEER [Zhao et al., 2022] EF21-SGDM Corollary 2 Corollary 3 (a) Analysis requires a bound of the second moment of the stochastic gradients, i.e., E [ 𝑓𝑖(π‘₯, πœ‰π‘–) 2] 𝐺2 for all π‘₯ R𝑑. (b) This complexity is achieved by using a large mini-batch and communicating 𝐾/𝛼 𝑑coordinates per iteration. (c) Analysis requires a bounded gradient disimilarity assumption, i.e., 1 𝑛 𝑛 𝑖=1 𝑓𝑖(π‘₯) 𝑓(π‘₯) 2 𝐺2 for all π‘₯ R𝑑. (d) Analysis requires a batch-size at least 𝐡 𝜎2 𝛼2πœ€2 for EF21-SGD and 𝐡 𝜎2 π›Όπœ€2 for BEER. (*) For a fair comparison, we take the (minimal) batch-size for these methods which guarantees the reported communication complexity. Table 2: Extended summary of related works on distributed error compensated SGD methods using a Top𝐾 compressor under Assumptions 1 and 2. The goal is to find an πœ€-stationary point of a smooth nonconvex function of the form (1), i.e., a point π‘₯such that E [ 𝑓(π‘₯) ] πœ€. "Comm. compl." reports the total number of communicated bits if the method is applied with batch-size equal to "Batch-size" at each node. When Top𝐾 compressor is applied, then 𝛼 𝐾/𝑑, and the comm. compl. of error compensated methods can be reduced by a factor of 𝛼𝑑/𝐾. "Batch-size for comm. compl." means the batch-size for achieving the reported "Comm. compl.". "Asymp. sample compl." reports the asymptotic sample complexity of the algorithm with batch-size 𝐡= 1 in the regime πœ€ 0, i.e., the total number of samples required at each node to achieve πœ€-stationary point. "Batch free" marks with if the analysis ensures convergence with batch-size equal to 1. "No extra assump." marks with if no additional assumption beyond Assumptions 1 and 2 is required for analysis. We denote 𝐿max := max𝑖 [𝑛] 𝐿𝑖. Notice that it always holds 𝐿 𝐿 𝐿max and these inequalities only become equalities in the homogeneous case. It could be that 𝛼𝐿/ 𝐿 1 making the batch-size of EF21-SGDM and EF21-SGD2M much smaller than those of EF21-SGD and BEER. Symbol denotes the maximum of two scalars. A More on Contractive Compressors, Error Feedback and Momentum Greedy vs uniform. In our work, we specifically focus on the class of contractive compressors satisfying Definition 1, which contains a greedy Top𝐾compressor as a special case. Note that Top𝐾is greedy in that it minimizes the error Top𝐾(π‘₯) π‘₯ 2 subject to the sparsity constraint π’ž(π‘₯) 0 𝐾, where 𝑒 0 counts the number of nonzero entries in 𝑒. In practice, greediness is almost always17 very useful, translating into excellent empirical performance, especially when compared to the performance of the Rand𝐾sparsifier. On the other hand, it appears to be very hard to formalize these practical gains theoretically18. In fact, while greedy compressors such as Top𝐾 outperform their randomized cousins such as Rand𝐾in practice, and often by huge margins [Lin et al., 2018], the theoretical picture is exactly reversed, and the theoretical communication complexity of gradient-type methods based on randomized compressors [Alistarh et al., 2017, Mishchenko et al., 2019, HorvΓ‘th et al., 2019b, Li et al., 2020, Gorbunov et al., 2021] is much better than of those based on greedy compressors [Koloskova et al., 2020, RichtΓ‘rik et al., 2021, Fatkhullin et al., 2021, RichtΓ‘rik et al., 2022]. The key reason behind this is the fact that popular randomized compressors such as Rand𝐾become unbiased mappings after appropriate scaling (e.g., E[ 𝑑 𝐾Rand𝐾(π‘₯)] π‘₯), and that the inherent randomness is typically drawn independently by all clients. This leads to several key simplifications in the analysis, and consequently, to theoretical gains over methods that do not compress, and over methods that compress greedily. Further improvements are possible when the randomness is correlated in an appropriate way [Szlendak et al., 2021]. Due to the superior empirical properties of greedy contractive compressors, and our desire to push this very potent line of work further, in this paper we work with the general class of compressors 17Greediness is not useful, for example, when π’Ÿπ‘–= π’Ÿπ‘—for all 𝑖,𝑗and when Top𝐾is applied to the full-batch gradient 𝑓𝑖(π‘₯) by each client. However, situations of this type arise rarely in practice. 18No theoretical results of this type exist for 𝑛> 1 . satisfying Definition 1, and do not invoke any additional restrictive assumptions. For example, we do not assume π’žcan be made unbiased after scaling. Error Feedback. The first theoretical analysis of EF14 was presented in the works of Stich et al. [2018], Alistarh et al. [2018] and further revisited in convex case in [Karimireddy et al., 2019, Beznosikov et al., 2020, Gorbunov et al., 2020, Qian et al., 2020] and analysis was extended to nonconvex setting in [Stich and Karimireddy, 2019]. Later, in nonconvex case, various extensions and combinations of EF14 with other optimization techniques were considered and analyzed, which include bidirectional compression [Tang et al., 2020], decentralized training [Koloskova et al., 2020, Singh et al., 2021], server level momentum [Xie et al., 2020], client level momentum [Zheng et al., 2019], combination with adaptive methods [Li et al., 2022b]. To our knowledge, the best sample complexity for finding a stationary point for this method (including its momentum and adaptive variants) in the distributed nonconvex setting is given by Koloskova et al. [2020], which is π’ͺ( 𝐺 π‘›πœ€4 ). More recently, Huang et al. [2022] propose a modification of EF14 method achieving π’ͺ ( 1 π›Όπœ€2 log( 𝐺 π‘›πœ€4 ) sample complexity by using the BGS assumption. When applied with Top𝐾 compressor, this method requires to communicate π’ͺ(𝐾/𝛼) coordinates at every iteration. This makes it impractical since when the effective 𝛼is unknown and is set to 𝛼= 𝐾/𝑑, it means that their method communicates all 𝑑coordinates at every iteration, mimicking vanilla (S)GD method. Moreover, their algorithm uses an additional subroutine and applies updates with a large batch-size of samples of order π’ͺ( 1 πœ€ ) ), making the algorithm less practical and difficult to implement. It is worth to mention, that error feedback was also analyzed for other classes of compressors such as absolute (see Definition 2) or additive compressors (i.e., π’ž(π‘₯+ 𝑦) = π’ž(π‘₯) + π’ž(𝑦) for all π‘₯, 𝑦 R𝑑) [Tang et al., 2020, Xu and Huang, 2022], which do not include Top𝐾sparsifier. Momentum. The first convergence analysis of gradient descent with momentum was proposed by B.T. Polyak in his seminal work [Polyak, 1964] studying the benefit of multi-step methods. The proof technique proposed in this work is based on the analysis of the spectral norm of a certain matrix arising from the dynamics of a multi-step process on a quadratic function. Unfortunately, such technique is restricted to the case of strongly convex quadratic objective and the setting of full gradients. Later Zavriev and Kostyuk [1993] prove an asymptotic convergence of this method in nonconvex deterministic case without specifying the rate of convergence. To our knowledge, the first non-asymptotic analysis of SGDM in the smooth nonconvex setting is due to Yu et al. [2019]. Their analysis, however, heavily relies on BG assumption. Later, Liu et al. [2020] provide a refined analysis of SGDM, removing the BG assumption and improving the dependence on the momentum parameter πœ‚. Notice that the analysis of Liu et al. [2020] and the majority of other works relies on some variant of the following Lyapunov function: Λ𝑑:= 𝑓(𝑧𝑑) 𝑓* + 𝜏=0 π‘πœ π‘₯𝑑 𝜏 π‘₯𝑑 1 𝜏 2 , (11) where {𝑧𝑑}𝑑 0 is some auxiliary sequence (often) different from the iterates {π‘₯𝑑}𝑑 0, and {π‘πœ}𝜏 0 is a diminishing non-negative sequence. This approach is motivated by the dynamical system point of view at Polyak s heavy ball momentum, where the two terms in (11) are interpreted as the potential and kinetic energy of the system [Sebbouh et al., 2019]. In contrast, the Lyapunov function used in this work is conceptually different even in the single node (𝑛= 1), uncompressed (𝛼= 1) setting. Later, Defazio [2021] revisit the analysis in [Liu et al., 2020] through the lens of primal averaging and provide insights on why momentum helps in practice. The momentum is also used for stabilizing adaptive algorithms such as normalized SGD [Cutkosky and Mehta, 2020]. In particular, it was shown that by using momentum, one can ensure convergence without large batches for normalized SGD (while keeping the same sample complexity as a large batch normalized SGD). However, their analysis is specific to the normalized method, which allows using the function value as a Lyapunov function. High probability analysis of momentum methods was investigated in [Cutkosky and Mehta, 2021, Li and Orabona, 2020]. In the distributed setting, [Yu et al., 2019, Karimireddy et al., 2021] extend the analysis of SGDM under BGS assumption. Later [Takezawa et al., 2022, Gao et al., 2023] remove this assumption providing a refined analysis based on the techniques developed in [Liu et al., 2020]. However, the algorithms in these works do not apply any bandwidth reduction technique such as communication compression. We would like to mention that understanding the behavior of SGDM in convex case also remains an active area of research [Ghadimi et al., 2015, Yang et al., 2016, Sebbouh et al., 2021, Li et al., 2022a, Xiao and Yang, 2022] . B Variance Reduction Effect of SGDM and Comparison to STORM Notice that the choice of our Lyapunov function Λ𝑑(8), which is used in the analysis of EF21-SGDM implies that the gradient estimators 𝑔𝑑 𝑖and 𝑣𝑑 𝑖improve over the iterations, i.e., 𝑔𝑑 𝑖 𝑓𝑖(π‘₯𝑑), 𝑣𝑑 𝑖 𝑓𝑖(π‘₯𝑑) for 𝑑 . This comes in contrast with the behavior of SGD, for which the gradient estimator 𝑣𝑑 𝑖= 𝑓𝑖(π‘₯𝑑, πœ‰π‘‘ 𝑖) does not necessarily tend to zero over iterations. Such effect of asymptotic improvement of the estimation error of the gradient estimator is reminiscent to the analogous effect known in the literature on variance reduction (VR) methods. In particular, the classical momentum step 6 of Algorithm 1 may be contrasted with a STORM variance reduced estimator proposed by Cutkosky and Orabona [2019], which updates the gradient estimator via 𝑀𝑑+1 𝑖 = 𝑓𝑖(π‘₯𝑑+1, πœ‰π‘‘+1 𝑖 ) + (1 πœ‚)(𝑀𝑑 𝑖 𝑓𝑖(π‘₯𝑑, πœ‰π‘‘+1 𝑖 )), 𝑀0 𝑖= 𝑓𝑖(π‘₯0, πœ‰0 𝑖) (12) It is known that the class of VR methods (and STORM, in particular) can show faster asymptotic convergence in terms of 𝑇(or πœ€) compared to SGD and SGDM under some additional assumptions. However, we would like to point out the important differences (and limitations) of (12) compared to the classical Polyak s momentum used on line 6 of Algorithm 1. First, the estimator 𝑀𝑑+1 𝑖 is different from the momentum update rule 𝑣𝑑+1 𝑖 in that it is unbiased for any 𝑑 0, i.e., E [ 𝑀𝑑+1 𝑖 𝑓𝑖(π‘₯𝑑+1) ] = 0,19 which greatly facilitates the analysis of this method. Notice that, in particular, in the deterministic case (𝜎= 0, 𝛼= 1), the method with estimator (12) reduces to vanilla gradient descent with 𝑀𝑑+1 𝑖 = 𝑓𝑖(π‘₯𝑑+1). Second, the computation of 𝑀𝑑+1 𝑖 requires access to two stochastic gradients 𝑓𝑖(π‘₯𝑑+1, πœ‰π‘‘+1 𝑖 ) and 𝑓𝑖(π‘₯𝑑, πœ‰π‘‘+1 𝑖 ) under the same realization of noise πœ‰π‘‘+1 𝑖 at each iteration, and requires the additional storage of control variate π‘₯𝑑. This is a serious limitation, which can make the method impractical or even not implementable for certain applications such as federated RL [Mitra et al., 2023], multi-agent RL [Doan et al., 2019] or operations research problems [Chen et al., 2022]. Third, the analysis of variance reduced methods such as STORM requires an additional assumptions such as individual smoothness of stochastic functions (or its averaged variants) (Assumption 3), i.e., 𝑓𝑖(π‘₯,πœ‰π‘–) 𝑓𝑖(𝑦, πœ‰π‘–) ℓ𝑖 π‘₯ 𝑦 for all π‘₯,𝑦 R𝑑, πœ‰π‘– π’Ÿπ‘–, 𝑖 [𝑛], while our EF21-SGDM only needs smoothness of (deterministic) local functions 𝑓𝑖(π‘₯). While this assumption is satisfied for some loss functions in supervised learning, it can also be very limiting. Even if Assumption 3 is satisfied, the constant β„“(which always satisfies β„“ 𝐿) can be much larger than 𝐿canceling the speed-up in terms of 𝑇(or πœ€). For completeness, we provide the sample complexity analysis of our error compensated method combined with estimator (12), which is deferred to Appendix I. 19Notice that E [ 𝑀0 𝑖 𝑓𝑖(π‘₯0) ] = 0. Let E [ 𝑀𝑑 𝑖 𝑓𝑖(π‘₯𝑑) ] = 0 hold, then E [ 𝑀𝑑+1 𝑖 𝑓𝑖(π‘₯𝑑+1) ] = (1 πœ‚)E [ 𝑀𝑑 𝑖 𝑓𝑖(π‘₯𝑑, πœ‰π‘‘+1 𝑖 ) ] = 0. 0 2000 4000 6000 8000 10000 Iteration, t Objective value EF21-SGD-ideal EF21-SGD EF21-SGDM (a) Divergence in single node setting, 𝑛= 1. 0 2000 4000 6000 8000 10000 Iteration, t Objective value EF21-SGD (n = 50) EF21-SGD (n = 500) EF21-SGD (n = 5000) EF21-SGD (n = 10000) (b) No improvement with 𝑛. Figure 4: Divergence of EF21-SGD on a quadratic function 1 2 π‘₯ 2 with Top1 compressor. See the proof of Therem 1 for details on the construction of noise πœ‰, we use 𝜎= 1, 𝐡= 1. The starting point for all methods is π‘₯0 = (0, 0.01) . Unlike Figure 1, these experiments use time varying step-sizes and momentum parameters 𝛾𝑑= πœ‚π‘‘= 0.1 𝑑+1. Each method is run 10 times and the plot shows the median performance alongside the 25% and 75% quantiles. C Additional Experiments and Details of Experimental Setup Divergence of EF21-SGD with time-varying step-sizes. We complement our Figure 1 in the main part of the paper, which shows divergence of EF21-SGD [Fatkhullin et al., 2021] with small (constant) step-size. Here, in Figure 4, we see that the similar divergence is observed when using time varying step-sizes 𝛾𝑑= 0.1 𝑑+1. Also, EF21-SGD with time-varying step-size does not improve convergence when 𝑛is increased. Implementation Details. The experiments were implemented in Python 3.7.9. The distributed environment was emulated on machines with Intel(R) Xeon(R) Gold 6248 CPU @ 2.50GHz. In all experiments with MNIST, we split the dataset across nodes by labels to simulate the heterogeneous setting. C.1 Extra plots for experiments 1 and 2 In Figures 5 and 6, we provide extra experiments for the setup from Section 4. 0.0 0.5 1.0 1.5 2.0 2.5 3.0 #bits / n 1e6 || f(xt)||2 EF14-SGD: Step size: 0.5 EF21-SGD: Step size: 0.03125 EF21-SGDM: Step size: 0.03125 0.0 0.5 1.0 1.5 2.0 2.5 3.0 #bits / n 1e6 || f(xt)||2 EF14-SGD: Step size: 1.0 EF21-SGD: Step size: 0.015625 EF21-SGDM: Step size: 0.125 0.0 0.5 1.0 1.5 2.0 2.5 3.0 #bits / n 1e6 || f(xt)||2 EF14-SGD: Step size: 1.0 EF21-SGD: Step size: 0.015625 EF21-SGDM: Step size: 0.25 (c) 𝐡= 128 Figure 5: Performance of algorithms on MNIST dataset with 𝑛= 100, and Top10 compressor. C.2 Experiment 3: stochastic quadratic optimization We now consider a synthetic πœ† strongly convex quadratic function problem 𝑓(π‘₯) = 1 𝑛 𝑛 𝑖=1 𝑓𝑖(π‘₯), where the functions 𝑓𝑖(π‘₯) = 1 2π‘₯ Q𝑖π‘₯ π‘₯ 𝑏𝑖are (not necessarily convex) quadratic functions for all 𝑖 [𝑛] and π‘₯ R𝑑. The matrices Q1, , Q𝑛, vectors 𝑏1, , 𝑏𝑛, and a starting point π‘₯0 are generated by Algorithm 2 with the number of nodes 𝑛= 100, dimension 𝑑= 1000, regularizer πœ†= 0.01, and scale 𝑠= 1. For all 𝑖 [𝑛] and π‘₯ R𝑑, we consider stochastic gradients 𝑓𝑖(π‘₯, πœ‰) = 𝑓𝑖(π‘₯) + πœ‰π‘–, where πœ‰π‘–are i.i.d. samples from 𝒩(0, 𝜎) with 𝜎 {0.001, 0.01}. In 0.0 0.5 1.0 1.5 2.0 2.5 3.0 #bits / n 1e7 EF14-SGD: Step size: 1.0 EF21-SGD: Step size: 2.0 EF21-SGDM: Step size: 1.0 0.0 0.5 1.0 1.5 2.0 2.5 3.0 #bits / n 1e7 EF14-SGD: Step size: 8.0 EF21-SGD: Step size: 8.0 EF21-SGDM: Step size: 8.0 0.0 0.5 1.0 1.5 2.0 2.5 3.0 #bits / n 1e7 EF14-SGD: Step size: 64.0 EF21-SGD: Step size: 64.0 EF21-SGDM: Step size: 128.0 (c) 𝑛= 100 Figure 6: Performance of algorithms on real-sim dataset with batch-size 𝐡= 1, and Top100 compressor. Figure 7, we present the comparison of EF21-SGDM and EF14-SGD with three different step sizes. The behavior of methods for other step sizes from the set {2π‘˜| π‘˜ [ 20, 20]} follows a similar trend. For every step size, we observe that at the beginning, the methods have almost the same linear rates, but then EF14-SGD gets stuck at high accuracies, while EF21-SGDM continues converging to the lower accuracies. 0 100000 200000 300000 400000 500000 #bits / n EF14-SGD: Step size: 0.0625 EF14-SGD: Step size: 0.125 EF14-SGD: Step size: 0.25 EF21-SGDM: Step size: 0.0625 EF21-SGDM: Step size: 0.125 EF21-SGDM: Step size: 0.25 0 100000 200000 300000 400000 500000 #bits / n EF14-SGD: Step size: 0.0625 EF14-SGD: Step size: 0.125 EF14-SGD: Step size: 0.25 EF21-SGDM: Step size: 0.0625 EF21-SGDM: Step size: 0.125 EF21-SGDM: Step size: 0.25 Figure 7: Stochastic Quadratic Optimization Problem with 𝜎= 0.001 (left figure) and 𝜎= 0.01 (right figure) Algorithm 2 Quadratic Optimization Task Generation Procedure 1: Parameters: number nodes 𝑛, dimension 𝑑, regularizer πœ†, and scale 𝑠. 2: for 𝑖= 1, . . . , 𝑛do 3: Calculate Guassian noises πœ‡π‘  𝑖= 1 + π‘ πœ‰π‘  𝑖and πœ‡π‘ 𝑖= π‘ πœ‰π‘ 𝑖, i.i.d. πœ‰π‘  𝑖, πœ‰π‘ 𝑖 𝒩(0, 1) 4: 𝑏𝑖= πœ‡π‘  𝑖 4 ( 1 + πœ‡π‘ 𝑖, 0, , 0) R𝑑 5: Scale the predefined tridiagonal matrix 1 ... ... ... ... 1 0 1 2 6: end for 7: Find the mean of matrices Q = 1 𝑛 𝑛 𝑖=1 Q𝑖 8: Find the minimum eigenvalue πœ†min(Q) 9: for 𝑖= 1, . . . , 𝑛do 10: Normalize matrix Q𝑖= Q𝑖+ (πœ† πœ†min(Q))I 11: end for 12: Find a starting point π‘₯0 = ( 𝑑, 0, , 0) 13: Output a new problem: matrices Q1, , Q𝑛, vectors 𝑏1, , 𝑏𝑛, starting point π‘₯0 A procedure to generate stochastic quadratic optimization problems. In this section, we present an algorithm that generates quadratic optimization tasks. The formal description is provided in Algorithm 2. The idea is to take a predefined tridiagonal matrix and add noises to simulate the heterogeneous setting. Algorithm 2 returns matrices Q1, , Q𝑛, vectors 𝑏1, , 𝑏𝑛, and a starting 0.0 0.5 1.0 1.5 2.0 #bits / n 1e12 f(xk) f(x * ) Number of nodes: 5 Vanilla SGD (no compression): Step size: 0.05 EF21-SGDM: Step size: 0.05 EF21-SGD: Step size: 0.05 EF14-SGD: Step size: 0.05 0.0 0.5 1.0 1.5 2.0 2.5 #bits / n 1e12 f(xk) f(x * ) Number of nodes: 5 Vanilla SGD (no compression): Step size: 0.05 EF21-SGDM: Step size: 0.05 EF21-SGD: Step size: 0.05 EF14-SGD: Step size: 0.05 (b) 𝐡= 25 Figure 8: Res Net-18 on CIFAR10 dataset with 𝑛= 5. Algorithm Test Accuracy SGD 81.5 % EF21-SGD 82.5 % EF14-SGD 83.1 % EF21-SGDM 83.3 % Figure 9: Accuracy on the CIFAR10 test split. point π‘₯0 such that the matrix Q = 1 𝑛 𝑛 𝑖=1 Q𝑖has the minimum eigenvalue πœ†min(Q) = πœ†, where πœ† 0 is a parameter. Next, we define the functions 𝑓𝑖and stochastic gradients in the following way: 2π‘₯ Q𝑖π‘₯ π‘₯ 𝑏𝑖 𝑓𝑖(π‘₯, πœ‰) := 𝑓𝑖(π‘₯) + πœ‰π‘–, for all π‘₯ R𝑑and 𝑖 [𝑛]. The noises πœ‰π‘–are i.i.d. samples from 𝒩(0, 𝜎), where 𝜎is a parameter. C.3 Experiment 4: training neural network We test algorithms on an image recognition task, CIFAR10 [Krizhevsky et al., 2009], with the Res Net18 [He et al., 2016] deep neural network (the number of parameters 𝑑 107). We split CIFAR10 among 5 nodes, and take 𝐾= 2 106 in Top𝐾. In all methods we finetune the step sizes. One can see that our findings in the low-scale experiments translate into large-scale experiments in Figure 8. With different batch sizes, EF21-SGD converges slower than EF21-SGDM and EF14-SGD, and our new method EF21-SGDM improves over EF14-SGD in Figure 8b. We checked the accuracies on the test dataset (see Table 9) and observed the same relations between algorithms (note that accuracies are far from the real SOTA because we turned off all augmentations and regularizations in training). D Descent Lemma Let us state the following lemma that is used in the analysis of nonconvex optimization methods. Lemma 1 ([Li et al., 2021]). Let the function 𝑓( ) be 𝐿-smooth and let π‘₯𝑑+1 = π‘₯𝑑 𝛾𝑔𝑑for some vector 𝑔𝑑 R𝑑and a step-size 𝛾> 0. Then we have 𝑓(π‘₯𝑑+1) 𝑓(π‘₯𝑑) 𝛾 𝑓(π‘₯𝑑) 2 ( 1 ) π‘₯𝑑+1 π‘₯𝑑 2 + 𝛾 𝑔𝑑 𝑓(π‘₯𝑑) 2 . (13) E EF21-SGDM-ideal (Proof of Theorem 1 and Proposition 1) We now state a slighly more general result than Theorem 1, which holds for EF21-SGDM-ideal method with any πœ‚ (0, 1]. The statement of Theorem 1 follows by setting πœ‚= 1, since in that case EF21-SGDM-ideal coincides with EF21-SGD-ideal (5a), (5aa). Recall that EF21-SGDM-ideal (distributed variant) has the following update rule: π‘₯𝑑+1 = π‘₯𝑑 𝛾𝑔𝑑, 𝑔𝑑= 1 𝑖=1 𝑔𝑑 𝑖, (14) EF21-SGDM-ideal: 𝑣𝑑+1 𝑖 = 𝑓𝑖(π‘₯𝑑+1) + πœ‚( 𝑓𝑖(π‘₯𝑑+1, πœ‰π‘‘+1 𝑖 ) 𝑓𝑖(π‘₯𝑑+1)), 𝑔𝑑+1 𝑖 = 𝑓𝑖(π‘₯𝑑+1) + π’ž ( 𝑣𝑑+1 𝑖 𝑓𝑖(π‘₯𝑑+1) Theorem 4. Let 𝐿, 𝜎> 0, 0 < 𝛾 1/𝐿, 0 < πœ‚ 1 and 𝑛= 1. There exists a convex, 𝐿-smooth function 𝑓( ), a contractive compressor π’ž( ) satisfying Definition 1, and an unbiased stochastic gradient with bounded variance 𝜎2 such that if the method (14), (15) is run with a step-size 𝛾, then for all 𝑇 0 and for all π‘₯0 {(0, π‘₯0 (2)) R2 | π‘₯0 (2) < 0}, we have E [ 𝑓(π‘₯𝑇) 2] 1 60 min { πœ‚2𝜎2, 𝑓(π‘₯0) 2} . Fix 0 < πœ€ 𝐿/ 60 and π‘₯0 = (0, 1) . Additionally assume that 𝑛 1 and the variance of unbiased stochastic gradient is controlled by 𝜎2/𝐡for some 𝐡 1. If 𝐡< πœ‚2𝜎2 60πœ€2 , then we have E [ 𝑓(π‘₯𝑇) ] > πœ€for all 𝑇 0. Proof of Theorem 1. Part I. Consider 𝑓(π‘₯) = 𝐿 2 π‘₯ 2, π‘₯ R2. For each iteration 𝑑 0, let the random vector πœ‰π‘‘+1 be sampled uniformly at random from the set of vectors: 10 , 𝑧2 = ( 0 1 10 , 𝑧3 = ( 2 1 Define the stochastic gradient as 𝑓(π‘₯𝑑, πœ‰π‘‘) := 𝑓(π‘₯𝑑) + πœ‰π‘‘ = 𝐿π‘₯𝑑+ πœ‰π‘‘. Notice that E [ 𝑓(π‘₯𝑑, πœ‰π‘‘)] = 𝑓(π‘₯𝑑), and E [ 𝑓(π‘₯𝑑, πœ‰π‘‘) 𝑓(π‘₯𝑑) 2] = 𝜎2. The update rule of method (14), (15) with such estimator is π‘₯𝑑+1 = π‘₯𝑑 𝛾𝑔𝑑= π‘₯𝑑 𝐿𝛾π‘₯𝑑 π›Ύπ’ž ( πœ‚πœ‰π‘‘) , where we choose π’ž( ) as a Top1 compressor. Notice that E [πœ‰π‘‘] = (0, 0) , but E [ π’ž(πœ‰π‘‘) ] = πœ‚ 10 (0, 1/3) = (0, 0) . By setting the initial iterate to π‘₯0 = (0, π‘₯0 (2)) for any π‘₯0 (2) < 0, we can derive E [ π‘₯𝑇] = (1 𝐿𝛾)𝑇π‘₯0 πœ‚ 𝑑=0 (1 𝐿𝛾)𝑑 = (1 𝐿𝛾)𝑇 ( 0 π‘₯0 (2) ) (1 (1 𝐿𝛾)𝑇) = ( 0 0 for any 0 𝛾 1/𝐿and any π‘₯0 (2) < 0. The inequality in (16) is because the first vector has strictly negative component π‘₯0 (2), and the second vector has non-positive second component when 𝛾> 0 and 𝜎2 > 0. Therefore, since 𝑓(π‘₯) 2 = 𝐿π‘₯ 2, we have E [ 𝑓(π‘₯𝑇) 2] = E [ 𝐿π‘₯𝑇 2] = E [ 𝐿π‘₯𝑇] 2 + E [ 𝐿π‘₯𝑇 E [ 𝐿π‘₯𝑇] 2] (1 𝐿𝛾)𝑇 𝐿π‘₯0 + πœ‚ 30(1 (1 𝐿𝛾)𝑇) (𝑖𝑖) (1 𝐿𝛾)2𝑇 𝑓(π‘₯0) 2 + πœ‚2𝜎2 30 (1 (1 𝐿𝛾)𝑇)2 𝑓(π‘₯0) 2 πœ‚2𝜎2 30 𝑓(π‘₯0) 2 + πœ‚2𝜎2 for all 𝑇 1, where in (𝑖) we used the form of vector E [ π‘₯𝑇] in (16), in (𝑖𝑖) we drop a nonnegative cross term, and use 𝑓(π‘₯0) = 𝐿π‘₯0. The last inequality follows by lower bounding a univariate quadratic function with respect to 𝑧:= (1 𝐿𝛾)𝑇for 0 𝑧 1, where optimal choice is 𝑧= πœ‚2𝜎2/(30 𝑓(π‘₯0) 2 + πœ‚2𝜎2). It is left to note that π‘₯𝑦 π‘₯+𝑦 1 2 min{π‘₯, 𝑦} for all π‘₯, 𝑦> 0. Part II. Fix 𝑛 1 and 𝐡 1. Let at each node 𝑖= 1, . . . , 𝑛, the random vectors πœ‰π‘‘ 𝑖be sampled independently and uniformly form the set of vectors: 10𝐡, 𝑧2 = ( 0 1 10𝐡, 𝑧3 = ( 2 1 Define a random matrix πœ‰π‘‘:= (πœ‰π‘‘ 1, . . . , πœ‰π‘‘ 𝑛) . Then E [ 𝑓(π‘₯𝑑, πœ‰π‘‘) 𝑓(π‘₯𝑑) 2] = 𝜎2 𝐡. The update of the method (14), (15) on the same function instance will take the form π‘₯𝑑+1 = π‘₯𝑑 𝛾1 𝑖=1 𝑔𝑑 𝑖= π‘₯𝑑 𝐿𝛾π‘₯𝑑 𝛾1 𝑖=1 π’ž ( πœ‚πœ‰π‘‘ 𝑖 ) . Notice that in this case, we still have 𝑖=1 π’ž(πœ‚πœ‰π‘‘ 𝑖) 𝑖=1 E [ π’ž(πœ‚πœ‰π‘‘ 𝑖) ] = πœ‚ 10 (0, 1/3) = (0, 0) , which is independent (!) of 𝑛. Therefore, by similar derivations, we can conclude that E [ 𝑓(π‘₯𝑇) 2] 1 60 min { πœ‚2𝜎2 𝐡, 𝑓(π‘₯0) 2} > πœ€2, where we use that 𝐡< πœ‚2𝜎2 60πœ€2 , πœ€ 𝐿/ 60, and π‘₯0 = (0, 1) . Proof of Proposition 1. By smoothness (Assumption 1) of 𝑓( ) it follows from Lemma 1 that for 𝛾 1/𝐿we have 𝑓(π‘₯𝑑+1) 𝑓(π‘₯𝑑) 𝛾 𝑓(π‘₯𝑑) 2 + 𝛾 𝑔𝑑 𝑓(π‘₯𝑑) 2 . (17) Now it remains to control the last term, which is due to the error introduced by a contractive compressor and stochastic gradients. We have E [ 𝑔𝑑 𝑓(π‘₯𝑑) 2] (𝑖) = E [ π’ž ( 𝑣𝑑 𝑓(π‘₯𝑑) ) 2] (𝑖𝑖) = E [ π’ž ( πœ‚ ( 𝑓(π‘₯𝑑, πœ‰π‘‘) 𝑓(π‘₯𝑑) ) ) 2] (𝑖𝑖𝑖) 2E [ π’ž ( πœ‚ ( 𝑓(π‘₯𝑑, πœ‰π‘‘) 𝑓(π‘₯𝑑) ) ) πœ‚( 𝑓(π‘₯𝑑, πœ‰π‘‘) 𝑓(π‘₯𝑑)) 2] +2πœ‚2E [ 𝑓(π‘₯𝑑, πœ‰π‘‘) 𝑓(π‘₯𝑑) 2] 2(2 𝛼)πœ‚2E [ 𝑓(π‘₯𝑑, πœ‰π‘‘) 𝑓(π‘₯𝑑) 2] where (𝑖) and (𝑖𝑖) use the update rule (6), (𝑖𝑖𝑖) holds by Young s inequality, and the last two steps hold by Definition 1 and Assumption 2. Subtracting 𝑓* from both sides of (17), taking expectation and defining 𝛿𝑑:= E [𝑓(π‘₯𝑑) 𝑓*], we derive E [ 𝑓(Λ†π‘₯𝑇) 2] = 1 𝑑=0 E [ 𝑓(π‘₯𝑑) 2] 2𝛿0 F EF21-SGDM (Proof of Theorems 2 and 3) The statement of Theorem 2 follows directly from Theorem 3 and Remark 2. Let us prove Theorem 3. Proof of Theorem 3. In order to control the error between 𝑔𝑑and 𝑓(π‘₯𝑑), we decompose it into two terms 𝑔𝑑 𝑓(π‘₯𝑑) 2 2 𝑔𝑑 𝑣𝑑 2 +2 𝑣𝑑 𝑓(π‘₯𝑑) 2 2 1 𝑔𝑑 𝑖 𝑣𝑑 𝑖 2 +2 𝑣𝑑 𝑓(π‘₯𝑑) 2 , and develop a recursion for each term above separately. Part I (a). Controlling the error of momentum estimator for each 𝑣𝑑 𝑖. Recall that by Lemma 2-(24), we have for each 𝑖= 1, . . . , 𝑛, and any 0 < πœ‚ 1 and 𝑑 0 E [ 𝑣𝑑+1 𝑖 𝑓𝑖(π‘₯𝑑+1) 2] (1 πœ‚)E [ 𝑣𝑑 𝑖 𝑓𝑖(π‘₯𝑑) 2] + 3𝐿2 𝑖 πœ‚E [ π‘₯𝑑+1 π‘₯𝑑 2] + πœ‚2𝜎2. (18) Averaging inequalities (18) over 𝑖= 1, . . . ,𝑛and denoting 𝑃𝑑:= 1 𝑛 𝑛 𝑖=1 E [ 𝑣𝑑 𝑖 𝑓𝑖(π‘₯𝑑) 2] , 𝑅𝑑:= E [ π‘₯𝑑 π‘₯𝑑+1 2] we have 𝑃𝑑+1 (1 πœ‚) 𝑃𝑑+ 3 𝐿2 Summing up the above inequality for 𝑑= 0, . . . , 𝑇 1, we derive 𝑑=0 𝑃𝑑 3 𝐿2 𝑑=0 𝑅𝑑+ πœ‚πœŽ2 + 1 πœ‚π‘‡ 𝑃0. (19) Part I (b). Controlling the error of momentum estimator for 𝑣𝑑(on average). Similarly by Lemma 2-(25), we have for any 0 < πœ‚ 1 and 𝑑 0 E [ 𝑣𝑑+1 𝑓(π‘₯𝑑+1) 2] (1 πœ‚)E [ 𝑣𝑑 𝑓(π‘₯𝑑) 2] + 3𝐿2 πœ‚E [ π‘₯𝑑+1 π‘₯𝑑 2] + πœ‚2𝜎2 where 𝑣𝑑:= 1 𝑛 𝑛 𝑖=1 𝑣𝑑 𝑖is an auxiliary sequence. Summing up the above inequality for 𝑑= 0, . . . , 𝑇 1, and denoting 𝑃𝑑:= E [ 𝑣𝑑 𝑓(π‘₯𝑑) 2] , we derive 𝑑=0 𝑅𝑑+ πœ‚πœŽ2 Part II. Controlling the error of contractive compressor and momentum estimator. By Lemma 3 we have for each 𝑖= 1, . . . , 𝑛, and any 0 < πœ‚ 1 and 𝑑 0 E [ 𝑔𝑑+1 𝑖 𝑣𝑑+1 𝑖 2] ( 1 𝛼 ) E [ 𝑔𝑑 𝑖 𝑣𝑑 𝑖 2] + 4πœ‚2 𝛼E [ 𝑣𝑑 𝑖 𝑓𝑖(π‘₯𝑑) 2] 𝛼 E [ π‘₯𝑑+1 π‘₯𝑑 2] + πœ‚2𝜎2. (21) Averaging inequalities (21) over 𝑖= 1, . . . ,𝑛, denoting 𝑉𝑑:= 1 𝑛 𝑛 𝑖=1 E [ 𝑔𝑑 𝑖 𝑣𝑑 𝑖 2] , and summing up the resulting inequality for 𝑑= 0, . . . , 𝑇 1, we obtain 𝑑=0 𝑃𝑑+ 8 𝐿2πœ‚2 𝑑=0 𝑅𝑑+ 2πœ‚2𝜎2 𝛼𝑇 𝑉0. (22) Part III. Combining steps I and II with descent lemma. By smoothness (Assumption 1) of 𝑓( ) it follows from Lemma 1 that for any 𝛾 1/(2𝐿) we have 𝑓(π‘₯𝑑+1) 𝑓(π‘₯𝑑) 𝛾 π‘₯𝑑+1 π‘₯𝑑 2 + 𝛾 𝑔𝑑 𝑓(π‘₯𝑑) 2 (23) π‘₯𝑑+1 π‘₯𝑑 2 + 𝛾1 𝑔𝑑 𝑖 𝑣𝑑 𝑖 2 + 𝛾 𝑣𝑑 𝑓(π‘₯𝑑) 2 . Subtracting 𝑓* from both sides of (23), taking expectation and defining 𝛿𝑑:= E [𝑓(π‘₯𝑑) 𝑓*], we derive E [ 𝑓(Λ†π‘₯𝑇) 2] = 1 𝑇 𝑑=0 E [ 𝑓(π‘₯𝑑) 2] 2𝛿0 𝛾𝑇+ 2 1 𝑑=0 𝑉𝑑+ 2 1 𝑑=0 𝑃𝑑 1 2𝛾2 1 𝑇 (𝑖) 2𝛿0 𝛾𝑇+ 16πœ‚2 𝑑=0 𝑃𝑑+ 2 1 𝑑=0 𝑃𝑑+ 4πœ‚2𝜎2 1 2 16𝛾2 𝐿2πœ‚2 (𝑖𝑖) 2𝛿0 𝛾𝑇+ 16πœ‚3𝜎2 1 2 16𝛾2 𝐿2πœ‚2 (𝑖𝑖𝑖) 2𝛿0 𝛾𝑇+ 16πœ‚3𝜎2 2𝛿0 𝛾𝑇+ 16πœ‚3𝜎2 where (𝑖) holds due to (22), (𝑖𝑖) utilizes (20), and (𝑖𝑖𝑖) follows by πœ‚ 1, and the last step holds due to the assumption on the step-size. We proved (9). We now find the particular values of parameters. Since 𝑔𝑖= 𝑣𝑖for all 𝑖 [𝑛], we have 𝑉0 = 0. Using 𝑣0 𝑖= 1 𝐡init 𝐡init 𝑗=1 𝑓𝑖(π‘₯0, πœ‰0 𝑖,𝑗) for all 𝑖= 1, . . . , 𝑛, we have 𝑃0 = E [ 𝑣0 𝑓(π‘₯0) 2] 𝜎2 𝑛𝐡init and 𝑃0 = 1 𝑖=1 E [ 𝑣0 𝑖 𝑓𝑖(π‘₯0) 2] 𝜎2 We can substitute the choice of 𝛾and obtain E [ 𝑓(Λ†π‘₯𝑇) 2] = π’ͺ πœ‚π‘›π΅init𝑇+ πœ‚πœŽ2 Since 𝐡init 𝜎2 𝐿𝛿0𝑛, we have E [ 𝑓(Λ†π‘₯𝑇) 2] = π’ͺ Notice that the choice of the momentum parameter such that πœ‚ ( 𝐿𝛿0𝛼2 𝜎2𝑇 ) 1/4 , πœ‚ ( 𝐿𝛿0𝛼 𝜎2𝑇 ) 1/2 and πœ‚ 𝛼 𝐿𝛿0𝐡init 𝜎 ensures that πœ‚3𝜎2 𝛼2𝐡init𝑇 𝐿𝛿0 πœ‚π‘‡. Therefore, we have E [ 𝑓(Λ†π‘₯𝑇) 2] π’ͺ 𝛼𝑇+ ( 𝐿𝛿0𝜎2/3 ) 3/4 + ( 𝐿𝛿0𝜎 𝛼𝑇 ) 2/3 + ( 𝐿𝛿0𝜎2 ) 1/2 + 𝜎 𝐿𝛿0 𝛼 𝐡init𝑇 Using 𝐡init min { 𝜎2𝐿 𝐿2𝛿0 , 𝜎 𝛼 𝐿𝛿0𝑇, 𝜎2/3 𝛼4/3𝑇2/3(𝐿𝛿0)1/3 , 𝑛 𝛼2𝑇 } , we obtain E [ 𝑓(Λ†π‘₯𝑇) 2] π’ͺ 𝛼𝑇+ ( 𝐿𝛿0𝜎2/3 ) 3/4 + ( 𝐿𝛿0𝜎 𝛼𝑇 ) 2/3 + ( 𝐿𝛿0𝜎2 It remains to notice that max { min { 𝜎2𝐿 𝐿2𝛿0 , 𝜎 𝛼 𝐿𝛿0𝑇, 𝜎2/3 𝛼4/3𝑇2/3(𝐿𝛿0)1/3 , 𝑛 𝛼2𝑇 } , 𝜎2 𝐿𝛿0𝑛 } 𝜎2 𝐿𝛿0 F.1 Controlling the error of momentum estimator Lemma 2. Let Assumption 1 be satisfied, and suppose 0 < πœ‚ 1. For every 𝑖= 1, . . . , 𝑛, let the sequence {𝑣𝑑 𝑖}𝑑 0 be updated via 𝑣𝑑 𝑖= 𝑣𝑑 1 𝑖 + πœ‚ ( 𝑓𝑖(π‘₯𝑑, πœ‰π‘‘ 𝑖) 𝑣𝑑 1 𝑖 ) , Define the sequence 𝑣𝑑:= 1 𝑛 𝑛 𝑖=1 𝑣𝑑 𝑖. Then for every 𝑖= 1, . . . , 𝑛and 𝑑 0 it holds E [ 𝑣𝑑 𝑖 𝑓𝑖(π‘₯𝑑) 2] (1 πœ‚)E [ 𝑣𝑑 1 𝑖 𝑓𝑖(π‘₯𝑑 1) 2] + 3𝐿2 𝑖 πœ‚E [ π‘₯𝑑 π‘₯𝑑 1 2] + πœ‚2𝜎2, (24) E [ 𝑣𝑑 𝑓(π‘₯𝑑) 2] (1 πœ‚)E [ 𝑣𝑑 1 𝑓(π‘₯𝑑 1) 2] + 3𝐿2 πœ‚E [ π‘₯𝑑 π‘₯𝑑 1 2] + πœ‚2𝜎2 Proof. By the update rule of 𝑣𝑑 𝑖, we have E [ 𝑣𝑑 𝑖 𝑓𝑖(π‘₯𝑑) 2] = E [ 𝑣𝑑 1 𝑖 𝑓𝑖(π‘₯𝑑) + πœ‚( 𝑓𝑖(π‘₯𝑑, πœ‰π‘‘ 𝑖) 𝑣𝑑 1 𝑖 ) 2] = E [ Eπœ‰π‘‘ 𝑖 [ (1 πœ‚)(𝑣𝑑 1 𝑖 𝑓𝑖(π‘₯𝑑)) + πœ‚( 𝑓𝑖(π‘₯𝑑, πœ‰π‘‘ 𝑖) 𝑓𝑖(π‘₯𝑑)) 2] ] = (1 πœ‚)2E [ 𝑣𝑑 1 𝑖 𝑓𝑖(π‘₯𝑑) 2] + πœ‚2E [ 𝑓𝑖(π‘₯𝑑, πœ‰π‘‘ 𝑖) 𝑓𝑖(π‘₯𝑑) 2] (1 πœ‚)2 ( 1 + πœ‚ ) E [ 𝑣𝑑 1 𝑖 𝑓𝑖(π‘₯𝑑 1) 2] ) E [ 𝑓𝑖(π‘₯𝑑 1) 𝑓𝑖(π‘₯𝑑) 2] + πœ‚2𝜎2 (1 πœ‚)E [ 𝑣𝑑 1 𝑖 𝑓𝑖(π‘₯𝑑 1) 2] + 3𝐿2 𝑖 πœ‚E [ π‘₯𝑑 π‘₯𝑑+1 2] + πœ‚2𝜎2, where the first inequality holds by Young s inequality, and the last step uses smoothness of 𝑓𝑖( ) (Assumption 1), which concludes the proof of (24). For each 𝑑 = 0, . . . , 𝑇 1, define a random vector πœ‰π‘‘ := (πœ‰π‘‘ 1, . . . , πœ‰π‘‘ 𝑛) and denote by 𝑓(π‘₯𝑑, πœ‰π‘‘) := 1 𝑛 𝑛 𝑖=1 𝑓𝑖(π‘₯𝑑, πœ‰π‘‘ 𝑖). Note that the entries of the random vector πœ‰π‘‘are independent and Eπœ‰π‘‘[ 𝑓(π‘₯𝑑, πœ‰π‘‘)] = 𝑓(π‘₯𝑑), then we have 𝑣𝑑= 𝑣𝑑 1 + πœ‚ ( 𝑓(π‘₯𝑑, πœ‰π‘‘) 𝑣𝑑 1) , where 𝑣𝑑:= 1 𝑛 𝑛 𝑖=1 𝑣𝑑 𝑖is an auxiliary sequence. Therefore, we can similarly derive E [ 𝑣𝑑 𝑓(π‘₯𝑑) 2] = E [ 𝑣𝑑 1 𝑓(π‘₯𝑑) + πœ‚ ( 𝑓(π‘₯𝑑, πœ‰π‘‘) 𝑣𝑑 1) 2] = E [ Eπœ‰π‘‘ [ (1 πœ‚)(𝑣𝑑 1 𝑓(π‘₯𝑑)) + πœ‚ ( 𝑓(π‘₯𝑑, πœ‰π‘‘) 𝑓(π‘₯𝑑) ) 2] ] = (1 πœ‚)2E [ 𝑣𝑑 1 𝑓(π‘₯𝑑) 2] + πœ‚2E [ 𝑓(π‘₯𝑑, πœ‰π‘‘) 𝑓(π‘₯𝑑) 2] (1 πœ‚)2 ( 1 + πœ‚ ) E [ 𝑣𝑑 1 𝑓(π‘₯𝑑 1) 2] ) E [ 𝑓(π‘₯𝑑 1) 𝑓(π‘₯𝑑) 2] + πœ‚2𝜎2 (1 πœ‚)E [ 𝑣𝑑 1 𝑓(π‘₯𝑑 1) 2] + 3𝐿2 πœ‚E [ π‘₯𝑑 π‘₯𝑑 1 2] + πœ‚2𝜎2 where the last step uses smoothness of 𝑓( ) (Assumption 1), which concludes the proof of (25). F.2 Controlling the error of contractive compression and momentum estimator Lemma 3. Let Assumption 1 be satisfied, and suppose π’žis a contractive compressor with 𝛼 1 2. For every 𝑖= 1, . . . , 𝑛, let the sequences {𝑣𝑑 𝑖}𝑑 0 and {𝑔𝑑 𝑖}𝑑 0 be updated via 𝑣𝑑 𝑖= 𝑣𝑑 1 𝑖 + πœ‚ ( 𝑓𝑖(π‘₯𝑑, πœ‰π‘‘ 𝑖) 𝑣𝑑 1 𝑖 ) , 𝑔𝑑 𝑖= 𝑔𝑑 1 𝑖 + π’ž ( 𝑣𝑑 𝑖 𝑔𝑑 1 𝑖 ) , Then for every 𝑖= 1, . . . , 𝑛and 𝑑 0 it holds E [ 𝑔𝑑 𝑖 𝑣𝑑 𝑖 2] ( 1 𝛼 ) E [ 𝑔𝑑 1 𝑖 𝑣𝑑 1 𝑖 2] + 4πœ‚2 𝛼E [ 𝑣𝑑 1 𝑖 𝑓𝑖(π‘₯𝑑 1) 2] 𝛼 E [ π‘₯𝑑 π‘₯𝑑 1 2] + πœ‚2𝜎2. (26) Proof. By the update rules of 𝑔𝑑 𝑖and 𝑣𝑑 𝑖, we derive E [ 𝑔𝑑 𝑖 𝑣𝑑 𝑖 2] = E [ 𝑔𝑑 1 𝑖 𝑣𝑑 𝑖+ π’ž(𝑣𝑑 𝑖 𝑔𝑑 1 𝑖 ) 2] = E [ Eπ’ž [ π’ž(𝑣𝑑 𝑖 𝑔𝑑 1 𝑖 ) (𝑣𝑑 𝑖 𝑔𝑑 1 𝑖 ) 2] ] (𝑖) (1 𝛼)E [ 𝑣𝑑 𝑖 𝑔𝑑 1 𝑖 2] (𝑖𝑖) = (1 𝛼)E [ 𝑣𝑑 1 𝑖 𝑔𝑑 1 𝑖 + πœ‚( 𝑓𝑖(π‘₯𝑑, πœ‰π‘‘ 𝑖) 𝑣𝑑 1 𝑖 ) 2] = (1 𝛼)E [ Eπœ‰π‘‘ 𝑖 [ 𝑣𝑑 1 𝑖 𝑔𝑑 1 𝑖 + πœ‚( 𝑓𝑖(π‘₯𝑑, πœ‰π‘‘ 𝑖) 𝑣𝑑 1 𝑖 ) 2] ] = (1 𝛼)E [ 𝑣𝑑 1 𝑖 𝑔𝑑 1 𝑖 + πœ‚( 𝑓𝑖(π‘₯𝑑) 𝑣𝑑 1 𝑖 ) 2] +(1 𝛼)πœ‚2E [ 𝑓𝑖(π‘₯𝑑, πœ‰π‘‘ 𝑖) 𝑓𝑖(π‘₯𝑑) 2] (𝑖𝑖𝑖) (1 𝛼) (1 + 𝜌)E [ 𝑣𝑑 1 𝑖 𝑔𝑑 1 𝑖 2] + (1 𝛼) (1 + 𝜌 1)πœ‚2E [ 𝑣𝑑 1 𝑖 𝑓𝑖(π‘₯𝑑) 2] (𝑖𝑣) = (1 πœƒ)E [ 𝑔𝑑 1 𝑖 𝑣𝑑 1 𝑖 2] + π›½πœ‚2E [ 𝑣𝑑 1 𝑖 𝑓𝑖(π‘₯𝑑) 2] + (1 𝛼)πœ‚2𝜎2 (𝑣) (1 πœƒ)E [ 𝑔𝑑 1 𝑖 𝑣𝑑 1 𝑖 2] + 2π›½πœ‚2E [ 𝑣𝑑 1 𝑖 𝑓𝑖(π‘₯𝑑 1) 2] +2π›½πœ‚2E [ 𝑓𝑖(π‘₯𝑑) 𝑓𝑖(π‘₯𝑑 1) 2] + πœ‚2𝜎2 (1 πœƒ)E [ 𝑔𝑑 1 𝑖 𝑣𝑑 1 𝑖 2] + 2π›½πœ‚2E [ 𝑣𝑑 1 𝑖 𝑓𝑖(π‘₯𝑑 1) 2] +2𝛽𝐿2 π‘–πœ‚2E [ π‘₯𝑑 π‘₯𝑑 1 2] + πœ‚2𝜎2, where (𝑖) is due to the definition of a contractive compressor (Definition 1), (𝑖𝑖) follows by the update rule of 𝑣𝑑 𝑖, (𝑖𝑖𝑖) and (𝑣) hold by Young s inequality for any 𝜌> 0. In (𝑖𝑣), we introduced the notation πœƒ:= 1 (1 𝛼)(1 + 𝜌), and 𝛽:= (1 𝛼)(1 + 𝜌 1). The last step follows by smoothness of 𝑓𝑖( ) (Assumption 1). The proof is complete by the choice 𝜌= 𝛼/2, which guarantees 1 πœƒ 1 𝛼/2, and 2𝛽 4/𝛼. G Further Improvement Using Double Momentum (Proof of Corollary 3) Algorithm 3 EF21-SGD2M (Error Feedback 2021 Enhanced with Double Momentum) 1: Input: starting point π‘₯0, step-size 𝛾> 0, parameter πœ‚ (0, 1], initial batch size 𝐡init 2: Initialize 𝑒0 𝑖= 𝑣0 𝑖= 𝑔0 𝑖= 1 𝐡init 𝐡init 𝑗=1 𝑓𝑖(π‘₯0, πœ‰0 𝑖,𝑗) for 𝑖= 1, . . . , 𝑛; 𝑔0 = 1 𝑛 𝑛 𝑖=1 𝑔0 𝑖 3: for 𝑑= 0,1, 2, . . . , 𝑇 1 do 4: Master computes π‘₯𝑑+1 = π‘₯𝑑 𝛾𝑔𝑑and broadcasts π‘₯𝑑+1 to all nodes 5: for all nodes 𝑖= 1, . . . , 𝑛in parallel do 6: Compute the first momentum estimator 𝑣𝑑+1 𝑖 = (1 πœ‚)𝑣𝑑 𝑖+ πœ‚ 𝑓𝑖(π‘₯𝑑+1, πœ‰π‘‘+1 𝑖 ) 7: Compute the second momentum estimator 𝑒𝑑+1 𝑖 = (1 πœ‚)𝑒𝑑 𝑖+ πœ‚π‘£π‘‘+1 𝑖 8: Compress 𝑐𝑑+1 𝑖 = π’ž(𝑒𝑑+1 𝑖 𝑔𝑑 𝑖) and send 𝑐𝑑+1 𝑖 to the master 9: Update local state 𝑔𝑑+1 𝑖 = 𝑔𝑑 𝑖+ 𝑐𝑑+1 𝑖 10: end for 11: Master computes 𝑔𝑑+1 = 1 𝑛 𝑛 𝑖=1 𝑔𝑑+1 𝑖 via 𝑔𝑑+1 = 𝑔𝑑+ 1 𝑛 𝑛 𝑖=1 𝑐𝑑+1 𝑖 12: end for In this section, we state the detailed version of Corollary 3 in Theorem 5, followed by its formal proof. Notice that the key reason for the sample complexity improvement of the double momentum variant compared to EF21-SGDM (Theorem 3) is that in (27), one of the terms has better dependence on πœ‚ compared to (9) in Theorem 3, i.e., πœ‚4𝜎2/𝛼instead of πœ‚2𝜎2/𝛼. As a result, this term is dominated by other terms and vanishes in Corollary 3. Theorem 5. Let Assumptions 1 and 2 hold. Let Λ†π‘₯𝑇be sampled uniformly at random from the iterates of the method. Let Algorithm 3 run with a contractive compressor. For all πœ‚ (0, 1] and 𝐡init 1, with 𝛾 min { 𝛼 60 𝐿, πœ‚ 16𝐿 } , we have E [ 𝑓(Λ†π‘₯𝑇) 2] π’ͺ ( Ξ¨0 where Ξ¨0 := 𝛿0+ 𝛾 πœ‚E [ 𝑣0 𝑓(π‘₯0) 2] + π›Ύπœ‚4 𝑛 𝑛 𝑖=1 E [ 𝑣0 𝑖 𝑓𝑖(π‘₯0) 2] . Setting initial batch size 𝐡init = 𝜎2 𝐿𝛿0 , step-size and momentum parameters 60 𝐿 , πœ‚ 16𝐿 ) 1/4 , ( 𝐿𝛿0𝑛 ) 1/2 , 𝛼 𝐿𝛿0𝐡init 𝑑=0 E [ 𝑓(π‘₯𝑑) 2] π’ͺ 𝛼𝑇+ ( 𝐿𝛿0𝜎2/3 ) 3/4 + ( 𝐿𝛿0𝜎2 Proof. In order to control the error between 𝑔𝑑and 𝑓(π‘₯𝑑), we decompose it into three terms 𝑔𝑑 𝑓(π‘₯𝑑) 2 3 𝑔𝑑 𝑒𝑑 2 + 3 𝑒𝑑 𝑣𝑑 2 + 3 𝑣𝑑 𝑓(π‘₯𝑑) 2 𝑔𝑑 𝑖 𝑒𝑑 𝑖 2 + 3 𝑒𝑑 𝑣𝑑 2 + 3 𝑣𝑑 𝑓(π‘₯𝑑) 2 , where we define the sequences 𝑣𝑑:= 1 𝑛 𝑛 𝑖=1 𝑣𝑑 𝑖and 𝑒𝑑:= 1 𝑛 𝑛 𝑖=1 𝑒𝑑 𝑖. In the following, we develop a recursion for each term above separately. Part I. Controlling the error of momentum estimator for each 𝑣𝑑 𝑖and on average for 𝑣𝑑. Denote 𝑃𝑑:= E [ 𝑣𝑑 𝑓(π‘₯𝑑) 2] , 𝑃𝑑:= 1 𝑛 𝑛 𝑖=1 E [ 𝑣𝑑 𝑖 𝑓𝑖(π‘₯𝑑) 2] , 𝑅𝑑:= E [ π‘₯𝑑 π‘₯𝑑+1 2] . Similarly to Part I of the proof of Theorem 3, we have 𝑑=0 𝑃𝑑 3 𝐿2 𝑑=0 𝑅𝑑+ πœ‚πœŽ2 + 1 πœ‚π‘‡ 𝑃0, (29) 𝑑=0 𝑅𝑑+ πœ‚πœŽ2 Part II (a). Controlling the error of the second momentum estimator for each 𝑒𝑑 𝑖. Recall that by Lemma 4-(37), we have for each 𝑖= 1, . . . , 𝑛, and any 0 < πœ‚ 1 and 𝑑 0 E [ 𝑒𝑑+1 𝑖 𝑣𝑑+1 𝑖 2] (1 πœ‚)E [ 𝑒𝑑 𝑖 𝑣𝑑 𝑖 2] + 6πœ‚E [ 𝑣𝑑 𝑖 𝑓𝑖(π‘₯𝑑) 2] +6𝐿2 π‘–πœ‚E [ π‘₯𝑑+1 π‘₯𝑑 2] + πœ‚2𝜎2, (31) Averaging inequalities (31) over 𝑖= 1, . . . ,𝑛and denoting 𝑄𝑑:= 1 𝑛 𝑛 𝑖=1 E [ 𝑒𝑑 𝑖 𝑣𝑑 𝑖 2] , we have 𝑄𝑑+1 (1 πœ‚) 𝑄𝑑+ 6πœ‚ 𝑃𝑑+ 6 𝐿2πœ‚π‘…π‘‘+ πœ‚2𝜎2. Summing up the above inequalities for 𝑑= 0, . . . , 𝑇 1, we derive 𝑑=0 𝑃𝑑+ 6 𝐿2 1 𝑑=0 𝑅𝑑+ πœ‚πœŽ2 + 1 πœ‚2 + 6 𝐿2 ) 1 𝑇 𝑑=0 𝑅𝑑+ 7πœ‚πœŽ2 + 1 𝑑=0 𝑅𝑑+ 7πœ‚πœŽ2 + 6 πœ‚π‘‡ 𝑃0, (32) where we used (29), the bound πœ‚ 1, and 𝑒0 𝑖= 𝑣0 𝑖for 𝑖= 1, . . . , 𝑛. Part II (b). Controlling the error of the second momentum estimator 𝑒𝑑(on average). Similarly by Lemma 4-(38), we have for any 0 < πœ‚ 1 and 𝑑 0 E [ 𝑒𝑑+1 𝑣𝑑+1 2] (1 πœ‚)E [ 𝑒𝑑 𝑣𝑑 2] + 6πœ‚E [ 𝑣𝑑 𝑓(π‘₯𝑑) 2] +6𝐿2πœ‚E [ π‘₯𝑑+1 π‘₯𝑑 2] + πœ‚2𝜎2 Summing up the above inequalities for 𝑑= 0, . . . , 𝑇 1, and denoting 𝑄𝑑:= E [ 𝑒𝑑 𝑣𝑑 2] , we derive 𝑑=0 𝑃𝑑+ 6𝐿2 1 𝑑=0 𝑅𝑑+ πœ‚πœŽ2 + 1 πœ‚2 + 6𝐿2 ) 1 𝑑=0 𝑅𝑑+ 7πœ‚πœŽ2 𝑑=0 𝑅𝑑+ 7πœ‚πœŽ2 where we used (30), the bound πœ‚ 1, and 𝑒0 = 𝑣0. Part III. Controlling the error of contractive compressor and the double momentum estimator. By Lemma 5 we have for each 𝑖= 1, . . . , 𝑛, and any 0 < πœ‚ 1 and 𝑑 0 E [ 𝑔𝑑+1 𝑖 𝑒𝑑+1 𝑖 2] ( 1 𝛼 ) E [ 𝑔𝑑 𝑖 𝑒𝑑 𝑖 2] + 6πœ‚2 𝛼E [ 𝑒𝑑 𝑖 𝑣𝑑 𝑖 2] 𝛼E [ 𝑣𝑑 𝑖 𝑓𝑖(π‘₯𝑑, πœ‰π‘‘ 𝑖) 2] + 6𝐿2 π‘–πœ‚4 𝛼 E [ π‘₯𝑑 π‘₯𝑑+1 2] + πœ‚4𝜎2. Averaging inequalities (34) over 𝑖= 1, . . . ,𝑛, denoting 𝑉𝑑:= 1 𝑛 𝑛 𝑖=1 E [ 𝑔𝑑 𝑖 𝑒𝑑 𝑖 2] , and summing up the resulting inequality for 𝑑= 0, . . . , 𝑇 1, we obtain 𝑑=0 𝑉𝑑 12πœ‚2 𝑑=0 𝑄𝑑+ 12πœ‚4 𝑑=0 𝑃𝑑+ 12 𝐿2πœ‚4 𝑑=0 𝑅𝑑+ 2πœ‚4𝜎2 𝑑=0 𝑅𝑑+ 7πœ‚πœŽ2 ) 𝑑=0 𝑅𝑑+ πœ‚πœŽ2 ) 𝑑=0 𝑅𝑑+ 2πœ‚4𝜎2 𝑑=0 𝑅𝑑+ 12 7πœ‚3𝜎2 𝛼2 + 12 3 𝐿2πœ‚2 𝑑=0 𝑅𝑑+ 12πœ‚5𝜎2 𝑑=0 𝑅𝑑+ 2πœ‚4𝜎2 𝑑=0 𝑅𝑑+ 84πœ‚3𝜎2 𝛼2 + 12πœ‚5𝜎2 Part IV. Combining steps I, II and III with descent lemma. By smoothness (Assumption 1) of 𝑓( ) it follows from Lemma 1 that for any 𝛾 1/(2𝐿) we have 𝑓(π‘₯𝑑+1) 𝑓(π‘₯𝑑) 𝛾 π‘₯𝑑+1 π‘₯𝑑 2 + 𝛾 𝑔𝑑 𝑓(π‘₯𝑑) 2 (36) 𝑔𝑑 𝑖 𝑒𝑑 𝑖 2 + 3𝛾 𝑒𝑑 𝑣𝑑 2 + 3𝛾 𝑣𝑑 𝑓(π‘₯𝑑) 2 . Subtracting 𝑓* from both sides of (36), taking expectation and defining 𝛿𝑑:= E [𝑓(π‘₯𝑑) 𝑓*], we derive E [ 𝑓(Λ†π‘₯𝑇) 2] = 1 𝑇 𝑑=0 E [ 𝑓(π‘₯𝑑) 2] 2𝛿0 𝛾𝑇+ 3 1 𝑑=0 𝑉𝑑+ 3 1 𝑑=0 𝑄𝑑+ 3 1 𝑑=0 𝑃𝑑 1 2𝛾2 1 𝑇 (𝑖) 2𝛿0 𝛾𝑇+ 3 276 𝐿2 𝑑=0 𝑅𝑑+ 3 84πœ‚3𝜎2 𝛼2 + 3 12πœ‚5𝜎2 𝛼2 + 3 2πœ‚4𝜎2 𝑑=0 𝑅𝑑+ 3 7πœ‚πœŽ2 𝑑=0 𝑅𝑑+ πœ‚πœŽ2 𝑛 1 2𝛾2 1 𝑇 𝛼2𝑇 𝑃0 + 18 = 2𝛿0 𝛾𝑇+ 3 84πœ‚3𝜎2 𝛼2 + 3 12πœ‚5𝜎2 𝛼2 + 3 2πœ‚4𝜎2 πœ‚2 + 3 276 𝐿2 𝛼2𝑇 𝑃0 + 21 = 2𝛿0 𝛾𝑇+ 288πœ‚3𝜎2 𝛼2𝑇 𝑃0 + 21 where (𝑖) holds due to (30), (35) and (33), the last two steps hold because of the assumption on the step-size, and πœ‚ 1, which completes the proof of the first part of Theorem. Notice that it suffices to take the same initial batch-size as in the proof of the Theorem 3 in order to "remove" 𝑃0 and 𝑃0 terms, since the power of πœ‚in front of 𝑃0 is larger here compared to the proof of Theorem 3. The choice of the momentum parameter such that πœ‚ ( 𝐿𝛿0𝛼2 𝜎2𝑇 ) 1/4 , 𝜎2𝑇 ) 1/2 ensures that πœ‚3𝜎2 πœ‚π‘‡, and πœ‚πœŽ2 πœ‚π‘‡. Therefore, we can guarantee that the choice πœ‚= min { 𝛼 𝐿𝛿0𝐡init 𝜎 , ( 𝐿𝛿0𝛼2 𝜎2𝑇 ) 1/4 , ( 𝐿𝛿0𝑛 𝜎2𝑇 ) 1/2} ensures that E [ 𝑓(Λ†π‘₯𝑇) 2] π’ͺ 𝛼𝑇+ ( 𝐿𝛿0𝜎2/3 ) 3/4 + ( 𝐿𝛿0𝜎2 G.1 Controlling the error of second momentum estimator Lemma 4. Let Assumption 1 be satisfied, and suppose 0 < πœ‚ 1. For every 𝑖= 1, . . . , 𝑛, let the sequences {𝑣𝑑 𝑖}𝑑 0 and {𝑒𝑑 𝑖}𝑑 0 be updated via 𝑣𝑑 𝑖= 𝑣𝑑 1 𝑖 + πœ‚ ( 𝑓𝑖(π‘₯𝑑, πœ‰π‘‘ 𝑖) 𝑣𝑑 1 𝑖 ) , 𝑒𝑑 𝑖= 𝑒𝑑 1 𝑖 + πœ‚ ( 𝑣𝑑 𝑖 𝑒𝑑 1 𝑖 ) . Define the sequences 𝑣𝑑:= 1 𝑛 𝑛 𝑖=1 𝑣𝑑 𝑖and 𝑒𝑑:= 1 𝑛 𝑛 𝑖=1 𝑒𝑑 𝑖. Then for every 𝑖= 1, . . . , 𝑛and 𝑑 0 it holds E [ 𝑒𝑑 𝑖 𝑣𝑑 𝑖 2] (1 πœ‚)E [ 𝑒𝑑 1 𝑖 𝑣𝑑 1 𝑖 2] + 6πœ‚E [ 𝑣𝑑 1 𝑖 𝑓𝑖(π‘₯𝑑 1) 2] +6𝐿2 π‘–πœ‚E [ π‘₯𝑑 π‘₯𝑑 1 2] + πœ‚2𝜎2, (37) E [ 𝑒𝑑 𝑣𝑑 2] (1 πœ‚)E [ 𝑒𝑑 1 𝑣𝑑 1 2] + 6πœ‚E [ 𝑣𝑑 1 𝑓(π‘₯𝑑 1) 2] +6𝐿2πœ‚E [ π‘₯𝑑 π‘₯𝑑 1 2] + πœ‚2𝜎2 Proof. By the update rule of 𝑣𝑑 𝑖, we have E [ 𝑒𝑑 𝑖 𝑣𝑑 𝑖 2] = E [ 𝑒𝑑 1 𝑖 𝑣𝑑 𝑖+ πœ‚(𝑣𝑑 𝑖 𝑒𝑑 1 𝑖 ) 2] = (1 πœ‚)2E [ 𝑣𝑑 𝑖 𝑒𝑑 1 𝑖 2] = (1 πœ‚)2E [ (1 πœ‚)𝑣𝑑 1 𝑖 + πœ‚ 𝑓𝑖(π‘₯𝑑, πœ‰π‘‘ 𝑖) 𝑒𝑑 1 𝑖 2] = (1 πœ‚)2E [ (𝑒𝑑 1 𝑖 𝑣𝑑 1 𝑖 ) + πœ‚(𝑣𝑑 1 𝑖 𝑓𝑖(π‘₯𝑑, πœ‰π‘‘ 𝑖)) 2] = (1 πœ‚)2E [ (𝑒𝑑 1 𝑖 𝑣𝑑 1 𝑖 ) + πœ‚(𝑣𝑑 1 𝑖 𝑓𝑖(π‘₯𝑑)) + πœ‚( 𝑓𝑖(π‘₯𝑑) 𝑓𝑖(π‘₯𝑑, πœ‰π‘‘ 𝑖)) 2] = (1 πœ‚)2E [ Eπœ‰π‘‘ 𝑖 [ (𝑒𝑑 1 𝑖 𝑣𝑑 1 𝑖 ) + πœ‚(𝑣𝑑 1 𝑖 𝑓𝑖(π‘₯𝑑)) + πœ‚( 𝑓𝑖(π‘₯𝑑) 𝑓𝑖(π‘₯𝑑, πœ‰π‘‘ 𝑖)) 2] ] = (1 πœ‚)2 ( E [ 𝑒𝑑 1 𝑖 𝑣𝑑 1 𝑖 + πœ‚(𝑣𝑑 1 𝑖 𝑓𝑖(π‘₯𝑑)) 2] + πœ‚2E [ 𝑓𝑖(π‘₯𝑑, πœ‰π‘‘ 𝑖) 𝑓𝑖(π‘₯𝑑) 2] ) (1 πœ‚)2E [ 𝑒𝑑 1 𝑖 𝑣𝑑 1 𝑖 + πœ‚(𝑣𝑑 1 𝑖 𝑓𝑖(π‘₯𝑑)) 2] + πœ‚2𝜎2 (1 πœ‚)2 ( 1 + πœ‚ ) E [ 𝑒𝑑 1 𝑖 𝑣𝑑 1 𝑖 2] ) πœ‚2E [ 𝑣𝑑 1 𝑖 𝑓𝑖(π‘₯𝑑) 2] + πœ‚2𝜎2 (1 πœ‚)E [ 𝑒𝑑 1 𝑖 𝑣𝑑 1 𝑖 2] + 3πœ‚E [ 𝑣𝑑 1 𝑖 𝑓𝑖(π‘₯𝑑) 2] + πœ‚2𝜎2 (1 πœ‚)E [ 𝑒𝑑 1 𝑖 𝑣𝑑 1 𝑖 2] + 6πœ‚E [ 𝑣𝑑 1 𝑖 𝑓𝑖(π‘₯𝑑 1) 2] +6πœ‚E [ 𝑓𝑖(π‘₯𝑑) 𝑓𝑖(π‘₯𝑑 1) 2] + πœ‚2𝜎2 (1 πœ‚)E [ 𝑒𝑑 1 𝑖 𝑣𝑑 1 𝑖 2] + 6πœ‚E [ 𝑣𝑑 1 𝑖 𝑓𝑖(π‘₯𝑑 1) 2] +6𝐿2 π‘–πœ‚E [ π‘₯𝑑 π‘₯𝑑 1 2] + πœ‚2𝜎2, where the first inequality holds Assumption 2, the second inequality holds by Young s inequality, and the last step uses smoothness of 𝑓𝑖( ) (Assumption 1), which concludes the proof of (37). For each 𝑑 = 0, . . . , 𝑇 1, define a random vector πœ‰π‘‘ := (πœ‰π‘‘ 1, . . . , πœ‰π‘‘ 𝑛) and denote by 𝑓(π‘₯𝑑, πœ‰π‘‘+1) := 1 𝑛 𝑛 𝑖=1 𝑓𝑖(π‘₯𝑑, πœ‰π‘‘ 𝑖). Note that the entries of the random vector πœ‰π‘‘are independent and Eπœ‰π‘‘[ 𝑓(π‘₯𝑑, πœ‰π‘‘)] = 𝑓(π‘₯𝑑), then we have 𝑣𝑑= 𝑣𝑑 1 + πœ‚ ( 𝑓(π‘₯𝑑, πœ‰π‘‘) 𝑣𝑑 1) , 𝑒𝑑 𝑖= 𝑒𝑑 1 𝑖 + πœ‚ ( 𝑣𝑑 𝑖 𝑒𝑑 1 𝑖 ) , where 𝑣𝑑:= 1 𝑛 𝑛 𝑖=1 𝑣𝑑 𝑖, 𝑒𝑑:= 1 𝑛 𝑛 𝑖=1 𝑒𝑑 𝑖are auxiliary sequences. Therefore, we can similarly derive E [ 𝑒𝑑 𝑣𝑑 2] = E [ 𝑒𝑑 1 𝑣𝑑+ πœ‚(𝑣𝑑 𝑒𝑑 1) 2] = (1 πœ‚)2E [ 𝑣𝑑 𝑒𝑑 1 2] = (1 πœ‚)2E [ (1 πœ‚)𝑣𝑑 1 + πœ‚ 𝑓(π‘₯𝑑, πœ‰π‘‘) 𝑒𝑑 1 2] = (1 πœ‚)2E [ (𝑒𝑑 1 𝑣𝑑 1) + πœ‚(𝑣𝑑 1 𝑓(π‘₯𝑑, πœ‰π‘‘)) 2] = (1 πœ‚)2E [ (𝑒𝑑 1 𝑣𝑑 1) + πœ‚(𝑣𝑑 1 𝑓(π‘₯𝑑)) + πœ‚( 𝑓(π‘₯𝑑) 𝑓(π‘₯𝑑, πœ‰π‘‘)) 2] = (1 πœ‚)2E [ Eπœ‰π‘‘ [ (𝑒𝑑 1 𝑣𝑑 1) + πœ‚(𝑣𝑑 1 𝑓(π‘₯𝑑)) + πœ‚( 𝑓(π‘₯𝑑) 𝑓(π‘₯𝑑, πœ‰π‘‘)) 2] ] = (1 πœ‚)2 ( E [ 𝑒𝑑 1 𝑣𝑑 1 + πœ‚(𝑣𝑑 1 𝑓(π‘₯𝑑)) 2] + πœ‚2E [ 𝑓(π‘₯𝑑, πœ‰π‘‘) 𝑓(π‘₯𝑑) 2] ) (1 πœ‚)2E [ 𝑒𝑑 1 𝑣𝑑 1 + πœ‚(𝑣𝑑 1 𝑓(π‘₯𝑑)) 2] + πœ‚2𝜎2 (1 πœ‚)2 ( 1 + πœ‚ ) E [ 𝑒𝑑 1 𝑣𝑑 1 2] ) πœ‚2E [ 𝑣𝑑 1 𝑓(π‘₯𝑑) 2] + πœ‚2𝜎2 (1 πœ‚)E [ 𝑒𝑑 1 𝑣𝑑 1 2] + 3πœ‚E [ 𝑣𝑑 1 𝑓(π‘₯𝑑) 2] + πœ‚2𝜎2 (1 πœ‚)E [ 𝑒𝑑 1 𝑣𝑑 1 2] + 6πœ‚E [ 𝑣𝑑 1 𝑓(π‘₯𝑑 1) 2] +6πœ‚E [ 𝑓(π‘₯𝑑) 𝑓(π‘₯𝑑 1) 2] + πœ‚2𝜎2 (1 πœ‚)E [ 𝑒𝑑 1 𝑣𝑑 1 2] + 6πœ‚E [ 𝑣𝑑 1 𝑓(π‘₯𝑑 1) 2] +6𝐿2πœ‚E [ π‘₯𝑑 π‘₯𝑑 1 2] + πœ‚2𝜎2 where the first inequality holds Assumption 2, the second inequality holds by Young s inequality, and the last step uses smoothness of 𝑓( ) (Assumption 1), which concludes the proof of (38). G.2 Controlling the error of contractive compression and double momentum estimator Lemma 5. Let Assumption 1 be satisfied, and suppose π’žis a contractive compressor. For every 𝑖= 1, . . . , 𝑛, let the sequences {𝑣𝑑 𝑖}𝑑 0, {𝑒𝑑 𝑖}𝑑 0, and {𝑔𝑑 𝑖}𝑑 0 be updated via 𝑣𝑑 𝑖 = 𝑣𝑑 1 𝑖 + πœ‚ ( 𝑓𝑖(π‘₯𝑑, πœ‰π‘‘ 𝑖) 𝑣𝑑 1 𝑖 ) , 𝑒𝑑 𝑖 = 𝑒𝑑 1 𝑖 + πœ‚ ( 𝑣𝑑 𝑖 𝑒𝑑 1 𝑖 ) , 𝑔𝑑 𝑖 = 𝑔𝑑 1 𝑖 + π’ž ( 𝑒𝑑 𝑖 𝑔𝑑 1 𝑖 ) . Then for every 𝑖= 1, . . . , 𝑛and 𝑑 0 it holds E [ 𝑔𝑑 𝑖 𝑒𝑑 𝑖 2] ( 1 𝛼 ) E [ 𝑔𝑑 1 𝑖 𝑒𝑑 1 𝑖 2] + 6πœ‚2 𝛼E [ 𝑒𝑑 1 𝑖 𝑣𝑑 1 𝑖 2] 𝛼E [ 𝑣𝑑 1 𝑖 𝑓𝑖(π‘₯𝑑 1, πœ‰π‘‘ 1 𝑖 ) 2] + 6𝐿2 π‘–πœ‚4 𝛼 E [ π‘₯𝑑 π‘₯𝑑 1 2] + πœ‚4𝜎2. Proof. By the update rules of 𝑔𝑑 𝑖, 𝑒𝑑 𝑖and 𝑣𝑑 𝑖, we derive E [ 𝑔𝑑 𝑖 𝑒𝑑 𝑖 2] = E [ 𝑔𝑑 1 𝑖 𝑒𝑑 𝑖+ π’ž(𝑒𝑑 𝑖 𝑔𝑑 1 𝑖 ) 2] (𝑖) (1 𝛼)E [ 𝑒𝑑 𝑖 𝑔𝑑 1 𝑖 2] (𝑖𝑖) = (1 𝛼)E [ 𝑒𝑑 1 𝑖 𝑔𝑑 1 𝑖 + πœ‚(𝑣𝑑 1 𝑖 𝑒𝑑 1 𝑖 ) + πœ‚2( 𝑓𝑖(π‘₯𝑑, πœ‰π‘‘ 𝑖) 𝑣𝑑 1 𝑖 ) 2] = (1 𝛼)E [ 𝑒𝑑 1 𝑖 𝑔𝑑 1 𝑖 + πœ‚(𝑣𝑑 1 𝑖 𝑒𝑑 1 𝑖 ) + πœ‚2( 𝑓𝑖(π‘₯𝑑) 𝑣𝑑 1 𝑖 ) +πœ‚2( 𝑓𝑖(π‘₯𝑑, πœ‰π‘‘ 𝑖) 𝑓𝑖(π‘₯𝑑)) 2] = (1 𝛼)E [ Eπœ‰π‘‘ 𝑖 [ 𝑒𝑑 1 𝑖 𝑔𝑑 1 𝑖 + πœ‚(𝑣𝑑 1 𝑖 𝑒𝑑 1 𝑖 ) + πœ‚2( 𝑓𝑖(π‘₯𝑑) 𝑣𝑑 1 𝑖 ) +πœ‚2( 𝑓𝑖(π‘₯𝑑, πœ‰π‘‘ 𝑖) 𝑓𝑖(π‘₯𝑑)) 2] ] = (1 𝛼)E [ 𝑒𝑑 1 𝑖 𝑔𝑑 1 𝑖 + πœ‚(𝑣𝑑 1 𝑖 𝑒𝑑 1 𝑖 ) + πœ‚2( 𝑓𝑖(π‘₯𝑑) 𝑣𝑑 1 𝑖 ) 2] +(1 𝛼)πœ‚4E [ 𝑓𝑖(π‘₯𝑑, πœ‰π‘‘ 𝑖) 𝑓𝑖(π‘₯𝑑) 2] (𝑖𝑖𝑖) (1 𝛼)(1 + 𝜌)E [ 𝑒𝑑 1 𝑖 𝑔𝑑 1 𝑖 2] +(1 𝛼)(1 + 𝜌 1)E [ πœ‚(𝑣𝑑 1 𝑖 𝑒𝑑 1 𝑖 ) + πœ‚2( 𝑓𝑖(π‘₯𝑑) 𝑣𝑑 1 𝑖 ) 2] (𝑖𝑣) = (1 πœƒ)E [ 𝑒𝑑 1 𝑖 𝑔𝑑 1 𝑖 2] + πœ‚4𝜎2 +𝛽E [ πœ‚(𝑣𝑑 1 𝑖 𝑒𝑑 1 𝑖 ) + πœ‚2( 𝑓𝑖(π‘₯𝑑 1) 𝑣𝑑 1 𝑖 ) + πœ‚2( 𝑓𝑖(π‘₯𝑑) 𝑓𝑖(π‘₯𝑑 1) 2] (𝑣) (1 πœƒ) E [ 𝑒𝑑 1 𝑖 𝑔𝑑 1 𝑖 2] + 3π›½πœ‚2E [ 𝑣𝑑 1 𝑖 𝑒𝑑 1 𝑖 2] +3π›½πœ‚4E [ 𝑣𝑑 1 𝑖 𝑓𝑖(π‘₯𝑑 1) 2] +3π›½πœ‚4E [ 𝑓𝑖(π‘₯𝑑) 𝑓𝑖(π‘₯𝑑 1) 2] + πœ‚4𝜎2 (1 πœƒ) E [ 𝑒𝑑 1 𝑖 𝑔𝑑 1 𝑖 2] + 3π›½πœ‚2E [ 𝑣𝑑 1 𝑖 𝑒𝑑 1 𝑖 2] +3π›½πœ‚4E [ 𝑣𝑑 1 𝑖 𝑓𝑖(π‘₯𝑑 1) 2] +3𝛽𝐿2 π‘–πœ‚4E [ π‘₯𝑑 π‘₯𝑑 1 2] + πœ‚4𝜎2 where (𝑖) is due to definition of a contractive compressor (Definition 1), (𝑖𝑖) follows by the update rule of 𝑣𝑑 𝑖and 𝑒𝑑 𝑖, (𝑖𝑖𝑖) and (𝑣) hold by Young s inequality for any 𝜌> 0. In (𝑖𝑣), we introduced the notation πœƒ:= 1 (1 𝛼)(1 + 𝜌), and 𝛽:= (1 𝛼)(1 + 𝜌 1). The last step follows by smoothness of 𝑓𝑖( ) (Assumption 1). The proof is complete by the choice 𝜌= 𝛼/2, which guarantees 1 πœƒ 1 𝛼/2, and 3𝛽 6/𝛼. H EF21-SGDM with Absolute Compressor In this section, we complement our theory by analyzing EF21-SGDM under a different class of widely used biased compressors, namely, absolute compressors, which are defined as follows. Definition 2 (Absolute compressors). We say that a (possibly randomized) map π’ž: R𝑑 R𝑑is an absolute compression operator if there exists a constant > 0 such that E [ π’ž(π‘₯) π‘₯ 2] 2, π‘₯ R𝑑. (40) This class includes important examples of compressors such as hard-threshold sparsifier [Sahu et al., 2021], (stochatsic) rounding schemes with bounded error [Gupta et al., 2015] and scaled integer rounding [Sapio et al., 2021]. Algorithm 4 EF21-SGDM (abs) 1: Input: starting point π‘₯0, step-size 𝛾> 0, momentum πœ‚ (0, 1], initial batch size 𝐡init 2: Initialize 𝑣0 𝑖= 𝑔0 𝑖= 1 𝐡init 𝐡init 𝑗=1 𝑓𝑖(π‘₯0, πœ‰0 𝑖,𝑗) for 𝑖= 1, . . . , 𝑛; 𝑔0 = 1 𝑛 𝑛 𝑖=1 𝑔0 𝑖 3: for 𝑑= 0,1, 2, . . . , 𝑇 1 do 4: Master computes π‘₯𝑑+1 = π‘₯𝑑 𝛾𝑔𝑑and broadcasts π‘₯𝑑+1 to all nodes 5: for all nodes 𝑖= 1, . . . , 𝑛in parallel do 6: Compute momentum estimator 𝑣𝑑+1 𝑖 = (1 πœ‚)𝑣𝑑 𝑖+ πœ‚ 𝑓𝑖(π‘₯𝑑+1, πœ‰π‘‘+1 𝑖 ) 7: Compress 𝑐𝑑+1 𝑖 = π’ž ( 𝑣𝑑+1 𝑖 𝑔𝑑 𝑖 𝛾 ) and send 𝑐𝑑+1 𝑖 to the master 8: Update local state 𝑔𝑑+1 𝑖 = 𝑔𝑑 𝑖+ 𝛾𝑐𝑑+1 𝑖 9: end for 10: Master computes 𝑔𝑑+1 = 1 𝑛 𝑛 𝑖=1 𝑔𝑑+1 𝑖 via 𝑔𝑑+1 = 𝑔𝑑+ 1 𝑛 𝑛 𝑖=1 𝛾𝑐𝑑+1 𝑖 11: end for To accomodate absolute compressors into our EF21-SGDM method, we need to make a slight modification to our algorithm, see Algorithm 4. At each iteration, before compressing the difference 𝑣𝑑+1 𝑖 𝑔𝑑 𝑖, we divide it by the step-size 𝛾. Later, we multiply the compressed vector 𝑐𝑑+1 𝑖 by 𝛾, i.e., have 𝑔𝑑+1 𝑖 = 𝑔𝑑 𝑖+ π›Ύπ’ž ( 𝑣𝑑+1 𝑖 𝑔𝑑 𝑖 𝛾 Such modification is necessary for absolute compressors because by Definition 2 the compression error is not proportional to π‘₯ 2, but merely an absolute constant 2. In fact, Algorithm 4 is somewhat more universal in the sense that it can be also applied for contractive compressors.20 We derive the following result for EF21-SGDM (abs). Theorem 6. Let Assumptions 1 and 2 hold. Let Λ†π‘₯𝑇be sampled uniformly at random from the iterates of the method. Let Algorithm 4 run with an absolute compressor (Definition 2). For all πœ‚ (0, 1] and 𝐡init 1, with 𝛾 πœ‚ 4𝐿, we have E [ 𝑓(Λ†π‘₯𝑇) 2] π’ͺ ( Ξ¨0 𝛾𝑇+ 𝛾2 2 + πœ‚πœŽ2 where Ξ¨0 := 𝛿0 + 𝛾 πœ‚E [ 𝑣0 𝑓(π‘₯0) 2] is a Lyapunov function. With the following step-size, momentum parameter, and initial batch size ) 1/3 , ( 𝐿𝛿0𝑛 , 𝐡𝑖𝑛𝑖𝑑= 𝜎2 E [ 𝑓(Λ†π‘₯𝑇) 2] π’ͺ ) 2/3 + ( 𝐿𝛿0𝜎2 20It is straightforward to modify the proof of our Theorem 3 for the case when Algorithm 4 is applied with a contractive compressor. Corollary 4. Under the setting of Theorem 6, we have E [ 𝑓(Λ†π‘₯𝑇) ] πœ€after 𝑇 = π‘›πœ€4 ) iterations. Remark 3. The sample complexity result in Corollary 4 matches the one derived for Double Squeeze algorithm [Tang et al., 2020], which is different from Algorithm 4. Proof. Similarly to the proof of Theorem 3, we control the error between 𝑔𝑑and 𝑓(π‘₯𝑑) by decomposing it into two terms 𝑔𝑑 𝑓(π‘₯𝑑) 2 2 𝑔𝑑 𝑣𝑑 2 +2 𝑣𝑑 𝑓(π‘₯𝑑) 2 2 1 𝑔𝑑 𝑖 𝑣𝑑 𝑖 2 +2 𝑣𝑑 𝑓(π‘₯𝑑) 2 . Again, for the second term above we can use the recursion developed for momentum estimator Lemma 2. However, since we use a different compressor here, we need to bound 𝑔𝑑 𝑖 𝑣𝑑 𝑖 2 term differently, thus we invoke Lemma 6 for absolute compressor. Part I. Controlling the error of momentum estimator on average for 𝑣𝑑. Denote 𝑃𝑑:= E [ 𝑣𝑑 𝑓(π‘₯𝑑) 2] , 𝑅𝑑:= E [ π‘₯𝑑 π‘₯𝑑+1 2] . Similarly to Part I of the proof of Theorem 3, we have by Lemma 2 𝑑=0 𝑅𝑑+ πœ‚πœŽ2 Part II. Controlling the error of absolute compressor and momentum estimator. By Lemma 6 we have for any 0 < πœ‚ 1 and 𝑑 0 𝑖=1 E [ 𝑔𝑑 𝑖 𝑣𝑑 𝑖 2] 𝛾2 2. (44) Part III. Combining steps I and II with descent lemma. By smoothness (Assumption 1) of 𝑓( ) it follows from Lemma 1 that for any 𝛾 1/(2𝐿) we have 𝑓(π‘₯𝑑+1) 𝑓(π‘₯𝑑) 𝛾 π‘₯𝑑+1 π‘₯𝑑 2 + 𝛾 𝑔𝑑 𝑓(π‘₯𝑑) 2 (45) π‘₯𝑑+1 π‘₯𝑑 2 + 𝛾 𝑉𝑑+ 𝛾𝑃𝑑. Subtracting 𝑓* from both sides of (45), taking expectation and defining 𝛿𝑑:= E [𝑓(π‘₯𝑑) 𝑓*], we derive E [ 𝑓(Λ†π‘₯𝑇) 2] = 1 𝑇 𝑑=0 E [ 𝑓(π‘₯𝑑) 2] 2𝛿0 𝛾𝑇+ 2 1 𝑑=0 𝑉𝑑+ 2 1 𝑑=0 𝑃𝑑 1 2𝛾2 1 𝑇 (𝑖) 2𝛿0 𝛾𝑇+ 2𝛾2 2 + 2 1 𝑑=0 𝑃𝑑 1 2𝛾2 1 𝑇 (𝑖𝑖) 2𝛿0 𝛾𝑇+ 2𝛾2 2 + ( 6𝐿2 𝑑=0 𝑅𝑑+ 2πœ‚πœŽ2 2𝛿0 𝛾𝑇+ 2𝛾2 2 + 2πœ‚πœŽ2 where in (𝑖) and (𝑖𝑖) we apply (43), (44), and in the last step we use the assumption on the step-size 𝛾 πœ‚/(4𝐿). Setting 𝛾= πœ‚ 4𝐿, and taking πœ‚ ( 𝐿3𝛿0 Ξ”2𝑇 ) 1/3 we can ensure that πœ‚2Ξ”2 πœ‚π‘‡, since πœ‚ ( 𝐿𝛿0𝑛 we have πœ‚πœŽ2 πœ‚π‘‡. Finally, by setting the initial batch-size to 𝐡𝑖𝑛𝑖𝑑= 𝜎2 𝐿𝛿0𝑛, we have 1 πœ‚π‘‡π‘ƒ0 = 𝜎2 πœ‚π‘‡π‘›π΅π‘–π‘›π‘–π‘‘ 𝐿𝛿0 πœ‚π‘‡. Therefore, we derive 𝑑=0 E [ 𝑓(π‘₯𝑑) 2] 2𝛿0 𝛾𝑇+ 2𝛾2 2 + 2πœ‚πœŽ2 𝑇 + 𝛿 2/3 0 2/3 𝑇 2/3 + 𝜎(𝐿𝛿0) 1/2 H.1 Controlling the error of absolute compression Lemma 6. Let π’žbe an absolute compressor and 𝑔𝑑+1 𝑖 be updated according to Algorithm 4, then for 𝑑 0, we have 1 𝑛 𝑛 𝑖=1 E [ 𝑔𝑑 𝑖 𝑣𝑑 𝑖 2] 𝛾2 2. Proof. By the update rule for 𝑔𝑑+1 𝑖 in Algorithm 4 and Definition 2, we can bound E [ 𝑔𝑑+1 𝑖 𝑣𝑑+1 𝑖 2] = E [ π›Ύπ’ž ( 𝑣𝑑+1 𝑖 𝑔𝑑 𝑖 𝛾 ) (𝑣𝑑+1 𝑖 𝑔𝑑 𝑖) [ π’ž ( 𝑣𝑑+1 𝑖 𝑔𝑑 𝑖 𝛾 ) 𝑣𝑑+1 𝑖 𝑔𝑑 𝑖 𝛾 I EF21-STORM/MVR Algorithm 5 EF21-STORM/MVR 1: Input: π‘₯0, step-size 𝛾> 0, parameter πœ‚ (0, 1], 𝐡init 1 2: Initialize 𝑀0 𝑖= 𝑔0 𝑖= 1 𝐡init 𝐡init 𝑗=1 𝑓𝑖(π‘₯0, πœ‰0 𝑖,𝑗) for 𝑖= 1, . . . , 𝑛; 𝑔0 = 1 𝑛 𝑛 𝑖=1 𝑔0 𝑖 3: for 𝑑= 0,1, 2, . . . , 𝑇 1 do 4: Master computes π‘₯𝑑+1 = π‘₯𝑑 𝛾𝑔𝑑and broadcasts π‘₯𝑑+1 to all nodes 5: for all nodes 𝑖= 1, . . . , 𝑛in parallel do 6: Draw πœ‰π‘‘+1 𝑖 and compute two (stochastic) gradients 𝑓𝑖(π‘₯𝑑, πœ‰π‘‘+1 𝑖 ) and 𝑓𝑖(π‘₯𝑑+1, πœ‰π‘‘+1 𝑖 ) 7: Compute variance reduced STORM/MVR estimator 8: 𝑀𝑑+1 𝑖 = 𝑓𝑖(π‘₯𝑑+1, πœ‰π‘‘+1 𝑖 ) + (1 πœ‚)(𝑀𝑑 𝑖 𝑓𝑖(π‘₯𝑑, πœ‰π‘‘+1 𝑖 )) 9: Compress 𝑐𝑑+1 𝑖 = π’ž(𝑀𝑑+1 𝑖 𝑔𝑑 𝑖) and send 𝑐𝑑+1 𝑖 to the master 10: Update local state 𝑔𝑑+1 𝑖 = 𝑔𝑑 𝑖+ 𝑐𝑑+1 𝑖 11: end for 12: Master computes 𝑔𝑑+1 = 1 𝑛 𝑛 𝑖=1 𝑔𝑑+1 𝑖 via 𝑔𝑑+1 = 𝑔𝑑+ 1 𝑛 𝑛 𝑖=1 𝑐𝑑+1 𝑖 13: end for Assumption 3 (Individual smoothness21). For each 𝑖= 1, . . . , 𝑛, every realization of πœ‰π‘– π’Ÿπ‘–, the stochastic gradient 𝑓𝑖(π‘₯,πœ‰π‘–) is ℓ𝑖-Lipschitz, i.e., for all π‘₯, 𝑦 R𝑑 𝑓𝑖(π‘₯,πœ‰π‘–) 𝑓𝑖(𝑦, πœ‰π‘–) ℓ𝑖 π‘₯ 𝑦 . We denote β„“2 := 1 𝑛 𝑛 𝑖=1 β„“2 𝑖 Theorem 7. Let Assumptions 1, 2 and 3 hold. Let Λ†π‘₯𝑇be sampled uniformly at random from the iterates of the method. Let Algorithm 5 run with a contractive compressor. For all πœ‚ (0, 1] and 𝐡init 1, with 𝛾 min { 𝛼 8 𝐿, 𝛼 } , we have E [ 𝑓(Λ†π‘₯𝑇) 2] π’ͺ ( Ξ¨0 where Ξ¨0 := 𝛿0 + 𝛾 πœ‚E [ 𝑣0 𝑓(π‘₯0) 2] + π›Ύπœ‚ 𝑛 𝑛 𝑖=1 E [ 𝑣0 𝑖 𝑓𝑖(π‘₯0) 2] . With the following step-size, momentum parameter, and initial batch size ( ℓ𝛿0 𝑛 𝜎2𝑇 and 𝐡init = max { 𝜎2 𝐿𝛿0𝑛, 𝛼𝑛 𝑇 } , we have E [ 𝑓(Λ†π‘₯𝑇) 2] π’ͺ 𝛼𝑇+ ℓ𝛿0 𝛼𝑇+ Corollary 5. Under the setting of Theorem 7. we have E [ 𝑓(Λ†π‘₯𝑇) ] πœ€after 𝑇 = π›Όπœ€2 + ℓ𝛿0𝜎 1/3 𝛼1/3 π‘›πœ€7/3 + ℓ𝛿0𝜎 1/2 𝛼1/4 π‘›πœ€5/2 + ℓ𝛿0𝜎 π‘›πœ€3 ) iterations. Recently, Yau and Wai [2022] propose and analyze a Do Co M-SGT algorithm for decentralized optimization with contractive compressor under the above Assumption 3. When their method is specialized to centralized setting (with mixing constant 𝜌= 1), their total sample complexity becomes π’ͺ ( β„“ π›Όπœ€2 + 𝑛4/5𝜎3/2 𝛼9/4πœ€3/2 + 𝜎3 π‘›πœ€3 ) (see Table 1 or Theorem 4.1 in [Yau and Wai, 2022]). In contrast, the sample complexity given in our Corollary 5 improves the dependence on 𝜎in the last term and, moreover, achieves the linear speedup in terms of 𝑛for all stochastic terms in the sample complexity. 21This assumption can be also relaxed to so-called expected smoothness. Proof. Similarly to the proof of Theorem 3, we control the error between 𝑔𝑑and 𝑓(π‘₯𝑑) by decomposing it into two terms 𝑔𝑑 𝑓(π‘₯𝑑) 2 2 𝑔𝑑 𝑀𝑑 2+2 𝑀𝑑 𝑓(π‘₯𝑑) 2 2 1 𝑔𝑑 𝑖 𝑀𝑑 𝑖 2+2 𝑀𝑑 𝑓(π‘₯𝑑) 2 . In the following, we develop a recursive bound for each term above separately. Part I. Controlling the variance of STORM/MVR estimator for each 𝑀𝑑 𝑖and on average 𝑀𝑑. Recall that by Lemma 7-(55), we have for each 𝑖= 1, . . . , 𝑛, and any 0 < πœ‚ 1 and 𝑑 0 E [ 𝑀𝑑+1 𝑖 𝑓𝑖(π‘₯𝑑+1) 2] (1 πœ‚)E [ 𝑀𝑑 𝑖 𝑓𝑖(π‘₯𝑑) 2] + 2β„“2 𝑖E [ π‘₯𝑑 π‘₯𝑑+1 2] + 2πœ‚2𝜎2. (49) Averaging inequalities (49) over 𝑖= 1, . . . ,𝑛and denoting 𝑃𝑑:= 1 𝑛 𝑛 𝑖=1 E [ 𝑀𝑑 𝑖 𝑓𝑖(π‘₯𝑑) 2] , 𝑅𝑑:= E [ π‘₯𝑑 π‘₯𝑑+1 2] we have 𝑃𝑑+1 (1 πœ‚) 𝑃𝑑+ 2 β„“2𝑅𝑑+ 2πœ‚2𝜎2. Summing up the above inequality for 𝑑= 0, . . . , 𝑇 1, we derive 𝑑=0 𝑃𝑑 2 β„“2 𝑑=0 𝑅𝑑+ 2πœ‚πœŽ2 + 1 πœ‚π‘‡ 𝑃0. (50) Similarly by Lemma 7-(56) denoting 𝑃𝑑:= E [ 𝑀𝑑 𝑓(π‘₯𝑑) 2] , we have 𝑑=0 𝑃𝑑 2 β„“2 𝑑=0 𝑅𝑑+ 2πœ‚πœŽ2 Part II. Controlling the variance of contractive compressor and STORM/MVR estimator. By Lemma 8 we have for each 𝑖= 1, . . . , 𝑛, and any 0 < πœ‚ 1 and 𝑑 0 E [ 𝑔𝑑+1 𝑖 𝑀𝑑+1 𝑖 2] ( 1 𝛼 ) E [ 𝑔𝑑 𝑖 𝑀𝑑 𝑖 2] + 4πœ‚2 𝛼E [ 𝑀𝑑 𝑖 𝑓𝑖(π‘₯𝑑) 2] + ( 4𝐿2 𝑖 𝛼 + β„“2 𝑖 ) E [ π‘₯𝑑+1 π‘₯𝑑 2] + 2πœ‚2𝜎2. (52) Averaging inequalities (52) over 𝑖= 1, . . . ,𝑛, denoting 𝑉𝑑:= 1 𝑛 𝑛 𝑖=1 E [ 𝑔𝑑 𝑖 𝑀𝑑 𝑖 2] , and summing up the resulting inequality for 𝑑= 0, . . . , 𝑇 1, we obtain 𝑑=0 𝑅𝑑+ 2πœ‚2𝜎2 𝛼2𝑇 𝑃0. (53) Part III. Combining steps I and II with descent lemma. By smoothness (Assumption 1) of 𝑓( ) it follows from Lemma 1 that for any 𝛾 1/(2𝐿) we have 𝑓(π‘₯𝑑+1) 𝑓(π‘₯𝑑) 𝛾 π‘₯𝑑+1 π‘₯𝑑 2 + 𝛾 𝑔𝑑 𝑓(π‘₯𝑑) 2 (54) π‘₯𝑑+1 π‘₯𝑑 2 + 𝛾1 𝑔𝑑 𝑖 𝑀𝑑 𝑖 2 + 𝛾 𝑀𝑑 𝑓(π‘₯𝑑) 2 . Subtracting 𝑓* from both sides of (54), taking expectation and defining 𝛿𝑑:= E [𝑓(π‘₯𝑑) 𝑓*], we derive E [ 𝑓(Λ†π‘₯𝑇) 2] = 1 𝑇 𝑑=0 E [ 𝑓(π‘₯𝑑) 2] 2𝛿0 𝛾𝑇+ 2 1 𝑑=0 𝑉𝑑+ 2 1 𝑑=0 𝑃𝑑 1 2𝛾2 1 𝑇 (𝑖) 2𝛿0 𝛾𝑇+ 𝑑=0 𝑅𝑑+ 2 1 𝑑=0 𝑃𝑑 1 2𝛾2 1 𝑇 (𝑖𝑖) 2𝛿0 𝛾𝑇+ 2𝛿0 𝛾𝑇+ 32πœ‚3𝜎2 where in (𝑖) we apply (53), in (𝑖𝑖) we use (51), and the last step follows by assumption on the step-size, which proves (48). We now find the particular values of parameters. Using 𝑀0 𝑖= 1 𝐡init 𝐡init 𝑗=1 𝑓𝑖(π‘₯0, πœ‰0 𝑖,𝑗) for all 𝑖= 1, . . . , 𝑛, we have 𝑃0 = E [ 𝑀0 𝑓(π‘₯0) 2] 𝜎2 𝑛𝐡init and 𝑃0 = 1 𝑖=1 E [ 𝑀0 𝑖 𝑓𝑖(π‘₯0) 2] 𝜎2 We can substitute the choice of 𝛾and obtain E [ 𝑓(Λ†π‘₯𝑇) 2] = π’ͺ ( 𝛿0 πœ‚π‘›π΅init𝑇+ πœ‚πœŽ2 𝛼𝑇+ ℓ𝛿0 𝛼𝑇+ ℓ𝛿0 π‘›πœ‚π‘‡+ πœ‚3𝜎2 πœ‚π‘›π΅init𝑇+ πœ‚πœŽ2 Since 𝐡init 𝜎2 𝐿𝛿0𝑛, we have E [ 𝑓(Λ†π‘₯𝑇) 2] = π’ͺ 𝛼𝑇+ ℓ𝛿0 𝛼𝑇+ ℓ𝛿0 π‘›πœ‚π‘‡+ πœ‚3𝜎2 Notice that the choice of the momentum parameter such that πœ‚ ( ℓ𝛿0𝛼2 𝜎2 𝑛𝑇 ) 2/7 , πœ‚ ( ℓ𝛿0𝛼 𝜎2 𝑛𝑇 ) 2/5 , πœ‚ ( ℓ𝛿0 𝑛 𝜎2𝑇 ) 2/3 , and πœ‚ ( ℓ𝛿0𝛼2𝐡init 𝜎2 𝑛 ) 2/3 ensures that πœ‚3𝜎2 𝛼2 ℓ𝛿0 π‘›πœ‚π‘‡, πœ‚2𝜎2 𝛼 ℓ𝛿0 π‘›πœ‚π‘‡, πœ‚πœŽ2 𝛼2𝐡init𝑇 ℓ𝛿0 π‘›πœ‚π‘‡. Therefore, we have E [ 𝑓(Λ†π‘₯𝑇) 2] = π’ͺ 𝛼𝑇+ ℓ𝛿0 𝛼𝑇+ 𝐡1/3 init 𝑇 Using 𝐡init 𝛼𝑛 𝑇, we obtain E [ 𝑓(Λ†π‘₯𝑇) 2] = π’ͺ 𝛼𝑇+ ℓ𝛿0 𝛼𝑇+ I.1 Controlling the variance of STORM/MVR estimator Lemma 7. Let Assumptions 2 and 3 be satisfied, and suppose 0 < πœ‚ 1. For every 𝑖= 1, . . . , 𝑛, let the sequence {𝑀𝑑 𝑖}𝑑 0 be updated via 𝑀𝑑+1 𝑖 = 𝑓𝑖(π‘₯𝑑+1, πœ‰π‘‘+1 𝑖 ) + (1 πœ‚)(𝑀𝑑 𝑖 𝑓𝑖(π‘₯𝑑, πœ‰π‘‘+1 𝑖 )) Define the sequence 𝑀𝑑:= 1 𝑛 𝑛 𝑖=1 𝑀𝑑 𝑖. Then for every 𝑖= 1, . . . , 𝑛and 𝑑 0 it holds E [ 𝑀𝑑+1 𝑖 𝑓𝑖(π‘₯𝑑+1) 2] (1 πœ‚)E [ 𝑀𝑑 𝑖 𝑓𝑖(π‘₯𝑑) 2] + 2β„“2 𝑖E [ π‘₯𝑑 π‘₯𝑑+1 2] + 2πœ‚2𝜎2, (55) E [ 𝑀𝑑+1 𝑓(π‘₯𝑑+1) 2] (1 πœ‚)E [ 𝑀𝑑 𝑓(π‘₯𝑑) 2] + 2 β„“2 𝑛E [ π‘₯𝑑 π‘₯𝑑+1 2] + 2πœ‚2𝜎2 Proof. For each 𝑑= 0, . . . , 𝑇 1, define a random vector πœ‰π‘‘:= (πœ‰π‘‘ 1, . . . , πœ‰π‘‘ 𝑛) and denote by 𝑓(π‘₯𝑑, πœ‰π‘‘) := 1 𝑛 𝑛 𝑖=1 𝑓𝑖(π‘₯𝑑, πœ‰π‘‘ 𝑖). Note that the entries of the random vector πœ‰π‘‘are independent and Eπœ‰π‘‘[ 𝑓(π‘₯𝑑, πœ‰π‘‘)] = 𝑓(π‘₯𝑑), then we have 𝑀𝑑+1 = 𝑓(π‘₯𝑑+1, πœ‰π‘‘+1) + (1 πœ‚) ( 𝑀𝑑 𝑓(π‘₯𝑑, πœ‰π‘‘+1) ) , where 𝑀𝑑= 1 𝑛 𝑛 𝑖=1 𝑀𝑑 𝑖is an auxiliary sequence. 𝒱𝑑 𝑖:= 𝑓𝑖(π‘₯𝑑, πœ‰π‘‘ 𝑖) 𝑓𝑖(π‘₯𝑑), 𝒱𝑑:= 1 𝒲𝑑 𝑖:= 𝑓𝑖(π‘₯𝑑) 𝑓𝑖(π‘₯𝑑, πœ‰π‘‘+1 𝑖 ) + 𝑓𝑖(π‘₯𝑑+1, πœ‰π‘‘+1 𝑖 ) 𝑓𝑖(π‘₯𝑑+1), 𝒲𝑑:= 1 Then by Assumptions 2, we have E [ 𝒱𝑑 𝑖 ] = E [ 𝒲𝑑 𝑖 ] = E [ 𝒱𝑑] = E [ 𝒲𝑑] = 0, (57) E [ 𝒱𝑑 𝑖 2] 𝜎2, E [ 𝒱𝑑 2] 𝜎2 Furthermore, we can derive E [ 𝒲𝑑 2] = E 𝑖=1 E [ 𝒲𝑑 𝑖 2] + 1 𝑖 =𝑗 E [ 𝒲𝑑 𝑖, 𝒲𝑑 𝑗 ] 𝑖=1 E [ 𝒲𝑑 𝑖 2] + 1 𝑖 =𝑗 E [ 𝒲𝑑 𝑖 ] , E [ 𝒲𝑑 𝑗 ] 𝑖=1 E [ 𝒲𝑑 𝑖 2] 𝑖=1 E [ 𝑓𝑖(π‘₯𝑑+1, πœ‰π‘‘+1 𝑖 ) 𝑓𝑖(π‘₯𝑑, πœ‰π‘‘+1 𝑖 ) 2] 𝑖=1 β„“2 𝑖E [ π‘₯𝑑+1 π‘₯𝑑 2] = β„“2 𝑛E [ π‘₯𝑑+1 π‘₯𝑑 2] , where (𝑖) holds by the conditional independence of 𝒲𝑑 𝑖and 𝒲𝑑 𝑗, and the last inequality follows by the individual smoothness of stochastic functions (Assumption 3). Therefore, we have E [ 𝒲𝑑 𝑖 2] β„“2 𝑖E [ π‘₯𝑑+1 π‘₯𝑑 2] , E [ 𝒲𝑑 2] β„“2 𝑛E [ π‘₯𝑑+1 π‘₯𝑑 2] , (59) where the first inequality is obtained by using a similar derivation. By the update rule for 𝑀𝑑, we can also derive 𝑀𝑑+1 𝑓(π‘₯𝑑+1) = (1 πœ‚) ( 𝑀𝑑 𝑓(π‘₯𝑑, πœ‰π‘‘+1) ) + ( 𝑓(π‘₯𝑑+1, πœ‰π‘‘+1) 𝑓(π‘₯𝑑+1) ) = (1 πœ‚) ( 𝑀𝑑 𝑓(π‘₯𝑑) ) + πœ‚ ( 𝑓(π‘₯𝑑+1, πœ‰π‘‘+1) 𝑓(π‘₯𝑑+1) ) +(1 πœ‚) ( ( 𝑓(π‘₯𝑑) 𝑓(π‘₯𝑑, πœ‰π‘‘+1) + 𝑓(π‘₯𝑑+1, πœ‰π‘‘+1) 𝑓(π‘₯𝑑+1) ) ) = (1 πœ‚) ( 𝑀𝑑 𝑓(π‘₯𝑑) ) + πœ‚π’±π‘‘+1 + (1 πœ‚)𝒲𝑑. Therefore, we have E [ 𝑀𝑑+1 𝑓(π‘₯𝑑+1) 2] E [ Eπœ‰π‘‘+1 [ (1 πœ‚) ( 𝑀𝑑 𝑓(π‘₯𝑑) ) + πœ‚π’±π‘‘+1 + (1 πœ‚)𝒲𝑑 2] ] = (1 πœ‚)2E [ 𝑀𝑑 𝑓(π‘₯𝑑) 2] + E [ πœ‚π’±π‘‘+1 + (1 πœ‚)𝒲𝑑 2] (1 πœ‚) 𝑀𝑑 𝑓(π‘₯𝑑) 2 + 2πœ‚2E [ 𝒱𝑑+1 2] + 2E [ 𝒲𝑑 2] (1 πœ‚)E [ 𝑀𝑑 𝑓(π‘₯𝑑) 2] + 2𝜎2πœ‚2 𝑛E [ π‘₯𝑑+1 π‘₯𝑑 2] , where the last inequality holds by (58) and (59). Similarly for each 𝑖= 1, . . . , 𝑛, we have 𝑀𝑑+1 𝑖 𝑓𝑖(π‘₯𝑑+1) = (1 πœ‚) ( 𝑀𝑑 𝑖 𝑓𝑖(π‘₯𝑑) ) + πœ‚π’±π‘‘+1 𝑖 + (1 πœ‚)𝒲𝑑 𝑖. (60) E [ 𝑀𝑑+1 𝑖 𝑓𝑖(π‘₯𝑑+1) 2] (1 πœ‚)E [ 𝑀𝑑 𝑖 𝑓𝑖(π‘₯𝑑) 2] + 2𝜎2πœ‚2 + 2β„“2 𝑖𝑅𝑑. I.2 Controlling the variance of contractive compression and STORM/MVR estimator Lemma 8. Let Assumptions 1, 2 and 3 be satisfied, and suppose 0 < πœ‚ 1. For every 𝑖= 1, . . . , 𝑛, let the sequences {𝑀𝑑 𝑖}𝑑 0 and {𝑔𝑑 𝑖}𝑑 0 be updated via 𝑀𝑑+1 𝑖 = 𝑓𝑖(π‘₯𝑑+1, πœ‰π‘‘+1 𝑖 ) + (1 πœ‚)(𝑀𝑑 𝑖 𝑓𝑖(π‘₯𝑑, πœ‰π‘‘+1 𝑖 )), 𝑔𝑑+1 𝑖 = 𝑔𝑑 𝑖+ π’ž ( 𝑀𝑑+1 𝑖 𝑔𝑑 𝑖 ) . Then for every 𝑖= 1, . . . , 𝑛and 𝑑 0 it holds E [ 𝑔𝑑+1 𝑖 𝑀𝑑+1 𝑖 2] ( 1 𝛼 ) E [ 𝑔𝑑 𝑖 𝑀𝑑 𝑖 2] + 4πœ‚2 𝛼E [ 𝑀𝑑 𝑖 𝑓𝑖(π‘₯𝑑) 2] + ( 4𝐿2 𝑖 𝛼 + β„“2 𝑖 ) E [ π‘₯𝑑+1 π‘₯𝑑 2] + 2πœ‚2𝜎2. (61) Proof. By the update rule of 𝑀𝑑 𝑖, 𝑔𝑑 𝑖, and definition of 𝒱𝑑 𝑖, 𝒲𝑑 𝑖given in the proof of Lemma 7, we can derive E [ 𝑔𝑑+1 𝑖 𝑀𝑑+1 𝑖 2] = E [ π’ž(𝑀𝑑+1 𝑖 𝑔𝑑 𝑖) (𝑀𝑑+1 𝑖 𝑔𝑑 𝑖) 2] (𝑖) (1 𝛼)E [ 𝑀𝑑+1 𝑖 𝑔𝑑 𝑖 2] (𝑖𝑖) = (1 𝛼)E [ (1 πœ‚) ( 𝑀𝑑 𝑖 𝑓𝑖(π‘₯𝑑) ) + πœ‚π’±π‘‘+1 𝑖 + (1 πœ‚)𝒲𝑑 𝑖+ 𝑓𝑖(π‘₯𝑑+1) 𝑔𝑑 𝑖 2] = (1 𝛼)E [ Eπœ‰π‘‘+1 𝑖 [ (1 πœ‚) ( 𝑀𝑑 𝑖 𝑓𝑖(π‘₯𝑑) ) + πœ‚π’±π‘‘+1 𝑖 + (1 πœ‚)𝒲𝑑 𝑖+ 𝑓𝑖(π‘₯𝑑+1) 𝑔𝑑 𝑖 2] ] (𝑖𝑖𝑖) = (1 𝛼)E [ (1 πœ‚) ( 𝑀𝑑 𝑖 𝑓𝑖(π‘₯𝑑) ) + 𝑓𝑖(π‘₯𝑑+1) 𝑔𝑑 𝑖 2] +(1 𝛼)E [ πœ‚π’±π‘‘+1 𝑖 + (1 πœ‚)𝒲𝑑 𝑖 2] = (1 𝛼)E [ ( 𝑀𝑑 𝑖 𝑔𝑑 𝑖 ) + ( 𝑓𝑖(π‘₯𝑑+1) 𝑓𝑖(π‘₯𝑑) ) πœ‚ ( 𝑀𝑑 𝑖 𝑓𝑖(π‘₯𝑑) ) 2] +(1 𝛼)E [ πœ‚π’±π‘‘+1 𝑖 + (1 πœ‚)𝒲𝑑 𝑖 2] (𝑖𝑣) (1 𝛼) (1 + 𝜌) E [ 𝑀𝑑 𝑖 𝑔𝑑 𝑖 2] +(1 𝛼) ( 1 + 𝜌 1) E [ ( 𝑓𝑖(π‘₯𝑑+1) 𝑓𝑖(π‘₯𝑑) ) πœ‚ ( 𝑀𝑑 𝑖 𝑓𝑖(π‘₯𝑑) ) 2] +2(1 𝛼)πœ‚2E [ 𝒱𝑑+1 𝑖 2] + 2(1 𝛼)(1 πœ‚)2E [ 𝒲𝑑 𝑖 2] (𝑣) = (1 πœƒ)E [ 𝑀𝑑 𝑖 𝑔𝑑 𝑖 2] +𝛽E [ ( 𝑓𝑖(π‘₯𝑑+1) 𝑓𝑖(π‘₯𝑑) ) πœ‚ ( 𝑀𝑑 𝑖 𝑓𝑖(π‘₯𝑑) ) 2] +2(1 𝛼)πœ‚2E [ 𝒱𝑑+1 𝑖 2] + 2(1 𝛼)(1 πœ‚)2E [ 𝒲𝑑 𝑖 2] (𝑣𝑖) (1 πœƒ)E [ 𝑀𝑑 𝑖 𝑔𝑑 𝑖 2] + 2π›½πœ‚2E [ 𝑀𝑑 𝑖 𝑓𝑖(π‘₯𝑑) 2] +2𝛽E [ 𝑓𝑖(π‘₯𝑑+1) 𝑓𝑖(π‘₯𝑑) 2] + 2β„“2 𝑖E [ π‘₯𝑑+1 π‘₯𝑑 2] + 2πœ‚2𝜎2 (1 πœƒ)E [ 𝑀𝑑 𝑖 𝑔𝑑 𝑖 2] + 2π›½πœ‚2E [ 𝑀𝑑 𝑖 𝑓𝑖(π‘₯𝑑) 2] + ( 2𝛽𝐿2 𝑖+ β„“2 𝑖 ) E [ π‘₯𝑑+1 π‘₯𝑑 2] + 2πœ‚2𝜎2, where (𝑖) holds by Definition 1, (𝑖𝑖) follows from (60), (𝑖𝑖𝑖) holds by unbiasedness of 𝒱𝑑+1 𝑖 and 𝒲𝑑 𝑖 (57). In (𝑖𝑣) we use Young s inequality twice, in (𝑣) we introduce the notation πœƒ:= 1 (1 𝛼)(1+𝜌) and 𝛽:= (1 𝛼)(1 + 𝜌 1), in (𝑣𝑖) we again use Young s inequality and the bound (58) and (59). The last step holds by smoothness of 𝑓𝑖( ) (Assumption 1). The proof is complete by the choice 𝜌= 𝛼/2, which guarantees 1 πœƒ 1 𝛼/2, and 2𝛽 4/𝛼. J Simplified Proof of SGDM: Time Varying Parameters and No Tuning for Momentum Sequence In this section, we give a simplified proof of SGDM in the single node setting (𝑛= 1) without compression (𝛼= 1). The following theorem shows that the momentum parameter can be chosen in a parameter agnostic22 way as πœ‚π‘‘= 1/ 𝑑+ 1 (or πœ‚π‘‘= 1/ 𝑇+ 1), instead of being a constant depending on problem parameters as it is suggested in our main Theorem 3. In other words, using SGDM with time varying momentum does not introduce any additional tuning of hyper-parameters. Theorem 8. Let Assumptions 1, 2 hold. Let 𝑛= 1 and Algorithm 1 run with identity compressor π’ž, i.e., 𝛼= 1, and (possibly) time varying momentum πœ‚π‘‘ (0, 1] and step-size paramters 𝛾𝑑= π›Ύπœ‚π‘‘ with 𝛾 (0, 1/(3𝐿)]. Let Λ†π‘₯𝑇be sampled from the iterates of the algorithm with probabilities 𝑝𝑑= πœ‚π‘‘/( 𝑇 1 𝑑=0 πœ‚π‘‘), then E [ 𝑓(Λ†π‘₯𝑇) 2] 2Ξ›0𝛾 1 + 2𝜎2 𝑇 1 𝑑=0 πœ‚2 𝑑 𝑇 1 𝑑=0 πœ‚π‘‘ , where Ξ›0 := 𝑓(π‘₯0) 𝑓* + 𝛾E [ 𝑣0 𝑓(π‘₯0) 2] is the Lyapunov function. Proof. By Lemma 2 denoting 𝑃𝑑:= E [ 𝑣𝑑 𝑓(π‘₯𝑑) 2] , 𝑅𝑑:= E [ π‘₯𝑑 π‘₯𝑑+1 2] , we have 𝑃𝑑+1 𝑃𝑑 πœ‚π‘‘π‘ƒπ‘‘+ 3𝐿2 πœ‚π‘‘ 𝑅𝑑+ πœ‚2 π‘‘πœŽ2. (62) By descent Lemma 1, we have for any 𝛾𝑑> 0 2 E [ 𝑓(π‘₯𝑑) 2] 1 2𝛾𝑑 (1 𝛾𝑑𝐿) 𝑅𝑑+ 𝛾𝑑 where 𝛿0 := E [𝑓(π‘₯𝑑) 𝑓*]. Define the Lyapunov function as Λ𝑑= 𝛿0 + 𝛾𝑃𝑑. Then summing up (63) with a 𝛾multiple of (62) and noticing that 𝛾𝑑 𝛾, we get 2 E [ 𝑓(π‘₯𝑑) 2] 1 ( 1 𝛾𝐿 6𝛾2𝐿2) 𝑅𝑑+ π›Ύπœ‚2 π‘‘πœŽ2. Since 𝛾 1/(3𝐿), we have 1 𝛾𝐿 6𝛾2𝐿2 0, and, therefore, by telescoping we can derive E [ 𝑓(Λ†π‘₯𝑇) 2] = 𝑑=0 πœ‚π‘‘E [ 𝑓(π‘₯𝑑) 2] 2Ξ›0𝛾 1 + 2𝜎2 𝑇 1 𝑑=0 πœ‚2 𝑑 𝑇 1 𝑑=0 πœ‚π‘‘ . The above theorem suggests that to ensure convergence, we can select any momentum sequence such that 𝜎2 𝑑=0 πœ‚2 𝑑< , and 𝑑=0 πœ‚2 𝑑 for 𝑑 . The parameter 𝛾, which determines the step-size 𝛾𝑑= π›Ύπœ‚π‘‘, should be set to 𝛾= 1/(3𝐿) (to minimize the upper bound). Let us now consider some special cases. Deterministic case. If 𝜎= 0, we can set it to be any constant πœ‚π‘‘= πœ‚ (0, 1] and derive E [ 𝑓(Λ†π‘₯𝑇) 2] 2𝛿0 π›Ύπœ‚π‘‡= π’ͺ ( 𝐿𝛿0 22That is, independently of the problem specific parameters Stochastic case. For 𝜎2 > 0, we can select time-varying πœ‚π‘‘= 1 𝑑+1 or constant πœ‚π‘‘= 1 𝑇+1, which gives 𝑇 1 𝑑=0 πœ‚2 𝑑= π’ͺ(log(𝑇)), and 𝑇 1 𝑑=0 πœ‚π‘‘= Ω ( E [ 𝑓(Λ†π‘₯𝑇) 2] = π’ͺ ( 𝐿Λ0 + 𝜎2 Notice that if we set πœ‚π‘‘as above, we do not need any tuning of momentum parameter. Only tuning of paramter 𝛾is required to ensure convergence with optimal dependence on 𝑇, as in SGD without momentum. Of course, this rate is not yet optimal in other parameters, e.g., 𝜎2 and 𝐿. To make it optimal in all problem parameters, we can set πœ‚= max { 1, ( 𝐿Λ0 𝜎2𝑇 ) 1/2} similarly to the statement of Theorem 2. K Revisiting EF14-SGD Analysis under BG and BGS Assumptions In this section, we revisit the analysis of the original variant of error feedback (EF14-SGD) to showcase the difficulty in avoiding BG/BGS assumptions commonly used in the nonconvex analysis of this variant. In summary, the key reason for BG/BGS assumption is to bound the second term in (67) or (68). Recall that EF14-SGD has the update rule [Stich et al., 2018] π‘₯𝑑+1 = π‘₯𝑑 𝑔𝑑, 𝑔𝑑= 1 𝑖=1 𝑔𝑑 𝑖, (64) EF14-SGD: 𝑒𝑑+1 𝑖 = 𝑒𝑑 𝑖+ 𝛾 𝑓𝑖(π‘₯𝑑, πœ‰π‘‘ 𝑖) 𝑔𝑑 𝑖, 𝑔𝑑+1 𝑖 = π’ž ( 𝑒𝑑+1 𝑖 + 𝛾 𝑓𝑖(π‘₯𝑑+1, πœ‰π‘‘+1 𝑖 ) ) , (65) where {𝑒𝑑 𝑖}𝑑 0 are error/memory sequences with 𝑒0 𝑖= 0 for each 𝑖= 1, . . . , 𝑛. Let 𝑒𝑑:= 1 𝑛 𝑛 𝑖=1 𝑒𝑑 𝑖. The proof of this method relies on so called perturbed iterate analysis, for which one defines a "virtual sequence": π‘₯𝑑:= π‘₯𝑑 𝑒𝑑. Then it is verified by direct substitution that for any 𝑑 0 π‘₯𝑑+1 = π‘₯𝑑 𝛾1 𝑖=1 𝑓𝑖(π‘₯𝑑, πœ‰π‘‘ 𝑖). If follows from Lemma 9 in [Stich and Karimireddy, 2021] that for any 𝛾 1/2𝐿and 𝑑 0 E [ 𝑓( π‘₯𝑑+1) ] E [ 𝑓( π‘₯𝑑) ] 𝛾 4 E [ 𝑓(π‘₯𝑑) 2] + π›ΎπΏπœŽ2 2 E [ 𝑒𝑑 2] . Telescoping the recursion above and setting 𝛿0 := 𝑓(π‘₯0) 𝑓*, we have 𝑑=0 E [ 𝑓(π‘₯𝑑) 2] 4𝛿0 𝑑=0 E [ 𝑒𝑑 2] . (66) Now it remains to bound efficiently the average error term E [ 𝑒𝑑 2] = E [ 1 𝑛 𝑛 𝑖=1 𝑒𝑑 𝑖 2] . By Jensen s inequality, we have 𝑖=1 E [ 𝑒𝑑 𝑖 2] , and develop a bound for each E [ 𝑒𝑑 𝑖 2] individually. Denote by 𝑧:= 𝑒𝑑 𝑖+ 𝛾 𝑓𝑖(π‘₯𝑑, πœ‰π‘‘ 𝑖), then E [ 𝑒𝑑+1 𝑖 2] E [ π’ž(𝑧) 𝑧 2] (1 𝛼)E [ 𝑒𝑑 𝑖+ 𝛾 𝑓𝑖(π‘₯𝑑, πœ‰π‘‘ 𝑖) 2] (1 𝛼) ( 1 + 𝛼 ) E [ 𝑒𝑑 𝑖 2] + ( 1 + 2 ) E [ 𝛾 𝑓𝑖(π‘₯𝑑, πœ‰π‘‘ 𝑖) 2] ) E [ 𝑒𝑑 𝑖 2] + 3𝛾2 𝛼E [ 𝑓𝑖(π‘₯𝑑, πœ‰π‘‘ 𝑖) 2] , (67) where we used Definition 1 and Young s inequality. BG asssumption. If we assume bounded (stochastic) gradients (BG), i.e., E [ 𝑓𝑖(π‘₯, πœ‰π‘–) 2] 𝐺2 for all 𝑖= 1, . . . , 𝑛, then using (67) we can derive 𝑑=0 E [ 𝑒𝑑 2] 6𝛾2𝐺2 Combining this bound with (66), we have 𝑑=0 E [ 𝑓(π‘₯𝑑) 2] 4𝛿0 𝑛 + 12𝐿2𝛾2𝐺2 The step-size choice 𝛾= min { 1 𝐿, ( 𝛿0𝛼2 𝑇𝐿2𝜎2 ) 1/3 , ( 𝑛𝛿0 π‘‡πΏπœŽ2 ) 1/2} , allows us to bound the RHS by 12𝛿0 and guarantees E [ 𝑓(Λ†π‘₯𝑇) 2] = π’ͺ ) 2/3 + ( 𝐿𝛿0 or, equivalently, 𝑇= π’ͺ ( 𝐿𝛿0 π‘›πœ€4 ) sample complexity to find a stationary point. This analysis using BG assumption and derived sample complexity is essentially a simplified version of the one by Koloskova et al. [2020].23 BGS asssumption. If we assume bounded gradient similarity (BGS), i.e., 1 𝑛 𝑛 𝑖=1 E [ 𝑓𝑖(π‘₯) 𝑓(π‘₯) 2] 𝐺2, we can slightly modify the derivation in (67) as follows E [ 𝑒𝑑+1 𝑖 2] (1 𝛼)E [ 𝑒𝑑 𝑖+ 𝛾 𝑓𝑖(π‘₯𝑑, πœ‰π‘‘ 𝑖) 2] = (1 𝛼)E [ 𝑒𝑑 𝑖+ 𝛾 𝑓𝑖(π‘₯𝑑) 2] + (1 𝛼)𝛾2E [ 𝑓𝑖(π‘₯𝑑, πœ‰π‘‘ 𝑖) 𝑓𝑖(π‘₯𝑑) 2] (1 𝛼) ( 1 + 𝛼 ) E [ 𝑒𝑑 𝑖 2] + ( 1 + 2 ) E [ 𝛾 𝑓𝑖(π‘₯𝑑) 2] + 𝛾2𝜎2 ) E [ 𝑒𝑑 𝑖 2] + 3𝛾2 𝛼E [ 𝑓𝑖(π‘₯𝑑) 2] + 𝛾2𝜎2. (68) Averaging the above inequalities over 𝑖 = 1, . . . , 𝑛and using BGS assumption, i.e., 1 𝑛 𝑛 𝑖=1 E [ 𝑓𝑖(π‘₯) 2] 𝑓(π‘₯) 2 + 𝐺2 , we can derive via averaging over 𝑑= 0, . . . , 𝑇 1 𝑑=0 E [ 𝑒𝑑 2] 6𝛾2 𝑑=0 E [ 𝑓(π‘₯𝑑) 2] + 6𝛾2𝐺2 Combining the above inequality with (66), we have ( 1 12𝐿2𝛾2 𝑑=0 E [ 𝑓(π‘₯𝑑) 2] 4𝛿0 𝑛 + 12𝛾2𝐿2𝐺2 𝛼2 + 4𝛾2𝐿2𝜎2 By setting 𝛾= min { 𝛼 4𝐿, ( 𝑛𝛿0 𝐿𝜎2𝑇 ) 1/2 , ( 𝛼2𝛿0 𝐿2𝐺2𝑇 ) 1/3 , ( 𝛼𝛿0 𝐿2𝜎2𝑇 ) 1/3} , we have ( 1 12𝐿2𝛾2 the RHS is at most 16𝛿0 𝛾𝑇. Therefore, E [ 𝑓(Λ†π‘₯𝑇) 2] = π’ͺ ) 2/3 + ( 𝐿𝛿0𝜎 𝛼𝑇 ) 2/3 + ( 𝐿𝛿0 or, equivalently, 𝑇= π’ͺ ( 𝐿𝛿0 π›Όπœ€2 + 𝐿𝛿0𝐺 π›Όπœ€3 + 𝐿𝛿0𝜎 π›Όπœ€3 + 𝐿𝛿0 π‘›πœ€4 ) sample complexity. Notice that in the single node case (𝑛= 1), we have 𝐺= 0, and by Young s inequality ( 𝐿𝛿0𝜎 𝛼𝑇 ) 2/3 1 𝑇 ) 1/2 . Therefore, the above rate recovers the one by Stich and Karimireddy [2021] in the single node setting. 23Up to a smoothness constant and the fact that Koloskova et al. [2020] works in a more general decentralized setting.