# momentum_provably_improves_error_feedback__9f28327d.pdf

Momentum Provably Improves Error Feedback!

Ilyas Fatkhullin ETH AI Center & ETH Zurich Alexander Tyurin KAUST* Peter Richtárik KAUST

Due to the high communication overhead when training machine learning models in a distributed environment, modern algorithms invariably rely on lossy communication compression. However, when untreated, the errors caused by compression propagate, and can lead to severely unstable behavior, including exponential divergence. Almost a decade ago, Seide et al. [2014] proposed an error feedback (EF) mechanism, which we refer to as EF14, as an immensely effective heuristic for mitigating this issue. However, despite steady algorithmic and theoretical advances in the EF field in the last decade, our understanding is far from complete. In this work we address one of the most pressing issues. In particular, in the canonical nonconvex setting, all known variants of EF rely on very large batch sizes to converge, which can be prohibitive in practice. We propose a surprisingly simple fix which removes this issue both theoretically, and in practice: the application of Polyak s momentum to the latest incarnation of EF due to Richtárik et al. [2021] known as EF21. Our algorithm, for which we coin the name EF21-SGDM, improves the communication and sample complexities of previous error feedback algorithms under standard smoothness and bounded variance assumptions, and does not require any further strong assumptions such as bounded gradient dissimilarity. Moreover, we propose a double momentum version of our method that improves the complexities even further. Our proof seems to be novel even when compression is removed from the method, and as such, our proof technique is of independent interest in the study of nonconvex stochastic optimization enriched with Polyak s momentum.

1 Introduction

Since the practical utility of modern machine learning models crucially depends on our ability to train them on large quantities of training data, it is imperative to perform the training in a distributed storage and compute environment. In federated learning (FL) [Koneˇcný et al., 2016, Kairouz, 2019], for example, data is naturally stored in a distributed fashion across a large number of clients (who capture and own the data in the first place), and the goal is to train a single machine learning model from the wealth of all this distributed data, in a private fashion, directly on their devices.

1.1 Formalism. We consider the problem of collaborative training of a single model by several clients in a data-parallel fashion. In particular, we aim to solve the distributed nonconvex stochastic optimization problem

[ 𝑓(𝑥) := 1

𝑖=1 𝑓𝑖(𝑥) ] , 𝑓𝑖(𝑥) := E𝜉𝑖 𝒟𝑖[𝑓𝑖(𝑥, 𝜉𝑖)] , 𝑖= 1, . . . , 𝑛, (1)

where 𝑛is the number of clients, 𝑥 R𝑑represents the parameters of the model we wish to train, and 𝑓𝑖(𝑥) is the (typically nonconvex) loss of model parameterized by the vector 𝑥on the data 𝒟𝑖 owned by client 𝑖. Unlike most works in federated learning, we do not assume the datasets to be similar, i.e., we allow the distributions 𝒟1, . . . , 𝒟𝑛to be arbitrarily different.

*King Abdullah University of Science and Technology, Thuwal, Saudi Arabia.

37th Conference on Neural Information Processing Systems (Neur IPS 2023).

We are interested in the fundamental problem of finding an approximately stationary point of 𝑓in expectation, i.e., we wish to find a (possibly random) vector ˆ𝑥 R𝑑such that E [ 𝑓(ˆ𝑥) ] 𝜀. In order to solve this problem, we assume that the 𝑛clients communicate via an orchestrating server. Typically, the role of the server is to first perform aggregation of the messages obtained from the workers, and to subsequently broadcast the aggregated information back to the workers. Following an implicit assumption made in virtually all theoretically-focused papers on communication-efficient training, we also assume that the speed of client-to-workers broadcast is so fast (compared to speed of workers-to-client communication) that the cost associated with broadcast can be neglected2.

1.2 Aiming for communication and computation efficiency at the same time. In our work, we pay attention to two key aspects of efficient distributed training communication cost and computation cost for finding an approximate stationary point ˆ𝑥. The former refers to the number of bits that need to be communicated by the workers to the server, and the latter refers to the number of stochastic gradients that need to be sampled by each client. The rest of the paper can be summarized as follows: We pick one of the most popular communication-efficient gradient-type methods (the EF21 method of

Richtárik et al. [2021] the latest variant of error feedback pioneered by Seide et al. [2014]) and modify it in a way which provably preserves its communication complexity, but massively improves its computation/sample complexity, both theoretically and in practice.

2 Communication Compression, Error Feedback, and Sample Complexity

Communication compression techniques such as quantization [Alistarh et al., 2017, Horváth et al., 2019a] and sparsification [Seide et al., 2014, Beznosikov et al., 2020] are known to be immensely powerful for reducing the communication footprint of gradient-type3 methods. Arguably the most studied, versatile and practically useful class of compression mappings are contractive compressors. Definition 1 (Contractive compressors). We say that a (possibly randomized) mapping 𝒞: R𝑑 R𝑑 is a contractive compression operator if there exists a constant 0 < 𝛼 1 such that

E [ 𝒞(𝑥) 𝑥 2] (1 𝛼) 𝑥 2, 𝑥 R𝑑. (2)

Inequality (2) is satisfied by a vast array of compressors considered in the literature, including numerous variants of sparsification operators [Alistarh et al., 2018, Stich et al., 2018], quantization operators [Alistarh et al., 2017, Horváth et al., 2019a], and low-rank approximation [Vogels et al., 2019, Safaryan et al., 2022] and more [Beznosikov et al., 2020, Safaryan et al., 2021]. The canonical examples are i) the Top𝐾sparsifier, which preserves the 𝐾largest components of 𝑥in magnitude and sets all remaining coordinates to zero [Stich et al., 2018], and ii) the (scaled) Rand𝐾sparsifier, which preserves a subset of 𝐾components of 𝑥chosen uniformly at random and sets all remaining coordinates to zero [Beznosikov et al., 2020]. In both cases, (2) is satisfied with 𝛼= 𝐾/𝑑.

2.1 Brief history of error-feedback

When greedy contractive compressors, such as Top𝐾, are used in a direct way to compress the local gradients in distributed gradient descent (GD), the resulting method may diverge exponentially, even on strongly convex quadratics [Beznosikov et al., 2020]. Empirically, instability caused by such a naive application of greedy compressors was observed much earlier, and a fix was proposed in the form of the error feedback (EF) mechanism by Seide et al. [2014], which we henceforth call EF14 or EF14-SGD (in the stochastic case).4 To the best of our knowledge, the best sample complexity of EF14-SGD for finding a stationary point in the distributed nonconvex setting is given by Koloskova et al. [2020]: after 𝒪(𝐺𝛼 1𝜀 3 + 𝜎2𝑛 1𝜀 4) samples5, EF14-SGD finds a point 𝑥 such that E[ 𝑓(𝑥) ] 𝜀, where 𝛼is the contraction parameter (see Definition 1). However, such

2While this is a reasonable assumption in many practical situations [Mishchenko et al., 2019, Kairouz, 2019], some works consider the regime when the server-to-workers broadcast cannot be neglected [Horváth et al., 2019a, Tang et al., 2020, Philippenko and Dieuleveut, 2020, Kovalev et al., 2021, Fatkhullin et al., 2021, Gruntkowska et al., 2022]. 3For Newton-type methods, see [Islamov et al., 2022] and references therein. 4In Appendix A, we provide a more detailed discussion on theoretical develepments for this method. 5Here 𝜎2 is the bound on the variance of stochastic gradients at each node, see Assumption 2. When referring the sample complexity we count the number of stochastic gradients used only at one of the 𝑛nodes rather than by all nodes in total. This is a meaningful notion because the computations are done in parallel.

0 2000 4000 6000 8000 10000 Iteration, t

Objective value

EF21-SGD-ideal EF21-SGD EF21-SGDM

(a) Divergence for 𝑛= 1.

0 2000 4000 6000 8000 10000 Iteration, t

Objective value

EF21-SGD (n = 50) EF21-SGD (n = 500)

EF21-SGD (n = 5000) EF21-SGD (n = 10000)

(b) No improvement with 𝑛. Figure 1: Divergence of EF21-SGD on the quadratic function 𝑓(𝑥) = 1 2 𝑥 2, 𝑥 R2, using the Top1 compressor. See the proof of Theorem 1 for details on the construction of the noise 𝜉; we use 𝜎= 1, 𝐵= 1. The starting point is 𝑥0 = (0, 0.01) . Unlike EF21-SGD, our method EF21-SGDM does not suffer from divergence and is stable near optimum. Figure 1b shows that when increasing the number of nodes 𝑛, EF21-SGD applied with 𝐵= 1 does not improve, and, moreover, diverges from the optimum even faster. All experiments use constant parameters 𝛾= 𝜂= 0.1/

𝑇= 10 3; see Figure 4 for diminishing parameters. Each method is run 10 times and the plot shows the median performance alongside the 25% and 75% quantiles.

an analysis has two important deficiencies. First, in the deterministic case (when exact gradients are computable by each node), the analysis only gives the suboptimal 𝒪(𝜀 3) iteration complexity, which is suboptimal compared to vanilla (i.e., non-compressed) gradient descent, whose iteration complexity is 𝒪(𝜀 2). Second, their analysis relies heavily on additional strong assumptions, such as the bounded gradient (BG) assumption, E[ 𝑓𝑖(𝑥, 𝜉𝑖) 2] 𝐺2 for all 𝑥 R𝑑, 𝑖 [𝑛], 𝜉𝑖 𝒟𝑖, or the bounded gradient similarity (BGS) assumption, 1

𝑛 𝑛 𝑖=1 𝑓𝑖(𝑥) 𝑓(𝑥) 2 𝐺2 for all 𝑥 R𝑑. Such assumptions are restrictive and sometimes even unrealistic. In particular, both BG and BGS might not hold even in the case of convex quadratic functions.6 Moreover, it was recently shown that nonconvex analysis of stochastic gradient methods using a BG assumption may hide an exponential dependence on the smoothness constant in the complexity [Yang et al., 2023].

In 2021, these issues were partially resolved by Richtárik et al. [2021], who propose a modification of the EF mechanism, which they call EF21. They address both deficiencies of the original EF14 method: i) they removed the BG/BGS assumptions, and improved the iteration complexity to 𝒪(𝜀 2) in the full gradient regime. Subsequently, the EF21 method was modified in several directions, e.g., extended to bidirectional compression, variance reduction and proximal setup [Fatkhullin et al., 2021], generalized from contractive to three-point compressors [Richtárik et al., 2022] and adaptive compressors [Makarenko et al., 2022], modified from dual (gradient) to primal (model) compression [Gruntkowska et al., 2022] and from centralized to decentralized setting [Zhao et al., 2022]. For further work, we refer to [Wang et al., 2022, Dorfman et al., 2023, Islamov et al., 2022].

2.2 Key issue: error feedback has an unhealthy appetite for samples!

Unfortunately, the current theory of EF21 with stochastic gradients has weak sample complexity guarantees. In particular, Fatkhullin et al. [2021] extended the EF21-GD method, which is the basic variant of EF21 using full gradient at the clients, to EF21-SGD, which uses a large minibatch of stochastic gradients instead. They obtained 𝒪( 1 𝛼𝜀2 + 𝜎2 𝛼3𝜀4 ) sample complexity for their method. Later, Zhao et al. [2022] improved this result slightly7 to 𝒪( 1 𝛼𝜀2 + 𝜎2 𝛼2𝜀4 ), shaving off one 𝛼in the stochastic term. However, it is easy to notice several issues in these results, which generally feature the fundamental challenge of combining biased gradient methods with stochastic gradients.

Mega-batches. These works require all clients to sample mega-batches of stochastic gradients/datapoints in each iteration, of order 𝒪(𝜀 2), in order to control the variance coming from stochastic gradients. In Figure 1, we find that, in fact, a batch-free (i.e., with mini-batch size 𝐵= 1)

6For example, one can consider 𝑓𝑖(𝑥) = 𝑥 𝐴𝑖𝑥with 𝐴𝑖 R𝑑 𝑑, for which BG or BGS assumptions hold only in the trivial cases: matrices 𝐴𝑖are all zero or all equal to each other (homogeneous data regime). 7The result was obtained under a more general setting of decentralized optimization over a network.

version of EF21-SGD diverges even on a very simple quadratic function. We also observe a similar behavior when a small batch 𝐵> 1 is applied. This implies that there is a fundamental flaw in the EF21-SGD method itself, rather just a problem of the theoretical analysis. While mega-batch methods are common in optimization literature, smaller batches are often preferred whenever they work . For example, the time/cost required to obtain such a large number of samples at each iteration might be unreasonably large compared to the communication time, which is already reduced using compression. Moreover, when dealing with medical data, large batches might simply be unavailable [Rieke et al., 2020]. In certain applications, such as federated reinforcement learning (RL) or multi-agent RL, it is often intractable to sample more than one trajectory of the environment in order to form a gradient estimator [Mitra et al., 2023, Doan et al., 2019, Jin et al., 2022, Khodadadian et al., 2022]. Further, a method using a mega-batch at each iteration effectively follows the gradient descent (GD) dynamics instead of the dynamics of (mini-batch) SGD, which may hinder the training and generalization performance of such algorithms since it is both empirically [Keskar et al., 2017, Kleinberg et al., 2018] and theoretically [Kale et al., 2021] observed that mini-batch SGD is superior to mega-batch SGD or GD in a number of machine learning tasks.

Dependence on 𝛼. The total sample complexity results derived by Fatkhullin et al. [2021], Zhao et al. [2022] suffer from poor dependence on the contraction parameter 𝛼. Typically, EF methods are used with the Top𝐾sparsifier, which only communicates 𝐾largest entries in magnitude. In this case, 𝛼= 𝐾/𝑑, and the stochastic part of sample complexity scales quadratically with dimension.

No improvement with 𝑛. The stochastic term in the sample complexity of EF21-SGD does not improve when increasing the number of nodes. However, the opposite behavior is typically desired, and is present in several latest non-EF methods based on unbiased compressors, such as MARINA [Gorbunov et al., 2021] and DASHA [Tyurin and Richtárik, 2022]. We are not aware of any distributed algorithms utilizing the Top𝐾compressor achieving linear speedup in 𝑛in the stochastic term without relying on restrictive BG or BGS assumptions.

These observations motivate our work with the following central questions:

Can we design a batch-free distributed SGD method utilizing contractive communication compression (such as Top𝐾) without relying on restrictive BG/BGS assumptions? Is it possible to improve over the current state-of-theart 𝒪 ( 𝛼 1𝜀 2 + 𝜎2𝛼 2𝜀 4) sample complexity under the standard smoothness and bounded variance assumptions?

We answer both questions in the affirmative by incorporating a momentum update into EF21-SGD.

2.3 Mysterious effectiveness of momentum in nonconvex optimization

An immensely popular modification of SGD (and its distributed variants) is the use of momentum. This technique, initially inspired by the developments in convex optimization [Polyak, 1964], is often applied in machine learning for stabilizing convergence and speeding up the training. In particular, momentum is an important part of an immensely popular and empirically successful line of adaptive methods for deep learning, including ADAM [Kingma and Ba, 2015] and a plethora of variants. The classical SGD method with Polyak (i.e., heavy ball) momentum (SGDM) reads:

𝑥𝑡+1 = 𝑥𝑡 𝛾𝑣𝑡, 𝑣𝑡+1 = (1 𝜂)𝑣𝑡+ 𝜂 𝑓(𝑥𝑡+1, 𝜉𝑡+1), (3)

where 𝛾> 0 is a learning rate and 𝜂> 0 is the momentum parameter.

We provide a concise walk through the key theoretical developments in the analysis of SGDM in stochastic nonconvex optimization in Appendix A; and only mention the most relevant works here. The most closely related works to ours are [Mishchenko et al., 2019], [Xie et al., 2020], and [Fatkhullin et al., 2021], which analyze momentum together with communication compression. The analysis in [Mishchenko et al., 2019, Xie et al., 2020] requires BG/BGS assumption, and does not provide any theoretical improvement over the variants without momentum. Finally, the analysis of Fatkhullin et al. [2021] is only established for deterministic case, and it is unclear if its extension to stochastic case can bring any convergence improvement over EF21-SGD. Recently, several other works attempt to explain the benefit of momentum [Plattner, 2022]; some consider structured nonconvex problems [Wang and Abernethy, 2021], and others focus on generalization [Jelassi and Li, 2022].

Method Communication complexity Asymptotic sample complexity Batch-free? No extra assumptions?

EF14-SGD [Koloskova et al., 2020]

NEOLITHIC [Huang et al., 2022]

𝐾 𝛼𝜀2 log ( 𝐺

EF21-SGD [Fatkhullin et al., 2021]

BEER [Zhao et al., 2022]

Corollary 2

(a) Analysis requires a bound of the second moment of the stochastic gradients, i.e., E [ 𝑓𝑖(𝑥, 𝜉𝑖) 2] 𝐺2 for all 𝑥 R𝑑. (b) This complexity is achieved by using a large mini-batch and communicating 𝐾/𝛼 coordinates per iteration, see Appendix A. (c) Analysis requires a bounded gradient disimilarity assumption, i.e., 1

𝑛 𝑛 𝑖=1 𝑓𝑖(𝑥) 𝑓(𝑥) 2 𝐺2 for all 𝑥 R𝑑.

(d) Analysis requires a batch-size at least 𝐵 𝜎2

𝛼2𝜀2 for EF21-SGD and 𝐵 𝜎2

𝛼𝜀2 for BEER. Table 1: Summary of related works on distributed error compensated SGD methods using a Top𝐾compressor under Assumptions 1 and 2. The goal is to find an 𝜀-stationary point of a smooth nonconvex function of the form (1), i.e., a point 𝑥such that E [ 𝑓(𝑥) ] 𝜀. "Communication complexity": the total # of communicated bits if the method is applied with sufficiently large batch-size; see Table 2 for batch-size. "Asymptotic sample complexity": the total # of samples required at each node to find an 𝜀-stationary point for batch-size 𝐵= 1 in the regime 𝜀 0. "No extra assumptions": means that no additional assumption is required.

Summary of contributions. Despite the vast amount of work trying to explain the benefits of momentum, there is no work obtaining any theoretical improvement over vanilla SGD in the smooth nonconvex setting under the standard assumptions of smoothness and bounded variance.

First, we establish a negative result for a simplified/idealized version of EF21-SGD, which shows that this algorithm does not converge with constant batch-size, and that a mega-batch of order Ω(𝜎2𝜀 2) is required. This provides a strong indication that EF21-SGD method is inherently sensitive to stochastic gradients, which is also confirmed by our numerical simulations.

We propose a simple fix for this problem by incorporating momentum step into EF21-SGD, which leads to our one-batch EF21-SGDM algorithm. By leveraging our new Lyapunov function construction and new analysis, we establish 𝒪 ( 𝛼 1𝜀 2 + 𝜎2𝜀 4) sample complexity in the single node case.

We extend our algorithm to the distributed setting and derive an improved sample complexity result compared to other methods using the Top𝐾compressor without resorting to the BG/BGS assumptions. In particular, EF21-SGDM achieves asymptotically optimal 𝒪 ( 𝜎2𝑛 1𝜀 4) sample complexity. Moreover, when EF21-SGDM is applied with large enough batch size, we prove that it reaches the optimal communication complexity 𝒪 ( 𝐾𝛼 1𝜀 2) ; see Tables 1 & 2 for more details.

Finally, we propose a double momentum variant of EF21-SGDM, and find that it further improves the sample complexity of EF21-SGDM in the non-asymptotic regime.

We highlight that, interestingly, we prove that momentum helps: EF21-SGDM is theoretically better compared to its non-momentum variant large-batch EF21-SGD. We believe that our new technique can be extended in many ways, e.g., to dealing with other biased compressors and other (biased) communication saving techniques such as lazy aggregation of gradients, model compression, bidirectional compression, partial participation, decentralized training, adaptive compression; other important optimization techniques involving biased updates such as proximal SGD with momentum, gradient clipping and adaptive step-size schedules. We also hope that our proof techniques can be useful to establish linear speedup for other classes of distributed methods, e.g, algorithms based on local training Prox Skip/Scaffnew [Mishchenko et al., 2022].

Additionally, we extend our results to the class of absolute compressors in Appendix H and study the variance reduced variant of error feedback in Appendix I.

3 Main Results

Throughout the paper we work under the following standard assumptions.

Assumption 1 (Smoothness and lower boundedness). We assume that 𝑓has 𝐿-Lipschitz gradient, i.e., 𝑓(𝑥) 𝑓(𝑦) 𝐿 𝑥 𝑦 for all 𝑥, 𝑦 R𝑑, and each 𝑓𝑖has 𝐿𝑖-Lipschitz gradient,

i.e., 𝑓𝑖(𝑥) 𝑓𝑖(𝑦) 𝐿𝑖 𝑥 𝑦 for all 𝑖 [𝑛], 𝑥, 𝑦 R𝑑. We denote 𝐿2 := 1

𝑛 𝑛 𝑖=1 𝐿2 𝑖. Moreover, we assume that 𝑓is lower bounded, i.e., 𝑓* := inf𝑥 R𝑑𝑓(𝑥) > .

Assumption 2 (Bounded variance (BV)). There exists 𝜎> 0 such that

E [ 𝑓𝑖(𝑥, 𝜉𝑖) 𝑓𝑖(𝑥) 2] 𝜎2, 𝑥 R𝑑, 𝑖 [𝑛], (4)

where 𝜉𝑖 𝒟𝑖are i.i.d. random samples for each 𝑖 [𝑛].

3.1 A deeper dive into the issues EF21 has with stochastic gradients

As remarked before, the current analysis of EF21 in the stochastic setting requires each client to sample a mega-batch in each iteration, and it is not clear how to avoid this. In order to understand this phenomenon, we propose to step back and examine an idealized version of EF21-SGD, which we call EF21-SGD-ideal, defined by the update rules (5a) + (5aa):

𝑥𝑡+1 = 𝑥𝑡 𝛾𝑔𝑡, 𝑔𝑡= 1

𝑖=1 𝑔𝑡 𝑖 (5a)

EF21-SGD-ideal: 𝑔𝑡+1 𝑖 = 𝑓𝑖(𝑥𝑡+1) + 𝒞 ( 𝑓𝑖(𝑥𝑡+1, 𝜉𝑡+1 𝑖 ) 𝑓𝑖(𝑥𝑡+1)

EF21-SGD: 𝑔𝑡+1 𝑖 = 𝑔𝑡 𝑖 + 𝒞 ( 𝑓𝑖(𝑥𝑡+1, 𝜉𝑡+1 𝑖 ) 𝑔𝑡 𝑖

Compared to EF21-SGD, given by (5a) + (5ab), we replace the previous state 𝑔𝑡 𝑖by the exact gradient at the current iteration. Since EF21-SGD heavily relies on the approximation 𝑔𝑡 𝑖 𝑓𝑖(𝑥𝑡+1), and according to the proof of convergence of EF21-SGD, such discrepancy tends to zero as 𝑡 , this change can only improve the method. While we admit this is a conceptual algorithm only (it does not lead to any communication or sample complexity reduction in practice)8, it serves us well to illustrate the drawbacks of EF21-SGD. We now establish the following negative result for EF21-SGD-ideal. Theorem 1. Let 𝐿, 𝜎> 0, 0 < 𝛾 1/𝐿and 𝑛= 1. There exists a convex, 𝐿-smooth function 𝑓: R2 R, a contractive compressor 𝒞( ) satisfying Definition 1, and an unbiased stochastic gradient with bounded variance 𝜎2 such that if the method EF21-SGD-ideal ((5a) + (5aa)) is run with step-size 𝛾, then for all 𝑇 0 and for all 𝑥0 {(0, 𝑥0 (2)) R2 | 𝑥0 (2) < 0}, we have

E [ 𝑓(𝑥𝑇) 2] 1 60 min { 𝜎2, 𝑓(𝑥0) 2} .

Fix 0 < 𝜀 𝐿/

60 and 𝑥0 = (0, 1) . Additionally assume that 𝑛 1 and the variance of unbiased stochastic gradient is controlled by 𝜎2/𝐵for some 𝐵 1. If 𝐵< 𝜎2 60𝜀2 , then we have E [ 𝑓(𝑥𝑇) ] > 𝜀for all 𝑇 0.

The above theorem implies that the method (5a), (5aa), does not converge with small batch-size (e.g., equal to one) for any fixed step-size choice.9 Moreover, in distributed setting with 𝑛nodes, a mini-batch of order 𝐵= Ω ( 𝜎2/𝜀2) is required for convergence. Notice that this batch-size is independent of 𝑛, which further implies that a linear speedup in the number of nodes 𝑛cannot be achieved for this method. While we only prove these negative results for an "idealized" version of EF21-SGD rather than for the method itself, in Figures 1a and 4a, we empirically verify that EF21-SGD also suffers from a similar divergence on the same problem instance provided in the proof of Theorem 1. Additionally, Figures 1b and 4b illustrate that the situation does not improve for EF21-SGD when increasing 𝑛.

3.2 Momentum for avoiding mega-batches

Let us now focus on the single node setting10 and try to fix the divergence issue shown above. As we can learn from Theorem 1, the key reason for non-convergence of EF21-SGD is that even if the state vector 𝑔𝑡sufficiently approximates the current gradient, i.e., 𝑔𝑡 𝑓(𝑥𝑡+1), the design of

8This is because full gradients would need to be computed and communicated for its implementation. Notice also that if 𝜎= 0, this method becomes the exact distributed gradient descent. 9In fact, the example can be easily extended to the case of polynomially decaying step-size. 10In this case, we can drop index 𝑖everywhere and write 𝑔𝑡 𝑖= 𝑔𝑡, 𝜉𝑡 𝑖= 𝜉𝑡for all 𝑡 0.

this method cannot guarantee that the quantity 𝑔𝑡 𝑓(𝑥𝑡) 2 𝒞( 𝑓(𝑥𝑡, 𝜉𝑡) 𝑓(𝑥𝑡)) 2 is small enough. Indeed, the last term above can be bounded by 2(2 𝛼)𝜎2 in expectation, but it is not sufficient as formally illustrated in Theorem 1. To fix this problem, we propose to modify our idealized EF21-SGD-ideal method so that the compressed difference can be controlled and made arbitrarily small, which leads us to another (more advanced) conceptual algorithm,

EF21-SGDM-ideal: 𝑣𝑡+1 = 𝑓(𝑥𝑡+1) + 𝜂( 𝑓(𝑥𝑡+1, 𝜉𝑡+1) 𝑓(𝑥𝑡+1)),

𝑔𝑡+1 = 𝑓(𝑥𝑡+1) + 𝒞 ( 𝑣𝑡+1 𝑓(𝑥𝑡+1)

In this method, instead of using 𝑣𝑡+1 = 𝑓(𝑥𝑡+1, 𝜉𝑡+1) as in EF21-SGD-ideal, we introduce a correction, which allows to control variance of the difference 𝑓(𝑥𝑡+1, 𝜉𝑡+1) 𝑓(𝑥𝑡+1). This allows us to derive the following convergence result. Let 𝛿0 := 𝑓(𝑥0) 𝑓*. Proposition 1. Let Assumptions 1, 2 hold, and let 𝒞satisfy Definition 1. Let 𝑔0 = 0 and the step-size in method (5a), (6) be set as 𝛾 1/𝐿. Let ˆ𝑥𝑇be sampled uniformly at random from the iterates of

the method. Then for any 𝜂> 0 after 𝑇iterations, we have E [ 𝑓(ˆ𝑥𝑇) 2] 2𝛿0

Notice that if 𝜂= 1, then algorithm EF21-SGDM-ideal (5a), (6) reduces to EF21-SGD-ideal method (5a), (5aa), and this result shows that the lower bound for the batch-size established in Theorem 1 is tight, i.e., 𝐵= Θ(𝜎2/𝜀2) is necessary and sufficient11 for convergence. For 𝜂< 1, the above theorem suggests that using a small enough parameter 𝜂, the variance term can be completely eliminated. This observation motivates us to design a practical variant of this method. Similarly to the design of EF21 mechanism (from EF21-SGD-ideal), we propose to do this by replacing the exact gradients 𝑓(𝑥𝑡+1) by state vectors 𝑣𝑡and 𝑔𝑡as follows:

EF21-SGDM: 𝑣𝑡+1 = 𝑣𝑡+ 𝜂( 𝑓(𝑥𝑡+1, 𝜉𝑡+1) 𝑣𝑡),

𝑔𝑡+1 = 𝑔𝑡+ 𝒞 ( 𝑣𝑡+1 𝑔𝑡) (7)

Theorem 2. Let Assumptions 1, 2 hold, and let 𝒞satisfy Definition 1. Let method (5a), (7) be run with 𝑔0 = 𝑣0 = 𝑓(𝑥0), and ˆ𝑥𝑇be sampled uniformly at random from the iterates of the method. Then for all 𝜂 (0, 1] with 𝛾 𝛾0 = min { 𝛼

7𝐿 } , we have E [ 𝑓(ˆ𝑥𝑇) 2] 𝒪( 𝛿0

The choice 𝜂= min { 1, ( 𝐿𝛿0

𝜎2𝑇 ) 1/2} , 𝛾= 𝛾0 results in E [ 𝑓(ˆ𝑥𝑇) 2] 𝒪 ( 𝐿𝛿0

𝛼𝑇+ ( 𝐿𝛿0𝜎2

Compared to Proposition 1, where 𝜂can be made arbitrarily small, Theorem 2 suggests that there is a trade-off for the choice of 𝜂 (0, 1] in algorithm (5a), (7). The above theorem implies that in single node setting EF21-SGDM has 𝒪( 𝐿

𝜀4 ) sample complexity. For 𝛼= 1, this result matches with the sample complexity of SGD and is known to be unimprovable under Assumptions 1 and 2 [Arjevani et al., 2019]. Moreover, when 𝛼= 1, our sample complexity matches with previous analysis of momentum methods in [Liu et al., 2020] and [Defazio, 2021]. However, even in this single node (𝑛= 1), uncompressed (𝛼= 1) setting our analysis is different from the previous work, in particular, our choice of momentum parameter and the Lyapunov function are different, see Appendix A and J. For 𝛼< 1, the above result matches with sample complexity of EF14-SGD (single node setting) [Stich and Karimireddy, 2019], which was recently shown to be optimal [Huang et al., 2022] for biased compressors satisfying Definition 1. However, notice that the extension of the analysis by Stich and Karimireddy [2019] for EF14-SGD to distributed setting meets additional challenges and it is unclear whether it is possible without imposing additional BG or BGS assumptions as in [Koloskova et al., 2020]. We revisit this analysis in Appendix K to showcase the difficulty of removing BG/BGS. In the following we will demonstrate the benefit of our EF21-SGDM method by extending it to distributed setting without imposing any additional assumptions.

3.3 Distributed stochastic error feedback with momentum

Now we are ready to present a distributed variant of EF21-SGDM , see Algorithm 1. Letting 𝛿𝑡:= 𝑓(𝑥𝑡) 𝑓*, our convergence analysis of this method relies on the monotonicity of the following Lyapunov function:

Λ𝑡:= 𝛿𝑡+ 𝛾 𝛼𝑛

𝑖=1 𝑔𝑡 𝑖 𝑣𝑡 𝑖 2 + 𝛾𝜂 𝛼2𝑛

𝑖=1 𝑣𝑡 𝑖 𝑓𝑖(𝑥𝑡) 2 + 𝛾

𝑖=1 (𝑣𝑡 𝑖 𝑓𝑖(𝑥𝑡))

11This follows by replacing 𝜎2 in the batch free algorithm by 𝜎2/𝐵if the batch-size of size 𝐵> 1 is used.

Algorithm 1 EF21-SGDM (Error Feedback 2021 Enhanced with Polyak Momentum)

1: Input: starting point 𝑥0, step-size 𝛾> 0, momentum 𝜂 (0, 1], initial batch size 𝐵init 2: Initialize 𝑣0 𝑖= 𝑔0 𝑖= 1 𝐵init 𝐵init 𝑗=1 𝑓𝑖(𝑥0, 𝜉0 𝑖,𝑗) for 𝑖= 1, . . . , 𝑛; 𝑔0 = 1

𝑛 𝑛 𝑖=1 𝑔0 𝑖 3: for 𝑡= 0,1, 2, . . . , 𝑇 1 do 4: Master computes 𝑥𝑡+1 = 𝑥𝑡 𝛾𝑔𝑡and broadcasts 𝑥𝑡+1 to all nodes 5: for all nodes 𝑖= 1, . . . , 𝑛in parallel do 6: Compute momentum estimator 𝑣𝑡+1 𝑖 = (1 𝜂)𝑣𝑡 𝑖+ 𝜂 𝑓𝑖(𝑥𝑡+1, 𝜉𝑡+1 𝑖 ) 7: Compress 𝑐𝑡+1 𝑖 = 𝒞(𝑣𝑡+1 𝑖 𝑔𝑡 𝑖) and send 𝑐𝑡+1 𝑖 to the master 8: Update local state 𝑔𝑡+1 𝑖 = 𝑔𝑡 𝑖+ 𝑐𝑡+1 𝑖 9: end for 10: Master computes 𝑔𝑡+1 = 1

𝑛 𝑛 𝑖=1 𝑔𝑡+1 𝑖 via 𝑔𝑡+1 = 𝑔𝑡+ 1

𝑛 𝑛 𝑖=1 𝑐𝑡+1 𝑖 11: end for

Convergence of EF21-SGDM with contractive compressors. We obtain the following result: Theorem 3. Let Assumptions 1 and 2 hold. Let ˆ𝑥𝑇be sampled uniformly at random from the 𝑇 iterates of the method. Let EF21-SGDM (Algorithm 1) be run with a contractive compressor. For all 𝜂 (0, 1] and 𝐵init 1, with 𝛾 min { 𝛼 20 𝐿, 𝜂

7𝐿 } , we have

E [ 𝑓(ˆ𝑥𝑇) 2] 𝒪 ( Λ0 𝛾𝑇+ 𝜂3𝜎2

where Λ0 is given by (8). Choosing the batch size 𝐵init = 𝜎2 𝐿𝛿0

, and stepsize 𝛾= min { 𝛼 20 𝐿, 𝜂

and momentum 𝜂= min { 1, ( 𝐿𝛿0𝛼2

𝜎2𝑇 ) 1/4 , ( 𝐿𝛿0𝛼

𝜎2𝑇 ) 1/3 , ( 𝐿𝛿0𝑛

𝜎2𝑇 ) 1/2 , 𝛼 𝐿𝛿0𝐵init

} , 12 we get

E [ 𝑓(ˆ𝑥𝑇) 2] 𝒪 ( 𝐿𝛿0 𝛼𝑇+ ( 𝐿𝛿0𝜎2/3

𝛼2/3𝑇 ) 3/4 + ( 𝐿𝛿0𝜎 𝛼𝑇 ) 2/3 + ( 𝐿𝛿0𝜎2

𝑛𝑇 ) 1/2) .

Remark 1. Note that using large initial batch size 𝐵init > 1 is not necessary for convergence of EF21-SGDM. If we set 𝐵init = 1, the above theorem still holds by replacing 𝛿0 with Λ0. Remark 2. In the single node setting (𝑛= 1), the above result recovers the statement of

Theorem 2 (with the same choice of parameters) since by Young s inequality ( 𝐿𝛿0𝜎2/3

𝛼2/3𝑇 ) 3/4

𝑇 ) 1/2 , ( 𝐿𝛿0𝜎 𝛼𝑇 ) 2/3 1

𝑇 ) 1/2 , and 𝐿= 𝐿.

Recovering previous rates in case of full gradients. Compared to the iteration complexity 𝒪( 𝐿max𝐺

𝛼𝜀3 ) of EF14 [Koloskova et al., 2020], our result, summarized in

Corollary 1. If 𝜎= 0, then E [ 𝑓(ˆ𝑥𝑇) ] 𝜀after 𝑇= 𝒪 ( 𝐿

𝛼𝜀2 ) iterations.

is better by an order of magnitude, and does not require the BG assumption. The result of Corollary 1 is the same as for EF21 method [Richtárik et al., 2021], and EF21-HB method [Fatkhullin et al., 2021]. Notice, however, that even in this deterministic setting (𝜎= 0) EF21-SGDM method is different from EF21 and EF21-HB: while the original EF21 does not use momentum, EF21-HB method incorporates momentum on the server side to update 𝑥𝑡, which is different from our Algorithm 1, where momentum is applied by each node. This iteration complexity 𝒪 ( 1 𝛼𝜀2 ) is optimal in both 𝛼and 𝜀. The matching lower bound was recently established by Huang et al. [2022] for smooth nonconvex optimization in the class of centralized, zero-respecting algorithms with contractive compressors.

Comparison to previous work. Our sample complexity13 in

Corollary 2. E [ 𝑓(ˆ𝑥𝑇) ] 𝜀after 𝑇= 𝒪 ( 𝐿

𝛼𝜀2 + 𝐿𝜎2/3

𝛼2/3𝜀8/3 + 𝐿𝜎 𝛼1/2𝜀3 + 𝐿𝜎2

𝑛𝜀4 ) iterations.

12In Appendix J, we show how to deal with time varying 𝛾𝑡and 𝜂𝑡. 13Note that the initial batch size contributes to the sample complexity only an additive constant independent of 𝜀. Moreover, 𝐵init = 𝜎2

𝜀2 since, otherwise, 𝑓(𝑥0) 2 2𝐿𝛿0 𝜀2, and 𝑥0 is a solution. In the following, we ignore the dependece on 𝐵init for a fair comparison with other works.

strictly improves over the complexity 𝒪( 𝐺𝐿max

𝛼𝜀3 + 𝐿max𝜎2

𝑛𝜀4 ) of EF14-SGD by Koloskova et al. [2020], even in case when 𝐺< + . Notice that it always holds that 𝜎 𝐺. If we assume that 𝐺 𝜎, our three first terms in the complexity improve the first term from Koloskova et al. [2020] by the factor of 𝜀/𝜎, (𝜀𝛼/𝜎)1/3, or 𝛼1/2. Compared to the BEER algorithm of Zhao et al. [2022], with sample complexity 𝒪( 𝐿max

𝛼2𝜀4 ), the result of Corollary 2 is strictly better in terms of 𝛼, 𝑛, and the smoothness constants.14 In addition, we remove the large batch requirement for convergence compared to [Fatkhullin et al., 2021, Zhao et al., 2022]. Moreover, notice that Corollary 2 implies that EF21-SGDM achieves asymptotically optimal sample complexity 𝒪( 𝐿𝜎2

𝑛𝜀4 ) in the regime 𝜀 0.

3.4 Further improvement using double momentum!

Unfortunately, in the non-asymptotic regime, our sample complexity does not match with the lower bound in all problem parameters simultanuously due to the middle term 𝐿𝜎2/3

𝛼2/3𝜀8/3 + 𝐿𝜎 𝛼1/2𝜀3 , which can

potentially dominate over 𝐿𝜎2

𝑛𝜀4 term for large enough 𝑛and 𝜀, and small enough 𝛼and 𝜎. We propose a double-momentum method, which can further improve the middle term in the sample complexity of EF21-SGDM. We replace the momentum estimator 𝑣𝑡 𝑖in line 6 of Algorithm 1 by the following two-step momentum update

EF21-SGD2M: 𝑣𝑡+1 𝑖 = (1 𝜂)𝑣𝑡 𝑖+ 𝜂 𝑓𝑖(𝑥𝑡+1, 𝜉𝑡+1 𝑖 ), 𝑢𝑡+1 𝑖 = (1 𝜂)𝑢𝑡 𝑖+ 𝜂𝑣𝑡+1 𝑖 . (10)

We formally present this method in Algorithm 3 in Appendix G. Compared to EF21-SGDM (Algorithm 1), the only change is that instead of compressing 𝑣𝑡 𝑖 𝑔𝑡 𝑖, in EF21-SGD2M, we compress 𝑢𝑡 𝑖 𝑔𝑡 𝑖, where 𝑢𝑖is a two step (double) momentum estimator. The intuition behind this modification is that a double momentum estimator 𝑢𝑡 𝑖has richer "memory" of the past gradients compared to 𝑣𝑡 𝑖. Notice that for each node, EF21-SGD2M requires to save 3 vectors (𝑣𝑡 𝑖, 𝑢𝑡 𝑖, 𝑔𝑡 𝑖) instead of 2 in EF21-SGDM (𝑣𝑡 𝑖, 𝑔𝑡 𝑖) and EF14-SGD (𝑒𝑡 𝑖, 𝑔𝑡 𝑖).15 When interacting with biased compression operator 𝒞( ), such effect becomes crucial in improving the sample complexity. For EF21-SGD2M, we derive

Corollary 3. Let 𝑣𝑡 𝑖in Algorithm 1 be replaced by 𝑢𝑡 𝑖given by (10) (Algorithm 3 in Appendix G). Then with appropriate choice of 𝛾and 𝜂(given in Theorem 5), we have E [ 𝑓(ˆ𝑥𝑇) ] 𝜀after

𝛼𝜀2 + 𝐿𝛿0𝜎2/3

𝛼2/3𝜀8/3 + 𝐿𝛿0𝜎2

𝑛𝜀4 ) iterations.

4 Experiments

We consider a nonconvex logistic regression problem: 𝑓𝑖(𝑥1, . . . , 𝑥𝑐) = 1

𝑚 𝑚 𝑗=1 log(exp(𝑎 𝑖𝑗𝑥𝑦𝑖𝑗)/ 𝑐 𝑦=1 exp(𝑎 𝑖𝑗𝑥𝑦)) with a nonconvex regularizer ℎ(𝑥1, . . . , 𝑥𝑐) =

𝜆 𝑐 𝑦=1 𝑙 𝑘=1[𝑥𝑦]2 𝑘/(1 + [𝑥𝑦]2 𝑘) with 𝜆= 10 3, where 𝑥1, . . . , 𝑥𝑐 R𝑙, [ ]𝑘is an indexing operation of a vector, 𝑐 2 is the number of classes, 𝑙is the number of features, 𝑚is the size of a dataset, 𝑎𝑖𝑗 R𝑙and 𝑦𝑖𝑗 {1, . . . , 𝑐} are features and labels. The datasets used are MNIST (with 𝑙= 784, 𝑚= 60 000, 𝑐= 10) and real-sim (with 𝑙= 20 958, 𝑚= 72 309, 𝑐= 2) [Le Cun et al., 2010, Chang and Lin, 2011]. The dimension of the problem is 𝑑= (𝑙+ 1)𝑐, i.e., 𝑑= 7 850 for MNIST and 𝑑= 41 918 for real-sim. In each experiment, we show relations between the total number of transmitted coordinates and gradient/function values. The stochastic gradients in each algorithm are replaced by a mini-batch estimator 1

𝐵 𝐵 𝑗=1 𝑓𝑖(𝑥, 𝜉𝑖𝑗) with the same 𝐵 1 in each plot. Notice that all methods (except for NEOLITHIC)16 calculate the same number of samples at each communication round, thus the dependence on the number of samples used will be qualitatively the same. In all algorithms, the step sizes are fine-tuned from a set {2𝑘| 𝑘 [ 20, 20]} and the Top𝐾 compressor is used to compress information from the nodes to the master. For EF21-SGDM , we fix momentum parameter 𝜂= 0.1 in all experiments. Prior to that, we tuned 𝜂 {0.01, 0.1} on the independent dataset w8a (with 𝑙= 300, 𝑚= 49 749, 𝑐= 2). We omit BEER method from the plots since it showed worse performance than EF21-SGD in all runs.

14𝐿max := max𝑖 [𝑛] 𝐿𝑖. Notice that 𝐿 𝐿 𝐿max and the inequalities are strict in heterogeneous setting. 15See Appendix K for details on EF14-SGD. In contrast, EF21-SGD needs to save only one vector (𝑔𝑡 𝑖). 16For NEOLITHIC, we use the parameter 𝑅= 𝑑/𝐾 following the requirement in their Theorem 3. Experiments in Huang et al. [2022] use a heuristic choice 𝑅= 4, and thus can show faster convergence.

4.1 Experiment 1: increasing batch-size

In this experiment, we use MNIST dataset and fix the number of transmitted coordinates to 𝐾= 10 (thus 𝛼 𝐾/𝑑 10 3), and set 𝑛= 10. Figure 2 shows convergence plots for 𝐵 {1,32,128}. EF21-SGDM and its double momentum version EF21-SGD2M have fast convergence and show a significant improvement when increasing batch-size compared to EF14-SGD. In contrast, EF21-SGD suffers from poor performance for small 𝐵, which confirms our observations in previous sections. NEOLITHIC has order times slower convergence rate due to the fact that it sends 𝑑/𝐾 compressed vectors in each iteration, while other methods send only one.

0.0 0.5 1.0 1.5 2.0 2.5 3.0 #bits / n 1e6

|| f(xt)||2

Neolithic: Step size: 0.125 EF14-SGD: Step size: 0.0625 EF21-SGD: Step size: 0.015625 EF21-SGD2M: Step size: 0.03125 EF21-SGDM: Step size: 0.0078125

0.0 0.5 1.0 1.5 2.0 2.5 3.0 #bits / n 1e6

|| f(xt)||2

Neolithic: Step size: 1.0 EF14-SGD: Step size: 0.25 EF21-SGD: Step size: 0.0078125 EF21-SGD2M: Step size: 0.0625 EF21-SGDM: Step size: 0.0625

0.0 0.5 1.0 1.5 2.0 2.5 3.0 #bits / n 1e6

|| f(xt)||2

Neolithic: Step size: 1.0 EF14-SGD: Step size: 0.25 EF21-SGD: Step size: 0.015625 EF21-SGD2M: Step size: 0.0625 EF21-SGDM: Step size: 0.125

(c) 𝐵= 128 Figure 2: Experiment on MNIST dataset with 𝑛= 10, and Top10 compressor.

4.2 Experiment 2: improving convergence with 𝑛

This experiment uses real-sim dataset, 𝐾= 100 (thus 𝛼 𝐾/𝑑 2 10 3), and with 𝐵= 128 𝑚. We vary the number of nodes within 𝑛 {1, 10, 100}, see Figure 3. In this case, EF21-SGDM and EF21-SGD2M have much faster convergence compared to other methods for all 𝑛. Moreover, the proposed algorithms show a significant improvement when 𝑛increases. We also observe that on this task, EF21-SGD2M performs slightly worse than EF21-SGDM for 𝑛= 10, 100 , but it is still much faster than other other methods.

0.0 0.5 1.0 1.5 2.0 2.5 3.0 #bits / n 1e7

EF14-SGD: Step size: 64.0 EF21-SGD: Step size: 8.0 EF21-SGD2M: Step size: 32.0 EF21-SGDM: Step size: 64.0

0.0 0.5 1.0 1.5 2.0 2.5 3.0 #bits / n 1e7

EF14-SGD: Step size: 128.0 EF21-SGD: Step size: 64.0 EF21-SGD2M: Step size: 512.0 EF21-SGDM: Step size: 512.0

0.0 0.5 1.0 1.5 2.0 2.5 3.0 #bits / n 1e7

EF14-SGD: Step size: 128.0 EF21-SGD: Step size: 128.0 EF21-SGD2M: Step size: 512.0 EF21-SGDM: Step size: 512.0

(c) 𝑛= 100 Figure 3: Experiment on real-sim dataset with batch-size 𝐵= 128, and Top100 compressor.

In Section C, we present extra simulations with different parameters for above experiments. Additionally, we inlclude experiemnts on simple quadratic problems and perform training of larger image recognition models. In all cases, EF21-SGDM and EF21-SGD2M outperform other algorithms.

Acknowledgments and Disclosure of Funding

This work of I. Fatkhullin was supported by ETH AI Center doctoral fellowship. The work of P. Richtárik and A. Tyurin was supported by the KAUST Baseline Research Scheme (KAUST BRF) and the KAUST Extreme Computing Research Center (KAUST ECRC); P. Richtárik was also supported by the SDAIA-KAUST Center of Excellence in Data Science and Artificial Intelligence (SDAIA-KAUST AI).

Dan Alistarh, Demjan Grubic, Jerry Li, Ryota Tomioka, and Milan Vojnovic. QSGD: Communicationefficient SGD via gradient quantization and encoding. In Advances in Neural Information Processing Systems, pages 1709 1720, 2017.

Dan Alistarh, Torsten Hoefler, Mikael Johansson, Sarit Khirirat, Nikola Konstantinov, and Cédric Renggli. The convergence of sparsified gradient methods. In Advances in Neural Information Processing Systems, 2018.

Yossi Arjevani, Yair Carmon, John C Duchi, Dylan J Foster, Nathan Srebro, and Blake Woodworth. Lower bounds for non-convex stochastic optimization. ar Xiv preprint ar Xiv:1912.02365, 2019.

Aleksandr Beznosikov, Samuel Horváth, Peter Richtárik, and Mher Safaryan. On biased compression for distributed learning. ar Xiv preprint ar Xiv:2002.12410, 2020.

Chih-Chung Chang and Chih-Jen Lin. LIBSVM: a library for support vector machines. ACM Transactions on Intelligent Systems and Technology (TIST), 2(3):1 27, 2011.

Xin Chen, Niao He, Yifan Hu, and Zikun Ye. Efficient algorithms for minimizing compositions of convex functions and random functions and its applications in network revenue management. ar Xiv preprint ar Xiv:2205.01774, 2022.

Ashok Cutkosky and Harsh Mehta. Momentum improves normalized SGD. In International Conference on Machine Learning, page 2260 2268. PMLR, 2020.

Ashok Cutkosky and Harsh Mehta. High-probability bounds for non-convex stochastic optimization with heavy tails. pages 4883 4895, 2021.

Ashok Cutkosky and Francesco Orabona. Momentum-based variance reduction in non-convex SGD. In Advances in Neural Information Processing Systems, 2019.

Aaron Defazio. Momentum via primal averaging: Theoretical insights and learning rate schedules for non-convex optimization. ar Xiv preprint ar Xiv:2010.00406, 2021.

Thinh Doan, Siva Maguluri, and Justin Romberg. Finite-time analysis of distributed TD(0) with linear function approximation on multi-agent reinforcement learning. In International Conference on Machine Learning, pages 1626 1635. PMLR, 2019.

Ron Dorfman, Shay Vargaftik, Yaniv Ben-Itzhak, and Kfir Y. Levy. Do Co FL: Downlink compression for cross-device federated learning. ar Xiv preprint ar Xiv:2302.00543, 2023.

Ilyas Fatkhullin, Igor Sokolov, Eduard Gorbunov, Zhize Li, and Peter Richtárik. EF21 with bells & whistles: Practical algorithmic extensions of modern error feedback. ar Xiv preprint ar Xiv:2110.03294, 2021.

Juan Gao, Xin-Wei Liu, Yu-Hong Dai, Yakui Huang, and Junhua Gu. Distributed stochastic gradient tracking methods with momentum acceleration for non-convex optimization. Computational Optimization and Applications, 84(2):531 572, 2023.

Euhanna Ghadimi, Hamid Reza Feyzmahdavian, and Mikael Johansson. Global convergence of the heavy-ball method for convex optimization. In 2015 European control conference (ECC), pages 310 315. IEEE, 2015.

Eduard Gorbunov, Dmitry Kovalev, Dmitry Makarenko, and Peter Richtárik. Linearly converging error compensated SGD. In 34th Conference on Neural Information Processing Systems, 2020.

Eduard Gorbunov, Konstantin Burlachenko, Zhize Li, and Peter Richtárik. MARINA: Faster nonconvex distributed learning with compression. In International Conference on Machine Learning, pages 3788 3798. PMLR, 2021.

Kaja Gruntkowska, Alexander Tyurin, and Peter Richtárik. Ef21-p and friends: Improved theoretical communication complexity for distributed optimization with bidirectional compression. ar Xiv preprint ar Xiv:2209.15218, Sep 2022.

Suyog Gupta, Ankur Agrawal, Kailash Gopalakrishnan, and Pritish Narayanan. Deep learning with limited numerical precision. In International Conference on Machine Learning, page 1737 1746, Feb 2015.

Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 770 778, 2016.

Samuel Horváth, Chen-Yu Ho, L udovít Horváth, Atal Narayan Sahu, Marco Canini, and Peter Richtárik. Natural compression for distributed deep learning. ar Xiv preprint ar Xiv:1905.10988, 2019a.

Samuel Horváth, Dmitry Kovalev, Konstantin Mishchenko, Sebastian Stich, and Peter Richtárik. Stochastic distributed learning with gradient quantization and variance reduction. ar Xiv preprint ar Xiv:1904.05115, 2019b.

Xinmeng Huang, Yiming Chen, Wotao Yin, and Kun Yuan. Lower bounds and nearly optimal algorithms in distributed learning with communication compression. In Advances in Neural Information Processing Systems, 2022.

Rustem Islamov, Xun Qian, Slavomír Hanzely, Mher Safaryan, and Peter Richtárik. Distributed newton-type methods with communication compression and bernoulli aggregation. ar Xiv preprint ar Xiv:2206.03588, 2022.

Samy Jelassi and Yuanzhi Li. Towards understanding how momentum improves generalization in deep learning. In International Conference on Machine Learning, pages 9965 10040. PMLR, 2022.

Hao Jin, Yang Peng, Wenhao Yang, Shusen Wang, and Zhihua Zhang. Federated reinforcement learning with environment heterogeneity. In International Conference on Artificial Intelligence and Statistics, volume 151 of Proceedings of Machine Learning Research, pages 18 37. PMLR, 28 30 Mar 2022.

Peter et al Kairouz. Advances and open problems in federated learning. ar Xiv preprint ar Xiv:1912.04977, 2019.

Satyen Kale, Ayush Sekhari, and Karthik Sridharan. Sgd: The role of implicit regularization, batch-size and multiple-epochs. ar Xiv preprint ar Xiv:2107.05074, 2021.

Sai Praneeth Karimireddy, Quentin Rebjock, Sebastian Stich, and Martin Jaggi. Error feedback fixes Sign SGD and other gradient compression schemes. In International Conference on Machine Learning, 2019.

Sai Praneeth Karimireddy, Martin Jaggi, Satyen Kale, Mehryar Mohri, Sashank J. Reddi, Sebastian U. Stich, and Ananda Theertha Suresh. Mime: Mimicking centralized stochastic algorithms in federated learning. ar Xiv preprint ar Xiv:2008.03606, 2021.

Nitish Shirish Keskar, Dheevatsa Mudigere, Jorge Nocedal, Mikhail Smelyanskiy, and Ping Tak Peter Tang. On large-batch training for deep learning: Generalization gap and sharp minima. In International Conference on Learning Representations, 2017.

Sajad Khodadadian, Pranay Sharma, Gauri Joshi, and Siva Theja Maguluri. Federated reinforcement learning: Linear speedup under markovian sampling. In International Conference on Machine Learning, page 10997 11057. PMLR, Jun 2022.

Diederik P. Kingma and Jimmy Ba. Adam: A method for stochastic optimization. In ICLR (Poster), 2015.

Bobby Kleinberg, Yuanzhi Li, and Yang Yuan. An alternative view: When does sgd escape local minima? In International Conference on Machine Learning, page 2698 2707. PMLR, 2018.

Anastasia Koloskova, Tao Lin, S. Stich, and Martin Jaggi. Decentralized deep learning with arbitrary communication compression. In International Conference on Learning Representations, 2020.

Jakub Koneˇcný, H. Brendan Mc Mahan, Felix Yu, Peter Richtárik, Ananda Theertha Suresh, and Dave Bacon. Federated learning: strategies for improving communication efficiency. In NIPS Private Multi-Party Machine Learning Workshop, 2016.

Dmitry Kovalev, Anastasia Koloskova, Martin Jaggi, Peter Richtárik, and Sebastian Stich. A linearly convergent algorithm for decentralized optimization: Sending less bits for free! In The 24th International Conference on Artificial Intelligence and Statistics (AISTATS 2021), 2021.

Alex Krizhevsky, Geoffrey Hinton, et al. Learning multiple layers of features from tiny images. Technical report, University of Toronto, Toronto, 2009.

Yann Le Cun, Corinna Cortes, and CJ Burges. Mnist handwritten digit database. ATT Labs [Online]. Available: http://yann.lecun.com/exdb/mnist, 2, 2010.

Xiaoyu Li and Francesco Orabona. A high probability analysis of adaptive SGD with momentum. ar Xiv preprint ar Xiv:2007.14294, 2020.

Xiaoyu Li, Mingrui Liu, and Francesco Orabona. On the last iterate convergence of momentum methods. In International Conference on Algorithmic Learning Theory, pages 699 717. PMLR, 2022a.

Xiaoyun Li, Belhal Karimi, and Ping Li. On distributed adaptive optimization with gradient compression. In International Conference on Learning Representations, 2022b.

Zhize Li, Dmitry Kovalev, Xun Qian, and Peter Richtárik. Acceleration for compressed gradient descent in distributed and federated optimization. In International Conference on Machine Learning, pages 5895 5904. PMLR, 2020.

Zhize Li, Hongyan Bao, Xiangliang Zhang, and Peter Richtárik. PAGE: A simple and optimal probabilistic gradient estimator for nonconvex optimization. In International Conference on Machine Learning, pages 6286 6295. PMLR, 2021.

Yujun Lin, Song Han, Huizi Mao, Yu Wang, and Bill Dally. Deep gradient compression: Reducing the communication bandwidth for distributed training. In International Conference on Learning Representations, 2018.

Yanli Liu, Yuan Gao, and Wotao Yin. An improved analysis of stochastic gradient descent with momentum. In Advances in Neural Information Processing Systems, 2020.

Maksim Makarenko, Elnur Gasanov, Rustem Islamov, Abdurakhmon Sadiev, and Peter Richtarik. Adaptive compression for communication-efficient distributed training. ar Xiv preprint ar Xiv:2211.00188, Oct 2022.

Konstantin Mishchenko, Eduard Gorbunov, Martin Takáˇc, and Peter Richtárik. Distributed learning with compressed gradient differences. ar Xiv preprint ar Xiv:1901.09269, 2019.

Konstantin Mishchenko, Grigory Malinovsky, Sebastian Stich, and Peter Richtarik. Prox Skip: Yes! Local gradient steps provably lead to communication acceleration! Finally! In International Conference on Machine Learning, pages 15750 15769. PMLR, 2022.

Aritra Mitra, George J. Pappas, and Hamed Hassani. Temporal difference learning with compressed updates: Error-feedback meets reinforcement learning. ar Xiv preprint ar Xiv:2301.00944, 2023.

Constantin Philippenko and Aymeric Dieuleveut. Bidirectional compression in heterogeneous settings for distributed or federated learning with partial participation: tight convergence guarantees. ar Xiv preprint ar Xiv:2006.14591, 2020.

Maximilian Plattner. On SGD with momentum. Master s Thesis, 2022.

Boris T Polyak. Some methods of speeding up the convergence of iteration methods. Ussr computational mathematics and mathematical physics, 4(5):1 17, 1964.

Xun Qian, Peter Richtárik, and Tong Zhang. Error compensated distributed SGD can be accelerated. ar Xiv preprint ar Xiv:2010.00091, 2020.

Peter Richtárik, Igor Sokolov, and Ilyas Fatkhullin. EF21: A new, simpler, theoretically better, and practically faster error feedback. In Advances in Neural Information Processing Systems, 2021.

Peter Richtárik, Igor Sokolov, Ilyas Fatkhullin, Elnur Gasanov, Zhize Li, and Eduard Gorbunov. 3PC: Three point compressors for communication-efficient distributed training and a better theory for lazy aggregation. ar Xiv preprint ar Xiv:2202.00998, 2022.

Nicola Rieke, Jonny Hancox, Wenqi Li, Fausto Milletarì, Holger R. Roth, Shadi Albarqouni, Spyridon Bakas, Mathieu N. Galtier, Bennett A. Landman, Klaus Maier-Hein, Sébastien Ourselin, Micah Sheller, Ronald M. Summers, Andrew Trask, Daguang Xu, Maximilian Baust, and M. Jorge Cardoso. The future of digital health with federated learning. npj Digital Medicine, 3(11):1 7, 2020.

Mher Safaryan, Egor Shulgin, and Peter Richtárik. Uncertainty principle for communication compression in distributed and federated learning and the search for an optimal compressor. Information and Inference: A Journal of the IMA, 2021.

Mher Safaryan, Rustem Islamov, Xun Qian, and Peter Richtárik. Fed NL: Making Newton-type methods applicable to federated learning. In Internatioanl Conference on Machine Learning, 2022.

Atal Sahu, Aritra Dutta, Ahmed M. Abdelmoniem, Trambak Banerjee, Marco Canini, and Panos Kalnis. Rethinking gradient sparsification as total error minimization. In Advances in Neural Information Processing Systems, page 8133 8146, 2021.

Amedeo Sapio, Marco Canini, Chen-Yu Ho, Jacob Nelson, Panos Kalnis, Changhoon Kim, Arvind Krishnamurthy, and Masoud Moshref. Scaling distributed machine learning with in-network aggregation. In 18th USENIX Symposium on Networked Systems Design and Implementation, page 785 808, 2021.

Othmane Sebbouh, Charles Dossal, and Aude Rondepierre. Nesterov s acceleration and Polyak s heavy ball method in continuous time: convergence rate analysis under geometric conditions and perturbations. ar Xiv preprint ar Xiv:1907.02710, 2019.

Othmane Sebbouh, Robert M Gower, and Aaron Defazio. Almost sure convergence rates for stochastic gradient descent and stochastic heavy ball. In Conference on Learning Theory, pages 3935 3971. PMLR, 2021.

Frank Seide, Hao Fu, Jasha Droppo, Gang Li, and Dong Yu. 1-bit stochastic gradient descent and its application to data-parallel distributed training of speech DNNs. In Fifteenth Annual Conference of the International Speech Communication Association, 2014.

Navjot Singh, Deepesh Data, Jemin George, and Suhas Diggavi. Squarm-sgd: Communicationefficient momentum sgd for decentralized optimization. IEEE Journal on Selected Areas in Information Theory, 2(3):954 969, 2021.

Sebastian Stich and Sai Praneeth Karimireddy. The error-feedback framework: Better rates for SGD with delayed gradients and compressed communication. ar Xiv preprint ar Xiv:1909.05350, 2019.

Sebastian U. Stich and Sai Praneeth Karimireddy. The Error-Feedback Framework: Better Rates for SGD with Delayed Gradients and Compressed Communication. ar Xiv preprint ar Xiv:1909.05350, 2021.

Sebastian U. Stich, J.-B. Cordonnier, and Martin Jaggi. Sparsified SGD with memory. In Advances in Neural Information Processing Systems, 2018.

Rafał Szlendak, Alexander Tyurin, and Peter Richtárik. Permutation compressors for provably faster distributed nonconvex optimization. ar Xiv preprint ar Xiv:2110.03300, 2021.

Yuki Takezawa, Han Bao, Kenta Niwa, Ryoma Sato, and Makoto Yamada. Momentum tracking: Momentum acceleration for decentralized deep learning on heterogeneous data. ar Xiv preprint ar Xiv:2209.15505, 2022.

Hanlin Tang, Xiangru Lian, Chen Yu, Tong Zhang, and Ji Liu. Double Squeeze: Parallel stochastic gradient descent with double-pass error-compensated compression. In International Conference on Machine Learning, 2020.

Alexander Tyurin and Peter Richtárik. Dasha: Distributed nonconvex optimization with communication compression, optimal oracle complexity, and no client synchronization. ar Xiv preprint ar Xiv:2202.01268, 2022.

Thijs Vogels, Sai Praneeth Karimireddy, and Martin Jaggi. Powersgd: Practical low-rank gradient compression for distributed optimization. Advances in Neural Information Processing Systems, 2019.

Jun-Kun Wang and Jacob Abernethy. Quickly finding a benign region via heavy ball momentum in non-convex optimization. ar Xiv preprint ar Xiv:2010.01449, 2021.

Yujia Wang, Lu Lin, and Jinghui Chen. Communication-compressed adaptive gradient method for distributed nonconvex optimization. In International Conference on Artificial Intelligence and Statistics, pages 6292 6320. PMLR, 2022.

Tiannan Xiao and Guoguo Yang. A convergence study of sgd-type methods for stochastic optimization. ar Xiv preprint ar Xiv:2211.06197, 2022.

Cong Xie, Shuai Zheng, Oluwasanmi Koyejo, Indranil Gupta, Mu Li, and Haibin Lin. CSER: Communication-efficient SGD with error reset. In Advances in Neural Information Processing Systems, pages 12593 12603, 2020.

An Xu and Heng Huang. Detached error feedback for distributed sgd with random sparsification. In International Conference on Machine Learning, pages 24550 24575. PMLR, 2022.

Junchi Yang, Xiang Li, Ilyas Fatkhullin, and Niao He. Two sides of one coin: the limits of untuned SGD and the power of adaptive methods. In Advances in Neural Information Processing Systems, 2023.

Tianbao Yang, Qihang Lin, and Zhe Li. Unified convergence analysis of stochastic momentum methods for convex and non-convex optimization. ar Xiv preprint ar Xiv:1604.03257, 2016.

Chung-Yiu Yau and Hoi-To Wai. Do Co M-SGT: Doubly compressed momentum-assisted stochastic gradient tracking algorithm for communication efficient decentralized learning. ar Xiv preprint ar Xiv:2202.00255, 2022.

Hao Yu, Rong Jin, and Sen Yang. On the linear speedup analysis of communication efficient momentum sgd for distributed non-convex optimization. In International Conference on Machine Learning, page 7184 7193. PMLR, 2019.

SK Zavriev and FV Kostyuk. Heavy-ball method in nonconvex optimization problems. Computational Mathematics and Modeling, 4(4):336 341, 1993.

Haoyu Zhao, Boyue Li, Zhize Li, Peter Richtárik, and Yuejie Chi. BEER: Fast O(1/T) rate for decentralized nonconvex optimization with communication compression. In Advances in Neural Information Processing Systems, 2022.

Shuai Zheng, Ziyue Huang, and James Kwok. Communication-efficient distributed blockwise momentum SGD with error-feedback. In Advances in Neural Information Processing Systems, 2019.

1 Introduction 1

2 Communication Compression, Error Feedback, and Sample Complexity 2

2.1 Brief history of error-feedback . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2

2.2 Key issue: error feedback has an unhealthy appetite for samples! . . . . . . . . . . 3

2.3 Mysterious effectiveness of momentum in nonconvex optimization . . . . . . . . . 4

3 Main Results 5

3.1 A deeper dive into the issues EF21 has with stochastic gradients . . . . . . . . . . 6

3.2 Momentum for avoiding mega-batches . . . . . . . . . . . . . . . . . . . . . . . . 6

3.3 Distributed stochastic error feedback with momentum . . . . . . . . . . . . . . . . 7

3.4 Further improvement using double momentum! . . . . . . . . . . . . . . . . . . . 9

4 Experiments 9

4.1 Experiment 1: increasing batch-size . . . . . . . . . . . . . . . . . . . . . . . . . 10

4.2 Experiment 2: improving convergence with 𝑛. . . . . . . . . . . . . . . . . . . . 10

A More on Contractive Compressors, Error Feedback and Momentum 18

B Variance Reduction Effect of SGDM and Comparison to STORM 21

C Additional Experiments and Details of Experimental Setup 22

C.1 Extra plots for experiments 1 and 2 . . . . . . . . . . . . . . . . . . . . . . . . . . 22

C.2 Experiment 3: stochastic quadratic optimization . . . . . . . . . . . . . . . . . . . 22

C.3 Experiment 4: training neural network . . . . . . . . . . . . . . . . . . . . . . . . 24

D Descent Lemma 25

E EF21-SGDM-ideal (Proof of Theorem 1 and Proposition 1) 25

F EF21-SGDM (Proof of Theorems 2 and 3) 28

F.1 Controlling the error of momentum estimator . . . . . . . . . . . . . . . . . . . . 30

F.2 Controlling the error of contractive compression and momentum estimator . . . . . 31

G Further Improvement Using Double Momentum (Proof of Corollary 3) 33

G.1 Controlling the error of second momentum estimator . . . . . . . . . . . . . . . . 36

G.2 Controlling the error of contractive compression and double momentum estimator . 38

H EF21-SGDM with Absolute Compressor 40

H.1 Controlling the error of absolute compression . . . . . . . . . . . . . . . . . . . . 42

I EF21-STORM/MVR 43

I.1 Controlling the variance of STORM/MVR estimator . . . . . . . . . . . . . . . . 46

I.2 Controlling the variance of contractive compression and STORM/MVR estimator . 47

J Simplified Proof of SGDM: Time Varying Parameters and No Tuning for Momentum Sequence 49

K Revisiting EF14-SGD Analysis under BG and BGS Assumptions 51

Method Comm. compl.

Batch-size for comm. compl.

Asymp. sample compl.

Batch free?

No extra assump.?

EF14-SGD [Koloskova et al., 2020]

NEOLITHIC [Huang et al., 2022]

𝛼𝜀2 log ( 𝐺

𝜀 ) (*) 𝐿max𝜎2

EF21-SGD [Fatkhullin et al., 2021]

BEER [Zhao et al., 2022]

EF21-SGDM Corollary 2

Corollary 3

(a) Analysis requires a bound of the second moment of the stochastic gradients, i.e., E [ 𝑓𝑖(𝑥, 𝜉𝑖) 2] 𝐺2 for all 𝑥 R𝑑. (b) This complexity is achieved by using a large mini-batch and communicating 𝐾/𝛼 𝑑coordinates per iteration. (c) Analysis requires a bounded gradient disimilarity assumption, i.e., 1

𝑛 𝑛 𝑖=1 𝑓𝑖(𝑥) 𝑓(𝑥) 2 𝐺2 for all 𝑥 R𝑑.

(d) Analysis requires a batch-size at least 𝐵 𝜎2

𝛼2𝜀2 for EF21-SGD and 𝐵 𝜎2

𝛼𝜀2 for BEER.

(*) For a fair comparison, we take the (minimal) batch-size for these methods which guarantees the reported communication complexity. Table 2: Extended summary of related works on distributed error compensated SGD methods using a Top𝐾 compressor under Assumptions 1 and 2. The goal is to find an 𝜀-stationary point of a smooth nonconvex function of the form (1), i.e., a point 𝑥such that E [ 𝑓(𝑥) ] 𝜀. "Comm. compl." reports the total number of communicated bits if the method is applied with batch-size equal to "Batch-size" at each node. When Top𝐾 compressor is applied, then 𝛼 𝐾/𝑑, and the comm. compl. of error compensated methods can be reduced by a factor of 𝛼𝑑/𝐾. "Batch-size for comm. compl." means the batch-size for achieving the reported "Comm. compl.". "Asymp. sample compl." reports the asymptotic sample complexity of the algorithm with batch-size 𝐵= 1 in the regime 𝜀 0, i.e., the total number of samples required at each node to achieve 𝜀-stationary point. "Batch free" marks with if the analysis ensures convergence with batch-size equal to 1. "No extra assump." marks with if no additional assumption beyond Assumptions 1 and 2 is required for analysis. We denote 𝐿max := max𝑖 [𝑛] 𝐿𝑖. Notice that it always holds 𝐿 𝐿 𝐿max and these inequalities only become equalities in the homogeneous case. It could be that 𝛼𝐿/ 𝐿 1 making the batch-size of EF21-SGDM and EF21-SGD2M much smaller than those of EF21-SGD and BEER. Symbol denotes the maximum of two scalars.

A More on Contractive Compressors, Error Feedback and Momentum

Greedy vs uniform. In our work, we specifically focus on the class of contractive compressors satisfying Definition 1, which contains a greedy Top𝐾compressor as a special case. Note that Top𝐾is greedy in that it minimizes the error Top𝐾(𝑥) 𝑥 2 subject to the sparsity constraint 𝒞(𝑥) 0 𝐾, where 𝑢 0 counts the number of nonzero entries in 𝑢. In practice, greediness is almost always17 very useful, translating into excellent empirical performance, especially when compared to the performance of the Rand𝐾sparsifier. On the other hand, it appears to be very hard to formalize these practical gains theoretically18. In fact, while greedy compressors such as Top𝐾 outperform their randomized cousins such as Rand𝐾in practice, and often by huge margins [Lin et al., 2018], the theoretical picture is exactly reversed, and the theoretical communication complexity of gradient-type methods based on randomized compressors [Alistarh et al., 2017, Mishchenko et al., 2019, Horváth et al., 2019b, Li et al., 2020, Gorbunov et al., 2021] is much better than of those based on greedy compressors [Koloskova et al., 2020, Richtárik et al., 2021, Fatkhullin et al., 2021, Richtárik et al., 2022]. The key reason behind this is the fact that popular randomized compressors such as Rand𝐾become unbiased mappings after appropriate scaling (e.g., E[ 𝑑

𝐾Rand𝐾(𝑥)] 𝑥), and that the inherent randomness is typically drawn independently by all clients. This leads to several key simplifications in the analysis, and consequently, to theoretical gains over methods that do not compress, and over methods that compress greedily. Further improvements are possible when the randomness is correlated in an appropriate way [Szlendak et al., 2021].

Due to the superior empirical properties of greedy contractive compressors, and our desire to push this very potent line of work further, in this paper we work with the general class of compressors

17Greediness is not useful, for example, when 𝒟𝑖= 𝒟𝑗for all 𝑖,𝑗and when Top𝐾is applied to the full-batch gradient 𝑓𝑖(𝑥) by each client. However, situations of this type arise rarely in practice. 18No theoretical results of this type exist for 𝑛> 1 .

satisfying Definition 1, and do not invoke any additional restrictive assumptions. For example, we do not assume 𝒞can be made unbiased after scaling.

Error Feedback. The first theoretical analysis of EF14 was presented in the works of Stich et al. [2018], Alistarh et al. [2018] and further revisited in convex case in [Karimireddy et al., 2019, Beznosikov et al., 2020, Gorbunov et al., 2020, Qian et al., 2020] and analysis was extended to nonconvex setting in [Stich and Karimireddy, 2019]. Later, in nonconvex case, various extensions and combinations of EF14 with other optimization techniques were considered and analyzed, which include bidirectional compression [Tang et al., 2020], decentralized training [Koloskova et al., 2020, Singh et al., 2021], server level momentum [Xie et al., 2020], client level momentum [Zheng et al., 2019], combination with adaptive methods [Li et al., 2022b]. To our knowledge, the best sample complexity for finding a stationary point for this method (including its momentum and adaptive variants) in the distributed nonconvex setting is given by Koloskova et al. [2020], which is 𝒪( 𝐺

𝑛𝜀4 ). More recently, Huang et al. [2022] propose a modification of EF14 method achieving

𝒪 ( 1 𝛼𝜀2 log( 𝐺

𝑛𝜀4 ) sample complexity by using the BGS assumption. When applied with Top𝐾

compressor, this method requires to communicate 𝒪(𝐾/𝛼) coordinates at every iteration. This makes it impractical since when the effective 𝛼is unknown and is set to 𝛼= 𝐾/𝑑, it means that their method communicates all 𝑑coordinates at every iteration, mimicking vanilla (S)GD method. Moreover, their algorithm uses an additional subroutine and applies updates with a large batch-size of samples of order 𝒪( 1

𝜀 ) ), making the algorithm less practical and difficult to implement. It is worth to mention, that error feedback was also analyzed for other classes of compressors such as absolute (see Definition 2) or additive compressors (i.e., 𝒞(𝑥+ 𝑦) = 𝒞(𝑥) + 𝒞(𝑦) for all 𝑥, 𝑦 R𝑑) [Tang et al., 2020, Xu and Huang, 2022], which do not include Top𝐾sparsifier.

Momentum. The first convergence analysis of gradient descent with momentum was proposed by B.T. Polyak in his seminal work [Polyak, 1964] studying the benefit of multi-step methods. The proof technique proposed in this work is based on the analysis of the spectral norm of a certain matrix arising from the dynamics of a multi-step process on a quadratic function. Unfortunately, such technique is restricted to the case of strongly convex quadratic objective and the setting of full gradients. Later Zavriev and Kostyuk [1993] prove an asymptotic convergence of this method in nonconvex deterministic case without specifying the rate of convergence.

To our knowledge, the first non-asymptotic analysis of SGDM in the smooth nonconvex setting is due to Yu et al. [2019]. Their analysis, however, heavily relies on BG assumption. Later, Liu et al. [2020] provide a refined analysis of SGDM, removing the BG assumption and improving the dependence on the momentum parameter 𝜂. Notice that the analysis of Liu et al. [2020] and the majority of other works relies on some variant of the following Lyapunov function:

Λ𝑡:= 𝑓(𝑧𝑡) 𝑓* +

𝜏=0 𝑐𝜏 𝑥𝑡 𝜏 𝑥𝑡 1 𝜏 2 , (11)

where {𝑧𝑡}𝑡 0 is some auxiliary sequence (often) different from the iterates {𝑥𝑡}𝑡 0, and {𝑐𝜏}𝜏 0 is a diminishing non-negative sequence. This approach is motivated by the dynamical system point of view at Polyak s heavy ball momentum, where the two terms in (11) are interpreted as the potential and kinetic energy of the system [Sebbouh et al., 2019]. In contrast, the Lyapunov function used in this work is conceptually different even in the single node (𝑛= 1), uncompressed (𝛼= 1) setting. Later, Defazio [2021] revisit the analysis in [Liu et al., 2020] through the lens of primal averaging and provide insights on why momentum helps in practice. The momentum is also used for stabilizing adaptive algorithms such as normalized SGD [Cutkosky and Mehta, 2020]. In particular, it was shown that by using momentum, one can ensure convergence without large batches for normalized SGD (while keeping the same sample complexity as a large batch normalized SGD). However, their analysis is specific to the normalized method, which allows using the function value as a Lyapunov function. High probability analysis of momentum methods was investigated in [Cutkosky and Mehta, 2021, Li and Orabona, 2020]. In the distributed setting, [Yu et al., 2019, Karimireddy et al., 2021] extend the analysis of SGDM under BGS assumption. Later [Takezawa et al., 2022, Gao et al., 2023] remove this assumption providing a refined analysis based on the techniques developed in [Liu et al., 2020]. However, the algorithms in these works do not apply any bandwidth reduction technique such as communication compression.

We would like to mention that understanding the behavior of SGDM in convex case also remains an active area of research [Ghadimi et al., 2015, Yang et al., 2016, Sebbouh et al., 2021, Li et al., 2022a, Xiao and Yang, 2022] .

B Variance Reduction Effect of SGDM and Comparison to STORM

Notice that the choice of our Lyapunov function Λ𝑡(8), which is used in the analysis of EF21-SGDM implies that the gradient estimators 𝑔𝑡 𝑖and 𝑣𝑡 𝑖improve over the iterations, i.e.,

𝑔𝑡 𝑖 𝑓𝑖(𝑥𝑡), 𝑣𝑡 𝑖 𝑓𝑖(𝑥𝑡) for 𝑡 .

This comes in contrast with the behavior of SGD, for which the gradient estimator 𝑣𝑡 𝑖= 𝑓𝑖(𝑥𝑡, 𝜉𝑡 𝑖) does not necessarily tend to zero over iterations. Such effect of asymptotic improvement of the estimation error of the gradient estimator is reminiscent to the analogous effect known in the literature on variance reduction (VR) methods. In particular, the classical momentum step 6 of Algorithm 1 may be contrasted with a STORM variance reduced estimator proposed by Cutkosky and Orabona [2019], which updates the gradient estimator via

𝑤𝑡+1 𝑖 = 𝑓𝑖(𝑥𝑡+1, 𝜉𝑡+1 𝑖 ) + (1 𝜂)(𝑤𝑡 𝑖 𝑓𝑖(𝑥𝑡, 𝜉𝑡+1 𝑖 )), 𝑤0 𝑖= 𝑓𝑖(𝑥0, 𝜉0 𝑖) (12)

It is known that the class of VR methods (and STORM, in particular) can show faster asymptotic convergence in terms of 𝑇(or 𝜀) compared to SGD and SGDM under some additional assumptions. However, we would like to point out the important differences (and limitations) of (12) compared to the classical Polyak s momentum used on line 6 of Algorithm 1. First, the estimator 𝑤𝑡+1 𝑖 is different from the momentum update rule 𝑣𝑡+1 𝑖 in that it is unbiased for any 𝑡 0, i.e., E [ 𝑤𝑡+1 𝑖 𝑓𝑖(𝑥𝑡+1) ] = 0,19 which greatly facilitates the analysis of this method. Notice that, in particular, in the deterministic case (𝜎= 0, 𝛼= 1), the method with estimator (12) reduces to vanilla gradient descent with 𝑤𝑡+1 𝑖 = 𝑓𝑖(𝑥𝑡+1). Second, the computation of 𝑤𝑡+1 𝑖 requires access to two stochastic gradients 𝑓𝑖(𝑥𝑡+1, 𝜉𝑡+1 𝑖 ) and 𝑓𝑖(𝑥𝑡, 𝜉𝑡+1 𝑖 ) under the same realization of noise 𝜉𝑡+1 𝑖 at each iteration, and requires the additional storage of control variate 𝑥𝑡. This is a serious limitation, which can make the method impractical or even not implementable for certain applications such as federated RL [Mitra et al., 2023], multi-agent RL [Doan et al., 2019] or operations research problems [Chen et al., 2022]. Third, the analysis of variance reduced methods such as STORM requires an additional assumptions such as individual smoothness of stochastic functions (or its averaged variants) (Assumption 3), i.e., 𝑓𝑖(𝑥,𝜉𝑖) 𝑓𝑖(𝑦, 𝜉𝑖) ℓ𝑖 𝑥 𝑦 for all 𝑥,𝑦 R𝑑, 𝜉𝑖 𝒟𝑖, 𝑖 [𝑛], while our EF21-SGDM only needs smoothness of (deterministic) local functions 𝑓𝑖(𝑥). While this assumption is satisfied for some loss functions in supervised learning, it can also be very limiting. Even if Assumption 3 is satisfied, the constant ℓ(which always satisfies ℓ 𝐿) can be much larger than 𝐿canceling the speed-up in terms of 𝑇(or 𝜀). For completeness, we provide the sample complexity analysis of our error compensated method combined with estimator (12), which is deferred to Appendix I.

19Notice that E [ 𝑤0 𝑖 𝑓𝑖(𝑥0) ] = 0. Let E [ 𝑤𝑡 𝑖 𝑓𝑖(𝑥𝑡) ] = 0 hold, then

E [ 𝑤𝑡+1 𝑖 𝑓𝑖(𝑥𝑡+1) ] = (1 𝜂)E [ 𝑤𝑡 𝑖 𝑓𝑖(𝑥𝑡, 𝜉𝑡+1 𝑖 ) ] = 0.

0 2000 4000 6000 8000 10000 Iteration, t

Objective value

EF21-SGD-ideal EF21-SGD EF21-SGDM

(a) Divergence in single node setting, 𝑛= 1.

0 2000 4000 6000 8000 10000 Iteration, t

Objective value

EF21-SGD (n = 50)

EF21-SGD (n = 500)

EF21-SGD (n = 5000)

EF21-SGD (n = 10000)

(b) No improvement with 𝑛. Figure 4: Divergence of EF21-SGD on a quadratic function 1

2 𝑥 2 with Top1 compressor. See the proof of Therem 1 for details on the construction of noise 𝜉, we use 𝜎= 1, 𝐵= 1. The starting point for all methods is 𝑥0 = (0, 0.01) . Unlike Figure 1, these experiments use time varying step-sizes and momentum parameters 𝛾𝑡= 𝜂𝑡= 0.1 𝑡+1. Each method is run 10 times and the plot shows the median performance alongside the 25% and 75% quantiles.

C Additional Experiments and Details of Experimental Setup

Divergence of EF21-SGD with time-varying step-sizes. We complement our Figure 1 in the main part of the paper, which shows divergence of EF21-SGD [Fatkhullin et al., 2021] with small (constant) step-size. Here, in Figure 4, we see that the similar divergence is observed when using time varying step-sizes 𝛾𝑡= 0.1 𝑡+1. Also, EF21-SGD with time-varying step-size does not improve convergence when 𝑛is increased.

Implementation Details. The experiments were implemented in Python 3.7.9. The distributed environment was emulated on machines with Intel(R) Xeon(R) Gold 6248 CPU @ 2.50GHz. In all experiments with MNIST, we split the dataset across nodes by labels to simulate the heterogeneous setting.

C.1 Extra plots for experiments 1 and 2

In Figures 5 and 6, we provide extra experiments for the setup from Section 4.

0.0 0.5 1.0 1.5 2.0 2.5 3.0 #bits / n 1e6

|| f(xt)||2

EF14-SGD: Step size: 0.5 EF21-SGD: Step size: 0.03125 EF21-SGDM: Step size: 0.03125

0.0 0.5 1.0 1.5 2.0 2.5 3.0 #bits / n 1e6

|| f(xt)||2

EF14-SGD: Step size: 1.0 EF21-SGD: Step size: 0.015625 EF21-SGDM: Step size: 0.125

0.0 0.5 1.0 1.5 2.0 2.5 3.0 #bits / n 1e6

|| f(xt)||2

EF14-SGD: Step size: 1.0 EF21-SGD: Step size: 0.015625 EF21-SGDM: Step size: 0.25

(c) 𝐵= 128 Figure 5: Performance of algorithms on MNIST dataset with 𝑛= 100, and Top10 compressor.

C.2 Experiment 3: stochastic quadratic optimization

We now consider a synthetic 𝜆 strongly convex quadratic function problem 𝑓(𝑥) = 1

𝑛 𝑛 𝑖=1 𝑓𝑖(𝑥), where the functions 𝑓𝑖(𝑥) = 1 2𝑥 Q𝑖𝑥 𝑥 𝑏𝑖are (not necessarily convex) quadratic functions for all 𝑖 [𝑛] and 𝑥 R𝑑. The matrices Q1, , Q𝑛, vectors 𝑏1, , 𝑏𝑛, and a starting point 𝑥0 are generated by Algorithm 2 with the number of nodes 𝑛= 100, dimension 𝑑= 1000, regularizer 𝜆= 0.01, and scale 𝑠= 1. For all 𝑖 [𝑛] and 𝑥 R𝑑, we consider stochastic gradients 𝑓𝑖(𝑥, 𝜉) = 𝑓𝑖(𝑥) + 𝜉𝑖, where 𝜉𝑖are i.i.d. samples from 𝒩(0, 𝜎) with 𝜎 {0.001, 0.01}. In

0.0 0.5 1.0 1.5 2.0 2.5 3.0 #bits / n 1e7

EF14-SGD: Step size: 1.0 EF21-SGD: Step size: 2.0 EF21-SGDM: Step size: 1.0

0.0 0.5 1.0 1.5 2.0 2.5 3.0 #bits / n 1e7

EF14-SGD: Step size: 8.0 EF21-SGD: Step size: 8.0 EF21-SGDM: Step size: 8.0

0.0 0.5 1.0 1.5 2.0 2.5 3.0 #bits / n 1e7

EF14-SGD: Step size: 64.0 EF21-SGD: Step size: 64.0 EF21-SGDM: Step size: 128.0

(c) 𝑛= 100 Figure 6: Performance of algorithms on real-sim dataset with batch-size 𝐵= 1, and Top100 compressor.

Figure 7, we present the comparison of EF21-SGDM and EF14-SGD with three different step sizes. The behavior of methods for other step sizes from the set {2𝑘| 𝑘 [ 20, 20]} follows a similar trend. For every step size, we observe that at the beginning, the methods have almost the same linear rates, but then EF14-SGD gets stuck at high accuracies, while EF21-SGDM continues converging to the lower accuracies.

0 100000 200000 300000 400000 500000 #bits / n

EF14-SGD: Step size: 0.0625 EF14-SGD: Step size: 0.125 EF14-SGD: Step size: 0.25 EF21-SGDM: Step size: 0.0625 EF21-SGDM: Step size: 0.125 EF21-SGDM: Step size: 0.25

0 100000 200000 300000 400000 500000 #bits / n

EF14-SGD: Step size: 0.0625 EF14-SGD: Step size: 0.125 EF14-SGD: Step size: 0.25 EF21-SGDM: Step size: 0.0625 EF21-SGDM: Step size: 0.125 EF21-SGDM: Step size: 0.25

Figure 7: Stochastic Quadratic Optimization Problem with 𝜎= 0.001 (left figure) and 𝜎= 0.01 (right figure)

Algorithm 2 Quadratic Optimization Task Generation Procedure

1: Parameters: number nodes 𝑛, dimension 𝑑, regularizer 𝜆, and scale 𝑠. 2: for 𝑖= 1, . . . , 𝑛do 3: Calculate Guassian noises 𝜇𝑠 𝑖= 1 + 𝑠𝜉𝑠 𝑖and 𝜇𝑏 𝑖= 𝑠𝜉𝑏 𝑖, i.i.d. 𝜉𝑠 𝑖, 𝜉𝑏 𝑖 𝒩(0, 1)

4: 𝑏𝑖= 𝜇𝑠 𝑖 4 ( 1 + 𝜇𝑏 𝑖, 0, , 0) R𝑑

5: Scale the predefined tridiagonal matrix

1 ... ... ... ... 1 0 1 2

6: end for 7: Find the mean of matrices Q = 1

𝑛 𝑛 𝑖=1 Q𝑖 8: Find the minimum eigenvalue 𝜆min(Q) 9: for 𝑖= 1, . . . , 𝑛do 10: Normalize matrix Q𝑖= Q𝑖+ (𝜆 𝜆min(Q))I 11: end for 12: Find a starting point 𝑥0 = (

𝑑, 0, , 0) 13: Output a new problem: matrices Q1, , Q𝑛, vectors 𝑏1, , 𝑏𝑛, starting point 𝑥0

A procedure to generate stochastic quadratic optimization problems. In this section, we present an algorithm that generates quadratic optimization tasks. The formal description is provided in Algorithm 2. The idea is to take a predefined tridiagonal matrix and add noises to simulate the heterogeneous setting. Algorithm 2 returns matrices Q1, , Q𝑛, vectors 𝑏1, , 𝑏𝑛, and a starting

0.0 0.5 1.0 1.5 2.0 #bits / n 1e12

f(xk) f(x * )

Number of nodes: 5

Vanilla SGD (no compression): Step size: 0.05 EF21-SGDM: Step size: 0.05 EF21-SGD: Step size: 0.05 EF14-SGD: Step size: 0.05

0.0 0.5 1.0 1.5 2.0 2.5 #bits / n 1e12

f(xk) f(x * )

Number of nodes: 5

Vanilla SGD (no compression): Step size: 0.05 EF21-SGDM: Step size: 0.05 EF21-SGD: Step size: 0.05 EF14-SGD: Step size: 0.05

(b) 𝐵= 25 Figure 8: Res Net-18 on CIFAR10 dataset with 𝑛= 5.

Algorithm Test Accuracy

SGD 81.5 % EF21-SGD 82.5 % EF14-SGD 83.1 % EF21-SGDM 83.3 % Figure 9: Accuracy on the CIFAR10 test split.

point 𝑥0 such that the matrix Q = 1

𝑛 𝑛 𝑖=1 Q𝑖has the minimum eigenvalue 𝜆min(Q) = 𝜆, where 𝜆 0 is a parameter. Next, we define the functions 𝑓𝑖and stochastic gradients in the following way:

2𝑥 Q𝑖𝑥 𝑥 𝑏𝑖

𝑓𝑖(𝑥, 𝜉) := 𝑓𝑖(𝑥) + 𝜉𝑖,

for all 𝑥 R𝑑and 𝑖 [𝑛]. The noises 𝜉𝑖are i.i.d. samples from 𝒩(0, 𝜎), where 𝜎is a parameter.

C.3 Experiment 4: training neural network

We test algorithms on an image recognition task, CIFAR10 [Krizhevsky et al., 2009], with the Res Net18 [He et al., 2016] deep neural network (the number of parameters 𝑑 107). We split CIFAR10 among 5 nodes, and take 𝐾= 2 106 in Top𝐾. In all methods we finetune the step sizes. One can see that our findings in the low-scale experiments translate into large-scale experiments in Figure 8. With different batch sizes, EF21-SGD converges slower than EF21-SGDM and EF14-SGD, and our new method EF21-SGDM improves over EF14-SGD in Figure 8b. We checked the accuracies on the test dataset (see Table 9) and observed the same relations between algorithms (note that accuracies are far from the real SOTA because we turned off all augmentations and regularizations in training).

D Descent Lemma

Let us state the following lemma that is used in the analysis of nonconvex optimization methods. Lemma 1 ([Li et al., 2021]). Let the function 𝑓( ) be 𝐿-smooth and let 𝑥𝑡+1 = 𝑥𝑡 𝛾𝑔𝑡for some vector 𝑔𝑡 R𝑑and a step-size 𝛾> 0. Then we have

𝑓(𝑥𝑡+1) 𝑓(𝑥𝑡) 𝛾

𝑓(𝑥𝑡) 2 ( 1

) 𝑥𝑡+1 𝑥𝑡 2 + 𝛾

𝑔𝑡 𝑓(𝑥𝑡) 2 . (13)

E EF21-SGDM-ideal (Proof of Theorem 1 and Proposition 1)

We now state a slighly more general result than Theorem 1, which holds for EF21-SGDM-ideal method with any 𝜂 (0, 1]. The statement of Theorem 1 follows by setting 𝜂= 1, since in that case EF21-SGDM-ideal coincides with EF21-SGD-ideal (5a), (5aa). Recall that EF21-SGDM-ideal (distributed variant) has the following update rule:

𝑥𝑡+1 = 𝑥𝑡 𝛾𝑔𝑡, 𝑔𝑡= 1

𝑖=1 𝑔𝑡 𝑖, (14)

EF21-SGDM-ideal: 𝑣𝑡+1 𝑖 = 𝑓𝑖(𝑥𝑡+1) + 𝜂( 𝑓𝑖(𝑥𝑡+1, 𝜉𝑡+1 𝑖 ) 𝑓𝑖(𝑥𝑡+1)),

𝑔𝑡+1 𝑖 = 𝑓𝑖(𝑥𝑡+1) + 𝒞 ( 𝑣𝑡+1 𝑖 𝑓𝑖(𝑥𝑡+1)

Theorem 4. Let 𝐿, 𝜎> 0, 0 < 𝛾 1/𝐿, 0 < 𝜂 1 and 𝑛= 1. There exists a convex, 𝐿-smooth function 𝑓( ), a contractive compressor 𝒞( ) satisfying Definition 1, and an unbiased stochastic gradient with bounded variance 𝜎2 such that if the method (14), (15) is run with a step-size 𝛾, then for all 𝑇 0 and for all 𝑥0 {(0, 𝑥0 (2)) R2 | 𝑥0 (2) < 0}, we have

E [ 𝑓(𝑥𝑇) 2] 1

60 min { 𝜂2𝜎2, 𝑓(𝑥0) 2} .

Fix 0 < 𝜀 𝐿/

60 and 𝑥0 = (0, 1) . Additionally assume that 𝑛 1 and the variance of unbiased stochastic gradient is controlled by 𝜎2/𝐵for some 𝐵 1. If 𝐵< 𝜂2𝜎2

60𝜀2 , then we have E [ 𝑓(𝑥𝑇) ] > 𝜀for all 𝑇 0.

Proof of Theorem 1. Part I. Consider 𝑓(𝑥) = 𝐿

2 𝑥 2, 𝑥 R2. For each iteration 𝑡 0, let the random vector 𝜉𝑡+1 be sampled uniformly at random from the set of vectors:

10 , 𝑧2 = ( 0 1

10 , 𝑧3 = ( 2 1

Define the stochastic gradient as 𝑓(𝑥𝑡, 𝜉𝑡) := 𝑓(𝑥𝑡) + 𝜉𝑡 = 𝐿𝑥𝑡+ 𝜉𝑡. Notice that

E [ 𝑓(𝑥𝑡, 𝜉𝑡)] = 𝑓(𝑥𝑡), and E [ 𝑓(𝑥𝑡, 𝜉𝑡) 𝑓(𝑥𝑡) 2] = 𝜎2. The update rule of method (14), (15) with such estimator is

𝑥𝑡+1 = 𝑥𝑡 𝛾𝑔𝑡= 𝑥𝑡 𝐿𝛾𝑥𝑡 𝛾𝒞 ( 𝜂𝜉𝑡) ,

where we choose 𝒞( ) as a Top1 compressor. Notice that E [𝜉𝑡] = (0, 0) , but

E [ 𝒞(𝜉𝑡) ] = 𝜂

10 (0, 1/3) = (0, 0) .

By setting the initial iterate to 𝑥0 = (0, 𝑥0 (2)) for any 𝑥0 (2) < 0, we can derive

E [ 𝑥𝑇] = (1 𝐿𝛾)𝑇𝑥0 𝜂

𝑡=0 (1 𝐿𝛾)𝑡

= (1 𝐿𝛾)𝑇 ( 0 𝑥0 (2)

) (1 (1 𝐿𝛾)𝑇) = ( 0 0

for any 0 𝛾 1/𝐿and any 𝑥0 (2) < 0. The inequality in (16) is because the first vector has strictly negative component 𝑥0 (2), and the second vector has non-positive second component when 𝛾> 0 and

𝜎2 > 0. Therefore, since 𝑓(𝑥) 2 = 𝐿𝑥 2, we have

E [ 𝑓(𝑥𝑇) 2] = E [ 𝐿𝑥𝑇 2]

= E [ 𝐿𝑥𝑇] 2 + E [ 𝐿𝑥𝑇 E [ 𝐿𝑥𝑇] 2]

(1 𝐿𝛾)𝑇 𝐿𝑥0 + 𝜂

30(1 (1 𝐿𝛾)𝑇)

(𝑖𝑖) (1 𝐿𝛾)2𝑇 𝑓(𝑥0) 2 + 𝜂2𝜎2

30 (1 (1 𝐿𝛾)𝑇)2

𝑓(𝑥0) 2 𝜂2𝜎2

30 𝑓(𝑥0) 2 + 𝜂2𝜎2

for all 𝑇 1, where in (𝑖) we used the form of vector E [ 𝑥𝑇] in (16), in (𝑖𝑖) we drop a nonnegative cross term, and use 𝑓(𝑥0) = 𝐿𝑥0. The last inequality follows by lower bounding a univariate quadratic function with respect to 𝑧:= (1 𝐿𝛾)𝑇for 0 𝑧 1, where optimal choice is 𝑧= 𝜂2𝜎2/(30 𝑓(𝑥0) 2 + 𝜂2𝜎2). It is left to note that 𝑥𝑦 𝑥+𝑦 1

2 min{𝑥, 𝑦} for all 𝑥, 𝑦> 0.

Part II. Fix 𝑛 1 and 𝐵 1. Let at each node 𝑖= 1, . . . , 𝑛, the random vectors 𝜉𝑡 𝑖be sampled independently and uniformly form the set of vectors:

10𝐵, 𝑧2 = ( 0 1

10𝐵, 𝑧3 = ( 2 1

Define a random matrix 𝜉𝑡:= (𝜉𝑡 1, . . . , 𝜉𝑡 𝑛) . Then E [ 𝑓(𝑥𝑡, 𝜉𝑡) 𝑓(𝑥𝑡) 2] = 𝜎2

𝐵. The update of the method (14), (15) on the same function instance will take the form

𝑥𝑡+1 = 𝑥𝑡 𝛾1

𝑖=1 𝑔𝑡 𝑖= 𝑥𝑡 𝐿𝛾𝑥𝑡 𝛾1

𝑖=1 𝒞 ( 𝜂𝜉𝑡 𝑖 ) .

Notice that in this case, we still have

𝑖=1 𝒞(𝜂𝜉𝑡 𝑖)

𝑖=1 E [ 𝒞(𝜂𝜉𝑡 𝑖) ] = 𝜂

10 (0, 1/3) = (0, 0) ,

which is independent (!) of 𝑛. Therefore, by similar derivations, we can conclude that

E [ 𝑓(𝑥𝑇) 2] 1 60 min { 𝜂2𝜎2

𝐵, 𝑓(𝑥0) 2} > 𝜀2,

where we use that 𝐵< 𝜂2𝜎2

60𝜀2 , 𝜀 𝐿/

60, and 𝑥0 = (0, 1) .

Proof of Proposition 1. By smoothness (Assumption 1) of 𝑓( ) it follows from Lemma 1 that for 𝛾 1/𝐿we have

𝑓(𝑥𝑡+1) 𝑓(𝑥𝑡) 𝛾

𝑓(𝑥𝑡) 2 + 𝛾

𝑔𝑡 𝑓(𝑥𝑡) 2 . (17)

Now it remains to control the last term, which is due to the error introduced by a contractive compressor and stochastic gradients. We have

E [ 𝑔𝑡 𝑓(𝑥𝑡) 2] (𝑖) = E [ 𝒞 ( 𝑣𝑡 𝑓(𝑥𝑡) ) 2] (𝑖𝑖) = E [ 𝒞 ( 𝜂 ( 𝑓(𝑥𝑡, 𝜉𝑡) 𝑓(𝑥𝑡) ) ) 2]

(𝑖𝑖𝑖) 2E [ 𝒞 ( 𝜂 ( 𝑓(𝑥𝑡, 𝜉𝑡) 𝑓(𝑥𝑡) ) ) 𝜂( 𝑓(𝑥𝑡, 𝜉𝑡) 𝑓(𝑥𝑡)) 2]

+2𝜂2E [ 𝑓(𝑥𝑡, 𝜉𝑡) 𝑓(𝑥𝑡) 2]

2(2 𝛼)𝜂2E [ 𝑓(𝑥𝑡, 𝜉𝑡) 𝑓(𝑥𝑡) 2]

where (𝑖) and (𝑖𝑖) use the update rule (6), (𝑖𝑖𝑖) holds by Young s inequality, and the last two steps hold by Definition 1 and Assumption 2.

Subtracting 𝑓* from both sides of (17), taking expectation and defining 𝛿𝑡:= E [𝑓(𝑥𝑡) 𝑓*], we derive

E [ 𝑓(ˆ𝑥𝑇) 2] = 1

𝑡=0 E [ 𝑓(𝑥𝑡) 2] 2𝛿0

F EF21-SGDM (Proof of Theorems 2 and 3)

The statement of Theorem 2 follows directly from Theorem 3 and Remark 2. Let us prove Theorem 3.

Proof of Theorem 3. In order to control the error between 𝑔𝑡and 𝑓(𝑥𝑡), we decompose it into two terms 𝑔𝑡 𝑓(𝑥𝑡) 2 2 𝑔𝑡 𝑣𝑡 2 +2 𝑣𝑡 𝑓(𝑥𝑡) 2 2 1

𝑔𝑡 𝑖 𝑣𝑡 𝑖 2 +2 𝑣𝑡 𝑓(𝑥𝑡) 2 ,

and develop a recursion for each term above separately.

Part I (a). Controlling the error of momentum estimator for each 𝑣𝑡 𝑖. Recall that by Lemma 2-(24), we have for each 𝑖= 1, . . . , 𝑛, and any 0 < 𝜂 1 and 𝑡 0

E [ 𝑣𝑡+1 𝑖 𝑓𝑖(𝑥𝑡+1) 2] (1 𝜂)E [ 𝑣𝑡 𝑖 𝑓𝑖(𝑥𝑡) 2] + 3𝐿2 𝑖 𝜂E [ 𝑥𝑡+1 𝑥𝑡 2] + 𝜂2𝜎2. (18)

Averaging inequalities (18) over 𝑖= 1, . . . ,𝑛and denoting 𝑃𝑡:= 1 𝑛 𝑛 𝑖=1 E [ 𝑣𝑡 𝑖 𝑓𝑖(𝑥𝑡) 2] ,

𝑅𝑡:= E [ 𝑥𝑡 𝑥𝑡+1 2] we have

𝑃𝑡+1 (1 𝜂) 𝑃𝑡+ 3 𝐿2

Summing up the above inequality for 𝑡= 0, . . . , 𝑇 1, we derive

𝑡=0 𝑃𝑡 3 𝐿2

𝑡=0 𝑅𝑡+ 𝜂𝜎2 + 1

𝜂𝑇 𝑃0. (19)

Part I (b). Controlling the error of momentum estimator for 𝑣𝑡(on average). Similarly by Lemma 2-(25), we have for any 0 < 𝜂 1 and 𝑡 0

E [ 𝑣𝑡+1 𝑓(𝑥𝑡+1) 2] (1 𝜂)E [ 𝑣𝑡 𝑓(𝑥𝑡) 2] + 3𝐿2

𝜂E [ 𝑥𝑡+1 𝑥𝑡 2] + 𝜂2𝜎2

where 𝑣𝑡:= 1

𝑛 𝑛 𝑖=1 𝑣𝑡 𝑖is an auxiliary sequence.

Summing up the above inequality for 𝑡= 0, . . . , 𝑇 1, and denoting 𝑃𝑡:= E [ 𝑣𝑡 𝑓(𝑥𝑡) 2] , we derive

𝑡=0 𝑅𝑡+ 𝜂𝜎2

Part II. Controlling the error of contractive compressor and momentum estimator. By Lemma 3 we have for each 𝑖= 1, . . . , 𝑛, and any 0 < 𝜂 1 and 𝑡 0

E [ 𝑔𝑡+1 𝑖 𝑣𝑡+1 𝑖 2] ( 1 𝛼

) E [ 𝑔𝑡 𝑖 𝑣𝑡 𝑖 2] + 4𝜂2

𝛼E [ 𝑣𝑡 𝑖 𝑓𝑖(𝑥𝑡) 2]

𝛼 E [ 𝑥𝑡+1 𝑥𝑡 2] + 𝜂2𝜎2. (21)

Averaging inequalities (21) over 𝑖= 1, . . . ,𝑛, denoting 𝑉𝑡:= 1

𝑛 𝑛 𝑖=1 E [ 𝑔𝑡 𝑖 𝑣𝑡 𝑖 2] , and summing up the resulting inequality for 𝑡= 0, . . . , 𝑇 1, we obtain

𝑡=0 𝑃𝑡+ 8 𝐿2𝜂2

𝑡=0 𝑅𝑡+ 2𝜂2𝜎2

𝛼𝑇 𝑉0. (22)

Part III. Combining steps I and II with descent lemma. By smoothness (Assumption 1) of 𝑓( ) it follows from Lemma 1 that for any 𝛾 1/(2𝐿) we have

𝑓(𝑥𝑡+1) 𝑓(𝑥𝑡) 𝛾

𝑥𝑡+1 𝑥𝑡 2 + 𝛾

𝑔𝑡 𝑓(𝑥𝑡) 2 (23)

𝑥𝑡+1 𝑥𝑡 2 + 𝛾1

𝑔𝑡 𝑖 𝑣𝑡 𝑖 2 + 𝛾 𝑣𝑡 𝑓(𝑥𝑡) 2 .

Subtracting 𝑓* from both sides of (23), taking expectation and defining 𝛿𝑡:= E [𝑓(𝑥𝑡) 𝑓*], we derive

E [ 𝑓(ˆ𝑥𝑇) 2] = 1 𝑇

𝑡=0 E [ 𝑓(𝑥𝑡) 2]

2𝛿0 𝛾𝑇+ 2 1

𝑡=0 𝑉𝑡+ 2 1

𝑡=0 𝑃𝑡 1 2𝛾2 1 𝑇

(𝑖) 2𝛿0 𝛾𝑇+ 16𝜂2

𝑡=0 𝑃𝑡+ 2 1

𝑡=0 𝑃𝑡+ 4𝜂2𝜎2

1 2 16𝛾2 𝐿2𝜂2

(𝑖𝑖) 2𝛿0 𝛾𝑇+ 16𝜂3𝜎2

1 2 16𝛾2 𝐿2𝜂2

(𝑖𝑖𝑖) 2𝛿0 𝛾𝑇+ 16𝜂3𝜎2

2𝛿0 𝛾𝑇+ 16𝜂3𝜎2

where (𝑖) holds due to (22), (𝑖𝑖) utilizes (20), and (𝑖𝑖𝑖) follows by 𝜂 1, and the last step holds due to the assumption on the step-size. We proved (9).

We now find the particular values of parameters. Since 𝑔𝑖= 𝑣𝑖for all 𝑖 [𝑛], we have 𝑉0 = 0. Using 𝑣0 𝑖= 1 𝐵init 𝐵init 𝑗=1 𝑓𝑖(𝑥0, 𝜉0 𝑖,𝑗) for all 𝑖= 1, . . . , 𝑛, we have

𝑃0 = E [ 𝑣0 𝑓(𝑥0) 2] 𝜎2

𝑛𝐵init and 𝑃0 = 1

𝑖=1 E [ 𝑣0 𝑖 𝑓𝑖(𝑥0) 2] 𝜎2

We can substitute the choice of 𝛾and obtain

E [ 𝑓(ˆ𝑥𝑇) 2] = 𝒪

𝜂𝑛𝐵init𝑇+ 𝜂𝜎2

Since 𝐵init 𝜎2 𝐿𝛿0𝑛, we have

E [ 𝑓(ˆ𝑥𝑇) 2] = 𝒪

Notice that the choice of the momentum parameter such that 𝜂 ( 𝐿𝛿0𝛼2

𝜎2𝑇 ) 1/4 , 𝜂 ( 𝐿𝛿0𝛼

𝜎2𝑇 ) 1/2 and 𝜂 𝛼 𝐿𝛿0𝐵init

𝜎 ensures that 𝜂3𝜎2

𝛼2𝐵init𝑇 𝐿𝛿0

𝜂𝑇. Therefore, we have

E [ 𝑓(ˆ𝑥𝑇) 2] 𝒪

𝛼𝑇+ ( 𝐿𝛿0𝜎2/3

) 3/4 + ( 𝐿𝛿0𝜎 𝛼𝑇

) 2/3 + ( 𝐿𝛿0𝜎2

) 1/2 + 𝜎 𝐿𝛿0 𝛼 𝐵init𝑇

Using 𝐵init min { 𝜎2𝐿 𝐿2𝛿0 , 𝜎 𝛼 𝐿𝛿0𝑇, 𝜎2/3

𝛼4/3𝑇2/3(𝐿𝛿0)1/3 , 𝑛 𝛼2𝑇 } , we obtain

E [ 𝑓(ˆ𝑥𝑇) 2] 𝒪

𝛼𝑇+ ( 𝐿𝛿0𝜎2/3

) 3/4 + ( 𝐿𝛿0𝜎 𝛼𝑇

) 2/3 + ( 𝐿𝛿0𝜎2

It remains to notice that max { min { 𝜎2𝐿 𝐿2𝛿0 , 𝜎 𝛼 𝐿𝛿0𝑇, 𝜎2/3

𝛼4/3𝑇2/3(𝐿𝛿0)1/3 , 𝑛 𝛼2𝑇 } , 𝜎2 𝐿𝛿0𝑛 } 𝜎2 𝐿𝛿0

F.1 Controlling the error of momentum estimator

Lemma 2. Let Assumption 1 be satisfied, and suppose 0 < 𝜂 1. For every 𝑖= 1, . . . , 𝑛, let the sequence {𝑣𝑡 𝑖}𝑡 0 be updated via

𝑣𝑡 𝑖= 𝑣𝑡 1 𝑖 + 𝜂 ( 𝑓𝑖(𝑥𝑡, 𝜉𝑡 𝑖) 𝑣𝑡 1 𝑖 ) ,

Define the sequence 𝑣𝑡:= 1

𝑛 𝑛 𝑖=1 𝑣𝑡 𝑖. Then for every 𝑖= 1, . . . , 𝑛and 𝑡 0 it holds

E [ 𝑣𝑡 𝑖 𝑓𝑖(𝑥𝑡) 2] (1 𝜂)E [ 𝑣𝑡 1 𝑖 𝑓𝑖(𝑥𝑡 1) 2] + 3𝐿2 𝑖 𝜂E [ 𝑥𝑡 𝑥𝑡 1 2] + 𝜂2𝜎2, (24)

E [ 𝑣𝑡 𝑓(𝑥𝑡) 2] (1 𝜂)E [ 𝑣𝑡 1 𝑓(𝑥𝑡 1) 2] + 3𝐿2

𝜂E [ 𝑥𝑡 𝑥𝑡 1 2] + 𝜂2𝜎2

Proof. By the update rule of 𝑣𝑡 𝑖, we have

E [ 𝑣𝑡 𝑖 𝑓𝑖(𝑥𝑡) 2] = E [ 𝑣𝑡 1 𝑖 𝑓𝑖(𝑥𝑡) + 𝜂( 𝑓𝑖(𝑥𝑡, 𝜉𝑡 𝑖) 𝑣𝑡 1 𝑖 ) 2]

= E [ E𝜉𝑡 𝑖

[ (1 𝜂)(𝑣𝑡 1 𝑖 𝑓𝑖(𝑥𝑡)) + 𝜂( 𝑓𝑖(𝑥𝑡, 𝜉𝑡 𝑖) 𝑓𝑖(𝑥𝑡)) 2] ]

= (1 𝜂)2E [ 𝑣𝑡 1 𝑖 𝑓𝑖(𝑥𝑡) 2] + 𝜂2E [ 𝑓𝑖(𝑥𝑡, 𝜉𝑡 𝑖) 𝑓𝑖(𝑥𝑡) 2]

(1 𝜂)2 ( 1 + 𝜂

) E [ 𝑣𝑡 1 𝑖 𝑓𝑖(𝑥𝑡 1) 2]

) E [ 𝑓𝑖(𝑥𝑡 1) 𝑓𝑖(𝑥𝑡) 2] + 𝜂2𝜎2

(1 𝜂)E [ 𝑣𝑡 1 𝑖 𝑓𝑖(𝑥𝑡 1) 2] + 3𝐿2 𝑖 𝜂E [ 𝑥𝑡 𝑥𝑡+1 2] + 𝜂2𝜎2,

where the first inequality holds by Young s inequality, and the last step uses smoothness of 𝑓𝑖( ) (Assumption 1), which concludes the proof of (24).

For each 𝑡 = 0, . . . , 𝑇 1, define a random vector 𝜉𝑡 := (𝜉𝑡 1, . . . , 𝜉𝑡 𝑛) and denote by 𝑓(𝑥𝑡, 𝜉𝑡) := 1 𝑛 𝑛 𝑖=1 𝑓𝑖(𝑥𝑡, 𝜉𝑡 𝑖). Note that the entries of the random vector 𝜉𝑡are independent and E𝜉𝑡[ 𝑓(𝑥𝑡, 𝜉𝑡)] = 𝑓(𝑥𝑡), then we have

𝑣𝑡= 𝑣𝑡 1 + 𝜂 ( 𝑓(𝑥𝑡, 𝜉𝑡) 𝑣𝑡 1) ,

where 𝑣𝑡:= 1

𝑛 𝑛 𝑖=1 𝑣𝑡 𝑖is an auxiliary sequence. Therefore, we can similarly derive

E [ 𝑣𝑡 𝑓(𝑥𝑡) 2] = E [ 𝑣𝑡 1 𝑓(𝑥𝑡) + 𝜂 ( 𝑓(𝑥𝑡, 𝜉𝑡) 𝑣𝑡 1) 2]

= E [ E𝜉𝑡 [ (1 𝜂)(𝑣𝑡 1 𝑓(𝑥𝑡)) + 𝜂 ( 𝑓(𝑥𝑡, 𝜉𝑡) 𝑓(𝑥𝑡) ) 2] ]

= (1 𝜂)2E [ 𝑣𝑡 1 𝑓(𝑥𝑡) 2] + 𝜂2E [ 𝑓(𝑥𝑡, 𝜉𝑡) 𝑓(𝑥𝑡) 2]

(1 𝜂)2 ( 1 + 𝜂

) E [ 𝑣𝑡 1 𝑓(𝑥𝑡 1) 2]

) E [ 𝑓(𝑥𝑡 1) 𝑓(𝑥𝑡) 2] + 𝜂2𝜎2

(1 𝜂)E [ 𝑣𝑡 1 𝑓(𝑥𝑡 1) 2] + 3𝐿2

𝜂E [ 𝑥𝑡 𝑥𝑡 1 2] + 𝜂2𝜎2

where the last step uses smoothness of 𝑓( ) (Assumption 1), which concludes the proof of (25).

F.2 Controlling the error of contractive compression and momentum estimator

Lemma 3. Let Assumption 1 be satisfied, and suppose 𝒞is a contractive compressor with 𝛼 1

2. For every 𝑖= 1, . . . , 𝑛, let the sequences {𝑣𝑡 𝑖}𝑡 0 and {𝑔𝑡 𝑖}𝑡 0 be updated via

𝑣𝑡 𝑖= 𝑣𝑡 1 𝑖 + 𝜂 ( 𝑓𝑖(𝑥𝑡, 𝜉𝑡 𝑖) 𝑣𝑡 1 𝑖 ) ,

𝑔𝑡 𝑖= 𝑔𝑡 1 𝑖 + 𝒞 ( 𝑣𝑡 𝑖 𝑔𝑡 1 𝑖 ) , Then for every 𝑖= 1, . . . , 𝑛and 𝑡 0 it holds

E [ 𝑔𝑡 𝑖 𝑣𝑡 𝑖 2] ( 1 𝛼

) E [ 𝑔𝑡 1 𝑖 𝑣𝑡 1 𝑖 2] + 4𝜂2

𝛼E [ 𝑣𝑡 1 𝑖 𝑓𝑖(𝑥𝑡 1) 2]

𝛼 E [ 𝑥𝑡 𝑥𝑡 1 2] + 𝜂2𝜎2. (26)

Proof. By the update rules of 𝑔𝑡 𝑖and 𝑣𝑡 𝑖, we derive

E [ 𝑔𝑡 𝑖 𝑣𝑡 𝑖 2] = E [ 𝑔𝑡 1 𝑖 𝑣𝑡 𝑖+ 𝒞(𝑣𝑡 𝑖 𝑔𝑡 1 𝑖 ) 2]

= E [ E𝒞 [ 𝒞(𝑣𝑡 𝑖 𝑔𝑡 1 𝑖 ) (𝑣𝑡 𝑖 𝑔𝑡 1 𝑖 ) 2] ]

(𝑖) (1 𝛼)E [ 𝑣𝑡 𝑖 𝑔𝑡 1 𝑖 2]

(𝑖𝑖) = (1 𝛼)E [ 𝑣𝑡 1 𝑖 𝑔𝑡 1 𝑖 + 𝜂( 𝑓𝑖(𝑥𝑡, 𝜉𝑡 𝑖) 𝑣𝑡 1 𝑖 ) 2]

= (1 𝛼)E [ E𝜉𝑡 𝑖

[ 𝑣𝑡 1 𝑖 𝑔𝑡 1 𝑖 + 𝜂( 𝑓𝑖(𝑥𝑡, 𝜉𝑡 𝑖) 𝑣𝑡 1 𝑖 ) 2] ]

= (1 𝛼)E [ 𝑣𝑡 1 𝑖 𝑔𝑡 1 𝑖 + 𝜂( 𝑓𝑖(𝑥𝑡) 𝑣𝑡 1 𝑖 ) 2]

+(1 𝛼)𝜂2E [ 𝑓𝑖(𝑥𝑡, 𝜉𝑡 𝑖) 𝑓𝑖(𝑥𝑡) 2]

(𝑖𝑖𝑖) (1 𝛼) (1 + 𝜌)E [ 𝑣𝑡 1 𝑖 𝑔𝑡 1 𝑖 2] + (1 𝛼) (1 + 𝜌 1)𝜂2E [ 𝑣𝑡 1 𝑖 𝑓𝑖(𝑥𝑡) 2]

(𝑖𝑣) = (1 𝜃)E [ 𝑔𝑡 1 𝑖 𝑣𝑡 1 𝑖 2] + 𝛽𝜂2E [ 𝑣𝑡 1 𝑖 𝑓𝑖(𝑥𝑡) 2] + (1 𝛼)𝜂2𝜎2

(𝑣) (1 𝜃)E [ 𝑔𝑡 1 𝑖 𝑣𝑡 1 𝑖 2] + 2𝛽𝜂2E [ 𝑣𝑡 1 𝑖 𝑓𝑖(𝑥𝑡 1) 2]

+2𝛽𝜂2E [ 𝑓𝑖(𝑥𝑡) 𝑓𝑖(𝑥𝑡 1) 2] + 𝜂2𝜎2

(1 𝜃)E [ 𝑔𝑡 1 𝑖 𝑣𝑡 1 𝑖 2] + 2𝛽𝜂2E [ 𝑣𝑡 1 𝑖 𝑓𝑖(𝑥𝑡 1) 2]

+2𝛽𝐿2 𝑖𝜂2E [ 𝑥𝑡 𝑥𝑡 1 2] + 𝜂2𝜎2,

where (𝑖) is due to the definition of a contractive compressor (Definition 1), (𝑖𝑖) follows by the update rule of 𝑣𝑡 𝑖, (𝑖𝑖𝑖) and (𝑣) hold by Young s inequality for any 𝜌> 0. In (𝑖𝑣), we introduced the notation

𝜃:= 1 (1 𝛼)(1 + 𝜌), and 𝛽:= (1 𝛼)(1 + 𝜌 1). The last step follows by smoothness of 𝑓𝑖( ) (Assumption 1). The proof is complete by the choice 𝜌= 𝛼/2, which guarantees 1 𝜃 1 𝛼/2, and 2𝛽 4/𝛼.

G Further Improvement Using Double Momentum (Proof of Corollary 3)

Algorithm 3 EF21-SGD2M (Error Feedback 2021 Enhanced with Double Momentum)

1: Input: starting point 𝑥0, step-size 𝛾> 0, parameter 𝜂 (0, 1], initial batch size 𝐵init 2: Initialize 𝑢0 𝑖= 𝑣0 𝑖= 𝑔0 𝑖= 1 𝐵init 𝐵init 𝑗=1 𝑓𝑖(𝑥0, 𝜉0 𝑖,𝑗) for 𝑖= 1, . . . , 𝑛; 𝑔0 = 1

𝑛 𝑛 𝑖=1 𝑔0 𝑖 3: for 𝑡= 0,1, 2, . . . , 𝑇 1 do 4: Master computes 𝑥𝑡+1 = 𝑥𝑡 𝛾𝑔𝑡and broadcasts 𝑥𝑡+1 to all nodes 5: for all nodes 𝑖= 1, . . . , 𝑛in parallel do 6: Compute the first momentum estimator 𝑣𝑡+1 𝑖 = (1 𝜂)𝑣𝑡 𝑖+ 𝜂 𝑓𝑖(𝑥𝑡+1, 𝜉𝑡+1 𝑖 ) 7: Compute the second momentum estimator 𝑢𝑡+1 𝑖 = (1 𝜂)𝑢𝑡 𝑖+ 𝜂𝑣𝑡+1 𝑖 8: Compress 𝑐𝑡+1 𝑖 = 𝒞(𝑢𝑡+1 𝑖 𝑔𝑡 𝑖) and send 𝑐𝑡+1 𝑖 to the master 9: Update local state 𝑔𝑡+1 𝑖 = 𝑔𝑡 𝑖+ 𝑐𝑡+1 𝑖 10: end for 11: Master computes 𝑔𝑡+1 = 1

𝑛 𝑛 𝑖=1 𝑔𝑡+1 𝑖 via 𝑔𝑡+1 = 𝑔𝑡+ 1

𝑛 𝑛 𝑖=1 𝑐𝑡+1 𝑖 12: end for

In this section, we state the detailed version of Corollary 3 in Theorem 5, followed by its formal proof. Notice that the key reason for the sample complexity improvement of the double momentum variant compared to EF21-SGDM (Theorem 3) is that in (27), one of the terms has better dependence on 𝜂 compared to (9) in Theorem 3, i.e., 𝜂4𝜎2/𝛼instead of 𝜂2𝜎2/𝛼. As a result, this term is dominated by other terms and vanishes in Corollary 3. Theorem 5. Let Assumptions 1 and 2 hold. Let ˆ𝑥𝑇be sampled uniformly at random from the iterates of the method. Let Algorithm 3 run with a contractive compressor. For all 𝜂 (0, 1] and 𝐵init 1,

with 𝛾 min { 𝛼 60 𝐿, 𝜂 16𝐿 } , we have

E [ 𝑓(ˆ𝑥𝑇) 2] 𝒪 ( Ψ0

where Ψ0 := 𝛿0+ 𝛾

𝜂E [ 𝑣0 𝑓(𝑥0) 2] + 𝛾𝜂4

𝑛 𝑛 𝑖=1 E [ 𝑣0 𝑖 𝑓𝑖(𝑥0) 2] . Setting initial batch

size 𝐵init = 𝜎2 𝐿𝛿0

, step-size and momentum parameters

60 𝐿 , 𝜂 16𝐿

) 1/4 , ( 𝐿𝛿0𝑛

) 1/2 , 𝛼 𝐿𝛿0𝐵init

𝑡=0 E [ 𝑓(𝑥𝑡) 2] 𝒪

𝛼𝑇+ ( 𝐿𝛿0𝜎2/3

) 3/4 + ( 𝐿𝛿0𝜎2

Proof. In order to control the error between 𝑔𝑡and 𝑓(𝑥𝑡), we decompose it into three terms 𝑔𝑡 𝑓(𝑥𝑡) 2 3 𝑔𝑡 𝑢𝑡 2 + 3 𝑢𝑡 𝑣𝑡 2 + 3 𝑣𝑡 𝑓(𝑥𝑡) 2

𝑔𝑡 𝑖 𝑢𝑡 𝑖 2 + 3 𝑢𝑡 𝑣𝑡 2 + 3 𝑣𝑡 𝑓(𝑥𝑡) 2 ,

where we define the sequences 𝑣𝑡:= 1

𝑛 𝑛 𝑖=1 𝑣𝑡 𝑖and 𝑢𝑡:= 1

𝑛 𝑛 𝑖=1 𝑢𝑡 𝑖. In the following, we develop a recursion for each term above separately.

Part I. Controlling the error of momentum estimator for each 𝑣𝑡 𝑖and on average for 𝑣𝑡. Denote 𝑃𝑡:= E [ 𝑣𝑡 𝑓(𝑥𝑡) 2] , 𝑃𝑡:= 1

𝑛 𝑛 𝑖=1 E [ 𝑣𝑡 𝑖 𝑓𝑖(𝑥𝑡) 2] , 𝑅𝑡:= E [ 𝑥𝑡 𝑥𝑡+1 2] . Similarly to Part I of the proof of Theorem 3, we have

𝑡=0 𝑃𝑡 3 𝐿2

𝑡=0 𝑅𝑡+ 𝜂𝜎2 + 1

𝜂𝑇 𝑃0, (29)

𝑡=0 𝑅𝑡+ 𝜂𝜎2

Part II (a). Controlling the error of the second momentum estimator for each 𝑢𝑡 𝑖. Recall that by Lemma 4-(37), we have for each 𝑖= 1, . . . , 𝑛, and any 0 < 𝜂 1 and 𝑡 0

E [ 𝑢𝑡+1 𝑖 𝑣𝑡+1 𝑖 2] (1 𝜂)E [ 𝑢𝑡 𝑖 𝑣𝑡 𝑖 2] + 6𝜂E [ 𝑣𝑡 𝑖 𝑓𝑖(𝑥𝑡) 2]

+6𝐿2 𝑖𝜂E [ 𝑥𝑡+1 𝑥𝑡 2] + 𝜂2𝜎2, (31)

Averaging inequalities (31) over 𝑖= 1, . . . ,𝑛and denoting 𝑄𝑡:= 1

𝑛 𝑛 𝑖=1 E [ 𝑢𝑡 𝑖 𝑣𝑡 𝑖 2] , we have

𝑄𝑡+1 (1 𝜂) 𝑄𝑡+ 6𝜂 𝑃𝑡+ 6 𝐿2𝜂𝑅𝑡+ 𝜂2𝜎2. Summing up the above inequalities for 𝑡= 0, . . . , 𝑇 1, we derive

𝑡=0 𝑃𝑡+ 6 𝐿2 1

𝑡=0 𝑅𝑡+ 𝜂𝜎2 + 1

𝜂2 + 6 𝐿2 ) 1 𝑇

𝑡=0 𝑅𝑡+ 7𝜂𝜎2 + 1

𝑡=0 𝑅𝑡+ 7𝜂𝜎2 + 6

𝜂𝑇 𝑃0, (32)

where we used (29), the bound 𝜂 1, and 𝑢0 𝑖= 𝑣0 𝑖for 𝑖= 1, . . . , 𝑛.

Part II (b). Controlling the error of the second momentum estimator 𝑢𝑡(on average). Similarly by Lemma 4-(38), we have for any 0 < 𝜂 1 and 𝑡 0

E [ 𝑢𝑡+1 𝑣𝑡+1 2] (1 𝜂)E [ 𝑢𝑡 𝑣𝑡 2] + 6𝜂E [ 𝑣𝑡 𝑓(𝑥𝑡) 2]

+6𝐿2𝜂E [ 𝑥𝑡+1 𝑥𝑡 2] + 𝜂2𝜎2

Summing up the above inequalities for 𝑡= 0, . . . , 𝑇 1, and denoting 𝑄𝑡:= E [ 𝑢𝑡 𝑣𝑡 2] , we derive

𝑡=0 𝑃𝑡+ 6𝐿2 1

𝑡=0 𝑅𝑡+ 𝜂𝜎2 + 1

𝜂2 + 6𝐿2 ) 1

𝑡=0 𝑅𝑡+ 7𝜂𝜎2

𝑡=0 𝑅𝑡+ 7𝜂𝜎2

where we used (30), the bound 𝜂 1, and 𝑢0 = 𝑣0.

Part III. Controlling the error of contractive compressor and the double momentum estimator. By Lemma 5 we have for each 𝑖= 1, . . . , 𝑛, and any 0 < 𝜂 1 and 𝑡 0

E [ 𝑔𝑡+1 𝑖 𝑢𝑡+1 𝑖 2] ( 1 𝛼

) E [ 𝑔𝑡 𝑖 𝑢𝑡 𝑖 2] + 6𝜂2

𝛼E [ 𝑢𝑡 𝑖 𝑣𝑡 𝑖 2]

𝛼E [ 𝑣𝑡 𝑖 𝑓𝑖(𝑥𝑡, 𝜉𝑡 𝑖) 2] + 6𝐿2 𝑖𝜂4

𝛼 E [ 𝑥𝑡 𝑥𝑡+1 2] + 𝜂4𝜎2.

Averaging inequalities (34) over 𝑖= 1, . . . ,𝑛, denoting 𝑉𝑡:= 1

𝑛 𝑛 𝑖=1 E [ 𝑔𝑡 𝑖 𝑢𝑡 𝑖 2] , and summing up the resulting inequality for 𝑡= 0, . . . , 𝑇 1, we obtain

𝑡=0 𝑉𝑡 12𝜂2

𝑡=0 𝑄𝑡+ 12𝜂4

𝑡=0 𝑃𝑡+ 12 𝐿2𝜂4

𝑡=0 𝑅𝑡+ 2𝜂4𝜎2

𝑡=0 𝑅𝑡+ 7𝜂𝜎2 )

𝑡=0 𝑅𝑡+ 𝜂𝜎2 )

𝑡=0 𝑅𝑡+ 2𝜂4𝜎2

𝑡=0 𝑅𝑡+ 12 7𝜂3𝜎2

𝛼2 + 12 3 𝐿2𝜂2

𝑡=0 𝑅𝑡+ 12𝜂5𝜎2

𝑡=0 𝑅𝑡+ 2𝜂4𝜎2

𝑡=0 𝑅𝑡+ 84𝜂3𝜎2

𝛼2 + 12𝜂5𝜎2

Part IV. Combining steps I, II and III with descent lemma. By smoothness (Assumption 1) of 𝑓( ) it follows from Lemma 1 that for any 𝛾 1/(2𝐿) we have

𝑓(𝑥𝑡+1) 𝑓(𝑥𝑡) 𝛾

𝑥𝑡+1 𝑥𝑡 2 + 𝛾

𝑔𝑡 𝑓(𝑥𝑡) 2 (36)

𝑔𝑡 𝑖 𝑢𝑡 𝑖 2 + 3𝛾

𝑢𝑡 𝑣𝑡 2 + 3𝛾

𝑣𝑡 𝑓(𝑥𝑡) 2 .

Subtracting 𝑓* from both sides of (36), taking expectation and defining 𝛿𝑡:= E [𝑓(𝑥𝑡) 𝑓*], we derive

E [ 𝑓(ˆ𝑥𝑇) 2] = 1 𝑇

𝑡=0 E [ 𝑓(𝑥𝑡) 2]

2𝛿0 𝛾𝑇+ 3 1

𝑡=0 𝑉𝑡+ 3 1

𝑡=0 𝑄𝑡+ 3 1

𝑡=0 𝑃𝑡 1 2𝛾2 1 𝑇

(𝑖) 2𝛿0 𝛾𝑇+ 3 276 𝐿2

𝑡=0 𝑅𝑡+ 3 84𝜂3𝜎2

𝛼2 + 3 12𝜂5𝜎2

𝛼2 + 3 2𝜂4𝜎2

𝑡=0 𝑅𝑡+ 3 7𝜂𝜎2

𝑡=0 𝑅𝑡+ 𝜂𝜎2

𝑛 1 2𝛾2 1 𝑇

𝛼2𝑇 𝑃0 + 18

= 2𝛿0 𝛾𝑇+ 3 84𝜂3𝜎2

𝛼2 + 3 12𝜂5𝜎2

𝛼2 + 3 2𝜂4𝜎2

𝜂2 + 3 276 𝐿2

𝛼2𝑇 𝑃0 + 21

= 2𝛿0 𝛾𝑇+ 288𝜂3𝜎2

𝛼2𝑇 𝑃0 + 21

where (𝑖) holds due to (30), (35) and (33), the last two steps hold because of the assumption on the step-size, and 𝜂 1, which completes the proof of the first part of Theorem.

Notice that it suffices to take the same initial batch-size as in the proof of the Theorem 3 in order to "remove" 𝑃0 and 𝑃0 terms, since the power of 𝜂in front of 𝑃0 is larger here compared to

the proof of Theorem 3. The choice of the momentum parameter such that 𝜂 ( 𝐿𝛿0𝛼2

𝜎2𝑇 ) 1/4 ,

𝜎2𝑇 ) 1/2 ensures that 𝜂3𝜎2

𝜂𝑇, and 𝜂𝜎2

𝜂𝑇. Therefore, we can guarantee that the

choice 𝜂= min { 𝛼 𝐿𝛿0𝐵init

𝜎 , ( 𝐿𝛿0𝛼2

𝜎2𝑇 ) 1/4 , ( 𝐿𝛿0𝑛

𝜎2𝑇 ) 1/2} ensures that

E [ 𝑓(ˆ𝑥𝑇) 2] 𝒪

𝛼𝑇+ ( 𝐿𝛿0𝜎2/3

) 3/4 + ( 𝐿𝛿0𝜎2

G.1 Controlling the error of second momentum estimator

Lemma 4. Let Assumption 1 be satisfied, and suppose 0 < 𝜂 1. For every 𝑖= 1, . . . , 𝑛, let the sequences {𝑣𝑡 𝑖}𝑡 0 and {𝑢𝑡 𝑖}𝑡 0 be updated via

𝑣𝑡 𝑖= 𝑣𝑡 1 𝑖 + 𝜂 ( 𝑓𝑖(𝑥𝑡, 𝜉𝑡 𝑖) 𝑣𝑡 1 𝑖 ) ,

𝑢𝑡 𝑖= 𝑢𝑡 1 𝑖 + 𝜂 ( 𝑣𝑡 𝑖 𝑢𝑡 1 𝑖 ) .

Define the sequences 𝑣𝑡:= 1

𝑛 𝑛 𝑖=1 𝑣𝑡 𝑖and 𝑢𝑡:= 1

𝑛 𝑛 𝑖=1 𝑢𝑡 𝑖. Then for every 𝑖= 1, . . . , 𝑛and 𝑡 0 it holds

E [ 𝑢𝑡 𝑖 𝑣𝑡 𝑖 2] (1 𝜂)E [ 𝑢𝑡 1 𝑖 𝑣𝑡 1 𝑖 2] + 6𝜂E [ 𝑣𝑡 1 𝑖 𝑓𝑖(𝑥𝑡 1) 2]

+6𝐿2 𝑖𝜂E [ 𝑥𝑡 𝑥𝑡 1 2] + 𝜂2𝜎2, (37)

E [ 𝑢𝑡 𝑣𝑡 2] (1 𝜂)E [ 𝑢𝑡 1 𝑣𝑡 1 2] + 6𝜂E [ 𝑣𝑡 1 𝑓(𝑥𝑡 1) 2]

+6𝐿2𝜂E [ 𝑥𝑡 𝑥𝑡 1 2] + 𝜂2𝜎2

Proof. By the update rule of 𝑣𝑡 𝑖, we have

E [ 𝑢𝑡 𝑖 𝑣𝑡 𝑖 2] = E [ 𝑢𝑡 1 𝑖 𝑣𝑡 𝑖+ 𝜂(𝑣𝑡 𝑖 𝑢𝑡 1 𝑖 ) 2]

= (1 𝜂)2E [ 𝑣𝑡 𝑖 𝑢𝑡 1 𝑖 2]

= (1 𝜂)2E [ (1 𝜂)𝑣𝑡 1 𝑖 + 𝜂 𝑓𝑖(𝑥𝑡, 𝜉𝑡 𝑖) 𝑢𝑡 1 𝑖 2]

= (1 𝜂)2E [ (𝑢𝑡 1 𝑖 𝑣𝑡 1 𝑖 ) + 𝜂(𝑣𝑡 1 𝑖 𝑓𝑖(𝑥𝑡, 𝜉𝑡 𝑖)) 2]

= (1 𝜂)2E [ (𝑢𝑡 1 𝑖 𝑣𝑡 1 𝑖 ) + 𝜂(𝑣𝑡 1 𝑖 𝑓𝑖(𝑥𝑡)) + 𝜂( 𝑓𝑖(𝑥𝑡) 𝑓𝑖(𝑥𝑡, 𝜉𝑡 𝑖)) 2]

= (1 𝜂)2E [ E𝜉𝑡 𝑖

[ (𝑢𝑡 1 𝑖 𝑣𝑡 1 𝑖 ) + 𝜂(𝑣𝑡 1 𝑖 𝑓𝑖(𝑥𝑡)) + 𝜂( 𝑓𝑖(𝑥𝑡) 𝑓𝑖(𝑥𝑡, 𝜉𝑡 𝑖)) 2] ]

= (1 𝜂)2 ( E [ 𝑢𝑡 1 𝑖 𝑣𝑡 1 𝑖 + 𝜂(𝑣𝑡 1 𝑖 𝑓𝑖(𝑥𝑡)) 2] + 𝜂2E [ 𝑓𝑖(𝑥𝑡, 𝜉𝑡 𝑖) 𝑓𝑖(𝑥𝑡) 2] )

(1 𝜂)2E [ 𝑢𝑡 1 𝑖 𝑣𝑡 1 𝑖 + 𝜂(𝑣𝑡 1 𝑖 𝑓𝑖(𝑥𝑡)) 2] + 𝜂2𝜎2

(1 𝜂)2 ( 1 + 𝜂

) E [ 𝑢𝑡 1 𝑖 𝑣𝑡 1 𝑖 2]

) 𝜂2E [ 𝑣𝑡 1 𝑖 𝑓𝑖(𝑥𝑡) 2] + 𝜂2𝜎2

(1 𝜂)E [ 𝑢𝑡 1 𝑖 𝑣𝑡 1 𝑖 2] + 3𝜂E [ 𝑣𝑡 1 𝑖 𝑓𝑖(𝑥𝑡) 2] + 𝜂2𝜎2

(1 𝜂)E [ 𝑢𝑡 1 𝑖 𝑣𝑡 1 𝑖 2] + 6𝜂E [ 𝑣𝑡 1 𝑖 𝑓𝑖(𝑥𝑡 1) 2]

+6𝜂E [ 𝑓𝑖(𝑥𝑡) 𝑓𝑖(𝑥𝑡 1) 2] + 𝜂2𝜎2

(1 𝜂)E [ 𝑢𝑡 1 𝑖 𝑣𝑡 1 𝑖 2] + 6𝜂E [ 𝑣𝑡 1 𝑖 𝑓𝑖(𝑥𝑡 1) 2]

+6𝐿2 𝑖𝜂E [ 𝑥𝑡 𝑥𝑡 1 2] + 𝜂2𝜎2,

where the first inequality holds Assumption 2, the second inequality holds by Young s inequality, and the last step uses smoothness of 𝑓𝑖( ) (Assumption 1), which concludes the proof of (37).

For each 𝑡 = 0, . . . , 𝑇 1, define a random vector 𝜉𝑡 := (𝜉𝑡 1, . . . , 𝜉𝑡 𝑛) and denote by 𝑓(𝑥𝑡, 𝜉𝑡+1) := 1

𝑛 𝑛 𝑖=1 𝑓𝑖(𝑥𝑡, 𝜉𝑡 𝑖). Note that the entries of the random vector 𝜉𝑡are independent and E𝜉𝑡[ 𝑓(𝑥𝑡, 𝜉𝑡)] = 𝑓(𝑥𝑡), then we have

𝑣𝑡= 𝑣𝑡 1 + 𝜂 ( 𝑓(𝑥𝑡, 𝜉𝑡) 𝑣𝑡 1) ,

𝑢𝑡 𝑖= 𝑢𝑡 1 𝑖 + 𝜂 ( 𝑣𝑡 𝑖 𝑢𝑡 1 𝑖 ) ,

where 𝑣𝑡:= 1 𝑛 𝑛 𝑖=1 𝑣𝑡 𝑖, 𝑢𝑡:= 1 𝑛 𝑛 𝑖=1 𝑢𝑡 𝑖are auxiliary sequences. Therefore, we can similarly derive

E [ 𝑢𝑡 𝑣𝑡 2] = E [ 𝑢𝑡 1 𝑣𝑡+ 𝜂(𝑣𝑡 𝑢𝑡 1) 2]

= (1 𝜂)2E [ 𝑣𝑡 𝑢𝑡 1 2]

= (1 𝜂)2E [ (1 𝜂)𝑣𝑡 1 + 𝜂 𝑓(𝑥𝑡, 𝜉𝑡) 𝑢𝑡 1 2]

= (1 𝜂)2E [ (𝑢𝑡 1 𝑣𝑡 1) + 𝜂(𝑣𝑡 1 𝑓(𝑥𝑡, 𝜉𝑡)) 2]

= (1 𝜂)2E [ (𝑢𝑡 1 𝑣𝑡 1) + 𝜂(𝑣𝑡 1 𝑓(𝑥𝑡)) + 𝜂( 𝑓(𝑥𝑡) 𝑓(𝑥𝑡, 𝜉𝑡)) 2]

= (1 𝜂)2E [ E𝜉𝑡 [ (𝑢𝑡 1 𝑣𝑡 1) + 𝜂(𝑣𝑡 1 𝑓(𝑥𝑡)) + 𝜂( 𝑓(𝑥𝑡) 𝑓(𝑥𝑡, 𝜉𝑡)) 2] ]

= (1 𝜂)2 ( E [ 𝑢𝑡 1 𝑣𝑡 1 + 𝜂(𝑣𝑡 1 𝑓(𝑥𝑡)) 2] + 𝜂2E [ 𝑓(𝑥𝑡, 𝜉𝑡) 𝑓(𝑥𝑡) 2] )

(1 𝜂)2E [ 𝑢𝑡 1 𝑣𝑡 1 + 𝜂(𝑣𝑡 1 𝑓(𝑥𝑡)) 2] + 𝜂2𝜎2

(1 𝜂)2 ( 1 + 𝜂

) E [ 𝑢𝑡 1 𝑣𝑡 1 2]

) 𝜂2E [ 𝑣𝑡 1 𝑓(𝑥𝑡) 2] + 𝜂2𝜎2

(1 𝜂)E [ 𝑢𝑡 1 𝑣𝑡 1 2] + 3𝜂E [ 𝑣𝑡 1 𝑓(𝑥𝑡) 2] + 𝜂2𝜎2

(1 𝜂)E [ 𝑢𝑡 1 𝑣𝑡 1 2] + 6𝜂E [ 𝑣𝑡 1 𝑓(𝑥𝑡 1) 2]

+6𝜂E [ 𝑓(𝑥𝑡) 𝑓(𝑥𝑡 1) 2] + 𝜂2𝜎2

(1 𝜂)E [ 𝑢𝑡 1 𝑣𝑡 1 2] + 6𝜂E [ 𝑣𝑡 1 𝑓(𝑥𝑡 1) 2]

+6𝐿2𝜂E [ 𝑥𝑡 𝑥𝑡 1 2] + 𝜂2𝜎2

where the first inequality holds Assumption 2, the second inequality holds by Young s inequality, and the last step uses smoothness of 𝑓( ) (Assumption 1), which concludes the proof of (38).

G.2 Controlling the error of contractive compression and double momentum estimator

Lemma 5. Let Assumption 1 be satisfied, and suppose 𝒞is a contractive compressor. For every 𝑖= 1, . . . , 𝑛, let the sequences {𝑣𝑡 𝑖}𝑡 0, {𝑢𝑡 𝑖}𝑡 0, and {𝑔𝑡 𝑖}𝑡 0 be updated via

𝑣𝑡 𝑖 = 𝑣𝑡 1 𝑖 + 𝜂 ( 𝑓𝑖(𝑥𝑡, 𝜉𝑡 𝑖) 𝑣𝑡 1 𝑖 ) ,

𝑢𝑡 𝑖 = 𝑢𝑡 1 𝑖 + 𝜂 ( 𝑣𝑡 𝑖 𝑢𝑡 1 𝑖 ) ,

𝑔𝑡 𝑖 = 𝑔𝑡 1 𝑖 + 𝒞 ( 𝑢𝑡 𝑖 𝑔𝑡 1 𝑖 ) .

Then for every 𝑖= 1, . . . , 𝑛and 𝑡 0 it holds

E [ 𝑔𝑡 𝑖 𝑢𝑡 𝑖 2] ( 1 𝛼

) E [ 𝑔𝑡 1 𝑖 𝑢𝑡 1 𝑖 2] + 6𝜂2

𝛼E [ 𝑢𝑡 1 𝑖 𝑣𝑡 1 𝑖 2]

𝛼E [ 𝑣𝑡 1 𝑖 𝑓𝑖(𝑥𝑡 1, 𝜉𝑡 1 𝑖 ) 2] + 6𝐿2 𝑖𝜂4

𝛼 E [ 𝑥𝑡 𝑥𝑡 1 2] + 𝜂4𝜎2.

Proof. By the update rules of 𝑔𝑡 𝑖, 𝑢𝑡 𝑖and 𝑣𝑡 𝑖, we derive

E [ 𝑔𝑡 𝑖 𝑢𝑡 𝑖 2] = E [ 𝑔𝑡 1 𝑖 𝑢𝑡 𝑖+ 𝒞(𝑢𝑡 𝑖 𝑔𝑡 1 𝑖 ) 2]

(𝑖) (1 𝛼)E [ 𝑢𝑡 𝑖 𝑔𝑡 1 𝑖 2]

(𝑖𝑖) = (1 𝛼)E [ 𝑢𝑡 1 𝑖 𝑔𝑡 1 𝑖 + 𝜂(𝑣𝑡 1 𝑖 𝑢𝑡 1 𝑖 ) + 𝜂2( 𝑓𝑖(𝑥𝑡, 𝜉𝑡 𝑖) 𝑣𝑡 1 𝑖 ) 2]

= (1 𝛼)E [ 𝑢𝑡 1 𝑖 𝑔𝑡 1 𝑖 + 𝜂(𝑣𝑡 1 𝑖 𝑢𝑡 1 𝑖 ) + 𝜂2( 𝑓𝑖(𝑥𝑡) 𝑣𝑡 1 𝑖 )

+𝜂2( 𝑓𝑖(𝑥𝑡, 𝜉𝑡 𝑖) 𝑓𝑖(𝑥𝑡)) 2]

= (1 𝛼)E [ E𝜉𝑡 𝑖 [ 𝑢𝑡 1 𝑖 𝑔𝑡 1 𝑖 + 𝜂(𝑣𝑡 1 𝑖 𝑢𝑡 1 𝑖 ) + 𝜂2( 𝑓𝑖(𝑥𝑡) 𝑣𝑡 1 𝑖 )

+𝜂2( 𝑓𝑖(𝑥𝑡, 𝜉𝑡 𝑖) 𝑓𝑖(𝑥𝑡)) 2] ]

= (1 𝛼)E [ 𝑢𝑡 1 𝑖 𝑔𝑡 1 𝑖 + 𝜂(𝑣𝑡 1 𝑖 𝑢𝑡 1 𝑖 ) + 𝜂2( 𝑓𝑖(𝑥𝑡) 𝑣𝑡 1 𝑖 ) 2]

+(1 𝛼)𝜂4E [ 𝑓𝑖(𝑥𝑡, 𝜉𝑡 𝑖) 𝑓𝑖(𝑥𝑡) 2]

(𝑖𝑖𝑖) (1 𝛼)(1 + 𝜌)E [ 𝑢𝑡 1 𝑖 𝑔𝑡 1 𝑖 2]

+(1 𝛼)(1 + 𝜌 1)E [ 𝜂(𝑣𝑡 1 𝑖 𝑢𝑡 1 𝑖 ) + 𝜂2( 𝑓𝑖(𝑥𝑡) 𝑣𝑡 1 𝑖 ) 2]

(𝑖𝑣) = (1 𝜃)E [ 𝑢𝑡 1 𝑖 𝑔𝑡 1 𝑖 2] + 𝜂4𝜎2

+𝛽E [ 𝜂(𝑣𝑡 1 𝑖 𝑢𝑡 1 𝑖 ) + 𝜂2( 𝑓𝑖(𝑥𝑡 1) 𝑣𝑡 1 𝑖 ) + 𝜂2( 𝑓𝑖(𝑥𝑡) 𝑓𝑖(𝑥𝑡 1) 2]

(𝑣) (1 𝜃) E [ 𝑢𝑡 1 𝑖 𝑔𝑡 1 𝑖 2] + 3𝛽𝜂2E [ 𝑣𝑡 1 𝑖 𝑢𝑡 1 𝑖 2]

+3𝛽𝜂4E [ 𝑣𝑡 1 𝑖 𝑓𝑖(𝑥𝑡 1) 2]

+3𝛽𝜂4E [ 𝑓𝑖(𝑥𝑡) 𝑓𝑖(𝑥𝑡 1) 2] + 𝜂4𝜎2

(1 𝜃) E [ 𝑢𝑡 1 𝑖 𝑔𝑡 1 𝑖 2] + 3𝛽𝜂2E [ 𝑣𝑡 1 𝑖 𝑢𝑡 1 𝑖 2]

+3𝛽𝜂4E [ 𝑣𝑡 1 𝑖 𝑓𝑖(𝑥𝑡 1) 2]

+3𝛽𝐿2 𝑖𝜂4E [ 𝑥𝑡 𝑥𝑡 1 2] + 𝜂4𝜎2

where (𝑖) is due to definition of a contractive compressor (Definition 1), (𝑖𝑖) follows by the update rule of 𝑣𝑡 𝑖and 𝑢𝑡 𝑖, (𝑖𝑖𝑖) and (𝑣) hold by Young s inequality for any 𝜌> 0. In (𝑖𝑣), we introduced

the notation 𝜃:= 1 (1 𝛼)(1 + 𝜌), and 𝛽:= (1 𝛼)(1 + 𝜌 1). The last step follows by smoothness of 𝑓𝑖( ) (Assumption 1). The proof is complete by the choice 𝜌= 𝛼/2, which guarantees 1 𝜃 1 𝛼/2, and 3𝛽 6/𝛼.

H EF21-SGDM with Absolute Compressor

In this section, we complement our theory by analyzing EF21-SGDM under a different class of widely used biased compressors, namely, absolute compressors, which are defined as follows. Definition 2 (Absolute compressors). We say that a (possibly randomized) map 𝒞: R𝑑 R𝑑is an absolute compression operator if there exists a constant > 0 such that

E [ 𝒞(𝑥) 𝑥 2] 2, 𝑥 R𝑑. (40)

This class includes important examples of compressors such as hard-threshold sparsifier [Sahu et al., 2021], (stochatsic) rounding schemes with bounded error [Gupta et al., 2015] and scaled integer rounding [Sapio et al., 2021].

Algorithm 4 EF21-SGDM (abs)

1: Input: starting point 𝑥0, step-size 𝛾> 0, momentum 𝜂 (0, 1], initial batch size 𝐵init 2: Initialize 𝑣0 𝑖= 𝑔0 𝑖= 1 𝐵init 𝐵init 𝑗=1 𝑓𝑖(𝑥0, 𝜉0 𝑖,𝑗) for 𝑖= 1, . . . , 𝑛; 𝑔0 = 1

𝑛 𝑛 𝑖=1 𝑔0 𝑖 3: for 𝑡= 0,1, 2, . . . , 𝑇 1 do 4: Master computes 𝑥𝑡+1 = 𝑥𝑡 𝛾𝑔𝑡and broadcasts 𝑥𝑡+1 to all nodes 5: for all nodes 𝑖= 1, . . . , 𝑛in parallel do 6: Compute momentum estimator 𝑣𝑡+1 𝑖 = (1 𝜂)𝑣𝑡 𝑖+ 𝜂 𝑓𝑖(𝑥𝑡+1, 𝜉𝑡+1 𝑖 )

7: Compress 𝑐𝑡+1 𝑖 = 𝒞 ( 𝑣𝑡+1 𝑖 𝑔𝑡 𝑖 𝛾 ) and send 𝑐𝑡+1 𝑖 to the master

8: Update local state 𝑔𝑡+1 𝑖 = 𝑔𝑡 𝑖+ 𝛾𝑐𝑡+1 𝑖 9: end for 10: Master computes 𝑔𝑡+1 = 1

𝑛 𝑛 𝑖=1 𝑔𝑡+1 𝑖 via 𝑔𝑡+1 = 𝑔𝑡+ 1

𝑛 𝑛 𝑖=1 𝛾𝑐𝑡+1 𝑖 11: end for

To accomodate absolute compressors into our EF21-SGDM method, we need to make a slight modification to our algorithm, see Algorithm 4. At each iteration, before compressing the difference 𝑣𝑡+1 𝑖 𝑔𝑡 𝑖, we divide it by the step-size 𝛾. Later, we multiply the compressed vector 𝑐𝑡+1 𝑖 by 𝛾, i.e., have

𝑔𝑡+1 𝑖 = 𝑔𝑡 𝑖+ 𝛾𝒞 ( 𝑣𝑡+1 𝑖 𝑔𝑡 𝑖 𝛾

Such modification is necessary for absolute compressors because by Definition 2 the compression error is not proportional to 𝑥 2, but merely an absolute constant 2. In fact, Algorithm 4 is somewhat more universal in the sense that it can be also applied for contractive compressors.20 We derive the following result for EF21-SGDM (abs). Theorem 6. Let Assumptions 1 and 2 hold. Let ˆ𝑥𝑇be sampled uniformly at random from the iterates of the method. Let Algorithm 4 run with an absolute compressor (Definition 2). For all 𝜂 (0, 1] and 𝐵init 1, with 𝛾 𝜂 4𝐿, we have

E [ 𝑓(ˆ𝑥𝑇) 2] 𝒪 ( Ψ0

𝛾𝑇+ 𝛾2 2 + 𝜂𝜎2

where Ψ0 := 𝛿0 + 𝛾

𝜂E [ 𝑣0 𝑓(𝑥0) 2] is a Lyapunov function. With the following step-size, momentum parameter, and initial batch size

) 1/3 , ( 𝐿𝛿0𝑛

, 𝐵𝑖𝑛𝑖𝑡= 𝜎2

E [ 𝑓(ˆ𝑥𝑇) 2] 𝒪

) 2/3 + ( 𝐿𝛿0𝜎2

20It is straightforward to modify the proof of our Theorem 3 for the case when Algorithm 4 is applied with a contractive compressor.

Corollary 4. Under the setting of Theorem 6, we have E [ 𝑓(ˆ𝑥𝑇) ] 𝜀after 𝑇 =

𝑛𝜀4 ) iterations.

Remark 3. The sample complexity result in Corollary 4 matches the one derived for Double Squeeze algorithm [Tang et al., 2020], which is different from Algorithm 4.

Proof. Similarly to the proof of Theorem 3, we control the error between 𝑔𝑡and 𝑓(𝑥𝑡) by decomposing it into two terms

𝑔𝑡 𝑓(𝑥𝑡) 2 2 𝑔𝑡 𝑣𝑡 2 +2 𝑣𝑡 𝑓(𝑥𝑡) 2 2 1

𝑔𝑡 𝑖 𝑣𝑡 𝑖 2 +2 𝑣𝑡 𝑓(𝑥𝑡) 2 .

Again, for the second term above we can use the recursion developed for momentum estimator Lemma 2. However, since we use a different compressor here, we need to bound 𝑔𝑡 𝑖 𝑣𝑡 𝑖 2 term differently, thus we invoke Lemma 6 for absolute compressor.

Part I. Controlling the error of momentum estimator on average for 𝑣𝑡. Denote 𝑃𝑡:= E [ 𝑣𝑡 𝑓(𝑥𝑡) 2] , 𝑅𝑡:= E [ 𝑥𝑡 𝑥𝑡+1 2] . Similarly to Part I of the proof of Theorem 3, we have by Lemma 2

𝑡=0 𝑅𝑡+ 𝜂𝜎2

Part II. Controlling the error of absolute compressor and momentum estimator. By Lemma 6 we have for any 0 < 𝜂 1 and 𝑡 0

𝑖=1 E [ 𝑔𝑡 𝑖 𝑣𝑡 𝑖 2] 𝛾2 2. (44)

Part III. Combining steps I and II with descent lemma. By smoothness (Assumption 1) of 𝑓( ) it follows from Lemma 1 that for any 𝛾 1/(2𝐿) we have

𝑓(𝑥𝑡+1) 𝑓(𝑥𝑡) 𝛾

𝑥𝑡+1 𝑥𝑡 2 + 𝛾

𝑔𝑡 𝑓(𝑥𝑡) 2 (45)

𝑥𝑡+1 𝑥𝑡 2 + 𝛾 𝑉𝑡+ 𝛾𝑃𝑡.

Subtracting 𝑓* from both sides of (45), taking expectation and defining 𝛿𝑡:= E [𝑓(𝑥𝑡) 𝑓*], we derive

E [ 𝑓(ˆ𝑥𝑇) 2] = 1 𝑇

𝑡=0 E [ 𝑓(𝑥𝑡) 2]

2𝛿0 𝛾𝑇+ 2 1

𝑡=0 𝑉𝑡+ 2 1

𝑡=0 𝑃𝑡 1 2𝛾2 1 𝑇

(𝑖) 2𝛿0 𝛾𝑇+ 2𝛾2 2 + 2 1

𝑡=0 𝑃𝑡 1 2𝛾2 1 𝑇

(𝑖𝑖) 2𝛿0 𝛾𝑇+ 2𝛾2 2 + ( 6𝐿2

𝑡=0 𝑅𝑡+ 2𝜂𝜎2

2𝛿0 𝛾𝑇+ 2𝛾2 2 + 2𝜂𝜎2

where in (𝑖) and (𝑖𝑖) we apply (43), (44), and in the last step we use the assumption on the step-size 𝛾 𝜂/(4𝐿).

Setting 𝛾= 𝜂 4𝐿, and taking 𝜂 ( 𝐿3𝛿0 Δ2𝑇 ) 1/3 we can ensure that 𝜂2Δ2

𝜂𝑇, since 𝜂 ( 𝐿𝛿0𝑛

we have 𝜂𝜎2

𝜂𝑇. Finally, by setting the initial batch-size to 𝐵𝑖𝑛𝑖𝑡= 𝜎2 𝐿𝛿0𝑛, we have 1 𝜂𝑇𝑃0 =

𝜎2 𝜂𝑇𝑛𝐵𝑖𝑛𝑖𝑡 𝐿𝛿0

𝜂𝑇. Therefore, we derive

𝑡=0 E [ 𝑓(𝑥𝑡) 2] 2𝛿0 𝛾𝑇+ 2𝛾2 2 + 2𝜂𝜎2

𝑇 + 𝛿 2/3 0 2/3

𝑇 2/3 + 𝜎(𝐿𝛿0) 1/2

H.1 Controlling the error of absolute compression

Lemma 6. Let 𝒞be an absolute compressor and 𝑔𝑡+1 𝑖 be updated according to Algorithm 4, then for

𝑡 0, we have 1

𝑛 𝑛 𝑖=1 E [ 𝑔𝑡 𝑖 𝑣𝑡 𝑖 2] 𝛾2 2.

Proof. By the update rule for 𝑔𝑡+1 𝑖 in Algorithm 4 and Definition 2, we can bound

E [ 𝑔𝑡+1 𝑖 𝑣𝑡+1 𝑖 2] = E

[ 𝛾𝒞 ( 𝑣𝑡+1 𝑖 𝑔𝑡 𝑖 𝛾

) (𝑣𝑡+1 𝑖 𝑔𝑡 𝑖)

[ 𝒞 ( 𝑣𝑡+1 𝑖 𝑔𝑡 𝑖 𝛾

) 𝑣𝑡+1 𝑖 𝑔𝑡 𝑖 𝛾

I EF21-STORM/MVR

Algorithm 5 EF21-STORM/MVR

1: Input: 𝑥0, step-size 𝛾> 0, parameter 𝜂 (0, 1], 𝐵init 1 2: Initialize 𝑤0 𝑖= 𝑔0 𝑖= 1 𝐵init 𝐵init 𝑗=1 𝑓𝑖(𝑥0, 𝜉0 𝑖,𝑗) for 𝑖= 1, . . . , 𝑛; 𝑔0 = 1

𝑛 𝑛 𝑖=1 𝑔0 𝑖 3: for 𝑡= 0,1, 2, . . . , 𝑇 1 do 4: Master computes 𝑥𝑡+1 = 𝑥𝑡 𝛾𝑔𝑡and broadcasts 𝑥𝑡+1 to all nodes 5: for all nodes 𝑖= 1, . . . , 𝑛in parallel do 6: Draw 𝜉𝑡+1 𝑖 and compute two (stochastic) gradients 𝑓𝑖(𝑥𝑡, 𝜉𝑡+1 𝑖 ) and 𝑓𝑖(𝑥𝑡+1, 𝜉𝑡+1 𝑖 ) 7: Compute variance reduced STORM/MVR estimator 8: 𝑤𝑡+1 𝑖 = 𝑓𝑖(𝑥𝑡+1, 𝜉𝑡+1 𝑖 ) + (1 𝜂)(𝑤𝑡 𝑖 𝑓𝑖(𝑥𝑡, 𝜉𝑡+1 𝑖 )) 9: Compress 𝑐𝑡+1 𝑖 = 𝒞(𝑤𝑡+1 𝑖 𝑔𝑡 𝑖) and send 𝑐𝑡+1 𝑖 to the master 10: Update local state 𝑔𝑡+1 𝑖 = 𝑔𝑡 𝑖+ 𝑐𝑡+1 𝑖 11: end for 12: Master computes 𝑔𝑡+1 = 1

𝑛 𝑛 𝑖=1 𝑔𝑡+1 𝑖 via 𝑔𝑡+1 = 𝑔𝑡+ 1

𝑛 𝑛 𝑖=1 𝑐𝑡+1 𝑖 13: end for

Assumption 3 (Individual smoothness21). For each 𝑖= 1, . . . , 𝑛, every realization of 𝜉𝑖 𝒟𝑖, the stochastic gradient 𝑓𝑖(𝑥,𝜉𝑖) is ℓ𝑖-Lipschitz, i.e., for all 𝑥, 𝑦 R𝑑

𝑓𝑖(𝑥,𝜉𝑖) 𝑓𝑖(𝑦, 𝜉𝑖) ℓ𝑖 𝑥 𝑦 .

We denote ℓ2 := 1

𝑛 𝑛 𝑖=1 ℓ2 𝑖 Theorem 7. Let Assumptions 1, 2 and 3 hold. Let ˆ𝑥𝑇be sampled uniformly at random from the iterates of the method. Let Algorithm 5 run with a contractive compressor. For all 𝜂 (0, 1] and

𝐵init 1, with 𝛾 min { 𝛼 8 𝐿, 𝛼

} , we have

E [ 𝑓(ˆ𝑥𝑇) 2] 𝒪 ( Ψ0

where Ψ0 := 𝛿0 + 𝛾

𝜂E [ 𝑣0 𝑓(𝑥0) 2] + 𝛾𝜂

𝑛 𝑛 𝑖=1 E [ 𝑣0 𝑖 𝑓𝑖(𝑥0) 2] . With the following step-size, momentum parameter, and initial batch size

( ℓ𝛿0 𝑛 𝜎2𝑇

and 𝐵init = max { 𝜎2 𝐿𝛿0𝑛, 𝛼𝑛

𝑇 } , we have

E [ 𝑓(ˆ𝑥𝑇) 2] 𝒪

𝛼𝑇+ ℓ𝛿0 𝛼𝑇+

Corollary 5. Under the setting of Theorem 7. we have E [ 𝑓(ˆ𝑥𝑇) ] 𝜀after 𝑇 =

𝛼𝜀2 + ℓ𝛿0𝜎 1/3

𝛼1/3 𝑛𝜀7/3 + ℓ𝛿0𝜎 1/2

𝛼1/4 𝑛𝜀5/2 + ℓ𝛿0𝜎

𝑛𝜀3 ) iterations.

Recently, Yau and Wai [2022] propose and analyze a Do Co M-SGT algorithm for decentralized optimization with contractive compressor under the above Assumption 3. When their method is specialized to centralized setting (with mixing constant 𝜌= 1), their total sample complexity becomes 𝒪 ( ℓ 𝛼𝜀2 + 𝑛4/5𝜎3/2

𝛼9/4𝜀3/2 + 𝜎3

𝑛𝜀3 ) (see Table 1 or Theorem 4.1 in [Yau and Wai, 2022]). In contrast, the sample complexity given in our Corollary 5 improves the dependence on 𝜎in the last term and, moreover, achieves the linear speedup in terms of 𝑛for all stochastic terms in the sample complexity.

21This assumption can be also relaxed to so-called expected smoothness.

Proof. Similarly to the proof of Theorem 3, we control the error between 𝑔𝑡and 𝑓(𝑥𝑡) by decomposing it into two terms

𝑔𝑡 𝑓(𝑥𝑡) 2 2 𝑔𝑡 𝑤𝑡 2+2 𝑤𝑡 𝑓(𝑥𝑡) 2 2 1

𝑔𝑡 𝑖 𝑤𝑡 𝑖 2+2 𝑤𝑡 𝑓(𝑥𝑡) 2 .

In the following, we develop a recursive bound for each term above separately.

Part I. Controlling the variance of STORM/MVR estimator for each 𝑤𝑡 𝑖and on average 𝑤𝑡. Recall that by Lemma 7-(55), we have for each 𝑖= 1, . . . , 𝑛, and any 0 < 𝜂 1 and 𝑡 0

E [ 𝑤𝑡+1 𝑖 𝑓𝑖(𝑥𝑡+1) 2] (1 𝜂)E [ 𝑤𝑡 𝑖 𝑓𝑖(𝑥𝑡) 2] + 2ℓ2 𝑖E [ 𝑥𝑡 𝑥𝑡+1 2] + 2𝜂2𝜎2. (49)

Averaging inequalities (49) over 𝑖= 1, . . . ,𝑛and denoting 𝑃𝑡:= 1

𝑛 𝑛 𝑖=1 E [ 𝑤𝑡 𝑖 𝑓𝑖(𝑥𝑡) 2] ,

𝑅𝑡:= E [ 𝑥𝑡 𝑥𝑡+1 2] we have

𝑃𝑡+1 (1 𝜂) 𝑃𝑡+ 2 ℓ2𝑅𝑡+ 2𝜂2𝜎2.

Summing up the above inequality for 𝑡= 0, . . . , 𝑇 1, we derive

𝑡=0 𝑃𝑡 2 ℓ2

𝑡=0 𝑅𝑡+ 2𝜂𝜎2 + 1

𝜂𝑇 𝑃0. (50)

Similarly by Lemma 7-(56) denoting 𝑃𝑡:= E [ 𝑤𝑡 𝑓(𝑥𝑡) 2] , we have

𝑡=0 𝑃𝑡 2 ℓ2

𝑡=0 𝑅𝑡+ 2𝜂𝜎2

Part II. Controlling the variance of contractive compressor and STORM/MVR estimator. By Lemma 8 we have for each 𝑖= 1, . . . , 𝑛, and any 0 < 𝜂 1 and 𝑡 0

E [ 𝑔𝑡+1 𝑖 𝑤𝑡+1 𝑖 2] ( 1 𝛼

) E [ 𝑔𝑡 𝑖 𝑤𝑡 𝑖 2] + 4𝜂2

𝛼E [ 𝑤𝑡 𝑖 𝑓𝑖(𝑥𝑡) 2]

+ ( 4𝐿2 𝑖 𝛼 + ℓ2 𝑖

) E [ 𝑥𝑡+1 𝑥𝑡 2] + 2𝜂2𝜎2. (52)

Averaging inequalities (52) over 𝑖= 1, . . . ,𝑛, denoting 𝑉𝑡:= 1

𝑛 𝑛 𝑖=1 E [ 𝑔𝑡 𝑖 𝑤𝑡 𝑖 2] , and summing up the resulting inequality for 𝑡= 0, . . . , 𝑇 1, we obtain

𝑡=0 𝑅𝑡+ 2𝜂2𝜎2

𝛼2𝑇 𝑃0. (53)

Part III. Combining steps I and II with descent lemma. By smoothness (Assumption 1) of 𝑓( ) it follows from Lemma 1 that for any 𝛾 1/(2𝐿) we have

𝑓(𝑥𝑡+1) 𝑓(𝑥𝑡) 𝛾

𝑥𝑡+1 𝑥𝑡 2 + 𝛾

𝑔𝑡 𝑓(𝑥𝑡) 2 (54)

𝑥𝑡+1 𝑥𝑡 2 + 𝛾1

𝑔𝑡 𝑖 𝑤𝑡 𝑖 2 + 𝛾 𝑤𝑡 𝑓(𝑥𝑡) 2 .

Subtracting 𝑓* from both sides of (54), taking expectation and defining 𝛿𝑡:= E [𝑓(𝑥𝑡) 𝑓*], we derive

E [ 𝑓(ˆ𝑥𝑇) 2] = 1 𝑇

𝑡=0 E [ 𝑓(𝑥𝑡) 2]

2𝛿0 𝛾𝑇+ 2 1

𝑡=0 𝑉𝑡+ 2 1

𝑡=0 𝑃𝑡 1 2𝛾2 1 𝑇

(𝑖) 2𝛿0 𝛾𝑇+

𝑡=0 𝑅𝑡+ 2 1

𝑡=0 𝑃𝑡 1 2𝛾2 1 𝑇

(𝑖𝑖) 2𝛿0 𝛾𝑇+

2𝛿0 𝛾𝑇+ 32𝜂3𝜎2

where in (𝑖) we apply (53), in (𝑖𝑖) we use (51), and the last step follows by assumption on the step-size, which proves (48).

We now find the particular values of parameters. Using 𝑤0 𝑖= 1 𝐵init 𝐵init 𝑗=1 𝑓𝑖(𝑥0, 𝜉0 𝑖,𝑗) for all 𝑖= 1, . . . , 𝑛, we have

𝑃0 = E [ 𝑤0 𝑓(𝑥0) 2] 𝜎2

𝑛𝐵init and 𝑃0 = 1

𝑖=1 E [ 𝑤0 𝑖 𝑓𝑖(𝑥0) 2] 𝜎2

We can substitute the choice of 𝛾and obtain

E [ 𝑓(ˆ𝑥𝑇) 2] = 𝒪 ( 𝛿0

𝜂𝑛𝐵init𝑇+ 𝜂𝜎2

𝛼𝑇+ ℓ𝛿0 𝛼𝑇+ ℓ𝛿0 𝑛𝜂𝑇+ 𝜂3𝜎2

𝜂𝑛𝐵init𝑇+ 𝜂𝜎2

Since 𝐵init 𝜎2 𝐿𝛿0𝑛, we have

E [ 𝑓(ˆ𝑥𝑇) 2] = 𝒪

𝛼𝑇+ ℓ𝛿0 𝛼𝑇+ ℓ𝛿0 𝑛𝜂𝑇+ 𝜂3𝜎2

Notice that the choice of the momentum parameter such that 𝜂 ( ℓ𝛿0𝛼2

𝜎2 𝑛𝑇 ) 2/7 , 𝜂 ( ℓ𝛿0𝛼 𝜎2 𝑛𝑇 ) 2/5 ,

𝜂 ( ℓ𝛿0 𝑛 𝜎2𝑇 ) 2/3 , and 𝜂 ( ℓ𝛿0𝛼2𝐵init

𝜎2 𝑛 ) 2/3 ensures that 𝜂3𝜎2

𝛼2 ℓ𝛿0 𝑛𝜂𝑇, 𝜂2𝜎2

𝛼 ℓ𝛿0 𝑛𝜂𝑇, 𝜂𝜎2

𝛼2𝐵init𝑇 ℓ𝛿0 𝑛𝜂𝑇. Therefore, we have

E [ 𝑓(ˆ𝑥𝑇) 2] = 𝒪

𝛼𝑇+ ℓ𝛿0 𝛼𝑇+

𝐵1/3 init 𝑇

Using 𝐵init 𝛼𝑛

𝑇, we obtain

E [ 𝑓(ˆ𝑥𝑇) 2] = 𝒪

𝛼𝑇+ ℓ𝛿0 𝛼𝑇+

I.1 Controlling the variance of STORM/MVR estimator

Lemma 7. Let Assumptions 2 and 3 be satisfied, and suppose 0 < 𝜂 1. For every 𝑖= 1, . . . , 𝑛, let the sequence {𝑤𝑡 𝑖}𝑡 0 be updated via 𝑤𝑡+1 𝑖 = 𝑓𝑖(𝑥𝑡+1, 𝜉𝑡+1 𝑖 ) + (1 𝜂)(𝑤𝑡 𝑖 𝑓𝑖(𝑥𝑡, 𝜉𝑡+1 𝑖 )) Define the sequence 𝑤𝑡:= 1

𝑛 𝑛 𝑖=1 𝑤𝑡 𝑖. Then for every 𝑖= 1, . . . , 𝑛and 𝑡 0 it holds

E [ 𝑤𝑡+1 𝑖 𝑓𝑖(𝑥𝑡+1) 2] (1 𝜂)E [ 𝑤𝑡 𝑖 𝑓𝑖(𝑥𝑡) 2] + 2ℓ2 𝑖E [ 𝑥𝑡 𝑥𝑡+1 2] + 2𝜂2𝜎2, (55)

E [ 𝑤𝑡+1 𝑓(𝑥𝑡+1) 2] (1 𝜂)E [ 𝑤𝑡 𝑓(𝑥𝑡) 2] + 2 ℓ2

𝑛E [ 𝑥𝑡 𝑥𝑡+1 2] + 2𝜂2𝜎2

Proof. For each 𝑡= 0, . . . , 𝑇 1, define a random vector 𝜉𝑡:= (𝜉𝑡 1, . . . , 𝜉𝑡 𝑛) and denote by 𝑓(𝑥𝑡, 𝜉𝑡) := 1

𝑛 𝑛 𝑖=1 𝑓𝑖(𝑥𝑡, 𝜉𝑡 𝑖). Note that the entries of the random vector 𝜉𝑡are independent and E𝜉𝑡[ 𝑓(𝑥𝑡, 𝜉𝑡)] = 𝑓(𝑥𝑡), then we have

𝑤𝑡+1 = 𝑓(𝑥𝑡+1, 𝜉𝑡+1) + (1 𝜂) ( 𝑤𝑡 𝑓(𝑥𝑡, 𝜉𝑡+1) ) ,

where 𝑤𝑡= 1

𝑛 𝑛 𝑖=1 𝑤𝑡 𝑖is an auxiliary sequence.

𝒱𝑡 𝑖:= 𝑓𝑖(𝑥𝑡, 𝜉𝑡 𝑖) 𝑓𝑖(𝑥𝑡), 𝒱𝑡:= 1

𝒲𝑡 𝑖:= 𝑓𝑖(𝑥𝑡) 𝑓𝑖(𝑥𝑡, 𝜉𝑡+1 𝑖 ) + 𝑓𝑖(𝑥𝑡+1, 𝜉𝑡+1 𝑖 ) 𝑓𝑖(𝑥𝑡+1), 𝒲𝑡:= 1

Then by Assumptions 2, we have

E [ 𝒱𝑡 𝑖 ] = E [ 𝒲𝑡 𝑖 ] = E [ 𝒱𝑡] = E [ 𝒲𝑡] = 0, (57)

E [ 𝒱𝑡 𝑖 2] 𝜎2, E [ 𝒱𝑡 2] 𝜎2

Furthermore, we can derive

E [ 𝒲𝑡 2] = E

𝑖=1 E [ 𝒲𝑡 𝑖 2] + 1

𝑖 =𝑗 E [ 𝒲𝑡 𝑖, 𝒲𝑡 𝑗 ]

𝑖=1 E [ 𝒲𝑡 𝑖 2] + 1

𝑖 =𝑗 E [ 𝒲𝑡 𝑖 ] , E [ 𝒲𝑡 𝑗 ]

𝑖=1 E [ 𝒲𝑡 𝑖 2]

𝑖=1 E [ 𝑓𝑖(𝑥𝑡+1, 𝜉𝑡+1 𝑖 ) 𝑓𝑖(𝑥𝑡, 𝜉𝑡+1 𝑖 ) 2]

𝑖=1 ℓ2 𝑖E [ 𝑥𝑡+1 𝑥𝑡 2] = ℓ2

𝑛E [ 𝑥𝑡+1 𝑥𝑡 2] ,

where (𝑖) holds by the conditional independence of 𝒲𝑡 𝑖and 𝒲𝑡 𝑗, and the last inequality follows by the individual smoothness of stochastic functions (Assumption 3). Therefore, we have

E [ 𝒲𝑡 𝑖 2] ℓ2 𝑖E [ 𝑥𝑡+1 𝑥𝑡 2] , E [ 𝒲𝑡 2] ℓ2

𝑛E [ 𝑥𝑡+1 𝑥𝑡 2] , (59)

where the first inequality is obtained by using a similar derivation.

By the update rule for 𝑤𝑡, we can also derive

𝑤𝑡+1 𝑓(𝑥𝑡+1) = (1 𝜂) ( 𝑤𝑡 𝑓(𝑥𝑡, 𝜉𝑡+1) ) + ( 𝑓(𝑥𝑡+1, 𝜉𝑡+1) 𝑓(𝑥𝑡+1) )

= (1 𝜂) ( 𝑤𝑡 𝑓(𝑥𝑡) ) + 𝜂 ( 𝑓(𝑥𝑡+1, 𝜉𝑡+1) 𝑓(𝑥𝑡+1) )

+(1 𝜂) ( ( 𝑓(𝑥𝑡) 𝑓(𝑥𝑡, 𝜉𝑡+1) + 𝑓(𝑥𝑡+1, 𝜉𝑡+1) 𝑓(𝑥𝑡+1) ) )

= (1 𝜂) ( 𝑤𝑡 𝑓(𝑥𝑡) ) + 𝜂𝒱𝑡+1 + (1 𝜂)𝒲𝑡.

Therefore, we have

E [ 𝑤𝑡+1 𝑓(𝑥𝑡+1) 2] E [ E𝜉𝑡+1 [ (1 𝜂) ( 𝑤𝑡 𝑓(𝑥𝑡) ) + 𝜂𝒱𝑡+1 + (1 𝜂)𝒲𝑡 2] ]

= (1 𝜂)2E [ 𝑤𝑡 𝑓(𝑥𝑡) 2] + E [ 𝜂𝒱𝑡+1 + (1 𝜂)𝒲𝑡 2]

(1 𝜂) 𝑤𝑡 𝑓(𝑥𝑡) 2 + 2𝜂2E [ 𝒱𝑡+1 2] + 2E [ 𝒲𝑡 2]

(1 𝜂)E [ 𝑤𝑡 𝑓(𝑥𝑡) 2] + 2𝜎2𝜂2

𝑛E [ 𝑥𝑡+1 𝑥𝑡 2] ,

where the last inequality holds by (58) and (59). Similarly for each 𝑖= 1, . . . , 𝑛, we have

𝑤𝑡+1 𝑖 𝑓𝑖(𝑥𝑡+1) = (1 𝜂) ( 𝑤𝑡 𝑖 𝑓𝑖(𝑥𝑡) ) + 𝜂𝒱𝑡+1 𝑖 + (1 𝜂)𝒲𝑡 𝑖. (60)

E [ 𝑤𝑡+1 𝑖 𝑓𝑖(𝑥𝑡+1) 2] (1 𝜂)E [ 𝑤𝑡 𝑖 𝑓𝑖(𝑥𝑡) 2] + 2𝜎2𝜂2 + 2ℓ2 𝑖𝑅𝑡.

I.2 Controlling the variance of contractive compression and STORM/MVR estimator

Lemma 8. Let Assumptions 1, 2 and 3 be satisfied, and suppose 0 < 𝜂 1. For every 𝑖= 1, . . . , 𝑛, let the sequences {𝑤𝑡 𝑖}𝑡 0 and {𝑔𝑡 𝑖}𝑡 0 be updated via

𝑤𝑡+1 𝑖 = 𝑓𝑖(𝑥𝑡+1, 𝜉𝑡+1 𝑖 ) + (1 𝜂)(𝑤𝑡 𝑖 𝑓𝑖(𝑥𝑡, 𝜉𝑡+1 𝑖 )),

𝑔𝑡+1 𝑖 = 𝑔𝑡 𝑖+ 𝒞 ( 𝑤𝑡+1 𝑖 𝑔𝑡 𝑖 ) .

Then for every 𝑖= 1, . . . , 𝑛and 𝑡 0 it holds

E [ 𝑔𝑡+1 𝑖 𝑤𝑡+1 𝑖 2] ( 1 𝛼

) E [ 𝑔𝑡 𝑖 𝑤𝑡 𝑖 2] + 4𝜂2

𝛼E [ 𝑤𝑡 𝑖 𝑓𝑖(𝑥𝑡) 2]

+ ( 4𝐿2 𝑖 𝛼 + ℓ2 𝑖

) E [ 𝑥𝑡+1 𝑥𝑡 2] + 2𝜂2𝜎2. (61)

Proof. By the update rule of 𝑤𝑡 𝑖, 𝑔𝑡 𝑖, and definition of 𝒱𝑡 𝑖, 𝒲𝑡 𝑖given in the proof of Lemma 7, we can derive

E [ 𝑔𝑡+1 𝑖 𝑤𝑡+1 𝑖 2] = E [ 𝒞(𝑤𝑡+1 𝑖 𝑔𝑡 𝑖) (𝑤𝑡+1 𝑖 𝑔𝑡 𝑖) 2]

(𝑖) (1 𝛼)E [ 𝑤𝑡+1 𝑖 𝑔𝑡 𝑖 2]

(𝑖𝑖) = (1 𝛼)E [ (1 𝜂) ( 𝑤𝑡 𝑖 𝑓𝑖(𝑥𝑡) ) + 𝜂𝒱𝑡+1 𝑖 + (1 𝜂)𝒲𝑡 𝑖+ 𝑓𝑖(𝑥𝑡+1) 𝑔𝑡 𝑖 2]

= (1 𝛼)E [ E𝜉𝑡+1 𝑖

[ (1 𝜂) ( 𝑤𝑡 𝑖 𝑓𝑖(𝑥𝑡) ) + 𝜂𝒱𝑡+1 𝑖 + (1 𝜂)𝒲𝑡 𝑖+ 𝑓𝑖(𝑥𝑡+1) 𝑔𝑡 𝑖 2] ]

(𝑖𝑖𝑖) = (1 𝛼)E [ (1 𝜂) ( 𝑤𝑡 𝑖 𝑓𝑖(𝑥𝑡) ) + 𝑓𝑖(𝑥𝑡+1) 𝑔𝑡 𝑖 2]

+(1 𝛼)E [ 𝜂𝒱𝑡+1 𝑖 + (1 𝜂)𝒲𝑡 𝑖 2]

= (1 𝛼)E [ ( 𝑤𝑡 𝑖 𝑔𝑡 𝑖 ) + ( 𝑓𝑖(𝑥𝑡+1) 𝑓𝑖(𝑥𝑡) ) 𝜂 ( 𝑤𝑡 𝑖 𝑓𝑖(𝑥𝑡) ) 2]

+(1 𝛼)E [ 𝜂𝒱𝑡+1 𝑖 + (1 𝜂)𝒲𝑡 𝑖 2]

(𝑖𝑣) (1 𝛼) (1 + 𝜌) E [ 𝑤𝑡 𝑖 𝑔𝑡 𝑖 2]

+(1 𝛼) ( 1 + 𝜌 1) E [ ( 𝑓𝑖(𝑥𝑡+1) 𝑓𝑖(𝑥𝑡) ) 𝜂 ( 𝑤𝑡 𝑖 𝑓𝑖(𝑥𝑡) ) 2]

+2(1 𝛼)𝜂2E [ 𝒱𝑡+1 𝑖 2] + 2(1 𝛼)(1 𝜂)2E [ 𝒲𝑡 𝑖 2]

(𝑣) = (1 𝜃)E [ 𝑤𝑡 𝑖 𝑔𝑡 𝑖 2]

+𝛽E [ ( 𝑓𝑖(𝑥𝑡+1) 𝑓𝑖(𝑥𝑡) ) 𝜂 ( 𝑤𝑡 𝑖 𝑓𝑖(𝑥𝑡) ) 2]

+2(1 𝛼)𝜂2E [ 𝒱𝑡+1 𝑖 2] + 2(1 𝛼)(1 𝜂)2E [ 𝒲𝑡 𝑖 2]

(𝑣𝑖) (1 𝜃)E [ 𝑤𝑡 𝑖 𝑔𝑡 𝑖 2] + 2𝛽𝜂2E [ 𝑤𝑡 𝑖 𝑓𝑖(𝑥𝑡) 2]

+2𝛽E [ 𝑓𝑖(𝑥𝑡+1) 𝑓𝑖(𝑥𝑡) 2] + 2ℓ2 𝑖E [ 𝑥𝑡+1 𝑥𝑡 2] + 2𝜂2𝜎2

(1 𝜃)E [ 𝑤𝑡 𝑖 𝑔𝑡 𝑖 2] + 2𝛽𝜂2E [ 𝑤𝑡 𝑖 𝑓𝑖(𝑥𝑡) 2]

+ ( 2𝛽𝐿2 𝑖+ ℓ2 𝑖 ) E [ 𝑥𝑡+1 𝑥𝑡 2] + 2𝜂2𝜎2,

where (𝑖) holds by Definition 1, (𝑖𝑖) follows from (60), (𝑖𝑖𝑖) holds by unbiasedness of 𝒱𝑡+1 𝑖 and 𝒲𝑡 𝑖 (57). In (𝑖𝑣) we use Young s inequality twice, in (𝑣) we introduce the notation 𝜃:= 1 (1 𝛼)(1+𝜌) and 𝛽:= (1 𝛼)(1 + 𝜌 1), in (𝑣𝑖) we again use Young s inequality and the bound (58) and (59). The last step holds by smoothness of 𝑓𝑖( ) (Assumption 1). The proof is complete by the choice 𝜌= 𝛼/2, which guarantees 1 𝜃 1 𝛼/2, and 2𝛽 4/𝛼.

J Simplified Proof of SGDM: Time Varying Parameters and No Tuning for Momentum Sequence

In this section, we give a simplified proof of SGDM in the single node setting (𝑛= 1) without compression (𝛼= 1). The following theorem shows that the momentum parameter can be chosen in a parameter agnostic22 way as 𝜂𝑡= 1/ 𝑡+ 1 (or 𝜂𝑡= 1/

𝑇+ 1), instead of being a constant depending on problem parameters as it is suggested in our main Theorem 3. In other words, using SGDM with time varying momentum does not introduce any additional tuning of hyper-parameters.

Theorem 8. Let Assumptions 1, 2 hold. Let 𝑛= 1 and Algorithm 1 run with identity compressor 𝒞, i.e., 𝛼= 1, and (possibly) time varying momentum 𝜂𝑡 (0, 1] and step-size paramters 𝛾𝑡= 𝛾𝜂𝑡 with 𝛾 (0, 1/(3𝐿)]. Let ˆ𝑥𝑇be sampled from the iterates of the algorithm with probabilities 𝑝𝑡= 𝜂𝑡/( 𝑇 1 𝑡=0 𝜂𝑡), then

E [ 𝑓(ˆ𝑥𝑇) 2] 2Λ0𝛾 1 + 2𝜎2 𝑇 1 𝑡=0 𝜂2 𝑡 𝑇 1 𝑡=0 𝜂𝑡 ,

where Λ0 := 𝑓(𝑥0) 𝑓* + 𝛾E [ 𝑣0 𝑓(𝑥0) 2] is the Lyapunov function.

Proof. By Lemma 2 denoting 𝑃𝑡:= E [ 𝑣𝑡 𝑓(𝑥𝑡) 2] , 𝑅𝑡:= E [ 𝑥𝑡 𝑥𝑡+1 2] , we have

𝑃𝑡+1 𝑃𝑡 𝜂𝑡𝑃𝑡+ 3𝐿2

𝜂𝑡 𝑅𝑡+ 𝜂2 𝑡𝜎2. (62)

By descent Lemma 1, we have for any 𝛾𝑡> 0

2 E [ 𝑓(𝑥𝑡) 2] 1

2𝛾𝑡 (1 𝛾𝑡𝐿) 𝑅𝑡+ 𝛾𝑡

where 𝛿0 := E [𝑓(𝑥𝑡) 𝑓*]. Define the Lyapunov function as Λ𝑡= 𝛿0 + 𝛾𝑃𝑡. Then summing up (63) with a 𝛾multiple of (62) and noticing that 𝛾𝑡 𝛾, we get

2 E [ 𝑓(𝑥𝑡) 2] 1

( 1 𝛾𝐿 6𝛾2𝐿2) 𝑅𝑡+ 𝛾𝜂2 𝑡𝜎2.

Since 𝛾 1/(3𝐿), we have 1 𝛾𝐿 6𝛾2𝐿2 0, and, therefore, by telescoping we can derive

E [ 𝑓(ˆ𝑥𝑇) 2] =

𝑡=0 𝜂𝑡E [ 𝑓(𝑥𝑡) 2]

2Λ0𝛾 1 + 2𝜎2 𝑇 1 𝑡=0 𝜂2 𝑡 𝑇 1 𝑡=0 𝜂𝑡 .

The above theorem suggests that to ensure convergence, we can select any momentum sequence such that 𝜎2 𝑡=0 𝜂2 𝑡< , and 𝑡=0 𝜂2 𝑡 for 𝑡 . The parameter 𝛾, which determines the step-size 𝛾𝑡= 𝛾𝜂𝑡, should be set to 𝛾= 1/(3𝐿) (to minimize the upper bound). Let us now consider some special cases.

Deterministic case. If 𝜎= 0, we can set it to be any constant 𝜂𝑡= 𝜂 (0, 1] and derive

E [ 𝑓(ˆ𝑥𝑇) 2] 2𝛿0

𝛾𝜂𝑇= 𝒪 ( 𝐿𝛿0

22That is, independently of the problem specific parameters

Stochastic case. For 𝜎2 > 0, we can select time-varying 𝜂𝑡= 1 𝑡+1 or constant 𝜂𝑡= 1 𝑇+1, which

gives 𝑇 1 𝑡=0 𝜂2 𝑡= 𝒪(log(𝑇)), and 𝑇 1 𝑡=0 𝜂𝑡= Ω (

E [ 𝑓(ˆ𝑥𝑇) 2] = 𝒪 ( 𝐿Λ0 + 𝜎2

Notice that if we set 𝜂𝑡as above, we do not need any tuning of momentum parameter. Only tuning of paramter 𝛾is required to ensure convergence with optimal dependence on 𝑇, as in SGD without momentum. Of course, this rate is not yet optimal in other parameters, e.g., 𝜎2 and 𝐿. To make it optimal in all problem parameters, we can set 𝜂= max { 1, ( 𝐿Λ0

𝜎2𝑇 ) 1/2} similarly to the statement of Theorem 2.

K Revisiting EF14-SGD Analysis under BG and BGS Assumptions

In this section, we revisit the analysis of the original variant of error feedback (EF14-SGD) to showcase the difficulty in avoiding BG/BGS assumptions commonly used in the nonconvex analysis of this variant. In summary, the key reason for BG/BGS assumption is to bound the second term in (67) or (68).

Recall that EF14-SGD has the update rule [Stich et al., 2018]

𝑥𝑡+1 = 𝑥𝑡 𝑔𝑡, 𝑔𝑡= 1

𝑖=1 𝑔𝑡 𝑖, (64)

EF14-SGD: 𝑒𝑡+1 𝑖 = 𝑒𝑡 𝑖+ 𝛾 𝑓𝑖(𝑥𝑡, 𝜉𝑡 𝑖) 𝑔𝑡 𝑖,

𝑔𝑡+1 𝑖 = 𝒞 ( 𝑒𝑡+1 𝑖 + 𝛾 𝑓𝑖(𝑥𝑡+1, 𝜉𝑡+1 𝑖 ) ) , (65)

where {𝑒𝑡 𝑖}𝑡 0 are error/memory sequences with 𝑒0 𝑖= 0 for each 𝑖= 1, . . . , 𝑛. Let 𝑒𝑡:= 1

𝑛 𝑛 𝑖=1 𝑒𝑡 𝑖. The proof of this method relies on so called perturbed iterate analysis, for which one defines a "virtual sequence": 𝑥𝑡:= 𝑥𝑡 𝑒𝑡. Then it is verified by direct substitution that for any 𝑡 0

𝑥𝑡+1 = 𝑥𝑡 𝛾1

𝑖=1 𝑓𝑖(𝑥𝑡, 𝜉𝑡 𝑖).

If follows from Lemma 9 in [Stich and Karimireddy, 2021] that for any 𝛾 1/2𝐿and 𝑡 0

E [ 𝑓( 𝑥𝑡+1) ] E [ 𝑓( 𝑥𝑡) ] 𝛾

4 E [ 𝑓(𝑥𝑡) 2] + 𝛾𝐿𝜎2

2 E [ 𝑒𝑡 2] .

Telescoping the recursion above and setting 𝛿0 := 𝑓(𝑥0) 𝑓*, we have

𝑡=0 E [ 𝑓(𝑥𝑡) 2] 4𝛿0

𝑡=0 E [ 𝑒𝑡 2] . (66)

Now it remains to bound efficiently the average error term E [ 𝑒𝑡 2] = E [ 1

𝑛 𝑛 𝑖=1 𝑒𝑡 𝑖 2] . By Jensen s inequality, we have

𝑖=1 E [ 𝑒𝑡 𝑖 2] ,

and develop a bound for each E [ 𝑒𝑡 𝑖 2] individually. Denote by 𝑧:= 𝑒𝑡 𝑖+ 𝛾 𝑓𝑖(𝑥𝑡, 𝜉𝑡 𝑖), then

E [ 𝑒𝑡+1 𝑖 2] E [ 𝒞(𝑧) 𝑧 2]

(1 𝛼)E [ 𝑒𝑡 𝑖+ 𝛾 𝑓𝑖(𝑥𝑡, 𝜉𝑡 𝑖) 2]

(1 𝛼) ( 1 + 𝛼

) E [ 𝑒𝑡 𝑖 2] + ( 1 + 2

) E [ 𝛾 𝑓𝑖(𝑥𝑡, 𝜉𝑡 𝑖) 2]

) E [ 𝑒𝑡 𝑖 2] + 3𝛾2

𝛼E [ 𝑓𝑖(𝑥𝑡, 𝜉𝑡 𝑖) 2] , (67)

where we used Definition 1 and Young s inequality.

BG asssumption. If we assume bounded (stochastic) gradients (BG), i.e., E [ 𝑓𝑖(𝑥, 𝜉𝑖) 2] 𝐺2

for all 𝑖= 1, . . . , 𝑛, then using (67) we can derive

𝑡=0 E [ 𝑒𝑡 2] 6𝛾2𝐺2

Combining this bound with (66), we have

𝑡=0 E [ 𝑓(𝑥𝑡) 2] 4𝛿0

𝑛 + 12𝐿2𝛾2𝐺2

The step-size choice 𝛾= min { 1 𝐿, ( 𝛿0𝛼2

𝑇𝐿2𝜎2 ) 1/3 , ( 𝑛𝛿0

𝑇𝐿𝜎2 ) 1/2} , allows us to bound the RHS by 12𝛿0

and guarantees

E [ 𝑓(ˆ𝑥𝑇) 2] = 𝒪

) 2/3 + ( 𝐿𝛿0

or, equivalently, 𝑇= 𝒪 ( 𝐿𝛿0

𝑛𝜀4 ) sample complexity to find a stationary point. This analysis using BG assumption and derived sample complexity is essentially a simplified version of the one by Koloskova et al. [2020].23

BGS asssumption. If we assume bounded gradient similarity (BGS), i.e.,

1 𝑛 𝑛 𝑖=1 E [ 𝑓𝑖(𝑥) 𝑓(𝑥) 2] 𝐺2, we can slightly modify the derivation in (67) as follows

E [ 𝑒𝑡+1 𝑖 2] (1 𝛼)E [ 𝑒𝑡 𝑖+ 𝛾 𝑓𝑖(𝑥𝑡, 𝜉𝑡 𝑖) 2]

= (1 𝛼)E [ 𝑒𝑡 𝑖+ 𝛾 𝑓𝑖(𝑥𝑡) 2] + (1 𝛼)𝛾2E [ 𝑓𝑖(𝑥𝑡, 𝜉𝑡 𝑖) 𝑓𝑖(𝑥𝑡) 2]

(1 𝛼) ( 1 + 𝛼

) E [ 𝑒𝑡 𝑖 2] + ( 1 + 2

) E [ 𝛾 𝑓𝑖(𝑥𝑡) 2] + 𝛾2𝜎2

) E [ 𝑒𝑡 𝑖 2] + 3𝛾2

𝛼E [ 𝑓𝑖(𝑥𝑡) 2] + 𝛾2𝜎2. (68)

Averaging the above inequalities over 𝑖 = 1, . . . , 𝑛and using BGS assumption, i.e.,

1 𝑛 𝑛 𝑖=1 E [ 𝑓𝑖(𝑥) 2] 𝑓(𝑥) 2 + 𝐺2 , we can derive via averaging over 𝑡= 0, . . . , 𝑇 1

𝑡=0 E [ 𝑒𝑡 2] 6𝛾2

𝑡=0 E [ 𝑓(𝑥𝑡) 2] + 6𝛾2𝐺2

Combining the above inequality with (66), we have ( 1 12𝐿2𝛾2

𝑡=0 E [ 𝑓(𝑥𝑡) 2] 4𝛿0

𝑛 + 12𝛾2𝐿2𝐺2

𝛼2 + 4𝛾2𝐿2𝜎2

By setting 𝛾= min { 𝛼 4𝐿, ( 𝑛𝛿0

𝐿𝜎2𝑇 ) 1/2 , ( 𝛼2𝛿0 𝐿2𝐺2𝑇 ) 1/3 , ( 𝛼𝛿0 𝐿2𝜎2𝑇 ) 1/3} , we have ( 1 12𝐿2𝛾2

the RHS is at most 16𝛿0

𝛾𝑇. Therefore,

E [ 𝑓(ˆ𝑥𝑇) 2] = 𝒪

) 2/3 + ( 𝐿𝛿0𝜎 𝛼𝑇

) 2/3 + ( 𝐿𝛿0

or, equivalently, 𝑇= 𝒪 ( 𝐿𝛿0 𝛼𝜀2 + 𝐿𝛿0𝐺

𝛼𝜀3 + 𝐿𝛿0𝜎 𝛼𝜀3 + 𝐿𝛿0

𝑛𝜀4 ) sample complexity. Notice that in the single

node case (𝑛= 1), we have 𝐺= 0, and by Young s inequality ( 𝐿𝛿0𝜎 𝛼𝑇 ) 2/3 1

𝑇 ) 1/2 . Therefore, the above rate recovers the one by Stich and Karimireddy [2021] in the single node setting.

23Up to a smoothness constant and the fact that Koloskova et al. [2020] works in a more general decentralized setting.