# linearly_converging_error_compensated_sgd__c79fcdc6.pdf

Linearly Converging Error Compensated SGD

Eduard Gorbunov MIPT, Yandex and Sirius, Russia KAUST, Saudi Arabia

Dmitry Kovalev KAUST, Saudi Arabia

Dmitry Makarenko MIPT, Russia Peter Richtárik KAUST, Saudi Arabia

In this paper, we propose a uniﬁed analysis of variants of distributed SGD with arbitrary compressions and delayed updates. Our framework is general enough to cover diﬀerent variants of quantized SGD, Error-Compensated SGD (EC-SGD) and SGD with delayed updates (D-SGD). Via a single theorem, we derive the complexity results for all the methods that ﬁt our framework. For the existing methods, this theorem gives the best-known complexity results. Moreover, using our general scheme, we develop new variants of SGD that combine variance reduction or arbitrary sampling with error feedback and quantization and derive the convergence rates for these methods beating the state-of-the-art results. In order to illustrate the strength of our framework, we develop 16 new methods that ﬁt this. In particular, we propose the ﬁrst method called EC-SGD-DIANA that is based on error-feedback for biased compression operator and quantization of gradient diﬀerences and prove the convergence guarantees showing that EC-SGD-DIANA converges to the exact optimum asymptotically in expectation with constant learning rate for both convex and strongly convex objectives when workers compute full gradients of their loss functions. Moreover, for the case when the loss function of the worker has the form of ﬁnite sum, we modiﬁed the method and got a new one called EC-LSVRG-DIANA which is the ﬁrst distributed stochastic method with error feedback and variance reduction that converges to the exact optimum asymptotically in expectation with a constant learning rate.

1 Introduction

We consider distributed optimization problems of the form

i=1 fi(x) , (1)

where n is the number of workers/devices/clients/nodes. The information about function fi is stored on the i-th worker only. Problems of this form appear in the distributed or federated training of supervised machine learning models [42, 30]. In such applications, x Rd describes the parameters identifying a statistical model we wish to train, and fi is the (generalization or empirical) loss of model x on the data accessible by worker i. If worker i has access to data with distribution Di, then fi is assumed to have the structure

fi(x) = Eξi Di [fξi(x)] . (2)

Dataset Di may or may not be available to worker i in its entirety. Typically, we assume that worker i has only access to samples from Di. If the dataset is fully available, it is typically

34th Conference on Neural Information Processing Systems (Neur IPS 2020), Vancouver, Canada.

ﬁnite, in which case we can assume that fi has the ﬁnite-sum form1:

j=1 fij(x). (3)

Communication bottleneck. The key bottleneck in practical distributed [14] and federated [30, 21] systems comes from the high cost of communication of messages among the clients needed to ﬁnd a solution of suﬃcient qualities. Several approaches to addressing this communication bottleneck have been proposed in the literature.

In the very rare situation when it is possible to adjust the network architecture connecting the clients, one may consider a fully decentralized setup [6], and allow each client in each iteration to communicate to their neighbors only. One can argue that in some circumstances and in a certain sense, decentralized architecture may be preferable to centralized architectures [34]. Another natural way to address the communication bottleneck is to do more meaningful (which typically means more expensive) work on each client before each communication round. This is done in the hope that such extra work will produce more valuable messages to be communicated, which hopefully results in the need for fewer communication rounds. A popular technique of this type which is particularly relevant to Federated Learning is based in applying multiple local updates instead of a single update only. This is the main idea behind Local-SGD [43]; see also [4, 15, 22, 24, 29, 46, 50]. However, in this paper, we contribute to the line work which aims to resolve the communication bottleneck issue via communication compression. That is, the information that is normally exchanged be it iterates, gradients or some more sophisticated vectors/tensors is compressed in a lossy manner before communication. By applying compression, fewer bits are transmitted in each communication round, and one hopes that the increase in the number of communication rounds necessary to solve the problem, if any, is compensated by the savings, leading to a more eﬃcient method overall.

Error-feedback framework. In order to address these issues, in this paper we study a broad class of distributed stochastic ﬁrst order methods for solving problem (1) described by the iterative framework

xk+1 = xk 1

i=1 vk i , (4)

ek+1 i = ek i + γgk i vk i , i = 1, 2, . . . , n. (5)

In this scheme, xk represents the key iterate, vk i is the contribution of worker i towards the update in iteration k, gk i is an unbiased estimator of fi(xk) computed by worker i, γ > 0 is a ﬁxed stepsize and ek i is the error accumulated at node i prior to iteration k (we set to e0 i = 0 for all i). In order to better understand the role of the vectors vk i and ek i , ﬁrst consider the simple special case with vk i γgk i . In this case, ek i = 0 for all i and k, and method (4) (5) reduces to distributed SGD:

xk+1 = xk γ

i=1 gk i . (6)

However, by allowing to chose the vectors vk i in a diﬀerent manner, we obtain a more general update rule than what the SGD update (6) can oﬀer. Stich and Karimireddy [46], who studied (4) (5) in the n = 1 regime, show that this ﬂexibility allows to capture several types of methods, including those employing i) compressed communication, ii) delayed gradients, and iii) minibatch gradient updates. While our general results apply to all these special cases and more, in order to not dilute the focus of the paper, in the main body of this paper we concentrate on the ﬁrst use case compressed communication which we now describe.

Error-compensated compressed gradient methods. Note that in distributed SGD (6), each worker needs to know the aggregate gradient gk = 1

n Pn i=1 gk i to form xk+1, which is needed before the next iteration can start. This can be achieved, for example, by each

1The implicit assumption that each worker contains exactly m data points is for simplicity only; all our results have direct analogues in the general setting with mi data points on worker i.

worker i communicating their gradient gk i to all other workers. Alternatively, in a parameter server setup, a dedicated master node collects the gradients from all workers, and broadcasts their average gk to all workers. Instead of communicating the gradient vectors gk i , which is expensive in distributed learning in general and in federated learning in particular, and especially if d is large, we wish to communicate other but closely related vectors which can be represented with fewer bits. To this eﬀect, each worker i sends the vector

vk i = C(ek i + γgk i ), i [n] (7)

instead, where C : Rd Rd is a (possibly randomized, and in such a case, drawn independently of all else in iteration k) compression operator used to reduce communication. We assume throughout that there exists δ (0, 1] such that the following inequality holds for all x Rd

E C(x) x 2 (1 δ) x 2. (8)

For any k 0, the vector ek+1 i = Pk t=0 γgt i vt i captures the error accumulated by worker i up to iteration k. This is the diﬀerence between the ideal SGD update Pk t=0 γgt i and the applied update Pk t=0 vt i. As we see in (7), at iteration k the current error ek i is added to the gradient update γgk i this is referred to as error feedback and subsequently compressed, which deﬁnes the update vector vk i . Compression introduces additional error, which is added to ek i , and the process is repeated.

Compression operators. For a rich collection of speciﬁc operators satisfying (8), we refer the reader to Stich and Karimireddy [46] and Beznosikov et al. [7]. These include various unbiased or contractive sparsiﬁcation operators such as Rand K and Top K, respectively, and quantization operators such as natural compression and natural dithering [18]. Several additional comments related to compression operators are included in Section B.

2 Summary of Contributions

We now summarize the key contributions of this paper.

General theoretical framework. In this work we propose a general theoretical framework for analyzing a wide class of methods that can be written in the the error-feedback form (4)-(5). We perform complexity analysis under µ-strong quasi convexity (Assumption 3.1) and L-smoothness (Assumption 3.2) assumptions on the functions f and {fi}, respectively. Our analysis is based on an additional parametric assumption (using parameters A, A , B1, B 1, B2, B 2, C1, C2, D1, D 1, D2, D3, η, ρ1, ρ2, F1, F2, G) on the relationship between the iterates xk, stochastic gradients gk, errors ek and a few other quantities (see Assumption 3.4, and the stronger Assumption 3.3). We prove a single theorem (Theorem 3.1) from which all our complexity results follow as special cases. That is, for each existing or new speciﬁc method, we prove that one (or both) of our parametric assumptions holds, and specify the parameters for which it holds. These parameters have direct impact on the theoretical rate of the method. A summary of the values of the parameters for all methods developed in this paper is provided in Table 5 in the appendix. We remark that the values of the parameters A, A , B1, B 1, B2, B 2, C1, C2 and ρ1, ρ2 inﬂuence the theoretical stepsize.

Sharp rates. For existing methods covered by our general framework, our main convergence result (Theorem 3.1) recovers the best known rates for these methods up to constant factors.

Eight new error-compensated (EC) methods. We study several speciﬁc EC methods for solving problem (1). First, we recover the EC-SGD method ﬁrst analyzed in the n = 1 case by Stich and Karimireddy [46] and later in the general n 1 case by Beznosikov et al. [7]. More importantly, we develop eight new methods: EC-SGDsr, EC-GDstar, EC-SGD-DIANA2,

2Inspired by personal communication with D. Kovalev in November 2019 who shared a key algorithm and preliminary results of our paper, Stich [45] studied almost the same algorithm and also other related methods and independently derived convergence rates. Our work was ﬁnalized and submitted to Neur IPS 2020 in June 2020, while the results in [45] were obtained in Summer 2020 and appeared on ar Xiv in September 2020. Moreover, in our work, we obtain tighter rates (see Table 1 for the details).

Table 1: Complexity of Error-Compensated SGD methods established in this paper. Symbols: ε = error tolerance; δ = contraction factor of compressor C; ω = variance parameter of compressor Q; κ = L/µ; L = expected smoothness constant; σ2 = variance of the stochastic gradients in the solution; ζ2 = average of fi(x ) 2; σ2 = average of the uniform bounds for the variances of stochastic gradients of workers. EC-GDstar, EC-LSVRGstar and EC-LSVRG-DIANA are the ﬁrst EC methods with a linear convergence rate without assuming that fi(x ) = 0 for all i. EC-LSVRGstar and EC-LSVRG-DIANA are the ﬁrst EC methods with a linear convergence rate which do not require the computation of the full gradient fi(xk) by all workers in each iteration. Out of these three methods, only EC-LSVRG-DIANA is practical. EC-GD-DIANA is a special case of EC-SGD-DIANA where each worker i computes the full gradient fi(xk).

Problem Method Alg # Citation Sec # Rate (constants ignored)

(1)+(3) EC-SGDsr Alg 3 new J.1 e O

δLL δµ + σ2 nµε + p

L(σ2 +ζ2 /δ)

(1)+(2) EC-SGD Alg 4 [46] J.2 e O

κ δ + σ2 nµε + p

L(σ2 +ζ2 /δ)

(1)+(3) EC-GDstar Alg 5 new J.3 O κ δ log 1

(1)+(2) EC-SGD-DIANA Alg 6 new J.4

Opt. I: e O

δ + σ2 nµε +

Opt. II: e O 1+ω

δ + σ2 nµε +

(1)+(3) EC-SGDsr-DIANA Alg 7 new J.5

Opt. I: e O

LL δµ + σ2 nµε + p

Opt. II: e O

LL δµ + σ2 nµε + p

(1)+(2) EC-GD-DIANA Alg 6 new J.4 O ω + κ

(1)+(3) EC-LSVRG Alg 8 new J.6 e O

(1)+(3) EC-LSVRGstar Alg 9 new J.7 O m + κ

(1)+(3) EC-LSVRG-DIANA Alg 10 new J.8 O ω + m + κ

EC-SGDsr-DIANA, EC-GD-DIANA, EC-LSVRG, EC-LSVRGstar and EC-LSVRG-DIANA. Some of these methods are designed to work with the expectation structure of the local functions fi given in (2), and others are speciﬁcally designed to exploit the ﬁnite-sum structure (3). All these methods follow the error-feedback framework (4) (5), with vk i chosen as in (7). They diﬀer in how the gradient estimator gk i is constructed (see Table 2 for a compact description of all these methods; formal descriptions can be found in the appendix). As we shall see, the existing EC-SGD method uses a rather naive gradient estimator, which renders it less eﬃcient in theory and practice when compared to the best of our new methods. A key property of our parametric assumption described above is that it allows for the construction and modeling of more elaborate gradient estimators, which leads to new EC methods with superior theoretical and practical properties when compared to prior state of the art.

First linearly converging EC methods. The key theoretical consequence of our general framework is the development of the ﬁrst linearly converging error-compensated SGD-type methods for distributed training with biased communication compression. In particular, we design four such methods: two simple but impractical methods, EC-GDstar and EC-LSVRGstar, with rates O κ

ε and O m + κ

ε , respectively, and two practical but more elaborate methods, EC-GD-DIANA, with rate O ω + κ

ε , and EC-LSVRG-DIANA, with rate O ω + m + κ

ε . In these rates, κ = L/µ is the condition number, 0 < δ 1 is the contraction parameter associated with the compressor C used in (7), and ω is the variance parameter associated with a secondary unbiased compressor3 Q which plays a key role in the construction of the gradient estimator gk i . The complexity of the ﬁrst and third

3We assume that EQ(x) = x and E Q(x) x 2 ω x 2 for all x Rd.

Table 2: Error compensated methods developed in this paper. In all cases, vk i = C(ek i + γgk i ). The full descriptions of the algorithms are included in the appendix.

Problem Method gk i Comment

(1) + (3) EC-SGDsr 1 m

j=1 ξij fij(xk)

E [ξij] = 1 EDi fξi(x) fξi(x ) 2

2LDfi(x, x ) (1) + (2) EC-SGD fξi(xk) (1) EC-GDstar fi(xk) fi(x ) known fi(x ) i

(1) + (2) EC-SGD-DIANA ˆgk i hk i + hk

E ˆgk i = fi(xk) Ek ˆgk i fi(xk) 2 D1,i hk+1 i = hk i + αQ(ˆgk i hk i )

(1) + (3) EC-SGDsr-DIANA fξk i (xk) hk i + hk

E h fξk i (xk) i = fi(xk)

EDi fξi(x) fξi(x ) 2

2LDfi(x, x ) hk+1 i = hk i + αQ( fξk i (xk) hk i )

(1) + (3) EC-LSVRG fil(xk) fil(wk i )

+ fi(wk i )

l chosen uniformly from [m]

wk+1 i = xk, with prob. p, wk i , with prob. 1 p

(1) + (3) EC-LSVRGstar fil(xk) fil(wk i )

+ fi(wk i ) fi(x )

l chosen uniformly from [m]

wk+1 i = xk, with prob. p, wk i , with prob. 1 p

(1) + (3) EC-LSVRG-DIANA

ˆgk i hk i + hk

where ˆgk i = fil(xk)

fil(wk i ) + fi(wk i )

hk+1 i = hk i + αQ(ˆgk i hk i )

l chosen uniformly from [m]

wk+1 i = xk, with prob. p, wk i , with prob. 1 p

methods does not depend on m as they require the computation of the full gradient fi(xk) for each i. The remaining two methods only need to compute O(1) stochastic gradients fij(xk) on each worker i.

The ﬁrst two methods, while impractical, provided us with the intuition which enabled us to develop the practical variant. We include them in this paper due to their simplicity, because of the added insights they oﬀer, and to showcase the ﬂexibility of our general theoretical framework, which is able to describe them. EC-GDstar and EC-LSVRGstar are impractical since they require the knowledge of the gradients { fi(x )}, where x is an optimal solution of (1), which are obviously not known since x is not known.

The only known linear convergence result for an error compensated SGD method is due to Beznosikov et al. [7], who require the computation of the full gradient of fi by each machine i (i.e., m stochastic gradients), and the additional assumption that fi(x ) = 0 for all i. We do not need such assumptions, thereby resolving a major theoretical issue with EC methods.

Results in the convex case. Our theoretical analysis goes beyond distributed optimization and recovers the results from Gorbunov et al. [11], Khaled et al. [25] (without regularization) in the special case when vk i γgk i . As we have seen, in this case ek i 0 for all i and k, and the error-feedback framework (4) (5) reduces to distributed SGD (6). In this regime, the relation (19) in Assumption 3.4 becomes void, while relations (15) and (16) with σ2 2,k 0 are precisely those used by Gorbunov et al. [11] to analyze a wide array of SGD methods, including vanilla SGD [41], SGD with arbitrary sampling [13], as well as variance

reduced methods such as SAGA [9], SVRG [20], LSVRG [17, 31], Jac Sketch [12], SEGA [16] and DIANA [37, 19]. Our theorem recovers the rates of all the methods just listed in both the convex case µ = 0 Khaled et al. [25] and the strongly-convex case µ > 0 Gorbunov et al. [11] under the more general Assumption 3.4.

DIANA with bi-directional quantization. To illustrate how our framework can be used even in the case when vk i γgk i , ek i 0, we develop analyze a new version of DIANA called DIANAsr-DQ that uses arbitrary sampling on every node and double quantization4, i.e., unbiased compression not only on the workers side but also on the master s one.

Methods with delayed updates. Following Stich [44], we also show that our approach covers SGD with delayed updates [1, 3, 10] (D-SGD), and our analysis shows the best-known rate for this method. Due to the ﬂexibility of our framework, we are able develop several new variants of D-SGD with and without quantization, variance reduction, and arbitrary sampling. Again, due to space limitations, we put these methods together with their convergence analyses in the appendix.

3 Main Result

In this section we present the main theoretical result of our paper. First, we introduce our assumption on f, which is a relaxation of µ-strong convexity. Assumption 3.1 (µ-strong quasi-convexity). Assume that function f has a unique minimizer x . We say that function f is strongly quasi-convex with parameter µ 0 if for all x Rd

f(x ) f(x) + f(x), x x + µ

2 x x 2. (9)

We allow µ to be zero, in which case f is sometimes called weakly quasi-convex (see [44] and references therein). Second, we introduce the classical L-smoothness assumption. Assumption 3.2. L-smoothness We say that function f is L-smooth if it is diﬀerentiable and its gradient is L-Lipschitz continuous, i.e., for all x, y Rd

f(x) f(y) L x y . (10)

It is a well-known fact [38] that L-smoothness of convex function f implies that

f(x) f(y) 2 2L(f(x) f(y) f(y), x y ) def = 2LDf(x, y). (11)

We now introduce our key parametric assumption on the stochastic gradient gk. This is a generalization of the assumption introduced by Gorbunov et al. [11] for the particular class of methods described covered by the EF framework (4) (5). Assumption 3.3. For all k 0, the stochastic gradient gk is an average of stochastic gradients gk i such that

i=1 gk i , E gk | xk = f(xk). (12)

Moreover, there exist constants A, e A, A , B1, B2, e B1, e B2, B 1, B 2, C1, C2, G, D1, e D1, D 1, D2, D3 0, and ρ1, ρ2 [0, 1] and two sequences of (probably random) variables {σ1,k}k 0 and {σ2,k}k 0, such that the following recursions hold:

gk i 2 2A(f(xk) f(x )) + B1σ2 1,k + B2σ2 2,k + D1, (13)

i=1 E h gk i gk i 2 | xki 2 e A(f(xk) f(x )) + e B1σ2 1,k + e B2σ2 2,k + e D1, (14)

E gk 2 | xk 2A (f(xk) f(x )) + B 1σ2 1,k + B 2σ2 2,k + D 1, (15)

E σ2 1,k+1 | σ2 1,k, σ2 2,k (1 ρ1)σ2 1,k + 2C1 f(xk) f(x ) + Gρ1σ2 2,k + D2,(16)

E σ2 2,k+1 | σ2 2,k (1 ρ2)σ2 2,k + 2C2 f(xk) f(x ) , (17)

4In the concurrent work (which appeared on ar Xiv after we have submitted our paper to Neur IPS) a similar method was independently proposed under the name of Artemis [40]. However, our analysis is more general, see all the details on this method in the appendix. This footnote was added to the paper during the preparation of the camera-ready version of our paper.

where gk i = E gk i | xk .

Let us brieﬂy explain the intuition behind the assumption and the meaning of the introduced parameters. First of all, we assume that the stochastic gradient at iteration k is conditionally unbiased estimator of f(xk), which is a natural and commonly used assumption on the stochastic gradient in the literature. However, we explicitly do not require unbiasedness of gk i , which is very useful in some special cases. Secondly, let us consider the simplest special case when gk f(xk) and f1 = . . . = fn = f, i.e., there is no stochasticity/randomness in the method and the workers have the same functions. Then due to f(x ) = 0, we have that

f(xk) 2 (11) 2L(f(xk) f(x )),

which implies that Assumption 3.3 holds in this case with A = A = L, e A = 0 and B1 = B2 = e B1 = e B2 = B 1 = B 2 = C1 = C2 = D1 = e D1 = D 1 = D2 = 0, ρ = 1, σ2 1,k σ2 2,k 0.

In general, if gk satisﬁes Assumption 3.4, then parameters A, e A and A are usually connected with the smoothness properties of f and typically they are just multiples of L, whereas terms B1σ2 1,k, B2σ2 2,k, e B1σ2 1,k, e B2σ2 2,k, B 1σ2 1,k, B 2σ2 2,k and D1, e D1, D 1 appear due to the stochastic nature of gk i . Moreover, {σ2 1,k}k 0 and {σ2 2,k}k 0 are sequences connected with variance reduction processes and for the methods; without any kind of variance reduction these sequences contains only zeros. Parameters B1 and B2 are often 0 or small positive constants, e.g., B1 = B2 = 2, and D1 characterizes the remaining variance in the estimator gk that is not included in the ﬁrst two terms.

Inequalities (16) and (17) describe the variance reduction processes: one can interpret ρ1 and ρ2 as the rates of the variance reduction processes, 2C1(f(xk) f(x )) and 2C2(f(xk) f(x )) are optimization terms and, similarly to D1, D2 represents the remaining variance that is not included in the ﬁrst two terms. Typically, σ2 1,k controls the variance coming from compression and σ2 2,k controls the variance taking its origin in ﬁnite-sum type randomization (i.e., subsampling) by each worker. In the case ρ1 = 1 we assume that B1 = B 1 = C1 = G = 0, D2 = 0 (for ρ2 = 1 analogously), since inequality (16) becomes superﬂuous.

However, in our main result we need a slightly diﬀerent assumption. Assumption 3.4. For all k 0, the stochastic gradient gk is an unbiased estimator of f(xk): E gk | xk = f(xk). (18) Moreover, there exist non-negative constants A , B 1, B 2, C1, C2, F1, F2, G, D 1, D2, D3 0, ρ1, ρ2 [0, 1] and two sequences of (probably random) variables {σ1,k}k 0 and {σ2,k}k 0 such that inequalities (15), (16) and (17) hold and

k=0 wk E ek 2 1 4

k=0 wk E f(xk) f(x ) + F1σ2 1,0 + F2σ2 2,0 + γD3WK (19)

for all k, K 0, where ek = 1

n Pn i=1 ek i and {WK}K 0 and {wk}k 0 are deﬁned as

k=0 wk, wk = (1 η) (k+1), η = min γµ

This assumption is more ﬂexible than Assumption 3.3 and helps us to obtain a uniﬁed analysis of all methods falling in the error-feedback framework. We emphasize that in this assumption we do not assume that (13) and (14) hold explicitly. Instead of this, we introduce inequality (19), which is the key tool that helps us to analyze the eﬀect of error-feedback and comes from the analysis from [46] with needed adaptations connected with the ﬁrst three inequalities. As we show in the appendix, this inequality can be derived for SGD with error compensation and delayed updates under Assumption 3.3 and, in particular, using (13) and (14). As before, D3 hides a variance that is not handled by variance reduction processes and F1 and F2 are some constants that typically depend on L, B1, B2, ρ1, ρ2 and γ.

We now proceed to stating our main theorem.

Theorem 3.1. Let Assumptions 3.1, 3.2 and 3.4 be satisﬁed and γ 1/4(A +C1M1+C2M2). Then for all K 0 we have

E f( x K) f(x ) (1 η)K 4(T 0+γF1σ2 1,0+γF2σ2 2,0) γ + 4γ (D 1 + M1D2 + D3) (21)

when µ > 0 and

E f( x K) f(x ) 4(T 0+γF1σ2 1,0+γF2σ2 2,0) γK + 4γ (D 1 + M1D2 + D3) (22)

when µ = 0, where η = min {γµ/2, ρ1/4, ρ2/4}, T k def = xk x 2 + M1γ2σ2 1,k + M2γ2σ2 2,k and

M1 = 4B 1 3ρ1 , M2 = 4(B 2+ 4

All the complexity results summarized in Table 1 follow from this theorem; the detailed proofs are included in the appendix. Furthermore, in the appendix we include similar results but for methods employing delayed updates. The methods, and all associated theory is included there, too.

4 Numerical Experiments

To justify our theory, we conduct several numerical experimentson logistic regression problem with ℓ2-regularization:

i=1 log (1 + exp ( yi (Ax)i)) + µ

2 x 2 , (23)

where N is a number of features, x Rd represents the weights of the model, A RN d is a feature matrix, vector y { 1, 1}N is a vector of labels and (Ax)i denotes the i-th component of vector Ax. Clearly, this problem is L-smooth and µ-strongly convex with L = µ + λmax(A A)/4N, where λmax(A A) is a largest eigenvalue of A A. The datasets were taken from LIBSVM library [8], and the code was written in Python 3.7 using standard libraries. Our code is available at https://github.com/eduardgorbunov/ef_sigma_k.

We simulate parameter-server architecture using one machine with Intel(R) Core(TM) i7-9750 CPU 2.60 GHz in the following way. First of all, we always use such N that N = n m and consider n = 20 and n = 100 workers. The choice of N for each dataset that we consider is stated in Table 3. Next, we shuﬄe the data and split in n groups of size m. To emulate

Table 3: Summary of datasets: N = total # of data samples; d = # of features.

a9a w8a gisette mushrooms madelon phishing N 32, 000 49, 700 6, 000 8, 000 2, 000 11, 000 d 123 300 5, 000 112 500 68

the work of workers, we use a single machine and run the methods with the parallel loop in series. Since in these experiments we study sample complexity and number of bits used for communication, this setup is identical to the real parameter-server setup in this sense.

In all experiments we use the stepsize γ = 1/L and ℓ2-regularization parameter µ = 10 4λmax(A A)/4N. The starting point x0 for each dataset was chosen so that f(x0) f(x ) 10. In experiments with stochastic methods we used batches of size 1 and uniform sampling for simplicity. For LSVRG-type methods we choose p = 1/m.

Compressing stochastic gradients. The results for a9a, madelon and phishing can be found in Figure 1 (included here) and for w8a, mushrooms and gisette in Figure 3 (in the Appendix). We choose number of components for Top K operator of the order max{1, d/100}. Clearly, in these experiments we see two levels of noise. For some datasets, like a9a, phishing or mushrooms, the noise that comes from the stochasticity of the gradients dominates the noise coming from compression. Therefore, methods such as EC-SGD and EC-SGD-DIANA start to oscillate around a larger value of the loss function than other methods we consider. EC-LSVRG reduces the largest source of noise and, as a result, ﬁnds a better approximation of

the solution. However, at some point, it reaches another level of the loss function and starts to oscillate there due to the noise coming from compression. Finally, EC-LSVRG-DIANA reduces the variance of both types, and as a result, ﬁnds an even better approximation of the solution. In contrast, for the madelon dataset, both noises are of the same order, and therefore, EC-LSVRG and EC-SGD-DIANA behave similarly to EC-SGD. However, EC-LSVRG-DIANA again reduces both types of noise eﬀectively and ﬁnds a better approximation of the solution after a given number of epochs. In the experiments with w8a and gisette datasets, the noise produced by compression is dominated by the noise coming from the stochastic gradients. As a result, we see that the DIANA-trick is not needed here.

0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.00

Number of bits per worker 1e7

f(xk) f(x * )

f(x0) f(x * )

a9a, 20 workers

EC-SGD top-1 EC-SGD-DIANA top-1 rand-1 EC-SGD-DIANA top-1 l2-quant EC-L-SVRG top-1 EC-L-SVRG-DIANA top-1 rand-1 EC-L-SVRG-DIANA top-1 l2-quant

0 500000 1000000 1500000 2000000 2500000

Number of bits per worker

f(xk) f(x * )

f(x0) f(x * )

madelon, 20 workers

EC-SGD top-5 EC-SGD-DIANA top-5 rand-5 EC-SGD-DIANA top-5 l2-quant EC-L-SVRG top-5 EC-L-SVRG-DIANA top-5 rand-5 EC-L-SVRG-DIANA top-5 l2-quant

0 1000000 2000000 3000000 4000000 5000000 Number of bits per worker

10 7 10 6 10 5 10 4 10 3 10 2 10 1 100

f(xk) f(x * )

f(x0) f(x * )

phishing, 20 workers

EC-SGD top-1 EC-SGD-DIANA top-1 rand-1 EC-SGD-DIANA top-1 l2-quant EC-L-SVRG top-1 EC-L-SVRG-DIANA top-1 rand-1 EC-L-SVRG-DIANA top-1 l2-quant

0 20 40 60 80 Number of passes through the data

f(xk) f(x * )

f(x0) f(x * )

a9a, 20 workers

EC-SGD top-1 EC-SGD-DIANA top-1 rand-1 EC-SGD-DIANA top-1 l2-quant EC-L-SVRG top-1 EC-L-SVRG-DIANA top-1 rand-1 EC-L-SVRG-DIANA top-1 l2-quant

0 20 40 60 80 100 120 Number of passes through the data

f(xk) f(x * )

f(x0) f(x * )

madelon, 20 workers

EC-SGD top-5 EC-SGD-DIANA top-5 rand-5 EC-SGD-DIANA top-5 l2-quant EC-L-SVRG top-5 EC-L-SVRG-DIANA top-5 rand-5 EC-L-SVRG-DIANA top-5 l2-quant

0 50 100 150 200 Number of passes through the data

10 7 10 6 10 5 10 4 10 3 10 2 10 1 100

f(xk) f(x * )

f(x0) f(x * )

phishing, 20 workers

EC-SGD top-1 EC-SGD-DIANA top-1 rand-1 EC-SGD-DIANA top-1 l2-quant EC-L-SVRG top-1 EC-L-SVRG-DIANA top-1 rand-1 EC-L-SVRG-DIANA top-1 l2-quant

Figure 1: Trajectories of EC-SGD, EC-SGD-DIANA, EC-LSVRG and EC-LSVRG-DIANA applied to solve logistic regression problem with 20 workers.

Compressing full gradients. In order to show the eﬀect of DIANA-type variance reduction itself, we consider the case when all workers compute the full gradients of their functions, see Figure 2 (included here) and Figures 4 7 (in the Appendix). Clearly, for all datasets except mushrooms, EC-GD with constant stepsize converges to a neighborhood of the solution only, while EC-GDstar and EC-GD-DIANA converge with linear rate asymptotically to the exact solution. EC-GDstar always show the best performance, however, it is impractical: we used a very good approximation of the solution to apply this method. In contrast, EC-DIANA converges slightly slower and requires more bits for communication; but it is practical and shows better performance than EC-GD. On the mushrooms datasets, EC-GD does not reach the oscillation region after the given number of epochs, therefore, it is preferable there.

0 1000000 2000000 3000000 4000000 5000000 6000000 7000000

Number of bits per worker

f(xk) f(x * )

f(x0) f(x * )

a9a, 20 workers

EC-GD top-1 EC-GD top-2 EC-GD-star top-1 EC-GD-DIANA top-1 rand-1 EC-GD-DIANA top-1 l2-quant

0.0 0.5 1.0 1.5 2.0 2.5 Number of bits per worker 1e7

f(xk) f(x * )

f(x0) f(x * )

madelon, 20 workers

EC-GD top-5 EC-GD top-10 EC-GD-star top-5 EC-GD-DIANA top-5 rand-5 EC-GD-DIANA top-5 l2-quant

0.0 0.2 0.4 0.6 0.8 1.0 1.2 1.4 Number of bits per worker 1e7

f(xk) f(x * )

f(x0) f(x * )

phishing, 20 workers

EC-GD top-1 EC-GD top-2 EC-GD-star top-1 EC-GD-DIANA top-1 rand-1 EC-GD-DIANA top-1 l2-quant

0 10000 20000 30000 40000 50000 Number of passes through the data

f(xk) f(x * )

f(x0) f(x * )

a9a, 20 workers

EC-GD top-1 EC-GD top-2 EC-GD-star top-1 EC-GD-DIANA top-1 rand-1 EC-GD-DIANA top-1 l2-quant

0 5000 10000 15000 20000 25000 30000 35000 40000 Number of passes through the data

f(xk) f(x * )

f(x0) f(x * )

madelon, 20 workers

EC-GD top-5 EC-GD top-10 EC-GD-star top-5 EC-GD-DIANA top-5 rand-5 EC-GD-DIANA top-5 l2-quant

0 20000 40000 60000 80000 100000 Number of passes through the data

f(xk) f(x * )

f(x0) f(x * )

phishing, 20 workers

EC-GD top-1 EC-GD top-2 EC-GD-star top-1 EC-GD-DIANA top-1 rand-1 EC-GD-DIANA top-1 l2-quant

Figure 2: Trajectories of EC-GD, EC-GD-star and EC-DIANA applied to solve logistic regression problem with 20 workers.

Broader Impact

Our contribution is primarily theoretical. Therefore, a broader impact discussion is not applicable.

Acknowledgments and Disclosure of Funding

The work of Peter Richtárik, Eduard Gorbunov and Dmitry Kovalev was supported by KAUST Baseline Research Fund. Part of this work was done while E. Gorbunov was a research intern at KAUST. The work of E. Gorbunov in Sections 4, A D, H, and K was also partially supported by the Ministry of Science and Higher Education of the Russian Federation (Goszadaniye) 075-00337-20-03, project No. 0714-2020-0005, and in Sections 3, E G, I, J by RFBR, project number 19-31-51001.

[1] A. Agarwal and J. C. Duchi. Distributed delayed stochastic optimization. In Advances in Neural Information Processing Systems, pages 873 881, 2011.

[2] D. Alistarh, D. Grubic, J. Li, R. Tomioka, and M. Vojnovic. QSGD: Communicationeﬃcient SGD via gradient quantization and encoding. In Advances in Neural Information Processing Systems, pages 1709 1720, 2017.

[3] Y. Arjevani, O. Shamir, and N. Srebro. A tight convergence analysis for stochastic gradient descent with delayed updates. ar Xiv preprint ar Xiv:1806.10188, 2018.

[4] D. Basu, D. Data, C. Karakus, and S. Diggavi. Qsparse-local-SGD: Distributed SGD with quantization, sparsiﬁcation and local computations. In Advances in Neural Information Processing Systems, pages 14668 14679, 2019.

[5] J. Bernstein, J. Zhao, K. Azizzadenesheli, and A. Anandkumar. Sign SGD with majority vote is communication eﬃcient and fault tolerant. In ICLR, 2019.

[6] D. P. Bertsekas and J. N. Tsitsiklis. Parallel and distributed computation: numerical methods, volume 23. Prentice hall Englewood Cliﬀs, NJ, 1989.

[7] A. Beznosikov, S. Horváth, P. Richtárik, and M. Safaryan. On biased compression for distributed learning. ar Xiv preprint ar Xiv:2002.12410, 2020.

[8] C.-C. Chang and C.-J. Lin. LIBSVM: A library for support vector machines. ACM transactions on intelligent systems and technology (TIST), 2(3):1 27, 2011.

[9] A. Defazio, F. Bach, and S. Lacoste-Julien. SAGA: A fast incremental gradient method with support for non-strongly convex composite objectives. In Advances in Neural Information Processing Systems, pages 1646 1654, 2014.

[10] H. R. Feyzmahdavian, A. Aytekin, and M. Johansson. An asynchronous mini-batch algorithm for regularized stochastic optimization. IEEE Transactions on Automatic Control, 61(12):3740 3754, 2016.

[11] E. Gorbunov, F. Hanzely, and P. Richtárik. A uniﬁed theory of SGD: Variance reduction, sampling, quantization and coordinate descent. In The 23rd International Conference on Artiﬁcial Intelligence and Statistics (AISTATS 2020), 2020.

[12] R. M. Gower, P. Richtárik, and F. Bach. Stochastic quasi-gradient methods: Variance reduction via Jacobian sketching. ar Xiv preprint ar Xiv:1805.02632, 2018.

[13] R. M. Gower, N. Loizou, X. Qian, A. Sailanbayev, E. Shulgin, and P. Richtárik. SGD: General analysis and improved rates. In International Conference on Machine Learning, pages 5200 5209, 2019.

[14] P. Goyal, P. Dollár, R. Girshick, P. Noordhuis, L. Wesolowski, A. Kyrola, A. Tulloch, Y. Jia, and K. He. Accurate, large minibatch SGD: training imagenet in 1 hour. ar Xiv preprint ar Xiv:1706.02677, 2017.

[15] F. Haddadpour and M. Mahdavi. On the convergence of local descent methods in federated learning. ar Xiv preprint ar Xiv:1910.14425, 2019.

[16] F. Hanzely, K. Mishchenko, and P. Richtárik. SEGA: Variance reduction via gradient sketching. In Advances in Neural Information Processing Systems, pages 2082 2093, 2018.

[17] T. Hofmann, A. Lucchi, S. Lacoste-Julien, and B. Mc Williams. Variance reduced stochastic gradient descent with neighbors. In Advances in Neural Information Processing Systems, pages 2305 2313, 2015.

[18] S. Horváth, C.-Y. Ho, Ľudovít Horváth, A. N. Sahu, M. Canini, and P. Richtárik. Natural compression for distributed deep learning. ar Xiv preprint ar Xiv:1905.10988, 2019.

[19] S. Horváth, D. Kovalev, K. Mishchenko, S. Stich, and P. Richtárik. Stochastic distributed learning with gradient quantization and variance reduction. ar Xiv preprint ar Xiv:1904.05115, 2019.

[20] R. Johnson and T. Zhang. Accelerating stochastic gradient descent using predictive variance reduction. In Advances in Neural Information Processing Systems 26, pages 315 323, 2013.

[21] P. Kairouz, H. B. Mc Mahan, B. Avent, A. Bellet, M. Bennis, A. N. Bhagoji, K. Bonawitz, Z. Charles, G. Cormode, R. Cummings, et al. Advances and open problems in federated learning. ar Xiv preprint ar Xiv:1912.04977, 2019.

[22] S. P. Karimireddy, S. Kale, M. Mohri, S. J. Reddi, S. U. Stich, and A. T. Suresh. Scaﬀold: Stochastic controlled averaging for federated learning. ar Xiv preprint ar Xiv:1910.06378, 2019.

[23] S. P. Karimireddy, Q. Rebjock, S. Stich, and M. Jaggi. Error feedback ﬁxes sign SGD and other gradient compression schemes. In International Conference on Machine Learning, pages 3252 3261, 2019.

[24] A. Khaled, K. Mishchenko, and P. Richtárik. Tighter theory for local SGD on identical and heterogeneous data. In The 23rd International Conference on Artiﬁcial Intelligence and Statistics (AISTATS 2020), 2020.

[25] A. Khaled, O. Sebbouh, N. Loizou, R. M. Gower, and P. Richtárik. Uniﬁed analysis of stochastic gradient methods for composite convex and smooth optimization. ar Xiv preprint ar Xiv:2006.11573, 2020.

[26] S. Khirirat, H. R. Feyzmahdavian, and M. Johansson. Distributed learning with compressed gradients. ar Xiv preprint ar Xiv:1806.06573, 2018.

[27] A. Koloskova, S. Stich, and M. Jaggi. Decentralized stochastic optimization and gossip algorithms with compressed communication. In International Conference on Machine Learning, pages 3478 3487, 2019.

[28] A. Koloskova, T. Lin, S. U. Stich, and M. Jaggi. Decentralized deep learning with arbitrary communication compression. ICLR, page ar Xiv:1907.09356, 2020. URL https://arxiv.org/abs/1907.09356.

[29] A. Koloskova, N. Loizou, S. Boreiri, M. Jaggi, and S. U. Stich. A uniﬁed theory of decentralized SGD with changing topology and local updates. ar Xiv preprint ar Xiv:2003.10422, 2020.

[30] J. Konečný, H. B. Mc Mahan, F. X. Yu, P. Richtárik, A. T. Suresh, and D. Bacon. Federated learning: Strategies for improving communication eﬃciency. ar Xiv preprint ar Xiv:1610.05492, 2016.

[31] D. Kovalev, S. Horváth, and P. Richtárik. Don t jump through hoops and remove those loops: SVRG and Katyusha are better without the outer loop. In Proceedings of the 31st International Conference on Algorithmic Learning Theory, 2020.

[32] R. Leblond, F. Pedregosa, and S. Lacoste-Julien. Improved asynchronous parallel optimization analysis for stochastic incremental methods. The Journal of Machine Learning Research, 19(1):3140 3207, 2018.

[33] X. Lian, Y. Huang, Y. Li, and J. Liu. Asynchronous parallel stochastic gradient for nonconvex optimization. In Advances in Neural Information Processing Systems, pages 2737 2745, 2015.

[34] X. Lian, C. Zhang, H. Zhang, C.-J. Hsieh, W. Zhang, and J. Liu. Can decentralized algorithms outperform centralized algorithms? a case study for decentralized parallel stochastic gradient descent. In Advances in Neural Information Processing Systems, pages 5330 5340, 2017. [35] X. Liu, Y. Li, J. Tang, and M. Yan. A double residual compression algorithm for eﬃcient distributed learning. ar Xiv preprint ar Xiv:1910.07561, 2019. [36] H. Mania, X. Pan, D. Papailiopoulos, B. Recht, K. Ramchandran, and M. I. Jordan. Perturbed iterate analysis for asynchronous stochastic optimization. SIAM Journal on Optimization, 27(4):2202 2229, 2017. [37] K. Mishchenko, E. Gorbunov, M. Takáč, and P. Richtárik. Distributed learning with compressed gradient diﬀerences. ar Xiv preprint ar Xiv:1901.09269, 2019. [38] Y. Nesterov. Lectures on convex optimization, volume 137. Springer, 2018. [39] L. Nguyen, P. H. Nguyen, M. Dijk, P. Richtárik, K. Scheinberg, and M. Takáč. SGD and Hogwild! convergence without the bounded gradients assumption. In International Conference on Machine Learning, pages 3750 3758, 2018. [40] C. Philippenko and A. Dieuleveut. Artemis: tight convergence guarantees for bidirectional compression in federated learning. ar Xiv preprint ar Xiv:2006.14591, 2020. [41] H. Robbins and S. Monro. A stochastic approximation method. In Herbert Robbins Selected Papers, pages 102 109. Springer, 1985. [42] O. Shamir, N. Srebro, and T. Zhang. Communication-eﬃcient distributed optimization using an approximate Newton-type method. In International conference on machine learning, pages 1000 1008, 2014. [43] S. U. Stich. Local SGD converges fast and communicates little. International Conference on Learning Representations (ICLR), page ar Xiv:1805.09767, 2019. URL https:// arxiv.org/abs/1805.09767. [44] S. U. Stich. Uniﬁed optimal analysis of the (stochastic) gradient method. ar Xiv preprint ar Xiv:1907.04232, 2019. [45] S. U. Stich. On communication compression for distributed optimization on heterogeneous data. ar Xiv preprint ar Xiv:2009.02388, 2020. [46] S. U. Stich and S. P. Karimireddy. The error-feedback framework: Better rates for SGD with delayed gradients and compressed communication. ar Xiv preprint ar Xiv:1909.05350, 2019. [47] S. U. Stich, J.-B. Cordonnier, and M. Jaggi. Sparsiﬁed SGD with memory. In Advances in Neural Information Processing Systems, pages 4447 4458, 2018. [48] H. Tang, C. Yu, X. Lian, T. Zhang, and J. Liu. Double Squeeze: Parallel stochastic gradient descent with double-pass error-compensated compression. In International Conference on Machine Learning, pages 6155 6165, 2019. [49] W. Wen, C. Xu, F. Yan, C. Wu, Y. Wang, Y. Chen, and H. Li. Terngrad: Ternary gradients to reduce communication in distributed deep learning. In Advances in Neural Information Processing Systems, pages 1509 1519, 2017. [50] B. Woodworth, K. K. Patel, S. U. Stich, Z. Dai, B. Bullins, H. B. Mc Mahan, O. Shamir, and N. Srebro. Is local SGD better than minibatch SGD? ar Xiv preprint ar Xiv:2002.07839, 2020.