# marina_faster_nonconvex_distributed_learning_with_compression__1e5e5803.pdf

MARINA: Faster Non-Convex Distributed Learning with Compression

Eduard Gorbunov 1 2 3 Konstantin Burlachenko 3 Zhize Li 3 Peter Richt arik 3

We develop and analyze MARINA: a new communication efﬁcient method for non-convex distributed learning over heterogeneous datasets. MARINA employs a novel communication compression strategy based on the compression of gradient differences that is reminiscent of but different from the strategy employed in the DIANA method of Mishchenko et al. (2019). Unlike virtually all competing distributed ﬁrst-order methods, including DIANA, ours is based on a carefully designed biased gradient estimator, which is the key to its superior theoretical and practical performance. The communication complexity bounds we prove for MARINA are evidently better than those of all previous ﬁrst-order methods. Further, we develop and analyze two variants of MARINA: VR-MARINA and PP-MARINA. The ﬁrst method is designed for the case when the local loss functions owned by clients are either of a ﬁnite sum or of an expectation form, and the second method allows for a partial participation of clients a feature important in federated learning. All our methods are superior to previous state-of-the-art methods in terms of oracle/communication complexity. Finally, we provide a convergence analysis of all methods for problems satisfying the Polyak-Łojasiewicz condition.

1. Introduction

Non-convex optimization problems appear in various applications of machine learning, such as training deep neural networks (Goodfellow et al., 2016) and matrix completion and recovery (Ma et al., 2018; Bhojanapalli et al., 2016). Because of their practical importance, these problems gained much attention in recent years, which led to a rapid develop-

1Moscow Institute of Physics and Technology, Moscow, Russia 2Yandex, Moscow, Russia 3King Abdullah University of Science and Technology, Thuwal, Saudi Arabia. Correspondence to: Eduard Gorbunov <eduard.gorbunov@phystech.edu>, Peter Richt arik <peter.richtarik@kaust.edu.sa>.

Proceedings of the 38 th International Conference on Machine Learning, PMLR 139, 2021. Copyright 2021 by the author(s).

ment of new efﬁcient methods for non-convex optimization problems (Danilova et al., 2020), and especially the training of deep learning models (Sun, 2019).

Training deep neural networks is notoriously computationally challenging and time-consuming. In the quest to improve the generalization performance of modern deep learning models, practitioners resort to using increasingly larger datasets in the training process, and to support such workloads, it is imperative to use advanced parallel and distributed hardware, systems, and algorithms. Distributed computing is often necessitated by the desire to train models from data naturally distributed across several edge devices, as is the case in federated learning (Koneˇcn y et al., 2016; Mc Mahan et al., 2017). However, even when this is not the case, distributed methods are often very efﬁcient at reducing the training time (Goyal et al., 2017; You et al., 2020). Due to these and other reasons, distributed optimization has gained immense popularity in recent years.

However, distributed methods almost invariably suffer from the so-called communication bottleneck: the communication cost of information necessary for the workers to jointly solve the problem at hand is often very high, and depending on the particular compute architecture, workload, and algorithm used, it can be orders of magnitude higher than the computation cost. A popular technique for resolving this issue is communication compression (Seide et al., 2014; Koneˇcn y et al., 2016; Suresh et al., 2017), which is based on applying a lossy transformation/compression to the models, gradients, or tensors to be sent over the network to save on communication. Since applying a lossy compression generally decreases the utility of the exchanged messages, such an approach will typically lead to an increase in the number of communications, and the overall usefulness of this technique manifests itself in situations where the communication savings are larger compared to the increased need for the number of communication rounds (Horv ath et al., 2019).

The optimization and machine learning communities have exerted considerable effort in recent years to design distributed methods supporting compressed communication. From many methods proposed, we emphasize VR-DIANA (Horv ath et al., 2019), Fed COMGATE (Haddadpour et al., 2020), and Fed STEPH (Das et al., 2020) because these pa-

MARINA: Faster Non-Convex Distributed Learning with Compression

pers contain the state-of-the-art results in the setup when the local loss functions can be arbitrary heterogeneous.

1.1. Contributions

We propose several new distributed optimization methods supporting compressed communication, speciﬁcally focusing on smooth but nonconvex problems of the form

i=1 fi(x) , (1)

where n workers/devices/clients/peers are connected in a centralized way with a parameter-server, and client i has an access to the local loss function fi only. We establish strong complexity rates for them and show that they are better than previous state-of-the-art results.

MARINA. The main contribution of our paper is a new distributed method supporting communication compression called MARINA (Alg 1). In this algorithm, workers apply an unbiased compression operator to the gradient differences at each iteration with some probability and send them to the server that performs aggregation by averaging. Unlike all known methods operating with unbiased compression operators, this procedure leads to a biased gradient estimator. We prove convergence guarantees for MARINA, which are strictly better than previous state-of-the-art methods (see Table 1). For example, MARINA s rate O( 1+ω/ n

ε2 ) is O( ω) times better than that of the state-of-the-art method DIANA (Mishchenko et al., 2019), where ω is the variance parameter associated with the deployed compressor. For example, in the case of the Rand1 sparsiﬁcation compressor, we have ω = d 1, and hence we get an improvement by the factor O(

d). Since the number d of features can be truly very large when training modern models, this is a substantial improvement that can even amount to several orders of magnitude.

Variance Reduction on Nodes. We generalize MARINA to VR-MARINA, which can handle the situation when the local functions fi have either a ﬁnite-sum (each fi is an average of m functions) or an expectation form, and when it is more efﬁcient to rely on local stochastic gradients rather than on local gradients. When compared with MARINA, VRMARINA additionally performs local variance reduction on all nodes, progressively removing the variance coming from the stochastic approximation, leading to a better oracle complexity than previous state-of-the-art results (see Table 1). When no compression is used (i.e., ω = 0), the rate of VRMARINA is O( m nε2 ), while the rate of the state-of-the-art

method VR-DIANA is O( m2/3

ε2 ). This is an improvement by the factor O( nm1/6). When much compression is applied, and ω is large, our method is faster by the factor O( m2/3+ω m1/2+ω1/2 ). In the special case, when there is just a sin-

gle node (n = 1), and no compression is used, VR-MARINA reduces to the PAGE method of Li et al. (2020); this is an optimal ﬁrst-order algorithm for smooth non-convex ﬁnitesum/online optimization problems.

Partial Participation. We develop a modiﬁcation of MARINA allowing for partial participation of the clients, which is a feature critical in federated learning. The resulting method, PP-MARINA, has superior communication complexity to the existing methods developed for this settings (see Table 1).

Convergence Under the Polyak-Łojasiewicz Condition. We analyze all proposed methods for problems satisfying the Polyak-Łojasiewicz condition (Polyak, 1963; Łojasiewicz, 1963). Again, the obtained results are strictly better than previous ones (see Table 2). Statements and proofs of all these results are in the Appendix.

Simple Analysis. The simplicity and ﬂexibility of our analysis offer several extensions. For example, one can easily generalize our analysis to the case of different quantization operators and different batch sizes used by clients. Moreover, one can combine the ideas of VR-MARINA and PP-MARINA and obtain a single distributed algorithm with compressed communications, variance reduction on nodes, and clients sampling. We did not do this to keep the exposition simpler.

1.2. Related Work

Non-Convex Optimization. Since ﬁnding a global minimum of a non-convex function is, in general, an NP-hard problem (Murty & Kabadi, 1987), many researchers in nonconvex optimization focus on relaxed goals such as ﬁnding an ε-stationary point. The theory of stochastic ﬁrst-order methods for ﬁnding ε-stationary points is well-developed: it contains lower bounds for expectation minimization without smoothness of stochastic realizations (Arjevani et al., 2019) and for ﬁnite-sum/expectation minimization (Fang et al., 2018; Li et al., 2020) as well as optimal methods matching the lower bounds (see (Danilova et al., 2020; Li et al., 2020) for the overview). Recently, distributed variants of such methods were proposed (Sun et al., 2020; Sharma et al., 2019; Khanduri et al., 2020).

Compressed Communications. Works on distributed methods supporting communication compression can be roughly split into two large groups: the ﬁrst group focuses on methods using unbiased compression operators (which refer to as quantizations in this paper), such as Rand K, and the second one studies methods using biased compressors such as Top K. One can ﬁnd a detailed summary of the most popular compression operators in (Safaryan et al., 2020; Beznosikov et al., 2020).

Unbiased Compression. In this line of work, the ﬁrst con-

MARINA: Faster Non-Convex Distributed Learning with Compression

Table 1: Summary of the state-of-the-art results for ﬁnding an ε-stationary point for the problem (1), i.e., such a point ˆx that E f(ˆx) 2 ε2. Dependences on the numerical constants, quality of the starting point, and smoothness constants are omitted in the complexity bounds. Abbreviations: PP = partial participation; Communication complexity = the number of communications rounds needed to ﬁnd an ε-stationary point; Oracle complexity = the number of (stochastic) ﬁrst-order oracle calls needed to ﬁnd an ε-stationary point. Notation: ω = the quantization parameter (see Def. 1.1); n = the number of nodes; m = the size of the local dataset; r = (expected) number of clients sampled at each iteration; b = the batchsize for VR-MARINA at the iterations with compressed communication. To simplify the bounds, we assume that the expected density ζQ of the quantization operator Q (see Def. 1.1) satisﬁes ω + 1 = Θ(d/ζQ) (e.g., this holds for Rand K and ℓ2-quantization, see (Beznosikov et al., 2020)). We notice that (Haddadpour et al., 2020) and (Das et al., 2020) contain also better rates under different assumptions on clients similarity.

Setup Method Citation Communication Complexity Oracle Complexity

(Mishchenko et al., 2019) (Horv ath et al., 2019) (Li & Richt arik, 2020)

Fed COMGATE (1) (Haddadpour et al., 2020) 1+ω

nε4 Fed STEPH, r = n (Das et al., 2020) 1+ω/n

ε4 MARINA (Alg. 1) Thm. 2.1 & Cor. 2.1 (NEW) 1+ω/ n

DIANA (Li & Richt arik, 2020) 1+(1+ω)

nε4 1+(1+ω)

VR-DIANA (Horv ath et al., 2019)

VR-MARINA (Alg. 2), b = 1(2) Thm. 3.1 & Cor. 3.1 (NEW)

(1+ω)m o / n

(1+ω)m o / n

DIANA (3) (Mishchenko et al., 2019) (Li & Richt arik, 2020)

nε4 1+(1+ω)

Fed COMGATE (3) (Haddadpour et al., 2020) 1+ω

nε4 VR-MARINA (Alg. 2), b = 1 Thm. 3.2 & Cor. 3.2 (NEW) 1+ω/ n

nε3 VR-MARINA (Alg. 2), b = Θ 1 nε2

Thm. 3.2 & Cor. 3.2 (NEW) 1+ω/ n

Fed STEPH (Das et al., 2020) 1+ω/n

rε4 + (1+ω)(n r)

r(n 1)ε4 1+ω/n

rε4 + (1+ω)(n r)

r(n 1)ε4 PP-MARINA (Alg. 4) Thm. 4.1 & Cor. 4.1 (NEW) 1+(1+ω) n/r

1+(1+ω) n/r

ε2 (1) The results for Fed COMGATE are derived under assumption that for all vectors x1, . . . , xn Rd the quantization operator Q satisﬁes

n Pn i=1 Q(xj) 2 Q 1

n Pn i=1 xj 2i G for some constant G 0. In fact, this assumption does not hold for classical quantization

operators like Rand K and ℓ2-quantization on Rd. The counterexample: n = 2 and x1 = x2 = (t, t, . . . , t) with arbitrary large t > 0. (2) One can even further improve the communication complexity by increasing b . (3) No assumptions on the smoothness of the stochastic realizations fξ(x) are used.

vergence result in the non-convex case was obtained by Alistarh et al. (2017) for QSGD, under assumptions that the local loss functions are the same for all workers, and the stochastic gradient has uniformly bounded second moment. After that, Mishchenko et al. (2019) proposed DIANA (and its momentum version) and proved its convergence rate for non-convex problems without any assumption on the boundedness of the second moment of the stochastic gradient, but under the assumption that the dissimilarity between local loss functions is bounded. This restriction was later eliminated by Horv ath et al. (2019) for the variance reduced version of DIANA called VR-DIANA, and the analysis was extended to a large class of unbiased compressors. Finally, the results for QSGD and DIANA were recently generalized and tightened by Li & Richt arik (2020) in a unifying framework that included many other methods as well.

Biased Compression. Biased compression operators are less optimization-friendly than unbiased ones. Indeed, one can construct a simple convex quadratic problem for which distributed SGD with Top1 compression diverges exponentially fast (Beznosikov et al., 2020). However, this issue can be resolved using error compensation (Seide et al., 2014). The ﬁrst analysis of error-compensated SGD (EC-

SGD) for non-convex problems was obtained by Karimireddy et al. (2019) for homogeneous problems under the assumption that the second moment of the stochastic gradient is uniformly bounded. The last assumption was recently removed from the analysis of EC-SGD by Stich & Karimireddy (2020); Beznosikov et al. (2020), while the ﬁrst results without the homogeneity assumption were obtained by Koloskova et al. (2020a) for Choco-SGD, but still under the assumption that the second moment of the stochastic gradient is uniformly bounded. This issue was resolved by Beznosikov et al. (2020). In general, the current understanding of optimization methods with biased compressors is far from complete: even in the strongly convex case, the ﬁrst linearly converging (Gorbunov et al., 2020) and accelerated (Qian et al., 2020) error-compensated stochastic methods were proposed just recently.

Other Approaches. Besides communication compression, there are also different techniques aiming to reduce the overall communication cost of distributed methods. The most popular ones are based on decentralized communications and multiple local steps between communication rounds, where the second technique is very popular in federated learning (Koneˇcn y et al., 2016; Kairouz et al., 2019).

MARINA: Faster Non-Convex Distributed Learning with Compression

Table 2: Summary of the state-of-the-art results for ﬁnding an ε-solution for the problem (1) satifying Polyak-Łojasiewicz condition (see As. 2.1), i.e., such a point ˆx that E [f(ˆx) f(x )] ε. Dependences on the numerical constants and log(1/ε) factors are omitted and all smoothness constanst are denoted by L in the complexity bounds. Abbreviations: PP = partial participation; Communication complexity = the number of communications rounds needed to ﬁnd an ε-stationary point; Oracle complexity = the number of (stochastic) ﬁrst-order oracle calls needed to ﬁnd an ε-stationary point. Notation: ω = the quantization parameter (see Def. 1.1); n = the number of nodes; m = the size of the local dataset; r = (expected) number of clients sampled at each iteration; b = the batchsize for VR-MARINA at the iterations with compressed communication. To simplify the bounds, we assume that the expected density ζQ of the quantization operator Q (see Def. 1.1) satisﬁes ω + 1 = Θ(d/ζQ) (e.g., this holds for Rand K and ℓ2-quantization, see (Beznosikov et al., 2020)). We notice that (Haddadpour et al., 2020) and (Das et al., 2020) contain also better rates under different assumptions on clients similarity.

Setup Method Citation Communication Complexity Oracle Complexity

DIANA (Li & Richt arik, 2020) L(1+(1+ω)

ω/n) µ L(1+(1+ω)

ω/n) µ Fed COMGATE (1) (Haddadpour et al., 2020) L(1+ω)

nµε MARINA (Alg. 1) Thm. 2.2 & Cor. C.2 (NEW) ω + L(1+ω/ n)

µ ω + L(1+ω/ n)

DIANA (Li & Richt arik, 2020)

VR-DIANA (Li & Richt arik, 2020) L m2/3+ω

VR-MARINA (Alg. 2), b = 1(2) Thm. D.2 & Cor. D.2 (NEW)

+ L(1+max n ω,

(1+ω)m o / n)

+ L(1+max n ω,

(1+ω)m o / n)

DIANA (3) (Mishchenko et al., 2019) (Li & Richt arik, 2020)

nε4 1+(1+ω)

Fed COMGATE (3) (Haddadpour et al., 2020) L(1+ω)

nµε VR-MARINA (Alg. 2), b = 1 Thm. D.4 & Cor. D.4 (NEW) ω + L(1+ω/ n)

nµε ω + L(1+ω/ n)

nµε VR-MARINA (Alg. 2), b = Θ 1 nµε

Thm. D.4 & Cor. D.4 (NEW) ω + L(1+ω/ n)

nµε + L(1+ω/ n)

nµ2ε + L(1+ω)

Fed STEPH (4) (Das et al., 2020) L µ 3/2 L µ 3/2

PP-MARINA (Alg. 4) Thm. E.2 & Cor. E.2 (NEW) (ω+1)n

r + L(1+(1+ω) n/r)

r + L(1+(1+ω) n/r)

µ (1) The results for Fed COMGATE are derived under assumption that for all vectors x1, . . . , xn Rd the quantization operator Q satisﬁes

n Pn i=1 Q(xj) 2 Q 1

n Pn i=1 xj 2i G for some constant G 0. In fact, this assumption does not hold for classical quantization operators like

Rand K and ℓ2-quantization on Rd. The counterexample: n = 2 and x1 = x2 = (t, t, . . . , t) with arbitrary large t > 0. (2) One can even further improve the communication complexity by increasing b . (3) No assumptions on the smoothness of the stochastic realizations fξ(x) are used. (4) The rate is derived under assumption that r = Ω((1 + ω) p

L/µ log(1/ε)).

One can ﬁnd the state-of-the-art distributed optimization methods using these techniques and their combinations in (Lian et al., 2017; Karimireddy et al., 2020; Li et al., 2019; Koloskova et al., 2020b). Moreover, there exist results based on the combinations of communication compression with either decentralized communication, e.g., Choco-SGD (Koloskova et al., 2020a), or local updates, e.g., Qsparse Local-SGD (Basu et al., 2019), Fed COMGATE (Haddadpour et al., 2020), Fed STEPH (Das et al., 2020), where in (Basu et al., 2019) the convergence rates were derived under an assumption that the stochastic gradient has uniformly bounded second moment and the results for Choco-SGD, Fed COMGATE, Fed STEPH were described either earlier in the text, or in Table 1.

1.3. Preliminaries

We will rely on two key assumptions thrughout the text.

Assumption 1.1 (Uniform lower bound). There exists f R such that f(x) f for all x Rd.

Assumption 1.2 (L-smoothness). We assume that fi is Lismooth for all i [n] = {1, 2, . . . , n} meaning that the

following inequality holds x, y Rd, i [n]:

fi(x) fi(y) Li x y . (2)

This assumption implies that f is Lf-smooth with L2 f L2 = 1

n Pn i=1 L2 i .

Finally, we describe a large class of unbiased compression operators satisfying a certain variance bound, which we will refer to, in this paper, by the name quantization.

Deﬁnition 1.1 (Quantization). We say that a stochastic mapping Q : Rd Rd is a quantization operator/quantization if there exists ω > 0 such that for any x Rd , we have

E [Q(x)] = x, E Q(x) x 2 ω x 2. (3)

For the given quantization operator Q(x), we deﬁne the the expected density as ζQ = supx Rd E [ Q(x) 0] , where y 0 is the number of non-zero components of y Rd.

Notice that the expected density is well-deﬁned for any quantization operator since Q(x) 0 d.

MARINA: Faster Non-Convex Distributed Learning with Compression

In this section, we describe the main algorithm of this work: MARINA (see Algorithm 1). At each iteration of MARINA, each worker i either sends to the server the dense vector fi(xk+1) with probability p, or it sends the quantized gradient difference Q fi(xk+1) fi(xk)) with probability 1 p. In the ﬁrst situation, the server just averages the vectors received from workers and gets gk+1 = f(xk+1), whereas in the second case, the server averages the quantized differences from all workers and then adds the result to gk to get gk+1. Moreover, if Q is identity quantization, i.e., Q(x) = x, then MARINA reduces to Gradient Descent (GD).

Algorithm 1 MARINA

1: Input: starting point x0, stepsize γ, probability p (0, 1], number of iterations K 2: Initialize g0 = f(x0) 3: for k = 0, 1, . . . , K 1 do 4: Sample ck Be(p) 5: Broadcast gk to all workers 6: for i = 1, . . . , n in parallel do 7: xk+1 = xk γgk

8: Set gk+1 i = fi(xk+1) if ck = 1, and gk+1 i = gk + Q fi(xk+1) fi(xk)) otherwise 9: end for 10: gk+1 = 1

n Pn i=1 gk+1 i 11: end for 12: Return: ˆx K chosen uniformly at random from {xk}K 1 k=0

However, for non-trivial quantizations, we have E[gk+1 | xk+1] = f(xk+1) unlike all other distributed methods using exclusively unbiased compressors we know of. That is, gk+1 is a biased stochastic estimator of f(xk+1). However, MARINA is an example of a rare phenomenon in stochastic optimization when the bias of the stochastic gradient helps to achieve better complexity.

2.1. Convergence Results for Generally Non-Convex Problems

We start with the following result.

Theorem 2.1. Let Assumptions 1.1 and 1.2 be satisﬁed. Then, after

iterations with 0 = f(x0) f , L2 = 1

n Pn i=1 L2 i and

the stepsize γ L 1 1 + p

(1 p)ω/(pn) 1 , MARINA pro-

duces point ˆx K for which E[ f(ˆx K) 2] ε2.

One can ﬁnd the full statement of the theorem together with its proof in Section C.1 of the Appendix.

The following corollary provides the bounds on the number of iterations/communication rounds and estimates the total communication cost needed to achieve an ε-stationary point in expectation. Moreover, for simplicity, throughout the paper we assume that the communication cost is proportional to the number of non-zero components of transmitted vectors from workers to the server.

Corollary 2.1. Let the assumptions of Theorem 2.1 hold

and p = ζQ/d. If γ L 1 1 + p

ω(d ζQ)/(nζQ) 1 , then

MARINA requires

iterations/communication rounds in order to achieve E[ f(ˆx K) 2] ε2, and the expected total communication cost per worker is O(d + ζQK).

Let us clarify the obtained result. First of all, if ω = 0 (no quantization), then ζQ = 0 and the rate coincides with the rate of Gradient Descent (GD). Since GD is optimal among ﬁrst-order methods in terms of reducing the norm of the gradient (Carmon et al., 2019), the dependence on ε in our bound cannot be improved in general. Next, if n is large enough, i.e., n ω(d/ζQ 1), then1 the iteration complexity of MARINA (method with compressed communications) and GD (method with dense communications) coincide. This means that in this regime, MARINA is able to reach a provably better communication complexity than GD!

2.2. Convergence Results Under Polyak-Łojasiewicz condition

In this section, we provide a complexity bounds for MARINA under the Polyak-Łojasiewicz (PŁ) condition.

Assumption 2.1 (PŁ condition). Function f satisﬁes Polyak-Łojasiewicz (PŁ) condition with parameter µ, i.e.,

f(x) 2 2µ (f(x) f(x )) . (4)

holds for x = arg minx Rd f(x) and for all x Rd.

Under this and previously introduced assumptions, we derive the following result.

Theorem 2.2. Let Assumptions 1.1, 1.2 and 2.1 be satisﬁed. Then, after

K = O max 1 p, L

1For ℓ2-quantization this requirement is satisﬁed when n d.

MARINA: Faster Non-Convex Distributed Learning with Compression

iterations with 0 = f(x0) f(x ), L2 = 1 n Pn i=1 L2 i and the stepsize γ

min L 1 1 + p

2(1 p)ω/(pn) 1 , p(2µ) 1 , MARINA

produces a point x K for which E[f(x K) f(x )] ε.

One can ﬁnd the full statement of the theorem together with its proof in Section C.2 of the Appendix.

3. Variance Reduction

Throughout this section, we assume that the local loss on each node has either a ﬁnite-sum form (ﬁnite sum case),

j=1 fij(x), (5)

or an expectation form (online case),

fi(x) = Eξi Di[fξi(x)]. (6)

3.1. Finite Sum Case

In this section, we generalize MARINA to problems of the form (1)+(5), obtaining VR-MARINA (see Algorithm 2). At

Algorithm 2 VR-MARINA: ﬁnite sum case

1: Input: starting point x0, stepsize γ, minibatch size b , probability p (0, 1], number of iterations K 2: Initialize g0 = f(x0) 3: for k = 0, 1, . . . , K 1 do 4: Sample ck Be(p) 5: Broadcast gk to all workers 6: for i = 1, . . . , n in parallel do 7: xk+1 = xk γgk

8: Set gk+1 i = fi(xk+1) if ck = 1, and gk+1 i =

gk + Q 1 b P

j I i,k( fij(xk+1) fij(xk))

otherwise, where I i,k is the set of the indices in the minibatch, |I i,k| = b

9: end for 10: gk+1 = 1

n Pn i=1 gk+1 i 11: end for 12: Return: ˆx K chosen uniformly at random from {xk}K 1 k=0

each iteration of VR-MARINA, devices are to compute the full gradients fi(xk+1) and send them to the server with probability p. Typically, p 1/m and m is large, meaning that workers compute full gradients rarely (once per m iterations in expectation). At other iterations, workers compute minibatch stochastic gradients evaluated at the current and previous points, compress them using an unbiased compression operator, i.e., quantization/quantization operator, and send the resulting vectors gk+1 i gk to the server. Moreover, if Q is the identity quantization, i.e., Q(x) = x, and

n = 1, then MARINA reduces to the optimal method PAGE (Li et al., 2020).

In this part, we will rely on the following average smoothness assumption.

Assumption 3.1 (Average L-smoothness). For all k 0 and i [n] the minibatch stochastic gradients difference e k i = 1 b P

j I i,k( fij(xk+1) fij(xk)) computed on

the i-th worker satisﬁes E h e k i | xk, xk+1i = k i and

E e k i k i 2 | xk, xk+1 L2 i b xk+1 xk 2 (7)

with some Li 0, where k i = fi(xk+1) fi(xk).

This assumption is satisﬁed in many standard minibatch regimes. In particular, if I i,k = {1, . . . , m}, then Li = 0, and if I i,k consists of b i.i.d. samples from the uniform distributions on {1, . . . , m} and fij are Lij-smooth, then Li maxj [m] Lij.

Under this and the previously introduced assumptions, we derive the following result.

Theorem 3.1. Consider the ﬁnite sum case (1)+(5). Let Assumptions 1.1, 1.2 and 3.1 be satisﬁed. Then, after

pn ωL2 + (1+ω)L2

iterations with 0 = f(x0) f , L2 = 1 n Pn i=1 L2 i , L2 = 1 n Pn i=1 L2 i and the stepsize γ L + q ωL2 + (1+ω)L2/b (1 p)/(pn) 1 , VR-MARINA

produces such a point ˆx K that E[ f(ˆx K) 2] ε2.

One can ﬁnd the full statement of the theorem together with its proof in Section D.1.1 of the Appendix.

Corollary 3.1. Let the assumptions of Theorem 3.1 hold and p = min {ζQ/d, b /(m+b )}, where b m. If γ L + q ωL2 + (1+ω)L2/b max{d/ζQ 1,m/b }/n 1 then

VR-MARINA requires

ω max{d/ζQ 1,m/b }

(1+ω) max{d/ζQ 1,m/b }

iterations/communication rounds and O (m + b K) stochastic oracle calls per node in expectation in order to achieve E[ f(ˆx K) 2] ε2, and the expected total communication cost per worker is O(d + ζQK).

First of all, when workers quatize differences of the full gradients, then I i,k = {1, . . . , m} for all i [n] and k 0,

MARINA: Faster Non-Convex Distributed Learning with Compression

implying L = 0. In this case, the complexity bounds for VRMARINA recover the ones for MARINA. Next, when ω = 0 (no quantization) and n = 1, our bounds for iteration and oracle complexities for VR-MARINA recover the bounds for PAGE (Li & Richt arik, 2020), which is optimal for ﬁnitesum smooth non-convex optimization. This observation implies that the dependence on ε and m in the complexity bounds for VR-MARINA cannot be improved in the class of ﬁrst-order stochastic methods. Next, we notice that up to the differences in smoothness constants, the iteration and oracle complexities for VR-MARINA beneﬁt from the number of workers n. Finally, as Table 1 shows, the rates for VR-MARINA are strictly better than ones for the previous state-of-the-art method VR-DIANA (Horv ath et al., 2019).

We provide the convergence results for VR-MARINA in the ﬁnite-sum case under the Polyak-Łojasiewicz condition, together with complete proofs, in Section D.1.2 of the Appendix.

3.2. Online Case

In this section, we focus on problems of type (1)+(6). For this type of problems, we consider a slightly modiﬁed version of VR-MARINA. That is, we replace line 8 in Algorithm 2 with the following update rule: gk+1 i = 1 b P

j Ii,k fξk ij(xk+1) if ck = 1, and gk+1 i =

gk + Q 1 b P

j I i,k( fξk ij(xk+1) fξk ij(xk)) otherwise, where Ii,k, I i,k are the sets of the indices in the minibatches, |Ii,k| = b, |I i,k| = b , and ξk ij is independently sampled from Di for i [n], j [m] (see Algorithm 3 in the Appendix).

Before we provide our convergence results in this setup, we reformulate Assumption 3.1 for the online case.

Assumption 3.2 (Average L-smoothness). For all k 0 and i [n] the minibatch stochastic gradients difference e k i = 1

j I i,k( fξk ij(xk+1) fξk ij(xk)) computed on

the i-th worker satisﬁes E h e k i | xk, xk+1i = k i and

E e k i k i 2 | xk, xk+1 L2 i b xk+1 xk 2 (8)

with some Li 0, where k i = fi(xk+1) fi(xk).

Moreover, we assume that the variance of the stochastic gradients on all nodes is uniformly upper bounded.

Assumption 3.3. We assume that for all i [n] there exists such constant σi [0, + ) that for all x Rd

Eξi Di [ fξi(x)] = fi(x), (9)

Eξi Di h fξi(x) fi(x) 2i σ2 i . (10)

Under these and previously introduced assumptions, we derive the following result. Theorem 3.2. Consider the online case (1)+(6). Let Assumptions 1.1, 1.2, 3.2 and 3.3 be satisﬁed. Then, after

pn ωL2 + (1+ω)L2

iterations with 0 = f(x0) f , L2 = 1 n Pn i=1 L2 i , L2 = 1 n Pn i=1 L2 i , the stepsize

γ L + q ωL2 + (1+ω)L2/b (1 p)/(pn) 1 , and

b = Θ σ2/(nε2) , σ2 = 1 n Pn i=1 σ2 i , VR-MARINA produces a point ˆx K for which E[ f(ˆx K) 2] ε2.

One can ﬁnd the full statement of the theorem, together with its proof, in Section D.2.1 of the Appendix. Corollary 3.2. Let the assumptions of Theorem 3.2 hold and choose p = min {ζQ/d, b /(b+b )}, where b b, b = Θ σ2/(nε2) . If γ L + q ωL2 + (1+ω)L2/b max{d/ζQ 1,b/b }/n 1 ,

then VR-MARINA requires

ω n max n d ζQ 1, σ2 nb ε2 o!

nb max n d ζQ 1, σ2 nb ε2 o!!

iterations/communication rounds and O(ζQK + σ2/(nε2)) stochastic oracle calls per node in expectation to achieve E[ f(ˆx K) 2] ε2, and the expected total communication cost per worker is O(d + ζQK).

Similarly to the ﬁnite-sum case, when ω = 0 (no quantization) and n = 1, our bounds for iteration and oracle complexities for VR-MARINA recover the bounds for PAGE (Li & Richt arik, 2020), which is optimal for online smooth non-convex optimization as well. That is, the dependence on ε in the complexity bound for VR-MARINA cannot be improved in the class of ﬁrst-order stochastic methods. As previously, up to the differences in smoothness constants, the iteration and oracle complexities for VR-MARINA beneﬁt from an increase in the number of workers n.

We provide the convergence results for VR-MARINA in the online case under the Polyak-Łojasiewicz condition, together with complete proofs, in Section D.2.2 of the Appendix.

4. Partial Participation

Finally, we propose another modiﬁcation of MARINA. In particular, we prove an option for partial participation of

MARINA: Faster Non-Convex Distributed Learning with Compression

0 2500 5000 7500 10000 12500 15000 17500 20000 Communication Rounds

jjrf(xk)jj2

MARINA (K=1) MARINA (K=5) MARINA (K=10) DIANA (K=1) DIANA (K=5) DIANA (K=10)

0.0 0.2 0.4 0.6 0.8 1.0 1.2 #bits/n 1e7

jjrf(xk)jj2

MARINA (K=1) MARINA (K=5) MARINA (K=10) DIANA (K=1) DIANA (K=5) DIANA (K=10)

0 2500 5000 7500 10000 12500 15000 17500 20000 Communication Rounds

jjrf(xk)jj2

MARINA (K=1) MARINA (K=5) MARINA (K=10) DIANA (K=1) DIANA (K=5) DIANA (K=10)

0.0 0.2 0.4 0.6 0.8 1.0 1.2 #bits/n 1e7

jjrf(xk)jj2

MARINA (K=1) MARINA (K=5) MARINA (K=10) DIANA (K=1) DIANA (K=5) DIANA (K=10)

0 100 200 300 400 500 600 # of Epochs

jjrf(xk)jj2

VR-MARINA (K=1) VR-MARINA (K=5) VR-MARINA (K=10) VR-DIANA (K=1) VR-DIANA (K=5) VR-DIANA (K=10)

0 1000000 2000000 3000000 4000000 5000000 6000000 7000000 #bits/n 10-3

jjrf(xk)jj2

VR-MARINA (K=1) VR-MARINA (K=5) VR-MARINA (K=10) VR-DIANA (K=1) VR-DIANA (K=5) VR-DIANA (K=10)

Figure 1: Comparison of MARINA with DIANA, and of VR-MARINA with VR-DIANA, on binary classiﬁcation problem involving non-convex loss (11) with Lib SVM data (Chang & Lin, 2011). Parameter n is chosen as per Tbl. 3 in the Appendix. Stepsizes for the methods are chosen according to the theory and the batchsizes for VR-MARINA and VR-DIANA are m/100. In all cases, we used the Rand K sparsiﬁcation operator with K {1, 5, 10}.

the clients - a feature important in federated learning. The resulting method is called PP-MARINA (see Algorithm 4 in the Appendix). At each iteration of PP-MARINA, the server receives the quantized gradient differences from r clients with probability 1 p, and aggregates full gradients from all clients with probability p, i.e., PP-MARINA coincides with MARINA up to the following difference: gk+1 i = fi(xk+1), gk+1 = 1 n Pn i=1 gk+1 i if ck = 1, and gk+1 i = gk + Q fi(xk+1) fi(xk)) , gk+1 = 1 r P

ik I k gk+1 ik otherwise, where I k is the set of r i.i.d. samples from the uniform distribution over {1, . . . , n}. That is, if the probability p is chosen to be small enough, then with high probability the server receives only quantized vectors from a subset of clients at each iteration.

Below, we provide a convergence result for PP-MARINA for smooth non-convex problems. Theorem 4.1. Let Assumptions 1.1 and 1.2 be satisﬁed. Then, after

iterations with 0 = f(x0) f , L2 = 1

n Pn i=1 L2 i and the

stepsize γ L 1 1 + p

(1 p)(1+ω)/(pr) 1 , PP-MARINA

produces a point ˆx K for which E[ f(ˆx K) 2] ε2.

One can ﬁnd the full statement of the theorem together with its proof in Section E.1 of the appendix. Corollary 4.1. Let the assumptions of Theorem 4.1 hold and choose p = ζQr/(dn), where r n. If γ

(1+ω)(dn ζQr)/(b ζQr) 1 , then PP-MARINA requires

iterations/communication rounds to achieve E[ f(ˆx K) 2] ε2, and the expected total communication cost is O (dn + ζQr K).

When r = n, i.e., all clients participate in communication with the server at each iteration, the rate for PP-MARINA recovers the rate for MARINA under the assumption that (1 + ω)(d/ζQ 1) = O(ω(d/ζQ 1)), which holds for a wide class of quantization operators, e.g., for identical quantization, Rand K, and ℓp-quantization. In general, the derived complexity is strictly better than previous state-ofthe-art one (see Table 1).

We provide the convergence results for PP-MARINA under the Polyak-Łojasiewicz condition, together with complete proofs, in Section E.2 of the Appendix.

5. Numerical Experiments

5.1. Binary Classiﬁcation with Non-Convex Loss

We conduct several numerical experiments2 on binary classiﬁcation problem involving non-convex loss (Zhao et al., 2010) (used for two-layer neural networks) with Lib SVM

2Our code is available at https://github.com/ burlachenkok/marina.

MARINA: Faster Non-Convex Distributed Learning with Compression

data (Chang & Lin, 2011) to justify the theoretical claims of the paper. That is, we consider the following optimization problem:

t=1 ℓ(a t x, yi)

where {at} Rd, yi { 1, 1} for all t = 1, . . . , N, and the function ℓ: Rd R is deﬁned as

ℓ(b, c) = 1 1 1 + exp( bc)

The distributed environment is simulated in Python 3.8 using MPI4PY and other standard libraries. Additional details about the experimental setup together with extra experiments are deferred to Section A of the Appendix.

In our experiments, we compare MARINA with the full-batch version of DIANA, and then VR-MARINA with VR-DIANA. We exclude Fed COMGATE and Fed PATH from this comparison since they have signiﬁcantly worse oracle complexities (see Table 1). The results are presented in Fig. 1. As our theory predicts, the ﬁrst row shows the superiority of MARINA to DIANA both in terms of iteration/communication complexity and the total number of transmitted bits to achieve the given accuracy. Next, to study the oracle complexity as well, we consider non-full-batched methods VR-MARINA and VR-DIANA since they have better oracle complexity than the full-batched methods in the ﬁnite-sum case. Again, the results presented in the second row justify that VR-MARINA outperforms VR-DIANA in terms of oracle complexity and the total number of transmitted bits to achieve the given accuracy.

5.2. Image Classiﬁcation

We also compared the performance of VR-MARINA and VRDIANA on the training Res Net-18 (He et al., 2016) at CIFAR100 (Krizhevsky et al., 2009) dataset. Formally, the optimization problem is

i=1 ℓ(p(f(ai, x)), yi)

where {(ai, yi)}N i=1 encode images and labels from CIFAR100 dataset, f(ai, x) is the output of Res Net-18 on image ai with weights x, p is softmax function, and ℓ( , ) is cross-entropy loss. The code is wrtitten in Python 3.9 using Py Torch 1.7, and the distributed environment is simulated.

The results are presented in Fig. 2. Again, VR-MARINA converges signiﬁcantly faster than VR-DIANA both in terms of the oracle complexity and the total number of transmitted bits to achieve the given accuracy. See other details and observations in Section A of the Appendix.

0 500 1000 1500 2000 2500 3000 3500 Communication Rounds

Training Res Net-18 @ CIFAR100

VR-MARINA (K ¼ 0.009d) VR-MARINA (K ¼ 0.043d) VR-MARINA (K ¼ 0.086d) VR-DIANA (K ¼ 0.009d) VR-DIANA (K ¼ 0.043d) VR-DIANA (K ¼ 0.086d)

0 500 1000 1500 2000 2500 3000 3500 Communication Rounds

Training Res Net-18 @ CIFAR100

VR-MARINA (K ¼ 0.009d) VR-MARINA (K ¼ 0.043d) VR-MARINA (K ¼ 0.086d) VR-DIANA (K ¼ 0.009d) VR-DIANA (K ¼ 0.043d) VR-DIANA (K ¼ 0.086d)

0.0 0.2 0.4 0.6 0.8 1.0 1.2 1.4 #bits/n 1e11

Training Res Net-18 @ CIFAR100

VR-MARINA (K ¼ 0.009d) VR-MARINA (K ¼ 0.043d) VR-MARINA (K ¼ 0.086d) VR-DIANA (K ¼ 0.009d) VR-DIANA (K ¼ 0.043d) VR-DIANA (K ¼ 0.086d)

0.0 0.2 0.4 0.6 0.8 1.0 1.2 1.4 #bits/n 1e11

Training Res Net-18 @ CIFAR100

VR-MARINA (K ¼ 0.009d) VR-MARINA (K ¼ 0.043d) VR-MARINA (K ¼ 0.086d) VR-DIANA (K ¼ 0.009d) VR-DIANA (K ¼ 0.043d) VR-DIANA (K ¼ 0.086d)

0 50 100 150 200 250 Epochs

Training Res Net-18 @ CIFAR100

VR-MARINA (K ¼ 0.009d) VR-MARINA (K ¼ 0.043d) VR-MARINA (K ¼ 0.086d) VR-DIANA (K ¼ 0.009d) VR-DIANA (K ¼ 0.043d) VR-DIANA (K ¼ 0.086d)

0 50 100 150 200 250 Epochs

Training Res Net-18 @ CIFAR100

VR-MARINA (K ¼ 0.009d) VR-MARINA (K ¼ 0.043d) VR-MARINA (K ¼ 0.086d) VR-DIANA (K ¼ 0.009d) VR-DIANA (K ¼ 0.043d) VR-DIANA (K ¼ 0.086d)

Figure 2: Comparison of VR-MARINA with VR-DIANA on training Res Net-18 at CIFAR100 dataset. Number of workers equals 5. Stepsizes for the methods were tuned and the batchsizes are m/50. In all cases, we used the Rand K sparsiﬁcation operator, the approximate values of K are given in the legends (d is dimension of the problem).

Acknowledgements

The work of Peter Richt arik, Eduard Gorbunov, Konstantin Burlachenko and Zhize Li was supported by KAUST Baseline Research Fund. The paper was written while E. Gorbunov was a research intern at KAUST. The work of E. Gorbunov in Sections 1, 2, and C was also partially supported by the Ministry of Science and Higher Education of the Russian Federation (Goszadaniye) 075-00337-20-03, project No. 0714-2020-0005, and in Sections 3, 4, D, E by RFBR, project number 19-31-51001. We thank Konstantin Mishchenko (KAUST) for a suggestion related to the experiments, Elena Bazanova (MIPT) for the suggestions about improving the text, and Slavom ır Hanzely (KAUST) and Egor Shulgin (KAUST) for spotting the typos.

MARINA: Faster Non-Convex Distributed Learning with Compression

Alistarh, D., Grubic, D., Li, J., Tomioka, R., and Vojnovic, M. QSGD: Communication-efﬁcient SGD via gradient quantization and encoding. In Advances in Neural Information Processing Systems, pp. 1709 1720, 2017.

Arjevani, Y., Carmon, Y., Duchi, J. C., Foster, D. J., Srebro, N., and Woodworth, B. Lower bounds for non-convex stochastic optimization. ar Xiv preprint ar Xiv:1912.02365, 2019.

Basu, D., Data, D., Karakus, C., and Diggavi, S. Qsparselocal-SGD: Distributed SGD with quantization, sparsiﬁcation and local computations. In Advances in Neural Information Processing Systems, pp. 14668 14679, 2019.

Beznosikov, A., Horv ath, S., Richt arik, P., and Safaryan, M. On biased compression for distributed learning. ar Xiv preprint ar Xiv:2002.12410, 2020.

Bhojanapalli, S., Kyrillidis, A., and Sanghavi, S. Dropping convexity for faster semi-deﬁnite optimization. In Conference on Learning Theory, pp. 530 582. PMLR, 2016.

Carmon, Y., Duchi, J. C., Hinder, O., and Sidford, A. Lower bounds for ﬁnding stationary points i. Mathematical Programming, pp. 1 50, 2019.

Chang, C.-C. and Lin, C.-J. LIBSVM: A library for support vector machines. ACM transactions on intelligent systems and technology (TIST), 2(3):1 27, 2011.

Danilova, M., Dvurechensky, P., Gasnikov, A., Gorbunov, E., Guminov, S., Kamzolov, D., and Shibaev, I. Recent theoretical advances in non-convex optimization. ar Xiv preprint ar Xiv:2012.06188, 2020.

Das, R., Hashemi, A., Sanghavi, S., and Dhillon, I. S. Improved convergence rates for non-convex federated learning with compression. ar Xiv preprint ar Xiv:2012.04061, 2020.

Fang, C., Li, C., Lin, Z., and Zhang, T. Near-optimal non-convex optimization via stochastic path integrated differential estimator. Advances in Neural Information Processing Systems, 31:689, 2018.

Goodfellow, I., Bengio, Y., Courville, A., and Bengio, Y. Deep learning, volume 1. MIT press Cambridge, 2016.

Gorbunov, E., Kovalev, D., Makarenko, D., and Richt arik, P. Linearly converging error compensated sgd. Advances in Neural Information Processing Systems, 33, 2020.

Goyal, P., Doll ar, P., Girshick, R., Noordhuis, P., Wesolowski, L., Kyrola, A., Tulloch, A., Jia, Y., and He, K. Accurate, large minibatch SGD: training imagenet in 1 hour. ar Xiv preprint ar Xiv:1706.02677, 2017.

Haddadpour, F., Kamani, M. M., Mokhtari, A., and Mahdavi, M. Federated learning with compression: Uniﬁed analysis and sharp guarantees. ar Xiv preprint ar Xiv:2007.01154, 2020.

He, K., Zhang, X., Ren, S., and Sun, J. Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 770 778, 2016.

Horv ath, S., Ho, C.-Y., ˇLudov ıt Horv ath, Sahu, A. N., Canini, M., and Richt arik, P. Natural compression for distributed deep learning. ar Xiv preprint ar Xiv:1905.10988, 2019.

Horv ath, S., Kovalev, D., Mishchenko, K., Stich, S., and Richt arik, P. Stochastic distributed learning with gradient quantization and variance reduction. ar Xiv preprint ar Xiv:1904.05115, 2019.

Kairouz, P., Mc Mahan, H. B., Avent, B., Bellet, A., Bennis, M., Bhagoji, A. N., Bonawitz, K., Charles, Z., Cormode, G., Cummings, R., et al. Advances and open problems in federated learning. ar Xiv preprint ar Xiv:1912.04977, 2019.

Karimireddy, S. P., Rebjock, Q., Stich, S., and Jaggi, M. Error feedback ﬁxes sign SGD and other gradient compression schemes. In International Conference on Machine Learning, pp. 3252 3261, 2019.

Karimireddy, S. P., Kale, S., Mohri, M., Reddi, S., Stich, S., and Suresh, A. T. Scaffold: Stochastic controlled averaging for federated learning. In International Conference on Machine Learning, pp. 5132 5143. PMLR, 2020.

Khanduri, P., Sharma, P., Kaﬂe, S., Bulusu, S., Rajawat, K., and Varshney, P. K. Distributed stochastic non-convex optimization: Momentum-based variance reduction. ar Xiv preprint ar Xiv:2005.00224, 2020.

Koloskova, A., Lin, T., Stich, S. U., and Jaggi, M. Decentralized deep learning with arbitrary communication compression. ICLR, pp. ar Xiv:1907.09356, 2020a. URL https://arxiv.org/abs/1907.09356.

Koloskova, A., Loizou, N., Boreiri, S., Jaggi, M., and Stich, S. A uniﬁed theory of decentralized sgd with changing topology and local updates. In International Conference on Machine Learning, pp. 5381 5393. PMLR, 2020b.

Koneˇcn y, J., Mc Mahan, H. B., Yu, F. X., Richt arik, P., Suresh, A. T., and Bacon, D. Federated learning: Strategies for improving communication efﬁciency. ar Xiv preprint ar Xiv:1610.05492, 2016.

Koneˇcn y, J., Mc Mahan, H. B., Yu, F., Richt arik, P., Suresh, A. T., and Bacon, D. Federated learning: strategies for

MARINA: Faster Non-Convex Distributed Learning with Compression

improving communication efﬁciency. In NIPS Private Multi-Party Machine Learning Workshop, 2016.

Krizhevsky, A., Hinton, G., et al. Learning multiple layers of features from tiny images. 2009.

Li, X., Yang, W., Wang, S., and Zhang, Z. Communication efﬁcient decentralized training with multiple local updates. ar Xiv preprint ar Xiv:1910.09126, 5, 2019.

Li, Z. and Richt arik, P. A uniﬁed analysis of stochastic gradient methods for nonconvex federated optimization. ar Xiv preprint ar Xiv:2006.07013, 2020.

Li, Z., Bao, H., Zhang, X., and Richt arik, P. Page: A simple and optimal probabilistic gradient estimator for nonconvex optimization. ar Xiv preprint ar Xiv:2008.10898, 2020.

Lian, X., Zhang, C., Zhang, H., Hsieh, C.-J., Zhang, W., and Liu, J. Can decentralized algorithms outperform centralized algorithms? a case study for decentralized parallel stochastic gradient descent. In Advances in Neural Information Processing Systems, pp. 5330 5340, 2017.

Łojasiewicz, S. A topological property of real analytic subsets. Coll. du CNRS, Les equations aux d eriv ees partielles, 117:87 89, 1963.

Ma, C., Wang, K., Chi, Y., and Chen, Y. Implicit regularization in nonconvex statistical estimation: Gradient descent converges linearly for phase retrieval and matrix completion. In International Conference on Machine Learning, pp. 3345 3354. PMLR, 2018.

Mc Mahan, H. B., Moore, E., Ramage, D., Hampson, S., and Ag uera y Arcas, B. Communication-efﬁcient learning of deep networks from decentralized data. In Proceedings of the 20th International Conference on Artiﬁcial Intelligence and Statistics (AISTATS), 2017.

Mishchenko, K., Gorbunov, E., Tak aˇc, M., and Richt arik, P. Distributed learning with compressed gradient differences. ar Xiv preprint ar Xiv:1901.09269, 2019.

Murty, K. and Kabadi, S. Some np-complete problems in quadratic and nonlinear programming. Mathematical programming, 39(2):117 129, 1987.

Polyak, B. T. Gradient methods for the minimisation of functionals. USSR Computational Mathematics and Mathematical Physics, 3(4):864 878, 1963.

Qian, X., Richt arik, P., and Zhang, T. Error compensated distributed sgd can be accelerated. ar Xiv preprint ar Xiv:2010.00091, 2020.

Safaryan, M., Shulgin, E., and Richt arik, P. Uncertainty principle for communication compression in distributed and federated learning and the search for an optimal compressor. ar Xiv preprint ar Xiv:2002.08958, 2020.

Seide, F., Fu, H., Droppo, J., Li, G., and Yu, D. 1-bit stochastic gradient descent and its application to data-parallel distributed training of speech dnns. In Fifteenth Annual Conference of the International Speech Communication Association, 2014.

Sharma, P., Kaﬂe, S., Khanduri, P., Bulusu, S., Rajawat, K., and Varshney, P. K. Parallel restarted spider communication efﬁcient distributed nonconvex optimization with optimal computation complexity. ar Xiv preprint ar Xiv:1912.06036, 2019.

Stich, S. U. and Karimireddy, S. P. The error-feedback framework: Better rates for sgd with delayed gradients and compressed updates. Journal of Machine Learning Research, 21:1 36, 2020.

Sun, H., Lu, S., and Hong, M. Improving the sample and communication complexity for decentralized non-convex optimization: Joint gradient estimation and tracking. In International Conference on Machine Learning, pp. 9217 9228. PMLR, 2020.

Sun, R. Optimization for deep learning: theory and algorithms. ar Xiv preprint ar Xiv:1912.08957, 2019.

Suresh, A. T., Yu, F. X., Kumar, S., and Mc Mahan, H. B. Distributed mean estimation with limited communication. In Proceedings of the 34th International Conference on Machine Learning, 2017.

You, Y., Li, J., Reddi, S., Hseu, J., Kumar, S., Bhojanapalli, S., Song, X., Demmel, J., Keutzer, K., and Hsieh, C.-J. Large batch optimization for deep learning: Training bert in 76 minutes. In International Conference on Learning Representations, 2020. URL https://openreview. net/forum?id=Syx4wn Etv H.

Zhao, L., Mammadov, M., and Yearwood, J. From convex to nonconvex: a loss function analysis for binary classiﬁcation. In 2010 IEEE International Conference on Data Mining Workshops, pp. 1281 1288. IEEE, 2010.