# local_sgd_converges_fast_and_communicates_little__68333987.pdf

Published as a conference paper at ICLR 2019

LOCAL SGD CONVERGES FAST AND COMMUNICATES LITTLE

Sebastian U. Stich EPFL, Switzerland sebastian.stich@epfl.ch

Mini-batch stochastic gradient descent (SGD) is state of the art in large scale distributed training. The scheme can reach a linear speedup with respect to the number of workers, but this is rarely seen in practice as the scheme often suffers from large network delays and bandwidth limits. To overcome this communication bottleneck recent works propose to reduce the communication frequency. An algorithm of this type is local SGD that runs SGD independently in parallel on different workers and averages the sequences only once in a while. This scheme shows promising results in practice, but eluded thorough theoretical analysis. We prove concise convergence rates for local SGD on convex problems and show that it converges at the same rate as mini-batch SGD in terms of number of evaluated gradients, that is, the scheme achieves linear speedup in the number of workers and mini-batch size. The number of communication rounds can be reduced up to a factor of T 1/2 where T denotes the number of total steps compared to mini-batch SGD. This also holds for asynchronous implementations. Local SGD can also be used for large scale training of deep learning models. The results shown here aim serving as a guideline to further explore the theoretical and practical aspects of local SGD in these applications.

1 INTRODUCTION

Stochastic Gradient Descent (SGD) (Robbins & Monro, 1951) consists of iterations of the form

xt+1 := xt ηtgt , (1)

for iterates (weights) xt, xt+1 Rd, stepsize (learning rate) ηt > 0, and stochastic gradient gt Rd with the property E gt = f(xt), for a loss function f : Rd R. This scheme can easily be parallelized by replacing gt in (1) by an average of stochastic gradients that are independently computed in parallel on separate workers (parallel SGD). This simple scheme has a major drawback: in each iteration the results of the computations on the workers have to be shared with the other workers to compute the next iterate xt+1. Communication has been reported to be a major bottleneck for many large scale deep learning applications, see e.g. (Seide et al., 2014; Alistarh et al., 2017; Zhang et al., 2017; Lin et al., 2018b). Mini-batch parallel SGD addresses this issue by increasing the compute to communication ratio. Each worker computes a mini-batch of size b 1 before communication. This scheme is implemented in state-of-the-art distributed deep learning frameworks (Abadi et al., 2016; Paszke et al., 2017; Seide & Agarwal, 2016). Recent work in (You et al., 2017; Goyal et al., 2017) explores various limitations of this approach, as in general it is reported that performance degrades for too large mini-batch sizes (Keskar et al., 2016; Ma et al., 2018; Yin et al., 2018).

In this work we follow an orthogonal approach, still with the goal to increase the compute to communication ratio: Instead of increasing the mini-batch size, we reduce the communication frequency. Rather than keeping the sequences on different machines in sync, we allow them to evolve locally on each machine, independent from each other, and only average the sequences once in a while (local SGD). Such strategies have been explored widely in the literature, under various names.

An extreme instance of this concept is one-shot SGD (Mc Donald et al., 2009; Zinkevich et al., 2010) where the local sequences are only exchanged once, after the local runs have converged. Zhang

Published as a conference paper at ICLR 2019

1 4 16 64 256

communication overhead

b=1, H=1 perfect speedup

1 4 16 64 256

communication overhead gain

b=1, H=1 b=2, H=1 b=1, H=2 b=2, H=2

Figure 1: Illustration of the speedup (3) for time-to-accuracy when either increasing mini-batch size b (1 2) or communication inverval H (1 2), for compute to communication ratio ρ = 25.

et al. (2013) show statistical convergence (see also (Shamir & Srebro, 2014; Godichon-Baggioni & Saadane, 2017; Jain et al., 2018)), but the analysis restricts the algorithm to at most one pass over the data, which is in general not enough for the training error to converge. More practical are schemes that perform more frequent averaging of the parallel sequences, as e.g. (Mc Donald et al., 2010) for perceptron training (iterative parameter mixing), see also (Coppola, 2015), (Zhang et al., 2014; Bijral et al., 2016; Zhang et al., 2016) for the training of deep neural networks (model averaging) or in federated learning (Mc Mahan et al., 2017).

The question of how often communication rounds need to be initiated has eluded a concise theoretical answer so far. Whilst there is practical evidence, the theory does not even resolve the question whether averaging helps when optimizing convex functions. Concretely, whether running local SGD on K workers is K times faster than running just a single instance of SGD on one worker.1

We ﬁll this gap in the literature and provide a concise convergence analysis of local SGD. We show that averaging helps. Frequent synchronization of K local sequences increases the convergence rate by a factor of K, i.e. a linear speedup can be attained. Thus, local SGD is as efﬁcient as parallel mini-batch SGD in terms of computation, but the communication cost can be drastically reduced.

1.1 CONTRIBUTIONS

We consider ﬁnite-sum convex optimization problems f : Rd R of the form

i=1 fi(x) , x := arg minx Rd f(x) , f := f(x ) , (2)

where f is L-smooth2 and µ-strongly convex3. We consider K parallel mini-batch SGD sequences with mini-batch size b that are synchronized (by averaging) after at most every H iterations. For appropriate chosen stepsizes and an averaged iterate ˆx T after T steps (for T sufﬁciently large, see Section 3 below for the precise statement of the convergence result with bias and variance terms) and synchronization delay H = O( p

T/(Kb)) we show convergence

E f(ˆx T ) f = O G2

with second moment bound G2 E fi(x) 2. Thus, we see that compared to parallel minibatch SGD the communication rounds can be reduced by a factor H = O( p

T/(Kb)) without hampering the asymptotic convergence. Equation (3) shows perfect linear speedup in terms of computation, but with much less communication that mini-batch SGD. The resulting speedup when taking communication cost into account is illustrated in Figure 1 (see also Section D below). Under the assumption that (3) is tight, one has thus now two strategies to improve the compute to communication ratio (denoted by ρ): (i) either to increase the mini-batch size b or (ii) to increase the communication interval H. Both strategies give the same improvement when b and H are small (linear speedup). Like mini-batch SGD that faces some limitations for b 1 (as discussed in e.g. (Dekel et al., 2012; Ma et al., 2018; Yin et al., 2018)), the parameter H cannot be chosen too large in local SGD. We give some pratical guidelines in Section 4.

Our proof is simple and straightforward, and we imagine that with slight modiﬁcations of the proof the technique can also be used to analyze other variants of SGD that evolve sequences on

1On convex functions, the average of the K local solutions can of course only decrease the objective value, but convexity does not imply that the averaged point is K times better. 2f(y) f(x) + f(x), y x + L

2 y x 2, x, y Rd. 3f(y) f(x) + f(x), y x + µ

2 y x 2, x, y Rd.

Published as a conference paper at ICLR 2019

different worker that are not perfectly synchronized. Although we do not yet provide convergence guarantees for the non-convex setting, we feel that the positive results presented here will spark further investigation of local SGD for this important application (see e.g. (Yu et al., 2018)).

1.2 RELATED WORK

A parallel line of work reduces the communication cost by compressing the stochastic gradients before communication. For instance, by limiting the number of bits in the ﬂoating point representation (Gupta et al., 2015; Na et al., 2017; Sa et al., 2015), or random quantization (Alistarh et al., 2017; Wen et al., 2017). The Zip ML framework applies this technique also to the data (Zhang et al., 2017). Sparsiﬁcation methods reduce the number of non-zero entries in the stochastic gradient (Alistarh et al., 2017; Wangni et al., 2017). A very aggressive and promising sparsiﬁcation method is to keep only very few coordinates of the stochastic gradient by considering only the coordinates with the largest magnitudes (Seide et al., 2014; Strom, 2015; Dryden et al., 2016; Aji & Heaﬁeld, 2017; Sun et al., 2017; Lin et al., 2018b; Stich et al., 2018).

Allowing asynchronous updates provides an alternative solution to disguise the communication overhead to a certain amount (Niu et al., 2011; Sa et al., 2015; Lian et al., 2015), though alternative strategies might be better when high accuracy is desired (Chen et al., 2016). The analysis of Agarwal & Duchi (2011) shows that asynchronous SGD on convex functions can tolerated delays up to O( p

T/K), which is identical to the maximal length of the local sequences in local SGD. Asynchronous SGD converges also for larger delays (see also (Zhou et al., 2018)) but without linear speedup, a similar statement holds for local SGD (see discussion in Section 3). The current frameworks for the analysis of asynchronous SGD do not cover local SGD. A fundamental difference is that asynchronous SGD maintains a (almost) synchronized sequence and gradients are computed with respect this unique sequence (but just applied with delays), whereas each worker in local SGD evolves a different sequence and computes gradient with respect those iterates.

For the training of deep neural networks, Bijral et al. (2016) discuss a stochastic averaging schedule whereas Zhang et al. (2016) study local SGD with more frequent communication at the beginning of the optimization process. The elastic averaging technique (Zhang et al., 2015) is different to local SGD, as it uses the average of the iterates only to guide the local sequences but does not perform a hard reset after averaging. Among the ﬁrst theoretical studies of local SGD in the non-convex setting are (Coppola, 2015; Zhou & Cong, 2018) that did not establish a speedup, in contrast to two more recent analyses (Yu et al., 2018; Wang & Joshi, 2018). Yu et al. (2018) show linear speedup of local SGD on non-convex functions for H = O(T 1/4K 3/4), which is more restrictive than the constraint on H in the convex setting. Lin et al. (2018a) study empirically hierarchical variants of local SGD.

Local SGD with averaging in every step, i.e. H = 1, is identical to mini-batch SGD. Dekel et al. (2012) show that batch sizes b = T δ, for δ (0, 1

2) are asymptotically optimal for mini-batch SGD, however they also note that this asymptotic bound might be crude for practical purposes. Similar considerations might also apply to the asymptotic upper bounds on the communication frequency H derived here. Local SGD with averaging only at the end, i.e. H = T, is identical to one-shot SGD. Jain et al. (2018) give concise speedup results in terms of bias and variance for one-shot SGD with constant stepsizes for the optimization of quadratic least squares problems. In contrast, our upper bounds become loose when H T and our results do not cover one-shot SGD.

Recently, Woodworth et al. (2018) provided a lower bound for parallel stochastic optimization (in the convex setting, and not for strongly convex functions as considered here). The bound is not known to be tight for local SGD.

1.3 OUTLINE

We formally introduce local SGD in Section 2 and sketch the convergence proof in Section 3. In Section 4 show numerical results to illustrate the result. We analyze asynchronous local SGD in Section 5. The proof of the technical results, further discussion about the experimental setup and implementation guidelines are deferred to the appendix.

Published as a conference paper at ICLR 2019

Algorithm 1 LOCAL SGD

1: Initialize variables xk 0 = x0 for workers k [K] 2: for t in 0 . . . T 1 do 3: parallel for k [K] do 4: Sample ik t uniformly in [n] 5: if t + 1 IT then 6: xk t+1 1

K PK k=1 xk t ηt fik t (xk t ) global synchronization 7: else 8: xk t+1 xk t ηt fik t (xk t ) local update 9: end if 10: end parallel for 11: end for

2 LOCAL SGD

The algorithm local SGD (depicted in Algorithm 1) generates in parallel K sequences {xk t }T t=0 of iterates, k [K]. Here K denotes the level of parallelization, i.e. the number of distinct parallel sequences and T the number of steps (i.e. the total number of stochastic gradient evaluations is TK). Let IT [T] with T IT denote a set of synchronization indices. Then local SGD evolves the sequences {xk t }T t=0 in the following way:

( xk t ηt fik t (xk t ) , if t + 1 / IT 1 K PK k=1 xk t ηt fik t (xk t ) if t + 1 IT (4)

where indices ik t u.a.r. [n] and {ηt}t 0 denotes a sequence of stepsizes. If IT = [T] then the synchronization of the sequences is performed every iteration. In this case, (4) amounts to parallel or mini-batch SGD with mini-batch size K.4 On the other extreme, if IT = {T}, the synchronization only happens at the end, which is known as one-shot averaging.

In order to measure the longest interval between subsequent synchronization steps, we introduce the gap of a set of integers.

Deﬁnition 2.1 (gap). The gap of a set P := {p0, . . . , pt} of t + 1 integers, pi pi+1 for i = 0, . . . , t 1, is deﬁned as gap(P) := maxi=1,...,t(pi pi 1).

2.1 VARIANCE REDUCTION IN LOCAL SGD

Before jumping to the convergence result, we ﬁrst discuss an important observation.

Parallel (mini-batch) SGD. For carefully chosen stepsizes ηt, SGD converges at rate O σ2

T 5 on strongly convex and smooth functions f, where σ2 E fik t (xk t ) f(xk t ) 2 for t > 0, k [K] is an upper bound on the variance, see for instance (Zhao & Zhang, 2015). By averaging K stochastic gradients such as in parallel SGD the variance decreases by a factor of K, and we conclude that parallel SGD converges at a rate O σ2

T K , i.e. achieves a linear speedup.

Towards local SGD. For local SGD such a simple argument is elusive. For instance, just capitalizing the convexity of the objective function f is not enough: this will show that the averaged iterate of K independent SGD sequences converges at rate O σ2

T , i.e. no speedup can be shown in this way.

This indicates that one has to show that local SGD decreases the variance σ2 instead, similar as in parallel SGD. Suppose the different sequences xk t evolve close to each other. Then it is reasonable to assume that averaging the stochastic gradients fik t (xk t ) for all k [K] can still yield a reduction in the variance by a factor of K similar as in parallel SGD. Indeed, we will make this statement precise in the proof below.

4For the ease of presentation, we assume here that each worker in local SGD only processes a mini-batch of size b = 1. This can be done without loss of generality, as we discuss later in Remark 2.4. 5For the ease of presentation, we here assume that the bias term is negligible compared to the variance term.

Published as a conference paper at ICLR 2019

2.2 CONVERGENCE RESULT AND DISCUSSION

Theorem 2.2. Let f be L-smooth and µ-strongly convex, Ei fi(xk t ) f(xk t ) 2 σ2,

Ei fi(xk t ) 2 G2, for t = 0, . . . , T 1, where {xk t }T t=0 for k [K] are generated according to (4) with gap(IT ) H and for stepsizes ηt = 4 µ(a+t) with shift parameter a > max{16κ, H}, for κ = L

E f(ˆx T ) f µa3

2ST x0 x 2 + 4T(T + 2a)

µKST σ2 + 256T

µ2ST G2H2L , (5)

where ˆx T = 1 KST PK k=1 PT 1 t=0 wtxk t , for wt = (a + t)2 and ST = PT 1 t=0 wt 1

We were not especially careful to optimize the constants (and the lower order terms) in (5), so we now state the asymptotic result. Corollary 2.3. Let ˆx T be as deﬁned as in Theorem 2.2, for parameter a = max{16κ, H}. Then

E f(ˆx T ) f = O 1 µKT + κ + H

µT 2 + κ3 + H3

For the last estimate we used E µ x0 x 2G for µ-strongly convex f, as derived in (Rakhlin et al., 2012, Lemma 2). Remark 2.4 (Mini-batch local SGD). So far, we assumed that each worker only computes a single stochastic gradient. In mini-batch local SGD, each worker computes a mini-batch of size b in each iteration. This reduces the variance by a factor of b, and thus Theorem (2.2) gives the convergence rate of mini-batch local SGD when σ2 is replaced by σ2

We now state some consequences of equation (6). For the ease of the exposition we omit the dependency on L, µ, σ2 and G2 below, but depict the dependency on the local mini-batch size b.

Convergence rate. For T large enough and assuming σ > 0, the very ﬁrst term is dominating in (6) and local SGD converges at rate O(1/(KTb)). That is, local SGD achieves a linear speedup in both, the number of workers K and the mini-batch size b. Global synchronization steps. It needs to hold H = O( p

T/(Kb)) to get the linear speedup. This yields a reduction of the number of communication rounds by a factor O( p

T/(Kb)) compared to parallel mini-batch SGD without hurting the convergence rate. Extreme Cases. We have not optimized the result for extreme settings of H, K, L or σ. For instance, we do not recover convergence for the one-shot averaging, i.e. the setting H = T (though convergence for H = o(T), but at a lower rate). Unknown Time Horizon/Adaptive Communication Frequency Zhang et al. (2016) empirically observe that more frequent communication at the beginning of the optimization can help to get faster time-to-accuracy (see also Lin et al. (2018a)). Indeed, when the number of total iterations T is not known beforehand (as it e.g. depends on the target accuracy, cf. (6) and also Section 4 below), then increasing the communication frequency seems to be a good strategy to keep the communication low, why still respecting the constraint H = O( p

T/(Kb)) for all T.

3 PROOF OUTLINE

We now give the outline of the proof. The proofs of the lemmas are given in Appendix A.

Perturbed iterate analysis. Inspired by the perturbed iterate framework of (Mania et al., 2017) we ﬁrst deﬁne a virtual sequence { xt}t 0 in the following way:

x0 = x0 , xt = 1

k=1 xk t , (7)

where the sequences {xk t }t 0 for k [K] are the same as in (4). Notice that this sequence never has to be computed explicitly, it is just a tool that we use in the analysis. Further notice that xt = xk t for

Published as a conference paper at ICLR 2019

k [K] whenever t IT . Especially, when IT = [T], then xt xk t for every k [K], t [T]. It will be useful to deﬁne

k=1 fik t (xk t ) , gt := 1

k=1 f(xk t ) . (8)

Observe xt+1 = xt ηtgt and E gt = gt.

Now the proof proceeds as follows: we show (i) that the virtual sequence { xt}t 0 almost behaves like mini-batch SGD with batch size K (Lemma 3.1 and 3.2), and (ii) the true iterates {xk t }t 0,k [K] do not deviate much from the virtual sequence (Lemma 3.3). These are the main ingredients in the proof. To obtain the rate we exploit a technical lemma from (Stich et al., 2018). Lemma 3.1. Let {xt}t 0 and { xt}t 0 for k [K] be deﬁned as in (4) and (7) and let f be L-smooth and µ-strongly convex and ηt 1 4L. Then

E xt+1 x 2 (1 µηt) E xt x 2 + η2 t E gt gt 2

2ηt E(f( xt) f ) + 2ηt L K

k=1 E xt xk t 2 . (9)

Bounding the variance. From equation (9) it becomes clear that we should derive an upper bound on E gt gt 2. We will relate this to the variance σ2.

Lemma 3.2. Let σ2 Ei fi(xk t ) f(xk t ) 2 for k [K], t [T]. Then E gt gt 2 σ2

Bounding the deviation. Further, we need to bound 1

K PK k=1 E xt xk t 2. For this we impose a condition on IT and an additional condition on the stepsize ηt. Lemma 3.3. If gap(IT ) H and sequence of decreasing positive stepsizes {ηt}t 0 satisfying ηt 2ηt+H for all t 0, then

k=1 E xt xk t 2 4η2 t G2H2 , (10)

where G2 is a constant such that Ei fi(xk t ) 2 G2 for k [K], t [T].

Optimal Averaging. Similar as in (Lacoste-Julien et al., 2012; Shamir & Zhang, 2013; Rakhlin et al., 2012) we deﬁne a suitable averaging scheme for the iterates { xt}t 0 to get the optimal convergence rate. In contrast to (Lacoste-Julien et al., 2012) that use linearly increasing weights, we use quadratically increasing weights, as for instance (Shamir & Zhang, 2013; Stich et al., 2018). Lemma 3.4 ((Stich et al., 2018)). Let {at}t 0, at 0, {et}t 0, et 0 be sequences satisfying

at+1 (1 µηt) at ηtet A + η2 t B + η3 t C , (11)

for ηt = 4 µ(a+t) and constants A > 0, B, C 0, µ > 0, a > 1. Then

t=0 wtet µa3

4ST a0 + 2T(T + 2a)

µST B + 16T

µ2ST C , (12)

for wt = (a + t)2 and ST := PT 1 t=0 wt = T

6 2T 2 + 6a T 3T + 6a2 6a + 1 1

Proof. This is a reformulation of Lemma 3.3 in (Stich et al., 2018).

Proof of Theorem 2.2. By convexity of f we have E f(ˆx T ) f 1 ST PT 1 t=0 wt E f( xt) f . The proof of the theorem thus follows immediately from the four lemmas that we have presented, i.e. by Lemma 3.4 with et := E(f( xt) f ) and constants A = 1

2, (Lemma 3.1), B = σ2

K , (Lemma 3.2) and C = 8G2H2L, (Lemma 3.3). Observe that the stepsizes ηt = 4 µ(a+t) satisfy both the conditions

of Lemma 3.1 (η0 = 4 µa 1 4L, as a 16κ) and of Lemma 3.3 ηt ηt+H = a+t+H

a+t 2, as a H .

Published as a conference paper at ICLR 2019

1 2 4 8 16 32 64 128 256 512 1024

H=1 H=4 H=16 H=64 H=256

(a) Theoretical speedup S(K) (ϵ > 0, T small).

1 2 4 8 16 32 64 128 256 512 1024

H=1 H=4 H=16 H=64 H=256

(b) Theoretical speedup S(K) (ϵ = 0, T ).

Figure 2: Theoretical speedup of local SGD for different numbers of workers K and H.

1 4 16 64 256 1024

H=1 H=4 H=16 H=64 H=256

(a) Measured speedup, ϵ = 0.005.

1 4 16 64 256 1024

H=1 H=4 H=16 H=64 H=256

(b) Measured speedup, ϵ = 0.0001.

Figure 3: Measured speedup of local SGD with mini-batch b = 4 for different numbers of workers K and parameters H.

4 NUMERICAL ILLUSTRATION

In this section we show some numerical experiments to illustrate the results of Theorem 2.2.

Speedup. When Algorithm 1 is implemented in a distributed setting, there are two components that determine the wall-clock time: (i) the total number of gradient computations, TK, and (ii) the total time spend for communication. In each communication round 2(K 1) vectors need to be exchanged, and there will be T/H communication rounds. Typically, the communication is more expensive than a single gradient computation. We will denote this ratio by a factor ρ 1 (in practice, ρ can be 10 100, or even larger on slow networks). The parameter T depends on the desired accuracy ϵ > 0, and according to (6) we roughly have T(ϵ, H, K) 1 Kϵ 1 2 + 1

1 + ϵ(1 + H + H2K) .

Thus, the theoretical speedup S(K) of local SGD on K machines compared to SGD on one machine (H = 1, K = 1) is

S(K) = K 1 2 + 1

1 + ϵ(1 + H + H2K) 1 + 2ρ (K 1)

Theoretical. Examining (13), we see that (i) increasing H can reduce negative scaling effects due to parallelization (second bracket in the denominator of (13)), and (ii) local SGD only shows linear scaling for ϵ 1 (i.e. T large enough, in agreement with the theory). In Figure 2 we depict S(K), once for ϵ = 0 in Figure 2b, and for positive ϵ > 0 in Figure 2a under the assumption ρ = 25. We see that for ϵ = 0 the largest values of H give the best speedup, however, when only a few epochs need to be performed, then the optimal values of H change with the number of workers K. We also see that for a small number of workers H = 1 is never optimal. If T is unknown, then these observations seem to indicate that the technique from (Zhang et al., 2016), i.e. adaptively increasing H over time seems to be a good strategy to get the best choice of H when the time horizon is unknown.

Experimental. We examine the practical speedup on a logistic regression problem, f(x) = 1 n Pn i=1 log(1 + exp( bia i x)) + λ

2 x 2, where ai Rd and bi { 1, +1} are the data samples. The regularization parameter is set to λ = 1/n. We consider the w8a dataset (Platt, 1999) (d = 300, n = 49749). We initialize all runs with x0 = 0d and measure the number of iterations to reach the target accuracy ϵ. We consider the target accuracy reached, when either the last iterate, the uniform average, the average with linear weights, or the average with quadratic weights (such as in Theorem 2.2) reaches the target accuracy. By extensive grid search we determine for each conﬁguration (H, K, B) the best stepsize from the set {min(32, cn

t+1), 32c}, where c can take the values c = 2i for i Z. For more details on the experimental setup refer Section D in the appendix. We depict the results in Figure 3, again under the assumption ρ = 25.

Published as a conference paper at ICLR 2019

Algorithm 2 ASYNCHRONOUS LOCAL SGD (SCHEMATIC)

1: Initialize variables xk 0 = x0, rk = 0 for k [K], aggregate x = x0. 2: parallel for k [K] do 3: for t in 0 . . . T 1 do 4: Sample ik t uniformly in [n] 5: xk t+1 xk t ηt fik t (xk t ) local update

6: if t + 1 Ik T then 7: x add( x, 1

K (xk t+1 xk rk)) atomic aggregation of the updates 8: xk t+1 read( x); 9: rk t + 1 iteration/time of last read 10: end if 11: end for 12: end parallel for

Conclusion. The restriction on H imposed by theory is not severe for T . Thus, for training that either requires many passes over the data or that is performed only on a small cluster, large values of H are advisable. However, for smaller T (few passes over the data), the O(1/

K) dependency shows signiﬁcantly in the experiment. This has to be taken into account when deploying the algorithm on a massively parallel system, for instance through the technique mentioned in (Zhang et al., 2016).

5 ASYNCHRONOUS LOCAL SGD

In this section we present asynchronous local SGD that does not require that the local sequences are synchronized. This does not only reduce communication bottlenecks, but by using load-balancing techniques the algorithm can optimally be tuned to heterogeneous settings (slower workers do less computation between synchronization, and faster workers do more). We will discuss this in more detail in Section C.

Asynchronous local SGD generates in parallel K sequences {xk t }T t=0 of iterates, k [K]. Similar as in Section 2 we introduce sets of synchronization indices, Ik t [T] with T Ik T for k [K]. Note that the sets do not have to be equal for different workers. Each worker k evolves locally a sequence xk t in the following way:

( xk t γt fik t (xk t ) if t + 1 / Ik T xk t+1 if t + 1 Ik T (14)

where xk t+1 denotes the state of the aggregated variable at the time when worker k reads the aggregated variable. To be precise, we use the notation

xk t = x0 1

j=0 1j Wk,h t (γj fik j (xk j )) , (15)

where Wk,h t [T] denotes all updates that have been written at the time the read takes place. The sets Wk,h t are indexed by iteration t, worker k that initiates the read and h [K]. Thus Wk,h t denotes all updates of the local sequence {xh t }t 0, that have been reported back to the server at the time worker k reads (in iteration t). This notation is necessary, as we don t necessarily have Wk,h t = Wk ,h t for k = k . We have Wk,h t Wk,h t for t t, as updates are not overwritten. When we cast synchronized local SGD in this notation, then it holds Wk,h t = Wk ,h t for all k, h, k , h , as all the writes and reads are synchronized. Theorem 5.1. Let f, σ, G and κ be as in Theorem 5.1 and sequences {xk t }T t=0 for k [K] generated according to (14) with gap(Ik T ) H for k K and for stepsizes ηt = 4 µ(a+t) with shift parameter

a > max{16κ, H + τ} for delay τ > 0. If Wk,h t [t τ] for all k, h [K], t [T], then

E f(ˆx T ) f µa3

2ST x0 x 2 + 4T(T + 2a)

µKST σ2 + 768T

µ2ST G2(H + σ)2L , (16)

where ˆx T = 1 KST PK k=1 PT 1 t=0 wtxk t , for wt = (a + t)2 and ST = PT 1 t=0 wt 1

Published as a conference paper at ICLR 2019

Hence, for T large enough and (H + τ) = O( p

T/K), asynchronous local SGD converges with rate O G2

KT , the same rate as synchronous local SGD.

6 CONCLUSION

We prove convergence of synchronous and asynchronous local SGD and are the ﬁrst to show that local SGD (for nontrivial values of H) attains theoretically linear speedup on strongly convex functions when parallelized among K workers. We show that local SGD saves up to a factor of O(T 1/2) in global communication rounds compared to mini-batch SGD, while still converging at the same rate in terms of total stochastic gradient computations.

Deriving more concise convergence rates for local SGD could be an interesting future direction that could deepen our understanding of the scheme. For instance one could aim for a more ﬁne grained analysis in terms of bias and variance terms (similar as e.g. in Dekel et al. (2012); Jain et al. (2018)), relaxing the assumptions (here we relied on the bounded gradient assumption), or investigating the data dependence (e.g. by considering data-depentent measures like e.g. gradient diversity Yin et al. (2018)). There are also no apparent reasons that would limit the extension of the theory to non-convex objective functions; Lemma 3.3 does neither use the smoothness nor the strong convexity assumption, so this can be applied in the non-convex setting as well. We feel that the positive results shown here can motivate and spark further research on non-convex problems. Indeed, very recent work (Zhou & Cong, 2018; Yu et al., 2018) analyzes local SGD for non-convex optimization problems and shows convergence of SGD to a stationary point, though the restrictions on H are stronger than here.

ACKNOWLEDGMENTS

The author thanks Jean-Baptiste Cordonnier, Tao Lin and Kumar Kshitij Patel for spotting various typos in the ﬁrst versions of this manuscript, as well as Martin Jaggi for his support.

Martín Abadi, Ashish Agarwal, Paul Barham, Eugene Brevdo, Zhifeng Chen, Craig Citro, Greg S Corrado, Andy Davis, Jeffrey Dean, Matthieu Devin, et al. Tensorﬂow: Large-scale machine learning on heterogeneous distributed systems. ar Xiv preprint ar Xiv:1603.04467, 2016.

Alekh Agarwal and John C Duchi. Distributed delayed stochastic optimization. In J. Shawe-Taylor, R. S. Zemel, P. L. Bartlett, F. Pereira, and K. Q. Weinberger (eds.), Advances in Neural Information Processing Systems 24, pp. 873 881. Curran Associates, Inc., 2011. URL http://papers.nips.cc/paper/ 4247-distributed-delayed-stochastic-optimization.pdf.

Alham Fikri Aji and Kenneth Heaﬁeld. Sparse communication for distributed gradient descent. In Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing, pp. 440 445. Association for Computational Linguistics, 2017. URL http://aclweb.org/anthology/D17-1045.

Dan Alistarh, Demjan Grubic, Jerry Li, Ryota Tomioka, and Milan Vojnovic. QSGD: Communication-efﬁcient SGD via gradient quantization and encoding. In I. Guyon, U. V. Luxburg, S. Bengio, H. Wallach, R. Fergus, S. Vishwanathan, and R. Garnett (eds.), Advances in Neural Information Processing Systems 30, pp. 1709 1720. Curran Associates, Inc., 2017. URL http://papers.nips.cc/paper/6768-qsgdcommunication-efficient-sgd-via-gradient-quantization-and-encoding.pdf.

Avleen S Bijral, Anand D Sarwate, and Nathan Srebro. On data dependence in distributed stochastic optimization. ar Xiv.org, 2016.

Jianmin Chen, Rajat Monga, Samy Bengio, and Rafal Józefowicz. Revisiting distributed synchronous SGD. Co RR, abs/1604.00981, 2016. URL http://arxiv.org/abs/1604.00981.

Greg Coppola. Iterative parameter mixing for distributed large-margin training of structured predictors for natural language processing. Ph D thesis, The University of Edinburgh, 2015.

Ofer Dekel, Ran Gilad-Bachrach, Ohad Shamir, and Lin Xiao. Optimal distributed online prediction using mini-batches. J. Mach. Learn. Res., 13(1):165 202, January 2012. ISSN 1532-4435. URL http:// dl.acm.org/citation.cfm?id=2503308.2188391.

Published as a conference paper at ICLR 2019

N. Dryden, T. Moon, S. A. Jacobs, and B. V. Essen. Communication quantization for data-parallel training of deep neural networks. In 2016 2nd Workshop on Machine Learning in HPC Environments (MLHPC), pp. 1 8, Nov 2016. doi: 10.1109/MLHPC.2016.004.

Antoine Godichon-Baggioni and Soﬁane Saadane. On the rates of convergence of parallelized averaged stochastic gradient algorithms. ar Xiv preprint ar Xiv:1710.07926, 2017.

Priya Goyal, Piotr Dollár, Ross B. Girshick, Pieter Noordhuis, Lukasz Wesolowski, Aapo Kyrola, Andrew Tulloch, Yangqing Jia, and Kaiming He. Accurate, large minibatch SGD: training Image Net in 1 hour. Co RR, abs/1706.02677, 2017. URL http://arxiv.org/abs/1706.02677.

Suyog Gupta, Ankur Agrawal, Kailash Gopalakrishnan, and Pritish Narayanan. Deep learning with limited numerical precision. In Proceedings of the 32Nd International Conference on International Conference on Machine Learning - Volume 37, ICML 15, pp. 1737 1746. JMLR.org, 2015. URL http://dl.acm.org/ citation.cfm?id=3045118.3045303.

Prateek Jain, Sham M. Kakade, Rahul Kidambi, Praneeth Netrapalli, and Aaron Sidford. Parallelizing stochastic gradient descent for least squares regression: Mini-batching, averaging, and model misspeciﬁcation. Journal of Machine Learning Research, 18(223):1 42, 2018. URL http://jmlr.org/papers/v18/16595.html.

Nitish Shirish Keskar, Dheevatsa Mudigere, Jorge Nocedal, Mikhail Smelyanskiy, and Ping Tak Peter Tang. On large-batch training for deep learning: Generalization gap and sharp minima. ar Xiv preprint ar Xiv:1609.04836, 2016.

Simon Lacoste-Julien, Mark W. Schmidt, and Francis R. Bach. A simpler approach to obtaining an O(1/t) convergence rate for the projected stochastic subgradient method. Co RR, abs/1212.2002, 2012.

Xiangru Lian, Yijun Huang, Yuncheng Li, and Ji Liu. Asynchronous parallel stochastic gradient for nonconvex optimization. In Proceedings of the 28th International Conference on Neural Information Processing Systems - Volume 2, NIPS 15, pp. 2737 2745, Cambridge, MA, USA, 2015. MIT Press. URL http: //dl.acm.org/citation.cfm?id=2969442.2969545.

Tao Lin, Sebastian U. Stich, and Martin Jaggi. Don t use large mini-batches, use local SGD. Co RR, abs/1808.07217, 2018a. URL https://arxiv.org/abs/1808.07217.

Yujun Lin, Song Han, Huizi Mao, Yu Wang, and Bill Dally. Deep gradient compression: Reducing the communication bandwidth for distributed training. In ICLR 2018 - International Conference on Learning Representations, 2018b. URL https://openreview.net/forum?id=Skh QHMW0W.

Siyuan Ma, Raef Bassily, and Mikhail Belkin. The power of interpolation: Understanding the effectiveness of SGD in modern over-parametrized learning. In ICML, 2018.

Horia Mania, Xinghao Pan, Dimitris Papailiopoulos, Benjamin Recht, Kannan Ramchandran, and Michael I. Jordan. Perturbed iterate analysis for asynchronous stochastic optimization. SIAM Journal on Optimization, 27(4):2202 2229, 2017. doi: 10.1137/16M1057000.

Ryan Mc Donald, Mehryar Mohri, Nathan Silberman, Dan Walker, and Gideon S. Mann. Efﬁcient large-scale distributed training of conditional maximum entropy models. In Y. Bengio, D. Schuurmans, J. D. Lafferty, C. K. I. Williams, and A. Culotta (eds.), Advances in Neural Information Processing Systems 22, pp. 1231 1239. Curran Associates, Inc., 2009. URL http://papers.nips.cc/paper/3881-efficient-largescale-distributed-training-of-conditional-maximum-entropy-models.pdf.

Ryan Mc Donald, Keith Hall, and Gideon Mann. Distributed training strategies for the structured perceptron. In Human Language Technologies: The 2010 Annual Conference of the North American Chapter of the Association for Computational Linguistics, HLT 10, pp. 456 464, Stroudsburg, PA, USA, 2010. Association for Computational Linguistics. ISBN 1-932432-65-5. URL http://dl.acm.org/citation.cfm?id= 1857999.1858068.

Brendan Mc Mahan, Eider Moore, Daniel Ramage, Seth Hampson, and Blaise Aguera y Arcas. Communicationefﬁcient learning of deep networks from decentralized data. In Aarti Singh and Jerry Zhu (eds.), Proceedings of the 20th International Conference on Artiﬁcial Intelligence and Statistics, volume 54 of Proceedings of Machine Learning Research, pp. 1273 1282, Fort Lauderdale, FL, USA, 20 22 Apr 2017. PMLR. URL http://proceedings.mlr.press/v54/mcmahan17a.html.

T. Na, J. H. Ko, J. Kung, and S. Mukhopadhyay. On-chip training of recurrent neural networks with limited numerical precision. In 2017 International Joint Conference on Neural Networks (IJCNN), pp. 3716 3723, May 2017. doi: 10.1109/IJCNN.2017.7966324.

Published as a conference paper at ICLR 2019

Feng Niu, Benjamin Recht, Christopher Re, and Stephen J. Wright. Hogwild!: A lock-free approach to parallelizing stochastic gradient descent. In Proceedings of the 24th International Conference on Neural Information Processing Systems, NIPS 11, pp. 693 701, USA, 2011. Curran Associates Inc. ISBN 978-161839-599-3. URL http://dl.acm.org/citation.cfm?id=2986459.2986537.

Adam Paszke, Sam Gross, Soumith Chintala, Gregory Chanan, Edward Yang, Zachary De Vito, Zeming Lin, Alban Desmaison, Luca Antiga, and Adam Lerer. Automatic differentiation in pytorch. 2017.

John C. Platt. Advances in kernel methods. chapter Fast Training of Support Vector Machines Using Sequential Minimal Optimization, pp. 185 208. MIT Press, Cambridge, MA, USA, 1999. ISBN 0-262-19416-3. URL http://dl.acm.org/citation.cfm?id=299094.299105.

Alexander Rakhlin, Ohad Shamir, and Karthik Sridharan. Making gradient descent optimal for strongly convex stochastic optimization. In Proceedings of the 29th International Coference on International Conference on Machine Learning, ICML 12, pp. 1571 1578, USA, 2012. Omnipress. ISBN 978-1-4503-1285-1. URL http://dl.acm.org/citation.cfm?id=3042573.3042774.

Herbert Robbins and Sutton Monro. A Stochastic Approximation Method. The Annals of Mathematical Statistics, 22(3):400 407, September 1951.

Christopher De Sa, Ce Zhang, Kunle Olukotun, and Christopher Ré. Taming the wild: A uniﬁed analysis of HOG WILD!-style algorithms. In Proceedings of the 28th International Conference on Neural Information Processing Systems - Volume 2, NIPS 15, pp. 2674 2682, Cambridge, MA, USA, 2015. MIT Press. URL http://dl.acm.org/citation.cfm?id=2969442.2969538.

Frank Seide and Amit Agarwal. CNTK: Microsoft s open-source deep-learning toolkit. In Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pp. 2135 2135. ACM, 2016.

Frank Seide, Hao Fu, Jasha Droppo, Gang Li, and Dong Yu. 1-bit stochastic gradient descent and its application to data-parallel distributed training of speech DNNs. In Haizhou Li, Helen M. Meng, Bin Ma, Engsiong Chng, and Lei Xie (eds.), INTERSPEECH, pp. 1058 1062. ISCA, 2014. URL http://dblp.uni-trier.de/ db/conf/interspeech/interspeech2014.html#Seide FDLY14.

O. Shamir and N. Srebro. Distributed stochastic optimization and learning. In 2014 52nd Annual Allerton Conference on Communication, Control, and Computing (Allerton), pp. 850 857, Sep. 2014. doi: 10.1109/ ALLERTON.2014.7028543.

Ohad Shamir and Tong Zhang. Stochastic gradient descent for non-smooth optimization: Convergence results and optimal averaging schemes. In Sanjoy Dasgupta and David Mc Allester (eds.), Proceedings of the 30th International Conference on Machine Learning, volume 28 of Proceedings of Machine Learning Research, pp. 71 79, Atlanta, Georgia, USA, 17 19 Jun 2013. PMLR. URL http://proceedings.mlr.press/ v28/shamir13.html.

Sebastian U Stich, Jean-Baptiste Cordonnier, and Martin Jaggi. Sparsiﬁed SGD with memory. In S. Bengio, H. Wallach, H. Larochelle, K. Grauman, N. Cesa-Bianchi, and R. Garnett (eds.), Advances in Neural Information Processing Systems 31, pp. 4452 4463. Curran Associates, Inc., 2018. URL http: //papers.nips.cc/paper/7697-sparsified-sgd-with-memory.pdf.

Nikko Strom. Scalable distributed DNN training using commodity GPU cloud computing. In INTERSPEECH, pp. 1488 1492. ISCA, 2015. URL http://dblp.uni-trier.de/db/conf/interspeech/ interspeech2015.html#Strom15.

Xu Sun, Xuancheng Ren, Shuming Ma, and Houfeng Wang. me Prop: Sparsiﬁed back propagation for accelerated deep learning with reduced overﬁtting. In Doina Precup and Yee Whye Teh (eds.), Proceedings of the 34th International Conference on Machine Learning, volume 70 of Proceedings of Machine Learning Research, pp. 3299 3308, International Convention Centre, Sydney, Australia, 06 11 Aug 2017. PMLR. URL http://proceedings.mlr.press/v70/sun17c.html.

Jianyu Wang and Gauri Joshi. Cooperative SGD: A uniﬁed framework for the design and analysis of communication-efﬁcient SGD algorithms. Co RR, abs/1808.07576, 2018.

Jianqiao Wangni, Jialei Wang, Ji Liu, and Tong Zhang. Gradient sparsiﬁcation for communication-efﬁcient distributed optimization. Co RR, abs/1710.09854, 2017. URL http://arxiv.org/abs/1710.09854.

Published as a conference paper at ICLR 2019

Wei Wen, Cong Xu, Feng Yan, Chunpeng Wu, Yandan Wang, Yiran Chen, and Hai Li. Terngrad: Ternary gradients to reduce communication in distributed deep learning. In I. Guyon, U. V. Luxburg, S. Bengio, H. Wallach, R. Fergus, S. Vishwanathan, and R. Garnett (eds.), Advances in Neural Information Processing Systems 30, pp. 1509 1519. Curran Associates, Inc., 2017. URL http://papers.nips.cc/paper/6749-terngrad-ternary-gradients-toreduce-communication-in-distributed-deep-learning.pdf.

Blake E Woodworth, Jialei Wang, Adam Smith, Brendan Mc Mahan, and Nati Srebro. Graph oracle models, lower bounds, and gaps for parallel stochastic optimization. In S. Bengio, H. Wallach, H. Larochelle, K. Grauman, N. Cesa-Bianchi, and R. Garnett (eds.), Advances in Neural Information Processing Systems 31, pp. 8505 8515. Curran Associates, Inc., 2018. URL http://papers.nips.cc/paper/8069-graph-oracle-models-lower-bounds-andgaps-for-parallel-stochastic-optimization.pdf.

Dong Yin, Ashwin Pananjady, Max Lam, Dimitris Papailiopoulos, Kannan Ramchandran, and Peter Bartlett. Gradient diversity: a key ingredient for scalable distributed learning. In Amos Storkey and Fernando Perez Cruz (eds.), Proceedings of the Twenty-First International Conference on Artiﬁcial Intelligence and Statistics, volume 84 of Proceedings of Machine Learning Research, pp. 1998 2007, Playa Blanca, Lanzarote, Canary Islands, 09 11 Apr 2018. PMLR. URL http://proceedings.mlr.press/v84/yin18a.html.

Yang You, Igor Gitman, and Boris Ginsburg. Scaling SGD batch size to 32k for Image Net training. Co RR, abs/1708.03888, 2017.

Hao Yu, Sen Yang, and Shenghuo Zhu. Parallel restarted SGD for non-convex optimization with faster convergence and less communication. Co RR, abs/1807.06629, 2018.

Hantian Zhang, Jerry Li, Kaan Kara, Dan Alistarh, Ji Liu, and Ce Zhang. Zip ML: Training linear models with end-to-end low precision, and a little bit of deep learning. In Doina Precup and Yee Whye Teh (eds.), Proceedings of the 34th International Conference on Machine Learning, volume 70 of Proceedings of Machine Learning Research, pp. 4035 4043, International Convention Centre, Sydney, Australia, 06 11 Aug 2017. PMLR. URL http://proceedings.mlr.press/v70/zhang17e.html.

Jian Zhang, Christopher De Sa, Ioannis Mitliagkas, and Christopher Ré. Parallel SGD: When does averaging help? ar Xiv, 2016.

Sixin Zhang, Anna E Choromanska, and Yann Le Cun. Deep learning with elastic averaging SGD. In C. Cortes, N. D. Lawrence, D. D. Lee, M. Sugiyama, and R. Garnett (eds.), Advances in Neural Information Processing Systems 28, pp. 685 693. Curran Associates, Inc., 2015. URL http://papers.nips.cc/paper/ 5761-deep-learning-with-elastic-averaging-sgd.pdf.

X. Zhang, J. Trmal, D. Povey, and S. Khudanpur. Improving deep neural network acoustic models using generalized maxout networks. In 2014 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 215 219, May 2014. doi: 10.1109/ICASSP.2014.6853589.

Yuchen Zhang, John C. Duchi, and Martin J. Wainwright. Communication-efﬁcient algorithms for statistical optimization. Journal of Machine Learning Research, 14:3321 3363, 2013. URL http://jmlr.org/ papers/v14/zhang13b.html.

Peilin Zhao and Tong Zhang. Stochastic optimization with importance sampling for regularized loss minimization. In Francis Bach and David Blei (eds.), Proceedings of the 32nd International Conference on Machine Learning, volume 37 of Proceedings of Machine Learning Research, pp. 1 9, Lille, France, 07 09 Jul 2015. PMLR. URL http://proceedings.mlr.press/v37/zhaoa15.html.

Fan Zhou and Guojing Cong. On the convergence properties of a k-step averaging stochastic gradient descent algorithm for nonconvex optimization. In Proceedings of the Twenty-Seventh International Joint Conference on Artiﬁcial Intelligence, IJCAI-18, pp. 3219 3227. International Joint Conferences on Artiﬁcial Intelligence Organization, 7 2018. doi: 10.24963/ijcai.2018/447. URL https://doi.org/10.24963/ijcai.2018/ 447.

Zhengyuan Zhou, Panayotis Mertikopoulos, Nicholas Bambos, Peter Glynn, Yinyu Ye, Li-Jia Li, and Li Fei-Fei. Distributed asynchronous optimization with unbounded delays: How slow can you go? In Jennifer Dy and Andreas Krause (eds.), Proceedings of the 35th International Conference on Machine Learning, volume 80 of Proceedings of Machine Learning Research, pp. 5970 5979, Stockholmsmässan, Stockholm Sweden, 10 15 Jul 2018. PMLR. URL http://proceedings.mlr.press/v80/zhou18b.html.

Martin Zinkevich, Markus Weimer, Lihong Li, and Alex J. Smola. Parallelized stochastic gradient descent. In J. D. Lafferty, C. K. I. Williams, J. Shawe-Taylor, R. S. Zemel, and A. Culotta (eds.), Advances in Neural Information Processing Systems 23, pp. 2595 2603. Curran Associates, Inc., 2010. URL http: //papers.nips.cc/paper/4006-parallelized-stochastic-gradient-descent.pdf.

Published as a conference paper at ICLR 2019

A MISSING PROOFS FOR SYNCHRONIZED LOCAL SGD

In this section we provide the proofs for the three lemmas that were introduced in Section 3.

Proof of Lemma 3.1. Using the update equation (7) we have

xt+1 x 2 = xt ηtgt x 2 = xt ηtgt x ηt gt + ηt gt 2 (17)

= xt x ηt gt 2 + η2 t gt gt 2 + 2ηt xt x ηt gt, gt gt . (18)

Observe that

xt x ηt gt 2 = xt x 2 + η2 t gt 2 2ηt xt x , gt (19)

= xt x 2 + η2 t gt 2 2ηt 1 K

xt x , f(xk t ) (20)

xt x 2 + η2 t 1 K

xt xt k + xt k x , f(xk t ) (21)

= xt x 2 + η2 t 1 K

f(xk t ) f(x ) 2

xt k x , f(xk t ) 2ηt 1 K

xt xt k, f(xk t ) ,

where we used the inequality PK i=1 ai 2 K PK i=1 ai 2 in (21). By L-smoothness, f(xk t ) f(x ) 2 2L(f(xk t ) f ) , (23)

and by µ-strong convexity

xk t x , f(xk t ) (f(xk t ) f ) µ

xk t x 2 . (24)

To estimate the last term in (22) we use 2 a, b γ a 2 + γ 1 b 2, for γ > 0. This gives

2 xt xt k, f(xk t ) 2L xt xt k 2 + 1

f(xk t ) 2 (25)

= 2L xt xt k 2 + 1

f(xk t ) f(x ) 2 (26)

2L xt xt k 2 + (f(xk t ) f ) , (27)

where we have again used (23) in the last inequality. By applying these three estimates to (22) we get

xt x ηt gt 2 xt x 2 + 2ηt L K

(f(xk t ) f ) µ

For ηt 1 4L it holds ηt L 1

4. By convexity of a (f(x) f ) + b x x 2 for a, b 0:

a(f(xk t ) f ) + b xk t x 2 a(f( xt) f ) + b xt x 2 , (29)

Published as a conference paper at ICLR 2019

hence we can continue in (28) and obtain

xt x ηt gt 2 (1 µηt) xt x 2 1

2ηt(f( xt) f ) + 2ηt L K

xt xk t 2 . (30)

Finally, we can plug (30) back into (18). By taking expectation we get

E xt+1 x 2 (1 µηt) E xt x 2 + η2 t E gt gt 2

2ηt E(f( xt) f ) + 2ηt L K

k=1 E xt xk t 2 .

Proof of Lemma 3.2. By deﬁnition of gt and gt we have

E gt gt 2 = E 1

fik t (xk t ) f(xk t ) 2 = 1 K2

k=1 E fik t (xk t ) f(xk t ) 2 σ2

where we used Var(PK k=1 Xk) = PK k=1 Var(Xk) for independent random variables.

Proof of Lemma 3.3. As the gap(IT ) H, there is an index t0, t t0 H such that xt0 = xk t0 for k [K]. Observe, using E X E X 2 = E X 2 E X 2 and PH i=1 ai 2 H PH i=1 ai 2,

k=1 E xt xk t 2 = 1

k=1 E xk t xt0 ( xt xt0) 2 (32)

k=1 E xk t xt0 2 (33)

h=t0 E fik h(xk h) 2 (34)

k=1 H2η2 t0G2 , (35)

where we used ηt ηt0 for t t0 and the assumption E fik h(xk h) 2 G2. Finally, the claim follows by the assumption on the stepsizes, ηt0

B MISSING PROOF FOR ASYNCHRONOUS LOCAL SGD

In this Section we prove Theorem 5.1. The proof follows closely the proof presented in Section 3. We again introduce the virtual sequence

j=0 ηj fik j (xk j ) , (36)

as before. By the property T Ik T for k K we know that all workers will have written their updates when the algorithm terminates. This assumption is not very critical and could be relaxed, but it facilitates the (already quite heavy) notation in the proof.

Observe, that Lemmas 3.1 and 3.2 hold for the virtual sequence { xt}T t=0. Hence, all we need is a reﬁned version of Lemma 3.3 that bounds how far the local sequences can deviate from the virtual average.

Published as a conference paper at ICLR 2019

Lemma B.1. If gap(Ik T ) H and τ > 0, s.t. Wk,h t [t τ] for all k, h [K], t [T], and sequence of decreasing positive stepsizes {ηt}t 0 satisfying ηt 2ηt+H+τ for all t 0, then

k=1 E xt xk t 2 12η2 t G2(H + τ)2 , (37)

where G2 is a constant such that Ei fi(xk t ) 2 G2 for k [K], t [T].

Here we use the notation [s] = {} for s < 0, such that [t τ] is also deﬁned for t < τ.

Proof. As gap(Ik T ) H there exists for every k K a tk, t tk H, such that xk tk = xk tk. Let t0 := min{t1, . . . , t K} and observe t0 t H. Let t 0 = max{t0 τ, 0}. As Wk,h t [t τ] for all k, h [K], t [T], it holds

xk tk = xt 0 1

j=t 0 1j Wk,h tk (ηj fik j (xk j )) , (38)

for each k [K]. In other words, all updates up to iteration t 0 have been written to the aggregated sequence.

We decompose the error term as xt xk t 2 3 xk t xk tk 2 + xk tk xt 0 2 + xt 0 xt 2 . (39)

Now, using ηt ηt+1, and t tk H, we conclude (as in (35)) xk t xk tk 2 η2 tk H2G2 η2 t 0H2G2 . (40)

As tk t 0 τ, xk tk xt 0 2 η2 t 0τ 2G2 , (41)

and similarly, as t t 0 H + τ, xt 0 xt 2 η2 t 0(H + τ)2G2 . (42)

Finally, as ηt 0 ηt 2, we can conclude xt xk t 2 12η2 t (H + τ)2G2 . (43)

and the lemma follows.

Now the proof of Theorem 5.1 follows immediately.

Proof of Theorem 5.1. As in the proof of Theorem 2.2 we rely on Lemma 3.4 to derive the convergence rate. Again, we have A = 1

K , and C = LG2(H + τ)2 (Lemma B.1). It is easy to see

that the stepsizes satisfy the condition of Lemma B.1, as clearly ηt 0 ηt ηt 0 ηt 0+H+τ = a+t+H+τ

C COMMENTS ON IMPLEMENTATION ISSUES

C.1 SYNCHRONOUS LOCAL SGD

In Theorem 5 we do not prove convergence of the sequences {xk t }t 0 of the iterates, but only convergence of a weighted average of all iterates. In practice, the last iterate might often be sufﬁcient, but we like to remark that the weighted average of the iterates can easily be tracked on the ﬂy with an auxiliary sequence {yt}t>0, y0 = x0, without storing all intermediate iterates, see Table 1 for some examples.

Published as a conference paper at ICLR 2019

criteria weights formula recursive update last iterate - yt = xt yt = xt uniform average wt = 1 yt = 1 t+1 Pt i=0 xi yt = 1 t+1xt + t t+1yt 1 linear weights wt = (t + 1) yt = 2 (1+t)(2+t) Pt i=0(i + 1)xi yt = 2 2+txt + t t+2yt 1 quadratic weights wt = (t + 1)2 yt = 6 (t+1)(t+2)(2t+3) Pt i=0(i + 1)2xi yt = 6(t+1) (t+2)(2t+3)xt + t(1+2t) 6+7t+2t2 yt 1

Table 1: Formulas to recursively track weighted averages.

C.2 ASYNCHRONOUS LOCAL SGD

As for synchronous local SGD, the weighted averages of the iterates (if needed), can be tracked on each worker locally by a recursive formula as explained above.

A more important aspect that we do not have discussed yet, is that Algorithm 2 allows for an easy procedure to balance the load in heterogeneous settings. In our notation, we have always associated the local sequences {xk t } with a speciﬁc worker k. However, the computation of the sequences does not need to be tied to a speciﬁc worker. Thus, a fast worker k that has advanced his local sequence too much already, can start computing updates for another sequence k = k, if worker k is lagged behind. This was not possible in the synchronous model, as there all communications had to happen in sync. We demonstrate this principle in Table 2 below for two workers. Note that also the running averages can still be maintained.

wall clock time

worker 1 x1 H U( x) x1 2H U( x) x1 3H U( x) x2

2H U( x) x2

4H U( x) x1 4H U( x)

worker 2 x2 H U( x) x2 3H U( x)

Table 2: Simple load balancing. The faster worker can advance both sequences, even when the slower worker has not yet ﬁnished the computation. In the example each worker does H steps of local SGD (denoted by the operator U : Rd Rd) before writing back the updates to the aggregate x. Due to the load balancing, τ 3H.

D DETAILS ON EXPERIMENTS

We here state the precise procedure that was used to generate the ﬁgures in this report. As brieﬂy stated in Section 4 we examine empirically the speedup on a logistic regression problem, f(x) = 1 n Pn i=1 log(1 + exp( bia i x)) + λ

2 x 2, where ai Rd and bi { 1, +1} are the data samples. The regularization parameter is set to λ = 1/n. We consider the small scale w8a dataset (Platt, 1999) (d = 300, n = 49749).

For each run, we initialize x0 = 0d and measure the number of iterations6 (and number of stochastic gradient evaluations) to reach the target accuracy ϵ {0.005, 0.0001}. As we prove convergence only for a special weighted sum of the iterates in Theorem 2.2 and not for standard criteria (last iterate or uniform average), we evaluate the function value for different weighted averages yt = 1 Pt i=0 wi Pt i=0 wixt, and consider the accuracy reached when one of the averages satisﬁes f(yt) f ϵ, with f := 0.126433176216545 (numerically determined). The precise formulas for the averages that we used are given in Table 1.

For each conﬁguration (K, H, b, ϵ), we report the best result found with any of the following two stepsizes: ηt := min(32, cn

t+1) and ηt = 32c. Here c is a parameter that can take the values c = 2i for i Z. For each stepsize we determine the best parameter c by a grid search, and consider parameter c optimal, if parameters {2 2c, 2 1c, 2c, 22c} yield worse results (i.e. more iterations to reach the target accuracy).

6Note, that besides the randomness involved the stochastic gradient computations, the averaging steps of synchronous local SGD are deterministic. Hence, these results (convergence in terms if numbers of iterations) can be reproduced by just simulating local SGD by using virtual workers (which we did for large number of K). For completeness, we report that all experiments were run on an an Ubuntu 16.04 machine with a 24 cores processor Intel R Xeon R CPU E5-2680 v3 @ 2.50GHz.

Published as a conference paper at ICLR 2019

In Figures 4 and 5 we give additional results for mini-batch sizes b {1, 16}.

1 4 16 64 256 1024

H=1 H=4 H=16 H=64 H=256

(a) Measured speedup, ϵ = 0.005.

1 4 16 64 256 1024

H=1 H=4 H=16 H=64 H=256

(b) Measured speedup, ϵ = 0.0001.

Figure 4: Measured speedup of local SGD with mini-batch b = 1 for different numbers of workers K and parameters H.

1 4 16 64 256 1024

H=1 H=4 H=16 H=64 H=256

(a) Measured speedup, ϵ = 0.005.

1 4 16 64 256 1024

H=1 H=4 H=16 H=64 H=256

(b) Measured speedup, ϵ = 0.0001.

Figure 5: Measured speedup of local SGD with mini-batch b = 16 for different numbers of workers K and parameters H.