# asynchronous_stochastic_optimization_robust_to_arbitrary_delays__d00bbbac.pdf Asynchronous Stochastic Optimization Robust to Arbitrary Delays Alon Cohen Tel Aviv University and Google Research Israel alonco@tauex.tau.ac.il Amit Daniely Hebrew University of Jerusalem and Google Research Israel amit.daniely@mail.huji.ac.il Yoel Drori Google Research Israel dyoel@google.com Tomer Koren Tel Aviv University and Google Research Israel tkoren@tauex.tau.ac.il Mariano Schain Google Research Israel marianos@google.com We consider stochastic optimization with delayed gradients where, at each time step 𝑡, the algorithm makes an update using a stale stochastic gradient from step 𝑡 𝑑𝑡for some arbitrary delay 𝑑𝑡. This setting abstracts asynchronous distributed optimization where a central server receives gradient updates computed by worker machines. These machines can experience computation and communication loads that might vary significantly over time. In the general non-convex smooth optimization setting, we give a simple and efficient algorithm that requires 𝑂(σ2/ϵ4 + τ/ϵ2) steps for finding an ϵ-stationary point 𝑥, where τ is the average delay 1 𝑇 P𝑇 𝑡=1 𝑑𝑡and σ2 is the variance of the stochastic gradients. This improves over previous work, which showed that stochastic gradient decent achieves the same rate but with respect to the maximal delay max𝑡𝑑𝑡, that can be significantly larger than the average delay especially in heterogeneous distributed systems. Our experiments demonstrate the efficacy and robustness of our algorithm in cases where the delay distribution is skewed or heavy-tailed. 1 Introduction Gradient-based iterative optimization methods are widely used in large-scale machine learning applications as they are extremely simple to implement and use, and come with mild computational requirements. On the other hand, in their standard formulation they are also inherently serial and synchronous due to their iterative nature. For example, in stochastic gradient descent (SGD), each step involves an update of the form 𝑥𝑡+1 = 𝑥𝑡 η𝑔𝑡where 𝑥𝑡is the current iterate, and 𝑔𝑡is a (stochastic) gradient vector evaluated at 𝑥𝑡. To progress to the next step of the method, the subsequent iterate 𝑥𝑡+1 has to be fully determined by the end of step 𝑡as it is required for future gradient queries. Evidently, this scheme has to wait for the computation of the gradient 𝑔𝑡to complete (this is often the most computationally intensive part in SGD) before it can evaluate 𝑥𝑡+1. In modern large scale machine learning applications, a direct serial implementation of gradient methods like SGD is overly costly, and parallelizing the optimization process over several cores or machines is desired. Perhaps the most common parallelization approach is via mini-batching, where computation of stochastic gradients is distributed across several worker machines that send updates to a parameter server. The parameter server is responsible for accruing the individual updates into a single averaged gradient, and consequently, updating the optimization parameters using this gradient. 35th Conference on Neural Information Processing Systems (Neur IPS 2021). While mini-batching is well understood theoretically [e.g., 16, 9, 8, 10], it is still fundamentally synchronous in nature and its performance is adversely determined by the slowest worker machine: the parameter server must wait for all updates from all workers to arrive before it can update the model it maintains. This could cause serious performance issues in heterogeneous distributed networks, where worker machines may be subject to unpredictable loads that vary significantly between workers (due to different hardware, communication bandwidth, etc.) and over time (due to varying users load, power outages, etc.). An alternative approach that has recently gained popularity is to employ asynchronous gradient updates [e.g., 21, 2, 7, 18, 11]; namely, each worker machine computes gradients independently of the other machines, possibly on different iterates, and sends updates to the parameter server in an asynchronous fashion. This implies the parameter server might be making stale updates based on delayed gradients taken at earlier, out-of-date iterates. While these methods often work well in practice, they have proven to be much more intricate and challenging to analyze theoretically than synchronous gradient methods, and overall our understanding of asynchronous updates remains lacking. Recently, Arjevani et al. [4] and subsequently Stich and Karimireddy [26] have made significant progress in analyzing delayed asynchronous gradient methods. They have shown that in stochastic optimization, delays only affect a lower-order term in the convergence bounds. In other words, if the delays are not too large, the convergence rate of SGD may not be affected by the delays. (4 first proved this for quadratic objectives; 26 then proved a more general result for smooth functions.) More concretely, Stich and Karimireddy [26] showed that SGD with a sufficiently attenuated step size to account for the delays attains an iteration complexity bound of the form for finding an ϵ-stationary point of a possibly non-convex smooth objective function (namely, a point at which the gradient is of norm ϵ). Here σ2 is the variance of the noise in the stochastic gradients, and τmax is the maximal possible delay, which is also needed to be known a-priori for properly tuning the SGD step size. Up to the τmax factor in the second term, this bound is identical to standard iteration bounds for stochastic non-convex SGD without delays [12]. While the bound in Eq. (1) is a significant improvement over previous art, it is still lacking in one important aspect: the dependence on the maximal delay could be excessively large in truly asynchronous environments, making the second term in the bound the dominant term. For example, in heterogeneous or massively distributed networks, the maximal delay is effectively determined by the single slowest (or less reliable) worker machine which is precisely the issue with synchronous methods we set to address in the first place. Moreover, as Stich and Karimireddy [26] show, the step size used to achieve the bound in Eq. (1) could be as much as τmax-times smaller than that of without delays, which could severely impact performance in practice. 1.1 Contribution We propose a new algorithm for stochastic optimization with asynchronous delayed updates, we call Picky SGD, that is significantly more robust than SGD, especially when the (empirical) distribution of delays is skewed or heavy-tailed and thus the maximal delay could be very large. For general smooth possibly non-convex objectives, our algorithm achieves a convergence bound of the form where now τavg is the average delay in retrospect. This is a significant improvement over the bound in Eq. (1) whenever τavg τmax, which is indeed the case with heavy-tailed delay distributions. Moreover, Picky SGD is very efficient, extremely simple to implement, and does not require to know the average delay τavg ahead of time for optimal tuning. In fact, the algorithm only relies on a single additional hyper-parameter beyond the step-size. Notably, and in contrast to SGD as analyzed in previous work [26], our algorithm is able to employ a significantly larger effective step size, and thus one could expect it to perform well in practice compared to SGD. Indeed, we show in experiments that Picky SGD is able to converge quickly on large image classification tasks with a relatively high learning rate, even when very large delays are introduced. In contrast, in the same setting, SGD needs to be configured with a substantially reduced step size to be able to converge at all, consequently performing poorly compared to our algorithm. Finally, we also address the case where 𝑓is smooth and convex, in which we give a close variant of our algorithm with an iteration complexity bound of the form for obtaining a point 𝑥with 𝑓(𝑥) 𝑓(𝑥 ) ϵ (where 𝑥 is a minimizer of 𝑓over ℝ𝑑). Here as well, our rate matches precisely the one obtained by the state-of-the-art [26], but with the dependence on the maximal delay being replaced with the average delay. For consistency of presentation, we defer details on the convex case to the full version of the paper [? ] and focus here on our algorithm for non-convex optimization. Concurrently to this work, Aviv et al. [5] derived similar bounds that depend on the average delay. Compared to our contribution, their results are adaptive to the smoothness and noise parameters, but on the other hand, are restricted to convex functions and their algorithms are more elaborate and their implementation is more involved. 1.2 Additional related work For general background on distributed asynchronous optimization and basic asymptotic convergence results, we refer to the classic book by Bertsekas and Tsitsiklis [6]. Since the influential work of Niu et al. [24], there has been significant interest in asynchronous algorithms in a related model where there is a delay in updating individual parameters in a shared parameter vector (e.g., [25, 19, 28, 17]). This is of course very different from our model, where steps use the full gradient vector in atomic, yet delayed, updates. Also related to our study is the literature on Local SGD (e.g., 27 and references therein), which is a distributed gradient method that perform several local (serial) gradient update steps before communicating with the parameter server or with other machines. Local SGD methods have become popular recently since they are used extensively in Federated Learning [20]. We note that the theoretical study in this line of work is mostly concerned with analyzing existing distributed variants of SGD used in practice, whereas we aim to develop and analyze new algorithmic tools to help with mitigating the effect of stale gradients in asynchronous optimization. A related yet orthogonal issue in distribution optimization, which we do not address here, is reducing the communication load between the workers and servers. One approach that was recently studied extensively is doing this by compressing gradient updates before they are transmitted over the network. We refer to [3, 14, 26] for further discussion and references. 2 Setup and Basic Definitions 2.1 Stochastic non-convex smooth optimization We consider stochastic optimization of a β-smooth (not necessarily convex) non-negative function 𝑓defined over the 𝑑-dimensional Euclidean space ℝ𝑑. A function 𝑓is said to be β-smooth if it is differentiable and its gradient operator is β-Lipschitz, that is, if 𝑓(𝑥) 𝑓(𝑦) β 𝑥 𝑦 for all 𝑥, 𝑦 ℝ𝑑. This in particular implies (e.g., [22]) that for all 𝑥, 𝑦 ℝ𝑑, 𝑓(𝑦) 𝑓(𝑥) + 𝑓(𝑥) (𝑦 𝑥) + β 2 𝑦 𝑥 2. (2) We assume a stochastic first-order oracle access to 𝑓; namely, 𝑓is endowed with a stochastic gradient oracle that given a point 𝑥 ℝ𝑑returns a random vector 𝑔(𝑥), independent of all past randomization, such that 𝔼[ 𝑔(𝑥) | 𝑥] = 𝑓(𝑥) and 𝔼[ 𝑔(𝑥) 𝑓(𝑥) 2 | 𝑥] σ2 for some variance bound σ2 0. In this setting, our goal is to find an ϵ-stationary point of 𝑓, namely, a point 𝑥 ℝ𝑑such that 𝑓(𝑥) ϵ, with as few samples of stochastic gradients as possible. 2.2 Asynchronous delay model We consider an abstract setting where stochastic gradients (namely, outputs for invocations of the stochastic first-order oracle) are received asynchronously and are subject to arbitrary delays. The asynchronous model can be abstracted as follows. We assume that at each step 𝑡of the optimization, the algorithm obtains a pair (𝑥𝑡 𝑑𝑡, 𝑔𝑡) where 𝑔𝑡is a stochastic gradient at 𝑥𝑡 𝑑𝑡with variance bounded by σ2; namely, 𝑔𝑡is a random vector such that 𝔼𝑡𝑔𝑡= 𝑓(𝑥𝑡 𝑑𝑡) and 𝔼𝑡 𝑔𝑡 𝑓(𝑥𝑡 𝑑𝑡) 2 σ2 for some delay 0 𝑑𝑡< 𝑡. Here and throughout, 𝔼𝑡[ ] denotes the expectation conditioned on all randomness drawn before step 𝑡. After processing the received gradient update, the algorithm may query a new stochastic gradient at whatever point it chooses (the result of this query will be received with a delay, as above). Few remarks are in order: We stress that the delays 𝑑1, 𝑑2, . . . are entirely arbitrary, possibly chosen by an adversary; in particular, we do not assume they are sampled from a fixed stationary distribution. Nevertheless, we assume that the delays are independent of the randomness of the stochastic gradients (and of the internal randomness of the optimization algorithm, if any).1 For simplicity, we assumed above that a stochastic gradient is received at every round 𝑡. This is almost without loss of generality:2 if at some round no feedback is observed, we may simply skip the round without affecting the rest of the optimization process (up to a re-indexing of the remaining rounds). Similarly, we will also assume that only a single gradient is obtained in each step; the scenario that multiple gradients arrive at the same step (as in mini-batched methods) can be simulated by several subsequent iterations in each of which a single gradient is processed. 3 The Picky SGD Algorithm We are now ready to present our asynchronous stochastic optimization algorithm, which we call Picky SGD; see pseudo-code in Algorithm 1. The algorithm is essentially a variant of stochastic gradient descent, parameterized by a learning rate η as well as a target accuracy ϵ. Algorithm 1: Picky SGD 1: input: learning rate η, target accuracy ϵ. 2: for 𝑡= 1, . . . ,𝑇do 3: receive delayed stochastic gradient 𝑔𝑡and point 𝑥𝑡 𝑑𝑡such that 𝔼𝑡[𝑔𝑡] = 𝑓(𝑥𝑡 𝑑𝑡). 4: if 𝑥𝑡 𝑥𝑡 𝑑𝑡 ϵ/(2β) then 5: update: 𝑥𝑡+1 = 𝑥𝑡 η𝑔𝑡. 6: else 7: pass: 𝑥𝑡+1 = 𝑥𝑡. 8: end if 9: end for Picky SGD maintains a sequence of iterates 𝑥1, . . . , 𝑥𝑇. At step 𝑡, the algorithm receives a delayed stochastic gradient 𝑔𝑡that was computed at an earlier iterate 𝑥𝑡 𝑑𝑡(line 3). Then, in line 4, the algorithm tests whether 𝑥𝑡 𝑥𝑡 𝑑𝑡 ϵ/2β. Intuitively, this aims to verify whether the delayed (expected) gradient 𝑓(𝑥𝑡 𝑑𝑡) is similar to the gradient 𝑓(𝑥𝑡) at the current iterate 𝑥𝑡; due to the smoothness of 𝑓, we expect that if 𝑥𝑡 𝑑𝑡is close to 𝑥𝑡, then also the corresponding gradients will be similar. If this condition holds true, the algorithm takes a gradient step using 𝑔𝑡with step size η. Our main theoretical result is the following guarantee on the success of the algorithm. Theorem 1. Suppose that Algorithm 1 is initialized at 𝑥1 ℝ𝑑with 𝑓(𝑥1) 𝐹and ran with 4β min 1, ϵ2 where τ be the average delay, i.e., τ = (1/𝑇) P𝑇 𝑡=1 𝑑𝑡. Then, with probability at least 1 2, there is some 1 𝑡 𝑇for which 𝑓(𝑥𝑡) ϵ. Observe that the optimal step size in Theorem 1 is independent of the average delay τ. This is important for two main reasons: (i) implementing the algorithm does not require knowledge about 1One can thus think of the sequence of delays as being fixed ahead of time by an oblivious adversary. 2We may, in principle, allow to query the stochastic gradient oracle even on rounds where no feedback is received, however this would be redundant in most reasonable instantiations of this model (e.g., in a parameter server architecture). future, yet-to-be-seen delays; and (ii) even with very large delays, the algorithm can maintain a high effective step size. We note that the guarantee of Theorem 1 is slightly different from typical bounds in non-convex optimization (e.g., the bounds appearing in the previous work [14]): our result claims about the minimal gradient norm of any iterate rather than the average gradient norm over the iterates. Arguably, this difference does not represent a very strong limitation: the significance of convergence bounds in non-convex optimization is, in fact, in that they ensure that one of the iterates along the trajectory of the algorithm is indeed an approximate critical point, and the type of bound we establish is indeed sufficient to ensure exactly that. We further note that while the theorem above only guarantees a constant success probability, it is not hard to amplify this probability to an arbitrary 1 δ simply by restarting the algorithm 𝑂(log(1/δ)) times (with independent stochastic gradients); with high probability, one of the repetitions will be successful and run through a point with gradient norm ϵ, which would imply the guarantee in the theorem with probability at least 1 δ. In this section we analyze Algorithm 1 and prove our main result. Throughout, we denote 𝑥 𝑡= 𝑥𝑡 𝑑𝑡 and let 𝑁𝑡denote the noise vector at step 𝑡, namely 𝑁𝑡= 𝑔𝑡 𝑓(𝑥 𝑡). Note that 𝔼[𝑁𝑡| 𝑥𝑡, 𝑥 𝑡] = 0 and 𝔼[ 𝑁𝑡 2 | 𝑥𝑡, 𝑥 𝑡] σ2, since the iterates 𝑥𝑡, 𝑥 𝑡are conditionally independent of the noise in 𝑔𝑡 as this gradient is obtained by the algorithm only at step 𝑡, after 𝑥𝑡, 𝑥 𝑡were determined. To prove Theorem 1, we will analyze a variant of the algorithm that will stop making updates once it finds a point with 𝑓(𝑥) ϵ (and eventually fails otherwise). That is, if 𝑥𝑡 𝑥 𝑡 > ϵ/2β or 𝑓(𝑥𝑡) ϵ then 𝑥𝑡+1 = 𝑥𝑡. Else, 𝑥𝑡+1 = 𝑥𝑡 η𝑔𝑡. This variant is impossible to implement (since it needs to compute the exact gradient at each step), but the guarantee of Theorem 1 is valid for this variant if and only if it is valid for the original algorithm: one encounters an ϵ-stationary point if and only if the other does so. First, we prove a simple technical lemma guaranteeing that whenever the algorithm takes a step, a large gradient norm implies a large decrease in function value. It is a variant of the classical descent lemma, adapted to the case where the gradient step is taken with respect to a gradient computed at a nearby point. Lemma 2. Fix 𝑥, 𝑥 ℝ𝑑with 𝑥 𝑥 ϵ/2β and 𝑓(𝑥 ) > ϵ. Let 𝑁 ℝ𝑑be a random vector with 𝔼[𝑁| 𝑥, 𝑥 ] = 0 and 𝔼[ 𝑁 2 | 𝑥, 𝑥 ] σ2. Then, 𝔼[ 𝑓(𝑥 η( 𝑓(𝑥 ) + 𝑁))] 𝔼𝑓(𝑥) η 2𝔼 𝑓(𝑥 ) 2 + η2β 2 (σ2 + 𝔼 𝑓(𝑥 ) 2). In particular, for our choice of η, we have η 4𝔼 𝑓(𝑥 ) 2 𝔼𝑓(𝑥) 𝔼[ 𝑓 𝑥 η( 𝑓(𝑥 ) + 𝑁) ]. (3) Proof. Using the smoothness of 𝑓(Eq. (2)), we have 𝑓(𝑥 η( 𝑓(𝑥 ) + 𝑁)) 𝑓(𝑥) η 𝑓(𝑥) ( 𝑓(𝑥 ) + 𝑁) + 1 2η2β 𝑓(𝑥 ) + 𝑁 2. Taking expectation over 𝑁conditioned on 𝑥, 𝑥 , we get 𝔼[ 𝑓(𝑥 η( 𝑓(𝑥 ) + 𝑁)) 𝑓(𝑥) | 𝑥, 𝑥 ] η 𝑓(𝑥) 𝑓(𝑥 ) + 1 2η2β( 𝑓(𝑥 ) 2 + σ2) = η 𝑓(𝑥 ) 𝑓(𝑥 ) η 𝑓(𝑥 ) ( 𝑓(𝑥) 𝑓(𝑥 )) + 1 2η2β( 𝑓(𝑥 ) 2 + σ2) η 𝑓(𝑥 ) 2 + ηβ 𝑓(𝑥 ) 𝑥 𝑥 + 1 2η2β( 𝑓(𝑥 ) 2 + σ2) = η(β 𝑓(𝑥 ) 𝑥 𝑥 𝑓(𝑥 ) 2) + 1 2η2β( 𝑓(𝑥 ) 2 + σ2). Since ϵ 𝑓(𝑥 ) then and we have 𝔼 𝑓(𝑥 η( 𝑓(𝑥 ) + 𝑁)) 𝑓(𝑥) | 𝑥, 𝑥 η 2 𝑓(𝑥 ) 2 + 1 2η2β(σ2 + 𝑓(𝑥 ) 2). If ϵ σ then σ2 𝑓(𝑥 ) 2. This, with η = 1/4β, yields Eq. (3). If ϵ < σ and η = ϵ2/4σ2β, then η2 ϵ2/16σ2β2. Plugging that in instead, using 𝑓(𝑥 ) ϵ, and taking expectations (with respect to 𝑥, 𝑥 ) gets us Eq. (3). We next introduce a bit of additional notation. We denote by 𝐼𝑡the indicator of event that the algorithm performed an update at time 𝑡. Namely, 𝐼𝑡= 𝐼 𝑥𝑡 𝑥 𝑡 ϵ/2β and 𝑓(𝑥𝑡) > ϵ . Note that 𝐼𝑡= 1 implies that 𝑓(𝑥𝑠) ϵ for all 𝑠= 1, . . . , 𝑡. Further, we denote by 𝑡= 𝑓(𝑥𝑡) 𝑓(𝑥𝑡+1) the improvement at time 𝑡. Since 𝑓is non-negative and 𝑓(𝑥1) 𝐹, we have that for all 𝑡, 𝑡 𝑖=1 𝑖= 𝑓(𝑥1) 𝑓(𝑥𝑡+1) 𝐹. Note that by Lemma 2 we have that 𝔼 𝑡 0. The rest of the proof is split into two cases: σ ϵ, and σ ϵ. 4.1 Case (i): σ ϵ This regime is intuitively the low noise regime in which the standard deviation of the gradient noise, σ, is smaller than the desired accuracy ϵ. We prove the following. Lemma 3. Suppose that σ ϵ and the algorithm fails with probability 1 2. Then 𝑇 128β𝐹(τ + 1)/ϵ2. To prove the lemma above, we first show that the algorithm must make a significant number of updates, as shown by the following lemma. Lemma 4. If the algorithm fails, then the number of updates that it makes is at least 𝑇/4(τ + 1). Proof. Consider 𝑈2τ, the number of steps 𝑡for which the delay 𝑑𝑡is at least 2τ. We must have 𝑈2τ 𝑇/2 (otherwise the total sum of delays exceeds τ𝑇, contradicting the definition of τ). On the other hand, let 𝑘be the number of updates that the algorithm makes. Let 𝑡1 < 𝑡2 < ... < 𝑡𝑘be the steps in which an update is made. Denote 𝑡0 = 0 and 𝑡𝑘+1 = 𝑇. Now, fix 𝑖and consider the steps at times 𝑠𝑛= 𝑡𝑖+ 𝑛for 𝑛 [1, 2, . . . , 𝑡𝑖+1 𝑡𝑖 1]. In all those steps no update takes place and 𝑥𝑠𝑛= 𝑥𝑡𝑖. We must have 𝑑𝑠𝑛> 𝑛for all 𝑛(otherwise 𝑥𝑡= 𝑥𝑡 𝑑𝑡for 𝑡= 𝑠𝑛and an update occurs). In particular we have that 𝑑𝑠𝑛 2τ in at least 𝑡𝑖+1 𝑡𝑖 1 2τ steps in [𝑡𝑖, 𝑡𝑖+1]. Hence, 𝑖=0 (𝑡𝑖+1 𝑡𝑖 1 2τ) = 𝑇 𝑘(1 + 2τ). Finally, it follows that 𝑇 𝑘(1 + 2τ) 𝑇/2 which implies 𝑘 𝑇 4(τ+1) . Given the lemma above, we prove Lemma 3 by showing that if the algorithm fails, it makes many updates in all of which we have 𝑓(𝑥𝑡) > ϵ. By Lemma 2, this means that in the 𝑇time steps of the algorithm, it must decrease the value of 𝑓significantly. Since we start at a point in which 𝑓(𝑥1) 𝐹, we must conclude that 𝑇cannot be too large. Proof of Lemma 3. Combining Eq. (3) with η = 1/(4β) and Lemma 4, we get that if the algorithm fails with probability 1 𝑡=1 𝔼 𝑡 1 16β 𝑡=1 𝔼[𝐼𝑡 𝑓(𝑥𝑡) 2] 1 16β𝔼 𝑡=1 𝐼𝑡 𝑓(𝑥𝑡) 2 # 𝑡=1 𝐼𝑡 𝑓(𝑥𝑡) 2 algorithm fails algorithm fails 32β 𝑇 4(τ + 1) . This yields the lemma s statement. 4.2 Case (ii): σ > ϵ This is the high noise regime. For this case, we prove the following guarantee for the convergence of our algorithm. Lemma 5. Assume that σ > ϵ and the algorithm fails with probability 1 𝑡=1 𝔼 𝑡 𝑇 500β min ϵ2 In particular, This result is attained using the following observation. Consider the iterate of algorithm at time 𝑡, 𝑥𝑡, and the point at which the gradient was computed 𝑥 𝑡= 𝑥𝑡 𝑑𝑡. We claim that if the algorithm has not decreased the function value sufficiently during the interval [𝑡 𝑑𝑡, 𝑡 1], then it is likely to trigger a large decline in the function value at time 𝑡. Formally, either 𝔼 𝑡is large, or P𝑡 1 𝑖=𝑡 𝑑𝑡𝔼 𝑖is large. To show the claim, we first upper bound the distance 𝑥𝑡 𝑥 𝑡 in terms of P𝑡 1 𝑖=𝑡 𝑑𝑡𝔼 𝑖, as shown by the following technical lemma. Lemma 6. For all 𝑡and 𝑘, it holds that Proof. We have 𝔼 𝑥𝑡 𝑥𝑡+𝑘 = η𝔼 𝑖=𝑡 𝐼𝑖( 𝑓(𝑥 𝑖) + 𝑁𝑖) 𝑖=𝑡 𝐼𝑖 𝑓(𝑥 𝑖) We continue bounding the second term above as follows: 𝑗=𝑡 𝐼𝑖𝐼𝑗𝑁𝑖 𝑁𝑗 𝑖=𝑡 𝐼𝑖 𝑁𝑖 2 (𝔼[𝑁𝑖| 𝐼𝑖, 𝐼𝑗, 𝑁𝑗] = 0 for 𝑖> 𝑗) 𝑖=𝑡 𝐼𝑖 𝑓(𝑥 𝑖) 2 ( 𝑓(𝑥 𝑖) ϵ when 𝐼𝑖= 1) 𝑖=𝑡 𝔼 𝑖 (Eq. (3), η = ϵ2/4βσ2) 𝑖=𝑡 𝔼 𝑖, (η = ϵ2/4βσ2) 𝑖=𝑡 𝐼𝑖 𝑓(𝑥 𝑖) 𝑖=𝑡 𝔼𝐼𝑖 𝑓(𝑥 𝑖) 𝑖=𝑡 𝔼𝐼𝑖 𝑓(𝑥 𝑖) 2 ( 𝑓(𝑥 𝑖) ϵ when 𝐼𝑖= 1) 𝑖=𝑡 𝔼 𝑖. (Eq. (3)) This completes the proof. Given the lemma above, it is now clear that if P𝑡 1 𝑖=𝑡 𝑑𝑡𝔼 𝑖is sufficiently small, then 𝔼 𝑥𝑡 𝑥 𝑡 ϵ/β which means that the algorithm is likely (with constant probability) to take a step at time 𝑡. This argument yields the following. Corollary 7. Assume that the algorithm fails with probability 1 2. If P𝑡 1 𝑖=𝑡 𝑑𝑡𝔼 𝑖< ϵ2/125β then 𝔼 𝑡 ϵ4/64σ2β. In particular, 𝑖=𝑡 𝑑𝑡 𝔼 𝑖 1 250β min ϵ2 Proof. If P𝑡 1 𝑖=𝑡 𝑑𝑖𝔼 𝑖< ϵ2/125β, then 𝔼 𝑥𝑡 𝑑𝑡 𝑥𝑡 ϵ/8β by Lemma 6. By a Markov inequality, with probability 3 4, we have 𝑥𝑡 𝑑𝑡 𝑥𝑡 ϵ/2β. Since the probability that 𝑓(𝑥𝑡 𝑑𝑡) > ϵ is at least 1 2, we get that 𝔼𝐼𝑡 1 4. By Lemma 2 this implies that which yields our claim. We now prove our main claim. We show that if the algorithm fails, then in all time steps in which 𝑑𝑡 2τ (of which there are at least 𝑇/2), either the algorithm makes a substantial step, or it has made significant updates in the interval [𝑡 𝑑𝑡, 𝑡 1]. In any case, the function value must necessarily decrease overall in the 𝑇time steps of the algorithm, concluding that 𝑇cannot be too large. Proof of Lemma 5. We have, 𝑖=𝑡 𝑑𝑡 𝔼 𝑖. Hence, using Corollary 7, {𝑡: 𝑑𝑡 2τ} 1 250β min ϵ2 2 1 250β min ϵ2 = 𝑇 500β min ϵ2 where we used Markov s inequality to show that |{𝑡: 𝑑𝑡 2τ}| 1 4.3 Concluding the proof Proof of Theorem 1. In the case σ ϵ, Lemma 3 implies that if 𝑇> 128β𝐹(τ + 1)/ϵ2 then the algorithms succeeds with probability greater than 1/2, which yields the theorem in this case. Similarly, Lemma 5 gives our claim in the case when σ > ϵ. 5 Experiments To illustrate the robustness and efficacy of Picky SGD, we present a comparison between the performance of SGD versus Picky SGD under various delay distributions. In particular, we show that Picky SGD requires significantly less iterations to reach a fixes goal and is more robust to varying delay distributions. The main goal of our experimental setup is to be reproducible. For that end, the experimentation is done in two phases. First, we perform a simulation to determine the delay 𝑑𝑡at each iteration without actually computing any gradients:3 this is done by simulating 𝑁concurrent worker threads sharing and collectively advancing a global iteration number, where each worker repeatedly records the current global iteration number 𝑡start, waits a random amount of time from a prescribed Poisson distribution, then records the new global iteration number 𝑡= 𝑡end and the difference 𝑑𝑡= 𝑡end 𝑡start, and increases the global iteration number. This information (a delay schedule) is calculated once for each tested scheme (differing in the number of workers and random distribution, as detailed below), and is stored for use in the second phase. In the second phase of the experiments, the algorithms SGD and Picky SGD are executed for each delay schedule. Here, at every iteration the gradient is computed (if needed) and is kept until its usage as dictated by the schedule (and then applied at the appropriate global iteration number). As a result of this configuration, we get a fully reproducible set of experiments, where the algorithms performance may be compared as they are executed over identical delay series of identical statistical properties. We created four different delay schedules: A baseline schedule (A) using 𝑁= 10 workers and sampling the simulated wait from a Poisson distribution (this schedule serves to compare Picky SGD and SGD in a setting of relatively small delay variance) and schedules (B) (C) and (D) all using 𝑁= 75 workers and sampling the simulated wait from bi-modal mixtures of Poisson distributions of similar mean but increasing variance respectively.4 See Figure 2 in the the full version of the paper [? ] for an illustration of the delay distributions of the four delay schedules used. All training is performed on the standard CIFAR-10 dataset [15] using a Res Net56 with 9 blocks model [13] and implemented in Tensor Flow [1]. We compare Picky SGD (Algorithm 1) to the SGD algorithm which unconditionally updates the state 𝑥𝑡given the stochastic delayed gradient 𝑔𝑡(recall that 𝑔𝑡is the stochastic gradient at state 𝑥𝑡 𝑑𝑡). For both algorithms, instead of a constant learning rate η we use a piecewise-linear learning rate schedule as follows: we consider a baseline η0 piecewise-linear learning rate schedule5 that achieves optimal performance in a synchronous distributed optimization setting (that is, for 𝑑𝑡 0)6 and search (for each of the four delay schedules and each algorithm to compensate for the effects of delays) for the best multiple of the baseline rate and the best first rate-change point. Alternatively, we also used a cosine decay learning rate schedule (with the duration of the decay as meta parameters). Another meta-parameter we optimize is the threshold ϵ/(2β) in line 4 of Picky SGD. Batch size 64 was used throughout the experiments. Note that although use chose the threshold value ϵ/2β by an exhaustive search, in practice, a good choice can be found by logging the distance values during a typical execution and choosing a high percentile value. See the full version of the paper [? ] for more details. 3Note that up to the training data ordering a computation of 𝑇steps of Picky SGD or SGD is uniquely determined by the starting state 𝑥1 and the sequence {𝑡 𝑑𝑡}𝑡=1...𝑇. 4See the the full version of the paper [? ] for specific parameter values and implementation details. 5With rate changes at three achieved accuracy points 0.93, 0.98, and 0.99. 6This is also the best performance achievable in an asynchronous setting. 0 100 200 300 400 500 epoch SGD Picky SGD 0 100 200 300 400 500 epoch SGD Picky SGD 0 100 200 300 400 500 epoch SGD Picky SGD 0 100 200 300 400 500 epoch SGD Picky SGD Figure 1: Accuracy trajectory (with a zoom-in on the tail of the convergence) over train epochs for the four delay schedules of Fig. 2, respectively: the key metrics (reported in Table 1) for each trajectory are epochs to reach 0.99 accuracy (the number of epochs required to reach the 0.99 accuracy mark) and the baseline learning rate multiplier η/η0. 5.2 Results The accuracy trajectory for the best performing combination of parameters of each algorithm for each of the four delay schedules is shown in Fig. 1 and summarized in Table 1. Clearly, Picky SGD significantly outperforms SGD in terms of the final accuracy and the number of epochs it takes to achieve it. We also emphasize that the generalization performance (that is, the evaluation accuracy as related to the training accuracy) was not observed to vary across delay schedules or the applied algorithms (see e.g., Fig. 4 in the the full version of the paper [? ]), and that the nature of the results is even more pronounced when using the alternative cosine decay learning rate schedule (see Fig. 5 in the the full version of the paper [? ]). Specific details of the meta parameters used, and additional performance figures are reported in the full version of the paper [? ]. Table 1: Summary of the key metrics from Fig. 1, for each of the four delay schedules A, B, C, and D . Epochs to 0.99% LR multiplier (η/η0) Picky SGD SGD Picky SGD A 344 350 0.5 0.5 B 333 451 0.2 0.05 C 337 438 0.2 0.05 D 288 466 0.2 0.05 5.3 Discussion We first observe that while the number of epochs it takes Picky SGD to reach the target accuracy mark is almost the same across the delay schedules (ranging from 288 to 344), SGD requires significantly more epochs to attain the target accuracy (ranging from 350 up to 466 for the highest variance delay schedule) this is consistent with the average-delay bound dependence of Picky SGD (as stated in Theorem 1) compared to the max-delay bound dependence of SGD. Furthermore, the best baseline learning rate multiplier meta-parameter for Picky SGD is the same (0.2) across all high-variance delay schedules, while the respective meta parameter for SGD is significantly smaller (0.05) and sometimes varying, explaining the need for more steps to reach the target and evidence of Picky SGD superior robustness. Acknowledgements AD is partially supported by the Israeli Science Foundation (ISF) grant no. 2258/19. TK is partially supported by the Israeli Science Foundation (ISF) grant no. 2549/19, by the Len Blavatnik and the Blavatnik Family foundation, and by the Yandex Initiative in Machine Learning. [1] Martín Abadi, Ashish Agarwal, Paul Barham, Eugene Brevdo, Zhifeng Chen, Craig Citro, Greg S. Corrado, Andy Davis, Jeffrey Dean, Matthieu Devin, Sanjay Ghemawat, Ian Goodfellow, Andrew Harp, Geoffrey Irving, Michael Isard, Yangqing Jia, Rafal Jozefowicz, Lukasz Kaiser, Manjunath Kudlur, Josh Levenberg, Dandelion Mané, Rajat Monga, Sherry Moore, Derek Murray, Chris Olah, Mike Schuster, Jonathon Shlens, Benoit Steiner, Ilya Sutskever, Kunal Talwar, Paul Tucker, Vincent Vanhoucke, Vijay Vasudevan, Fernanda Viégas, Oriol Vinyals, Pete Warden, Martin Wattenberg, Martin Wicke, Yuan Yu, and Xiaoqiang Zheng. Tensor Flow: Large-scale machine learning on heterogeneous systems, 2015. URL https://www.tensorflow.org/. Software available from tensorflow.org. [2] Alekh Agarwal and John C Duchi. Distributed delayed stochastic optimization. In 2012 IEEE 51st IEEE Conference on Decision and Control (CDC), pages 5451 5452. IEEE, 2012. [3] Dan Alistarh, Demjan Grubic, Jerry Z Li, Ryota Tomioka, and Milan Vojnovic. Qsgd: communication-efficient sgd via gradient quantization and encoding. In Proceedings of the 31st International Conference on Neural Information Processing Systems, pages 1707 1718, 2017. [4] Yossi Arjevani, Ohad Shamir, and Nathan Srebro. A tight convergence analysis for stochastic gradient descent with delayed updates. In Algorithmic Learning Theory, pages 111 132. PMLR, 2020. [5] Rotem Zamir Aviv, Ido Hakimi, Assaf Schuster, and Kfir Yehuda Levy. Asynchronous distributed learning : Adapting to gradient delays without prior knowledge. In Marina Meila and Tong Zhang, editors, Proceedings of the 38th International Conference on Machine Learning, volume 139 of Proceedings of Machine Learning Research, pages 436 445. PMLR, 18 24 Jul 2021. URL https://proceedings.mlr.press/v139/aviv21a.html. [6] D.P. Bertsekas and J.N. Tsitsiklis. Parallel and Distributed Computation: Numerical Methods. Athena Scientific, 1997. [7] Sorathan Chaturapruek, John C Duchi, and Christopher Ré. Asynchronous stochastic convex optimization: the noise is in the noise and sgd don t care. Advances in Neural Information Processing Systems, 28:1531 1539, 2015. [8] Andrew Cotter, Ohad Shamir, Nathan Srebro, and Karthik Sridharan. Better mini-batch algorithms via accelerated gradient methods. In Proceedings of the 24th International Conference on Neural Information Processing Systems, pages 1647 1655, 2011. [9] Ofer Dekel, Ran Gilad-Bachrach, Ohad Shamir, and Lin Xiao. Optimal distributed online prediction using mini-batches. Journal of Machine Learning Research, 13(1), 2012. [10] John C Duchi, Peter L Bartlett, and Martin J Wainwright. Randomized smoothing for stochastic optimization. SIAM Journal on Optimization, 22(2):674 701, 2012. [11] Hamid Reza Feyzmahdavian, Arda Aytekin, and Mikael Johansson. An asynchronous mini-batch algorithm for regularized stochastic optimization. IEEE Transactions on Automatic Control, 61 (12):3740 3754, 2016. [12] Saeed Ghadimi and Guanghui Lan. Stochastic first-and zeroth-order methods for nonconvex stochastic programming. SIAM Journal on Optimization, 23(4):2341 2368, 2013. [13] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 770 778, 2016. [14] Sai Praneeth Karimireddy, Quentin Rebjock, Sebastian Stich, and Martin Jaggi. Error feedback fixes signsgd and other gradient compression schemes. In International Conference on Machine Learning, pages 3252 3261. PMLR, 2019. [15] A. Krizhevsky. Learning multiple layers of features from tiny images. Technical report, Computer Science Department, University of Toronto, April 2009. [16] Guanghui Lan. An optimal method for stochastic composite optimization. Mathematical Programming, 133(1):365 397, 2012. [17] Rémi Leblond, Fabian Pedregosa, and Simon Lacoste-Julien. Improved asynchronous parallel optimization analysis for stochastic incremental methods. Journal of Machine Learning Research, 19:1 68, 2018. [18] Xiangru Lian, Yijun Huang, Yuncheng Li, and Ji Liu. Asynchronous parallel stochastic gradient for nonconvex optimization. In Proceedings of the 28th International Conference on Neural Information Processing Systems-Volume 2, pages 2737 2745, 2015. [19] Horia Mania, Xinghao Pan, Dimitris Papailiopoulos, Benjamin Recht, Kannan Ramchandran, and Michael I Jordan. Perturbed iterate analysis for asynchronous stochastic optimization. SIAM Journal on Optimization, 27(4):2202 2229, 2017. [20] Brendan Mc Mahan, Eider Moore, Daniel Ramage, Seth Hampson, and Blaise Aguera y Arcas. Communication-efficient learning of deep networks from decentralized data. In Artificial Intelligence and Statistics, pages 1273 1282. PMLR, 2017. [21] Angelia Nedić, Dimitri P Bertsekas, and Vivek S Borkar. Distributed asynchronous incremental subgradient methods. Studies in Computational Mathematics, 8(C):381 407, 2001. [22] Yurii Nesterov. Introductory lectures on convex optimization: A basic course, volume 87. Springer Science & Business Media, 2003. [23] Yurii Nesterov et al. Lectures on convex optimization, volume 137. Springer, 2018. [24] Feng Niu, Benjamin Recht, Christopher Re, and Stephen J Wright. Hogwild! a lock-free approach to parallelizing stochastic gradient descent. In Proceedings of the 24th International Conference on Neural Information Processing Systems, pages 693 701, 2011. [25] Sashank J Reddi, Ahmed Hefny, Suvrit Sra, Barnabás Pöczos, and Alex Smola. On variance reduction in stochastic gradient descent and its asynchronous variants. In Proceedings of the 28th International Conference on Neural Information Processing Systems-Volume 2, pages 2647 2655, 2015. [26] Sebastian U Stich and Sai Praneeth Karimireddy. The error-feedback framework: Better rates for sgd with delayed gradients and compressed updates. Journal of Machine Learning Research, 21:1 36, 2020. [27] Blake Woodworth, Kumar Kshitij Patel, Sebastian Stich, Zhen Dai, Brian Bullins, Brendan Mcmahan, Ohad Shamir, and Nathan Srebro. Is local sgd better than minibatch sgd? In International Conference on Machine Learning, pages 10334 10343. PMLR, 2020. [28] Kaiwen Zhou, Fanhua Shang, and James Cheng. A simple stochastic variance reduced algorithm with fast convergence rates. In International Conference on Machine Learning, pages 5980 5989. PMLR, 2018.