# delayadaptive_distributed_stochastic_optimization__e1834f57.pdf The Thirty-Fourth AAAI Conference on Artificial Intelligence (AAAI-20) Delay-Adaptive Distributed Stochastic Optimization Zhaolin Ren,1 Zhengyuan Zhou, 2,4 Linhai Qiu,3 Ajay Deshpande,4 Jayant Kalagnanam4 1Harvard University, 2New York University, 3Google Inc., 4IBM Research In large-scale optimization problems, distributed asynchronous stochastic gradient descent (DASGD) is a commonly used algorithm. In most applications, there are often a large number of computing nodes asynchronously computing gradient information. As such, the gradient information received at a given iteration is often stale. In the presence of such delays, which can be unbounded, the convergence of DASGD is uncertain. The contribution of this paper is twofold. First, we propose a delay-adaptive variant of DASGD where we adjust each iteration s step-size based on the size of the delay, and prove asymptotic convergence of the algorithm on variationally coherent stochastic problems, a class of functions which properly includes convex, quasiconvex and star-convex functions. Second, we extend the convergence results of standard DASGD, used usually for problems with bounded domains, to problems with unbounded domains. In this way, we extend the frontier of theoretical guarantees for distributed asynchronous optimization, and provide new insights for practitioners working on large-scale optimization problems. 1 Introduction In recent years, rapid advances in computing infrastructures have led to a significant increase in the use of distributed stochastic optimization. There has correspondingly been intensive work studying distributed stochastic optimization, such as (Chaturapruek, Duchi, and R e 2015), (Karakus et al. 2017), (Damaskinos et al. 2018), (Wu et al. 2018),(Assran et al. 2018), (Cutkosky and Busa-Fekete 2018), (Koloskova, Stich, and Jaggi 2019), (Yu and Jin 2019), (Xie, Koyejo, and Gupta 2019). Many optimization algorithms in distributed settings are first-order methods involving multiple computing nodes working asynchronously to perform stochastic gradient descent. For brevity, we will refer to such algorithms as distributed asynchronous stochastic gradient descent (DASGD) in this paper. Two common architectures on which DASGD are deployed are 1) a master-slave architecture where each worker independently computes a noisy gradient for the master node, and 2) a multiprocessor sharedmemory architecture, where there is no master node and This author gratefully acknowledges the support of the IBM Goldstine Fellowship. Copyright c 2020, Association for the Advancement of Artificial Intelligence (www.aaai.org). All rights reserved. workers asynchronously update a parameter in some shared memory. Both architectures are vulnerable to delays in the gradient computation. Such delays are of particular concern in volunteer computing grids, where there is no upper bound on the workers delay. Related Work and Our Contributions Our paper addresses the robustness of DASGD-type algorithms in the presence of large, unbounded delays resulting from either the master-slave or shared memory parallel computing architectures. In distributed deterministic optimization, asynchronous gradient descent has been shown to solve convex problems, even if delays scale sublinearly over time (Bertsekas and Tsitsiklis 2003). For stochastic convex problems, recent results by ((Agarwal and Duchi 2011), (Recht et al. 2011), (Chaturapruek, Duchi, and R e 2015)) have derived convergence rates for DASGD when the delay is bounded. A delay-adaptive DASGD variant for convex problems was proposed in (Sra et al. 2015), where the algorithm s step-size depends on the actual delay received. However, the authors of (Sra et al. 2015) assumed that the delays are either bounded or have finite first and second moments, whereas in many applications the delays might not be bounded. An algorithm for problems with unbounded delays for both convex and nonconvex problems was introduced in (Peng et al. 2019). However, the proposed step-size in (Peng et al. 2019) is not delay-adaptive, and the authors there assumed finite-mean i.i.d delays, while we make no distributional assumption. Beyond convexity, the class of variationally coherent (VC) problems, which properly includes convex, quasi-convex, and star-convex objectives, was introduced in ((Zhou et al. 2017a), (Zhou et al. 2020a)), and subsequently generalized to multi-player game settings ((Zhou et al. 2017c), (Zhou et al. 2017b), (Zhou et al. 2018a), (Mertikopoulos and Zhou 2019)). For VC problems on a bounded domain, ((Zhou et al. 2018b), (Zhou et al. 2020b)) proved that DASGD converges almost surely to a global minimizer, even when the delays between gradient updates and requests grow at a polynomial rate. For synchronous stochastic optimization of variationally-coherent-like problems with unbounded domains, (Lelong 2008) proposed a random truncation approach that achieves convergence. Our main contributions are twofold: first, we propose a delay-adaptive variant of DASGD for VC problems whose convergence is robust to any delay sequence from either the master-slave or shared-memory architecture, going be- yond the polynomially growing delay allowed in (Zhou et al. 2020b). Developing such a theoretical guarantee is crucial given the arbitrary nature of delays in real-life computing grids. Next, we develop a modified version of DASGD that converges for VC problems with unbounded domains, reposing on the truncation technique pioneered in (Lelong 2008). Both contributions are to the best of the authors knowledge novel in the literature. 2 Problem Setup We will consider the following problem: minimize f(x) subject to x X, (Opt) where the objective f : X R is of the form f(x) = E[F(x; ω)] for some random function F : X Ω R, where we let (Ω, F, P) be some underlying probability space, and X is a compact, convex subset of Rd. Assumptions We make the following assumptions: Assumption 1. F satisfies the following: 1. F(x; ω) is differentiable in x for almost every ω Ω. 2. F(x; ω) has bounded second moments, that is, E[ F(x; ω) 2 2] < for all x X. 3. F(x; ω) is Lipschitz continuous in the mean on any compact set X: for any X, there exists some LX such that E[ F(x; ω)] is LX -Lipschitz on X. Assumption 2. The optimization problem (Opt) is variationally coherent in the mean, which means that E[ F(x; ω), x x ] 0, (VC) for all x X and x X , with equality if and only if x = x . Variational coherence is a wide class of optimization problems that properly includes convex, quasi-convex, and star-convex functions. We note that (VC) is a significantly weaker condition than convexity, as it has no information on generic point pairs. For examples of such functions, see (Bottou 1998), (Zhou et al. 2020a). DASGD: A Framework for Asynchronous Optimization Our goal in this paper is to deal with delays that occur either in 1) master-slave or 2) multi-processor systems with shared memory. 1. In distributed master-slave gradient descent, workers asynchronously compute stochastic gradients and then send them to the master. Meanwhile, the master updates the global state of the system and pushes the updated state back to the workers. 2. In a multi-processor system with shared memory, all processors access a global, shared memory, which contains both the data needed to compute a stochastic gradient as well as the current iterate. Each processor independently and asynchronously reads the current global iterate, computes a stochastic gradient, and then updated the shared global iterate. Both systems are common asynchronous computations architectures used in practice, and are described with more detail in (Zhou et al. 2020b). We can unify both these architectures under the DASGD framework in Algorithm 1. Algorithm 1: Distributed Asynchronous Stochastic Gradient Descent Require: Y0 Rd 3 Xn = Proj X (Yn) 4 Yn+1 = Yn αn+1( F(Xs(n); ωn+1)) 6 until end; 7 return solution candidate Xn In DASGD, note that s(n) is the iteration of the gradient used at iteration n. Let dn = n s(n) represent the delay. We say that the delay is (i) linear, (ii) polynomial, (iii) exponential respectively if dn is (i) linear, (ii) polynomial, (iii) exponential as a function of s(n). For instance, if s(n) = n, so that dn = n n, we say that the delay received at iteration n is polynomial. 3 Adapting Step-size to Delay Intuitively, using a fixed step-size sequence whilst ignoring the delay information might jeopardize convergence, given the potential for unbounded delay. Therefore, it is important to develop a delay-adaptive step-size sequence that is robust to bad delay. In this section, we first state and prove two elementary results on the delay sequence resulting from either a master-slave or shared-memory architecture, before providing intuition for our proposed delay-adaptive stepsize. Results on Delay Sequence We start by better understanding the delay generated from either (i) the master-slave architecture, or (ii) the multiprocessor system with shared memory. We show in fact that the delays generated are for the most part not too stale. To see this, assume that there are K workers computing the gradients, where K Z+. In a master-slave system, xt from any particular time t is used in the computation of future gradients precisely once. Meanwhile, in a sharedmemory system, multiple workers can access the same xt. Then, while the K gradients computed by the K workers are different (due to different stochasticity), they could all stem from the same global iterate xt. Hence, xt from any particular time t can be used in the computation of gradients for future updates at most K times. This observation gives rise to the following result. Lemma 3.1. Suppose there are K workers in a master-slave or shared-memory system. Then, for any N > 0, there exists at least N/2 indices n [N] such that s(n) n/(2K). Proof. To see this, note that the first N/(2K) iterates can each be used at most K times, so in total at most N/2 iterations n [N] can use gradients computed from the first N/(2K) iterates. Hence, for the remaining indices n [N] (and there are at least N/2 such indices), s(n) N/(2K) n/(2K). This shows that more often than not, the delay is not worse than linear. This is an interesting observation in its own right. It also implies that there are infinitely many n such that s(n) n/(2K), which we will use later to prove convergence of DASGD along a subsequence. Lemma 3.2. As n goes to infinity, so does s(n). Proof. To see this, suppose otherwise. Then, there exists some C > 0 such that s(n) C for all n. Since there are only K workers, it follows that only C K iterates will ever be used in the computation of gradients, a contradiction. A Delay-adaptive Step-size Proposal In developing a new delay-adaptive step-size, we leverage on insight from (Zhou et al. 2020b). There, to deal with polynomially growing delay, the authors used the step-size sequence αn+1 = 1 n log n log log n. This suggests that we might want to build our step-size on top of that, and incorporate a delay-adaptive adjustment term. Suppose 0 < c < 1. Consider the step-size αn+1 = 1 n log n log log n + nc(log(n)/ log s(n)) (1) We can view nc(log(n)/ log s(n)) as a delay-adjustment term that varies with the delay. For concreteness we pick c = 2/3 in our analysis during the rest of the paper. To see why we picked this step-size, observe the size of αn when s(n) is small, i.e. the delay is large. Suppose s(n) = O( n). Then, n(2/3)(log(n)/ log s(n)) = o(n4/3). So when the delay is unbounded and grows faster than polynomially, αn is on the order O(1/n4/3), which is summable. Next, note that when the delay is linear, i.e. s(n) is on the order o(n), n(2/3)(log(n)/ log s(n)) = O(n2/3), so the stepsize αn is on the order o(1/(n log n log log n)), which is not summable We formalize the above in the following result generalized for any 0 < c < 1. Details of the proof can be found in the supplementary material. Lemma 3.3. Let αn+1 = 1 n log n log log n + nc(log(n)/ log s(n)) . Then, 1. On the subsequence whose indices satisfy s(n) = O(nℓ), where ℓ< c, the step-sizes are summable: n=0,s(n)=O(nℓ) αn+1 < 2. On the subsequence whose indices satisfy s(n) n/(2K), where the step-sizes are not summable: n=0,s(n) n/(2K) αn+1 = 4 Convergence Analysis For clarity, we will write down the delay-adaptive variant of DASGD we are proposing. Algorithm 2: Delay-Adaptive DASGD Require: Y0 Rd 1 n 0, 0 < c < 1 3 Xn = Proj X (Yn) 4 αn+1 = 1 n log n log log n + nc(log(n)/ log s(n)) 5 Yn+1 = Yn αn+1( F(Xs(n); ωn+1)) 7 until end; 8 return solution candidate Xn Energy Function Consider the energy function E(y) = inf x X Ex (y), where Ex (y) = x 2 2 Proj X (y) 2 2 + 2 y, Proj X (y) x . The energy function satisfies the following properties: Lemma 4.1. 1. E(y) 0 with equality if and only if Proj X (y) X 2. Let {yn} n=1 be a sequence. Then, limn E(yn) = 0 if and only if Proj X (yn) X as n Observe calling E(y) an energy function is justified since E(y) is always nonnegative. The keypoint here is this: since E(Yn) 0 if and only if Xn X , the energy function E(y) provides us a convenient way to characterize and prove convergence. This next result allows us to bound the change across an iteration of the energy function, and will prove essential in the technical analysis. Lemma 4.2. Fix any x X 1. Proj X (y) ˆy 2 2 Proj X (ˆy) ˆy 2 2 y ˆy 2 2, for any y, ˆy Rd 2. Ex (y+Δy) Ex (y) 2 Δy, Proj X (y) x + Δy 2 2, for any y, Δy Rd. Convergence Analysis We can write the DASGD updates as follows: Xn = Proj X (Yn) Yn+1 = Yn αn+1( f(Xn) + Bn + Un+1) where Bn = f(Xs(n)) f(Xn) is the delay error term and Un+1 = F(Xs(n); ωn+1) f(Xs(n)) is the stochastic term in the gradient computation. In addition, we use the adaptive step-size from equation (1). In our proof (see supplement), we first prove almost sure (a.s.) convergence along a subsequence before extending to full a.s. convergence, a standard technique for convergence proofs of discrete iterates. Convergence Along a Subsequence Proposition 4.2.1. Under assumptions 1 - 2, and lemmas 3.1 and 3.2, delay-adaptive DASGD admits a subsequence Xnk that converges to X almost surely as k . The proof is technical, and provided in full in the supplement. We highlight that the key here is to show a bound on the delay term Bn, which is almost surely bounded above by O(log log log n) when the delay is allowed to be exponential. A telescoping argument then allows us to show that E(Yn+1) E(Y0) , a contradiction. Extending to Convergence The main result of delayadaptivity in the stochastic context is the following: Theorem 4.3. Under Assumptions 1 to 2, and lemmas 3.1 and 3.2, the global state variable Xn of delay-adaptive DASGD converges a.s. to the solution set X of Opt. The stochastic noise makes extending subsequential convergence to full convergence tricky. To get around this, we will write down an ODE that approximates the DASGD iterates trajectory and show convergence of the ODE flow to the optima set. Then, we will relate the DASGD iterates to the ODE trajectory to prove convergence. We note that we can view the DASGD updates as a discretization of the mean-flow ODE y = f(Proj X (y)) (2) Compactness of X and Lipschitz-continuity of f ensure a unique trajectory for y. We can define P : R+ Rd Rd as follows, where P(t, y0) denotes the state of the system when starting from y0 after running for a time of t. The idea is that asymptotically, the (random) iterates Y1, Y2, . . . , Yn will hug close to the mean-field ODE specified above in (2). Using the VC condition allows us to prove that P(t, y0) x as t . This then allows us to prove convergence of the iterates. To make this argument more precise, we next introduce the following three objects 1. The DASGD iterates Y1, Y2, . . . , Yn 2. An affine curve A(t) interpolating the iterates Y1, Y2, . . . , Yn where Yr is placed at r s=0 αs (recall that here the step-size αr is still delay-adaptive) 3. the curve given by the ODE flow P(t, y) In order to show that the interpolated curve A(t) is close to the mean-field ODE asymptotically, we consider the following notion of an asymptotic pseudotrajectory (APT), introduced in (Bena ım 1999): Definition 1. A continuous function s : R+ Rd is an APT for P if every T > 0, lim t sup h T d(s(t + h), P(h, s(t))) = 0 In the absence of delay-adaptivity, Theorem 4.12 in (Zhou et al. 2020b) showed that as long A(t) can be shown to be an APT for P, assuming there exists a convergent subsequence, we can prove convergence of the DASGD iterates. Since we have already shown a convergent subsequence in the delay-adaptive setting, we just need to show that A(t) is still an APT for P under the delay-adaptive setting. We can prove this by relying on the fact that when s(n) = O( n), i.e. when there is a possibly bad delay, the delay-adaptive stepsize is summable. The precise details, using ideas from (Bena ım 1999), are in the supplement. 5 Extending DASGD to Unbounded Domain While DASGD guarantees are often provided for problems with bounded domains, to date, there has been no result in the literature proving convergence of DASGD for problems with unbounded domains. In this section, we propose a novel variant of DASGD that achieves such guarantees for variationally coherent problems with unbounded domains. We will require the following assumption, as the existing proof technique only works for sublinearly growing decay. Assumption 3. We assume that the delay sequence {dn}n satisfies dn βnc for some β > 0, 0 c < 1. For clarity, we will also assume that the minimizer x of f is unique. In an unconstrained setting, ensuring boundedness of the iterates is problematic. To deal with this issue, we force the algorithm to remain in an increasing sequence of compact sets (Kj)j, where the containment Kj Kj+1 is strict. Each time the iterate leaves the current compact set Kτn, we increase τn by 1, and perform a truncation by resetting the iterate to a fixed point in Kτ0. To differentiate it from vanilla DASGD, we call this new algorithm DASGD-T, where T stands for truncation. Algorithm (3) provides one implementation of the algorithm DASGD-T. As with DASGD, the master keeps track of a global counter n and increments it every time it updates the current solution candidate Xn. When an iterate exceeds the current compact set Kτn, we return the algorithm to its initial iterate X0. This ensures that the iterates return to some compact set infinitely often, which will prove essential in ensuring boundedness of the iterates. On the other hand, if the tentative iterate stays within Kτn, we update Yn to be our new candidate solution Xn+1. We say that the algorithm is in its t-th cycle at time n if τn = t. Algorithm 3: DASGD-T Require: Initial state X0 Rd, step-size sequence αn, initial truncation parameter τ0 1 n 0; 3 if τs(n) < τn; /* only use gradient from current cycle */ 7 Yn Xn αn+1 F(Xs(n), ωs(n)+1); 8 if Yn Kτn then 10 τn+1 τn; 12 Xn+1 X0; /* truncate */ 13 τn+1 τn + 1; /* increase exploration limit */ 17 until end; 18 return Xn Note also that if the gradient received at time n came from an iteration s(n) for which τs(n) < s(n), we keep the existing iterate and do not use the gradient evaluated at time τs(n) to produce a new iterate. The idea behind this is to prevent a stale gradient from an earlier cycle τs(n) to affect the algorithm, as the reason why we exited the compact set Kτs(n) in the first place could have been because the algorithm was producing iterates that started to diverge. This choice of only using delayed gradient from the current cycle will also help in proving boundedness of the iterates later. Boundedness of Iterates The crux of this analysis is showing that the iterates remain a.s. bounded. To do so, we will first show that the delay and noise terms are asymptotically negligible. Using that, we will then show that the iterates have to remain a.s. bounded. For the purpose of analysis, it is convenient to rewrite the algorithm update in the following way. We have Xn+1 = Xn αn+1 f(Xn) αn+1δMn+1 + αn+1pn+1 (3) δMn+1 = f(Xs(n); ω) f(Xn) = f(Xs(n); ω) f(Xs(n) + f(Xs(n) f(Xn) f(Xn) + δMn+1 + 1 αn+1 (X0 Xn) if Yn / Kτn 0 if otherwise. (5) We can interpret δMn+1 as representing the delay and noise terms, whilst pn+1 is the deviation from the noisy gradient descent dynamics resulting from a truncation. We also introduce a new notation vn, which we define as follows, v(n) inf {k : τk = τn, k + dk s(n)} . To parse the above, Xv(n) is the earliest iterate in the current cycle whose gradient was used in producing any iterate from time s(n) onwards. The particular choice of definition is to make sure that (v(n))n is an increasing sequence, which is important for the technical analysis. We now characterize the notion that the delay and noise terms do not matter asymptotically. Lemma 5.1. Suppose assumptions (1), (2), and (3) hold, and αn+1 = 1/n. Then, for all q > 0, the series n αn+1δMn+1I Xv(n):n x q converges a.s. We prove this in the supplement. Note that this is where we used the assumption that the delays scale sublinearly. Next, we show that given that the delay and noise terms do not matter asymptotically, the iterates must remain a.s. bounded. Proposition 5.1.1. Suppose assumptions (1), (2), and (3) hold, and αn+1 = 1/n. If for all q > 0, the series n>0 αn+1δMn+1I{ Xv(n):n x 1/20 = max(f(z1), f(z2)). This can also be seen by visualizing the sublevel sets of the function f. To carry out the minimization, an asynchronous masterslave framework was used, with delays that scale as Θ( n), a step-size sequence of αn 1/n, and gradients with stochastic noise drawn from a standard multivariate Gaussian distribution with zero mean and identity covariance. The choice for the initial point was (10, 10), and the truncation sets Kj was chosen to be x : x X0 2 j2 . We show plots of value convergence of the function f, averaged over 100 trials, both in the setting with and without delay. While the no-delay case converged faster, we see that the algorithm eventually stabilizes and converges in both cases. This confirms our theoretical analysis that DASGD-T can achieve convergence for a variationally coherent problem with unbounded domain even when there is (sublinear) delay. 7 Conclusion To conclude, our contributions are twofold. First, we show that a delay-adaptive algorithm can achieve convergence for VC functions under general arbitrary delay sequences, a novel contribution to the literature. Second, we show that we can extend the convergence results of bounded domains to unbounded domains using a truncation algorithm, with the caveat that our current results hold only in the setting with sub-linearly growing delay. On the simulation front, we provide numerical results verifying our theoretical findings, and also show that delay-adaptive step-sizes perform better than their non-delay-adaptive counterparts across a range of delay scenarios. References Agarwal, A., and Duchi, J. C. 2011. Distributed delayed stochastic optimization. In Advances in Neural Information Processing Systems, 873 881. Assran, M.; Loizou, N.; Ballas, N.; and Rabbat, M. 2018. Stochastic gradient push for distributed deep learning. ar Xiv preprint ar Xiv:1811.10792. Bena ım, M. 1999. Dynamics of stochastic approximation algorithms. In Seminaire de probabilites XXXIII. Springer. 1 68. Bertsekas, D. P., and Tsitsiklis, J. N. 2003. Parallel and distributed computation: numerical methods. Bottou, L. 1998. Online learning and stochastic approximations. On-line learning in neural networks 17(9):142. Chaturapruek, S.; Duchi, J. C.; and R e, C. 2015. Asynchronous stochastic convex optimization: the noise is in the noise and sgd don t care. In Advances in Neural Information Processing Systems, 1531 1539. Cutkosky, A., and Busa-Fekete, R. 2018. Distributed stochastic optimization via adaptive sgd. In Bengio, S.; Wallach, H.; Larochelle, H.; Grauman, K.; Cesa-Bianchi, N.; and Garnett, R., eds., Advances in Neural Information Processing Systems 31. Curran Associates, Inc. 1910 1919. Damaskinos, G.; El Mhamdi, E. M.; Guerraoui, R.; Patra, R.; and Taziki, M. 2018. Asynchronous Byzantine machine learning (the case of SGD). In Dy, J., and Krause, A., eds., Proceedings of the 35th International Conference on Machine Learning, volume 80 of Proceedings of Machine Learning Research, 1145 1154. Stockholmsm assan, Stockholm Sweden: PMLR. Karakus, C.; Sun, Y.; Diggavi, S.; and Yin, W. 2017. Straggler mitigation in distributed optimization through data encoding. In Guyon, I.; Luxburg, U. V.; Bengio, S.; Wallach, H.; Fergus, R.; Vishwanathan, S.; and Garnett, R., eds., Advances in Neural Information Processing Systems 30. Curran Associates, Inc. 5434 5442. Koloskova, A.; Stich, S. U.; and Jaggi, M. 2019. Decentralized stochastic optimization and gossip algorithms with compressed communication. ar Xiv preprint ar Xiv:1902.00340. Le Cun, Y. 1998. The mnist database of handwritten digits. http://yann. lecun. com/exdb/mnist/. Lelong, J. 2008. Almost sure convergence and of randomly truncated stochastic algorithms under verifiable conditions. Statistics and Probability Letters 78:2632 2636. Mertikopoulos, P., and Zhou, Z. 2019. Learning in games with continuous action sets and unknown payoff functions. Mathematical Programming 173(1-2):465 507. Peng, Z.; Xu, Y.; Yan, M.; and Yin, W. 2019. On the convergence of asynchronous parallel iteration with unbounded delays. Journal of the Operations Research Society of China 7(1):5 42. Recht, B.; Re, C.; Wright, S.; and Niu, F. 2011. Hogwild: A lock-free approach to parallelizing stochastic gradient descent. In Advances in neural information processing systems, 693 701. Sra, S.; Yu, A. W.; Li, M.; and Smola, A. J. 2015. Adadelay: Delay adaptive distributed stochastic convex optimization. ar Xiv preprint ar Xiv:1508.05003. Wu, J.; Huang, W.; Huang, J.; and Zhang, T. 2018. Error compensated quantized SGD and its applications to largescale distributed optimization. In Dy, J., and Krause, A., eds., Proceedings of the 35th International Conference on Machine Learning, volume 80 of Proceedings of Machine Learning Research, 5325 5333. Stockholmsm assan, Stockholm Sweden: PMLR. Xie, C.; Koyejo, S.; and Gupta, I. 2019. Zeno: Distributed stochastic gradient descent with suspicion-based fault-tolerance. In Chaudhuri, K., and Salakhutdinov, R., eds., Proceedings of the 36th International Conference on Machine Learning, volume 97 of Proceedings of Machine Learning Research, 6893 6901. Long Beach, California, USA: PMLR. Yu, H., and Jin, R. 2019. On the computation and communication complexity of parallel sgd with dynamic batch sizes for stochastic non-convex optimization. ar Xiv preprint ar Xiv:1905.04346. Zhou, Z.; Mertikopoulos, P.; Bambos, N.; Boyd, S.; and Glynn, P. W. 2017a. Stochastic mirror descent in variationally coherent optimization problems. In Advances in Neural Information Processing Systems, 7040 7049. Zhou, Z.; Mertikopoulos, P.; Bambos, N.; Glynn, P. W.; and Tomlin, C. 2017b. Countering feedback delays in multiagent learning. In Advances in Neural Information Processing Systems, 6171 6181. Zhou, Z.; Mertikopoulos, P.; Moustakas, A. L.; Bambos, N.; and Glynn, P. 2017c. Mirror descent learning in continuous games. In 2017 IEEE 56th Annual Conference on Decision and Control (CDC), 5776 5783. IEEE. Zhou, Z.; Mertikopoulos, P.; Athey, S.; Bambos, N.; Glynn, P. W.; and Ye, Y. 2018a. Learning in games with lossy feedback. In Advances in Neural Information Processing Systems, 5134 5144. Zhou, Z.; Mertikopoulos, P.; Bambos, N.; Glynn, P.; Ye, Y.; Li, L.-J.; and Fei-Fei, L. 2018b. Distributed asynchronous optimization with unbounded delays: How slow can you go? In International Conference on Machine Learning, 5965 5974. Zhou, Z.; Mertikopoulos, P.; Bambos, N.; Boyd, S.; and Glynn, P. 2020a. On the convergence of mirror descent beyond stochastic convex programming. SIAM Journal On Optimization. Zhou, Z.; Mertikopoulos, P.; Bambos, N.; Glynn, P.; and Ye, Y. 2020b. Distributed stochastic optimization with large delays. Mathematics of Operations Research (under review).