# a2bcd_asynchronous_acceleration_with_optimal_complexity__28690b40.pdf Published as a conference paper at ICLR 2019 A2BCD: Asynchronous Acceleration with Optimal Complexity Robert Hannah , Fei Feng , Wotao Yin Department of Mathematics University of California, Los Angeles 520 Portola Plaza, Los Angeles, CA 90095, USA In this paper, we propose the Asynchronous Accelerated Nonuniform Randomized Block Coordinate Descent algorithm (A2BCD). We prove A2BCD converges linearly to a solution of the convex minimization problem at the same rate as NU_ACDM, so long as the maximum delay is not too large. This is the first asynchronous Nesterov-accelerated algorithm that attains any provable speedup. Moreover, we then prove that these algorithms both have optimal complexity. Asynchronous algorithms complete much faster iterations, and A2BCD has optimal complexity. Hence we observe in experiments that A2BCD is the top-performing coordinate descent algorithm, converging up to 4 5 faster than NU_ACDM on some data sets in terms of wall-clock time. To motivate our theory and proof techniques, we also derive and analyze a continuous-time analogue of our algorithm and prove it converges at the same rate. 1 Introduction In this paper, we propose and prove the convergence of the Asynchronous Accelerated Nonuniform Randomized Block Coordinate Descent algorithm (A2BCD), the first asynchronous Nesterovaccelerated algorithm that achieves optimal complexity. No previous attempts have been able to prove a speedup for asynchronous Nesterov acceleration. We aim to find the minimizer x of the unconstrained minimization problem: min x Rd f(x) = f x(1), . . . , x(n) (1.1) where f is σ-strongly convex for σ > 0 with L-Lipschitz gradient f = ( 1f, . . . , nf). x Rd is composed of coordinate blocks x(1), . . . , x(n). The coordinate blocks of the gradient if are assumed Li-Lipschitz with respect to the ith block. That is, x, h Rd: if(x + Pih) if(x) Li h (1.2) where Pi is the projection onto the ith block of Rd. Let L 1 n Pn i=1 Li be the average block Lipschitz constant. These conditions on f are assumed throughout this whole paper. Our algorithm can also be applied to non-strongly convex objectives (σ = 0) or non-smooth objectives using the black box reduction techniques proposed in Allen-Zhu & Hazan (2016). Hence we consider only Corresponding author: Robert Hannah89@gmail.com fei.feng@math.ucla.edu wotaoyin@math.ucla.edu Published as a conference paper at ICLR 2019 the coordinate smooth, strongly-convex case. Our algorithm can also be applied to the convex regularized ERM problem via the standard dual transformation (see for instance Lin et al. (2014)): i=1 fi( ai, x ) + λ 2 x 2 (1.3) Hence A2BCD can be used as an asynchronous Nesterov-accelerated finite-sum algorithm. Coordinate descent methods, in which a chosen coordinate block ik is updated at every iteration, are a popular way to solve equation 1.1. Randomized block coordinate descent (RBCD, Nesterov (2012)) updates a uniformly randomly chosen coordinate block ik with a gradient-descent-like step: xk+1 = xk (1/Lik) ikf(xk). The complexity K(ϵ) of an algorithm is defined as the number of iterations required to decrease the error E(f(xk) f(x )) to less than ϵ(f(x0) f(x )). Randomized coordinate descent has a complexity of K(ϵ) = O(n( L/σ) ln(1/ϵ)). Using a series of averaging and extrapolation steps, accelerated RBCD Nesterov (2012) improves RBCD s iteration complexity K(ϵ) to O(n q L/σ ln(1/ϵ)), which leads to much faster convergence when L σ is large. This rate is optimal when all Li are equal Lan & Zhou (2015). Finally, using a special probability distribution for the random block index ik, the non-uniform accelerated coordinate descent method Allen-Zhu et al. (2015) (NU_ACDM) can further decrease the complexity to O(Pn i=1 p Li/σ ln(1/ϵ)), which can be up to n times faster than accelerated RBCD, since some Li can be significantly smaller than L. NU_ACDM is the current state-of-the-art coordinate descent algorithm for solving equation 1.1. Our A2BCD algorithm generalizes NU_ACDM to the asynchronous-parallel case. We solve equation 1.1 with a collection of p computing nodes that continually read a shared-access solution vector y into local memory then compute a block gradient if, which is used to update shared solution vectors (x, y, v). Proving convergence in the asynchronous case requires extensive new technical machinery. A traditional synchronous-parallel implementation is organized into rounds of computation: Every computing node must complete an update in order for the next iteration to begin. However, this synchronization process can be extremely costly, since the lateness of a single node can halt the entire system. This becomes increasingly problematic with scale, as differences in node computing speeds, load balancing, random network delays, and bandwidth constraints mean that a synchronous-parallel solver may spend more time waiting than computing a solution. Computing nodes in an asynchronous solver do not wait for others to complete and share their updates before starting the next iteration. They simply continue to update the solution vectors with the most recent information available, without any central coordination. This eliminates costly idle time, meaning that asynchronous algorithms can be much faster than traditional ones, since they have much faster iterations. For instance, random network delays cause asynchronous algorithms to complete iterations Ω(ln(p)) time faster than synchronous algorithms at scale. This and other factors that influence the speed of iterations are discussed in Hannah & Yin (2017a). However, since many iterations may occur between the time that a node reads the solution vector, and the time that its computed update is applied, effectively the solution vector is being updated with outdated information. At iteration k, the block gradient ikf is computed at a delayed iterate ˆyk defined as1: ˆyk = yk j(k,1) (1), . . . , yk j(k,n) 1Every coordinate can be outdated by a different amount without significantly changing the proofs. Published as a conference paper at ICLR 2019 1.1 Summary of Contributions for delay parameters j(k, 1), . . . , j(k, n) N. Here j(k, i) denotes how many iterations out of date coordinate block i is at iteration k. Different blocks may be out of date by different amounts, which is known as an inconsistent read. We assume2 that j(k, i) τ for some constant τ < . Asynchronous algorithms were proposed in Chazan & Miranker (1969) to solve linear systems. General convergence results and theory were developed later in Bertsekas (1983); Bertsekas & Tsitsiklis (1997); Tseng et al. (1990); Luo & Tseng (1992; 1993); Tseng (1991) for partially and totally asynchronous systems, with essentially-cyclic block sequence ik. More recently, there has been renewed interest in asynchronous algorithms with random block coordinate updates. Linear and sublinear convergence results were proven for asynchronous RBCD Liu & Wright (2015); Liu et al. (2014); Avron et al. (2014), and similar was proven for asynchronous SGD Recht et al. (2011), and variance reduction algorithms Reddi et al. (2015); Leblond et al. (2017); Mania et al. (2015); Huo & Huang (2016), and primal-dual algorithms Combettes & Eckstein (2018). There is also a rich body of work on asynchronous SGD. In the distributed setting, Zhou et al. (2018) showed global convergence for stochastic variationally coherent problems even when the delays grow at a polynomial rate. In Lian et al. (2018), an asynchronous decentralized SGD was proposed with the same optimal sublinear convergence rate as SGD and linear speedup with respect to the number of workers. In Liu et al. (2018), authors obtained an asymptotic rate of convergence for asynchronous momentum SGD on streaming PCA, which provides insight into the tradeoffbetween asynchrony and momentum. In Dutta et al. (2018), authors prove convergence results for asynchronous SGD that highlight the tradeoffbetween faster iterations and iteration complexity. Further related work is discussed in Section 4. 1.1 Summary of Contributions In this paper, we prove that A2BCD attains NU_ACDM s state-of-the-art iteration complexity to highest order for solving equation 1.1, so long as delays are not too large (see Section 2). The proof is very different from that of Allen-Zhu et al. (2015), and involves significant technical innovations and complexity related to the analysis of asynchronicity. We also prove that A2BCD (and hence NU_ACDM) has optimal complexity to within a constant factor over a fairly general class of randomized block coordinate descent algorithms (see Section 2.1). This extends results in Lan & Zhou (2015) to asynchronous algorithms with Li not all equal. Since asynchronous algorithms complete faster iterations, and A2BCD has optimal complexity, we expect A2BCD to be faster than all existing coordinate descent algorithms. We confirm with numerical experiments that A2BCD is the current fastest coordinate descent algorithm (see Section 5). We are only aware of one previous and one contemporaneous attempt at proving convergence results for asynchronous Nesterov-accelerated algorithms. However, the first is not accelerated and relies on extreme assumptions, and the second obtains no speedup. Therefore, we claim that our results are the first-ever analysis of asynchronous Nesterov-accelerated algorithms that attains a speedup. Moreover, our speedup is optimal for delays not too large3. The work of Meng et al. claims to obtain square-root speedup for an asynchronous accelerated SVRG. In the case where all component functions have the same Lipschitz constant L, the complexity they obtain reduces to (n + κ) ln(1/ϵ) for κ = O τn2 (Corollary 4.4). Hence authors do not even obtain accelerated rates. Their convergence condition is τ < 1 4 1/8 for sparsity parameter . Since the dimension d satisfies d 1 , they require d 216τ 8. So τ = 20 requires dimension d > 1015. 2This condition can be relaxed however by techniques in Hannah & Yin (2017b); Sun et al. (2017); Peng et al. (2016c); Hannah & Yin (2017a) 3Speedup is defined precisely in Section 2 Published as a conference paper at ICLR 2019 In a contemporaneous preprint, authors in Fang et al. (2018) skillfully devised accelerated schemes for asynchronous coordinate descent and SVRG using momentum compensation techniques. Although their complexity results have the improved κ dependence on the condition number, they do not prove any speedup. Their complexity is τ times larger than the serial complexity. Since τ is necessarily greater than p, their results imply that adding more computing nodes will increase running time. The authors claim that they can extend their results to linear speedup for asynchronous, accelerated SVRG under sparsity assumptions. And while we think this is quite likely, they have not yet provided proof. We also derive a second-order ordinary differential equation (ODE), which is the continuous-time limit of A2BCD (see Section 3). This extends the ODE found in Su et al. (2014) to an asynchronous accelerated algorithm minimizing a strongly convex function. We prove this ODE linearly converges to a solution with the same rate as A2BCD s, without needing to resort to the restarting techniques. The ODE analysis motivates and clarifies the our proof strategy of the main result. 2 Main results We should consider functions f where it is efficient to calculate blocks of the gradient, so that coordinate-wise parallelization is efficient. That is, the function should be coordinate friendly Peng et al. (2016b). This is a very wide class that includes regularized linear regression, logistic regression, etc. The L2-regularized empirical risk minimization problem is not coordinate friendly in general, however the equivalent dual problem is, and hence can be solved efficiently by A2BCD (see Lin et al. (2014), and Section 5). To calculate the k + 1 th iteration of the algorithm from iteration k, we use only one block of the gradient ikf. We assume that the delays j(k, i) are independent of the block sequence ik, but otherwise arbitrary (This is a standard assumption found in the vast majority of papers, but can be relaxed Sun et al. (2017); Leblond et al. (2017); Cannelli et al. (2017)). Definition 1. Asynchronous Accelerated Randomized Block Coordinate Descent (A2BCD). Let f be σ-strongly convex, and let its gradient f be L-Lipschitz with block coordinate Lipschitz parameters Li as in equation 1.2. We define the condition number κ = L/σ, and let L = mini Li. Using these parameters, we sample ik in an independent and identically distributed (IID) fashion according to P[ik = j] = L1/2 j /S, j {1, . . . , n}, for S = Xn i=1 L1/2 i . (2.1) Let τ be the maximum asynchronous delay. We define the dimensionless asynchronicity parameter ψ, which is proportional to τ, and quantifies how strongly asynchronicity will affect convergence: ψ = 9 S 1/2L 1/2L3/4κ1/4 τ (2.2) We use the above system parameters and ψ to define the coefficients α, β, and γ via eqs. (2.3) to (2.5). Hence A2BCD algorithm is defined via the iterations: eqs. (2.6) to (2.8). α 1 + (1 + ψ)σ 1/2S 1 (2.3) β 1 (1 ψ)σ1/2S 1 (2.4) 2σ1/2L 1/2ψ. (2.5) yk = αvk + (1 α)xk, (2.6) xk+1 = yk h L 1 ik ikf(ˆyk), (2.7) vk+1 = βvk + (1 β)yk σ 1/2L 1/2 ik ikf(ˆyk). (2.8) See Section A for a discussion of why it is practical and natural to have the gradient ikf(ˆyk) to be outdated, while the actual variables xk, yk, vk can be efficiently kept up to date. Essentially it is Published as a conference paper at ICLR 2019 because most of the computation lies in computing ikf(ˆyk). After this is computed, xk, yk, vk can be updated more-or-less atomically with minimal overhead, meaning that they will always be up to date. However our main results still hold for more general asynchronicity. A natural quantity to consider in asynchronous convergence analysis is the asynchronicity error, a powerful tool for analyzing asynchronous algorithms used in several recent works Peng et al. (2016a); Hannah & Yin (2017b); Sun et al. (2017); Hannah & Yin (2017a). We adapt it and use a weighted sum of the history of the algorithm with decreasing weight as you go further back in time. Definition 2. Asynchronicity error. Using the above parameters, we define: j=1 cj yk+1 j yk j 2 (2.9) for ci = 6 S L1/2κ3/2τ 1 σ1/2S 1 i j 1 ψ 1. (2.10) Here we define yk = y0 for all k < 0. The determination of the coefficients ci is in general a very involved process of trial and error, intuition, and balancing competing requirements. The algorithm doesn t depend on the coefficients, however; they are only an analytical tool. We define Ek[X] as the expectation of X conditioned on (x0, . . . , xk), (y0, . . . , yk), (v0, . . . , vk), and (i0, . . . , ik 1). To simplify notation4, we assume that the minimizer x = 0, and that f(x ) = 0 with no loss in generality. We define the Lyapunov function: ρk = vk 2 + Ak + cf(xk) (2.11) for c = 2σ 1/2S 1 βα 1(1 α) + 1 . (2.12) We now present this paper s first main contribution. Theorem 1. Let f be σ-strongly convex with a gradient f that is L-Lipschitz with block Lipschitz constants {Li}n i=1. Let ψ defined in equation 2.2 satisfy ψ 3 7 (i.e. τ 1 21S1/2L1/2L 3/4κ 1/4). Then for A2BCD we have: Ek[ρk+1] 1 (1 ψ)σ1/2S 1 ρk. To obtain E[ρk] ϵρ0, it takes KA2BCD(ϵ) iterations for: KA2BCD(ϵ) = σ 1/2S + O(1) ln(1/ϵ) 1 ψ , (2.13) where O( ) is asymptotic with respect to σ 1/2S , and uniformly bounded. This result is proven in Section B. A stronger result for Li L can be proven, but this adds to the complexity of the proof; see Section E for a discussion. In practice, asynchronous algorithms are far more resilient to delays than the theory predicts. τ can be much larger without negatively affecting the convergence rate and complexity. This is perhaps because we are limited to a worst-case analysis, which is not representative of the average-case performance. Allen-Zhu et al. (2015) (Theorem 5.1) shows a linear convergence rate of 1 2/ 1 + 2σ 1/2S for NU_ACDM, which leads to the corresponding iteration complexity of KNU_ACDM(ϵ) = σ 1/2S + O(1) ln(1/ϵ). Hence, we have: KA2BCD(ϵ) = 1 1 ψ (1 + o(1))KNU_ACDM(ϵ) 4We can assume x = 0 with no loss in generality since we may translate the coordinate system so that x is at the origin. We can assume f(x ) = 0 with no loss in generality, since we can replace f(x) with f(x) f(x ). Without this assumption, the Lyapunov function simply becomes: vk x 2 + Ak + c(f(xk) f(x )). Published as a conference paper at ICLR 2019 2.1 Optimality When 0 ψ 1, or equivalently, when τ S1/2L1/2L 3/4κ 1/4, the complexity of A2BCD asymptotically matches that of NU_ACDM. Hence A2BCD combines state-of-the-art complexity with the faster iterations and superior scaling that asynchronous iterations allow. We now present some special cases of the conditions on the maximum delay τ required for good complexity. Corollary 3. Let the conditions of Theorem 1 hold. If all coordinate-wise Lipschitz constants Li are equal (i.e. Li = L1, i), then we have KA2BCD(ϵ) KNU_ACDM(ϵ) when τ n1/2κ 1/4(L1/L)3/4. If we further assume all coordinate-wise Lipschitz constants Li equal L. Then KA2BCD(ϵ) KNU_ACDM(ϵ) = KACDM(ϵ), when τ n1/2κ 1/4. Remark 1. Reduction to synchronous case. Notice that when τ = 0, we have ψ = 0, ci 0 and hence Ak 0. Thus A2BCD becomes equivalent to NU_ACDM, the Lyapunov function5 ρk becomes equivalent to one found in Allen-Zhu et al. (2015)(pg. 9), and Theorem 1 yields the same complexity. The maximum delay τ will be a function τ(p) of p, number of computing nodes. Clearly τ p, and experimentally it has been observed that τ = O(p) Leblond et al. (2017). Let gradient complexity K(ϵ, τ) be the number of gradients required for an asynchronous algorithm with maximum delay τ to attain suboptimality ϵ. τ(1) = 0, since with only 1 computing node there can be no delay. This corresponds to the serial complexity. We say that an asynchronous algorithm attains a complexity speedup if p K(ϵ,τ(0)) K(ϵ,τ(p) is increasing in p. We say it attains linear complexity speedup if p K(ϵ,τ(0)) K(ϵ,τ(p) = Ω(p). In Theorem 1, we obtain a linear complexity speedup (for p not too large), whereas no other prior attempt can attain even a complexity speedup with Nesterov acceleration. In the ideal scenario where the rate at which gradients are calculated increases linearly with p, algorithms that have linear complexity speedup will have a linear decrease in wall-clock time. However in practice, when the number of computing nodes is sufficiently large, the rate at which gradients are calculated will no longer be linear. This is due to many parallel overhead factors including too many nodes sharing the same memory read/write bandwidth, and network bandwidth. However we note that even with these issues, we obtain much faster convergence than the synchronous counterpart experimentally. 2.1 Optimality NU_ACDM and hence A2BCD are in fact optimal in some sense. That is, among a fairly wide class of coordinate descent algorithms A, they have the best-possible worst-case complexity to highest order. We extend the work in Lan & Zhou (2015) to encompass algorithms are asynchronous and have unequal Li. For a subset S Rd, we let IC(S) (inconsistent read) denote the set of vectors v whose components are a combination of components of vectors in the set S. That is, v = (v1,1, v2,2, . . . , vd,d) for some vectors v1, v2, . . . , vd S. Here vi,j denotes the jth component of vector vi. Definition 4. Asynchronous Randomized Incremental Algorithms. Consider the unconstrained minimization problem equation 1.1 for function f satisfying the conditions stated in Section 1. We define the class A as algorithms G on this problem such that: 1. For each parameter set (σ, L1, . . . , Ln, n), G has an associated IID random variable ik with some fixed distribution P[ik] = pi for Pn i=1 pi = 1. 2. The iterates of A satisfy: xk+1 span{IC(Xk), i0f(IC(X0)), i1f(IC(X1)), . . . , ikf(IC(Xk))} This is a rather general class: xk+1 can be constructed from any inconsistent reading of past iterates IC(Xk), and any past gradient of an inconsistent read ijf(IC(Xj)). 5Their Lyapunov function is in fact a generalization of the one found in Nesterov (2012). Published as a conference paper at ICLR 2019 Theorem 2. For any algorithm G A that solves eq. (1.1), and parameter set (σ, L1, . . . , Ln, n), there is a dimension d, a corresponding function f on Rd, and a starting point x0, such that E xk x 2/ x0 x 2 1 Li/σ + 2n k Hence A has a complexity lower bound: K(ϵ) 1 4(1 + o(1)) Pn j=1 p Li/σ + 2n ln(1/2ϵ) Our proof in Section D follows very similar lines to Lan & Zhou (2015); Nesterov (2013). 3 ODE Analysis In this section we present and analyze an ODE which is the continuous-time limit of A2BCD. This ODE is a strongly convex, and asynchronous version of the ODE found in Su et al. (2014). For simplicity, assume Li = L, i. We rescale (I.e. we replace f(x) with 1 σf.) f so that σ = 1, and hence κ = L/σ = L. Taking the discrete limit of synchronous A2BCD (i.e. accelerated RBCD), we can derive the following ODE6 (see Section equation C.1): Y + 2n 1κ 1/2 Y + 2n 2κ 1 f(Y ) = 0 (3.1) We define the parameter η nκ1/2, and the energy: E(t) = en 1κ 1/2t(f(Y ) + 1 4 Y + η Y 2). This is very similar to the Lyapunov function discussed in equation 2.11, with 1 4 Y (t) + η Y (t) 2 fulfilling the role of vk 2, and Ak = 0 (since there is no delay yet). Much like the traditional analysis in the proof of Theorem 1, we can derive a linear convergence result with a similar rate. See Section C.2. Lemma 5. If Y satisfies equation 3.1, the energy satisfies E (t) 0, E(t) E(0), and hence: f(Y (t)) + 1 Y (t) + nκ1/2 Y (t) 2 f(Y (0)) + 1 Y (0) + η Y (0) 2 e n 1κ 1/2t We may also analyze an asynchronous version of equation 3.1 to motivate the proof of our main theorem. Here ˆY (t) is a delayed version of Y (t) with the delay bounded by τ. Y + 2n 1κ 1/2 Y + 2n 2κ 1 f ˆY = 0, (3.2) Unfortunately, this energy satisfies (see Section equation C.4, equation C.7): e η 1t E (t) 1 8η Y 2 + 3κ2η 1τD(t), for D(t) Z t Hence this energy E(t) may not be decreasing in general. But, we may add a continuous-time asynchronicity error (see Sun et al. (2017)), much like in Definition 2, to create a decreasing energy. Let c0 0 and r > 0 be arbitrary constants that will be set later. Define: t τ c(t s) Y (s) 2ds, for c(t) c0 e rt + e rτ 1 e rτ e rt 1 . Lemma 6. When rτ 1 2, the asynchronicity error A(t) satisfies: dt ert A(t) c0 Y (t) 2 1 2τ 1c0D(t). 6For compactness, we have omitted the (t) from time-varying functions Y (t), Y (t), Y (t), etc. Published as a conference paper at ICLR 2019 See Section C.3 for the proof. Adding this error to the Lyapunov function serves a similar purpose in the continuous-time case as in the proof of Theorem 1 (see Lemma 11). It allows us to negate 1 2τ 1c0 units of D(t) for the cost of creating c0 units of Y (t) 2. This restores monotonicity. Theorem 3. Let c0 = 6κ2η 1τ 2, and r = η 1. If τ 1 48nκ 1/2 then we have: E(t) + eη 1t A(t) 0. (3.3) Hence f(Y (t)) convergence linearly to f(x ) with rate O exp t/(nκ1/2) Notice how this convergence condition is similar to Corollary 3, but a little looser. The convergence condition in Theorem 1 can actually be improved to approximately match this (see Section E). E(t) + eη 1t A(t) c0 1 8η Y 2 + 3κ2η 1τ 1 = 6η 1κ2 τ 2 1 48n2κ 1 Y 2 0 The preceding should hopefully elucidate the logic and general strategy of the proof of Theorem 1. 4 Related work We now discuss related work that was not addressed in Section 1. Nesterov acceleration is a method for improving an algorithm s iteration complexity s dependence the condition number κ. Nesterov-accelerated methods have been proposed and discovered in many settings Nesterov (1983); Tseng (2008); Nesterov (2012); Lin et al. (2014); Lu & Xiao (2014); Shalev-Shwartz & Zhang (2016); Allen-Zhu (2017), including for coordinate descent algorithms (algorithms that use 1 gradient block if or minimize with respect to 1 coordinate block per iteration), and incremental algorithms (algorithms for finite sum problems 1 n Pn i=1 fi(x) that use 1 function gradient fi(x) per iteration). Such algorithms can often be augmented to solve composite minimization problems (minimization for objective of the form f(x) + g(x), especially for nonsomooth g), or include constraints. In Peng et al. (2016a), authors proposed and analyzed an asynchronous fixed-point algorithm called ARock, that takes proximal algorithms, forward-backward, ADMM, etc. as special cases. Work has also been done on asynchronous algorithms for finite sums in the operator setting Davis (2016); Johnstone & Eckstein (2018). In Hannah & Yin (2017b); Sun et al. (2017); Peng et al. (2016c); Cannelli et al. (2017) showed that many of the assumptions used in prior work (such as bounded delay τ < ) were unrealistic and unnecessary in general. In Hannah & Yin (2017a) the authors showed that asynchronous iterations will complete far more iterations per second, and that a wide class of asynchronous algorithms, including asynchronous RBCD, have the same iteration complexity as their synchronous counterparts. Hence certain asynchronous algorithms can be expected to significantly outperform traditional ones. In Xiao et al. (2017) authors propose a novel asynchronous catalyst-accelerated Lin et al. (2015) primal-dual algorithmic framework to solve regularized ERM problems. They structure the parallel updates so that the data that an update depends on is up to date (though the rest of the data may not be). However catalyst acceleration incurs a log(κ) penalty over Nesterov acceleration in general. In Allen-Zhu (2017), the author argues that the inner iterations of catalyst acceleration are hard to tune, making it less practical than Nesterov acceleration. Published as a conference paper at ICLR 2019 5 Numerical experiments To investigate the performance of A2BCD, we solve the ridge regression problem. Consider the following primal and corresponding dual objective (see for instance Lin et al. (2014)): min w Rd P(w) = 1 AT w l 2 + λ 2 w 2, min α Rn D(α) = 1 2d2λ Aα 2 + 1 2d α + l 2 (5.1) where A Rd n is a matrix of n samples and d features, and l is a label vector. We let A = [A1, . . . , Am] where Ai are the column blocks of A. We compare A2BCD (which is asynchronous accelerated), synchronous NU_ACDM (which is synchronous accelerated), and asynchronous RBCD (which is asynchronous non-accelerated). Nodes randomly select a coordinate block according to equation 2.1, calculate the corresponding block gradient, and use it to apply an update to the shared solution vectors. synchronous NU_ACDM is implemented in a batch fashion, with batch size p (1 block per computing node). Nodes in synchronous NU_ACDM implementation must wait until all nodes apply their computed gradients before they can start the next iteration, but the asynchronous algorithms simply compute with the most up-to-date information available. We use the datasets w1a (47272 samples, 300 features), wxa which combines the data from from w1a to w8a (293201 samples, 300 features), and aloi (108000 samples, 128 features) from LIBSVM Chang & Lin (2011). The algorithm is implemented in a multi-threaded fashion using C++11 and GNU Scientific Library with a shared memory architecture. We use 40 threads on two 2.5GHz 10-core Intel Xeon E5-2670v2 processors. See Section A.1 for a discussion of parameter tuning and estimation. The parameters for each algorithm are tuned to give the fastest performance, so that a fair comparison is possible. A critical ingredient in the efficient implementation of A2BCD and NU_ACDM for this problem is the efficient update scheme discussed in Lee & Sidford (2013b;a). In linear regression applications such as this, it is essential to be able to efficiently maintain or recover Ay. This is because calculating block gradients requires the vector AT i Ay, and without an efficient way to recover Ay, block gradient evaluations are essentially 50% as expensive as full-gradient calculations. Unfortunately, every accelerated iteration results in dense updates to yk because of the averaging step in equation 2.6. Hence Ay must be recalculated from scratch. However Lee & Sidford (2013a) introduces a linear transformation that allows for an equivalent iteration that results in sparse updates to new iteration variables p and q. The original purpose of this transformation was to ensure that the averaging steps (e.g. equation 2.6) do not dominate the computational cost for sparse problems. However we find a more important secondary use which applies to both sparse and dense problems. Since the updates to p and q are sparse coordinate-block updates, the vectors Ap, and Aq can be efficiently maintained, and therefore block gradients can be efficiently calculated. The specifics of this efficient implementation are discussed in Section A.2. In Table 5, we plot the sub-optimality vs. time for decreasing values of λ, which corresponds to increasingly large condition numbers κ. When κ is small, acceleration doesn t result in a significantly better convergence rate, and hence A2BCD and async-RBCD both outperform sync-NU_ACDM since they complete faster iterations at similar complexity. Acceleration for low κ has unnecessary overhead, which means async-RBCD can be quite competitive. When κ becomes large, async-RBCD is no longer competitive, since it has a poor convergence rate. We observe that A2BCD and sync-NU_ACDM have essentially the same convergence rate, but A2BCD is up to 4 5 faster than sync-NU_ACDM because it completes much faster iterations. We observe this advantage despite the fact that we are in an ideal environment for synchronous computation: A small, homogeneous, high-bandwidth, low-latency cluster. In large-scale heterogeneous systems with greater synchronization overhead, bandwidth constraints, and latency, we expect A2BCD s advantage to be much larger. Published as a conference paper at ICLR 2019 Table 1: Sub-optimality f(yk) f(x ) (y-axis) vs time in seconds (x-axis) for A2BCD, synchronous NU_ACDM, and asynchronous RBCD for data sets w1a, wxa and aloi for various values of λ. 6 Acknowledgement The authors would like to thank the reviewers for their helpful comments. The research presented in this paper was supported in part by AFOSR MURI FA9550-18-10502, NSF DMS-1720237, and ONR N0001417121. Zeyuan Allen-Zhu. Katyusha: The First Direct Acceleration of Stochastic Gradient Methods. In Proceedings of the 49th Annual ACM SIGACT Symposium on Theory of Computing, STOC 2017, pp. 1200 1205, New York, NY, USA, 2017. ACM. Zeyuan Allen-Zhu and Elad Hazan. Optimal Black-Box Reductions Between Optimization Objectives. ar Xiv:1603.05642, March 2016. Zeyuan Allen-Zhu, Zheng Qu, Peter Richtárik, and Yang Yuan. Even Faster Accelerated Coordinate Descent Using Non-Uniform Sampling. ar Xiv:1512.09103, December 2015. Yossi Arjevani. Limitations on Variance-Reduction and Acceleration Schemes for Finite Sums Optimization. In Advances in Neural Information Processing Systems 30, pp. 3540 3549. Curran Associates, Inc., 2017. H. Avron, A. Druinsky, and A. Gupta. Revisiting asynchronous linear solvers: Provable convergence rate through randomization. In Parallel and Distributed Processing Symposium, 2014 IEEE 28th International, pp. 198 207, May 2014. Dimitri P. Bertsekas. Distributed asynchronous computation of fixed points. Mathematical Programming, 27(1):107 120, 1983. Published as a conference paper at ICLR 2019 REFERENCES Dimitri P. Bertsekas and John N. Tsitsiklis. Parallel and Distributed Computation: Numerical Methods. Athena Scientific, 1997. Loris Cannelli, Francisco Facchinei, Vyacheslav Kungurtsev, and Gesualdo Scutari. Asynchronous Parallel Algorithms for Nonconvex Big-Data Optimization. Part II: Complexity and Numerical Results. ar Xiv:1701.04900, January 2017. Chih-Chung Chang and Chih-Jen Lin. LIBSVM: A Library for Support Vector Machines. ACM Trans. Intell. Syst. Technol., 2(3):27:1 27:27, May 2011. D. Chazan and W. Miranker. Chaotic relaxation. Linear Algebra and its Applications, 2(2):199 222, April 1969. Patrick L. Combettes and Jonathan Eckstein. Asynchronous block-iterative primal-dual decomposition methods for monotone inclusions. Mathematical Programming, 168(1-2):645 672, March 2018. Damek Davis. SMART: The stochastic monotone aggregated root-finding algorithm. ar Xiv:1601.00698, January 2016. Sanghamitra Dutta, Gauri Joshi, Soumyadip Ghosh, Parijat Dube, and Priya Nagpurkar. Slow and Stale Gradients Can Win the Race: Error-Runtime Trade-offs in Distributed SGD. ar Xiv:1803.01113 [cs, stat], March 2018. Cong Fang, Yameng Huang, and Zhouchen Lin. Accelerating Asynchronous Algorithms for Convex Optimization by Momentum Compensation. ar Xiv:1802.09747 [cs, math], February 2018. Robert Hannah and Wotao Yin. More Iterations per Second, Same Quality Why Asynchronous Algorithms may Drastically Outperform Traditional Ones. ar Xiv:1708.05136, August 2017a. Robert Hannah and Wotao Yin. On Unbounded Delays in Asynchronous Parallel Fixed-Point Algorithms. Journal of Scientific Computing, pp. 1 28, December 2017b. Zhouyuan Huo and Heng Huang. Asynchronous Stochastic Gradient Descent with Variance Reduction for Non-Convex Optimization. ar Xiv:1604.03584, April 2016. Patrick R. Johnstone and Jonathan Eckstein. Projective Splitting with Forward Steps: Asynchronous and Block-Iterative Operator Splitting. ar Xiv:1803.07043 [cs, math], March 2018. Guanghui Lan and Yi Zhou. An optimal randomized incremental gradient method. ar Xiv:1507.02000, July 2015. Rémi Leblond, Fabian Pedregosa, and Simon Lacoste-Julien. ASAGA: Asynchronous Parallel SAGA. In Proceedings of the 20th International Conference on Artificial Intelligence and Statistics, pp. 46 54, April 2017. Y. T. Lee and A. Sidford. Efficient Accelerated Coordinate Descent Methods and Faster Algorithms for Solving Linear Systems. In 2013 IEEE 54th Annual Symposium on Foundations of Computer Science, pp. 147 156, October 2013a. Yin Tat Lee and Aaron Sidford. Efficient Accelerated Coordinate Descent Methods and Faster Algorithms for Solving Linear Systems. ar Xiv:1305.1922, May 2013b. Xiangru Lian, Wei Zhang, Ce Zhang, and Ji Liu. Asynchronous Decentralized Parallel Stochastic Gradient Descent. In International Conference on Machine Learning, pp. 3043 3052, July 2018. Published as a conference paper at ICLR 2019 REFERENCES Hongzhou Lin, Julien Mairal, and Zaid Harchaoui. A Universal Catalyst for First-Order Optimization. ar Xiv:1506.02186, June 2015. Qihang Lin, Zhaosong Lu, and Lin Xiao. An Accelerated Proximal Coordinate Gradient Method and its Application to Regularized Empirical Risk Minimization. ar Xiv:1407.1296, July 2014. J. Liu and S. Wright. Asynchronous stochastic coordinate descent: Parallelism and convergence properties. SIAM Journal on Optimization, 25(1):351 376, January 2015. Ji Liu, Steve Wright, Christopher Re, Victor Bittorf, and Srikrishna Sridhar. An Asynchronous Parallel Stochastic Coordinate Descent Algorithm. In International Conference on Machine Learning, pp. 469 477, January 2014. Tianyi Liu, Shiyang Li, Jianping Shi, Enlu Zhou, and Tuo Zhao. Towards Understanding Acceleration Tradeoffbetween Momentum and Asynchrony in Nonconvex Stochastic Optimization. In Advances in Neural Information Processing Systems 32. Curran Associates, Inc., 2018. Zhaosong Lu and Lin Xiao. On the complexity analysis of randomized block-coordinate descent methods. Mathematical Programming, 152(1-2):615 642, August 2014. Z. Q. Luo and P. Tseng. On the convergence of the coordinate descent method for convex differentiable minimization. Journal of Optimization Theory and Applications, 72(1):7 35, January 1992. Zhi-Quan Luo and Paul Tseng. On the convergence rate of dual ascent methods for linearly constrained convex minimization. Mathematics of Operations Research, 18(4):846 867, November 1993. Horia Mania, Xinghao Pan, Dimitris Papailiopoulos, Benjamin Recht, Kannan Ramchandran, and Michael I. Jordan. Perturbed Iterate Analysis for Asynchronous Stochastic Optimization. ar Xiv:1507.06970, July 2015. Qi Meng, Wei Chen, Jingcheng Yu, Taifeng Wang, Zhi-Ming Ma, and Tie-Yan Liu. Asynchronous Accelerated Stochastic Gradient Descent. Y. Nesterov. Efficiency of coordinate descent methods on huge-scale optimization problems. SIAM Journal on Optimization, 22(2):341 362, January 2012. Yurii Nesterov. A method of solving a convex programming problem with convergence rate O (1/k2). In Soviet Mathematics Doklady, volume 27, pp. 372 376, 1983. Yurii Nesterov. Introductory Lectures on Convex Optimization: A Basic Course. Springer Science & Business Media, December 2013. Z. Peng, Y. Xu, M. Yan, and W. Yin. ARock: An Algorithmic Framework for Asynchronous Parallel Coordinate Updates. SIAM Journal on Scientific Computing, 38(5):A2851 A2879, January 2016a. Zhimin Peng, Tianyu Wu, Yangyang Xu, Ming Yan, and Wotao Yin. Coordinate friendly structures, algorithms and applications. Annals of Mathematical Sciences and Applications, 1(1):57 119, 2016b. Zhimin Peng, Yangyang Xu, Ming Yan, and Wotao Yin. On the Convergence of Asynchronous Parallel Iteration with Unbounded Delays. ar Xiv:1612.04425 [cs, math, stat], December 2016c. Benjamin Recht, Christopher Re, Stephen Wright, and Feng Niu. Hogwild!: A lock-free approach to parallelizing stochastic gradient descent. In Advances in Neural Information Processing Systems 24, pp. 693 701, 2011. Published as a conference paper at ICLR 2019 REFERENCES Sashank J. Reddi, Ahmed Hefny, Suvrit Sra, Barnabás Póczos, and Alex Smola. On Variance Reduction in Stochastic Gradient Descent and its Asynchronous Variants. ar Xiv:1506.06840, June 2015. Nicolas L. Roux, Mark Schmidt, and Francis R. Bach. A Stochastic Gradient Method with an Exponential Convergence _Rate for Finite Training Sets. In Advances in Neural Information Processing Systems 25, pp. 2663 2671. Curran Associates, Inc., 2012. Shai Shalev-Shwartz and Tong Zhang. Accelerated proximal stochastic dual coordinate ascent for regularized loss minimization. Mathematical Programming, 155(1-2):105 145, January 2016. Weijie Su, Stephen Boyd, and Emmanuel Candes. A Differential Equation for Modeling Nesterov s Accelerated Gradient Method: Theory and Insights. In Advances in Neural Information Processing Systems 27, pp. 2510 2518. 2014. Tao Sun, Robert Hannah, and Wotao Yin. Asynchronous Coordinate Descent under More Realistic Assumptions. In Advances in Neural Information Processing Systems 30, pp. 6183 6191. 2017. P. Tseng. On the rate of convergence of a partially asynchronous gradient projection algorithm. SIAM Journal on Optimization, 1(4):603 619, November 1991. P. Tseng, D. Bertsekas, and J. Tsitsiklis. Partially asynchronous, parallel algorithms for network flow and other problems. SIAM Journal on Control and Optimization, 28(3):678 710, March 1990. Paul Tseng. On accelerated proximal gradient methods for convex-concave optimization. Department of Mathematics, University of Washington, Tech. Rep., 2008. Lin Xiao, Adams Wei Yu, Qihang Lin, and Weizhu Chen. DSCOVR: Randomized Primal-Dual Block Coordinate Algorithms for Asynchronous Distributed Optimization. ar Xiv:1710.05080, October 2017. Zhengyuan Zhou, Panayotis Mertikopoulos, Nicholas Bambos, Peter Glynn, Yinyu Ye, Li-Jia Li, and Li Fei-Fei. Distributed Asynchronous Optimization with Unbounded Delays: How Slow Can You Go? In International Conference on Machine Learning, pp. 5970 5979, July 2018. Published as a conference paper at ICLR 2019 A Efficient Implementation An efficient implementation will have coordinate blocks of size greater than 1. This to ensure the efficiency of linear algebra subroutines. Especially because of this, the bulk of the computation for each iteration is computing ikf(ˆyk), and not the averaging steps. Hence the computing nodes only need a local copy of yk in order to do the bulk of an iteration s computation. Given this gradient ikf(ˆyk), updating yk and vk is extremely fast (xk can simply be eliminated). Hence it is natural to simply store yk and vk centrally, and update them when the delayed gradients ikf(ˆyk). Given the above, a write mutex over (y, v) has minuscule overhead (which we confirm with experiments), and makes the labeling of iterates unambiguous. This also ensures that vk and yk are always up to date when (y, v) are being updated. Whereas the gradient ikf(ˆyk) may at the same time be out of date, since it has been calculated with an outdated version of yk. However a write mutex is not necessary in practice, and does not appear to affect convergence rates or computation time. Also it is possible to prove convergence under more general asynchronicity. A.1 Parameter selection and tuning When defining the coefficients, σ may be underestimated, and L, L1, . . . , Ln may be overestimated if exact values are unavailable. Notice that xk can be eliminated from the above iteration, and the block gradient ikf(ˆyk) only needs to be calculated once per iteration. A larger (or overestimated) maximum delay τ will cause a larger asynchronicity parameter ψ, which leads to more conservative step sizes to compensate. To estimate ψ, one can first performed a dry run with all coefficient set to 0 to estimate τ. All function parameters can be calculated exactly for this problem in terms of the data matrix and λ. We can then use these parameters and this tau to calculate ψ. ψ and τ merely change the parameters, and do not change execution patterns of the processors. Hence their parameter specification doesn t affect the observed delay. Through simple tuning though, we found that ψ = 0.25 resulted in good performance. In tuning for general problems, there are theoretical reasons why it is difficult to attain acceleration without some prior knowledge of σ, the strong convexity modulus Arjevani (2017). Ideally σ is pre-specified for instance in a regularization term. If the Lipschitz constants Li cannot be calculated directly (which is rarely the case for the classic dual problem of empirical risk minimization objectives), the line-search method discussed in Roux et al. (2012) Section 4 can be used. A.2 Sparse update formulation As mentioned in Section 5, authors in Lee & Sidford (2013a) proposed a linear transformation of an accelerated RBCD scheme that results in sparse coordinate updates. Our proposed algorithm can be given a similar efficient implementation. We may eliminate xk from A2BCD, and derive the equivalent iteration below: = 1 αβ, αβ 1 β, β ασ 1/2L 1/2 ik + h(1 α)L 1 ik ikf ˆyk σ 1/2L 1/2 ik ikf ˆyk Published as a conference paper at ICLR 2019 A.2 Sparse update formulation where C and Qk are defined in the obvious way. Hence we define auxiliary variables pk, qk defined via: yk vk These clearly follow the iteration: pk+1 qk+1 C (k+1)Qk (A.2) Since the vector Qk is sparse, we can evolve variables pk, and qk in a sparse manner, and recover the original iteration variables at the end of the algorithm via A.1. The gradient of the dual function is given by: d AT Ay + λ(y + l) As mentioned before, it is necessary to maintain or recover Ayk to calculate block gradients. Since Ayk can be recovered via the linear relation in equation A.1, and the gradient is an affine function, we maintain the auxiliary vectors Apk and Aqk instead. Hence we propose the following efficient implementation in Algorithm 1. We used this to generate the results in Table 5. We also note also that it can improve performance to periodically recover vk and yk, reset the values of pk, qk, and C to vk, yk, and I respectively, and restarting the scheme (which can be done cheaply in time O(d)). We let B R2 2 represent Ck, and b represent B 1. is the Kronecker product. Each computing node has local outdated versions of p, q, Ap, Aq which we denote ˆp, ˆq, ˆ Ap, ˆ Aq respectively. We also find it convenient to define: Dk 1 Dk 2 " ασ 1/2L 1/2 ik + h(1 α)L 1 ik σ 1/2L 1/2 ik Published as a conference paper at ICLR 2019 Algorithm 1 Shared-memory implementation of A2BCD 1: Inputs: Function parameters A, λ, L, {Li}n i=1, n, d. Delay τ (obtained in dry run). Starting vectors y, v. 2: Shared data: Solution vectors p, q; auxiliary vectors Ap, Aq; sparsifying matrix B 3: Node local data: Solution vectors ˆp, ˆq, auxiliary vectors ˆ Ap, ˆ Aq, sparsifying matrix ˆB. 4: Calculate parameters ψ, α, β, h via 1. Set k = 0. 5: Initializations: p y, q v, Ap Ay, Aq Av, B I. 6: while not converged, each computing node asynchronous do 7: Randomly select block i via equation 2.1. 8: Read shared data into local memory: ˆp p, ˆq q, ˆ Ap Ap, ˆ Aq Aq, ˆB B. 9: Compute block gradient: if(ˆy) = 1 nλ 1 n AT i ˆB1,1 ˆ Ap + ˆB1,2 ˆ Aq + λ ˆB1,1ˆp + ˆB1,2ˆq 10: Compute quantity gi = AT i if(ˆy) Shared memory updates: 11: Update B 1 αβ αβ 1 β β B, calculate inverse b B 1. = b Dk 1 Dk 2 if(ˆy) , Ap Aq = b Dk 1 Dk 2 13: Increase iteration count: k k + 1 14: end while 15: Recover original iteration variables: y v . Output y. B Proof of the main result We first recall a couple of inequalities for convex functions. Lemma 7. Let f be σ-strongly convex with L-Lipschitz gradient. Then we have: f(y) f(x) + y x, f(x) + 1 2L y x 2, x, y (B.1) f(y) f(x) + y x, f(x) + 1 2σ y x 2, x, y (B.2) We also find it convenient to define the norm: i=1 L 1/2 i si 2 (B.3) Published as a conference paper at ICLR 2019 B.1 Starting point B.1 Starting point First notice that using the definition equation 2.8 of vk+1 we have: vk+1 2 = βvk + (1 β)yk 2 2σ 1/2L 1/2 ik βvk + (1 β)yk, ikf(ˆyk) + σ 1L 1 ik ikf(ˆyk) 2 Ek vk+1 2 = βvk + (1 β)yk 2 | {z } A 2σ 1/2S 1 βvk + (1 β)yk, f(ˆyk) | {z } B + S 1σ 1 n X i=1 L 1/2 i if(ˆyk) 2 We have the following general identity: βx + (1 β)y 2 = β x 2 + (1 β) y 2 β(1 β) x y 2, x, y (B.5) It can also easily be verified from equation 2.6 that we have: vk = yk + α 1(1 α)(yk xk) (B.6) Using equation B.5 on term A, equation B.6 on term B, and recalling the definition equation B.3 on term C, we have from equation B.4: Ek vk+1 2 = β vk 2 + (1 β) yk 2 β(1 β) vk yk 2 + S 1σ 1/2 f(ˆyk) 2 (B.7) 2σ 1/2S 1βα 1(1 α) yk xk, f(ˆyk) 2σ 1/2S 1 yk, f(ˆyk) This inequality is our starting point. We analyze the terms on the second line in the next section. B.2 The Cross Term To analyze these terms, we need a small lemma. This lemma is fundamental in allowing us to deal with asynchronicity. Lemma 8. Let χ, A > 0. Let the delay be bounded by τ. Then: j=1 yk+1 j yk j 2 Proof. See Hannah & Yin (2017a). Lemma 9. We have: f(ˆyk), yk f(yk) 1 2σ(1 ψ) yk 2 + 1 j=1 yk+1 j yk j 2 (B.8) f(ˆyk), xk yk f(xk) f(yk) (B.9) κ 1ψβ vk yk 2 + κψ 1β 1τ j=1 yk+1 j yk j 2 The terms in bold in equation B.8 and equation B.9 are a result of the asynchronicity, and are identically 0 in its absence. Published as a conference paper at ICLR 2019 B.3 Function-value term Proof. Our strategy is to separately analyze terms that appear in the traditional analysis of Nesterov (2012), and the terms that result from asynchronicity. We first prove equation B.8: f(ˆyk), yk = f(yk), yk f(ˆyk) f(yk), yk 2σ yk 2 + L ˆyk yk yk (B.10) equation B.10 follows from strong convexity (equation B.2 with x = yk and y = x ), and the fact that f is L-Lipschitz. The term due to asynchronicity becomes: L ˆyk yk yk 1 2Lκ 1ψ yk 2 + 1 j=1 yk+1 j yk j 2 using Lemma 8 with χ = κψ 1, A = yk . Combining this with equation B.10 completes the proof of equation B.8. We now prove equation B.9: f(ˆyk), xk yk = f(yk), xk yk + f(ˆyk) f(yk), xk yk f(xk) f(yk) + L ˆyk yk xk yk f(xk) f(yk) κ 1ψβα 1(1 α) xk yk 2 + κψ 1β 1α(1 α) 1τ j=1 yk+1 j yk j 2 Here the last line follows from Lemma 8 with χ = κψ 1β 1α(1 α) 1, A = nxk yk. We can complete the proof using the following identity that can be easily obtained from equation 2.6: yk xk = α(1 α) 1(vk yk) B.3 Function-value term Much like Nesterov (2012), we need a f(xk) term in the Lyapunov function (see the middle of page 357). However we additionally need to consider asynchronicity when analyzing the growth of this term. Again terms due to asynchronicity are emboldened. Lemma 10. We have: Ekf(xk+1) f(yk) 1 2h 2 h 1 + 1 2σ1/2L 1/2ψ S 1 f(ˆyk) 2 + S 1Lσ1/2κψ 1τ j=1 yk+1 j yk j 2 Proof. From the definition equation 2.7 of xk+1, we can see that xk+1 yk is supported on block ik. Since each gradient block if is Li Lipschitz with respect to changes to block i, we can use Published as a conference paper at ICLR 2019 B.4 Asynchronicity error equation B.1 to obtain: f(xk+1) f(yk) + f(yk), xk+1 yk + 1 2Lik xk+1 yk 2 (from equation 2.7) = f(yk) h L 1 ik ikf(yk), ikf(ˆyk) + 1 2h2L 1 ik ikf(ˆyk) 2 = f(yk) h L 1 ik ikf(yk) ikf(ˆyk), ikf(ˆyk) 1 2h(2 h)L 1 ik ikf(ˆyk) 2 Ekf(xk+1) f(yk) h S 1 n X i=1 L 1/2 i if(yk) if(ˆyk), if(ˆyk) 1 2h(2 h)S 1 f(ˆyk) 2 Here the last line followed from the definition equation B.3 of the norm 1/2. We now analyze the middle term: i=1 L 1/2 i if(yk) if(ˆyk), if(ˆyk) i=1 L 1/4 i ( if(yk) if(ˆyk)), i=1 L 1/4 i if(ˆyk) (Cauchy Schwarz) i=1 L 1/4 i ( if(yk) if(ˆyk)) i=1 L 1/4 i if(ˆyk) i=1 L 1/2 i if(yk) if(ˆyk) 2 !1/2 n X i=1 L 1/2 i if(ˆyk) 2 !1/2 (L Li, i and equation B.3) L 1/4 f(yk) f(ˆyk) f(ˆyk) ( f is L-Lipschitz) L 1/4L yk ˆyk f(ˆyk) We then apply Lemma 8 to this with χ = 2h 1σ1/2L1/4κψ 1, A = f(ˆyk) to yield: i=1 L 1/2 i if(yk) if(ˆyk), if(ˆyk) h 1Lσ1/2κψ 1τ j=1 yk+1 j yk j 2 (B.12) 4hσ1/2L 1/2ψ f(ˆyk) 2 Finally to complete the proof, we combine equation B.11, with equation B.12. B.4 Asynchronicity error The previous inequalities produced difference terms of the form yk+1 j yk j 2. The following lemma shows how these errors can be incorporated into a Lyapunov function. Lemma 11. Let 0 < r < 1 and consider the asynchronicity error and corresponding coefficients: j=1 cj yk+1 j yk j 2 j=i ri j 1sj Published as a conference paper at ICLR 2019 B.4 Asynchronicity error This sum satisfies: Ek[Ak+1 r Ak] = c1Ek yk+1 yk 2 j=1 sj yk+1 j yk j 2 Remark 2. Interpretation. This result means that an asynchronicity error term Ak can negate a series of difference terms P j=1 sj yk+1 j yk j 2 at the cost of producing an additional error c1Ek yk+1 yk 2, while maintaining a convergence rate of r. This essentially converts difference terms, which are hard to deal with, into a yk+1 yk 2 term which can be negated by other terms in the Lyapunov function. The proof is straightforward. Ek[Ak+1 r Ak] = Ek j=0 cj+1 yk+1 j yk j 2 r Ek j=1 cj yk+1 j yk j 2 = c1Ek yk+1 yk 2 + Ek j=1 (cj+1 rcj) yk+1 j yk j 2 Noting the following completes the proof: j=i+1 ri+1 j 1sj r j=i ri j 1sj = si Given that Ak allows us to negate difference terms, we now analyze the cost c1Ek yk+1 yk 2 of this negation. Lemma 12. We have: Ek yk+1 yk 2 2α2β2 vk yk 2 + 2S 1L 1 f(ˆyk) 2 yk+1 yk = (αvk+1 + (1 α)xk+1) yk = α βvk + (1 β)yk σ 1/2L 1/2 ik ikf(ˆyk) + (1 α) yk h L 1 ik ikf(ˆyk) yk (B.13) = αβvk + α(1 β)yk ασ 1/2L 1/2 ik ikf(ˆyk) αyk (1 α)h L 1 ik ikf(ˆyk) = αβ(vk yk) ασ 1/2L 1/2 ik + h(1 α)L 1 ik ikf(ˆyk) yk+1 yk 2 2α2β2 vk yk 2 + 2 ασ 1/2L 1/2 ik + h(1 α)L 1 ik 2 ikf(ˆyk) 2 (B.14) Published as a conference paper at ICLR 2019 B.5 Master inequality Here equation B.13 following from equation 2.8, the definition of vk+1. equation B.14 follows from the inequality x + y 2 2 x 2 + 2 y 2. The rest is simple algebraic manipulation. yk+1 yk 2 2α2β2 vk yk 2 + 2L 1 ik ασ 1/2 + h(1 α)L 1/2 ik 2 ikf(ˆyk) 2 (L Li, i) 2α2β2 vk yk 2 + 2L 1 ik ασ 1/2 + h(1 α)L 1/2 2 ikf(ˆyk) 2 = 2α2β2 vk yk 2 + 2L 1 ik L 1 L1/2σ 1/2α + h(1 α) 2 ikf(ˆyk) 2 E yk+1 yk 2 2α2β2 vk yk 2 + 2S 1L 1 L1/2σ 1/2α + h(1 α) 2 f(ˆyk) 2 Finally, to complete the proof, we prove L1/2σ 1/2α + h(1 α) 1. L1/2σ 1/2α + h(1 α) = h + α L1/2σ 1/2 h (definitions of h and α: equation 2.3, and equation 2.5) = 1 1 2σ1/2L 1/2ψ + σ1/2S 1 1 σ1/2L 1/2 1 2ψ σ 1/2S 1L1 Rearranging the definition of ψ, we have: 92 ψ2L1L 3/2κ 1/2τ 2 (τ 1 and ψ 1 2) 1 182 L1L 3/2κ 1/2 Using this on equation B.15, we have: L1/2ασ 1/2 + h(1 α) 1 σ1/2L 1/2 1 2ψ 1 182 L1L 3/2κ 1/2σ 1/2L1 = 1 σ1/2L 1/2 1 2ψ 1 182 (L/L)2 2) = 1 σ1/2L 1/2 1 This completes the proof. B.5 Master inequality We are finally in a position to bring together all the all the previous results together into a master inequality for the Lyapunov function ρk (defined in equation 2.11). After this lemma is proven, we will prove that the right hand size is negative, which will imply that ρk linearly converges to 0 with rate β. Published as a conference paper at ICLR 2019 B.5 Master inequality Lemma 13. Master inequality. We have: Ek[ρk+1 βρk] + yk 2 1 β σ 1/2S 1σ(1 ψ) (B.16) + vk yk 2 β 2α2βc1 + S 1βL1/2κ 1/2ψ (1 β) + f(yk) c 2σ 1/2S 1 βα 1(1 α) + 1 + f(xk) β 2σ 1/2S 1α 1(1 α) c j=1 yk+1 j yk j 2 S 1Lκψ 1τσ1/2 2σ 1 + c s + f(ˆyk) 2 S 1 σ 1 + 2L 1c1 1 2ch 2 h 1 + 1 2σ1/2L 1/2ψ Ek vk+1 2 β vk 2 (B.7) = (1 β) yk 2 β(1 β) vk yk 2 + S 1σ 1 f(ˆyk) 2 2σ 1/2S 1 yk, f(ˆyk) 2σ 1/2S 1βα 1(1 α) yk xk, f(ˆyk) (1 β) yk 2 β(1 β) vk yk 2 + S 1σ 1 f(ˆyk) 2 (B.17) (B.8) + 2σ 1/2S 1 2σ(1 ψ) yk 2 + 1 j=1 yk+1 j yk j 2 (B.9) 2σ 1/2S 1βα 1(1 α)(f(xk) f(yk)) + σ 1/2S 1βL κ 1ψβ vk yk 2 + κψ 1β 1τ j=1 yk+1 j yk j 2 We now collect and organize the similar terms of this inequality. + yk 2 1 β σ 1/2S 1σ(1 ψ) + vk yk 2 β σ 1/2S 1βLκ 1ψ (1 β) f(yk) 2σ 1/2S 1 βα 1(1 α) + 1 + f(xk) 2σ 1/2S 1βα 1(1 α) j=1 yk+1 j yk j 2 2σ 1/2S 1Lκψ 1τ + f(ˆyk) 2 σ 1S 1 Published as a conference paper at ICLR 2019 B.6 Proof of main theorem Now finally, we add the function-value and asynchronicity terms to our analysis. We use Lemma 11 is with r = 1 σ1/2S 1, and si = s = 6S 1L1/2κ3/2ψ 1τ, 1 i τ 0, i > τ (B.18) Notice that this choice of si will recover the coefficient formula given in equation 2.9. Hence we have: Ek[cf(xk+1) + Ak+1 β(cf(xk) + Ak)] (Lemma 10) cf(yk) 1 2ch 2 h 1 + 1 2σ1/2L 1/2ψ S 1 f(ˆyk) 2 βcf(xk) + S 1Lσ1/2κψ 1τ j=1 yk+1 j yk j 2 (Lemmas 11 and 12) + c1 2α2β2 vk yk 2 + 2S 1L 1 f(ˆyk) 2 (B.20) j=1 sj yk+1 j yk j 2 + Ak(r β) Notice Ak(r β) 0. Finally, combining equation B.17 and equation B.19 completes the proof. In the next section, we will prove that every coefficient on the right hand side of equation B.16 is 0 or less, which will complete the proof of Theorem 1. B.6 Proof of main theorem Lemma 14. The coefficients of yk 2, f(yk), and Pτ j=1 yk+1 j yk j 2 in Lemma 13 are non-positive. Proof. The coefficient 1 (1 ψ)σ1/2S 1 β of yk 2 is identically 0 via the definition equation 2.4 of β. The coefficient c 2σ 1/2S 1 βα 1(1 α) + 1 of f(yk) is identically 0 via the definition equation 2.12 of c. First notice from the definition equation 2.12 of c: c = 2σ 1/2S 1 βα 1(1 α) + 1 (definitions of α, β) = 2σ 1/2S 1 1 σ1/2S 1(1 ψ) (1 + ψ)σ 1/2S + 1 = 2σ 1/2S 1 (1 + ψ)σ 1/2S + ψ2 = 2σ 1 (1 + ψ) + ψ2σ1/2S 1 (B.21) c 4σ 1 (B.22) Published as a conference paper at ICLR 2019 B.6 Proof of main theorem Here the last line followed since ψ 1 2 and σ1/2S 1 1. We now analyze the coefficient of Pτ j=1 yk+1 j yk j 2. S 1Lκψ 1τσ1/2 2σ 1 + c s (B.22) 6L1/2κ3/2ψ 1τ s (definition equation B.18 of s) 0 Lemma 15. The coefficient β 2σ 1/2S 1α 1(1 α) c of f(xk) in Lemma 13 is non-positive. 2σ 1/2S 1α 1(1 α) c (B.21) = 2σ 1/2S 1(1 + ψ)σ 1/2S 2σ 1 (1 + ψ) + ψ2σ1/2S 1 = 2σ 1 (1 + ψ) (1 + ψ) + ψ2σ1/2S 1 = 2ψ2σ 1/2S 1 0 Lemma 16. The coefficient S 1 σ 1 + 2L 1c1 1 2ch 2 h 1 + 1 2σ1/2L 1/2ψ of f(ˆyk) 2 in Lemma 13 is non-positive. Proof. We first need to bound c1. (equation B.18 and equation 2.9) c1 = s 1 σ1/2S 1 j equation B.18 6S 1L1/2κ3/2ψ 1τ 1 σ1/2S 1 j 6S 1L1/2κ3/2ψ 1τ 2 1 σ1/2S 1 τ It can be easily verified that if x 1 2 and y 0, then (1 x) y exp(2xy). Using this fact with x = σ1/2S 1 and y = τ, we have: 6S 1L1/2κ3/2ψ 1τ 2 exp τσ1/2S 1 (since ψ 3/7 and hence τσ1/2S 1 1 7) S 1L1/2κ3/2ψ 1τ 2 6 exp 1 c1 7S 1L1/2κ3/2ψ 1τ 2 (B.23) Published as a conference paper at ICLR 2019 B.6 Proof of main theorem We now analyze the coefficient of f(ˆyk) 2 σ 1 + 2L 1c1 1 2ch 2 h 1 + 1 2σ1/2L 1/2ψ (B.23 and 2.5) σ 1 + 14S 1L 1L1/2κ3/2ψ 1τ 2 1 σ 1 + 14S 1L 1L1/2κ3/2ψ 1τ 2 1 (definition 2.2 of ψ) = σ 1 + 14 (B.21, definition 2.5 of h) = σ 1 1 + 14 81ψ (1 + ψ) + ψ2σ1/2S 1 1 1 2σ1/2L 1/2ψ (σ1/2L 1/2 0 and σ1/2S 1 1) σ 1 1 + 14 81ψ (1 + ψ) 1 1 Lemma 17. The coefficient β 2α2βc1 + S 1βL1/2κ 1/2ψ (1 β) of vk yk 2 in 13 is nonpositive. 2α2βc1 + σ1/2S 1βψ (1 ψ)σ1/2S 1 (B.23) 14α2βS 1L1/2κ3/2ψ 1τ 2 + σ1/2S 1βψ (1 ψ)σ1/2S 1 14σS 3L1/2κ3/2ψ 1τ 2 + σ1/2S 1ψ (1 ψ)σ1/2S 1 = σ1/2S 1 14S 2Lκτ 2ψ 1 + 2ψ 1 Here the last inequality follows since β 1 and α σ1/2S 1. We now rearrange the definition of ψ to yield the identity: 94 L2L 3τ 4ψ4 Using this, we have: 14S 2Lκτ 2ψ 1 + 2ψ 1 94 L2L 2ψ3τ 2 + 2ψ 1 Here the last line followed since L L, ψ 3 7, and τ 1. Hence the proof is complete. Proof of Theorem 1. Using the master inequality 13 in combination with the previous Lemmas 14, 15, 16, and 17, we have: Ek[ρk+1] βρk = 1 (1 ψ)σ1/2S 1 ρk Published as a conference paper at ICLR 2019 When we have: 1 (1 ψ)σ1/2S 1 k ϵ then the Lyapunov function ρk has decreased below ϵρ0 in expectation. Hence the complexity K(ϵ) satisfies: K(ϵ) ln 1 (1 ψ)σ1/2S 1 = ln(ϵ) K(ϵ) = 1 ln 1 (1 ψ)σ1/2S 1 ln(1/ϵ) Now it can be shown that for 0 < x 1 2, we have: 1 x 1 1 ln(1 x) 1 2 1 ln(1 x) = 1 Since n 2, we have σ1/2S 1 1 K(ϵ) = 1 1 ψ σ 1/2S + O(1) ln(1/ϵ) An expression for KNU_ACDM(ϵ), the complexity of NU_ACDM follows by similar reasoning. KNU_ACDM(ϵ) = σ 1/2S + O(1) ln(1/ϵ) (B.24) Finally we have: K(ϵ) = 1 1 ψ σ 1/2S + O(1) σ 1/2S + O(1) KNU_ACDM(ϵ) = 1 1 ψ (1 + o(1))KNU_ACDM(ϵ) which completes the proof. C Ordinary Differential Equation Analysis C.1 Derivation of ODE for synchronous A2BCD If we take expectations with respect to Ek, then synchronous (no delay) A2BCD becomes: yk = αvk + (1 α)xk Ekxk+1 = yk n 1κ 1 f(yk) Ekvk+1 = βvk + (1 β)yk n 1κ 1/2 f(yk) We find it convenient to define η = nκ1/2. Inspired by this, we consider the following iteration: yk = αvk + (1 α)xk (C.1) xk+1 = yk s1/2κ 1/2η 1 f(yk) (C.2) vk+1 = βvk + (1 β)yk s1/2η 1 f(yk) (C.3) Published as a conference paper at ICLR 2019 C.1 Derivation of ODE for synchronous A2BCD for coefficients: α = 1 + s 1/2η 1 β = 1 s1/2η 1 s is a discretization scale parameter that will be sent to 0 to obtain an ODE analogue of synchronous A2BCD. We first use equation B.6 to eliminate vk from from equation C.3. 0 = vk+1 + βvk + (1 β)yk s1/2η 1 f(yk) 0 = α 1yk+1 + α 1(1 α)xk+1 + β α 1yk α 1(1 α)xk + (1 β)yk s1/2η 1 f(yk) (times by α) 0 = yk+1 + (1 α)xk+1 + β(yk (1 α)xk) + α(1 β)yk αs1/2η 1 f(yk) = yk+1 + yk(β + α(1 β)) + (1 α)xk+1 xkβ(1 α) αs1/2η 1 f(yk) We now eliminate xk using equation C.1: 0 = yk+1 + yk(β + α(1 β)) + (1 α) yk s1/2η 1κ 1/2 f(yk) yk 1 s1/2η 1κ 1/2 f(yk 1) β(1 α) αs1/2η 1 f(yk) = yk+1 + yk(β + α(1 β) + (1 α)) β(1 α)yk 1 + s1/2η 1 f(yk 1)(β 1)(1 α) αs1/2η 1 f(yk) = (yk yk+1) + β(1 α)(yk yk 1) + s1/2η 1( f(yk 1)(β 1)(1 α) α f(yk)) Now to derive an ODE, we let yk = Y ks1/2 . Then f(yk 1) = f(yk) + O s1/2 . Hence the above becomes: 0 = (yk yk+1) + β(1 α)(yk yk 1) + s1/2η 1((β 1)(1 α) α) f(yk) + O s3/2 0 = s1/2 Y 1 2s Y + β(1 α) s1/2 Y 1 + s1/2η 1((β 1)(1 α) α) f(yk) + O s3/2 Published as a conference paper at ICLR 2019 C.2 Convergence proof for synchronous ODE We now look at some of the terms in this equation to find the highest-order dependence on s. β(1 α) = 1 s1/2η 1 1 1 1 + s 1/2η = 1 s1/2η 1 s 1/2η 1 + s 1/2η = 1 s1/2η 1 1 + s1/2η 1 = 1 2s1/2η 1 + O(s) We also have: (β 1)(1 α) α = β(1 α) 1 = 2s1/2η 1 + O(s) Hence using these facts on equation C.4, we have: 0 = s1/2 Y 1 2s Y + 1 2s1/2η 1 + O(s) s1/2 Y 1 + s1/2η 1 2s1/2η 1 + O(s) f(yk) + O s3/2 0 = s1/2 Y 1 2s Y + s1/2 Y 1 2s Y 2s1η 1 Y + O s3/2 2s1η 2 + O s3/2 f(yk) + O s3/2 0 = s Y 2sη 1 Y 2sη 2 f(yk) + O s3/2 0 = Y 2η 1 Y 2η 2 f(yk) + O s1/2 Taking the limit as s 0, we obtain the ODE: Y (t) + 2η 1 Y + 2η 2 f(Y ) = 0 C.2 Convergence proof for synchronous ODE e η 1t E (t) = f(Y (t)), Y (t) + η 1f(Y (t)) 2 Y (t) + η Y (t), Y (t) + η Y (t) + η 1 1 Y (t) + η Y (t) 2 (strong convexity equation B.2) f(Y ), Y + η 1 f(Y ), Y 1 2 Y + η Y , Y 2η 1 f(Y ) + η 1 1 Y (t) + η Y (t) 2 Published as a conference paper at ICLR 2019 C.3 Asynchronicity error lemma Hence we have E (t) 0. Therefore E(t) E(0). That is: E(t) = en 1κ 1/2t f(Y ) + 1 Y + η Y 2 E(0) = f(Y (0)) + 1 Y (0) + η Y (0) 2 (C.5) which implies: f(Y (t)) + 1 Y (t) + η Y (t) 2 e n 1κ 1/2t f(Y (0)) + 1 Y (0) + η Y (0) 2 (C.6) C.3 Asynchronicity error lemma This result is the continuous-time analogue of Lemma 11. First notice that c(0) = c0 and c(τ) = 0. We also have: c (t)/c0 = re rt re rt e rτ = r e rt + e rt e rτ = r e rt + e rt 1 e rτ 1 e rτ + e rτ c (t) = rc(t) rc0 e rτ 1 e rτ Hence using c(τ) = 0: A (t) = c0 Y (t) 2 + Z t t τ c (t s) Y (s) 2ds = c0 Y (t) 2 r A(t) rc0 e rτ 1 e rτ D(t) Now when x 1 2, we have e x 1 e x 1 2x 1. Hence when rτ 1 2, we have: A (t) c0 Y (t) 2 r A(t) 1 and the result easily follows. C.4 Convergence analysis for the asynchronous ODE We consider the same energy as in the synchronous case (that is, the ODE in equation 3.1). Similar to before, we have: e η 1t E (t) f(Y ), Y + η 1 f(Y ), Y 1 D Y + η Y , Y 2η 1 f ˆY E + η 1 1 Y (t) + η Y (t) 2 = f(Y ), Y + η 1 f(Y ), Y 1 2 Y + η Y , Y 2η 1 f(Y ) + η 1 1 Y (t) + η Y (t) 2 η 1D Y + η Y , f ˆY f(Y ) E 4η Y 2 η 1D Y + η Y , f ˆY f(Y ) E Published as a conference paper at ICLR 2019 where the final equality follows from the proof in Section C.2. Hence e η 1t E (t) 1 4η Y 2 + Lη 1 Y ˆY Y + L Y ˆY Y (C.7) Now we present an inequality that is similar to equation 8. Lemma 18. Let A, χ > 0. Then: Y (t) ˆY (t) A 1 2χτD(t) + 1 Proof. Since ˆY (t) is a delayed version of Y (t), we have: ˆY (t) = Y (t j(t)) for some function j(t) 0 (though this can be easily generalized to an inconsistent read). Recall that for χ > 0, we have ab 1 2 χa2 + χ 1b2 . Hence X(t) ˆX(t) = Z t s=t j(t) X (s)ds X(t) ˆX(t) A = s=t j(t) X (s)ds s=t j(t) X (s)ds (Holder s inequality) 1 s=t j(t) X (s) 2ds s=t j(t) 1ds s=t j(t) X (s) 2ds We use this lemma twice on Y ˆY Y and Y ˆY Y in equation C.7 with χ = 2L, A = Y and χ = 4Lη 1, A = Y respectively, to yield: e η 1t E (t) 1 + Lη 1 LτD(t) + 1 4L 1 Y 2 + L 2Lη 1τD(t) + 1 8η Y 2 + 3L2η 1τD(t) The proof of convergence is completed in Section 3. D Optimality proof For parameter set σ, L1, . . . , Ln, n, we construct a block-separable function f on the space Rbn (separated into n blocks of size b), which will imply this lower bound. Define κi = Li/σ. We define Published as a conference paper at ICLR 2019 the matrix Ai Rb b via: 1 2 ... ... 0 ... ... 1 0 ... 1 2 1 0 1 θi , for θi = κ1/2 i + 3 κ1/2 i + 1 . Hence we define fi on Rb via: 2 x, Aix e1, x + σ which is clearly σ-strongly convex and Li-Lipschitz on Rb. From Lemma 8 of Lan & Zhou (2015), we know that this function has unique minimizer x ,(i) qi, q2 i , . . . , qb i , for q = κ1/2 i 1 κ1/2 i + 1 . Finally, we define f via: i=1 fi x(i) . Now let e(i, j) be the jth unit vector of the ith block of size b in Rbn. For I1, . . . , In N, we define the subspaces Vi(I) = span{e(i, 1), . . . , e(i, I)}, V (I1, . . . , In) = V1(I1) . . . Vn(In). V (I1, . . . , In) is the subspace with the first I1 components of block 1 nonzero, the first I2 components of block 2 nonzero, etc. First notice that IC(V (I1, . . . , In)) = V (I1, . . . , In). Also, clearly, we have: if(V (I1, . . . , In)) V (0, . . . , 0, min{Ii + 1, b}, 0, . . . , 0). (D.1) if is supported on the ith block, hence why all the other indices are 0. The patten of nonzeros in A means that the gradient will have at most 1 more nonzero on the ith block (see Nesterov (2013)). Let the initial point x0 belong to V I1, . . . , In . Let IK,i be the number of times we have had ik = i for k = 0, . . . , K 1. By induction on condition 2 of Definition 4 using equation D.1, we have: xk V min I1 + Ik,1, b , . . . , min In + Ik,m, b Hence if x0,(i) Vi(0) and k b, then xk,(i) x ,(i) 2 min x Vi(Ik,i) x x ,(i) 2 = j=Ik,i+1 q2j i = q2Ik,i+2 i q2b+2 i / 1 q2 i Therefore for all i, we have: E xk x 2 E xk,(i) x ,(i) 2 E h q2Ik,i+2 i q2b+2 i / 1 q2 i i Published as a conference paper at ICLR 2019 To evaluate this expectation, we note: Eq2Ik,i i = pj i(1 pi)k jq2j i = (1 pi)k k X q2 i pi(1 pi) 1 j = (1 pi)k 1 + q2 i pi(1 pi) 1 k = 1 1 q2 i pi k E xk x 2 1 1 q2 i pi k q2b i q2 i / 1 q2 i . For any i, we may select the starting iterate x0 by defining its block j = 1, . . . , n via: x0,(j) = (1 δij)x ,(j) For such a choice of x0, we have x0 x 2 = x ,(i) 2 = q2 i + . . . + q2b i = q2 i 1 q2b i 1 q2 i Hence for this choice of x0: E xk x 2/ x0 x 2 1 1 q2 i pi k q2b i / 1 q2b i Now notice: 1 1 q2 i pi k = q 2 i q 2 i 1 pi kq2k i q2k i E xk x 2/ x0 x 2 1 1 q2 i pi k 1 q2b 2k i / 1 q2b i Now if we let b = 2k, then we have: E xk x 2/ x0 x 2 1 1 q2 i pi k 1 q2k i / 1 q4k i = 1 1 q2 i pi k/ 1 + q2k i E xk x 2/ x0 x 2 1 2 max i 1 1 q2 i pi k Now let us take the minimum of the right-hand side over the parameters pi, subject to Pn i=1 pi = 1. The solution to this minimization is clearly: pi = 1 q2 i 1/ Published as a conference paper at ICLR 2019 E xk x 2/ x0 x 2 1 1 q2 j 1 = 1 κ1/2 i + 2 + κ 1/2 i j=1 κ1/2 i + 2n E xk x 2/ x0 x 2 1 1 4 Pn j=1 κ1/2 i + 2n Hence the complexity I(ϵ) satisfies: 1 4 Pn j=1 κ1/2 i + 2n 1 4 Pn j=1 κ1/2 i + 2n 4(1 + o(1)) E Extensions As mentioned, a stronger result than Theorem 1 is possible. In the case when Li = L for all i, we can consider a slight modification of the coefficients: α 1 + (1 + ψ)σ 1/2S 1 (E.1) β 1 (1 + ψ) 1σ1/2S 1 (E.2) 2σ1/2L 1/2ψ 1 . (E.3) yk = αvk + (1 α)xk, (E.4) xk+1 = yk h L 1 ikf(ˆyk), (E.5) vk+1 = βvk + (1 β)yk σ 1/2L 1/2 ikf(ˆyk). (E.6) for the asynchronicity parameter: ψ = 6κ1/2n 1 τ (E.7) This leads to complexity: K(ϵ) = (1 + ψ)nκ1/2 ln(1/ϵ) (E.8) Here there is no restriction on ψ as in Theorem 1, and hence there is no restriction on τ. Assuming ψ 1 gives optimal complexity to within a constant factor. Notice then that the resulting condition of τ 6nκ 1/2 (E.9) Published as a conference paper at ICLR 2019 now essentially matches the one in Theorem 3 in Section 3. While this result is stronger, it increases the complexity of the proof substantially. So in the interests of space and simplicity, we do not prove this stronger result.