# a2bcd_asynchronous_acceleration_with_optimal_complexity__28690b40.pdf

Published as a conference paper at ICLR 2019

A2BCD: Asynchronous Acceleration with Optimal Complexity

Robert Hannah , Fei Feng , Wotao Yin Department of Mathematics University of California, Los Angeles 520 Portola Plaza, Los Angeles, CA 90095, USA

In this paper, we propose the Asynchronous Accelerated Nonuniform Randomized Block Coordinate Descent algorithm (A2BCD). We prove A2BCD converges linearly to a solution of the convex minimization problem at the same rate as NU_ACDM, so long as the maximum delay is not too large. This is the ﬁrst asynchronous Nesterov-accelerated algorithm that attains any provable speedup. Moreover, we then prove that these algorithms both have optimal complexity. Asynchronous algorithms complete much faster iterations, and A2BCD has optimal complexity. Hence we observe in experiments that A2BCD is the top-performing coordinate descent algorithm, converging up to 4 5 faster than NU_ACDM on some data sets in terms of wall-clock time. To motivate our theory and proof techniques, we also derive and analyze a continuous-time analogue of our algorithm and prove it converges at the same rate.

1 Introduction

In this paper, we propose and prove the convergence of the Asynchronous Accelerated Nonuniform Randomized Block Coordinate Descent algorithm (A2BCD), the ﬁrst asynchronous Nesterovaccelerated algorithm that achieves optimal complexity. No previous attempts have been able to prove a speedup for asynchronous Nesterov acceleration. We aim to ﬁnd the minimizer x of the unconstrained minimization problem:

min x Rd f(x) = f x(1), . . . , x(n) (1.1)

where f is σ-strongly convex for σ > 0 with L-Lipschitz gradient f = ( 1f, . . . , nf). x Rd is composed of coordinate blocks x(1), . . . , x(n). The coordinate blocks of the gradient if are assumed Li-Lipschitz with respect to the ith block. That is, x, h Rd:

if(x + Pih) if(x) Li h (1.2)

where Pi is the projection onto the ith block of Rd. Let L 1 n Pn i=1 Li be the average block Lipschitz constant. These conditions on f are assumed throughout this whole paper. Our algorithm can also be applied to non-strongly convex objectives (σ = 0) or non-smooth objectives using the black box reduction techniques proposed in Allen-Zhu & Hazan (2016). Hence we consider only

Corresponding author: Robert Hannah89@gmail.com fei.feng@math.ucla.edu wotaoyin@math.ucla.edu

Published as a conference paper at ICLR 2019

the coordinate smooth, strongly-convex case. Our algorithm can also be applied to the convex regularized ERM problem via the standard dual transformation (see for instance Lin et al. (2014)):

i=1 fi( ai, x ) + λ

2 x 2 (1.3)

Hence A2BCD can be used as an asynchronous Nesterov-accelerated ﬁnite-sum algorithm.

Coordinate descent methods, in which a chosen coordinate block ik is updated at every iteration, are a popular way to solve equation 1.1. Randomized block coordinate descent (RBCD, Nesterov (2012)) updates a uniformly randomly chosen coordinate block ik with a gradient-descent-like step: xk+1 = xk (1/Lik) ikf(xk). The complexity K(ϵ) of an algorithm is deﬁned as the number of iterations required to decrease the error E(f(xk) f(x )) to less than ϵ(f(x0) f(x )). Randomized coordinate descent has a complexity of K(ϵ) = O(n( L/σ) ln(1/ϵ)).

Using a series of averaging and extrapolation steps, accelerated RBCD Nesterov (2012) improves

RBCD s iteration complexity K(ϵ) to O(n q

L/σ ln(1/ϵ)), which leads to much faster convergence

when L σ is large. This rate is optimal when all Li are equal Lan & Zhou (2015). Finally, using a special probability distribution for the random block index ik, the non-uniform accelerated coordinate descent method Allen-Zhu et al. (2015) (NU_ACDM) can further decrease the complexity to O(Pn i=1 p

Li/σ ln(1/ϵ)), which can be up to n times faster than accelerated RBCD, since some Li can be signiﬁcantly smaller than L. NU_ACDM is the current state-of-the-art coordinate descent algorithm for solving equation 1.1.

Our A2BCD algorithm generalizes NU_ACDM to the asynchronous-parallel case. We solve equation 1.1 with a collection of p computing nodes that continually read a shared-access solution vector y into local memory then compute a block gradient if, which is used to update shared solution vectors (x, y, v). Proving convergence in the asynchronous case requires extensive new technical machinery.

A traditional synchronous-parallel implementation is organized into rounds of computation: Every computing node must complete an update in order for the next iteration to begin. However, this synchronization process can be extremely costly, since the lateness of a single node can halt the entire system. This becomes increasingly problematic with scale, as diﬀerences in node computing speeds, load balancing, random network delays, and bandwidth constraints mean that a synchronous-parallel solver may spend more time waiting than computing a solution.

Computing nodes in an asynchronous solver do not wait for others to complete and share their updates before starting the next iteration. They simply continue to update the solution vectors with the most recent information available, without any central coordination. This eliminates costly idle time, meaning that asynchronous algorithms can be much faster than traditional ones, since they have much faster iterations. For instance, random network delays cause asynchronous algorithms to complete iterations Ω(ln(p)) time faster than synchronous algorithms at scale. This and other factors that inﬂuence the speed of iterations are discussed in Hannah & Yin (2017a). However, since many iterations may occur between the time that a node reads the solution vector, and the time that its computed update is applied, eﬀectively the solution vector is being updated with outdated information. At iteration k, the block gradient ikf is computed at a delayed iterate ˆyk deﬁned as1:

ˆyk = yk j(k,1)

(1), . . . , yk j(k,n)

1Every coordinate can be outdated by a diﬀerent amount without signiﬁcantly changing the proofs.

Published as a conference paper at ICLR 2019 1.1 Summary of Contributions

for delay parameters j(k, 1), . . . , j(k, n) N. Here j(k, i) denotes how many iterations out of date coordinate block i is at iteration k. Diﬀerent blocks may be out of date by diﬀerent amounts, which is known as an inconsistent read. We assume2 that j(k, i) τ for some constant τ < .

Asynchronous algorithms were proposed in Chazan & Miranker (1969) to solve linear systems. General convergence results and theory were developed later in Bertsekas (1983); Bertsekas & Tsitsiklis (1997); Tseng et al. (1990); Luo & Tseng (1992; 1993); Tseng (1991) for partially and totally asynchronous systems, with essentially-cyclic block sequence ik. More recently, there has been renewed interest in asynchronous algorithms with random block coordinate updates. Linear and sublinear convergence results were proven for asynchronous RBCD Liu & Wright (2015); Liu et al. (2014); Avron et al. (2014), and similar was proven for asynchronous SGD Recht et al. (2011), and variance reduction algorithms Reddi et al. (2015); Leblond et al. (2017); Mania et al. (2015); Huo & Huang (2016), and primal-dual algorithms Combettes & Eckstein (2018).

There is also a rich body of work on asynchronous SGD. In the distributed setting, Zhou et al. (2018) showed global convergence for stochastic variationally coherent problems even when the delays grow at a polynomial rate. In Lian et al. (2018), an asynchronous decentralized SGD was proposed with the same optimal sublinear convergence rate as SGD and linear speedup with respect to the number of workers. In Liu et al. (2018), authors obtained an asymptotic rate of convergence for asynchronous momentum SGD on streaming PCA, which provides insight into the tradeoﬀbetween asynchrony and momentum. In Dutta et al. (2018), authors prove convergence results for asynchronous SGD that highlight the tradeoﬀbetween faster iterations and iteration complexity. Further related work is discussed in Section 4.

1.1 Summary of Contributions

In this paper, we prove that A2BCD attains NU_ACDM s state-of-the-art iteration complexity to highest order for solving equation 1.1, so long as delays are not too large (see Section 2). The proof is very diﬀerent from that of Allen-Zhu et al. (2015), and involves signiﬁcant technical innovations and complexity related to the analysis of asynchronicity.

We also prove that A2BCD (and hence NU_ACDM) has optimal complexity to within a constant factor over a fairly general class of randomized block coordinate descent algorithms (see Section 2.1). This extends results in Lan & Zhou (2015) to asynchronous algorithms with Li not all equal. Since asynchronous algorithms complete faster iterations, and A2BCD has optimal complexity, we expect A2BCD to be faster than all existing coordinate descent algorithms. We conﬁrm with numerical experiments that A2BCD is the current fastest coordinate descent algorithm (see Section 5).

We are only aware of one previous and one contemporaneous attempt at proving convergence results for asynchronous Nesterov-accelerated algorithms. However, the ﬁrst is not accelerated and relies on extreme assumptions, and the second obtains no speedup. Therefore, we claim that our results are the ﬁrst-ever analysis of asynchronous Nesterov-accelerated algorithms that attains a speedup. Moreover, our speedup is optimal for delays not too large3.

The work of Meng et al. claims to obtain square-root speedup for an asynchronous accelerated SVRG. In the case where all component functions have the same Lipschitz constant L, the complexity they obtain reduces to (n + κ) ln(1/ϵ) for κ = O τn2 (Corollary 4.4). Hence authors do not even obtain accelerated rates. Their convergence condition is τ < 1 4 1/8 for sparsity parameter . Since the dimension d satisﬁes d 1

, they require d 216τ 8. So τ = 20 requires dimension d > 1015.

2This condition can be relaxed however by techniques in Hannah & Yin (2017b); Sun et al. (2017); Peng et al. (2016c); Hannah & Yin (2017a) 3Speedup is deﬁned precisely in Section 2

Published as a conference paper at ICLR 2019

In a contemporaneous preprint, authors in Fang et al. (2018) skillfully devised accelerated schemes for asynchronous coordinate descent and SVRG using momentum compensation techniques. Although their complexity results have the improved κ dependence on the condition number, they do not prove any speedup. Their complexity is τ times larger than the serial complexity. Since τ is necessarily greater than p, their results imply that adding more computing nodes will increase running time. The authors claim that they can extend their results to linear speedup for asynchronous, accelerated SVRG under sparsity assumptions. And while we think this is quite likely, they have not yet provided proof.

We also derive a second-order ordinary diﬀerential equation (ODE), which is the continuous-time limit of A2BCD (see Section 3). This extends the ODE found in Su et al. (2014) to an asynchronous accelerated algorithm minimizing a strongly convex function. We prove this ODE linearly converges to a solution with the same rate as A2BCD s, without needing to resort to the restarting techniques. The ODE analysis motivates and clariﬁes the our proof strategy of the main result.

2 Main results

We should consider functions f where it is eﬃcient to calculate blocks of the gradient, so that coordinate-wise parallelization is eﬃcient. That is, the function should be coordinate friendly Peng et al. (2016b). This is a very wide class that includes regularized linear regression, logistic regression, etc. The L2-regularized empirical risk minimization problem is not coordinate friendly in general, however the equivalent dual problem is, and hence can be solved eﬃciently by A2BCD (see Lin et al. (2014), and Section 5).

To calculate the k + 1 th iteration of the algorithm from iteration k, we use only one block of the gradient ikf. We assume that the delays j(k, i) are independent of the block sequence ik, but otherwise arbitrary (This is a standard assumption found in the vast majority of papers, but can be relaxed Sun et al. (2017); Leblond et al. (2017); Cannelli et al. (2017)). Deﬁnition 1. Asynchronous Accelerated Randomized Block Coordinate Descent (A2BCD). Let f be σ-strongly convex, and let its gradient f be L-Lipschitz with block coordinate Lipschitz parameters Li as in equation 1.2. We deﬁne the condition number κ = L/σ, and let L = mini Li. Using these parameters, we sample ik in an independent and identically distributed (IID) fashion according to

P[ik = j] = L1/2 j /S, j {1, . . . , n}, for S = Xn

i=1 L1/2 i . (2.1)

Let τ be the maximum asynchronous delay. We deﬁne the dimensionless asynchronicity parameter ψ, which is proportional to τ, and quantiﬁes how strongly asynchronicity will aﬀect convergence:

ψ = 9 S 1/2L 1/2L3/4κ1/4 τ (2.2)

We use the above system parameters and ψ to deﬁne the coeﬃcients α, β, and γ via eqs. (2.3)

to (2.5). Hence A2BCD algorithm is deﬁned via the iterations: eqs. (2.6) to (2.8).

α 1 + (1 + ψ)σ 1/2S 1 (2.3)

β 1 (1 ψ)σ1/2S 1 (2.4)

2σ1/2L 1/2ψ. (2.5)

yk = αvk + (1 α)xk, (2.6)

xk+1 = yk h L 1 ik ikf(ˆyk), (2.7)

vk+1 = βvk + (1 β)yk σ 1/2L 1/2 ik ikf(ˆyk). (2.8)

See Section A for a discussion of why it is practical and natural to have the gradient ikf(ˆyk) to be outdated, while the actual variables xk, yk, vk can be eﬃciently kept up to date. Essentially it is

Published as a conference paper at ICLR 2019

because most of the computation lies in computing ikf(ˆyk). After this is computed, xk, yk, vk can be updated more-or-less atomically with minimal overhead, meaning that they will always be up to date. However our main results still hold for more general asynchronicity.

A natural quantity to consider in asynchronous convergence analysis is the asynchronicity error, a powerful tool for analyzing asynchronous algorithms used in several recent works Peng et al. (2016a); Hannah & Yin (2017b); Sun et al. (2017); Hannah & Yin (2017a). We adapt it and use a weighted sum of the history of the algorithm with decreasing weight as you go further back in time. Deﬁnition 2. Asynchronicity error. Using the above parameters, we deﬁne:

j=1 cj yk+1 j yk j 2 (2.9) for ci = 6

S L1/2κ3/2τ

1 σ1/2S 1 i j 1 ψ 1. (2.10)

Here we deﬁne yk = y0 for all k < 0. The determination of the coeﬃcients ci is in general a very involved process of trial and error, intuition, and balancing competing requirements. The algorithm doesn t depend on the coeﬃcients, however; they are only an analytical tool.

We deﬁne Ek[X] as the expectation of X conditioned on (x0, . . . , xk), (y0, . . . , yk), (v0, . . . , vk), and (i0, . . . , ik 1). To simplify notation4, we assume that the minimizer x = 0, and that f(x ) = 0 with no loss in generality. We deﬁne the Lyapunov function:

ρk = vk 2 + Ak + cf(xk) (2.11) for c = 2σ 1/2S 1 βα 1(1 α) + 1 . (2.12)

We now present this paper s ﬁrst main contribution. Theorem 1. Let f be σ-strongly convex with a gradient f that is L-Lipschitz with block Lipschitz constants {Li}n i=1. Let ψ deﬁned in equation 2.2 satisfy ψ 3

7 (i.e. τ 1 21S1/2L1/2L 3/4κ 1/4). Then for A2BCD we have:

Ek[ρk+1] 1 (1 ψ)σ1/2S 1 ρk.

To obtain E[ρk] ϵρ0, it takes KA2BCD(ϵ) iterations for:

KA2BCD(ϵ) = σ 1/2S + O(1) ln(1/ϵ)

1 ψ , (2.13)

where O( ) is asymptotic with respect to σ 1/2S , and uniformly bounded.

This result is proven in Section B. A stronger result for Li L can be proven, but this adds to the complexity of the proof; see Section E for a discussion. In practice, asynchronous algorithms are far more resilient to delays than the theory predicts. τ can be much larger without negatively aﬀecting the convergence rate and complexity. This is perhaps because we are limited to a worst-case analysis, which is not representative of the average-case performance.

Allen-Zhu et al. (2015) (Theorem 5.1) shows a linear convergence rate of 1 2/ 1 + 2σ 1/2S

for NU_ACDM, which leads to the corresponding iteration complexity of KNU_ACDM(ϵ) = σ 1/2S + O(1) ln(1/ϵ). Hence, we have:

KA2BCD(ϵ) = 1 1 ψ (1 + o(1))KNU_ACDM(ϵ)

4We can assume x = 0 with no loss in generality since we may translate the coordinate system so that x is at the origin. We can assume f(x ) = 0 with no loss in generality, since we can replace f(x) with f(x) f(x ). Without this assumption, the Lyapunov function simply becomes: vk x 2 + Ak + c(f(xk) f(x )).

Published as a conference paper at ICLR 2019 2.1 Optimality

When 0 ψ 1, or equivalently, when τ S1/2L1/2L 3/4κ 1/4, the complexity of A2BCD asymptotically matches that of NU_ACDM. Hence A2BCD combines state-of-the-art complexity with the faster iterations and superior scaling that asynchronous iterations allow. We now present some special cases of the conditions on the maximum delay τ required for good complexity.

Corollary 3. Let the conditions of Theorem 1 hold. If all coordinate-wise Lipschitz constants Li are equal (i.e. Li = L1, i), then we have KA2BCD(ϵ) KNU_ACDM(ϵ) when τ n1/2κ 1/4(L1/L)3/4. If we further assume all coordinate-wise Lipschitz constants Li equal L. Then KA2BCD(ϵ) KNU_ACDM(ϵ) = KACDM(ϵ), when τ n1/2κ 1/4.

Remark 1. Reduction to synchronous case. Notice that when τ = 0, we have ψ = 0, ci 0 and hence Ak 0. Thus A2BCD becomes equivalent to NU_ACDM, the Lyapunov function5 ρk becomes equivalent to one found in Allen-Zhu et al. (2015)(pg. 9), and Theorem 1 yields the same complexity.

The maximum delay τ will be a function τ(p) of p, number of computing nodes. Clearly τ p, and experimentally it has been observed that τ = O(p) Leblond et al. (2017). Let gradient complexity K(ϵ, τ) be the number of gradients required for an asynchronous algorithm with maximum delay τ to attain suboptimality ϵ. τ(1) = 0, since with only 1 computing node there can be no delay. This corresponds to the serial complexity. We say that an asynchronous algorithm attains a complexity speedup if p K(ϵ,τ(0))

K(ϵ,τ(p) is increasing in p. We say it attains linear complexity speedup if p K(ϵ,τ(0))

K(ϵ,τ(p) = Ω(p). In Theorem 1, we obtain a linear complexity speedup (for p not too large), whereas no other prior attempt can attain even a complexity speedup with Nesterov acceleration.

In the ideal scenario where the rate at which gradients are calculated increases linearly with p, algorithms that have linear complexity speedup will have a linear decrease in wall-clock time. However in practice, when the number of computing nodes is suﬃciently large, the rate at which gradients are calculated will no longer be linear. This is due to many parallel overhead factors including too many nodes sharing the same memory read/write bandwidth, and network bandwidth. However we note that even with these issues, we obtain much faster convergence than the synchronous counterpart experimentally.

2.1 Optimality

NU_ACDM and hence A2BCD are in fact optimal in some sense. That is, among a fairly wide class of coordinate descent algorithms A, they have the best-possible worst-case complexity to highest order. We extend the work in Lan & Zhou (2015) to encompass algorithms are asynchronous and have unequal Li. For a subset S Rd, we let IC(S) (inconsistent read) denote the set of vectors v whose components are a combination of components of vectors in the set S. That is, v = (v1,1, v2,2, . . . , vd,d) for some vectors v1, v2, . . . , vd S. Here vi,j denotes the jth component of vector vi.

Deﬁnition 4. Asynchronous Randomized Incremental Algorithms. Consider the unconstrained minimization problem equation 1.1 for function f satisfying the conditions stated in Section 1. We deﬁne the class A as algorithms G on this problem such that:

1. For each parameter set (σ, L1, . . . , Ln, n), G has an associated IID random variable ik with some ﬁxed distribution P[ik] = pi for Pn i=1 pi = 1.

2. The iterates of A satisfy: xk+1 span{IC(Xk), i0f(IC(X0)), i1f(IC(X1)), . . . , ikf(IC(Xk))}

This is a rather general class: xk+1 can be constructed from any inconsistent reading of past iterates IC(Xk), and any past gradient of an inconsistent read ijf(IC(Xj)).

5Their Lyapunov function is in fact a generalization of the one found in Nesterov (2012).

Published as a conference paper at ICLR 2019

Theorem 2. For any algorithm G A that solves eq. (1.1), and parameter set (σ, L1, . . . , Ln, n), there is a dimension d, a corresponding function f on Rd, and a starting point x0, such that

E xk x 2/ x0 x 2 1

Li/σ + 2n k

Hence A has a complexity lower bound: K(ϵ) 1

4(1 + o(1)) Pn j=1 p

Li/σ + 2n ln(1/2ϵ)

Our proof in Section D follows very similar lines to Lan & Zhou (2015); Nesterov (2013).

3 ODE Analysis

In this section we present and analyze an ODE which is the continuous-time limit of A2BCD. This ODE is a strongly convex, and asynchronous version of the ODE found in Su et al. (2014). For simplicity, assume Li = L, i. We rescale (I.e. we replace f(x) with 1

σf.) f so that σ = 1, and hence κ = L/σ = L. Taking the discrete limit of synchronous A2BCD (i.e. accelerated RBCD), we can derive the following ODE6 (see Section equation C.1):

Y + 2n 1κ 1/2 Y + 2n 2κ 1 f(Y ) = 0 (3.1)

We deﬁne the parameter η nκ1/2, and the energy: E(t) = en 1κ 1/2t(f(Y ) + 1

4 Y + η Y 2). This

is very similar to the Lyapunov function discussed in equation 2.11, with 1

4 Y (t) + η Y (t) 2 fulﬁlling the role of vk 2, and Ak = 0 (since there is no delay yet). Much like the traditional analysis in the proof of Theorem 1, we can derive a linear convergence result with a similar rate. See Section C.2. Lemma 5. If Y satisﬁes equation 3.1, the energy satisﬁes E (t) 0, E(t) E(0), and hence:

f(Y (t)) + 1

Y (t) + nκ1/2 Y (t) 2 f(Y (0)) + 1

Y (0) + η Y (0) 2 e n 1κ 1/2t

We may also analyze an asynchronous version of equation 3.1 to motivate the proof of our main theorem. Here ˆY (t) is a delayed version of Y (t) with the delay bounded by τ.

Y + 2n 1κ 1/2 Y + 2n 2κ 1 f ˆY = 0, (3.2)

Unfortunately, this energy satisﬁes (see Section equation C.4, equation C.7):

e η 1t E (t) 1

8η Y 2 + 3κ2η 1τD(t), for D(t) Z t

Hence this energy E(t) may not be decreasing in general. But, we may add a continuous-time asynchronicity error (see Sun et al. (2017)), much like in Deﬁnition 2, to create a decreasing energy. Let c0 0 and r > 0 be arbitrary constants that will be set later. Deﬁne:

t τ c(t s) Y (s) 2ds, for c(t) c0

e rt + e rτ

1 e rτ e rt 1 .

Lemma 6. When rτ 1

2, the asynchronicity error A(t) satisﬁes:

dt ert A(t) c0 Y (t) 2 1

2τ 1c0D(t).

6For compactness, we have omitted the (t) from time-varying functions Y (t), Y (t), Y (t), etc.

Published as a conference paper at ICLR 2019

See Section C.3 for the proof. Adding this error to the Lyapunov function serves a similar purpose in the continuous-time case as in the proof of Theorem 1 (see Lemma 11). It allows us to negate 1 2τ 1c0 units of D(t) for the cost of creating c0 units of Y (t) 2. This restores monotonicity.

Theorem 3. Let c0 = 6κ2η 1τ 2, and r = η 1. If τ 1

48nκ 1/2 then we have:

E(t) + eη 1t A(t) 0. (3.3)

Hence f(Y (t)) convergence linearly to f(x ) with rate O exp t/(nκ1/2)

Notice how this convergence condition is similar to Corollary 3, but a little looser. The convergence condition in Theorem 1 can actually be improved to approximately match this (see Section E).

E(t) + eη 1t A(t) c0 1

8η Y 2 + 3κ2η 1τ 1

= 6η 1κ2 τ 2 1

48n2κ 1 Y 2 0

The preceding should hopefully elucidate the logic and general strategy of the proof of Theorem 1.

4 Related work

We now discuss related work that was not addressed in Section 1. Nesterov acceleration is a method for improving an algorithm s iteration complexity s dependence the condition number κ. Nesterov-accelerated methods have been proposed and discovered in many settings Nesterov (1983); Tseng (2008); Nesterov (2012); Lin et al. (2014); Lu & Xiao (2014); Shalev-Shwartz & Zhang (2016); Allen-Zhu (2017), including for coordinate descent algorithms (algorithms that use 1 gradient block if or minimize with respect to 1 coordinate block per iteration), and incremental algorithms (algorithms for ﬁnite sum problems 1

n Pn i=1 fi(x) that use 1 function gradient fi(x) per iteration). Such algorithms can often be augmented to solve composite minimization problems (minimization for objective of the form f(x) + g(x), especially for nonsomooth g), or include constraints.

In Peng et al. (2016a), authors proposed and analyzed an asynchronous ﬁxed-point algorithm called ARock, that takes proximal algorithms, forward-backward, ADMM, etc. as special cases. Work has also been done on asynchronous algorithms for ﬁnite sums in the operator setting Davis (2016); Johnstone & Eckstein (2018). In Hannah & Yin (2017b); Sun et al. (2017); Peng et al. (2016c); Cannelli et al. (2017) showed that many of the assumptions used in prior work (such as bounded delay τ < ) were unrealistic and unnecessary in general. In Hannah & Yin (2017a) the authors showed that asynchronous iterations will complete far more iterations per second, and that a wide class of asynchronous algorithms, including asynchronous RBCD, have the same iteration complexity as their synchronous counterparts. Hence certain asynchronous algorithms can be expected to signiﬁcantly outperform traditional ones.

In Xiao et al. (2017) authors propose a novel asynchronous catalyst-accelerated Lin et al. (2015) primal-dual algorithmic framework to solve regularized ERM problems. They structure the parallel updates so that the data that an update depends on is up to date (though the rest of the data may not be). However catalyst acceleration incurs a log(κ) penalty over Nesterov acceleration in general. In Allen-Zhu (2017), the author argues that the inner iterations of catalyst acceleration are hard to tune, making it less practical than Nesterov acceleration.

Published as a conference paper at ICLR 2019

5 Numerical experiments

To investigate the performance of A2BCD, we solve the ridge regression problem. Consider the following primal and corresponding dual objective (see for instance Lin et al. (2014)):

min w Rd P(w) = 1

AT w l 2 + λ

2 w 2, min α Rn D(α) = 1 2d2λ Aα 2 + 1

2d α + l 2 (5.1)

where A Rd n is a matrix of n samples and d features, and l is a label vector. We let A = [A1, . . . , Am] where Ai are the column blocks of A. We compare A2BCD (which is asynchronous accelerated), synchronous NU_ACDM (which is synchronous accelerated), and asynchronous RBCD (which is asynchronous non-accelerated). Nodes randomly select a coordinate block according to equation 2.1, calculate the corresponding block gradient, and use it to apply an update to the shared solution vectors. synchronous NU_ACDM is implemented in a batch fashion, with batch size p (1 block per computing node). Nodes in synchronous NU_ACDM implementation must wait until all nodes apply their computed gradients before they can start the next iteration, but the asynchronous algorithms simply compute with the most up-to-date information available.

We use the datasets w1a (47272 samples, 300 features), wxa which combines the data from from w1a to w8a (293201 samples, 300 features), and aloi (108000 samples, 128 features) from LIBSVM Chang & Lin (2011). The algorithm is implemented in a multi-threaded fashion using C++11 and GNU Scientiﬁc Library with a shared memory architecture. We use 40 threads on two 2.5GHz 10-core Intel Xeon E5-2670v2 processors. See Section A.1 for a discussion of parameter tuning and estimation. The parameters for each algorithm are tuned to give the fastest performance, so that a fair comparison is possible.

A critical ingredient in the eﬃcient implementation of A2BCD and NU_ACDM for this problem is the eﬃcient update scheme discussed in Lee & Sidford (2013b;a). In linear regression applications such as this, it is essential to be able to eﬃciently maintain or recover Ay. This is because calculating block gradients requires the vector AT i Ay, and without an eﬃcient way to recover Ay, block gradient evaluations are essentially 50% as expensive as full-gradient calculations. Unfortunately, every accelerated iteration results in dense updates to yk because of the averaging step in equation 2.6. Hence Ay must be recalculated from scratch.

However Lee & Sidford (2013a) introduces a linear transformation that allows for an equivalent iteration that results in sparse updates to new iteration variables p and q. The original purpose of this transformation was to ensure that the averaging steps (e.g. equation 2.6) do not dominate the computational cost for sparse problems. However we ﬁnd a more important secondary use which applies to both sparse and dense problems. Since the updates to p and q are sparse coordinate-block updates, the vectors Ap, and Aq can be eﬃciently maintained, and therefore block gradients can be eﬃciently calculated. The speciﬁcs of this eﬃcient implementation are discussed in Section A.2.

In Table 5, we plot the sub-optimality vs. time for decreasing values of λ, which corresponds to increasingly large condition numbers κ. When κ is small, acceleration doesn t result in a signiﬁcantly better convergence rate, and hence A2BCD and async-RBCD both outperform sync-NU_ACDM since they complete faster iterations at similar complexity. Acceleration for low κ has unnecessary overhead, which means async-RBCD can be quite competitive. When κ becomes large, async-RBCD is no longer competitive, since it has a poor convergence rate. We observe that A2BCD and sync-NU_ACDM have essentially the same convergence rate, but A2BCD is up to 4 5 faster than sync-NU_ACDM because it completes much faster iterations. We observe this advantage despite the fact that we are in an ideal environment for synchronous computation: A small, homogeneous, high-bandwidth, low-latency cluster. In large-scale heterogeneous systems with greater synchronization overhead, bandwidth constraints, and latency, we expect A2BCD s advantage to be much larger.

Published as a conference paper at ICLR 2019

Table 1: Sub-optimality f(yk) f(x ) (y-axis) vs time in seconds (x-axis) for A2BCD, synchronous NU_ACDM, and asynchronous RBCD for data sets w1a, wxa and aloi for various values of λ.

6 Acknowledgement

The authors would like to thank the reviewers for their helpful comments. The research presented in this paper was supported in part by AFOSR MURI FA9550-18-10502, NSF DMS-1720237, and ONR N0001417121.

Zeyuan Allen-Zhu. Katyusha: The First Direct Acceleration of Stochastic Gradient Methods. In Proceedings of the 49th Annual ACM SIGACT Symposium on Theory of Computing, STOC 2017, pp. 1200 1205, New York, NY, USA, 2017. ACM.

Zeyuan Allen-Zhu and Elad Hazan. Optimal Black-Box Reductions Between Optimization Objectives. ar Xiv:1603.05642, March 2016.

Zeyuan Allen-Zhu, Zheng Qu, Peter Richtárik, and Yang Yuan. Even Faster Accelerated Coordinate Descent Using Non-Uniform Sampling. ar Xiv:1512.09103, December 2015.

Yossi Arjevani. Limitations on Variance-Reduction and Acceleration Schemes for Finite Sums Optimization. In Advances in Neural Information Processing Systems 30, pp. 3540 3549. Curran Associates, Inc., 2017.

H. Avron, A. Druinsky, and A. Gupta. Revisiting asynchronous linear solvers: Provable convergence rate through randomization. In Parallel and Distributed Processing Symposium, 2014 IEEE 28th International, pp. 198 207, May 2014.

Dimitri P. Bertsekas. Distributed asynchronous computation of ﬁxed points. Mathematical Programming, 27(1):107 120, 1983.

Published as a conference paper at ICLR 2019 REFERENCES

Dimitri P. Bertsekas and John N. Tsitsiklis. Parallel and Distributed Computation: Numerical Methods. Athena Scientiﬁc, 1997.

Loris Cannelli, Francisco Facchinei, Vyacheslav Kungurtsev, and Gesualdo Scutari. Asynchronous Parallel Algorithms for Nonconvex Big-Data Optimization. Part II: Complexity and Numerical Results. ar Xiv:1701.04900, January 2017.

Chih-Chung Chang and Chih-Jen Lin. LIBSVM: A Library for Support Vector Machines. ACM Trans. Intell. Syst. Technol., 2(3):27:1 27:27, May 2011.

D. Chazan and W. Miranker. Chaotic relaxation. Linear Algebra and its Applications, 2(2):199 222, April 1969.

Patrick L. Combettes and Jonathan Eckstein. Asynchronous block-iterative primal-dual decomposition methods for monotone inclusions. Mathematical Programming, 168(1-2):645 672, March 2018.

Damek Davis. SMART: The stochastic monotone aggregated root-ﬁnding algorithm. ar Xiv:1601.00698, January 2016.

Sanghamitra Dutta, Gauri Joshi, Soumyadip Ghosh, Parijat Dube, and Priya Nagpurkar. Slow and Stale Gradients Can Win the Race: Error-Runtime Trade-oﬀs in Distributed SGD. ar Xiv:1803.01113 [cs, stat], March 2018.

Cong Fang, Yameng Huang, and Zhouchen Lin. Accelerating Asynchronous Algorithms for Convex Optimization by Momentum Compensation. ar Xiv:1802.09747 [cs, math], February 2018.

Robert Hannah and Wotao Yin. More Iterations per Second, Same Quality Why Asynchronous Algorithms may Drastically Outperform Traditional Ones. ar Xiv:1708.05136, August 2017a.

Robert Hannah and Wotao Yin. On Unbounded Delays in Asynchronous Parallel Fixed-Point Algorithms. Journal of Scientiﬁc Computing, pp. 1 28, December 2017b.

Zhouyuan Huo and Heng Huang. Asynchronous Stochastic Gradient Descent with Variance Reduction for Non-Convex Optimization. ar Xiv:1604.03584, April 2016.

Patrick R. Johnstone and Jonathan Eckstein. Projective Splitting with Forward Steps: Asynchronous and Block-Iterative Operator Splitting. ar Xiv:1803.07043 [cs, math], March 2018.

Guanghui Lan and Yi Zhou. An optimal randomized incremental gradient method. ar Xiv:1507.02000, July 2015.

Rémi Leblond, Fabian Pedregosa, and Simon Lacoste-Julien. ASAGA: Asynchronous Parallel SAGA. In Proceedings of the 20th International Conference on Artiﬁcial Intelligence and Statistics, pp. 46 54, April 2017.

Y. T. Lee and A. Sidford. Eﬃcient Accelerated Coordinate Descent Methods and Faster Algorithms for Solving Linear Systems. In 2013 IEEE 54th Annual Symposium on Foundations of Computer Science, pp. 147 156, October 2013a.

Yin Tat Lee and Aaron Sidford. Eﬃcient Accelerated Coordinate Descent Methods and Faster Algorithms for Solving Linear Systems. ar Xiv:1305.1922, May 2013b.

Xiangru Lian, Wei Zhang, Ce Zhang, and Ji Liu. Asynchronous Decentralized Parallel Stochastic Gradient Descent. In International Conference on Machine Learning, pp. 3043 3052, July 2018.

Published as a conference paper at ICLR 2019 REFERENCES

Hongzhou Lin, Julien Mairal, and Zaid Harchaoui. A Universal Catalyst for First-Order Optimization. ar Xiv:1506.02186, June 2015.

Qihang Lin, Zhaosong Lu, and Lin Xiao. An Accelerated Proximal Coordinate Gradient Method and its Application to Regularized Empirical Risk Minimization. ar Xiv:1407.1296, July 2014.

J. Liu and S. Wright. Asynchronous stochastic coordinate descent: Parallelism and convergence properties. SIAM Journal on Optimization, 25(1):351 376, January 2015.

Ji Liu, Steve Wright, Christopher Re, Victor Bittorf, and Srikrishna Sridhar. An Asynchronous Parallel Stochastic Coordinate Descent Algorithm. In International Conference on Machine Learning, pp. 469 477, January 2014.

Tianyi Liu, Shiyang Li, Jianping Shi, Enlu Zhou, and Tuo Zhao. Towards Understanding Acceleration Tradeoﬀbetween Momentum and Asynchrony in Nonconvex Stochastic Optimization. In Advances in Neural Information Processing Systems 32. Curran Associates, Inc., 2018.

Zhaosong Lu and Lin Xiao. On the complexity analysis of randomized block-coordinate descent methods. Mathematical Programming, 152(1-2):615 642, August 2014.

Z. Q. Luo and P. Tseng. On the convergence of the coordinate descent method for convex diﬀerentiable minimization. Journal of Optimization Theory and Applications, 72(1):7 35, January 1992.

Zhi-Quan Luo and Paul Tseng. On the convergence rate of dual ascent methods for linearly constrained convex minimization. Mathematics of Operations Research, 18(4):846 867, November 1993.

Horia Mania, Xinghao Pan, Dimitris Papailiopoulos, Benjamin Recht, Kannan Ramchandran, and Michael I. Jordan. Perturbed Iterate Analysis for Asynchronous Stochastic Optimization. ar Xiv:1507.06970, July 2015.

Qi Meng, Wei Chen, Jingcheng Yu, Taifeng Wang, Zhi-Ming Ma, and Tie-Yan Liu. Asynchronous Accelerated Stochastic Gradient Descent.

Y. Nesterov. Eﬃciency of coordinate descent methods on huge-scale optimization problems. SIAM Journal on Optimization, 22(2):341 362, January 2012.

Yurii Nesterov. A method of solving a convex programming problem with convergence rate O (1/k2). In Soviet Mathematics Doklady, volume 27, pp. 372 376, 1983.

Yurii Nesterov. Introductory Lectures on Convex Optimization: A Basic Course. Springer Science & Business Media, December 2013.

Z. Peng, Y. Xu, M. Yan, and W. Yin. ARock: An Algorithmic Framework for Asynchronous Parallel Coordinate Updates. SIAM Journal on Scientiﬁc Computing, 38(5):A2851 A2879, January 2016a.

Zhimin Peng, Tianyu Wu, Yangyang Xu, Ming Yan, and Wotao Yin. Coordinate friendly structures, algorithms and applications. Annals of Mathematical Sciences and Applications, 1(1):57 119, 2016b.

Zhimin Peng, Yangyang Xu, Ming Yan, and Wotao Yin. On the Convergence of Asynchronous Parallel Iteration with Unbounded Delays. ar Xiv:1612.04425 [cs, math, stat], December 2016c.

Benjamin Recht, Christopher Re, Stephen Wright, and Feng Niu. Hogwild!: A lock-free approach to parallelizing stochastic gradient descent. In Advances in Neural Information Processing Systems 24, pp. 693 701, 2011.

Published as a conference paper at ICLR 2019 REFERENCES

Sashank J. Reddi, Ahmed Hefny, Suvrit Sra, Barnabás Póczos, and Alex Smola. On Variance Reduction in Stochastic Gradient Descent and its Asynchronous Variants. ar Xiv:1506.06840, June 2015.

Nicolas L. Roux, Mark Schmidt, and Francis R. Bach. A Stochastic Gradient Method with an Exponential Convergence _Rate for Finite Training Sets. In Advances in Neural Information Processing Systems 25, pp. 2663 2671. Curran Associates, Inc., 2012.

Shai Shalev-Shwartz and Tong Zhang. Accelerated proximal stochastic dual coordinate ascent for regularized loss minimization. Mathematical Programming, 155(1-2):105 145, January 2016.

Weijie Su, Stephen Boyd, and Emmanuel Candes. A Diﬀerential Equation for Modeling Nesterov s Accelerated Gradient Method: Theory and Insights. In Advances in Neural Information Processing Systems 27, pp. 2510 2518. 2014.

Tao Sun, Robert Hannah, and Wotao Yin. Asynchronous Coordinate Descent under More Realistic Assumptions. In Advances in Neural Information Processing Systems 30, pp. 6183 6191. 2017.

P. Tseng. On the rate of convergence of a partially asynchronous gradient projection algorithm. SIAM Journal on Optimization, 1(4):603 619, November 1991.

P. Tseng, D. Bertsekas, and J. Tsitsiklis. Partially asynchronous, parallel algorithms for network ﬂow and other problems. SIAM Journal on Control and Optimization, 28(3):678 710, March 1990.

Paul Tseng. On accelerated proximal gradient methods for convex-concave optimization. Department of Mathematics, University of Washington, Tech. Rep., 2008.

Lin Xiao, Adams Wei Yu, Qihang Lin, and Weizhu Chen. DSCOVR: Randomized Primal-Dual Block Coordinate Algorithms for Asynchronous Distributed Optimization. ar Xiv:1710.05080, October 2017.

Zhengyuan Zhou, Panayotis Mertikopoulos, Nicholas Bambos, Peter Glynn, Yinyu Ye, Li-Jia Li, and Li Fei-Fei. Distributed Asynchronous Optimization with Unbounded Delays: How Slow Can You Go? In International Conference on Machine Learning, pp. 5970 5979, July 2018.

Published as a conference paper at ICLR 2019

A Efficient Implementation

An eﬃcient implementation will have coordinate blocks of size greater than 1. This to ensure the eﬃciency of linear algebra subroutines. Especially because of this, the bulk of the computation for each iteration is computing ikf(ˆyk), and not the averaging steps. Hence the computing nodes only need a local copy of yk in order to do the bulk of an iteration s computation. Given this gradient ikf(ˆyk), updating yk and vk is extremely fast (xk can simply be eliminated). Hence it is natural to simply store yk and vk centrally, and update them when the delayed gradients ikf(ˆyk). Given the above, a write mutex over (y, v) has minuscule overhead (which we conﬁrm with experiments), and makes the labeling of iterates unambiguous. This also ensures that vk and yk are always up to date when (y, v) are being updated. Whereas the gradient ikf(ˆyk) may at the same time be out of date, since it has been calculated with an outdated version of yk. However a write mutex is not necessary in practice, and does not appear to aﬀect convergence rates or computation time. Also it is possible to prove convergence under more general asynchronicity.

A.1 Parameter selection and tuning

When deﬁning the coeﬃcients, σ may be underestimated, and L, L1, . . . , Ln may be overestimated if exact values are unavailable. Notice that xk can be eliminated from the above iteration, and the block gradient ikf(ˆyk) only needs to be calculated once per iteration. A larger (or overestimated) maximum delay τ will cause a larger asynchronicity parameter ψ, which leads to more conservative step sizes to compensate.

To estimate ψ, one can ﬁrst performed a dry run with all coeﬃcient set to 0 to estimate τ. All function parameters can be calculated exactly for this problem in terms of the data matrix and λ. We can then use these parameters and this tau to calculate ψ. ψ and τ merely change the parameters, and do not change execution patterns of the processors. Hence their parameter speciﬁcation doesn t aﬀect the observed delay. Through simple tuning though, we found that ψ = 0.25 resulted in good performance.

In tuning for general problems, there are theoretical reasons why it is diﬃcult to attain acceleration without some prior knowledge of σ, the strong convexity modulus Arjevani (2017). Ideally σ is pre-speciﬁed for instance in a regularization term. If the Lipschitz constants Li cannot be calculated directly (which is rarely the case for the classic dual problem of empirical risk minimization objectives), the line-search method discussed in Roux et al. (2012) Section 4 can be used.

A.2 Sparse update formulation

As mentioned in Section 5, authors in Lee & Sidford (2013a) proposed a linear transformation of an accelerated RBCD scheme that results in sparse coordinate updates. Our proposed algorithm can be given a similar eﬃcient implementation. We may eliminate xk from A2BCD, and derive the equivalent iteration below:

= 1 αβ, αβ 1 β, β

ασ 1/2L 1/2 ik + h(1 α)L 1 ik ikf ˆyk

σ 1/2L 1/2 ik ikf ˆyk

Published as a conference paper at ICLR 2019 A.2 Sparse update formulation

where C and Qk are deﬁned in the obvious way. Hence we deﬁne auxiliary variables pk, qk deﬁned via: yk vk

These clearly follow the iteration: pk+1 qk+1

C (k+1)Qk (A.2)

Since the vector Qk is sparse, we can evolve variables pk, and qk in a sparse manner, and recover the original iteration variables at the end of the algorithm via A.1.

The gradient of the dual function is given by:

d AT Ay + λ(y + l)

As mentioned before, it is necessary to maintain or recover Ayk to calculate block gradients. Since Ayk can be recovered via the linear relation in equation A.1, and the gradient is an aﬃne function, we maintain the auxiliary vectors Apk and Aqk instead.

Hence we propose the following eﬃcient implementation in Algorithm 1. We used this to generate the results in Table 5. We also note also that it can improve performance to periodically recover vk and yk, reset the values of pk, qk, and C to vk, yk, and I respectively, and restarting the scheme (which can be done cheaply in time O(d)).

We let B R2 2 represent Ck, and b represent B 1. is the Kronecker product. Each computing node has local outdated versions of p, q, Ap, Aq which we denote ˆp, ˆq, ˆ Ap, ˆ Aq respectively. We also ﬁnd it convenient to deﬁne: Dk 1 Dk 2

" ασ 1/2L 1/2 ik + h(1 α)L 1 ik σ 1/2L 1/2 ik

Published as a conference paper at ICLR 2019

Algorithm 1 Shared-memory implementation of A2BCD

1: Inputs: Function parameters A, λ, L, {Li}n i=1, n, d. Delay τ (obtained in dry run). Starting vectors y, v. 2: Shared data: Solution vectors p, q; auxiliary vectors Ap, Aq; sparsifying matrix B 3: Node local data: Solution vectors ˆp, ˆq, auxiliary vectors ˆ Ap, ˆ Aq, sparsifying matrix ˆB. 4: Calculate parameters ψ, α, β, h via 1. Set k = 0. 5: Initializations: p y, q v, Ap Ay, Aq Av, B I. 6: while not converged, each computing node asynchronous do 7: Randomly select block i via equation 2.1. 8: Read shared data into local memory: ˆp p, ˆq q, ˆ Ap Ap, ˆ Aq Aq, ˆB B.

9: Compute block gradient: if(ˆy) = 1 nλ 1 n AT i ˆB1,1 ˆ Ap + ˆB1,2 ˆ Aq + λ ˆB1,1ˆp + ˆB1,2ˆq

10: Compute quantity gi = AT i if(ˆy) Shared memory updates:

11: Update B 1 αβ αβ 1 β β

B, calculate inverse b B 1.

= b Dk 1 Dk 2

if(ˆy) , Ap Aq

= b Dk 1 Dk 2

13: Increase iteration count: k k + 1 14: end while

15: Recover original iteration variables: y v

. Output y.

B Proof of the main result

We ﬁrst recall a couple of inequalities for convex functions.

Lemma 7. Let f be σ-strongly convex with L-Lipschitz gradient. Then we have:

f(y) f(x) + y x, f(x) + 1

2L y x 2, x, y (B.1)

f(y) f(x) + y x, f(x) + 1

2σ y x 2, x, y (B.2)

We also ﬁnd it convenient to deﬁne the norm:

i=1 L 1/2 i si 2 (B.3)

Published as a conference paper at ICLR 2019 B.1 Starting point

B.1 Starting point

First notice that using the deﬁnition equation 2.8 of vk+1 we have:

vk+1 2 = βvk + (1 β)yk 2 2σ 1/2L 1/2 ik βvk + (1 β)yk, ikf(ˆyk) + σ 1L 1 ik ikf(ˆyk) 2

Ek vk+1 2 = βvk + (1 β)yk 2 | {z } A

2σ 1/2S 1 βvk + (1 β)yk, f(ˆyk) | {z } B

+ S 1σ 1 n X

i=1 L 1/2 i if(ˆyk) 2

We have the following general identity:

βx + (1 β)y 2 = β x 2 + (1 β) y 2 β(1 β) x y 2, x, y (B.5)

It can also easily be veriﬁed from equation 2.6 that we have:

vk = yk + α 1(1 α)(yk xk) (B.6)

Using equation B.5 on term A, equation B.6 on term B, and recalling the deﬁnition equation B.3 on term C, we have from equation B.4:

Ek vk+1 2 = β vk 2 + (1 β) yk 2 β(1 β) vk yk 2 + S 1σ 1/2 f(ˆyk) 2 (B.7)

2σ 1/2S 1βα 1(1 α) yk xk, f(ˆyk) 2σ 1/2S 1 yk, f(ˆyk)

This inequality is our starting point. We analyze the terms on the second line in the next section.

B.2 The Cross Term

To analyze these terms, we need a small lemma. This lemma is fundamental in allowing us to deal with asynchronicity. Lemma 8. Let χ, A > 0. Let the delay be bounded by τ. Then:

j=1 yk+1 j yk j 2

Proof. See Hannah & Yin (2017a).

Lemma 9. We have:

f(ˆyk), yk f(yk) 1

2σ(1 ψ) yk 2 + 1

j=1 yk+1 j yk j 2 (B.8)

f(ˆyk), xk yk f(xk) f(yk) (B.9)

κ 1ψβ vk yk 2 + κψ 1β 1τ

j=1 yk+1 j yk j 2

The terms in bold in equation B.8 and equation B.9 are a result of the asynchronicity, and are identically 0 in its absence.

Published as a conference paper at ICLR 2019 B.3 Function-value term

Proof. Our strategy is to separately analyze terms that appear in the traditional analysis of Nesterov (2012), and the terms that result from asynchronicity. We ﬁrst prove equation B.8:

f(ˆyk), yk = f(yk), yk f(ˆyk) f(yk), yk

2σ yk 2 + L ˆyk yk yk (B.10)

equation B.10 follows from strong convexity (equation B.2 with x = yk and y = x ), and the fact that f is L-Lipschitz. The term due to asynchronicity becomes:

L ˆyk yk yk 1

2Lκ 1ψ yk 2 + 1

j=1 yk+1 j yk j 2

using Lemma 8 with χ = κψ 1, A = yk . Combining this with equation B.10 completes the proof of equation B.8.

We now prove equation B.9:

f(ˆyk), xk yk = f(yk), xk yk + f(ˆyk) f(yk), xk yk f(xk) f(yk) + L ˆyk yk xk yk f(xk) f(yk)

κ 1ψβα 1(1 α) xk yk 2 + κψ 1β 1α(1 α) 1τ

j=1 yk+1 j yk j 2

Here the last line follows from Lemma 8 with χ = κψ 1β 1α(1 α) 1, A = nxk yk. We can complete the proof using the following identity that can be easily obtained from equation 2.6:

yk xk = α(1 α) 1(vk yk)

B.3 Function-value term

Much like Nesterov (2012), we need a f(xk) term in the Lyapunov function (see the middle of page 357). However we additionally need to consider asynchronicity when analyzing the growth of this term. Again terms due to asynchronicity are emboldened.

Lemma 10. We have:

Ekf(xk+1) f(yk) 1

2h 2 h 1 + 1

2σ1/2L 1/2ψ S 1 f(ˆyk) 2

+ S 1Lσ1/2κψ 1τ

j=1 yk+1 j yk j 2

Proof. From the deﬁnition equation 2.7 of xk+1, we can see that xk+1 yk is supported on block ik. Since each gradient block if is Li Lipschitz with respect to changes to block i, we can use

Published as a conference paper at ICLR 2019 B.4 Asynchronicity error

equation B.1 to obtain:

f(xk+1) f(yk) + f(yk), xk+1 yk + 1

2Lik xk+1 yk 2

(from equation 2.7) = f(yk) h L 1 ik ikf(yk), ikf(ˆyk) + 1

2h2L 1 ik ikf(ˆyk) 2

= f(yk) h L 1 ik ikf(yk) ikf(ˆyk), ikf(ˆyk) 1

2h(2 h)L 1 ik ikf(ˆyk) 2

Ekf(xk+1) f(yk) h S 1 n X

i=1 L 1/2 i if(yk) if(ˆyk), if(ˆyk) 1

2h(2 h)S 1 f(ˆyk) 2

Here the last line followed from the deﬁnition equation B.3 of the norm 1/2. We now analyze the middle term:

i=1 L 1/2 i if(yk) if(ˆyk), if(ˆyk)

i=1 L 1/4 i ( if(yk) if(ˆyk)),

i=1 L 1/4 i if(ˆyk)

(Cauchy Schwarz)

i=1 L 1/4 i ( if(yk) if(ˆyk))

i=1 L 1/4 i if(ˆyk)

i=1 L 1/2 i if(yk) if(ˆyk) 2 !1/2 n X

i=1 L 1/2 i if(ˆyk) 2 !1/2

(L Li, i and equation B.3) L 1/4 f(yk) f(ˆyk) f(ˆyk) ( f is L-Lipschitz) L 1/4L yk ˆyk f(ˆyk) We then apply Lemma 8 to this with χ = 2h 1σ1/2L1/4κψ 1, A = f(ˆyk) to yield:

i=1 L 1/2 i if(yk) if(ˆyk), if(ˆyk) h 1Lσ1/2κψ 1τ

j=1 yk+1 j yk j 2 (B.12)

4hσ1/2L 1/2ψ f(ˆyk) 2

Finally to complete the proof, we combine equation B.11, with equation B.12.

B.4 Asynchronicity error

The previous inequalities produced diﬀerence terms of the form yk+1 j yk j 2. The following lemma shows how these errors can be incorporated into a Lyapunov function. Lemma 11. Let 0 < r < 1 and consider the asynchronicity error and corresponding coeﬃcients:

j=1 cj yk+1 j yk j 2

j=i ri j 1sj

Published as a conference paper at ICLR 2019 B.4 Asynchronicity error

This sum satisﬁes:

Ek[Ak+1 r Ak] = c1Ek yk+1 yk 2

j=1 sj yk+1 j yk j 2

Remark 2. Interpretation. This result means that an asynchronicity error term Ak can negate a series of diﬀerence terms P j=1 sj yk+1 j yk j 2 at the cost of producing an additional error c1Ek yk+1 yk 2, while maintaining a convergence rate of r. This essentially converts diﬀerence terms, which are hard to deal with, into a yk+1 yk 2 term which can be negated by other terms in the Lyapunov function. The proof is straightforward.

Ek[Ak+1 r Ak] = Ek

j=0 cj+1 yk+1 j yk j 2 r Ek

j=1 cj yk+1 j yk j 2

= c1Ek yk+1 yk 2 + Ek

j=1 (cj+1 rcj) yk+1 j yk j 2

Noting the following completes the proof:

j=i+1 ri+1 j 1sj r

j=i ri j 1sj = si

Given that Ak allows us to negate diﬀerence terms, we now analyze the cost c1Ek yk+1 yk 2 of this negation.

Lemma 12. We have:

Ek yk+1 yk 2 2α2β2 vk yk 2 + 2S 1L 1 f(ˆyk) 2

yk+1 yk = (αvk+1 + (1 α)xk+1) yk

= α βvk + (1 β)yk σ 1/2L 1/2 ik ikf(ˆyk) + (1 α) yk h L 1 ik ikf(ˆyk) yk (B.13)

= αβvk + α(1 β)yk ασ 1/2L 1/2 ik ikf(ˆyk) αyk (1 α)h L 1 ik ikf(ˆyk)

= αβ(vk yk) ασ 1/2L 1/2 ik + h(1 α)L 1 ik ikf(ˆyk)

yk+1 yk 2 2α2β2 vk yk 2 + 2 ασ 1/2L 1/2 ik + h(1 α)L 1 ik 2 ikf(ˆyk) 2 (B.14)

Published as a conference paper at ICLR 2019 B.5 Master inequality

Here equation B.13 following from equation 2.8, the deﬁnition of vk+1. equation B.14 follows from the inequality x + y 2 2 x 2 + 2 y 2. The rest is simple algebraic manipulation.

yk+1 yk 2 2α2β2 vk yk 2 + 2L 1 ik ασ 1/2 + h(1 α)L 1/2 ik 2 ikf(ˆyk) 2

(L Li, i) 2α2β2 vk yk 2 + 2L 1 ik ασ 1/2 + h(1 α)L 1/2 2 ikf(ˆyk) 2

= 2α2β2 vk yk 2 + 2L 1 ik L 1

L1/2σ 1/2α + h(1 α) 2 ikf(ˆyk) 2

E yk+1 yk 2 2α2β2 vk yk 2 + 2S 1L 1

L1/2σ 1/2α + h(1 α) 2 f(ˆyk) 2

Finally, to complete the proof, we prove L1/2σ 1/2α + h(1 α) 1.

L1/2σ 1/2α + h(1 α) = h + α

L1/2σ 1/2 h

(deﬁnitions of h and α: equation 2.3, and equation 2.5) = 1 1

2σ1/2L 1/2ψ + σ1/2S 1

1 σ1/2L 1/2 1

2ψ σ 1/2S 1L1

Rearranging the deﬁnition of ψ, we have:

92 ψ2L1L 3/2κ 1/2τ 2

(τ 1 and ψ 1

2) 1 182 L1L 3/2κ 1/2

Using this on equation B.15, we have:

L1/2ασ 1/2 + h(1 α) 1 σ1/2L 1/2 1

2ψ 1 182 L1L 3/2κ 1/2σ 1/2L1

= 1 σ1/2L 1/2 1

2ψ 1 182 (L/L)2

2) = 1 σ1/2L 1/2 1

This completes the proof.

B.5 Master inequality

We are ﬁnally in a position to bring together all the all the previous results together into a master inequality for the Lyapunov function ρk (deﬁned in equation 2.11). After this lemma is proven, we will prove that the right hand size is negative, which will imply that ρk linearly converges to 0 with rate β.

Published as a conference paper at ICLR 2019 B.5 Master inequality

Lemma 13. Master inequality. We have:

Ek[ρk+1 βρk]

+ yk 2 1 β σ 1/2S 1σ(1 ψ) (B.16)

+ vk yk 2 β 2α2βc1 + S 1βL1/2κ 1/2ψ (1 β)

+ f(yk) c 2σ 1/2S 1 βα 1(1 α) + 1

+ f(xk) β 2σ 1/2S 1α 1(1 α) c

j=1 yk+1 j yk j 2 S 1Lκψ 1τσ1/2 2σ 1 + c s

+ f(ˆyk) 2 S 1 σ 1 + 2L 1c1 1

2ch 2 h 1 + 1

2σ1/2L 1/2ψ

Ek vk+1 2 β vk 2

(B.7) = (1 β) yk 2 β(1 β) vk yk 2 + S 1σ 1 f(ˆyk) 2 2σ 1/2S 1 yk, f(ˆyk)

2σ 1/2S 1βα 1(1 α) yk xk, f(ˆyk)

(1 β) yk 2 β(1 β) vk yk 2 + S 1σ 1 f(ˆyk) 2 (B.17)

(B.8) + 2σ 1/2S 1

2σ(1 ψ) yk 2 + 1

j=1 yk+1 j yk j 2

(B.9) 2σ 1/2S 1βα 1(1 α)(f(xk) f(yk))

+ σ 1/2S 1βL

κ 1ψβ vk yk 2 + κψ 1β 1τ

j=1 yk+1 j yk j 2

We now collect and organize the similar terms of this inequality.

+ yk 2 1 β σ 1/2S 1σ(1 ψ)

+ vk yk 2 β σ 1/2S 1βLκ 1ψ (1 β)

f(yk) 2σ 1/2S 1 βα 1(1 α) + 1

+ f(xk) 2σ 1/2S 1βα 1(1 α)

j=1 yk+1 j yk j 2 2σ 1/2S 1Lκψ 1τ

+ f(ˆyk) 2 σ 1S 1

Published as a conference paper at ICLR 2019 B.6 Proof of main theorem

Now ﬁnally, we add the function-value and asynchronicity terms to our analysis. We use Lemma 11 is with r = 1 σ1/2S 1, and

si = s = 6S 1L1/2κ3/2ψ 1τ, 1 i τ 0, i > τ (B.18)

Notice that this choice of si will recover the coeﬃcient formula given in equation 2.9. Hence we have:

Ek[cf(xk+1) + Ak+1 β(cf(xk) + Ak)]

(Lemma 10) cf(yk) 1

2ch 2 h 1 + 1

2σ1/2L 1/2ψ S 1 f(ˆyk) 2 βcf(xk)

+ S 1Lσ1/2κψ 1τ

j=1 yk+1 j yk j 2

(Lemmas 11 and 12) + c1 2α2β2 vk yk 2 + 2S 1L 1 f(ˆyk) 2 (B.20)

j=1 sj yk+1 j yk j 2 + Ak(r β)

Notice Ak(r β) 0. Finally, combining equation B.17 and equation B.19 completes the proof.

In the next section, we will prove that every coeﬃcient on the right hand side of equation B.16 is 0 or less, which will complete the proof of Theorem 1.

B.6 Proof of main theorem

Lemma 14. The coeﬃcients of yk 2, f(yk), and Pτ j=1 yk+1 j yk j 2 in Lemma 13 are non-positive.

Proof. The coeﬃcient 1 (1 ψ)σ1/2S 1 β of yk 2 is identically 0 via the deﬁnition equation 2.4 of β. The coeﬃcient c 2σ 1/2S 1 βα 1(1 α) + 1 of f(yk) is identically 0 via the deﬁnition equation 2.12 of c.

First notice from the deﬁnition equation 2.12 of c:

c = 2σ 1/2S 1 βα 1(1 α) + 1

(deﬁnitions of α, β) = 2σ 1/2S 1 1 σ1/2S 1(1 ψ) (1 + ψ)σ 1/2S + 1

= 2σ 1/2S 1 (1 + ψ)σ 1/2S + ψ2

= 2σ 1 (1 + ψ) + ψ2σ1/2S 1 (B.21)

c 4σ 1 (B.22)

Published as a conference paper at ICLR 2019 B.6 Proof of main theorem

Here the last line followed since ψ 1 2 and σ1/2S 1 1. We now analyze the coeﬃcient of Pτ j=1 yk+1 j yk j 2.

S 1Lκψ 1τσ1/2 2σ 1 + c s

(B.22) 6L1/2κ3/2ψ 1τ s (deﬁnition equation B.18 of s) 0

Lemma 15. The coeﬃcient β 2σ 1/2S 1α 1(1 α) c of f(xk) in Lemma 13 is non-positive.

2σ 1/2S 1α 1(1 α) c

(B.21) = 2σ 1/2S 1(1 + ψ)σ 1/2S 2σ 1 (1 + ψ) + ψ2σ1/2S 1

= 2σ 1 (1 + ψ) (1 + ψ) + ψ2σ1/2S 1

= 2ψ2σ 1/2S 1 0

Lemma 16. The coeﬃcient S 1 σ 1 + 2L 1c1 1

2ch 2 h 1 + 1

2σ1/2L 1/2ψ of f(ˆyk) 2 in Lemma 13 is non-positive.

Proof. We ﬁrst need to bound c1.

(equation B.18 and equation 2.9) c1 = s

1 σ1/2S 1 j

equation B.18 6S 1L1/2κ3/2ψ 1τ

1 σ1/2S 1 j

6S 1L1/2κ3/2ψ 1τ 2 1 σ1/2S 1 τ

It can be easily veriﬁed that if x 1

2 and y 0, then (1 x) y exp(2xy). Using this fact with x = σ1/2S 1 and y = τ, we have:

6S 1L1/2κ3/2ψ 1τ 2 exp τσ1/2S 1

(since ψ 3/7 and hence τσ1/2S 1 1

7) S 1L1/2κ3/2ψ 1τ 2 6 exp 1

c1 7S 1L1/2κ3/2ψ 1τ 2 (B.23)

Published as a conference paper at ICLR 2019 B.6 Proof of main theorem

We now analyze the coeﬃcient of f(ˆyk) 2

σ 1 + 2L 1c1 1

2ch 2 h 1 + 1

2σ1/2L 1/2ψ

(B.23 and 2.5) σ 1 + 14S 1L 1L1/2κ3/2ψ 1τ 2 1

σ 1 + 14S 1L 1L1/2κ3/2ψ 1τ 2 1

(deﬁnition 2.2 of ψ) = σ 1 + 14

(B.21, deﬁnition 2.5 of h) = σ 1 1 + 14

81ψ (1 + ψ) + ψ2σ1/2S 1 1 1

2σ1/2L 1/2ψ

(σ1/2L 1/2 0 and σ1/2S 1 1) σ 1 1 + 14

81ψ (1 + ψ) 1 1

Lemma 17. The coeﬃcient β 2α2βc1 + S 1βL1/2κ 1/2ψ (1 β) of vk yk 2 in 13 is nonpositive.

2α2βc1 + σ1/2S 1βψ (1 ψ)σ1/2S 1

(B.23) 14α2βS 1L1/2κ3/2ψ 1τ 2 + σ1/2S 1βψ (1 ψ)σ1/2S 1

14σS 3L1/2κ3/2ψ 1τ 2 + σ1/2S 1ψ (1 ψ)σ1/2S 1

= σ1/2S 1 14S 2Lκτ 2ψ 1 + 2ψ 1

Here the last inequality follows since β 1 and α σ1/2S 1. We now rearrange the deﬁnition of ψ to yield the identity:

94 L2L 3τ 4ψ4

Using this, we have: 14S 2Lκτ 2ψ 1 + 2ψ 1

94 L2L 2ψ3τ 2 + 2ψ 1

Here the last line followed since L L, ψ 3

7, and τ 1. Hence the proof is complete.

Proof of Theorem 1. Using the master inequality 13 in combination with the previous Lemmas 14, 15, 16, and 17, we have:

Ek[ρk+1] βρk = 1 (1 ψ)σ1/2S 1 ρk

Published as a conference paper at ICLR 2019

When we have: 1 (1 ψ)σ1/2S 1 k ϵ

then the Lyapunov function ρk has decreased below ϵρ0 in expectation. Hence the complexity K(ϵ) satisﬁes:

K(ϵ) ln 1 (1 ψ)σ1/2S 1 = ln(ϵ)

K(ϵ) = 1 ln 1 (1 ψ)σ1/2S 1 ln(1/ϵ)

Now it can be shown that for 0 < x 1

2, we have:

1 x 1 1 ln(1 x) 1

2 1 ln(1 x) = 1

Since n 2, we have σ1/2S 1 1

K(ϵ) = 1 1 ψ

σ 1/2S + O(1) ln(1/ϵ)

An expression for KNU_ACDM(ϵ), the complexity of NU_ACDM follows by similar reasoning.

KNU_ACDM(ϵ) = σ 1/2S + O(1) ln(1/ϵ) (B.24)

Finally we have:

K(ϵ) = 1 1 ψ

σ 1/2S + O(1)

σ 1/2S + O(1)

KNU_ACDM(ϵ)

= 1 1 ψ (1 + o(1))KNU_ACDM(ϵ)

which completes the proof.

C Ordinary Differential Equation Analysis

C.1 Derivation of ODE for synchronous A2BCD

If we take expectations with respect to Ek, then synchronous (no delay) A2BCD becomes:

yk = αvk + (1 α)xk Ekxk+1 = yk n 1κ 1 f(yk)

Ekvk+1 = βvk + (1 β)yk n 1κ 1/2 f(yk)

We ﬁnd it convenient to deﬁne η = nκ1/2. Inspired by this, we consider the following iteration:

yk = αvk + (1 α)xk (C.1)

xk+1 = yk s1/2κ 1/2η 1 f(yk) (C.2)

vk+1 = βvk + (1 β)yk s1/2η 1 f(yk) (C.3)

Published as a conference paper at ICLR 2019 C.1 Derivation of ODE for synchronous A2BCD

for coeﬃcients:

α = 1 + s 1/2η 1

β = 1 s1/2η 1

s is a discretization scale parameter that will be sent to 0 to obtain an ODE analogue of synchronous A2BCD. We ﬁrst use equation B.6 to eliminate vk from from equation C.3.

0 = vk+1 + βvk + (1 β)yk s1/2η 1 f(yk)

0 = α 1yk+1 + α 1(1 α)xk+1

+ β α 1yk α 1(1 α)xk + (1 β)yk s1/2η 1 f(yk)

(times by α) 0 = yk+1 + (1 α)xk+1

+ β(yk (1 α)xk) + α(1 β)yk αs1/2η 1 f(yk) = yk+1 + yk(β + α(1 β))

+ (1 α)xk+1 xkβ(1 α) αs1/2η 1 f(yk)

We now eliminate xk using equation C.1:

0 = yk+1 + yk(β + α(1 β))

+ (1 α) yk s1/2η 1κ 1/2 f(yk) yk 1 s1/2η 1κ 1/2 f(yk 1) β(1 α)

αs1/2η 1 f(yk) = yk+1 + yk(β + α(1 β) + (1 α)) β(1 α)yk 1

+ s1/2η 1 f(yk 1)(β 1)(1 α)

αs1/2η 1 f(yk) = (yk yk+1) + β(1 α)(yk yk 1)

+ s1/2η 1( f(yk 1)(β 1)(1 α) α f(yk))

Now to derive an ODE, we let yk = Y ks1/2 . Then f(yk 1) = f(yk) + O s1/2 . Hence the above becomes:

0 = (yk yk+1) + β(1 α)(yk yk 1)

+ s1/2η 1((β 1)(1 α) α) f(yk) + O s3/2

0 = s1/2 Y 1

2s Y + β(1 α) s1/2 Y 1

+ s1/2η 1((β 1)(1 α) α) f(yk) + O s3/2

Published as a conference paper at ICLR 2019 C.2 Convergence proof for synchronous ODE

We now look at some of the terms in this equation to ﬁnd the highest-order dependence on s.

β(1 α) = 1 s1/2η 1 1 1 1 + s 1/2η

= 1 s1/2η 1 s 1/2η 1 + s 1/2η

= 1 s1/2η 1

1 + s1/2η 1

= 1 2s1/2η 1 + O(s)

We also have:

(β 1)(1 α) α = β(1 α) 1

= 2s1/2η 1 + O(s)

Hence using these facts on equation C.4, we have:

0 = s1/2 Y 1

2s Y + 1 2s1/2η 1 + O(s) s1/2 Y 1

+ s1/2η 1 2s1/2η 1 + O(s) f(yk) + O s3/2

0 = s1/2 Y 1

2s Y + s1/2 Y 1

2s Y 2s1η 1 Y + O s3/2

2s1η 2 + O s3/2 f(yk) + O s3/2

0 = s Y 2sη 1 Y 2sη 2 f(yk) + O s3/2

0 = Y 2η 1 Y 2η 2 f(yk) + O s1/2

Taking the limit as s 0, we obtain the ODE:

Y (t) + 2η 1 Y + 2η 2 f(Y ) = 0

C.2 Convergence proof for synchronous ODE

e η 1t E (t) = f(Y (t)), Y (t) + η 1f(Y (t))

2 Y (t) + η Y (t), Y (t) + η Y (t) + η 1 1

Y (t) + η Y (t) 2

(strong convexity equation B.2) f(Y ), Y + η 1 f(Y ), Y 1

2 Y + η Y , Y 2η 1 f(Y ) + η 1 1

Y (t) + η Y (t) 2

Published as a conference paper at ICLR 2019 C.3 Asynchronicity error lemma

Hence we have E (t) 0. Therefore E(t) E(0). That is:

E(t) = en 1κ 1/2t f(Y ) + 1

Y + η Y 2 E(0) = f(Y (0)) + 1

Y (0) + η Y (0) 2 (C.5)

which implies:

f(Y (t)) + 1

Y (t) + η Y (t) 2 e n 1κ 1/2t f(Y (0)) + 1

Y (0) + η Y (0) 2 (C.6)

C.3 Asynchronicity error lemma

This result is the continuous-time analogue of Lemma 11. First notice that c(0) = c0 and c(τ) = 0. We also have:

c (t)/c0 = re rt re rt e rτ

= r e rt + e rt e rτ

= r e rt + e rt 1 e rτ

1 e rτ + e rτ

c (t) = rc(t) rc0 e rτ

1 e rτ Hence using c(τ) = 0:

A (t) = c0 Y (t) 2 + Z t

t τ c (t s) Y (s) 2ds

= c0 Y (t) 2 r A(t) rc0 e rτ

1 e rτ D(t)

Now when x 1

2, we have e x 1 e x 1

2x 1. Hence when rτ 1

2, we have:

A (t) c0 Y (t) 2 r A(t) 1

and the result easily follows.

C.4 Convergence analysis for the asynchronous ODE

We consider the same energy as in the synchronous case (that is, the ODE in equation 3.1). Similar to before, we have:

e η 1t E (t) f(Y ), Y + η 1 f(Y ), Y 1

D Y + η Y , Y 2η 1 f ˆY E + η 1 1

Y (t) + η Y (t) 2

= f(Y ), Y + η 1 f(Y ), Y 1

2 Y + η Y , Y 2η 1 f(Y ) + η 1 1

Y (t) + η Y (t) 2

η 1D Y + η Y , f ˆY f(Y ) E

4η Y 2 η 1D Y + η Y , f ˆY f(Y ) E

Published as a conference paper at ICLR 2019

where the ﬁnal equality follows from the proof in Section C.2. Hence

e η 1t E (t) 1

4η Y 2 + Lη 1 Y ˆY Y + L Y ˆY Y (C.7)

Now we present an inequality that is similar to equation 8.

Lemma 18. Let A, χ > 0. Then: Y (t) ˆY (t) A 1

2χτD(t) + 1

Proof. Since ˆY (t) is a delayed version of Y (t), we have: ˆY (t) = Y (t j(t)) for some function j(t) 0 (though this can be easily generalized to an inconsistent read). Recall that for χ > 0, we have ab 1

2 χa2 + χ 1b2 . Hence

X(t) ˆX(t) = Z t

s=t j(t) X (s)ds

X(t) ˆX(t) A =

s=t j(t) X (s)ds

s=t j(t) X (s)ds

(Holder s inequality) 1

s=t j(t) X (s) 2ds

s=t j(t) 1ds

s=t j(t) X (s) 2ds

We use this lemma twice on Y ˆY Y and Y ˆY Y in equation C.7 with χ = 2L, A = Y

and χ = 4Lη 1, A = Y respectively, to yield:

e η 1t E (t) 1

+ Lη 1 LτD(t) + 1

4L 1 Y 2 + L 2Lη 1τD(t) + 1

8η Y 2 + 3L2η 1τD(t)

The proof of convergence is completed in Section 3.

D Optimality proof

For parameter set σ, L1, . . . , Ln, n, we construct a block-separable function f on the space Rbn (separated into n blocks of size b), which will imply this lower bound. Deﬁne κi = Li/σ. We deﬁne

Published as a conference paper at ICLR 2019

the matrix Ai Rb b via:

1 2 ... ...

0 ... ... 1 0 ... 1 2 1 0 1 θi

, for θi = κ1/2 i + 3

κ1/2 i + 1 .

Hence we deﬁne fi on Rb via:

2 x, Aix e1, x + σ

which is clearly σ-strongly convex and Li-Lipschitz on Rb. From Lemma 8 of Lan & Zhou (2015), we know that this function has unique minimizer

x ,(i) qi, q2 i , . . . , qb i , for q = κ1/2 i 1

κ1/2 i + 1 .

Finally, we deﬁne f via:

i=1 fi x(i) .

Now let e(i, j) be the jth unit vector of the ith block of size b in Rbn. For I1, . . . , In N, we deﬁne the subspaces

Vi(I) = span{e(i, 1), . . . , e(i, I)}, V (I1, . . . , In) = V1(I1) . . . Vn(In).

V (I1, . . . , In) is the subspace with the ﬁrst I1 components of block 1 nonzero, the ﬁrst I2 components of block 2 nonzero, etc. First notice that IC(V (I1, . . . , In)) = V (I1, . . . , In). Also, clearly, we have:

if(V (I1, . . . , In)) V (0, . . . , 0, min{Ii + 1, b}, 0, . . . , 0). (D.1)

if is supported on the ith block, hence why all the other indices are 0. The patten of nonzeros in A means that the gradient will have at most 1 more nonzero on the ith block (see Nesterov (2013)).

Let the initial point x0 belong to V I1, . . . , In . Let IK,i be the number of times we have had ik = i for k = 0, . . . , K 1. By induction on condition 2 of Deﬁnition 4 using equation D.1, we have:

xk V min I1 + Ik,1, b , . . . , min In + Ik,m, b

Hence if x0,(i) Vi(0) and k b, then

xk,(i) x ,(i) 2 min x Vi(Ik,i)

x x ,(i) 2 =

j=Ik,i+1 q2j i = q2Ik,i+2 i q2b+2 i / 1 q2 i

Therefore for all i, we have:

E xk x 2 E xk,(i) x ,(i) 2 E h q2Ik,i+2 i q2b+2 i / 1 q2 i i

Published as a conference paper at ICLR 2019

To evaluate this expectation, we note:

Eq2Ik,i i =

pj i(1 pi)k jq2j i

= (1 pi)k k X

q2 i pi(1 pi) 1 j

= (1 pi)k 1 + q2 i pi(1 pi) 1 k

= 1 1 q2 i pi k

E xk x 2 1 1 q2 i pi k q2b i q2 i / 1 q2 i .

For any i, we may select the starting iterate x0 by deﬁning its block j = 1, . . . , n via:

x0,(j) = (1 δij)x ,(j)

For such a choice of x0, we have

x0 x 2 = x ,(i) 2 = q2 i + . . . + q2b i = q2 i 1 q2b i 1 q2 i Hence for this choice of x0:

E xk x 2/ x0 x 2 1 1 q2 i pi k q2b i / 1 q2b i

Now notice: 1 1 q2 i pi k = q 2 i q 2 i 1 pi kq2k i q2k i

E xk x 2/ x0 x 2 1 1 q2 i pi k 1 q2b 2k i / 1 q2b i

Now if we let b = 2k, then we have:

E xk x 2/ x0 x 2 1 1 q2 i pi k 1 q2k i / 1 q4k i

= 1 1 q2 i pi k/ 1 + q2k i

E xk x 2/ x0 x 2 1

2 max i 1 1 q2 i pi k

Now let us take the minimum of the right-hand side over the parameters pi, subject to Pn i=1 pi = 1. The solution to this minimization is clearly:

pi = 1 q2 i 1/

Published as a conference paper at ICLR 2019

E xk x 2/ x0 x 2 1

1 q2 j 1 = 1

κ1/2 i + 2 + κ 1/2 i

j=1 κ1/2 i + 2n

E xk x 2/ x0 x 2 1

1 4 Pn j=1 κ1/2 i + 2n

Hence the complexity I(ϵ) satisﬁes:

1 4 Pn j=1 κ1/2 i + 2n

1 4 Pn j=1 κ1/2 i + 2n

4(1 + o(1))

E Extensions

As mentioned, a stronger result than Theorem 1 is possible. In the case when Li = L for all i, we can consider a slight modiﬁcation of the coeﬃcients:

α 1 + (1 + ψ)σ 1/2S 1 (E.1)

β 1 (1 + ψ) 1σ1/2S 1 (E.2)

2σ1/2L 1/2ψ 1 . (E.3)

yk = αvk + (1 α)xk, (E.4)

xk+1 = yk h L 1 ikf(ˆyk), (E.5)

vk+1 = βvk + (1 β)yk σ 1/2L 1/2 ikf(ˆyk). (E.6)

for the asynchronicity parameter:

ψ = 6κ1/2n 1 τ (E.7)

This leads to complexity:

K(ϵ) = (1 + ψ)nκ1/2 ln(1/ϵ) (E.8)

Here there is no restriction on ψ as in Theorem 1, and hence there is no restriction on τ. Assuming ψ 1 gives optimal complexity to within a constant factor. Notice then that the resulting condition of τ

6nκ 1/2 (E.9)

Published as a conference paper at ICLR 2019

now essentially matches the one in Theorem 3 in Section 3. While this result is stronger, it increases the complexity of the proof substantially. So in the interests of space and simplicity, we do not prove this stronger result.