# complexity_of_highly_parallel_nonsmooth_convex_optimization__075a23c7.pdf

Complexity of Highly Parallel Non-Smooth Convex Optimization

Sébastien Bubeck Microsoft Research sebubeck@microsoft.com

Qijia Jiang Stanford University qjiang2@stanford.edu

Yin Tat Lee University of Washington

& Microsoft Research

yintat@uw.edu

Yuanzhi Li Stanford University yuanzhil@stanford.edu

Aaron Sidford Stanford University sidford@stanford.edu

A landmark result of non-smooth convex optimization is that gradient descent is an optimal algorithm whenever the number of computed gradients is smaller than the dimension d. In this paper we study the extension of this result to the parallel optimization setting. Namely we consider optimization algorithms interacting with a highly parallel gradient oracle, that is one that can answer poly(d) gradient queries in parallel. We show that in this case gradient descent is optimal only up to e O(

d) rounds of interactions with the oracle. The lower bound improves upon a decades old construction by Nemirovski which proves optimality only up to d1/3 rounds (as recently observed by Balkanski and Singer), and the suboptimality of gradient descent after

d rounds was already observed by Duchi, Bartlett and Wainwright. In the latter regime we propose a new method with improved complexity, which we conjecture to be optimal. The analysis of this new method is based upon a generalized version of the recent results on optimal acceleration for highly smooth convex optimization.

1 Introduction

Much of the research in convex optimization has focused on the oracle model, where an algorithm optimizing some objective function f : Rd ! R does so by sequential interaction with, e.g., a gradient oracle (given a query x 2 Rd, the oracle returns rf(x)), [Nemirovski and Yudin, 1983, Nesterov, 2004, Bubeck, 2015].1 In the early 1990s, Arkadi Nemirovski introduced the parallel version of this problem [Nemirovski, 1994]: instead of submitting queries one by one sequentially, the algorithm can submit in parallel up to Q 1 queries. We refer to the depth of such a parallel algorithm as the number of rounds of interaction with the oracle, and the work as the total number of queries (in particular work Q depth). In this paper we study the optimal depth achievable for highly parallel algorithms, namely we consider the regime Q = poly(d). We focus on non-smooth convex optimization, that is we want to optimize a Lipschitz, convex function f on the unit Euclidean ball.

1Throughout we assume that f is differentiable, though our results carry over to the case where f is nondifferentiable and given by a sub-gradient oracle. This generalization is immediate as our analysis and algorithms are stable under ﬁnite-precision arithmetic and convex functions are almost everywhere differentiable.

33rd Conference on Neural Information Processing Systems (Neur IPS 2019), Vancouver, Canada.

Our key result is a new form a quadratic acceleration: while for purely sequential methods the critical depth at which one can improve upon local search is e O(d), we show that in the highly parallel regime the critical depth is e O(

1.1 Classical optimality results

Classically, when Q = 1, it is known that gradient descent s query complexity is order optimal for any target accuracy " in the range

. More precisely, it is known that the query complexity of gradient descent is O(1/"2) and that for any " in the range

, and for any algorithm, there exists a Lipschitz and convex function f on which the number of oracle queries the algorithm makes to achieve additive " accuracy is (1/"2). Furthermore, whenever " is smaller than d 1/2 there exists a better algorithm (i.e., with smaller depth), namely the center of gravity whose depth is O(d log(1/")). Consequently, an alternative formulation of these results is that, for Q = 1, gradient descent is order optimal if and only if the depth is smaller than e O(d). (See previously cited references for the exact statements.)

1.2 Optimality for highly parallel algorithms

The main result of this paper is to show that in the highly parallel regime (Q = poly(d)), gradient descent is order optimal if and only if the depth is smaller than e O(

d). Thus one has a quadratic" improvement over the purely sequential setting in terms of the critical depth at which naive local search becomes suboptimal.

The only if part of the above statement follows from Duchi et al. [2012], where randomized smoothing with accelerated gradient descent was proposed (henceforth referred to as distributed randomized smoothing [Scaman et al., 2018]), and shown to achieve depth d1/4/", which is order better than 1/"2 exactly when the latter is equal to

d. A ﬁrst key contribution of our work is a matching lower bound showing that, when the depth is smaller than e O(

d), no signiﬁcant improvement over gradient descent is possible, i.e. Q = 1 and Q = poly(d) have essentially the same power. Importantly we note that our lower bound applies to randomized algorithms. The previous state of the art lower bound was that gradient descent is optimal up to depth e O(d1/3) [Balkanski and Singer, 2018]. In fact the construction in the latter paper is exactly the same as the original construction of Nemirovski in [Nemirovski, 1994] (however the ﬁnal statements are different, as Nemirovski was concerned with an 1 setting instead of 2, see also Diakonikolas and Guzmán [2018] for more results about non-Euclidean setups).

A second key contribution of this work is to improve the state of the art complexity of parallel algorithms with depth between

d and d. Improving the depth d1/4/" of Duchi et al. [2012] was explicitly mentioned as an open problem by Scaman et al. [2018]. Leveraging the recent higher order acceleration schemes of Gasnikov et al. [2018], Jiang et al. [2018], Bubeck et al. [2018], we propose a new method with depth d1/3/"2/3. This means that for any value of " in the range

there is an algorithm that is order better than both gradient descent and center of gravity. Moreover we conjecture that the depth d1/3/"2/3 is in fact optimal for any " in this range (as the arguments of Section 2.3 would imply this if a similar argument could be made for δ = ", i.e. a smaller walling radius). We leave this question, the optimality of the center of gravity method for small " < 1/poly(d) , and the optimal work among optimal depth algorithms, for future works.

1.3 Related works

Though Nemirovski s prescient work stood alone for decades, more recently the subﬁeld of parallel/distributed optimization is booming, propelled by problems in machine learning, see e.g., [Boyd et al., 2011]. Chief among those problems is how to leverage mini-batches in stochastic gradient descent as efﬁciently as possible [Dekel et al., 2012]. The literature on this topic is sprawling, see for example [Duchi et al., 2018] which studies the total work achievable in parallel stochastic convex optimization, or [Zhang and Xiao, 2018] where the stochastic assumptions are leveraged to take advantage of second order information. More directly related to our work is [Nemirovski, 1994, Diakonikolas and Guzmán, 2018, Balkanski and Singer, 2018] from the lower bound side (we directly

improve upon the result in the latter paper), and [Duchi et al., 2012, Scaman et al., 2018] from the upper bound side (we directly improve upon the depth provided by the algorithms in those works).

2 Lower bound

Fix " > 0 such that 1/"2 = e O(

d). In this section we construct a random function f such that, for any deterministic algorithm with depth O(1/"2) and total work poly(d), the output point x is such that E[f(x) f ] > ", where the expectation is with respect to the random function f, and f denotes the minimum value of f on the unit centered Euclidean ball. Note that by the minimax theorem, this implies that for any randomized algorithm there exists a deterministic function such that the same conclusion applies. Formally, we prove the following:

Theorem 1 (Lower Bound) Let 2 (0, 1) and C = 12 + 4 logd(Q/ ). Further, assume that it

holds that log(N)N

C log(d)/d 1

4 (i.e., N .

d/ log3(d)). Fix a randomized algorithm that queries at most Q points per iteration (both function value and gradient), and that runs for at most N iterations. Then, with probability at least 1 , when run on the shielded Nemirovski function f (see Section 2.3 and Section 2.4)) one has for any queried point: f(x) f 1 4

The details of the proof of this theorem are deferred to Appendix A. In the remainder of this section we instead provide a sketch of its proof. We ﬁrst recall in Section 2.1 why, for purely sequential algorithms, the above statement holds true, and in fact one can even replace

d by d in this case (this construction goes back to [Yudin and Nemirovski, 1976], see also [Nemirovski and Yudin, 1983]). Next, in Section 2.2 we explain Nemirovski [1994] s construction, which yields a weaker version of the above statement, with

d replaced by d1/3 (as rediscovered by [Balkanski and Singer, 2018]). We then explain in Section 2.3 our key new construction, a type of shielding operation. Finally, we conclude the proof sketch in Section 2.4.

For the rest of the section we let v1, . . . , v N denote N random orthonormal vectors in Rd (in particular N d), and x = 1 p

i=1 vi. We deﬁne the Nemirovski function with parameter γ 0 by:

N N(x ) 1 p

2.1 The classical argument

We consider here the Nemirovski function with parameter γ = 0. Each gradient query reveals a single vector in the collection of the vi, so after N/2 iterations one might know say v1, . . . , v N/2, but the rest remain unknown (or in other words they remain random orthonormal vectors in span(v1, . . . , v N/2)?). Thus for any output x that depends on only N/2 queries, one has E[N(x)] 0 (formally this inequality follows from Jensen s inequality and the tower rule). Thus, together with (1), it follows that E[N(x) N ] 1/

N. In other words the best rate of convergence of sequential methods is 1/

N, provided that N d.

2.2 The basic parallel argument

We consider here the Nemirovski function with parameter γ = C

log(d)/d for some large enough constant C (more precisely that constant C depends on the exponent in the poly(d) number of allowed queries per round). The key observation is as follows: Imagine that the algorithm has already discovered v1, . . . , vi 1. Then for any set of poly(d) queries, with high probability with respect to the random draw of vi, . . . , v N, one has that the inner product of any of those vectors with any of the queried points is in [ γ/2, γ/2] (using both basic concentration of measure on the sphere, and a union bound). Thus the maximum in the deﬁnition of N is attained at some index i. This means that this set of poly(d) queries can only reveal vi, and not any of the vj, j > i. Thus after N 1 rounds we know that with high probability any output x satisﬁes N(x) v N x Nγ (N + 1)γ (since

v N is a random direction orthogonal to span(v1, . . . , v N 1) and x only depends on v1, . . . , v N 1). Thus we obtain that the suboptimality gap is 1 p

N (N + 1)γ. Let us assume that

i.e., N = e O(d1/3) (since γ = C

log(d)/d). Then one has that the best rate of convergence with a highly parallel algorithm is (1/

N) (i.e., the same as with purely sequential methods).

2.3 The wall function

Our new idea to improve upon Nemirovski s construction is to introduce a new random wall function W (with parameter δ > 0), where the randomness come from v1, . . . , v N. Our new random hard function, which we term shielded-Nemirovski function, is then deﬁned by:

f(x) = max {N(x), W(x)} .

We construct the convex function W so that one can essentially repeat the argument of Section 2.2 with a smaller value of γ (the parameter in the Nemirovski function), so that the condition (2) becomes less restrictive and allows to take N as large as e O(

Roughly speaking the wall function will satisfy the following properties:

1. The value of W at x is small, namely W(x ) 1 p

2. The value of W at most" vectors x with kxk δ is large, namely W(x) N(x), and

moreover it is does not depend on the collection vi (in fact at most points we will have the simple formula W(x) = 2kxk1+ , for some small that depends on δ, to be deﬁned later).

The key argument is that, by property 2, one can expect (roughly) that information on the random collection of v0

is can only be obtained by querying points of norm smaller than δ. This means that one can repeat the argument of Section 2.2 with a smaller value of γ, namely γ = δ C

log(d)/d. In turn the condition (2) now becomes N = e O

. Due to convexity of W, there is a tension between property 1 and 2, so that one cannot take δ too small. We will show below that it is possible to take δ =

N/d. In turn this means that the argument proves that 1/

N is the best possible rate, up to N = e O(

The above argument is imprecise because the meaning of most" in property 2 is unclear. A more precise formulation of the required property is as follows:

20 Let x = w + z with w 2 Vi and z 2 V ?

i where Vi = span(v1, . . . , vi). Assume that kzk δ, then the total variation distance between the conditional distribution of vi+1, . . . , v N given r W(x) (and W(x)) and the unconditional distribution is polynomially small in d with high probability (here high probability is with respect to the realization of r W(x) and W(x), see below for an additional comment about such conditional reasoning). Moreover if the argmax in the deﬁnition of N(x) is attained at some index > i, then W(x) N(x).

Given both property 1 and 20 it is actually easy to formalize the whole argument. We do so by consdering a game between Alice, who is choosing the query points, and Bob who is choosing the random vectors v1, . . . , v N. Moreover, to clarify the reasoning about conditional distributions, Bob will resample the vectors vi, . . . , v N at the beginning of the ith round of interaction, so that one explicitly does not have any information about those vectors given the ﬁrst i 1 rounds of interaction. Then we argue that with high probability all the oracle answers remain consistent throughout this resampling process. See Appendix A for the details. Next we explain how to build W so as to satisfy property 1 and 2 .

2.4 Building the wall

Let h(x) = 2kxk1+ be the basic building block of the wall. Consider the correlation cones:

,,,,vi x kxk

Note that for any ﬁxed query x, the probability (with respect to the random draw of vi) that x is in Ci is polynomially small in d. We now deﬁne the wall W as follows: it is equal to the function h outside of the correlation cones and the ball of radius δ, and it is extended by convexity to the rest of the unit ball. In other words, let = {x 2 Rd : kxk 2 [δ, 1] and x 62 Ci for all i 2 [N]}, and

y2 {h(y) + rh(y) (x y)} .

Let us ﬁrst prove property 1:

Lemma 2 Let = 1 log2(1/δ) 1, and δ log2(1/δ) = 4C

N . Then W(x ) 1 p

Proof One has rh(y) = 2(1 + ) y kyk1 and thus

h(y) + rh(y) (x y) = 2 kyk1+ + 2(1 + ) y x

Moreover for any y 2 one has:

Thus for any y 2 we have:

h(y) + rh(y) (x y) 2 δ1+ + 2(1 + )C

The proof is straightforwardly concluded by using the values of and δ.

Next we prove a simple formula for W(x) in the context of property 20. More precisely we assume that x = w + z with w 2 Vi and z 2 V ?

i with z 62 Cj for any j > i. Note that for any ﬁxed z, the latter condition happens with high probability with respect to the random draw of vi+1, . . . , v N.

Lemma 3 Let x = w + z with w 2 Vi and z 2 V ?

i with z 62 Cj for any j > i. Then one has:

W(x) = max a,b2R+:a2+b22[δ2,1]

2 (a2 + b2)

max y2e ,kyk=a

where e = {x 2 Vi : x 62 Cj for all j 2 [i]} and we use the convention that the maximum of an empty set is 1.

Proof Recall (3), and let us optimize over y 2 subject to k PViyk = a and k PV ?

i yk = b for some a, b such that a2 + b2 2 [δ2, 1]. Note that in fact there is an upper bound constraint on a for such a y to exists (for if the projection of y onto Vi is large, then necessarily y must be in one of the correlation cones), which we can ignore thanks to the convention choice for the maximum of an empty set. Thus the only calculation we have to do is to verify that:

max y2 :k PViyk=a and k PV ?

i yk=b y x = max y2e ,kyk=a

y w + bkzk .

Note that y x = PViy w + PV ?

i y z. Thus the right hand side is clearly an upper bound on the

left hand side (note that PViy 2 e ). To see that it is also a lower bound take y = y0 + b z

kzk for some

arbitrary y0 2 e with ky0k = a, and note that y 2 (in particular using the assumption on z) with

k PViyk = a and k PV ?

The key point of the formula given in Lemma 3 is that it does not depend on vi+1, . . . , v N. Thus when the algorithm queries the point x and obtains the above value for W(x) (and the corresponding gradient), the only information that it obtains is that z 62 Cj for any j > i. Since the latter condition holds with high probability, the algorithm essentially learns nothing (more precisely the conditional distribution of vi+1, . . . , v N only changes by 1/poly(d) compared to the unconditional distribution).

Thus to complete the proof of property 20 it only remains to show that if kzk δ and the argmax in N(x) is attained at an index > i, then the formula in Lemma 3 is larger than N(x). By taking a = 0 and b = δ one obtains that this formula is equal to (using also the values assigned to in Lemma 2):

2 δ1+ + 21 +

δ1 δkzk = δ + (1 + )kzk kzk .

On the other hand one has (by assumption that the arg max index is > i)

j>i {vj x jγ} kzk .

This concludes the proof of property 20, and in turn concludes the proof sketch of our lower bound.

3 Upper bound

Here we present our highly parallel optimization procedure. Throughout this section we let f : Rd ! R denote a differentiable L-Lipschitz function that obtains its minimum value at x 2 Rd with kx k2 R. The main result of this section is the following theorem, which provides an

e O(d1/3/"2/3)-depth highly-parallel algorithm that computes an "-optimal point with high probability.

Theorem 4 (Highly Parallel Function Minimization) There is a randomized highly-parallel algorithm which given any differentiable L-Lipschitz f : Rd ! R minimized at x with kx k R computes with probability 1 a point x 2 Rd with f(x) f(x ) " in depth e O(d1/3(LR/")2/3) and work e O(d4/3(LR/")8/3) where e O( ) hides factors polylogarithmic in d,",L,R, and 1.

Our starting point for obtaining this result are the O(d1/4/") depth highly parallel algorithms of [Duchi et al., 2012]. This paper considers the convolution of f with simple functions, e.g. Gaussians and uniform distributions, and shows this preserves the convexity and continuity of f while improving the smoothness and thereby enables methods like accelerated gradient descent (AGD) to run efﬁciently. Since the convolved function can be accessed efﬁciently in parallel by random sampling, working with the convolved function is comparable to working with the original function in terms of query depth (up to the sampling error). Consequently, the paper achieves its depth bound by trading off the error induced by convolution with the depth improvements gained from stochastic variants of AGD.

To improve upon this bound, we apply a similar approach of working with the convolution of f with a Gaussian. However, instead of applying standard stochastic AGD we consider accelerated methods which build a more sophisticated model of the convolved function in parallel. Instead of using random sampling to approximate only the gradient of the convolved function, we obtain our improvements by using random sampling to glean more local information with each highly-parallel query and then use this to minimize the convolved function at an accelerated rate.

To enable the use of these more sophisticated models we develop a general acceleration framework that allows us to leverage any subroutine for approximate minimization a local model/approximate gradient computations into an accelerated minimization scheme. We believe this framework is of independent interest, as we show that we can analyze the performance of this method just in terms of simple quantities regarding the local model. This framework is discussed in Section 3.1 and in Appendix C where we show how it generalizes multiple previous results on near-optimal acceleration.

Using this framework, proving Theorem 4 reduces to showing that we can minimize high quality local models of the convolved function. Interestingly, it is possible to nearly obtain this result by simply random sampling to estimate all derivatives up to some order k and then use this to minimize a regularized k-th order Taylor approximation to the function. Near-optimal convergence for such methods under Lipschitz bounds on the k-th derivatives were recently given by [Gasnikov et al., 2018, Jiang et al., 2018, Bubeck et al., 2018] (and follow from our framework). This approach can be shown

to give a highly-parallel algorithm of depth e O(d1/3+c/"2/3) for any c > 0 (with an appropriately large k). Unfortunately, the work of these methods is O(dpoly(1/c)) and expensive for small c.

To overcome this limitation, we leverage the full power of our acceleration framework and instead show that we can randomly sample to build a model of the convolved function accurate within a ball of sufﬁciently large radius. In Section 3.2 we bound this quality of approximation and show that this local model can be be optimized to sufﬁcient accuracy efﬁciently. By combining this result with our framework we prove Theorem 4. We believe this demonstrates the utility of our general acceleration scheme and we plan to further explore its implications in future work.

3.1 Acceleration framework

Here we provide a general framework for accelerated convex function minimization. Throughout this section we assume that there is a twice-differentiable convex function g : Rd ! R given by an approximate proximal step oracle and an approximate gradient oracle deﬁned as follows.

Deﬁnition 5 (Approximate Proximal Step Oracle) Let ! : R+ ! R+ be a non-decreasing function, δ 0, and 2 [0, 1). We call Tprox an ( , δ)-approximate !-proximal step oracle for g : Rd ! R if, for all x 2 Rd, when queried at x 2 Rd the oracle returns y = Tprox(x) such that

krg(y) + !(ky xk)(y x)k !(ky xk)ky xk + δ . (4)

Deﬁnition 6 (Approximate Gradient Oracle) We call Tgrad an δ-approximate gradient oracle for g : Rd ! R if when queried at x 2 Rd the oracle returns v = Tgrad(x) such that kv rg(x)k δ.

We show that there is an efﬁcient accelerated optimization algorithm for minimizing g using only these oracles. Its performance is encapsulated by the following theorem.

Theorem 7 (Acceleration Framework) Let g : Rd ! R be a convex twice-differentiable function minimized at x with kx k R, " > 0, 2 [0, 1), and γ 1 such that 128 γ2 1. Further, let ! : R+ ! R+ be a monotonically increasing continuously differentiable function with 0 < !0(s) γ !(s)/s for all s > 0. There is an algorithm which for all k computes a point yk with

g(yk) g max

using k(6 + log2[1020γ6R2 !(105γ2R) " 1])2 queries to a ( , δ)-approximate !-proximal step oracle for g and a δ-approximate gradient oracle for g provided that both δ "/[1020γ2R] and " 1020γ4R3!(80R).

This theorem generalizes multiple accelerated methods (up to polylogarithmic factors) and sheds light on the rates of these methods (See Appendix C for applications). For example, choosing !(x)

2 and Tprox(x) = x 1

Lrf(x) recovers standard accelerated minimization of L-smooth functions, choosing !(x)

2 and Tprox(x) arg miny g(y) + L

2 ky xk2 recovers a variant of approximate proximal point [Frostig et al.] and Catalyst [Lin et al., 2015], and choosing !(x)

def = Lp (p+1)

and Tprox(x) = arg miny gp(y; x) + Lp

p! ky xkp+1 where gp(y; x) is the value of the p th order Taylor approximation of g about x evaluated at y recovers highly smooth function minimization [Monteiro and Svaiter, 2013, Gasnikov et al., 2018, Jiang et al., 2018, Bubeck et al., 2018].

We prove Theorem 7 by generalizing an acceleration framework due to [Monteiro and Svaiter, 2013]. This framework was recently used by several results to obtain near-optimal query complexities for minimizing highly smooth convex functions [Gasnikov et al., 2018, Jiang et al., 2018, Bubeck et al., 2018]. In Section B.1 we provide a variant of this general framework that is amenable to the noise induced by our oracles. In Section B.2 we show how to instantiate our framework using the oracles assuming a particular type of line search can be performed. In Section B.3 we then prove the Theorem 7. The algorithm for and analysis of line search is deferred to Appendix E.

3.2 Highly parallel optimization

With Theorem 7 in hand, to obtain our result we need to provide, for an appropriate function !, a highly parallel implementation of an approximate proximal step oracle and an approximate gradient oracle for a function that is an O(") additive approximation f. As with previous work [Duchi et al., 2012, Scaman et al., 2018] we consider the convolution of f with a Gaussian of covariance r2 Id for r > 0 we will tune later. Formally, we deﬁne g : Rd ! R for all x 2 Rd as

Rd γr(y)f(x y)dy where γr(x)

It is straightforward to prove (See Section D.1) the following standard facts regarding g.

Lemma 8 The function g is convex, L-Lipschitz, and satisﬁes both |g(y) f(y)|

d Lr and r2g(y) (L/r) Id for all y 2 Rd.

Consequently, to minimize f up to " error, it sufﬁces to minimize g to O(") error with r = O( " p

d L). In the remainder of this section we simply show how to provide highly parallel implementations of the requisite oracles to achieve this by Theorem 7.

Now, as we have discussed, one way we could achieve this goal would be to use random sampling to approximate (in parallel) the k-th order Taylor approximation to g and minimize a regularization of this function to implement the approximate proximal step oracle. While this procedure is depthefﬁcient, its work is quite large. Instead, we provide a more work efﬁcient application of our acceleration framework. To implement a query to the oracles at some point c 2 Rd we instead simply take multiple samples from γr(x c), i.e., the normal distribution with covariance r2Id and mean c, and use these samples to build an approximation to the gradient ﬁeld of g. The algorithm for this procedure is given by Algorithm 1 and carefully combines the gradients of the sampled points to build a model with small bias and variance. By concentration bound and "-net argument, we can show that Algorithm 1 outputs a vector ﬁeld v : Rd ! Rd that is an uniform approximation of rg within a small ball (See Section D.2 for the proof.)

Algorithm 1: Compute vector ﬁeld approximating rg

1 Input: Number of samples N, radius r > 0, error parameter 2 (0, 1), center c 2 Rd.

2 Sample x1, x2, , x N independently from γr(x c).

3 return v : Rd ! Rd deﬁned for all y 2 Rd by

γr(y xi) γr(c xi) rf(xi) χ((xi c)>(y c)) 1kxi ck (

def = 0, if |t| r2, χ(t)

def = 1 if |t| r2

def = 2 2|t|

r2 otherwise.

Lemma 9 (Uniform Approximation) Algorithm 1 outputs vector ﬁeld v : Rd ! R such that for any δ 2 (0, 1

2) with probability at least 1 δ the following holds

max y:ky ck

4 r kv(y) rg(y)k 5L exp

d 2 log(9d) + log 1

Consequently, for any " 2 [0, 1], N = O([d log d log( 1

") + log( 1

δ )]" 2), and = 1

yields that maxy:ky ck er kv(y) rg(y)k L " where er = r

This lemma immediately yields that we can use Algorithm 1 to implement a highly-parallel approximate gradient oracle for g. Interestingly, it can also be leveraged to implement a highly-parallel approximate proximal step oracle. Formally, we show how to use it to ﬁnd y such that

rg(y) + !(ky xk) (y x) 0 where !(s)

for some p to be determined later. Ignoring logarithmic factors and supposing for simplicity that L, R 1, Theorem 7 shows that by invoking this procedure e O(k) d

3p+4 times we could achieve function error on the order

!(1/k3/2)/k2 er (p+1)k 3p+4

2 " (p+1)k 3p+4

and therefore achieve the desired result by setting p to be polylogarithmic in the problem parameters.

Consequently, we simply need to ﬁnd y satisfying (5). The algorithm that achieves this is Algorithm 2 which essentially performs gradient descent on

g(y) + Φ(ky ck) where Φ(s) =

!(t) t dt . (6)

The performance of this algorithm is given by Theorem 24 in Section D.3. Combined with all of the above it proves Theorem 4, see Section D.4 for the details.

Algorithm 2: Approximate minimization of g(y) + Φ(ky ck)

1 Input: center c 2 Rd, accuracy ", inner radius er = r

" ), and step size h = er 48p

2 Use Algorithm 1 to ﬁnd a vector ﬁeld v such that maxy:ky ck er kv(y) rg(y)k L "

4 for i = 1, 2, 1 do

5 δy = v(y) + !(ky ck) (y c) where ! is deﬁned by (5) with p 1.

6 if kδyk L 5"

6 then return y else y = y h δy;

Acknowledgments

The authors thank the anonymous reviewers for their helpful feedback in preparing this ﬁnal version. Further, the authors are grateful for multiple funding sources which supported this work in part, including NSF Awards CCF-1740551, CCF-1749609, and DMS-1839116 and NSF CAREER Award CCF-1844855.

E. Balkanski and Y. Singer. Parallelization does not accelerate convex optimization: Adaptivity lower

bounds for non-smooth convex minimization. Arxiv preprint ar Xiv:1808.03880, 2018.

K. Ball. An elementary introduction to modern convex geometry. In S. Levy, editor, Flavors of

Geometry, pages 1 58. Cambridge University Press, 1997.

S. Boyd, N. Parikh, E. Chu, B. Peleato, and J. Eckstein. Distributed optimization and statistical

learning via the alternating direction method of multipliers. Foundations and Trends R in Machine Learning, 3(1):1 122, 2011.

S. Bubeck. Convex optimization: Algorithms and complexity. Foundations and Trends in Machine

Learning, 8(3-4):231 357, 2015.

S. Bubeck, Q. Jiang, Y.T. Lee, Y. Li, and A. Sidford. Near-optimal method for highly smooth convex

optimization. Arxiv preprint ar Xiv:1812.08026, 2018.

O. Dekel, R. Gilad-Bachrach, O. Shamir, and L. Xiao. Optimal distributed online prediction using

mini-batches. Journal of Machine Learning Research, 13:165 202, 2012.

J. Diakonikolas and C. Guzmán. Lower bounds for parallel and randomized convex optimization.

Arxiv preprint ar Xiv:1811.01903, 2018.

J. Duchi, P. Bartlett, and M. Wainwright. Randomized smoothing for stochastic optimization. SIAM

Journal on Optimization, 22(2):674 701, 2012.

J. Duchi, F. Ruan, and C. Yun. Minimax bounds on stochastic batched convex optimization. In

Proceedings of the 31st Conference On Learning Theory (COLT), volume 75 of Proceedings of Machine Learning Research, pages 3065 3162, 2018.

Roy Frostig, Rong Ge, Sham Kakade, and Aaron Sidford. Un-regularizing: approximate proximal

point and faster stochastic algorithms for empirical risk minimization. In Proceedings of the 32nd International Conference on Machine Learning (ICML).

A. Gasnikov, E. Gorbunov, D. Kovalev, A. Mohhamed, and E. Chernousova. The global rate of conver-

gence for optimal tensor methods in smooth convex optimization. Arxiv preprint ar Xiv:1809.00382, 2018.

B. Jiang, H. Wang, and S. Zhang. An optimal high-order tensor method for convex optimization.

Arxiv preprint ar Xiv:1812.06557, 2018.

B. Laurent and P. Massart. Adaptive estimation of a quadratic functional by model selection. Ann.

Statist., 28(5):1302 1338, 10 2000. doi: 10.1214/aos/1015957395. URL https://doi.org/10. 1214/aos/1015957395.

Hongzhou Lin, Julien Mairal, and Zaïd Harchaoui. A universal catalyst for ﬁrst-order optimization. In

Advances in Neural Information Processing Systems 28: Annual Conference on Neural Information Processing Systems (NIPS) 2015, pages 3384 3392, 2015.

R. D. C. Monteiro and B. F. Svaiter. An accelerated hybrid proximal extragradient method for convex

optimization and its implications to second-order methods. SIAM Journal on Optimization, 23(2): 1092 1125, 2013.

A. Nemirovski. On parallel complexity of nonsmooth convex optimization. Journal of Complexity,

10(4):451 463, 1994.

A. Nemirovski and D. Yudin. Problem Complexity and Method Efﬁciency in Optimization. Wiley

Interscience, 1983.

Y. Nesterov. Introductory lectures on convex optimization: A basic course. Kluwer Academic

Publishers, 2004.

Iosif Pinelis. Optimum bounds for the distributions of martingales in banach spaces. Ann. Probab.,

22(4):1679 1706, 10 1994. doi: 10.1214/aop/1176988477. URL https://doi.org/10.1214/ aop/1176988477.

K. Scaman, F. Bach, S. Bubeck, L. Massoulié, and Y.T. Lee. Optimal algorithms for non-smooth

distributed optimization in networks. In Advances in Neural Information Processing Systems 31, pages 2740 2749. 2018.

D. Yudin and A. Nemirovski. Estimating the computational complexity of mathematica-programming

problems. Ekonomika i matem metody, 12(1):128 142, 1976. In Russian.

Y. Zhang and L. Xiao. Communication-efﬁcient distributed optimization of self-concordant empirical

loss. In Large-Scale and Distributed Optimization, pages 289 341. Springer, 2018.