# acceleration_with_a_ball_optimization_oracle__a7c1f047.pdf

Acceleration with a Ball Optimization Oracle

Yair Carmon Arun Jambulapati Qijia Jiang Yujia Jin Yin Tat Lee

Aaron Sidford Kevin Tian

Consider an oracle which takes a point x and returns the minimizer of a convex function f in an ℓ2 ball of radius r around x. It is straightforward to show that roughly r 1 log 1

ϵ calls to the oracle sufﬁce to ﬁnd an ϵ-approximate minimizer of f in an ℓ2 unit ball. Perhaps surprisingly, this is not optimal: we design an accelerated algorithm which attains an ϵ-approximate minimizer with roughly r 2/3 log 1

ϵ oracle queries, and give a matching lower bound. Further, we implement ball optimization oracles for functions with locally stable Hessians using a variant of Newton s method and, in certain cases, stochastic ﬁrst-order methods. The resulting algorithm applies to a number of problems of practical and theoretical import, improving upon previous results for logistic and ℓ regression and achieving guarantees comparable to the state-of-the-art for ℓp regression.

1 Introduction

We study unconstrained minimization of a smooth convex objective f : Rd R, which we access through a ball optimization oracle Oball, that when queried at any point x, returns the minimizer of f restricted a ball of radius r around x, i.e.,1

Oball(x) = arg min x s.t. x x r f(x ).

Such oracles underlie trust-region methods [15] and, as we demonstrate via applications, encapsulate problems with local stability. Iterating xk+1 Oball(xk) minimizes f in e O(R/r) iterations (see Appendix A), where R is the initial distance to the minimizer, x , and e O( ) hides polylogarithmic factors in problem parameters, including the desired accuracy.

Given the fundamental geometric nature of the ball optimization abstraction, the central question motivating our work is whether it is possible to improve upon this e O(R/r) query complexity. It is natural to conjecture that the answer is negative: we require R/r oracle calls to observe the entire line from x0 to the optimum, and therefore ﬁnding a solution using less queries would require jumping into completely unobserved regions. Nevertheless, we prove that the optimal query complexity scales as (R/r)2/3. This result has positive implications for the complexity for several key regression tasks, for which we can efﬁciently implement the ball optimization oracles.

1.1 Our contributions We overview our main contributions: accelerating ball optimization oracles (with a matching lower bound), implementing them under Hessian stability, and applying our results to regression problems.

Stanford University, {yairc,jmblpati,qjiang2,yujiajin,sidford,kjtian}@stanford.edu. Tel Aviv University, ycarmon@cs.tau.ac.il. University of Washington, yintat@uw.edu. 1In the introduction we discuss exact oracles for simplicity, but our results account for inexactness. Our results hold for any weighted Euclidean (semi)norm, i.e., x =

x Mx for M 0, which we sometimes write explicitly as x M.

34th Conference on Neural Information Processing Systems (Neur IPS 2020), Vancouver, Canada.

Monteiro-Svaiter (MS) oracles via ball optimization. Our starting point is an acceleration framework due to Monteiro and Svaiter [25]. It relies on access to an oracle that when queried with x, v Rd and A > 0, returns points x+, y Rd and λ > 0 such that

y = A A + aλ x + aλ A + aλ v, and (1)

x+ arg min x Rd

2λ x y 2 , (2)

where aλ = 1

λ2 + 4Aλ). Basic calculus shows that for any z, the radius-r oracle response Oball(z) solves the proximal point problem (2) for y = z and some λ = λ r(z) 0 which depends on r and z. Therefore, to implement the MS oracle with a ball optimization oracle, given query (x, v, A) we need to ﬁnd λ that solves the implicit equation λ = λ r(y(λ)), with y(λ) as in (1). We solve this equation to sufﬁcient accuracy via binary search over λ, resulting in an accelerated scheme that makes e O(1) queries to Oball( ) per iteration (each iteration also requires a gradient evaluation).

The main challenge lies in proving that our MS oracle implementation guarantees rapid convergence. We do so by a careful analysis which relates convergence to the distance between the MS oracle outputs y and x+. Speciﬁcally, letting {yk, xk+1} be the sequence of these points, we prove that

f(x K) f(x )

f(x0) f(x ) exp Ω(K) min k<K xk+1 yk 2/3

Since Oball guarantees xk+1 yk = r for all k except possibly the last, our result follows.

Matching lower bound. We give a distribution over functions with domain of size R for which any algorithm interacting with a ball optimization oracle of radius r requires Ω((R/r)2/3) queries to ﬁnd an approximate solution with O(r1/3) additive error. Our lower bound in fact holds for an even more powerful r-local oracle, which reveals all values of f in a ball of radius r around the query point. We prove our lower bounds using well-established techniques and Nemirovski s function, a canonical hard instance in convex optimization [26, 30, 12, 17, 9]. Here, our primary contribution is to show that appropriately scaling this construction makes it hard even against r-local oracles with a ﬁxed radius r, as opposed to the more standard notion of local oracles that reveal the instance only in an arbitrarily small neighborhood around the query.

Ball optimization oracle implementation. Trust-region methods [15] solve a sequence of subproblems of the form minimize δ Rd s.t. δ r

When g = f(x) and H = 2f(x), the trust-region subproblem minimizes a second-order Taylor expansion of f around x, implementing an approximate ball optimization oracle. We show how to implement a ball optimization oracle for f to high accuracy for functions satisfying a local Hessian stability property. Speciﬁcally, we use a notion of Hessian stability similar to that of Karimireddy et al. [22], requiring 1

c 2f(x) 2f(y) c 2f(x) for every y in a ball of radius r around x for some c > 1. We analyze Nesterov s accelerated gradient method in a Euclidean norm weighted by the Hessian at x, which we can also view as accelerated Newton steps, and show that it implements the oracle in e O(c) linear system solutions, improving upon the c2 dependence of more naive methods. This improvement is not necessary for our applications where we take c to be a constant, but we include it for completeness. For certain objectives (e.g., softmax), we show that a ﬁrst-order oracle implementation (e.g., computing the Newton steps with accelerated SVRG) allows us to further exploit the problem structure, and improve state-of-the-art runtimes guarantees in some regimes.

Applications. We apply our implementation and acceleration of ball optimization oracles to problems of the form f(Ax b) for data matrix A Rn d. For logistic regression, where f(z) = P i [n] log(1 + e zi), Hessian stability implies [4] that our algorithm solves the problem

with e O( x0 x 2/3 A A) linear system solves of the form A DAx = z for diagonal D. This improves upon the previous best linearly-convergent condition-free algorithm due to Karimireddy et al. [22], which requires e O( x0 x A A) system solves. Our improvement is precisely the power 2/3 factor that comes from acceleration using the ball optimization oracle.

For ℓ regression, we take f to be the log-sum-exp (softmax) function and establish that it too has a stable Hessian. By appropriately scaling softmax to approximate ℓ regression to ϵ additive

error and taking r = ϵ, our method solves ℓ to additive error ϵ in e O( x0 x 2/3 A Aϵ 2/3) linear system solves of the same form as above. This improves upon the algorithm of Bullins and Peng [11] in terms of ϵ scaling (from ϵ 4/5 to ϵ 2/3) and the algorithm of Ene and Vladu [18] in terms of distance scaling (from n1/3 A(x0 x ) 2/3 to A(x0 x ) 2/3 2 ). We also give a runtime guarantee improving over the state-of-the-art ﬁrst-order method of Carmon et al. [13] whenever n d ( maxi ai 2R

ϵ )2/3 d where R is the ℓ2 distance between an initial point and the optimizer, by using a ﬁrst-order oracle implementation based on [5].

Finally, we leverage our framework to obtain high accuracy solutions to ℓp norm regression, where f(z) = P

i [n] |zi|p, via minimizing a sequence of proximal problems with geometrically shrinking

regularization. The result is an algorithm that solves e O(poly(p)n1/3) linear systems. For p = ω(1), this matches the state-of-the-art n dependence [1] but obtains worse dependence on p. Nevertheless, we provide a straightforward alternative approach to prior work and our results leave room for further reﬁnements which we believe may result in stronger guarantees.

1.2 Related work

Our developments are rooted in three lines of work, which we now brieﬂy survey.

Monteiro-Svaiter framework instantiations. Monteiro and Svaiter [25] propose a new acceleration framework, which they specialize to recover the classic fast gradient method [27] and obtain an optimal accelerated second-order method for convex problems with Lipschitz Hessian. Subsequent work [19] extends this to functions with pth-order Lipschitz derivatives and a pth-order oracle. Generalizing further, Bubeck et al. [9] implement the MS oracle via a Φ prox oracle that given query x returns roughly arg minx {f(x) + Φ( x x )}, for continuously differentiable Φ, and prove an error bound scaling with the iterate number k as φ(R/k3/2)R2/k2, where φ(t) = Φ (t)/t. Using poly(d) parallel queries to a subgradient oracle for non-smooth f, they show how to implement the Φ prox oracle for Φ(t) (t/r)p with arbitrarily large p, where r = ϵ/

d. Our notion of a ball optimization corresponds to taking p = , i.e., letting Φ be the indicator of [0, r]. However, since such Φ is not continuous, our result does not follow directly from [9]. Thus, our approach clariﬁes the limiting behavior of MS acceleration of inﬁnitely smooth functions.

Trust region methods. The idea of approximately minimizing the objective in a trust region around the current iterate plays a central role in nonlinear optimization and machine learning [see, e.g., 15, 23, 28]. Typically, the approximation takes the form of a second-order Taylor expansion, where regularity of the Hessian is key for guaranteeing the approximation quality. Of particular relevance to us is the work of Karimireddy et al. [22], which deﬁne a notion of Hessian stability under which a trust-region method converges linearly with only logarithmic dependence on problem conditioning. We observe that this stability condition in fact renders the second-order trust region approximation highly effective, so that a few iterations sufﬁce in order to implement an ideal ball optimization oracle, thus enabling accelerated condition-free convergence.

Karimireddy et al. [22] also observe that quasi self-concordance (QSC) is a sufﬁcient condition for Hessian stability, and that the logistic regression objective is QSC. We use this observation for our applications, and prove that the softmax objective is also QSC. Marteau-Ferey et al. [24] directly leverage the QSC property using Newton method variants. For QSC functions with parameter M, they show complexity guarantees scaling linearly in MR. Under the same assumptions, we obtain the improved scaling (MR)2/3. Both guarantees depend only weakly (polylogarithmically) on the standard problem condition number.

Efﬁcient ℓp regression algorithms. There has been rapid recent progress in linearly convergent algorithms for minimizing the p-norm of the regression residual Ax b or alternatively for ﬁnding a minimum p-norm x satisfying linear constraints Ax = b. Bubeck et al. [8] give faster algorithms for all p (1, 2) (2, ), discovering and overcoming a limitation of classic interior point methods. Their algorithm is based on considering a smooth interpolation between a quadratic and the original objective. Bullins [10] applies accelerated tensor methods to develop a gradient descent method for the case of p = 4 with linear-system solution complexity scaling as n1/5 (for A Rn d). Adil et al. [2] give an iterative reﬁnement method for general p (1, ) with complexity proportional to n|p 2|/(2p+|p 2|) n1/3, matching [10] for p = 4 and improving on [8]. Adil and Sachdeva [1] provide an alternative method with complexity scaling as p n1/3 scaling, improving on the O(p O(p)) dependence in [2].

As mentioned in the previous section, a number of recent works [11, 18, 13] obtain ϵ-accurate solutions for p = with complexity scaling polynomially in ϵ 1. Bullins and Peng [11] leverage accelerated tensor methods and fourth-order smoothness, Ene and Vladu [18] carefully analyze re-weighted least squares, and Carmon et al. [13] develop a ﬁrst-order stochastic variance reduction technique for matrix saddle-point problems. We believe that our approach brings us closer to a uniﬁed perspective on high-order smoothness and acceleration for regression problems.

1.3 Paper organization

In Section 2, we implement the MS oracle using a ball optimization oracle and prove its e O((R/r)2/3) convergence guarantee. In Section 3, we show how to use Hessian stability to efﬁciently implement a ball optimization oracle, and also show that quasi-self-concordance implies Hessian stability. In Section 4 we apply our developments to the aforementioned regression tasks. Finally, in Section 5 we give a lower bound implying our oracle complexity is optimal (up to logarithmic terms).

Notation. Let M be a positive semideﬁnite matrix, and let M be its pseudoinverse. We perform our analysis in the Euclidean seminorm x M def =

x Mx; we will choose a speciﬁc M when discussing applications. We denote the M ball of radius r around x by Br( x) def = x Rd | x x M r . We recall standard deﬁnitions of smoothness and strong-convexity in a quadratic norm: differentiable f : Rd R is L-smooth in M if its gradient is L-Lipschitz in M, and twice-differentiable f is L-smooth and µ-strongly convex in M if µM 2f(x) LM for all x Rd.

2 Monteiro-Svaiter Acceleration with a Ball Optimization Oracle

In this section, we give an accelerated algorithm for optimization with the following oracle. Deﬁnition 1 (Ball optimization oracle). We call Oball a (δ, r)-ball optimization oracle for f : Rd R if for any x Rd, it outputs y = Oball( x) Br( x) such that y x x,r M δ for some x x,r arg minx Br( x) f(x).

We use the Monteiro and Svaiter acceleration framework [25, 19, 9], relying on the following oracle. Deﬁnition 2 (MS oracle). We call OMS a σ-MS oracle for differentiable f : Rd R if given inputs (A, x, v) R 0 Rd Rd, OMS outputs (λ, aλ, ytλ, z) R 0 R 0 Rd Rd such that

2 , tλ = A A + aλ , ytλ = tλ x + (1 tλ) v,

and we have the guarantee z (ytλ λM f(z)) M σ z ytλ M . (3)

We now state the acceleration framework and the main bound we use to analyze its convergence.

Algorithm 1 Monteiro-Svaiter acceleration

1: Input: Strictly convex and differentiable function f : Rd R. Symmetric M 0 with f(x) Im(M) for all x Rd. Initialization A0 0 and x0 = v0 Rd. Monteiro-Svaiter oracle OMS with parameter σ [0, 1). 2: for k = 0, 1, 2, . . . do 3: (λk+1, ak+1, yk, xk+1) OMS(Ak, xk, vk) 4: vk+1 vk ak+1M f(xk+1), Ak+1 Ak + ak+1. 5: end for

Proposition 3. Let differentiable f be strictly convex, x0 x M R and f(x0) f(x ) ϵ0. Set A0 = R2/(2ϵ0) and suppose that for some r > 0 the iterates of Algorithm 1 satisfy xk+1 yk M r for all k 0. Then, the iterates also satisfy f(xk) f(x ) 2ϵ0 exp( ( r(1 σ)

R )2/3(k 1)).

Proposition 3 is one of our main technical results, obtained via applying a reverse Hölder s inequality on a variant of the performance guarantees of [25]; we defer the proof to Appendix B. Clearly, Proposition 3 implies that the progress of Algorithm 1 is related to the amount of movement of the iterates, i.e., the quantities { xk+1 yk M}. We now show that by using a ball optimization oracle of radius r, we are able to guarantee movement by roughly r, which implies rapid convergence. We rely on the following characterization, whose proof we defer to Appendix C.

Lemma 4. Let f : Rd R be continuously differentiable and strictly convex. For all y Rd, z = arg minz Br(y) f(z ) either globally minimizes f, or z y M = r and f(z) =

Lemma 4 implies that a (0, r) ball optimization oracle either globally minimizes f, or yields z with

z y M = r and z y λM f(z) M = 0, for λ = r f(z) M . (4)

This is precisely the type of bound compatible with both Proposition 3 and requirement (3) of OMS. The remaining difﬁculty lies in that λ also deﬁnes the point y = ytλ. Therefore, to implement an MS oracle using a ball optimization oracle we perform binary search over λ, with the goal of solving

g(λ) def = λ f(ztλ) M = r, where ztλ

def = min z Br(ytλ) f(z), and tλ, ytλ as in Deﬁnition 2.

Algorithm 2 describes our binary search implementation. The algorithm takes the MS oracle input (A, x, v) as well D bounding the distance of x and v from the optimum, and desired global solution accuracy ϵ, outputting either a (global) ϵ-approximate minimizer or (λ, aλ, ytλ, ztλ) satisfying both (3) (with σ = 1

2) and a lower bound on ztλ ytλ 2. To bound our procedure s complexity we leverage L-smoothness of f (i.e. L-Lipschitz continuity of f), yielding a bound on the Lipschitz constant of g(λ) deﬁned above. Our analysis is somewhat intricate as it must account for inexactness in the ball optimization oracle. It obtains the following performance guarantee, whose proof is in Appendix C. Proposition 5 (Guarantees of Algorithm 2). Let L, D, δ, r > 0 and Oball satisfy the requirements in Lines 1 3 of Algorithm 2, and ϵ < 2LD2. Then, Algorithm 2 either returns ztλ with f( ztλ) f(x ) < ϵ, or implements a 1

2-MS oracle with the additional guarantee ztλ ytλ M 11r

12 . Moreover, the number of calls to Oball is bounded by O(log( LD2

Algorithm 2 Monteiro-Svaiter oracle implementation

1: Input: Function f : Rd R that is strictly convex, L-smooth in M.A R 0 and x, v Rd satisfying x x M and v x M D where x = arg minx f(x). A (δ, r)-ball optimization oracle Oball, where δ = r 12(1+Lu) and u = 2(D+r)r

2: Set λ u and ℓ r 2LD, let ztλ Oball(ytλ) 3: if u f( ztλ) M r + u Lδ then 4: return (λ, aλ, ytλ, ztλ) 5: else 6: while λ f( ztλ) M r > r

6 do 7: λ ℓ+u

2 , ztλ Oball(ytλ) 8: if λ f( ztλ) M r then u λ, else ℓ λ 9: end while 10: return (λ, aλ, ytλ, ztλ) 11: end if

Finally, we state our main acceleration result, whose proof we defer to Appendix C. Theorem 6 (Acceleration with a ball optimization oracle). Let Oball be an ( r 12+126LRr/ϵ, r)-ball optimization oracle for strictly convex and L-smooth f : Rd R with minimizer x , and initial point x0 satisfying x0 x M R and f(x0) f(x ) ϵ0. Then, Algorithm 1 using Algorithm 2 as a Monteiro-Svaiter oracle with D =

18R produces an iterate xk with f(xk) f(x ) ϵ, in

O (R/r))2/3 log (ϵ0/ϵ) log LR2/ϵ calls to Oball.

3 Ball Optimization Oracle for Hessian Stable Functions

In this section we leverage standard techniques for solving the trust-region subproblem [15] in order to implement a ball optimization oracle. The key structure enabling efﬁcient implementation is the the following notion of Hessian stability, a slightly stronger version of the condition in Karimireddy et al. [22].2

2 A variant of the algorithm we develop also works under the weaker stability condition. We state the stronger condition as it is simpler, and holds for all our applications.

Deﬁnition 7 (Hessian stability). Twice-differentiable f : Rd R is (r, c)-Hessian stable for r, c 0 with respect to if x, y Rd with x y r we have c 1 2f(y) 2f(x) c 2f(y).

We give a method implementing a (δ, r)-ball oracle (cf. Deﬁnition 1) for (r, c)-stable functions in M, requiring e O(c) linear system solutions. The method reduces the oracle to solving e O(c) trust-region subproblems of the form minx Br( x) Q(x) def = g x + 1

2x Hx, and we show each requires e O(1) linear system solves in H + λM for λ 0. In terms of total linear system solves, our method has a (mild) polylogarithmic dependence on the condition number of f in M. The main result of this section is Theorem 8, which guarantees correctness and complexity our ball optimization oracle implementation; proofs are deferred to Appendices D.1 and D.2. Theorem 8. Let f be L-smooth, µ-strongly convex, and (r, c)-Hessian stable in the seminorm M. Then, Algorithm 7 (in Appendix D.2) implements a (δ, r)-ball optimization oracle for query point x with x x M D for x the minimizer of f, and requires

O c log2 κ(D + r)c

linear system solves in matrices of the form H + λM for nonnegative λ, where κ = L/µ. Remark 9 (First-order implementation). The linear system solves required by Theorem 8 can be carried out via Gaussian elimination, fast matrix multiplication, or a number of more scalable algorithms, including ﬁrst-order methods [e.g., 5]. In Section 4.3, we show that using ﬁrst-order methods that exploit the particular problem structure allows us to achieve state-of-the-art runtimes for ℓ regression in certain regimes.

We state a sufﬁcient condition for Hessian stability below. We use this result in Section 4 to establish Hessian stability in several structured problems, and defer its proof to Appendix E for completeness. Deﬁnition 10 (Quasi-self-concordance). We say that thrice-differentiable f : Rd R is M-quasiself-concordant (QSC) with respect to some norm , for M 0, if for all u, h, x Rd, 3f(x)[u, u, h] M h u 2 2f(x),

i.e., restricting the third-derivative tensor of f to any direction is bounded by a multiple of its Hessian.

Lemma 11. If thrice-differentiable f : Rd R is M-quasi-self-concordant with respect to norm , then it is (r, exp(Mr))-Hessian stable with respect to .

4 Applications

Algorithm 3 puts together the ingredients from previous sections to give a complete second-order method for minimizing QSC functions. We now apply it to functions of the form f(x) = g(Ax) for matrix A Rn d and g : Rn R. The logistic loss, softmax approximation of ℓ regression, and variations of ℓp regression objectives all have this form. The following complexity guarantee for Algorithm 3 follows directly from our previous developments and we defer a proof to Appendix F.

Algorithm 3 Monteiro-Svaiter accelerated BAll COnstrained Newton s method (MS-BACON)

1: Input: Function f : Rd R, desired accuracy ϵ, initial point x0, initial suboptimality ϵ0. 2: Input: Domain bound R, quasi-self-concordance M, smoothness L, norm M. 3: Deﬁne f(x) = f(x) + ϵ 55R2 x x0 2 M 4: Using Algorithm 7, implement Oball, a (δ, 1

M )-ball optimization oracle for f, where δ = Θ( ϵ

5: Using Algorithm 2 and Oball, implement OMS, a 1

2-MS oracle for f 6: Using O((RM)2/3 log ϵ0

ϵ ) iterations of Algorithm 1 with OMS and initial point x0 compute xout, an ϵ/2-accurate minimizer of f 7: return xout

Corollary 12. Let f(x) = g(Ax), for g : Rn R that is L-smooth, M-QSC in the ℓ2 norm, and A Rn d. Let x be a minimizer of f, and suppose that x0 x M R and f(x0) f(x ) ϵ0

for some x0 Rd, where M def = A A. Then, Algorithm 3 yields an ϵ-approximate minimizer to f in

O (RM)2/3 log ϵ0

linear system solves in matrices of the form A 2g(Ax) + λI A for λ > 0 and x Rd.

Both the (unaccelerated) Newton method-based algorithm in Marteau-Ferey et al. [24] and our method depend polylogarithmically on the (regularized) problem s condition number. The method proposed in Marteau-Ferey et al. [24] has a complexity of e O(MR) for solving M-QSC functions with domain size R, while our method gives an accelerated dependence of e O((MR)2/3). We defer proofs of claims in the following subsections to Appendix F.

4.1 Logistic regression

Consider logistic regression in matrix A Rn d with n data points of dimension d, and corresponding labels b { 1, 1}n. The objective is

i [n] log(1 + exp( bi ai, x )) = g(Ax), (5)

where g(y) = P

i [n] log(1 + exp( biyi)). It is known [6] that g is 1-QSC and 1-smooth in ℓ2, with a diagonal Hessian. Thus, we have the following convergence guarantee from Corollary 12. Corollary 13. For the logistic regression objective (5), given x0 with initial function error ϵ0 with distance R from a minimizer in A A, Algorithm 3 obtains an ϵ-approximate minimizer using O R2/3 log (ϵ0/ϵ) log3 R2(1 + R)/ϵ linear system solves in matrices A DA for diagonal D.

Compared to Karimireddy et al. [22], which gives a trust-region Newton method using e O(R) linear system solves, we obtain an improved dependence on the domain size R.

4.2 ℓ regression

Consider ℓ regression in matrix A Rn d and vector b Rn, which asks to minimize

f(x) = Ax b = g(Ax), (6)

where g(y) = y b . Without loss of generality (by concatenating A, b with A, b), we may replace the in the objective with a maximum. It is well-known that g(y) is approximated within additive ϵ/2 by lset(y b) for t = ϵ/(2 log n) (see Lemma 42 for a proof), where we set

lse(x) def = log

i [n] exp(xi)

, lset(x) def = tlse(x/t).

Our improvement stems from the fact that lset is QSC, which appears to be a new observation. The proof carefully manipulates the third-derivative tensor of lset and is deferred to Appendix F. Lemma 14. lset is 1/t-smooth and 2/t-QSC in ℓ .

Lemma 14 immediately implies that lset is n/t-smooth and 2/t-QSC in ℓ2. We thus obtain the following by applying Corollary 12 to the lseϵ/(2 log n) objective, and solving to ϵ/2 additive accuracy.

Corollary 15. Given x0 with initial function error ϵ0 with distance R from a minimizer in A A, Algorithm 3 obtains an ϵ-approximate minimizer using O (R log n/ϵ)2/3 log (ϵ0/ϵ) log3 (n R/ϵ)

linear system solves in matrices ˆA D ˆA, where D is a positive deﬁnite diagonal matrix, and ˆA is the vertical concatenation of A and A.

The reduction from solving linear systems of the form described in Corollary 12 to linear systems of the form in Corollary 15 (which is not immediate, since the Hessian of softmax is not diagonal) is given in Appendix F.

Compared to Bullins and Peng [11], which ﬁnd an ϵ-approximate solution to (6) in e O((R/ϵ)4/5) linear system solves using high-order acceleration, we obtain an improved dependence on R/ϵ. Ene

and Vladu [18] consider the equivalent problem minimizey:A y=c y (see Appendix F.2.1 for explanation of this equivalence). They show how to solve this problem to δ multiplicatie error in e O(n1/3δ 2/3) linear system solutions in A DA for positive diagonal D. Translated into our setting, this implies a complexity of e O(n1/3 Ax 2/3 ϵ 2/3) linear system solves in A DA, which is never better than our guarantee since v 2 n v for all v Rn. Conversely, our result maps to the setting of Ene and Vladu [18] to provide a complexity guarantee of e O( x 2/3 2 ϵ 2/3) appropriate linear system solves to attain ϵ additive error.

Finally, we note that our unconstrained regression solver also solves constrained regression problems which are sometimes considered in the literature, through a reduction.

4.3 First-order methods and improved norm dependence

For both logistic regression and ℓ regression, we can alternatively work in the standard ℓ2 norm, and obtain a different QSC parameter depending on maxi ai 2; we defer all proofs to Appendix F.3. Lemma 16. The logistic objective f(x) = g(Ax) in (5) is maxi [n] ai 2-QSC in the ℓ2 norm.

Lemma 17. The log-sum-exp function f(x) = lset(Ax) is 2

t maxi [n] ai 2-QSC in the ℓ2 norm.

With these alternative QSC bounds, we turn our attention to the cost of implementing a ball oracle. In the previous sections we accomplish this by using a generic positive semideﬁnite linear system solver; we now demonstrate how ﬁrst-order methods can give improved runtimes in large-scale settings. We focus on ℓ regression here, as the case of logistic regression is similar. Deﬁning R = x0 x 2, we seek an ϵ/4-approximate minimizer to a smooth, strongly-convex approximation of the ℓ -norm: we pick h(x) = lset(Ax) + ϵ 4R2 x x0 2 2, where t = ϵ 2 log n.

By applying variance-reduced stochastic gradient methods to solve linear systems in 2h(x) and combining with our framework, we obtain the following complexity bound in terms of runtime (as opposed to linear system solves). Corollary 18. With initial function error ϵ0 and R = x0 x 2, Algorithm 3 using the ﬁrst-order linear system solver of Agarwal et al. [5] returns an ϵ-approximate minimizer within total runtime e O maxi [n] ai 2 R

ϵ 2/3 nd + d1.5 maxi [n] ai 2 R

Let L = maxi [n] ai 2. In the regime d ( LR

ϵ )2/3 n d and when A is dense, we obtain a speed-up compared to the state-of-the-art runtime e O(nd + p

nd(n + d) LR2

ϵ ) of Carmon et al. [13].

4.4 ℓp regression

Consider ℓp regression in matrix A Rn d and vector b Rn, which asks to minimize

f(x) = Ax b p p = g(Ax) with optimizer x . (7)

for some ﬁxed p > 3,3 where g(x) = P

i|xi bi|p. While this objective is not QSC, our method iteratively considers a regularized QSC objective to halve the error, as summarized in Algorithm 8.

Algorithm 4 High accuracy ℓp regression

1: Input: A Rn d, b Rn, multiplicative error tolerance δ 0. 2: Set x0 = A b and ϵ0 = f(x0) = Ax0 b p p.

3: for k log2(n/δ1/p) do 4: ϵk 2 pϵk 1 5: xk output of Algorithm 3 applied on f(x) = Ax b p p with initialization xk 1, desired

accuracy ϵk and parameters R = O(n(p 2)/2pϵ1/p k ) and M = O(p n/R) (see Lemma 52) 6: end for

Below we state the guarantee of Algorithm 8, and defer its proof to Appendix F.4.

3We assume p > 3 for ease of presentation; for p 4 our runtime is superseded by, e.g., the algorithm of [3].

Corollary 19. Algorithm 8 computes x Rd with Ax b p p (1 + δ) Ax b p p using O(p14/3n1/3 log4(n/δ)) linear system solves in A DA for diagonal matrix D 0.

Compared to Adil and Sachdeva [1], Adil et al. [2], which minimize f to 1+δ multiplicative accuracy by solving e O min pn1/3, p O(p)n

p 2 3p 2 log(1/δ) linear systems, our guarantee is slightly weaker in its p dependence. Nonetheless, we believe our alternative, simple approach sheds further light on the complexity of this problem, and that there is room for additional improvement.

5 Lower Bound

We establish a lower bound showing that all algorithms for minimizing a function via repeated calls to a (0, r) ball optimization oracle, which we call r-BOO algorithms, require Ω((R/r)2/3) queries in the worst case. Formally, the following lower bound matches Theorem 6 up to logarithmic factors. Theorem 20. Let r

R, δ (0, 1) and d = 60( R

δ r . There exists a distribution P over convex and 1-Lipschitz functions from BR(0) R such that the following holds for any r-BOO algorithm: with probability at least 1 δ over the draw of f P and the algorithm s coin ﬂips (i.e. randomness used by the algorithm), its ﬁrst 1

r )2/3 queries are at least R2/3r1/3 suboptimal for f.

We prove Theorem 20 in Appendix G as a corollary of a stronger results stating the same lower bound for r-local oracle algorithms, that for each query x receive the function f restricted to Br(x). This information clearly sufﬁces to compute the ball optimization oracle output as well as f(x) and f(x), implying that Algorithm 2 also operates within our oracle framework. Our proof is essentially a careful reading of the classical information-based complexity lower-bound for convex optimization [26, 20], where we strengthen the notion of local oracles which return f restricted to a neighborhood of the query by quantifying the size of the neighborhood.

Using arguments from [20, 17] we may further strengthen the lower bound to hold for instances which are smooth, strongly-convex and have unbounded domain, precisely matching the assumptions of Theorem 6 (see Appendix G). However, the implementations of the ball optimization oracle we consider in Section 3 require a Hessian stability assumption (Deﬁnition 7), and it is unclear if we can make the hard instances underlying Theorem 20 Hessian-stable. Nevertheless, our lower bound precludes further progress via the ball optimization oracle abstraction, up to logarithmic factors.

Broader Impact

This work does not present any foreseeable societal consequence.

Acknowledgments

We thank Sébastien Bubeck for helpful conversations.

Sources of funding

Researchers on this project were supported by NSF CAREER Award CCF-1844855 and CCF1749609, NSF Grant CCF-1955039, CCF-1740551, DMS-1839116 and DMS-2023166, two Sloan Research Fellowships and Packard Fellowships, and two Stanford Graduate Fellowships. Additional support was provided by Pay Pal and Microsoft, including two Microsoft Research Faculty Fellowships and a Pay Pal research gift.

Competing interests

The authors declare no competing interests.

[1] Deeksha Adil and Sushant Sachdeva. Faster p-norm minimizing ﬂows, via smoothed q-norm problems. In Symposium on Discrete Algorithms, SODA, pages 892 910, 2020.

[2] Deeksha Adil, Rasmus Kyng, Richard Peng, and Sushant Sachdeva. Iterative reﬁnement for ℓp-norm regression. In Symposium on Discrete Algorithms, SODA, pages 1405 1424, 2019.

[3] Deeksha Adil, Richard Peng, and Sushant Sachdeva. Fast, provably convergent IRLS algorithm for p-norm linear regression. In Advances in Neural Information Processing Systems, Neur IPS, pages 14166 14177, 2019.

[4] Naman Agarwal, Brian Bullins, and Elad Hazan. Second-order stochastic optimization for machine learning in linear time. Journal of Machine Learning Research, 18(1):4148 4187, 2017.

[5] Naman Agarwal, Sham Kakade, Rahul Kidambi, Yin Tat Lee, Praneeth Netrapalli, and Aaron Sidford. Leverage score sampling for faster accelerated regression and erm. ar Xiv preprint ar Xiv:1711.08426, 2017.

[6] Francis Bach et al. Self-concordant analysis for logistic regression. Electronic Journal of Statistics, 4:384 414, 2010.

[7] Keith Ball et al. An elementary introduction to modern convex geometry. Flavors of geometry, 31:1 58, 1997.

[8] Sébastien Bubeck, Michael B. Cohen, Yin Tat Lee, and Yuanzhi Li. An homotopy method for ℓp regression provably beyond self-concordance and in input-sparsity time. In Symposium on Theory of Computing, STOC, pages 1130 1137, 2018.

[9] Sébastien Bubeck, Qijia Jiang, Yin-Tat Lee, Yuanzhi Li, and Aaron Sidford. Complexity of highly parallel non-smooth convex optimization. In Advances in Neural Information Processing Systems, pages 13900 13909, 2019.

[10] Brian Bullins. Fast minimization of structured convex quartics. ar Xiv preprint ar Xiv:1812.10349, 2018.

[11] Brian Bullins and Richard Peng. Higher-order accelerated methods for faster non-smooth optimization. ar Xiv preprint ar Xiv:1906.01621, 2019.

[12] Yair Carmon, John C Duchi, Oliver Hinder, and Aaron Sidford. Lower bounds for ﬁnding stationary points I. Mathematical Programming, May 2019.

[13] Yair Carmon, Yujia Jin, Aaron Sidford, and Kevin Tian. Variance reduction for matrix games. In Advances in Neural Information Processing Systems, pages 11377 11388, 2019.

[14] Michael Cohen, Jelena Diakonikolas, and Lorenzo Orecchia. On acceleration with noisecorrupted gradients. In Proceedings of the 35th International Conference on Machine Learning, ICML 2018, Stockholmsmässan, Stockholm, Sweden, July 10-15, 2018, pages 1018 1027, 2018.

[15] Andrew R. Conn, Nicholas I. M. Gould, and Philippe L. Toint. Trust Region Methods. MOSSIAM Series on Optimization. SIAM, 2000.

[16] Olivier Devolder, François Glineur, and Yurii E. Nesterov. First-order methods of smooth convex optimization with inexact oracle. Math. Program., 146(1-2):37 75, 2014.

[17] Jelena Diakonikolas and Cristóbal Guzmán. Lower bounds for parallel and randomized convex optimization. In Conference on Learning Theory, COLT, pages 1132 1157, 2019.

[18] Alina Ene and Adrian Vladu. Improved convergence for ℓ1 and ℓ regression via iteratively reweighted least squares. In International Conference on Machine Learning, pages 1794 1801, 2019.

[19] Alexander Gasnikov, Pavel E. Dvurechensky, Eduard A. Gorbunov, Evgeniya A. Vorontsova, Daniil Selikhanovych, César A. Uribe, Bo Jiang, Haoyue Wang, Shuzhong Zhang, Sébastien Bubeck, Qijia Jiang, Yin Tat Lee, Yuanzhi Li, and Aaron Sidford. Near optimal methods for minimizing convex functions with Lipschitz p-th derivatives. In Conference on Learning Theory, COLT 2019, pages 1392 1393, 2019.

[20] Cristóbal Guzmán and Arkadi Nemirovski. On lower complexity bounds for large-scale smooth convex optimization. Journal of Complexity, 31(1):1 14, 2015.

[21] William W. Hager. Minimizing a quadratic over a sphere. SIAM Journal on Optimization, 12 (1):188 208, 2001.

[22] Sai Praneeth Karimireddy, Sebastian U Stich, and Martin Jaggi. Global linear convergence of Newton s method without strong-convexity or Lipschitz gradients. ar Xiv preprint ar Xiv:1806.00413, 2018.

[23] Chih-Jen Lin, Ruby C. Weng, and S. Sathiya Keerthi. Trust region Newton method for largescale logistic regression. Journal of Machine Learning Research, 9:627 650, 2008.

[24] Ulysse Marteau-Ferey, Francis Bach, and Alessandro Rudi. Globally convergent newton methods for ill-conditioned generalized self-concordant losses. In Advances in Neural Information Processing Systems, pages 7636 7646, 2019.

[25] Renato D. C. Monteiro and Benar Fux Svaiter. An accelerated hybrid proximal extragradient method for convex optimization and its implications to second-order methods. SIAM Journal on Optimization, 23(2):1092 1125, 2013.

[26] Arkadi Nemirovski and David Borisovich Yudin. Problem Complexity and Method Efﬁciency in Optimization. Wiley, 1983.

[27] Yurii Nesterov. A method for solving a convex programming problem with convergence rate o(1/k2). Doklady AN SSSR, 269:543 547, 1983.

[28] Mark Schmidt, Dongmin Kim, and Suvrit Sra. Projected Newton-type methods in machine learning. In Suvrit Sra, Sebastian Nowozin, and Stephen J Wright, editors, Optimization for Machine Learning, chapter 11. MIT Press, 2011.

[29] Blake Woodworth and Nathan Srebro. Tight complexity bounds for optimizing composite objectives. In Advances in neural information processing systems, pages 3639 3647, 2016.

[30] Blake Woodworth and Nathan Srebro. Lower bound for randomized ﬁrst order convex optimization. ar Xiv preprint ar Xiv:1709.03594, 2017.

[31] Andrew Chi-Chin Yao. Probabilistic computations: Toward a uniﬁed measure of complexity. In 18th Annual Symposium on Foundations of Computer Science, pages 222 227, 1977.