# optimal_and_adaptive_monteirosvaiter_acceleration__4b9e6bd6.pdf Optimal and Adaptive Monteiro-Svaiter Acceleration Yair Carmon Danielle Hausler Arun Jambulapati Yujia Jin Aaron Sidford We develop a variant of the Monteiro-Svaiter (MS) acceleration framework that removes the need to solve an expensive implicit equation at every iteration. Consequently, for any p 2 we improve the complexity of convex optimization with Lipschitz pth derivative by a logarithmic factor, matching a lower bound. We also introduce an MS subproblem solver that requires no knowledge of problem parameters, and implement it as either a secondor first-order method via exact linear system solution or Min Res, respectively. On logistic regression our method outperforms previous second-order acceleration schemes, but under-performs Newton s method; simply iterating our first-order adaptive subproblem solver performs comparably to L-BFGS. 1 Introduction We consider the problem of minimizing a convex function f : X ! R over closed convex set X Rd, given access to an oracle O : X ! X that minimizes a local model of f around a given query point. A key motivating example of such an oracle is the cubic-regularized Newton step Ocr(y) = argmin f(y) + rf(y)>(x y) + 1 2(x y)>r2f(y)(x y) + M i.e., minimizing the second-order Taylor approximation of f around y plus a cubic regularization term. However, our results apply to additional oracles including a simple gradient step, regularized higher-order Taylor expansions [5, 19, 7, 23, 8, 35, 21, 41, 36, 26], ball-constrained optimization [12], and new adaptive oracles that we develop. Seminal work by Monteiro and Svaiter [32] (MS) shows how to accelerate the basic oracle iteration xt+1 = O(xt). Their algorithm is based on the fact that many oracles, including Ocr, implicitly approximate proximal points. That is, for every y and x = O(y), there exists λx,y > 0 such that x argminx02X 2λx,ykx0 yk2 , with the approximation error controlled by a specific condition they define. MS prove that, under this condition, the accelerated proximal point method [22, 40] (with dynamic regularization parameter) maintains its rate of convergence. Applying their framework to Ocr and assuming r2f is Lipschitz, they achieve error bounds that decay as O(t 7/2 log t) after t oracle calls, improving the O(t 2) rate of the basic Ocr iteration [37] and the O(t 3) rate of an earlier accelerated method [33]. Subsequent works apply variations of the MS framework to different oracles, obtaining improved theoretical guarantees for functions with continuous higher-order derivatives [19, 7, 23, 41, 2], parallel optimization [6], logistic and 1 regression [8, 12], minimizing functions with Hölder continuous higher derivatives [41], and distributionally-robust optimization [13, 11]. However, all of these algorithms based on the MS framework share a common drawback: the iterate yt used to produce xt+1 = O(yt) depends on the proximal parameter λt+1 = λxt+1,yt, which itself depends on both xt+1 and yt. This circular dependence necessitates solving an implicit equation for Tel Aviv University, ycarmon@tauex.tau.ac.il, hausler@mail.tau.ac.il Stanford University, {jmblpati,yujiajin,sidford}@stanford.edu 36th Conference on Neural Information Processing Systems (Neur IPS 2022). λt+1; MS (and many subsequent results based upon it) propose bisection procedures for doing so using a number of oracle calls logarithmic in the problem parameters. From a theoretical perspective, the additional bisection complexity introduces a logarithmic gap between the upper bounds due to MS-based algorithms and the best known lower bounds [3, 21] in a number of settings. From a practical perspective, the use of bisection in the MS framework is undesirable as it potentially discards the optimization progress made by oracle calls during each bisection. In his textbook, Nesterov [34, 4.3.3] argues that the logarithmic cost of bisection likely renders the MS scheme for accelerating Ocr inferior in practice to algorithms whose error decays at the asymptotically worse rate O(t 3) but do not require bisection; he notes that removing the bisection from the MS algorithm is an open and challenging question in Optimization Theory. Carmon et al. [13] also point out bisection as one of the main hurdles in making their theoretical scheme practical, while Song et al. [41] note this limitation and propose a heuristic alternative to bisection. (See Appendix A for extended discussion of related work, including a concurrent and independent result by Kovalev and Gasnikov [27].) 1.1 Our contributions We settle this open question, providing a variant of MS acceleration that does not require bisection (Section 2). When combined with certain existing MS oracles (Section 3.1), our algorithm obtains complexity bounds that are optimal up to constant factors, improving over prior art by a logarithmic factor (see Table 1). In addition, our algorithm has no parameters sensitive to tuning. We then go a step further and (in Section 3.2) develop an adaptive alternative to Ocr (Equation (1)). Our oracle does not require tuning the parameter M, which in theory should be proportional to the (difficult to estimate) Lipschitz constant of r2f. Using our oracle, we obtain the optimal Hessian evaluation complexity O(t (4+3 )/2) for functions with order Hölder Hessian (Lipschitz Hessian is the = 1 special case), without requiring any knowledge of the Hölder constant and order . Our oracle is also efficient: while existing complexity bounds for computing Ocr require a logarithmic number of linear system solutions per call, our oracle requires a double-logarithmic number. Moreover, when used with our acceleration method, the number of linear system solves per iteration is essentially constant. We also provide a first-order implementation of our adaptive oracle (Section 3.3). It approximately solves linear systems via first-order operations (Hessian-vector products) using Min Res/Conjugate Residuals [42, 18] with a simple, adaptive, stopping criterion lifted directly from our analysis. Our oracle attains the optimal first-order evaluation complexity for smooth functions up to an additive logarithmic term, without knowledge of the gradient Lipschitz constant or any parameter tuning. Moreover, it maintains an optimal outer iteration complexity for Hölder Hessian of any order. Finally, we report empirical results (Section 4).3 On logistic regression problems, combining our optimal acceleration scheme with our adaptive oracle outperforms previously proposed accelerated second-order methods. However, we also show that (while somewhat helpful for Ocr with a conservative choice of H), adding momentum to well-tuned or adaptive second-order methods is harmful in logistic regression: simply iterating our oracle or, better yet, applying Newton s method dramatically outperforms all accelerated algorithms. This important fact seems to have gone unobserved in the literature on accelerated second-order methods, despite logistic regression appearing in many related experiments [41, 16, 30, 24]. Simply iterating our adaptive oracle outperforms the classical accelerated gradient descent, and performs comparably to L-BFGS. 1.2 Limitations and outlook While our algorithms resolve an enduring theoretical open problem in convex optimization, and are free of sensitive parameters that typically hinder theoretically-optimal methods, practical performance remains a limitation. On logistic regression, Newton s method is remarkably fast, and our acceleration scheme does not seem to help our adaptive oracle. We do not fully understand why this is so, but we suspect that it has to do with additional structure in logistic regression, which Newton s method can automatically exploit but momentum cannot. We believe that future research should identify the structure that makes Newton s method so efficient, and modifying momentum schemes to leverage it. 3The code for our experiments is available at https://github.com/danielle-hausler/ms-optimal. Scalability is another important limitation. While our first-order oracle significantly improves scalability over the second-order oracle from which it is built, it still relies on exact gradient and Hessian-vector products. Therefore, it will have difficulty scaling up to very large datasets. Nevertheless, we hope that further scalability improvements may be possible by building an oracle that utilizes cheap stochastic gradient estimates instead of exact gradients, bringing with it the exciting prospect of a new and powerful adaptive stochastic gradient method. The alternative, probabilistic approximation condition we propose in Appendix C might be helpful in this regard. 2 Removing bisection from the Monteiro-Svaiter framework Algorithm 1: Optimal MS Acceleration Input: Initial x0, function f, oracle O Parameters: Initial λ0 0, multiplicative adjustment factor > 1 1 Set v0 = x0, A0 = 0 2 x1, λ1 = O(x0; λ0 1 = λ1 3 for t = 0, 1, . . . , do t+1 = 1 2λ0 1 + p1 + 4λ0 t+1 = At + a0 6 yt = At A0 7 if t > 0 then xt+1, λt+1 = O(yt; λ0 8 if λt+1 λ0 9 at+1 = a0 t+1, At+1 = A0 t+1 10 xt+1 = xt+1 t+1 λt+1 14 at+1 = γt+1a0 t+1, At+1 = At + at+1 15 xt+1 = (1 γt+1)At t+1 At+1 xt+1 17 vt+1 = vt at+1rf( xt+1) Algorithm 0: MS Acceleration Input: Initial x0, function f, oracle O Parameters: Bisection limits λ , λh, and tolerance > 1 1 Set v0 = x0, A0 = 0 2 for t = 0, 1, . . . , do t+1 = λ , λh t+1 = 1 2λ0 1 + p1 + 4λ0 t+1 = At + a0 7 yt = At A0 8 xt+1, λt+1 = O(yt; λ0 9 if λt+1 2 [ 1 10 at+1 = a0 t+1, At+1 = A0 t+1 11 xt+1 = xt+1 12 else if λt+1 < 1 t+1 14 Go to line 4 t+1 17 Go to line 4 18 vt+1 = vt at+1rf( xt+1) In this section we present our acceleration algorithm (Algorithm 1) which removes bisection from the MS method (shown in stylized form as Algorithm 0) and thereby attains optimal rates of convergence. For simplicity, in this section and the next we focus on unconstrained optimization (X = Rd) and assume that f is continuously differentiable, so that rf exists. In Appendix C we extend our framework to general closed and convex domains and non-differentiable convex objectives. The key object in both the original MS algorithm and our new variant is an oracle O that approximately minimizes a local model of f at a query point y. In particular, O satisfies the following approximation error bound, adapted from Monteiro and Svaiter [32, eq. (3.3)] (λ in [32] is 1/λ in our notation). Definition 1 (MS oracle). An oracle O : Rd R+ ! Rd R+ is a σ-MS oracle for function f : Rd ! R if for every y 2 Rd and λ0 > 0, the points (x, λ) = O(y; λ0) satisfy '(( σkx yk. (2) Definition 1 endows the oracle with an additional output λ and an additional input λ0. The value of λ has the following simple interpretation: any point x satisfying (2) approximately minimizes F(x0) = f(x0) + λ 2 kx0 yk2 in the sense that kr F(x)k λσkx yk. In particular, computing an exact proximal point xλ = argminx0 F(x0) and outputting (x, λ) implements a 0-MS oracle. The input λ0 is optional: oracle implementations in prior work do not require it, but our new adaptive oracles (described in the next section) use it for improved efficiency. In Appendix C we provide a slightly more general approximation condition for MS oracles that handles non-smooth objectives and bounded domains, as well as a different, stochastic condition similar to that of [4, 11]. Let us discuss the key differences between our algorithm (Algorithm 1) and the stylized MS algorithm (Algorithm 0). At every iteration, Algorithm 0 searches for a value λ0 t+1 such that xt+1, λt+1 = O(yt; λ0 t+1) satisfies λt+1 λ0 t+1 (note that yt depends on λ0 t+1). This is done via a bisection procedure iteratively shrinking an interval that contains a successful choice of λ0 t+1.4 This bisection process is inefficient in the sense that every time we reach lines 14 and 17 (highlighted in red) all of the optimization progress made by the last oracle call is discarded. In contrast, even though our algorithm queries O in the same way (with yt computed based on a guess λ0 t+1), it makes use of the oracle output even if λt+1 is very far from λ0 t+1, thus never discarding progress made by the oracle. Instead of performing a bisection, we compare λ0 t+1 and λt+1 to guide our next guess λ0 t+2. When λ0 overshoots λ, we decrease it by a factor (line 11, highlighted in green) and set xt and At as in Algorithm 0. When it undershoots, we multiply it by (line 16). In this case, we perform an additional key algorithmic modification which we call the momentum damping mechanism: we scale down the growth of the parameter At+1 and replace the next iterate with a convex combination of xt and xt+1 to ensure that our overly optimistic guess for λt+1 does not destabilize the algorithm.5 In Appendix E.6 we demonstrate empirically that this mechanism is important for stabilizing Algorithm 1. Different MS oracles attain different rates of convergence when accelerated via the MS framework. In the following definition, we distill a key property that determines this rate. Definition 2 (Movement bound). For s 1, c, λ > 0, and x, y 2 Rd we say that (x, y, λ) satisfy a (s, c)-movement bound if (λ/cs)1/(s 1) s < 1 1/c s = 1 , (3) where a (1, c)-movement bound simply means that λ c. In the next section, we will show how to build MS oracles that, given query y, output (x, λ) such that (x, y, λ) always satisfy a (s, c)-movement bound, for certain s and c depending on the oracle type and function structure (e.g., level of smoothness). For example, when f has H-Lipschitz Hessian, the cubic-regularized Newton step with M = 2H is a 1 2-MS oracle that guarantees a (2, H)-movement bound. With the necessary definitions in hand, we are ready to state our main result: the iteration (and MS oracle query) complexity of Algorithm 1. Theorem 1. Let f : Rd ! R be convex and differentiable, and consider Algorithm 1 with parameters > 1, λ0 > 0, and a σ-MS oracle (Definition 1) for f with σ 2 [0, 0.99). Let s 1 and c > 0, and suppose that for all t such that λt > λ0 t or t = 1, the iterates ( xt, yt 1, λt) satisfy a (s, c)-movement bound (Definition 2). There exist C ,s = O s min{s,ln } % 1 ln 1/3' for any x? 2 Rd and > 0, we have f(x T ) f(x?) when cskx0 x?ks+1 2 3s+1 s < 1 K (ckx0 x?k) 2 3 log λ1kx0 x?k2 Proof sketch. The remainder of this section is an overview of the proof of Theorem 1, which we provide in full in Appendix B. To simplify this proof sketch, we treat , c, and 1/(1 σ) as O(1), 4Algorithm 0 simplifies the bisection routine of Monteiro and Svaiter [32] and implicitly assumes that an initial interval [λ , λh] always contains a valid solution. One can guarantee such an interval exists by selecting very small λ and very large λh. Alternatively, one may construct a valid initial interval via a bracketing procedure, as we do in the empirical comparison. Either way, the cost is logarithmic in problem parameters. 5It is also possible to set xt+1 = argminx2{ xt+1,xt} f(x) instead of the convex combination in line 15 and maintain our theoretical guarantees. 6For a fixed s 1, the value of minimizing our complexity bound is ? = e s+1 . In practice, performance is not sensitive to the choice of (see Appendix E.3). and focus on s < 1. To highlight the novel aspects of the proof, let us first briefly recall the analysis of Algorithm 0 [32, 19, 7, 23, 12]. For every t T let Et := f(xt) f(x?) , Dt := 1 2kvt x?k2 and Mt = 1 2k xt yt 1k2. The key facts about the standard MS iterations are λt At Mt O(D0) and The first fact implies that the optimality gap at iteration T is inversely proportional to AT , while the latter two facts imply that AT grows rapidly. More specifically, substituting the movement bound Mt (λt)2/(s 1)' t (λt) (thanks to the bisection) yields P s+1 s 1 At = O(D0). Combining this with the third fact in (4) and using the reverse Hölder inequality allows one to conclude that, for k = s+1 3s+1 and k0 = s 1 3s+1, we have Ak i , which, upon further algebraic manipulation, yields AT (T (3s+1)/2D (s 1)/2 0 ). Plugging this back to to the first fact in (4) gives the claimed convergence rate. Having described the standard MS analysis, we move on to our algorithm. Our first challenge is re-establishing the facts (4). The difficult case is λt > λ0 t, where the standard cancellation that occurs in the MS analysis may fail. This is where the momentum damping mechanism (lines 14 and 15 of our algorithm) comes into play, allowing us to show that (See Proposition 1 in the appendix) t At Mt O(D0) and T := {t 2 [T] | λt λ0 T are analogously defined. Comparing (4) and (5), the price of removing the bisection becomes evident: at each iteration (except the first) only one of the terms forcing the growth of At receives a contribution. The second challenge of our proof is establishing a lower bound on p AT in terms of the 1/ t values for t 2 S> T [ {1}, where the movement bound holds for Mt. This is where the multiplicative λ0 update rule (lines 11 and 16 of the algorithm) comes into play: it allows us to credit the contribution of every down iterate (in S T ) to an adjacent up iterate (S> T [ {1}) and furthermore argue that the contribution gets an exponential bonus based on the distance between the two. Consequently, we are able to identify a set QT S> T [ {1} of iterates, and a sequence {rt} such that (see Lemma 1) p AT (1) P t2[T ] rt = T 1 Repeating the reverse Hölder argument of prior work, we obtain the recursive bound t krt (D k0 with k = s+1 3s+1 and k0 = s 1 3s+1 as before. The final challenge of our proof is to show that such recursion implies sufficient growth of At. This is where careful algebra comes into play; we show that (6) implies that AT t2[T ] rt)(3s+1)/2D (s 1)/2 (see Lemmas 3 and 4) which establishes our result since P t2[T ] rt = T 1 3 MS oracle implementations In this section we describe several oracles that satisfy both Definition 1 (the MS condition) and Definition 2 (movement bounds) and may therefore be used by Algorithm 1. Section 3.1 briefly reviews oracles that have appeared in prior work, while Section 3.2 and Section 3.3 describe our new adaptive oracle implementations. We summarize the key oracle properties and resulting complexity bounds in Table 1. 3.1 Oracles from prior work Here we consider several previously-studied oracles of the form (x, λ) = O(y), where we omit the second argument λ0 since prior work does not leverage it to improve implementation efficiency. Assumption Oracle Complexity with Algorithm 1 Lower bound rpf is (1, )-Hölder Op, -reg O evals of rpf r3f is 1-Lipschitz O3-reg-so O Hessian evals N/A Or-ball O oracle calls Stable Hessian Or-Ba Co N O Hessian evals - r2f is (1, )-Hölder Oa MSN (Alg. 2) Hessian evals + e O(1) linear systems - rf is -Lipschitz and r2f is (1, )-Hölder Oa MSN-fo (Alg. 3) + e O(1) first-order evals iterations - Table 1. Complexity bounds for finding x such that f(x) f(x?) assuming kx x?k 1, attained by MS oracles from the literature (top 4 rows, described in Section 3.1) and oracles we develop (bottom two rows). In all cases we improve on prior work by a logarithmic factor. We require p + 2. Our adaptive oracles do not require knowledge of continuity constants or even the Hölder order 2 [0, 1]. Gradient descent step [e.g., 34]. As a gentle start, consider the oracle Ogd(y) = (y rf(y), 1 ), i.e., an oracle that returns x by taking standard gradient step with size and λ = 1/ . Obviously, the oracle always satisfies a (1, 1)-movement bound. Moreover, if we assume that rf is L-Lipschitz, then kx (y 1 λrf(x))k = krf(x) rf(y)k Lkx yk. Therefore, when 1 L/σ the oracle is a σ-MS oracle. Taylor descent step [5, 35, 19, 7, 23, 41]. Generalizing both Ogd and the cubic-regularized Newton step oracle Ocr, we define for every integer p 1 and 2 [0, 1] the oracle Op, -reg, that, for parameter C and input y returns (x, λ) = Op, -reg(y) where fp(x0; y) + M p!(p + )kx0 ykp+ p! kx ykp+ 2 (7) and fp(x; y) := Pp 1 i!rif(y)[(x y) i] is the Taylor expansion of f around y evaluated at x. Oracles Ogd and Ocr correspond to the special cases O1,1-reg (with = M 1) and O2,1-reg, respectively. Clearly, by definition, the oracle always satisfies a (p + 1, (M/p!)1/(p+ 1))- movement bound. Moreover, it is easy to show that λkrf(x) r fp(x; y)k = p! krf(x) r fp(x; y)k kx ykp+ 2 . For any p 1 and 2 [0, 1] we say that rpf is (H, )-Hölder if for all x, y we have krpf(x) rpf(y)kop Hkx yk . (An (H, 1)-Hölder derivative is H-Lipschitz.) If rpf is (H, )-Hölder, Taylor s theorem gives krf(x) r fp(x; y)k H p! kx ykp+ 1 [41, Lemma 2.5], and so kx (y 1 M kx yk. Therefore, when M H/σ the oracle is a σ-MS oracle. Exploiting third-order smoothness with a second order oracle [36, 26]. For p > 2, computing Op, -reg is typically intractable due to the need to compute the high-order derivative tensors r3f(y), r4f(y), . . . , rpf(y). Nevertheless for p = 3 Nesterov [36] designs an approximate solver for (7), which we denote O3-reg-so, using only r2f(y) and a logarithmic number of gradient evaluations. When r3f is (L3, 1)-Hölder, [36] shows that O3-reg-so is a valid MS-oracle satisfying a (3, O(L3))-movement bound, on par with the movement bound of O3,1-reg. Exact ball optimization oracle [12]. For a given query y, consider the exact minimizer of f constrained to a ball of radius r around y, i.e., consider an oracle Or-ball such that (x, λ) = Or-ball(y) satisfy x 2 argminx0:kx0 yk r f(x0) and λ = krf(x)k kx yk . One may easily verify that (unless λ = krf(x)k = 0) we have x = y 1 λrf(x), and therefore the oracle is a 0-MS oracle. Moreover, when f is convex, we have either kx yk = r or x is a global minimizer of f, and so we may assume without loss of generality that the oracle satisfies an (1, 1/r) movement bound. Ball-Constrained Newton (Ba Co N) oracle [12]. Exactly implementing Or-ball is generally intractable. Nevertheless, Carmon et al. [12, Alg. 3] describe a method Or-Ba Co N based on solving a sequence of e O(1) trust-region problems (ball-constrained Newton steps), which we call that, for functions that are O(1)-Hessian stable in a ball of radius r (or 1/r-quasi-self-concordant) and have a finite condition number, outputs (x, λ) satisfying the 1 2-MS oracle condition and an (1, O(1/r))- movement bound. Implementing Or-Ba Co N requires only a single Hessian evaluation and a number of linear system solutions that is polylogarithmic in problem parameters. Subsequent works implementing ball oracles [13, 4, 11] satisfy an approximation guarantee different than the MS condition, similar to the one we describe in Appendix C. 3.2 An adaptive Monteiro-Svaiter-Newton oracle The oracle implementations in Section 3.1 satisfy movement bounds by design and the MS condition (2) by assumption. For example, the cubic-regularized Newton step oracle Ocr is guaranteed to satisfy the MS condition only when the regularization parameter M is sufficiently larger than the Lipschitz constant of r2f. This suggests that M must be carefully tuned to ensure good performance. Prior work attempt to dynamically adjust M in order to meet certain approximation conditions [14, 20, 21, 24]. However, even computing a single cubic-regularized Newton step entails searching for λ that satisfies k[r2f(y) + λI] 1rf(y)k = Mλ 2 . Therefore, such a search over M is essentially a (potentially) redundant double search over λ. We propose a more direct and more adaptive MS oracle recipe: search for the smallest λ for which the regularized Newton step x = y [r2f(y)+λI] 1rf(y) satisfies the MS condition (2).7 This yields valid MS oracle by construction, independently of any assumption. Moreover, it is simple to argue that when r2f is (H, )-Hölder continuous for some 2 [0, 1], such oracle would guarantee the same movement bound as O2, -reg with the best choice parameters M and (see Appendix D.1) even though our recipe requires neither of these parameters! Exactly fulfilling this recipe, i.e., finding the ideal minimal λ? that satisfies the MS condition, is difficult. Fortunately, to adaptively guarantee movement bounds, it suffices to find a value λ such the corresponding regularized Newton step satisfies the MS condition, while the step corresponding to λ/2 does not; Algorithm 2 finds precisely such a λ. Let us describe the operation of Algorithm 2. If the input λ0 is invalid (i.e., its corresponding regularized Newton step does not satisfy the MS condition so that CHECKMS(λ0; y, σ) evaluates to False), we set λinvld λ0 and test a double-exponentially increasing series of λ s, until reaching a valid λvld (line 11). If λ0 is valid and the LAZY flag is set, we return it immediately. Otherwise (if LAZY is not set) we set λvld = λ0 and decrease it at a double-exponential rate until finding an invalid λinvld (line 5). In either case (so long as LAZY is not set) we obtain an (invalid,valid) pair (λinvld, λvld) such that λvld/λinvld = 22k? at the cost of 2 + k? linear system solutions. We then perform precisely k? log-scale bisection steps in order to shrink λvld/λinvld down to 2 while maintaining the invariant that λvld is valid and λinvld is invalid (line 15). The following theorem bounds the complexity of Algorithm 2 in terms of linear-system solution number, and establishes a movement bound for its output assuming that r2f is locally Hölder around the query point. We defer the proof of the theorem and its following corollary to Appendix D.2. Theorem 2. Algorithm 2 with parameter σ is a σ-MS oracle Oa MSN. For any y 2 Rd, computing (x, λ) = Oa MSN(y) requires at most 2 + 2 log2 linear systems solutions. If LAZY is False or λ > λ0, and if r2f is (H, )-Hölder in a ball of radius 2kx yk around y, then (x, y, λ) 1 + , (2H/σ)1/(1+ )/σ -movement bound. 7The prior works [30, 17] also directly consider quadratically-regularized Newton steps, but employ approximation conditions other than (2) to select the parameter λ. Algorithm 2: Oa MSN Input: Query y 2 Rd, λ0 > 0. Flag LAZY. Parameters: MS factor σ 2 (0, 1). 1 if CHECKMS(λ0; y, σ) then 2 if LAZY then return y [r2f(y) + λ0I] 1rf(y), λ0 4 λvld λ0 , k 0 5 while CHECKMS(λvld/22k; y, σ) 6 λvld λvld/22k 8 k? k , λinvld λvld/22k? 10 λinvld λ0 , k 0 11 while not CHECKMS(λinvld22k; y, σ) 12 λinvld λinvld22k 14 k? k , λvld λinvld22k? 15 while λinvld < λvld/2 do 16 λ pλinvldλvld 17 if CHECKMS(λ; y, σ) then λvld λ 18 else λinvld λ 19 return y [r2f(y)+λvld I] 1rf(y), λvld 20 function CHECKMS(λ; y, σ) 21 x = y [r2f(y) + λI] 1rf(y) (( σkx yk then return True 23 else return False Algorithm 3: Oa MSN-fo Input: y 2 Rd, λ0 > 0. Flag LAZY. Parameters: MS factor σ 2 (0, 1). 1 λ λ0 , FAILEDCHECK False 3 A r2f(y) + λI , b rf(y) . Apply Min Res/Conjugate Residuals [18] until obtaining w s.t. k Aw bk λσ 4 x y + CONJRES(A, b, λσ) (( σkx yk then 6 if LAZY or FAILEDCHECK then 7 return x, λ 8 else λ λ/2 10 FAILEDCHECK True 12 function CONJRES(A, b, λσ) 14 p0 r0 Aw0 b . ri = Awi b 15 s0 q0 Ar0 . qi = Api 16 i 0 17 while krik > λσ 18 wi+1 wi hri,sii 19 ri+1 ri hri,sii 20 si+1 Ari+1 21 pi+1 hri+1,si+1i hri,sii pi + ri+1 22 qi+1 hri+1,si+1i hri,sii qi + si+1 24 return wi To understand the LAZY option of Algorithm 2, note that when λ0 is valid we will necessarily output λ λ0. In such case Theorem 1 does not require a movement bound (except for the first iteration). Therefore, we might as well save on computation and return λ0. The following Corollary 3 gives the overall complexity bound for the combination of Algorithm 1 and Oa MSN, leveraging lazy oracle calls to show that the number of linear system solves per iteration is essentially constant. Corollary 3. Consider Algorithm 1 with initial point x0, parameters satisfying 1.1 = O(1) and λ0 0, and σ-MS oracle Oa MSN (with LAZY = True in all but the first iteration) with σ 2 (0.01, 0.99). For any H, > 0, 2 [0, 1] and any x? 2 Rd, if f is convex with (H, )- Hölder Hessian, the algorithm produces an iterate x T such that f(x T ) f(x?) + using T = Hkx0 x?k2+ / Hessian evaluations and O T + log log max ear system solutions, where R is the distance between x0 and argminx0 f(x0). Note that as long as λ0 0 is in the range 2 2T HR , 22T R 2 , the double logarithmic term in our bound on linear system solution number is O(T). Therefore, the overall bound is O(T) for an extremely large range of λ0 3.3 First-order implementation via Min Res/Conjugate Residuals We now present a first-order implementation of our adaptive oracle, Oa MSN-fo (Algorithm 3), which replaces exact linear system solutions with approximations obtained via Hessian-vector products and the Min Res/Conjugate Residuals method [42, 18]. Similar to Algorithm 2, the algorithm searches for λ such that xλ y [r2f(y) + λI] 1rf(y) satisfies the MS condition, but xλ/2 does not. Departing from the double-exponential scheme of Algorithm 2, here we adopt the following doubling scheme that allows us to control the cost of the xλ approximation. If λ0 is such that xλ0 does not satisfy the MS condition, we repeatedly test λ = 2λ0, 4λ0, 8λ0, . . . and return the first one for which xλ satisfies the MS condition. If xλ0 satisfies the MS condition and the algorithm is LAZY, we immediately return it. Otherwise, we repeatedly test λ = 1 8λ0, . . . until reaching λ for which xλ does not satisfy the MS condition, and return x2λ. The subroutine CONJRES of Algorithm 3 takes as input a matrix A, a vector b, and accuracy parameter λσ, and iteratively generates {wi} that approximate A 1b. The construction of the Min Res/Conjugate Residuals method guarnatees that wi minimizes the norm of the residual ri = Awi b in the Krylov subspace span{b, Ab, . . . , Ai 1b}. The key algorithmic decision here is when to stop the iterations: stop too early, and the approximation for the Newton step might not be accurate enough to guarantee a movement bound; stop too late, and incur a high Hessian-vector product complexity. We introduce a simple stopping condition (line 17) that strikes a balance. On the one hand, we show that whenever the condition krik λσ 2 kwik holds, the resulting point x can certify roughly the same movement bounds as exact Newton steps. On the other hand, by invoking the complexity bounds in [28] and using the the optimality of krik, we guarantee that the stopping condition is met within a number of iterations proportional to 1/ λ. The structure of our doubling scheme for λ then allows us to relate the overall first-order complexity to the lowest value of λ queried, obtaining the following guarantees. See proofs in Appendix D.3. Theorem 4. Algorithm 3 with parameter σ is a σ-MS oracle Oa MSN-fo. For any y 2 Rd, com- puting (x, λ) = Oa MSN-fo(y) requires at most O 1 + kr2f(y)kop σ min{λ,λ0} Hessian-vector product and 77) gradient computations. If LAZY is False or λ > λ0, and if r2f is (H, )-Hölder, then (x, y, λ) satisfy a 1 + , (6H/σ)1/(1+ )' -movement bound. Corollary 5. Consider Algorithm 1 with initial point x0, parameters satisfying 1.1 = O(1) and λ0 0, and σ-MS oracle Oa MSN-fo with LAZY set to True in all but the first iteration and σ 2 (0.01, 0.99). For any L, H, > 0, 2 [0, 1] and any x? 2 Rd, if f is convex with (H, )-Hölder Hessian and L-Lipschitz gradient, the algorithm produces an iterate x T such that f(x T ) f(x?) + within T = O iterations and at most gradient and Hessian-vector product evaluations. Note that the L-Lipschitz gradient assumption implies an (L, 0)-Hölder Hessian assumption, giving the iteration complexity bound we state in Table 1. Moreover, note that our algorithm has the optimal O( Lkx0 x?k2/ ) complexity for any λ0 0 in the range ( /kx0 x?k2) to Lkx0 x?k2/ ) . By choosing a large λ0 0 (say 106) we may guarantee that only the logarithmic term is added to the optimal first-order evaluation complexity. 4 Experiments We conduct three sets of experiments. First, we consider Ocr with a fixed parameter M and compare previous acceleration schemes to Algorithm 1. Second, we combine Algorithm 1 with our adaptive Oa MSN and test it against previous adaptive accelerated (second-order) methods and Newton s method. Finally, we compare Algorithm 1 with our first-order adaptive oracle Oa MSN-fo to other first-order methods. We provide full implementation details in Appendix E.1. Figure 1 summarizes our results for logistic regression on the a9a dataset [15]; see Appendix E.2 for similar results on three additional datasets. These experiments were conducted with no tuning of Algorithm 1: the parameters σ and were simply set to 1 2 and 2, respectively. An additional experiment, reported in Appendix E.3, shows that the algorithm is indeed insensitive to that choice. Non-adaptive methods. We use the non-adaptive oracle Ocr (1), and take M to be 0.2 H where, for feature vectors φ1, . . . , φn, H = k 1 i kop maxi2[n]kφik is an upper bound on 6 3 10 times the Lipschitz constant of the logistic regression Hessian [see, e.g., 41]. Fixing the MS oracle Figure 1. Empirical results for logistic regression on the a9a dataset. See Section 4 for description, and Appendix E.2 for additional datasets. Boldface legend entries denote methods we contribute. allows for a controlled comparison of different acceleration schemes: Figure 1(a) shows that standard MS acceleration with a carefully-implemented bisection outperforms standard cubic regularization (CR) and its accelerated counterpart (ACR) [33, Alg. 4.8], and removing the bisection via Algorithm 1 yields the best results. We also implemented the heuristic suggested by Song et al. [41], where instead of a bisection in Algorithm 0 we select a sequence λ0 t such that At = 1 Mkx0 x?k(t/3)7/2. In Appendix E.4 we tune the M parameter for each method separately, finding that the optimal M for CR is near 0, so that Ocr is nearly a Newton step (and not a valid MS oracle). Adaptive methods and Newton s method. We compare the following adaptive accelerated secondorder methods (which do not require an estimate of the Hessian Lipschitz constant): Adaptive ACR [21, Algorithm 4] (which adaptively sets M in Ocr), standard MS acceleration (Algorithm 0) with Oa MSN (Algorithm 2, with LAZY = False) and Algorithm 1 with Oa MSN (with LAZY = True in all but the first iteration). Figure 1(b) shows that the latter converges significantly faster than the other adaptive acceleration schemes. However, the classical unaccelerated Newton iteration xt+1 = (r2f(xt)) 1rf(xt) strongly outperforms all accelerated methods, indicating that momentum mechanisms might actually be slowing down convergence in logistic regression problems. To test this, we consider the following simple iteration of (the non-lazy variant) of our oracle: xt+1, λt+1 = Oa MSN(xt; λt/2); it significantly improves over Algorithm 1. These results beg the question: is momentum ever useful for second-order methods? In Appendix E.5 we test different schemes on the lower bound construction [3, 21]. We find momentum is helpful for Ocr, but not for the adaptive oracle Oa MSN. What makes Newton s method perform so well on logistic regression, and whether simply iterating Oa MSN is worst-case optimal, are important questions for future work. First-order methods. We compare our first-order adaptive Oa MSN-fo (Algorithm 3) to the following baselines: gradient descent and accelerated gradient descent [38] with a tuned step size , and LBFGS-B from Sci Py [10, 44, 43]. In light of the above comparison with Newton s method, we also test the following simple iteration of (the lazy variant) of our oracle: xt+1, λt+1 = Oa MSN-fo(xt; λt/2). Figure 1(c) shows that forgoing (second-order) momentum is better for the first-order oracle, too: Algorithm 1 performs comparably to tuned AGD (without tuning a single parameter), and the equally adaptive Oa MSN-fo iteration performs comparably to with L-BFGS-B. Acknowledgments We thank the anonymous reviewers for helpful questions and suggestions, leading to important writing clarifications, a simplification of the proof of Lemma 3, and an improved complexity bound for second-order methods under third-order smoothness (due to the observation that our framework is compatible with the oracle of [36]). YC and DH were supported in part by the Israeli Science Foundation (ISF) grant no. 2486/21, the Len Blavatnik and the Blavatnik Family foundation and the Adelis Foundation. YJ was supported in part by a Stanford Graduate Fellowship and the Dantzig Lieberman Fellowship. AS was supported in part by a Microsoft Research Faculty Fellowship, NSF CAREER Award CCF-1844855, NSF Grant CCF-1955039, a Pay Pal research award, and a Sloan Research Fellowship. [1] D. Adil, B. Bullins, A. Jambulapati, and S. Sachdeva. Line search-free methods for higher-order smooth monotone variational inequalities. ar Xiv:2205.06167, 2022. [2] M. M. Alves. Variants of the A-HPE and large-step A-HPE algorithms for strongly convex problems with applications to accelerated high-order tensor methods. ar Xiv:2102.02045, 2021. [3] Y. Arjevani, O. Shamir, and R. Shiff. Oracle complexity of second-order methods for smooth convex optimization. Mathematical Programming, 178(1-2):327 360, 2019. [4] H. Asi, Y. Carmon, A. Jambulapati, Y. Jin, and A. Sidford. Stochastic bias-reduced gradient methods. Advances in Neural Information Processing Systems, 34, 2021. [5] M. Baes. Estimate sequence methods: extensions and approximations. Optimization Online, [6] S. Bubeck, Q. Jiang, Y. T. Lee, Y. Li, and A. Sidford. Complexity of highly parallel non-smooth convex optimization. ar Xiv:1906.10655 [math.OC], 2019. [7] S. Bubeck, Q. Jiang, Y. T. Lee, Y. Li, and A. Sidford. Near-optimal method for highly smooth convex optimization. In Proceedings of the Thirty Second Annual Conference on Computational Learning Theory, pages 492 507, 2019. [8] B. Bullins. Highly smooth minimization of non-smooth problems. In Conference on Learning Theory, pages 988 1030, 2020. [9] B. Bullins and K. A. Lai. Higher-order methods for convex-concave min-max optimization and monotone variational inequalities. ar Xiv:2007.04528, 2020. [10] R. H. Byrd, P. Lu, J. Nocedal, and C. Zhu. A limited memory algorithm for bound constrained optimization. SIAM Journal on scientific computing, 16(5):1190 1208, 1995. [11] Y. Carmon and D. Hausler. Distributionally robust optimization via ball oracle acceleration. ar Xiv:2203.13225, 2022. [12] Y. Carmon, A. Jambulapati, Q. Jiang, Y. Jin, Y. T. Lee, A. Sidford, and K. Tian. Acceleration with a ball optimization oracle. In Advances in Neural Information Processing Systems, 2020. [13] Y. Carmon, A. Jambulapati, Y. Jin, and A. Sidford. Thinking inside the ball: Near-optimal minimization of the maximal loss. In Conference on Learning Theory, 2021. [14] C. Cartis, N. I. M. Gould, and P. L. Toint. Adaptive cubic regularisation methods for uncon- strained optimization. part II: worst-case functionand derivative-evaluation complexity. Math. Program., 130(2):295 319, 2011. [15] C.-C. Chang and C.-J. Lin. LIBSVM: A library for support vector machines. ACM Transactions on Intelligent Systems and Technology, 2:27:1 27:27, 2011. Software available at http: //www.csie.ntu.edu.tw/~cjlin/libsvm. [16] N. Doikov and Y. Nesterov. Inexact tensor methods with dynamic accuracies. In International Conference on Machine Learning, 2020. [17] N. Doikov and Y. Nesterov. Gradient regularization of Newton method with Bregman distances. ar Xiv:2112.02952, 2021. [18] D. C.-L. Fong and M. Saunders. CG versus MINRES: An empirical comparison. Sultan Qaboos University Journal for Science (SQUJS), 17(1):44 62, 2012. [19] A. V. Gasnikov, P. E. Dvurechensky, E. Gorbunov, E. A. Vorontsova, D. Selikhanovych, and C. A. Uribe. Optimal tensor methods in smooth convex and uniformly convex optimization. In Proceedings of the Thirty Second Annual Conference on Computational Learning Theory, pages 1374 1391, 2019. [20] N. I. M. Gould, M. Porcelli, and P. L. Toint. Updating the regularization parameter in the adaptive cubic regularization algorithm. Comput. Optim. Appl., 53(1):1 22, 2012. [21] G. N. Grapiglia and Y. Nesterov. Tensor methods for minimizing convex functions with Hölder continuous higher-order derivatives. SIAM Journal on Optimization, 30(4):2750 2779, 2020. [22] O. Güler. New proximal point algorithms for convex minimization. SIAM Journal on Optimiza- tion, 2(4):649 664, 1992. [23] B. Jiang, H. Wang, and S. Zhang. An optimal high-order tensor method for convex optimization. In Proceedings of the Thirty Second Annual Conference on Computational Learning Theory, pages 1799 1801, 2019. [24] B. Jiang, T. Lin, and S. Zhang. A unified adaptive tensor approximation scheme to accelerate composite convex optimization. SIAM Journal on Optimization, 30(4):2897 2926, 2020. [25] R. Jiang and A. Mokhtari. Generalized optimistic methods for convex-concave saddle point problems. ar Xiv:2202.09674, 2022. [26] D. Kamzolov and A. Gasnikov. Near-optimal hyperfast second-order method for convex optimization and its sliding. ar Xiv:2002.09050, 2020. [27] D. Kovalev and A. Gasnikov. The first optimal acceleration of high-order methods in smooth convex optimization. In Advances in Neural Information Processing Systems (Neur IPS), 2022. [28] J. Lee, C. Park, and E. Ryu. A geometric structure of acceleration and its role in making gradients small fast. Advances in Neural Information Processing Systems, 2021. [29] T. Lin and M. Jordan. Perseus: A simple high-order regularization method for variational inequalities. ar Xiv:2205.03202, 2022. [30] K. Mishchenko. Regularized Newton method with global O(1/k2) convergence. ar Xiv:2112.02089, 2021. [31] R. D. Monteiro and B. F. Svaiter. Iteration-complexity of a Newton proximal extragradi- ent method for monotone variational inequalities and inclusion problems. SIAM Journal on Optimization, 22(3):914 935, 2012. [32] R. D. C. Monteiro and B. F. Svaiter. An accelerated hybrid proximal extragradient method for convex optimization and its implications to second-order methods. SIAM J. Optim., 23(2): 1092 1125, 2013. [33] Y. Nesterov. Accelerating the cubic regularization of Newton s method on convex problems. Mathematical Programming, 112(1):159 181, 2008. [34] Y. Nesterov. Lectures on convex optimization, volume 137. Springer, 2018. [35] Y. Nesterov. Implementable tensor methods in unconstrained convex optimization. Mathematical Programming, 186(1):157 183, 2021. [36] Y. Nesterov. Superfast second-order methods for unconstrained convex optimization. Journal of Optimization Theory and Applications, 191(1):1 30, 2021. [37] Y. Nesterov and B. T. Polyak. Cubic regularization of Newton method and its global performance. Mathematical Programming, 108(1):177 205, 2006. [38] Y. E. Nesterov. A method for solving the convex programming problem with convergence rate o(1/k2). In Dokl. akad. nauk Sssr, volume 269, pages 543 547, 1983. [39] F. Pedregosa, G. Varoquaux, A. Gramfort, V. Michel, B. Thirion, O. Grisel, M. Blondel, P. Prettenhofer, R. Weiss, V. Dubourg, J. Vanderplas, A. Passos, D. Cournapeau, M. Brucher, M. Perrot, and E. Duchesnay. Scikit-learn: Machine learning in Python. Journal of Machine Learning Research, 12:2825 2830, 2011. [40] S. Salzo and S. Villa. Inexact and accelerated proximal point algorithms. Journal of Convex analysis, 19(4):1167 1192, 2012. [41] C. Song, Y. Jiang, and Y. Ma. Unified acceleration of high-order algorithms under Hölder continuity and uniform convexity. SIAM Journal on Optimization, 2021. [42] E. Stiefel. Relaxationsmethoden bester strategie zur lösung linearer gleichungssysteme. Com- mentarii Mathematici Helvetici, 29(1):157 179, 1955. [43] P. Virtanen, R. Gommers, T. E. Oliphant, M. Haberland, T. Reddy, D. Cournapeau, E. Burovski, P. Peterson, W. Weckesser, J. Bright, S. J. van der Walt, M. Brett, J. Wilson, K. J. Millman, N. Mayorov, A. R. J. Nelson, E. Jones, R. Kern, E. Larson, C. J. Carey, I. Polat, Y. Feng, E. W. Moore, J. Vander Plas, D. Laxalde, J. Perktold, R. Cimrman, I. Henriksen, E. A. Quintero, C. R. Harris, A. M. Archibald, A. H. Ribeiro, F. Pedregosa, P. van Mulbregt, and Sci Py 1.0 Contributors. Sci Py 1.0: Fundamental Algorithms for Scientific Computing in Python. Nature Methods, 17:261 272, 2020. [44] C. Zhu, R. H. Byrd, P. Lu, and J. Nocedal. Algorithm 778: L-BFGS-B: Fortran subroutines for large-scale bound-constrained optimization. ACM Transactions on mathematical software (TOMS), 23(4):550 560, 1997.