# inexact_tensor_methods_with_dynamic_accuracies__23948fe4.pdf

Inexact Tensor Methods with Dynamic Accuracies

Nikita Doikov 1 Yurii Nesterov 2

In this paper, we study inexact high-order Tensor Methods for solving convex optimization problems with composite objective. At every step of such methods, we use approximate solution of the auxiliary problem, deﬁned by the bound for the residual in function value. We propose two dynamic strategies for choosing the inner accuracy: the ﬁrst one is decreasing as 1/kp+1, where p 1 is the order of the method and k is the iteration counter, and the second approach is using for the inner accuracy the last progress in the target objective. We show that inexact Tensor Methods with these strategies achieve the same global convergence rate as in the error-free case. For the second approach we also establish local superlinear rates (for p 2), and propose the accelerated scheme. Lastly, we present computational results on a variety of machine learning problems for several methods and different accuracy policies.

1. Introduction

1.1. Motivation

With the growth of computing power, high-order optimization methods are becoming more and more popular in machine learning, due to their ability to tackle the illconditioning and to improve the rate of convergence. Based on the work (Nesterov & Polyak, 2006), where global complexity guarantees for the Cubic regularization of Newton method were established, a signiﬁcant leap in the development of second-order optimization algorithms was made, discovering stochastic and randomized methods (Kohler & Lucchi, 2017; Tripuraneni et al., 2018; Cartis & Schein-

1Institute of Information and Communication Technologies, Electronics and Applied Mathematics (ICTEAM), Catholic University of Louvain (UCL), Belgium 2Center for Operations Research and Econometrics (CORE), Catholic University of Louvain (UCL), Belgium. Correspondence to: Nikita Doikov <nikita.doikov@uclouvain.be>, Yurii Nesterov <yurii.nesterov@uclouvain.be>.

Proceedings of the 37 th International Conference on Machine Learning, Online, PMLR 119, 2020. Copyright 2020 by the author(s).

berg, 2018; Doikov & Richt arik, 2018; Wang et al., 2018; Zhou et al., 2019), which have better convergence rate, than the corresponding ﬁrst-order analogues. The main weakness, though, is that every step of Newton method is much more expensive. It requires to solve the subproblem, which is a minimization of quadratic function with a regularizer, and possibly with some additional nondifferentiable components. Therefore, the idea to employ higher derivatives into optimization schemes was remaining questionable, because of the high cost of the computations. However, recently (Nesterov, 2019a) it was shown, that the third-order Tensor Method for convex minimization problems admits very effective implementation, with the cost, which is comparable to that of the Newton step.

Now we have a family of methods (starting from the methods of order one), for each iteration of which may need to call some auxiliary subsolver. Thus, it becomes important to study: which level of exactness we need to ensure at the step for not loosing the fast convergence of the initial method. In this work, we suggest to describe approximate solution of the subproblem in terms of the residual in function value. We propose two strategies for the inner accuracies, which are dynamic (changing with iterations). Indeed, there is no need to have a very precise solution of the subproblem at the ﬁrst iterations, but we reasonably ask for higher precision closer to the end of the optimization process.

1.2. Related Work

Global convergence of the ﬁrst-order methods with inexact proximal-gradient steps was studied in (Schmidt et al., 2011). The authors considered the errors in the residual in function value of the subproblem, and require them to decrease with iterations at an appropriate rate. This setting is the most similar to the current work.

In (Cartis et al., 2011a;b), adaptive second-order methods with cubic regularization and inexact steps were proposed. High-order inexact tensor methods were considered in (Birgin et al., 2017; Jiang et al., 2018; Grapiglia & Nesterov, 2019c;d; Cartis et al., 2019; Lucchi & Kohler, 2019). In all of these works, the authors describe approximate solution of the subproblem in terms of the corresponding ﬁrst-order optimality condition (using the gradients). This can be difﬁcult to achieve by the current optimization schemes, since

Inexact Tensor Methods with Dynamic Accuracies

it is more often that we have a better (or the only) guarantees for the decrease of the residual in function value. The latter one is used as a measure of inaccuracy in the recent work (Nesterov, 2019b) on the inexact Basic Tensor Methods. However, only the constant choice of the accuracy level is considered there.

1.3. Contributions

We propose new dynamic strategies for choosing the inner accuracy for the general Tensor Methods, and several inexact algorithms based on it, with proven complexity guarantees, summarized next (we denote by δk the required precision for the residual in function value of the auxiliary problem):

The rule δk := 1/kp+1, where p 1 is the order of the method, and k is the iteration counter.

Using this strategy, we propose two optimization schemes: Monotone Inexact Tensor Method I (Algorithm 1) and Inexact Tensor Method with Averaging (Algorithm 3). Both of them have the global complexity estimates O(1/ε

1 p ) iterations for minimizing the convex function up to ε -accuracy (see Theorem 1 and Theorem 5). The latter method seems to be the ﬁrst primal high-order scheme (aggregating the points from the primal space only), having the explicit distance between the starting point and the solution, in the complexity bound.

The rule δk := c (F(xk 2) F(xk 1)), where F(xi) are the values of the target objective during the iterations, and c 0 is a constant.

We incorporate this strategy into our Monotone Inexact Tensor Method II (Algorithm 2). For this scheme, for minimizing convex functions up to ε-accuracy by the methods of order p 1, we prove the global complexity proportional to O(1/ε 1 p ) (Theorem 2). The global rate becomes linear, if the objective is uniformly convex (Theorem 3).

Assuming that δk := c (F(xk 2) F(xk 1)) p+1

2 , for the methods of order p 2 as applied to minimization of strongly convex objective, we also establish the local superlinear rate of convergence (see Theorem 4).

Using the technique of Contracting Proximal iterations (Doikov & Nesterov, 2019a), we propose inexact Accelerated Scheme (Algorithm 4), where at each iteration k, we solve the corresponding subproblem with the precision ζk := 1/kp+2 in the residual of the function value, by inexact Tensor Methods of order p 1. The resulting complexity bound is O(1/ε 1 p+1 ) inexact tensor steps for minimizing the convex function up to ε accuracy (Theorem 6).

Numerical results with empirical study of the methods for different accuracy policies are provided.

1.4. Contents

The rest of the paper is organized as follows. Section 2 contains notation which we use throughout the paper, and declares our problem of interest in the composite form. In Section 3 we introduce high-order model of the objective and describe a general optimization scheme using this model. Then, we summarize some known techniques for computing a step for the methods of different order. In Section 3.2 we study monotone inexact methods, for which we guarantee the decrease of the objective function for every iteration, and in Section 3.3 we study the methods with averaging. In Section 4 we present our accelerated scheme. Section 5 contains numerical results. Missing proofs are provided in the supplementary material.

2. Notation

In what follows, we denote by E a ﬁnite-dimensional real vector space and by E its dual space, which is a space of linear functions on E. The value of function s E on x E is denoted by s, x . One can always identify E and E with Rn, when some basis is ﬁxed, but often it is useful to seperate these spaces, in order to avoid ambiguities.

Let us ﬁx some symmetric positive deﬁnite linear operator B : E E and use it to deﬁne Euclidean norm for the primal variables: x def = Bx, x 1/2, x E. Then, the norm for the dual space is deﬁned as:

s def = max h E: h 1 s, x = s, B 1s 1/2, s E .

For a smooth function f, its gradient at point x is denoted by f(x), and its Hessian is 2f(x). Note that for x dom f E we have f(x) E , and 2f(x)h E for h E.

For p 1, we denote by Dpf(x)[h1, , hp] pth directional derivative of f along directions h1, . . . , hp E. If hi = h for all 1 i p, the shorter notation Dpf(x)[h]p

is used. The norm of Dpf(x), which is p-linear symmetric form on E, is induced in the standard way:

Dpf(x) def = max h1,...,hp E: hi 1, 1 i p

Dpf(x)[h1, . . . , hp]

= max h E: h 1

Dpf(x)[h]p .

See Appendix 1 in (Nesterov & Nemirovskii, 1994) for the proof of the last equation.

We are interested to solve convex optimization problem in

Inexact Tensor Methods with Dynamic Accuracies

the composite form:

n F(x) f(x) + ψ(x) o , (1)

where ψ : E R {+ } is a simple proper closed convex function, and f : dom ψ R is several times differentiable and convex. Basic examples of ψ which we keep in mind are: {0, + }-indicator of a simple closed convex set and ℓ1-regularization. We always assume, that solution x dom ψ of problem (1) does exist, denoting F = F(x ).

3. Inexact Tensor Methods

3.1. High-Order Model of the Objective

We assume, that for some p 1, the pth derivative of the smooth component of our objective (1) is Lipschitz continuous.

Assumption 1 For all x, y dom ψ

Dpf(x) Dpf(y) Lp x y . (2)

Examples of convex functions with known Lipschitz constants are as follows.

Example 1 For the power of Euclidean norm f(x) = 1 p+1 x x0 p+1, p 1, (2) holds with Lp = p! (see Theorem 7.1 in (Rodomanov & Nesterov, 2019)).

Example 2 For a given ai E , 1 i m, consider the log-sum-exp function:

f(x) = log m P

i=1 e ai,x , x E.

Then, for B := Pm i=1 aia i (assuming B 0, otherwise we can reduce dimensionality of the problem), (2) holds with L1 = 1, L2 = 2 and L3 = 4 (see Lemma 4 in the supplementary material).

Example 3 Using E = R, and a1 = 0, a2 = 1 in the previous example, we obtain the logistic regression loss: f(x) = log(1 + ex).

Let us consider Taylor s model of f around a ﬁxed point x:

f(y) fp,x(y) def = f(x) + p P

1 k!Dkf(x)[y x]k.

Then, from (2) we have a global bound for this approximation. It holds, for all x, y dom ψ

|f(y) fp,x(y)| Lp (p+1)! y x p+1. (3)

Denote by ΩH(x; y) the following regularized model of our objective:

ΩH(x; y) def = fp,x(y) + H y x p+1

(p+1)! + ψ(y), (4)

which serves as the global upper bound: F(y) ΩH(x; y), for H big enough (at least, for H Lp). This property suggests us to use the minimizer of (4) in y, as the next point of a hypothetical optimization scheme, while x being equal to a current iterate:

xk+1 Argminy ΩH(xk; y), k 0. (5)

The approach of using high-order Taylor model fp,x(y) with its regularization was investigated ﬁrst in (Baes, 2009).

Note, that for p = 1, iterations (5) gives the Gradient Method (see (Nesterov, 2013) as a modern reference), and for p = 2 it corresponds to the Newton method with Cubic regularization (Nesterov & Polyak, 2006) (see also (Doikov & Richt arik, 2018) and (Grapiglia & Nesterov, 2019a) for extensions to the composite setting).

Recently it was shown in (Nesterov, 2019a), that for H p Lp, function ΩH(x; ) is always convex (despite the Taylor s polynomial fp,x(y) is nonconvex for p 3). Thus, computation (5) of the next point can be done by powerful tools of Convex Optimization. Let us summarize some known techniques for computing a step of the general method (5), for different p:

p = 1. When there is no composite part: ψ(x) = 0, iteration (5) can be represented as the Gradient Step with preconditioning: xk+1 = xk 1

H B 1 f(xk). One can precompute inverse of B in advance, or use some numerical subroutine at every step, solving the linear system (for example, by Conjugate Gradient method, see (Nocedal & Wright, 2006)). If ψ(x) = 0, computing the corresponding prox-operator is required (see (Beck, 2017)).

p = 2. One approach consists in diagonalizing the quadratic part by using eigenvalueor tridiagonaldecomposiition of the Hessian matrix. Then, computation of the model minimizer (5) can be done very efﬁciently by solving some one-dimensional nonlinear equation (Nesterov & Polyak, 2006; Gould et al., 2010). The usage of fast approximate eigenvalue computation was considered in (Agarwal et al., 2017). Another way to compute the (inexact) second-order step (5) is to launch the Gradient Method (Carmon & Duchi, 2019). Recently, the subsolver based on the Fast Gradient Method with restarts was proposed in (Nesterov, 2019b), which convergence rate is O( 1

t6 ), where t is the iteration counter.

Inexact Tensor Methods with Dynamic Accuracies

p = 3. In (Nesterov, 2019a), an efﬁcient method for computing the inexact third-order step (5) was outlined, which is based on the notion of relative smoothness (Van Nguyen, 2017; Bauschke et al., 2016; Lu et al., 2018). Thus, every iteration of the subsolver can be represented as the Gradient Step for the auxiliary problem, with a speciﬁc choice of prox-function, formed by the second derivative of the initial objective and augmented by the fourth power of Euclidean norm. The methods of this type were studied in (Grapiglia & Nesterov, 2019b; Bullins, 2018; Nesterov, 2019b).

Up to our knowledge, the efﬁcient implementation of the tensor step of degree p 4 remains to be an open question.

3.2. Monotone Inexact Methods

Let us assume that at every step of our method, we minimize the model (4) inexactly by an auxiliary subroutine, up to some given accuracy δ 0. We use the following deﬁnition of inexact δ-step.

Deﬁnition 1 Denote by TH,δ(x) a point T TH,δ(x) dom ψ, satisfying

ΩH(x; T) min y ΩH(x; y) δ. (6)

The main property of this point is given by the next lemma.

Lemma 1 Let H = αLp for some α p. Then, for every y dom ψ

F(TH,δ(x)) F(y) + (α+1)Lp y x p+1

(p+1)! + δ. (7)

Indeed, denoting T TH,δ(x), we have

F(T) (3) ΩH(x; T) (6) ΩH(x; y) + δ,

(3) F(y) + (α+1)Lp y x p+1

(p+1)! + δ.

Now, if we plug y := x (a current iterate) into (7), we obtain

F(TH,δ(x)) F(x)+δ. So in the case δ = 0 (exact tensor step), we would have nonincreasing sequence {F(xk)}k 0 of test points of the method. However, this is not the case for δ > 0 (inexact tensor step). Therefore we propose the following minimization scheme with correction.

If at some step k 0 of this algorithm we get xk+1 = xk, then we need to decrease inner accuracy for the next step. From the practical point of view, an efﬁcient implementation

Algorithm 1 Monotone Inexact Tensor Method, I

Initialization: Choose x0 dom ψ. Fix H := p Lp. for k = 0, 1, 2, . . . do

Pick up δk+1 0 Compute inexact tensor step Tk+1 := TH,δk+1(xk) if F(Tk+1) < F(xk) then

xk+1 := Tk+1 else

xk+1 := xk end if end for

of this algorithm should include a possibility of improving accuracy of the previously computed point.

Denote by D the radius of the initial level set of the objective:

D def = sup x

n x x : F(x) F(x0) o . (8)

For Algorithm 1, we can prove the following convergence result, which uses a simple strategy for choosing δk+1.

Theorem 1 Let D < + . Let the sequence of inner accuracies {δk}k 1 be chosen according to the rule

δk := c kp+1 (9)

with some c 0. Then for the sequence {xk}k 1 produced by Algorithm 1, we have

F(xk) F (p+1)p+1Lp Dp+1

p! kp + c kp . (10)

We see, that the global convergence rate of the inexact Tensor Method remains on the same level, as of the exact one. Namely, in order to achieve F(x K) F ε, we need to perform K = O(1/ε 1 p ) iterations of the algorithm. According to these estimates, at the last iteration K, the rule (9) requires to solve the subproblem up to accuracy

This is intriguing, since for bigger p (order of the method) we need less accurate solutions. Note, that the estimate (11) coincides with the constant choice of inner accuracy in (Nesterov, 2019b). However, the dynamic strategy (9) provides a signiﬁcant decrease of the computational time on the ﬁrst iterations of the method, which is also conﬁrmed by our numerical results (see Section 5).

Now, looking at Algorithm 1, one may think that we are forgetting the points Tk+1 such that F(Tk+1) F(xk), and thus we are loosing some computations. However, this is not true: even if point Tk+1 has not been taken as xk+1, we

Inexact Tensor Methods with Dynamic Accuracies

shall use it internally as a starting point for computing the next Tk+2. To support this concept, we introduce inexact δ-step with an additional condition of monotonicity.

Deﬁnition 2 Denote by MH,δ(x) a point M MH,δ(x) dom ψ, satisfying the following two conditions.

ΩH(x; M) min y ΩH(x; y) δ, (12)

F(M) < F(x). (13)

It is clear, that point M from Deﬁnition 2 satisﬁes Deﬁnition 1 as well (while the opposite is not always the case). Therefore, we can also use Lemma 1 for the monotone inexact tensor step. Using this deﬁnition, we simplify Algorithm 1 and present the following scheme.

Algorithm 2 Monotone Inexact Tensor Method, II

Initialization: Choose x0 dom ψ. Fix H := p Lp. for k = 0, 1, 2, . . . do

Pick up δk+1 0 Compute inexact monotone tensor step xk+1 := MH,δk+1(xk) end for

When our method is strictly monotone, we guarantee that F(xk+1) < F(xk) for all k 0, and we propose to use the following adaptive strategy of deﬁning the inner accuracies.

Theorem 2 Let D < + . Let sequence of inner accuracies {δk}k 1 be chosen in accordance to the rule

δk := c F(xk 2) F(xk 1) , k 2 (14)

for some ﬁxed 0 c < 1 (p+2)3p+1 1 and δ1 0. Then for the sequence {xk}k 1 produced by Algorithm 2, we have

F(xk) F γLp Dp+1

p! kp + β kp+2 , (15)

where γ and β are the constants:

γ def = (p+2)p+1

1 c((p+2)3p+1 1), β def = δ1+c2p+2(F (x0) F ) 1 c((p+2)2/(p+1) 1).

The rule (14) is surprisingly simple and natural: while the method is approaching the optimum, it becomes more and more difﬁcult to optimize the function. Consequently, the progress in the function value at every step is decreasing. Therefore, we need to solve the auxiliary problem more accurately, and this is exactly what we are doing in accordance to this rule.

It is also notable, that the rule (14) is universal, in a sense that it remains the same (up to a constant factor) for the methods of any order, starting from p = 1.

This strategy also works for the nondegenerate case. Let us assume that our objective is uniformly convex of degree p + 1 with constant σp+1. Thus, for all x, y dom ψ and F (x) F(x) it holds

F(y) F(x) + F (x), y x

p+1 y x p+1. (16)

For p = 1 this deﬁnition corresponds to the standard class of strongly convex functions. One of the main sources of uniform convexity is a regularization by power of Euclidean norm (we use this construction in Section 4, where we accelerate our methods):

Example 4 Let ψ(x) = µ p+1 x x0 p+1, µ 0. Then (16) holds with σp+1 = µ21 p (see Lemma 5 in (Doikov & Nesterov, 2019c)).

Example 5 Let ψ(x) = µ

2 x x0 2, µ 0. Consider the ball of radius D around the optimum: B = {x : x x D}. Then (16) holds for all x, y B with σp+1 = (p+1)µ

2p Dp 1 (see Lemma 5 in the supplementary material).

Denote by ωp the condition number of degree p:

ωp def = max{ (p+1)2Lp

p! σp+1 , 1}. (17)

The next theorem shows, that ωp serves as the main factor in the complexity of solving the uniformly convex problems by inexact Tensor Methods.

Theorem 3 Let σp+1 > 0. Let sequence of inner accuracies {δk}k 1 be chosen in accordance to the rule

δk := c F(xk 2) F(xk 1) , k 2 (18)

for some ﬁxed 0 c < p p+1ω 1/p p and δ1 0. Then for the sequence {xk}k 1 produced by Algorithm 2, we have the following linear rate of convergence:

1 p p+1ω 1/p p + c (F(xk 1) F ). (19)

Let us pick c := p 2(p+1)ω 1/p p . Then, according to estimate (19), in order solve the problem up to ε accuracy: F(x K) F ε, we need to perform

K = O ω1/p p log F (x0) F

iterations of the algorithm.

Finally, we study the local behavior of the method for strongly convex objective.

Inexact Tensor Methods with Dynamic Accuracies

Theorem 4 Let σ2 > 0. Let sequence of inner accuracies {δk}k 1 be chosen in accordance to the rule

δk := c F(xk 2) F(xk 1) p+1

2 , k 2 (21)

with some ﬁxed c 0 and δ1 0. Then for p 2 the sequence {xk}k 1 produced by Algorithm 2 has the local superlinear rate of convergence:

2 + c (F(xk 1) F ) p+1

Let us assume for simplicity, that the constant c is chosen to be small enough: c Lp

σ2 (p+1)/2. Then, we are able to describe the region of superlinear convergence as

Q = x dom ψ : F(x) F σp+1 2 2p+3 p!

Lp 2 1 (p 1) .

After reaching it, the method becomes very fast: we need to perform no more than O(log log 1

ε) additional iterations to solve the problem.

Note, that estimate (22) of the local convergence is slightly weaker, than the corresponding one for exact Tensor Methods (Doikov & Nesterov, 2019b). For example, for p = 2 (Cubic regularization of Newton Method) we obtain the convergence of order 3

2, not the quadratic, which affects only a constant factor in the complexity estimate. The region Q of the superlinear convergence is remaining the same.

3.3. Inexact Methods with Averaging

Methods from the previous section were developed by forcing the monotonicity of the sequence of function values {F(xk)}k 0 into the scheme. As a byproduct, we get the radius of the initial level set D (see deﬁnition (8)) in the right-hand side of our complexity estimates (10) and (15). Note, that D may be signiﬁcantly bigger than the distance x0 x from the initial point to the solution.

Example 6 Consider the following function, for x Rn:

f(x) = |x(1)|p+1 + n P

i=2 |x(i) 2x(i 1)|p+1,

where x(i) indicates ith coordinate of x. Clearly, the minimum of f is at the origin: x = (0, . . . , 0)T . Let us take two points: x0 = (1, . . . , 1)T and x1, such that x(i) 1 = 2i 1. It holds, f(x0) = f(x1) = n, so they belong to the same level set. However, we have (for the standard Euclidean norm): x0 x = n, while D x1 x 2n 1.

Here we present an alternative approach, Tensor Methods with Averaging. In this scheme, we perform a step not from

the previous point xk, but from a point yk, which is a convex combination of the previous point and the starting point:

yk = λkxk + (1 λk)x0,

where λk k k+1 p+1. The whole optimization scheme remains very simple.

Algorithm 3 Inexact Tensor Method with Averaging

Initialization: Choose x0 dom ψ. Fix H := p Lp. for k = 0, 1, 2, . . . do

Set λk := k k+1 p+1, yk := λkxk + (1 λk)x0 Pick up δk+1 0 Compute inexact tensor step xk+1 := TH,δk+1(yk) end for

For this method, we are able to prove a similar convergence result as that of Algorithm 1. However, now we have the explicit distance x0 x in the right hand side of our bound for the convergence rate (compare with Theorem 1).

Theorem 5 Let sequence of inner accuracies {δk}k 1 be chosen according the rule

δk := c kp+1 (23)

for some c 0. Then for the sequence {xk}k 1 produced by Algorithm 3, we have

F(xk) F (p+1)p+1Lp x0 x p+1

p! kp + c kp . (24)

Thus, Algorithm 3 seems to be the ﬁrst Primal Tensor method (aggregating only the points from the primal space E), which admits the explicit initial distance in the global convergence estimate (24). Table 1 contains a short overview of the inexact Tensor methods from this section.

Table 1. Summary on the methods.

Algorithm The rule for δk Global rate Local superl.

Tensor Method (Nesterov, 2019a) 0 O Lp Dp+1

Monotone Inexact Tensor Method, I (Algorithm 1) 1/kp+1 O Lp Dp+1

Monotone Inexact Tensor Method, II (Algorithm 2) (F (xk 1) F (xk))α O Lp Dp+1

Inexact Tensor Method with Averaging (Algorithm 3)

1/kp+1 O Lp x0 x p+1

Inexact Tensor Methods with Dynamic Accuracies

4. Acceleration

After the Fast Gradient Method had been discovered in (Nesterov, 1983), there were made huge efforts to develop accelerated second-order (Nesterov, 2008; Monteiro & Svaiter, 2013; Grapiglia & Nesterov, 2019a) and high-order (Baes, 2009; Nesterov, 2019a; Gasnikov et al., 2019; Grapiglia & Nesterov, 2019c; Song & Ma, 2019) optimization algorithms. Most of these schemes are based on the notion of Estimating sequences (see (Nesterov, 2018)).

An alternative approach of using the Proximal Point iterations with the acceleration was studied ﬁrst in (G uler, 1992). It became very popular recently, in the context of machine learning applications (Lin et al., 2015; 2018; Kulunchakov & Mairal, 2019; Ivanova et al., 2019). In this section, we use the technique of Contracting Proximal iterations (Doikov & Nesterov, 2019a), to accelerate our inexact tensor methods.

In the accelerated scheme, two sequences of points are used: the main sequence {xk}k 0, for which we are able to guarantee the convergence in function residuals, and auxiliary sequence {vk}k 0 of prox-centers, starting from the same initial point: v0 = x0. Also, we use the sequence {Ak}k 0 of scaling coefﬁcients. Denote ak def = Ak Ak 1, k 1.

Then, at every iteration, we apply Monotone Inexact Tensor Method, II (Algorithm 2) to minimize the following contracted objective with regularization:

hk+1(x) def = Ak+1f ak+1x+Akxk

Ak+1 + ak+1ψ(x)

+ βd(vk; x).

Here βd(vk; x) def = d(x) d(vk) d(vk), x vk is Bregman divergence centered at vk, for the following choice of prox-function:

d(x) := 1 p+1 x x0 p+1,

which is uniformly convex of degree p + 1 (Example 4). Therefore, Tensor Method achieves fast linear rate of convergence (Theorem 3). By an appropriate choice of scaling coefﬁcients {Ak}k 1, we are able to make the condition number of the subproblem being an absolute constant. This means that only O(1) steps of Algorithm 2 are needed to ﬁnd an approximate minimizer of hk+1( ):

hk+1(vk+1) h k+1 ζk+1. (25)

Note that inexact condition (25) was considered ﬁrst in (Lin et al., 2015), in a general algorithmic framework for accelerating ﬁrst-order methods. It differs from the corresponding one from (Doikov & Nesterov, 2019a), where a bound for the (sub)gradient norm was used.

Therefore, for accelerating inexact Tensor Methods, we propose a multi-level approach. On the upper level, we

Algorithm 4 Accelerated Scheme

Initialization: Choose x0 dom ψ. Set v0 := x0, A0 := 0. for k = 0, 1, 2, . . . do

Set Ak+1 := (k+1)p+1

Lp Pick up ζk+1 0 Find vk+1 such that (25) holds Set xk+1 := ak+1vk+1+Akxk

Ak+1 end for

run Algorithm 4. At each iteration of this method, we call Algorithm 2 (with dynamic rule (18) for inner accuracies) to ﬁnd vk+1. For this optimization scheme, we obtain the following global convergence guarantee.

Theorem 6 Let sequence {ζk}k 1 be chosen according to the rule

ζk := c kp+2 (26)

with some c 0. Then for the iterations {xk}k 1 produced by Algorithm 4, it holds:

F(xk) F O Lp( x0 x p+1+c)

kp+1 . (27)

For every k 0, in order to ﬁnd vk+1 by Algorithm 2 (for minimizing hk+1( ), starting from vk), it is enough to perform no more than

O log (k+1)( x0 x p+1+c)

inexact monotone tensor steps.

Therefore, the total number of the inexact tensor steps for ﬁnding ε-solution of (1) is bounded by e O 1/ε 1 p+1 . One theoretical question remains open: is it possible to construct in the framework of inexact tensor steps, the optimal methods with the complexity estimate O 1/ε 2 3p+1 having no hidden logarithmic factors. This would match the existing lower bound (Arjevani et al., 2019; Nesterov, 2019a).

5. Experiments

Let us demonstrate computational results with empirical study of different accuracy policies. We consider inexact methods of order p = 2 (Cubic regularization of Newton method), and to solve the corresponding subproblem we call the Fast Gradient Method with restarts from (Nesterov, 2019b). To estimate the residual in function value of the subproblem, we use a simple stopping criterion, given by uniform convexity of the model g(y) = ΩH(x; y):

g(y) min y g(y) 4 3 1

H 1/2 g(y) 3/2 . (29)

Inexact Tensor Methods with Dynamic Accuracies

An alternative approach would be to bound the functional residual by the duality gap1.

We compare the adaptive rule for inner accuracies (14) with dynamic strategies in the form δk = 1/kα, for different α (left graphs), and with the constant choices (right).

5.1. Logistic Regression

First, let us consider the problem of training ℓ2-regularized logistic regression model for classiﬁcation task with two classes, on several real datasets2: mashrooms (m = 8124, n = 112), w8a (m = 49749, n = 300), and a8a (m = 22696, n = 123)3.

We use the standard Euclidean norm for this problem, and simple line search at every iteration, to ﬁt the regularization parameter H. The results are shown on Figure 1.

0.0 0.5 1.0 1.5 Time, s

Functional residual

Log-reg: mushrooms

0.00 0.25 0.50 0.75 1.00 1.25 Time, s

Functional residual

Log-reg: mushrooms

0.0 2.5 5.0 7.5 10.0 12.5 Time, s

Functional residual

Log-reg: w8a

0.0 2.5 5.0 7.5 10.0 Time, s

Functional residual

Log-reg: w8a

0 10 20 30 Time, s

Functional residual

Log-reg: a8a

0 5 10 15 20 Time, s

Functional residual

Log-reg: a8a

Figure 1. Comparison of different accuracy policies for inexact Cubic Newton, training logistic regression.

1Note that the left hand side in (29) can be bounded from below by the distance from y to the optimum of the model, using uniform convexity. Therefore, we have a computable bound for the distance to the solution of the subproblem. 2https://www.csie.ntu.edu.tw/ cjlin/ libsvmtools/datasets/

3m is the number of training examples and n is the dimension of the problem (the number of features).

5.2. Log-Sum-Exp

In the next set of experiments, we consider unconstrained minimization of the following objective:

fµ(x) = µ ln m P

i=1 exp ai,x bi

where µ > 0 is a smoothing parameter. To generate the data, we sample coefﬁcients { ai}m i=1 and b randomly from the uniform distribution on [ 1, 1]. Then, we shift the parameters in a way to have the solution x in the origin. Namely, using { ai}m i=1 we form a preliminary function fµ(x), and set ai := ai fµ(0). Thus we essentially obtain fµ(0) = 0.

We set m = 6n, and n = 100. In the method, we use the following Euclidean norm for the primal space: x = Bx, x 1/2, with the matrix B = Pm i=1 aia T i , and ﬁx regularization parameter H being equal 1. The results are shown on Figure 2.

0.00 0.25 0.50 0.75 1.00 1.25 Time, s

Functional residual

Log-sum-exp, μ = 0.25

0.00 0.25 0.50 0.75 1.00 1.25 Time, s

Functional residual

Log-sum-exp, μ = 0.25

0.0 0.5 1.0 1.5 2.0 Time, s

Functional residual

Log-sum-exp, μ = 0.1

0.0 0.5 1.0 1.5 Time, s

Functional residual

Log-sum-exp, μ = 0.1

0 5 10 15 20 Time, s

Functional residual

Log-sum-exp, μ = 0.05

0 2 4 6 8 10 Time, s

Functional residual

Log-sum-exp, μ = 0.05

Figure 2. Comparison of different accuracy policies for inexact Cubic Newton, minimizing Log-Sum-Exp function.

We see, that the adaptive rule demonstrates reasonably good

Inexact Tensor Methods with Dynamic Accuracies

performance (in terms of the total computational time4) in all the scenarios.

Acknowledgements

The research results of this paper were obtained in the framework of ERC Advanced Grant 788368.

Agarwal, N., Allen-Zhu, Z., Bullins, B., Hazan, E., and Ma, T. Finding approximate local minima faster than gradient descent. In Proceedings of the 49th Annual ACM SIGACT Symposium on Theory of Computing, pp. 1195 1199. ACM, 2017.

Arjevani, Y., Shamir, O., and Shiff, R. Oracle complexity of second-order methods for smooth convex optimization. Mathematical Programming, 178(1-2):327 360, 2019.

Baes, M. Estimate sequence methods: extensions and approximations. Institute for Operations Research, ETH, Z urich, Switzerland, 2009.

Bauschke, H. H., Bolte, J., and Teboulle, M. A descent lemma beyond lipschitz gradient continuity: ﬁrst-order methods revisited and applications. Mathematics of Operations Research, 42(2):330 348, 2016.

Beck, A. First-order methods in optimization, volume 25. SIAM, 2017.

Birgin, E. G., Gardenghi, J., Mart ınez, J. M., Santos, S. A., and Toint, P. L. Worst-case evaluation complexity for unconstrained nonlinear optimization using high-order regularized models. Mathematical Programming, 163 (1-2):359 368, 2017.

Bullins, B. Fast minimization of structured convex quartics. ar Xiv preprint ar Xiv:1812.10349, 2018.

Carmon, Y. and Duchi, J. Gradient descent ﬁnds the cubicregularized nonconvex Newton step. SIAM Journal on Optimization, 29(3):2146 2178, 2019.

Cartis, C. and Scheinberg, K. Global convergence rate analysis of unconstrained optimization methods based on probabilistic models. Mathematical Programming, 169 (2):337 375, 2018.

Cartis, C., Gould, N. I., and Toint, P. L. Adaptive cubic regularisation methods for unconstrained optimization. Part I: motivation, convergence and numerical results. Mathematical Programming, 127(2):245 295, 2011a.

4Clock time was evaluated using the machine with Intel Core i5 CPU, 1.6GHz; 8 GB RAM. All methods were implemented in Python. The source code can be found at https://github. com/doikov/dynamic-accuracies/

Cartis, C., Gould, N. I., and Toint, P. L. Adaptive cubic regularisation methods for unconstrained optimization. Part II: worst-case function-and derivative-evaluation complexity. Mathematical programming, 130(2):295 319, 2011b.

Cartis, C., Gould, N. I., and Toint, P. L. Universal regularization methods: varying the power, the smoothness and the accuracy. SIAM Journal on Optimization, 29(1): 595 615, 2019.

Doikov, N. and Nesterov, Y. Contracting proximal methods for smooth convex optimization. CORE Discussion Papers 2019/27, 2019a.

Doikov, N. and Nesterov, Y. Local convergence of tensor methods. CORE Discussion Papers 2019/21, 2019b.

Doikov, N. and Nesterov, Y. Minimizing uniformly convex functions by cubic regularization of Newton method. ar Xiv preprint ar Xiv:1905.02671, 2019c.

Doikov, N. and Richt arik, P. Randomized block cubic Newton method. In International Conference on Machine Learning, pp. 1289 1297, 2018.

Gasnikov, A., Dvurechensky, P., Gorbunov, E., Vorontsova, E., Selikhanovych, D., Uribe, C. A., Jiang, B., Wang, H., Zhang, S., Bubeck, S., et al. Near optimal methods for minimizing convex functions with lipschitz p-th derivatives. In Conference on Learning Theory, pp. 1392 1393, 2019.

Gould, N. I., Robinson, D. P., and Thorne, H. S. On solving trust-region and other regularised subproblems in optimization. Mathematical Programming Computation, 2 (1):21 57, 2010.

Grapiglia, G. N. and Nesterov, Y. Accelerated regularized Newton methods for minimizing composite convex functions. SIAM Journal on Optimization, 29(1):77 99, 2019a.

Grapiglia, G. N. and Nesterov, Y. On inexact solution of auxiliary problems in tensor methods for convex optimization. ar Xiv preprint ar Xiv:1907.13023, 2019b.

Grapiglia, G. N. and Nesterov, Y. Tensor methods for minimizing functions with H older continuous higher-order derivatives. ar Xiv preprint ar Xiv:1904.12559, 2019c.

Grapiglia, G. N. and Nesterov, Y. Tensor methods for ﬁnding approximate stationary points of convex functions. ar Xiv preprint ar Xiv:1907.07053, 2019d.

G uler, O. New proximal point algorithms for convex minimization. SIAM Journal on Optimization, 2(4):649 664, 1992.

Inexact Tensor Methods with Dynamic Accuracies

Ivanova, A., Grishchenko, D., Gasnikov, A., and Shulgin, E. Adaptive catalyst for smooth convex optimization. ar Xiv preprint ar Xiv:1911.11271, 2019.

Jiang, B., Lin, T., and Zhang, S. A uniﬁed adaptive tensor approximation scheme to accelerate composite convex optimization. ar Xiv preprint ar Xiv:1811.02427, 2018.

Kohler, J. M. and Lucchi, A. Sub-sampled cubic regularization for non-convex optimization. In International Conference on Machine Learning, pp. 1895 1904, 2017.

Kulunchakov, A. and Mairal, J. A generic acceleration framework for stochastic composite optimization. In Advances in Neural Information Processing Systems, pp. 12556 12567, 2019.

Lin, H., Mairal, J., and Harchaoui, Z. A universal catalyst for ﬁrst-order optimization. In Advances in Neural Information Processing Systems, pp. 3384 3392, 2015.

Lin, H., Mairal, J., and Harchaoui, Z. Catalyst acceleration for ﬁrst-order convex optimization: from theory to practice. Journal of Machine Learning Research, 18(212): 1 54, 2018.

Lu, H., Freund, R. M., and Nesterov, Y. Relatively smooth convex optimization by ﬁrst-order methods, and applications. SIAM Journal on Optimization, 28(1):333 354, 2018.

Lucchi, A. and Kohler, J. A stochastic tensor method for non-convex optimization. ar Xiv preprint ar Xiv:1911.10367, 2019.

Monteiro, R. D. and Svaiter, B. F. An accelerated hybrid proximal extragradient method for convex optimization and its implications to second-order methods. SIAM Journal on Optimization, 23(2):1092 1125, 2013.

Nesterov, Y. A method for solving the convex programming problem with convergence rate O(1/kˆ2). In Dokl. akad. nauk Sssr, volume 269, pp. 543 547, 1983.

Nesterov, Y. Accelerating the cubic regularization of Newton s method on convex problems. Mathematical Programming, 112(1):159 181, 2008.

Nesterov, Y. Gradient methods for minimizing composite functions. Mathematical Programming, 140(1):125 161, 2013.

Nesterov, Y. Lectures on convex optimization, volume 137. Springer, 2018.

Nesterov, Y. Implementable tensor methods in unconstrained convex optimization. Mathematical Programming, pp. 1 27, 2019a.

Nesterov, Y. Inexact basic tensor methods. CORE Discussion Papers 2019/23, 2019b.

Nesterov, Y. and Nemirovskii, A. Interior-point polynomial algorithms in convex programming. SIAM, 1994.

Nesterov, Y. and Polyak, B. T. Cubic regularization of Newton s method and its global performance. Mathematical Programming, 108(1):177 205, 2006.

Nocedal, J. and Wright, S. J. Numerical optimization 2nd, 2006.

Rodomanov, A. and Nesterov, Y. Smoothness parameter of power of euclidean norm. ar Xiv preprint ar Xiv:1907.12346, 2019.

Schmidt, M., Roux, N. L., and Bach, F. R. Convergence rates of inexact proximal-gradient methods for convex optimization. In Advances in neural information processing systems, pp. 1458 1466, 2011.

Song, C. and Ma, Y. Towards uniﬁed acceleration of highorder algorithms under H older continuity and uniform convexity. ar Xiv preprint ar Xiv:1906.00582, 2019.

Tripuraneni, N., Stern, M., Jin, C., Regier, J., and Jordan, M. I. Stochastic cubic regularization for fast nonconvex optimization. In Advances in Neural Information Processing Systems, pp. 2899 2908, 2018.

Van Nguyen, Q. Forward-backward splitting with bregman distances. Vietnam Journal of Mathematics, 45(3):519 539, 2017.

Wang, Z., Zhou, Y., Liang, Y., and Lan, G. Stochastic variance-reduced cubic regularization for nonconvex optimization. ar Xiv preprint ar Xiv:1802.07372, 2018.

Zhou, D., Xu, P., and Gu, Q. Stochastic variance-reduced cubic regularization methods. Journal of Machine Learning Research, 20(134):1 47, 2019.