# secondorder_optimization_with_lazy_hessians__4cd47fd9.pdf Second-Order Optimization with Lazy Hessians Nikita Doikov * 1 El Mahdi Chayti * 1 Martin Jaggi 1 We analyze Newton s method with lazy Hessian updates for solving general possibly non-convex optimization problems. We propose to reuse a previously seen Hessian for several iterations while computing new gradients at each step of the method. This significantly reduces the overall arithmetic complexity of second-order optimization schemes. By using the cubic regularization technique, we establish fast global convergence of our method to a second-order stationary point, while the Hessian does not need to be updated each iteration. For convex problems, we justify global and local superlinear rates for lazy Newton steps with quadratic regularization, which is easier to compute. The optimal frequency for updating the Hessian is once every d iterations, where d is the dimension of the problem. This provably improves the total arithmetic complexity of second-order algorithms by a factor 1. Introduction Motivation. Second-order optimization algorithms are being widely used for solving difficult ill-conditioned problems. The classical Newton method approximates the objective function by its second-order Taylor approximation at the current point. A minimizer of this model serves as one step of the algorithm. Locally, Newton s method achieves very fast quadratic rate (Kantorovich, 1948), when the iterates are in a neighborhood of the solution. However, it may not converge if the initial point is far away from the optimum. For the global convergence of Newton s method, we need to use some of the various regularization techniques, that have been intensively developed during recent decades (including damped Newton steps with line search (Kantorovich, 1948; Ortega & Rheinboldt, 2000; Hanzely et al., 2022), *Equal contribution 1Machine Learning and Optimization Laboratory, EPFL, Switzerland. Correspondence to: El Mahdi Chayti , Nikita Doikov . Proceedings of the 40 th International Conference on Machine Learning, Honolulu, Hawaii, USA. PMLR 202, 2023. Copyright 2023 by the author(s). trust-region methods (Conn et al., 2000), self-concordant analysis (Nesterov & Nemirovski, 1994; Bach, 2010; Sun & Tran-Dinh, 2019; Dvurechensky & Nesterov, 2018) or the notion of Hessian stability (Karimireddy et al., 2018), cubic regularization (Nesterov & Polyak, 2006), contracting-point steps (Doikov & Nesterov, 2020; 2022), or gradient regularization techniques (Mishchenko, 2021; Doikov & Nesterov, 2021a)). In all of these approaches, the arithmetic complexity of each step remains costly, since it requires computing both the gradient and Hessian of the objective function, and then perform some factorization of the matrix. Hessian computations are prohibitively expensive in any large-scale setting. In this work, we explore a simple possibility for a significant acceleration of second-order schemes in terms of their total arithmetic complexity. The idea is to keep a previously computed Hessian (a stale version of the Hessian) for several iterations, while using fresh gradients in each step of the method. Therefore, we can save a lot of computational resources by reusing the old Hessians. We call this lazy Hessian updates. We justify that this idea can be successfully employed in most established second-order schemes, resulting in a provable improvement of the overall performance. reuse Hessian 2f(xm) f(x0) f(x1) . . . f(xm 1) f(xm) Lazy Hessian Updates: compute a new Hessian once per m iterations. For general problems1, the optimal schedule is m := d (update the Hessian every d iterations), where d is the dimension of the problem. In these cases, we gain a provable computational acceleration of our methods by a factor d. The key to our analysis is an increased value of the regularization constant. In a general case, this should be proportional to m L, where L is the Lipschitz constant of the Hessian. Thus, for the lazy Newton steps, we should regularize our model stronger to balance the possible errors 1Assuming that the cost of computing one Hessian and its appropriate factorization is d times the cost of one gradient step. See Examples 3.2, 3.3, and 3.4. Second-order optimization with lazy Hessians Method Global Complexity Local superlinear convergence Non-convex problems Hessian computations Reference Gradient Method O(ε 2) not used (Ghadimi & Lan, 2013) Classical Newton every step (Nesterov, 2018) Shamanskii Newton once per m steps (Shamanskii, 1967) Cubic Newton O(ε 3/2) every step (Nesterov & Polyak, 2006) Cubic Newton with Lazy Hessians O( mε 3/2) once per m steps ours (Algorithm 1) Reg. Newton with Lazy Hessians O( mε 3/2) once per m steps ours (Algorithm 2) Table 1: A comparison of our methods with the classical deterministic algorithms. The Global Complexity shows how much gradient computations are needed for a method to find a point x with the small gradient norm: f( x) ε. coming from the Hessian inexactness, and so the steps of the method become shorter. Fortunately, we save much more in terms of overall computational cost, which makes our approach appealing. Related Work. The idea of using an old Hessian for several steps of Newton s method probably appeared for the first time in (Shamanskii, 1967), with the study of local convergence of this process as applied to solving systems of nonlinear equations. The author proved a local quadratic rate for the iterates when updating the Hessian, and a linear rate with an improving factor for the iterates in between. Later on, this idea has been successfully combined with Levenberg Marquardt (Levenberg, 1944; Marquardt, 1963) regularization approach in (Fan, 2013), with damped Newton steps in (Lampariello & Sciandrone, 2001; Wang et al., 2006), and with the proximal Newton-type methods for convex optimization in (Adler et al., 2020). In these papers, it was established that the methods possess an asymptotic global convergence to a solution, without explicit non-asymptotic bounds on the actual rate. Compared with these works, our new second-order algorithms with the lazy Hessian updates use a different analysis that is based on the modern globalization techniques (cubic regularization (Nesterov & Polyak, 2006) and gradient regularization (Mishchenko, 2021; Doikov & Nesterov, 2021a)). It allows equipping our methods with provably fast global rates for wide classes of convex and non-convex optimization problems. Thus, we prove a global complexity O(1/ε3/2) of the lazy regularized Newton steps for achieving a second-order stationary point with the gradient norm bounded by ε, while saving a lot of computational resources by reusing the old Hessian (see Table 1 for the comparison of the global complexities). Another interesting connection can be established to recently developed distributed Newton-type methods (Islamov et al., 2022), where the authors propose to use a probabilistic aggregation and compression of the Hessians, as for example in federated learning. In the particular case of the single node setup, their algorithm also evaluates the Hessian rarely which is similar in spirit to our approach but using different aggregation strategies. In their technique, the node needs to consider the Hessian each iteration. Then, using a certain criterion, it decides whether to employ the new Hessian for the next step or to keep using just the old information, while in our approach the Hessian is computed only once per m iterations. The authors proved local linear and superlinear rates that are independent on the condition number. At the same time, in our paper, we primarily focus on global complexity guarantees and thus establish a provable computational improvement by a factor d for our methods as compared to updating the Hessian in every step. We also justify local superlinear rates for our approach. Contributions. We develop new efficient second-order optimization algorithms and equip them with the global complexity guarantees. More specifically, We propose the lazy Newton step with cubic regularization (Section 2). It uses the gradient computed at the current point and the Hessian at some different point from the past trajectory. We quantify the error effect coming from the inexactness in the second-order information and formulate the progress for one method step (Theorem 2.1). We show how to balance the errors for m consecutive lazy Newton steps, by increasing the regularization parameter proportionally. Based on that, we develop the Cubic Newton with Lazy Hessians (Algorithm 1) and establish its fast global convergence to a second-order stationary point (Theorem 3.1 in Section 3). This avoids that our method could get stuck in saddle points. In the case m := 1 (updating the Hessian each iteration), our rate recovers the classical rate of full Cubic Newton (Nesterov & Polyak, 2006). Then, we show that taking into account the actual arithmetic cost of the Hessian computations, the optimal choice for one phase of the method is m := d, which improves the total complexity of Second-order optimization with lazy Hessians Cubic Newton by a factor d (see Corollary 3.6). We show how to improve our method when the problem is convex in Section 4. We develop the Regularized Newton with Lazy Hessians (Algorithm 2), which replaces the cubic regularizer in the model by quadratic one. This makes the subproblem much easier to solve, involving just one standard matrix inversion, while keeping the fast rate of the original cubically regularized method. We study the local convergence of our new algorithms in Section 5 (see Theorem 5.1 and Theorem 5.3). We prove that they both enjoy superlinear convergence rates. As a particular case, we also justify the local quadratic convergence for the classical Newton method (without regularization) with the lazy Hessian updates. Illustrative numerical experiments are provided. Notation. We consider the following optimization problem, min x Rd f(x), (1) where f is a several times differentiable function, not necessarily convex, that we assume to be bounded from below: inf x f(x) f . We denote by f(x) Rd its gradient and by 2f(x) Rd d its Hessian, computed at some x Rd. Our analysis will apply both if optimization is over the standard Euclidean space, and also more generally if a different norm is defined by an arbitrary fixed symmetric positive definite matrix B = B 0. In that case the norm becomes x def = Bx, x 1/2, x Rd. (2) Thus, the matrix B is responsible for fixing the coordinate system in our problem. In the simplest case, we can choose B := I (identity matrix), which recovers the classical Euclidean geometry. A more advanced choice of B can take into account a specific structure of the problem (see Example 3.2). The norm for the dual objects (gradients) is defined in the standard way, g def = sup x: x 1 g, x = g, B 1g 1/2, g Rd. The induced norm for a matrix A Rd d is defined by A def = sup x,y: x 1, y 1 Ax, y = sup x: x 1 Ax . We assume that the Hessian of f is Lipschitz continuous, with some constant L > 0: 2f(x) 2f(y) L x y , x, y Rd. (3) 2. Lazy Newton Steps Let us introduce the following lazy Newton steps, where we allow the gradient and Hessian to be computed at different points x, z Rd. Thus, we denote the following quadratic model of the objective: Qx,z(y) def = f(x), y x + 1 2 2f(z)(y x), y x . We use cubic regularization of our model, for some M > 0: TM(x, z) Argmin y Rd n Qx,z(y) + M 6 y x 3o . (4) For z := x this is the iteration of Cubic Newton (Nesterov & Polyak, 2006). Thus, in our scheme, we can reuse the Hessian from a previous step z without recomputing it, which significantly reduces the overall iteration cost. Our definition implies that the point T = TM(x, z) is a global minimum of the cubically regularized model, which is generally non-convex. However, it turns out that we can compute this point efficiently by using standard techniques developed initially for trust-region methods (Conn et al., 2000). Let us denote r def = T x . The solution to the subproblem (4) satisfies the following stationarity conditions: ( f(x) + 2f(z)(T x) + Mr 2 B(T x) = 0, 2f(z) + Mr Thus, in the non-degenerate case, one step can be represented in the following form: T = x 2f(z) + Mr 2 B 1 f(x), (5) and the value r > 0 can be found by solving the corresponding univariate nonlinear equation (Nesterov & Polyak, 2006, Section 5). It can be done very efficiently from a precomputed eigenvalue or the tridiagonal decomposition of the Hessian. Typically, it is of similar cost as for matrix inversion in the classical Newton step. We discuss the computation of the iterate TM(x, z) in more details in Section 6.1. Let us define the following quantity, for y Rd: ξ(y) def = h λmin B 1/2 2f(y)B 1/2 i where [t]+ def = max{t, 0} denotes the positive part, and λmin( ) is the smallest eigenvalue of a symmetric matrix. If 2f(y) 0 for a certain y Rd, then ξ(y) = 0. Otherwise, ξ(y) shows how big (in absolute value) the smallest eigenvalue of the Hessian is w.r.t. a fixed matrix B 0. Second-order optimization with lazy Hessians Theorem 2.1. Let M L. Then, for one cubic step (4), it holds max n 1 648M 2 ξ(T)3, 1 72 2M f(T) 3/2 o Theorem 2.1 shows how much progress we can expect from one lazy Newton step with cubic regularization. The price for using the lazy Hessian is the last term in the progress bound (7), which vanishes if z := x (updating the Hessian for the current iterate). 3. Global Convergence Rates Let us consider one phase of the potential algorithm when we compute the Hessian once at a current point z := x0 and then perform m cubic steps (4) in a row: xi = TM(xi 1, z), i = 1, . . . , m. (8) We use some fixed regularization constant M > 0. From (7), we see that the error from using the old Hessian is proportional to the cube of the distance to the starting point: xi 1 x0 3, for each 1 i m. Hopefully, we can balance off the accumulating error by aggregating as well the positive terms from (7), and by choosing a sufficiently big value of the regularization parameter M. Hence, our aim would be to ensure: i=1 xi xi 1 3 11L3 i=1 xi 1 x0 3. (9) In fact, we show (see Appendix B) that it is enough to choose M 6m L to guarantee (9). Thus we deal with the accumulating error for m consecutive lazy steps in (8). After each inner phase of cubic steps (8) with stale Hessian, we recompute the Hessian at a new snapshot point z in a new outer round. The length of each inner phase m 1 is the key parameter of the algorithm. For convenient notation, we denote by π(k) the highest multiple of m which is less than or equal to k: π(k) def = k k mod m, k 0. (10) The main algorithm is defined as follows: As suggested by our previous analysis, we will use a simple fixed rule for the regularization parameter: M := 6m L (11) Algorithm 1 Cubic Newton with Lazy Hessians Input: x0 Rd, m 1, L > 0. Choose M 0. for k = 0, 1, . . . do Set last snapshot point zk = xπ(k) Compute lazy cubic step xk+1 = TM(xk, zk) end for Theorem 3.1. Let M be fixed as in (11). Assume that the gradients for previous iterations {xi}k i=1 of Algorithm 1 are higher than a desired error level ε > 0: f(xi) ε. (12) Then, the number of iterations to reach accuracy f(xk+1) ε is at most m L(f(x0) f ) ε3/2 , (13) The total number of Hessian updates during these iterations is t O L(f(x0) f ) ε3/2 m . (14) For the minimal eigenvalues of all Hessians, it holds that min 1 i k ξ(xi) O M 2(f(x0) f ) k 1/3 . (15) Let us assume that the desired accuracy level is small, i.e. ε (f(x0) f )2/3 L which requires from the method to use several Hessians. To choose the parameter m, we need to take into account the actual computational efforts required in all iterations of our method. We denote by Grad Cost the arithmetic complexity of one gradient step, and by Hess Cost the cost of computing the Hessian and its appropriate factorization. In general, we have Hess Cost = d Grad Cost, (16) where d is the dimension of our problem (1). Example 3.2. Let f(x) = 1 i=1 ϕ( ai, x ), where ai Rd, 1 i n are given data and ϕ is a fixed univariate smooth function. In this case, f(x) = A s(x), and 2f(x) = A Q(x)A, where A Rn d is the matrix with rows a1, . . . , an; s(x) Rn and Q(x) Rn n is a diagonal matrix given by s(x) (i) def = 1 nϕ ( ai, x ), Q(x) (i,i) def = 1 nϕ ( ai, x ). Second-order optimization with lazy Hessians Because of the Hessian structure, it is convenient to use the matrix B := A A to define our primal norm (2). Indeed, for this choice of the norm, we can show that the Lipschitz constant of the Hessian depends only on the loss function ϕ (i.e. it is an absolute numerical constant which does not depend on data). At the same time, when using the identity matrix I, the corresponding Lipschitz constant depends on the size of the data matrix A . Thus, in the latter case, the value of the Lipschitz constant can be very big. Let us assume that the cost of computing ϕ ( ) and ϕ ( ) is O(1) which does not depend on the problem parameters. We denote by nz (A) the number of nonzero elements of A. Then, Hess Cost = O d nz (A) + d3 , where the last cubic term comes from computing a matrix factorization (see Section 6.1), and Grad Cost = O nz (A) + d2 , where the second term is from using the factorization. Thus, relation (16) is satisfied. Example 3.3. Let the representation of our objective be given by the computational graph (e.g. a neural network). In this case, we can compute the Hessian-vector product 2f(x)h for any x, h Rd at the same cost as its gradient f(x) by using automatic differentiation technique (Nocedal & Wright, 2006, Chapter 7.2). Thus, in general we can compute the Hessian as d Hessian-vector products, 2f(x) = 2f(x)e1 . . . 2f(x)ed , where e1, . . . , ed are the standard basis vectors. This satisfies (16). Example 3.4. For any differentiable function, we can use an approximation of its Hessian by finite differences. Namely, for a fixed parameter δ > 0, we can form Hx,δ Rd d as Hx,δ (i,j) := 1 δ f(x + δei) f(x) (j), where e1, . . . , ed are the standard basis vectors, and use the following symmetrization as an approximation of the true Hessian (see (Nocedal & Wright, 2006; Cartis et al., 2012; Grapiglia et al., 2022) for more details): 1 2 h Hx,δ + H x,δ i 2f(x). Thus, forming the Hessian approximation requires d + 1 gradient computations, which is consistent with (16). Hence, according to Theorem 3.1, the total arithmetic complexity of Algorithm 1 can be estimated as Arithmetic Complexity = k Grad Cost + t Hess Cost (13),(14),(16) O m + d m L(f(x0) f ) Corollary 3.5. For m = 1 we update Hessian at each step. It corresponds to the full Cubic Newton (Nesterov & Polyak, 2006), and Theorem 3.1 recovers its global iteration complexity: L(f(x0) f ) Consequently, the total arithmetic complexity of the method is bounded by L(f(x0) f ) ε3/2 Grad Cost. (18) Corollary 3.6. For m = d we obtain the optimal choice for the length of one phase, which minimizes the right hand side of (17). The total arithmetic complexity of Algorithm 1 becomes L(f(x0) f ) ε3/2 Grad Cost. This improves upon the full Cubic Newton by factor Now, let us look at the minimal eigenvalue of all the Hessians at points generated by our algorithm. Corollary 3.7. Let us fix some ε > 0 and perform m L(f(x0) f ) iterations of Algorithm 1. According to Theorem 3.1, we thus ensure min 1 i k f(xi) O(ε). At the same time, min 1 i k ξ(xi) (15) 648M 2(f(x0) f ) k 1/3 (19) = 25/332 Therefore, the negative eigenvalues of the Hessians cannot be big. For O(ε) level of gradient norm we guarantee O( m Lε) level for the smallest eigenvalue (6). 4. Minimizing Convex Functions In this section, we assume that the objective in our problem (1) is convex. Thus, all the Hessians are positive semidefinite, i.e., 2f(x) 0, x Rd. Then, we can apply the gradient regularization technique (Mishchenko, 2021; Doikov & Nesterov, 2021a), which allows using the square of the Euclidean norm as a regularizer Second-order optimization with lazy Hessians Algorithm 2 Regularized Newton with Lazy Hessians 1: Input: x0 Rd, m 1, L > 0. Choose M 0. 2: for k = 0, 1, . . . do 3: Set last snapshot point zk = xπ(k) 4: Set regularization parameter λk = p M f(xk) 5: Compute lazy Newton step: xk+1 = xk 2f(zk) + λk B 1 f(xk) 6: end for for our model. Each step of the method becomes easier to compute by employing just one matrix inversion. Thus, we come to the following scheme: The parameter M should have the same order and physical interpretation as the one in Cubic Newton. We use M := 3m L (20) The global complexity bounds for Algorithm 2 are the same (up to an additive logarithmic term) as those ones for Algorithm 1. However, each iteration of Algorithm 2 is much easier to implement since it involves solving just one linear system. Theorem 4.1. Let M be fixed as in (20). Assume that the gradients for previous iterations {xi}k i=1 of Algorithm 2 are higher than a desired error level ε > 0: f(xi) ε. (21) Then, the number of iterations to reach accuracy f(xk+1) ε is at most m L(f(x0) f ) ε3/2 + ln f(x0) The total number of Hessian updates t during these iterations is bounded as L(f(x0) f ) 5. Local Superlinear Convergence In this section, we discuss local convergence of our secondorder schemes with lazy Hessian updates. We show the fast linear rate with a constant factor during each phase and superlinear convergence taking into account the Hessian updates. Let us assume that our objective is strongly convex, thus 2f(x) µB, x Rd, with some parameter µ > 0. For simplicity, we require that for all points from our space, while it is possible to analyze Newton steps assuming strong convexity only in a neighborhood of the solution. Theorem 5.1. Let M 0. Assume that the initial gradient is small enough: f(x0) 1 3L+M . Then Algorithm 1 has a superlinear convergence for the gradient norm, for k 0: 2 (1+m)π(k)(1+k mod m), where π(k) is defined by (10). Corollary 5.2. Combining both Theorem 3.1 and 5.1, we conclude that for minimizing a strongly convex function by Algorithm 1 with regularization parameter M given by (11) and starting from an arbitrary x0, we need k O m2L2(f(x0) f ) µ3 + 1 ln(1+m) ln ln µ2 lazy steps to achieve f(xk) ε. We show also that Algorithm 2 locally has a slightly worse but still superlinear convergence rate, which is extremely fast from the practical perspective. For example, for m = 1, Algorithm 2 has a superlinear convergence rate of order 3/2 while Algorithm 1 has order 2. Theorem 5.3. Let M 0. Assume that the initial gradient is small enough: f(x0) 1 22 G0, where 23(3L+4M). Then Algorithm 2 has a superlinear convergence for the gradient norm, for k 0: 2 2(1 + m/2)π(k)(1+(k mod m)/2), where π(k) is defined by (10). Corollary 5.4. Combining both Theorem 4.1 and 5.3, we conclude that for minimizing a strongly convex function by Algorithm 2 with regularization parameter M given by (20) and starting from an arbitrary x0, we need to do k O m2L2(f(x0) f ) µ3 + ln m L f(x0) + 1 ln(1+m/2) ln ln µ2 lazy steps to achieve f(xk) ε. 6. Practical Implementation 6.1. Use of Matrix Factorization Computing the Hessian once per m iterations, we want to be able to solve efficiently the corresponding subproblem with this Hessian in each iteration of the method. For solving the subproblem in Cubic Newton (4), we need to find parameter r > 0 that is, in a nondegenerate case, the root of the following nonlinear equation (see also Section 5 in (Nesterov & Polyak, 2006)), φ(r) def = s(r) r = 0, (24) Second-order optimization with lazy Hessians where s(r) def = ( 2f(z) + Mr 2 B) 1 f(x), for r such that the matrix is positive definite. Applying to (24) the standard bisection method, or the univariate Newton s method with steps of the form r+ = r φ(r)/φ (r), the main operation that we need to perform is solving a linear equation: 2f(z) + τB h = f(x), (25) for different values of τ > 0. This is the same type of operation that we need to do once per each step in Algorithm 2. Now, let us assume that we have the following factorization of the Hessian: 2f(z) = UΛU , where UU = B, (26) and Λ Rd d is a diagonal or tridiagonal matrix. Thus, U is a set of vectors orthogonal with respect to B. In case B = I (identity matrix) and Λ being diagonal, (26) is called Eigendecomposition and implemented in most of the standard Linear Algebra packages. In general, factorization (26) can be computed in O(d3) arithmetic operations. Namely, we can apply a standard orthogonalizing decomposition for the following matrix: B 1/2 2f(z)B 1/2 = VΛV , with VV = I, and then set U := B1/2V, which gives (26). Note that the solution to (25) can expressed as h = U Λ + τI 1U 1 f(x), and it is computable in O(d2) arithmetic operations for any given τ > 0 and f(x). The use of tridiagonal decomposition can be more efficient in practice. Indeed, inversion of the matrix Λ + τI would still cost O(d) operations for tridiagonal Λ, while it requires less computational resources and less floating point precision to compute such decomposition. In practice, it is also important to leverage a structure of the Hessian (e.g. when matrices are sparse), which can further improve the arithmetic cost of each step. 6.2. Adaptive Search To obtain our results we needed to pick the regularization parameter M proportional to m L, where m 1 is the length of one phase and L > 0 is the Lipschitz constant of the Hessian. One drawback of this choice is that we actually need to know the Lipschitz constant, which is not always the case in practice. Moreover, with a constant regularization, the methods become conservative, preventing the possibility of big steps. At the same time, from the local perspective, the best quadratic approximation of the objective is the pure secondorder Taylor polynomial. So, being in a neighborhood of the solution, the best strategy is to tend M to zero which gives the fastest local superlinear rates (see Section 5). The use of adaptive search in second-order schemes has been studied for several years (Nesterov & Polyak, 2006; Cartis et al., 2011a;b; Grapiglia & Nesterov, 2017; 2019; Doikov & Nesterov, 2021b). It is well known that such schemes have a very good practical performance, while the adaptive search makes the methods to be also universal (Grapiglia & Nesterov, 2017; Doikov & Nesterov, 2021b), that is to adapt to the best H older degree of the Hessian, or even super-universal (Doikov et al., 2022) which chooses automatically the H older degree of either the second or third derivative of the objective, working properly on a wide range of problem classes with the best global complexity guarantees. We propose Algorithm 3 which changes the value of M adaptively and checks the functional progress after m steps to validate its choice. Importantly, the knowledge of the Lipschitz constant L is not needed. Algorithm 3 Adaptive Cubic Newton with Lazy Hessians 1: Input: x0 Rd, m 1. Fix some M0 > 0. 2: for t = 0, 1, . . . do 3: Compute snapshot Hessian 2f(xtm) 4: repeat 5: Update Mt = 2 Mt 6: for i = 1, . . . , m do 7: Lazy cubic step xtm+i = TMt(xtm+i 1, xtm) 8: end for 9: until f(xtm) f(xtm+m) 1 Mt i=1 f(xtm+i) 3/2 10: Set Mt+1 = 1 4 Mt 11: end for M0 is an initial guess for the regularization constant, which can be further both increased or decreased dynamically. Theorem 6.1. Let the gradients for previous iterates {xi}tm i=1 during t phases of Algorithm 3 are higher than a desired error level ε > 0: f(xi) ε. (27) Then, the number of phases to reach accuracy f(xtm+1) ε is at most m , L f(x0) f ε3/2 m . (28) The total number N of gradient calls during these phases is bounded as N 2tm + max{1, log2 2935m L M0 }m. (29) Remark 6.2. According to Theorem 6.1, the global complexity of the adaptive algorithm is the same as the one given by Theorem 3.1 for the method with a fixed regularization constant. Due to (29), the average number of tries of different Second-order optimization with lazy Hessians Mt per one phase is only two. Clearly, It is also possible to incorporate a similar adaptive search into Algorithm 2 for the convex case (see Appendix E.2). 7. Experiments We demonstrate an illustrative numerical experiment on the performance of the proposed second-order methods with lazy Hessian updates. We consider the following convex minimization problem with the Soft Maximum objective (log-sum-exp): min x Rd f(x) := µ ln n P i=1 exp ai,x bi h ai, x bi i . The problems of this type are important in applications with minimax strategies for matrix games and for training ℓ - regression (Nesterov, 2005; Bullins, 2020). To generate the data, we sample randomly the vectors a1, . . . , an Rd and b Rn with elements from the uniform distribution on [ 1, 1]. Then, we build an auxiliary objective f of the form (30) with these vectors and set ai := ai f(0). This ensures the optimum is at the origin since f(0) = 0. The starting point is x0 = (1, . . . , 1). For the primal norm (2), we use the matrix i=1 aia i + δI 0, (31) where δ > 0 is a small perturbation parameter to ensure positive definiteness. Then, the corresponding Lipschitz constant of the Hessian is bounded by (see, e.g. (Doikov, 2021, Example 1.3.5)): L = 2/µ2, where µ > 0 is a smoothing parameter. Since the problem is convex, we can apply Newton s method with gradient regularization (Algorithm 2). In Figure 1, we compare different values of parameter m that is the frequency of updating the Hessian. The regularization parameter is fixed as M := 1. We also show the performance of the Gradient Method as a standard baseline. We see that increasing the parameter m, thus reusing old Hessians for more of the inner steps, we significantly improve the overall performance of the method in terms of the total computational time. The best frequency is m := d which confirms our theory. 8. Discussion Conclusions. In this work, we have developed new second-order algorithms with lazy Hessian updates for solv- 0 2500 5000 7500 10000 Grad. computations 0 2 4 6 8 Time, s GM m = 1 m = 2 m = 10 m = d m = 10d Log-sum-exp, d = 100, n = 100, = 0.5 0 2500 5000 7500 10000 12500 Grad. computations 0 1 2 3 4 Time, s GM m = 1 m = 2 m = 10 m = d m = 10d Log-sum-exp, d = 100, n = 500, = 0.5 Figure 1: Gradient norm depending on the number of computed gradients and on running time, when varying the frequency m of updating the Hessian in Algorithm 2. ing general possibly non-convex optimization problems. We show that it can be very efficient to reuse the previously computed Hessian for several iterations of the method, instead of updating the Hessian each step. Indeed, in general, the cost of computing the Hessian matrix is d times more expensive than computing one gradient. At the same time, it is intuitively clear that even inexact secondorder information from the past should significantly help the method in dealing with ill-conditioning of the problem. In our analysis, we show that this intuition truly works. By using cubic regularization and gradient regularization techniques, we establish fast global and local rates for our second-order methods with lazy Hessian updates. We show that the optimal strategy for updating the Hessian is once per d iterations, which gives a provable improvement of the total arithmetic complexity by a factor of d. Our approach also works with classical Newton steps (without regularization), achieving a local quadratic convergence. Note that it is possible to extend our results onto the composite convex optimization problems, i.e. minimizing the sum f(x) + ψ(x), where f( ) is convex and smooth, while ψ( ) is a simple closed convex proper function but not necessarily differentiable (see Appendix F). Directions for Future Work. One important direction for further research can be a study of the problems with a specific Hessian structure (e.g. sparsity or a certain spectral clustering). Then, we may need to have different schedules for updating the Hessian matrix, while it can also help in Second-order optimization with lazy Hessians solving each step more efficiently. Another interesting question is to make connections between our approach and classical Quasi-Newton methods (Nocedal & Wright, 2006), which gradually update the approximation of the Hessian after each step. Recently discovered nonasymptotic complexity bounds (Rodomanov & Nesterov, 2021; Rodomanov, 2022) for Quasi-Newton methods may be especially useful for reaching this goal. We also think that it is possible to generalize our analysis to high-order optimization schemes (Nesterov, 2021) as well. Note that the main advantage of our schemes is that we can reuse the precomputed factorization of the Hessian and thus we can perform several steps with the same Hessian in an efficient manner. At the same time, it is not clear up to now, how to compute the tensor step efficiently by reusing an old high-order tensor. This would require using some advanced tensor decomposition techniques. Finally, it seems to be very important to study the effect of lazy Hessian updates for convex optimization in more details. In our analysis, we studied only a general non-convex convergence in terms of the gradient norm. Another common accuracy measure in the convex case is the functional residual. Thus, it could be possible to prove some better convergence rates using this measure, as well as considering accelerated (Nesterov, 2018) and super-universal (Doikov et al., 2022) second-order schemes. We keep these questions for further investigation. Acknowledgements This work was supported by the Swiss State Secretariat for Education, Research and Innovation (SERI) under contract number 22.00133. Adler, I., Hu, Z. T., and Lin, T. New proximal Newton-type methods for convex optimization. In 2020 59th IEEE Conference on Decision and Control (CDC), pp. 4828 4835. IEEE, 2020. Bach, F. Self-concordant analysis for logistic regression. Electronic Journal of Statistics, 4:384 414, 2010. Bullins, B. Highly smooth minimization of non-smooth problems. In Conference on Learning Theory, pp. 988 1030. PMLR, 2020. Cartis, C., Gould, N. I., and Toint, P. L. Adaptive cubic regularisation methods for unconstrained optimization. Part I: motivation, convergence and numerical results. Mathematical Programming, 127(2):245 295, 2011a. Cartis, C., Gould, N. I., and Toint, P. L. Adaptive cubic regu- larisation methods for unconstrained optimization. Part II: worst-case function-and derivative-evaluation complexity. Mathematical programming, 130(2):295 319, 2011b. Cartis, C., Gould, N. I., and Toint, P. L. On the oracle complexity of first-order and derivative-free algorithms for smooth nonconvex minimization. SIAM Journal on Optimization, 22(1):66 86, 2012. Conn, A. R., Gould, N. I., and Toint, P. L. Trust region methods. SIAM, 2000. Doikov, N. New second-order and tensor methods in Convex Optimization. Ph D thesis, Universit e catholique de Louvain, 2021. Doikov, N. and Nesterov, Y. Convex optimization based on global lower second-order models. Advances in Neural Information Processing Systems, 33, 2020. Doikov, N. and Nesterov, Y. Gradient regularization of Newton method with Bregman distances. ar Xiv preprint ar Xiv:2112.02952, 2021a. Doikov, N. and Nesterov, Y. Minimizing uniformly convex functions by cubic regularization of Newton method. Journal of Optimization Theory and Applications, pp. 1 23, 2021b. Doikov, N. and Nesterov, Y. Affine-invariant contractingpoint methods for convex optimization. Mathematical Programming, pp. 1 23, 2022. Doikov, N., Mishchenko, K., and Nesterov, Y. Superuniversal regularized Newton method. ar Xiv preprint ar Xiv:2208.05888, 2022. Dvurechensky, P. and Nesterov, Y. Global performance guarantees of second-order methods for unconstrained convex minimization. Technical report, CORE Discussion Paper, 2018. Fan, J. A Shamanskii-like Levenberg-Marquardt method for nonlinear equations. Computational Optimization and Applications, 56(1):63 80, 2013. Ghadimi, S. and Lan, G. Stochastic first-and zeroth-order methods for nonconvex stochastic programming. SIAM Journal on Optimization, 23(4):2341 2368, 2013. Grapiglia, G. N. and Nesterov, Y. Regularized Newton methods for minimizing functions with H older continuous Hessians. SIAM Journal on Optimization, 27(1): 478 506, 2017. Grapiglia, G. N. and Nesterov, Y. Accelerated regularized Newton methods for minimizing composite convex functions. SIAM Journal on Optimization, 29(1):77 99, 2019. Second-order optimization with lazy Hessians Grapiglia, G. N., Gonc alves, M. L., and Silva, G. A cubic regularization of Newton s method with finite difference hessian approximations. Numerical Algorithms, 90(2): 607 630, 2022. Hanzely, S., Kamzolov, D., Pasechnyuk, D., Gasnikov, A., Richt arik, P., and Tak aˇc, M. A damped Newton method achieves global O(1/k2) and local quadratic convergence rate. ar Xiv preprint ar Xiv:2211.00140, 2022. Islamov, R., Qian, X., Hanzely, S., Safaryan, M., and Richt arik, P. Distributed Newton-type methods with communication compression and Bernoulli aggregation. ar Xiv preprint ar Xiv:2206.03588, 2022. Kantorovich, L. V. Functional analysis and applied mathematics. [in Russian]. Uspekhi Matematicheskikh Nauk, 3 (6):89 185, 1948. Karimireddy, S. P., Stich, S. U., and Jaggi, M. Global linear convergence of Newton s method without strongconvexity or Lipschitz gradients. ar Xiv preprint ar Xiv:1806.00413, 2018. Kohler, J. M. and Lucchi, A. Sub-sampled cubic regularization for non-convex optimization. In International Conference on Machine Learning, pp. 1895 1904. PMLR, 2017. Lampariello, F. and Sciandrone, M. Global convergence technique for the Newton method with periodic Hessian evaluation. Journal of optimization theory and applications, 111(2):341 358, 2001. Levenberg, K. A method for the solution of certain nonlinear problems in least squares. Quarterly of applied mathematics, 2(2):164 168, 1944. Marquardt, D. W. An algorithm for least-squares estimation of nonlinear parameters. Journal of the society for Industrial and Applied Mathematics, 11(2):431 441, 1963. Mishchenko, K. Regularized Newton method with global O(1/k2) convergence. ar Xiv preprint ar Xiv:2112.02089, 2021. Nesterov, Y. Smooth minimization of non-smooth functions. Mathematical programming, 103(1):127 152, 2005. Nesterov, Y. Accelerating the cubic regularization of Newton s method on convex problems. Mathematical Programming, 112(1):159 181, 2008. Nesterov, Y. Lectures on convex optimization, volume 137. Springer, 2018. Nesterov, Y. Implementable tensor methods in unconstrained convex optimization. Mathematical Programming, 186:157 183, 2021. Nesterov, Y. and Nemirovski, A. Interior-point polynomial algorithms in convex programming. SIAM, 1994. Nesterov, Y. and Polyak, B. Cubic regularization of Newton s method and its global performance. Mathematical Programming, 108(1):177 205, 2006. Nocedal, J. and Wright, S. Numerical optimization. Springer Science & Business Media, 2006. Ortega, J. M. and Rheinboldt, W. C. Iterative solution of nonlinear equations in several variables. SIAM, 2000. Rodomanov, A. Quasi-Newton Methods with Provable Efficiency Guarantees. Ph D thesis, UCL-Universit e Catholique de Louvain, 2022. Rodomanov, A. and Nesterov, Y. New results on superlinear convergence of classical quasi-Newton methods. Journal of optimization theory and applications, 188(3):744 769, 2021. Shamanskii, V. A modification of Newton s method. Ukrainian Mathematical Journal, 19(1):118 122, 1967. Sun, T. and Tran-Dinh, Q. Generalized self-concordant functions: a recipe for Newton-type methods. Mathematical Programming, 178(1-2):145 213, 2019. Wang, C.-y., Chen, Y.-y., and Du, S.-q. Further insight into the Shamanskii modification of Newton method. Applied mathematics and computation, 180(1):46 52, 2006. Supplementary Material We consider the problem (1) : min x Rd f(x), (32) We assume that the Hessian of f is Lipschitz continuous, with some constant L > 0: 2f(x) 2f(y) L x y , x, y Rd. (33) The immediate consequences are the following inequalities, valid for all x, y Rd: f(y) f(x) 2f(x)(y x) L 2 x y 2, (34) |f(y) f(x) f(x), y x 1 2 2f(x)(y x), y x | L 6 y x 3. (35) In our analysis, we frequently use the standard Young s inequality for product: q , a, b 0, which is valid for any p, q > 1 such that 1 A. Proofs for Section 2 Our lazy Newton steps, allow the gradient and Hessian to be computed at different points x, z Rd. We use cubic regularization of our model with a parameter M > 0: TM(x, z) Argmin y Rd n f(x), y x + 1 2 2f(z)(y x), y x + M 6 y x 3o . (36) For z := x this is the iteration of Cubic Newton (Nesterov & Polyak, 2006). Thus, in our scheme, we can reuse the Hessian from a previous step z without recomputing it, which significantly reduces the overall iteration cost. Our definition implies that the point T = TM(x, z) is a global minimum of the cubically regularized model, which is generally non-convex. However, it turns out that we can compute this point efficiently by using standard techniques developed initially for trust-region methods (Conn et al., 2000). Let us denote r def = T x . The solution to the subproblem (36) satisfies the following stationarity conditions (see (Nesterov & Polyak, 2006)): f(x) + 2f(z)(T x) + Mr 2 B(T x) = 0, (37) and 2f(z) + Mr 2 B 0. (38) Thus, in the non-degenerate case, one step can be represented in the following form: T = x 2f(z) + Mr 2 B 1 f(x), (39) and the value r > 0 can be found by solving the corresponding univariate nonlinear equation (see (Nesterov & Polyak, 2006)[Section 5]). It can be done very efficiently from a precomputed eigenvalue or the tridiagonal decomposition of the Hessian. These decompositions typically require O(d3) arithmetic operations, which is of similar cost as for matrix inversion in the classical Newton step. We discuss the computation of the new iterate TM(x, z) in more detail in Section 6.1. Second-order optimization with lazy Hessians Now, let us express the progress achieved by the proposed step (36). Firstly, we can relate the length of the step and the gradient at the new point as follows: (34) f(T) f(x) 2f(x)(T x) (37) = f(T) + Mr 2 B(T x) + ( 2f(z) 2f(x))(T x) 2 r 2f(z) 2f(x) . Rearranging terms and using Lipschitz continuity of the Hessian, we get 2 r2 + r 2f(z) 2f(x) 2 r2 + Lr z x . Thus, we have established Lemma A.1. For any M > 0, it holds that 2 r2 + Lr z x . (40) Secondly, we have the following progress in terms of the objective function. Lemma A.2. For any M > 0, it holds that f(x) f(T) 5M 4L 3M 2 z x 3. (41) Proof. Indeed, multiplying (37) by the vector T x, we get f(x), T x = 2f(z)(T x), T x M 2 2f(z)(T x), T x M Hence, using the global upper bound on the objective, we obtain f(T) (35) f(x) + f(x), T x + 1 2 2f(x)(T x), T x + L (42) f(x) M 2 ( 2f(x) 2f(z))(T x), T x (33) f(x) M Applying Young s inequality for the last term, 2 r2 z x = M 2/3 2 321/3 r2 321/3L M 2/3 z x M 24r3 + 32L3 3M 2 z x 3, we obtain (41). We can also bound the smallest eigenvalue of the new Hessian, which will be crucial to understand the behaviour for non-convex objectives. Indeed, using Lipschitz continuity of the Hessian and the triangle inequality, we have: 2f(T) 2f(z) L z T B 2f(z) (L z x + Lr)B 2 r + L z x + Lr)B. Second-order optimization with lazy Hessians Let us denote the following quantity, for any y Rd: ξ(y) def = h λmin B 1/2 2f(y)B 1/2 i where [t]+ def = max{t, 0} denotes the positive part, and λmin( ) is the smallest eigenvalue of a symmetric matrix. If 2f(y) 0 for a certain y Rd, then ξ(y) = 0. Otherwise, ξ(y) shows how big (in absolute value) the smallest eigenvalue of the Hessian is with respect to a fixed matrix B 0. Thus, due to (43), we can control the quantity ξ as follows: Lemma A.3. For any M > 0, it holds 2 r + L z x . (45) Now, we can combine (40) and (45) with (41) to express the functional progress of one step using the new gradient norm and the smallest eigenvalue of the new Hessian. Theorem A.4. Let M L. Then, for one cubic step (4), we have the one step progress bound (7), i.e., f(x) f(T) max n 1 648M 2 ξ(T)3, 1 72 2M f(T) 3/2 o + M M 2 z x 3. (46) Proof. Indeed, using convexity of the function t 7 t3 for t 0, we obtain ξ(T)3 (45) 3 2Mr + L z x 3 27 2 M 3r3 + 4L3 z x 3. 2 1 27M 2 ξ(T)3 4L3 27M 2 z x 3, which gives the first part of the maximum when substituting into (41). Then, using convexity of the function t 7 t3/2 for t 0 and Young s inequality, we get f(T) 3/2 (40) Mr2 + Lr z x 3/2 2(Lr z x )3/2 2M 3/2 z x 3 = 3 2M 3/2 z x 3. Therefore, Mr3 2M f(T) 3/2 L3 6M 2 z x 3. Substituting this bound into (41) completes the proof. B. Proofs for Section 3 Let us denote the corresponding step lengths by rk+1 := xk+1 xk . Then, according to Theorem 2.1, for each 0 k m 1, we have the following progress in terms of the objective function: f(xk) f(xk+1) (7) max n 1 648M 2 ξ(xk+1)3, 1 72 2M f(xk+1) 3/2 o + M 48r3 k+1 11L3 M 2 x0 xk 3. Telescoping this bound for different k, and using triangle inequality for the last negative term, we get f(x0) f(xm) k=1 max n 1 648M 2 ξ(xk)3, 1 72 k=1 f(xk) 3/2 o + M k=1 r3 k 11L3 i=1 ri 3 . (47) It remains to use the following simple technical lemma. Second-order optimization with lazy Hessians Lemma B.1. For any sequence of positive numbers {rk}k 1, it holds for any m 1: i=1 ri 3 m3 k=1 r3 k (48) Proof. We prove (48) by induction. It is obviously true for m = 1, which is our base. Assume that it holds for some arbitrary m 1. Then i=1 ri 3 = m 1 P i=1 ri 3 + m P i=1 ri 3 (48) m3 k=1 r3 k + m P i=1 ri 3 . (49) Applying Jensen s inequality for convex function t 7 t3, t 0, we have a bound for the second term: m P i=1 ri 3 = m3 1 m i=1 ri 3 m2 m P i=1 r3 i . (50) Therefore, m P i=1 ri 3 (49),(50) m3 k=1 r3 k (m+1)3 k=1 r3 k. 2 Using this bound, we ensure the right-hand side of (47) to be positive. For that, we need to use an increased value of the regularization parameter. Thus, we prove the following guarantee. Corollary B.2. Let M 6m L. Then, f(x0) f(xm) m P k=1 max n 1 648M 2 ξ(xk)3, 1 72 2M f(xk) 3/2 o . (51) Theorem B.3. Let the objective function be bounded from below: inf x f(x) f . Let the regularization parameter M be fixed as in (11). Assume that the gradients for previous iterations {xi}k i=1 are higher than a desired error level ε > 0: f(xi) ε. (52) Then, the number of iterations of Algorithm 1 to reach accuracy f(xk+1) ε is at most m L(f(x0) f ) ε3/2 , (53) where O( ) hides some absolute numerical constants. The total number of Hessian updates t during these iterations is bounded as (14), i.e., t O L(f(x0) f ) ε3/2 m . (54) For the minimal eigenvalues of all Hessians, it holds that h λmin B 1/2 2f(xi)B 1/2 i + 648M 2(f(x0) f ) k 1/3 . (55) Proof. Without loss of generality, we assume that k is a multiple of m: k = tm. Then, for the ith (1 i t) phase of the method, we have f(xm(i 1)) f(xmi) (51),(52) m 72 2M ε3/2 = m 144 3Lε3/2. (56) Telescoping this bound for all phases, we obtain f(x0) f f(x0) f(xk) (56) t m 144 This gives the claimed bound (14) for t. Taking into account that k = tm, (53) follows immediately. Telescoping the bound (51) for the smallest Hessian eigenvalue ξ( ), we obtain f(x0) f f(x0) f(xk) k 648M 2 min 1 i k ξ(xi)3. (57) Hence, min 1 i k h λmin B 1/2 2f(xi)B 1/2 i + min 1 i k ξ(xi) (57) 648M 2(f(x0) f ) which completes the proof. Second-order optimization with lazy Hessians C. Proofs for Section 4 Here we assume all the Hessians are positive semidefinite: 2f(x) 0, x Rd. (58) For a fixed z Rd, let us denote one regularized lazy Newton step from point x Rd by x+ def = x 2f(z) + λB 1 f(x), λ > 0. (59) We choose parameter λ as in Algorithm 2: M f(x) (60) where M > 0 is fixed. The Newton step with gradient regularization can be seen as an approximation of the Cubic Newton step (39), which utilizes convexity of the objective. Indeed, we have by convexity: x+ x = 2f(z) + λB 1 f(x) (58) 1 λ f(x) , (61) and so we can ensure our main bound: M x+ x (61) M λ f(x) (60) = λ, (62) which justifies the actual choice of λ. Now, we need to have analogues of Lemma A.1 and Lemma A.2 that we established for the cubically regularized lazy Newton step. We can prove the following. Lemma C.1. For any M > 0, it holds 2 x+ x + λ + L z x x+ x . (63) Consequently, for M 3L, we get f(x+) (63),(62) 7 6λ + L z x x+ x . (64) Proof. Indeed, we have 2 x+ x 2 (34) f(x+) f(x) 2f(x)(x+ x) f(x+) f(x) + 2f(z)(x+ x) L z x x+ x , where we used triangle inequality and Lipschitz continuity of the Hessian in the last bound. It remains to note that f(x) + 2f(z)(x+ x) (59) = λ x+ x . Lemma C.2. For M 3L, it holds f(x) f(x+) λ 4 x+ x 2 8L3 27M 2 z x 3. (65) Proof. Combining the method step with Lipschitz continuity of the Hessian, we obtain f(x+) (35),(62) f(x) + f(x), x+ x + 1 2 2f(x)(x+ x), x+ x + M (59) = f(x) λ x+ x 2 2f(z)(x+ x), x+ x + M 2 2f(x)(x+ x), x+ x (33),(58),(62) f(x) λ x+ x 2 + λ 2 x+ x 2 + L 2 z x x+ x 2. Second-order optimization with lazy Hessians Applying Young s inequality for the last term, 2 z x x+ x 2 = (3M)2/3 4 x+ x 2 2L (3M)2/3 z x 4 x+ x 3 + 8L3 27M 2 z x 3 (62) λ 4 x+ x 2 + 8L3 27M 2 z x 3. we obtain (65). Combining these two lemmas, we get the functional progress of one step in terms of the new gradient norm. Theorem C.3. Let M 3L. Then, for one lazy Newton step (59) with gradient regularization (60), it holds: f(x) f(x+) 9 244λ f(x+) 2 + M 8 x+ x 3 3L3 M 2 z x 3. (66) Proof. Using convexity of the function t 7 t2 for t 0 and Young s inequality, we get (64) 7 6λ x+ x + L z x x+ x 2 49 18λ2 x+ x 2 + 2L2 z x 2 x+ x 2 49 18λ2 x+ x 2 + 4λL3 3M 2 z x 3 + 2M 4 x+ x 6 (62) 61 18λ2 x+ x 2 + 4λL3 3M 2 z x 3. λ 4 x+ x 2 (62) λ 8 x+ x 2 + M (67) 9 244λ f(x+) 2 + M 8 x+ x 3 3L3 61M 2 z x 3. Substituting this bound into (65) completes the proof. Let us consider one phase of the algorithm. Telescoping bound (66) for the first m iterations and using triangle inequality, i=1 xi xi 1 =: k P Corollary C.4. Let M 3m L. Then f(x0) f(xm) 9 244 f(xk) 2 f(xk 1) 1/2 + M k=1 r3 k 3L3 f(xk) 2 f(xk 1) 1/2 . We are ready to prove the global complexity bound for Algorithm 2. Let us consider the constant choice of the regularization parameter: M := 3m L (69) Theorem C.5. Let the objective function be convex and bounded from below: inf x f(x) f . Let the regularization parameter M be fixed as in (20). Assume that the gradients for previous iterations {xi}k i=1 are higher than a desired error level ε > 0: f(xi) ε. (70) Second-order optimization with lazy Hessians Then, the number of iterations of Algorithm 2 to reach accuracy f(xk+1) ε is at most m L(f(x0) f ) ε3/2 + ln f(x0) The total number of Hessian updates t during these iterations is bounded as L(f(x0) f ) Proof. Without loss of generality we assume that k is a multiple of m: k = tm. Then, for ith (1 i t) phase of the method, we have f(xm(i 1)) f(xmi) (68) 9 244 m(i 1)+1 j mi f(xj) 2 f(xj 1) 1/2 m(i 1)+1 j mi f(xj) f(xj 1) 1 Telescoping this bound for all phases and using inequality between arithmetic and geometric means, we get f(x0) f f(x0) f(xk) (73) 9ε3/2 f(xj) f(xj 1) 1 f(xj) f(xj 1) 2k = 9ε3/2k 244 f(xk) f(x0) (70) 9ε3/2k 244 2k = 9ε3/2k 244 M exp 1 2k ln ε f(x0) 1 1 2k ln f(x0) Rearranging the terms gives (71). Inequality (72) follows immediately from t = k/m. We see that the global complexity bounds for Algorithm 2 are the same (up to an additive logarithmic term) as those ones for Algorithm 1. However, each iteration of Algorithm 2 is much easier to implement since it involves just one standard matrix inversion. D. Proofs for Section 5 We assume here that our objective is strongly convex with some parameter µ > 0. 2f(x) µB, x Rd. (74) By strong convexity (74), we have a bound for the distance to the optimum x def = argminx f(x) using the gradient norm (see, e.g. (Nesterov, 2018)), for all x Rd: x x 1 µ f(x) . (75) Now, let us look at one lazy Cubic Newton step T = TM(x, z) with some M 0, for which we have: r T x (39) = 2f(z) + Mr 2 B 1 f(x) (74) 1 µ f(x) . (76) Note that M = 0 corresponds to the classical pure Newton step with a lazy Hessian. Then, applying Lemma A.1 with triangle inequality, we get f(T) (40) M+L 2 r2 + Lr z x M+L 2 r2 + Lr x x + Lr z x (75),(76) M+3L 2µ2 f(x) 2 + L µ2 f(x) f(z) . Second-order optimization with lazy Hessians This inequality leads to a superlinear convergence of the method. Theorem D.1. Let M 0. Assume that initial gradient is small enough: 2(M+3L). (78) Then, for the iterations of Algorithm 1, we have a superlinear convergence for the gradient norms: 2 (1+m)π(k)(1+k mod m), k 0, (79) where π(k) is defined by (10). Proof. According to (77), for one iteration of Algorithm 1, it holds f(xk+1) M+3L 2µ2 f(xk) 2 + L µ2 f(xk) f(xπ(k)) . Let us multiply this inequality by c := M+3L 2µ2 and denote sk def = c f(xk) . This yields the following recursion sk+1 s2 k + sksπ(k), (80) and by initial condition (78), we have s0 1 Required inequality from our claim is 2 (1+i)(1+m)t+1, (81) for any t 0 (phases of the method) and 0 i m. We prove it by induction. Indeed, stm+i+1 (80) s2 tm+i + stm+istm (81) 1 2 2(1+i)(1+m)t+2 + 1 2 (1+i)(1+m)t+1+(1+m)t+1 2 (2+i)(1+m)t+2 + 1 2 (2+i)(1+m)t+2 = 1 2 (2+i)(1+m)t+1, which is (81) for the next index. Corollary D.2. Combining both Theorem 3.1 and 5.1, we conclude that for minimizing a strongly convex function by Algorithm 1 with regularization parameter M given by (11) and starting from an arbitrary x0, we need to do k O m2L2(f(x0) f ) µ3 + 1 ln(1+m) ln ln µ2 lazy steps to achieve f(xk) ε. Let us also analyze the local behaviour of the lazy Newton steps with gradient regularization, that we perform in Algorithm 2. Note that by strong convexity (74) we have the following bound for the length of one step, which is x+ x (59) = 2f(z) + λB 1 f(x) (74) 1 µ f(x) . (82) Therefore, we can estimate the norm of the gradient at new point taking into account the actual choice of the regularization parameter (60). Our reasoning works for any M 0. Applying Lemma C.1 and triangle inequality, we get f(x+) (63) L 2 x+ x + λ + L z x x+ x 2µ f(x) + λ + L z x 1 2µ f(x) + λ + L x x + L z x 1 (75),(60) 3L 2µ2 f(x) 2 + M µ f(x) 3/2 + L µ2 f(x) f(z) Second-order optimization with lazy Hessians We see that the power of the gradient norm for the second term in the right hand side is 3/2, which is slightly worse than 2, that is in the cubic regularization (compare with (77)). However, it is still a local superlinear convergence, which is extremely fast from the practical perspective. Theorem D.3. Let M 0. Assume that initial gradient is small enough: 25(3L+4M). (84) Then, for the iterations of Algorithm 2, we have a superlinear convergence for the gradient norms: 23(3L+4M) 1 2 2(1 + m/2)π(k)(1+(k mod m)/2), k 0, (85) where π(k) is defined by (10). Proof. According to (83), for one iteration of Algorithm 2, it holds f(xk+1) 3L µ2 f(xk) 2 + M µ f(xk) 3/2 + L µ2 f(xk) f(xπ(k)) . Let us multiply this inequality by c := 2(4M+3L) µ2 and denote sk def = c f(xk) . This yields the following recursion (compare with (80)): sk+1 1 2 s2 k + s3/2 k + sksπ(k), (86) and by initial condition (84), we have s0 1 24 . Let us prove by induction that 2 2(1+(1+m/2)t), (87) for any t 0 (phases of the method), and 2 i(1+m/2)t stm, (88) for 0 i m. Then (85) clearly follows. Note that from (87) we have 1 2s1/2 tm + 3 2stm 2s1/2 tm 2 (1+m/2)t . (89) Then, assuming that (87) and (89) holds for the current iterate, we get stm+i+1 (86) 1 2 s2 tm+i + s3/2 tm+i + stm+istm (88),(87) 1 2 i(1+m/2)t stm 1 2s1/2 tm + 3 2 (i+1)(1+m/2)t stm, which is (88) for the next index. It remains to observe that substituting i := m into (88) we obtain (87) for the next phase. Indeed, s(t+1)m = stm+m (88) 1 2 m(1+m/2)t stm 2 m(1+m/2)t+2+2(1+m/2)t = 1 2 2+2(1+m/2)t+1 . 2 Corollary D.4. Combining both Theorem 4.1 and 5.3, we conclude that for minimizing a strongly convex function by Algorithm 2 with regularization parameter M given by (20) and starting from an arbitrary x0, we need to do k O m2L2(f(x0) f ) µ3 + ln m L f(x0) µ2 + 1 ln(1+m/2) ln ln µ2 lazy steps to achieve f(xk) ε. Second-order optimization with lazy Hessians E. Proofs for Section 6 and the Adaptive Regularized Newton E.1. Proof of the General Case The key to our analysis of Algorithm 3 is inequality (51) on m consecutive steps of the method. Note that it holds for any value of M that is sufficiently big. Hence, there is no necessity to fix the regularization parameter. We are going to change its value adaptively while checking the condition on the functional progress after m steps of the method. The value M0 is just an initial guess for the regularization constant, which can be further both increased or decreased dynamically. This process is well defined. Indeed, due to (51), the stopping condition is satisfied as long as Mt 2835m L. Hence, we ensure that all regularization coefficients always satisfy the following bound: Mt max{2M0, 2935m L}. (90) Substituting this bound into the stopping condition, and telescoping it for the first t 0 phases, we get f(x0) f f(x0) f(xtm) 1 max{2M0,2935m L} i=1 f(xi) 3/2 . (91) Thus, we obtain the following global complexity guarantee. Theorem E.1. Let the objective function be bounded from below: inf x f(x) f and let the gradients for previous iterates {xi}tm i=1 during t phases of Algorithm 3 are higher than a desired error level ε > 0: f(xi) ε. (92) Then, the number of phases of Algorithm 3 to reach accuracy f(xtm+1) ε is at most m , L f(x0) f ε3/2 m . (93) The total number N of gradient calls during these phases is bounded as N 2t + log2 Mt M0 m (90) 2tm + max{1, log2 2935m L M0 }m. (94) Proof. Bound (93) follows directly from (91) and (92). To prove (94), we denote by ni the number of times the do ...until loop is performed at phase 0 i t 1. By our updates, it holds 2ni 2 = Mi+1 It remains to note that m (95) = 2t + t 1 P i=0 log2 Mi+1 m = 2t + log2 Mt M0 Remark E.2. According to Theorem 6.1, the global complexity of the adaptive algorithm is the same as that one given by Theorem 3.1 for the method with a fixed regularization constant. Due to (94), the average number of tries of different Mt per one phase is only two. E.2. Adaptive Search for the Regularized Newton Let us present an adaptive version of Algorithm 2, which is suitable for solving convex minimization problems (Algorithm 4 below). Repeating the same reasoning as in section E.1 , we straightforwardly obtain the same global guarantees for this process as in Theorem 4.1 for the method with a fixed Lipschitz constant. The cost of adaptive search again is only one extra try of regularization parameter in average per phase. Second-order optimization with lazy Hessians Algorithm 4 Adaptive Regularized Newton with Lazy Hessians 1: Input: x0 Rd, m 1. Fix some M0 > 0. 2: for t = 0, 1, . . . do 3: Compute snapshot Hessian 2f(xtm) 4: repeat 5: Update Mt = 2 Mt 6: for i = 1, . . . , m do 7: Denote k = tm + i 1 and set λk = p Mt f(xk) 8: Compute lazy Newton step xk+1 = 2f(xtm) + λk B 1 f(xk) 9: end for 10: until f(xtm) f(xtm+m) Pm i=1 1 λtm+i 1 f(xtm+i) 2 11: Set Mt+1 = 1 4 Mt 12: end for F. Composite Convex Optimization Let us consider composite convex optimization problems in the following form: min x dom ψ n F(x) def = f(x) + ψ(x) o , (96) where f : Rd R is a several times differentiable convex function, with Lipschitz continuous Hessian (3), and ψ : Rd R {+ } is a proper closed convex function, that can be non-differentiable. However, we assume that it has a simple structure which allows to solve efficiently corresponding minimization subproblems that involve ψ( ). In this section, we demonstrate how to generalize our analysis of the lazy Newton steps onto this class of problems. For example, ψ can be an indicator of a simple closed convex set Q Rd: ( 0, x Q, + , otherwise, in which case problem (96) becomes constrained optimization problem: min x Q f(x). F.1. Lazy Cubic Newton for Composite Problems We replace the lazy Cubic Newton step (4) by the following one: TM(x, z) = argmin y dom ψ n f(x), y x + 1 2 2f(z)(y x), y x + M 6 y x 3 + ψ(y) o . (97) Since we assume here that both f and ψ convex, the solution to (97) always exists and is unique due to the uniform convexity of the cubic regularizer (Nesterov, 2008). Then, we can use this point exactly the same way as the basic step (4) in Algorithm 1 and Algorithm 3. Let us present the main inequalities that are needed for its theoretical analysis. The stationary condition for T := TM(x, z) is as follows, for any y dom ψ, f(x) + 2f(z)(T x) + M 2 r B(T x), y T + ψ(y) ψ(T), (98) where r := T x as always. In other words, we have that ψ (T) def = f(x) 2f(z)(T x) M 2 r B(T x) ψ(T). (99) Correspondingly, we denote F (T) def = f(T) + ψ (T) F(T). Second-order optimization with lazy Hessians We can work with these objects in a similar way as with the new gradient of f in the basic non-composite case. Thus, we can ensure the following inequalities, employing the Lipschitzness of the Hessian: (34) f(T) f(x) 2f(z)(T x) (99) = F (T) + Mr 2 B(T x) + ( 2f(z) 2f(x))(T x) 2 r 2f(z) 2f(x) (3) F (T) Mr2 2 r L z x . Hence, rearranging the terms we can prove the analogue of Lemma A.1 for the composite case (96): Lemma F.1. For any M > 0, it holds that 2 r2 + Lr z x . (101) Secondly, for the objective function value at new point, we get F(T) def = f(T) + ψ(T) (35) f(x) + f(x), T x + 1 2 2f(x)(T x), T x + L 6 r3 + ψ(T) F(x) + f(x), T x + 1 2 2f(x)(T x), T x + L 6 r3 + ψ (T), T x (99) = F(x) + 1 2 2f(x)(T x), T x 2f(z)(T x), T x 3M L ( ) F(x) + 1 2 ( f(x) f(z))(T x), T x 3M L (3) F(x) + L 2 r2 z x 3M L where we used convexity of f( ) in ( ). Note that the form of this inequality exactly the same as in the proof of Lemma A.2. Hence, we can establish its analogue fo the composite case (96): Lemma F.2. For any M > 0, it holds that F(x) F(T) 11M 4L 3M 2 z x 3. (102) Therefore, having these two main lemmas established, we can prove the global rates for the composite Cubic Newton with Lazy Hessian updates using similar technique that were used before. Note that in the composite case, we ensure the convergence in terms of the subgradient norm: F (xk) 0 with k . Finally, let justify local superlinear convergence, when the smooth component of the objective is strongly convex: 2f(x) µB, x Rd. Then, we have the following inequality for the whole objective: F(y) F(x) + sx, y x + µ 2 y x 2, x, y dom F, sx F(x). (103) Substituting x := x into (103), we get, for any y dom F: F(y) F µ 2 y x 2. (104) At the same time, minimizing the left and right hand sides of (103) with respect to y independently, we obtain F F(x) 1 2µ sx 2 . (105) Second-order optimization with lazy Hessians Thus, combining (104) and (105) together, we ensure the following standard inequality, for any x dom F and sx F(x): 1 µ sx x x . (106) Further, for one lazy composite cubic step x 7 T := TM(x, z), we obtain µr2 2f(z)(T x), T x (99) = f(x) + ψ (T), x T M ( ) sx, x T M 2 r3 r sx , where we used convexity of ψ in ( ). Hence, 1 µ sx r. (107) It remains to use Lemma F.1, which gives, for any M 0: F (T) (101) M+L 2 r2 + Lr z x M+L 2 r2 + Lr x x + Lr z x (106),(107) M+3L 2µ2 F (x) 2 + L µ F (x) F (z) . That ensures local superlinear convergence in terms of the subgradient norm for our method in the composite case (compare with (77)). G. Extra Experiments In this section, we include additional evaluation of our methods on several optimization problems: Logistic Regression with ℓ2-regularization, Logistic Regression with a non-convex regularizer, and training a small Diagonal Neural Network model. In all our experiments we observe that the use of the lazy Hessian updates significantly improve the performance of the Cubic Newton method in terms of the total computational cost. We also show the convergence of the classic Gradient Method (GM) as a natural baseline. We use a constant regularization parameter M (correspondingly the stepsize in the Gradient Method), that we choose for each method separately to optimize its performance. G.1. Convex Logistic Regression The objective in this problem has the following form: i=1 log(1 + e yi ai,x ) + λ 2 x 2, x Rd, where a1, . . . , an Rd are the feature vectors and y1, . . . , yn { 1, 1} are the labels, given by the dataset 2 and λ > 0 is the regularization parameter, which we fix as λ = 1 n. The results are shown in Figure 2. 0 500 1000 1500 2000 Grad. computations 0 5 10 15 Time, s GM m = 1 m = 2 m = 10 m = d m = 10d Logistic regression: a9a, d = 123, n = 32561, L2-regularization Figure 2: Logistic Regression with ℓ2-regularization trained on the a9a dataset. The Cubic Newton method with Lazy Hessian updates (m = d) shows the best overall performance. 2www.csie.ntu.edu.tw/ cjlin/libsvmtools/datasets/. Second-order optimization with lazy Hessians G.2. Non-convex Logistic Regression In the following experiment, we consider the Logistic Regression model with non-convex regularizer: i=1 log(1 + e yi ai,x ) + λ d P x2 i 1+x2 i , that was also considered in (Kohler & Lucchi, 2017), and we set λ = 1 n. The results are shown in Figure 3. 0 500 1000 1500 Grad. computations 0 20 40 60 Time, s GM m = 1 m = 2 m = 10 m = d m = 10d Logistic regression: a9a, d = 123, n = 32561, Non-convex Figure 3: Logistic Regression with non-convex regularizer trained on the a9a dataset. The Cubic Newton method with Lazy Hessian updates (m = d) shows the best overall performance. G.3. Non-convex Diagonal Neural Network Finally, we consider non-convex optimization problem with the following objective: f(x, y) = A(x y) b 2, x, y Rd, where A Rn d, b Rn are generated randomly with the entries from Gaussian distribution, and denotes the coordinate-wise product. The results are shown in Figure 4. We see that there are many choices of parameter m > 1 (the frequency of the Hessian updates) that lead to a considerable time saving without hurting the convergence rate. Notice that m = d is not necessarily optimal for this problem. Our intuition is that the nature of the parametrization of our Diagonal Neural Network implies a certain effective dimension that is smaller than the full dimension d. Hence, it can be an interesting direction for further research to develop new strategies of the Hessian updates that are suitable for the problems with a specific structure. 0 500 1000 1500 Grad. computations 0.0 0.5 1.0 1.5 Time, s GM m = 1 m = 2 m = 10 m = d m = 10d Digonal NN Regression d = 80, n = 10000 0 500 1000 1500 Grad. computations 0 1 2 3 4 Time, s GM m = 1 m = 2 m = 10 m = 20 m = 30 m = 40 m = 50 Digonal NN Regression d = 200, n = 10000 Figure 4: Performance of a Diagonal Neural Network trained on random data.