# stochastic_variancereduced_cubic_regularization_methods__6d94ebbf.pdf

Journal of Machine Learning Research 20 (2019) 1-47 Submitted 1/19; Revised 7/19; Published 8/19

Stochastic Variance-Reduced Cubic Regularization Methods

Dongruo Zhou drzhou@cs.ucla.edu Department of Computer Science University of California, Los Angeles Los Angeles, CA 90095, USA

Pan Xu panxu@cs.ucla.edu Department of Computer Science University of California, Los Angeles Los Angeles, CA 90095, USA

Quanquan Gu qgu@cs.ucla.edu Department of Computer Science University of California, Los Angeles Los Angeles, CA 90095, USA

Editor: Zhihua Zhang

We propose a stochastic variance-reduced cubic regularized Newton method (SVRC) for non-convex optimization. At the core of SVRC is a novel semi-stochastic gradient along with a semi-stochastic Hessian, which are speciﬁcally designed for cubic regularization method. For a nonconvex function with n component functions, we show that our algorithm is guaranteed to converge to an (ϵ, ϵ)-approximate local minimum within e O(n4/5/ϵ3/2)1

second-order oracle calls, which outperforms the state-of-the-art cubic regularization algorithms including subsampled cubic regularization. To further reduce the sample complexity of Hessian matrix computation in cubic regularization based methods, we also propose a sample eﬃcient stochastic variance-reduced cubic regularization (Lite-SVRC) algorithm for ﬁnding the local minimum more eﬃciently. Lite-SVRC converges to an (ϵ, ϵ)-approximate local minimum within e O(n + n2/3/ϵ3/2) Hessian sample complexity, which is faster than all existing cubic regularization based methods. Numerical experiments with diﬀerent nonconvex optimization problems conducted on real datasets validate our theoretical results for both SVRC and Lite-SVRC.

Keywords: Cubic Regularization, Nonconvex Optimization, Variance Reduction, Hessian Sample Complexity, Local Minimum

1. Introduction

We study the following unconstrained ﬁnite-sum nonconvex optimization problem:

min x Rd F(x) := 1

i=1 fi(x), (1)

1. Here e O hides poly-logarithmic factors.

c 2019 Dongruo Zhou, Pan Xu and Quanquan Gu.

License: CC-BY 4.0, see https://creativecommons.org/licenses/by/4.0/. Attribution requirements are provided at http://jmlr.org/papers/v20/19-055.html.

Zhou, Xu and Gu

where each fi : Rd R is a general nonconvex function. Such nonconvex optimization problems are ubiquitous in machine learning, including training deep neural network (Le Cun et al., 2015), robust linear regression (Yu and Yao, 2017) and nonconvex regularized logistic regression (Reddi et al., 2016b). In principle, ﬁnding the global minimum of (1) is generally a NP-hard problem (Hillar and Lim, 2013) due to the lack of convexity. Instead of ﬁnding the global minimum, various algorithms have been developed in the literature (Nesterov and Polyak, 2006; Cartis et al., 2011a; Carmon and Duchi, 2016; Agarwal et al., 2017; Xu et al., 2018b; Allen-Zhu and Li, 2018) to ﬁnd an approximate local minimum of (1). In particular, a point x is said to be an (ϵg, ϵH)-approximate local minimum of F if

F(x) 2 ϵg, λmin( 2F(x)) ϵH, (2)

where ϵg, ϵH > 0 are predeﬁned precision parameters. It has been shown that such approximate local minima can be as good as global minima in some problems. For instance, Ge et al. (2016) proved that any local minimum is actually a global minimum in matrix completion problems. Therefore, to develop an algorithm to ﬁnd an approximate local minimum is of great interest both in theory and practice. A very important and popular method to ﬁnd the approximate local minimum is cubicregularized (CR) Newton method, which was originally introduced by Nesterov and Polyak (2006). Generally speaking, in the k-th iteration, CR solves a sub-problem which minimizes a cubic-regularized second-order Taylor expansion at current iterate xk. The update rule can be written as follows:

hk = argmin h Rd F(xk), h + 1

2 2F(xk)h, h + M

6 h 3 2, (3)

xk+1 = xk + hk, (4)

where M > 0 is a penalty parameter used in CR. Nesterov and Polyak (2006) proved that to ﬁnd an (ϵ, ϵ)-approximate local minimum of a nonconvex function F, CR requires at most O(ϵ 3/2) iterations. However, a major drawback for CR is that it needs to sample n individual gradients fi(xk) and Hessian matrices 2fi(xk) in (3) at each iteration, which leads to a total O(nϵ 3/2) Hessian sample complexity, i.e., number of queries to the stochastic Hessian 2fi(x) for some i and x. Such computational cost will be extremely expensive when n is large as in many large scale machine learning problems. To overcome the computational burden of CR based methods, some recent studies have proposed to use sub-sampled Hessian instead of the full Hessian (Kohler and Lucchi, 2017; Xu et al., 2017a) to reduce the Hessian complexity. In detail, Kohler and Lucchi (2017) proposed a sub-sampled cubic-regularized Newton method (SCR), which uses a subsampled Hessian instead of full Hessian to reduce the per iteration sample complexity of Hessian evaluations. Xu et al. (2017a) proposed a reﬁned convergence analysis of SCR, as well as a subsampled Trust Region algorithm (Conn et al., 2000). Nevertheless, SCR bears a much slower convergence rate than the original CR method, and the total Hessian sample complexity for SCR to achieve an (ϵ, ϵ)-approximate local minimum is e O(ϵ 5/2). This suggests that the computational cost of SCR could be even worse than CR when ϵ n 1. In this paper, we propose a novel cubic regularization algorithm named Stochastic Variance-Reduced Cubic regularization (SVRC), which incorporates the variance reduction techniques (Johnson and Zhang, 2013; Xiao and Zhang, 2014; Allen-Zhu and Hazan,

Stochastic Variance-Reduced Cubic Regularization Methods

2016; Reddi et al., 2016a) into the cubic-regularized Newton method. The key component in our algorithm is a novel semi-stochastic gradient, together with a semi-stochastic Hessian, that are speciﬁcally designed for cubic regularization. Furthermore, we prove that, for L2-Hessian Lipschitz functions, to attain an (ϵ, L2ϵ)-approximate local minimum, our proposed algorithm requires O(n+n4/5/ϵ3/2) Second-order Oracle (SO) calls and O(1/ϵ3/2) Cubic Subproblem Oracle (CSO) calls. Here an SO oracle represents an evaluation of triple (fi(x), fi(x), 2fi(x)), and a CSO oracle denotes an evaluation of the exact solution (or inexact solution) of the cubic subproblem (3). Compared with the original cubic regularization algorithm (Nesterov and Polyak, 2006), which requires O(n/ϵ3/2) SO calls and O(1/ϵ3/2) CSO calls, our proposed SVRC algorithm reduces the SO calls by a factor of Ω(n1/5). The second-order oracle complexity is dominated by the maximum number of queries to one of the elements in the triplet (fi(x), fi(x), 2fi(x)), and therefore is not always desirable in reﬂecting the computational complexity of multifarious applications. Therefore, we need to focus more on the Hessian sample complexity of cubic regularization methods for relatively high dimensional problems. Based on the SVRC algorithm, in order to further reduce the Hessian sample complexity, we also develop a sample eﬃcient stochastic variancereduced cubic-regularized Newton method called Lite-SVRC, which signiﬁcantly reduces the sample complexity of Hessian matrix evaluations in stochastic CR methods. Under mild conditions, we prove that Lite-SVRC achieves a lower Hessian sample complexity than existing cubic regularization based methods. We prove that Lite-SVRC converges to an (ϵ, ϵ)-approximate local minimum of a nonconvex function within e O(n + n2/3ϵ 3/2) Hessian sample complexity. We summarize the major contributions of this paper as follows:

We present a novel cubic regularization method (SVRC) with improved oracle complexity. To the best of our knowledge, this is the ﬁrst algorithm that outperforms cubic regularization without any loss in convergence rate. In sharp contrast, existing subsampled cubic regularization methods (Kohler and Lucchi, 2017; Xu et al., 2017a) suﬀer from worse convergence rates than cubic regularization.

We also extend SVRC to the case with inexact solution to the cubic regularization subproblem. Similar to previous work (Cartis et al., 2011a; Xu et al., 2017a), we layout a set of suﬃcient conditions, under which the output of the inexact algorithm is still guaranteed to have the same convergence rate and oracle complexity as the exact algorithm. This further sheds light on the practical implementation of our algorithm.

As far as we know, our work is the ﬁrst to rigorously demonstrate the advantage of variance reduction for second-order optimization algorithms. Although there exist a few studies (Lucchi et al., 2015; Moritz et al., 2016; Rodomanov and Kropotov, 2016) using variance reduction to accelerate Newton method, none of them can deliver faster rates of convergence than standard Newton method.

We also propose a lite version of SVRC, namely, the Lite-SVRC algorithm, which only requires a constant batch size of Hessian evaluations at each iteration. The proposed Lite-SVRC further improves the Hessian sample complexity of SVRC and outperforms the state-of-the-art result by achieving e O(n + n2/3ϵ 3/2) Hessian sample complexity.

Zhou, Xu and Gu

We conduct extensive numerical experiments with diﬀerent types of nonconvex optimization problems on various real datasets to validate our theoretical results for both SVRC and Lite-SVRC.

When the short version of this paper was submitted to ICML, there was a concurrent work by Wang et al. (2018a), which applies the idea of stochastic variance reduction to cubic regularization as well. Their algorithms have a worse Hessian sample complexity than Lite SVRC. Since the short version of this paper was published in ICML, there have been two followup works by Wang et al. (2018b) and Zhang et al. (2018), which both proposed similar algorithms to our Lite-SVRC algorithm, and achieved the same Hessian sample complexity. However, Wang et al. (2018b) and Zhang et al. (2018) s results rely on the adaptive choice of batch size for stochastic Hessian. Furthermore, Zhang et al. (2018) s result relies on a stronger notion of Hessian Lipschitz condition. We will discuss the key diﬀerence between our Lite-SVRC algorithm and the algorithms in Wang et al. (2018a,b); Zhang et al. (2018) in detail in Section 7. Notation: We use a(x) = O(b(x)) if a(x) Cb(x), where C is a constant independent of any parameters in our algorithm. We use e O( ) to hide polynomial logarithm terms. We use v 2 to denote the 2-norm of vector v Rd. For symmetric matrix H Rd d, we use H 2 and H Sr to denote the spectral norm and Schatten rnorm of H. We denote the smallest eigenvalue of H to be λmin(H).

2. Related Work

Cubic Regularization and Trust-region Newton Method Traditional Newton method in convex setting has been widely studied in past decades (Bennett, 1916; Bertsekas, 1999). The most related work to ours is the nonconvex cubic regularized Newton method, which was originally proposed in Nesterov and Polyak (2006). Cartis et al. (2011a) presented an adaptive framework of cubic regularization, which uses an adaptive estimation of the local Lipschitz constant and approximate solution to the cubic subproblem. To connect cubic regularization with traditional trust region method (Conn et al., 2000; Cartis et al., 2009, 2012, 2013), Blanchet et al. (2016); Curtis et al. (2017); Mart ınez and Raydan (2017) showed that the trust-region Newton method can achieve the same iteration complexity as the cubic regularization method. To overcome the computational burden of gradient and Hessian matrix evaluations, Kohler and Lucchi (2017); Xu et al. (2017a,b) proposed to use subsampled gradient and Hessian in cubic regularization. On the other hand, in order to solve the cubic subproblem (3) more eﬃciently, Carmon and Duchi (2016) proposed to use gradient descent, while Agarwal et al. (2017) proposed a sophisticated algorithm based on approximate matrix inverse and approximate PCA. Tripuraneni et al. (2018) proposed a reﬁned stochastic cubic regularization algorithm based on above subproblem solver. However, none of the aforementioned variants of cubic regularization outperforms the original cubic regularization method in terms of oracle complexity. Finding Approximate Local Minima There is another line of work for ﬁnding approximate local minima which focuses on escaping from nondegenerated saddle points using the negative curvature. Ge et al. (2015); Jin et al. (2017) showed that simple (stochastic) gradient descent with an injected uniform noise over a small ball is able to converge to approximate local minima. Carmon et al. (2018); Royer and Wright (2018); Allen-Zhu (2018)

Stochastic Variance-Reduced Cubic Regularization Methods

showed that by calculating the negative curvature using Hessian information or Hessian vector product, one can ﬁnd approximate local minima faster than ﬁrst-order methods. Xu et al. (2018b); Allen-Zhu and Li (2018); Jin et al. (2018) further proved that gradient methods with additive noise are also able to ﬁnd approximate local minima faster than the ﬁrst-order methods. Yu et al. (2017) proposed the GOSE algorithm to save negative curvature computation and Yu et al. (2018) improved the gradient complexity by exploring the third-order smoothness of objective functions. Raginsky et al. (2017); Zhang et al. (2017); Xu et al. (2018a) proved that a family of algorithms based on discretizations of Langevin dynamics can ﬁnd a neighborhood of the global minimum of nonconvex objective functions. Variance Reduction Variance-reduced techniques play an important role in our proposed algorithm, which have been extensively studied for large-scale ﬁnite-sum optimization problems. Variance reduction was ﬁrst proposed in convex ﬁnite-sum optimization (Roux et al., 2012; Johnson and Zhang, 2013; Xiao and Zhang, 2014; Defazio et al., 2014), which uses semi-stochastic gradient to reduce the variance of the stochastic gradient and improves the gradient complexity of both stochastic gradient descent (SGD) and gradient descent (GD). Representative algorithms include Stochastic Average Gradient (SAG) (Roux et al., 2012), Stochastic Variance Reduced Gradient (SVRG) (Johnson and Zhang, 2013) and SAGA (Defazio et al., 2014), to mention a few. Garber and Hazan (2015); Shalev-Shwartz (2016) studied non-convex ﬁnite-sum problems where each individual function may be non-convex, but their sum is still convex. Reddi et al. (2016a) and Allen-Zhu and Hazan (2016) extended SVRG to the general non-convex ﬁnite-sum optimization, and proved that SVRG is able to converge to a ﬁrst-order stationary point with the same convergence rate as gradient descent, yet with an Ω(n1/3) improvement in gradient complexity. Recently Zhou et al. (2018b) and Fang et al. (2018) further improved the gradient complexity of SVRG type of algorithms to converge to a ﬁrst-order stationary point in nonconvex optimization to an optimal rate. However, to the best of our knowledge, it is still an open problem whether variance reduction can improve the oracle complexity of second-order optimization algorithms. The remainder of this paper is organized as follows: we present the stochastic variancereduced cubic regularization (SVRC) algorithm in Section 3. We present our theoretical analysis of the proposed SVRC algorithm in Section 4 and discuss on SVRC with inexact cubic subproblem oracles in Section 5. In Section 6, we propose a modiﬁed algorithm, Lite SVRC, to further reduce Hessian sample complexity and present its theoretical analysis in Section 7. We conduct thorough numerical experiments on diﬀerent nonconvex optimization problems and on diﬀerent real world datasets to validate our theory in Section 8. We conclude our work in Section 9.

3. Stochastic Variance-Reduced Cubic Regularization

In this section, we present a novel algorithm, which utilizes stochastic variance reduction techniques to improve cubic regularization method. As is discussed in the introduction, to reduce the computation burden of gradient and Hessian matrix evaluations in the cubic regularization updates in (3), subsampled gradient and Hessian matrix have been used in subsampled cubic regularization (Kohler and Lucchi, 2017; Xu et al., 2017b) and stochastic cubic regularization (Tripuraneni et al., 2018).

Zhou, Xu and Gu

Algorithm 1 Stochastic Variance Reduction Cubic Regularization (SVRC)

1: Input: batch size parameters bg, bh, cubic penalty parameters {Ms,t}, epoch number S, epoch length T and starting point x0.

2: Initialization bx1 = x0 3: for s = 1, . . . , S do

4: xs 0 = bxs

5: gs = F(bxs) = 1

n Pn i=1 fi(bxs), Hs = 1

n Pn i=1 2fi(bxs)

6: for t = 0, . . . , T 1 do

7: Sample index set Ig, Ih, |Ig| = bg, |Ih| = bh;

8: vs t = 1

bg P it Ig fit(xs t) fit(bxs) + gs 1

bg P it Ig 2fit(bxs) Hs (xs t bxs)

9: Us t = 1

bh P jt Ih 2fjt(xs t) 2fjt(bxs) + Hs

10: hs t = argmin vs t, h + 1

2 Us th, h + Ms,t

11: xs t+1 = xs t + hs t 12: end for

13: bxs+1 = xs T 14: end for

15: Output: xout = xs t, where s, t are uniformly random chosen from s [S] and t [T].

Nevertheless, the stochastic gradient and Hessian matrix have large variances, which undermine the convergence performance. Inspired by SVRG (Johnson and Zhang, 2013), we propose to use a semi-stochastic version of gradient and Hessian matrix, which can control the variances automatically. Speciﬁcally, our algorithm has two loops. At the beginning of the s-th iteration of the outer loop, we denote bxs = xs 0. We ﬁrst calculate the full gradient gs = F(bxs) and Hessian matrix Hs = 2F(bxs), which are stored for further references in the inner loop. At the t-th iteration of the inner loop, we calculate the following semi-stochastic gradient and Hessian matrix:

fit(xs t) fit(bxs) + gs 1

2fit(bxs) Hs (xs t bxs , (5)

2fjt(xs t) 2fjt(bxs) + Hs, (6)

where Ig and Ih are batch index sets, and the batch sizes bg = |Ig|, bh = |Ih| will be decided later. In each inner iteration, we solve the following cubic regularization subproblem:

hs t = argmin ms t(h),

ms t(h) = vs t, h + 1

2 Us th, h + Ms,t

6 h 3 2, (7)

where {Ms,t} are cubic regularization parameters, which may depend on s and t. Then we perform the update xs t+1 = xs t + hs t in the t-th iteration of the inner loop. The proposed algorithm is displayed in Algorithm 1. There are two notable features of our estimator of the full gradient and Hessian in each inner loop, compared with that used in SVRG (Johnson and Zhang, 2013). The ﬁrst

Stochastic Variance-Reduced Cubic Regularization Methods

is that our gradient and Hessian estimators consist of mini-batches of stochastic gradient and Hessian. The second one is that we use second-order information when we construct the gradient estimator vs t, while classical SVRG only uses ﬁrst-order information to build it. Intuitively speaking, both features are used to make a more accurate estimation of the true gradient and Hessian with aﬀordable oracle calls. Note that similar approximations of the gradient and Hessian matrix have been staged in recent work by Gower et al. (2018) and Wai et al. (2017), where they used this new kind of estimator for traditional SVRG in the convex setting, which radically diﬀers from our setting.

4. Theoretical Analysis of SVRC

In this section, we prove the convergence rate of SVRC (Algorithm 1) to an (ϵ, ϵ)- approximate local minimum. We ﬁrst lay out the following Hessian Lipschitz assumption, which is necessary for our analysis and is widely used in the literature (Nesterov and Polyak, 2006; Xu et al., 2016; Kohler and Lucchi, 2017).

Assumption 1 (Hessian Lipschitz) There exists a constant L2 > 0, such that for all x, y and i [n]

2fi(x) 2fi(y) 2 L2 x y 2.

The Hessian Lipschitz assumption plays a central role in controlling the changing speed of second order information. In fact, this is the only assumption we need to prove our theoretical results for SVRC. We then deﬁne the following optimal function gap between initial point x0 and the global minimum of F.

Deﬁnition 2 (Optimal Gap) For function F( ) and the initial point x0, let F be

F = inf{ R : F(x0) F },

where F = infx Rd F(x).

W.L.O.G., we assume F < + throughout this paper. Before we present nonasymptotic convergence results of Algorithm 1, we deﬁne the following useful notation

µ(x) = max F(x) 3/2 2 , λ3 min 2F(x)

A similar deﬁnition also appears in Nesterov and Polyak (2006) with a slightly diﬀerent form, which is used to describe how much a point is similar to a true local minimum. In particular, according to the deﬁnition in (8), µ(x) ϵ3/2 holds if and only if

F(x) 2 ϵ, λmin 2F(x) > p

Therefore, in order to ﬁnd an (ϵ, L2ϵ)-approximate local minimum of the nonconvex function F, it suﬃces to ﬁnd x which satisﬁes µ(x) < ϵ3/2. Next we formally deﬁne our oracles:

Zhou, Xu and Gu

Deﬁnition 3 (Second-order Oracle) Given an index i and a point x, one second-order oracle (SO) call returns such a triple:

[fi(x), fi(x), 2fi(x)]. (10)

Deﬁnition 4 (Cubic Subproblem Oracle) Given a vector g Rd, a Hessian matrix H and a positive constant θ, one Cubic Subproblem Oracle (CSO) call returns hsol, where hsol can be solved exactly as follows

hsol = argmin h Rd g, h + 1

2 h, Hh + θ

Remark 5 The second-order oracle is a special form of Information Oracle ﬁrstly introduced by Nemirovsky and Yudin (1983), which returns gradient, Hessian and all high order derivatives of the objective function F(x). Here, our second-order oracle will only returns ﬁrst and second order information at some point of single objective fi instead of F. We argue that it is a reasonable adaption because in this paper we focus on ﬁnite-sum objective function. The Cubic Subproblem Oracle will return an exact or inexact solution of (7), which plays an important role in both theory and practice.

Now we are ready to give a general convergence result of Algorithm 1:

Theorem 6 Under Assumption 1, suppose that the cubic regularization parameter Ms,t of Algorithm 1 satisﬁes that Ms,t = CML2, where L2 is the Hessian Lipschitz parameter and CM 100 is a constant. The batch sizes bg and bh satisfy that

bg 5T 4, bh 100T 2 log d, (11)

where T 2 is the length of the inner loop of Algorithm 1 and d is the dimension of the problem. Then the output of Algorithm 1 satisﬁes

E[µ(xout)] 240C2 ML1/2 2 F ST . (12)

Remark 7 According to (8), to ensure that xout is an (ϵ, L2ϵ)-approximate local minimum, we can set the right hand side of (12) to be less than ϵ3/2. This immediately implies that the total iteration complexity of Algorithm 1 is ST = O( F L1/2 2 ϵ 3/2), which matches the iteration complexity of cubic regularization (Nesterov and Polyak, 2006).

Remark 8 Note that there is a log d term in the expression of the parameter, and it is only related to Hessian batch size bh. The log d term comes from matrix concentration inequalities, which is believed to be unavoidable (Tropp et al., 2015). In other words, the batch size of Hessian matrix bh has an inevitable relation to dimension d, unlike the batch size of gradient bg.

The result in Theorem 6 depends on a series of parameters. In the following corollary, we will show how to choose these parameters in practice to achieve a better oracle complexity.

Stochastic Variance-Reduced Cubic Regularization Methods

Corollary 9 Under Assumption 1, let the cubic regularization parameter Ms,t = M = CML2, where CM 100 is a constant. Let the epoch length T = n1/5, batch sizes bg = 5n4/5, bh = 100n2/5 log d, and the number of epochs S = max{1, 240C2 ML1/2 2 F n 1/5ϵ 3/2}. Then Algorithm 1 will ﬁnd an (ϵ, L2ϵ)-approximate local minimum xout within

O n + F L2n4/5

SO calls (13)

O F L2 ϵ3/2

CSO calls. (14)

Remark 10 Corollary 9 states that we can reduce the SO calls by setting the batch size bg, bh related to n. In contrast, in order to achieve an (ϵ, L2ϵ) local minimum, original cubic regularization method in Nesterov and Polyak (2006) needs O(n/ϵ3/2) second-order oracle calls, which is by a factor of n1/5 worse than ours. And subsampled cubic regularization (Kohler and Lucchi, 2017; Xu et al., 2017b) requires e O(n/ϵ3/2 + 1/ϵ5/2) SO calls, which is also worse than our algorithm.

In Table 1, we summarize the comparison of our SVRC algorithm with the most related algorithms in terms of SO and CSO oracle complexities. It can be seen from Table 1 that our algorithm (SVRC) achieves the lowest (SO and CSO) oracle complexity compared with the original cubic regularization method (Nesterov and Polyak, 2006) which employs full gradient and Hessian evaluations and the subsampled cubic method (Kohler and Lucchi, 2017; Xu et al., 2017b). In particular, our algorithm reduces the SO oracle complexity of cubic regularization by a factor of n1/5 for ﬁnding an (ϵ, L2ϵ)-approximate local minimum.

Algorithm SO calls CSO calls Gradient Lipschitz Hessian Lipschitz

CR O n ϵ3/2 O 1 ϵ3/2 no yes

SCR e O n ϵ3/2 + 1 ϵ5/2 2 O 1 ϵ3/2 yes yes

SVRC e O n + n4/5

ϵ3/2 O 1 ϵ3/2 no yes (Algorithm 1)

Table 1: Comparisons between diﬀerent methods to ﬁnd (ϵ, L2ϵ)-local minimum on the second-order oracle (SO) complexity and the cubic sub-problem oracle (CSO) complexity. The compared methods include (1) CR: Cubic regularization (Nesterov and Polyak, 2006) and (2) SCR: Subsampled cubic regularization (Kohler and Lucchi, 2017; Xu et al., 2017b).

2. It is the reﬁned rate proved by Xu et al. (2017b) for the subsampled cubic regularization algorithm proposed in Kohler and Lucchi (2017).

Zhou, Xu and Gu

5. SVRC with Inexact Oracles

In practice, the exact solution to the cubic subproblem (7) cannot be obtained. Instead, one can only get an approximate solution by some inexact solver. Thus we replace the CSO oracle in (4) with the following inexact CSO oracle

ehsol argmin h Rd g, h + 1

2 h, Hh + θ

To analyze the performance of SVRC with inexact cubic subproblem solver, we relax the exact solver hs t in Line 10 of Algorithm 1 with

ehs t argmin ms t(h). (15)

The ultimate goal of this section is to prove that the theoretical results of SVRC still hold with inexact subproblem solvers. To this end, we present the following suﬃcient condition, under which inexact solution can ensure the same oracle complexity as the exact solution:

Condition 11 (Inexact Condition) For each s, t and a given δ > 0, ehs t satisﬁes δinexact condition if ehs t satisﬁes

ms t(ehs t) Ms,t

12 ehs t 3 2 + δ,

ms t(ehs t) 2 M1/3 s,t δ2/3, ehs t 2 hs t 2 M 1/3 s,t δ1/3.

Remark 12 Similar inexact conditions have been studied in the literature of cubic regularization. For instance, Nesterov and Polyak (2006) presented a practical way to solve the cubic subproblem without termination condition. Cartis et al. (2011a); Kohler and Lucchi (2017) presented termination criteria for approximate solution to cubic subproblem, which is slightly diﬀerent from Condition 11.

Now we present the convergence result of SVRC with inexact CSO oracles:

Theorem 13 Suppose that for each s, t, ehs t is an inexact solver of cubic subproblem ms t(h), which satisﬁes Condition 11. Under the same conditions of Theorem 6, the output of Algorithm 1 satisﬁes

E[µ(xout)] 240C2 ML1/2 2 F ST + 480C2 ML1/2 2 δ. (16)

Remark 14 By the deﬁnition of µ(x), in order to attain an (ϵ, L2ϵ)-approximate local minimum, we require E[µ(xout)] ϵ3/2 and thus 480C2 ML1/2 2 δ < ϵ3/2, which implies that δ in Condition 11 should satisfy δ < (480C2 ML1/2 2 ) 1ϵ3/2. Thus the total iteration complexity of Algorithm 1 with inexact oracle is still O( F L1/2 2 ϵ 3/2).

By the same choice of parameters, Algorithm 1 with inexact oracle can achieve a reduction in SO calls.

Stochastic Variance-Reduced Cubic Regularization Methods

Corollary 15 Suppose that for each s, t, ehs t is an inexact solver of cubic subproblem ms t(h), which satisﬁes Condition 11 with δ = (960C2 M) 1L 1/2 2 ϵ3/2. Under Assumption 1, let the cubic regularization parameter Ms,t = M = CML2, where CM 100 is a constant. Let the epoch length T = n1/5, batch sizes bg = 5n4/5 and bh = 100n2/5 log d, and the number of epochs S = max{1, 480C2 ML1/2 2 F n 1/5ϵ 3/2}. Then Algorithm 1 will ﬁnd an (ϵ, L2ϵ)- approximate local minimum within

O n + F L2n4/5

SO calls (17)

O F L2 ϵ3/2

CSO calls. (18)

Remark 16 It is worth noting that even with the inexact CSO oracle satisfying Condition 11, the SO and CSO complexities of SVRC remain the same as that of SVRC with exact CSO oracle. Furthermore, this result always holds with any inexact cubic sub-problem solver.

6. Lite-SVRC for Eﬃcient Hessian Sample Complexity

As we discussed in the introduction section, when the problem dimension d is relatively high, we may want to focus more on the Hessian sample complexity of cubic regularization methods than the second-order oracle complexity. In this section, we present a new algorithm Lite-SVRC based on SVRC, which trades the second-order oracle complexity for a more aﬀordable Hessian sample complexity. As is displayed in Algorithm 2, our Lite-SVRC algorithm has similar structure as Algorithm 1 with S epochs and T iterations within each epoch. At the t-th iteration of the s-th epoch, we also use a semi-stochastic gradient evs t and Hessian Us t to replace the full gradient and full Hessian in CR subproblem (3) as follows

evs t = 1 Bg;s,t

fit(xs t) fit(bxs) + gs, (19)

2fjt(xs t) 2fjt(bxs) + Hs, (20)

where bxs is the reference point at which gs and Hs are computed, Ig and Ih are sampling index sets (with replacement), Bg;s,t and Bh are sizes of Ig and Ih. Compared with SVRC (Algorithm 1), Lite-SVRC uses a lite version of semi-stochastic gradient evs t. Note that the additional Hessian information in the semi-stochastic gradient in (5) actually increases the Hessian sample complexity. Therefore, with the goal of reducing the Hessian sample complexity, the standard semi-stochastic gradient (Johnson and Zhang, 2013; Xiao and Zhang, 2014) is used in this section. Note that similar semi-stochastic gradient and Hessian have been proposed in Johnson and Zhang (2013); Xiao and Zhang (2014) and Gower et al. (2018); Wai et al. (2017); Zhou et al. (2018a); Wang et al. (2018a,b); Zhang et al. (2018) respectively. In Algorithm 2, we choose ﬁxed batch size of stochastic Hessian as Bh = |Ih|. However, the batch size of stochastic gradient is chosen adaptively at each iteration:

Bg;s,t = Dg/ xs t bxs 2 2, (21)

Zhou, Xu and Gu

where Dg is a constant only depending on n and d.

Algorithm 2 Sample eﬃcient stochastic variance-reduced cubic regularization method (Lite-SVRC)

1: Input: batch size parameters Dg, Bh, cubic penalty parameter {Ms,t}, epoch number S, epoch length T and starting point x0.

2: Initialization bx1 = x0 3: for s = 1, . . . , S do

4: xs 0 = bxs

5: gs = F(bxs) = 1

n Pn i=1 fi(bxs), Hs = 2F(bxs) = 1

n Pn i=1 2fi(bxs)

6: hs 0 = argminh Rd ms 0(h) = gs, h + 1

2 Hsh, h + Ms,0

6 h 3 2 7: xs 1 = xs 0 + hs 0 8: for t = 1, . . . , T 1 do

9: Bg;s,t = Dg/ xs t bxs 2 2, t > 0

10: Sample index set Ig, Ih [n], |Ig| = Bg;s,t, |Ih| = Bh 11: evs t = 1 Bg;s,t P it Ig fit(xs t) fit(bxs) + gs

12: Us t = 1 Bh P jt Ih 2fjt(xs t) 2fjt(bxs) + Hs

13: hs t = argminh Rd ms t(h) = evs t, h + 1

2 Us th, h + Ms,t

6 h 3 2 14: xs t+1 = xs t + hs t 15: end for

16: bxs+1 = xs T 17: end for

18: Output: xout = xs t, where s, t are uniformly random chosen from s [S] and t [T].

In addition, the major diﬀerence between our algorithm and the SVRC algorithms proposed in Wang et al. (2018a); Zhang et al. (2018); Wang et al. (2018b) is that our algorithm uses a constant Hessian minibatch size instead of an adaptive one in each iteration, and thus the parameter tuning of our algorithm is much easier. In sharp contrast, the minibatch sizes of the stochastic Hessian in the algorithm proposed by Wang et al. (2018a); Zhang et al. (2018); Wang et al. (2018b) are dependent on both accuracy parameter ϵ and the current update hs t, which make the update an implicit one and it is hard to tune such hyperparameters in practice.

7. Theoretical Analysis of Lite-SVRC

In this section, we present our theoretical results on the Hessian sample complexity of Lite SVRC (Algorithm 2). Diﬀerent from the analysis of SVRC in Section 4 which only requires the Hessian Lipschitz condition (Assumption 1), we will need additionally the following smoothness assumption for the analysis of Lite-SVRC:

Assumption 17 (Gradient Lipschitz) There exists a constant L1 > 0, such that for all x, y and i {1, ..., n}

fi(x) fi(y) 2 L1 x y 2.

Stochastic Variance-Reduced Cubic Regularization Methods

Assumptions 1 and 17 are mild and widely used in the line of research for ﬁnding approximate global minima (Carmon and Duchi, 2016; Carmon et al., 2018; Agarwal et al., 2017; Wang et al., 2018a; Yu et al., 2018). Recall the deﬁnition in (8), we need to upper bound µ(xout) in order to ﬁnd the approximate local minimum. The following theorem spells out the upper bound of µ(xout).

Theorem 18 Under Assumptions 1 and 17, suppose that n > 10, Ms,t = CML2, Dg C1L2 1/L2 2 n4/3/CM and Bh > 144(C1Ch)2/3n2/3/C2 M, where Ch = 1200(log d) and CM, C1 are absolute constants. Then the output xout of Algorithm 2 satisﬁes

E[µ(xout)] 216C2 ML1/2 2 F ST . (22)

Remark 19 Theorem 18 suggests that with a ﬁxed number of inner loops T, if we run Algorithm 2 for suﬃciently large S epochs, then we have a point sequence xi where E[µ(xi)] 0. That being said, xi will converge to a local minimum, which is consistent with the convergence analysis in existing related work (Nesterov and Polyak, 2006; Kohler and Lucchi, 2017; Wang et al., 2018a).

Now we provide a speciﬁc choice of parameters used in Theorem 18 to derive the total Hessian sample complexity of Algorithm 2.

Corollary 20 Under the same assumptions as in Theorem 18, let batch size parameters satisfy Dg = 4L2 1/L2 2 n4/3 and Bh = log d (Ch n)2/3. Set the inner loop parameter T = n1/3

and cubic regularization parameter Ms,t = CML2, where CM is an absolute constant. Set the epoch number S = O(max{L1/2 2 F /(ϵ3/2n1/3), 1}). Then the output xout from Algorithm 2 is an (ϵ, L2ϵ)-approximate local minimum after

e O n + F L2 ϵ3/2 n2/3 stochastic Hessian evaluations. (23)

Moreover, the total number of CSO calls of Algorithm 2 is

O F L2 ϵ3/2

Remark 21 Note that the CSO oracle complexity of Lite-SVRC is the same as SVRC. In what follows, we present a comprehensive comparison on Hessian sample complexity between our Lite-SVRC and other related algorithms in Table 2. The algorithm proposed in Wang et al. (2018a) has two versions: sample with replacement and sample without replacement. For the completeness, we present both versions in Wang et al. (2018a). From Table 2 we can see that Lite-SVRC strictly outperforms CR by a factor of n1/3 and outperforms SVRC by a factor of n2/15 in terms of Hessian sample complexity. Lite-SVRC also outperforms SCR when ϵ = O(n 2/3), which suggests that the variance reduction scheme makes Lite-SVRC perform better in the high accuracy regime. More importantly, our proposed Lite-SVRC does not rely on the assumption that the function F is Lipschitz continuous, which is required by

Zhou, Xu and Gu

the algorithm proposed in Wang et al. (2018a). In terms of Hessian sample complexity, our algorithm directly improves that of Wang et al. (2018a) by a factor of n2/33. The Hessian sample complexity of Lite-SVRC is the same as that of the algorithms recently proposed in Wang et al. (2018b) and Zhang et al. (2018). Nevertheless, Lite-SVRC uses a constant Hessian sample batch size in contrast to the adaptive batch size as used in Wang et al. (2018b); Zhang et al. (2018), which makes the use of Lite-SVRC algorithm much simpler and more practical.

Algorithm Per-iteration Total Function Gradient Hessian Lipschitz Lipschitz Lipschitz

CR O(n) O n ϵ3/2 No No Yes

SCR e O 1 ϵ e O 1 ϵ5/2 No3 Yes Yes

SVRCwith4 e O xs t bxs 2 2 hs t 2 2

e O n + n3/4

ϵ3/2 Yes Yes Yes

SVRCwithout4 e O 1 n + hs t 2 2 xs t bxs 2 2

1 e O n + n8/11

ϵ3/2 Yes Yes Yes

SVRCWang e O xs t bxs 2 2 max{ϵ, hs t 2 2} e O n + n2/3

ϵ3/2 No Yes Yes

SVRCZhang e O xs t bxs 2 2 ϵ e O n + n2/3

ϵ3/2 No Yes Yes

SVRC5 e O(n4/5) e O n + n4/5

ϵ3/2 No No Yes (Algorithm 1) Lite-SVRC e O(n2/3) e O n + n2/3

ϵ3/2 No Yes Yes (Algorithm 2)

Table 2: Comparisons of per-iteration and total sample complexities of Hessian evaluations for diﬀerent algorithms: CR (Nesterov and Polyak, 2006), SCR (Kohler and Lucchi, 2017; Xu et al., 2017a), SVRCwith (Wang et al., 2018a), SVRCwithout (Wang et al., 2018a), SVRCWang (Wang et al., 2018b), SVRCZhang (Zhang et al., 2018), SVRC (Algorithm 1) and Lite-SVRC (Algorithm 2). Similar to Table 1, the CSO oracle complexities of all the methods being compared are the same, i.e. O(1/ϵ3/2). Therefore, we omit it for simplicity.

Recall the inexact cubic subproblem solver deﬁned in Section 5. The same inexact CSO oracles can also be used in Algorithm 2. In what follows, we present the convergence result of Lite-SVRC with inexact CSO oracles.

3. Although the reﬁned SCR in Xu et al. (2017b) does not need function Lipschitz, the original SCR in Kohler and Lucchi (2017) needs it. 4. In Wang et al. (2018a), both algorithms need to calculate λmin( 2F(xs t)) at each iteration to decide whether the algorithm should continue, which adds additional O(n) Hessian sample complexity. We choose not to include this into the results in the table. 5. For SVRC (Algorithm 1), we present its second-order oracle calls derived in Section 4 as the Hessian sample complexity.

Stochastic Variance-Reduced Cubic Regularization Methods

Theorem 22 Suppose that for each s, t, ehs t is an inexact solver of cubic subproblem ms t(h) satisfying Condition 11. Then under the same conditions of Theorem 18, the output of Algorithm 2 satisﬁes

E[µ(xout)] 216C2 ML1/2 2 F ST + 432C2 ML1/2 2 δ. (24)

In addition, Algorithm 2 with inexact oracle can also reduce the Hessian sample complexity, which is summarized in the following corollary.

Corollary 23 Suppose that for each s, t, ehs t is an inexact solver of cubic subproblem ms t(h) satisfying Condition 11 with δ = (864C2 ML1/2 2 ) 1ϵ3/2. Then with the same choice of parameters in Corollary 20, Algorithm 2 will ﬁnd an (ϵ, L2ϵ)-approximate local minimum within

e O n + F L2 ϵ3/2 n2/3 stochastic Hessian evaluations,

O F L2 ϵ3/2

8. Experiments

In this section, we conduct experiments on real world datasets to support our theoretical analysis of the proposed SVRC and Lite-SVRC algorithms. We investigate two nonconvex problems on three diﬀerent datasets, a9a, ijcnn1 and covtype, which are all common datasets used in machine learning and the sizes are summarized in Table 3.

8.1. Baseline Algorithms

Dataset sample size n dimension d

a9a 32,561 123 covtype 581,012 54 ijcnn1 35,000 22

Table 3: Datasets used in experiments.

To validate the superior performance of the proposed SVRC (Algorithm 1) in terms of second-order oracles, we compare it with the following baseline algorithms: (1) trust-region Newton methods (TR) (Conn et al., 2000); (2) Adaptive Cubic regularization (Cartis et al., 2011a,b); (3) Subsampled Cubic regularization (Kohler and Lucchi, 2017); (4) Gradient Cubic regularization (Carmon and Duchi, 2016) and (5) Stochastic Cubic regularization (Tripuraneni et al., 2018). To demonstrate the improvement of Lite-SVRC (Algorithm 2) on Hessian sample complexity, we further conduct experiments to compare Lite-SVRC with all the baselines above including SVRC. In addition, we also compare Lite-SVRC with (6) SVRC-without (Wang et al., 2018a), which focuses on reducing the Hessian sample complexity as well. In addition, there are two versions of SVRC in Wang

Zhou, Xu and Gu

et al. (2018a), but the one based on sampling without replacement performs better in both theory and experiments. We therefore only compare with this one. Note that the SVRC algorithms in Wang et al. (2018b); Zhang et al. (2018) are essentially the same as our Lite SVRC algorithm, except in the choice of batch size for stochastic Hessian. Thus we do not compare our Lite-SVRC with these algorithms (Wang et al., 2018b; Zhang et al., 2018).

8.2. Implementation Details

For Subsampled Cubic and SVRC-without, the sample size Bk is dependent on hk 2 (Kohler and Lucchi, 2017) and Bh is dependent on hs t 2 (Wang et al., 2018a), which make these two algorithms implicit algorithms. To address this issue, we follow the suggestion in Kohler and Lucchi (2017); Wang et al. (2018a) and use hk 1 2 and hs t 1 2 instead of hk 2 and hs t 2. Furthermore, we choose the penalty parameter Ms,t for SVRC, SVRC-without and Lite-SVRC as constants which are suggested by the original papers of these algorithms. Finally, to solve the CR sub-problem in each iteration, we choose to solve the sub-problem approximately in the Krylov subspace spanned by Hessian related vectors, as used by Kohler and Lucchi (2017).

8.3. Nonconvex Optimization Problems

In this subsection, we formulate the nonconvex optimization problems that will be studied in our experiments. In particular, we choose two nonconvex regression problem as our objectives with the following nonconvex regularizer

g(λ, γ, x) = λ

1 + (γxi)2 , (25)

where λ, γ are the control parameters and xi is the i-th coordinate of x. λ and γ are set diﬀerently for each dataset. This regularizer has been widely used in nonconvex regression problem, which can be regarded as a special example of robust nonlinear regression (Reddi et al., 2016b; Kohler and Lucchi, 2017; Wang et al., 2018a).

8.3.1. Logistic Regression with Nonconvex Regularizer

The ﬁrst problem is a binary logistic regression problem with a nonconvex regularizer g (Reddi et al., 2016b). Given training data xi Rd and label yi {0, 1}, 1 i n, our goal is to solve the following optimization problem:

min s Rd 1 n

yi log φ(s xi) + (1 yi) log[1 φ(s xi)] + g(λ, γ, s), (26)

where φ(x) = 1/(1 + exp( x)) is the sigmoid function and g is deﬁned in (25).

8.3.2. Nonlinear Least Square with Nonconvex Regularizer

Another problem is the nonlinear least square problem with a nonconvex regularizer g(λ, γ, x) deﬁned in (25). The nonlinear least square problem is also studied in Xu et al. (2017b).

Stochastic Variance-Reduced Cubic Regularization Methods

Given training data xi Rd and yi {0, 1}, 1 i n, our goal is to minimize the following problem

min s Rd 1 n

yi φ(s xi) 2 + g(λ, γ, s). (27)

Here φ(x) = 1/(1 + exp( x)) is again the sigmoid function and g is deﬁned in (25).

8.4. Experimental Results for SVRC

In this subsection, we present the experimental results for SVRC compared with baseline algorithms (1)-(5) listed in Section 8.1. Here, we ﬁx λ = 10 and γ = 1 of the nonconvex regularizer g in (25) for both the logistic regression and the nonlinear least square problems. Calculation for SO calls: For Subsampled Cubic, each loop takes (Bg + Bh) SO calls, where Bg and Bh are the subsampling sizes of gradient and Hessian. For Stochastic Cubic, each loop costs (ng +nh) SO calls, where ng and nh denote the subsampling sizes of gradient and Hessian-vector operator. Gradient Cubic, Adaptive Cubic and TR cost n SO calls in each loop. We deﬁne the amount of epochs to be the amount of SO calls divided by n. Parameters: For each algorithm and each dataset, we choose diﬀerent bg, bh, T for the best performance. Meanwhile, we choose the cubic regularization parameter as Ms,t = α/(1+β)(s+t/T), α, β > 0 for each iteration. When β = 0, it has been proved to enjoy good convergence performance. This choice of parameter is similar to the choice of penalty parameter in Subsampled Cubic and Adaptive Cubic, which sometimes makes some algorithms behave better in our experiments. Subproblem Solver: With regard to the cubic subproblem solver for solving (7), we choose the Lanczos-type method used in Cartis et al. (2011a), which ﬁnds the global minimizer hs t of ms t(h) in a Krylov subspace Kl = span{vs t, Us tvs t, (Us t)2vs t, . . . , (Us t)l 1vs t}, where l d is the dimension of Kl and can be selected manually or adaptively (Cartis et al., 2011a; Kohler and Lucchi, 2017). The computational complexity of Lanczos-type method consists of two parts according to Carmon and Duchi (2018). First, (l 1) matrix-vector products are performed to calculate the basis of Kl, whose computational complexity is O(d2l). Second, the minimizer of ms t(h) is computed in subspace Kl, whose computational complexity is O(l log l). Thus, the total computational complexity of Lanczos-type method is O(d2l). At each iteration, SVRC needs to compute the semi-stochastic gradient vs t and Hessian Us t, which costs O(dbg + d2bh) computational complexity for both nonconvex regularized logistic regression and nonlinear least square problems, where bg and bh are the mini-batch sizes of stochastic gradient and Hessian respectively. Putting these pieces together, the periteration complexity of SVRC is O(dbg+d2bh+d2l), and the total computational complexity of SVRC is O(ST(dbg + d2bh + d3)), where S is the number of epochs and T is the length of epoch. For the binary logistic regression problem in (26), the parameters of Ms,t = α/(1 + β)(s+t/T), α, β > 0 are set as follows: α = 0.05, β = 0 for a9a and ijcnn1 datasets and α = 5e3, β = 0.15 for covtype. The experimental results are shown in Figure 1. For the non-linear least squares problem in (27), we set α = 0.05, 1e8, 0.003 and β = 0, 1, 0.5 for a9a, covtype and ijcnn1 datasets respectively. The experimental results are shown in Figure

Zhou, Xu and Gu

2. From both Figures 1 and 2, we can see that SVRC outperforms all the other baseline algorithms on all the datasets. The only exception happens in the non-linear least square problem on the covtype dataset, where our algorithm behaves a little worse than Adaptive Cubic at the high accuracy regime in terms of epoch counts. However, under this setting, our algorithm still outperforms the other baselines in terms of the CPU time.

0 100 200 300 400 500 600 epochs

TR Adaptive Cubic Gradient Cubic Subsampled Cubic Stochastic Cubic SVRC

0 50 100 150 200 250 300 epochs

TR Adaptive Cubic Gradient Cubic Subsampled Cubic Stochastic Cubic SVRC

(b) covtype

0 20 40 60 80 100 epochs

TR Adaptive Cubic Gradient Cubic Subsampled Cubic Stochastic Cubic SVRC

0 10 20 30 40 50 time in seconds

TR Adaptive Cubic Gradient Cubic Subsampled Cubic Stochastic Cubic SVRC

0 25 50 75 100 125 150 175 200 time in seconds

TR Adaptive Cubic Gradient Cubic Subsampled Cubic Stochastic Cubic SVRC

(e) covtype

0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.00 time in seconds

TR Adaptive Cubic Gradient Cubic Subsampled Cubic Stochastic Cubic SVRC

Figure 1: Logarithmic function value gap for nonconvex regularized logistic regression on diﬀerent datasets. (a), (b) and (c) present the oracle complexity comparison; (d), (e) and (f) present the runtime comparison.

8.5. Experimental Results for Lite-SVRC

In this subsection, we present the experimental results for Lite-SVRC compared with all the baselines listed in Section 8.1. For Lite-SVRC, we use the same cubic subproblem solver used for SVRC in the previous subsection. In the binary logistic regression problem in (26), for the nonconvex regularizer g in (25), we set λ = 10 3 for all three datasets, and set γ = 10, 50, 100 for a9a, ijcnn1 and covtype datasets respectively. The experimental results are displayed in Figure 3. The ﬁrst row of the ﬁgure shows the plots of function value gap v.s. Hessian sample complexity of all the compared algorithms, and the second row presents the plots of function value gap v.s. CPU runtime (in seconds) of all the algorithms. It can be seen from Figure 3 that Lite-SVRC performs the best among all algorithms regarding both sample complexity of Hessian and runtime on all three datasets, which is consistent with our theoretical analysis. We remark that SVRC performs the second best in most settings in terms of both Hessian sample complexity and runtime. It should also be noted that although SVRC-without is also a variance-reduced method similar to Lite-SVRC and SVRC, it indeed performs much worse

Stochastic Variance-Reduced Cubic Regularization Methods

0 10 20 30 40 50 60 70 80 epochs

TR Adaptive Cubic Gradient Cubic Subsampled Cubic Stochastic Cubic SVRC

0 10 20 30 40 50 epochs

TR Adaptive Cubic Gradient Cubic Subsampled Cubic Stochastic Cubic SVRC

(b) covtype

0 5 10 15 20 25 30 epochs

TR Adaptive Cubic Gradient Cubic Subsampled Cubic Stochastic Cubic SVRC

0 1 2 3 4 5 6 7 time in seconds

TR Adaptive Cubic Gradient Cubic Subsampled Cubic Stochastic Cubic SVRC

0 5 10 15 20 25 30 35 40 time in seconds

TR Adaptive Cubic Gradient Cubic Subsampled Cubic Stochastic Cubic SVRC

(e) covtype

0.00 0.05 0.10 0.15 0.20 0.25 0.30 time in seconds

TR Adaptive Cubic Gradient Cubic Subsampled Cubic Stochastic Cubic SVRC

Figure 2: Logarithmic function value gap for nonlinear least square on diﬀerent datasets. (a), (b) and (c) present the oracle complexity comparison; (d), (e) and (f) present the runtime comparison.

0 50 100 150 200 250 300 350 400 450 epochs

TR Adaptive Cubic Gradient Cubic Subsampled Cubic Stochastic Cubic SVRC SVRC_without Lite-SVRC

0 10 20 30 40 50 60 70 epochs

TR Adaptive Cubic Gradient Cubic Subsampled Cubic Stochastic Cubic SVRC SVRC_without Lite-SVRC

(b) covtype

0 20 40 60 80 100 epochs

TR Adaptive Cubic Gradient Cubic Subsampled Cubic Stochastic Cubic SVRC SVRC_without Lite-SVRC

0 20 40 60 80 100 120 140 time in seconds

TR Adaptive Cubic Gradient Cubic Subsampled Cubic Stochastic Cubic SVRC SVRC_without Lite-SVRC

0 5 10 15 20 25 30 35 40 time in seconds

TR Adaptive Cubic Gradient Cubic Subsampled Cubic Stochastic Cubic SVRC SVRC_without Lite-SVRC

(e) covtype

0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.00 time in seconds

TR Adaptive Cubic Gradient Cubic Subsampled Cubic Stochastic Cubic SVRC SVRC_without Lite-SVRC

Figure 3: Function value gap of diﬀerent algorithms for nonconvex regularized logistic regression problems on diﬀerent datasets. (a)-(c) are plotted w.r.t. Hessian sample complexity. (d)-(e) are plotted w.r.t. CPU runtime.

Zhou, Xu and Gu

0 2 4 6 8 10 epochs

TR Adaptive Cubic Gradient Cubic Subsampled Cubic Stochastic Cubic SVRC SVRC_without Lite-SVRC

0 10 20 30 40 50 epochs

TR Adaptive Cubic Gradient Cubic Subsampled Cubic Stochastic Cubic SVRC SVRC_without Lite-SVRC

(b) covtype

0 20 40 60 80 100 120 140 epochs

TR Adaptive Cubic Gradient Cubic Subsampled Cubic Stochastic Cubic SVRC SVRC_without Lite-SVRC

0 1 2 3 4 5 time in seconds

TR Adaptive Cubic Gradient Cubic Subsampled Cubic Stochastic Cubic SVRC SVRC_without Lite-SVRC

0 5 10 15 20 25 30 time in seconds

TR Adaptive Cubic Gradient Cubic Subsampled Cubic Stochastic Cubic SVRC SVRC_without Lite-SVRC

(e) covtype

0 2 4 6 8 10 time in seconds

TR Adaptive Cubic Gradient Cubic Subsampled Cubic Stochastic Cubic SVRC SVRC_without Lite-SVRC

Figure 4: Function value gap of diﬀerent algorithms for nonlinear least square problems on diﬀerent datasets. (a)-(c) are plotted w.r.t. Hessian sample complexity. (d)-(e) are plotted w.r.t. CPU runtime.

than other methods, because as we pointed out in the introduction, it needs to compute the minimum eigenvalue of the Hessian in each iteration, which actually makes the Hessian sample complexity even worse than Subsampled Cubic, let alone the runtime complexity. For the least square problem in (27), the parameters λ and γ in the nonconvex regularizer for diﬀerent datasets are set as follows: λ = 5 10 3 for all three datasets, and γ = 10, 20, 50 for a9a, ijcnn1 and covtype datasets respectively. The experimental results are summarized in Figure 4, where the ﬁrst row shows the plots of function value gap v.s. Hessian sample complexity and the second row presents the plots of function value gap v.s. CPU runtime (in seconds). It can be seen that Lite-SVRC again achieves the best performance among all the algorithms regarding to both sample complexity of Hessian and runtime when the required precision is high, which supports our theoretical analysis.

9. Conclusions

In this paper, we propose two novel second-order algorithms for non-convex optimization: SVRC and Lite-SVRC. Our proposed algorithm SVRC is the ﬁrst algorithm which improves the oracle complexity of cubic regularization and its subsampled variants under certain regime using variance reduction techniques. We also show that similar oracle complexity also holds with inexact oracles. Under both exact and inexact oracle settings our algorithm outperforms the state-of-the-art methods. Furthermore, our proposed algorithm Lite-SVRC achieves a lower sample complexity of Hessian compared with SVRC and existing variance reduction based cubic regularization algorithms. Extensive experiments on various nonconvex optimization problems and datasets validate our theory.

Stochastic Variance-Reduced Cubic Regularization Methods

Acknowledgement

We would like to thank the anonymous reviewers for their helpful comments. This research was sponsored in part by the National Science Foundation IIS-1904183 and IIS-1906169. The views and conclusions contained in this paper are those of the authors and should not be interpreted as representing any funding agencies.

Appendix A. Proof of Main Theoretical Results for SVRC

In this section, we present the proofs of our main theoretical results for SVRC. Let us ﬁrst recall the notations used in Algorithm 1. vs t and Us t are the semi-stochastic gradient and Hessian deﬁned in (5) and (6) respectively. xs t s are the iterates and bxs s are the reference points used in Algorithm 1. bg and bh are the batch sizes of semi-stochastic gradient and Hessian. S and T are the number of epochs and epoch length of Algorithm 1. We set Ms,t := M = CML2 as suggested by Theorems 6 and 13, where CM > 0 is a constant. hs t is the exact minimizer of ms t(h), where ms t(h) is deﬁned in (7). ehs t is the inexact minimizer deﬁned in (15). In order to prove Theorems 6 and 13, we lay down the following useful technical lemmas. The ﬁrst lemma is standard in the analysis cubic regularization methods.

Lemma 24 Suppose F is L2-Hessian Lipschitz for some constant L2 > 0. For the semistochastic gradient and Hessian deﬁned in (5) and (6), we have the following results:

vs t + Us ths t + M

2 hs t 2hs t = 0, (28)

2 hs t 2I 0, (29)

vs t, hs t + 1

2 Us ths t, hs t + M

6 hs t 3 2 M

12 hs t 3 2. (30)

The next two important lemmas control the variances of vs t and Us t.

Lemma 25 For the semi-stochastic gradient vs t deﬁned in (5), we have

Eit F(xs t) vs t 3/2 2 L3/2 2 b3/4 g xs t bxs 3 2,

where Eit is the expectation over all it Ig.

Lemma 26 Let Us t be the semi-stochastic Hessian deﬁned in (6). If the batch size satisfy bh 400 log d, then we have

Ejt 2F(xs t) Us t 3 2 1200L3 2

3/2 xs t bxs 3 2,

where Ejt is the expectation over all jt Ih.

Zhou, Xu and Gu

Lemma 27 For the semi-stochastic gradient and Hessian deﬁned in (5) and (6) and h Rd, we have

F(xs t) vs t, h M

27 h 3 2 + 2 F(xs t) vs t 3/2 2 M1/2 ,

2F(xs t) Us t h, h 2M

27 h 3 2 + 27

M2 2F(xs t) Us t 3 2.

Lemma 28 Let h Rd and CM 100. For the semi-stochastic gradient and Hessian deﬁned in (5) and (6), we have

µ(xs t + h) 9C3/2 M h M3/2 h 3 2 + F(xs t) vs t 3/2 2 + M 3/2 2F(xs t) Us t 3 2

+ ms t(h) 3/2 2 + M3/2 h 2 hs t 2 3i .

Lemma 29 Let h Rd and C 3/2, For the semi-stochastic gradient and Hessian deﬁned in (5) and (6), we have

xs t + h bxs 3 2 2C2 h 3 2 + (1 + 3/C) xs t bxs 3 2. (31)

Lemma 30 We deﬁne constant series ct for 0 t T as follows: c T = 0 and ct = ct+1(1 + 3/T) + M(500T 3) 1 for 0 t T 1. Then we have for any 1 t T,

M/24 2ct T 2 0. (32)

A.1. Proof of Theorem 6

Proof [Proof of Theorem 6] We ﬁrst upper bound F(xs t+1) as follows

F(xs t+1) F(xs t) + F(xs t), hs t + 1

2 2F(xs t)hs t, hs t + L2

= F(xs t) + vs t, hs t + 1

2 Us ths t, hs t + M

6 hs t 3 2 + F(xs t) vs t, hs t

2 2F(xs t) Us t hs t, hs t M L2

12 hs t 3 2 + M

27 hs t 3 2 + 2 F(xs t) vs t 3/2 2 M1/2

27 hs t 3 2 + 27

M2 2F(xs t) Us t 3 2

12 hs t 3 2 + 2 M1/2 F(xs t) vs t 3/2 2 + 27

M2 2F(xs t) Us t 3 2, (33)

where the ﬁrst inequality follows from Lemma 40 and the second inequality holds due to Lemmas 24 and 27. We deﬁne

Rs t = E F(xs t) + ct xs t bxs 3 2 , (34)

Stochastic Variance-Reduced Cubic Regularization Methods

where ct is deﬁned in Lemma 30. Then by Lemma 29, for T 3/2 we have

ct+1 xs t+1 bxs 3 2 = ct+1 hs t + xs t bxs 3 2 2ct+1T 2 hs t 3 2 + ct+1(1 + 3/T) xs t bxs 3 2. (35)

Applying Lemma 28 with h = hs t, we have

240C2 ML1/2 2 1µ(xs t+1) M

24 hs t 3 2 + F(xs t) vs t 3/2 2 24M1/2 +

2F(xs t) Us t 3 2 24M2

+ ms t(hs t) 3/2 2 24M1/2 + M

hs t 2 hs t 2 3

24 hs t 3 2 + F(xs t) vs t 3/2 2 24M1/2 +

2F(xs t) Us t 3 2 24M2 , (36)

where the equality is due to Lemma 24. Adding (33) with (35) and (36) and taking total expectation, we have

Rs t+1 + 240C2 ML1/2 2 1E[µ(xs t+1)]

= E h F(xs t+1) + ct+1 xs t+1 bxs 3 2 + 240C2 ML1/2 2 1µ(xs t+1) i

E F(xs t) + ct+1(1 + 3/T) xs t bxs 3 2 hs t 3 2 M/24 2ct+1T 2

+ E h 3M 1/2 F(xs t) vs t 3/2 2 + 28M 2 2F(xs t) Us t 3 2

E F(xs t) + ct+1(1 + 3/T) xs t bxs 3 2 + E h 3M 1/2 F(xs t) vs t 3/2 2

+ 28M 2 2F(xs t) Us t 3 2

where the third inequality holds due to Lemma 30. To further bound (37), we have

3 M1/2 E F(xs t) vs t 3/2 2 3L3/2 2 M1/2b3/4 g E xs t bxs 3 2 M 1000T 3 E xs t bxs 3 2, (38)

where the ﬁrst inequality holds due to Lemma 25, the second inequality holds due to M 100L2 and bg 5T 4 from the condition of Theorem 6. We also have

28 M2 E 2F(xs t) Us t 3 2 28 15000L3 2 M2(bh/ log d)3/2 E xs t bxs 3 2 M 1000T 3 E xs t bxs 3 2, (39)

where the ﬁrst inequality holds due to Lemma 26, where we have bh 100T 2 log d 400 log d, and the second inequality holds due to M 100L2 and bh 100T 2 log d from the assumption of Theorem 6. Thus, submitting (38) and (39) into (37), we have

Rs t+1 + 240C2 ML1/2 2 1E[µ(xs t+1)] E F(xs t) + xs t bxs 3 2

ct+1(1 + 3/T) + M 500T 3

= E F(xs t) + ct xs t bxs 3 2

= Rs t, (40)

Zhou, Xu and Gu

where the ﬁrst equality holds due to the deﬁnition of ct in Lemma 30. Telescoping (40) from t = 0 to T 1, we have

240C2 ML1/2 2 1E[µ(xs t)].

By the deﬁnition of c T in Lemma 30, we have c T = 0, then Rs T = E F(xs T ) + c T xs T bxs 3 2 = EF(bxs+1); meanwhile by the deﬁnition of xs 0, we have xs 0 = bxs. Thus we have Rs 0 = E F(xs 0) + c0 xs 0 bxs 3 2 = EF(bxs), which implies

EF(bxs) EF(bxs+1) = Rs 0 Rs T 240C2 ML1/2 2 1 T X

t=1 E[µ(xs t)]. (41)

Finally, telescoping (41) from s = 1 to S yields

s=1 EF(bxs) EF(bxs+1) 240C2 ML1/2 2 1 S X

t=1 E[µ(xs t)].

By the deﬁnition about choice of xout, we complete the proof.

A.2. Proof of Corollary 9

Proof We can verify that the parameter setting in Corollary 9 satisﬁes the requirement of Theorem 6. Thus, submitting the choice of parameters into Theorem 6, the output of Algorithm 1 xout satisﬁes that

E[µ(xout)] 240C2 ML1/2 2 F ST ϵ3/2, (42)

which indeed implies that xout is an (ϵ, L2ϵ)-approximate local minimum. Next we calculate how many SO calls and CSO calls are needed. Algorithm 1 needs to calculate full gradient gs and full Hessian Hs at the beginning of each epoch, with n SO calls. In each epoch, Algorithm 1 needs to calculate vs t and Us t with bg + bh SO calls at each iteration. Thus, the total amount of SO calls is

Sn + (ST)(bg + bh) n + C1 F L1/2 2 n4/5ϵ 3/2 + C1 F L1/2 2 ϵ 3/2(5n4/5 + 1000n2/5 log d)

= e O n + F L2n4/5

where C1 = 240C2 M. For the CSO calls, Algorithm 1 needs to solve cubic subproblem at each single iteration. Thus, the total amount of CSO calls is

ST C1 F L1/2 2 ϵ 3/2 = O F L2 ϵ3/2

Stochastic Variance-Reduced Cubic Regularization Methods

A.3. Proof of Theorem 13

Proof [Proof of Theorem 13] Similar to (33) in the proof of Theorem 6, we have

F(xs t+1) F(xs t) + F(xs t), ehs t + 1

2 2F(xs t)ehs t, ehs t + L2

= F(xs t) + vs t, ehs t + 1

2 Us t ehs t, ehs t + M

ehs t 3 2 + F(xs t) vs t, ehs t

2 ( 2F(xs t) Us t)ehs t, ehs t M L2

ehs t 3 2 + δ + M

ehs t 3 2 + 2 F(xs t) vs t 3/2 2 M1/2

ehs t 3 2 + 27

M2 2F(xs t) Us t 3 2

ehs t 3 2 + 2 M1/2 F(xs t) vs t 3/2 2 + 27

M2 2F(xs t) Us t 3 2 + δ,

where the second inequality holds because ehs t is an inexact solver satisfying Condition 11. By Lemma 29 with h = ehs t, we have

ct+1 xs t+1 bxs 3 2 = ct+1 xs t bxs + ehs t 3 2 2ct+1T 2 ehs t 3 2 + ct+1(1 + 3/T) xs t bxs 3 2. (44)

By Lemma 28, we also have 240C2 ML1/2 2 1µ(xs t+1)

= 240C2 ML1/2 2 1µ xs t + ehs t

ehs t 3 2 + F(xs t) vs t 3/2 2 24M1/2 +

2F(xs t) Us t 3 2 24M2 +

ms t ehs t 3/2 2 24M1/2 + M ehs t 2 hs t 2 3

Since ehs t is an inexact solver satisfying Condition 11, we have

ms t ehs t 3/2 2 24M1/2 + M ehs t 2 hs t 2 3

24 < δ. (46)

Submitting (46) into (45), we have

240C2 ML1/2 2 1µ(xs t+1) M

ehs t 3 2 + F(xs t) vs t 3/2 2 24M1/2 +

2F(xs t) Us t 3 2 24M2 + δ. (47)

Then adding (43), (44) and (47) up, we have

Rs t+1 + 240C2 ML1/2 2 1E[µ(xs t+1)]

= E h F(xs t+1) + ct+1 xs t+1 bxs 3 2 + 240C2 ML1/2 2 1µ(xs t+1) i

Zhou, Xu and Gu

E h F(xs t) + ct+1(1 + 3/T) xs t bxs 3 2 ehs t 3 2 M/24 2ct+1T 2 i

+ E 3 M1/2 F(xs t) vs t 3/2 2 + 28

M2 2F(xs t) Us t 3 2

E F(xs t) + ct+1(1 + 3/T) xs t bxs 3 2

+ E 3 M1/2 F(xs t) vs t 3/2 2 + 28

M2 2F(xs t) Us t 3 2

Since the parameter setting is the same as Theorem 6, by (38) and (39), we have

3 M1/2 E F(xs t) vs t 3/2 2 M 1000T 3 E xs t bxs 3 2, (49)

28 M2 E 2F(xs t) Us t 3 2 M 1000T 3 E xs t bxs 3 2. (50)

Submitting (49) and (50) into (48) yields

Rs t+1 + 240C2 ML1/2 2 1E[µ(xs t+1)] E F(xs t) + xs t bxs 3 2

ct+1(1 + 3/T) + M 500T 3

= E F(xs t) + ct xs t bxs 3 2 + 2δ

= Rs t + 2δ, (51)

where the ﬁrst equality holds due to the deﬁnition of ct in Lemma 30. Telescoping (40) from t = 0 to T 1, we have

240C2 ML1/2 2 1E[µ(xs t)] 2δ .

By the deﬁnition of c T in Lemma 30, we have c T = 0, then Rs T = E F(xs T ) + c T xs T bxs 3 2 = E[F(bxs+1)]; meanwhile by the deﬁnition of xs 0, we have xs 0 = bxs. Thus we have Rs 0 = E F(xs 0) + c0 xs 0 bxs 3 2 = E[F(bxs)], which further implies

E[F(bxs)] E[F(bxs+1)] = Rs 0 Rs T

240C2 ML1/2 2 1E[µ(xs t)] 2δ . (52)

Finally, telescoping (52) from s = 1 to S, we obtain

s=1 E[F(bxs)] E[F(bxs+1)]

h 240C2 ML1/2 2 1E[µ(xs t)] 2δ i .

By the deﬁnition about choice of xout, we ﬁnish the proof.

Stochastic Variance-Reduced Cubic Regularization Methods

A.4. Proof of Corollary 15

Proof [Proof of Corollary 15] Under the parameter choice in Corollary 15, it holds that

E[µ(xout)] 240C2 ML1/2 2 F ST + 480C2 ML1/2 2 δ ϵ3/2/2 + ϵ3/2/2 = ϵ3/2. (53)

Thus, xout is an (ϵ, L2ϵ)-approximate local minimum. By the proof of Corollary 9, the total amount of SO calls is

Sn + (ST)(bg + bh) n + C1 F L1/2 2 n4/5ϵ 3/2 + C1 F L1/2 2 ϵ 3/2(5n4/5 + 1000n2/5 log d)

= e O n + F L2n4/5

where C1 = 480C2 M. For the CSO calls, Algorithm 1 needs to solve cubic subproblem at each single iteration. Thus, the total amount of CSO calls is

ST C1 F L1/2 2 ϵ 3/2 = O F L2 ϵ3/2

Appendix B. Proof of Technical Lemmas in Appendix A

In this section, we prove the technical lemmas used in Appendix A.

B.1. Proof of Lemma 24

The result of Lemma 24 is typical in the literature of cubic regularization (Nesterov and Polyak, 2006; Cartis et al., 2011a,b), but no exactly the same result has been shown in any formal way. Thus we present the proof here for self-containedness. Proof [Proof of Lemma 24] For simplicity, we let g = vs t, H = Us t, θ = Mt and hopt = hs t. Then we need to prove

g + Hhopt + θ

2 hopt 2hopt = 0, (54)

2 hopt 2I 0, (55)

g, hopt + 1

2 Hhopt, hopt + θ

6 hopt 3 2 θ

12 hopt 3 2. (56)

Let λ = θ hopt 2/2. Note that hopt = argmin m(h), then the necessary condition m(hopt) = 0 and 2m(hopt) 0 can be written as

m(hopt) = g + Hhopt + λhopt = 0, (57)

w 2m(hopt)w = w H + λI + λ hopt hopt 2

hopt hopt 2

w 0, w Rd. (58)

Zhou, Xu and Gu

Apparently, (57) directly implies (54). To prove (55), we adapt the proof of Lemma 5.1 in Agarwal et al. (2017). Note that if w, hopt = 0, then (58) directly implies (55). So we only need to focus on the case that w, hopt = 0. Since w, hopt = 0, there exists η = 0 such that hopt + ηw 2 = hopt 2. (In fact, we can ﬁnd η = 2 w, hopt / w 2 2 satisﬁes the requirement). Next we will take a close look at the diﬀerence m(hopt + ηw) m(hopt). On one hand, we have

m(hopt + ηw) m(hopt)

= g [(hopt + ηw) hopt] + (hopt + ηw) H(hopt + ηw)

2 h opt Hhopt

= [(hopt + ηw) hopt] (H + λI)hopt + (hopt + ηw) H(hopt + ηw)

2 h opt Hhopt

2 w 2 2 + [hopt (hopt + ηw)] Hhopt + (hopt + ηw) H(hopt + ηw)

2 h opt Hhopt

2 w 2 2 + h opt Hhopt

2 (hopt + ηw) Hhopt + (hopt + ηw) H(hopt + ηw)

2 w 2 2 + η2

2 w Hw = η2

2 w (H + λI)w,

where (59) holds due to (57) and (60) holds due to the deﬁnition of η. On the other hand, by the deﬁnition of hopt, m(hopt + ηw) m(hopt) 0. Thus, we have proved (55). Finally, we prove (56) by showing that

g, hopt + 1

2 Hhopt, hopt + θ

= g + Hhopt + θ

2 hopt 2hopt, hopt

2h opt(H + λI)hopt θ

12 hopt 3 2

2h opt(H + λI)hopt θ

12 hopt 3 2 (61)

12 hopt 3 2, (62)

where (61) holds due to (54) and (62) holds due to (55).

B.2. Proof of Lemma 25

In order to prove Lemma 25, we need the following useful lemma.

Lemma 31 Suppose a1, . . . , a N are i.i.d. and Eai = 0, then

2 1 N3/4 E ai 2 2 3/4.

Proof [Proof of Lemma 25] For simpliﬁcation, we use E to replace Evit. We have

E F(xs t) vs t 3/2 2

Stochastic Variance-Reduced Cubic Regularization Methods

X fit(xs t) fit(bxs) + gs 1

X 2fit(bxs) Hs (xs t bxs) F(xs t)

X fit(xs t) fit(bxs) 2fit(bxs)(xs t bxs) F(xs t)

+ F(bxs) + 2F(bxs)(xs t bxs

Now we set the parameters in Lemma 31 as N = bg and

ait = fit(xs t) fit(bxs) 2fit(bxs)(xs t bxs) F(xs+1 t ) + F(bxs) + 2F(bxs)(xs t bxs).

We can check that ait satisfy the assumption of Lemma 31. Thus, by Lemma 31, we have

E F(xs t) vs t 3/2 2 1

E fit(xs t) fit(bxs) 2fit(bxs)(xs t bxs)

F(xs t) + F(bxs) + 2F(bxs)(xs t bxs) 2 2

By Assumption 1 and Lemma 40, we have fit(xs t) fit(bxs) 2fit(bxs)(xs t bxs) F(xs t) + F(bxs) + 2F(bxs)(xs t bxs) 2 fit(xs t) fit(bxs) 2fit(bxs)(xs t bxs) 2 + F(xs t) F(bxs) 2F(bxs)(xs t bxs) 2

2 xs t bxs 2 2 + L2

2 xs t bxs 2 2

= L2 xs t bxs 2 2. (64)

Plugging (64) into (63) yields

E F(xs t) vs t 3/2 2 1

L2 2 xs t bxs 4 2 3/4 = L3/2 2 b3/4 g xs t bxs 3 2.

B.3. Proof of Lemma 26

In order to prove Lemma 26, we need the following supporting lemma.

Lemma 32 Suppose that q 2, p 2, and ﬁx r max{q, 2 log p}. Consider i.i.d. random self-adjoint matrices Y1, ..., YN with dimension p p, EYi = 0. It holds that

1/2 2 + 4er E max i Yi q 2 1/q.

Proof [Proof of Lemma 26] For simplicity, we use E to denote Ejt. We have

E 2F(xs t) Us t 3 2 = E 2F(xs t) 1

X 2fjt(xs t) 2fjt(bxs) + Hs

Zhou, Xu and Gu

X 2fjt(xs t) 2fjt(bxs) + Hs 2F(xs t)

We apply Lemma 32 with parameters

q = 3, p = d, r = 2 log p, Yjt = 2fjt(xs t) 2fjt(bxs) + Hs 2F(xs t), N = bh.

It can be easily checked that these parameters satisfy the assumption of Lemma 32. Meanwhile, by Assumption 1, we have the following upper bound for Yjt: Yjt 2 = 2fjt(xs t) 2fjt(bxs) + Hs 2F(xs t) 2 2fjt(xs t) 2fjt(bxs) 2 + Hs 2F(xs t) 2 L2 xs t bxs 2 + L2 xs t bxs 2 = 2L2 xs t bxs 2. (66)

By Lemma 32, we have

1/3 2 er X EY2 jt 1/2 2 + 4er E max i Yi 3 2 1/3. (67)

The ﬁrst term in RHS of (67) can be bounded as

2 er X EY2 jt 1/2 2 = 2 er X EY2 jt 1/2

Ner EY2 jt 1/2

Ner E Y2 jt 2

Ner E Yjt 2 2 1/2

Ner xs t bxs 2, (68)

where the ﬁrst inequality holds due to Jensen s inequality, the third equality holds because Y2 jt 2 = Yjt 2 2 and the last inequality holds due to (66). The second term in RHS of (67) can be bounded as

4er E max i Yi 3 2 1/3 4er[(2L2 xs t bxs 2)3]1/3 = 8L2er xs t bxs 2. (69)

Submitting (68), (69) into (67), we have

h E X Yjt 3

Ner xs t bxs 2 + 8L2er xs t bxs 2,

which immediately implies

3 xs t bxs 3 2. (70)

Stochastic Variance-Reduced Cubic Regularization Methods

Submitting (70) into (65) with Yjt = 2fjt(xs t) 2fjt(bxs)+Hs 2F(xs t), r = 2 log d, N = bh, we have

E 2F(xs t) Us t 3 2 64L3 2

bh + 4e log d

3 xs t bxs 3 2

3/2 xs t bxs 3 2,

where the last inequality holds due to bh 400 log d.

B.4. Proof of Lemma 27

Proof we have

F(xs t) vs t, h F(xs t) vs t 2 h 2

M1/3 F(xs t) vs t 2

M1/3 F(xs t) vs t 2

27 h 3 2 + 2 F(xs t) vs t 3/2 2 M1/2 ,

where the second inequality holds due to Young s inequality. Meanwhile, we have 2F(xs t) Hs h, h 2F(xs t) Hs 2 h 2 2

M2/3 2F(xs t) Hs 2

M2/3 2F(xs t) Hs 2

27 h 3 2 + 27

M2 2F(xs t) Us t 3 2,

where the second inequality holds due to Young s inequality.

B.5. Proof of Lemma 28

In order to prove Lemma 28, we need the following two useful lemmas.

Lemma 33 Under Assumption 1, if M 2L2, then we have F(xs t + h) 2 M h 2 2 + F(xs t) vs t 2 + 1

2F(xs t) Us t 2 2 + ms t(h) 2.

Lemma 34 Under Assumption 1, if M 2L2, then we have

λmin 2F(xs t + h) M h 2 + 2F(xs t) Us t 2 + M h 2 hs t 2 .

Zhou, Xu and Gu

Proof [Proof of Lemma 28] By the deﬁnition of µ, we can bound F(xs t + h) 3/2 2 and 0 L 3/2 2 λmin 2F(xs t + h) 3 separately. To bound F(xs t + h) 3/2 2 , applying Lemma 33 we have F(xs t + h) 3/2 2

h M h 2 2 + F(xs t) vs t 2 + 1

2F(xs t) Us t 2 2 + ms t(h) 2

2 h M3/2 h 3 2 + F(xs t) vs t 3/2 2 + M 3/2 2F(xs t) Us t 3 2 + ms t(h) 3/2 2

where the second inequality holds due to the following basic inequality (a + b + c + d)3/2 2(a3/2 + b3/2 + c3/2 + d3/2). To bound λmin 2F(xs t + h) , applying Lemma 26, we have

L 3/2 2 λmin 2F(xs t + h) 3

= C3/2 M M 3/2 λmin 2F(xs t + h) 3

C3/2 M M 3/2h M h 2 + 2F(xs t) Us t 2 + M h 2 hs t 2 i3

9C3/2 M h M3/2 h 3 2 + M 3/2 2F(xs t) Us t 3 2 + M3/2 h 2 hs t 2 3i ,

where the second inequality is due to (a + b + c)3 9(a3 + b3 + c3). Since 9C3/2 M > 2, we have

µ(xs t + h) = max n F(xs t + h) 3/2 2 , L 3/2 2 λmin 2F(xs t + h) 3o

9C3/2 M h M3/2 h 3 2 + F(xs t) vs t 3/2 2 + M 3/2 2F(xs t) Us t 3 2

+ ms t(h) 3/2 2 + M3/2 h 2 hs t 2 3i ,

which completes the proof.

B.6. Proof of Lemma 29

Proof For any xs t, h, bxs and a constant C > 0, we have

xs t + h bxs 3 2

h 2 + xs t bxs 2 3

= h 3 2 + 3 h 2 2 xs t bxs 2 + 3 h 2 xs t bxs 2 2 + xs t bxs 3 2

= h 3 2 + 3 C1/3 h 2 2 xs t bxs 2

C1/3 + 3 C2/3 h 2 xs t bxs 2 2 C2/3 + xs t bxs 3 2

h 3 2 + 3 2

3 C1/3 h 2 2 3/2 + 1

3 C2/3 h 2 3 + 2

xs t bxs 2 2 C2/3

3/2 + xs t bxs 3 2

Stochastic Variance-Reduced Cubic Regularization Methods

= h 3 2 + 2C1/2 h 3 2 + 1

C xs t bxs 3 2

+ C2 h 3 2 + 2

C xs t bxs 3 2

+ xs t bxs 3 2

2C2 h 3 2 + 1 + 3

xs t bxs 3, (71)

where the second inequality holds due to Young s inequality, the last inequality holds because 1 + 2

C C2 when C 3/2.

B.7. Proof of Lemma 30

Proof By induction, we have for any 0 t T,

ct = M (1 + 3/T)T t 1

Then for any 0 t T,

2ct T 2 M 2(1 + 3/T)T

1500 M 2 27

Appendix C. Proof of Auxiliary Lemmas

In this section, we prove auxiliary lemmas used in Appendix B.

C.1. Proof of Lemma 31

Proof We have

2 = E PN i=1 ai 3/2 2 N3/2 (E PN i=1 ai 2 2)3/4

N3/2 = (PN i=1 E ai 2 2)3/4

E ai 2 2 3/4

The ﬁrst inequality holds due to Lemma 41 with s = 3/2, t = 2. The second equality holds due to Eai = 0 and that ai are independently identically distributed.

C.2. Proof of Lemma 32

Proof This proof is mainly adapted from Chen et al. (2012); Tropp (2016). First, Let {Yi : i = 1, . . . , N} be an independent copy of the sequence {Yi : i = 1, . . . , N}. We denote EY to be the expectation over the independent copy Y . Then EY Yi = 0 and

i=1 EY (Yi Yi )

i=1 (Yi Yi )

i=1 (Yi Yi )

Zhou, Xu and Gu

The ﬁrst equality holds due to EY Yi = 0, the ﬁrst inequality holds because q 2 is a convex function, and the second equality holds because we combine the iterated expectation into a single expectation. Note that Yi Yi has the same distribution as Yi Yi, thus the independent sequence {ξi(Yi Yi ) : 1 i n} has the same distribution as {Yi Yi : 1 i N}, where ξi are independent Rademacher random variables, also independent with Yi, Yi . Therefore,

i=1 (Yi Yi )

i=1 ξi(Yi Yi )

Furthermore, we have

i=1 ξi(Yi Yi )

The ﬁrst inequality holds due to A B q 2 ( A 2 + B 2)q 2q 1( A q 2 + B q 2), where we let A = PN i=1 ξi Yi, B = PN i=1 ξi Yi ; the equality holds due to the identical distribution of {ξYi} and {ξYi }. Submitting (73), (74) into (72) yields

Taking q-th root for both sides, we have

Next, we have the inequality chain:

q/r 1/q , (77)

where the ﬁrst inequality holds due to 2 Sr, the second inequality holds due to Lyapunov s inequality (Lemma 41), where we set s = q, t = r. Since q < r, then the second inequality holds. Note we have

q/r 1/q 2 r E

Stochastic Variance-Reduced Cubic Regularization Methods

where the ﬁrst inequality holds due to Proposition 42; the second inequality holds because A Sr p1/r A 2, where we set A = (PN i=1 Y2 i 1/2 and p is the dimension of A ; the third inequality holds because p1/r p1/(2 log p) = e. Finally, we use Proposition 43 to bound (78). Since Y2 i are positive-semideﬁnite and independent random matrices, we can set Wi in Proposition 43 as Wi = Y2 i . Meanwhile, q/2 1, so we have E

2 + 2 er E max i Yi q 2 1/q 2 ,

which immediately implies E

2 + 2 er E max i Yi q 2 1/q. (79)

Submitting (77), (78),(79) into (76), we have the proof completed.

C.3. Proof of Lemma 33

Proof We have

F(xs t + h) 2 = F(xs t + h) F(xs t) 2F(xs t)h + vs t + Us th

+ F(xs t) vs t + 2F(xs t) Us t h 2

F(xs t + h) F(xs t) 2F(xs t)h 2 + vs t + Us th + M

+ F(xs t) vs t 2 + 2F(xs t) Us t h 2 + M

2 h 2 2, (80)

where the inequality holds due to triangle inequality. In the following, we are going to bound the right-hand side of (80). For the ﬁrst term in the right-hand side of (80), it can be bounded as F(xs t + h) F(xs t) 2F(xs t)h 2 L2

where the ﬁrst inequality holds due to Assumption 1 and the second inequality holds due to 2L2 M. For the second term in the the right hand side of (80), it equals to vs t + Us th + M

2 h 2h 2 = ms t(h) 2.

Zhou, Xu and Gu

And the ﬁnal term can be bounded as

2F(xs t) Us t h 2 2F(xs t) Us t 2 h 2 1

2F(xs t) Us t 2 2 + M

where the last inequality is due to Young s inequality. Putting all these bounds together and submit them into (80), we have

F(xs t + h) 2 M h 2 2 + F(xs t) vs t 2 + 1

2F(xs t) Us t 2 2 + ms t(h) 2.

C.4. Proof of Lemma 34

Proof We have

2F(xs t + h) 2F(xs t) L2 h 2I

Us t 2F(xs t) Us t 2I L2 h 2I

2 hs t 2I 2F(xs t) Us t 2I L2 h 2I,

where the ﬁrst inequality holds because 2F is L2-Hessian Lipschitz, the last inequality holds due to (29) in Lemma 24. Thus we have

λmin 2F(xs t + h) M

2 hs t 2 + 2F(xs t) Us t 2 + L2 h 2

2 ( hs t 2 h 2) + 2F(xs t) Us t 2 + (L2 + M/2) h 2

M h 2 + 2F(xs t) Us t 2 + M hs t 2 h 2 ,

where the last inequality holds because L2 M/2.

Appendix D. Proof of Main Theoretical Results for Lite-SVRC

In this section, we provide the proofs of theoretical results of Lite-SVRC (Algorithm 2).

D.1. Proof of Theorem 18

For the simpliﬁcation of notation, we deﬁne ev, e U as follows

ev = F(xs t) evs t, e U = 2F(xs t) Us t, (81)

where vs 0 = gs, Us 0 = Hs. Before we state the proof, we present some technical lemmas that are useful in our analysis. Firstly, we give a sharp bound of µ(xs t+1). A very crucial observation is that we can bound the norm of gradient F(xs t+1) 2 and the smallest eigenvalue of Hessian λmin( 2F(xs t+1)) with hs t 2, ev 2 and e U 2 deﬁned in (81).

Stochastic Variance-Reduced Cubic Regularization Methods

Lemma 35 Under the same assumption as in Theorem 18, let hs t, xs t+1, Ms,t be variables deﬁned by Algorithm 2. Then we have

µ(xs t+1) 9C3/2 M M3/2 s,t hs t 3 2 + ev 3/2 2 + M 3/2 s,t e U 3 2 , (82)

where CM is the parameter in Ms,t = CML2.

Lemma 35 suggests that to bound our target Eµ(xs t+1), we only need to focus on E hs t 3 2,

E ev 3/2 2 and E e U 3 2. The next lemma upper bound F(xs t) F(xs t+1) with ev, e U and hs t, which can be straightforwardly derived from the Hessian Lipschitz condition.

Lemma 36 Under the same assumption as in Theorem 18, let hs t, xs t, xs t+1, Ms,t be variables deﬁned by Algorithm 2. Then we have the following result:

F(xs t+1) F(xs t) Ms,t/12 hs t 3 2 + C1 ev 3/2 2 /M 1/2 s,t + e U 3 2/M 2 s,t , (83)

where C1 = 200.

We can also bound xs t+1 bxs 3 2 with xs t bxs 3 2 as follows.

Lemma 37 Under the same assumption as in Theorem 18, let hs t, xs t, xs t+1, Ms,t be variables deﬁned by Algorithm 2. Then we have the following result:

xs t+1 bxs 3 2 (1 + 3/n1/3) xs t bxs 3 2 + 2n2/3 hs t 3 2. (84)

Finally, we bound the variance of evs t and Us t as follows.

Lemma 38 Under the same assumption as in Theorem 18, let xs t, evs t and bxs be the iterates deﬁned in Algorithm 2. Then we have

Eevs t ev 3/2 2 3L3/2 1 D3/4 g xs t bxs 3 2,

where Dg is the batch size parameter deﬁned in (21) and Eevs t takes expectation over evs t.

Lemma 39 Under the same assumption as in Theorem 18, let xs t, Us t and bxs be iterates deﬁned in Algorithm 2. If the batch size of Hessian satisﬁes Bh > 400 log d, we have

EUs t e U 3/2 2 Ch L3 2 B3/2 h xs t bxs 3 2,

where Ch = 1200(log d)3/2 and EUs t only takes expectation over Us t.

Lemmas 38 and 39 suggest that with carefully selection of batch size, both E ev 3/2 2 and E e U 3 2 can be bounded by E xs t bxs 3 2, which play similar roles to Lemmas 25 and 26 in the analysis of SVRC. Proof [Proof of Theorem 18] We ﬁrst deﬁne Rs t = E[F(xs t) + cs,t xs t bxs 3 2], where cs,t is a number series deﬁned as follows: cs,T = 0 and for s = 1, . . . , S, t = 0, . . . , T 1,

cs,t = cs,t+1 1 + 3/n1/3 + Ms,t/(500n). (85)

Zhou, Xu and Gu

Combining Lemmas 36 and 37, we can upper bound F(xs t+1) + cs,t+1 xs t+1 bxs 3 2. Speciﬁcally, (83) + cs,t+1 (84) yields

F(xs t+1) + cs,t+1 xs t+1 bxs 3 2

F(xs t) hs t 3 2 Ms,t/12 2n2/3cs,t+1 + C1 ev 3/2 2 /M 1/2 s,t + e U 3 2/M 2 s,t

+ cs,t+1(1 + 3/n1/3) xs t bxs 3 2. (86)

By Lemma 35, multiplying (82) with 24 9C3/2 M M1/2 s,t 1 and adding it into (86) yields

F(xs t+1) + cs,t+1 xs t+1 bxs 3 2 + 24 9C3/2 M M1/2 s,t 1 µ(xs t+1)

F(xs t) + cs,t+1 1 + 3/n1/3 xs t bxs 3 2

+ (24C1 + 1)

24M1/2 s,t ev 3/2 2 + (24C1 + 1)

24M2 s,t e U 3 2, (87)

where we use the fact that Ms,t/24 2n2/3cs,t+1 > 0 which can be easily veriﬁed by the deﬁnition in (85) and a similar argument in Appendix B.7. By Lemmas 38 and 39 we have

E ev 3/2 2 3L3/2 1 D3/4 g E xs t bxs 3 2, E e U 3 2 Ch L3 2 B3/2 h E xs t bxs 3 2,

where Dg, Bh are batch size parameters and Ch = 1200(log d)3/2. Taking expectation on (87), we obtain the following result

E F(xs t+1) + cs,t+1 xs t+1 bxs 3 2 + 24 9C3/2 M M1/2 s,t 1 µ(xs t+1)

E F(xs t) + xs t bxs 3 2 cs,t+1 1 + 3/n1/3 + 6L3/2 1 C1 D3/4 g M1/2 s,t + 2L3 2Ch C1 B3/2 h M2 s,t

E F(xs t) + xs t bxs 3 2 cs,t+1 1 + 3/n1/3 + Ms,t/(500n) , (88)

where in the last inequality we use the fact that Ms,t = CML2, Dg C1n4/3L2 1/(L2 2C4/3 M ) and Bh 144(C1Ch)2/3n2/3/C2 M. By the deﬁnition of cs,t in (85), (88) is equivalent to

24 9C3/2 M M1/2 s,t 1 E(µ(xs t+1)) Rs t Rs t+1. (89)

Then we sum up (89) from t = 0 to T 1, while yields

24 9C3/2 M M1/2 s,t 1 E[µ(xs t+1)] Rs 0 Rs T .

Substituting xs 0 = bxs, xs T = bxs+1 and cs,T = 0 into the inequality above, we get

24 9C3/2 M M1/2 s,t 1 E[µ(xs t+1)] EF(bxs) EF(bxs+1).

Stochastic Variance-Reduced Cubic Regularization Methods

Then we take summation from s = 1 to S, we have

24 9C3/2 M M1/2 s,t 1 E[µ(xs t+1)] EF(bx0) EF(bx S+1)

F(bx0) inf x Rd F(x)

Plugging Ms,t = CML2 into the above inequality, we have

t=0 E[µ(xs t+1)] 216C2 ML1/2 2 F . (90)

In Algorithm 2, we choose xout randomly over s and t, thus we have our result from (90):

E[µ(xout)] 216C2 ML1/2 2 F ST .

This competes the proof.

D.2. Proof of Corollary 20

Now we provide the proof of our corollary for the sample complexity of Lite-SVRC. Proof [Proof of Corollary 20] By the deﬁnition in (8) and the result in Theorem 18, to ﬁnd an (ϵ, L2ϵ)-approximate local minimum, we only need to make sure 216C2 ML1/2 2 F /(ST) ϵ3/2. Setting T = n1/3, it suﬃces to let S = O(max{L1/2 2 F /(ϵ3/2n1/3), 1}). We need to sample n Hessian at the beginning of each inner loop, and in each inner loop, we need to sample Bh = e O(n2/3) Hessian matrices. Therefore, the total sample complexity of Hessian for Algorithm 2 is S n + S T Bh = e O(n + n2/3 ( F L2)/ϵ3/2). The total amount of CSO calls is O(ST) = O( F L2)/ϵ3/2).

D.3. Proofs of Theorem 22 and Corollary 23

The proofs of the convergence of Lite-SVRC with an inexact cubic subproblem solver deﬁned in Section 5 are almost the same as that of the convergence of SVRC. More speciﬁcally, the proof of Theorem 22 is the same as that of Theorem 13, and the proof of Corollary 23 is the same as that of Corollary 15. Therefore, we omit the proofs here for simplicity.

Appendix E. Proof of Technical Lemmas

In this section, we prove the technical lemmas used in the proof of Theorem 18.

Zhou, Xu and Gu

E.1. Proof of Lemma 35

Proof [Proof of Lemma 35] Recall the deﬁnition of µ( ) in (8). We need to upper bound F(xs t+1) 3/2 2 and L3/2 2 λ3 min( 2F(xs t+1)), which can be achieved by applying Lemmas 33

and 34 since we have xs t+1 = xs t + hs t. To bound F(xs t+1) 3/2 2 , we apply Lemma 33:

F(xs t+1) 3/2 2

Ms,t hs t 2 2 + ms t(hs t) 2 + F(xs t) evs t 2 + 1 Ms,t 2F(xs t) Us t 2 2

2 h M3/2 s,t hs t 3 2 + F(xs t) evs t 3/2 2 + ms t(hs t) 3/2 2 + M 3/2 s,t 2F(xs t) Us t 3 2 i

2 h M3/2 s,t hs t 3 2 + F(xs t) evs t 3/2 2 + M 3/2 s,t 2F(xs t) Us t 3 2 i ,

where the second inequality holds due to the basic inequality (a + b + c + d)3/2 2(a3/2 + b3/2 + c3/2 + d3/2) and in the last inequality we use the fact that ms t(hs t) = 0. Next we bound M 3/2 s,t λmin 2F(xs t+1) 3. Applying Lemma 34 with h = hs t, we have

L 3/2 2 λmin 2F(xs t+1) 3 C3/2 M M 3/2 s,t Ms,t hs t 2 + 2F(xs t) Us t 2 3

9C3/2 M M 3/2 s,t M3 s,t 4 hs t 3 2 + 2F(xs t) Us t 3 2

9C3/2 M h M3/2 s,t hs t 3 2 + M 3/2 s,t 2F(xs t) Us t 3 2 i ,

where the second inequality holds due to (a + b + c)3 9(a3 + b3 + c3). Thus, we have

µ(xs t+1) = max n F(xs t+1) 3/2 2 , L 3/2 2 λmin 2F(xs t+1) 3o

9C3/2 M h M3/2 s,t hs t 3 2 + F(xs t) evs t 3/2 2 + M 3/2 s,t 2F(xs t) Us t 3 2 i ,

which completes the proof.

E.2. Proof of Lemma 36

Proof [Proof of Lemma 36] Note that xs t+1 = xs t + hs t. Apply Lemma 40 and we have

F(xs t+1) = F(xs t + hs t) F(xs t) + F(xs t), hs t + 1

2 2F(xs t)hs t, hs t + L2

= F(xs t) + evs t, hs t + 1

2 Us ths t, hs t + Ms,t

6 hs t 3 2 + ev, hs t

2 e Uhs t, hs t + L2 Ms,t

6 hs t 3 2. (91)

Based on Lemma 24 for the sub-problem in cubic regularization, we have

evs t, hs t + 1

2 Us ths t, hs t + Ms,t

6 hs t 3 2 Ms,t

12 hs t 3 2. (92)

Stochastic Variance-Reduced Cubic Regularization Methods

Meanwhile, we have following two bounds on ev, hs t and e Uhs t, hs t by Young s inequality:

ev, hs t C4 ev 2 1

C4 hs t 2 C3/2 4 ev 3/2 2 + 1

C3 4 hs t 3 2, (93)

e Uhs t, hs t C2 5 e U 2 hs t 2 C5

2 C6 5 e U 3 2 + hs t 3 2 C3 5 . (94)

We set C4 = C5 = (18/Ms,t)1/3. Finally, because L2 Ms,t/2, we have

6 hs t 3 2 Ms,t

12 hs t 3 2. (95)

Substituting (92), (93), (94) and (95) into (91), we have the ﬁnal result:

F(xs t+1) F(xs t) Ms,t

12 hs t 3 2 + C1

ev 3/2 2 M1/2 s,t + e U 3 2 M2 s,t

where C1 = 200.

E.3. Proof of Lemma 37

Proof [Proof of Lemma 37] Note that xs t+1 = xs t + hs t, then we have

xs t+1 bxs 3 2 xs t bxs 2 + hs t 2 3

= xs t bxs 3 2 + hs t 3 2 + 3 xs t bxs 2 2 hs t 2 + 3 xs t bxs 2 hs t 2 2. (97)

The inequality holds due to triangle inequality. Next, we bound xs t bxs 2 2 hs t 2 and xs t bxs 2 hs t 2 2 by Young s inequality:

xs t bxs 2 2 hs t 2 = xs t bxs 2 2 n2/9 n2/9 hs t 2 2 xs t bxs 3 2 3n1/3 + n2/3 hs t 3 2 3 , (98)

xs t bxs 2 hs t 2 2 = xs t bxs 2

n1/9 n1/9 hs t 2 2 xs t bxs 3 2 3n1/3 + 2n1/6 hs t 3 2 3 . (99)

Substituting (98), (99) into (97), we have the result:

xs t+1 bxs 3 2 (1 + 1/n1/3 + 2/n1/3) xs t bxs 3 2 + (1 + 2n1/6 + n2/3) hs t 3 2 (1 + 3/n1/3) xs t bxs 3 2 + 2n2/3 hs t 3 2,

which completes the proof.

Zhou, Xu and Gu

E.4. Proof of Lemma 38

Proof [Proof of Lemma 38] This proof is essentially the same as that of Lemma 25 in Section B.2. However, we replace the semi-stochastic gradient evs t deﬁned in (19) with vs t used in Lemma 25, which leads to the following inequality that is similar to (63):

Eevs t F(xs t) evs t 3/2 2 1/B3/4 g;s,t E fit(xs t) fit(bxs) + gs F(xs t) 2 2 3/4. (100)

Next we bound fit(xs t) fit(bxs) + gs F(xs t) 2. By Assumption 17, we have

fit(xs t) fit(bxs) + gs F(xs t) 2 fit(xs t) fit(bxs) 2 + gs F(xs t) 2 = fit(xs t) fit(bxs) 2 + F(bxs) F(xs t) 2 2L1 xs t bxs 2. (101)

Finally, substituting (101) and Bg;s,t = Dg/ xs t bxs 2 2 into (100), we have

Eevs t F(xs t) evs t 3/2 2 xs t bxs 3/2 2 D3/4 g 4L2 1 xs t bxs 2 2 3/4 23/2L3/2 1 D3/4 g xs+1 t bxs 3 2.

This completes the proof.

E.5. Proof of Lemma 39

Proof [Proof of Lemma 39] The proof of Lemma 39 is the same as that of Lemma 26 in Section B.3. Thus we omit it for simplicity.

Appendix F. Additional Lemmas and Propositions

It is obvious that Assumption 1 implies the Hessian Lipschitz assumption of F, which is also equivalent to the following lemma:

Lemma 40 (Nesterov and Polyak, 2006) Suppose F is L2-Hessian Lipschitz for some constant L2 > 0, then we have

2F(x) 2F(y) L2 x y 2,

F(x + h) F(x) + F(x), h + 1

2 2F(x)h, h + L2

F(x + h) F(x) 2F(x)h 2 L2

Lemma 41 (Lyapunov s Inequality) (Durrett, 2010) For a random variable X, when 0 < s < t, it holds that (E|X|s)1/s (E|X|t)1/t.

The following two lemmas are matrix concentration inequalities.

Stochastic Variance-Reduced Cubic Regularization Methods

Proposition 42 (Matrix Khintchine Inequality) (Mackey et al., 2014) Suppose r > 2. Consider a ﬁnite sequence {Ai, 1 i N} of deterministic, self-adjoint matrices. Then

where sequence ξi consists of independent Rademacher random variables.

Proposition 43 (Chen et al., 2012) Let q 1, and ﬁx r max{q, 2 log p}. Consider W1, ..., WN of independent, random, positive-deﬁnite matrices with dimension p p. Then

2 + 2 er E max i Wi q 2 1/(2q) 2 .

Naman Agarwal, Zeyuan Allen-Zhu, Brian Bullins, Elad Hazan, and Tengyu Ma. Finding approximate local minima faster than gradient descent. In Proceedings of the 49th Annual ACM SIGACT Symposium on Theory of Computing, pages 1195 1199. ACM, 2017.

Zeyuan Allen-Zhu. Natasha 2: Faster non-convex optimization than SGD. In Advances in Neural Information Processing Systems, pages 2676 2687, 2018.

Zeyuan Allen-Zhu and Elad Hazan. Variance reduction for faster non-convex optimization. In International Conference on Machine Learning, pages 699 707, 2016.

Zeyuan Allen-Zhu and Yuanzhi Li. Neon2: Finding local minima via ﬁrst-order oracles. In Advances in Neural Information Processing Systems, pages 3720 3730, 2018.

Albert A Bennett. Newton s method in general analysis. Proceedings of the National Academy of Sciences, 2(10):592 598, 1916.

Dimitri P Bertsekas. Nonlinear programming. Athena scientiﬁc Belmont, 1999. ISBN 9781886529144.

Jose Blanchet, Coralia Cartis, Matt Menickelly, and Katya Scheinberg. Convergence rate analysis of a stochastic trust region method for nonconvex optimization. ar Xiv preprint ar Xiv:1609.07428, 2016.

Yair Carmon and John C Duchi. Gradient descent eﬃciently ﬁnds the cubic-regularized non-convex Newton step. ar Xiv preprint ar Xiv:1612.00547, 2016.

Yair Carmon and John C Duchi. Analysis of Krylov subspace solutions of regularized nonconvex quadratic problems. In Advances in Neural Information Processing Systems, pages 10705 10715, 2018.

Yair Carmon, John C Duchi, Oliver Hinder, and Aaron Sidford. Accelerated methods for nonconvex optimization. SIAM Journal on Optimization, 28(2):1751 1772, 2018.

Zhou, Xu and Gu

Coralia Cartis, Nicholas IM Gould, and Philippe L Toint. Trust-region and other regularisations of linear least-squares problems. BIT Numerical Mathematics, 49(1):21 53, 2009.

Coralia Cartis, Nicholas I Gould, and Philippe L Toint. Adaptive cubic regularisation methods for unconstrained optimization. part i: motivation, convergence and numerical results. Mathematical Programming, 127(2):245 295, 2011a.

Coralia Cartis, Nicholas I. M Gould, and Philippe L Toint. Adaptive cubic regularisation methods for unconstrained optimization. Part II: worst-case functionand derivativeevaluation complexity. Springer-Verlag New York, Inc., 2011b.

Coralia Cartis, Nicholas IM Gould, and Ph L Toint. Complexity bounds for second-order optimality in unconstrained optimization. Journal of Complexity, 28(1):93 108, 2012.

Coralia Cartis, Nicholas IM Gould, and Philippe L Toint. On the evaluation complexity of cubic regularization methods for potentially rank-deﬁcient nonlinear least-squares problems and its relevance to constrained nonlinear optimization. SIAM Journal on Optimization, 23(3):1553 1574, 2013.

Richard Y Chen, Alex Gittens, and Joel A Tropp. The masked sample covariance estimator: an analysis using matrix concentration inequalities. Information and Inference: A Journal of the IMA, 1(1):2 20, 2012.

Andrew R. Conn, Nicholas I. M. Gould, and Philippe L. Toint. Trust-region Methods. Society for Industrial and Applied Mathematics, 2000. ISBN 0-89871-460-5.

Frank E Curtis, Daniel P Robinson, and Mohammadreza Samadi. A trust region algorithm with a worst-case iteration complexity of O(ϵ 3/2) for nonconvex optimization. Mathematical Programming, 162(1-2):1 32, 2017.

Aaron Defazio, Francis Bach, and Simon Lacoste-Julien. Saga: A fast incremental gradient method with support for non-strongly convex composite objectives. In Advances in Neural Information Processing Systems, pages 1646 1654, 2014.

Rick Durrett. Probability: theory and examples. Cambridge university press, 2010.

Cong Fang, Chris Junchi Li, Zhouchen Lin, and Tong Zhang. Spider: Near-optimal nonconvex optimization via stochastic path-integrated diﬀerential estimator. In Advances in Neural Information Processing Systems, pages 686 696, 2018.

Dan Garber and Elad Hazan. Fast and simple PCA via convex optimization. ar Xiv preprint ar Xiv:1509.05647, 2015.

Rong Ge, Furong Huang, Chi Jin, and Yang Yuan. Escaping from saddle pointsonline stochastic gradient for tensor decomposition. In Conference on Learning Theory, pages 797 842, 2015.

Rong Ge, Jason D Lee, and Tengyu Ma. Matrix completion has no spurious local minimum. In Advances in Neural Information Processing Systems, pages 2973 2981, 2016.

Stochastic Variance-Reduced Cubic Regularization Methods

Robert Gower, Nicolas Le Roux, and Francis Bach. Tracking the gradients using the Hessian: A new look at variance reducing stochastic methods. In International Conference on Artiﬁcial Intelligence and Statistics, pages 707 715, 2018.

Christopher J Hillar and Lek-Heng Lim. Most tensor problems are NP-hard. Journal of the ACM (JACM), 60(6):45, 2013.

Chi Jin, Rong Ge, Praneeth Netrapalli, Sham M Kakade, and Michael I Jordan. How to escape saddle points eﬃciently. In International Conference on Machine Learning, pages 1724 1732, 2017.

Chi Jin, Praneeth Netrapalli, and Michael I. Jordan. Accelerated gradient descent escapes saddle points faster than gradient descent. In Proceedings of the 31st Conference On Learning Theory, volume 75, pages 1042 1085. PMLR, 2018.

Rie Johnson and Tong Zhang. Accelerating stochastic gradient descent using predictive variance reduction. In Advances in Neural Information Processing Systems, pages 315 323, 2013.

Jonas Moritz Kohler and Aurelien Lucchi. Sub-sampled cubic regularization for non-convex optimization. In Proceedings of the 34th International Conference on Machine Learning, volume 70, pages 1895 1904. PMLR, 2017.

Yann Le Cun, Yoshua Bengio, and Geoﬀrey Hinton. Deep learning. Nature, 521(7553): 436 444, 2015.

Aurelien Lucchi, Brian Mc Williams, and Thomas Hofmann. A variance reduced stochastic Newton method. ar Xiv preprint ar Xiv:1503.08316, 2015.

Lester Mackey, Michael I Jordan, Richard Y Chen, Brendan Farrell, Joel A Tropp, et al. Matrix concentration inequalities via the method of exchangeable pairs. The Annals of Probability, 42(3):906 945, 2014.

Jos e Mario Mart ınez and Marcos Raydan. Cubic-regularization counterpart of a variablenorm trust-region method for unconstrained minimization. Journal of Global Optimization, 68(2):367 385, 2017.

Philipp Moritz, Robert Nishihara, and Michael Jordan. A linearly-convergent stochastic L-BFGS algorithm. In Artiﬁcial Intelligence and Statistics, pages 249 258, 2016.

Arkadii Semenovich Nemirovsky and David Borisovich Yudin. Problem complexity and method eﬃciency in optimization. 1983.

Yurii Nesterov and B. T. Polyak. Cubic regularization of Newton method and its global performance. Mathematical Programming, 108(1):177 205, 2006.

Maxim Raginsky, Alexander Rakhlin, and Matus Telgarsky. Non-convex learning via stochastic gradient Langevin dynamics: a nonasymptotic analysis. In Conference on Learning Theory, pages 1674 1703, 2017.

Zhou, Xu and Gu

Sashank J Reddi, Ahmed Hefny, Suvrit Sra, Barnabas Poczos, and Alex Smola. Stochastic variance reduction for nonconvex optimization. In International Conference on Machine Learning, pages 314 323, 2016a.

Sashank J Reddi, Suvrit Sra, Barnab as P oczos, and Alex Smola. Fast incremental method for smooth nonconvex optimization. In Decision and Control (CDC), 2016 IEEE 55th Conference on, pages 1971 1977. IEEE, 2016b.

Anton Rodomanov and Dmitry Kropotov. A superlinearly-convergent proximal Newtontype method for the optimization of ﬁnite sums. In International Conference on Machine Learning, pages 2597 2605, 2016.

Nicolas L Roux, Mark Schmidt, and Francis R Bach. A stochastic gradient method with an exponential convergence rate for ﬁnite training sets. In Advances in Neural Information Processing Systems, pages 2663 2671, 2012.

Cl ement W Royer and Stephen J Wright. Complexity analysis of second-order line-search algorithms for smooth nonconvex optimization. SIAM Journal on Optimization, 28(2): 1448 1477, 2018. doi: 10.1137/17M1134329.

Shai Shalev-Shwartz. SDCA without duality, regularization, and individual convexity. In International Conference on Machine Learning, pages 747 754, 2016.

Nilesh Tripuraneni, Mitchell Stern, Chi Jin, Jeﬀrey Regier, and Michael I Jordan. Stochastic cubic regularization for fast nonconvex optimization. In Advances in Neural Information Processing Systems, pages 2904 2913, 2018.

Joel A Tropp. The expected norm of a sum of independent random matrices: An elementary approach. In High Dimensional Probability VII, pages 173 202. Springer, 2016.

Joel A Tropp et al. An introduction to matrix concentration inequalities. Foundations and Trends R in Machine Learning, 8(1-2):1 230, 2015.

H. Wai, W. Shi, A. Nedic, and A. Scaglione. Curvature-aided incremental aggregated gradient method. In 55th Annual Allerton Conference on Communication, Control, and Computing (Allerton), pages 526 532, 2017. doi: 10.1109/ALLERTON.2017.8262782.

Zhe Wang, Yi Zhou, Yingbin Liang, and Guanghui Lan. Sample complexity of stochastic variance-reduced cubic regularization for nonconvex optimization. ar Xiv preprint ar Xiv:1802.07372v1, 2018a.

Zhe Wang, Yi Zhou, Yingbin Liang, and Guanghui Lan. Stochastic variance-reduced cubic regularization for nonconvex optimization. ar Xiv preprint ar Xiv:1802.07372v2, 2018b.

Lin Xiao and Tong Zhang. A proximal stochastic gradient method with progressive variance reduction. SIAM Journal on Optimization, 24(4):2057 2075, 2014.

Pan Xu, Jinghui Chen, Difan Zou, and Quanquan Gu. Global convergence of Langevin dynamics based algorithms for nonconvex optimization. In Advances in Neural Information Processing Systems, pages 3125 3136, 2018a.

Stochastic Variance-Reduced Cubic Regularization Methods

Peng Xu, Jiyan Yang, Farbod Roosta-Khorasani, Christopher R, and Michael W Mahoney. Sub-sampled Newton methods with non-uniform sampling. 2016.

Peng Xu, Farbod Roosta-Khorasan, and Michael W Mahoney. Second-order optimization for non-convex machine learning: An empirical study. ar Xiv preprint ar Xiv:1708.07827, 2017a.

Peng Xu, Farbod Roosta-Khorasani, and Michael W Mahoney. Newton-type methods for non-convex optimization under inexact Hessian information. ar Xiv preprint ar Xiv:1708.07164, 2017b.

Yi Xu, Jing Rong, and Tianbao Yang. First-order stochastic algorithms for escaping from saddle points in almost linear time. In Advances in Neural Information Processing Systems, pages 5531 5541, 2018b.

Chun Yu and Weixin Yao. Robust linear regression: A review and comparison. Communications in Statistics-Simulation and Computation, 46(8):6261 6282, 2017.

Yaodong Yu, Difan Zou, and Quanquan Gu. Saving gradient and negative curvature computations: Finding local minima more eﬃciently. ar Xiv preprint ar Xiv:1712.03950, 2017.

Yaodong Yu, Pan Xu, and Quanquan Gu. Third-order smoothness helps: Faster stochastic optimization algorithms for ﬁnding local minima. In Advances in Neural Information Processing Systems, pages 4526 4536, 2018.

Junyu Zhang, Lin Xiao, and Shuzhong Zhang. Adaptive stochastic variance reduction for subsampled Newton method with cubic regularization. ar Xiv preprint ar Xiv:1811.11637, 2018.

Yuchen Zhang, Percy Liang, and Moses Charikar. A hitting time analysis of stochastic gradient Langevin dynamics. In Conference on Learning Theory, pages 1980 2022, 2017.

Dongruo Zhou, Pan Xu, and Quanquan Gu. Stochastic variance-reduced cubic regularized Newton methods. In Proceedings of the 35th International Conference on Machine Learning, pages 5990 5999. PMLR, 2018a.

Dongruo Zhou, Pan Xu, and Quanquan Gu. Stochastic nested variance reduced gradient descent for nonconvex optimization. In Advances in Neural Information Processing Systems, pages 3922 3933, 2018b.