# escaping_saddlepoint_faster_under_interpolationlike_conditions__5227aed2.pdf

Escaping Saddle-Point Faster under

Interpolation-like Conditions

Abhiskek Roy Department of Statistics University of California, Davis

abroy@ucdavis.edu

Krishnakumar Balasubramanian

Department of Statistics University of California, Davis

kbala@ucdavis.edu

Saeed Ghadimi Department of Management Sciences

University of Waterloo sghadimi@uwaterloo.ca

Prasant Mohapatra Department of Computer Science

University of California, Davis

pmohapatra@ucdavis.edu

In this paper, we show that under over-parametrization several standard stochastic optimization algorithms escape saddle-points and converge to local-minimizers much faster. One of the fundamental aspects of over-parametrized models is that they are capable of interpolating the training data. We show that, under interpolation-like assumptions satisﬁed by the stochastic gradients in an overparametrization setting, the ﬁrst-order oracle complexity of Perturbed Stochastic Gradient Descent (PSGD) algorithm to reach an -local-minimizer, matches the corresponding deterministic rate of O(1/ 2). We next analyze Stochastic Cubic Regularized Newton (SCRN) algorithm under interpolation-like conditions, and show that the oracle complexity to reach an -local-minimizer under interpolationlike conditions, is O(1/ 2.5). While this obtained complexity is better than the corresponding complexity of either PSGD, or SCRN without interpolation-like assumptions, it does not match the rate of O(1/ 1.5) corresponding to deterministic Cubic-Regularized Newton method. It seems further Hessian-based interpolationlike assumptions are necessary to bridge this gap. We also discuss the corresponding improved complexities in the zeroth-order settings.

1 Introduction

Over-parametrized models, for which the training stage involves solving nonconvex optimization problems, are common in modern machine learning. A canonical example of such a model is deep neural networks. Such over-parametrized models have several interesting statistical and computational properties. On the statistical side, such over-parametrized models are highly expressive and are capable of nearly perfectly interpolating the training data. Furthermore, despite the highly nonconvex training landscape, most local minimizers have good generalization properties under regularity conditions; see for example [39, 27, 22, 21] for empirical and theoretical details. We emphasize here that over-parametrization plays an important role for both phenomenon to occur. Furthermore, it is to be noted that not all critical points exhibit nice generalization properties. Hence, from a computational perspective, designing algorithms that do not get trapped in saddle-points, and converge to local minimizers during the training process, becomes extremely important [12].

Indeed, recently there has been extensive research in the machine learning and optimization communities on designing algorithms that escape saddle-points and converge to local minimizers. The

34th Conference on Neural Information Processing Systems (Neur IPS 2020), Vancouver, Canada.

authors of [30] proved the folklore result that in the deterministic setting for sufﬁciently regular functions, vanilla gradient descent algorithms converges almost surely to local minimizers, even when initialized randomly; see also [29]. However, [30, 29] only provide asymptotic results, that have limited consequence for practice. Understandably, it has been shown by the authors of [15], that gradient descent might take exponential-time to escape saddle points in several cases. In this context, injecting artiﬁcial noise in each step of the gradient descent algorithm has been empirically observed to help escape saddle points. Several works, for example, [24, 26], showed that such perturbed gradient descent algorithms escape saddles faster in a non-asymptotic sense. Such algorithms are routinely used in training highly over-parametrized deep neural network and other over-parameterized nonconvex machine learning models. However, existing theoretical analysis of such algorithms fail to take advantage of the interpolation-like properties enjoyed by over-parametrized machine learning models. Hence, such theoretical results are conservative. Speciﬁcally, there is a gap between the assumptions used in the theoretical analysis of algorithms that escape saddle-points and the assumptions commonly satisﬁed by over-parametrized models which are trained by those algorithms.

In this work, we consider nonconvex stochastic optimization problems of the following form:

x2Rd {f(x) := E [F(x, )]} . (1)

where f : Rd ! R is nonconvex function satisfying certain regularity properties described next, and is a random variable characterizing the stochasticity in the problem. We assume that the function f has a lower bound f throughout this work. We analyze two standard algorithms that escape saddle-points, namely the perturbed stochastic gradient descent (PSGD) and stochastic cubic-regularized Newton s method (SCRN) for problems of the form in (1). We show that under interpolation-like assumptions (see Section 2 for exact deﬁnitions) on the stochastic gradient, it could be proved that both PSGD and SCRN escape saddle-points and converge to local minimizers much faster. In particular, we show that in order for PSGD algorithm to escape saddle-points and ﬁnd an -local-minimizer, the number of calls to the stochastic ﬁrst-order oracle is of the order O(1/ 2)1 which matches number of calls when the objective being optimized is a deterministic objective (for which exact gradient could be obtained in each step of the algorithm)2. As a point of comparison, [19, 25] showed that without the interpolation-like conditions that we make, PSGD requires O(1/ 4) calls to the stochastic gradient oracle. Furthermore, [17] analyzed a version of PSGD with averaging and improved the oracle complexity to O(1/ 3.5). It is also worth noting that, with a mean-square Lipschitz gradient assumption on the objective function being optimized, and using complicated variance reduction techniques, the authors of [16] showed that it is possible for a double-loop version of PSGD to converge to -local minimizers with O(1/ 3) number of calls to the stochastic ﬁrst-order oracle. However, recent empirical investigations seem to suggest that variance reduction techniques are inefﬁcient for the nonconvex deep learning problems [13, 43]. Our results, on the other hand exploit the naturally available structure present in over-parametrized models and obtains the best-known oracle complexity for escaping saddle-points using only the vanilla versions of PSGD algorithm (which is oftentimes the version of PSGD used in practice). We also analyze the corresponding Zeroth-Order version of the PSGD algorithm. In this setting, we are able to observe only potentially noisy evaluations of the function being optimized. In this setting, we show that PSGD algorithm requires O(d1.5/ 4.5) calls to the stochastic zeroth-order oracle. In this context, we are not aware of a result to compare with. The recent works of [4, 18] provided results for bounded functions in the zeroth-order deterministic setting, where one obtains exact function values; such a setting though is highly unrealistic in practice.

Next, we consider the question of whether using second-order methods helps reduce the number of calls. Indeed, in the deterministic setting, it is well-known that second-order information helps escape saddle point at a much faster rate. For example, [37] proposed that Cubic-regularized Newton s method and showed that the method requires only O(1/ 1.5) calls to the gradient and Hessian oracle; see also [8, 11] for related results. Correspondingly, in the stochastic setting [47] showed that SCRN method requires O(1/ 3.5) calls, which is better than that of PSGD (without further assumptions). In this work, we show that under interpolation-like assumptions on (only) the stochastic gradient, SCRN method requires only O(1/ 2.5) calls. In contrast to the PSGD setting, SCRN requires more calls than its corresponding deterministic counterpart. However, it should be noted that the complexity of SCRN

1Here, O hides log factors. 2It is possible to obtain O(1/ 11.75) complexity using accelerated method in deterministic setting; see [26].

With SGC (This paper) Without SGC Deterministic

ZO HO ZO HO HO

Theorem 3.1

Theorem 3.1

Theorem 3.2

Cubic Newton

Theorem 4.1

Theorem 4.1

Theorem 4.1

Table 1: Oracle complexities of perturbed stochastic gradient descent (PSGD) and stochastic cubicregularized Newton s method (SCRN). ZO corresponds to number of calls to zeroth-order oracle and HO corresponds to number of calls to ﬁrst or second-order oracles. The result for PSGD and SCRN are given respectively in high-probability and in expectation. The results in the deterministic case corresponds to projected gradient descent and cubic-Regularized Newton s method (without stochastic gradients).

is still better than that of the PSGD, with or without interpolation-like assumptions. We belive that without further interpolation-like assumptions also on the stochastic Hessians, the oracle complexity of SCRN cannot be improved, in particular to match the deterministic rate of O(1/ 1.5) (see also Remark 6). We also provide similar improved results for a zeroth-order version of SCRN method, thereby improving upon the results of [5]. All of our results, along with comparison to existing results in the literature and the corresponding assumption required, are summarized in Table 1. We conclude this section with a other related works.

More Related Works. In the interpolation regime, [33] recently showed that mini-batch stochastic gradient descent (SGD) algorithm enjoys exponential rates of convergence for unconstrained stronglyconvex optimization problems; see also [45, 36] for related earlier work. For the non-convex setting, [6] analyze SGD for non-convex functions satisfying the Polyak-Lojasiewicz (PL) inequality [41] under the interpolation condition and show that SGD can achieve a linear convergence rate. Recently, [48] introduced a more practical form of interpolation condition, and prove that the constant step-size SGD can obtain the optimal convergence rate for strongly-convex and smooth convex functions. They also show the ﬁrst results in the non-convex setting that the constant step-size SGD can obtain the deterministic rate in the interpolation regime for converging to ﬁrst-order stationary solution. Subsequently, [34] investigate the regularized subsampled Newton method (R-SSN) and the stochastic BFGS algorithm under the interpolation-like conditions. We emphasize that all the above works consider the case of convex objective function predominantly; the only exception is [48] that consider the nonconvex case but only study convergence to ﬁrst-order stationary solution. There has been several works on obtaining oracle complexity of escaping saddle-points in the ﬁnite-sum setting; we refer the interested reader to [1, 51, 50, 49] and references therein for such results. We emphasize that a majority of the above works are based on complicated variance reduction techniques that increase the implementation complexity of such methods and make them less appealing in practice. There exist only few works on escaping saddle-points for constrained optimization problems; see [32, 31, 40, 35] for more details.

We also brieﬂy discuss the consequences of our results to deep neural network training and related works. Roughly speaking, there are now two potential explanations for the success of optimization methods for training deep neural networks [46]. The ﬁrst explanation is based on landscape analysis. This involves two steps: Showing the optimization landscape has favorable geometry [28] (i.e., all local minima are (approximate) global minima under suitable regularity conditions), and hence constructing optimization algorithms that can efﬁciently escape saddle-points. The second explanation is based on the NTK viewpoint; see, for example [23, 9, 10, 3, 14, 52], for a partial overview. However, a majority of the results based on NTK viewpoint are for polynomially (in depth and sample-size) large-width networks (indeed, [3] mention that their polynomial degrees are impractical). Our results in this paper are geared towards the former program.

2 Preliminaries

We now present the assumptions and deﬁnitions used throughout the paper. Section-speciﬁc additional details are in the respective sections. In this paper we use k k, and k k to denote a norm and the corresponding dual norm on Rd. We now describe some regularity conditions made on the objective function in (1) assumptions in this work.

Assumption 2.1 (Lipschitz Function) The function F is L-Lipschitz, almost surely for any , i.e., |F(x, ) F(y, )| L kx yk. Here we assume k k = k k2, unless speciﬁed explicitly.

Assumption 2.2 (Lipschitz Gradient) The function F has Lipschitz continuous gradient, almost surely for any , i.e., kr F(x, ) r F(y, )k LGkx yk , where k k denotes the dual norm of k k. This also implies |F(y, ) F(x, ) r F(x, )>(y x)| LG

Assumption 2.3 (Lipschitz Hessian) The function F has Lipschitz continuous Hessian, almost surely for any , i.e.,

##r2F(x, ) r2F(y, )

## LHkx yk.

Note that if Assumptions 2.1 2.3 are true for F, then they also hold for f( ) = E [F( , )]; but the other way around is not true. For our higher-order results, we make the above assumptions only on f( ), which is a weaker assumption. In the interpolation regime, the stochastic gradients become small when the true gradient is small. The following condition, known as Strong Growth Condition (SGC) [48], captures how fast the stochastic gradient goes to 0 with respect to the true gradient.

Assumption 2.4 (SGC [48]) For any point x 2 Rd, the stochastic gradient satisﬁes E kr F(x, )k2 krf(x)k2, for > 1 (as = 1, corresponds to the deterministic setting).

SGC controls the variance of the obtained stochastic gradient in the above mentioned way. Note in particular that in the case when krf(x)k2 = 0, under SGC, we have almost surely kr F(x, )k2 = 0. This means that when the point x is a stationary point of the function f, then it is also a stationary point of the function F almost surely. In the context of deep neural networks, the function F corresponds to the risk based on training sample and the function f corresponds to the risk. Hence, the strong growth condition states that that deep neural network is capable of interpolating the training data almost surely. Such a phenomenon is observed in practice with deep neural networks, which provides a strong motivation for using this assumption for analyzing the performance of PSGD and SCRN for escaping saddle-points.

In this work, we study the algorithms under two oracles settings: Stochastic zeroth-order oracle, where one obtains noisy unbiased function evaluations, and the stochastic higher-order oracle, where one obtains noisy unbiased estimators of the gradients, and hessians. We now deﬁne them formally.

Assumption 2.5 (Zeroth-order oracle) For any x 2 Rd, the zeroth order oracle outputs an estimator F(x, ) of f(x) such that E [F(x, )] = f(x), E [r F(x, )] = rf(x), E

= r2f(x), and E

kr2F(x, ) r2f(x)k4

2, where k k F is the Frobenius norm.

Assumption 2.6 (Higher-order oracles) For any x 2 Rd, (i) the ﬁrst-order oracle outputs an estimate r F(x, ) of rf(x) such that E [r F(x, )] = rf(x) and (ii) the second-order oracle, in addition outputs an estimate r2F(x, ) of r2f(x) such that, E

= r2f(x), and E

kr2F(x, ) r2f(x)k4

Such assumptions on the zeroth-order and higher-order oracles are standard in the literature; see for example [20, 38? ]. Our goal in this paper is to reach an approximate local minimizer (also called as a second-order stationary point) of a non-convex function, which is deﬁned as follows:

Deﬁnition 2.1 ( -Local Minimizer) Let Assumption 2.3 hold for a function f. Then a point x is called a -second-order stationary point if,

krf ( x) k, λmin

is the minimum eigenvalue of r2f ( x).

Algorithm 1 Perturbed Stochastic Gradient Descent Algorithm

Input: x0 2 Rd, , r. for t = 0 to T do Set gt = 1 n1

i=1 gt,i where

gt,i = r F (xt, t,i) (First-order)

gt,i = F(xt + ut,i, t,i) F(xt, t,i)

ui (Zeroth-order)

and ut,i N (0,I d) 8t = 1, 2, , T, i = 1, 2, , n1 Sample t 2 N

Update xt+1 = xt (gt + t) end for

Note that for stochastic optimization problems, the quantity on the left hand side of (2), is a random variable. In this paper we prove a high-probability bound, and an expectation bound for the above quantity for PSGD, and SCRN respectively.

For a point xt, we will use rt, r2

t, ht, and λ1,t to denote rt, r2f(xt), (xt+1 xt),and λmin

respectively. The zeroth-order minibatch gradient [38], and Hessian estimator [? ] gt, and Ht are deﬁned as:

F(xt + ut,i, t,i) F(xt, t,i)

where Ht,i = F (xt+ ut,i, t,i)+F (xt ut,i, t,i) 2F (xt, t,i)

2 2 , and ut,i N (0,I d) 8t = 1, 2, , T, i = 1, 2, , n1. We will use t = gt rt = 1 n1

i=1 gt,i rt, and t = t + t. In the following lemma we show that under SGC, the variance of r F(xt, ) is of the order of the gradient norm squared.

Lemma 2.1 Let Assumption 2.4 hold for a function f. Then, for both zeroth-order, and ﬁrst-order oracle, we have,

r F(xt, i) rt

Proof Let gt = 1 n1

i=1 r F(xt, i). Then we have

kgtk2 + krtk2 2g>

n1 krtk2 + n1(n1 1) krtk2

which completes the proof.

Remark 1 The above simple results actually turns out to have far-reaching consequences for obtaining improved complexity bounds for both PSGD and SCRN algorithms. It implies that when the true gradient is small, the variance of the stochastic gradient is also small. Typically, in the analysis of PSGD and SCRN, it is assumed that the stochastic gradients are assumed to have a constant variance. But for over-parametrized models, we will use Lemma 2.1 to prove deterministic rate for PSGD and improved rates for SCRN.

3 Perturbed Stochastic Gradient Descent

In this section we show that under SGC, PSGD attain deterministic rate in the ﬁrst-order setting and obtains much better rate than previously known rates in the zeroth-order setting. An intuitive

explanation of this phenomenon is as follows: in the general stochastic setting, at time t where krtk , PSGD does not descend as much as in the deterministic setting due to noisy gradient. So it takes more iterations to average out the noise. While escaping a saddle point, due to noisy gradient, the iterates follow the direction of the most negative curvature with more difﬁculty leading to higher complexity. Under SGC, when krtk , the noise variance is of the order of krtk2 as shown in Lemma 2.1. So the algorithm still manages to descent. While escaping a saddle point under SGC, as krtk , and the gradient noise is also small leading to deterministic rates.

The outline of the proof of the bounds for PSGD in the ﬁrst-order setting is similar to [25] except that we analyze PSGD under interpolation regime. At a high level the proof has two stages: ﬁrstly, we show that when krtk , the function descends as fast as the deterministic case; Secondly, when krf(xt)k , and λmin(r2f(xt)) p LH , i.e., xt is a saddle point, by a coupling argument it is shown that either the function descends or the sequence of iterates are stuck around the saddle point. But then it is shown that the stuck region is narrow enough so that the iterates escape the saddle points with high probability. We now require a condition on the tail of the stochastic gradient.

Assumption 3.1 For any x 2 Rd, P (kr F(x, ) rf(x)k ) 2e

2E[kr F (x, ) rf(x)k2] .

Such light-tail conditions are common in the stochastic optimization literature to obtain highprobability bounds; see for example [20, 25]. Note that under Assumption 2.4, Assumption 3.1 is equivalent to

P (kr F(x, ) rf(x)k ) 2e 2/(2( 1)krf(x)k2) (5)

We now present our main result on PSGD.

Theorem 3.1 a) Under Assumptions 2.2, 2.3 on the function f( ), and Assumptions 2.4, and 3.1,

, r = 1.5 log( 1) 3, n1 = 512c( 1) log( 1), (6)

with probability at least 1 δ, half of the iterations of Algorithm 1 will be -local minimizers after T iterations where,

(f(x0) f )T1

, (f(x0) f )

where a0, a1 are constants, and T1 = 0.5log

"3/p , and F1 = 1.5/log

"7. b) Under Assumptions 2.1, 2.2, 2.3, 2.4, and 3.1 in the zeroth order-setting, choosing,

r = 1 = 4 dlog

" n1 = 5log

with probability at least 1 δ, half of the iterations of Algorithm 1 will be -local minimzers, after T iterations, where,

(f(x0) f )T0

, (f(x0) f )

Here, i, i = 1, 2, , 9 are absolute constants, and T0 = 3

2 log(d)2 p , and F0 = 8 1.5.

Hence, the total number of zeroth-order oracle calls is Tn1 = O

Remark 2 Note that the complexity result in (7) for the PSGD in the ﬁrst-order setting matches corresponding complexity of perturbed gradient descent on deterministic optimization problems.

Remark 3 We brieﬂy highlight on the difﬁculty associated with proving the result in (9). First note that in the ﬁrst-order proof, and also in [25], it is assumed that the noise is sub-gaussian. But for the zeroth-order gradient gt as deﬁned in (3), kgt rtk no longer has sub-Gaussian tails. Also note

that we have from [38], Eut,i[gt,i] = r F (xt, t,i) = r Eut,i [F(x + ut,i, t,i)]. So gt is not an unbiased estimator of rt. But as shown in [38], r F (xt, t,i) is close to r F(xt, t,i). So we ﬁrst need to establish concentration properties for gt in the zeroth-order setting. Towards this, we show that gt is -sub-exponential with = 2/3, even if is sub-gaussian, i.e., the noise in the gradient estimates has heavier tail (Lemma A.1). This leads to the obtained complexity bounds in (9).

Remark 4 Note that T1 and T0 are the number of iterations required to descend by F1 and F0 respectively in the ﬁrst and zeroth-order setting, after the algorithm hits a saddle point. As shown in [25], without SGC, T1 = O( 2.5). In this paper we show that, under SGC, T1 = T0 = O( 0.5). This shows under SGC, it is indeed possible to escape saddle point faster.

We highlight here that [4, 18] recently considered escaping saddle points in the zeroth-order setting. However they assume that the function being optimized is deterministic (which means exact gradients could be obtained) and is bounded (which means sub-Gaussian tails are possible for the zeroth-order gradient estimator). These two assumptions are however highly impractical and are not satisﬁed by several situations in practice where zeroth-order optimization techniques are utilized. To the best of our knowledge, there is no known bound on the number of times zeroth-order oracle should accessed for (9) to hold, when SGC does not hold and only the following standard variance assumption on the unseen stochastic gradient holds (see, e.g., [20]) for some σ > 0,

r F(xt, i) rt

For completeness we present the corresponding result below, which serves as a reference to compare our results with the SGC assumption to what one could obtain without it.

Theorem 3.2 Under Assumptions 2.1, 2.2, 2.3, 2.4, and 3.1, we have the following: In the zeroth order-setting, choosing,

r = 1 = 4 dlog

" n1 = 5log

"5d1.5σ 3.5 (11)

with probability at least 1 δ, half of the iterations of Algorithm 1 will be -local minimzers, after T iterations, where,

(f(x0) f )T0

, (f(x0) f )

Here, i, i = 1, 2, , 9 are absolute constants, and T0 = 3

2 log(d)2 p and F0 = 8 1.5. Hence,

the total number of zeroth-order oracle calls is Tn1 = O

Remark 5 A generic reduction was proposed in [2] for using any algorithm that converges to a ﬁrst-order stationary points at a particular rate, to converge to a local minimizer at the same rate. The results in [2] are not directly applicable to the zeroth-order setting due to their assumptions. However, assuming that their assumption could be relaxed to get it work in the zeroth-order setting, it is interesting to examine if the results in [20] for converging to ﬁrst-order stationary solution could be combined with the reduction proposed in [2] to establish a result similar to Theorem 3.2. To make the result of [20] hold with the same probability as in Theorem 3.2, we would require O(d 6) calls to the stochastic zeroth-order oracle. Hence, in certain regimes it is plausible we obtain improved results. It is interesting future work to examine this further rigorously.

4 Stochastic Cubic-Regularized Newton s Method

In this section we analyze Cubic-Regularized (CR) Newton method under interpolation regime. In non-interpolation like stochastic setting, CR Newton achieves a rate of O( 3.5) as compared to O( 4) attained by PSGD. Here we show that CR Newton achieves a rate of O( 2.5) under

Algorithm 2 Cubic-Regularized Newton Algorithm

Input: x1 2 Rd, T, M, n1, n2 for t = 1 to T do Set gt = 1 n1

i=1 gt,i where

(Higher-order)

gt,i = F(xt + u G

t,i) F(xt, G

i (Zeroth-order)

Set Ht = 1 n2

i=1 Ht,i where

(Higher-order)

Ht,i = F(xt + u H

t,i) + F(xt u H

t,i) 2F(xt, H

(Zeroth-order)

where u G[H]

t,i N (0,I d) 8t = 1, 2, , T, i = 1, 2, , n1[n2] Update

xt+1 = argmin

y mt (xt, y, gt, Ht, M) , (13)

mt(y) = f(xt) + (y xt)>gt + 1

2(y xt)>Ht(y xt) + M

6 ky xtk3 (14)

SGC. Even though this rate is better than non-interpolation like stochastic setting, quite interestingly, CR Newton method fails to achieve deterministic rate of O( 1.5) unlike PSGD. We believe that without stronger assumption on the Hessian estimator noise as well, CR Newton will perform worse than PSGD. In this section let Ft be the ﬁltration generated until time t, i.e., in the higher-order setting Ft = σ({ G

i,j=1), and in the zeroth-order setting Ft = σ({ G

i,j=1, {u G

i,j=1, {u H

i,j=1). We now present our main result.

Theorem 4.1 Let f be a function for which Assumptions 2.2, and 2.3 are true. Then under SGC, i.e., under Assumption 2.4, for Algorithm 2, we have:

a) In the higher-order setting, choosing

T = 144 (f(x1) f )

3 2 n1 = µ0( 1)

n2 = 1, M = max

we get, max

E[krf(x R)k]

144M , E[λ1,R]

p , where µ0 is a constant independent of and d,

and R is an integer random variable uniformly distributed over the support {1, 2, , T}. The

total number of ﬁrst-order and second-order oracle calls are hence O

. b) In the zeroth-order setting, choosing

T = µ0 (f(x1) f )

3 2 , n1 = µ1(d + 5)

, M = µ4, = µ3

5 2 , n2 = µ2(1 + 2 log 2d)(d + 16)4

we get, max

E [krf (x R) k], E [λ1,R]

O (p ), where µ0, µ1, µ2, µ3, µ4 are constants independent of , and d, and R is an integer random variable uniformly distributed over the support {1, 2, , T}.. The total number of ﬁrst-order oracle calls is O

, and the number

of second-order oracle calls is O

Remark 6 The above results only require Assumption 2.4, which is a gradient-level property of interpolation condition. As SCRN is a second-order algorithm, an assumption like if the min eigenvalue of true Hessian at a point is non-negative, then min eigenvalue of stochastic Hessian is almost surely also non-negative might be required to capture second-order properties of interpolation. Such an assumption could then be used to obtain a result similar to Lemma 2.1 for stochastic Hessians, to improve the rates in Theorem 4.1. Formalizing this intuition is an extremely interesting future work.

Remark 7 In comparison to the PSGD algorithm, we obtain the results for the SCRN algorithm in expectation. We highlight that it is straightforward to obtain to obtain a high-probability result in the higher-order setting. However, it is technically challenging to do so for the zeroth-order setting. This is due to the difﬁculty associated with obtaining sharper concentration results for the zeroth-order Hessian estimator, which we leave as future work. In Theorem 4.1, we presented the results in expectation for both settings to maintain uniformity of presentation. In Algorithm 2 we assume that the exact solution to (13) is available. We remark that it is possible to relax this assumption following the approach of [47] which in turn leveraged the results in [7] showing that the subproblem in (13) can be solved with high probability using gradient descent.

In this work, we analyze the oracle complexity of two standard algorithms the perturbed stochastic gradient descent algorithm and the stochastic cubic-regularized Newton s method for escaping saddle-points in nonconvex stochastic optimization. We show that under interpolation-like conditions satisﬁed in modern over-parametrized machine learning problems, PSGD and SCRN obtain improved rates for escaping saddle-points. In particular the above stated improvements are obtained for the vanilla versions of PSGD and SCRN algorithms and are not based on any complicated variance reduction techniques. For future work, it is extremely interesting to bridge the gap between SCRN and its deterministic counterpart. The key to this is come up with a Hessian-based interpolation-like assumption, which is both practically meaningful and theoretically sound.

Broader Impact

We focus in this work on establishing theoretical justiﬁcation for a practically observed phenomenon: Stochastic gradient method and its relatives perform well for training deep neural networks with complicated nonconvex landscape. The result presented will beneﬁt researchers and practitioners who are interested in understanding the theoretical underpinnings of stochastic optimization for deep learning. Although our work in this draft is theoretical, it might have a positive impact for various practical applications of neural networks.

Funding Disclosure

The AR was supported in parts by the NSF Grant CCF-1934568. The research of KB was supported in parts by UC Davis Ce DAR (Center for Data Science and Artiﬁcial Intelligence Research) Innovative Data Science Seed Funding Program. The research of PM was sponsored in part by the U.S. Army Combat Capabilities Development Command Army Research Laboratory and was accomplished under Cooperative Agreement Number W911NF-13-2-0045 (ARL Cyber Security CRA). The views and conclusions contained in this document are those of the authors and should not be interpreted as representing the ofﬁcial policies, either expressed or implied, of the Combat Capabilities Development Command Army Research Laboratory or the U.S. Government. The U.S. Government is authorized to reproduce and distribute reprints for Government purposes notwithstanding any copyright notation here on.

[1] Zeyuan Allen-Zhu. Katyusha: The ﬁrst direct acceleration of stochastic gradient methods. The

Journal of Machine Learning Research, 18(1):8194 8244, 2017.

[2] Zeyuan Allen-Zhu and Yuanzhi Li. Neon2: Finding local minima via ﬁrst-order oracles. In

Advances in Neural Information Processing Systems, pages 3716 3726, 2018.

[3] Zeyuan Allen-Zhu, Yuanzhi Li, and Zhao Song. A convergence theory for deep learning via

over-parameterization. In International Conference on Machine Learning, pages 242 252. PMLR, 2019.

[4] Qinbo Bai, Mridul Agarwal, and Vaneet Aggarwal. Escaping saddle points for zeroth-order

non-convex optimization using estimated gradient descent. In 2020 54th Annual Conference on Information Sciences and Systems (CISS), pages 1 6. IEEE, 2020.

[5] Krishnakumar Balasubramanian and Saeed Ghadimi. Zeroth-order nonconvex stochastic optimization: Handling constraints, high-dimensionality and saddle-points. ar Xiv preprint ar Xiv:1809.06474, 2018.

[6] Raef Bassily, Mikhail Belkin, and Siyuan Ma. On exponential convergence of sgd in non-convex

over-parametrized learning. ar Xiv preprint ar Xiv:1811.02564, 2018.

[7] Yair Carmon and John C Duchi. Gradient descent efﬁciently ﬁnds the cubic-regularized

non-convex newton step. ar Xiv preprint ar Xiv:1612.00547, 2016.

[8] Coralia Cartis, Nicholas IM Gould, and Philippe L Toint. Adaptive cubic regularisation methods for unconstrained optimization. part II: worst-case function-and derivative-evaluation complexity. Mathematical programming, 130(2):295 319, 2011.

[9] Lenaic Chizat and Francis Bach. On the global convergence of gradient descent for over-

parameterized models using optimal transport. In Advances in neural information processing systems, pages 3036 3046, 2018.

[10] Lenaic Chizat, Edouard Oyallon, and Francis Bach. On lazy training in differentiable program-

ming. In Advances in Neural Information Processing Systems, pages 2937 2947, 2019.

[11] Frank E Curtis, Daniel P Robinson, and Mohammadreza Samadi. A trust region algorithm

with a worst-case iteration complexity of 3/2 for nonconvex optimization. Mathematical Programming, 162(1-2):1 32, 2017.

[12] Yann N Dauphin, Razvan Pascanu, Caglar Gulcehre, Kyunghyun Cho, Surya Ganguli, and

Yoshua Bengio. Identifying and attacking the saddle point problem in high-dimensional non-

convex optimization. In Advances in neural information processing systems, pages 2933 2941, 2014.

[13] Aaron Defazio and Léon Bottou. On the ineffectiveness of variance reduced optimization for

deep learning. In Advances in Neural Information Processing Systems, pages 1753 1763, 2019.

[14] Simon Du, Jason Lee, Haochuan Li, Liwei Wang, and Xiyu Zhai. Gradient descent ﬁnds global

minima of deep neural networks. In International Conference on Machine Learning, pages 1675 1685, 2019.

[15] Simon S Du, Chi Jin, Jason D Lee, Michael I Jordan, Aarti Singh, and Barnabas Poczos.

Gradient descent can take exponential time to escape saddle points. In Advances in neural information processing systems, pages 1067 1077, 2017.

[16] Cong Fang, Chris Junchi Li, Zhouchen Lin, and Tong Zhang. Spider: Near-optimal non-

convex optimization via stochastic path-integrated differential estimator. In Advances in Neural Information Processing Systems, pages 689 699, 2018.

[17] Cong Fang, Zhouchen Lin, and Tong Zhang. Sharp analysis for nonconvex sgd escaping from

saddle points. In Conference on Learning Theory, pages 1192 1234, 2019.

[18] Lampros Flokas, Emmanouil-Vasileios Vlatakis-Gkaragkounis, and Georgios Piliouras. Efﬁ-

ciently avoiding saddle points with zero order methods: No gradients required. ar Xiv preprint ar Xiv:1910.13021, 2019.

[19] Rong Ge, Furong Huang, Chi Jin, and Yang Yuan. Escaping from saddle points: online stochastic

gradient for tensor decomposition. In Conference on Learning Theory, pages 797 842, 2015.

[20] Saeed Ghadimi and Guanghui Lan. Stochastic ﬁrst-and zeroth-order methods for nonconvex

stochastic programming. SIAM Journal on Optimization, 23(4):2341 2368, 2013.

[21] Benjamin Haeffele, Eric Young, and Rene Vidal. Structured low-rank matrix factorization:

Optimality, algorithm, and applications to image processing. In International conference on machine learning, pages 2007 2015, 2014.

[22] Benjamin D Haeffele and René Vidal. Global optimality in tensor factorization, deep learning,

and beyond. ar Xiv preprint ar Xiv:1506.07540, 2015.

[23] Arthur Jacot, Franck Gabriel, and Clément Hongler. Neural tangent kernel: Convergence and

generalization in neural networks. In Advances in neural information processing systems, pages 8571 8580, 2018.

[24] Chi Jin, Rong Ge, Praneeth Netrapalli, Sham M Kakade, and Michael I Jordan. How to escape

saddle points efﬁciently. In Proceedings of the 34th International Conference on Machine Learning-Volume 70, pages 1724 1732. JMLR. org, 2017.

[25] Chi Jin, Praneeth Netrapalli, Rong Ge, Sham M Kakade, and Michael I Jordan. On nonconvex

optimization for machine learning: Gradients, stochasticity, and saddle points. ar Xiv preprint ar Xiv:1902.04811, 2019.

[26] Chi Jin, Praneeth Netrapalli, and Michael I Jordan. Accelerated gradient descent escapes saddle

points faster than gradient descent. In Conference On Learning Theory, pages 1042 1085, 2018.

[27] Kenji Kawaguchi. Deep learning without poor local minima. In Advances in neural information

processing systems, pages 586 594, 2016.

[28] Kenji Kawaguchi and Leslie Kaelbling. Elimination of all bad local minima in deep learning.

In International Conference on Artiﬁcial Intelligence and Statistics, pages 853 863, 2020.

[29] Jason D Lee, Ioannis Panageas, Georgios Piliouras, Max Simchowitz, Michael I Jordan, and

Benjamin Recht. First-order methods almost always avoid saddle points. ar Xiv preprint ar Xiv:1710.07406, 2017.

[30] Jason D Lee, Max Simchowitz, Michael I Jordan, and Benjamin Recht. Gradient descent only

converges to minimizers. In Conference on learning theory, pages 1246 1257, 2016.

[31] Songtao Lu, Meisam Razaviyayn, Bo Yang, Kejun Huang, and Mingyi Hong. Snap: Finding

approximate second-order stationary solutions efﬁciently for non-convex linearly constrained problems. ar Xiv preprint ar Xiv:1907.04450, 2019.

[32] Songtao Lu, Ziping Zhao, Kejun Huang, and Mingyi Hong. Perturbed projected gradient descent

converges to approximate second-order points for bound constrained nonconvex problems. In ICASSP 2019-2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 5356 5360. IEEE, 2019.

[33] Siyuan Ma, Raef Bassily, and Mikhail Belkin. The power of interpolation: Understanding the

effectiveness of sgd in modern over-parametrized learning. In International Conference on Machine Learning, pages 3325 3334, 2018.

[34] Si Yi Meng, Sharan Vaswani, Issam Laradji, Mark Schmidt, and Simon Lacoste-Julien. Fast

and furious convergence: Stochastic second order methods under interpolation. ar Xiv preprint ar Xiv:1910.04920, 2020.

[35] Aryan Mokhtari, Asuman Ozdaglar, and Ali Jadbabaie. Escaping saddle points in constrained

optimization. In Advances in Neural Information Processing Systems, pages 3629 3639, 2018.

[36] Deanna Needell, Rachel Ward, and Nati Srebro. Stochastic gradient descent, weighted sampling,

and the randomized kaczmarz algorithm. In Advances in neural information processing systems, pages 1017 1025, 2014.

[37] Yurii Nesterov and Boris T Polyak. Cubic regularization of newton method and its global

performance. Mathematical Programming, 108(1):177 205, 2006.

[38] Yurii Nesterov and Vladimir Spokoiny. Random gradient-free minimization of convex functions.

Foundations of Computational Mathematics, 17(2):527 566, 2017.

[39] Quynh Nguyen and Matthias Hein. The loss surface of deep and wide neural networks. ar Xiv

preprint ar Xiv:1704.08045, 2017.

[40] Maher Nouiehed and Meisam Razaviyayn. A trust region method for ﬁnding second-order

stationarity in linearly constrained non-convex optimization. ar Xiv preprint ar Xiv:1904.06784, 2019.

[41] Boris Teodorovich Polyak. Gradient methods for minimizing functionals. Zhurnal Vychisli-

tel noi Matematiki i Matematicheskoi Fiziki, 3(4):643 653, 1963.

[42] Abhishek Roy, Krishnakumar Balasubramanian, Saeed Ghadimi, and Prasant Mohapatra. Multi-

point bandit algorithms for nonstationary online nonconvex optimization. ar Xiv preprint ar Xiv:1907.13616, 2019.

[43] Mark Schmidt. Faster algorithms for deep learning? (presentation in vector institute: https://www.cs.ubc.ca/ schmidtm/documents/2020_vector_smallresidual.pdf), 2020.

[44] Lingqing Shen, Krishnakumar Balasubramanian, and Saeed Ghadimi. Non-asymptotic re-

sults for langevin monte carlo: Coordinate-wise and black-box sampling. ar Xiv preprint ar Xiv:1902.01373, 2019.

[45] Thomas Strohmer and Roman Vershynin. A randomized kaczmarz algorithm with exponential

convergence. Journal of Fourier Analysis and Applications, 15(2):262, 2009.

[46] Ruoyu Sun. Optimization for deep learning: theory and algorithms. ar Xiv preprint ar Xiv:1912.08957, 2019.

[47] Nilesh Tripuraneni, Mitchell Stern, Chi Jin, Jeffrey Regier, and Michael I Jordan. Stochastic

cubic regularization for fast nonconvex optimization. In Advances in neural information processing systems, pages 2899 2908, 2018.

[48] Sharan Vaswani, Francis Bach, and Mark Schmidt. Fast and faster convergence of sgd for

over-parameterized models and an accelerated perceptron. ar Xiv preprint ar Xiv:1810.07288, 2018.

[49] Zhe Wang, Yi Zhou, Yingbin Liang, and Guanghui Lan. Stochastic variance-reduced cubic

regularization for nonconvex optimization. ar Xiv preprint ar Xiv:1802.07372, 2018.

[50] Junyu Zhang, Lin Xiao, and Shuzhong Zhang. Adaptive stochastic variance reduction for

subsampled newton method with cubic regularization. ar Xiv preprint ar Xiv:1811.11637, 2018.

[51] Dongruo Zhou and Quanquan Gu. Stochastic recursive variance-reduced cubic regularization

methods. ar Xiv preprint ar Xiv:1901.11518, 2019.

[52] Difan Zou, Yuan Cao, Dongruo Zhou, and Quanquan Gu. Gradient descent optimizes over-

parameterized deep relu networks. Machine Learning, 109(3):467 492, 2020.