# natasha_2_faster_nonconvex_optimization_than_sgd__520b7181.pdf

Natasha 2: Faster Non-Convex Optimization Than SGD

Zeyuan Allen-Zhu Microsoft Research AI zeyuan@csail.mit.edu

We design a stochastic algorithm to ﬁnd ε-approximate local minima of any smooth nonconvex function in rate O(ε 3.25), with only oracle access to stochastic gradients. The best result before this work was O(ε 4) by stochastic gradient descent (SGD).2

1 Introduction

In diverse world of deep learning research has given rise to numerous architectures for neural networks (convolutional ones, long short term memory ones, etc). However, to this date, the underlying training algorithms for neural networks are still stochastic gradient descent (SGD) and its heuristic variants. In this paper, we address the problem of designing a new algorithm that has provably faster running time than the best known result for SGD.

Mathematically, we study the problem of online stochastic nonconvex optimization:

minx Rd n f(x) := Ei[fi(x)] = 1

n Pn i=1 fi(x) o (1.1)

where both f( ) and each fi( ) can be nonconvex. We want to study online algorithms to ﬁnd appx. local minimum of f(x). Here, we say an algorithm is online if its complexity is independent of n. This tackles the big-data scenarios when n is extremely large or even inﬁnite.3

Nonconvex optimization arises prominently in large-scale machine learning. Most notably, training deep neural networks corresponds to minimizing f(x) of this average structure: each training sample i corresponds to one loss function fi( ) in the summation. This average structure allows one to perform stochastic gradient descent (SGD) which uses a random fi(x) corresponding to computing backpropagation once to approximate f(x) and performs descent updates.

The standard goal of nonconvex optimization with provable guarantee is to ﬁnd approximate local minima. This is not only because ﬁnding the global one is NP-hard, but also because there exist rich literature on heuristics for turning a local-minima ﬁnding algorithm into a global one. This includes random seeding, graduated optimization [25] and others. Therefore, faster algorithms for ﬁnding approximate local minima translate into faster heuristic algorithms for ﬁnding global minimum.

On a separate note, experiments [16, 17, 24] suggest that fast convergence to approximate local minima may be sufﬁcient for training neural nets, while convergence to stationary points (i.e., points that may be saddle points) is not. In other words, we need to avoid saddle points.

The full version of this paper can be found on https://arxiv.org/abs/1708.08694. 2When this manuscript ﬁrst appeared online, the best rate was T = O(ε 4) by SGD. Several followups appeared after this paper. This includes stochastic cubic regularization [44] which gives T = O(ε 3.5), and Neon+SCSG [10, 46] which gives T = O(ε 3.333). These rates are worse than T = O(ε 3.25). 3All of our results in this paper apply to the case when n is inﬁnite, meaning f(x) = Ei[fi(x)], because we focus on online methods. However, we still introduce n to simplify notations.

32nd Conference on Neural Information Processing Systems (Neur IPS 2018), Montr eal, Canada.

Figure 1: Local minimum (left), saddle point (right) and its negative-curvature direction.

1.1 Classical Approach: Escaping Saddle Points Using Random Perturbation

One natural way to avoid saddle points is to use randomness to escape from it, whenever we meet one. For instance, Ge et al. [22] showed, by injecting random perturbation, SGD will not be stuck in saddle points: whenever SGD moves into a saddle point, randomness shall help it escape. This partially explains why SGD performs well in deep learning.4 Jin et al. [27] showed, equipped with random perturbation, full gradient descent (GD) also escapes from saddle points. Being easy to implement, however, we raise two main efﬁciency issues regarding this classical approach:

Issue 1. If we want to escape from saddle points, is random perturbation the only way? Moving in a random direction is blind to the Hessian information of the function, and thus can we escape from saddle points faster? Issue 2. If we want to avoid saddle points, is it really necessary to ﬁrst move close to saddle points and then escape from them? Can we design an algorithm that can somehow avoid saddle points without ever moving close to them?

1.2 Our Resolutions

Resolution to Issue 1: Efﬁcient Use of Hessian. Mathematically, instead of using a random perturbation, the negative eigenvector of 2f(x) (a.k.a. the negative-curvature direction of f( ) at x) gives us a better direction to escape from saddle points. See Figure 1.

To make it concrete, suppose we apply power method on 2f(x) to ﬁnd its most negative eigenvector. If we run power method for 0 iteration, then it gives us a totally random direction; if we run it for more iterations, then it converges to the most negative eigenvector of 2f(x). Unfortunately, applying power method is unrealistic because f(x) = 1

i fi(x) can possibly have inﬁnite pieces.

We propose to use Oja s algorithm [37] to approximate power method. Oja s algorithm can be viewed as an online variant of power method, and requires only (stochastic) matrix-vector product computations. In our setting, this is the same as (stochastic) Hessian-vector products namely, computing 2fi(x) w for arbitrary vectors w Rd and random indices i [n]. It is a known fact that computing Hessian-vector products is as cheap as computing stochastic gradients, and thus we can use Oja s algorithm to escape from saddle points.

(c) Figure 2: Illustration of Natasha2full how to swing by a saddle point. (a) move in a negative curvature direction if there is any (by applying Oja s algorithm) (b) swing by a saddle point without entering its neighborhood (wishful thinking) (c) swing by a saddle point using only stochastic gradients (by applying Natasha1.5full)

4In practice, stochastic gradients naturally incur random noise and adding perturbation may not be needed.

Resolution to Issue 2: Swing by Saddle Points. If the function is sufﬁciently smooth,5 then any point close to a saddle point must have a negative curvature. Therefore, as long as we are close to saddle points, we can already use Oja s algorithm to ﬁnd such negative curvature, and move in its direction to decrease the objective, see Figure 2(a).

Therefore, we are left only with the case that point is not close to any saddle point. Using smoothness of f( ), this gives a safe zone near the current point, in which there is no strict saddle point, see Figure 2(b). Intuitively, we wish to use the property of safe zone to design an algorithm that decreases the objective faster than SGD. Formally, f( ) inside this safe zone must be of bounded nonconvexity, meaning that its eigenvalues of the Hessian are always greater than some negative threshold σ (where σ depends on how long we run Oja s algorithm). Intuitively, the greater σ is, then the more non-convex f(x) is. We wish to design an (online) stochastic ﬁrst-order method whose running time scales with σ.

Unfortunately, classical stochastic methods such as SGD or SCSG [30] cannot make use of this nonconvexity parameter σ. The only known ones that can make use of σ are ofﬂine algorithms. In this paper, we design a new stochastic ﬁrst-order method Natasha1.5

Theorem 1 (informal). Natasha1.5 ﬁnds x with f(x) ε in rate T = O 1

Finally, we put Natasha1.5 together with Oja s to construct our ﬁnal algorithm Natasha2:

Theorem 2 (informal). Natasha2 ﬁnds x with f(x) ε and 2f(x) δI in rate T = e O 1

δ5 + 1 δε3 + 1 ε3.25 . In particular, when δ ε1/4, this gives T = e O 1 ε3.25 .

In contrast, the convergence rate of SGD was T = e O(poly(d) ε 4) [22].

1.3 Follow-Up Results

Since the original appearance of this work, there has been a lot of progress in stochastic nonconvex optimization. Most notably,

If one swings by saddle points using Oja s algorithm and SGD variants (instead of Natasha1.5), the convergence rate is T = e O(ε 3.5) [5]. If one applies SGD and only escapes from saddle points using Oja s algorithm, the convergence rate is T = e O(ε 4) [10, 46]. If one applies SCSG and only escapes from saddle points using Oja s algorithm, the convergence rate is T = e O(ε 3.333) [10, 46]. If one applies a stochastic version of cubic regularized Newton s method, the convergence rate is T = e O(ε 3.5) [44].

If f(x) is of σ-bounded nonconvexity, the SGD4 method [5] gives rate T = e O(ε 2 + σε 4).

We include these results in Table 1 for a close comparison.

2 Preliminaries

Throughout this paper, we denote by the Euclidean norm. We use i R [n] to denote that i is generated from [n] = {1, 2, . . . , n} uniformly at random. We denote by f(x) the gradient of function f if it is differentiable, and f(x) any subgradient if f is only Lipschitz continuous. We denote by I[event] the indicator function of probabilistic events.

We denote by A 2 the spectral norm of matrix A. For symmetric matrices A and B, we write A B to indicate that A B is positive semideﬁnite (PSD). Therefore, A σI if and only if all eigenvalues of A are no less than σ. We denote by λmin(A) and λmax(A) the minimum and maximum eigenvalue of a symmetric matrix A.

Deﬁnition 2.1. For a function f : Rd R,

f is σ-strongly convex if x, y Rd, it satisﬁes f(y) f(x) + f(x), y x + σ

5As we shall see, smoothness is necessary for ﬁnding approximate local minima with provable guarantees.

algorithm gradient complexity T variance bound

Lipschitz smooth

2nd-order smooth

convex only

SGD1 [5, 23] O ε 2.667

needed needed no

SGD2 [5] O ε 2.5

needed needed no

SGD3 [5] e O ε 2

needed needed no

approximate stationary points

SGD (folklore) O ε 4 (see Appendix B) needed needed no

SCSG [30] O ε 3.333

needed needed no

Natasha1.5 O ε 3 + σ1/3ε 3.333

(see Theorem 1) needed needed no

SGD4 [5] e O ε 2 + σε 4

needed needed no

perturbed SGD [22] e O ε 4 poly(d)

needed needed needed

Natasha2 e O ε 3.25 (see Theorem 2) needed needed needed

NEON + SGD [10, 46] e O ε 4

needed needed needed

cubic Newton [44] e O ε 3.5

needed needed needed

SGD5 [5] e O ε 3.5

needed needed needed

approximate local minima

NEON + SCSG [10, 46] e O ε 3.333

needed needed needed

Table 1: Comparison of online methods for ﬁnding f(x) ε. Following tradition, in these complexity bounds, we assume variance and smoothness parameters as constants, and only show the dependency on n, d, ε and the bounded nonconvexity parameter σ (0, 1). We use to indicate results that appeared after this paper.

Remark 1. Variance bounds must be needed for online methods. Remark 2. Lipschitz smoothness must be needed for achieving even approximate stationary points. Remark 3. Second-order smoothness must be needed for achieving approximate local minima.

f is of σ-bounded nonconvexity (or σ-nonconvex for short) if x, y Rd, it satisﬁes f(y) f(x) + f(x), y x σ

f is L-Lipschitz smooth (or L-smooth for short) if x, y Rd, f(x) f(y) L x y . f is second-order L2-Lipschitz smooth (or L2-second-order smooth for short) if x, y Rd, it satisﬁes 2f(x) 2f(y) 2 L2 x y .

These deﬁnitions have other equivalent forms, see textbook [33].

Deﬁnition 2.2. For composite function F(x) = ψ(x) + f(x) where ψ(x) is proper convex, given a parameter η > 0, the gradient mapping of F( ) at point x is

GF,η(x) := 1

η x x where x = arg min y

ψ(y) + f(x), y + 1

In particular, if ψ( ) 0, then GF,η(x) f(x).

3 Natasha 1.5: Finding Approximate Stationary Points

We ﬁrst make a detour to study how to ﬁnd approximate stationary points using only ﬁrst-order information. A point x Rd is an ε-approximate stationary point7 of f(x) if it satisﬁes f(x) ε. Let gradient complexity T be the number of computations of fi(x).

6Previous authors also refer to this notion as approximate convex , almost convex , hypo-convex , semi-convex , or weakly-convex. We call it σ-nonconvex to stress the point that σ can be as large as L (any L-smooth function is automatically L-nonconvex). 7Historically, in ﬁrst-order literatures, x is called ε-approximate if f(x) 2 ε; in second-order literatures, x is ε-approximate if f(x) ε. We adapt the latter notion following Polyak and Nesterov [34, 36].

repeat SVRG Natasha1

gradient complexity 𝑇

(a) ofﬂine ﬁrst-order methods

SGD4 𝜎𝜀 4 𝜀 8/3

gradient complexity 𝑇

(b) online ﬁrst-order methods

Figure 3: Comparison of ﬁrst-order methods for ﬁnding ε-approximate stationary points of a σ-nonconvex function. For simplicity, in the plots we let L = 1 and V = 1. The results SGD2/3/4 appeared after this work.

Before 2015, nonconvex ﬁrst-order methods give rise to two convergence rates. SGD converges in T = O(ε 4) and GD converges T = O(nε 2). The proofs of both are simple (see Appendix B for completeness). In particular, the convergence of SGD relies on two minimal assumptions f(x) has bounded variance V, meaning Ei[ fi(x) f(x) 2] V, and (A1) f(x) is L-Lipschitz smooth, meaning f(x) f(y) L x y . (A2 ) Remark 3.1. Both assumptions are necessary to design online algorithms for ﬁnding stationary points.8 For ofﬂine algorithms like GD the ﬁrst assumption is not needed.

Since 2016, the convergence rates have been improved to T = O(n + n2/3ε 2) for ofﬂine methods [6, 38], and to T = O(ε 10/3) for online algorithms [30]. Both results are based on the SVRG (stochastic variance reduced gradient) method, and assume additionally (note (A2) implies (A2 )) each fi(x) is L-Lipschitz smooth. (A2)

Lei et al. [30] gave their algorithm a new name, SCSG (stochastically controlled stochastic gradient).

Bounded Non-Convexity. In recent works [3, 13], it has been proposed to study a more reﬁned convergence rate, by assuming that f(x) is of σ-bounded nonconvexity (or σ-nonconvex), meaning

all the eigenvalues of 2f(x) lie in [ σ, L] (A3) for some σ (0, L]. This parameter σ is analogous to the strong-convexity parameter µ in convex optimization, where all the eigenvalues of 2f(x) lie in [µ, L] for some µ > 0.

In our illustrative process to swing by a saddle point, the function inside safe zone see Figure 2(b) is also of bounded nonconvexity. Since larger σ means the function is more nonconvex and thus harder to optimize, can we design algorithms with gradient complexity T as an increasing function of σ ? Remark 3.2. Most methods (SGD, SCSG, SVRG and GD) do not run faster in theory if σ < L.

In the ofﬂine setting, two methods are known to make use of parameter σ. One is repeat SVRG, implicitly in [13] and formally in [3]. The other is Natasha1 [3]. repeat SVRG performs better when σ L/ n and Natasha1 performs better when σ L/ n. See Figure 3(a) and Table 2.

Before this work, no online method is known to take advantage of σ.

3.1 Our Theorem

We show that, under (A1), (A2) and (A3), one can non-trivially extend Natasha1 to an online version, taking advantage of σ, and achieving better complexity than SCSG.

Let f be any upper bound on f(x0) f(x ) where x0 is the starting point. In this section, to present the simplest results, we use the big-O notion to hide dependency in f and V. In Section 6,

8For instance, if the variance V is unbounded, we cannot even tell if a point x satisﬁes f(x) ε using ﬁnite samples. Also, if f(x) is not Lipschitz smooth, it may contain sharp turning points (e.g., behaves like absolute value function |x|); in this case, ﬁnding f(x) ε can be as hard as ﬁnding f(x) = 0, and is NP-hard in general.

Algorithm 1 Natasha1.5(F, x , B, T , α)

Input: f( ) = 1 n Pn i=1 fi(x), starting vector x , epoch length B [n], epoch count T 1, learning rate α > 0. 1: bx x ; p Θ((σ/εL)2/3); m B/p; X []; 2: for k 1 to T do T epochs each of length B 3: ex bx; µ 1

B P i S fi(ex) where S is a uniform random subset of [n] with |S| = B; 4: for s 0 to p 1 do p sub-epochs each of length m 5: x0 bx; X [X,bx]; 6: for t 0 to m 1 do 7: e fi(xt) fi(ex) + µ + 2σ(xt bx) where i R [n] 8: xt+1 = xt αe ; 9: end for 10: bx a random choice from {x0, x1, . . . , xm 1}; in practice, choose the average 11: end for 12: end for 13: by a random vector in X. in practice, simply return by 14: g(x) := f(x) + σ x by 2 and use convex SGD to minimize g(x) for Tsgd = T B iterations. 15: return xout the output of SGD.

we shall add back such dependency and as well as support the existence of a proximal term. (That is, to minimize ψ(x) + f(x) where ψ(x) is a proper convex simple function.)

Under such simpliﬁed notations, our main theorem can be stated as follows.

Theorem 1 (simple). Under (A1), (A2) and (A3), using the big-O notion to hide dependency in f and V, we have for every ε (0, σ

L], letting

ε2 , T = Θ L2/3σ1/3

ε10/3 and α = Θ ε4/3

we have that Natasha1.5(f, x , B, T/B, α) outputs a point xout with E[ f(xout) ] ε, and needs O(T) computations of stochastic gradients. (See also Figure 3(b).)

We emphasize that the additional factor σ1/3 in the numerator of T shall become our key to achieve faster algorithm for ﬁnding approximate local minima in Section 4. Also, if the requirement ε σ

L is not satisﬁed, one can replace σ with εL; accordingly, T becomes O L

ε3 + L2/3σ1/3

We note that the SGD4 method of [5] (which appeared after this paper) achieves T = O L

ε4 . It is better than Natasha1.5 only when σ εL. We compare them in Figure 3(b), and emphasize that it is necessary to use Natasha1.5 (rather than SGD4) to design Natasha2 of the next section.

Extension. In fact, we show Theorem 1 in a more general proximal setting. That is, to minimize F(x) := f(x)+ψ(x) where ψ(x) is proper convex function that can be non-smooth. For instance, if ψ(x) is the indicator function of a convex set, then Problem (1.1) becomes constraint minimization; and if ψ(x) = x 1, we encourage sparsity. At a ﬁrst reading of its proof, one can assume ψ(x) 0.

3.2 Our Intuition

We ﬁrst recall the main idea of the SVRG method [28, 48], which is an ofﬂine algorithm. SVRG divides iterations into epochs, each of length n. It maintains a snapshot point ex for each epoch, and computes the full gradient f(ex) only for snapshots. Then, in each iteration t at point xt, SVRG deﬁnes gradient estimator e f(xt) := fi(xt) fi(ex) + f(ex) which satisﬁes Ei[e f(xt)] = f(xt), and performs proximal update xt+1 xt αe f(xt) for learning rate α.

For minimizing non-convex functions, SVRG does not take advantage of parameter σ even if the learning rate can be adapted to σ. This is because SVRG (and in fact SGD and GD too) rely on gradient-descent analysis to argue for objective decrease per iteration. This is blind to σ.9

9These results argue for objective decrease per iteration, of the form f(xt) f(xt+1) α

2 E f(xt) e f(xt) 2 . Unlike mirror-descent analysis, this inequality cannot take advantage of the

The prior work Natasha1 takes advantage of σ. Natasha1 is similar to SVRG, but it further divides each epoch into sub-epochs, each with a starting vector bx. Then, it replaces e f(xt) with e f(xt) + 2σ(xt bx). This is equivalent to replacing f(x) with f(x) + σ x bx 2, where the center bx changes every sub-epoch. We view this additional term 2σ(xt bx) as a type of retraction. Conceptually, it stabilizes the algorithm by moving a bit in the backward direction. Technically, it enables us to perform only mirror-descent type of analysis, and thus bypass the issue of SVRG.

Our Algorithm. Both SVRG and Natasha1 are ofﬂine methods, because the gradient estimator requires the full gradient computation f(ex) at snapshots ex. A natural ﬁx originally studied by practitioners but ﬁrst formally analyzed by Lei et al. [30] is to replace the computation of f(ex) with 1 |S| P

i S fi(ex), for a random batch S [n] with ﬁxed cardinality B := |S| n. This allows us to shorten the epoch length from n to B, thus turning SVRG and Natasha1 into online methods.

How large should we pick B? By Chernoff bound, we wish B 1 ε2 because our desired accuracy is ε. One can thus hope to replace the parameter n in the complexities of SVRG and Natasha1.5 (ignoring the dependency on L):

T = O n + n2/3

ε2 and T = O n + n1/2

ε2 + σ1/3n2/3

with B 1 ε2 . This wishful thinking gives

T = O ε 10/3 and T = O ε 3 + σ1/3ε 10/3 . These are exactly the results achieved by SCSG [30] and to be achieved by our new Natasha1.5.

Unfortunately, Chernoff bound itself is not sufﬁcient in getting such rates. Let e := 1 |S| P

i S fi(ex) f(ex)

denote the bias of this new gradient estimator, then when performing iterative updates, this bias e gives rise to two types of error terms: ﬁrst-order error terms of the form e, x y and second-order error term e 2. Chernoff bound ensures that the second-order error ES[ e 2] ε2 is bounded. However, ﬁrst-order error terms are the true bottlenecks.

In the ofﬂine method SCSG, Lei et al. [30] carefully performed updates so that all ﬁrst-order errors cancel out. To the best of our knowledge, this analysis cannot take advantage of σ even if the algorithm knows σ. (Again, for experts, this is because SCSG is based on gradient-descent type of analysis but not mirror-descent.)

In Natasha1.5, we use the aforementioned retraction to ensure that all points in a single sub-epoch are close to each other (based on mirror-descent type of analysis). Then, we use Young s inequality to bound e, x y by 1

2 x y 2. In this equation, e 2 is already bounded by Chernoff concentration, and x y 2 can also be bounded as long as x and y are within the same sub-epoch. This summarizes the high-level technical contribution of Natasha1.5.

We formally state Natasha1.5 in Algorithm 1, and it uses big-O notions to hide dependency in L, f, and V. The more general code to take care of the proximal term is in Algorithm 3 of Section 6.

4 Natasha 2: Finding Approximate Local Minima

Stochastic gradient descent (SGD) ﬁnd approximate local minima [22], under (A1), (A2) and an additional assumption (A4): f(x) is second-order L2-Lipschitz smooth, meaning 2f(x) 2f(y) 2 L2 x y . (A4) Remark 4.1. (A4) is necessary to make the task of ﬁnd appx. local minima meaningful, for the same reason Lipschitz smoothness was needed for ﬁnding stationary points.

Deﬁnition 4.2. We say x is an (ε, δ)-approximate local minimum of f(x) if10

f(x) ε and 2f(x) δI ,

bounded nonconvexity parameter of f(x). For readers interested in the difference between gradient and mirror descent, see [11]. 10The notion 2f(x) δI means all the eigenvalues of 2f(x) are above δ.

or ε-approximate local minimum if it is (ε, ε1/C)-approximate local minimum for constant C 1.

Before our work, Ge et al. [22] is the only result that gives provable online complexity for ﬁnding approximate local minima. Other previous results, including SVRG, SCSG, Natasha1, and even Natasha1.5, do not ﬁnd approximate local minima and may be stuck at saddle points.11 Ge et al. [22] showed that, hiding factors that depend on L, L2 and V, SGD ﬁnds an ε-approximate local minimum of f(x) in gradient complexity T = O(poly(d)ε 4). This ε 4 factor seems necessary since SGD needs T Ω(ε 4) for just ﬁnding stationary points (see Appendix B and Table 1). Remark 4.3. Ofﬂine methods are often studied under (ε, ε1/2)-approximate local minima. In the online setting, Ge et al. [22] used (ε, ε1/4)-approximate local minima, thus giving T = O poly(d)

ε4 + poly(d)

δ16 . In general, it is better to treat ε and δ separately to be more general, but nevertheless, (ε, ε1/C)-approximate local minima are always better than ε-approximate stationary points.

4.1 Our Theorem

We propose a new method Natasha2full which, very informally speaking, alternatively

ﬁnds approximate stationary points of f(x) using Natasha1.5, or ﬁnds negative curvature of the Hessian 2f(x), using Oja s online eigenvector algorithm.

In this section, we deﬁne gradient complexity T to be the number of stochastic gradient computations plus Hessian-vector products. Let f be any upper bound on f(x0) f(x ) where x0 is the starting point. In this section, to present the simplest results, we use the big-O notion to hide dependency in L, L2, f, and V. In Section 7, we shall add back such dependency for a more general description of the algorithm. Our main result can be stated as follows:

Theorem 2 (informal). Under (A1), (A2) and (A4), for any ε (0, 1) and δ (0, ε1/4), Natasha2(f, y0, ε, δ) outputs a point xout so that, with probability at least 2/3:

f(xout) ε and 2f(xout) δI .

Furthermore, its gradient complexity is T = e O 1

δ5 + 1 δε3 .12

Remark 4.4. If δ > ε1/4 we can replace it with δ = ε1/4. Therefore, T = e O 1

δ5 + 1 δε3 + 1 ε3.25 . Remark 4.5. The follow-up work [10] replaced Hessian-vector products in Natasha2 with only stochastic gradient computations, turning Natasha2 into a pure ﬁrst-order method. That paper is built on ours and thus all the proofs of this paper are still needed.

Corollary 4.6. T = e O(ε 3.25) for ﬁnding (ε, ε1/4)-approximate local minima. This is better than T = O(ε 10/3) of SCSG for ﬁnding only ε-approximate stationary points.

Corollary 4.7. T = e O(ε 3.5) for ﬁnding (ε, ε1/2)-approximate local minima. This was not known before and is matched by several follow-up works using different algorithms [5, 10, 44, 46].

4.2 Our Intuition

It is known that the problem of ﬁnding (ε, δ)-approximate local minima, at a high level, reduces to (repeatedly) ﬁnding ε-approximate stationary points for an O(δ)-nonconvex function [1, 13]. Speciﬁcally, Carmon et al. [13] proposed the following procedure. In every iteration at point yk, detect whether the minimum eigenvalue of 2f(yk) is below δ:

if yes, ﬁnd the minimum eigenvector of 2f(yk) approximately and move in this direction.

if no, let F k(x) := f(x)+L max 0, x yk δ

L2 2, which can be proven as 5L-smooth and 3δ-nonconvex; then ﬁnd an ε-approximate stationary point of F k(x) to move there. Intuitively, F k(x) penalizes us from moving out of the safe zone of x: x yk δ L2 .

11These methods are based on the variance reduction technique to reduce the random noise of SGD. They have been criticized by practitioners for performing poorer than SGD on training neural networks, because the noise of SGD allows it to escape from saddle points. Variance-reduction based methods have less noise and thus cannot escape from saddle points. 12Throughout this paper, we use the e O notion to hide at most one logarithmic factor in all the parameters (namely, n, d, L, L2, V, 1/ε, 1/δ).

Algorithm 2 Natasha2(f, y0, ε, δ)

Input: function f(x) = 1

n Pn i=1 fi(x), starting vector y0, target accuracy ε > 0 and δ > 0.

δ 1 then e L = eσ Θ( ε1/3

δ ) 1; the boundary case for large L2 2: else e L 1 and eσ Θ ε

δ3 [δ, 1]. 3: X []; 4: for k 0 to do 5: Apply Oja s algorithm to ﬁnd min EV v of 2f(yk) for eΘ( 1

δ2 ) iterations see Lemma 5.3 6: if v Rd is found s.t. v 2f(yk)v δ

2 then 7: yk+1 yk δ L2 v where the sign is random. 8: else it satisﬁes 2f(yk) δI 9: F k(x) := f(x) + L(max{0, x yk δ L2 })2.

10: run Natasha1.5 F k, yk, Θ(ε 2), 1, Θ(εδ) . F k( ) is e L-smooth and eσ-nonconvex 11: let byk, yk+1 be the vector by and bx when Line 13 is reached in Natasha1.5. 12: X [X, (yk,byk)]; 13: break the for loop if have performed Θ 1

δε ﬁrst-order steps. 14: end if 15: end for 16: (y,by) a random pair in X. in practice, simply output byk 17: deﬁne convex function g(x) := f(x) + L(max{0, x y δ L2 })2 + eσ x by 2.

18: use SGD to minimize g(x) for eΘ( 1

ε2 ) steps and output xout.

Previously, it was thought necessary to achieve high accuracy for both tasks above. This is why researchers have only been able to design ofﬂine methods: in particular, the shift-and-invert method [21] was applied to ﬁnd the minimum eigenvector, and repeat SVRG was applied to ﬁnd a stationary point of F k(x).13

In this paper, we apply efﬁcient online algorithms for the two tasks: namely, Oja s algorithm (see Section 5.1) for ﬁnding minimum eigenvectors, and our new Natasha1.5 algorithm (see Section 3.2) for ﬁnding stationary points. More speciﬁcally, for Oja s, we only decide if there is an eigenvalue below threshold δ/2, or conclude that the Hessian has all eigenvalues above δ. This can be done in an online fashion using O(δ 2) Hessian-vector products (with high probability) using Oja s algorithm. For Natasha1.5, we only apply it for a single epoch of length B = Θ(ε 2). Conceptually, this shall make the above procedure online and run in a complexity independent of n.

Unfortunately, technical issues arise in this wishful thinking.

Most notably, the above process ﬁnishes only if Natasha1.5 ﬁnds an approximate stationary point x of F k(x) that is also inside the safe zone x: x yk δ L2 . This is because F k(x) = f(x) inside the safe zone and therefore F k(x) ε also implies f(x) 2ε.

What can we do if we move out of the safe zone? To tackle this case, we show an additional property of Natasha1.5 (see Lemma 6.5). That is, the amount of objective decrease i.e., f(yk) f(x) if x moves out of the safe zone must be proportional to the distance x yk 2 we travel in space. Therefore, if x moves out of the safe zone, then we can decrease sufﬁciently the objective. This is also a good case. This summarizes some high-level technical ingredient of Natasha2.

We formally state Natasha2 in Algorithm 2, and it uses the big-O notion to hide dependency in L, L2, V and f. The more general code to take care of all the parameters can be found in Algorithm 5 of Section 7.

Finally, we stress that although we borrowed the construction of f(x)+L max 0, x yk δ

from the ofﬂine algorithm of Carmon et al. [13], our Natasha2 algorithm and analysis are different from them in all other aspects.

13repeat SVRG is an ofﬂine algorithm, and ﬁnds an ε-approximate stationary point for a function f(x) that is σ-nonconvex. It is divided into stages. In each stage t, it considers a modiﬁed function ft(x) := f(x) + σ x xt 2, and then apply the accelerated SVRG method to minimize ft(x). Then, it moves to xt+1 which is a sufﬁciently accurate minimizer of ft(x).

[1] Naman Agarwal, Zeyuan Allen-Zhu, Brian Bullins, Elad Hazan, and Tengyu Ma. Finding Approximate Local Minima for Nonconvex Optimization in Linear Time. In STOC, 2017.

[2] Zeyuan Allen-Zhu. Katyusha: The First Direct Acceleration of Stochastic Gradient Methods. In STOC, 2017.

[3] Zeyuan Allen-Zhu. Natasha: Faster Non-Convex Stochastic Optimization via Strongly Non Convex Parameter. In ICML, 2017.

[4] Zeyuan Allen-Zhu. Katyusha X: Practical Momentum Method for Stochastic Sum-of Nonconvex Optimization. In ICML, 2018.

[5] Zeyuan Allen-Zhu. How To Make the Gradients Small Stochastically: Even Faster Convex and Nonconvex SGD. In Neur IPS, 2018.

[6] Zeyuan Allen-Zhu and Elad Hazan. Variance Reduction for Faster Non-Convex Optimization. In ICML, 2016.

[7] Zeyuan Allen-Zhu and Yuanzhi Li. Lazy SVD: Even Faster SVD Decomposition Yet Without Agonizing Pain. In Neur IPS, 2016.

[8] Zeyuan Allen-Zhu and Yuanzhi Li. First Efﬁcient Convergence for Streaming k-PCA: a Global, Gap-Free, and Near-Optimal Rate. In FOCS, 2017.

[9] Zeyuan Allen-Zhu and Yuanzhi Li. Follow the Compressed Leader: Faster Online Learning of Eigenvectors and Faster MMWU. In ICML, 2017.

[10] Zeyuan Allen-Zhu and Yuanzhi Li. Neon2: Finding Local Minima via First-Order Oracles. In Neur IPS, 2018.

[11] Zeyuan Allen-Zhu and Lorenzo Orecchia. Linear Coupling: An Ultimate Uniﬁcation of Gradient and Mirror Descent. In Proceedings of the 8th Innovations in Theoretical Computer Science, ITCS 17, 2017.

[12] Sanjeev Arora, Rong Ge, Tengyu Ma, and Ankur Moitra. Simple, Efﬁcient, and Neural Algorithms for Sparse Coding. In COLT, 2015.

[13] Yair Carmon, John C. Duchi, Oliver Hinder, and Aaron Sidford. Accelerated Methods for Non-Convex Optimization. Ar Xiv e-prints, abs/1611.00756, November 2016.

[14] Yair Carmon, Oliver Hinder, John C. Duchi, and Aaron Sidford. Convex Until Proven Guilty : Dimension-Free Acceleration of Gradient Descent on Non-Convex Functions. In ICML, 2017.

[15] Yuxin Chen and Emmanuel Candes. Solving random quadratic systems of equations is nearly as easy as solving linear systems. In Advances in Neural Information Processing Systems, pages 739 747, 2015.

[16] Anna Choromanska, Mikael Henaff, Michael Mathieu, G erard Ben Arous, and Yann Le Cun. The loss surfaces of multilayer networks. In AISTATS, 2015.

[17] Yann N Dauphin, Razvan Pascanu, Caglar Gulcehre, Kyunghyun Cho, Surya Ganguli, and Yoshua Bengio. Identifying and attacking the saddle point problem in high-dimensional nonconvex optimization. In Neur IPS, pages 2933 2941, 2014.

[18] Aaron Defazio, Francis Bach, and Simon Lacoste-Julien. SAGA: A Fast Incremental Gradient Method With Support for Non-Strongly Convex Composite Objectives. In Neur IPS, 2014.

[19] John Duchi, Elad Hazan, and Yoram Singer. Adaptive subgradient methods for online learning and stochastic optimization. The Journal of Machine Learning Research, 12:2121 2159, 2011.

[20] Roy Frostig, Rong Ge, Sham M. Kakade, and Aaron Sidford. Un-regularizing: approximate proximal point and faster stochastic algorithms for empirical risk minimization. In ICML, 2015.

[21] Dan Garber, Elad Hazan, Chi Jin, Sham M. Kakade, Cameron Musco, Praneeth Netrapalli, and Aaron Sidford. Robust shift-and-invert preconditioning: Faster and more sample efﬁcient algorithms for eigenvector computation. In ICML, 2016.

[22] Rong Ge, Furong Huang, Chi Jin, and Yang Yuan. Escaping from saddle points online stochastic gradient for tensor decomposition. In Proceedings of the 28th Annual Conference on Learning Theory, COLT 2015, 2015.

[23] Saeed Ghadimi and Guanghui Lan. Accelerated gradient methods for nonconvex nonlinear and stochastic programming. Mathematical Programming, pages 1 26, feb 2015. ISSN 00255610.

[24] I. J. Goodfellow, O. Vinyals, and A. M. Saxe. Qualitatively characterizing neural network optimization problems. Ar Xiv e-prints, December 2014.

[25] Elad Hazan, Kﬁr Yehuda Levy, and Shai Shalev-Shwartz. On graduated optimization for stochastic non-convex problems. In International Conference on Machine Learning, pages 1833 1841, 2016.

[26] Xi He, Dheevatsa Mudigere, Mikhail Smelyanskiy, and Martin Tak aˇc. Distributed Hessian Free Optimization for Deep Neural Network. Ar Xiv e-prints, abs/1606.00511, June 2016.

[27] Chi Jin, Rong Ge, Praneeth Netrapalli, Sham M Kakade, and Michael I Jordan. How to Escape Saddle Points Efﬁciently. In ICML, 2017.

[28] Rie Johnson and Tong Zhang. Accelerating stochastic gradient descent using predictive variance reduction. In Advances in Neural Information Processing Systems, Neur IPS 2013, pages 315 323, 2013.

[29] Diederik Kingma and Jimmy Ba. Adam: A method for stochastic optimization. Ar Xiv e-prints, abs/1412.6980, 12 2014.

[30] Lihua Lei, Cheng Ju, Jianbo Chen, and Michael I Jordan. Nonconvex Finite-Sum Optimization Via SCSG Methods. In Neur IPS, 2017.

[31] Yuanzhi Li and Yang Yuan. Convergence Analysis of Two-layer Neural Networks with Re LU Activation. In Neur IPS, 2017.

[32] Hongzhou Lin, Julien Mairal, and Zaid Harchaoui. A Universal Catalyst for First-Order Optimization. In Neur IPS, 2015.

[33] Yurii Nesterov. Introductory Lectures on Convex Programming Volume: A Basic course, volume I. Kluwer Academic Publishers, 2004. ISBN 1402075537.

[34] Yurii Nesterov. Accelerating the cubic regularization of newton s method on convex problems. Mathematical Programming, 112(1):159 181, 2008.

[35] Yurii Nesterov. How to make the gradients small. Optima, 88:10 11, 2012.

[36] Yurii Nesterov and Boris T. Polyak. Cubic regularization of newton method and its global performance. Mathematical Programming, 108(1):177 205, 2006.

[37] Erkki Oja. Simpliﬁed neuron model as a principal component analyzer. Journal of mathematical biology, 15(3):267 273, 1982.

[38] Sashank J. Reddi, Ahmed Hefny, Suvrit Sra, Barnabas Poczos, and Alex Smola. Stochastic variance reduction for nonconvex optimization. In ICML, 2016.

[39] Sashank J Reddi, Manzil Zaheer, Suvrit Sra, Barnabas Poczos, Francis Bach, Ruslan Salakhutdinov, and Alexander J Smola. A generic approach for escaping saddle points. Ar Xiv e-prints, abs/1709.01434, September 2017.

[40] Mark Schmidt, Nicolas Le Roux, and Francis Bach. Minimizing ﬁnite sums with the stochastic average gradient. Ar Xiv e-prints, abs/1309.2388, September 2013.

[41] Shai Shalev-Shwartz. Online Learning and Online Convex Optimization. Foundations and Trends in Machine Learning, 4(2):107 194, 2012. ISSN 1935-8237.

[42] Shai Shalev-Shwartz. SDCA without Duality, Regularization, and Individual Convexity. In ICML, 2016.

[43] Ruoyu Sun and Zhi-Quan Luo. Guaranteed Matrix Completion via Nonconvex Factorization. In FOCS, 2015.

[44] Nilesh Tripuraneni, Mitchell Stern, Chi Jin, Jeffrey Regier, and Michael I Jordan. Stochastic Cubic Regularization for Fast Nonconvex Optimization. Ar Xiv e-prints, abs/1711.02838, November 2017.

[45] Lin Xiao and Tong Zhang. A Proximal Stochastic Gradient Method with Progressive Variance Reduction. SIAM Journal on Optimization, 24(4):2057 -2075, 2014.

[46] Yi Xu and Tianbao Yang. First-order Stochastic Algorithms for Escaping From Saddle Points in Almost Linear Time. Ar Xiv e-prints, abs/1711.01944, November 2017.

[47] Matthew D Zeiler. ADADELTA: an adaptive learning rate method. Ar Xiv e-prints, abs/1212.5701, 12 2012.

[48] Lijun Zhang, Mehrdad Mahdavi, and Rong Jin. Linear convergence with condition number independent access of full gradients. In Advances in Neural Information Processing Systems, pages 980 988, 2013.