# how_to_escape_saddle_points_efficiently__f2374fc0.pdf

How to Escape Saddle Points Efﬁciently

Chi Jin 1 Rong Ge 2 Praneeth Netrapalli 3 Sham M. Kakade 4 Michael I. Jordan 1

This paper shows that a perturbed form of gradient descent converges to a second-order stationary point in a number iterations which depends only poly-logarithmically on dimension (i.e., it is almost dimension-free ). The convergence rate of this procedure matches the well-known convergence rate of gradient descent to ﬁrst-order stationary points, up to log factors. When all saddle points are non-degenerate, all second-order stationary points are local minima, and our result thus shows that perturbed gradient descent can escape saddle points almost for free. Our results can be directly applied to many machine learning applications, including deep learning. As a particular concrete example of such an application, we show that our results can be used directly to establish sharp global convergence rates for matrix factorization. Our results rely on a novel characterization of the geometry around saddle points, which may be of independent interest to the non-convex optimization community.

1. Introduction

Given a function f : Rd R, a gradient descent aims to minimize the function via the following iteration:

xt+1 = xt η f(xt),

where η > 0 is a step size. Gradient descent and its variants (e.g., stochastic gradient) are widely used in machine learning applications due to their favorable computational properties. This is notably true in the deep learning setting, where gradients can be computed efﬁciently via backpropagation (Rumelhart et al., 1988).

Gradient descent is especially useful in high-dimensional settings because the number of iterations required to reach

1University of California, Berkeley 2Duke University 3Microsoft Research India 4University of Washington. Correspondence to: Chi Jin <chijin@berkeley.edu>.

Proceedings of the 34 th International Conference on Machine Learning, Sydney, Australia, PMLR 70, 2017. Copyright 2017 by the author(s).

a point with small gradient is independent of the dimension ( dimension-free ). More precisely, for a function that is ℓgradient Lipschitz (see Deﬁnition 1), it is well known that gradient descent ﬁnds an ϵ-ﬁrst-order stationary point (i.e., a point x with f(x) ϵ) within ℓ(f(x0) f )/ϵ2 iterations (Nesterov, 1998), where x0 is the initial point and f is the optimal value of f. This bound does not depend on the dimension of x. In convex optimization, ﬁnding an ϵ-ﬁrst-order stationary point is equivalent to ﬁnding an approximate global optimum.

In non-convex settings, however, convergence to ﬁrst-order stationary points is not satisfactory. For non-convex functions, ﬁrst-order stationary points can be global minima, local minima, saddle points or even local maxima. Finding a global minimum can be hard, but fortunately, for many non-convex problems, it is sufﬁcient to ﬁnd a local minimum. Indeed, a line of recent results show that, in many problems of interest, all local minima are global minima (e.g., in tensor decomposition (Ge et al., 2015), dictionary learning (Sun et al., 2016a), phase retrieval (Sun et al., 2016b), matrix sensing (Bhojanapalli et al., 2016; Park et al., 2016), matrix completion (Ge et al., 2016), and certain classes of deep neural networks (Kawaguchi, 2016)). Moreover, there are suggestions that in more general deep networks most of the local minima are as good as global minima (Choromanska et al., 2014).

On the other hand, saddle points (and local maxima) can correspond to highly suboptimal solutions in many problems (see, e.g., Jain et al., 2015; Sun et al., 2016b). Furthermore, Dauphin et al. (2014) argue that saddle points are ubiquitous in high-dimensional, non-convex optimization problems, and are thus the main bottleneck in training neural networks. Standard analysis of gradient descent cannot distinguish between saddle points and local minima, leaving open the possibility that gradient descent may get stuck at saddle points, either asymptotically or for a sufﬁciently long time so as to make training times for arriving at a local minimum infeasible. Ge et al. (2015) showed that by adding noise at each step, gradient descent can escape all saddle points in a polynomial number of iterations, provided that the objective function satisﬁes the strict saddle property (see Assumption A2). Lee et al. (2016) proved that under similar conditions, gradient descent with random initialization avoids saddle points even without adding

How to Escape Saddle Points Efﬁciently

Algorithm 1 Perturbed Gradient Descent (Meta-algorithm)

for t = 0, 1, . . . do

if perturbation condition holds then

xt xt + ξt, ξt uniformly B0(r) xt+1 xt η f(xt)

noise. However, this result does not bound the number of steps needed to reach a local minimum.

Previous work explains why gradient descent avoids saddle points in the nonconvex setting, but not why it is efﬁcient all of them have runtime guarantees with high polynomial dependency in dimension d. For instance, the number of iterations required in Ge et al. (2015) is at least Ω(d4), which is prohibitive in high dimensional setting such as deep learning (typically with millions of parameters). Therefore, we wonder whether gradient descent type of algorithms are fundamentally slow in escaping saddle points, or it is the lack of our theoretical understanding while gradient descent is indeed efﬁcient. This motivates the following question: Can gradient descent escape saddle points and converge to local minima in a number of iterations that is (almost) dimension-free?

In order to answer this question formally, this paper investigates the complexity of ﬁnding ϵ-second-order stationary points. For ρ-Hessian Lipschitz functions (see Deﬁnition 5), these points are deﬁned as (Nesterov & Polyak, 2006):

f(x) ϵ, and λmin( 2f(x)) ρϵ.

Under the assumption that all saddle points are strict (i.e., for any saddle point xs, λmin( 2f(xs)) < 0), all second-order stationary points (ϵ = 0) are local minima. Therefore, convergence to second-order stationary points is equivalent to convergence to local minima.

This paper studies a simple variant of gradient descent (with phasic perturbations, see Algorithm 1). For ℓ-smooth functions that are also Hessian Lipschitz, we show that perturbed gradient descent will converge to an ϵ-second-order stationary point in O(ℓ(f(x0) f )/ϵ2), where O( ) hides polylog factors. This guarantee is almost dimension free (up to polylog(d) factors), answering the above highlighted question afﬁrmatively. Note that this rate is exactly the same as the well-known convergence rate of gradient descent to ﬁrst-order stationary points (Nesterov, 1998), up to log factors. Furthermore, our analysis admits a maximal step size of up to Ω(1/ℓ), which is the same as that in analyses for ﬁrst-order stationary points.

As many real learning problems present strong local geometric properties, similar to strong convexity in the global setting (see, e.g. Bhojanapalli et al., 2016; Sun & Luo, 2016; Zheng & Lafferty, 2016), it is important to note that our analysis naturally takes advantage of such local struc-

ture. We show that when local strong convexity is present, the ϵ-dependence goes from a polynomial rate, 1/ϵ2, to linear convergence, log(1/ϵ). As an example, we show that sharp global convergence rates can be obtained for matrix factorization as a direct consequence of our analysis.

1.1. Our Contributions

This paper presents the ﬁrst sharp analysis that shows that (perturbed) gradient descent ﬁnds an approximate secondorder stationary point in at most polylog(d) iterations, thus escaping all saddle points efﬁciently. Our main technical contributions are as follows:

For ℓ-gradient Lipschitz, ρ-Hessian Lipschitz functions (possibly non-convex), gradient descent with appropriate perturbations ﬁnds an ϵ-second-order stationary point in O(ℓ(f(x0) f )/ϵ2) iterations. This rate matches the well-known convergence rate of gradient descent to ﬁrst-order stationary points up to log factors.

Under a strict-saddle condition (see Assumption A2), the same convergence result applies for local minima. This means that gradient descent can escape all saddle points with only logarithmic overhead in runtime.

When the function has local structure, such as local strong convexity (see Assumption A3.a), the above results can be further improved to linear convergence. We give sharp rates that are comparable to previous problem-speciﬁc local analysis of gradient descent with smart initialization (see Section 1.2).

All the above results rely on a new characterization of the geometry around saddle points: points from where gradient descent gets stuck at a saddle point constitute a thin band. We develop novel techniques to bound the volume of this band. As a result, we can show that after a random perturbation the current point is very unlikely to be in the band ; hence, efﬁcient escape from the saddle point is possible (see Section 5).

1.2. Related Work

Over the past few years, there have been many problemspeciﬁc convergence results for non-convex optimization. One line of work requires a smart initialization algorithm to provide a coarse estimate lying inside a local neighborhood, from which popular local search algorithms enjoy fast local convergence (see, e.g., Netrapalli et al., 2013; Candes et al., 2015; Sun & Luo, 2016; Bhojanapalli et al., 2016). While there are not many results that show global convergence for non-convex problems, Jain et al. (2015) show that gradient descent yields global convergence rates for matrix square-root problems. Although these results

How to Escape Saddle Points Efﬁciently

Table 1. Oracle models and iteration complexity for convergence to second-order stationary point

Algorithm Iterations Oracle

Ge et al. (2015) O(poly(d/ϵ)) Gradient Levy (2016) O(d3poly(1/ϵ)) Gradient This Work O(log4(d)/ϵ2) Gradient

Agarwal et al. (2016) O(log(d)/ϵ7/4) Hessian-vector Carmon et al. (2016) O(log(d)/ϵ7/4) Hessian-vector Carmon & Duchi

(2016) O(log(d)/ϵ2) Hessian-vector

Nesterov & Polyak

(2006) O(1/ϵ1.5) Hessian

Curtis et al. (2014) O(1/ϵ1.5) Hessian

give strong guarantees, the analyses are heavily tailored to speciﬁc problems, and it is unclear how to generalize them to a wider class of non-convex functions.

For general non-convex optimization, there are a few previous results on ﬁnding second-order stationary points. These results can be divided into the following three categories, where, for simplicity of presentation, we only highlight dependence on dimension d and ϵ, assuming that all other problem parameters are constant from the point of view of iteration complexity:

Hessian-based: Traditionally, only second-order optimization methods were known to converge to second-order stationary points. These algorithms rely on computing the Hessian to distinguish between ﬁrstand second-order stationary points. Nesterov & Polyak (2006) designed a cubic regularization algorithm which converges to an ϵ-secondorder stationary point in O(1/ϵ1.5) iterations. Trust region algorithms (Curtis et al., 2014) can also achieve the same performance if the parameters are chosen carefully. These algorithms typically require the computation of the inverse of the full Hessian per iteration, which can be very expensive.

Hessian-vector-product-based: A number of recent papers have explored the possibility of using only Hessianvector products instead of full Hessian information in order to ﬁnd second-order stationary points. These algorithms require a Hessian-vector product oracle: given a function f, a point x and a direction u, the oracle returns 2f(x) u. Agarwal et al. (2016) and Carmon et al. (2016) presented accelerated algorithms that can ﬁnd an ϵ-second-order stationary point in O(log d/ϵ7/4) steps. Also, Carmon & Duchi (2016) showed by running gradient descent as a subroutine to solve the subproblem of cubic regularization

(which requires Hessian-vector product oracle), it is possible to ﬁnd an ϵ-second-order stationary pointin O(log d/ϵ2) iterations. In many applications such an oracle can be implemented efﬁciently, in roughly the same complexity as the gradient oracle. Also, when the function has a Hessian Lipschitz property such an oracle can be approximated by differentiating the gradients at two very close points (although this may suffer from numerical issues, thus is seldom used in practice).

Gradient-based: Another recent line of work shows that it is possible to converge to a second-order stationary point without any use of the Hessian. These methods feature simple computation per iteration (only involving gradient operations), and are closest to the algorithms used in practice. Ge et al. (2015) showed that stochastic gradient descent could converge to a second-order stationary point in poly(d/ϵ) iterations, with polynomial of order at least four. This was improved in Levy (2016) to O(d3 poly(1/ϵ)) using normalized gradient descent. The current paper improves on both results by showing that perturbed gradient descent can actually ﬁnd an ϵ-second-order stationary point in O(polylog(d)/ϵ2) steps, which matches the guarantee for converging to ﬁrst-order stationary points up to polylog factors.

2. Preliminaries

In this section, we will ﬁrst introduce our notation, and then present some deﬁnitions and existing results in optimization which will be used later.

2.1. Notation

We use bold upper-case letters A, B to denote matrices and bold lower-case letters x, y to denote vectors. Aij means the (i, j)th entry of matrix A. For vectors we use to denote the ℓ2-norm, and for matrices we use and F to denote spectral norm and Frobenius norm respectively. We use σmax( ), σmin( ), σi( ) to denote the largest, the smallest and the i-th largest singular values respectively, and λmax( ), λmin( ), λi( ) for corresponding eigenvalues.

For a function f : Rd R, we use f( ) and 2f( ) to denote its gradient and Hessian, and f to denote the global minimum of f( ). We use notation O( ) to hide only absolute constants which do not depend on any problem parameter, and notation O( ) to hide only absolute constants and log factors. We let B(d) x (r) denote the d-dimensional ball centered at x with radius r; when it is clear from context, we simply denote it as Bx(r). We use PX ( ) to denote projection onto the set X. Distance and projection are always deﬁned in a Euclidean sense.

How to Escape Saddle Points Efﬁciently

2.2. Gradient Descent

The theory of gradient descent often takes its point of departure to be the study of convex optimization. Deﬁnition 1. A differentiable function f( ) is ℓ-smooth (or ℓ-gradient Lipschitz) if:

x1, x2, f(x1) f(x2) ℓ x1 x2 .

Deﬁnition 2. A twice-differentiable function f( ) is αstrongly convex if x, λmin( 2f(x)) α

Such smoothness guarantees imply that the gradient can not change too rapidly, and strong convexity ensures that there is a unique stationary point (and hence a global minimum). Standard analysis using these two properties shows that gradient descent converges linearly to a global optimum x (see e.g. (Bubeck et al., 2015)). Theorem 1. Assume f( ) is ℓ-smooth and α-strongly convex. For any ϵ > 0, if we run gradient descent with step size η = 1

ℓ, iterate xt will be ϵ-close to x in iterations:

In a more general setting, we no longer have convexity, let alone strong convexity. Though global optima are difﬁcult to achieve in such a setting, it is possible to analyze convergence to ﬁrst-order stationary points. Deﬁnition 3. For a differentiable function f( ), we say that x is a ﬁrst-order stationary point if f(x) = 0; we also say x is an ϵ-ﬁrst-order stationary point if f(x) ϵ.

Under an ℓ-smoothness assumption, it is well known that by choosing the step size η = 1 ℓ, gradient descent converges to ﬁrst-order stationary points. Theorem 2 ((Nesterov, 1998)). Assume that the function f( ) is ℓ-smooth. Then, for any ϵ > 0, if we run gradient descent with step size η = 1 ℓand termination condition f(x) ϵ, the output will be ϵ-ﬁrst-order stationary point, and the algorithm will terminate within the following number of iterations:

ℓ(f(x0) f )

Note that the iteration complexity does not depend explicitly on intrinsic dimension; in the literature this is referred to as dimension-free optimization.

Note that a ﬁrst-order stationary point can be either a local minimum or a saddle point or a local maximum. For minimization problems, saddle points and local maxima are undesirable, and we abuse nomenclature to call both of them saddle points in this paper. The formal deﬁnition is as follows:

Deﬁnition 4. For a differentiable function f( ), we say that x is a local minimum if x is a ﬁrst-order stationary point, and there exists ϵ > 0 so that for any y in the ϵneighborhood of x, we have f(x) f(y); we also say x is a saddle point if x is a ﬁrst-order stationary point but not a local minimum. For a twice-differentiable function f( ), we further say a saddle point x is strict (or nondegenerate) if λmin( 2f(x)) < 0.

For a twice-differentiable function f( ), we know a saddle point x must satify λmin( 2f(x)) 0. Intuitively, for saddle point x to be strict, we simply rule out the undetermined case λmin( 2f(x)) = 0, where Hessian information alone is not enough to check whether x is a local minimum or saddle point. In most non-convex problems, saddle points are undesirable.

To escape from saddle points and ﬁnd local minima in a general setting, we move both the assumptions and guarantees in Theorem 2 one order higher. In particular, we require the Hessian to be Lipschitz:

Deﬁnition 5. A twice-differentiable function f( ) is ρHessian Lipschitz if:

x1, x2, 2f(x1) 2f(x2) ρ x1 x2 .

That is, Hessian can not change dramatically in terms of spectral norm. We also generalize the deﬁnition of ﬁrstorder stationary point to higher order:

Deﬁnition 6. For a ρ-Hessian Lipschitz function f( ), we say that x is a second-order stationary point if f(x) = 0 and λmin( 2f(x)) 0; we also say x is ϵ-second-order stationary point if:

f(x) ϵ, and λmin( 2f(x)) ρϵ

Second-order stationary points are very important in nonconvex optimization because when all saddle points are strict, all second-order stationary points are exactly local minima.

Note that the literature sometime deﬁnes ϵ-second-order stationary point by two independent error terms; i.e., letting f(x) ϵg and λmin( 2f(x)) ϵH. We instead follow the convention of Nesterov & Polyak (2006) by choosing ϵH = ρϵg to reﬂect the natural relations between the gradient and the Hessian.

3. Main Result

In this section we show that it possible to modify gradient descent in a simple way so that the resulting algorithm will provably converge quickly to a second-order stationary point.

How to Escape Saddle Points Efﬁciently

Algorithm 2 Perturbed Gradient Descent: PGD(x0, ℓ, ρ, ϵ, c, δ, f)

χ 3 max{log( dℓ f

cϵ2δ ), 4}, η c

χ2 ϵ, fthres c χ3 q

ρ , tthres χ

c2 ℓ ρϵ tnoise tthres 1 for t = 0, 1, . . . do

if f(xt) gthres and t tnoise > tthres then

xt xt, tnoise t xt xt + ξt, ξt uniformly B0(r) if t tnoise = tthres and f(xt) f( xtnoise) > fthres then

return xtnoise xt+1 xt η f(xt)

The algorithm that we analyze is a perturbed form of gradient descent (see Algorithm 2). The algorithm is based on gradient descent with step size η. When the norm of the current gradient is small ( gthres) (which indicates that the current iterate xt is potentially near a saddle point), the algorithm adds a small random perturbation to the gradient. The perturbation is added at most only once every tthres iterations.

To simplify the analysis we choose the perturbation ξt to be uniformly sampled from a d-dimensional ball1. The use of the threshold tthres ensures that the dynamics are mostly those of gradient descent. If the function value does not decrease enough (by fthres) after tthres iterations, the algorithm outputs xtnoise. The analysis in this section shows that under this protocol, the output xtnoise is necessarily close to a second-order stationary point.

We ﬁrst state the assumptions that we require. Assumption A1. Function f( ) is both ℓ-smooth and ρHessian Lipschitz.

The Hessian Lipschitz condition ensures that the function is well-behaved near a saddle point, and the small perturbation we add will sufﬁce to allow the subsequent gradient updates to escape from the saddle point. More formally, we have: Theorem 3. Assume that f( ) satisﬁes A1. Then there exists an absolute constant cmax such that, for any δ > 0, ϵ ℓ2

ρ , f f(x0) f , and constant c cmax, PGD(x0, ℓ, ρ, ϵ, c, δ, f) will output an ϵ-secondorder stationary point, with probability 1 δ, and terminate in the following number of iterations:

O ℓ(f(x0) f )

ϵ2 log4 dℓ f

1Note that uniform sampling from a d-dimensional ball can be done efﬁciently by sampling U 1 d Y Y where U Uniform([0, 1]) and Y N(0, Id) (Harman & Lacko, 2010).

Strikingly, Theorem 3 shows that perturbed gradient descent ﬁnds a second-order stationary point in almost the same amount of time that gradient descent takes to ﬁnd ﬁrst-order stationary point. The step size η is chosen as O(1/ℓ) which is in accord with classical analyses of convergence to ﬁrst-order stationary points. Though we state the theorem with a certain choice of parameters for simplicity of presentation, our result holds even if we vary the parameters up to constant factors.

Without loss of generality, we can focus on the case ϵ ℓ2/ρ, as in Theorem 3. In the case ϵ > ℓ2/ρ, standard gradient descent without perturbation Theorem 2 easily solves the problem. This is because by A1, we always have λmin( 2f(x)) ℓ ρϵ, which means that all ϵsecond-order stationary points are ϵ-ﬁrst order stationary points.

We believe that the dependence on at least one log d factor in the iteration complexity is unavoidable in the nonconvex setting, as our result can be directly applied to the principal component analysis problem, for which the best known runtimes (for the power method or Lanczos method) incur a log d factor due to random initialization. Establishing this formally is still an open question however.

To provide some intuition for Theorem 3, consider an iterate xt which is not yet an ϵ-second-order stationary point. By deﬁnition, either (1) the gradient f(xt) is large, or (2) the Hessian 2f(xt) has a signiﬁcant negative eigenvalue. Traditional analysis works in the ﬁrst case. The crucial step in the proof of Theorem 3 involves handling the second case: when the gradient is small f(xt) gthres and the Hessian has a signiﬁcant negative eigenvalue λmin( 2f( xt)) ρϵ, then adding a perturbation, followed by standard gradient descent for tthres steps, decreases the function value by at least fthres, with high probability. The proof of this fact relies on a novel characterization of geometry around saddle points (see Section 5)

If we are able to make stronger assumptions on the objective function we are able to strengthen our main result. This further analysis is presented in the next section.

3.1. Functions with Strict Saddle Property

In many real applications, objective functions further admit the property that all saddle points are strict (Ge et al., 2015; Sun et al., 2016a;b; Bhojanapalli et al., 2016; Ge et al., 2016). In this case, all second-order stationary points are local minima and hence convergence to second-order stationary points (Theorem 3) is equivalent to convergence to local minima.

To state this result formally, we introduce a robust version of the strict saddle property (cf. Ge et al., 2015): Assumption A2. Function f( ) is (θ, γ, ζ)-strict saddle.

How to Escape Saddle Points Efﬁciently

That is, for any x, at least one of following holds:

λmin( 2f(x)) γ.

x is ζ-close to X the set of local minima.

Intuitively, the strict saddle assumption states that the Rd

space can be divided into three regions: 1) a region where the gradient is large; 2) a region where the Hessian has a signiﬁcant negative eigenvalue (around saddle point); and 3) the region close to a local minimum. With this assumption, we immediately have the following corollary: Corollary 4. Let f( ) satisfy A1 and A2. Then, there exists an absolute constant cmax such that, for any δ > 0, f f(x0) f , constant c cmax, and letting ϵ = min(θ, γ2/ρ), PGD(x0, ℓ, ρ, ϵ, c, δ, f) will output a point ζ-close to X , with probability 1 δ, and terminate in the following number of iterations:

O ℓ(f(x0) f )

ϵ2 log4 dℓ f

Corollary 4 shows after ﬁnding ϵ-second-order stationary point by Theorem 3 where ϵ = min(θ, γ2/ρ), the output is also in the ζ-neighborhood of some local minimum.

Note although Corollary 4 only explicitly asserts that the output will lie within some ﬁxed radius ζ from a local minimum. In many real applications, we further have that ζ can be written as a function ζ(θ) which decreases linearly or polynomially depending on θ, while γ will be nondecreasing w.r.t θ. In these cases, the above corollary further gives a convergence rate to a local minimum.

3.2. Functions with Strong Local Structure

The convergence rate in Theorem 3 is polynomial in ϵ, which is similar to that of Theorem 2, but is worse than the rate of Theorem 1 because of the lack of strong convexity. Although global strong convexity does not hold in the nonconvex setting that is our focus, in many machine learning problems the objective function may have a favorable local structure in the neighborhood of local minima (Ge et al., 2015; Sun et al., 2016a;b; Sun & Luo, 2016). Exploiting this property can lead to much faster convergence (linear convergence) to local minima. One such property that ensures such convergence is a local form of smoothness and strong convexity: Assumption A3.a. In a ζ-neighborhood of the set of local minima X , the function f( ) is α-strongly convex, and β-smooth.

Here we use different letter β to denote the local smoothness parameter (in contrast to the global smoothness parameter ℓ). Note that we always have β ℓ.

Algorithm 3 Perturbed Gradient Descent with Local Improvement: PGDli(x0, ℓ, ρ, ϵ, c, δ, f, β)

x0 PGD(x0, ℓ, ρ, ϵ, c, δ, f) for t = 0, 1, . . . do

However, often even local α-strong convexity does not hold. We thus introduce the following relaxation:

Assumption A3.b. In a ζ-neighborhood of the set of local minima X , the function f( ) satisﬁes a (α, β)-regularity condition if for any x in this neighborhood:

f(x), x PX (x) α

2 x PX (x) 2+ 1

Here PX ( ) is the projection on to the set X . Note (α, β)- regularity condition is more general and is directly implied by standard β-smooth and α-strongly convex conditions. This regularity condition commonly appears in low-rank problems such as matrix sensing and matrix completion, and has been used in Bhojanapalli et al. (2016); Zheng & Lafferty (2016), where local minima form a connected set, and where the Hessian is strictly positive only with respect to directions pointing outside the set of local minima.

Gradient descent naturally exploits local structure very well. In Algorithm 3, we ﬁrst run Algorithm 2 to output a point within the neighborhood of a local minimum, and then perform standard gradient descent with step size 1

β . We can then prove the following theorem:

Theorem 5. Let f( ) satisfy A1, A2, and A3.a (or A3.b). Then there exists an absolute constant cmax such that, for any δ > 0, ϵ > 0, f f(x0) f , constant c cmax, and letting ϵ = min(θ, γ2/ρ), PGDli(x0, ℓ, ρ, ϵ, c, δ, f, β) will output a point that is ϵclose to X , with probability 1 δ, in the following number of iterations:

O ℓ(f(x0) f )

ϵ2 log4 dℓ f

Theorem 5 says that if strong local structure is present, the convergence rate can be boosted to linear convergence (log 1

ϵ ). In this theorem we see that sequence of iterations can be decomposed into two phases. In the ﬁrst phase, perturbed gradient descent ﬁnds a ζ-neighborhood by Corollary 4. In the second phase, standard gradient descent takes us from ζ to ϵ-close to a local minimum. Standard gradient descent and Assumption A3.a (or A3.b) make sure that the iterate never steps out of a ζ-neighborhood in this second phase, giving a result similar to Theorem 1 with linear convergence.

How to Escape Saddle Points Efﬁciently

4. Example Matrix Factorization

As a simple example to illustrate how to apply our general theorems to speciﬁc non-convex optimization problems, we consider a symmetric low-rank matrix factorization problem, based on the following objective function:

min U Rd r f(U) = 1

2 UU M 2 F, (2)

where M Rd d. For simplicity, we assume rank(M ) = r, and denote σ 1 := σ1(M ), σ r := σr(M ). Clearly, in this case the global minimum of function value is zero, which is achieved at V = TD1/2 where TDT is the SVD of the symmetric real matrix M .

The following two lemmas show that the objective function in Eq. (2) satisﬁes the geometric assumptions A1, A2,and A3.b. Moreover, all local minima are global minima. Lemma 6. For any Γ σ 1, the function f(U) deﬁned in Eq. (2) is 8Γ-smooth and 12Γ1/2-Hessian Lipschitz, inside the region {U| U 2 < Γ}. Lemma 7. For function f(U) deﬁned in Eq. (2), all local minima are global minima. The set of global minima is X = {V R|RR = R R = I}. Furthermore, f(U) is ( 1

24(σ r)3/2, 1

3(σ r)1/2)-strict saddle; and satisﬁes a ( 2

3σ r, 10σ 1)-regularity condition in a 1

3(σ r)1/2neighborhood of X .

One caveat is that since the objective function is actually a fourth-order polynomial with respect to U, the smoothness and Hessian Lipschitz parameters from Lemma 6 naturally depend on U . Fortunately, we can further show that gradient descent (even with perturbation) does not increase U beyond O(max{ U0 , (σ 1)1/2}). Then, applying Theorem 5 gives: Theorem 8. There exists an absolute constant cmax such that the following holds. For the objective function in Eq. (2), for any δ > 0 and constant c cmax, and for Γ1/2 := 2 max{ U0 , 3(σ 1)1/2}, the output of PGDli(U0, 8Γ, 12Γ1/2, (σ r )2

108Γ1/2 , c, δ, rΓ2

2 , 10σ 1), will be ϵclose to the global minimum set X , with probability 1 δ, after the following number of iterations:

+ σ 1 σ r log σ r ϵ

Theorem 8 establishes global convergence of perturbed gradient descent from an arbitrary initial point U0, including exact saddle points. Suppose we initialize at U0 = 0, then our iteration complexity becomes:

O r(κ )4 log4(dκ /δ) + κ log(σ r/ϵ) ,

where κ = σ 1/σ r is the condition number of the matrix M . We see that in the ﬁrst phase, to move from a neighborhood of the solution, our method requires a number of

iterations scaling as O(r(κ )4). We suspect that this strong dependence on condition number arises from our generic assumption that the Hessian Lipschitz is uniformly upper bounded; it may well be the case that this dependence can be reduced in the special case of matrix factorization via a ﬁner analysis of the geometric structure of the problem.

5. Proof Sketch for Theorem 3

In this section we will present the key ideas underlying the main result of this paper (Theorem 3). We will ﬁrst argue the correctness of Theorem 3 given two important intermediate lemmas. Then we turn to the main lemma, which establishes that gradient descent can escape from saddle points quickly. We present full proofs of all these results in Appendix A. Throughout this section, we use η, r, gthres, fthres and tthres as deﬁned in Algorithm 2.

5.1. Exploiting Large Gradient or Negative Curvature

Recall that an ϵ-second-order stationary point is a point with a small gradient, and where the Hessian does not have a signiﬁcant negative eigenvalue. Suppose we are currently at an iterate xt that is not an ϵ-second-order stationary point; i.e., it does not satisfy the above properties. There are two possibilities: (1) The gradient is large: f(xt) gthres; or (2) Around the saddle point we have f(xt) gthres and λmin( 2f(xt)) ρϵ.

The following two lemmas address these two cases respectively. They guarantee that perturbed gradient descent will decrease the function value in both scenarios.

Lemma 9 (Gradient). Assume that f( ) satisﬁes A1. Then for gradient descent with stepsize η < 1 ℓ, we have f(xt+1) f(xt) η

Lemma 10 (Saddle). (informal) Assume that f( ) satisﬁes A1, If xt satisﬁes f(xt) gthres and λmin( 2f(xt)) ρϵ, then adding one perturbation step followed by tthres steps of gradient descent, we have f(xt+tthres) f(xt) fthres with high probability.

We see that Algorithm 2 is designed so that Lemma 10 can be directly applied. According to these two lemmas, perturbed gradient descent will decrease the function value either in the case of a large gradient, or around strict saddle points. Computing the average decrease in function value yields the total iteration complexity. Since Algorithm 2 only terminate when the function value decreases too slowly, this guarantees that the output must be ϵ-second-order stationary point (see Appendix A for formal proofs).

How to Escape Saddle Points Efﬁciently

Figure 1. Pertubation ball in 3D and thin pancake stuck region

Figure 2. Narrow band stuck region in 2D under gradient ﬂow

5.2. Escaping from Saddle Points Quickly

The proof of Lemma 9 is straightforward and follows from traditional analysis. The key technical contribution of this paper is the proof of Lemma 10, which gives a new characterization of the geometry around saddle points.

Consider a point x that satisﬁes the the preconditions of Lemma 10 ( f( x) gthres and λmin( 2f( x)) ρϵ). After adding the perturbation (x0 = x+ξ), we can view x0 as coming from a uniform distribution over B x(r), which we call the perturbation ball. We can divide this perturbation ball B x(r) into two disjoint regions: (1) an escaping region Xescape which consists of all the points x B x(r) whose function value decreases by at least fthres after tthres steps; (2) a stuck region Xstuck = B x(r) Xescape. Our general proof strategy is to show that Xstuck consists of a very small proportion of the volume of perturbation ball. After adding a perturbation to x, point x0 has a very small chance of falling in Xstuck, and hence will escape from the saddle point efﬁciently.

Let us consider the nature of Xstuck. For simplicity, let us imagine that x is an exact saddle point whose Hessian has only one negative eigenvalue, and d 1 positive eigenvalues. Let us denote the minimum eigenvalue direction as e1. In this case, if the Hessian remains constant (and we

have a quadratic function), the stuck region Xstuck consists of points x such that x x has a small e1 component. This is a straight band in two dimensions and a ﬂat disk in high dimensions. However, when the Hessian is not constant, the shape of the stuck region is distorted. In two dimensions, it forms a narrow band as plotted in Figure 2 on top of the gradient ﬂow. In three dimensions, it forms a thin pancake as shown in Figure 1.

The major challenge here is to bound the volume of this high-dimensional non-ﬂat pancake shaped region Xstuck. A crude approximation of this pancake by a ﬂat disk loses polynomial factors in the dimensionalilty, which gives a suboptimal rate. Our proof relies on the following crucial observation: Although we do not know the explicit form of the stuck region, we know it must be very thin, therefore it cannot have a large volume. The informal statement of the lemma is as follows:

Lemma 11. (informal) Suppose x satisﬁes the precondition of Lemma 10, and let e1 be the smallest eigendirection of 2f( x). For any δ (0, 1/3] and any two points w, u B x(r), if w u = µre1 and µ δ/(2

d), then at least one of w, u is not in the stuck region Xstuck.

Using this lemma it is not hard to bound the volume of the stuck region: we can draw a straight line along the e1 direction which intersects the perturbation ball (shown as purple line segment in Figure 2). For any two points on this line segment that are at least δr/(2

d) away from each other (shown as red points w, u in Figure 2), by Lemma 11, we know at least one of them must not be in Xstuck. This implies if there is one point u Xstuck on this line segment, then Xstuck on this line can be at most an interval of length δr/

d around u. This establishes the thickness of Xstuck in the e1 direction, which is turned into an upper bound on the volume of the stuck region Xstuck by standard calculus.

6. Conclusion

This paper presents the ﬁrst (nearly) dimension-free result for gradient descent in a general non-convex setting. We present a general convergence result and show how it can be further strengthened when combined with further structure such as strict saddle conditions and/or local regularity/convexity.

There are still many related open problems. First, in the presence of constraints, it is worthwhile to study whether gradient descent still admits similar sharp convergence results. Another important question is whether similar techniques can be applied to accelerated gradient descent. We hope that this result could serve as a ﬁrst step towards a more general theory with strong, almost dimension free guarantees for non-convex optimization.

How to Escape Saddle Points Efﬁciently

Agarwal, Naman, Allen-Zhu, Zeyuan, Bullins, Brian, Hazan, Elad, and Ma, Tengyu. Finding approximate local minima for nonconvex optimization in linear time. ar Xiv preprint ar Xiv:1611.01146, 2016.

Bhojanapalli, Srinadh, Neyshabur, Behnam, and Srebro, Nathan. Global optimality of local search for low rank matrix recovery. ar Xiv preprint ar Xiv:1605.07221, 2016.

Bubeck, S ebastien et al. Convex optimization: Algorithms and complexity. Foundations and Trends R in Machine Learning, 8(3-4):231 357, 2015.

Candes, Emmanuel J, Li, Xiaodong, and Soltanolkotabi, Mahdi. Phase retrieval via wirtinger ﬂow: Theory and algorithms. IEEE Transactions on Information Theory, 61(4):1985 2007, 2015.

Carmon, Yair and Duchi, John C. Gradient descent efﬁciently ﬁnds the cubic-regularized non-convex newton step. ar Xiv preprint ar Xiv:1612.00547, 2016.

Carmon, Yair, Duchi, John C, Hinder, Oliver, and Sidford, Aaron. Accelerated methods for non-convex optimization. ar Xiv preprint ar Xiv:1611.00756, 2016.

Choromanska, Anna, Henaff, Mikael, Mathieu, Michael, Arous, G erard Ben, and Le Cun, Yann. The loss surface of multilayer networks. ar Xiv:1412.0233, 2014.

Curtis, Frank E, Robinson, Daniel P, and Samadi, Mohammadreza. A trust region algorithm with a worst-case iteration complexity of\ mathcal {O}(\ epsilonˆ{-3/2}) for nonconvex optimization. Mathematical Programming, pp. 1 32, 2014.

Dauphin, Yann N, Pascanu, Razvan, Gulcehre, Caglar, Cho, Kyunghyun, Ganguli, Surya, and Bengio, Yoshua. Identifying and attacking the saddle point problem in high-dimensional non-convex optimization. In Advances in Neural Information Processing Systems, pp. 2933 2941, 2014.

Ge, Rong, Huang, Furong, Jin, Chi, and Yuan, Yang. Escaping from saddle points online stochastic gradient for tensor decomposition. In COLT, 2015.

Ge, Rong, Lee, Jason D, and Ma, Tengyu. Matrix completion has no spurious local minimum. In Advances in Neural Information Processing Systems, pp. 2973 2981, 2016.

Harman, Radoslav and Lacko, Vladim ır. On decompositional algorithms for uniform sampling from n-spheres and n-balls. Journal of Multivariate Analysis, 101(10): 2297 2304, 2010.

Jain, Prateek, Jin, Chi, Kakade, Sham M, and Netrapalli, Praneeth. Computing matrix squareroot via non convex local search. ar Xiv preprint ar Xiv:1507.05854, 2015.

Kawaguchi, Kenji. Deep learning without poor local minima. In Advances In Neural Information Processing Systems, pp. 586 594, 2016.

Lee, Jason D, Simchowitz, Max, Jordan, Michael I, and Recht, Benjamin. Gradient descent only converges to minimizers. In Conference on Learning Theory, pp. 1246 1257, 2016.

Levy, Kﬁr Y. The power of normalization: Faster evasion of saddle points. ar Xiv preprint ar Xiv:1611.04831, 2016.

Nesterov, Yu. Introductory lectures on convex programming volume i: Basic course. Lecture notes, 1998.

Nesterov, Yurii and Polyak, Boris T. Cubic regularization of newton method and its global performance. Mathematical Programming, 108(1):177 205, 2006.

Netrapalli, Praneeth, Jain, Prateek, and Sanghavi, Sujay. Phase retrieval using alternating minimization. In Advances in Neural Information Processing Systems, pp. 2796 2804, 2013.

Park, Dohyung, Kyrillidis, Anastasios, Caramanis, Constantine, and Sanghavi, Sujay. Non-square matrix sensing without spurious local minima via the burermonteiro approach. ar Xiv preprint ar Xiv:1609.03240, 2016.

Rumelhart, David E, Hinton, Geoffrey E, and Williams, Ronald J. Learning representations by back-propagating errors. Cognitive modeling, 5, 1988.

Sun, Ju, Qu, Qing, and Wright, John. Complete dictionary recovery over the sphere i: Overview and the geometric picture. IEEE Transactions on Information Theory, 2016a.

Sun, Ju, Qu, Qing, and Wright, John. A geometric analysis of phase retrieval. In Information Theory (ISIT), 2016 IEEE International Symposium on, pp. 2379 2383. IEEE, 2016b.

Sun, Ruoyu and Luo, Zhi-Quan. Guaranteed matrix completion via non-convex factorization. IEEE Transactions on Information Theory, 62(11):6535 6579, 2016.

Zheng, Qinqing and Lafferty, John. Convergence analysis for rectangular matrix completion using burer-monteiro factorization and gradient descent. ar Xiv preprint ar Xiv:1605.07051, 2016.