# escaping_saddle_points_with_adaptive_gradient_methods__1de6da5d.pdf

Escaping Saddle Points with Adaptive Gradient Methods

Matthew Staib 1 2 Sashank Reddi 3 Satyen Kale 3 Sanjiv Kumar 3 Suvrit Sra 1

Adaptive methods such as Adam and RMSProp are widely used in deep learning but are not well understood. In this paper, we seek a crisp, clean and precise characterization of their behavior in nonconvex settings. To this end, we ﬁrst provide a novel view of adaptive methods as preconditioned SGD, where the preconditioner is estimated in an online manner. By studying the preconditioner on its own, we elucidate its purpose: it rescales the stochastic gradient noise to be isotropic near stationary points, which helps escape saddle points. Furthermore, we show that adaptive methods can efﬁciently estimate the aforementioned preconditioner. By gluing together these two components, we provide the ﬁrst (to our knowledge) secondorder convergence result for any adaptive method. The key insight from our analysis is that, compared to SGD, adaptive methods escape saddle points faster, and can converge faster overall to second-order stationary points.

1. Introduction

Stochastic ﬁrst-order methods are the algorithms of choice for training deep networks, or more generally optimization problems of the form argminx Ez[f(x, z)]. While vanilla stochastic gradient descent (SGD) is still the most popular such algorithm, there has been much recent interest in adaptive methods that adaptively change learning rates for each parameter. This is a very old idea, e.g. (Jacobs, 1988); modern variants such as Adagrad (Duchi et al., 2011; Mc Mahan & Streeter, 2010) Adam (Kingma & Ba, 2014) and RMSProp (Tieleman & Hinton, 2012) are widely used in deep learning due to their good empirical performance.

Adagrad uses the square root of the sum of the outer product of the past gradients to achieve adaptivity. At time step t,

1MIT EECS 2Based on work performed at Google Research New York 3Google Research New York. Correspondence to: Matthew Staib <mstaib@mit.edu>.

Proceedings of the 36 th International Conference on Machine Learning, Long Beach, California, PMLR 97, 2019. Copyright 2019 by the author(s).

Adagrad updates the parameters in the following manner:

xt+1 = xt G 1/2 t gt,

where gt is a noisy stochastic gradient at xt and Gt = Pt i=1 gig T i . More often, a diagonal version of Adagrad is used due to practical considerations, which effectively yields a per parameter learning rate. In the convex setting, Adagrad achieves provably good performance, especially when the gradients are sparse. Although Adagrad works well in sparse convex settings, its performance appears to deteriorate in (dense) nonconvex settings. This performance degradation is often attributed to the rapid decay of the learning rate in Adagrad over time, which is a consequence of rapid increase in eigenvalues of the matrix Gt.

To tackle this issue, variants of Adagrad such as Adam and RMSProp have been proposed, which replace the sum of the outer products with an exponential moving average i.e., Gt = (1 β) Pt i=1 βt igig T i for some constant β (0, 1). This connection with Adagrad is often used to justify the design of Adam and RMSProp (e.g. (Goodfellow et al., 2016)). Although this connection is simple and appealing, it is clearly superﬁcial. For instance, while learning rates in Adagrad decrease monotonically, it is not necessarily the case with Adam or RMSProp as shown recently in Reddi et al. (2018b), leading to their non-convergence in even simple convex settings. Thus, a principled understanding of these adaptive methods is largely missing.

In this paper, we introduce a much simpler way of thinking about adaptive methods such as Adam and RMSProp. Roughly, adaptive methods try to precondition SGD by some matrix A, e.g. when A is diagonal, Aii corresponds to the effective stepsize for coordinate i. For some choices of A the algorithms do not have oracle access to A, but instead form an estimate ˆA A. We separate out these two steps, by 1) giving convergence guarantees for an idealized setting where we have access to A, then 2) proving bounds on the quality of the estimate ˆA. Our approach makes it possible to effectively intuit about the algorithms, prove convergence guarantees (including second-order convergence), and give insights about how to choose algorithm parameters. It also leads to a number of surprising results, including an understanding of why the Reddi et al. (2018b) counterexample is hard for adaptive methods, why adaptive methods tend to escape saddle points faster than SGD (observed empirically

Escaping Saddle Points with Adaptive Gradient Methods

in (Reddi et al., 2018a)), insights into how to tune Adam s parameters, and (to our knowledge) the ﬁrst second-order convergence proof for any adaptive method.

Contributions: In addition to the aforementioned novel viewpoint, we also make the following key contributions:

We develop a new approach for analyzing convergence of adaptive methods leveraging the preconditioner viewpoint and by way of disentangling estimation from the behavior of the idealized preconditioner.

We provide second-order convergence results for adaptive methods, and as a byproduct, ﬁrst-order convergence results. To the best of our knowledge, ours is the ﬁrst work to show second order convergence for any adaptive method.

We provide theoretical insights on how adaptive methods escape saddle points quickly. In particular, we show that the preconditioner used in adaptive methods leads to isotropic noise near stationary points, which helps escape saddle points faster.

Our analysis also provides practical suggestions for tuning the exponential moving average parameter β.

1.1. Related work

There is an immense amount of work studying nonconvex optimization for machine learning, which is too much to discuss here in detail. Thus, we only brieﬂy discuss two lines of work that are most relevant to our paper here. First, the recent work e.g. (Chen et al., 2018; Reddi et al., 2018b; Zou et al., 2018) to understand and give theoretical guarantees for adaptive methods such as Adam and RMSProp. Second, the technical developments in using ﬁrst-order algorithms to achieve nonconvex second-order convergence (see Deﬁnition 2.1) e.g. (Ge et al., 2015; Allen-Zhu & Li, 2018; Jin et al., 2017; Lee et al., 2016).

Nonconvex convergence of adaptive methods. Many recent works have investigated convergence properties of adaptive methods. However, to our knowledge, all these results either require convexity or show only ﬁrst-order convergence to stationary points. Reddi et al. (2018b) showed non-convergence of Adam and RMSProp in simple convex settings and provided a variant of Adam, called AMSGrad, with guaranteed convergence in the convex setting; Zhou et al. (2018) generalized this to a nonconvex ﬁrst-order convergence result. Zaheer et al. (2018) showed ﬁrst-order convergence of Adam when the batch size grows over time. Chen et al. (2018) bound the nonconvex convergence rate for a large family of Adam-like algorithms, but they essentially need to assume the effective stepsize is well-behaved

(as in AMSGrad). Agarwal et al. (2018) give a convex convergence result for a full-matrix version of RMSProp, which they extend to the nonconvex case via iteratively optimizing convex functions. Their algorithm uses a ﬁxed sliding window instead of an exponential moving average. Mukkamala & Hein (2017) prove improved convergence bounds for Adagrad in the online strongly convex case; they prove similar results for RMSProp, but only in a regime where it is essentially the same as Adagrad. Ward et al. (2018) give a nonconvex convergence result for a variant of Adagrad which employs an adaptively decreasing single learning rate (not per-parameter). Zou et al. (2018) give sufﬁcient conditions for ﬁrst-order convergence of Adam.

Nonconvex second order convergence of ﬁrst order methods. Starting with Ge et al. (2015) there has been a resurgence in interest in giving ﬁrst-order algorithms that ﬁnd second order stationary points of nonconvex objectives, where the gradient is small and the Hessian is nearly positive semideﬁnite. Most other results in this space operate in the deterministic setting where we have exact gradients, with carefully injected isotropic noise to escape saddle points. Levy (2016) show improved results for normalized gradient descent. Some algorithms rely on Hessian-vector products instead of pure gradient information e.g. (Agarwal et al., 2017; Carmon et al., 2018); it is possible to reduce Hessianvector based algorithms to gradient algorithms (Xu et al., 2018; Allen-Zhu & Li, 2018). Jin et al. (2017) improve the dependence on dimension to polylogarithmic. Mokhtari et al. (2018) work towards adapting these techniques for constrained optimization. Most relevant to our work is that of Daneshmand et al. (2018), who prove convergence of SGD with better rates than Ge et al. (2015). Concurrent with our paper, Fang et al. (2019) give even better rates for SGD. Our work differs in that we provide second-order results for preconditioned SGD.

2. Notation and deﬁnitions

The objective function is f, and the gradient and Hessian of f are f and H = 2f, respectively. Denote by xt Rd

the iterate at time t, by gt an unbiased stochastic gradient at xt and by t the expected gradient at t. The matrix Gt refers to E[gtg T t ]. Denote by λmax(G) and λmin(G) the largest and smallest eigenvalues of G, and κ(G) is the condition number λmax(G)/λmin(G) of G. For a vector v, its elementwise p-th power is written vp. The objective f(x) has global minimizer x , and we write f = f(x ). The Euclidean norm of a vector v is written as v , while for a matrix M, M refers to the operator norm of M. The matrix I is the identity matrix, whose dimension should be clear from context.

Deﬁnition 2.1 (Second-order stationary point). A (τg, τh)- stationary point of f is a point x so that f(x) τg and

Escaping Saddle Points with Adaptive Gradient Methods

Algorithm 1 Preconditioned SGD

Input: initial x0, time T, stepsize η, preconditioner A(x) for t = 0, . . . , T do

gt stochastic gradient at xt At A(xt) e.g. At = E[gtg T t ] 1/2

xt+1 xt ηAtgt end for

Algorithm 2 Full-matrix RMSProp

Input: initial x0, time T, stepsize η, small number ε > 0 for stability for t = 0, . . . , T do

gt stochastic gradient

ˆGt = β ˆGt 1 + (1 β)gtg T t At = ( ˆGt + εI) 1/2

xt+1 xt ηAtgt end for

λmin( 2f(x)) τh, where τg, τh > 0.

As is standard (e.g. Nesterov & Polyak (2006)), we will discuss only (τ, ρτ)-stationary points, where ρ is the Lipschitz constant of the Hessian.

3. The RMSProp Preconditioner

Recall that methods like Adam and RMSProp replace the running sum Pt i=1 gig T i used in Adagrad with an exponential moving average (EMA) of the form (1 β) Pt i=1 βt igig T i , e.g. full-matrix RMSProp is described formally in Algorithm 2. One key observation is that ˆGt = (1 β) Pt i=1 βt igig T i E[gtg T t ] =: Gt if β is chosen appropriately; in other words, at time t, the accumulated ˆGt can be seen as an approximation of the true second moment matrix Gt = E[gtg T t ] at the current iterate. Thus, RMSProp can be viewed as preconditioned SGD (Algorithm 1) with the preconditioner being At = G 1/2 t . In practice, it is too expensive to compute Gt exactly since it requires summing over all training samples. Practical adaptive methods (see Algorithm 2) estimate this preconditioner (or a diagonal approximation) on-the-ﬂy via an EMA.

Before developing our formal results, we will build intuition about the behavior of adaptive methods by studying an idealized adaptive method (IAM) with perfect access to Gt. In the rest of this section, we make use of idealized RMSProp to answer some simple questions about adaptive methods that we feel have not yet been addressed satisfactorily.

3.1. What is the purpose of the preconditioner?

Why should preconditioning by A = E[gg T ] 1/2 help optimization? The original Adam paper (Kingma & Ba, 2014) argues that Adam is an approximation to natural gradient descent, since if the objective f is a log-likelihood, E[gg T ]

approximates the Fisher information matrix, which captures curvature information in the space of distributions. There are multiple issues with comparing adaptive methods to natural gradient descent, which we discuss in Appendix A. Instead, Balles & Hennig (2018) argue that the primary function of adaptive methods is to equalize the stochastic gradient noise in each direction. But it is still not clear why or how equalized noise should help optimization.

Our IAM abstraction makes it easy to explain precisely how rescaling the gradient noise helps. Speciﬁcally, we manipulate the update rule for idealized RMSProp:

xt+1 xt ηAtgt (1)

= xt ηAt t η At(gt t) | {z } =:ξt

The At t term is deterministic; only ξt is stochastic, with mean E[At(gt t)] = At E[gt t] = 0. Take ε = 0 and assume Gt = E[gtg T t ] is invertible, so that ξt = G 1/2 t (gt t). Now we can be more precise about how RMSProp rescales gradient noise. Speciﬁcally, we compute the covariance of the noise ξt:

Cov(ξt) = I G 1/2 t t T t G 1/2 t . (3)

The key insight is: near stationary points, t will be small, so that the noise covariance Cov(ξt) is approximately the identity matrix I. In other words, at stationary points, the gradient noise is approximately isotropic. This observation hints at why adaptive methods are so successful for nonconvex problems, where one of the main challenges is to escape saddle points (Reddi et al., 2018a). Essentially all ﬁrst-order approaches for escaping saddlepoints rely on adding carefully tuned isotropic noise, so that regardless of what the escape direction is, there is enough noise in that direction to escape with high probability.

3.2. (Reddi et al., 2018b) counterexample resolution

Recently, Reddi et al. (2018b) provided a simple convex stochastic counterexample on which RMSProp and Adam do not converge. Their reasoning is that RMSProp and Adam too quickly forget about large gradients from the past, in favor of small (but poor) gradients at the present. In contrast, for RMSProp with the idealized preconditioner (Algorithm 1 with A = E[gg T ] 1/2), there is no issue, but the preconditioner A cannot be computed in practice. Rather, for this example, the exponential moving average estimation scheme fails to adequately estimate the preconditioner.

The counterexample is an optimization problem of the form

min x [ 1,1] F(x) = pf1(x) + (1 p)f2(x), (4)

where the stochastic gradient oracle returns f1 with probability p and f2 otherwise. Let ζ > 0 be small, and C >

Escaping Saddle Points with Adaptive Gradient Methods

0 be large. Reddi et al. (2018b) set p = (1 + ζ)/(C + 1), f1(x) = Cx, and f2(x) = x. Overall, then, F(x) = ζx which is minimized at x = 1, however Reddi et al. (2018b) show that RMSProp has E[F(xt)] 0 and so incurs suboptimality gap at least ζ. In contrast, the idealized preconditioner is a function of

E[g2] = p f1

2 + (1 p) f2

2 = C(1 + ζ) ζ

which is a constant independent of x. Hence the preconditioner is constant, and, up to the choice of stepsize, idealized RMSProp on this problem is the same as SGD, which of course will converge.

The difﬁculty for practical adaptive methods (which estimate E[g2] via an EMA) is that as C grows, the variance of the estimate of E[g2] grows too. Thus Reddi et al. (2018b) break Adam by making estimation of E[g2] harder.

4. Main Results: Gluing Estimation and Optimization

The key enabling insight of this paper is to separately study the preconditioner and its estimation via EMA, then combine these to give proofs for practical adaptive methods. We will prove a formal guarantee that the EMA estimate ˆGt is close to the true Gt. By combining our estimation results with the underlying behavior of the preconditioner, we will be able to give convergence proofs for practical adaptive methods that are constructed in a novel, modular way.

Separating these two components enables more general results: we actually analyze preconditioned SGD (Algorithm 1) with oracle access to an arbitrary preconditioner A(x). Idealized RMSProp is but one particular instance. Our convergence results depend only on speciﬁc properties of the preconditioner A(x), with which we can recover convergence results for many RMSProp variants simply by bounding the appropriate constants. For example, A = (E[gg T ]1/2 + εI) 1 corresponds to full-matrix Adam with β1 = 0 or RMSProp as commonly implemented. For cleaner presentation, we instead focus on the variant A = (E[gg T ]+εI) 1/2, but our proof technique can handle either case or its diagonal approximation.

4.1. Estimating from Moving Sequences

The above discussion about IAM is helpful for intuition, and as a base algorithm for analyzing convergence. But it remains to understand how well the estimation procedure works, both for intuition s sake and for later use in a convergence proof. In this section we introduce an abstraction we name estimation from moving sequences. This abstraction will allow us to guarantee high quality estimates of the preconditioner, or, for that matter, any similarly constructed

preconditioner. Our results will moreover make apparent how to choose the β parameter in the exponential moving average: β should increase with the stepsize η. Increasing β over time has been supported both empirically (Shazeer & Stern, 2018) as well as theoretically (Mukkamala & Hein, 2017; Zou et al., 2018; Reddi et al., 2018b), though to our knowledge, the precise pinning of β to the stepsize η is new.

Suppose there is a sequence of states x1, x2, . . . , x T X, e.g. the parameters of our model at each time step. We have access to the states xt, but more importantly we know the states are not changing too fast: xt xt 1 is bounded for all t. There is a Lipschitz function G : X Rd d, which in our case is the second moment matrix of the stochastic gradients, but could be more general. We would like to estimate G(x) for each x = xt, but we have only a noisy oracle Y (x) for G(x), which we assume is unbiased and has bounded variance. Our goal is, given noisy reads Y1, . . . , YT of G(x1), . . . , G(x T ), to estimate G(x T ) at the current point x T as well as possible.

We consider estimators of the form PT t=1 wt Yt. For example. setting w T = 1 and all others to zero would yield an unbiased (but high variance) estimate of G(x T ). We could assign more mass to older samples Yt, but this will introduce bias into the estimate. By optimizing this bias-variance tradeoff, we can get a good estimator. In particular, taking w to be an exponential moving average (EMA) of {Yt}T t=1 will prioritize more recent and relevant estimates, while placing enough weight on old estimates to reduce the variance. The tradeoff is controlled by the EMA parameter β; e.g. if the sequence xt moves slowly (the stepsize is small), we will want large β because older iterates are still very relevant.

In adaptive methods, the underlying function G(x) we want to estimate is E[gg T ], and every stochastic gradient g gives us an unbiased estimate Y = gg T . The diagonal approximation also ﬁts this formulation: we can set Y = diag(g2) as an unbiased estimate of the matrix E[diag(g2)]. With these applications in mind, we formalize our results in terms of matrix estimation. By combining standard matrix concentration inequalities (e.g. Tropp (2015)) with bounds on how fast the sequence moves, we obtain the following result:

Theorem 4.1. Assume xt xt+1 ηM. The function G : Rd Rd d is matrix-valued and L-Lipschitz. For each t, Yt is a random matrix with E[Yt] = G(xt), Gt E[Gt] R, and E[(Gt E[Gt])2] σ2 max. Set wt βT t with PT t=1 wt = 1 and assume T > 4/(1 β). Then with probability 1 δ, the estimation error Φ = PT t=1 wt Yt G(x T ) is bounded by

log(d/δ) + MLη/(1 β)).

This is optimized by β = 1 Cη2/3, for which the bound is O((ηMσ2 max(log(d/δ))L)1/3) as long as T > C η 2/3.

Escaping Saddle Points with Adaptive Gradient Methods

The proof is given in Appendix G. As long as T is sufﬁciently large, we can get a high quality estimate of Gt = E[gtg T t ]. For this, it sufﬁces to start off the underlying optimization algorithm with W = O(η 2/3) burn-in iterations where our estimate is updated but the algorithm is not started. This burn-in period will not affect asymptotic runtime as long as W = O(η 2/3) = O(T). In our nonconvex convergence results we will require T = O(τ 4) and η = O(τ 2), so that W = O(τ 4/3) which is much smaller than T. In practice, one can get away with much shorter (or no) burn-in period.

If β is properly tuned, while running an adaptive method like RMSProp, we will get good estimates of G = E[gg T ] from samples gg T . However, we actually require a good estimate of A = E[gg T ] 1/2 and variants. To treat estimation in a uniﬁed way, we introduce estimable matrix sequences:

Deﬁnition 4.1. A (W, T, η, , δ)-estimable matrix sequence is a sequence of matrices {A(xt)}W +T t=1 generated from {xt}t with xt xt 1 η so that with probability 1 δ, after a burn-in of time W, we can achieve an estimate sequence { ˆAt} so that ˆAt At simultaneously for all times t = W + 1, . . . , W + T.

Applying Theorem 4.1 and union bounding over all time t = W + 1, . . . , W + T, we may state a concise result in terms of Deﬁnition 4.1:

Proposition 4.1. Suppose G = E[gtg T t ] is L-Lipschitz as a function of x. When applied to a generator sequence {xt} with xt xt 1 ηM and samples Yt = gtg T t , the matrix sequence Gt = E[gtg T t ] is (W, T, ηM, , δ)- estimable with W = O(η 2/3), T = Ω(W), and = O(η1/3σ2/3 max(log(2Td/δ)1/3M 1/3L1/3).

We are hence guaranteed a good estimate of G. What we actually want, though, is a good estimate of the preconditioner A = (G + εI) 1/2. In Appendix H we show how to bound the quality of an estimate of A. One simple result is:

Proposition 4.2. Suppose G = E[gg T ] is L-Lipschitz as a function of x. Further suppose a uniform bound λmin(G)I G(x) for all x, with λmin(G) > 0. When applied to a generator sequence {xt} with xt xt 1 ηM and samples Yt = gtg T t , the matrix sequence At = (Gt + εI) 1/2 is (W, T, ηM, , δ)- estimable with W = O(η 2/3), T = Ω(W), and = O((ησ2 max log(2Td/δ)ML)1/3(ε + λmin(G)) 3/2).

4.2. Convergence Results

We saw in the last two sections that it is simple to reason about adaptive methods via IAM, and that it is possible to compute a good estimate of the preconditioner. But we still need to glue the two together in order to get a convergence proof for practical adaptive methods.

In this section we will give non-convex convergence results, ﬁrst for IAM and then for practical realizations thereof. We start with ﬁrst-order convergence as a warm-up, and then move on to second-order convergence. In each case we give a bound for IAM, study it, and then give the corresponding bound for practical adaptive methods.

4.2.1. ASSUMPTIONS AND NOTATION

We want results for a wide variety of preconditioners A, e.g. A = I, the RMSProp preconditioner A = (G + εI) 1/2, and the diagonal version thereof, A = (diag(G) + εI) 1/2. To facilitate this and the future extension of our approach to other preconditioners, we give guarantees that hold for general preconditioners A. Our bounds depend on A via the following properties:

Deﬁnition 4.2. We say A(x) is a (Λ1, Λ2, Γ, ν, λ )- preconditioner if, for all x, the following bounds hold. First, A f 2 Λ1 A1/2 f 2. Second, if f(x) is the quadratic approximation of f at some point x0, we assume A( f f) Λ2 f f . Third, Γ E[ Ag 2]. Fourth, ν λmin(A E[gg T ]AT ). Finally, λ λmin(A).

Note that we could bound Λ1 = Λ2 = λmax(A). but in practice Λ1 and Λ2 may be smaller, since they depend on the behavior of A only in speciﬁc directions. In particular, if the preconditioner A is well-aligned with the Hessian, as may be the case if the natural gradient approximation is valid, then Λ1 would be very small. If f is exactly quadratic, Λ2 can be taken as a constant. The constant Γ controls the magnitude of (rescaled) gradient noise, which affects stability at a local minimum. Finally, ν gives a lower bound on the amount of gradient noise in any direction; when ν is larger it is easier to escape saddle points1. For shorthand, a ( , , Γ, , λ )-preconditioner needs to satisfy only the corresponding inequalities.

In Appendix C we provide bounds on these constants for variants of the second moment preconditioner. We highlight the two most relevant cases, for SGD and RMSProp:

Proposition 4.3. The preconditioner A = I is a (Λ1, Λ2, Γ, ν, λ )-preconditioner, with Λ1 = Λ2 = 1, Γ E[ g 2] d tr(G), ν λmin(G), and λ = 1.

Proposition 4.4. The preconditioner A = (G + εI) 1/2 is a (Λ1, Λ2, Γ, ν, λ )-preconditioner, with

Λ1 = Λ2 = 1 (λmin(G) + ε)1/2 , Γ = dλmax(G) ε + λmax(G),

ν = λmin(G) λmin(G) + ε, and λ = (λmax(G) + ε) 1/2.

1In cases where G = E[gg T ] is rank deﬁcient, e.g. in high-dimensional ﬁnite sum problems, lower bounds on λmin(G) should be understood as lower bounds on E[(v T g)2] for escape directions v from saddle points, analogous to the CNC condition from (Daneshmand et al., 2018).

Escaping Saddle Points with Adaptive Gradient Methods

4.2.2. FIRST-ORDER CONVERGENCE

Proofs are given in Appendix F. For all ﬁrst-order results, we assume that A is a ( , , Γ, , λ )-preconditioner. The proof technique is essentially standard, with minor changes in order to accomodate general preconditioners. First, suppose we have exact oracle access to the preconditioner:

Theorem 4.2. Run preconditioned SGD with preconditioner A and stepsize η = τ 2λ /(LΓ). For small enough τ, after T = 2(f(x0) f )LΓ/(τ 4λ2 ) iterations,

t=0 E f(xt) 2 τ 2. (5)

Now we consider an alternate version where instead of the preconditioner At, we precondition by an noisy version ˆAt that is close to At, i.e. ˆAt At .

Theorem 4.3. Suppose we have access to an inexact preconditioner ˆA, which satisﬁes ˆA A for < λ /2. Run preconditioned SGD with preconditioner ˆA and stepsize η = τ 2λ /(4

2LΓ). For small enough τ, after T = 32(f(x0) f )LΓ/(τ 4λ2 ) iterations, we will have

t=0 E f(xt) 2 τ 2. (6)

The results are the same up to constants. In other words, as long as we can achieve less than λ /2 error, we will converge at essentially the same rate as if we had the exact preconditioner. In light of this, for the second-order convergence results, we treat only the noisy version.

Theorem 4.3 gives a convergence bound assuming a good estimate of the preconditioner, and our estimation results guarantee a good estimate. By gluing together Theorem 4.3 with our estimation results for the RMSProp preconditioner, i.e. Proposition 4.2, we can give a convergence result for bona ﬁde RMSProp:

Corollary 4.1. Consider RMSProp with burn-in, as in Algorithm 3, where we estimate A = (G + εI) 1/2. Retain the same choice of η = O(τ 2) and T = O(τ 4) as in Theorem 4.3. For small enough τ, such a choice of η will yield < λ /2. Choose all other parameters e.g. β in accordance with Proposition 4.2. In particular, choose W = Θ(η 2/3) = Θ(τ 4/3) = O(T) for the burn-in parameter. Then with probability 1 δ, in overall time O(W + T) = O(τ 4), we achieve

t=0 E f(xt) 2 τ 2. (7)

Algorithm 3 RMSProp with burn-in

Input: initial x0, time T, stepsize η, burn-in length W

ˆG0 BURNIN(W, β) Appendix B for t = 0, . . . , T do

gt stochastic gradient

ˆGt β ˆGt 1 + (1 β)gtg T t ˆAt ˆG 1/2 t xt+1 xt η ˆAtgt end for

4.2.3. SECOND-ORDER CONVERGENCE

Now we leverage the power of our high level approach to prove nonconvex second-order convergence for adaptive methods. Like the ﬁrst-order results, we start by proving convergence bounds for a generic, possibly inexact preconditioner A. Our proof is based on that of Daneshmand et al. (2018) for SGD, and therefore we achieve the same O(τ 5) rate. It may be possible to improve our result using the technique of Fang et al. (2019), which is concurrent work to ours. However, our focus is on the preconditioner, and our study of it is wholly new. Accordingly, we study the convergence of Algorithm 4, which is the same as Algorithm 1 (generic preconditioned SGD) except that once in a while we take a large stepsize so we may escape saddlepoints. The proof is given completely in Appendix E. At a high level, we show the algorithm makes progress when the gradient is large and when we are at a saddle point, and does not escape from local minima. Our analysis uses all the constants speciﬁed in Deﬁnition 4.2, e.g. the speed of escape from saddle points depends on ν, the lower bound on stochastic gradient noise.

Then, as before, we simply fuse our convergence guarantees with our estimation guarantees. The end result is, to our knowledge, the ﬁrst nonconvex second-order convergence result for any adaptive method.

Deﬁnitions for second-order results. Assume further that the Hessian H is ρ-Lipschitz and the preconditioner A(x) is α-Lipschitz. The dependence on these constants is made precise in the proof, in Appendix E. The usual stepsize is η, while r is the occasional large stepsize that happens every tthresh iterations. We tolerate a small probability of failure δ. For all results, we assume A is a (Λ1, Λ2, Γ, ν, λ )- preconditioner. For simplicity, we assume the noisy estimate ˆA also satisﬁes the Λ1 inequality. We will also assume a uniform bound on Ag M = O(

The proofs rely on a few other quantities that we optimally determine as a function of the problem parameters: fthresh is a threshold on the function value progress, and gthresh = fthresh/tthresh is the time-amortized average of fthresh. We specify the precise values of all quantities in the proof.

Theorem 4.4. Consider Algorithm 4 with inexact preconditioner ˆAt and exact preconditioner At satisfying the

Escaping Saddle Points with Adaptive Gradient Methods

Algorithm 4 Preconditioned SGD with increasing stepsize

Input: initial x0, time T, stepsizes η, r, threshold tthresh, matrix error for t = 0, . . . , T do

At A(xt) preconditioner at xt ˆAt any matrix with ˆAt At gt stochastic gradient at xt if t mod tthresh = 0 then

xt+1 xt r ˆAtgt else

xt+1 xt η ˆAtgt end if end for

preceding requirements. Suppose that for all t, we have ˆAt At = O(τ 1/2). Then for small τ, with probability 1 δ, we reach an (τ, ρτ)-stationary point in time

T = O Λ4 1Λ4 2Γ4

ρδ3 τ 5 . (8)

The big-O suppresses other constants given in the proof.

To prove a result for bona ﬁde RMSProp, we need to combine Theorem 4.4 with an algorithm that maintains a good estimate of G = E[gg T ] (and consequently A = (G + εI) 1/2). This is more delicate than the ﬁrst-order case because now the stepsize varies. Whenever we take a large stepsize, the estimation algorithm will need to hallucinate S number of smaller steps in order to keep the estimate accurate. Our overall scheme is formalized in Appendix B, for which the following convergence result holds: Corollary 4.2. Consider the RMSProp version of Algorithm 4 that is described in Appendix B. Retain the same choice of η = O(τ 5/2), r = O(τ), and T = O(τ 5) as in Theorem 4.4. For small enough τ, such a choice of η will yield < λ /2. Choose W = Θ(η 2/3) = Θ(τ 5/3) = O(T) for the burn-in parameter Choose S = O(τ 3/2), so that as far as the estimation scheme is concerned, the stepsize is bounded by max{η, r/S} = O(τ 5/2) = O(η). Then as before, with probability 1 δ, we can reach an (τ, ρτ)-stationary point in total time

W + T = O Λ4 1Λ4 2Γ4

ρδ3 τ 5 , (9)

where Λ1, Λ2, Γ, ν, λ are the constants describing A = (G + εI) 1/2.

Again, as in the ﬁrst order results, one could substitute in any other estimable preconditioner. In particular, in Appendix D we discuss the more common diagonal version of RMSProp.

5. Discussion

Separating the estimation step from the preconditioning enables evaluation of different choices for the preconditioner.

5.1. How to set the regularization parameter ε

In the adaptive methods literature, it is still a mystery how to properly set the regularization parameter ε that ensures invertibility of G + εI. When the optimality tolerance τ is small enough, estimating the preconditioner is not the bottleneck. Thus, focusing only on the idealized case, one could just choose ε to minimize the bound. Our ﬁrst-order results depend on ε only through the following term:

Γ λmin(A) dλmin(G) ε + λmin(G) (λmax(G) + ε), (10)

where we have used the preconditioner bounds from Proposition 4.4. This is minimized by taking ε , which suggests using identity preconditioner, or SGD. In contrast, for second-order convergence, the bound is

λ10 ν4 d4κ(G)4(λmax(G) + ε), (11)

which is instead minimized with ε = 0. So for the best second-order convergence rate, it is desireable to set ε as small as possible. Note that since our bounds hold only for small enough convergence tolerance τ, it is possible that the optimal ε should depend in some way on τ.

5.2. Comparison to SGD

Another important question we make progress towards is: when are adaptive methods better than SGD? Our secondorder result depends on the preconditioner only through Λ4 1Λ4 2Γ4/(λ10 ν4). Plugging in Proposition 4.3 for SGD, we may bound

λ10 ν4 E[ g 2]4

λmin(G)4 d4κ(G)4, (12)

while for full-matrix RMSProp, we have

λ10 ν4 d4κ(G)4(λmax(G) + ε). (13)

Setting ε = 0 for simplicity, we conclude that full-matrix RMSProp converges faster if λmax(G) 1.

Now suppose that for a given optimization problem, the preconditioner A is well-aligned with the Hessian so that Λ1 = O(1) (e.g. if the natural gradient approximation holds) and that near saddle points the objective is essentially quadratic so that Λ2 = O(1). In this regime, the preconditioner dependence of idealized full matrix RMSProp is d4λmax(G)5, which yields a better result than SGD when λmax(G) λmin(G) 4. This will happen whenever λmin(G) is relatively small. Thus, when there is not much noise in the escape direction, and the Hessian and G 1/2

are not poorly aligned, RMSProp will converge faster overall. Similar phenomenon can be shown for the diagonal case when the approximation is good, per the results in Appendix C and D.

Escaping Saddle Points with Adaptive Gradient Methods

5.3. Alternative preconditioners

Our analysis inspires the design of other preconditioners: e.g., if at each iteration we sample two independent stochastic gradients g1 and g2, we have unbiased sample access to (g1 g2)(g1 g2)T , which in expectation yields the covariance Σ = Cov(g) instead of the second moment matrix of g. It immediately follows that we can prove second-order convergence results for an algorithm that constructs an exponential moving average estimate of Σ and preconditions by Σ 1/2, as advocated by Ida et al. (2017).

5.4. Tuning the EMA parameter β

Another mystery of adaptive methods is how to set the exponential moving average (EMA) parameter β. In practice β is typically set to a constant, e.g. 0.99, while other parameters such as the stepsize η are tuned more carefully and may vary over time. While our estimation guarantee Theorem 4.1, suggests setting β = 1 O(η2/3), the speciﬁc formula depends on constants that may be unknown, e.g. Lipschitz constants and gradient norms. Instead, one could set β = 1 Cη2/3, and search for a good choice of the hyperparameter C. For example, the common initial choice of η = 0.001 and β = 0.99 corresponds to C = 1.

6. Experiments

We experimentally test our claims about adaptive methods escaping saddle points, and our suggestion for setting β.

Escaping saddle points. First, we test our claim that when the gradient noise is ill-conditioned, adaptive methods escape saddle points faster than SGD, and often converge faster to (approximate) local minima. We construct a two dimensional2 non-convex problem f(x) = 1

n Pn i=1 fi(x) where fi(x) = 1 2x T Hx + b T i x + x 10 10. Here, H = diag([1, 0.1]), so f has a saddle point at the origin with objective value zero. The vectors bi are chosen so that sampling b uniformly from {bi}n i=1 yields E[b] = 0 and Cov(b) = diag([1, 0.01]). Hence at the origin there is an escape direction but little gradient noise in that direction.

We initialize SGD and (diagonal) RMSProp (with β = 1 η2/3) at the saddle point and test several stepsizes η for each. Results for the ﬁrst 104 iterations are shown in Figure 1. In order to escape the saddle point as fast as RMSProp, SGD requires a substantially larger stepsize, e.g. SGD needs η = 0.01 to escape as fast as RMSProp does with η = 0.001. But with such a large stepsize, SGD cannot converge to a small neighborhood of the local minimum, and instead bounces around due to gradient noise. Since RMSProp can escape with a small stepsize, it can converge

2The same phenomenon still holds in higher dimensions but the presentation is simpler with d = 2.

0 2000 4000 6000 8000 10000 Iteration

SGD η = 0.03 SGD η = 0.01 SGD η = 0.007

0 2000 4000 6000 8000 10000 Iteration

RMSProp η = 0.01 RMSProp η = 0.001 RMSProp η = 0.0001

Figure 1. SGD (top) vs RMSProp (bottom) performance escaping a saddle point with poorly conditioned gradient noise. Compared to RMSProp, SGD requires a larger stepsize to escape as quickly, which negatively impacts convergence to the local minimum.

25 50 75 100 125 150 175 200 Epoch

β = 0.7 β = 0.9 β = 0.97 β = 0.99

β = 1 0.3η2/3

Figure 2. Performance on MNIST logistic regression of RMSProp with different choices of β and decreasing stepsize.

to a much smaller neighborhood of the local minimum. Overall, for any ﬁxed ﬁnal convergence criterion, RMSProp escapes faster and converges faster overall.

Setting the EMA parameter β. Next, we test our recommendations regarding setting the EMA parameter β. We consider logistic regression on MNIST. We use (diagonal) RMSProp with batch size 100, decreasing stepsize ηt = 0.001/

t and ε = 10 8, and compare different schedules for β. Speciﬁcally we test β {0.7, 0.9, 0.97, 0.99} (so that 1 β is spaced roughly logarithmically) as well as our recommendation of βt = 1 Cη2/3 t for C {0.1, 0.3, 1}. As shown in Figure 2, all options for β have similar performance initially, but as ηt decreases, large β yields substantially better performance. In particular, our decreasing β schedule achieved the best performance, and moreover was insensitive to how C was set.

Escaping Saddle Points with Adaptive Gradient Methods

Acknowledgements

SS acknowledges support from the DARPA Lagrange grant, an Amazon Research Award, and the NSF-CAREER Award (ID 1846088). We thank Nicolas Le Roux for helpful conversations.

Agarwal, N., Allen-Zhu, Z., Bullins, B., Hazan, E., and Ma, T. Finding approximate local minima faster than gradient descent. In Proceedings of the 49th Annual ACM SIGACT Symposium on Theory of Computing, STOC 2017, pp. 1195 1199, New York, NY, USA, 2017. ACM. ISBN 978-1-4503-4528-6. doi: 10.1145/3055399.3055464.

Agarwal, N., Bullins, B., Chen, X., Hazan, E., Singh, K., Zhang, C., and Zhang, Y. The case for full-matrix adaptive regularization. ar Xiv preprint ar Xiv:1806.02958, 2018.

Allen-Zhu, Z. and Li, Y. Neon2: Finding local minima via ﬁrst-order oracles. In Bengio, S., Wallach, H., Larochelle, H., Grauman, K., Cesa-Bianchi, N., and Garnett, R. (eds.), Advances in Neural Information Processing Systems 31, pp. 3720 3730. Curran Associates, Inc., 2018.

Balles, L. and Hennig, P. Dissecting Adam: The sign, magnitude and variance of stochastic gradients. In Dy, J. and Krause, A. (eds.), Proceedings of the 35th International Conference on Machine Learning, volume 80 of Proceedings of Machine Learning Research, pp. 404 413, Stockholmsmssan, Stockholm Sweden, 10 15 Jul 2018. PMLR.

Carmon, Y., Duchi, J., Hinder, O., and Sidford, A. Accelerated methods for nonconvex optimization. SIAM Journal on Optimization, 28(2):1751 1772, 2018. doi: 10.1137/17M1114296.

Chen, X., Liu, S., Sun, R., and Hong, M. On the convergence of a class of adam-type algorithms for non-convex optimization. ar Xiv preprint ar Xiv:1808.02941, 2018.

Daneshmand, H., Kohler, J., Lucchi, A., and Hofmann, T. Escaping saddles with stochastic gradients. In Dy, J. and Krause, A. (eds.), Proceedings of the 35th International Conference on Machine Learning, volume 80 of Proceedings of Machine Learning Research, pp. 1155 1164, Stockholmsmssan, Stockholm Sweden, 10 15 Jul 2018. PMLR.

Duchi, J., Hazan, E., and Singer, Y. Adaptive subgradient methods for online learning and stochastic optimization. J. Mach. Learn. Res., 12:2121 2159, July 2011. ISSN 1532-4435.

Fang, C., Lin, Z., and Zhang, T. Sharp analysis for nonconvex sgd escaping from saddle points. ar Xiv preprint ar Xiv:1902.00247, 2019.

Ge, R., Huang, F., Jin, C., and Yuan, Y. Escaping from saddle points online stochastic gradient for tensor decomposition. In Grnwald, P., Hazan, E., and Kale, S. (eds.), Proceedings of The 28th Conference on Learning Theory, volume 40 of Proceedings of Machine Learning Research, pp. 797 842, Paris, France, 03 06 Jul 2015. PMLR.

Goodfellow, I., Bengio, Y., and Courville, A. Deep Learning. MIT Press, 2016. http://www. deeplearningbook.org.

Ida, Y., Fujiwara, Y., and Iwamura, S. Adaptive learning rate via covariance matrix based preconditioning for deep neural networks. In Proceedings of the Twenty-Sixth International Joint Conference on Artiﬁcial Intelligence, IJCAI-17, pp. 1923 1929, 2017. doi: 10.24963/ijcai. 2017/267.

Jacobs, R. A. Increased rates of convergence through learning rate adaptation. Neural networks, 1(4):295 307, 1988.

Jin, C., Ge, R., Netrapalli, P., Kakade, S. M., and Jordan, M. I. How to escape saddle points efﬁciently. In Precup, D. and Teh, Y. W. (eds.), Proceedings of the 34th International Conference on Machine Learning, volume 70 of Proceedings of Machine Learning Research, pp. 1724 1732, International Convention Centre, Sydney, Australia, 06 11 Aug 2017. PMLR.

Kingma, D. P. and Ba, J. Adam: A method for stochastic optimization. ar Xiv preprint ar Xiv:1412.6980, 2014.

Lee, J. D., Simchowitz, M., Jordan, M. I., and Recht, B. Gradient descent only converges to minimizers. In Feldman, V., Rakhlin, A., and Shamir, O. (eds.), 29th Annual Conference on Learning Theory, volume 49 of Proceedings of Machine Learning Research, pp. 1246 1257, Columbia University, New York, New York, USA, 23 26 Jun 2016. PMLR.

Levy, K. Y. The power of normalization: Faster evasion of saddle points. ar Xiv preprint ar Xiv:1611.04831, 2016.

Mc Mahan, H. B. and Streeter, M. J. Adaptive bound optimization for online convex optimization. In COLT 2010 - The 23rd Conference on Learning Theory, Haifa, Israel, June 27-29, 2010, pp. 244 256, 2010.

Mokhtari, A., Ozdaglar, A., and Jadbabaie, A. Escaping saddle points in constrained optimization. In Bengio, S., Wallach, H., Larochelle, H., Grauman, K., Cesa-Bianchi,

Escaping Saddle Points with Adaptive Gradient Methods

N., and Garnett, R. (eds.), Advances in Neural Information Processing Systems 31, pp. 3633 3643. Curran Associates, Inc., 2018.

Mukkamala, M. C. and Hein, M. Variants of RMSProp and Adagrad with logarithmic regret bounds. In Precup, D. and Teh, Y. W. (eds.), Proceedings of the 34th International Conference on Machine Learning, volume 70 of Proceedings of Machine Learning Research, pp. 2545 2553, International Convention Centre, Sydney, Australia, 06 11 Aug 2017. PMLR.

Nesterov. Introductory lectures on convex optimization : a basic course. Kluwer Academic Publishers, Boston, 2004. ISBN 1402075537.

Nesterov, Y. and Polyak, B. T. Cubic regularization of Newton method and its global performance. Mathematical Programming, 108(1):177 205, 2006.

Reddi, S., Zaheer, M., Sra, S., Poczos, B., Bach, F., Salakhutdinov, R., and Smola, A. A generic approach for escaping saddle points. In Storkey, A. and Perez-Cruz, F. (eds.), Proceedings of the Twenty-First International Conference on Artiﬁcial Intelligence and Statistics, volume 84 of Proceedings of Machine Learning Research, pp. 1233 1242, Playa Blanca, Lanzarote, Canary Islands, 09 11 Apr 2018a. PMLR.

Reddi, S. J., Kale, S., and Kumar, S. On the convergence of Adam and beyond. In International Conference on Learning Representations, 2018b.

Ruder, S. An overview of gradient descent optimization algorithms. ar Xiv preprint ar Xiv:1609.04747, 2016.

Shazeer, N. and Stern, M. Adafactor: Adaptive learning rates with sublinear memory cost. In Dy, J. and Krause, A. (eds.), Proceedings of the 35th International Conference on Machine Learning, volume 80 of Proceedings of Machine Learning Research, pp. 4596 4604, Stockholmsmssan, Stockholm Sweden, 10 15 Jul 2018. PMLR.

Tieleman, T. and Hinton, G. Lecture 6.5-RMSProp: Divide the gradient by a running average of its recent magnitude. COURSERA: Neural networks for machine learning, 4 (2):26 31, 2012.

Tropp, J. A. An introduction to matrix concentration inequalities. Foundations and Trends in Machine Learning, 8(1-2):1 230, 2015. ISSN 1935-8237. doi: 10.1561/ 2200000048.

Ward, R., Wu, X., and Bottou, L. Adagrad stepsizes: Sharp convergence over nonconvex landscapes, from any initialization. ar Xiv preprint ar Xiv:1806.01811, 2018.

Xu, Y., Rong, J., and Yang, T. First-order stochastic algorithms for escaping from saddle points in almost linear time. In Bengio, S., Wallach, H., Larochelle, H., Grauman, K., Cesa-Bianchi, N., and Garnett, R. (eds.), Advances in Neural Information Processing Systems 31, pp. 5531 5541. Curran Associates, Inc., 2018.

Zaheer, M., Reddi, S., Sachan, D., Kale, S., and Kumar, S. Adaptive methods for nonconvex optimization. In NIPS. 2018.

Zhou, D., Tang, Y., Yang, Z., Cao, Y., and Gu, Q. On the convergence of adaptive gradient methods for nonconvex optimization. ar Xiv preprint ar Xiv:1808.05671, 2018.

Zou, F., Shen, L., Jie, Z., Zhang, W., and Liu, W. A sufﬁcient condition for convergences of Adam and RMSProp. ar Xiv preprint ar Xiv:1811.09358, 2018.