# escaping_saddles_with_stochastic_gradients__023b2991.pdf

Escaping Saddles with Stochastic Gradients

Hadi Daneshmand * 1 Jonas Kohler * 1 Aurelien Lucchi 1 Thomas Hofmann 1

We analyze the variance of stochastic gradients along negative curvature directions in certain nonconvex machine learning models and show that stochastic gradients exhibit a strong component along these directions. Furthermore, we show that - contrary to the case of isotropic noise - this variance is proportional to the magnitude of the corresponding eigenvalues and not decreasing in the dimensionality. Based upon this observation we propose a new assumption under which we show that the injection of explicit, isotropic noise usually applied to make gradient descent escape saddle points can successfully be replaced by a simple SGD step. Additionally - and under the same condition - we derive the ﬁrst convergence rate for plain SGD to a second-order stationary point in a number of iterations that is independent of the problem dimension.

1. Introduction

In this paper we analyze the use of gradient descent (GD) and its stochastic variant (SGD) to minimize objectives of the form

w = arg min w Rd [f(w) := Ez P [fz(w)]] , (1)

where f C2(Rd, R) is a not necessarily convex loss function and P is an arbitrary probability distribution.

In the era of big data and deep neural networks, (stochastic) gradient descent is a core component of many training algorithms (Bottou, 2010). What makes SGD so attractive is its simplicity, its seemingly universal applicability and a convergence rate that is independent of the size of the training set. One speciﬁc trait of SGD is the inherent noise, originating from sampling training points, whose variance has to be controlled in order to guarantee convergence either

*Equal contribution 1ETH, Zurich, Switzerland. Correspondence to: Hadi Daneshmand <hadi.daneshmand@inf.ethz.ch>.

Proceedings of the 35 th International Conference on Machine Learning, Stockholm, Sweden, PMLR 80, 2018. Copyright 2018 by the author(s).

through a conservative step size (Nesterov, 2013) or via explicit variance-reduction techniques (Johnson & Zhang, 2013).

While the convergence behavior of SGD is well-understood for convex functions (Bottou, 2010), we are here interested in the optimization of non-convex functions which pose additional challenges for optimization in particular due to the presence of saddle points and suboptimal local minima (Dauphin et al., 2014; Choromanska et al., 2015). For example, ﬁnding the global minimum of even a degree 4 polynomial can be NP-hard (Hillar & Lim, 2013). Instead of aiming for a global minimizer, a more practical goal is to search for a local optimum of the objective. In this paper we thus focus on reaching a second-order stationary point of smooth non-convex functions. Formally, we aim to ﬁnd an (ϵg, ϵh)-second-order stationary point w such that the following conditions hold:

f(w) ϵg and 2f(w) ϵh I, (2)

where ϵg, ϵh > 0.

Existing work, such as (Ge et al., 2015; Jin et al., 2017a), proved convergence to a point satisfying Eq. (2) for modiﬁed variants of gradient descent and its stochastic variant by requiring additional noise to be explicitly added to the iterates along the entire path (former) or whenever the gradient is sufﬁciently small (latter). Formally, this yields the following update step for the perturbed GD and SGD versions:

PGD: wt+1 = wt ηt f(wt) + rζt+1 (3)

PSGD: wt+1 = wt ηt ( fz(wt) + ζt) , (4)

where ζt is typically zero-mean noise sampled uniformly from a unit sphere.

Isotropic noise The perturbed variants of GD and SGD in Eqs. (3)-(4) have been analyzed for the case where the added noise ζt is isotropic (Ge et al., 2015; Levy, 2016; Jin et al., 2017a) or at least exhibits a certain amount of variance along all directions in Rd (Ge et al., 2015). As shown in Table 1, an immediate consequence of such conditions is that they introduce a dependency on the input dimension d in the convergence rate. Furthermore, it is unknown as of today, if this condition is satisﬁed by the intrinsic noise of vanilla SGD for any speciﬁc class of machine learning

Escaping Saddles with Stochastic Gradients

models. Recent empirical observations show that this is not the case for training neural networks (Chaudhari & Soatto, 2017).

In this work, we therefore turn our attention to the following question. Do we need to perturb iterates along all dimensions in order for (S)GD to converge to a second-order stationary point? Or is it enough to simply rely on the inherent variance of SGD induced by sampling? More than a purely theoretical exercise, this question has some very important practical implications since in practice the vast majority of existing SGD methods do not add additional noise and therefore do not meet the requirement of isotropic noise. Thus we instead focus our attention on a less restrictive condition for which perturbations only have a guaranteed variance along directions of negative curvature of the objective, i.e. along the eigenvector(s) associated with the minimum eigenvalue of the Hessian. Instead of explicitly adding noise as done in Eqs. (3) and (4), we will from now on consider the simple SGD step:

wt+1 = wt η fz(wt) (5)

and propose the following sufﬁcient condition on the stochastic gradient fz(w) to guarantee convergence to a second-order stationary point.

Assumption 1 (Correlated Negative Curvature (CNC)). Let vw be the eigenvector corresponding to the minimum eigenvalue of the Hessian matrix 2f(w). The stochastic gradient fz(w) satisﬁes the CNC assumption, if the second moment of its projection along the direction vw is uniformly bounded away from zero, i.e.

γ > 0 s.t. w : E[ vw, fz(w) 2] > γ . (6)

Contributions Our contribution is fourfold: First, we analyze the convergence of GD perturbed by SGD steps (Algorithm 1). Under the CNC assumption, we demonstrate that this method converges to an (ϵ, ϵ2/5)-second-order stationary point in O(ϵ 2) iterations and with high probability. Second, we prove that vanilla SGD as stated in Algorithm 2 -again under Assumption 1also convergences to an (ϵ, ϵ2/5)-second-order stationary point in O(ϵ 4) iterations and with high probability. To the best of our knowledge, this is the ﬁrst second-order convergence result for SGD without adding additional noise. One important consequence of not relying on isotropic noise is that the rate of convergence becomes independent of the input dimension d. This can be a very signiﬁcant practical advantage when optimizing deep neural networks that contain millions of trainable parameters. Third, we prove that stochastic gradients satisfy Assumption 1 in the setting of learning half-spaces, which is ubiquitous in machine learning. Finally, we provide experimental evidence suggesting the validity of this condition

for training neural networks. In particular we show that, while the variance of uniform noise along eigenvectors corresponding to the most negative eigenvalue decreases as O(1/d), stochastic gradients have a signiﬁcant component along this direction independent of the width and depth of the neural net. When looking at the entire eigenspectrum, we ﬁnd that this variance increases with the magnitude of the associated eigenvalues. Hereby, we contribute to a better understanding of the success of training deep networks with SGD and its extensions.

2. Background & Related work

Reaching a 1st-order stationary point For smooth nonconvex functions, a ﬁrst-order stationary point satisfying f(x) ϵ can be reached by GD and SGD in O(ϵ 2) and O(ϵ 4) iterations respectively (Nesterov, 2013). Recently, it has been shown that GD can be accelerated to ﬁnd such a point in O(ϵ 7/4 log(ϵ 1)) (Carmon et al., 2017).

Reaching a 2nd-order stationary point In order to reach second-order stationary points, existing ﬁrst-order techniques rely on explicitly adding isotropic noise with a known variance (see Eq. (3)). The key motivation for this step is the insight that the area of attraction to a saddle point constitutes an unstable manifold and thus gradient descent methods are unlikely to get stuck, but if they do, adding noise allows them to escape (Lee et al., 2016). Based upon this observations, recent works prove second-order convergence of normalized GD (Levy, 2016) and perturbed GD (Jin et al., 2017a). The later needs at most O(max{ϵ 2 g , ϵ 4 h } log4(d)) iterations and is thus the ﬁrst to achieve a poly-log dependency on the dimensionality. The convergence of SGD with additional noise was analyzed in (Ge et al., 2015) but to the best of our knowledge, no prior work demonstrated convergence of SGD without explicitly adding noise.

Using curvature information Since negative curvature signals potential descent directions, it seems logical to apply a second-order method to exploit this curvature direction in order to escape saddle points. Yet, the prototypical Newton s method has no global convergence guarantee and is locally attracted by saddle points and even local maxima (Dauphin et al., 2014). Another issue is the computation (and perhaps storage) of the Hessian matrix, which requires O(nd2) operations as well as computing the inverse of the Hessian, which requires O(d3) computations.

The ﬁrst problem can be solved by using trust-region methods that guarantee convergence to a second-order stationary point (Conn et al., 2000). Among these methods, the Cubic Regularization technique initially proposed by (Nesterov & Polyak, 2006) has been shown to achieve the optimal worstcase iteration bound O(max{ϵ 3/2 g , ϵ 3 h }) (Cartis et al.,

Escaping Saddles with Stochastic Gradients

Algorithm First-order Complexity Second-order Complexity d Dependency Perturbed SGD (Ge et al., 2015) O(dpϵ 4 g ) O(dpϵ 16 h ) poly SGLD (Zhang et al., 2017) O(dpϵ 2 g ) O(dpϵ 4 h ) poly PGD (Jin et al., 2017a) O(log4(d/ϵg)ϵ 2 g ) O(log4(d/ϵh)ϵ 4 h ) poly-log SGD+NEON (Xu & Yang, 2017) O(ϵ 4 g ) O(ϵ 8 g ) poly-log CNC-GD (Algorithm 1) O(ϵ 2 g log(1/ϵg)) O(ϵ 5 h log(1/ϵh)) free CNC-SGD (Algorithm 2) O(ϵ 4 g log2(1/ϵg)) O(ϵ 10 h ) log2(1/ϵh)) free

Table 1. Dimension dependency and iteration complexity to reach a second-order stationary point as characterized in Eq. (2). The notation O( ) hides constant factors and O( ) hides a poly-logarithmic factor.

2012). The second problem can be addressed by replacing the computation of the Hessian by Hessian-vector products that can be computed efﬁciently in O(nd) (Pearlmutter, 1994). This is applied e.g. using matrix-free Lanczos iterations (Curtis & Robinson, 2017; Reddi et al., 2017) or online variants such as Oja s algorithm (Allen-Zhu, 2017). Sub-sampling the Hessian can furthermore reduce the dependence on n by using various sampling schemes (Kohler & Lucchi, 2017; Xu et al., 2017). Finally, (Xu & Yang, 2017) and (Allen-Zhu & Li, 2017) showed that noisy gradient updates act as a noisy Power method allowing to ﬁnd a negative curvature direction using only ﬁrst-order information. Despite the recent theoretical improvements obtained by such techniques, ﬁrst-order methods still dominate for training large deep neural networks. Their theoretical properties are however not perfectly well understood in the general case and we here aim to deepen the current understanding.

3. GD Perturbed by Stochastic Gradients

In this section we derive a converge guarantee for a combination of gradient descent and stochastic gradient steps, as presented in Algorithm 1, for the case where the stochastic gradient sequence meets the CNC assumption introduced in Eq. (6). We name this algorithm CNC-PGD since it is a modiﬁed version of the PGD method (Jin et al., 2017a), but use the intrinsic noise of SGD instead of requiring noise isotropy. Our theoretical analysis relies on the following smoothness conditions on the objective function f.

Assumption 2 (Smoothness Assumption). We assume that the function f C2(Rd, R) has L-Lipschitz gradients and ρ-Lipschitz Hessians and that each function fz has an ℓbounded gradient.1 W.l.o.g. we further assume that ρ, ℓ, and L are greater than one.

Note that L-smoothness and ρ-Hessian Lipschitzness are standard assumptions for convergence analysis to a secondorder stationary point (Ge et al., 2015; Jin et al., 2017a; Nesterov & Polyak, 2006). The boundedness of the stochastic gradient fz(w) is often used in stochastic optimization (Moulines & Bach, 2011).

1See Appendix A for formal deﬁnitions.

Algorithm 1 CNC-PGD

1: Input: gthres, tthres, T, η and r 2: tnoise tthres 1 3: for t = 1, 2, . . . , T do 4: if f(wt) 2 gthres and t tnoise tthres then 5: wt wt, tnoise t # used in the analysis 6: wt+1 wt r fz(wt) # z i.i.d P 7: else 8: wt+1 wt η f(wt) 9: end if 10: end for 11: return bw uniformly from {w1, . . . , w T }

Parameter Value Dependency on ϵ η 1/L Independent r c1(δγϵ4/5)/(ℓ3L2) O(ϵ4/5) ω log(ℓL/(γδϵ)) O(log(1/ϵ)) tthres c2L( ρϵ2/5) 1ω O(ϵ 2/5 log(1/ϵ)) fthres c3δγ2ϵ8/5/(ℓ2L)2 O(ϵ8/5) gthres fthres/tthres O(ϵ2/ log(1/ϵ)) T 4(f(w0) f )/(ηδgthres) O(ϵ 2 log(1/ϵ))

Table 2. Parameters of CNC-PGD. Note that the constants fthres and ω are only needed for the analysis and thus not required to run Algorithm 1. The constant δ (0, 1) comes from the probability statement in Theorem 1. Finally the constants c1, c2 and c3 are independent of the parameters γ,δ, ϵ, ℓ, ρ, and L (see Appendix B for more details).

Parameters The analysis presented below relies on a particular choice of parameters. Their values are set based on the desired accuracy ϵ and presented in Table 2.

3.1. PGD Convergence Result

Theorem 1. Let the stochastic gradients fz(wt) in CNCPGD satisfy Assumption 1 and let f, fz satisfy Assumption 2. Then Algorithm 1 returns an ϵ, ρϵ2/5 -second-order stationary point with probability at least (1 δ) after

O (ℓL)4(δγϵ) 2 log ℓL ηδγϵ2/5

steps, where δ < 1.

Escaping Saddles with Stochastic Gradients

Remark CNC-PGD converges polynomially to a secondorder stationary point under Assumption 1. By relying on isotropic noise, (Jin et al., 2017a) prove convergence to a ϵ, (ρϵ)1/2 -stationary point in O 1/ϵ2 steps. The result of Theorem 1 matches this rate in terms of ﬁrst-order optimality but is worse by an ϵ 0.1-factor in terms of the second-order condition. Yet, we do not know whether our rate is the best achievable rate under the CNC condition and whether having isotropic noise is necessary to obtain a faster rate of convergence. As mentioned previously, a major beneﬁt of employing the CNC condition is that it results in a convergence rate that does not depend on the dimension of the parameter space.2 Furthermore, we believe that the dependency to γ (Eq. (6)) can be signiﬁcantly improved.

3.2. Proof sketch of Theorem 1

In order to prove Theorem 1, we consider three different scenarios depending on the magnitude of the gradient and the amount of negative curvature. Our proof scheme is mainly inspired by the analysis of perturbed gradient descent (Jin et al., 2017a), where a deterministic sufﬁcient condition is established for escaping from saddle points (see Lemma 11). This condition is shown to hold in the case of isotropic noise. However, the non-isotropic noise coming from stochastic gradients is more difﬁcult to analyze. Our contribution is to show that a less restrictive assumption on the perturbation noise still allows to escape saddle points. Detailed proofs of each lemma are provided in the Appendix.

Large gradient regime When the gradient is large enough, we can invoke existing results on the analysis of gradient descent for non-convex functions (Nesterov, 2013).

Lemma 1. Consider a gradient descent step wt+1 = wt η f(wt) on a L-smooth function f. For η 1/L this yields the following function decrease:

f(wt+1) f(wt) η

2 f(wt) 2. (7)

Using the above result, we can guarantee the desired decrease whenever the norm of the gradient is large enough. Suppose that f(wt) 2 gthres, then Lemma 1 immediately yields

f(wt+1) f(wt) η

2gthres. (8)

Small gradient and sharp negative curvature regime Consider the setting where the norm of the gradient is small, i.e. f(wt) 2 gthres, but the minimum eigenvalue of the Hessian matrix is signiﬁcantly less than zero,

2This result is not in conﬂict with the dimensionality-dependent lower bound established in (Simchowitz et al., 2017) since they make no initialization assumption as we do in Assumption 1 (CNC).

i.e. λmin( 2f(w)) 0. In such a case, exploiting Assumption 1 (CNC) provides a guaranteed decrease in the function value after tthres iterations, in expectation. Lemma 2. Let Assumptions 1 and 2 hold. Consider perturbed gradient steps (Algorithm 1 with parameters as in Table 2) starting from wt such that f( wt) 2 gthres. Assume the Hessian matrix 2f( wt) has a large negative eigenvalue, i.e.

λmin( 2f( wt)) ρϵ2/5. (9)

Then, after tthres iterations the function value decreases as

E [f(wt+tthres)] f( wt) fthres, (10)

where the expectation is over the sequence {wk}t+tthres t+1 .

Small gradient with moderate negative curvature regime Suppose that f(wt) 2 gthres and that the absolute value of the minimum eigenvalue of the Hessian is close to zero, i.e. we already reached the desired ﬁrstand second-order optimality. In this case, we can guarantee that adding noise will only lead to a limited increase in terms of expected function value. Lemma 3. Let Assumptions 1 and 2 hold. Consider perturbed gradient steps (Algorithm 1 with parameters as in Table 2) starting from wt such that f( wt) 2 gthres. Then after tthres iterations, the function value cannot increase by more than

E [f(wt+tthres)] f( wt) ηδfthres

where the expectation is over the sequence {wk}t+tthres t+1 .

Joint analysis We now combine the results of the three scenarios discussed so far. Towards this end we introduce the set S as

S := {w Rd | f(w) 2 gthres

or λmin 2f(w) ρϵ2/5}.

Each of the visited parameters wt, t = 1, . . . , T constitutes a random variable. For each of these random variables, we deﬁne the event At := {wt S}. When At occurs, the function value decreases in expectation. Since the number of steps required in the analysis of the large gradient regime and the sharp curvature regime are different, we use an amortized analysis similar to (Jin et al., 2017a) where we consider the per-step decrease 3. Indeed, when the negative curvature is sharp, then Lemma 2 provides a guaranteed decrease in f which - when normalized per step - yields

E [f(wt+tthres)] f( wt)

tthres fthres

tthres = ηgthres. (12)

3Note that the amortization technique is here used to simplify the presentation but all our results hold without amortization.

Escaping Saddles with Stochastic Gradients

The large gradient norm regime of Lemma 1 guarantees a decrease of the same order and hence

E [f(wt+1) f(wt) | At] η

2gthres (13)

follows from combining the two results. Let us now consider the case when Ac t (complement of At) occurs. Then the result of Lemma 3 allows us to bound the increase in terms of function value, i.e.

E [f(wt+1) f(wt) | Ac t] ηδ

4 gthres. (14)

Probabilistic bound The results established so far have shown that in expectation the function value decreases until the iterates reach a second-order stationary point, for which Lemma 3 guarantees that the function value does not increase too much subsequently.4 This result guarantees visiting a second-order stationary point in T steps (see Table 2). Yet, certifying second-order optimality is slightly more intricate as one would need to know which of the parameters {w1, . . . , w T } meets the required condition. One solution to address this problem is to provide a high probability statement as suggested in (Jin et al., 2017a) (see Lemma 10). We here follow a similar approach except that unlike the result of (Jin et al., 2017a) that relies on exact function values, our results are valid in expectation. Our solution is to establish a high probability bound by returning one of the visited parameters picked uniformly at random. This approach is often used in stochastic non-convex optimization (Ghadimi & Lan, 2013).

The idea is simple: If the number of steps is sufﬁciently large, then the results of Lemma (1)-(3) guarantee that the number of times we visit a second-order stationary point is high. Let R be a random variable that determines the ratio of (ϵ, ρϵ2/5)-second-order stationary points visited through the optimization path {wt}t=1,...,T . Formally,

t=1 1 (Ac t) , (15)

where 1 is the indicator function. Let Pt denote the probability of event At and 1 Pt be the probability of its complement Ac t. The probability of returning a second-order stationary point is simply

t=1 (1 Pt). (16)

4Since there may exist degenerate saddle points which are second-order stationary but not local minima we cannot guarantee that PGD stays close to a second-order stationary point it visits. One could rule out degenerate saddles using the strict-saddle assumption introduced in (Ge et al., 2015).

Estimating the probabilities Pt is difﬁcult due to the interdependence of the random variables wt. However, we can upper bound the sum of the individual Pt s. Using the law of total expectation and the results from Eq. (13) and (14), we bound the expectation of the function value decrease as:

E [f(wt+1) f(wt)]

ηgthres (δ/2 (1 + δ/2)Pt) /2. (17)

Summing over T iterations yields

i=1 E [f(wt+1)] E [f(wt)]

δT/2 (1 + δ/2)

which, after rearranging terms, leads to the following upperbound

2 + 2 (f(w0) f ))

Tηgthres δ. (19)

Therefore, the probability that Ac t occurs uniformly over {1, . . . , T} is lower bounded as

t=1 (1 Pt) 1 δ, (20)

which concludes the proof of Theorem 1.

4. SGD without Perturbation

We now turn our attention to the stochastic variant of gradient descent under the assumption that the stochastic gradients fulﬁll the CNC condition (Assumption 1). We name this method CNC-SGD and demonstrate that it converges to a second-order stationary point without any additional perturbation. Note that in order to provide the convergence guarantee, we periodically enlarge the step size through the optimization process, as outlined in Algorithm 2. This periodic step size increase ampliﬁes the variance along eigenvectors corresponding to the minimum eigenvalue of the Hessian, allowing SGD to exploit the negative curvature in the subsequent steps (using a smaller step size). Increasing the step size is therefore similar to the perturbation step used in CNC-PGD (Algorithm 1). Although this may not be very common in practice, adaptive stepsizes are not unusual in the literature (see e.g. (Goyal et al., 2017)).

Parameters The analysis of CNC-SGD relies on the particular choice of parameters presented in Table 3.

Escaping Saddles with Stochastic Gradients

Algorithm 2 CNC-SGD

1: Input: tthres, r, η, and T (η < r) 2: for t = 1, 2, . . . , T do 3: if (t mod tthres) = 0 then 4: wt wt # used in the analysis 5: wt+1 wt r fz(wt) # z i.i.d P 6: else 7: wt+1 wt η fz(wt) # z i.i.d P 8: end if 9: end for 10: return wt uniformly from {w1, . . . , w T }.

Parameter Value Dependency to ϵ r c1δγϵ4/5/(ℓ3L) O(ϵ4/5) fthres c2δγ2ϵ8/5/(ℓ4L) O(ϵ8/5) ω c3 log(ℓL/(ηϵr)) O(log(1/ϵ)) η c4δ2γ2ϵ2/(ℓ6L2ω) O(ϵ2/ log(1/ϵ)) tthres (ηϵ2/5) 1ω O(ϵ 12/5 log2(1/ϵ)) gthres fthres/tthres O(ϵ4/ log2(1/ϵ)) T 4(f(w0) f )/(δgthres) O(ϵ 4 log2(1/ϵ))

Table 3. Parameters of CNC-SGD: the parameters fthres and gthres are used exclusively in the analysis and are thus not needed to run the algorithm. The constants c1, c2, . . . , c4 are independent of the parameters γ,δ, ϵ, ρ, and L (see Appendix B for more details).

Theorem 2. Let the stochastic gradients fz(wt) in CNCSGD satisfy Assumption 1 and let f, fz satisfy Assumption 2. Then Algorithm 2 returns an ϵ, ρϵ2/5 -second-order stationary point with probability at least (1 δ) after

steps, where δ < 1.

Remarks As reported in Table 1, perturbed SGD - with isotropic noise - converges to an (ϵ, ϵ1/4)-second-order stationary point in O(dpϵ 4) steps (Ge et al., 2015). Here, we prove that under the CNC assumption, vanilla SGD - i.e. without perturbations - converges to an (ϵ, ρϵ2/5)secondorder stationary point using O(ϵ 4) stochastic gradient steps. Our result matches the result of (Ge et al., 2015) in terms of ﬁrst-order optimality and yields an improvement by an ϵ0.15-factor in terms of second-order optimality. However, this second-order optimality rate is still worse by an ϵ 0.1-factor compared to the best known convergence rate for perturbed SGD established by (Zhang et al., 2017), which requires O(dpϵ 4) iterations for an (ϵ, ϵ1/2)- second-order stationary point. One can even improve the convergence guarantee of SGD by using the NEON framework (Allen-Zhu & Li, 2017; Xu & Yang, 2017) but a perturbation with isotropic noise is still required. The theoretical guarantees we provide in Theorem 2, however, are based on

a less restrictive assumption. As we prove in the following Section, this assumption actually holds for stochastic gradients when learning half-spaces. Subsequently, in Section 6, we present empirical observations that suggest its validity even for training wide and deep neural networks.

5. Learning Half-spaces with Correlated Negative Curvature

The analysis presented in the previous sections relies on the CNC assumption introduced in Eq. (6). As mentioned before, this assumption is weaker than the isotropic noise condition required in previous work. In this Section we conﬁrm the validity of this condition for the problem of learning half-spaces which is a core problem in machine learning, commonly encountered when training Perceptrons, Support Vector Machines or Neural Networks (Zhang et al., 2015). Learning a half-space reduces to a minimization problem of the following form

min w Rd f(w) := Ez P ϕ(w z) , (21)

where ϕ is an arbitrary loss function and the data distribution P might have a ﬁnite or inﬁnite support. There are different choices for the loss function ϕ, e.g. zero-one loss, sigmoid loss or piece-wise linear loss (Zhang et al., 2015). Here, we assume that ϕ( ) is differentiable. Generally, the objective f(w) is non-convex and might exhibit many local minima and saddle points. Note that the stochastic gradient is unbiased and deﬁned as

fz(w) = ϕ (w z)z, f(w) = Ez [ fz(w)] , (22)

where the samples z are drawn from the distribution P.

Noise isotropy vs. CNC assumption. First, one can easily ﬁnd a scenario where the noise isotropy condition is violated for stochastic gradients. Take for example the case where the data distribution from which z is sampled lives in a low-dimensional space L Rd. In this case, one can prove that there exists a vector u Rd orthogonal to all z L. Then clearly E (u fz(w))2 = 0 and thus fz(w) does not have components along all directions.

However - under mild assumptions - we show that the stochastic gradients do have a signiﬁcant component along directions of negative curvature. Lemma 4 makes this argument precise by establishing a lower bound on the second moment of the stochastic gradients projected onto eigenvectors corresponding to negative eigenvalues of the Hessian matrix 2f(w). To establish this lower bound we require the following structural property of the loss function ϕ.

Assumption 3. Suppose that the magnitude of the secondorder derivative of ϕ is bounded by a constant factor of its

Escaping Saddles with Stochastic Gradients

1. Stochastic gradients 2. Isotropic noise 3. Extreme eigenvalues

Figure 1. Average variance of stochastic gradients (1) and isotropic noise (2) along eigenvectors corresponding to λmin and extreme eigenvalues (3) of 30 random weight settings in a 1-Layer Neural Network with increasing number of units U (top) and multi-layer Neural Network with increasing number of hidden layers HL (bottom).

ﬁrst-order derivative, i.e.

|ϕ (α)| c|ϕ (α)| (23)

holds for all α in the domain of ϕ and c > 0.

The reader might notice that this condition resembles the self-concordant assumption often used in the optimization literature (Nesterov, 2013), for which the second derivative is bounded by the third derivative. One can easily check that this condition is fulﬁlled by commonly used activation functions in neural networks, such as the sigmoid and softplus. We now leverage this property to prove that the stochastic gradient fz(w) satisﬁes Assumption 1 (CNC).

Lemma 4. Consider the problem of learning half-spaces as stated in Eq. (21), where ϕ satisﬁes Assumption 3. Furthermore, assume that the support of P is a subset of the unit sphere.5 Let v be a unit length eigenvector of 2f(w) with corresponding eigenvalue λ < 0. Then

Ez ( fz(w) v)2 (λ/c)2. (24)

Discussion Since the result of Lemma 4 holds for any eigenvector v associated with a negative eigenvalue λ < 0, this naturally includes the eigenvector(s) corresponding to λmin. As a result, Assumption 1 (CNC) holds for stochastic

5This assumption is equivalent to assuming the random variable z lies inside the unit sphere, which is common in learning halfspace (Zhang et al., 2015).

gradients on learning half-spaces. Combining this result with the derived convergence guarantees in Theorem 1 implies that a mix of SGD and GD steps (Algorithm 1) obtains a second-order stationary point in polynomial time. Furthermore, according to Theorem 2, vanilla SGD obtains a second-order stationary point in polynomial time without any explicit perturbation. Notably, both established convergence guarantees are dimension free.

Furthermore, Lemma 4 reveals an interesting relationship between stochastic gradients and eigenvectors at a certain iterate w. Namely, the variance of stochastic gradients along these vectors scales proportional to the magnitude of the negative eigenvalues within the spectrum of the Hessian matrix. This is in clear contrast to the case of isotropic noise variance which is uniformly distributed along all eigenvectors of the Hessian matrix. The difference can be important form a generalization point of view. Consider the simpliﬁed setting where ϕ is square loss. Then the eigenvectors with large eigenvalues correspond to the principal directions of the data. In this regard, having a lower variance along the non-principal directions avoids over-ﬁtting.

In the following section we conﬁrm the above results and furthermore show experiments on Neural Networks that suggest the validity of these results beyond the setting of learning half-spaces.

Escaping Saddles with Stochastic Gradients

6. Experiments

In this Section we ﬁrst show that vanilla SGD (Algorithm 2) as well as GD with a stochastic gradient step as perturbation (Algorithm 1) indeed escape saddle points. Towards this end, we initialize SGD, GD, perturbed GD with isotropic noise (ISO-PGD) (Jin et al., 2017a) and CNC-PGD close to a saddle point on a low dimensional learning-halfspaces problem with Gaussian input data and sigmoid loss. Figure 2 shows suboptimality over epochs for an average of 10 runs. The results are in line with our analysis since all stochastic methods quickly ﬁnd a negative curvature direction to escape the saddle point. See Appendix E for more details.6

Figure 2. Learning halfspaces (n = 40, d = 4): The stochastic methods need less iterations to escape the saddle.

Secondly - and more importantly - we study the properties of the variance of stochastic gradients depending on the width and depth of neural networks. All of these experiments are conducted using feed-forward networks on the well-known MNIST classiﬁcation task (n = 70 000). Speciﬁcally, we draw m = 30 random parameters wi in each of these networks and test Assumption 1 by estimating the second moment of the stochastic gradients projected onto the eigenvectors vk of 2f(wi) as follows

fj(wi) vk 2

We do the same for n isotropic noise vectors drawn from the unit ball Bd around each wi.7 Figure 1 shows this estimate for eigenvectors corresponding to the minimum eigenvalues for a 1 hidden layer network with increasing number of units (top) and for a 10 hidden unit network with increasing number of layers (bottom). Similar results on the entire negative eigenspectrum can be found in Appendix E. Figure 3 shows how µk varies with the magnitude of the corresponding negative eigenvalues λk. Again we evaluate 30 random parameter settings in neural networks with

6Rather than an encompassing benchmark of the different methods, this result is to be seen as a proof of concept. 7For a fair comparison all involved vectors were normalized.

Figure 3. Variance of stochastic gradients along eigenvectors corresponding to eigenvalues of different magnitudes computed on neural networks with 8, 16 and 32 hidden layers. Scatterplot and ﬁtted linear model with 95% conﬁdence interval.

increasing depth. Two conclusions can be drawn from the results: (i) Although the variance of isotropic noise along eigenvectors corresponding to λmin decreases as O(1/d), the stochastic gradients maintain a signiﬁcant component along the directions of most negative curvature independent of width and depth of the neural network (see Figure 1), (ii) the stochastic gradients yield an increasing variance along eigenvectors corresponding to larger eigenvalues (see Figure 3). These ﬁndings suggest important implications. (i) justify the use and explain the success of training wide and deep neural networks with pure SGD despite the presence of saddle points. (ii) suggests that the bound established in Lemma 4 may well be extended to more general settings such as training neural networks and illustrates the implicit regularization of optimization methods that rely on stochastic gradients since directions of large curvature correspond to principal (more robust) components of the data for many machine learning models.

7. Conclusion

In this work we have analyzed the convergence of PGD and SGD for optimizing non-convex functions under a new assumption -named CNC - that requires the stochastic noise to exhibit a certain amount of variance along the directions of most negative curvature. This is a less restrictive assumption than the noise isotropy condition required by previous work which causes a dependency to the problem dimensionality in the convergence rate. We have shown theoretically that stochastic gradients satisfy the CNC assumption and reveal a variance proportional to the eigenvalue s magnitude for the problem of learning half-spaces. Furthermore, we provided empirical evidence that suggests the validity of this assumption in the context of neural networks and thus contributes to a better understanding of training these models with stochastic gradients. Proving this observation theoretically and investigating its implications on the optimization and generalization properties of stochastic gradients methods is an interesting direction of future research.

Escaping Saddles with Stochastic Gradients

Acknowledgements

We would like to thank Kﬁr Levy, Gary Becigneul, Yannic Kilcher and Kevin Roth for their helpful discussions. We also thank Antonio Orvieto for pointing out a mistake in an early draft.

Allen-Zhu, Z. Natasha 2: Faster non-convex optimization than sgd. ar Xiv preprint ar Xiv:1708.08694, 2017.

Allen-Zhu, Z. and Li, Y. Neon2: Finding local minima via ﬁrst-order oracles. ar Xiv preprint ar Xiv:1711.06673, 2017.

Bottou, L. Large-scale machine learning with stochastic gradient descent. In Proceedings of COMPSTAT 2010, pp. 177 186. Springer, 2010.

Carmon, Y., Hinder, O., Duchi, J. C., and Sidford, A. convex until proven guilty : Dimension-free acceleration of gradient descent on non-convex functions. ar Xiv preprint ar Xiv:1705.02766, 2017.

Cartis, C., Gould, N. I., and Toint, P. L. How Much Patience to You Have?: A Worst-case Perspective on Smooth Noncovex Optimization. Science and Technology Facilities Council Swindon, 2012.

Chaudhari, P. and Soatto, S. Stochastic gradient descent performs variational inference, converges to limit cycles for deep networks. ar Xiv preprint ar Xiv:1710.11029, 2017.

Choromanska, A., Henaff, M., Mathieu, M., Arous, G. B., and Le Cun, Y. The loss surfaces of multilayer networks. In AISTATS, 2015.

Conn, A. R., Gould, N. I., and Toint, P. L. Trust region methods. SIAM, 2000.

Curtis, F. E. and Robinson, D. P. Exploiting negative curvature in deterministic and stochastic optimization. ar Xiv preprint ar Xiv:1703.00412, 2017.

Dauphin, Y. N., Pascanu, R., Gulcehre, C., Cho, K., Ganguli, S., and Bengio, Y. Identifying and attacking the saddle point problem in high-dimensional non-convex optimization. In Advances in neural information processing systems, pp. 2933 2941, 2014.

Ge, R., Huang, F., Jin, C., and Yuan, Y. Escaping from saddle points-online stochastic gradient for tensor decomposition. In COLT, pp. 797 842, 2015.

Ghadimi, S. and Lan, G. Stochastic ﬁrst-and zeroth-order methods for nonconvex stochastic programming. SIAM Journal on Optimization, 23(4):2341 2368, 2013.

Goyal, P., Doll ar, P., Girshick, R., Noordhuis, P., Wesolowski, L., Kyrola, A., Tulloch, A., Jia, Y., and He, K. Accurate, large minibatch sgd: training imagenet in 1 hour. ar Xiv preprint ar Xiv:1706.02677, 2017.

Hillar, C. J. and Lim, L.-H. Most tensor problems are nphard. Journal of the ACM (JACM), 60(6):45, 2013.

Jin, C., Ge, R., Netrapalli, P., Kakade, S. M., and Jordan, M. I. How to escape saddle points efﬁciently. ar Xiv preprint ar Xiv:1703.00887, 2017a.

Jin, C., Netrapalli, P., and Jordan, M. I. Accelerated gradient descent escapes saddle points faster than gradient descent. ar Xiv preprint ar Xiv:1711.10456, 2017b.

Johnson, R. and Zhang, T. Accelerating stochastic gradient descent using predictive variance reduction. In Advances in Neural Information Processing Systems, pp. 315 323, 2013.

Kohler, J. M. and Lucchi, A. Sub-sampled cubic regularization for non-convex optimization. In International Conference on Machine Learning, 2017.

Lee, J. D., Simchowitz, M., Jordan, M. I., and Recht, B. Gradient descent converges to minimizers. ar Xiv preprint ar Xiv:1602.04915, 2016.

Levy, K. Y. The power of normalization: Faster evasion of saddle points. ar Xiv preprint ar Xiv:1611.04831, 2016.

Moulines, E. and Bach, F. R. Non-asymptotic analysis of stochastic approximation algorithms for machine learning. In Advances in Neural Information Processing Systems, pp. 451 459, 2011.

Nesterov, Y. Introductory lectures on convex optimization: A basic course, volume 87. Springer Science & Business Media, 2013.

Nesterov, Y. and Polyak, B. T. Cubic regularization of newton method and its global performance. Mathematical Programming, 108(1):177 205, 2006.

Pearlmutter, B. A. Fast exact multiplication by the hessian. Neural computation, 6(1):147 160, 1994.

Reddi, S. J., Zaheer, M., Sra, S., Poczos, B., Bach, F., Salakhutdinov, R., and Smola, A. J. A generic approach for escaping saddle points. ar Xiv preprint ar Xiv:1709.01434, 2017.

Simchowitz, M., Alaoui, A. E., and Recht, B. On the gap between strict-saddles and true convexity: An omega (log d) lower bound for eigenvector approximation. ar Xiv preprint ar Xiv:1704.04548, 2017.

Escaping Saddles with Stochastic Gradients

Xu, P., Roosta-Khorasani, F., and Mahoney, M. W. Newtontype methods for non-convex optimization under inexact hessian information. ar Xiv preprint ar Xiv:1708.07164, 2017.

Xu, Y. and Yang, T. First-order stochastic algorithms for escaping from saddle points in almost linear time. ar Xiv preprint ar Xiv:1711.01944, 2017.

Zhang, Y., Lee, J. D., Wainwright, M. J., and Jordan, M. I. Learning halfspaces and neural networks with random initialization. ar Xiv preprint ar Xiv:1511.07948, 2015.

Zhang, Y., Liang, P., and Charikar, M. A hitting time analysis of stochastic gradient langevin dynamics. In COLT, 2017.