# escaping_saddle_points_faster_with_stochastic_momentum__38214d8e.pdf Published as a conference paper at ICLR 2020 ESCAPING SADDLE POINTS FASTER WITH STOCHASTIC MOMENTUM Jun-Kun Wang, Chi-Heng Lin, & Jacob Abernethy Georgia Institute of Technology {jimwang,cl3385,prof}@gatech.edu Stochastic gradient descent (SGD) with stochastic momentum is popular in nonconvex stochastic optimization and particularly for the training of deep neural networks. In standard SGD, parameters are updated by improving along the path of the gradient at the current iterate on a batch of examples, where the addition of a momentum term biases the update in the direction of the previous change in parameters. In non-stochastic convex optimization one can show that a momentum adjustment provably reduces convergence time in many settings, yet such results have been elusive in the stochastic and non-convex settings. At the same time, a widely-observed empirical phenomenon is that in training deep networks stochastic momentum appears to significantly improve convergence time, variants of it have flourished in the development of other popular update methods, e.g. ADAM (Kingma & Ba (2015)), AMSGrad (Reddi et al. (2018b)), etc. Yet theoretical justification for the use of stochastic momentum has remained a significant open question. In this paper we propose an answer: stochastic momentum improves deep network training because it modifies SGD to escape saddle points faster and, consequently, to more quickly find a second order stationary point. Our theoretical results also shed light on the related question of how to choose the ideal momentum parameter our analysis suggests that β [0, 1) should be large (close to 1), which comports with empirical findings. We also provide experimental findings that further validate these conclusions. 1 INTRODUCTION SGD with stochastic momentum has been a de facto algorithm in nonconvex optimization and deep learning. It has been widely adopted for training machine learning models in various applications. Modern techniques in computer vision (e.g.Krizhevsky et al. (2012); He et al. (2016); Cubuk et al. (2018); Gastaldi (2017)), speech recognition (e.g. Amodei et al. (2016)), natural language processing (e.g. Vaswani et al. (2017)), and reinforcement learning (e.g. Silver et al. (2017)) use SGD with stochastic momentum to train models. The advantage of SGD with stochastic momentum has been widely observed (Hoffer et al. (2017); Loshchilov & Hutter (2019); Wilson et al. (2017)). Sutskever et al. (2013) demonstrate that training deep neural nets by SGD with stochastic momentum helps achieving in faster convergence compared with the standard SGD (i.e. without momentum). The success of momentum makes it a necessary tool for designing new optimization algorithms in optimization and deep learning. For example, all the popular variants of adaptive stochastic gradient methods like Adam (Kingma & Ba (2015)) or AMSGrad (Reddi et al. (2018b)) include the use of momentum. Despite the wide use of stochastic momentum (Algorithm 1) in practice, 1 justification for the clear empirical improvements has remained elusive, as has any mathematical guidelines for actually setting the momentum parameter it has been observed that large values (e.g. β = 0.9) work well in practice. It should be noted that Algorithm 1 is the default momentum-method in popular software 1Heavy ball momentum is the default choice of momentum method in Py Torch and Tensorflow, instead of Nesterov s momentum. See the manual pages https://pytorch.org/docs/stable/_modules/ torch/optim/sgd.html and https://www.tensorflow.org/api_docs/python/tf/ keras/optimizers/SGD. Published as a conference paper at ICLR 2020 Algorithm 1: SGD with stochastic heavy ball momentum 1: Required: Step size parameter η and momentum parameter β. 2: Init: w0 Rd and m 1 = 0 Rd. 3: for t = 0 to T do 4: Given current iterate wt, obtain stochastic gradient gt := f(wt; ξt). 5: Update stochastic momentum mt := βmt 1 + gt. 6: Update iterate wt+1 := wt ηmt. 7: end for packages such as Py Torch and Tensorflow. In this paper we provide a theoretical analysis for SGD with momentum. We identify some mild conditions that guarantees SGD with stochastic momentum will provably escape saddle points faster than the standard SGD, which provides clear evidence for the benefit of using stochastic momentum. For stochastic heavy ball momentum, a weighted average of stochastic gradients at the visited points is maintained. The new update is computed as the current update minus a step in the direction of the momentum. Our analysis shows that these updates can amplify a component in an escape direction of the saddle points. In this paper, we focus on finding a second-order stationary point for smooth non-convex optimization by SGD with stochastic heavy ball momentum. Specifically, we consider the stochastic nonconvex optimization problem, minw Rd f(w) := Eξ D[f(w; ξ)], where we overload the notation so that f(w; ξ) represents a stochastic function induced by the randomness ξ while f(w) is the expectation of the stochastic functions. An (ϵ, ϵ)-second-order stationary point w satisfies f(w) ϵ and 2f(w) ϵI. (1) Obtaining a second order guarantee has emerged as a desired goal in the nonconvex optimization community. Since finding a global minimum or even a local minimum in general nonconvex optimization can be NP hard (Anandkumar & Ge (2016); Nie (2015); Murty & Kabadi (1987); Nesterov (2000)), most of the papers in nonconvex optimization target at reaching an approximate second-order stationary point with additional assumptions like Lipschitzness in the gradients and the Hessian (e.g. Allen-Zhu & Li (2018); Carmon & Duchi (2018); Curtis et al. (2017); Daneshmand et al. (2018); Du et al. (2017); Fang et al. (2018; 2019); Ge et al. (2015); Jin et al. (2017; 2019); Kohler & Lucchi (2017); Lei et al. (2017); Lee et al. (2019); Levy (2016); Mokhtari et al. (2018); Nesterov & Polyak (2006); Reddi et al. (2018a); Staib et al. (2019); Tripuraneni et al. (2018); Xu et al. (2018)). 2 We follow these related works for the goal and aim at showing the benefit of the use of the momentum in reaching an (ϵ, ϵ)-second-order stationary point. We introduce a required condition, akin to a model assumption made in (Daneshmand et al. (2018)), that ensures the dynamic procedure in Algorithm 2 produces updates with suitable correlation with the negative curvature directions of the function f. Definition 1. Assume, at some time t, that the Hessian Ht = 2f(wt) has some eigenvalue smaller than ϵ and f(wt) ϵ. Let vt be the eigenvector corresponding to the smallest eigenvalue of 2f(wt). The stochastic momentum mt satisfies Correlated Negative Curvature (CNC) at t with parameter γ > 0 if Et[ mt, vt 2] γ. (2) As we will show, the recursive dynamics of SGD with heavy ball momentum helps in amplifying the escape signal γ, which allows it to escape saddle points faster. Contribution: We show that, under CNC assumption and some minor constraints that upper-bound parameter β, if SGD with momentum has properties called Almost Positively Aligned with Gradient (APAG), Almost Positively Correlated with Gradient (APCG), and Gradient Alignment or Curvature Exploitation (Gr ACE), defined in the later section, then it takes T = O((1 β) log(1/(1 β)ϵ)ϵ 10) iterations to return an (ϵ, ϵ) second order stationary point. Alternatively, one can obtain an (ϵ, ϵ) second order stationary point in T = O((1 β) log(1/(1 β)ϵ)ϵ 5) iterations. Our theoretical result demonstrates that a larger momentum parameter β can help in escaping saddle points faster. As saddle points are pervasive in the loss landscape of optimization and deep learning (Dauphin et al. 2We apologize that the list is far from exhaustive. Published as a conference paper at ICLR 2020 (2014); Choromanska et al. (2015)), the result sheds light on explaining why SGD with momentum enables training faster in optimization and deep learning. Notation: In this paper we use Et[ ] to represent conditional expectation E[ |w1, w2, . . . , wt], which is about fixing the randomness upto but not including t and notice that wt was determined at t 1. 2 BACKGROUND 2.1 A THOUGHT EXPERIMENT. Figure 1: The trajectory of the standard SGD (left) and SGD with momentum (right). Let us provide some high-level intuition about the benefit of stochastic momentum with respect to escaping saddle points. In an iterative update scheme, at some time t0 the parameters wt0 can enter a saddle point region, that is a place where Hessian 2f(wt0) has a non-trivial negative eigenvalue, say λmin( 2f(wt0)) ϵ, and the gradient f(wt0) is small in norm, say f(wt0) ϵ. The challenge here is that gradient updates may drift only very slowly away from the saddle point, and may not escape this region; see (Du et al. (2017); Lee et al. (2019)) for additional details. On the other hand, if the iterates were to move in one particular direction, namely along vt0 the direction of the smallest eigenvector of 2f(wt0), then a fast escape is guaranteed under certain constraints on the step size η; see e.g. (Carmon et al. (2018)). While the negative eigenvector could be computed directly, this 2nd-order method is prohibitively expensive and hence we typically aim to rely on gradient methods. With this in mind, Daneshmand et al. (2018), who study non-momentum SGD, make an assumption akin to our CNC property described above that each stochastic gradient gt0 is strongly non-orthogonal to vt0 the direction of large negative curvature. This suffices to drive the updates out of the saddle point region. In the present paper we study stochastic momentum, and our CNC property requires that the update direction mt0 is strongly non-orthogonal to vt0; more precisely, Et0[ mt0, vt0 2] γ > 0. We are able to take advantage of the analysis of (Daneshmand et al. (2018)) to establish that updates begin to escape a saddle point region for similar reasons. Further, this effect is amplified in successive iterations through the momentum update when β is close to 1. Assume that at some wt0 we have mt0 which possesses significant correlation with the negative curvature direction vt0, then on successive rounds mt0+1 is quite close to βmt0, mt0+2 is quite close to β2mt0, and so forth; see Figure 1 for an example. This provides an intuitive perspective on how momentum might help accelerate the escape process. Yet one might ask does this procedure provably contribute to the escape process and, if so, what is the aggregate performance improvement of the momentum? We answer the first question in the affirmative, and we answer the second question essentially by showing that momentum can help speed up saddle-point escape by a multiplicative factor of 1 β. On the negative side, we also show that β is constrained and may not be chosen arbitrarily close to 1. 2.2 MOMENTUM HELPS ESCAPE SADDLE POINTS: AN EMPIRICAL VIEW Let us now establish, empirically, the clear benefit of stochastic momentum on the problem of saddle-point escape. We construct two stochastic optimization tasks, and each exhibits at least one significant saddle point. The two objectives are as follows. min w f(w) := 1 n 2w Hw + b i w + w 10 10 , (3) min w f(w) := 1 n (a i w)2 y 2. (4) Problem (3) of these was considered by (Staib et al. (2019); Reddi et al. (2018a)) and represents a very straightforward non-convex optimization challenge, with an embedded saddle given by the matrix Published as a conference paper at ICLR 2020 (a) Solving objective (3). (b) Solving objective (4). (phase retrieval) Figure 2: Performance of SGD with different values of β = {0, 0.3, 0.5, 0.7, 0.9}; β = 0 corresponds to the standard SGD. Fig. 4a: We plot convergence in function value f( ) given in (3). Initialization is always set as w0 = 0. All the algorithms use the same step size η = 5 10 5. Fig. 4b: We plot convergence in relative distance to the true model w , defined as min( wt w , wt + w )/ w , which more appropriately captures progress as the global sign of the objective (4) is unrecoverable. All the algorithms are initialized at the same point w0 N(0, Id/(10000d)) and use the same step size η = 5 10 4. H := diag([1, 0.1]), and stochastic gaussian perturbations given by bi N(0, diag([0.1, 0.001])); the small variance in the second component provides lower noise in the escape direction. Here we have set n = 10. Observe that the origin is in the neighborhood of saddle points and has objective value zero. SGD and SGD with momentum are initialized at the origin in the experiment so that they have to escape saddle points before the convergence. The second objective (4) appears in the phase retrieval problem, that has real applications in physical sciences (Cand es et al. (2013); Shechtman et al. (2015)). In phase retrieval3, one wants to find an unknown w Rd with access to but a few samples yi = (a i w )2; the design vector ai is known a priori. Here we have sampled w N(0, Id/d) and ai N(0, Id) with d = 10 and n = 200. The empirical findings, displayed in Figure 2, are quite stark: for both objectives, convergence is significantly accelerated by larger choices of β. In the first objective (Figure 4a), we see each optimization trajectory entering a saddle point region, apparent from the flat progress, yet we observe that large-momentum trajectories escape the saddle much more quickly than those with smaller momentum. A similar affect appears in Figure 4b. To the best of our knowledge, this is the first reported empirical finding that establishes the dramatic speed up of stochastic momentum for finding an optimal solution in phase retrieval. 2.3 RELATED WORKS. Heavy ball method: The heavy ball method was originally proposed by Polyak (1964). It has been observed that this algorithm, even in the deterministic setting, provides no convergence speedup over standard gradient descent, except in some highly structure cases such as convex quadratic objectives where an accelerated rate is possible (Lessard et al. (2016); Goh (2017); Ghadimi et al. (2015); Sun et al. (2019); Loizou & Richt arik (2017; 2018); Gadat et al. (2016); Yang et al. (2018); Kidambi et al. (2018); Can et al. (2019)). We provide a comprehensive survey of the related works about heavy ball method in Appendix A. Reaching a second order stationary point: As we mentioned earlier, there are many works aim at reaching a second order stationary point. We classify them into two categories: specialized algorithms and simple GD/SGD variants. Specialized algorithms are those designed to exploit the negative curvature explicitly and escape saddle points faster than the ones without the explicit exploitation (e.g. Carmon et al. (2018); Agarwal et al. (2017); Allen-Zhu & Li (2018); Xu et al. (2018)). Simple GD/SGD variants are those with minimal tweaks of standard GD/SGD or their variants (e.g. Ge et al. (2015); Levy (2016); Fang et al. (2019); Jin et al. (2017; 2018; 2019); Daneshmand et al. (2018); 3It is known that phase retrieval is nonconvex and has the so-called strict saddle property: (1) every local minimizer {w , w } is global up to phase, (2) each saddle exhibits negative curvature (see e.g. (Sun et al. (2015; 2016); Chen et al. (2018))) Published as a conference paper at ICLR 2020 Staib et al. (2019)). Our work belongs to this category. In this category, perhaps the pioneer works are (Ge et al. (2015)) and (Jin et al. (2017)). Jin et al. (2017) show that explicitly adding isotropic noise in each iteration guarantees that GD escapes saddle points and finds a second order stationary point with high probability. Following (Jin et al. (2017)), Daneshmand et al. (2018) assume that stochastic gradient inherently has a component to escape. Specifically, they make assumption of the Correlated Negative Curvature (CNC) for stochastic gradient gt so that Et[ gt, vt 2] γ > 0. The assumption allows the algorithm to avoid the procedure of perturbing the updates by adding isotropic noise. Our work is motivated by (Daneshmand et al. (2018)) but assumes CNC for the stochastic momentum mt instead. In Appendix A, we compare the results of our work with the related works. 3 MAIN RESULTS We assume that the gradient f is L-Lipschitz; that is, f is L-smooth. Further, we assume that the Hessian 2f is ρ-Lipschitz. These two properties ensure that f(w) f(w ) L w w and that 2f(w) 2f(w ) ρ w w , w, w . The L-Lipschitz gradient assumption implies that |f(w ) f(w) f(w), w w | L 2 w w 2, w, w , while the ρ-Lipschitz Hessian assumption implies that |f(w ) f(w) f(w), w w (w w) 2f(w)(w w)| ρ 6 w w 3, w, w . Furthermore, we assume that the stochastic gradient has bounded noise f(w) f(w; ξ) 2 σ2 and that the norm of stochastic momentum is bounded so that mt cm. We denote Πi Mi as the matrix product of matrices {Mi} and we use σmax(M) = M 2 := supx =0 x,Mx x,x to denote the spectral norm of the matrix M. 3.1 REQUIRED PROPERTIES WITH EMPIRICAL VALIDATION Our analysis of stochastic momentum relies on three properties of the stochastic momentum dynamic. These properties are somewhat unusual, but we argue they should hold in natural settings, and later we aim to demonstrate that they hold empirically in a couple of standard problems of interest. Definition 2. We say that SGD with stochastic momentum satisfies Almost Positively Aligned with Gradient (APAG) 4 if we have Et[ f(wt), mt gt ] 1 2 f(wt) 2. (5) We say that SGD with stochastic momentum satisfies Almost Positively Correlated with Gradient (APCG) with parameter τ if c > 0 such that, Et[ f(wt), Mtmt ] c ησmax(Mt) f(wt) 2, (6) where the PSD matrix Mt is defined as Mt = (Πτ 1 s=1Gs,t)(Πτ 1 s=k Gs,t) with Gs,t := I η Ps j=1 βs j 2f(wt) = I η(1 βs) for any integer 1 k τ 1, and η is any step size chosen that guarantees each Gs,t is PSD. Definition 3. We say that the SGD with momentum exhibits Gradient Alignment or Curvature Exploitation (Gr ACE) if ch 0 such that Et[η f(wt), gt mt + η2 2 m t 2f(wt)mt] η2ch. (7) APAG requires that the momentum term mt must, in expectation, not be significantly misaligned with the gradient f(wt). This is a very natural condition when one sees that the momentum term is acting as a biased estimate of the gradient of the deterministic f. APAG demands that the bias can not be too large relative to the size of f(wt). Indeed this property is only needed in our analysis when the gradient is large (i.e. f(wt) ϵ) as it guarantees that the algorithm makes progress; our analysis does not require APAG holds when gradient is small. APCG is a related property, but requires that the current momentum term mt is almost positively correlated with the the gradient f(wt), but measured in the Mahalanobis norm induced by Mt. It 4Note that our analysis still go through if one replaces 1 2 on r.h.s. of (5) with any larger number c < 1; the resulted iteration complexity would be only a constant multiple worse. Published as a conference paper at ICLR 2020 (a) Gradient norm f(wt) . (b) About APAG. (c) About APCG. (d) Gradient norm f(wt) . (e) About APAG. (f) About APCG. Figure 3: Plots of the related properties. Sub-figures on the top row are regarding solving (3) and sub-figures on the bottom row are regarding solving (4) (phase retrieval). Note that the function value/relative distance to w are plotted on Figure 2. Above, sub-figures (a) and (d): We plot the gradient norms versus iterations. Sub-figures (b) and (e): We plot the values of f(wt), mt gt / f(wt) 2 versus iterations. For (b), we only report them when the gradient is large ( f(wt) 0.02). It shows that the value is large than 0.5 except the transition. For (e), we observe that the value is almost always nonnegative. Sub-figures (c) and (f): We plot the value of f(wt), Mtmt /ησmax(Mt) f(wt) 2. For (c), we let Mt = (Π3 105 s=1 Gs,t)(Π3 105 s=1 Gs,t) and we only report the values when the update is in the region of saddle points. For (f), we let Mt = (Π500 s=1Gs,t)(Π500 s=1Gs,t) and we observe that the value is almost always nonnegative. The figures implies that SGD with momentum has APAG and APCG properties in the experiments. Furthermore, an interesting observation is that, for the phase retrieval problem, the expected values might actually be nonnegative. may appear to be an unusual object, but one can view the PSD matrix Mt as measuring something about the local curvature of the function with respect to the trajectory of the SGD with momentum dynamic. We will show that this property holds empirically on two natural problems for a reasonable constant c . APCG is only needed in our analysis when the update is in a saddle region with significant negative curvature, f(w) ϵ and λmin( 2f(w)) ϵ. Our analysis does not require APCG holds when the gradient is large or the update is at an (ϵ, ϵ)-second order stationary point. For Gr ACE, the first term on l.h.s of (7) measures the alignment between stochastic momentum mt and the gradient f(wt), while the second term on l.h.s measures the curvature exploitation. The first term is small (or even negative) when the stochastic momentum mt is aligned with the gradient f(wt), while the second term is small (or even negative) when the stochastic momentum mt can exploit a negative curvature (i.e. the subspace of eigenvectors that corresponds to the negative eigenvalues of the Hessian 2f(wt) if exists). Overall, a small sum of the two terms (and, consequently, a small ch) allows one to bound the function value of the next iterate (see Lemma 8). On Figure 3, we report some quantities related to APAG and APCG as well as the gradient norm when solving the previously discussed problems (3) and (4) using SGD with momentum. We also report a quantity regarding Gr ACE on Figure 4 in the appendix. 3.2 CONVERGENCE RESULTS The high level idea of our analysis follows as a similar template to (Jin et al. (2017); Daneshmand et al. (2018); Staib et al. (2019)). Our proof is structured into three cases: either (a) f(w) ϵ, or (b) f(w) ϵ and λmin( 2f(w)) ϵ, or otherwise (c) f(w) ϵ and λmin( 2f(w)) ϵ, Published as a conference paper at ICLR 2020 Algorithm 2: SGD with stochastic heavy ball momentum 1: Required: Step size parameters r and η, momentum parameter β, and period parameter Tthred. 2: Init: w0 Rd and m 1 = 0 Rd. 3: for t = 0 to T do 4: Get stochastic gradient gt at wt, and set stochastic momentum mt := βmt 1 + gt. 5: Set learning rate: ˆη := η unless (t mod Tthred) = 0 in which case ˆη := r 6: wt+1 = wt ˆηmt. 7: end for meaning we have arrived in a second-order stationary region. The precise algorithm we analyze is Algorithm 2, which identical to Algorithm 1 except that we boost the step size to a larger value r on occasion. We will show that the algorithm makes progress in cases (a) and (b). In case (c), when the goal has already been met, further execution of the algorithm only weakly hurts progress. Ultimately, we prove that a second order stationary point is arrived at with high probability. While our proof borrows tools from (Daneshmand et al. (2018); Staib et al. (2019)), much of the momentum analysis is entirely novel to our knowledge. Theorem 1. Assume that the stochastic momentum satisfies CNC. Set 5 r = O(ϵ2), η = O(ϵ5), and Tthred = c(1 β) ηϵ log( Lcmσ2ρc ch (1 β)δγϵ ) = O((1 β) log( Lcmσ2ρc ch (1 β)δγϵ )ϵ 6) for some constant c > 0. If SGD with momentum (Algorithm 2) has APAG property when gradient is large ( f(w) ϵ), APCGTthred property when it enters a region of saddle points that exhibits a negative curvature ( f(w) ϵ and λmin( 2f(w)) ϵ), and Gr ACE property throughout the iterations, then it reaches an (ϵ, ϵ) second order stationary point in T = 2Tthred(f(w0) minw f(w))/(δFthred) = O((1 β) log( Lcmσ2ρc ch (1 β)δγϵ )ϵ 10) iterations with high probability 1 δ, where Fthred = O(ϵ4). The theorem implies the advantage of using stochastic momentum for SGD. Higher β leads to reaching a second order stationary point faster. As we will show in the following, this is due to that higher β enables escaping the saddle points faster. In Subsection 3.2.1, we provide some key details of the proof of Theorem 1. The interested reader can read a high-level sketch of the proof, as well as the detailed version, in Appendix G. Remark 1: (constraints on β) We also need some minor constraints on β so that β cannot be too close to 1. They are 1) L(1 β)3 > 1, 2) σ2(1 β)3 > 1, 3) c (1 β)2 > 1, 4) η 1 β L , 5) η 1 β ϵ , and 6) Tthred 1 + 2β 1 β . Please see Appendix E.1 for the details and discussions. Remark 2: (escaping saddle points) Note that Algorithm 2 reduces to CNC-SGD of Daneshmand et al. (2018) when β = 0 (i.e. without momentum). Therefore, let us compare the results. We show that the escape time of Algorithm 2 is Tthred := O (1 β) ηϵ (see Appendix E.3.3, especially (81-82)). On the other hand, for CNC-SGD, based on Table 3 in their paper, is Tthred = O 1 ηϵ . One can clearly see that Tthred of our result has a dependency 1 β, which makes it smaller than that of Daneshmand et al. (2018) for any same η and consequently demonstrates escaping saddle point faster with momentum. Remark 3: (finding a second order stationary point) Denote ℓa number such that t, gt ℓ. In Appendix G.3, we show that in the high momentum regime where (1 β) << ρ2ℓ10 c9mc2 hc , Algorithm 2 is strictly better than CNC-SGD of Daneshmand et al. (2018), which means that a higher momentum can help find a second order stationary point faster. Empirically, we find out that c 0 (Figure 3) and ch 0 (Figure 4) in the phase retrieval problem, so the condition is easily satisfied for a wide range of β. 3.2.1 ESCAPING SADDLE POINTS In this subsection, we analyze the process of escaping saddle points by SGD with momentum. Denote t0 any time such that (t0 mod Tthred) = 0. Suppose that it enters the region exhibiting a small 5See Table 3 in Appendix E for the precise expressions of the parameters. Here, we hide the parameters dependencies on γ, L, cm, c , σ2, ρ, ch, and δ. W.l.o.g, we also assume that cm, L, σ2, c , ch, and ρ are not less than one and ϵ 1. Published as a conference paper at ICLR 2020 gradient but a large negative eigenvalue of the Hessian (i.e. f(wt0) ϵ and λmin( 2f(wt0)) ϵ). We want to show that it takes at most Tthred iterations to escape the region and whenever it escapes, the function value decreases at least by Fthred = O(ϵ4) on expectation, where the precise expression of Fthred will be determined later in Appendix E. The technique that we use is proving by contradiction. Assume that the function value on expectation does not decrease at least Fthred in Tthred iterations. Then, we get an upper bound of the expected distance Et0[ wt0+Tthred wt0 2] Cupper. Yet, by leveraging the negative curvature, we also show a lower bound of the form Et0[ wt0+Tthred wt0 2] Clower. The analysis will show that the lower bound is larger than the upper bound (namely, Clower > Cupper), which leads to the contradiction and concludes that the function value must decrease at least Fthred in Tthred iterations on expectation. Since Tthred = O((1 β) log( 1 (1 β)ϵ)ϵ6), the dependency on β suggests that larger β can leads to smaller Tthred, which implies that larger momentum helps in escaping saddle points faster. Lemma 1 below provides an upper bound of the expected distance. The proof is in Appendix C. Lemma 1. Denote t0 any time such that (t0 mod Tthred) = 0. Suppose that Et0[f(wt0) f(wt0+t)] Fthred for any 0 t Tthred. Then, Et0[ wt0+t wt0 2] Cupper,t := 8ηt Fthred+2r2ch+ ρ (1 β)2 + 8η2 tσ2 (1 β)2 + 4η2 β 1 β 2c2 m + 2r2c2 m. We see that Cupper,t in Lemma 1 is monotone increasing with t, so we can define Cupper := Cupper,Tthred. Now let us switch to obtaining the lower bound of Et0[ wt0+Tthred wt0 2]. The key to get the lower bound comes from the recursive dynamics of SGD with momentum. Lemma 2. Denote t0 any time such that (t0 mod Tthred) = 0. Let us define a quadratic approximation at wt0, Q(w) := f(wt0) + w wt0, f(wt0) + 1 2(w wt0) H(w wt0), where H := 2f(wt0). Also, define Gs := (I η Ps k=1 βs k H). Then we can write wt0+t wt0 exactly using the following decomposition. qv,t 1 z }| { Πt 1 j=1Gj rmt0 +η qm,t 1 z }| { Πt 1 j=s+1Gj βsmt0 qq,t 1 z }| { Πt 1 j=s+1Gj s X k=1 βs k f(wt0+k) Q(wt0+s) qw,t 1 z }| { Πt 1 j=s+1Gj s X k=1 βs k f(wt0) +η qξ,t 1 z }| { Πt 1 j=s+1Gj s X k=1 βs kξt0+k . The proof of Lemma 2 is in Appendix D. Furthermore, we will use the quantities qv,t 1, qm,t 1, qq,t 1, qw,t 1, qξ,t 1 as defined above throughout the analysis. Lemma 3. Following the notations of Lemma 2, we have that Et0[ wt0+t wt0 2] Et0[ qv,t 1 2]+2ηEt0[ qv,t 1, qm,t 1+qq,t 1+qw,t 1+qξ,t 1 ] =: Clower. We are going to show that the dominant term in the lower bound of Et0[ wt0+t wt0 2] is Et0[ qv,t 1 2], which is the critical component for ensuring that the lower bound is larger than the upper bound of the expected distance. Lemma 4. Denote θj := Pj k=1 βj k = Pj k=1 βk 1 and λ := λmin(H). Following the conditions and notations in Lemma 1 and Lemma 2, we have that Et0[ qv,t 1 2] Πt 1 j=1(1 + ηθjλ) 2r2γ. (8) Proof. We know that λmin(H) ϵ < 0. Let v be the eigenvector of the Hessian H with unit norm that corresponds to λmin(H) so that Hv = λmin(H)v. We have (I ηH)v = v ηλmin(H)v = Published as a conference paper at ICLR 2020 (1 ηλmin(H))v. Then, Et0[ qv,t 1 2] (a) = Et0[ qv,t 1 2 v 2] (b) Et0[ qv,t 1, v 2] (c) = Et0[ Πt 1 j=1Gj rmt0, v 2] (d) = Et0[ Πt 1 j=1(I ηθj H) rmt0, v 2] = Et0 Πt 1 j=1(1 ηθjλmin(H)) rmt0, v 2] (e) Πt 1 j=1(1 + ηθjλ) 2r2γ, where (a) is because v is with unit norm, (b) is by Cauchy Schwarz inequality, (c), (d) are by the definitions, and (e) is by the CNC assumption so that Et0[ mt0, v 2] γ. Observe that the lower bound in (8) is monotone increasing with t and the momentum parameter β. Moreover, it actually grows exponentially in t. To get the contradiction, we have to show that the lower bound is larger than the upper bound. By Lemma 1 and Lemma 3, it suffices to prove the following lemma. We provide its proof in Appendix E. Lemma 5. Let Fthred = O(ϵ4) and η2Tthred r2. By following the conditions and notations in Theorem 1, Lemma 1 and Lemma 2, we conclude that if SGD with momentum (Algorithm 2) has the APCG property, then we have that Clower := Et0[ qv,Tthred 1 2]+2ηEt0[ qv,Tthred 1, qm,Tthred 1 + qq,Tthred 1 + qw,Tthred 1 + qξ,Tthred 1 ] > Cupper. 4 CONCLUSION In this paper, we identify three properties that guarantee SGD with momentum in reaching a secondorder stationary point faster by a higher momentum, which justifies the practice of using a large value of momentum parameter β. We show that a greater momentum leads to escaping strict saddle points faster due to that SGD with momentum recursively enlarges the projection to an escape direction. However, how to make sure that SGD with momentum has the three properties is not very clear. It would be interesting to identify conditions that guarantee SGD with momentum to have the properties. Perhaps a good starting point is understanding why the properties hold in phase retrieval. We believe that our results shed light on understanding the recent success of SGD with momentum in non-convex optimization and deep learning. ACKNOWLEDGMENTS We gratefully acknowledge financial support from NSF IIS awards 1910077 and 1453304. Naman Agarwal, Zeyuan Allen-Zhu, Brian Bullins, Elad Hazan, and Tengyu Ma. Finding approximate local minima faster than gradient descent. STOC, 2017. Zeyuan Allen-Zhu and Yuanzhi Li. Neon2: Finding local minima via first-order oracles. Neur IPS, 2018. Dario Amodei, Sundaram Ananthanarayanan, Rishita Anubhai, and et al. Deep speech 2 : End-to-end speech recognition in english and mandarin. ICML, 2016. Anima Anandkumar and Rong Ge. Efficient approaches for escaping higher order saddle points in non-convex optimization. COLT, 2016. Bugra Can, Mert G urb uzbalaban, and Lingjiong Zhu. Accelerated linear convergence of stochastic momentum methods in wasserstein distances. ICML, 2019. Emmanuel J. Cand es, Yonina Eldar, Thomas Strohmer, and Vlad Voroninski. Phase retrieval via matrix completion. SIAM Journal on Imaging Sciences, 2013. Yair Carmon and John C. Duchi. Gradient descent efficiently finds the cubic-regularized non-convex newton step. Neur IPS, 2018. Published as a conference paper at ICLR 2020 Yair Carmon, John Duchi, Oliver Hinder, and Aaron Sidford. Accelerated methods for nonconvex optimization. SIAM Journal of Optimization, 2018. Yuxin Chen, Yuejie Chi, Jianqing Fan, Cong Ma, and Yuling Yan. Gradient descent with random initialization: Fast global convergence for nonconvex phase retrieval. Mathematical Programming, 2018. Anna Choromanska, Mikael Henaff, Michael Mathieu, G erard Ben Arous, and Yann Le Cun. The loss surfaces of multilayer networks. AISTAT, 2015. Ekin D Cubuk, Barret Zoph, Dandelion Mane, Vijay Vasudevan, and Quoc V Le. Autoaugment: Learning augmentation policies from data. ar Xiv:1805.09501, 2018. Frank E. Curtis, Daniel P. Robinson, and Mohammadreza Samadi. A trust region algorithm with a worst-case iteration complexity of o(ϵ 3/2) for nonconvex optimization. Mathematical Programming, 2017. Hadi Daneshmand, Jonas Kohler, Aurelien Lucchi, and Thomas Hofmann. Escaping saddles with stochastic gradients. ICML, 2018. Yann Dauphin, Razvan Pascanu, Caglar Gulcehre, Kyunghyun Cho, Surya Ganguli, and Yoshua Bengio. Identifying and attacking the saddle point problem in high-dimensional non-convex optimization. NIPS, 2014. Simon S. Du, Chi Jin, Jason D. Lee, Michael I. Jordan, Barnabas Poczos, and Aarti Singh. Gradient descent can take exponential time to escape saddle points. NIPS, 2017. Cong Fang, Chris Junchi Li, Zhouchen Lin, and Tong Zhang. Spider: Near-optimal non-convex optimization via stochastic path-integrated differential estimator. Neur IPS, 2018. Cong Fang, Zhouchen Lin, and Tong Zhang. Sharp analysis for nonconvex sgd escaping from saddle points. COLT, 2019. S ebastien Gadat, Fabien Panloup, and Sofiane Saadane. Stochastic heavy ball. ar Xiv:1609.04228, 2016. Xavier Gastaldi. Shake-shake regularization. ar Xiv:1705.07485, 2017. Rong Ge, Furong Huang, Chi Jin, and Yang Yuan. Escaping from saddle points online stochastic gradient for tensor decomposition. COLT, 2015. Euhanna Ghadimi, Hamid Reza Feyzmahdavian, and Mikael Johansson. Global convergence of the heavy-ball method for convex optimization. ECC, 2015. Saeed Ghadimi and Guanghui Lan. Stochastic firstand zeroth-order methods for nonconvex stochastic programming. SIAM Journal on Optimization, 2013. Saeed Ghadimi and Guanghui Lan. Accelerated gradient methods for nonconvex nonlinear and stochastic programming. Mathematical Programming, 2016. Gabriel Goh. Why momentum really works. Distill, 2017. Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image recognition. Conference on Computer Vision and Pattern Recognition (CVPR), 2016. Elad Hoffer, Itay Hubara, and Daniel Soudry. Train longer, generalize better: closing the generalization gap in large batch training of neural networks. NIPS, 2017. Chi Jin, Rong Ge, Praneeth Netrapalli, Sham M. Kakade, and Michael I. Jordan. How to escape saddle points efficiently. ICML, 2017. Chi Jin, Praneeth Netrapalli, and Michael I. Jordan. Accelerated gradient descent escapes saddle points faster than gradient descent. COLT, 2018. Published as a conference paper at ICLR 2020 Chi Jin, Praneeth Netrapalli, Rong Ge, Sham M. Kakade, and Michael I. Jordan. Stochastic gradient descent escapes saddle points efficiently. ar Xiv:1902.04811, 2019. Rahul Kidambi, Praneeth Netrapalli, Prateek Jain, and Sham M. Kakade. On the insufficiency of existing momentum schemes for stochastic optimization. ICLR, 2018. Diederik P. Kingma and Jimmy Ba. Adam: A method for stochastic optimization. ICLR, 2015. Jonas Moritz Kohler and Aurelien Lucchi. Sub-sampled cubic regularization for non-convex optimization. ICML, 2017. Alex Krizhevsky, Ilya Sutskever, and Geoffrey E. Hinton. Imagenet classification with deep convolutional neural networks. NIPS, 2012. Jason D. Lee, Ioannis Panageas, Georgios Piliouras, Max Simchowitz, Michael I. Jordan, and Benjamin Recht. First-order methods almost always avoid strict saddle-points. Mathematical Programming, Series B, 2019. Lihua Lei, Cheng Ju, Jianbo Chen, and Michael I. Jordan. Nonconvex finite-sum optimization via scsg methods. NIPS, 2017. Laurent Lessard, Benjamin Recht, and Andrew Packard. Analysis and design of optimization algorithms via integral quadratic constraints. SIAM Journal on Optimization, 2016. Kfir Y. Levy. The power of normalization: Faster evasion of saddle points. ar Xiv:1611.04831, 2016. Nicolas Loizou and Peter Richt arik. Momentum and stochastic momentum for stochastic gradient, newton, proximal point and subspace descent methods. ar Xiv:1712.09677, 2017. Nicolas Loizou and Peter Richt arik. Accelerated gossip via stochastic heavy ball method. Allerton, 2018. Ilya Loshchilov and Frank Hutter. Decoupled weight decay regularization. ICLR, 2019. Aryan Mokhtari, Asuman Ozdaglar, and Ali Jadbabaie. Escaping saddle points in constrained optimization. Neur IPS, 2018. Katta G Murty and Santosh N Kabadi. Some np-complete problems in quadratic and nonlinear programming. Mathematical programming, 1987. Yurii Nesterov. Squared functional systems and optimization problems. High performance optimization, Springer, 2000. Yurii Nesterov. Introductory lectures on convex optimization: a basic course. Springer, 2013. Yurii Nesterov and B.T. Polyak. Cubic regularization of newton method and its global performance. Math. Program., Ser. A 108, 177 205, 2006. Jiawang Nie. The hierarchy of local minimums in polynomial optimization. Mathematical programming, 2015. Peter Ochs, Yunjin Chen, Thomas Brox, and Thomas Pock. ipiano: Inertial proximal algorithm for nonconvex optimization. SIAM Journal of Imaging Sciences, 2014. B.T. Polyak. Some methods of speeding up the convergence of iteration methods. USSR Computational Mathematics and Mathematical Physics, 1964. Sashank Reddi, Manzil Zaheer, Suvrit Sra, Barnabas Poczos, Francis Bach, Ruslan Salakhutdinov, and Alex Smola. A generic approach for escaping saddle points. AISTATS, 2018a. Sashank J. Reddi, Satyen Kale, and Sanjiv Kumar. On the convergence of adam and beyond. ICLR, 2018b. Yoav Shechtman, Yonina C. Eldar, Oren Cohen, Henry Nicholas Chapman, Jianwei Miao, and Mordechai Segev. Phase retrieval with application to optical imaging: a contemporary overview. IEEE signal processing magazine, 2015. Published as a conference paper at ICLR 2020 David Silver, Julian Schrittwieser, Karen Simonyan, and et al. Mastering the game of go without human knowledge. Nature, 2017. Matthew Staib, Sashank J. Reddi, Satyen Kale, Sanjiv Kumar, and Suvrit Sra. Escaping saddle points with adaptive gradient methods. ICML, 2019. Ju Sun, Qing Qu, and John Wright. When are nonconvex problems not scary? NIPS Workshop on Non-convex Optimization for Machine Learning: Theory and Practice, 2015. Ju Sun, Qing Qu, and John Wright. A geometrical analysis of phase retrieval. International Symposium on Information Theory, 2016. Tao Sun, Penghang Yin, Dongsheng Li, Chun Huang, Lei Guan, and Hao Jiang. Non-ergodic convergence analysis of heavy-ball algorithms. AAAI, 2019. Ilya Sutskever, James Martens, George Dahl, and Geoffrey Hinton. On the importance of initialization and momentum in deep learning. ICML, 2013. T. Tieleman and G. Hinton. Rmsprop: Divide the gradient by a running average of its recent magnitude. COURSERA: Neural Networks for Machine Learning, 2012. Nilesh Tripuraneni, Mitchell Stern, Chi Jin, Jeffrey Regier, and Michael I Jordan. Stochastic cubic regularization for fast nonconvex optimization. Neur IPS, 2018. Ashish Vaswani, Noam Shazeer, Niki Parmar, and et al. Attention is all you need. NIPS, 2017. Ashia C Wilson, Rebecca Roelofs, Mitchell Stern, Nathan Srebro, , and Benjamin Recht. The marginal value of adaptive gradient methods in machine learning. NIPS, 2017. Yi Xu, Jing Rong, and Tianbao Yang. First-order stochastic algorithms for escaping from saddle points in almost linear time. Neur IPS, 2018. Tianbao Yang, Qihang Lin, and Zhe Li. Unified convergence analysis of stochastic momentum methods for convex and non-convex optimization. IJCAI, 2018. A LITERATURE SURVEY Heavy ball method: The heavy ball method was originally proposed by Polyak (1964). It has been observed that this algorithm, even in the deterministic setting, provides no convergence speedup over standard gradient descent, except in some highly structure cases such as convex quadratic objectives where an accelerated rate is possible (Lessard et al. (2016); Goh (2017)). In recent years, some works make some efforts in analyzing heavy ball method for other classes of optimization problems besides the quadratic functions. For example, Ghadimi et al. (2015) prove an O(1/T) ergodic convergence rate when the problem is smooth convex, while Sun et al. (2019) provide a non-ergodic convergence rate for certain classes of convex problems. Ochs et al. (2014) combine the technique of forward-backward splitting with heavy ball method for a specific class of nonconvex optimization problem. For stochastic heavy ball method, Loizou & Richt arik (2017) analyze a class of linear regression problems and shows a linear convergence rate of stochastic momentum, in which the linear regression problems actually belongs to the case of strongly convex quadratic functions. Other works includes (Gadat et al. (2016)), which shows almost sure convergence to the critical points by stochastic heavy ball for general non-convex coercive functions. Yet, the result does not show any advantage of stochastic heavy ball over other optimization algorithms like SGD. Can et al. (2019) show an accelerated linear convergence to a stationary distribution under Wasserstein distance for strongly convex quadratic functions by SGD with stochastic heavy ball momentum. Yang et al. (2018) provide a unified analysis of stochastic heavy ball momentum and Nesterov s momentum for smooth non-convex objective functions. They show that the expected gradient norm converges at rate O(1/ t). Yet, the rate is not better than that of the standard SGD. We are also aware of the works (Ghadimi & Lan (2016; 2013)), which propose some variants of stochastic accelerated algorithms with first order stationary point guarantees. Yet, the framework in (Ghadimi & Lan (2016; 2013)) does not capture the stochastic heavy ball momentum used in practice. There is also a negative result about the heavy ball momentum. Kidambi et al. (2018) show that for a specific strongly convex and Published as a conference paper at ICLR 2020 (a) About Gr ACE for problem (3). (b) About Gr ACE for problem (4) (phase retrieval). Figure 4: Plot regarding the Gr ACE property. We plot the values of η f(wt),gt mt + 1 2 η2m t Htmt η2 versus iterations. An interesting observation is that the value is well upper-bounded by zero for the phase retrieval problem. The results imply that the constant ch is indeed small. strongly smooth problem, SGD with heavy ball momentum fails to achieving the best convergence rate while some algorithms can. Reaching a second order stationary point: As we mentioned earlier, there are many works aim at reaching a second order stationary point. We classify them into two categories: specialized algorithms and simple GD/SGD variants. Specialized algorithms are those designed to exploit the negative curvature explicitly and escape saddle points faster than the ones without the explicit exploitation (e.g. Carmon et al. (2018); Agarwal et al. (2017); Allen-Zhu & Li (2018); Xu et al. (2018)). Simple GD/SGD variants are those with minimal tweaks of standard GD/SGD or their variants (e.g. Ge et al. (2015); Levy (2016); Fang et al. (2019); Jin et al. (2017; 2018; 2019); Daneshmand et al. (2018); Staib et al. (2019)). Our work belongs to this category. In this category, perhaps the pioneer works are (Ge et al. (2015)) and (Jin et al. (2017)). Jin et al. (2017) show that explicitly adding isotropic noise in each iteration guarantees that GD escapes saddle points and finds a second order stationary point with high probability. Following (Jin et al. (2017)), Daneshmand et al. (2018) assume that stochastic gradient inherently has a component to escape. Specifically, they make assumption of the Correlated Negative Curvature (CNC) for stochastic gradient gt so that Et[ gt, vt 2] γ > 0. The assumption allows the algorithm to avoid the procedure of perturbing the updates by adding isotropic noise. Our work is motivated by (Daneshmand et al. (2018)) but assumes CNC for the stochastic momentum mt instead. Very recently, Jin et al. (2019) consider perturbing the update of SGD and provide a second order guarantee. Staib et al. (2019) consider a variant of RMSProp (Tieleman & Hinton (2012)), in which the gradient gt is multiplied by a preconditioning matrix Gt and the update is wt+1 = wt G 1/2 t gt. The work shows that the algorithm can help in escaping saddle points faster compared to the standard SGD under certain conditions. Fang et al. (2019) propose average-SGD, in which a suffix averaging scheme is conducted for the updates. They also assume an inherent property of stochastic gradients that allows SGD to escape saddle points. We summarize the iteration complexity results of the related works for simple SGD variants on Table 1. 6 The readers can see that the iteration complexity of (Fang et al. (2019)) and (Jin et al. (2019)) are better than (Daneshmand et al. (2018); Staib et al. (2019)) and our result. So, we want to explain the results and clarify the differences. First, we focus on explaining why the popular algorithm, SGD with heavy ball momentum, works well in practice, which is without the suffix averaging scheme used in (Fang et al. (2019)) and is without the explicit perturbation used in (Jin et al. (2019)). Specifically, we focus on studying the effect of stochastic heavy ball momentum and showing the advantage of using it. Furthermore, our analysis framework is built on the work of (Daneshmand et al. (2018)). We believe that, based on the insight in our work, one can also show the advantage of stochastic momentum by modifying the assumptions and algorithms in (Fang et al. (2019)) or (Jin et al. (2019)) and consequently get a better dependency on ϵ. 6We follow the work (Daneshmand et al. (2018)) for reaching an (ϵ, ϵ)-stationary point, while some works are for an (ϵ, ϵ)-stationary point. We translate them into the complexity of getting an (ϵ, ϵ)-stationary point. Published as a conference paper at ICLR 2020 Algorithm Complexity Perturbed SGD (Ge et al. (2015)) O(ϵ 16) Average-SGD (Fang et al. (2019)) O(ϵ 7) Perturbed SGD (Jin et al. (2019)) O(ϵ 8) CNC-SGD (Daneshmand et al. (2018)) O(ϵ 10) Adaptive SGD (Staib et al. (2019)) O(ϵ 10) SGD+momentum (this work) O((1 β) log( 1 (1 β)ϵ)ϵ 10) Table 1: Iteration complexity to find an (ϵ, ϵ) second-order stationary point . B LEMMA 6, 7, AND 8 In the following, Lemma 7 says that under the APAG property, when the gradient norm is large, on expectation SGD with momentum decreases the function value by a constant and consequently makes progress. On the other hand, Lemma 8 upper-bounds the increase of function value of the next iterate (if happens) by leveraging the Gr ACE property. Lemma 6. If SGD with momentum has the APAG property, then, considering the update step wt+1 = wt ηmt, we have that Et[f(wt+1)] f(wt) η 2 f(wt) 2 + Lη2c2 m 2 . Proof. By the L-smoothness assumption, f(wt+1) f(wt) η f(wt), mt + Lη2 f(wt) η f(wt), gt η f(wt), mt gt + Lη2c2 m 2 . (10) Taking the expectation on both sides. We have Et[f(wt+1)] f(wt) η f(wt) 2 ηEt[ f(wt), mt gt ] + Lη2c2 m 2 2 f(wt) 2 + Lη2c2 m 2 . (11) where we use the APAG property in the last inequality. Lemma 7. Assume that the step size η satisfies η ϵ2 8Lc2m . If SGD with momentum has the APAG property, then, considering the update step wt+1 = wt ηmt, we have that Et[f(wt+1)] f(wt) η 4ϵ2 when f(wt) ϵ. Proof. Et[f(wt+1) f(wt)] Lemma 6 η 2 f(wt) 2+ Lη2c2 m 2 2ϵ2+ Lη2c2 m 2 η 4ϵ2, where the last inequality is due to the constraint of η. Lemma 8. If SGD with momentum has the Gr ACE property, then, considering the update step wt+1 = wt ηmt, we have that Et[f(wt+1)] f(wt) + η2ch + ρη3 Proof. Consider the update rule wt+1 = wt ηmt, where mt represents the stochastic momentum and η is the step size. By ρ-Lipschitzness of Hessian, we have f(wt+1) f(wt) η f(wt), gt + η f(wt), gt mt + η2 2 m t 2f(wt)mt + ρη3 6 mt 3. Taking the conditional expectation, one has Et[f(wt+1)] f(wt) Et[η f(wt) 2] + Et[η f(wt), gt mt + η2 2 m t 2f(wt)mt] + ρη3 f(wt) + 0 + η2ch + ρη3 Published as a conference paper at ICLR 2020 C PROOF OF LEMMA 1 Lemma 1 Denote t0 any time such that (t0 mod Tthred) = 0. Suppose that Et0[f(wt0) f(wt0+t)] Fthred for any 0 t Tthred. Then, Et0[ wt0+t wt0 2] Cupper,t := 8ηt Fthred + 2r2ch + ρ (1 β)2 + 8η2 tσ2 (1 β)2 + 4η2 β 1 β 2c2 m + 2r2c2 m. Proof. Recall that the update is wt0+1 = wt0 rmt0, and wt0+t = wt0+t 1 ηmt0+t 1, for t > 1. We have that wt0+t wt0 2 2( wt0+t wt0+1 2 + wt0+1 wt0 2) 2 wt0+t wt0+1 2 + 2r2c2 m, (14) where the first inequality is by the triangle inequality and the second one is due to the assumption that mt cm for any t. Now let us denote αs := Pt 1 s j=0 βj At 1 := Pt 1 s=1 αs and let us rewrite gt = f(wt) + ξt, where ξt is the zero-mean noise. We have that Et0[ wt0+t wt0+1 2] = Et0[ s=1 ηmt0+s 2] = Et0[η2 j=1 βs jgt0+j) + βsmt0 2] j=1 βs jgt0+j 2 + 2η2 s=1 βsmt0 2] j=1 βs jgt0+j 2] + 2η2 β 1 β 2c2 m s=1 αsgt0+s 2] + 2η2 β 1 β 2c2 m s=1 αs f(wt0+s) + ξt0+s 2] + 2η2 β 1 β 2c2 m s=1 αs f(wt0+s) 2] + Et0[4η2 s=1 αsξt0+s 2] + 2η2 β 1 β 2c2 m. To proceed, we need to upper bound Et0[4η2 Pt 1 s=1 αs f(wt0+s) 2]. We have that s=1 αs f(wt0+s) 2] (a) Et0[4η2A2 t 1 αs At 1 f(wt0+s) 2] (b) Et0[4η2 At 1 s=1 f(wt0+s) 2] (c) Et0[4η2 t (1 β)2 s=1 f(wt0+s) 2]. Published as a conference paper at ICLR 2020 where (a) is by Jensen s inequality, (b) is by maxs αs 1 1 β , and (c) is by At 1 t 1 β . Now let us switch to bound the other term. s=1 αsξt0+s 2] = 4η2 Et0[ i =j αiαjξ t0+iξt0+j] + Et0[ s=1 α2 sξ t0+sξt0+s] (a) = 4η2 0 + Et0[ s=1 α2 sξ t0+sξt0+s] , (b) 4η2 tσ2 (1 β)2 . (17) where (a) is because Et0[ξ t0+iξt0+j] = 0 for i = j, (b) is by that ξt 2 σ2 and maxt αt 1 1 β . Combining (14), (15), (16), (17), Et0[ wt0+t wt0 2] Et0[8η2 t (1 β)2 s=1 f(wt0+s) 2] + 8η2 tσ2 (1 β)2 + 4η2 β 1 β 2c2 m + 2r2c2 m. Now we need to bound Et0[Pt 1 s=1 f(wt0+s) 2]. By using ρ-Lipschitzness of Hessian, we have that f(wt0+s) f(wt0+s 1) η f(wt0+s 1), mt0+s 1 + 1 2η2m t0+s 1 2f(wt0+s 1)mt0+s 1 + ρ 6η3 mt0+s 1 3. By adding η f(wt0+s 1), gt0+s 1 on both sides, we have η f(wt0+s 1), gt0+s 1 f(wt0+s 1) f(wt0+s) + η f(wt0+s 1), gt0+s 1 mt0+s 1 2η2m t0+s 1 2f(wt0+s 1)mt0+s 1 + ρ 6η3 mt0+s 1 3. Taking conditional expectation on both sides leads to Et0+s 1[η f(wt0+s 1) 2] Et0+s 1[f(wt0+s 1) f(wt0+s)] + η2ch + ρ 6η3c3 m, (21) where Et0+s 1[η f(wt0+s 1), gt0+s 1 mt0+s 1 + 1 2η2m t0+s 1 2f(wt0+s 1)mt0+s 1] η2ch by the Gr ACE property. We have that for t0 t0 + s 1 Et0[η f(wt0+s 1) 2] = Et0[Et0+s 1[η f(wt0+s 1) 2]] (21) Et0[Et0+s 1[f(wt0+s 1) f(wt0+s)]] + η2ch + ρ = Et0[f(wt0+s 1) f(wt0+s)] + η2ch + ρ 6η3c3 m. (22) Summing the above inequality from s = 2, 3, . . . , t leads to s=1 η f(wt0+s) 2] Et0[f(wt0+1) f(wt0+t)] + η2(t 1)ch + ρ 6η3(t 1)c3 m = Et0[f(wt0+1) f(wt0) + f(wt0) f(wt0+t)] + η2(t 1)ch + ρ 6η3(t 1)c3 m (a) Et0[f(wt0+1) f(wt0)] + Fthred + η2(t 1)ch + ρ 6η3(t 1)c3 m, where (a) is by the assumption (made for proving by contradiction) that Et0[f(wt0) f(wt0+s)] Fthred for any 0 s Tthred. By (21) with s = 1 and η = r, we have Et0[r f(wt0) 2] Et0[f(wt0) f(wt0+1)] + r2ch + ρ 6r3c3 m. (24) Published as a conference paper at ICLR 2020 By (23) and (24), we know that s=1 η f(wt0+s) 2] Et0[r f(wt0) 2] + Et0[ s=1 η f(wt0+s) 2] Fthred + r2ch + ρ 6r3c3 m + η2tch + ρ (a) Fthred + 2r2ch + ρ 6r3c3 m + ρ (b) Fthred + 2r2ch + ρ 3r3c3 m, (25) where (a) is by the constraint that η2t r2 for 0 t Tthred and (b) is by the constraint that r η. By combining (25) and (18) Et0[ wt0+t wt0 2] Et0[8η2 t (1 β)2 s=1 f(wt0+s) 2] + 8η2 tσ2 (1 β)2 + 4η2 β 1 β 2c2 m + 2r2c2 m 8ηt Fthred + 2r2ch + ρ (1 β)2 + 8η2 tσ2 (1 β)2 + 4η2 β 1 β 2c2 m + 2r2c2 m. Published as a conference paper at ICLR 2020 D PROOF OF LEMMA 2 AND LEMMA 3 Lemma 2 Denote t0 any time such that (t0 mod Tthred) = 0. Let us define a quadratic approximation at wt0, Q(w) := f(wt0) + w wt0, f(wt0) + 1 2(w wt0) H(w wt0), where H := 2f(wt0). Also, define Gs := (I η Ps k=1 βs k H) and qv,t 1 := Πt 1 j=1Gj rmt0 . qm,t 1 := Pt 1 s=1 Πt 1 j=s+1Gj βsmt0. qq,t 1 := Pt 1 s=1 Πt 1 j=s+1Gj Ps k=1 βs k f(wt0+k) Q(wt0+s) . qw,t 1 := Pt 1 s=1 Πt 1 j=s+1Gj Ps k=1 βs k f(wt0). qξ,t 1 := Pt 1 s=1 Πt 1 j=s+1Gj Ps k=1 βs kξt0+k. Then, wt0+t wt0 = qv,t 1 + ηqm,t 1 + ηqq,t 1 + ηqw,t 1 + ηqξ,t 1. Notations: Denote t0 any time such that (t0 mod Tthred) = 0. Let us define a quadratic approximation at wt0, Q(w) := f(wt0) + w wt0, f(wt0) + 1 2(w wt0) H(w wt0), (27) where H := 2f(wt0). Also, we denote k=1 βs k H) vm,s := βsmt0 k=1 βs k f(wt0+k) Q(wt0+s) k=1 βs k f(wt0) k=1 βs kξt0+k Proof. First, we rewrite mt0+j for any j 1 as follows. mt0+j = βjmt0 + k=1 βj kgt0+k k=1 βj k f(wt0+k) + ξt0+k . Published as a conference paper at ICLR 2020 We have that wt0+t wt0 = wt0+t 1 wt0 ηmt0+t 1 (a) = wt0+t 1 wt0 η βt 1mt0 + k=1 βt 1 k f(wt0+k) + ξt0+k (b) = wt0+t 1 wt0 η k=1 βt 1 k Q(wt0+t 1) η βt 1mt0 + k=1 βt 1 k f(wt0+k) Q(wt0+t 1) + ξt0+k (c) = wt0+t 1 wt0 η k=1 βt 1 k H(wt0+t 1 wt0) + f(wt0) η βt 1mt0 + k=1 βt 1 k f(wt0+k) Q(wt0+t 1) + ξt0+k k=1 βt 1 k H) wt0+t 1 wt0 η βt 1mt0 + k=1 βt 1 k f(wt0+k) Q(wt0+t 1) + f(wt0) + ξt0+k , (30) where (a) is by using (29) with j = t 1, (b) is by subtracting and adding back the same term, and (c) is by Q(wt0+t 1) = f(wt0) + H(wt0+t 1 wt0). To continue, by using the nations in (28), we can rewrite (30) as wt0+t wt0 = Gt 1 wt0+t 1 wt0 η vm,t 1 + vq,t 1 + vw,t 1 + vξ,t 1 . (31) Recursively expanding (31) leads to wt0+t wt0 = Gt 1 wt0+t 1 wt0 η vm,t 1 + vq,t 1 + vw,t 1 + vξ,t 1 = Gt 1 Gt 2 wt0+t 2 wt0 η vm,t 2 + vq,t 2 + vw,t 2 + vξ,t 2 η vm,t 1 + vq,t 1 + vw,t 1 + vξ,t 1 (a) = Πt 1 j=1Gj wt0+1 wt0) η Πt 1 j=s+1Gj vm,s + vq,s + vw,s + vξ,s , (b) = Πt 1 j=1Gj rmt0 η Πt 1 j=s+1Gj vm,s + vq,s + vw,s + vξ,s , where (a) we use the notation that Πt 1 j=s Gj := Gs Gs+1 . . . . . . Gt 1 and the notation that Πt 1 j=t Gj = 1 and (b) is by the update rule. By using the definitions of {q ,t 1} in the lemma statement, we complete the proof. Lemma 3 Following the notations of Lemma 2, we have that Et0[ wt0+t wt0 2] Et0[ qv,t 1 2] + 2ηEt0[ qv,t 1, qm,t 1 + qq,t 1 + qw,t 1 + qξ,t 1 ] := Clower (33) Proof. Following the proof of Lemma 2, we have wt0+t wt0 = qv,t 1 + η qm,t 1 + qq,t 1 + qw,t 1 + qξ,t 1 . (34) Published as a conference paper at ICLR 2020 Therefore, by using a + b 2 a 2 + 2 a, b , Et0[ wt0+t wt0 2] Et0[ qv,t 1 2] + 2ηEt0[ qv,t 1, qm,t 1 + qq,t 1 + qw,t 1 + qξ,t 1 ]. (35) Published as a conference paper at ICLR 2020 E PROOF OF LEMMA 5 Lemma 5 Let Fthred = O(ϵ4) and η2Tthred r2. By following the conditions and notations in Theorem 1, Lemma 1 and Lemma 2, we conclude that if SGD with momentum (Algorithm 2) has the APCG property, then we have that Clower := Et0[ qv,Tthred 1 2]+2ηEt0[ qv,Tthred 1, qm,Tthred 1 + qq,Tthred 1 + qw,Tthred 1 + qξ,Tthred 1 ] > Cupper. Table 3: Constraints and choices of the parameters. Parameter Value Constraint origin constant r δγϵ2cr (64), (65), (66) cr c0 c3mρLσ2ch , c0 = 1 1152 c0 c3mρLσ2c (1 β)2ch cr c0 c3mρLσ4(1 β)3ch cr 8ch from (89) η δ2γ2ϵ5cη (64) cη c1 c5mρL2σ2c ch , c1 = c0 24 η η r/ Tthred from (25),(39),(87),(89) η η min{ (1 β) ϵ } from (45), (78) 7 Fthred δγ2ϵ4c F (65) c F c2 c4mρ2Lσ4ch , c2 = c0 576 c F 8c2 0 c6mρ2L2σ4ch Fthred Fthred ϵ2r 4 from (88) Tthred Tthred c(1 β) ηϵ log( Lcmσ2ρc ch (1 β)δγϵ ) from (82) W.l.o.g, we assume that cm, L, σ2, c , ch, and ρ are not less than one and that ϵ 1. E.1 SOME CONSTRAINTS ON β. We require that parameter β is not too close to 1 so that the following holds, 1) L(1 β)3 > 1. 2) σ2(1 β)3 > 1. 3) c (1 β)2 > 1. 6) Tthred c(1 β) ηϵ log( Lcmσ2ρc ch (1 β)δγϵ ) 1 + 2β 1 β . The constraints upper-bound the value of β. That is, β cannot be too close to 1. We note that the β dependence on L, σ, and c are only artificial. We use these constraints in our proofs but they are mostly artefacts of the analysis. For example, if a function is L-smooth, and L < 1, then it is also 1-smooth, so we can assume without loss of generality that L > 1. Similarly, the dependence on σ is not highly relevant, since we can always increase the variance of the stochastic gradient, for example by adding an O(1) gaussian perturbation. Published as a conference paper at ICLR 2020 E.2 SOME LEMMAS To prove Lemma 5, we need a series of lemmas with the choices of parameters on Table 3. Upper bounding Et0[ qq,t 1 ]: Lemma 9. Following the conditions in Lemma 1 and Lemma 2, we have Et0[ qq,t 1 ] Πt 1 j=1(1 + ηθjλ) βLcm ϵ(1 β)2 Πt 1 j=1(1 + ηθjλ) 1 β ρ ηϵ2 8 Fthred + 2r2ch + ρ Πt 1 j=1(1 + ηθjλ) (1 β)2 + 4η2 β 1 β 2c2 m + 2r2c2 m Et0[ qq,t 1 ] = Et0[ Πt 1 j=s+1Gj s X k=1 βs k f(wt0+k) Q(wt0+s) ] s=1 Πt 1 j=s+1Gj s X k=1 βs k f(wt0+k) Q(wt0+s) ] s=1 Πt 1 j=s+1Gj 2 k=1 βs k f(wt0+k) Q(wt0+s) ] s=1 Πt 1 j=s+1Gj 2 k=1 βs k f(wt0+k) Q(wt0+s) ] s=1 Πt 1 j=s+1Gj 2 k=1 βs k f(wt0+k) f(wt0+s) + f(wt0+s) Q(wt0+s) ] (37) where (a), (c), (d) is by triangle inequality, (b) is by the fact that Ax 2 A 2 x 2 for any matrix A and vector x. Now that we have an upper bound of f(wt0+k) f(wt0+s) , f(wt0+k) f(wt0+s) (a) L wt0+k wt0+s (b) Lη(s k)cm. (38) where (a) is by the assumption of L-Lipschitz gradient and (b) is by applying the triangle inequality (s k) times and that wt wt 1 η mt 1 ηcm, for any t. We can also derive an upper bound of Et0[ f(wt0+s) Q(wt0+s) ], Et0[ f(wt0+s) Q(wt0+s) ] 2 wt0+s wt0 2] (b) ρ 2 8ηs Fthred + 2r2ch + ρ (1 β)2 + 8 r2σ2 (1 β)2 + 4η2 β 1 β 2c2 m + 2r2c2 m. (39) Above, (a) is by the fact that if a function f( ) has ρ Lipschitz Hessian, then f(y) f(x) 2f(x)(y x) ρ 2 y x 2 (40) (c.f. Lemma 1.2.4 in (Nesterov (2013))) and using the definition that Q(w) := f(wt0) + w wt0, f(wt0) + 1 2(w wt0) H(w wt0), Published as a conference paper at ICLR 2020 (b) is by Lemma 1 and η2t r2 for 0 t Tthred Et0[ wt0+t wt0 2] 8ηt Fthred + 2r2ch + ρ (1 β)2 + 8η2 tσ2 (1 β)2 + 4η2 β 1 β 2c2 m + 2r2c2 m 8ηt Fthred + 2r2ch + ρ (1 β)2 + 8 r2σ2 (1 β)2 + 4η2 β 1 β 2c2 m + 2r2c2 m. (41) Combing (37), (38), (39), we have that Et0[ qq,t 1 ] s=1 Πt 1 j=s+1Gj 2 k=1 βs k f(wt0+k) f(wt0+s) + f(wt0+s) Q(wt0+s) ] s=1 Πt 1 j=s+1Gj 2 k=1 βs k Lη(s k)cm s=1 Πt 1 j=s+1Gj 2 2 8ηs Fthred + 2r2ch + ρ (1 β)2 + 8 r2σ2 (1 β)2 + 4η2 β 1 β 2c2 m + 2r2c2 m s=1 Πt 1 j=s+1Gj 2 k=1 βs k Lη(s k)cm + s=1 Πt 1 j=s+1Gj 2 (42) where on the last line we use the notation that νs := 8ηs Fthred + 2r2ch + ρ ν := 8 r2σ2 (1 β)2 + 4η2 β 1 β 2c2 m + 2r2c2 m. (43) To continue, let us analyze Πt 1 j=s+1Gj 2 first. Πt 1 j=s+1Gj 2 = Πt 1 j=s+1(I η k=1 βj k H) 2 (a) Πt 1 j=s+1(1 + ηθjλ) = Πt 1 j=1(1 + ηθjλ) Πs j=1(1 + ηθjλ) (b) Πt 1 j=1(1 + ηθjλ) (1 + ηϵ)s . (44) Above, we use the notation that θj := Pj k=1 βj k. For (a), it is due to that λ := λmin(H), λmax(H) L, and the choice of η so that 1 ηL 1 β , or equivalently, For (b), it is due to that θj 1 for any j and λ ϵ. Therefore, we can upper-bound the first term on r.h.s of (42) as s=1 Πt 1 j=s+1Gj 2 k=1 βs k Lη(s k)cm = s=1 Πt 1 j=s+1Gj 2 k=1 βkk Lηcm s=1 Πt 1 j=s+1Gj 2 β (1 β)2 Lηcm (b) Πt 1 j=1(1 + ηθjλ) βLηcm 1 (1 + ηϵ)s (c) Πt 1 j=1(1 + ηθjλ) βLηcm (1 β)2 1 ηϵ = Πt 1 j=1(1 + ηθjλ) βLcm ϵ(1 β)2 , (46) Published as a conference paper at ICLR 2020 where (a) is by that fact that P k=1 βkk β (1 β)2 for any 0 β < 1, (b) is by using (44), and (c) is by using that P s=1( 1 1+ηϵ)s 1 ηϵ. Now let us switch to bound Pt 1 s=1 Πt 1 j=s+1Gj 2 Ps k=1 βs k ρ 2(νs + ν) on (42). We have that s=1 Πt 1 j=s+1Gj 2 2(νs + ν) (a) 1 1 β s=1 Πt 1 j=s+1Gj 2 ρ 2(νs + ν) Πt 1 j=1(1 + ηθjλ) 1 (1 + ηϵ)s ρ 2νs + Πt 1 j=1(1 + ηθjλ) 1 (1 + ηϵ)s ρ 2ν Πt 1 j=1(1 + ηθjλ) 1 (1 + ηϵ)s ρ 2νs + Πt 1 j=1(1 + ηθjλ) Πt 1 j=1(1 + ηθjλ) 1 (1 + ηϵ)s ρ 2νs + Πt 1 j=1(1 + ηθjλ) (1 β)2 + 4η2 β 1 β 2c2 m + 2r2c2 m Πt 1 j=1(1 + ηθjλ) 1 β ρ (ηϵ)2 8η Fthred + 2r2ch + ρ Πt 1 j=1(1 + ηθjλ) (1 β)2 + 4η2 β 1 β 2c2 m + 2r2c2 m 2ηϵ (47) where (a) is by the fact that Ps k=1 βs k 1/(1 β), (b) is by (44), (c) is by using that P s=1( 1 1+ηϵ)s 1 ηϵ, (d) is by P k=1 zkk z (1 z)2 for any |z| 1 and substituting z = 1 1+ηϵ, which leads to P k=1 zkk z (1 z)2 = 1/(1+ηϵ) (1 1/(1+ηϵ))2 = 1+ηϵ (ηϵ)2 2 (ηϵ)2 in which the last inequality is by chosen the step size η so that ηϵ 1. By combining (42), (46), and (47), we have that Et0[ qq,t 1 (42) s=1 Πt 1 j=s+1Gj 2 k=1 βs k Lη(s k)cm + s=1 Πt 1 j=s+1Gj 2 (46),(47) Πt 1 j=1(1 + ηθjλ) βLcm ϵ(1 β)2 Πt 1 j=1(1 + ηθjλ) 1 β ρ ηϵ2 8 Fthred + 2r2ch + ρ Πt 1 j=1(1 + ηθjλ) (1 β)2 + 4η2 β 1 β 2c2 m + 2r2c2 m (48) which completes the proof. Published as a conference paper at ICLR 2020 Upper bounding qv,t 1 : Lemma 10. Following the conditions in Lemma 1 and Lemma 2, we have qv,t 1 Πt 1 j=1(1 + ηθjλ) rcm. (49) qv,t 1 Πt 1 j=1Gj rmt0 Πt 1 j=1Gj 2 rmt0 Πt 1 j=1(1 + ηθjλ) rcm, (50) where the last inequality is because η is chosen so that 1 ηL 1 β and the fact that λmax(H) L. Lower bounding Et0[2η qv,t 1, qq,t 1 ]: Lemma 11. Following the conditions in Lemma 1 and Lemma 2, we have Et0[2η qv,t 1, qq,t 1 ] 2η Πt 1 j=1(1 + ηθjλ) 2rcm βLcm ϵ(1 β)2 + ρ ηϵ2 8 Fthred + 2r2ch + ρ (1 β)3 + ρ 8 r2σ2 (1 β)2 + 4η2 β 1 β 2c2 m + 2r2c2 m 2ηϵ(1 β) ]. Proof. By the results of Lemma 9 and Lemma 10 Et0[2η qv,t 1, qq,t 1 ] Et0[2η qv,t 1 qq,t 1 ] Lemma 10 Et0[2η Πt 1 j=1(1 + ηθjλ) rcm qq,t 1 ] Lemma 9 2η Πt 1 j=1(1 + ηθjλ) 2rcm βLcm ϵ(1 β)2 + ρ ηϵ2 8 Fthred + 2r2ch + ρ (1 β)3 + ρ 8 r2σ2 (1 β)2 + 4η2 β 1 β 2c2 m + 2r2c2 m 2ηϵ(1 β) ]. Published as a conference paper at ICLR 2020 Lower bounding Et0[2η qv,t 1, qξ,t 1 ]: Lemma 12. Following the conditions in Lemma 1 and Lemma 2, we have Et0[2η qv,t 1, qξ,t 1 ] = 0. (53) Et0[2η qv,t 1, qξ,t 1 ] = Et0[2η qv,t 1, Πt 1 j=s+1Gj s X k=1 βs kξt0+k ] (a) = Et0[2η qv,t 1, k=1 αkξt0+k ] (b) = Et0[2η k=1 Et0+k 1[ qv,t 1, αkξt0+k ]] (c) = Et0[2η k=1 qv,t 1, Et0+k 1[αkξt0+k] ] k=1 αk qv,t 1, Et0+k 1[ξt0+k] ] where (a) holds for some coefficients αk, (b) is by the tower rule, (c) is because qv,t 1 is measureable with t0, and (d) is by the zero mean assumption of ξ s. Lower bounding Et0[2η qv,t 1, qm,t 1 ]: Lemma 13. Following the conditions in Lemma 1 and Lemma 2, we have Et0[2η qv,t 1, qm,t 1 ] 0. (55) Et0[2η qv,t 1, qm,t 1 ] = 2ηr Et0[ Πt 1 j=1Gj mt0, Πt 1 j=s+1Gj βsmt0 ] (a) = 2ηr Et0[ mt0, Bmt0 ] (b) 0, where (a) is by defining the matrix B := Πt 1 j=1Gj Pt 1 s=1 Πt 1 j=s+1Gj βs . For (b), notice that the matrix B is symmetric positive semidefinite. To see that the matrix B is symmetric positive semidefinite, observe that each Gj := (I η Pj k=1 βj k H) can be written in the form of Gj = UDj U for some orthonormal matrix U and a diagonal matrix Dj. Therefore, the matrix product Πt 1 j=1Gj Πt 1 j=s+1Gj = U(Πt 1 j=1Dj)(Πt 1 j=s+1Dj)U is symmetric positive semidefinite as long as each Gj is. So, (b) is by the property of a matrix being symmetric positive semidefinite. Published as a conference paper at ICLR 2020 Lower bounding 2ηEt0[ qv,t 1, qw,t 1 ]: Lemma 14. Following the conditions in Lemma 1 and Lemma 2, if SGD with momentum has the APCG property, then 2ηEt0[ qv,t 1, qw,t 1 ] 2ηrc (1 β)(Πt 1 j=1(1 + ηθjλ))2ϵ. (57) Proof. Define Ds := Πt 1 j=1GjΠt 1 j=s+1Gj. 2ηEt0[ qv,t 1, qw,t 1 ] = 2ηEt0[ Πt 1 j=1Gj rmt0 , Πt 1 j=s+1Gj s X k=1 βs k f(wt0) ] = 2ηEt0[ rmt0, Πt 1 j=1GjΠt 1 j=s+1Gj s X k=1 βs k f(wt0) ] k=1 βs k Et0[ mt0, Ds f(wt0) ] (a) 2η2rc t 1 X k=1 βs k Ds 2 f(wt0) 2 s=1 Ds 2 f(wt0) 2, (58) where (a) is by the APCG property. We also have that Ds 2 = Πt 1 j=1GjΠt 1 j=s+1Gj 2 Πt 1 j=1Gj 2 Πt 1 j=s+1Gj 2 (a) Πt 1 j=1Gj 2 Πt 1 j=1(1 + ηθjλ) Πt 1 j=1(1 + ηθjλ) 2 (1 + ηϵ)s (59) where (a) and (b) is by (44). Substituting the result back to (58), we get 2ηEt0[ qv,t 1, qw,t 1 ] 2η2rc s=1 Ds 2 f(wt0) 2 Πt 1 j=1(1 + ηθjλ) 2 (1 + ηϵ)s f(wt0) 2 2η2rc (1 β)ηϵ Πt 1 j=1(1 + ηθjλ) 2 f(wt0) 2 (60) Using the fact that f(wt0) ϵ completes the proof. E.3 PROOF OF LEMMA 5 Recall that the strategy is proving by contradiction. Assume that the function value does not decrease at least Fthred in Tthred iterations on expectation. Then, we can get an upper bound of the expected distance Et0[ wt0+Tthred wt0 2] Cupper but, by leveraging the negative curvature, we can also show a lower bound of the form Et0[ wt0+Tthred wt0 2] Clower. The strategy is showing that the lower bound is larger than the upper bound, which leads to the contradiction and concludes that the function value must decrease at least Fthred in Tthred iterations on expectation. To get the contradiction, according to Lemma 1 and Lemma 3, we need to show that Et0[ qv,Tthred 1 2] + 2ηEt0[ qv,Tthred 1, qm,Tthred 1 + qq,Tthred 1 + qw,Tthred 1 + qξ,Tthred 1 ] > Cupper. (61) Yet, by Lemma 13 and Lemma 12, we have that ηEt0[ qv,Tthred 1, qm,Tthred 1 ] 0 and ηEt0[ qv,Tthred 1, qξ,Tthred 1 ] = 0. So, it suffices to prove that Et0[ qv,Tthred 1 2] + 2ηEt0[ qv,Tthred 1, qq,Tthred 1 + qw,Tthred 1 ] > Cupper, (62) and it suffices to show that Published as a conference paper at ICLR 2020 1 4Et0[ qv,Tthred 1 2] + 2ηEt0[ qv,Tthred 1, qq,Tthred 1 ] 0. 1 4Et0[ qv,Tthred 1 2] + 2ηEt0[ qv,Tthred 1, qw,Tthred 1 ] 0. 1 4Et0[ qv,Tthred 1 2] Cupper. E.3.1 PROVING THAT 1 4Et0[ qv,Tthred 1 2] + 2ηEt0[ qv,Tthred 1, qq,Tthred 1 ] 0: By Lemma 4 and Lemma 11, we have that 1 4Et0[ qv,Tthred 1 2] + Et0[2η qv,Tthred 1, qq,Tthred 1 ] 4 ΠTthred 1 j=1 (1 + ηθjλ) 2r2γ 2η ΠTthred 1 j=1 (1 + ηθjλ) 2rcm βLcm ϵ(1 β)2 + ρ ηϵ2 8 Fthred + 2r2ch + ρ (1 β)3 + ρ 8 r2σ2 (1 β)2 + 4η2 β 1 β 2c2 m + 2r2c2 m 2ηϵ(1 β) ]. To show that the above is nonnegative, it suffices to show that r2γ 24ηrβLc2 m ϵ(1 β)2 , (64) r2γ 24ηrcmρ (1 β)ηϵ2 8 Fthred + 2r2ch + ρ (1 β)2 , (65) (1 β)2 + 4η2 β 1 β 2c2 m + 2r2c2 m Now w.l.o.g, we assume that cm, L, σ2, c , and ρ are not less than one and that ϵ 1. By using the values of parameters on Table 3, we have the following results; a sufficient condition of (64) is that cr cη 24Lc2 mϵ2 (1 β)2 . (67) A sufficient condition of (65) is that cr c F 576cmρ (1 β)3 , (68) and 1 1152cmρchcr (1 β)3 , (69) 1 192c4 mρ2c2 r (1 β)3 . (70) A sufficient condition of (66) is that 1 96cmρ(σ2 + 3c2 m)crϵ (1 β)3 , (71) and a sufficient condition for the above (71), by the assumption that both σ2 1 and cm 1, is 1 576c3 mρσ2crϵ (1 β)3 . (72) Now let us verify if (67), (68), (69), (70), (72) are satisfied. For (67), using the constraint of cη on Table 3, we have that 1 cη c5 mρL2σ2c ch c1 . Using this inequality, it suffices to let cr c3mρLσ2c ch(1 β)2 for getting (67), which holds by using the constraint that c (1 β)2 > 1 and ϵ 1. Published as a conference paper at ICLR 2020 For (68), using the constraint of c F on Table 3, we have that 1 c F c4 mρ2Lσ4ch c2 . Using this inequality, it suffices to let cr c0 c3mρLσ4(1 β)3 , which holds by using the constraint that σ2(1 β)3 > 1. For (69), it needs (1 β)3 1152cmρch c0 c3mρLσ2ch cr, which hold by using the constraint that σ2(1 β)3 > 1. For (70), it suffices to let (1 β)2 14c2mρ c0 c3mρLσ2ch cr which holds by using the constraint that σ2(1 β)3 > 1. For (72), it suffices to let (1 β)3 576c3mρσ2ϵ c0 c3mρLσ2ch cr, which holds by using the constraint that L(1 β)3 > 1 and ϵ 1. Therefore, by choosing the parameter values as Table 3, we can guarantee that 1 4Et0[ qv,Tthred 1 2] + 2ηEt0[ qv,Tthred 1, qq,Tthred 1 ] 0. E.3.2 PROVING THAT 1 4Et0[ qv,Tthred 1 2] + 2ηEt0[ qv,Tthred 1, qw,Tthred 1 ] 0: By Lemma 4 and Lemma 14, we have that 1 4Et0[ qv,Tthred 1 2] + 2ηEt0[ qv,Tthred 1, qw,Tthred 1 ] 4 ΠTthred 1 j=1 (1 + ηθjλ) 2r2γ 2ηrc (1 β)(ΠTthred 1 j=1 (1 + ηθjλ))2ϵ. (73) To show that the above is nonnegative, it suffices to show that (1 β). (74) A sufficient condition is cr 1 β . Using the constraint of cη on Table 3, we have that 1 cη c5 mρL2σ2c ch c1 . So, it suffices to let cr c0ϵ4 3c5mρL2σ2ch(1 β), which holds by using the constraint that L(1 β)3 > 1 (so that L(1 β) > 1) and ϵ 1. E.3.3 PROVING THAT 1 4Et0[ qv,Tthred 1 2] Cupper: From Lemma 4 and Lemma 1, we need to show that 1 4 ΠTthred 1 j=1 (1 + ηθjλ) 2r2γ 8ηt Fthred + 2r2ch + ρ (1 β)2 + 8 r2σ2 (1 β)2 + 4η2 β 1 β 2c2 m + 2r2c2 m. (75) We know that 1 4 ΠTthred 1 j=1 (1 + ηθjλ) 2r2γ 1 4 ΠTthred 1 j=1 (1 + ηθjϵ) 2r2γ. It suffices to show that 1 4 ΠTthred 1 j=1 (1 + ηθjϵ) 2r2γ 8ηt Fthred + 2r2ch + ρ (1 β)2 + 8 r2σ2 (1 β)2 + 4η2 β 1 β 2c2 m + 2r2c2 m. (76) Note that the left hand side is exponentially growing in Tthred. We can choose the number of iterations Tthred large enough to get the desired result. Specifically, we claim that Tthred c(1 β) ηϵ log( Lcmσ2ρc ch (1 β)δγϵ ) for some constant c > 0. To see this, let us first apply log on both sides of (76), 2 Tthred 1 X j=1 log(1 + ηθjϵ) + log(r2γ) log(8a Tthred + 8b) (77) where we denote a := 4η Fthred+2r2ch+ ρ (1 β)2 and b := 4 r2σ2 (1 β)2 + 2η2 β 1 β 2c2 m + r2c2 m. To proceed, we are going to use the inequality log(1 + x) x 2, for x [0, 2.51]. We have that 1 ηϵ (1 β) (78) Published as a conference paper at ICLR 2020 as guaranteed by the constraint of η. So, 2 Tthred 1 X j=1 log(1 + ηθjϵ) (a) 1 β ηϵ 1 1 β (Tthred 1 β 1 β )ηϵ. (b) Tthred 1 2(1 β) ηϵ, (79) where (a) is by using the inequality log(1 + x) x 2 with x = ηθjϵ 1 and (b) is by making Tthred 1 2(1 β) β (1 β)2 , which is equivalent to the condition that Tthred 1 + 2β 1 β (80) Now let us substitute the result of (79) back to (77). We have that Tthred 1 + 2(1 β) ηϵ log(8a Tthred + 8b γr2 ), (81) which is what we need to show. By choosing Tthred large enough, Tthred c(1 β) ηϵ log(Lcmσ2ρc ch (1 β)δγϵ ) = O((1 β) log( 1 (1 β)ϵ)ϵ 6) (82) for some constant c > 0, we can guarantee that the above inequality (81) holds. F PROOF OF LEMMA 15 Lemma 15 (Daneshmand et al. (2018)) Let us define the event Υk := { f(wk Tthred) ϵ or λmin( 2f(wk Tthred)) ϵ}. The complement is Υc k := { f(wk Tthred) ϵ and λmin( 2f(wk Tthred)) ϵ}, which suggests that wk Tthred is an (ϵ, ϵ)-second order stationary points. Suppose that E[f(w(k+1)Tthred) f(wk Tthred)|Υk] E[f(w(k+1)Tthred) f(wk Tthred)|Υc k] δ Set T = 2Tthred f(w0) minw f(w) /(δ ). We return w uniformly randomly from w0, w Tthred, w2Tthred, . . . , wk Tthred, . . . , w KTthred, where K := T/Tthred . Then, with probability at least 1 δ, we will have chosen a wk where Υk did not occur. Proof. Let Pk be the probability that Υk occurs. E[f(w(k+1)Tthred) f(wk Tthred)] = E[f(w(k+1)Tthred) f(wk Tthred)|Υk]Pk + E[f(w(k+1)Tthred) f(wk Tthred)|Υc k](1 Pk) Pk + δ /2(1 Pk) = δ /2 (1 + δ/2) Pk δ /2 Pk. (84) Summing over all K, we have k=0 E[f(w(k+1)Tthred) f(wk Tthred)] 1 K + 1 k=0 (δ/2 Pk) k=0 Pk δ/2 + f(w0) minw f(w) k=0 (1 Pk) 1 δ. Published as a conference paper at ICLR 2020 G PROOF OF THEOREM 1 Theorem 1 Assume that the stochastic momentum satisfies CNC. Set r = O(ϵ2), η = O(ϵ5), and Tthred = c(1 β) ηϵ log( Lcmσ2ρc ch (1 β)δγϵ ) = O((1 β) log( Lcmσ2ρc ch (1 β)δγϵ )ϵ 6) for some constant c > 0. If SGD with momentum (Algorithm 2) has APAG property when gradient is large ( f(w) ϵ), APCGTthred property when it enters a region of saddle points that exhibits a negative curvature ( f(w) ϵ and λmin( 2f(w)) ϵ), and Gr ACE property throughout the iterations, then it reaches an (ϵ, ϵ) second order stationary point in T = 2Tthred(f(w0) minw f(w))/(δFthred) = O((1 β) log( Lcmσ2ρc ch (1 β)δγϵ )ϵ 10) iterations with high probability 1 δ, where Fthred = O(ϵ4). G.1 PROOF SKETCH OF THEOREM 1 In this subsection, we provide a sketch of the proof of Theorem 1. The complete proof is available in Appendix G. Our proof uses a lemma in (Daneshmand et al. (2018)), which is Lemma 15 below. The lemma guarantees that uniformly sampling a w from {wk Tthred}, k = 0, 1, 2, . . . , T/Tthred gives an (ϵ, ϵ)-second order stationary point with high probability. We replicate the proof of Lemma 15 in Appendix F. Lemma 15. (Daneshmand et al. (2018)) Let us define the event Υk := { f(wk Tthred) ϵ or λmin( 2f(wk Tthred)) ϵ}. The complement is Υc k := { f(wk Tthred) ϵ and λmin( 2f(wk Tthred)) ϵ}, which suggests that wk Tthred is an (ϵ, ϵ)-second order stationary points. Suppose that E[f(w(k+1)Tthred) f(wk Tthred)|Υk] & E[f(w(k+1)Tthred) f(wk Tthred)|Υc k] δ 2 . (86) Set T = 2Tthred f(w0) minw f(w) /(δ ). 8 We return w uniformly randomly from w0, w Tthred, w2Tthred, . . . , wk Tthred, . . . , w KTthred, where K := T/Tthred . Then, with probability at least 1 δ, we will have chosen a wk where Υk did not occur. To use the result of Lemma 15, we need to let the conditions in (86) be satisfied. We can bound E[f(w(k+1)Tthred) f(wk Tthred)|Υk] Fthred, based on the analysis of the large gradient norm regime (Lemma 7) and the analysis for the scenario when the update is with small gradient norm but a large negative curvature is available (Subsection 3.2.1). For the other condition, E[f(w(k+1)Tthred) f(wk Tthred)|Υc k] δ Fthred 2 , it requires that the expected amortized increase of function value due to taking the large step size r is limited (i.e. bounded by δ Fthred 2 ) when wk Tthred is a second order stationary point. By having the conditions satisfied, we can apply Lemma 15 and finish the proof of the theorem. G.2 FULL PROOF OF THEOREM 1 Proof. Our proof is based on Lemma 15. So, let us consider the events in Lemma 15, Υk := { f(wk Tthred) ϵ or λmin( 2f(wk Tthred)) ϵ}. We first show that E[f(w(k+1)Tthred) f(wk Tthred)|Υk] Fthred. When f(wk Tthred) ϵ: 8One can use any upper bound of f(w0) minw f(w) as f(w0) minw f(w) in the expression of T. Published as a conference paper at ICLR 2020 Consider that Υk is the case that f(wk Tthred) ϵ. Denote t0 := k Tthred in the following. We have that Et0[f(wt0+Tthred) f(wt0)] = t=0 Et0[E[f(wt0+t+1) f(wt0+t)|w0:t0+t]] = Et0[f(wt0+1) f(wt0)] + t=1 Et0[E[f(wt0+t+1) f(wt0+t)|w0:t0+t]] 2 f(wt0) 2 + Lr2c2 m 2 + t=1 Et0[E[f(wt0+t+1) f(wt0+t)|w0:t0+t]] 2 f(wt0) 2 + Lr2c2 m 2 + 2 f(wt0) 2 + Lr2c2 m 2 + r2ch + ρ 2 f(wt0) 2 + Lr2c2 m + r2ch 2ϵ2 + Lr2c2 m + r2ch 4ϵ2 (g) Fthred, where (a) is by using Lemma 6 with step size r, (b) is by using Lemma 8, (c) is due to the constraint that η2Tthred r2, (d) is by the choice of r, (e) is by f(wt) ϵ, (f) is by the choice of r so that r ϵ2 4(Lc2m+ch), and (g) is by r 4ϵ2 Fthred. (88) When f(wk Tthred) ϵ and λmin( 2f(wk Tthred)) ϵ: The scenario that Υk is the case that f(wk Tthred) ϵ and λmin( 2f(wk Tthred)) ϵ has been analyzed in Appendix E, which guarantees that E[f(wt0+Tthred) f(wt0)] Fthred under the setting. When f(wk Tthred) ϵ and λmin( 2f(wk Tthred)) ϵ: Now let us switch to show that E[f(w(k+1)Tthred) f(wk Tthred)|Υc k] δ Fthred 2 . Recall that Υc k means that f(wk Tthred) ϵ and λmin( 2f(wk Tthred)) ϵ. Denote t0 := k Tthred in the following. We have that Et0[f(wt0+Tthred) f(wt0)] = t=0 Et0[E[f(wt0+t+1) f(wt0+t)|w0:t0+t]] = Et0[f(wt0+1) f(wt0)] + t=1 Et0[E[f(wt0+t+1) f(wt0+t)|w0:t0+t]] (a) r2ch + ρ t=1 Et0[E[f(wt0+t+1) f(wt0+t)|w0:t0+t]] (b) r2ch + ρ (c) 2r2ch + ρ 3r3c3 m 4r2ch (d) δFthred where (a) is by using Lemma 8 with step size r, (b) is by using Lemma 8 with step step size η, (c) is by setting η2Tthred r2 and η r, (d) is by the choice of r so that 8r2ch δFthred. Published as a conference paper at ICLR 2020 Now we are ready to use Lemma 15, since both the conditions are satisfied. According to the lemma and the choices of parameters value on Table 3, we can set T = 2Tthred f(w0) minw f(w) /(δFthred) = O((1 β) log( Lcmσ2ρc ch (1 β)δγϵ )ϵ 10), which will return a w that is an (ϵ, ϵ) second order stationary point. Thus, we have completed the proof. G.3 COMPARISON TO DANESHMAND ET AL. (2018) Theorem 2 in Daneshmand et al. (2018) states that, for CNC-SGD to find an (ϵ, ρ1/2ϵ) stationary point, the total number of iterations is T = O( ℓ10L3 δ4γ4 log2( ℓL ϵδγ )ϵ 10), where ℓis the bound of the stochastic gradient norm gt ℓwhich can be viewed as the counterpart of cm in our paper. By translating their result for finding an (ϵ, ϵ) stationary point, it is T = O( ℓ10L3ρ5 δ4γ4 log2( ρℓL ϵδγ )ϵ 10). On the other hand, using the parameters value on Table 3, we have that T = 2Tthred f(w0) minw f(w) /(δFthred) = O( (1 β)c9 m L3ρ3(σ2)3c2 hc δ4γ4 log( Lcmσ2c ch (1 β)δγϵ )ϵ 10) for Algorithm 2. Before making a comparison, we note that their result does not have a dependency on the variance of stochastic gradient (i.e. σ2), which is because they assume that the variance is also bounded by the constant ℓ(can be seen from (86) in the supplementary of their paper where the variance terms ζi are bounded by ℓ). Following their treatment, if we assume that σ2 cm, then on (71) we can instead replace (σ2 + 3c2 m) with 4c2 m and on (72) it becomes 1 576c3 mρcrϵ (1 β)3 . This will remove all the parameters dependency on σ2. Now by comparing O((1 β)c9 mc2 hc ρ3L3 δ4γ4 ϵ 10) of ours and T = O(ρ2ℓ10 ρ3L3 δ4γ4 ϵ 10) of Daneshmand et al. (2018), we see that in the high momentum regime where (1 β) << ρ2ℓ10 c9 mc2 hc , Algorithm 2 is strictly better than that of Daneshmand et al. (2018), which means that a higher momentum can help to find a second order stationary point faster.