# sgd_can_converge_to_local_maxima__4dc37692.pdf Published as a conference paper at ICLR 2022 SGD WITH A CONSTANT LARGE LEARNING RATE CAN CONVERGE TO LOCAL MAXIMA Liu Ziyin1, Botao Li2, James Simon3, & Masahito Ueda1 1The University of Tokyo 2ENS, Université PSL, CNRS, Sorbonne Université, Université de Paris 3University of California, Berkeley Previous works on stochastic gradient descent (SGD) often focus on its success. In this work, we construct worst-case optimization problems illustrating that, when not in the regimes that the previous works often assume, SGD can exhibit many strange and potentially undesirable behaviors. Specifically, we construct landscapes and data distributions such that (1) SGD converges to local maxima, (2) SGD escapes saddle points arbitrarily slowly, (3) SGD prefers sharp minima over flat ones, and (4) AMSGrad converges to local maxima. We also realize results in a minimal neural network-like example. Our results highlight the importance of simultaneously analyzing the minibatch sampling, discrete-time updates rules, and realistic landscapes to understand the role of SGD in deep learning. 1 INTRODUCTION SGD is the main optimization algorithm for training deep neural networks, and understanding SGD is widely regarded as a key step on the path to understanding deep learning (Bottou, 2012; Zhang et al., 2018; Xing et al., 2018; Mori et al., 2021; Du et al., 2018; Allen-Zhu et al., 2018; Wojtowytsch, 2021a;b; Gower et al., 2021; Ziyin et al., 2022b; Gurbuzbalaban et al., 2021; Zou et al., 2021; Li et al., 2021; Feng and Tu, 2021). Despite its algorithmic simplicity (could be described by only two lines of equations), SGD is hard to understand. The difficulty is threefold: (1) SGD is discrete-time in nature, and discrete-time dynamics are typically much more complicated to solve than their continuous-time counterparts (May, 1976); (2) SGD noise is state-dependent (Ziyin et al., 2022b; Hodgkinson and Mahoney, 2020); and (3) the loss landscape can be nonlinear and non-convex, involving many local minima, saddle points, and degeneracies (the Hessian is not full-rank) in the landscape (Xing et al., 2018; Wu et al., 2017). Each of these difficulties is so challenging that very few works attempt to deal with all of them simultaneously. Most previous works on SGD are limited to the cases when the loss landscape is strongly convex (Ziyin et al., 2022b; Liu et al., 2021; Hoffer et al., 2017; Mandt et al., 2017), or when the noise is assumed to be Gaussian and time-independent (Zhu et al., 2019; Jastrzebski et al., 2017; Xie et al., 2021); for the works that tackle SGD in a non-convex setting, often strong conditions are assumed. The reliance on strong assumptions regarding each of the challenges means that our present understanding of SGD for deep learning could be speculative. This work aims to examine some commonly held presuppositions about SGD and show that when all the three challenging factors are taken together, many counter-intuitive phenomena may arise. Since most of these phenomena are potentially undesired, this work can also be seen as a worst-case analysis of SGD. In this work, we study the behavior of SGD in toy landscapes with non-convex loss and multi-minima. We approach the problem of SGD convergence from a different angle from many of the related works: instead of studying when SGD will converge, our result helps answer the question of when SGD might fail. In particular, the problem setting considers discrete-time SGD close to saddle points, where the noise is due to minibatch sampling, and the learning rate is held constant throughout training. In the next section, we define the SGD algorithm and the necessary notations. In Sec. 3, we discuss the related works. A warmup example is provided in Sec. 4. Sec. 5 introduces the main results. Sec. 6 presents the numerical simulations, including a minimal example involving a neural network. We also encourage the readers to examine the appendix. Sec. A presents additional experiments. Sec. B Published as a conference paper at ICLR 2022 1.0 0.5 0.0 0.5 1.0 w Loss landscape Initial distribution Step 10 Final distribution 2 1 0 1 2 w Loss landscape Initial distribution Step 10 Final distribution 1.0 0.5 0.0 0.5 1.0 a Figure 1: SGD converges to a local maximum for the example studied in Sec. 5.1. Left: Distribution of w on the quadratic potential, ˆL = xw2/2; the histograms in different colors show the distribution of model parameters at different time steps. Middle: Distribution on the fourth-order potential, ˆL = xw2/2 + w4/4. Right: Escape probability as a function of a and λ. The parameters space is divided into an absorbing phase where w is attracted to the local maximum (in dark blue) and an active phase where w successfully escapes the two central bins (in white). presents a continuous-time analysis of the problems in the main text and is relevant for discussing the unique discrete-time features of SGD. Sec. C presents all the proofs. 2 BACKGROUND Notation and Terminology. We use λ to denote the learning rate. w is the model parameter. ˆL(w;x), with the caret symbolˆ , denotes the sampled loss function, which is a function of the data point, and L = Ex[ˆL] is the loss landscape averaged over the data distribution; for notational conciseness, we often hide the dependence of ˆL on x when the context is clear. The lowercase t denotes the time step of optimization. The phrases loss, objective, and potential are also used interchangeably. Now, we introduce the minibatch SGD algorithm. The objective of SGD can be defined as a pair (ˆL,p(x)) such that ˆL is a differentiable function and p(x) is a probability density. We aim at finding the minimizer of the following differentiable objective: L(w) = Ex p(x)[ˆL(w;x)], (1) where x is a data point drawn from distribution p(x) and w RD is the model parameter. The gradient descent (GD) algorithm for L is defined as the update rule wt = wt 1 λ w L(w,x) for some randomly initialized w0, where λ is the learning rate. The following definition defines SGD. Definition 1. The minibatch SGD algorithm computes the update to the parameter w with the following equation: wt = wt 1 λˆgt, where ˆgt = wt 1 ˆL(wt 1;xt), and xt is drawn from p(x) such that xi is independent of xj for i j. 3 RELATED WORKS Escaping Saddle Points. Because neural networks are highly nonlinear, the landscape of neural networks is believed to have many local minima and saddle points (Du et al., 2017; Kleinberg et al., 2018; Reddi et al., 2018b). It is thus crucial for the optimization algorithm to be able to overcome saddle points efficiently and escape a suboptimal minimum. However, the majority of works on escaping saddle points assume no stochasticity or only artificially injected noise (Li et al., 2017; Du et al., 2017; Jin et al., 2017; Ge et al., 2015; Reddi et al., 2018b; Lee et al., 2016; Pemantle, 1990). Some physics-inspired approaches take the stochasticity into account but often rely on continuous-time approximation and nontrivial assumptions regarding the stationary distribution of the parameter (Mori et al., 2021; Xie et al., 2021; Liu et al., 2021). In the context of stochastic non-convex optimization, one setting that SGD is known to escape saddles and converge well is when the learning rate decreases to zero in the training (Pemantle, 1990; Vlaski and Sayed, 2019; Mertikopoulos et al., 2020). In contrast, we only consider the case when the learning rate is held fixed. We stress that our result does not contradict these previous results. On the opposite, our result reinforces the idea that the learning rate needs to be carefully chosen and scheduled when convergence is an essential concern. Role of Minibatch Noise. One indispensable aspect of SGD is its stochasticity originated from minibatch sampling (Wu et al., 2020; Ziyin et al., 2022b; Zhu et al., 2019; Hodgkinson and Mahoney, Published as a conference paper at ICLR 2022 2020), whose strength and structure are determined by the model architecture and the data distribution. Moreover, it has been observed that the models trained with SGD significantly outperform the models directly trained with GD (Hoffer et al., 2017). Therefore, understanding the role of minibatch noise is of both fundamental and practical value. However, in the previous works, the focus is on analyzing the behavior of SGD on the landscape specified by L; the unique structures due to the noise in SGD are often treated in an oversimplified manner without referring to the actual models in deep learning or the actual data distributions. For example, Daneshmand et al. (2018) assumes that the noisy gradient is negatively correlated with the least-eigenvalue direction of the Hessian, an assumption that its validity is unclear when the Hessian is not full rank. Kleinberg et al. (2018) assumes that the noise in SGD makes the landscape one-point strongly convex, which implies that the loss landscape before the convolution is already close to a strongly convex landscape. Another commonly assumed condition is that the loss-landscape satisfies the Lojasiewicz condition, which is non-convex but implies the unrealistic condition that there is no local minimum except the global minimum (Karimi et al., 2020; Wojtowytsch, 2021a; Vaswani et al., 2019), which is not sufficient to understand the complicated dynamics that often happen in a setting with many minima (for example, a deep linear net (Ziyin et al., 2022a)). In Sec. 6.3, we show that these assumptions are violated for a minimal nonlinear neural network with two layers and one hidden neuron. While our example may or may not be directly relevant for realistic settings in the deep learning practice, our work highlights the importance of analyzing the actual noise structure imposed by a deep neural network in future works. Large Learning Rate Regime. Recently, it has been realized that networks trained at a large learning rate have a better performance than networks trained with a vanishing learning rate (lazy training) (Chizat and Bach, 2018). Lewkowycz et al. (2020) shows that there is a qualitative difference between the lazy training regime and the large learning rate regime; the performance features two plateaus in testing accuracy in the two regimes, with the large learning rate regime performing much better. However, almost no previous theory exists for understanding SGD at a large learning rate. Our work is also relevant for understanding what happens at a large learning rate. 4 A WARM-UP EXAMPLE This section studies a special example of the main results of our work. While the setting of this example is restricted, it captures the essential features and mechanisms of SGD that we will utilize to prove the main results. Consider the following loss function: ˆL(w) = x where x,w R and x is drawn from an underlying distribution p(x) at every time-step. We let E[x] = a/2 and assume that Var[x] = σ2 is finite. This gives rise to a true, sample-averaged loss of L(w) = a 4w2. The stationary point of this loss function is w = 0. When a > 0, the underlying (deterministic) landscape is a minimum; the a > 0 case is now well understood in the discrete-time limit and when the underlying noise is state-dependent (Liu et al., 2021). When a < 0, the point w = 0 is a local maximum; one hopes that the underlying dynamics escape to infinity, and our goal is to understand the behavior of SGD in this case. The following proposition and proof show that for the zero-mean and bounded distribution p(x), SGD cannot escape from the local maximum when the learning rate is set to be 1. Proposition 1. Let λ = 1 and a < 0 and p(x) be the distribution such that x = 1 and x = 1 + a with equal probability. Then, w converges to w = 0 with probability 1 with the SGD algorithm, independent of the initialization. Proof. By the definition of the SGD algorithm, we have wt+1 = {0 with probability 0.5; (2 a)wt otherwise. (3) Therefore, after t time steps, wt has at most 2 t probability of being non-zero. For infinite t, wt = 0 with probability 1. This simple toy example illustrates a few important points. First of all, it suggests that one cannot use the expected value of w to study the escape problem of SGD. The expectation value of wt is Ex[wt] = (1 a/2)Ex[wt 1] = (1 a/2)tw0 = w0eln(1 a/2)t, (4) Published as a conference paper at ICLR 2022 which is an exponentially fast escape from any initialization w0 for a < 0. In fact, ln(1 a) is exactly the escape time scale of the GD algorithm. This is counter-intuitive: SGD will converge to the local maximum with probability 1, even if its expected value escapes the local maximum exponentially fast. Alternatively, one might also hope to use the expected loss or the norm Ewt[L(wt)] wt 2 2 as the metric of escape, but it is not hard to see that they suffer the same problem. At time t, because wt = 0 with probability 1 2 t and wt = (2 a)tw0 with probability 2 t, we have E[L(wt)] w2 0e2t ln 2 a 2 , which also predicts an exponentially fast escape. Again, this fails to reflect the fact that SGD has only a vanishingly small probability of escaping the constructed local maximum. Statistically speaking, the norms of wt fail to be good metrics to measure the escape rate because these metrics are not robust against outliers. In our example, there is only 2 t probability for SGD to escape the local minimum, yet the speed at which this outlier escapes the maximum outweighs the decay in its probability through time, thus contributing more to the expected norm than any other events. This example also shows the difficulty and subtlety of dealing with the escape problem. This example (together with Sec. 5.1) also emphasizes the importance of showing convergence in probability for future works in non-convex optimization because convergence in expectation is not sufficient to guarantee a good empirical performance. The results in the rest of this work can be seen as extensions of this special case to less restrictive settings. 5 MAIN RESULTS In this section, we present our four main theoretical results: (1) SGD may converge to a local maximum; (2) SGD may escape a saddle point arbitrarily slowly; (3) SGD may prefer sharp minima over flat ones; and (4) AMSGrad may converge to a local maximum. 5.1 SGD MAY CONVERGE TO A LOCAL MAXIMUM We show that there are cases where SGD may fail to escape saddle points, while GD can successfully escape; this contradicts the suggestions in Kleinberg et al. (2018) that SGD always escapes a local minimum faster than GD. It is appropriate to begin with the definition of a saddle point. Definition 2. w is said to be a stationary point of L(w) if w L(w) = 0. w is a minimum of L if, for all w in a sufficiently small neighborhood of w, L(w ) L(w). w is said to be a saddle point if w is a stationary point but not a local minimum. This definition of saddle points, which includes local maxima, agrees with the common definition in the literature (Daneshmand et al., 2018). The following proposition generalizes our warmup example to a family of 1D problems in which SGD converges to a local maximum. Proposition 2. Let Eq. 2 be the loss function. Let λ > 0, and p(x) be such that µ = E[ln 1 λx ] and s2 = Var[ln 1 λx ]; then SGD converges to the local maximum in probability if µ < 0. Proof Sketch. We first show that 1 t ln wt/w0 obeys a normal distribution as t . This allows us to deduce the cumulative distribution function (c.d.f.) of w as t . The c.d.f. has an bifurcative dependence on the sign of µ = E[ln 1 λx ]. When µ < 0, the distribution of wt converges to 0 in probability. See Sec. C.2 for the detailed proof. Corollary 1. Let Eq. 2 be the loss function and p(x) = 1 2δ(x+1 a); then SGD converges to the local maximum in probability if a a 1 < λ < a a2 8a + 8 2(a 1) . (5) One special feature of this example is that the state-dependent noise vanishes at the saddle point. In this particular example, this is achieved by having an SGD noise proportional to the gradient thus vanishing at the saddle point. This type of structure may be relevant for deep learning because vanishing noise at a stationary point and state-dependent noise can appear at the origin of a deep net, where weights of all layers are zero. Note that the convergence is independent of the actual shape of the distribution p(x), and so may apply to many kinds of distributions besides the Bernoulli distribution we considered. There are three different regimes/phases for this simple escaping problem Published as a conference paper at ICLR 2022 (also see Fig. 1-Right). The escape regime λ > a a2 8a+8 2(a 1) is also present when a > 0; it means that in this regime, SGD will also escape w0 even if it is a local minimum. This regime is not present if we perform a continuous-time analysis (Sec. B), showing that this region is due to the instability of the discrete-time SGD algorithm. This kind of escaping is undesirable because, in this regime, the local gradient cannot provide any guidance for minimizing the loss. The region a a 1 < λ < a a2 8a+8 2(a 1) is the trapped regime. This is due to the special structure of the minibatch noise a comparison with the continuous-time analysis suggests that SGD will not be trapped if the noise is not position dependent (Sec. B.1). The small learning rate regime λ < a a 1 is the successful escape regime because the SGD noise is of order λ2 and becomes negligible at a small λ (Liu et al., 2021). This regime corresponds to the lazy learning regime of training: with negligible noise and vanishing gradient, SGD is known to optimize neural networks well (Du et al., 2018). This discussion also justifies our later choice for placing an upper bound to λ and focusing only on the second and third regime in Sec. 5.3 and 5.4. It is important to understand why the previous works that show that the previous guarantees of convergence do not apply to this particular example, for example, Ge et al. (2015). In comparison, the SGD algorithm is not guaranteed to escape the saddle point in this example because two crucial assumptions of Ge et al. (2015) is violated: (1) the gradient does not have a bounded fluctuation around its mean (and so even parameters very far away from the saddle can be brought back close to the saddle point due to the noise), and (2) we do not inject additive noise to the gradient and the actual noise decreases as we approach the saddle point (and so the closer we are to the saddle point, the less likely one can escape). The standard theoretical guarantee of convergence in Ge et al. (2015) thus does not apply to this example. This example shows that the minibatch noise of SGD is very special and can strongly influence convergence. A continuous-time analysis in Section B of this example shows that the convergence to local maxima occurs when the minibatch noise dominates the gradient, namely, when the signal to noise ratio becomes smaller than 1. This suggests the importance of minibatch noise in influencing the dynamics of SGD. This suggests that the signal-to-noise ratio can be an important parameter in SGD dynamics and may deserve more future research. 5.2 SGD MAY ESCAPE SADDLE POINTS ARBITRARILY SLOWLY In the previous section, we have shown that one can adversarially construct a loss function for a fixed learning rate such that the SGD algorithm converges to a local maximum. This section shows that, even if one can tune the learning rate at will, there is a loss landscape where SGD can take an arbitrarily long time to escape, no matter how one chooses the learning rate. We begin by defining the escape rate. Definition 3. The asymptotic average escape rate of w at learning rate λ, γ(λ), is defined as γ = limt 1 t Ewt[ln wt ]. The optimal escape rate is defined as γ = supλγ(λ). Remark. This definition can be seen as a generalization of the Lyapunov exponent to a probabilistic setting. We note that this definition of escape rate differentiates this work from the standard literature on escaping saddles, where the focus is on the time scale for reaching a local minimum when saddle points exist (Ge et al., 2015; Daneshmand et al., 2018; Jin et al., 2017), while our definition focuses on the time scale of escaping a specified saddle point. These two time scales can be vastly different. As discussed in the previous section, the undesired escape can be achieved by setting the learning rate to be arbitrarily large; however, this is not possible in practice because the learning rate of SGD is inherently associated with the stability of learning, and λ usually has to be much smaller than 1. Proposition 3. Let the loss function be Eq. (2), a < 0 and p(x) 1 2δ(x 1) + 1 2δ(x a + 1), where δ is the Dirac delta function, and λ 1. Then, for any ϵ > 0, there exists a such that supλ γ(λ) < ϵ. Specifically, γ = ln 2 a 2 1 a, and, for any ϵ > 0, γ < ϵ if a 2 eϵ e2ϵ 1 e2ϵ + 1 . Namely, the SGD algorithm can escape the saddle point of Eq. (2) arbitrarily slowly. Intuitively, one expects escaping to be easier as we increase the learning rate (Kleinberg et al., 2018), and that the escape rate should monotonically increase as we increase the learning rate. This example conveys a novel message that the escaping efficiency is not monotonically increasing as the learning rate increases. For this example, the escape rate first decreases and starts to increase only when λ is quite large. This example also shows the subtlety in the escaping problem. On the one hand, one needs to avoid a too large learning rate to avoid training instability. On the other hand, one cannot use a too Published as a conference paper at ICLR 2022 small learning rate because a small learning rate also makes optimization slow. Thus, there should be a tradeoff between learning speed and learning stability, and we may construct a general theory for finding the best learning rate that achieves the best tradeoff; however, this tradeoff problem is a sufficiently complex problem on its own and is beyond the scope of the present work. 5.3 SGD MAY PREFER SHARPER MINIMA Modern deep neural networks are often overparametrized and can easily memorize all the training data points. This means that the traditional metrics such as Rademacher complexity cannot be used to guarantee the generalization capability of SGD (Zhang et al., 2017). Nevertheless, neural networks are found to generalize very well. This inspired the hypothesis that SGD, the main training algorithm of neural networks, contains some implicit regularization effect such that it biases the neural network towards simpler solutions (Neyshabur et al., 2017; Soudry et al., 2018). One hypothesized mechanism of this regularization is that SGD selects flat minima over the sharp ones (Hochreiter and Schmidhuber, 1997; Meng et al., 2020; Xie et al., 2021; Liu et al., 2021; Mori et al., 2021; Smith and Le, 2018; Wojtowytsch, 2021b). In this work, we show that this may not be the case because the dynamics and convergence of SGD crucially depend on the underlying mini-batch noise, while the sharpness of the landscape is independent of the noise. Before introducing the results, let us first define the sharpness of a local minimum. Definition 4. Let w = w be a local minimum of the loss function L(w); the sharpness s(w ) of a local minimum w is defined as s(w ) = Tr[ 2 w L(w )]. We say that a minimum w 1 is sharper than w 2 if and only if s(w 1) > s(w 2). In this definition, the sharpness is the trace of the Hessian of the quadratic approximation to the local minimum. Other existing definitions, such as the determinant of the Hessian, are essentially similar to this definition (Dinh et al., 2017). For an explicit construction, consider the following objective on a 2-dimensional landscape: ˆL(w1,w2) = 1 2 [ (w1 w2)2 (w1 + w2)2 + (w1 w2)4 + (w1 + w2)4 2aw2 2 + xbw2 1]. (6) for a,b > 0 and p(x) = 1 2δ(x 1) + 1 2δ(x + 1). The diagonal terms of the Hessian of the loss function is given by w2 1 ˆL(w1,w2) = 2 + 12w2 1 + 12w2 2; w2 2 ˆL(w1,w2) = 2 2a + 12w2 1 + 12w2 2. (7) There are four local minima in total for this loss function. These minima and their sharpnesses are (w1,w2) = ( 1 2,0) with s = 4; (w1,w2) = (0, 2 ) with s = 4 + 4a. (8) We see that for positive a, the minima at (0, 2 ) are sharper than the other two minima. We show that SGD will converge to these sharper minima. For this example, we assume that w1 and w2 are bounded because, if initialized sufficiently far from the origin, the fourth-order term will cause divergence. Proposition 4. Let the loss function be Eq. (6), w1 1 and w2 1. For any λ < c for some constant c = O(1), there exists some b such that if SGD converges in probability, it will converge to the sharper minimum at (w1,w2) = (0, Remark. In this example, the noise is not full-rank, which is often the case in a deep learning setting for overparametrized networks (Wojtowytsch, 2021a). For example, when weight decay is used, the Hessian of the loss function should be full rank, while the rank of the noise covariance should be proportional to the inherent dimension of the data points, which is in general much smaller than the dimension of the Hessian in deep learning (Ansuini et al., 2019). The proof is technical and given in the appendix Sec. C.5. In fact, it has been controversial whether finding a flat minimum can help generalization. For example, Dinh et al. (2017) shows that, for every flat minimum of a Re LU-based net, there exists a minimum that is arbitrarily sharper and generalizes Published as a conference paper at ICLR 2022 as well. However, this work does not rule out the possibility that conditioning by using gradient-based optimization, the performance of the sharper minima that gradient descent finds is worse than the performance of the flatter minima. If this assumption is valid, then biasing gradient descent towards flatter minima can indeed help, and the stochasticity of SGD has been hypothesized to help in this regard. However, our construction shows that SGD, on its own, may be incapable of helping SGD converge to a flatter minimum. At least some other assumptions need to be invoked to show that SGD may help. For example, the definition of the neural networks may endow these models with a special kind of Hessian and noise structure that, when combined with SGD, results in a miraculous generalization behavior. However, no previous work has pursued this direction in sufficient depth to our knowledge, and future studies in this direction may be fruitful. 5.4 NON-CONVERGENCE OF ADAPTIVE GRADIENTS mt = β1mt 1 + (1 β1)ˆgt; (9) vt = β2vt 1 + (1 β2)ˆg2 t ; (10) ˆvt = max(ˆvt 1,vt); (11) wt = wt 1 λ ˆvt mt. (12) Adam (Kingma and Ba, 2014) and its closely related variants such as RMSProp (Tieleman and Hinton, 2012) have been shown to converge to a local maximum even for some simple convex loss landscapes (Reddi et al., 2018a). The proposed fix, named AMSGrad, takes the maximum of all the previous preconditioners in Adam. This section shows that AMSGrad may also converge to a local maximum even in simple nonconvex settings. Let xt p(x) and ˆgt = ˆL(xt,wt 1), the Adam algorithm is given by Eq. (9)-(12), where v0 = m0 = 0. We have removed the numerical smoothing constant ϵ from the denominator, which causes no problem if w0 is initialized away from 0. Here, β1 is the momentum hyperparameter. vt is called the preconditioner, and β2 is the associated hyperparameter. The standard value for β1 is 0.9 and β2 is 0.999. In our theory, we only consider the case when β1 = 0. The experiment section shows that a similar problem exists when β1 > 0 (with additional interesting findings). Intuitively, this is easy to understand because AMSGrad behaves like SGD asymptotically, so we only have to wait for long enough, and the results in previous sections would apply. The construction below follows this intuition. Proposition 5. Let wt [ 1,1], and w0 0. For fixed λ < 1 and the loss function in Eq. (2) there exists a < 0 such that the AMSGrad algorithm converges in probability to a local maximum. Remark. The proof is given in Sec. C.6. Note that the example we construct is similar to the original construction in Reddi et al. (2018a) that shows that Adam converges to a maximum while AMSGrad succeeds in reaching the global minimum. In their example (with some rescaling), the gradient is a random variable ˆgt = {1 with probability q 1; c0 < 0 with probability 1 q 1; (13) such that the expected gradient is negative, and Adam is shown to converge to the direction opposite to the gradient descent (therefore to a maximum). In our example, the gradient is (roughly) ˆgt = {wt 1 with probability 1/2; (1 + a)wt 1 with probability 1/2. (14) Therefore, our example can be seen as a minimal generalization of the Reddi et al. (2018a) example to a non-convex setting. 6 EXPERIMENTAL DEMONSTRATIONS We perform experiments to illustrate the examples studied in this work. Due to space limitations, we illustrate the escape rate and the convergence to sharper minima example in Appendix Sec. A. 6.1 SGD CONVERGES TO A LOCAL MAXIMUM Numerical results are obtained for the setting described in section 5.1. See Fig. 1-Left. The loss landscape is defined by L(w) = 1 4aw2. In this numerical example, we set λ = 0.8 and a = 1, and the histogram is plotted with 2000 independent runs. We see that the distribution converges to the local maximum at w = 0 as the theory predicts. For better qualitative understanding, we also plot Published as a conference paper at ICLR 2022 Figure 2: AMSGrad diverges to the local maximum with or without momentum in the example we studied in Sec 5.4. Our result shows that AMSGrad is always attracted to the local maximum, while Adam has the potential of escaping the local maximum with momentum. Left: without momentum. Right: with momentum. the empirical phase diagram of this setting in Fig. 1-Right. We perform numerical evaluations for different combinations of a and λ on a a λ plane, and for each combination, we plot the percentage of w that escapes the central bin (white is 100% and dark blue is 0%). For completeness, we also plot the case when a > 0, i.e., when the critical point at w = 0 is a local minimum. The orange line shows where the phase transition is expected to happen according to Eq. (5). The fact that the orange line agrees exactly with the empirical phase transition line confirms our theory. We also numerically study a closely related fourth-order loss landscape, defined as ˆL(w) = 1 2xw2 + 1 4w4, where the distribution p(x) is the same as before. The expected loss L(w) has two local minima located as w = 2a. We are interested in whether SGD can converge to these two points successfully; note that, without the fourth-order term, this loss is the same as the quadratic loss we studied. See Fig. 1-Middle. As before, a = 1 and λ = 0.8. We see that the distribution of w also concentrates towards the local maximum after training, and no w is found to converge to the two local minimum even if a significant proportion of w is initialized close to these two minima. A phase diagram analysis is given in the Sec. A.1. 6.2 NON-CONVERGENCE OF AMSGRAD In this section, we illustrate the example of the convergence behavior of AMSGrad described in Sec. 5.4; for reference, we also plot the behavior of Adam and gradient descent for this example. See Fig. 2. The experiment is the average over 2000 runs, each with 5 104 steps. The uncertainty is reflected by the shaded region (almost invisibly small). The loss function is the same as discussed in Sec. 5.4 with a = 0.1. For illustration purposes, we set λ = 0.2 and β2 = 0.999 for both Adam and AMSGrad. When momentum is used, we set β1 = 0.9. GD is run with a learning rate of 0.01. Without momentum, both Adam and AMSGrad converge to the local maximum with almost the same speed. When momentum is added, AMSGrad still converges to the local maximum. Curiously, Adam is no longer attracted to the local maximum but also remains away from the local minimum. This suggests that Adam, with momentum, might have a better capability of escaping saddle points than AMSGrad and might be preferable to AMSGrad in a non-convex setting. 6.3 A TOY NEURAL NETWORK EXAMPLE We have now illustrated several SGD phenomena we studied in artificial settings, inspiring the natural question of whether they occur for an actual neural network. Here we construct a neuralnet-like optimization problem to show that the problems with convergence might indeed arise for a neural network. The toy neural network function is given by f(x) = w2σ(w1x), where w1,w2 R. σ is the nonlinearity, and we let σ(x) = x2 as a minimal example. This is the simplest kind of nonlinear feedforward network one can construct (2 layer with a single hidden neuron). We also pick a minimal dataset with a single data point: x = 1 with probability 1 and y { 1,2}, each with probability 0.5, i.e., the label has a degree of inherent uncertainty, which is often the case in real problems. We use mean squared error (MSE) as the loss function. Therefore, the expected loss is L(w1,w2) = 1 2(w2w2 1 + 1)2 + 1 2(w2w2 1 2)2. The global minimum L = 2.25 is degenerate, and is achieved when w1 = 1/ 2w2. There is also a manifold of saddle points given by {(w1,w2) w1 = 0,w2 0}, all with L = 2.5. Despite the simplicity of this example, it contains a few realistic features in a realistic problem, including (1) inherent uncertainty in the data, (2) a nonlinear hidden layer, and (3) a degenerate minimum with zero eigenvalues in the Hessian. Published as a conference paper at ICLR 2022 1.5 1.0 0.5 0.0 0.5 1.0 1.5 w2 Saddle points Global minima 1.5 1.0 0.5 0.0 0.5 1.0 1.5 w2 0.000 0.025 0.050 0.075 0.100 0.125 0.150 learning rate Figure 3: Convergence of a two-layer one-neuron neural network to a saddle point. The blue region shows the empirical density of converged parameter distribution. Left: λ = 0.001 at step 10000 converges to global minima. Mid: λ = 0.1 at step 10000 converges to a saddle point. Right: Average loss in equilibrium as a function of learning rate. The loss function diverges for learning rates larger than 0.108. More importantly, most previous works on the convergence or escaping behavior of SGD are inapplicable to this example. One dominant assumption is the ρ-Hessian Lipschitz property (Jin et al., 2017; Ge et al., 2015), which does not hold for this example in particular. Another common assumption is the PL condition (Karimi et al., 2020; Wojtowytsch, 2021a; Vaswani et al., 2019), which does not apply due to the existence of the saddle point. One recent assumption is that the loss function is 1-point convex for all points (Kleinberg et al., 2018), which is also ruled out because (0,0) is not 1-point strongly convex. Another relevant assumption is the correlated negative curvature assumption, which also does not hold for (0,0). We explain in Sec. A.5 in detail why these conditions are violated. See Fig. 3 for the experimental results with this example. Here, w1 is initialized uniformly in [ 1,1]; w2 is initialized uniformly in [0,1] to be closer to the global minimum and away from the saddle point (standard initialization such as initializing in [ 1,1] does not change the conclusion). The left figure shows the stationary distribution of the model parameter at a small learning rate (λ = 0.001). All the parameters are located in the global-minimum valley as expected. In contrast, when the learning rate is large (λ = 0.1), the central figure shows that all models (1000 independent runs) converge to the saddle point at (0,0). The right figure investigates this change more systematically and shows the change in the average stationary training loss of the models as we increase the learning rate from 0.001 to 0.15. We see that for a small learning rate, the training loss is close to that of the global minima (L = 2.25), and for a significant range of large learning rates, the model invariably converges to a saddle point (L = 2.5). One additional interesting observation is that the model diverges at λ 0.11, and there is almost no sign of such divergence (such as increased fluctuation) when the learning rate is close to the divergent threshold. This example shows the relevance of our results to the study of neural networks. In fact, it has often been noticed that at convergence, many large-scale neural networks in real tasks exhibit negative eigenvalues in the Hessian (Alain et al., 2019; Granziol et al., 2019). The existence of negative eigenvalues at a late time suggests either the possibility that the algorithm has yet to escape or that it has indeed been attracted to such saddle points; if the latter is true, then our work offers an explanation. The emergence of negative eigenvalues at the end of training serves as indirect evidence that our result may be relevant for larger neural networks. It thus remains an important open question to prove (and identify the condition of) or disprove the convergence of larger and deeper neural networks to a local maximum. 7 DISCUSSION We have shown that when the learning rate is not carefully chosen or scheduled, SGD can exhibit many undesirable behaviors, such as convergence to local maxima or saddle points. The limitation of our work is clear. At best, all the constructions we made are minimal and simplistic toy examples that are far from practice, and investigating whether the discovered messages are relevant for deep learning or not is the one important immediate future step. Other relevant questions include: (1) Does convergence to saddle points help or hurt generalization? (2) If it hurts, how can we modify SGD to avoid saddle points better? We suspect changing learning rates, changing batch size, or injecting noise may help, but a convincing theoretical guarantee in a realistic setting is yet lacking. (3) If saddle points do not have worse generalization, our results motivate for understanding why. This is especially relevant because we showed that using a large learning rate is more likely to converge to saddle points, while previous works have shown that using a large learning rate can improve generalization; combined, this may imply that certain saddle points may have intriguing but unknown regularization effects. Published as a conference paper at ICLR 2022 ACKNOWLEDGMENT We thank the anonymous reviewers for providing detailed and constructive feedback for our draft. Ziyin thanks Jie Zhang for the help during the writing of this manuscript. Ziyin is financially supported by the GSS Scholarship from the University of Tokyo. BL acknowledges CNRS for financial support and Werner Krauth for all kinds of help. JS gratefully acknowledges support from the National Science Foundation Graduate Fellow Research Program (NSF-GRFP) under grant DGE 1752814. This work was supported by KAKENHI Grant Numbers JP18H01145 and JP21H05185 from the Japan Society for the Promotion of Science. Alain, G., Roux, N. L., and Manzagol, P.-A. (2019). Negative eigenvalues of the hessian in deep neural networks. ar Xiv preprint ar Xiv:1902.02366. Allen-Zhu, Z., Li, Y., and Song, Z. (2018). A convergence theory for deep learning via overparameterization. ar Xiv preprint ar Xiv:1811.03962. Ansuini, A., Laio, A., Macke, J. H., and Zoccolan, D. (2019). Intrinsic dimension of data representations in deep neural networks. Bottou, L. (2012). Stochastic gradient descent tricks. In Neural networks: Tricks of the trade, pages 421 436. Springer. Chizat, L. and Bach, F. (2018). A note on lazy training in supervised differentiable programming. ar Xiv preprint ar Xiv:1812.07956, 8. Daneshmand, H., Kohler, J., Lucchi, A., and Hofmann, T. (2018). Escaping saddles with stochastic gradients. In International Conference on Machine Learning, pages 1155 1164. PMLR. Dinh, L., Pascanu, R., Bengio, S., and Bengio, Y. (2017). Sharp minima can generalize for deep nets. In International Conference on Machine Learning, pages 1019 1028. PMLR. Du, S. S., Jin, C., Lee, J. D., Jordan, M. I., Poczos, B., and Singh, A. (2017). Gradient descent can take exponential time to escape saddle points. ar Xiv preprint ar Xiv:1705.10412. Du, S. S., Zhai, X., Poczos, B., and Singh, A. (2018). Gradient descent provably optimizes overparameterized neural networks. ar Xiv preprint ar Xiv:1810.02054. Feng, Y. and Tu, Y. (2021). The inverse variance flatness relation in stochastic gradient descent is critical for finding flat minima. Proceedings of the National Academy of Sciences, 118(9). Ge, R., Huang, F., Jin, C., and Yuan, Y. (2015). Escaping from saddle points online stochastic gradient for tensor decomposition. In Conference on learning theory, pages 797 842. PMLR. Gower, R., Sebbouh, O., and Loizou, N. (2021). Sgd for structured nonconvex functions: Learning rates, minibatching and interpolation. In International Conference on Artificial Intelligence and Statistics, pages 1315 1323. PMLR. Granziol, D., Garipov, T., Zohren, S., Vetrov, D., Roberts, S., and Wilson, A. G. (2019). The deep learning limit: are negative neural network eigenvalues just noise? In ICML 2019 Workshop on Theoretical Physics for Deep Learning. Gurbuzbalaban, M., Simsekli, U., and Zhu, L. (2021). The heavy-tail phenomenon in sgd. In International Conference on Machine Learning, pages 3964 3975. PMLR. Hochreiter, S. and Schmidhuber, J. (1997). Flat minima. Neural Computation, 9(1):1 42. Hodgkinson, L. and Mahoney, M. W. (2020). Multiplicative noise and heavy tails in stochastic optimization. ar Xiv preprint ar Xiv:2006.06293. Hoffer, E., Hubara, I., and Soudry, D. (2017). Train longer, generalize better: closing the generalization gap in large batch training of neural networks. In Proceedings of the 31st International Conference on Neural Information Processing Systems, pages 1729 1739. Published as a conference paper at ICLR 2022 Jastrzebski, S., Kenton, Z., Arpit, D., Ballas, N., Fischer, A., Bengio, Y., and Storkey, A. (2017). Three factors influencing minima in sgd. ar Xiv preprint ar Xiv:1711.04623. Jin, C., Ge, R., Netrapalli, P., Kakade, S. M., and Jordan, M. I. (2017). How to escape saddle points efficiently. In International Conference on Machine Learning, pages 1724 1732. PMLR. Karimi, H., Nutini, J., and Schmidt, M. (2020). Linear convergence of gradient and proximal-gradient methods under the polyak-łojasiewicz condition. Kingma, D. P. and Ba, J. (2014). Adam: A method for stochastic optimization. Co RR, abs/1412.6980. Kleinberg, B., Li, Y., and Yuan, Y. (2018). An alternative view: When does sgd escape local minima? In International Conference on Machine Learning, pages 2698 2707. PMLR. Lee, J. D., Simchowitz, M., Jordan, M. I., and Recht, B. (2016). Gradient descent only converges to minimizers. In Conference on learning theory, pages 1246 1257. PMLR. Lewkowycz, A., Bahri, Y., Dyer, E., Sohl-Dickstein, J., and Gur-Ari, G. (2020). The large learning rate phase of deep learning: the catapult mechanism. ar Xiv preprint ar Xiv:2003.02218. Li, H., Xu, Z., Taylor, G., Studer, C., and Goldstein, T. (2017). Visualizing the Loss Landscape of Neural Nets. Ar Xiv e-prints. Li, Z., Malladi, S., and Arora, S. (2021). On the validity of modeling sgd with stochastic differential equations (sdes). ar Xiv preprint ar Xiv:2102.12470. Liu, K., Ziyin, L., and Ueda, M. (2021). Noise and fluctuation of finite learning rate stochastic gradient descent. Mandt, S., Hoffman, M. D., and Blei, D. M. (2017). Stochastic gradient descent as approximate bayesian inference. Journal of Machine Learning Research, 18:1 35. May, R. (1976). Simple mathematical models with very complicated dynamics. nature. vol. 251, june. 10. Meng, Q., Gong, S., Chen, W., Ma, Z.-M., and Liu, T.-Y. (2020). Dynamic of stochastic gradient descent with state-dependent noise. ar Xiv preprint ar Xiv:2006.13719. Mertikopoulos, P., Hallak, N., Kavis, A., and Cevher, V. (2020). On the almost sure convergence of stochastic gradient descent in non-convex problems. ar Xiv preprint ar Xiv:2006.11144. Mori, T., Ziyin, L., Liu, K., and Ueda, M. (2021). Logarithmic landscape and power-law escape rate of sgd. ar Xiv preprint ar Xiv:2105.09557. Neyshabur, B., Tomioka, R., Salakhutdinov, R., and Srebro, N. (2017). Geometry of optimization and implicit regularization in deep learning. ar Xiv preprint ar Xiv:1705.03071. Pemantle, R. (1990). Nonconvergence to unstable points in urn models and stochastic approximations. The Annals of Probability, 18(2):698 712. Reddi, S., Kale, S., and Kumar, S. (2018a). On the convergence of adam and beyond. In International Conference on Learning Representations. Reddi, S., Zaheer, M., Sra, S., Poczos, B., Bach, F., Salakhutdinov, R., and Smola, A. (2018b). A generic approach for escaping saddle points. In International conference on artificial intelligence and statistics, pages 1233 1242. PMLR. Smith, S. L. and Le, Q. V. (2018). A bayesian perspective on generalization and stochastic gradient descent. In International Conference on Learning Representations. Soudry, D., Hoffer, E., Nacson, M. S., Gunasekar, S., and Srebro, N. (2018). The implicit bias of gradient descent on separable data. The Journal of Machine Learning Research, 19(1):2822 2878. Tieleman, T. and Hinton, G. (2012). Lecture 6.5 Rms Prop: Divide the gradient by a running average of its recent magnitude. COURSERA: Neural Networks for Machine Learning. Published as a conference paper at ICLR 2022 Vaswani, S., Bach, F., and Schmidt, M. (2019). Fast and faster convergence of sgd for overparameterized models and an accelerated perceptron. In The 22nd International Conference on Artificial Intelligence and Statistics, pages 1195 1204. PMLR. Vlaski, S. and Sayed, A. H. (2019). Second-order guarantees of stochastic gradient descent in non-convex optimization. Welling, M. and Teh, Y. W. (2011). Bayesian learning via stochastic gradient langevin dynamics. In Proceedings of the 28th international conference on machine learning (ICML-11), pages 681 688. Citeseer. Wojtowytsch, S. (2021a). Stochastic gradient descent with noise of machine learning type. part i: Discrete time analysis. ar Xiv preprint ar Xiv:2105.01650. Wojtowytsch, S. (2021b). Stochastic gradient descent with noise of machine learning type. part ii: Continuous time analysis. ar Xiv preprint ar Xiv:2106.02588. Wu, J., Hu, W., Xiong, H., Huan, J., Braverman, V., and Zhu, Z. (2020). On the noisy gradient descent that generalizes as sgd. In International Conference on Machine Learning, pages 10367 10376. PMLR. Wu, L., Zhu, Z., and E, W. (2017). Towards Understanding Generalization of Deep Learning: Perspective of Loss Landscapes. Ar Xiv e-prints. Xie, Z., Sato, I., and Sugiyama, M. (2021). A diffusion theory for deep learning dynamics: Stochastic gradient descent exponentially favors flat minima. In International Conference on Learning Representations. Xing, C., Arpit, D., Tsirigotis, C., and Bengio, Y. (2018). A walk with sgd. cite arxiv:1802.08770Comment: First two authors contributed equally. Zhang, C., Bengio, S., Hardt, M., Recht, B., and Vinyals, O. (2017). Understanding deep learning requires rethinking generalization. Zhang, C., Liao, Q., Rakhlin, A., Miranda, B., Golowich, N., and Poggio, T. (2018). Theory of deep learning iib: Optimization properties of sgd. ar Xiv preprint ar Xiv:1801.02254. Zhu, Z., Wu, J., Yu, B., Wu, L., and Ma, J. (2019). The anisotropic noise in stochastic gradient descent: Its behavior of escaping from sharp minima and regularization effects. In International Conference on Machine Learning, pages 7654 7663. PMLR. Ziyin, L., Li, B., and Meng, X. (2022a). Exact solutions of a deep linear network. ar Xiv preprint ar Xiv:2202.04777. Ziyin, L., Liu, K., Mori, T., and Ueda, M. (2022b). Strength of minibatch noise in SGD. In International Conference on Learning Representations. Zou, D., Wu, J., Braverman, V., Gu, Q., and Kakade, S. M. (2021). Benign overfitting of constantstepsize sgd for linear regression. ar Xiv preprint ar Xiv:2103.12692. Published as a conference paper at ICLR 2022 A ADDITIONAL EXPERIMENTS A.1 PHASE DIAGRAM OF THE FOURTH-ORDER LOSS FUNCTION With two proper local minima, the fourth-order loss function is a more realistic loss function than the one we considered in Sec. 5.1. We also performed one experiment in the main text (See Fig. 1). Here, we plot its empirical phase diagram in Figure 4. We see that for this loss landscape also, there is some region such that for all λ, SGD converges to the local maximum. This loss landscape is very difficult to study in discrete time as it is known to lead to chaotic behavior at large learning rate (May, 1976). Therefore, we alternatively try to understand this landscape using continuous approximation. See Sec. B. 1.0 0.5 0.0 0.5 1.0 a Figure 4: Escape probability from the local maximum as a function of a and λ with fourth-order loss landscape. The parameters space is divided into an absorbing phase where w is attracted to the local maximum and an active phase where w escapes to infinity successfully. The orange line is the theoretical phase transition line for the quadratic loss function. We see that when λ is small, the line based on quadratic loss also gives good agreement with the fourth-order loss. This suggests that part of this result is universal and independent of the details of the loss function. A.2 ESTIMATING THE ESCAPE RATE In the escape rate experiments, the empirical results are obtained from the proposed approximation. Here, we show that it is valid. See Fig. 5, where we plot the estimated γ as a function of the training step. We see that the estimated value converges within about 20 steps. We, therefore, estimate the escape rates at time step 100 in the main text. 20 40 60 80 100 t λ = 0.2 λ = 0.5 λ = 0.8 λ = 1.5 Figure 5: Escape rate γ as a function of time step t obtained when a = 1, showing that γ, defined in Equation (73), is indeed a well-defined quantity when t is large. γ becomes stable after roughly 40 steps. A.3 ESCAPE RATE EXPERIMENTS This section illustrates the slow escape problem studied in Sec. 5.2. See Fig. 6. γ is calculated by averaging the first 50 time steps across 2000 independent runs. We see that, as our theory predicts, as a gets closer to 0, the optimal escape rate decreases towards 0. This implies that there exists a landscape such that SGD is arbitrarily slow at learning, independent of parameter tuning. One additional observation is that the optimal learning rate is neither too large nor too small, i.e., there seems to be a tradeoff between the escape speed and escape probability (and between the speed and stability). See also the discussion at the end of Sec. 5.2. Published as a conference paper at ICLR 2022 1.0 0.5 0.0 0.5 1.0 a 0.0 0.2 0.4 0.6 0.8 1.0 λ λ = a 2(a 1) a = 3 a = 2 a = 1 a = 0 a = 1 Figure 6: Escape rate γ as a function of learning rate λ with quadratic loss landscape, obtained by simulations. We note that the escape-rate analysis yields results which are compatible with the phase diagram. 1.5 1.0 0.5 0.0 0.5 1.0 1.5 1.5 1.5 1.0 0.5 0.0 0.5 1.0 1.5 1.5 1.5 1.0 0.5 0.0 0.5 1.0 1.5 1.5 Figure 7: Evolution of the distribution of w in a landscape where two flat minima and two sharp minima exist. As Proposition 4 shows, the model parameters converge to the sharper minima even if initialized in the flat minimum. Left: Initialization. Mid: step 2. Right: step 10000. A.4 CONVERGENCE TO THE SHARPER MINIMUM We use the same 2-dimensional loss landscape defined in Sec. 5.3. The experiment is run with 2000 independent simulations and learning rate λ = 0.05. See Fig. 7, where we overlap the underlying landscape with the empirical distribution (the heat map). We see that, even though the initialization overlaps significantly with the local flatter minimum, all points converge to the sharper minimum, as the theorem predicts. A.5 INADEQUACY OF STANDARD ASSUMPTIONS FOR THE NEURAL NETWORK EXAMPLE In this section, we show that the standard assumptions in the previous literature do not hold for the example studied in Sec. 6.3. One common assumption is that the loss function is ρ-Hessian Lifschitz. Definition 5. A loss function L is said to be ρ-Hessian Lifschitz if, for some ρ > 0, w,w , 2L(w) 2L(w) ρ w w . (15) This certainly does not hold for the example we considered. Explicitly, let w2 = 0, the Hessian H(w) of the loss function is Hw1,w1 = Hw1,w2 = Hw2,w1 = 0, (16) and Hw2,w2 = w2 1. (17) Let w1 violates the ρ-Hessian Lifschitz property. Another common assumption for the non-convex setting is the PL condition. Definition 6. Let L be the value of L at the global minimum. A loss function L is said to satisfy the PL (Lojasiewitz) condition if w, L(w) 2 µ(L(w) L ), (18) for some µ > 0. This also does not hold due to the existence of the saddle point. Let (w1,w2) = (0,0). The gradient is zero, but the right-hand side is greater than 0. Recently, the correlated negative curvature assumption has been proposed to study the escaping behavior of SGD. Published as a conference paper at ICLR 2022 Definition 7. Let vw be the eigenvector of the minimum eigenvalue of the Hessian H(w). L satisfies the correlated negative curvature assumption if, for some γ > 0, w,Ex[ vw, L(x,w) 2] > γ (19) where x is the data point. This also does not hold. The point (0,0) violates this condition because L(x,w) = 0 for all x at (w1,w2) = (0,0). In fact, it is exactly this point that violates this condition that the SGD converges to in this example. Recently, the one point strongly convex assumption has been proposed to study the escaping behavior of SGD. Definition 8. A loss function is said to satisfy the c-one point strongly convex condition with respect to w for all gradient noise ϵ, c > 0, define v = w λ L(w), we have EϵL(v λw,w y) c w v 2 2, (20) where ϵ is the random noise caused by minibatch sampling. This assumption is equivalent to that the loss landscape is strongly convex after convoluting with the noise. This condition implies that there is only a single stationary point. However, this is not the case, consider any point with w1 = 0 and w2 0. These points have zero gradient for w1 0 and so ϵ = 0 with probability 1. Then, these points all have the same loss after the convolution: E[L(v λϵ)] = L(v), (21) which either imply that the landscape is not convex or that there is more than 1 stationary point, and so the system is not one point strongly convex 1. 1One notices that the origin, where all the parameters are zero, is a special point in the problem. In fact, the origin may be a very special point in the landscape of deep neural networks in general. For example, see Ziyin et al. (2022a). Published as a conference paper at ICLR 2022 B CONTINUOUS-TIME APPROXIMATION WITH FOKKER-PLANCK EQUATION In this section, we study the examples we studied in the main text in a continuous-time approximation, in order to understand the unique aspects of discrete-time SGD. Compared with the discrete-time analysis, continuous-time analysis is usually more powerful in terms of calculation: it is able to deal with more complicated potentials. B.1 FOKKER-PLANCK EQUATION AND ITS STATIONARY DISTRIBUTION Recall that the SGD update takes the form (see Sec. 2) wt = λ L + λ when λ < 1, the above equation may be approximated by a continuous-time Ornstein-Uhlenbeck process (Mandt et al., 2017) dw(t) = λ Ldt + λ C(w)d W(t), (23) where λ is the learning rate; we have also introduced S as the batch size. d W(t) is a stochastic process satisfying {d W(t) N(0,dt I), E[d W(t)d W(t )] δ(t t ). (24) By the definition of the underlying discrete-time process, the term dw(t) depends only on w(t) and has no dependence on w(t + dt). Thus, the stochastic integration should be interpreted as Ito. For simplicity, we only consider one-dimensional version of the Fokker-Planck, which is P[w,t w(0),0] w J(w,t w(0),0), (25) where J[w,t w(0),0] is the probability flow. The current J[w,t w(0),0] is: J[w,t w(0),0] = λ L w P[w,t w(0),0] + λ2 2S w{C(w)P[w,t w(0),0]}. (26) Assuming that the SGD dynamics is ergodic, there is a unique stationary distribution for w when t that can be found to be P(w) 1 C(w) exp[ 2S λ dw 1 C(w) L w]. (27) We will apply this equation to study the relevant problems in this work. B.2 A GENERAL CASE In this section, we consider (a slightly more general version of) the fourth-order potential we studied in the main text. The average loss landscape in this model is where b > 0, guarantees that the w is bounded regardless the value of a. In the limit of b 0, this function reduces to saddle point problem we studied in Sec. 5.1. Besides the SGD noise, additive noise is also present in the dynamics. The variance of the additive noise has no dependence on w. When the additive noise is present, the update rule (equation of motion) is dw(t) = [λ(a/2)w(t) + bw3(t)]dt + ληm(t) + ληa(t), (29) where both ηm(t) and ηa(t) are both Ornstein-Uhlenbeck process, denoting multiplicative noise and additive noise respectively. The w-dependent covariance of w is C(w) = w2. The SGD noise is thus a multiplicative noise. ηm(t) and has a variance w(t)2 S dt, while the additive noise ηa(t) has a variance of σ2dt, where σ is a positive constant denoting the strength of the additive noise. If the Published as a conference paper at ICLR 2022 multiplicative noise is seen as a part of the loss landscape, (2) coincides with the 2nd-order term in the loss landscape. There is no correlation between the additive noise and the SGD noise, i.e. E[ηm(t)ηa(t )] = 0. The additive noise can be seen as the artificially injected noise, a technique sometimes used for aiding the escape or for Bayesian learning purposes (Jin et al., 2017; Welling and Teh, 2011). The additive noise vanishes in the limit of σ 0, and the model becomes a normal SGD with 4th-order loss function. It is always possible to define a noise η (t), whose contribution is equivalent to the contribution of both the SGD and the additive noise, and the equation of motion becomes dw(t) = [λ(a/2)w(t) + bw3(t)]dt + λη (t). (30) This transformed noise η (t) thus has 0 mean and a variance of ( w(t)2 S + σ2)dt. Define η(t) = S + σ2, the equation of motion becomes dw(t) = λ[(a/2)w(t) + bw3(t)]dt + λ S + σ2η(t). (31) Comparing with (23), one finds C(w) = w2 + Sσ2. (32) The solution of the corresponding Fokker-Planck equation is P(w) 1 w2 + Sσ2 exp[ 2S λ (a/2)w + bw3 w2 + Sσ2 dw] = (w2 + Sσ2) 1 exp 2Sb 2b Sσ2)w + w3 + Sσ2w w2 + Sσ2 dw = (w2 + Sσ2) 1 exp[ 2Sb 2b Sσ2) w w2 + Sσ2 dw] = (w2 + Sσ2) 1 exp[ Sb λ )ln(w2 + Sσ2)]. Further simplification yields P(w) (w2 + Sσ2) 1 Sa λ w2]. (33) The function P(w) is finite everywhere and decays exponentially to 0 when w , regardless of the values of the parameters a,b,σ,S. This indicates that dw(w2 + Sσ2) 1 Sa λ w2] has a well-defined value. As a consequence, there is no concentration of measure. Thus, w(t) can be fall into any interval with finite probability after big enough t, indicating that w(t) can be arbitrarily far away from 0. In practice, this means that w(t) escapes from the saddle point at w = 0 regardless of the value of the parameters. 1.0 0.5 0.0 0.5 1.0 w Loss landscape Initial distribution Step 10 Final distribution Figure 8: Stationary distribution of w with additive noise of σ = 0.1 with quadratic loss landscape. All the other parameters are identical to those in Fig.1. This distribution has finite width. The effect of additive noise is better understood when b = 0. In this case, which corresponds to the case shown in Fig.1, w neither concentrates at 0, nor escapes to . Instead, it stays near 0 without Published as a conference paper at ICLR 2022 converging to it. The stationary distribution of w when b = 0, obtained by numerical simulation, is shown in Fig.8. The other settings are identical to the ones in Fig.1. The stationary distribution exists, but w does not concentrate at w = 0 in the stationary distribution. First of all, this figure shows that the additive noise helps SGD to escape saddle point. If the designed landscape is a part of a realistic landscape, having a broader distribution means having more chance of being attracted by another minimum. However, this figure also shows that w would stay in the neighborhood of w = 0 until the noise is big enough. Thus, having additive noise and long enough training time does not guarantee the efficiency of escape. In the language of Bayesian inference, ln[P(w)] is the likelihood of w. ˆw, the most probable value of w, is defined by the relation dln[P(w)] w= ˆ w = 0. (34) which reads λ )ln[(w2 + Sσ2)] Sb λ w2 + Sσ2 2Sb RRRRRRRRRRRRw= ˆ w = {w [ 1 Sa λ (w2 + Sσ2)]} The likelihood maximizer is When a > 2λ S , the maximum likelihood parameter is w = 0. When a < 2λ 2b equally likely. This resembles the phase transition in statistical physics and we call 2λ S the critical value of a, or, the critical a". For the energy landscape, when a > 0 there is only one global minimum at w = 0, and when a < 0 there are two global minima at w = 2b. Comparing with the maximum likelihood solutions, we see that the likelihood maximizer given by SGD is in fact a biased estimator of the underlying minima. There is a bias term introduced by the SGD noise in both the critical a and the value of the most probable w. This term vanishes when S , i.e. when the SGD noise vanishes. This indicates that, when the noise is state-dependent, there is no reason to expect SGD to be an unbiased or consistent estimator of the minimizer, as some works assume. B.3 4-TH ORDER POTENTIAL WITH MULTIPLICATIVE NOISE In this subsection, we study a 4-th order loss landscape in SGD dynamics without additive noise. The average loss function with 4th-order term is the SGD update rule (i.e., the equation of motion) in this case is dw(t) = λ[(a/2)w(t) + bw3(t)]dt + λ 1 S η(t)w(t), (37) where the definition of η is identical to the previous example. The solution of corresponding stationary Fokker-Planck equation is P(w) w 2 Sa λ w2]. (38) Published as a conference paper at ICLR 2022 2λ < 1, (39) P(w) diverges at w = 0. The normalization factor, i.e. integral + dww 2 Sa λ w2], also diverges due to the divergence at w = 0. This solution could be seen as a limit of (33) when the additive noise vanishes. As the strength of the additive noise σ 0, the normalization factor keep growing while the distribution remains normalized. For w 0, the function w 2 Sa λ w2] is always finite. Thus, in the limit σ 0, P(w) approaches 0 everywhere except for at w = 0. It is straight forward to check that the normalization factor diverges slower that P(0) as σ 0. As a consequence, when σ = 0, P(w) diverges at w = 0 and P(w) becomes a Dirac delta function. This corroborates with the result in Sec. 5.3 that a small a leads to a concentration of measure towards the local maximum. Compared with (33), one finds that the additive noise prevents the concentration of measure. Thus, in practice, the additive noise helps SGD escape the local minimum or saddle point. However, the existence of additive noise does not change the critical a and the position of the peaks. B.4 QUADRATIC POTENTIAL At last, we consider the continuous approximation of the model treated in the Sec. 5.1. Let the loss function be ˆL = x and we have {Ex[ˆgt] = a 2wt C(wt) = 1 S E[ ℓ ℓT] 1 S ; L(wt) L(wt)T = 1 S E[x2]w2 t . (41) The SGD update rule (i.e., the equation of motion) in 1d is = λxw(t). (43) With the continuous approximation, the equation of motion becomes dw(t) = λ(a/2)w(t)dt + λ 1 S w(t)η(t), (44) where η(t) N(0, dt). By comparing with the 1d Fokker-Planck equation, 4w2; B(w) = w. (45) The stationary distribution of w is The only difference between the solution in this case and the one in the 4th-order-loss case is the exponential factor, showing that the 4th-order potential does nothing but keeping w(t) bounded. The function w 2 Sa λ diverges at w = 0 or w = depending on the value of a. However, the stationary distribution, being a limit of (33), is well defined. In the case that it diverges at w = 0, P(w) becomes a delta function. On the contrary, there are two peaks infinitely far away from 0 when w 2 Sa λ diverges at w = . Being able to escape requires that the probability measure of w does not concentrate at 0, corresponding to 2λ > 1. (47) This straight line agrees approximately with the boundary of the lower branch of the phase diagram in Fig. 9. This condition of escaping is worth interpretation. Note that a is the local curvature and reflects the strength of the gradient signal. In contrast, the strength of the SGD noise is proportional Published as a conference paper at ICLR 2022 to λ/S.2 The term Sa/2λ is thus the signal to noise ratio of this learning problem, and the escaping condition is exactly when the signal becomes larger than the noise. This analysis, therefore, pinpoints the cause of the convergence to saddle points to be the dominance of the SGD noise over the gradient. Also, this example raises an interesting question regarding the mechanism of SGD that causes a convergence to the local maximum. From the perspective of the types of convergence, the SGD mechanism behind proposition 1 may be different from that of proposition 2. However, from the perspective of continuous-time analysis, there is really the same mechanism that is governing the SGD dynamics behind proposition 1 and 2 (namely, the fact that the noise has dominated the gradient). 1.0 0.5 0.0 0.5 1.0 a Figure 9: Escape probability as a function of a and λ. The parameters space is divided into an absorbing phase where w is attracted to the local maximum (in dark blue) and an active phase where w successfully escapes the two central bins (in white). The orange line denotes the analytical convergence bound on λ as a function of a obtained in the discrete-time calculation, while the red dashed line denotes the same bound obtained by the continuous-time calculation. Here the batch size S is set to 1. The result given by the continuous-time process agrees well with the discrete-time process in the small λ small a limit. 2See Ziyin et al. (2022b), for example. Published as a conference paper at ICLR 2022 C DELAYED PROOFS C.1 TWO FREQUENTLY USED LEMMAS We first prove the following lemma regarding the limiting distribution of ln wt . Lemma 1. Let the loss function be ˆL(w) = 1 2xw2, x p(x) such that Var[x] = σ2 and Ex[x] = a < 0 and that p(x) is continuous in a δ-neighborhood of x = 1, and w0 0. Then, for wt generated by SGD after t time steps, 1 t (ln wt/w0 µ) d N(0,s2) (48) where µ = Ex[ln 1 λx ] and s2 = Var[ln 1 λx ], where λ > 0 is the learning rate. Proof. After t steps of SGD, we have wt = t i=1 [1 λxi]w0. (49) This leads to w0 = t i=1 ln 1 λxi . (50) Now, since p(1 x) is continuous in the neighborhood of 1 by assumption, we can apply Proposition 6 (appendix) to show that the first and second moment of ln 1 λx exists, whereby we can apply the central limit theorem to obtain 1 w0 t E[ln 1 λx ]) d N(0,s2), (51) where s2 = Var[ln 1 λx ]. This proves the lemma. Proposition 6. Ex p(x)[ln2 x ] is finite if the second moment of p(x) exists and if p(x) is continuous in the neighborhood of x = 0. Proof. Since p(x) is continuous in the neighborhood of x = 0, then there exists δ > 0 such that for all ϵ < δ, p(δ) p(x) c for some c > 0. This means that we can divide the integral over p(x) into three regions: Ex p(x)[ln x 2] = dxp(x)ln2 x (52) dxp(x)ln2 x + δ dxp(x)ln2 x + δ dxp(x)ln2 x . (53) The second term can be bounded as δ dxp(x)ln2 x [p(0) + c] δ dxln2 x (54) = 2[p(0) + c] 0 dxln2 x (55) = 2[p(0) + c]x(ln2 x 2lnx + 2) = 2[p(0) + c]δ(ln2 δ 2lnδ + 2) < . (57) The first and the third term can also be bounded. Since ln x is a convex function in the regions x δ and x δ respectively. We can find linear functions ax + b of x such that ax + b lnx for x δ. This leads to δ dx p(x)ln2 x δ dx p(x)(ax + b)2 (58) dx p(x)(ax + b)2 (59) dxp(x)(a2x2 + 2abx + b2) (60) = a2µ2 + 2abµ1 + b2, (61) Published as a conference paper at ICLR 2022 where we have used the notation E[x2] = µ2 and E[x1] = µ, which by assumption is finite. The case for x δ is completely symmetric and can also be bounded by a2µ2 + 2abµ1 + b2. Therefore, we have shown that Ex p(x)[ln x 2] 2(a2µ2 + 2abµ1 + b2) + 2(p(0) + c)δ(ln2 δ 2lnδ + 2) . (62) This proves the proposition. Remark. Note that, the continuity in the neighborhood of 0 is not a necessary condition. For example, one can also prove that Ex[ln2 x ] is finite if p(x) is bounded and its second moment exists. C.2 PROOF OF PROPOSITION 2 Proof. The case when w0 = 0 is trivially true, we therefore consider the case w0 0. By Lemma 1, we have 1 t zt d N(0,s2) (63) where we have defined zt = 1 w0 tµ), µ = E[ln 1 λx ] and s2 = Var[ln 1 λx ]. This means that w0 ) = lim t 1 2πs2 exp[ 1 2ts2 (ln wt w0 tµ)]. (64) By definition, wt = w0 e tz+tµ. Therefore, we have lim t P( wt > ϵ) = lim t P ( w0 e tz+tµ > ϵ) = {0 if µ < 0; 1 if µ > 0, (65) for all ϵ > 0. In other words, since µ and σ are t-independent, in the infinite t limit, the sign of µ becomes crucial. When µ > 0, the limiting distribution diverges to infinity; when µ < 0, wt converges to 0 in probability. This finishes the proof. C.3 PROOF OF COROLLARY 1 Proof. By the definition of p(x), µ = E[ln 1 λx ] (66) 2 ln 1 λ + 1 2 ln 1 λ(a 1) (67) 2 ln (1 λ)(1 λ(a 1)) , (68) while s2 is indeed constant in time. Therefore, the asymptotic distribution of wt is solely dependent on the sign of µ. The above equation implies that µ 0 when (1 λ)[1 λ(a 1)] > 1. (69) When λ 1, the above equation is solved by λ < a a 1. (70) When λ > 1, the above equation solves to a2 8a + 8 2(a 1) , (71) only when the learning rate satisfies the above two conditions can it escape from the local maximum. Conversely, SGD cannot escape the local minimum when a2 8a + 8 2(a 1) . (72) We are done. Published as a conference paper at ICLR 2022 C.4 PROOF OF PROPOSITION 3 Proof. By Lemma 1, we have that 1 w0 ] = µ = 1 2 ln{(1 λ)[1 λ(a 1)]}, (73) where the second equality follows from the assumption that λ < 1. We differentiate with respect to λ to find the critical escape rate: λ = a 2(a 1). (74) Since µ is convex in λ, it follows that this critical escape rate is the maximum escape rate. Plugging this into µ, we obtain that γ = µ(λ ) = 1 2 ln (2 a)2 4(1 a) = ln 2 a 2 1 a ϵ, (75) i.e., the optimal escape rate can be made smaller than any ϵ if we set e2ϵ 1 e2ϵ + 1 . (76) This finishes the proof. C.5 PROOF OF PROPOSITION 4 Proof. It suffices to show that w1 converges to 0 with probability 1 because, if this is the case, the only possible local minimum to converge to is (w1,w2) = (0, The SGD dynamics is { w1,t = λ( 2w1,t + 4w3 1,t + 12w2 2,tw1,t + xtbw1,t); w2,t = λ( 2w2,t + 4w3 2,t + 12w2 1,tw2,t 2aw2,t). (77) We focus on the dynamics of w1. By the definition of the noise x, we have that w1,t+1 = {w1,t[1 + λ(2 b) 4λw2 1,t 12λw2 2,t] with probability 1/2; w1,t[1 + λ(2 + b) 4λw2 1,t 12λw2 2,t] with probability 1/2. (78) Equivalently, w1,t = { 1 + λ(2 b) 4λw2 1,t 12λw2 2,t with probability 1/2; 1 + λ(2 + b) 4λw2 1,t 12λw2 2,t with probability 1/2. (79) Since 0 4λw2 1,t + 12λw2 2,t 16λ, we can define a new random variable rt: rt = r(xt) = {max( 1 + λ(2 b) , 1 + λ(2 b) 16λ ) if xt 0; max( 1 + λ(2 + b) , 1 + λ(2 + b) 16λ ) if xt 0. (80) By construction, rt w1,t+1/w1,t for all values of xt. This implies that w1,t/w1,0 t i=1 rt, and so, P( w1,t/w1,0 > ϵ) P ( t i=1 rt > ϵ), (81) i.e., if t i=1 rt converges to 0 with probability 1, w1,t must also converge to zero with probability 1. We let b = 1 λ 6 (note that this is the value of b such that rt is minimized for both cases), and we obtain that rt = {8λ if xt 0; 2max( 1 2λ , 1 10λ ) if xt 0. (82) The rest of the proof follows from applying the central limit theorem to 1 t t i=1 lnrt as in Lemma 1. The result is that t i=1 ri converges to 0 with probabiliy 1 if µ = E[lnrt] = 1 2 ln[8λmax( 1 2λ , 1 10λ )] < 0. (83) The above inequality solves to which is of order O(1) as stated in the theorem statement. This proves the proposition. Published as a conference paper at ICLR 2022 C.6 PROOF OF PROPOSITION 5 Proof. First we note that, since wt 1 ˆgt = xtwt 1 + a. (85) By the definition of the AMSGrad algorithm, we have that vt = β2vt 1 + (1 β2)ˆg2 t ; (86) ˆvt = max(ˆvt 1,vt). (87) Since v0 = 0 and ˆgt = xtwt 1 + a, we have that vt 1 + a (88) for all t, and so ˆvt max t ˆvt = max t vt 1 + a. (89) Meanwhile, since ˆvt is a monotonically increasing series and is upper bounded by 1 + a, it must converge to a constant 0 < c 1 + a. Now, as before, we want to upper bound the random variable t i=1 (ln 1 λ ˆvt xt µ), (90) where µ = E[ln wt+1 wt ]. Since ˆvt converges to c, there must exists a positive integer N(ϵ) such that for any ϵ > 0, ˆv N c ϵ. This allows us to divide the sum to two terms: t i=1 (ln 1 λ ˆvt xt µ) (91) N i=1 (ln 1 λ ˆvt xt µ) + 1 t i=N+1 (ln 1 λ c kt xt µ), (92) where we introduced 0 kt ϵ. But the first term is of order O(1/ t) and converges to 0, and so for sufficiently large t, the first term is also smaller than arbitrary ϵ. This means that t i=1 (ln 1 λ c kt xt µ) + ϵ, (93) We can now consider the random variable 1 λ c kt xt = 1 λ ˆvt with probability 1 2; 1 + λ ˆvt (1 a) with probability 1 We now define a new random variable rt(xt) = {m if xt = 1; 1 + λ c ϵ(1 a) if xt = 1 + a. (95) where we defined the constant m = max( 1 λ/ c , 1 λ/ c ϵ ). One can easily check that rt 1 λ c kt xt . This means that w0 > α) P ( t i=0 ri > α + O(1/ i.e., if rt converges to 0 in probability, wt+1 w0 must also converge to 0 in probability. Proceeding as in the proof of Proposition 2. One finds that the condition for rt to converge in probability to 0 is ln m[1 + λ c ϵ(1 a)] < 0, (97) Published as a conference paper at ICLR 2022 which is equivalent to m[1 + λ c(1 a)] < 1. (98) Since ϵ is arbitrary, we let ϵ 0 and obtain (1 λ c)[1 + λ c(1 a)] 1. (99) Denote λ/ c as λ , this condition solves to a2 8a + 8 2(a 1) . (100) Setting a such that the above condition is met, AMSGrad will converge to 0 in probability. For λ < 1, This completes the proof.