# powerlaw_escape_rate_of_sgd__d04447f3.pdf

Power-Law Escape Rate of SGD

Takashi Mori 1 Liu Ziyin 2 Kangqiao Liu 2 Masahito Ueda 1 2 3

Stochastic gradient descent (SGD) undergoes complicated multiplicative noise for the meansquare loss. We use this property of SGD noise to derive a stochastic differential equation (SDE) with simpler additive noise by performing a random time change. Using this formalism, we show that the log loss barrier log L = log[L(θs)/L(θ )] between a local minimum θ

and a saddle θs determines the escape rate of SGD from the local minimum, contrary to the previous results borrowing from physics that the linear loss barrier L = L(θs) L(θ ) decides the escape rate. Our escape-rate formula strongly depends on the typical magnitude h and the number n of the outlier eigenvalues of the Hessian. This result explains an empirical fact that SGD prefers ﬂat minima with low effective dimensions, giving an insight into implicit biases of SGD.

1. Introduction

Deep learning has achieved breakthroughs in various applications in artiﬁcial intelligence such as image classiﬁcation (Krizhevsky et al., 2012; Le Cun et al., 2015), speech recognition (Hinton et al., 2012), natural language processing (Collobert & Weston, 2008), and natural sciences (Iten et al., 2020; Bapst et al., 2020; Seif et al., 2021). Such unparalleled success of deep learning hinges crucially on stochastic gradient descent (SGD) and its variants as an efﬁcient training algorithm.

Although the loss landscape is highly nonconvex, the SGD often succeeds in ﬁnding a global minimum. It has been argued that the SGD noise plays a key role in escaping from local minima (Jastrze bski et al., 2017; Wu et al., 2018;

1Center for Emergent Matter Science, Riken, Saitama, Japan 2Department of Physics, The University of Tokyo, Tokyo, Japan 3Institute for Physics of Intelligence, The University of Tokyo, Tokyo, Japan. Correspondence to: Takashi Mori <takashi.mori.fh@riken.jp>.

Proceedings of the 39 th International Conference on Machine Learning, Baltimore, Maryland, USA, PMLR 162, 2022. Copyright 2022 by the author(s).

2020; Zhu et al., 2019; Meng et al., 2020; Xie et al., 2021; Liu et al., 2021). It has also been suggested that SGD has an implicit bias that is beneﬁcial for generalization. For example, SGD may help the network to ﬁnd ﬂat minima, which are considered to imply good generalization (Keskar et al., 2017; Hoffer et al., 2017; Wu et al., 2018; Pittorino et al., 2022). How and why does SGD help the network escape from bad local minima and ﬁnd ﬂat minima? These questions have been addressed in several works, and it is now recognized that the SGD noise strength and structure signiﬁcantly affect the efﬁciency of escape from local minima. Our work follows this line of research and adds new theoretical perspectives.

In physics and chemistry, escape from a local minimum of the energy landscape due to thermal noise at temperature T is a thoroughly studied fundamental problem (Kramers, 1940; Eyring, 1935). When the energy barrier is given by E, the escape rate is proportional to e E/T , which is known as the Arrhenius law. By analogy, in machine learning, escape from a local minimum of the loss function is considered to be determined by the loss barrier height L = L(θs) L(θ ), where L(θ) denotes the loss function at the network parameters θ, θ stands for a local minimum of L(θ), and θs denotes a saddle point that separates θ

from other minima. Indeed, if we assume that the SGD noise is uniform and isotropic, which is often assumed in machine-learning literature (Jastrze bski et al., 2017), the escape rate is proportional to e L/D, where D denotes the SGD noise strength.

In this paper, we show that inhomogeneity of the SGD noise strength in the parameter space brings about drastic modiﬁcation for the mean-square loss. We show that the escape rate is determined by the logarithmic loss barrier height log L = log L(θs) log L(θ ) = log[L(θs)/L(θ )]. In other words, the escape rate is determined not by the difference but by the ratio of L(θs) and L(θ ). This result means that even if the loss barrier height L is the same, minima with smaller values of L(θ ) are more stable.

Moreover, given the fact that the eigenvalue spectrum of the Hessian at a minimum consists of a bulk of almost zero eigenvalues and outliers (Sagun et al., 2017; Papyan, 2019), our escape-rate formula implies that SGD prefers ﬂat minima with a low effective dimension, where the effective

Power-Law Escape Rate of SGD

dimension is deﬁned as the number of outliers (Mac Kay, 1992) and ﬂatness is measured by a typical magnitude of outlier eigenvalues (Keskar et al., 2017). The previous theories (Jastrze bski et al., 2017; Wu et al., 2018; Zhu et al., 2019; Meng et al., 2020; Xie et al., 2021; Liu et al., 2021) have also successfully explained the fact that SGD prefers ﬂat minima, but not the preference of small effective dimensions. The logarithmic loss naturally explains the latter, and sheds light on implicit biases of SGD.

Main contributions: We obtain the following main results:

We derive an equation for approximating the SGD noise in Eq. (6). Remarkably, the SGD noise strength in the mean-square loss is shown to be proportional to the loss function, which is experimentally veriﬁed in Section 5.2. A key ingredient in deriving Eq. (6) is the decoupling approximation given in Eq. (7). This is a novel approximate method introduced in our analysis, and hence we experimentally verify it in Section 5.1.

We derive a novel stochastic differential equation (SDE) in Eq. (14) via a random time change introduced in Eq. (13). Although the original SDE (4) has a multiplicative noise, the transformed SDE (14) has a simple additive noise with the gradient of the logarithmic loss. This shows the relevance of the logarithmic loss landscape for understanding SGD.

We derive a novel form of SGD escape rate from a local minimum in Eq. (15). Remarkably, the escape rate takes the power-law form with respect to the ratio between L(θ ) and L(θs). In Section 5.3, we experimentally test the validity of this result for a linear regression and a neural network.

We show that the escape rate of SGD crucially depends on the ﬂatness and the effective dimension, which implies that SGD has implicit biases towards ﬂat minima with low effective dimension. We also show in Eq. (16) that a local minimum with an effective dimension n greater than a certain critical value nc becomes unstable.

Related works: The role of the SGD noise structure has been discussed in some previous works (Zhu et al., 2019; Xie et al., 2021; Liu et al., 2021; Meng et al., 2020; Wojtowytsch, 2021). It was pointed out that the anisotropic nature of the SGD noise is important: the SGD noise covariance matrix is aligned with the Hessian of the loss function, which is beneﬁcial for escape from sharp minima (Zhu et al., 2019; Xie et al., 2021; Liu et al., 2021). These previous works, however, do not take the parameter dependence of the SGD noise strength into account, and consequently, escape rates derived there depend exponentially on the loss barrier height, which differs from our formula.

Compared with the anisotropy of the SGD noise, the inhomogeneity of the SGD noise strength has been less explored. In (Meng et al., 2020; Wojtowytsch, 2021), the SGD dynamics under a state-dependent noise is discussed. However, in these previous works, the connection between the noise strength and the loss function was not theoretically established, and the logarithmic loss landscape was not discussed. The instability due to large effective dimensions was also not shown. Another recent work (Pesme et al., 2021) observed that the noise is proportional to the loss for speciﬁc simple models. In our paper, such a result is derived for more generic models. G urb uzbalaban et al. (2021) showed that SGD will converge to a heavy-tailed stationary distribution due to a multiplicative nature of the SGD noise in a simple linear regression problem. Our paper strengthens this result: we argue that such a heavy-tailed distribution generically appears for the mean-square loss.

2. Background

We consider supervised learning. Let D = {(x(µ), y(µ)) : µ = 1, 2, . . . , N} be the training dataset, where x(µ) Rd

denotes a data vector and y(µ) R be its label. The network output for a given input x is denoted by f(θ, x) R, where θ RP stands for a set of trainable parameters with P being the number of trainable parameters. In this work, we focus on the mean-square loss

L(θ) = 1 2N

h f(θ, x(µ)) y(µ)i2 =: 1

(1) The training proceeds through optimization of L(θ). In most machine-learning applications, the optimization is done via SGD or its variants. In SGD, the parameter θk+1 at the time step k + 1 is determined by

θk+1 = θk η LBk(θk), LBk(θ) = 1

µ Bk ℓµ(θ),

(2) where η > 0 is the learning rate, Bk {1, 2, . . . , N} with |Bk| = B is a mini-batch used at the kth time step, and LBk denotes the mini-batch loss. Since the training dataset D is randomly divided into mini-batches, the dynamics deﬁned by Eq. (2) is stochastic. When B = N, the full training data samples are used for every iteration. In this case, the dynamics is deterministic and called gradient descent (GD). SGD is interpreted as GD with stochastic noise. By introducing the SGD noise ξk = [ LBk(θk) L(θk)], Eq. (2) is rewritten as

θk+1 = θk η L(θk) + ηξk. (3)

Obviously, ξk = 0, where the brackets denote the average over possible choices of mini-batches. The noise covariance

Power-Law Escape Rate of SGD

matrix is deﬁned as Σ(θk) := ξkξT k . The covariance structure of the SGD noise is important in analyzing the SGD dynamics, which will be discussed in Section 3.1.

2.2. Stochastic differential equation for SGD

When the parameter update for each iteration is small, which is typically the case when the learning rate η is sufﬁciently small, the continuous-time approximation can be used (Li et al., 2017; Smith & Le, 2018). By introducing a continuous time variable t R and regarding η as an inﬁnitesimal time step dt, we have a SDE

dθt = L(θt)dt + p

ηΣ(θt) d Wt, (4)

where d Wt N(0, IP dt) with In being the n-by-n identity matrix, and the multiplicative noise p

ηΣ(θt) d Wt is interpreted as Itˆo since the noise ξk in Eq. (3) should not depend on θk+1. Throughout this work, we consider the continuous-time approximation (4) with Gaussian noise.

In machine learning, the gradient Langevin dynamics (GLD) is also considered, in which the isotropic and uniform Gaussian noise is injected into the GD as

dθt = L(θt)dt +

2Dd Wt, (5)

where D > 0 corresponds to the noise strength (it is also called the diffusion coefﬁcient) (Sato & Nakagawa, 2014; Zhang et al., 2017b; Zhu et al., 2019). The GLD yields an escape rate proportional to e L/D (Eyring, 1935; Kramers, 1940). We will see in Section 4 that the SGD noise structure, which is characterized by Σ(θ), drastically alters the escape rate from a local minimum.

3. Dynamics of SGD

The ﬂuctuation of SGD and the local landscape are the two most important factors that inﬂuence the escaping behavior of learning. In Sections 3.1 and 3.2, we establish the form of the noise of SGD and show that under this type of noise, the relevant landscape for consideration should be log L instead of L. We then use these two results to derive the escape rate of SGD in Section 4.

3.1. Structure of the SGD noise covariance

The SGD noise covariance matrix Σ(θ) signiﬁcantly affects the dynamics (Jastrze bski et al., 2017; Smith & Le, 2018; Zhu et al., 2019; Ziyin et al., 2021). In this section, under some approximations, we derive the following expression of Σ(θ) for the mean-square loss near a local minimum θ :

B H(θ ), (6)

where H(θ) = 2L(θ) is the Hessian. It should be noted that L(θ ) = 0 and H(θ ) is positive semideﬁnite at any

local minima θ . We give a derivation below and the list of the approximations and their justiﬁcations in Appendix A. In particular, the following decoupling approximation is a key assumption in the derivation.

Assumption 3.1 (decoupling approximation). The quantities ℓµ and C(µ) f (θ) := f(θ, x(µ)) f(θ, x(µ))T are uncorrelated, which implies

µ=1 ℓµC(µ) f (θ) =

µ=1 C(µ) f (θ). (7)

This approximation seems justiﬁed for large networks in which f behaves as a random vector. In Section 5.1, we experimentally verify the decoupling approximation for the entire training dynamics.

Our formula (6) possesses two important properties. First, the noise is aligned with the Hessian, which is well known and has been pointed out in the literature (Jastrze bski et al., 2017; Zhu et al., 2019; Xie et al., 2021; Liu et al., 2021). If the loss landscape has ﬂat directions, which correspond to the directions of the Hessian eigenvectors belonging to vanishingly small eigenvalues, the SGD noise does not arise along these directions. Consequently, the SGD dynamics is frozen along the ﬂat directions, which effectively reduces the dimension of the parameter space explored by the SGD dynamics. This reduction plays an important role in the escape efﬁciency. Indeed, we will see that the escape rate crucially depends on the effective dimension of a given local minimum.

Second, the noise is proportional to the loss function, which is indeed experimentally conﬁrmed in Section 5.2. This property has not been pointed out and not been taken into account in previous studies (Jastrze bski et al., 2017; Zhu et al., 2019; Xie et al., 2021; Liu et al., 2021) and therefore gives new insights into the SGD dynamics. Indeed, this property allows us to formulate the Langevin equation on the logarithmic loss landscape with simple additive noise as discussed in Section 3.2. This new formalism yields the power-law escape rate summarized as Theorem 4.7, and the importance of the effective dimension of local minima for their stability.

It should be noted that although Eq. (6) is derived for the mean-square loss, the proportionality between the noise strength and the loss function seems to hold for more general loss functions such as the cross-entropy loss (see Appendix D for the detail).

Derivation of Eq. (6): We start from an analytic expression

Power-Law Escape Rate of SGD

of Σ(θ), which reads

µ=1 ℓµ ℓT µ L LT !

µ=1 ℓµ ℓT µ L LT !

where B N is assumed in the second equality. The derivation of Eq. (8) is found in Jastrze bski et al. (2017); Smith & Le (2018). Usually, the gradient noise variance dominates the square of the gradient noise mean, and hence the term L LT in Eq. (8) is negligible (Shwartz-Ziv & Tishby, 2017; Zhu et al., 2019).

For the mean-square loss, we have ℓµ = [f(θ, x(µ)) y(µ)] f(θ, x(µ)), and hence

µ=1 ℓµ f(θ, x(µ)) f(θ, x(µ))T

µ=1 ℓµC(µ) f (θ) 2L(θ)

µ=1 C(µ) f (θ), (9)

where the decoupling approximation (7) is used in the last equality. Equation (9) is directly related to the Hessian of the loss function near a (local or global) minimum. The Hessian H(θ) = 2L(θ) is written as

n C(µ) f (θ)

+ h f(θ, x(µ)) y(µ)i 2f(θ, x(µ)) o . (10)

It is shown by Papyan (2018) that the last term of Eq. (10) does not contribute to outliers (i.e. large eigenvalues) of the Hessian. The dynamics near a local minimum is governed by outliers, and hence we can ignore this term. At θ = θ , we therefore obtain

µ=1 C(µ) f (θ ). (11)

Let us assume C(µ) f (θ) C(µ) f (θ ) for θ within the valley of a local minimum θ . We then obtain the desired expression (6) by substituting it into Eq. (9).

3.2. Logarithmized loss landscape

Let us consider the Itˆo SDE (4) with the SGD noise covariance (6) near a local minimum θ , which is written as

dθt = L(θt)dt +

B H(θ ) d Wt. (12)

Let us consider a stochastic time t(τ) for τ 0 as

0 dt L(θt ), (13)

and perform a random time change from t to τ (Øksendal, 1998). Correspondingly, we introduce the Wiener process d Wτ N(0, IP dτ). Since dτ = L(θt)dt, we have d Wτ = p

L(θt) d Wt. In terms of the notation θτ = θt, Eq. (12) is expressed as

d θτ = 1 L( θτ) L( θτ)dτ +

= h log L( θτ) i dτ +

B d Wτ. (14)

We should note that at a global minimum with L(θ) = 0, which is realized in an overparameterized regime (Zhang et al., 2017a), the random time change through Eq. (13) is ill-deﬁned since τ is frozen at a ﬁnite value once the model reaches a global minimum. We can overcome this difﬁculty by adding an inﬁnitesimal constant ϵ > 0 to the loss, which makes the loss function positive without changing the ﬁnitetime dynamics like the escape from a local minimum θ

with L(θ ) > 0.

In this way, the Langevin equation on the loss landscape L(θ) with multiplicative noise is transformed to that on the logarithmic loss landscape U(θ) = log L(θ) with simpler additive noise. This formulation indicates the importance of considering the logarithmic loss landscape U(θ) = log L(θ). In the following, we use Eq. (14) to discuss the escape efﬁciency from local minima.

4. Escape rate from local minima

Now we present our main result on the escape from a local minimum θ . First, we formally deﬁne some key concepts regarding the escape problem.

Deﬁnition 4.1 (basin of attraction). For a given local minimum θ , its basin of attraction Aθ (or a valley of θ ) is deﬁned as the set of all the starting points θ0 such that θt θ as t when there is no noise.

Deﬁnition 4.2 (escape rate). We denote by P the total probability within Aθ and by J the total ﬂux of the probability current across the boundary of Aθ .1 The escape rate κ is then deﬁned as J /P (Kramers, 1940).

1By deﬁning the probability current density J(θ, t) via the continuity equation P(θ, t)/ t = J(θ, t) for the probability distribution P(θ, t) of θ at time t, J is explicitly given by J = R

Aθ dθ J(θ, t) = R

Aθ da J(θ, t), where the second equality follows from Gauss law. Here, Aθ denotes the boundary of Aθ and da is a vector representing an inﬁnitesimal element of area of the surface.

Power-Law Escape Rate of SGD

We now calculate the asymptotic behavior of the escape rate in the weak-noise limit η/B +0, which has been investigated in many ﬁelds, including physics (Kramers, 1940; Langer, 1969), chemistry (Eyring, 1935), stochastic processes (Bovier et al., 2004; Berglund, 2013), and machine learning (Jastrze bski et al., 2017; Xie et al., 2021). In these previous studies, the following two basic assumptions are commonly used, which are expected to hold in the weaknoise limit. Indeed, the Kramers formula (Kramers, 1940), which is obtained by using Assumptions 4.3 and 4.4 below, is rigorously justiﬁed (Bovier et al., 2004).

Assumption 4.3 (quasi-stationary approximation). The parameters θ obey the quasi-stationary distribution near θ , which is identiﬁed as the stationary distribution restricted to Aθ (Bianchi & Gaudilli ere, 2016).

Assumption 4.4 (most probable escape path). The escape from θ is dominated by the most probable escape path (MPEP) (Freidlin & Wentzell, 1998; Maier & Stein, 1993). It is known that the direction of the MPEP at one point must be the direction of one eigenvector of the Hessian at that point, and the MPEP passes a saddle point θs at the boundary of Aθ .

Now we introduce further key assumptions in our analysis for P > 1.

Assumption 4.5 (Hessian outliers). The eigenvalue spectrum of the Hessian H(θ ) at θ has n nonzero eigenvalues, which are called outliers . The remaining P n eigenvalues are vanishingly small.

Assumption 4.6 (low-dimensional subspace of SGD). Outlier eigenvectors of the Hessian v1, v2, . . . , vn RP do not change within Aθ , and the SGD dynamics is restricted to the n-dimensional subspace spanned by those outlier eigenvectors (n is called the effective dimension of the local minimum).

Assumption 4.5 is empirically conﬁrmed (Sagun et al., 2017; Papyan, 2019). Since the SGD dynamics is frozen along ﬂat directions as we pointed out in Section 3.1, the restriction to the n-dimensional outlier subspace in Assumption 4.6 is justiﬁed. Indeed, it is empirically observed that the SGD dynamics is actually restricted to a low-dimensional subspace spanned by top eigenvectors of the Hessian (Gur-Ari et al., 2018).

Now we are ready to state our main result in the form of the following theorem.

Theorem 4.7. Let the model parameter θt evolve according to the SDE in Eq. (12). Under Assumptions 4.3 to 4.6, the escape rate κ asymptotically behaves as

B ηh e +1 n

as η/B +02, where θs is the saddle on the MPEP, h e is the eigenvalue of the Hessian at θ along the MPEP, and hs e is the negative eigenvalue of the Hessian at θs along the MPEP.

Proof. The proof is given in Appendix B.

From Eq. (15) we can obtain some implications. The factor

[L(θs)/L(θ )] B ηh e +1 n

2 increases with h e and n, which indicates that sharp minima (i.e. minima with large h e) or minima with large n are unstable. This fact explains why SGD ﬁnds ﬂat minima with a low effective dimension n. Equation (15) also implies that the effective dimension of any stable minima must satisfy n < 2(B/(ηh e) + 1). This condition guarantees the stability along the MPEP, but any minimum must be stable along all the directions. The stability condition along the direction of the ith eigenvector of the Hessian is obtained by replacing h e by hi. Therefore, the maximum eigenvalue hmax gives the most stringent stability condition:

n < nc := 2 B ηhmax + 1 . (16)

The instability due to a large effective dimension is a new insight naturally explained by the picture of the logarithmic loss landscape.

Let us summarize novel aspects of the escape rate derived here, which are not found in previous studies (Jastrze bski et al., 2017; Wu et al., 2018; Zhu et al., 2019; Meng et al., 2020; Xie et al., 2021; Liu et al., 2021). First, the escape rate in Eq. (15) takes a power-law form with respect to the ratio between L(θ ) and L(θs), which differs from a conventional exponential form with respect to the difference L = L(θs) L(θ ). Intuitively, escaping behavior must be a result of noise in the gradient, and if there is no noise, gradient descent cannot escape any basin of attraction. Our result agrees with this intuition: at the global minimum where L(θ ) = 0, there is no noise, and the SGD must be stuck in where it is with no escaping behavior, which is directly reﬂected by our formula. However, the standard exponential escape rate formula predicts a non-zero escape rate even when L(θ ) = 0, which cannot be correct in principle. Second, Eqs. (15) and (16) imply that the SGD is biased to small effective dimensions n. As we mentioned, it is empirically conﬁrmed that the SGD dynamics is restricted to a low-dimensional subspace spanned by top eigenvectors of the Hessian (Gur-Ari et al., 2018). Since the restriction to a low-dimensional subspace greatly reduces the capacity of the network (Li et al., 2018), Eq. (16) supports the presence of an implicit regularization via the SGD.

2Here, the notation f(x) g(x) as x +0 means limx +0(f(x)/g(x)) = 1.

Power-Law Escape Rate of SGD

The formula in Eq. (15) is tested in Section 5.3. Since it is difﬁcult in practice to identify when the barrier crossing occurs, we instead consider the mean ﬁrst passage time, which imitates the escape time for a non-convex loss landscape. Let us ﬁx a threshold value of the loss function. The ﬁrst passage time tp is deﬁned as the shortest time at which the loss exceeds the threshold value. We identify the threshold value as L(θs), i.e., the loss at the saddle in the escape problem. It is expected that tp is similar to the escape time and proportional to κ 1. Indeed, at least when noise is isotropic and uniform as in the GLD (5), the inverse of the mean ﬁrst passage time 1/ tp is logarithmically equivalent to κ, i.e., log κ log tp , in the weak-noise asymptotic limit (Freidlin & Wentzell, 1998; Berglund, 2013).

5. Experiments

Our key theoretical observation is that the SGD noise strength is proportional to the loss function, which is obtained as a result of the decoupling approximation. This property leads us to the Langevin equation (14) with the logarithmic loss gradient and an additive noise through a random time change (13). Equation (14) implies the escape rate (15).

In Section 5.1, we show that the decoupling approximation is valid during the entire training dynamics. In Section 5.2, we measure the SGD noise strength and conﬁrm that it is indeed proportional to the loss function near a minimum. In Section 5.3, we experimentally test the validity of Eq. (15) for the escape rate. We will see that numerical results for a linear regression and for a non-linear neural network agree with our theoretical results.

5.1. Experimental veriﬁcation of the decoupling approximation

Let us compare the eigenvalue distribution of the exact matrix (1/N) PN µ=1 ℓ(µ)C(µ) f with that of the de-

coupled one L(θ) (1/N) PN µ=1 C(µ) f with C(µ) f = f(θ, x(µ)) f(θ, x(µ))T. We consider a binary classiﬁcation problem using the ﬁrst 104 samples of the MNIST dataset such that we classify each image into even (its label is y = +1) or odd number (its label is y = 1). The network has two hidden layers, each of which has 100 units and the Re LU activation, followed by the output layer of a single unit with no activation. Starting from the Glorot initialization, the training is performed via SGD with the mean-square loss, where we ﬁx η = 0.01 and B = 100.

Figure 1 shows histograms of their eigenvalues at different stages of the training: (a) at initialization, (b) after 50 epochs, and (c) after 500 epochs. We see that the exact matrix and the approximate one have statistically similar eigenvalue distributions except for exponentially small eigenvalues dur-

ing the training dynamics. This shows that the decoupling approximation holds during the entire training dynamics in this experiment.

5.2. Measurements of the SGD noise strength

Now we test whether the SGD noise strength is actually proportional to the loss function, which is predicted by Eq. (6), an essential theoretical result of ours. As a measure of the SGD noise strength, let us consider the norm of the noise vector ξ given by ξTξ = Tr Σ N/B, where N := (1/N) PN µ=1 ℓT µ ℓµ LT L. Here we present experimental results for two architectures and datasets. First, we consider training of the Fashion-MNIST dataset by using a fully connected network with three hidden layers, each of which has 2 103 units and the Re LU activation, followed by the output layer of 10 units with no activation (classiﬁcation labels are given in the one-hot representation). Second, we consider training of the CIFAR-10 dataset by using a convolutional neural network. Following Keskar et al. (2017), let us denote a stack of n convolutional layers of a ﬁlters and a kernel size of b c with the stride length of d by n [a, b, c, d]. We use the conﬁguration: 3 [64, 3, 3, 1], 3 [128, 3, 3, 1], 3 [256, 3, 3, 1], where a Max Pool(2) is applied after each stack. To all layers, the Re LU activation is applied. Finally, an output layer consists of 10 units with no activation.

Starting from the Glorot initialization, the network is trained by SGD of the mini-batch size B = 100 and η = 0.1 for the mean-square loss. During the training, we measure the training loss and the noise strength N for every epoch. Numerical results are given in Fig. 2. We see that roughly N L at a later stage of the training, which agrees with our theoretical prediction.

Although N is not proportional to L at an early stage of training, it does not mean that Eq. (9) is invalid there. Since the decoupling approximation is valid for the entire training dynamics, Eq. (9) always holds. The reason why the SGD noise strength does not decrease with the loss function in the early-stage dynamics is that N 2L(θ) (1/N) PN µ=1 f(θ, x(µ))T f(θ, x(µ)), but the quantity

(1/N) PN µ=1 f(θ, x(µ))T f(θ, x(µ)) also changes during training.

Although Eq. (9) is derived for the mean-square loss, the relation N L(θ) holds in more general loss functions; see Appendix D for general argument and experiments on the cross entropy loss.

5.3. Experimental test of the escape rate formula

We now experimentally verify our escape-rate formula (15). Below, we ﬁrst present numerical results for a simple linear regression problem, and then for a nonlinear model, i.e., a

Power-Law Escape Rate of SGD

(a) at initialization (b) at 50 epochs (c) at 500 epochs

20 10 0 10 log(eigenvalues)

103 exact decoupling

20 10 0 10 log(eigenvalues)

103 exact decoupling

20 10 0 10 log(eigenvalues)

103 exact decoupling

Figure 1. Comparison of the eigenvalue distributions of the left-hand side (exact expression) and the right-hand side (decoupled one) of Equation (7). They agree with each other except for exponentially small eigenvalues during the entire training dynamics. For the detail, see the description in Section 5.1.

0 20 40 60 80 epoch

(SGD noise strength)/(1.3 103)

0 20 40 60 80 100 epoch

(SGD noise strength)/(1.5 103)

0 20 40 60 80 100 epoch

(SGD noise strength)/(1.5 103)

10 3 10 2 10 1

SGD noise strength

Figure 2. Training dynamics of the loss and the SGD noise strength N. In the ﬁgure, we multiply N by a numerical factor to emphasize that N is proportional to the loss in a later stage of the training. (Top left) Fully connected network trained by the Fashion-MNIST dataset. (Top right) Convolutional network trained by the CIFAR-10 dataset. (Bottom left) Mean values with standard deviations (shaded regions) taken over 10 independent runs with different initializations in convolutional network trained by the CIFAR-10 dataset. (Bottom right) Loss vs N in the training of the convolutional network. The dashed line is a straight line of slope 1, which implies N L(θ).

neural network.

5.3.1. LINEAR REGRESSION

Let us start with the following linear regression problem: each entry of x(µ) Rd and its label y(µ) R are i.i.d. Gaussian random variables of zero mean and unit variance. We focus on the case of d N. The output for an input x is given by f(θ, x) = θTx, where θ Rd is the trainable network parameter. We optimize θ via SGD. The mean-square loss L(θ) = (1/2N) PN µ=1 θx(µ) y(µ) 2

is quadratic and has a unique minimum at θ 0. the Hessian H = (1/N) PN µ=1 x(µ)x(µ)T has d nonzero eigenvalues, all of which are close to unity. We can therefore identify h e = h = 1 and n = d.

Before investigating the escape rate, we remark that the stationary distribution of θ in this case is theoretically obtained because of the isotropy of SGD noise. Indeed, H(θ ) in Eq. (14) can be replaced by h Id, where Id is the d-dimensional identity matrix, and then the stationary distribution of θτ is simply given by the Gibbs distribution under the potential U(θ) = log L(θ): Ps(θ) e U(θ)/T = L(θ) 1/T , where the temperature T is given by T = ηh /B. In Appendix C, it is shown that the stationary distribution Ps(θ) of θt under Eq. (12) is related to Ps(θ) via Ps(θ) L(θ) 1 Ps(θ). We therefore have

Ps(θ) L(θ) φ, φ = 1 + B

Remarkably, it depends on L(θ) polynomially rather than

Power-Law Escape Rate of SGD

(a) Exponent φ in Ps(θ) (b) tp for the linear regression (c) tp for the neural network

1 10 100 1000

1 2 3 4 5 6 7

mean first passage time

d=1 d=5 d=10 0.1

mean first passage time

η=0.01 η=0.015

Figure 3. (a) Exponent φ for the stationary distribution Ps(θ) L(θ) φ for d = 1 in the linear regression. Dashed lines show the theoretical prediction φ = 1 + B/(ηh ). (b) Log-log plot of the mean ﬁrst-passage time tp vs c = L(θs)/L(θ ) for B = 1 and η = 0.1 in the linear regression. Dashed lines show the theoretical prediction, tp κ 1 cφ d/2. (c) Log-log plot of tp vs c for B = 10 and various η in the neural network. Dashed lines show the theoretical prediction, tp κ 1 c B/(ηh e)+1 n/2 with h e = 95.6 and n = 9.

exponentially as in the standard GLD (Jastrze bski et al., 2017; Sato & Nakagawa, 2014; Zhang et al., 2017b). This property is closely related to the power-law form of the escape rate.

First, we test Eq. (17), i.e. the stationary distribution, for d = 1 and N = 105. We sampled the value of θk at every 100 iterations (k = j 100, j = 1, 2, . . . , 104) and made a histogram. We then ﬁt the histogram to the form Ps(θ) L(θ) φ and determine the exponent φ. Our theory predicts φ = 1+B/(ηh ). Numerical results for the exponent φ are presented in Fig. 3 (a) against B for three ﬁxed learning rates η. In the same ﬁgure, theoretical values of φ are plotted in dashed lines. The agreement between theory and experiment is fairly well. For a large learning rate η = 1, the exponent slightly deviates from its theoretical value. This is due to the effect of a ﬁnite learning rate (recall that η is assumed to be small in deriving the continuous-time stochastic differential equation).

Next, we test our formula on the escape rate, Eq. (15). Although the mean-square loss is quadratic and no barrier crossing occurs, we can measure the ﬁrst passage time, which imitates the escape time for a non-convex loss landscape as is mentioned in Section 4.

The mean ﬁrst passage time over 100 independent runs is measured for varying threshold values which are speciﬁed by c = L(θs)/L(θ ) > 1. Experimental results for N = 104 are presented in Fig. 3 (b). Dashed straight lines have slope B/(ηh ) + 1 n/2. Experiments show that the ﬁrst passage time behaves as tp [L(θs)/L(θ )]B/(ηh )+1 n/2, which agrees with our theoretical evaluation of κ 1 [see Eq. (15)]. We conclude that the escape rate crucially depends on the effective dimension n, which is not explained by the previous results (Zhu et al., 2019; Xie et al., 2021; Liu et al., 2021; Meng et al., 2020).

5.3.2. NEURAL NETWORK

The escape-rate formula (15) is also veriﬁed in a non-linear model. As in Section 5.1, we consider the binary classiﬁcation problem using the ﬁrst 104 samples of MNIST dataset such that we classify each image into even or odd. The network has one hidden layer with 10 units activated by Re LU, followed by the output layer of a single unit with no activation. We always use the mean-square loss. Starting from the Glorot initialization, the network is pre-trained via SGD with η = 0.01 and B = 100 for 105 iterations. We ﬁnd that after pre-training, the loss becomes almost stationary around at L(θ) 0.035. We assume that the pre-trained network is near a local minimum. We then further train the pre-trained network via SGD with a new choice of η and B (here we ﬁx B = 10), and measure the ﬁrst passage time tp. The mean ﬁrst-passage time over 100 independent runs is measured for varying threshold values which are speciﬁed by c = L(θs)/L(θ ) > 1. Experimental results are presented in Fig. 3 (c). We see the power-law behavior, which is consistent with our theory.

To further verify our theoretical formula (15), we also compare the power-law exponent for the mean ﬁrst-passage time with our theoretical prediction B/(ηh e) + 1 n/2. Here, h e and n are estimated by the Hessian eigenvalues. In Appendix E, we present a numerical result for eigenvalues of the approximate Hessian given by the right-hand side of Eq. (11). By identifying the largest eigenvalue of the Hessian as h e, we have h e 95.6. On the other hand, it is difﬁcult to precisely determine the effective dimension n, but it seems reasonable to estimate n 10. It turns out that theory and experiment agree with each other by choosing n = 9. Dashed lines in Fig. 3 (c) correspond to our theoretical prediction κ 1 c B/(ηh e)+1 n/2 with h e = 95.6 and n = 9. This excellent agreement shows that our theoretical formula (15) is also valid in non-linear models.

Power-Law Escape Rate of SGD

5.4. Stability condition and minima selection

As we have argued in Section 4, any stable solution found by SGD should satisfy Equation (16). For a ﬁxed value of B, Equation (16) implies the existence of the critical learning rate

ηc = 2 n 2 B hmax , (18)

above which a minimum with a certain value of h max and n is unstable. Moreover, Equation (16) indicates that increasing η/B, i.e. making the SGD noise stronger, leads to a ﬂatter minimum with a lower effective dimension.

We numerically test these statements in the following. We conducted binary classiﬁcation of 5,000 samples of MNIST dataset (classify each image into even or odd number) by using a fully connected network with three hidden layers of width 100. The number of trainable parameters is about 105. First, we perform pre-training with ﬁxed values of the learning rate ηini and the batch size Bini. Here we consider the two cases (ηini, Bini) = (0.1, 50) and (0.05, 100), the former of which corresponds to stronger SGD noise compared with the latter. In both cases, we ﬁnd that the loss becomes almost stationary at a small value after 104 iterations. We then measure the largest Hessian eigenvalue hmax and the effective dimension n by the method in Appendix E, and determine ηc for B = 8. Next, we ﬁx B = 8 and gradually increase the learning rate from a small value η = 0.001 and measure the loss. When the minimum becomes unstable, the loss will rapidly increase and ﬂuctuate. Such behavior is actually observed in Figure 4. We ﬁnd that our theoretical prediction of ηc (vertical dashed lines in Figure 4) is relatively close to the actual transition point. In this way, Equation (16) is relevant in learning with a neural network.

Furthermore, by comparing hmax and n in Figure 4 (a) and (b), we ﬁnd that increasing the SGD noise results in a ﬂatter minimum with a lower effective dimension. It shows that SGD has a bias towards ﬂatter minima with lower effective dimensions.

6. Conclusion

In this work, we have investigated the SGD dynamics via a Langevin approach. With several approximations listed in Appendix A, we have derived Eq. (6), which shows that the SGD noise strength is proportional to the loss function. This SGD noise covariance structure yields the stochastic differential equation (14) with additive noise near a minimum via a random time change (13). The original multiplicative noise is reduced to simpler additive noise, but instead the gradient of the loss function is replaced by that of the logarithmic loss function U(θ) = log L(θ). This new formalism yields the power-law escape rate formula (15) whose exponent depends on η, B, h e, and n.

0.00 0.02 0.04 0.06 0.08 0.10 0.12 learning rate

(a) strong SGD noise ηini = 0.1, Bini = 50 hmax = 8.54, n = 20.1

0.00 0.01 0.02 0.03 0.04 learning rate

(b) weak SGD noise ηini = 0.05, Bini = 100 hmax = 18.4, n = 28.1

Figure 4. Stability of a minimum found by SGD with η = ηini and B = Bini. After ﬁnding a minimum, we ﬁx B = 8 and gradually increase the learning rate η. When η exceeds a threshold value, the minimum becomes unstable and the loss rapidly increases and ﬂuctuates. Vertical dashed lines correspond to the critical learning rate ηc estimated by Equation (16).

The escape-rate formula in Eq. (15) explains an empirical fact that SGD favors ﬂat minima with low effective dimensions. The effective dimension of a minimum must satisfy Eq. (16) for its stability. This result as well as the formulation of the SGD dynamics using the logarithmic loss landscape should help understand more deeply the SGD dynamics and its implicit biases in machine learning problems.

Although the present work focuses on the Gaussian noise, the non-Gaussianity can also play an important role. For example, S ims ekli et al. (2019) approximated SGD as a L evy-driven SDE, which explains why SGD ﬁnds wide minima. It would be an interesting future problem to take the non-Gaussian effect into account.

Acknowledgements

This work was supported by KAKENHI Grant Numbers JP21H05185 and JP22H01152 from the Japan Society for the Promotion of Science. LZ was ﬁnancially supported by the GSS Scholarship of the University of Tokyo. KL was supported by Global Science Graduate Course (GSGC) program of the University of Tokyo.

Power-Law Escape Rate of SGD

Bapst, V., Keck, T., Grabska-Barwi nska, A., Donner, C., Cubuk, E. D., Schoenholz, S. S., Obika, A., Nelson, A. W., Back, T., Hassabis, D., and Kohli, P. Unveiling the predictive power of static structure in glassy systems. Nature Physics, 16(4):448 454, 2020.

Berglund, N. Kramers Law: Validity , Derivations and Generalisations. Markov Processes and Related Fields, 19:459 490, 2013.

Bianchi, A. and Gaudilli ere, A. Metastable states, quasistationary distributions and soft measures. Stochastic Processes and their Applications, 126:1622 1680, 2016.

Bovier, A., Eckhoff, M., Gayrard, V., and Klein, M. Metastability in reversible diffusion processes I. Sharp asymptotics for capacities and exit times. Journal of the European Mathematical Society, 6:399 424, 2004.

Collobert, R. and Weston, J. A uniﬁed architecture for natural language processing: Deep neural networks with multitask learning. In International Conference on Machine Learning, 2008.

Eyring, H. The Activated Complex in Chemical Reactions. Journal of Chemical Physics, 3:63 71, 1935. ISSN 00219606.

Freidlin, M. I. and Wentzell, A. D. Random Perturbations of Dynamical Systems. Springer, 1998.

Gur-Ari, G., Roberts, D. A., and Dyer, E. Gradient Descent Happens in a Tiny Subspace. ar Xiv preprint ar Xiv:1812.04754, 2018.

G urb uzbalaban, M., S ims ekli, U., and Zhu, L. The Heavy Tail Phenomenon in SGD. In International Conference on Machine Learning, 2021.

Hinton, G., Deng, L., Yu, D., Dahl, G. E., Mohamed, A.-r., Jaitly, N., Senior, A., Vanhoucke, V., Nguyen, P., Sainath, T. N., and Others. Deep neural networks for acoustic modeling in speech recognition: The shared views of four research groups. IEEE Signal processing magazine, 29 (6):82 97, 2012.

Hoffer, E., Hubara, I., and Soudry, D. Train longer, generalize better: closing the generalization gap in large batch training of neural networks. In Advances in Neural Information Processing Systems, 2017.

Iten, R., Metger, T., Wilming, H., Del Rio, L., and Renner, R. Discovering Physical Concepts with Neural Networks. Physical Review Letters, 124(1):10508, 2020.

Jastrze bski, S., Kenton, Z., Arpit, D., Ballas, N., Fischer, A., Bengio, Y., and Storkey, A. Three Factors Inﬂuencing Minima in SGD. ar Xiv:1711.04623, 2017.

Keskar, N. S., Nocedal, J., Tang, P. T. P., Mudigere, D., and Smelyanskiy, M. On large-batch training for deep learning: Generalization gap and sharp minima. In International Conference on Learning Representations, 2017.

Kramers, H. A. Brownian motion in a ﬁeld of force and the diffusion model of chemical reactions. Physica, 7(4): 284 304, 1940.

Krizhevsky, A., Sutskever, I., and Hinton, G. E. Imagenet classiﬁcation with deep convolutional neural networks. In Advances in Neural Information Processing Systems, pp. 1097 1105, 2012.

Langer, J. S. Statistical theory of the decay of metastable states. Annals of Physics, 54(2):258 275, 1969.

Le Cun, Y., Bengio, Y., and Hinton, G. Deep learning. Nature, 521(7553):436 444, 2015.

Li, C., Farkhoor, H., Liu, R., and Yosinski, J. Measuring the Intrinsic Dimension of Objective Landscapes. International Conference on Learning Representations, 2018.

Li, Q., Tai, C., and Weinan, E. Stochastic modiﬁed equations and adaptive stochastic gradient algorithms. In International Conference on Machine Learning, 2017.

Liu, K., Ziyin, L., and Ueda, M. Noise and Fluctuation of Finite Learning Rate Stochastic Gradient Descent. In International Conference on Machine Learning, 2021.

Mac Kay, D. Bayesian model comparison and backprop nets. In Advances in Neural Information Processing Systems, 1992.

Maier, R. S. and Stein, D. L. Escape problem for irreversible systems. Physical Review E, 48:931 938, 1993.

Meng, Q., Gong, S., Chen, W., Ma, Z. M., and Liu, T. Y. Dynamic of Stochastic Gradient Descent with State Dependent Noise. ar Xiv:2006.13719, 2020.

Øksendal, B. Stochastic differential equations: an introduction with applications. Springer, Berlin, 1998.

Papyan, V. The Full Spectrum of Deepnet Hessians at Scale: Dynamics with SGD Training and Sample Size. ar Xiv:1811.07062, 2018.

Papyan, V. Measurements of three-level hierarchical structure in the outliers in the spectrum of deepnet hessians. In International Conference on Machine Learning, 2019.

Pesme, S., Pillaud-Vivien, L., and Flammarion, N. Implicit Bias of SGD for Diagonal Linear Networks: a Provable Beneﬁt of Stochasticity. ar Xiv:2106.09524, 2021.

Power-Law Escape Rate of SGD

Pittorino, F., Ferraro, A., Perugini, G., Feinauer, C., Baldassi, C., and Zecchina, R. Deep Networks on Toroids: Removing Symmetries Reveals the Structure of Flat Regions in the Landscape Geometry. ar Xiv:2202.03038, 2022.

Sagun, L., Evci, U., G uney, V. U., Dauphin, Y., and Bottou, L. Empirical analysis of the hessian of over-parametrized neural networks. ar Xiv:1706.04454, 2017.

Sato, I. and Nakagawa, H. Approximation analysis of stochastic gradient langevin dynamics by using fokkerplanck equation and ito process. In International Conference on Machine Learning, 2014.

Seif, A., Hafezi, M., and Jarzynski, C. Machine learning the thermodynamic arrow of time. Nature Physics, 17: 105 113, 2021.

Shwartz-Ziv, R. and Tishby, N. Opening the Black Box of Deep Neural Networks via Information. ar Xiv:1703.00810, 2017.

S ims ekli, U., Sagun, L., and Giirbiizbalaban, M. A tailindex analysis of stochastic gradient noise in deep neural networks. In International Conference on Machine Learning, 2019.

Smith, S. L. and Le, Q. V. A Bayesian perspective on generalization and stochastic gradient descent. In International Conference on Learning Representations, 2018.

Wojtowytsch, S. Stochastic gradient descent with noise of machine learning type. Part II: Continuous time analysis. ar Xiv:2106.02588, 2021.

Wu, J., Hu, W., Xiong, H., Huan, J., Braverman, V., and Zhu, Z. On the Noisy Gradient Descent that Generalizes as SGD. In International Conference on Machine Learning, 2020.

Wu, L., Ma, C., and Weinan, E. How SGD selects the global minima in over-parameterized learning: A dynamical stability perspective. In Advances in Neural Information Processing Systems, 2018.

Xie, Z., Sato, I., and Sugiyama, M. A Diffusion Theory For Deep Learning Dynamics: Stochastic Gradient Descent Exponentially Favors Flat Minima. In International Conference on Learning Representations, 2021.

Zhang, C., Bengio, S., Hardt, M., Recht, B., and Vinyals, O. Understanding Deep Learning Requires Rethinking of Generalization. In International Conference on Learning Representations, 2017a.

Zhang, Y., Liang, P., and Charikar, M. A hitting time analysis of stochastic gradient Langevin dynamics. In Proceedings of Machine Learning Research, 2017b.

Zhu, Z., Wu, J., Yu, B., Wu, L., and Ma, J. The anisotropic noise in stochastic gradient descent: Its behavior of escaping from sharp minima and regularization effects. In International Conference on Machine Learning, 2019.

Ziyin, L., Liu, K., Mori, T., and Ueda, M. On Minibatch Noise: Discrete-Time SGD, Overparametrization, and Bayes. ar Xiv:2102.05375, 2021.

Power-Law Escape Rate of SGD

A. List of approximations and their justiﬁcations

In Section 3.1, we made several approximations to derive the result (6). For clarity, we list the approximations made and their justiﬁcations below.

In Eq. (8), we make the approximation of B N, which simpliﬁes the expression but is not essential. The main conclusion is not affected by this approximation.

We ignore L LT in Eq. (8), which is justiﬁed near a local minimum.

In Eq. (9), we make the decoupling approximation (7), which is one of the key heuristic approximations in our work. This approximation is veriﬁed experimentally in Section 5.1.

We ignore the last term of Eq. (10), which is a common approximation (Sagun et al., 2017). This approximation is justiﬁed if we are only interested in the outliers of the Hessian eigenvalues. Indeed, outliers of the Hessian eigenvalues, which play dominant roles in escape from local minima, are known to be attributable to the ﬁrst term of the right-hand side of Eq. (10) (Papyan, 2018).

In Eq. (11), we assume that the matrix (1/N) PN µ=1 f(θ, x(µ)) f(θ, x(µ))T does not change so much in a valley with a given local minimum. This approximation is indirectly veriﬁed in Fig. 2 (the proportionality between the loss and the noise strength implies this matrix is actually constant).

B. Proof of Theorem 4.7

We now prove Theorem 4.7. Under Assumption 4.6, dynamics of θ is restricted to the n-dimensional subspace spanned by the outlier eigenvectors v1, v2, . . . , vn RP of the Hessian. Consequently, θ is parametrized by n variables z1, z2, . . . , zn R as

i=1 zivi. (19)

The Hessian is a diagonal matrix in the basis of (v1, v2, . . . , vn) within Aθ because of Assumption 4.6. Therefore, 2U/ zi zj = 0 for any i = j, where U = log L is the logarithmic loss introduced in Eq. (14). It implies that U/ zi is a function of zi, and hence U is written in the form

U(θ) = log L(θ ) +

i=1 Ui(zi). (20)

It should be noted that

Ui(0) = 0, Ui(zi) > 0 for any zi = 0, Ui

zi (0) = 0, 2Ui

z2 i (0) = h i L(θ ), (21)

where the last equality is obtained by putting U = log L and 2L/ z2 i z=0 = h i .

The stochastic differential equation for the n-dimensional vector z = (z1, z2, . . . , zn)T in the rescaled time variable τ is

given by dz = z Udτ + q

2η ˆH(θ )/Bd Wτ [see Eq. (14), which is equivalent to the following Fokker-Planck equation for the distribution function P(z, τ) of z at τ:

zi P + ηh i B 2

z2 i P =: z J(z), (22)

where the probability current density J(z) = (J1(z), J2(z), . . . , Jn(z))T is explicitly given by

Ji(z) = Ui(zi)

zi P(z) ηh i B zi P(z). (23)

Under Assumption 4.4, the escape rate is evaluated by considering the MPEP. It is known that the MPEP aligns with one of the eigenvectors of the Hessian (Xie et al., 2021). Without loss of generality, let us assume that the direction of eth

Power-Law Escape Rate of SGD

eigenvector ve corresponds to the direction of the MPEP. We denote by z Rn 1 the displacement perpendicular to the escape direction, and write z = (ze, z ). At the saddle zs, hs e < 0 and hs i > 0 for all i = e, where {hs i} is the set of eigenvalues of the Hessian at zs, and

zi (zs i ) = 0, 2Ui

z2 i (zs i ) = hs i L(θs). (24)

We now derive the escape rate formula for asymptotically weak noise η/B +0 following Kramers (1940). The steady current J Rn is aligned to the escape direction, and hence Je = 0 and J = 0 along the MPEP. Let us denote by P the total probability within the valley Aθ and by Jτ the total current per unit change of τ ﬂowing to the outside of the valley through the saddle θs = θ + zs. From Assumption 4.3, z is assumed to be in the quasi-stationary distribution. Since the quasi-stationary distribution is concentrated around θ in the weak-noise limit, L(θt) is approximately equal to L(θ ) before the escape. Consequently, the rescaled time variable τ in Eq. (13) is approximately equal to L(θ )t, and hence the total probability current Jt per unit change of t is given by L(θ )Jτ. The escape rate κ per unit change of t is therefore given by

B +0.3 (25)

We now evaluate P and Jτ. First, to obtain P , we must know about the quasi-stationary distribution Ps(z) near θ . It is obtained by putting P(z, τ)/ τ = 0 in Eq. (22), which implies J(z) = 0. We therefore have

zi Ps(z). (26)

By solving this equation, we obtain for θ = θ + z Aθ

Ps(z) = P(θ ) exp

B ηh i Ui(zi)

For convenience, we put Ps(z) = 0 for z / Aθ . Then, P , which is the total probability within Aθ , is given by

dzn Ps(z) = P(θ )

dzi e B ηh i U(zi). (28)

In the weak-noise limit η/B +0, we can evaluate the above integral by using the saddle-point method, which is also called the Laplace method. By using Eq. (21), the saddle-point method yields

dzi e B ηh i U(zi) Z

dzi e B 2ηL(θ ) z2 i = 2πηL(θ )

as η/B +0, and therefore

P P(θ ) 2πηL(θ )

Next, let us evaluate Jτ. Along the MPEP (z = 0), the current vector J perpendicular to the escape direction is zero, and hence the probability distribution near the saddle is evaluated in a similar way as Eq. (27):

P(ze = zs e, z ) = P(ze = zs e, z = 0) exp

B ηh i Ui(zi)

By substituting it into

ze P ηh e B ze P, (32)

3Recall that the notation f(x) g(x) as x +0 means limx +0(f(x)/g(x)) = 1.

Power-Law Escape Rate of SGD

U(θ) U(θ*) : z = 0

e (θ = θ*) (θ = θs)

Figure 5. Schematic illustration of the escape from a potential barrier.

Je(zs e, z ) = Jpath exp

B ηh i Ui(zi)

where the current along the MPEP (z = 0) is denoted by Jpath := J(zs e, z = 0). The total current through the saddle is then evaluated by using the saddle-point method:

Jτ = Z dz Je(zs e, z ) = Jpath Y

dz e B ηh i Ui(zi) Jpath

(n 1)/2 (34)

as η/B +0. When the distribution function is almost stationary, Eq. (22) yields Je(ze, z = 0)/ ze = 0, and hence the current along the MPEP is constant Je(ze, z = 0) = Jpath. By putting z = 0 in Eq. (33), we have

ze P(ze, z = 0) ηh e B P ze (ze, z = 0) = ηh e B e B ηh e U

B ηh e UP(ze, z = 0) i . (35)

By multiplying e

B ηh e U in both sides and integrating over ze from 0 to zc e, where zc e deﬁned as Ue(θ + zc eve) = Ue(θ ) (see Fig. 5), we obtain

B ηh e U ηh e B e

B ηh e U(θ )P(θ ) (36)

as η/B +0, where we used the fact that the probability at zc e is negligible in the weak-noise limit. By using the saddle-point method, the integral in the left-hand side of Eq. (36) is evaluated as Z zc e

U(θs) + hs e 2L(θs)(ze zs e)2

= 2πηh e B|hse|L(θs)

B ηh e U(θs). (37)

By substituting this result in Eq. (36), we obtain

Jpath ηh e|hs e| 2πBL(θs)

1/2 e B ηh e UP(θ ), (38)

where U = U(θs) U(θ ). The total current Jτ in Eq. (34) is then expressed as

h e|hse| 2πL(θs)

n/2 e B ηh e UP(θ ). (39)

Power-Law Escape Rate of SGD

By using Eqs. (30) and (39), the escape rate κ in Eq. (25) is evaluated as

2 1 e B ηhe U. (40)

Since U = log[L(θs)/L(θ )], we ﬁnally obtain

B ηh e +1 n

which is exactly identical to the escape rate formula given in Theorem 4.7.

C. Stationary distribution

Suppose that θτ obeys a simple Langevin equation

d θτ = U ( θτ) +

2Td Wτ. (42)

The stationary distribution of θτ is then given by the Gibbs distribution Ps(θ) e U(θ)/T . On the other hand, what we want is the stationary distribution Ps(θ) of θt, where θt = θτ with τ = R t 0 dt L(θt ). In this section, we show the relation between the two distributions: Ps(θ) L(θ) 1 Ps(θ).

We express the stationary distributions in terms of the long-time average of the delta function:

Ps(θ) = lim s 1 s

0 dt δ(θt θ), Ps(θ) = lim s 1 s

0 dτ δ( θτ θ). (43)

By using the relation τ = R t 0 dt L(θt ), we have dτ = L(θt)dt. For a sufﬁciently large t, we also obtain τ t L, where L := lims (1/s) R s 0 dt L(θt ) denotes the long-time average of L(θt). By using them, Ps(θ) is rewritten as

Ps(θ) lim s 1 s

0 dτ δ( θτ θ)

= 1 L(θ) lim s

0 dτ δ( θτ θ)

= L L(θ) Ps(θ). (44)

We thus obtain the desired relation, Ps(θ) L(θ) 1 Ps(θ).

D. Other loss functions

In our paper, we mainly focus on the mean-square loss, for which we can analytically derive the relation between the loss L(θ) and the SGD noise covariance Σ(θ). An important observation is that the SGD noise strength N is proportional to the loss, i.e., N L(θ) (see Section 5.2 for the deﬁnition of N).

Here, we argue that the relation N L(θ) also holds in more general situations. During the training, the value of ℓµ will ﬂuctuate from sample to sample. At a certain time step of SGD, let us suppose that N M samples in the training dataset are already ﬁt correctly and hence ℓµ 0, whereas the other M samples are not and hence ℓµ = O(1). The loss function is then given by L(θ) = (1/N) PN µ=1 ℓµ M/N. When ℓµ is small, ℓµ will also be small. Therefore, for N M samples with ℓµ 0, ℓµ 0 also holds. The other M samples will have non-small gradients: ℓµ 2 = O(1). We thus estimate N as N (1/N) PN µ=1 ℓT µ ℓµ M/N L(θ). In this way, N L(θ) will hold, irrespective of the loss function.

However, we emphasize that this is a crude argument. In particular, the above argument will not hold near a global minimum because all the samples are correctly ﬁt there, which implies that ℓµ is small for all µ, in contrast to the above argument relying on the existence of M samples with ℓµ and ℓµ 2 of O(1).

Power-Law Escape Rate of SGD

0 20 40 60 80 100 epoch

(SGD noise strength)/(5 103)

0 20 40 60 80 100 epoch

(SGD noise strength)/(6 104)

10 3 10 2 10 1

SGD noise strength

10 4 10 2 100

SGD noise strength

Figure 6. Training dynamics of the cross-entropy loss and the SGD noise strength N for (a) a fully connected network trained by the Fashion-MNIST dataset and (b) a convolutional network trained by the CIFAR-10 dataset. In the ﬁgure, we multiplied N by a numerical factor to emphasize that N is actually proportional to the loss in an intermediate stage of the training. Loss vs N for (c) a fully connected network trained by the Fashion-MNIST and (d) a convolutional network trained by CIFAR-10. Dashed lines in (c) and (d) are straight lines of slope 1, which imply N L(θ).

We now experimentally test the relation N L(θ) for the cross-entropy loss. We consider the same architectures and datasets in Section 5.2: a fully connected neural network trained by Fashion-MNIST and a convolutional neural network trained by CIFAR-10 (see Section 5.2 for the detail). We ﬁx B = 100 in both cases, and η = 0.1 for the fully connected network and η = 0.05 for the convolutional network. Experimental results are presented in Fig. 6. We ﬁnd that the relation N L(θ) seems to hold true at an intermediate stage of the training dynamics, although the proportionality is less clear compared with Fig. 2 in the main text for the mean-square loss.

We also ﬁnd that for sufﬁciently small values of the loss, N L(θ)2 [see Fig. 6 (c) and (d)], whose implications should merit further investigation in future studies.

E. Hessian eigenvalues for a neural network in Section 5.3

We present numerical results on Hessian eigenvalues in a pre-trained neural network studied in Section 5.3. Instead of the exact Hessian, we consider an approximate Hessian given on the right-hand side of Eq. (11), i.e.,

µ=1 f(θ, x(µ)) f(θ, x(µ))T. (45)

Eigenvalues {hi} are arranged in descending order as h1 h2 h P (in our model P = 7861).

A histogram of the Hessian eigenvalues is presented in Fig. 7 (a). We see that most eigenvalues are close to zero, which corresponds to the bulk, but there are some large eigenvalues, which correspond to the outliers. The largest eigenvalue is λ1 = 95.6, which is identiﬁed as h in our theoretical formula (15).

Power-Law Escape Rate of SGD

0 20 40 60 80 100 eigenvalues

1 4 7 10 13 16 19 22 25 28 i

Figure 7. Hessian eigenvalues for a pre-trained neural network studied in Section 5.3. (a) The histogram of the Hessian eigenvalues and (b) top 30 eigenvalues. Most eigenvalues are close to zero, but there are some large eigenvalues, which correspond to outliers.

Another important quantity is the effective dimension n corresnding to the number of outliers. Since the outliers and the bulk are not sharply separated, it is difﬁcult to precisely determine n. In Fig. 7 (b), we plot hi up to i = 30. From this ﬁgure, it seems reasonable to estimate n 10.

As a heuristic method of determining n, we can consider the following identiﬁcation: ﬁrst we deﬁne the weight pi for ith eigenmode as pi = hi/ PP j=1 hj. We then determine n as

which gives n = 12.7 in our case.

In Section 5.3, our formula (15) with n = 9 explains numerical results on the ﬁrst-passage time.