# strength_of_minibatch_noise_in_sgd__7dde7be8.pdf

Published as a conference paper at ICLR 2022

STRENGTH OF MINIBATCH NOISE IN SGD

Liu Ziyin , Kangqiao Liu , Takashi Mori, & Masahito Ueda The University of Tokyo

The noise in stochastic gradient descent (SGD), caused by minibatch sampling, is poorly understood despite its practical importance in deep learning. This work presents the ﬁrst systematic study of the SGD noise and ﬂuctuations close to a local minimum. We ﬁrst analyze the SGD noise in linear regression in detail and then derive a general formula for approximating SGD noise in different types of minima. For application, our results (1) provide insight into the stability of training a neural network, (2) suggest that a large learning rate can help generalization by introducing an implicit regularization, (3) explain why the linear learning ratebatchsize scaling law fails at a large learning rate or at a small batchsize and (4) can provide an understanding of how discrete-time nature of SGD affects the recently discovered power-law phenomenon of SGD.

1 INTRODUCTION

Stochastic gradient descent (SGD) is the simple and efﬁcient optimization algorithm behind the success of deep learning (Allen-Zhu et al., 2019; Xing et al., 2018; Zhang et al., 2018; Wang et al., 2020; He and Tao, 2020; Liu et al., 2021; Simsekli et al., 2019; Wu et al., 2020). Minibatch noise, also known as the SGD noise, is the primary type of noise in the learning dynamics of neural networks. Practically, minibatch noise is unavoidable because a modern computer s memory is limited while the size of the datasets we use is large; this demands the dataset to be split into minibatches" for training. At the same time, using minibatch is also a recommended practice because using a smaller batch size often leads to better generalization performance (Hoffer et al., 2017). Therefore, understanding minibatch noise in SGD has been one of the primary topics in deep learning theory. Dominantly many theoretical studies take two approximations: (1) the continuous-time approximation, which takes the inﬁnitesimal step-size limit; (2) the Hessian approximation, which assumes that the covariance matrix of the SGD noise is equal to the Hessian H. While these approximations have been shown to provide some qualitative understanding, the limitation of these approximations is not well understood. For example, it is still unsure when such approximations are valid, which hinders our capability to assess the correctness of the results obtained by approximations.

In this work, we ﬁll this gap by deriving analytical formulae for discrete-time SGD with arbitrary learning rates and exact minibatch noise covariance. In summary, the main contributions are: (1) we derive the strength and the shape of the minibatch SGD noise in the cases where the noise for discrete-time SGD is analytically solvable; (2) we show that the SGD noise takes a different form in different kinds of minima and propose general and more accurate approximations. This work is organized as follows: Sec. 2 introduces the background. Sec. 3 discusses the related works. Sec. 4 outlines our theoretical results. Sec. 5 derives new approximation formulae for SGD noises. In Sec. 6, we show how our results can provide practical and theoretical insights to problems relevant to contemporary machine learning research. For reference, the relationship of this work to the previous works is shown in Table 1.

2 BACKGROUND

In this section, we introduce the minibatch SGD algorithm. Let {xi,yi}N i=1 be a training set. We can deﬁne the gradient descent (GD) algorithm for a differentiable loss function L as wt = wt 1 λ w L(w,{x,y}), where λ is the learning rate and w RD is the weights of the model. We consider an additive loss function for applying the minibatch SGD. Deﬁnition 1. A loss function L({xi,yi}N i=1,w) is additive if L({xi,yi}N i=1,w) = 1 N N i=1 ℓ(xi,yi,w) for some differentiable, non-negative function ℓ( ).

Published as a conference paper at ICLR 2022

Table 1: Summary of related works on the noise and stationary distribution of SGD. This work ﬁlls the gap of the lack of theoretical results for the actual SGD dynamics, which is discrete-time and with minibatch noise.

Setting Artiﬁcial Noise Hessian Approximation Noise Minibatch Noise Continuous-time Sato and Nakagawa (2014); Welling and Teh (2011) Jastrzebski et al. (2018); Zhu et al. (2019) Blanc et al. (2020); Mori et al. (2021) Mandt et al. (2017); Meng et al. (2020) Wu et al. (2020); Xie et al. (2021) Discrete-time Yaida (2019); Gitman et al. (2019) Liu et al. (2021) This Work Liu et al. (2021)

This deﬁnition is quite general. Most commonly studied and used loss functions are additive, e.g., the mean-square error (MSE) and cross-entropy loss. For an additive loss, the minibatch SGD with momentum algorithm can be deﬁned. Deﬁnition 2. The minibatch SGD with momentum algorithm by sampling with replacement computes the update to the parameter w with the following set of equations:

S i Bt ℓ(xi,yi,wt 1); mt = µmt 1 + ˆgt; wt = wt 1 λmt. (1)

where µ [0,1) is the momentum hyperparameter, S = Bt is the minibatch size, and the set Bt = {i1,...i S} are S i.i.d. random integers sampled uniformly from [1,N].

One can decompose the gradient into a deterministic plus a stochastic term. Note that EB[ˆgt] = L is equal to the gradient for the GD algorithm. We use EB( ) to denote the expectation over batches, and use Ew( ) to denote the expectation over the stationary distribution of the model parameters. Therefore, we can write ˆgt = EB[ˆgt] + ηt, where ηt = 1

S i Bt ℓ(xi,yi,wt 1) EB[ˆgt] is the noise term; the noise covariance is C(wt) = cov(ηt,ηt). Of central importance to us is the averaged asymptotic noise covariance C = limt Ewt[C(wt)]. Also, we consider the asymptotic model ﬂuctuation Σ = limt cov(wt,wt). Σ gives the strength and shape of the ﬂuctuation of w around a local minimum and is another quantity of central importance to this work. Throughout this work, C is called the noise" and Σ the ﬂuctuation".

3 RELATED WORKS

Noise and Fluctuation in SGD. Deep learning models are trained with SGD and its variants. To understand the parameter distribution in deep learning, one needs to understand the stationary distribution of SGD (Mandt et al., 2017). Sato and Nakagawa (2014) describes the stationary distribution of stochastic gradient Langevin dynamics using discrete-time Fokker-Planck equation. Yaida (2019) connects the covariance of parameter Σ to that of the noise C through the ﬂuctuationdissipation theorem. When Σ is known, one may obtain by Laplace approximation the stationary distribution of the model parameter around a local minimum w as N(w ,Σ). Therefore, knowing Σ can be of great practical use. For example, it has been used to estimate the local minimum escape efﬁciency (Zhu et al., 2019; Liu et al., 2021) and argue that SGD prefers a ﬂatter minimum; it can also be used to assess parameter uncertainty and prediction uncertainty when a Bayesian prior is speciﬁed (Mandt et al., 2017; Gal and Ghahramani, 2016; Pearce et al., 2020). Empirically, both the ﬂuctuation and the noise are known to crucially affect the generalization of a deep neural network. Wu et al. (2020) shows that the strength and shape of the Σ due to the minibatch noise lead to better generalization of neural networks in comparison to an artiﬁcially constructed noise.

Hessian Approximation of the Minibatch Noise. However, it is not yet known what form C and Σ actually take for SGD in a realistic learning setting. Early attempts assume isotropic noise in the continuous-time limit (Sato and Nakagawa, 2014; Mandt et al., 2017). In this setting, the noise is an isotropic Gaussian with C ID, and Σ is known to be proportional to the inverse Hessian H 1. More recently, the importance of noise structure was realized (Hoffer et al., 2017; Jastrzebski et al., 2018; Zhu et al., 2019; Hao Chen et al., 2020). Hessian approximation", which assumes C c0H for some unknown constant c0, has often been adopted for understanding SGD (see Table 1); this assumption is often motivated by the fact that C = Jw H, where Jw is the Fisher information matrix (FIM) (Zhu et al., 2019); the ﬂuctuation can be solved to be isotropic: Σ ID. However, it is not known under what conditions the Hessian approximation is valid, while previous works have argued that it can be very inaccurate (Martens, 2014; Liu et al., 2021; Thomas et al., 2020; Kunstner et al., 2019). However, Martens (2014) and Kunstner et al. (2019) only focuses on the natural gradient descent (NGD) setting; Thomas et al. (2020) is closest to ours, but it does not apply to the case with momentum, a matrix learning rate, or regularization.

Published as a conference paper at ICLR 2022

Discrete-time SGD with a Large Learning Rate. Recently, it has been realized that networks trained at a large learning rate have a dramatically better performance than networks trained with a vanishing learning rate (lazy training) (Chizat and Bach, 2018). Lewkowycz et al. (2020) shows that there is a qualitative difference between the lazy training regime and the large learning rate regime; the performance features two plateaus in testing accuracy in the two regimes, with the large learning rate regime performing much better. However, the theory regarding discrete-time SGD at a large learning rate is almost non-existent, and it is also not known what Σ may be when the learning rate is non-vanishing. Our work also sheds light on the behavior of SGD at a large learning rate. Some other works also consider discrete-time SGD in a similar setting (Fontaine et al., 2021; Dieuleveut et al., 2020; Toulis et al., 2017), but the focus is not on deriving analytical formulae or does not deal with the stationary distribution.

4 SGD NOISE AND FLUCTUATION IN LINEAR REGRESSION

This section derives the shape and strength of SGD noise and ﬂuctuation for linear regression; concurrent to our work, Kunin et al. (2021) also studies the same problem but with continuous-time approximation; our result is thus more general. To emphasize the message, we discuss the label noise case in more detail. The other situations also deserve detailed analysis; we delay such discussion to the appendix due to space constraints. Notation: S denotes the minibatch size. w RD is the model parameter viewed in a vectorized form; λ R+ denotes a scalar learning rate; when the learning rate takes the form of a preconditioning matrix, we use Λ RD D. A RD D denotes the covariance matrix of the input data. When a matrix X is positive semi-deﬁnite, we write X 0; throughout, we require Λ 0. γ R denotes the weight decay hyperparameter; when the weight decay hyperparameter is a matrix, we write Γ RD D. µ is the momentum hyperparameter in SGD. For two matrices X,Y , the commutator is deﬁned as [X,Y ] = XY Y X. Other notations are introduced in the context.1 The results of this section are numerically veriﬁed in Appendix A.

4.1 KEY PREVIOUS RESULTS When N S, the following proposition is well-known and gives the exact noise due to minibatch sampling. See Appendix E.1 for a derivation. Proposition 1. The noise covariance of SGD as deﬁned in Deﬁnition 2 is

C(w) = 1 SN

N i=1 ℓi(w) ℓi(w)T 1

S L(w) L(w)T, (2)

where the notations ℓi(w) = l(xi,yi,w) and L(w) = L({xi,yi}N i=1,w) are used.

This gradient covariance matrix C is crucial to understand the minibatch noise. The standard literature often assumes C(w) H(w); however, the following well-known proposition shows that this approximation can easily break down. Proposition 2. Let w be the solution such that L(w ) = 0, then C(w ) = 0.

Proof. Because ℓi is non-negative for all i, L(w ) = 0 implies that ℓi(w ) = 0. The differentiability in turn implies that each ℓi(w ) = 0; therefore, C = 0.

This proposition implies that there is no noise if our model can achieve zero training loss (which is achievable for an overparametrized model). This already suggests that the Hessian approximation C H is wrong since the Hessian is unlikely to vanish in any minimum. The fact that the noise strength vanishes at L = 0 suggests that the SGD noise might at least be proportional to L(w), which we will show to be true for many cases. The following theorem relates C and Σ of the discrete-time SGD algorithm with momentum for a matrix learning rate. Theorem 1. (Liu et al., 2021) Consider running SGD on a quadratic loss function with Hessian H, learning rate matrix Λ, momentum µ. Assuming ergodicity, then

(1 µ)(ΛHΣ + ΣHΛ) 1 + µ2

1 µ2 ΛHΣHΛ + µ 1 µ2 (ΛHΛHΣ + ΣHΛHΛ) = ΛCΛ. (3)

Propostion 1 and Theorem 1 allow one to solve C and Σ. Equation (3) can be seen as a general form of the Lyapunov equation (Lyapunov, 1992) and is hard to solve in general (Hammarling, 1982; Ye et al., 1998; Simoncini, 2016). Solving this analytical equation in settings of machine learning relevance is one of the main technical contributions of this work.

1We use the word global minimum to refer to the global minimum of the loss function, i.e., where L = 0 and a local minimum refers to a minimum that has a non-negative loss, i.e., L 0.

Published as a conference paper at ICLR 2022

4.2 RANDOM NOISE IN THE LABEL

We ﬁrst consider the case when the labels contain noise. The loss function takes the form

L(w) = 1 2N

N i=1 (w Txi yi)2, (4)

where xi RD are drawn from a zero-mean Gaussian distribution with feature covariance A = EB[xx T], and yi = u Txi + ϵi, for some ﬁxed u and ϵi R is drawn from a distribution with zero mean and ﬁnite second momentum σ2. We redeﬁne w u w and let N with D held ﬁxed. The following lemma ﬁnds C as a function of Σ.

Lemma 1. (Covariance matrix for SGD noise in the label) Let N and the model be updated according to Eq. (1) with loss function in Eq. (4). Then,

S (AΣA + Tr[AΣ]A + σ2A). (5)

The model ﬂuctuation can be obtained using this lemma.

Theorem 2. (Fluctuation of model parameters with random noise in the label) Let the assumptions be the same as in Lemma 1 and [Λ,A] = 0. Then,

S )ΛG 1 µ , (6)

where κµ = Tr[ΛAG 1 µ ] 1 1

S Tr[ΛAG 1 µ ] with Gµ = 2(1 µ)ID ( 1 µ

Remark. This result is numerically validated in Appendix A. The subscript µ refers to momentum. To obtain results for vanilla SGD, one can set µ = 0, which has the effect of reducing Gµ G = 2ID (1 + 1

S )ΛA. From now on, we focus on the case when µ = 0 for notational simplicity, but we note that the results for momentum can be likewise studied. The assumption [Λ,A] = 0 is not too strong because this condition holds for a scalar learning rate and common second-order methods such as Newton s method.

If σ2 = 0, then Σ = 0. This means that when there is no label noise, the model parameter has a vanishing stationary ﬂuctuation, which corroborates Proposition 2. When a scalar learning rate λ 1 and 1 S, we have

which is the result one would expect from the continuous-time theory with the Hessian approximation (Liu et al., 2021; Xie et al., 2021; Zhu et al., 2019), except for a correction factor of σ2. Therefore, a Hessian approximation fails to account for the randomness in the data of strength σ2. We provide a systematic and detailed comparison with the Hessian approximation in Table 2 of Appendix B.

Moreover, it is worth comparing the exact result in Theorem 2 with Eq. (7) in the regime of nonvanishing learning rate and small batch size. One notices two differences: (1) an anisotropic enhancement, appearing in the matrix Gµ and taking the form λ(1 + 1/S)A; compared with the result in Liu et al. (2021), this term is due to the compound effect of using a large learning rate and a small batchsize; (2) an isotropic enhancement term κ, which causes the overall magnitude of ﬂuctuations to increase; this term does not appear in the previous works that are based on the Hessian approximation and is due to the minibatch sampling process alone. As the numerical example in Appendix A shows, at large batch size, the discrete-time nature of SGD is the leading source of ﬂuctuation; at small batch size, the isotropic enhancement becomes the dominant source of ﬂuctuation. Therefore, the minibatch sampling process causes two different kinds of enhancement to the ﬂuctuation, potentially increasing the exploration power of SGD at initialization but reducing the convergence speed.

Now, combining Theorem 2 and Lemma 1, one can obtain an explicit form of the noise covariance.

Theorem 3. The noise covariance matrix of minibatch SGD with random noise in the label is

S )(ΛAG 1 µ + Tr[ΛAG 1 µ ]ID)A. (8)

Published as a conference paper at ICLR 2022

By deﬁnition, C = J is the FIM. The Hessian approximation, in sharp contrast, can only account for the term in orange. A signiﬁcant modiﬁcation containing both anisotropic and isotropic (up to Hessian) is required to fully understand SGD noise, even in this simple example. Additionally, comparing this result with the training loss (127), one can ﬁnd that the noise covariance contains one term that is proportional to the training loss. In fact, we will derive in Sec. 5 that containing a term proportional to training loss is a general feature of the SGD noise. We also study the case when the input is contaminated with noise. Interestingly, the result is the same with the label noise case with σ2 replaced by a more complicated term of the form Tr[AK 1BU]. We thus omit this part from the main text. A detailed discussion can be found in Appendix E.3.1. In the next section, we study the effect of regularization on SGD noise and ﬂuctuation.

4.3 LEARNING WITH REGULARIZATION

Now, we show that regularization also causes a unique SGD noise. The loss function for Γ L2 regularized linear regression is

LΓ(w) = 1 2N

N i=1 [(w u)Txi] 2 + 1

2(w u)TA(w u) + 1

2w TΓw, (9)

where Γ is a symmetric matrix; conventionally, one set Γ = γID with a scalar γ > 0. For conciseness, we assume that there is no noise in the label, namely yi = u Txi with a constant vector u. One important quantity in this case will be uu T = U. The noise for this form of regularization can be calculated but takes a complicated form. Proposition 3. (Noise covariance matrix for learning with L2 regularization) Let the algorithm be updated according to Eq. (1) on loss function (9) with N and [A,Γ] = 0. Then,

S (AΣA + Tr[AΣ]A + Tr[Γ TAΓ U]A + ΓA UA Γ), (10)

where A = K 1A, Γ = K 1Γ with K = A + Γ.

Notice that the last term ΓA UA Γ in C is unique to the regularization-based noise: it is rank-1 because U is rank-1. This term is due to the mismatch between the regularization and the minimum of the original loss. Also, note that the term Tr[AΣ] is proportional to the training loss. Deﬁne the test loss to be Ltest = limt Ewt[ 1

2(wt u)TA(wt u)], we can prove the following theorem. We will show that one intriguing feature of discrete-time SGD is that the weight decay can be negative. Theorem 4. (Test loss and model ﬂuctuation for L2 regularization) Let the assumptions be the same as in Proposition 3. Then

2S (Tr[AK 2Γ2U]κ + r) + 1

2Tr[AK 2Γ2U], (11)

where κ = Tr[A2K 1G 1] 1 λ

S Tr[A2K 1G 1], r = Tr[A3K 3Γ2G 1U]

S Tr[A2K 1G 1], with G = 2ID λ(K + 1

S K 1A2). Moreover,

let [Γ,U] = 0, then

S Tr[AK 2Γ2U](1 + λκ

S )AK 1G 1 + λ

S (A2K 2Γ2U + λr

S A)K 1G 1. (12)

This result is numerically validated in Appendix A. The test loss (11) has an interesting consequence. One can show that there exist situations where the optimal Γ is negative.2 When discussing the test loss, we make the convention that if wt diverges, then Ltest = . Corollary 1. Let γ = arg minγ Ltest. There exist a, λ and S such that γ < 0.

The proof shows that when the learning rate is sufﬁciently large, only negative weight decay is allowed. This agrees with the argument in Liu et al. (2021) that discrete-time SGD introduces an implicit L2 regularization that favors small norm solutions. A too-large learning rate requires a negative weight decay because a large learning rate already over-regularizes the model and one needs

2Some readers might argue that discussing test loss is meaningless when N ; however, this criticism does not apply because the size of the training set is not the only factor that affects generalization. In fact, this section s crucial message is that using a large learning rate affects the generalization by implicitly regularizing the model and, if one over-regularizes, one needs to offset this effect.

Published as a conference paper at ICLR 2022

to introduce an explicit negative weight decay to offset this over-regularization effect of SGD. This is a piece of direct evidence that using a large learning rate can help regularize the models. It has been hypothesized that the dynamics of SGD implicitly regularizes neural networks such that the training favors simpler solutions (Kalimeris et al., 2019). Our result suggests one new mechanism for such a regularization.

5 NOISE STRUCTURE FOR GENERIC SETTINGS

The results in the previous sections suggest that (1) the SGD noises differ for different kinds of situations, and (2) SGD noise contains a term proportional to the training loss in general. These two facts motivate us to derive the noise covariance differently for different kinds of minima. Let f(w,x) denote the output of the model for a given input x RD. Here, we consider a more general case; f(w,x) may be any differentiable function, e.g., a non-linear deep neural network. The number of parameters in the model is denoted by P, and hence w RP . For a training dataset {xi,yi}i=1,2,...,N, the loss function with a L2 regularization is given by

LΓ(w) = L0(w) + 1

2w TΓw, (13)

where L0(w) = 1 N N i=1 ℓ(f(w,xi),yi) is the loss function without regularization, and H0 is the Hessian of L0. We focus on the MSE loss ℓ(f(w,xi),yi) = [f(w,xi) yi]2/2. Our result crucially relies on the following two assumptions, which relate to the conditions of different kinds of local minima. Assumption 1. (Fluctuation decays with batch size) Σ is proportional to S 1, i.e. Σ = O(S 1).

This is justiﬁed by the results in all the related works (Liu et al., 2021; Xie et al., 2021; Meng et al., 2020; Mori et al., 2021), where Σ is found to be O(S 1). Assumption 2. (Weak homogeneity) L ℓi is small; in particular, it is of order o(L).

This assumption amounts to assuming that the current training loss L reﬂects the actual level of approximation for each data point well. In fact, since L 0, one can easily show that L ℓi = O(L). Here, we require a slightly stronger condition for a more clean expression, when L ℓi = O(L) we can still get a similar expression but with some constant that hinders the clarity. Relaxing this condition can be an important and interesting future work. The above two conditions allow us to state our general theorem formally. Theorem 5. Let the training loss be LΓ = L0 + 1

2w TΓw and the models be optimized with SGD in the neighborhood of a local minimum w . Then,

C(w) = 2L0(w)

S LΓ(w) LΓ(w)T + o(L0). (14)

The noise takes different forms for different kinds of local minima. Corollary 2. Omitting the terms of order o(L0), when Γ 0,

C = 2L0(w )

S Γw w TΓ + O(S 2) + O( w w 2). (15)

When Γ = 0 and L0(w ) 0,

C = 2L0(w )

S H0(w ) + O(S 2) + O( w w 2). (16)

When Γ = 0 and L0(w ) = 0,

S (Tr[H0(w )Σ]ID H0(w )Σ)H0(w ) + O(S 2) + O( w w 2). (17)

Remark. Assumption 2 can be replaced by a weaker but more technical assumption called the decoupling assumption", which has been used in recent works to derive the continuous-time distribution of SGD (Mori et al., 2021; Wojtowytsch, 2021). The Hessian approximation was invoked in most of the literature without considering the conditions of its applicability (Jastrzebski et al., 2018; Zhu et al., 2019; Liu et al., 2021; Wu et al., 2020; Xie et al., 2021). Our result does provide such

Published as a conference paper at ICLR 2022

conditions for applicability. As indicated by the two assumptions, this theorem is applicable when the batch size is not too small and when the local minimum has a loss close to 0. The reason for the failure of the Hessian approximation is that, while the FIM is equal to the expected Hessian J = E[H], there is no reason to expect the expected Hessian to be close to the actual Hessian of the minimum.

The proof is given in Appendix C. Two crucial messages this corollary delivers are (1) the SGD noise is different in strength and shape in different kinds of local minima and that they need to be analyzed differently; (2) the SGD noise contains a term that is proportional to the training loss L0 in general. Recently, it has been experimentally demonstrated that the SGD noise is indeed proportional to the training loss in realistic deep neural network settings, both when the loss function is MSE and cross-entropy (Mori et al., 2021); our result offers a theoretical justiﬁcation. The previous works all treat all the minima as if the noise is similar (Jastrzebski et al., 2018; Zhu et al., 2019; Liu et al., 2021; Wu et al., 2020; Xie et al., 2021), which can lead to inaccurate or even incorrect understanding. For example, Theorem 3.2 in Xie et al. (2021) predicts a high escape probability from a sharp local or global minimum. However, this is incorrect because a model at a global minimum has zero probability of escaping due to a vanishing gradient. In contrast, the escape rate results derived in Mori et al. (2021) correctly differentiate the local and global minima. We also note that these general formulae are consistent with the exact solutions we obtained in the previous section than the Hessian approximation. For example, the dependence of the noise strength on the training loss in Theorem 2, and the rank-1 noise of regularization are all reﬂected in these formulae. In contrast, the simple Hessian approximation misses these crucial distinctions. Lastly, combining Theorem 5 with Theorem 1, one can also ﬁnd the ﬂuctuation. Corollary 3. Let the noise be as in Theorem 5, and omit the terms of order O(S 2) and O( w w 2). Then, when Γ 0 and when Λ, H0(w ) and Γ commute with each other, Pr Σ =

1 S Λ 1 µ(2L0H0 Γw w TΓ)(H0 + Γ)+ [2ID Λ 1+µ(H0 + Γ)] 1 . When Γ = 0 and L0(w ) 0,

PrΣ = 2L0 S(1 µ)PrΛ(2ID Λ 1+µH0) 1 . When Γ = 0 and L0(w ) = 0, PrΣ = 0. Here the superscript + is the Moore-Penrose pseudo inverse, Pr = diag(1,...,1,0,...,0) is the projection operator with r non-zero entries, r D is the rank of the Hessian H0, and r D is the rank of H0 + Γ. For the null space H0, Σ can be arbitrary.

6 APPLICATIONS

One major advantage of analytical solutions is that they can be applied in a simple plug-in" manner by the practitioners or theorists to analyze new problems they encounter. In this section, we brieﬂy outline a few examples where the proposed theories can be relevant.

6.1 HIGH-DIMENSIONAL REGRESSION

We ﬁrst apply our result to the high-dimensional regression problem and show how over-andunderparametrization might play a role in determining the minibatch noise. Here, we take N,D with the ratio α = N/D held ﬁxed. The loss function is L(w) = 1 2N N i=1 (w Txi yi) 2. As in the standard literature (Hastie et al., 2019), we assume the existence of label noise: yi = u Txi + ϵi, with Var[ϵi] = σ2. A key difference between our setting and the standard high-dimensional setting is that, in the standard setting (Hastie et al., 2019), one uses the GD algorithm with vanishing learning rate λ instead of the minibatch SGD algorithm with a non-vanishing learning rate. Tackling the high-dimensional regression problem with non-vanishing λ and a minibatch noise is another main technical contribution of this work. In this setting, we can obtain the following result on the noise covariance matrix. Proposition 4. Let ˆA = 1

N N i xix T i and suppose assumptions 1 and 2 hold. With ﬁxed S, λ, then

S (Tr[ ˆAΣ]ID ˆAΣ) ˆA + max{0, σ2

We note that this proposition follows from Theorem 5, showing an important theoretical application of our general theory. An interesting observation is that one Σ-independent term proportional to σ2 emerges in the underparametrized regime (α > 1). However, for the overparametrized regime, the noise is completely dependent on Σ, which is a sign that the stationary solution has no ﬂuctuation. This shows that the degree of underparametrization also plays a distinctive role in the ﬂuctuation. In fact, one can prove the following theorem, which is veriﬁed in Appendix A.2.

Published as a conference paper at ICLR 2022

Figure 1: Realistic learning settings with neural networks and logistic regression. Left: Variance of training loss of a neural network with width d and tanh activation on the MNIST dataset. We see that the variance explodes after d 200. In contrast, rescaling the learning rate by 1/d results in a constant noise level in training. This suggests that the stability condition we derived for high-dimension regression is also useful for understanding deep learning. Middle: Stability of Adam with the same setting. Adam also experiences a similar stability problem when the model width increases. Right: Logistic regression on MNIST trained with SGD; with λ = 1.5, S = 32. We see that the optimal performance is also achieved at negative weight decay strength γ, suggesting that a large learning rate can indeed introduce effective regularization.

Theorem 6. When a stationary solution exists for w, we have Tr[ ˆAΣ] = max{0, λσ2

where ˆκ = Tr[ ˆ G 1 ˆ A] 1 λ

S Tr[ ˆ G 1 ˆ A] with ˆG = 2ID λ(1 1

6.2 IMPLICATION FOR NEURAL NETWORK TRAINING

It is commonly believed that the high-dimensional linear regression problem can be a minimal model for deep learning. Taking this stance, Theorem 6 suggests a technique for training neural networks. For SGD to converge, a positive semi-deﬁnite Σ must exist; however, Σ 0 if and only if ˆκ 0. From ˆκ > 0, we have D i=1 1 2/λai 1+1/S < S, where ai are the eigenvalues of ˆA. This means that each summand should have the order of D/S. Thus the upper bound of λ should have the order of 2S/a D, where a is the typical value of ai s. One implication of the dependence on the dimension is that the stability of a neural network trained with SGD may strongly depend on its width d, and one may rescale the learning rate according to the width to stabilize neural network training. See Figure 1-Left and Middle. We train a two-layer tanh neural network on MNIST and plot the variance of its training loss in the ﬁrst epoch with ﬁxed λ = 0.5. We see that, when d 200, the training starts to destabilize, and the training loss begins to ﬂuctuate dramatically. When rescaling the learning rate by 1/d, we see that the variance of the training loss is successfully kept roughly constant across all d. This suggests a training technique worth being explored by practitioners in the ﬁeld. In Figure 1-Middle, we also use Adam for training the same network and ﬁnd a similar stabilizing trick to work for Adam.

6.3 A NATURAL LEARNING EXAMPLE WITH NEGATIVE WEIGHT DECAY Sec. 4.3 shows that a too-large learning rate introduces an effective L2 regularization that can be corrected by setting the weight decay to be negative. This effect can be observed in more realistic learning settings. We train a logistic regressor on the MNIST dataset with a large learning rate (of order O(1)). Figure 1-Right conﬁrms that, at a large learning rate, the optimal weight decay can indeed be negative. This agrees with our argument that using a large learning rate can effectively regularize the training.

6.4 SECOND-ORDER METHODS Understanding stochastic second-order methods (including the adaptive gradient methods) is also important for deep learning (Agarwal et al., 2017; Zhang and Liu, 2021; Martens, 2014; Kunstner et al., 2019). In this section, we apply our theory to two standard second-order methods: damped Newton s method (DNM) and natural gradient descent (NGD). We provide more accurate results than those derived in Liu et al. (2021). The derivations are given in Appendix D.2. For DNM, the preconditioning learning rate matrix is deﬁned as Λ = λA 1. The model ﬂuctuation is shown to be proportional to the inverse of the Hessian: Σ = λσ2

g S λDA 1, where g = 2(1 µ) ( 1 µ

S )λ. The main difference with the previous results is that the ﬂuctuation now depends explicitly on the dimension D, and implies a stability condition: S λD/g, corroborating the stability condition we derived above. For NGD, the preconditioning matrix is deﬁned by the inverse of the Fisher information that Λ = λ S J(w) 1 = λ S C 1. We show that Σ = λ

2 ( 1 1+D 1 1+µ + 1 1 µ 1 S )A 1 is one solution when σ = 0, which also contains a correction related to D compared to the result in Liu et al. (2021) which is Σ = λ

2 ( 1 1+µ + 1 1 µ 1 S )A 1. A consequence is that J Σ 1. The surprising

Published as a conference paper at ICLR 2022

fact is that the stability of both NGD and DNM now crucially depends on D; combining with the results in Sec. 6.1, this suggests that the dimension of the problem may crucially affect the stability and performance of the minibatch-based algorithms. This result also implies that some features we derived are shared across many algorithms that depend on minibatch noise and that our results may be relevant to a broad class of optimization algorithms other than SGD.

6.5 FAILURE OF THE λ S SCALING LAW One well-known technique in deep learning training is that one can scale λ linearly as one increases the batch size S to achieve high-efﬁciency training without hindering the generalization performance; however, it is known that this scaling law fails when the learning rate is too large, or the batch size is too small (Goyal et al., 2017). In Hoffer et al. (2017), this scaling law is established on the ground that Σ λ/S. However, our result in Theorem 2 suggests the reason for the failure even for the simple setting of linear regression. Recall that the exact Σ takes the form:

for a scalar λ. One notices that the leading term is indeed proportional to λ/S. However, the discrete-time SGD results in a second-order correction in S, and the term proportional to 1/S2 does not contain a corresponding λ; this explains the failure of the scaling law in small S, where the second-order contribution of S becomes signiﬁcant. To understand the failure at large λ, we need to look at the term Gµ:

Gµ = 2(1 µ)ID (λ1 µ

One notices that the second term contains a part that only depends on λ but not on S. This part is negligible compared to the ﬁrst term when λ is small; however, it becomes signiﬁcant as the second term approaches the ﬁrst term. Therefore, increasing λ changes this part of the ﬂuctuation, and the scaling law no more holds if λ is large.

6.6 POWER LAW TAIL IN DISCRETE-TIME SGD

Figure 2: Comparison of the proposed theory with the continuoustime theory on the SGD stationary distribution for aλ = 1. The proposed theory agrees with the experiment exactly.

It has recently been discovered that the SGD noise causes a heavytail distribution (Simsekli et al., 2019; 2020), with a tail decaying like a power law with tail index β (Hodgkinson and Mahoney, 2020). In continuous-time, the stationary distribution has been found to obey a Student s t-like distribution, p(w) L (1+β)/2

(σ2 + aw2) (1+β)/2 (Meng et al., 2020; Mori et al., 2021; Wojtowytsch, 2021). However, this result is only established for continuous-time approximations to SGD and one does not know what affects the exponent β for discrete-time SGD. Our result in Theorem 2 can serve as a tool to ﬁnd the discrete-time correction to the tail index of the stationary distribution. In Appendix D.3, we show that the tail index of discrete-time SGD in 1d can be estimated as β(λ,S) = 2S

aλ S. A clear discrete-time contribution is (S + 1) which depends only on the batch size, while 2S

aλ +1 is the tail index in the continuous-time limit (Mori et al., 2021). See Figure 2; the proposed formula agrees with the experiment. Knowing the tail index β is important for understanding the SGD dynamics because β is equal to the smallest moment of w that diverges. For example, when β 4, then the kurtosis of w diverges, and one expects to see outliers of w very often during training; when β 2, then the second moment of w diverges, and one does not expect w to converge in the minimum under consideration. Our result suggests that the discrete-time dynamics always leads to a heavier tail than the continuous-time theory expects, and therefore is more unstable.

In this work, we have presented a systematic analysis with a focus on exactly solvable results to promote our fundamental understanding of SGD. One major limitation is that we have only focused on studying the asymptotic behavior of SGD in local minimum. For example, Ziyin et al. (2022) showed that SGD can converge to a local maximum when the learning rate is large. One important future step is thus to understand the SGD noise beyond a strongly convex landscape.

Published as a conference paper at ICLR 2022

ACKNOWLEDGEMENT

Liu Ziyin thanks Jie Zhang, Junxia Wang, and Shoki Sugimoto. Ziyin is supported by the GSS Scholarship of The University of Tokyo. Kangqiao Liu was supported by the GSGC program of the University of Tokyo. This work was supported by KAKENHI Grant Numbers JP18H01145 and JP21H05185 from the Japan Society for the Promotion of Science.

Agarwal, N., Bullins, B., and Hazan, E. (2017). Second-order stochastic optimization for machine learning in linear time. The Journal of Machine Learning Research, 18(1):4148 4187.

Allen-Zhu, Z., Li, Y., and Song, Z. (2019). A convergence theory for deep learning via overparameterization. In International Conference on Machine Learning, pages 242 252. PMLR.

Amari, S.-I. (1998). Natural gradient works efﬁciently in learning. Neural Comput., 10(2):251 276.

Blanc, G., Gupta, N., Valiant, G., and Valiant, P. (2020). Implicit regularization for deep neural networks driven by an ornstein-uhlenbeck like process. In Conference on learning theory, pages 483 513. PMLR.

Chizat, L. and Bach, F. (2018). A note on lazy training in supervised differentiable programming. ar Xiv preprint ar Xiv:1812.07956, 8.

Clauset, A., Shalizi, C. R., and Newman, M. E. (2009). Power-law distributions in empirical data. SIAM review, 51(4):661 703.

Dieuleveut, A., Durmus, A., Bach, F., et al. (2020). Bridging the gap between constant step size stochastic gradient descent and markov chains. Annals of Statistics, 48(3):1348 1382.

Fontaine, X., Bortoli, V. D., and Durmus, A. (2021). Convergence rates and approximation results for sgd and its continuous-time counterpart.

Gal, Y. and Ghahramani, Z. (2016). Dropout as a bayesian approximation: Representing model uncertainty in deep learning. In international conference on machine learning, pages 1050 1059. PMLR.

Gitman, I., Lang, H., Zhang, P., and Xiao, L. (2019). Understanding the role of momentum in stochastic gradient methods. In Advances in Neural Information Processing Systems, pages 9633 9643.

Goyal, P., Dollár, P., Girshick, R., Noordhuis, P., Wesolowski, L., Kyrola, A., Tulloch, A., Jia, Y., and He, K. (2017). Accurate, large minibatch sgd: Training imagenet in 1 hour. ar Xiv preprint ar Xiv:1706.02677.

Hammarling, S. J. (1982). Numerical solution of the stable, non-negative deﬁnite lyapunov equation lyapunov equation. IMA Journal of Numerical Analysis, 2(3):303 323.

Hao Chen, J. Z., Wei, C., Lee, J. D., and Ma, T. (2020). Shape matters: Understanding the implicit bias of the noise covariance. ar Xiv preprint ar Xiv:2006.08680.

Hastie, T., Montanari, A., Rosset, S., and Tibshirani, R. J. (2019). Surprises in high-dimensional ridgeless least squares interpolation. ar Xiv preprint ar Xiv:1903.08560.

He, F. and Tao, D. (2020). Recent advances in deep learning theory. ar Xiv preprint ar Xiv:2012.10931.

Hodgkinson, L. and Mahoney, M. W. (2020). Multiplicative noise and heavy tails in stochastic optimization. ar Xiv preprint ar Xiv:2006.06293.

Hoffer, E., Hubara, I., and Soudry, D. (2017). Train longer, generalize better: closing the generalization gap in large batch training of neural networks. In Advances in Neural Information Processing Systems, pages 1731 1741.

Published as a conference paper at ICLR 2022

Janssen, P. H. M. and Stoica, P. (1988). On the expectation of the product of four matrix-valued gaussian random variables. IEEE Transactions on Automatic Control, 33(9):867 870.

Jastrzebski, S., Kenton, Z., Arpit, D., Ballas, N., Fischer, A., Storkey, A., and Bengio, Y. (2018). Three factors inﬂuencing minima in SGD.

Kalimeris, D., Kaplun, G., Nakkiran, P., Edelman, B., Yang, T., Barak, B., and Zhang, H. (2019). Sgd on neural networks learns functions of increasing complexity. Advances in Neural Information Processing Systems, 32:3496 3506.

Kunin, D., Sagastuy-Brena, J., Gillespie, L., Margalit, E., Tanaka, H., Ganguli, S., and Yamins, D. L. (2021). Rethinking the limiting dynamics of sgd: modiﬁed loss, phase space oscillations, and anomalous diffusion. ar Xiv preprint ar Xiv:2107.09133.

Kunstner, F., Balles, L., and Hennig, P. (2019). Limitations of the empirical ﬁsher approximation for natural gradient descent. ar Xiv preprint ar Xiv:1905.12558.

Levy, M. and Solomon, S. (1996). Power laws are logarithmic boltzmann laws. International Journal of Modern Physics C, 7(04):595 601.

Lewkowycz, A., Bahri, Y., Dyer, E., Sohl-Dickstein, J., and Gur-Ari, G. (2020). The large learning rate phase of deep learning: the catapult mechanism. ar Xiv preprint ar Xiv:2003.02218.

Liu, K., Ziyin, L., and Ueda, M. (2021). Noise and ﬂuctuation of ﬁnite learning rate stochastic gradient descent. ar Xiv preprint ar Xiv:2012.03636.

Lyapunov, A. M. (1992). The general problem of the stability of motion. International journal of control, 55(3):531 534.

Mandt, S., Hoffman, M. D., and Blei, D. M. (2017). Stochastic gradient descent as approximate bayesian inference. J. Mach. Learn. Res., 18(1):4873 4907.

Martens, J. (2014). New insights and perspectives on the natural gradient method. cite arxiv:1412.1193Comment: New title and abstract. Added multiple sections, including a proper introduction/outline and one on convergence speed. Many other revisions throughout.

Meng, Q., Gong, S., Chen, W., Ma, Z.-M., and Liu, T.-Y. (2020). Dynamic of stochastic gradient descent with state-dependent noise. ar Xiv preprint ar Xiv:2006.13719.

Mori, T., Ziyin, L., Liu, K., and Ueda, M. (2021). Logarithmic landscape and power-law escape rate of sgd. ar Xiv preprint ar Xiv:2105.09557.

Pearce, T., Leibfried, F., and Brintrup, A. (2020). Uncertainty in neural networks: Approximately bayesian ensembling. In International conference on artiﬁcial intelligence and statistics, pages 234 244. PMLR.

Sato, I. and Nakagawa, H. (2014). Approximation analysis of stochastic gradient langevin dynamics by using fokker-planck equation and ito process. In International Conference on Machine Learning, pages 982 990. PMLR.

Simoncini, V. (2016). Computational methods for linear matrix equations. SIAM Review, 58(3):377 441.

Simsekli, U., Sagun, L., and Gurbuzbalaban, M. (2019). A tail-index analysis of stochastic gradient noise in deep neural networks. In International Conference on Machine Learning, pages 5827 5837. PMLR.

Simsekli, U., Sener, O., Deligiannidis, G., and Erdogdu, M. A. (2020). Hausdorff dimension, heavy tails, and generalization in neural networks. Advances in Neural Information Processing Systems, 33.

Thomas, V., Pedregosa, F., Merriënboer, B., Manzagol, P.-A., Bengio, Y., and Le Roux, N. (2020). On the interplay between noise and curvature and its effect on optimization and generalization. In International Conference on Artiﬁcial Intelligence and Statistics, pages 3503 3513. PMLR.

Published as a conference paper at ICLR 2022

Toulis, P., Airoldi, E. M., et al. (2017). Asymptotic and ﬁnite-sample properties of estimators based on stochastic gradients. Annals of Statistics, 45(4):1694 1727.

Wang, X., Zhao, Y., and Pourpanah, F. (2020). Recent advances in deep learning. International Journal of Machine Learning and Cybernetics, 11(4):747 750.

Welling, M. and Teh, Y. W. (2011). Bayesian learning via stochastic gradient langevin dynamics. In Proceedings of the 28th international conference on machine learning (ICML-11), pages 681 688. Citeseer.

Wojtowytsch, S. (2021). Stochastic gradient descent with noise of machine learning type. part ii: Continuous time analysis. ar Xiv preprint ar Xiv:2106.02588.

Wu, J., Hu, W., Xiong, H., Huan, J., Braverman, V., and Zhu, Z. (2020). On the noisy gradient descent that generalizes as sgd. In International Conference on Machine Learning, pages 10367 10376. PMLR.

Xie, Z., Sato, I., and Sugiyama, M. (2021). A diffusion theory for deep learning dynamics: Stochastic gradient descent exponentially favors ﬂat minima. In International Conference on Learning Representations.

Xing, C., Arpit, D., Tsirigotis, C., and Bengio, Y. (2018). A walk with sgd. ar Xiv preprint ar Xiv:1802.08770.

Yaida, S. (2019). Fluctuation-dissipation relations for stochastic gradient descent. In International Conference on Learning Representations.

Ye, H., Michel, A. N., and Hou, L. (1998). Stability theory for hybrid dynamical systems. IEEE transactions on automatic control, 43(4):461 474.

Zhang, C., Liao, Q., Rakhlin, A., Miranda, B., Golowich, N., and Poggio, T. (2018). Theory of deep learning iib: Optimization properties of sgd. ar Xiv preprint ar Xiv:1801.02254.

Zhang, Z. and Liu, Z. (2021). On the distributional properties of adaptive gradients. In Uncertainty in Artiﬁcial Intelligence, pages 419 429. PMLR.

Zhu, Z., Wu, J., Yu, B., Wu, L., and Ma, J. (2019). The anisotropic noise in stochastic gradient descent: Its behavior of escaping from sharp minima and regularization effects. In International Conference on Machine Learning, pages 7654 7663. PMLR.

Ziyin, L., Li, B., Simon, J. B., and Ueda, M. (2022). SGD can converge to local maxima. In International Conference on Learning Representations.

Published as a conference paper at ICLR 2022

Figure 3: Left: 1d experiments with label noise. The parameters are set to be a = 1.5 and λ = 1. Right: Experiments with L2 regularization with weight decay strength γ. The parameters are set to be a = 1, λ = 0.5, S = 1. This is the standard case with a vanishing optimal γ. The vertical lines show where our theory predicts a divergence.

(a) a = 1, S = 10

Figure 4: Comparison between theoretical predictions and experiments. (a) 1d experiment. We plot Σ as an increasing function of λ. We see that the continuous-time approximation fails to predict the divergence at a learning rate and the prediction in Liu et al. (2021) severely underestimates the model ﬂuctuation. In contrast, our result is accurate throughout the entire range of learning rates. (b)-(c) 2d experiments. The straight line shows where the proposed theory predicts a divergence in the variance, which agrees with experiment exactly. The Hessian has eigenvalues 1 and 0.5, and λ = 1.5. For a large batch size, the discrete-time Hessian approximation is quite accurate; for a small S, the Hessian approximation underestimates the overall strength of the ﬂuctuation. In contrast, the continuous-time result is both inaccurate in shape and in strength.

A EXPERIMENTS

A.1 LABEL NOISE AND REGULARIZATION

Theorem 2 can be veriﬁed empirically. We run 1d experiment in Figure 4(a) and high dimensional experiments in Figures 4(b)-(c), where we choose D = 2 for visualization. We see that the continuous Hessian approximation fails badly for both large and small batch sizes. When the batch size is large, both the discrete-time Hessian approximation and our solution give a accurate estimate of the shape and the spread of the distribution. This suggests that when the batch size is large, discreteness is the determining factor of the ﬂuctuation. When the batch size is small, the discrete Hessian approximation severely underestimates the strength of the noise. This reﬂects the fact that the isotropic noise enhancement is dominant at a small batch size.

In Figure 3-Left, we run a 1d experiment with λ = 1, N = 10000 and σ2 = 0.25. Comparing the predicted Σ, we see that the proposed theory agrees with the experiment across all ranges of S. The continuous theory with the Hessian approximation fails almost everywhere, while the recently proposed discrete theory with the Hessian approximation underestimates the ﬂuctuation when S is small. In Figure 3-Right, we plot a standard case where the optimal regularization strength γ is vanishing.

Now, we validate the existence of the optimal negative weight decay as predicted by our formula. For illustration, we plot in Figure 5 the test loss (11) for a 1d example while varying either S or λ. The orange vertical lines show the place where the theory predicts a divergence in the test loss. We also

Published as a conference paper at ICLR 2022

Figure 5: 1d experiments with L2 regularization with weight decay strength γ. The parameters are set to be a = 4, λ = 1, S = 64. This shows a case where the optimal γ is negative. The vertical lines show where our theory predicts a divergence.

Figure 6: High-dimensional linear regression. We see that the predicted ﬂuctuation coefﬁcient agrees with the experiment well. The slight deviation is due to a ﬁnite training time and ﬁnite N and D. On the other hand, a naive Hessian approximation results in a qualitatively wrong result.

plot a standard case where the optimal γ is close to 0 in Appendix A. Also, we note that the proposed theory agrees better with the experiment.

A.2 HIGH-DIMENSIONAL REGRESSION

See Figure 6-Left. We vary N with D = 1000 held ﬁxed. We set λ = 0.01 and S = 32. We see that the agreement between the theory and experiment is good, even for this modest dimension number D. The vertical line shows where the over-to-underparametrization transition takes place. As expected, there is no ﬂuctuation when α < 1, and the ﬂuctuation gradually increases as α . On the other hand, the Hessian approximation gives a wrong picture, predicting ﬂuctuation to rise when there is no ﬂuctuation and predicting a constant ﬂuctuation just when the ﬂuctuation starts to rise.

Published as a conference paper at ICLR 2022

Table 2: Comparison with previous results. For notational conciseness, we compare the case when all the relevant matrices commute. The model ﬂuctuation Σ, the expected training loss Ltrain and the expected test loss Ltest calculated by continuousand discrete-time theories with Hessian approximation C H are presented. Exact solutions to these quantities obtained in the present work are shown in the rightmost column.

Hessian Approximation Exact Solution

Cts-time Approximation D-time Solution This Work

Label Noise λ 2S ID λ S (2ID λA) 1 λσ2

S ) [2ID λ (1 + 1

Input Noise λ 2S ID λ S (2ID λK) 1 λTr[AK 1BU]

S ) [2ID λ (1 + 1

L2 Regularization λ 2S ID λ S (2ID λK) 1 Eq. (12)

Ltrain Ltrain Ltrain Label Noise λ 4S Tr[A] + 1

2σ2 Eq. (20) σ2

Input Noise λ 4S Tr[K] + 1

2Tr[AK 1BU] Eq. (28) 1 2Tr[AK 1BU] (1 + λ

L2 Regularization λ 4S Tr[K] + 1

2Tr[AK 1ΓU] Eq. (37) Eq. (151)

Ltest Ltest Ltest Label Noise λ 4S Tr[A] λ 2S Tr[A(2ID λA) 1] λσ2

Input Noise λ 4S Tr[A] + 1

2Tr[B TAB U] Eq. (29) λ 2S Tr[AK 1BU]κ + 1

2Tr[B TAB U]

L2 Regularization λ 4S Tr[A] + 1

2Tr[AK 2Γ2U] Eq. (38) Eq. (11)

B COMPARISON WITH CONVENTIONAL HESSIAN APPROXIMATION

We compare our results for the three cases with the results obtained with the conventional Hessian approximation of the noise covariance, i.e., C H, where H is the Hessian of the loss function. We summarize the analytical results for a special case in Table 2.

B.1 LABEL NOISE

We ﬁrst consider discrete-time dynamics with the Hessian approximation. The matrix equation is

ΣA + AΣ λAΣA = λ

Compared with the exact result (3), it is a large-S limit up to the constant σ2. This constant factor is ignored during the approximation that J(w) = EB[ l l T] EB[ Tl] = H(w), which is exact only when l({xi},w) is a negative log likelihood function of w. Solving the matrix equation yields

S (2ID λA) 1. (19)

The training loss and the test loss are

2S Tr[A(2ID λA) 1] + 1

2S Tr[A(2ID λA) 1]. (21)

On the other hand, by taking the large-S limit directly from the exact equation (3), the factor σ2 is present:

ΣA + AΣ λAΣA = λ

S σ2A. (22)

For the continuous-time limit with the Hessian approximation, the matrix equation is

ΣA + AΣ = λ

which is the small-λ limit up to the factor σ2. The variance is

2S ID. (24)

Published as a conference paper at ICLR 2022

The training and the test error are

4S Tr[A] + 1

4S Tr[A]. (26)

Again, taking the small-λ limit directly from the exact result (3) shows the presence of the factor σ2 on the right hand side of the matrix equation.

B.2 INPUT NOISE

The case with the input noise is similar to the label noise. This can be understood if we replace A by K and σ2 by Tr[AK 1BU]. The model parameter variance resulting from the discrete-time dynamics under the Hessian approximation is

S (2ID λK) 1. (27)

The training and the test error are

2S Tr[K(2ID λK) 1] + 1

2Tr[AK 1BU], (28)

2S Tr[A(2ID λK) 1] + 1

2Tr[B TAB U]. (29)

The large-S limit from the exact matrix equation (144) results in a prefactor Tr[AK 1BU] in the ﬂuctuation:

S Tr[AK 1BU](2ID λK) 1. (30)

For the continuous-time limit, we take λ 0. The Hessian approximation gives

2S ID, (31)

4S Tr[K] + 1

2Tr[AK 1BU], (32)

4S Tr[A] + 1

2Tr[B TAB U]. (33)

The large-S limit again produces a prefactor Tr[AK 1BU].

B.3 L2 REGULARIZATION

For learning with regularization, there is a more difference between the Hessian approximation and the limit taken directly from the exact theory. We ﬁrst adopt the Hessian approximation for the discrete-time dynamics. The matrix equation is

ΣK + KΣ λKΣK = λ

which is similar to the previous subsection. However, it is different from the large-S limit of the exact matrix equation (154):

ΣK + KΣ λKΣK = λ

S (Tr[AK 2Γ2U]A + AK 1ΓUΓK 1A). (35)

This signiﬁcant difference suggests that the conventional Fisher-to-Hessian approximation J H fails badly. The ﬂuctuation, the training loss, and the test loss with the Hessian approximation are

S (2ID λK) 1, (36)

2S Tr[K(2ID λK) 1] + 1

2Tr[AK 1ΓU], (37)

2S Tr[A(2ID λK) 1] + 1

2Tr[AK 2Γ2U], (38)

Published as a conference paper at ICLR 2022

while the large-S limit of the exact theory yields

S Tr[AK 2Γ2U]AK 1(2ID λK) 1 + λ

S A2K 3Γ2(2ID λK) 1U, (39)

2S Tr[AK 2Γ2U]Tr[A(2ID λK) 1] + λ

2S Tr[A2K 2Γ2(2ID λK) 1U]

2Tr[AK 1ΓU], (40)

2S Tr[AK 2Γ2U]Tr[A(2ID λK) 1] + λ

2S Tr[A3K 3Γ2(2ID λK) 1U]

2Tr[AK 2Γ2U]. (41)

The continuous-time results are obtained by taking the small-λ limit on Eqs. (36)-(38) for the Hessian approximation and on Eqs. (39)-(41) for the limiting cases of the exact theory. Speciﬁcally, for the Hessian approximation, we have

2S ID, (42)

4S Tr[K] + 1

2Tr[AK 1ΓU], (43)

4S Tr[A] + 1

2Tr[AK 2Γ2U]. (44)

The small-λ limit of the exact theory yields

2S Tr[AK 2Γ2U]AK 1 + λ

2S A2K 3Γ2U, (45)

4S Tr[AK 2Γ2U]Tr[A] + λ

4S Tr[A2K 2Γ2U] + 1

2Tr[AK 1ΓU], (46)

4S Tr[AK 2Γ2U]Tr[A] + λ

4S Tr[A3K 3Γ2U] + 1

2Tr[AK 2Γ2U]. (47)

Published as a conference paper at ICLR 2022

C PROOF OF THE GENERAL FORMULA

C.1 PROOF OF THEOREM 5 AND COROLLARY 2

We restate the theorem.

Theorem 7. Let the training loss be LΓ = L0 + 1

2w TΓw and the models be optimized with SGD in the neighborhood of a local minimum w . When Γ 0, the noise covariance is given by

C = 2L0(w )

S Γw w TΓ + O(S 2) + O( w w 2). (48)

When Γ = 0 and L0(w ) 0,

C = 2L0(w )

S H0(w ) + O(S 2) + O( w w 2). (49)

When Γ = 0 and L0(w ) = 0,

S (Tr[H0(w )Σ]ID H0(w )Σ)H0(w ) + O(S 2) + O( w w 2). (50)

Proof. We will use the following shorthand notations: ℓi = ℓ(f(w,xi),yi), ℓ i = ℓi

f , ℓ i = 2ℓi

f 2 . The Hessian of the loss function without regularization H0(w) = TL0(w) is given by

N i=1 ℓ i f(w,xi) f(w,xi)T + 1

N i=1 ℓ i Tf(w,xi). (51)

The last term of Eq. (51) can be ignored when L0 1, since

N i=1 ℓ i Tf(w,xi)

N i=1 (ℓ i)2)

N i=1 Tf(w,xi) 2 F )

= ℓ 2 1/2 ( 1

N i=1 Tf(w,xi) 2 F )

N i=1 Tf(w,xi) 2 F )

where F stands for the Frobenius norm3, and we have deﬁned the variable ℓ 2 = 1

N N i=1(ℓ i)2. Since ℓ i = 1 for the mean-square error, we obtain

N i=1 f(w,xi) f(w,xi)T + O (

near a minimum. The Hessian with regularization HΓ(w) = TLΓ(w) is just given by H0(w)+Γ.

On the other hand, the SGD noise covariance C(w) is given by Eq. (2). By assumption 2, the SGD noise covariance is directly related to the Hessian:

N i=1 f(w,xi) f(w,xi)T 1

S LΓ(w) LΓ(w)T

N i=1 (ℓi L0) f(w,xi) f(w,xi)T

S LΓ(w) LΓ(w)T + o(L0). (53)

This ﬁnishes the proof.

3In the linear regression problem, the last term of Eq. (51) does not exist since Tf(w, xi) = 0.

Published as a conference paper at ICLR 2022

Now we prove Corollary 2.

Proof. Near a minimum w of the full loss LΓ(w), we have

LΓ(w) = H0(w )(w w ) + Γw + O( w w 2), (54)

within the approximation LΓ(w) = LΓ(w )+(1/2)(w w )THΓ(w )(w w )T +O( w w 2). Equations (14) and (54) give the SGD noise covariance near a minimum of LΓ(w).

Now it is worth discussing two different cases separately: (1) with regularization and (2) without regularization. We ﬁrst discuss the case when regularization is present. In this case, the regularization Γ is not small enough, and the SGD noise covariance is not proportional to the Hessian. Near a local or global minimum w w , the ﬁrst term of the right-hand side of Eq. (54) is negligible, and hence we obtain

Ew[C(w)] = 2L0(w )

S H0(w )(w w )(w w )TH0(w )] + O( w w 2)

S Γw w TΓ + O(S 2) + O( w w 2). (55)

where we have used the fact that E[w] = w . The SGD noise does not vanish even at a global minimum of LΓ(w). Note that this also agrees with the exact result derived in Sec. 4.3: together with an anisotropic noise that is proportional to the Hessian, a rank-1 noise proportional to the strength of the regularization appears. This rank-1 noise is a signature of regularization.

On the other hand, as we will see below, the SGD noise covariance is proportional to the Hessian near a minimum when there is no regularization, i.e., Γ = 0. We have

C(w) = 2L0(w)

S H0(w )(w w )(w w )TH0(w ) + O( w w 2). (56)

For this case, we need to differentiate between a local minimum and a global minimum. When L0(w ) is not small enough (e.g. at a local but not global minimum),

C(w) = 2L0(w )

S H0(w) + (w w )TH0(w )(w w )

S H0(w )(w w )(w w )TH0(w ) + O( w w 2)

S H0(w) + O(S 2) + O( w w 2)

S H0(w ) + O(S 2) + O( w w 2), (57)

and so, to leading order,

C = 2L0(w )

S H0(w ), (58)

which is proportional to the Hessian but also proportional to the achievable approximation error.

On the other hand, when L0(w ) is vanishingly small (e.g. at a global minimum), we have 2L0(w) (w w )TH0(w )(w w ), and thus obtain

S [(w w )TH0(w )(w w )H0(w ) H0(w )(w w )(w w )TH0(w )]

+ O(S 2) + O( w w 2), (59)

S (Tr[H0Σ]ID H0Σ)H0 + O(S 2) + O( w w 2). (60)

This completes the proof.

Published as a conference paper at ICLR 2022

Remark. It should be noted that the second term on the right-hand side of Eq. (59) would typically be much smaller than the ﬁrst term for large D. For example, when H0(w ) = a ID with a > 0, the ﬁrst and the second terms are respectively given by (a2/S) w w 2ID and (a2/S)(w w )(w w )T. The Frobenius norm of the former is given by (Da2/S) w w 2, while that of the latter is given by (a2/S) w w 2, which indicates that in Eq. (59), the ﬁrst term is dominant over the second term for large D. Therefore the second term of Eq. (59) can be dropped for large D, and Eq. (59) is simpliﬁed as

C(w) (w w )TH0(w )(w w )

S H0(w ); E[C] Tr[H0Σ]

Again, the SGD noise covariance is proportional to the Hessian.

In conclusion, as long as the regularization is small enough, that the SGD noise covariance near a minimum is proportional to the Hessian is a good approximation. This implies that the noise is multiplicative, which is known to lead to a heavy tail distribution (Clauset et al., 2009; Levy and Solomon, 1996). Thus, we have studied the nature of the minibatch SGD noise in three different situations. As an example, we have demonstrated the power of this general formulation by applying it to the high-dimensional linear regression problem in Sec. 6.1.

C.2 PROOF OF COROLLARY 3

Proof. We prove the case where Γ = 0 and L(w ) 0 as an example. Substituting Theorem 5 into Theorem 1yields

[2ID 1 1 + µΛH0]ΛH0Σ = 2L0 S(1 µ)Λ2H0, (62)

where we have assumed necessary commutation relations. Suppose that the Hessian H0 is of rank-r with r D. The singular-value decomposition and its Moore-Penrose pseudo inverse are given by H0 = USV T and H+ 0 = V S+U T, respectively, where U and V are unitary, S is a rank-r diagonal matrix with elements being singular values of H0, and S+ is obtained by inverting every non-zero entry of S. Multiplying H+ 0 to both sides of the above equation, we have

PrΣ = 2L0 S(1 µ)PrΛ(2ID Λ 1 + µH0)

where Pr = diag(1,1,...,1,0,...,0) is the projection operator with r non-zero entries. When the Hessian is full-rank, i.e., r = D, the Moore-Penrose pseudo inverse is nothing but the usual inverse. The other cases can be calculated similarly.

Published as a conference paper at ICLR 2022

D APPLICATIONS

D.1 INFINITE-DIMENSIONAL LIMIT OF THE LINEAR REGRESSION PROBLEM

Now we apply the general theory in Sec. 5 to linear regressions in the high-dimensional limit, namely N,D with α = N/D held ﬁxed.

D.1.1 PROOF OF PROPOSITION 4

The loss function

L(w) = 1 2N

N i=1 (w Txi yi) 2 (64)

with yi = u T + ϵi can be written as

2 (w u ˆA+v) T ˆA(w u ˆA+v) 1

2v T ˆA+v + 1

N i=1 ϵ2 i , (65)

where ˆA = 1

N N i=1 xix T i is an empirical covariance for the training data and v = 1

N N i=1 xiϵi. The symbol ( )+ denotes the Moore-Penrose pseudoinverse. We also introduce the the averaged traing loss: Ltrain = Ew[L(w)]

The minimum of the loss function is given by

w = u + ˆA+v + Πr, (66)

where r RD is an arbitrary vector and Π is the projection onto the null space of ˆA. Since 1 Π = ˆA+ ˆA, w is also expressed as

w = ˆA+( ˆAu + v) + Πr. (67)

In an underparameterized regime α > 1, Π = 0 almost surely holds as long as the minimum eigenvalue of A (not ˆA) is positive (Hastie et al., 2019). In this case, ˆA+ = ˆA 1 and we obtain

w = u + ˆA 1v for α > 1. (68)

On the other hand, in an overparameterized regime α > 1, Π 0 and there are inﬁnitely many global minima. In the ridgeless regression, we consider the global minimum that has the minimum norm w , which corresponds to

w = ˆA+( ˆAu + v) = (1 Π)u + ˆA+v for ridgeless regression with α < 1. (69)

In both cases, the loss function is expressed as

2 (w w )T ˆA(w w ) 1

2v T ˆAv + 1

N i=1 ϵ2 i . (70)

Asymptotically, wt converges to a stationary point w with ﬂuctuation Σ obeying the following equation (Theorem 1: λ ˆAΣ + λΣ ˆA λ2 ˆAΣ ˆA = λ2C. (71)

The SGD noise covariance C is given by Eq. (14). In the present case, the Hessian is given by H = ˆA and we also have

N i=1 (ℓ i)2 = 1

N i=1 (w Txi yi) 2 = 2

N i=1 ℓi = 2L(w). (72)

On the other hand, L(w) L(w)T = ˆA(w w )(w w )T ˆA, and hence Ew[ L(w) L(w)T] = ˆAΣ ˆA. Therefore we obtain

C = Ew[C(w)] = 2Ltrain

S ˆAΣ ˆA. (73)

Published as a conference paper at ICLR 2022

Now, we ﬁnd Ltrain. First, we deﬁne X RN D as Xik = (xi)k, and ϵ RN as ϵi = ϵi. Then w = (1 Π)u + ˆA+v = (1 Π)u + (XTX)+XT ϵ.

With this notation, we have ˆA = XTX/N, and the loss function is expressed as

2(w w )T ˆA(w w ) 1

2N ϵTX(XTX)+XT ϵ + 1

N i=1 ϵ2 i . (74)

We therefore obtain

2Tr[ ˆAΣ] 1

2N E[ ϵTX(XTX)+XT ϵ] + σ2

Here, E[ ϵTX(XTX)+XT ϵ] = σ2Tr[(XTX)(XTX)+] = σ2Tr(1 Π). (76) We can prove that the following identity is almost surely satisﬁed (Hastie et al., 2019) as long as the smallest eigenvalue of A (not ˆA) is positive:

Tr(1 Π) = min{D,N}. (77)

We therefore obtain

2Tr[ ˆAΣ] σ2

2N min{D,N} + σ2

1 2Tr[ ˆAΣ] + 1

α)σ2 for α > 1,

1 2Tr[ ˆAΣ] for α 1 (78)

By substituting Eq. (78) into Eq. (73), we obtain the following SGD noise covariance:

1 S (Tr[ ˆAΣ] ˆAΣ) ˆA + σ2

α) ˆA for α > 1,

1 S (Tr[ ˆAΣ] ˆAΣ) ˆA for α 1. (79)

This ﬁnishes the proof.

D.1.2 PROOF OF THEOREM 6

Proof. We have to solve this equation:

ˆAΣ + Σ ˆA λ ˆAΣ ˆA = λC, (80)

where C is given in Proposition 4. Using the similar trick of multiplying by ˆG = 2ID λ(1 1

S ) ˆA as in Appendix E.2.2, one obtains

Tr[ ˆAΣ] = { λσ2

α) ˆκ for α > 1; 0 for α 1, (81)

where ˆκ = Tr[ ˆ G 1 ˆ A] 1 λ

S Tr[ ˆ G 1 ˆ A] with ˆG = 2ID λ(1 1

Substituting the above trace into the matrix equation, we have

S ˆκ) ˆG 1 for α > 1; 0 for α 1. (82)

D.2 SECOND-ORDER METHODS

Proposition 5. Suppose that we run DNM with Λ = λA 1 with random noise in the label. The model ﬂuctuation is

g S λDA 1, (83)

where g = 2(1 µ) ( 1 µ

Published as a conference paper at ICLR 2022

Proof. Substituting Λ = λA 1 into Eqs. (3) and (5) yields

S (Tr[AΣ] + σ2)A 1, (84)

where g == 2(1 µ) ( 1 µ

S )λ. Multiplying A and taking trace on both sides, we have

Tr[AΣ] = λDσ2

g S λD. (85)

Therefore, the model ﬂuctuation is

g S λDA 1. (86)

Proposition 6. Suppose that we run NGD with Λ = λ

S C 1 with random noise in the label. The model ﬂuctuation is

Á Á Àλ2g2 + 4λ(g 2 1 + D 1 1 + µ) σ2

1 + D + 4( σ2

where g = 1 1+D 1 1+µ + 1 1 µ 1 S .

Proof. Similarly to the previous case, the matrix equation satisﬁed by Σ is

(1 µ)(C 1AΣ + ΣAC 1) 1 + µ2

1 µ2 λ S C 1AΣAC 1 + µ 1 µ2 λ S (C 1AC 1AΣ + ΣAC 1AC 1) = λ

Although it is not obvious how to directly solve this equation, it is possible to guess one solution according to the hope that Σ be proportional to J 1, in turn, A 1 (Amari, 1998; Liu et al., 2021). We assume that Σ = x A 1 and substitute it into the above equation to solve for x. This yields one solution without claiming its uniqueness. By simple algebra, this x is solved to be

Á Á Àλ2g2 + 4λ(g 2 1 + D 1 1 + µ) σ2

1 + D + 4( σ2

Let σ = 0. We obtain the result in Sec. 6.4.

D.3 ESTIMATION OF TAIL INDEX

In Mori et al. (2021); Meng et al. (2020), it is shown that the (1d) discrete-time SGD results in a distribution that is similar to a Student s t-distribution:

p(w) (σ2 + aw2) 1+β

where σ2 is the degree of noise in the label, and a is the local curvature of the minimum. For large w, this distribution is a power-law distribution with tail index:

p( w ) w (1+β), (91)

and it is not hard to check that β also equal to the smallest moment of w that diverges: E[wβ] = . Therefore, estimating β can be of great use both empirically and theoretically.

In continuous-time, it is found that βcts = 2S

aλ + 1 (Mori et al., 2021). For discrete-time SGD, we hypothesize that the discrete-time nature causes a change in the tail index β = βcts + ϵ, and we are interested in ﬁnding ϵ. We propose a semi-continuous" approximation to give the formula to estimate the tail index. Notice that Theorem 2 gives the variance of the discrete-time SGD, while Eq. (90) can be integrated to give another value of the variance, and the two expressions must be equal for consistency. This gives us an equation that β must satisfy:

p(w;β)(w E[w])2 = Var[w], (92)

Published as a conference paper at ICLR 2022

Figure 7: Tail index β of the stationary distribution of SGD in a 1d linear regression problem. Left to Right: aλ = 0.2, 1.0, 1.8.

this procedure gives the following formula:

β(λ,S) = 2S

aλ S = βcts + ϵ, (93)

and one immediately recognizes that (S + 1) is the discrete-time contribution to the tail index. See Figure 7 for additional experiments. We see that the proposed formula agrees with the experimentally measured value of the tail index for all ranges of the learning rate, while the result of Mori et al. (2021) is only correct when λ 0+. Hodgkinson and Mahoney (2020) also studies the tail exponent of discrete-time SGD; however, their conclusion is only that the index decreases with the learning rate and increases with the batch size". In contrast, our result give the functional form of the tail index directly. In fact, this is the ﬁrst work that gives any functional form for the tail index of discrete-time SGD ﬂuctuation to the best of our knowledge.

The following proposition gives the intermediate steps in the calculation.

Proposition 7. (Tail index estimation for discrete-time SGD) Let the parameter distribution be

p(w) (σ2 + aw2) 1+β

2 , and Var[w] be given by Theorem 2. Then

β(λ,S) = 2S

Proof. The normalization factor for the distribution exists if β > 0:

If β > 2, the variance exists and the value is

Var[w] = σ2

a(β 2). (96)

By equating Eq. (96) with the exact variance (6), we are able to solve for an expression of the tail index as β(λ,S) = 2S

Published as a conference paper at ICLR 2022

E PROOFS AND ADDITIONAL THEORETICAL CONSIDERATIONS

E.1 PROOF OF PROPOSITION 1

The with-replacement sampling is deﬁned in Deﬁnition 2. Let us here deﬁne the without-replacement sampling. Deﬁnition 3. A minibatch SGD without replacement computes the update to the parameter w with the following set of equations:

S i Bt ℓ(xi,yi,wt 1); wt = wt 1 λˆgt, (98)

where S = Bt N is the minibatch size, and the set Bt is an element uniformly-randomly drawn from the set of all S-size subsets of {1,...,N}.

From the deﬁnition of the update rule for sampling with or without replacement, the covariance matrix of the SGD noise can be exactly derived. Proposition 8. The covariance matrices of noise in SGD due to minibatch sampling as deﬁned in Deﬁnitions 2 and 3 with an arbitrary N are

N N i=1 ℓi ℓT i L(w) L(w)T], (with replacement) N S S(N 1) [ 1

N N i=1 ℓi ℓT i L(w) L(w)T], (without replacement) (99)

where the shorthand notation ℓi(w) = l(xi,yi,w) is used.

In the limit of S = 1 or N S, two cases coincide. In the N S limit, both methods of sampling have the same noise covariance as stated in Proposition 1:

C(w) = 1 SN

N i=1 ℓi ℓT i 1

S L(w) L(w)T. (100)

Remark. We also note that a different way of deﬁning minibatch noise exists in Hoffer et al. (2017). The difference is that our deﬁnition requires the size of each minibatch to be exactly S, while Hoffer et al. (2017) treats the batch size also as a random variable and is only expected to be S. In comparison, our deﬁnition agrees better with the common practice.

Now we prove Proposition 8.

Proof. We derive the noise covariance matrices for sampling with and without replacement. We ﬁrst derive the case with replacement. According to the deﬁnition, the stochastic gradient for sampling with replacement can be rewritten as

N n=1 gnsn, (101)

where gn = ℓn and

sn = l, if l multiple n s are sampled in S, with 0 l S. (102)

The probability of sn assuming value l is given by the multinomial distribution

P(sn = l) = (S l )( 1

S l . (103)

Therefore, the expectation value of sn is given by

EB[sn] = S l=0 l P(sn = l) = S

which gives

EB[ˆg] = g = 1

N n=1 gn = L(w). (105)

Published as a conference paper at ICLR 2022

For the covariance, we ﬁrst calculate the covariance between sn and sn . Due to the properties of the covariance of multinomial distribution, we have for n n

EB[snsn ] = cov[sn,sn ] + E[sn]2

N 2 ; (106)

and for n = n

EB[snsn] = Var[sn] + E[sn]2

= SN + S(S 1)

N 2 . (107)

Substituting these results into the deﬁnition of the noise covariance yields

C(w) = EB[ˆgˆg T] EB[ˆg]EB[ˆg]T

N n =1 gng T n EB[snsn ] gg T

N n,n =1 gng T n S(S 1)

N n=1 gng T n [SN + S(S 1)

N n=1 gng T n 1

N i=1 ℓi ℓT i L(w) L(w)T]. (108)

Then, we derive the noise covariance for sampling without replacement. Similarly, according to the deﬁnition, the stochastic gradient for sampling without replacement can be rewritten as

N n=1 gnsn, (109)

sn = {0,if n S; 1,if n S. (110)

The probability of n that is sampled in S from N is given by

P(sn = 1) = (N 1 S 1)

The expectation value of sn is then given by

EB[sn] = P(sn = 1) = S

which gives

EB[ˆg] = g = 1

N n=1 gn = L(w). (113)

For the covariance, we ﬁrst calculate the covariance between sn and sn . By deﬁnition, we have for n n

EB[snsn ] = P(sn = 1,s n = 1) = P(sn = 1 s n = 1)P(s n = 1)

= (N 2 S 2)

(N S) = S(S 1)

N(N 1); (114)

Published as a conference paper at ICLR 2022

and for n = n

EB[snsn] = P(sn = l) = S

Substituting these results into the deﬁnition of the noise covariance yields

C(w) = EB[ˆgˆg T] EB[ˆg]EB[ˆg]T

N n =1 gng T n EB[snsn ] gg T

N n,n =1 gng T n S(S 1)

N n=1 gng T n [ S

N(N 1)] gg T

N n=1 gng T n N S S(N 1)gg T

= N S S(N 1) [ 1

N i=1 ℓi ℓT i L(w) L(w)T]. (116)

E.2 PROOFS IN SEC. 4.2

E.2.1 PROOF OF LEMMA 1

Proof. From the deﬁnition of noise covariance (2), the covariance matrix for the noise in the label is

C(w) = 1 NS

N i=1 li(wt 1) li(wt 1)T 1

S L(wt 1) L(wt 1)T

N i (w Txi ϵi)xix T i (w Txi ϵi)T 1

N i (w Txi ϵi)xi]

N j x T j (w Txj ϵj)T

N i (w Txixix T i x T i w + ϵ2 i xix T i ) 1

N i (w Txixi)]

N j (x T i x T i w) (117)

S (Aww TA + Tr[Aww T]A + σ2A), (118)

where we have invoked the law of large numbers and the expectation value of the product of four Gaussian random variables in the third line is evaluated as follows.

Because N is large, we invoke the law of large numbers to obtain the (j,k)-th component of the matrix as

N i=1 (w Txixix T i x T i w)jk = EB[w Txxx Tx Tw]jk = EB [ D i wixixjxk

D i xi wi ]. (119)

Because the average is taken with respect to x and each x is a Gaussian random variable, we apply the expression for the product of four Gaussian random variables E[x1x2x3x4] = E[x1x2]E[x3x4] + E[x1x3]E[x2x4]+E[x1x4]E[x2x3] 2E[x1]E[x2]E[x3]E[x4] (Janssen and Stoica, 1988) to obtain

EB [ D i wixixjxk

D i xi wi ]

= EB [ D i wixixj]EB [xk

D i xi wi ] + EB [ D i wixixk]EB [xj

D i xi wi ]

+ EB [ D i wixi

D i xi wi ]EB [xjxk]

= 2(Aww TA)jk + Tr[Aww T]Ajk. (120)

Writing Σ = Ew[ww T], we obtain

Ew[C(w)] = C = 1

S (AΣA + Tr[AΣ]A + σ2A). (121)

Published as a conference paper at ICLR 2022

This method has been utilized repeatedly in this work.

E.2.2 PROOF OF THEOREM 2

Proof. We substitute Eq. (5) into Eq. (3) which is a general solution obtained in a recent work (Liu et al., 2021):

(1 µ)(ΛAΣ + ΣAΛ) 1 + µ2

1 µ2 ΛAΣAΛ + µ 1 µ2 (ΛAΛAΣ + ΣAΛAΛ) = ΛCΛ. (122)

To solve it, we assume the commutation relation that [Λ,A] = ΛA AΛ = 0. Therefore, the above equation can be alternatively rewritten as

S )ΛA]ΣAΛ + ΛAΣ[(1 µ)ID 1

S Tr[AΣ]ΛA = 1

S σ2ΛA. (123)

To solve this equation, we ﬁrst need to solve for Tr[AΣ]. Multiplying Eq. (123) by G 1 µ =

[2(1 µ)ID ( 1 µ

S )ΛA] 1 and taking trace, we obtain

S Tr[AΣ]Tr[ΛAG 1 µ ] = 1

S σ2Tr[ΛAG 1 µ ], (124)

which solves to give

Tr[AΣ] = σ2

Tr[ΛAG 1 µ ]

S Tr[ΛAG 1 µ ] = σ2

S κµ. (125)

Therefore, Σ is

S )Λ[2(1 µ)ID (1 µ

E.2.3 TRAINING ERROR AND TEST ERROR FOR LABEL NOISE

In the following theorem, we calculate the expected training and test loss for random noise in the label. Theorem 8. (Approximation error and test loss for SGD noise in the label) The expected approximation error, or the training loss, is deﬁned as Ltrain = Ew[L(w)]; the expected test loss is deﬁned as Ltest = 1

2Ew EB [(w Tx)2]. For SGD with noise in the label given by Eq. (5), the expected approximation error and test loss are

Ltrain = σ2

Ltest = λσ2

2S κ. (128)

Remark. Notably, the training loss decomposes into two additive terms. The term that is proportional to 1 is the bias, caused by insufﬁcient model expressivity to perfectly ﬁt all the data points, while the second term that is proportional to λκ/S is the variance in the model parameter, induced by the randomness of minibatch noise. Remark. When the learning rate λ is vanishingly small, the expected test loss diminishes whereas the training error remains ﬁnite as long as label noise exists.

Proof. We ﬁrst calculate the approximation error. By deﬁnition,

Ltrain = Ew[L(w)]

2Tr[AΣ] + 1

Published as a conference paper at ICLR 2022

The test loss is

2Ew [w TAw] = 1

2S κ. (130)

E.3 MINIBATCH NOISE FOR RANDOM NOISE IN THE INPUT

E.3.1 NOISE STRUCTURE

Similar to label noise, noise in the input data can also cause ﬂuctuation. We assume that the training data points xi = xi+ηi can be decomposed into a signal part and a random part. As before, we assume Gaussian distributions, xi N(0,A) and ηi N(0,B). The problem remains analytically solvable if we replace the Gaussian assumption by the weaker assumption that the fourth-order moment exists and takes some matrix form. For conciseness, we assume that there is no noise in the label, namely yi = u Txi with a constant vector u. One important quantity in this case will be uu T = U. Notice that the trick w u = w no more works, and so we write the difference explicitly here. The loss function then takes the form

L(w) = 1 2N

N i=1 [(w u)Txi + w Tηi] 2 = 1

2(w u)TA(w u) + 1

2w TBw. (131)

The gradient L = (A + B)(w u) + Bu vanishes at w = (A + B) 1Au, which is the minimum of the loss function and the expectation of the parameter at convergence. It can be seen that, even at the minimum w , the loss function remains ﬁnite unless u = 0, which reﬂects the fact that in the presence of input noise, the network is not expressive enough to memorize all the information of the data. The SGD noise covariance for this type of noise is calculated in the following proposition. Proposition 9. (Covariance matrix for SGD noise in the input) Let the algorithm be updated according to Eq. (1) or (98) with random noise in the input while the limit N is taken with D held ﬁxed. Then the noise covariance is

S {KΣK + Tr[KΣ]K + Tr[AK 1BU]K}, (132)

where K = A + B, and Σ = Ew [(w w )(w w )T]. Remark. It can be seen that the form of the covariance (132) of input noise is similar to that of label noise (5) with replacing A by K and σ2 by Tr[AK 1BU], suggesting that these two types of noise share a similar nature.

Deﬁning the test loss as Ltest = 1 2Ew EB [(w Tx u Tx)2], Proposition 9 can then be used to calculate the test loss and the model ﬂuctuation. Theorem 9. (Training error, test loss and model ﬂuctuation for noise in the input) The expected training loss is deﬁned as Ltrain = Ew[L(w)], and the expected test loss is deﬁned as Ltest = 1 2Ew EB [(w Tx u Tx)2]. For SGD with noise in the input given in Proposition (9), the expected approximation error and test loss are

2Tr[AK 1BU](1 + λ

S κ ), (133)

2S Tr[AK 1BU]κ + 1

2Tr[B TAB U], (134)

where κ = Tr[KG 1] 1 λ 1

S Tr[KG 1] with G = 2ID λ(1 + 1

S )K, and B = K 1B. Moreover, let [K,U] = 0. Then the covariance matrix of model parameters is

Σ = λTr[AK 1BU]

S )[2ID λ(1 + 1

Remark. Note that [K,U] = 0 is necessary only for an analytical expression of Σ. It can be obtained by solving Eq. (144) even without invoking [K,U] = 0. In general, the condition that [K,U] = 0 does not hold. Therefore, only the training and test error can be calculated exactly. Remark. The test loss is always smaller than or equal to the training loss because all matrices involved here are positive semideﬁnite.

Published as a conference paper at ICLR 2022

E.3.2 PROOF OF PROPOSITION 9

Proof. We deﬁne Σ = Ew [(w w )(w w )T]. Then,

Ew[ww T] = Σ + (A + B) 1AUA(A + B) 1 = Σ + A UA T = ΣA, (136)

Ew [(w u)(w u)T] = Σ + B UB T = ΣB, (137)

where we use the shorthand notations A = (A+B) 1A, B = (A+B) 1B and ΣA = Σ+A UA T, ΣB = Σ + B UB T. We remark that the covariance matrix Σ here still satisﬁes the matrix equation (3) with the Hessian being K = A + B.

The noise covariance is

N i (w T xi u Txi) xi x T i (w T xi u Txi)T 1

S L(w) L(w)T

S {A(w u)(w u)TA + Bww TB + A(w u)w TB + Bw(w u)TA

+ Tr[A(w u)(w u)T]K + Tr[Bww T]K}. (138)

In Eq. (138), there are four terms without trace and two terms with trace. We ﬁrst calculate the traceless terms. For the latter two terms, we have

Ew[(w u)w T] = Σ A UB T, (139)

Ew[w(w u)T] = Σ B UA T. (140)

Because A + B = ID, after simple algebra the four traceless terms result in 2(A + B)Σ(A + B).

The two traceful terms add to Tr[AΣB + BΣA]K. With the relation AB = BA , what inside the trace is

AΣB + BΣA = KΣ + AK 1BU. (141)

Therefore, the asymptotic noise is

C = Ew[C(w)]

S {KΣK + Tr[AΣB + BΣA]K} (142)

S {KΣK + Tr[KΣ]K + Tr[AK 1BU]K}. (143)

E.3.3 PROOF OF THEOREM 9

Proof. The matrix equation satisﬁed by Σ is

ΣK + KΣ λ(1 + 1

S (Tr[KΣ]K + Tr[AK 1BU]K). (144)

By using a similar technique as in Appendix E.2.2, the trace Tr[KΣ] can be calculated to give

Tr[KΣ] = λTr[AK 1BU]

S κ , (145)

where κ = Tr[KG 1] 1 λ 1

S Tr[KG 1] with G = 2ID λ(1 + 1

With Eq. (145), the training error and the test error can be calculated. The approximation error is

Ltrain = Ew[L(w)] = 1

2Tr[AΣB + BΣA] = 1

2Tr[AK 1BU](1 + λ

S κ ). (146)

Published as a conference paper at ICLR 2022

The test loss takes the form of a bias-variance tradeoff:

2Ew EB [(w Tx u Tx)2] = 1

2Ew [(w u)TA(w u)] = 1

2S Tr[AK 1BU](1 + λκ

S )Tr[AG 1] + 1

2Tr[B TAB U]

2S Tr[AK 1BU]κ + 1

2Tr[B TAB U]. (147)

Let [K,U] = 0. Then Σ can be explicitly solved because it is a function of K and U. Speciﬁcally,

Σ = λTr[AK 1BU]

S )[2ID λ(1 + 1

E.4 PROOFS IN SEC. 4.3

E.4.1 PROOF OF PROPOSITION 3

Proof. The covariance matrix of the noise is

N i [(w u)Txixi + Γw][x T i x T i (w u) + w TΓ] 1

S LΓ(w) LΓ(w)T

S {A(w u)(w u)TA + Tr[A(w u)(w u)T]A}. (149)

Using a similar trick as in Appendix E.3.2, the asymptotic noise is

S (AΣA + Tr[AΣ]A + Tr[Γ TAΓ U]A + ΓA UA Γ). (150)

E.4.2 PROOF OF THEOREM 4

Besides the test loss and the model ﬂuctuation, we derive the approximation error here as well.

Theorem. (Training error, test loss and model ﬂuctuation for learning with L2 regularization) The expected training loss is deﬁned as Ltrain = Ew[L(w)], and the expected test loss is deﬁned as Ltest = 1

2Ew EB [(w Tx u Tx)2]. For noise induced by L2 regularization given in Proposition 3, let [A,Γ] = 0. Then the expected approximation error and test loss are

2S Tr[AK 2Γ2U]Tr[AG 1](1 + λκ

2S (Tr[A2K 2Γ2G 1U] + λr

S Tr[AG 1])

2Tr[AK 1ΓU], (151)

2S (Tr[AK 2Γ2U]κ + r) + 1

2Tr[AK 2Γ2U], (152)

where κ = Tr[A2K 1G 1] 1 λ

S Tr[A2K 1G 1], r = Tr[A3K 3Γ2G 1U]

S Tr[A2K 1G 1], with G = 2ID λ(K + 1

S K 1A2). Moreover, if A, Γ and U commute with each other, the model ﬂuctuation is

S Tr[AK 2Γ2U](1 + λκ

S )AK 1G 1 + λ

S (A2K 2Γ2U + λr

S A)K 1G 1. (153)

Remark. Because Γ may not be positive semideﬁnite, the test loss can be larger than the training loss, which is different from the input noise case.

Published as a conference paper at ICLR 2022

Proof. The matrix equation obeyed by Σ is

ΣK + KΣ λKΣK λ

S Tr[AΣ]A + λ

S (Tr[AK 2Γ2U]A + AK 1ΓUΓK 1A),

where we use the shorthand notation K = A + Γ. Let [A,Γ] = 0. Using the trick in Appendix E.2.2, the trace term Tr[AΣ] is calculated as

S (Tr[AK 2Γ2U]κ + r), (155)

where κ = Tr[A2K 1G 1] 1 λ

S Tr[A2K 1G 1], r = Tr[A3K 3Γ2G 1U]

S Tr[A2K 1G 1], and G = 2ID λ(K + 1

The training error is

2Tr[AΣΓ + ΓΣA]

2Tr[KΣ] + 1

2Tr[AK 1ΓU]

2S Tr[AK 2Γ2U]Tr[AG 1](1 + λκ

2S (Tr[A2K 2Γ2G 1U] + λr

S Tr[AG 1])

2Tr[AK 1ΓU]. (156)

The test loss is

2Ew EB [(w Tx u Tx)2]

2Ew [(w u)TA(w u)] = 1

2S (Tr[AK 2Γ2U]κ + r) + 1

2Tr[AK 2Γ2U]. (157)

Let A, Γ and U commute with each other. Then,

S Tr[AK 2Γ2U](1 + λκ

S )AK 1G 1 + λ

S (A2K 2Γ2U + λr

S A)K 1G 1. (158)

E.4.3 PROOF OF COROLLARY 1

For a 1d example, the training loss and the test loss have a simple form. We use lowercase letters for 1d cases.

Corollary 4. For a 1d SGD with L2 regularization, the training loss and the test loss are

Ltrain = aγ 2(a + γ)

2(a + γ) λ[(a + γ)2 + 2

2(a + γ) λ[(a + γ)2 + 2

S a2] u2, (159)

Ltest = aγ2

2(a + γ) 2 λ(a + γ) 2(a + γ) λ[(a + γ)2 + 2

S a2]u2. (160)

Proof. The training error and the test loss for 1d cases can be easily obtained from Theorem E.4.2.

Now we prove Corollary 1.

Proof. The condition for convergence is 1 λ

S Tr[A2K 1G 1] > 0. Speciﬁcally,

λ(a + γ)2 2(a + γ) + λ 2

S a2 < 0. (161)

Published as a conference paper at ICLR 2022

For a given γ, the learning rate needs to satisfy

λ < 2(a + γ) (a + γ)2 + 2

S a2 . (162)

For a given λ, γ needs to satisfy

λ < γ < 1 aλ +

which indicates a constraint on λ:

If γ is allowed to be non-negative, the optimal value can only be 0 due to the convergence condition. Therefore, a negative optimal γ requires an upper bound on it being negative, namely

λ < 0. (165)

Solving it, we have

aλ > 2 1 + 2

By combining with Eq. (164), a necessary condition for the existence of a negative optimal γ is

2 (S 2)2 > 0 S 2. (167)

Hence, a negative optimal γ exists, if and only if

2 , and S 2. (168)

For higher dimension with Γ = γID, it is possible to calculate the optimal γ for minimizing the test loss (11) as well. Speciﬁcally, the condition is given by

d dγ Ltest = 1

2 d dγ f(γ) g(γ) = 0, (169)

f(γ) = γ2Tr[AK 2 (ID + λ

S A2K 1G 1)U], (170)

S Tr[A2K 1G 1]. (171)

Although it is impossible to solve the equation analytically, it can be solved numerically.