# learning_trajectories_are_generalization_indicators__9b8d9432.pdf

Learning Trajectories are Generalization Indicators

Jingwen Fu1 , Zhizheng Zhang2 , Dacheng Yin3 , Yan Lu2 , Nanning Zheng1 fu1371252069@stu.xjtu.edu.cn {zhizzhang,yanlu}@microsoft.com ydc@mail.ustc.edu.cn nnzheng@mail.xjtu.edu.cn 1National Key Laboratory of Human-Machine Hybrid Augmented Intelligence, National Engineering Research Center for Visual Information and Applications, and Institute of Artificial Intelligence and Robotics, Xi an Jiaotong University, 2Microsoft Research Asia, 3University of Science and Technology of China

This paper explores the connection between learning trajectories of Deep Neural Networks (DNNs) and their generalization capabilities when optimized using (stochastic) gradient descent algorithms. Instead of concentrating solely on the generalization error of the DNN post-training, we present a novel perspective for analyzing generalization error by investigating the contribution of each update step to the change in generalization error. This perspective enable a more direct comprehension of how the learning trajectory influences generalization error. Building upon this analysis, we propose a new generalization bound that incorporates more extensive trajectory information. Our proposed generalization bound depends on the complexity of learning trajectory and the ratio between the bias and diversity of training set. Experimental observations reveal that our method effectively captures the generalization error throughout the training process. Furthermore, our approach can also track changes in generalization error when adjustments are made to learning rates and label noise levels. These results demonstrate that learning trajectory information is a valuable indicator of a model s generalization capabilities.

1 Introduction

The generalizability of a Deep Neural Network (DNN) is a crucial research topic in the field of machine learning. Deep neural networks are commonly trained with a limited number of training samples while being tested on unseen samples. Depite the commonly used independent and identically distributed (i.i.d.) assumption between the training and testing sets, there often exists a varying degree of discrepancy between them in real-world applications. Generalization theories study the generalization of DNNs by modeling the gap between the empirical risk [37] and the popular risk [37]. Classical uniform convergence based methods [20] adopt the complexity of the function space to analyze this generalization error. These theories discover that more complex function space results in a larger generalization error [38]. However, they are not well applicable for DNNs [33, 22]. In deep learning, the double descent phenomenon [6] exists, which tells that larger complexity of function space may lead to smaller generalization error. This violates the aforementioned property in uniform convergence methods and imposes demands in studying the generalization of DNNs.

Although the function space of DNNs is vast, not all functions within that space can be discovered by learning algorithms. Therefore, some representative works bound the generalization of DNNs based

Work done during internships at Microsoft Research Asia. Corresponding Authors

37th Conference on Neural Information Processing Systems (Neur IPS 2023).

on the properties of the learning algorithm, e.g. , stability of algorithm [11], information-theoretic analysis [40]. These works rely on the relation between the input (i.e. , training data) and output (weights of the model after training) of the learning algorithm to infer the generalization ability of the learned model. Here, the relation refers to how the change of one sample in the training data impacts the final weights of model in the stability of algorithms while referring to the mutual information between the weights and the training data in the information-theoretic analysis. Although some works [24, 11] leverage some information from training process to understand the properties of learning algorithm, there is limited trajectory information conveyed.

The purpose of this article is to enhance our theoretical comprehension of the relation between learning trajectory and generalization. While some recent experiments [9, 13, 29] have shown a strong correlation between the information contained in learning trajectory and generalization, the theoretical understanding behind this is still underexplored. By investigating the contribution of each update step to the change in generalization error, we give a new generalization bound with rich trajectory related information. Our work can serve as a starting point to understand those experimental discoveries.

1.1 Our Contribution

Our contributions can be summarized below:

We demonstrate that learning trajectory information serves as a valuable indicator of generalization abilities. With this motivation, we present a novel perspective for analyzing generalization error by investigating the contribution of each update step to the change in generalization error.

Utilizing the aforementioned modeling technique, we introduce a novel generalization bound for deep neural networks (DNNs). Our proposed bound provides a greater depth of trajectory-related insights than existing methods.

Our method effectively captures the generalization error throughout the training process. And the assumption corresponding to this method is also confirmed by experiments. Furthermore, our approach can also track changes in generalization error when adjustments are made to learning rates and label noise levels.

2 Related Work

Generalization Theories Existing works on studying the generalization of DNNs can be divided into three categories: the methods based on the complexity of function space, the methods based on learning algorithms, and the methods based on PAC Bayes. The first category considers the generalization of DNNs from the perspective of the complexity of the function space. Many methods for measuring the complexity of the function space have been proposed, e.g. , VC dimension [39], Rademacher Complexity [4] and covering number [33]. These works fail in being applied to DNN models since the complexity of the function space of a DNN model is too large to deliver a trivial result [41]. This thus motivates recent works to rethink the generalization of DNNs based on the accessible information in different learning algorithms such as stability of algorithm [11], informationtheoretic analysis [40]. Among them, the stability of algorithm [7] measures how one sample change of training data impacts the model weights finally learned, and the information theory [30, 31, 40] based generalization bounds rely on the mutual information of the input (training data) and output (weights after training) of the learning algorithm. Another line is PAC Bayes [19] based method, which bounds the expectation of the error rates of a classifier chosen from a posterior distribution in terms of the KL divergence from a given prior distribution. Our research modifies the conventional Rademacher Complexity to calculate the complexity of the space explored by a learning algorithm, which in turn helps derive the generalization bound. Our approach resembles the first category, as we also rely on the complexity of the function space. However, our method differs as we focus on the function space explored by the learning trajectory, rather than the entire function space. The novelty of our technique lies in addressing the issue of dependence between training data and the function space explored by the learning trajectory, a dependence that is not permitted by the original Rademacher Complexity Theory.

Generalization Analysis for SGD The optimization plays an nonnegligible role in the success of DNN. Therefore, there are many prior works studying the generalization of DNNs by exploring property of SGD, which could be summarized into two categories: stability of SGD and informationtheoretic analysis. The most popular way of the former category is to analyze the stability of the weights updating. Hardt et al. [11] is the first work to analyze the stability of SGD with the requirements of smooth and Lipschitz assumptions. Its follow-up works try to discard the smooth [5], or Lipschitz [25] assumptions towards getting a more general bound. Information-theoretic methods leverage the chain rule of KL-divergence to calculate the mutual information between the learned model weights and the data. This kind of works is mainly applied for Stochastic Gradient Langevin Dynamics(SGLD), i.e. , SGD with noise injected in each step of parameters updating [28]. Negrea et al. [23], Haghifam et al. [10] improve the results using data-dependent priors. Neu et al. [24] construct an auxiliary iterative noisy process to adapt this method to the SGD scenario. In contrast to these studies, our approach utilizes more information related to learning trajectories. A more detailed comparison can be found in Table 2 and Appendix B.

3 Generalization Bound

Let us consider a supervised learning problem with a instance space Z and a parameter space W. The loss function can be defined as f : W Z R+. We denote the distribution of the instance space Z as µ. The n i.i.d samples draw from µ are denoted as S = {z1, ..., zn} µn. Given parameters w, the empirical risk and popular risk are denoted as FS(w) 1 n Pn i f(w, zi), and Fµ(w) Ez µ[f(w, z)] respectively. Our work studies the generalization error of the learned model, i.e. Fµ(w) FS(w). For an optimizaiton process, the learning trajectory is represented as a function J : N W. We use Jt to denote the weights of model after t times updating, where Jt = J(t). The learning algorithm is defined as A : µn R J, where the second input R denotes all randomness in the algorithm A, including the randomness in initialization, batch sampling et al. . We simply use A(S) to represent a random choice for the second input term. Given two functions U, V , R

t U(t)d V (t) P t U(t)(V (t + 1) V (t)) and we use to denote L2 norm. If S is a set, then |S| denotes the number of elements in S. Et denotes taking the expectiation conditioned on {Ji|i t}.

Let mini-batch B be a random subset sampled from dataset S, and we have |B| = b. The averaged function value of mini-batch B is denoted as FB(w) 1

z B f(w, z). The parameters updated with gradient descent can be formulated as: Jt+1 = Jt ηt FS(Jt). (1) where ηt is the learning rate for the t-th update. The parameter updating with stochastic gradient descent is: Jt+1 = Jt ηt FB(Jt). (2) Let ϵ(w) FS(w) FB(w) be the gradient noise in mini-batch updating, where w is the weights of a DNN. Then we can transform Equation (2) into: Jt+1 = Jt ηt FS(Jt) + ηtϵ(Jt). (3) The covariance of the gradients over the entire dataset S can be calculated as:

i=1 f(w, zi) f(w, zi)T FS(w) FS(w)T. (4)

Therefore, the covariance of the gradient noise ϵ(w) is:

C(w) n b b(n 1)Σ(w). (5)

Since for any w we have E(ϵ(w)) = 0, we can represent ϵ(w) as C(w) 1 2 ϵ , where ϵ is a random distribution whose mean is zero and covariance matrix is an identity matrix. Here, ϵ can be any distributions, including Guassian distribution [12] and SαS distribution [35].

The primary objective of our work is to suggest a new generalization bound that incorporates more comprehensive trajectory-related information. The key aspects of this information are: 1) It should be adaptive and change according to different learning trajectories. 2) It should not rely on the extra information from data distribution µ except from the training data S.

3.1 Investigating generalization alone learning trajectory

As annotated before, the learning trajectory is represented by a function J : N W, which defines the relationship between the model weights and the training timesteps t. Jt denotes the model weights after t times updating. Note that J depends on S, because it comes from the equation J = A(S). We simply use f(Jt) : Z R+ to represent the function after t-times update. Our goal is to analyze the generalization error, i.e., Fµ(JT) FS(JT), where T represents the total training steps.

We reformulate the function corresponding to the finally obtained model as:

f(JT) = f(J0) +

t=1 (f(Jt) f(Jt 1)). (6)

Therefore, the generalization error can be rewritten as:

Fµ(JT) FS(JT) = Fµ(J0) FS(J0) | {z } (i)

t=1 [(Fµ(Jt) Fµ(Jt 1)) (FS(Jt) FS(Jt 1))] | {z } (ii)t

(7) In this form, we divide the generalization error into two parts. (i) is the generalization error before the training. (ii)t is the generalization error caused by t-step update.

Typically, there is independence between J0 and the data S. Therefore, we have E(i) = 0. Combining with this, we have:

E[Fµ(JT) FS(JT)] = E

t=1 (ii)t. (8)

Analyzing the generalization error after training can be transformed into analyzing the increase of generalization error for each update. This is a straighforward and quite different way to extract the information from learning trajectory compared with previous work. Here, we list two techniques that most used by previous works to extract the information from learning trajectory.

(T1). This method leverages the chaining rule of mutual informaton to calculate a upper bound of the mutual information between JT and the training data S, i.e. I(S; JT) I(S; Jt T) PT t=0 I(S; Jt|Ji<t). I(S; JT) is the value of concerning for their theory. (T2). This method assumes we have another data S , which is obtained by replacing one sample in data S with another sample drawing from distribution µ. J is the learning trajectory trained from data S with same randomness value as J. Denote k Jk J k and assume 0 = 0. Then, the value of concerning is T . The upper bound of T is calculate by iterately apply the formular k ck 1 k 1 + ek 1.

(T1) is commonly utilized in analyzing Stochastic Gradient Langevin Dynamics(SGLD) [18, 2, 28], while (T2) is frequently employed in stability-based works for analyzing SGD [11, 15, 5]. Our method offers several benefits, including: 1) We directly focus on the change in generalization error, rather than intermediate values such as k and I(S; Jt|Ji<t), 2) The generalization error is equivalent to the sum of (ii)t, while (T1) and (T2) takes the upper bound value of I(S; JT) and T , and 3) From this perspective, We can extract more in-depth trajectory-related information. For (T1), the computation of I(S; Jt|Ji<t) primarily involves the information of Fµ(Jt), which is inaccessible to us (Detail in Appendix D and Neu et al. [24]). (T2) faces the challenge that only the upper bounds of ck and ek can be calculated. The upper bounds remain unchanged across various learning trajectories. Consequently, both (T1) and (T2) have difficulty conveying meaningful trajectory information.

3.2 A New Generalization Bound

In this section, we introduce the generalization bound based on our aforementioned modeling. Let us start with the definition of commonly used assumptions.

Definition 3.1. The function f is L-Lipschitz, if for all w1, w2 W and for all z Z, wherein we have f(w1, z) f(w2, z) L w1 w2 .

Definition 3.2. The function f is β-smooth, if for all w1, w2 W and for all z Z, wherein we have f(w1, z) f(w2, z) β w1 w2 .

Definition 3.3. The function f is convex, if for all w1, w2 W and for all z Z, wherein we have f(w1, z) f(w2, z) + (w1 w2)T f(w2, z).

Here, L-lipschitz assumption implies that the f(w, z) L holds. β-smooth assumption indicates the largest eignvalue of 2f(w, z) is smaller than β. The convexity indicates the smallest eigenvalue of 2f(w, z) are positive. These assumptions tell us the constraints of gradients and Hessian matrices of the training data and the unseen samples in the test set. Since the values of gradients and Hessian matrices in the training set are accessible, the key role of these assumptions is to deliver knowledge about the unseen samples in the test set.

In the following, we introduce a new generalization bound. We give the assumption required by our new generalization bound in the following.

Assumption 3.4. There is a value γ, so that for all w {Jt|t N}, we have Fµ(w) γ FS(w) .

Remark 3.5. Assumption 3.4 gives a restriction with the norm of popular gradient Fµ(w). This assumption is easily satisfied when n is a large number, because we have lim n FS(w) =

Fµ(w) . When the n is not large enough, the assumption will hold before SGD enter the neighbourhood of convergent point. Under the case that SGD enters the neighbourhood of convergent point, we give a relaxed assumption and its corresponding generalization bound in Appendix B. According to paper [42], this case will ununsually happen in real situation. Section 4 gives experiments to explore the assumption.

Theorem 3.6. Under Assumption 3.4, given S µn, let J = A(S), where A denoted the SGD or GD algorithm training with T steps, we have:

E[Fµ(JT) FS(JT)] 2γ Vm E Z

1 + Tr(Σ(Jt)) FS(Jt) 2 2 + O(ηm) (9)

where V(w) = FS(w) EU S |U|

n FU(w) n |U|

n FS/U(w) , Vm = max t V(Jt), γ =

max{1, max U S;t

n FS(Jt) }γ and ηm max t ηt.

Remark 3.7. Our generalization bound mainly relies on the information from gradients. V(w) is related to the variance of the gradient. When the variance of the gradients across different samples in the training set S is large, then the value of V(w) is small, and vice versa. Note that we have |U| < n due to U S. Our bound will became trival if EU S |U|

n FU(w) n |U|

n FS/U(w) = 0. This rarely happens in real case, because it requires that for all U S, we have |U| FU(w) = (n |U|) FS/U(w). We also give a relaxed assumption version of this theorem in Appendix B. The generalization bound provides a clear insight into how the reduction of training loss leads to a increase in generalization error.

Proof Sketch The proof of this theorem is placed in Appendix A. Here, we give the sketch for this proof.

Step 1 Beginning with Equation (8), we decomposite the Fµ(JT) FS(JT) into a linear part (genlin(JT)) and nonlinear part(gennl(JT)). We have genlin(JT) = PT t=1(ii)lin t , where (ii)lin t (Jt Jt 1)T( Fµ(Jt 1) FS(Jt 1)). The nonlinear part is gennl(JT) = Fµ(JT) FS(JT) genlin(JT). We takle these two parts differently. Here, we focus on analyzing genlin(JT) because it dominates under small learning rate. Detail discussion of gennl(JT) is given in Appendix (Propositon A.1 and Subsection C.3)

Step 2 We construct the addictive linear space LJ|S {PT 1 t=0 wt T f(Jt) | wt t}, where t ηt FS(Jt) . Then E[genlin(JT)] 2γ Vm ERS(LJ|S), where RS(LJ|S) Eσ sup h LJ|S ( 1

n Pn i=1 σih(zi)).

Table 1: Comparison of the generalization bounds with stability based method for SGD learning algorithms. T.R.T is an abbreviation for the term related to trajectory. T.R.T is defined as the term that 1) varies based on different learning trajectories, and 2) don t rely on the extra information of data distribution µ except from training data S. We can infer that the proposed bound incorporates a greater amount of information pertaining to the trajectory. Other related works are discussed in Appendix D.

Method β-Smooth L-Lipschitz Convex Small LR Other Conditions Generalization Bound T.R.T Hardt et al. [11] 2L2

n PT t=1 ηt

PT t=1 ηt Hardt et al. [11] f [0, 1], ηt < c

n L 2 βc+1 T

βc βc+1 ) T

Zhang et al. [43] T > n, ηt = c βt

Zhou et al. [44] Ez S f(w, z) FS(w) 2 B2 O(

2βFµ(J0) + 1

2EB2 log T) log T

Bassily et al. [5] Projected SGD 2L2 q PT 1 t=1 η2 t + 4L2

n PT 1 t=1 ηt

PT 1 t=1 ηt Lei and Ying [16] Projected SGD O((1 + T

n2 ) PT t=1 η2 t ) T PT t=1 η2 t Ours (Theorem 3.6) Fµ(w) γ FS(w) Theorem 3.6

t d FS(Jt) q

1 + Tr(Σ(Jt)) FS(Jt) 2

Step 3 Finally, we compute the upper bound of RS(LJ|S), which follows same techniques used in Radermacher Complexity theory. By combining this with Proposition A.1, we establish the theorem.

Technical Novety Directly applying the Rademacher complexity to calculate the generalization error bound fails because the large complexity of neural network s function space leads to trival bound[41]. In this work, we want to calculate the complexity of the function space that can be explored during the training process. However, there are two challenges here. First, the trajectory of neural network is a "line", instead of a function space that can be calculated the complexity. To solve this problem, we indroduce the addictive linear space LJ|S. This space contains the local information of learning trajectory, and can serve as the pseudo function space. Second, the function space LJ|S has a dependent on the sample set S, while the theory of Rademacher complexity requires that the function space is independent with training samples. To decouple this dependence, we adapt the Rademacher complexity and we obtain that E[genlin(JT)] 2γ Vm ERS(LJ|S). Here, γ is indroduced to decouple the dependent fact mentioned above.

Next, in order to draw a clearer comparison with the stability-based method, we present the following corollary. This corollary employs the β-smooth assumption to bound gennl(JT) and leverages a similar learning rate setting to that found in stability based works.

Corollary 3.8. If function f( ) is β-smooth, under Assumption 3.4 given S µn, let J = A(S), ηt = c β(t+1), M 2 2 = max t Et 1( FS(Jt) + ϵ(Jt) 2) and M 4 4 = max t Et 1( FS(Jt) + ϵ(Jt) 4) , where A denoted the SGD or GD algorithm training with T steps, we have:

E[Fµ(JT) FS(JT)] 2γ Vm E Z

1 + Tr(Σ(Jt)) FS(Jt) 2 2

+ 2c2γ Vm M 2 4

dt nβ2(t + 1)4

1 + Tr(Σ(Jt)) FS(Jt) 2 2

+ 2c2 M 2 2 β .

where V(w) = FS(w) EU S |U|

n FU(w) n |U|

n FS/U(w) , Vm = max t V(Jt) and γ =

max{1, max U S;t

n FS(Jt) }γ.

3.3 Further Analysis

3.3.1 Interpreting the Generalization Bounds

We rewrite the obtained generalization bound here:

E[Fµ(JT) FS(JT)] γ |{z} Bias of Training Set

1 Diversity of Training Set z}|{ Vm

1 + Tr(Σ(Jt)) FS(Jt) 2 2

| {z } Complexity of Learning Trajectory

(11) The "Bias of Training Set" refers to the disparity between the characteristics of the training set and those of the broader population. To measure this difference, we use the distance between the norm of the popular gradient and that of the training set gradient, as specified in Assumption 3.4. The "Diversity of Training Set" can be understood as the variation among the samples in the training set, which in turn affects the quality of the training data. The ratio Bias of Training Set Diversity of Training Set gives us the property of information conveyed by the training set. It is important to consider the properties of the training set, as the data may not contribute equally to the generalization[36]. The detail version of the equation can be found in Theorem 3.6.

3.3.2 Asymptotic Analysis

We will first analyze the dependent of n for V. The V is calculated as V(w) = FS(w) EU S |U|

n FU(w) n |U|

n FS/U(w) . Obviously, the gradient of individual sample is unrelated

to the sample size n. And |U| n. Therefore, V = O(1). Similarly, we have

t d FS(Jt) n q

1 + Tr(Σ(Jt)) FS(Jt) 2 2 = O( 1 n). As for the O(ηm) term in Theorem 3.6, we have

lim n O(ηm) = 0 according to Proposition A.1. We simply assume that O(ηm) = O( 1

nc ). Therefore,

our bound has O( 1 nmin{0.5,c} ).

3.3.3 Comparison with Stability-based methods

We first compare our method with the stability-based methods in terms of the trajectory information. In Table 1, we present a summary of stability-based methods, while other methods are outlined in Appendix D. We focus on generalization bounds from previous works that eliminate terms dependent on extra information about data distribution µ, apart from the training data S, using assumptions such as smoothness or Lipschitz continuity. Analyzing Table 1 reveals that most prior works primarily depend on the learning rate η and the total number of training steps T. This suggests that we can achieve the same bound by using an identical learning rate schedule and total training steps, which does not align with our practical experience. Our proposed generalization bound considers the evolution of function values, gradient covariance, and gradient norms throughout the training process. As a result, our bounds encompass more comprehensive information about the learning trajectory.

Table 2: Detail comparison with Hardt et al. [11]. The β refers to the β-smooth assumption (see in Definition 3.2). S denotes the training set.

Uniform Stability[11] Ours Assumption w W z Z f(w, z ) L w {Jt|t N} Ez µ f(w, z ) γ Ez S f(w, z ) Modelling Method of SGD Epoch Structure Full Batch Gradient + Stochastic Noise Batch Size 1 n Trajectory Information in Bound Learning rate and number of training step Values in Trajectory (gradient norm and covariance) Perspective Stability of Algorithm Complexity of Learning Trajectory

Next, we give a detail comparison with Hardt et al. [11] in Table 2. The concept of uniform stability is commonly used to evaluate the ability of SGD in generalizaton, by assessing its stability when a single training sample is altered. Our primary point of comparison is with Hardt et al. [11], as their work is considered the most representative in terms of analyzing the stability of SGD. We find that First, the assumption of Uniform Stability requires the gradient norm of all input samples for all weights being bounded by L, whereas our assumption only limits the expectation of the gradients

for the weights during the learning trajectory. Secondly, Uniform Stability uses an epoch structure to model the stochastic gradient descent, whereas our approach regards each stochastic gradient descent as full batch gradient descent with added stochastic noise. The epoch structure complicates the modelling process because it requires a consideration of sampling. As a result, in Hardt et al. [11], the author only considers the setting with batch size 1. Thirdly, the bound of Uniform Stability only uses hyperparameters setting such as learning rate and number of training step. In contrast, our bound contains more trajectory-related information, such as the gradient norm and covariance. Finally, the Uniform Stability provides the generalization bound based on the stability of the algorithm, while our approach leverages the complexity of the learning trajectory. In summary, there are some notable differences between our approach and Uniform Stability, such as the assumptions made, the modelling process, the type of information used in the bound, and the perspectives.

4 Experiments

4.1 Tightness of Our Bounds

Table 3: Numeric comparison with stability-based work on toy examples. The reason for the value of Zhang et al. [43] is large is because that our and Hardt et al. [11] has dependent on L2

β , while Zhang et al. [43] depends on L2. L and β are usually large numbers.

Gen Error Ours Hardt et al. [11] Zhang et al. [43] 1.49 3.62 4.04 4417

In a toy dataset setting, we compare our generalization bound with stability-based methods.

Reasons for toy examples 1) Some values in the bounds are hard to be calculated. Calculating β (under the β-smooth assumption) and L (under the L-Lipschitz assumption) in stability-based work, as well as the values of V and γ in our proposed bound, are challenging. 2) Stability-based methods require a batch size of 1. The training is hard for batch size of 1 with learning rate setting ηt = 1

t in complex datasets.

Constuction of the toy examples In the following, we discuss the construction of the toy dataset used to compare the tightness of the generalization bounds. The training data is Xtr = {xi}n i=1. All the data xi is sampled from Guassian distribution N(0, Id). Sampling w N(0, Id),the ground truth is generated by yi = 1 if w Txi > 0 else 0. The weights for learning is denoted as w. The predict y is calculated as yi = w Txi. The loss for a simple data point is li = yi w Txi 2. The training loss is L = Pn i=1 li. The test data is Xte = {x i}, where x i = x i and x i N(0, Id). We use 100 samples for training and 1,000 samples for evaluation. The model is trained using SGD for 200 epochs.

We evaluate the tightness of our bound by comparing our results with those in Hardt et al. [11] and Zhang et al. [43] from the original paper. We set the learning rate as ηt = 1 βt. Our reasons for comparing with these two papers are: 1) Hardt et al. [11] is a representative study, 2) Both papers have theorems using a learning rate setting of ηt = O( 1

t ), which aligns with Corollary 3.8 in our paper, and 3) They do not assume convexity. The generalization bounds we compare include Corollary 3.8 from our paper, Theorem 3.12 from Hardt et al. [11], and Theorem 5 from Zhang et al. [43].

Our results are given in Table 3. Our bound is tighter under this setting.

4.2 Capturing the trend of generalization error

In this section, 1) we conduct the deep learning experiment to verify Assumption 3.4 and 2) Verify whether our proposed generalization bound can capture the changes of generalization error. In this experiment, we mainly consider the term C(Jt) 2 R t i=0 d FS(Ji) n q

1 + Tr(Σ(Ji)) FS(Ji) 2 2 . We omit the term γ and Vm, because all the trajectory related information that we want to explore is stored in C(Jt). Capturing the trend of generalization error is regarded as an important problem in Nagarajan [21]. Unless further specified, we use the default setting of the experiments on CIFAR-10 dataset [14]

Figure 1: Exploration of Assumption 3.4 for different dataset. The eγt is stable before training loss reaches a relative small value. Assumption holds if the training is stop before extremely overfitting. A relaxed assumption and its corresponding generalization bound are given in Appendix B for extremely overfitting situation

Figure 2: Exploration of C(Jt) during the training process. Left: The curve FS(Jt) + C(Jt) exhibits a comparable Pattern with the curve FS (Jt). Center: After the early stage, C(Jt) and FS (Jt) FS(Jt) have a similar trend. Right: The value of d C(Jt) d FS(Jt) alone the training process.

with the VGG13 [34] network. The experimental details for each figure can be found in Appendix C.2.

Our observations are:

Assumption 3.4 is valid when SGD is not exhibiting extreme overfitting. The term of C(Jt) can depict how the generalization error varies along the training process. And it can also track the changes in generalization error when adjustments are made to learnling rates and label noise levels

Exploring the assumption 3.4 for different dataset during the training process To explore the Assumption 3.4, we define γt Fµ(Jt)

FS(Jt) and eγt FS (Jt)

FS(Jt) , where S is another data set i.i.d sampled from distribution µ. Because S is independent with S, we have eγt γt. We found that eγt is stable around 1 during the early stage of training(Figure 1). When the training loss is reaching a relative small value, eγt increases as we continue training. This phenomenon remain consistant aross the Cifar10, Cifar100 and SVHN datasets. The γ in Assumption 3.4 can be assigned as γ = maxt eγt. We can always find such γ if the optimizer is not extreme overfitting. Under the extremely overfitting case, we can use the relaxed theorem in Appendix B to bound the generalization error.

The bound capturing the trend of generalization error during training process The generalization error and the C(Jt) both changes as the training continues. Therefore, we want to verify whether they correlate with each other during the training process. Here, we use the term FS (Jt) FS(Jt) to approximate the generalization error. We find that FS (Jt) FS(Jt) has similar trend with C(Jt) (Figure 2 Center). What s more, we also find that the curve of FS(Jt) + C(Jt) exhibits a comparable pattern with the curve FS (Jt) (Figure 2 Left). To explore whether C(Jt) reveals influence of the change of FS(Jt) to the generalization error, we plot d C(Jt) d FS(Jt) (Figure 2 Right) during

Figure 3: C(JT) correlates with FS (Jt) FS(Jt). Left: C(Jt) and the generalization error under different label noise level. Right: C(Jt) and the generalization error under learning rate. The C(Jt) can capture the trend of generalization error cased by learning rate when learning rate is small. Appendix E provides proof that a large learning rate results in a smaller proposed generalization bound. Further discussions on why a small learning rate leads to a larger generalization error can be found in Li et al. [17], Barrett and Dherin [3].

the training process. d C(Jt) d FS(Jt) increases slowly during the early stage of training, but surge rapidly afterward. This discovery is aligned with our intuition about the overfitting.

The complexity of learning trajectory correlates with the generalization error In Figure 3, we carry out experiments under various settings. Each data point in the figure represents the average of three repeated experiments. The results demonstrate that both the generalization error and C(Jt) increase as the level of label noise is raised (Figure 3 Left). The another experiments measure C(Jt) and generalization error for different learning rate and discover that C(Jt) can capture the trend generalization error. The reasons behind a larger learning rate resulting in a smaller generalization error have been explored in Li et al. [17], Barrett and Dherin [3]. Additionally, Appendix E discusses why a larger learning rate can lead to a smaller C(Jt).

5 Limitation

The assumption of small learning rate is required by our method. But this assumption is also common use in previous works. For example, Hardt et al. [11], Zhang et al. [43], Zhou et al. [44] explicitly requires that the learning rate should be small and is decayed with a rate of O( 1

t ). Some methods have no explict requirements about this but show that large learning rate pushes the generalization bounds to a trivial point. For example, the generalization bounds in works [5, 16] have a term PT t=1 η2 t that is not decayed as the data size n increases. The value of this term is unignorable when the learning rate is large. The small learning assumption widens the gap between theory and practice. Eliminating this assumption is crucial for future work.

6 Conclusion

In this study, we investigate the relation between learning trajectories and generalization capabilities of Deep Neural Networks (DNNs) from a unique standpoint. We show that learning trajectories can serve as reliable predictors for DNNs generalization performance. To understand the relation between learning trajectory and generalization error, we analyze how each update step impacts the generalization error. Based on this, we propose a novel generalization bound that encompasses extensive information related to the learning trajectory. The conducted experiments validate our newly proposed assumption. Experimental findings reveal that our method effectively captures the generalization error throughout the training process. Furthermore, our approach can also track changes in generalization error when adjustments are made to learning rates and the level of label noises.

7 Acknowledgement

We thank all the anonymous reviewers for their valuable comments. The work was supported in part with the National Natural Science Foundation of China (Grant No. 62088102).

[1] K. Ahn, J. Zhang, and S. Sra. Understanding the unstable convergence of gradient descent. In International Conference on Machine Learning, pages 247 257. PMLR, 2022. [2] A. Banerjee, T. Chen, X. Li, and Y. Zhou. Stability based generalization bounds for exponential family langevin dynamics. ar Xiv preprint ar Xiv:2201.03064, 2022. [3] D. G. Barrett and B. Dherin. Implicit gradient regularization. ar Xiv preprint ar Xiv:2009.11162, 2020. [4] P. L. Bartlett and S. Mendelson. Rademacher and gaussian complexities: Risk bounds and structural results. Journal of Machine Learning Research, 3(Nov):463 482, 2002. [5] R. Bassily, V. Feldman, C. Guzmán, and K. Talwar. Stability of stochastic gradient descent on nonsmooth convex losses. Advances in Neural Information Processing Systems, 33:4381 4391, 2020. [6] M. Belkin, D. Hsu, S. Ma, and S. Mandal. Reconciling modern machine-learning practice and the classical bias variance trade-off. Proceedings of the National Academy of Sciences, 116 (32):15849 15854, 2019. [7] O. Bousquet and A. Elisseeff. Stability and generalization. The Journal of Machine Learning Research, 2:499 526, 2002. [8] N. Chandramoorthy, A. Loukas, K. Gatmiry, and S. Jegelka. On the generalization of learning algorithms that do not converge. ar Xiv preprint ar Xiv:2208.07951, 2022. [9] J. M. Cohen, S. Kaur, Y. Li, J. Z. Kolter, and A. Talwalkar. Gradient descent on neural networks typically occurs at the edge of stability. ar Xiv preprint ar Xiv:2103.00065, 2021. [10] M. Haghifam, J. Negrea, A. Khisti, D. M. Roy, and G. K. Dziugaite. Sharpened generalization bounds based on conditional mutual information and an application to noisy, iterative algorithms. Advances in Neural Information Processing Systems, 33:9925 9935, 2020. [11] M. Hardt, B. Recht, and Y. Singer. Train faster, generalize better: Stability of stochastic gradient descent. In International conference on machine learning, pages 1225 1234. PMLR, 2016. [12] S. Jastrzebski, Z. Kenton, D. Arpit, N. Ballas, A. Fischer, Y. Bengio, and A. Storkey. Three factors influencing minima in sgd. ar Xiv preprint ar Xiv:1711.04623, 2017. [13] S. Jastrzebski, D. Arpit, O. Astrand, G. B. Kerg, H. Wang, C. Xiong, R. Socher, K. Cho, and K. J. Geras. Catastrophic fisher explosion: Early phase fisher matrix impacts generalization. In International Conference on Machine Learning, pages 4772 4784. PMLR, 2021. [14] A. Krizhevsky, G. Hinton, et al. Learning multiple layers of features from tiny images. 2009. [15] Y. Lei. Stability and generalization of stochastic optimization with nonconvex and nonsmooth problems. ar Xiv preprint ar Xiv:2206.07082, 2022. [16] Y. Lei and Y. Ying. Fine-grained analysis of stability and generalization for stochastic gradient descent. In International Conference on Machine Learning, pages 5809 5819. PMLR, 2020. [17] Y. Li, C. Wei, and T. Ma. Towards explaining the regularization effect of initial large learning rate in training neural networks. Advances in Neural Information Processing Systems, 32, 2019. [18] X. Luo, B. Luo, and J. Li. Generalization bounds for gradient methods via discrete and continuous prior. Advances in Neural Information Processing Systems, 35:10600 10614, 2022. [19] D. A. Mc Allester. Pac-bayesian model averaging. In Proceedings of the twelfth annual conference on Computational learning theory, pages 164 170, 1999. [20] M. Mohri, A. Rostamizadeh, and A. Talwalkar. Foundations of machine learning. MIT press, 2018. [21] V. Nagarajan. Explaining generalization in deep learning: progress and fundamental limits. ar Xiv preprint ar Xiv:2110.08922, 2021. [22] V. Nagarajan and J. Z. Kolter. Uniform convergence may be unable to explain generalization in deep learning. Advances in Neural Information Processing Systems, 32, 2019. [23] J. Negrea, M. Haghifam, G. K. Dziugaite, A. Khisti, and D. M. Roy. Information-theoretic generalization bounds for sgld via data-dependent estimates. Advances in Neural Information Processing Systems, 32, 2019.

[24] G. Neu, G. K. Dziugaite, M. Haghifam, and D. M. Roy. Information-theoretic generalization bounds for stochastic gradient descent. In Conference on Learning Theory, pages 3526 3545. PMLR, 2021. [25] K. E. Nikolakakis, F. Haddadpour, A. Karbasi, and D. S. Kalogerias. Beyond lipschitz: Sharp generalization and excess risk bounds for full-batch gd. ar Xiv preprint ar Xiv:2204.12446, 2022. [26] B. Oksendal. Stochastic differential equations: an introduction with applications. Springer Science & Business Media, 2013. [27] S. Park, U. Simsekli, and M. A. Erdogdu. Generalization bounds for stochastic gradient descent via localized ϵ-covers. Advances in Neural Information Processing Systems, 35:2790 2802, 2022. [28] A. Pensia, V. Jog, and P.-L. Loh. Generalization error bounds for noisy, iterative algorithms. In 2018 IEEE International Symposium on Information Theory (ISIT), pages 546 550. IEEE, 2018. [29] Y. Ren, S. Guo, and D. J. Sutherland. Better supervisory signals by observing learning paths. ar Xiv preprint ar Xiv:2203.02485, 2022. [30] D. Russo and J. Zou. Controlling bias in adaptive data analysis using information theory. In Artificial Intelligence and Statistics, pages 1232 1240. PMLR, 2016. [31] D. Russo and J. Zou. How much does your data exploration overfit? controlling bias via information usage. IEEE Transactions on Information Theory, 66(1):302 323, 2019. [32] L. Sagun, U. Evci, V. U. Guney, Y. Dauphin, and L. Bottou. Empirical analysis of the hessian of over-parametrized neural networks. ar Xiv preprint ar Xiv:1706.04454, 2017. [33] S. Shalev-Shwartz and S. Ben-David. Understanding machine learning: From theory to algorithms. Cambridge university press, 2014. [34] K. Simonyan and A. Zisserman. Very deep convolutional networks for large-scale image recognition. ar Xiv preprint ar Xiv:1409.1556, 2014. [35] U. Simsekli, L. Sagun, and M. Gurbuzbalaban. A tail-index analysis of stochastic gradient noise in deep neural networks. In International Conference on Machine Learning, pages 5827 5837. PMLR, 2019. [36] B. Sorscher, R. Geirhos, S. Shekhar, S. Ganguli, and A. Morcos. Beyond neural scaling laws: beating power law scaling via data pruning. Advances in Neural Information Processing Systems, 35:19523 19536, 2022. [37] V. Vapnik. Principles of risk minimization for learning theory. Advances in neural information processing systems, 4, 1991. [38] V. Vapnik. The nature of statistical learning theory. Springer science & business media, 1999. [39] V. N. Vapnik and A. Y. Chervonenkis. On the uniform convergence of relative frequencies of events to their probabilities. In Measures of complexity, pages 11 30. Springer, 2015. [40] A. Xu and M. Raginsky. Information-theoretic analysis of generalization capability of learning algorithms. Advances in Neural Information Processing Systems, 30, 2017. [41] C. Zhang, S. Bengio, M. Hardt, B. Recht, and O. Vinyals. Understanding deep learning (still) requires rethinking generalization. Communications of the ACM, 64(3):107 115, 2021. [42] J. Zhang, H. Li, S. Sra, and A. Jadbabaie. Neural network weights do not converge to stationary points: An invariant measure perspective. In International Conference on Machine Learning, pages 26330 26346. PMLR, 2022. [43] Y. Zhang, W. Zhang, S. Bald, V. Pingali, C. Chen, and M. Goswami. Stability of sgd: Tightness analysis and improved bounds. In Uncertainty in Artificial Intelligence, pages 2364 2373. PMLR, 2022. [44] Y. Zhou, Y. Liang, and H. Zhang. Understanding generalization error of sgd in nonconvex optimization. Machine Learning, 111(1):345 375, 2022. [45] Z. Zhu, J. Wu, B. Yu, L. Wu, and J. Ma. The anisotropic noise in stochastic gradient descent: Its behavior of escaping from sharp minima and regularization effects. ar Xiv preprint ar Xiv:1803.00195, 2018.

A Proof of Theorem 3.6

We rewrite the Equation (7) and Equation (8):

Fµ(JT) FS(JT) = Fµ(J0) FS(J0) | {z } (i)

t=1 [(Fµ(Jt) Fµ(Jt 1)) (FS(Jt) FS(Jt 1))] | {z } (ii)t

E[Fµ(JT) FS(JT)] = E

t=1 (ii)t. (13)

Using Taylor expansion for the function f( ), we have:

f(Jt) f(Jt 1) = (Jt Jt 1)T f(Jt 1) + O( Jt+1 Jt 2). (14)

Therefore, we can define (ii)lin t as:

(ii)lin t (Jt Jt 1)T( Fµ(Jt 1) FS(Jt 1)). (15)

The (ii)t can be decomposed as (ii)t = (ii)lin t + (ii)nl t , where (ii)nl t (ii)t (ii)lin t .

Then Equation 13 can be decomposited as:

E[Fµ(JT) FS(JT)] = E

t=1 (ii)lin t | {z } genlin(JT)

t=1 (ii)nl t | {z } gennl(JT)

Proposition A.1. For the gradient descent or the stochastic gradient descent algorithm, we have:

E[genlin(JT)] = E[

t=0 ηt FS(Jt)T( FS(Jt) Fµ(Jt))]. (17)

If T = O( 1

ηm ), we have: | gennl(JT)| = O(ηm), (18)

where ηm max t ηt, and we have:

lim n | gennl(JT)| = 0. (19)

Remark A.2. We give an experimental exploration of the gennl(JT) in Appendix C.3. We discover that if the optimizer doesn t enter the Eo S (Edge of Stability) regime [9], we have gennl(JT) 0.

Proof. Analyzing of genlin(JT)

Because of ϵ(w) = C(w) 1 2 ϵ and E[ϵ ] = 0 (detail in Equation (3) and Equation (5)d), we can get:

Et 1[ϵT t ( Fµ(Jt) FS(Jt))]

= Et 1[(ϵ )T(C(Jt) 1 2 )T( Fµ(Jt) FS(Jt))]

= Et 1[ϵ ]TEt 1[(C(Jt) 1 2 )T( Fµ(Jt) FS(Jt))] = 0.

Combining with Formula (3), we have

E[genlin(JT)] = E[

t=0 ηt FS(Jt)T( FS(Jt) Fµ(Jt))]. (21)

Analyzing of gennl(JT)

Here, we denote M max t FS(Jt) . According to the definition of gennl(JT).

| gennl(JT)| |Fµ(JT) F lin µ (JT)| + |FS(JT) F lin S (JT)|

t=1 O( Jt+1 Jt 2)| + |

t=1 O( Jt+1 Jt 2)|

t=1 O(η2 t FS(Jt) 2)

= O(Tη2 m M 2)

ηm η2 m M 2)

Because all the elements of training set S are sampled from distribution µ, we have lim n FS(w) =

Fµ(w), Therefore:

lim n (ii)lin t = lim n (Jt Jt 1)T( Fµ(Jt 1) FS(Jt 1)) = 0. (23)

What s more, we also have:

lim n [Fµ(JT) FS(JT)] = 0. (24)

Because Fµ(JT) FS(JT) = PT t=1(ii)lin t + PT t=1(ii)nl t , we have:

lim n | gennl(JT)| = lim n

t=1 (ii)nl t

Fµ(JT) FS(JT)

t=1 (ii)lin t

= |0 0| = 0

According to the Equation (17), we analyze the generalization error of FJ|S {PT 1 t=0 wt T f(Jt) | wt = δt f(Jt) f(Jt) } as a proxy for analyzing generalization error of the function trained using SGD or GD algorithm. The value of genlin(JT) is equal to the generalization error of FJ|S. To analyze FJ|S, we construct an addictive linear space as LJ|S {PT 1 t=0 wt T f(Jt) | wt δt}, where δt ηt FS(Jt) . Here, we use J|S to emphasize that J depends on S.

Under Assumption 3.4 (that is introduced in the main paper), we can have the following lemma.

Lemma A.3. Under Assumption 3.4, given S µn, let J = A(S), we have:

E[genlin(JT)] 2γ Vm ERS(LJ|S), (26)

where V(w) = FS(w) EU S |U|

n FU(w) |S| |U|

n FS/U(w) , Vm = max t V(Jt) and γ =

max{1, max U S;t

n FS(Jt) }γ.

Proof. For a function h, we define that hµ Ez µ[g(z)] and h S = 1

n P zi S h(zi). Given a function space, the maximum generalization error of the space can be defined as: Φ(S, H) sup h H (hµ h S)

Φ(S, H|S) = sup h H|S (hµ h S)

= sup h H|S (ES h S h S)

ES sup h H|S (h S h S)

= ES ,σ sup h H|S ( 1

i σi(h(zσi i ) h(z σi i )))

ES ,σ sup h H|S ( 1

i σi(h(zσi i ))) + ES ,σ sup h H|S ( 1

i σi(h(zσi i )))

= 2ES ,σ sup h H|S ( 1

i σi(h(zσi i ))),

where S is another i.i.d sample set drawn from µn and σ denotes the Rademacher variable. The σi in zσi i denotes zσi i that belongs to S or S . if σi = 1 zσi i S, otherwise, zσi i S .

RS(LJ|S) Eσ sup h LJ|S ( 1

= Eσ sup h LJ|S ( 1

z S+ h(z) X

t=0 δt g S+(Jt) g S (Jt) ),

where S+ {zi | σi = +1} and S {zi | σi = 1}, and g S(w) |S| FS(w).

ES ,σ sup h FJ|S ( 1

i σi(h(zσi i ))) = ES ,σ sup h FJ|S ( 1

z S + h(z) X

t=0 δt g S(Jt) g S(Jt) (g S +(Jt) g S (Jt)))

t=0 δt g S(Jt) g S(Jt) (|S+| Fµ(Jt) g S (Jt)))

where S + is a subset of S with |S +| = |S+|. Defining k γ Vm, we have:

k Eσ sup h LJ|S ( 1

i σih(zi)) ES ,σ sup h FJ|S ( 1

i σi(h(zi)))

t=0 δt g S+(Jt) g S (Jt) ) ES ,σ( 1

t=0 δt g S(Jt) g S(Jt) (g S +(Jt) g S (Jt)))

t=0 δt g S+(Jt) g S (Jt) 1

t=0 δt g S(Jt) g S(Jt) (|S+| Fµ(Jt) g S (Jt)))

t=0 δt g S+(Jt) g S (Jt) 1

t=0 δt |S+| Fµ(Jt) g S (Jt) )

t=0 δt Eσ( g S+(Jt) g S (Jt) )

t=0 δtγ FS(Jt)

Therefore, combining Equation (27) and (30), we have E[genlin(JT)] 2γ Vm ERS(LJ|S).

Lemma A.4. Given J = A(S), the formula RS(LJ|S) can be upper bounded with:

RS(LJ|S) E Z

1 + Tr(Σ(w)) FS(w) 2 2 . (31)

Proof. Let us start with the calculation of RS(w T f(Jt)):

RS({w T f(Jt)| w δ}) = 1

sup w δ w T n X

i=1 σi f(Jt, zi)

i=1 σi f(Jt, zi)

i=1 σi f(Jt, zi)

i=1 σi f(Jt, zi) 2

i=1 f(Jt, zi) 2,

where represents using the relation that for i, j satisfying i = j, we have Eσiσj = 0.

Because wi is independent of wj if i = j, we have:

RS(LJ|S) = RS({f(J0) +

t=0 wt T f(Jt) | wt δt})

t=0 RS({wt T f(Jt)| wt δt})

i=1 f(Jt, zi) 2.

The covariance of gradient noise can be calculated as:

Tr[Σ(w)] = Tr[ 1

i=1 f(w, zi) f(w, zi)T FS(w) FS(w)T]

i=1 Tr[ f(w, zi) f(w, zi)T] Tr[ FS(w) FS(w)T]

i=1 f(w, zi)2 FS(w) 2

Taking Equation (33) and δt ηt FS(Jt) into Equation (34), we have :

i=1 f(Jt, zi) 2

Tr[Σ(Jt)] + FS(Jt) 2

ηt FS(Jt) n

Tr[Σ(Jt)] + FS(Jt) 2

When ηt is small, δt Eϵ (Jt+1 Jt)T FS(Jt)

FS(Jt) Eϵ FS(Jt+1) FS(Jt)

FS(Jt) holds, therefore we have:

ERS(LJ|S) E

i=1 f(Jt, zi) 2 2

FS(Jt+1) FS(Jt) n

1 + Tr(Σ(w)) FS(w) 2 2

1 + Tr(Σ(w)) FS(w) 2 2

Theorem A.5. Under Assumption 3.4, given S µn, let J = A(S), where A denotes the SGD or GD algorithm training with T steps, we have:

E[Fµ(JT) FS(JT)] 2γ Vm E Z

1 + Tr(Σ(Jt)) FS(Jt) 2 2 + O(ηm), (37)

where V(w) = FS(w) EU S |U|

n FU(w) |S| |U|

n FS/U(w) , Vm = maxt V(Jt) and γ =

max{1, max U S;t

n FS(Jt) }γ.

Proof. We rewrite Equation 2 of the update of SGD with batchsize b here: Jt = Jt 1 ηt FS(Jt 1) + ηtϵt (38) where we simplify the ϵ(wt) as ϵt, then we can expand the function at f(JT) as:

f lin(JT) f(J0) +

t=0 (ηt FS(Jt) + ϵ)T f(Jt)

t=0 ηt FS(Jt)T f(Jt) +

t=0 ϵT t f(Jt)

Note that when the learning rate is small, we have f(JT) f lin(JT).

The difference between the distributional value and the empirical value of the linear function can be calculated as:

t=0 (ηt FS(Jt) + ϵ)T Fµ(Jt)] E[FS(J0) +

t=0 (ηt FS(Jt) + ϵ)T FS(Jt)]

= E[Fµ(J0) FS(J0) +

t=0 (ηt FS(Jt) + ϵ)T Fµ(Jt)

t=0 (ηt FS(Jt) + ϵ)T FS(Jt)]

t=0 ηt FS(Jt)T( Fµ(Jt) FS(Jt)) +

t=0 ϵT t ( Fµ(Jt) FS(Jt))]]

t=0 ηt FS(Jt)T( Fµ(Jt) FS(Jt))]

Φ(S, FJ|S), (40) where using the equation that E[ϵT t ( Fµ(Jt) FS(Jt))] = 0, according to Equation 20.

Because of E[Fµ(JT) FS(JT)] = E[F lin µ (JT)+O(ηm) F lin S (JT) O(ηm)] = E[F lin µ (JT) F lin S (JT)] + O(ηm)(from Proposition A.1), by applying Lemma A.3 and Lemma A.4, the theorm is proved.

Corollary A.6. If function f( ) is β-smooth, under Assumption 3.4 given S µn, let J = A(S), ηt = c β(t+1), M 2 2 = max t Et 1( FS(Jt) + ϵ(Jt) 2) and M 4 4 = max t Et 1( FS(Jt) + ϵ(Jt) 4) , where A denoted the SGD or GD algorithm training with T steps, we have:

E[Fµ(JT) FS(JT)] 2γ Vm E Z

1 + Tr(Σ(Jt)) FS(Jt) 2 2

+ 2c2γ Vm M 2 4

dt nβ2(t + 1)4

1 + Tr(Σ(Jt)) FS(Jt) 2 2

+ 2c2 M 2 2 β .

where V(w) = FS(w) EU S |U|

n FU(w) |S| |U|

n FS/U(w) , Vm = max t V(Jt) and γ =

max{1, max U S;t

n FS(Jt) }γ.

Proof. If f( ) is β-smooth, we have:

f(Jt+1) f(Jt) (Jt+1 Jt)T f(Jt) + 1

2β Jt+1 Jt 2 (42)

f(Jt+1) f(Jt) (Jt+1 Jt)T f(Jt) 1

2β Jt+1 Jt 2. (43)

Combining the two equations, we obtain:

|Rµ(JT) RS(JT)| |Fµ(JT) F lin µ (JT)| + |FS(JT) F lin S (JT)|

t=0 Jt+1 Jt 2 + β

t=0 Jt+1 Jt 2

t=0 Jt+1 Jt 2.

The generalization error can be divided into three parts:

E[Fµ(JT) FS(JT)] 2γ Vm E Z

1 + Tr(Σ(Jt)) FS(Jt) 2 2

d F lin S (Jt) d FS(Jt) n

1 + Tr(Σ(Jt)) FS(Jt) 2 2 | {z } (A)

t=0 Jt+1 Jt 2

The term (A) is caused by using FS(Jt+1) FS(Jt) to replace F lin S (Jt+1) F lin S (Jt). The term "(B)" is induced by gennl(JT). Then, we want to give a upper bound of (A) using M 4 4 :

(A) ( ) 2γ Vm E

β Jt+1 Jt 2

1 + Tr(Σ(Jt)) FS(Jt) 2 2

( ) 2c2γ Vm E

FS(Jt) + ϵ(Jt) 2

β n(t + 1)2

1 + Tr(Σ(Jt)) FS(Jt) 2 2

t=0 Et 1 FS(Jt) + ϵ(Jt) 2

β n(t + 1)2

1 + Tr(Σ(Jt)) FS(Jt) 2 2

( ) 2c2γ Vm

Et 1 FS(Jt) + ϵ(Jt) 4

β2n(t + 1)4 Et 1

1 + Tr(Σ(Jt)) FS(Jt) 2 2

M 4 4 β2n(t + 1)4 Et 1

1 + Tr(Σ(Jt)) FS(Jt) 2 2

2c2γ Vm M 2 4

1 β2n(t + 1)4

1 + Tr(Σ(Jt)) FS(Jt) 2 2

where ( ) is due to the Equation 44, ( ) is due to the update rule of Jt and ( ) is que to Hölder s inequality. In the following, we use M 2 2 to give a upper bound for (B):

1 (t + 1)2 E F(Jt) + ϵ(Jt) 2

1 (t + 1)2 M 2 2

1 (t + 1)2 M 2 2

2c2 M 2 2 β

Taking the upper bound value of "(A)" and "(B)" into Equation 45, we obtain the result.

B Relaxed Assumption and Corresponding Bound

Assumption B.1. There is a value γ, T0 and ζ, so that for all w {Jt|t N t < T0}, we have Fµ(w) γ FS(w) and for all w {Jt|t N t T0}, we have Fµ(w) γ FS(w) + ζ.

Theorem B.2. Under Assumption B.1, given S µn, let J = A(S), where A denotes the SGD or GD algorithm training with T steps, we have:

E[Fµ(JT) FS(JT)] 2γ Vm E Z

1 + Tr(Σ(Jt)) FS(Jt) 2 2 +1

t=T0 ηt FS(Jt) ζ+O(ηm),

(48) where V(w) = FS(w) EU S |U|

n FU(w) n |U|

n FS/U(w) , Vm = maxt V(Jt) and γ =

max{1, max U S;t

n FS(Jt) }γ.

Proof. Most of the proofs in this part are the same as those in Appendix A, except for Equation 30. The Equation 30 is replaced by:

k Eσ sup h LJ|S ( 1

i σih(zi)) ES ,σ sup h FJ|S ( 1

i σi(h(zi)))

t=0 δt g S+(Jt) g S (Jt) ) ES ,σ( 1

t=0 δt g S(Jt) g S(Jt) (g S +(Jt) g S (Jt)))

t=0 δt g S+(Jt) g S (Jt) 1

t=0 δt g S(Jt) g S(Jt) (|S+| Fµ(Jt) g S (Jt)))

t=0 δt g S+(Jt) g S (Jt) 1

t=0 δt |S+| Fµ(Jt) g S (Jt) )

t=0 δt Eσ( g S+(Jt) g S (Jt) )

t=0 δtγ FS(Jt) 1

t=T0 ηt FS(Jt) ζ.

Remark B.3. Compared of Theorem 3.6, we have a extra term PT t=T0 ηt FS(Jt) ζ here. Since the unrelaxed assumption Fµ(w) γ FS(w) is not satisfied only when FS(w) is relative small, the term PT t=T0 ηt FS(Jt) ζ is small value.

C Experiments

C.1 Calculation of C(J)

To reduce the calculation, we construct a randomly sampled subset Ssp = {zsp 1 , ..., zsp n } S.

1 + Tr(Σ(Jt)) FS(Jt) 2 2 = Z

s Pn i=1 f(Jt, zi) 2 2 n 1 FS(Jt) 2 2

s Pnsp i=1 f(Jt, zsp i ) 2 2 nsp

1 FS(Jt) 2 2

Denote the weights after t-epoch training as Xt. We can roughly calculated C(Jt)

FS(Xt) FS(Xt 1) n

s Pnsp i=1 f(Xt, zsp i ) 2 2 nsp

1 FS(Xt) 2 2

C.2 Experimental Details

Here, we give a detail setting of the experiment for each figure.

Figure 1 The learning rate is fixed to 0.05 during all the training process. The batch size is 256. All experiments is trained with 100 epoch. The test accuracy for CIFAR-10, CIFAR-100, and SVHN are 87.64%, 55.08%, and 92.80%, respectively.

Figure 2 The initial learning rate is set to 0.05 with the batch size of 1024. We use the Cosine Annealing LR Schedule to adjust the learning rate during training.

Figure 3 Each point is an average of three repeated experiments. We stop training when the training loss is small than 0.2.

C.3 Experimental exploration of gennl(JT)

In this section, our aim is to investigate the conditions under which gennl(JT) 0. Since directly calculating the difference |Rµ(JT) RS(JT)| is challenging, we concentrate on the upper bound value |Rµ(JT)| + |RS(JT)|.

We conduct the experiment using cifar10-5k dataset and fc-tanh network, following the setting of paper [9]. Cifar10-5k[9] is a subset of cifar10 dataset. Building upon the work of [1], we compute the Relative Progress Ratio (RP) and Test Relative Progress Ratio (TRP) throughout the training process. We initially consider the case of gradient descent. The definitions of RP and TRP for gradient descent are as follows:

RP(Jt) FS(Jt+1) FS(Jt)

η FS(Jt) 2 (50)

TRP(Jt) FS (Jt+1) FS (Jt)

η FS(Jt)T FS (Jt). (51)

Therefore, we have:

t=1 (FS(Jt) FS(Jt 1)) FS(J0)

t=0 (Jt Jt 1)T FS(Jt 1)

(FS(Jt) FS(Jt 1)) (Jt Jt 1)T FS(Jt 1)

(FS(Jt) FS(Jt 1)) + η FS(Jt 1) 2

ηt(1 + RP(Jt 1)) FS(Jt 1) 2

Following the same way, we have:

t=1 (FS (Jt) FS (Jt 1)) FS (J0)

t=0 (Jt Jt 1)T FS (Jt 1)

ηt(1 + TRP(Jt 1)) FS(Jt 1)T FS (Jt 1) (53)

Combining Equation (52) and Equation (53), we have:

| gennl(JT)|

t=1 ηt (1 + TRP(Jt 1))| FS(Jt 1)T FS (Jt 1)| + (1 + RP(Jt 1)) FS(Jt 1) 2 (54)

Therefore, if we have for all t,RP(Jt) 1 and TRP(Jt) 1, then | gennl(JT)| 0.

From Figure 4 we find that in stable regime, where the sharpness is below the 2

η, we have TRP RP 1. Under small learning rate, the gradient descent doesn t enter the regime of edge of stability and we have TRP RP 1 during whole training process and gennl(JT) 0.

Next, we consider the case of Stochastic Gradient Descent (SGD). Due to the stochastic estimation of the gradient, we need to rely on some approximations. Let Xi t represent the weights after the t-epoch and i-th iteration of training. We assume a constant learning rate η for SGD. The gradient is approximated as follows:

η FS(Xi t) B

n (Xt Xt+1) = B

i=1 FS(Xi t), (55)

Figure 4: Exploration of gennl(JT) on Gradient descent. Experiments is conducted on cifar10-5k dataset with cross entropy loss. The blue dash line in fourth row denotes 2

η. Gradient descent enter the Eo S regime when the sharpness is above 2

η. Both RP and TRP have values around -1 when sharpness is below the 2

and we appximate FS (Xi t) as:

η FS(Xi t) η FS(Xt). (56)

Therefore, we have:

RP(Xt) η(FS(Xt+1) FS(Xt))

Xt+1 Xt (57)

TRP(Xt) FS (Xt+1) FS (Xt) (Xt Xt+1)T FS (Xt). (58)

Figure 5: Exploration of gennl(JT) on SGD case. Here, the effective learning rate is defined as ηef n

B η. We still have gennl(JT) 0 under small learning rate.

We calculated the effect learning rate for SGD as ηef n

B η. Figure 5 shows that the conclusions of SGD are similar as GD, except that the conditions of entering Eo S are different.

Table 4: Comparison of trajectory based generalization bounds. Only our proposed method can apply to the SGD with rich trajectory related information.

Method Conditions T.R.T Nikolakakis et al. [25] Gradient Descent, ηt c

β , β-smooth PT t=1 ηt 1

n Pn i=1 f(Jt, zi) 2

Neu et al. [24] β-smooth, E [ f(w, z) Fµ(w) ] v, f( ) is subguassian distribution

Tη2 Park et al. [27] Weak Lipschitz continuity, Piecewise β -smooth, f( ) is bounded, η < 2

Ours Small Learning Rate, Fµ(w) γ FS(w)

t d FS(Jt) q

1 + Tr(Σ(Jt)) FS(Jt) 2

D Other Related Work

This part compares the works that is not listed in Table 2. Table 4 gives other trajectory based generalization bounds. [25] is a stability based work designed mainly for generalization of gradient descent. It removes the Lipschitz assumption, and replaced by the term PT t=1 ηt 1

n Pn i=1 f(Jt, zi) 2 in the generalization bounds. This helps enrich the trajectory information in the bounds. The limitation of this work is that it can only apply to the gradient descent and it is hard to extend to the stochastic gradient descent. Neu et al. [24] adapt the information-theretical generalization bound to the stochastic gradient descent. The Theorem 1 in Neu et al. [24] contains rich information about the learning trajectory, but most is about Fµ(w), which is unavailable for us. Therefore, we mainly consider the result of Corollary 2 in Neu et al. [24], which removes the term Fµ(w) by the assumption listed in Table 4. For this Collorary, the remained information within trajectory is merely the p

Tη2. Althouth Neu et al. [24] dosen t require the assumption of small learning rate, the bound contains the dimension of model, which is large for deep neural network. Compared with these work, our proposed method has advantage in that it can both reveal rich information about learning trajectory and applied to stochastic gradient descent.

Chandramoorthy et al. [8] analyzes the generalization behavior based on statistical algorithmic stability. The proposed generalization bound can be applied into algorithms that don t converge. Let S(i) be the dataset obtained by replace zi in S with another sample z i draw from distribution µ.The generalization bound relies on the stability measure m sup{ 1

T PT 1 t=0 f(Jt|S, z) 1 T PT 1 t=0 f(Jt|S(i), z)|z Z, i [n]}. We don t directly compare with this method because the calculation of m relies on S(i) which contains sample outside of S. Therefore, we treat this result as intermediate results. More assumption is needed to remove this dependence of the information about the unseen samples, i.e., the samples outside set S.

E Effect of Learning Rate and Stochastic Noise

In this part, we want to analyze how learning rate and the stochastic noise jointly affect our proposed generalization bound. Specifically, we denote pt(w) as the distribution of the Jt during the training with multiple training steps. Following the work [12], we consider the SDE function as an approximation, which is shown as below:

dw = FS(w)dt + ηC 1 2 d W(t). (59)

The SDE can be regarded as the continuous counterpart of Equation(3) when sets the distribution of noise term ϵ in Equation(3) as Gaussian distribution. The influence of the noise ϵ on pt(w) is shown in the following theorem. Theorem E.1. When the updating of the weight w follows Equation (59), the covariance matrix C is a hessian matrix of a function with a scalar output, then we have:

wi [ FS(w)pt(w) η

2[ Tr(C(w)) + C(w) w log(pt(w)) | {z } dampling factor

]pt(w)]. (60)

Remark E.2. Previous studies [45, 32, 12] tell that the covariance matrix C is proximately equal to the hessian matrix of the loss function with respect to the parameters of DNN. Thus, the above condition that the covariance matrix C is a hessian matrix of a function with scalar output is easy to be satisfied. Formula (60) contains three parts. The item FS(w)pt(w) enlarge the probability of parameters being located in the parameter space with low FS(w). Tr(C(w)) and C(w) w log(pt(w)) ususally

contradict with each other. Tr(C(w)) enlarge the probability of parameters being located in the parameter space with low Tr(C(w)) value, while C(w) w log(pt(w)) serves as a damping factor to prevent the probability from concentrating on a small space. Therefore, setting larger learning rate gives stronger force for the weight to the area with lower Tr(C(w)) values. According to Equation 5, we also have a lower Σ(w). As a result, large learning rate causes a small lower bound in Theorem 3.6

Proof. Based on the condition described above, we can infer that C(w) = G(w), where G is a function with a scalar output.

We first prove that C(w) = Tr(C(w)) as below:

[ C(w)]j = [ G(w)]j

= wj Tr(C(w)).

So far, we can infer that C = Tr(C). According to Fokker-Planck equation( [26]), we have:

wi [ FD(w)pt(w)] + 1

wj [C(w)pt(w)]

wi [ FD(w)pt(w)] + 1

wi [pt(w) C + pt(w)C w log pt(w)]

FD(w)pt(w) 1

2η [ C(w) + C(w) w log pt(w)] pt(w)

FD(w)pt(w) 1

2η [ Tr(C(w)) + C(w) w log pt(w)] pt(w) .

(62) Therefore, the theorem is proven.