# explicit_eigenvalue_regularization_improves_sharpnessaware_minimization__db0ddb1a.pdf Explicit Eigenvalue Regularization Improves Sharpness-Aware Minimization Haocheng Luo1, Tuan Truong2, Tung Pham2, Mehrtash Harandi1, Dinh Phung1, Trung Le1 1Monash University, Australia 2Vin AI Research, Vietnam haocheng.luo@monash.edu, v.tuantm27@vinai.io, v.tungph4@vinai.io, mehrtash.harandi@monash.edu, dinh.phung@monash.edu, trunglm@monash.edu Sharpness-Aware Minimization (SAM) has attracted significant attention for its effectiveness in improving generalization across various tasks. However, its underlying principles remain poorly understood. In this work, we analyze SAM s training dynamics using the maximum eigenvalue of the Hessian as a measure of sharpness and propose a third-order stochastic differential equation (SDE), which reveals that the dynamics are driven by a complex mixture of secondand thirdorder terms. We show that alignment between the perturbation vector and the top eigenvector is crucial for SAM s effectiveness in regularizing sharpness, but find that this alignment is often inadequate in practice, which limits SAM s efficiency. Building on these insights, we introduce Eigen-SAM, an algorithm that explicitly aims to regularize the top Hessian eigenvalue by aligning the perturbation vector with the leading eigenvector. We validate the effectiveness of our theory and the practical advantages of our proposed approach through comprehensive experiments. Code is available at https://github.com/Ritian Luo/Eigen SAM. 1 Introduction Understanding the generalization of deep learning algorithms is one of the core challenges in modern machine learning. Overparameterization makes the loss landscape of neural networks highly nonconvex, often featuring numerous global optima, while simple gradient-based algorithms surprisingly tend to find solutions that generalize well. A body of empirical and theoretical work suggests that the "flatness" or "sharpness" of the minima is a promising explanation for generalization (Hochreiter and Schmidhuber, 1997; Keskar et al., 2016; Dinh et al., 2017; Jiang et al., 2019; Xie et al., 2020; Liu et al., 2023b), and the implicit bias of optimization algorithms drives them toward flatter solutions, thereby ensuring good generalization (Blanc et al., 2020; Wen et al., 2022; Arora et al., 2022; Damian et al., 2022; Ahn et al., 2023; Tahmasebi et al., 2024). Inspired by research on flatness and generalization, recent work by Foret et al. (2021) proposed Sharpness-Aware Minimization (SAM), a dual optimization method that perturbs parameters before performing gradient descent to enhance generalization performance by minimizing sharpness. Although SAM has demonstrated empirical success across various fields (Foret et al., 2021; Kaddour et al., 2022), theoretical analysis of the principles underlying its success remains limited. The work by Compagnoni et al. (2023) explains SAM s generalization advantage as an implicit minimization of the gradient norm, while Wen et al. (2022) suggests that SAM regularizes the Hessian spectrum near the minima manifold. However, existing theories are either somewhat simplified or rely on overly 38th Conference on Neural Information Processing Systems (Neur IPS 2024). idealized assumptions, leading to a noticeable gap between theory and practice. This gap limits their ability to fully explain the advantages of SAM (see Appendix A or contemporaneous work Song et al. (2024) for empirical evidence). In this paper, we consider a widely used measure of sharpness: the largest eigenvalue of the Hessian matrix (Lyu et al., 2022; Arora et al., 2022; Wen et al., 2022; Damian et al., 2022). We extend the previous PAC-Bayes theory to demonstrate the importance of this measure for generalization. Our main contribution is an in-depth analysis of the training dynamics of SAM, expanding on the second-order Stochastic Differential Equation (SDE) proposed by Compagnoni et al. (2023) and revealing that the complex third-order terms play a crucial role in shaping SAM s implicit bias. Under ideal conditions, where the perturbation vector aligns well with the top eigenvector of the Hessian matrix, these terms effectively reduce the sharpness of the loss function. However, our experiments show that this alignment does not hold in real-world settings, limiting SAM s ability to effectively regularize sharpness. Based on our theoretical findings and experimental observations, we propose a new algorithm, Eigen-SAM, which intermittently estimates the top eigenvector of the Hessian matrix and incorporates its component orthogonal to the gradient into the perturbation, enabling explicit regularization of the top Hessian eigenvalue. We summarize our contributions as follows: We prove a new theorem (Theorem 3.1) to establish the relationship between the top eigenvalue and generalization error, building on the general PAC-Bayes theorem (Alquier et al., 2016). We propose a third-order SDE (Theorem 4.1) to model the dynamics of SAM. This approach achieves a lower approximation error compared to the previous second-order SDE by Compagnoni et al. (2023) and additionally infers a close relationship between perturbationeigenvector alignment and sharpness reduction (Corollary 4.1.1). We introduce a novel algorithm, Eigen-SAM, based on our theoretical insights and experimental observations (Section 5). This method aims to enhance alignment between the perturbation vector and the top eigenvector, resulting in a more effective reduction of sharpness. We validate our theory and the effectiveness of the proposed algorithm through comprehensive experiments (Section 6). 2 Related work Theoretical understanding of SAM. SAM has garnered widespread attention for its significant improvements in generalization performance; however, the theoretical analysis provided in the original paper (Foret et al., 2021) is limited. The authors only presented a PAC-Bayes generalization bound, which is effective only for 0-1 loss. Subsequently, Andriushchenko and Flammarion (2022) attempted to further understand the success of SAM by restricting the network structure to diagonal linear networks and providing an implicit bias for SAM. They also established the first convergence result for SAM. Bartlett et al. (2023) conducted a detailed study of SAM s dynamics for quadratic loss, suggesting that it oscillates between the two sides of the minimum in the direction of greatest curvature and drifts toward flatter minima. However, the assumption of quadratic or locally quadratic loss settings is not realistic for practical deep learning models. More recently, Wen et al. (2022) extended the analysis of SAM s dynamics to general loss functions, assuming that all global minima form a connected manifold. Given sufficient training time and infinitesimally small η and ρ, they rigorously proved that SAM s dynamics would track the trajectory of a Riemannian flow with respect to sharpness, achieving the same sharpness-reducing effect. On a different front, Compagnoni et al. (2023) applied the continuous-time approximation framework from Li et al. (2017) to analyze SAM dynamics, concluding that SAM implicitly minimizes the norm of the gradient scaled by ρ. Continuous-time approximations for discrete algorithms. Substantial research demonstrates that the trajectory of stochastic discrete iterations with decaying step sizes will ultimately follow the solution of certain Ordinary Differential Equations (ODEs) (Harold et al., 1997; Borkar et al., 2009; Duchi and Ruan, 2018). Further developments in understanding deep learning algorithms were made by Li et al. (2017), who proposed a general and rigorous mathematical framework for continuous-time approximations, deriving SDEs for SGD and its various variants. They also provided experimental evidence supporting the reasonableness of continuous-time approximations in real-world models (Li et al., 2021). In this paper, we follow this mathematical framework, reusing some of its notations and definitions. 3 Preliminaries 3.1 Notations We start by introducing the notation used throughout our paper. Let S denote the training set, sampled from the true data distribution D. For a given mini-batch γ, we define the mini-batch loss as fγ(x) with parameters x Rd. The generalization loss is defined as f D (x) = Eγ D [fγ(x)], while the empirical loss is defined as f S (x) = Eγ S [fγ(x)]. Since we primarily analyze and discuss the empirical loss in this paper, we drop the dependency on S for simplicity, denoting the empirical loss as f(x) when no ambiguity arises. We use to denote the Euclidean norm. The k-th order derivative of the loss f at x is denoted by kf(x), which is a symmetric k-tensor in (Rd) k when x Rd. We denote by λ1( 2f(x)) the largest eigenvalue of the Hessian matrix 2f(x) and v1( 2f(x)) its corresponding unit eigenvector, with v1( 2f(x)) = 1. Additionally, we use 3f(x)(u, v) Rd to represent the application of the symmetric 3-tensor 3f(x) along directions u and v. 3.2 Sharpness-Aware Minimization SAM (Foret et al., 2021) seeks flat minima by minimizing the perturbed loss: min x max ϵ 1 f(x + ρϵ), where ρ is a predefined hyperparameter controlling the radius of the perturbation. Solving the inner maximization problem leads to ϵSAM(x) = f(x) f(x) . Differentiating the perturbed loss with respect to x, we get: f(x + ρϵSAM(x)) = d(x + ρϵSAM(x)) dx f(x)|x+ρϵSAM(x) f(x)|x+ρϵSAM(x). In the last approximation, Foret et al. (2021) ignore the dependency of ϵSAM(x) on x, leading to faster computational efficiency and higher generalization performance. Applying SAM in the stochastic case, the SAM iteration for mini-batch γk is summarized as: xk+1 = xk η fγk xk + ρ fγk We use fγk( ) and fγk to distinguish between those needing and not needing backpropagation of gradients, matching the algorithm in practice. Thus, the perturbation vector of mini-batch SAM can be written as: ϵSAM γ = fγ fγ . (2) 3.3 SDE approximation for SGD and SAM Li et al. (2017) developed the following SDE to approximate discrete SGD: d Xt = f(Xt)dt + η(Σ1,1(Xt)) 1 2 d Wt, where Wt is standard Brownian motion. Compagnoni et al. (2023) applied this framework to analyze the dynamics of SAM, deriving the following second-order SDE for SAM: d Xt = f(Xt) ρE 2fγ(Xt) fγ dt + η Σ1,1(Xt) + ρ(Σ1,2(Xt) + Σ1,2(Xt) ) 1 where Σa,b denotes the covariance matrix of the a-th and b-th terms in the Taylor expansion of the perturbed loss (see Appendix B for the full expression). We refer to Eq.3 as the second-order SDE since it includes up to second-order partial derivatives in both the drift and diffusion coefficients. 3.4 Choice of sharpness measure In this paper, we use the largest eigenvalue λ1( 2f(x)) as a measure of sharpness, similar to prior works (Lyu et al., 2022; Arora et al., 2022; Wen et al., 2022; Damian et al., 2022). Geometrically, the top eigenvalue of the Hessian matrix at a given point represents the maximal curvature of the loss function along any direction. Moreover, it is closely related to the concept of sharpness (defined as the maximum perturbed loss difference) used by Foret et al. (2021) at the minima. Note that at the minima, f(x) = 0; thus, max ϵ 1 f(x + ρϵ) f(x) max ϵ 1 ρϵ f(x) + ρ2 = ρ2λ1( 2f(x)) Another possible choice is the trace of the Hessian matrix. However, the trace of the Hessian scales with the dimensionality of the parameters, complicating cross-model comparisons and limiting this measure s applicability. We further establish a PAC-Bayes theorem that bounds the generalization error through the top eigenvalue of the Hessian matrix. This theorem is based on the general PAC-Bayes theorem Alquier et al. (2016) and applies to bounded losses, not limited to 0-1 loss as in the work of Foret et al. (2021); Zhuang et al. (2022). Theorem 3.1. (Generalization Bound) Assume that the loss function is bounded by L, and the third-order partial derivative of the loss function is bounded by C. Additionally, we assume f D(x) Eϵ N(0,σ2Id)f D(x + ϵ), as in Foret et al. (2021). For any δ (0, 1) and σ > 0, with a probability over 1 δ over the choice of S Dn, we have f D (x) f S (x) + dσ2 2 λ1 2f S (x) + Cd3σ3 d log 1 + x 2 + O(1) + 2 log 1 δ + 4 log (n + d). where n is the number of samples. We defer the proof to Appendix C. Theorem 3.1 suggests that minimizing the top eigenvalue of the Hessian matrix is crucial for improving generalization ability. 4 Third-order SDE reveals implicit regularization in SAM In this section, we delve into the discussion and derivation of the third-order SDE continuous-time approximation for SAM. In Section 4.1, we present heuristic derivations that provide intuitive insights into our approach. Following this, Section 4.2 offers a formal third-order SDE approximation for SAM, establishing the mathematical rigor of our framework. Finally, in Section 4.3, we propose a corollary linking perturbation-eigenvector alignment with eigenvalue regularization, furthering our understanding of the implicit regularization effects inherent in SAM. 4.1 Heuristic derivations for the third-order SDE We begin by examining the drift coefficient in Compagnoni et al. (2023) (Eq. 3): f(Xt) ρE 2fγ(Xt) fγ = f(Xt) ρ E fγ(Xt) , where the second term indicates that SAM penalizes trajectories with large loss gradients. However, this formulation does not reveal an implicit regularization effect concerning sharpness, specifically regarding the top eigenvalue of the Hessian matrix. Thus, understanding the implicit bias on the Hessian matrix requires a third-order Taylor expansion. The missing cubic term in the Taylor expansion is: 2 E 3fγ(Xt)( fγ, fγ) 2 E f γ 2fγ(Xt) fγ The equality holds because fγ is treated as a perturbation vector independent of Xt, as SAM s implementation does not involve differentiating with respect to it (See Section 3.2 for a detailed description of this part). This third-order term suggests that SAM employs an additional gradient measurement to compute a specific third derivative: the gradient of the second derivative along the direction of the gradient. Hessian-gradient alignment during training. Recent findings indicate that during training, the gradient implicitly aligns with the top eigenvector of the Hessian matrix under certain conditions: (1) training with normalized full-batch gradient descent (Arora et al., 2022); (2) a locally quadratic loss landscape (Bartlett et al., 2023); (3) training with SAM when very close to the minimizer manifold (Wen et al., 2022). This alignment phenomenon is crucial for interpreting the third-order term in our drift coefficient. Specifically, when the gradient is highly aligned with the top eigenvector, we have: " 3fγ(Xt)( fγ, fγ) 2 E 3fγ(Xt)(v1( 2fγ(Xt)), v1( 2fγ(Xt))) 2 E λ1( 2fγ(Xt)), where the final equality follows from the properties of differentiating eigenvalues (see Magnus (1985) for a detailed discussion). If this alignment phenomenon holds, we can conclude that the implicit bias of the drift coefficient aligns with the gradient of the top eigenvalue of the Hessian, thereby implicitly minimizing sharpness. 4.2 Formal third-order SDE approximation for SAM In this subsection, we present the general formulation of the SDE for SAM. We refer to our SDE (Eq. 4) as the third-order SDE, to distinguish it from the second-order SDE (Eq. 3). For the complete statements and proofs, we refer the reader to Appendix B. Theorem 4.1. (Third-order SDE for SAM, Informal Statement of Theorem B.4) Let 0 < η < 1, T > 0, N = T/η , and {xk : k 0} denote the sequence of discrete SAM iterations defined by Eq. 1. Define {Xt : t [0, T]} as the stochastic process satisfying the SDE d Xt = ef SAM(Xt)dt + η(ΣSAM(Xt)) 1 2 d Wt, X0 = x0 (4) with ef SAM(Xt) := f(Xt) + ρE fγ(Xt) + ρ2 2 E f γ 2fγ(Xt) fγ ΣSAM(Xt) := Σ1,1(Xt) + ρ(Σ1,2(Xt) + Σ1,2(Xt) ) + ρ2 Σ2,2(Xt) + 1 2(Σ1,3(Xt) + Σ1,3(Xt) ) , where Σa,b denotes the covariance matrix of the a-th and b-th terms in the Taylor expansion of the perturbed loss (see Appendix B for the full expression). Under sufficient regularity conditions, let ρ = O(η 1 3 ). Then, {Xt : t [0, T]} is an order-1 weak approximation of {xk : k 0}, i.e., for any test function g of at most polynomial growth, there exists a constant C independent of η such that max k=0,1,...,N |Eg(xk) Eg(Xkη)| Cη. Our proof relies on the third-order Taylor expansion of fγk Xt + ρ fγk fγk , carefully matching the firstand second-order conditional moments and quantifying the errors for higher-order terms. Our third-order SDE reveals that SAM s implicit bias includes a complex combination of second-order and third-order terms, with scales of ρ and ρ2 2 , respectively. Compared to Compagnoni et al. (2023), our theorem offers two main advantages: first, we allow ρ to take larger values (η 1 3 compared to η 1 2 in Compagnoni et al. (2023)), which is more consistent with real-world settings; equivalently, our SDE achieves a lower approximation error for a fixed ρ. Second, our SDE explicitly captures SAM s implicit bias on the Hessian matrix, manifesting as the gradient of the Hessian in the gradient s direction. Additionally, the diffusion coefficient in Eq. 4 implies that SAM injects additional noise in the form of the covariance of the higher-order terms in the Taylor expansion of the perturbed loss. This curvature-dependent noise aligns with recent studies (Gatmiry et al., 2024a,b), which demonstrate that label noise in SGD exhibits similar behavior to SAM. 4.3 Perturbation-eigenvector alignment and eigenvalue regularization The implicit bias introduced by the third term in the drift coefficient of the SDE (Eq. 4), i.e., ρ2 2 E 3fγ(Xt)( fγ, fγ) fγ 2 , remains difficult to understand. As discussed in Section 4.1, if we assume that the perturbation vector is well-aligned with the top eigenvector, we can interpret the cubic term as the gradient of the top eigenvalue, leading to an implicit bias that decreases sharpness. Next, we quantify and formalize this heuristic approach. Define Align(ϵ, v1) := 1 mins { 1} ϵ ϵ s v1 as a measure of alignment between the perturbation vector ϵ and the top eigenvector v1. It is worth noting that +v1 and v1 are equivalent eigenvectors, which is why we define alignment as the maximum over both +v1 and v1. Corollary 4.1.1. Recall that ϵSAM γ is defined in Eq. 2. Let s denote the direction scalar, i.e., s = arg mins { 1} ϵ ϵ s v1 . Under the same conditions as in Theorem 4.1, and assuming a positive eigenvalue gap (see Assumption B.2 for a definition), we have the following: 1. If Align(ϵSAM γ , v1( 2fγ(Xt))) 1 O(ρ), then the SDE (Eq. 4) becomes d Xt = ef SAM ρ (Xt)dt + η(ΣSAM) 1 2 d Wt, (5) where ef SAM ρ (Xt) := f(Xt) + ρ E fγ(Xt) + ρ2 2 Eλ1( 2fγ(Xt)); 2. If Align(ϵSAM γ , v1( 2fγ(Xt))) 1 O(ρ2), then the SDE (Eq. 4) becomes d Xt = ef SAM ρ2 (Xt)dt + η(ΣSAM) 1 2 d Wt, (6) where ef SAM ρ2 (Xt) := f(Xt) + ρE s λ1( 2fγ(Xt))v1( 2fγ(Xt)) + ρ2 2 Eλ1( 2fγ(Xt)). The proof is deferred to Appendix B. In Corollary 4.1.1, we rigorously formalize our intuition from Section 4.1. If the alignment is at least 1 O(ρ), we conclude that the SAM trajectory comprises three components: the gradient of the loss, the gradient of the gradient norm, and the gradient of the top eigenvalue, with respective scales 1, ρ, and ρ2 2 . Under this well-aligned condition, the SAM trajectory minimizes the loss while implicitly regularizing both the gradient norm and sharpness. This demonstrates SAM s complex implicit bias, which is not solely influenced by secondor third-order terms, as summarized in previous work (Wen et al., 2022; Compagnoni et al., 2023). For empirical evidence supporting this observation, we refer readers to Appendix A. Furthermore, if the alignment is at least 1 O(ρ2), then the gradient of the gradient norm oscillates in the direction of the top eigenvector. Notably, Bartlett et al. (2023) derived a similar discrete SAM dynamic under specific conditions (Theorem 20), where the parameter trajectory oscillates in the direction of the top eigenvector while regularizing the leading eigenvalue. However, their conditions are stricter than ours, assuming that the parameters are already close to the minimum and requiring a specific initialization. When these conditions are met, they require an alignment of the gradient with the top eigenvector of at least 1 O(ηρ) = 1 O(ρ4). In comparison, our SDE framework is more general. Comparison with Wen et al. (2022). Wen et al. (2022) derived an implicit bias similar to our cubic term in the third-order SDE (Eq. 5) for SAM through the analysis of the Riemannian flow near the minimizer manifold. However, our work fundamentally differs from theirs. First, their theory requires a much longer training time Θ(η 1ρ 2) compared to our Θ(η 1). Thus, our SDE corresponds to the initial phase of their analysis regarding time scale, during which they do not conclude any implicit bias. In contrast, our SDE (Eq. 5), which indicates that the implicit bias comprises three components with different scales, provides richer insights in this phase. Second, they require ηln(1/ρ) to be sufficiently small, causing ρ to be exponentially smaller than η, whereas our theory accommodates a more practical range, ρ = O(η 1 3 ). 5 Eigen-SAM: an explicit regularization method for the top eigenvalue of the Hessian 5.1 Failure of perturbation-eigenvector alignment in practice Within the theoretical framework of Section 4, a natural question arises: Is the perturbationeigenvector alignment sufficient in practice for SAM to effectively minimize sharpness? Unfortunately, 0 5000 10000 15000 20000 25000 30000 35000 40000 Step (a) Perturbation-eigenvector alignment (b) Top eigenvalue of Hessian Figure 1: Alignment and top eigenvalue for a 6-layer CNN model trained on CIFAR-10. The left panel shows the trend of alignment during SAM training; the shaded area represents the 95% confidence interval. The right panel displays the trend of the top eigenvalue over the course of training. we have empirically verified that such alignment may be poor in practice, even for relatively simple models. As a result, the regularization effect on the largest eigenvalue, as discussed in Corollary 4.1.1, may not be clearly observable in practical scenarios. To investigate this phenomenon, we trained a 6-layer Simple CNN model, as used in Jastrzebski et al. (2021); Deng et al. (2024), on CIFAR-10 (Krizhevsky et al., 2009). Figure 1 illustrates two key findings from our experiments. The left panel shows that the alignment between the perturbation vector and the top eigenvector is indeed poor during training. Consequently, the right panel reveals that SAM is unable to efficiently minimize the top eigenvalue when alignment is weak. These results highlight the limitations of SAM in real-world scenarios where ideal alignment cannot be assumed. 5.2 Proposed method: Eigen-SAM To address the issue of poor alignment, we propose a novel algorithm called Eigen-SAM, which aims to explicitly align the perturbation vector with the top eigenvector. This approach makes SAM s update closer to the SDE approximation in Eq. 5, where the third-order term in the drift coefficient can effectively minimize the largest eigenvalue. To achieve this alignment, we estimate the top eigenvector of the Hessian matrix once every p mini-batch steps (with p = 100 in our implementation) using the power method for q iterations (with q = 5 in our implementation). This strategy of intermittently estimating the Hessian matrix has been shown to be effective in practice (Liu et al., 2023a). After obtaining an estimate ˆv of the top eigenvector, we decompose it into components parallel and perpendicular to the gradient direction, then add the perpendicular component to the perturbation vector to enhance alignment explicitly: ˆv = ˆv + ˆv (7) ϵEigen-SAM γ = fγ fγ + α sign( fγ, ˆv )ˆv , (8) where α is a hyperparameter that controls the strength of the explicit alignment. Since +v and v are equivalent eigenvectors, we include sign( fγ(x), ˆv ) to determine the direction of ˆv. Here, we always choose ˆv to have a smaller angle with the gradient. The full algorithm is presented in Algorithms 1 and 2. In Appendix D, we provide an in-depth discussion of the theoretical properties of Eigen-SAM, including sufficient conditions for improving alignment (Proposition D.1) and its convergence rate (Theorem D.2), which is comparable to that of SAM. Algorithm 1 Power iteration to estimate the top eigenvector 1: Initialize ˆv as a random unit vector 2: for t2 = 1, 2, . . . , q do 3: Compute Hessian-vector product ˆv = 2f(x) ˆv 4: Normalize ˆv: ˆv ˆv/ ˆv 5: end for 6: return ˆv Algorithm 2 Eigen-SAM 1: for t1 = 1, 2, . . . , T do 2: Compute mini-batch loss fγ(xk) 3: if t1 mod p = 1 then 4: Use Algorithm 1 to estimate ˆv 5: end if 6: Compute the perturbation ϵt per Eq. 7 and 8 7: Perturb the model: xt = xk + ρϵt 8: Update parameters: xk+1 = xk η fγ( xk) 9: end for 10: return xk 5.3 Analysis of additional computational overhead The additional overhead in Eigen-SAM arises from running the Hessian-vector product q times every p steps to estimate the top eigenvector. The Hessian-vector product requires roughly 1 2 times the time needed to compute the gradient, so the overhead of our algorithm is approximately 2 + q p to 2 + 2q p times that of SGD, compared to 2 times for standard SAM. For larger models, the computation time for the Hessian-vector product remains nearly constant. For a detailed analysis of the computation cost of Hessian-vector products, we refer readers to Dagréou et al. (2024). 6 Experiments 6.1 Numerical simulation of the third-order SDE In this subsection, we validate the approximation error between our proposed SDE (Eq. 4) and the discrete SAM algorithm (Eq. 1). We trained a fully-connected network with one hidden layer, consisting of 784 hidden units and using Ge LU activation, on the MNIST dataset (Deng, 2012). The training was conducted with η = 0.01 and ρ = 0.2 (where ρ η 1 3 ). During training, we carefully tracked several key metrics, including training loss, test loss, test accuracy, parameter norm, gradient norm, and the top eigenvalue of the Hessian, as shown in Figure 2. Our results demonstrate that the approximation error of our third-order SDE is significantly lower than that of the previous second-order SDE. Specifically, the curves of our SDE closely match those of the discrete SAM across all tracked metrics, underscoring the accuracy and reliability of our approximation. This close alignment suggests that our proposed continuous-time approximation provides a more precise representation of the discrete SAM dynamics, thus enhancing the theoretical understanding of SAM. 6.2 Image classification from scratch To evaluate the effectiveness of Eigen-SAM, we applied it to several image classification tasks on benchmark datasets, including CIFAR-10 (Krizhevsky et al., 2009), CIFAR-100 (Krizhevsky et al., 2009), Fashion-MNIST (Xiao et al., 2017), and SVHN (Netzer et al., 2011). For these tasks, we used Res Net-18 (He et al., 2016), Res Net-50 (He et al., 2016), and Wide Res Net-28-10 (Zagoruyko and Komodakis, 2016) models. We selected SGD as the base optimizer and applied basic data augmentation techniques, including horizontal flips, padding by four pixels, and random cropping. The batch size was set to 256, with training conducted for 200 epochs. We used an initial learning rate of 0.1 for CIFAR-10, Fashion-MNIST, and CIFAR-100, and 0.01 for SVHN, adjusting the learning rate over time with a cosine schedule. The weight decay was set to 5 10 5, and the momentum was 0.9. Detailed hyperparameter settings are provided in Appendix E. The test set performance, reported in Table 1 along with the 95% confidence interval, shows that Eigen-SAM consistently achieves state-of-the-art performance across various datasets and models, validating its effectiveness and robustness. (a) Training Loss (b) Test Loss (c) Test Accuracy (d) Parameter Norm (e) Gradient Norm (f) Top Eigenvalue Figure 2: Training dynamics of discrete SAM, second-order SDE, and third-order SDE during training. Metrics include training loss, test loss, test accuracy, parameter norm, gradient norm, and the top Hessian eigenvalue. Each plot illustrates how each approach affects loss dynamics and key stability metrics. Table 1: Test accuracy on CIFAR-10, CIFAR-100, Fashion-MNIST, SVHN. Architecture Method CIFAR-10 CIFAR-100 Fashion-MNIST SVHN Res Net18 SGD 94.8 0.2 74.6 0.2 94.9 0.2 96.1 0.1 SAM 95.5 0.1 77.4 0.2 95.4 0.1 96.3 0.1 Eigen-SAM 95.9 0.2 78.3 0.2 95.6 0.2 96.5 0.1 Res Net50 SGD 95.0 0.1 76.6 0.2 94.8 0.1 96.1 0.1 SAM 95.6 0.2 79.0 0.2 95.4 0.1 96.4 0.1 Eigen-SAM 96.2 0.1 79.7 0.1 95.7 0.1 96.6 0.1 Wide Res Net-28-10 SGD 95.7 0.1 79.8 0.2 95.1 0.1 96.2 0.1 SAM 96.5 0.1 82.0 0.2 95.6 0.1 96.4 0.1 Eigen-SAM 96.8 0.1 82.8 0.1 95.9 0.1 96.7 0.1 6.3 Finetuning We evaluated performance by fine-tuning a Vi T-B-16 model Dosovitskiy et al. (2020) pre-trained on Image Net for CIFAR-10 and CIFAR-100. We used the checkpoint provided by Py Torch s official repository1. For SAM and Eigen-SAM, we used an initial learning rate of 0.01 and trained for 4k steps, while for SGD, we trained for 8k steps. Table 2 shows the test accuracy, where Eigen-SAM consistently outperforms the baselines. Table 2: Test accuracy for fine-tuning Vi T-B-16 pretrained on Image Net-1K on CIFAR-10 and CIFAR-100. Architecture Method CIFAR-10 CIFAR-100 Vi T-B-16 SGD 98.0 0.1 88.6 0.1 SAM 98.4 <0.1 89.5 0.1 Eigen-SAM 98.5 0.1 89.8 0.1 1https://pytorch.org/vision/main/models/generated/torchvision.models.vit_b_16.html 6.4 Sensitivity analysis We investigated the impact of varying the hyperparameter α on the test accuracy of Eigen-SAM. We conducted experiments on Res Net-18 with CIFAR-100, testing a range of α values, as shown in Figure 3. We observed that the test accuracy peaks at α = 0.2, yielding a 0.9% improvement in test accuracy compared to α = 0, which corresponds to the standard SAM. These results suggest that α is a robust hyperparameter, as its variations do not cause significant performance fluctuations, while consistently enhancing performance. In Table 3 and Table 4 in Appendix F, we demonstrate how larger values of p affect generalization performance and observe that setting p to 1000 (resulting in less than 1% additional overhead) retains most of the performance gains. In Figure 5 in Appendix F, we demonstrate the convergence speed of Algorithm 1, which typically requires only a few steps to converge. 0(SAM) 0.05 0.1 0.2 0.3 0.5 0.7 Alpha Test accuracy (a) Sensitivity Analysis of α (b) Spectrum of Hessian Figure 3: Left: Sensitivity analysis of α; the blue lines indicate the confidence interval. Right: Spectrum of the Hessian at the end of training. 6.5 Hessian spectrum Figure 3 shows the Hessian spectrum at the 200th epoch for Res Net-18 trained on CIFAR-100 using SAM and Eigen-SAM. We observe that the model trained with Eigen-SAM has both a smaller top eigenvalue and trace, with more eigenvalues concentrated near zero. This observation aligns with our motivation for proposing Eigen-SAM and explains why Eigen-SAM generalizes better than SAM. 7 Conclusion In this work, we analyzed the training dynamics of SAM using a third-order SDE, identifying the alignment between the perturbation vector and the top eigenvector as a crucial factor for effective sharpness regularization. However, our empirical analysis showed that this alignment is often poor in practice. Building on our theoretical framework and experimental insights, we proposed Eigen SAM, an algorithm that intermittently estimates the top eigenvector of the Hessian matrix and incorporates its component orthogonal to the gradient into the perturbation, explicitly regularizing the top Hessian eigenvalue. Extensive experiments demonstrated that our third-order SDE yields a smaller approximation error than previous models and that Eigen-SAM achieves state-of-the-art performance across various tasks, validating both its accuracy and effectiveness. Acknowledgements We would like to thank Zehang Deng for helpful discussions. We also appreciate the constructive feedback provided by the anonymous reviewers. This work was supported by ARC DP23 grant DP230101176 and by the Air Force Office of Scientific Research under award number FA2386-23-14044. Ahn, K., Jadbabaie, A., and Sra, S. (2023). How to escape sharp minima with random perturbations. ar Xiv preprint ar Xiv:2305.15659. Alquier, P., Ridgway, J., and Chopin, N. (2016). On the properties of variational approximations of gibbs posteriors. Journal of Machine Learning Research, 17(236). Andriushchenko, M. and Flammarion, N. (2022). Towards understanding sharpness-aware minimization. In International Conference on Machine Learning, pages 639 668. PMLR. Arora, S., Li, Z., and Panigrahi, A. (2022). Understanding gradient descent on the edge of stability in deep learning. In International Conference on Machine Learning, pages 948 1024. PMLR. Barrett, D. G. and Dherin, B. (2020). Implicit gradient regularization. ar Xiv preprint ar Xiv:2009.11162. Bartlett, P. L., Long, P. M., and Bousquet, O. (2023). The dynamics of sharpness-aware minimization: Bouncing across ravines and drifting towards wide minima. Journal of Machine Learning Research, 24(316):1 36. Blanc, G., Gupta, N., Valiant, G., and Valiant, P. (2020). Implicit regularization for deep neural networks driven by an ornstein-uhlenbeck like process. In Conference on learning theory, pages 483 513. PMLR. Borkar, V. S., Pinto, J., and Prabhu, T. (2009). A new learning algorithm for optimal stopping. Discrete Event Dynamic Systems, 19:91 113. Compagnoni, E. M., Biggio, L., Orvieto, A., Proske, F. N., Kersting, H., and Lucchi, A. (2023). An sde for modeling sam: Theory and insights. In International Conference on Machine Learning, pages 25209 25253. PMLR. Dagréou, M., Ablin, P., Vaiter, S., and Moreau, T. (2024). How to compute hessian-vector products? In ICLR Blogposts 2024. https://iclr-blogposts.github.io/2024/blog/bench-hvp/. Damian, A., Nichani, E., and Lee, J. D. (2022). Self-stabilization: The implicit bias of gradient descent at the edge of stability. ar Xiv preprint ar Xiv:2209.15594. Deng, L. (2012). The mnist database of handwritten digit images for machine learning research. IEEE Signal Processing Magazine, 29(6):141 142. Deng, Z., Sun, R., Xue, M., Wen, S., Camtepe, S., Nepal, S., and Xiang, Y. (2024). Leakage-resilient and carbon-neutral aggregation featuring the federated ai-enabled critical infrastructure. ar Xiv preprint ar Xiv:2405.15258. Dinh, L., Pascanu, R., Bengio, S., and Bengio, Y. (2017). Sharp minima can generalize for deep nets. In International Conference on Machine Learning, pages 1019 1028. PMLR. Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., et al. (2020). An image is worth 16x16 words: Transformers for image recognition at scale. ar Xiv preprint ar Xiv:2010.11929. Drucker, H. and Le Cun, Y. (1991). Double backpropagation increasing generalization performance. In IJCNN-91-Seattle International Joint Conference on Neural Networks, volume 2, pages 145 150. IEEE. Duchi, J. C. and Ruan, F. (2018). Stochastic methods for composite and weakly convex optimization problems. SIAM Journal on Optimization, 28(4):3229 3259. Dziugaite, G. K. and Roy, D. M. (2017). Computing nonvacuous generalization bounds for deep (stochastic) neural networks with many more parameters than training data. ar Xiv preprint ar Xiv:1703.11008. Folland, G. B. (2005). Higher-order derivatives and taylor s formula in several variables. Preprint, pages 1 4. Foret, P., Kleiner, A., Mobahi, H., and Neyshabur, B. (2021). Sharpness-aware minimization for efficiently improving generalization. In 9th International Conference on Learning Representations, ICLR 2021, Virtual Event, Austria, May 3-7, 2021. Open Review.net. Gatmiry, K., Li, Z., Ma, T., Reddi, S., Jegelka, S., and Chuang, C.-Y. (2024a). What is the inductive bias of flatness regularization? a study of deep matrix factorization models. Advances in Neural Information Processing Systems, 36. Gatmiry, K., Li, Z., Ruiz, L., Reddi, S. J., and Jegelka, S. (2024b). Simplicity bias of SGD via sharpness minimization. Harold, J., Kushner, G., and Yin, G. (1997). Stochastic approximation and recursive algorithm and applications. Application of Mathematics, 35(10). He, K., Zhang, X., Ren, S., and Sun, J. (2016). Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 770 778. Hochreiter, S. and Schmidhuber, J. (1997). Flat minima. Neural computation, 9(1):1 42. Jastrzebski, S., Arpit, D., Astrand, O., Kerg, G. B., Wang, H., Xiong, C., Socher, R., Cho, K., and Geras, K. J. (2021). Catastrophic fisher explosion: Early phase fisher matrix impacts generalization. In International Conference on Machine Learning, pages 4772 4784. PMLR. Jiang, Y., Neyshabur, B., Mobahi, H., Krishnan, D., and Bengio, S. (2019). Fantastic generalization measures and where to find them. ar Xiv preprint ar Xiv:1912.02178. Kaddour, J., Liu, L., Silva, R., and Kusner, M. J. (2022). When do flat minima optimizers work? Advances in Neural Information Processing Systems, 35:16577 16595. Keskar, N. S., Mudigere, D., Nocedal, J., Smelyanskiy, M., and Tang, P. T. P. (2016). On large-batch training for deep learning: Generalization gap and sharp minima. ar Xiv preprint ar Xiv:1609.04836. Krizhevsky, A., Hinton, G., et al. (2009). Learning multiple layers of features from tiny images. Li, Q., Tai, C., and Weinan, E. (2017). Stochastic modified equations and adaptive stochastic gradient algorithms. In International Conference on Machine Learning, pages 2101 2110. PMLR. Li, Z., Malladi, S., and Arora, S. (2021). On the validity of modeling sgd with stochastic differential equations (sdes). Advances in Neural Information Processing Systems, 34:12712 12725. Liu, H., Li, Z., Hall, D., Liang, P., and Ma, T. (2023a). Sophia: A scalable stochastic second-order optimizer for language model pre-training. ar Xiv preprint ar Xiv:2305.14342. Liu, H., Xie, S. M., Li, Z., and Ma, T. (2023b). Same pre-training loss, better downstream: Implicit bias matters for language models. In International Conference on Machine Learning, pages 22188 22214. PMLR. Lyu, K., Li, Z., and Arora, S. (2022). Understanding the generalization benefit of normalization layers: Sharpness reduction. Advances in Neural Information Processing Systems, 35:34689 34708. Magnus, J. R. (1985). On differentiating eigenvalues and eigenvectors. Econometric theory, 1(2):179 191. Mil shtein, G. (1986). Weak approximation of solutions of systems of stochastic differential equations. Theory of Probability & Its Applications, 30(4):750 766. Netzer, Y., Wang, T., Coates, A., Bissacco, A., Wu, B., Ng, A. Y., et al. (2011). Reading digits in natural images with unsupervised feature learning. In NIPS workshop on deep learning and unsupervised feature learning, volume 2011, page 7. Granada, Spain. Si, D. and Yun, C. (2024). Practical sharpness-aware minimization cannot converge all the way to optima. Advances in Neural Information Processing Systems, 36. Song, M., Ahn, K., and Yun, C. (2024). Does sgd really happen in tiny subspaces? Tahmasebi, B., Soleymani, A., Bahri, D., Jegelka, S., and Jaillet, P. (2024). A universal class of sharpness-aware minimization algorithms. ar Xiv preprint ar Xiv:2406.03682. Wen, K., Ma, T., and Li, Z. (2022). How does sharpness-aware minimization minimize sharpness? Co RR, abs/2211.05729. Xiao, H., Rasul, K., and Vollgraf, R. (2017). Fashion-mnist: a novel image dataset for benchmarking machine learning algorithms. ar Xiv preprint ar Xiv:1708.07747. Xie, Z., Sato, I., and Sugiyama, M. (2020). A diffusion theory for deep learning dynamics: Stochastic gradient descent exponentially favors flat minima. ar Xiv preprint ar Xiv:2002.03495. Zagoruyko, S. and Komodakis, N. (2016). Wide residual networks. ar Xiv preprint ar Xiv:1605.07146. Zhuang, J., Gong, B., Yuan, L., Cui, Y., Adam, H., Dvornek, N., Tatikonda, S., Duncan, J., and Liu, T. (2022). Surrogate gap minimization improves sharpness-aware training. ar Xiv preprint ar Xiv:2203.08065. A Additional empirical evidence In this section, we present empirical evidence highlighting discrepancies between existing theories and practical observations. We demonstrate this by designing several "counterexample" algorithms and comparing their performance with SAM. First, the theory developed by Wen et al. (2022) suggests that SAM s implicit regularization primarily occurs near the minima manifold, where f(x) 0 and the quadratic term dominates the perturbed term. To test this, we designed a counterexample algorithm called Reverse-SAM, which applies SAM s ascent step using a negative normalized gradient, i.e., ϵ = fγ(x) fγ(x) . If this theory holds in practice, then Reverse-SAM should perform similarly to SAM, as the sign of the perturbation vector would not affect the implicit bias induced by the quadratic term. The second counterexample algorithm we consider is Explicit Gradient Regularization (EGR), a regularization method with a long history, tracing back to the works of Barrett and Dherin (2020) and Drucker and Le Cun (1991). The second-order SDE proposed by Compagnoni et al. (2023) suggests that SAM s implicit bias is equivalent to optimizing the objective f(x) + ρ f(x) . Therefore, if this second-order SDE model is accurate in practice, EGR with a regularization coefficient equal to ρ should exhibit performance comparable to SAM. We trained Res Net-18 on CIFAR-10 to compare the performance of these two counterexample algorithms with SAM. Figure 4 shows the training and test losses. Additionally, the test accuracies for SAM, EGR, and Reverse-SAM were 95.5% 0.1%, 95.0% 0.2%, and 94.4% 0.2%, respectively. Both counterexample algorithms show noticeable performance gaps from SAM across all metrics, highlighting the limitations of existing theories in explaining SAM s practical outcomes. 0 5000 10000 15000 20000 25000 30000 35000 40000 Step Training loss Reverse-SAM SAM EGR (a) Training Loss 0 5000 10000 15000 20000 25000 30000 35000 40000 Step Reverse-SAM SAM EGR (b) Test Loss Figure 4: Comparison of training loss and test loss metrics across algorithms. B Proofs for SDEs (Section 4) We follow the notation of multi-index, which is commonly used in SDE literature: A multi-index is an n-tuple of non-negative integers α = (α1, α2..., αn) |α| := α1 + α2 + ... + αn α! := α1!α2!...αn! For x = (x1, x2..., xn) Rn, xα := xα1 1 xα2 2 ...xαn n For a multi-index β, |β| β f(x) := |β| xβ1 1 xβ2 2 ... xβn n f(x) We denote the partial derivative with respect to xi by ei. Following the SDE framework by Mil shtein (1986); Li et al. (2017); Compagnoni et al. (2023), we have the following definitions and assumptions: Definition B.1. Let G denote the set of continuous functions Rd R of at most polynomial growth, i.e. g G if there exists positive integers κ1, κ2 > 0 such that |g(x)| κ1(1 + x 2κ2) for all x Rd. Moreover, for each integer α 1 we denote by Gα the set of α-times continuously differentiable functions Rd R which, together with its partial derivatives up to and including order α, belong to G. This definition originates from the field of numerical analysis of SDEs (Mil shtein, 1986). In the case of g(x) = ||x||j, the bound restricts the difference between the j-th moments of the discrete process and those of the continuous process. We write O(ρα1ηα2) to denote that there exists a function K G independent with ρ, η, such that the error terms are bounded by Kρα1ηα2. Assumption B.1. Assume that the following conditions on f, fi and their gradients are satisfied: f, fi satisfy a Lipschitz condition: there exists L > 0 such that f(x) f(y) + i=1 fi(x) fi(y) L x y ; f, fi and its partial derivatives up to order 7 belong to G; f, fi satisfy a growth condition: there exists M > 0 such that i=1 fi(x) M(1 + x ); g and its partial derivatives up to order 6 belong to G; See Li et al. (2017) for a detailed discussion on Assumption B.1. To prove the Corollary 4.1.1, we need an additional assumption to ensure that the top eigenvalue is differentiable. Note that this assumption is common in recent works (Damian et al., 2022; Wen et al., 2022) and easy to satisfy. Assumption B.2. (Eigenvalues gap) For all k > 0, λ1( 2f(xk)) > λ2( 2f(xk)), and λ1( 2f(Xkη)) > λ2( 2f(Xkη)). In the proof of our main theorem, we will utilize the following two auxiliary lemmas. Lemma B.1. (Lemma 1 (Li et al., 2017)) Let 0 < η < 1. Consider a stochastic process {Xt : t > 0} satisfying the SDE d Xt = b(Xt)dt + η 1 2 σ(Xt)d Wt with X0 = x Rd and b, σ together with their derivatives belong to G. Define the one-step difference = Xη x, then we have E i = biη + 1 j=1 bj ejbi η2 + O η3 i = 1, . . . , d; (9) E i j = h bibj + σσT (ij) i η2 + O η3 i, j = 1, . . . , d; (10) j=1 (ij) = O η3 s 3, ij = 1, . . . , d. (11) all the functions are evaluated at x. Lemma B.2. (Theorem 2 and Lemma 5 (Mil shtein, 1986) )Under Assumption B.1, we define the onestep difference for the stochastic process = Xη x. if in addition there exists K1, K2, K3, K4 G so that E i E i K1(x)η2, i = 1, . . . , d; (12) E i j E i j K2(x)η2, i, j = 1, . . . , d; (13) E K3(x)η2, s 3, ij {1, . . . , d}; (14) ij K4(x)η2, ij {1, . . . , d}. (15) Then, for each g G, there exists a constant C such that max k=0,1,...N |Eg(xk) Eg(Xkη)| Cη Next, we will prove a lemma that controls the moments of the discrete process for SAM: Lemma B.3. Under Assumption B.1, let 0 < η < 1. We define: ei f SAM(x) := eif(x)+ρE j 2 ei+ejfγ(x) ejfγ(x) jk 3 ei+ej+ekfγ(x) ejfγ(x) ekfγ(x) In addition, we define the one-step difference for the discrete-time algorithm as = x1 x. Then we have: 1. E i = ei f SAM(x)η + O ηρ3 , i = 1, . . . , d; (16) 2. E i j = ei f SAM(x) ej f SAM(x)η2 + ΣSAM (ij) η2 + O η2ρ3 , i, j = 1, . . . , d; (17) j=1 ij = O η3 , s 3, ij {1, . . . , d}. (18) All functions are evaluated at x. Proof. To evaluate E i = E h eifγ x + ρ fγ(x) fγ(x) i , we start by analyzing eifγ x + ρ fγ(x) fγ(x) , which is the partial derivative in the direction ei. Taking the Taylor expansion, we have: x + ρ fγ(x) fγ(x) = eifγ(x) + X |α|=1 2 ei+αfγ(x)ρ αfγ(x) |α|=2 3 ei+αfγ(x)ρ2 αfγ(x) + R eifγ(x) x,1 where the residual term is defined in Folland (2005). For some constant c (0, 1), it holds that R eifγ(x) x,1 4 ei+αfγ x + cρ fγ(x) fγ(x) ρ|α| fγ(x) fγ(x) α Now, we observe that Ki(x) := 4 ei+αfγ x + cρ fγ(x) fγ(x) ρ|α| fγ(x) fγ(x) α is a finite sum of products of functions that, by assumption, are in G. We can rewrite Equation (19) as x + ρ fγ(x) fγ(x) = eifγ(x) + X |α|=1 2 ei+αfγ(x)ρ αfγ(x) |α|=2 3 ei+αfγ(x)ρ2 αfγ(x) + ρ3Ki(x), (21) which implies that x + ρ fγ(x) fγ(x) = eif(x) + ρE j 2 ei+ejfγ(x) ejfγ(x) "P jk 3 ei+ej+ekfγ(x) ejfγ(x) ekfγ(x) + ρ3 Ki(x), (22) where Ki(x) = EKi(x). Therefore, we have i = 1, 2, . . . , d, E i = ei f SAM(x)η + O ηρ3 . To prove the second statement, by definition, we have Cov( i, j) = η2 Σ1,1 i,j (x) + ρ(Σ1,2 i,j (x) + Σ1,2 i,j (x) ) + ρ2(Σ2,2 i,j + 1 2(Σ1,3 i,j (x) + Σ1,3 i,j (x) )) + O(η2ρ3) (23) = η2ΣSAM(x) + O(η2ρ3). (24) E i j = E i E j + Cov( i, j) (25) = ei f SAM(x) ej f SAM(x)η2 + η2ΣSAM(x) + O(η2ρ3). (26) Finally, it is clear that j=1 ij = O (ηs) , s 3, ij {1, . . . , d} (27) = O η3 , s 3, ij {1, . . . , d}. (28) Now we are ready to state the theorem and prove it. Theorem B.4. (Stochastic modified equations) Under Assumption B.1, let 0 < η < 1, T > 0, N = T/η . Let xk Rd, 0 k N denote the sequence of SAM iterations defined by Equation 1. Define {Xt} as the stochastic process satisfying the SDE d Xt = f SAM(Xt)dt + η(ΣSAM(Xt)) 1 2 d Wt (29) where f SAM(Xt) := f(Xt) + ρE|| fγ(Xt)|| + ρ2 2 E f γ 2fγ(Xt) fγ ΣSAM(Xt) := Σ1,1(Xt)+ρ(Σ1,2(Xt)+Σ1,2(Xt) )+ρ2(Σ2,2(Xt)+ 1 2(Σ1,3(Xt)+Σ1,3(Xt) )) Σ1,1(Xt) := E h ( f(Xt) fγ(Xt)) ( f(Xt) fγ(Xt)) i Σ1,2(Xt) := E ( f(Xt) fγ(Xt)) E 2fγ(Xt) fγ Σ2,2(Xt) := E " E 2fγ(Xt) fγ E 2fγ(Xt) fγ Σ1,3(Xt) := E ( f(Xt) fγ(Xt)) " 3fγ(Xt)( fγ, fγ) " 3fγ(Xt)( fγ, fγ) Additionally, let us take ρ = O(η 1 3 ) Then, {Xt : t [0, T]} is an order-1 weak approximation of {xk : k 0}, i.e. for each g G, there exists a constant C independent of η such that max k=0,1,...N |Eg(xk) Eg(Xkη)| Cη Proof. We will check the conditions in Lemma B.2. As we apply Lemma B.1, we make the following choices: b(x) = f SAM(x) σ(x) = ΣSAM(x) 1 2 First, for i = 1, ..., d, we have E i Lemma B.1 = ei f SAM(x)η + O ηρ3 (30) E i Lemma B.3 = ei f SAM(x)η + O ηρ3 (31) Therefore, we have that for some K1(x) G E i E i K1(x)η2 (32) Second, for i, j = 1, ..., d, it holds that E i j Lemma B.1 = ei f SAM(x) ej f SAM(x)η2 + ΣSAM (ij) η2 + O η2ρ3 (33) E i j Lemma B.3 = ei f SAM(x) ej f SAM(x)η2 + ΣSAM (ij) η2 + O η2ρ3 (34) Consequently, we have for some K2(x) G E i j E i j K2(x)η2 (35) Third, for ij = 1, ..., d, we have j=1 (ij) Lemma B.1 = O η3 (36) j=1 (ij) Lemma B.3 = O η3 (37) Therefore, for some K3(x) G, we have E K3(x)η2, s 3 (38) Additionally, for some K4(x) G, ij = 1, ..., d, ij Lemma B.3 K4(x)η2 (39) By combining the four equations, Eq. 32, Eq. 35, Eq. 38, Eq. 39, we complete the proof. Proof of Corollary 4.1.1 1. Under the supposition, the first statement is immediately followed by using the fact: ρ2 2 E f γ 2fγ(Xt) fγ 2 E v1( 2fγ(Xt)) 2fγ(Xt)v1( 2fγ(Xt)) = O(ρ3) (40) 2. Under the supposition, the second statement is immediately followed by using the fact: ρ2 2 E f γ 2fγ(Xt) fγ 2 E v1( 2fγ(Xt)) 2fγ(Xt)v1( 2fγ(Xt)) = O(ρ4) (41) , and ρ E 2fγ(x) fγ ρ E s 2fγ(x) v1 2fγ(x) = O(ρ3), (42) ρ E 2fγ(x) fγ ρ E s λ1 2fγ(x) v1 2fγ(x) = O(ρ3) (43) C Proof of Theorem 3.1 Theorem C.1. (Generalization Bound) Assume that the loss function is bounded by L, and the third-order derivative of the loss function is bounded by C. Additionally, we assume f D(x) Eϵ N(0,σ2Id)f D(x + ϵ), similar to Foret et al. (2021). For any δ (0, 1) and σ > 0, with a probability over 1 δ over the choice of S Dn, we have f D (x) f S (x) + dσ2 2 λ1 2f S (x) + Cd3σ3 d log 1 + x 2 + O(1) + 2 log 1 δ + 4 log (n + d). Proof. We use the PAC-Bayes theory in this proof. In PAC-Bayes theory, x follows a distribution, denoted by P, and we express the expected loss over x as follows: f D(P) = Ex P f D(x) f S(P) = Ex P f S(x) For any distribution P = N(0, σ2 P Id) and Q = N(x, σ2Id) over x Rd, where P is the prior distribution and Q is the posterior distribution, use the general PAC-Bayes theorem in Alquier et al. (2016), for all β > 0, with a probability at least 1 δ, we have f D(Q) f S(Q) + 1 h KL(Q P) + log 1 δ + Ψ(β, n) i , (44) where Ψ is defined as Ψ(β, n) = log Ex P ES h exp β f D(x) f S(x) i . When the loss function is bounded by L, then by using the Hoeffding s inequality we have: Ψ(β, n) β2L2 The task is to minimize the second term of RHS of (44), we thus choose β = 8n(KL(Q P )+log 1 δ) L . Then the second term of RHS of (44) is equal to s KL(Q P) + log 1 The KL divergence between Q and P, when they are Gaussian, is given by formula KL(Q P) = 1 σ2 P d + d log σ2 P σ2 For given posterior distribution Q with fixed σ2, to minimize the KL term, the σ2 P should be equal to σ2 + x 2/d. In this case, the KL term is no less than d log 1 + x 2 Thus, the second term of RHS is s KL(Q P) + log 1 d log 1 + x 2 when x 2 2 > σ2 exp(4n/d) 1 . Hence, for any x 2 > σ2 exp(4n/d) 1 , we have the RHS is greater than the LHS, the inequality is trivial. In the remainder of the proof, we only consider the case: x 2 < σ2 exp(4n/d) 1 . (45) Distribution P is Gaussian centered around 0 with variance σ2 P = σ2 + x 2/d, which is unknown at the time we set up the inequality, since x is unknown. Meanwhile we have to specify P in advance, since P is the prior distribution. To deal with this problem, we apply the union bound technique (Dziugaite and Roy, 2017; Foret et al., 2021). We set c = σ2 1 + exp(4n/d) Pj = N 0, c exp 1 j P := Pj : j = 1, 2, . . . Then the following inequality holds for a particular distribution Pj with probability 1 δj with δj = 6δ π2j2 f D Q f S Q + 1 KL(Q Pj) + log 1 δj + Ψ(β, n) . Use the well-known equation: P j=1 1 j2 = π2 6 , then with probability 1 δ, the above inequality holds with every j. We pick j := 1 d log σ2 + x 2/d = 1 d log σ2 + x 2/d σ2(1 + exp{4n/d}) 1 j = d log σ2 + x 2/d log σ2 + x 2/d d log σ2 + x 2/d σ2 + x 2/d c exp 1 j exp(1/d) σ2 + x 2/d σ2 + x 2/d σ2 Pj exp(1/d) σ2 + x 2/d . Thus the KL term could be bounded as follow KL(Q Pj ) = 1 " dσ2 + x 2 σ2 Pj d + d log σ2 Pj σ2 " d(σ2 + x 2/d) σ2 + x 2/d d + d log exp(1/d) σ2 + x 2/d h d log exp(1/d) σ2 + x 2/d h 1 + d log 1 + x 2 For the term log 1 δj , with recall that c = σ2 1 + exp(4n/d) and j = j 1 d log σ2+ x 2/d σ2(1+exp{4n/d}) k , we have δj = log (j )2π2 + 2 log(j ) 6 + 2 log 1 + d log σ2 1 + exp(4n/d) 6 + 2 log 1 + d log 1 + exp(4n/d) 6 + 2 log 1 + d 1 + 4n 6 + log(1 + d + 4n). Hence, the inequality f D (Q) f S (Q) + KL(Q Pj ) + log 1 δj 2n L 1 + d log 1 + x 2 6δ + 4 log(n + d) d log 1 + x 2 dσ2 + O(1) + 2 log 1 δ + 4 log(n + d). Eϵ N(0,σ2Id) h f D x + ϵ i Eϵ N(0,σ2Id) h f S x + ϵ i d log 1 + x 2 dσ2 + O(1) + 2 log 1 δ + 4 log(n + d). Using the assumption that f D(x) Eϵ N(0,σ2Id) h f D x + ϵ i , we reach f D (x) Eϵ N(0,σ2Id) h f S x + ϵ i d log 1 + x 2 dσ2 + O(1) + 2 log 1 δ + 4 log(n + d). Using the second-order Taylor expansion for f S x + ϵ , we obtain f S x + ϵ = f S x + ϵT xf S x + 1 2ϵT 2f S (x) ϵ + 1 3f S (x + tϵ) xi1 xi2 xi3 ϵi1ϵi2ϵi3 f S x + ϵT xf S x + 1 2λ1 2f S (x) ϵ 2 2 + 1 3f S (x + tϵ) xi1 xi2 xi3 ϵi1ϵi2ϵi3, where t [0, 1]. Thanks to Eϵ N(0,σ2Id) ϵ2 Eϵ N(0,Id) ϵ2 = dσ2, we have Eϵ N(0,σ2Id) h f S x + ϵ i f S x + dσ2 2 λ1 2f S (x) 6 Eϵ1 N(0,σ2) [|ϵ1|] Eϵ2 N(0,σ2) [|ϵ2|] Eϵ3 N(0,σ2) [|ϵ3|] f S x + dσ2 2 λ1 2f S (x) + Cd3 Eϵ1 N(0,σ2) ϵ2 1 1/2 3 = f S x + dσ2 2 λ1 2f S (x) + Cd3σ3 By the assumption f D(x) Eϵ N(0,σ2Id)f D(x + ϵ), we reach f D (x) f S x + dσ2 2 λ1 2f S (x) + Cd3σ3 d log 1 + x 2 dσ2 + O(1) + 2 log 1 δ + 4 log(n + d). D Theoretical Properties of Eigen-SAM In this subsection, first we show that Eq. 8 can indeed improve the alignment for moderate α. Let ω := cos( fγ(x) fγ(x) , v), without loss of generalization, we suppose ω > 0. We only consider the case 2 2 , since in the case 0 < ω 2 2 , any α > 0 will enhance the alignment. Using fundamental mathematics to solve the inequality, we obtain Proposition D.1. Let ω > 2 2 , for any α 0, 2ω 1 ω2 2ω2 1 , we have cos( fγ(x) fγ(x) + αv , v) > ω. Proposition D.1 shows our update can indeed improve the alignment for a wide range of α. For example, if cos( fγ(x), v) = 0.8, then α can be any value in (0, 3.43); if cos( fγ(x), v) = 0.9, then α can be any value in (0, 1.27). Next, inspired by Si and Yun (2024), we have the following convergence rate for stochastic Eigen SAM on non-convex function: Theorem D.2. (Convergence rate) Consider a β-smooth function f satisfying f = infx f(x) > , let := f(x0) f , and assume the mini-batch variance is bounded by σ2. Under Eigen-SAM, starting at x0 with any perturbation size ρ > 0 and step size η = min{ 1 βσ2T } to minimize f, t=0 E fγ (xt) 2 O + β2(ρ2 + α2) Proof. By the definition of β-smoothness, we have Ef(xt+1) Ef(xt) ηE f(xt), f(xt + ρϵ) + βη2 2 E f(xt + ρϵ) 2 Ef(xt) ηE f(xt), f(xt + ρϵ) + βη2 E f(xt + ρϵ) 2 + E f(xt + ρϵ) f(xt) 2 Ef(xt) ηE f(xt), f(xt + ρϵ) + βη2 E f(xt + ρϵ) 2 + σ2 2E f(xt) 2 η 2E f(xt + ρϵ) 2 2E f(xt) f(xt + ρϵ) 2 + βη2 E f(xt + ρϵ) 2 + σ2 2E f(xt) 2 + β2η 2 E xt (xt + ρϵ) 2 + βσ2η2 2E f(xt) 2 + β2η(ρ2 + α2) Rearranging this inequality and deviding both sides by ηT 2 , we have t=0 E f(xt) 2 2 ηT (Ef(x0) Ef(x T )) + β2(ρ2 + α2) + 2βσ2η ηT + β2(ρ2 + α2) + 2βσ2η. By using η = min{ 1 βσ2T }, we complete the proof. E Additional Experimental Details For hyperparameter ρ, we follow the guidelines by Foret et al. (2021), setting ρ to 0.05 for CIFAR-10 and Fashion-MNIST, 0.01 for SVHN, and 0.1 for CIFAR-100. We ensure that SAM and Eigen-SAM use the same ρ for a fair comparison. Additionally, we tune the hyperparameter α for Eigen-SAM over {0.05, 0.1, 0.2} using 10% of the training set as a validation set. We find that α = 0.2 works the best for almost all cases. Therefore, we report the performance with α = 0.2 for all experiments to demonstrate that Eigen-SAM does not require extensive hyperparameter tuning. We run three independent repeat experiments with different weight initializations and data shuffling. Because SAM and Eigen-SAM require twice the runtime, we allow SGD to train for twice the number of epochs. All our experiments were conducted on NVIDIA RTX 4090 24GB GPUs. F Additional experiment results Table 3: Test accuracy on CIFAR-10 for different values of p (interval steps for estimating eigenvectors) using Eigen-SAM. Architechture p=100 p=200 p=500 p=1000 SAM Res Net18 95.9 0.2 95.9 0.1 95.7 0.2 95.8 0.1 95.5 0.1 Wide Res Net-28-10 96.8 0.1 96.7 0.1 96.8 0.1 96.7 0.1 96.5 0.1 Table 4: Test accuracy on CIFAR-100 for different values of p (interval steps for estimating eigenvectors) using Eigen-SAM. Architechture p=100 p=200 p=500 p=1000 SAM Res Net18 78.3 0.2 78.2 0.2 78.2 0.2 78.3 0.1 77.4 0.2 Wide Res Net-28-10 82.8 0.1 82.7 0.2 82.7 0.1 82.6 0.1 82.0 0.2 (a) Res Net18 (b) Wide Res Net-28-10 Figure 5: The effect of the number of Hessian-vector product steps in Algorithm 1 (power iteration) on the alignment of the estimated vector with the top eigenvalue. The dataset is CIFAR-100, and the models are Res Net18 and Wide Res Net-28-10 at mid-training stage (100th epoch). G Limitaions A limitation of this work is the additional memory and time required to estimate the top eigenvalue of the Hessian matrix. Improving the efficiency of Eigen-SAM is a direction for future research. Neur IPS Paper Checklist Question: Do the main claims made in the abstract and introduction accurately reflect the paper s contributions and scope? Answer: [Yes] Justification: We accurately assess and describe our contributions. Guidelines: The answer NA means that the abstract and introduction do not include the claims made in the paper. The abstract and/or introduction should clearly state the claims made, including the contributions made in the paper and important assumptions and limitations. A No or NA answer to this question will not be perceived well by the reviewers. The claims made should match theoretical and experimental results, and reflect how much the results can be expected to generalize to other settings. It is fine to include aspirational goals as motivation as long as it is clear that these goals are not attained by the paper. 2. Limitations Question: Does the paper discuss the limitations of the work performed by the authors? Answer: [Yes] Justification: We discuss our limitation at the end of the appendix. Guidelines: The answer NA means that the paper has no limitation while the answer No means that the paper has limitations, but those are not discussed in the paper. The authors are encouraged to create a separate "Limitations" section in their paper. The paper should point out any strong assumptions and how robust the results are to violations of these assumptions (e.g., independence assumptions, noiseless settings, model well-specification, asymptotic approximations only holding locally). The authors should reflect on how these assumptions might be violated in practice and what the implications would be. The authors should reflect on the scope of the claims made, e.g., if the approach was only tested on a few datasets or with a few runs. In general, empirical results often depend on implicit assumptions, which should be articulated. The authors should reflect on the factors that influence the performance of the approach. For example, a facial recognition algorithm may perform poorly when image resolution is low or images are taken in low lighting. Or a speech-to-text system might not be used reliably to provide closed captions for online lectures because it fails to handle technical jargon. The authors should discuss the computational efficiency of the proposed algorithms and how they scale with dataset size. If applicable, the authors should discuss possible limitations of their approach to address problems of privacy and fairness. While the authors might fear that complete honesty about limitations might be used by reviewers as grounds for rejection, a worse outcome might be that reviewers discover limitations that aren t acknowledged in the paper. The authors should use their best judgment and recognize that individual actions in favor of transparency play an important role in developing norms that preserve the integrity of the community. Reviewers will be specifically instructed to not penalize honesty concerning limitations. 3. Theory Assumptions and Proofs Question: For each theoretical result, does the paper provide the full set of assumptions and a complete (and correct) proof? Answer: [Yes] Justification: We provide all assumptions and our proof is complete. Guidelines: The answer NA means that the paper does not include theoretical results. All the theorems, formulas, and proofs in the paper should be numbered and crossreferenced. All assumptions should be clearly stated or referenced in the statement of any theorems. The proofs can either appear in the main paper or the supplemental material, but if they appear in the supplemental material, the authors are encouraged to provide a short proof sketch to provide intuition. Inversely, any informal proof provided in the core of the paper should be complemented by formal proofs provided in appendix or supplemental material. Theorems and Lemmas that the proof relies upon should be properly referenced. 4. Experimental Result Reproducibility Question: Does the paper fully disclose all the information needed to reproduce the main experimental results of the paper to the extent that it affects the main claims and/or conclusions of the paper (regardless of whether the code and data are provided or not)? Answer: [Yes] Justification: We provide all details for experiments. Guidelines: The answer NA means that the paper does not include experiments. If the paper includes experiments, a No answer to this question will not be perceived well by the reviewers: Making the paper reproducible is important, regardless of whether the code and data are provided or not. If the contribution is a dataset and/or model, the authors should describe the steps taken to make their results reproducible or verifiable. Depending on the contribution, reproducibility can be accomplished in various ways. For example, if the contribution is a novel architecture, describing the architecture fully might suffice, or if the contribution is a specific model and empirical evaluation, it may be necessary to either make it possible for others to replicate the model with the same dataset, or provide access to the model. In general. releasing code and data is often one good way to accomplish this, but reproducibility can also be provided via detailed instructions for how to replicate the results, access to a hosted model (e.g., in the case of a large language model), releasing of a model checkpoint, or other means that are appropriate to the research performed. While Neur IPS does not require releasing code, the conference does require all submissions to provide some reasonable avenue for reproducibility, which may depend on the nature of the contribution. For example (a) If the contribution is primarily a new algorithm, the paper should make it clear how to reproduce that algorithm. (b) If the contribution is primarily a new model architecture, the paper should describe the architecture clearly and fully. (c) If the contribution is a new model (e.g., a large language model), then there should either be a way to access this model for reproducing the results or a way to reproduce the model (e.g., with an open-source dataset or instructions for how to construct the dataset). (d) We recognize that reproducibility may be tricky in some cases, in which case authors are welcome to describe the particular way they provide for reproducibility. In the case of closed-source models, it may be that access to the model is limited in some way (e.g., to registered users), but it should be possible for other researchers to have some path to reproducing or verifying the results. 5. Open access to data and code Question: Does the paper provide open access to the data and code, with sufficient instructions to faithfully reproduce the main experimental results, as described in supplemental material? Answer: [Yes] Justification: Yes, our code is open source. Guidelines: The answer NA means that paper does not include experiments requiring code. Please see the Neur IPS code and data submission guidelines (https://nips.cc/ public/guides/Code Submission Policy) for more details. While we encourage the release of code and data, we understand that this might not be possible, so No is an acceptable answer. Papers cannot be rejected simply for not including code, unless this is central to the contribution (e.g., for a new open-source benchmark). The instructions should contain the exact command and environment needed to run to reproduce the results. See the Neur IPS code and data submission guidelines (https: //nips.cc/public/guides/Code Submission Policy) for more details. The authors should provide instructions on data access and preparation, including how to access the raw data, preprocessed data, intermediate data, and generated data, etc. The authors should provide scripts to reproduce all experimental results for the new proposed method and baselines. If only a subset of experiments are reproducible, they should state which ones are omitted from the script and why. At submission time, to preserve anonymity, the authors should release anonymized versions (if applicable). Providing as much information as possible in supplemental material (appended to the paper) is recommended, but including URLs to data and code is permitted. 6. Experimental Setting/Details Question: Does the paper specify all the training and test details (e.g., data splits, hyperparameters, how they were chosen, type of optimizer, etc.) necessary to understand the results? Answer: [Yes] Justification: We specify them in the main paper. Additional details are provided in the appendix. Guidelines: The answer NA means that the paper does not include experiments. The experimental setting should be presented in the core of the paper to a level of detail that is necessary to appreciate the results and make sense of them. The full details can be provided either with the code, in appendix, or as supplemental material. 7. Experiment Statistical Significance Question: Does the paper report error bars suitably and correctly defined or other appropriate information about the statistical significance of the experiments? Answer: [Yes] Justification: We give a confidence level of 95%. Guidelines: The answer NA means that the paper does not include experiments. The authors should answer "Yes" if the results are accompanied by error bars, confidence intervals, or statistical significance tests, at least for the experiments that support the main claims of the paper. The factors of variability that the error bars are capturing should be clearly stated (for example, train/test split, initialization, random drawing of some parameter, or overall run with given experimental conditions). The method for calculating the error bars should be explained (closed form formula, call to a library function, bootstrap, etc.) The assumptions made should be given (e.g., Normally distributed errors). It should be clear whether the error bar is the standard deviation or the standard error of the mean. It is OK to report 1-sigma error bars, but one should state it. The authors should preferably report a 2-sigma error bar than state that they have a 96% CI, if the hypothesis of Normality of errors is not verified. For asymmetric distributions, the authors should be careful not to show in tables or figures symmetric error bars that would yield results that are out of range (e.g. negative error rates). If error bars are reported in tables or plots, The authors should explain in the text how they were calculated and reference the corresponding figures or tables in the text. 8. Experiments Compute Resources Question: For each experiment, does the paper provide sufficient information on the computer resources (type of compute workers, memory, time of execution) needed to reproduce the experiments? Answer: [Yes] Justification: We provide the information of the computational resource. Guidelines: The answer NA means that the paper does not include experiments. The paper should indicate the type of compute workers CPU or GPU, internal cluster, or cloud provider, including relevant memory and storage. The paper should provide the amount of compute required for each of the individual experimental runs as well as estimate the total compute. The paper should disclose whether the full research project required more compute than the experiments reported in the paper (e.g., preliminary or failed experiments that didn t make it into the paper). 9. Code Of Ethics Question: Does the research conducted in the paper conform, in every respect, with the Neur IPS Code of Ethics https://neurips.cc/public/Ethics Guidelines? Answer: [Yes] Justification: Our research conduct with the Neur IPS Code of Ethics. Guidelines: The answer NA means that the authors have not reviewed the Neur IPS Code of Ethics. If the authors answer No, they should explain the special circumstances that require a deviation from the Code of Ethics. The authors should make sure to preserve anonymity (e.g., if there is a special consideration due to laws or regulations in their jurisdiction). 10. Broader Impacts Question: Does the paper discuss both potential positive societal impacts and negative societal impacts of the work performed? Answer: [NA] Justification: Our paper focuses on fundamental research in deep learning theory and algorithms. Guidelines: The answer NA means that there is no societal impact of the work performed. If the authors answer NA or No, they should explain why their work has no societal impact or why the paper does not address societal impact. Examples of negative societal impacts include potential malicious or unintended uses (e.g., disinformation, generating fake profiles, surveillance), fairness considerations (e.g., deployment of technologies that could make decisions that unfairly impact specific groups), privacy considerations, and security considerations. The conference expects that many papers will be foundational research and not tied to particular applications, let alone deployments. However, if there is a direct path to any negative applications, the authors should point it out. For example, it is legitimate to point out that an improvement in the quality of generative models could be used to generate deepfakes for disinformation. On the other hand, it is not needed to point out that a generic algorithm for optimizing neural networks could enable people to train models that generate Deepfakes faster. The authors should consider possible harms that could arise when the technology is being used as intended and functioning correctly, harms that could arise when the technology is being used as intended but gives incorrect results, and harms following from (intentional or unintentional) misuse of the technology. If there are negative societal impacts, the authors could also discuss possible mitigation strategies (e.g., gated release of models, providing defenses in addition to attacks, mechanisms for monitoring misuse, mechanisms to monitor how a system learns from feedback over time, improving the efficiency and accessibility of ML). 11. Safeguards Question: Does the paper describe safeguards that have been put in place for responsible release of data or models that have a high risk for misuse (e.g., pretrained language models, image generators, or scraped datasets)? Answer: [NA] Justification: We only use public datasets. Guidelines: The answer NA means that the paper poses no such risks. Released models that have a high risk for misuse or dual-use should be released with necessary safeguards to allow for controlled use of the model, for example by requiring that users adhere to usage guidelines or restrictions to access the model or implementing safety filters. Datasets that have been scraped from the Internet could pose safety risks. The authors should describe how they avoided releasing unsafe images. We recognize that providing effective safeguards is challenging, and many papers do not require this, but we encourage authors to take this into account and make a best faith effort. 12. Licenses for existing assets Question: Are the creators or original owners of assets (e.g., code, data, models), used in the paper, properly credited and are the license and terms of use explicitly mentioned and properly respected? Answer: [Yes] Justification: We correctly cite authors for models and data. Guidelines: The answer NA means that the paper does not use existing assets. The authors should cite the original paper that produced the code package or dataset. The authors should state which version of the asset is used and, if possible, include a URL. The name of the license (e.g., CC-BY 4.0) should be included for each asset. For scraped data from a particular source (e.g., website), the copyright and terms of service of that source should be provided. If assets are released, the license, copyright information, and terms of use in the package should be provided. For popular datasets, paperswithcode.com/datasets has curated licenses for some datasets. Their licensing guide can help determine the license of a dataset. For existing datasets that are re-packaged, both the original license and the license of the derived asset (if it has changed) should be provided. If this information is not available online, the authors are encouraged to reach out to the asset s creators. 13. New Assets Question: Are new assets introduced in the paper well documented and is the documentation provided alongside the assets? Answer: [Yes] Justification: We communicate the details of the dataset/code/model. Guidelines: The answer NA means that the paper does not release new assets. Researchers should communicate the details of the dataset/code/model as part of their submissions via structured templates. This includes details about training, license, limitations, etc. The paper should discuss whether and how consent was obtained from people whose asset is used. At submission time, remember to anonymize your assets (if applicable). You can either create an anonymized URL or include an anonymized zip file. 14. Crowdsourcing and Research with Human Subjects Question: For crowdsourcing experiments and research with human subjects, does the paper include the full text of instructions given to participants and screenshots, if applicable, as well as details about compensation (if any)? Answer: [NA] Justification: We do not contain human subjects. Guidelines: The answer NA means that the paper does not involve crowdsourcing nor research with human subjects. Including this information in the supplemental material is fine, but if the main contribution of the paper involves human subjects, then as much detail as possible should be included in the main paper. According to the Neur IPS Code of Ethics, workers involved in data collection, curation, or other labor should be paid at least the minimum wage in the country of the data collector. 15. Institutional Review Board (IRB) Approvals or Equivalent for Research with Human Subjects Question: Does the paper describe potential risks incurred by study participants, whether such risks were disclosed to the subjects, and whether Institutional Review Board (IRB) approvals (or an equivalent approval/review based on the requirements of your country or institution) were obtained? Answer: [NA] Justification: We do not contain human subjects. Guidelines: The answer NA means that the paper does not involve crowdsourcing nor research with human subjects. Depending on the country in which research is conducted, IRB approval (or equivalent) may be required for any human subjects research. If you obtained IRB approval, you should clearly state this in the paper. We recognize that the procedures for this may vary significantly between institutions and locations, and we expect authors to adhere to the Neur IPS Code of Ethics and the guidelines for their institution. For initial submissions, do not include any information that would break anonymity (if applicable), such as the institution conducting the review.