# when_do_flat_minima_optimizers_work__7b2073ad.pdf

When Do Flat Minima Optimizers Work?

Jean Kaddour Centre for Artificial Intelligence University College London

Linqing Liu Centre for Artificial Intelligence University College London

Ricardo Silva Department of Statistical Science University College London

Matt J. Kusner Centre for Artificial Intelligence University College London

Recently, flat-minima optimizers, which seek to find parameters in low-loss neighborhoods, have been shown to improve a neural network s generalization performance over stochastic and adaptive gradient-based optimizers. Two methods have received significant attention due to their scalability: 1. Stochastic Weight Averaging (SWA), and 2. Sharpness-Aware Minimization (SAM). However, there has been limited investigation into their properties and no systematic benchmarking of them across different domains. We fill this gap here by comparing the loss surfaces of the models trained with each method and through broad benchmarking across computer vision, natural language processing, and graph representation learning tasks. We discover several surprising findings from these results, which we hope will help researchers further improve deep learning optimizers, and practitioners identify the right optimizer for their problem.

1 Introduction

Stochastic gradient descent (SGD) methods are central to neural network optimization [6]. Recently, one class of algorithms has focused on biasing SGD methods towards so-called flat minima, which are located in large weight space regions with very similar low loss values [43]. Theoretical and empirical studies [21, 77, 9, 55, 49, 5, 12] postulate that such flatter regions generalize better than sharper minima, e.g., due to the flat minimizer s robustness against loss function shifts between train and test data, as illustrated in Fig. 1. Two popular flat-minima optimization approaches are: 1. Stochastic Weight Averaging (SWA) [48], and 2. Sharpness-Aware Minimization (SAM) [22].

While both strategies aim to find flatter minima, they operate much differently. On the one hand, SWA is based on the intuition that, near a flat minimum, gradients are smaller, leaving many iterates in that flat region. Therefore, averaging iterates will produce a solution that is pulled towards these flatter regions, see Fig. 1, top. On the other hand, SAM minimizes the maximum loss around a neighborhood of the current iterate. This way, a region around the iterate is designed to have uniformly low loss; see Fig. 1, bottom. Crucially, SAM requires an additional forward/backward pass for each parameter update, making it more expensive than SWA.

Despite the successes [76, 3, 51, 12, 4] of SWA and SAM in some domains, we are unaware of a systematic comparison between them that would help practitioners to choose the right optimizer for their problem and researchers to develop better optimizers. The SWA [48] paper was published in 2018, and the SAM [22] paper in 2021; however, the SAM paper, and its most noticeable follow-ups [65, 12, 103], do not compare against SWA. Further, there is very limited overlap in

Equal contribution, correspondence to {jean.kaddour,linqing.liu}.20@ucl.ac.uk

36th Conference on Neural Information Processing Systems (Neur IPS 2022).

terms of the model architecture and dataset used in the experiments among both papers, which are likely further confounded by other differences in the training procedures (e.g. data augmentations, hyper-parameters, etc.).

Figure 1: The mechanics behind SWA and SAM, whose solution is denoted by + and , respectively. SWA produces a solution θ that is pulled towards flatter regions, while SAM approximates sharpness within the parameters neighborhood (arrows).

Contributions

1. In-depth comparison of minima found by SWA and SAM: We visualize linear interpolations between different models and quantify the minimizers flatnesses. This analysis yields 4 insights, e.g., despite SAM finding flatter solutions than SWA as quantified by Hessian eigenvalues, they can be close to sharp directions, a phenomenon that has been overlooked in the previous SAM literature. Averaging SAM iterates leads to the flattest among all minima. 2. Rigorous comparison of SWA and SAM s performance over 42 tasks: We empirically compare the optimizers with a rigorous model selection procedure on a broad range of tasks across different domains (CV, NLP, and GRL), model types (MLPs, CNNs, Transformers) and tasks (classification, self-supervised learning, open-domain question answering, natural language understanding, and node/graph/link property prediction). We discuss 9 findings, e.g., that both dataset and architecture impact their effectiveness, that for NLP tasks, SAM improves over SWA in most cases, and that the converse holds for GRL tasks. When flat-minima optimizers do not help, we notice clear discrepancies between the shapes of loss and accuracy curves. To assist future work, we open-source the code for all pipelines and hyper-parameters to reproduce the results.

2 Background and Related Work

2.1 Stochastic Gradient Descent (SGD)

The classic optimization framework of machine learning is empirical risk minimization

i=1 ℓ(xi; θ) (1)

where θ Rd is a vector of parameters, {x1, . . . , x N} is a training set of inputs xn RD, and ℓ(x; θ) is a loss function quantifying the performance of parameters θ on x. SGD samples a minibatch S {1, . . . , N} of size |S| N from the training set and updates the parameters through

θSGD t+1 = θt ηg (θt) , where g (θ) = 1

i B ℓ(θ; xi) , (2)

for a length specified by η, the learning rate.

2.2 Stochastic Weight Averaging (SWA)

The idea of averaging weights dates back to accelerating the convergence speed of SGD [78, 51]. SWA s motivation is based on the following observation about SGD s behavior when training neural networks: it often traverses regions of the weight space that correspond to high-performing models but rarely reaches the central points of this optimal set. Averaging the parameter values over iterations moves the solution closer to the centroid of this space of points.

The SWA update rule is the cumulative moving average

θSWA t+1 θSWA t l + θSGD t l + 1 , (3)

Algorithm 1 Stochastic Weight Averaging [48]

Input: Loss function L, training budget in number of iterations b, training dataset D := n i=1{xi}, mini-batch size |B|, averaging start epoch E, averaging frequency ν, (scheduled) learning rate η, initial weights θ0. for k 1, . . . , b do

Sample a mini-batch B from D Compute gradient g L (θt) Update parameters θt+1 θt ηg

if k E and mod(k, ν) = 0 then

θSWA t+1 = θSWA t l + θSWA t+1 / (l + 1) end if end for return θSWA

Algorithm 2 Sharpness-Aware Minimization [22]

Input: Loss function L, training budget in number of iterations b, training dataset D := n i=1{xi}, mini-batch size |B|, neighborhood radius ρ, (scheduled) learning rate η, initial weights θ0. for k 1, . . . , b do

Sample a mini-batch B from D

Compute worst-case perturbation

bϵ ρ L(θ) L(θ) 2 Compute gradient g L θSAM t +bϵ

Update parameters θSAM t+1 θSAM t ηg end for return θSAM

where l is the number of distinct parameters averaged so far and t is the SGD iteration number.2

SWA has two hyper-parameters: the update frequency ν and starting epoch E. When using a constant learning rate, Izmailov et al. [48] suggests updating the parameters once after each epoch, i.e., ν N

|B|, and starting at E 0.75T, where T is the training budget required to train the model until convergence with conventional SGD training.

He et al. [39] argue that SWA may always improve generalization, regardless of the loss function s geometry. Kaddour [51] show that averaging a specific range of weights can speed up training convergence. Cha et al. [8] argue that tuning ν and E carefully is necessary to make it work effectively in domain generalization (DG) tasks. Besides DG tasks, a list of tuned hyper-parameters based on a fair model selection procedure across different architectures and tasks has been missing in the literature. To the best of our knowledge, Cha et al. [8] is the only study that compares SWA and SAM over the same experiments, but it focuses on domain generalization tasks which we, therefore, leave out in this work.

2.3 Sharpness-Aware Minimization (SAM)

While SWA is implicitly biased towards flat minima, SAM explicitly approximates the flatness around parameters θ to guide the parameter update. It first computes the worst-case perturbation ϵ that maximizes the loss within a given neighborhood ρ, then minimizes the loss w.r.t. the perturbed weights θ + ϵ. Formally, SAM finds θ by solving the minimax problem:

min θ max ||ϵ||2 ρ L(θ + ϵ), (4)

where ρ 0 is a hyperparameter.

To find the worst-case perturbation ϵ efficiently in practice, Foret et al. [22] approximates Eq. (4) via a first-order Taylor expansion of L(θ + ϵ) w.r.t. ϵ around 0, obtaining

ϵ arg max ϵ 2 ρ ϵ θL(θ) ρ θL(θ) θL(θ) | {z } =:bϵ

In words, bϵ is simply the scaled gradient of the loss function w.r.t to the current parameters θ. Given bϵ, the altered gradient used to update the current θt (in place of g(θt)) is

θ max ||ϵ||2 ρ L(θ + ϵ) θL(θ)|θ+ˆϵ.

Due to Eq. (5), SAM s computational overhead consists of an additional forward and backward pass per parameter update step compared to SWA and non-flat optimizers.

2SWA parameters are constant between averaging steps.

SAM s performance strongly depends on the neighborhood radius ρ. For example, Chen et al. [12], Wu et al. [93] show that ρ should be set to values outside the originally considered ranges by Foret et al. [22]. Analogously to Sec. 2.2, this lack of coherence among hyper-parameter tuning protocols in the SAM literature makes it tricky to determine SAM s comparative effectiveness.

2.4 Other Flat-Minima Optimizers

There are several extensions of SWA [36, 8] and SAM [65, 103, 101]. For simplicity, we do not consider them in this work. Besides SWA and SAM, other flat-minima optimizers include e.g., [9, 84]. However, due to their computational cost and/or lack of performance gains, we do not include them in this work. Chaudhari et al. [9] requires [5, 20] forward and backward passes per parameter update. Sankar et al. [84] similarly requires [5, 10] forward and backward passes to estimate the Hessian trace and 6 of 7 experiments yield minimal improvement of 0.27%, see Table 1 in Sankar et al. [84]. In contrast, SWA and SAM have been shown to increase performance by multiple percentage points in some cases [8, 12] while requiring fewer computational resources.

3 How do minima found by SWA and SAM differ?

In this section, we investigate SWA and SAM solutions in two prototypical deep learning tasks, where these optimizers improve over the baseline. Our goal is to understand better their geometric properties (instead of their generalization performance, which is the focus of Sec. 4).

First, we investigate the behavior of the loss landscape along the line between non-flat and flat solutions (Sec. 3.1). Previous studies successfully used such linear interpolations to gain novel insights, e.g., for training dynamics [32, 25], regularization [69, 28], and network pruning [26]. Second, motivated by findings in Sec. 3.1, we average SAM iterates and visualize interpolations between averaged and non-averaged solutions (Sec. 3.2). Interestingly, the averaged SAM solution is less susceptible to asymmetric directions. Third, we compare quantitative measurements of all solutions flatnesses (Sec. 3.3). Here, we compute dominant Hessian eigenvalues, as commonly used in the flat minima literature [9, 98, 12, 22]. Lastly, in Appendix A.1, we further compute CKA [61] and cosine similarities between SWA/SAM s network output logits.

We choose the following two disparate learning settings: (i) a well-known image classification task, widely used for evaluation in flat-minima optimizer papers, and (ii) a novel, challenging Python code summarization task, on which state-of-the-art models achieve only around 16% F1 score on the test set (which is higher than its commonly achieved accuracy on the more challenging training set), and that has not been explored yet in the flat-minima literature. Specifically, for (i), we investigate the loss/accuracy surfaces of a Wide Res Net28-10 [99] model on CIFAR-100 [63] (baseline non-flat optimizer: SGD with momentum (SGD-M)) [83]. For (ii), we use the theoretically-grounded Graph Isomorphism Network [95] model on OGB-Code2 [45] (baseline optimizer: Adam [56]).

All optimizers start from the same initialization. We denote the minimizer produced by the non-flat methods (SGD-M and Adam) by θNF and the flat ones by θSWA and θSAM.

3.1 What is between non-flat and flat solutions?

We start by comparing the similarity of flat and non-flat minimizers through linear interpolations. This analysis allows us to understand if they are in the same basin and how close they are to a region of sharply-increasing loss, where we expect loss/accuracy to differ widely between train and test. Further, for each of our four observations, we recommend a future work direction.

To linearly interpolate between two sets of parameters θ and θ , we parameterize the line connecting these two by choosing a scalar parameter α and defining the weighted average θ(α) = (1 α)θ+αθ . If there exists no high-loss barrier between two networks θ, θ along the linear interpolation, we say that they are located in the same basin, i.e., {θ, θ } Ω. [75, 102]. A basin is an area in the parameter space where the loss function has relatively low values. Due to NN non-linearities, the linear combination of the weights of two accurate models does not necessarily define an accurate model. Hence, we generally expect high-loss barriers along the linear interpolation path.

While there are alternative distance measures that could be used to compare two networks, they typically either (a) do not offer clear interpretations, as pointed out by Frankle et al. [26], or (b) yield

trivial network connectivity results, such as non-linear low-loss paths, which can be found for any two network minimizers [20, 27, 33, 23].

1.0 0.5 0.0 0.5 1.0 1.5 Æ

1.0 0.5 0.0 0.5 1.0 1.5 Æ

1.0 0.5 0.0 0.5 1.0 1.5 Æ

1.0 0.5 0.0 0.5 1.0 1.5 Æ

1.0 0.5 0.0 0.5 1.0 1.5

1.0 0.5 0.0 0.5 1.0 1.5 Æ

1.0 0.5 0.0 0.5 1.0 1.5 Æ

1.0 0.5 0.0 0.5 1.0 1.5 Æ

GIN on Code2 WRN on CIFAR100

(a) (b) (c) (d)

(e) (f) (g) (h)

Accuracy/F1 Loss Accuracy/F1 Loss

Figure 2: Training (blue) and test (red) losses ( ) and accuracies ( ) of linear interpolations θ(α) = (1 α)θ + αθ (for α [ 1, 1.5]) between SWA (+) and SAM ( ) solutions (α = 0.0) and non-flat baseline solutions ( , α = 1.0).

Obs. 1: {θSWA, θNF} ΩNF. θSWA and θNF are in the same basin, as can be seen in Figures 2a and 2e. Additionally, θNF is near the periphery of a sharp increase in loss, as can be seen when moving in the direction from θSWA to θNF (i.e., α > 1). Conversely, θSWA finds flat regions that change slowly in the loss. This bias of SWA to flatter loss beneficially transfers to the accuracy landscape too: Figures 2b and 2f show the accuracy/F1 score rapidly dropping off approaching and beyond θNF. Interestingly, in Figures 2e and 2f, we see that for Code2, for α < 0, there exist solutions with even better training loss/accuracy but worse test loss/accuracy. However, θSWA GIN is close to the test accuracy maximizer along this interpolation. Future work may inspect why the cross entropy loss function used for GIN/Code2 seems less well correlated with its accuracy compared to WRN/CIFAR100.

Obs. 2: θSAM ΩSAM = ΩNF. θSAM and θNF are not in the same basin: Figures 2c and 2g show that there is a high loss barrier between them, respectively. Figures 2d and 2h show that θSAM and even nearby points in parameter space achieve higher accuracies/F1 scores (i.e., generalize better) than θNF and points around it. This is an interesting result because we expect different basins to produce qualitatively different predictions, one of the motivations behind combining models, even if they exhibit different performances [46, 67]. Grewal & Bui [34] successfully combine models yielded by different optimizers, and we think future work should study ensembling SAM and non-SAM solutions.

Obs. 3: SAM finds a saddle point. Figure 2g shows θSAM GIN being located in a sharp training loss minimum whose loss is much higher than θNF. Yet, its test loss is slightly higher, and its F1 score is better. We visualize 2D plots moving along random directions (not shown here due to space) to confirm that θSAM GIN is a saddle point (Appendix A.2). A common pathology among curvature-based methods is that they attract saddle points [16]. Since SAM takes some form of curvature into account, too, we believe that future work should investigate SAM s propensity to find saddle points and potential remedies.

Obs. 4: θSAM is closer to sharper directions than θSWA, as can be seen by Ltr/te(θSAM(0.1)) 2 Ltr/te(θSAM( 0.1)), while Ltr/te(θSWA(0.1)) Ltr/te(θSWA( 0.1)), where L( )tr/te refers to both training and test loss functions. A possible explanation for SAM being closer to sharp sides is that while it finds different basins than SGD/SWA by smoothing the loss surface (as illustrated in Fig. 1), within a local basin, it may oscillate around the minimizer similarly as SGD. One cause for this can be that ΩSAM s hypersphere is larger than SAM s radius ρ. If that holds, then given a small enough learning rate, we expect it to oscillate around θ ΩSAM (the smaller the learning rate, the less likely it escapes the basin due to that stochasticity). Two possible remedies are: (1) adapt/schedule ρ, or (2) average SAM iterates to bias its solution towards the flatter side. (1) has been explored by [103, 101]. We try (2) in the next subsection. Future work may study SAM s basin escape time, e.g., using convolutions [58] or stochastic differential equations [102].

3.2 What happens if we average SAM iterates?

Based on observation 4: θSAM is closer to sharper directions than θSWA , averaging SAM iterates may further improve generalization, referred to as Weight-Averaged Sharpness-Aware Minimization (WASAM). The reason is that while SAM finds better-performing basins, within the basin, its final iterate may still be near a side that increases sharply in the loss.

GIN on Code2 WRN on CIFAR100

(a) (b) (c) (d)

Accuracy Loss F1 Loss

1.0 0.5 0.0 0.5 1.0 1.5 2.0 Æ

1.0 0.5 0.0 0.5 1.0 1.5 2.0 Æ

1.0 0.5 0.0 0.5 1.0 1.5 2.0 Æ

1.0 0.5 0.0 0.5 1.0 1.5 2.0 Æ

(a) (b) (c) (d)

Figure 3: Training (blue) / test (red) losses ( ) / accuracies ( ) between non-flat baseline ( ) SWA (+), SAM ( ) WASAM ( ).

Starting with the first of the two previously analyzed settings (WRN/CIFAR100), Figures 3a, and 3b show that θSWA+SAM WRN (marker: ) achieves the lowest test loss and highest test accuracy, respectively. What stands out in comparison to the previous plots is θSAM WRN s ( ) proximity to sharp sides, surprisingly similar to θNF WRN ( ) here and in Figures 2c and 2e. As we hoped, θSWA+SAM WRN is indeed closer to a flatter region, as can be seen by Ltr/te(θSWA+SAM WRN ( 0.2)) Ltr/te(θSWA+SAM WRN (0.2)).

In GIN/OGB-Code2, one unanticipated finding is that θSWA+SAM GIN escapes the (previously discussed) saddle point of θSAM GIN , appearing here as a maximum in Figure 3c. A likely reason is that SAM traversed nearby flatter regions before arriving at the saddle point, especially if it is a non-strict saddle. In terms of F1 score, Figure 3d shows that while θSWA GIN (+) and θSAM GIN perform about equally well, the flatter region found by θSWA+SAM GIN improves over both.

3.3 How flat are the found minima?

We now quantify the flatnesses of all four optimizers over both tasks by computing the median of the dominant Hessian eigenvalue across all training set batches using the Power Iteration algorithm [74, 98]. This metric measures the worst-case loss landscape curvature. We choose this metric as it is very commonly used in the minima flatness literature, e.g., [9, 12, 22, 97, 18, 62, 85].

Table 1 shows that SAM leads to flatter minima than SWA in both cases. Interestingly λmax(θNF WRN) 2.5 λmax({θSWA, θSAM}), while λmax(θNF WRN) 5.75λmax(θSWA+SAM WRN ), indicating room for improvement in terms of flatness for both SWA and SAM. The relative differences are less dramatic for GIN/Code2, although surprisingly λmax(θNF GIN) λmax(θSWA GIN ). In sum, averaging SAM iterates leads to the flattest minima and best-performing minima in both cases (see Sec. 4).

Table 1: Median λmax of Hessian over all training set batches.

Task Baseline SWA SAM WASAM WRN on CIFAR100 673 265 237 117 GIN on Code2 16.65 16.79 11.31 9.96

4 How do SWA and SAM perform on a broad set of experiments?

As we point out in the introduction, there is almost no overlap and consistency regarding reported SWA and SAM results in the literature. This section addresses this gap. For example, Bahri et al. [4], Chen et al. [12] illustrate that the flat minima found by SAM improve generalization on Transformer [90] architectures compared to non-flat optimizers, but they do not compare against SWA. Hence, it is unclear if the computationally cheaper SWA may provide better or similar performance.

We compare flat minimizers SWA, SAM, and averaged SWA iterates (WASAM) over the non-flat minimizers across a range of different tasks in the domains of computer vision, natural language processing, and graph representation learning. We average all runs at least three times across random

Table 2: CV test results: Supervised Classification (SC), and Self-Supervised Learning (SSL) tasks.

Task Model Baseline SWA SAM WASAM

SC: CIFAR10

WRN-28-10 96.78 0.03 0.05 0.04 + 0.34 0.09 + 0.25 0.05 PN-272 96.73 0.14 + 0.22 0.14 + 0.42 0.06 + 0.41 0.02 Vi T-B-16 98.95 0.02 0.04 0.04 + 0.07 0.01 + 0.10 0.01 Mixer-B-16 96.65 0.03 + 0.02 0.03 + 0.19 0.05 + 0.22 0.06

SC: CIFAR100

WRN-28-10 80.93 0.19 + 1.62 0.06 + 1.82 0.14 + 2.24 0.14 PN-272 80.86 0.12 + 1.88 0.04 + 2.33 0.08 + 2.60 0.09 Vi T-B-16 92.77 0.07 0.12 0.05 + 0.19 0.09 + 0.13 0.07 Mixer-B-16 83.77 0.08 + 0.45 0.06 + 0.52 0.15 + 0.97 0.12

SSL: CIFAR10

Mo Co 89.25 0.07 0.03 0.10 0.25 0.06 0.17 0.10 Sim CLR 88.66 0.08 0.05 0.06 + 0.05 0.04 0.13 0.06 Sim Siam 89.86 0.22 + 0.12 0.26 + 0.07 0.10 + 0.11 0.10 Barlow Twins 86.34 0.24 0.09 0.19 + 0.09 0.15 + 0.14 0.05 BYOL 90.32 0.14 + 0.70 0.05 + 0.14 0.03 + 0.21 0.07 Swa V 87.28 0.05 + 0.09 0.06 + 0.07 0.12 + 0.02 0.06

SSL: Image Nette

Mo Co 81.74 0.18 + 0.97 0.10 + 0.91 0.32 + 1.40 0.10 Sim CLR 83.28 0.22 + 0.95 0.25 + 0.18 0.24 + 1.07 0.13 Sim Siam 81.77 0.14 + 0.20 0.37 + 0.33 0.28 + 0.18 0.26 Barlow Twins 77.49 0.36 + 0.20 0.16 + 0.47 0.27 + 0.66 0.57 BYOL 84.16 0.14 + 0.76 0.08 + 0.15 0.25 + 0.31 0.19 Swa V 88.16 0.31 + 1.04 0.27 + 0.03 0.10 + 1.03 0.09

seeds (more often for experiments with higher variability, see details in Appendix B), and we report the corresponding standard error. We bold the best-performing approach and any approach whose average performance plus standard error overlaps it.

Hyper-parameters. For all architectures and datasets, we set hyperparameters shared by all methods (e.g., learning rate) mostly to values cited in prior work 3 As explained in Secs. 2.2 and 2.3, the effectiveness of flat-minima optimizers is highly sensitive to their additional hyperparameters. We select hyper-parameters using a grid search over a held-out validation set. Specifically, for SWA we follow Izmailov et al. [48] and hold the update frequency ν constant to once per epoch and tune the start time E {0.5T, 0.6T, 0.75T, 0.9T} (T is the number of baseline training epochs). Izmailov et al. [48] argue that a cyclical learning rate starting from E helps to encourage exploration of the basin. For the sake of simplicity, we average the iterates of the baseline directly but include even earlier starting times (i.e., 0.5T, 0.6T). For SAM, we tune its neighborhood size ρ {0.01, 0.02, 0.05, 0.1, 0.2}, as in previous work [22, 4].

Appendix B contains the values of all hyper-parameters and additional training details (including public model checkpoints, hardware infrastructure, software libraries, etc.) to ensure full reproducibility alongside open-sourcing our code.

4.1 Computer Vision

Supervised Classification (SC). We evaluate the CNN architectures Wide Res Nets [99] with 28 layers and width 10, and Pyramid Net (PN) with 110 layers and widening factor 272 [38] as well as Vision Transformer (Vi T) [19] and MLP-Mixer [87] on CIFAR{10, 100} [63]. All experiments use basic data augmentations: horizontal flip, padding by four pixels, random crop, and cutout [17].

Self-Supervised Learning (SSL). We consider the following methods on CIFAR10 and Image Nette4: Momentum Contrast [41], a Simple framework for Contrastive Learning (Sim CLR) [10], Simple Siamese representation learning (Sim Siam) [11], Barlow Twins [100], Bootstrap your own Latent (BYOL) [35], and Swapping Assignments between multiple Views of the same image (Sw AV) [7]. All SSL methods use a Res Net-18 [40] as backbone network. To test the frozen representations, we use k-nearest-neighbor classification with a memory bank [94]. We choose k =200 and temperature τ = 0.1 to reweight similarities. Compared to learning a linear model on top of the representations, this evaluation procedure is more robust to hyperparameter changes [59].

3Sometimes with minor modifications, e.g., adjusting per-device batch sizes to be compatible with our GPU infrastructure. 4https://github.com/fastai/imagenette

Figure 4: (a) NLP test results: Open-Domain Question Answering and Natural Language Understanding (GLUE) including paraphrase, sentiment analysis, and textual entailment. (b) GRL test results: Node Property Prediction (NPP), Graph Property Prediction (GPP), Link Property Prediction (LPP).

Task Model Baseline SWA SAM WASAM

NQ Fi D 49.35 0.44 0.20 0.33 + 0.33 0.19 + 0.48 0.21 Trivia QA Fi D 67.74 0.29 + 0.40 0.24 + 0.89 0.03 + 0.92 0.10 COLA Ro BERTa 60.41 0.22 + 0.09 0.08 + 1.57 1.20 + 1.41 1.14 SST Ro BERTa 94.95 0.13 0.30 0.27 0.23 0.40 + 0.19 0.14 MRPC Ro BERTa 89.14 0.57 + 0.08 0.49 +0.73 0.43 + 0.81 0.38 STSB Ro BERTa 90.40 0.02 +0.00 0.05 +0.38 0.17 + 0.35 0.16 QQP Ro BERTa 91.36 0.07 +0.01 0.06 +0.08 0.07 +0.06 0.08 MNLI Ro BERTa 87.41 0.09 +0.08 0.11 +0.39 0.02 +0.35 0.03 QNLI Ro BERTa 92.96 0.06 0.08 0.11 +0.09 0.01 +0.11 0.06 RTE Ro BERTa 80.09 0.23 0.23 0.20 +0.70 0.65 0.46 0.12 (a)

Task Model Baseline SWA SAM WASAM

NPP: Proteins SAGE 77.79 0.18 0.17 0.22 0.02 0.13 0.11 0.15 DGCN 85.42 0.17 + 0.11 0.08 0.14 0.05 0.08 0.07 NPP: Products SAGE 78.92 0.08 + 0.39 0.10 + 0.13 0.08 + 0.57 0.03 DGCN 73.88 0.13 + 0.44 0.14 + 0.08 0.09 + 0.53 0.05 GPP: Code2 GCN 16.04 0.09 + 0.73 0.11 + 0.36 0.08 + 0.93 0.15 GIN 15.73 0.11 + 0.83 0.11 + 0.57 0.09 + 1.10 0.09 GPP: Molpcba GIN 28.10 0.11 + 0.40 0.18 0.33 0.14 + 0.33 0.16 DGCN 25.65 0.13 + 1.90 0.20 0.13 0.18 + 1.34 0.12 LPP: Biokg CP 84.06 0.00 + 0.07 0.01 0.00 0.03 + 0.08 0.02 Compl Ex 84.94 0.01 + 0.14 0.01 0.02 0.01 + 0.12 0.02 LPP: Citation2 GCN 79.52 0.41 0.05 0.52 + 1.32 0.06 + 1.50 0.13 SAGE 81.95 0.02 + 1.15 0.02 0.31 0.07 + 0.86 0.04 (b)

4.2 Natural Language Processing

We consider the task of open domain question answering (ODQA) using a T5-based model Fusion In-Decoder (Fi D) [47]. We evaluate Fi D-base on the test sets of Natural Questions (NQ) [64] and Trivia QA [50]. We also consider natural language understanding tasks included in the GLUE benchmark [91], which cover acceptability, sentiment, paraphrase, similarity, and inference. We fine-tune Ro BERTa-base [72] for each task individually and report the results on the GLUE dev set.

4.3 Graph Representation Learning

We use a subset of the Open Graph Benchmark (OGB) datasets [45]. The tasks are node property prediction (NPP), graph property prediction (GPP), and link property prediction (LPP). For each task, we use two of the following GNN architectures and matrix factorization methods: GCN [57], Deeper GCN (DGCN) [68], SAGE [37], GIN [95], Compl Ex [89], and CP [66]. We use popular training schemes, such as virtual nodes, cluster sampling [14], or relation prediction as auxiliary training objective [13]. The reported metrics are ROC-AUC for Proteins, Accuracy for Products, F1 score for Code2, Average precision for Molpcba, and Mean Reciprocal Rank for Biokg/Citation2.

4.4 9 Findings

We use G( ) to describe the generalization accuracy/F1/ROCAUC/AP/MRR of all optimizers {Non-flat baseline(NF), SWA, SAM, WASAM}.

1. Datasets matter. For example, for node property prediction (Proteins), we see that no flat optimizer improves over the baseline optimizer; however, for (Products), flat-minima optimizers on the same architectures significantly improve over the baseline. We further explore the impact of different data augmentation strategies in Appendix A.7. 2. Architectures matter, e.g., there is a vast difference across model architectures for link property prediction on the Citation2 dataset: using a GNN with GCN layers, SAM achieves a statistically significant boost of >1.30% and SWA slightly hurts the performance. When we replace the GCN layers with SAGE layers (and fix everything else), we see a boost of > 1.15% for SWA, while SAM hurts the performance by 0.31%. 3. SWA underperforms on NLP tasks. SAM achieves the best performance in 7/10 experiments on NLP tasks, consistent with the findings of Bahri et al. [4], which show that SAM can boost performance across a wide range of NLP tasks. However, SWA never performs best, only improving the results in 1/10 cases, and even hurting the performance on 4 tasks. Surprisingly, G(WASAM) > G(SAM) for {SST, QNLI} while SWA decreases performance in these cases. 4. SWA beats SAM on GRL tasks. G(SWA) > G(NF) in 10/12 experiments, while G(SAM) > G(NF) only in 4. We examine why SAM under-performs in Appendix A.4. 5. SWA does not work well with Transformers. SWA often does not improve and sometimes hurts performances, as can be seen in the Vi T results in Table 2, and NLP results in Fig. 4a. In contrast, SAM has some positive effects in these settings. We explore this further in Appendix A.3.

6. SWA and SAM improve SSL task performance. This is non-trivial as the theoretical motivation behind finding flat minima is linked to supervised learning losses [43, 77, 21]. Concurrently to our work, Ramesh et al. [82] report that SAM helps for contrastive CLIP [79] models too. 7. Flat optimizers do not strictly improve over non-flat optimizers. The non-flat optimizer is nearly always the best for NPP Proteins and SSL methods on CIFAR10. We investigate the NPP Proteins solutions in the next subsection and recommend a more thorough investigation of the landscapes of SSL objectives for future work. 8. Flat-minima optimizers offer asymmetric payoffs: at worst, they decreased performance by 0.30%, at best, they increased it by 2.60%. 9. Averaging SAM iterates often improves over SWA or SAM alone. G(WASAM) > min(G(SWA), G(SAM)) in 39/42 cases. We hypothesize that asymmetric payoffs are the reason: when either SWA or SAM does not improve over the baseline (as discussed above), it does not hurt (much) either, hence WASAM is more robust across all tasks.

4.5 Why do flat-minima optimizers fail?

Here, we audit one of the cases, where neither θSWA nor θSAM improves over θNF (this happens in 3 out of 42 cases): training a Graph SAGE [37] model on OGB-Proteins: a protein-protein interaction graph where the goal is to predict the presence of protein functions (multi-label binary classification) [45]. θSWA performs noticeably worse; θSAM performs about equally well.

0.0 0.5 1.0 Æ

0.0 0.5 1.0 Æ

3 2 1 0 1 2 3 4 Æ

3 2 1 0 1 2 3 4 Æ

Adam to SWA

(a) (b) (c) (d)

ROC-AUC Loss ROC-AUC Loss

Adam to SAM

Figure 5: Graph SAGE on OGB-Proteins: Adam s ( ) solution performs about equally well as SAM ( ), and better than SWA (+).

Fig. 5 shows two linear interpolations: between θNF (ADAM) and (1) θSWA (Figures 5a and 5b), and (2) θSAM (Figures 5c and 5d). In contrast to success cases in Fig. 2, here: (a) for both SWA and SAM, the training loss minimizer is very uncorrelated with the test loss minimizer; (b) SAM and ADAM seem to be contained in the same test loss/accuracy basin. More analyses can be found in the Appendix.

5 Limitations and Future Work

First, some of the fixed, shared hyperparameter values we used from previous works may harm the effect of flat optimizers. The ideal experimental design includes tuning all hyperparameters independently for the non-flat optimizer, SWA, SAM, and WASAM. However, this forces the number of required runs to grow exponentially in unique hyperparameters and quickly renders this benchmark infeasible.

Second, despite our best efforts to evaluate the optimizers on a broad range of benchmark tasks, there are still plenty of unexplored domains; especially some of which are known to be sensitive to careful optimization. For example, bi-level optimization problems [24] are common in generative modeling [31, 42], deep reinforcement learning [60, 44], meta-learning [81, 52], or causal machine learning [53, 54]. We are unaware of an investigation of flat minima optimization for such problems.

Third, in general, we believe fruitful directions of research include (a) optimizers that explicitly find basins where training loss flatness more directly corresponds to higher hold-out accuracy, (b) post-processing methods for existing optimization runs to move into flatter regions of these basins [2], (c) loss functions whose contours more tightly align with accuracy contours, (d) the study of flat-minima hyperparameter interactions (e.g., learning rate and neighborhood radius in SAM) (see Appendices A.5 and A.6 for first results), (e) analyses of flat minima optimization on convergence speed [51].

Our benchmark results point to which tasks would most benefit from improving these future work directions: graph learning tasks would benefit from improvements in (a), as SAM is never among the best-performing method, and language tasks would benefit if (b) is improved, as SWA is never among the best performing method).

6 Conclusion

We investigated when flat minima optimizers work by conducting a fair comparison of two popular flatminima optimizers. We examined the behavior of SWA/SAM by analyzing their loss landscapes on two representative deep learning tasks. Our next step was to evaluate their generalization performance on a broad and diverse set of tasks (in data, learning settings, and model architectures). Based on this benchmarking, we identified 9 findings, of which some directly guide future work directions. Finally, when SWA/SAM did not improve over baselines, common assumptions seemed broken (i.e., train-to-test loss minimizers were not correlated).

Acknowledgements

We are very grateful to Kilian Q. Weinberger and Gao Huang for initial discussions and intuition, Pontus Stenetorp for NLP experimental design advice, and Oscar Key for feedback on the draft. JK and LL acknowledge support by the Engineering and Physical Sciences Research Council with grant number EP/S021566/1. This research was supported through Azure resources provided by The Alan Turing Institute and credits awarded by Google Cloud.

[1] Aghajanyan, A., Shrivastava, A., Gupta, A., Goyal, N., Zettlemoyer, L., and Gupta, S. Better fine-tuning by reducing representational collapse. In International Conference on Learning Representations, 2020.

[2] Andriushchenko, M. and Flammarion, N. Towards understanding sharpness-aware minimization. In International Conference on Machine Learning, pp. 639 668. PMLR, 2022.

[3] Athiwaratkun, B., Finzi, M., Izmailov, P., and Wilson, A. G. There are many consistent explanations of unlabeled data: Why you should average, 2019.

[4] Bahri, D., Mobahi, H., and Tay, Y. Sharpness-aware minimization improves language model generalization, 2021.

[5] Bisla, D., Wang, J., and Choromanska, A. Low-pass filtering sgd for recovering flat optima in the deep learning optimization landscape, 2022. URL https://arxiv.org/abs/2201. 08025.

[6] Bottou, L., Curtis, F. E., and Nocedal, J. Optimization methods for large-scale machine learning. Siam Review, 60(2):223 311, 2018.

[7] Caron, M., Misra, I., Mairal, J., Goyal, P., Bojanowski, P., and Joulin, A. Unsupervised learning of visual features by contrasting cluster assignments. In Larochelle, H., Ranzato, M., Hadsell, R., Balcan, M., and Lin, H. (eds.), Advances in Neural Information Processing Systems 33: Annual Conference on Neural Information Processing Systems 2020, Neur IPS 2020, December 6-12, 2020, virtual, 2020. URL https://proceedings.neurips.cc/ paper/2020/hash/70feb62b69f16e0238f741fab228fec2-Abstract.html.

[8] Cha, J., Chun, S., Lee, K., Cho, H.-C., Park, S., Lee, Y., and Park, S. SWAD: Domain generalization by seeking flat minima. In Beygelzimer, A., Dauphin, Y., Liang, P., and Vaughan, J. W. (eds.), Advances in Neural Information Processing Systems, 2021. URL https://openreview.net/forum?id=zk Hlu_3s JYU.

[9] Chaudhari, P., Choromanska, A., Soatto, S., Le Cun, Y., Baldassi, C., Borgs, C., Chayes, J. T., Sagun, L., and Zecchina, R. Entropy-sgd: Biasing gradient descent into wide valleys. In 5th International Conference on Learning Representations, ICLR 2017, Toulon, France,

April 24-26, 2017, Conference Track Proceedings. Open Review.net, 2017. URL https: //openreview.net/forum?id=B1Yf Afcgl.

[10] Chen, T., Kornblith, S., Norouzi, M., and Hinton, G. E. A simple framework for contrastive learning of visual representations. In Proceedings of the 37th International Conference on Machine Learning, ICML 2020, 13-18 July 2020, Virtual Event, volume 119 of Proceedings of Machine Learning Research, pp. 1597 1607. PMLR, 2020. URL http://proceedings. mlr.press/v119/chen20j.html.

[11] Chen, X. and He, K. Exploring simple siamese representation learning. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 15750 15758, 2021.

[12] Chen, X., Hsieh, C.-J., and Gong, B. When vision transformers outperform resnets without pre-training or strong data augmentations, 2021.

[13] Chen, Y., Minervini, P., Riedel, S., and Stenetorp, P. Relation prediction as an auxiliary training objective for improving multi-relational graph representations. In 3rd Conference on Automated Knowledge Base Construction, 2021. URL https://openreview.net/forum? id=Qa3u S3H7-Le.

[14] Chiang, W., Liu, X., Si, S., Li, Y., Bengio, S., and Hsieh, C. Cluster-gcn: An efficient algorithm for training deep and large graph convolutional networks. In Teredesai, A., Kumar, V., Li, Y., Rosales, R., Terzi, E., and Karypis, G. (eds.), Proceedings of the 25th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining, KDD 2019, Anchorage, AK, USA, August 4-8, 2019, pp. 257 266. ACM, 2019. doi: 10.1145/3292500.3330925. URL https://doi.org/10.1145/3292500.3330925.

[15] Cubuk, E. D., Zoph, B., Mane, D., Vasudevan, V., and Le, Q. V. Autoaugment: Learning augmentation strategies from data. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 113 123, 2019.

[16] Dauphin, Y. N., Pascanu, R., Gülçehre, Ç., Cho, K., Ganguli, S., and Bengio, Y. Identifying and attacking the saddle point problem in high-dimensional non-convex optimization. In Ghahramani, Z., Welling, M., Cortes, C., Lawrence, N. D., and Weinberger, K. Q. (eds.), Advances in Neural Information Processing Systems 27: Annual Conference on Neural Information Processing Systems 2014, December 8-13 2014, Montreal, Quebec, Canada, pp. 2933 2941, 2014. URL https://proceedings.neurips.cc/paper/2014/hash/ 17e23e50bedc63b4095e3d8204ce063b-Abstract.html.

[17] Devries, T. and Taylor, G. W. Improved regularization of convolutional neural networks with cutout. Co RR, abs/1708.04552, 2017. URL http://arxiv.org/abs/1708.04552.

[18] Dong, Z., Yao, Z., Gholami, A., Mahoney, M. W., and Keutzer, K. HAWQ: hessian aware quantization of neural networks with mixed-precision. In 2019 IEEE/CVF International Conference on Computer Vision, ICCV 2019, Seoul, Korea (South), October 27 - November 2, 2019, pp. 293 302. IEEE, 2019. doi: 10.1109/ICCV.2019.00038. URL https://doi.org/ 10.1109/ICCV.2019.00038.

[19] Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., Uszkoreit, J., and Houlsby, N. An image is worth 16x16 words: Transformers for image recognition at scale. In 9th International Conference on Learning Representations, ICLR 2021, Virtual Event, Austria, May 3-7, 2021. Open Review.net, 2021. URL https://openreview.net/forum?id=Yicb Fd NTTy.

[20] Draxler, F., Veschgini, K., Salmhofer, M., and Hamprecht, F. A. Essentially no barriers in neural network energy landscape. In Dy, J. G. and Krause, A. (eds.), Proceedings of the 35th International Conference on Machine Learning, ICML 2018, Stockholmsmässan, Stockholm, Sweden, July 10-15, 2018, volume 80 of Proceedings of Machine Learning Research, pp. 1308 1317. PMLR, 2018. URL http://proceedings.mlr.press/v80/draxler18a.html.

[21] Dziugaite, G. K. and Roy, D. M. Computing nonvacuous generalization bounds for deep (stochastic) neural networks with many more parameters than training data. In Elidan, G., Kersting, K., and Ihler, A. T. (eds.), Proceedings of the Thirty-Third Conference on Uncertainty

in Artificial Intelligence, UAI 2017, Sydney, Australia, August 11-15, 2017. AUAI Press, 2017. URL http://auai.org/uai2017/proceedings/papers/173.pdf.

[22] Foret, P., Kleiner, A., Mobahi, H., and Neyshabur, B. Sharpness-aware minimization for efficiently improving generalization. In 9th International Conference on Learning Representations, ICLR 2021, Virtual Event, Austria, May 3-7, 2021. Open Review.net, 2021. URL https://openreview.net/forum?id=6Tm1mposlr M.

[23] Fort, S. and Jastrzebski, S. Large scale structure of neural network loss landscapes. In Wallach, H. M., Larochelle, H., Beygelzimer, A., d Alché-Buc, F., Fox, E. B., and Garnett, R. (eds.), Advances in Neural Information Processing Systems 32: Annual Conference on Neural Information Processing Systems 2019, Neur IPS 2019, December 8-14, 2019, Vancouver, BC, Canada, pp. 6706 6714, 2019. URL https://proceedings.neurips.cc/paper/2019/ hash/48042b1dae4950fef2bd2aafa0b971a1-Abstract.html.

[24] Franceschi, L., Frasconi, P., Salzo, S., Grazzi, R., and Pontil, M. Bilevel programming for hyperparameter optimization and meta-learning. In International Conference on Machine Learning, pp. 1568 1577. PMLR, 2018.

[25] Frankle, J. Revisiting "qualitatively characterizing neural network optimization problems". Co RR, abs/2012.06898, 2020. URL https://arxiv.org/abs/2012.06898.

[26] Frankle, J., Dziugaite, G. K., Roy, D. M., and Carbin, M. Linear mode connectivity and the lottery ticket hypothesis. In Proceedings of the 37th International Conference on Machine Learning, ICML 2020, 13-18 July 2020, Virtual Event, volume 119 of Proceedings of Machine Learning Research, pp. 3259 3269. PMLR, 2020. URL http://proceedings.mlr.press/ v119/frankle20a.html.

[27] Garipov, T., Izmailov, P., Podoprikhin, D., Vetrov, D. P., and Wilson, A. G. Loss surfaces, mode connectivity, and fast ensembling of dnns. In Advances in Neural Information Processing Systems, 2018.

[28] Geiping, J., Goldblum, M., Pope, P. E., Moeller, M., and Goldstein, T. Stochastic training is not necessary for generalization. Co RR, abs/2109.14119, 2021. URL https://arxiv.org/ abs/2109.14119.

[29] Ghorbani, B., Krishnan, S., and Xiao, Y. An investigation into neural net optimization via hessian eigenvalue density. In Chaudhuri, K. and Salakhutdinov, R. (eds.), Proceedings of the 36th International Conference on Machine Learning, volume 97 of Proceedings of Machine Learning Research, pp. 2232 2241. PMLR, 09 15 Jun 2019. URL https://proceedings. mlr.press/v97/ghorbani19b.html.

[30] Golub, G. H. and Welsch, J. H. Calculation of gauss quadrature rules. Mathematics of computation, 23(106):221 230, 1969.

[31] Goodfellow, I., Pouget-Abadie, J., Mirza, M., Xu, B., Warde-Farley, D., Ozair, S., Courville, A., and Bengio, Y. Generative adversarial networks. Communications of the ACM, 63(11): 139 144, 2020.

[32] Goodfellow, I. J. and Vinyals, O. Qualitatively characterizing neural network optimization problems. In Bengio, Y. and Le Cun, Y. (eds.), 3rd International Conference on Learning Representations, ICLR 2015, San Diego, CA, USA, May 7-9, 2015, Conference Track Proceedings, 2015. URL http://arxiv.org/abs/1412.6544.

[33] Gotmare, A., Keskar, N. S., Xiong, C., and Socher, R. A closer look at deep learning heuristics: Learning rate restarts, warmup and distillation. In 7th International Conference on Learning Representations, ICLR 2019, New Orleans, LA, USA, May 6-9, 2019. Open Review.net, 2019. URL https://openreview.net/forum?id=r14EOs Cq KX.

[34] Grewal, Y. and Bui, T. D. Diversity is all you need to improve bayesian model averaging. In Bayesian Deep Learning Workshop at Neural Information Processing Systems 34: Annual Conference on Neural Information Processing Systems 2021, Neur IPS 2021, 2021.

[35] Grill, J., Strub, F., Altché, F., Tallec, C., Richemond, P. H., Buchatskaya, E., Doersch, C., Pires, B. Á., Guo, Z., Azar, M. G., Piot, B., Kavukcuoglu, K., Munos, R., and Valko, M. Bootstrap your own latent - A new approach to self-supervised learning. In Larochelle, H., Ranzato, M., Hadsell, R., Balcan, M., and Lin, H. (eds.), Advances in Neural Information Processing Systems 33: Annual Conference on Neural Information Processing Systems 2020, Neur IPS 2020, December 6-12, 2020, virtual, 2020. URL https://proceedings.neurips. cc/paper/2020/hash/f3ada80d5c4ee70142b17b8192b2958e-Abstract.html.

[36] Guo, H., Jin, J., and Liu, B. Stochastic weight averaging revisited. Co RR, abs/2201.00519, 2022. URL https://arxiv.org/abs/2201.00519.

[37] Hamilton, W. L., Ying, R., and Leskovec, J. Inductive representation learning on large graphs. In Proceedings of the 31st International Conference on Neural Information Processing Systems, pp. 1025 1035, 2017.

[38] Han, D., Kim, J., and Kim, J. Deep pyramidal residual networks. In 2017 IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2017, Honolulu, HI, USA, July 21-26, 2017, pp. 6307 6315. IEEE Computer Society, 2017. doi: 10.1109/CVPR.2017.668. URL https://doi.org/10.1109/CVPR.2017.668.

[39] He, H., Huang, G., and Yuan, Y. Asymmetric valleys: Beyond sharp and flat local minima. In Wallach, H. M., Larochelle, H., Beygelzimer, A., d Alché-Buc, F., Fox, E. B., and Garnett, R. (eds.), Advances in Neural Information Processing Systems 32: Annual Conference on Neural Information Processing Systems 2019, Neur IPS 2019, December 8-14, 2019, Vancouver, BC, Canada, pp. 2549 2560, 2019. URL https://proceedings.neurips.cc/paper/2019/ hash/01d8bae291b1e4724443375634ccfa0e-Abstract.html.

[40] He, K., Zhang, X., Ren, S., and Sun, J. Deep residual learning for image recognition. In 2016 IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2016, Las Vegas, NV, USA, June 27-30, 2016, pp. 770 778. IEEE Computer Society, 2016. doi: 10.1109/CVPR. 2016.90. URL https://doi.org/10.1109/CVPR.2016.90.

[41] He, K., Fan, H., Wu, Y., Xie, S., and Girshick, R. B. Momentum contrast for unsupervised visual representation learning. In 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition, CVPR 2020, Seattle, WA, USA, June 13-19, 2020, pp. 9726 9735. Computer Vision Foundation / IEEE, 2020. doi: 10.1109/CVPR42600.2020.00975. URL https://doi.org/10.1109/CVPR42600.2020.00975.

[42] Heusel, M., Ramsauer, H., Unterthiner, T., Nessler, B., and Hochreiter, S. Gans trained by a two time-scale update rule converge to a local nash equilibrium. Advances in neural information processing systems, 30, 2017.

[43] Hochreiter, S. and Schmidhuber, J. Flat minima. Neural computation, 9(1):1 42, 1997.

[44] Hong, M., Wai, H.-T., Wang, Z., and Yang, Z. A two-timescale framework for bilevel optimization: Complexity analysis and application to actor-critic. ar Xiv preprint ar Xiv:2007.05170, 2020.

[45] Hu, W., Fey, M., Zitnik, M., Dong, Y., Ren, H., Liu, B., Catasta, M., and Leskovec, J. Open graph benchmark: Datasets for machine learning on graphs. In Larochelle, H., Ranzato, M., Hadsell, R., Balcan, M., and Lin, H. (eds.), Advances in Neural Information Processing Systems 33: Annual Conference on Neural Information Processing Systems 2020, Neur IPS 2020, December 6-12, 2020, virtual, 2020. URL https://proceedings.neurips.cc/ paper/2020/hash/fb60d411a5c5b72b2e7d3527cfc84fd0-Abstract.html.

[46] Huang, G., Li, Y., Pleiss, G., Liu, Z., Hopcroft, J. E., and Weinberger, K. Q. Snapshot ensembles: Train 1, get M for free. In 5th International Conference on Learning Representations, ICLR 2017, Toulon, France, April 24-26, 2017, Conference Track Proceedings. Open Review.net, 2017. URL https://openreview.net/forum?id=BJYww Y9ll.

[47] Izacard, G. and Grave, É. Leveraging passage retrieval with generative models for open domain question answering. In Proceedings of the 16th Conference of the European Chapter of the Association for Computational Linguistics: Main Volume, pp. 874 880, 2021.

[48] Izmailov, P., Podoprikhin, D., Garipov, T., Vetrov, D. P., and Wilson, A. G. Averaging weights leads to wider optima and better generalization. In Globerson, A. and Silva, R. (eds.), Proceedings of the Thirty-Fourth Conference on Uncertainty in Artificial Intelligence, UAI 2018, Monterey, California, USA, August 6-10, 2018, pp. 876 885. AUAI Press, 2018. URL http://auai.org/uai2018/proceedings/papers/313.pdf.

[49] Jiang, Y., Neyshabur, B., Mobahi, H., Krishnan, D., and Bengio, S. Fantastic generalization measures and where to find them. In 8th International Conference on Learning Representations, ICLR 2020, Addis Ababa, Ethiopia, April 26-30, 2020. Open Review.net, 2020. URL https: //openreview.net/forum?id=SJg IPJBFv H.

[50] Joshi, M., Choi, E., Weld, D. S., and Zettlemoyer, L. Triviaqa: A large scale distantly supervised challenge dataset for reading comprehension. In Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pp. 1601 1611, 2017.

[51] Kaddour, J. Stop wasting my time! saving days of imagenet and bert training with latest weight averaging. ar Xiv preprint ar Xiv:2209.14981, 2022. URL https://arxiv.org/abs/ 2209.14981.

[52] Kaddour, J., Saemundsson, S., and Deisenroth (he/him), M. Probabilistic active metalearning. In Larochelle, H., Ranzato, M., Hadsell, R., Balcan, M., and Lin, H. (eds.), Advances in Neural Information Processing Systems, volume 33, pp. 20813 20822. Curran Associates, Inc., 2020. URL https://proceedings.neurips.cc/paper/2020/file/ ef0d17b3bdb4ee2aa741ba28c7255c53-Paper.pdf.

[53] Kaddour, J., Zhu, Y., Liu, Q., Kusner, M., and Silva, R. Causal effect inference for structured treatments. In Beygelzimer, A., Dauphin, Y., Liang, P., and Vaughan, J. W. (eds.), Advances in Neural Information Processing Systems, 2021. URL https://openreview.net/forum? id=0v9EPJGc10.

[54] Kaddour, J., Lynch, A., Liu, Q., Kusner, M. J., and Silva, R. Causal machine learning: A survey and open problems. ar Xiv preprint ar Xiv:2206.15475, 2022. URL https://arxiv. org/abs/2206.15475.

[55] Keskar, N. S., Mudigere, D., Nocedal, J., Smelyanskiy, M., and Tang, P. T. P. On largebatch training for deep learning: Generalization gap and sharp minima. In 5th International Conference on Learning Representations, ICLR 2017, Toulon, France, April 24-26, 2017, Conference Track Proceedings. Open Review.net, 2017. URL https://openreview.net/ forum?id=H1oy Rl Ygg.

[56] Kingma, D. P. and Ba, J. Adam: A method for stochastic optimization. In Bengio, Y. and Le Cun, Y. (eds.), 3rd International Conference on Learning Representations, ICLR 2015, San Diego, CA, USA, May 7-9, 2015, Conference Track Proceedings, 2015. URL http://arxiv.org/abs/1412.6980.

[57] Kipf, T. N. and Welling, M. Semi-supervised classification with graph convolutional networks. In 5th International Conference on Learning Representations, ICLR 2017, Toulon, France, April 24-26, 2017, Conference Track Proceedings. Open Review.net, 2017. URL https: //openreview.net/forum?id=SJU4ay Ygl.

[58] Kleinberg, B., Li, Y., and Yuan, Y. An alternative view: When does sgd escape local minima? In International Conference on Machine Learning, pp. 2698 2707. PMLR, 2018.

[59] Kolesnikov, A., Zhai, X., and Beyer, L. Revisiting self-supervised visual representation learning. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp. 1920 1929, 2019.

[60] Konda, V. and Tsitsiklis, J. Actor-critic algorithms. Advances in neural information processing systems, 12, 1999.

[61] Kornblith, S., Norouzi, M., Lee, H., and Hinton, G. E. Similarity of neural network representations revisited. In Chaudhuri, K. and Salakhutdinov, R. (eds.), Proceedings of the 36th International Conference on Machine Learning, ICML 2019, 9-15 June 2019, Long Beach, California, USA, volume 97 of Proceedings of Machine Learning Research, pp. 3519 3529. PMLR, 2019. URL http://proceedings.mlr.press/v97/kornblith19a.html.

[62] Krishnapriyan, A. S., Gholami, A., Zhe, S., Kirby, R. M., and Mahoney, M. W. Characterizing possible failure modes in physics-informed neural networks. In Ranzato, M., Beygelzimer, A., Dauphin, Y. N., Liang, P., and Vaughan, J. W. (eds.), Advances in Neural Information Processing Systems 34: Annual Conference on Neural Information Processing Systems 2021, Neur IPS 2021, December 6-14, 2021, virtual, pp. 26548 26560, 2021. URL https://proceedings.neurips.cc/paper/2021/hash/ df438e5206f31600e6ae4af72f2725f1-Abstract.html.

[63] Krizhevsky, A. Learning multiple layers of features from tiny images. Technical report, University of Toronto, 2009.

[64] Kwiatkowski, T., Palomaki, J., Redfield, O., Collins, M., Parikh, A., Alberti, C., Epstein, D., Polosukhin, I., Devlin, J., Lee, K., et al. Natural questions: A benchmark for question answering research. Transactions of the Association for Computational Linguistics, 7:452 466, 2019.

[65] Kwon, J., Kim, J., Park, H., and Choi, I. K. Asam: Adaptive sharpness-aware minimization for scale-invariant learning of deep neural networks. In Meila, M. and Zhang, T. (eds.), Proceedings of the 38th International Conference on Machine Learning, volume 139 of Proceedings of Machine Learning Research, pp. 5905 5914. PMLR, 18 24 Jul 2021. URL https://proceedings.mlr.press/v139/kwon21b.html.

[66] Lacroix, T., Usunier, N., and Obozinski, G. Canonical tensor decomposition for knowledge base completion. In Dy, J. G. and Krause, A. (eds.), Proceedings of the 35th International Conference on Machine Learning, ICML 2018, Stockholmsmässan, Stockholm, Sweden, July 10-15, 2018, volume 80 of Proceedings of Machine Learning Research, pp. 2869 2878. PMLR, 2018. URL http://proceedings.mlr.press/v80/lacroix18a.html.

[67] Lakshminarayanan, B., Pritzel, A., and Blundell, C. Simple and scalable predictive uncertainty estimation using deep ensembles. In Guyon, I., von Luxburg, U., Bengio, S., Wallach, H. M., Fergus, R., Vishwanathan, S. V. N., and Garnett, R. (eds.), Advances in Neural Information Processing Systems 30: Annual Conference on Neural Information Processing Systems 2017, December 4-9, 2017, Long Beach, CA, USA, pp. 6402 6413, 2017. URL https://proceedings.neurips.cc/paper/2017/hash/ 9ef2ed4b7fd2c810847ffa5fa85bce38-Abstract.html.

[68] Li, G., Xiong, C., Thabet, A., and Ghanem, B. Deepergcn: All you need to train deeper gcns, 2020.

[69] Li, H., Xu, Z., Taylor, G., Studer, C., and Goldstein, T. Visualizing the loss landscape of neural nets. In Bengio, S., Wallach, H. M., Larochelle, H., Grauman, K., Cesa-Bianchi, N., and Garnett, R. (eds.), Advances in Neural Information Processing Systems 31: Annual Conference on Neural Information Processing Systems 2018, Neur IPS 2018, December 3-8, 2018, Montréal, Canada, pp. 6391 6401, 2018. URL https://proceedings.neurips. cc/paper/2018/hash/a41b3bb3e6b050b6c9067c67f663b915-Abstract.html.

[70] Li, Z., Cui, Z., Wu, S., Zhang, X., and Wang, L. Fi-gnn: Modeling feature interactions via graph neural networks for ctr prediction. In Proceedings of the 28th ACM International Conference on Information and Knowledge Management, pp. 539 548, 2019.

[71] Liu, Q., Nickel, M., and Kiela, D. Hyperbolic graph neural networks. Advances in Neural Information Processing Systems, 32, 2019.

[72] Liu, Y., Ott, M., Goyal, N., Du, J., Joshi, M., Chen, D., Levy, O., Lewis, M., Zettlemoyer, L., and Stoyanov, V. Roberta: A robustly optimized bert pretraining approach. ar Xiv preprint ar Xiv:1907.11692, 2019.

[73] Loshchilov, I. and Hutter, F. Decoupled weight decay regularization. In International Conference on Learning Representations, 2019. URL https://openreview.net/forum?id= Bkg6Ri Cq Y7.

[74] Mises, R. and Pollaczek-Geiringer, H. Praktische verfahren der gleichungsauflösung. ZAMMJournal of Applied Mathematics and Mechanics/Zeitschrift für Angewandte Mathematik und Mechanik, 9(1):58 77, 1929.

[75] Neyshabur, B., Sedghi, H., and Zhang, C. What is being transferred in transfer learning? In Larochelle, H., Ranzato, M., Hadsell, R., Balcan, M., and Lin, H. (eds.), Advances in Neural Information Processing Systems 33: Annual Conference on Neural Information Processing Systems 2020, Neur IPS 2020, December 6-12, 2020, virtual, 2020. URL https://proceedings.neurips.cc/paper/2020/hash/ 0607f4c705595b911a4f3e7a127b44e0-Abstract.html.

[76] Nikishin, E., Izmailov, P., Athiwaratkun, B., Podoprikhin, D., Garipov, T., Shvechikov, P., Vetrov, D., and Wilson, A. G. Improving stability in deep reinforcement learning with weight averaging. In Uncertainty in artificial intelligence workshop on uncertainty in Deep learning, 2018.

[77] Petzka, H., Kamp, M., Adilova, L., Sminchisescu, C., and Boley, M. Relative flatness and generalization. In Thirty-Fifth Conference on Neural Information Processing Systems, 2021. URL https://openreview.net/forum?id=sygvo7ctb_.

[78] Polyak, B. T. and Juditsky, A. B. Acceleration of stochastic approximation by averaging. SIAM journal on control and optimization, 30(4):838 855, 1992.

[79] Radford, A., Kim, J. W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., Sastry, G., Askell, A., Mishkin, P., Clark, J., Krueger, G., and Sutskever, I. Learning transferable visual models from natural language supervision. In Meila, M. and Zhang, T. (eds.), Proceedings of the 38th International Conference on Machine Learning, ICML 2021, 18-24 July 2021, Virtual Event, volume 139 of Proceedings of Machine Learning Research, pp. 8748 8763. PMLR, 2021. URL http://proceedings.mlr.press/v139/radford21a.html.

[80] Rahman, M. K. Training sensitivity in graph isomorphism network. In Proceedings of the 29th ACM International Conference on Information & Knowledge Management, pp. 2181 2184, 2020.

[81] Rajeswaran, A., Finn, C., Kakade, S. M., and Levine, S. Meta-learning with implicit gradients. Advances in neural information processing systems, 32, 2019.

[82] Ramesh, A., Dhariwal, P., Nichol, A., Chu, C., and Chen, M. Hierarchical text-conditional image generation with clip latents. ar Xiv preprint ar Xiv:2204.06125, 2022.

[83] Rumelhart, D. E., Hinton, G. E., and Williams, R. J. Learning Representations by Back Propagating Errors, pp. 696 699. MIT Press, Cambridge, MA, USA, 1988. ISBN 0262010976.

[84] Sankar, A. R., Khasbage, Y., Vigneswaran, R., and Balasubramanian, V. N. A deeper look at the hessian eigenspectrum of deep neural networks and its applications to regularization. ar Xiv preprint ar Xiv:2012.03801, 2020.

[85] Stutz, D., Hein, M., and Schiele, B. Relating adversarially robust generalization to flat minima. In 2021 IEEE/CVF International Conference on Computer Vision, ICCV 2021, Montreal, QC, Canada, October 10-17, 2021, pp. 7787 7797. IEEE, 2021. doi: 10.1109/ICCV48922.2021. 00771. URL https://doi.org/10.1109/ICCV48922.2021.00771.

[86] Susmelj, I., Heller, M., Wirth, P., Prescott, J., and et al., M. E. Lightly. Git Hub. Note: https://github.com/lightly-ai/lightly, 2020.

[87] Tolstikhin, I. O., Houlsby, N., Kolesnikov, A., Beyer, L., Zhai, X., Unterthiner, T., Yung, J., Steiner, A., Keysers, D., Uszkoreit, J., Lucic, M., and Dosovitskiy, A. Mlp-mixer: An all-mlp architecture for vision. Co RR, abs/2105.01601, 2021. URL https://arxiv.org/ abs/2105.01601.

[88] Touvron, H., Cord, M., Sablayrolles, A., Synnaeve, G., and Jégou, H. Going deeper with image transformers. In 2021 IEEE/CVF International Conference on Computer Vision, ICCV 2021, Montreal, QC, Canada, October 10-17, 2021, pp. 32 42. IEEE, 2021. doi: 10.1109/ ICCV48922.2021.00010. URL https://doi.org/10.1109/ICCV48922.2021.00010.

[89] Trouillon, T., Welbl, J., Riedel, S., Gaussier, É., and Bouchard, G. Complex embeddings for simple link prediction. In Balcan, M. and Weinberger, K. Q. (eds.), Proceedings of the 33nd International Conference on Machine Learning, ICML 2016, New York City, NY, USA, June 19-24, 2016, volume 48 of JMLR Workshop and Conference Proceedings, pp. 2071 2080. JMLR.org, 2016. URL http://proceedings.mlr.press/v48/trouillon16.html.

[90] Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A. N., Kaiser, Ł., and Polosukhin, I. Attention is all you need. In Advances in neural information processing systems, pp. 5998 6008, 2017.

[91] Wang, A., Singh, A., Michael, J., Hill, F., Levy, O., and Bowman, S. Glue: A multi-task benchmark and analysis platform for natural language understanding. In Proceedings of the 2018 EMNLP Workshop Blackbox NLP: Analyzing and Interpreting Neural Networks for NLP, pp. 353 355, 2018.

[92] Wolf, T., Chaumond, J., Debut, L., Sanh, V., Delangue, C., Moi, A., Cistac, P., Funtowicz, M., Davison, J., Shleifer, S., et al. Transformers: State-of-the-art natural language processing. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing: System Demonstrations, pp. 38 45, 2020.

[93] Wu, Y., Bojchevski, A., and Huang, H. Adversarial weight perturbation improves generalization in graph neural networks, 2022. URL https://openreview.net/forum?id= h Ur6K4D9f7P.

[94] Wu, Z., Xiong, Y., Yu, S. X., and Lin, D. Unsupervised feature learning via nonparametric instance discrimination. In 2018 IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2018, Salt Lake City, UT, USA, June 18-22, 2018, pp. 3733 3742. Computer Vision Foundation / IEEE Computer Society, 2018. doi: 10.1109/CVPR. 2018.00393. URL http://openaccess.thecvf.com/content_cvpr_2018/html/Wu_ Unsupervised_Feature_Learning_CVPR_2018_paper.html.

[95] Xu, K., Hu, W., Leskovec, J., and Jegelka, S. How powerful are graph neural networks? In 7th International Conference on Learning Representations, ICLR 2019, New Orleans, LA, USA, May 6-9, 2019. Open Review.net, 2019. URL https://openreview.net/forum?id= ry Gs6i A5Km.

[96] Yang, Y., Hodgkinson, L., Theisen, R., Zou, J., Gonzalez, J. E., Ramchandran, K., and Mahoney, M. W. Taxonomizing local versus global structure in neural network loss landscapes. In Beygelzimer, A., Dauphin, Y., Liang, P., and Vaughan, J. W. (eds.), Advances in Neural Information Processing Systems, 2021. URL https://openreview.net/forum?id= P6b Ur LREcne.

[97] Yao, Z., Gholami, A., Keutzer, K., and Mahoney, M. LARGE BATCH SIZE TRAINING OF NEURAL NETWORKS WITH ADVERSARIAL TRAINING AND SECOND-ORDER INFORMATION, 2019. URL https://openreview.net/forum?id=H1ln J2Rqt7.

[98] Yao, Z., Gholami, A., Keutzer, K., and Mahoney, M. W. Pyhessian: Neural networks through the lens of the hessian. In Wu, X., Jermaine, C., Xiong, L., Hu, X., Kotevska, O., Lu, S., Xu, W., Aluru, S., Zhai, C., Al-Masri, E., Chen, Z., and Saltz, J. (eds.), 2020 IEEE International Conference on Big Data (IEEE Big Data 2020), Atlanta, GA, USA, December 10-13, 2020, pp. 581 590. IEEE, 2020. doi: 10.1109/Big Data50022.2020.9378171. URL https://doi.org/10.1109/Big Data50022.2020.9378171.

[99] Zagoruyko, S. and Komodakis, N. Wide residual networks. In Wilson, R. C., Hancock, E. R., and Smith, W. A. P. (eds.), Proceedings of the British Machine Vision Conference 2016, BMVC 2016, York, UK, September 19-22, 2016. BMVA Press, 2016. URL http: //www.bmva.org/bmvc/2016/papers/paper087/index.html.

[100] Zbontar, J., Jing, L., Misra, I., Le Cun, Y., and Deny, S. Barlow twins: Self-supervised learning via redundancy reduction. In Meila, M. and Zhang, T. (eds.), Proceedings of the 38th International Conference on Machine Learning, ICML 2021, 18-24 July 2021, Virtual Event, volume 139 of Proceedings of Machine Learning Research, pp. 12310 12320. PMLR, 2021. URL http://proceedings.mlr.press/v139/zbontar21a.html.

[101] Zhao, Y., Zhang, H., and Hu, X. Ss-sam: Stochastic scheduled sharpness-aware minimization for efficiently training deep neural networks. ar Xiv preprint ar Xiv:2203.09962, 2022.

[102] Zhou, P., Feng, J., Ma, C., Xiong, C., Hoi, S. C. H., et al. Towards theoretically understanding why sgd generalizes better than adam in deep learning. Advances in Neural Information Processing Systems, 33:21285 21296, 2020.

[103] Zhuang, J., Gong, B., Yuan, L., Cui, Y., Adam, H., Dvornek, N. C., sekhar tatikonda, s Duncan, J., and Liu, T. Surrogate gap minimization improves sharpness-aware training. In International Conference on Learning Representations, 2022. URL https://openreview.net/forum? id=ed ONMAnh Lu-.

1. For all authors...

(a) Do the main claims made in the abstract and introduction accurately reflect the paper s contributions and scope? [Yes] (b) Did you describe the limitations of your work? [Yes] See Sec. 5

(c) Did you discuss any potential negative societal impacts of your work? [No] We consider this work to be an investigation into optimization algorithms that are central to modern ML systems. As these systems can be used in vastly different ways, we believe that it is intractable to isolate the societal impact of novel insights into optimization. (d) Have you read the ethics review guidelines and ensured that your paper conforms to them? [Yes] 2. If you are including theoretical results...

(a) Did you state the full set of assumptions of all theoretical results? [N/A] (b) Did you include complete proofs of all theoretical results? [N/A] 3. If you ran experiments...

(a) Did you include the code, data, and instructions needed to reproduce the main experimental results (either in the supplemental material or as a URL)? [Yes] Supplemental material (b) Did you specify all the training details (e.g., data splits, hyperparameters, how they were chosen)? [Yes] See Appendix. (c) Did you report error bars (e.g., with respect to the random seed after running experiments multiple times)? [Yes] See Sec. 4 (d) Did you include the total amount of compute and the type of resources used (e.g., type of GPUs, internal cluster, or cloud provider)? [Yes] See Appendix. 4. If you are using existing assets (e.g., code, data, models) or curating/releasing new assets...

(a) If your work uses existing assets, did you cite the creators? [Yes] (b) Did you mention the license of the assets? [Yes] See Appendix.

(c) Did you include any new assets either in the supplemental material or as a URL? [N/A]

(d) Did you discuss whether and how consent was obtained from people whose data you re using/curating? [N/A] (e) Did you discuss whether the data you are using/curating contains personally identifiable information or offensive content? [N/A] 5. If you used crowdsourcing or conducted research with human subjects...

(a) Did you include the full text of instructions given to participants and screenshots, if applicable? [N/A]

(b) Did you describe any potential participant risks, with links to Institutional Review Board (IRB) approvals, if applicable? [N/A] (c) Did you include the estimated hourly wage paid to participants and the total amount spent on participant compensation? [N/A]