# sam_as_an_optimal_relaxation_of_bayes__56be3aed.pdf Published as a conference paper at ICLR 2023 SAM AS AN OPTIMAL RELAXATION OF BAYES Thomas M ollenhoff & Mohammad Emtiyaz Khan RIKEN Center for Advanced Intelligence Project Tokyo, Japan {thomas.moellenhoff,emtiyaz.khan}@riken.jp Sharpness-aware minimization (SAM) and related adversarial deep-learning methods can drastically improve generalization, but their underlying mechanisms are not yet fully understood. Here, we establish SAM as a relaxation of the Bayes objective where the expected negative-loss is replaced by the optimal convex lower bound, obtained by using the so-called Fenchel biconjugate. The connection enables a new Adam-like extension of SAM to automatically obtain reasonable uncertainty estimates, while sometimes also improving its accuracy. By connecting adversarial and Bayesian methods, our work opens a new path to robustness. 1 INTRODUCTION Sharpness-aware minimization (SAM) (Foret et al., 2021) and related adversarial methods (Zheng et al., 2021; Wu et al., 2020; Kim et al., 2022) have been shown to improve generalization, calibration, and robustness in various applications of deep learning (Chen et al., 2022; Bahri et al., 2022), but the reasons behind their success are not fully understood. The original proposal of SAM was geared towards biasing training trajectories towards flat minima, and the effectiveness of such minima has various Bayesian explanations, for example, those relying on the optimization of description lengths (Hinton & Van Camp, 1993; Hochreiter & Schmidhuber, 1997), PAC-Bayes bounds (Dziugaite & Roy, 2017; 2018; Jiang et al., 2020; Alquier, 2021), or marginal likelihoods (Smith & Le, 2018). However, SAM is not known to directly optimize any such Bayesian criteria, even though some connections to PAC-Bayes do exist (Foret et al., 2021). The issue is that the max-loss used in SAM fundamentally departs from a Bayesian-style expected-loss under the posterior; see Fig. 1(a). The two methodologies are distinct, and little is known about their relationship. Here, we establish a connection by using a relaxation of the Bayes objective where the expected negative-loss is replaced by the tightest convex lower bound. The bound is optimal, and obtained by Fenchel biconjugates which naturally yield the maximum loss used in SAM (Fig. 1(a)). From this, SAM can be seen as optimizing the relaxation of Bayes to find the mean of an isotropic Gaussian posterior while keeping the variance fixed. Higher variances lead to smoother objectives, which biases the solution towards flatter regions (Fig. 1(b)). Essentially, the result connects SAM and Bayes through a Fenchel biconjugate that replaces the expected loss in Bayes by a maximum loss. What do we gain by this connection? The generality of our result makes it possible to easily combine the complementary strengths of SAM and Bayes. For example, we show that the relaxed Bayes objective can be used to learn the variance parameter, which yields an Adam-like extension of SAM (Alg. 1). The variances are cheaply obtained from the vector that adapts the learning rate, and SAM s hyperparameter is adjusted for each parameter dimension via the variance vector. The extension improves the performance of SAM on standard benchmarks while giving comparable uncertainty estimates to the best Bayesian methods. Our work complements similar extensions for SGD, RMSprop, and Adam from the Bayesian deeplearning community (Gal & Ghahramani, 2016; Mandt et al., 2017; Khan et al., 2018; Osawa et al., 2019). So far there is no work on such connections between SAM and Bayes, except for a recent empirical study by Kaddour et al. (2022). Husain & Knoblauch (2022) give an adversarial interpretation of Bayes, but it does not cover methods like SAM. Our work here is focused on connections to approximate Bayesian methods, which can have both theoretical and practical impact. We discuss a new path to robustness by connecting the two fields of adversarial and Bayesian methodologies. Published as a conference paper at ICLR 2023 ℓ(θ + ϵ) SAM: 피ϵ 풩(0,σ2)[ℓ(θ + ϵ)] Fenchel Biconjugate (a) SAM vs Bayes Expected-Loss Biconjugate Loss -5 0 5 -5 0 5 SAM minimizer SAM minimizer SAM minimizer Variance = 0.5 Variance = 2 Variance = 4 (b) Higher variances give smoother objectives Figure 1: Panel (a) highlights the main difference between SAM and Bayes. SAM uses max-loss (the red dot), while Bayes uses expected-loss (the blue dot shows ℓ(θ + ϵ) for an ϵ N(ϵ | 0, σ2)). Our main result in Theorem 2 connects the two by using the optimal convex relaxation (Fenchel biconjugate) of the negative expected loss. It shows that the role of ρ and σ are exactly the same, and a SAM minimizer obtained for a fixed ρ can always be recovered from the relaxation for some σ. An example is shown in Panel (b) for a loss (gray line) with 3 local minima indicated by A, B, and C. The expected loss is smoother (blue line) but the relaxation, which upper bounds it, is even more smooth (red line). Higher σ give smoother objectives where sharp minima A and C slowly disappear. The SAM minimizer is shown with a red star which matches the minimizer of the relaxation. 2 SAM AS AN OPTIMAL RELAXATION OF BAYES In SAM, the training loss ℓ(θ) is replaced by the maximum loss within a ball of radius ρ > 0 around the parameter θ Θ RP , as shown below for an ℓ2-regularized problem, ESAM(θ; ρ, δ) = sup ϵ ρ ℓ(θ + ϵ) + δ with δ > 0 as the regularization parameter. The use of maximum above differs from Bayesian strategies that use expectation , for example, the variational formulation by Zellner (1988), L(q) = Eθ q [ ℓ(θ)] DKL(q(θ) p(θ)), (2) where q(θ) is the generalized posterior (Zhang, 1999; Catoni, 2007), p(θ) is the prior, and DKL( ) is the Kullback-Leibler divergence (KLD). For an isotropic Gaussian posterior q(θ) = N(θ | m, σ2I) and prior p(θ) = N(θ | 0, I/δ), the objective with respect to the mean m becomes EBayes(m; σ, δ) = Eϵ N(0,σ2I) [ℓ(m + ϵ )] + δ which closely resembles Eq. 1, but with the maximum replaced by an expectation. The above objective is obtained by simply plugging in the posterior and prior, then collecting the terms that depend on m, and finally rewriting θ = m + ϵ . There are some resemblances between the two objectives which indicate that they might be connected. For example, both use local neighborhoods: the isotropic Gaussian can be seen as a softer version of the hard-constrained ball used in SAM. The size of the neighborhood is decided by their respective parameters (σ or ρ in Fig. 1(a)). But, apart from these resemblances, are there any deeper connections? Here, we answer this question. We will show that Eq. 1 is equivalent to optimizing with respect to m an optimal convex relaxation of Eq. 3 while keeping σ fixed. See Theorem 2 where an equivalence is drawn between σ and ρ, revealing similarities between the mechanisms used in SAM and Bayes to favor flatter regions. Fig. 1(b) shows an illustration where the relaxation is shown to upper bound the expected loss. We will now describe the construction of the relaxation based on Fenchel biconjugates. 2.1 FROM EXPECTED-LOSS TO MAX-LOSS VIA FENCHEL BICONJUGATES We use results from convex duality to connect expected-loss to max-loss. Our main result relies on Fenchel conjugates (Hiriart-Urruty & Lemar echal, 2001), also known as convex conjugates. Con- Published as a conference paper at ICLR 2023 (a) Duality of exponential family (b) Biconjugate of expected loss Biconjugate Expected Loss Original negative loss B (c) Biconjugate and expected loss in (ω, v)-parameterization Figure 2: Panel (a) shows the dual-structure of exponential-family for a Gaussian defined in Eq. 6. The space M lies above the parabola, while Ωis a half space. Panel (b) shows f(µ) and its biconjugate f (µ) with the blue mesh and red surface respectively. We use the loss from Fig. 1(b). The f function perfectly matches the original loss at the boundary where v 0 (black curve), clearly showing the regions for three minima A, B and C. Panel (c) shows the same but with respect to the standard mean-variance (ω, v) parameterization. Here, convexity of f is lost, but we see more clearly the negative ℓ(ω) appear at v 0 (black curve). At this limit, we also get f(µ) = f (µ) = ℓ(ω), but for higher variances, both f and f are smoothed version of ℓ(ω). The slices at v = 0.5, 2, and 4 give rise to the curves shown in Fig. 1(b). sider a function f(u), defined over u U and taking values in the extended real line. Given a dual space v V, the Fenchel biconjugate f and conjugate f are defined as follows, f (u) = sup v V v , u f (v ), f (v ) = sup u U u , v f(u ), (4) where , denotes the inner product. Note that f need not be convex, but both f and f are convex functions. Moreover, the biconjugate is the optimal convex lower bound: f(u) f (u). We will use the conjugate functions to connect the max-loss in SAM to the expected-loss involving regular, minimal exponential-family distributions (Wainwright et al., 2008) of the following form1, qλ(θ) = exp ( λ, T(θ) A(λ)) , (5) with natural parameter λ Ω, sufficient statistics T(θ), log-partition function A(λ), and a base measure of 1 with Ω= {λ : A(λ) < }. Dual spaces naturally arise for such distributions and can be used to define conjugates. For every λ Ω, there exists a unique dual parameterization in terms of the expectation parameter µ = Eθ qλ[T(θ)]. The space of all valid µ satisfies µ = A(λ), and all such µ form the interior of a dual space which we denote by M. We can also write λ in terms of µ using the inverse λ = A (µ), and in general, both µ and λ are valid parameterizations. In what follows, we will always use µ to parameterize the distribution, denoting qµ(θ). As an example, for a Gaussian N(θ | ω, v) over scalar θ with mean ω and variance v > 0, we have, T(θ) = (θ, θ2), µ = ω, ω2 + v , λ = (ω/v, 1/(2v)) . (6) Because v > 0, the space M is the epigraph of parabola (ω, ω2), and Ωis a negative half space because the second natural parameter is constrained to be negative; see Fig. 2(a). The dual spaces can be used to define the Fenchel conjugates. We will use the conjugates of the expected negative-loss acting over µ interior(M), f(µ) = Eθ qµ [ ℓ(θ)] . (7) 1The notation used in the paper is summarized in Table 10 in App. H. Published as a conference paper at ICLR 2023 For all µ / interior(M), we set f(µ) = + . Using Eq. 4, we can obtain the biconjugate, f (µ) = sup λ Ω λ , µ sup µ M µ , λ Eθ qµ [ ℓ(θ) ] where we use λ and µ to denote arbitrary points from Ωand M, respectively, and avoid confusion to the change of coordinates µ = A(λ ). We show in App. A that, for an isotropic Gaussian, the biconjugate takes a form where max-loss automatically emerges. We define µ to be the expectation parameter of a Gaussian N(θ | ω, v I), and λ to be the natural parameter of a (different) Gaussian N(θ | m, σ2I). The equation below shows the biconjugate with the max-loss included in the second term, f (µ) = sup m,σ Eθ qµ 2σ2 θ m 2 sup ϵ ℓ(m + ϵ) 1 2σ2 ϵ 2 . (9) The max-loss here is in a proximal form, which has an equivalence to the trust-region form used in Eq. 1 (Parikh & Boyd, 2013, Sec. 3.4). The first term in the biconjugate arises after simplification of the λ , µ term in Eq. 8, while the second term is obtained by using the theorem below to switch the maximum over µ to a maximum over θ, and subsequently over ϵ by rewriting θ = m + ϵ. Theorem 1. For a Gaussian distribution that takes the form (5), the conjugate f (λ ) can be obtained by taking the supremum over Θ instead of M, that is, sup µ M µ , λ Eθ qµ [ ℓ(θ)] = sup θ Θ T(θ), λ [ ℓ(θ)] , (10) assuming that ℓ(θ) is lower-bounded, continuous and majorized by a quadratic function. Moreover, there exists λ for which f (λ ) is finite and, assuming ℓ(θ) is coercive, dom(f ) Ω. A proof of this theorem is in App. B, where we also discuss its validity for generic distributions. The red surface in Fig. 2(b) shows the convex lower bound f to the function f (shown with the blue mesh) for a Gaussian with mean ω and variance v (Eq. 6). Both functions have a parabolic shape which is due to the shape of M. Fig. 2(c) shows the same but with respect to the standard mean-variance parameterization (ω, v). In this parameterization, convexity of f is lost, but we see more clearly the negative ℓ(ω) appear at v 0 (shown with the thick black curve). At the limit, we get f(µ) = f (µ) = ℓ(ω), but for higher variances, both f and f are smoothed versions of ℓ(ω). Fig. 2(c) shows three pairs of blue and red lines. The blue line in each pair is a slice of the function f, while the red line is a slice of f . The values of v for which the functions are plotted are 0.5, 2, and 4. The slices at these variances are visualized in Fig. 1(b) along with the original loss. The function f (µ) is a lifted version of the one-dimensional ℓ(θ) to a two-dimensional µ-space. Such liftings have been used in the optimization literature (Bauermeister et al., 2022). Our work connects them to exponential families and probabilistic inference. The convex relaxation preserves the shape at the limit v 0, while giving smoothed lower bounds for all v > 0. For a high enough variance the local minima at A and C disappear, biasing the optimization towards flatter region B of the loss. We will now use this observation to connect to SAM. 2.2 SAM AS AN OPTIMAL RELAXATION OF THE BAYES OBJECTIVE We will now show that a minimizer of the SAM objective in Eq. 1 can be recovered by minimizing a relaxation of the Bayes objective where expected-loss f is replaced by its convex lower bound f . The new objective, which we call relaxed-Bayes, always lower bounds the Bayes objective of Eq. 2, sup µ M L(qµ) sup µ M f (µ) DKL(qµ(θ) p(θ)) (11) sup m,σ sup ϵ ℓ(m + ϵ) 1 2σ2 ϵ 2 DKL Z p(θ)e 1 2σ2 θ m 2 + log Z , where in the second line we substitute f (µ) from (9), and merge Eθ qµ[ 1 2σ2 θ m 2] with the KLD term. Z denotes the normalizing constant of the second distribution in the KLD. The relaxed Bayes objective has a specific difference-of-convex (DC) structure because both f and KLD are Published as a conference paper at ICLR 2023 convex with respect to µ. Optimizing the relaxed-Bayes objective relates to double-loop algorithms where the inner-loop uses max-loss to optimize for (m, σ), while the outer loop solves for µ. We will use the structure to simplify and connect to SAM. Due to the DC structure, it is possible to switch the order of maximization with respect to µ and (m, σ), and eliminate µ by a direct maximization (Toland, 1978). The maximum occurs when the KLD term is 0 and the two distributions in its arguments are equal. A detailed derivation is in App. C, which gives us a minimization problem over (m, σ) Erelaxed(m, σ; δ ) = sup ϵ ℓ(m + ϵ) 1 2σ2 ϵ 2 + δ 2 log(σ2δ ) | {z } = log Z where δ = 1/(σ2 + 1/δ0) and we use a Gaussian prior p(θ) = N(θ | 0, I/δ0). This objective uses max-loss in the proximal form, which is similar to the SAM loss in Eq. 1 with the difference that the hard-constraint is replaced by a softer penalty term. The theorem below is our main result which shows that, for a given ρ, a minimizer of the SAM objective in Eq. 1 can be recovered by optimizing Eq. 12 with respect to m while keeping σ fixed. Theorem 2. For every (ρ, δ), there exist (σ, δ ) such that arg min θ Θ ESAM(θ; ρ, δ) = arg min m Θ Erelaxed(m, σ; δ ), (13) assuming that the SAM-perturbation satisfies ϵ = ρ at a stationary point. A proof is given in App. D and essentially follows from the equivalences between proximal and trust-region formulations. The result establishes SAM as a special case of relaxed-Bayes, and shows that ρ and σ play the exactly same role. This formalizes the intuitive resemblance discussed at the beginning of Sec. 2, and suggests that, in both Bayes and SAM, flatter regions are preferred for higher σ because such values give rise to smoother objectives (Fig. 1(b)). The result can be useful for combining complementary strengths of SAM and Bayes. For example, uncertainty estimation in SAM can now be performed by estimating σ through the relaxed-Bayes objective. Another way is to exploit the DC structure in µ-space to devise new optimization procedures to improve SAM. In Sec. 3, we will demonstrate both of these improvements of SAM. It might also be possible to use SAM to improve Bayes, something we would like to explore in the future. For Bayes, it still is challenging to get a good performance in deep learning while optimizing for both mean and variance (despite some recent success (Osawa et al., 2019)). Often, non-Bayesian methods are used to set the variance parameter (Plappert et al., 2018; Fortunato et al., 2018; Bisla et al., 2022). The reasons behind the bad performance remain unclear. It is possible that a simplistic posterior is to blame (Foong et al., 2020; Fortuin et al., 2022; Coker et al., 2022), but our result suggests to not disregard the opposite hypothesis (Farquhar et al., 2020). SAM works well, despite using a simple posterior, and it is possible that the poor performance of Bayes is due to poor algorithms, not simplistic posteriors. We believe that our result can be used to borrow algorithmic techniques from SAM to improve Bayes, and we hope to explore this in the future. 3 BAYESIAN-SAM TO OBTAIN UNCERTAINTY ESTIMATES FOR SAM In this section, we will show an application of our result to improve an aspect of SAM. We will derive a new algorithm that can automatically obtain reasonable uncertainty estimates for SAM. We will do so by optimizing the relaxed-Bayes objective to get σ. A straightforward way is to directly optimize Eq. 12, but we will use the approach followed by Khan et al. (2018), based on the naturalgradient method of Khan & Rue (2021), called the Bayesian learning rule (BLR). This approach has been shown to work well on large problems, and is more promising because it leads to Adam-like algorithms, enabling us to leverage existing deep-learning techniques to improve training (Osawa et al., 2019), for example, by using momentum and data augmentation. To learn variances on top of SAM, we will use a multivariate Gaussian posterior qµ(θ) = N(θ | ω, V) with mean m and a diagonal covariance V = diag(s) 1, where s is a vector of precision values (inverse of the variances). The posterior is more expressive than the isotropic Gaussian Published as a conference paper at ICLR 2023 Algorithm 1 Our Bayesian-SAM (b SAM) is a simple modification of SAM with Adam . Just add the red boxes, or use them to replace the blue boxes next to them. The advantage of b SAM is that it automatically obtains variances σ2 through the vector s that adapts the learning rate (see line 11). Two other main changes are in line 3 where Gaussian sampling is added, and line 5 where the vector s is used to adjust the perturbation ϵ. The change in line 8 improves the variance by adding the regularizer δ and a constant γ (just like Adam), while line 9 uses a Newton-like update where s is used instead of s. We denote by a b, a/b and |a|, the element-wise multiplication, division and absolute value, respectively. N denotes the number of training examples. Input: Learning rate α, β1, L2-regularizer δ, SAM parameter ρ, β2 = 0.999, γ = 10 8 0.1 1: Initialize mean ω (NN weight init), scale s 0 1 , and momentum vector gm 0 2: while not converged do 3: θ ω + e, where e N(e | 0, σ2), σ2 1/(N s) {Sample to improve the variance} i M ℓi(θ) where M is minibatch of B examples s {Rescale by a vector s, instead of a scalar g used in SAM} i M ℓi(ω + ϵ) 7: gm β1gm + (1 β1) (gϵ + δω) 8: s β2 s + (1 β2) (gϵ + δω)2 s |g| + δ + γ {Improved variance estimate} 9: ω ω α gm s+γ gm s {Scale the gradient by s instead of s} 10: end while 11: Return the mean m and variance σ2 1/(N s) used in the previous section, and is expected to give better uncertainty estimates. As before, we will use a Gaussian prior p(θ) = N(θ | 0, I/δ0). The BLR uses gradients with respect to µ to update λ. The pair (µ, λ) is defined similarly to Eq. 6. We will use the update given in Khan & Rue (2021, Eq. 6) which when applied to Eq. 11 gives us, λ (1 α)λ + α [ f (µ) + λ0] , (14) where α > 0 is the learning rate and λ0 = (0, δ0I/2) is the natural parameter of p(θ). This is obtained by using the fact that µDKL(qµ(θ) p(θ)) = λ λ0; see Khan & Rue (2021, Eq. 23). For α = 1, the update is equivalent to the difference-of-convex (DC) algorithm (Tao & An, 1997). The BLR generalizes this procedure by using α < 1, and automatically exploits the DC structure in µ-space, unlike the naive direct minimization of (12) with gradient descent. The update (14) requires f , which is obtained by maximizing a variant of Eq. 9, that replaces the isotropic covariance by a diagonal one. For a scalable method, we solve them inexactly by using the local-linearization approximation used in SAM, along with an additional approximation to do single block-coordinate descent step to optimize for the variances. The rest of the derivations is similar to the previous applications of the BLR; see Khan et al. (2018); Osawa et al. (2019). The details are in App. E, and the resulting algorithm, which we call Bayesian-SAM is shown in Alg. 1. b SAM differs slightly from a direct application of Adam (Kingma & Ba, 2015) on the SAM objective, but uses the same set of hyperparameters. The differences are highlighted in Alg. 1, where b SAM is obtained by replacing the blue boxes with the red boxes next to them. For SAM with Adam, we use the method discussed in Bahri et al. (2022); Kwon et al. (2021) for language models. The main change is in line 5, where the perturbation ϵ is adapted by using a vector s, as opposed to SAM which uses a scalar g . Modifications in line 3 and 8 are aimed to improve the variance estimates because with just Adam we may not get good variances; see the discussion in Khan et al. (2018, Sec 3.4). Finally, the modification in line 9 is due to the Newton-like updates arising from BLR (Khan & Rue, 2021) where s is used instead of s to scale the gradient as explained in App. E. Due to their overall structural similarity, the b SAM algorithm has a similar computational complexity to Adam with SAM. The only additional overhead compared to SAM-type methods is the computation of a random noise perturbation which is negligible compared to gradient computations. Published as a conference paper at ICLR 2023 (a) Exact Bayes (c) SAM (point est.) (d) SAM (Laplace) Figure 3: For a 2D logistic regression, b SAM gives predictive uncertainties that are similar to those obtained by an exact Bayesian posterior. White areas show uncertain predictive probabilities around 0.5. SAM s point estimate gives overconfident predictions, while Laplace leads to underconfidence. Model / Data Method Accuracy (higher is better) NLL (lower is better) ECE (lower is better) AUROC (higher is better) Res Net-20-FRN CIFAR-10 SGD 86.55(0.35) 0.56(0.014) 0.0839(0.003) 0.878(0.005) SAM-SGD 87.49(0.26) 0.49(0.019) 0.0710(0.003) 0.891(0.004) SWAG 86.80(0.10) 0.53(0.017) 0.0774(0.001) 0.880(0.006) VOGN 87.30(0.24) 0.38(0.004) 0.0315(0.003) 0.890(0.003) Adam 80.85(0.98) 0.83(0.063) 0.1317(0.011) 0.820(0.013) SAM-Adam 85.26(0.15) 0.46(0.007) 0.0228(0.002) 0.874(0.004) b SAM (ours) 88.72(0.24) 0.34(0.005) 0.0163(0.002) 0.903(0.003) Res Net-20-FRN CIFAR-100 SGD 55.82(0.97) 1.91(0.025) 0.1695(0.005) 0.811(0.004) SAM-SGD 58.58(0.59) 1.60(0.022) 0.0989(0.005) 0.827(0.003) SWAG 56.53(0.40) 1.86(0.018) 0.1604(0.004) 0.814(0.004) VOGN 59.83(0.75) 1.44(0.019) 0.0756(0.005) 0.830(0.002) Adam 39.73(0.97) 2.29(0.045) 0.0295(0.018) 0.775(0.004) SAM-Adam 53.25(0.80) 1.71(0.035) 0.0401(0.005) 0.818(0.005) b SAM (ours) 62.64(0.33) 1.32(0.001) 0.0311(0.003) 0.841(0.004) Table 1: Comparison without data augmentation. The results are averaged over 5 random seeds, with standard deviation shown in the brackets. The best statistical significant performance is shown in bold. b SAM (gray row) gives either comparable or better performance than the rest. Small scale experiments on MNIST are in Table 6 in App. F.8. 4 NUMERICAL EXPERIMENTS In this section, we present numerical results and show that b SAM brings the best of the two worlds together. It gives an improved uncertainty estimate similarly to the best Bayesian approaches, but also improves test accuracy, just like SAM. We compare performance to many methods from the deep learning (DL) and Bayesian DL literature: SGD, Adam (Kingma & Ba, 2015), SWAG (Maddox et al., 2019), and VOGN (Osawa et al., 2019). We also compare with two SAM variants: SAM-SGD and SAM-Adam. Both are obtained by inserting the perturbed gradients into either SGD or Adam, as suggested in Foret et al. (2021); Bahri et al. (2022); Kwon et al. (2021). We compare these methods across three different neural network architectures and on five datasets of increasing complexity. The comparison is carried out with respect to four different metrics evaluated on the validation set. The metrics are test accuracy, the negative log-likelihood (NLL), expected calibration error (ECE) (Guo et al., 2017) (using 20 bins) and area-under the ROC curves (AUROC). For an explanation of these metrics, we refer the reader to Osawa et al. (2019, Appendix H). For Bayesian methods, we report the performance of the predictive distribution bp(y | D) = R p(y | D, θ) q(θ) dθ, which we approximate using an average over 32 models drawn from q(θ). 4.1 ILLUSTRATION ON A TOY EXAMPLE Fig. 3 compares the predictive probability on a simple Bayesian logistic regression problem (Murphy, 2012, Ch. 8.4). For exact Bayes, we use numerical integration, which is feasible for this toy 2D problem. For the rest, we use diagonal Gaussians. White areas indicate probabilities around Published as a conference paper at ICLR 2023 Model / Dataset Method Accuracy (higher is better) NLL (lower is better) ECE (lower is better) AUROC (higher is better) Res Net-20-FRN / CIFAR-10 SGD 91.68(0.26) 0.29(0.008) 0.0397(0.002) 0.915(0.002) SAM-SGD 92.29(0.39) 0.25(0.004) 0.0266(0.003) 0.920(0.003) Adam 89.97(0.27) 0.41(0.021) 0.0610(0.003) 0.900(0.003) SAM-Adam 91.57(0.21) 0.26(0.004) 0.0329(0.002) 0.918(0.001) b SAM (ours) 92.16(0.16) 0.23(0.003) 0.0057(0.002) 0.925(0.001) Res Net-20-FRN / CIFAR-100 SGD 66.48(0.10) 1.20(0.007) 0.0524(0.004) 0.846(0.002) SAM-SGD 67.27(0.22) 1.19(0.011) 0.0481(0.001) 0.848(0.002) Adam 61.76(0.67) 1.66(0.049) 0.1582(0.006) 0.826(0.003) SAM-Adam 65.34(0.32) 1.23(0.012) 0.0166(0.003) 0.847(0.002) b SAM (ours) 68.22(0.44) 1.10(0.013) 0.0258(0.003) 0.857(0.004) Res Net-20-FRN / Tiny Image Net SGD 52.01(0.36) 1.98(0.007) 0.0330(0.002) 0.832(0.002) SAM-SGD 52.25(0.26) 1.97(0.013) 0.0155(0.002) 0.827(0.005) Adam 49.04(0.38) 2.14(0.024) 0.0502(0.004) 0.820(0.004) SAM-Adam 51.17(0.45) 2.02(0.014) 0.0460(0.004) 0.828(0.004) b SAM (ours) 52.90(0.35) 1.94(0.009) 0.0199(0.003) 0.831(0.001) Res Net-18 / CIFAR-10 SGD 94.76(0.11) 0.21(0.006) 0.0304(0.001) 0.926(0.006) SAM-SGD 95.72(0.14) 0.14(0.004) 0.0134(0.001) 0.949(0.003) b SAM (ours) 96.15(0.08) 0.12(0.002) 0.0049(0.001) 0.954(0.001) Res Net-18 / CIFAR-100 SGD 76.54(0.26) 0.98(0.007) 0.0501(0.002) 0.869(0.003) SAM-SGD 78.74(0.19) 0.79(0.007) 0.0445(0.002) 0.887(0.003) b SAM (ours) 80.22(0.28) 0.70(0.008) 0.0311(0.003) 0.892(0.003) Table 2: Comparison with data augmentation. Similar to Table 1, the shaded row show that b SAM consistently improves over the baselines and is the overall best method. 0.5, while red and blue are closer to 0 and 1 respectively. We see that the b SAM result is comparable to the exact posterior and corrects the overconfident predictions of SAM. Performing a Laplace approximation around the SAM solution leads to underconfident predictions. The posteriors are visualized in Fig. 5(c) and other details are discussed in App. F.1. We show results on another small toy example ( two moons ) in App. F.2. We can also use this toy problem to see the effect of using f vs f in the Bayesian objective. Fig. 7 in App. F.3 shows a comparison of optimizing Eq. 11 to an optimization of the original Bayesian objective Eq. 2. For both, we use full covariance to avoid the inaccuracies arising due to a diagonal covariance assumption. We find that the posteriors are very similar for both f and f , indicating that the gap introduced by the relaxation is also small. 4.2 REAL DATASETS We first perform a set of experiments without any data augmentation, which helps us to quantify the improvements obtained by the regularization induced by our Bayesian approach. This is useful in applications where it is not easy to do data-augmentation (for example, tabular or time series data). All further details about hyperparameter settings and experiments are in App. G. The results are shown in Table 1, and the proposed b SAM method overall performs best with respect to test accuracy as well as uncertainty metrics. b SAM also works well when training smaller networks on MNIST-like datasets, see Table 6 in App. F.8. In Table 2 we show results with data augmentations (random horizontal flipping and random cropping). We consider CIFAR-10, CIFAR-100 and Tiny Image Net, and still find b SAM to perform favorably, although the margin of improvement is smaller with data augmentation. We note that, for SGD on CIFAR-10 in Table 2, the accuracy (91.68% accuracy) is slightly better than the 91.25% reported in the original Res Net-paper (He et al., 2016, Table 6), by using a similar Res Net-20 architecture with filter response normalization (FRN), which has 0.27 million parameters. Published as a conference paper at ICLR 2023 10 4 10 3 10 2 10 1 100 ρ Accuracy (higher is better) Res Net20-FRN / CIFAR-100 SAM-SGD SAM-Adam b SAM 10 4 10 3 10 2 10 1 100 ρ NLL (lower is better) Res Net20-FRN / CIFAR-100 SAM-SGD SAM-Adam b SAM Figure 4: b SAM is less sensitive to the choice of ρ than both SAM-SGD and SAM-Adam. We show accuracy and NLL. Plots for the other metrics are in App. F.4. b SAM not only improves uncertainty estimates and test accuracy, it also reduces the sensitivity of SAM to the hyperparameter ρ. This is shown in Fig. 4 for SAM-SGD, SAM-Adam and b SAM, where b SAM shows the least sensitivity to the choice of ρ. Plots for other metrics are in App. F.4. To understand the reasons behind the gains obtained with b SAM, we do several ablation studies. First, we study the effect of Gaussian sampling in line 3 in Alg. 1, which is not used in SAM. Experiments in App. F.5 show that Gaussian sampling consistently improves the test accuracy and uncertainty metrics. Second, in App. F.6, we study the effect of using posterior averaging over the quality of b SAM s predictions, and show that increasing the number of sampled models consistently improves the performance. Finally, we study the effect of m-sharpness , where we distribute the mini-batch across several compute nodes and use a different perturbation for each node (m refers to the number of splits). This has been shown to improve the performance of both SAM (Foret et al., 2021) and Bayesian methods (Osawa et al., 2019). In App. F.7, we show that larger values of m lead to better performance for b SAM too. In our experiments, we used m = 8. For completeness, we show how m-sharpness is implemented in b SAM in App. E.4. 5 DISCUSSION In this paper, we show a connection between SAM and Bayes through a Fenchel biconjugate. When the expected negative loss in the Bayes objective is replaced by the biconjugate, we obtain a relaxed Bayesian objective, which involves a maximum loss, similar to SAM. We showed that SAM can be seen as optimizing the mean while keeping variance fixed. We then derived an extension where the variances are optimized using a Bayesian learning algorithm applied to the relaxation. The numerical results show that the new variant improves both uncertainty estimates and generalization, as expected of a method that combines the strengths of Bayes and SAM. We expect this result to be useful for researchers interested in designing new robust methods. The principles used in SAM are related to adversarial robustness which is different from Bayes. We show that the Fenchel conjugate is the bridge between the two, and we used it to obtain an extension of SAM. Since SAM gives an excellent performance on many deep-learning benchmark, the techniques used there will be useful in improving Bayesian methods. Finally, the use of convex duality can be useful to improve theoretical understanding of SAM-like methods. It is possible to use the results from convex-optimization literature to study and improve adversarial as well as Bayesian methods. Our work may open such directions and provide a new path for research in improving robustness. Expectations with respect to distributions arise naturally in many other problems, and often sampling is used to approximate them. Our paper gives a different way to approximate such expectations using convex optimization rather than sampling. Such problems may also benefit from our results. ACKNOWLEDGEMENTS We would like to thank Pierre Alquier for his valuable comments and suggestions. This work is supported by the Bayes duality project, JST CREST Grant Number JPMJCR2112. Published as a conference paper at ICLR 2023 AUTHOR CONTRIBUTIONS STATEMENT List of Authors: Thomas M ollenhoff (T.M.), Mohammad Emtiyaz Khan (M.E.K.). T.M. proposed the Fenchel biconjugates, their connections to SAM (Theorems 1 and 2), and derived the new b SAM algorithm. M.E.K. provided feedback to simplify these. T.M. conducted all the experiments with suggestions from M.E.K. Both authors wrote the paper together. Pierre Alquier. User-friendly introduction to PAC-Bayes bounds. ar Xiv:2110.11216, 2021. 1 Dara Bahri, Hossein Mobahi, and Yi Tay. Sharpness-aware minimization improves language model generalization. In Association for Computational Linguistics (ACL), 2022. 1, 6, 7 Hartmut Bauermeister, Emanuel Laude, Thomas M ollenhoff, Michael Moeller, and Daniel Cremers. Lifting the convex conjugate in Lagrangian relaxations: A tractable approach for continuous Markov random fields. SIAM J. Imaging Sci., 15(3):1253 1281, 2022. 4 Devansh Bisla, Jing Wang, and Anna Choromanska. Low-pass filtering SGD for recovering flat optima in the deep learning optimization landscape. International Conference on Artificial Intelligence and Statistics (AISTATS), 2022. 5 Olivier Catoni. PAC-Bayesian Supervised Classification. Number 56. Institute of Mathematical Statistics Lecture Notes Monograph Series, 2007. 2 Xiangning Chen, Cho-Jui Hsieh, and Boqing Gong. When vision transformers outperform Res Nets without pretraining or strong data augmentations. In International Conference on Learning Representations (ICLR), 2022. 1 Beau Coker, Wessel P Bruinsma, David R Burt, Weiwei Pan, and Finale Doshi-Velez. Wide meanfield Bayesian neural networks ignore the data. In International Conference on Artificial Intelligence and Statistics (AISTATS), 2022. 5 Imre Csisz ar and Frantiˇsek Mat uˇs. Closures of exponential families. Annals of probability, pp. 582 600, 2005. 15 Steven Diamond and Stephen Boyd. CVXPY: A Python-embedded modeling language for convex optimization. J. Mach. Learn. Res. (JMLR), 17(1):2909 2913, 2016. 20 Gintare Karolina Dziugaite and Daniel M Roy. Computing nonvacuous generalization bounds for deep (stochastic) neural networks with many more parameters than training data. In Conference on Uncertainty in Artificial Intelligence (UAI), 2017. 1 Gintare Karolina Dziugaite and Daniel M Roy. Entropy-SGD optimizes the prior of a PAC-Bayes bound: Generalization properties of Entropy-SGD and data-dependent priors. In International Conference on Machine Learning (ICML), 2018. 1 Sebastian Farquhar, Lewis Smith, and Yarin Gal. Liberty or depth: Deep bayesian neural nets do not need complex weight posterior approximations. Advances in Neural Information Processing Systems, 33:4346 4357, 2020. 5 Andrew Foong, David Burt, Yingzhen Li, and Richard Turner. On the expressiveness of approximate inference in Bayesian neural networks. Advances in Neural Information Processing Systems, 33, 2020. 5 Pierre Foret, Ariel Kleiner, Hossein Mobahi, and Behnam Neyshabur. Sharpness-aware minimization for efficiently improving generalization. In International Conference on Learning Representations (ICLR), 2021. 1, 7, 9, 17, 22 Vincent Fortuin, Adri a Garriga-Alonso, Florian Wenzel, Gunnar R atsch, Richard Turner, Mark van der Wilk, and Laurence Aitchison. Bayesian neural network priors revisited. In International Conference on Learning Representations (ICLR), 2022. 5 Published as a conference paper at ICLR 2023 Meire Fortunato, Mohammad Gheshlaghi Azar, Bilal Piot, Jacob Menick, Ian Osband, Alex Graves, Vlad Mnih, Remi Munos, Demis Hassabis, Olivier Pietquin, Charles Blundell, and Shane Legg. Noisy networks for exploration. In International Conference on Learning Representations (ICLR), 2018. 5 Yarin Gal and Zoubin Ghahramani. Dropout as a Bayesian approximation: Representing model uncertainty in deep learning. In International Conference on Machine Learning (ICML), 2016. 1 Chuan Guo, Geoff Pleiss, Yu Sun, and Kilian Q Weinberger. On calibration of modern neural networks. In International Conference on Machine Learning (ICML), 2017. 7 Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image recognition. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2016. 8, 23 Geoffrey E Hinton and Drew Van Camp. Keeping the neural networks simple by minimizing the description length of the weights. In Conference on Learning Theory (COLT), 1993. 1 Jean-Baptiste Hiriart-Urruty and Claude Lemar echal. Fundamentals of Convex Analysis. Springer, 2001. 2 Sepp Hochreiter and J urgen Schmidhuber. Flat minima. Neural Computation, 9(1):1 42, 1997. 1 Hisham Husain and Jeremias Knoblauch. Adversarial interpretation of Bayesian inference. In International Conference on Algorithmic Learning Theory, 2022. 1 Alexander Immer, Maciej Korzepa, and Matthias Bauer. Improving predictions of bayesian neural nets via local linearization. In International Conference on Artificial Intelligence and Statistics (AISTATS), 2021. 19 Pavel Izmailov, Sharad Vikram, Matthew D. Hoffman, and Andrew Gordon Wilson. What are bayesian neural network posteriors really like? In International Conference on Machine Learning (ICML), 2021. 23 Yiding Jiang, Behnam Neyshabur, Hossein Mobahi, Dilip Krishnan, and Samy Bengio. Fantastic generalization measures and where to find them. In International Conference on Learning Representations (ICLR), 2020. 1 Chi Jin, Praneeth Netrapalli, and Michael Jordan. What is local optimality in nonconvex-nonconcave minimax optimization? In International Conference on Machine Learning (ICML), 2020. 16 Jean Kaddour, Linqing Liu, Ricardo Silva, and Matt J Kusner. A fair comparison of two popular flat minima optimizers: Stochastic weight averaging vs. sharpness-aware minimization. ar Xiv:2202.00661, 2022. 1 Mohammad Emtiyaz Khan and Haavard Rue. The Bayesian learning rule. ar Xiv:2107.04562, 2021. 5, 6, 17, 20 Mohammad Emtiyaz Khan, Wu Lin, Voot Tangkaratt, Zuozhu Liu, and Didrik Nielsen. Variational adaptive-newton method for explorative learning. In NIPS Workshop on Advances in Approximate Bayesian Inference, 2017. 17 Mohammad Emtiyaz Khan, Didrik Nielsen, Voot Tangkaratt, Wu Lin, Yarin Gal, and Akash Srivastava. Fast and scalable Bayesian deep learning by weight-perturbation in Adam. In International Conference on Machine Learning (ICML), 2018. 1, 5, 6, 17 Minyoung Kim, Da Li, Shell X Hu, and Timothy Hospedales. Fisher SAM: Information geometry and sharpness aware minimisation. In International Conference on Machine Learning (ICML), 2022. 1 Diederik P Kingma and Jimmy Ba. Adam: A method for stochastic optimization. In International Conference on Learning Representations (ICLR), 2015. 6, 7 Jungmin Kwon, Jeongseop Kim, Hyunseo Park, and In Kwon Choi. ASAM: Adaptive sharpnessaware minimization for scale-invariant learning of deep neural networks. In International Conference on Machine Learning (ICML), 2021. 6, 7 Published as a conference paper at ICLR 2023 Yong Liu, Siqi Mai, Minhao Cheng, Xiangning Chen, Cho-Jui Hsieh, and Yang You. Random sharpness-aware minimization. Advances in Neural Information Processing Systems (Neur IPS), 2022. 21 Wesley J. Maddox, Pavel Izmailov, Timur Garipov, Dmitry P. Vetrov, and Andrew Gordon Wilson. A simple baseline for bayesian uncertainty in deep learning. In Advances in Neural Information Processing Systems (Neur IPS), 2019. 7 Luigi Malag o, Matteo Matteucci, and Giovanni Pistone. Towards the geometry of estimation of distribution algorithms based on the exponential family. In Foundations of Genetic Algorithms (FOGA), 2011. 15 Stephan Mandt, Matthew D Hoffman, and David M Blei. Stochastic gradient descent as approximate Bayesian inference. J. Mach. Learn. Res. (JMLR), 18:1 35, 2017. 1 Kevin Patrick Murphy. Machine learning: a probabilistic perspective. MIT Press, 2012. 7, 19 Kazuki Osawa, Siddharth Swaroop, Anirudh Jain, Runa Eschenhagen, Richard E Turner, Rio Yokota, and Mohammad Emtiyaz Khan. Practical deep learning with Bayesian principles. Advances in Neural Information Processing Systems (Neur IPS), 2019. 1, 5, 6, 7, 9, 18, 23 Neal Parikh and Stephen Boyd. Proximal algorithms. Foundations and Trends in Optimization, 1: 123 231, 2013. 4, 16 Matthias Plappert, Rein Houthooft, Prafulla Dhariwal, Szymon Sidor, Richard Y Chen, Xi Chen, Tamim Asfour, Pieter Abbeel, and Marcin Andrychowicz. Parameter space noise for exploration. In International Conference on Learning Representations (ICLR), 2018. 5 Carl Edward Rasmussen and Christopher K. I. Williams. Gaussian processes for machine learning. MIT Press, 2006. 15 Ralph Tyrrell Rockafellar and Roger J.-B Wets. Variational Analysis. Springer, 1998. 17 Samuel L. Smith and Quoc V. Le. A Bayesian perspective on generalization and stochastic gradient descent. In International Conference on Learning Representations (ICLR), 2018. 1 Pham Dinh Tao and Le Thi Hoai An. Convex analysis approach to DC programming: Theory, algorithms and applications. Acta Math. Vietnam., 22(1):289 355, 1997. 6 John F Toland. Duality in nonconvex optimization. J. Math. Anal. Appl., 66(2):399 415, 1978. 5 Martin J Wainwright, Michael I Jordan, et al. Graphical models, exponential families, and variational inference. Foundations and Trends in Machine Learning, 1(1 2):1 305, 2008. 3 Dongxian Wu, Shu-Tao Xia, and Yisen Wang. Adversarial weight perturbation helps robust generalization. In Advances in Neural Information Processing Systems (Neur IPS), 2020. 1 Arnold Zellner. Optimal information processing and Bayes s theorem. The American Statistician, 42(4):278 280, 1988. 2 Tong Zhang. Theoretical analysis of a class of randomized regularization methods. In Conference on Learning Theory (COLT), 1999. 2 Yaowei Zheng, Richong Zhang, and Yongyi Mao. Regularizing neural networks via adversarial model perturbation. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2021. 1 Published as a conference paper at ICLR 2023 A DERIVATION OF THE FENCHEL BICONJUGATE Here, we give the derivation of the Fenchel Biconjugate given in Eq. 9. For an isotropic Gaussian, we have T(θ) = (θ, θθ ) and λ = (m/σ2, 1/(2σ2)I). We will use the following identity to simplify, λ , T(θ) = m σ2 θ + Tr I 2σ2 m θ 2 + 1 2σ2 m 2. (15) Using this, the first term in Eq. 8 becomes λ , µ = Eθ qµ[ λ , T(θ) ] = Eθ qµ 2σ2 m θ 2 + 1 2σ2 m 2. (16) Similarly, for the second term in Eq. 8, we use Theorem 1 and Eq. 15 to simplify, sup µ M µ , λ Eθ qµ[ ℓ(θ)] = sup θ Θ λ , T(θ) + ℓ(θ) 2σ2 θ m 2 + 1 2σ2 m 2 + ℓ(θ) = sup ϵ ℓ(m + ϵ) 1 2σ2 ϵ 2 + 1 2σ2 m 2, (17) where in the last step we have performed a change of variables θ m = ϵ. Subtracting (17) from (16), we arrive at Eq. 9. B PROOF OF THEOREM 1 Theorem 1. For a Gaussian distribution that takes the form (5), the conjugate f (λ ) can be obtained by taking the supremum over Θ instead of M, that is, sup µ M µ , λ Eθ qµ [ ℓ(θ)] = sup θ Θ T(θ), λ [ ℓ(θ)] , (10) assuming that ℓ(θ) is lower-bounded, continuous and majorized by a quadratic function. Moreover, there exists λ for which f (λ ) is finite and, assuming ℓ(θ) is coercive, dom(f ) Ω. Proof. We first pull out the expectation in the left-hand side of Eq. 10, sup µ M µ , λ Eθ qµ [ ℓ(θ)] = sup µ M Eθ qµ [ λ , T(θ) + ℓ(θ) | {z } =J(θ) We will denote the expression inside the expectation by J(θ), and prove that the supremum over µ is equal to the supremum over θ. We will do so for two separate cases where the supremum J = supθ Θ J(θ) < is finite, and infinite respectively. When J is finite: For this case, our derivation proceeds in two steps, where we show that both the following two inequalities are true at the same time, sup µ M Eθ qµ [J(θ)] J , sup µ M Eθ qµ [J(θ)] J . This is only possible when the two are equal, which will establish the result. To prove the first inequality, we take a sequence {θt} t=1 for which J(θt) J when sending t . By the definition of supremum, there always exists such a sequence. From the sequence {θt} t=1 we will now construct a sequence {µ t } t=1 for which the expectation of J(θ) also converges to J . This will establish that the supremum over µ is larger or equal than the supremum over θ. It is only larger or equal, since hypothetically there could be another sequence of {µ t } t=1 which achieves higher value. We will see soon that this cannot be the case. Published as a conference paper at ICLR 2023 We use the sequence {µ t } t=1 where µ t = (θt, Σt + θtθ t ) with {Σt} t=1 SP P + is taken in the space of symmetric positive-definite matrices with the sequence such that Σt 0. Now, we can see that the expectation Eθ qµ t [J(θ)] converges to the supremum as follows: lim t Eθ qµ t [J(θ)] = lim t Eϵ N(0,I) h J θt + Σ1/2 t ϵ i = Eϵ N(0,I) h lim t J θt + Σ1/2 t ϵ i = sup θ RP J(θ). In the above steps, we used the fact that J is Gaussian-integrable and continuous. A sufficient condition for the integrability is that ℓ(θ) is lower-bounded and majorized by quadratics. With this, we have established that the supremum over µ is at least as large as the one over θ. To prove the second inequality, we will show that the supremum over µ is also less or equal than the supremum over θ. To see that, we note that J(θ ) J for all θ Θ. This implies that for any distribution qµ , we have that Eθ qµ [J(θ )] J . Since the inequality holds for any distribution, we also have that supµ M Eθ qµ [J(θ )] J . Having established both inequalities, we now know that the suprema in (10) agrees whenever the supremum over θ is finite. When J is infinite: For such cases, we will show that supµ M Eθ qµ [J(θ)] = , which will complete the proof. Again, by definition of supremum, there exists a sequence {θt} t=1 for which J(θt) when t . This means that for any M > 0 there is a parameter θt(M) in the sequence such that J(θt(M)) > M. Now from the previous part of the proof, we know that we can construct a sequence qµ t of Gaussians whose expectation converges to J(θt(M)). Since the above holds for any M > 0, the supremum over the expectation is also + . Conditions for existence of λ for which J is finite: The final step of our proof is to give the conditions when there exist at λ for which J is finite, and therefore the result is non-trivial. Since ℓ(θ) is majorized by a quadratic function, there exists a constant c R, vector b RP and matrix A SP P + such that the following bound holds: ℓ(θ) c + b θ + 1 Then for the candidate λ = ( b, 1 2A) we know that J(θ) = ℓ(θ) + λ , T(θ) c has a finite upper bound for any θ. Hence, the supremum is finite for this choice of λ Ω. Domain of f (λ ): For a Gaussian distribution, valid (λ 1, λ 2) Ωsatisfy the constraint λ 2 0. We now show that if λ / Ω, then f (λ ) = + . Since λ 2 0, there exists a direction d = 0 such that d λ 2d 0 and λ 1 d 0. Taking the sequence {θt} t=0 with θt = t d we see J(θt) = tλ 1 d + t2d λ 2d + ℓ(td) ℓ(td) as t , since coercivity of ℓ(θ) implies that ℓ(θt) for any sequence {θt} t=1 with θt . Some additional comments on the assumptions used in the theorem are in order. 1. Any reasonable loss function that one would want to minimize is lower-bounded, so this assumption is always satisfied in practice. 2. The existence of an upper-bounding quadratic holds for functions with Lipschitz continuous gradient, which is a standard assumption in optimization known as the descent lemma. In practice, our conjugates are also applicable to cases where this assumption is not satisfied. In such a case, one can consider the conjugate function only in a local neighbourhood around the current solution by restricting the set of admissible θ in the supremum. It may not matter that the bounding quadratic will eventually intersect the loss far away outside of our current region of interest. 3. We assumed that ℓ(θ) is continuous since virtually all loss functions used in deep learning are continuous. Interestingly, if ℓ(θ) is not continuous, the conjugate function f will involve its semicontinous relaxation. Published as a conference paper at ICLR 2023 4. The coercivity assumption on ℓ(θ) is another standard assumption in optimization. For non-coercive losses, there could exist quadratic upper-bounds which are flat in some direction. This would lead to a dual variable λ at the boundary of Ω. The above Theorem 1 also holds for other exponential family distributions such as scalar Bernoulli and Gamma distributions, but we do not give more details here because they are not relevant for connecting to SAM. There are also cases where the result does not hold. The simplest case is a Gaussian with fixed covariance. The above proof does not work there, because the closure (see e.g., Csisz ar & Mat uˇs (2005); Malag o et al. (2011)) of the family of distributions does not contain all Dirac delta distributions. Intuitively, the requirement is that we should be able to approach any delta as a limit of a sequence of members of the family, which excludes the fixed-covariance case. C DERIVATION OF THE EQUIVALENT RELAXED OBJECTIVE (12) Inserting the biconjugate (9) into the relaxed Bayes objective (11) yields sup µ M sup m,σ sup ϵ ℓ(m + ϵ) 1 2σ2 ϵ 2 + Eθ qµ 2σ2 θ m 2 + log p(θ) log qµ(θ) , where we expanded the KLD as DKL(qµ(θ) p(θ)) = Eθ qµ[log qµ(θ)] Eθ qµ[log p(θ)]. Interchanging the order of maximization and noticing that the first term does not depend on µ, we can find a closed-form expression for the maximization in µ as follows: sup µ M Eθ qµ 2σ2 θ m 2 + log p(θ) log qµ(θ) = sup µ M Eθ qµ 2σ2 θ m 2 p(θ) log qµ(θ) + log Z = log Z inf µ M DKL 2σ2 θ m 2 p(θ) In the last step, the KLD is attained at zero with qµ(θ) = 1 2σ2 θ m 2 p(θ). Finally, we compute Z as Z = Z exp 1 2σ2 θ m 2 p(θ) dθ = (2πσ2)P/2 Z (2πσ2) P/2 exp 1 2σ2 θ m 2 p(θ) dθ = (σ2)P/2(σ2 + 1/δ0) P/2 exp m 2 where in the last step we inserted the normalization constant for the product of the two Gaussians N(θ | m, σ2I) and N(θ | 0, 1 δ0 I) as for example given in Rasmussen & Williams (2006, App. 2). This gives the following expression 2 log(σ2) P 2 log(σ2 + 1/δ0) 1 2(σ2 + 1 2 log(σ2) + P 2 log(δ ) δ 2 m 2, (18) where δ = 1/(σ2 + 1/δ0). Switching to a minimization over m, σ2 and inserting the result for the exact µ-solution (18) leads to the final objective sup ϵ ℓ(m + ϵ) 1 2σ2 ϵ 2 + δ 2 log(σ2) P which is the energy function in (12). Published as a conference paper at ICLR 2023 D PROOF OF THEOREM 2 Theorem 2. For every (ρ, δ), there exist (σ, δ ) such that arg min θ Θ ESAM(θ; ρ, δ) = arg min m Θ Erelaxed(m, σ; δ ), (13) assuming that the SAM-perturbation satisfies ϵ = ρ at a stationary point. Proof. We will show that any stationary point of ESAM is also a stationary point of Erelaxed when σ and δ are chosen appropriately. The correspondence between the stationary points follows directly from the equivalences between proximal and trust-region formulations (Parikh & Boyd, 2013, Sec. 3.4). To show this, we first write the two objectives in terms of perturbation ϵ and ϵ respectively, min m sup ϵ ℓ(m + ϵ ) + 1 2σ2 ϵ 2 + δ min θ sup ϵ ℓ(θ + ϵ) + i{ ϵ ρ} + δ where i{ ρ} is the indicator function which is zero inside the ℓ2-ball of radius ρ and + otherwise. Since these are min-max problems, we need an appropriate notion of stationary point , which we consider here to be a local Nash equilibirum (Jin et al., 2020, Prop. 3). At such points, the first derivatives in θ and ϵ (or ϵ) vanish. Taking the derivatives and setting them to 0, we get the following conditions for the two problems, δ m = ℓ(m + ϵ ), ϵ = σ2 ℓ(m + ϵ ), δθ = ℓ(θ + ϵ ), ϵ = ρ µ ℓ(θ + ϵ ), for some constant multiplier µ > 0. Here, we used the assumption that the constraint ϵ ρ is active at our optimal point. Since the constraint is active, the negative gradient is an element of the (non-trivial) normal cone, which gives us the multiplier µ in the second line above. The two equations are structurally equivalent. Any pair (m , ϵ ) satisfying the first line also satisfies the second line for σ2 = ρ/µ and δ = δ. The opposite is also true when using the pair (θ , ϵ ) in the first line, which proves the theorem. We assume that the constraint in the perturbation is active, that is, ϵ = ρ. This assumption would be violated if there exists a local maximum within a ρ-ball around the parameter θ . However, such a local maximum is unlikely to exist since the parameter θ is determined by minimization and tends to lie inside a flat minimum. E DERIVATION OF THE BSAM ALGORITHM In this section of the appendix, we show how the natural gradient descent update (14) leads us to the proposed b SAM algorithm shown in Alg. 1. As mentioned in the main text, for b SAM we consider a generalized setting where our Gaussian distribution qµ(θ) = N(θ | ω, V) has diagonal covariance. This corresponds to the following setting: T(θ) = (θ, θθ ), µ = ω, ω2 + diag(s) 1 , λ = s ω, 1 2diag(s) . (19) Here, ω RP is the mean and diag(s) = V 1 denotes the entries of the inverse diagonal covariance matrix. All operations (squaring, multiplication) are performed entrywise. E.1 THE GRADIENT OF THE BICONJUGATE First, we discuss how to compute that gradient of the Fenchel biconjugate, as the BLR update (14) requires it in every iteration. The diagonal Gaussian setup leads to a slightly generalized expression Published as a conference paper at ICLR 2023 for the Fenchel biconjugate function: f (µ) = min m RP ,b RP + sup ϵ RP ℓ(m + ϵ) 1 2 ϵ 2 + Eθ qµ = min m RP ,b RP + sup ϵ RP ℓ(m + ϵ) 1 2 (ω m) 2 + 1 2 tr(Σ 1V), (20) where Σ = diag(1/b) denotes the diagonal covariance matrix with 1/b on its diagonal. It follows from results in convex analysis, for instance from Rockafellar & Wets (1998, Proposition 11.3), that f (µ) = (Σ 1 m , 1 2b ), that is, the gradient of the biconjugate can be constructed from the optimal solution (m , b ) of the optimization problem (20). In (20) there are three optimization steps: over ϵ, m, and b. For the first two, we will make very similar approximation to what is used for SAM (Foret et al., 2021), and simply add an additional optimization over b. To simplify computation, we will make an additional approximation. Instead of an iterative joint minimization in m and b, we will perform a single block-coordinate descent step where we first minimize in m for fixed b = s and then afterwards in b for fixed ω = m. This will give rise to a simple algorithm that works well in practice. Optimization with respect to ϵ. To simplify the supremum in ϵ we will use SAM s technique of local linearization (Foret et al., 2021) with one difference. Instead of using the current solution to linearize, we will sample a random linearization point θ N(θ | ω, diag(s) 1). This is preferable from a Bayesian perspective and is also used in the other Bayesian variants (Khan et al., 2018). Linearization at the sampled θ then gives us, ℓ(m + ϵ) ℓ(θ) + ℓ(θ), m + ϵ θ . Using this approximation in (20), we get a closed form solution for ϵ = Σ ℓ(θ). Optimization with respect to m. To get a closed-form solution for the optimization in m, we again consider a linearized loss: ℓ(m + ϵ) ℓ(ω + ϵ) + ℓ(ω + ϵ), m ω . Moreover, we set ϵ = V ℓ(θ), corresponding to a fixed b = s. This reduces the optimization of (20) to m = arg min m RP ℓ(ω + ϵ), m + 1 2 Σ 1/2 (m ω) 2, (21) which has closed form solution m = ω Σ ℓ(ω + ϵ) which we here leave to depend Σ . Optimization with respect to Σ via b. Inserting ϵ into (20) with the linearized loss, the minimization over b while fixing m = ω also has a simple closed-form solution as shown below, b = arg min b RP + Σ1/2 ℓ(θ) 2 + s g2 = |g| s, (22) where θ N(θ | m, diag(s) 1) and g = ℓ(θ). E.2 SIMPLIFYING THE NATURAL-GRADIENT UPDATES Having a way to approximate the biconjugate-gradient, we now rewrite and simplify the BLR update (14). The derivations are largely similar to the ones used to derive the variational online Newton approaches, see Khan et al. (2017; 2018); Khan & Rue (2021). Inserting the parametrization (19) and our expression f (µ) = (Σ 1 m , 1 2b ) into the BLR update (14), we get the following updates s ω (1 α)s ω + αΣ 1 m , s (1 α)s + α(b + δ0), where we also used the prior λ0 = (0, 1 2δ0I). By changing the order of the updates, we can write them in an equivalent form that resembles an adaptive gradient method: s (1 α)s + α(b + δ0), (23) ω ω αV Σ 1 (ω m ) + δω . (24) Published as a conference paper at ICLR 2023 Now, inserting the solutions Σ 1 (ω m ) = ℓ(θ + ϵ) and b = | ℓ(θ)| s from Eqs. 21 and 22 into these updates, we arrive at s (1 α)s + α | ℓ(θ)| s + δ0 , (25) ω ω αV ( ℓ(ω + ϵ) + δ0ω) , (26) where ϵ = V ℓ(θ) and θ N(θ | ω, V). E.3 FURTHER MODIFICATIONS FOR LARGE-SCALE DEEP LEARNING Now we will consider further modifications to the updates (25) (26) to arrive at our b SAM method shown in Alg. 1. These are similar to the ones used for VOGN (Osawa et al., 2019). First, we consider a stochastic approximation to deal with larger problems. To that end, we now briefly describe the general learning setup. The loss is then written as a sum of individual losses, ℓ(θ) = PN i=1 ℓi(θ) where N is the number of datapoints. The N datapoints are disjointly partitioned into K minibatches {Bi}K i=1 of B examples each. We then apply a stochastic variant to the bound N 1 K PK i=1 f i (µ) f (µ), where the individual functions are fi(µ) = 1 j Bi Eθ qµ[ ℓj(θ)]. Essentially, the variant computes the gradient on an incrementally sampled fi rather than considering the full dataset. With s = Nˆs, δ0 = Nδ this turns (25) (26) into the following updates: Nˆs (1 α)Nˆs + α h Nδ + N Nˆs |g| i , (27) ω ω α [Nδω + Ngϵ] /(Nˆs), (28) where g = 1 B P j M ℓj(θ), θ N(θ | ω, diag(σ2)), σ2 = (Nˆs) 1 and M denotes a randomly drawn minibatch. Moreover, we have gϵ = 1 B P j M ℓj(ω + ϵ) with ϵ = g/(Nˆs) = ρ g/ˆs where we introduce the step-size ρ > 0 in the adversarial step to absorb the 1/N factor and to gain additional control over the perturbation strength. Following the practical improvements of Osawa et al. (2019), we introduce a damping parameter γ > 0 and seperate step-size β2 > 0 on the precision estimate, as well as an exponential moving average for the gradient β1 > 0. Under these changes, and dividing out the N-factors in (27) (28) we arrive at the following method: gm β1gm + (1 β1) (δω + gϵ) , (29) ˆs β2ˆs + (1 β2) h δ + γ + Nˆs |g| i , (30) ω ω αgm/ˆs, (31) with g and gϵ defined as above. In practice, the method performed better without the N-factor in the ˆs-update. Renaming ˆs as s, the updates (29) (31) are equivalent to b SAM (Alg. 1). E.4 BSAM ALGORITHM WITH m-SHARPNESS The b SAM algorithm parallelizes well onto multiple accelerators. This is done by splitting up a minibatch into m parts and computing independent perturbations for each part in parallel. The algorithm remains exactly the same as Alg. 1, with lines 3 6 replaced by the following lines 1 8: 1: Equally partition M into {M1, . . . , Mm} of B examples each 2: for k = 1 . . . m in parallel do 3: θk ω + ek, ek N(ek | 0, σ2), σ2 1/(N s) 4: gk (1/B) P i Mk ℓi(θk) 5: ϵk gk/s 6: gϵ,k (1/B) P i Mk ℓi(ω + ek) 8: gϵ (1/m) Pm k=1 gϵ,k, g (1/m) Pm k=1 gk Published as a conference paper at ICLR 2023 F ADDITIONAL DETAILS AND EXPERIMENTS F.1 DETAILS ON THE LOGISTIC REGRESSION EXPERIMENT Fig. 5(a) shows the binary classification data-set (red vs blue circles) we adopted from (Murphy, 2012, Ch. 8.4). The data can be linearly seperated by a linear classifier with two parameters whose decision boundary passes through zero. We show the decision boundaries obtained from the MAP solution and from samples of the exact Bayesian posterior in Fig. 5(a). Fig. 5(b) shows the 2D probability density function of the exact Bayesian posterior over the two parameters, where the black cross indicates the mode (MAP solution), and the orange triangles the weight-samples corresponding to the classifiers shown in the left figure. In Fig. 5(c), we show the solutions found by SAM, b SAM and an approximate Bayesian posterior. On this example, the posterior approximation found by b SAM is similar to the Bayesian one. Performing a Laplace approximation at the SAM solution leads to the covariance shown by the green dashed ellipse. It overestimates the extend of the posterior as it is based only on local information. Figure 5: Posterior shapes for the logistic regression problem shown in Sec. 4.1. F.2 COMPARISON ON A LOW-DIMENSIONAL CLASSIFICATION PROBLEM In Fig. 6, we compare predictive uncertainties of SAM and b SAM on the two-moons classification problem. We fit a small network (2 24 12 12 1). For SAM+Laplace, we consider a diagonal GGN approximation to the Hessian (which was indefinite at the SAM solution) and use the linearized model for predictions, as suggested by Immer et al. (2021). (c) SAM + Laplace Figure 6: b SAM improves SAM s predictive uncertainty. Using a Laplace approximation at the SAM solution can lead to an overestimation of the predictive uncertainty. F.3 EFFECT OF USING f INSTEAD OF f In this section, we empirically check the effect of using f in the Bayesian objective instead of f. To that end, we compute an exact solution to the relaxed problem as well as the original problem (2) for a full Gaussian variational family. Published as a conference paper at ICLR 2023 For small problems, gradients of the biconjugate function can be easily approximated using convex optimization. To see this, notice that: f (µ) = arg min λ Ω f (λ ) λ , µ = arg min λ Ω,c R c λ , µ s.t. c f (λ ). (32) This is a optimization problem with linear cost function and convex constraints. To implement the constraints, we approximate the loss by a maximum over linear functions, which is always possible for convex losses. The linear functions are first-order approximations of the loss at L points θi qµ drawn from the current variational distribution. Under this approximation, the constraint in (32) can be written as L seperate constraints for i {1, . . . , L} as follows: sup θ Θ θ λ 1 + θ λ 2θ + ℓ(θi) + ℓ(θi), θ θi c 4(λ 1 + ℓ(θi)) λ 2 1(λ 1 + ℓ(θi)) c ℓ(θi), θi ℓ(θi). (33) The convex program (32) with convex constraints (33) is then solved using the software package CVXPY (Diamond & Boyd, 2016) to obtain the gradient of the biconjugate function. Having the gradients of the biconjugate available, our posterior is obtained by the Bayesian learning rule of Khan & Rue (2021), λ (1 α)λ + α λ0 + f(µ), for Bayes (2), λ0 + f (µ), for relaxed-Bayes (11). (34) Fig. 7 compares both solutions obtained by iterating (34) on the logistic regression problem described in App. F.1. We can see that the relaxation induces a lower-bound as expected, but the relaxed solution is still a reasonable approximation to the true posterior. Figure 7: Fig. 7(a) compares the posterior approximations obtained by solving (2) and our lowerbound (11) for a full Gaussian variational family. Fig. 7(b) shows the gap in the evidence-lower bound (ELBO) induced by the relaxation. F.4 SENSITIVITY OF BSAM TO THE CHOICE OF ρ We show the sensitivity of all considered SAM variants to the hyperparameter ρ in Fig. 8. As the b SAM algorithm learns the variance vector, the method is expected to be less sensitive to the choice of ρ. We confirm this here for a Res Net 20 on the CIFAR-100 dataset without data augmentation. F.5 ABLATION STUDY: EFFECT OF GAUSSIAN SAMPLING (LINE 3 IN ALG. 1) In the derivation of the b SAM algorithm in Sec. 3 we performed a local linearization in order to approximate the inner problem. There, we claim that performing this local linearization not at the Published as a conference paper at ICLR 2023 10 4 10 3 10 2 10 1 100 ρ Accuracy (higher is better) Res Net20-FRN / CIFAR-100 SAM-SGD SAM-Adam b SAM 10 4 10 3 10 2 10 1 100 ρ NLL (lower is better) Res Net20-FRN / CIFAR-100 SAM-SGD SAM-Adam b SAM 10 4 10 3 10 2 10 1 100 ρ ECE (lower is better) Res Net20-FRN / CIFAR-100 SAM-SGD SAM-Adam b SAM 10 4 10 3 10 2 10 1 100 ρ AUROC (higher is better) Res Net20-FRN / CIFAR-100 SAM-SGD SAM-Adam b SAM Figure 8: Sensitivity of SAM-variants to ρ. For b SAM, we plot the markers at 100 ρ to roughly align the minima/maxima of the curves. The proposed b SAM method adapts to shape of the loss and is overall more robust to misspecified parameter ρ while also giving the overall best performance for all four metrics (Accuracy , NLL , ECE , AUROC ). mean but rather at a sampled point is preferable from a Bayesian viewpoint. A concurrent work by Liu et al. (2022) also shows that combining SAM with random sampling improves the performance. The following Table 3 confirms that the noisy linearization helps in practice. Model / Data Method Accuracy (higher is better) NLL (lower is better) ECE (lower is better) AUROC (higher is better) MLP / MNIST b SAM w/o Noise 98.67(0.04) 0.041(0.0011) 0.0030(0.0007) 0.981(0.001) b SAM (Alg. 1) 98.78(0.06) 0.038(0.0012) 0.0024(0.0004) 0.982(0.001) Improvement: +00.11 +0.003 +0.0006 +0.001 Le Net-5 / FMNIST b SAM w/o Noise 92.01(0.11) 0.23(0.002) 0.0128(0.0021) 0.918(0.002) b SAM (Alg. 1) 92.08(0.15) 0.22(0.005) 0.0066(0.0022) 0.920(0.003) Improvement: +00.07 +0.01 +0.0062 +0.002 Res Net-20-FRN / CIFAR-10 b SAM w/o Noise 87.74(0.28) 0.38(0.009) 0.0251(0.006) 0.896(0.004) b SAM (Alg. 1) 88.72(0.24) 0.34(0.005) 0.0163(0.002) 0.903(0.003) Improvement: +00.98 +0.04 +0.0088 +0.007 Res Net-20-FRN / CIFAR-100 b SAM w/o Noise 60.30(0.33) 1.44(0.009) 0.0226(0.003) 0.832(0.002) b SAM (Alg. 1) 62.64(0.34) 1.32(0.015) 0.0311(0.002) 0.841(0.003) Improvement: +02.34 +0.12 0.0085 +0.008 Table 3: Performing the loss-linearization at the noisy point rather than at the mean ( b SAM w/o Noise ) typically improves the results with respect to all metrics. F.6 ABLATION STUDY: EFFECT OF BAYESIAN MODEL AVERAGING In Table 4 we show how the predictive performance and uncertainty of b SAM depends on the number of MC samples used to approximate the integral in the Bayesian marginalization. Published as a conference paper at ICLR 2023 Model / Data Samples Accuracy (higher is better) NLL (lower is better) ECE (lower is better) AUROC (higher is better) Res Net-20-FRN / CIFAR-10 0 (mean) 88.34(0.22) 0.392(0.007) 0.0538(0.002) 0.9025(0.005) 8 88.43(0.21) 0.349(0.006) 0.0212(0.002) 0.9038(0.005) 64 88.59(0.20) 0.339(0.005) 0.0186(0.001) 0.9056(0.005) 512 88.69(0.20) 0.338(0.005) 0.0173(0.002) 0.9047(0.005) 2048 88.63(0.20) 0.338(0.005) 0.0173(0.001) 0.9058(0.005) Table 4: Increasing the number of MC samples to approximate the marginal Bayesian predictive distribution tends to improve the performance in terms of NLL and ECE. F.7 ABLATION STUDY: DEPENDANCE OF SAM AND BSAM ON M-SHARPNESS It has been observed in Foret et al. (2021) that the performance of SAM depends on the number m of subbatches the minibatch is split into. This phenomenon is referred to as m-sharpness. From a theoretical viewpoint, a larger value of m corresponds to a looser lower-bound for b SAM, as the sum of biconjugates is always smaller than the biconjugate of the sum. Here we investigate the performance of SAM and b SAM on the parameter m for a Res Net-20-FRN on CIFAR-100 without data augmentation. The other parameters are chosen as before, and the batch-size is set to 128. The results are summarized in the following Table 5. Algorithm m Accuracy (higher is better) NLL (lower is better) ECE (lower is better) AUROC (higher is better) 1 61.38(0.27) 1.39(0.007) 0.0264(0.002) 0.836(0.003) 2 61.47(0.46) 1.38(0.002) 0.0233(0.003) 0.837(0.003) 4 62.35(0.32) 1.34(0.014) 0.0208(0.003) 0.837(0.003) 8 62.81(0.57) 1.31(0.017) 0.0327(0.004) 0.842(0.003) 16 63.31(0.32) 1.30(0.011) 0.0592(0.003) 0.840(0.003) 1 55.88(0.81) 1.82(0.032) 0.1501(0.004) 0.820(0.005) 2 56.66(0.74) 1.76(0.041) 0.1421(0.006) 0.819(0.005) 4 56.53(1.09) 1.73(0.058) 0.1289(0.006) 0.810(0.003) 8 58.42(0.66) 1.60(0.030) 0.1017(0.005) 0.825(0.003) 16 59.51(0.67) 1.51(0.040) 0.0681(0.006) 0.829(0.004) Table 5: Effect of m-sharpness mini-batch size for SAM-SGD and b SAM for a Res Net-20-FRN on CIFAR-100 without data augmentation. As the effective minibatch size decreases, all performance metrics (accuracy and uncertainty) tend to improve. This boost in performance is not captured by our theory, and understanding it is an interesting direction for future work. F.8 EXPERIMENTS ON MNIST In Table 6 we compare b SAM to several baselines for smaller neural networks on MNIST and Fashion MNIST. Also in these smaller settings, b SAM performs competitively to state-of-the-art. G CHOICE OF HYPERPARAMETERS AND OTHER DETAILS G.1 EXPERIMENTS IN TABLE 1 AND TABLE 6 In the following, we list the details of the experimental setup. First, we provide some general remarks, followed by tables of the detailed parameters used on each dataset. For all experiments, the hyperparameters are selected using a grid-search over a moderate amount of configurations to find the best validation accuracy. Our neural network outputs the natural param- Published as a conference paper at ICLR 2023 Model / Data Method Accuracy (higher is better) NLL (lower is better) ECE (lower is better) AUROC (higher is better) SGD 98.63(0.06) 0.044(0.0012) 0.0028(0.0007) 0.979(0.002) SAM-SGD 98.82(0.03) 0.039(0.0005) 0.0031(0.0002) 0.978(0.003) SWAG 98.65(0.04) 0.044(0.0002) 0.0028(0.0005) 0.976(0.003) VOGN 98.54(0.03) 0.046(0.0008) 0.0033(0.0007) 0.980(0.002) Adam 98.41(0.06) 0.050(0.0012) 0.0036(0.0007) 0.979(0.002) SAM-Adam 98.58(0.03) 0.046(0.0009) 0.0044(0.0005) 0.980(0.002) b SAM (ours) 98.78(0.06) 0.038(0.0011) 0.0024(0.0004) 0.982(0.001) Le Net-5 FMNIST SGD 91.37(0.27) 0.32(0.007) 0.0429(0.0015) 0.897(0.004) SAM-SGD 91.95(0.17) 0.22(0.004) 0.0062(0.0007) 0.917(0.002) SWAG 91.38(0.27) 0.31(0.005) 0.0397(0.0026) 0.901(0.005) VOGN 91.24(0.25) 0.24(0.004) 0.0071(0.0012) 0.916(0.002) Adam 91.14(0.25) 0.33(0.005) 0.0450(0.0008) 0.897(0.005) SAM-Adam 91.66(0.19) 0.25(0.004) 0.0225(0.0026) 0.913(0.002) b SAM (ours) 92.10(0.26) 0.22(0.005) 0.0066(0.0022) 0.920(0.002) Table 6: Comparison on small datasets. eters of the categorical distribution as a minimal exponential family (number of classes minus one output neurons). The loss function is the negative log-likelihood. We always use a batch-size of B = 128. For SAM-SGD, SAM-Adam and b SAM we split each minibatch into m = 8 subbatches, for VOGN we set m = 16 and consider independently computed perturbations for each subbatch. Finally, to demonstrate that b SAM does not require an excessive amount of tuning and parameters transfer well, we use the same hyper-parameters found on CIFAR10 for the CIFAR-100 experiments. We summarize all hyperparameters in Table 7. MNIST. The architecture is a 784 500 300 9 fully-connected multilayer perceptron (MLP) with Re LU activation functions. All methods are trained for 75 epochs. We use a cosine learning rate decay scheme, annealing the learning rate to zero. The hyperparameters of the individual methods are shown in Table 7. For SWAG, we run SGD for 60 epochs (with the same parameters). Then collect SWAG statistics with fixed α = 0.01, β1 = 0.95. For all SWAG runs, we set the rank to 20, and collect a sample every 100 steps for 15 epochs. Fashion MNIST. The architecture is a Le Net 5 with Re LU activations. We use a cosine learning rate decay scheme, annealing the learning rate to zero. We train all methods for 120 epochs. For SWAG, we run SGD for 105 epochs and collect 15 epochs with α = 0.001 and β1 = 0.8. CIFAR-10/100. The architecture is a Res Net-20 with filter response normalization nonlinearity adopted from Izmailov et al. (2021). The Res Net-20 has the same number of parameters as the one used in the original publication (He et al., 2016). We train for 180 epochs and decay the learning rate by factor 10 at epoch 100 and 120. For SWAG, we run SGD for 165 epochs and collect for 15 epochs with α = 0.0001, β1 = 0.9. G.2 EXPERIMENTS IN TABLE 2 The b SAM method depends on the number of data-samples N. To account for the random cropping and horizontal flipping, we rescale this number by a factor 4 which improved the performance. This corresponds to a tempered posterior, as also suggested in Osawa et al. (2019). In all experiments, all methods use cosine learning rate schedule which anneals the learning rate across 180 epochs to zero. The batch size is again chosen as B = 128. For an overview over all the hyperparameters, please see Table 8. Published as a conference paper at ICLR 2023 Configuration Method α β1 N δ ρ β2 γ SGD 0.1 0.95 24 VOGN 0.0005 0.95 25 0.999 0.05 MNIST / Adam 0.0005 0.8 24 MLP SAM-SGD 0.1 0.95 30 0.05 SAM-Adam 0.0005 0.8 24 0.02 b SAM 0.1 0.8 15 0.01 SGD 0.01 0.8 60 VOGN 0.0001 0.95 25 0.99 0.05 Fashion MNIST / Adam 0.001 0.9 60 Le Net-5 SAM-SGD 0.05 0.8 60 0.05 SAM-Adam 0.001 0.9 60 0.02 b SAM 0.1 0.95 50 0.002 SGD 0.1 0.9 25 VOGN 0.001 0.9 25 0.999 0.01 CIFAR-10 / Adam 0.001 0.9 250 Res Net-20-FRN SAM-SGD 0.05 0.9 25 0.05 SAM-Adam 0.0003 0.9 25 0.1 b SAM 0.1 0.9 25 0.001 SGD 0.1 0.9 30 VOGN 0.001 0.9 25 0.999 0.01 CIFAR-100 / Adam 0.001 0.9 250 Res Net-20-FRN SAM-SGD 0.05 0.9 25 0.05 SAM-Adam 0.0003 0.9 25 0.1 b SAM 0.1 0.9 25 0.001 Table 7: Hyperparameters used in the experiments without data augmentation. denotes that this method does not use that hyperparameter. indicates fixed hyperparameter across all datasets. Dataset Method α β1 N δ ρ β2 γ SGD 0.1 0.95 10 Adam 0.005 0.7 2.5 CIFAR-10 SAM-SGD 0.03 0.95 25 0.01 SAM-Adam 0.001 0.8 10 0.1 b SAM 1.0 0.95 2 0.01 SGD 0.03 0.95 25 Adam 0.005 0.9 1 CIFAR-100 SAM-SGD 0.2 0.8 10 0.02 SAM-Adam 0.005 0.7 1 0.2 b SAM 1.0 0.95 2 0.01 SGD 0.1 0.95 20 Adam 0.002 0.9 20 Tiny Image Net SAM-SGD 0.1 0.95 50 0.01 SAM-Adam 0.001 0.95 10 0.1 b SAM 0.1 0.9 25 0.00005 Table 8: Hyperparameters used in the experiments with data augmentation shown in Table 2. denotes that this method does not use that hyperparameter. indicates fixed hyperparameter across all datasets. Published as a conference paper at ICLR 2023 G.3 RESNET 18 EXPERIMENTS IN TABLE 2 We trained all methods over 180 epochs with a cosine learning rate scheduler, annealing the learning rate to zero. The batch-size is set to 200, and both SAM and b SAM use m-sharpness with m = 8. Dataset Method α β1 N δ ρ β2 γ SGD 0.03 0.9 25 CIFAR-10 SAM-SGD 0.03 0.9 25 0.05 b SAM 0.5 0.9 10 0.01 SGD 0.03 0.9 25 CIFAR-100 SAM-SGD 0.03 0.9 25 0.05 b SAM 0.5 0.9 10 0.01 Table 9: Hyperparameters used in the Res Net 18 experiments of Table 2. denotes that this method does not use that hyperparameter. indicates fixed hyperparameter across all datasets. H TABLE OF NOTATIONS The following Table 10 provides an overview of the notations used throughout the paper. Multidimensional quantities are written in bold-face, scalars and scalar-valued functions are regular face. Symbol Description , An inner-product on a finite-dimensional vector-space. The ℓ2-norm. f , f Fenchel conjugate and biconjugate of the function f. θ Parameter vector of the model, e.g., the neural network. T(θ) Sufficient statistics of an exponential family distribution over the parameters. µ, µ , µ Expectation (or mean/moment) parameters of exponential family. λ, λ , λ Natural parameters of an exponential family. Ω Space of valid natural parameters. M Space of valid expectation parameters. qµ Exponential family distribution with moments µ. qλ Exponential family distribution with natural parameters λ. A(λ) Log-partition function of the exponential family in natural parameters. A (µ) Entropy of qµ, the convex conjugate of the log-partition function. ω, v Mean and variance of a normal distribution, corresponding to µ and λ. m, σ2 Mean and variance of another normal, corresponding to µ and λ . N(e|m, σ2I) e follows a normal distribution with mean m and covariance σ2I. δ Precision of an isotropic Gaussian prior. Table 10: Notation.