# tilted_sharpnessaware_minimization__d81b2e2f.pdf

Tilted Sharpness-Aware Minimization

Tian Li 1 Tianyi Zhou 2 Jeffrey A. Bilmes 3

Sharpness-Aware Minimization (SAM) has been demonstrated to improve the generalization performance of overparameterized models by seeking flat minima on the loss landscape through optimizing model parameters that incur the largest loss within a neighborhood. Nevertheless, such min-max formulations are computationally challenging especially when the problem is highly non-convex. Additionally, focusing only on the worst-case local solution while ignoring potentially many other local solutions may be suboptimal when searching for flat minima. In this work, we propose Tilted SAM (TSAM), a smoothed generalization of SAM inspired by exponential tilting that effectively assigns higher priority to local solutions that incur larger losses. TSAM is parameterized by a tilt hyperparameter t and reduces to SAM as t approaches infinity. We show that TSAM is smoother than SAM and thus easier to optimize, and it explicitly favors flatter minima. We develop algorithms motivated by the discretization of Hamiltonian dynamics to solve TSAM. Empirically, TSAM arrives at flatter local minima and results in superior test performance than the baselines of SAM and ERM across a range of image and text tasks.

1. Introduction

Empirical risk minimization (ERM) is a classic framework for machine learning that optimizes for the average performance of the observed samples. For n training samples {xi}i [n] (which may also contain label information), model parameters θ Rd, and a loss function l( ), let ERM

1University of Chicago 2University of Maryland, College Park 3University of Washington. Correspondence to: Tian Li <litian@uchicago.edu>.

Proceedings of the 42 nd International Conference on Machine Learning, Vancouver, Canada. PMLR 267, 2025. Copyright 2025 by the author(s).

be defined as

min θ L(θ) := 1

i [n] l(xi; θ). (1)

In overparameterized models, however, minimizing ERM may arrive at a bad local minimum. To address this, one line of work focuses on minimizing the sharpness of final solutions, ensuring that the losses of parameters around local minima are uniformly small. One popular formulation is sharpness-aware minimization (SAM), that optimizes over the worst-case loss over perturbed parameters (Foret et al., 2020). For a perturbing region ϵ ρ where ϵ Rd, the canonical SAM objective is defined as

min θ Ls(θ) := max ϵ ρ L(θ + ϵ). (2)

Typically, SAM is optimized by alternating between running gradient ascent (to find the max loss) and gradient descent steps (to minimize the max loss) on model parameters (e.g., Foret et al., 2020; Andriushchenko & Flammarion, 2022). However, it is difficult for such algorithms (and its variants) that rely on one or few steps of gradient ascent to find the exact perturbation ϵ that incurs the true max loss, as the loss landscape can be highly non-convex and potentially nonsmooth. Despite recent advancements on approximately solving the SAM objective (e.g., Liu et al., 2022b;a; Zhuang et al., 2022), the min-max formulation itself overlooks many neighborhood regions that may also result in large losses, leaving some loss surface near local minima sharp. For instance, we have computed the average loss across the neighborhoods of SAM solutions, and find that it is still higher than the ones obtained by our approach (Section 5.2).

To this end, we propose a generalized and smoothed variant of SAM inspired by exponential tilting and its widespread usage in probability and statistics. In optimization literature, it has also been used as an efficient min-max smoothing operator (Kort & Bertsekas, 1972). Tilted SAM (TSAM), parameterized by a tilt scalar t 0, is defined as

min θ Lt(θ) := 1

t log Z et L(θ+ϵ)dµ(ϵ)

t log Eµ(ϵ) h et L(θ+ϵ)i , (3)

where L(θ + ϵ) is defined in Eq. (1). µ(ϵ) denotes an uncertainty probability measure for ϵ Rd that can represent

Tilted Sharpness-Aware Minimization

uniform balls such as ϵ ρ (but other measures are possible as well). When t and µ(ϵ) takes ϵ ρ, Lt(θ) reduces to the SAM objective Ls(θ). When t = 0, Lt(θ) reduces to the average loss over the perturbed neighborhood µ(ϵ), i.e., Eµ(ϵ)[L(θ + ϵ)] where the expectation is taken with respect to the randomness of ϵ (formally proved in Appendix B). When we set t = 0 and ρ = 0, the TSAM loss is reduced to the classic average empirical risk L(θ). We may use Eµ(ϵ), Eϵ, and E interchangeably when the meaning is clear from the context.

TSAM provides a smooth transition between minmax optimization (Eq. (2)) and min-avg optimization minθ Eµ(ϵ)[L(θ + ϵ)]. The min-avg optimization has appeared in prior works known as average-perturbed sharpness (Wen et al., 2022), noise-perturbed loss (Zhang et al., 2024), or random smoothing (Duchi et al., 2012). The smoothness parameter of the TSAM objective increases as the value of t increases, which suggests that it is easier to optimize than SAM (Section 3). As we formalize later, TSAM reweights gradients of neighboring solutions based on their loss values, which can be viewed as a soft version of SAM that assigns all the weights to one single worst minimum. In addition to the benefits in optimization, rigorously considering many, as opposed to one, neighbourhood parameters that incur large losses can result in improved generalization. We provide both theoretical characterization and empirical evidence showing that TSAM solutions are flatter than those of ERM and SAM. One line of the mostly related works have explored tilted risks to reweight different data points (Li et al., 2023; Robey et al., 2022). In this work, we use the TSAM framework to assign varying priority to local minima in the parameter space.

To solve TSAM, we need to estimate the integral over µ(ϵ) (Eq. (3)), or equivalently, to estimate the full gradient of the objective, which is a tilted aggregation of gradients evaluated at L(θ + ϵ). Both require sampling the perturbation ϵ with probability proportional to et L(θ+ϵ) for the integration. Naively sampling ϵ at random to obtain Lt(θ) would be inefficient, as it is likely that L(θ + ϵ) under the sampled ϵ is small and therefore we need many samples to converge to the true distribution. On the other hand, methods based on Hamiltonian Monte Carlo (HMC) (Leimkuhler & Reich, 2004) are guaranteed to arrive at the exact distribution. Inspired by the Euler s rules for HMC, we develop an algorithm to efficiently sample ϵ s and estimate the true gradient of Lt(θ). Our contributions are summarized as follows.

Contributions. We propose TSAM, a new sharpnessaware optimization objective that reweights the parameters around local minima via exponential tilting. We rigorously study several properties of TSAM, showing that it always favors flatter solutions as t increases (Section 3). To optimize TSAM, we adapt a specific HMC algorithm to efficiently

sample the model perturbation ϵ (Section 4). We empirically demonstrate that TSAM results in flatter solutions and superior generalization performance than SAM and its variants for deep neural networks including transformers on both image and text datasets (Section 5).

2. Related Work

Sharpness-Aware Minimization. SAM regularizes overparameterized models by considering adversarial data points that have large training errors (Foret et al., 2020; Zheng et al., 2021). The SAM variants, training dynamics, and applications in different models have been extensively studied in prior work (Long & Bartlett, 2023; Foret et al., 2020; Bartlett et al., 2023; Andriushchenko & Flammarion, 2022; Chen et al., 2024; Kwon et al., 2021; Chen et al., 2021; Liu et al., 2022b; Zhou et al., 2021; Du et al., 2022; Baek et al., 2024; Zhao et al., 2022; Mi et al., 2022; Zhuang et al., 2022; Mueller et al., 2023; Xie et al., 2024; Tahmasebi et al., 2024). Notably, TSAM objective can be viewed as a special case of a general sharpness-aware approach (Tahmasebi et al., 2024), though Tahmasebi et al. (2024) do not specifically study TSAM. Some work aim to improve efficiency of the SAM algorithm studying different relaxations (Du et al., 2022; Liu et al., 2022a). Zhao et al. (2022) use a linear interpolation between normal gradients and SAM outer gradients evaluated at the max-loss parameter, which does not take into account the possibly many bad local minima for highly non-convex problems. Zhou et al. (2021) perform sample-wise reweighting for SAM, as opposed to parameter-wise reweighting proposed herein. Liu et al. (2022b) improve the inner max optimization by adding a random perturbation to the gradient ascent step to smoothen its trajectory. Li & Giannakis (2023) leverage a moving average of stochastic gradients in the ascent direction to reduce the gradient variance1. Our goal is not to better approximate the inner max or develop algorithms for solving the min-max SAM formulation, but rather, to solve a different TSAM objective that reweights many local minima to seek flat solutions. Nevertheless, we still compare with more advanced algorithms for the SAM objective and show the superiority of TSAM solutions (Section 5). As TSAM is a new objective, in principle, we can readily apply many existing optimization techniques (that can be potentially applied to SAM as well) such as variance reduction (Johnson & Zhang, 2013), acceleration (Nesterov, 1983), or adaptivity (Streeter & Mc Mahan, 2010; Duchi et al., 2011; Kingma & Ba, 2014) on top of the tilted stochastic gradients to gain further improvement.

1Note that the notion of variance in VASSO (Li & Giannakis, 2023) refers to the variance of stochastic gradients compared with full gradients; whereas in TSAM, we examine the loss variance around the neighborhood regions, where the randomness in variance (Definition 3.4) comes from the perturbation ϵ.

Tilted Sharpness-Aware Minimization

There are also works exploring why SAM leads to better generalization or theoretically studying what SAM (and its implementation) is effectively minimizing (Chen et al., 2024; Andriushchenko & Flammarion, 2022; Long & Bartlett, 2023; Wen et al., 2022). In this work, we prove that TSAM (and SAM) encourages flatter models for a class of problems including generalized linear models, where flatness (or sharpness) is characterized by the variance of the losses around the minima (Definition 3.5). Our proposed TSAM framework is particularly suitable for problems where flatness helps generalization. The various notions of sharpness, along with theoretical relations between sharpness and generalization still remain an open problem (Andriushchenko et al., 2023; Wen et al., 2024; Ding et al., 2024; Tahmasebi et al., 2024), which is outside the scope of our paper.

Tilting in Machine Learning. Exponential tilting, used to shift parametric distributions, has appeared in previous literature in importance sampling, optimization, control, and information theory (e.g., Siegmund, 1976; Kort & Bertsekas, 1972; Dembo, 2009; Aminian et al., 2024; Whittle, 2002; 1981). Recently, the idea of tilted risk minimization (which exponentially reweights different training samples) has been explored in machine learning applications such as enforcing fairness and robustness, image segmentation, and noisy label correction (Li et al., 2023; Robey et al., 2022; Zhou et al., 2020; Szab o et al., 2021; Aminian et al., 2024). A closely-related Log Sum Exp operator is often used to as an smooth approximation to the max, which is always considered more computationally favorable (Kort & Bertsekas, 1972; Calafiore & El Ghaoui, 2014; Shen & Li, 2010; Li et al., 2023). One application of tilted risks applied to the adversarial training problem is to balance worst-case robustness (i.e., adversarial robustness) and average-case robustness in the data space (Robey et al., 2022), among other approaches that can also achieve a transition between worst-case and average-case errors (Rice et al., 2021). Our work is similar conceptually, but we consider reweighting adversarial model parameters, instead of adversarial data points. Compared with SAM (optimizing the largest loss), the TSAM framework offers additional flexibility of optimizing over quantiles of losses given the connections between exponential tilting and quantile approaches (Rockafellar et al., 2000; Li et al., 2023).

3. Properties of TSAM

In this section, we discuss properties of the TSAM objective. We first state the convexity and smoothness of TSAM (Section 3.1). We then show that as t increases, the gap between less-flat and more-flat solutions of the TSAM objective becomes larger. In other words, optimizing TSAM would give flatter solutions as t increases (Section 3.2). Finally, we discuss the generalization behavior of TSAM and

prove that there exists t (0, ) that result in the tightest bound (Section 3.3). All properties discussed in this section hold regardless of the distributions of ϵ (i.e., choice of µ(ϵ)), unless otherwise specified.

3.1. Convexity and Smoothness

In this part, we connect the convexity and smoothness of TSAM with the convexity and smoothness of the ERM loss. We provide complete proofs in Appendix B. We first define a useful quantity (tilted weights) that will be used throughout this section.

Definition 3.1 (t-tilted weights). For a perturbed model parameter θ + ϵ, we define its corresponding t-tilted weight as wt(θ + ϵ) := et L(θ+ϵ)

E[et L(θ+ϵ)].

The weight of parameter θ + ϵ is exponentially proportional to the loss evaluated at this point. The expectation is with respect to the randomness of ϵ constrained by µ(ϵ). When t = 0, 0-tilted weights are uniform. When t , wt(θ+ϵ) focuses on the max loss among all possible {θ + ϵ}. Such weights have appeared in previous literature on importance sampling (Siegmund, 1976), but they are only applied to reweight sample-specific losses, as opposed to perturbationspecific parameters. Given tilted weights in Definition 3.1, we can present the TSAM gradients and Hessian as follows. Lemma 3.2 (Gradient and Hessian for TSAM). Assume L( ) is continuously differentiable. The full gradient of TSAM (Objective (3)) is

Lt(θ) = E[et L(θ+ϵ) L(θ + ϵ)]

E[et L(θ+ϵ)] = E[wt(θ + ϵ) L(θ + ϵ)].

The Hessian of TSAM 2Lt(θ) is

t E wt(θ + ϵ) L(θ + ϵ) L(θ + ϵ)

E wt(θ + ϵ) L(θ + ϵ) E wt(θ + ϵ) L(θ + ϵ)

+ E wt(θ + ϵ) 2L(θ + ϵ) . (4)

The gradient of TSAM can be viewed as reweighting the gradients of L(θ + ϵ) by the loss values et L(θ+ϵ). Examining the Hessian, we note that the first term is t multiplied by a positive semi-definite matrix, and the second term can be viewed as a reweighted Hessian of the original loss L evaluted at θ + ϵ.

It is not difficult to observe that if L(θ) is p-Lipschitz with respect to θ, then Lt(θ) is p-Lipschitz with respect to θ. If L is µ-strongly convex, then Lt is also µ-strongly convex (proof in Appendix B). Next, we show that the smoothness of TSAM scales linearly with t.

Lemma 3.3 (Smoothness of TSAM). Let L( ) be β-smooth and β is bounded. Then Lt(θ) is β(t)-smooth, where β(t) satisfies 0 < limt β(t)

Tilted Sharpness-Aware Minimization

That is, β(t) = O(t). The proof is deferred to Appendix B. In Lemma 3.3, we connect the smoothness of the tilted objective with the smoothness of the original ERM objective. We see that for any bounded t, the smoothness parameter is bounded. As t increases, TSAM becomes more difficult to optimize as the loss becomes more and more non-smooth. When t , β(t) . If we have access to unbiased gradient estimates at each round, it directly follows that the convergence of SAM (TSAM with t ) objective is slower than that of tilted SAM following standard arguments (Nesterov, 2013). To further visualize this, we create a one-dimensional toy problem in Appendix A, where we obtain the globally optimal solutions for each objective. We show that both SAM and TSAM are able to arrive at flat solutions; but the SAM objective is non-smooth, hence more difficult to optimize.

3.2. TSAM Prefers Flatter Models as t Increases

In this subsection, we focus on a specific class of models including generalized linear models (GLMs), where the loss function l(xi; θ) carries the form of l(xi; θ) = A(θ) θ T(xi), (5)

L(θ) = A(θ) θ

i [n] T(xi)

For GLMs, A(θ) is a convex function, and P

i [n] T(xi)T(xi) 0 (Wainwright & Jordan, 2008). Our results in this section apply to loss functions defined in Eq. (6), which subsume linear models. Before introducing the main theorem, we define two important quantities that will be used throughout this section, and in the experiments.

Definition 3.4 (t-weighted mean and variance). We define t-weighted mean of a random variable X as E[et XX]

E[et X] . Similarly, we define t-weighted variance of a random variable

X as E[et XX2]

E[et X] E[et XX]

E[et X] 2 .

When t = 0, these definitions reduce to standard mean E[X] and variance E[X2] (E[X])2. Similar tilted statistics definitions have also appeared in prior work (Li et al., 2023). We leverage weighted variance to define sharpness below.

Definition 3.5 (t-sharpness). We say that a model parameter is θ1 is t-sharper than θ2 if the t-weighted variance of L(θ1 + ϵ) (which is a random variable of loss distribution under model parameters perturbed by ϵ) is larger than the t-weighted variance of L(θ2 + ϵ).

Given the definition of sharpness above based on weighted variance, we are ready to prove that TSAM encourages flatter local minima as the increase of t. Empirically, in Section 5, we also plot the 0-sharpness of the solutions

obtained from different objectives, and observe that TSAM achieves smaller sharpness values (measured in terms of standard variance) than ERM and SAM. Proper definitions of sharpness is generally still an open problem, and other options are possible such as the trace of Hessian, gradient norms, and other functions over Hessian (Wen et al., 2022; Tahmasebi et al., 2024).

Theorem 3.6 (TSAM prefers flatter models as t increases). Assume L(θ) is given by Eq. (6) and L(θ) is continuously differentiable. For any θ1, θ2 Rd, let gt(θ1, θ2) := Lt(θ1) Lt(θ2). If θ1 is t-sharper than θ2, then gt(θ1,θ2)

For some θ1 sharper than θ2, it is possible that L(θ1) = L(θ2), which implies that ERM is not able to distinguish between the two solutions, while TSAM can. Furthermore, Theorem 3.6 indicates that as t increases, the TSAM objective favors θ2 more aggressively, as the gap between Lt(θ1) and Lt(θ2) grows larger. We explore a one-dimensional toy problem with many local minima: L(θ) = 2 sin(4πθ)/(2θ) + 0.005(θ 1)2, and focus on the area θ (0.2, 2.5) to visualize this behavior. We take ρ = 0.2 and a fixed learning rate 0.005 for all objectives. Each run starts from the same initialization θ = 0.5. In Figure 1, we see that as t increases, TSAM leads to flatter solutions, despite having larger objective values measured by L(θ). As a side note, we prove that for any θ, the objective value of Lt(θ) is monotonically increasing as t increases (Appendix B). Next, we discuss a special case when t is close to 0, where we provide another perspective on the TSAM behavior.

Discussions for t 0. All the results above hold for the small-t regime, where sharpness reduces to standard variance when t 0 (Definition 3.4). It still follows that gt(θ1,θ2)

t t 0 0 if θ1 is sharper than θ2. Here, we provide another interpretation of TSAM when t is close to zero. Similar statements have also appeared in prior works in a different context (e.g., Liu & Theodorou, 2019; Li et al., 2023). For a very small t, it holds that 1

t log E et L(θ+ϵ) E[L(θ + ϵ)] + t

2var (L(θ + ϵ)) + o(t2). We provide a proof in Appendix B. Hence, optimizing TSAM is approximately equivalent to optimizing for the mean plus variance of the losses under the perturbed parameters. When t = 0, it reduces to only optimizing for E[L(θ + ϵ)]. In other words, TSAM with t close to 0 is directly minimizing 0-sharpness (standard variance). For any θ1 and θ2 such that θ1 is sharper than θ2, we have

gt(θ1, θ2) E[L(θ1 + ϵ)] + t

2var (L(θ1 + ϵ))

E[L(θ2 + ϵ)] t

2var (L(θ2 + ϵ)) ,

2 (var (L(θ1 + ϵ)) var (L(θ2 + ϵ))) 0.

Tilted Sharpness-Aware Minimization

0.5 1.0 1.5 2.0 2.5 3

0.5 1.0 1.5 2.0 2.5 3

0.5 1.0 1.5 2.0 2.5 3

TSAM (t=20)

0.5 1.0 1.5 2.0 2.5 3

TSAM (t=100)

Figure 1. Optimization trajectories of different objectives (orange) with final solutions marked in star. The local minima gets flatter from left to right in each subfigure. We see that TSAM favors flat minima as t increases. ERM solution by gradient descent converges to a sharp minimal. We solve TSAM on this one-dimensional problem by sampling thousands of ϵ s to estimate the gradient of Lt(θ) at each step, which is infeasible for real problems.

This is a special case of Theorem 3.6 for t 0. It suggests that as we increase t from 0 for a small amount, the standard variance of neighborhood loss would reduce. Note that some recent works propose a noise-injected loss similar to TSAM with t 0 (Zhang et al., 2024). The proposed algorithm therein explicitly minimizes the trace of Hessian, which aligns with our arguments that the TSAM objective can lead to flatter solutions.

So far, we study properties of TSAM regarding convexity, smoothness of the objective, and sharpness of the resulting solutions. We note that these properties of the objective are independent of the actual optimization algorithms used to optimize TSAM. Though Theorem 3.6 implies similar benefits of both TSAM and SAM relative to ERM (assuming we optimize TSAM and SAM perfectly), Lemma 3.3 shows the superiority of the TSAM objective over SAM with unbounded smoothness parameters, as TSAM is easier to optimize. The performance of practical applications depends on both the properties of the objectives and the approximation algorithms used to solve them.

3.3. Generalization of TSAM

In this section, we give a uniform bound on the generalization error of our TSAM objective. By solving the tilted objective empirically, during test time, we are ultimately interested in evaluating the linear population risk EZ[l(θ; Z)] where Z denotes the underlying data distribution and {xi}i [n] Z. We define generalization error as the difference between population risk and our empirical objective value EZ[l(θ; Z)] 1

t log Eϵ[et L(θ+ϵ)], bounded as follows.

Theorem 3.7 (Generalization of TSAM). Assume losses are bounded as 0 L( ) M. Suppose we have n training data points. For any θ Θ and t 0, with probability 1 δ, the difference between population risk and empirical TSAM risk gent := EZ[l(θ; Z)] 1

t log Eϵ[et L(θ+ϵ)] satisfies

2n varϵ(et L(θ+ϵ))

2te2t M + c, (7)

where c = L(θ) Eϵ[L(θ + ϵ)] is a constant unrelated to t.

We defer the proof to Appendix B, where we build upon existing generalization results of a related objective (Aminian et al., 2024). From Theorem 3.7, we see that when the sample space of ϵ is empty, our result reduces to EZ[l(θ, Z)]

2n , scaling at a rate of 1 n consistent with standard uniform bound on the average risk (Shalev Shwartz & Ben-David, 2014). When t and we define µ(ϵ) to be ϵ ρ over some distribution, the result gives an upper bound on the generalization of SAM:

EZ[l(θ, Z)] max ϵ ρ L(θ + ϵ) M q

Additionally, denote θTSAM and θERM as optimal solutions for TSAM (Eq. (3)) and ERM (Eq. (1)), respectively. For modest values of t, due to the negativity of varϵ(et L(θ+ϵ))

2te2t M , the upper bound of the linear population risk EZ[l(θTSAM, Z)] can be smaller than that of the linear risk EZ[l(θERM, Z)], as long as 1

t log Eϵ[et L(θTSAM+ϵ)]

varϵ(et L(θTSAM+ϵ))

2te2t M L(θERM). This implies that by solving TSAM, we can obtain a solution that results in a smaller upper bound of the linear population error than that of ERM.

4. Algorithms

In this section, we describe the algorithms we use to solve TSAM. The main challenge in solving TSAM is to sample ϵ to get a good estimator of Lt(θ), or equivalently, Lt(θ). We first describe a general approach where we use estimated tilted gradients (given sampled ϵ s) to update the model (Section 4.1). Then, we discuss how to sample ϵ s via a specific Hamiltonian Monte Carlo algorithm and present our method and implementation (Section 4.2).

4.1. General Algorithm

To solve TSAM, the primary challenge is to estimate the integral 1

t log R et L(θ+ϵ)dµ(ϵ) , or its full gradient

E[et L(θ+ϵ) L(θ+ϵ)]

E[et L(θ+ϵ)] , assuming gradient-based methods and

Tilted Sharpness-Aware Minimization

the differentiable loss L. A naive way is to first sample ϵ from µ(ϵ) following the pre-defined distribution (e.g., Gaussian or uniform) over µ(ϵ), and then perform tilted aggregation with weights proportional to et L(θ+ϵ). However, this approach may be extremely inefficient, as there could be an infinite set of perturbed model parameters with relatively small losses, which are not informative. In Figure 5 in the appendix, we empirically show that even when we sample a much larger number of ϵ s, the resulting accuracy is still worse than our proposed method. Instead, we propose to sample s number of ϵ s from distribution eδL(θ+ϵ) (denoted as {ϵj}j [s]), where 0 δ t. We then use these {ϵj}j [s] to obtain an empirical gradient estimation with weights proportional to {e(t δ)L(θ+ϵj)}j [s], as the full gradient is a tilted average of the original gradient on L( ). To improve sample efficiency, we use gradient-based methods such as Hamiltonian Monte Carlo (HMC) that simulates Hamiltonian dynamics (Leimkuhler & Reich, 2004). The structure of our proposed method is in Algorithm 1. Note that in principle, after estimating the tilted stochastic gradients, we can further apply existing optimization techniques such as variance reduction (Johnson & Zhang, 2013), acceleration (Nesterov), or adaptivity (Streeter & Mc Mahan, 2010; Duchi et al., 2011) to gain further improvement, which we leave for future work.

Algorithm 1 Tilted SAM Solver

Require: t, θ0, learning rate η, total iterations T, total number of samples s, 0 δ t 1: for i = 0, , T 1 do 2: Sample s random perturbations {ϵj}j [s] from distribution eδL(θi+ϵ) under the constraint characterized by µ(ϵ) via some HMC algorithm (Algorithm 3) 3: Update θi with the estimated gradient evaluated on the mini-batch:

j [s] e(t δ)L(θi+ϵj) L(θi + ϵj)

j [s] e(t δ)L(θi+ϵj)

4: end for 5: Return θT

4.2. Sampling ϵ

There could be potentially different algorithms for sampling ϵ where p(ϵ) eδL(θ+ϵ). Here we propose an approximate and cheap sampler based on discretization of Hamiltonian dynamics. Our method is inspired by one of the best-known way to approximate the solution to a system of differential equations, i.e., Euler s method or its modification (Neal et al., 2011). A more accurate solver like the leap-frog method might be more popular for HMC, but these come at an increased expense (Neal et al., 2011). As our goal to minimize computational cost, we stick with the cheaper

Euler s approach as follows. We first initialize ϵ0 from an L2 ball that satisfies ϵ ρ, and initialize the momentum p0 Rd from some Gaussian distribution, i.e., p0 N(0, σ2I). Note that the negative log probability density of the energy function U(ϵ) is log(eδL(θ+ϵ)) = δL(θ + ϵ). At each sampling step, we run the following steps for N iterations with a small step-size β to obtain a candidate ϵ:

p p + βδ ϵL(θ + ϵ), ϵ ϵ + βp/σ2. (8)

After obtaining a candidate ϵ, we accept ϵ with probability

min{1, eδL(θ+ϵ) p 2

2σ2 /eδL(θ+ϵ0) p0 2

2σ2 }. If the candidate ϵ is not accepted, we set (p, ϵ) to the initial point before the N iterations. Repeating the above for enough times would give us a sample ϵ from the exact distribution.

Algorithm 3 Sampling from eδL(θi+ϵ) where ϵ ρ

Require: θ0, total samples s, uncertainty ball radius ρ

1: for j = 0, , s do 2: Perturb θi with a random δj sampled from Gaussian or uniform distribution: θi j θi + δj 3: Run normalized SGD on the mini-batch data at θi j:

ˆθi j θi j + ρ L(θi j) L(θi j) ; ϵj ˆθi j θi

4: end for 5: Return {ϵj}j [s]

Generating one ϵ via HMC requires at least 2N gradient evaluations, which is infeasible for large-scale problems. Hence, we set N = 1 in all the main experiments, and meanwhile accept the generated ϵ with probability 1. For completeness, we evaluate the effects of increasing N in HMC in Appendix C. We observe that using N > 1 does not significantly improve the performance. Running equations in Eq. (8) for one step, if p is initialized as p = 0, we have ϵ ϵ+β L(θ +ϵ), where β is a constant. We adopt this updating rule in our problem, and run the aforementioned procedure in parallel for s times to get s samples. Our method is presented in Algorithm 3. Though Algorithm 3 does not guarantee the ϵj s result in a consistent estimator of the TSAM integral, we empirically showcase its effectiveness on non-convex models including transformers in the next section.

5. Experiments

In this section, we first describe our setup. Then we present our main results, comparing TSAM with the baselines of ERM (Eq. (1)), SAM (Eq. (2)), and SAM variants on both image and text data (Section 5.1). We explain TSAM s superior performance by empirically examining the flatness of local minima in Section 5.2. In Section 5.3, we discuss the

Tilted Sharpness-Aware Minimization

effects of hyperparameters. Our code is publicly available at github.com/litian96/TSAM.

Tasks and Datasets. We study image classification tasks with CNNs and transformers, and language modeling tasks with BERT. First, we explore training Res Net18 (He et al., 2016) and Wide Res Net16-8 (Zagoruyko, 2016) on CIFAR100 (Krizhevsky et al., 2009). Previous works show that SAM is robust to label noise (Foret et al., 2020; Baek et al., 2024); and we investigate this setting by training the same models on on CIFAR100 with label noise generated by substituting 20% of the true labels uniformly at random to other labels. Since vision transformers (Vi Ts) (Dosovitskiy et al., 2020) have been shown to result in sharper local minima than CNNs (Chen et al., 2021), we study the performance of finetuning Vi Ts (pretrained on Image Net (Deng et al., 2009)) on an out-of-distribution Describable Texture Dataset (DTD) (Cimpoi et al., 2014), where the task is 47-class classification. We also use Wide Res Net16-8 to train DTD from scratch. Additionally, we evaluate a 200-class classification task for Tiny Imagenet (Le & Yang, 2015) with Res Net18 and Res Net34 (He et al., 2016) models. Lastly, for text data, we study finetuning a pretrained Distil BERT (Sanh, 2019) model on the GLUE benchmark including both classification and regression problems.

Hyperparameter Tuning. We take µ(ϵ) to be ϵ ρ for all TSAM experiments, and tune the ρ parameters separately from {0.05, 0.1, 0.2} for relevant methods. For TSAM, we tune t from {0, 1, 5, 20, 100} and select the best one based on the validation set. We also report the performance for all t s in the next sections. We use s=3 or s=5 sampled ϵ s for all datasets and find that it works well. For some SAM variants that introduce additional hyperparameters, we tune those via grid search as well. The batch size is 64 for all the datasets and methods and a constant learning rate is tuned from {0.0003, 0.001, 0.003, 0.01, 0.03, 0.1} for each algorithm. Despite the existence of adaptive methods for SGD and SAM (Kingma & Ba, 2014; Kwon et al., 2021), we do not use adaptivity for any algorithm for a fair comparison. See Appendix C for details on hyperparameter tuning.

5.1. TSAM Leads to Better Test Performance

We compare the performance of various objectives and algorithms in Table 1. ERM denotes minimizing the empirical average loss with mini-batch SGD. SAM is the vanilla SAM implementation with one step of gradient ascent and one step of gradient descent at each iteration (Foret et al., 2020). Note that TSAM requires more gradient evaluations per iteration. Hence, we include two additional baselines of SAM under the same computational budget as TSAM runs. (1) We simply run the vanilla SAM algorithm for more iterations until it reaches the same runtime as TSAM. (2) We try another SAM approximation by exploring different step sizes

along the gradient ascent directions and pick the one incurring the biggest loss. Then we evaluate the gradient under that step size to be applied to the original model parameters. We call these expensive SAM baselines ESAM1, and ESAM2, respectively. We also evaluate two more advanced sharpness-aware optimization methods: PGN that combines normal gradients and SAM gradients (Zhao et al., 2022), and Ramdom SAM (RSAM) which adds random perturbations before finding the adversarial directions (Liu et al., 2022b). We let PGN and RSAM run the same amount of time as TSAM on the same computing platform. On all the datasets, we tuned t values via grid search from {0, 1, 5, 20, 100}.

Our results are shown in Table 1. The performance for all t s on three image datasets and different model architectures are reported in Section 5.3. For the GLUE benchmark, we report the standard metrics for each dataset in GLUE. TSAM consistently achieves higher test performance than ERM and variants of SAM. We provide corresponding convergence plots of ERM, vanilla SAM, and TSAM in Appendix C.

5.2. Flatness of TSAM Solutions

In this part, we take a more detailed look into the properties of TSAM solutions compared with the ones of ERM and SAM on the CIFAR100 dataset trained by Res Net18 from scratch. In Figure 2, we plot the loss mean and variance over the neighborhood areas around local minima obtained by different objectives, i.e., E[L(θ +ϵ)] and var[L(θ +ϵ)], where ϵ N(0, δ2), and θ denotes the different solutions of any objective (with a slight abuse of notation). These measurements have appeared in prior works named averageloss sharpness (Wen et al., 2024; Chen et al., 2021), and are consistent with our sharpness definition (Definition 3.5) mentioned before. In Figure 2, for all δ values, we see that TSAM consistently result in flatter local minima than ERM and SAM measured by both the mean and variance of losses around the minima. In addition, we evaluate sharpness following other common notions by investigating the top-5 eigen values of Hessian (e.g., Foret et al., 2020). Under the same model setup, the top-5 eigenvalues are {342.11, 304.72, 260.71, 252.92, 210.88} for ERM, {232.60, 198.35, 182.61, 153.74, 145.76} for SAM, and {140.91, 113.38, 105.90, 92.94, 89.55} for TSAM (t=20). We see that TSAM achieves the smallest max eigenvalues among the three.

We further report the training and test performance of besttuned ERM, SAM, and TSAM in Table 3 in the appendix. We show that ERM solutions have lower training losses but higher test losses than SAM and TSAM when evaluated on the average test performance (i.e., the ERM column in the right table). This is due to the fact that ERM does not generalize as well as SAM or TSAM, and there exist bad sharp local minima around ERM solutions. On the other hand, while TSAM s average training loss is the highest

Tilted Sharpness-Aware Minimization

Table 1. TSAM achieves higher test performance relative to ERM and different variants of SAM across image datasets and the GLUE benchmark with both CNNs and transformers. TSAM (or SAM) is particularly suitable for applications with distribution shifts (DTD and noisy CIFAR100 datasets), which is also consistent with observations in prior works (Foret et al., 2020; Baek et al., 2024). TSAM also results in lower test loss, discussed in detail in the next section.

Methods CIFAR100 DTD Noisy CIFAR100 Tiny Imagenet Res Net18 Wide Res Net Vi T Wide Res Net Res Net18 Wide Res Net Res Net18 Res Net34

ERM 71.39 (.2) 73.22 (.2) 66.38 (.2) 16.97 (.3) 61.01 (.2) 57.03 (.3) 71.10 (.2) 74.90 (.1) SAM 76.52 (.1) 78.44 (.1) 67.87 (.2) 17.45 (.2) 69.00 (.1) 68.02 (.2) 72.43 (.1) 76.75 (.2) ESAM1 77.40 (.1) 80.22 (.1) 68.18 (.2) 17.67 (.2) 69.20 (.1) 69.79 (.1) 73.24 (.1) 77.41 (.1) ESAM2 77.52 (.1) 79.03 (.2) 68.35 (.1) 17.71 (.1) 67.27 (.1) 66.83 (.1) 73.26 (.1) 77.40 (.1) PGN 77.45 (.1) 78.58 (.2) 67.76 (.1) 18.23 (.2) 65.68 (.1) 64.02 (.2) 73.18 (.1) 77.80 (.1) RSAM 77.35 (.1) 79.02 (.2) 68.35 (.2) 17.66 (.3) 69.31 (.1) 65.93 (.2) 73.57 (.1) 77.72 (.1) TSAM 77.78 (.1) 80.85 (.2) 68.82 (.1) 18.63 (.2) 69.98 (.1) 70.26 (.1) 73.55 (.1) 77.79 (.1)

Objectives Co LA WNLI SST-2 MNLI QNLI RTE MRPC QQP STSB AVG

ERM 52/80.34 54.93 90.48 79.6 87.72 60.65 83.82 86.32 86.6/86.3 77.15 SAM 52/80.48 56.34 91.74 81.1 86.42 58.84 85.29 87.71 87.0/86.5 77.56 ESAM1 52/80.44 56.34 91.63 81.1 86.18 59.02 85.31 87.69 87.1/86.7 77.59 ESAM2 52/80.53 56.34 91.63 81.2 86.29 59.25 85.80 87.47 86.8/86.5 77.62 TSAM 52/80.81 56.34 91.86 81.1 87.81 60.65 85.05 88.77 87.1/86.6 78.01

0.01 0.02 0.03 0.04 0.05 10 1

sharpness (measured by [L( * + )])

ERM SAM TSAM

0.01 0.02 0.03 0.04 0.05

var[L( * + )]

sharpness (measured by var[L( * + )])

ERM SAM TSAM

Figure 2. Sharpness of the solutions found by ERM, SAM, and TSAM on CIFAR100 with Res Net18. We empirically measure sharpness by both E[L(θ + ϵ)] and var[L(θ + ϵ)] where ϵ N(0, δ2). θ denotes different optimal model parameters obtained from the three objectives. These metrics (especially variance) are also consistent with Definition 3.5 with t = 0. We see that TSAM solutions have a flatter neighborhood compared with the other two.

(which is expected because it does not directly optimize over ERM), the test losses of TSAM for both the average-case performance and worst-case performance are lower than the other two baselines. While we show better generalization of TSAM empirically, rigorous understandings between generalization and flatness remains an open area of research.

5.3. Sensitivity to Hyperparameters

Effects of the Tilting Hyperparameter t. One critical hyperparameter in TSAM is t. When t = 0, TSAM objective reduces to the average-case perturbed objective. When t , the TSAM objective (Eq. (3)) recovers SAM (Eq. 2). But the TSAM algorithm (Algorithm 3) do not exactly recover SAM s alternating updating approximation when t . See Section 4 for a detailed discussion. Here,

we report the test accuracies as the training proceeds under multiple values of t s for all the three tasks. Results are plotted in Figure 7 in the appendix. We see that there are a range of t s that result in faster convergence or higher accuracies than SAM. There also exists an optimal t that leads to the best test performance. This is consistent with our previous generalization bound (Section 3.3). Though Theorem 3.6 captures the benefits of both SAM and TSAM, we note that the final empirical performance does not only depend on the properties of the objectives. But rather, it also relies on the choice of approximation algorithms. Results in Theorem 3.6 assume that the objectives are optimized perfectly, which is infeasible in high-dimensional settings.

Additionally, we empirically study the effects of scheduling t during optimization. Increasing t from 0 to a fixed value effectively switches from weighting local minima uniformly to rewighting them based on the loss values, and vice versa. We experiment with two options: linearly decreasing t and linearly increasing t on the noisy CIFAR100 dataset trained by Res Net18. The convergence curves are shown in Figure 8. We see that using a fixed t throughout training does not have significant difference from scheduling t. Hence, we stick to fixed t s for our TSAM experiments.

Effects of the Number of ϵ s. One may wonder if we need to sample a large number of perturbations for the algorithm to be effective. In Table 2, we show that we usually only need s = 3 or 5 number of ϵ s to achieve significant improvements relative to SAM. The models in three columns correspond to Res Net18, Vi T, and Res Net18, respectively.

Tilted Sharpness-Aware Minimization

CIFAR100 DTD noisy CIFAR100

SAM 76.52 67.87 69.00 TSAM, s=3 77.40 68.82 69.55 TSAM, s=5 77.78 68.70 69.98

Table 2. Test accuracies of SAM and TSAM with different number of sampled ϵ s (denoted as s; see Algorithm 3). Empirically, we do not need many samples from the tilted distribution.

6. Conclusion

In this work, we have proposed a tilted sharpness-aware minimization (TSAM) objective, which leverages exponential tilting (parameterized by t) to reweight potentially many local minima in the neighborhoods. TSAM is a more smooth problem relative to SAM with a bounded t, and it explicitly encourages flatter solutions as t increases. We have proposed a practical algorithm motivated by HMC to sample from the tilted distribution et L(θ+ϵ). Through experiments on different models and datasets including label-noise settings, we have demonstrated that TSAM consistently outperforms SAM and its variants on both image and text datasets.

Acknowledgements

We thank Ahmad Beirami, Behrooz Tahmasebi, and Manzil Zaheer for helpful discussions. We thank the anonymous reviewers for their feedback.

Impact Statement

In this work, we propose tilted sharpness-aware minimization (TSAM) that leverages exponential tilting to reweight sharp local minima and achieve better generalization. It could potentially improve the test performance of a wide range of models including those designed to address societal problems such as fairness, robustness, and privacy. As the algorithmic framework is general, we do not anticipate our objective or algorithm to directly cause any negative impacts. However, if TSAM is applied in an undesirable manner, e.g., to some malicious attacking approaches, it may strengthen the capabilities of those models.

Aminian, G., Asadi, A. R., Li, T., Beirami, A., Reinert, G., and Cohen, S. N. Generalization error of the tilted empirical risk. ar Xiv preprint ar Xiv:2409.19431, 2024.

Andriushchenko, M. and Flammarion, N. Towards understanding sharpness-aware minimization. In International Conference on Machine Learning, pp. 639 668. PMLR, 2022.

Andriushchenko, M., Croce, F., M uller, M., Hein, M., and Flammarion, N. A modern look at the relationship between sharpness and generalization. ar Xiv preprint ar Xiv:2302.07011, 2023.

Baek, C., Kolter, Z., and Raghunathan, A. Why is sam robust to label noise? ar Xiv preprint ar Xiv:2405.03676, 2024.

Bartlett, P. L., Long, P. M., and Bousquet, O. The dynamics of sharpness-aware minimization: Bouncing across ravines and drifting towards wide minima. Journal of Machine Learning Research, 24(316):1 36, 2023.

Boucheron, S., Lugosi, G., and Massart, P. Concentration Inequalities: A Nonasymptotic Theory of Independence. Oxford University Press, 2013.

Calafiore, G. C. and El Ghaoui, L. Optimization Models. Cambridge University Press, 2014.

Chen, X., Hsieh, C.-J., and Gong, B. When vision transformers outperform resnets without pre-training or strong data augmentations. ar Xiv preprint ar Xiv:2106.01548, 2021.

Chen, Z., Zhang, J., Kou, Y., Chen, X., Hsieh, C.-J., and Gu, Q. Why does sharpness-aware minimization generalize better than sgd? Advances in Neural Information Processing Systems, 36, 2024.

Cimpoi, M., Maji, S., Kokkinos, I., Mohamed, S., , and Vedaldi, A. Describing textures in the wild. In Proceedings of the IEEE Conf. on Computer Vision and Pattern Recognition (CVPR), 2014.

Dembo, A. Large deviations techniques and applications. Springer, 2009.

Deng, J., Dong, W., Socher, R., Li, L.-J., Li, K., and Fei-Fei, L. Imagenet: A large-scale hierarchical image database. In 2009 IEEE conference on computer vision and pattern recognition, pp. 248 255. Ieee, 2009.

Ding, L., Drusvyatskiy, D., Fazel, M., and Harchaoui, Z. Flat minima generalize for low-rank matrix recovery. Information and Inference: A Journal of the IMA, 13(2): iaae009, 2024.

Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., et al. An image is worth 16x16 words: Transformers for image recognition at scale. ar Xiv preprint ar Xiv:2010.11929, 2020.

Du, J., Zhou, D., Feng, J., Tan, V., and Zhou, J. T. Sharpnessaware training for free. Advances in Neural Information Processing Systems, 35:23439 23451, 2022.

Tilted Sharpness-Aware Minimization

Duchi, J., Hazan, E., and Singer, Y. Adaptive subgradient methods for online learning and stochastic optimization. Journal of machine learning research, 12(7), 2011.

Duchi, J. C., Bartlett, P. L., and Wainwright, M. J. Randomized smoothing for stochastic optimization. SIAM Journal on Optimization, 22(2):674 701, 2012.

Foret, P., Kleiner, A., Mobahi, H., and Neyshabur, B. Sharpness-aware minimization for efficiently improving generalization. ar Xiv preprint ar Xiv:2010.01412, 2020.

He, K., Zhang, X., Ren, S., and Sun, J. Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 770 778, 2016.

Johnson, R. and Zhang, T. Accelerating stochastic gradient descent using predictive variance reduction. Advances in neural information processing systems, 26, 2013.

Kingma, D. P. and Ba, J. Adam: A method for stochastic optimization. ar Xiv preprint ar Xiv:1412.6980, 2014.

Kort, B. W. and Bertsekas, D. P. A new penalty function method for constrained minimization. In Proceedings of the 1972 ieee conference on decision and control and 11th symposium on adaptive processes, pp. 162 166. IEEE, 1972.

Krizhevsky, A., Hinton, G., et al. Learning multiple layers of features from tiny images. 2009.

Kwon, J., Kim, J., Park, H., and Choi, I. K. Asam: Adaptive sharpness-aware minimization for scale-invariant learning of deep neural networks. In International Conference on Machine Learning, pp. 5905 5914. PMLR, 2021.

Le, Y. and Yang, X. Tiny imagenet visual recognition challenge. CS 231N, 7(7):3, 2015.

Leimkuhler, B. and Reich, S. Simulating hamiltonian dynamics. Number 14. Cambridge university press, 2004.

Li, B. and Giannakis, G. Enhancing sharpness-aware optimization through variance suppression. Advances in Neural Information Processing Systems, 36, 2023.

Li, T., Beirami, A., Sanjabi, M., and Smith, V. On tilted losses in machine learning: Theory and applications. Journal of Machine Learning Research, 24(142):1 79, 2023.

Liu, G.-H. and Theodorou, E. A. Deep learning theory review: An optimal control and dynamical systems perspective. ar Xiv preprint ar Xiv:1908.10920, 2019.

Liu, Y., Mai, S., Chen, X., Hsieh, C.-J., and You, Y. Towards efficient and scalable sharpness-aware minimization. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 12360 12370, 2022a.

Liu, Y., Mai, S., Cheng, M., Chen, X., Hsieh, C.-J., and You, Y. Random sharpness-aware minimization. Advances in Neural Information Processing Systems, 35:24543 24556, 2022b.

Long, P. M. and Bartlett, P. L. Sharpness-aware minimization and the edge of stability. ar Xiv preprint ar Xiv:2309.12488, 2023.

Mi, P., Shen, L., Ren, T., Zhou, Y., Sun, X., Ji, R., and Tao, D. Make sharpness-aware minimization stronger: A sparsified perturbation approach. Advances in Neural Information Processing Systems, 35:30950 30962, 2022.

Mueller, M., Vlaar, T. J., Rolnick, D., and Hein, M. Normalization layers are all that sharpness-aware minimization needs. In Thirty-seventh Conference on Neural Information Processing Systems, 2023.

Neal, R. M. et al. Mcmc using hamiltonian dynamics. Handbook of markov chain monte carlo, 2(11):2, 2011.

Nesterov, Y. A method of solving a convex programming problem with convergence rate of 1/k2. Proceedings of the USSR Academy of Sciences, 269:3.

Nesterov, Y. A method for solving the convex programming problem with convergence rate o (1/k2). In Dokl akad nauk Sssr, volume 269, pp. 543, 1983.

Nesterov, Y. Introductory lectures on convex optimization: A basic course, volume 87. Springer Science & Business Media, 2013.

Rice, L., Bair, A., Zhang, H., and Kolter, J. Z. Robustness between the worst and average case. Advances in Neural Information Processing Systems, 34:27840 27851, 2021.

Robey, A., Chamon, L., Pappas, G. J., and Hassani, H. Probabilistically robust learning: Balancing average and worst-case performance. In International Conference on Machine Learning, pp. 18667 18686. PMLR, 2022.

Rockafellar, R. T., Uryasev, S., et al. Optimization of conditional value-at-risk. Journal of risk, 2:21 42, 2000.

Sanh, V. Distilbert, a distilled version of bert: Smaller, faster, cheaper and lighter. ar Xiv preprint ar Xiv:1910.01108, 2019.

Shalev-Shwartz, S. and Ben-David, S. Understanding machine learning: From theory to algorithms. Cambridge university press, 2014.

Tilted Sharpness-Aware Minimization

Shen, C. and Li, H. On the dual formulation of boosting algorithms. IEEE Transactions on Pattern Analysis and Machine Intelligence, 32(12):2216 2231, 2010.

Siegmund, D. Importance sampling in the monte carlo study of sequential tests. The Annals of Statistics, pp. 673 684, 1976.

Streeter, M. and Mc Mahan, H. B. Less regret via online conditioning. ar Xiv preprint ar Xiv:1002.4862, 2010.

Szab o, A., Jamali-Rad, H., and Mannava, S.-D. Tilted cross-entropy (tce): Promoting fairness in semantic segmentation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 2305 2310, 2021.

Tahmasebi, B., Soleymani, A., Bahri, D., Jegelka, S., and Jaillet, P. A universal class of sharpness-aware minimization algorithms. In ICML, 2024.

Wainwright, M. J. and Jordan, M. I. Graphical models, exponential families, and variational inference. Foundations and Trends in Machine Learning, 2008.

Wen, K., Ma, T., and Li, Z. How does sharpness-aware minimization minimize sharpness? ar Xiv preprint ar Xiv:2211.05729, 2022.

Wen, K., Li, Z., and Ma, T. Sharpness minimization algorithms do not only minimize sharpness to achieve better generalization. Advances in Neural Information Processing Systems, 36, 2024.

Whittle, P. Risk-sensitive linear/quadratic/gaussian control. Advances in Applied Probability, 13(4):764 777, 1981.

Whittle, P. Risk sensitivity, a strangely pervasive concept. Macroeconomic Dynamics, 6(1):5 18, 2002.

Xie, W., Latorre, F., Antonakopoulos, K., Pethick, T., and Cevher, V. Improving SAM requires rethinking its optimization formulation. In Proceedings of the 41st International Conference on Machine Learning, 2024.

Zagoruyko, S. Wide residual networks. ar Xiv preprint ar Xiv:1605.07146, 2016.

Zhang, H. R., Li, D., and Ju, H. Noise stability optimization for finding flat minima: A hessian-based regularization approach. Transactions on Machine Learning Research, 2024. ISSN 2835-8856.

Zhao, Y., Zhang, H., and Hu, X. Penalizing gradient norm for efficiently improving generalization in deep learning. In International Conference on Machine Learning, pp. 26982 26992. PMLR, 2022.

Zheng, Y., Zhang, R., and Mao, Y. Regularizing neural networks via adversarial model perturbation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 8156 8165, 2021.

Zhou, T., Wang, S., and Bilmes, J. Robust curriculum learning: from clean label detection to noisy label selfcorrection. In International Conference on Learning Representations, 2020.

Zhou, W., Liu, F., Zhang, H., and Chen, M. Sharpnessaware minimization with dynamic reweighting. ar Xiv preprint ar Xiv:2112.08772, 2021.

Zhuang, J., Gong, B., Yuan, L., Cui, Y., Adam, H., Dvornek, N., Tatikonda, S., Duncan, J., and Liu, T. Surrogate gap minimization improves sharpness-aware training. ar Xiv preprint ar Xiv:2203.08065, 2022.

Tilted Sharpness-Aware Minimization

A. Additional Toy Problems

In Figure 1 in Section 1, we present a specific toy problem where TSAM arrives at more flat solutions as t increases. Though the TSAM objective will recover SAM when t , we note that TSAM can be easier to solve due to smoothness. To illustrate this, we create another toy problem in Figure 3 and 4 below. We see that SAM always leads to a non-smooth optimization problem for ρ > 0.

Figure 3. SAM losses as ρ increases. The original loss function (shown in the blue lines across all figures) is a one-dimensional problem L(θ) = |θ 1| 0.01 if θ 2, and L(θ) = (θ 3)2 otherwise. Note that θ = 3 is a more flat solution than θ = 1, though L(1) < L(3). The SAM objective is minθ max|ϵ| ρ L(θ + ϵ), shown in the red lines, where the values of ρ s increase from 0 to 0.8. When ρ = 0, the objective reduces to ERM. For ρ > 0, the SAM objectives (red lines) are non-smooth, and the global minima (marked in orange) are achieved at a flat region in L( ). The SAM objective visualization holds regardless of the usage of any existing SAM algorithms or implementation.

Figure 4. TSAM (t=0.01) losses as ρ increases. The TSAM objective (red lines) is 1

t log Eµ(ϵ) h et L(θ+ϵ)i , where µ(ϵ) := U(|ϵ| ρ) defines a uniform distribution of ϵ s constrained in a ball with radius ρ. The larger ρ is, the darker the redness becomes. TSAM with a small t is able to find flat solutions.

Tilted Sharpness-Aware Minimization

B. Complete Proofs

B.1. Proofs for Section 3.1

Proof for the Case of t 0, Lt(θ + ϵ) E[L(θ + ϵ)]. Note that if L( ) is continuously differentiable, then et L(θ+ϵ) is continuous w.r.t. ϵ Rd. It is also continuous w.r.t. t R. When t 0,

lim t 0 Lt(θ) = lim t 0 1

t log Z et L(θ+ϵ)dµ(ϵ) (9)

= 1 R et L(θ+ϵ)dµ(ϵ)

Z et L(θ+ϵ)L(θ + ϵ)dµ(ϵ) (10)

= Z L(θ + ϵ)dµ(ϵ) (11)

= E[L(θ + ϵ)]. (12)

Proof for Lipschitzness. First observe that if L(θ) is p-Lipschitz with respect to θ, then Lt(θ) is p-Lipschitz with respect to θ. This follows from

Lt(θ1) Lt(θ2) = 1

t log E h et L(θ1+ϵ)i 1

t log E h et L(θ2+ϵ)i (13)

t log E et L(θ1+ϵ)

E et L(θ2+ϵ)

t log etp θ1 θ2 E et L(θ2+ϵ)

E et L(θ2+ϵ)

= p θ1 θ2 . (16)

Proof for Strong Convexity. Assume L is continuously differentiable. If L is µ-strongly convex, then Lt is also µ-strongly convex. This is because of the Hessian in Eq. (17), which can be written as

2Lt(θ) = t M + E et L(θ+ϵ)

E[et L(θ+ϵ)] 2L(θ + ϵ) , (17)

where M is a positive semi-definite matrix. We note that due to the µ-strong convexity of L, the second term satisfies E h et L(θ+ϵ)

E[et L(θ+ϵ)] 2L(θ + ϵ) i µI. Hence, 2Lt(θ) µI.

Proof for Smoothness. From Eq. (17), we know that

t 2Lt(θ) = M + 1

t E et L(θ+ϵ)

E[et L(θ+ϵ)] 2L(θ + ϵ) . (18)

As E h et L(θ+ϵ)

E[et L(θ+ϵ)] i = 1, and the max eigenvalue λmax( 2L(θ + ϵ)) β, we have

0 < min t 1

t λmax( 2L(θ + ϵ)) < + . (19)

B.2. Proofs for Section 3.2

In the following, we use E to denote Eϵ. Define g(t) as

g(t) := Lt(θ1) Lt(θ2) (20)

t log Z et L(θ1+ϵ)dµ(ϵ) 1

t log Z et L(θ2+ϵ)dµ(ϵ) . (21)

Tilted Sharpness-Aware Minimization

Assume l has the specific form of

l(xi; θ) = A(θ) θ T(xi), (22)

L(θ) = A(θ) θ 1 n

:= A(θ) θ T(x). (23)

Under this form, we have that

t log Z et(A(θ+ϵ) (θ+ϵ) T (x))p(ϵ)dϵ (24)

t log e tθ T (x) Z et(A(θ+ϵ) ϵ T (x))p(ϵ)dϵ (25)

= θ T(x) + 1

t log Z et(A(θ+ϵ) ϵ T (x))p(ϵ)dϵ (26)

Γt(θ) := log E h et A(θ+ϵ) tϵ T (x)i . (27)

Then we have

t log E h et(A(θ+ϵ) (θ+ϵ) T (x))i = θ T(x) + 1

t Γt(θ). (28)

nt(θ) := et A(θ+ϵ) tϵ T (x), (29)

ht(θ) := E nt(θ)(A(θ + ϵ) ϵ T(x)

E [nt(θ)] , (30)

t2 Γt(θ) + 1

t ht(θ). (31)

We have that

t = nt(θ)(A(θ + ϵ) ϵ T(x)). (32)

t = E h et A(θ+ϵ) tϵ T (x)(A(θ + ϵ) ϵ T(x)) i

E et A(θ+ϵ) tϵ T (x) = ht(θ), (33)

t2 Γt(θ) + 1

t2 Γt(θ) + 1

t ht(θ) = mt(θ). (34)

B.3. Lt(θ) is monotonically non-increasing as t

We would like to prove the sign of Lt(θ)

t , or mt(θ), is non-negative. The sign of mt(θ) is the same as the sign of t2mt(θ). We have

t2mt(θ) = Γt(θ) + tht(θ), (35)

t = ht(θ) + ht(θ) + t ht(θ)

t = t ht(θ)

t = E[nt(θ)(A(θ + ϵ) ϵ T(x))2]E[nt(θ)]

(E [nt(θ)])2 (37)

E[nt(θ)(A(θ + ϵ) ϵ T(x))]E[nt(θ)(A(θ + ϵ) ϵ T(x))]

(E [nt(θ)])2 . (38)

Tilted Sharpness-Aware Minimization

Let random variables X = p

nt(θ)(A(θ+ϵ) ϵ T(x)), and Y = p

nt(θ). Following the fact E[X2]E[Y 2] (E[XY ])2 0 gives

t = E[X2]E[Y 2] E[XY ]E[XY ]

(E [nt(θ)])2 0. (39)

We note that limt 0 t2mt(θ) = limt 0 t2 Γt(θ) + tht(θ) = 0. Hence t2mt(θ) 0. Therefore, mt(θ) 0. we have shown that the tilted SAM loss Lt(θ) is monotonically non-decreasing as the increase of t, for any θ.

B.4. t-SAM prefers flatter models as t increases

Next, we examine gt(θ1, θ2) := Lt(θ1) Lt(θ2). First, we have

t2 Γt(θ1) + 1

t ht(θ1) + 1

t2 Γt(θ2) 1

t ht(θ2) (42)

= mt(θ1) mt(θ2). (43)

Similarly, it holds that

t (t2mt(θ2))

t = t ht(θ1)

For t 0, we have

sign ht(θ1)

= sign (t2mt(θ1))

t (t2mt(θ2))

sign mt(θ1) mt(θ2) = sign t2mt(θ1) t2mt(θ2) . (46)

Let the random variable L1 denote A(θ1 + ϵ) ϵ T(x), and random variable L2 denote A(θ2 + ϵ) ϵ T(x). Then

t = E et L1L2 1 E et L1 E et L1L1 2

(E [et L1])2 = E et L1L2 1

E [et L1] E[et L1L1]

t = E et L1L2 1

E [et L1] E[et L1L1]

E et L2L2 2

E [et L2] E[et L2L2]

Given random variables L1 and L2, the exponentially reweighted losses can be defined as et L1L1 and et L2L2. The t-

weighted second moment is E[et L1L2 1] E[et L1] , and the t-weighted mean is E[et L1L1]

E[et L1] . Hence, E[et L1L2 1] E[et L1] E[et L1L1]

E[et L1] 2 can be

viewed as t-weighted variance. As θ1 is t-sharper than θ2, we have ht(θ1)

t 0. Therefore t2mt(θ1) t2mt(θ2)

is non-decreasing as t increases. It takes value of 0 when t = 0, which implies that t2mt(θ1) t2mt(θ2) = gt(θ1,θ2)

Proof for the Discussions on t 0. Recall that exp and log functions can be expanded as

exp(x) = 1 +

k! 1 + x + 1

log(x + 1) =

k=1 ( 1)k 1 xk

Tilted Sharpness-Aware Minimization

For very small t 1,

t log E h et L(θ+ϵ)i (51)

t log E 1 + t L(θ + ϵ) + t2

2 L2(θ + ϵ) (52)

E t L(θ + ϵ) + t2

2 L2(θ + ϵ) 1

2E2 t L(θ + ϵ) + t2

2 L2(θ + ϵ) (53)

t E[L(θ + ϵ)] + t2

2 E L2(θ + ϵ) t2

2 E2[L(θ + ϵ)] + O(t3) + O(t4) (54)

= E[L(θ + ϵ)] + t

2 E L2(θ + ϵ) E2 [L(θ + ϵ)] + O(t2) (55)

E[L(θ + ϵ)] + t

2var (L(θ + ϵ)) (56)

Hence, our proposed objective can be viewed as optimizing for the mean plus variance of the losses in the neighborhood regions when t is very close to 0. When t = 0, it reduces to only optimizing for E[L(θ + ϵ)] for ϵ µ(ϵ). For any θ1 and θ2 such that θ1 is sharper than θ2, we have

g(t) E[L(θ1 + ϵ)] + t

2var (L(θ1 + ϵ)) E[L(θ2 + ϵ)] t

2var (L(θ2 + ϵ)) , (57)

2 (var (L(θ1 + ϵ)) var (L(θ2 + ϵ))) . (58)

Recall that sharpness is defined as standard variance when t 0 (Definition 3.5), and we have that g (t) 0.

B.5. Proof for Theorem 3.7

We first state some useful lemmas. Lemma B.1 (Stated and proved in Aminian et al. (2024)). Let X be a random variable. Suppose 0 < a < X < b < + , we have

2b2 log(E[X]) E[log(X)] var(X)

The Lemma directly follows from existing results in Aminian et al. (2024). For completeness, we include the proof here.

Proof. As d2 dx2 log(x) + βx2 = 1

x2 + 2β, the function log(x) + βx2 is concave for β = 1 2b2 and convex for β = 1 2a2 . Hence, by Jensen s inequality,

E[log(X)] = E h log(X) + X2

i log(E[X]) + 1

2b2 E[X]2 1

2b2 E[X2] (60)

= log(E[X]) 1

2b2 var(X), (61)

which completes the proof of the lower bound. A similar approach can be applied to derive the upper bound.

Proof for Theorem 3.7. We can now proceed with the detailed proof below.

Proof. Examine the following decomposition of the generalization error:

EZ[l(θ, Z)] 1

t log Eϵ[et L(θ+ϵ)] = EZ[l(θ, Z)] Eϵ[L(θ + ϵ)] + Eϵ[L(θ + ϵ)] 1

t log Eϵ[et L(θ+ϵ)]. (62)

Based on Lemma B.1, let X be et L(θ+ϵ) and 1 et L(θ+ϵ) et M (assuming positive and bounded losses and non-negative t), we have that

var(et L(θ+ϵ))

t log(Eϵ[et L(θ+ϵ)]) E[L(θ + ϵ)] var(et L(θ+ϵ))

Tilted Sharpness-Aware Minimization

var(et L(θ+ϵ))

2te2t M Eϵ[L(θ + ϵ)] 1

t log(Eϵ[et L(θ+ϵ)]) var(et L(θ+ϵ))

For the term EZ[l(θ, Z)] Eϵ[L(θ + ϵ)], we further decompose it into

EZ[l(θ, Z)] Eϵ[L(θ + ϵ)] = EZ[l(θ, Z)] L(θ) + L(θ) Eϵ[L(θ + ϵ)]. (65)

Recall that L( ) denote the empirical average loss based on n training samples (Eq. (1)), applying Hoeffding Inequality (Boucheron et al., 2013) gives

EZ[l(θ, Z)] L(θ) M

Combining Eq.(64), (65), and (66), we have the desired bound

EZ[l(θ, Z)] 1

t log Eϵ[et L(θ+ϵ)] M

2n var(et L(θ+ϵ))

2te2t M + L(θ) Eϵ[L(θ + ϵ)]. (67)

To investigate the impacts of t on the generalization bound, we can leave the last term L(θ) Eϵ[L(θ + ϵ)] as it is since it is independent of t.

Additionally, to bound L(θ) Eϵ[L(θ + ϵ)] (the gap between empirical average losses and its randomly-smoothed version) we note that it is related to the curvature of L(θ). If we further assume that the loss is µ-strongly convex, then it holds that

L(θ) E[L(θ + ϵ)] L(θ) L(θ) E[ L(θ) ϵ] µ

2 E[ ϵ 2]. (68)

Combining all the above results gives

EZ[l(θ, Z)] 1

t log Eϵ[et L(θ+ϵ)] M

2n var(et L(θ+ϵ))

2 E[ ϵ 2]. (69)

Tilted Sharpness-Aware Minimization

C. Additional Experimental Details

C.1. Hyperparamter Tuning

For all the three datasets and all methods, we tune learning rates from {0.0001, 0.0003, 0.001, 0.003, 0.01, 0.03, 0.1}. We use a fixed batch size of 64, label smoothing of 0.1 (for smooth cross-entropy loss), momentum with a parameter 0.9, and L2 weight decay 0.0005 for all runs. For vanilla SAM, we tune ρ from {0.05, 0.1, 0.5, 1, 5, 15} and found that the best values are 0.05, 0.1, 0.1 for CIFAR100, DTD, noisy CIFAR100, respectively. For SAM variants, we tune ρ parameters in the same way. The PGN baseline (Zhao et al., 2022) introduces another hyperparameter the coefficients for the linear combination between normal gradients and SAM gradients, and we tune that from {0.3, 0.5, 0.7, 0.9}. We follow the recommendations of λ and γ hyperparameters in the original Random SAM paper (Liu et al., 2022b). For TSAM, we set δ = t

2 in Algorithm 1, set ρ to be 20, and α in Algorithm 3 to be 0.995 across all datasets. We tune t from {0, 1, 5, 20, 100}, and the best t s are 20, 5, 1 for the three image datasets. The number of sampled ϵ s (the s hyperparameter in Algorithm 1) are chosen from {3, 5}. We show the effects of t and s in detail in Section 5.3 in the main text.

C.2. Naive Sampling

As discussed in Section 4, one naive approach to estimate E[et L(θ+ϵ) L(θ+ϵ)]

E[et L(θ+ϵ)] for ϵ in uniformly distributed over µ(ϵ) is to first uniformly sample s ϵ s over µ(ϵ), and then perform tilted aggregation, as follows:

P i [s] et L(θ+ϵi) L(θ + ϵi)

i [s] e L(θ+ϵi) , ϵi µ(ϵ). (70)

We demonstrate convergence as a function of s on the CIFAR100 dataset in the figure below. We see that as s increases, the performance increases. However, when s = 10, which means that we need 10 gradient evaluations per model updates, the accuracy is lower than that of TSAM with the proposed algorithm.

0 20 40 60 80 # epochs

test accuracy

naive sampling (s=3) naive sampling (s=10) TSAM, proposed alg.

Figure 5. TSAM (t = 20) under the proposed algorithm compared with first uniformly sampling ϵ and then performing tilted aggregation.

C.3. Additional Results

Convergence Curves. In Table 1, we present the final test accuracies of TSAM and the baseline. In Figure 6, we show the convergence of these methods on three datasets. We see that TSAM achieves the fastest convergence, and arrives at a better solution. This is consistent with our argument that TSAM with a bounded t is a more smooth objective than the original SAM formulation (Section 3).

Tilted Sharpness-Aware Minimization

0 25 50 75 100 125 150 175 200 # epochs

test accuracy

ERM SAM TSAM

0 10 20 30 40 50 # epochs

test accuracy

ERM SAM TSAM

0 50 100 150 200 250 300 # epochs

test accuracy

noisy CIFAR100

ERM SAM TSAM

Figure 6. Convergence curves on three image datasets showing test accuracies.

Training and Test Loss Comparisons under Different Objectives. In Table 3, we show that TSAM generates better than ERM and SAM by comparing training and test losses.

Table 3. To further understand TSAM behavior, we report the losses of models trained by {ERM, SAM, TSAM} and evaluated on {ERM, SAM, TSAM} objectives, respectively. The left table shows training losses and the right one shows test losses. We see that (1) every objective achieves the smallest training loss if directly being optimized (diagonal entries, left table). (2) Though SAM and TSAM incurs larger training losses than ERM (last two rows, left table), they lead to smaller test losses (last two rows, right table), i.e., better generalization. (3) Optimizing TSAM results in the smallest test loss across all the three metrics (last row, right table).

trained on evaluated on ERM SAM TSAM training loss

ERM 0.1283 1.35 3.48 SAM 0.1489 0.22 0.60 TSAM 0.1763 0.27 0.46

trained on evaluated on ERM SAM TSAM test loss

ERM 0.9302 2.05 3.54 SAM 0.7414 0.91 1.34 TSAM 0.7163 0.90 1.08

Effects of N in HMC. Recall that Algorithm 3 follows HMC updates in Eq. (8) by running it for N = 1 step. In principle, we can run Eq. (8) for more than 1 step to generate each candidate perturbation ϵ. This require additional gradient evaluations, which can be very expensive. We report results of N = 2 and N = 3 in Table 4 below.

configurations CIFAR100 DTD

N = 1 (s = 3) 0.7740 0.6882 N = 1 (s = 5) 0.7778 0.6870 N = 2 (s = 3) 0.7745 0.6883 N = 3 (s = 3) 0.7752 0.6880

Table 4. N > 1 requires N times more gradient evaluations to generate a single perturbation ϵ (Eq. (8), without resulting in significantly better model accuracies.

Runtime Comparisons. We report the runtime of TSAM and other baselines in Table 5 below.

Effects of the Tilting Hyperparameter t. One critical hyperparameter in TSAM is t. When t = 0, TSAM objective reduces to the average-case perturbed objective. When t , the TSAM objective (Eq. (3)) recovers SAM (Eq. 2). Here, we report the test accuracies as the training proceeds under multiple values of t s for the three tasks. In Figure 7, we see that there are a range of t s that result in faster convergence or higher accuracies than SAM.

Effects of Scheduling t. We also evaluate the performance when we schedule t dynamically during training. In Figure 8, we see that there is no significant difference between using a fixed t, increasing t linearly, or decreasing it lienarly on the noisy CIFAR100 data.

Tilted Sharpness-Aware Minimization

Table 5. Runtime comparisons (in minutes) averaged over different hyperparameter sets. ERM and SAM run the same numbers of epochs as TSAM. All baselines are explained in Section 5. ESAM1 denotes the baseline of letting SAM run longer until it reaches the same computation budget as TSAM.

datasets ERM SAM ESAM1 ESAM2 PGN RSAM TSAM

CIFAR100 32 65 325 310 333 320 333 DTD 5 6 18 17 15 16 15 Noisy CIFAR100 45 92 350 315 302 312 337

SAM t=0 t=1 t=5 t=20 t=100

Figure 7. Test accuracies of SAM and TSAM for various values of t when the number of sampled ϵ s is 3 for each dataset. We select the best SAM and TSAM runs based on the final accuracies on validation data. The results suggest that (1) there are multiple t values that give superior performance than SAM; and (2) we typically need to manually tune a best t via grid search. The empirical performance under different values of t relies on tradeoffs between optimization efficiency (Section 3.1, Section 4), flatness (Section 3.2), and generalization (Section 3.3), and it is difficult to determine an optimal t prior to training. Note that in the label noise regime (left subfigure), one might think that SAM performance could be further improved via a smaller learning rate or early stopping; however, we observe that SAM with a smaller learning rate does not give better accuracy. With early stopping, SAM accuracy is 0.6918, which is still 0.4% lower than that of TSAM without early stopping.

50 100 150 200 250 300 # epochs

test accuracy

noisy CIFAR100

fix t increase t decrease t

Figure 8. Linearly decreasing or increasing the tilting hyperparameter t with the epochs does not differ from the results of a fixed t.