# sharpnessaware_minimization_general_analysis_and_improved_rates__9dec50ad.pdf Published as a conference paper at ICLR 2025 SHARPNESS-AWARE MINIMIZATION: GENERAL ANALYSIS AND IMPROVED RATES Dimitris Oikonomou CS & MINDS Johns Hopkins University doikono1@jh.edu Nicolas Loizou AMS & MINDS Johns Hopkins University nloizou1@jh.edu Sharpness-Aware Minimization (SAM) has emerged as a powerful method for improving generalization in machine learning models by minimizing the sharpness of the loss landscape. However, despite its success, several important questions regarding the convergence properties of SAM in non-convex settings are still open, including the benefits of using normalization in the update rule, the dependence of the analysis on the restrictive bounded variance assumption, and the convergence guarantees under different sampling strategies. To address these questions, in this paper, we provide a unified analysis of SAM and its unnormalized variant (USAM) under one single flexible update rule (Unified SAM), and we present convergence results of the new algorithm under a relaxed and more natural assumption on the stochastic noise. Our analysis provides convergence guarantees for SAM under different step size selections for non-convex problems and functions that satisfy the Polyak-Lojasiewicz (PL) condition (a non-convex generalization of strongly convex functions). The proposed theory holds under the arbitrary sampling paradigm, which includes importance sampling as special case, allowing us to analyze variants of SAM that were never explicitly considered in the literature. Experiments validate the theoretical findings and further demonstrate the practical effectiveness of Unified SAM in training deep neural networks for image classification tasks. 1 INTRODUCTION Consider the classical finite-sum optimization problem where each fi is differentiable, Li-smooth and lower bounded. Let X be the set of minimizers of f, which we assume is non-empty. In practical scenarios, the variable x represents the model parameters, n is the total number of training instances, and the functions fi are loss functions that measure how close our model is to the i-th training data point. The goal is to minimize the average loss of all training instances. Understanding the generalization capabilities of Deep Neural Networks (DNNs) is a central concern in machine learning research, (Zhang et al., 2016; Hardt et al., 2016; Neyshabur et al., 2017; Wilson et al., 2017; Neyshabur et al., 2018; Zhang et al., 2021). The training objective function f has numerous stationary points that perfectly fit the training data (Liu et al., 2020), and each of these points can lead to dramatically varying generalization performances. This suggests that minimizing the training objective using a carefully selected optimization algorithm can lead to convergence at a point with improved generalization performance (Foret et al., 2021). Recent studies have observed that the sharpness of the training loss, that is, how rapidly it changes in a neighborhood around the model parameters, correlates strongly with the generalization error (Keskar et al., 2016; Jiang et al., 2019; Singh et al., 2025; Dziugaite & Roy, 2018). This observation has motivated recent works (Foret et al., 2021; Zheng et al., 2021; Wu et al., 2020; Andriushchenko et al., 2023; Xie et al., 2024; Tahmasebi et al., 2024) aiming to minimize sharpness to improve Published as a conference paper at ICLR 2025 generalization. More specifically, building on these ideas Foret et al. (2021), proposed reformulating the optimization problem in (1) into a min-max problem of the following form: min x Rd max ε ρ f(x + ε) where ε represents the radius of the desired neighborhood. The merits of such a formulation reside in the fact that essentially we minimize the empirical sharpness measure max ε ρ f(x+ε) f(x), which inevitably will lead to flatter minima. The objective now is to find x that minimizes f not just at a specific point but across the entire ε-neighborhood. By taking the first-order Taylor expansion of f around x and solving for the optimal ε , the (Normalized) Sharpness-Aware Minimization (SAM) update rule is obtained: xt+1 = xt γt f St xt + ρt f St(xt) f St(xt) where St [n] is a random subset of data points (mini-batch) with cardinality τ sampled independently at each iteration t. The normalization of the inner gradient ensures that the point xt = xt + ρt f St(xt) f St(xt) is a good approximation of xt, since xt xt = ρt. This leads to a more stable optimization process, as explained in Dai et al. (2023). In an orthogonal direction and building upon SAM, Unnormalized Sharpness-Aware Minimization (USAM) was introduced by Andriushchenko & Flammarion (2022) and further investigated in Shin et al. (2024); Dai et al. (2023). The update rule for USAM is defined as follows: xt+1 = xt γt f St xt + ρt f St(xt) . (USAM) In contrast to SAM, USAM omits the normalization thus the point xt can be potentially far from xt making the xt s of USAM updates much larger. This means that the removal of normalization can lead to much more aggressive steps making the USAM potentially more unstable. Although the two variants appear closely related, the proof techniques and upper bounds used in the convergence analysis of SAM are substantially different from those in USAM. Furthermore, the convergence guarantees of the two variants vary significantly. For example, in the deterministic setting (full-batch), SAM guarantees convergence only to a neighborhood of the solution, whereas USAM does not. Additionally, the step sizes γt and ρt used in the two update rules to guarantee convergence are very different. All of these differences motivate the importance and necessity of a novel general analysis of SAM-type algorithms, unifying the two main variants (SAM and USAM) and providing the ability to design and analyze new SAM-like methods filling existing gaps in the theoretical understanding of the update rules. In this work we develop such unified framework that allows the combination of the two approaches and, at the same time, obtains the best-known convergence guarantees under relaxed assumptions. Main Contributions. Our main contributions are summarized below. Unified Framework. We propose the Unified SAM, an update rule that is a convex combination of SAM and USAM, given by: xt+1 = xt γt f St 1 λt + λt f St(xt) f St(xt) (Unified SAM) where λ [0, 1]. The new formulation captures both USAM and SAM as special cases (λ = 0 and λ = 1, respectively), but more importantly, it opens up a wide range of possible update rules beyond these traditional settings. The unified framework offers the flexibility to adjust the degree of normalization (using different values for λ) based on specific model needs, offering a more versatile approach to SAM. Technical Assumptions on the Stochastic Noise. Existing convergence analyses of stochastic SAM rely heavily on the bounded variance assumption, that is, there exists a σ 0 such that E f St(x) f(x) 2 σ2, (Andriushchenko & Flammarion, 2022; Si & Yun, 2023; Li & Giannakis, 2023; Harada & Iiduka, 2024; Mi et al., 2022; Zhuang et al., 2022) or sometimes to the much stronger bounded gradient condition E f St(x) 2 q2, where q 0 (Mi et al., 2022; Zhuang et al., 2022). While these assumptions have been crucial in previous analyses, they can be Published as a conference paper at ICLR 2025 Work Assumptions Arbitrary Sampling? SAM Variant PL functions (Andriushchenko & Flammarion, 2022) BV USAM (Shin et al., 2024) Interpolation USAM (Dai et al., 2023) Deterministic USAM Theorems 3.2 and 3.5 ER Unified SAM General Non-convex functions (Mi et al., 2022) BV, BG SAM/SSAM (Zhuang et al., 2022) BV, BG GSAM (Andriushchenko & Flammarion, 2022) BV USAM (Li & Giannakis, 2023) BV SAM (Si & Yun, 2023) BV SAM Theorem 3.7 ER Unified SAM Table 1: Summary of the convergence results in the SAM literature. In all works, smoothness is assumed. The top part of the table is for PL functions and the lower part is for general non-convex functions. Here BV = Bounded Variance, BG = Bounded Gradients. overly restrictive. In the literature of convergence analysis for stochastic gradient descent (SGD), there have been a lot of efforts recently on relaxing such assumptions (Gower et al., 2019; Khaled & Richt arik, 2020; Gower et al., 2021), but, to date, no work has successfully used similar ideas for the analysis of SAM. In our analysis, we relax the bounded gradients/variance assumptions by utilizing the recently proposed Expected Residual (ER) condition Gower et al. (2021); Khaled & Richt arik (2020). As we explain later, in several scenarios, including smooth non-convex problems, ER holds for free and allows us to provide step sizes for SAM related to the sampling strategies. Convergence guarantees for Unified SAM. We provide tight convergence guarantees for Unified SAM, for smooth functions satisfying the Polyak-Lojasiewicz (PL) condition (Polyak, 1987; Lojasiewicz, 1963; Karimi et al., 2016) and for general non-convex functions. See also Table 1 for a summary of our results and comparison with closely related works. PL functions: For constant step-sizes γ and ρ we prove linear convergence for Unified SAM to a neighborhood of the solution. Our theorem holds without requiring the much stronger assumptions of interpolation condition or the bounded variance assumption of previous works (Andriushchenko & Flammarion, 2022; Shin et al., 2024). Additionally, we prove that for decreasing step sizes γt and ρt, the Unified SAM converges to the exact solution with a sublinear O(1/t) rate (under the ER condition). Our theoretical results hold under the arbitrary sampling paradigm and, as such, can capture tight convergence guarantees in the deterministic setting. For PL functions in the deterministic setting (full-batch SAM), we show that USAM converges to the exact solution, while SAM does not. This observation was first noted by Si & Yun (2023) for deterministic algorithms. To the best of our knowledge, our work is the first that provides tight convergence guarantees, showing this behaviour as a special case of stochastic algorithms. Non-convex functions: Under the ER condition, we show that for general non-convex functions Unified SAM with step sizes that depend on T (the total number of iterations) achieves E f(x T ) < ε for a given ε at a sublinear rate. This is the first result that drops the bounded variance assumption for both USAM and SAM, (Andriushchenko & Flammarion, 2022; Li & Giannakis, 2023), and substitutes it with the Expected Residual condition. Finally, as corollaries of the main theorems for the above two classes of problems, we obtain the state-of-the-art convergence guarantees for SGD (a special case of SAM with ρ = 0), showing the tightness of our analysis. Arbitrary Sampling. Via a stochastic reformulation of the finite sum problem (1), firstly introduced in Gower et al. (2019), we explain how our convergence guarantees of Unified SAM hold under the arbitrary sampling paradigm. This allows us to cover a wide range of samplings for USAM and SAM (and their convex combination via λ [0, 1]) that were never considered in the literature before, including uniform sampling and importance sampling as special cases. In this sense, our analysis of Unified SAM is unified for different sampling strategies. Published as a conference paper at ICLR 2025 Numerical Evaluation. In Section 4, we present extensive experiments validating different aspects of our theoretical results (behavior of methods in the deterministic setting, importance sampling, and different step-size selections). We also assess the performance of Unified SAM in training deep neural networks for multi-class image classification problems. An open-source implementation of our method is available at https://github.com/dimitris-oik/unifiedsam. 2 UNIFIED SAM WITH ARBITRARY SAMPLING In this work, we provide a theoretical analysis of Unified SAM that allows us to obtain convergence guarantees of any minibatch and reasonable sampling selection. We are able to do that by leveraging the recently proposed stochastic reformulation of the minimization problem (1) from Gower et al. (2019; 2021). Following an identical setting to Gower et al. (2021), we assume that we have access to unbiased gradient estimates g(x) Rd such that E[g(x)] = f(x). For example, we can have g(x) = 1 τ P i S fi(x) to be a mini-batch, where S [n] is chosen uniformly at random with |S| = τ. To accommodate any form of mini-batching, we utilize the arbitrary sampling notation g(x) = fv(x) := 1 n Pn i=1 vi fi(x), where v Rn is a random sampling vector drawn from a distribution D such that ED[vi] = 1, for i = 1, . . . , n. Then the original problem (1) can be reformulated as minx Rd ED fv(x) := 1 n Pn i=1 vifi(x) . Note that it follows immediately from the definition of sampling vector that E[g(x)] = 1 n Pn i=1 E[vi] fi(x) = f(x). Using this reformulation of the original problem, the update rule of Unified SAM can be rewritten as follows: xt+1 = xt γtg xt + ρt 1 λt + λt g(xt) g(xt) , (Unified SAM) where g(xt) D is sampled i.i.d at each iteration and ρt 0, γt > 0 and λt [0, 1]. The name unified stems from the fact that the update rule indeed unifies both USAM and SAM, however, we acknowledge that there exist other SAM-like variants that our approach does not cover. Arbirtary Sampling. Using the stochastic reformulation, the update rule of Unified SAM includes several variants of the algorithm, each related to different sampling, by simply varying the distribution D (that satisfies ED[vi] = 1, i [n]). This flexibility implies that different choices of D lead to distinct SAM-type methods (never proposed in the literature before) for solving the original problem (1). In this work we focus on two representative sampling distributions, without aiming to be exhaustive: 1. Single element sampling: We choose only singleton sets S = {i} for i [n], i.e. P[|S| = 1] = 1. Each number i is sampled with probability pi [0, 1] or more formally the vector v Rn is defined via P[v = ei/pi] = pi, where Pn i=1 pi = 1. It is clear that E[vi] = 1. For example, when pi = 1/n for all i then this reduces to the well-known uniform sampling. 2. τ-nice Sampling: Let τ [n]. We generate a random subset S [n] by choosing uniformly from all subsets of size τ. More formally, the vector v Rn is defined by P[v = n i S ei] := 1/ n τ = τ!(n τ)! n! for any subset S [n] with |S| = τ. Using a double counting argument, one can show that E[vi] = 1 (see Gower et al. (2019)). Importantly, our analysis applies to all forms of mini-batching and supports various choices of sampling vectors v. Later in Section 3.4, we provide additional details on non-uniform single element sampling strategies. In addition, it is clear that if τ = n in the τ-nice sampling then we recover the full batch or deterministic regime. Later in Section 3.2, we further demonstrate how our analysis encompasses deterministic SAM and USAM as special cases. 3 CONVERGENCE ANALYSIS In this section, we present our main convergence results. Firstly, we introduce the main assumption for our results, namely the Expected Residual (ER) condition. Then we focus on PL functions, where we demonstrate a linear convergence rate using constant step sizes, and also provide a variant with decreasing step sizes for convergence to the exact solution. Moreover, we extend the analysis to general non-convex functions. Lastly, we discuss the use of importance sampling. Published as a conference paper at ICLR 2025 3.1 MAIN ASSUMPTION In all our theoretical results, we rely on the Expected Residual condition. Assumption 3.1 (Expected Residual Condition). Let f inf = infx Rn f(x). We say the Expected Residual condition holds if there exist parameters A, B, C 0 such that for an unbiased estimator g(x) of f(x), we have that for all x Rd ED g(x) 2 2A[f(x) f inf] + B f(x) 2 + C. (ER) Most prior works in the SAM literature assume either bounded gradients (e.g., Mi et al. (2022); Zhuang et al. (2022)) or bounded variance (e.g., Andriushchenko & Flammarion (2022); Harada & Iiduka (2024)). Both conditions are stronger assumptions than ER. Note that the bounded gradients assumption is captured by ER for A = B = 0, C > 0, while the bounded variance is obtained for A = 0, B = 1, and C > 0. For a detailed analysis of other conditions that automatically satisfy ER, see Gower et al. (2021) and Khaled & Richt arik (2020). Finally, when each fi is Li-smooth under mild assumptions on the distribution D, one can show that ER holds immediately (not an assumption but property of the problem) and has closed-from expressions for the constants A, B, and C. For more details on the expressions A, B, and C in this scenario, please check Appendix B. 3.2 PL FUNCTIONS One of the popular generalizations of strong convexity in the literature is the Polyak-Lojasiewicz (PL) condition, (Karimi et al., 2016; Lei et al., 2019). We call a function µ-PL if f(x) 2 2µ(f(x) f(x )), for all x Rd. In the following, we prove linear convergence of Unified SAM for µ-PL functions. Theorem 3.2. Assume that each fi is Li-smooth, f is µ-PL and the ER is satisfied. Set Lmax = maxi [n] Li. Then the iterates of Unified SAM with ρt = ρ ρ := µ Lmax(µ+2[Bµ+A](1 λ)2), γt = γ γ := µ Lmaxρ(µ+2[Bµ+A](1 λ)2) 2Lmax(Bµ+A)[2L2maxρ2(1 λ)2+1], satisfy: E[f(xt) f(x )] (1 γµ)t f(x0) f(x ) + N, where N = Lmax µ Cγ + ρ(1 + 2γL2 maxρ) λ2 + C(1 λ)2 . As an immediate corollary of Theorem 3.2, we get the following guarantees for USAM and SAM, when we substitute λ = 0 and λ = 1 respectively. Corollary 3.3. Consider the same assumptions as in Theorem 3.2. USAM: For USAM with ρ µ Lmax(µ+2[Bµ+A]) and γ µ Lmaxρ(µ+2[Bµ+A]) 2Lmax(Bµ+A)[2L2 maxρ2+1], it holds: E[f(xt) f(x )] (1 γµ)t f(x0) f(x ) + Lmax C µ γ + ρ(1 + 2γL2 maxρ) . SAM: For SAM with ρt = ρ 1 Lmax and γt = γ µ(1 Lmaxρ) 2Lmax(Bµ+A), it holds: E[f(xt) f(x )] (1 γµ)t f(x0) f(x ) + Lmax µ Cγ + ρ(1 + 2γL2 maxρ) . To the best of our knowledge, all prior convergence results for (stochastic) SAM have relied on the strong assumption of bounded variance, as seen in works like Andriushchenko & Flammarion (2022), Si & Yun (2023), Li & Giannakis (2023) and Harada & Iiduka (2024). In contrast, the above theorem is the first to establish convergence for SAM without relying on this assumption. The closest related works on the convergence of constant step size SAM for PL functions are Shin et al. (2024) and Dai et al. (2023). The former provides a linear convergence rate for USAM in the interpolated regime, while the latter establishes a linear convergence rate for USAM in the deterministic regime. Published as a conference paper at ICLR 2025 Our result is the first to demonstrate linear convergence in the fully stochastic regime. Additionally, when ρ = 0, Unified SAM reduces to SGD, and Theorem 3.2 recovers the step sizes and rates (up to constants) of Theorem 4.6 from Gower et al. (2021), demonstrating the tightness of our results. Recall that with our general analysis via the arbitrary sampling framework, the proposed convergence guarantees hold for the τ-nice sampling as well (see Section 2). As such, a special case of Theorem 3.2 is the case where |S| = n with probability one. That is, we run Unified SAM with a full batch, τ = n. In this scenario, using Proposition B.3, the ER condition holds with A = 0, B = 1 and C = 0 and Theorem 3.2 simplifies as follows. Corollary 3.4 (Deterministic SAM). Assume that each fi is Li-smooth and f is µ-PL. Then the iterates of the deterministic Unified SAM with ρ 1 Lmax(1+2(1 λ)2) and γ 1 Lmaxρ(1+2(1 λ)2) 2Lmax[2L2maxρ2(1 λ)2+1] satisfy: E[f(xt) f(x )] (1 γµ)t f(x0) f(x ) + Lmaxρ(1 + 2γL2 maxρ)λ2 First, observe that the PL parameter µ no longer appears in the step sizes ρ and γ. Additionally, when λ = 0, i.e. USAM, the method converges to the exact solution at a linear rate. However, for λ > 0, and in particular when λ = 1 (SAM), convergence is only up to a neighborhood. This suggests that even in the deterministic setting, SAM does not fully converge to the minimum. This was first investigated in Si & Yun (2023) and a similar result appear in Dai et al. (2023). We illustrate this phenomenon experimentally in Section 4.1. Finally, as an extension of Theorem 3.2, we also show how to obtain convergence to exact solution with an O(1/t) rate for Unified SAM using decreasing step sizes. Theorem 3.5. Assume that each fi is Li-smooth, f is µ-PL and the ER is satisfied. Then the iterates of Unified SAM with ρt = min n 1 2t+1, ρ o and γt = min n 2t+1 (t+1)2µ, γ o , where ρ and γ are defined in Theorem 3.2, satisfy: E[f(xt) f(x )] O 1 The detailed expression hidden under the big O notation can be found in Appendix C. A similar result appears in Andriushchenko & Flammarion (2022) where they provide decreasing step size selection for USAM for PL functions and prove a convergence rate of O(1/t). However, their result relies on the additional assumption of bounded variance. In contrast, our theorem does not require this assumption and is valid for any λ [0, 1]. Notably, to the best of our knowledge, this is the first decreasing step size result for SAM. 3.3 GENERAL NON-CONVEX FUNCTIONS In this section, we remove the PL assumption and work with general non-convex functions. First, we start with a general proposition that upper bounds the quantity mint=0,...,T 1 E f(xt) 2, which can serve as an intermediate result for Theorem 3.7. Our approach follows a similar derivation to the analysis of SGD in the same setting by Khaled & Richt arik (2020). Proposition 3.6. Assume that each fi is Li-smooth and the ER is satisfied. Then the iterates of Unified SAM with ρ min n 1 4Lmax , 1 8BLmax(1 λ)2 o and γ 1 8BLmax , satisfy: min t=0,...,T 1 E f(xt) 2 2 1 + 2AγLmax ρ(1 λ)2(1 + 2γρL2 max) + γ T Tγ [f(x0) f inf] + 2Lmax Cγ + ρ(1 + 2γρL2 max)(λ2 + C(1 λ)2) . In order to control the coefficient of the term f(x0) f inf of Proposition 3.6 from exploding we need to carefully select the step sizes ρ and γ. This is what the following Theorem 3.7 achieves. Published as a conference paper at ICLR 2025 Theorem 3.7. Let ε > 0 and set δ0 = f(x0) f inf 0. For ρ = ρ and γ = γ,a provided that ε2 max 96B, 24(1 λ) p 3Lmax A, 5184Lmax A2(1 λ)4δ0 ε2 , 864δ0A 288L2 max(1 λ)2 the iterates of Unified SAM satisfy mint=0,...,T 1 E f(xt) ε. a For the precise expressions of ρ and γ we refer to Theorem C.4. The results in Proposition 3.6 and Theorem 3.7 are tight, as setting ρ = 0 Unified SAM reduces in SGD and these simplify to the step sizes and rates (up to constants) of Theorem 2 and Corollary 1 from Khaled & Richt arik (2020). Other results for general non-convex functions can be found in Mi et al. (2022), Zhuang et al. (2022) and Li & Giannakis (2023) for SAM and in Andriushchenko & Flammarion (2022) for USAM. However, as mentioned earlier, all these analyses rely on the strong assumption of bounded variance and/or bounded gradients. In contrast, our result uses weaker assumptions and offers guarantees for any λ [0, 1]. Furthermore, Khanh et al. (2024) have results for general non-convex functions in the deterministic setting though all their results are asymptotic. Another closely related work is Nam et al. (2023), where they also assume the expected residual condition, however, they additionally assume bounded gradient of f and their results are only asymptotic and hold almost surely. 3.4 BEYOND UNIFORM SAMPLING: IMPORTANCE SAMPLING In this work, we provide theorems in the PL and non-convex regime through which we can analyze all importance sampling and mini-batch variants of Unified SAM (and as a result of USAM and SAM). We achieve that via our general arbitrary sampling analysis. In this section, we focus on the non-convex regime (Theorem 3.7) and provide importance sampling for Unified SAM with singleelement sampling. That is, we select probabilities that optimize the iteration complexity (lower bound of T). In particular, to derive the importance sampling probabilities, we substitute the bounds for A, B, and C from Proposition B.2 into Theorem 3.7, resulting in: nτ max i Li n2τ 2 (maxi Li pi )2(1 λ)4δ0 ε2 , ε2 , 288L2 max(1 λ)2 Having substituted the values of A, B, and C in the above expression, to obtain the best bound, we require optimizing the quantity maxi Li npi over the probabilities pi. This results in pi = Li Pn j=1 Lj . Similar probabilities have been proposed for several optimization algorithms, including SGD (Gower et al., 2019; 2021; Khaled & Richt arik, 2020), variance-reduced methods (Khaled et al., 2023), and methods for min-max optimization (Gorbunov et al., 2022; Loizou et al., 2020; Choudhury et al., 2024; Beznosikov et al., 2023). To the best of our knowledge, our work is the first to provide importance sampling for SAM algorithms. 4 NUMERICAL EXPERIMENTS In this section, we evaluate our proposed step sizes for Unified SAM on both deterministic and stochastic PL problems, with experiments designed to illustrate our theoretical findings. Additionally, we explore different values of λ when training deep neural networks to improve accuracy. 4.1 VALIDATION OF THE THEORY In this part, we empirically validate our theoretical results and illustrate the main properties of Unified SAM that our theory suggests in Section 3. In these experiments, we focus on ℓ2-regularized regression problems (problems with strongly convex objective f and components fi and thus PL) Published as a conference paper at ICLR 2025 and we evaluate the performance of Unified SAM on synthetic data. The loss function of the ℓ2regularized ridge regression is given by 2 Ax b 2 + λr x 2 = 1 i=1 (A[i, :]x bi)2 + λr and the loss function of the ℓ2-regularized logistic regression is given by i=1 log (1 + exp ( bi A[i, :]x)) + λr In both problems A Rn d, b Rn are the given data and λr 0 is the regularization parameter. 0 1000 2000 3000 4000 5000 Iterations f(xk) f(x * ) f(x0) f(x * ) = 0.0 = 0.1 = 0.2 = 0.3 = 0.4 = 0.5 = 0.6 = 0.7 = 0.8 = 0.9 = 1.0 Figure 1: Deterministic Unified SAM for various values of λ applied to the ridge regression problem. USAM (λ = 0) converges to the exact solution while the other variants λ > 0 converge to a neighborhood of the solution. Normalized SAM converges only in neighborhood. In Corollary 3.4 we highlight that our theoretical results indicate that in the deterministic case, USAM achieves linear convergence to the exact solution. In contrast, when λ > 0, and specifically for SAM (λ = 1), convergence is only to a neighborhood of the solution, an unusual outcome for deterministic optimization methods. We validate this observation experimentally in Figure 1. We run a ridge regression problem with n = 100, d = 100, λr = 0. The matrix A has been generated according to Lenard & Minkoff (1984) such that the condition number of A is 10 and the vector b has been sampled from the standard Gaussian distribution. We have used the deterministic Unified SAM for λ = 0.0, . . . , 1.0 and we run each algorithm for 50 epochs. Indeed we can see that USAM (λ = 0) converges all the way to the exact solution while the other choices of λ converge to a neighborhood. It is also noteworthy that the neighborhood increases as λ approaches 1. Constant vs Decreasing Step size. In this part, we compare the performance of Unified SAM under both constant and decreasing step size regimes, as described in Theorem 3.2 and Theorem 3.5, respectively. For this experiment, we consider a logistic regression task with n = 100, d = 100, and λr = 3/n. As before the matrix A has been generated such that its condition number is 10 and the vector b has been sampled from the standard Gaussian distribution. We run Unified SAM with λ = 0.0, 0.5, 1.0, using uniform single-element sampling for 10, 000 epochs across 5 trials. The average results of these trials, along with one standard deviation, are shown in Figure 2. Initially, the trajectory of the decreasing step size follows that of the constant step size. However, as the constant step size version of Theorem 3.2 approaches a neighborhood near optimality, it stagnates. In contrast, the decreasing step size from Theorem 3.5 allows for continued improvement, leading to better overall accuracy. 0.0 0.2 0.4 0.6 0.8 1.0 Iterations f(xk) f(x * ) f(x0) f(x * ) Constant Step-size Decreasing Step-size 0.0 0.2 0.4 0.6 0.8 1.0 Iterations f(xk) f(x * ) f(x0) f(x * ) Constant Step-size Decreasing Step-size 0.0 0.2 0.4 0.6 0.8 1.0 Iterations f(xk) f(x * ) f(x0) f(x * ) Constant Step-size Decreasing Step-size Figure 2: Comparison between constant and decreasing step size regimes of Unified SAM. From left to right we have λ = 0.0, 0.5, 1.0 Uniform vs Importance Sampling. In this experiment, we demonstrate the benefit of using importance sampling compared to uniform sampling. We consider a ridge regression problem with n = 100, d = 100, and λr = 3/n. The eigenvalues of matrix A are sampled uniformly from the interval [1.0, 10.0], and the vector b is drawn from a standard Gaussian distribution. We run Unified SAM with λ = 0.0, 0.5, 1.0, employing single-element uniform sampling for 3, 000 epochs across Published as a conference paper at ICLR 2025 0 50000 100000 150000 200000 250000 300000 Iterations f(xk) f(x * ) f(x0) f(x * ) Uniform Sampling Importance Sampling 0 50000 100000 150000 200000 250000 300000 Iterations f(xk) f(x * ) f(x0) f(x * ) Uniform Sampling Importance Sampling 0 50000 100000 150000 200000 250000 300000 Iterations f(xk) f(x * ) f(x0) f(x * ) Uniform Sampling Importance Sampling Figure 3: Comparison between uniform and importance sampling for Unified SAM. From left to right we have λ = 0.0, 0.5, 1.0 five trials. For importance sampling, the probabilities are set as pi = Li/ Pn j=1 Lj. The averaged results with one standard deviation are presented in Figure 3. The results clearly show that importance sampling enhances the convergence of Unified SAM for all values of λ, which is consistent with the discussion in Section 3.4. 4.2 IMAGE CLASSIFICATION In this section, we evaluate the generalization performance of Unified SAM using the classic image classification task. The models are trained on the CIFAR-10 and CIFAR-100 datasets, (Krizhevsky et al., 2009). For data augmentation, we apply standard techniques such as random crop, random horizontal flip, and normalization (De Vries, 2017). All experiments are conducted on NVIDIA RTX 6000 Ada GPUs. Models. We use several models in our experiments, including Res Net-18 (RN-18) (He et al., 2016a), and Pre Act Res Net-18 (PRN-18) (He et al., 2016b). To demonstrate the scalability of Unified SAM, we also include Wider Res Net (WRN-28-10) (Zagoruyko & Komodakis, 2016). Different values of λ. This subsection explores how varying λ in the Unified SAM update rule affects generalization performance. We experiment with the PRN-18 and WRN-28-10 models trained on CIFAR-10 and CIFAR-100. The Unified SAM method is trained using ρ = 0.1, 0.2, 0.3, 0.4 and λt = 0.0, 0.5, 1.0, 1/t, 1 1/t. The last two choices of λt can be intuitively explained as follows: for λt = 1 1/t, the algorithm starts as USAM (λ1 = 0) and gradually transitions toward SAM as λt 1 when t , meaning it begins with USAM and approaches SAM. Conversely, for λt = 1/t, the algorithm behaves the other way around, starting closer from SAM and converging to USAM over time. Following Pang et al. (2021) and Zhang et al. (2024), we set the weight decay to 5 10 4, the momentum to 0.9, and train for 100 epochs. The step size γ is initialized at 0.1 and reduced by a factor of 10 at the 75-th and 90-th epochs. Each experiment is repeated three times, and we report the averages and standard errors in Tables 2 and 3 and Tables 8 and 9 in appendix. The results indicate that, for a fixed ρ, the optimal λ value is not always λ = 0 or λ = 1. Overall, λt = 1 1/t appears to be a reliable choice. In particular, we observe that USAM (λ = 0) does not have the best performance in any of these experiments. On the contrary in the CIFAR10 dataset the value λt = 1 1/t is a good choice while in the CIFAR100 dataset both λt = 1 and λt = 1 1/t offer strong performance. Table 2: Test accuracy (%) of Unified SAM for WRN-28-10 on CIFAR10, evaluated across different values of ρ and λ. With bold we highlight the best performance for fixed ρ. Unified SAM λ = 0.0 λ = 0.5 λ = 1.0 λ = 1/t λ = 1 1/t ρ = 0.1 95.7 0.01 95.68 0.11 95.9 0.08 95.84 0.07 95.81 0.03 ρ = 0.2 95.8 0.05 95.77 0.09 95.93 0.07 95.71 0.13 95.98 0.1 ρ = 0.3 95.35 0.3 95.88 0.1 95.95 0.09 95.68 0.02 95.99 0.06 ρ = 0.4 95.46 0.02 95.76 0.1 95.62 0.05 95.46 0.27 95.79 0.07 SGD 95.35 0.06 Unified Va SSO. The Variance-Suppressed Sharpness-Aware Optimization (Va SSO) method, introduced in Li & Giannakis (2023), is an extension of SAM designed to reduce variance in gradient estimates. Va SSO adjusts the direction of gradient updates using a combination of past gradients and the current gradient, controlled by a parameter θ, aiming to suppress noise during training and enhance Published as a conference paper at ICLR 2025 Table 3: Test accuracy (%) of Unified SAM for WRN-28-10 on CIFAR100, evaluated across different values of ρ and λ. With bold we highlight the best performance for fixed ρ. Unified SAM λ = 0.0 λ = 0.5 λ = 1.0 λ = 1/t λ = 1 1/t ρ = 0.1 80.84 0.08 81.01 0.11 80.69 0.11 80.81 0.24 80.88 0.31 ρ = 0.2 81.12 0.18 81.45 0.23 81.53 0.09 81.31 0.21 81.22 0.19 ρ = 0.3 80.94 0.13 81.64 0.16 81.62 0.16 81.03 0.14 81.71 0.17 ρ = 0.4 80.1 0.22 81.31 0.16 81.70 0.06 80.39 0.07 81.59 0.05 SGD 79.79 0.18 generalization, particularly in overparameterized models. In their work, they prove that Va SSO converges at the same asymptotic rate as SAM, under the bounded variance assumption. The update rule of Va SSO is defined as follows: dt = (1 θ)dt 1 + θ fi(xt), xt+1 = xt γt fi xt + ρt dt dt . We can incorporate our unification approach, initially designed for SAM, into Va SSO. This leads to the following modified update rule: dt = (1 θ)dt 1 + θ fi(xt) xt+1 = xt γt fi 1 λt + λt dt (Unified Va SSO) Note that when λt = 1 then we recover Va SSO as introduced in Li & Giannakis (2023). In this section, we conduct experiments to evaluate the performance of Unified Va SSO. All models are trained for 200 epochs with a batch size of 128. A cosine scheduler is employed in all cases, with an initial step size of 0.05. The weight decay is set to 0.001. For Va SSO, we use θ = 0.4, as this value provides the best accuracy according to Li & Giannakis (2023). For the CIFAR-10 dataset, we set ρ = 0.1, while for CIFAR-100, we use ρ = 0.2. Each experiment is repeated three times, and we report the average of the maximum test accuracy along with the standard error. The numerical results are presented in Tables 4 and 5. The results show that, for the Wide Res Net-28-10 model the value λt = 1 1/t produces the best accuracy. For the Res Net-18 model the best values for λ appear to be λ = 0.5 and λ = 1 1/t. Thus, the choice λt = 1 1/t appears to be a strong overall choice. In any cases, a λ value different from 1 achieves the best performance. Table 4: Test accuracy (%) of Unified Va SSO on various neural networks trained on CIFAR10. Model λ = 0.0 λ = 0.5 λ = 1.0 λ = 1/t λ = 1 1/t Res Net-18 96.12 0.02 96.10 0.05 96.22 0.11 96.03 0.04 96.34 0.01 Wide Res Net-28-10 96.77 0.07 96.93 0.03 97.03 0.06 96.71 0.06 97.06 0.09 Table 5: Test accuracy (%) of Unified Va SSO on various neural networks trained on CIFAR100. Model λ = 0.0 λ = 0.5 λ = 1.0 λ = 1/t λ = 1 1/t Res Net-18 79.82 0.15 80.01 0.07 79.76 0.03 79.93 0.13 79.86 0.04 Wide Res Net-28-10 83.07 0.25 83.29 0.33 83.51 0.09 82.83 0.18 83.66 0.19 5 CONCLUSION In this work, we introduced Unified SAM, an algorithm that generalizes SAM and USAM via the convex combination parameter λ [0, 1]. The proposed analysis relaxes the common bounded variance assumption used in previous works by employing the ER condition, which enables us to prove convergence under weaker assumptions. In particular, under the ER, we established convergence guarantees for Unified SAM for both PL and general non-convex functions, and our analysis encompasses importance sampling and tau-nice sampling. Future work could investigate the optimal value of λ for various classes of problems, extend the unified approach to incorporate adaptive step sizes ρt and γt, and study Unified SAM in distributed and federated learning settings. Another promising direction is to explore how our approach may help minimize the sharpness-aware loss by employing the Hessian-regularized methods proposed by Wen et al. (2023) and Bartlett et al. (2023). Published as a conference paper at ICLR 2025 Maksym Andriushchenko and Nicolas Flammarion. Towards understanding sharpness-aware minimization. In ICML, 2022. Maksym Andriushchenko, Dara Bahri, Hossein Mobahi, and Nicolas Flammarion. Sharpness-aware minimization leads to low-rank features. In Neur IPS, 2023. Peter L Bartlett, Philip M Long, and Olivier Bousquet. The dynamics of sharpness-aware minimization: Bouncing across ravines and drifting towards wide minima. Journal of Machine Learning Research, 24(316):1 36, 2023. Aleksandr Beznosikov, Eduard Gorbunov, Hugo Berard, and Nicolas Loizou. Stochastic gradient descent-ascent: Unified theory and new efficient methods. In AISTATS, 2023. Sayantan Choudhury, Eduard Gorbunov, and Nicolas Loizou. Single-call stochastic extragradient methods for structured non-monotone variational inequalities: Improved analysis under weaker conditions. In Neur IPS, 2024. Yan Dai, Kwangjun Ahn, and Suvrit Sra. The crucial role of normalization in sharpness-aware minimization. In Neur IPS, 2023. Terrance De Vries. Improved regularization of convolutional neural networks with cutout. ar Xiv preprint ar Xiv:1708.04552, 2017. Gintare Karolina Dziugaite and Daniel Roy. Entropy-sgd optimizes the prior of a pac-bayes bound: Generalization properties of entropy-sgd and data-dependent priors. In ICML, 2018. Pierre Foret, Ariel Kleiner, Hossein Mobahi, and Behnam Neyshabur. Sharpness-aware minimization for efficiently improving generalization. In ICLR, 2021. Eduard Gorbunov, Hugo Berard, Gauthier Gidel, and Nicolas Loizou. Stochastic extragradient: General analysis and improved rates. In AISTATS, 2022. Robert Gower, Othmane Sebbouh, and Nicolas Loizou. SGD for structured nonconvex functions: Learning rates, minibatching and interpolation. In AISTATS, 2021. Robert Mansel Gower, Nicolas Loizou, Xun Qian, Alibek Sailanbayev, Egor Shulgin, and Peter Richt arik. SGD: General analysis and improved rates. In ICML, 2019. Hinata Harada and Hideaki Iiduka. Convergence of sharpness-aware minimization algorithms using increasing batch size and decaying learning rate. ar Xiv preprint ar Xiv:2409.09984, 2024. Moritz Hardt, Ben Recht, and Yoram Singer. Train faster, generalize better: Stability of stochastic gradient descent. In ICML, 2016. Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image recognition. In CVPR, 2016a. Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Identity mappings in deep residual networks. In ECCV, 2016b. Yiding Jiang, Behnam Neyshabur, Hossein Mobahi, Dilip Krishnan, and Samy Bengio. Fantastic generalization measures and where to find them. In ICLR, 2019. Hamed Karimi, Julie Nutini, and Mark Schmidt. Linear convergence of gradient and proximalgradient methods under the polyak-lojasiewicz condition. In Machine Learning and Knowledge Discovery in Databases: European Conference, ECML PKDD, pp. 795 811, 2016. Nitish Shirish Keskar, Dheevatsa Mudigere, Jorge Nocedal, Mikhail Smelyanskiy, and Ping Tak Peter Tang. On large-batch training for deep learning: Generalization gap and sharp minima. In ICLR, 2016. Ahmed Khaled and Peter Richt arik. Better theory for sgd in the nonconvex world. ar Xiv preprint ar Xiv:2002.03329, 2020. Ahmed Khaled, Othmane Sebbouh, Nicolas Loizou, Robert M Gower, and Peter Richt arik. Unified analysis of stochastic gradient methods for composite convex and smooth optimization. Journal of Optimization Theory and Applications, 199(2):499 540, 2023. Pham Duy Khanh, Hoang-Chau Luong, Boris S Mordukhovich, and Dat Ba Tran. Fundamental convergence analysis of sharpness-aware minimization. ar Xiv preprint ar Xiv:2401.08060, 2024. Published as a conference paper at ICLR 2025 Alex Krizhevsky et al. Learning Multiple Layers of Features from Tiny Images. Technical report, University of Toronto, 2009. Yunwen Lei, Ting Hu, Guiying Li, and Ke Tang. Stochastic gradient descent for nonconvex learning without bounded gradient assumptions. IEEE transactions on neural networks and learning systems, 31(10):4394 4400, 2019. Melanie L Lenard and Michael Minkoff. Randomly generated test problems for positive definite quadratic programming. ACM Transactions on Mathematical Software (TOMS), 10(1):86 96, 1984. Bingcong Li and Georgios Giannakis. Enhancing sharpness-aware optimization through variance suppression. In Neur IPS, 2023. Shengchao Liu, Dimitris Papailiopoulos, and Dimitris Achlioptas. Bad global minima exist and SGD can reach them. In Neur IPS, 2020. Nicolas Loizou, Hugo Berard, Alexia Jolicoeur-Martineau, Pascal Vincent, Simon Lacoste-Julien, and Ioannis Mitliagkas. Stochastic hamiltonian gradient methods for smooth games. In ICML, 2020. Stanislaw Lojasiewicz. A topological property of real analytic subsets. Coll. du CNRS, Les equations aux d eriv ees partielles, 117(87-89):2, 1963. Peng Mi, Li Shen, Tianhe Ren, Yiyi Zhou, Xiaoshuai Sun, Rongrong Ji, and Dacheng Tao. Make sharpness-aware minimization stronger: A sparsified perturbation approach. In Neur IPS, 2022. Kyunghun Nam, Jinseok Chung, and Namhoon Lee. Almost sure last iterate convergence of sharpness-aware minimization. 2023. Behnam Neyshabur, Srinadh Bhojanapalli, David Mc Allester, and Nati Srebro. Exploring generalization in deep learning. In Neur IPS, 2017. Behnam Neyshabur, Zhiyuan Li, Srinadh Bhojanapalli, Yann Le Cun, and Nathan Srebro. Towards understanding the role of over-parametrization in generalization of neural networks. ar Xiv preprint ar Xiv:1805.12076, 2018. Tianyu Pang, Xiao Yang, Yinpeng Dong, Hang Su, and Jun Zhu. Bag of tricks for adversarial training. In ICML, 2021. Boris T Polyak. Introduction to Optimization. New York, Optimization Software, 1987. Sungbin Shin, Dongyeop Lee, Maksym Andriushchenko, and Namhoon Lee. Analyzing sharpnessaware minimization under overparameterization. ar Xiv preprint ar Xiv:2311.17539, 2024. Dongkuk Si and Chulhee Yun. Practical sharpness-aware minimization cannot converge all the way to optima. In Neur IPS, 2023. Sidak Pal Singh, Hossein Mobahi, Atish Agarwala, and Yann Dauphin. Avoiding spurious sharpness minimization broadens applicability of sam. ar Xiv preprint ar Xiv:2502.02407, 2025. Behrooz Tahmasebi, Ashkan Soleymani, Dara Bahri, Stefanie Jegelka, and Patrick Jaillet. A universal class of sharpness-aware minimization algorithms. In ICML, 2024. Kaiyue Wen, Tengyu Ma, and Zhiyuan Li. How sharpness-aware minimization minimizes sharpness? In ICLR, 2023. Ashia C Wilson, Rebecca Roelofs, Mitchell Stern, Nati Srebro, and Benjamin Recht. The marginal value of adaptive gradient methods in machine learning. In Neur IPS, 2017. Dongxian Wu, Shu-Tao Xia, and Yisen Wang. Adversarial weight perturbation helps robust generalization. Neur IPS, 2020. Wanyun Xie, Thomas Pethick, and Volkan Cevher. Sampa: Sharpness-aware minimization parallelized. In Neur IPS, 2024. Sergey Zagoruyko and Nikos Komodakis. Wide residual networks. In British Machine Vision Conference, 2016. Chiyuan Zhang, Samy Bengio, Moritz Hardt, Benjamin Recht, and Oriol Vinyals. Understanding deep learning requires rethinking generalization. ar Xiv preprint ar Xiv:1611.03530, 2016. Chiyuan Zhang, Samy Bengio, Moritz Hardt, Benjamin Recht, and Oriol Vinyals. Understanding deep learning (still) requires rethinking generalization. Communications of the ACM, 64(3):107 115, 2021. Published as a conference paper at ICLR 2025 Yihao Zhang, Hangzhou He, Jingyu Zhu, Huanran Chen, Yifei Wang, and Zeming Wei. On the duality between sharpness-aware minimization and adversarial training. In ICML, 2024. Yaowei Zheng, Richong Zhang, and Yongyi Mao. Regularizing neural networks via adversarial model perturbation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 8156 8165, 2021. Juntang Zhuang, Boqing Gong, Liangzhe Yuan, Yin Cui, Hartwig Adam, Nicha C. Dvornek, Sekhar Tatikonda, James S. Duncan, and Ting Liu. Surrogate gap minimization improves sharpnessaware training. In ICLR, 2022. Published as a conference paper at ICLR 2025 Supplementary Material The Supplementary Material is organized as follows: In Appendix A, we give the basic definitions and lemmas that we need for the proofs. Appendix C presents the proofs of the theoretical guarantees from the main paper. In Appendix D, we provide additional experiments. 1 Introduction 1 2 Unified SAM with Arbitrary Sampling 4 3 Convergence Analysis 4 3.1 Main Assumption . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5 3.2 PL functions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5 3.3 General non-convex functions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6 3.4 Beyond Uniform Sampling: Importance Sampling . . . . . . . . . . . . . . . . . . . . . . . 7 4 Numerical Experiments 7 4.1 Validation of the theory . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7 4.2 Image Classification . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9 5 Conclusion 10 A Technical Preliminaries 15 A.1 Basic Definitions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15 A.2 Basic Lemmas . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15 B More on the Expected Residual condition 18 C Proofs of the main results 19 C.1 Proof of Theorem 3.2 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19 C.2 Proof of Theorem 3.5 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22 C.3 Proof of Proposition 3.6 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23 C.4 Proof of Theorem 3.7 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24 D Additional Experiments 27 Published as a conference paper at ICLR 2025 A TECHNICAL PRELIMINARIES A.1 BASIC DEFINITIONS In this section, we present some basic definitions we use throughout the paper. Definition A.1 (L-smooth). A differentiable function f : Rd R is L-smooth if there exists a constant L > 0 such that f(x) f(y) L x y , (2) or equivalently f(x) f(y) f(y), x y + L for all x, y Rd. Lemma A.2. Let f be an L-smooth function. Then for all x Rn we have f(x) 2 2L[f(x) f inf]. Definition A.3 (µ-PL). We say that a differentiable function f : Rd R satisfies the Polyak Lojasiewicz condition if f(x) 2 2µ(f(x) f(x )), (3) for all x Rd. Definition A.4 (Interpolation). We say that the interpolation condition holds if there exists x X such that min x Rn fi(x) = fi(x ) for all i = 1, . . . , n. Proposition A.5 (Young s Inequality). For any a, b Rn and β = 0 we have Completing the square we have a + b 2 (1 + β 1) a 2 + (1 + β) b 2. (5) For β = 1 we get a + b 2 2 a 2 + 2 b 2. (6) A.2 BASIC LEMMAS Lemma A.6. Assume that each fi is Li-smooth and consider the iterates of Unified SAM. Then it holds 1 λt + λt g(xt) 2 4L2 maxρ2 tλ2 t + 2 2L2 maxρ2 t(1 λt)2 + 1 ED g(xt) 2. Published as a conference paper at ICLR 2025 Proof. We have 1 λt + λt g(xt) 1 λt + λt g(xt) g(xt) g(xt) + g(xt) 1 λt + λt g(xt) g(xt) g(xt) 2 + 2 ED g(xt) 2 (2) 2L2 maxρ2 t ED (1 λt)g(xt) + λt g(xt) g(xt) 2 + 2 ED g(xt) 2 (6) 4L2 maxρ2 t(1 λt)2 ED g(xt) 2 + 4L2 maxρ2 tλ2 t + 2 ED g(xt) 2 = 4L2 maxρ2 tλ2 t + 2 2L2 maxρ2 t(1 λt)2 + 1 ED g(xt) 2. This completes the proof. Lemma A.7. Assume that each fi is Li-smooth and consider the iterates of Unified SAM. Then for any β > 0 it holds 1 λt + λt g(xt) g(xt) , f(xt) 1 β f(xt) 2 L2 maxρ2 tλ2 t β L2 maxρ2 t(1 λt)2 β ED g(xt) 2. In particular, for β = Lmaxρt we get 1 λt + λt g(xt) g(xt) , f(xt) 1 Lmaxρt Lmaxρtλ2 t Lmaxρt(1 λt)2 ED g(xt) 2. Proof. We have 1 λt + λt g(xt) g(xt) , f(xt) 1 λt + λt g(xt) g(xt) g(xt), f(xt) + ED g(xt), f(xt) 1 λt + λt g(xt) g(xt) g(xt), f(xt) + f(xt) 2. 1 λt + λt g(xt) g(xt) g(xt), f(xt) (4) 1 2β ED 1 λt + λt g(xt) g(xt) g(xt) 2 ED f(xt) 2 (2) L2 maxρ2 t 2β ED (1 λt)g(xt) + λt g(xt) g(xt) (6) L2 maxρ2 t 2β 2(1 λt)2 ED g(xt) 2 + L2 maxρ2 t 2β 2λ2 t + β Published as a conference paper at ICLR 2025 = L2 maxρ2 tλ2 t β + L2 maxρ2 t(1 λt)2 β ED g(xt) 2 + β 1 λt + λt g(xt) g(xt)), f(xt) L2 maxρ2 t(1 λt)2 β ED g(xt) 2 L2 maxρ2 tλ2 t β β f(xt) 2 L2 maxρ2 tλ2 t β L2 maxρ2 t(1 λt)2 β ED g(xt) 2. This completes the proof. Lemma A.8. Let (rt)t 0 and (δt)t 0 be sequences of non-negative real numbers and let g > 1 and N 0. Assume that the following recursive relationship holds: rt gδt δt+1 + N (7) Then it holds min 0 t T 1 rt g T Proof. Set gt = g t for any t Z. Then multiply both sides of (7) with gt. This yields gtrt gt 1δt gtδt+1 + Ngt. Summing for t = 0, . . . , T and telescoping we get t=0 gtrt g 1δ0 g T 1δT + N t=0 gt. (8) Now let G = PT 1 t=0 gt. Note that the sequence (gt) is decreasing for t 0, thus G Tg T 1. Now dividing (8) by G we get min 0 t T 1 rt 1 This completes the proof. Published as a conference paper at ICLR 2025 B MORE ON THE EXPECTED RESIDUAL CONDITION In this section, we introduce several assumptions under which ER is automatically satisfied. Similar derivations are provided in Khaled & Richt arik (2020), where the authors focus on the analysis of SGD. Specifically, we consider the following cases: Assuming bounded gradients, i.e. ED g(x) 2 σ2, x Rn, then the ER holds with A = 0, B = 0, C = σ2. Assuming bounded variance, i.e. ED g(x) f(x) σ2, x Rn, then the ER holds with A = 0, B = 1, C = σ2. Assuming expected smoothness, i.e. ED g(x) f(x) 2L[f(x) f inf], x Rn, then the ER holds with A = 2L, B = 0, C = 0. Assuming the relaxed strong growth condition, i.e. ED g(x) ρ f(x) + σ2, x Rn, then the ER holds with A = 0, B = ρ, C = σ2. Assuming the relaxed strong growth condition, i.e. ED g(x) α[f(x) f inf] + σ2, x Rn, then the ER holds with A = α, B = 0, C = σ2. Additionally, if we have more information about the problem and the distribution D, we can derive stronger bounds on A, B, and C. The more assumptions we make, the tighter these bounds become, as demonstrated in the following propositions. Proposition B.1 (Prop. 2, (Khaled & Richt arik, 2020)). Assume that each fi is Li-smooth and that ED[v2 i ] < for all i [n]. Let σ = 1 n Pn i=1 fi(x ) f i 0, where f i = infx Rn fi(x). Then ER holds with A = maxi Li ED[v2 i ], B = 0 and C = 2Aσ . This proposition indicates that if each fi is Li-smooth and minimal assumptions hold for the distribution D, then ER is satisfied. As an immediate corollary of Proposition B.1, if the problem further satisfies the interpolation assumption (see Definition A.4), then C = 0. The next proposition gives tighter constants for A, B, and C in the context of the sampling strategies considered in this paper. Proposition B.2 (Prop. 3, (Khaled & Richt arik, 2020)). Assume that each fi is Li-smooth. 1. For the single element sampling ER holds with A = 1 nτ maxi Li pi , B = 0 and C = 2Aσ . 2. For the τ-nice sampling ER holds with A = n τ τ(n 1)Lmax, B = n(τ 1) τ(n 1) and C = 2Aσ , where Lmax = maxi Li. Lastly, if we assume x -convexity with τ-nice sampling, we obtain the following constants: Proposition B.3 (Prop. 3.3, (Gower et al., 2021)). Assume that each fi is Li-smooth and that there exists x X such that fi is x convex. In addition, assume that D is the τnice sampling. Then ER holds with A = n τ τ(n 1)Lmax, B = 1 and C = 2(n τ) τ(n 1)σ1, where σ1 = supx X 1 n Pn i=1 fi(x ) 2. Published as a conference paper at ICLR 2025 C PROOFS OF THE MAIN RESULTS In this section we present the proofs of the main theoretical results presented in the main paper, i.e., the convergence guarantees of Unified SAM for PL and smooth functions and general non-convex and smooth functions. We restate the main theorems here for completeness. In all cases, we use the convention 1/0 = . C.1 PROOF OF THEOREM 3.2 Theorem C.1. Assume that each fi is Li-smooth, f is µ-PL and the ER is satisfied. Set Lmax = maxi [n] Li. Then the iterates of Unified SAM with ρt = ρ ρ := µ Lmax(µ+2[Bµ+A](1 λ)2), γt = γ γ := µ Lmaxρ(µ+2[Bµ+A](1 λ)2) 2Lmax(Bµ+A)[2L2maxρ2(1 λ)2+1], E[f(xt) f(x )] (1 γµ)t f(x0) f(x ) + N, where N = Lmax µ Cγ + ρ(1 + 2γL2 maxρ) λ2 + C(1 λ)2 . Proof of Theorem 3.2. By combining the smoothness of function f with the update rule of Unified SAM we obtain: f(xt+1) f(xt) + f(xt), xt+1 xt + Lmax 2 xt+1 xt 2 f(xt), g xt + ρt 1 λt + λt g(xt) + Lmaxγ2 t 2 1 λt + λt g(xt) By taking expectation conditioned on xt we obtain: E[f(xt+1) f(x )|xt] [f(xt) f(x )] 1 λt + λt g(xt) g(xt) , f(xt) + Lmaxγ2 t 2 ED 1 λt + λt g(xt) Lem. A.6 and A.7 f(xt) 2 Lmaxρtλ2 t Lmaxρt(1 λt)2 ED g(xt) 2 + γ2 t Lmax 2 4L2 maxρ2 tλ2 t + 2 2L2 maxρ2 t(1 λt)2 + 1 ED g(xt) 2 + γt Lmax ρt(1 λt)2 + 2γt L2 maxρ2 t(1 λt)2 + γt ED g(xt) 2 + γt Lmaxρtλ2 t(1 + 2γt L2 maxρt) + 2Aγt Lmax ρt(1 λt)2 + 2γt L2 maxρ2 t(1 λt)2 + γt f(xt) f(x ) + Bγt Lmax ρt(1 λt)2 + 2γt L2 maxρ2 t(1 λt)2 + γt f(xt) 2 + Cγt Lmax ρt(1 λt)2 + 2γt L2 maxρ2 t(1 λt)2 + γt + γt Lmaxρtλ2 t(1 + 2γt L2 maxρt) 2 BLmax ρt(1 λt)2 + 2γt L2 maxρ2 t(1 λt)2 + γt f(xt) 2 Published as a conference paper at ICLR 2025 + 2Aγt Lmax ρt(1 λt)2 + 2γt L2 maxρ2 t(1 λt)2 + γt f(xt) f(x ) + γt Lmax Cρt(1 λt)2 + 2Cγt L2 maxρ2 t(1 λt)2 + Cγt + ρtλ2 t + 2γt L2 maxρ2 tλ2 t . (9) In order to use the fact that is µ-PL we need to ensure that the coefficient of f(xt) 2 is nonnegative. For this we have 2 BLmax ρt(1 λt)2 + 2γt L2 maxρ2 t(1 λt)2 + γt 0 2 + BLmaxρt(1 λt)2 + BLmaxγt 2L2 maxρ2 t(1 λt)2 + 1 1 Solving for γt we get γt 2 Lmaxρt 2BLmaxρt(1 λt)2 2BLmax (2L2maxρ2 t(1 λt)2 + 1) . (10) The numerator of the upper bound on γt in (10) must be positive (positive step-size). For this, the following restriction on ρt is needed. 2 Lmaxρt 2BLmaxρt(1 λt)2 0 ρt 2 Lmax (1 + 2B(1 λt)2). (11) Now using the µ-PL condition (3) on (9) we get E[f(xt+1) f(x )|xt] [f(xt) f(x )] 2γtµ 1 Lmaxρt 2 BLmax ρt(1 λt)2 + 2γt L2 maxρ2 t(1 λt)2 + γt [f(xt) f(x )] + 2Aγt Lmax ρt(1 λt)2 + 2γt L2 maxρ2 t(1 λt)2 + γt f(xt) f(x ) + γt Lmax Cρt(1 λt)2 + 2Cγt L2 maxρ2 t(1 λt)2 + Cγt + ρtλ2 t + 2γt L2 maxρ2 tλ2 t , E[f(xt+1) f(x )|xt] 1 2γtµ 1 Lmaxρt 2 BLmax ρt(1 λt)2 + 2γt L2 maxρ2 t(1 λt)2 + γt + 2Aγt Lmax ρt(1 λt)2 + 2γt L2 maxρ2 t(1 λt)2 + γt f(xt) f(x ) + γt Lmax Cρt(1 λt)2 + 2Cγt L2 maxρ2 t(1 λt)2 + Cγt + ρtλ2 t + 2γt L2 maxρ2 tλ2 t . (12) The coefficient of [f(xt) f(x )] can be simplified to 1 2γtµ 1 Lmaxρt 2 BLmax ρt(1 λt)2 + 2γt L2 maxρ2 t(1 λt)2 + γt + 2Aγt Lmax ρt(1 λt)2 + 2γt L2 maxρ2 t(1 λt)2 + γt = 1 2γtµ + γtµLmaxρt + 2γtµBLmax ρt(1 λt)2 + 2γt L2 maxρ2 t(1 λt)2 + γt + 2Aγt Lmax ρt(1 λt)2 + 2γt L2 maxρ2 t(1 λt)2 + γt = 1 2γtµ + γtµLmaxρt + 2γt Lmax(Bµ + A) ρt(1 λt)2 + 2γt L2 maxρ2 t(1 λt)2 + γt = 1 2γtµ + γt Lmaxρt µ + 2(Bµ + A)(1 λt)2 + 2γ2 t Lmax(Bµ + A) 2L2 maxρ2 t(1 λt)2 + 1 Published as a conference paper at ICLR 2025 Let us bound the above coefficient by 1 γtµ (used for the final linear convergence). For doing that, we force the following restrictions for ρt and γt: 1 2γtµ + γt Lmaxρt µ + 2(Bµ + A)(1 λt)2 + 2γ2 t Lmax(Bµ + A) 2L2 maxρ2 t(1 λt)2 + 1 1 γtµ γt Lmaxρt µ + 2(Bµ + A)(1 λt)2 + 2γ2 t Lmax(Bµ + A) 2L2 maxρ2 t(1 λt)2 + 1 γtµ Lmaxρt µ + 2(Bµ + A)(1 λt)2 + 2γt Lmax(Bµ + A) 2L2 maxρ2 t(1 λt)2 + 1 µ 2γt Lmax(Bµ + A) 2L2 maxρ2 t(1 λt)2 + 1 µ Lmaxρt µ + 2(Bµ + A)(1 λt)2 γt µ Lmaxρt µ + 2(Bµ + A)(1 λt)2 2Lmax(Bµ + A) (2L2maxρ2 t(1 λt)2 + 1). (13) The numerator of the upper bound on γt in (13) must be positive (as the step-size is positive). Thus: µ Lmaxρt µ + 2(Bµ + A)(1 λt)2 0 ρt µ Lmax (µ + 2(Bµ + A)(1 λt)2). (14) Now by combining Equations (11) and (14) we obtain: ρt ρ = min 2 Lmax (1 + 2B(1 λt)2), µ Lmax (µ + 2(Bµ + A)(1 λt)2) = µ Lmax (µ + 2(Bµ + A)(1 λt)2), where last equality holds due to: µ Lmax (µ + 2(Bµ + A)(1 λt)2) 2 Lmax (1 + 2B(1 λt)2) Lmaxµ + 2Lmax Bµ(1 λt)2 2Lmaxµ + 4Lmax(Bµ + A)(1 λt)2 0 Lmaxµ + 2Lmax(Bµ + 2A)(1 λt)2. Similarly for γt by combining Equations (13) and (10) we obtain: ( 2 Lmaxρt 2BLmaxρt(1 λt)2 2BLmax (2L2maxρ2 t(1 λt)2 + 1) , µ Lmaxρt µ + 2(Bµ + A)(1 λt)2 2Lmax(Bµ + A) (2L2maxρ2 t(1 λt)2 + 1) = µ Lmaxρt µ + 2(Bµ + A)(1 λt)2 2Lmax(Bµ + A) (2L2maxρ2 t(1 λt)2 + 1), where the last inequality holds due to: µ Lmaxρt µ + 2(Bµ + A)(1 λt)2 2Lmax(Bµ + A) (2L2maxρ2 t(1 λt)2 + 1) 2 Lmaxρt 2BLmaxρt(1 λt)2 2BLmax (2L2maxρ2 t(1 λt)2 + 1) µ Lmaxρt µ + 2(Bµ + A)(1 λt)2 Bµ + A 2 Lmaxρt 2BLmaxρt(1 λt)2 Bµ BLmaxρtµ 2BLmaxρt(Bµ + A)(1 λt)2 2Bµ + 2A Bmaxρtµ ALmaxρt 2BLmaxρt(Bµ + A)(1 λt)2, The above derivation simplifies to the below inequality: 0 Bµ + A(2 Lmaxρt), Published as a conference paper at ICLR 2025 which is true by the restriction on ρ in (11), that implies the quantity (2 Lmaxρt) is non-negative due to: ρt 2 Lmax (1 + 2B(1 λt)2) 2 Lmax . Now setting constant ρt = ρ , γt = γ and λt = λ [0, 1] in (12) we have E[f(xt+1) f(x )|xt] (1 γµ)[f(xt) f(x )] + γLmax Cρ(1 λ)2 + 2CγL2 maxρ2(1 λ)2 + Cγ + ρλ2 + 2γL2 maxρ2λ2 . (15) Taking expectation on (15) and using the tower property we get E[f(xt+1) f(x )] (1 γµ) E[f(xt) f(x )] + γLmax Cρ(1 λ)2 + 2CγL2 maxρ2(1 λt)2 + Cγ + ρλ2 + 2γL2 maxρ2λ2 = (1 γµ) E[f(xt) f(x )] + γLmax Cγ + ρ(1 + 2γL2 maxρ) λ2 + C(1 λt)2 . (16) Recursively applying the above and summing up the resulting geometric series yields: E[f(xt+1) f(x )] (1 γµ)t [f(x0) f(x )] + γLmax Cγ + ρ(1 + 2γL2 maxρ) λ2 + C(1 λ)2 t X j=0 (1 γµ)j. Using the fact that Pt j=0(1 γµ)j 1 γµ completes the proof. C.2 PROOF OF THEOREM 3.5 Theorem C.2. Assume that each fi is Li-smooth, f is µ-PL and the ER is satisfied. Then the iterates of Unified SAM with ρt = min n 1 2t+1, ρ o and γt = min n 2t+1 (t+1)2µ, γ o , where ρ and γ are defined in Theorem 3.2, satisfy: E[f(xt) f(x )] O 1 Proof of Theorem 3.5. Since γt = min n 2t+1 (t+1)2µ, γ o γ and ρt = min n 1 2t+1, ρ o ρ we get that inequality (16) holds for any t 0, hence we have: E[f(xt+1) f(x )] (1 γtµ) E[f(xt) f(x )] + γt Lmax Cγt + ρt(1 + 2γtρt L2 max) λ2 + C(1 λ)2 = (1 γtµ) E[f(xt) f(x )] + CLmaxγ2 t + γtρt L2 max(1 + 2γtρt L2 max) λ2 + C(1 λ)2 (1 γtµ) E[f(xt) f(x )] + CLmaxγ2 t + γtρt L2 max 1 + 2L2 max µ λ2 + C(1 λ)2 , (17) where the last inequality follows from γt 2t+1 (t+1)2µ and ρt 1 2t+1, and thus γtρt 1 (t+1)2µ 1 Now set δt = E[f(xt) f(x )], R = CLmax and Q = L2 max 1 + 2L2 max µ λ2 + C(1 λ)2 . Then the inequality (17) takes the following form: δt+1 (1 γtµ)δt + Qγtρt + Rγ2 t . (18) Published as a conference paper at ICLR 2025 Now since the sequences 2t+1 (t+1)2µ and 1 2t+1 are decreasing there exists an integer t N such that for any t t we have γt = 2t+1 (t+1)2µ and ρt = 1 2t+1. Substituting in (18) we get that for any t t (t + 1)2 δt + Q µ(t + 1)2 + R(2t + 1)2 (t + 1)2 δt + Qµ + 4R µ2(t + 1)2 , because (2t+1)2 (t+1)4 4(t+1)2 (t+1)4 = 4 (t+1)2 . Multiplying both sides with (t + 1)2 and rearranging we have (t + 1)2δt+1 t2δt Qµ + 4R Summing for t = t , . . . , T 1 and telescoping we have T 2δT (t )2δt + Qµ + 4R µ2 (T t 1). Changing notation from T to t we get E[f(xt) f(x )] (t )2δt t2 + Qµ + 4R t2 + Qµ + 4R This completes the proof. C.3 PROOF OF PROPOSITION 3.6 Proposition C.3. Assume that each fi is Li-smooth and the ER is satisfied. Then the iterates of Unified SAM with ρ min n 1 4Lmax , 1 8BLmax(1 λ)2 o and γ 1 8BLmax , satisfy: min t=0,...,T 1 E f(xt) 2 2 1 + 2AγLmax ρ(1 λ)2(1 + 2γρL2 max) + γ T Tγ [f(x0) f inf] + 2Lmax Cγ + ρ(1 + 2γρL2 max)(λ2 + C(1 λ)2) . Proof of Proposition 3.6. From Equation (9) we have E[f(xt+1) f inf|xt] [f(xt) f inf] 2 BLmax ρ(1 λ)2 + 2γL2 maxρ2(1 λ)2 + γ f(xt) 2 + 2AγLmax ρ(1 λ)2 + 2γL2 maxρ2(1 λ)2 + γ f(xt) f inf + γLmax Cρ(1 λ)2 + 2CγL2 maxρ2(1 λ)2 + Cγ + ρλ2 + 2γL2 maxρ2λ2 2 f(xt) 2 + 2AγLmax ρ(1 λ)2 + 2γL2 maxρ2(1 λ)2 + γ f(xt) f inf + γLmax Cρ(1 λ)2 + 2CγL2 maxρ2(1 λ)2 + Cγ + ρλ2 + 2γL2 maxρ2λ2 . (19) Here inequality (19) is true due to: 1 Lmaxρ 2 BLmax ρ(1 λ)2 + 2γL2 maxρ2(1 λ)2 + γ 1 2 + BLmaxρ(1 λ)2 + BLmaxγ + 2BγL3 maxρ2(1 λ)2 1 Inequality (20) holds due to the theorem s constraints on γ and ρ, which imply that each factor on the left-hand side of inequality (20) is less than 1/8, as explained below. Published as a conference paper at ICLR 2025 8 by the condition on ρ. BLmaxρ(1 λ)2 1 8 by the condition on ρ. 8 by the condition on γ. 2BγL3 maxρ2(1 λ)2 1 8 because we have ρ 1 4Lmax and γ 1 8BLmax , so 2BγL3 maxρ2(1 λ)2 2B 1 8BLmax L3 max Summing up the inequalities in the above bullet points results in inequality (20). By taking expectation and using the tower property in (19), we have E[f(xt+1) f inf] E[f(xt) f inf] 2 E f(xt) 2 + 2AγLmax ρ(1 λ)2 + 2γL2 maxρ2(1 λ)2 + γ E f(xt) f inf + γLmax Cρ(1 λ)2 + 2CγL2 maxρ2(1 λ)2 + Cγ + ρλ2 + 2γL2 maxρ2λ2 , thus γ 2 E f(xt) 2 1 + 2AγLmax ρ(1 λ)2 + 2γL2 maxρ2(1 λ)2 + γ E[f(xt) f inf] E[f(xt+1) f inf] + γLmax Cρ(1 λ)2 + 2CγL2 maxρ2(1 λ)2 + Cγ + ρλ2 + 2γL2 maxρ2λ2 . (21) Let us now set: γ E[f(xt) f inf] 0 rt = E f(xt) 2 0 g = 1 + 2AγLmax ρ(1 λ)2 + 2γL2 maxρ2(1 λ)2 + γ = 1 + 2AγLmax γ + ρ(1 λ)2 1 + 2γL2 max > 1 N = 2Lmax Cρ(1 λ)2 + 2CγL2 maxρ2(1 λ)2 + Cγ + ρλ2 + 2γL2 maxρ2λ2 = 2Lmax Cγ + ρ(1 + 2γρL2 max)(λ2 + C(1 λ)2) 0 and inequality (21) takes the following form: γ 2 rt gδt δt+1 + N. (22) Applying Lemma A.8 to (22) completes the proof. C.4 PROOF OF THEOREM 3.7 Theorem C.4. Let ε > 0 and set δ0 = f(x0) f inf 0. For ρ = min 1 4Lmax , 1 8BLmax(1 λ)2 , 1 12Lmax(C(1 λ)2 + λ2) γ = min 1 8BLmax , 1 2L(1 λ) 3ALmax , 1 6Lmax A(1 λ)2 Published as a conference paper at ICLR 2025 24L3max (C(1 λ)2 + λ2), ε2 Then provided that ε2 max 96B, 24(1 λ) p 3Lmax A, 5184Lmax A2(1 λ)4δ0 ε2 , 864δ0A 288L2 max(1 λ)2 we have mint=0,...,T 1 E f(xt) ε. Proof of Theorem 3.7. From Proposition C.3 under the condition that ρ min n 1 4Lmax , 1 8BLmax(1 λ)2 o and γ 1 8BLmax we have min t=0,...,T 1 E f(xt) 2 2 1 + 2AγLmax ρ(1 λ)2(1 + 2γρL2 max) + γ T Tγ [f(x0) f inf] + 2Lmax Cγ + ρ(1 + 2γρL2 max)(λ2 + C(1 λ)2) . (23) Using the fact that 1 + x exp(x), we have that 1 + 2AγLmax ρ(1 λ)2(1 + 2γρL2 max) + γ T exp 2TAγLmax ρ(1 λ)2(1 + 2γρL2 max) + γ exp(1) < 3. (24) We note that for the second inequality to hold, it is enough to assume that 2TAγLmax ρ(1 λ)2(1 + 2γρL2 max) + γ = 2TALmaxγ ρ(1 λ)2 + 2γL2 maxρ2(1 λ)2 + γ 1, which is true if we have that, 2TALmaxγρ(1 λ)2 1/3 4TAL3 maxγρ2(1 λ)2 1/3 2TALmaxγ2 1/3. Imposing the restriction ρ 1 T , the above inequalities can be expressed as (by solving for γ): γ 1 6Lmax A(1 λ)2 γ 1 2L(1 λ) 3ALmax Using these restrictions on γ, we are able to substitute the bound of (24) in (23) to get: min t=0,...,T 1 E f(xt) 2 6δ0 Tγ + 2Lmax Cγ + ρ(1 + 2γρL2 max)(λ2 + C(1 λ)2) . (25) To make the right hand side of (25) smaller than ε2, we require that the second term satisfies 2Lmax Cγ + ρ(1 + 2γρL2 max)(λ2 + C(1 λ)2) ε2 2Lmax Cρ(1 λ)2 + 2CγL2 maxρ2(1 λ)2 + Cγ + ρλ2 + 2γL2 maxρ2λ2 ε2 2Lmax ρ C(1 λ)2 + λ2 + 2L2 maxγρ2 C(1 λ)2 + λ2 + Cγ ε2 2 For this to hold it is enough to have Published as a conference paper at ICLR 2025 2Lmaxρ(C(1 λ)2 + λ2) ε2 6 ρ ε2 12Lmax(C(1 λ)2+λ2) 4L3 maxγρ2 C(1 λ)2 + λ2 ε2 6 ρ<1 γ ε2 24L3max(C(1 λ)2+λ2) 2Lmax Cγ ε2 6 γ ε2 12Lmax C Similarly, for the first term, we get that the number of iterations must satisfy: Hence so far we need the following restrictions on ρ and γ: ρ min n 1 4Lmax , 1 8BLmax(1 λ)2 o and γ 1 8BLmax (by the restrictions of Proposition C.3) T and ρ ε2 12Lmax(C(1 λ)2+λ2) γ 1 6Lmax A(1 λ)2 T and γ 1 2L(1 λ) 3ALmax and γ 1 ε2 24L3 max(C(1 λ)2+λ2) and γ ε2 12Lmax C Plugging each of the previous bounds of γ into (26) we get T 96BLmaxδ0 T 24Lmaxδ0(1 λ) 3Lmax A T 5184L2 max A2(1 λ)4δ2 0 ε4 T 864δ2 0ALmax T 144Lmax Cδ0 T 288L3 maxδ0(1 λ)2 Finally, collecting all the terms into a single bound we have: ρ = min 1 4Lmax , 1 8BLmax(1 λ)2 , 1 12Lmax(C(1 λ)2 + λ2) γ = min 1 8BLmax , 1 2L(1 λ) 3ALmax , 1 6Lmax A(1 λ)2 24L3max (C(1 λ)2 + λ2), ε2 ε2 max 96B, 24(1 λ) p 3Lmax A, 5184Lmax A2(1 λ)4δ0 ε2 , 864δ0A 288L2 max(1 λ)2 This completes the proof. Published as a conference paper at ICLR 2025 D ADDITIONAL EXPERIMENTS In this section, we present additional experimental evaluations of Unified SAM, following the same setup as in Li & Giannakis (2023). Specifically, we train Res Net-18 and WRN-28-10 on CIFAR10 and CIFAR100 datasets. Standard data augmentation techniques, including random cropping, random horizontal flipping, and normalization De Vries (2017), are employed. The models are trained for 200 epochs with a batch size of 128, using a cosine scheduler starting from 0.05. Weight decay is set to 0.001. For SAM, we use ρ = 0.1 for CIFAR10 and ρ = 0.2 for CIFAR100. Each experiment is repeated three times, and we report the average of the maximum test accuracy along with the standard error. The numerical results are presented in Tables 6 and 7, demonstrate that Unified SAM consistently improves the test accuracy over both USAM and SAM across all tested models. Lastly, we also observe that careful tuning of the parameter λ is essential for achieving optimal performance. Table 6: Test accuracy (%) of Unified SAM on various neural networks trained on CIFAR10. Model λ = 0.0 λ = 0.5 λ = 1.0 λ = 1/t λ = 1 1/t Res Net-18 96.13 0.05 96.16 0.03 96.33 0.03 96.20 0.05 96.22 0.09 Wide Res Net-28-10 97.26 0.44 96.98 0.08 97.05 0.05 96.73 0.04 96.63 0.35 Table 7: Test accuracy (%) of Unified SAM on various neural networks trained on CIFAR100. Model λ = 0.0 λ = 0.5 λ = 1.0 λ = 1/t λ = 1 1/t Res Net-18 80.21 0.02 80.28 0.16 80.14 0.07 80.27 0.09 80.19 0.06 Wide Res Net-28-10 83.49 0.23 83.71 0.03 83.55 0.19 83.55 0.10 83.62 0.16 Table 8: Test accuracy (%) of Unified SAM for PRN-18 on CIFAR10, evaluated across different values of ρ and λ. With bold we highlight the best performance for fixed ρ. Unified SAM λ = 0.0 λ = 0.5 λ = 1.0 λ = 1/t λ = 1 1/t ρ = 0.1 95.29 0.09 95.32 0.02 95.59 0.09 95.24 0.03 95.53 0.11 ρ = 0.2 95.25 0.14 95.48 0.05 95.5 0.02 95.38 0.11 95.58 0.07 ρ = 0.3 95.25 0.11 95.24 0.02 95.18 0.04 95.12 0.19 95.26 0.1 ρ = 0.4 94.76 0.09 94.98 0.07 94.7 0.02 94.64 0.03 94.61 0.09 SGD 94.82 0.02 Table 9: Test accuracy (%) of Unified SAM for PRN-18 on CIFAR100, evaluated across different values of ρ and λ. With bold we highlight the best performance for fixed ρ. Unified SAM λ = 0.0 λ = 0.5 λ = 1.0 λ = 1/t λ = 1 1/t ρ = 0.1 78.28 0.15 78.28 0.06 78.32 0.22 78.33 0.32 78.39 0.31 ρ = 0.2 78.98 0.18 78.68 0.13 78.96 0.12 78.87 0.02 78.79 0.1 ρ = 0.3 79.0 0.05 78.95 0.07 79.21 0.08 78.73 0.06 79.27 0.08 ρ = 0.4 78.57 0.26 78.76 0.3 79.05 0.16 78.36 0.09 78.79 0.13 SGD 76.9 0.23