# sharpnessaware_blackbox_optimization__173d6e7b.pdf Published as a conference paper at ICLR 2025 SHARPNESS-AWARE BLACK-BOX OPTIMIZATION Feiyang Ye1,2, , Yueming Lyu3,4, , Xuehao Wang1, Masashi Sugiyama5,6, Yu Zhang1, , Ivor W. Tsang2,3,4,7 1Department of Computer Science and Engineering, Southern University of Science and Technology 2Australian Artificial Intelligence Institute, University of Technology Sydney 3Centre for Frontier AI Research, Agency for Science, Technology and Research, Singapore 4Institute of High Performance Computing, Agency for Science, Technology and Research, Singapore 5RIKEN Center for Advanced Intelligence Project 6Graduate School of Frontier Sciences, The University of Tokyo 7College of Computing and Data Science, Nanyang Technological University {feiyang.ye.uts,xuehaowangfi,yu.zhang.ust}@gmail.com sugi@k.u-tokyo.ac.jp {Lyu Yueming,Ivor Tsang}@cfar.a-star.edu.sg Black-box optimization algorithms have been widely used in various machine learning problems, including reinforcement learning and prompt fine-tuning. However, directly optimizing the training loss value, as commonly done in existing black-box optimization methods, could lead to suboptimal model quality and generalization performance. To address those problems in black-box optimization, we propose a novel Sharpness-Aware Black-box Optimization (SABO) algorithm, which applies a sharpness-aware minimization strategy to improve the model generalization. Specifically, the proposed SABO method first reparameterizes the objective function by its expectation over a Gaussian distribution. Then it iteratively updates the parameterized distribution by approximated stochastic gradients of the maximum objective value within a small neighborhood around the current solution in the Gaussian distribution space. Theoretically, we prove the convergence rate and generalization bound of the proposed SABO algorithm. Empirically, extensive experiments on the black-box prompt fine-tuning tasks demonstrate the effectiveness of the proposed SABO method in improving model generalization performance. 1 INTRODUCTION Black-box optimization involves optimizing one objective function by using function queries only. In this work, we study the black-box optimization problem (Jones et al., 1998), which is formulated as min x F(x), s.t. x X, (1) where X Rd, and d represents the parameter dimension. The objective function F : Rd R, which satisfies F(x) (lower bounded), can only be queried to obtain function values and we cannot get the gradient of F w.r.t. x. In this work, we focus on the online setting for black-box optimization, where different from the offline setting (Chen et al., 2022; Qi et al., 2022), we do not have a prior dataset containing the variable x and its corresponding objective value. Black-box optimization has drawn intensive attention in a wide range of applications, such as deep reinforcement learning (Salimans et al., 2017; Conti et al., 2018), black-box adversarial attacks of deep neural networks (Ilyas et al., 2018; Dong et al., 2019), etc. Recently, black-box optimization has shown increasing power on real-world natural language processing tasks, especially with the emergence of large language models (LLMs). Since a common practice is to release LLMs as a service and allow users to access it through their inference APIs. In such a scenario, called Languaged-Model-as-a-Service (LMaa S) (Sun et al., 2022b;a; 2023), users cannot access or tune model parameters but can only tune their prompts without model backpropagation to accomplish language tasks of interest, which directly increase the demand of black-box optimization methods. Equal contribution. Corresponding author. Published as a conference paper at ICLR 2025 Although black-box optimization algorithms have been successfully applied to various learning tasks, most existing works directly optimize the training loss value (Sun et al., 2022b), which may lead to suboptimal model quality and generalization performance. Since the training loss landscape is complex and has many local minima with different generalization abilities (Zhang et al., 2021), the learned model may suffer from the overfitting problem, causing poor generalization performance (Foret et al., 2021). Hence, reducing the performance gap between training and testing is an important research topic in deep learning (Neyshabur et al., 2017). Recently, there have been many works exploring the close relationship between loss geometry and generalization performance, and it has been observed that flat minima often imply better generalization (Dziugaite & Roy, 2017; Chatterji et al., 2019; Petzka et al., 2021). This inspires us to design a black-box optimization algorithm to improve the model generalization by finding the flat minima. Sharpness-aware minimization (SAM) (Foret et al., 2021) is a state-of-the-art method to seek flat minima in white-box cases by solving a min-max optimization problem. SAM minimizes the maximum objective value within a small neighborhood around current solution. Since SAM considers the geometry of the Euclidean parameter space, it uses the Euclidean ball to define the neighborhood. In SAM, each update consists of two forward-backward computations: one for computing the perturbation and the other for computing the actual update direction. SAM has been proven to perform better than SGD and its variants (Kwon et al., 2021; Zhuang et al., 2022; Zhao et al., 2022; Kim et al., 2022; Jiang et al., 2023; Liu et al., 2022) yield significant performance gains in various fields such as computer vision and natural language processing (Bahri et al., 2022; Foret et al., 2021). However, SAM and its variants rely on the availability of true gradients or stochastic gradients w.r.t. the variable x, so they are inapplicable to black-box optimization. To take advantage of SAM to improve the generalization performance of black-box optimization, we propose a Sharpness-Aware Black-box Optimization (SABO) algorithm. Specifically, SABO first reparameterizes the objective function via its expectation over a Gaussian distribution, which can help to optimize the objective by only accessing the function value (Wierstra et al., 2014; Lyu & Tsang, 2021). Then the SABO method seeks to identify the robust minimum region over the space of Gaussian distributions, which is different from SAM that finds the flat minimum over the parameter space. To achieve that, the SABO method iteratively updates the parameterized distribution via a search direction obtained by approximated stochastic gradients for the maximum objective value within a small neighborhood around the current solution in the space of Gaussian distributions. Theoretically, we analyze the convergence rate and provide a generalization error bound for the proposed SABO algorithm. Empirically, we verify the convergence result of the proposed algorithm on the synthetic problems, and extensive experimental results on a black-box prompt fine-tuning problem demonstrate the effectiveness of the proposed SABO method. Our contributions are summarized as follows. We propose the SABO algorithm for black-box optimization. To the best of our knowledge, we are the first to design a stochastic gradient approximation algorithm to improve the model generalization in black-box optimization by using the sharpness-aware minimization strategy. Theoretically, we prove that the proposed SABO algorithm possesses a convergence rate O( log T T ) in a full-batch function query setting and O( 1 T ) in a mini-batch function query setting, respectively. Moreover, we provide a generalization error analysis for the proposed SABO method. Empirically, we verify the convergence result of the SABO algorithm on the synthetic numerical problems. Moreover, extensive experiments on black-box prompt fine-tuning tasks demonstrate the effectiveness of the proposed SABO method in improving the model generalization performance. Notation and Symbols. We denote by 2 and the l2 norm and the l norm for vectors, respectively. F denotes the Frobenius norm for matrices. S+ denotes the set of positive semidefinite matrices. For a square matrix X, diag(X) represents a vector with diagonal entries in X, and if x is a vector, diag(x) represents a diagonal matrix with x as its diagonal entries. We define X Y := p X, Y X for a positive semi-definite matrix Y S+ or a non-negative vector Y , where , denotes the inner product under the Frobenius norm for matrices and inner product under the l2 norm for vectors. X Y denotes the elementwise division operation when X and Y are vectors (for the diagonal matrix), and the elementwise division operation for diagonal elements in X and Y when they are diagonal matrices. Published as a conference paper at ICLR 2025 2 BACKGROUND Stochastic Gradient Approximation The stochastic gradient approximation method (Wierstra et al., 2014; Lyu & Tsang, 2021; Ye et al., 2024) is a representative strategy for solving black-box optimization problems, which instead of maintaining a population of searching points, iteratively updates a search distribution by stochastic gradient approximation. The general procedure of stochastic gradient approximation methods is to first generate a batch of sample points by a parameterized search distribution. Then the sample points allow the algorithm to capture the local structure of the fitness function and appropriately estimate the stochastic gradient to update the distribution. Specifically, the stochastic gradient approximation method reparameterizes F(x) as J(θ) = Epθ(x)[F(x)] = Z F(x)p(x; θ)dx, (2) where θ denotes the parameters of density p(x; θ) or pθ and F(x) is also referred to as the fitness function for x. Based on this definition, we can obtain the Monte Carlo estimation of the search gradient as j=1 F(xj) θ log pθ(xj), (3) where xj denotes the j-th sample and N denotes the number of samples. Therefore, the stochastic gradient θJ(θ) provides a search direction in the space of search distributions. Sharpness-Aware Minimization SAM (Foret et al., 2021) attempts to improve generalization by finding flat minima. This is achieved by minimizing the worst-case loss within some perturbation radius. Mathematically, it is formulated as the following minimax optimization problem: min x X max ϵ 2 ρ2 F(x + ϵ), (4) where F : Rd R is the objective function, x denotes variables that can represent model parameters, ρ > 0 is a positive constant, and ϵ is the perturbation whose magnitude is bounded by ρ2. By taking the first-order approximation of F(x+ϵ) over F(x), a solution of ϵ for the maximization subproblem can be obtained as ϵ(x) = ρ2 F(x) F(x) 2 . (5) Then problem (4) can be solved by performing the gradient descent method for the minimization subproblem as xt+1 = xt βt x F(xt + ϵ(xt)), where βt represents the step size in the t-th iteration. Note that this gradient implicitly depends on the Hessian of F(x) because ϵ(x) is a function of x. To accelerate the computation, a common approach in SAM-based methods (Foret et al., 2021; Kim et al., 2022; Jiang et al., 2023) is to apply a first-order gradient approximation, so we obtain the update rule for SAM as xt+1 = xt βt x F(x)|x=xt+ϵt, where ϵt = ϵ(xt) is viewed as a constant w.r.t. x. 3 METHODOLOGY In this section, we introduce the proposed SABO algorithm. Firstly, we formulate the sharpness-aware black-box optimization as a min-max optimization problem and solve it in Section 3.1, and in Section 3.2, we derive the update formula of parameters in the search distribution. The detailed derivations in this section are put in Appendix A. 3.1 SHARPNESS-AWARE BLACK-BOX OPTIMIZATION Suppose we are given a training set D with i.i.d. samples {(Xi, yi)}. The main objective is defined as F(x; D) = 1 |D| (Xi,yi) D l(x; (Xi, yi)), (6) Published as a conference paper at ICLR 2025 where x denotes model parameters, |D| denotes the number of data in the dataset D, and l(x; (X, y)) denotes the loss function (e.g., the cross-entropy loss for classification). To simplify the notation, we define F(x) := F(x; D). In black-box optimization, we aim at minimizing the objective function F(x), with only function queries. Due to the lack of gradient information, we first apply the stochastic gradient approximation method (Wierstra et al., 2014; Lyu & Tsang, 2021). We denote by θ the parameters of the search distribution pθ and define the expected fitness of F(x) under the parametric search distribution pθ(x) as J(θ) = Epθ(x)[F(x)]. Then the optimal parameter θ can be found by minimizing the reparameterized objective J(θ). Inspired by SAM (Foret et al., 2021), we attempt to improve generalization by finding flat minima of θ. However, for the reparameterized objective J(θ), the geometry of the corresponding distribution space is not Euclidean but a statistical manifold, where the distance between two probability distributions is defined by some statistical distance, e.g., Kullback-Leibler (KL) divergence. Therefore, instead of restricting the perturbation in an Euclidean ball, we restrict the perturbation distribution to be inside a small neighborhood of the unperturbed distribution w.r.t. the KL divergence (Amari, 2016). The proposed optimization problem for black-box optimization is formulated as min θ max δ C(θ) J(θ + δ), (7) where C(θ) = {δ | KL(pθ+δ pθ) ρ2}, ρ is a positive constant, and J(θ + δ) = Ex pθ+δ[F(x)]. Note that C(θ) defines the neighborhood around a given distribution pθ in the distribution space, which is different from the neighborhood of SAM that is defined in the parameter space. Here ρ defines the size of the neighborhood. Generally, problem (7) is applicable to any family of distribution pθ. For computational consideration, the search distribution is assumed to be a Gaussian distribution, i.e., pθ(x) = N(x | µ, Σ) where µ denotes the mean and Σ denotes the covariance matrix, and correspondingly θ includes µ and Σ, i.e., θ = {µ, Σ}. In this work, we assume that the covariance matrix Σ is a diagonal matrix. For a perturbation δ = {δµ, δΣ} where δΣ is a digaonal matrix, the perturbed distribution is a Gaussian distribution pθ+δ(x) = N(x | µ + δµ, Σ + δΣ). We need to solve problem (7) to derive the update formulation for θ. By using the first-order Taylor expansion, the maximization subproblem can be approximated as a quadratically constrained linear programming problem: max δ C(θ) J(θ + δ) max δ C(θ) θJ(θ), δ , (8) The corresponding Lagrangian of problem (8) is L(δ, λ) = θJ(θ), δ + λ(KL(pθ+δ pθ) ρ2), (9) where λ is the Lagrange multiplier. We can see that problem (9) is convex with respect to δ. Therefore, by setting the derivatives w.r.t. δµ and δΣ to zero, we can obtain δ as λΣ µJ(θ), δΣ(θ) = 2Σ ΣJ(θ) λΣ 1 2 ΣJ(θ). (10) As shown in Eq. (10), to calculate the perturbation, we need to calculate the inverse covariance matrix, i.e., Σ 1, which is computationally expensive for high-dimensional problems. Therefore, assuming that Σ is a diagonal matrix can significantly reduce the computation cost. Then by plugging δµ(θ) and δΣ(θ) into the neighborhood constraint, i.e., δ C(θ), we can determine the optimal λ as Σ ΣJ(θ) 2 F + 0.5 Σ 1 2 µJ(θ) 2 2. (11) Based on Eqs. (10) and (11), we obtain the approximated closed-form solution δ(θ) for a given θ. With δ(θ), the minimization subproblem of problem (7) can be reformulated as min θ J(θ + δ(θ)). (12) To solve problem (12), in the t-th iteration, we add a regularization term 1 βt KL(pθ pθt) to problem (12) to enforce θ to be close to θt, and obtain θt+1 by solving the following problem as θt+1 = arg min θ J(θ + δ(θt)) J(θt) + 1 βt KL(pθ pθt). (13) Published as a conference paper at ICLR 2025 Following standard SAM-based methods (Foret et al., 2021), we treat δ(θt) as a constant δt instead of a function of θt to accelerate the computation. Then by using the first-order Taylor expansion, problem (13) can be approximated as θt+1 = arg min θ θ θt, θJ(θt + δt) + 1 βt KL(pθ pθt). (14) By solving problem (14), we can obtain the update formulations for µ and Σ in the t-th iteration as µt+1 = µt βtΣt µJ(θt + δt), Σ 1 t+1 = Σ 1 t + 2βt ΣJ(θt + δt), (15) where µJ(θt +δt) and ΣJ(θt +δt) denote the derivative of J(θ) w.r.t. µ and Σ at µ = θt +δµt and Σ = Σt + δΣt, respectively. 3.2 UPDATE FORMULATIONS FOR SABO The gradients of the reparameterized objective J(θ) w.r.t. µ and Σ rely on the expectations of the black-box function and can be obtained with only function queries (Wierstra et al., 2014) (see Theorem A.1 in Appendix A.4). Hence, we estimate them by Monte Carlo sampling. Specifically, the stochastic approximation of the gradients µJ(θt) and ΣJ(θt) are given as j=1 Σ 1 t (x j µt) F(x j) F(µt) , (16) j=1 diag h Σ 1 t diag (x j µt)(x j µt) Σ 1 t I (F(x j) F(µt)) i , (17) where x j denotes the j-th sample sampled from the distribution N(x | µt, Σt). Note that g t is an unbiased estimator for the gradient µJ(θ) as proved in Lemma C.5, and inspired by Lyu & Tsang (2021), we subtract F(µt) to improve the computational stability while keeping them as unbiased estimations. To avoid the scaling problem, in practice, we can employ a monotonic transformation strategy and the details can be found in Appendix F. Then according to Eq. (10), we obtain the perturbation δt in the t-th iteration as λΣtg t, 2Σt G t λΣ 1 t 2G t where λ is approximated by 1 Σt G t 2 F + 0.5 Σ 1 2 t g t 2 2. Similarly, the gradients µJ(θt + δt) and ΣJ(θt + δt) can be approximated as follows: j=1 bΣ 1 t (xj bµt) F(xj) F(bµt) , (19) j=1 diag h bΣ 1 t diag (xj bµt)(xj bµt) bΣ 1 t I (F(xj) F(bµt)) i , (20) where bΣt = Σt +δΣt, bµt = µt +δµt, and xj denotes the j-th sample sampled from the distribution N(x | bµt, bΣt). Then the updated formulations for µ and Σ are rewritten as µt+1 = µt βtΣtgt, Σ 1 t+1 = Σ 1 t + 2βt Gt. (21) The entire algorithm is shown in Algorithm 1. To avoid the scaling problem, in practice, we can employ monotonic transformation for the objective, more details can be found in Appendix F. For the proposed SABO method, the computation cost per iteration is of order O(Nd). Mini-batch SABO. In Algorithm 1, we assume full access to the objective function F(x; D), while in practice, a full-batch function query might be costly. Therefore, we can perform a mini-batch function query. Specifically, in each iteration, we query the expected fitness by a mini-batch of data and approximate F(x; D) by F(x; B) = 1 (X,y) B l(x; (X, y)), (22) where |B| denotes the number of data in the mini-batch B. The corresponding expected fitness of F(x; B) under the distribution pθ(x) is formulated as J(θ; B) = Epθ(x)[F(x; B)]. Then similar to the full-batch function query setting, we can approximate the gradients µJ(θt, B) and ΣJ(θt, B), where the detailed formulations can be found in Appendix B. The entire SABO algorithm with mini-batch function queries is shown in Algorithm 2 in Appendix B. Published as a conference paper at ICLR 2025 Algorithm 1 SABO Require: Neighborhood size ρ, learning rate βt 1: Initialized θ0 = (µ0, Σ0) ; 2: for t = 0 to T 1 do 3: Take i.i.d. samples z j N(0, I) and set x j = µt + Σ 1 2 t z j for j {1, . . . , N}; 4: Query the batch observations {F(x 1), . . . , F(x N)}; 5: Compute the gradient g t via Eq. (16) and compute the gradient G t via Eq. (17); 6: Compute λ = 1 Σt G t 2 F + 0.5 Σ 1 2 t g t 2 2; 7: Compute δµt and δΣt via Eq. (18); 8: Take i.i.d. samples zj N(0, I) for j {1, . . . , N}; 9: Set xj = µt + δµt + (Σt + δΣt) 1 2 zj for j {1, . . . , N}; 10: Query the batch observations {F(x1), . . . , F(x N)}; 11: Compute the gradient gt via Eq. (19) and compute the gradient Gt via Eq. (20); 12: Set µt+1 = µt βtΣtgt and set Σ 1 t+1 = Σ 1 t + 2βt Gt; 13: end for 14: return θT = (µT , ΣT ). In this section, we provide comprehensive theoretical analyses for the proposed SABO method with all the detailed proofs in Appendix D. 4.1 CONVERGENCE ANALYSIS OF SABO Firstly, we make an assumption for the reparameterized objective function. Assumption 4.1 The function F(x) is LF -Lipschitz w.r.t. x. The function J(θ) satisfies that µJ(θ) is L-Lipschitz w.r.t. θ = {µ, Σ} Θ, where Θ := {µ, Σ | µ Rd, Σ S+}. Note that the proposed SABO algorithm approximates the gradients of the reparameterized objective function. It is necessary to study the relation between the optimal solutions of the original objective functions F(x) and J(θ), and we put the results in the following proposition. Proposition 4.2 (Lyu & Tsang, 2021) Suppose pθ(x) is a Gaussian distribution with θ = {µ, Σ} and F(x) is a convex function. Let J(θ) = Epθ[F(x)], and J(µ , 0) := F(µ ). Then we have F(µ) F(µ ) J(µ, Σ) J(µ , 0), (23) where 0 denotes a zero matrix with appropriate size. The convexity assumption of the objective function in Theorem 4.3 has been widely adopted in the area of stochastic gradient approximation black-box optimization (Beyer, 2014; Wierstra et al., 2014; Lyu & Tsang, 2021; Ye, 2023). Since the Gaussian-smooth approximation function is always an upper bound of the true target function in convex cases, i.e., F(µ) EN(µ,Σ)[F(x)]. When µ is an optimal solution of minimization problem minx F(x), Proposition 4.2 implies that the difference between the objective value at µ and the optimal objective value of the original problem is upper-bounded by that of the expected objective function. Then for Algorithm 1, the following theorem captures the convergence of µ for a convex objective function. Theorem 4.3 Suppose that F(x) is a convex function, J(θ) is c-strongly convex w.r.t. µ, the gradient estimator Gt (w.r.t. the covariance matrix) is positive semi-definite matrix such that ξI Gt c I 4 with ξ 0, Σ0 S+, and Σ0 RI where R > 0. Suppose the sequence {µt} generated by Algorithm 1 satisfies that the distance between the sequence {µt} and the optimal solution of F(x) is bounded, i.e., µt µ D, Σ=Σt J(θ) F H, βt = O(1), and ρ < d 2 satisfies ρ = O( 1 T ), then with Assumption 4.1, we have t=0 E [J(µt+1, Σt) J(µ , 0)] = O log T Published as a conference paper at ICLR 2025 Based on Theorem 4.3 and Proposition 4.2, when βt = O(1) and ρ = O( 1 T ), we have t=0 E [F(µt+1) F(µ )] = O(log T Therefore, the proposed SABO algorithm with full-batch function query possesses a convergence rate O( log T T ) for the convex objective function. Additionally, in Theorem 4.3, when β = O(1) and ρ = O(1), the proposed SABO algorithm still maintains a convergence rate O(T 1 2 ). The detailed discussion is provided in Remark D.1. For the mini-batch setting, we make an additional assumption for the objective function F(x; D). Assumption 4.4 It is assumed that the datasets D and B are i.i.d. sampled from a data distribution P(X, y), and the variance of the mini-batch estimation of the objective function is bounded, i.e., F(x; B) EF(x; B) 2 2 ε2 B and F(x; D) EF(x; D) 2 2 ε2 D. Note that for the standard stochastic gradient descent (SGD) method (Shamir & Zhang, 2013), the unbiased estimation and bounded variance assumptions were made for the approximated gradient. However, in black-box optimization, the gradient of the objective function F(x) w.r.t. x is unavailable. Hence we can only make assumptions for the batch estimations F(x; B) and F(x; D). Then we have the following result for the mini-batch estimation. Proposition 4.5 Suppose Assumption 4.4 holds, then we have EF(x; B) = EF(x; D), and F(x; B) F(x; D) 2 2 ε2, where ε2 = 2(ε2 B + ε2 D). Then with Assumption 4.4, the following theorem shows the convergence of µ for Algorithm 2. Theorem 4.6 Suppose that F(x) is a convex function, J(θ) is c-strongly convex w.r.t. µ, the gradient estimator Gt (w.r.t. the covariance matrix) is positive semi-definite matrix such that ξI Gt c I 4 with ξ 0, Σ0 S+, and Σ0 RI where R > 0. Suppose the sequence {µt} generated by Algorithm 2 satisfies that the distance between the sequence {µt} and the optimal solution of F(x) is bounded, i.e., µt µ D, Σ=Σt J(θ) F H, βt = O(1), and ρ < d 2 satisfies ρ = O(1). Then with Assumptions 4.1 and 4.4, we have t=0 E [J(µt+1, Σt) J(µ , 0)] = O 1 Based on Theorem 4.6 and Proposition 4.2, the proposed SABO algorithm with mini-batch function query possesses a convergence rate O(T 1 Remark 4.7 Note that in Theorem 4.3 and Theorem 4.6, the convergence results do not require the objective function F(x) to be strongly convex or differentiable. Hence, the convergence holds for a non-smooth convex function F(x) (as long as J(θ) being a c-strongly convex function w.r.t. µ). 4.2 GENERALIZATION ERROR ANALYSIS In this subsection, we analyze the generalization bound of the proposed SABO algorithm. Specifically, we bound the expectation of the objective function over the Gaussian perturbation. We denote by (X, y) a data pair drawn from a data distribution P(X, y) and by F(x; (X, y)) the corresponding loss of parameter x on (X, y). So we have F(x; (X, y)) = l(x; (X, y)). We define the population loss over the data distribution P(X, y) as EP (X,y)[F(x; (X, y))], and the empirical loss over a dataset S, which consists of M i.i.d. samples drawn from P(X, y), as F(x; S) = 1 M PM i=1 l(x; (Xi, yi)). Then we have following result. Theorem 4.8 Let the loss function F(x; (X, y)) be a convex function w.r.t. x, then for any µ Rd, with probability at least 1 κ, we have EP (X,y)[F(µ; (X, y))] max δ C(θ) Epθ+δ F(x; S) + ρ2 + log( M κ ) 2(M 1) , (27) Published as a conference paper at ICLR 2025 where pθ := N(µ, Σ), C(θ) = {δ | KL(pθ+δ pθ) ρ2}, and S denotes the training set that consists of M i.i.d. samples drawn from data distribution P(X, y). Theorem 4.8 provides a generalization bound for the proposed SABO algorithm. Compared with the generalization bound of SAM presented in Appendix A.1 of Foret et al. (2021), Theorem 4.8 has an asymptotically identical order in the complexity term. However, the expected generalization loss on the right-hand side of Eq. (27) is different in that we reparameterize the objective function and have perturbation of θ in its neighborhood of a statistical manifold, i.e., δ C(θ), while SAM bounds the generalization loss averaged over a spherical Gaussian perturbation on parameters. 5 RELATED WORKS Black-Box Optimization. Many methods have been proposed for black-box optimization, including Bayesian optimization (BO) methods (Srinivas et al., 2010; Gardner et al., 2017; Nayebi et al., 2019), stochastic optimization methods such as evolution strategies (ES) (Hansen, 2006; Wierstra et al., 2014; Lyu & Tsang, 2021), and genetic algorithms (GA) (Srinivas & Patnaik, 1994; Mirjalili, 2019). Among those methods, BO achieves good performance for low-dimensional problems, but it often fails to handle high-dimensional problems through a global surrogate model, as shown in (Eriksson et al., 2019) and (Nguyen et al., 2022). As a result, Tu RBO (Eriksson et al., 2019) and GIBO (Nguyen et al., 2022) try to address this problem with the local BO approach. (Ziomek & Ammar, 2023) further showed that decomposition is important for alleviating the high-dimensional problems in BO. Although BO is not our main focus, we further compare our method with these local Bayesian optimization methods on the black-box prompt fine-tuning problem, and the corresponding exploration is shown in Appendix H.1. GA method is computationally expensive for machine learning problems and usually lacks convergence analysis. The stochastic optimization methods such as CMAES (Hansen, 2006) and INGO (Lyu & Tsang, 2021) can scale up to higher-dimensional problems compared with BO. Hence we mainly consider stochastic optimization methods as baseline methods in our experiments. Sharpness-Aware Minimization. SAM has been widely studied for improving the model generalization. Among previous works on SAM (Kwon et al., 2021; Zhuang et al., 2022; Zhao et al., 2022; Jiang et al., 2023; Tahmasebi et al., 2024), the most relevant method to our approach is the FSAM method (Kim et al., 2022) which also finds the worst-case objective function via a statistical manifold instead of the Euclidean space. However, the loss function of the model studied in FSAM is a predictive distribution conditional on both model parameters and data, while in our case, we consider the parameter as a Gaussian distribution. The b SAM method (M ollenhoff & Khan, 2022) builds a connection between the SAM objective and Bayes objective by Fenchel biconjugate of the loss function. M ollenhoff & Khan (2022) shares a similarity with our work in developing SAM w.r.t. the expected loss. However, The b SAM method relies on the derivation of a convex lower bound of the expected loss by the Fenchel biconjugate and the perturbation is still w.r.t. each point inside the expected loss as standard SAM. Hence FSAM and b SAM are different from the proposed SABO method. Additionally, like other variants of SAM, FSAM and b SAM are both inapplicable to black-box optimization. The STABLEOPT method (Bogunovic et al., 2018) proposes a SAM-like optimization formulation in the Bayesian optimization area. They aim to improve the robustness w.r.t. the adversarial perturbation of the return point by a GP-based optimization. Their method relies on a GP surrogate model that is expensive for training and inference. In addition, it is challenging for the proposed adversarial robust GP-based optimization to handle high-dimensional problems. In contrast, our work aims to improve the generalization property in high-dimensional black-box optimization. 6 EMPIRICAL STUDY In this section, we empirically evaluate the proposed SABO method, and compare it with four representative black-box methods, i.e., CMA-ES (Hansen, 2006), MMES (He et al., 2020), BES (Gao & Sener, 2022), and INGO (Lyu & Tsang, 2021). All the experiments are conducted on a single NVIDIA Ge Force RTX 3090 GPU. Published as a conference paper at ICLR 2025 6.1 SYNTHETIC PROBLEMS To verify the convergent results of the proposed SABO method in Section 4. We compare the proposed SABO method with baseline methods on minimizing four d-dimensional synthetic benchmark test functions, i.e., ellipsoid function, l 1 2 -ellipsoid function, different powers function, and Levy function. All the test functions are listed in Appendix G.1. The results are evaluated by calculating the Euclidean distance between the solution x and the optimal solution x , i.e., E = x x 2. We then assess the baseline methods using varying dimensions, i.e., d {200, 500, 1000}. Due to the page limitation, the implementation details and more detailed experimental results are put in Appendix G.1. 0.0 0.5 1.0 1.5 2.0 2.5 Iteration. T ( 103) CMA-ES MMES BES INGO SABO (a) Ellipsoid. 0.0 0.5 1.0 1.5 2.0 2.5 Iteration. T ( 103) CMA-ES MMES BES INGO SABO 2 -Ellipsoid. 0.0 0.5 1.0 1.5 2.0 2.5 Iteration. T ( 103) CMA-ES MMES BES INGO SABO (c) Different Powers. 0.0 0.5 1.0 1.5 2.0 2.5 Iteration. T ( 103) CMA-ES MMES BES INGO SABO Figure 1: Results on the four test functions with problem dimension d = 500 and N = 50. Result. Figure 1 shows the results on four d-dimensional synthetic problems with d = 500 and population size N = 50. The proposed SABO method approximately achieves a linear convergence rate similar to the INGO method. This is reasonable as these two methods have the same theoretical convergent rate, i.e., O( log T T ), according to Theorem 4.3 and Theorem 5 in Lyu & Tsang (2021). Since the SABO method perturbs the main objective in each iteration, its practical convergence speed is slightly slower than INGO. The CMA-ES method and MMES methods can converge on the ellipsoid problem and the different powers problem, but they do not converge as fast as SABO and INGO methods. Moreover, CMA-ES and MMES methods fail on the l 1 2 -ellipsoid problem and Levy problem. The BES method fails on all test problems. This shows that it could be challenging for BES to optimize non-smooth or high-dimensional test functions without adaptively updating mean and covariance. These results demonstrate the superiority of the SABO method in optimizing high-dimensional problems, and verify our theoretical convergence results. 6.2 BLACK-BOX PROMPT FINE-TUNING Black-box prompt fine-tuning of large language models (Ding et al., 2023; Sun et al., 2022b;a; 2023) is a promising direction to achieve expertise models efficiently for downstream tasks. In such an LMaa S setting, we cannot access the model parameter and can only tune their prompts without backpropagation. We evaluate the proposed SABO method in improving generalization performance on the black-box prompt fine-tuning task. Datasets. We conduct experiments on six language understanding benchmark datasets: SST-2 (Socher et al., 2013) and Yelp polarity (Zhang et al., 2015) for sentiment analysis, AG s News (Zhang et al., 2015) for topic classification, MRPC (Dolan & Brockett, 2005) for paraphrase, RTE (Wang et al., 2018) and SNLI (Bowman et al., 2015) for natural language inference. Each dataset contains a classification task. The statistics of six datasets are summarized in Table 1 of (Sun et al., 2022b). By following (Sun et al., 2022b), the testing accuracy is used to measure the performance of all the methods on the SST-2, AG s News, RTE, SNLI, and Yelp P. datasets, and the F1 score is used to measure the performance on the MRPC datasets. Implementation Details. Following Sun et al. (2022b), we employ a fixed randomly initialized matrix A Rd D to project a vector v Rd onto the token embedding space RD. Then we Published as a conference paper at ICLR 2025 Table 1: Performance (%) on SST-2, AG s News, MRPC, RTE, SNLI and Yelp P. datasets. We report the mean and standard deviation over 3 random seeds. The best result across all groups is highlighted in bold and the best result in each group is marked with underlined. Methods SST-2 AG s News MRPC RTE SNLI Yelp P. Zero-shot 79.82 76.96 67.40 51.62 38.82 89.64 Dimension d = 200 CMA-ES 85.74 0.35 82.09 0.56 74.98 2.16 51.02 2.14 34.27 1.18 90.57 0.05 MMES 83.98 0.78 80.52 0.99 76.54 4.34 48.50 0.45 40.39 1.83 90.94 0.36 BES 83.52 0.11 75.44 0.31 79.23 0.20 53.07 0.29 38.73 0.17 89.65 0.01 INGO 83.57 0.11 76.47 0.03 78.87 0.20 53.07 0.00 38.86 0.06 89.84 0.04 SABO 87.88 0.53 82.22 0.41 79.35 0.12 53.67 0.17 40.72 0.15 91.50 0.13 Dimension d = 500 CMA-ES 86.12 0.59 82.50 0.23 77.10 1.90 52.71 0.51 41.34 1.49 91.19 0.44 MMES 85.28 0.94 81.67 0.80 77.31 1.24 48.74 0.59 42.07 2.62 91.39 0.24 BES 83.56 0.05 75.93 0.17 79.21 0.09 52.95 0.17 38.64 0.28 89.62 0.07 INGO 84.29 0.34 76.54 0.20 79.09 0.15 53.19 0.17 38.91 0.10 89.90 0.13 SABO 87.31 0.38 82.65 0.59 79.62 0.07 53.55 0.17 42.29 2.48 91.83 0.16 Dimension d = 1000 CMA-ES 86.85 0.57 82.21 0.36 78.98 0.17 52.35 0.17 38.40 1.83 90.46 0.62 MMES 84.98 0.52 80.86 1.95 76.43 0.82 49.22 1.23 39.82 3.43 91.63 0.20 BES 83.11 0.11 75.66 0.09 79.09 0.08 53.19 0.17 38.57 0.13 89.61 0.04 INGO 84.36 0.23 76.35 0.14 78.97 0.08 53.07 0.29 39.05 0.06 89.95 0.08 SABO 87.96 0.83 82.77 0.41 79.68 0.23 53.31 0.17 40.32 0.27 91.96 0.41 optimize the vector v Rd instead of directly optimizing the prompt p RD. The pre-trained Ro BERTa LARGE model (Liu et al., 2019) is used as the backbone model. The matrix A is sampled from the normal distribution as described in Sun et al. (2022a), i.e., N(0, σe d), where σe is the standard deviation of word embeddings in Ro BERTa LARGE. The templates and label words in Table 1 of Sun et al. (2022b) are used to conduct the zero-shot baseline. For CMA-ES, MMES, BES, INGO, and SABO methods, we employ the cross-entropy loss of training data as the black-box objective for six datasets and optimize the vector v with 100 iterations. The Gaussian distributions are initialized as µ0 = 0 and Σ0 = I, and the population size N is set to 100. We perform a grid search for hyperparameters of INGO, SABO, and BES methods. Specifically, we search the learning rate β over {0.1, 0.5, 1, 5} for INGO, SABO, and BES, the neighborhood size ρ over {10, 50, 100, 500} for SABO, and the spacing c over {0.1, 1, 10} for BES. Additionally, we evaluate the performance of all methods on different dimensions of v, specifically d {200, 500, 1000}. All the experiments are performed in three independent runs, and the experimental results of mean objective std are reported. Results. Table 1 presents experimental results on these six benchmark datasets for three different dimensions of the vector v. We can see that the SABO method consistently outperforms all baselines in terms of testing classification accuracy or testing F1 scores across different settings, highlighting its effectiveness in improving generalization performance. Notably, even in the high-dimensional setting (i.e., d = 1000), our method maintains good performance. Moreover, we can see that when d = 1000, SABO achieves the best performance on SST-2, AG s News, MRPC, and Yelp P. datasets. The SABO method also achieves the best performance on RTE and SNLI datasets with d = 200 and d = 500, respectively. 7 CONCLUSION In this work, we have introduced SABO, a novel black-box optimization algorithm that improves generalization by utilizing a sharpness-aware minimization strategy. Theoretically, we provide a convergence guarantee for the proposed SABO algorithm in both full-batch function query and mini-batch function query settings. Additionally, we prove the generalization bound for the proposed method. Empirical studies on synthetic numerical problems verify the convergence properties of the proposed method. Moreover, extensive experimental results on black-box prompt fine-tuning problems demonstrate the effectiveness of the proposed SABO method in improving the generalization performance. Published as a conference paper at ICLR 2025 ACKNOWLEDGEMENTS This work is supported by National Key R&D Program of China 2022ZD0160300 and NSFC key grant under grant no. 62136005. Dr. Yueming Lyu is supported by Career Development Fund no. C243512014 and no. C233312007 of the Agency for Science, Technology and Research. Prof. Sugiyama was supported by the Institute for AI and Beyond, UTokyo and by a grant from Apple, Inc. Any views, opinions, findings, and conclusions or recommendations expressed in this material are those of the authors and should not be interpreted as reflecting the views, policies or position, either expressed or implied, of Apple Inc. Shun-ichi Amari. Information geometry and its applications, volume 194. Springer, 2016. Dara Bahri, Hossein Mobahi, and Yi Tay. Sharpness-aware minimization improves language model generalization. In Annual Meeting of the Association for Computational Linguistics, 2022. Hans-Georg Beyer. Convergence analysis of evolutionary algorithms that are based on the paradigm of information geometry. Evolutionary Computation, 22(4):679 709, 2014. Ilija Bogunovic, Jonathan Scarlett, Stefanie Jegelka, and Volkan Cevher. Adversarially robust optimization with gaussian processes. Advances in neural information processing systems, 31, 2018. Samuel R Bowman, Gabor Angeli, Christopher Potts, and Christopher D Manning. A large annotated corpus for learning natural language inference. ar Xiv preprint ar Xiv:1508.05326, 2015. Niladri S Chatterji, Behnam Neyshabur, and Hanie Sedghi. The intriguing role of module criticality in the generalization of deep networks. ar Xiv preprint ar Xiv:1912.00528, 2019. Can Chen, Yingxueff Zhang, Jie Fu, Xue Steve Liu, and Mark Coates. Bidirectional learning for offline infinite-width model-based optimization. Advances in Neural Information Processing Systems, 35:29454 29467, 2022. Edoardo Conti, Vashisht Madhavan, Felipe Petroski Such, Joel Lehman, Kenneth Stanley, and Jeff Clune. Improving exploration in evolution strategies for deep reinforcement learning via a population of novelty-seeking agents. In Neural Information Processing Systems, 2018. Ning Ding, Yujia Qin, Guang Yang, Fuchao Wei, Zonghan Yang, Yusheng Su, Shengding Hu, Yulin Chen, Chi-Min Chan, Weize Chen, et al. Parameter-efficient fine-tuning of large-scale pre-trained language models. Nature Machine Intelligence, 5(3):220 235, 2023. Bill Dolan and Chris Brockett. Automatically constructing a corpus of sentential paraphrases. In Third International Workshop on Paraphrasing, 2005. Yinpeng Dong, Hang Su, Baoyuan Wu, Zhifeng Li, Wei Liu, Tong Zhang, and Jun Zhu. Efficient decision-based black-box adversarial attacks on face recognition. In IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2019. Gintare Karolina Dziugaite and Daniel M Roy. Computing nonvacuous generalization bounds for deep (stochastic) neural networks with many more parameters than training data. ar Xiv preprint ar Xiv:1703.11008, 2017. David Eriksson, Michael Pearce, Jacob Gardner, Ryan D Turner, and Matthias Poloczek. Scalable global optimization via local bayesian optimization. Advances in neural information processing systems, 32, 2019. Clara Fannjiang and Jennifer Listgarten. Autofocused oracles for model-based design. Advances in Neural Information Processing Systems, 33:12945 12956, 2020. Pierre Foret, Ariel Kleiner, Hossein Mobahi, and Behnam Neyshabur. Sharpness-aware minimization for efficiently improving generalization. In International Conference on Learning Representations, 2021. Published as a conference paper at ICLR 2025 Katelyn Gao and Ozan Sener. Generalizing gaussian smoothing for random search. In International Conference on Machine Learning, pp. 7077 7101. PMLR, 2022. Jacob Gardner, Chuan Guo, Kilian Weinberger, Roman Garnett, and Roger Grosse. Discovering and exploiting additive structure for bayesian optimization. In Artificial Intelligence and Statistics, 2017. Nikolaus Hansen. The cma evolution strategy: a comparing review. Towards a new evolutionary computation: Advances in the estimation of distribution algorithms, pp. 75 102, 2006. Xiaoyu He, Zibin Zheng, and Yuren Zhou. Mmes: Mixture model-based evolution strategy for large-scale optimization. IEEE Transactions on Evolutionary Computation, 25(2):320 333, 2020. Andrew Ilyas, Logan Engstrom, Anish Athalye, and Jessy Lin. Black-box adversarial attacks with limited queries and information. In International Conference on Machine Learning, 2018. Weisen Jiang, Hansi Yang, Yu Zhang, and James Kwok. An adaptive policy to employ sharpnessaware minimization. In International Conference on Learning Representations, 2023. Donald R Jones, Matthias Schonlau, and William J Welch. Efficient global optimization of expensive black-box functions. Journal of Global optimization, 13:455 492, 1998. Minyoung Kim, Da Li, Shell X Hu, and Timothy Hospedales. Fisher SAM: Information geometry and sharpness aware minimisation. In International Conference on Machine Learning, 2022. Aviral Kumar and Sergey Levine. Model inversion networks for model-based optimization. Advances in neural information processing systems, 33:5126 5137, 2020. Jungmin Kwon, Jeongseop Kim, Hyunseo Park, and In Kwon Choi. Asam: Adaptive sharpness-aware minimization for scale-invariant learning of deep neural networks. In International Conference on Machine Learning, 2021. Tianyi Lin, Zeyu Zheng, and Michael Jordan. Gradient-free methods for deterministic and stochastic nonsmooth nonconvex optimization. Advances in Neural Information Processing Systems, 35: 26160 26175, 2022. Yinhan Liu, Myle Ott, Naman Goyal, Jingfei Du, Mandar Joshi, Danqi Chen, Omer Levy, Mike Lewis, Luke Zettlemoyer, and Veselin Stoyanov. Roberta: A robustly optimized bert pretraining approach. ar Xiv preprint ar Xiv:1907.11692, 2019. Yong Liu, Siqi Mai, Xiangning Chen, Cho-Jui Hsieh, and Yang You. Towards efficient and scalable sharpness-aware minimization. In IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2022. Yueming Lyu and Ivor W Tsang. Black-box optimizer with stochastic implicit natural gradient. In European Conference on Machine Learning and Principles and Practice of Knowledge Discovery in Databases, 2021. David A Mc Allester. Pac-bayesian model averaging. In Annual Conference on Computational Learning Theory, 1999. Leland Mc Innes, John Healy, and James Melville. Umap: Uniform manifold approximation and projection for dimension reduction. ar Xiv preprint ar Xiv:1802.03426, 2018. Seyedali Mirjalili. Evolutionary algorithms and neural networks. In Studies in computational intelligence, volume 780. Springer, 2019. Thomas M ollenhoff and Mohammad Emtiyaz Khan. Sam as an optimal relaxation of bayes. In The Eleventh International Conference on Learning Representations, 2022. Amin Nayebi, Alexander Munteanu, and Matthias Poloczek. A framework for bayesian optimization in embedded subspaces. In International Conference on Machine Learning, 2019. Yurii Nesterov and Vladimir Spokoiny. Random gradient-free minimization of convex functions. Foundations of Computational Mathematics, 17(2):527 566, 2017. Published as a conference paper at ICLR 2025 Behnam Neyshabur, Srinadh Bhojanapalli, David Mc Allester, and Nati Srebro. Exploring generalization in deep learning. In Neural Information Processing Systems, 2017. Quan Nguyen, Kaiwen Wu, Jacob Gardner, and Roman Garnett. Local bayesian optimization via maximizing probability of descent. Advances in neural information processing systems, 35: 13190 13202, 2022. Henning Petzka, Michael Kamp, Linara Adilova, Cristian Sminchisescu, and Mario Boley. Relative flatness and generalization. In Neural Information Processing Systems, 2021. Han Qi, Yi Su, Aviral Kumar, and Sergey Levine. Data-driven offline decision-making via invariant representation learning. Advances in Neural Information Processing Systems, 35:13226 13237, 2022. Shogo Sagawa and Hideitsu Hino. Gradual domain adaptation via normalizing flows. ar Xiv preprint ar Xiv:2206.11492, 2022. Tim Salimans, Jonathan Ho, Xi Chen, Szymon Sidor, and Ilya Sutskever. Evolution strategies as a scalable alternative to reinforcement learning. ar Xiv preprint ar Xiv:1703.03864, 2017. Ohad Shamir and Tong Zhang. Stochastic gradient descent for non-smooth optimization: Convergence results and optimal averaging schemes. In International Conference on Machine Learning, 2013. Richard Socher, Alex Perelygin, Jean Wu, Jason Chuang, Christopher D Manning, Andrew Y Ng, and Christopher Potts. Recursive deep models for semantic compositionality over a sentiment treebank. In Conference on Empirical Methods in Natural Language Processing, 2013. Mandavilli Srinivas and Lalit M Patnaik. Genetic algorithms: A survey. Computer, 27(6):17 26, 1994. Niranjan Srinivas, Andreas Krause, Sham Kakade, and Matthias Seeger. Gaussian process optimization in the bandit setting: no regret and experimental design. In International Conference on Machine Learning, 2010. Tianxiang Sun, Zhengfu He, Hong Qian, Yunhua Zhou, Xuan-Jing Huang, and Xipeng Qiu. Bbtv2: towards a gradient-free future with large language models. In Conference on Empirical Methods in Natural Language Processing, 2022a. Tianxiang Sun, Yunfan Shao, Hong Qian, Xuanjing Huang, and Xipeng Qiu. Black-box tuning for language-model-as-a-service. In International Conference on Machine Learning, 2022b. Tianxiang Sun, Zhengfu He, Qin Zhu, Xipeng Qiu, and Xuan-Jing Huang. Multitask pre-training of modular prompt for chinese few-shot learning. In Annual Meeting of the Association for Computational Linguistics, 2023. Behrooz Tahmasebi, Ashkan Soleymani, Dara Bahri, Stefanie Jegelka, and Patrick Jaillet. A universal class of sharpness-aware minimization algorithms. In International Conference on Machine Learning, pp. 47418 47440. PMLR, 2024. Brandon Trabucco, Aviral Kumar, Xinyang Geng, and Sergey Levine. Conservative objective models for effective offline model-based optimization. In International Conference on Machine Learning, pp. 10358 10368. PMLR, 2021. Alex Wang, Amanpreet Singh, Julian Michael, Felix Hill, Omer Levy, and Samuel R Bowman. Glue: A multi-task benchmark and analysis platform for natural language understanding. ar Xiv preprint ar Xiv:1804.07461, 2018. Daan Wierstra, Tom Schaul, Tobias Glasmachers, Yi Sun, Jan Peters, and J urgen Schmidhuber. Natural evolution strategies. The Journal of Machine Learning Research, 15(1):949 980, 2014. Han Xiao, Kashif Rasul, and Roland Vollgraf. Fashion-mnist: a novel image dataset for benchmarking machine learning algorithms. ar Xiv preprint ar Xiv:1708.07747, 2017. Published as a conference paper at ICLR 2025 Feiyang Ye, Yueming Lyu, Xuehao Wang, Yu Zhang, and Ivor Tsang. Adaptive stochastic gradient algorithm for black-box multi-objective learning. In The Twelfth International Conference on Learning Representations, 2024. Haishan Ye. Mirror natural evolution strategies. ar Xiv preprint ar Xiv:2308.00469, 2023. Chiyuan Zhang, Samy Bengio, Moritz Hardt, Benjamin Recht, and Oriol Vinyals. Understanding deep learning (still) requires rethinking generalization. Communications of the ACM, 64(3):107 115, 2021. Xiang Zhang, Junbo Zhao, and Yann Le Cun. Character-level convolutional networks for text classification. In Neural Information Processing Systems, 2015. Yang Zhao, Hao Zhang, and Xiuyuan Hu. Penalizing gradient norm for efficiently improving generalization in deep learning. In International Conference on Machine Learning, 2022. Juntang Zhuang, Boqing Gong, Liangzhe Yuan, Yin Cui, Hartwig Adam, Nicha Dvornek, Sekhar Tatikonda, James Duncan, and Ting Liu. Surrogate gap minimization improves sharpness-aware training. In International Conference on Learning Representations, 2022. Juliusz Krzysztof Ziomek and Haitham Bou Ammar. Are random decompositions all we need in high dimensional bayesian optimisation? In International Conference on Machine Learning, pp. 43347 43368. PMLR, 2023. Published as a conference paper at ICLR 2025 A ADDITIONAL MATERIAL FOR SECTION 3 A.1 DETERMINE THE PERTURBATION δ The Lagrangian of problem (8) is L(δ, λ) = θJ(θ), δ + λ(KL(pθ+δ pθ) ρ2) (28) = δ µ µJ(θt) tr(δΣ ΣJ(θt)) tr(Σ 1(Σ + δΣ)) + δ µ Σ 1δµ + log |Σ| |(Σ + δΣ)| d λρ2. (29) Taking the derivative δµ and δΣ and setting them to zero, we can obtain that µJ(θ) λΣ 1δµ = 0, (30) 2 [(Σ + δΣ) 1 Σ 1] = 0. (31) Note that Σ is a diagonal matrix. Therefore, we can achieve that λΣ µJ(θ), (32) δΣ(θ) = 2Σ ΣJ(θ) λΣ 1 2 ΣJ(θ). (33) A.2 DETERMINE THE OPTIMAL λ Note that we have KL(pθ+δ pθ) = 1 tr(Σ 1(Σ + δΣ)) + δ µ Σ 1δµ + log |Σ| |(Σ + δΣ)| d (34) λ ΣJ(θ)(Σ + δΣ)) + δ µ Σ 1δµ + log |Σ| |(Σ + δΣ)| d (35) λ ΣJ(θ)δΣ) + δ µ Σ 1δµ + Q , (36) λΣ ΣJ(θ)| + log(I 2 λΣ ΣJ(θ)). (37) Since Σ and ΣJ(θ) are both diagonal matrix, we denote diag(Σ ΣJ(θ)) = (v1, . . . , vd), then we have λ2 (vi)2 O( 1 (38) We denote diag( ΣJ(θ)) = (ˆv1, . . . , ˆvd), then we have ˆviσi = vi and λ ΣJ(θ)δΣ) = 2 λ ˆvi( 1 (σi) 1 2 λ ˆvi σi) = 2 λ ˆviσi( 2 λσiˆvi) + O( 4 λ2 (σiˆvi)2). (39) Then substituting δµ(θ) and δΣ(θ) into the inequality KL(pθ+δ pθt) ρ2, we can obtain that KL(pθ+δ pθ) = 1 λ2 Σ ΣJ(θ) 2 F + 1 λ2 Σ 1 2 µJ(θ) 2 2 + ϵ ρ2, (40) where ϵ = O( 4(σiˆvi)2 λ2 ) = O( 1 λ2 ). Let the equality holds and solve Eq. (40), we have Σ ΣJ(θ) 2 F + 0.5 Σ 1 2 µJ(θ) 2 2. (41) Published as a conference paper at ICLR 2025 A.3 UPDATE RULE FOR θt Note that we have θ θt, θJ(θt + δt) + 1 βt KL(pθ pθt) = (µ µt) µJ(θt + δt) + tr((Σ Σt) ΣJ(θt + δt)) tr(Σ 1 t Σ) + (µ µt) Σ 1 t (µ µt) + log |Σt| where µJ(θt +δt) and ΣJ(θt +δt) denotes the derivative w.r.t. µ and Σ taking at µ = µt +δµt and Σ = Σt + δΣt, respectively. We can see the above problem is convex with respect to µ and Σ. Taking the derivative w.r.t. µ and Σ and setting them to zero, we can obtain that µJ(θt + δt) + 1 βt Σ 1 t (µ µt) = 0, (42) ΣJ(θt + δt) + 1 2βt [Σ 1 t Σ 1] = 0. (43) Therefore, we obtain the update rule for θt as µt+1 = µt βtΣt µJ(θt + δt), (44) Σ 1 t+1 = Σ 1 t + 2βt ΣJ(θt + δt). (45) A.4 THE GRADIENTS OF J(θ) To obtain the gradients of the reparameterized objective J(θ) w.r.t. µ and Σ, we use the following theorem to show that only function queries are needed. Theorem A.1 (Wierstra et al., 2014) The gradient of the expectation of an integrable function F(x) under a Gaussian distribution pθ := N(µ, Σ) with respect to the mean µ and the covariance Σ can be expressed as µJ(θ) = Epθ[Σ 1(x µ)F(x)], (46) 2Epθ[ Σ 1(x µ)(x µ) Σ 1 Σ 1 F(x)]. (47) B MINI-BATCH SABO For the proposed SABO with a mini-batch function query, the stochastic approximation of the gradients µJ(θt) and ΣJ(θt) using Monte Carlo sampling are given as i=1 Σ 1 t (x j µt) F(x j; (X i, y i)) F(µt; (X i, y i)) , (48) G t = 1 2NM i=1 diag h Σ 1 t diag (x j µt)(x j µt) Σ 1 t I (F(x j; (X i, y i)) F(µt; (X i, y i))) i , (49) where x j denotes the j-th sample sampled from the distribution N(x | µt, Σt), and (X i, y i) denotes the i-th data in the mini-batch dataset B . Then by setting δt = { 1 λΣtg t, 2Σt G t λΣ 1 t 2G t }, the gradients Published as a conference paper at ICLR 2025 Algorithm 2 Mini-batch SABO Require: Neighborhood size ρ, learning rate βt, batch size M 1: Initialized θ0 = (µ0, Σ0) ; 2: for t = 0 to T 1 do 3: Take i.i.d. samples z j N(0, I) and set x j = µt + Σ 1 2 t z j for j {1, . . . , N}; 4: Sample a batch of data B ; 5: Query the batch observations {F(x 1; B ), . . . , F(x N; B )}; 6: Set the gradient g t via Eq. (48) and set the gradient G t via Eq. (49); 7: Compute λ = 1 Σt G t 2 F + 0.5 Σ 1 2 t g t 2 2; 8: Compute δµt = 1 λΣtg t and δΣt = 2Σt G t λΣ 1 t 2G t ; 9: Take i.i.d. samples zj N(0, I) for j {1, . . . , N}; 10: Set xj = µt + δµt + (Σt + δΣt) 1 2 zj for j {1, . . . , N}; 11: Sample a batch of data B; 12: Query the batch observations {F(x1; B), . . . , F(x N; B)}; 13: Set the gradient gt via Eq. (50) and set the gradient Gt via Eq. (51); 14: Set µt+1 = µt βtΣtgt; 15: Set Σ 1 t+1 = Σ 1 t + 2βt Gt; 16: end for 17: return θT = (µT , ΣT ). µJ(θt + δt) and ΣJ(θt + δt) can also be approximated as i=1 bΣ 1 t (xj bµt) F(xj; (Xi, yi)) F(bµt; (Xi, yi)) , (50) i=1 diag h bΣ 1 t diag (xj bµt)(xj bµt) bΣ 1 t I (F(xj; (Xi, yi)) F(bµt; (Xi, yi))) i , (51) where bΣt = Σt + δΣt, bµt = µt + δµt, xj denotes the j-th sample sampled from the distribution N(x | bµt, bΣt), and (Xi, yi) denotes the i-th data in the mini-batch dataset B. Then we can update θt by Eq. (21). The entire algorithm of SABO with a mini-batch function query is shown in Algorithm 2. C TECHNICAL LEMMAS In this section, we introduce the following technical lemmas for analysis. The proof of all technical lemmas is put in Appendix E. Lemma C.1 Suppose Σ and ˆΣ are two d-dimensional diagonal matrix and z is a d-dimensional vector, then we have Σz Σ F z and Σ ˆΣ F Σ F ˆΣ F. Lemma C.2 Given a convex function F(x), for Gaussian distribution with parameters θ := {µ, Σ 1 2 }, let J(θ) := Ep(x;θ)[F(x)]. Then J(θ) is a convex function with respect to θ. Lemma C.3 Suppose that the gradient Gt are positive semi-definite matrix and satisfies ξI Gt b I. Then for algorithm 1 and 2, we have the following results. (a) The (diagonal) covariance matrix ΣT satisfies 1 2b PT t=1 βt I+Σ 1 0 ΣT 1 2ξ PT t=1 βt I+Σ 1 0 . (b) The Frobenius norm for the covariance matrix Σt satisfies Σt F d 2ξ Pt k=1 βk . Published as a conference paper at ICLR 2025 Lemma C.4 For given θt, denote the approximated gradients of µJ(θt) and ΣJ(θt) by g t and G t, respectively. Then for Algorithm 1 and Algorithm 2, if ρ d 2 , then the perturbation satisfies, 1 2 t F and δΣt F 2ρr Σt F, where r is a positive constant. Lemma C.5 In Algorithm 1, suppose the gradient estimator g t in t-th iteration as 2 t z F(µt + Σ 1 2 t z) F(µt) , (52) where z N(0, I). Then g t is an unbiased estimator of the gradient µEpθt[F(x)]. Lemma C.6 In Algorithm 1, suppose the gradient estimator gt in t-th iteration as 2 t z F(bµt + bΣ 1 2 t z) F(bµt) , (53) where z N(0, I). Suppose that Assumption 4.1 holds, ρ < d 2 , the gradient Gt is positive semi-definite matrix and satisfies ξI Gt b I and Σ0 RI, where ξ, b, R 0. Then we have 1 2 t gt 2] L2 F (1 + 2ρr )(d + 4)2 2ξ(Pt k=1 βk) , (54) where r is a positive constant. Lemma C.7 In Algorithm 2, suppose assumption 4.4 holds and the gradient estimator g t in t-th iteration as 2 t z F(µt + Σ 1 2 t z; B) F(µt; B) , (55) where z N(0, I). Then g t is an unbiased estimator of the gradient µEpθt[F(x)]. Lemma C.8 In Algorithm 2, suppose the gradient estimator gt in t-th iteration as 2 t z F(bµt + bΣ 1 2 t z; B) F(bµt; B) , (56) where z N(0, I). Suppose that Assumption 4.1 and Assumption 4.4 hold, ρ < d 2 , the gradient Gt is positive semi-definite matrix and satisfies ξI Gt b I and Σ0 RI, where ξ, b, R 0. Then we have 1 2 t gt 2] L2 F (1 + 2ρr )(d + 4)2 6ξ(Pt k=1 βk) + 2(d + 4)ε2 where r is a positive constant, ε2 = 2ε2 B + 2ε2 D. D PROOF OF THE RESULT IN SECTION 4 In this section, we provide the proof of the result in Section 4. D.1 PROOF OF THE PROPOSITION 4.2 Note that F(x) is a convex function, we have F(µ) = F(Ex N(µ,Σ)[x]) Ex N(µ,Σ)[F(x)] = J(µ, Σ). (58) Since F(µ ) = J(µ , 0), it follows that F(µ) F(µ ) J(µ, Σ) J(µ , 0), (59) where we reach the conclusion. Published as a conference paper at ICLR 2025 D.2 PROOF OF THEOREM 4.3 The update rule of µ can be represented as µt+1 = µt βtΣtgt. Since F(x) is convex function, we have J(θ) is convex w.r.t. θ = {µ, Σ 1 2 } by Lemma C.2, together with J(θ) is c-strongly convex w.r.t. µ we obtain J(θt) J(µ , 0) + µJ(θt) (µt µ ) + Σ 1 2 J(θt) Σ 2 µt µ 2. (60) Note that Σ 1 2 J(θt) = Σ 1 2 t ΣJ(θt) + ΣJ(θt)Σ 1 2 t , we have J(θt) J(µ , 0) + µJ(θt) (µt µ ) + 2 ΣJ(θt)Σt c 2 µt µ 2. (61) Let At = J(µt, Σt) J(µ , 0), we have βt E[At] βt Ez[ µJ(θt) (µt µ )] + 2βt Ez[ ΣJ(θt) Σt] cβt 2 µt µ 2. (62) µt µ 2 Σ 1 t µt+1 µ 2 Σ 1 t = µt µ 2 Σ 1 t µt βtΣtgt µ 2 Σ 1 t (63) = µt µ 2 Σ 1 t µt µ 2 Σ 1 t 2βt µt µ , gt + β2 t Σtgt, gt (64) = 2βtg t (µt µ ) β2 t (Σtgt) gt. (65) Therefore we have g t (µt µ ) = 1 2βt µt µ 2 Σ 1 t µt+1 µ 2 Σ 1 t + βt 2 (Σtgt) gt. (66) According to Lemma C.5, we have Egt = µJ(θt + δt). Then we have E ( µJ(θt) gt) (µt µ ) = E ( µJ(θt) µJ(θt + δt)) (µt µ ) + E ( µJ(θt + δt) gt) (µt µ ) (67) µt µ µJ(θt) µJ(θt + δt) (68) DL δt (69) DLρt Ut, (70) where the first inequality is due to the Cauchy-Schwarz inequality, the second inequality is due to µt µ D and Assumption 4.1, the last inequality is due to Lemma C.4, and Ut = 1 2 t F, 4r Σt F). Then we have Ez µJ(θt) (µt µ ) (71) = Ez g t (µt µ ) + ( µJ(θt) gt) (µt µ ) (72) 1 2βt Ez[ µt µ 2 Σ 1 t µt+1 µ 2 Σ 1 t ] + βt(Σtgt) gt] + DLρt Ut, (73) where the inequality is due to Eq. (66), Eq. (70), and βt 0. Note that E ΣJ(θt) Σt E ΣJ(θt) Σt F H Σt F, (74) where the first inequality is due to Lemma C.1 and the second inequality is due to the Lipschitz continuous assumption of the function J(θ). Then substituting Eq. (73) and Eq. (74) into Eq. (62) and multiplying βt on both sides of the inequality, we have 2E[ µt µ 2 Σ 1 t µt+1 µ 2 Σ 1 t ] cβt 2 µt µ 2 + 2β2 t H Σt F + DLρt Ut + β2 t Σ 1 2 t gt 2, (75) Published as a conference paper at ICLR 2025 We further obtain that 2E[ µt µ 2 Σ 1 t µt+1 µ 2 Σ 1 t ] cβt 2 µt µ 2 (76) µt µ 2 Σ 1 t µt 1 µ 2 Σ 1 t 1 cβt 2 µt µ 2 (77) h µ0 µ 2 Σ 1 0 µT µ 2 Σ 1 T 1 µt µ 2 2βt Gt cβt 2 µt µ 2 + Σ 1 0 FD2 (79) 2 µt µ 2 cβt 2 µt µ 2 + Σ 1 0 FD2 (80) = Σ 1 0 FD2, (81) where the second inequality is due to the update rule of Σt and µt µ D, and the third inequality is due to Cauchy-Schwarz inequality and Gt c 4I. Then we have t=0 βt E[At] Σ 1 0 FD2 + 2Hβ2 t Σt F + DLρt Ut + β2 t Σ 1 2 t gt 2 ! Σ 1 0 FD2 + d 2ξ Pt k=1 βk + DLρt Ut + L2 F R 1 2 (1 + 2ρr )(d + 4)2β2 t 2ξ(Pt k=1 βk) where the first inequality is due to Eq. (81) and Eq. (75), and the second inequality is due to Lemma 1 2 t F R 1 2 . According to C.3 (b), we have Ut max( 2 (2ξ Pt k=1 βk) 1 2 , 4r d 2ξ Pt k=1 βk ). Therefore, there exists a constant t , when t > t 1, Ut 2 (2ξ Pt k=1 βk) 1 2 . Denote Pt 1 t=0 4rρt d 2ξ Pt k=1 βk by a constant Γ. Then if T > t , we have t=0 E[At] Σ 1 0 FD2 d 2ξ Pt k=1 βk + L2 F R 1 2 (1 + 2ρr )(d + 4)2βt 2ξ Pt k=1 βk βt(2ξ Pt k=1 βk) 1 2 Let βt = β and ρt = ρ0 t, we can obtain that t=0 E[At] Σ 1 0 FD2 2DLρ0d 1 4 2ξβ 3 2 t where C = 2H d + L2 F R 1 2 (1 + 2ρr )(d + 4)2. Since we have PT t=1 1 t 1 + log(T), we obtain t=0 E [J(µt+1, Σt) J(µ , 0)] = 1 t=0 E[At] = O 1 where we reach the conclusion. Remark D.1 In Theorem 4.3, if we set β = O(1) and ρ = O( 1 T ), then the third term in right-hand side of Eq. (84) has a convergence rate of O( log T T ). If we set β = O(1) and ρ = O(1), then Published as a conference paper at ICLR 2025 the third term in right-hand side of Eq. (84) has a convergence rate of O( 1 T ), since we have PT t=1 1 T. Then we obtain t=0 E [J(µt+1, Σt) J(µ , 0)] = 1 t=0 E[At] = O 1 This implies 1 T PT 1 t=0 E [J(µt+1, Σt) J(µ , 0)] = O 1 D.3 PROOF OF THE PROPOSITION 4.5 Since the datasets D and B are i.i.d. sampled from one data distribution P(X, y). We have E(X,y) P [F(x; (X, y))] = E[F(x; B)] = E[F(x; D)]. We obtain that F(x; B) F(x; D) 2 2 2 F(x; B) EF(x; B) 2 2+2 F(x; D) EF(x; D) 2 2 = 2(ε2 B+ε2 D) (88) D.4 PROOF OF THEOREM 4.6 Note that for SABO with a mini-batch function query, the update rule of µ can be represented as µt+1 = µt βtΣtgt. Let Bt = J(µt, Σt) J(µ , 0), then by using Eq. (61), we have βt E[Bt] βt E[ µJ(θt) (µt µ )] + 2βt E[ ΣJ(θt) Σt] cβt 2 µt µ 2 (89) βt E[ µJ(θt) (µt µ )] + 2βt H Σt F cβt 2 µt µ 2, (90) where the second inequality is due to Lemma C.1 and Lipschitz continuous assumption of the function J(θ). Note that µt µ 2 Σ 1 t µt+1 µ 2 Σ 1 t = µt µ 2 Σ 1 t µt βtΣtgt µ 2 Σ 1 t (91) = µt µ 2 Σ 1 t µt µ 2 Σ 1 t 2βt µt µ , gt + β2 t Σtgt, gt (92) = 2βtg t (µt µ ) β2 t (Σtgt) gt. (93) It follows that g t (µt µ ) = 1 2βt µt µ 2 Σ 1 t µt+1 µ 2 Σ 1 t + βt 2 (Σtgt) gt (94) µt µ 2 Σ 1 t µt+1 µ 2 Σ 1 t + βt Σ 1 2 t gt 2, (95) where the inequality is due to βt 0 and Lemma C.1. According to Lemma C.7, we have Egt = µJ(θt + δt). Therefore E ( µJ(θt) gt) (µt µ ) = Ez ( µJ(θt) µJ(θt + δt)) (µt µ ) + Ez ( µJ(θt + δt) gt) (µt µ ) (96) µt µ µJ(θt) µJ(θt + δt) (97) DL δt (98) DLρt Ut, (99) where the first inequality is due to the Cauchy-Schwarz inequality, the second inequality is due to µt µ D and smoothness assumption of the function J(θ), the last inequality is due to Lemma C.4, and Ut = max(2 1 2 t F, 4r Σt F). Then we have Ez µJ(θt) (µt µ ) (100) = Ez g t (µt µ ) + ( µJ(θt) gt) (µt µ ) (101) 1 2βt Ez[ µt µ 2 Σ 1 t µt+1 µ 2 Σ 1 t ] + βt Σ 1 2 t gt 2 + DLρt Ut, (102) Published as a conference paper at ICLR 2025 where the inequality is due to Eq. (95) and Eq. (99). Then substituting Eq. (102) into Eq. (90) and multiplying βt on both sides of the inequality, we have 2E[ µt µ 2 Σ 1 t µt+1 µ 2 Σ 1 t ] cβt 2 µt µ 2 + 2β2 t H Σt F + DLρt Ut + β2 t Σ 1 2 t gt 2, (103) We further obtain that 2E[ µt µ 2 Σ 1 t µt+1 µ 2 Σ 1 t ] cβt 2 µt µ 2 (104) µt µ 2 Σ 1 t µt 1 µ 2 Σ 1 t 1 cβt 2 µt µ 2 (105) h µ0 µ 2 Σ 1 0 µT µ 2 Σ 1 T 1 µt µ 2 2βt Gt cβt 2 µt µ 2 + Σ 1 0 FD2 (107) 2 µt µ 2 cβt 2 µt µ 2 + Σ 1 0 FD2 (108) = Σ 1 0 FD2, (109) where the second inequality is due to the update rule of Σt and µt µ D, and the third inequality is due to Cauchy-Schwarz inequality and Gt c 4I. Then we have t=0 βt E[Bt] Σ 1 0 FD2 + 2Hβ2 t Σt F + DLρt Ut + β2 t Σ 1 2 t gt 2 ! Σ 1 0 FD2 + d 2ξ Pt k=1 βk + DLρt Ut + 2β2 t Σ 1 2 t F(d + 4)ε2 + L2 F (1 + 2ρr )(d + 4)2β2 t Σ 1 2 t F 6ξ(Pt k=1 βk) Σ 1 0 FD2 + d 2ξ Pt k=1 βk + DLρt Ut + 2β2 t (d + 4)d 1 4 ε2 3(2ξ Pt k=1 βk) 1 2 + L2 F (1 + 2ρr )(d + 4)2β2 t R 1 2 6ξ(Pt k=1 βk) where the first inequality is due to Eq. (103) and Eq. (109), the second inequality is due to Lemma C.3 (b) and Lemma C.8, and the third inequality is due to Σ 1 2 t F R 1 2 and Lemma C.3 (b). According to C.3 (b), we have Ut max( 2 (2ξ Pt k=1 βk) 1 2 , 4r d 2ξ Pt k=1 βk ). Therefore, there exists a constant t , when t > t 1, Ut 2 (2ξ Pt k=1 βk) 1 2 . Denote Pt 1 t=0 4rρt d 2ξ Pt k=1 βk by a constant Γ. Then if T > t , t=0 E[Bt] Σ 1 0 FD2 d 2ξ Pt k=1 βk + L2 F R 1 2 (1 + 2ρr )(d + 4)2βt 6ξ Pt k=1 βk βt(2ξ Pt k=1 βk) 1 2 2βt(d + 4)d 1 4 ε2 3(2ξ Pt k=1 βk) 1 2 Published as a conference paper at ICLR 2025 Let βt = β and ρt = ρ, we have t=0 E[Bt] Σ 1 0 FD2 d 2ξt + L2 F R 1 2 (1 + 2ρr )(d + 4)2 2DLρd 1 4 2ξβ 3 2 2(d + 4)d 1 4 ε2 Since we have PT t=1 1 t 1 + log(T) and PT t=1 1 T, we obtain t=0 E [J(µt+1, Σt) J(µ , 0)] = 1 t=0 E[Bt] = O 1 where we reach the conclusion. D.5 PROOF OF THEOREM 4.8 Using PAC-Bayesian bound (Mc Allester, 1999) and following (Dziugaite & Roy, 2017), for any prior distribution p and any posterior distribution q that may be dependent on finite dataset S, where data set S with M i.i.d. samples drawn from data distribution P(X, y), we have q, Eq(x) EP (X,y)[F(x; (X, y))] Eq(x) F(x; S) + KL(q||p) + log( M κ ) 2(M 1) . (116) Set the posterior distribution as q = pθ := N(µ, Σ). Denote the set M(θ) = {δ | KL(pθ+δ pθ) + KL(pδ pθ+δ) ρ2}. We can choose p M(θ). Then, we know that for the prior distribution p, the following inequality holds with a probability at least 1 κ, Epθ(x) EP (X,y)[F(x; (X, y))] Epθ(x) F(x; S) + KL(pθ||p) + log( M κ ) 2(M 1) . (117) Note that for any density p, q, we have KL(p||q) 0. Thus, we know M(θ) C(θ). It follows that Epθ(x) EP (X,y)[F(x; (X, y))] Epθ(x) C(θ) F(x; S) + KL(pθ||p) + log( M κ ) 2(M 1) (118) max δ C(θ) Epθ+δ F(x; S) + max δ M(θ) KL(pθ||pθ+δ) + log( M κ ) 2(M 1) (119) max δ C(θ) Epθ+δ F(x; S) + ρ2 + log( M κ ) 2(M 1) . (120) Note that F(x; (X, y)) is convex function w.r.t. x, we know that EP (X,y)[F(x; (X, y))] is a convex function w.r.t. x. It follows that EP (X,y)[F(µ; (X, y))] = EP (X,y)[F(Epθ(x)[x]; (X, y))] (121) Epθ(x) EP (X,y)[F(x; (X, y))] . (122) Finally, we know that with a probability of at least 1 κ, the following inequality holds. EP (X,y)[F(µ; z, y)] max δ C(θ) Epθ+δ F(x; S) + ρ2 + log( M κ ) 2(M 1) . (123) E PROOF OF TECHNICAL LEMMAS In this section, we provide the proof of lemmas in Appendix C. Note that Lemma C.1 and Lemma C.2 can be directly obtained by Lemma B.1. and Lemma B.2. in (Ye et al., 2024), respectively. Published as a conference paper at ICLR 2025 E.1 PROOF OF LEMMA C.3 (a): Since we have Σ 1 t+1 = Σ 1 t + 2βt Gt. We can obtain that Σ 1 t + 2bβt I Σ 1 t+1 Σ 1 t + 2ξβt I. (124) Summing up it over t = 0, . . . , T 1, we have t=1 βt I Σ 1 T Σ 1 0 + 2ξ t=1 βt I. (125) Therefore, we have 2b PT t=1 βt I + Σ 1 0 ΣT 1 2ξ PT t=1 βt I + Σ 1 0 . (126) (b): We have 1 2ξ Pt k=1 βk I + Σ 1 0 1 2ξ Pt k=1 βk I d 2ξ Pt k=1 βk . (127) E.2 PROOF OF LEMMA C.4 In Algorithm 1 and Algorithm 2, for given θt, the perturbation δt satisfies λΣtg t = ρΣtg t q Σt G t 2 F + 0.5 Σ 1 2 t g t 2 2 1 2 t g t q 1 2 t g t 2 2 Therefore, we have δµt ρ 1 2 t F. For δΣt, we have δΣt = 2Σt G t λΣ 1 t 2G t = Σt 2 λΣt G t I 2 λΣt G t . (129) Note that 2 λ 2ρ ΣG t F . Therefore, we can obtain that δΣt F Σt F 2ρ I 2 λΣt G t F . (130) d 2 , there exist a constant r > 0, I 2 λΣt G t F > 1 r holds. Therefore we have δΣt F 2ρr Σt F. E.3 PROOF OF LEMMA C.5 Ez[g t] = Ez[Σ 1 2 t z F(µt + Σ 1 2 t z)] Ez[Σ 1 2 t z F(µt)] (131) 2 t z F(µt + Σ 1 2 t z)] (132) = Ex N(µt,Σt)[Σ 1 t (x µt)F(x)] (133) = µEpθt[F(x)], (134) where we reach the conclusion. E.4 PROOF OF LEMMA C.6 Denote the diagonal elements of Σ and bΣ by σ and bσ, respectively. Then we have 1 2 t gt,j 2 = σ 2 t zj(F(bµt + bσ 1 2 t zj) F(bµt)) 2. (135) Published as a conference paper at ICLR 2025 Note that bΣ 1 t+1 = Σ 1 t 2λG t, we have σt bσt. Since we have 1 2 t gt 2 2 = E 1 2 t gt,j 2. (136) It follows that 1 2 t gt 2 2 1 2 t zj(F(bµt + bσ 1 2 t zj) F(bµt)) 2 (137) j=1 E h zj 2(F(bµt + bσ 1 2 t zj) F(bµt))2i (138) j=1 E L2 F zj 4 bσt . (139) bΣt = Σt + δΣt = Σt + Σt 2 λΣt G t I 2 λΣt G t . (140) Note that 2 λ 2ρ ΣG t F . Then if ρ < d 2 , there exist a constant r > 0, the inequality bσt σt + 2ρr σt holds. Therefore, we have bσt (1 + 2ρr ) σt 1 + 2ρr σ 1 0 min + 2(Pt k=1 βk)ξ , (141) where min denotes the minimum element in the input. Noticed that Ez[ z 2] d + 4 as shown in (Nesterov & Spokoiny, 2017), we obtain 1 2 t gt 2 2 L2 F (1 + 2ρr )(d + 4)2 2ξ(Pt k=1 βk) , (142) where we reach the conclusion. E.5 PROOF OF LEMMA C.7 E[g t] = E[Σ 1 2 t z F(µt + Σ 1 2 t z; B)] E[Σ 1 2 t z F(µt; B)] (143) 2 t z F(µt + Σ 1 2 t z; D)] Ez[Σ 1 2 t z F(µt; D)] (144) 2 t z F(µt + Σ 1 2 t z)] (145) = Ex N(µt,Σt)[Σ 1 t (x µt)F(x)] (146) = µEpθt[F(x)], (147) where the second equality is due to Proposition 4.5. E.6 PROOF OF LEMMA C.8 Denote the diagonal elements of Σ and bΣ by σ and bσ, respectively. Then we have 1 2 t gt,j 2 = σ 2 t zj(F(bµt + bσ 1 2 t zj; B) F(bµt); B) 2 (148) Note that bΣ 1 t+1 = Σ 1 t 2λG t, we have σt bσt. Since we have 1 2 t gt 2 2 = E 1 2 t gt,j 2. (149) Published as a conference paper at ICLR 2025 It follows that 1 2 t gt 2 2 1 2 t zj(F(bµt + bσ 1 2 t zj; B) F(bµt; B)) 2 (150) j=1 E h zj 2(F(bµt + bσ 1 2 t zj; B) F(bµt; B))2i (151) j=1 E zj 2(1 1 2 t zj 2 + 2 3ε2) , (152) where the third inequality is due to Proposition 4.5. Since bΣt = Σt+Σt 2 λ Σt G t I 2 λ Σt G t , and 2 λ 2ρ ΣG t F . Then if ρ < d 2 , there exist a constant r > 0, the inequality bσt σt + 2ρr σt holds. Therefore, we have bσt (1 + 2ρr ) σt 1 + 2ρr σ 1 0 min + 2(Pt k=1 βk)ξ , (153) where min denotes the minimum element in the input. Noticed that Ez[ z 2] d + 4 as shown in (Nesterov & Spokoiny, 2017), we obtain 1 2 t gt 2 2 1 j=1 E (1 + 2ρr )L2 F 3 σt zj 4 + 2ε2 3 zj 2 (154) L2 F (1 + 2ρr )(d + 4)2 6ξ(Pt k=1 βk) + 2(d + 4)ε2 where we reach the conclusion. F UPDATED RULE UNDER TRANSFORMATION To avoid the scaling problem, we can employ monotonic transformation for the objective F(x), i.e. h(F(xj)) = F (xj) ˆµ ˆσ , where ˆµ and ˆσ denote mean and stand deviation of function values F(xj), j = 1, . . . , N. Then by applying this rescaling strategy, the update rule for µt and Σt in t-th iteration can be written as µt+1 = µt βt j=1 Σt bΣ 1 t (xj bµt)F(xj) ˆµt ˆσt , (156) Σ 1 t+1 = Σ 1 t + βt j=1 diag h bΣ 1 t diag (xj bµt)(xj bµt) bΣ 1 t F(xj) ˆµt ˆσt i , (157) where bΣt = Σt +δΣt, bµt = µt +δµt, and xj denotes the j-th sample sampled from the distribution N(x | bµt, bΣt). G ADDITIONAL MATERIALS FOR SECTION 6 In this section, we provide additional experiments and more implementation details of Section 6. Published as a conference paper at ICLR 2025 G.1 SYNTHETIC PROBLEMS The four numerical benchmark test functions employed in Section 6.1 are listed as follows: d 1 x2 i , (158) d 1 |xi| 1 2 , (159) i=1 |xi|2+4 i 1 d 1 , (160) F(x) = sin2(πω1) + i=1 (ωi 1)2(1 + 10 sin2(ωiπ + 1)) + (ωd 1)2(1 + sin2(2πωd)), where ωi = 1 + xi 1 4 , i {1, . . . , d}. Test functions (158)-(161) are called the ellipsoid function, l 1 2 -ellipsoid function, different powers function and Levy function, respectively. Implementation Details. For all the methods, we initialize µ0 from the uniform distribution Uni[0, 1], and set Σ0 = I. For the INGO, BES, and SABO methods, we use a fixed step size of β = 0.1. According to our assumption in Theorem 4.3, we set ρ = 100/ T + 1 for the proposed SABO method. We set the spacing c = 1 for the BES method and employ the default hyperparameter setting from He et al. (2020) for the MMES method. We then assess these methods using varying dimensions, i.e., d {200, 500, 1000}. For d = 200, we assess these methods using varying sample sizes, i.e., N {10, 50, 100}. The mean value of E over 3 independent runs is reported. Results. Figure 2 and 3 show the results on four test functions with problem dimensions d = 200 and d = 1000, respectively. Figure 4 and 5 show the results on 200-dimensional test functions with sample size N = 10 and N = 100, respectively. Combining these results with the result from Figure 1, we observe consistent performance for the proposed ASBO method. It achieves a similar convergence result to the INGO method, as they have the same theoretical convergent rate. In some cases, i.e., Figure 3 (d) and Figure 4 (d), it converges slightly faster than INGO. The CMA-ES method and MMES method both work for ellipsoid and different powers functions, but fail in l 1 2 -ellipsoid and Levy functions. In most cases, they converge slower than INGO and SABO. With a large sample size, i.e., N = 100, the CMA-ES method can maintain a fast converge rate according to Figure 4 (a). The BES method fails to achieve high precision in all test functions. It diverges in ellipsoid, l 1 2 -ellipsoid, Levy functions, and only achieves a low precision in the different power functions. These results demonstrate the superiority of the SABO method in optimizing high-dimensional problems, and verify our theoretical convergence results in Section 4. 0.0 0.5 1.0 1.5 2.0 2.5 Iteration. T ( 103) CMA-ES MMES BES INGO SABO (a) Ellipsoid. 0.0 0.5 1.0 1.5 2.0 2.5 Iteration. T ( 103) CMA-ES MMES BES INGO SABO 2 -Ellipsoid. 0.0 0.5 1.0 1.5 2.0 2.5 Iteration. T ( 103) CMA-ES MMES BES INGO SABO (c) Different Powers. 0.0 0.5 1.0 1.5 2.0 2.5 Iteration. T ( 103) CMA-ES MMES BES INGO SABO Figure 2: Results on the four test functions with problem dimension d = 200 and N = 50. Published as a conference paper at ICLR 2025 0.0 0.5 1.0 1.5 2.0 2.5 Iteration. T ( 103) CMA-ES MMES BES INGO SABO (a) Ellipsoid. 0.0 0.5 1.0 1.5 2.0 2.5 Iteration. T ( 103) CMA-ES MMES BES INGO SABO 2 -Ellipsoid. 0.0 0.5 1.0 1.5 2.0 2.5 Iteration. T ( 103) CMA-ES MMES BES INGO SABO (c) Different Powers. 0.0 0.5 1.0 1.5 2.0 2.5 Iteration. T ( 103) CMA-ES MMES BES INGO SABO Figure 3: Results on the four test functions with problem dimension d = 1000 and N = 50. 0.0 0.5 1.0 1.5 2.0 2.5 Iteration. T ( 103) CMA-ES MMES BES INGO SABO (a) Ellipsoid. 0.0 0.5 1.0 1.5 2.0 2.5 Iteration. T ( 103) CMA-ES MMES BES INGO SABO 2 -Ellipsoid. 0.0 0.5 1.0 1.5 2.0 2.5 Iteration. T ( 103) CMA-ES MMES BES INGO SABO (c) Different Powers. 0.0 0.5 1.0 1.5 2.0 2.5 Iteration. T ( 103) CMA-ES MMES BES INGO SABO Figure 4: Results on the four test functions with problem dimension d = 200 and N = 10. 0.0 0.5 1.0 1.5 2.0 2.5 Iteration. T ( 103) CMA-ES MMES BES INGO SABO (a) Ellipsoid. 0.0 0.5 1.0 1.5 2.0 2.5 Iteration. T ( 103) CMA-ES MMES BES INGO SABO 2 -Ellipsoid. 0.0 0.5 1.0 1.5 2.0 2.5 Iteration. T ( 103) CMA-ES MMES BES INGO SABO (c) Different Powers. 0.0 0.5 1.0 1.5 2.0 2.5 Iteration. T ( 103) CMA-ES MMES BES INGO SABO Figure 5: Results on the four test functions with problem dimension d = 200 and N = 100. H ADDITIONAL RESULTS ON SYNTHETIC PROBLEMS We conduct additional experiments to compare the proposed SABO method and GFM method Lin et al. (2022). For GFM, we employ its default hyperparameter setting from Lin et al. (2022). We set its smoothing parameter δ = 0.2 and conduct experiments on the step-size over {0.001, 0.0005, 0.0001}. The initial point is set the same as SABO. The results are shown in Figure 6 with problem dimension d = 200 and N = 50. The GFM method can converge slowly for the Ellipsoid problem. It fails in the l 1 2 -Ellipsoid problem and cannot achieve high precision for Different Powers and Levy problem. This shows that it could be challenging for GFM to optimize non-smooth or high-dimensional test functions without adaptively updating mean and covariance. The proposed SABO method takes advantage of the second-order information, which gains great acceleration for solving these problems. Published as a conference paper at ICLR 2025 Table 2: Performance (%) on SST-2, AG s News, MRPC, RTE, SNLI, and Yelp P. datasets. We report the mean and standard deviation over 3 random seeds. The best result across all groups is highlighted in bold and the best result in each group is marked with underlined. Methods SST-2 AG s News MRPC RTE SNLI Yelp P. Zero-shot 79.82 76.96 67.40 51.62 38.82 89.64 Dimension d = 200 GIBO 83.53 0.15 75.79 0.08 79.21 0.09 53.07 0.29 38.73 0.09 89.63 0.03 Tur Bo 83.30 0.19 79.01 1.68 69.59 8.51 46.57 2.30 40.27 0.69 90.16 0.19 SABO 87.88 0.53 82.22 0.41 79.35 0.12 53.67 0.17 40.72 0.15 91.50 0.13 Dimension d = 500 GIBO 83.49 0.09 75.70 0.05 79.03 0.08 52.95 0.17 38.71 0.16 89.65 0.02 Tur Bo 84.52 0.65 80.03 1.97 75.30 2.34 48.01 0.59 38.82 0.34 90.20 0.45 SABO 87.31 0.38 82.65 0.59 79.62 0.07 53.55 0.17 42.29 2.48 91.83 0.16 Dimension d = 1000 GIBO 83.45 0.11 75.67 0.10 79.15 0.0 52.95 0.17 38.87 0.18 89.65 0.04 Tur Bo 85.90 0.95 82.36 0.21 77.30 0.86 50.30 1.11 39.87 1.07 90.14 0.20 SABO 87.96 0.83 82.77 0.41 79.68 0.23 53.31 0.17 40.32 0.27 91.96 0.41 0.0 0.5 1.0 1.5 2.0 2.5 Iteration. T ( 103) SABO GFM_0.001 GFM_0.0005 GFM_0.0001 (a) Ellipsoid. 0.0 0.5 1.0 1.5 2.0 2.5 Iteration. T ( 103) SABO GFM_0.001 GFM_0.0005 GFM_0.0001 2 -Ellipsoid. 0.0 0.5 1.0 1.5 2.0 2.5 Iteration. T ( 103) SABO GFM_0.001 GFM_0.0005 GFM_0.0001 (c) Different Powers. 0.0 0.5 1.0 1.5 2.0 2.5 Iteration. T ( 103) SABO GFM_0.001 GFM_0.0005 GFM_0.0001 Figure 6: Results on the four test functions with problem dimension d = 200 and N = 50. H.1 ADDITIONAL RESULTS ON BLACK-BOX PROMPT FINE-TUNING We conduct an additional experiment on the black-box prompt fine-tuning task to compare the proposed SABO method with high-dimensional BO methods, i.e., Tu RBO (Eriksson et al., 2019) and GIBO (Nguyen et al., 2022) methods discussed in Section 5. We employ the default setting of Tu RBO-1 from He et al. (2020) for the Tu RBO method, and the default setting from Nguyen et al. (2022) for the GIBO method. The results on six benchmark datasets are reported in Table 2. According to the results, the SABO method consistently outperforms these two baselines, highlighting its effectiveness. H.2 SYNTHETIC IMAGE CLASSIFICATION PROBLEM We conduct experiments on a synthetic image classification problem to empirically evaluate the performance of the proposed mini-batch SABO method. Specifically, we apply black-box optimization methods to train a model to classify the images accurately. Moreover, following Foret et al. (2021), we conduct additional experiments on the noisy label setting. Particularly, we train the model on a corrupted version of the Fashion-MNIST dataset, where some of its training labels are randomly flipped according to different noise rates, while the testing set is clean. To construct problems with different dimensions, we first adopt UMAP (Mc Innes et al., 2018) to reduce the dimension of Fashion-MNIST to 8 while preserving the class discriminability (Sagawa & Hino, 2022), and then employ a fixed randomly initialized matrix P R8 e d to project the extracted features to a Published as a conference paper at ICLR 2025 Table 3: Performance (%) on Fashion-MNIST dataset with noise labels. The best result across all groups is highlighted in bold and the best result in each group is marked with underlined. Methods noise=0% noise=20% noise=40% noise=60% noise=80% Dimension d = 100 CMA-ES 76.68 0.44 77.25 1.26 75.71 1.15 70.81 1.01 57.86 3.65 MMES 76.20 1.45 74.68 0.33 74.46 0.32 71.12 2.24 65.92 1.30 BES 76.26 0.21 73.60 0.50 67.60 0.72 55.14 6.26 45.65 4.35 INGO 76.68 0.90 75.16 0.55 71.26 0.74 61.70 3.89 46.75 1.35 SABO 78.10 0.90 77.41 0.68 77.10 1.69 73.41 0.33 66.45 0.43 Dimension d = 1000 CMA-ES 76.75 0.54 74.85 0.06 73.21 0.52 67.32 0.84 50.00 0.98 MMES 75.85 0.43 72.39 3.59 73.73 0.50 72.51 1.47 56.60 2.72 BES 76.18 0.63 72.45 0.44 66.64 2.09 58.74 1.50 46.99 4.30 INGO 74.73 0.62 74.87 1.72 69.75 0.73 63.18 4.04 41.80 1.36 SABO 76.87 1.41 77.73 0.85 74.97 1.12 73.81 1.26 61.96 0.08 ed-dimensional space. After preprocessing, a linear layer is used as a classifier. Therefore, the total number of the trainable parameters is d = 10 ed. Datasets. Fashion-MNIST (Xiao et al., 2017) is a image classifications dataset. It contains 60, 000 training samples and 10, 000 test samples, each representing a 28 28-pixel grayscale image of fashion items from 10 different classes. Implementation Details. For a fair comparison, we set the same population size, number of batch samples, and the initialization for CMA-ES (Hansen, 2006), INGO (Lyu & Tsang, 2021), BES (Gao & Sener, 2022), and SABO. The population size and the number of batch samples are set to N = 100 and M = 2048, respectively. The Gaussian distributions are initialized with µ0 = 0 and Σ0 = 0.5I. For BES, INGO and SABO methods, we search the learning rate β over {0.1, 0.5, 1, 5}. Moreover, we perform grid-search on ρ over {100, 500, 1000, 5000} for SABO. We employ the default hyperparameter setting from He et al. (2020) for the MMES method. For the BES method, we perform grid-search on the spacing c over {0.1, 1, 10}. The cross-entropy loss is used as the training objective. All experiments are repeatedly run with three independent seeds and the mean and standard deviation are reported. Results. Table 3 shows experimental results on Fashion-MNIST dataset with different ed and different noise rates. We can see that the SABO method consistently outperforms all baselines across different noise rates and dimensions, highlighting its effectiveness. These results show that the proposed SABO method can achieve good robustness performance and demonstrate the effectiveness of the proposed SABO method in improving model generalization performance. I DISCUSSION WITH OFFLINE BLACK-BOX OPTIMIZATION The typical black-box optimization problem we studied in this work can also be called the online black-box optimization problem. Since we have access to the objective function F, the problem can be solved in an online iterative manner, where in each iteration the solver proposes new x and queries the objective function for feedback in order to inform better solution proposals at the next iteration. The offline black-box optimization (Chen et al., 2022; Qi et al., 2022) is different from the research line of standard online black-box optimization. In offline black-box optimization, access to the true objective F is not available. Instead, the offline black-box optimization algorithm is provided access to a static dataset D = {xi, F(xi)} of the variable xi and its corresponding objective value F(xi). Therefore, the basic settings of offline black-box optimization and online black-box optimization are different. Published as a conference paper at ICLR 2025 Moreover, the challenges of offline black-box optimization and online black-box optimization are also different. Offline black-box optimization focuses on producing query candidates by a surrogate model trained with a prior static dataset (Trabucco et al., 2021; Kumar & Levine, 2020). The goal is to produce a good query set based on the fixed model at one time without considering query feedback update, exploration and exploitation balance in the long term. Along this line, Kumar & Levine (2020) proposed a model-based offline optimization by training a conditional generative model that conditions the objective value. In Fannjiang & Listgarten (2020), the authors formulated the problem as a non-zero-sum game and proposed an alternating ascent-descent algorithm for model-based offline optimization. Trabucco et al. (2021) proposed the conservative objective models, which presents a technique similar to adversarial training that avoids overestimation of out-of-distribution inputs. In contrast to offline black-box optimization, standard online black-box optimization needs to balance exploration and exploitation, which focuses on long-term convergence performance. As a result, offline black-box optimization is not suitable for our online prompt fine-tuning tasks. J DISCUSSIONS ABOUT THE SOCIETAL IMPACT AND LIMITATIONS This work only focuses on black-box optimization in deep learning, so it has no negative societal impact. The main theoretical analysis in this work focuses on convex black-box functions. It is technically challenging to analyze non-convex cases considering both the black-box nature and the sharpness-aware properties, and we leave this as one of our future work. Additionally, the SABO method employs a standard Monte Carlo sampling for gradient approximation. Other sampling techniques might be more efficient, but those are out of the scope of this work. We will study it in the future.