# scalable_acceleration_for_classificationbased_derivativefree_optimization__70985883.pdf Scalable Acceleration for Classification-Based Derivative-Free Optimization Tianyi Han*, Jingya Li, Zhipeng Guo, Yuan Jin* Beijing Supreium Technology, Haidian District, Beijing, China {ty.han, jy.li, zp.guo, y.jin}@supreium.com Derivative-free optimization algorithms play an important role in scientific and engineering design optimization problems, especially when derivative information is not accessible. In this paper, we study the framework of sequential classification-based derivative-free optimization algorithms. By introducing learning theoretic concept hypothesis-target shattering rate, we revisit the computational complexity upper bound of SRACOS. Inspired by the revisited upper bound, we propose an algorithm named RACE-CARS, which adds a random region-shrinking step compared with SRACOS. We further establish theorems showing the acceleration by region shrinking. Experiments on the synthetic functions as well as black-box tuning for language-model-as-a-service demonstrate empirically the efficiency of RACE-CARS. An ablation experiment on the introduced hyper-parameters is also conducted, revealing the mechanism of RACE-CARS and putting forward an empirical hyper-parameter tuning guidance. Introduction In recent years, there has been a growing interest in the field of derivative-free optimization (DFO) algorithms, also known as zeroth-order optimization. These algorithms aim to optimize objective functions without relying on explicit gradient information, making them suitable for scenarios where obtaining derivatives is either infeasible or computationally expensive (Conn, Scheinberg, and Vicente 2009; Larson, Menickelly, and Wild 2019). For example, DFO techniques can be applied to hyperparameter tuning, which involves optimizing complex objective functions with unavailable first-order information (Falkner, Klein, and Hutter 2018; Akiba et al. 2019; Yang and Shami 2020). Moreover, DFO algorithms find applications in engineering design optimization, where the objective functions are computationally expensive to evaluate and lack explicit derivatives (Ray and Saini 2001; Liao 2010; Akay and Karaboga 2012). Classical DFO methods such as Nelder-Mead method (Nelder and Mead 1965) or directional direct-search (DDS) method (C ea 1971; Yu 1979) have shown great success on convex problems. However, their performance is compromised when the objective is nonconvex, which is the Copyright 2025, Association for the Advancement of Artificial Intelligence (www.aaai.org). All rights reserved. commonly-faced situation for black-box problems. In this paper, we focus on optimization problems reading as min x Ωf(x), (1) where the n dimensional compact cubic Ω Rn is solution space. In addition, we will not stipulate any convexity, smoothness or separability assumption on f. Recent decades, progresses have been made in the extensive exploration of DFO algorithms for nonconvex problems, and various kinds of algorithms have been established. For example, evolutionary algorithms (B ack and Schwefel 1993; Fortin et al. 2012; Hansen 2016; Opara and Arabas 2019), Bayesian optimization (BO) methods (Snoek, Larochelle, and Adams 2012; Shahriari et al. 2015; Frazier 2018) and gradient approximation surrogate modeling algorithms (Nesterov and Spokoiny 2017; Chen et al. 2019; Ge et al. 2022; Ragonneau and Zhang 2023). Given the nonconvex and black-box nature of the problem, the quest for enhanced sample efficiency inevitably presents the classic trade-off between exploration and exploitation. Similar to typical nonlinear/nonconvex numerical issues, the aforementioned algorithms are also susceptible to the curse of dimensionality (Bickel and Levina 2004; Hall, Marron, and Neeman 2005; Fan and Fan 2008; Shi et al. 2021; Scheinberg 2022; Yue et al. 2023). Improving scalability and efficiency can be distilled into a few key areas, such as function decomposition (Wang et al. 2018b), dimension reduction (Wang et al. 2016; Nayebi, Munteanu, and Poloczek 2019) and sample strategy refinement. Regarding the latter, previous algorithms have focused on preventing over-exploitation through, for instance, trying more efficient sampling dynamics (Ros and Hansen 2008; Hensman, Fusi, and Lawrence 2013; Yi et al. 2024), search space discretization (Sobol 1967) or search region restriction (Eriksson et al. 2019; Wang, Fonseca, and Tian 2020). Regardless of the improvements, many of these modelbased DFO algorithms share the mechanism that, locally or globally, fits the objective function. The motivation of classification-based DFO algorithms is different: it learns a hypothesis (classifier) to fit the sub-level set Ωαt := {x Ω: f(x) f αt} at each iteration t = 1, . . . , T, with which new candidates are generated for the successive round (Michalski 2000; Gotovos 2013; Bogunovic et al. 2016; Yu, Qian, and Hu 2016; Hashimoto, Yadlowsky, and Duchi The Thirty-Ninth AAAI Conference on Artificial Intelligence (AAAI-25) 2018). Sub-level sets are considered to contain less information than objective functions, since they are already obtained whenever the objective function is known. By this means, sample usage is more efficient. Moreover, it is less sensitive to function oscillation or noise (Liu et al. 2017) and easy to be parallelized (Liu et al. 2017, 2019b; Hashimoto, Yadlowsky, and Duchi 2018). Training strategies of hypotheses have branched out in various directions. Hashimoto, Yadlowsky, and Duchi pursue accurate hypotheses using a conservative strategy raised by El-Yaniv and Wiener. They demonstrate that for a hypothesis class H with a VC-dimension of V C(H), an ϵ-accurate hypothesis requires O(ϵ 1(V C(H) log(ϵ 1))) samples per batch. While a computationally efficient approximation has indeed been developed, its success was demonstrated only in the context of low-dimensional problems. The extent of its effectiveness in high-dimensional scenarios is yet to be determined. However, a contrasting approach successfully lead to the improvements in sample efficiency. Yu, Qian, and Hu propose to RAndomly COordinate-wise Shrink the solution region (RACOS), a more radical approach that diverges from the goal of producing accurate hypotheses. They prove RACOS converges to global minimum in polynomial time when the objective is locally Holder continuous, and experiments indicate success on high-dimensional problems ranging from hundreds to thousands of dimensions. An added benefit is the ease of sampling through the generated hypotheses, as their active regions are coordinate-orthogonal cubics. Building on this, the sequential counterpart of RACOS, known as SRACOS, takes an even more radical stance by sampling only once per iteration and learning from biased data. Despite this, SRACOS has been shown to outperform RACOS under certain mild restriction (Hu, Qian, and Yu 2017). Outline and Contributions 1. Our research extends the groundwork laid by (Yu, Qian, and Hu 2016; Hu, Qian, and Yu 2017), yet we discover that the upper bound they proposed for the SRACOS does not fully encapsulate its behavior, potentially allowing for an overflow. On the other hand, we construct a counter-example (eq. (3)) illustrating the upper bound is not tight enough to delineate the convergence accurately. 2. We also identify a contentious assumption regarding the concept of error-target dependence that seems to underpin these limitations. In this paper, we introduce a novel learning theoretic concept termed the hypothesis-target η-shattering rate, which serves as a foundation for our reassessment of the upper bound on the computational complexity of SRACOS. 3. Inspired by the reanalysis, we recognize that the accuracy of the trained hypotheses is even less critical than previously thought. With this insight, we formulate a novel algorithm, RACE-CARS (an acronym for RAndomized Coordinat E Classifying And Region Shrinking ), which perpetuates the essence of SRACOS while promises a theoretical enhancement in convergence. We design experiments on both synthetic functions and black-box tuning for LMaa S. These experiments juxtapose RACE-CARS against a selection of DFO algorithms, empirically highlighting the superior efficacy of RACE-CARS. 4. In discussion and appendix, we further substantiate the empirical prowess of RACE-CARS beyond the confines of continuity, also supported by theoretical framework. An ablation study is also conducted to demystify the selection of hyperparameters for our proposed algorithm. The rest of the paper is organized in five sections, sequentially presenting the background, theoretical study, experiments, discussion and conclusion. Assumption 1. In eq. (1), we presume f(x) is lower bounded within Ω, with f := minx Ωf(x). In our theoretical analysis, we refine the procedure as a stochastic process. Let F denote the Borel σ-algebra on Ω, and P the probability measure defined on F. For instance, if Ωis continuous space, P is derived from Lebesgue measure m, such that P(B) := m(B)/m(Ω) for all B F. Assumption 2. Define Ωϵ := {x Ω: f(x) f ϵ}, with the assumption |Ωϵ| := P(Ωϵ) > 0 for all ϵ > 0. A hypothesis (or classifier) h is a function mapping the solution space Ωto {0, 1}. Define Dh(x) := P({x Ω: h(x) = 1}) 1, h(x) = 1 0, otherwise, (2) a probability distribution in Ω. Let Xh be the random vector in (Ω, F, P) drawn from Dh, implying that Pr(Xh B) = R x B Dh(x)d P for all B F. Denoted by T := {1, 2, . . . , T}, and let F := (Ft)t T be a filtration of σalgebras on Ωindexed by T, satisfying F1 F2 FT F. A typical classification-based optimization algorithm learns an F-adapted stochastic process X := (Xt)t T, where each Xt is induced by Xht and ht is the hypothesis updated at step t. Subsequently, it samples new data with a stochastic process Y := (Yt)t T generated by X. Typically, the new candidates at step t r + 1 are sampled from Yt := Xt, with probability λ XΩ, with probability 1 λ, where XΩis random vector drawn from uniform distribution UΩand λ [0, 1] is the exploitation rate. A simplified batch-mode classification-based optimization algorithm is outlined in Algorithm 1. At each step t, it selects a positive set Spositive from S containing the best m samples, with the remainder in the negative set Snegative. Then it trains a hypothesis ht distinguishing between the positive set and negative set, ensuring ht(xj) = 0 for all (xj, yj) Snegative. Finally, it samples r new solutions with the sampling random vector Yt. The sub-procedure ht T (Spositive, Snegative) means hypothesis training. RACOS is the abbreviation of RAndomized COordinate Shrinking . Literally, it trains the hypothesis by this means Algorithm 1: Batch-Mode Classification-Based Optimization Algorithm Input: T: Budget; r: Training size; m: Positive size. Output: (xbest, ybest). Collect: S = {(x0 1, y0 1), . . . , (x0 r, y0 r)} i.i.d. from UΩ; (xbest, ybest) = arg min{y: (x, y) S}; for t = 1, . . . , T/r do Classify: (Spositive, Snegative) S; Train: ht T (Spositive, Snegative); Sample: {(xt 1, yt 1), . . . , (xt r, yt r)} i.i.d. with Yt; Select: S S {(xt 1, yt 1), . . . , (xt r, yt r)}; (xbest, ybest) = arg min y: (x, y) S . end for Return: (xbest, ybest) Algorithm 2: RACOS Input: Ω: Boundary; (Spositive, Snegative): Binary sets; I = {1, . . . , n}: Index of dimensions. Output: h: Hypothesis. Randomly select: x+ = (x1 +, . . . , xn +) Spositive; Initialize: h(x) 1; while x Snegative s.t. h(x) = 1 do Randomly select: k I; Randomly select: x = (x1 , . . . , xn ) Snegative; if xk + xk then s random(xk +, xk ); Shrink: h(x) = 0, x {x = (x1, . . . , xn) Ω: xk > s}; else s random(xk , xk +); Shrink: h(x) = 0, x {x = (x1, . . . , xn) Ω: xk < s}; end if Return: h end while (Yu, Qian, and Hu 2016), i.e., shrinking coordinates randomly such that all negative samples are excluded from active region of resulting hypothesis. Algorithm 2 shows a continuous version of RACOS. Aside from sampling only once per iteration, another difference between sequential and batch mode classificationbased DFO is that the sequential version will replace the training set with a new one under certain rules to finish step t (see in in Algorithm 3). In the rest of this paper, we will omit the details of Replacing sub-procedure which can be found in (Hu, Qian, and Yu 2017). Classification-based DFO algorithms admit a bound on the query complexity (Yu and Qian 2014), quantifying the total number of function evaluations required to identify a solution that achieves an approximation level of ϵ with a high probability of at least 1 δ. Definition 1 ((ϵ, δ)-Query Complexity). Given f, 0 < δ < 1 and ϵ > 0. The (ϵ, δ)-query complexity of an algorithm A is the number of calls to f such that, with probability at least Algorithm 3: Sequential-Mode Classification-Based Optimization Algorithm Input: T: Budget; r: Training size; m: Positive size; Replacing: Replacing sub-procedure. Output: (xbest, ybest). Collect S = {(x1, y1), . . . , (xr, yr)} i.i.d. from UΩ; (xbest, ybest) = arg min{y: (x, y) S}; for t = r + 1, . . . , T do Classify: (Spositive, Snegative) S; Train: ht T (Spositive, Snegative); Sample: (xt, yt) Yt; Replace: S Replacing((xt, yt), S); (xbest, ybest) = arg min{y: (x, y) S}; end for Return: (xbest, ybest) 1 δ, A finds at least one solution x Ωsatisfying Definition 2 and 3 are given by (Yu, Qian, and Hu 2016). The first one characterizes the so-called dependence between classification error and target region, which is expected to be minimized to ensure that the efficiency. The second one characterizes the portion of active region of hypothesis, and is also expected to be as small as possible. Definition 2 (Error-Target θ-Dependence). The error-target dependence θ 0 of a classification-based optimization algorithm is its infimum such that, for any ϵ > 0 and any t = 1, . . . , T, |Ωϵ| P(Rt) P Ωϵ Rt θ|Ωϵ|, where Rt := Ωαt {x Ω: ht(x) = 1} denotes the relative error, Ωαt is the sub-level set at step t with αt := min 1 i t f(xi) f and the operator is symmetric difference of two sets defined as A1 A2 = (A1 A2) (A1 A2). Definition 3 (γ-Shrinking Rate). The shrinking rate γ > 0 of a classification-based optimization algorithm is its infimum such that P(x Ω: ht(x) = 1) γ|Ωαt|, for all t = 1, . . . , T. Theoretical Study Previous studies gave a general bound of the query complexity of Algorithm 3 based on minor error-target θ-dependence and γ-shrinking rate assumptions: Theorem 1. (Hu, Qian, and Yu 2017) Given 0 < δ < 1 and ϵ > 0, if a sequential classification-based optimization algorithm has error-target θ-dependence and γ-shrinking rate, then its (ϵ, δ)-query complexity is upper bounded by ( 1 |Ωϵ| λ + 1 λ γ(T r) t=r+1 Φt 1 ln 1 where Φt = 1 θ P(RDt) m(Ω) q 1 2DKL(Dt UΩ) |Ωαt| 1 with the notations Dt := λDht + (1 λ)UΩand P(RDt) := R Issues Introduced by Error-Target Dependence 1. Overflow of the Upper Bound As assumptions entailed in (Yu, Qian, and Hu 2016; Hu, Qian, and Yu 2017), it can be observed that lower values of θ or γ correlate with improved query complexity. However, this is not a hard-and-fast rule. Even with small values of these parameters, we can encounter scenarios where the expected performance does not materialize. Following the lemma given by (Yu, Qian, and Hu 2016): P(Rt) P(RDt) + m(Ω) q 1 2DKL(Dt UΩ), we have the following inequality: Φt = 1 θ P(RDt) m(Ω) q 1 2DKL(Dt UΩ) (1 θ P(Rt)) |Ωαt| 1. The concept of error-target θ-dependence reveals that a small θ does not guarantee small relative error P(Rt). Contrarily, a small θ coupled with a large (P(Ωϵ Rt))/ |Ωϵ| can result in a significant P(Rt), which can even be 1 as long as Ωαt is totally out of the active region of ht, when the hypothesis ht is completely wrong. Problematically, as a divisor in the proof of Theorem 1, Φt (1 θ P(Rt)) |Ωαt| 1 is less or equal to 0, disrupting the established inequality. Nevertheless, in order that the inequality disrupt, P(Rt) is not necessary to be 1 as θ is inherently nonnegative, highlighting that a series of inaccurate hypotheses can undermine the validity of the upper bound. This finding challenges the principle of SRACOS s radical training strategy. 2. Tightness of the Upper Bound Consider an extreme but plausible situation where the hypotheses generated at each step are defined as follows: ht(x) = 1, x Ωϵ 0, x / Ωϵ. (3) In the context of sequential-mode classification-based optimization algorithms, where the training sets are not only small but also potentially biased, it s reasonable to expect large relative errors P(Rt). This scenario could lead to a series of hypotheses that are inaccurate with respect to Ωαt but, by chance, accurate with respect to Ωϵ. Consequently, the error-target dependence θ = max1 t T P(Rt) can be unexpectedly large. Even all Φt values being positive, the query complexity bound given in Theorem 1 is therefore not optimistic. However, the probability of failing to identify an ϵ-minimum: Pr min 1 t T f(xt) f ϵ =(1 |Ωϵ|)r (1 λ)(1 |Ωϵ|) T r , is less than δ for a reasonably sized T. This indicates that the upper bound may not be as tight as initially thought. Revisit of Query Complexity Upper Bound It s evident that even minimal error-target dependence can not encapsulate issues arising from substantial relative error. This is because error-target dependence alone is insufficient to fully account for relative error. Intuitively, one might consider introducing an additional assumption to cap relative error. However, such an assumption would be impractical, given the inherently small and biased nature of training datasets in the process. To this end, we give a new concept that stands apart from the constraints of relative error: Definition 4 (Hypothesis-Target η-Shattering Rate). Given η [0, 1], for a family of hypotheses H defined on Ω, we say Ωϵ is η-shattered by h H if P(Ωϵ {x Ω: h(x) = 1}) η|Ωϵ|, and η is called hypothesis-target shattering rate. The hypothesis-target shattering rate mirrors the errortarget dependence in its relation to the hypothesis s targetaccuracy. Importantly, it provides a limitation for errortarget dependence in conjunction with relative error: θ max{P Rt , |1 P Rt η|}. This rate, η, measures the overlap between the target set Ωϵ and the active region of a hypothesis. Crucially, it also mitigates the influence of relative error on error-target dependence. Utilizing the concept hypothesis-target shattering rate, we reexamine the upper bound of (ϵ, δ)-query complexity in the subsequent theorem. Theorem 2. For sequential-mode classification-based DFO Algorithm 3, let Xt = Xht, ϵ > 0 and 0 < δ < 1. When Ωϵ is η-shattered by ht for all t = r + 1 . . . , T and max t=r+1,...,T P({x Ω: ht(x) = 1}) p 1, the (ϵ, δ)- query complexity is upper bounded by p + (1 λ) 1 1 δ r + r, T} . The Region-Shrinking Acceleration In the analysis of (ϵ, δ)-query complexity for classificationbased optimization, the focus shifts away from minimizing relative error P(Rt), as our goal is to identify optima, not to develop a sequence of accurate hypotheses. The counterexample in equation (3), despite a potentially high relative error, represents an optimal hypothesis scenario. This realization allows for the consideration of more radical hypotheses, directing our attention to the overlap between Ωϵ and the active region of the hypotheses, which is quantified by the hypothesis-target shattering rate. The γ-shrinking rate, as defined in Definition 3, measures the decay of P(x Ω: ht(x) = 1). However, the rapid decrease of |Ωαt| as αt approaches zero makes it impractical to sustain a series of hypotheses with a small γ through our training process. Thus, the γ-shrinking assumption is often not feasible for minimal γ. Moving beyond the pursuit of minimal relative error and γ-shrinking relative to |Ωαt|, we introduce Algorithm 4, Algorithm 4: Accelerated Sequential-Mode Classification Based Optimization Algorithm Input: Ω: Boundary; T N+: Budget; r = m + k; Replacing: Replacing sub-procedure; γ: Region shrinking rate; ρ: Region shrinking frequency. Output: (xbest, ybest). Collect S = {(x1, y1), . . . , (xr, yr)} i.i.d. from UΩ; (xbest, ybest) = arg min{y: (x, y) S}; Initialize k = 1, Ω= Ω; for t = r + 1, . . . , T do Train: ht T (Spositive, Snegative); s random(0, 1); if s ρ then Shrink region: Ω= Ω [xbest 1 2γk Ω , xbest + 1 2γk Ω ]; k = k + 1; end if Project: Yt Proj(ht, Ω); Sample: (xt, yt) Yt; Replace: S Replacing((xt, yt), S); (xbest, ybest) = arg min{y: (x, y) S}; end for Return: (xbest, ybest) which adaptively shrinks the sampling random vector Yt s active region through a Projection sub-procedure. The operator returns a tuple represents the diameter of each dimension of the region. For instance, when Ω= [ω1 1, ω1 2] [ω2 1, ω2 2], we have Ω = (ω1 2 ω1 1, ω2 2 ω2 1). The projection operator Proj(ht, Ω) generates a random vector Xt with probability distribution D ht := ht/P({x Ω: ht(x) = 1}), with ht(x) = 1 whenever ht(x) = 1 for x Ω. The sampling random vector Yt is induced by Xt. The subsequent theorem presents the upper bound of query complexity for Algorithm 4. Theorem 3. For Algorithm 4 with region shrinking rate 0 < γ < 1 and region shrinking frequency 0 < ρ < 1. Let ϵ > 0 and 0 < δ < 1. When Ωϵ is η-shattered by ht for all t = r + 1 . . . , T, the (ϵ, δ)-query complexity is upper bounded by O max{ γ ρ + γ (T r)ρ + (1 λ) 1 1 δ r + r, T} . The condition γ (0, 1) ensures that the term 2p/(γ ρ + γ (T r)ρ) is significantly less than 1. According to Theorem 3, the (ϵ, δ)-query complexity of the Algorithm 4 is lower than that of Algorithm 3 providing η > 0. Theorem 3 establishes an upper bound on the (ϵ, δ)-query complexity that is applicable to a wide range of scenarios, assuming only that the objective function f is lowerbounded. Building on this, we identify a sufficient condition for acceleration that applies to dimensionally local Holder continuity functions (Definition 5), detailed in the appendix. Within this context, the SRACOS algorithm, which utilizes RACOS for Training phase in Algorithm 3, exhibit polynomial convergence (Hu, Qian, and Yu 2017). We adopt the same RACOS approach for Training sub-procedure in Algorithm 4, introducing RAndomized Coordinat E Classifying And Region Shrinking (RACE-CARS) algorithm. Definition 5 (Dimensionally local Holder continuity). Assume that x = (x1 , . . . , xn ) is the unique global minimum such that f(x ) = f . We call f dimensionally local Holder continuity if for all i = 1, . . . , n, Li 1|xi xi |βi 1 |f(x1 , . . . , xi, . . . , xn ) f | Li 2|xi xi |βi 2, for all x = (x1, . . . , xn) in the neighborhood of X , where βi 1, βi 2, Li 1, Li 2 are positive constants for i = 1, . . . , n. Experiments In this section, we design experiments to test RACE-CARS on synthetic functions, and language model tasks respectively. We use same budget to compare RACE-CARS with a selection of DFO algorithms, including SRACOS (Hu, Qian, and Yu 2017), zeroth-order adaptive momentum method (ZO-Adam) (Chen et al. 2019), differential evolution (DE) (Opara and Arabas 2019) and covariance matrix adaptation evolution strategies (CMA-ES) (Hansen 2016). All the baseline algorithms are fine-tuned, and the essential hyperparameters of RACE-CARS can be found in Appendix. On Synthetic Functions We commence our empirical experiments with four benchmark functions: Ackley, Levy, Rastrigin and Sphere. Their analytic forms and 2-dimensional illustrations are detailed in Appendix. Characterized by extreme non-convexity and numerous local minima and saddle points with the exception of the Sphere function each is minimized within the boundary Ω= [ 10, 10]n, with a global minimum value of 0. We choose the dimension of solution space n to be 50 and 500, with corresponding function evaluation budgets of 5000 and 50000. Notably, as indicates, the convergence of the RACE-CARS requires only a fraction of this budget. Region shrinking rate is configured to be γ = 0.9 and 0.95, with shrinking frequency of ρ = 0.01 and 0.001 for n = 50, 500, respectively. Each algorithm is executed over 30 trials, and the mean convergence trajectories of the bestso-far values under 100n function evaluations are depicted in Figure 1 and Figure 2. The numerical values adjacent to the algorithm names in the legends represent the mean of the attained minima. It is evident that that RACE-CARS performs the best on both convergence speed and optimal value, with a slight edge to CMA-ES on the strongly convex Sphere function. Yet this comes at the cost of scalability due to CMA-ES s reliance on an n-dimensional covariance matrix, which is significantly more computationally intensive compared to the other algorithms. RACE-CARS: 0.031 SRACOS: 1.705 ZO-Adam: 4.629 DE: 0.755 CMA-ES: 0.51 objective value evaluation number RACE-CARS: 0.001 SRACOS: 0.772 ZO-Adam: 56.26 DE: 3.599 CMA-ES: 2.891 objective value evaluation number RACE-CARS: 8.847 SRACOS: 58.152 ZO-Adam: 547.534 DE: 311.519 CMA-ES: 259.639 objective value evaluation number (c) Rastrigin RACE-CARS: 0.003 SRACOS: 1.977 ZO-Adam: 0.014 DE: 2.54 CMA-ES: 0.0 objective value evaluation number Figure 1: Comparison of synthetic functions with n = 50. evaluation number objective value RACE-CARS: 0.003 SRACOS: 3.988 ZO-Adam: 0.162 DE: 2.93 CMA-ES: 1.141 RACE-CARS: 2.873 SRACOS: 47.556 ZO-Adam: 538.24 DE: 54.4 CMA-ES: 153.155 objective value evaluation number RACE-CARS: 589.776 SRACOS: 1861.062 ZO-Adam: 4357.937 DE: 3173.669 CMA-ES: 2121.274 objective value evaluation number (c) Rastrigin RACE-CARS: 0.108 SRACOS: 209.387 ZO-Adam: 0.177 DE: 30.461 CMA-ES: 0.0 objective value evaluation number Figure 2: Comparison of synthetic functions with n = 500. On Black-Box Tuning for LMaa S Prompt tuning for extremely large pre-trained language models (PTMs) has shown great power. PTMs such as GPT3 (Brown et al. 2020) are usually released as a service due to commercial considerations and the potential risk of misuse, allowing users to design individual prompts to query the PTMs through black-box APIs. This scenario is called Language-Model-as-a-Service (LMaa S) (Diao et al. 2022; Sun et al. 2022). In this part we follow the experiments designed by (Sun et al. 2022) 1, where language understanding task is formulated as a classification task predicting for 1Code can be found in https://github.com/txsun1997/Black Box-Tuning a batch of PTM-modified input texts X the labels Y in the PTM vocabulary, namely we need to tune the prompt such that the black-box PTM inference API f takes a continuous prompt p satisfying Y = f(p; X). Moreover, to handle the high-dimensional prompt p, (Sun et al. 2022) proposed to randomly embed the D-dimensional prompt p into a lower dimensional space Rd via random projection matrix A RD d. Therefore, the objective becomes: min z Z L f(Az + p0; X), Y , where Z = [ 50, 50]d is the search space and L( ) is cross entropy loss. In our experimental setup, we configure the search space dimension to d = 500 and the prompt length to 50, with Ro BERTa (Liu et al. 2019a) serving as the backbone model. We evaluate performance on datasets SST-2 (Socher et al. 2013), Yelp polarity and AG s News (Zhang, Zhao, and Le Cun 2015), and RTE (Wang et al. 2018a). With a fixed API call budget of T = 8000, we pit RACE-CARS against SRACOS and the default DFO algorithm CMA-ES utilized in (Sun et al. 2022) 2. cross entropy loss (train) calls number of API calls RACE-CARS SRACOS CMA-ES (a) Training loss accuracy (train) RACE-CARS SRACOS CMA-ES number of API calls (b) Training accuracy cross entropy loss (develop) RACE-CARS SRACOS CMA-ES number of API calls (c) Development loss accuracy (develop) RACE-CARS SRACOS CMA-ES number of API calls (d) Development accuracy Figure 3: Comparisons on SST-2. For our tests, the shrinking rate is γ = 0.7, with shrinking frequency of ρ = 0.002. Each algorithm is repeated 5 times independently with unique seeds. We assess the algorithms based on the mean and deviation of training loss, training accuracy, development loss and development accuracy. The SST-2 dataset results are highlighted in Figure 3, with additional findings for Yelp Polarity, AG s News, and RTE detailed in the appendix. The results indicate that RACECARS consistently accelerates the convergence of SRACOS. While CMA-ES shows superior performance on Yelp Polarity, AG s News, and RTE, it also exhibits signs of overfitting. 2Our choice to exclude ZO-Adam and DE is based on their suboptimal performance with high-dimensional nonconvex black-box functions, as demonstrated in the last section. RACE-CARS achieves comparable performance to CMA-ES, despite the latter s hyperparameters being finely tuned. Notably, the hyperparameters for RACE-CARS were empirically adjusted based on the SST-2 dataset and then applied to the other three datasets without further tuning. Discussion Beyond Continuity 1. For discontinuous objective functions. Dimensionally local Holder continuity in Definition 5, imposes certain constraints on the objective through a set of continuous envelopes, whereas the objective is not supposed to be continuous. Beyond the continuous cases discussed in the previous section, RACE-CARS remains applicable to discontinuous objectives as well. Refer to appendix for a comprehensive understanding. 2. For discrete optimization. Similar to SRACOS algorithm, RACE-CARS retains the capability to tackle discrete optimization problems. The convergence theorems, presented in Theorem 2 and Theorem 3, encompass this situation by altering the measure of probability space to be for example, induced by counting measure. We extend experiments on mixed-integer programming problems, substantiating the acceleration of RACE-CARS empirically. See appendix for details. On the Concept Hypothesis-Target Shattering The concept hypothesis-target shattering, central to our discussion, draws inspiration from the established learning theory notion of shatter and its deep ties to the Vapnik Chervonenkis (VC) theory (Vapnik et al. 1998). At the heart of VC theory lies the VC dimension, a measure of a hypothesis family s capacity to distinguish among data points based on their labels. Specifically, for a collection of data points with binary labels, S, a subset S S is said to be shattered by a hypothesis family H if there exists a hypothesis h H that perfectly aligns with the labels of points in S and contrasts with those outside: h(x) = 1, x S The shattering coefficient, S(H, n), signifies the variety of point-label combinations that H can produce for n points. The VC dimension is then defined as V C(H) := {sup n : S(H, n) = 2n}, indicating the maximum number of points that can be distinctly labeled by H. In the context of classification-based DFO, we represent the target-representative capability of a family of hypotheses through hypothesis-target shattering, measuring the overlap of hypothesis s active region and the target. Therefore the quintessence of algorithm design hinges on maximizing this quantity. However, discerning the target-representative capacity within the intricate landscape of nonconvex, blackbox optimization problems is nontrivial. Nonetheless, the target-representative capability of hypotheses family generated by RACOS, although proved under the previous framework, empirically suggests sufficient efficacy in scenarios where the objective function exhibits locally Holder continuouity. Looking ahead, altering the Training and Replacing sub-procedures inherited from SRACOS, which may ideally lead to a bigger shattering rate and maintain the easy-tosample characterization, will be another extension direction of the current study. Ablation Experiments While it s an appealing goal to develop a universally effective DFO algorithm for black-box functions without the need for hyperparameter tuning, this remains an unrealistic aspiration. Our proposed algorithm, RACE-CARS, introduces two hyperparameters: shrinking-rate γ and shrinkingfrequency ρ. For an n-dimensional optimization problem, we call γnρ shrinking factor of RACE-CARS. We take Ackley for a case study, design ablation experiments on the two hyperparameters of RACE-CARS to reveal the mechanism. We stipulate that we do not aim to identify a optimal combination of hyperparameters for maximizing the overlap with the target hypothesis. Instead, our aim is to provide empirical guidance for tuning these hyperparameters effectively. For further details, the reader is directed to the appendix. Conclusion In this paper, we refine the framework of classification-based DFO as a stochastic process, and propose a novel learning concept named hypothesis-target shattering rate. Our research delves into the convergence properties of sequentialmode classification-based DFO algorithms and provides a fresh perspective on their query complexity upper bound. Delighted by the computational complexity upper bound under the new framework, we propose a theoretically grounded region-shrinking technique to accelerate the convergence. In empirical analysis, we study the scalability performance of RACE-CARS on both synthetic functions and black-box tuning for LMaa S, showing its superiority over SRACOS. References Akay, B.; and Karaboga, D. 2012. Artificial bee colony algorithm for large-scale problems and engineering design optimization. Journal of intelligent manufacturing, 23: 1001 1014. Akiba, T.; Sano, S.; Yanase, T.; Ohta, T.; and Koyama, M. 2019. Optuna: A next-generation hyperparameter optimization framework. In Proceedings of the 25th ACM SIGKDD international conference on knowledge discovery & data mining, 2623 2631. B ack, T.; and Schwefel, H.-P. 1993. An overview of evolutionary algorithms for parameter optimization. Evolutionary computation, 1(1): 1 23. Bickel, P. J.; and Levina, E. 2004. Some theory for Fisher s linear discriminant function,naive Bayes , and some alternatives when there are many more variables than observations. Bernoulli, 10(6): 989 1010. Bogunovic, I.; Scarlett, J.; Krause, A.; and Cevher, V. 2016. Truncated variance reduction: A unified approach to bayesian optimization and level-set estimation. Advances in neural information processing systems, 29. Brown, T.; Mann, B.; Ryder, N.; Subbiah, M.; Kaplan, J. D.; Dhariwal, P.; Neelakantan, A.; Shyam, P.; Sastry, G.; Askell, A.; et al. 2020. Language models are few-shot learners. Advances in neural information processing systems, 33: 1877 1901. C ea, J. 1971. Optimisation: th eorie et algorithmes. Jean C ea. Dunod. Chen, X.; Liu, S.; Xu, K.; Li, X.; Lin, X.; Hong, M.; and Cox, D. 2019. Zo-adamm: Zeroth-order adaptive momentum method for black-box optimization. Advances in neural information processing systems, 32. Conn, A. R.; Scheinberg, K.; and Vicente, L. N. 2009. Introduction to derivative-free optimization. SIAM. Diao, S.; Huang, Z.; Xu, R.; Li, X.; Lin, Y.; Zhou, X.; and Zhang, T. 2022. Black-box prompt learning for pre-trained language models. ar Xiv preprint ar Xiv:2201.08531. El-Yaniv, R.; and Wiener, Y. 2012. Active Learning via Perfect Selective Classification. Journal of Machine Learning Research, 13(2). Eriksson, D.; Pearce, M.; Gardner, J.; Turner, R. D.; and Poloczek, M. 2019. Scalable global optimization via local Bayesian optimization. Advances in neural information processing systems, 32. Falkner, S.; Klein, A.; and Hutter, F. 2018. BOHB: Robust and efficient hyperparameter optimization at scale. In International conference on machine learning, 1437 1446. PMLR. Fan, J.; and Fan, Y. 2008. High dimensional classification using features annealed independence rules. Annals of statistics, 36(6): 2605. Fortin, F.-A.; De Rainville, F.-M.; Gardner, M.-A. G.; Parizeau, M.; and Gagn e, C. 2012. DEAP: Evolutionary algorithms made easy. The Journal of Machine Learning Research, 13(1): 2171 2175. Frazier, P. I. 2018. A tutorial on Bayesian optimization. ar Xiv preprint ar Xiv:1807.02811. Ge, D.; Liu, T.; Liu, J.; Tan, J.; and Ye, Y. 2022. SOLNP+: A Derivative-Free Solver for Constrained Nonlinear Optimization. ar Xiv preprint ar Xiv:2210.07160. Gotovos, A. 2013. Active learning for level set estimation. Master s thesis, Eidgen ossische Technische Hochschule Z urich, Department of Computer Science,. Hall, P.; Marron, J. S.; and Neeman, A. 2005. Geometric representation of high dimension, low sample size data. Journal of the Royal Statistical Society Series B: Statistical Methodology, 67(3): 427 444. Hansen, N. 2016. The CMA evolution strategy: A tutorial. ar Xiv preprint ar Xiv:1604.00772. Hashimoto, T.; Yadlowsky, S.; and Duchi, J. 2018. Derivative free optimization via repeated classification. In International Conference on Artificial Intelligence and Statistics, 2027 2036. PMLR. Hensman, J.; Fusi, N.; and Lawrence, N. D. 2013. Gaussian processes for big data. ar Xiv preprint ar Xiv:1309.6835. Hu, Y.-Q.; Qian, H.; and Yu, Y. 2017. Sequential classification-based optimization for direct policy search. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 31. Larson, J.; Menickelly, M.; and Wild, S. M. 2019. Derivative-free optimization methods. Acta Numerica, 28: 287 404. Liao, T. W. 2010. Two hybrid differential evolution algorithms for engineering design optimization. Applied Soft Computing, 10(4): 1188 1199. Liu, Y.; Ott, M.; Goyal, N.; Du, J.; Joshi, M.; Chen, D.; Levy, O.; Lewis, M.; Zettlemoyer, L.; and Stoyanov, V. 2019a. Roberta: A robustly optimized bert pretraining approach. ar Xiv preprint ar Xiv:1907.11692. Liu, Y.-R.; Hu, Y.-Q.; Qian, H.; Qian, C.; and Yu, Y. 2017. Zoopt: Toolbox for derivative-free optimization. ar Xiv preprint ar Xiv:1801.00329. Liu, Y.-R.; Hu, Y.-Q.; Qian, H.; and Yu, Y. 2019b. Asynchronous classification-based optimization. In Proceedings of the First International Conference on Distributed Artificial Intelligence, 1 8. Michalski, R. S. 2000. Learnable evolution model: Evolutionary processes guided by machine learning. Machine learning, 38: 9 40. Nayebi, A.; Munteanu, A.; and Poloczek, M. 2019. A framework for Bayesian optimization in embedded subspaces. In International Conference on Machine Learning, 4752 4761. PMLR. Nelder, J. A.; and Mead, R. 1965. A simplex method for function minimization. The computer journal, 7(4): 308 313. Nesterov, Y.; and Spokoiny, V. 2017. Random gradient-free minimization of convex functions. Foundations of Computational Mathematics, 17: 527 566. Opara, K. R.; and Arabas, J. 2019. Differential Evolution: A survey of theoretical analyses. Swarm and evolutionary computation, 44: 546 558. Ragonneau, T. M.; and Zhang, Z. 2023. PDFO A Cross Platform Package for Powell s Derivative-Free Optimization Solver. ar Xiv preprint ar Xiv:2302.13246. Ray, T.; and Saini, P. 2001. Engineering design optimization using a swarm with an intelligent information sharing among individuals. Engineering Optimization, 33(6): 735 748. Ros, R.; and Hansen, N. 2008. A simple modification in CMA-ES achieving linear time and space complexity. In International conference on parallel problem solving from nature, 296 305. Springer. Scheinberg, K. 2022. Finite Difference Gradient Approximation: To Randomize or Not? INFORMS Journal on Computing, 34(5): 2384 2388. Shahriari, B.; Swersky, K.; Wang, Z.; Adams, R. P.; and De Freitas, N. 2015. Taking the human out of the loop: A review of Bayesian optimization. Proceedings of the IEEE, 104(1): 148 175. Shi, H.-J. M.; Xuan, M. Q.; Oztoprak, F.; and Nocedal, J. 2021. On the numerical performance of derivative-free optimization methods based on finite-difference approximations. ar Xiv preprint ar Xiv:2102.09762. Snoek, J.; Larochelle, H.; and Adams, R. P. 2012. Practical bayesian optimization of machine learning algorithms. Advances in neural information processing systems, 25. Sobol , I. M. 1967. On the distribution of points in a cube and the approximate evaluation of integrals. Zhurnal Vychislitel noi Matematiki i Matematicheskoi Fiziki, 7(4): 784 802. Socher, R.; Perelygin, A.; Wu, J.; Chuang, J.; Manning, C. D.; Ng, A. Y.; and Potts, C. 2013. Recursive deep models for semantic compositionality over a sentiment treebank. In Proceedings of the 2013 conference on empirical methods in natural language processing, 1631 1642. Sun, T.; Shao, Y.; Qian, H.; Huang, X.; and Qiu, X. 2022. Black-box tuning for language-model-as-a-service. In International Conference on Machine Learning, 20841 20855. PMLR. Vapnik, V.; et al. 1998. Statistical learning theory. Wang, A.; Singh, A.; Michael, J.; Hill, F.; Levy, O.; and Bowman, S. R. 2018a. GLUE: A multi-task benchmark and analysis platform for natural language understanding. ar Xiv preprint ar Xiv:1804.07461. Wang, L.; Fonseca, R.; and Tian, Y. 2020. Learning search space partition for black-box optimization using monte carlo tree search. Advances in Neural Information Processing Systems, 33: 19511 19522. Wang, Z.; Gehring, C.; Kohli, P.; and Jegelka, S. 2018b. Batched large-scale Bayesian optimization in highdimensional spaces. In International Conference on Artificial Intelligence and Statistics, 745 754. PMLR. Wang, Z.; Hutter, F.; Zoghi, M.; Matheson, D.; and De Feitas, N. 2016. Bayesian optimization in a billion dimensions via random embeddings. Journal of Artificial Intelligence Research, 55: 361 387. Yang, L.; and Shami, A. 2020. On hyperparameter optimization of machine learning algorithms: Theory and practice. Neurocomputing, 415: 295 316. Yi, Z.; Wei, Y.; Cheng, C. X.; He, K.; and Sui, Y. 2024. Improving sample efficiency of high dimensional Bayesian optimization with MCMC. ar Xiv preprint ar Xiv:2401.02650. Yu, W.-C. 1979. Positive basis and a class of direct search techniques. Scientia Sinica, Special Issue of Mathematics, 1(26): 53 68. Yu, Y.; and Qian, H. 2014. The sampling-and-learning framework: A statistical view of evolutionary algorithms. In 2014 IEEE Congress on Evolutionary Computation (CEC), 149 158. IEEE. Yu, Y.; Qian, H.; and Hu, Y.-Q. 2016. Derivative-free optimization via classification. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 30. Yue, P.; Yang, L.; Fang, C.; and Lin, Z. 2023. Zeroth-order Optimization with Weak Dimension Dependency. In The Thirty Sixth Annual Conference on Learning Theory, 4429 4472. PMLR. Zhang, X.; Zhao, J.; and Le Cun, Y. 2015. Character-level convolutional networks for text classification. Advances in neural information processing systems, 28.