# scalable_acceleration_for_classificationbased_derivativefree_optimization__70985883.pdf

Scalable Acceleration for Classiﬁcation-Based Derivative-Free Optimization

Tianyi Han*, Jingya Li, Zhipeng Guo, Yuan Jin*

Beijing Supreium Technology, Haidian District, Beijing, China {ty.han, jy.li, zp.guo, y.jin}@supreium.com

Derivative-free optimization algorithms play an important role in scientiﬁc and engineering design optimization problems, especially when derivative information is not accessible. In this paper, we study the framework of sequential classiﬁcation-based derivative-free optimization algorithms. By introducing learning theoretic concept hypothesis-target shattering rate, we revisit the computational complexity upper bound of SRACOS. Inspired by the revisited upper bound, we propose an algorithm named RACE-CARS, which adds a random region-shrinking step compared with SRACOS. We further establish theorems showing the acceleration by region shrinking. Experiments on the synthetic functions as well as black-box tuning for language-model-as-a-service demonstrate empirically the efﬁciency of RACE-CARS. An ablation experiment on the introduced hyper-parameters is also conducted, revealing the mechanism of RACE-CARS and putting forward an empirical hyper-parameter tuning guidance.

Introduction

In recent years, there has been a growing interest in the ﬁeld of derivative-free optimization (DFO) algorithms, also known as zeroth-order optimization. These algorithms aim to optimize objective functions without relying on explicit gradient information, making them suitable for scenarios where obtaining derivatives is either infeasible or computationally expensive (Conn, Scheinberg, and Vicente 2009; Larson, Menickelly, and Wild 2019). For example, DFO techniques can be applied to hyperparameter tuning, which involves optimizing complex objective functions with unavailable ﬁrst-order information (Falkner, Klein, and Hutter 2018; Akiba et al. 2019; Yang and Shami 2020). Moreover, DFO algorithms ﬁnd applications in engineering design optimization, where the objective functions are computationally expensive to evaluate and lack explicit derivatives (Ray and Saini 2001; Liao 2010; Akay and Karaboga 2012). Classical DFO methods such as Nelder-Mead method (Nelder and Mead 1965) or directional direct-search (DDS) method (C ea 1971; Yu 1979) have shown great success on convex problems. However, their performance is compromised when the objective is nonconvex, which is the

Copyright 2025, Association for the Advancement of Artiﬁcial Intelligence (www.aaai.org). All rights reserved.

commonly-faced situation for black-box problems. In this paper, we focus on optimization problems reading as

min x Ωf(x), (1)

where the n dimensional compact cubic Ω Rn is solution space. In addition, we will not stipulate any convexity, smoothness or separability assumption on f. Recent decades, progresses have been made in the extensive exploration of DFO algorithms for nonconvex problems, and various kinds of algorithms have been established. For example, evolutionary algorithms (B ack and Schwefel 1993; Fortin et al. 2012; Hansen 2016; Opara and Arabas 2019), Bayesian optimization (BO) methods (Snoek, Larochelle, and Adams 2012; Shahriari et al. 2015; Frazier 2018) and gradient approximation surrogate modeling algorithms (Nesterov and Spokoiny 2017; Chen et al. 2019; Ge et al. 2022; Ragonneau and Zhang 2023). Given the nonconvex and black-box nature of the problem, the quest for enhanced sample efﬁciency inevitably presents the classic trade-off between exploration and exploitation. Similar to typical nonlinear/nonconvex numerical issues, the aforementioned algorithms are also susceptible to the curse of dimensionality (Bickel and Levina 2004; Hall, Marron, and Neeman 2005; Fan and Fan 2008; Shi et al. 2021; Scheinberg 2022; Yue et al. 2023). Improving scalability and efﬁciency can be distilled into a few key areas, such as function decomposition (Wang et al. 2018b), dimension reduction (Wang et al. 2016; Nayebi, Munteanu, and Poloczek 2019) and sample strategy reﬁnement. Regarding the latter, previous algorithms have focused on preventing over-exploitation through, for instance, trying more efﬁcient sampling dynamics (Ros and Hansen 2008; Hensman, Fusi, and Lawrence 2013; Yi et al. 2024), search space discretization (Sobol 1967) or search region restriction (Eriksson et al. 2019; Wang, Fonseca, and Tian 2020). Regardless of the improvements, many of these modelbased DFO algorithms share the mechanism that, locally or globally, ﬁts the objective function. The motivation of classiﬁcation-based DFO algorithms is different: it learns a hypothesis (classiﬁer) to ﬁt the sub-level set Ωαt := {x Ω: f(x) f αt} at each iteration t = 1, . . . , T, with which new candidates are generated for the successive round (Michalski 2000; Gotovos 2013; Bogunovic et al. 2016; Yu, Qian, and Hu 2016; Hashimoto, Yadlowsky, and Duchi

The Thirty-Ninth AAAI Conference on Artificial Intelligence (AAAI-25)

2018). Sub-level sets are considered to contain less information than objective functions, since they are already obtained whenever the objective function is known. By this means, sample usage is more efﬁcient. Moreover, it is less sensitive to function oscillation or noise (Liu et al. 2017) and easy to be parallelized (Liu et al. 2017, 2019b; Hashimoto, Yadlowsky, and Duchi 2018). Training strategies of hypotheses have branched out in various directions. Hashimoto, Yadlowsky, and Duchi pursue accurate hypotheses using a conservative strategy raised by El-Yaniv and Wiener. They demonstrate that for a hypothesis class H with a VC-dimension of V C(H), an ϵ-accurate hypothesis requires O(ϵ 1(V C(H) log(ϵ 1))) samples per batch. While a computationally efﬁcient approximation has indeed been developed, its success was demonstrated only in the context of low-dimensional problems. The extent of its effectiveness in high-dimensional scenarios is yet to be determined. However, a contrasting approach successfully lead to the improvements in sample efﬁciency. Yu, Qian, and Hu propose to RAndomly COordinate-wise Shrink the solution region (RACOS), a more radical approach that diverges from the goal of producing accurate hypotheses. They prove RACOS converges to global minimum in polynomial time when the objective is locally Holder continuous, and experiments indicate success on high-dimensional problems ranging from hundreds to thousands of dimensions. An added beneﬁt is the ease of sampling through the generated hypotheses, as their active regions are coordinate-orthogonal cubics. Building on this, the sequential counterpart of RACOS, known as SRACOS, takes an even more radical stance by sampling only once per iteration and learning from biased data. Despite this, SRACOS has been shown to outperform RACOS under certain mild restriction (Hu, Qian, and Yu 2017).

Outline and Contributions

1. Our research extends the groundwork laid by (Yu, Qian, and Hu 2016; Hu, Qian, and Yu 2017), yet we discover that the upper bound they proposed for the SRACOS does not fully encapsulate its behavior, potentially allowing for an overﬂow. On the other hand, we construct a counter-example (eq. (3)) illustrating the upper bound is not tight enough to delineate the convergence accurately. 2. We also identify a contentious assumption regarding the concept of error-target dependence that seems to underpin these limitations. In this paper, we introduce a novel learning theoretic concept termed the hypothesis-target η-shattering rate, which serves as a foundation for our reassessment of the upper bound on the computational complexity of SRACOS. 3. Inspired by the reanalysis, we recognize that the accuracy of the trained hypotheses is even less critical than previously thought. With this insight, we formulate a novel algorithm, RACE-CARS (an acronym for RAndomized Coordinat E Classifying And Region Shrinking ), which perpetuates the essence of SRACOS while promises a theoretical enhancement in convergence. We design experiments on both synthetic functions and black-box tuning

for LMaa S. These experiments juxtapose RACE-CARS against a selection of DFO algorithms, empirically highlighting the superior efﬁcacy of RACE-CARS. 4. In discussion and appendix, we further substantiate the empirical prowess of RACE-CARS beyond the conﬁnes of continuity, also supported by theoretical framework. An ablation study is also conducted to demystify the selection of hyperparameters for our proposed algorithm.

The rest of the paper is organized in ﬁve sections, sequentially presenting the background, theoretical study, experiments, discussion and conclusion.

Assumption 1. In eq. (1), we presume f(x) is lower bounded within Ω, with f := minx Ωf(x).

In our theoretical analysis, we reﬁne the procedure as a stochastic process. Let F denote the Borel σ-algebra on Ω, and P the probability measure deﬁned on F. For instance, if Ωis continuous space, P is derived from Lebesgue measure m, such that P(B) := m(B)/m(Ω) for all B F.

Assumption 2. Deﬁne Ωϵ := {x Ω: f(x) f ϵ}, with the assumption |Ωϵ| := P(Ωϵ) > 0 for all ϵ > 0.

A hypothesis (or classiﬁer) h is a function mapping the solution space Ωto {0, 1}. Deﬁne

Dh(x) := P({x Ω: h(x) = 1}) 1, h(x) = 1 0, otherwise, (2)

a probability distribution in Ω. Let Xh be the random vector in (Ω, F, P) drawn from Dh, implying that Pr(Xh B) = R

x B Dh(x)d P for all B F. Denoted by T := {1, 2, . . . , T}, and let F := (Ft)t T be a ﬁltration of σalgebras on Ωindexed by T, satisfying F1 F2 FT F. A typical classiﬁcation-based optimization algorithm learns an F-adapted stochastic process X := (Xt)t T, where each Xt is induced by Xht and ht is the hypothesis updated at step t. Subsequently, it samples new data with a stochastic process Y := (Yt)t T generated by X. Typically, the new candidates at step t r + 1 are sampled from

Yt := Xt, with probability λ XΩ, with probability 1 λ,

where XΩis random vector drawn from uniform distribution UΩand λ [0, 1] is the exploitation rate. A simpliﬁed batch-mode classiﬁcation-based optimization algorithm is outlined in Algorithm 1. At each step t, it selects a positive set Spositive from S containing the best m samples, with the remainder in the negative set Snegative. Then it trains a hypothesis ht distinguishing between the positive set and negative set, ensuring ht(xj) = 0 for all (xj, yj) Snegative. Finally, it samples r new solutions with the sampling random vector Yt. The sub-procedure ht T (Spositive, Snegative) means hypothesis training. RACOS is the abbreviation of RAndomized COordinate Shrinking . Literally, it trains the hypothesis by this means

Algorithm 1: Batch-Mode Classiﬁcation-Based Optimization Algorithm

Input: T: Budget; r: Training size; m: Positive size. Output: (xbest, ybest). Collect: S = {(x0 1, y0 1), . . . , (x0 r, y0 r)} i.i.d. from UΩ; (xbest, ybest) = arg min{y: (x, y) S}; for t = 1, . . . , T/r do Classify: (Spositive, Snegative) S; Train: ht T (Spositive, Snegative); Sample: {(xt 1, yt 1), . . . , (xt r, yt r)} i.i.d. with Yt; Select: S S {(xt 1, yt 1), . . . , (xt r, yt r)}; (xbest, ybest) = arg min y: (x, y) S . end for Return: (xbest, ybest)

Algorithm 2: RACOS

Input: Ω: Boundary; (Spositive, Snegative): Binary sets; I = {1, . . . , n}: Index of dimensions. Output: h: Hypothesis. Randomly select: x+ = (x1 +, . . . , xn +) Spositive; Initialize: h(x) 1; while x Snegative s.t. h(x) = 1 do Randomly select: k I; Randomly select: x = (x1 , . . . , xn ) Snegative; if xk + xk then s random(xk +, xk ); Shrink: h(x) = 0, x {x = (x1, . . . , xn) Ω: xk > s}; else s random(xk , xk +); Shrink: h(x) = 0, x {x = (x1, . . . , xn) Ω: xk < s}; end if Return: h end while

(Yu, Qian, and Hu 2016), i.e., shrinking coordinates randomly such that all negative samples are excluded from active region of resulting hypothesis. Algorithm 2 shows a continuous version of RACOS. Aside from sampling only once per iteration, another difference between sequential and batch mode classiﬁcationbased DFO is that the sequential version will replace the training set with a new one under certain rules to ﬁnish step t (see in in Algorithm 3). In the rest of this paper, we will omit the details of Replacing sub-procedure which can be found in (Hu, Qian, and Yu 2017). Classiﬁcation-based DFO algorithms admit a bound on the query complexity (Yu and Qian 2014), quantifying the total number of function evaluations required to identify a solution that achieves an approximation level of ϵ with a high probability of at least 1 δ.

Deﬁnition 1 ((ϵ, δ)-Query Complexity). Given f, 0 < δ < 1 and ϵ > 0. The (ϵ, δ)-query complexity of an algorithm A is the number of calls to f such that, with probability at least

Algorithm 3: Sequential-Mode Classiﬁcation-Based Optimization Algorithm

Input: T: Budget; r: Training size; m: Positive size; Replacing: Replacing sub-procedure. Output: (xbest, ybest). Collect S = {(x1, y1), . . . , (xr, yr)} i.i.d. from UΩ; (xbest, ybest) = arg min{y: (x, y) S}; for t = r + 1, . . . , T do Classify: (Spositive, Snegative) S; Train: ht T (Spositive, Snegative); Sample: (xt, yt) Yt; Replace: S Replacing((xt, yt), S); (xbest, ybest) = arg min{y: (x, y) S}; end for Return: (xbest, ybest)

1 δ, A ﬁnds at least one solution x Ωsatisfying

Deﬁnition 2 and 3 are given by (Yu, Qian, and Hu 2016). The ﬁrst one characterizes the so-called dependence between classiﬁcation error and target region, which is expected to be minimized to ensure that the efﬁciency. The second one characterizes the portion of active region of hypothesis, and is also expected to be as small as possible. Deﬁnition 2 (Error-Target θ-Dependence). The error-target dependence θ 0 of a classiﬁcation-based optimization algorithm is its inﬁmum such that, for any ϵ > 0 and any t = 1, . . . , T, |Ωϵ| P(Rt) P Ωϵ Rt θ|Ωϵ|,

where Rt := Ωαt {x Ω: ht(x) = 1} denotes the relative error, Ωαt is the sub-level set at step t with αt := min 1 i t f(xi) f and the operator is symmetric difference

of two sets deﬁned as A1 A2 = (A1 A2) (A1 A2). Deﬁnition 3 (γ-Shrinking Rate). The shrinking rate γ > 0 of a classiﬁcation-based optimization algorithm is its inﬁmum such that P(x Ω: ht(x) = 1) γ|Ωαt|, for all t = 1, . . . , T.

Theoretical Study Previous studies gave a general bound of the query complexity of Algorithm 3 based on minor error-target θ-dependence and γ-shrinking rate assumptions: Theorem 1. (Hu, Qian, and Yu 2017) Given 0 < δ < 1 and ϵ > 0, if a sequential classiﬁcation-based optimization algorithm has error-target θ-dependence and γ-shrinking rate, then its (ϵ, δ)-query complexity is upper bounded by

( 1 |Ωϵ| λ + 1 λ γ(T r)

t=r+1 Φt 1 ln 1

where Φt = 1 θ P(RDt) m(Ω) q

1 2DKL(Dt UΩ)

|Ωαt| 1 with the notations Dt := λDht + (1 λ)UΩand P(RDt) := R

Issues Introduced by Error-Target Dependence

1. Overﬂow of the Upper Bound As assumptions entailed in (Yu, Qian, and Hu 2016; Hu, Qian, and Yu 2017), it can be observed that lower values of θ or γ correlate with improved query complexity. However, this is not a hard-and-fast rule. Even with small values of these parameters, we can encounter scenarios where the expected performance does not materialize. Following the lemma given by (Yu, Qian, and Hu

2016): P(Rt) P(RDt) + m(Ω) q

1 2DKL(Dt UΩ), we have the following inequality:

Φt = 1 θ P(RDt) m(Ω) q

1 2DKL(Dt UΩ)

(1 θ P(Rt)) |Ωαt| 1.

The concept of error-target θ-dependence reveals that a small θ does not guarantee small relative error P(Rt). Contrarily, a small θ coupled with a large (P(Ωϵ Rt))/ |Ωϵ| can result in a signiﬁcant P(Rt), which can even be 1 as long as Ωαt is totally out of the active region of ht, when the hypothesis ht is completely wrong. Problematically, as a divisor in the proof of Theorem 1, Φt (1 θ P(Rt)) |Ωαt| 1 is less or equal to 0, disrupting the established inequality. Nevertheless, in order that the inequality disrupt, P(Rt) is not necessary to be 1 as θ is inherently nonnegative, highlighting that a series of inaccurate hypotheses can undermine the validity of the upper bound. This ﬁnding challenges the principle of SRACOS s radical training strategy.

2. Tightness of the Upper Bound Consider an extreme but plausible situation where the hypotheses generated at each step are deﬁned as follows:

ht(x) = 1, x Ωϵ 0, x / Ωϵ. (3)

In the context of sequential-mode classiﬁcation-based optimization algorithms, where the training sets are not only small but also potentially biased, it s reasonable to expect large relative errors P(Rt). This scenario could lead to a series of hypotheses that are inaccurate with respect to Ωαt but, by chance, accurate with respect to Ωϵ. Consequently, the error-target dependence θ = max1 t T P(Rt) can be unexpectedly large. Even all Φt values being positive, the query complexity bound given in Theorem 1 is therefore not optimistic. However, the probability of failing to identify an ϵ-minimum:

Pr min 1 t T f(xt) f ϵ

=(1 |Ωϵ|)r (1 λ)(1 |Ωϵ|) T r ,

is less than δ for a reasonably sized T. This indicates that the upper bound may not be as tight as initially thought.

Revisit of Query Complexity Upper Bound It s evident that even minimal error-target dependence can not encapsulate issues arising from substantial relative error. This is because error-target dependence alone is insufﬁcient to fully account for relative error. Intuitively, one might consider introducing an additional assumption to cap relative error. However, such an assumption would be impractical, given the inherently small and biased nature of training datasets in the process. To this end, we give a new concept that stands apart from the constraints of relative error:

Deﬁnition 4 (Hypothesis-Target η-Shattering Rate). Given η [0, 1], for a family of hypotheses H deﬁned on Ω, we say Ωϵ is η-shattered by h H if

P(Ωϵ {x Ω: h(x) = 1}) η|Ωϵ|,

and η is called hypothesis-target shattering rate.

The hypothesis-target shattering rate mirrors the errortarget dependence in its relation to the hypothesis s targetaccuracy. Importantly, it provides a limitation for errortarget dependence in conjunction with relative error:

θ max{P Rt , |1 P Rt η|}.

This rate, η, measures the overlap between the target set Ωϵ and the active region of a hypothesis. Crucially, it also mitigates the inﬂuence of relative error on error-target dependence. Utilizing the concept hypothesis-target shattering rate, we reexamine the upper bound of (ϵ, δ)-query complexity in the subsequent theorem.

Theorem 2. For sequential-mode classiﬁcation-based DFO Algorithm 3, let Xt = Xht, ϵ > 0 and 0 < δ < 1. When Ωϵ is η-shattered by ht for all t = r + 1 . . . , T and max t=r+1,...,T P({x Ω: ht(x) = 1}) p 1, the (ϵ, δ)-

query complexity is upper bounded by

p + (1 λ) 1 1

δ r + r, T} .

The Region-Shrinking Acceleration In the analysis of (ϵ, δ)-query complexity for classiﬁcationbased optimization, the focus shifts away from minimizing relative error P(Rt), as our goal is to identify optima, not to develop a sequence of accurate hypotheses. The counterexample in equation (3), despite a potentially high relative error, represents an optimal hypothesis scenario. This realization allows for the consideration of more radical hypotheses, directing our attention to the overlap between Ωϵ and the active region of the hypotheses, which is quantiﬁed by the hypothesis-target shattering rate. The γ-shrinking rate, as deﬁned in Deﬁnition 3, measures the decay of P(x Ω: ht(x) = 1). However, the rapid decrease of |Ωαt| as αt approaches zero makes it impractical to sustain a series of hypotheses with a small γ through our training process. Thus, the γ-shrinking assumption is often not feasible for minimal γ. Moving beyond the pursuit of minimal relative error and γ-shrinking relative to |Ωαt|, we introduce Algorithm 4,

Algorithm 4: Accelerated Sequential-Mode Classiﬁcation Based Optimization Algorithm

Input: Ω: Boundary; T N+: Budget; r = m + k; Replacing: Replacing sub-procedure; γ: Region shrinking rate; ρ: Region shrinking frequency. Output: (xbest, ybest). Collect S = {(x1, y1), . . . , (xr, yr)} i.i.d. from UΩ; (xbest, ybest) = arg min{y: (x, y) S}; Initialize k = 1, Ω= Ω; for t = r + 1, . . . , T do Train: ht T (Spositive, Snegative); s random(0, 1); if s ρ then Shrink region: Ω= Ω [xbest 1

2γk Ω , xbest + 1 2γk Ω ]; k = k + 1; end if Project: Yt Proj(ht, Ω); Sample: (xt, yt) Yt; Replace: S Replacing((xt, yt), S); (xbest, ybest) = arg min{y: (x, y) S}; end for Return: (xbest, ybest)

which adaptively shrinks the sampling random vector Yt s active region through a Projection sub-procedure. The operator returns a tuple represents the diameter of each dimension of the region. For instance, when Ω= [ω1 1, ω1 2] [ω2 1, ω2 2], we have Ω = (ω1 2 ω1 1, ω2 2 ω2 1). The projection operator Proj(ht, Ω) generates a random vector Xt with probability distribution D ht := ht/P({x Ω: ht(x) = 1}), with ht(x) = 1 whenever ht(x) = 1 for x Ω. The sampling random vector Yt is induced by Xt. The subsequent theorem presents the upper bound of query complexity for Algorithm 4.

Theorem 3. For Algorithm 4 with region shrinking rate 0 < γ < 1 and region shrinking frequency 0 < ρ < 1. Let ϵ > 0 and 0 < δ < 1. When Ωϵ is η-shattered by ht for all t = r + 1 . . . , T, the (ϵ, δ)-query complexity is upper bounded by

O max{ γ ρ + γ (T r)ρ

+ (1 λ) 1 1

δ r + r, T} .

The condition γ (0, 1) ensures that the term 2p/(γ ρ + γ (T r)ρ) is signiﬁcantly less than 1. According to Theorem 3, the (ϵ, δ)-query complexity of the Algorithm 4 is lower than that of Algorithm 3 providing η > 0. Theorem 3 establishes an upper bound on the (ϵ, δ)-query complexity that is applicable to a wide range of scenarios, assuming only that the objective function f is lowerbounded. Building on this, we identify a sufﬁcient condition for acceleration that applies to dimensionally local Holder

continuity functions (Deﬁnition 5), detailed in the appendix. Within this context, the SRACOS algorithm, which utilizes RACOS for Training phase in Algorithm 3, exhibit polynomial convergence (Hu, Qian, and Yu 2017). We adopt the same RACOS approach for Training sub-procedure in Algorithm 4, introducing RAndomized Coordinat E Classifying And Region Shrinking (RACE-CARS) algorithm.

Deﬁnition 5 (Dimensionally local Holder continuity). Assume that x = (x1 , . . . , xn ) is the unique global minimum such that f(x ) = f . We call f dimensionally local Holder continuity if for all i = 1, . . . , n,

Li 1|xi xi |βi 1

|f(x1 , . . . , xi, . . . , xn ) f |

Li 2|xi xi |βi 2,

for all x = (x1, . . . , xn) in the neighborhood of X , where βi 1, βi 2, Li 1, Li 2 are positive constants for i = 1, . . . , n.

Experiments

In this section, we design experiments to test RACE-CARS on synthetic functions, and language model tasks respectively. We use same budget to compare RACE-CARS with a selection of DFO algorithms, including SRACOS (Hu, Qian, and Yu 2017), zeroth-order adaptive momentum method (ZO-Adam) (Chen et al. 2019), differential evolution (DE) (Opara and Arabas 2019) and covariance matrix adaptation evolution strategies (CMA-ES) (Hansen 2016). All the baseline algorithms are ﬁne-tuned, and the essential hyperparameters of RACE-CARS can be found in Appendix.

On Synthetic Functions

We commence our empirical experiments with four benchmark functions: Ackley, Levy, Rastrigin and Sphere. Their analytic forms and 2-dimensional illustrations are detailed in Appendix. Characterized by extreme non-convexity and numerous local minima and saddle points with the exception of the Sphere function each is minimized within the boundary Ω= [ 10, 10]n, with a global minimum value of 0. We choose the dimension of solution space n to be 50 and 500, with corresponding function evaluation budgets of 5000 and 50000. Notably, as indicates, the convergence of the RACE-CARS requires only a fraction of this budget. Region shrinking rate is conﬁgured to be γ = 0.9 and 0.95, with shrinking frequency of ρ = 0.01 and 0.001 for n = 50, 500, respectively. Each algorithm is executed over 30 trials, and the mean convergence trajectories of the bestso-far values under 100n function evaluations are depicted in Figure 1 and Figure 2. The numerical values adjacent to the algorithm names in the legends represent the mean of the attained minima. It is evident that that RACE-CARS performs the best on both convergence speed and optimal value, with a slight edge to CMA-ES on the strongly convex Sphere function. Yet this comes at the cost of scalability due to CMA-ES s reliance on an n-dimensional covariance matrix, which is signiﬁcantly more computationally intensive compared to the other algorithms.

RACE-CARS: 0.031 SRACOS: 1.705 ZO-Adam: 4.629 DE: 0.755 CMA-ES: 0.51

objective value

evaluation number

RACE-CARS: 0.001 SRACOS: 0.772 ZO-Adam: 56.26 DE: 3.599 CMA-ES: 2.891

objective value

evaluation number

RACE-CARS: 8.847 SRACOS: 58.152 ZO-Adam: 547.534 DE: 311.519 CMA-ES: 259.639

objective value

evaluation number

(c) Rastrigin

RACE-CARS: 0.003 SRACOS: 1.977 ZO-Adam: 0.014 DE: 2.54 CMA-ES: 0.0

objective value

evaluation number

Figure 1: Comparison of synthetic functions with n = 50.

evaluation number

objective value

RACE-CARS: 0.003 SRACOS: 3.988 ZO-Adam: 0.162 DE: 2.93 CMA-ES: 1.141

RACE-CARS: 2.873 SRACOS: 47.556 ZO-Adam: 538.24 DE: 54.4 CMA-ES: 153.155

objective value

evaluation number

RACE-CARS: 589.776 SRACOS: 1861.062 ZO-Adam: 4357.937 DE: 3173.669 CMA-ES: 2121.274

objective value

evaluation number

(c) Rastrigin

RACE-CARS: 0.108 SRACOS: 209.387 ZO-Adam: 0.177 DE: 30.461 CMA-ES: 0.0

objective value

evaluation number

Figure 2: Comparison of synthetic functions with n = 500.

On Black-Box Tuning for LMaa S

Prompt tuning for extremely large pre-trained language models (PTMs) has shown great power. PTMs such as GPT3 (Brown et al. 2020) are usually released as a service due to commercial considerations and the potential risk of misuse, allowing users to design individual prompts to query the PTMs through black-box APIs. This scenario is called Language-Model-as-a-Service (LMaa S) (Diao et al. 2022; Sun et al. 2022). In this part we follow the experiments designed by (Sun et al. 2022) 1, where language understanding task is formulated as a classiﬁcation task predicting for

1Code can be found in https://github.com/txsun1997/Black Box-Tuning

a batch of PTM-modiﬁed input texts X the labels Y in the PTM vocabulary, namely we need to tune the prompt such that the black-box PTM inference API f takes a continuous prompt p satisfying Y = f(p; X). Moreover, to handle the high-dimensional prompt p, (Sun et al. 2022) proposed to randomly embed the D-dimensional prompt p into a lower dimensional space Rd via random projection matrix A RD d. Therefore, the objective becomes:

min z Z L f(Az + p0; X), Y ,

where Z = [ 50, 50]d is the search space and L( ) is cross entropy loss. In our experimental setup, we conﬁgure the search space dimension to d = 500 and the prompt length to 50, with Ro BERTa (Liu et al. 2019a) serving as the backbone model. We evaluate performance on datasets SST-2 (Socher et al. 2013), Yelp polarity and AG s News (Zhang, Zhao, and Le Cun 2015), and RTE (Wang et al. 2018a). With a ﬁxed API call budget of T = 8000, we pit RACE-CARS against SRACOS and the default DFO algorithm CMA-ES utilized in (Sun et al. 2022) 2.

cross entropy loss (train)

calls number of API calls

RACE-CARS SRACOS CMA-ES

(a) Training loss

accuracy (train)

RACE-CARS SRACOS CMA-ES

number of API calls

(b) Training accuracy

cross entropy loss (develop)

RACE-CARS SRACOS CMA-ES

number of API calls

(c) Development loss

accuracy (develop)

RACE-CARS SRACOS CMA-ES

number of API calls

(d) Development accuracy

Figure 3: Comparisons on SST-2.

For our tests, the shrinking rate is γ = 0.7, with shrinking frequency of ρ = 0.002. Each algorithm is repeated 5 times independently with unique seeds. We assess the algorithms based on the mean and deviation of training loss, training accuracy, development loss and development accuracy. The SST-2 dataset results are highlighted in Figure 3, with additional ﬁndings for Yelp Polarity, AG s News, and RTE detailed in the appendix. The results indicate that RACECARS consistently accelerates the convergence of SRACOS. While CMA-ES shows superior performance on Yelp Polarity, AG s News, and RTE, it also exhibits signs of overﬁtting.

2Our choice to exclude ZO-Adam and DE is based on their suboptimal performance with high-dimensional nonconvex black-box functions, as demonstrated in the last section.

RACE-CARS achieves comparable performance to CMA-ES, despite the latter s hyperparameters being ﬁnely tuned. Notably, the hyperparameters for RACE-CARS were empirically adjusted based on the SST-2 dataset and then applied to the other three datasets without further tuning.

Discussion Beyond Continuity 1. For discontinuous objective functions. Dimensionally local Holder continuity in Deﬁnition 5, imposes certain constraints on the objective through a set of continuous envelopes, whereas the objective is not supposed to be continuous. Beyond the continuous cases discussed in the previous section, RACE-CARS remains applicable to discontinuous objectives as well. Refer to appendix for a comprehensive understanding. 2. For discrete optimization. Similar to SRACOS algorithm, RACE-CARS retains the capability to tackle discrete optimization problems. The convergence theorems, presented in Theorem 2 and Theorem 3, encompass this situation by altering the measure of probability space to be for example, induced by counting measure. We extend experiments on mixed-integer programming problems, substantiating the acceleration of RACE-CARS empirically. See appendix for details.

On the Concept Hypothesis-Target Shattering The concept hypothesis-target shattering, central to our discussion, draws inspiration from the established learning theory notion of shatter and its deep ties to the Vapnik Chervonenkis (VC) theory (Vapnik et al. 1998). At the heart of VC theory lies the VC dimension, a measure of a hypothesis family s capacity to distinguish among data points based on their labels. Speciﬁcally, for a collection of data points with binary labels, S, a subset S S is said to be shattered by a hypothesis family H if there exists a hypothesis h H that perfectly aligns with the labels of points in S and contrasts with those outside:

h(x) = 1, x S

The shattering coefﬁcient, S(H, n), signiﬁes the variety of point-label combinations that H can produce for n points. The VC dimension is then deﬁned as V C(H) := {sup n : S(H, n) = 2n}, indicating the maximum number of points that can be distinctly labeled by H. In the context of classiﬁcation-based DFO, we represent the target-representative capability of a family of hypotheses through hypothesis-target shattering, measuring the overlap of hypothesis s active region and the target. Therefore the quintessence of algorithm design hinges on maximizing this quantity. However, discerning the target-representative capacity within the intricate landscape of nonconvex, blackbox optimization problems is nontrivial. Nonetheless, the target-representative capability of hypotheses family generated by RACOS, although proved under the previous framework, empirically suggests sufﬁcient efﬁcacy in scenarios

where the objective function exhibits locally Holder continuouity. Looking ahead, altering the Training and Replacing sub-procedures inherited from SRACOS, which may ideally lead to a bigger shattering rate and maintain the easy-tosample characterization, will be another extension direction of the current study.

Ablation Experiments While it s an appealing goal to develop a universally effective DFO algorithm for black-box functions without the need for hyperparameter tuning, this remains an unrealistic aspiration. Our proposed algorithm, RACE-CARS, introduces two hyperparameters: shrinking-rate γ and shrinkingfrequency ρ. For an n-dimensional optimization problem, we call γnρ shrinking factor of RACE-CARS. We take Ackley for a case study, design ablation experiments on the two hyperparameters of RACE-CARS to reveal the mechanism. We stipulate that we do not aim to identify a optimal combination of hyperparameters for maximizing the overlap with the target hypothesis. Instead, our aim is to provide empirical guidance for tuning these hyperparameters effectively. For further details, the reader is directed to the appendix.

Conclusion In this paper, we reﬁne the framework of classiﬁcation-based DFO as a stochastic process, and propose a novel learning concept named hypothesis-target shattering rate. Our research delves into the convergence properties of sequentialmode classiﬁcation-based DFO algorithms and provides a fresh perspective on their query complexity upper bound. Delighted by the computational complexity upper bound under the new framework, we propose a theoretically grounded region-shrinking technique to accelerate the convergence. In empirical analysis, we study the scalability performance of RACE-CARS on both synthetic functions and black-box tuning for LMaa S, showing its superiority over SRACOS.

References Akay, B.; and Karaboga, D. 2012. Artiﬁcial bee colony algorithm for large-scale problems and engineering design optimization. Journal of intelligent manufacturing, 23: 1001 1014. Akiba, T.; Sano, S.; Yanase, T.; Ohta, T.; and Koyama, M. 2019. Optuna: A next-generation hyperparameter optimization framework. In Proceedings of the 25th ACM SIGKDD international conference on knowledge discovery & data mining, 2623 2631. B ack, T.; and Schwefel, H.-P. 1993. An overview of evolutionary algorithms for parameter optimization. Evolutionary computation, 1(1): 1 23. Bickel, P. J.; and Levina, E. 2004. Some theory for Fisher s linear discriminant function,naive Bayes , and some alternatives when there are many more variables than observations. Bernoulli, 10(6): 989 1010. Bogunovic, I.; Scarlett, J.; Krause, A.; and Cevher, V. 2016. Truncated variance reduction: A uniﬁed approach to bayesian optimization and level-set estimation. Advances in neural information processing systems, 29.

Brown, T.; Mann, B.; Ryder, N.; Subbiah, M.; Kaplan, J. D.; Dhariwal, P.; Neelakantan, A.; Shyam, P.; Sastry, G.; Askell, A.; et al. 2020. Language models are few-shot learners. Advances in neural information processing systems, 33: 1877 1901. C ea, J. 1971. Optimisation: th eorie et algorithmes. Jean C ea. Dunod. Chen, X.; Liu, S.; Xu, K.; Li, X.; Lin, X.; Hong, M.; and Cox, D. 2019. Zo-adamm: Zeroth-order adaptive momentum method for black-box optimization. Advances in neural information processing systems, 32. Conn, A. R.; Scheinberg, K.; and Vicente, L. N. 2009. Introduction to derivative-free optimization. SIAM. Diao, S.; Huang, Z.; Xu, R.; Li, X.; Lin, Y.; Zhou, X.; and Zhang, T. 2022. Black-box prompt learning for pre-trained language models. ar Xiv preprint ar Xiv:2201.08531. El-Yaniv, R.; and Wiener, Y. 2012. Active Learning via Perfect Selective Classiﬁcation. Journal of Machine Learning Research, 13(2). Eriksson, D.; Pearce, M.; Gardner, J.; Turner, R. D.; and Poloczek, M. 2019. Scalable global optimization via local Bayesian optimization. Advances in neural information processing systems, 32. Falkner, S.; Klein, A.; and Hutter, F. 2018. BOHB: Robust and efﬁcient hyperparameter optimization at scale. In International conference on machine learning, 1437 1446. PMLR. Fan, J.; and Fan, Y. 2008. High dimensional classiﬁcation using features annealed independence rules. Annals of statistics, 36(6): 2605. Fortin, F.-A.; De Rainville, F.-M.; Gardner, M.-A. G.; Parizeau, M.; and Gagn e, C. 2012. DEAP: Evolutionary algorithms made easy. The Journal of Machine Learning Research, 13(1): 2171 2175. Frazier, P. I. 2018. A tutorial on Bayesian optimization. ar Xiv preprint ar Xiv:1807.02811. Ge, D.; Liu, T.; Liu, J.; Tan, J.; and Ye, Y. 2022. SOLNP+: A Derivative-Free Solver for Constrained Nonlinear Optimization. ar Xiv preprint ar Xiv:2210.07160. Gotovos, A. 2013. Active learning for level set estimation. Master s thesis, Eidgen ossische Technische Hochschule Z urich, Department of Computer Science,. Hall, P.; Marron, J. S.; and Neeman, A. 2005. Geometric representation of high dimension, low sample size data. Journal of the Royal Statistical Society Series B: Statistical Methodology, 67(3): 427 444. Hansen, N. 2016. The CMA evolution strategy: A tutorial. ar Xiv preprint ar Xiv:1604.00772. Hashimoto, T.; Yadlowsky, S.; and Duchi, J. 2018. Derivative free optimization via repeated classiﬁcation. In International Conference on Artiﬁcial Intelligence and Statistics, 2027 2036. PMLR. Hensman, J.; Fusi, N.; and Lawrence, N. D. 2013. Gaussian processes for big data. ar Xiv preprint ar Xiv:1309.6835.

Hu, Y.-Q.; Qian, H.; and Yu, Y. 2017. Sequential classiﬁcation-based optimization for direct policy search. In Proceedings of the AAAI Conference on Artiﬁcial Intelligence, volume 31. Larson, J.; Menickelly, M.; and Wild, S. M. 2019. Derivative-free optimization methods. Acta Numerica, 28: 287 404. Liao, T. W. 2010. Two hybrid differential evolution algorithms for engineering design optimization. Applied Soft Computing, 10(4): 1188 1199. Liu, Y.; Ott, M.; Goyal, N.; Du, J.; Joshi, M.; Chen, D.; Levy, O.; Lewis, M.; Zettlemoyer, L.; and Stoyanov, V. 2019a. Roberta: A robustly optimized bert pretraining approach. ar Xiv preprint ar Xiv:1907.11692. Liu, Y.-R.; Hu, Y.-Q.; Qian, H.; Qian, C.; and Yu, Y. 2017. Zoopt: Toolbox for derivative-free optimization. ar Xiv preprint ar Xiv:1801.00329. Liu, Y.-R.; Hu, Y.-Q.; Qian, H.; and Yu, Y. 2019b. Asynchronous classiﬁcation-based optimization. In Proceedings of the First International Conference on Distributed Artiﬁcial Intelligence, 1 8. Michalski, R. S. 2000. Learnable evolution model: Evolutionary processes guided by machine learning. Machine learning, 38: 9 40. Nayebi, A.; Munteanu, A.; and Poloczek, M. 2019. A framework for Bayesian optimization in embedded subspaces. In International Conference on Machine Learning, 4752 4761. PMLR. Nelder, J. A.; and Mead, R. 1965. A simplex method for function minimization. The computer journal, 7(4): 308 313. Nesterov, Y.; and Spokoiny, V. 2017. Random gradient-free minimization of convex functions. Foundations of Computational Mathematics, 17: 527 566. Opara, K. R.; and Arabas, J. 2019. Differential Evolution: A survey of theoretical analyses. Swarm and evolutionary computation, 44: 546 558. Ragonneau, T. M.; and Zhang, Z. 2023. PDFO A Cross Platform Package for Powell s Derivative-Free Optimization Solver. ar Xiv preprint ar Xiv:2302.13246. Ray, T.; and Saini, P. 2001. Engineering design optimization using a swarm with an intelligent information sharing among individuals. Engineering Optimization, 33(6): 735 748. Ros, R.; and Hansen, N. 2008. A simple modiﬁcation in CMA-ES achieving linear time and space complexity. In International conference on parallel problem solving from nature, 296 305. Springer. Scheinberg, K. 2022. Finite Difference Gradient Approximation: To Randomize or Not? INFORMS Journal on Computing, 34(5): 2384 2388. Shahriari, B.; Swersky, K.; Wang, Z.; Adams, R. P.; and De Freitas, N. 2015. Taking the human out of the loop: A review of Bayesian optimization. Proceedings of the IEEE, 104(1): 148 175.

Shi, H.-J. M.; Xuan, M. Q.; Oztoprak, F.; and Nocedal, J. 2021. On the numerical performance of derivative-free optimization methods based on ﬁnite-difference approximations. ar Xiv preprint ar Xiv:2102.09762. Snoek, J.; Larochelle, H.; and Adams, R. P. 2012. Practical bayesian optimization of machine learning algorithms. Advances in neural information processing systems, 25. Sobol , I. M. 1967. On the distribution of points in a cube and the approximate evaluation of integrals. Zhurnal Vychislitel noi Matematiki i Matematicheskoi Fiziki, 7(4): 784 802. Socher, R.; Perelygin, A.; Wu, J.; Chuang, J.; Manning, C. D.; Ng, A. Y.; and Potts, C. 2013. Recursive deep models for semantic compositionality over a sentiment treebank. In Proceedings of the 2013 conference on empirical methods in natural language processing, 1631 1642. Sun, T.; Shao, Y.; Qian, H.; Huang, X.; and Qiu, X. 2022. Black-box tuning for language-model-as-a-service. In International Conference on Machine Learning, 20841 20855. PMLR. Vapnik, V.; et al. 1998. Statistical learning theory. Wang, A.; Singh, A.; Michael, J.; Hill, F.; Levy, O.; and Bowman, S. R. 2018a. GLUE: A multi-task benchmark and analysis platform for natural language understanding. ar Xiv preprint ar Xiv:1804.07461. Wang, L.; Fonseca, R.; and Tian, Y. 2020. Learning search space partition for black-box optimization using monte carlo tree search. Advances in Neural Information Processing Systems, 33: 19511 19522. Wang, Z.; Gehring, C.; Kohli, P.; and Jegelka, S. 2018b. Batched large-scale Bayesian optimization in highdimensional spaces. In International Conference on Artiﬁcial Intelligence and Statistics, 745 754. PMLR. Wang, Z.; Hutter, F.; Zoghi, M.; Matheson, D.; and De Feitas, N. 2016. Bayesian optimization in a billion dimensions via random embeddings. Journal of Artiﬁcial Intelligence Research, 55: 361 387. Yang, L.; and Shami, A. 2020. On hyperparameter optimization of machine learning algorithms: Theory and practice. Neurocomputing, 415: 295 316. Yi, Z.; Wei, Y.; Cheng, C. X.; He, K.; and Sui, Y. 2024. Improving sample efﬁciency of high dimensional Bayesian optimization with MCMC. ar Xiv preprint ar Xiv:2401.02650. Yu, W.-C. 1979. Positive basis and a class of direct search techniques. Scientia Sinica, Special Issue of Mathematics, 1(26): 53 68. Yu, Y.; and Qian, H. 2014. The sampling-and-learning framework: A statistical view of evolutionary algorithms. In 2014 IEEE Congress on Evolutionary Computation (CEC), 149 158. IEEE. Yu, Y.; Qian, H.; and Hu, Y.-Q. 2016. Derivative-free optimization via classiﬁcation. In Proceedings of the AAAI Conference on Artiﬁcial Intelligence, volume 30. Yue, P.; Yang, L.; Fang, C.; and Lin, Z. 2023. Zeroth-order Optimization with Weak Dimension Dependency. In The

Thirty Sixth Annual Conference on Learning Theory, 4429 4472. PMLR. Zhang, X.; Zhao, J.; and Le Cun, Y. 2015. Character-level convolutional networks for text classiﬁcation. Advances in neural information processing systems, 28.