# adaptive_partitioning_schemes_for_optimistic_optimization__4fe7d449.pdf

Adaptive Partitioning Schemes for Optimistic Optimization

Raja Sunkara 1 Ardhendu Tripathy 2

Applications such as engineering design often require us to optimize a black-box function, i.e., a system whose inner processing is not analytically known and whose gradients are not available. Practitioners often have a fixed budget for the number of function evaluations and the performance of an optimization algorithm is measured by its simple regret. In this paper, we study the class of Optimistic Optimization algorithms for black-box optimization that use a partitioning scheme for the domain. We develop algorithms that learn a good partitioning scheme and use flexible surrogate models such as neural networks in the optimization procedure. For multi-index functions on an m-dimensional subspace within d dimensions, our algorithm attains O(n β/d) regret, where β = 1+ d m

2m 1, as opposed to O(n 1/d) for Sequ OOL, a state-of-the-art optimistic optimization algorithm. We use our approach to improve the quality of Activation-aware Weight Quantization (AWQ) of the OPT-1.3B model, achieving 10% improvement in performance relative to the best possible unquantized model.

1. INTRODUCTION AND MOTIVATION

Optimization of black-box functions is often carried out by a class of algorithms called Optimistic Optimization algorithms (Munos, 2014). These algorithms are preferred due to their mild assumptions; however, they require a partition scheme to be provided for the search space. The quality of a partitioning scheme depends on the function being optimized (Grill et al., 2015). If information on the function is lacking, then a default partitioning scheme, i.e., axis-aligned rectangles, is used (Jones et al., 1993; Bartlett et al., 2019). But this default choice might not be a good choice for the

1Missouri University of Science & Technology, Rolla MO, US 2Ops Canvas, Alexandria VA, US. Correspondence to: Raja Sunkara <rs5cq@mst.edu>, Ardhendu Tripathy <ardhendu@opscanvas.com>.

Proceedings of the 42 st International Conference on Machine Learning, Vancouver, Canada. PMLR 267, 2025. Copyright 2025 by the author(s).

function. Additionally, an axis-aligned rectangle scheme limits the application to low-dimensional search spaces (as the number of rectangles grows as kdn, where k is the number of splits at each height, d is the dimension and n is the number of splits needed in each dimension), or it may fail to exploit low-dimensional structures in high-dimensional spaces. Thus, there is a need to develop partitioning schemes that can adapt to the low dimensional structure that may be present in a black-box function.

Global optimization is impossible without restricting the class of functions: a degenerate function that takes the value 1 at particular location and zero everywhere else can only be optimized by sampling every point in the domain. The class of functions is typically restricted by assumptions on its smoothness . One of the first global optimization algorithms with provable convergence was obtained for the class of Lipschitz functions. The Di Rect algorithm (Jones et al., 1993; Jones & Martins, 2021) is a well-known algorithm that can optimize Lipschitz functions without knowing the Lipschitz constant. A different class of functions studied in the Bayesian Optimization literature is that of the Gaussian Process (GP) prior for the unknown function f. Combining the observed samples with the prior mean and covariance kernel, a posterior distribution for f is obtained and used to guide the sampling strategy, see e.g. GP-UCB (Srinivas et al., 2012). Since its sampling strategy required maximizing the Upper Confidence Bound obtained from the posterior, its could feasibly be applied only on low-dimensional X. Later works (Shekhar & Javidi, 2018; Salgia et al., 2021) incorporated domain partitioning ideas to reduce the computational cost. GP-UCB can also be applied if f belongs to a Reproducing Kernel Hilbert Space corresponding to the covariance kernel. However, these methods require the kernel associated with f as an input.

Unlike the above methods, which assume a global characteristic for f, the Optimistic Optimization class of algorithms (HOO (Bubeck et al., 2011), SOO (Munos, 2011), Sequ OOL (Bartlett et al., 2019)) just assume a local smoothness condition around the global maximizer. This local smoothness ensures that the function does not rapidly decrease around its maximum value. These algorithms require as input a hierarchical partitioning of X that is well-behaved with respect to a semi-metric on X. Although the semi-metric is not needed to be known, the performance of these algorithms

Adaptive Partitioning Schemes for Optimistic Optimization

can be heavily influenced by the choice of the partitioning scheme. Absent any additional information, the default partitioning is the axis-aligned rectangles from Di Rect, leading to the shortcomings described in the beginning. We propose to develop an algorithm that adaptively builds a partitioning scheme as new samples are collected. Additionally, it refines partitions using Sequ OOL in the low dimensional subspace that accounts for most of the variation in the function.

We sumarize the main contributions of our paper below.

We demonstrate the benefit of using a learned partitioning scheme for existing derivative-free optimization algorithms such as Sequ OOL.

When the function is a low-dimensional multi-index function we theoretically prove improved regret bounds shown in Table 1.

Empirically, we demonstrate the improvement in optimization error for several benchmark functions including Rastrigin (multi-modal), Branin (multiple minima), and Sharp Ridge (non-differentiable).

We pose the quantization of Large Language Model (LLM) as a high-dimensional black-box optimization problem and obtain an improved perplexity value.

SOO Sequ OOL Our Method η = 0 ρ n ρ Ω(n) ρβ Ω(n)

η > 0 O(n 1/η) O(n 1/η) O(n β/η)

Table 1. Regret bounds on a budget of n evaluations for a mdimensional multi-index function in d dimensions. β = 1+ d m

2m 1 and ρ, η are parameters for the default partitioning scheme.

Related works Here we summarize the methods we have compared in our experiments. Perhaps the closest related work is Random Embedding Simultaneous Optimistic Optimization (RESOO) (Qian & Yu, 2016), which scales SOO to high-dimensional optimization problems by applying SOO in a random low-dimensional search space. The simple regret of RESOO depends only on the effective dimension of the problem, rather than the full dimension of the solution space. REMBO (Random EMbedding Bayesian Optimization) (Wang et al., 2016) uses a random projection matrix to create a lower-dimensional embedding for high-dimensional optimization problems. It then applies Bayesian optimization on this low-dimensional space, allowing it to efficiently search for optima in the reduced space. He SBO (Nayebi et al., 2019) uses hashing-enhanced embeding subspaces. ALEBO (Adaptive Linear Embedding Bayesian Optimization) (Letham et al., 2020) builds upon and improves the original REMBO by proposing a new linear-embedding method. However, these algorithms requires an lower-bound

to the low-dimensional subspace dimension, which is difficult to obtain in real-world problems. Additionally, the Bayesian Optimization algorithms can be computationally expensive for large budgets.

Latent Action Monte Carlo Tree Search (LA-MCTS) (Wang et al., 2020) recursively partitions the high-dimensional search space into regions with high/low function values using nonlinear decision boundaries. It serves as a metaalgorithm by using existing black-box optimizers (e.g., BO, Tu RBO (Eriksson et al., 2019)) as its local model. LA-MCTS creates sub-regions by partitioning the variable ranges without reducing dimensionality. Each sub-region retains the full dimensionality of the original problem space. While LA-MCTS has shown good empirical performance, it encounters challenges in high-dimensional spaces. Voronoi Optimistic Optimization (VOO) (Kim et al., 2020) is a method for optimizing functions in high-dimensional spaces. Instead of using fixed partitions, VOO creates flexible sections called Voronoi cells. Each cell contains points closest to a specific evaluated point. As new points are evaluated, the cells automatically adjust based on the best results.

Evolutionary algorithms such as CMA-ES (Hansen, 2023) and simulated annealing (Xiang et al., 2013) are other popular approaches for black-box optimization. The CMA-ES technique performs well in finding the best solutions in high-dimensional optimization problems. However, these methods do not have convergence guarantees (Loshchilov & Hutter, 2016; Nomura et al., 2021).

2. PROBLEM SETUP & ADAPTIVE PARTITIONING SCHEMES

We consider the problem of optimizing a function f : X 7 R using only its evaluations at appropriately chosen points in its domain X, which is assumed to be a closed and compact set. Given a budget of n evaluations, at each t {1, 2, . . . , n} the algorithm queries a point xt X and observes a real number yt = f(xt). After exhausting its budget, the algorithm returns the estimated maximizer ˆx(n). The optimization error is quantified by the simple regret rn, defined as

rn f f(ˆx(n)), where f sup x X f(x),

and x X such that f = f(x ).

Our focus is on the class of optimistic optimization algorithms (Munos, 2011; Bubeck et al., 2011; Valko et al., 2013; Munos, 2014; Grill et al., 2015; Bartlett et al., 2019). These algorithms require as input a hierarchical partitioning of the domain X for their search procedure.

Definition 2.1. Partitioning scheme (Bartlett et al., 2019). Let P denote a tree representation of the domain X. All the

Adaptive Partitioning Schemes for Optimistic Optimization

h = 0 : X = [ 1, 1]2

Depth h = 2: axis 2 trisection

Depth h = 1: axis 1 trisection

Depth h = 3: axis 1 trisection

A blue dot represents a cell that is trisected at height h

An orange dot represents a cell that is not trisected at height h

Figure 1. First four depths of the default partitioning P on [ 1, 1]2

cells at depth h form a partition of X and are denoted as Ph. We index the cells at depth h with an additional index i, i.e., Ph,i is a cell in the tree at depth h. A cell Ph,i is partitioned to obtain one or more cells of Ph+1. P h denotes a cell at depth h containing a maximizer x of f and xh,i denotes a representative location within the Ph,i cell.

A common choice for the domain is obtained when we have interval constraints on each of its components. In this case, without loss of generality, we can consider X = [ 1, 1]d. And a default choice for the partitioning scheme that is often used in practice is the axis-aligned trisection scheme (Jones et al., 1993). We use Hm 1 to denote the unit hypercube in m-dimensional space. i.e., [ 1, 1]m domain.

Definition 2.2. The default partitioning scheme P = {Ph,i : h, i N0} of X = [ 1, 1]d uses an axis-aligned trisection method in a round-robin manner (see Figure 1). At depth h = 0, there is a single cell P0,1 = X. Each cell Ph,i at depth h is split into three children cells Ph+1,j at depth h+1. The trisection occurs along the (h mod d) + 1 axis, i.e., the new cells are created by introducing (d 1)-dimensional hyperplanes orthogonal to the chosen axis. For all h, i the representative xh,i is the midpoint of the cell Ph,i.

We consider partitioning schemes that are not aligned with the standard axes. We define such a rotated low-dimensional partitioning scheme using a matrix of orthonormal rows.

Definition 2.3. Given A Rm d such that AA = Im and a scalar α > 0, we establish a partitioning scheme denoted as A. Let the default partitioning scheme (Definition 2.2) on [ α, α]m be denoted as T . For any depth h and

index i, the cell Ah,i {A t : t Th,i} is a cell in the partitioning on the m-dimensional projection of X onto the subspace spanned by rows of A.

Since the projection of X onto A can result in points outside X, the value of α is chosen to ensure that the projection is covered by the A partitioning scheme. An appropriate value of α is in the following definition.

Definition 2.4. Let cj { 1, 1}d denote the 2d corners, indexed by j, of the default axis-aligned P partitioning scheme. Given a matrix A with m orthonormal rows denoted as a1, a2, . . . , am, we define

αmax h max 1 j 2d a 1 cj, max 1 j 2d a 2 cj, max 1 j 2d a mcj i

as the extent of the A partitioning scheme. We choose the largest component of the extent as α maxi,j a i cj.

In this paper, we will provably demonstrate the advantage of our method when there is a low-dimensional ridge structure present in the black-box function. We consider the class of multi-index functions (Fornasier et al., 2012), i.e., there is a matrix A Rm d with orthonormal rows and a Lipschitz function g such that

f(x) = g(Ax). (1)

Then optimizing over the low-dimensional subspace is sufficient to recover the maximizer. All proofs and omitted details can be found in the supplementary.

Proposition 2.5. Suppose f : Rd 7 R is a multi-index function with f(x) = g(Ax). If x X = [ 1, 1]d is an optimizer and α is chosen as per definition 2.4, then there is a z Rm such that f(A z ) = f(x ) and z α.

Proof Sketch. For an optimizer x [ 1, 1]d let z = Ax . Then f(A z ) = f(A Ax ) = f(x ) since A A is identity. We then use that x lies in a convex hull of cube corners and apply the definition of α to show that z α, see details in Appendix 11.4.

The above proposition implies that if we have access to the true subspace matrix A, we can compute α and restrict our optimization to the m dimensional space αHm 1 using the A partitioning scheme. We theoretically characterize the benefit of using the partitioning scheme A for the class of multi-index functions, i.e., if f satisfies (1) then using the partitioning scheme A can decrease rn at a faster rate with n. To show this, we use the partitioning scheme assumption made by Grill et al. (2015) and Bartlett et al. (2019) that states that the suboptimality of any point in the P h cell improves as depth h increases. The rate of this improvement is characterized by a parameter ρ (0, 1).

Adaptive Partitioning Schemes for Optimistic Optimization

Assumption 2.6. (Bartlett et al., 2019) For any global optimum x , there is a ν > 0 and ρ (0, 1) such that for all h N0 and all x P h, we have that f(x) f(x ) νρh.

Bartlett et al. (2019) define the near-optimality dimension which characterizes the difficulty of optimizing a black-box function using a partitioning scheme.

Definition 2.7. (Bartlett et al., 2019). Consider a partitioning P that satisfies Assumption 2.6 for some ν, ρ. For any C > 1, the near-optimality dimension ηP(ν, ρ, C) of f with respect to the partitioning P is defined as

ηP(ν, ρ, C) inf{η 0 : |NP(3νρh)| Cρ ηh h 0},

where NP(3νρh) is the set of near-optimal cells Ph,i at depth h for which supx Ph,i f(x) f(x ) 3νρh.

Intuitively, a larger ρ implies that the function is only improving slowly near the maximizer, and a larger η implies that there are many near-optimal cells which must be ruled out to get the true maximizer. In both cases, we need a larger budget of evaluations to converge. The following example shows a partitioning scheme A with a lower near-optimality dimension than the default partitioning scheme.

Example 2.8. Consider the function f(x1, x2) = g(Ax) = 1 |x1| with A = [1, 0] and g(z) = 1 |z|. Let ηP, ηA be the near-optimality dimensions for the partitioning schemes P, A as defined in Definitions 2.2 and 2.3, respectively. Then we have that ηP = 1 and ηA = 0.

In the next example, the subspace defined by A is not aligned with the canonical axes and we see that ρA is smaller than ρ of the default partitioning scheme.

Example 2.9. Consider the function f(x1, x2) = g(Ax) = 1 |x1 + x2| with A = [1, 1] and g(k) = 1 |k|. Let ηP, ηA be the near-optimality dimensions for the partitioning schemes P, A associated with (ν, ρ), (νA, ρA) respectively. Then ρ = 1/

3 and ρA = 1/3 and ηP = ηA = 0.

In addition to identifying the important subspace spanned by m orthonormal directions, an adaptive partitioning scheme can choose which direction to split at each depth.

Definition 2.10. Direction selection strategy. For a partitioning A in Definition 2.3, a direction selection strategy τh : H [1 : m] defined for each height h takes the history H of all past function evaluations till depth h 1 and outputs the index of the direction to be split at depth h.

In the following example, a direction selection strategy that splits the x1 axis twice as often as the x2 axis results in a lower near-optimality dimension than that of the default partition P which splits both the axes in equal proportion.

Example 2.11. The near-optimality dimension of the default partitioning scheme for the function f(x1, x2) =

Algorithm 1 Obtaining directions for an adaptive partitioning scheme Require: T, oracle for f which is a multi-index function defined using A (see (1))

1: Sample f at T points chosen as x(i) iid N(0, Id) and fit a single hidden-layer neural network ˆy 2: ˆA top right singular vectors of the hidden layer weight matrix preserving 95% variance 3: u Upper bound to dist (A, ˆA) obtained in Lemma 9.11 or Theorem 9.12 4: return ˆA and ˆα (see Equation 4) used to specify the partitioning scheme A in Definition 2.3

1 |x1| x2 2 is ηP = 0.5. For the partitioning scheme A from Definition 2.3 with A = I2, α = 1 and direction selection strategy τh = 1 if h mod 3 = 0 and τh = 2 otherwise, its near-optimality dimension ηA = 0.

Empirically, as we increase the optimization budget, we observe that Sequ OOL on A decreases regret at a faster rate than Sequ OOL on P for the above example functions.

3. PROPOSED ALGORITHMS FOR BLACK-BOX OPTIMIZATION

To utilize an adaptive partitioning scheme, we develop two kinds of algorithms: 1. a two-stage algorithm where the first stage learns an adaptive partitioning scheme and the second stage uses it for optimization, and 2. an interleaved algorithm where learning and optimization happen iteratively.

Two-stage algorithm. In the first stage, we use a learning algorithm to obtain ˆA, i.e., the directions used to define the adaptive partitioning scheme A. If f is a multi-index function satisfying (1), the quality of this estimate is measured by the subspace distance between ˆA and the true A. The value of ˆα chosen in Lemma 4.5 is such that f( ˆA t) = f(x ) for some t [ ˆα, ˆα]m and the optimization can find the maximizer in the low-dimensional subspace.

Algorithm 1 learns ˆA by fitting a single hidden-layer neural network to the function evaluations at random locations in its domain. A single hidden-layer neural network can model the fact that only a subspace of the domain explains all the variation in the function values (see Proposition 8.1). In practice, we can choose the value of m to explain a desired percentage (such as 95%) of the total variation in the SVD step calculating ˆA. Another approach to learn ˆA is from Fornasier et al. (2012, Algorithm 2) which estimates the gradient of the function using finite differences. The second stage applies Sequ OOL to the partitioning scheme A returned by Algorithm 1.

Adaptive Partitioning Schemes for Optimistic Optimization

Algorithm 2 Sequ OOL on an adaptive partitioning scheme with a direction selection strategy Require: Total number of openings n, number of samples T for updating ˆf, integer c stating how often ˆf is updated, number of dimensions m, oracle for f, direction selection strategy τh. Setup a partitioning scheme ˆ A with ˆA and ˆα obtained by (Fornasier et al., 2012, Algorithm 2) Obtain ˆf by fitting a neural network on available samples, set hmax n2/ nlogn + T n

3c for h 1 to hmax do

Th the cell with the largest function sample value at height h if h mod c = 0 then

Obtain evaluations of f at T uniform random locations in T h and update ˆf end if Open hmax/h cells at depth h and trisect them along the direction returned by τh( ˆf) end for return ˆx(n) arg maxall h,i f(xh,i)

Interleaved learning and optimization. Instead of separating the learning and optimization in two distinct stages, an interleaved algorithm updates the neural network fit at regular intervals. The updated approximant is used to specify the direction selection strategy in Definition 2.10.

Algorithm 2 uses a parameter τh that takes the updated approximation ˆf as input. Our proposed method for τh is the lookahead direction selection strategy, which is detailed in Algorithm 3 in Appendix section 3.

A different value of hmax is used in Algorithm 2, compared to hmax = j n2

k used in Sequ OOL, to account for the additional evaluations required to update the surrogate model ˆf at regular intervals. This adjustment ensures that the total number of evaluations, including both Sequ OOL openings and model updates, remains within the overall budget. The following proposition guarantees that the chosen hmax ensures the total number of function evaluations does not exceed 3n.

Proposition 3.1. Let n, T, c, hmax be the parameters defined in Algorithm 2 with logn Pn t=1 1 t . Then the total number of function evaluations taken by the algorithm will not exceed 3n.

4. THEORETICAL ANALYSIS

We show that for the class of multi-index functions (1), the partitioning ˆ A obtained using the learned ˆA can have a lower η and a lower ρ compared to that of the default partitioning P. Our proof proceeds by analyzing the rela-

tionships between three partitioning schemes: the default scheme P, the scheme A based on the true subspace A, and the scheme ˆ A based on the estimated subspace ˆA. Our analysis proceeds in two main stages: We first relate the parameters of the default partitioning scheme P to those of the scheme A. This includes comparing their Sequ OOL parameters (ν, ρ), characterizing their star cells, and bounding the number of near-optimal cells. We then extend this analysis to the estimated scheme ˆ A, relating its properties to those of A. This involves quantifying the impact of using an estimated subspace and establishing relationships between the Sequ OOL parameters of ˆ A and A.

We will use the notation of lattices to relate the number of near-optimal cells in two different partitioning schemes.

Definition 4.1. (Vaikuntanathan, 2011) Given m linearly independent vectors b1, . . . , bm Rm, the lattice generated by them is defined as L(b1, . . . , bm) = {Pm i=1 aibi | ai Z} . We call b1, . . . , bm a basis of the lattice. We denote lattices formed by the standard basis vectors e1, e2, . . . , em as Λ. Thus, Λ = L(e1, e2, . . . , em) = Zm. The lattice Λ scaled by a scalar κ is the same as L(κe1, κe2, . . . , κen).

For a d-dimensional vector x = [x1, . . . , xd] , we use x p and x to denote its ℓp and ℓ norm respectively. The following lemma bounds the number of cells from a finer partitioning scheme that cover a given region in the domain.

Lemma 4.2. Let κ, κ R+ with κ > κ , and let B(x, r) = {y Rm : y x r}. Consider a lattice Λ as defined in Definition 4.1 , scaled by κ , and translated by a vector t to form the lattice Λ + t. Define the subset of all lattice points that cover B(0, κ) as C(κ, κ , t) Λ + t, i.e.,

C(κ, κ , t) = {ci Λ + t : B(ci, κ ) B(0, κ) =

and B(0, κ) [

i B(ci, κ )}.

Then, the cardinality of C satisfies: κ

m |C(κ, κ , t)| κ

We switch from A to T to simplify analysis by working directly in the low-dimensional space. Detailed mathematical justification is in Appendix Section 11.13.

The following lemma provides an upper bound on the number of near-optimal cells in scheme A relative to scheme P. This relationship is fundamental for comparing the nearoptimality dimensions of the two schemes, which will be addressed in a subsequent lemma (Lemma 4.4).

Lemma 4.3. For the partitioning schemes P and T from Definition 2.7, let NP(ϵ) be the number of cells Ph,i at depth h for which supx Ph,i f(x) f(x ) ϵ, and similarly for

Adaptive Partitioning Schemes for Optimistic Optimization

NT (ϵ). Then

NT (ϵ) CNP(ϵ) where C = 3ddd m(12 m)m.

Proof Sketch. Consider the following POpt relation:

POpt {(Th,i, Ph,j) : Th,i NT (ϵ), Ph,j NP(ϵ),

Th,i APh,j = },

APh,j {Ax | x Ph,j}.

For any h, i, consider the cell Th,i NT (ϵ). Let l be the lower bound to the number of elements in POpt that are of the form (Th,i, ), implying |POpt| |NT (ϵ)| l. Similarly, for any h, j, consider the cell Ph,j NP(ϵ). Let u be the upper bound to the number of elements in POpt that are of the form ( , Ph,j). Then, we have: |NP(ϵ)| u |POpt|. Combining these two inequalities, we get:

|NT (ϵ)| |NP(ϵ)| u

In our proof in Appendix 11.10, we further refine the upper bound on u to (12 m)m3d3k(d m) and the lower bound on l to (1/d)d m3k(d m), leading to the result.

The above lemma is a key technical result that enables us to relate the parameters of the partitioning schemes A and P.

Lemma 4.4. For a function in the multi-index class (1) with known A Rm d and m < d, let (ν, ρ, ηP, C), (νA, ρA, ηA, CA) be parameters of P, A. Let lf f infx κ1Hd 1 f(x). Then we have that

νA = max{ν, lf}ρ(1 β)(m 1) h1,

ηA(νA, ρA, CA) ηP(ν, ρ, C)/β,

and CA = 3ddd m(12 m)m Cρ ηP h3,

where β 1 + d m

2m 1, h1 d log3 κ1 and

κ1 3 log3 mα , h3 j logρ νA

Proof Sketch. Lemma 9.4 shows that A is a valid partitioning scheme and relates its parameters to those of the default partitioning scheme P. We then use Lemma 4.3 to bound the number of near-optimal cells in A, see details in Appendix 11.11.

Since β 1, the previous lemma shows that ηA ηP. Furthermore, since ρ (0, 1), ρA ρ and ν νA. As an illustration, Example 2.9 satisfies the conditions of Lemma 4.4 with d = 2, m = 1, and hence β = 1 + d m

2m 1 = 2. This implies ρA = ρ2 and ηA(νA, ρA, CA) ηP(ν, ρ, C)/2.

Figure 2. Illustration of Lemma 4.5 for d = 2, m = 1, showing the true subspace A and the estimated subspace ˆA used in defining the expansion factor.

We previously computed ρ = 1/

3 and ρA = 1/3, which confirms the first conclusion. Additionally, since both ηP and ηA are 0, the second conclusion holds with equality.

When the low rank matrix A is unknown, we use the estimation guarantees for the learning algorithms of Fornasier et al. (2012) and Mousavi-Hosseini et al. (2023) that bound the subspace distance dist(A, ˆA). The subspace distance between the two row subspaces A and ˆA and is given by A A ˆA ˆA 2, where 2 denotes the spectral norm.

The following lemma shows that we need an upper bound on the subspace distance dist(A, ˆA) to ensure that an optimizer exists within the low-dimensional search space.

Lemma 4.5. Consider a multi-index function f(x) = g(Ax) over Rd. Let x be an optimizer of f within [ 1, 1]d, α be as in definition 2.4 and ˆA be an estimate of A satisfying dist(A, ˆA) < 1. Then there exists a z Rm such that

f( ˆA z ) = f(x ) and z m

1 dist2 (A, ˆA)α.

Proof Sketch. Consider a z satisfying A ˆA z = Ax .

Since dist(A, ˆA) < 1, the matrix A ˆA is invertible, giving

that z = (A ˆA ) 1Ax and f( ˆA z ) = g(A ˆA z ) =

g(A ˆA (A ˆA ) 1Ax ) = f(x ). Finally, we bound z using matrix norm inequalities and the subspace distance dist(A, ˆA) in Appendix 11.12.

Figure 2 shows an illustration of the lemma. Specifically, optimizing f over the blue segment is enough to obtain its optimum value over X = [ 1, 1]2. The value of ˆα used in Algorithm 1 is obtained from this lemma. If we have a u dist(A, ˆA), then since α

dm/(1 u2) (4)

to ensure it is larger than m

1 dist2 (A, ˆA)α. The value of u

Adaptive Partitioning Schemes for Optimistic Optimization

is obtained from Lemma 9.11 or Theorem 9.12, along with Lemma 9.8. In practice, m is chosen to obtain a A that preserves 95% variance in the hidden layer weight matrix.

The next lemma bounds the number of near-optimal cells in ˆT in terms of those in T . This is a key step toward comparing their near-optimality dimensions (see Lemma 4.7).

Lemma 4.6. For the partitioning schemes T and ˆT in Definition 2.3, let NT (3νT ρh T ) be the number of cells Th,i at depth h for which supz Th,i g(z) f 3νT ρh T . Similarly, let N ˆT (3ν ˆT ρh ˆT ) be the number of cells ˆTh,i at depth h for

which supz ˆTh,i g(A ˆA z) f 3ν ˆT ρh ˆT . Then

h 0, N ˆT (3ν ˆT ρh ˆT ) 4m NT (3νT ρh T ). (5)

Proof Sketch. We aim to upper-bound the number of nearoptimal cells in the partitioning scheme ˆT in terms of those in T . The key idea is to relate the two schemes through the

transformation A ˆA , under which we optimize the function

g(A ˆA z) over ˆT . For a near-optimal cell Th,i T , its preimage under this transformation defines a bounded set B in Rm, which we enclose within a hypercube of computable size. To bound the number of near-optimal cells in ˆT , we count how many of its cells at height h can intersect this hypercube. We invoke the Lemma 4.2 and show that at most 4m such cells can intersect the enlarged region. The complete proof is provided in Appendix 11.17

Lemma 4.7. Consider the partitioning scheme ˆ A obtained using ˆA. Let lg = g infz κ2αHm 1 g(A ˆA z). Then,

ν ˆ A = max{νA, lg}ρ h2 A , ρ ˆ A = ρA, η ˆ A ηA

C ˆ A = CA4mρηA h4 A , h2 = m + m log3 2 mκ2

1 dist2 (A, ˆA) , h4 = logρA ν ˆ A νA

Proof Sketch. We first relate the Sequ OOL parameters (ν, ρ) of the partitioning scheme ˆ A to the scheme A using Lemma 9.6, yielding expressions for (ν ˆ A, ρ ˆ A). Next, using the bound on the number of near-optimal cells in A (Definition 2.7), we upper bound the number of near-optimal cells in ˆ A by composing with the Lemma 4.6. The complete proof is provided in Appendix 11.18.

Theorem 4.8. For a function in the multi-index class (1), the regret of Sequ OOL applied on the partitioning scheme using ˆA returned by Algorithm 1 and ˆα =

1 dist2(A, ˆA) satisfies

γ(ν, ρ)ρ β h2ρ

β C1 n log n if ηP = 0,

γ(ν, ρ)ρ β h2 n log n β

ηP if ηP > 0,

where γ(ν, ρ) = max{max{ν, lf}ρ(1 β)(m 1) h1, lg},

C1 = 3ddd m(12 m)m Cρ ηP h34mρηP h4 and n = n/logn ηP log(1/ρ)/C1,

with h2 = m + m l log3 2 mκ2

κ2 1 m , h1 = d log3 mα and

1 dist2 (A, ˆA), where h3, h4 are from equations (3)

and (6) respectively.

When ηP > 0, Algorithm 1 has rn = O(n β/ηP) = O(n (1+ d m

2m 1 )/ηP) while default Sequ OOL would give O(n 1/ηP), showing our approach reduces the regret at a faster rate. The proof of this theorem follows from applying the Sequ OOL parameters derived in Corollary 9.7 to Theorem 5 of Bartlett et al. (2019). Detailed proof is in Appendix 11.20.

5. EXPERIMENTS

All implementation details, benchmark functions, and experiment scripts can be found at our Git Hub repository: https://github.com/raja-sunkara/ Learned-Partitions-Sequ OOL

We evaluate our algorithms against baselines from various optimization categories. These include Bayesian Optimization (REMBO (Wang et al., 2016), Hes BO (Nayebi et al., 2019)), Evolutionary Algorithms (CMA-ES (Hansen et al., 2003)), Dual Annealing (Pincus, 1970), Optimistic Optimization (SOO, Sequ OOL, RESOO (Qian & Yu, 2016), Di Rect), and Random Search. The optimization functions used in our experiments include Sphere, Rastrigin, Different Powers, Rosenbrock, Styblinski-Tang, Hartmann-6, Branin, Ellipsoid, Sharp Ridge, and the CUSTOM function defined as 1 + (x1 1)2 + Pd i=d m+2(xi 1)4.

In our experimental setup, we construct the multi-index function as f(x) = g(Ax), where x Rd, g : Rm R is a standard optimization test function, and A Rm d

is a randomly generated matrix satisfying AA = Im. The Branin and Hartmann-6 functions are defined in 2 and 6 dimensions respectively; thus, we choose m = 2 for Branin and m = 6 for Hartmann-6 when constructing the multi-index function. To further evaluate the efficacy of our algorithm, we conducted benchmarks on low-dimensional multi-index functions with d = 5 and m = 2. Additional experimental results are in Appendix 12.2.

Our algorithm demonstrates superior performance on the Rastrigin, Sphere and Styblinski-Tang functions, achieving

Adaptive Partitioning Schemes for Optimistic Optimization

Figure 3. Regret Plots: Algorithm 1 (Sequ OOL on ˆ A) uses 650 additional samples to learn the subspace. Regret is plotted for 100 equally spaced budget values between 1 and 2000. For the randomized algorithms, we took 10 trials and plotted the median curve (thick line) and 0 and 95 percentile curves. Random Search is run on ˆ A.

zero regret with fewer samples compared to other methods. On the different powers function, we perform comparable to the other methods, however, on the (x1 1)2 + (x72 1)4, Ellipsoid and Sharp Ridge function, our algorithm perform slightly worse than the RESOO. This may be attributed to the use of several (650) samples for the first stage in a 2000 budgeted experiment.

5.1. LLM Quantization

The AWQ (Lin et al., 2024) method for quantizing large language models formulates optimization problem as:

α = arg min α [0,1] L(sα X)

L(s) = Q(W s)(s 1 X) WX 2

Where X is the input features to the block which is cached from a calibration dataset. It uses the parameterization s = sα X, where s X is the activation scale computed from X and α [0, 1] and Q as the quantization function and W as the original weights (full-precision). To determine the optimal α , AWQ applies a 1D grid search over the interval

Adaptive Partitioning Schemes for Optimistic Optimization

[0, 1]. This parameter controls the scale of activations and influences quantization error.

Each layer of the LLM contains three primary components: the attention matrices (WQ, WK, WV , and WO: all four matrices share a single α per layer), the first fully connected layer (Wfc1: one α per layer), and the second fully connected layer (Wfc2:one α per layer). Consequently, this leads to three optimization parameters per layer, resulting in a total of 3M parameters to optimize across M layers. Each of these parameters is derived from separate optimization problems, all of which are solved through the grid search method in the interval [0, 1] to find the optimal value of α .

We propose a new approach which involves solving this LLM Quantization as high-dimensional black-box optimization problem. In our approach, we jointly optimize all layers to minimize perplexity: So, our approach has one optimization problem in 3M dimensional space, compared to AWQ which has 3M one-dimensional optimization problems. Let α = [α1, ..., α3M] represent the scales for all M layers, with each layer having three parameters. We define our proposed optimization problem as:

α = arg min α [0,1]3M PPL(α)

Where, PPL(α) is the perplexity on the calibration set after quantization using the scaling factors derived from α.

We evaluated our approach on the OPT-1.3B model (Zhang et al., 2022), with results presented in Table 2. Our proposed objective function using Sequ OOL over 72 dimensions outperformed AWQ, achieving lower perplexity on both Wiki Text-2 (Merity et al., 2016) and the calibration set (Pile dataset (Gao et al., 2020)). More details are in Appendix 12.3.

To quantify this improvement, we compare perplexity scores across methods. The perplexity of the unquantized model is 14.47, which we treat as a reference point for comparison. Quantization with the AWQ baseline increases perplexity to 16.92, resulting in a perplexity gap of 2.45. In contrast, our proposed method (Algorithm 2) achieves a perplexity of 16.68, reducing the gap to 2.21. This corresponds to a relative improvement of approximately 10% in perplexity degradation compared to AWQ, demonstrating the benefit of jointly optimizing the scaling factors across all layers.

6. DISCUSSION AND CONCLUSION

We have proposed an adaptive approach to learning a good partitioning scheme for black-box optimization. When the function belongs to a multi-index class, we prove that the adaptive partitioning scheme results in lower regret. To achieve this, we have developed a novel theoretical contribu-

Table 2. LLM Quantization Experiment Results

Algorithm Compute Wiki Text-2 Calibration Time PPL -Set PPL

Grid Search 9 hours 16.92 14.62 Sequ OOL 10 hours 16.83 14.28 Algorithm 1 10 hours 16.96 14.42 Algorithm 2 12 hours 16.68 14.29

tion by relating the near-optimality dimensions of different partitioning schemes for the same function. Empirically, we observe that our proposed approach is better or comparable to existing methods over a wide range of benchmark functions and can be applied to high-dimensional functions. For example, on the Rastrigin function, Sequ OOL achieved a regret of 4.22 10 5, while our method reached zero regret. Against RESOO, on both the Rastrigin and Styblinski-Tang functions, our method achieved zero regret compared to 5.5 10 3 for RESOO on Rastrigin. For Styblinski-Tang, our method reached zero regret with approximately 900 evaluations, versus 2000 for RESOO.

Some limitations of our approach are estimating the lowdimensional subspace dimension and using an upper bound on the subspace distance in our algorithm. When the multi index assumption is not valid, we see only a mild benefit over Sequ OOL, e.g., in the AWQ experiments, our proposed objective function has function variation in all the directions and improvement over Sequ OOL is minimal. In the twostage algorithm, the first stage chooses samples randomly for the subspace learning and these samples reduce the budget available for optimization.

There are several avenues for future work. We can extend our approach to the case when the function evaluations are noisy. We can consider other low-dimensional functions beyond the class of multi-index functions, e.g., functions with bounded second order variation in the Radon domain (Parhi & Nowak, 2022). Obtaining theoretical guarantees for direction selection strategy could be possible using active learning techniques, e.g., in (Mukherjee et al., 2022).

Adaptive Partitioning Schemes for Optimistic Optimization

Acknowledgements

This material is based upon work supported by the National Science Foundation under Grant No. 2246187. The authors thank the reviewers and meta-reviewers for their careful reading and insightful suggestions, which significantly improved the quality of the paper. Ardhendu Tripathy acknowledges the helpful discussions with Robert Nowak.

Impact Statement

This paper presents work whose goal is to advance the field of Machine Learning. There are many potential societal consequences of our work, none which we feel must be specifically highlighted here.

Bartlett, P. L., Gabillon, V., and Valko, M. A simple parameter-free and adaptive approach to optimization under a minimal local smoothness assumption. In Garivier, A. and Kale, S. (eds.), Proceedings of the 30th International Conference on Algorithmic Learning Theory, volume 98 of Proceedings of Machine Learning Research, pp. 184 206. PMLR, 22 24 Mar 2019. URL https://proceedings.mlr.press/v98/ bartlett19a.html.

Bubeck, S., Munos, R., Stoltz, G., and Szepesv ari, C. Xarmed bandits. Journal of Machine Learning Research, 12(5), 2011.

Chen, Y., Chi, Y., Fan, J., Ma, C., et al. Spectral methods for data science: A statistical perspective. Foundations and Trends in Machine Learning, 14(5):566 806, 2021.

Eriksson, D., Pearce, M., Gardner, J., Turner, R. D., and Poloczek, M. Scalable global optimization via local bayesian optimization. Advances in neural information processing systems, 32, 2019.

Fornasier, M., Schnass, K., and Vybiral, J. Learning functions of few arbitrary linear parameters in high dimensions. Foundations of Computational Mathematics, 12: 229 262, 2012.

Gao, L., Biderman, S., Black, S., Golding, L., Hoppe, T., Foster, C., Phang, J., He, H., Thite, A., Nabeshima, N., et al. The pile: An 800gb dataset of diverse text for language modeling. ar Xiv preprint ar Xiv:2101.00027, 2020.

Grill, J.-B., Valko, M., and Munos, R. Black-box optimization of noisy functions with unknown smoothness. Advances in Neural Information Processing Systems, 28, 2015.

Hansen, N. The cma evolution strategy: A tutorial, 2023. URL https://arxiv.org/abs/1604.00772.

Hansen, N., M uller, S. D., and Koumoutsakos, P. Reducing the time complexity of the derandomized evolution strategy with covariance matrix adaptation (cma-es). Evolutionary computation, 11(1):1 18, 2003.

Jones, D. R. and Martins, J. R. The direct algorithm: 25 years later. Journal of global optimization, 79(3):521 566, 2021.

Jones, D. R., Perttunen, C. D., and Stuckman, B. E. Lipschitzian optimization without the lipschitz constant. Journal of optimization Theory and Applications, 79:157 181, 1993.

Kim, B., Lee, K., Lim, S., Kaelbling, L., and Lozano-P erez, T. Monte carlo tree search in continuous spaces using voronoi optimistic optimization with regret bounds. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 34, pp. 9916 9924, 2020.

Letham, B., Calandra, R., Rai, A., and Bakshy, E. Re-examining linear embeddings for high-dimensional bayesian optimization. Advances in neural information processing systems, 33:1546 1558, 2020.

Lin, J., Tang, J., Tang, H., Yang, S., Chen, W.-M., Wang, W.-C., Xiao, G., Dang, X., Gan, C., and Han, S. Awq: Activation-aware weight quantization for on-device llm compression and acceleration. Proceedings of Machine Learning and Systems, 6:87 100, 2024.

Loshchilov, I. and Hutter, F. Cma-es for hyperparameter optimization of deep neural networks. ar Xiv preprint ar Xiv:1604.07269, 2016.

Merity, S., Xiong, C., Bradbury, J., and Socher, R. Pointer sentinel mixture models. ar Xiv preprint ar Xiv:1609.07843, 2016.

Mousavi-Hosseini, A., Park, S., Girotti, M., Mitliagkas, I., and Erdogdu, M. A. Neural networks efficiently learn lowdimensional representations with SGD. In The Eleventh International Conference on Learning Representations, 2023. URL https://openreview.net/forum? id=6taykzqc PD.

Mukherjee, S., Tripathy, A. S., and Nowak, R. Chernoff sampling for active testing and extension to active regression. In Camps-Valls, G., Ruiz, F. J. R., and Valera, I. (eds.), Proceedings of The 25th International Conference on Artificial Intelligence and Statistics, volume 151 of Proceedings of Machine Learning Research, pp. 7384 7432. PMLR, 28 30 Mar 2022. URL https://proceedings.mlr.press/ v151/mukherjee22a.html.

Adaptive Partitioning Schemes for Optimistic Optimization

Munos, R. Optimistic optimization of a deterministic function without the knowledge of its smoothness. Advances in neural information processing systems, 24, 2011.

Munos, R. From bandits to monte-carlo tree search: The optimistic principle applied to optimization and planning. Foundations and Trends in Machine Learning, 7(1):1 129, 2014. ISSN 1935-8237. doi: 10. 1561/2200000038. URL http://dx.doi.org/10. 1561/2200000038.

Nayebi, A., Munteanu, A., and Poloczek, M. A framework for bayesian optimization in embedded subspaces. In International Conference on Machine Learning, pp. 4752 4761. PMLR, 2019.

Nomura, M., Watanabe, S., Akimoto, Y., Ozaki, Y., and Onishi, M. Warm starting cma-es for hyperparameter optimization. In Proceedings of the AAAI conference on artificial intelligence, volume 35, pp. 9188 9196, 2021.

Parhi, R. and Nowak, R. D. Near-minimax optimal estimation with shallow relu neural networks. IEEE Transactions on Information Theory, 69(2):1125 1140, 2022.

Pincus, M. A monte carlo method for the approximate solution of certain types of constrained optimization problems. Operations research, 18(6):1225 1228, 1970.

Qian, H. and Yu, Y. Scaling simultaneous optimistic optimization for high-dimensional non-convex functions with low effective dimensions. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 30, 2016.

Salgia, S., Vakili, S., and Zhao, Q. A domain-shrinking based bayesian optimization algorithm with orderoptimal regret performance. Advances in Neural Information Processing Systems, 34:28836 28847, 2021.

Shekhar, S. and Javidi, T. Gaussian process bandits with adaptive discretization. Electronic Journal of Statistics, 12(2):3829 3874, 2018.

Srinivas, N., Krause, A., Kakade, S. M., and Seeger, M. W. Information-theoretic regret bounds for gaussian process optimization in the bandit setting. IEEE transactions on information theory, 58(5):3250 3265, 2012.

Vaikuntanathan, V. Csc 2414 lattices in computer science. https://people.csail.mit.edu/ vinodv/COURSES/CSC2414-F11/L1.pdf, 2011. [Online; accessed 19-July-2008].

Valko, M., Carpentier, A., and Munos, R. Stochastic simultaneous optimistic optimization. In International Conference on Machine Learning, pp. 19 27. PMLR, 2013.

Wang, L., Fonseca, R., and Tian, Y. Learning search space partition for black-box optimization using monte carlo tree search. Advances in Neural Information Processing Systems, 33:19511 19522, 2020.

Wang, Z., Hutter, F., Zoghi, M., Matheson, D., and De Feitas, N. Bayesian optimization in a billion dimensions via random embeddings. Journal of Artificial Intelligence Research, 55:361 387, 2016.

Xiang, Y., Gubian, S., Suomela, B., and Hoeng, J. Generalized simulated annealing for global optimization: the gensa package. R J., 5(1):13, 2013.

Zhang, S., Roller, S., Goyal, N., Artetxe, M., Chen, M., Chen, S., Dewan, C., Diab, M., Li, X., Lin, X. V., et al. Opt: Open pre-trained transformer language models. ar Xiv preprint ar Xiv:2205.01068, 2022.

Adaptive Partitioning Schemes for Optimistic Optimization

7. Notation

Let N0 = N {0} denote the set of non-negative integers and [d] represents the set {1, 2, . . . , d}. For vectors and matrices, we use ( ) to denote the transpose. For a d-dimensional vector x = [x1, . . . , xd] , we use x p and x to denote its ℓp and ℓ norm respectively. For any matrix X = [x1, . . . , xn], X 2 and X F represent its spectral and Frobenius norms. We use σi(X) for the ith largest singular value, with σmax(X) and σmin(X) denoting the largest and smallest singular values. Suppose A Rm d, with AA = Im, then for any vector v Rd, we denote its orthogonal projection onto the span of the rows of A as v = A Av, with the orthogonal component given by v = v v . For a matrix W Rp d, we denote W and W as the projections applied to each row, i.e., W = WA A and W = W W . For a given scalar κ > 0, we denote κX as the set {κx : x X}. By a partitioning scheme P with κ, we mean the domain for the partitioning scheme is the set {κx : x [ 1, 1]d}. By the default partitioning scheme P, we mean partitioning scheme P with κ = 1. To describe side-length of a hyper-rectangle, we use the notation [3 h+m i

m ]m i=1. This expressions represents a vector of m components, where each component is given by 3 h+m i

m , with i ranging from 1 to m.

8. Omitted details for Section 3

A single hidden layer neural network with p hidden neurons maps input x Rd to the scalar

ˆy(x, W, a, b) =

i=1 aiσ(w i x + bi), (7)

where σ is the non-linear activation function, W is the hidden layer weight matrix consisting of p weight vectors denoted as wi, bi is the scalar bias for the ith hidden neuron, and ai are the components of the output layer weight vector. The following proposition shows that the class of single hidden layer neural networks can represent the important subspace of multi-index functions. Proposition 8.1. Consider a function f(x) = Pp i=1 viσ(w i x + bi), where x Rd and σ is a non-linear function. Let x = Px + (I P)x, where P is the projection matrix that maps any x Rd to Span{w1, w2, . . . , wp}. Then f(x) = Pp i=1 viσ(w i Px + bi). And if x, x Rd are such that (x x ) Span{w1, w2, . . . , wp}, then f(x) = f(x ).

We use SGD with weight decay to fit the neural network on the function evaluations obtained at uniform random locations in X. The top m right singular vectors of the learned weight matrix of the hidden layer is used as the estimated ˆA for obtaining the partitioning scheme ˆ A.

Algorithm 3 describes our lookahead direction selection strategy τh( ˆf).

Querying outside the domain and optimizing ˆf In practice, with our optimization domain set as X = [ 1, 1]d, we

may encounter situations during low-dimensional subspace optimization where t [ α, α]m results in ˆA t / X. To ensure that f can be evaluated in all cases, we employ a two-step approach. First, we attempt to solve the optimization problem: arg mintc Rd m ˆA c tc 2 subject to ˆA t + ˆA c tc X, where ˆAc consists of the remaining d m columns

of ˆA that are not in ˆA. If this optimization problem has no feasible solution, we then employ Euclidean projection onto

X: arg minx X x ˆA t 2. This projection method is applied whenever ˆA t / X, ensuring that we always have a

valid point within our optimization domain.In practice, we estimate the minimum of ˆf on the domain Th+1 using random sampling or any of the other black-box optimization algorithms, since ˆf is cheap to evaluate and gradients are also available.

9. Omitted details for Section 4

Proofs are in Section 11.

First, we start with relating the default partitioning scheme P with A partitioning scheme. To establish the relationship between the near-optimality dimensions of the two schemes, we first need to compare the parameters (ν, ρ) of Sequ OOL across both partitioning schemes. This requires a characterization of the cells A hand P h. The following proposition provides this characterization. Proposition 9.1. Let κ > 0 and α h Rm be the representative of the A h cell containing a global maxima of the function. Using fraction of two vectors to denote component-wise division,

Adaptive Partitioning Schemes for Optimistic Optimization

Algorithm 3 Implementing lookahead direction selection strategy τh( ˆf)

Require: Current partition tree T , height h, estimated function ˆf

ˆxh arg maxi f(xh, i), representative of cell T h with the largest function value at height h T h cell at height h in T whose representative is ˆxh , axis to split 0, minimum for i 1 m do

T h + 1 child cell of T h after trisecting axis i and having the same representative ˆx h temp Compute minimum of ˆf on the domain Th+1 if temp minimum then

axis to split i minimum temp end if end for return axis to split

A h = {A α : α Rm, α α h s

α with s = [3 h+m i

m ]m i=1}. (8)

Similarly, if x h Rd is the representative of the P h cell containing the same global maxima,

P h = {x : x Rd, x x h c

κ with c = [3 h+d i

d ]d i=1}. (9)

Proposition 2.5 shows that we can use A partitioning scheme to perform optimization. We now relate the ν and ρ parameters (see Assumption 2.6) of the partitioning schemes P and A.

To establish relationships between the parameters of the partitioning schemes A and the default scheme P, we need to connect the sets P h and A h. The following lemma provides this connection:

Lemma 9.2. Let the domain for P be κHd 1 = {κx | x Hd 1} with κ 3 log3 mα . Suppose x P h is such that x = A Ax . Then, i [1 : m 1], k N0 we have that A km+i P kd+i.

Having established the relationship between the star cells of the partitioning schemes A and P, we can now proceed to relate their respective parameters. The following two lemmas establish these relationships.

This lemma connects the parameters (ν, ρ) of the partitioning scheme P with κ 3 log3 mα to the parameters (νA, ρA) of the scheme A. Lemma 9.3. Let the parameters for the partitioning schemes P, A be (ν, ρ), (νA, ρA) respectively. If Lemma 9.2 is applicable and P satisfies Assumption 2.6. Then we have that νA = νρ(1 β)(m 1), ρA = ρβ where β = 1 + d m

This lemma establishes the relationship between the parameters (νA, ρA) of scheme A and (ν, ρ) of the default partitioning scheme P. Lemma 9.4. The parameters (νA, ρA), (ν, ρ) associated with partitioning schemes A and P with κ = 1. Let lf = f infx κ1Hd 1 f(x). Then

ρA = ρβ, νA = max{ν, lf}ρ(1 β)(m 1) h1

where h1 = d log3 κ1 , β = 1 + d m

2m 1 and κ1 = 3 log3 mα .

The previous lemmas established relationships between the parameters of the partitioning schemes A and P. They demonstrate that A is a valid partitioning scheme with a reduced sequ OOL parameter ρA compared to the default partitioning scheme ρ.

9.1. Using ˆ A defined over the estimated ˆA

We obtain ˆA using a subspace learning algorithm with evaluations of f at selected points in X. We then use ˆA to define a partitioning scheme ˆ A (as per Definition 2.3) and apply Sequ OOL to it. The impact of using an estimated matrix ˆA instead

Adaptive Partitioning Schemes for Optimistic Optimization

of the true matrix A in our optimization problem can be quantified using subspace distance. The following is the definition of the subspace distance.

Definition 9.5. ((Chen et al., 2021, Lemma 2.5)) Let A, ˆA Rm d consists of orthonormal rows such that AA = Im and ˆA ˆA = Im. Define two more matrices A , ˆA R(d m) d such that A A and h ˆA ˆA

orthonormal bases for Rd. Then the subspace distance between the two row subspaces (A, ˆA) is given by

dist (A, ˆA) = A A ˆA ˆA 2 = ˆA A 2 = A ˆA 2 = sin Θ(A, ˆA) 2 , (10)

where 2 denotes the spectral norm, and sin Θ is a diagonal matrix of {sin(arccos(σi)) : i = 1, 2, . . . , m} where σi are the singular values of ˆAA in decreasing order.

Moreover, we have the following equality:

σmin(ˆAA ) = cos θm = q

1 sin2 θm =

1 sin Θ(A, ˆA) 2

1 dist2(A, ˆA) (11)

We observe that assuming dist(A, ˆA) < 1 implies that σmin(ˆAA ) = 0 and rank(ˆAA ) = m.

Given that we only have an estimate ˆA of the true matrix A, and we perform optimization on the subspace spanned by ˆA, Lemma 4.5 guarantees that we can use ˆA and recover f . Now, we relate the Sequ OOL parameters between the partitioning schemes A and ˆ A partitioning schemes.

Lemma 9.6. Let A and ˆ A be the partitioning schemes defined in Definition 2.3. Suppose A satisfies Assumption 2.6 with parameters νA, ρA. Let lg = g infz κ2αHm 1 g(A ˆA z). Then ˆ A satisfies Assumption 2.6 with parameters

ν ˆ A = max{νA, lg}ρ h2 A , ρ ˆ A = ρA

where h2 = m + m l log3 2 mκ2

κ2 1 m and κ2 = m

1 dist2 (A, ˆA).

The above lemma demonstrates that the partitioning scheme ˆ A is valid and satisfies Assumption 2.6 and it establishes the relationship between the Sequ OOL parameters of the ˆ A and A partitioning schemes.

This corollary synthesizes the relationships established in the preceding lemmas, providing a direct comparison between the estimated scheme ˆ A and the default partitioning scheme P. It combines the two-step process of relating A to P and then ˆ A to A, yielding a comprehensive set of relationships for the Sequ OOL parameters, near-optimality dimensions, and associated constants.

Corollary 9.7. Referring to Lemma 4.4, for the partitioning scheme P with κ = 1, we have ρA = ρβ, νA = max{ν, lf}ρ(1 β)(m 1) h1, ηA(νA, ρA, CA) ηP(ν,ρ,C)

β and CA = 3ddd m(12 m)m Cρ ηP h3. By utilizing

Lemma 9.6 and Lemma 4.7, we establish the following relationships among the parameters associated with ˆ A and P.

ρ ˆ A = ρβ, ν ˆ A = max{max{ν, lf}ρ(1 β)(m 1) h1, lg}ρ h2 A ,

η ˆ A(ν ˆ A, ρ ˆ A, C ˆ A) ηP(ν, ρ, C)

β , C ˆ A = 3ddd m(12 m)m Cρ ηP h34mρηP h4.

with h2 = m + m l log3 2 mκ2

κ2 1 m and h1 = d l log3 3 log3 mα m with κ2 = m

1 dist2 (A, ˆA)

h3 = j logρ(max{ν, lf}ρ(1 β)(m 1) h1) logρ(ν) k , h4 = j logρA(max{νA, lg}ρ h2 A ) logρA(νA) k

(Mousavi-Hosseini et al., 2023), controls the closeness between true and estimated subspace expressed in terms of W F . However, for out Algorithm 2, we require an upper bound on the distance between the true subspace A and its estimate ˆA. This lemma bridges this gap by providing an upper bound on dist (A, ˆA) in terms of W F and the singular values of W.

Adaptive Partitioning Schemes for Optimistic Optimization

Lemma 9.8. Given A Rm d satisfying AA = Im, let W Rp d be any matrix such that rank(W) m and d m. Consider the singular value decomposition of W = USV and collect the top m right singular vectors in the matrix ˆA = v1 v2 vm , where vi is the ith column of V. Recollect the definition of sin Θ(A, ˆA) and dist (A, ˆA) from Definition 9.5 and the notation of W from section 7. Using F to denote Frobenius norm and σm to denote the mth singular value of W, we have that

dist (A, ˆA) = sin Θ(A, ˆA) 2 W F

9.2. Supporting lemmas

The following lemmas and assumptions are needed to obtain guarantees on learning a good estimate ˆA.

Assumption 9.9. (Mousavi-Hosseini et al., 2023) The student model is a two-layer neural network Equation (7) trained over the data set {(x(i), y(i))}i 1, where the target values y(i) are generated according to the teacher model Equation (1)

and the inputs satisfy x(i) iid N(0, Id). The link function g( ) is weakly differentiable.

Assumption 9.10. (Mousavi-Hosseini et al., 2023) For all 1 i m, 1 j d, we initialize the NN weights and biases with

d W0 ij iid N(0, 1), ma0 i iid Unif([ 1, 1]), and b0 i iid Unif({ 1, 1})

Lemma 9.11 ((Mousavi-Hosseini et al., 2023), Theorem 3). Consider running T SGD iterations over samples satisfying Assumption 9.9, with an initialization satisfying Assumption 9.10, and using the following decaying step size schedule. Assuming Re LU non-linearity, let ζ := 2 p

2/eπ. Choose the decreasing step size ηt = m 2(t+t )+1

γ(t+t +1)2 , λ γ +ζ and t λ γ for any γ > 0. Then, for λ = λ m, with probability at least 1 δ,

(d + log(1/δ)

whenever m log(1/δ) and T λ2 d+log(1/δ).

The following Lemma is to control the subspace distance using compressed sensing algorithm.

Theorem 9.12 ((Fornasier et al., 2012), Theorem 4.1). Let f(x) = g(Ax) be a function where A is a k d matrix with orthonormal rows, and g is a twice continuously differentiable function. Assume that Hf = A Hg A is well-conditioned with σk(Hf) α > 0. Let ˆA be the matrix obtained from the Dantzig Selector approximation ˆX of the matrix X of gradients of f at m X random points. Then, with high probability, the distance between the subspaces spanned by the rows of A and ˆA is bounded by: A A ˆA ˆA F 2ν2 p

ν2 = Ck1/q mΦ log(d/mΦ)

1/2 1/q + ϵk2 mΦ

and C is a constant depending on the parameters C1 and C2 from the conditions on A and g, mΦ is the number of derivative directions, ϵ is the step size used in the finite difference approximation, d is the ambient dimension, and s (0, 1) is a parameter.

10. Illustrative experiments to motivate lookahead direction selection

Using a partitioning scheme with a lower near-optimality dimension can lead to a faster decrease in regret. Empirically, we observe that the regret for Sequ OOL applied for a budget of 200 evaluations of the function in the Example 2.11 was 5 10 10 for the default partitioning scheme and 5.8 10 12 for the direction selection strategy in Example (2.11). This example indicates that it can be beneficial to use a direction splitting strategy that adapts to the function being optimized.

Adaptive Partitioning Schemes for Optimistic Optimization

Figure 4. Optimal splitting ratio between the first and second directions. Green curve is obtained using the strategy in Definition 10.1

Additionally, we evaluate the regret for different choices of A, by parameterizing A = cos θ sin θ sin θ cos θ and changing the rotation angle from 0 to π/8 while keeping the direction selection strategy the same as in Example (2.11). Figure 5 shows that the regret varies significantly over the range of angles. This shows that minimizing the angle of discrepancy between A and the true directions of variation (which are the standard x1 and x2 axes in this example) is beneficial in reducing the regret.

Given a particular A and function f, we consider the question of identifying an appropriate direction selection strategy that minimizes NA(3νAρh A) at all heights. The two example strategies for f(x1, x2) = 1 |x1| x2 2 we have seen so far are the default round-robin (equivalently 1:1) splitting and the 2:1 splitting in Example(2.11). For an A with θ = π/48, we minimize the number of near-optimal cells NA(3νAρh A) at different heights by choosing the best splitting ratio at each height and plot the ratios as the blue line in Figure 4. We see that as the height increases to infinity the optimal split ratio converges to 1. However, at lower heights, the optimal split ratio is greater than 1 and takes its maximum value 1.83 at h = 3. Definition 10.1. Lookahead strategy for direction selection. Given an estimated ˆf and the current tree of partitions till depth h, the lookahead strategy first evaluates the different number of near-optimal cells for ˆf at depth h + 1 by splitting along each of the m directions. It then greedily selects the direction to be split at h + 1 as the direction that results in the lowest number of near-optimal cells.

Figure 5. Regret at n = 300 for Sequ OOL on A with varying θ.

In a numerical experiment, we see that the lookahead strategy closely matches the optimal split ratio (shown as green points in Fig 4) over all heights. This also results in it having a low regret than the default partitioning scheme. Empirically, we observe that the regret for Sequ OOL applied for a budget of 500 evaluations of the function in the previous lemma was 5.68 10 10 for the partitioning scheme A using the lookahead strategy for direction selection, 1.98 10 5 for A with 1:1 splitting, and 1.8 10 4 for A with the 2:1 splitting strategy from Example (2.11).

Some of the facts which we use in our proofs.

Key inequalities for the matrix norms include: x x 2 x 1. Holder s inequality, applicable for p, q 1 where 1

q = 1, states that |x y| x p y q. The triangle inequality, valid for any p 1, asserts that x + y p x p + y p.

Matrix Norms: For a matrix A Rm n, the operator norm is:

A p = sup x =0

When p = , A = max1 i m Pn j=1 |aij| and when p = 2, A 2 = σmax(A), Where σmax(A) represents the

largest singular value of matrix A. Additionally, the Frobenius norm is given by A F = q Pm i=1 Pn j=1 |aij|2.

For a matrix A Rm n of rank r, the following inequalities hold:

A 2 A F r A 2 , 1 n A A 2 m A (13)

Adaptive Partitioning Schemes for Optimistic Optimization

For the operator norm 2, one has

AB 2 A 2 B 2 , AB 2 A 2 σmin(B), AB 2 B 2 σmin(A) (14)

Suppose a matrix A consists of orthonormal rows or columns, then A 2 = A 2 = 1

11.1. Proof of Example 2.8

Proof. Consider the function f(x1, x2) = g(Ax) = 1 |x1| with A = [1, 0] and g(z) = 1 |z|. Let P, A be the partitioning schemes defined in Definition 2.2 with κ = 1 and parameters (ν, ρ), (νA, ρA) respectively. For the P partitioning scheme, along the X axis, the side lengths of the children at depth h are given by 3 h/2 and along the Y axis, it is 3 h/2 . Consider f(x 1, x 2) f(x1, x2) = |x1|, and the cell with the representative origin is the P h cell, therefore

|x1| = 3 h/2 3 h/2 (1/

According to Assumption 2.6, the appropriate values are ν = 1, ρ = 1/

3. Now consider a rectangle region R with corners {( 3νρh, 1), (3νρh, 1), (3νρh, 1), ( 3νρh, 1)}. Since f f(x1, x2) = |x1|, (x1, x2) R, f f(x1, x2) 3νρh. Thus any cell in P that has a non-empty intersection with R is a near-optimal cell. Each cell at depth h has an area of 3 h/2 3 h/2 , therefore

NP(3νρh) Area(R) Area(Ph,i) = 4(3(1/

3)h) 3 h/2 3 h/2 ,

yielding that since NP(3νρh) = Ω(ρ h) , from the Definition 2.7 we get ηP = 1.

For the A partitioning scheme, we first note that α = 1. Along X axis, the side length of the children at depth h is 3 h, therefore f(x 1, x 2) f(x1, x2) = |x1| = 3 h (1/3)h, yielding the values are νA = 1, ρA = 1/3. Now consider a line segment L with endpoints {( 3νAρh A, 0), (3νAρh A, 0)}. Since f f(x1, x2) = |x1|, (x1, 0) L, f f(x1, x2) 3νAρh A. Thus any cell in A that has an intersection with the L is a near-optimal cell. Every Ah,i cell at depth h has a length of 3 h, therefore,

NA(3νAρh A) 2 + len(L) len(A(h, i)) = 2 + 2(3(1/3)h)

where the additional term 2 accounts for cells in A that partially intersect L at its endpoints. Hence NA(3νAρh A) is a constant and ηA = 0.

11.2. Proof of Example 2.9

Proof. Consider the function g(k) = 1 |k| with A = [1, 1]. Then, f(x1, x2) = g(Ax) = 1 |x1 + x2|. For the P partitioning scheme with κ = 1, along the X axis, the side lengths of the children at depth h along the X, Y axes are 3 h/2 , 3 h/2 respectively. Consider f(x 1, x 2) f(x1, x2) = |x1 + x2|, and the cell with the representative origin is the P h cell, therefore |x1 + x2| = 3 h/2 + 3 h/2 2 3 (h 1)/2 2

3)h. According to Assumption 2.6, ν = 2

3. Consider a rhombus region T formed by coordinates {( 3νρh, 0), (0, 3νρh), (3νρh, 0), (0, 3νρh)}. Since f f(x1, x2) = |x1+x2|, (x1, x2) T, f f(x1, x2) 3νρh. This says, any cell in P that has an intersection with the T is near-optimal cell. Each cell at depth h has an area of 3 h/2 3 h/2 , therefore,

NP(3νρh) 1 + 3νρh

Hence NP(3νρh) is independent of h and ηP = 0.

For the A partitioning scheme, along the X axis and Y axis, the side lengths of the children at depth h is given by 3 h and 3 h, giving that f(x 1, x 2) f(x1, x2) = |x1 + x2| = 2 3 h 2(1/3)h.

According to Assumption 2.6, the values are νA = 2, ρA = 1/3.

Further, we see from Definition 2.4 that α =

2 in this example.

Adaptive Partitioning Schemes for Optimistic Optimization

11.3. Proof of Example 2.11

Proof. For the P partitioning scheme, the side lengths of the children at depth h are as follows: along the X axis, 3 h/2 . Along the Y axis, it is 3 h/2 . Consider f(x 1, x 2) f(x1, x2) = |x1| + x2 2, and the cell with the representative origin is the P h cell, therefore

|x1| + x2 2 = 3 h/2 + 3 2 h/2 3 h/2 + 3 2 h/2

2 3 h/2 2 3 (h 1)/2 2

According to Assumption 2.6, the appropriate values are ν = 2

3 and νρh = 2

Consider a region R which is given by |x1|+x2 2 3νρh. Since f f(x1, x2) = |x1|+x2 2, (x1, x2) R, f f(x1, x2) 3νρh. Thus any cell in P that has a non-empty intersection with R is a near-optimal cell. Each cell at depth h has an area of 3 h/2 3 h/2 , therefore

NP(3νρh) Area(R) Area(Ph,i)

Now, we compute the area of the region formed by the curve |x1| + x2 2 = 3νρh which is given by

Z 3νρh x2 2

0 dx1dx2 = 8/3(3νρh) 3/2

NP(3νρh) Area(R) Area(Ph,i) = 8/3(6

3 h/2 3 h/2 = Ω ( 1

Hence, ηP 0.5

The P h cell contains the point (0, 0). Since P is an axis-aligned partitioning scheme, we can bound the number of cells directly above, i.e., having the same x coordinate of their representative as that of, P h by the value p

3νρh/3 h/2 . In a similar manner, we can bound the maximum number of cells having their representative s y coordinate to be the same as that of P h by 3νρh/3 h/2 . Since P is an axis aligned partitioning scheme, the previous two bounds imply that number of near optimal cells are upper bounded by the product of number of cells along x axis times number of cells along y axis. Thus,

Thus, ηP 0.5, hence ηP = 0.5

For the A partitioning scheme, the side lengths of the children at depth h are as follows: along the X axis, 3 (h+1) if h is odd; otherwise, 3 h. Along the Y axis, it is 3 h/2 . Let h is even. consider f(x 1, x 2) f(x1, x2) = |x1| + x2 2 and the cell with the representative origin is the A h cell, therefore

|x1| + x2 2 = 3 h + 3 2 h/2 3 h + 3 2(h 1)/2 4 3 h

According to Assumption 2.6, we have νA = 4, ρA = 1/3, and 3νAρh A = 12 3 h.

The A h cell contains the point (0, 0). Since A is an axis-aligned partitioning scheme, we can bound the number of cells directly above, i.e., having the same x coordinate of their representative as that of, A h by the value p

3νρh/3 h/2. In a similar manner, we can bound the maximum number of cells having their representative s y coordinate to be the same as that

Adaptive Partitioning Schemes for Optimistic Optimization

of P h by 3νρh/3 h. Since P is an axis aligned partitioning scheme, the previous two bounds imply that number of near optimal cells are upper bounded by the product of number of cells along x axis times number of cells along y axis. Thus,

NA(3νAρh A)

'! 1 + 3 4 3 h

When h is odd, following the same steps will give NA(νAρh A) 148.

Hence, NA(νAρh A) is a constant and ηA = 0.

11.4. Proof of Proposition 2.5

Proof. We claim that z = Ax , where x X is the optimizer of f. This choice of z satisfies f(A z ) = f(x ) as required in the Lemma. we will now show that the infinity norm of this z is less than or equal to α.

First, observe that for any x [ 1, 1]d, we can express x as a convex combination of corner points:

j=1 cjαj, αj 0,

j=1 αj = 1 (15)

Now, let s consider the infinity norm of Ax :

(Using Equation 15)

j=1 Acjαj (Triangle Inequality of Norms)

j=1 |αj| Acj (Absolute Homogeneity property of Norms)

j=1 |αj| α (From the definition of α in the Proposition statement)

Therefore, we have shown that z = Ax α, which completes the proof.

11.5. Proof of Proposition 9.1

Proof. Let x h be the representative point (midpoint) of the P h cell and x is within the P h cell.

Let x be a point in a hyperrectangle centered at x h with side lengths 2κc, where c is a vector of side lengths. Then, this point x satisfies |xi x h,i| κci for all i {1, . . . , d}. Equivalently, this set of inequalities can be written as

maxi {1,...,d} xi x h,i ci

κ. Using the infinity norm, we can concisely express this conditions as x x h c κ, where the division is performed element-wise.

For the partition scheme, we perform trisection along each axis in a round-robin manner. After h iterations, the side lengths are given by:

ci = κ3 h+d i

d , i {1, . . . , d} (16)

where h+d i

d represents the number of trisections applied to dimension i. Therefore, the cell P h can be described as:

P h = x Rd : x x h c

κ with c = [3 h+d i

d ]d i=1. (17)

Adaptive Partitioning Schemes for Optimistic Optimization

From the Definition 2.3, we have A h {A α : α T h } and

T h = α Rm : α α h s

α with s = [3 h+m i

m ]d i=1. (18)

A h = A α : α Rm, α α h s

α with s = [3 h+m i

m ]d i=1. (19)

11.6. Proof of Lemma 9.2

Proof. Following the Definition 2.2, consider the partitioning schemes P and A.

According to Proposition 9.1, we can express the cell A h as:

A h = {A α : α Rm, α α h s

α with s = [3 h+m i

At depth h = km + i, for i [1 : m 1] and k N0, the side lengths simplifies to: s = [3 1(j i) k]m j=1 Similarly,

P h = {x : x Rd, x x h c

1} with c = [3 h+d i

At depth h = kd + i, for i [1 : d 1] and k N0, the side lengths for P h become: c = [3 1(j i) k]d j=1.

And let us consider another partitioning scheme G = κP. Let us denote x h to be the representative of the G h cell. Using Proposition 9.1, the cell G h can be written as

G kd+i = {x : x Rd, x x kd+i c

κ} with c = [3 1(j i) k]d j=1

Consider an element x = A α A km+i. we now proceed with the following chain of inequalities:

x x kd+i x x + x x kd+i (Vector Norm property)

= A α A α + x x kd+i (x = A α )

= A (α α ) + x x kd+i

A α α + x x kd+i (Matrix Norm definition)

m3 kα + 3 kκ (From the Matrix Inequality 13)

2 3 kκ ( From the Lemma statement, κ 3 log3 mα )

This sequence of inequalities demonstrates that x G (k 1)d. Moreover, we know that G (k 1)d G kd+i. Therefore,

partitioning scheme P with κ 3 log3 mα will satisfy:

A km+i P kd+i i [1 : m 1], k N0

Adaptive Partitioning Schemes for Optimistic Optimization

11.7. Proof of Lemma 9.3

Proof. Since Lemma 9.2 is assumed to be applicable, consider a partitioning scheme P with κ = 3 log3 mα and suppose f 1 = supx κHd 1 f(x). Let P satisfies Assumption 2.6, then, there exist constants ν and 0 < ρ < 1 such that

sup x P h (f 1 f(x)) νρh. h N0 (20)

Since κ 3 log3 mα , we can incorporate Lemma 9.2, which gives,

A km+i P kd+i i [1 : m 1], k N0

Therefore, we get sup x A km+i (f f(x)) sup x P kd+i (f f(x)) i [1 : m 1], k N0 (21)

Combining Equations (20) and (21), and using the fact f f 1 , we have

sup x A km+i (f f(x)) sup x P kd+i (f f(x)) νρkd+i ν(ρ kd+i km+i )km+i (22)

Therefore, we have

sup x A km+i (f f(x)) ν(ρ kd+i km+i )km+i i [1 : m 1], k N0 (23)

Consider ρA = ρβ where

β = min kd + i

i [1 : m 1], k N (24)

Then the sequence of inequalities supx A km+i(f f(x)) ν(ρβ)km+i i [1 : m 1], k N ensures that (23) is

satisfied for all heights h m. We show that β = d+m 1

2m 1 is the solution to (24). We consider the sequence for k = 1:

m + i, i [1, m 1]

First, we show that this sequence is decreasing. For any i in the range [2, m 2], consider the difference between consecutive terms:

ti+1 ti = d + (i + 1)

m + (i + 1) d + i

m + i = m d (m + i)(m + i + 1)

Since m > d, we have m d > 0, but the denominator (m + i)(m + i + 1) is positive. Thus, ti+1 < ti, which says that sequence ti is decreasing and the minimum value of ti occurs at i = m 1:

tm 1 = d + (m 1)

m + (m 1) = d + m 1

We now consider the general form for any k N:

β = min kd + i

i [1 : m 1]

With the observation that (k + 1)d + m 1 (k + 1)m + m 1 kd + m 1

we get β = (d + m 1)/(2m 1). With this choice of β we have

sup x A km+i (f f(x)) ν(ρ kd+i km+i )km+i ν(ρβ A)km+i i [1 : m 1], k N (25)

Adaptive Partitioning Schemes for Optimistic Optimization

Now, we choose νA to satisfy (23) for the first m 1 heights. From Equation 22,

sup x A i (f f(x)) νρi = νρi βiρβi i [1 : m 1] (26)

νρ(1 β)(m 1)ρβi i [1 : m 1] (27)

Therefore, νA = νρ(1 β)(m 1) and ρA = ρβ will satisfy the Assumption 2.6 for the A partitioning scheme.

11.8. Proof of Lemma 9.4

Proof. Consider partitioning schemes P (κ = 1) and G (κ = κ1 = 3 log3 mα ) with star cells:

P h = {x : x Rd, x x h c

1}, G h = {x : x Rd, x x h c

. where c = [3 h+d i

At height h1 = d log3 mα , there are log3 mα divisions along each of the d-axis. Since at each division the cell is divided into three parts, after log3 mα divisions, side length along any axis becomes κ1/3 log3 mα = 1. Since we are focusing on the x which lies in the domain P 0, we have G h1 = P 0. Subsequently, we deduce that G h+ h1 P h.

sup x G h1+h (f f(x)) = sup x P h (f f(x)) h N0 (28)

By lemma assumption P satisfies Assumption 2.6, with parameters (ν, ρ). Therefore we have

sup x P h (f f(x)) νρh h N0

sup x G h1+h (f f(x)) νρh = νρ h1ρ h1+h h N0 (29)

For depths h [1 : h1 1]:

sup x G h (f f(x)) f inf x κ1Hd 1 f(x)

(f inf x κ1Hd 1 f(x))ρ h1ρh

Using inequality 29 and the above inequality, we conclude that G satisfies Assumption 2.6 with parameters (ρ, max{ν, f infx κ1Hd 1 f(x)}ρ h1).

By Lemma 9.3 and κ 3 log3 mα , we conclude:

ρA = ρβ, νA = max{ν, f inf x κ1Hd 1 f(x)}ρ(1 β)(m 1) h1

11.9. Proof of Lemma 4.2

Proof. Let B (x, r) be the open ball corresponding to the closed ball B(x, r). We will first demonstrate that B (ci, κ ) B (cj, κ ) = i = j. Suppose there exists a point y B (ci, κ ) B (cj, κ ), then:

ci cj ci y + cj y < 2κ .

Adaptive Partitioning Schemes for Optimistic Optimization

However, for lattice points, we have ci cj 2κ , contradicting the above inequality.

Next, we show that

B(ci, κ ) (κ + 2κ )Hm 1 ci C. (30)

For any y B(ci, κ ) we have that:

y y ci + ci κ + ci (since y B(xi, κ )

= κ + ci z + z (for some z B(0, κ) B(ci, κ ) = )

κ + ci z + z κ + κ + κ

Therefore, y (κ + 2κ )Hm 1 , which proves (30). Let N be the number of balls B(ci, κ ). Since B (xi, κ ) are disjoint and all these balls are contained in (κ + 2κ )Hm 1 , we have:

N Vol(B (xi, κ )) Vol((κ + 2κ )Hm 1 )

N (2κ )m (κ + 2κ )m,

giving that N 2 + κ

κ m. For the lower bound to N, since, B(0, κ) S

i B(ci, κ ), we have that

Vol(B(0, κ)) = (2κ)m Vol( [

i B(ci, κ ))

i=1 Vol(B(ci, κ )) = NVol(B(c0, κ )) = N(2κ )m.

Lemma 11.1. Let P = [ 1/

d]d, Prot = {Q x : x P}, where Q = A A is an orthonormal matrix, with A Rm d and A R(d m) d. Denote, Proj A (S) = {β Rd m : β = A x, x S Rd} and

α max max1 j 2d q 1 cj, , max1 j 2d q mcj, max1 j 2d q m+1cj, , max1 j 2d q d cj ,

where cj are corners of hypercube P. Then:

Vol(Proj A (Prot)) =

i=m+1 2α max,i and Vol(Proj A ([ 1, 1]d))

i=m+1 2α max,i.

Proof. The projection of Prot onto the subspace orthogonal to A is determined by:

Proj A (Prot) = n A y : y Prot o .

Let us denote α max[1 : m] and α max[m + 1 : d] as first m components and last d m components of the vector α max. Since Prot = {Q x : x P}, the corners of Prot are given by:

Crot = n A (α max[1 : m] s1) + A (α max[m + 1 : d] s2) : s1 { 1/

d}m, s2 { 1/

The volume of the projection onto the subspace orthogonal to A is:

Vol(Proj A (Prot)) =

i=m+1 2α max,i.

Adaptive Partitioning Schemes for Optimistic Optimization

Let y Prot. By definition, y = Q x for some x P. Then:

Using the properties of the norm and orthonormality of Q:

Since Q is orthonormal, Q = 1. Additionally, x 1/

d because x [ 1/

showing that y [ 1, 1]d. Hence, Prot [ 1, 1]d.

Since Prot [ 1, 1]d, the volume of the projection satisfies:

Vol(Proj A (Prot)) Vol(Proj A ([ 1, 1]d)).

Vol(Proj A ([ 1, 1]d))

i=m+1 2α max,i.

11.10. Proof of Lemma 4.3

Proof. Recollect the definition of the partitioning scheme T from Definition 2.3. T represents the equivalent partitioning scheme of A. For the partitioning scheme T , at a specific depth h, each cell can be indexed by i, i.e, Th,i 1 i 3h, where, 3h represents the total number of cells at depth h, and i is the index of a specific cell within that depth. Similarly, For the partitioning scheme P, at a depth h, each cell can be indexed by j, i.e, Ph,j 1 j 3h.

Definition of the POpt Relation Consider the following definition of the POpt relation:

POpt = {(Th,i, Ph,j) : Th,i NT (ϵ), Ph,j NP(ϵ), Th,i APh,j = }

The sets NT (ϵ) and NP(ϵ) denote near-optimal cells in the respective partitioning schemes.

Figure 6. Caption

For any h, i, consider the cell Th,i NT (ϵ). Let l be the lower bound on the number of elements in POpt that are of the form (Th,i, ). Then, we have: |POpt| |NT (ϵ)| l. Similarly, for any h, j, consider the cell Ph,j NP(3νρh). Let u be the upper bound on the number of elements in POpt that are of the form ( , Ph,j). Then, we have: |NP(ϵ)| u |POpt|. Combining these two inequalities, we get: |NT (ϵ)| |NP(ϵ)| u

This inequality provides a relationship between the near-optimal cells in the two partitioning schemes, taking into account the bounds on the number of elements in the POpt relation.

Adaptive Partitioning Schemes for Optimistic Optimization

11.10.1. ESTIMATING UPPER BOUND u

If Ph,j is a near-optimal cell, then there is a x Ph,j such that f(x) f ϵ. For a y = Ax we get g(y) = f(AT y) = f(x). If y Th,i, then Th,i is a near optimal cell and y Th,i APh,j. Thus the pair (Th,i, Ph,j) POpt. To obtain the upper bound u on the number of such pairs, suppose that all x Ph,j satisfy f(x) f ϵ. Then the entire region APh,j is near-optimal, and we calculate the maximum number of distinct Th,i that can be present in POpt.

We use Lemma 4.2 to obtain an upper bound to u. To apply the lemma, we first approximate the domain APh,j with a hypercube, which we use as κ and we obtain a lower bound on the side-lengths of Th,i cells which will be used as a κ . These approximations allows us to effectively utilize Lemma 4.2 in our analysis.

Using Proposition 9.1, the cell Th,i and Ph,j can be written as

Th,i = {α Rm, α αh,i

α with s = [3 h+m j

Ph,i = {x Rd, x xh,i

κ with c = [3 h+d j

For all x1, x2 Ph,i, consider Ax1 Ax2 , we have:

= Ax1 Ax2 A x1 x2 (from the Matrix operator norm definition)

m x1 x2 (From Matrix Inequality 13 and A 2 = 1)

= m x1 x2 + xh,i xh,i m x1 xh,i + m x1 xh,i 2 m3 k (Let h = kd + i)

Now, we estimate the side lengths for the Th,i cell, by computing s

m = kd + m + i j

m kd + m + d

m (k + 1) d

Hence 3 h+m j

m 3 (1+(k+1) d

We set κ = 2 m3 k, κ = α3 (1+(k+1) d

m ) and invoke Lemma 4.2, to derive the upper bound to u. Thus,

u 2 + 2 m3 k

α3 13 (k+1) d

2 + 6 m3d/m3k( d m

2 + 6 m3d/m3k( d m

m ) m (12 m)m3d3k(d m)

11.10.2. ESTIMATING LOWER BOUND l

Recall the definition of A from Definition 9.5. If Ah,i is a near-optimal cell, then there is a x Ah,i s.t. f(x) f 3νρh. Consider x = x + A y, where y Rd m. Then, f(x ) = g(Ax ) = g(Ax) = f(x). Hence the entire domain {x + A y | y Rd m, x + A y [ 1, 1]d} is a near optimal region. Suppose Ph,j has a not zero intersection with the above domain, then the cell Ph,j is a near-optimal cell. We thus count all these cells to get a lower bound l.

Adaptive Partitioning Schemes for Optimistic Optimization

n2 Vol(Proj A (P0,0))

Vol(Proj A (Ph,0))

Qd i=m+1 2(α max)i Qd i=m+1 2 max1 j 2d q i xj (Numerator is through Lemma 11.1)

Qd i=m+1(1/

d)2 max1 j 2d q i cj Qd i=m+1 2 max1 j 2d q i xj (Here cj are corners of hypercube [ 1, 1]d)

d)d m Qd i=m+1 max1 j 2d q i cj Qd i=m+1 3 k max1 j 2d q i cj

3 k(d m) (1/d)d m3k(d m).

Hence, the lower bound l = (1/d)d m3k(d m) Thus, using Equation 31, we conclude that

|NA(ϵ)| |NT (3νρh)| 3ddd m(12 m)m (32)

11.11. Proof of Lemma 4.4

Proof. To start, we recall the relationship between the parameters of the two partitioning schemes as established in Lemma 9.4. Specifically, we have:

ρA = ρβ, νA = max{ν, f inf x κ1Hd 1 f(x)}ρ(1 β)(m 1) h1

For brevity, we denote ν = max{ν, f infx κ1Hd 1 f(x)}.

From the definiton of near-optimality dimension for the P partioning scheme, we have:

NP(3νρh) Cρ ηP h (33)

Now, consider 3νAρh A:

= 3ν ρβh+(1 β)(m 1) h1

= 3νρβh+logρ(ν ρ(1 β)(m 1) h1) logρ(ν)

3νρh+logρ(ν ρ(1 β)(m 1) h1) logρ(ν)

3νρ h+ j logρ(ν ρ(1 β)(m 1) h1) logρ(ν) k

= 3νρh h3 (Denote h3 = j logρ(ν ρ(1 β)(m 1) h1) logρ(ν) k )

Now, consider

NA(3νA/ρ h A ) NA(3νρh h3) (3νAρh A 3νρh h3)

3ddd m(12 m)m NP(3νρh h3) (Using Lemma 4.3)

3ddd m(12 m)m Cρ ηP(h h3) (Using Inequality 33)

= 3ddd m(12 m)m CρηP h3ρ ηP h

Therefore, we have: NA(3νA/ρ h A ) 3ddd m(12 m)m CρηP h3ρ ηP h h h3

Adaptive Partitioning Schemes for Optimistic Optimization

For heights h [0 : h3 1], since the right-hand side quantity is monotonically increasing, we can use the value of the right-hand side at depth h = h3 in the above expression, which is 3ddd m(12 m)m C.

Therefore, we have

NA(3νA/ρ h A ) 3ddd m(12 m)m Cρ ηP h3ρ ηP h = 3ddd m(12 m)m Cρ ηP h3ρ ηP

This implies that ηA ηP /β, and the constant CA is given by 3ddd m(12 m)m Cρ ηP h3.

11.12. Proof of Lemma 4.5

Proof. Consider z = (A ˆA ) 1Ax , then

f( ˆA z ) = g(A ˆA z ) = g(A ˆA (A ˆA ) 1Ax )

= g(Ax ) = f(x ).

Now, we show that

z = (A ˆA ) 1Ax

(A ˆA ) 1Ax 2 (Vector Norm Inequality)

(A ˆA ) 1 2 Ax 2 (Operator Norm Definition)

(A ˆA ) 1 2 σmin(A ˆA )

σmin(A ˆA ) Ax 2 (Since, σmin(A ˆA ) > 0)

(A ˆA ) 1A ˆA 2 σmin(A ˆA ) Ax 2 (From Matrix Inequality 14)

Ax 2 1 dist2(A, ˆA) (From the Equation 11 and I 2 = 1)

m Ax 1 dist2(A, ˆA) (Vector Norm Inequality)

1 dist2(A, ˆA) .

For some of the proofs, we use the following equivalence of partitioning schemes.

11.13. Equivalence of partitioning schemes

Consider the partitioning schemes A and T which are defined in Definition 2.3. For every α Th,i, there exists a unique x Ah,i such that x = A α. This establishes an equivalence:

x Ah,i, f(x) = f(A α) = g(AA α) = g(α) (34)

where f is optimized on Ah,i and g on Th,i. Thus, optimizing f over A is equivalent to optimizing g over T .

Similarly, for the partitioning schemes ˆ A and ˆT , we have: α ˆTh,i, x ˆ Ah,i such that x = ˆA α. This leads to:

x ˆ Ah,i, f(x) = f( ˆA α) = g(A ˆA α) def = ˆg(α) (35)

where g is defined on Th,i and ˆg on ˆTh,i. Therefore, optimizing f over ˆ A is equivalent to optimizing ˆg over ˆT .

Adaptive Partitioning Schemes for Optimistic Optimization

11.14. Proof of Lemma 9.8

Proof. From the notation section 7, we have W = W WA A. Next, consider the singular value decomposition (SVD) of W, given by W = USV , where U Rp p and V Rd d are orthogonal matrices, satisfying UU = U U = Ip, VV = V V = Id and S Rp d is a diagonal matrix.

We have this relation W 2 = σ1(W) σr(W) 0, r = min{p, d}. From the Lemma assumption, we have rank(W) = r m. Thus, we collect the respective leading m columns of U and V, denoted as U1 Rp m and V1 Rd m, respectively. The remaining columns are denote by U2 Rp p m and V2 Rd d m. Since, columns of U1 and U2 forms a orthonormal basis for Rp, we have:

U 1 U1 = Im and U 1 U2 = 0m p m. (36)

Then the SVD of W can be written as: W = U1S1V 1 + U2S2V 2 . and from the lemma hypothesis ˆA = V 1 . Now, consider the implications of this decomposition for the proof. Consider,

= W(I A A) F

W(I A A) 2 (From Matrix Norm Inequality 13)

= U1S1V 1 (I A A) + U2S2V 2 (I A A) 2 (SVD of W)

U1S1V 1 (I A A) + U2S2V 2 (I A A) 2 ( U 1 2 = 1)

U 1 (U1S1V 1 (I A A) + U2S2V 2 (I A A)) 2 (From Matrix Norm Inequality 14)

= S1V 1 (I A A) 2 (Using Equation 36)

σmin(S1) V 1 (I A A) 2 (From Matrix Norm Inequality 14)

= σm ˆA(I A A) 2

= σm ˆAA A 2 ( h A , A i is a orthonormal matrix, thus A A + A A = Id)

= σm ˆAA A 2

A 2 ( A 2 = 1)

σm ˆAA A A 2 (From Matrix Norm Inequality 14)

= σm ˆAA 2 (A consists of orthonormal rows)

= σm sin Θ(A, ˆA) 2 (From Definition 9.5)

Since σm > 0, we have

sin Θ(A, ˆA) 2

W(I A A) 2 σm

W(I A A) F σm .

11.15. Proof of Lemma 9.6

Proof. Using the partitioning scheme equivalence from Section 11.13, we can work with partitioning schemes T and ˆT in place of the A and ˆ A partitioning schemes.

Adaptive Partitioning Schemes for Optimistic Optimization

Let zh, ˆzh be the representatives of T h and ˆT h cells respectively. Then

T h = z Rm, z zh

α with s = [3 h+m i

z Rm, z ˆzh

1 dist2(A, ˆA) with s = [3 h+m i

For brevity, denote κ2 = m

1 dist2(A, ˆA). Let z denote a maximizer of the function g, i.e., z arg maxz αHm 1 g(z).

Given that ˆg(z) = g(A ˆA z) and that A ˆA an invertible matrix, we can identify a corresponding maximizer z ˆg for ˆg such

that: z g = A ˆA z ˆg. Under this transformation, ˆg(z ˆg) will be a maximizer of the function ˆg.

From lemma assumption, T satisfies Assumption 2.6, therefore we have:

h N0, sup z T h (g g(z)) νAρh A (37)

We aim to control

h N0, sup z ˆT h (g ˆg(z)) = sup z ˆT h (g g(A ˆA z)) (38)

We choose some height h and represent it as h = km + i. Let us denote:

Rh = {A ˆA z : z ˆT h }.

Now, we show that: Rkm T km mk ,

where k = l log3 2 mκ2

Consider a point A ˆA z from the set Rkm, where z ˆT km. By definition, z satisfies the following condition:

z ˆzkm 3 kκ2α. (39)

Now, we show that A ˆA z T km mk . We begin with the following inequality:

= z(k k )m A ˆA z

= z(k k )m z g + A ˆA z ˆg A ˆA z (Using z g = A ˆA z ˆg)

z(k k )m z g + A ˆA (z ˆg z) (Triangle Inequality of Norms)

3 (k k )α + A ˆA (z ˆg z) (z g T km cell. Thus zkm z 3 kα)

3 (k k )α + A ˆA

z ˆg z (Operator Norm Definition)

3 (k k )α + m z ˆg z (From Matrix Inequality 13, 14 and A 2 = ˆA 2 = 1)

3 (k k )α + m ˆzkm + z ˆg + m ˆzkm z

3 (k k )α + m3 k2κ2α (using Inequality 39 and z ˆg ˆ T km)

3 (k k )α + 3 (k k )α(κ2 1) (Suppose k is chosen such that 2 mκ2 (κ2 1)3k )

= ακ23 (k k )

Adaptive Partitioning Schemes for Optimistic Optimization

From the above inequality, we can conclude:

Rkm T km m l log3 2 mκ2

Using the above set containment, for any height h = km + i, we have

Rkm+i Rkm T km m l log3 2 mκ2

κ2 1 m T km m l log3 2 mκ2

κ2 1 m +i m

First and the last set containment are valid from the round robin paritioning scheme. Substituting, h = km + i, we get:

Rh T h m l log3 2 mκ2

Define: h2 = m l log3 2 mκ2

κ2 1 m + m. Using the Inequalities 37, 38 and the above set containment, we conclude:

sup z ˆT h (g ˆg(z)) νAρh h2 A h h2 (40)

For height h [0 : h2 1], we know that: inf z ˆT h ˆg(z) inf z ˆT 0 ˆg(z)

which implies:

sup z ˆT h (g ˆg(z)) g inf z ˆT 0 ˆg(z) (g inf z ˆT 0 ˆg(z))ρ h2 A ρh A

Combining 40 and the above inequality, we conclude that ˆT satisfies Assumption 2.6 with parameters:

(ρA, max{νA, g inf z κ2αHm 1 g(A ˆAz)}ρ h2 A )

11.16. Proof of Proposition 8.1

Proof. Let A = [w1, w2, w3, . . . , wp] Rd p, where we assume without loss of generality that the vectors {w1, w2, , wp} are linearly independent. If the vectors were dependent, we could consider only the independent vectors without changing the Span{w1, w2, . . . , wp}. The orthogonal projection matrix P onto Span{w1, w2, . . . , wp} is given by P = A(A A) 1A .

For all x Rd, consider (w T i P)x = (P wi) x = (Pwi) x = w i x, where the equalities hold due to the following: First, P = P because P is a projection matrix and thus symmetric. Second, Pwi = wi since wi is in the column span of A, and P projects onto this span. Therefore, we can express f(x) as

i=1 viσ(w i x + bi) =

i=1 viσ(w i Px + bi). (41)

Since (x x ) Span{w1, w2, . . . , wp}, Px = Px and therefore using Equation (41), f(x) = f(x ).

11.17. Proof of Lemma 4.6

Proof. Using the partitioning scheme equivalence from Section 11.13, we can work with partitioning schemes T and ˆT in place of the A and ˆ A partitioning schemes.

Adaptive Partitioning Schemes for Optimistic Optimization

An arbitrary cell Th,i in T is defined as:

Th,i = z : z Rm, z zh,i

α with s = [3 h+m i

and an arbitrary cell ˆTh,i in ˆT is defined as:

z : z Rm, z zh,i

1 dist2(A, ˆA) with s = [3 h+m i

We optimize the function ˆg(z) = g(A ˆA z) over ˆT , while optimizing g(z) over T . For a near-optimal cell Th,i NT (3νT ρh T ), the following holds:

sup z Th,i g(z) = sup

z (A ˆA ) 1Th,i g(A ˆA z) g 3νT ρh T , (44)

where the inequality holds because Th,i is a near-optimal cell by assumption.

We aim to count the number of cells in ˆT that satisfy the following near-optimality cell condition:

sup z ˆTh,i g(A ˆA z) g 3ν ˆT ρh ˆT . (45)

Since, ν ˆT ρh ˆT > νT ρh T , the above condition is satisfied by

sup z ˆTh,i g(A ˆA z) g 3νT ρh T . (46)

Using relations (44) and (46), we observe that since the function g(A ˆA z) is identical in both cases, it suffices to work with the domain of the cells. Define:

B = {(A ˆA ) 1z : z Th,i}.

To obtain an upper bound to the number of near-optimal cells in the partitioning scheme ˆT , we consider every cell in ˆT ( ˆTh,i) that intersects with B as potentially near-optimal. For simplicity, we enlarge the domain B into a hypercube in Rm.

The set B is the transformation of the cell Th,i under (A ˆA ) 1. Since the transformation is linear and invertible, B remains a bounded region. However, to simplify analysis, we enclose B in a hypercube. By ensuring that the hypercube fully contains B, any cell ˆTh,i that intersects this hypercube is considered a candidate near-optimal cell.

Adaptive Partitioning Schemes for Optimistic Optimization

z1, z2 Th,i, consider, (A ˆA ) 1z1 (A ˆA ) 1z1

(A ˆA ) 1 z1 z2 (From the Matrix Operator Norm definition)

m (A ˆA ) 1 2 z1 z2 (From the Matrix Inequality 13)

m (A ˆA ) 1 2 z1 z2 σmin(A ˆA )

σmin(A ˆA ) (Since, σmin(A ˆA) > 0)

(A ˆA ) 1A ˆA 2 z1 z2

σmin(A ˆA ) (From the Matrix Inequality 14)

σmin(A ˆA ) ( I 2 = 1)

σmin(A ˆA ) 2α3 k

= 2α3 k m q

1 dist2(A, ˆA) (From the Equation 11)

(1) is true from the following inequality,

z1 z2 = s z1 zh,i + zh,i z2

= diag (s) z1 zh,i + zh,i z2

z1 zh,i + zh,i z2

B 2α3 k m q

1 dist2(A, ˆA) Hm 1 .

To establish an upper bound for the near-optimal cells of ˆT partitioning scheme, we consider the hypercube 2α3 k m

1 dist2(A, ˆA)Hm 1 as potentially near-optimal. We can then count the maximum number of cells in the ˆT par-

titioning scheme at height h that can intersect with this region.

To invoke the Lemma 4.2, we have κ = 2α3 k m

1 dist2(A, ˆA) and we need the side-length of the cell ˆTh,i cell or for

simplicity, lower-bound to side-length will be κ . And the side-length of ˆTh,i cell is greater than 2 3 (k+1) α m

1 dist2(A, ˆA).

Hence, the maximum number of cells of ˆTh,i that can be tiled inside the hypercube with side length = 2α3 k m

1 dist2(A, ˆA) are

1 dist2(A, ˆA)

2 3 (k+1) α m

1 dist2(A, ˆA)

Hence, upper-bound to the near-optimal calls of ˆT partitioning scheme is

h 0, N ˆT (3ν ˆT ρh ˆT ) 4m NT (3νT ρh T ) (47)

Adaptive Partitioning Schemes for Optimistic Optimization

11.18. Proof of Lemma 4.7

Proof. From the Lemma 4.6, we have

h 0, N ˆT (3ν ˆT ρh ˆT ) 4m NT (3νT ρh T ) (48)

Using the above relation, we relate the near-optimality dimension. Suppose, ηT (νT , ρT , CT ) is the near optimality dimension of T then, h 0, NT (3νT ρh T ) CT ρ ηT h T (49)

From Lemma 9.6, we have the following Sequ OOL parameter relations:

ν ˆT = max{νT , g inf z κ2αHm 1 g(A ˆA z)}/ρ h2 T , ρ ˆT = ρT where h2 = m+m log3 2 mκ2

and κ2 = m q

1 dist2 (A, ˆA) .

For brevity, denote ν T = max{νT , g infz κ2αHm 1 g(A ˆA z)}. Now, consider 3ν ˆT ρh ˆT :

= 3ν T ρh h2 T

= 3νT ρ logρT (ν T ρ h2 T ) logρT (νT ) T ρh T

j logρT (ν T ρ h2 T ) logρT (νT ) k

= 3νT ρ h4 T ρh T (Denote h4 = j logρT (ν T ρ h2 T ) logρT (νT ) k )

Next, we examine N ˆT (3ν ˆT ρh ˆT ):

N ˆT (3νT ρh h4 T )

4m NT (3νT ρh h4 T ) (Using Inequality 48 and N ˆT (3νT ρh T ) N ˆT (3ν ˆT ρh ˆT ))

4m CT ρ ηT (h h4) T (Using Inequality 49 and h h4)

Therefore, we have:

N ˆT (3ν ˆT ρh ˆT ) 4m CT ρηT h4 T ρ ηT h T h h4 (50)

For heights h [0 : h4 1], we can use the value of the right-hand side at depth h = h4, which is 4m CT . Hence, we have:

N ˆT (3ν ˆT ρh ˆT ) 4m CT ρηT h4 T ρ ηT h T h 0

= 4m CT ρηT h4 T ρ ηT h ˆT h 0

Therefore, we conclude that η ˆT ηT and C ˆT = CT 4mρηT h4 T .

11.19. Proof of Proposition 3.1

Proof. Let hmax = j n2

k as given in the lemma statement. Sequ OOL opens hmax

h cells at depth h for all

h [1, hmax]. Additionally, we utilize T samples for every c heights to learn ˆf. The total number of samples used to learn ˆf up to height hmax is thus T hmax

c . Each cell opening in Sequ OOL requires 3 samples. Therefore, the total number of openings performed by Algorithm 3 is given by Phmax i=1 hmax

3c . According to the proposition, we need to show

Adaptive Partitioning Schemes for Optimistic Optimization

that this quantity is n. Consider, total number of openings:

nlogn + T n

3c) (Definition of hmax)

nlogn + T n

3c (logn + T

3c) (Recall the definition: logn Pn t=1 1 t )

Hence, the number of openings made in Algorithm 3 does not exceed n.

11.20. Proof of Theorem 4.8

We start with the Theorem 5 of (Bartlett et al., 2019). We restate the theorem, adapting it to our notation and incorporating the dependency of the parameters on the partitioning scheme P:

Theorem 11.2 ((Bartlett et al., 2019), Theorem 5). Let W be the standard Lambert W function. Suppose f along the partitioning scheme P satisfies Assumption 2.6 with associated (νP, ρP), CP > 1, and near-optimality dimension ηP = ηP(νP, CP, ρP) parameters. Then, after n rounds, the simple regret of Sequ OOL is bounded as follows: For ηP > 0, we use Corollary 6 of (Bartlett et al., 2019). Let n = n/logn ηP log(1/ρP)/(CP).

If ηP = 0, rn νPρ

1 CP n log n

P If ηP > 0, rn νP n log n 1 ηP

To invoke this Theorem for our proof, first we apply the theorem for the partitioning scheme ˆ A. 4.7 shows that ˆ A is a valid partioning scheme, i.e., it satisfies 2.6, hence we can invoke Theorem 11.2.

Thus, for our partitioning scheme ˆ A, denoting n = n/logn η ˆ A log(1/ρ ˆ A)/(C ˆ A), the regret is bounded by

If η ˆ A = 0, rn ν ˆ Aρ

1 C ˆ A n log n

ˆ A If η ˆ A > 0, rn ν ˆ A n log n 1 η ˆ A

Corollary 9.7 relates Sequ OOL parameters and gives,

ρ ˆ A = ρβ, ν ˆ A = max{max{ν, lf}ρ(1 β)(m 1) h1, lg}ρ h2 A ,

η ˆ A(ν ˆ A, ρ ˆ A, C ˆ A) ηP(ν, ρ, C)

β , C ˆ A = 3ddd m(12 m)m Cρ ηP h34mρηP h4.

Now, we substitute these relations in our regret bound to get the upper bound in terms of default partitioning scheme P parameters.

max{max{ν, lf}ρ(1 β)(m 1) h1, lg}ρ β h2ρ

β C1 n log n if ηP = 0,

max{max{ν, lf}ρ(1 β)(m 1) h1, lg}ρ β h2 n log n β

ηP if ηP > 0,

Where C1 = 3ddd m(12 m)m Cρ ηP h34mρηP h4 and n = n/logn ηP log(1/ρ)/C1

with h2 = m + m l log3 2 mκ2

κ2 1 m and h1 = d l log3 3 log3 mα m with κ2 = m

1 dist2 (A, ˆA)

h3 = j logρ(max{ν, lf}ρ(1 β)(m 1) h1) logρ(ν) k , h4 = j logρA(max{νA, lg}ρ h2 A ) logρA(νA) k

Adaptive Partitioning Schemes for Optimistic Optimization

12. Additional Experiment Details & Results

12.1. Test Functions Experiments

We implemented Sequ OOL, SOO, and RESOO ourselves due to the absence of publicly available open-source code for these algorithms. For Di Rect and Dual Annealing, we utilized the implementations provided in the Sci Py library s optimize module. The CMA-ES algorithm was sourced from its dedicated project repository1. REMBO and Hes BO implementations were derived from the original Hes BO repository2. Bayesian Optimization was implemented using the repository 3.

12.2. Multi-Index Functions Results

We present additional experimental results to further demonstrate the effectiveness of our approach. Figure 7 showcases the performance of various algorithms on low-dimensional multi-index functions with d = 5 and m = 2. Our algorithm consistently achieves lower regret across different test functions, including Sphere, Branin, Ellipsoid, and Rastrigin, often reaching zero regret with fewer samples compared to competing methods.

12.3. Training Details of LLM Quantization Experiment

We implemented our Large Language Model (LLM) code on hardware equipped with one Quadro RTX 5000 GPU having 16GB VRAM. For comparison, we ran AWQ baselines using the original authors code, which also served as a foundation for developing our proposed method.

To optimize the neural network used in Algorithm 3 for our LLM Quantization objective function, we employed the Ray package for hyper-parameter tuning 4. We used Adam optimizer and our search space included hidden layer sizes (500, 1000, 2000, 3000), learning rates (log-uniform from 1 10 4 to 1 10 1), weight decay (log-uniform from 1 10 2 to 1 10 1), and learning rate Step Decay with gamma values (uniform from 0.9 to 0.99), and step sizes (500, 1000, 2000). We utilized early stopping to prevent overfitting.

The neural network was retrained on Sequ OOL-collected samples every 5 heights, with the look-ahead strategy applied up to a height of 60 and performed round-robin direction selection after this height.

1https://github.com/Cyber Agent AILab/cmaes 2https://github.com/aminnayebi/Hes BO 3https://github.com/bayesian-optimization/Bayesian Optimization 4https://github.com/ray-project/ray

Adaptive Partitioning Schemes for Optimistic Optimization

Figure 7. Regret Plots: Legend for the plots are arranged in the order of their performance. Algorithm 1 (Sequ OOL on ˆ A) uses 100 additional samples to learn the subspace through the Fornasier et al. (2012) approach. Our Algorithm, SOO, RESOO, Sequ OOL are budget algorithms, so we run these algorithms using 100 equally spaced budget values between 1 and 2000 and plot the regret at the end of each run. For the randomized algorithms, we took 10 trials and plotted the median curve (thick line) and 0 and 95 percentile curves. Random Search is run on ˆ A.