# bayesian_optimization_with_costvarying_variable_subsets__a2bf40fa.pdf

Bayesian Optimization with Cost-varying Variable Subsets

Sebastian Shenghong Tay1 2, Chuan Sheng Foo2 3, Daisuke Urano4, Richalynn Chiu Xian Leong4, Bryan Kian Hsiang Low1

1Department of Computer Science, National University of Singapore 2Institute for Infocomm Research (I2R), A*STAR, Singapore 3Centre for Frontier AI Research (CFAR), A*STAR, Singapore 4Temasek Life Sciences Laboratory, Singapore sebastian.tay@u.nus.edu, foo_chuan_sheng@i2r.a-star.edu.sg, daisuke@tll.org.sg, richalynn@tll.org.sg, lowkh@comp.nus.edu.sg

We introduce the problem of Bayesian optimization with cost-varying variable subsets (BOCVS) where in each iteration, the learner chooses a subset of query variables and specifies their values while the rest are randomly sampled. Each chosen subset has an associated cost. This presents the learner with the novel challenge of balancing between choosing more informative subsets for more directed learning versus leaving some variables to be randomly sampled to reduce incurred costs. This paper presents a novel Gaussian process upper confidence bound-based algorithm for solving the BOCVS problem that is provably no-regret. We analyze how the availability of cheaper control sets helps in exploration and reduces overall regret. We empirically show that our proposed algorithm can find significantly better solutions than comparable baselines with the same budget.

1 Introduction

Bayesian optimization (BO) is a powerful framework for the sample-efficient optimization of costly-toevaluate black-box objective functions [11] and has been successfully applied to many experimental design problems of significance such as hyperparameter optimization [6, 39], chemical synthesis [30], and particle accelerator control [29], among others. Conventional BO assumes that the learner has full control over all query variables (i.e., all variables in the input to the objective function). However, in many real-world optimization problems, some of the query variables may be subject to randomness affecting their values. In some cases, the randomness affecting a specific variable can be eliminated (by allowing the learner to select its value), but at a cost. We illustrate with a few concrete scenarios: In precision agriculture, consider a farm aiming to find the optimal conditions for largest crop yield where the query variables are a set of soil nutrient concentrations (e.g., Ca, B, NH3, K) and p H. The farm may rely on the naturally-occurring quantities of these nutrients in the available soil, but these quantities will be randomly sampled. Alternatively, they may control some subset of these quantities (via manufactured soil and fertilizers) at a higher cost. In advanced manufacturing where random variation occurs in every operation [34], certain specifications of a product may be left unspecified by the manufacturer and randomly determined, or specified but at a higher cost. In ad revenue maximization or crowdsourcing where information is gathered from a large number of individuals via ad platforms or crowdsourcing platforms such as Amazon Mechanical Turk, suppose that the query variables describe the demographics of the individual, such as country of origin or income level. The learner may allow the platform to randomly assign the task to any individuals, or the learner may demand a specific subgroup of individuals at a higher cost. In all these practical scenarios, the goal is to find the maximizer with as little incurred cost as possible. At each query iteration, the learner is

37th Conference on Neural Information Processing Systems (Neur IPS 2023).

faced with the non-trivial problem of deciding which variables to specify (for more directed learning) vs. which variables to allow to be randomly sampled (to reduce incurred costs), in addition to the usual BO problem of deciding the specified variables values.

To the best of our knowledge, there are no existing works that tackle this problem precisely. The work of Hayashi et al. [13] introduced the problem of BO with partially specified queries (BOPSQ) in which the subset of deterministically selected variables (control set) and randomly sampled variables (random set) can also be chosen by the learner, but it does not consider the costs incurred by such choices. This is a non-trivial limitation as the presence of costs can significantly alter the learner s decisions. Under such a formulation, if a control set is a strict subset of another, then the former will never be chosen as there is no benefit to having variable values be randomly sampled instead of chosen by the learner. Consequently, if there exists a control set that includes all the variables in a query, then all other control sets will not be used and the problem reduces to conventional BO. In practice, however, the availability of other control sets confers an advantage if these other control sets are cheaper. Having access to cheaper but more random control sets allows the learner to explore the query space cheaply and then use costlier but more deterministic control sets to exploit high-value regions. BOPSQ in its current formulation excludes the analysis of such strategies and is akin to multi-fidelity BO [15] but without modeling the costs of the different information sources: In this case, the learner would simply choose the highest-fidelity information source all the time, thus making the problem setting trivial.

This paper introduces the problem of BO with cost-varying variable subsets (BOCVS) that explicitly models the cost of each control set and is more useful in practical scenarios. Our work generalizes BOPSQ and argues that BOCVS problems are much richer when analyzed from a similar perspective as multi-fidelity BO, and the various control sets are treated as information sources with different levels of usefulness and costs. By using cheap control sets for exploration and expensive control sets for exploitation, we show that with an appropriately designed algorithm, a learner can find significantly better solutions with a lower cost expenditure. To achieve this, we leverage the Gaussian process upper confidence bound (GP-UCB) acquisition function [7, 32] to design a novel no-regret algorithm, i.e., its incurred simple regret tends to 0 as the number of iterations tends to infinity, and the algorithm s best chosen query converges to the optimal solution. We additionally analyze the impact of the availability of cheaper control sets on the regret incurred by the most expensive control set. We observe that our algorithm generally outperforms the non-cost-aware baselines, while simple extensions based on Thompson sampling, maximizing UCB or expected improvement-based acquisition scores per unit cost [31, Sec. 3.2] either fail to converge or fail to utilize cheap control sets effectively. Concretely, the contributions of our work in this paper include the following:

We introduce the BOCVS problem (Sec. 4) and solve it by designing a novel UCB-based algorithm (Sec. 4.1) with a theoretical analysis of its properties, including the conditions under which it is provably no-regret and the impact of the availability of cheaper control sets on the regret incurred by the most expensive control set, and discuss the practical considerations (Sec. 4.2); We empirically evaluate the performance of our proposed algorithm against the baselines under several experimental settings with synthetic and real-world datasets (Sec. 5), including a plant growth dataset and an airfoil self-noise dataset corresponding, respectively, to the precision agriculture and advanced manufacturing use cases motivated earlier in this section.

2 Related Work

The work of Hayashi et al. [13] introduced BO with partially specified queries (BOPSQ) and tackled the problem with Thompson sampling. However, it fails to consider the relative costs of control sets, which hinders the learner s ability to take advantage of all control sets even in the presence of more deterministic control sets. The work of Oliveira et al. [25] proposed BO with uncertain inputs in which the executed query is sampled from a probability distribution depending on the proposed query. Though related, its problem setting is motivated more by uncertainty in the input query even post-observation and does not involve variable subset selection. These two works are part of a line of research investigating BO in situations where the learner may not have full control over all variables in a query, which includes BO for expected values [38], risk-averse BO [5, 21, 22], and distributionally robust BO [17, 24, 37]. These works also do not consider variable subset selection. Our treatment of the BOCVS problem is inspired by multi-fidelity BO in which the learner has access to cheap, low-fidelity surrogates of the true objective function [15, 27, 35, 36]. In such works (and in

ours), modeling costs is crucial as the learner would simply choose the highest-fidelity information source (in ours, the maximally deterministic control set) otherwise. While the general idea of paying less for potentially less informative queries is similar, our problem setting is fundamentally different: The lack of informativeness comes from the uncertainty of the executed query as opposed to a bias in the observed function values.

The BOCVS setting may be viewed as a special case of causal BO as formulated by Aglietti et al. [1] and continued in several works [2, 4]. Specifically, our setting is a case in which there are no non-manipulative variables and the causal DAG is such that all input variables have no parents and are parents of the output variable. Nevertheless, we believe our focus on this special case has value as it allows us to derive useful theoretical results such as algorithm regret bounds that, to the best of our knowledge, do not exist for the completely general causal BO setting at the time of writing. The work of Sussex et al. [33] includes a regret bound, but is also a special case of [1], and has little overlap with our work as it does not consider costs of control sets or explicit probability distributions over input variables. We believe that our work is sufficiently general to be useful for practical scenarios (where the full causal BO apparatus may be unnecessary), and is also a stepping stone towards theory for the general case.

3 BO and Gaussian Processes

We will first give a brief review of conventional BO [11]. Given a query set X and an objective function f : X R, a learner wishes to find the maximizing query x := argmaxx X f(x). However, f is black-box (i.e., not available in closed form) and can only be learned by submitting a query xt X in each iteration t for function evaluation and receiving a noisy observation yt := f(xt) + ξt where each ξt is i.i.d. σ-sub-Gaussian noise with zero mean. Each function evaluation is assumed to be expensive in some way, such as in terms of money or time spent. So, the learner must be sample-efficient and find x in as few iterations as possible. BO achieves sample efficiency by leveraging a Bayesian model to represent a probabilistic belief of the function values at unobserved regions of X in a principled manner. While any Bayesian model may be used for BO, Gaussian processes (GPs) [42] are a common choice as they enable exact posterior inference: The GP posterior belief of f at any query x X after t iterations is a Gaussian with posterior mean and variance given by

µt(x) := kt(x) (Kt + λI) 1yt , σ2 t (x) := k(x, x) kt(x) (Kt + λI) 1kt(x) (1)

where yt := (yj)t j=1 Rt, k is a positive semidefinite kernel (covariance function), kt(x) := (k(x, xj))t j=1 Rt, Kt := (k(xj, xj ))t j,j =1 Rt t, and λ is an algorithm parameter; if the noise is a Gaussian with variance σ2, then the true posterior is recovered with λ = σ2. The kernel k is an important modeling choice as the GP posterior mean will reside in the reproducing kernel Hilbert space (RKHS) associated with k. For simplicity, we assume w.l.o.g. that k(x, x ) 1 for any pair of queries x, x X. Kernel k affects the maximum information gain (MIG) defined as

γT (X) := max {xt}T t=1 X 0.5 log I + λ 1KT .

The MIG characterizes the statistical complexity of a problem and plays an integral role in the theoretical analysis. For the commonly used squared exponential kernel, γT (X) = O((log T)d+1), while for the Matérn kernel with ν > 1, γT (X) = O(T d(d+1)/(2v+d(d+1))(log T)) [32]. Importantly, γT (X) is increasing in the volume of X [32, Theorem 8].

4 BO with Cost-varying Variable Subsets (BOCVS)

The BOCVS problem consists of a compact query set X Rd and an objective function f : X R in the RKHS of k with the RKHS norm upper bounded by B. For simplicity, assume w.l.o.g. that X = [0, 1]d. Let [d] := {1, 2, ..., d}. The learner is given a collection I 2[d] of control sets indexed by 1, 2, . . . , m := |I|. Each control set i [m], denoted by Ii [d], indicates the variables in a query with values that can be chosen by the learner. The complement Ii := [d] \ Ii of Ii is the corresponding random set indicating the variables in a query with values that will be randomly sampled from some distribution. A query x X can be represented by a combination of partial queries [xi, x i] comprising the control partial query xi := (xℓ)ℓ Ii (i.e., xi collects the variables

Figure 1: Two iterations in a BOCVS problem setting. The grey boxes are isometric views of a query set X R3. The blue regions depict the probability densities of random vectors [xit, X it] and [xit+1, X it+1]. In iteration t, the learner chooses the control set it = 1 and specifies the value (of the first variable xt,1) in control partial query xit, while the last two variables Xt,2, Xt,3 in random partial query X it will be randomly sampled. In iteration t + 1, the learner chooses the control set it+1 = 2 and specifies the values (of the first two variables xt,1, xt,2) in control partial query xit+1, while the last variable Xt,3 in random partial query X it+1 will be randomly sampled.

indexed by Ii) and the random partial query x i := (xℓ)ℓ Ii where xℓdenotes the ℓ-th variable in the query vector x. Note that [xi, x i] is not a simple vector concatenation as the variables may need to be reordered according to their indices. Furthermore, let X i := {xi | x X}.

In iteration t, the learner chooses control set it I and specifies the values in control partial query xit. The random partial query x it will then be randomly sampled from the environment. For example, if d = 4 and Iit = {1, 3}, then Iit = {2, 4} and the learner will be able to choose the values in xit (i.e., the 1st and 3rd variables) but not those in x it (i.e., the 2nd and 4th variables). The full query in iteration t is then xt = [xit, x it] = (xt,ℓ)ℓ [d]. Each observed variable xt,ℓfor ℓ Iit is a realization of a random variable Xt,ℓ Pℓ. The observed x it is then a realization of the random vector X it := (Xt,ℓ)ℓ Iit P it where P it is the product measure ℓ Iit Pℓ. In other words, each variable in a random partial query is independently sampled from a probability distribution that governs that variable. All distributions are assumed to be known. The learner then observes yt := f(xt) + ξt where each ξt is i.i.d. σ-sub-Gaussian noise with a zero mean. Fig. 1 illustrates two iterations in a BOCVS problem setting.

The learner wishes to find the optimal control set i and specified values in control partial query xi

that maximize the expected value of f([xi, X i]) where the expectation is w.r.t. X i P i:

(i , xi ) := argmax (i,xi) [m] X i E f([xi, X i]) .

The learner has an initial budget C R+ and every control set Ii has an associated cost ci > 0 for all i [m]. Let the control set indices be defined such that c1 c2 . . . cm.1

In every iteration t, the learner pays cit. The learning procedure ends after T iterations when C PT t=1 cit < ci T +1, i.e., the learner has not enough budget left to pay for the chosen control set. T will now be a random variable depending on the algorithm and the random outcomes of the learning procedure. The cost-varying cumulative regret is defined as

RT := PT t=1 cit E f([xi , X i ]) E f([xit, X it]) .

The regret incurred by choosing a sub-optimal control set and specifying sub-optimal values in the control partial query is weighted by the cost of that control set. This naturally incorporates the notion that the penalty for sub-optimal plays is lower if the play was cheap, while also penalizing using the entire budget on sub-optimal plays, regardless of whether those plays are cheap or expensive. Intuitively, to minimize the cost-varying regret, a learner would attempt to use the cheap control sets (i.e., low ci, low E f([xi, X i]) ) to explore the query space, and use the expensive control sets (i.e., high ci, high E f([xi, X i]) ) to exploit control partial queries with high expected function values.1 When all ci = 1, we recover the BOPSQ problem [13], and C is simply the number of iterations

1While our problem definition does not require that ci cj maxxi X i E f([xi, X i]) maxxj X j E f([xj, X j]) , one might reasonably expect this to be the case in real-world problems, i.e., "better" control sets cost more to specify. This also implies that Ii Ij ci cj.

Algorithm 1 UCB-CVS

1: Input: GP with kernel k, budget C, control sets I, costs (ci)m i=1, ϵ-schedule (ϵt) t=1 2: for iteration t = 1 to do 3: gt := max(i,xi) [m] X i E ut 1([xi, X i])

4: S1 := {i [m] | maxxi X i E ut 1([xi, X i]) + ϵt gt} 5: S2 := {i S1 | ci = minj S1 cj} 6: (it, xit) := argmax(i,xi) S2 X i E ut 1([xi, X i])

7: break if C Pt 1 τ=1 ciτ < cit 8: Observe x it drawn from P it

9: Observe yt := f(xt) + ξt 10: Dt := {(xτ, yτ)}t τ=1 11: end for 12: return Dt

in the learning trajectory. In fact, BOPSQ reduces to a simpler problem if there exists a full query control set that allows the learner to choose the values of all d variables. If [d] I, then Ii = [d] and E f([xi , X i ]) = maxx X f(x) since expectations of a function are never greater than the maximum value of the function. In other words, the full query control set is guaranteed to be the optimal control set and the BOPSQ problem reduces to one of conventional BO. In general, under BOPSQ, any control set that is a strict subset of another will never be chosen.

4.1 UCB-CVS

Alg. 1 describes our UCB-CVS algorithm for solving the BOCVS problem. In iteration t, it uses the GP posterior belief of f to construct an upper confidence bound (UCB) ut 1 of f:

ut 1(x) = µt 1(x) + βtσt 1(x)

where the sequence (βt)t 1 is an algorithm parameter that controls the tradeoff between exploration and exploitation. UCB-based algorithm design is a classic strategy in the stochastic bandits [19, Ch. 7] and BO literature [7, 32] and makes use of the optimism in the face of uncertainty (OFU) principle [18]: Queries with a large posterior standard deviation (i.e., high uncertainty) are given high acquisition scores as the function values at those queries may be potentially high. UCB-CVS adapts this strategy by taking the expectation of the UCB as part of the acquisition process. Due to the monotonicity of expectation, if ut 1 is an upper bound of f (i.e., ut 1(x) f(x) for any x X), then E ut 1([xi, X i]) is also an upper bound of E f([xi, X i]) for any i [m], xi X i.

UCB-CVS also takes as input an ϵ-schedule (ϵt) t=1 where ϵt 0 for all t. To choose the control set in iteration t, it first computes gt which is the expected UCB of the best control set and specified values in the control partial query (Step 3). It then collects every control set i that fulfills the condition maxxi X i E ut 1([xi, X i]) + ϵt gt into a set S1 (Step 4). It further reduces this set S1 to S2 by retaining only the control sets with the lowest cost (Step 5). Finally, it chooses the control set from S2 with the largest expected UCB value (Step 6). Each ϵt thus serves as a relaxation that enables exploration with cheaper control sets. Choosing many ϵt to be large results in many iterations of choosing cheaper control sets; conversely, choosing ϵt = 0 for all t ignores all costs.

Our first result upper bounds the cost-varying cumulative regret incurred by UCB-CVS. Define the feasible set e Xi := d ℓ=1[ai ℓ, bi ℓ] for each control set i such that ai ℓ= 0, bi ℓ= 1 if ℓ Ii, and ai ℓ= sup{a [0, 1] | Fℓ(a) = 0}, bi ℓ= inf{b [0, 1] | Fℓ(b) = 1} otherwise, where Fℓis the CDF of Xℓ Pℓ. e Xi is a subset of X in which any query chosen with control set i must reside. Define Ti as the total number of iterations in which control set i is chosen. Theorem 4.1. With probability at least 1 δ, UCB-CVS (Alg. 1) incurs a cost-varying cumulative regret bounded by

γT (X) + log m + 1

TiγTi( e Xi) + log m + 1

by setting βt = B + σ p

2 (γt 1(X) + 1 + log((m + 1)/δ)).

For any appropriately chosen kernel such that γT (X) < O(

T) (e.g., commonly used squared exponential kernel, see Sec. 3) and ϵ-schedule such that PT t=1 ϵt is sublinear in T, the cumulative regret incurred will be sublinear in T: lim T RT /T = 0. Since the mean of a sequence is no less than the minimum, and all ci > 0, this further implies the desired no-regret property: lim T min1 t T (E f([xi , X i ]) E f([xit, X it]) ) = 0, i.e., the best control set and specified values in control partial query in the algorithm s choices eventually converge to the optimal solution. The proof of Theorem 4.1 relies on choosing an appropriate sequence of βt such that ut 1(x) f(x) for any x X, t 1 with high probability [7, Theorem 2]. The cumulative regret is bounded by a sum of expectations of posterior standard deviations, which can then be bounded by a sum of posterior standard deviations plus some additional terms [16, Lemma 3] and in turn bounded in terms of the MIG [7, Lemma 4]. The proofs of all results in this paper are provided in Appendix A.

Since each γTi( e Xi) is increasing in the volume of e Xi, Theorem 4.1 states that control sets with smaller feasible sets will incur less regret. If the size of a feasible set is taken to be a reasonable surrogate for the diffuseness of the probability distributions involved, Theorem 4.1 then suggests that control sets with corresponding random sets whose probability distributions are less diffuse will incur less regret.2 Theorem 4.1 also informs us that one sufficient condition on the ϵ-schedule for the cost-varying regret to be sublinear in T is that PT t=1 ϵt is sublinear in T. Our next proposition provides an alternative condition (neither is more general than the other):

Proposition 4.2. If there exists a ϵ > 0 s.t. for all i = i , ϵt E f([xi , X i ]) maxxi X i E f([xi, X i]) ϵ eventually (i.e., the inequality holds for all t q for some q 1), and γT (X) < O(

T), then, with probability at least 1 δ, lim T Ti/T = 0 for all i = i and UCB-CVS incurs a cost-varying cumulative regret that is sublinear in T by setting βt = B + σ p

2 (γt 1(X) + 1 + log((m + 1)/δ)).

The above results have shown that with an appropriately chosen ϵ-schedule, UCB-CVS satisfies the no-regret property. However, ignoring all costs by setting ϵt = 0 for all t also achieves no-regret. This begs the question: In what way does a good ϵ-schedule improve UCB-CVS? Supposing the most expensive control set is the full query control set, the presence of queries chosen with cheaper control sets should reduce the cost-varying regret incurred by the full query control set by ruling out low function value regions and directing the full queries towards high function value regions. Additionally, it is reasonable to conjecture that the more diffuse each variable s (indexed by ℓ) probability distribution Pℓis, the more the cheaper control sets would explore the query space and thus, the lower the cost-varying regret incurred by the full query control set. To derive such a result, the plan of attack is to relate the variances (i.e., notion of diffuseness) of the probability distributions to the distances between queries chosen with the cheaper control sets, followed by analyzing the effect of these distances and the number of times cheaper control sets were played on the MIG term of the most expensive control set. Our next result relates the distance between pairs of queries chosen with control set i to the variance V[Xℓ] of every probability distribution Pℓfor ℓ Ii:

Lemma 4.3. Suppose that for each control set i, the random variable Yi := [0, X i 1 ] [0, X i 2 ] 2

has a median Mi s.t. E[Yi|Yi > Mi] hi Mi for some hi > 0 where X i 1 , X i 2 P i. With probability at least 1 δ, there will be at least Ni non-overlapping pairs of queries x and x chosen by UCB-CVS (Alg. 1) with control set i s.t. x x 2 Mi where

Ni = j (Ti 1)/4 p

(Ti/4) log(1/δ) k and Mi (4/(hi + 1)) P

ℓ Ii V[Xℓ] . (2)

From (2), the higher the variances of the distributions that govern the variables in the random set, the larger the lower bound Mi on the squared distance between at least Ni pairs of queries chosen with control set i. As expected, the number Ni of pairs increases with Ti (i.e., the total number of iterations in which control set i is chosen). The assumption on Yi is mild: As long as Yi has at least 1 non-zero median, it will hold. The assumption excludes the case in which Pℓfor all ℓ Ii are degenerate with all probability mass on a single point. With Lemma 4.3, we now derive an alternative regret bound that depends on the variances of the distributions and the number of plays of cheaper control sets:

2The feasible set of control set i is defined in a worst-case manner, which may be too conservative to be a good surrogate for diffuseness, especially for concentrated probability distributions with non-zero density everywhere. Nevertheless, it facilitates the worst-case analysis of the regret bounds.

Theorem 4.4. Suppose that the following hold:

Assumption of Lemma 4.3 holds; k(x, x ) is an isotropic kernel which only depends on distance between x & x and can be written as k( x x ); There exists an iteration r s.t. for all t r, it m 1, and for all t > r, it = m .

Then, with probability at least 1 δ, UCB-CVS (Alg. 1) incurs a cost-varying cumulative regret bounded by

γT (X) + log 2m

TγT (X) L + log 2m

TiγTi( e Xi) + log 2m

i=1 Ni log Vi 2k p

by setting βt = B + σ p

2 (γt 1(X) + 1 + log((2m)/δ)) where Ni and Mi are previously defined in Lemma 4.3, and Vi and W are residual terms defined in Appendix A.5.

Theorem 4.4 shows that the MIG term pertaining to the most expensive control set m is reduced by L which increases as Ni increases, which in turn increases as Ti increases. This suggests that an ϵ-schedule that increases the number of times cheaper control sets are played can reduce the MIG term. L also increases as k( Mi) decreases. For common kernels such as the squared exponential or Matérn kernel with ν > 1 (which satisfy the second assumption on isotropic kernel), k( Mi) decreases as Mi increases, from which we may conclude that higher variance probability distributions governing each Xℓlead to a larger L due to (2) and hence a larger decrease on the MIG term. In cases where cm ci for all i = m, a carefully chosen ϵ-schedule can thus lead to a large decrease in the regret bound via L. The third assumption is (informally) approximately true in practice due to the design of UCB-CVS: If a decreasing ϵ-schedule is used, the algorithm will choose the cheaper but sub-optimal control sets at the start. After ϵt has decreased past a certain value, the algorithm will only choose the optimal (and likely most expensive) control set. The proof sketch upper bounds the sum of posterior standard deviations of queries chosen with control set m with the MIG term minus the sum of posterior standard deviations of queries chosen with all other control sets. This latter sum is then lower bounded by a log determinant of the prior covariance matrix which is then decomposed into a sum of log determinants of pairs of queries. The dependence on the distances between the pairs can be made explicit in this form. Neither Theorems 4.1 nor 4.4 is more general than the other.

4.2 Practical Considerations

UCB-CVS is presented with the ϵ-schedule formulation for generality and ease of theoretical analysis. In practice, however, the ϵ-schedule is a hyperparameter that is difficult to interpret and choose. We propose a simple explore-then-commit (ETC) variant with which the learner only chooses the number of plays of each cost group (i.e., defined as a collection of control sets with the same cost that is not the maximum cost). In each iteration, the algorithm will choose the cost group with the lowest cost and non-zero remaining plays, and then choose the control set within that cost group with the largest expected UCB (similar to Step 6 in Alg. 1). Once all cost groups have zero remaining plays, the algorithm chooses the control set with the largest expected UCB among all control sets. This algorithm is highly interpretable and is equivalent to UCB-CVS with a specific sublinear ϵschedule (that cannot be known a priori). Furthermore, the learner should choose the number of plays adaptively depending on the cost of each cost group. On computational considerations, UCB-CVS may be computationally expensive if the number m of control sets is large (e.g., if every subset of variables is available as a control set and m = 2d) as each control set requires a maximization of the expected UCB (which can be approximated with Monte Carlo sampling). In such cases, the learner has the option to simply ignore any number of control sets to reduce m, as long as i is not ignored.

Figure 2: Mean and standard error (over 10 RNG seeds) of the simple regret (lower is better) incurred against cost spent (budget) C by TS-PSQ, UCB-PSQ, ETC-50, ETC-100, and ETC-Ada with varying objective functions, cost sets, and variances of distributions. A diamond indicates the average budget after which an algorithm only chooses the optimal control set.

5 Experiments and Discussion

This section empirically evaluates the performance of the tested algorithms with 4 objective functions: (a) function samples from a GP prior (3-D), (b) the Hartmann synthetic function (3-D), (c) a plant growth simulator built from real-world data where the variables are nutrients such as NH3 and p H (5D), and (d) a simulator built from the airfoil self-noise dataset (5-D) from the UCI Machine Learning Repository [9]. For the first 2 objective functions, the control sets are all possible subsets of the 3 variables except the empty set, which leads to 7 control sets. For the plant growth objective function, we pick 7 control sets including the full query control set. For the airfoil self-noise objective function, similar to that of [13], we pick 7 control sets of 2 variables each that are not subsets of each other. We use 3 different sets of costs for the 7 control sets: cheap ({0.01, 0.01, 0.01, 0.1, 0.1, 0.1, 1}), moderate ({0.1, 0.1, 0.1, 0.2, 0.2, 0.2, 1}), and expensive ({0.6, 0.6, 0.6, 0.8, 0.8, 0.8, 1}). Using these sets of costs, the control sets are ordered such that ci < cj maxxi X i E f([xi, X i]) maxxj X j E f([xj, X j]) . These cost sets have fixed the optimal (i.e., last) control set to have a cost of 1. While these cost sets may (at first glance) seem arbitrary, it is the algorithms relative performance across these cost sets rather than the absolute performance on a single cost set that allows us to understand the conditions under which particular algorithms perform better or worse. Real-world applications (unlike the experiments conducted here) will come with their own cost sets defined by real-world constraints. If the real costs can also be categorized in a similar relative way like the above cheap, moderate, and expensive cost sets, then the results are expected to be similar. Every probability distribution Pℓis a truncated normal distribution with mean 0.5 and the same variance which is one of 0.02, 0.04, and 0.08 (the uniform distribution on [0, 1] has variance 1/12).

We compare the performance of our algorithm against that of the baseline Thompson sampling (TS-PSQ) algorithm developed in [13]. We test UCB-PSQ (ϵ-schedule with ϵt = 0 for all t) along with the ETC variant of UCB-CVS (Sec. 4.2) with 3 sets of hyperparameters: 50 plays per cost group (ETC-50), 100 plays per cost group (ETC-100), and a cost-adaptive version with 4/cj plays per cost group where cj is the cost of the control sets in that cost group (ETC-Ada). We also investigated simple extensions of TS-PSQ, UCB-PSQ, and expected improvement (adapted for BOPSQ) for the BOCVS problem by dividing the acquisition score of a control set by its cost in a manner similar to that in [31, Sec. 3.2]. We observed that these naive methods generally do not work well; we defer the results and discussion of these methods to Appendix B. Refer to Appendix C for full descriptions of all experimental settings and algorithm hyperparameters. The code for the experiments may be found at https://github.com/sebtsh/bocvs.

Fig. 2 shows the mean and standard error (over 10 RNG seeds) of the simple regret min1 t T (C) E f([xi , X i ]) E f([xit, X it]) (lower is better) incurred against cost spent (budget) C by each algorithm with varying objective functions, cost sets, and variances of distributions where T (C) denotes the maximum iteration reached after spending C. The simple regret encodes the value of the best solution an algorithm has chosen within a certain budget and is a measure of cost efficiency. We report the salient observations below:

(1) UCB-CVS variants outperform TS-PSQ and UCB-PSQ under cheap/moderate costs when the full query control set is available. With the GP sample, Hartmann, and plant growth objective functions, the full query control set is available. TS-PSQ and UCB-PSQ only choose the full query control set in every iteration and are very cost inefficient under cheap and moderate costs, while UCB-CVS variants are able to use the cheaper control sets for exploration, followed by using the full query control set for exploitation, and find much better solutions with the same budget. As expected, their performance advantage reduces as the costs increase and cm gets closer to ci for all i = m.

(2) Cost-adaptive UCB-CVS (ETC-Ada) can maintain competitive performance under expensive costs. The non-cost-adaptive variants, ETC-50 and ETC-100, perform worse than TS-PSQ and UCBPSQ under expensive costs. In contrast, it can be observed that ETC-Ada generally performs well under all costs by tuning the number of plays of suboptimal cost groups according to their costs. We recommend practitioners to use adaptive algorithms to achieve good performance under any cost set. In particular, the results suggest that an O(c 1 i ) threshold is likely to work well across different sets of costs and is a robust choice for practitioners that keeps the number of hyperparameters to a minimum.

(3) TS-PSQ and UCB-PSQ perform relatively well when the control sets are not subsets of each other. With the airfoil self-noise objective function, TS-PSQ and UCB-PSQ perform better as the

control sets with this objective function are not subsets of each other and thus, they can also use the cheaper control sets during learning, while the UCB-CVS variants suffer worse performance here due to artificially selecting suboptimal control sets and queries with the ϵ-relaxations. This worse performance is encoded in Theorems 4.1 and 4.4 as the sum of ϵt terms.

(4) Increasing the variance of the probability distributions has competing effects on the simple regret. Of the 42 experimental settings (combinations of objective function, cost set, and algorithm) in which the variance makes a difference (excluding TS-PSQ and UCB-PSQ for all objective functions except airfoil), the settings with variance 0.02, 0.04, and 0.08 achieved the lowest mean simple regret by the end 11, 6, and 25 times, respectively. This generally supports Theorem 4.4 s prediction that higher variances decrease the upper bound on regret. However, due to the looseness of the bound, this effect is not guaranteed and there are still cases where lower variances lead to a lower regret, as suggested by the argument about feasible sets when discussing Theorem 4.1; note that the same MIGs of the feasible sets for control sets 1 to m 1 appear in Theorem 4.4. We observe competing effects and conclude that the effect of increasing variance is problemand algorithm-dependent. While higher variances may lead to more exploration, they may also result in too much smoothing of function values which may hinder the learner s ability to focus on high-value query regions.

6 Conclusion

This paper introduces the BOCVS problem and describes the UCB-CVS algorithm that is provably no-regret in solving this problem. We show that our algorithm performs well across several different experimental settings and achieves the desired goal of finding significantly better solutions within the same budget. This work opens up avenues of future research: In particular, an entropy search-based algorithm [14, 23, 41] that chooses control sets and queries based on expected information gain per unit cost is a non-trivial and promising direction for alternative methods of solving BOCVS.

Acknowledgements and Disclosure of Funding

This research/project is supported by the Agency for Science, Technology and Research, Singapore (A*STAR), under its RIE2020 Advanced Manufacturing and Engineering (AME) Programmatic Funds (Award A20H6b0151) and its RIE2020 Advanced Manufacturing and Engineering (AME) Industry Alignment Fund Pre Positioning (IAF-PP) (Award A19E4a0101). Sebastian Shenghong Tay is supported by A*STAR.

[1] V. Aglietti, X. Lu, A. Paleyes, and J. González. Causal Bayesian optimization. In Proc. AISTATS, pages 3155 3164. PMLR, 2020. [2] V. Aglietti, N. Dhir, J. González, and T. Damoulas. Dynamic causal Bayesian optimization. In Proc. Neur IPS, volume 34, pages 10549 10560, 2021. [3] M. Balandat, B. Karrer, D. R. Jiang, S. Daulton, B. Letham, A. G. Wilson, and E. Bakshy. Bo Torch: A framework for efficient Monte-Carlo Bayesian optimization. In Proc. Neur IPS, pages 21524 21538, 2020. [4] N. Branchini, V. Aglietti, N. Dhir, and T. Damoulas. Causal entropy optimization. In Proc. AISTATS, pages 8586 8605. PMLR, 2023. [5] S. Cakmak, R. Astudillo Marban, P. Frazier, and E. Zhou. Bayesian optimization of risk measures. In Proc. Neur IPS, pages 20130 20141, 2020. [6] Y. Chen, A. Huang, Z. Wang, I. Antonoglou, J. Schrittwieser, D. Silver, and N. de Freitas. Bayesian optimization in Alpha Go. ar Xiv:1812.06855, 2018. [7] S. R. Chowdhury and A. Gopalan. On kernelized multi-armed bandits. In Proc. ICML, pages 844 853, 2017. [8] D. E. Crabtree and E. V. Haynsworth. An identity for the Schur complement of a matrix. Proceedings of the American Mathematical Society, 22(2):364 366, 1969. [9] D. Dua and C. Graff. UCI machine learning repository, 2017. URL http://archive.ics. uci.edu/ml.

[10] J. Gardner, G. Pleiss, K. Q. Weinberger, D. Bindel, and A. G. Wilson. GPy Torch: Blackbox matrix-matrix Gaussian process inference with GPU acceleration. In Proc. Neur IPS, pages 7587 7597, 2018.

[11] R. Garnett. Bayesian Optimization. Cambridge University Press, 2023. To appear.

[12] C. R. Harris, K. J. Millman, S. J. van der Walt, R. Gommers, P. Virtanen, D. Cournapeau, E. Wieser, J. Taylor, S. Berg, N. J. Smith, R. Kern, M. Picus, S. Hoyer, M. H. van Kerkwijk, M. Brett, A. Haldane, J. F. del Río, M. Wiebe, P. Peterson, P. Gérard-Marchant, K. Sheppard, T. Reddy, W. Weckesser, H. Abbasi, C. Gohlke, and T. E. Oliphant. Array programming with Num Py. Nature, 585(7825):357 362, 2020.

[13] S. Hayashi, J. Honda, and H. Kashima. Bayesian optimization with partially specified queries. Machine Learning, 111(3):1019 1048, 2022.

[14] J. M. Hernández-Lobato, M. W. Hoffman, and Z. Ghahramani. Predictive entropy search for efficient global optimization of black-box functions. In Proc. Neur IPS, pages 918 926, 2014.

[15] K. Kandasamy, G. Dasarathy, J. B. Oliva, J. Schneider, and B. Póczos. Gaussian process bandit optimisation with multi-fidelity evaluations. In Proc. Neur IPS, pages 1000 1008, 2016.

[16] J. Kirschner and A. Krause. Information directed sampling and bandits with heteroscedastic noise. In Proc. COLT, pages 358 384, 2018.

[17] J. Kirschner, I. Bogunovic, S. Jegelka, and A. Krause. Distributionally robust Bayesian optimization. In Proc. AISTATS, pages 2174 2184, 2020.

[18] T. L. Lai, H. Robbins, et al. Asymptotically efficient adaptive allocation rules. Advances in Applied Mathematics, 6(1):4 22, 1985.

[19] T. Lattimore and C. Szepesvári. Bandit Algorithms. Cambridge University Press, 2020.

[20] J. Mockus, V. Tiešis, and A. Žilinskas. The application of Bayesian methods for seeking the extremum. In L. C. W. Dixon and G. P. Szegö, editors, Towards Global Optimization 2, pages 117 129. North-Holland Publishing Company, 1978.

[21] Q. P. Nguyen, Z. Dai, B. K. H. Low, and P. Jaillet. Optimizing conditional value-at-risk of black-box functions. In Proc. Neur IPS, pages 4170 4180, 2021.

[22] Q. P. Nguyen, Z. Dai, B. K. H. Low, and P. Jaillet. Value-at-risk optimization with Gaussian processes. In Proc. ICML, pages 8063 8072, 2021.

[23] Q. P. Nguyen, Z. Wu, B. K. H. Low, and P. Jaillet. Trusted-maximizers entropy search for efficient Bayesian optimization. In Proc. UAI, pages 1486 1495, 2021.

[24] T. Nguyen, S. Gupta, H. Ha, S. Rana, and S. Venkatesh. Distributionally robust Bayesian quadrature optimization. In Proc. AISTATS, pages 1921 1931, 2020.

[25] R. Oliveira, L. Ott, and F. Ramos. Bayesian optimisation under uncertain inputs. In Proc. AISTATS, pages 1177 1184, 2019.

[26] A. Paszke, S. Gross, F. Massa, A. Lerer, J. Bradbury, G. Chanan, T. Killeen, Z. Lin, N. Gimelshein, L. Antiga, et al. Py Torch: An imperative style, high-performance deep learning library. In Proc. Neur IPS, pages 8026 8037, 2019.

[27] M. Poloczek, J. Wang, and P. Frazier. Multi-information source optimization. In Proc. Neur IPS, pages 4288 4298, 2017.

[28] A. Rahimi and B. Recht. Random features for large-scale kernel machines. In Proc. Neur IPS, pages 1177 1184, 2007.

[29] R. Shalloo, S. Dann, J.-N. Gruse, et al. Automation and control of laser wakefield accelerators using Bayesian optimization. Nature Communications, 11(1):1 8, 2020.

[30] B. J. Shields, J. Stevens, J. Li, M. Parasram, F. Damani, J. I. M. Alvarado, J. M. Janey, R. P. Adams, and A. G. Doyle. Bayesian reaction optimization as a tool for chemical synthesis. Nature, 590(7844):89 96, 2021.

[31] J. Snoek, H. Larochelle, and R. P. Adams. Practical Bayesian optimization of machine learning algorithms. In Proc. Neur IPS, pages 2951 2959, 2012.

[32] N. Srinivas, A. Krause, S. Kakade, and M. Seeger. Gaussian process optimization in the bandit setting: no regret and experimental design. In Proc. ICML, pages 1015 1022, 2010.

[33] S. Sussex, A. Makarova, and A. Krause. Model-based causal Bayesian optimization. In Proc. ICLR, 2023. [34] P. M. Swamidass. Encyclopedia of production and manufacturing management. Springer Science & Business Media, 2000. [35] K. Swersky, J. Snoek, and R. P. Adams. Multi-task Bayesian optimization. In Proc. Neur IPS, pages 2004 2012, 2013. [36] S. Takeno, H. Fukuoka, Y. Tsukada, T. Koyama, M. Shiga, I. Takeuchi, and M. Karasuyama. Multi-fidelity Bayesian optimization with max-value entropy search and its parallelization. In Proc. ICML, pages 9334 9345, 2020. [37] S. S. Tay, C. S. Foo, U. Daisuke, R. Leong, and B. K. H. Low. Efficient distributionally robust Bayesian optimization with worst-case sensitivity. In Proc. ICML, pages 21180 21204, 2022. [38] S. Toscano-Palmerin and P. I. Frazier. Bayesian optimization with expensive integrands. SIAM Journal on Optimization, 32(2):417 444, 2022. [39] R. Turner, D. Eriksson, M. Mc Court, J. Kiili, E. Laaksonen, Z. Xu, and I. Guyon. Bayesian optimization is superior to random search for machine learning hyperparameter tuning: Analysis of the black-box optimization challenge 2020. In Proc. Neur IPS 2020 Competition and Demonstration Track, pages 3 26, 2021. [40] P. Virtanen, R. Gommers, T. E. Oliphant, M. Haberland, T. Reddy, D. Cournapeau, E. Burovski, P. Peterson, W. Weckesser, J. Bright, S. J. van der Walt, M. Brett, J. Wilson, K. J. Millman, N. Mayorov, A. R. J. Nelson, E. Jones, R. Kern, E. Larson, C. J. Carey, I. Polat, Y. Feng, E. W. Moore, J. Vander Plas, D. Laxalde, J. Perktold, R. Cimrman, I. Henriksen, E. A. Quintero, C. R. Harris, A. M. Archibald, A. H. Ribeiro, F. Pedregosa, P. van Mulbregt, and Sci Py 1.0 Contributors. Sci Py 1.0: Fundamental Algorithms for Scientific Computing in Python. Nature Methods, 17:261 272, 2020. [41] Z. Wang and S. Jegelka. Max-value entropy search for efficient Bayesian optimization. In Proc. ICML, pages 3627 3635, 2017. [42] C. K. Williams and C. E. Rasmussen. Gaussian Processes for Machine Learning. MIT Press, Cambridge, MA, 2006.

A.1 Proof of Theorem 4.1

Theorem 4.1. With probability at least 1 δ, UCB-CVS (Alg. 1) incurs a cost-varying cumulative regret bounded by

γT (X) + log m + 1

TiγTi( e Xi) + log m + 1

by setting βt = B + σ p

2 (γt 1(X) + 1 + log((m + 1)/δ)).

t=1 cit E h f([xi , X i ]) i E f([xit, X it])

t=1 cit E h ut 1([xi , X i ]) i E f([xit, X it])

t=1 cit E ut 1([xit, X it]) E f([xit, X it]) + ϵt

t=1 cit E ut 1([xit, X it]) E f([xit, X it]) + cm

t=1 cit E ut 1([xit, X it]) f([xit, X it]) + cm

t e Ti E ut 1([xit, X it]) f([xit, X it])

i=1 ci(2βT ) X

t e Ti E σt 1([xit, X it])

i=1 ci(2βT )

t e Ti σt 1(xt) + 4 log m + 1

δ + 8 log(4) + 1

i=1 ci(2βT ) 2 q

4(Ti + 2)γTi( e Xi) + 4 log m + 1

δ + 8 log(4) + 1 !

TiγTi( e Xi) + log m + 1

γT (X) + log m + 1

TiγTi( e Xi) + log m + 1

where e Ti is the ordered sequence of iterations at which control set i is chosen, (i) follows from the algorithm s choice of xit, (ii) follows from Lemma A.4 with probability δ/(m + 1), (iii) follows from Lemma A.1 with probability δ/(m + 1) applied once for each control set i, and (iv) follows from Lemma A.3 and the definition of e Xi as the feasible set for control set i. A union bound over the m + 1 events comprising the m applications of Lemma A.1 and single application of Lemma A.4 yields the desired 1 δ probability bound.

A.2 Proof of Lemma A.1

Lemma A.1. Let k(x, x) = 1 and let e Ti be the ordered sequence of iterations at which control set i is chosen by UCB-CVS. For any i [m], with probability at least 1 δ, X

t e Ti E σt 1([xit, X it]) 2 X

t e Ti σt 1(xt) + 4 log 1

δ + 8 log 4 + 1.

Proof. For this proof, define a probability space (Ω, F, P) and a filtration F = {Ft} t=1, where Ft := σ(i1, xi1, x i1, y1, i2, xi2, x i2, y2, ..., it, xit, x it, yt), the sigma-algebra generated by all the random variables in the BO procedure known by the end of iteration t.

In advance of proving this result, it should be clarified that, in the main paper and all proofs excluding that of this lemma, E σt 1([xit, X it]) denotes the quantity obtained by treating σt 1 and xit

as deterministic and treating X it as a random vector distributed according to the probability distribution P it. This is for ease of exposition. In the following proof, however, when using the formalism of random processes, what was previously referred to as E σt 1([xit, X it]) is actually E[σt 1(xt) | Ft 1]. The meaning is equivalent since σt 1, it, and xit are F-predictable, and the only uncertainty about σt 1(xt) arises from the lack of knowledge about x it.

Now we begin the proof proper. Define m stochastic processes {X(1) t } t=1, {X(2) t } t=1, ..., {X(m) t } t=1, where, using 1[A] to denote the indicator function that is equal to 1 when the event A is true and 0 otherwise, X(i) t := σt 1(xt) 1[it = i] .

Since each X(i) t is Ft-measurable, each stochastic process is adapted to F. Next, define

m(i) t := E h X(i) t | Ft 1 i

= E[σt 1(xt) 1[it = i] | Ft 1] = 1[it = i] E[σt 1(xt) | Ft 1]

where the last equality uses the pull-through property since it is Ft 1 measurable. Using Lemma A.5 with bt = 1 since k(x, x) = 1, with probability at least 1 δ, for any T 1,

t=1 m(i) t 2

t=1 X(i) t + 4 log 1

δ + 8 log 4 + 1

t=1 1[it = i] E[σt 1(xt) | Ft 1] 2

t=1 σt 1(xt) 1[it = i] + 4 log 1

δ + 8 log 4 + 1

t e Ti E[σt 1(xt) | Ft 1] 2 X

t e Ti σt 1(xt) + 4 log 1

δ + 8 log 4 + 1

which completes the proof.

A.3 Proof of Proposition 4.2

Proposition 4.2. If there exists a ϵ > 0 s.t. for all i = i ,

ϵt E h f([xi , X i ]) i max xi X i E f([xi, X i]) ϵ

eventually (i.e., the inequality holds for all t q for some q 1), and γT (X) < O(

T), then, with probability at least 1 δ, lim T Ti/T = 0 for all i = i and UCB-CVS incurs a cost-varying cumulative regret that is sublinear in T by setting βt = B + σ p

2 (γt 1(X) + 1 + log((m + 1)/δ)).

Proof. Define xi t := argmaxxi X i E ut 1([xi, X i]) , and jt := argmaxi [m] maxxi X i E ut 1([xi, X i]) . Using 1[A] to denote the indicator function

that is equal to 1 when the event A is true and 0 otherwise,

t=1 1 h E ut 1([xi t, X i]) + ϵt E h ut 1([xjt t , X jt]) ii

t=1 1 h E ut 1([xi t, X i]) + ϵt E h ut 1([xi , X i ]) ii

t=1 1 h E ut 1([xi t, X i]) + ϵt E h f([xi , X i ]) ii

t=1 1 h E ut 1([xi t, X i]) E h f([xi , X i ]) i ϵt i

t=q 1 h E ut 1([xi t, X i]) E h f([xi , X i ]) i ϵt i

t=q 1 E ut 1([xi t, X i]) max xi X i E f([xi, x i]) + ϵ

t=q 1 E ut 1([xi t, X i]) E f([xi t, X i]) + ϵ

t=q 1 E ut 1([xi t, X i]) E f([xi t, X i]) ϵ

t=q E ut 1([xi t, X i]) E f([xi t, X i])

(iii) q 1 + O 1

T q + 1 B q

γT q+1(X) + γT q+1(X)

where (i) follows since control set i is only played when the condition on the RHS is true, (ii) follows from the definitions of jt and xjt t , and (iii) follows from the steps from (3) to (5) in the proof of Theorem 4.1, and O denotes suppressing logarithmic factors. Now dividing both sides by T and taking the limit as T goes to infinity,

T lim T 1 T

T q + 1 B q

γT q+1(X) + γT q+1(X)

which follows from γT (X) < O(

T) and completes the proof that, if the conditions in the proposition are fulfilled, suboptimal control sets will only be played a number of times that is sublinear in T. The proof that RT will then also be sublinear in T is straightforward. Assuming without loss of generality that i = m,

t=1 cit E h f([xi , X i ]) i E f([xit, X it])

t e Ti ci E h f([xi , X i ]) i E f([xit, X it])

E h f([xi , X i ]) i min xi X i E f([xi, X i])

E h f([xi , X i ]) i E f([xit, X it])

i=1 Ti Ci + cm X

E h f([xi , X i ]) i E f([xit, X it])

i=1 Ti Ci + O

γT (X) + log m + 1

TmγTm( e Xm) + log m + 1

where e Ti is the ordered sequence of iterations at which control set i is chosen and Ci := ci E f([xi , X i ]) minxi X i E f([xi, X i]) , and (i) follows from the steps in the proof of Theorem 4.1 but only for control set m and without accounting for the ϵ-schedule. Since each Ti is sublinear in T, dividing both sides by T, using the fact that γT (X) < O(

T), and taking the limit as T yields the desired result and completes the proof.

A.4 Proof of Lemma 4.3

Lemma 4.3. Assume that, for each control set i, the random variable Yi := [0, X i 1 ] [0, X i 2 ] 2

has a median Mi such that E[Yi|Yi > Mi] hi Mi for some hi > 0, where X i 1 , X i 2 P i. With probability at least 1 δ, there will be at least Ni non-overlapping pairs of queries x and x chosen by UCB-CVS (Alg. 1) with control set i such that x x 2 Mi, where

$ 1 4(Ti 1)

1 4Ti log 1

Mi 4 hi + 1

ℓ Ii V[Xℓ].

Proof. Consider two queries x = [xi, x i] and x = [x i, x i] chosen with control set i. The learner only selects xi and x i while x i and x i are sampled from the environment. Before they are sampled, the queries may be considered themselves random vectors composed of one deterministic partial query and one random partial query. Denote these random vectors as X = [xi, X i] and X = [x i, X i]. X X 2 is therefore a random variable as well. Observe that

j Ii (xj x j)2 + X

ℓ Ii (Xℓ X ℓ)2

ℓ Ii (Xℓ X ℓ)2

= [0, X i 1 ] [0, X i 2 ] 2

where Y i is a random variable that is i.i.d. with Yi. Therefore, any X X 2 can be treated as a random variable equal to some Y i that is i.i.d. with Yi plus some non-negative term. The rest of this proof will use lower bounds on random variables i.i.d. with Yi, which will in turn imply lower bounds on X X 2.

ℓ Ii (Xℓ X ℓ)2

ℓ Ii E (Xℓ Xℓ (X ℓ Xℓ))2

ℓ Ii E h ((Xℓ Xℓ) (X ℓ X ℓ))2i

ℓ Ii E h (Xℓ Xℓ)2 2(Xℓ Xℓ)(X ℓ X ℓ) + (X ℓ X ℓ)2i

ℓ Ii E (Xℓ Xℓ)2 E h (Xℓ Xℓ)(X ℓ X ℓ) i + E h (X ℓ X ℓ)2i

ℓ Ii E (Xℓ Xℓ)2 + E h (X ℓ X ℓ)2i

ℓ Ii 2V[Xℓ]. (6)

We will now construct a lower bound for a median of Yi denoted Mi.

E[Yi] = E[Yi|Yi < Mi] P(Yi < Mi) + E[Yi|Yi = Mi] P(Yi = Mi) + E[Yi|Yi > Mi] P(Yi > Mi) Mi P(Yi Mi) + E[Yi|Yi > Mi] P(Yi > Mi)

(i) Mi P(Yi Mi) + hi Mi P(Yi > Mi)

where (i) follows from our assumption on the median Mi and (ii) follows from the definition of a median: P(Yi Mi) 1/2. Substituting in (6) completes our construction of the lower bound for Mi:

Mi 2 hi + 1E[Yi]

ℓ Ii V[Xℓ].

Now consider the Ti/2 non-overlapping pairs of queries chosen with control set i 3. Associate each pair with a random variable Yij such that we have Ti/2 i.i.d. random variables Yi1, Yi2, ..., Yi Ti/2 . From the definition of a median, P(Yi Mi) 1/2. Without loss of generality, assume the worstcase such that P(Yi Mi) = 1/2. We can now construct Ti/2 i.i.d. Bernoulli random variables Z1, Z2, ..., Zn, n = Ti/2 , with p = 1/2 where a success (Zj = 1) corresponds to Yij Mi and a failure (Zj = 0) corresponds to Yij < Mi. Further define the random variable Z := Pn j=1 Zj.

Applying Hoeffding s inequality,

j=1 (Zj p) t

exp ( 2nt2)

n Z p t exp ( 2nt2)

P (Z n(p t)) exp ( 2nt2).

Choosing t = p α/n for some constant α,

P (Z α) exp 2n p α

For P (Z α) δ,

3While we technically have Ti 2 (overlapping) pairs, the squared distances between each such pair will be identically distributed but not independent. For example, if Ti 3 and we knew that Ti 2 1 of the squared distances were equal to 0 (i.e., all the queries are exactly the same), the last squared distance must also be equal to 0.

1 4Ti log 1

$ 1 4(Ti 1)

1 4Ti log 1

Therefore, with probability more than 1 δ, Z > Ni := j 1 4(Ti 1) q

1 4Ti log 1

δ k , i.e., the number of non-overlapping pairs with squared distance greater than Mi is at least Ni, which completes the proof.

A.5 Proof of Theorem 4.4

Theorem 4.4. If the following assumptions hold:

1. The assumption of Lemma 4.3 holds;

2. The kernel k(x, x ) is an isotropic kernel (which only depends on distance and can be written as k( x x ));

3. There exists an iteration r such that for all t r, it m 1 and for all t > r, it = m;

then with probability at least 1 δ, UCB-CVS (Alg. 1) incurs a cost-varying cumulative regret bounded by

γT (X) + log 2m

TγT (X) L + log 2m

TiγTi( e Xi) + log 2m

i=1 Ni log Vi 2k p

by setting βt = B + σ p

2 (γt 1(X) + 1 + log((2m)/δ)), where Ni and Mi are defined as in Lemma 4.3, and Vi and W are residual terms defined in (10).

Proof. We first construct a lower bound on the sum of posterior standard deviations of the queries up to iteration r, i.e., the queries that were chosen with any control set except the last.

t=1 σt 1(xt) (i)

t=1 σ2 t 1(xt)

t=1 λ 1σ2 t 1(xt)

t=1 log(1 + λ 1σ2 t 1(xt))

(iii) = λ log I + λ 1Kr

= λ log λ r |λI + Kr|

= λ ( r log λ + log |λI + Kr|)

(iv) λ(log |λI + Kr| 2) (7)

where (i) follows from the assumption that k(x, x) = 1 which implies σt 1(x) 1 for all x X and all t 1, (ii) follows since log(1 + x) x for all x > 1, (iii) follows from Lemma A.2, and (iv) follows from λ = 1 + 2

T (Lemma A.4), noting that T r, and taking limr r log λ.

From Lemma 4.3 with probability δ/(2m), there will be at least Ni pairs of queries chosen with control set i with squared distance at least Mi, where

$ 1 4(Ti 1)

1 4Ti log 2m

Mi = 4 hi + 1

Gather these 2 Pm 1 i=1 Ni queries in an ordered sequence S and keep paired queries adjacent to each other. The sequence should be ordered such that, for any control sets i and j, if i < j, then queries chosen with i should appear in the sequence before queries chosen with j. Denote as e T the ordered sequence of iterations at which each of these queries were chosen by the learner where the order corresponds to the order in S. Using row and column swaps on Kr, construct a new Gram matrix Ks such that, for all j, ℓ 2 Pm 1 i=1 Ni, [Ks]jℓ= [Kr] e Tj e Tℓ.

In other words, we have simply reordered the underlying queries that result in the Gram matrix Kr to produce a new Gram matrix Ks such that the first 2 Pm 1 i=1 Ni rows (and columns) correspond to the 2Ni queries, and paired queries (that have at least Mi squared distance between them) are kept in adjacent rows (and columns). Note that [λI + Kr] e Tj e Tℓ= [λI + Ks]jℓ i.e., the same row and column swap operations on λI + Kr result in λI + Ks. Note that swapping the positions of two queries corresponds to a row swap and a column swap in the Gram matrix. We can thus conclude that |λI + Kr| = |λI + Ks| (8) since determinants are invariant under an even number of row or column swaps.

Write |λI + Ks| as

|λI + Ks| = A1 B1 C1 D1

where A1 is a 2 2 matrix. Since A1 is invertible, |λI + Ks| = |A1| D1 C1A 1 1 B1

where D1 C1A 1 1 B1 is the Schur complement of A1. Observe that

D1 C1A 1 1 B1 = λI + Ks 2 k 2,s 2(K2 + λI) 1k2,s 2

= λI + ˆKs 2 where K2 and Ks 2 are the prior covariance matrices of the first 2 queries and last r 2 queries respectively, k2,s 2 is the prior covariance between the first 2 queries and the last r 2 queries, and ˆKs 2 is the posterior covariance matrix of the last r 2 queries conditioned on observations at the first 2 queries. We can repeat this decomposition:

λI + ˆKs 2 = A2 B2 C2 D2

λI + ˆKs 2 = |A2| D2 C2A 1 2 B2

D2 C2A 1 2 B2 = λI + ˆKs 4

where ˆKs 4 is the posterior covariance matrix of the last r 4 queries conditioned on observations at the first 4 queries, by the quotient property of the Schur complement [8]. Define N := Pm 1 i=1 Ni. Performing this decomposition N times yields

|λI + Ks| =

j=1 |Aj| λI + ˆKs 2N

where each Aj is the 2 2 posterior covariance matrix of a pair of queries chosen with some control set i that have least Mi squared distance between them conditioned on observations at the first 2(j 1) queries in the sequence, plus λI. From (7) and (8),

t=1 σt 1(xt) λ

j=1 log |Aj| + log λI + ˆKs 2N 2

Let ˆxj and ˆx j refer to the pair of queries associated with Aj, and kj to the posterior covariance function conditioned on observations at the first 2(j 1) queries in the sequence. Define kj and k j as the R2(j 1) vectors of the prior covariance between the first 2(j 1) queries in the sequence and ˆxj and ˆx j respectively. Further define Mj := K2(j 1) + λI. Use u, v M to denote u Mv, and u M to denote p

u, u M. Each |Aj| can be lower bounded as

|Aj| = ( kj(ˆxj, ˆxj) + λ)( kj(ˆx j, ˆx j) + λ) kj(ˆxj, ˆx j)2

= kj(ˆxj, ˆxj) kj(ˆx j, ˆx j) + λ kj(ˆxj, ˆxj) + λ kj(ˆx j, ˆx j) + λ2 kj(ˆxj, ˆx j)2

(i) = 1 kj 2 M 1 j

1 k j 2 M 1 j

+ λ 1 kj 2 M 1 j

+ λ 1 k j 2 M 1 j

k(ˆxj, ˆx j) kj, k j

= 1 kj 2 M 1 j

1 k j 2 M 1 j

+ λ 1 kj 2 M 1 j

+ λ 1 k j 2 M 1 j

k(ˆxj, ˆx j)2 + 2k(ˆxj, ˆx j) kj, k j

M 1 j kj, k j 2 M 1 j

(ii) 1 kj 2 M 1 j

1 k j 2 M 1 j

+ λ 1 kj 2 M 1 j

+ λ 1 k j 2 M 1 j

k(ˆxj, ˆx j)2 2k(ˆxj, ˆx j) kj M 1 j k j M 1 j kj 2 M 1 j k j 2 M 1 j

= 1 kj 2 M 1 j k j 2 M 1 j + λ 1 kj 2 M 1 j

+ λ 1 k j 2 M 1 j

k(ˆxj, ˆx j)2 2k(ˆxj, ˆx j) kj M 1 j k j M 1 j

(iii) 1 kj 2 M 1 j k j 2 M 1 j + λ 1 kj 2 M 1 j

+ λ 1 k j 2 M 1 j

k(ˆxj, ˆx j)2 2k(ˆxj, ˆx j)

= λ2 1 + (λ + 1) 1 kj 2 M 1 j

+ (λ + 1) 1 k j 2 M 1 j

2k(ˆxj, ˆx j) k(ˆxj, ˆx j)2

= λ2 1 + (λ + 1) kj(ˆxj, ˆxj) + kj(ˆx j, ˆx j) 2k(ˆxj, ˆx j) k(ˆxj, ˆx j)2

where (i) follows from our assumption that k(x, x) = 1, (ii) follows from the Cauchy-Schwarz inequality, and (iii) follows since 1 kj 2 M 1 j 1 and 1 k j 2 M 1 j 1.

Define Si := Pi ℓ=1 Ni and vi := min Si 1+1 j Si 1 2( kj(ˆxj, ˆxj) + kj(ˆx j, ˆx j)). Substituting this result into (9),

t=1 σt 1(xt) λ

j=1 log λ2 1 + (λ + 1) kj(ˆxj, ˆxj) + kj(ˆx j, ˆx j) 2k(ˆxj, ˆx j) k(ˆxj, ˆx j)2

+ log λI + ˆKs 2N 2

j=Si 1+1 log λ2 1 + (λ + 1) kj(ˆxj, ˆxj) + kj(ˆx j, ˆx j) 2k(ˆxj, ˆx j)

k(ˆxj, ˆx j)2 + log λI + ˆKs 2N 2

j=Si 1+1 log λ2 1 + 2(λ + 1) vi 2k(ˆxj, ˆx j) k(ˆxj, ˆx j)2

+ log λI + ˆKs 2N 2

j=Si 1+1 log λ2 1 + 2(λ + 1) vi 2k p

+ log λI + ˆKs 2N 2

i=1 Ni log λ2 1 + 2(λ + 1) vi 2k p

+ log λI + ˆKs 2N 2

i=1 Ni log Vi 2k p

where Vi := λ2 1 + 2(λ + 1) vi and W := log λI + ˆKs 2N 2, (i) follows from our assumption

that the kernel k is stationary and can be written in a single argument form as k( x x ) = k(x, x ) and the fact that every pair of queries in S chosen with control set i has squared distance at least Mi.

Starting from (4) in the proof of Theorem 4.1 except replacing the probabilities of all events with 2m/δ,

i=1 ci(2βT )

t e Ti σt 1(xt) + 4 log 2m

δ + 8 log(4) + 1

t e Tm σt 1(xt) + 4 log 2m

δ + 8 log(4) + 1

t e Ti σt 1(xt) + 4 log 2m

δ + 8 log(4) + 1

t e Tm σt 1(xt) + 4 log 2m

δ + 8 log(4) + 1

4(Ti + 2)γTi( e Xi) + 4 log 2m

δ + 8 log(4) + 1

4(T + 2)γT (X)

t=1 σt 1(xt) + 4 log 2m

δ + 8 log(4) + 1

4(Ti + 2)γTi( e Xi) + 4 log 2m

δ + 8 log(4) + 1

4(T + 2)γT (X) L + 4 log 2m

δ + 8 log(4) + 1

4(Ti + 2)γTi( e Xi) + 4 log 2m

δ + 8 log(4) + 1

γT (X) + log 2m

TγT (X) L + log 2m

TiγTi( e Xi) + log 2m

where (i) follows from Lemma A.3, (ii) follows from Lemma A.3 again and the resulting inequality Pr t=1 σt 1(xt) + PT t=r+1 σt 1(xt) p

4(T + 2)γT (X), and (iii) follows from substituting in (11). A union bound over the events of the single application of Lemma A.4, m applications of Lemma A.5, and m 1 applications of Lemma 4.3 yields the desired 1 δ probability bound, which completes the proof.

A.6 Other Lemmas

Lemma A.2 ([7] Lemma 3). Let (xt)T t=1 be a sequence of queries selected by some algorithm. Then, the mutual information I(y1:T ; f1:T ) between the noisy observations y1:T and the function values f1:T at the queries is given by

I(y1:T ; f1:T ) = 1

2 log I + λ 1Kt = 1

t=1 log(1 + λ 1σ2 t 1(xt)).

Lemma A.3 ([7] Lemma 4). Let (xt)T t=1 be a sequence of queries selected by some algorithm. Then

t=1 σt 1(xt) p

4(T + 2)γT (X) .

Lemma A.4 ([7] Theorem 2). Let βt := B + σ p

2 (γt 1(X) + 1 + log(1/δ)) where B is the upper bound of the RKHS norm of f. With probability at least 1 δ, for all x X and t 1,

|µt 1(x) f(x)| βtσt 1(x)

where µt 1 and σt 1 are defined in (1) with λ = 1 + η and η := 2/T.

Lemma A.5 ([16] Lemma 3). Let Xt be any non-negative stochastic process adapted to a filtration {Ft}, and define mt := E[Xt|Ft 1]. Further assume that Xt bt for a fixed, non-decreasing sequence (bt)t 1. If b T 1, with probability at least 1 δ, for any T 1,

t=1 Xt + 4b T log 1

δ + 8b T log(4b T ) + 1

B Comparison to Naive Baselines

We investigated simple extensions of TS-PSQ, UCB-PSQ, and EI-PSQ (i.e., the classic expected improvement algorithm [20] adapted for BOPSQ, see Appendix C for details) for the cost-varying problem by dividing the acquisition score of a control set by its cost in a manner similar to Snoek et al. [31, Sec. 3.2]. Fig. 3 shows the mean and standard error of the simple regret incurred over 10 RNG seeds for one set of experiments. We found that these naive methods generally do not work well. For TS per unit cost and UCB-PSQ per unit cost, if a suboptimal control set is very cheap, its acquisition score may remain artificially high throughout, and the algorithm fails to converge. EI per unit cost was slightly more promising, but suffered from the inverse problem: the suboptimal control sets had 0 expected improvement very quickly and dividing by the cost had no effect. This algorithm thus fails to encourage exploration with cheaper control sets. Furthermore, the EI algorithm was computationally expensive due to the double Monte Carlo expectation computation. In general, we see that the UCB-CVS algorithm is able to use the cheaper control sets much more effectively for exploration and hence find better solutions.

Figure 3: Mean and standard error (over 10 RNG seeds) of the simple regret (lower is better) incurred against cost spent (budget) C by all algorithms including TS-PSQ per unit cost, UCB-PSQ per unit cost, and EI per unit cost, with samples from the GP prior as the objective function, moderate cost set, and all variances. A diamond indicates the average budget after which an algorithm only chooses the optimal control set.

C Experimental Details

All experiments use a squared exponential kernel with ARD lengthscales that depend on the objective function, k(x, x ) = 1, Gaussian observation noise with σ = 0.01, and 5 initial query-observation pairs with queries drawn uniformly at random. All expectations are approximated with Monte Carlo sampling with 1024 samples. All acquisition maximizations are performed with L-BFGS-B with random restarts. All query sets are [0, 1]d.

C.1 Objective functions

The control sets described here are given in an order corresponding to their costs given in Sec. 5. For example, for the GP samples objective, under the cheap cost set, control set {1} has cost 0.01, control set {1, 2} has cost 0.1, and control set {1, 2, 3} has cost 1.

Samples from GP prior (3-D): We use samples from the same kernel k used to model the GP posteriors during learning. We use a kernel lengthscale of 0.1 and control sets {{1}, {2}, {3}, {1, 2}, {1, 3}, {2, 3}, {1, 2, 3}}.

Hartmann (3-D): We use a kernel lengthscale of 0.1 and control sets {{1}, {2}, {3}, {1, 2}, {1, 3}, {2, 3}, {1, 2, 3}}.

Plant growth simulator (5-D): The plant growth simulator is a GP built from private data collected on the maximum leaf area achieved by Marchantia plants depending on input variables Ca, B, NH3, K, and p H. We use min-max feature scaling to scale all input variables to [0, 1] and standardize the output values. We use the posterior mean of the GP as the objective function. We use a kernel lengthscale of 0.2 and control sets {{1, 2}, {3, 4}, {4, 5}, {1, 2, 3}, {2, 3, 4}, {3, 4, 5}, {1, 2, 3, 4, 5}}.

Airfoil self-noise (5-D): We use the airfoil self-noise dataset from the UCI Machine Learning Repository [9]. To scale all input variables to [0, 1], we first take the natural logarithm of variables 1 and 5, then do min-max feature scaling on all input variables. We also standardize the output values. We then feed the data into a default Single Task GP from Bo Torch and use the posterior mean as the objective function. We use a kernel lengthscale of 0.2 and control sets {{4, 5}, {2, 5}, {1, 4}, {2, 3}, {3, 5}, {1, 2}, {3, 4}}.

C.2 Algorithms

UCB-PSQ and UCB-CVS: For the experiments, we set βt = 2 for all t.

TS-PSQ: Following [13], we use random Fourier features (RFF) [28] to approximately sample from a GP posterior. We use RFF with 1024 features.

EI-PSQ: We adapt the Bo Torch acquisition Noisy Expected Improvement to the BOPSQ problem setting. To evaluate the acquisition score of a partial query, we first sample 32 fantasy models of f from the GP posterior. For each fantasy model, we compute the expected value of the partial query and take the best value as the value of the best observation so far (assuming the full query control set is available). We then compute the improvement score as the expected value minus the best value, and then average the improvement score over all fantasy models.

C.3 Implementation

The experiments were implemented in Python. The major libraries used were Num Py [12], Sci Py [40], Py Torch [26], GPy Torch [10] and Bo Torch [3]. For more details, please refer to the code repository.

C.4 Compute

The following CPU times in seconds were collected on a server running Ubuntu 20.04.4 LTS with 2 Intel(R) Xeon(R) Gold 6326 CPU @ 2.90GHz and 256 GB of RAM. We measure the CPU time for 1 iteration of TS-PSQ and UCB-CVS with a dataset of 100 observations. In general, none of the algorithms in the settings tested in this paper require a significant amount of compute.

GP sample Hartmann Plant Airfoil

TS-PSQ 6.27 4.14 8.96 232.27 UCB-CVS 37.92 52.34 61.96 87.09

D Limitations

A limitation of our work is that the theoretical guarantees of UCB-CVS rely on a few assumptions that may not hold in practice. For example, the regularity assumption that assumes the objective function f resides in some RKHS may not be true in some problems. The kernel corresponding to this RKHS may not be known either. The work also assumes that the probability distributions governing each variable are independent and fixed. In practice, these assumptions may be violated, if the probability distributions have some dependence on one another, or may change over time.