# active_set_ordering__5e578edd.pdf

Active Set Ordering

Quoc Phong Nguyen1,3, Sunil Gupta1, Svetha Venkatesh1, Bryan Kian Hsiang Low2, Patrick Jaillet3

1Applied Artificial Intelligence Institute, Deakin University, Australia 2School of Computing, National University of Singapore, Singapore 3LIDS and EECS, Massachusetts Institute of Technology, USA qphongmp@gmail.com, sunil.gupta@deakin.edu.au, svetha.venkatesh@deakin.edu.au, lowkh@comp.nus.edu.sg, jaillet@mit.edu

In this paper, we formalize the active set ordering problem, which involves actively discovering a set of inputs based on their orderings determined by expensive evaluations of a blackbox function. We then propose the mean prediction (MP) algorithm and theoretically analyze it in terms of the regret of predicted pairwise orderings between inputs. Notably, as a special case of this framework, we can cast Bayesian optimization as an active set ordering problem by recognizing that maximizers can be identified solely by comparison rather than by precisely estimating the function evaluations. As a result, we are able to construct the popular Gaussian process upper confidence bound (GP-UCB) algorithm through the lens of ordering with several nuanced insights. We empirically validate the performance of our proposed solution using various synthetic functions and real-world datasets.

1 Introduction

In real-world applications, we often encounter the problem of estimating an unknown function, known as a blackbox function, (i.e., those without closed-form expressions or derivatives) using their expensive and noisy evaluations. Under these circumstances, an efficient sequential process of evaluating the function is desired. On one extreme, experimental design (ED) aims to estimate the function in its entire input domain, e.g., by decreasing the uncertainty of the function globally in Bayesian ED [4, 18]. On the other extreme, the renowned Bayesian optimization (BO) targets inputs with the extreme function values such as the maximizers and the minimizers [2, 6, 7].

While the connection between ED and BO is studied in the classic work of [19], we still lack a problem formulation that strikes a balance between the prohibitively expensive process of estimating the entire function globally in ED and the lack of information about the function away from extreme locations in BO. One may consider a related problem, called level set estimation (LSE), which focuses on estimating inputs with function evaluations above or below a given (or implicit) threshold [1, 3, 8, 14]. However, without domain knowledge of the blackbox function, it is easy to set a threshold that leads to undesirably large or small level sets.

Let us consider an environmental monitoring problem of estimating a chemical concentration in a field. The blackbox function is the mapping from locations of the field to the chemical concentration measurement. It can be of a greater interest to estimate the maximizers, the minimizers, the top-k locations (with the highest chemical concentration) and the bottom-k locations. On one hand, these estimates provide more information about the blackbox function than just the maximizers or minimizers in BO. On the other hand, they may require less resource (i.e., evaluations of the blackbox function) than estimating the entire function in ED. Besides, as the top-k locations consist of exactly k locations in the field, it circumvents the issue of undesirably large or small level sets in LSE.

38th Conference on Neural Information Processing Systems (Neur IPS 2024).

Our main contribution in this paper is to formulate the above challenge and resolve it with a theoretically grounded solution. Specifically, we propose the active set ordering problem to capture the above scenario (Sec. 2.1). It aims to estimate subsets of the input domain that are defined based on pairwise comparisons/orderings between the blackbox function evaluations.1 These subsets include the maximizers, the minimizers, and the top-k inputs with the highest function evaluations. Like Bayesian ED, BO, and LSE, we adopt the pool-based active learning setting [18] in constructing a solution that sequentially selects a sampling input from the domain at each iteration. The knowledge from observing function evaluations at the sampling inputs helps predicting the subsets of interest and directs the algorithm to select the next sampling input. To facilitate the presentation of our method, we begin with the building block of our ordering-based problem: pairwise comparison/ordering between function evaluations in Sec. 3. Specifically, we propose a new kind of regret to quantify the loss of a pairwise ordering (Sec. 3.1), a prediction of the top-k inputs based on only the posterior mean (Sec. 3.2), and a sampling strategy that is equipped with a theoretical performance guarantee for the proposed prediction (Sec. 3.3). Subsequently, these concepts of the regret, the prediction, and the sampling strategy are extended to orderings between sets, which ultimately addresses the active set ordering problem in Sec. 4. Notably, the regret simplifies to the well-known regret in BO (Remark 4.1). Hence, we recover both the theoretical analysis and the GP-UCB algorithm [19] when considering a special case of our problem setting (Remark 4.5). In Sec. 5, we empirically validate the performance of our solution using several synthetic functions and real-world datasets.

2 Preliminaries and Problem Statement

2.1 Top-k set

Adopting an assumption in existing level set estimation (LSE) works [1, 8], we consider a blackbox function f : X R where the domain X is a finite set of n elements in Rd. Let Sc X \ S denote the complement of any subset S X. In this paper, the ordering between inputs are determined with respect to their corresponding blackbox function evaluations. Hence, we use the term the ordering between x and x and the ordering between f(x) and f(x ) interchangeably.

Definition 2.1 (Top-k set). The top-k set, denoted as S(k), is the set of k inputs with the highest function evaluations. Specifically, |S(k)| = k and x S(k), x Sc(k), f(x) f(x ).

In this work, we propose the active set ordering problem to estimate the top-k set S(k) of a blackbox function f by efficiently gathering noisy function evaluations in a sequential manner. Furthermore, it includes the Bayesian optimization (BO) problem when k = 1 because the top-1 set S(1) contains a maximizer of f.

2.2 Gaussian Process

The noisy function evaluation mentioned in the previous section is denoted as y(x) f(x) + ϵ(x) where the noise ϵ(x) N(0, σ2 n) is a Gaussian random variable with a known (or estimated) variance σ2 n. To obtain the posterior distribution of the unknown function f given these noisy evaluations, we model f using a Gaussian process (GP), that is, every subset of {f(x)}x X follows a multivariate Gaussian distribution [16]. A GP is fully specified by its prior mean and its kernel kx,x cov(f(x), f(x )) which measures the covariance between function values. Let Dt denote the set of sampling inputs in the first t 1 iterations. Then, given y Dt (y(x))x Dt, the predictive distribution of any function evaluation f(x) follows a Gaussian distribution with the following mean and variance:

µt(x) kt(x) (Kt + σ2 n I) 1y Dt σ2 t (x) k(x, x) kt(x) (Kt + σ2 n I) 1kt(x)

where kt(x) (k(x, x ))x Dt, Kt (k(x, x ))x,x Dt, and I is the identity matrix [16].

Assuming f belongs to a reproducing kernel Hilbert space with its norm bounded by B > 0, due to [5], we have the following confidence bound of f(x).2

1Existing research on best-k arm identification in the multi-armed bandit literature poses a similar problem, but it often focuses on the pure exploration setting and assumes independent arms which are inadequate for modelling blackbox functions [10 12]. 2Alternatively, we may consider using the confidence bounds established in [19].

Lemma 2.2. Pick δ (0, 1) and set βt = (B + σn p

2(γt 1 + 1 + log 1/δ))2. Then, the following event happens with probability of at least 1 δ,

x X, t 1, lt(x) f(x) ut(x)

where lt(x) µt(x) β1/2 t σt(x), ut(x) µt(x) + β1/2 t σt(x), and γt 1 max A X:|A|=t 1 I(y A; f A) is the maximum information gain of f A {f(x)}x A through observing y A over all subsets A X of size |A| = t 1.

To ease notational clutter, we denote the above confidence interval of f(x) as Ct(x) [lt(x), ut(x)] and its length as |Ct(x)| ut(x) lt(x).

3 Active Pairwise Ordering: n = 2

From Definition 2.1, pairwise orderings (or pairwise comparisons) are the building blocks of our active set ordering problem. Hence, to facilitate the exposition of the key ideas, let us begin with a simplistic setting where the input domain X consists of only n = 2 inputs, i.e., X = { x, x } and f( x) = f( x ). The problem is to determine the top-1 set S(1), i.e., the maximizer of f (equivalently, the minimizer of f). In essence, the goal is to check if f( x) > f( x ) by strategically collecting noisy evaluations y( x) and y( x ). In particular, at iteration t, the algorithm proposes a sampling input xt X to obtain a noisy evaluation y(xt). Then, the GP posterior distribution of f is updated and used to construct a predicted ordering between x and x (i.e., the ordering between f( x) and f( x )). The problem boils down to the strategy of selecting the sampling input xt such that a performance metric of the predicted ordering between f( x) and f( x ) is satisfactory. In the next section, we introduce a regret definition to serve as a performance metric.

Let us denote the (unknown) true ordering between x and x according to the evaluations of the blackbox function f as π :

π ( x, x ) 1f( x) f( x ) (1)

where the indicator function 1f( x) f( x ) = 1 if f( x) f( x ) and 0 otherwise. For any ordering π : {( x, x )} {0, 1}, we define the following regret of π( x, x ):

rπ( x, x ) max (0, (2π( x, x ) 1)(f( x ) f( x))) . (2)

In particular, rπ( x, x )=1 = max(0, f( x ) f( x)) and rπ( x, x )=0 = max(0, f( x) f( x )). The rationale is to ensure poor performance leads to large regret: The regret is |f( x) f( x )| (which increases as the gap between f( x) and f( x ) increases) if the ordering π does not align with the true ordering, i.e., π( x, x ) = π ( x, x ). On the contrary, the regret is 0 (i.e., the best performance) if π( x, x ) = π ( x, x ).

However, the blackbox function f renders the evaluation of rπ( x, x ) impossible. Thus, we resort to relying on the GP posterior distribution of f at iteration t to construct an upper bound of the above regret in the following lemma (proof in Appendix A).

Lemma 3.1. For all t 1, let us define

ρ(t) π( x, x ) max(0, ut( x ) lt( x)) if π( x, x ) = 1 max(0, ut( x) lt( x )) if π( x, x ) = 0 . (3)

Then, ρ(t) π( x, x ) is an upper confidence bound of the regret rπ( x, x ), i.e.,

P t 1, rπ( x, x ) ρ(t) π( x, x ) 1 δ

where δ is as defined in Lemma 2.2.

3.2 Prediction

Definition 3.2 (Predicted pairwise ordering πµt). Given the above upper bound ρ(t) π( x, x ) of the regret,

we would like to make a prediction πµt that minimizes ρ(t) π( x, x ). Therefore, πµt is defined as follows

πµt( x, x ) argmin a {0,1} ρ(t) π( x, x )=a . (4)

As a result, the upper confidence bound of the regret in (3) is minimized at π( x, x ) = πµt( x, x ) and its minimum value is

ρ(t) πµt( x, x ) = max (0, min(ut( x ) lt( x), ut( x) lt( x ))) . (5)

We note that the upper confidence bound of the regret can be interpreted as a measure of the approximation quality or the uncertainty reduction as discussed in the following two remarks.

Remark 3.3 (Approximation quality). Since ρ(t) πµt( x, x ) rπµt( x, x ) for all t 1 with probability of at least 1 δ, the regret incurred by the ordering πµt cannot exceed the worst-case regret ρ(t) πµt( x, x ) as shown in Fig. 1a. Hence, if |f( x) f( x )| > ρ(t) πµt( x, x ), πµt( x, x ) is the true ordering, i.e., πµt( x, x ) = π ( x, x ), with probability of at least 1 δ.

Remark 3.4 (Minimum uncertainty reduction). In Fig. 1b, one can interpret ρ(t) πµt( x, x ) as the minimum amount that the confidence intervals Ct( x) and Ct( x ) (representing the uncertainty) reduce so that rπµt( x, x ) = 0 with probability of at least 1 δ.

Moreover, the prediction πµt can be obtained using only the GP posterior mean (proof in Appendix B), which explains the name of our approach: mean prediction (MP) and the notation πµt. Lemma 3.5 (Mean prediction). The predicted ordering πµt defined in (4) can be determined from the GP posterior mean

πµt( x, x ) = 1µt( x) µt( x ) . (6)

lt( x ) ut( x ) lt( x) ut( x)

ρ(t) πµt ( x, x ) = worst-case regret

lt( x ) ut( x ) lt( x) ut( x)

ρ(t) πµt ( x, x ) = worst-case regret

(a) ρ(t) πµt( x, x ) is the worst-case regret happening when f( x) = lt( x) and f( x ) = ut( x ).

lt( x ) ut( x ) lt( x) ut( x) l t( x)

ρ(t) πµt( x, x )

lt( x ) ut( x ) u t( x )

ρ(t) πµt( x, x )

(b) rπµt( x, x ) = 0 when (ut, lt) are refined to (u t, l t) following an observation, i.e., the reduction in the uncertainty represented as the sum of the two red dashed segments is at least ρ(t) πµt( x, x ).

Figure 1: Interpretations of the upper bound ρ(t) πµt( x, x ) when πµt( x, x ) = 1.

3.3 Sampling Strategy

Given the regret in Sec. 3.1 and the predicted ordering πµt( x, x ) in Sec. 3.2, we would like to select a sampling input xt X = { x, x } such that the regret rπµt( x, x ) of the predicted ordering πµt( x, x ) reduces quickly.

While rπµt( x, x ) is unknown, it is bounded by ρ(t) πµt( x, x ) with probability of at least 1 δ. Hence,

to reduce rπµt( x, x ), we aim to reduce ρ(t) πµt( x, x ). It is also noted that observing y(xt) decreases the confidence interval |Ct(xt)|. Hence, to induce the reduction in the regret rπµt( x, x ) through ob-

serving y(xt), we select xt such that its confidence interval |Ct(xt)| ρ(t) πµt( x, x ) (which guarantees that |Ct(xt)| rπµt( x, x ) with probability of at least 1 δ). For instance, choosing xt = x in the left plot of Fig. 1a satisfies this condition, but it does not in the right plot of Fig. 1a. In the following lemma, we show that the following 4 choices of the sampling input satisfy the proposed condition (see Appendix C).

Lemma 3.6. Let Qt { x x , x x , x x , x x } denote a set3 of inputs at iteration t where

x x argmax x { x, x } ut(x) x x argmin x { x, x } lt(x)

x x argmax x { x, x } |Ct(x)| x x argmin x { x x , x x } |Ct(x)| .

For any xt Qt, |Ct(xt)| ρ(t) πµt( x, x ).

Theorem 3.7. By sampling the input xt following Lemma 3.6, we obtain the following regret bound

T 1, (xt)T t=1

t=1 rπµt( x, x ) O( p

where βT , γT , and δ are as defined in Lemma 2.2. Remark 3.8 (Sublinear cumulative regret). If γT is sublinear, our average cumulative regret is sublinear. This requirement is similar to most BO and LSE algorithms. It is noted that γT is sublinear for many popular kernels. For instance, γT = O((log T)d+1) for the squared exponential (SE) kernel as discussed in [19]. In this case, our cumulative regret bound RT O ( p

T(log T)2d) is the same as that of GP-UCB [19] (where O ( ) denotes asymptotic expressions up to dimension-independent logarithmic factors and is the dimension of the input).

We defer the pseudocode to the next section when n 2.

4 Active Set Ordering: n 2

In this section, we utilize the results in Sec. 3 to present the mean prediction (MP) algorithm for the active set ordering problem with n 2.

When X consists of n > 2 inputs, there are multiple pairwise orderings between inputs in X. We overload the ordering notation π of the true pairwise ordering between 2 inputs in (1) to the ordering between 2 sets as follows: for any subsets X0 X and X1 X c 0 ,

π (X0, X1) = 1 if x X0, x X1, π (x, x ) = 1 0 if x X0, x X1, π (x, x ) = 0 . (7)

It is noted that π (X0, X1) remains undefined if the two cases above are not satisfied. However, this situation does not arise in our solution. We define the regret of a set ordering (i.e., multiple pairwise orderings) as the maximum regret of all pairwise orderings:

rπ(X0,X1)=i max (x,x ) X0 X1 rπ(x,x )=i i {0, 1}

where rπ(x,x ) is defined in (2) and X0 X1 is the Cartesian product of X0 and X1, i.e.,

rπ(X0,X1)=i = max(0, (2i 1) max (x,x ) X0 X1(f(x ) f(x))) i {0, 1}. (8)

3For convenience, we allow duplicate elements in Qt.

Remark 4.1. It is noted that rπ(X0,X1) coincides with the well-known regret in BO when we consider the problem of predicting a maximizer of f. In particularly, predicting ˆx as a maximizer of f is equivalent to predicting the set ordering π({ˆx }, X \ {ˆx }) = 1. Its regret is rπ({ˆx },X\{ˆx })=1 = maxx X f(x) f(ˆx ) as shown in Appendix E.1.

Following the upper confidence bound of the regret of pairwise orderings in (3), we show in Appendix F that with probability of at least 1 δ, for all t 1 and for all subsets X0 X, X1 X c 0 , rπ(X0,X1) ρ(t) π(X0,X1) where

ρ(t) π(X0,X1) max (x,x ) X0 X1 ρ(t) π(x,x )=π(X0,X1) .

4.2 Prediction

In this section, we generalize the prediction in Sec. 3.2 to set orderings. From Lemma 3.5, there is no contradiction in the pairwise orderings πµt(x, x ) (defined in (4)) for all {x, x } X. In other words, the transitivity property holds for the binary relation πµt as shown in Appendix G. Definition 4.2 (Predicted top-k set Sµt(k)). Let Sµt(k) be a subset of X such that

|Sµt(k)| = k , πµt(Sµt(k), Sc µt(k)) = 1 (9)

where the set ordering πµt(Sµt(k), Sc µt(k)) is obtained by substituting π with the pairwise ordering πµt (see Definition 3.2) in (7). From Lemma 3.5, Sµt(k) is basically the set of k inputs with the highest GP posterior mean values.

As πµt(Sµt(k), Sc µt(k)) = 1 implies that πµt(x, x ) = 1 for all (x, x ) Sµt(k) Sc µt(k), the upper confidence bound of the regret is

ρ(t) πµt(Sµt(k),Scµt(k)) = max (x,x ) Sµt(k) Sc µt(k) ρ(t) πµt(x,x ) . (10)

4.3 Sampling Strategy

Like in Sec. 3.3, our key idea is to select xt such that the length |Ct(xt)| of its confidence interval bounds ρ(t) πµt(Sµt(k),Scµt(k)) (in (10)). Since ρ(t) πµt(Sµt(k),Scµt(k)) is the maximum upper confidence

bound of the regret of all pairwise orderings involved in defining Sµt(k), we first determine the input pair ( xt, x t) that incurs the maximum upper confidence bound of the regret. This is also the input pair that πµt most likely makes a mistake, following the intuition from [12].

( xt, x t) argmax (x,x ) Sµt(k) Scµt(k) ρ(t) πµt(x,x ) . (11)

It is noted that ( xt, x t) is constructed from an input in the predicted top-k set Sµt(k) and an input in its complement Sc µt(k). As the estimation of the top-k set improves, we expect these 2 inputs to be at both sides of the boundary of the top-k set: inside the top-k set vs. outside the top-k set.

Then, extended from Lemma 3.6, the following lemma shows that the above desirable property is satisfied by choosing the sampling input xt from any inputs in the set Qt { xt x t, xt x t, xt x t, xt x t} (defined in Lemma 3.6).

Lemma 4.3. For any xt Qt, |Ct(xt)| ρ(t) πµt(Sµt(k),Sc µt(k)).

The proof is shown in Appendix H. As a result, the cumulative regret incurred by choosing xt in Lemma 4.3 is bounded in the following theorem (proof in Appendix I).

Theorem 4.4. By sampling xt following Lemma 4.3, we obtain the following cumulative regret bound

T 1, (xt)T t=1

i=1 Qt, RT,k

t=1 rπµt(Sµt(k),Scµt(k)) O( p

where βT , γT , and δ are as defined in Lemma 2.2.

Algorithm 1 Mean Prediction (MP) for Active Set Ordering Require: X, D0, k, T

1: for t = 1 to T do 2: Update GP posterior belief: {µt(x)}x X , {σt(x)}x X . 3: Construct Sµt(k) as top-k inputs with the highest values of µt. Prediction

4: ( xt, x t) = argmax(x,x ) Sµt(k) Scµt(k) ρ(t) πµt(x,x ) 5: Select xt { xt x t, xt x t, xt x t, xt x t}. Sampling input 6: yt(Dt) y(Dt 1) {y(xt)} 7: end for 8: Update GP posterior belief: {µT +1(x)}x X , {σT +1(x)}x X . 9: Construct SµT +1(k) as top-k inputs with the highest values of µT +1. 10: return SµT +1(k).

We call the algorithm that makes prediction using the GP posterior mean and selects the sampling input xt following Lemma 4.3 the mean prediction (MP) algorithm. Its pseudocode is shown in Algorithm 1. Theorem 4.4 indicates that MP incurs a sublinear cumulative regret for several commonly used kernels with sublinear γT [19]. Remark 4.5 (Bayesian optimization as an active set ordering problem with k = 1). We show in Appendix E.2 that when k = 1, xt x t argmaxx X ut(x), which is the sampling input in the GP-UCB algorithm [19]. Additionally, as discussed in Sec. 4.1, the regret rπ(S(1),Sc(1)) is the wellknown regret in BO. Hence, we recover both the GP-UCB algorithm and its regret bound when k = 1 (although we consider the regret of the prediction rather than that of the sampling input). Moreover, this new construction of GP-UCB leads to some subtle insights. Firstly, while the GP posterior mean has been used in computing the inference regret of entropy search methods [9, 21], there has not been any theoretical justification for using the posterior mean. In contrast, the theoretical analysis in our work justifies the use of the maximizer of the GP posterior mean as an estimate of the maximizer of f. Secondly, by predicting the maximizer using the GP posterior mean, x x is not the only sampling input that achieves a sublinear cumulative regret. In fact, there are other choices of the sampling input as shown in Lemma 4.3. Similarly, we note that the LCB algorithm to find the minimizer of a blackbox function can be recovered by setting k = n 1 and xt = x x . Remark 4.6 (Lower bound of active set ordering problem). Let the lower bound of the active set ordering problem be the lower bound of the cumulative regret of the worst-case problem instance over all possible values of k. Then, it should be at least as large as the lower bound of the special case where k = 1, which is the BO problem according to Remark 4.5. Furthermore, BO has known lower bounds for several common kernels, e.g., for the SE kernel, the lower bound of the cumulative regret is Ω( p

T(log T)d/2) [17]. Hence, the lower bound of the active set ordering problem is at least Ω( p

T(log T)d/2). Additionally, similar to Remark 3.8, the cumulative regret of our solution in Theorem 4.4 is bounded by RT O ( p

T(log T)2d). Hence, it matches the lower bound up to the replacement of d/2 by 2d + O(1). Remark 4.7. Updating the GP posterior belief incurs O(|Dt|3 + n|Dt|2) (including O(|Dt|3) for training and O(n|Dt|2) for prediction). Given the GP posterior belief, Algorithm 1 involves the following 2 major steps. First, in line 3 of Algorithm 1, it takes O(n log k) to find the top-k inputs Sµt(k) by using a max heap of size k and scanning through the GP posterior mean of all n inputs. Second, in line 4 of Algorithm 1, it takes O(k(n k)) to scan through the elements in Sµt(k) Sc µt(k). Therefore, an iteration of Algorithm 1 takes O(|Dt|3 + n|Dt|2 + n log k + k(n k)). Remark 4.8 (Active multiple set ordering). Let us consider the problem of estimating m top-k sets: S(k1), S(k2), . . . , S(km) simultaneously (motivated in Sec. 1). This problem is analogous to finding k contour lines of a blackbox function, where each contour line represents the boundary between S(ki) and its complement Sc(ki). To solve this problem, we define the following input pair

( xt, x t) argmax (x,x ) ( m i=1Sµt(ki) Scµt(ki)) ρ(t) πµt(x,x ) . (12)

In other words, we aim to reduce the maximum regret incurred by the predicted pairwise orderings in all m top-k sets. Given ( xt, x t) in (12), MP proceeds by sampling the input xt according to Lemma 4.3, i.e., xt { xt x t, xt x t, xt x t, xt x t}. The approach is elaborated in Appendix J.

Blackbox function GP mean Upper bound Lower bound Correct Predicted Top-k Incorrect Predicted Top-k Missing Top-k Comparison Pair Current Observation Past Observation

(a) MP: xt = xt x t. (b) Var: xt = argmaxx X σ2 t (x).

Figure 2: Plot of sampling inputs, GP posterior distribution, and the performance of (a) MP and (b) Var in estimating S(20) of a synthetic function. The comparison pair is ( xt, x t) in (11). The histogram on the horizontal axis shows the frequency of sampling inputs in 40 iterations.

We also note that active multiple set ordering is able to find both maximizers S(1) and minimizers Sc(n 1) simultaneously, a problem has not been studied in GP-UCB [19].

5 Experiments

5.1 Active Set Ordering

In this section, we validate the empirical performance of our MP algorithm with different choices of the sampling input in Lemma 4.3: xt x t, xt x t, xt x t, and xt x t by comparing with 2 baselines: an uncertainty sampling approach, called Var, that selects the sampling input with the highest GP posterior variance, i.e., xt argmaxx X σ2 t (x), and a baseline, called Rand, that selects the sampling input at random. The regret rπµt(Sµt(k),Scµt(k)) is used to measure the performance of each algorithm, i.e., the prediction of the top-k set consists of the k inputs with the highest GP posterior mean.

To begin with, we visualize sampling inputs and the accuracy of Sµt(20) that come from our MP algorithm with xt = xt x t and the Var algorithm in Fig. 2. In Fig. 2a, the histogram shows that the sampling inputs are at the boundary of S(20). This is highly desirable as it is challenging to decide if an input at the boundary belongs to S(20). Similarly, the input pair in (11) also consists of inputs around this boundary (depicted as vertical orange lines). On the other hand, precisely estimating the function evaluations of inputs far from the boundary, e.g., inputs around x = 0.85 (in S(20)) and inputs around x = 0.1 (not part of S(20)) is unnecessary. We observe that the uncertainty of the GP posterior distribution at these inputs is high in Fig. 2a. Hence, our MP algorithm is able to efficiently concentrate its sampling budget on important inputs at the boundary of the top-k set. Interestingly, this boundary serves as a contour line of the blackbox function, indicating that our solution could potentially be applied to estimate the contour line by specifying the proportion of the input domain where function evaluations exceed this contour. Regarding the Var algorithm (i.e., uncertainty sampling) in Fig. 2b, the histogram shows that sampling inputs are distributed evenly across the input domain. It is because Var aims to reduce the uncertainty of the function evaluation throughout the input domain without considering the current predicted Sµt(20). For example, it is inefficient to select sampling inputs far away from the boundary of S(20). It is observed that the estimation of function evaluations at the boundary of S(20) using Var is more uncertain than that using the MP algorithm given the same number of sampling inputs. This results in erroneously predicting certain inputs in S(20) (depicted as red dots) and overlooking several inputs in S(20) (depicted as black dots).4

We numerically report the performance using the proposed regret rπµt(Sµt(k),Scµt(k)). The experiments are conducted on 4 synthetic functions: a function sampled from a GP, Branin-Hoo function, Goldstein-Price function with a noise of σn = 0.1, and Hartmann-6D function with a nosie of σn = 0.01 [20]. For the first three synthetic functions, the input domain is discretized into a set of 100 points, whereas for the Hartmann-6D function, it is discretized into a set of 1000 points. Motivated by environmental monitoring problems, we generate 3 active set ordering problems that

4The complete animation is available in the supplementary materials.

estimate the top-5 set using the dataset of NO3 concentration in the Lake Zurich (downloaded from https://wldb.ilec.or.jp/Lake/EUR-06/datalist), the dataset of the phosphorus concentration in the Brooms Barn [22], and the dataset of the humidity in the Intel Lab (downloaded from https://db.csail.mit.edu/labdata/labdata.html). The environment field is discretized into a set of 100 locations in the experiments with the NO3 and humidity datasets and 400 locations in the experiment with the phosphorus dataset. The experiments are repeated 15 times to account for the randomness in the generation of the observations. Further details are provided in Appendix K. The average and the standard error of the regret are shown in Figs. 3s:a-g. There is not any significant difference in the performance of MP with different sampling inputs in Lemma 4.3. Nevertheless, the MP algorithm with any choice of the sampling input in Qt outperforms the 2 baselines by converging to lower regret.

Rand Var xt x0 t xt x0 t xt x0 t xt x0 t

0 20 40 Iteration

0 50 100 Iteration

0 50 100 Iteration

0 50 100 Iteration

(s:a) GP sample (s:b) Branin-Hoo (s:c) Goldstein-Price (s:d) Hartmann-6D

0 50 100 Iteration

0 25 50 75 100 Iteration

0 20 40 Iteration

0 20 40 Iteration

(s:e) NO3 (s:f) Phosphorus (s:g) Humidity (m:a) GP sample

0 50 100 Iteration

0 25 50 75 100 Iteration

0 50 100 150 Iteration

0 25 50 75 100 Iteration

(m:b) Branin-Hoo (m:c) Goldstein-Price (m:d) Hartmann-6D (m:e) NO3

0 25 50 75 100 Iteration

0 20 40 Iteration

(m:f) Phosphorus (m:g) Humidity

Figure 3: Plots of the regret against the iteration in estimating (s:a-f) the top-5 set S(5) and (m:a-f) multiple top-k sets: S(1), S(10), and S(20).

5.2 Active Multiple Set Ordering

To empirically validate the performance of our MP algorithm in solving the problem of estimating multiple top-k sets, we consider the problem of estimating S(1) (i.e., maximizers), S(10), and S(20) simultaneously, i.e., k1 = 1, k2 = 10, k3 = 20 in Remark 4.8. We utilize the same set

EI PI MES xt x0 t xt x0 t xt x0 t xt x0 t

0 50 100 Iteration

0 50 100 Iteration

0 50 100 Iteration

0 10 20 30 40 Iteration

(a) Branin-Hoo (b) NO3 (c) Phosphorus (d) Humidity

Figure 4: Plots of the regret of the predicted maximizer against the iteration.

of synthetic functions and real-world environmental datasets in the previous section to compare the performance of MP with Var and Rand. The plot of the average and standard error of the maximum regret maxk {1,10,20}(rπµt(Sµt(k),Scµt(k))) over 15 repeated experiments are shown in Figs. 3m:a-g. The MP algorithm outperforms the other 2 baselines by converging to lower regret. In some active multiple set ordering experiments (e.g., in Figs. 3m:a, 3m:c), the performance gaps between MP and Var are smaller than those in the previous active set ordering experiments (e.g., Figs. 3s:a, 3s:c). It is because estimating multiple top-k sets requires more observations, which makes the performance of MP tend towards that of Var which estimates the entire function.

5.3 Bayesian Optimization

When k = 1, the active set ordering problem reduces to the BO problem and a sampling input of our MP algorithm, i.e., xt = xt x t, is the same as that of GP-UCB [19]. Therefore, this section empirically demonstrates the performance of MP with different sampling inputs (xt { xt x t, xt x t, xt x t, xt x t}) in solving BO. Our aim is not to show that MP achieves the state-of-the-art performance as a BO solver, but rather to demonstrate that it performs comparably to the well-known GP-UCB algorithm. In addition to comparing with GP-UCB (equivalently, MP with xt = xt x t), we also compare with 3 classical BO solutions: probability of improvement (PI) [13], expected improvement (EI) [15], and max-value entropy search (MES) [21]. The average and the standard error of the regret rπµt(Sµt(1),Scµt(1)) over 15 repeated experiments are shown in Fig. 4. We observe that MP performs comparably with the well-known GP-UCB algorithm (labelled as xt x t). Expectedly, EI and MES outperform GP-UCB (and hence, MP) in some experiments such as in Figs. 4b and 4c.

6 Conclusion

This paper presents a new problem formulation, namely active set ordering, that aims to balance between the expensive estimation of the entire function in ED and that of only the maximizers in BO. We propose the mean prediction (MP) algorithm to address this problem with a theoretical no-regret guarantee. Interestingly, BO can be framed as a special instance of active set ordering, which leads to several new subtle understandings regarding the predicted maximizer and other alternative sampling inputs. Last, the performance of MP is empirically evaluated using various synthetic functions and real-world datasets.

Acknowledgments and Disclosure of Funding

This research/project is supported by the National Research Foundation Singapore and DSO National Laboratories under the AI Singapore Programme (AISG Award No: AISG2-RP-2020-018).

Des Cartes: this research is supported by the National Research Foundation, Prime Minister s Office, Singapore under its Campus for Research Excellence and Technological Enterprise (CREATE) programme.

This research was partially supported by the Australian Government through the Australian Research Council s Discovery Projects funding scheme (project DP210102798). The views expressed herein are those of the authors and are not necessarily those of the Australian Government or Australian Research Council.

[1] I. Bogunovic, J. Scarlett, A. Krause, and V. Cevher. Truncated variance reduction: A unified approach to Bayesian optimization and level-set estimation. In Proc. NIPS, pages 1507 1515, 2016.

[2] E. Brochu, V. M. Cora, and N. De Freitas. A tutorial on Bayesian optimization of expensive cost functions, with application to active user modeling and hierarchical reinforcement learning. ar Xiv preprint ar Xiv:1012.2599, 2010.

[3] B. Bryan, R. C. Nichol, C. R. Genovese, J. Schneider, C. J. Miller, and L. Wasserman. Active learning for identifying function threshold boundaries. Proc. Neur IPS, 18, 2005.

[4] K. Chaloner and I. Verdinelli. Bayesian experimental design: A review. Statistical science, pages 273 304, 1995.

[5] S. R. Chowdhury and A. Gopalan. On kernelized multi-armed bandits. In Proc. ICML, pages 844 853, 2017.

[6] P. I. Frazier. A tutorial on Bayesian optimization. ar Xiv preprint ar Xiv:1807.02811, 2018.

[7] Roman Garnett. Bayesian Optimization. Cambridge University Press, 2022.

[8] A. Gotovos, N. Casati, G. Hitz, and A. Krause. Active learning for level set estimation. In Proc. IJCAI, 2013.

[9] J. M. Hernández-Lobato, M. W. Hoffman, and Z. Ghahramani. Predictive entropy search for efficient global optimization of black-box functions. In Proc. NIPS, pages 918 926, 2014.

[10] H. Jiang, J. Li, and M. Qiao. Practical algorithms for best-k identification in multi-armed bandits. ar Xiv preprint ar Xiv:1705.06894, 2017.

[11] S. Kalyanakrishnan and P. Stone. Efficient selection of multiple bandit arms: Theory and practice. In Proc. ICML, volume 10, pages 511 518, 2010.

[12] S. Kalyanakrishnan, A. Tewari, P. Auer, and P. Stone. PAC subset selection in stochastic multi-armed bandits. In Proc. ICML, volume 12, pages 655 662, 2012.

[13] H. J. Kushner. A new method of locating the maximum point of an arbitrary multipeak curve in the presence of noise. Journal of basic engineering, 86(1):97 106, 1964.

[14] B. Mason, R. Camilleri, S. Mukherjee, K. Jamieson, R. Nowak, and L. Jain. Nearly optimal algorithms for level set estimation. In Proc. AISTATS, 2022.

[15] J. Moˇckus. The Bayesian approach to global optimization. In System Modeling and Optimization, volume 38, pages 473 481, 1982.

[16] C. E. Rasmussen and C. K. I. Williams. Gaussian processes for machine learning. MIT Press, 2006.

[17] J. Scarlett, I. Bogunovic, and V. Cevher. Lower bounds on regret for noisy Gaussian process bandit optimization. In Conference on Learning Theory, pages 1723 1742, 2017.

[18] Burr Settles. Active learning literature survey. 2009.

[19] N. Srinivas, A. Krause, S. Kakade, and M. Seeger. Gaussian process optimization in the bandit setting: No regret and experimental design. In Proc. ICML, pages 1015 1022, 2010.

[20] S. Surjanovic and D. Bingham. Virtual library of simulation experiments: Test functions and datasets. Retrieved May 21, 2024, from http://www.sfu.ca/~ssurjano.

[21] Z. Wang and S. Jegelka. Max-value entropy search for efficient Bayesian optimization. In Proc. ICML, pages 3627 3635, 2017.

[22] R. Webster and M. A. Oliver. Geostatistics for environmental scientists. John Wiley & Sons, 2007.

A Proof of Lemma 3.1

We will show that ρ(t) π( x, x ) is an upper confidence bound of the regret rπ( x, x ), i.e.,

P t 1, rπ( x, x ) ρ(t) π( x, x ) 1 δ .

The regret rπ( x, x ) is defined as follows

rπ( x, x ) max (0, (2π( x, x ) 1)(f( x ) f( x))) .

Furthermore, from Lemma 2.2, with probability of at least 1 δ, for all t 1,

lt( x) f( x) ut( x)

lt( x ) f( x ) ut( x )

Hence, with probability of at least 1 δ, for all t 1,

f( x) f( x ) ut( x) lt( x )

f( x ) f( x) ut( x ) lt( x)

Multiplying both sides with (2π( x, x ) 1), we obtain

(2π( x, x ) 1)(f( x ) f( x)) ut( x) lt( x ) if π( x, x ) = 0 ut( x ) lt( x) if π( x, x ) = 1 ρ(t) π( x, x ) .

Therefore, with probability of at least 1 δ, for all t 1,

rπ( x, x ) max (0, (2π( x, x ) 1)(f( x ) f( x))) max(0, ut( x) lt( x )) if π( x, x ) = 0 max(0, ut( x ) lt( x)) if π( x, x ) = 1 .

B Proof of Lemma 3.5

We note that 1µt( x) µt( x ) = 1 happens when

µt( x) µt( x )

ut( x) + lt( x)

2 ut( x ) + lt( x )

2 ut( x) lt( x ) ut( x ) lt( x)

max(0, ut( x) lt( x )) max(0, ut( x ) lt( x))

ρ(t) π( x, x )=0 ρ(t) π( x, x )=1 .

1µt( x) µt( x ) = 1ρ(t) π( x, x )=0 ρ(t) π( x, x )=1 = argmin a {0,1} ρ(t) π( x, x )=a = πµt( x, x ) .

C Proof of Lemma 3.6

We will show that ρ(t) πµt( x, x ) |Ct(xt)| when xt is taken from the set Qt { x x , x x , x x , x x }.

Case 1: xt = x x argmaxx { x, x } ut(x)

ρ(t) πµt( x, x ) = max(0, min(ut( x ) lt( x), ut( x) lt( x )))

max(0, min(ut( x x ) lt( x), ut( x x ) lt( x ))) (13)

max(0, ut( x x ) lt( x x )) (14)

= ut( x x ) lt( x x ) (15)

= |Ct( x x )| (16)

where (13) is because x x argmaxx { x, x } ut(x), (14) is because x x { x, x }, (15) is because ut( x x ) lt( x x ) 0.

Case 2: xt = x x argminx { x, x } lt(x)

ρ(t) πµt( x, x ) = max(0, min(ut( x ) lt( x), ut( x) lt( x ))) (17)

max(0, min(ut( x ) lt( x x ), ut( x) lt( x x ))) (18)

max(0, ut( x x ) lt( x x )) (19)

= ut( x x ) lt( x x ) (20)

= |Ct( x x )| (21)

where (18) is because x x argminx { x, x } lt(x), (19) is because x x { x, x }, (20) is because ut( x x ) lt( x x ) 0.

Case 3: xt = x x argmaxx { x, x } |Ct(x)|, i.e.,

|Ct( x x )| = max x { x, x } |Ct(x)| (22)

|Ct( x x )| (23)

ρ(t) πµt( x, x ) (24)

where (23) is because x x { x, x }, (24) is from the above proof of case 2 in (21).

Case 4: xt = x x argminx { x x , x x } |Ct(x)|. From (16) and (21), it follows that |Ct( x

x )| ρ(t) πµt( x, x ).

D Proof of Theorem 3.7

By choosing xt following Lemma 3.6, with probability of at least 1 δ, for all t 1,

rπµt( x, x ) ρ(t) πµt( x, x ) |Ct(xt)| = 2β1/2 t σt(xt)

Hence, with probability of at least 1 δ, for all T 1,

t=1 rπµt( x, x )

t=1 2β1/2 t σt(xt)

Since βt is a non-decreasing sequence, βt βT for all t T. Therefore, with probability of at least 1 δ, for all T 1,

t=1 rπµt( x, x ) 2β1/2 T

From Lemma 4 in [5],

t=1 σt(xt) p

4(T + 2)γT = O( p

Hence, with probability of at least 1 δ, for all T 1,

t=1 rπµt( x, x ) O( p

t=1 rπµt( x, x ) O( p

E Bayesian Optimization as Active Set Ordering with k = 1

For a predicted top-1 set Sµt(1), πµt(Sµt(1), Sc µt(1)) = 1. Hence, the regret for predicting Sµt(1) is expressed as follows.

rπµt(Sµt(1),Sc µt(1)) = max

0, max (x,x ) Sµt(1) Scµt(1) f(x ) f(x)

Let Sµt(1) {ˆx } and x argmaxx X f(x).

rπµt(Sµt(1),Scµt(1)) = max

0, max x Scµt(1)(f(x) f(ˆx ))

max x Scµt(1) f(x)

Since X = Sµt(1) Sc µt(1), there are 2 cases

If x Sc µt(1), then maxx Scµt(1) f(x) = f(x ) f(ˆx ) and

rπµt(Sµt(1),Scµt(1)) = max (0, f(x ) f(ˆx )) = f(x ) f(ˆx ) .

If x Sµt(1), i.e., x = ˆx , then maxx Scµt(1) f(x) f(ˆx ) and

rπµt(Sµt(1),Scµt(1)) = 0 = f(x ) f(ˆx ) .

E.2 Sampling Input

We will show that

xt x t argmax x X ut(x) .

When k = 1, due to the definition of Sµt(k) in Definition 4.2

xt argmax x X µt(x) .

ˆxt argmax x X ut(x) .

We consider the following 2 cases:

Case 1: If ut(ˆxt) = ut( xt), then

xt x t argmax x X ut(x) .

Case 2: If ut(ˆxt) = ut( xt) which implies that ˆxt Sc µt(k) and ut(ˆxt) > ut( xt), then we prove that ut(ˆxt) = ut( x t) by contradiction. Assuming that

ut(ˆxt) > ut( x t) . (25)

Let us consider the following upper confidence bounds of pairwise orderings:

ρ(t) πµt( xt,ˆxt) = max(0, ut(ˆxt) lt( xt)) = ut(ˆxt) lt( xt) > 0

ρ(t) πµt( xt, x t) = max(0, ut( x t) lt( xt)) .

Furthermore, from the choice of ( xt, x t) in (11) and ˆxt Sc µt(k),

ρ(t) πµt( xt, x t) ρ(t) πµt( xt,ˆxt) .

max(0, ut( x t) lt( xt) max(0, ut(ˆxt) lt( xt)) = ut(ˆxt) lt( xt) > 0

ut( x t) lt( xt) ut(ˆxt) lt( xt)

ut( x t) ut(ˆxt)

which contradicts to the assumption 25. Therefore, ut(ˆxt) = ut( x t) and

xt x t argmax x X ut(x) .

F Upper Confidence Bound of the Regret rπ(X0,X1)

rπ(X0,X1) max (x,x ) X0 X1 rπ(x,x )=π(X0,X1)

With probability of at least 1 δ, for all t 1 and for all {x, x } X, rπ(x,x )=π(X0,X1)

ρ(t) π(x,x )=π(X0,X1). Therefore, with probability of at least 1 δ, for all t 1 and for all {x, x } X,

rπ(X0,X1) max (x,x ) X0 X1 ρ(t) π(x,x )=π(X0,X1) ρ(t) π(X0,X1) .

G Transitivity of Pairwise Orderings

The transitivity property of the binary relation µπt follows directly from its connection to the GP posterior mean in Lemma 3.5. In particular, we would like to show that if

πµt(x, x ) = 1 πµt(x , x ) = 1 (26)

πµt(x, x ) = 1 .

From Lemma 3.5, the premise (26) implies that

µt(x) µt(x ) µt(x ) µt(x )

which implies that

µt(x) µt(x ) .

Applying Lemma 3.5 again, we conclude

πµt(x, x ) = 1 .

H Proof of Lemma 4.3

Applying Lemma 3.6 to the input pair ( xt, x t), by selecting xt { xt x t, xt x t, xt x t, xt x t},

|Ct(xt)| ρ(t) πµt( xt, x t) .

Furthermore, from the choice of ( xt, x t) in (11),

ρ(t) πµt( xt, x t) = max (x,x ) Sµt(k) Scµt(k) ρ(t) πµt(x,x ) = ρ(t) πµt(Sµt(k),Scµt(k) .

|Ct(xt)| ρ(t) πµt(Sµt(k),Scµt(k) .

Algorithm 2 Mean Prediction (MP) for Active Multiple Set Ordering Require: X, D0, {k1, k2, . . . , km}, T

1: for t = 1 to T do 2: Update GP posterior belief: {µt(x)}x X , {σt(x)}x X . 3: Construct {Sµt(ki)}m i=1 as the collection of m top-k sets predicted using µt. Prediction

4: ( xt, x t) argmax(x,x ) m i=1Sµt(ki) Scµt(ki) ρ(t) πµt(x,x ) 5: Select xt { xt x t, xt x t, xt x t, xt x t}. Sampling input 6: yt(Dt) y(Dt 1) {y(xt)} 7: end for 8: Update GP posterior belief: {µT +1(x)}x X , {σT +1(x)}x X . 9: Construct {SµT +1(ki)}m i=1 as the collection of m top-k sets predicted using µT +1. 10: return {SµT +1(ki)}m i=1.

I Proof of Theorem 4.4

By choosing xt following Lemma 4.3, with probability of at least 1 δ, for all t 1,

rπµt(Sµt(k),Scµt(k)) ρ(t) πµt(Sµt(k),Scµt(k)) |Ct(xt)| = 2β1/2 t σt(xt) .

Furthermore, from the non-decreasing property of the sequence (βt)T t=1,

t=1 rπµt(Sµt(k),Scµt(k))

t=1 2β1/2 t σt(xt) 2β1/2 T

By utilizing the result from [5] like Appendix D, we can obtain

t=1 σt(xt) O( p

Therefore, with probability of at least 1 δ, for all T 1,

t=1 2β1/2 T σt(xt) O( p

t=1 rπµt(Sµt(k),Scµt(k)) O( p

J Active Multiple Set Ordering

The pseudocode of MP algorithm for the active multiple set ordering problem is shown in Algorithm 2. In the rest of this section, we prove the cumulative regret bound of Algorithm 2.

We recall that in (12),

( xt, x t) argmax (x,x ) ( m i=1Sµt(ki) Sc µt(ki)) ρ(t) πµt(x,x )

i {1, 2, . . . , m}, ρ(t) πµt( xt, x t) max (x,x ) S(ki) Sc(ki) ρ(t) πµt(x,x )

= ρ(t) πµt(Sµt(ki),Scµt(ki))

Furthermore, applying Lemma 3.6 to the input pair ( xt, x t), for any xt { xt x t, xt x t, xt x t, xt x t}

|Ct(xt)| ρ(t) πµt( xt, x t) .

i {1, 2, . . . , m},

|Ct(xt)| ρ(t) πµt(Sµt(ki),Sc µt(ki)) .

Hence, with probability of at least 1 δ, for all t 1,

i {1, 2, . . . , m}, rπµt(Sµt(ki),Scµt(ki)) ρ(t) πµt(Sµt(ki),Scµt(ki)) |Ct(xt)| .

As a result, with probability of at least 1 δ, for all T 1,

i {1, 2, . . . , m}, RT,ki

t=1 rπµt(Sµt(ki),Scµt(ki))

t=1 |Ct(xt)| O( p

where the last inequality comes from Appendix I.

K Experiments

All experiments were conducted on a computer equipped with an AMD Ryzen 7 6800HS processor and 16GB of RAM.

To generate a function sampled from a GP, we randomly generate 3 observations {(xi, y(xi))}3 i=1 and fit a GP model to these observations. Then, we sample function evaluations at all inputs in the domain from the GP posterior distribution. These function evaluations are considered the evaluations of the blackbox function. To generate an observation at a sampling input, we add a Gaussian noise (of σn = 0.1) to the evaluations of the blackbox function at the sampling input.

The expressions for the Branin-Hoo and Goldstein-Price functions are described at [20]. We transform the input domain of these functions to [0, 1]2 and standardize the function evaluations. The input domain is discretized into n = 100 points randomly selected in the domain [0, 1]2. The noise is chosen with σn = 0.1.

To perform experiments with the NO3 dataset from Lake Zurich (available at https://wldb.ilec. or.jp/Lake/EUR-06/datalist), we standardize the NO3 measurements. Then, a GP model is trained on the standardized dataset to generate the noisy evaluations of the blackbox function over n = 100 randomly chosen locations.

We use the logarithmic values of the phosphorus measurements in the soy survey of Brooms Barn [22] to construct a blackbox function. The locations are normalized to the range [0, 1]2 and the logarithmic values of the phosphorus measurements are standardized. Then, we train a GP model to generate the noisy evaluations of the blackbox function over n = 400 randomly chosen locations.

To perform experiments with the humidity dataset, we extract the humidity measurements at different locations with the same mote id of 31167 from the Intel Lab data (available at https://db.csail. mit.edu/labdata/labdata.html). The humidity measurements are standardized. Then, a GP model is trained to this extracted dataset to generate the noisy evaluations of the blackbox function over n = 100 randomly chosen locations.

To remove the potential inefficiency due to repeated sampling in Rand and Var, we provide additional experiments by replacing the Rand and Var baselines with Rand No Repl and Var No Repl, which do not allow repeated sampling. This modification potentially gives Rand No Repl (random sampling without replacement across different iterations) and Var No Repl (uncertainty sampling without replacement across different iterations) an additional advantage over our solutions which allow repeated sampling. By avoiding repeated sampling, Rand No Repl and Var No Repl can sample the input domain more uniformly, whereas our methods might re-sample certain input regions. However, as shown in Figure 5, Rand No Repl and Var No Repl still do not outperform our solutions. The justification for the efficiency of our solutions is in the nature of noisy observations: With a noise standard deviation of σn = 0.1, a single observation at each input may not suffice to accurately determine the ordering with its neighboring inputs in terms of the function value. Hence, spreading the sampling budget across the whole input domain may not perform well. In contrast, our approach allocates more sampling inputs to the boundary of the top-k set, where it is particularly challenging to check if inputs belong to the top-k set (see Figure 2).

Rand Rand No Repl Var Var No Repl xt x0 t xt x0 t xt x0 t xt x0 t

0 50 100 Iteration

0 50 100 Iteration

(a) Branin-Hoo (b) Goldstein-Price

Figure 5: Plots of the regret against the iteration in estimating the top-5 set S(t).

Neur IPS Paper Checklist

Question: Do the main claims made in the abstract and introduction accurately reflect the paper s contributions and scope?

Answer: [Yes]

Justification: Our contributions include introducing the active set ordering problem, designing a novel solution with theoretical performance guarantee, and investigating a connection to Bayesian optimization. The experiments are performed on environmental monitoring datasets which aligns with the motivation in the introduction.

Guidelines:

The answer NA means that the abstract and introduction do not include the claims made in the paper. The abstract and/or introduction should clearly state the claims made, including the contributions made in the paper and important assumptions and limitations. A No or NA answer to this question will not be perceived well by the reviewers. The claims made should match theoretical and experimental results, and reflect how much the results can be expected to generalize to other settings. It is fine to include aspirational goals as motivation as long as it is clear that these goals are not attained by the paper.

2. Limitations

Question: Does the paper discuss the limitations of the work performed by the authors?

Answer: [Yes]

Justification: The limitations are elaborated as assumptions in the paper including: a finite input domain of the blackbox function (see Sec. 2.1) and the blackbox function belonging to a reproducing kernel Hilbert space with a bounded norm (see Sec. 2.2).

Guidelines:

The answer NA means that the paper has no limitation while the answer No means that the paper has limitations, but those are not discussed in the paper. The authors are encouraged to create a separate "Limitations" section in their paper. The paper should point out any strong assumptions and how robust the results are to violations of these assumptions (e.g., independence assumptions, noiseless settings, model well-specification, asymptotic approximations only holding locally). The authors should reflect on how these assumptions might be violated in practice and what the implications would be.

The authors should reflect on the scope of the claims made, e.g., if the approach was only tested on a few datasets or with a few runs. In general, empirical results often depend on implicit assumptions, which should be articulated. The authors should reflect on the factors that influence the performance of the approach. For example, a facial recognition algorithm may perform poorly when image resolution is low or images are taken in low lighting. Or a speech-to-text system might not be used reliably to provide closed captions for online lectures because it fails to handle technical jargon. The authors should discuss the computational efficiency of the proposed algorithms and how they scale with dataset size. If applicable, the authors should discuss possible limitations of their approach to address problems of privacy and fairness. While the authors might fear that complete honesty about limitations might be used by reviewers as grounds for rejection, a worse outcome might be that reviewers discover limitations that aren t acknowledged in the paper. The authors should use their best judgment and recognize that individual actions in favor of transparency play an important role in developing norms that preserve the integrity of the community. Reviewers will be specifically instructed to not penalize honesty concerning limitations. 3. Theory Assumptions and Proofs

Question: For each theoretical result, does the paper provide the full set of assumptions and a complete (and correct) proof? Answer: [Yes] Justification: We have provided the proofs for all theoretical results in the appendix of the paper. Guidelines:

The answer NA means that the paper does not include theoretical results. All the theorems, formulas, and proofs in the paper should be numbered and crossreferenced. All assumptions should be clearly stated or referenced in the statement of any theorems. The proofs can either appear in the main paper or the supplemental material, but if they appear in the supplemental material, the authors are encouraged to provide a short proof sketch to provide intuition. Inversely, any informal proof provided in the core of the paper should be complemented by formal proofs provided in appendix or supplemental material. Theorems and Lemmas that the proof relies upon should be properly referenced. 4. Experimental Result Reproducibility

Question: Does the paper fully disclose all the information needed to reproduce the main experimental results of the paper to the extent that it affects the main claims and/or conclusions of the paper (regardless of whether the code and data are provided or not)? Answer: [Yes] Justification:We have described the experiments and provided the code and datasets for reproducing the experimental results. Guidelines:

The answer NA means that the paper does not include experiments. If the paper includes experiments, a No answer to this question will not be perceived well by the reviewers: Making the paper reproducible is important, regardless of whether the code and data are provided or not. If the contribution is a dataset and/or model, the authors should describe the steps taken to make their results reproducible or verifiable. Depending on the contribution, reproducibility can be accomplished in various ways. For example, if the contribution is a novel architecture, describing the architecture fully might suffice, or if the contribution is a specific model and empirical evaluation, it may be necessary to either make it possible for others to replicate the model with the same

dataset, or provide access to the model. In general. releasing code and data is often one good way to accomplish this, but reproducibility can also be provided via detailed instructions for how to replicate the results, access to a hosted model (e.g., in the case of a large language model), releasing of a model checkpoint, or other means that are appropriate to the research performed. While Neur IPS does not require releasing code, the conference does require all submissions to provide some reasonable avenue for reproducibility, which may depend on the nature of the contribution. For example (a) If the contribution is primarily a new algorithm, the paper should make it clear how to reproduce that algorithm. (b) If the contribution is primarily a new model architecture, the paper should describe the architecture clearly and fully. (c) If the contribution is a new model (e.g., a large language model), then there should either be a way to access this model for reproducing the results or a way to reproduce the model (e.g., with an open-source dataset or instructions for how to construct the dataset). (d) We recognize that reproducibility may be tricky in some cases, in which case authors are welcome to describe the particular way they provide for reproducibility. In the case of closed-source models, it may be that access to the model is limited in some way (e.g., to registered users), but it should be possible for other researchers to have some path to reproducing or verifying the results.

5. Open access to data and code

Question: Does the paper provide open access to the data and code, with sufficient instructions to faithfully reproduce the main experimental results, as described in supplemental material?

Answer: [Yes]

Justification: We have provided the source code, datasets, and scripts to reproduce the experiment results.

Guidelines:

The answer NA means that paper does not include experiments requiring code. Please see the Neur IPS code and data submission guidelines (https://nips.cc/ public/guides/Code Submission Policy) for more details. While we encourage the release of code and data, we understand that this might not be possible, so No is an acceptable answer. Papers cannot be rejected simply for not including code, unless this is central to the contribution (e.g., for a new open-source benchmark). The instructions should contain the exact command and environment needed to run to reproduce the results. See the Neur IPS code and data submission guidelines (https: //nips.cc/public/guides/Code Submission Policy) for more details. The authors should provide instructions on data access and preparation, including how to access the raw data, preprocessed data, intermediate data, and generated data, etc. The authors should provide scripts to reproduce all experimental results for the new proposed method and baselines. If only a subset of experiments are reproducible, they should state which ones are omitted from the script and why. At submission time, to preserve anonymity, the authors should release anonymized versions (if applicable). Providing as much information as possible in supplemental material (appended to the paper) is recommended, but including URLs to data and code is permitted.

6. Experimental Setting/Details

Question: Does the paper specify all the training and test details (e.g., data splits, hyperparameters, how they were chosen, type of optimizer, etc.) necessary to understand the results?

Answer: [Yes]

Justification: We have provided the URLs where datasets are downloaded and described parameters used in the experiments such as the size of the input domain and the noise variance. Other parameters such as the random seed are configured in the submitted code. Guidelines:

The answer NA means that the paper does not include experiments. The experimental setting should be presented in the core of the paper to a level of detail that is necessary to appreciate the results and make sense of them. The full details can be provided either with the code, in appendix, or as supplemental material. 7. Experiment Statistical Significance

Question: Does the paper report error bars suitably and correctly defined or other appropriate information about the statistical significance of the experiments? Answer: [Yes] Justification: We have repeated the experiments with different random seeds and reported the average and standard error of the experiment results. Guidelines:

The answer NA means that the paper does not include experiments. The authors should answer "Yes" if the results are accompanied by error bars, confidence intervals, or statistical significance tests, at least for the experiments that support the main claims of the paper. The factors of variability that the error bars are capturing should be clearly stated (for example, train/test split, initialization, random drawing of some parameter, or overall run with given experimental conditions). The method for calculating the error bars should be explained (closed form formula, call to a library function, bootstrap, etc.) The assumptions made should be given (e.g., Normally distributed errors). It should be clear whether the error bar is the standard deviation or the standard error of the mean. It is OK to report 1-sigma error bars, but one should state it. The authors should preferably report a 2-sigma error bar than state that they have a 96% CI, if the hypothesis of Normality of errors is not verified. For asymmetric distributions, the authors should be careful not to show in tables or figures symmetric error bars that would yield results that are out of range (e.g. negative error rates). If error bars are reported in tables or plots, The authors should explain in the text how they were calculated and reference the corresponding figures or tables in the text. 8. Experiments Compute Resources

Question: For each experiment, does the paper provide sufficient information on the computer resources (type of compute workers, memory, time of execution) needed to reproduce the experiments? Answer: [Yes] Justification: We describe the computer resources in Appendix K. Guidelines:

The answer NA means that the paper does not include experiments. The paper should indicate the type of compute workers CPU or GPU, internal cluster, or cloud provider, including relevant memory and storage. The paper should provide the amount of compute required for each of the individual experimental runs as well as estimate the total compute. The paper should disclose whether the full research project required more compute than the experiments reported in the paper (e.g., preliminary or failed experiments that didn t make it into the paper). 9. Code Of Ethics

Question: Does the research conducted in the paper conform, in every respect, with the Neur IPS Code of Ethics https://neurips.cc/public/Ethics Guidelines?

Answer: [Yes]

Justification: We ensure that the research conforms with the Neur IPS Code of Ethics.

Guidelines:

The answer NA means that the authors have not reviewed the Neur IPS Code of Ethics. If the authors answer No, they should explain the special circumstances that require a deviation from the Code of Ethics. The authors should make sure to preserve anonymity (e.g., if there is a special consideration due to laws or regulations in their jurisdiction).

10. Broader Impacts

Question: Does the paper discuss both potential positive societal impacts and negative societal impacts of the work performed?

Answer: [NA]

Justification: Our work proposes a general active learning problem that does not have any direct path to any negative applications.

Guidelines:

The answer NA means that there is no societal impact of the work performed. If the authors answer NA or No, they should explain why their work has no societal impact or why the paper does not address societal impact. Examples of negative societal impacts include potential malicious or unintended uses (e.g., disinformation, generating fake profiles, surveillance), fairness considerations (e.g., deployment of technologies that could make decisions that unfairly impact specific groups), privacy considerations, and security considerations. The conference expects that many papers will be foundational research and not tied to particular applications, let alone deployments. However, if there is a direct path to any negative applications, the authors should point it out. For example, it is legitimate to point out that an improvement in the quality of generative models could be used to generate deepfakes for disinformation. On the other hand, it is not needed to point out that a generic algorithm for optimizing neural networks could enable people to train models that generate Deepfakes faster. The authors should consider possible harms that could arise when the technology is being used as intended and functioning correctly, harms that could arise when the technology is being used as intended but gives incorrect results, and harms following from (intentional or unintentional) misuse of the technology. If there are negative societal impacts, the authors could also discuss possible mitigation strategies (e.g., gated release of models, providing defenses in addition to attacks, mechanisms for monitoring misuse, mechanisms to monitor how a system learns from feedback over time, improving the efficiency and accessibility of ML).

11. Safeguards

Question: Does the paper describe safeguards that have been put in place for responsible release of data or models that have a high risk for misuse (e.g., pretrained language models, image generators, or scraped datasets)?

Answer: [NA]

Justification: The paper poses no such risks.

Guidelines:

The answer NA means that the paper poses no such risks. Released models that have a high risk for misuse or dual-use should be released with necessary safeguards to allow for controlled use of the model, for example by requiring that users adhere to usage guidelines or restrictions to access the model or implementing safety filters.

Datasets that have been scraped from the Internet could pose safety risks. The authors should describe how they avoided releasing unsafe images. We recognize that providing effective safeguards is challenging, and many papers do not require this, but we encourage authors to take this into account and make a best faith effort. 12. Licenses for existing assets

Question: Are the creators or original owners of assets (e.g., code, data, models), used in the paper, properly credited and are the license and terms of use explicitly mentioned and properly respected? Answer: [Yes]

Justification: We have cited the sources and/or provided the URLs to all public datasets used in the paper. Guidelines:

The answer NA means that the paper does not use existing assets. The authors should cite the original paper that produced the code package or dataset. The authors should state which version of the asset is used and, if possible, include a URL. The name of the license (e.g., CC-BY 4.0) should be included for each asset. For scraped data from a particular source (e.g., website), the copyright and terms of service of that source should be provided. If assets are released, the license, copyright information, and terms of use in the package should be provided. For popular datasets, paperswithcode.com/datasets has curated licenses for some datasets. Their licensing guide can help determine the license of a dataset. For existing datasets that are re-packaged, both the original license and the license of the derived asset (if it has changed) should be provided. If this information is not available online, the authors are encouraged to reach out to the asset s creators. 13. New Assets

Question: Are new assets introduced in the paper well documented and is the documentation provided alongside the assets? Answer: [NA] Justification: The paper does not release new assets. Guidelines:

The answer NA means that the paper does not release new assets. Researchers should communicate the details of the dataset/code/model as part of their submissions via structured templates. This includes details about training, license, limitations, etc. The paper should discuss whether and how consent was obtained from people whose asset is used. At submission time, remember to anonymize your assets (if applicable). You can either create an anonymized URL or include an anonymized zip file. 14. Crowdsourcing and Research with Human Subjects

Question: For crowdsourcing experiments and research with human subjects, does the paper include the full text of instructions given to participants and screenshots, if applicable, as well as details about compensation (if any)? Answer: [NA] Justification: The paper does not involve crowdsourcing nor research with human subjects. Guidelines:

The answer NA means that the paper does not involve crowdsourcing nor research with human subjects.

Including this information in the supplemental material is fine, but if the main contribution of the paper involves human subjects, then as much detail as possible should be included in the main paper. According to the Neur IPS Code of Ethics, workers involved in data collection, curation, or other labor should be paid at least the minimum wage in the country of the data collector. 15. Institutional Review Board (IRB) Approvals or Equivalent for Research with Human Subjects Question: Does the paper describe potential risks incurred by study participants, whether such risks were disclosed to the subjects, and whether Institutional Review Board (IRB) approvals (or an equivalent approval/review based on the requirements of your country or institution) were obtained? Answer: [NA] Justification: The paper does not involve crowdsourcing nor research with human subjects. Guidelines:

The answer NA means that the paper does not involve crowdsourcing nor research with human subjects. Depending on the country in which research is conducted, IRB approval (or equivalent) may be required for any human subjects research. If you obtained IRB approval, you should clearly state this in the paper. We recognize that the procedures for this may vary significantly between institutions and locations, and we expect authors to adhere to the Neur IPS Code of Ethics and the guidelines for their institution. For initial submissions, do not include any information that would break anonymity (if applicable), such as the institution conducting the review.