# selective_sampling_for_online_bestarm_identification__4caf7e16.pdf

Selective Sampling for Online Best-arm Identiﬁcation

Romain Camilleri , Zhihan Xiong , Maryam Fazel, Lalit Jain, Kevin Jamieson University of Washington, Seattle, WA {camilr,zhihanx,mfazel,lalitj,jamieson}@uw.edu

This work considers the problem of selective-sampling for best-arm identiﬁcation. Given a set of potential options Z Rd, a learner aims to compute with probability greater than 1 δ, arg maxz Z z θ where θ is unknown. At each time step, a potential measurement xt X Rd is drawn IID and the learner can either choose to take the measurement, in which case they observe a noisy measurement of x θ , or to abstain from taking the measurement and wait for a potentially more informative point to arrive in the stream. Hence the learner faces a fundamental trade-off between the number of labeled samples they take and when they have collected enough evidence to declare the best arm and stop sampling. The main results of this work precisely characterize this trade-off between labeled samples and stopping time and provide an algorithm that nearly-optimally achieves the minimal label complexity given a desired stopping time. In addition, we show that the optimal decision rule has a simple geometric form based on deciding whether a point is in an ellipse or not. Finally, our framework is general enough to capture binary classiﬁcation improving upon previous works.

1 Introduction

In this work we consider selective sampling for online best-arm identiﬁcation. In this setting, at every time step t = 1, 2, . . . , Nature reveals a potential measurement xt X Rd to the learner. The learner can choose to either query xt (ξt = 1) or abstain (ξt = 0) and immediately move on to the next time. If the learner chooses to take a query (ξt = 1), then Nature reveals a noisy linear measurement of an unknown θ Rd, i.e. yt = xt, θ + ϵt where ϵt is mean zero sub-Gaussian noise. Before the start of the game, the learner has knowledge of a set Z Rd. The objective of the learner is to identify z := arg maxz Z z, θ with probability at least 1 δ at a learner speciﬁed stopping time U. It is desirable to minimize both the stopping time U which counts the total number of unlabeled or labeled queries and the number of labeled queries requested L := PU t=1 1{ξt = 1}. In this setting, at each time t the learner must make the decision of whether to accept the available measurement xt, or abstain and wait for an even more informative measurement. While abstention may result in a smaller total labeled sample complexity L, the stopping time U may be very large. This paper characterizes the set of feasible pairs (U, L) that are necessary and sufﬁcient to identify z with probability at least 1 δ when xt are drawn IID at each time t from a distribution ν. Moreover, we propose an algorithm that nearly obtains the minimal information theoretic label sample complexity L for any desired unlabeled sample complexity U.

While characterizing the sample complexity of selective sampling for online best arm identiﬁcation is the primary theoretical goal of this work, the study was initially motivated by fundamental questions about how to optimally trade-off the value of information versus time. Even for this idealized linear setting, it is far from obvious a priori what an optimal decision rule ξt looks like and if it can even be succinctly described, or if it is simply the solution to an opaque optimization problem. Remarkably,

Equal contribution. Alphabetical order.

35th Conference on Neural Information Processing Systems (Neur IPS 2021).

we show that for every feasible, optimal operating pair (U, L) there exists a matrix A Rd d such that the optimal decision rule takes on the form ξt = 1{x Ax 1} when xt ν iid. The fact that for any smooth distribution ν the decision rule is a hard decision equivalent to xt falling outside a ﬁxed ellipse or not, and not a stochastic rule that varies complementarily with the density of ν over space is perhaps unexpected.

To motivate the problem description, suppose on each day t = 1, 2, . . . a food blogger posts the Cocktail of the Day with a recipe described by a feature vector xt Rd. You have the ingredients (and skills) to make any possible cocktail in the space of all cocktails Z, but you don t know which one you d like the most, i.e., z := arg maxz Z z, θ , where θ captures your preferences over cocktail recipes. You decide to use the Cocktail of the Day to inform your search. That is, each day you are presented with the cocktail recipe xt Rd, and if you choose to make it (ξt = 1) you observe your preference for the cocktail yt with E[yt] = xt, θ . Of course, making cocktails can get costly, so you don t want to make each day s cocktail, but rather you will only make the cocktail if xt is informative about θ (e.g., uses a new combination of ingredients). At the same time, waiting too many days before making the next cocktail of the day may mean that you never get to learn (and hence drink) the cocktail z you like best. The setting above is not limited to cocktails, but rather naturally generalizes to discovering the efﬁcacy of drugs and other therapeutics where blood and tissue samples come to the clinic in a stream and the researcher has to choose whether to take a potentially costly measurement.

Our results hold for arbitrary θ Rd, sets X Rd and Z Rd, and measures ν X 1 for which we assume xt ν is drawn IID. The assumption that each xt is IID allows us to make very strong statements about optimality. To summarize, our contributions are as follows:

We present fundamental limits on the trade-off between the amount of unlabelled data and labelled data in the form of (the ﬁrst) information theoretic lower bounds for selective sampling problems that we are aware of. Naturally, they say that there is an absolute minimum amount of unlabelled data that is necessary to solve the problem, but then for any amount of unlabelled data beyond this critical value, the bounds say that the amount of labelled data must exceed some value as a function of the unlabelled data used.

We propose an algorithm that nearly matches the lower bound at all feasible trade-off points in the sense that given any unlabelled data budget that exceeds the critical threshold, the algorithm takes no more labels than the lower bound suggests. Thus, the upper and lower bounds sketch out a curve of all possible operating points, and the algorithm achieves any point on this curve.

We characterize the optimal decision rule of whether to take a sample or not, based on any critical point is a simple test: Accept xt Rd if x t Axt 1 for some matrix A that depends on the desired operating point and geometry of the task. Geometrically, this is equivalent to xt falling inside or outside an ellipsoid.

Our framework is also general enough to capture binary classiﬁcation, and consequently, we prove results there that improve upon state of the art.

1.1 Related Work

Selective Sampling in the Streaming Setting: Online prediction, the setting in which the selective sampling framework was introduced, is a closely related problem to the one studied in this paper and enjoys a much more developed literature [6, 9, 1, 7]. In the linear online prediction setting, for t = 1, 2, . . . Nature reveals xt Rd, the learner predicts byt and incurs a loss ℓ(byt, yt), and then the learner decides whether to observe yt (i.e., ξt = 1) or not (ξt = 0), where yt is a label generated by a composition of a known link function with a linear function of xt. For example, in the classiﬁcation setting [1, 6, 9], one setting assumes yt { 1, 1} with E[yt|xt] = xt, θ for some unknown θ Rd, and ℓ(byt, yt) = 1{byt = yt}. In the regression setting [7], one observes yt [ 1, 1] with E[yt|xt] = xt, θ again, and ℓ(byt, yt) = (byt yt)2. After any amount of time U, the learner is incentivized to minimize both the amount of requested labels PU t=1 1{ξt = 1} and the cumulative loss PU t=1 ℓ(yt, byt) (or some measure of regret which compares to predictions using the unknown θ ). If every label yt is requested then L = U and this is just the classical online learning setting.

1We denote the set of probability measures over X as X .

These works give a guarantee on the regret and labeled points taken in terms of the hardness of the stream relative to a learner which would see the label at every time. Most do not give the learner the ability to select an operating point that provides a trade-off between the amount of unlabeled versus labeled data taken. Those few works that propose algorithms that do provide this functionality do not provide lower bounds that match their given upper bounds, leaving it unclear whether their algorithm optimally negotiates this trade-off. In contrast, our work fully characterizes the trade-off between the amount of unlabeled and labeled data through an information-theoretic lower bound and a matching upper bound. Speciﬁcally, our algorithm includes a tuning parameter, call it τ, that controls the trade-off between the evaluation metric of interest (for us, the quality of the recommended z Z), the label complexity L, and the amount of unlabelled data U that is necessary before the metric of interest can be non-trivial. We prove that each possible setting of τ parametrizes all possible trade-offs between unlabeled and labeled data.

Our work is perhaps closest to the streaming setting for agnostic active classiﬁcation [8, 15] where each xs is drawn i.i.d. from an underlying distribution ν on X, and indeed our results can be specialized to this setting as we discuss in Section 3. These papers also evaluate themselves at a single point on the tradeoff curve, namely the number of samples needed in passive supervised learning to obtain a learner with excess risk at most ϵ. They provide minimax guarantees on the amount of labeled data needed in terms of the disagreement coefﬁcient [12]. In contrast, again, our results characterize the full trade-off between the amount of unlabeled data seen, and the amount of labeled data needed to achieve the target excess risk ϵ. We note that using online-to-batch conversion methods, [9, 1, 6] also provide results on the amount of labeled data needed but they assume a very speciﬁc parametric form to their label distribution unlike our setting which is agnostic. Other works have characterized selective sampling for classiﬁcation in the realizable setting that assumes there exists a classifer among the set under consideration that perfectly labels every yt [13] our work addresses the agnostic setting where no such assumption is made. Finally, our results apply under the more general setting of domain adaptation under covariate shift where we are observing data drawn from the stream ν, but we will evaluate the excess risk of our resulting classiﬁer on a different stream π [22, 23, 26].

Best-Arm Identiﬁcation and Online Experimental Design. Our techniques are based on experimental design methods for best-arm identiﬁcation in linear bandits, see [24, 11, 5]. In the setting of these works, there exists a pool of examples X and at each time any x X can be selected with replacement. The goal is to identify the best arm using as few total selections (labels) as possible. Their algorithms are based on arm-elimination. Speciﬁcally, they select examples with probability proportional to an approximate G-optimal design with respect to the current remaining arms. Then, during each round after taking measurements, those arms with high probability of being suboptimal will be eliminated. Remarkably, near-optimal sample complexity has been achieved under this setting. While we apply these techniques of arm-elimination and sampling through G-optimal design, the major difference is that we are facing a stream instead of a pool of examples. Finally, [10] considers a different online experiment design setup where (adversarially chosen) experiments arrive sequentially and a primal-dual algorithm decides whether to choose each, subject to a total budget. [10] studies the competitive ratio of such algorithms (in the manner of online packing algorithms) for problems such as D-optimal experiment design.

2 Selective Sampling for Best Arm Identiﬁcation

Consider the following game: Given known X, Z Rd and unknown θ Rd at each time t = 1, 2, . . . :

1. Nature reveals xt iid ν with support(ν) = X

2. Player chooses Qt {0, 1}. If Qt = 1 then nature reveals yt with E[yt] = xt, θ

3. Player optionally decides to stop at time t and output some bz Z

If the player stops at time U after observing L = PU t=1 Qt labels, the objective is to identify z = arg maxz Z z, θ with probability at least 1 δ while minimizing a trade-off of U, L.

This paper studies the relationship between U and L in the context of necessary and sufﬁcient conditions to identify z with probability at least 1 δ. Clearly U must be large enough for z to

be identiﬁable even if all labels are requested (i.e., L = U). But if U is very large, the player can start to become more picky with their decision to observe the label or not. Indeed, one can easily imagine scenarios in which it is advantageous for a player to forgo requesting the label of the current example in favor of waiting for a more informative example to arrive later if they wished to minimize L alone. Intuitively, L should decrease as U increases, but how?

Any selective sampling algorithm for the above protocol at time t is deﬁned by 1) a selection rule Pt : X [0, 1] where Qt Bernoulli(Pt(xt)), 2) a stopping rule U, and 3) a recommendation rule bz Z. The algorithm s behavior at time t can use all information collected up to time t Deﬁnition 1. For any δ (0, 1) we say a selective sampling algorithm is δ-PAC for ν X if for all θ Rd the algorithm terminates at time U which is ﬁnite almost surely and outputs arg maxz Z z, θ with probability at least 1 δ.

2.1 Optimal design

Before introducing our own algorithm, let us consider a seemingly optimal procedure. For any λ X = {p : P

x X px = 1, px 0 x X} deﬁne

ρ(λ) := max z Z\{z }

z z 2 EX λ[XX ] 1 θ , z z 2 . (1)

Intuitively, ρ(λ) captures the number of labeled examples drawn from distribution λ to identify z . Speciﬁcally, for any τ ρ(λ) log(|Z|/δ), if x1, . . . , xτ λ and yi = xi, θ + ϵi where ϵi is iid 1 sub-Gaussian noise, then there exists an estimator bθ := bθ({(xi, yi)}τ i=1) such that bθ, z > maxz Z\z bθ, z with probability at least 1 δ [11]. In particular, τ ρ(λ) log(|Z|/δ) samples sufﬁce to guarantee that arg maxz Z bθ, z = arg maxz Z θ , z =: z .

Thus, if our τ samples are coming from ν, we would expect any reasonable algorithm to require at least ρ(ν) log(|Z|/δ) examples and labels. However, since we only want to take informative examples, we instead choose to select the tth example xt = x according to a probability P(x) so that our ﬁnal labeled samples are coming from the distribution λ where λ(x) P(x)ν(x). In particular, P(x) should be chosen according to the following optimization problem

P = argmin P :X [0,1] τEX ν[P(X)] subject to max z Z\{z }

z z 2 EX ν[τP (X)XX ] 1 z z, θ 2 βδ 1 (2)

for βδ = log(|Z|/δ) where the objective captures the number of samples we select using P , and the constraint captures the fact that we have solved the problem. Remarkably, we can reparametrize this result in terms of an optimization problem over λ X instead of P : X [0, 1] as

τEX ν[P (X)] = min λ X ρ(λ)βδ subject to τ λ/ν ρ(λ)βδ

where λ/ν = maxx X λ(x)/ν(x), as shown in Proposition 2. Note that as τ the constraint becomes inconsequential. Also notice that ρ(ν)βδ appears to be a necessary amount of labels to solve the problem even if P(x) 1 (albeit, by arguing about minimizing the upperbound of above).

2.2 Main results

In this section we formally justify the sketched argument of the previous section, showing nearly matching upper and lower bounds. Theorem 1 (Lower bound). Fix any δ (0, 1), X, Z Rd, and θ Rd. Any selective sampling algorithm that is δ-PAC for ν X and terminates after drawing U unlabelled examples from ν and requests the labels of just L of them satisﬁes

E[U] ρ(ν) log(1/δ), and

E[L] min λ X ρ(λ) log(1/δ) subject to E[U] λ/ν ρ(λ) log(1/δ).

The ﬁrst part of the theorem quantiﬁes the number of rounds or unlabelled draws U that any algorithm must observe before it could hope to stop and output z correctly. The second part describes a

trade-off between U and L. One extreme is if E[U] , which effectively removes the constraint so that the number of observed labels must scale like minλ X ρ(λ) log(1/δ). Note that this is precisely the number of labels required in the pool-based setting where the agent can choose any x X that she desires at each time t (e.g. [11]). In the other extreme, E[U] = ρ(ν) log(1/δ) so that the constraint in the label complexity E[L] is equivalent to ρ(ν) λ/ν ρ(λ). This implies that the minimizing λ must either stay very close to ν, or must obtain a substantially smaller value of ρ(λ) relative to ρ(ν) to account for the inﬂation factor λ/ν . In some sense, this latter extreme is the most interesting point on the trade-off curve because its asking the algorithm to stop as quickly as the algorithm that observes all labels, but after requesting a minimal number of labels. Note that this lower bound holds even for algorithms that known ν exactly. The proof of Theorem 1 relies on standard techniques from best arm identiﬁcation lower bounds (see e.g. [17, 11]).

Remarkably, every point on the trade-off suggested by the lower bound is nearly achievable.

Theorem 2 (Upper bound). Fix any δ (0, 1), X, Z Rd, and θ Rd. Let = minz Z\{z } z z, θ and βδ log(log( 1

)|Z|/δ) where the precise constant is given in the appendix. For any τ ρ(ν)βδ there exists a δ-PAC selective sampling algorithm that observes U unlabeled examples and requests just L labels that satisﬁes with probability at least 1 δ

L 3 log2( 4

) min λ X ρ(λ) βδ subject to τ λ/ν ρ(λ) βδ.

Aside from the log( 1

) factor and the log(|Z|) that appears in the βδ term, this nearly matches the lower bound. Note that the parameter τ parameterizes the algorithm and makes the trade-off between U and L explicit. The next section describes the algorithm that achieves this theorem.

2.3 Selective Sampling Algorithm

Algorithm 1 contains the pseudo-code of our selective sampling algorithm for best-arm identiﬁcation. Note that it takes a conﬁdence level δ (0, 1) and a parameter τ that controls the unlabeled-labeled budget trade-off as input. The algorithm is effectively an elimination style algorithm and closely mirrors the RAGE algorithm for the pool-based setting of best-arm identiﬁcation problem [11]. The key difference, of course, is that instead of being able to plan over the pool of measurements, this algorithm must plan over the x s that the algorithm may potentially see and account for the case that it might not see the x s it wants.

Algorithm 1 Selective Sampling for Best-arm Identiﬁcation

1: Input Z Rd, δ (0, 1), τ 2: while |Zℓ| 1 do 3: Let b Pℓ, bΣ b Pℓ OPTIMIZEDESIGN(Zℓ, 2 ℓ, τ) // bΣ b Pℓapproximates EX ν[ b Pℓ(X)XX ] 4: for t = (ℓ 1)τ + 1, . . . , ℓτ do 5: Nature reveals xt drawn iid from ν (with support Rd) 6: Sample Qt(xt) Bernoulli( b Pℓ(xt)). If Qt = 1 then observe yt // E[yt|xt] = θ , xt 7: end for 8: Let bθℓ RIPS({bΣ 1 b PℓQs(xs)xsys}ℓτ s=(ℓ 1)τ+1, Z Z) // bθℓapproximates θ

9: Zℓ+1 = Zℓ\ {z Zℓ: max z Zℓ z z, bθℓ 2 ℓ}

10: end while

In round ℓ, the algorithm maintains an active set Zℓ Z with the guarantee that each remaining z Zℓsatisﬁes, z z, θ 8 2 ℓ. In each round, on Line 3 of the algorithm, it calls out to a sub-routine OPTIMIZEDESIGN(Z, ϵ, τ) that is trying to approximate the ideal optimal design of (2). In particular, the ideal response to OPTIMIZEDESIGN(Z, ϵ, τ) would return a P ϵ and ΣP ϵ = EX ν[P ϵ (X)XX ] where P ϵ is the solution to Equation 2 with the one exception that the denominator of the constraint is replaced with max{ϵ2, θ , z z 2}. Of course, θ is unknown so we cannot solve Equation 2 (as well as other outstanding issues that we will address shortly). Consequently, our implementation will aim to approximate the optimization problem of Equation 2.

But assuming our sample complexity is not too far off from this ideal, each round should not request more labels than the number of labels requested by the ideal program with ϵ = 0. Thus, the total number of samples should be bounded by the ideal sample complexity times the number of rounds, which is O(log( 1)). We will return to implementation issues in the next section.

Assuming we are returned ( b Pℓ, bΣ b Pℓ) that approximate their ideals as just described, the algorithm then proceeds to process the incoming stream of xt ν. As described above, the decision to request the label of xt is determined by a coin ﬂip coming up heads with probability b Pℓ(xt) otherwise we do not request the label. Given the collected dataset {(xt, yt, Qt, b Pℓ(xt))}t, line 8 then computes an estimate bθℓof θ using the RIPS estimator of [5] which will satisfy

| z z, bθℓ θ | O z z EX ν[τ b Pℓ(X)XX ] 1 p

log(2ℓ2|Z|2/δ) 2 ℓ

for all z Zℓsimultaneously with probability at least 1 δ. Thus, the ﬁnal line of the algorithm eliminates any z Zℓsuch that there exists another z Zℓ(think z ) that satisﬁes bθℓ, z z > 2 ℓ. The process continues until Zℓ= {z }.

2.4 Implementation of OPTIMIZEDESIGN

For the subroutine OPTIMIZEDESIGN passed (Zℓ, ϵ, τ) the next best thing to computing Equation 2 with the denominator of the constraint replaced with max{ϵ2, θ , z z 2}, is to compute

Pϵ = argmin P :X [0,1] EX ν[P(X)] subject to max z,z Zℓ

z z 2 EX ν[τP (X)XX ] 1

ϵ2 βδ 1 (3)

and ΣPϵ = EX ν[Pϵ(X)XX ] for an appropriate choice of βδ = Θ(log(|Z|/δ)). To see this, ﬁrstly, any z Z with gap θ , z z that we could accurately estimate would not be included in Zℓ, thus we don t need it in the max of the denominator. Secondly, to get rid of z in the numerator (which is unknown, of course), we note that for any norm maxz,z z z maxz 2 z z maxz,z 2 z z . Assuming we could solve this directly and compute ΣPϵ = EX ν[Pϵ(X)XX ], we can obtain the result of Theorem 2 (proven in the Appendix).

However, even if we knew ν exactly, the optimization problem of Equation 3 is quite daunting as it is a potentially inﬁnite dimensional optimization problem over X. Fortunately, after forming the Lagrangian with dual variables for each z z Z Z, optimizing the dual amounts to a ﬁnite dimensional optimization problem over the ﬁnite number of dual variables. Moreover, this optimization problem is maximizing a simple expectation with respect to ν and thus we can apply standard stochastic gradient ascent and results from stochastic approximation [20]. Given the connection to stochastic approximation, instead of sampling a fresh ex ν each iteration, it sufﬁces to replay a sequence of ex s from historical data. Summing up, this construction allows us to compute a satisfactory Pϵ and avoid both an inﬁnite-dimensional optimization problem and requiring knowledge of ν (as long as historical data is available).

Meanwhile, with historical data, we can also empirically compute EX ν[Pϵ(X)XX ]. Historical data could mean ofﬂine samples from ν or just samples from previous rounds. In this setting, Theorem 2 still holds albeit with larger constants. Theorem 7 in the appendix characterizes the necessary amount of historical data needed. Unfortunately (in full disclosure) the theoretical guarantees on the amount of historical data needed is absurdly large, though we suspect this arises from a looseness in our analysis. Similar assumptions and approaches to historical or ofﬂine data have been used in other works in the streaming setting e.g. [15].

3 Selective Sampling for Binary Classiﬁcation

We now review streaming Binary Classiﬁcation in the agnostic setting [8, 12, 15] and show that our approach can be adapted to this setting. Consider a binary classiﬁcation problem where X is the example space and Y = { 1, 1} is the label space. Fix a hypothesis class H such that each h H is a classiﬁer h : X Y. Assume there exists a ﬁxed regression function η : X [0, 1] such that the label of x is Bernoulli with probability η(x) = P(Y = 1|X = x). Being in the agnostic setting, we make no assumption on the relationship between H and η. Finally, ﬁx any ν X and π X . Given known X, H and unknown regression function η, at each time t = 1, 2, . . . :

1. Nature reveals xt ν 2. Player chooses Qt {0, 1}. If Qt = 1 then nature reveals yt Bernoulli(η(xt)) { 1, 1}

3. Player optionally decides to stop at time t and output some bh H.

Deﬁne the risk of any h H as Rπ(h) := PX π,Y η(X)(Y = h(X)). If the player stops at time U after observing L = PU t=1 Qt labels, the objective is to identify h = arg minh H Rπ(h) with probability at least 1 δ while minimizing a trade-off of U, L. Note that h is the true risk minimizer with respect to distribution π but we observe samples xt ν; π is not necessarily equal to ν. While we have posed the problem as identifying the potentially unique h , our setting naturally generalizes to identifying an ϵ-good h such that Rπ(h) Rπ(h ) ϵ.

We will now reduce selective sampling for binary classiﬁcation problem to selective sampling for best arm identiﬁcation, and thus immediately obtain a result on the sample complexity. For simplicity, assume that X and H are ﬁnite. Enumerate X and for each h H deﬁne a vector z(h) [0, 1]|X|

such that z(h) x := π(x)1{h(x) = 1} for z(h) = [z(h) x ]x X . Moreover, deﬁne θ := [θ x]x X where θ x := 2η(x) 1. Then

Rπ(h) = EX π,Y η(X)[1{Y = h(X)}]= X

x X π(x)(η(x)1{h(x) = 1}+(1 η(x))1{h(x) = 0})

x X π(x)η(x) + X

x X π(x)(1 2η(x))1{h(x) = 1} = c z(h), θ

where c = P

x X π(x)η(x) does not depend on h. Thus, if Z := {z(h)}h H then identifying h = arg minh H Rπ(h) is equivalent to identifying z = arg maxz Z z, θ . We can now apply Theorem 2 to obtain a result describing the sample complexity trade-off. First deﬁne,

ρπ(λ, ε) := max z Z\{z }

z z 2 EX λ[XX ] 1 max{ θ , z z 2, ε2} = max h H\{h }

EX π h 1{h(X) = h (X)} π(X)

max{(Rπ(h) Rπ(h ))2, ε2}

An important case of the above setting is when X ν and π = ν, i.e. we are evaluating the performance of a classiﬁer relative to the same distribution our samples are drawn from. This is the setting of [8, 15, 12]. The following theorem shows that the sample complexity obtained by our algorithm is at least as good as the results they present. Theorem 3. Fix any δ (0, 1), domain X with distribution ν, ﬁnite hypothesis class H, regression function η : X [0, 1]. Set ϵ 0 and βδ = 2048 log(4 log2 2(4/ϵ)|H|/δ). Then for τ ρπ(ν, ϵ)βδ there exists a selective sampling algorithm that returns h H satisfying Rπ(h) Rπ(h ) ϵ by observing U unlabeled examples and requesting just L labels such that

U log2(4/ϵ)τ

L 3 log2( 4

ε) min λ X ρπ(λ, ε)βδ s.t. τ λ/ν ρπ(λ, ε)βδ

with probability at least 1 δ. Furthermore when ν = π and if τ 16ρ(ν, ϵ)βδ we have that

L 36 log2(4/ϵ) Rν(h )2

ϵ2 + 4 sup ξ ϵ θ (2Rν(h ) + ξ, ν)βδ

where θ (u, ν) is the disagreement coefﬁcient, deﬁned in Appendix E.

Note that if τ is sufﬁciently large then the labeled sample complexity we obtain minλ X ρ(λ, ϵ) could be signiﬁcantly smaller than previous results in the streaming setting, e.g. see [16]. The proof of Theorem 3 can be found in Appendix E.

4 Solving the Optimization Problem

Recall that in Algorithm 1, during round ℓ, we need to solve optimization problem (3). Solving this optimization problem is not trivial because the number of variables can potentially be inﬁnite if X is

an inﬁnite set. In this section, we will demonstrate how to reduce it to a ﬁnite-dimensional problem by considering its dual problem. To simplify the notation, let Yℓ= {z z : z, z Zℓ, z = z }, and rewrite the problem as follows, where cℓ> 0 is a constant that may depend on round ℓ.

min P EX ν [P(X)] subject to y EX ν P(X)XX 1 y c2 ℓ, y Yℓ, 0 P(x) 1, x X. (4)

Using the Schur complement technique, we show in Lemma 13 (Appendix C) the following equivalence: y EX ν P(X)XX 1 y c2 ℓ EX ν P(X)XX 1 c2 ℓyy . This transforms a constraint involving matrix inversion into one with ordering between PSD matrices. Then, we remove the bound constraints 0 P(x) 1, x X by introducing the barrier function log(1 x) log(x). That is, instead of working with the objective EX ν [P(X)] directly, we consider the following problem.

min P EX ν[P(X) µb(log(1 P(X)) + log(P(X)))] subject to EX ν P(X)XX 1 c2 ℓyy , y Yℓ. (5)

Here, µb (0, 1) is some small constant that controls how strong the barrier is. Intuitively, a smaller µb will make problem (5) closer to the original problem. We now show that unlike the primal, the dual problem is indeed ﬁnite-dimensional. For each constraint of y Yℓ, let the matrix Λy 0 be its dual variable. Further, let Λ = P

y YℓΛy and Λ = (Λy)y Yℓ. The corresponding Lagrangian is

L (Λ, P) = EX ν P(X) µb (log(1 P(X))+log(P(X))) P(X)X ΛX + 1

y Yℓ y Λyy.

The dual problem is maxΛy 0, y Yℓmin P L (Λ, P). Notice that minimization over P : X 7 [0, 1] can be done via minimizing P(x) point-wise for each x X. To do this, we take the gradient with respect to each P(x) and set it to zero to get

1 + µb 1 P(x) µb P(x) x Λx = 0. (6)

Solving this equation and deﬁning qΛ(x) = x Λx 1, we get

2 µb qΛ(x) +

(2µb qΛ(x))2 + 4µbqΛ(x)

2qΛ(x) . (7)

Note that if µb = 0 (no barrier), the above reduces to the threshold decision rule PΛ(x) = 1 2 + |qΛ(x)|

2qΛ(x) , which gives 0 when qΛ(x) < 0 and 1 when qΛ(x) > 0.2 This is exactly the hard elliptical threshold rule mentioned before, in which whether to query the label for x depends on whether it falls inside (x Λx < 1) or outside (x Λx > 1) of the ellipsoid deﬁned by the positive semideﬁnite matrix Λ. A visualization of the decision rule PΛ is given in Figure 2 in the Appendix.

Now, by plugging in PΛ(x), our dual problem becomes maxΛy 0, y D(Λ) := L (Λ, PΛ). This is a ﬁnite-dimensional optimization problem, and can be solved by projected gradient ascent (or projected stochastic gradient ascent when we have only samples from ν). The gradient of D(Λ) is

Λy D(Λ) = EX ν

1+ µb 1 PΛ(x) µb PΛ(X) X ΛX Λy PΛ(X) PΛ(X)XX + yy

c2 ℓ EX ν PΛ(X)XX . (Since PΛ(X) solves Eq. (6))

The algorithm to solve the problem has been summarized in Algorithm 2, in which the gradient during kth iteration is replaced by its unbiased estimator yy

c2 ℓ PˆΛ(k)(xk)xkx k . The adaptive learning rate

is chosen by following the discussion in chapter 4 of [21]. Optimizing the assignment of ˆΛy to each y in line 10 ensures that the re-scaling step in line 11 increases the function value in an optimized way. Finally, the re-scaling step is used to ensure that the output primal objective value EX ν [P(X)] is bounded well, which will be explained in more details in Appendix C.

2When qΛ(x) = 0, PΛ(x) is undetermined from the dual.

Algorithm 2 Projected Stochastic Gradient Ascent to Solve OPTIMIZEDESIGN

1: Input: Number of iterations K; number of samples u; barrier weight µb (0, 1)

2: Initialize ˆΛ(0) y = 0 for each y Yℓ 3: for k = 0, 1, 2, . . . , K 1 do 4: Sample xk ν

5: Set gk,y = yy

c2 ℓ PˆΛ(k)(xk)xkx k , where PΛ is deﬁned in Eq. (7)

6: Set ˆΛ(k+1) y ˆΛ(k) y + ηkgk,y for each y Yℓ, where ηk = 1 q

y Yℓ gs,y 2 2

7: Update ˆΛ(k+1) y ΠSd +(ˆΛ(k+1) y ) for each y Yℓ, a projection to the set of d d PSD matrices 8: end for 9: Let ˆΛy = 1

K PK k=1 ˆΛ(k) y for each y Yℓand ˆΛ = P

y YℓˆΛy 10: Update (ˆΛy)y Yℓ argmaxΛ P

y Yℓy Λyy, subject to P

y YℓΛy = ˆΛ, Λy 0, y Yℓ.

11: Find s argmaxs [0,1] DE(s ˆΛ), where DE empirically evaluates D using u i.i.d. samples

12: return eΛ = s P

Let Λ be an optimal solution for D(Λ). Intuitively, as long as we run this algorithm with sufﬁciently large number of iterations K and number of samples u, we can guarantee that D(eΛ) and D(Λ ) are close enough with high probability, which in turn guarantees that the primal constraints are violated by only a tiny amount and EX ν PeΛ(X) is close enough to the optimal value. Speciﬁcally, we can prove the following theorem. Theorem 4. Suppose x 2 M for any x supp(ν) and Σ = EX ν XX is invertible. Let Λ argmaxΛy 0, y YℓD(Λ) and κ(Σ) = λmax(Σ)

λmin(Σ) be its condition number. Assume Λ F > 0

and deﬁne ω = minΓ Sd: Γ F =1 EX ν h X ΓX 2i , where Sd is the set of d d symmetric matrices.

Then, Λ = P

y YℓΛ y is unique. Further, for any ϵ > 0 and δ > 0, if it holds that µb O p

Λ F κ(Σ)M p

(1 + ϵ)/ϵ and

|Yℓ|3κ(Σ)2 Λ 8 F M 16log(1/δ) ω2µ6 b

κ(Σ)2 Λ 6 F M 16log(1/δ) ω2µ6 b

then, with probability at least 1 δ, Algorithm 2 will output eΛ that satisﬁes

y EX ν PeΛ(X)XX 1 y (1 + ϵ)c2 ℓ, y Yℓ.

EX ν PeΛ(X) EX ν h e P(X) i + 4 µb, where e P is the optimal solution to problem (4)

with barrier constraint repaced by 0 P(x) 1 µb, x X.

The proof is in Appendix C. Although e P is not exactly the same as the optimal solution of the original problem (4), when µb is sufﬁciently small, they will be very close. Meanwhile, it should be noted that Theorem 4 mainly reveals that with sufﬁciently large number of iterations and number of samples, Algorithm 2 can output sufﬁciently good solution. In future work, we plan to examine how much this bound can be improved via a tighter analysis.

Finally, notice that Algorithm 2 needs to maintain |Yℓ| d2 = O(|Zℓ|2 d2) variables, which can be large when we have a large set Zℓ. Therefore, as an alternative, we also propose Algorithm 3 that only needs to maintain d2 variables but requires more computational power in each iteration. The details are given in Appendix C.

5 Empirical results

In this section we present a benchmark experiment validating the fundamental trade-offs that are theoretically characterized in Theorem 1 and Theorem 2. We take inspiration from [24] to deﬁne our experimental protocol:

d = 2, a two-dimensional problem. Z = [e1, e2, (cos(ω), sin(ω))] for ω = 0.3, where e1, e2 are canonical vectors. θ = 2e1 and y = x θ + η, where η N(0, 1).

The distribution ν for streaming measurements xt i.i.d. ν is such that xt = (cos(2Itπ/N), sin(2Itπ/N)) where It {0, . . . , N 1}, P(It = i) cos(2iπ/N)2, and N = 30.

In this problem, the angle ω is small enough that the item (cos(ω), sin(ω)) is hard to discriminate from the best item e1. As argued in [24], an efﬁcient sampling strategy for this problem instance would be to pull arms in the direction of e2 in order to reduce the uncertainty in the direction of interest, e1 (cos(ω), sin(ω)). However, the distribution ν is deﬁned such that it is more likely to receive a vector xt in the direction of e1 rather than e2. Thus, if one seeks a small label complexity, then P should be taken to reject measurements in the direction of e1.

In the benchmark experiment, we compare the following three algorithms which all use Algorithm 1 as a meta-algorithm and just swap out the deﬁnition of b Pℓ. Naive Algorithm uses no selective sampling so that b Pℓ(x) = 1 for all x; the Oracle Algorithm uses b Pℓ= P where P is the ideal solution to (2), and Our Algorithm uses the solution to (5) for b Pℓ, where we take µb = 2 10 5. We swept over the values of τ and plotted on the y-axis the amount of labeled data needed before termination, as shown in Figure 1.

0.0 0.5 1.0 1.5 2.0 τ 106

Label Complexity (L) v.s. τ

Oracle Algorithm Our Algorithm Naive Algorithm

0 5 10 15 20 25 30 I

probability

Comparison between P (x) and ν(x)

1.0 0.5 0.0 0.5 1.0

[cos(ω), sin(ω)]

Heatmap of P (x)

Figure 1: (left) For each value of τ, we plot the average label complexity over 50 repeated trials. (middle) Visualization of P (x) and ν(x) v.s. x, where x is indexed by I such that x I = (cos(2Iπ/N), sin(2Iπ/N)). Here, P is solved with τ = 4 105 and distribution ν is not normalized. (right) A heat map of P (x) along with the setting of experimental protocol.

We observe in Figure 1 that the algorithms using non-naive selection rules require far less label complexity than the naive algorithm for all τ. This reﬂects the intuition that selection strategies that focus on requesting the more informative streaming measurements are much more efﬁcient than naively observing every streaming measurement. Meanwhile, the trade-off between label complexity L and sample complexity U characterized in Theorem 1 and Theorem 2 is precisely illustrated in Figure 1. Indeed, we see the number of labels queried by the two selective sampling algorithms decrease as the number of unlabeled data seen in each round increases.

6 Conclusion

In this paper, we proposed a new approach for the important problem of selective sampling for best arm identiﬁcation. We provide a lower bound that quantiﬁes the trade-off between labeled samples and stopping time and also presented an algorithm that nearly achieves the minimal label complexity given a desired stopping time.

One of the main limitations of this work is that our approach depends on a well-speciﬁed model following stationary stochastic assumptions. In practice, dependencies over time and model mismatch are common. Utilizing the proposed algorithm outside of our assumptions may lead to poor performance and unexpected behavior with adverse consequences. While negative results justify some of the most critical assumptions we make (e.g., allowing the stream xt to be arbitrary, rather than iid, can lead to trivial algorithms, see Theorem 7 of [7]), exploring what theoretical guarantees are possible under relaxed assumptions is an important topic of future work.

Acknowledgements

We sincerely thank Chunlin Sun for the insightful discussion on the alternative approach to the optimal design. This work was supported in part by the NSF TRIPODS II grant DMS 2023166, NSF TRIPODS CCF 1740551, NSF CCF 2007036 and NSF TRIPODS+X DMS 1839371.

[1] Alekh Agarwal. Selective sampling algorithms for cost-sensitive multiclass prediction. In International Conference on Machine Learning, pages 1220 1228. PMLR, 2013.

[2] Peter L Bartlett and Shahar Mendelson. Rademacher and gaussian complexities: Risk bounds and structural results. Journal of Machine Learning Research, 3(Nov):463 482, 2002.

[3] Dimitri P Bertsekas. Convex optimization theory. Athena Scientiﬁc Belmont, 2009.

[4] Alina Beygelzimer, John Langford, Lihong Li, Lev Reyzin, and Robert Schapire. Contextual bandit algorithms with supervised learning guarantees. In Proceedings of the Fourteenth International Conference on Artiﬁcial Intelligence and Statistics, pages 19 26. JMLR Workshop and Conference Proceedings, 2011.

[5] Romain Camilleri, Julian Katz-Samuels, and Kevin Jamieson. High-dimensional experimental design and kernel bandits, 2021.

[6] Nicolo Cesa-Bianchi, Claudio Gentile, and Francesco Orabona. Robust bounds for classiﬁcation via selective sampling. In Proceedings of the 26th annual international conference on machine learning, pages 121 128, 2009.

[7] Yining Chen, Haipeng Luo, Tengyu Ma, and Chicheng Zhang. Active online learning with hidden shifting domains. In International Conference on Artiﬁcial Intelligence and Statistics, pages 2053 2061. PMLR, 2021.

[8] S. Dasgupta, D. J. Hsu, and C. Monteleoni. A general agnostic active learning algorithm. Advances in neural information processing systems, 2008.

[9] Ofer Dekel, Claudio Gentile, and Karthik Sridharan. Selective sampling and active learning from single and multiple teachers. The Journal of Machine Learning Research, 13(1):2655 2697, 2012.

[10] Reza Eghbali, James Saunderson, and Maryam Fazel. Competitive online algorithms for resource allocation over the positive semideﬁnite cone. Mathematical Programming, 170(1):267 292, 2018.

[11] Tanner Fiez, Lalit Jain, Kevin Jamieson, and Lillian Ratliff. Sequential experimental design for transductive linear bandits. ar Xiv preprint ar Xiv:1906.08399, 2019.

[12] Steve Hanneke et al. Theory of disagreement-based active learning. Foundations and Trends in Machine Learning, 7(2-3):131 309, 2014.

[13] Steve Hanneke and Liu Yang. Toward a general theory of online selective sampling: Trading off mistakes and queries. In International Conference on Artiﬁcial Intelligence and Statistics, pages 3997 4005. PMLR, 2021.

[14] Roger A Horn and Charles R Johnson. Matrix analysis. Cambridge university press, 2012.

[15] Tzu-Kuo Huang, Alekh Agarwal, Daniel J Hsu, John Langford, and Robert E Schapire. Efﬁcient and parsimonious agnostic active learning. ar Xiv preprint ar Xiv:1506.08669, 2015.

[16] Julian Katz-Samuels, Jifan Zhang, Lalit Jain, and Kevin Jamieson. Improved algorithms for agnostic pool-based active classiﬁcation, 2021.

[17] Emilie Kaufmann, Olivier Cappé, and Aurélien Garivier. On the complexity of best-arm identiﬁcation in multi-armed bandit models. Journal of Machine Learning Research, 17:1 42, 2016.

[18] Gábor Lugosi and Shahar Mendelson. Mean estimation and regression under heavy-tailed distributions: A survey. Foundations of Computational Mathematics, 19(5):1145 1190, 2019.

[19] Mehryar Mohri, Afshin Rostamizadeh, and Ameet Talwalkar. Foundations of machine learning. MIT press, 2018.

[20] A Nemirovski, A Juditsky, G Lan, and A Shapiro. Stochastic approximation approach to stochastic programming. In SIAM J. Optim. Citeseer.

[21] Francesco Orabona. A modern introduction to online learning. ar Xiv preprint ar Xiv:1912.13213, 2019.

[22] Piyush Rai, Avishek Saha, Hal Daumé III, and Suresh Venkatasubramanian. Domain adaptation meets active learning. In Proceedings of the NAACL HLT 2010 Workshop on Active Learning for Natural Language Processing, pages 27 32, 2010.

[23] Avishek Saha, Piyush Rai, Hal Daumé, Suresh Venkatasubramanian, and Scott L Du Vall. Active supervised domain adaptation. In Joint European Conference on Machine Learning and Knowledge Discovery in Databases, pages 97 112. Springer, 2011.

[24] Marta Soare, Alessandro Lazaric, and Rémi Munos. Best-arm identiﬁcation in linear bandits. ar Xiv preprint ar Xiv:1409.6110, 2014.

[25] Roman Vershynin. Introduction to the non-asymptotic analysis of random matrices, 2011.

[26] Min Xiao and Yuhong Guo. Online active learning for cost sensitive domain adaptation, 2013.