# finding_all_epsilongood_arms_in_stochastic_bandits__1a608ff5.pdf

Finding All -Good Arms in Stochastic Bandits

Blake Mason University of Wisconsin

Madison, WI 53706 bmason3@wisc.edu

Lalit Jain University of Washington

Seattle, WA 98115

lalitj@uw.edu

Ardhendu Tripathy University of Wisconsin

Madison, WI 53706 astripathy@wisc.edu

Robert Nowak University of Wisconsin

Madison, WI 53706 rdnowak@wisc.edu

The pure-exploration problem in stochastic multi-armed bandits aims to ﬁnd one or more arms with the largest (or near largest) means. Examples include ﬁnding an -good arm, best-arm identiﬁcation, top-k arm identiﬁcation, and ﬁnding all arms with means above a speciﬁed threshold. However, the problem of ﬁnding all -good arms has been overlooked in past work, although arguably this may be the most natural objective in many applications. For example, a virologist may conduct preliminary laboratory experiments on a large candidate set of treatments and move all -good treatments into more expensive clinical trials. Since the ultimate clinical efﬁcacy is uncertain, it is important to identify all -good candidates. Mathematically, the all- -good arm identiﬁcation problem presents signiﬁcant new challenges and surprises that do not arise in the pure-exploration objectives studied in the past. We introduce two algorithms to overcome these and demonstrate their great empirical performance on a large-scale crowd-sourced dataset of 2.2M ratings collected by the New Yorker Caption Contest as well as a dataset testing hundreds of possible cancer drugs.

1 Introduction

We propose a new multi-armed bandit problem where the objective is to return all arms that are -good relative to the best-arm. Concretely, if the arms have means µ1, , µn, with µ1 = max1 i n µi, then the goal is to return the set {i : µi µ1 } in the additive case, and {i : µi (1 )µ1} in the multiplicative case. The ALLproblem is a novel setting in the bandits literature, adjacent to two other methods for ﬁnding many good arms: TOP-k where the goal is to return the arms with the k highest means, and threshold bandits where the goal is to identify all arms above a ﬁxed threshold. Building on a metaphor given by [1], if TOP-k is a contest and thresholding bandits is an exam , ALLorganically decides which arms are above the bar relative to the highest score. We argue that the ALLproblem formulation is more appropriate in many applications, and we show that it presents some unique challenges that make its solution distinct from TOP-k and threshold bandits.

A Natural and Robust Objective. A motivating example is drug discovery, where pharmacologists want to identify a set of highly-potent drug candidates from potentially millions of compounds using various in vitro and in silico assays, and only the selected undergo more expansive testing [2]. Since performing the assays can be costly, one would like to use an adaptive, sequential experiment design that requires fewer experiments than a ﬁxed experiment design. In sequential experiment design, it is important to ﬁx the objective at the beginning as that choice affects the experimentation process. Both the objectives of ﬁnding the top-k performing drugs, or all drugs above a threshold can result

34th Conference on Neural Information Processing Systems (Neur IPS 2020), Vancouver, Canada.

in failure. In TOP-k, choosing k too small may miss potent compounds, and choosing k too large may yield many ineffective compounds and require an excessively large number of experiments. Setting a threshold suffers from the same issues - with the additional concern that if it is set too high, potentially no drug discoveries are made. In contrast, the ALLobjective of ﬁnding all arms whose potency is within 20% of the best avoids these concerns by giving a robust and natural guarantee: no signiﬁcantly suboptimal arms will be returned and but every near-optimal arm will be discovered.

We emphasize that unlike TOP-k or thresholding which require some prior knowledge about the distribution of arms to guarantee a good set of returned arms, choosing the arms relative to the best is a natural, distribution-free metric for ﬁnding good arms. As an example, we consider the New Yorker Cartoon Caption Contest (NYCCC). Each week, contestants submit thousands of supposedly funny captions for a cartoon (see Appendix A), which are rated from 1 (unfunny) to 3 (funny) through a crowdsourcing process. The New Yorker editors select ﬁnal winners from a set with the highest average crowd-ratings (typically over 1 million ratings per contest).

Figure 1: Mean ratings from contests 627, 651, 690

The number of truly funny captions varies from week to week, and this makes setting a choice of k or ﬁxed threshold difﬁcult. In Figure 1, we plot the distribution of ratings from 3 different contests. Horizontal lines depict a reasonable threshold of 0.8µ1 in each and vertical lines show the number of arms that exceed this threshold. Both of these quantities vary over weeks and these differences can be stark. In contest 627, only k = 27 arms are within 20% of µ1, but k = 748 are in contest 651. Additionally, a ﬁxed threshold of = 1.5, admits captions within 30% of the best in contest 627, but only those within 15% of the best in contest 651. These examples show that it would be imprudent, and indeed, incorrect to choose a value of k or a threshold based on past contests the far more principled decision is to optimize for the objective of ﬁnding the captions that are within a percentage of the best every week.

Though the ALLobjective is natural and easy to state, it has not been studied in the literature. As we will show, admitting arms relative to the best makes the ALLproblem inherently more challenging than either TOP-k or thresholding. In particular, it is not easily possible to adapt TOP-k or thresholding algorithms to achieve the instance dependent lower bound for ALL- . In this work, we provide a careful investigation of the ALLproblem including theoretical and empirical guarantees.

1.1 Problem Statement and Notation

Fix > 0 and a failure probability δ > 0. Let := { 1, , n} be an instance of n distributions (or arms) with 1-sub-Gaussian distributions having unknown means µ1 µn. We now formally deﬁne our notions of additive and multiplicative -good arms. Deﬁnition 1 (additive -good). For a given > 0, arm i is additive -good if µi µ1 . Deﬁnition 2 (multiplicative -good). For a given > 0, arm i is multiplicative -good if µi (1 )µ1.

Additionally, we deﬁne the sets

G ( ) := {i : µi µ1 } and M ( ) := {i : µi (1 )µ1} (1)

to be the sets of additive and multiplicative -good arms respectively. Where clear, we take G = G ( ) and M = M ( ). Consider an algorithm that at each time s selects an arm Is 2 [n] based on the history Fs 1 = σ(I1, X1, , Is 1, Xs 1), and observes a reward Xs

iid Is. The objective of the algorithm is to return G or M using as few total samples as possible. Deﬁnition 3. (ALLproblem). An algorithm for the ALLproblem is δ-PAC if (a) the algorithm has a ﬁnite stopping time with respect to Ft, (b) at time it recommends a set b G such that with probability at least 1 δ, b G = G in the additive case, or b G = M in the multiplicative case.

Notation: For any arm i 2 [n], let bµi(t) denote the empirical mean after t pulls. For all i 2 [n], deﬁne the suboptimality gap i := µ1 µi. Without loss of generality, we denote k = |G | (resp. k = |M |). Throughout, we will keep track of the quantity := mini2G µi (µ1 ) which is the distance from the smallest additive -good arm, denoted µk, to the threshold µ1 . Additionally, if

is non-empty, we consider β = mini2Gc (µ1 ) µi, the distance of the largest arm that is not additive -good, denoted µk+1, to the threshold. Equivalently, in the case of returning multiplicative arms, we deﬁne := mini2M µi (1 )µ1, β := mini2M c (1 )µ1 µi, µk, and µk+1 to be the smallest differences of arms in M and M c

to (1 )µ1 respectively. For our sample complexity results, we also consider a relaxed version of the ALLproblem, where for a user-given slack γ 0, we allow our algorithm to return b G that satisﬁes G b G G +γ in the additive case, or M b G M +γ in the multiplicative case. As we will see, this prevents large or potentially unbounded sample complexities when arms means are very close to or equal µ1 .

1.2 Contributions and Summary of Main Results

Figure 2: An example instance

In this paper we propose the new problem of ﬁnding all -good arms and give a precise characterization of its complexity. Our contribution is threefold:

Information-theoretic lower bounds for the ALLproblem.

A novel algorithm, (ST)2, that is nearly optimal, is easy

to implement, and has excellent empirical performance on real-world data.

An instance optimal algorithm, FAREAST.

We now summarize our results in the additive setting (the multiplicative setting is analogous).

Lower Bound and Algorithms. As a preview of our results, we highlight the impact of three key quantities that affect the sample complexity: the user provided and the instance dependent quantities and β , (see Figure 2). In this case, Theorem 2.1 implies that any δ-PAC algorithm requires an expected number of samples exceeding

1 (µ1 µi)2 , 1 (µ1 + µi)2

We provide two algorithms, (ST)2 and FAREAST for the ALLproblem. Our starting point, (ST)2 is a novel combination of UCB [3] and LUCB [4] and is easier to implement and has good empirical performance. (ST)2 is nearly optimal, however in some instances does not achieve the lower bound. To overcome this gap, we provide an instance optimal algorithm FAREAST which achieves the lower bound, however suffers from larger constants and is not always better in practical applications.

To highlight the difﬁculty of developing optimal algorithms for the ALLproblem, we quickly discuss a naive elimination approach that uniformly samples all arms and eliminates arms once they are known to be above or below µ1 and not the best arm. Intuitively, such an algorithm would keep pulling arms until µ1 is estimated to an accuracy of O(min( , β )) to resolve the arms around the threshold (see Figure 2). An elimination algorithm pays a high cost of exploration - potentially over pulling arms close to µ1 compared to the lower bound until a time when µ1 is estimated sufﬁciently well. Our algorithm FAREAST provides a novel approach to overcome the issues with this approach. However, as we will show in Section 4, in certain instances a dependence on Pn

i=1(µ1 + β µi) 2 is present in moderate conﬁdence, i.e., it is not multiplied by log(1/δ), unlike the lower bound in equation (2) and becomes negligible compared to other terms as δ ! 0.

Empirical results. We demonstrate the empirical success of (ST)2 on a real world dataset of 9250 captions from the NYCCC. In Fig. 4a, we compare (ST)2 to other methods that have been used to run this contest. We show that (ST)2 is better able to detect which arms have means within 10% of the best. The plot demonstrates the sub-optimality of using existing sampling schemes such as UCB or LUCB with an incorrect k for the ALLproblem, providing an additional empirical validation for the study of this paper.

1.3 Connections to prior Bandit art

Our problem is related to several prior pure-exploration settings in the multi-armed bandit literature, including TOP-k bandits, and threshold bandits.

TOP-k. In the TOP-k problem, the goal is to identify the set {µ1, , µk} with probability greater than 1 δ [4 9]. The ALLproblem reduces to the setting of the TOP-k problem with k = |G | when |G | is known. In particular, lower bounds for the TOP-k problem apply to our setting. A lower bound (with precise logarithmic factors) given in [9] is Pk

i=1(µi µk+1) 2 log((n k)/δ) + Pn

i=k+1(µi µk) 2 log(k/δ). In general, this is smaller than our lower bound in Theorem 2.1 since µk µ1 µk+1. A particular case of this problem is best-arm identiﬁcation when k = 1.

Approximate versions of the TOP-k problem have also been considered where the goal is to return a set of arms S with |S| = k and such that with probability greater than 1 δ, each i 2 S satisﬁes µi µk [4, 10]. In the case where k = 1, this is also known as the problem of identifying an (single) -good arm [4,7,9 17] which has received a large amount of interest. If |G | = k, [6], demonstrate a lower bound of O((k 2 + Pn

i=k+1(µ1 µi) 2) log(1/δ)) samples in expectation to ﬁnd such an arm and [10] provide an algorithm that matches this to doubly logarithmic factors, though methods such as [4,9,18,19] achieve better empirical performance. A particular instance of interest is when it is known that one arm is at mean , and the rest are at mean zero. In this setting, [11] show a lower bound on the sample complexity of O(n/ 2 + 1/ 2 log(1/δ)) highlighting that the dependence on n only occurs in moderate conﬁdence, i.e., for a ﬁxed value of δ. They also provide a matching upper bound that motivates our procedure in FAREAST. Finally [15] considers the unveriﬁable regime where there are potentially many -good arms. In such cases, sample-efﬁcient algorithms exist that return an -good arm with high probability, but verifying it is -good requires far more samples. Extending these ideas to the setting of ALLis a goal of future work.

Threshold Bandits. In the threshold bandit problem, we are given a threshold and the goal is to identify the set of arms whose means are greater than the threshold [1, 20]. If the value of µ1 were known, then ALLproblem would reduce to a threshold bandit with = µ1 . A naive sequential sampling scheme that stops sampling an arm when its upper or lower conﬁdence bound clears the threshold has sample complexity O(Pn

i=1(µi ) 2 log(n/δ)). Up to factors of log(n), this can be shown to be a lower bound for threshold bandits as well, and as a result is bounded above by the result Theorem 2.1. Hence, ALLis intrinsically more difﬁcult than threshold bandits. A naive approach to the ALLproblem is to ﬁrst identify the index and mean of the best arm using a best-arm identiﬁcation algorithm and then utilize it to build an estimate of the threshold µ1 . In general, this two-step procedure is sub-optimal if there are many arms close to the best-arm in which case identifying the best-arm is both unnecessary and expends unnecessary samples. In the ﬁxed conﬁdence setting, threshold bandits is closely related to that of multiple hypothesis testing, and recent work [21] achieves tight upper and lower bounds for this problem including tighter logarithmic factors similar to those for TOP-k. If µ1 is known, then the additive ALLproblem reduces to the FWER (family-wise error rate) and FWPD (family-wise probability of detection) setting in [21]. Finally, in the ﬁxed budget setting, [1] proposes an optimal anytime method APT whose sampling strategy we use as a comparison in Section 5.

2 Lower Bound

Theorem 2.1. (additive and multiplicative lower bounds) Fix δ, > 0. Consider n arms, such that the ith is distributed according to N(µi, 1). Any δ-PAC algorithm for the additive setting satisﬁes

(µ1 µi)2 , 1 (µ1 + µi)2

and if µ1 > 0, any δ-PAC algorithm for the multiplicative algorithm satisﬁes,

((1 )µ1 µi)2 , 1 (µ1 + 1 µi)2

The bounds are different but share a common interpretation. Consider the additive case. First, every arm must be sampled inversely proportional to its squared distance to µ1 . In a manner similar to thresholding [1], even if µ1 was known, these number of samples are necessary to decide if an arm s mean is above or below that quantity. This leads to the ﬁrst term in the max{ , }. The second term in the max{ , } states that every arm must be sampled inversely proportional to its squared distance to µ1 + . Recall that = µk (µ1 ) is the margin by which arm k is good.

Hence, to verify that k 2 G , it is also necessary to conﬁrm that all means are below µ1 + , as µ1 + µk which would imply that k is bad. This represents the necessity of estimating the threshold, and leads to the second term. For arms in Gc

, comparing against µ1 is always more difﬁcult, but for arms in G , either constraint may be more challenging to ensure. We state the bound for Gaussian distributions, but the same technique can be used to prove equivalent results for other distributions. Lastly, we note that it is possible to prove bounds with tighter logarithmic terms. For an instance where O(nφ) arms have mean 2 for φ 2 (0, 1), and the remaining have mean 0, Theorem 1 of [22] suggests that (n/ 2 log(n/δ)) samples are necessary, exceeding the above bounds by a

factor of log(n).

3 An Optimism Algorithm for ALL-

We propose algorithm 1 called (ST)2, (Sample the Threshold, Split the Threshold) to return a set containing all -good arms and none worse than ( + γ)-good with probability 1 δ. Intuitively, (ST)2 runs UCB and LUCB1 in parallel. At all times, (ST)2 pulls three arms. We pull the arm with the

highest upper conﬁdence bound, similarly to the UCB algorithm, [3], to reﬁne an estimate of the threshold using the highest empirical mean (Sample the Threshold). Using the empirical estimate of the threshold, we pull an arm above it and an arm below it whose conﬁdence bounds cross it, similar to LUCB1, [4] (Split the Threshold). Using these bounds, (ST)2 forms upper and lower bounds on the true threshold, i.e. µ1 (resp. (1 )µ1) and terminates when it can declare that all arms are either in G +γ or Gc

. To do so, (ST)2 maintains anytime conﬁdence widths, Cδ/n(t) such that for an empirical mean bµi(t) of t samples, we have P(S1

t=1 |bµi(t) µi| > Cδ/n(t)) δ/n. For this

work, we take Cδ(t) =

cφ log(log2(2t)/δ)

t for a constant cφ. It sufﬁces to take cφ = 4, though tighter bounds are known and should be used in practice, e.g. [6,23,24].

Algorithm 1 (ST)2: Sample the Threshold, Split the Threshold

Require: , δ > 0, γ 0, instance

1: Pull each arm once, initialize Ti 1, update bµi for each i 2 {1, 2, . . . , n} 2: Empirically good arms: b G = {i : bµi maxj bµj }, b G = {i : bµi (1 ) maxj bµj} 3: Ut = maxj bµj(Tj) + Cδ/n(Tj) γ and Lt = maxj bµj(Tj) Cδ/n(Tj) 4: Ut = (1 γ)

maxj bµj(t) + Cδ/n(Tj)

and Lt = (1 )

maxj bµj(t) Cδ/n(Tj)

5: Known arms: K = {i : bµi(Ti) + Cδ/n(Ti) < Lt or bµi(Ti) Cδ/n(Ti) > Ut} 6: while K 6= [n] do 7: Pull arm i1(t) = arg mini2 b G\K bµi(Ti) Cδ/n(Ti), update Ti1, bµi1 8: Pull arm i2(t) = arg maxi2 b Gc\K bµi(Ti) + Cδ/n(Ti), update Ti2, bµi2 9: Pull arm i (t) = arg maxi bµi(Ti) + Cδ/n(Ti), update Ti , bµi 10: Update bounds Lt, Ut, sets b G, K return The set of good arms {i : bµi(Ti) Cδ/n(Ti) > Ut}

3.1 Theoretical guarantees

Next we present a pair of theorems on the sample complexity of (ST)2. For clarity, we omit doubly logarithmic terms and defer such statements to Appendix B. Below we denote a b := min{a, b}. Theorem 3.1 (Additive Case). Fix > 0, 0 < δ 1/2, γ 16 and an instance such that max( i, | i|) 8 for all i. With probability at least 1 δ, there is a constant c1 such that (ST)2

returns a set b G such that G b G G( +γ) in at most the following number of samples.

1 (µ1 µi)2 , 1 (µ1 + µi)2 , 1 (µ1 + β µi)2

Given a positive slack γ, we are allowed to return an arm that is ( + γ)-good. Thus a conﬁdence width less than (γ) on any arm is not needed, resulting in the 1/γ2 term in Theorem 3.1. In particular this prevents unbounded sample complexities if there is an arm at the threshold µ1 . For γ = 0, the ﬁrst two terms inside the max are also present in the lower bound (Theorem 2.1). When is within a constant factor of β , the second and third term in the max have the same order, and the upper bound matches the lower bound up to a log(n) factor.

If β , (3) has a different scaling than the lower bound. In such restrictive settings the upper bound above can be signiﬁcantly larger than the lower bound. In the next section, we provide an algorithm that overcomes these issues and is optimal over all parameter regimes. The multiplicative case has different terms but follows the same intuition.

Theorem 3.2 (Multiplicative Case). Fix 2 (0, 1/2], γ 2 [0, min(16/µ1, 1/2)] and 0 < δ 1/2 and an instance such that µ1 0 and max( i, | µ1 i|) 2 for all i. With probability at least 1 δ, for a constant c1 (ST)2 returns a set G such that M G M( +γ) with sample complexity:

((1 )µ1 µi)2 , 1 (µ1 + 1 µi)2 , 1

4 Surprising Complexity of Finding All -Good arms

When and β are not of the same order, (ST)2 is not optimal. In this section we present an algorithm that is optimal for all parameter regimes. We focus on the additive case here, and defer the multiplicative case to Appendix E. We ﬁrst state an improved sample complexity lower bound for a family of problem instances that makes explicit the moderate conﬁdence terms.

Theorem 4.1. Fix δ 1/16, n 2/δ, and > 0. Let be an instance of n arms such that the ith is distributed as N(µi, 1), |G2β | = 1, and β < /2. Select a permutation : [n] ! [n] uniformly from the set of n! permutations, and consider the permuted instance ( ). Any algorithm that returns G ( ( )) on ( ) correctly with probability at least 1 δ requires at least the following number of samples in expectation over randomness in and for a universal constant c2.

(µ1 µi)2 , 1

1 (µ1 + β µi)2 (4)

Proof. (Sketch) To give a tight lower bound in the setting where |G2β | = 1 and β < /2, we break our argument into pieces performing a series of reductions that link the ALLproblem to a hypothesis test, and then the hypothesis test to the problem of identifying the best-arm. We apply the Simulator technique from [9] to compute precise moderate conﬁdence bounds. Other works that prove strong lower bounds in moderate conﬁdence include [25]. We extend the Simulator technique via a novel reduction to composite hypothesis testing in order to connect to ALL- . In all cases, we consider sample complexity in expectation with respect to the randomness in the outcomes and a randomly chosen permutation of the means.

Step 1. Finding an isolated best arm: Consider the problem of ﬁnding the best arm where µ1 = β > 0 and µ2, , µn β. This relates to the problem of ﬁnding a β-good arm when µ1 is known, studied by [11]. We use the Simulator technique, [9], to show that any algorithm requires

samples in expectation. This can be signiﬁcantly larger than the asymptotically optimal rate of O(β 2 log(1/δ)) (which was proven by [11]) for non-asymptotic δ, e.g. δ = 0.05.

Step 2. Deciding if Any mean is positive: We then consider a composite hypothesis test on n distributions where the null hypothesis, H0, is that the mean of each distribution is less that β and the alternate hypothesis, H1, is that there exists a single distribution i with mean β and the remainder have mean less than β. Importantly, an algorithm does not need to declare which arm is i , otherwise the bound from step 1 applies immediately. Instead, to link this to step 1, we develop a novel extension of the simulator technique and use this to show that if an algorithm can solve this composite hypothesis test in fewer than o

samples, then one may design a method to solve the problem in step 1 in o

samples which is a contradiction. Hence any algorithm for this hypothesis test requires

samples in expectation.

Step 3: Reducing ALLto Step 2: Finally, we show that a generic algorithm for ALLcan be used to solve the hypothesis test in step 2. Hence the lower bound from step 2 applies to ﬁnding all -good arms as well. In the case of the instances considered in the theorem statement, O

i=2(µ1 + β µi) 27

. Combining this bound, which is independent of δ with the result from Theorem 2.1 gives the result.

Theorem 4.1 states that an additional (Pn

i=1(µ1 + β µi) 2) samples are necessary for instances where no arm is within 2β of µ1 compared to the lower bound Theorem 2.1. Somewhat surprisingly, these samples are necessary in moderate conﬁdence, independent of δ and negligible as δ ! 0. For non-asymptotic values of δ, such as the common choice of δ = .05 in scientiﬁc applications, this term is present and can even dominate the sample complexity when β . As an extreme example, if µ1 = β > 0, µ2 , µn 1 = β, µn = , the ﬁrst term in 4 scales like ((n 1)/ 2 + 1/β2) log(1/δ) but the second term scales like n/β2, which is O(n) larger than the ﬁrst term for small β and ﬁxed δ. Furthermore, we point out that Theorem 4.1 highlights that (ST)2 is optimal on these instances up to a log factor! The algorithm we present next, FAREAST, improves (ST)2 s dependence on δ and matches the lower bound in Theorem 4.1 for certain instances. Though

moderate conﬁdence terms can dominate the sample complexity in practice, few works have focused on understanding their effect.

4.1 FAREAST

We focus on the additive case with γ = 0 in Algorithm 4.1, FAREAST, and defer the more general case (multiplicative and γ > 0) to Algorithm E.1 in the supplementary. FAREAST matches the instance dependent lower bound in Theorem 2.1 as δ ! 0. At a high level, FAREAST (Fast Arm Removal Elimination Algorithm for a Sampled Threshold) proceeds in rounds r and maintains sets b Gr and b Br of arms thus far declared to be good or bad. It sorts unknown arms into either set through use of a good ﬁlter to detect arms in G and a bad ﬁlter to detect arms in Gc

Good Filter: The good ﬁlter is a simple elimination scheme. It maintains an upper bound Ut and lower bound Lt on µ1 . If an arm s upper bound drops below Lt (line 20), the good ﬁlter eliminates that arm, otherwise, if an arm s lower bound rises above Ut (19), the good ﬁlter adds the arm to b Gr, but only eliminates this arm if its upper bound falls below the highest lower bound. This ensures that µ1 is never eliminated and Ut and Lt are always valid bounds 1. As the sampling is split across rounds, the good ﬁlter always samples the least sampled arm, breaking ties arbitrarily. The number of samples given to the good ﬁlter in each round is such that both ﬁlters receive identically many samples. This prevents the good ﬁlter from over-sampling bad arms and vice versa. In our proof we show that in an unknown round, b Gr = G , ie all good arms have been found, having used fewer than O

(µ1 µi) 2, (µ1+ µi) 2

samples, matching the lower bound.

FAREAST cannot yet terminate, however, as it must also verify that any remaining arms are in Gc

Bad Filter: The bad ﬁlter removes arms that are not -good. To show an arm i is in Gc

, it sufﬁces to ﬁnd any j such that µj µi > . To motivate the idea of lines 9-12, consider the following procedure in the special case where βi = µ1 µi is known. In each round we ﬁrst run Median-Elimination, [12], with failure probability 1/16, to ﬁnd an arm ˆi that is βi/2-good in O(n/β2

i ) samples2. We then pull both i andbi roughly O(1/β2

i log(1/δ)) times and can check whether µˆi µi > with probability greater than 1 δ. This procedure relies on Median-Elimination succeeding, which happens with probability 15/16. In the case that it fails and we declare µbi µi < , we merely repeat this process until it succeeds on average O(1) times. This gives an expected sample complexity of O(n/β2

i log(1/δ)) for any i 2 Gc

. Of course, βi is unknown to the algorithm. Instead, in each round r, the bad ﬁlter guesses that βi 2 r for all unknown arms i /2 b Gr [ b Br and performs the above procedure. The following theorem demonstrates that this algorithm matches our lower bounds asymptotically as δ ! 0.

Theorem 4.2. Fix 0 < , 0 < δ < 1/8, and an instance of n arms such that max( i, | i|) 8 for all i. There exists an event E such that P(E) 1 δ and on E, FAREAST terminates and returns G . Letting T denote the number of samples taken, for a constant c3

1 (µ1 µi)2 , 1 (µ1 + µi)2

c00n (µ1 µi)2 .

Additionally for γ 16 FAREAST terminates on E and returns a set b G such that G b G G +γ in

a number of samples no more than a constant times (3), the complexity of (ST)2.

1This scheme works as an independent algorithm, we analyze it in Appendix E.5. 2Median-Elimination is used for ease of analysis. One can use LUCB [4] or another method instead.

(a) A typical setting of diverse means (black dots).

(b) A more challenging setting

Figure 3: Comparison of (ST)2 and FAREAST averaged over 250 trials plotted with 3 standard errors.

Algorithm 4.1: additive FAREAST with γ = 0 Input: , δ, instance Let b G0 = ; be the set of arms declared as good and b B0 = ; the set of arms declared as bad. Let A = [n] be the active set, Ni = 0 track the total number of samples of arm i by the Good Filter. Let t = 0 denote the total number of times that line 16 is true in the Good Filter. for r = 1, 2,

Let δr = δ/2r2, r =

, Initialize b Gr = b Gr 1 and b Br = b Br 1 // Bad Filter: ﬁnd bad arms in Gc

Let ir = Median Elimination( , 2 r, 1/16), sample ir r times and compute bµir for i /2 b Gr 1 [ b Br 1:

Sample µi r times and compute bµi If bµir bµi + 2 r+1: Add i to b Br // Bad arm detected

// Good Filter: ﬁnd good arms in G for s = 1, , HME(n, 2 r, 1/16) + (|( b Gr 1 [ b Br 1)c| + 1) r:

Pull arm Is 2 arg minj2A{Nj} and set NIs NIs + 1. if minj2A{Nj} = maxj2A{Nj}:

Update t = t + 1. Let Ut = maxj2A bµi(t) + Cδ/2n(t) and Lt = maxj2A bµi(t) Cδ/2n(t) for i 2 A:

if bµi(t) Cδ/2n(t) Ut: Add i to b Gr // Good arm detected if bµi(t) + Cδ/2n(t) Lt: Remove i from A and add i to b Br // Bad arms removed if i 2 b Gr and bµi(t) + Cδ/2n(t) maxj2A bµ(t) Cδ/2n(t): // Good arms removed

Remove i from A 22

if A b Gr or b Gr [ b Br = [n]: Return the set b Gr 23

5 Empirical Performance

We begin by comparing (ST)2 and FAREAST on simulated data. FAREAST is asymptotically optimal, but suffers worse constant factors compared to (ST)2 3. (ST)2 is optimal except when β . We compare (ST)2 and FAREAST on two instances in the additive case, shown in Figure 3. All arms are Gaussian with σ = 1. In the ﬁrst example on the left, δ = 0.1, = β = 0.05. Both (ST)2 and FAREAST are optimal in this setting; we show the scaling of their sample complexity as the number of arms increases while keeping the threshold, , and β constant. In the second example, = = 0.99, and β = 0.01. When 1/β2

n/ 2, Theorem 2.1 suggests that O(1/β2

log(1/δ)) samples are necessary, independent of n. Indeed, in Figure 3, for δ = 0.01, the average complexity of FAREAST is constant, but (ST)2 scales linearly with n as Theorem 3.1 suggests. Finally, a naive uniform sampling strategy performed very poorly - additional experiments including the uniform sampling method and with γ > 0 are in the Appendix A.

5.1 Finding all -good arms in real world data fast

As discussed in the introduction, in many applications such as the New Yorker Cartoon Caption Contest (NYCCC), the ALLobjective returns a set of good arms which can then be screened further to choose a favorite. We considered Contest 651, which had 9250 captions whose means we estimated from a total of 2.2 million ratings. We set = 0.1 and focus on the multiplicative setting, i.e., the objective of recovering all captions within 10% of the funniest one. In this experiment, we contrast (ST)2 with several other methods including two oracle methods (marked with N): LUCB1 [4] with

3Implementations of all algorithms and baselines used in this paper are available on Git Hub.

(a) NYCCC with = 0.1 (b) Cancer drug discovery with = 0.8

Figure 4: F1 scores averaged over 600 trials with 95% conﬁdence widths for each dataset.

(a) Precision curves for = 0.1 (b) Recall curves for = 0.1

Figure 5: Precision and recall averaged over 600 trials with 95% conﬁdence widths on NYCCC data.

k set to the number of -good arms (here it was 46), and a threshold-bandit, APT [1] given the value of 0.9µ1. We focus on a common practical requirement, each algorithm s ability to balance precision and recall as it samples. With every new sample, each method recommends an empirical set of -good arms based on the empirical means, and we consider the F1 score of this set4. We focus on the F1 score as it is practically relevant and provides a continuous measure of performance of each method. F1 = 1 indicates that an algorithm has found all -good arms. As can be seen in Figure 4a, (ST)2 outperforms all baselines including the oracle APT, and almost matches the performance of the TOP-k oracle! We transition from a solid line to a dashed one at 2.2M pulls to mark the number of samples drawn in the real contest from which we gather the data. To illustrate the importance of knowing the correct value of k, we also plot LUCB1 given k = 46/2 = 23 and k = 46 2 = 92, settings where the experimenter under or over estimates the number of good arms by as little as a factor of 2. Both cases result in a poor performance. We have also included UCB, currently being used for the contest [26]; the plot shows that UCB is not able to estimate the -good set. In Figure 5, we show precision and recall curves for each method on the NYCCC data. (ST)2 achieves near-perfect precision quickly, matched only by UCB. APT s poor performance is a

consequence of having low-precision, shown in Figure 5a. (ST)2 achieves high recall more slowly, but is still competitive with other methods. In practical experiments, high precision early on may be more important than high recall, as it guarantees that practitioners can trust the declarations that the algorithm has made, even if some arms are yet to be found. In the Supplementary we show plots for more values of . Additionally, motivated by drug discovery, we performed an experiment on a dataset [27] of 189 inhibitors whose activities were tested against ACVRL1, a kinase associated with cancer [28]. In this experiment, we use the multiplicative case of ALLwith = 0.8 and δ = 0.001, to promote high precision. In this experiment as well, (ST)2 performs best (Figure 4b), with only the oracle methods are competitive with it. We plot on a log-scale to emphasize the early regime.

4F1 is the harmonic mean of precision (fraction of captions returned that are actually good) and recall (fraction of all good captions that are actually returned).

6 Broader Impacts and Funding Transparency Statement

6.1 Broader Impacts

The application of machine learning (ML) in domains such as advertising, biology, or medicine brings the possibility of utilizing large computational power and large datasets to solve new problems. It is tempting to use powerful, if not fully understood, ML tools to maximize scientiﬁc discovery. However, at times the gap between a tool s theoretical guarantees and its practical performance can lead to sub-optimal behavior. This is especially true in adaptive data collection where misspecifying the model or desired output (e.g., return the top k performing compounds vs. return all compounds with a potency about a given threshold ) may bias data collection and hinder post-hoc consideration of different objectives. In this paper we highlight several such instances in real-life data collection using multi-armed bandits where such a phenomenon occurs. We believe that the objective studied in this work, that of returning all arms whose mean is quantiﬁably near-best, more naturally aligns with practical objectives as diverse as ﬁnding funny captions to performing medical tests. We point out that methods from adaptive data collection and multi-armed bandits can also be used on contentrecommendation platforms such as social media or news aggregator sites. In these scenarios, time and again, we have seen that recommendation systems can be greedy, attempting purely to maximize clickthrough with a long term effect of a less informed public. Adjacent to one of the main themes of this paper, we recommend that practitioners not just focus on the objective of recommendation for immediate proﬁt maximization but rather keep track of a more holistic set of metrics. We are excited to see our work used in practical applications and believe it can have a major impact on driving the process of scientiﬁc discovery.

6.2 Funding Transparency Statement

The work presented in this paper was supported by ARO grant W911NF-15-1-0479. Additionally, this work was partially supported by the MADLab AF Center of Excellence FA9550-18-1-0166.

[1] Andrea Locatelli, Maurilio Gutzeit, and Alexandra Carpentier. An optimal algorithm for

the thresholding bandit problem. In Proceedings of the 33rd International Conference on International Conference on Machine Learning-Volume 48, pages 1690 1698. JMLR. org, 2016.

[2] Serge Christmann-Franck, Gerard JP van Westen, George Papadatos, Fanny Beltran Escudie,

Alexander Roberts, John P Overington, and Daniel Domine. Unprecedently large-scale kinase inhibitor set enabling the accurate prediction of compound kinase activities: A way toward selective promiscuity by design? Journal of chemical information and modeling, 56(9):1654 1675, 2016.

[3] Peter Auer, Nicolo Cesa-Bianchi, and Paul Fischer. Finite-time analysis of the multiarmed

bandit problem. Machine learning, 47(2-3):235 256, 2002.

[4] Shivaram Kalyanakrishnan, Ambuj Tewari, Peter Auer, and Peter Stone. Pac subset selection in

stochastic multi-armed bandits. In ICML, volume 12, pages 655 662, 2012.

[5] Sébastian Bubeck, Tengyao Wang, and Nitin Viswanathan. Multiple identiﬁcations in multi-

armed bandits. In International Conference on Machine Learning, pages 258 265, 2013.

[6] Emilie Kaufmann, Olivier Cappé, and Aurélien Garivier. On the complexity of best-arm

identiﬁcation in multi-armed bandit models. The Journal of Machine Learning Research, 17(1):1 42, 2016.

[7] Victor Gabillon, Mohammad Ghavamzadeh, and Alessandro Lazaric. Best arm identiﬁcation:

A uniﬁed approach to ﬁxed budget and ﬁxed conﬁdence. In Advances in Neural Information Processing Systems, pages 3212 3220, 2012.

[8] Wenbo Ren, Jia Liu, and Ness B Shroff. Exploring k out of top fraction of arms in stochastic

bandits. In The 22nd International Conference on Artiﬁcial Intelligence and Statistics, pages 2820 2828, 2019.

[9] Max Simchowitz, Kevin Jamieson, and Benjamin Recht. The simulator: Understanding adaptive

sampling in the moderate-conﬁdence regime. In Conference on Learning Theory, pages 1794 1834, 2017.

[10] Zohar Karnin, Tomer Koren, and Oren Somekh. Almost optimal exploration in multi-armed

bandits. In International Conference on Machine Learning, pages 1238 1246, 2013.

[11] Shie Mannor and John N Tsitsiklis. The sample complexity of exploration in the multi-armed

bandit problem. Journal of Machine Learning Research, 5(Jun):623 648, 2004.

[12] Eyal Even-Dar, Shie Mannor, and Yishay Mansour. Pac bounds for multi-armed bandit and

markov decision processes. In International Conference on Computational Learning Theory, pages 255 270. Springer, 2002.

[13] Eyal Even-Dar, Shie Mannor, and Yishay Mansour. Action elimination and stopping conditions

for the multi-armed bandit and reinforcement learning problems. Journal of machine learning research, 7(Jun):1079 1105, 2006.

[14] Shivaram Kalyanakrishnan and Peter Stone. Efﬁcient selection of multiple bandit arms: Theory

and practice. In ICML, volume 10, pages 511 518, 2010.

[15] Julian Katz-Samuels and Kevin Jamieson. The true sample complexity of identifying good

arms. ar Xiv preprint ar Xiv:1906.06594, 2019.

[16] Rémy Degenne and Wouter M Koolen. Pure exploration with multiple correct answers. In

Advances in Neural Information Processing Systems, pages 14564 14573, 2019.

[17] Emilie Kaufmann and Shivaram Kalyanakrishnan. Information complexity in bandit subset

selection. In Conference on Learning Theory, pages 228 251, 2013.

[18] Arghya Roy Chaudhuri and Shivaram Kalyanakrishnan. Pac identiﬁcation of a bandit arm

relative to a reward quantile. In Thirty-First AAAI Conference on Artiﬁcial Intelligence, 2017.

[19] Arghya Roy Chaudhuri and Shivaram Kalyanakrishnan. Pac identiﬁcation of many good arms

in stochastic multi-armed bandits. In International Conference on Machine Learning, pages 991 1000, 2019.

[20] Hideaki Kano, Junya Honda, Kentaro Sakamaki, Kentaro Matsuura, Atsuyoshi Nakamura,

and Masashi Sugiyama. Good arm identiﬁcation via bandit feedback. Machine Learning, 108(5):721 745, 2019.

[21] Kevin Jamieson and Lalit Jain. A bandit approach to multiple testing with false discovery

control. In Proceedings of the 32nd International Conference on Neural Information Processing Systems, NIPS 18, page 3664 3674, Red Hook, NY, USA, 2018. Curran Associates Inc.

[22] Matthew L Malloy and Robert D Nowak. Sequential testing for sparse recovery. IEEE Transactions on Information Theory, 60(12):7862 7873, 2014.

[23] Kevin Jamieson, Matthew Malloy, Robert Nowak, and Sébastien Bubeck. lil ucb: An optimal

exploration algorithm for multi-armed bandits. In Conference on Learning Theory, pages 423 439, 2014.

[24] Steven R Howard, Aaditya Ramdas, Jon Mc Auliffe, and Jasjeet Sekhon. Uniform, nonparamet-

ric, non-asymptotic conﬁdence sequences. ar Xiv preprint ar Xiv:1810.08240, 2018.

[25] Lijie Chen, Jian Li, and Mingda Qiao. Nearly instance optimal sample complexity bounds for

top-k arm selection. In Artiﬁcial Intelligence and Statistics, pages 101 110, 2017.

[26] Ervin Tanczos, Robert Nowak, and Bob Mankoff. A kl-lucb algorithm for large-scale crowd-

sourcing. In Advances in Neural Information Processing Systems, pages 5894 5903, 2017.

[27] David H Drewry, Carrow I Wells, David M Andrews, Richard Angell, Hassan Al-Ali, Alison D

Axtman, Stephen J Capuzzi, Jonathan M Elkins, Peter Ettmayer, Mathias Frederiksen, et al. Progress towards a public chemogenomic set for protein kinases and a call for contributions. Plo S one, 12(8), 2017.

[28] Matteo Bocci, Jonas Sjölund, Ewa Kurzejamska, David Lindgren, Michael Bartoschek, Mattias

Höglund, Kristian Pietras, et al. Activin receptor-like kinase 1 is associated with immune cell inﬁltration and regulates clec14a transcription in cancer. Angiogenesis, 22(1):117 131, 2019.

[29] Patricia Dranchak, Ryan Mac Arthur, Rajarshi Guha, William J Zuercher, David H Drewry,

Douglas S Auld, and James Inglese. Proﬁle of the gsk published protein kinase inhibitor set across atp-dependent and-independent luciferases: implications for reporter-gene assays. Plo S one, 8(3), 2013.