# thresholding_bandit_with_optimal_aggregate_regret__08e575c7.pdf

Thresholding Bandit with Optimal Aggregate Regret

Chao Tao Computer Science Department Indiana University at Bloomington

Saúl A. Blanco Computer Science Department Indiana University at Bloomington

Jian Peng Computer Science Department University of Illinois at Urbana-Champaign

Yuan Zhou Computer Science Department, Indiana University at Bloomington Department of ISE, University of Illinois at Urbana-Champaign

We consider the thresholding bandit problem, whose goal is to ﬁnd arms of mean rewards above a given threshold θ, with a ﬁxed budget of T trials. We introduce LSA, a new, simple and anytime algorithm that aims to minimize the aggregate regret (or the expected number of mis-classiﬁed arms). We prove that our algorithm is instance-wise asymptotically optimal. We also provide comprehensive empirical results to demonstrate the algorithm s superior performance over existing algorithms under a variety of different scenarios.

1 Introduction

The stochastic Multi-Armed Bandit (MAB) problem has been extensively studied in the past decade (Auer, 2002; Audibert et al., 2010; Bubeck et al., 2009; Gabillon et al., 2012; Karnin et al., 2013; Jamieson et al., 2014; Garivier and Kaufmann, 2016; Chen et al., 2017). In the classical framework, at each trial of the game, a learner faces a set of K arms, pulls an arm and receives an unknown stochastic reward. Of particular interest is the ﬁxed budget setting, in which the learner is only given a limited number of total pulls. Based on the received rewards, the learner will recommend the best arm, i.e., the arm with the highest mean reward. In this paper, we study a variant of the MAB problem, called the Thresholding Bandit Problem (TBP). In TBP, instead of ﬁnding the best arm, we expect the learner to identify all the arms whose mean rewards θi (i {1, 2, 3, . . . , K}) are greater than or equal to a given threshold θ. This is a very natural setting with direct real-world applications to active binary classiﬁcation and anomaly detection (Locatelli et al., 2016; Steinwart et al., 2005).

In this paper, we propose to study TBP under the notion of aggregate regret, which is deﬁned as the expected number of errors after T samples of the bandit game. Speciﬁcally, for a given algorithm A and a TBP instance I with K arms, if we use ei to denote the probability that the algorithm makes an incorrect decision corresponding to arm i after T rounds of samples, the aggregate regret is deﬁned as RA(I; T) def = PK i=1 ei. In contrast, most previous works on TBP aim to minimize the simple regret, which is the probability that at least one of the arms is incorrectly labeled. Note that the deﬁnition of aggregate regret directly reﬂects the overall classiﬁcation accuracy of the TBP algorithm, which is more meaningful than the simple regret in many real-world applications. For example, in the crowdsourced binary labeling problem, the learner faces K binary classiﬁcation tasks, where each task i is associated with a latent true label zi {0, 1}, and a latent soft-label θi. The soft-label θi may be used to model the labeling difﬁculty/ambiguity of the task, in the sense that θi fraction

33rd Conference on Neural Information Processing Systems (Neur IPS 2019), Vancouver, Canada.

of the crowd will label task i as 1 and the rest labels task i as 0. The crowd is also assumed to be reliable, i.e., zi = 1 if and only if θi 1

2. The goal of the crowdsourcing problem is to sequentially query a random worker from the large crowd about his/her label on task i for a budget of T times, and then label the tasks with as high (expected) accuracy as possible. If we treat each of the binary classiﬁcation task as a Bernoulli arm with mean reward θi, then this crowdsourced problem becomes aggregate regret minimization in TBP with θ = 1

2. If a few tasks are extremely ambiguous (i.e., θi 1

2), the simple regret would trivially approach 1 (i.e., every algorithm would almost always fail to correctly label all tasks). In such cases, however, a good learner may turn to accurately label the less ambiguous tasks and still achieve a meaningful aggregate regret.

A new challenge arising for the TBP with aggregate regret is how to balance the exploration for each arm given a ﬁxed budget. Different from the exploration vs. exploitation trade-off in the classical MAB problems, where exploration is only aimed for ﬁnding the best arm, the TBP expects to maximize the accuracy of the classiﬁcation of all arms. Let i def = |θi θ| be the hardness parameter or gap for each arm i. An arm with smaller i would need more samples to achieve the same classiﬁcation conﬁdence. A TBP learner faces the following dilemma whether to allocate samples to determine the classiﬁcation of one hard arm, or use it for improving the accuracy of another easier arm.

Related Work. Since we focus on the TBP problem in this paper, due to limit of the space, we are sorry for not being able to include the signiﬁcant amount of references to other MAB variants.

In a previous work (Locatelli et al., 2016), the authors introduced the APT (Anytime Parameter-free Thresholding) algorithm with the goal of simple regret minimization. In this algorithm, a precision parameter ϵ is used to determine the tolerance of errors (a.k.a. the indifference zone); and the APT algorithm only attempts to correctly classify the arms with hardness gap i > ϵ. This variant goal of simple regret partly alleviates the trivialization problem mentioned previously because of the extremely hard arms. In details, at any time t, APT selects the arm that minimizes p

Ti(t)b i(t), where Ti(t) is the number of times arm i has been pulled until time t, b i(t) is deﬁned as |bθi(t) θ|+ϵ, and bθi(t) is the empirical mean reward of arm i at time t. In their experiments, Locatelli et al. (2016) also adapted the UCBE algorithm from (Audibert et al., 2010) for the TBP problem and showed that APT performs better than UCBE.

When the goal is to minimize the aggregate regret, the APT algorithm also works better than UCBE. However, we notice that the choice of precision parameter ϵ has signiﬁcant inﬂuence on the algorithm s performance. A large ϵ makes sure that, when the sample budget is limited, the APT algorithm is not intrigued by a hard arm to spend overwhelmingly many samples on it without achieving a conﬁdent label. However, when the sample budget is ample, a large ϵ would also prevent the algorithm from making enough samples for the arms with hardness gap i < ϵ. Theoretically, the optimal selection of this precision parameter ϵ may differ signiﬁcantly across the instances, and also depends on the budget T. In this work, we propose an algorithm that does not require such a precision parameter and demonstrates improved robustness in practice. Furthermore, a simple corollary of our main theorem (Theorem 1) shows that, for the simple regret with no indifference zone (ϵ = 0), our LSA algorithm achieves the optimality up to a ln K factor in the budget T compared with APT(0). We attach experimental results in Appendix F.1 to show that LSA performs better than APT(0) towards the simple regret objective.

Another natural approach to TBP is the uniform sampling method, where the learner plays each arm the same number of times (about T/K times). In Appendix C, we show that the uniform sampling approach may need Ω(K) times more budget than the optimal algorithm to achieve the same aggregate regret.

Finally, Chen et al. (2015) proposed the optimistic knowledge gradient heuristic algorithm for budget allocation in crowdsourcing binary classiﬁcation with Beta priors, which is closely related to the TBP problem in the Bayesian setting.

Our Results and Contributions. Let RA(I; T) denote the aggregate regret of an instance I after T time steps. Given a sequence of hardness parameters 1, 2, . . . , K, assume I 1,..., K is the class of all K-arm instances where the gap between θi of the i-th arm and the threshold θ is i, and let

OPT({ i}K i=1, T) def = inf A sup I I 1,..., K RA(I; T) (1)

be the minimum possible aggregate regret that any algorithm can achieve among all instances with the given set of gap parameters. We say an algorithm A is instance-wise asymptotically optimal if for every T, any set of gap parameters { i}K i=1, and any instance I I 1,..., K, it holds that

RA(I; T) O(1) OPT({ i}K i=1, Ω(T)). (2)

While it may appear that a constant factor multiplied to T can affect the regret if the optimal regret is an exponential function of T, we note that our deﬁnition aligns with major multi-armed bandit literature (e.g., ﬁxed-budget best arm identiﬁcation (Gabillon et al., 2012; Carpentier and Locatelli, 2016) and thresholding bandit with simple regret (Locatelli et al., 2016)). Indeed, according to our deﬁnition, if the universal optimal algorithm requires a budget of T to achieve ϵ regret, an asymptotically optimal algorithm requires a budget of only T multiplying some constant to achieve the same order of regret. On the other hand, if one wishes to pin down the optimal constant before T, even for the single arm case, it boils down to the question of the optimal (and distribution dependent) constant in the exponent of existing concentration bounds such as Chernoff Bound, Hoeffding s Inequality, and Bernstein Inequalities, which is beyond the scope of this paper.

We address the challenges mentioned previously and introduce a simple and elegant algorithm, the Logarithmic-Sample Algorithm (LSA). LSA has a very similar form as the APT algorithm in (Locatelli et al., 2016) but introduces an additive term that is proportional to the logarithm of the number of samples made to each arm in order to more carefully allocate the budget among the arms (see Line 4 of Algorithm 1). This logarithmic term arises from the optimal sample allocation scheme of an ofﬂine algorithm when the gap parameters are known beforehand. The log-sample additive term of LSA can be interpreted as an incentive to encourage the samples for arms with bigger gaps and/or less explored arms, which boasts a similar idea as the incentive term in the famous Upper Conﬁdence Bound (UCB) type of algorithms that date back to (Lai and Robbins, 1985; Agrawal, 1995; Auer, 2002), while interestingly the mathematical forms of the two incentive terms are very different.

As the main theoretical result of this paper, we analyze the aggregate regret upper bound of LSA in Theorem 1. We complement the upper bound result with a lower bound theorem (Theorem 20) for any online algorithm. In Remark 2, we compare the upper and lower bounds and show that LSA is instance-wise asymptotically optimal.

We now highlight the technical contributions made in our regret upper bound analysis at a very high level. Please refer to Section 4 for more detailed explanations. In our proof of the upper bound theorem, we ﬁrst deﬁne a global class of events {FC} (in (14)) which serves as a measurement of how well the arms are explored. Our analysis then goes by two steps. In the ﬁrst step, we show that FC happens with high probability, which intuitively means that all arms are well explored . In the second step, we show the quantitative upper bound on the mis-classiﬁcation probability for each arm, when conditioned on FC. The ﬁnal regret bound follows by summing up the mis-classiﬁcation probability for each arm via linearity of expectation. Using this approach, we successfully by-pass the analysis that involves pairs of (or even more) arms, which usually brings in union bound arguments and extra ln K terms. Indeed, such ln K slack appears between the upper and lower bounds proved in (Locatelli et al., 2016). In contrast, our LSA algorithm is asymptotically optimal, without any super-constant slack.

Another important technical ingredient that is crucial to the asymptotic optimality analysis is a new concentration inequality for the empirical mean of an arm that uniformly holds over all time periods, which we refer to as the Variable Conﬁdence Level Bound. This new inequality helps to reduce an extra ln ln T factor in the upper bound. It is also a strict improvement of Hoeffding s celebrated Maximal Inequality, which might be useful in many other problems.

Finally, we highlight that our LSA is anytime, i.e., it does not need to know the time horizon T beforehand. LSA does use a universal tuning parameter. However, this parameter does not depend on the instances. As we will show in Section 5, the choice of the parameter is quite robust; and the natural parameter setting leads to superior performance of LSA among a set of very different instances, while APT may suffer from poor performance if the precision parameter is not chosen well for an instance.

Organization. The organization of the rest of the paper is as follows. In Section 2 we provide the necessary notation and deﬁnitions. Then we present the details of the LSA algorithm in Section 3 and upper bound its aggregate regret in Section 4. In Section 5, we present experiments establishing the empirical advantages of LSA over other algorithms. The instance-wise aggregate regret lower bound theorem is deferred to Appendix E.

2 Problem Formulation and Notation

Given an integer K > 1, we let S = [K] def = {1, 2, . . . , K} be the set of K arms in an instance I. Each arm i S is associated with a distribution Di supported on [0, 1] which has an unknown mean θi. We are interested in the following dynamic game setting: At any round t 1, the learner chooses to pull an arm it from S and receives an i.i.d. reward sampled from Dit.

We let T, with T K, be the time horizon, or the budget of the game, which is not necessarily known beforehand. We furthermore let θ (0, 1) be the threshold of the game. After T rounds, the learner A has to determine, for every arm i S, whether or not its mean reward is greater than or equal to θ. So the learner outputs a vector (d1, . . . , d K) {0, 1}K, where di = 0 if and only if A decides that θi < θ. The goal of the Thresholding Bandit Problem (TBP) in this paper is to maximize the expected number of correct labels after T rounds of the game.

More speciﬁcally, for any algorithm A, we use EA i (T) to denote the event that A s decision corresponding to arm i is correct after T rounds of the game. The goal of the TBP algorithm is to minimize the aggregate regret, which is the expected number of incorrect classiﬁcations for the K arms, i.e.,

RA(T) = RA(I; T) def = E

i=1 I{E A i (T )}

where E denotes the complement of event E and I{condition} denotes the indicator function.

Let Xi,t denote the random variable representing the sample received by pulling arm i for the t-th time. We further write

bθi,t def = 1

s=1 Xi,s and b i,t def = |bθi,t θ| (4)

to denote the empirical mean and the empirical gap of arm i after being pulled t times, respectively. For a given algorithm A, let T A i (t) and bθA i (t) denote the number of times arm i is pulled and the empirical mean reward of arm i after t rounds of the game, respectively. For each arm i S, we use b A i (t) def = |bθA i (t) θ| to denote the empirical gap after t rounds of the game. We will omit the reference to A when it is clear from the context.

3 Our Algorithm

We now motivate our Logarithmic-Sample Algorithm by ﬁrst designing an optimal but unrealistic algorithm with the assumption that the hardness gaps { i}i S are known beforehand. Now we design the following algorithm O. Suppose the algorithm pulls arm i a total of xi times and makes a decision based on the empirical mean bθi,xi: if bθi,xi θ, the algorithm decides that θi θ, and decides θi < θ otherwise. Note that this is all an algorithm can do when the gaps i are known. We upper bound the aggregate regret of the algorithm by

i=1 P(E O i (T))

i=1 P(|bθi,xi θi| i)

i=1 2 exp 2xi 2 i , (5)

where the last inequality follows from Chernoff-Hoeffding Inequality (Proposition 5). Now we would like to minimize the RHS (right-hand-side) of (5), and upper bound the aggregate regret of the optimal algorithm O by

2 min x1+ +x K=T x1,...,x K N

i=1 exp( 2xi 2 i ) = 2P 2({ i}i S, T).

Here, for every c > 0, we deﬁne

P c ({ i}i S, T) def = min x1+ +x K=T x1,...,x K N

i=1 exp( cxi 2 i ). (6)

We naturally introduce the following continuous relaxation of the program Pc, by deﬁning

Pc({ i}i S, T) def = min x1+ +x K=T x1,...,x K 0

i=1 exp( cxi 2 i ). (7)

Pc approximates P c well, as it is straightforward to see that

Pc({ i}i S, T) P c ({ i}i S, T) Pc({ i}i S, T K). (8)

We apply the Karush-Kuhn-Tucker (KKT) conditions to the optimization problem P2({ i}i S, T) and ﬁnd that the optimal solution satisﬁes

xi 2 i + ln 1 i Φ, for i S, (9)

where Φ def = max{x : PK i=1 max{ x ln 1 i 2 i , 0} T} is independent of i S. Furthermore, since PK i=1 max{ x ln 1 i 2 i , 0} is an increasing continuous function of x, Φ is indeed well-deﬁned. Please refer to Lemma 10 of Appendix B for the details of the relevant calculations.

In light of (8) and (9), the following algorithm O (still, with the unrealistic assumption of the knowledge of the gaps { i}i S) incrementally solves Pc and approximates the algorithm O at each time t, the algorithm selects the arm i that minimizes Ti(t 1) 2 i + ln 1 i and plays it.

Our proposed algorithm is very close to O . Since in reality the algorithm does not have access to the precise gap quantities, we use the empirical estimates b 2 i instead of 2 i in the Ti(t 1) 2 i term. For the logarithmic term, if we also use ln b 1 i instead of ln 1 i , we may encounter extremely small empirical estimates when the arm is not sufﬁciently sampled, which would lead to unbounded values of ln b 1 i , and render the arm almost impossible to be sampled in future. To solve this problem, we note that O tries to maintain Ti(t 1) 2 i to be roughly the same across the arms (if ignoring the ln 1 i term). In light of this, we use p

Ti(t 1) to roughly estimate the order of 1 i . This encourages the exploration of both the arms with larger gaps and the ones with fewer trials.

To summarize, at each time t, our algorithm selects the arm i that minimizes α Ti(t 1)(b i(t 1))2 + 0.5 ln Ti(t 1), where α > 0 is a universal tuning parameter, and plays the arm. The details of the algorithm are presented in Algorithm 1.

Algorithm 1 Logarithmic-Sample Algorithm, LSA(S, θ)

1: Input: A set of arms S = [K], threshold θ 2: Initialization: pull each arm once 3: for t = K + 1 to T do 4: Pull arm it = argmin i S

αTi(t 1)(b i(t 1))2 + 0.5 ln Ti(t 1)

5: For each arm i S, let di 1 if bθi(T) θ and di 0 otherwise 6: Output: (d1, . . . , d K)

4 Regret Upper Bound for LSA

In this section, we show the upper bound of the aggregate regret of Algorithm 1.

Let x = Λ be the solution to the following equation

I{x ln 1 i } exp(2x) + I{x>ln 1 i } x ln 1 i + α α 2 i

= T max{40/α + 1, 40}. (10)

Notice that PK i=1(I{x ln 1 i } exp(2x) + I{x>ln 1 i } x ln 1 i +α α 2 i ) is a strictly increasing, continuous function with x 0 that becomes K when x = 0 and goes to inﬁnity when x . Hence Λ is guaranteed to exist and is uniquely deﬁned when T is large. Furthermore, for any i S, we let

λi def = I{Λ ln 1 i } exp(2Λ) + I{Λ>ln 1 i } Λ ln 1 i + α α 2 i . (11)

We note that {λi}i S is the optimal solution to P2α({max{ i, exp( Λ)}}i S, T/(max{40/α + 1, 40})). Please refer to Lemma 11 of Appendix B for the detailed calculations.

The goal of this section is to prove the following theorem. Theorem 1. Let RLSA(T) be the aggregate regret incurred by Algorithm 1. When 0 < α 8, and T max{40/α + 1, 40} K, we have

RLSA(T) Υ(α) X

i S exp λi 2 i 10

where Υ(α) = 9.3 8α

2 1 exp 2.1α ln α 0.5

4α is a constant that only depends on the universal tuning parameter α. Remark 2. If we set α = 1/20, then the right-hand side of (12) would be at most

i S exp λi 2 i 10 . One can verify that

i S exp λi 2 i 10

O P1/10({max{ i, exp( Λ)}}i S, T/801)

= O (P16({max{ i, exp( Λ)}}i S, T/128160)) O (P16({ i}i S, T/128160)) ,

where the ﬁrst inequality is due to Lemma 12 of Appendix B and the equality is because of Lemma 13 of Appendix B. This matches the lower bound stated in Theorem 20 up to constant factors. 1

The rest of this section is devoted to the proof of Theorem 1. Before proceeding, we note that the analysis of the APT algorithm (Locatelli et al., 2016) crucially depends on a favorable event stating that the empirical mean of any arm at any time does not deviate too much from the true mean. This requires a union bound that introduces extra factors such as ln K and ln ln T. Our analysis adopts a novel approach that does not need a union bound over all arms, and hence avoids the extra ln K factor. In the second step of our analysis, we introduce the new Variable Conﬁdence Level Bound to save the extra doubly logarithmic term in T.

Now we dive into details of the proof. Let B def = {i S : ln 1 i < Λ}. Intuitively, B contains the arms that can be well classiﬁed by the ideal algorithm O (described in Section 3), while even the ideal algorithm O suffers Ω(1) regret for each arm in S \ B. In light of this, the key of the proof is to upper bound the regret incurred by the arms in B.

Let RLSA B (T) denote the regret incurred by arms in B. Note that Υ(α) exp( λi 2 i /10) 1 for every arm i S \ B, and the regret incurred by each arm is at most 1. Therefore, to establish (12), we only need to show that

RLSA B (T) Υ(α) X

i B exp λi 2 i 10

We set up a few notations to facilitate the proof of (13). We deﬁne ξi(t) def = αTi(t)(b i(t))2 + 0.5 ln Ti(t) to be the expression inside the argmin( ) operator in Line 4 of the algorithm, for arm i

and at time t. We also deﬁne ξi,t def = αt(b i,t)2 + 0.5 ln t.

Intuitively, when ξi(t) is large, we usually have a larger value for Ti(t), and arm i is better explored. Therefore, ξi(t) can be used as a measurement of how well arm i is explored, which directly relates

1While the constants may seem large, we emphasize that i) we make no effort in optimizing the constants in asymptotic bounds, ii) most of the constants come from the lower bound, while the constant factor in our upper bound is 10, and iii) we believe that the actual constant of our algorithm is quite small, as the experimental evaluation in the later section demonstrates that our algorithm performs very well in practice.

to the mis-classiﬁcation probability for classifying the arm. We say that arm i is C-well explored at time T if there exists T T such that ξi(T ) > C. For any C > 0, we also deﬁne the event FC to be

FC def = { T T : i S, ξi(T ) > C} . (14)

When FC happens, we know that all arms are C-well explored.

At a higher level, the proof of (13) goes by two steps. First, we show that for C that is almost as large as Λ, FC happens with high probability, which means that every arm is C-well explored. Second, we quantitatively relate that being C-well explored and the mis-classiﬁcation probability for classifying each arm, which can be used to further deduce a regret upper bound given the event FC.

We start by revealing more details about the ﬁrst step. The following Lemma 3 gives a lower bound on the probability of the event FC. Lemma 3. P(FΛ k) 1 exp( 40k/α) for 0 k < Λ.

We now introduce the high-level ideas for proving Lemma 3 and defer the formal proofs to Appendix D.2. For any arm i S and C > 0, let τi,C be the random variable representing the smallest positive integer such that ξi,τi,C > C (i.e., ξi,t C for all 1 t < τi,C). Intuitively, τi,C denotes the ﬁrst time arm i is C-well explored. We ﬁrst show that the distribution of τi,C has an exponential tail. Hence, the sum of them with the same C also has an exponential tail. Next, we show that with high probability PK i=1 τi,Λ k T and the probability vanishes exponentially as k increases. In the last step, thanks to the design of the algorithm, we are able to argue that PK i=1 τi,Λ k T implies FΛ k.

We now proceed to the second step of the proof of (13). The following lemma (whose proof is deferred to Appendix D.3) gives an upper bound of regret incurred by arms in B conditioned on FC. Lemma 4. If k 0.1α, then conditioned on FΛ k,

RLSA B (T) 9 8α

i B exp λi 2 i 10 + k + α ln α 0.5

As mentioned before, the key to proving Lemma 4 is to pin down the quantitative relation between the event FC and the probability of mis-classifying an arm conditioned on FC, then the expected regret upper bound can be achieved by summing up the mis-classifying probability for all arms in B.

A key technical challenge in our analysis is to design a concentration bound for the empirical mean of an arm (namely arm i) that uniformly holds over all time periods. A typical method is to let the length of the conﬁdence band scale linearly with p

1/t, where t is the number of samples made for the arm. However, this would worsen the failure probability, and lead to an extra ln ln T factor in the regret upper bound. To reduce the iterated logarithmic factor, we introduce a novel uniform concentration bound where the ratio between the length of the conﬁdence band and p

1/t is almost constant for large t, but becomes larger for smaller t. Since this ratio is related to the conﬁdence level of the corresponding conﬁdence band, we refer to this new concentration inequality as the Variable Conﬁdence Level Bound. More speciﬁcally, in Appendix D.3.1, we prove the following lemma. Lemma 19 (Variable Conﬁdence Level Bound, pre-stated). Let X1, . . . , XL be i.i.d. random variables supported on [0, 1] with mean µ. For any a > 0 and b > 0, it holds that

a + b ln(L/t)

2b/2 1 exp( a/2).

This new inequality greatly helps the analysis of our algorithm, where the intuition is that when conditioned on the event FC, it is much less likely that fewer number of samples are conducted for arm i, and therefore we can afford a less accurate (i.e. bigger) conﬁdence band for its mean value.

It is notable that a similar idea is also adopted in the analysis of the MOSS algorithm (Audibert and Bubeck, 2009) which gives a minimax optimal regret bound for the ordinary multi-armed bandits. However, our Variable Conﬁdence Level Bound is more general. It can replace the usage of Hoeffding s Maximal Inequality in the analysis of MOSS and may ﬁnd other applications. We

200 400 600 800 1000 T

ln(regret/K)

0 5000 10000 15000 20000 25000 30000 35000 40000 T

ln(regret/K)

Opt-KG(1, 1)

Opt-KG(.5, .5) Uniform

0 20000 40000 60000 80000 100000 T

ln(regret/K)

Figure 1: Average aggregate regret on a logarithmic scale for different settings.

additionally remark that in Hoeffding s celebrated Maximal Inequality, the conﬁdence level also changes with time. However, the blow-up factor made to the conﬁdence level in our inequality is only the logarithm of that of Hoeffding s Maximal Inequality. Therefore, if constant factors are ignored, our inequality strictly improves Hoeffding s Maximal Inequality.

The formal proof of Theorem 1 involves a few technical tricks to combine Lemma 3 and Lemma 4 to deduce the ﬁnal regret bound, and is deferred to Appendix D.1. The lower bound theorem (Theorem 20) that complements Theorem 1 is deferred to Appendix E due to space constraints.

5 Experiments

In our experiments, we assume that each arm follows independent Bernoulli distributions with different means. To guarantee a fair comparison, we vary the total number of samples T and compare the empirical average aggregate regret on a logarithmic scale which is averaged over 5000 independent runs. We consider three different choices of {θi}i S:

1. (arithmetic progression I). K = 10; θ1:4 = 0.2 + (0 : 3) 0.05, θ5 = 0.45, θ6 = 0.55, and θ7:10 = 0.65 + (0 : 3) 0.05 (see Setup 1 in Figure 1). 2. (arithmetic progression II). K = 20; θ1:20 = 0.405 + (i 1)/100 (see Setup 2 in Figure 1). 3. (two-group setting). K = 10; θ1:5 = 0.45, and θ6:10 = 0.505 (see Setup 3 in Figure 1).

In our experiments, we ﬁx θ = 0.5. We notice that the choice of α in our LSA is quite robust (see Appendix F.4 for experimental results). To illustrate the performance, we ﬁx α = 1.35 in LSA and compare it with four existing algorithms for the TBP problem under a variety of settings. Now we discuss these algorithms and their parameter settings in more details.

Uniform: Given the budget T, this method pulls each arm sequentially from 1 to K until budget T is reached such that each arm is sampled roughly T/K times. Then it outputs θi θ when bθi θ. APT(ϵ): Introduced and analyzed in (Locatelli et al., 2016), this algorithm aims to output a set of arms ({i S : bµi θ}) serving as an estimate of the set of arms with means over θ + ϵ. The natural adaptation of the APT algorithm to our problem corresponds to changing the output: it outputs θi θ if bθi θ and θi < θ otherwise. In the experiments, we test the following choices of ϵ: 0, 0.025, 0.05, and 0.1. UCBE(b): Introduced and analyzed in (Audibert and Bubeck, 2010), this algorithm aims to identify the best arm (the arm with the largest mean reward). A natural adaptation of this algorithm to TBP is for each time t, it pulls argmini S(b i p

a/Ti(t 1)) where a is a tuning parameter. In (Audibert and Bubeck, 2010), it has been proved optimal when a = 25

H where H = P

i S 1 2 i . Here we set a = 4b T K

H and test three different choices of b: 1, 0, and 4. Opt-KG(a, b): Introduced in (Chen et al., 2015), this algorithm also aims to minimize the aggregate regret. It models TBP as a Bayesian Markov decision process where {θi}i S is

assumed to be drawn from a known Beta prior Beta(a, b). Here we choose two different priors: Beta(1, 1) (uniform prior) and Beta(0.5, 0.5) (Jeffreys prior).

Comparisons. In Setup 1, which is a relatively easy setting, LSA works best among all choices of budget T. With the right choice of parameter, APT and Opt-KG also achieve satisfactory performance. Though the performance gaps appear to be small, two-tailed paired t-tests of aggregate regrets indicate that LSA is signiﬁcantly better than most of the other methods, except APT(.05) and APT(.025) (see Table 1 in Appendix F.2).

In Setup 2 and 3, where ambiguous arms close to the threshold θ are presented, the performance difference between LSA and other methods is more noticeable. LSA consistently outperforms other methods in both settings over almost all choices of budget T with statistical signiﬁcance. It is worth noting that, though APT works also reasonably well in Setup 2 when T is small, the best parameter ϵ is different from that for bigger T and other setups. On the other hand, the parameters chosen in LSA are ﬁxed across all setups, indicating that our algorithm is more robust.

We perform additional experiments that due to space limitations are included in Appendix F.3. In all setups, LSA outperforms its competitors with various parameter choices.

6 Conclusion

In this paper we introduce an algorithm that minimizes the aggregate regret for the thresholding bandit problem. Our algorithm LSA makes use of a novel approach inspired by the optimal allocation scheme of the budget when the reward gaps are known ahead of time. When compared to APT, LSA uses an additional term, similar in spirit to the UCB-type algorithms though mathematically different, that encourages the exploration of arms that have bigger gaps, and/or those have not been sufﬁciently explored. Moreover, LSA is anytime and robust, while the precision parameter ϵ needed in the APT algorithm is highly sensitive and hard to choose. Besides showing empirically that LSA performs better than APT for different values of ϵ and other algorithms in a variety of settings, we also employ novel proof ideas that eliminate the logarithmic terms usually brought in by the straightforward union bound argument, design the new Variable Conﬁdence Level Bound that strictly improves Hoeffding s celebrated Maximal Inequality, and prove that LSA achieves instance-wise asymptotically optimal aggregate regret.

Peter Auer. Using conﬁdence bounds for exploitation-exploration trade-offs. Journal of Machine Learning Research, 3(Nov):397 422, 2002.

Jean-Yves Audibert, Sébastien Bubeck, and Rémi Munos. Best arm identiﬁcation in multi-armed bandits. In Conference on Learning Theory (COLT), pages 41 53, 2010.

Sébastien Bubeck, Rémi Munos, and Gilles Stoltz. Pure exploration in multi-armed bandits problems. In Algorithmic Learning Theory (ALT), pages 23 37, 2009.

Victor Gabillon, Mohammad Ghavamzadeh, and Alessandro Lazaric. Best arm identiﬁcation: A uniﬁed approach to ﬁxed budget and ﬁxed conﬁdence. In Advances in Neural Information Processing Systems (NIPS), pages 3212 3220, 2012.

Zohar Shay Karnin, Tomer Koren, and Oren Somekh. Almost optimal exploration in multi-armed bandits. In International Conference on Machine Learning (ICML), pages 1238 1246, 2013.

Kevin Jamieson, Matthew Malloy, Robert Nowak, and Sébastien Bubeck. lil UCB: An optimal exploration algorithm for multi-armed bandits. In Conference on Learning Theory (COLT), pages 423 439, 2014.

Aurélien Garivier and Emilie Kaufmann. Optimal best arm identiﬁcation with ﬁxed conﬁdence. In Conference on Learning Theory (COLT), pages 998 1027, 2016.

Lijie Chen, Jian Li, and Mingda Qiao. Towards instance optimal bounds for best arm identiﬁcation. In Conference on Learning Theory (COLT), pages 535 592, 2017.

Andrea Locatelli, Maurilio Gutzeit, and Alexandra Carpentier. An optimal algorithm for the thresholding bandit problem. In International Conference on Machine Learning (ICML), pages 1690 1698, 2016.

Ingo Steinwart, Don Hush, and Clint Scovel. A classiﬁcation framework for anomaly detection. Journal of Machine Learning Research, 6(Feb):211 232, 2005.

Xi Chen, Qihang Lin, and Dengyong Zhou. Statistical decision making for optimal budget allocation in crowd labeling. Journal of Machine Learning Research, 16(1):1 46, 2015.

Alexandra Carpentier and Andrea Locatelli. Tight (lower) bounds for the ﬁxed budget best arm identiﬁcation bandit problem. In Conference on Learning Theory, pages 590 604, 2016.

Tze Leung Lai and Herbert Robbins. Asymptotically efﬁcient adaptive allocation rules. Advances in Applied Mathematics, 6(1):4 22, 1985.

Rajeev Agrawal. Sample mean based index policies by O(log n) regret for the multi-armed bandit problem. Advances in Applied Probability, 27(4):1054 1078, 1995.

Jean-Yves Audibert and Sébastien Bubeck. Minimax policies for adversarial and stochastic bandits. In Conference on Learning Theory (COLT), 2009.

Jean-Yves Audibert and Sébastien Bubeck. Best arm identiﬁcation in multi-armed bandits. In Conference on Learning Theory (COLT), pages 13 p, 2010.

Wassily Hoeffding. Probability inequalities for sums of bounded random variables. Journal of the American statistical association, 58(301):13 30, 1963.

Svante Janson. Tail bounds for sums of geometric and exponential variables. Statistics & Probability Letters, 135:1 6, 2018.

Alexandre B. Tsybakov. Introduction to Nonparametric Estimation. Springer Series in Statistics. Springer, New York, 2009. Revised and extended from the 2004 French original, Translated by Vladimir Zaiats.

Tor Lattimore and Csaba Szepesvári. Bandit algorithms. preprint, 2018.

Jirí Matouek and Bernd Gärtner. Understanding and Using Linear Programming (Universitext). 2006.