# thresholding_bandits_with_augmented_ucb__fa71f2a2.pdf

Thresholding Bandits with Augmented UCB

Subhojyoti Mukherjee1,Naveen Kolar Purushothama2,Nandan Sudarsanam3,Balaraman Ravindran4

1,4Department of Computer Science & Engineering, Indian Institute of Technology Madras 2Department of Electrical Engineering, Indian Institute of Technology Madras 3Department of Management Studies, Indian Institute of Technology Madras 1subho@cse.iitm.ac.in, 2naveenkp@iitm.ac.in, 3nandan@iitm.ac.in, 4ravi@cse.iitm.ac.in

In this paper we propose the Augmented-UCB (Aug UCB) algorithm for a ﬁxed-budget version of the thresholding bandit problem (TBP), where the objective is to identify a set of arms whose quality is above a threshold. A key feature of Aug UCB is that it uses both mean and variance estimates to eliminate arms that have been sufﬁciently explored; to the best of our knowledge this is the ﬁrst algorithm to employ such an approach for the considered TBP. Theoretically, we obtain an upper bound on the loss (probability of mis-classiﬁcation) incurred by Aug UCB. Although UCBEV in literature provides a better guarantee, it is important to emphasize that UCBEV has access to problem complexity (whose computation requires arms mean and variances), and hence is not realistic in practice; this is in contrast to Aug UCB whose implementation does not require any such complexity inputs. We conduct extensive simulation experiments to validate the performance of Aug UCB. Through our simulation work, we establish that Aug UCB, owing to its utilization of variance estimates, performs signiﬁcantly better than the state-of-the-art APT, CSAR and other non variance-based algorithms.

1 Introduction

Stochastic multi-armed bandit (MAB) problems are instances of the classic sequential decision-making scenario; speciﬁcally an MAB problem comprises of a learner and a collection of actions (or arms), denoted A. In each trial the learner plays (or pulls) an arm i A which yields independent and identically distributed (i.i.d.) reward samples from a distribution (corresponding to arm i), whose expectation is denoted by ri. The learner s objective is to identify an arm corresponding to the maximum expected reward, denoted r . Thus, at each time-step the learner is faced with the exploration vs. exploitation dilemma, where it can pull an arm which has yielded the highest mean reward (denoted ˆri) thus far (exploitation) or continue to explore other arms with the prospect

of ﬁnding a better arm whose performance has not been observed sufﬁciently (exploration). Pure-exploration MAB problems are unlike their traditional (exploration vs. exploitation) counterparts where the objective is to minimize the cumulative regret (which is the total loss incurred by the learner for not playing the optimal arm throughout the time horizon T). Instead, in pureexploration problems a learning algorithm, until time T, can invest entirely on exploring the arms without being concerned about the loss incurred while exploring; the objective is to minimize the probability that the arm recommended at time T is not the best arm. In this paper, we further consider a combinatorial version of the pure-exploration MAB, called the thresholding bandit problem (TBP). Here, the learning algorithm is provided with a threshold τ, and the objective, after exploring for T rounds, is to output all arms i whose ri is above τ. It is important to emphasize that the thresholding bandit problem is different from the threshold bandit setup studied in Abernethy et al. [2016], where the learner receives an unit reward whenever the value of an observation is above a threshold. Formally, the problem we consider is the following. First, we deﬁne the set Sτ = {i A : ri τ}. Note that, Sτ is the set of all arms whose reward mean is greater than τ. Let Sc τ denote the complement of Sτ, i.e., Sc τ = {i A : ri < τ}. Next, let ˆSτ = ˆSτ(T) A denote the recommendation of a learning algorithm (under consideration) after T time units of exploration, while ˆSc τ denotes its complement. The performance of the learning agent is measured by the accuracy with which it can classify the arms into Sτ and Sc τ after time horizon T. Equivalently, using I(E) to denote the indicator of an event E, the loss L(T) is deﬁned as

L(T) = I {Sτ ˆSc τ = } { ˆSτ Sc τ = } .

Finally, the goal of the learning agent is to minimize the expected loss:

E[L(T)] = P {Sτ ˆSc τ = } { ˆSτ Sc τ = } .

Note that the expected loss is simply the probability of misclassiﬁcation (i.e., error), that occurs either if a good arm is rejected or a bad arm is accepted as a good one. The above TBP formulation has several applications, for instance, from areas ranging from anomaly detection and

Proceedings of the Twenty-Sixth International Joint Conference on Artiﬁcial Intelligence (IJCAI-17)

classiﬁcation (see Locatelli et al. [2016]) to industrial applications as well as in mobile communications (see Audibert and Bubeck [2010]) where the users are to be allocated only those channels whose quality is above an acceptable threshold.

1.1 Related Work

Signiﬁcant amount of literature is available on the stochastic MAB setting with respect to minimizing the cumulative regret. While the seminal work of Robbins [1952], Thompson [1933], and Lai and Robbins [1985] prove asymptotic lower bounds on the cumulative regret, the more recent work of Auer et al. [2002] propose the UCB1 algorithm that provides ﬁnite time-horizon guarantees. Subsequent work such as Audibert and Bubeck [2009] and Auer and Ortner [2010] have improved the upper bounds on the cumulative regret. The authors in Auer and Ortner [2010] have proposed a roundbased1 version of the UCB algorithm, referred to as UCBImproved. Of special mention is the work of Audibert et al. [2009] where the authors have introduced a variance-aware UCB algorithm, referred to as UCB-V; it is shown that the algorithms that take into account variance estimation along with mean estimation tends to perform better than the algorithms that solely focuses on mean estimation, for instance, such as UCB1. For a more detail survey of literature on UCB algorithms, we refer the reader to Bubeck and Cesa-Bianchi [2012]. In this work we are particularly interested in pureexploration MABs, where the focus in primarily on simple regret rather than the cumulative regret. The relationship between cumulative regret and simple regret is proved in Bubeck et al. [2011] where the authors prove that minimizing the simple regret necessarily results in maximizing the cumulative regret. The pure exploration problem has been explored mainly under the following two settings: 1. Fixed Budget setting: Here the learning algorithm has to suggest the best arm(s) within a ﬁxed time-horizon T, that is usually given as an input. The objective is to maximize the probability of returning the best arm(s). This is the scenario we consider in our paper. In Audibert and Bubeck [2010] the authors propose the UCBE and the Successive Reject (SR) algorithm, and prove simple-regret guarantees for the problem of identifying the single best arm. In the combinatorial ﬁxed budget setup Gabillon et al. [2011] propose the Gap E and Gap E-V algorithms that suggest, with high probability, the best m arms at the end of the time budget. Similarly, Bubeck et al. [2013] introduce the Successive Accept Reject (SAR) algorithm, which is an extension of the SR algorithm; SAR is a round based algorithm whereby at the end of each round an arm is either accepted or rejected (based on certain conﬁdence conditions) until the top m arms are suggested at the end of the budget with high probability. A similar combinatorial setup was explored in Chen et al. [2014] where the authors propose the Combinatorial Successive Accept Reject (CSAR) algorithm, which is similar in concept to SAR but

1An algorithm is said to be round-based if it pulls all the arms equal number of times in each round, and then proceeds to eliminate one or more arms that it identiﬁes to be sub-optimal.

with a more general setup. 2. Fixed Conﬁdence setting: In this setting the learning algorithm has to suggest the best arm(s) with a ﬁxed conﬁdence (given as input) with as fewer number of attempts as possible. The single best arm identiﬁcation has been studied in Even-Dar et al. [2006], while for the combinatorial setup Kalyanakrishnan et al. [2012] have proposed the LUCB algorithm which, on termination, returns m arms which are at least ϵ close to the true top-m arms with probability at least 1 δ. For a detail survey of this setup we refer the reader to Jamieson and Nowak [2014]. Apart from these two settings some uniﬁed approaches has also been suggested in Gabillon et al. [2012] which proposes the algorithms UGap Eb and UGap Ec which can work in both the above two settings. The thresholding bandit problem is a speciﬁc instance of the pure-exploration setup of Chen et al. [2014]. In the latest work of Locatelli et al. [2016] Anytime Parameter-Free Thresholding (APT) algorithm comes up with an improved anytime guarantee than CSAR for the thresholding bandit problem.

1.2 Our Contribution In this paper we propose the Augmented UCB (Aug UCB) algorithm for the ﬁxed-budget setting of a speciﬁc combinatorial, pure-exploration, stochastic MAB called the thresholding bandit problem. Aug UCB essentially combines the approach of UCB-Improved, CCB [Liu and Tsuruoka, 2016] and APT algorithms. Our algorithm takes into account the empirical variances of the arms along with mean estimates; to the best of our knowledge this is the ﬁrst variance-based algorithm for the considered TBP. Thus, we also address an open problem discussed in Auer and Ortner [2010] of designing an algorithm that can eliminate arms based on variance estimates. In this regard, note that both CSAR and APT are not variance-based algorithms. Our theoretical contribution comprises proving an upper bound on the expected loss incurred by Aug UCB (Theorem 3.1). In Table 1 we compare the upper bound on the losses incurred by the various algorithms, including Aug UCB. The terms H1, H2, HCSAR,2, Hσ,1 and Hσ,2 represent various problem complexities, and are as deﬁned in Section 3. From Section 3 we note that, for all K 8, we have

log (K log K) Hσ,2 > log(2K)Hσ,2 Hσ,1.

Thus, it follows that the upper bound for UCBEV is better than that for Aug UCB. However, implementation of UCBEV

Table 1: Aug UCB vs. State of the art

Algorithm Upper Bound on Expected Loss

Aug UCB exp T 4096 log(K log K)Hσ,2 + log (2KT)

UCBEV exp 1

Hσ,1 + log (6KT)

APT exp T 64H1 + 2 log((log(T) + 1)K)

CSAR exp T K 72 log(K)HCSAR,2 + 2 log(K)

Proceedings of the Twenty-Sixth International Joint Conference on Artiﬁcial Intelligence (IJCAI-17)

algorithm requires Hσ,1 as input, whose computation is not realistic in practice. In contrast, our Aug UCB algorithm requires no such complexity factor as input. Proceeding with the comparisons, we emphasize that the upper bound for Aug UCB is, in fact, not comparable with that of APT and CSAR; this is because the complexity term Hσ,2 is not explicitly comparable with either H1 or HCSAR,2. However, through extensive simulation experiments we ﬁnd that Aug UCB signiﬁcantly outperforms both APT, CSAR and other non variance-based algorithms. Aug UCB also outperforms UCBEV under explorations where non-optimal values of Hσ,1 are used. In particular, we consider experimental scenarios comprising large number of arms, with the variances of arms in Sτ being large. Aug UCB, being variance based, exhibits superior performance under these settings. The remainder of the paper is organized as follows. In section 2 we present our Aug UCB algorithm. Section 3 contains our main theorem on expected loss, while section 4 contains simulation experiments. We ﬁnally draw our conclusions in section 5. 2 Augmented-UCB Algorithm

Notation and assumptions: A denotes the set of arms, and |A| = K is the number of arms in A. For arm i A, we use ri to denote the true mean of the distribution from which the rewards are sampled, while ˆri(t) denotes the estimated mean at time t. Formally, using ni(t) to denote the number of times arm i has been pulled until time t, we have ˆri(t) = 1 ni(t) Pni(t) z=1 Xi,z, where Xi,z is the reward sample received when arm i is pulled for the z-th time. Similarly, we use σ2 i to denote the true variance of the reward distribution corresponding to arm i, while ˆvi(t) is the estimated variance, i.e., ˆvi(t) = 1 ni(t) Pni(t) z=1 (Xi,z ˆri)2. Whenever there is no ambiguity about the underlaying time index t, for simplicity we neglect t from the notations and simply use ˆri, ˆvi, and ni, to denote the respective quantities. Let i = |τ ri| denote the distance of the true mean from the threshold τ. Also, the rewards are assumed to take values in [0, 1]. The Algorithm: The Augmented-UCB (Aug UCB) algorithm is presented in Algorithm 1. Aug UCB is essentially based on the arm elimination method of the UCB-Improved Auer and Ortner [2010], but adapted to the thresholding bandit setting proposed in Locatelli et al. [2016]. However, unlike the UCB improved (which is based on mean estimation) our algorithm employs variance estimates (as in Audibert et al. [2009]) for arm elimination; to the best of our knowledge this is the ﬁrst variance-aware algorithm for the thresholding bandit problem. Further, we allow for arm-elimination at each time-step, which is in contrast to the earlier work (e.g., Auer and Ortner [2010]; Chen et al. [2014]) where the arm elimination task is deferred to the end of the respective exploration rounds. The details are presented below. The active set B0 is initialized with all the arms from A. We divide the entire budget T into rounds/phases like in UCB-Improved, CCB, SAR and CSAR. At every time-step Aug UCB checks for arm elimination conditions, while updating parameters at the end of each round. As suggested by Liu and Tsuruoka [2016] to make Aug UCB to overcome too much early exploration, we no longer pull all the arms equal

Algorithm 1 Aug UCB

Input: Time budget T; parameter ρ; threshold τ Initialization: B0 = A; m = 0; ϵ0 = 1;

16K log K) 2 ;

ℓ0 = 2ψ0 log(Tϵ0)

Pull each arm once

for t = K + 1, .., T do Pull arm j arg mini Bm n |ˆri τ| 2si o

for i Bm do if (ˆri + si < τ si) or (ˆri si > τ + si) then Bm Bm\{i} (Arm deletion) end if end for

if t Nm and m M then Reset Parameters ϵm+1 ϵm

2 Bm+1 Bm ψm+1 T ϵm+1 128(log( 3

16 K log K))2

ℓm+1 l 2ψm+1 log(T ϵm+1)

Nm+1 t + |Bm+1|ℓm+1 m m + 1 end if end for Output: ˆSτ = {i : ˆri τ}.

number of times in each round. Instead, we choose an arm in the active set Bm that minimizes (|ˆri τ| 2si) where

ρψm(ˆvi + 1) log(Tϵm)

4ni with ρ being the arm elimination parameter and ψm being the exploration regulatory factor. The above condition ensures that an arm closer to the threshold τ is pulled; parameter ρ can be used to ﬁne tune the elimination interval. The choice of exploration factor, ψm, comes directly from Audibert and Bubeck [2010] and Bubeck et al. [2011] where it is stated that in pure exploration setup, the exploring factor must be linear in T (so that an exponentially small probability of error is achieved) rather than being logarithmic in T (which is more suited for minimizing cumulative regret). 3 Theoretical Results Let us begin by recalling the following deﬁnitions of the problem complexity as introduced in Locatelli et al. [2016]:

1 2 i and H2 = min i A i 2 (i) where ( (i) : i A) is obtained by arranging ( i : i A) in an increasing order. Also, from Chen et al. [2014] we have

HCSAR,2 = max i A i 2 (i) .

Proceedings of the Twenty-Sixth International Joint Conference on Artiﬁcial Intelligence (IJCAI-17)

HCSAR,2 is the complexity term appearing in the bound for the CSAR algorithm. The relation between the above complexity terms are as follows (see Locatelli et al. [2016]):

H1 log(2K)H2 and H1 log(K)HCSAR,2.

As ours is a variance-aware algorithm, we require Hσ 1 (as deﬁned in Gabillon et al. [2011]) that incorporates reward variances into its expression as given below:

σ2 i + (16/3) i

Finally, analogous to HCSAR,2, in this paper we introduce the complexity term Hσ,2, which is given by

Hσ,2 = max i A i 2 (i)

where 2 i = 2 i σi+

σ2 i +(16/3) i , and ( (i)) is an increas-

ing ordering of ( i). Following the results in Audibert and Bubeck [2010], we can show that

Hσ,2 Hσ,1 log(K)Hσ,2 log(2K)Hσ,2.

Our main result is summarized in the following theorem where we prove an upper bound on the expected loss. Theorem 3.1. For K 4 and ρ = 1/3, the expected loss of the Aug UCB algorithm is given by,

E[L(T)] 2KT exp T 4096 log(K log K)Hσ,2

Proof. The proof comprises of two modules. In the ﬁrst module we investigate the necessary conditions for arm elimination within a speciﬁed number of rounds, which is motivated by the technique in Auer and Ortner [2010]. Bounds on the arm-elimination probability is then obtained; however, since we use variance estimates, we invoke the Bernstein inequality (as in Audibert et al. [2009]) rather that the Chernoff-Hoeffding bounds (which is appropriate for the UCB-Improved [Auer and Ortner, 2010]). In the second module, as in Locatelli et al. [2016], we ﬁrst deﬁne a favourable event that will yield an upper bound on the expected loss. Using union bound, we then incorporate the result from module1 (on the arm elimination probability), and ﬁnally derive the result through a series of simpliﬁcations. The details are as follows. Arm Elimination: Recall the notations used in the algorithm, Also, for each arm i A, deﬁne mi = min m| ρϵm < i

2 . In the mi-th round, whenever ni =

ℓmi 2ψmi log (T ϵmi)

ϵmi , we obtain (as ˆvi [0, 1])

ρ(ˆvi + 1)ϵmi

First, let us consider a bad arm i A (i.e., ri < τ). We note that, in the mi-th round whenever ˆri ri + 2si, then arm i is eliminated as a bad arm. This is easy to verify as follows: using (1) we obtain,

ˆri ri + 2si < ri + i 2si = τ 2si

which is precisely one of the elimination conditions in Algorithm 1. Thus, the probability that a bad arm is not eliminated correctly in the mi-th round (or before) is given by

P(ˆri > ri + 2si) P (ˆri > ri + 2 si) + P ˆvi σ2 i + ρϵmi

ρψmi(σ2 i + ρϵmi + 1) log(Tϵmi)

Note that, substituting ni = ℓmi 2ψmi log (T ϵmi)

ϵmi , si can be simpliﬁed to obtain,

ρϵmi(σ2 i + ρϵmi + 1)

2 ρϵmi. (3)

The ﬁrst term in the LHS of (2) can be bounded using the Bernstein inequality as below:

P (ˆri > ri + 2 si)

exp (2 si)2ni

ρψmi(σ2 i + ρϵmi + 1) log(Tϵmi)

(a) exp 3ρTϵmi

σ2 i + ρϵmi + 1 3σ2 i + ρϵmi

:= exp( Zi) (4)

where, for simplicity, we have used αi to denoted the exponent in the inequality (a). Also, note that (a) is obtained by using ψmi = T ϵmi 128a2 , where a = (log( 3

16K log K)). The second term in the LHS of (2) can be simpliﬁed as follows:

P ˆvi σ2 i + ρϵmi

t=1 (Xi,t ri)2 (ˆri ri)2 σ2 i + ρϵmi

P Pni t=1(Xi,t ri)2

ni σ2 i + ρϵmi

(a) P Pni t=1(Xi,t ri)2

ni σ2 i + 2 si

(b) exp 3ρψmi

σ2 i + ρϵmi + 1 3σ2 i + ρϵmi

= exp( Zi) (5)

where inequality (a) is obtained using (3), while (b) follows from the Bernstein inequality. Thus, using (4) and (5) in (2) we obtain P(ˆri > ri +2si) 2 exp( Zi). Proceeding similarly, for a good arm i A, the probability that it is not correctly eliminated in the mi-th round (or before) is also bounded by P(ˆri < ri 2si) 2 exp( Zi). In general, for any i A we have

P(|ˆri ri| > 2si) 4 exp( Zi). (6)

Proceedings of the Twenty-Sixth International Joint Conference on Artiﬁcial Intelligence (IJCAI-17)

Favourable Event: Following the notation in Locatelli et al. [2016] we deﬁne the event

ξ = i A, t = 1, 2, .., T : |ˆri ri| 2si

Note that, on ξ each arm i A is eliminated correctly in the mi-th round (or before). Thus, it follows that E[L(T)] P(ξc). Since ξc can be expressed as an union of the events (|ˆri ri| > 2si) for all i A and all t = 1, 2, , T, using union bound we can write

t=1 P(|ˆri ri| > 2si)

t=1 4 exp( Zi)

i A exp 3ρTϵmi

σ2 i + ρϵmi + 1 3σ2 i + ρϵmi

i A exp 3T 2 i 4096a2

4σ2 i + i + 4 12σ2 i + i

i A exp 12T 2 i (12σi + 12 i) log( 3

16K log K) 4096a2

i A exp T 2 i log( 3

4096(σi + p

σ2 i + (16/3) i)a2

i A exp T log( 3

4096 2 i a2

i A exp T log( 3

4096 maxj(j 2 (j))(log( 3

16K log K))2

(f) 4KT exp T 4096 log(K log K)Hσ,2

The justiﬁcation for the above simpliﬁcations are as follows: (a) is obtained by noting that in round mi we have i

2 . For (b), we note that the function x 7 x exp( Cx2), where x [0, 1], is decreasing on [1/

2C, 1] for any C > 0 (see Bubeck et al. [2011]; Auer and Ortner [2010]). Thus, us-

ing C = T/e and minj A j = = q

T , we obtain (b). To obtain (c) we have used the inequality i p

σ2 i + (16/3) i (which holds because i [0, 1]). (d) is obtained simply by substituting i = 2 i σi+

σ2 i +(16/3) i and a = log( 3

16K log K).

Finally, to obtain (e) and (f), note that 2 i i 2 i maxj A j 2 (j) = Hσ,2.

4 Numerical Experiments In this section, we empirically compare the performance of Aug UCB against APT, UCBE, UCBEV, CSAR and the

uniform-allocation (UA) algorithms. A brief note about these algorithms are as follows: APT: This algorithm is from Locatelli et al. [2016]; we set ϵ = 0.05, which is the margin-of-error within which APT suggests the set of good arms. Aug UCB: This is the Augmented-UCB algorithm proposed in this paper; as in Theorem 3.1 we set ρ = 1

3. UCBE: This is a modiﬁcation of the algorithm in Audibert et al. [2009] (as it was originally proposed for the best arm identiﬁcation problem); here, we set a = T K

H1 , and at each

time-step an arm i arg min n |ˆri τ| q

o is pulled.

UCBEV: This is a modiﬁcation of the algorithm in Gabillon et al. [2011] (proposed for the Top M problem); its implementation is identical to UCBE, but with a = T 2K

Hσ,1 . As mentioned earlier, note that UCBEV s implementation would not be possible in real scenarios, as it requires computing the problem complexity Hσ,1. However, for theoretical reasons we show the best performance achievable by UCBEV. In experiment 6 we perform further explorations of UCBEV with alternate settings of a. CSAR: Modiﬁcation of the successive-reject algorithm in Chen et al. [2014]; here, we reject the arm farthest from τ after each round. UA: The naive strategy where at each time-step an arm is uniformly sampled from A; however, UA is known to be optimal if all arms are equally difﬁcult to classify. Motivated by the settings considered in Locatelli et al. [2016], we design six different experimental scenarios that are obtained by varying the arm means and variances. Across all experiments consists of K = 100 arms (indexed i = 1, 2, , 100) of which Sτ = {6, 7, , 10}, where we have ﬁxed τ = 0.5. In all the experiments, each algorithm is run independently for 10000 time-steps. At every time-step, the output set, ˆSτ, suggested by each algorithm is recorded; the output is counted as an error if ˆSτ = Sτ. In Figure 1, for each experiment, we have reported the percentage of error incurred by the different algorithms as a function of time; Error percentage is obtained by repeating each experiment independently for 500 iterations, and then respectively computing the fraction of errors. The details of the considered experiments are as follows. Experiment-1: The reward distributions are Gaussian with means r1:4 = 0.2 + (0 : 3) 0.05, r5 = 0.45, r6 = 0.55, r7:10 = 0.65 + (0 : 3) 0.05 and r11:100 = 0.4. Thus, the means of the ﬁrst 10 arms follow an arithmetic progression. The remaining arms have identical means; this setting is chosen because now a signiﬁcant budget is required in exploring these arms, thus increasing the problem complexity. The corresponding variances are σ2 1:5 = 0.5 and σ2 6:10 = 0.6, while σ2 11:100 is chosen independently and uniform in the interval [0.38, 0.42]; note that, the variances of the arms in Sτ are higher than those of the other arms. The corresponding results are shown in Figure 1(a), from where we see that UCBEV, which has access to the problem complexity while being variance-aware, outperforms all other algorithm (including UCBE which also has access to the problem complexity but does not take into account the variances of

Proceedings of the Twenty-Sixth International Joint Conference on Artiﬁcial Intelligence (IJCAI-17)

0.2 0.4 0.6 0.8 1 104 0 20 40 60 80 100

Error Percentage

APT Aug UCB UCBE UCBEV CSAR UA

(a) Expt-1: Arithmetic Progression (Gaussian)

0.2 0.4 0.6 0.8 1 104 0 20 40 60 80 100

Error Percentage

APT Aug UCB UCBE UCBEV CSAR UA

(b) Expt-2: Geometric Progression (Gaussian)

0.2 0.4 0.6 0.8 1 104 0 20 40 60 80 100

Error Percentage

APT Aug UCB UCBE UCBEV CSAR UA

(c) Expt-3: Three Group Setting (Gaussian)

0.2 0.4 0.6 0.8 1 104 40

Error Percentage

APT AUg UCB UCBE UCBEV CSAR UA

(d) Expt-4: Two Group Setting (Gaussian)

0.2 0.4 0.6 0.8 1 104 40

Error Percentage

APT AUg UCB UCBE UCBEV CSAR UA

(e) Expt-5: Two Group Setting (Advance)

0.2 0.4 0.6 0.8 1 104 40

Error Percentage

UCBEV(0.25) Aug UCB UCBEV(256) UCBEV(1)

(f) Expt-6: Two Group Setting (Advance)

Figure 1: Performances of the various TBP algorithms in terms of error percentage vs. time-step, for six different experimental scenarios. the arms). Interestingly, the performance of our Aug UCB (without requiring any complexity input) is comparable with UCBEV, while it outperforms UCBE, APT and the other non variance-aware algorithms that we have considered. Experiment-2: We again consider Gaussian reward distributions. However, here the means of the ﬁrst 10 arms constitute a geometric progression. Formally, the reward means are r1:4 = 0.4 (0.2)1:4, r5 = 0.45, r6 = 0.55, r7:10 = 0.6 + (0.2)5 (1:4) and r11:100 = 0.4; the arm variances are as in experiment-1. The corresponding results are shown in Figure 1(b). We again observe Aug UCB outperforming the other algorithms, except UCBEV. Experiment-3: Here, the ﬁrst 10 arms are partitioned into three groups, with all arms in a group being assigned the same mean; the reward distributions are again Gaussian. Speciﬁcally, the reward means are r1:3 = 0.1, r4:7 = {0.35, 0.45, 0.55, 0.65} and r8:10 = 0.9; as before, r11:100 = 0.4 and all the variances are as in Experiment-1. The results for this scenario are presented in Figure 1(c). The observations are inline with those made in the previous experiments. Experiment-4: The setting is similar to that considered in Experiment-3, but with the ﬁrst 10 arms partitioned into two

groups; the respective means are r1:5 = 0.45, r6:10 = 0.55. The corresponding results are shown in Figure 1(d), from where the good performance of Aug UCB is again validated. Experiment-5: This is again the two group setting involving Gaussian reward distributions. The reward means are as in Experiment-4, while the variances are σ2 1:5 = 0.3 and σ2 6:10 = 0.8; σ2 11:100 are independently and uniformly chosen in the interval [0.2, 0.3]. The corresponding results are shown in Figure 1(e). We refer to this setup as Advanced because here the chosen variance values are such that only variance-aware algorithms will perform well. Hence, we see that UCBEV performs very well in comparison with the other algorithms. However, it is interesting to note that the performance of Aug UCB catches-up with UCBEV as the time-step increases, while signiﬁcantly outperforming the other nonvariance aware algorithms. Experiment-6: We use the same setting as in Experiment5, but conduct more exploration of UCBEV with different values of the exploration parameter a. The corresponding results are shown in Figure 1(f). As studied in Locatelli et al. [2016], we implement UCBEV with ai = 4i T 2K

Hσ,1 for i = 1, 0, 4. Here, a0 corresponds to UCBEV(1) (in Figure 1(f)) which is UCBEV run with the optimal choice of Hσ,1. For other choices of ai we see that UCBEV(ai) is signiﬁcantly outperformed by Aug UCB. Finally, note that in all the above experiments, the CSAR algorithm, although performs well initially, quickly exhausts its budget and saturates at a higher error percentage. This is because it pulls all arms equally in each round, with the round lengths being non-adaptive.

5 Conclusion We proposed the Aug UCB algorithm for a ﬁxed-budget, pureexploration TBP. Our algorithm employs both mean and variance estimates for arm elimination. This, to our knowledge is the ﬁrst variance-based algorithm for the speciﬁc TBP that we have considered. We ﬁrst prove an upper bound on the expected loss incurred by Aug UCB. We then conduct simulation experiments to validate the performance of Aug UCB. In comparison with APT, CSAR and other non variance-based algorithms, we ﬁnd that the performance of Aug UCB is signiﬁcantly better. Further, the performance of Aug UCB is comparable with UCBEV (which is also variance-based), although the latter exhibits a slightly better performance. However, UCBEV is not implementable in practice as it requires computing problem complexity, Hσ,1, while Aug UCB (requiring no such inputs) can be easily deployed in real-life scenarios. It would be an interesting future work to design an anytime version of the Aug UCB algorithm.

Acknowledgements This work was supported by a funding from IIT Madras under project CSE/14-15/831/RFTP/BRAV. The work of the second author is supported by an INSPIRE Faculty Award of the Department of Science and Technology.

Proceedings of the Twenty-Sixth International Joint Conference on Artiﬁcial Intelligence (IJCAI-17)

References Jacob D Abernethy, Kareem Amin, and Ruihao Zhu. Threshold bandits, with and without censored feedback. In Advances In Neural Information Processing Systems, pages 4889 4897, 2016. Jean-Yves Audibert and S ebastien Bubeck. Minimax policies for adversarial and stochastic bandits. In COLT, pages 217 226, 2009. Jean-Yves Audibert and S ebastien Bubeck. Best arm identiﬁcation in multi-armed bandits. In COLT-23th Conference on Learning Theory-2010, pages 13 p, 2010. Jean-Yves Audibert, R emi Munos, and Csaba Szepesv ari. Exploration exploitation tradeoff using variance estimates in multi-armed bandits. Theoretical Computer Science, 410(19):1876 1902, 2009. Peter Auer and Ronald Ortner. Ucb revisited: Improved regret bounds for the stochastic multi-armed bandit problem. Periodica Mathematica Hungarica, 61(1-2):55 65, 2010. Peter Auer, Nicolo Cesa-Bianchi, and Paul Fischer. Finitetime analysis of the multiarmed bandit problem. Machine learning, 47(2-3):235 256, 2002. S ebastien Bubeck and Nicolo Cesa-Bianchi. Regret analysis of stochastic and nonstochastic multi-armed bandit problems. ar Xiv preprint ar Xiv:1204.5721, 2012. S ebastien Bubeck, R emi Munos, and Gilles Stoltz. Pure exploration in ﬁnitely-armed and continuous-armed bandits. Theoretical Computer Science, 412(19):1832 1852, 2011. S ebastien Bubeck, Tengyao Wang, and Nitin Viswanathan. Multiple identiﬁcations in multi-armed bandits. In ICML (1), pages 258 265, 2013. Shouyuan Chen, Tian Lin, Irwin King, Michael R Lyu, and Wei Chen. Combinatorial pure exploration of multi-armed bandits. In Advances in Neural Information Processing Systems, pages 379 387, 2014. Eyal Even-Dar, Shie Mannor, and Yishay Mansour. Action elimination and stopping conditions for the multi-armed bandit and reinforcement learning problems. The Journal of Machine Learning Research, 7:1079 1105, 2006. Victor Gabillon, Mohammad Ghavamzadeh, Alessandro Lazaric, and S ebastien Bubeck. Multi-bandit best arm identiﬁcation. In Advances in Neural Information Processing Systems, pages 2222 2230, 2011. Victor Gabillon, Mohammad Ghavamzadeh, and Alessandro Lazaric. Best arm identiﬁcation: A uniﬁed approach to ﬁxed budget and ﬁxed conﬁdence. In Advances in Neural Information Processing Systems, pages 3212 3220, 2012. Kevin Jamieson and Robert Nowak. Best-arm identiﬁcation algorithms for multi-armed bandits in the ﬁxed conﬁdence setting. In Information Sciences and Systems (CISS), 2014 48th Annual Conference on, pages 1 6. IEEE, 2014. Shivaram Kalyanakrishnan, Ambuj Tewari, Peter Auer, and Peter Stone. Pac subset selection in stochastic multi-armed bandits. In Proceedings of the 29th International Conference on Machine Learning (ICML-12), pages 655 662, 2012.

Tze Leung Lai and Herbert Robbins. Asymptotically efﬁcient adaptive allocation rules. Advances in applied mathematics, 6(1):4 22, 1985. Yun-Ching Liu and Yoshimasa Tsuruoka. Modiﬁcation of improved upper conﬁdence bounds for regulating exploration in monte-carlo tree search. Theoretical Computer Science, 2016. Andrea Locatelli, Maurilio Gutzeit, and Alexandra Carpentier. An optimal algorithm for the thresholding bandit problem. ar Xiv preprint ar Xiv:1605.08671, 2016. Herbert Robbins. Some aspects of the sequential design of experiments. In Herbert Robbins Selected Papers, pages 169 177. Springer, 1952. William R Thompson. On the likelihood that one unknown probability exceeds another in view of the evidence of two samples. Biometrika, pages 285 294, 1933.

Proceedings of the Twenty-Sixth International Joint Conference on Artiﬁcial Intelligence (IJCAI-17)