# adaptive_multiplearm_identification__d83be97e.pdf

Adaptive Multiple-Arm Identiﬁcation

Jiecao Chen * 1 Xi Chen * 2 Qin Zhang * 1 Yuan Zhou * 1

Abstract We study the problem of selecting K arms with the highest expected rewards in a stochastic narmed bandit game. This problem has a wide range of applications, e.g., A/B testing, crowdsourcing, simulation optimization. Our goal is to develop a PAC algorithm, which, with probability at least 1 δ, identiﬁes a set of K arms with the aggregate regret at most ϵ. The notion of aggregate regret for multiple-arm identiﬁcation was ﬁrst introduced in Zhou et al. (2014) , which is deﬁned as the difference of the averaged expected rewards between the selected set of arms and the best K arms. In contrast to Zhou et al. (2014) that only provides instance-independent sample complexity, we introduce a new hardness parameter for characterizing the difﬁculty of any given instance. We further develop two algorithms and establish the corresponding sample complexity in terms of this hardness parameter. The derived sample complexity can be signiﬁcantly smaller than state-of-the-art results for a large class of instances and matches the instanceindependent lower bound upto a log(ϵ 1) factor in the worst case. We also prove a lower bound result showing that the extra log(ϵ 1) is necessary for instance-dependent algorithms using the introduced hardness parameter.

1. Introduction

Given a set of alternatives with different quality, identifying high quality alternatives via a sequential experiment is

*Equal contribution 1Computer Science Department, Indiana University, Bloomington, IN, USA 2Stern School of Business, New York University, New York, NY, USA. Correspondence to: Jiecao Chen <jiecchen@umail.iu.edu; supported in part by NSF CCF-1525024 and IIS-1633215>, Xi Chen <xchen3@stern.nyu.edu; supported by Google Faculty Research Fellowship>, Qin Zhang <qzhangcs@indiana.edu; supported in part by NSF CCF-1525024 and IIS-1633215>, Yuan Zhou <yzhoucs@indiana.edu>.

Proceedings of the 34 th International Conference on Machine Learning, Sydney, Australia, PMLR 70, 2017. Copyright 2017 by the author(s).

an important problem in multi-armed bandit (MAB) literature, which is also known as the pure-exploration problem. This problem has a wide range of applications. For example, consider the A/B/C testing problem with multiple website designs, where each candidate design corresponds to an alternative. In order to select high-quality designs, an agent could display different designs to website visitors and measure the attractiveness of an design. The question is: how should the agent adaptively select which design to be displayed next so that the high-quality designs can be quickly and accurately identiﬁed? For another example, in crowdsourcing, it is critical to identify high-quality workers from a pool of a large number of noisy workers. An effective strategy is testing workers by gold questions, i.e., questions with the known answers provided by domain experts. Since the agent has to pay a ﬁxed monetary reward for each answer from a worker, it is important to implement a cost-effective strategy for to select the top workers with the minimum number of tests. Other applications include simulation optimization, clinical trials, etc.

More formally, we assume that there are n alternative arms, where the i-th arm is associated with an unknown reward distribution Di with mean θi. For the ease of illustration, we assume each Di is supported on [0, 1]. In practice, it is easy to satisfy this assumption by a proper scaling. For example, the trafﬁc of a website or the correctness of an answer for a crowd worker (which simply takes the value either 0 or 1), can be scaled to [0, 1]. The mean reward θi characterizes the quality of the i-th alternative. The agent sequentially pulls an arm, and upon each pulling of the ith arm, the i.i.d. reward from Di is observed. The goal of top-K arm identiﬁcation is to design an adaptive arm pulling strategy so that the top K arms with the largest mean rewards can be identiﬁed with the minimum number of trials. In practice, identifying the exact top-K arms usually requires a large number of arm pulls, which could be wasteful. In many applications (e.g., crowdsourcing), it is sufﬁcient to ﬁnd an approximate set of top-K arms. To measure the quality of the selected arms, we adopt the notion of aggregate regret (or regret for short) from Zhou et al. (2014). In particular, we assume that arms are ordered by their mean θ1 θ2 θn so that the set of the best K arms is {1, . . . , K}. For the selected arm set T with

Adaptive Multiple-Arm Identiﬁcation

the size |T| = K, the aggregate regret RT is deﬁned as,

The set of arms T with the aggregate regret less than a pre-determined tolerance level ϵ (i.e. RT ϵ) is called ϵ-top-K arms. In this paper, we consider the ϵ-top-K-arm problem in the ﬁxed-conﬁdence setting: given a target conﬁdence level δ > 0, the goal is to ﬁnd a set of ϵ-top-K arms with the probability at least 1 δ. This is also known as the PAC (probably approximately correct) learning setting. We are interested in achieving this goal with as few arm pulls (sample complexity) as possible.

To solve this problem, Zhou et al. (2014) proposed the Opt MAI algorithm and established its sample complexity Θ n ϵ2 1 + ln δ 1

K , which is shown to be asymptotically optimal. However, the algorithm and the corresponding sample complexity in Zhou et al. (2014) are non-adaptive to the underlying instance. In other words, the algorithm does not utilize the information obtained in known samples to adjust its future sampling strategy; and as a result, the sample complexity only involves the parameters K, n, δ and ϵ but is independent of {θi}n i=1. Chen et al. (2014) developed the CLUCB-PAC algorithm and established an instance-dependent sample complexity for a more general class of problems, including the ϵ-top-K arm identiﬁcation problem as one of the key examples. When applying the CLUCB-PAC algorithm to identify ϵ-top-K arms, the sample complexity becomes O((log H(0,ϵ) + log δ 1)H(0,ϵ)) where H(0,ϵ) = Pn i=1 min{( i) 2, ϵ 2}, i = θi θK+1 for i K, i = θK θi for i > K. The reason why we adopt the notation H(0,ϵ) will be clear from Section 1.1. However, this bound may be improved for the following two reasons. First, intuitively, the hardness parameter H(0,ϵ) is the total number of necessary pulls needed for each arm to identify whether it is among the top-K arms or the rest so that the algorithm can decide whether to accept or reject the arm (when the arm s mean is ϵclose to the boundary between the top-K arms and the rest arms, it can be either selected or rejected). However, in many cases, even if an arm s mean is ϵ-far from the boundary, we may still be vague about the comparison between its mean and the boundary, i.e. either selecting or rejecting the arm satisﬁes the aggregate regret bound. This may lead to fewer number of pulls and a smaller hardness parameter for the same instance. Second, the worst-case sample complexity for CLUCB-PAC becomes O((log n+log ϵ 1 +log δ 1)nϵ 2). When δ is a constant, this bound is log n times more than the best nonadaptive algorithm in Zhou et al. (2014).

In this paper, we explore towards the above two directions and introduce new instance-sensitive algorithms for

the problem of identifying ϵ-top-K arms. These algorithms signiﬁcantly improve the sample complexity by CLUCBPAC for many common instances and almost match the best non-adaptive algorithm in the worst case.

Speciﬁcally, we ﬁrst introduce a new parameter H to characterize the hardness of a given instance. This new hardness parameter H could be smaller than the hardness parameter e H used in the literature, in many natural instances. For example, we show in Lemma 1 that when {θi}n i=1 are sampled from a continuous distribution with bounded probability density function (which is a common assumption in Bayesian MAB and natural for many applications), for K = γn with γ 0.5, our hardness parameter is H = O(n/ ϵ) while e H = Ω(n/ϵ).

Using this new hardness parameter H, we ﬁrst propose an easy-to-implement algorithm ADAPTIVETOPK and relate its sample complexity to H. In Theorem 1, we show that ADAPTIVETOPK uses O log log(ϵ 1) + log n + log δ 1 H to identify ϵ-top-K arms with probability at least 1 δ. Note that this bound has a similar form as the one in Chen et al. (2014), but as mentioned above, we have an ϵ -factor improvement in the hardness parameter for those instances where Lemma 1 applies.

We then propose the second algorithm (IMPROVEDTOPK) with even less sample complexity, which removes the log n factor in the sample complexity. In Theroem 2, we show that the algorithm uses O log ϵ 1 + log δ 1 H pulls to identify ϵ-top-K arms with probability 1 δ. Since H is always Ω(n/ϵ2) (which will be clear when the H is deﬁned in Section 1.1), the worst-case sample complexity of IMPROVEDTOPK matches the best instance-independent bound shown in Zhou et al. (2014) up to an extra log(ϵ 1) factor (for constant δ). We are also able to show that this extra log(ϵ 1) factor is a necessary expense by being instance-adaptive (Theorem 3). It is also noteworthy that as a by-product of establishing IMPROVEDTOPK, we developed an algorithm that approximately identiﬁes the k-th best arm, which may be of independent interest. Details are deferred to the full version of this paper. 1

We are now ready to introduce our new hardness parameters and summarize the main results in technical details.

1.1. Summary of Main Results

Following the existing literature (see, e.g., Bubeck et al. (2013)), we ﬁrst deﬁne the gap of the i-th arm

( θi θK+1 if i K θK θi if i K + 1. (2)

1Full version of this paper is available online at https:// arxiv.org/abs/1706.01026.

Adaptive Multiple-Arm Identiﬁcation

Note that when K = 1, i(K) becomes θ1 θi for all i 2 and 1(K) = θ1 θ2. When K is clear from the context, we simply use i for i(K). One commonly used hardness parameter for quantifying the sample complexity in the existing literature (see, e.g., Bubeck et al. (2013); Karnin et al. (2013)) is e H Pn i=1 2 i . If there is an extremely small gap i, the value of e H and thus the corresponding sample complexity can be super large. This hardness parameter is natural when the goal is to identify the exact top-K arms, where a sufﬁcient gap between an arm and the boundary (i.e. θK and θK+1) is necessary. However, in many applications (e.g., ﬁnding high-quality workers in crowdsourcing), it is an overkill to select the exact top-K arms. For example, if all the top-M arms with M > K have very close means, then any subset of them of size K forms an ϵ-top-K set in terms of the aggregate regret in (1). Therefore, to quantify the sample complexity when the metric is the aggregate regret, we need to construct a new hardness parameter.

Given K and an error bound ϵ, let us deﬁne t = t(ϵ, K) to be the largest t {0, 1, 2, . . . , K 1} such that

K t t Kϵ and K+t+1 t Kϵ. (3)

Note that K t t = (θK t θK+1) t upper-bounds the total gap of the t worst arms in the top K arms and K+t+1 t = (θK θK+t+1) t upper-bounds the total gap of the t best arms in the non-top-K arms. Intuitively, the deﬁnition in (3) means that we can tolerate exchanging at most t best arms in the non-top-K arms with the t worst arms in the top-K arms.

Given t = t(ϵ, K), we deﬁne

Ψt = min( K t, K+t+1), (4)

Ψϵ t = max(ϵ, Ψt). (5)

We now introduce the following parameter to characterize the hardness of a given instance,

H = H(t,ϵ) =

i=1 min{( i) 2, (Ψϵ t) 2}. (6)

It is worthwhile to note that no matter how small the gap i is, since Ψϵ t ϵ, we always have H(t,ϵ) nϵ 2. We also note that since Ψt is non-decreasing in t, H(t,ϵ) is also non-increasing in t.

Our ﬁrst result is an easy-to-implement algorithm (see Algorithm 1) that identiﬁes ϵ-top-K arms with sample complexity related to H(t,ϵ).

Theorem 1 There is an algorithm that computes ϵ-top-K arms with probability at least (1 δ), and pulls the arms at most O log log ϵ 1 + log n + log δ 1 H(t,ϵ) times.

We also develop a more sophisticated algorithm with an improved sample complexity, the details of which are deferred to the full version of this paper.

Theorem 2 There is an algorithm that computes ϵ-top-K arms with probability at least (1 δ), and pulls the arms at most O log ϵ 1 + log δ 1 H(t,ϵ) times.

Since Ψϵ t ϵ and H(t,ϵ) nϵ 2, the worst-case sample complexity by Theorem 2 is O n

ϵ2 log ϵ 1 + log δ 1 . While the asymptotically optimal instance-independent sample complexity is Θ n ϵ2 1 + ln δ 1

K (by Zhou et al.

(2014)), we show that the log ϵ 1 factor in Theorem 2 is necessary for instance-dependent algorithms using H(t,ϵ)

as a hardness parameter. In particular, we establish the following lower-bound result, the detailed proof of which is deferred to the full version of this paper.

Theorem 3 For any n, K such that n = 2K, and any ϵ = Ω(n 1), there exists an instance on n arms so that H(t,ϵ) = Θ(n) and it requires Ω(n log ϵ 1) pulls to identify a set of ϵ-top-K arms with probability at least 0.9.

Note that since H(t,ϵ) = Θ(n) in our lower bound instances, our Theorem 3 shows that the sample complexity has to be at least Ω(H(t,ϵ) log ϵ 1) in these instances. In other words, our lower bound result shows that for any instance-dependent algorithm, and any ϵ = Ω(n 1), there exists an instance where sample complexity has to be Ω(H(t,ϵ) log ϵ 1). While Theorem 3 shows the necessity of the log ϵ 1 factor in Theorem 2, it is not a lower bound for every instance of the problem.

1.2. Review of and Comparison with Related Works

The problem of identifying the single best arm (i.e. the top-K arms with K = 1), has been studied extensively (Even-Dar et al., 2002; Mannor & Tsitsiklis, 2004; Audibert et al., 2010; Gabillon et al., 2011; 2012; Karnin et al., 2013; Jamieson et al., 2014; Kaufmann et al., 2016; Garivier & Kaufmann, 2016; Russo, 2016; Chen et al., 2016b). More speciﬁcally, in the special case when K = 1, our problem reduces to identifying an ϵ-best arm, i.e. an arm whose expected reward is different from the best arm by an additive error of at most ϵ, with probability at least (1 δ). For this problem, Even-Dar et al. (2006) showed an algorithm with an instance-independent sample complexity O n

(and this was proved to be asymptotically optimal by Mannor & Tsitsiklis (2004)). An instance-dependent algorithm for this problem was given by Bubeck et al. (2013) and an improved algorithm was given by Karnin et al. (2013) with an instance-dependent sample complexity of O Pn i=2 max{ i, ϵ} 2(log δ 1 + log log max{ i, ϵ} 1) .

Adaptive Multiple-Arm Identiﬁcation

In the worst case, this bound becomes O n

ϵ2 (log δ 1 + log log ϵ 1) , almost matching the instance-independent bound in Even-Dar et al. (2006). When K = 1, we have t(ϵ, K) = 0 and thus H(t,ϵ) = H(0,ϵ) = Θ Pn i=2 max{ i, ϵ} 2 . Therefore, the sample complexity in our Theorem 2 becomes O((log ϵ 1 + log δ 1)H) = O n

ϵ2 (log ϵ 1 + log δ 1) in the worst-case, almost matching the bound by Karnin et al. (2013).

For the problem of identifying top-K arms with K > 1, different notions of ϵ-optimal solution have been proposed. One popular metric is the misidentiﬁcation probability (MISPROB), i.e., Pr(T = {1, . . . , K}). In the PAC setting (i.e., controlling MISPROB less than ϵ with probability at least 1 δ), many algorithms have been developed recently, e.g., Bubeck et al. (2013) in the ﬁxed budget setting and Chen et al. (2014) for both ﬁxed conﬁdence and ﬁxed budget settings. Gabillon et al. (2016) further improved the sample complexity in Chen et al. (2014); however the current implementation of their algorithm has an exponential running time. As argued in Zhou et al. (2014), the MISPROB requires to identify the exact top-K arms, which might be too stringent for some applications (e.g., crowdsourcing). The MISPROB requires a certain gap between θK and θK+1 to identify the top-K arms, and this requirement is not unnecessary when using the aggregate regret. As shown in Zhou et al. (2014), when the gap of any consecutive pair between θi and θi+1 among the ﬁrst 2K arms is o(1/n), the sample complexity has to be huge (ω(n2)) to make the MISPROB less than ϵ, while any K arms among the ﬁrst 2K form a desirably set of ϵ-top-K arms in terms of aggregate regret. Therefore, we follow Zhou et al. (2014) and adopt the aggregate regret to deﬁne the approximate solution in this paper.

Kalyanakrishnan et al. (2012) proposed the so-called EXPLORE-K metric, which requires for each arm i in the selected set T to satisfy θi θK ϵ , where θK is the mean of the K-th best arm. Cao et al. (2015) proposed a more restrictive notion of optimality ELEMENTWISEϵ-OPTIMAL, which requires the mean reward of the i-th best arm in the selected set T be at least θi ϵ for 1 i K. It is clear that the ELEMENT WISE-ϵ-OPTIMAL is a stronger guarantee than our ϵ-top-K in regret, while the latter is stronger than EXPLORE-K. Chen et al. (2016a) further extended Cao et al. (2015) to pure exploration problems under matroid constraints. Audibert et al. (2010) and Bubeck et al. (2013) considered expected aggregate regret (i.e., 1

K PK i=1 θi E P

i T θi , where the expectation is taken over the randomness of the algorithm. Note that this notion of expected aggregate regret is a weaker objective than the aggregate regret.

Moreover, there are some other recent works studying the

problem of best-arm identiﬁcation in different setups, e.g., linear contextual bandit (Soare et al., 2014), batch arm pulls (Jun et al., 2016).

For our ϵ-top-K arm problem, the state-of-the-art instancedependent sample complexity was given by Chen et al. (2014) (see Section B.2 in Appendix of their paper). More speciﬁcally, Chen et al. (2014) proposed CLUCB-PAC algorithms that ﬁnds ϵ-top-K arms with probability at least (1 δ) using O log δ 1 + log H(0,ϵ) H(0,ϵ) pulls. Since H(0,ϵ) H(t,ϵ) Ω(n) and H(0,ϵ) (Ψϵ t) 2, our Theorem 1 is not worse than the bound in Chen et al. (2014). Indeed, in many common settings, H(t,ϵ) can be much smaller than H(0,ϵ) so that Theorem 1 (and therefore Theorem 2) requires much less sample complexity. We explain this argument in more details as follows.

In many real-world applications, it is common to assume the arms θi are sampled from a prior distribution D over [0, 1] with cumulative distribution function FD(θ). In fact, this is the most fundamental assumption in Bayesian multi-armed bandit literature (e.g., best-arm identiﬁcation in Bayesian setup (Russo, 2016)). In crowdsourcing applications, Chen et al. (2015) and Abbasi-Yadkori et al. (2015) also made this assumption for modeling workers accuracy, which correspond to the expected rewards. Under this assumption, it is natural to let θi be the (1 i

n) quantile of the distribution D, i.e. F 1 D (1 i

n). If the prior distribution D s probability density function f D = d FD

dθ has bounded value (a few common examples include uniform distribution over [0, 1], Beta distribution, or the truncated Gaussian distribution), the arms mean rewards {θi}n i=1 can be characterized by the following property with c = O(1).

Deﬁnition 1 We call a set of n arms θ1 θ2 θn c-spread (for some c 1) if for all i, j [n] we have

|θi θj| h |i j|

cn , c|i j|

The following lemma upper-bounds H(t,ϵ) for O(1)-spread arms, and shows the improvement of our algorithms compared to Chen et al. (2014) on O(1)-spread arms. The proof of Lemma 1 is based on simple calculations and is deferred to the full version of this paper.

Lemma 1 Given a set of n c-spread arms, let K = γn n 2 . When c = O(1) and γ = Ω(1), we have H(t,ϵ) = O(n/ ϵ). In contrast, H(0,ϵ) = Ω(n/ϵ) for O(1)-spread arms and every K [n].

2. An Instance Dependent Algorithm for ϵ-top-K Arms

In this section, we show Theorem 1 by providing Algorithm 1 and proving the following theorem.

Adaptive Multiple-Arm Identiﬁcation

Algorithm 1 ADAPTIVETOPK(n, ϵ, K, δ) Input: n: number of arms; K and ϵ: parameters in ϵ-top-K arms; δ: error probability Output: ϵ-top-K arms

1 Let r denote the current round, initialized to be 0. Let Sr [n] denote the set of candidate arms at round r. S1 is initialized to be [n]. Set A, B

3 while 2 (K |A|) > ϵK do

5 Pull each arm in Sr by 2 ln 2nr2

δ times, and let eθr i be the empirical-mean

6 Deﬁne eθa(Sr) and eθb(Sr) be the (K |A| + 1)th and (K |A|)th largest empirical-means in Sr, and deﬁne

e i(Sr) = max eθr i eθa(Sr), eθb(Sr) eθr i (7)

7 while maxi Sr e i(Sr) > 2 do

8 x arg maxi Sr e i(Sr)

9 if eθr x > eθa(Sr) then

13 Sr Sr\{x}

16 Set A as the (K |A|) arms with the largest empiricalmeans in Sr+1

17 return A A

Theorem 4 Algorithm 1 computes ϵ-top-K arms with probability at least 1 δ, and pulls the arms at most

log log( ϵ t) 1 + log n

i=1 min{( i) 2, ( ϵ t) 2}

times, where t {0, 1, 2, . . . , K 1} is the largest integer satisfying K t t Kϵ, and ϵ t = max(ϵ, K t).

Note that Theorem 4 implies Theorem 1 because of the following reasons: 1) t deﬁned in Theorem 4 is always at least t(ϵ, K) deﬁned in (3); and 2) ϵ t Ψϵ t ϵ.

Algorithm 1 is similar to the accept-reject types of algorithms in for example Bubeck et al. (2013). The algorithm goes by rounds for r = 1, 2, 3, . . . , and keeps at set of undecided arms Sr [n] at Round r. All other arms (in [n] \ Sr) are either accepted (in A) or rejected (in B). At each round, all undecided arms are pulled by equal number of times. This number is chosen to ensure that the event E, which is deﬁned as the empirical means of all arms

are within a small neighborhood of their true means, happens with probability at least 1 δ (See Deﬁnition 2 and Claim 1). Note that E is deﬁned for all rounds and the length of the neighborhood becomes smaller as the algorithm proceeds. We are able to prove that when E happens, the algorithm returns the desired set of ϵ-top-K arms and has small query complexity.

To prove the correctness of the algorithm, we ﬁrst show that when conditioning on E, the algorithm always accepts a top-K arm in A (Lemma 3) and rejects a non-top-K arm in B (Lemma 4). The key observation here is that our algorithm never introduces any regret due to arms in A and B. We then use the key Lemma 5 to upper bound the regret that may be introduced due to the remaining arms. Once this upper bound is not more than ϵK (i.e. the total budget for regret), we can choose the remaining (K |A|) arms without further samplings. Details about this analysis can be found in Section 2.1.

We analyze of the query complexity of our algorithm in Section 2.2. We establish data-dependent bound by relating the number of pulls to each arms to both their i s and K t (Lemma 6 and Lemma 7).

2.1. Correctness of Algorithm 1

We ﬁrst deﬁne an event E which we will condition on in the rest of the analysis.

Deﬁnition 2 Let E be the event that |eθr i θi| < 2 r for all r 1 and i Sr.

Claim 1 Pr[E] 1 δ.

Proof: By Hoeffding s inequality, we can show that for any ﬁxed r and i, Pr h |eθr i θi| 2 ri 2( δ 2nr2 )2 δ 2nr2 . By a union bound,

i Sr Pr h |eθr i θi| 2 ri

The following lemma will be a very useful tool for our analysis, the proof of which is deferred to the full version of this paper.

Lemma 2 Given µ1 . . . µn and > 0, assuming that |eµi µi| for all i [n], and letting y1 . . . yn be the sorted version of eµ1, . . . , eµn, we have |yi µi| for all i [n].

We now prove that conditioned on E, the algorithm always accepts a desired arm in A.

Adaptive Multiple-Arm Identiﬁcation

Lemma 3 Conditioned on E, during the run of Algorithm 1, A {1, 2, . . . , K}, that is, all arms in A are among the top-K arms.

Proof: We prove by induction on the round r. The lemma holds trivially when r = 0 (A = ). Now ﬁx a round r 1, and let x be the arm that is added to A at Line 10 of Algorithm 1. By the induction hypothesis, assuming that before round r all arms in A are in [K], our goal is to show x [K].

By the inner while condition we have eθr x eθa(Sr) > 2 2 r. (8)

For any m [K |A| + 1, |Sr|], let j be the arm of the m-th largest true-mean in Sr, and j be the arm of the m-th largest empirical-mean in Sr. Since m K |A| + 1, we must have j [K] and eθr j eθa(Sr). By Lemma 2 we also have |eθr j θj| < 2 r. We thus have

θx > eθr x 2 r by (8) > eθa(Sr) + 2 r > eθr j + 2 r > θj.

That is, at least |Sr| K +|A| arms in Sr have true-means smaller than arm x. On the other hand, |Sr| K +|A| arms in Sr are not in [K]. We therefore conclude that x must be in [K].

By symmetry, we also have the following lemma, stating that when E happens, the algorithm always rejects a nontop-K arm in B. We omit the proof because it is almost identical to the proof of Lemma 3.

Lemma 4 Conditioning on E, during the run of Algorithm 1, B {K + 1, K + 2, . . . , n}.

Lemma 5 Conditioned on E, for all rounds r and i Sr, it holds that eθr i eθa(Sr) > θi θK+1 2 2 r (9)

and eθb(Sr) eθr i > θK θi 2 2 r. (10)

Consequently, we have e i(Sr) i 2 2 r for all rounds r and i Sr.

Proof: We look at a particular round r. Let j be the arm with (K |A| + 1)-th largest true-mean in Sr. Since by Lemma 3 we have A [K] , it holds that j K + 1. By Lemma 2, we also have |eθa(Sr) θj| < 2 r. We therefore have for any i Sr

eθr i eθa(Sr) > θi θj 2 2 r θi θK+1 2 2 r. (11)

With a similar argument (by symmetry and using Lemma 4), we can show that

eθb(Sr) eθr i > θK θi 2 2 r. (12)

Combining (11), (12) and the deﬁnitions of e i(Sr) and i, the lemma follows.

Now we are ready to prove the correctness of Theorem 4. By Lemma 3, all the arms that we add into the set A at Line 10 are in [K]. The rest of our job is to look at the arms in the set A .

When the algorithm exits the outer while loop (at round r = r ) and arrives at Line 16, we have by the condition of the outer while loop that

2 2 r (K |A|) ϵK. (13)

Let m = K |A|, and C = [K]\A = {i1, i2, . . . , im} where i1 < i2 < . . . < im. Let eθj1 eθj2 . . . eθjm be the (K |A|) empirical-means of the arms that we pick at Line 16. Note that it is not necessary that j1 < . . . < jm. By Lemma 2 and E, for any s [K |A|], we have |eθjs θis| 2 r and |eθjs θjs| 2 r . By the triangle inequality, it holds that

|θjs θis| 2 2 r . (14)

We thus can bound the error introduced by arms in A by X

i A A θi = X

by (14) 2 2 r (K |A|) by (13) ϵK.

2.2. Query Complexity of Algorithm 1

Recall (in the statement of Theorem 4) that t {0, 1, 2, . . . , K 1} is the largest integer satisfying

K t t ϵK. (15)

Lemma 6 If the algorithm exits the outer while loop at round r = r , then we must have

8 2 r K t. (16)

The proof is deferred to the full version of this paper.

Lemma 7 For any arm i, let ri be the round where arm i is removed from the candidate set if this ever happens; otherwise set ri = r . We must have

8 2 ri i. (17)

The proof is deferred to the full version of this paper.

With Lemma 6 and Lemma 7, we are ready to analyze the query complexity of the algorithm in Theorem 4. We can bound the number of pulls on each arm i by at most

j=1 22j log(2nj2/δ) O log(ri nδ 1) 22ri . (18)

Adaptive Multiple-Arm Identiﬁcation

Now let us upper-bound the RHS of (18). First, if i A, then by (17) we know that ri log2 1 i + O(1). Second, by (16) we have ri r log2 1 K t+O(1). Third, since 2 r ϵ/2 (otherwise the algorithm will exit the outer while loop), we have ri r log2 ϵ 1 + O(1). To summarize, we have ri log2 min{ 1 i , 1 K t, ϵ 1} + O(1) = log2 min{ 1 i , ( ϵ t) 1} + O(1) (recall that ϵ t = max{ϵ, K t}). We thus can upper-bound the RHS of (18) by O (log log( ϵ t) 1 + log n

δ ) min{( i) 2, ( ϵ t) 2} . The total cost is a summation over all n arms.

Remark 1 As we stated in Theorem 2, the log n factor from Theorem 1 can be removed. On the other hand, our lower bound result (see Theorem 3) shows that an extra log(ϵ 1) in Theorem 2 is necessary when we make the algorithm adaptive. The proofs of Theorem 2 and Theorem 3 are deferred to the full version of this paper.

3. Experiments

In this section we present the experimental results. While our theorems are presented in the PAC form, it is in general difﬁcult to verify them directly because the parameter ϵ is merely an upper bound and the actual aggregate regret may deviate from it. In our experiment, we convert our Algorithm 1 to the ﬁxed-budget version (that is, ﬁx the budget of the number of pulls and calculate the aggregate regret). We compare our Algorithm 1 (Adaptive Top K ) with two stateof-the-art methods Opt MAI in Zhou et al. (2014) and CLUCB-PAC in Chen et al. (2014). The comparison between Opt MAI /CLUCB-PAC and previous methods (e.g., the methods in Bubeck et al. (2013) and Kalyanakrishnan et al. (2012)) have already been demonstrated in Zhou et al. (2014) and Chen et al. (2014), and thus are omitted due to space constraints. To convert our algorithm to the ﬁxedbudget version, we remove the outer while loop of Algorithm 1. As a replacement, we keep track of the total number of pulls, and stop pulling the arms once the budget is exhausted.

We test our algorithm on both synthetic and real datasets (due to space constraints, our results on real datasets are deferred to the full version of this paper) as described as follows.

TWOGROUP: the mean reward for the top K arms is set to 0.7 and that for the rest of the arms is set to 0.3.

UNIFORM: we set θi = 1 i

n for 1 i n.

SYNTHETIC-p: we set θi = (1 K

for each i K and θi = (1 K

for each i > K. Note that SYNTHETIC-1 is identical to UNIFORM. When p is larger than 1, arms are made

closer to the boundary that separates the top-K from the rest (i.e. 1 K

n ). When p is smaller than 1, arms are made farther to the boundary. We normalize all the arms such that the mean values of the arms still span the whole interval [0, 1]. We consider p = .5, 1, 6.

We set the total number of arms n = 1, 000, the tolerance parameter ϵ = 0.01, and vary the parameter K. In Adaptive Top K and CLUCB-PAC, another parameter δ (i.e., the failure probability) is required and we set δ = 0.01.

For each dataset, we ﬁrst ﬁx the budget (total number of pulls allowed) and run each algorithm 200 times. For each algorithm, we calculate the empirical probability (over 200 runs) that the aggregate regret of the selected arms is above the tolerance threshold ϵ = 0.01, which is called failure probability. A smaller failure probability means better performance. For each dataset and different K, we plot the curve of failure probability by varying the number of pulls. The results are shown in Figure 1-4.

It can be observed from the experimental results that Adaptive Top K (Algorithm 1) outperforms CLUCB-PAC in almost all the datasets. When K is relatively small, Opt MAI has the best performance in most datasets. When K is large, Adaptive Top K outperforms Opt MAI . The details of the experimental results are elaborated as follows.

For TWOGROUP dataset (see Figure 1), Adaptive Top K outperforms other algorithms signiﬁcantly for all values of K. The advantage comes from the adaptivity of our algorithm. In the TWOGROUP dataset, top-K arms are very well separated from the rest. Once our algorithm identiﬁes this situation, it need only a few pulls to classify the arms. In details, the inner while loop (Line 7) of Algorithm 1 make it possible to accept/reject a large number of arms in one round as long as the algorithm is conﬁdent.

As K increases, the advantage of Adaptive Top K over other algorithms (Opt MAI in particular) becomes more signiﬁcant. This can be explained by the deﬁnition of H(t,ϵ): t = t(ϵ, K) usually becomes bigger as K grows, leading to a smaller hardness parameter H(t,ϵ).

A comparison between SYNTHETIC-.5, UNIFORM, SYNTHETIC-6 reveals that the advantage of Adaptive Top K over other algorithms (Opt MAI in particular) becomes signiﬁcant in both extreme scenarios, i.e., when arms are very well separated (p 1) and when arms are very close to the separation boundary (p 1).

Adaptive Multiple-Arm Identiﬁcation

(a) K = 100

(b) K = 250

(c) K = 500

Figure 1: TWOGROUP dataset

(a) K = 100

(b) K = 250

(c) K = 500

Figure 2: SYNTHETIC-.5 dataset

(a) K = 100

(b) K = 250

(c) K = 500

Figure 3: UNIFORM dataset

(a) K = 100

(b) K = 250

(c) K = 500

Figure 4: SYNTHETIC-6 dataset

4. Conclusion and Future Work

In this paper, we proposed two algorithms for a PAC version of the multiple-arm identiﬁcation problem in a stochastic multi-armed bandit (MAB) game. We introduced a new hardness parameter for characterizing the difﬁculty of an instance when using the aggregate regret as the evaluation metric, and established the instance-dependent sample complexity based on this hardness parameter. We also established lower bound results to show the optimality of our algorithm in the worst case. Although we only consider the case when the reward distribution is supported on [0, 1], it is straightforward to extend our results to sub Gaussian reward distributions.

For future directions, it is worthwhile to consider more gen-

eral problem of pure exploration of MAB under matroid constraints, which includes the multiple-arm identiﬁcation as a special case, or other polynomial-time-computable combinatorial constraints such as matchings. It is also interesting to extend the current work to ﬁnding top-K arms in a linear contextual bandit framework.

Abbasi-Yadkori, Yasin, Bartlett, Peter, Chen, Xi, and Malek, Alan. Large-scale markov decision problems with KL control cost and its application to crowdsourcing. In Proceedings of International Conference on Machine Learning (ICML), 2015.

Audibert, J.Y., Bubeck, S., and Munos, R. Best arm iden-

Adaptive Multiple-Arm Identiﬁcation

tiﬁcation in multi-armed bandits. In Proceedings of the Conference on Learning Theory (COLT), 2010.

Bubeck, Sebastian, Wang, Tengyao, and Viswanathan, Nitin. Multiple identiﬁcations in multi-armed bandits. In Proceedings of the International Conference on Machine Learning (ICML), 2013.

Cao, Wei, Li, Jian, Tao, Yufei, and Li, Zhize. On topk selection in multi-armed bandits and hidden bipartite graphs. In Proceedings of Advances in Neural Information Processing Systems (NIPS), 2015.

Chen, Lijie, Gupta, Anupam, and Li, Jian. Pure exploration of multi-armed bandit under matroid constraints. In Proceedings of the Conference on Learning Theory (COLT), 2016a.

Chen, Lijie, Li, Jian, and Qiao, Mingda. Towards instance optimal bounds for best arm identiﬁcation. ar Xiv preprint ar Xiv:1608.06031, 2016b.

Chen, Shouyuan, Lin, Tian, King, Irwin, Lyu, Michael R., and Chen, Wei. Combinatorial pure exploration of multiarmed bandits. In Proceedings of Advances in Neural Information Processing Systems (NIPS), 2014.

Chen, Xi, Lin, Qihang, and Zhou, Dengyong. Statistical decision making for optimal budget allocation in crowd labeling. Journal of Machine Learning Research, 16:1 46, 2015.

Even-Dar, Eyal, Mannor, Shie, and Mansour, Yishay. PAC bounds for multi-armed bandit and markov decision processes. In Proceedings of the Annual Conference on Learning Theory (COLT), 2002.

Even-Dar, Eyal, Mannor, Shie, and Mansour, Yishay. Action elimination and stopping conditions for the multiarmed bandit and reinforcement learning problems. Journal of machine learning research, 7:1079 1105, 2006.

Gabillon, Victor, Ghavamzadeh, Mohammad, Lazaric, Alessandro, and Bubeck, S ebastien. Multi-bandit best arm identiﬁcation. In Proceedings of Advances in Neural Information Processing Systems (NIPS), 2011.

Gabillon, Victor, Ghavamzadeh, Mohammad, and Lazaric, Alessandro. Best arm identiﬁcation: A uniﬁed approach to ﬁxed budget and ﬁxed conﬁdence. In Proceedings of Advances in Neural Information Processing Systems (NIPS), 2012.

Gabillon, Victor, Lazaric, Alessandro, Ghavamzadeh, Mohammad, Ortner, Ronald, and Bartlett, Peter. Improved learning complexity in combinatorial pure exploration bandits. In Proceedings of the International Conference on Artiﬁcial Intelligence and Statistics, 2016.

Garivier, A. and Kaufmann, E. Optimal best-arm identiﬁcation with ﬁxed conﬁdence. In Proceedings of the Conference on Learning Theory (COLT), 2016.

Jamieson, Kevin, Malloy, Matthew, Nowak, Robert, and Bubeck, S ebastien. UCB : An optimal exploration algorithm for multi-armed bandits. In Proceedings of the Conference on Learning Theory (COLT), 2014.

Jun, Kwang-Sung, Jamieson, Kevin, Nowak, Robert, and Zhu, Xiaojin. Top arm identiﬁcation in multi-armed bandits with batch arm pulls. In Proceedings of the International Conference on Artiﬁcial Intelligence and Statistics (AISTATS), 2016.

Kalyanakrishnan, Shivaram, Tewari, Ambuj, Auer, Peter, and Stone, Peter. PAC subset selection in stochastic multi-armed bandits. In Proceedings of International Conference on Machine Learning (ICML), 2012.

Karnin, Zohar, Koren, Tomer, and Somekh, Oren. Almost optimal exploration in multi-armed bandits. In Proceedings of International Conference on Machine Learning (ICML), 2013.

Kaufmann, Emilie, Capp e, Olivier, and Garivier, Aur elien. On the complexity of best arm identiﬁcation in multiarmed bandit models. Journal of Machine Learning Research, 17:1 42, 2016.

Mannor, Shie and Tsitsiklis, John N. The sample complexity of exploration in the multi-armed bandit problem. Journal of Machine Learning Research, 5:623 648, 2004.

Russo, Daniel. Simple bayesian algorithms for best arm identiﬁcation. In Proceedings of the Conference on Learning Theory (COLT), 2016.

Soare, Marta, Lazaric, Alessandro, and Munos, Remi. Best-arm identiﬁcation in linear bandits. In Proceedings of Advances in Neural Information Processing Systems (NIPS), 2014.

Zhou, Yuan, Chen, Xi, and Li, Jian. Optimal PAC multiple arm identiﬁcation with applications to crowdsourcing. In Proceedings of International Conference on Machine Learning (ICML), 2014.