# epsilonbestarm_identification_in_payperreward_multiarmed_bandits__e2056e0d.pdf

ϵ-Best-Arm Identiﬁcation in Pay-Per-Reward Multi-Armed Bandits

Sivan Sabato Department of Computer Science Ben-Gurion University of the Negev Beer-Sheva, Israel 8410501 sabatos@cs.bgu.ac.il

We study ϵ-best-arm identiﬁcation, in a setting where during the exploration phase, the cost of each arm pull is proportional to the expected future reward of that arm. We term this setting Pay-Per-Reward. We provide an algorithm for this setting, that with a high probability returns an ϵ-best arm, while incurring a cost that depends only linearly on the total expected reward of all arms, and does not depend at all on the number of arms. Under mild assumptions, the algorithm can be applied also to problems with inﬁnitely many arms.

1 Introduction

Consider placing an ad next to search engine results, based on the search query. In a preliminary survey for a future promotion, a retailer wishes to identify the best query expression to link to its ad, that is, the expression that maximizes the expected number of clicks on the ad. The payment per ad is proportional to the number of clicks. However, during the survey, clicks do not lead to proﬁt, while the payment is still due.

This problem can be formulated as a type of best-arm-identiﬁcation problem in a stochastic multiarmed bandit (MAB) setting [see, e.g., 14, 15, 8, 3]. In stochastic MAB, there is a set of arms, each associated with a non-negative reward distribution. Each pull of an arm draws an instantaneous reward from the corresponding reward distribution. The MAB algorithm iteratively pulls arms and observes their instantaneous rewards. At the end of the run, the algorithm selects one arm. The goal is to ﬁnd an arm with a near-maximal expected reward with a high probability. In the example above, each arm in the MAB framework can be mapped to a speciﬁc search expression, and pulling the arm is equivalent to linking the ad to the expression represented by this arm for a given period of time, and then observing the number of clicks. In standard MAB, each arm pull incurs a unit cost. However, in the example above, the cost of pulling an arm is the number of clicks on the ad, assuming the common pay-per-click advertisement model. Thus, the expected cost in the survey stage is proportional to the expected reward after the survey stage. We term this setting Pay-Per-Reward (PPR). Other applications of this model include any case where there is an exploration stage in which rewards are not collected, but payment is still proportional to the quality of the arm. For instance, consider a sensor ﬁeld, where the goal is to ﬁnd the most active sensor, but each activity report has a communication cost. Here, pulling an arm is done by activating the sensor s reporting option for a given period of time. Another example is ﬁnding the best crowd worker on a crowd-sourcing platform. Here, an "arm pull" is equivalent to providing the worker with a test task. The workers are paid according to their success rate in the test task, thus a more successful worker will be paid more during the exploration stage.

Our contribution. We provide an algorithm, MAB-PPR, for the pay-per-reward setting, that ﬁnds an ϵ-optimal arm with a probability of at least 1 δ, for given ϵ, δ (0, 1). We show that the cost

33rd Conference on Neural Information Processing Systems (Neur IPS 2019), Vancouver, Canada.

incurred by MAB-PPR has no dependence on the number of arms, and only a linear dependence on the total expected reward of all arms. Our results generalize beyond bounded reward distributions, and support some heavy-tailed distributions with bounded second moments. Under mild assumptions, the algorithm can be applied also to problems with inﬁnitely many arms.

Since the cost of MAB-PPR is independent of the number of arms, it is especially useful when the number of arms is potentially very large, while the total expected reward is bounded. For instance, in the ad-placement example, the number of arms is proportional to the number of admissible search expressions, which can be practically unbounded, but only a small number of those are expected to have large click rates when linked to a speciﬁc ad. Other applications in which the number of arms can be very large include, for instance, recommender systems [32], content personalization [24] and web-content optimization [1].

For the standard setting, in which each arm pull has a unit cost, [14] provided an ϵ-best-arm identiﬁcation algorithm with a sample complexity that depends linearly on the number of arms, using an algorithm that they term the median-elimination algorithm. This algorithm halves the number of candidate arms in each round based on their empirical mean rewards. Instead, in our setting, one should try to halve the total expected reward of the arms that remain in each round. This requires a new estimation scheme. Another challenge in the PPR setting is that due to the potentially unbounded number of arms, sums of instantaneous rewards of sets of arms do not have a bounded support, even if single-arm rewards are bounded. Moreover, in some settings one might prefer to simultaneously pull an entire set of arms from the joint distribution of rewards of all arms. Thus, standard assumptions on independence between arms do not necessarily hold.

MAB-PPR addresses these challenges using a two-step estimation procedure in each round, combined with heavy-tailed mean estimation. This guarantees success even under some dependence between arms. Support for some unbounded reward distributions of single arms is easily obtained within this framework, by using heavy-tailed mean estimation for the expected rewards of single arms as well. Our analysis shows that the PPR setting can be successfully handled in a wide range of regimes.

Paper structure After discussing related work, we present the setting and preliminaries in Section 2. The main result is stated in Section 3. The MAB-PPR algorithm is listed and described in Section 4. We prove the main result in Section 5, and conclude in Section 6.

1.1 Related work

[14] formulated the stochastic MAB ϵ-best-arm identiﬁcation problem (also termed the PAC setting) for bounded rewards. They showed that the number of pulls can be linear in the number of arms, without logarithmic factors that would be necessary in a naive algorithm that pulls all arms the same number of times. Closely matching lower bounds for this setting were given in [22]. Instance-speciﬁc bounds in this setting were studied in [19, 17, 12]. A ﬁxed-budget variant was studied in [8, 3, 10].

The works above assume the standard cost model for MAB, in which each arm pull costs one unit. A different cost model has been studied in the setting of budgeted MAB [25, 26, 13, 27, 29, 20]. Here, arms are associated with a random or deterministic cost, in addition to a reward distribution, and the goal is to maximize the cumulative reward within total cost constraints. [30] studied a best-arm variant of this problem. MAB with inﬁnitely many arms, also termed many-armed bandits, has been studied in the unit-cost model both for the cumulative regret setting [5, 6, 28, 20] and for exploration settings [11, 4]. [20] study inﬁnitely-many arms in the budgeted MAB setting.

A useful tool for mean estimation for heavy-tailed random variables is median-of-means estimation [2]. This tool has been recently used in various contexts in machine learning [e.g., 23, 7, 16, 21, 18]. MAB with heavy-tailed rewards under the standard unit-cost model was studied in the regret setting [9] and for pure exploration with unbounded rewards [31].

2 Setting and Preliminaries

For an integer i, denote [i] := {1, . . . , i}. Let K be the total number of available arms, and denote the set of available arms by [K].1 Let X1, . . . , XK be the random variables associated with the reward

1We discuss in Section 4 how to use MAB-PPR when the set of available arms is inﬁnite or unknown.

distributions of the corresponding arms. We assume that the random variables are non-negative, and have a ﬁnite expectation and variance. Pulling arm i is equivalent to drawing an independent copy of Xi and observing its value. We study here a more general setting, in which sets of arms can be pulled simultaneously. We assume some joint distribution of the random vector X = (X1, . . . , XK). A pull of a speciﬁc set of arms is equivalent to drawing an independent copy of X and observing the rewards of all the arms in the set. The standard setting, in which arms are pulled one by one, is equivalent to assuming that X1, . . . , XK are statistically independent in X. Thus, we generally assume below that sets are simultaneously pulled.

Denote the expected reward of arm i by wi := E[Xi], termed below the arm s weight. Let W := P

i [K] wi. For a set A [K], denote XA := P

i A Xi and WA := P

i A wi. Consider a run of the algorithm in which the total number of times each arm is pulled (regardless of whether some of the arms were pulled simultaneously) is n, and let I1, . . . , In be the sequence of pulled arms. The cost of the algorithm in this run is P

i [n] w Ii. We make two assumptions on the distribution of X.

Assumption 2.1. For all i [K], σ2 i := Var(Xi) wi 1.

This assumption clearly holds if Xi are bounded in [0, 1]. Nonetheless, it is more general, and can hold also for reward distributions with unbounded support.

Assumption 2.2. There is a constant V 0, such that for any subset of the arms A [K], σ2 A := Var(XA) V WA.

Assumption 2.2 trivially holds with V = 1 if arm-pulls of individual arms are statistically independent, as when they are pulled separately. In this case, σ2 A = P i A σ2 i P i A wi = WA, where Assumption 2.1 was used to bound σ2 i . If arm rewards are not statistically independent, then the existence and value of the constant V depend on the distribution of X. For instance, if each arm is positively correlated with at most Q other arms, then V can be upper bounded by 1 + 2Q (see the supplementary material, Appendix A). We note that our results below are indifferent to values of V as large as 1

We use the median-of-means method [2] to estimate the mean of a distribution with a bounded variance. In this method, q batches of ℓindependent samples are used; A separate empirical mean is calculated from each batch, and the estimate is the median of these means. Formally, the (q, ℓ)-Mo M estimator of a random variable X from a sample of i.i.d. draws {xi,j}i [q],j [ℓ] of X, is deﬁned as Median{ 1

ℓ P j [ℓ] xi,j | i [q]}. We use the following formulation of the result of [2] on the convergence of the Mo M estimator to the true mean, similarly to [16]. The proof is provided for completeness in the supplementary material, Appendix B.

Lemma 2.3 (Median-of-Means). Let X be a random variable with mean w and variance σ2 < . Let q, ℓbe integers, and let ˆw be the (q, ℓ)-Mo M estimate of w from a sample of qℓi.i.d. draws of copies of X. With a probability at least 1 exp( 2

9q), we have |w ˆw| p

3 Main result

We propose the MAB-PPR algorithm, which accepts parameters ϵ, δ (0, 1), iteratively pulls arms, and outputs a single arm. MAB-PPR satisﬁes the following guarantee.

Theorem 3.1. Suppose that Assumptions 2.1 and 2.2 hold. Let W := P i [K] wi, and let V as deﬁned in Assumption 2.2. With a probability of at least 1 δ, MAB-PPR returns an ϵ-optimal arm, and its cost is upper bounded by

O W log(1/δ) max 1

Here and below, O( ) stands for universal multiplicative constants. In comparison, previous algorithms for the unit-cost model, such as the ones studied in [14], would cost here at least Ω(W log(K/δ)/ϵ2) where K is the number of arms, assuming bounded rewards. Qualitatively, removing the dependence on K is of signiﬁcance. MAB-PPR obtains a quantitatively signiﬁcant improvement when K is large and V O(1/ϵ).

Algorithm 1 MAB-PPR: ϵ-Best-Arm-Identiﬁcation with Pay-Per-Reward Input: ϵ, δ (0, 1), V 1, K N; Universal constants Cq, C q, C1, C2, C3, Nﬁnal > 0; ρ, θ (0, 1). Output: r [K]

1: t 1, S1 [K], ϵ1 (1 ρ)ϵ/2, δ1 δ/2. 2: while |St| > Nﬁnal do 3: First batch of pulls: 4: Set q1 Cq log(C q/δt) and ℓt C1 max(1/ϵ2 t, V/ϵt); Pull the arms in St for q1ℓt times. 5: For each i St, use the pulls of arm i to calculate ˆwi,t, the (q1, ℓt)-Mo M estimate of wi. 6: Set {jl}l [|St|] to be an ordering of St such that ˆwjl,t ˆwjl+1,t for all l [|St| 1]. 7: Second batch of pulls: 8: Set q2 9

2 log(12/δt), ℓ(2) C2V/ϵ; Pull the arms in St for q2ℓ(2) times.

9: For i St, let {xj,l i }j [q2],l [ℓ(2)] be the i.i.d. copies of Xi obtained in line 8. 10: for l [|St|] do 11: Deﬁne Al := {j1, . . . , jl}.

12: Use {P i Al xj,l i }j [q2],l [ℓ(2)] to calculate f WAl,t, the (q2, ℓ(2))-Mo M estimate of WAl. 13: end for 14: Decide which arms to remove: 15: If f WSt,t ϵ/4, return some r St and terminate.

16: Find the largest integer Nt such that f WANt,t/f WSt,t θ. 17: Set St+1 St \ ANt, ϵt+1 ρϵt, δt+1 δt/2, t t + 1. 18: end while 19: Final round (|St| Nﬁnal): 20: Set q3 9

2 log(Nﬁnal/δt), ℓ(3) C3/ϵ2; Pull the arms in St for q3ℓ(3) times. 21: For i St, use the pulls of arm i to calculate ˆwi,t, the (q3, ℓ(3))-Mo M estimate of wi. 22: return some r argmaxi St ˆwi,t.

4 The MAB-PPR algorithm

The MAB-PPR algorithm is listed in Alg. 1. The main ideas in MAB-PPR are described below. We use several universal constants in Alg. 1. In the analysis below, we show that there exist values for these constants such that the guarantees of MAB-PPR hold. For readability, we do not explicitly round up non-integer sample sizes. Such rounding would not affect our ﬁnal cost upper bound.

A central challenge in designing MAB-PPR is avoiding a union bound on the number of arms. In [14], where arm pulls have unit cost, the median-elimination algorithm was proposed for this purpose. This algorithm removes half of the arms from the candidate set in each round. In our case, because of the different cost model, it is necessary to remove a constant fraction of the total weight, regardless of the number of arms. To achieve this, MAB-PPR uses two batches of arm pulls in each round. The universal constants Cq, C q, C1, C2 are used for calculating the required numbers of arm pulls in each batch so that subsequent estimates hold. The ﬁrst batch of pulls (line 4) is used to estimate the weight of each arm with the Mo M estimator (line 5). The arms are then ordered according to their estimated weight. The second batch of pulls (line 8) is used to decide how many arms to remove from the bottom of this ordering, to make sure that the removed fraction of the total weight is within a certain range. This is done by estimating the total weight of each possible subset of arms, again using Mo M (line 12). Here, it is important to note that taking the sum of Mo M estimates is not equivalent to taking the Mo M estimate of the sum, thus these sums are estimated directly. The selected number of arms is then removed from the bottom of the ordering (line 16). The threshold θ is used to upper bound the empirical fraction of the removed arms. After each round, the values of ϵt and δt are updated, where the constant ρ controls the rate of decay for ϵt. MAB-PPR terminates in one of two ways: if at some point, the estimated total weight of the remaining arms is small, then some remaining arm is returned (line 15). Otherwise, when the number of remaining arms is smaller than the constant Nﬁnal, all their weights are directly estimated (where the constant C3 is used to set the number of pulls), and the arm with the maximal estimated weight is returned (line 22).

Supporting an unbounded number of arms. For simplicity of presentation, the notations and Alg. 1 are given for a ﬁnite and known number of arms K. Nonetheless, MAB-PPR can be used even if the number of arms is unbounded or inﬁnite. This is possible if there is a way to pull all arms together in ﬁnite time, and get a ﬁnite result listing all the non-zero instantaneous rewards. For instance, in the search-query example, pulling all arms together can be done by placing an ad and linking it to all search queries, and then recording which search queries actually came up and resulted in an ad-click. Assuming that W P

i wi < , Alg. 1 can then support an unbounded number of arms as follows. In line 1, S1 should be (implicitly) set to the entire set of available arms. In line 6 of the ﬁrst round, only arms with a non-zero reward in any of the pulls of the ﬁrst batch need to be explicitly ordered. The other arms can be implicitly placed at the bottom of the ordering. After the second batch, let l0 be the smallest index such that arm jl0 got a positive reward at least once. Then for all l < l0, f WAl,1 can be (implicitly) set to zero. Clearly, the value of N1, set in line 16 of the ﬁrst round, satisﬁes N1 l0. Thus, in the ﬁrst round, any arm that got no reward in any of the pulls is removed in line 17. As a result, the set of arms S2 set for the second round is ﬁnite and known, and so from this point on, the algorithm can proceed as listed.

In this section we prove Theorem 3.1, restated below in more detail as Theorem 5.6. Recall that St is the set of candidate arms in round t. Denote by Wt := WSt the total weight of the arms in St. As in Alg. 1, the estimations made by MAB-PPR are denoted by { ˆwi,t}i St and {f WAl,t}l [|St|]. We omit the subscript t when it is clear from context. Denote f Wt := f WSt,t. We say that a round t of MAB-PPR is standard if |St| > Nﬁnal and Wt ϵ/8. We ﬁrst show that with a high probability, the only non-standard round (if at all) is the last round. Deﬁne the following event:

E0(t) := {(Wt < ϵ/8 f Wt ϵ/4) and (Wt ϵ/2 f Wt ϵ/4)}.

Lemma 5.1. For a large enough constant C2, for any round t with |St| > Nﬁnal, E0(t) holds with a probability at least 1 δt/4.

Proof. If |St| > Nﬁnal then line 8 is invoked, and f Wt is a (q2, ℓ(2))-Mo M estimate of Wt. By Assumption 2.2, σ2 St V Wt. Hence, by Lemma 2.3, since q2 9

2 log(4/δt), with a probability at least 1 δt/4, |Wt f Wt| p

6V Wt/ℓ(2). We have ℓ(2) = C2V/ϵ. By setting C2 48, we get |Wt f Wt| p

Wtϵ/8. If Wt < ϵ/8, it follows that f Wt Wt + ϵ/8 ϵ/4, which shows that the ﬁrst part of E0(t) holds. If Wt ϵ/2, then f Wt Wt p

Wtϵ/8 Wt/2 ϵ/4, which shows that the second part of E0(t) holds.

If E0(t) holds and Wt < ϵ/8, then in round t the condition in line 15 is invoked, leading to the termination of MAB-PPR. The other option for a round t to be non-standard is if |St| Nﬁnal. This case also leads to the termination of the algorithm. Therefore, if E0(t) holds for all rounds, then all rounds except for (perhaps) the last one are standard. We use this observation in the proofs below.

We prove the correctness of MAB-PPR by guaranteeing that with a high probability, an ϵ-best arm remains in the set St throughout the algorithm. We upper bound the cost of MAB-PPR by showing that under the same events, the total weight of arms is reduced in every round by at least a certain fraction. This general analysis structure is also used for the median-elimination algorithm in [14], where half of the arms are removed in each round. However, once weights are used instead of numbers of arms, new techniques are required, and the estimation scheme, as well as the analysis, become more involved.

First, we deﬁne several events. We later prove that they all hold together with a high probability. Let i t argmaxi St wi be an arm with the largest weight in St. Let w t := wi t , and ˆw t := ˆwi t ,t. Note that i := i 1 is the optimal arm and its weight is w := w 1. For any arm j St, we say that j is a t-bad arm if wj < w t ϵt, where ϵt is as deﬁned in Alg. 1. We also deﬁne the set Mt := {i St | i is t-bad and ˆwi,t ˆw t }.

We deﬁne three constants: φL, φM, φU (0, 1) (standing for Lower , Middle and Upper ), such that φL < φM < 1 φU. These constants are used to analyze the fraction of removed arms in

each round, as follows: φL is a lower-bound on the fraction of the arm weight which is guaranteed to be removed in each round. φU is an upper bound on the fraction of the arm weight in bad arms. φM is used to deﬁne a set of arms guaranteed to be removed in each round. Denote the set of arms that appear to be the worst in round t according to the estimates { ˆwi,t} by Lt = {j1, . . . , jkt}, where {jl}l [|St| is the ordering deﬁned in line 6 of round t, and kt is the largest integer such that P i Lt wi φMWt. We deﬁne the following events for each round t, where φL, φU (0, 1) are constants such that :

E1(t) := {WMt < φUWt},

E2(t) := {WLt φLWt or |St \ Lt| Nﬁnal},

E3(t) := {(St+1 \ Mt = ) (St+1 Lt = ).

We omit the argument (t) on events when it is clear from context. The following lemmas state that in a standard round t, all of these events hold with a high probability, as long as the constants are selected appropriately. Their proofs are provided after the statement and proof of Theorem 5.6.

Lemma 5.2. For any φU (0, 1) and sufﬁciently large constants Cq, C q, C1 > 0, we have that for any standard round t, with a probability at least 1 δt/4, E1(t) holds.

Lemma 5.3. For any φL, φM, φU such that 0 < φL < φM < 1 φU < 1, there are sufﬁciently large constants Cq, C q, C1, Nﬁnal > 0 such that for any standard round t, with a probability at least 1 δt/4, E2(t) holds.

Lemma 5.4. For any φL, φM, φU such that 0 < φL < φM < 1 φU < 1, there is some θ (0, 1) such that for a sufﬁciently large C2 > 0, in any standard round t, with a probability at least 1 δt/4, E1(t) implies E3(t).

The following lemma shows that these events guarantee that a good arm always remains in the candidate set, and that the reduction of the candidate set in each round is as required.

Lemma 5.5. For any standard round t in which E1, E2, E3 all hold, we have, with a probability at least 1 δt, that (1) w t+1 w t ϵt, and (2) Wt+1 (1 φL)Wt or |St+1| Nﬁnal.

Proof. To prove the ﬁrst part, observe that by E3, St+1 \ Mt = . Let j argmaxj St+1\Mt ˆwj,t. By the deﬁnition of St+1 in Alg. 1, it includes the arms in St with the largest estimated weights. Thus, we also have j argmaxj St\Mt ˆwj,t. Therefore, since i t St \ Mt, it follows that ˆwj,t ˆw t . Thus, since j / Mt, it follows j is not t-bad, so wj w t ϵt. Combined with w t+1 wj, this completes the proof of the ﬁrst part. For the second part, note that St+1 St and Lt St. By E3, St+1 Lt = . Therefore, Wt+1 + WLt Wt. If |St+1| > Nﬁnal then by E2, WLt φLWt. Therefore, Wt+1 Wt(1 φL).

Combining the lemmas above, the main result is shown in the following theorem.

Theorem 5.6. For any setting of the constants except for C3 and ρ that satisﬁes the lemmas above, there is setting of C3 > 0 and ρ (0, 1) such that with a probability at least 1 δ, MAB-PPR terminates and returns an ϵ-best arm, and its cost is

O(W log(1/δ) max( 1

Proof. Condition below on E0 occurring in all rounds, and on E1, E2, E3 occurring in all standard rounds. By Lemmas 5.1, 5.2, 5.3, and 5.4, and the fact that P t=1 δt δ, this condition holds with a probability at least 1 δ. By Lemma 5.5, in all standard rounds Wt (1 φL)t 1W. From E0(t) it follows that in a round with Wt ϵ/8, we have f Wt ϵ/4, which guarantees that MAB-PPR terminates in line 15. Therefore, MAB-PPR terminates at the latest when t satisﬁes (1 φL)t 1W ϵ/8. Let T be the last round of the run. By Lemma 5.5, in standard rounds we have w t+1 w t ϵt. Thus,

t=1 ϵt = w (1 ρ)

t=1 ρt 1ϵ/2 w ϵ/2. (1)

Consider the two possible ways of terminating: if the algorithm terminates in line 15, then f WT ϵ/4. By E0(t), this means that WT ϵ/2, thus wr w T ϵ/2. The other way of terminating is when |ST | Nﬁnal. In this case, for all i ST , ˆwi,T is a (q3, ℓ(3))-Mo M estimate of wi, where q3 9

2 log(Nﬁnal/δT ) and ℓ(3) = C3/ϵ2. Set C3 96. By Lemma 2.3 and a union bound on

Nﬁnal arms, it follows that with a probability at least 1 δT , i ST , |wi ˆwi,T | q

6σ2 i /ℓ(3) p

6/ℓ(3) ϵ/4, where the second inequality follows from Assumption 2.1. The returned arm in this case is r argmaxi ST ˆwi,T , thus wr ˆwr,T ϵ/4 ˆw T ϵ/4 w T ϵ/2. Therefore, in both cases we have wr w T ϵ/2. Combined with Eq. (1), this proves that wr w ϵ.

The cost of any round of MAB-PPR is upper-bounded by Wt(q1ℓt+q2ℓ(2)+q+3ℓ(3)). Therefore, the total cost is at most W PT t=1(1 φL)t 1(q1ℓt + q2ℓ(2) + q3ℓ(3)). We have, for constants C, C > 0 (that depend on the constants used in Alg. 1), that q1ℓt + q2ℓ(2) + q3ℓ(3) C log( C

δt ) max( 1

where δt = δ/2t and ϵt = 1

2(1 ρ)ρt 1ϵ. Therefore, the total cost is upper bounded by

t=1 (1 φL)t 1 C log(C

δt ) max( 1

ϵt ) C W log(C /δ) max( 1

For constants C , C > 0. Setting ρ ( 1 φL, 1), the sum on t is upper-bounded by a constant, giving the required upper bound on the cost.

We now prove lemmas that the events E1, E2, E3 hold with a high probability. We start with Lemma 5.2, which shows this for E1(t), which states that WMt < φUWt.

Proof of Lemma 5.2. Fix a round t. Deﬁne the event E4 := { ˆw t w t ϵt/2}. Setting Cq 9

2, C q 8 and C1 24, we have q1 9

2 log(8/δt) and ℓt 24/ϵ2 t. By Assumption 2.1, σ2 i t 1.

Hence, by Lemma 2.3, with a probability 1 δt/8, |wt ˆw t | p

6/ℓt ϵt/2, hence E4 holds.

Let i St be a t-bad arm. Let ˆwj i for j [q1] be the j th mean estimate of wi used to get the estimate ˆwi,t in line 5. Let α1 (1 φU

2 , 1). Set α2 = 1 1 φU/2

α1 , and set α3 (0, α2(1 α1)). If E4 hold, then since i is t-bad, ˆwj i ˆw t implies ˆwj i ˆw t wi + ϵt/2. Denote Pt[ ] := P[ | St]. Then,

Pt[ ˆwj i ˆw t | E4] Pt[ ˆwj i wi + ϵt/2]/Pt[E4] 2Pt[ ˆwj i wi + ϵt/2] 2 ℓt(ϵt/2)2 .

The last inequality follows from Chebychev s inequality, since σ2 i 1. Setting C1 8 α3 , we have ℓt 8 α3ϵ2 t , hence Pt[ ˆwj i ˆw t | E4] α3. Denote M j t := {i | i is t-bad and ˆwj i ˆw t }. We conclude that Et[WM j t | E4] = X

i St wi Pt[i M j t | E4] X

i St wi Pt[ ˆwj i ˆw t | E4] α3Wt.

By Markov s inequality, it follows that Pt[WM j t α2Wt | E4] α3Wt α2Wt < 1 α1. Deﬁne the event E5 := { 1

j I[WM j t α2Wt] < 1 α1}. We have q1 Cq log(C q/δt). Setting Cq 1/(2(1 α1 α3/α2)2) and C q 8, we have by Hoeffding s inequality and the independence of {M j i } for j [q1] that Pt[E5 | E4] 1 δt/8. Therefore, combined with Pt[E4] 1 δt/8, we get that for any St, Pt[E5] 1 δ/4. Under E5, we have 1 q1

j WM j t < α2α1Wt + (1 α1)Wt = (α2α1 + 1 α1)Wt = φUWt/2. (2)

Where the last equality follows from the deﬁnition of α2. On the other hand, since ˆwi,t is the median of { ˆwj i }j [q1], and i Mt implies ˆwi,t ˆw t , we have

i St wi I[i Mt] X

i St wi I[|{j | i M j t }| q1/2]

j [q1] I[i M j t ] = 2

i St wi I[i M j t ] = 2

j [q1] WM j t .

Combined with Eq. (2), we get that with a probability of 1 δt/4, WMt < φUWt.

We now prove Lemma 5.3, which states that with a probability at least 1 δt/4, E2(t) holds in a standard round, i.e., WLt φLWt or |St \ Lt| Nﬁnal. In this proof we use the following lemma, proved in the supplementary material (Appendix C). Lemma 5.7. Let Y1, . . . , Yz be random variables with means µ1, . . . , µz. Let Y := P

i [z] Yi, µ := P

i [z] µi = E[ Y ]. Let σ2 := Var( Y ) < . For i [z], let ˆµi be the (q, ℓ)-Mo M estimate of µi based on qℓi.i.d. draws of copies of Yi. Then for any γ > 0, with a probability at least

1 exp( q/18), |{i [z] | ˆµi γµ}| 4

Proof of Lemma 5.3. We will assume that WLt < φLWt and conclude the required upper bound on |St \ Lt| with a high probability. Let β := φM φL. Let ν (0, β), and set α := 1

2(β ν)/ β. Let j := jkt+1, where kt is given in the deﬁnition of Lt and {jl} is the ordering set in line 6 in round t. By the deﬁnition of Lt, WLt + wj > φMWt. By our assumption, WLt < φLWt. Therefore, wj > (φM φL)Wt = βWt. Deﬁne J = {i St | wi βWt}. We have |J| 1/β.

2 and C q 8/β, so that q1 9

2 log((8/β)/δt). Recall that σ2 i wi by Assumption 2.1. Therefore, by Lemma 2.3 and a union bound, with a probability at least 1 δt/8, i J, |wi ˆwi| p

6wi/ℓt. Setting C1 6/α2, it follows that ℓt 6/(α2ϵt). We have ϵt ϵ/2 for all t. In addition, since t is a standard round, ϵ/8 Wt. Therefore, for i J, ϵt 4Wt 4wi/β. It follows that with a probability at least 1 δt/8, i J, |wi ˆwi| (2α/ β)wi = (1 ν/β)wi, where the equality follows from the deﬁnition of α. Hence ˆwi (ν/β) wi νWt. In particular, this holds for j = jkt+1 as set above. Condition below on this event.

By the deﬁnition of Lt and j, for all i St \ Lt, ˆwi ˆwj. Therefore, by the event above, for all i St \ Lt, ˆwi νWt. Hence, |St \ Lt| |{i St | ˆwi νWt}|. To bound the RHS, recall that σ2 St V Wt by Assumption 2.2. Thus, applying Lemma 5.7, we have that by setting Cq 18,

w.p. 1 δt/8, |{i St | ˆwi νWt}| 4

6V Wtℓt ). We have ℓt C1V/ϵt C1V/(4Wt). Therefore,

|{i St | ˆwi νWt}| 4

24/C1) =: Nﬁnal. (3)

By a union bound, with a probability 1 δt/4, the event above and Eq. (3) hold simultaneously, in which case WLt < φLWt implies |St \ Lt| Nﬁnal.

Lastly, we prove Lemma 5.4, which states that in a standard round t, with a probability at least 1 δt/4, E1(t) implies E3(t), i.e., St+1 \ Mt = and St+1 Lt = .

Proof of Lemma 5.4. Fix a round t. Let Al := {j1, . . . , jl}, where {jl} is the ordering set in line 6, and let f WAl := f WAl,t be the estimate of WAl calculated in line 8 of round t. By the assumption of the lemma, φM < 1 φU. Let R (0, 1 φU) be the solution to φM+R φM

1+R = 0. The solution is in this range, since since for R = 0, the LHS is equal to φM (1 φU) < 0, and for R = 1 φU, the LHS is positive. Set θ := 1 φU R

Recall that q2 = 9

2 log(12/δt). By setting C2 48/R2, we get ℓ(2) 48V/(R2ϵ). Therefore, by Lemma 2.3 and Assumption 2.2, for any ﬁxed l the following holds with a probability 1 δt/12:

|f WAl WAl| q

6V WAl/ℓ(2) p

R2WAlϵ/8 R p

WAl Wt. (4)

The last inequality follows since t is a standard round, hence Wt ϵ/8. Let nt be the smallest integer such that WAnt Wt WMt. With a probability 1 δt/4, Eq. (4) holds simultaneously for l = nt, l = kt (so Al Lt) and l = |St| (so Al St). Condition below on this combined event. It follows from the deﬁnition of Lt that

f WLt WLt + R p

WLt Wt (φM + R p

Suppose that E1(t) holds, so WMt < φUWt. By deﬁnition of nt, WAnt > (1 φU)Wt. Therefore,

f WAnt WAnt R q

WAnt Wt > (1 φU)Wt RWt = (1 φU R)Wt.

Lastly, from Eq. (4) for l = |St|, we get Wt(1 R) f Wt Wt(1 + R). Now, MAB-PPR sets St+1 = St \ ANt, where Nt is the largest number that satisﬁes f WANt/f Wt θ. From the inequalities

above, we have f WAnt

f Wt > 1 φU R

1+R = θ and f WLt

f Wt φM+R φM

1 R = θ. Hence, kt Nt < nt. From kt Nt, we have St+1 Lt = . From Nt < nt, we have St+1 St \ ANt St \ Ant 1. Therefore Wt+1 Wt WAnt 1. By the deﬁnition of nt, we have WAnt 1 < Wt WMt. Therefore, Wt+1 WMt. It follows that St+1 \ Mt = . Thus, E1(t) implies E3(t).

Having proved Lemma 5.2,Lemma 5.3, and Lemma 5.4, this ﬁnalizes the proof of Theorem 5.6.

6 Conclusion

In this work we showed that it is possible to identify an ϵ-best arm in a pay-per-reward multi-armedbandit setting at a cost that does not depend on the number of arms, and depends only linearly on the total expected reward. In future work, we plan to study instance-speciﬁc bounds for this setting.

Acknowledgements

Sivan Sabato was supported in part by the Israel Science Foundation (grant No. 555/15).

[1] D. Agarwal, B.-C. Chen, and P. Elango. Explore/exploit schemes for web content optimization. In 2009 Ninth IEEE International Conference on Data Mining, pages 1 10. IEEE, 2009.

[2] N. Alon, Y. Matias, and M. Szegedy. The space complexity of approximating the frequency moments. Journal of Computer and system sciences, 58(1):137 147, 1999.

[3] J.-Y. Audibert and S. Bubeck. Best arm identiﬁcation in multi-armed bandits. In COLT-23th Conference on learning theory-2010, pages 13 p, 2010.

[4] M. Aziz, J. Anderton, E. Kaufmann, and J. Aslam. Pure exploration in inﬁnitely-armed bandit models with ﬁxed-conﬁdence. ar Xiv preprint ar Xiv:1803.04665, 2018.

[5] J. S. Banks and R. K. Sundaram. Denumerable-armed bandits. Econometrica: Journal of the Econometric Society, pages 1071 1096, 1992.

[6] D. A. Berry, R. W. Chen, A. Zame, D. C. Heath, and L. A. Shepp. Bandit problems with inﬁnitely many arms. The Annals of Statistics, 25(5):2103 2116, 1997.

[7] C. Brownlees, E. Joly, and G. Lugosi. Empirical risk minimization for heavy-tailed losses. The Annals of Statistics, 43(6):2507 2536, 2015.

[8] S. Bubeck, R. Munos, and G. Stoltz. Pure exploration in multi-armed bandits problems. In International conference on Algorithmic learning theory, pages 23 37. Springer, 2009.

[9] S. Bubeck, N. Cesa-Bianchi, and G. Lugosi. Bandits with heavy tail. IEEE Transactions on Information Theory, 59(11):7711 7717, Nov 2013. ISSN 0018-9448. doi: 10.1109/TIT.2013. 2277869.

[10] A. Carpentier and A. Locatelli. Tight (lower) bounds for the ﬁxed budget best arm identiﬁcation bandit problem. In Conference on Learning Theory, pages 590 604, 2016.

[11] A. Carpentier and M. Valko. Simple regret for inﬁnitely many armed bandits. In International Conference on Machine Learning, pages 1133 1141, 2015.

[12] L. Chen, J. Li, and M. Qiao. Towards instance optimal bounds for best arm identiﬁcation. In S. Kale and O. Shamir, editors, Proceedings of the 2017 Conference on Learning Theory, volume 65 of Proceedings of Machine Learning Research, pages 535 592, Amsterdam, Netherlands, 07 10 Jul 2017. PMLR.

[13] W. Ding, T. Qin, X.-D. Zhang, and T.-Y. Liu. Multi-armed bandit with budget constraint and variable costs. In Twenty-Seventh AAAI Conference on Artiﬁcial Intelligence, 2013.

[14] E. Even-Dar, S. Mannor, and Y. Mansour. Pac bounds for multi-armed bandit and markov decision processes. In International Conference on Computational Learning Theory, pages 255 270. Springer, 2002.

[15] E. Even-Dar, S. Mannor, and Y. Mansour. Action elimination and stopping conditions for the multi-armed bandit and reinforcement learning problems. Journal of machine learning research, 7(Jun):1079 1105, 2006.

[16] D. Hsu and S. Sabato. Loss minimization and parameter estimation with heavy tails. Journal of Machine Learning Research, 17(18):1 40, 2016.

[17] K. G. Jamieson and R. D. Nowak. Best-arm identiﬁcation algorithms for multi-armed bandits in the ﬁxed conﬁdence setting. 2014 48th Annual Conference on Information Sciences and Systems (CISS), pages 1 6, 2014.

[18] E. Joly, G. Lugosi, and R. I. Oliveira. On the estimation of the mean of a random vector. Electronic Journal of Statistics, 11(1):440 451, 2017.

[19] Z. Karnin, T. Koren, and O. Somekh. Almost optimal exploration in multi-armed bandits. In International Conference on Machine Learning, pages 1238 1246, 2013.

[20] H. Li and Y. Xia. Inﬁnitely many-armed bandits with budget constraints. In Thirty-First AAAI Conference on Artiﬁcial Intelligence, 2017.

[21] G. Lugosi and S. Mendelson. Risk minimization by median-of-means tournaments. ar Xiv preprint ar Xiv:1608.00757, 2016.

[22] S. Mannor and J. N. Tsitsiklis. The sample complexity of exploration in the multi-armed bandit problem. Journal of Machine Learning Research, 5(Jun):623 648, 2004.

[23] S. Minsker. Geometric median and robust estimation in banach spaces. Bernoulli, 21(4):2308 2335, 11 2015. doi: 10.3150/14-BEJ645. URL https://doi.org/10.3150/14-BEJ645.

[24] S. Pandey, D. Agarwal, D. Chakrabarti, and V. Josifovski. Bandits for taxonomies: A modelbased approach. In Proceedings of the 2007 SIAM International Conference on Data Mining, pages 216 227. SIAM, 2007.

[25] L. Tran-Thanh, A. Chapman, E. M. de Cote, A. Rogers, and N. R. Jennings. Epsilon ﬁrst policies for budget limited multi-armed bandits. In Twenty-Fourth AAAI Conference on Artiﬁcial Intelligence, 2010.

[26] L. Tran-Thanh, A. Chapman, A. Rogers, and N. R. Jennings. Knapsack based optimal policies for budget limited multi armed bandits. In Twenty-Sixth AAAI Conference on Artiﬁcial Intelligence, 2012.

[27] H. P. Vanchinathan, A. Marfurt, C.-A. Robelin, D. Kossmann, and A. Krause. Discovering valuable items from massive data. In Proceedings of the 21th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pages 1195 1204. ACM, 2015.

[28] Y. Wang, J.-Y. Audibert, and R. Munos. Algorithms for inﬁnitely many-armed bandits. In Advances in Neural Information Processing Systems, pages 1729 1736, 2009.

[29] Y. Xia, T. Qin, W. Ma, N. Yu, and T.-Y. Liu. Budgeted multi-armed bandits with multiple plays. In IJCAI, pages 2210 2216, 2016.

[30] Y. Xia, T. Qin, N. Yu, and T.-Y. Liu. Best action selection in a stochastic environment. In Proceedings of the 2016 International Conference on Autonomous Agents & Multiagent Systems, pages 758 766. International Foundation for Autonomous Agents and Multiagent Systems, 2016.

[31] X. Yu, H. Shao, M. R. Lyu, and I. King. Pure exploration of multi-armed bandits with heavytailed payoffs. In Proceedings of the Thirty-Fourth Conference on Uncertainty in Artiﬁcial Intelligence, pages 937 946, 2018.

[32] Q. Zhou, X. Zhang, J. Xu, and B. Liang. Large-scale bandit approaches for recommender systems. In Neural Information Processing, pages 811 821, Cham, 2017. Springer International Publishing.