# epsilonbestarm_identification_in_payperreward_multiarmed_bandits__e2056e0d.pdf ϵ-Best-Arm Identification in Pay-Per-Reward Multi-Armed Bandits Sivan Sabato Department of Computer Science Ben-Gurion University of the Negev Beer-Sheva, Israel 8410501 sabatos@cs.bgu.ac.il We study ϵ-best-arm identification, in a setting where during the exploration phase, the cost of each arm pull is proportional to the expected future reward of that arm. We term this setting Pay-Per-Reward. We provide an algorithm for this setting, that with a high probability returns an ϵ-best arm, while incurring a cost that depends only linearly on the total expected reward of all arms, and does not depend at all on the number of arms. Under mild assumptions, the algorithm can be applied also to problems with infinitely many arms. 1 Introduction Consider placing an ad next to search engine results, based on the search query. In a preliminary survey for a future promotion, a retailer wishes to identify the best query expression to link to its ad, that is, the expression that maximizes the expected number of clicks on the ad. The payment per ad is proportional to the number of clicks. However, during the survey, clicks do not lead to profit, while the payment is still due. This problem can be formulated as a type of best-arm-identification problem in a stochastic multiarmed bandit (MAB) setting [see, e.g., 14, 15, 8, 3]. In stochastic MAB, there is a set of arms, each associated with a non-negative reward distribution. Each pull of an arm draws an instantaneous reward from the corresponding reward distribution. The MAB algorithm iteratively pulls arms and observes their instantaneous rewards. At the end of the run, the algorithm selects one arm. The goal is to find an arm with a near-maximal expected reward with a high probability. In the example above, each arm in the MAB framework can be mapped to a specific search expression, and pulling the arm is equivalent to linking the ad to the expression represented by this arm for a given period of time, and then observing the number of clicks. In standard MAB, each arm pull incurs a unit cost. However, in the example above, the cost of pulling an arm is the number of clicks on the ad, assuming the common pay-per-click advertisement model. Thus, the expected cost in the survey stage is proportional to the expected reward after the survey stage. We term this setting Pay-Per-Reward (PPR). Other applications of this model include any case where there is an exploration stage in which rewards are not collected, but payment is still proportional to the quality of the arm. For instance, consider a sensor field, where the goal is to find the most active sensor, but each activity report has a communication cost. Here, pulling an arm is done by activating the sensor s reporting option for a given period of time. Another example is finding the best crowd worker on a crowd-sourcing platform. Here, an "arm pull" is equivalent to providing the worker with a test task. The workers are paid according to their success rate in the test task, thus a more successful worker will be paid more during the exploration stage. Our contribution. We provide an algorithm, MAB-PPR, for the pay-per-reward setting, that finds an ϵ-optimal arm with a probability of at least 1 δ, for given ϵ, δ (0, 1). We show that the cost 33rd Conference on Neural Information Processing Systems (Neur IPS 2019), Vancouver, Canada. incurred by MAB-PPR has no dependence on the number of arms, and only a linear dependence on the total expected reward of all arms. Our results generalize beyond bounded reward distributions, and support some heavy-tailed distributions with bounded second moments. Under mild assumptions, the algorithm can be applied also to problems with infinitely many arms. Since the cost of MAB-PPR is independent of the number of arms, it is especially useful when the number of arms is potentially very large, while the total expected reward is bounded. For instance, in the ad-placement example, the number of arms is proportional to the number of admissible search expressions, which can be practically unbounded, but only a small number of those are expected to have large click rates when linked to a specific ad. Other applications in which the number of arms can be very large include, for instance, recommender systems [32], content personalization [24] and web-content optimization [1]. For the standard setting, in which each arm pull has a unit cost, [14] provided an ϵ-best-arm identification algorithm with a sample complexity that depends linearly on the number of arms, using an algorithm that they term the median-elimination algorithm. This algorithm halves the number of candidate arms in each round based on their empirical mean rewards. Instead, in our setting, one should try to halve the total expected reward of the arms that remain in each round. This requires a new estimation scheme. Another challenge in the PPR setting is that due to the potentially unbounded number of arms, sums of instantaneous rewards of sets of arms do not have a bounded support, even if single-arm rewards are bounded. Moreover, in some settings one might prefer to simultaneously pull an entire set of arms from the joint distribution of rewards of all arms. Thus, standard assumptions on independence between arms do not necessarily hold. MAB-PPR addresses these challenges using a two-step estimation procedure in each round, combined with heavy-tailed mean estimation. This guarantees success even under some dependence between arms. Support for some unbounded reward distributions of single arms is easily obtained within this framework, by using heavy-tailed mean estimation for the expected rewards of single arms as well. Our analysis shows that the PPR setting can be successfully handled in a wide range of regimes. Paper structure After discussing related work, we present the setting and preliminaries in Section 2. The main result is stated in Section 3. The MAB-PPR algorithm is listed and described in Section 4. We prove the main result in Section 5, and conclude in Section 6. 1.1 Related work [14] formulated the stochastic MAB ϵ-best-arm identification problem (also termed the PAC setting) for bounded rewards. They showed that the number of pulls can be linear in the number of arms, without logarithmic factors that would be necessary in a naive algorithm that pulls all arms the same number of times. Closely matching lower bounds for this setting were given in [22]. Instance-specific bounds in this setting were studied in [19, 17, 12]. A fixed-budget variant was studied in [8, 3, 10]. The works above assume the standard cost model for MAB, in which each arm pull costs one unit. A different cost model has been studied in the setting of budgeted MAB [25, 26, 13, 27, 29, 20]. Here, arms are associated with a random or deterministic cost, in addition to a reward distribution, and the goal is to maximize the cumulative reward within total cost constraints. [30] studied a best-arm variant of this problem. MAB with infinitely many arms, also termed many-armed bandits, has been studied in the unit-cost model both for the cumulative regret setting [5, 6, 28, 20] and for exploration settings [11, 4]. [20] study infinitely-many arms in the budgeted MAB setting. A useful tool for mean estimation for heavy-tailed random variables is median-of-means estimation [2]. This tool has been recently used in various contexts in machine learning [e.g., 23, 7, 16, 21, 18]. MAB with heavy-tailed rewards under the standard unit-cost model was studied in the regret setting [9] and for pure exploration with unbounded rewards [31]. 2 Setting and Preliminaries For an integer i, denote [i] := {1, . . . , i}. Let K be the total number of available arms, and denote the set of available arms by [K].1 Let X1, . . . , XK be the random variables associated with the reward 1We discuss in Section 4 how to use MAB-PPR when the set of available arms is infinite or unknown. distributions of the corresponding arms. We assume that the random variables are non-negative, and have a finite expectation and variance. Pulling arm i is equivalent to drawing an independent copy of Xi and observing its value. We study here a more general setting, in which sets of arms can be pulled simultaneously. We assume some joint distribution of the random vector X = (X1, . . . , XK). A pull of a specific set of arms is equivalent to drawing an independent copy of X and observing the rewards of all the arms in the set. The standard setting, in which arms are pulled one by one, is equivalent to assuming that X1, . . . , XK are statistically independent in X. Thus, we generally assume below that sets are simultaneously pulled. Denote the expected reward of arm i by wi := E[Xi], termed below the arm s weight. Let W := P i [K] wi. For a set A [K], denote XA := P i A Xi and WA := P i A wi. Consider a run of the algorithm in which the total number of times each arm is pulled (regardless of whether some of the arms were pulled simultaneously) is n, and let I1, . . . , In be the sequence of pulled arms. The cost of the algorithm in this run is P i [n] w Ii. We make two assumptions on the distribution of X. Assumption 2.1. For all i [K], σ2 i := Var(Xi) wi 1. This assumption clearly holds if Xi are bounded in [0, 1]. Nonetheless, it is more general, and can hold also for reward distributions with unbounded support. Assumption 2.2. There is a constant V 0, such that for any subset of the arms A [K], σ2 A := Var(XA) V WA. Assumption 2.2 trivially holds with V = 1 if arm-pulls of individual arms are statistically independent, as when they are pulled separately. In this case, σ2 A = P i A σ2 i P i A wi = WA, where Assumption 2.1 was used to bound σ2 i . If arm rewards are not statistically independent, then the existence and value of the constant V depend on the distribution of X. For instance, if each arm is positively correlated with at most Q other arms, then V can be upper bounded by 1 + 2Q (see the supplementary material, Appendix A). We note that our results below are indifferent to values of V as large as 1 We use the median-of-means method [2] to estimate the mean of a distribution with a bounded variance. In this method, q batches of ℓindependent samples are used; A separate empirical mean is calculated from each batch, and the estimate is the median of these means. Formally, the (q, ℓ)-Mo M estimator of a random variable X from a sample of i.i.d. draws {xi,j}i [q],j [ℓ] of X, is defined as Median{ 1 ℓ P j [ℓ] xi,j | i [q]}. We use the following formulation of the result of [2] on the convergence of the Mo M estimator to the true mean, similarly to [16]. The proof is provided for completeness in the supplementary material, Appendix B. Lemma 2.3 (Median-of-Means). Let X be a random variable with mean w and variance σ2 < . Let q, ℓbe integers, and let ˆw be the (q, ℓ)-Mo M estimate of w from a sample of qℓi.i.d. draws of copies of X. With a probability at least 1 exp( 2 9q), we have |w ˆw| p 3 Main result We propose the MAB-PPR algorithm, which accepts parameters ϵ, δ (0, 1), iteratively pulls arms, and outputs a single arm. MAB-PPR satisfies the following guarantee. Theorem 3.1. Suppose that Assumptions 2.1 and 2.2 hold. Let W := P i [K] wi, and let V as defined in Assumption 2.2. With a probability of at least 1 δ, MAB-PPR returns an ϵ-optimal arm, and its cost is upper bounded by O W log(1/δ) max 1 Here and below, O( ) stands for universal multiplicative constants. In comparison, previous algorithms for the unit-cost model, such as the ones studied in [14], would cost here at least Ω(W log(K/δ)/ϵ2) where K is the number of arms, assuming bounded rewards. Qualitatively, removing the dependence on K is of significance. MAB-PPR obtains a quantitatively significant improvement when K is large and V O(1/ϵ). Algorithm 1 MAB-PPR: ϵ-Best-Arm-Identification with Pay-Per-Reward Input: ϵ, δ (0, 1), V 1, K N; Universal constants Cq, C q, C1, C2, C3, Nfinal > 0; ρ, θ (0, 1). Output: r [K] 1: t 1, S1 [K], ϵ1 (1 ρ)ϵ/2, δ1 δ/2. 2: while |St| > Nfinal do 3: First batch of pulls: 4: Set q1 Cq log(C q/δt) and ℓt C1 max(1/ϵ2 t, V/ϵt); Pull the arms in St for q1ℓt times. 5: For each i St, use the pulls of arm i to calculate ˆwi,t, the (q1, ℓt)-Mo M estimate of wi. 6: Set {jl}l [|St|] to be an ordering of St such that ˆwjl,t ˆwjl+1,t for all l [|St| 1]. 7: Second batch of pulls: 8: Set q2 9 2 log(12/δt), ℓ(2) C2V/ϵ; Pull the arms in St for q2ℓ(2) times. 9: For i St, let {xj,l i }j [q2],l [ℓ(2)] be the i.i.d. copies of Xi obtained in line 8. 10: for l [|St|] do 11: Define Al := {j1, . . . , jl}. 12: Use {P i Al xj,l i }j [q2],l [ℓ(2)] to calculate f WAl,t, the (q2, ℓ(2))-Mo M estimate of WAl. 13: end for 14: Decide which arms to remove: 15: If f WSt,t ϵ/4, return some r St and terminate. 16: Find the largest integer Nt such that f WANt,t/f WSt,t θ. 17: Set St+1 St \ ANt, ϵt+1 ρϵt, δt+1 δt/2, t t + 1. 18: end while 19: Final round (|St| Nfinal): 20: Set q3 9 2 log(Nfinal/δt), ℓ(3) C3/ϵ2; Pull the arms in St for q3ℓ(3) times. 21: For i St, use the pulls of arm i to calculate ˆwi,t, the (q3, ℓ(3))-Mo M estimate of wi. 22: return some r argmaxi St ˆwi,t. 4 The MAB-PPR algorithm The MAB-PPR algorithm is listed in Alg. 1. The main ideas in MAB-PPR are described below. We use several universal constants in Alg. 1. In the analysis below, we show that there exist values for these constants such that the guarantees of MAB-PPR hold. For readability, we do not explicitly round up non-integer sample sizes. Such rounding would not affect our final cost upper bound. A central challenge in designing MAB-PPR is avoiding a union bound on the number of arms. In [14], where arm pulls have unit cost, the median-elimination algorithm was proposed for this purpose. This algorithm removes half of the arms from the candidate set in each round. In our case, because of the different cost model, it is necessary to remove a constant fraction of the total weight, regardless of the number of arms. To achieve this, MAB-PPR uses two batches of arm pulls in each round. The universal constants Cq, C q, C1, C2 are used for calculating the required numbers of arm pulls in each batch so that subsequent estimates hold. The first batch of pulls (line 4) is used to estimate the weight of each arm with the Mo M estimator (line 5). The arms are then ordered according to their estimated weight. The second batch of pulls (line 8) is used to decide how many arms to remove from the bottom of this ordering, to make sure that the removed fraction of the total weight is within a certain range. This is done by estimating the total weight of each possible subset of arms, again using Mo M (line 12). Here, it is important to note that taking the sum of Mo M estimates is not equivalent to taking the Mo M estimate of the sum, thus these sums are estimated directly. The selected number of arms is then removed from the bottom of the ordering (line 16). The threshold θ is used to upper bound the empirical fraction of the removed arms. After each round, the values of ϵt and δt are updated, where the constant ρ controls the rate of decay for ϵt. MAB-PPR terminates in one of two ways: if at some point, the estimated total weight of the remaining arms is small, then some remaining arm is returned (line 15). Otherwise, when the number of remaining arms is smaller than the constant Nfinal, all their weights are directly estimated (where the constant C3 is used to set the number of pulls), and the arm with the maximal estimated weight is returned (line 22). Supporting an unbounded number of arms. For simplicity of presentation, the notations and Alg. 1 are given for a finite and known number of arms K. Nonetheless, MAB-PPR can be used even if the number of arms is unbounded or infinite. This is possible if there is a way to pull all arms together in finite time, and get a finite result listing all the non-zero instantaneous rewards. For instance, in the search-query example, pulling all arms together can be done by placing an ad and linking it to all search queries, and then recording which search queries actually came up and resulted in an ad-click. Assuming that W P i wi < , Alg. 1 can then support an unbounded number of arms as follows. In line 1, S1 should be (implicitly) set to the entire set of available arms. In line 6 of the first round, only arms with a non-zero reward in any of the pulls of the first batch need to be explicitly ordered. The other arms can be implicitly placed at the bottom of the ordering. After the second batch, let l0 be the smallest index such that arm jl0 got a positive reward at least once. Then for all l < l0, f WAl,1 can be (implicitly) set to zero. Clearly, the value of N1, set in line 16 of the first round, satisfies N1 l0. Thus, in the first round, any arm that got no reward in any of the pulls is removed in line 17. As a result, the set of arms S2 set for the second round is finite and known, and so from this point on, the algorithm can proceed as listed. In this section we prove Theorem 3.1, restated below in more detail as Theorem 5.6. Recall that St is the set of candidate arms in round t. Denote by Wt := WSt the total weight of the arms in St. As in Alg. 1, the estimations made by MAB-PPR are denoted by { ˆwi,t}i St and {f WAl,t}l [|St|]. We omit the subscript t when it is clear from context. Denote f Wt := f WSt,t. We say that a round t of MAB-PPR is standard if |St| > Nfinal and Wt ϵ/8. We first show that with a high probability, the only non-standard round (if at all) is the last round. Define the following event: E0(t) := {(Wt < ϵ/8 f Wt ϵ/4) and (Wt ϵ/2 f Wt ϵ/4)}. Lemma 5.1. For a large enough constant C2, for any round t with |St| > Nfinal, E0(t) holds with a probability at least 1 δt/4. Proof. If |St| > Nfinal then line 8 is invoked, and f Wt is a (q2, ℓ(2))-Mo M estimate of Wt. By Assumption 2.2, σ2 St V Wt. Hence, by Lemma 2.3, since q2 9 2 log(4/δt), with a probability at least 1 δt/4, |Wt f Wt| p 6V Wt/ℓ(2). We have ℓ(2) = C2V/ϵ. By setting C2 48, we get |Wt f Wt| p Wtϵ/8. If Wt < ϵ/8, it follows that f Wt Wt + ϵ/8 ϵ/4, which shows that the first part of E0(t) holds. If Wt ϵ/2, then f Wt Wt p Wtϵ/8 Wt/2 ϵ/4, which shows that the second part of E0(t) holds. If E0(t) holds and Wt < ϵ/8, then in round t the condition in line 15 is invoked, leading to the termination of MAB-PPR. The other option for a round t to be non-standard is if |St| Nfinal. This case also leads to the termination of the algorithm. Therefore, if E0(t) holds for all rounds, then all rounds except for (perhaps) the last one are standard. We use this observation in the proofs below. We prove the correctness of MAB-PPR by guaranteeing that with a high probability, an ϵ-best arm remains in the set St throughout the algorithm. We upper bound the cost of MAB-PPR by showing that under the same events, the total weight of arms is reduced in every round by at least a certain fraction. This general analysis structure is also used for the median-elimination algorithm in [14], where half of the arms are removed in each round. However, once weights are used instead of numbers of arms, new techniques are required, and the estimation scheme, as well as the analysis, become more involved. First, we define several events. We later prove that they all hold together with a high probability. Let i t argmaxi St wi be an arm with the largest weight in St. Let w t := wi t , and ˆw t := ˆwi t ,t. Note that i := i 1 is the optimal arm and its weight is w := w 1. For any arm j St, we say that j is a t-bad arm if wj < w t ϵt, where ϵt is as defined in Alg. 1. We also define the set Mt := {i St | i is t-bad and ˆwi,t ˆw t }. We define three constants: φL, φM, φU (0, 1) (standing for Lower , Middle and Upper ), such that φL < φM < 1 φU. These constants are used to analyze the fraction of removed arms in each round, as follows: φL is a lower-bound on the fraction of the arm weight which is guaranteed to be removed in each round. φU is an upper bound on the fraction of the arm weight in bad arms. φM is used to define a set of arms guaranteed to be removed in each round. Denote the set of arms that appear to be the worst in round t according to the estimates { ˆwi,t} by Lt = {j1, . . . , jkt}, where {jl}l [|St| is the ordering defined in line 6 of round t, and kt is the largest integer such that P i Lt wi φMWt. We define the following events for each round t, where φL, φU (0, 1) are constants such that : E1(t) := {WMt < φUWt}, E2(t) := {WLt φLWt or |St \ Lt| Nfinal}, E3(t) := {(St+1 \ Mt = ) (St+1 Lt = ). We omit the argument (t) on events when it is clear from context. The following lemmas state that in a standard round t, all of these events hold with a high probability, as long as the constants are selected appropriately. Their proofs are provided after the statement and proof of Theorem 5.6. Lemma 5.2. For any φU (0, 1) and sufficiently large constants Cq, C q, C1 > 0, we have that for any standard round t, with a probability at least 1 δt/4, E1(t) holds. Lemma 5.3. For any φL, φM, φU such that 0 < φL < φM < 1 φU < 1, there are sufficiently large constants Cq, C q, C1, Nfinal > 0 such that for any standard round t, with a probability at least 1 δt/4, E2(t) holds. Lemma 5.4. For any φL, φM, φU such that 0 < φL < φM < 1 φU < 1, there is some θ (0, 1) such that for a sufficiently large C2 > 0, in any standard round t, with a probability at least 1 δt/4, E1(t) implies E3(t). The following lemma shows that these events guarantee that a good arm always remains in the candidate set, and that the reduction of the candidate set in each round is as required. Lemma 5.5. For any standard round t in which E1, E2, E3 all hold, we have, with a probability at least 1 δt, that (1) w t+1 w t ϵt, and (2) Wt+1 (1 φL)Wt or |St+1| Nfinal. Proof. To prove the first part, observe that by E3, St+1 \ Mt = . Let j argmaxj St+1\Mt ˆwj,t. By the definition of St+1 in Alg. 1, it includes the arms in St with the largest estimated weights. Thus, we also have j argmaxj St\Mt ˆwj,t. Therefore, since i t St \ Mt, it follows that ˆwj,t ˆw t . Thus, since j / Mt, it follows j is not t-bad, so wj w t ϵt. Combined with w t+1 wj, this completes the proof of the first part. For the second part, note that St+1 St and Lt St. By E3, St+1 Lt = . Therefore, Wt+1 + WLt Wt. If |St+1| > Nfinal then by E2, WLt φLWt. Therefore, Wt+1 Wt(1 φL). Combining the lemmas above, the main result is shown in the following theorem. Theorem 5.6. For any setting of the constants except for C3 and ρ that satisfies the lemmas above, there is setting of C3 > 0 and ρ (0, 1) such that with a probability at least 1 δ, MAB-PPR terminates and returns an ϵ-best arm, and its cost is O(W log(1/δ) max( 1 Proof. Condition below on E0 occurring in all rounds, and on E1, E2, E3 occurring in all standard rounds. By Lemmas 5.1, 5.2, 5.3, and 5.4, and the fact that P t=1 δt δ, this condition holds with a probability at least 1 δ. By Lemma 5.5, in all standard rounds Wt (1 φL)t 1W. From E0(t) it follows that in a round with Wt ϵ/8, we have f Wt ϵ/4, which guarantees that MAB-PPR terminates in line 15. Therefore, MAB-PPR terminates at the latest when t satisfies (1 φL)t 1W ϵ/8. Let T be the last round of the run. By Lemma 5.5, in standard rounds we have w t+1 w t ϵt. Thus, t=1 ϵt = w (1 ρ) t=1 ρt 1ϵ/2 w ϵ/2. (1) Consider the two possible ways of terminating: if the algorithm terminates in line 15, then f WT ϵ/4. By E0(t), this means that WT ϵ/2, thus wr w T ϵ/2. The other way of terminating is when |ST | Nfinal. In this case, for all i ST , ˆwi,T is a (q3, ℓ(3))-Mo M estimate of wi, where q3 9 2 log(Nfinal/δT ) and ℓ(3) = C3/ϵ2. Set C3 96. By Lemma 2.3 and a union bound on Nfinal arms, it follows that with a probability at least 1 δT , i ST , |wi ˆwi,T | q 6σ2 i /ℓ(3) p 6/ℓ(3) ϵ/4, where the second inequality follows from Assumption 2.1. The returned arm in this case is r argmaxi ST ˆwi,T , thus wr ˆwr,T ϵ/4 ˆw T ϵ/4 w T ϵ/2. Therefore, in both cases we have wr w T ϵ/2. Combined with Eq. (1), this proves that wr w ϵ. The cost of any round of MAB-PPR is upper-bounded by Wt(q1ℓt+q2ℓ(2)+q+3ℓ(3)). Therefore, the total cost is at most W PT t=1(1 φL)t 1(q1ℓt + q2ℓ(2) + q3ℓ(3)). We have, for constants C, C > 0 (that depend on the constants used in Alg. 1), that q1ℓt + q2ℓ(2) + q3ℓ(3) C log( C δt ) max( 1 where δt = δ/2t and ϵt = 1 2(1 ρ)ρt 1ϵ. Therefore, the total cost is upper bounded by t=1 (1 φL)t 1 C log(C δt ) max( 1 ϵt ) C W log(C /δ) max( 1 For constants C , C > 0. Setting ρ ( 1 φL, 1), the sum on t is upper-bounded by a constant, giving the required upper bound on the cost. We now prove lemmas that the events E1, E2, E3 hold with a high probability. We start with Lemma 5.2, which shows this for E1(t), which states that WMt < φUWt. Proof of Lemma 5.2. Fix a round t. Define the event E4 := { ˆw t w t ϵt/2}. Setting Cq 9 2, C q 8 and C1 24, we have q1 9 2 log(8/δt) and ℓt 24/ϵ2 t. By Assumption 2.1, σ2 i t 1. Hence, by Lemma 2.3, with a probability 1 δt/8, |wt ˆw t | p 6/ℓt ϵt/2, hence E4 holds. Let i St be a t-bad arm. Let ˆwj i for j [q1] be the j th mean estimate of wi used to get the estimate ˆwi,t in line 5. Let α1 (1 φU 2 , 1). Set α2 = 1 1 φU/2 α1 , and set α3 (0, α2(1 α1)). If E4 hold, then since i is t-bad, ˆwj i ˆw t implies ˆwj i ˆw t wi + ϵt/2. Denote Pt[ ] := P[ | St]. Then, Pt[ ˆwj i ˆw t | E4] Pt[ ˆwj i wi + ϵt/2]/Pt[E4] 2Pt[ ˆwj i wi + ϵt/2] 2 ℓt(ϵt/2)2 . The last inequality follows from Chebychev s inequality, since σ2 i 1. Setting C1 8 α3 , we have ℓt 8 α3ϵ2 t , hence Pt[ ˆwj i ˆw t | E4] α3. Denote M j t := {i | i is t-bad and ˆwj i ˆw t }. We conclude that Et[WM j t | E4] = X i St wi Pt[i M j t | E4] X i St wi Pt[ ˆwj i ˆw t | E4] α3Wt. By Markov s inequality, it follows that Pt[WM j t α2Wt | E4] α3Wt α2Wt < 1 α1. Define the event E5 := { 1 j I[WM j t α2Wt] < 1 α1}. We have q1 Cq log(C q/δt). Setting Cq 1/(2(1 α1 α3/α2)2) and C q 8, we have by Hoeffding s inequality and the independence of {M j i } for j [q1] that Pt[E5 | E4] 1 δt/8. Therefore, combined with Pt[E4] 1 δt/8, we get that for any St, Pt[E5] 1 δ/4. Under E5, we have 1 q1 j WM j t < α2α1Wt + (1 α1)Wt = (α2α1 + 1 α1)Wt = φUWt/2. (2) Where the last equality follows from the definition of α2. On the other hand, since ˆwi,t is the median of { ˆwj i }j [q1], and i Mt implies ˆwi,t ˆw t , we have i St wi I[i Mt] X i St wi I[|{j | i M j t }| q1/2] j [q1] I[i M j t ] = 2 i St wi I[i M j t ] = 2 j [q1] WM j t . Combined with Eq. (2), we get that with a probability of 1 δt/4, WMt < φUWt. We now prove Lemma 5.3, which states that with a probability at least 1 δt/4, E2(t) holds in a standard round, i.e., WLt φLWt or |St \ Lt| Nfinal. In this proof we use the following lemma, proved in the supplementary material (Appendix C). Lemma 5.7. Let Y1, . . . , Yz be random variables with means µ1, . . . , µz. Let Y := P i [z] Yi, µ := P i [z] µi = E[ Y ]. Let σ2 := Var( Y ) < . For i [z], let ˆµi be the (q, ℓ)-Mo M estimate of µi based on qℓi.i.d. draws of copies of Yi. Then for any γ > 0, with a probability at least 1 exp( q/18), |{i [z] | ˆµi γµ}| 4 Proof of Lemma 5.3. We will assume that WLt < φLWt and conclude the required upper bound on |St \ Lt| with a high probability. Let β := φM φL. Let ν (0, β), and set α := 1 2(β ν)/ β. Let j := jkt+1, where kt is given in the definition of Lt and {jl} is the ordering set in line 6 in round t. By the definition of Lt, WLt + wj > φMWt. By our assumption, WLt < φLWt. Therefore, wj > (φM φL)Wt = βWt. Define J = {i St | wi βWt}. We have |J| 1/β. 2 and C q 8/β, so that q1 9 2 log((8/β)/δt). Recall that σ2 i wi by Assumption 2.1. Therefore, by Lemma 2.3 and a union bound, with a probability at least 1 δt/8, i J, |wi ˆwi| p 6wi/ℓt. Setting C1 6/α2, it follows that ℓt 6/(α2ϵt). We have ϵt ϵ/2 for all t. In addition, since t is a standard round, ϵ/8 Wt. Therefore, for i J, ϵt 4Wt 4wi/β. It follows that with a probability at least 1 δt/8, i J, |wi ˆwi| (2α/ β)wi = (1 ν/β)wi, where the equality follows from the definition of α. Hence ˆwi (ν/β) wi νWt. In particular, this holds for j = jkt+1 as set above. Condition below on this event. By the definition of Lt and j, for all i St \ Lt, ˆwi ˆwj. Therefore, by the event above, for all i St \ Lt, ˆwi νWt. Hence, |St \ Lt| |{i St | ˆwi νWt}|. To bound the RHS, recall that σ2 St V Wt by Assumption 2.2. Thus, applying Lemma 5.7, we have that by setting Cq 18, w.p. 1 δt/8, |{i St | ˆwi νWt}| 4 6V Wtℓt ). We have ℓt C1V/ϵt C1V/(4Wt). Therefore, |{i St | ˆwi νWt}| 4 24/C1) =: Nfinal. (3) By a union bound, with a probability 1 δt/4, the event above and Eq. (3) hold simultaneously, in which case WLt < φLWt implies |St \ Lt| Nfinal. Lastly, we prove Lemma 5.4, which states that in a standard round t, with a probability at least 1 δt/4, E1(t) implies E3(t), i.e., St+1 \ Mt = and St+1 Lt = . Proof of Lemma 5.4. Fix a round t. Let Al := {j1, . . . , jl}, where {jl} is the ordering set in line 6, and let f WAl := f WAl,t be the estimate of WAl calculated in line 8 of round t. By the assumption of the lemma, φM < 1 φU. Let R (0, 1 φU) be the solution to φM+R φM 1+R = 0. The solution is in this range, since since for R = 0, the LHS is equal to φM (1 φU) < 0, and for R = 1 φU, the LHS is positive. Set θ := 1 φU R Recall that q2 = 9 2 log(12/δt). By setting C2 48/R2, we get ℓ(2) 48V/(R2ϵ). Therefore, by Lemma 2.3 and Assumption 2.2, for any fixed l the following holds with a probability 1 δt/12: |f WAl WAl| q 6V WAl/ℓ(2) p R2WAlϵ/8 R p WAl Wt. (4) The last inequality follows since t is a standard round, hence Wt ϵ/8. Let nt be the smallest integer such that WAnt Wt WMt. With a probability 1 δt/4, Eq. (4) holds simultaneously for l = nt, l = kt (so Al Lt) and l = |St| (so Al St). Condition below on this combined event. It follows from the definition of Lt that f WLt WLt + R p WLt Wt (φM + R p Suppose that E1(t) holds, so WMt < φUWt. By definition of nt, WAnt > (1 φU)Wt. Therefore, f WAnt WAnt R q WAnt Wt > (1 φU)Wt RWt = (1 φU R)Wt. Lastly, from Eq. (4) for l = |St|, we get Wt(1 R) f Wt Wt(1 + R). Now, MAB-PPR sets St+1 = St \ ANt, where Nt is the largest number that satisfies f WANt/f Wt θ. From the inequalities above, we have f WAnt f Wt > 1 φU R 1+R = θ and f WLt f Wt φM+R φM 1 R = θ. Hence, kt Nt < nt. From kt Nt, we have St+1 Lt = . From Nt < nt, we have St+1 St \ ANt St \ Ant 1. Therefore Wt+1 Wt WAnt 1. By the definition of nt, we have WAnt 1 < Wt WMt. Therefore, Wt+1 WMt. It follows that St+1 \ Mt = . Thus, E1(t) implies E3(t). Having proved Lemma 5.2,Lemma 5.3, and Lemma 5.4, this finalizes the proof of Theorem 5.6. 6 Conclusion In this work we showed that it is possible to identify an ϵ-best arm in a pay-per-reward multi-armedbandit setting at a cost that does not depend on the number of arms, and depends only linearly on the total expected reward. In future work, we plan to study instance-specific bounds for this setting. Acknowledgements Sivan Sabato was supported in part by the Israel Science Foundation (grant No. 555/15). [1] D. Agarwal, B.-C. Chen, and P. Elango. Explore/exploit schemes for web content optimization. In 2009 Ninth IEEE International Conference on Data Mining, pages 1 10. IEEE, 2009. [2] N. Alon, Y. Matias, and M. Szegedy. The space complexity of approximating the frequency moments. Journal of Computer and system sciences, 58(1):137 147, 1999. [3] J.-Y. Audibert and S. Bubeck. Best arm identification in multi-armed bandits. In COLT-23th Conference on learning theory-2010, pages 13 p, 2010. [4] M. Aziz, J. Anderton, E. Kaufmann, and J. Aslam. Pure exploration in infinitely-armed bandit models with fixed-confidence. ar Xiv preprint ar Xiv:1803.04665, 2018. [5] J. S. Banks and R. K. Sundaram. Denumerable-armed bandits. Econometrica: Journal of the Econometric Society, pages 1071 1096, 1992. [6] D. A. Berry, R. W. Chen, A. Zame, D. C. Heath, and L. A. Shepp. Bandit problems with infinitely many arms. The Annals of Statistics, 25(5):2103 2116, 1997. [7] C. Brownlees, E. Joly, and G. Lugosi. Empirical risk minimization for heavy-tailed losses. The Annals of Statistics, 43(6):2507 2536, 2015. [8] S. Bubeck, R. Munos, and G. Stoltz. Pure exploration in multi-armed bandits problems. In International conference on Algorithmic learning theory, pages 23 37. Springer, 2009. [9] S. Bubeck, N. Cesa-Bianchi, and G. Lugosi. Bandits with heavy tail. IEEE Transactions on Information Theory, 59(11):7711 7717, Nov 2013. ISSN 0018-9448. doi: 10.1109/TIT.2013. 2277869. [10] A. Carpentier and A. Locatelli. Tight (lower) bounds for the fixed budget best arm identification bandit problem. In Conference on Learning Theory, pages 590 604, 2016. [11] A. Carpentier and M. Valko. Simple regret for infinitely many armed bandits. In International Conference on Machine Learning, pages 1133 1141, 2015. [12] L. Chen, J. Li, and M. Qiao. Towards instance optimal bounds for best arm identification. In S. Kale and O. Shamir, editors, Proceedings of the 2017 Conference on Learning Theory, volume 65 of Proceedings of Machine Learning Research, pages 535 592, Amsterdam, Netherlands, 07 10 Jul 2017. PMLR. [13] W. Ding, T. Qin, X.-D. Zhang, and T.-Y. Liu. Multi-armed bandit with budget constraint and variable costs. In Twenty-Seventh AAAI Conference on Artificial Intelligence, 2013. [14] E. Even-Dar, S. Mannor, and Y. Mansour. Pac bounds for multi-armed bandit and markov decision processes. In International Conference on Computational Learning Theory, pages 255 270. Springer, 2002. [15] E. Even-Dar, S. Mannor, and Y. Mansour. Action elimination and stopping conditions for the multi-armed bandit and reinforcement learning problems. Journal of machine learning research, 7(Jun):1079 1105, 2006. [16] D. Hsu and S. Sabato. Loss minimization and parameter estimation with heavy tails. Journal of Machine Learning Research, 17(18):1 40, 2016. [17] K. G. Jamieson and R. D. Nowak. Best-arm identification algorithms for multi-armed bandits in the fixed confidence setting. 2014 48th Annual Conference on Information Sciences and Systems (CISS), pages 1 6, 2014. [18] E. Joly, G. Lugosi, and R. I. Oliveira. On the estimation of the mean of a random vector. Electronic Journal of Statistics, 11(1):440 451, 2017. [19] Z. Karnin, T. Koren, and O. Somekh. Almost optimal exploration in multi-armed bandits. In International Conference on Machine Learning, pages 1238 1246, 2013. [20] H. Li and Y. Xia. Infinitely many-armed bandits with budget constraints. In Thirty-First AAAI Conference on Artificial Intelligence, 2017. [21] G. Lugosi and S. Mendelson. Risk minimization by median-of-means tournaments. ar Xiv preprint ar Xiv:1608.00757, 2016. [22] S. Mannor and J. N. Tsitsiklis. The sample complexity of exploration in the multi-armed bandit problem. Journal of Machine Learning Research, 5(Jun):623 648, 2004. [23] S. Minsker. Geometric median and robust estimation in banach spaces. Bernoulli, 21(4):2308 2335, 11 2015. doi: 10.3150/14-BEJ645. URL https://doi.org/10.3150/14-BEJ645. [24] S. Pandey, D. Agarwal, D. Chakrabarti, and V. Josifovski. Bandits for taxonomies: A modelbased approach. In Proceedings of the 2007 SIAM International Conference on Data Mining, pages 216 227. SIAM, 2007. [25] L. Tran-Thanh, A. Chapman, E. M. de Cote, A. Rogers, and N. R. Jennings. Epsilon first policies for budget limited multi-armed bandits. In Twenty-Fourth AAAI Conference on Artificial Intelligence, 2010. [26] L. Tran-Thanh, A. Chapman, A. Rogers, and N. R. Jennings. Knapsack based optimal policies for budget limited multi armed bandits. In Twenty-Sixth AAAI Conference on Artificial Intelligence, 2012. [27] H. P. Vanchinathan, A. Marfurt, C.-A. Robelin, D. Kossmann, and A. Krause. Discovering valuable items from massive data. In Proceedings of the 21th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pages 1195 1204. ACM, 2015. [28] Y. Wang, J.-Y. Audibert, and R. Munos. Algorithms for infinitely many-armed bandits. In Advances in Neural Information Processing Systems, pages 1729 1736, 2009. [29] Y. Xia, T. Qin, W. Ma, N. Yu, and T.-Y. Liu. Budgeted multi-armed bandits with multiple plays. In IJCAI, pages 2210 2216, 2016. [30] Y. Xia, T. Qin, N. Yu, and T.-Y. Liu. Best action selection in a stochastic environment. In Proceedings of the 2016 International Conference on Autonomous Agents & Multiagent Systems, pages 758 766. International Foundation for Autonomous Agents and Multiagent Systems, 2016. [31] X. Yu, H. Shao, M. R. Lyu, and I. King. Pure exploration of multi-armed bandits with heavytailed payoffs. In Proceedings of the Thirty-Fourth Conference on Uncertainty in Artificial Intelligence, pages 937 946, 2018. [32] Q. Zhou, X. Zhang, J. Xu, and B. Liang. Large-scale bandit approaches for recommender systems. In Neural Information Processing, pages 811 821, Cham, 2017. Springer International Publishing.