# infinitely_manyarmed_bandits_with_budget_constraints__4124c594.pdf Infinitely Many-Armed Bandits with Budget Constraints Haifang Li Institute of Automation, Chinese Academy of Sciences lihaifang@amss.ac.cn Yingce Xia University of Science and Technology of China yingce.xia@gmail.com We study the infinitely many-armed bandit problem with budget constraints, where the number of arms can be infinite and much larger than the number of possible experiments. The player aims at maximizing his/her total expected reward under a budget constraint B for the cost of pulling arms. We introduce a weak stochastic assumption on the ratio of expected-reward to expected-cost of a newly pulled arm which characterizes its probability of being a near-optimal arm. We propose an algorithm named RCB-I to this new problem, in which the player first randomly picks K arms, whose order is sub-linear in terms of B, and then runs the algorithm for the finite-arm setting on the selected arms. Theoretical analysis shows that this simple algorithm enjoys a sub-linear regret in term of the budget B. We also provide a lower bound of any algorithm under Bernoulli setting. The regret bound of RCB-I matches the lower bound up to a logarithmic factor. We further extend this algorithm to the any-budget setting (i.e., the budget is unknown in advance) and conduct corresponding theoretical analysis. 1 Introduction The multi-armed bandit (MAB) problem is a classical sequential decision problem, where a player receives a random reward by pulling one of the K arms of a slot machine at each round and wants to maximize the expected cumulative reward. Each arm has an unknown reward distribution. The player can only observe the reward of the pulled arm at each round. The core problem of MAB is the tradeoff between exploration (i.e., pulling the less pulled arms) and exploitation (i.e., sticking to the empirical best arms). Numerous real-world applications can be described by MAB models, such as personalized news recommendation (Li et al. 2010), auction mechanism design (Mohri and Munoz 2014), online advertising (Tran-Thanh et al. 2014), crowdsourcing (Zhou, Chen, and Li 2014) and so on. MAB problems have been extensively studied in machine learning and statistics community (Lai and Robbins 1985; Beygelzimer et al. 2011; Bubeck and Cesa-Bianchi 2012). Many algorithms have been proposed, like UCB1, ϵn-Greedy (Auer, Cesa-Bianchi, and Fischer 2002), UCBV (Audibert, Munos, and Szepesv ari 2007; 2009), KL-UCB Copyright c 2017, Association for the Advancement of Artificial Intelligence (www.aaai.org). All rights reserved. (Garivier and Capp e 2011), and Bayes-UCB (Kaufmann, Capp e, and Garivier 2012). Recently, the MAB problem with budget constraints (i.e., budgeted MAB), has been utilized to model some new industrial situations, for example, online bidding optimization in sponsored search (Amin et al. 2012; Tran-Thanh et al. 2014) and on-spot instance bidding in cloud computing (Agmon Ben-Yehuda et al. 2013; Ardagna, Panicucci, and Passacantando 2011). In budgeted MAB, besides the random reward, each arm is also associated with an unknown random cost. At each round, the player pulls an arm, receives a random reward and pays a random cost, until his budget runs out. Many algorithms have been developed under different settings for the budgeted MAB. Tran-Thanh et al. (2010), Tran-Thanh et al. (2012), and Vanchinathan et al. (2015) designed ϵ-first, KUBE and GP-SELECT algorithms respectively to study the budgeted MAB with deterministic cost of each arm. Ding et al. (2013) studied the problem with unknown discrete cost distributions and proposed the UCBBV algorithm. Xia et al. (2015a) designed the budget-UCB algorithm to investigate the random continuous costs. Another line of research, the best arm identification problem for budgeted MAB, was proposed in (Xia et al. 2016b). Berry et al. (1997) first addressed the MAB problem with infinitely many arms, where each arm is associated with a Bernoulli reward distribution. They tried to characterize the case in which the number of arms can be infinite and much larger than the possible number of experiments. The general reward distribution problems are studied by Wang, Audibert, and Munos (2009) and Carpentier and Valko (2015). Wang, Audibert, and Munos (2009) proposed UCB-V( ) algorithm to deal with this case. Carpentier and Valko (2015) addressed the optimal arm identification problem for infinitearm setting and provided Si RI algorithm. However, the costs per round in the aforementioned works are all 1 s. In this paper, we introduce the random cost into the infinitely manyarmed setting, and propose the infinitely many-armed bandits problem with budget constraints (denoted as -BMAB, where the first B refers to Budgeted .). All prior works on budgeted MAB focus on the finite number of arms, in which each arm will be sufficiently explored. These algorithms cannot be directly applied to an infinitely many-armed bandit problem, since it is impossible to make exploration on each arm. We design efficient algo- Proceedings of the Thirty-First AAAI Conference on Artificial Intelligence (AAAI-17) rithms to this new problem and perform formal theoretical analysis. Our contributions can be summarized in the following two aspects: Algorithm: We propose a Ratio Confidence Bound for - BMAB (briefly denoted as RCB-I) algorithm in this work. The algorithm consists of two steps: first, we randomly pick K arms, where the choice of K should balance the tradeoff between exploring enough arms (in order to include the arm whose expected reward to expected-cost ratio1 is close enough to the supremum ratio of all candidates), and avoiding selecting too many arms (in case that we waste much budget on exploring them). Then we adopt a UCB-style algorithm to tackle the finite arms problem (See Section 3). The choice of K usually depends on budget B. On the other hand, we will often come across the case that B is unknown in advance. Therefore, we propose an any-budget algorithm with a variant of the RCB-I algorithm, named RCB-AIR to deal with this case, in which we will check whether we should explore a new arm at the beginning of each round (See Section 5). Theoretical analysis: We make extensive theoretical analysis on -BMAB. We prove that both RCB-I and RCB-AIR enjoy sub-linear regrets in terms of the budget. Compared with the prior works, there are two challenges for -BMAB to be tackled: (i) we cannot make full exploration of all the arms since the number of arms is infinite; (ii) we cannot decompose the expected pulling number of each suboptimal arm as that of the finite-arm setting, otherwise we cannot get a finite regret bound. For the first challenge, we make investigation on finitely chosen arms, and bridge the regret between the finitely selected arms and all the infinite arms through the stochastic regularity assumption on the ratio distribution (see (1)). As to the second one, we decompose the expected pulling number of each suboptimal arm till round τB as in Eqn. (10), which includes a product term. It is the product form that enables us utilizing Eqn. (1) and obtaining a finite regret bound. Then we provide a lower bound for any algorithm for Bernoulli bandits (whose rewards and costs are either 0 or 1) with infinitely many arms. We show that our proposed RCB-I can match the lower bound up to a logarithmic factor. 2 Problem Formulation We consider a stochastic infinitely many-armed bandit problem with budget constraints. At each round t N+, the player pulls an arm i N+, receives a random reward ri(t) and pays a random cost ci(t), until he/she runs out of the budget B. B is a positive real number, which might be known/not known in advance. Both the reward ri(t) and cost ci(t) are supported in [0, 1]. We follow the setting in (Xia et al. 2015a) and make the assumptions about independence of rewards and costs: (i) the rewards (costs) of the same arm at different rounds are independent and identically distributed; (ii) the rewards and costs of an arm are independent of the other arms. Note that we do not assume that the rewards of 1According to the previous literature and the analysis of our work, the ratio of expected-reward to expected-cost of each arm is an important factor. an arm are independent of its costs. For any i N+, denote the expected reward and expected cost of arm i as ur i and uc i respectively. Without loss of generality, we assume ur i (0, 1) and uc i [λ, 1), where λ belongs to (0, 1). This assumption is very reasonable and natural, since in practice whatever non-trivial action a player takes, he/she needs to afford a certain non-zero cost. Denote the ratio of expected reward to expected cost of arm i as ρi, i.e., ρi = ur i uc i . Since there is infinite number of arms in our setting, it is impossible to pull each arm once. Similar to (Berry et al. 1997; Wang, Audibert, and Munos 2009; Carpentier and Valko 2015), we make the following stochastic regularity assumption on the ratio of expected reward to expected cost: given ρ , which is the supremum (or, maximum) expectedreward to expected-cost ratio, the probability that a newly pulled arm is ϵ-optimal is of order ϵβ for small ϵ(> 0), where β( 0) is a parameter known in advance. Mathematically, P{ρ > ρ ϵ} = Θ(ϵβ), for ϵ 0, (1) where ρ is the ratio of the expected reward to the expected cost of the newly pulled arm, and Θ(ϵβ) means that there exist two positive constants c1 and c2 such that for any ϵ [0, ρ ], c1ϵβ P{ρ > ρ ϵ} P{ρ ρ ϵ} c2ϵβ. (2) One can verify that the uniform distribution on (0, ρ ) satisfies the Eqn. (1) with β = 1, any c1 (0, 1/ρ ] and any c2 [1/ρ , ). Our goal is to design algorithms for -BMAB in order to maximize its expected cumulative rewards, or equivalently to minimize the pseudo-regret, before the budget runs out. We define the pseudo-regret of any algorithm as follows: Ralg = R E[ t=1 r It(t)1{Bt 0}], (3) where R is the supremum of expected reward of all the possible pulling algorithms2, It denote the arm chosen at round t, 1{ } is the indicator function, Bt is the remaining budget at round t, i.e., Bt = B t s=1 c Is(s), Ralg is the regret of algorithm alg , and the expectation E is taken w.r.t. the randomness of the rewards, costs and the pulling algorithm3. Note that in our setting, there are not such hard constraints that the pulling procedures cannot exceed specific rounds, which make our work differ from (Gy orgy et al. 2007; Badanidiyuru, Kleinberg, and Slivkins 2013; Badanidiyuru, Langford, and Slivkins 2014; Wu et al. 2015). 3 Algorithms As pointed in (Xia et al. 2016a), even when the number of arms is finite, and the reward and cost of each arm are deterministic, the budgeted MAB problem is an unbounded knapsack problem (Martello and Toth 1990), which is NP-hard (Lueker 1975). Consequently, it is much more challenging 2According to the analysis in Appendix B, we have that R is smaller than (B + 1)ρ , which shows that R exists. Due to space limitations, all the appendices are left in the online full version of this work (Li and Xia 2016). 3All the notations are summarized at the end of this paper. to obtain the optimal algorithms for the infinitely many arms setting. According to (Xia et al. 2016a), we have that for budgeted MAB with finite arms: (a) When the reward and cost distributions of each arm are known, always pulling the arm with the maximum ratio (denote the ratio as ρ) of expected reward to expected cost can bring almost the same expected reward as the optimal policy (the policy which can bring the maximum expected reward given the reward and cost distributions of all the arms) , the gap of which is at most 2 ρ. (b) When the reward and cost distributions of each arm are not known, at each round, we should try to pull the arm with the maximum ratio of empirical reward to empirical cost, while maintaining enough exploration on the less pulled arms. Given the assumption in Eqn. (1), we know that if we randomly pick K arms where K is large enough, the maximum expected-reward to expected-cost ratio of the selected arms is close enough to that of the infinitely many arms. Then, we could run the algorithms for finite-arm budgeted MAB on the randomly picked arms. When choosing K, we should consider the tradeoff between exploring more arms in order to search a candidate arm with the expected-reward to expected-cost ratio close enough to the best one and exploiting the empirical best arm in order not to waste much budget on exploration. Algorithm for Finite-arm Case For ease the reference, let Ti(t 1), ri,t, ci,t and Ei,t denote the number of pulling rounds, the empirical average reward, the empirical average cost and a confidence term of arm i before (excluding) round t respectively. Mathematically, s=1 1{Is = i}, Ei,t = Et 1 2Ti(t 1), ri,t = t 1 s=1 ri(s)1{Is = i} Ti(t 1) , ci,t = t 1 s=1 ci(s)1{Is = i} where {Et}t 0 is a nondecreasing sequence of nonnegative numbers. We call Et as exploration sequence. We adapt a ratio confidence bound style algorithm introduced in (Xia et al. 2016a) as a subroutine of our algorithm, which can deal with the finite-arm budgeted MAB. For ease of reference, denote the subroutine as RCB. At each round, one should pull the arm with the maximum index defined as follows: Di,Ti(t 1),t = min{ri,t + Ei,t, 1} max{ci,t Ei,t, 0}. (4) Note that arms with larger ratio of empirical average reward to empirical average cost, or fewer pulling rounds, would have larger indices. RCB is formally described4 in Algorithm 1. Algorithm for Infinite-arm Case Given the algorithm for finite-arm case, we only need to design the number of randomly picked arms. After careful derivations, we find that if we randomly select K arms, where Θ(Bβ/2) if β < 1, Θ(B β 1+β ) if β 1. (5) 4In step 4 of Algorithm 1, if there is more than one arm with maximum index Di,Ti(t 1),t, randomly pick one. Algorithm 1: RCB subroutine 1 Input: The randomly picked K arms; 2 Pull each arm once at the first K rounds and set t K + 1; 3 while the budget has not run out do 4 Pull arm It with the largest index of Eqn. (4), i.e., It = arg maxi [K] Di,Ti(t 1),t; 5 Update TIt(t), r It,t, c It,t and the left budget; set t t + 1. we could achieve sub-linear regret w.r.t. the budget. Our proposed algorithm for budgeted MAB with infinite arms, Ratio Confidence Bound for Infinitely many-armed bandits with budget constraints (briefly denoted as RCB-I), is shown in Algorithm 2. Algorithm 2: RCB-I Algorithm 1 Input: The ratio distribution regularity parameter β, the budget B; 2 Randomly choose K arms, which is defined in Eqn. (5); 3 Run RCB subroutine on the selected K arms. 4 Theoretical Analysis In this section, we conduct theoretical analysis for our proposed algorithm. We first give an upper bound of the regret of RCB-I. Then, we derive a lower bound for any algorithm under budgeted Bernoulli setting (whose rewards and costs are either 0 or 1). At last, make some discussions on upper bound and lower bound of the regret. For ease of reference, we introduce the following two notations, which would be used quite frequently. (i) Δk := ρ ρk, which describes the difference of the expected-reward to expected-cost ratio between the possible optimal one and that of a suboptimal arm k. (ii) τB := 2B λ , which can be seen as the pseudo stopping time of the budgeted MAB problem, since when B is large enough, the probability that the pulling rounds can exceed τB, bounded by X(B), is very small, where X(B) denotes the order O B exp( Bλ 4.1 Upper Bound of RRCB-I In this subsection, we derive an upper bound of the regret for the RCB-I algorithm, as shown in Theorem 1. Theorem 1. For the -BMAB satisfying Eqn. (1), when the exploration sequence Et satisfies: 2 log(4(log2 t + 1)) Et log t, the upper bound of the regret of RCB-I is shown as below. CB1/2 log B if β < 1, CB1/2(log B)2 if β = 1, CB β 1+β log B if β > 1, where C is a constant depending only on c1, c2, β, λ. Our proof consists of three main steps: First we analyze the regret on the randomly chosen K arms. Then we make a bridge of the regret between the randomly chosen K arms and infinitely many arms through the stochastic regularity assumption on the ratio distribution (see (1)). Finally, we summarize all the derivations and eventually get Theorem 1. To increase readability, we leave some detailed derivations in the Appendix C and provide the proof sketch only. (S1): Regret analysis on the selected K arms. Define the regret of RCB on the given K arms5, compared to the optimal policy obtained from the infinity many arms, as follows: RRCB-I K = R E k=1 rk(t)1{It = k, Bt 0} SK , (7) where SK represents the event randomly select K arms , and R is defined in (3). Conditioned on SK, by similar derivations to the Eqn. (10) in (Xia et al. 2016a), we can obtain that6 k=1 Δk E[Tk(τB)] + O B exp( 1 where Tk(τB) denotes the pulling number of arm k from round 1 to round τB. The order O( ) in (8) comes from the randomness of the stopping time. The reasons why we bridge RRCB-I K and Tk(τB) are: (i) The randomness of the stopping time is removed since we introduce the pseudo stopping time τB , due to which which we can use concentration inequalities safely; (ii) We can adapt the techniques about bounding the pulling rounds of suboptimal arms from finite-armed MAB s. Next we only need to focus on upper bounding E[Tk(τB)]. (S1-1): Decompose E[Tk(τB)]. Given SK, for any positive integer Lk, E[Tk(τB)] can be decomposed into three components: a constant invariant to t and the two probability terms. Specifically, we get that E[Tk(τB)] Lk + s=Lk P{Ek,s,t} (9) k =k P{ s [1, t 1], E k ,s ,t}, (10) where Lk = 2 log τB η(λ)2Δ2 k , η(λ) = λ2 3+2λ, ϕk = ρk + 1 2Δk, and Ek,s,t, E k ,s ,t denote these two events 1{Dk,s,t > ϕk} and 1{Dk ,s ,t ϕk} respectively. In the following two sub-steps, we get down to bounding the two probability terms in (9) and (10). Remark: The k =k P{ } in term (10) does not exist in the analysis of finite-armed budgeted bandits. If we directly use the proof technique for the finite-armed settings, term (10) would become linear w.r.t. K, and consequently, make the regret bound not a finite number. 5For ease of reference, throughout step (S1), denote the indices of the selected K arms as 1, 2, , K. 6Throughout (S1), the expectation E( ) and the probability P{ } are E( |SK) and probability P{ |SK}. We omit the SK for simplicity. (S1-2): Bound term (9). It is easy to verify the following inequality holds for any k 1: ϕk = ρk + 1 2Δk ur k + λ2 uc k λ2 3+2λΔk . (11) For any k [K], s [Lk, t 1] and t [1, τB], we define the following two events: (i) Er k,s,t : rk,t ur k > η(λ)Δk (ii) Ec k,s,t : ck,t uc k < η(λ)Δk + If Ek,s,t holds, at least one of the two events Er k,s,t and Ec k,s,t would hold. As a result, we have P{Ek,s,t} P{Er k,s,t} + P{Ec k,s,t}. (12) By leveraging Hoeffding s inequality on the two terms in the right-hand side of (12), we obtain that P{Er k,s,t} exp( sη(λ)2Δ2 k 2 ); P{Ec k,s,t} exp( sη(λ)2Δ2 k 2 ). (13) Consequently, by conducting some derivations, we have s=Lk P{Ek,s,t} 5 η(λ)2Δ2 k . (14) (S1-3): Bound term (10). For ease of the reference, we define another three events as follows, for any k = k [K], s [1, t 1] and t [1, τB]: (iii) Ek ,s ,t : Dk ,s ,t ρk ; (iv) Er k ,s ,t : rk ,t ur k (v) Ec k ,s ,t : ck ,t uc k It is easy to obtain that k =k P{ s [1, t 1], E k ,s ,t} 2 Δk P{ s [1, t 1], Ek ,s ,t}, (15) If Ek ,s ,t holds, at least one event of Er k ,s ,t and Ec k ,s ,t holds. Thus, we have P{ s [1, t 1], Ek ,s ,t} P{ s [1, t 1], Er k ,s ,t} +P{ s [1, t 1], Ec k ,s ,t}. (16) The two terms in the right-hand side of (16) could be upper bounded as below. By applying the peeling argument with a geometric grid over the time interval [1, t 1] and hoeffding maximal inequality (Bubeck 2010), we have P{ s [1, t 1], Er k ,s ,t} (log2(t 1) + 1) exp( Et 1 Similarly, we have P{ s [1, t 1], Ec k ,s ,t} (log2(t 1) + 1) exp( Et 1 Since Et 2 log(4(log2 t + 1)), the following inequality holds: P{ s [1, t 1], Ek ,s ,t} 1 k =k P{ s [1, t 1], E k ,s ,t} τB2 NΔk , (18) where NΔk is the cardinal of {k [K] : ρk > ϕk}. In conclusion, by combining inequalities (9), and (14) in (S1-2) and (18) in (S1-3), we get that E[Tk(τB)] 2 log τB η(λ)2Δ2 k + 5 η(λ)2Δ2 k + τB2 NΔk . (19) (S2): Bridge RRCB-I K and RRCB-I. The first step (S1) makes progress based on the condition that the randomly chosen K arms are given. In this step, we try to utilize the stochastic regularity assumption on the expected-reward to expected-cost ratio distribution (See (1) in Section 2) in order to bridge RRCB-I K and RRCB-I. According to Eqn. (1), the quantities Δ1, ..., ΔK are i.i.d. random variables satisfying 0 Δk ρ and P{Δk ϵ} = Θ(ϵβ). Combine (8) and (19), and take expectations w.r.t. all sources of randomness. Therefore, we have RRCB-I =E[RRCB-I K ] 2 λ + KE 10 η(λ)2Δ1 log τB (τBΔ1) +τBKE Δ1 2 NΔ1 + O B exp( 1 2Bλ) , (20) where a b := min {a, b}. Then we only need to bound the two expectation terms in the right-hand side of (20) in the next two sub-steps. (S2-1): Bound the first expectation term in (20). Since P{Δ1 ϵ} = Θ(ϵβ) and according to the expectation definition, we can obtain that E[( 10 log τB 2 B log τB if β < 1, O (log τB)2 if β = 1, O log τB if β > 1. (S2-2): Bound the second expectation term in (20). Conditioning on Δ1, NΔ1 follows a binomial distribution with parameters K 1 and P{ρ1 > ρ Δ1 2 |Δ1}. Then according to the total expectation formula and the expectation definition, we can obtain that E[Δ12 NΔ1 ] O K 1 1/β log(K) . (21) (S3): Bound the regret of the RCB-I algorithm RRCB-I. Combine the above steps, we get that 2 log B + BK 1/β log K if β < 1, C K(log B)2 + BK 1/β log K if β = 1, C K log B + BK 1/β log K if β > 1, where C is a constant depending only on c1, c2, β (See Eqn. (2)) and λ. Substitute K by Eqn. (5), and then we can get the desired result. 4.2 Lower Bound of Any Algorithm In this subsection, we derive a lower bound for the regret of any algorithm for Bernoulli -BMAB. A Bernoulli - BMAB is a special bandit, where the rewards and costs of each arm follow two Bernoulli distributions with unknown parameters. The lower bound of regret of any algorithm for Bernoulli -BMAB is shown as follows: Theorem 2. For any Bernoulli -BMAB, if the parameters of any newly pulled arm satisfy Eqn. (1), for any β > 0, any algorithm suffers a regret larger than c B β 1+β for some small enough constant c depending on c2, β and λ. Due to space restrictions, here we just only show the proof sketch of Theorem 2, and leave the completed proof into the Appendix D. (S1): Lower bound analysis on SK. Given K randomly selected arms (denoted as SK), under the Bernoulli setting, according to (Xia et al. 2015b), we have k=1 uc kΔk E[Tk(B)] λ k=1 Δk E[Tk(B)], (22) where Tk(B) denotes the pulling number of arm k until the budget B runs out. Since the cost of each arm is no larger than 1, the algorithm runs at least B rounds, i.e., K k=1 E[Tk(B)] B. Now let 0 < δ < δ < ρ . Similar to the derivations of Theorem 3 in (Wang, Audibert, and Munos 2009), we have RRCB-I K λBδ1{ˆρ ρ δ} + λκδ 1{ˆρ > ρ δ; K κ}. (23) where κ > 0 is a parameter to be determined, K denotes the cardinality of {k { I1, ..., IK 1 : ρk ρ δ }, K := min{l N+, ρ Il > ρ δ}, Il is the l-th arm drawn, ˆρ denotes the expected reward to expected cost ratio of the best arm in { I1, ..., IK}. (S2): Bridge RRCB-I K and RRCB-I. First, let we take κ = Bδ δ and take expectations on both sides of (23). Therefore, we have RRCB-I = E[RRCB-I K ] λBδP{K κ}. (24) Next, we need to obtain the distribution of K. Since K follows a geometric distribution with parameter P{ρ > ρ δ}, and given K , K follows a binomial distribution with parameters K 1 and P{ρ ρ δ }, by technical derivations, we can obtain that the random variable K follows a distribution as follows. P{ρ ρ δ } ι P{ρ / (ρ δ , ρ δ]} ι+1 . (25) Therefore, we have RRCBI λBδ P{ρ ρ δ }κ P{ρ / (ρ δ , ρ δ]}κP{ρ > ρ δ}. (26) Taking δ = δ B 1 1+β , where δ could be any constant in (0, ρ ), we have κ = B β 1+β , and we obtain the desired result. 4.3 Discussions In this subsection, we make some discussions on Theorem 1 and Theorem 2. (1) Theorem 1 shows that RCB-I achieves a sub-linear regret bound with respect to the budget B, and we have lim B (RRCB-I/B) = 0. (2) Comparing Theorem 1 with Theorem 2, we obtain that the upper bound of the regret of RCB-I matches the lower bound up to a logarithmic factor7 log B. (3) Compared with the budgeted MAB with finite arms, the regret bound of proposed algorithm, as well as the lower bound for any algorithm under Bernoulli reward/cost distributions setting, cannot achieve O(ln B). This is because we have to explore enough arms so as to obtain an arm with the expected-reward to expected-cost ratio close enough to the ρ with high probability. 5 Extension to Any Budgets In practice, we often come across the case that B is not known in advance, or changed through time. Inspired by (Wang, Audibert, and Munos 2009), in this section, we present an any-budget algorithm with a variant of the RCB-I algorithm, named RCB-AIR (short for Arm Increasing Rule) to deal with the case. The main idea is that, at the beginning of each round, we will check whether we should explore a new arm (which is determined by the number of the explored arms and the round number), then run the procedure for the finite arm case. Denote Kt as the arms pulled up to round t. Define Kt = |Kt|. We set Kt = and K0 = 0. The RCB-AIR algorithm is shown in Algorithm 3. Algorithm 3: RCB-AIR Algorithm 1 Input: The ratio distribution regularity parameter β > 0; 2 while the budget has not run out do 3 At round t, if Kt 1 < tβ/2 if β < 1 t β β+1 if β 1 , randomly pick a new arm at and set Kt Kt 1 {at}; otherwise, set Kt Kt 1; 4 Run Step 4 and 5 of Algorithm 1 on the selected Kt. The regret bound of RCB-AIR algorithm is shown as follows: Theorem 3. For the -BMAB satisfying Eqn. (1), when the exploration sequence {Et} satisfies 2 log(4(log2 t + 1)) Et log t, for any budget B, the upper bound of the regret for RCB-AIR is shown as follows. C(log B)2B 1 2 if β > 1, C(log B)2B β 1+β if β 1, (27) where C is a constant depending only on c1, c2, β, λ. 7In fact, the upper bound matches (up to a logarithmic factor) the lower bound in the case β 1. We will consider the case β < 1 in the future. Similar to the proof of Theorem 1, the proof of Theorem 3 also consists of three steps: First, analyze the regret on the randomly chosen {Kt}τB t=1 arms. Note to mention that the arms chosen until round τB progressively enter in competition, which is different from the RCB-I setting; second, relate RRCB-AIR KτB to RRCB-AIR by leveraging the stochastic assumption on the expected-reward to expected-cost ratio distribution; third, combine the results of the above two steps. We leave the detailed proofs into Appendix E. 6 Conclusion and Future Work In this paper, we design an arm pulling algorithm, RCB-I, for the -BMAB. We have proved that when the budget is known in advance, RCB-I achieves a sub-linear regret bound with respect to the budget, and matches (up to a logarithmic factor) the lower bound. We further make an extension to any budget setting, propose the RCB-AIR algorithm, and conduct corresponding theoretical analysis. For the future work, there are many important and interesting directions: (1) the distribution-free regret bounds remain empty for our proposed algorithm, which need to be solved; (2) whether the lower bound for β < 1 can be improvable is also an intriguing problem in need of discussion; (3) the case that the rewards and costs of different arms are correlated requires further investigation; (4) the best arm identification problem for -BMAB is another important problem. Notation Meaning B budget Bt budget at round t ri(t), ci(t) reward (cost) of arm i at round t ur i , uc i expected reward (cost) of arm i λ uniformly lower bound of expected cost ρi, ρ expected-reward to expected-cost ratio of arm i and maximum ratio β parameter of ratio distribution Δk difference between ρ and ρk Ti(t 1) arm i s pulling number before round t It the arm pulled at t-th round K, Kt randomly chosen arms (at time t) Et exploration sequence at round t Di,Ti(t 1),t arm i s estimated ratio index at round t R supremum of expected reward of all possible pulling algorithms Ralg, RRCB-I, RRCB-AIR regret of any algorithm, RCB-I and RCB-AIR algorithm τB speudo stopping time Table 1: Notations: We summarize the notations used in our paper in this table. Acknowledgments The authors wish to thank Tie-Yan Liu, Wei Chen, Tao Qin and Wensheng Zhang for helpful discussions related to this work. References Agmon Ben-Yehuda, O.; Ben-Yehuda, M.; Schuster, A.; and Tsafrir, D. 2013. Deconstructing amazon ec2 spot instance pricing. ACM Transactions on Economics and Computation 1(3):16. Amin, K.; Kearns, M.; Key, P.; and Schwaighofer, A. 2012. Budget optimization for sponsored search: Censored learning in mdps. Eprint Arxiv. Ardagna, D.; Panicucci, B.; and Passacantando, M. 2011. A game theoretic formulation of the service provisioning problem in cloud systems. In WWW, 177 186. ACM. Audibert, J.-Y.; Munos, R.; and Szepesv ari, C. 2007. Tuning bandit algorithms in stochastic environments. In International Conference on Algorithmic Learning Theory. Springer. Audibert, J.-Y.; Munos, R.; and Szepesv ari, C. 2009. Exploration exploitation tradeoff using variance estimates in multi-armed bandits. Theoretical Computer Science 410(19):1876 1902. Auer, P.; Cesa-Bianchi, N.; and Fischer, P. 2002. Finitetime analysis of the multiarmed bandit problem. Machine learning 47(2-3):235 256. Badanidiyuru, A.; Kleinberg, R.; and Slivkins, A. 2013. Bandits with knapsacks. In FOCS, 207 216. IEEE. Badanidiyuru, A.; Langford, J.; and Slivkins, A. 2014. Resourceful contextual bandits. In COLT, 1109 1134. Berry, D. A.; Chen, R. W.; Zame, A.; Heath, D. C.; and Shepp, L. A. 1997. Bandit problems with infinitely many arms. The Annals of Statistics 2103 2116. Beygelzimer, A.; Langford, J.; Li, L.; Reyzin, L.; and Schapire, R. E. 2011. Contextual bandit algorithms with supervised learning guarantees. In AISTATS, 19 26. Bubeck, S., and Cesa-Bianchi, N. 2012. Regret analysis of stochastic and nonstochastic multi-armed bandit problems. ar Xiv preprint ar Xiv:1204.5721. Bubeck, S. 2010. Bandits games and clustering foundations. Ph.D. Dissertation, Universit e des Sciences et Technologie de Lille-Lille I. Carpentier, A., and Valko, M. 2015. Simple regret for infinitely many armed bandits. ar Xiv preprint ar Xiv:1505.04627. Ding, W.; Qin, T.; Zhang, X.-D.; and Liu, T.-Y. 2013. Multiarmed bandit with budget constraint and variable costs. In Twenty-Seventh AAAI Conference on Artificial Intelligence. Garivier, A., and Capp e, O. 2011. The kl-ucb algorithm for bounded stochastic bandits and beyond. In COLT, 359 376. Gy orgy, A.; Kocsis, L.; Szab o, I.; and Szepesv ari, C. 2007. Continuous time associative bandit problems. In IJCAI, 830 835. Kaufmann, E.; Capp e, O.; and Garivier, A. 2012. On bayesian upper confidence bounds for bandit problems. In AISTATS, 592 600. Lai, T. L., and Robbins, H. 1985. Asymptotically efficient adaptive allocation rules. Advances in applied mathematics 6(1):4 22. Li, H., and Xia, Y. 2016. Infinitely many-armed bandits with budget constraints. https://goo.gl/b Cuu S5. Li, L.; Chu, W.; Langford, J.; and Schapire, R. E. 2010. A contextual-bandit approach to personalized news article recommendation. In Proceedings of the 19th international conference on World wide web, 661 670. ACM. Lueker, G. S. 1975. Two NP-complete problems in nonnegative integer programming. Princeton University. Department of Electrical Engineering. Martello, S., and Toth, P. 1990. Knapsack problems: algorithms and computer implementations. John Wiley & Sons, Inc. Mohri, M., and Munoz, A. 2014. Optimal regret minimization in posted-price auctions with strategic buyers. In NIPS, 1871 1879. Tran-Thanh, L.; Chapman, A.; de Cote, E. M.; Rogers, A.; and Jennings, N. R. 2010. Epsilon first policies for budget limited multi-armed bandits. In Twenty-Fourth AAAI Conference on Artificial Intelligence. Tran-Thanh, L.; Chapman, A.; Rogers, A.; and Jennings, N. R. 2012. Knapsack based optimal policies for budget limited multi armed bandits. In Twenty-Sixth AAAI Conference on Artificial Intelligence. Tran-Thanh, L.; Stavrogiannis, L. C.; Naroditskiy, V.; Robu, V.; Jennings, N. R.; and Key, P. 2014. Efficient regret bounds for online bid optimisation in budget-limited sponsored search auctions. Vanchinathan, H. P.; Marfurt, A.; Robelin, C.-A.; Kossmann, D.; and Krause, A. 2015. Discovering valuable items from massive data. In KDD. ACM. Wang, Y.; Audibert, J.-Y.; and Munos, R. 2009. Algorithms for infinitely many-armed bandits. In NIPS, 1729 1736. Wu, H.; Srikant, R.; Liu, X.; and Jiang, C. 2015. Algorithms with logarithmic or sublinear regret for constrained contextual bandits. In NIPS. Xia, Y.; Ding, W.; Zhang, X.-D.; Yu, N.; and Qin, T. 2015a. Budgeted bandit problems with continuous random costs. In ACML. Xia, Y.; Li, H.; Qin, T.; Yu, N.; and Liu, T. Y. 2015b. Thompson sampling for budgeted multi-armed bandits. In International Conference on Artificial Intelligence. Xia, Y.; Qin, T.; Ma, W.; Yu, N.; and Liu, T.-Y. 2016a. Budgeted multi-armed bandits with multiple plays. In IJCAI. Xia, Y.; Qin, T.; Yu, N.; and Liu, T.-Y. 2016b. Best action selection in a stochastic environment. In AAMAS, 758 766. Zhou, Y.; Chen, X.; and Li, J. 2014. Optimal pac multiple arm identification with applications to crowdsourcing. In ICML, 217 225.