# finite_continuumarmed_bandits__13c96732.pdf

Finite Continuum-Armed Bandits

Solenne Gaucher

Laboratoire de Mathématiques d Orsay

Université Paris-Saclay,

91405, Orsay, France solenne.gaucher@math.u-psud.fr

We consider a situation where an agent has T ressources to be allocated to a larger number N of actions. Each action can be completed at most once and results in a stochastic reward with unknown mean. The goal of the agent is to maximize her cumulative reward. Non trivial strategies are possible when side information on the actions is available, for example in the form of covariates. Focusing on a nonparametric setting, where the mean reward is an unknown function of a onedimensional covariate, we propose an optimal strategy for this problem. Under natural assumptions on the reward function, we prove that the optimal regret scales as O(T 1/3) up to poly-logarithmic factors when the budget T is proportional to the number of actions N. When T becomes small compared to N, a smooth transition occurs. When the ratio T/N decreases from a constant to N 1/3, the regret increases progressively up to the O(T 1/2) rate encountered in continuum-armed bandits.

1 Introduction

1.1 Motivations

Stochastic multi-armed bandits have been extensively used to model online decision problems under uncertainty : at each time step, an agent must choose an action from a ﬁnite set, and receives a reward drawn i.i.d. from a distribution depending on the action she has selected. By choosing the same action over and over again, she can learn the distribution of the rewards for performing this action. The agent then faces a trade-off between collecting information on the mechanism generating the rewards, and taking the best action with regards to the information collected, so as to maximise her immediate reward.

In some real-life situations, the agent can complete each action at most once, and does not have enough resource to complete all of them. Her decisions can be rephrased in terms of allocating limited resources between many candidates. The agent cannot estimate the reward of an action by performing it several times, and must rely on additional information to construct her strategy. In many situations, covariates providing information on the actions are available to the agent. Then, the expected reward for taking an action can be modelled as a (regular) function of the corresponding covariate. Thus, similar actions give rise to similar rewards. This problem is motivated by the following examples.

Allocation of scarce resources. The response of an individual to medical treatment can be inferred

from contextual information describing this patient. When this treatment is expensive or short in supply, decision-makers aim at efﬁciently selecting recipients who will be treated, so as to maximise the number of beneﬁcial interventions [Kleinberg et al., 2015]. During epidemic crises, lack of medical resources may force hospital staff to progressively identify patients that are more likely to recover based on indicators of their general health status, and prioritize them in the

34th Conference on Neural Information Processing Systems (Neur IPS 2020), Vancouver, Canada.

resource allocation. Similar questions arise when determining college admission so as to optimize the number of successful students [Kleinberg et al., 2018], or allocating ﬁnancial aid to individuals most likely to beneﬁt from it.

Contextual advertisement with budget. A common form of payment used in online adver-

tisement is pay-per-impression: the advertiser pays a ﬁxed fee each time an ad is displayed [Combes et al., 2015], and the budget of an advertising campaign determines the number of users who view the advertisement. It has been shown in [Agarwal et al., 2009] that click-through rates decrease steeply as users are exposed over and over to the same recommendation. Advertisers may therefore prefer to display their campaign to a new potential customer rather than to an already jaded one, so that each user will view the campaign at most once. Those users are often described by features including demographic information as well as previous online activities. Advertisers want to leverage this contextual information so as to focus on users that are more likely to click on the ad banner.

Pair matching. Finding good matches between pairs of individuals is an ubiquitous problem.

Each pair of individuals represents an action : the agent sequentially selects T pairs, and receives a reward each time the pair selected corresponds to a good matching [Giraud et al., 2019]. In many settings, the agent has access to information describing either individuals or pairs of individuals. For example, online gaming sites may want to pair up players of similar level or complementary strength; dating applications may use information provided by the users to help them ﬁnd a partner. Similarly, biologists studying protein-protein interaction networks will sequentially test pairs of proteins to discover possible interactions. Such experiments are however costly, difﬁcult and time-consuming, and leveraging information describing those proteins can help researchers focus on pairs more likely to interact [Szilagyi et al., 2005].

In these settings, the decision maker can complete each action (i.e., select each internet user, patient, college candidate or pair of individuals) at most once; however by selecting an action, she learns about the expected rewards of similar actions. We model the dependence of the expected reward on the variable describing this action in a non-parametric fashion, and rephrase our problem by using terminology from the bandit literature.

The Finite Continuum-Armed Bandit (F-CAB) problem : An agent is presented with a set of N arms described by covariates {a1, a2, ..., a N} in a continuous space X (the arm i will henceforth be identiﬁed with its covariate ai). The agent is given a budget T to spend on those arms, where T is typically a fraction p of the number of available arms N. At each step t T, the agent pulls an arm φ(t) among the arms that have not been pulled yet, and receives the corresponding reward yφ(t) 2 [0, 1]. Conditionally on {a1, a2, ..., a N}, the rewards yi are sampled independently from some distribution with mean m(ai), where m : X ! [0, 1] is the (unknown) mean reward function. The aim of the agent is to maximise the sum of the rewards she receives.

The F-CAB problem is closely related to the classical continuum-armed bandit problem. This problem, ﬁrst introduced in [Kleinberg, 2004], extends multi-armed bandits to continuous sets of actions. At each step, the agent takes an action indexed by a point of her choosing in a continuous space X. In order to maximise her gains, she must explore the space X so as to ﬁnd and exploit one of the maximas of the mean reward function. The assumption that the agent can choose arbitrarily any action corresponding to any covariate, unrealistic in many real-life situations, is relaxed in the F-CAB model. Moreover in the F-CAB setting, the agent can pull each arm at most once. Thus she must endeavour to ﬁnd and exploit a large set of good arms, as she cannot focus on a single arm corresponding to a maxima.

1.2 Related work

To the best of the author s knowledge, continuum-armed bandits without replacement have not been considered before. On the other hand, variants to the multi-armed bandit problem were proposed to relax the assumption that the agent can choose any action an inﬁnite number of time.

In [Chakrabarti et al., 2009], the authors consider a multi-armed bandit problem with inﬁnitely many arms, whose rewards are drawn i.i.d. from some known distribution. Each arm can only be pulled a ﬁnite number of times before it dies. Algorithms developed for this problem heavily rely on the knowledge of the distribution of the arms, and on the fact that an inﬁnite number of good arms is always available to the player, both assumptions that are violated in our setting.

Closer to our problem is [Féraud and Urvoy, 2012] : the authors study the problem of scratch game, where each arm can be pulled a limited number of time before dying. They bound the weak regret, deﬁned as the difference between T m(φ (T )) and the cumulative reward of the player, where m(φ (T )) is the expected reward of the T-th armed pulled by an oracle strategy. Since the reward of the arm pulled by this oracle strategy decreases at each step, its cumulative reward can be much larger than T m(φ (T )) (both can differ by a linear factor). Thus, the weak regret can be signiﬁcantly lower than the classical regret, which we control in this paper.

Another related problem is that of budgeted bandits with different budgets for each arm : the decision maker faces a multi-armed bandit problem with constraints on the number of pull of each arm. This problem is studied in [Agarwal et al., 2009]: the authors assume that the number of arms is ﬁxed, and that the budget of each arm increases proportionally to the number of steps T. They provide numerical simulations as well as asymptotic theoretical bounds on the regret of their algorithm. More precisely, they show that in the limit T ! 1, all optimal arms but one have died before time T : thus, when the budget of each arm and the total number of pulls T are sufﬁciently large, the problem reduces to a classical multi-armed bandit. By contrast, in the F-CAB setting we can pull each arm at most once and do not attain this regime. Our technics of proof require therefore more involved, non-asymptotic regret bounds.

1.3 Contribution and outline

In this paper, we present a new model for ﬁnite continuum-armed bandit motivated by real-world applications. In this resource allocation problem, each action is described by a continuous covariate, and can be taken at most once. After some preliminary discussions, we restrict our attention to one-dimensionnal covariates and introduce further assumptions on the distribution of the covariates ai and on the mean payoff function m in Section 2. In Section 3, we present an algorithm for this problem, and establish a non-asymptotic upper-bound on the regret of this algorithm. More precisely, we prove that when the budget T is a ﬁxed proportion of the number of arms, with high probability, RT = O(T 1/3 log(T)4/3). This rate is faster than all regret rates achievable in the classical continuum armed bandit under similar assumptions on the mean reward function. Indeed, the authors of [Auer et al., 2007] show that regret for the classical continuum-armed bandits problem is typically of order O(T 1/2 log(T)). On the other hand, we show that when the budget T becomes small compared to the number of arms N, the regret rate smoothly increases. In the limit where the ratio T/N decreases to N 1/3 log(N)2/3, the regret increases progressively up to the O(T 1/2 log(T)) rate encountered in classical continuum-armed bandit problems. Moreover, we derive matching lower bounds on the regret, showing that our rate is sharp up to a poly-logarithmic factor. Extensions of our methods to multi-dimensional covariates are discussed in Section 5 and detailed in the Appendix. We provide high level ideas behind those results throughout the paper but defer all proofs to the Appendix.

2 Problem set-up

2.1 Preliminary discussion

In the F-CAB problem, each arm can be pulled at most once, and exploration is made possible by the existence of covariates describing the arms. This framework is related to the classical Continuum Armed Bandit problem, which we recall here.

The Continuum-Armed Bandit (CAB) problem: At each step t, an agent selects any covariate at 2 X, pulls an arm indexed by this covariate and receives the corresponding reward yt 2 [0, 1]. Here again, the rewards for pulling an arm a 2 [0, 1] are drawn i.i.d. conditionally on a from some distribution with mean m(a). The agent aims at maximising her cumulative reward.

By contrast to the CAB setting, where the agent is free to choose any covariate in X, in the F-CAB setting she must restrict her choice to the ever diminishing set of available arms. The usual trade-off between exploration and exploitation breaks down, as the agent can pull but a ﬁnite number of arms in any region considered as optimal. Once those arms have been pulled, all effort spent on identifying this optimal region may become useless. On the contrary, in the CAB setting the agent may pull arms in a region identiﬁed as optimal indeﬁnitely. For this reason, strategies lead to lower cumulative reward in the F-CAB setting than that in the less constrained CAB setting.

Nonetheless, this does not imply that F-CAB problems are more difﬁcult than CAB ones in terms of regret. The difﬁculty of a problem is often deﬁned, in a minimax sense, as the performance of the best algorithm on a worst problem instance. In bandit problems, the performance of a strategy φ is often characterised as the difference between its expected cumulative reward, and that of an agent knowing in hindsight the expected rewards of the different arms. At each step t = 1, ..., T, this oracle agent pulls greedily the arm φ (t), where φ denote a permutation of {1, ..., N} such that m(aφ (1)) m(aφ (2)) ... m(aφ (N)). Note that this agent receives an expected cumulative reward of P

t T m(aφ (t)), which is lower than T maxa m(a). Thus the regret, deﬁned as the difference between the cumulative reward of φ and that of our strategy, is given by

The difﬁculty of the F-CAB problem is governed by the ratio p = T/N. In the limit p ! 1, the problem becomes trivial as any strategy must pull all arms, and all cumulative rewards are equal. Opposite to this case, in the limit p ! 0, choosing aφ(t) from the large set of remaining arms becomes less and less restrictive, and we expect the problem to become more and more similar to a CAB. To highlight this phenomenon, we derive upper and lower bounds on the regret that explicitly depend on p. We show that when p 2 (0, 1) is a ﬁxed constant, i.e. when the budget is proportional to the number of arms, lower regret rates can be achieved for the F-CAB problem than for the CAB problem. To the best of the author s knowledge, it is the ﬁrst time that this somewhat counter-intuitive phenomenon is observed; however it is consistent with previous observations on rotting bandits [Levine et al., 2017], in which the expected reward for pulling an arm decreases every time this arm is selected. Like in the F-CAB model, in rotting bandits the oracle agent receives ever decreasing rewards. The authors of [Seznec et al., 2019] show that this problem is no harder than the classical multi-armed bandit : although the cumulative rewards are lower than those in the classical multi-armed bandit setting, it does not imply that strategies should suffer greater regrets. This phenomenon is all the more striking in the F-CAB setting, as we show that strategies can in fact achieve lower regrets. Finally, we verify that when p ! 0, the regret rate increases. In the limit where p = N 1/3 log(N)2/3, the problem becomes similar to a CAB and the regret rate increases up to the rate encountered in this setting.

2.2 Assumptions on the covariates and the rewards

While in general the covariates ai could be multivariate, we restrict our attention to the onedimensional case, and assume that X = [0, 1]. The multivariate case is discussed and analysed in Section 5 and in the Appendix. Focusing on the one-dimensional case allows us to highlight the main novelties of this problem by avoiding cumbersome details. We make the following assumption on the distribution of the arms.

Assumption 1. For i = 1, ..., N, ai

i.i.d. U([0, 1]).

By contrast to the CAB setting, where one aims at ﬁnding and pulling arms with rewards close to the maxima of m, in a F-CAB setting the agent aims at ﬁnding and pulling the T best arms : the difﬁculty of the problem thus depends on the behaviour of m around the reward of the T-th best arm m(aφ (T )). Under Assumption 1, we note that E[m(aφ (T ))] = M, where M is deﬁned as

M = min {A : λ ({x : m(x) A}) < p}

and λ is the Lebesgue measure. In words, we aim at identifying and exploiting arms with expected rewards above the threshold M. We therefore say that an arm ai is optimal if m(ai) M, and that it is otherwise sub-optimal. Moreover, we say that an arm ai is sub-optimal (respectively optimal) by a gap if 0 M m(ai) (respectively 0 m(ai) M ).

We make the following assumptions on the mean reward function. First, note that if m varies sharply, the problem becomes much more difﬁcult as we cannot infer the value of m at a point based on rewards obtained from neighbouring arms. In fact, if m presents sharp peaks located at the T optimal arms, any reasonable strategy must suffer a linear regret. In order to control the ﬂuctuations of m, we assume that it is weakly Lipschitz continuous around the threshold M. Assumption 2 (Weak Lipschitz condition). There exists L > 0 such that, for all (x, y) 2 [0, 1]2,

|m(x) m(y)| max{|M m(x)|, L |x y|}. (1)

Assumption 2 is closely related to Assumption A2 in [Bubeck et al., 2011]. It requires that the mean reward function m is L-Lipschitz at any point x0 such that m(x0) = M : indeed, in this case the condition states that for any y, |m(x) m(y)| L |x y|. On the other hand, m may ﬂuctuate more strongly around any point x whose expected reward is far from the threshold M.

Bandit problems become more difﬁcult when many arms are slightly sub-optimal. Similarly, the F-CAB problem becomes more difﬁcult if there are many arms with rewards slightly above or under the threshold M, since it is hard to classify those arms respectively as optimal and sub-optimal. This difﬁculty is captured by the measure of points with expected rewards close to M. Assumption 3 (Margin condition). There exists Q > 0 such that for all 2 (0, 1),

λ ({x : |M m(x)| }) Q . (2)

In the classical CAB setting, lower bounds on the regret are of the order O(T 1/2) under similar margin assumptions, and they become O(T 2/3) when these margin assumptions are not satisﬁed. In the F-CAB, Assumption 3 allow us to improve regret bounds up to O(T 1/3p 1/3). It is altogether not too restrictive, as it is veriﬁed if m has ﬁnitely many points x such that m(x) = M, and has non vanishing ﬁrst derivatives at those points. Note that if the margin assumption and the weak Lipschitz assumption hold simultaneously for some L, Q > 0, we must have QL 1.

3 UCBF : Upper Conﬁdence Bound algorithm for Finite continuum-armed bandits

3.1 Algorithm

We now describe our strategy, the Upper Conﬁdence Bound for Finite continuum-armed bandits (UCBF). It is inspired from the algorithm UCBC introduced in [Auer et al., 2007] for CAB.

Algorithm 1 Upper Conﬁdence Bound for Finite continuum-armed bandits (UCBF)

Parameters: K, δ Initialisation: Divide [0, 1] into K intervals Ik with Ik = [ k 1

K )for k 2 {1, ..., K 1} and IK = [ K 1

K , 1]. Let Nk = P

1 i N 1{ai 2 Ik} be the number of arms in the interval Ik. Deﬁne the set of intervals alive as the set of intervals Ik such that Nk 2. Pull an arm uniformly at random in each interval alive. for t = K + 1, ..., T do

Select an interval Ik that maximizes bmk(nk(t 1)) +

log(T/δ) 2nk(t 1) among the set of alive intervals, where nk(t 1) is the number of arms pulled from Ik by the algorithm before time t, and bmk(nk(t 1)) is the average reward obtained from those nk(t 1) samples. Pull an arm selected uniformly at random among the arms in Ik. Remove this arm from Ik. If Ik is empty, remove Ik from the set of alive intervals. end for

In order to bound the regret of UCBF, we show that it can be decomposed into the sum of a discretization term and of the cost of learning on a ﬁnite multi-armed bandit. First, we discuss the optimal number of intervals K. In a second time, we present new arguments for bounding more tightly the discretization error. Then, we show that by contrast to the classical CAB, the contribution of slightly sub-optimal arms to the regret is much more limited in F-CAB problems, before obtaining a high-probability bound on the regret of our algorithm.

By dividing the continuous space of covariates into intervals, we approximate the X-armed setting with a ﬁnite multi-armed bandit problem, which we deﬁne bellow.

The Finite Multi-armed Bandit (F-MAB) : An agent is given a budget T and a set of K arms. At each step, the agent pulls an arm kt and receives a reward yt sampled independently with mean mkt. Each arm k 2 {1, ..., K} can only be pulled a ﬁnite number of time, denoted Nk, before it dies. The agent aims at maximising the sum of her rewards.

The approximation of the Nk arms in an interval Ik as a single arm that can be pulled Nk times is done at the price of a discretization error, as we are now forced to treat all arms in the same interval

equally, regardless of possible differences of rewards within an interval. The choice of the number of intervals K determines both the cost of this approximation, and the difﬁculty of the F-MAB problem. To analyse the dependence of those quantities on K, we introduce the strategy of an oracle agent facing the corresponding F-MAB problem (i.e., of an agent knowing in hindsight the expected mean rewards mk =

Ik m(a)da for pulling an arm in any interval Ik, and treating all arms in the same interval equally). We denote this strategy by φd. Assume, for the sake of simplicity, that the intervals I1, ..., IK have been reordered by decreasing mean reward, and that there exists f 2 {1, ..., K} such that T = N1 + ... + Nf. Then, φd pulls all arms in the intervals I1 up to If.

We can conveniently rewrite the regret RT as the sum of the regret of φd, and of the difference between the cumulative rewards of φd and that of the strategy φ :

The regret R(d)

T is the regret suffered by an agent with hindsight knowledge of the expected mean rewards for the different intervals. It can be viewed as the discretization error. The additional regret R(F MAB)

T corresponds to the cost of learning in a F-MAB setting. All arms in an interval Ik have a reward close to mk, so by deﬁnition of φd

(Nk nk(T))mk

nk(T)mk. (4)

where we recall that Nk denotes the number of arms belonging to interval Ik, and nk(T) denotes the number of arms pulled in this interval by UCBF at time T.

Choosing the number of intervals thus yields the following tradeoff : a low value of K implies an easier F-MAB problem and a low value of R(F MAB)

T , while a high value of K allows for reduction of the discretization error. In ﬁnite bandits, exploration is limited : indeed, when increasing the number of intervals in a F-CAB setting, we simultaneously reduce the number of arms in each interval, and we may become unable to differentiate the mean rewards of two intervals close to the threshold M. Under the weak Lipschitz assumption, gaps between the rewards of two adjacent intervals are of the order 1/K. Classical results indicate that K2 pulls are needed to differentiate the mean rewards of those intervals. On the other hand, under Assumption 1, the number of arms in each interval is of the order N/K. Thus, choosing K larger than N 1/3 will only increase the difﬁculty of the multi-armed problem, without reducing the discretization error (since K2 N/K when K N 1/3).

3.2 Bounding the discretization error

Equation (3) indicates that the regret can be decomposed as the sum of a discretization error and of the regret on the corresponding multi-armed bandit. In order to bound this discretization error, usual methods from continuum-armed bandits rely on bounding the difference between the expected reward of an arm and that of its interval by L/K. Thus, at each step, an algorithm knowing only the best interval may suffer a regret of the order O(1/K), and the difference between the cumulative rewards of φd and φ is of the order O(T/K). This argument yields sub-optimal bounds in F-CAB problems: indeed, the majority of the terms appearing in R(d)

T are zero, as φ and φ(d) mostly select the same arms.

To obtain a sharper bound on the discretization error R(d)

T , we analyse more carefully the difference between those strategies. More precisely, we use concentrations arguments to show that under Assumption 1, m(aφ (T )) and mf are close to M. This result implies that under the weak Lipschitz assumption, for any pair of arms (ai, aj) respectively selected by φ but not by φd and vice versa, m(ai) m(aj) = O(L/K). Finally, the margin assumption allows us to bound the number of those pairs, thus proving the following Lemma.

Lemma 1. Assume that K N 2/3 and K > p 1 _ (1 p) 1. Under Assumptions 1, 2 and 3, there exists a constant CL,Q depending on L and Q such that with probability larger than 1 6e 2N/K2 2e N 1/3/3,

We underline that this discretization error is lower than the unavoidable error of order O(T/K), encountered in classical CAB settings.

3.3 Upper bound on the regret of UCBF

Before stating our result, we bound the regret due to slightly sub-optimal arms. It is known that in the classical CAB model, slightly sub-optimal arms contribute strongly to the regret, as any agent needs at least O( 2) pulls to detect an interval sub-optimal by a gap . When is smaller than

1/T, the agent spends a budget proportional to T to test whether this interval is optimal or not, which leads to regret of the order O( T). By contrast, in a F-CAB setting, pulling arms from an interval sub-optimal by a gap until it dies, contributes to the regret by a factor at most N/K. Under Assumptions 1, 2 and 3, the number of intervals with mean rewards sub-optimal by a gap smaller than is O(K ). Thus, we are prevented from mistakenly selecting those slightly sub-optimal intervals too many times. This is summarised in the following remark. Remark 1. Under hypothesis 1, 2 and 3, intervals sub-optimal by a gap contribute to the regret by a factor at most O( 2T/p).

Remark 1 along with Lemma 1 help us to bound with high probability the regret of Algorithm UCBF for any mean payoff function m satisfying Assumptions 2 and 3, for the choice K = b N 1/3 log(N) 2/3c and δ = N 4/3. The proof of Theorem 1 is deferred to the Appendix.

Theorem 1. Assume that b N 1/3 log(N) 2/3c > p 1_(1 p) 1. Under Assumption 1, 2 and 3, there exists a constant CL,Q depending only on L and Q such that for the choice K = b N 1/3 log(N) 2/3c and δ = N 4/3,

RT CL,Q (T/p)1/3 log(T/p)4/3

with probability at least 1 12(N 1 _ e N 1/3/3).

Sketch of Proof. We use Lemma 1 to bound the discretization error R(d)

T . The decomposition in Equation (3) shows that it is enough to bound R(F MAB)

T . Recall that φd pulls all arms in the intervals I1, I2, up to If, while UCBF pulls nk(T) arms in all intervals Ik. Using Equation (4), we ﬁnd that

(Nk nk(T))(mk M) +

nk(T)(M mk)

where we have used that P

k f Nk = T = P

k K nk(T), which in turns implies P

k f Nk nk(T) = P

On the one hand, Rsubopt = P

k>f nk(T)(M mk) corresponds to the regret of pulling arms in sub-optimal intervals. We use Remark 1 to bound the contribution of intervals sub-optimal by a gap O(1/K) by a factor of the order O(T/(p K2)). Classical bandit technics allow to bound the contribution of the remaining sub-optimal intervals : under Assumptions 1-3, they contribute to the regret by a term O(K log(T) log(K)). Thus, for the choice K = N 1/3 log(N) 2/3, we can show that Rsubopt = O((T/p)1/3 log(T/p)4/3).

On the other hand, the term Ropt = P

k f(Nk nk(T))(mk M) is speciﬁc to ﬁnite bandit problems. The following argument shows that UCBF kills the majority of optimal intervals, and that optimal intervals Ik alive at time T are such that f k is bounded by a constant.

Let Ik be an interval still alive at time T such that mk > M. Then the interval Ik was alive at every round, and any interval selected by φ must have appeared as a better candidate than Ik. Using the deﬁnition of UCBF and Assumptions 3, we can show that the number of arms pulled from intervals with mean reward lower than mk is bounded by a term O(N/K + K2 log(T)).

Since T = N1 +...+Nf arms are pulled in total, the number of arms pulled from intervals with mean reward lower than mk is at least T (N1+...+Nk) = Nk+1+...+Nf (f k)N/K. Therefore, no interval Ik such that (f k)N/K O(N/K +K2 log(T)) can be alive at time T. For the choice of K described above, (1+K3 log(T)/N) is upper bounded by a constant. Thus, there exists a constant C > 0 such that for all k f C, all intervals Ik have died before time T. We note that the number of arms in any interval is of the order N/K, so Ropt = P

f C k f(Nk nk(T))(mk M) C(m(f C) M)N/K. To conclude, we use Assumption 2 to show that m(f C) M = O(CL/K), and ﬁnd that Ropt = O(N/K2) = O(T/(p K2)).

Under Assumptions similar to 2 and 3, [Auer et al., 2007] show that the regret of UCBC in CAB problems is O(

T log(T)) for the optimal choice K =

T/ log(T). By contrast, in the F-CAB problem, Theorem 1 indicates that when p is a ﬁxed constant, i.e. when the number of arms is proportional to the budget, the optimal choice for K is of the order T 1/3 log(T) 2/3 and the regret scales as O(T 1/3 log(T)4/3). In this regime, regrets lower than that in CAB settings are thus achievable. As N ! 1 and p ! 0, both the regret and the optimal number of intervals increase. To highlight this phenomenon, we consider regimes where T = 0.5N for some 2 [0, 1] (the choice T 0.5N reﬂects the fact that we are interested in settings where T may be small compared to N, and is arbitrary). Theorem 1 directly implies the following Corollary.

Corollary 1. Assume that T = 0.5N for some 2 (2/3 + N, 1], where we deﬁne N =

3 log log(N) + log(2)

/ log(N). Then, for the choice δ = N 4/3 and K = b 2/3(2T)1/(3 ) log(2T) 2/3c, with probability at least 1 12(N 1 _ e N 1/3/3) ,

RT CQ,LT 1/(3 ) log(T)4/3

for some constant CQ,L depending on Q and L.

Corollary 1 indicates that as decreases, the regret increases progressively from a F-CAB regime to a CAB regime. When the budget is a ﬁxed proportion of the number of arms, the regret scales as O(T 1/3 log(T)4/3) for the optimal number of intervals K of the order T 1/3 log(T)1/2. As p decreases and 2 (2/3 + N, 1], the regret increases as O(T 1/(3 ) log(T)4/3) for K of the order T 1/(3 ) log(T) 2/3. In the limit ! 2/3 + n, the regret rate becomes RT = O(

T log(T)) for the optimal number of intervals of the order

T/ log(T), which corresponds to their respective values in the CAB setting.

To understand why = 2/3 + N corresponds to a transition from a F-CAB to a CAB setting, note that = 2/3 + N implies T = N/K : in other words, the budget becomes of the order of the number of arms per interval. Thus, when > 2/3 + N, the oracle strategy exhausts all arms in the best interval, and it must select arms in intervals with lower mean rewards. In this regime, we see that the ﬁniteness of the arms is indeed a constraining issue. On the contrary, if 2/3 + N, no interval is ever exhausted. The oracle strategy only selects arms from the interval with highest mean reward, and our problem becomes similar to a CAB problem. Finally, we underline that when 2/3 + N the analysis becomes much simpler. Indeed, results can be directly inferred from [Auer et al., 2007] by noticing that no interval is ever exhausted, and that Algorithm UCBF is therefore a variant of Algorithm UCBC. In this case, the optimal choice for the number of intervals remains K =

T/ log(T), and yields a regret bound RT = O(

4 A lower bound

A careful analysis of the proof of Theorem 1 reveals that all intervals with mean reward larger than M plus a gap O(L/K) have died before time T. On the other hand, all intervals with mean rewards lower than M minus a gap O(L/K) have been selected but a logarithmic number of times. In other words, the algorithm UCBF is able to identify the set corresponding to the best p fraction of the rewards, and it is only mistaken on a subset of measure O(1/K) corresponding to arms ai such that |m(ai) M| = O(1/K). We use this remark to derive a lower bound on the regret of any strategy for mean payoff function m in the set Fp,L,Q deﬁned bellow.

Deﬁnition 1. For p 2 (0, 1), L > 0 and Q > 0, we denote by Fp,L,Q the set of functions m : [0, 1] ! [0, 1] that satisfy Equations (1) and (2).

To obtain our lower bound, we construct two functions m1 and m2 that are identical but on two intervals, each one of length N 1/3. On those intervals, m1 and m2 are close to the threshold M separating rewards of the fraction p of the best arms from the rewards of the remaining arms. One of these intervals corresponds to arms with reward above M under the payoff function m1 : more precisely, on this interval m1 increases linearly from M to M + 0.5LN 1/3, and decreases back to M. On this interval, m2 decreases linearly from M to M 0.5LN 1/3, and increases back to M. We deﬁne similarly m1 and m2 on the second interval by exchanging their roles, and choose the value of m1 and m2 outside of those intervals so as to ensure that both functions belong to the set FL,Q for some Q large enough.

Now, any reasonable strategy pulls arms in both intervals until it is able to differentiate the two mean reward functions, or equivalently until it is able to determine which interval contains optimal arms. As the average payments of those two intervals differ by (N 1/3), this strategy must pull (N 2/3) arms in both intervals. This is possible since there are N 2/3 arms in each interval. Since arms in one of those intervals are sub-optimal by a gap of the order N 1/3, this strategy suffers a regret (N 1/3).

In order to formalise this result, we stress the dependence of the regret on the strategy φ and the mean reward function m by denoting it Rφ

T (m). Our results are proved for reward y that are Bernoulli random variables (note that this is a special case of the F-CAB problem). Assumption 4. For i 2 {1, ..., N}, yi Bernoulli(m(ai)).

In order to simplify the exposition of our results, we assume that the arms ai are deterministic. Assumption 5. For i 2 {1, ..., N}, ai = i N . Theorem 2. For all p 2 (0, 1), all L > 0, all Q > (6/L _ 12), there exists a constant CL depending on L such that under Assumptions 5 and 4, for all N CL(p 3 _ (1 p) 3),

φ sup m2Fp,Q,L

T (m) 0.01T 1/3p 1/3

Theorem 2 shows that the bound on the regret of UCBF obtained in Theorem 1 is minimax optimal up to a polylogarithmic factor. The proof of Theorem 1 is deferred to the Appendix. Again, we stress the dependence of this regret bound on T by considering regimes where T = 0.5N . The following Corollary follows directly from Theorem 2. Corollary 2. For all L > 0, Q > (6/L _ 12), there exists a constant CL depending on L such that such that for all N > exp(3CL) and all T such that T = 0.5N for some 2 (2/3+CL/ log(N), 1], under Assumptions 5 and 4,

φ sup m2F0.5N 1,Q,L

T (m) 0.01T 1/(3 )

5 Discussion

We have introduced a new model for budget allocation with short supply when each action can be taken at most once, and side information is available on those actions. We have shown that, when covariates describing those actions are uniformly distributed in [0, 1], the expected reward function m satisﬁes Assumption 2 and 3, and the budget is proportional to the number of arms, then the optimal choice of number of intervals K is of the order T 1/3 log(T) 2/3, and the regret is O(T 1/3 log(T)4/3). Our lower bound shows that this rate is sharp up to poly-logarithmic factors.

Those results can readily be generalized to d-dimensionnal covariates. Assume that m : [0, 1]d ! [0, 1] is such that the weak Lipschitz assumption 2 holds for the euclidean distance, and that the margin assumption 3 is veriﬁed. Then, if ai

, we can adapt UCBF by dividing the space [0, 1]d into Kd boxes of equal size. Looking more closely at our methods of proof, we note that the discretization error R(d)

T remains of order O(T/K2), while the cost of learning R(F MAB)

T is now bounded by Kd log(T) log(K). Thus, the optimal number of intervals K is of the order T 1/(d+2) log(T) 2/(d+2), and the regret is of the order O(T d/(d+2) log(T)4/(d+2)). We refer the interested reader to the Appendix, where precise statements of our hypotheses and results are provided, along with a description of the extension of Algorithm UCBF to multi-dimensional covariates.

Broader impact

We present an algorithm for the problem of allocating a limited budget among competing candidates. This algorithm is easy to implement, and enjoys strong theoretical guarantees on its performance making it attractive and reliable in relevant applications. Nevertheless, we emphasise that the considered framework is based on the premise that the decision-maker is purely utility-driven. We leave it to the decision-maker to take additional domain-speciﬁc considerations into account.

Acknowledgement

We would like to thank Christophe Giraud, Vianney Perchet and Gilles Stoltz for their stimulating suggestions and discussions.

[Agarwal et al., 2009] Agarwal, D., Chen, B.-C., and Elango, P. (2009). Spatio-temporal models for

estimating click-through rate. In Proceedings of the 18th International Conference on World Wide Web, WWW 09, page 21 30, New York, NY, USA. Association for Computing Machinery.

[Auer et al., 2007] Auer, P., Ortner, R., and Szepesvári, C. (2007). Improved rates for the stochastic

continuum-armed bandit problem. In Bshouty, N. H. and Gentile, C., editors, Learning Theory, pages 454 468, Berlin, Heidelberg. Springer Berlin Heidelberg.

[Bubeck et al., 2011] Bubeck, S., Munos, R., Stoltz, G., and Szepesvari, C. (2011). X-Armed

Bandits. Journal of Machine Learning Research, 12:1655 1695.

[Chakrabarti et al., 2009] Chakrabarti, D., Kumar, R., Radlinski, F., and Upfal, E. (2009). Mortal

multi-armed bandits. In Koller, D., Schuurmans, D., Bengio, Y., and Bottou, L., editors, Advances in Neural Information Processing Systems 21, pages 273 280. Curran Associates, Inc.

[Combes et al., 2015] Combes, R., Jiang, C., and Srikant, R. (2015). Bandits with budgets: Regret

lower bounds and optimal algorithms. SIGMETRICS Perform. Eval. Rev., 43(1):245 257.

[Féraud and Urvoy, 2012] Féraud, R. and Urvoy, T. (2012). A stochastic bandit algorithm for scratch

games. In Hoi, S. C. H. and Buntine, W., editors, Proceedings of the Asian Conference on Machine Learning, volume 25 of Proceedings of Machine Learning Research, pages 129 143, Singapore Management University, Singapore. PMLR.

[Giraud et al., 2019] Giraud, C., Issartel, Y., Lehéricy, L., and Lerasle, M. (2019). Pair matching:

When bandits meet stochastic block model.

[Kleinberg et al., 2015] Kleinberg, J., Ludwig, J., Mullainathan, S., and Obermeyer, Z. (2015).

Prediction policy problems. American Economic Review, 105(5):491 95.

[Kleinberg et al., 2018] Kleinberg, J., Ludwig, J., Mullainathan, S., and Rambachan, A. (2018).

Algorithmic fairness. AEA Papers and Proceedings, 108:22 27.

[Kleinberg, 2004] Kleinberg, R. (2004). Nearly tight bounds for the continuum-armed bandit problem. In Proceedings of the 17th International Conference on Neural Information Processing Systems, NIPS 04, page 697 704, Cambridge, MA, USA. MIT Press.

[Lattimore and Szepesvári, 2020] Lattimore, T. and Szepesvári, C. (2020). Bandit Algorithms. Cam-

bridge University Press.

[Levine et al., 2017] Levine, N., Crammer, K., and Mannor, S. (2017). Rotting bandits. In Guyon,

I., Luxburg, U. V., Bengio, S., Wallach, H., Fergus, R., Vishwanathan, S., and Garnett, R., editors, Advances in Neural Information Processing Systems 30, pages 3074 3083. Curran Associates, Inc.

[Seznec et al., 2019] Seznec, J., Locatelli, A., Carpentier, A., Lazaric, A., and Valko, M. (2019).

Rotting bandits are no harder than stochastic ones. In Chaudhuri, K. and Sugiyama, M., editors, Proceedings of Machine Learning Research, volume 89 of Proceedings of Machine Learning Research, pages 2564 2572. PMLR.

[Szilagyi et al., 2005] Szilagyi, A., Grimm, V., Arakaki, A., and Skolnick, J. (2005). Prediction of

physical protein-protein interactions. Physical biology, 2:S1 16.

[Vershynin, 2018] Vershynin, R. (2018). High-Dimensional Probability: An Introduction with

Applications in Data Science. Cambridge Series in Statistical and Probabilistic Mathematics.

Cambridge University Press.