# linear_contextual_bandits_with_knapsacks__2e584b43.pdf

Linear Contextual Bandits with Knapsacks

Shipra Agrawal Nikhil R. Devanur

We consider the linear contextual bandit problem with resource consumption, in addition to reward generation. In each round, the outcome of pulling an arm is a reward as well as a vector of resource consumptions. The expected values of these outcomes depend linearly on the context of that arm. The budget/capacity constraints require that the total consumption doesn t exceed the budget for each resource. The objective is once again to maximize the total reward. This problem turns out to be a common generalization of classic linear contextual bandits (lin Contextual) [8, 11, 1], bandits with knapsacks (Bw K) [3, 9], and the online stochastic packing problem (OSPP) [4, 14]. We present algorithms with near-optimal regret bounds for this problem. Our bounds compare favorably to results on the unstructured version of the problem [5, 10] where the relation between the contexts and the outcomes could be arbitrary, but the algorithm only competes against a ﬁxed set of policies accessible through an optimization oracle. We combine techniques from the work on lin Contextual, Bw K and OSPP in a nontrivial manner while also tackling new difﬁculties that are not present in any of these special cases.

1 Introduction

In the contextual bandit problem [8, 2], the decision maker observes a sequence of contexts (or features). In every round she needs to pull one out of K arms, after observing the context for that round. The outcome of pulling an arm may be used along with the contexts to decide future arms. Contextual bandit problems have found many useful applications such as online recommendation systems, online advertising, and clinical trials, where the decision in every round needs to be customized to the features of the user being served. The linear contextual bandit problem [1, 8, 11] is a special case of the contextual bandit problem, where the outcome is linear in the feature vector encoding the context. As pointed by [2], contextual bandit problems represent a natural half-way point between supervised learning and reinforcement learning: the use of features to encode contexts and the models for the relation between these feature vectors and the outcome are often inherited from supervised learning, while managing the exploration-exploitation tradeoff is necessary to ensure good performance in reinforcement learning. The linear contextual bandit problem can thus be thought of as a midway between the linear regression model of supervised learning, and reinforcement learning.

Recently, there has been a signiﬁcant interest in introducing multiple global constraints in the standard bandit setting [9, 3, 10, 5]. Such constraints are crucial for many important real-world applications. For example, in clinical trials, the treatment plans may be constrained by the total availability of medical facilities, drugs and other resources. In online advertising, there are budget constraints that restrict the number of times an ad is shown. Other applications include dynamic pricing, dynamic procurement, crowdsourcing, etc.; see [9, 3] for many such examples.

In this paper, we consider the linear contextual bandit with knapsacks (henceforth, lin CBw K) problem. In this problem, the context vectors are generated i.i.d. in every round from some unknown distribution, and on picking an arm, a reward and a consumption vector is observed, which depend

Columbia University. sa3305@columbia.edu. Microsoft Research. nikdev@microsoft.com.

30th Conference on Neural Information Processing Systems (NIPS 2016), Barcelona, Spain.

linearly on the context vector. The aim of the decision maker is to maximize the total reward while ensuring that the total consumption of every resource remains within a given budget. Below, we give a more precise deﬁnition of this problem. We use the following notational convention throughout: vectors are denoted by bold face lower case letters, while matrices are denoted by regular face upper case letters. Other quantities such as sets, scalars, etc. may be of either case, but never bold faced. All vectors are column vectors, i.e., a vector in n dimensions is treated as an n 1 matrix. The transpose of matrix A is A .

Deﬁnition 1 (lin CBw K). There are K arms , which we identify with the set [K]. The algorithm is initially given as input a budget B R+. In every round t, the algorithm ﬁrst observes context xt(a) [0, 1]m for every arm a, and then chooses an arm at [K], and ﬁnally observes a reward rt(at) [0, 1] and a d-dimensional consumption vector vt(at) [0, 1]d. The algorithm has a no-op option, which is to pick none of the arms and get 0 reward and 0 consumption. The goal of the algorithm is to pick arms such that the total reward PT t=1 rt(at) is maximized, while ensuring that the total consumption does not exceed the budget, i.e., P t vt(at) B1.

We make the following stochastic assumption for context, reward, and consumption vectors. In every round t, the tuple {xt(a), rt(a), vt(a)}K a=1 is generated from an unknown distribution D, independent of everything in previous rounds. Also, there exists an unknown vector µ [0, 1]m and a matrix W [0, 1]m d such that for every arm a, given contexts xt(a), and history Ht 1 before time t, E[rt(a)|xt(a), Ht 1] = µ xt(a), E[vt(a)|xt(a), Ht 1] = W xt(a). (1)

For succinctness, we will denote the tuple of contexts for K arms at time t as matrix Xt [0, 1]m K, with xt(a) being the ath column of this matrix. Similarly, rewards and consumption vectors at time t are represented as the vector rt [0, 1]K and the matrix Vt [0, 1]d K respectively.

As we discuss later in the text, the assumption in equation (1) forms the primary distinction between our linear contextual bandit setting and the general contextual bandit setting considered in [5]. Exploiting this linearity assumption will allow us to generate regret bounds which do not depend on the number of arms K, rendering it to be especially useful when the number of arms is large. Some examples of this include recommendation systems with large number of products (e.g., retail products, travel packages, ad creatives, sponsored facebook posts). Another advantage over using the general contextual bandit setting of [5] is that we don t need an oracle access to a certain optimization problem, which in this case is required to solve an NP-Hard problem. (See Section 1.1 for a more detailed discussion.)

We compare the performance of an algorithm to that of an optimal adaptive policy that knows the distribution D and the parameters (µ , W ), and can take into account the history up to that point, as well as the current context, to decide (possibly with randomization) which arm to pull at time t. However, it is easier to work with an upper bound on this, which is the optimal expected reward of a static policy that is required to satisfy the constraints only in expectation. This technique has been used in several related problems and is standard by now [14, 9].

Deﬁnition 2 (Optimal Static Policy). A context-dependent non-adaptive policy π is a mapping from context space [0, 1]m K to Ω= {p [0, 1]K : p 1 1}, where π(X)i denotes the probability of playing arm i when the context is X, and 1 PK i=1 π(X)i is the probability of no-op. Deﬁne r(π) and v(π) to be the expected reward and consumption vector of policy π, respectively, i.e.

r(π) := E(X,r,V ) D[rπ(X)] = EX D[µ Xπ(X)]. (2)

v(π) := E(X,r,V ) D[V π(X)] = EX D[W Xπ(X)]. (3)

Let π := arg maxπ T r(π) such that T v(π) B1 (4)

be the optimal static policy. Note that since no-op is allowed, a feasible policy always exists. We denote the value of this optimal static policy by OPT := T r(π ).

The following lemma proves that OPT upper bounds the value of an optimal adaptive policy. Proof is in Appendix B in the supplement.

Lemma 1. Let OPT denote the value of an optimal adaptive policy that knows the distribution D and parameters µ , W . Then OPT OPT.

Deﬁnition 3 (Regret). Let at be the arm played at time t by the algorithm. Then, regret is deﬁned as

regret(T) := OPT

t=1 rt(at).

1.1 Main results

Our main result is an algorithm with near-optimal regret bound for lin CBw K.

Theorem 1. There is an algorithm for lin CBw K such that if B > m1/2T 3/4, then with probability at least 1 δ,

regret(T) = O ( OPT

Tlog(d T/δ) log(T) .

Relation to general contextual bandits. There have been recent papers [5, 10] that solve problems similar to lin CBw K but for general contextual bandits. In these papers the relation between contexts and outcome vectors is arbitrary and the algorithms compete with an arbitrary ﬁxed set of context dependent policies Π accessible via an optimization oracle, with regret bounds being O ( OPT

KT log(d T|Π|/δ) . These approaches could potentially be applied to the linear setting using a set Π of linear context dependent policies. Comparing their bounds with ours, in our results, essentially a p

K log(|Π|) factor is replaced by a factor of m. Most importantly, we have no dependence on K,3 which enables us to consider problems with large action spaces.

Further, suppose that we want to use their result with the set of linear policies, i.e., policies of the form, for some ﬁxed θ ℜm, arg max a [K]{xt(a) θ}.

Then, their algorithms would require access to an Arg-Max Oracle that can ﬁnd the best such policy (maximizing total reward) for a given set of contexts and rewards (no resource consumption). In fact, by a reduction from the problem of learning halfspaces with noise [16], we can show that the optimization problem underlying such an Arg-Max Oracle problem is NP-Hard, making such an approach computationally expensive. The proof of this is in Appendix C in the supplement.

The only downside to our results is that we need the budget B to be Ω(m1/2T 3/4). Getting similar bounds for budgets as small as B = Θ(m

T) is an interesting open problem. (This also indicates that this is indeed a harder problem than all the special cases.)

Near-optimality of regret bounds. In [12], it was shown that for the linear contextual bandits problem, no online algorithm can achieve a regret bound better than Ω(m

T). In fact, they prove this lower bound for linear contextual bandits with static contexts. Since that problem is a special case of the lin CBw K problem with d = 1, this shows that the dependence on m and T in the above regret bound is optimal upto log factors. For general contextual bandits with resource constraints, the bounds of [5, 10] are near optimal.

Relation to Bw K [3] and OSPP [4]. It is easy to see that the lin CBw K problem is a generalization of the linear contextual bandits problem [1, 8, 11]. There, the outcome is scalar and the goal is to simply maximize the sum of these. Remarkably, the lin CBw K problem also turns out to be a common generalization of the bandits with knapsacks (Bw K) problem considered in [9, 3], and the online stochastic packing problem (OSPP) studied by [13, 6, 15, 14, 4]. In both Bw K and OSPP, the outcome of every round t is a reward rt and a vector vt and the goal of the algorithm is to maximize PT t=1 rt while ensuring that PT t=1 vt B1. The problems differ in how these rewards and vectors are picked. In the OSPP problem, in every round t, the algorithm may pick any reward,vector pair from a given set At of d + 1-dimensional vectors. The set At is drawn i.i.d. from an unknown distribution over sets of vectors. This corresponds to the special case of lin CBw K, where m = d + 1 and the context xt(a) itself is equal to (rt(a), vt(a)). In the Bw K problem, there is a ﬁxed set of arms, and for each arm there is an unknown distribution over reward,vector pairs. The algorithm picks an arm and a reward,vector pair is drawn from the corresponding distribution for that arm. This

3Similar to the regret bounds for linear contextual bandits [8, 1, 11].

corresponds to the special case of lin CBw K, where m = K and the context Xt = I, the identity matrix, for all t.

We use techniques from all three special cases: our algorithms follow the primal-dual paradigm and use an online learning algorithm to search the dual space, as was done in [3]. In order to deal with linear contexts, we use techniques from [1, 8, 11] to estimate the weight matrix W , and deﬁne optimistic estimates of W . We also use the technique of combining the objective and the constraints using a certain tradeoff parameter and that was introduced in [4]. Further new difﬁculties arise, such as in estimating the optimum value from the ﬁrst few rounds, a task that follows from standard techniques in each of the special cases but is very challenging here. We develop a new way of exploration that uses the linear structure, so that one can evaluate all possible choices that could have led to an optimum solution on the historic sample. This technique might be of independent interest in estimating optimum values. One can see that the problem is indeed more than the sum of its parts, from the fact that we get the optimal bound for lin CBw K only when B Ω(m1/2T 3/4), unlike either special case for which the optimal bound holds for all B (but is meaningful only for B = Ω(m

The approach in [3] (for Bw K) extends to the case of static contexts,4 where each arm has a context that doesn t change over time. The OSPP of [4] is not a special case of lin CBw K with static contexts; this is one indication of the additional difﬁculty of dynamic over static contexts.

Other related work. Recently, [17] showed an O(

T) regret in the linear contextual setting with a single budget constraint, when costs depend only on contexts and not arms.

Due to space constraints, we have moved many proofs from the main part of the paper to the supplement.

2 Preliminaries

2.1 Conﬁdence Ellipsoid

Consider a stochastic process which in each round t, generates a pair of observations (rt, yt), such that rt is an unknown linear function of yt plus some 0-mean bounded noise, i.e., rt = µ yt + ηt, where yt, µ Rm, |ηt| 2R, and E[ηt|y1, r1, . . . , yt 1, rt 1, yt] = 0.

At any time t, a high conﬁdence estimate of the unknown vector µ can be obtained by building a conﬁdence ellipsoid around the ℓ2-regularized least-squares estimate ˆµt constructed from the observations made so far. This technique is common in prior work on linear contextual bandits (e.g., in [8, 11, 1]). For any regularization parameter λ > 0, let

Mt := λI + Pt 1 i=1 yiy i , and ˆµt := M 1 t Pt 1 i=1 yiri.

The following result from [1] shows that µ lies with high probability in an ellipsoid with center ˆµt. For any positive semi-deﬁnite (PSD) matrix M, deﬁne the M-norm as µ M := p

µ Mµ. The conﬁdence ellipsoid at time t is deﬁned as

Ct := n µ Rm : µ ˆµt Mt R p

m log ((1+tm/λ)/δ) +

Lemma 2 (Theorem 2 of [1]). If t, µ 2 m and yt 2 m, then with prob. 1 δ, µ Ct.

Another useful observation about this construction is stated below. It ﬁrst appeared as Lemma 11 of [8], and was also proved as Lemma 3 in [11].

Lemma 3 (Lemma 11 of [8]). PT t=1 yt M 1 t p

m T log(T).

As a corollary of the above two lemmas, we obtain a bound on the total error in the estimate provided by any point from the conﬁdence ellipsoid. (Proof is in Appendix D in the supplement.)

4It was incorrectly claimed in [3] that the approach can be extended to dynamic contexts without much modiﬁcations.

Corollary 1. For t = 1, . . . , T, let µt Ct be a point in the conﬁdence ellipsoid, with λ = 1 and 2R = 1. Then, with probability 1 δ, PT t=1 | µ t yt µ yt| 2m p

T log ((1+T m)/δ) log(T).

2.2 Online Learning

Consider a T round game played between an online learner and an adversary, where in round t, the learner chooses a θt Ω:= {θ : θ 1 1, θ 0}, and then observes a linear function gt : Ω [ 1, 1] picked by the adversary. The learner s choice θt may only depend on learner s and adversary s choices in previous rounds. The goal of the learner is to minimize regret deﬁned as the difference between the learner s objective value and the value of the best single choice in hindsight:

R(T) := maxθ Ω PT t=1 gt(θ) PT t=1 gt(θt).

The multiplicative weight update (MWU) algorithm (generalization by [7]) is a fast and efﬁcient online learning algorithm for this problem. Let gt,j := gt(1j). Then, given a parameter ǫ > 0, in round t + 1, the choice of this algorithm takes the following form,

θt+1,j = wt,j 1 + P

j wt,j , where wt,j = wt 1,j(1 + ǫ)gt,j if gt,j > 0, wt 1,j(1 ǫ) gt,j if gt,j 0. (5)

with initialization w0,j = 1, for all j = 1, . . . , K.

Lemma 4. [7] For any 0 < ǫ 1

2, the MWU algorithm provides the following regret bound for the online learning problem described above:

R(T) ǫT + log(d+1)

In particular, for ǫ = q

T , we have R(T) p

log(d + 1)T

For the rest of the paper, we refer to the MWU algorithm with ǫ = q

T as the online learning (OL) algorithm, and the update in (5) as the OL update at time t + 1.

3 Algorithm

3.1 Optimistic estimates of unknown parameters

Let at denote the arm played by the algorithm at time t. In the beginning of every round, we use the outcomes and contexts from previous rounds to construct a conﬁdence ellipsoid for µ and every column of W . The construction of conﬁdence ellipsoid for µ follows directly from the techniques in Section 2.1 with yt = xt(at) and rt being reward at time t. To construct a conﬁdence ellipsoid for a column j of W , we use the techniques in Section 2.1 while substituting yt = xt(at) and rt = vt(at)j for every j.

As in Section 2.1, let Mt := I + Pt 1 i=1 xi(ai)xi(ai) , and construct the regularized least squares estimate for µ , W , respectively, as

ˆµt := M 1 t Pt 1 i=1 xi(ai)ri(ai) (6) ˆWt := M 1 t Pt 1 i=1 xi(ai)vi(ai) . (7)

Deﬁne conﬁdence ellipsoid for parameter µ as

Ct,0 := n µ Rm : µ ˆµ Mt p

m log ((d+tmd)/δ) + m o ,

and for every arm a, the optimistic estimate of µ as:

µt(a) := arg maxµ Ct,0 xt(a) µ. (8)

Let wj denote the jth column of a matrix W. We deﬁne a conﬁdence ellipsoid for each column j, as

Ct,j := n w Rm : w ˆwtj Mt p

m log ((d+tmd)/δ) + m o ,

and denote by Gt, the Cartesian product of all these ellipsoids: Gt := {W Rm d : wj Ct,j}. Note that Lemma 2 implies that W Gt with probability 1 δ. Now, given a vector θt Rd, we deﬁne the optimistic estimate of the weight matrix at time t w.r.t. θt, for every arm a [K], as :

Wt(a) := arg min W Gt xt(a) Wθt. (9)

Intuitively, for the reward, we want an upper conﬁdence bound and for the consumption we want a lower conﬁdence bound as an optimistic estimate. This intuition aligns with the above deﬁnitions, where the maximizer was used in case of reward and a minimizer was used for consumption. The utility and precise meaning of θt will become clearer when we describe the algorithm and present the regret analysis.

Using the deﬁnition of µt, Wt, along with the results in Lemma 2 and Corollary 1 about conﬁdence ellipsoids, the following can be derived. Corollary 2. With probability 1 δ, for any sequence of θ1, θ2, . . . , θT ,

1. xt(a) µt(a) xt(a) µ , for all arms a [K], for all time t.

2. xt(a) Wt(a)θt xt(a) W θt, for all arms a [K], for all time t.

3. | PT t=1( µt(at) µ ) xt(at)| 2m p

T log ((1+tm)/δ) log(T) .

4. PT t=1( Wt(at) W ) xt(at) 1d 2m p

T log ((d+tmd)/δ) log(T) .

Essentially, the ﬁrst two claims ensure that we have optimistic estimates, and the last two claims ensure that the estimates quickly converge to the true parameters.

3.2 The core algorithm

In this section, we present an algorithm and its analysis, under the assumption that a parameter Z satisfying certain properties is given. Later, we show how to use the ﬁrst T0 rounds to compute such a Z, and also bound the additional regret due to these T0 rounds. We deﬁne Z now. Assumption 1. Let Z be such that for some universal constants c, c , OPT

The algorithm constructs estimates ˆµt and ˆWt as in Section 3.1. It also runs the OL algorithm for an instance of the online learning problem. The vector played by the OL algorithm in time step t is θt. After observing the context, the optimistic estimates for each arm are then constructed using θt, as deﬁned in (8) and (9). Intuitively, θt is used here as a multiplier to combine different columns of the weight matrix, to get an optimistic weight vector for every arm. An adjusted estimated reward for arm a is then deﬁned by using Z to linearly combine the optimistic estimate of the reward with the optimistic estimate of the consumption, as (xt(a) µt(a)) Z(xt(a) Wt(a)θt). The algorithm chooses the arm which appears to be the best according to the adjusted estimated reward. After observing the resulting reward and consumption vectors, the estimates are updated. The online learning algorithm is advanced by one step, by deﬁning the proﬁt vector to be vt(at) B

T 1. The algorithm ends either after T time steps or as soon as the total consumption exceeds the budget along some dimension. Theorem 2. Given a Z as per Assumption 1, Algorithm 1 achieves the following, with prob. 1 δ:

regret(T) O ( OPT

T log(d T/δ) log(T) .

(Proof Sketch) We provide a sketch of the proof here, with a full proof given in Appendix E in the supplement. Let τ be the stopping time of the algorithm. The proof is in 3 steps:

Step 1: Since E[vt(at)|Xt, at, Ht 1] = W xt(at), we apply Azuma-Hoeffding inequality to get that with high probability Pτ t=1 vt(at) W xt(at) is small. Therefore, we can work with Pτ t=1 W xt(at) instead of Pτ t=1 vt(at). A similar application of Azuma-Hoeffding inequality is used to bound the gap | Pτ t=1 rt(at) µ xt(at)|, so that a lower bound on Pτ t=1 µ xt(at) is sufﬁcient to lower bound the total reward Pτ t=1 rt(at).

Algorithm 1 Algorithm for lin CBw K, with given Z

Initialize θ1 as per the online learning (OL) algorithm. Initialize Z that satisﬁes Assumption 1. for all t = 1, ..., T do Observe Xt. For every a [K], compute µt(a) and Wt(a) as per (8) and (9) respectively. Play the arm at := arg maxa [K] xt(a) ( µt(a) Z Wt(a)θt). Observe rt(at) and vt(at). If for some j = 1..d, P

t t vt (at ) ej B then EXIT. Use xt(at), rt(at) and vt(at) to obtain ˆµt+1, ˆWt+1 and Gt+1. Choose θt+1 using the OL update (refer to (5)) with gt(θt) := θt vt(at) B

T 1 . end for

Step 2: Using Corollary 2, with high probability, we can bound PT t=1(W Wt(at)) xt(at) .

It is therefore sufﬁcient to work with the sum of vectors Wt(at) xt(at) instead of W xt(at), and similarly with µt(at) xt(at) instead of µ xt(at).

Step 3: The proof is completed by showing the desired bound on OPT Pτ t=1 µt(at) xt(at). This part is similar to the online stochastic packing problem; if the actual reward and consumption vectors were µt(at) xt(at) and Wt(at) xt(at), then it would be exactly that problem. We adapt techniques from [4]: use the OL algorithm and the Z parameter to combine constraints into the objective. If a dimension is being consumed too fast, then the multiplier for that dimension should increase, making the algorithm to pick arms that are not likely to consume too much along this dimension. Regret is then bounded by a combination of the online learning regret and the error in the optimistic estimates.

3.3 Algorithm with Z computation

In this section, we present a modiﬁcation of Algorithm 1 which computes the required parameter Z that satisﬁes Assumption 1, and therefore does not need to be provided with a Z as input. The algorithm computes Z using observations from the ﬁrst T0 rounds. Once Z is computed, Algorithm 1 can be run for the remaining time steps. However, it needs to be modiﬁed slightly to take into account the budget consumed during the ﬁrst T0 rounds. We handle this by using a smaller budget B = B T0 in the computations for the remaining rounds. The modiﬁed algorithm is given below.

Algorithm 2 Algorithm for lin CBw K, with Z computation

Inputs: B, T0, B = B T0 Using observations from ﬁrst T0 rounds, compute a Z that satisﬁes Assumption 1. Run Algorithm 1 for T T0 rounds and budget B .

Next, we provide the details of how to compute Z from observations in the ﬁrst T0 rounds, and how to choose T0. We provide a method that takes advantage of the linear structure of the problem, and explores in the m-dimensional space of contexts and weight vectors to obtain bounds independent of K. In every round t = 1, . . . , T0, after observing Xt, let pt [K] be

pt := arg max p [K] Xtp M 1 t , (10)

where Mt := I + Pt 1 i=1(Xipi)(Xipi) . (11)

Select arm at = a with probability pt(a). In fact, since Mt is a PSD matrix, due to convexity of the function Xtp 2 M 1 t , it is the same as playing at = arg maxa [K] xt(a) M 1 t . Construct estimates

ˆµ, ˆWt of µ , W at time t as

ˆµt := M 1 t Pt 1 i=1(Xipi)ri(ai), ˆWt := M 1 t Pt 1 i=1(Xipi)vi(ai) .

And, for some value of γ deﬁned later, obtain an estimate ˆ OPT γ of OPT as:

ˆ OPT γ := maxπ T T0 PT0 i=1 ˆµ i Xiπ(Xi) such that T T0 PT0 i=1 ˆW i Xiπ(Xi) B + γ. (12)

For an intuition about the choice of arm in (10), observe from the discussion in Section 2.1 that every column w j of W is guaranteed to lie inside the conﬁdence ellipsoid centered at column ˆwtj of ˆWt, namely the ellipsoid, w ˆwtj 2 Mt 4m log(Tm/δ). Note that this ellipsoid has principle axes as eigenvectors of Mt, and the length of the semi-principle axes is given by the inverse eigenvalues of Mt. Therefore, by maximizing Xtp M 1 t we are choosing the context closest to the direction of the longest principal axis of the conﬁdence ellipsoid, i.e. in the direction of the maximum uncertainty. Intuitively, this corresponds to pure exploration: by making an observation in the direction where uncertainty is large we can reduce the uncertainty in our estimate most effectively.

A more algebraic explanation is as follows. In order to get a good estimate of OPT by ˆ OPT γ, we want the estimates ˆWt and W (and, ˆµ and µ ) to be close enough so that PT0 t=1( ˆWt ˆW ) Xtπ(Xt) (and, | PT0 t=1(ˆµt µ ) Xtπ(Xt)|) is small for all policies π, and in particular for sample optimal policies. Now, using Cauchy-Schwartz these are bounded by PT0 t=1 ˆµt µ Mt Xtπ(Xt)) M 1 t , and PT0 t=1 ˆWt W Mt Xtπ(Xt)) M 1 t , where we deﬁne W M, the M-norm of matrix W to be the max of column-wise M-norms. Using Lemma 2, the term ˆµt µ Mt is bounded by 2 p

m log(T0m/δ) , and ˆWt W Mt is bounded by 2 p

m log(T0md/δ), with probability 1 δ. Lemma 3 bounds the second term PT0 t=1 Xtπ(Xt) M 1 t but only when π is the played policy. This is where we use that the played policy pt was chosen to maximize Xtpt M 1 t , so that PT0 t=1 Xtπ(Xt) M 1 t PT0 t=1 Xtpt M 1 t and the bound PT0 t=1 Xtpt M 1 t p

m T0 log(T0) given by Lemma 3 actually bounds PT0 t=1 Xtπ(Xt) M 1 t for all π. Combining, we get a bound of 2m p

T0log(T0) log(T0d/δ) on deviations PT0 t=1( ˆWt ˆW ) Xtπ(Xt) and | PT0 t=1(ˆµt µ ) Xtπ(Xt)| for all π.

We prove the following lemma.

Lemma 5. For γ = T T0

T0log(T0) log(T0d/δ), with probability 1 O(δ),

OPT 2γ ˆ OPT 2γ OPT + 9γ( OPT

Corollary 3. Set Z = ( ˆ OPT 2γ+2γ)

B + 1, with the above value of γ. Then, with probability 1 O(δ),

B + 1 Z (1 + 11γ

Corollary 3 implies that as long as B γ, i.e., B Ω( m T T0 ), Z is a constant factor approximation of

B + 1 Z , therefore Theorem 2 should provide an O ( OPT

T regret bound. However, this bound does not account for the budget consumed in the ﬁrst T0 rounds. Considering that (at most) T0 amount can be consumed from the budget in the ﬁrst T0 rounds, we have an additional regret of OPT

B T0. Further, since we have B = B T0 budget for remaining T T0 rounds, we need a Z that satisﬁes the required assumption for B instead of B (i.e., we need OPT

B Z O(1) OPT

B + 1 ). If B 2T0, then, B B/2, and using 2 times the Z computed in Corollary 3 would satisfy the required assumption.

Together, these observations give Theorem 3. Theorem 3. Using Algorithm 2 with T0 such that B > max{2T0, m T/ T0}, and twice the Z given by Corollary 3, we get a high probability regret bound of O OPT

B + 1 T0 + m

In particular, for B > m1/2T 3/4 and m

T, we can use T0 = m

T to get a regret bound of O OPT

[1] Y. Abbasi-Yadkori, D. P al, and C. Szepesv ari. Improved algorithms for linear stochastic bandits. In NIPS, 2012.

[2] A. Agarwal, D. Hsu, S. Kale, J. Langford, L. Li, and R. E. Schapire. Taming the monster: A fast and simple algorithm for contextual bandits. In ICML 2014, June 2014.

[3] S. Agrawal and N. R. Devanur. Bandits with concave rewards and convex knapsacks. In Proceedings of the Fifteenth ACM Conference on Economics and Computation, EC 14, 2014.

[4] S. Agrawal and N. R. Devanur. Fast algorithms for online stochastic convex programming. In SODA, pages 1405 1424, 2015.

[5] S. Agrawal, N. R. Devanur, and L. Li. An efﬁcient algorithm for contextual bandits with knapsacks, and an extension to concave objectives. In COLT, 2016.

[6] S. Agrawal, Z. Wang, and Y. Ye. A dynamic near-optimal algorithm for online linear programming. Operations Research, 62:876 890, 2014.

[7] S. Arora, E. Hazan, and S. Kale. The multiplicative weights update method: a meta-algorithm and applications. Theory of Computing, 8(6):121 164, 2012.

[8] P. Auer. Using conﬁdence bounds for exploitation-exploration trade-offs. J. Mach. Learn. Res., 3, Mar. 2003.

[9] A. Badanidiyuru, R. Kleinberg, and A. Slivkins. Bandits with knapsacks. In FOCS, pages 207 216, 2013.

[10] A. Badanidiyuru, J. Langford, and A. Slivkins. Resourceful contextual bandits. In Proceedings of The Twenty-Seventh Conference on Learning Theory (COLT-14), pages 1109 1134, 2014.

[11] W. Chu, L. Li, L. Reyzin, and R. E. Schapire. Contextual Bandits with Linear Payoff Functions. In AISTATS, 2011.

[12] V. Dani, T. P. Hayes, and S. M. Kakade. Stochastic Linear Optimization under Bandit Feedback. In COLT, 2008.

[13] N. R. Devanur and T. P. Hayes. The adwords problem: online keyword matching with budgeted bidders under random permutations. In EC, 2009.

[14] N. R. Devanur, K. Jain, B. Sivan, and C. A. Wilkens. Near optimal online algorithms and fast approximation algorithms for resource allocation problems. In EC, 2011.

[15] J. Feldman, M. Henzinger, N. Korula, V. S. Mirrokni, and C. Stein. Online stochastic packing applied to display ad allocation. In Proceedings of the 18th Annual European Conference on Algorithms: Part I, ESA 10, 2010.

[16] V. Guruswami and P. Raghavendra. Hardness of learning halfspaces with noise. SIAM Journal on Computing, 39(2):742 765, 2009.

[17] H. Wu, R. Srikant, X. Liu, and C. Jiang. Algorithms with logarithmic or sublinear regret for constrained contextual bandits. In Proceedings of the 28th International Conference on Neural Information Processing Systems (NIPS), 2015.