# smoothed_adversarial_linear_contextual_bandits_with_knapsacks__9e637b4c.pdf

Smoothed Adversarial Linear Contextual Bandits with Knapsacks

Vidyashankar Sivakumar 1 Shiliang Zuo 2 Arindam Banerjee 2

Many bandit problems are characterized by the learner making decisions under constraints. The learner in Linear Contextual Bandits with Knapsacks (Lin CBw K) receives a resource consumption vector in addition to a scalar reward in each time step which are both linear functions of the context corresponding to the chosen arm. For a ﬁxed time horizon T, the goal of the learner is to maximize rewards while ensuring resource consumptions do not exceed a pre-speciﬁed budget. We present algorithms and characterize regret for Lin CBw K in the smoothed setting where base context vectors are assumed to be perturbed by Gaussian noise. We consider both the stochastic and adversarial settings for the base contexts, and our analysis of stochastic Lin CBw K can be viewed as a warm-up to the more challenging adversarial Lin CBw K. For the stochastic setting, we obtain Op ?

Tq additive regret bounds compared to the best context dependent ﬁxed policy. The analysis combines ideas for greedy parameter estimation in (Kannan et al., 2018; Sivakumar et al., 2020) and the primal-dual paradigm ﬁrst explored in (Agrawal & Devanur, 2016; 2014a). Our main contribution is an algorithm with Oplog Tq competitive ratio relative to the best context dependent ﬁxed policy for the adversarial setting. The algorithm for the adversarial setting employs ideas from the primal-dual framework (Agrawal & Devanur, 2016; 2014a) and a novel adaptation of the doubling trick (Immorlica et al., 2019).

1Amazon 2Department of Computer Science, University of Illinois, Urbana-Champaign. Correspondence to: Vidyashankar Sivakumar <shankar861@gmail.com>, Shiliang Zuo <szuo3@illinois.edu>, Arindam Banerjee <arindamb@illinois.edu>. All work by VS done prior to starting employment with Amazon. Proceedings of the 39 th International Conference on Machine Learning, Baltimore, Maryland, USA, PMLR 162, 2022. Copyright 2022 by the author(s).

1. Introduction

The contextual bandits framework (Langford & Zhang, 2007; Lattimore & Szepesvari, 2019; Slivkins, 2021) is a popular sequential decision making framework used in multiple practical applications such as clinical trials, web search, and content optimization. Contextual bandit problems have multiple rounds where in each round a learner watches a set of context vectors corresponding to K arms and chooses an arm with the goal of maximizing cumulative rewards. In linear contextual bandits (Lin CB) (Chu et al., 2011; Li et al., 2010), the rewards for an arm are a linear function of the context vector. Algorithms for contextual bandit problems typically have to balance between exploration, choosing potentially sub-optimal arms for acquiring more information, and exploitation, choosing arms to optimize immediate rewards. Motivated by fairness and ethics considerations (Bird et al., 2016; Raghavan et al., 2018; Kannan et al., 2018) or to avoid inefﬁcient exploration strategies (Bastani et al., 2018), recent work has studied settings where an exploration free greedy algorithm can be employed. The crux of all such prior work is an assumption of inherent randomness in context vectors aiding exploration. In the smoothed bandits framework (Kannan et al., 2018) the inherent randomness is due to stochastic perturbations of the context vectors.

Often a learner has to operate under constraints while maximizing rewards (Babaioff et al., 2015; Badanidiyuru et al., 2012; Besbes & Zeevi, 2012; Singla & Krause, 2013; Combes et al., 2015). For example clinical trials are constrained by available medical resources or optimizing for ad placements should account for advertisers budget and user reach. There is now a body of work under the theme bandits with knapsacks (Bw K) addressing the tension between maximizing rewards while satisfying constraints (Badanidiyuru et al., 2013; Agrawal & Devanur, 2014a; 2016; Immorlica et al., 2019; Agrawal et al., 2016; Badanidiyuru et al., 2014).

Smoothed Lin CBw K. In this work, we consider the linear contextual bandits with knapsacks (Lin CBw K) problem (Agrawal & Devanur, 2016) under the smoothed assumption on context vectors. The Lin CBw K problem has d resources with budget B P R for each resource. In each round t, there are K context vectors txtpaqu K a 1, xtpaq P Rm, one corresponding to each arm a P r Ks. The smoothed context vectors are of the form xtpaq νtpaq gtpaq,

Smoothed Linear Contextual Bandits with Knapsacks

where νtpaq P Bm 2 (unit ball) is the base context vector and gtpaq Np0, σ2Imˆmq are Gaussian perturbations. Typically σ2 1 m (Kannan et al., 2018; Sivakumar et al., 2020). On choosing an arm at P r Ks, the learner receives a noisy reward rtpatq P R and consumption vector vtpatq P Rd satisfying

Errtpaq | xtpaq, Ht 1s µ xtpaq , (1)

Ervtpaq | xtpaq, Ht 1s W xtpaq , (2)

where Ht 1 denotes the history, the reward vector µ P Sm 1 (unit sphere) and the consumption matrix W P Rmˆd with columns in Sm 1 are ﬁxed but unknown to the learner. Additionally, the parameter µ and columns of W can be assumed to have some structure like sparsity. In each round, the learner also has a "no-op" option which is to choose none of the arms and receive 0 reward with 0 resource consumption. The learner s goal is to choose arms which maximize the total reward over T time steps while ensuring total consumption for each resource does not exceed B.

t 1 rtpatq s.t.

t 1 vtpatq ď 1B . (3)

Stochastic and Adversarial Contexts. We study algorithms under two assumptions on the base context vectors: (a) Stochastic Lin CBw K when the base context vectors νtpaq are sampled from an unknown distribution, and (b) Adversarial Lin CBw K when the νtpaq are chosen by an adaptive adversary. In both cases, we compare the learner s performance against an optimal adaptive policy with knowledge of µ , W which, unlike Lin CB, is no longer a single arm but a probability distribution over the arms (Agrawal & Devanur, 2014a; Badanidiyuru et al., 2013; Immorlica et al., 2019). Compared to Lin CB, the learner s algorithm is more complicated because it should not only optimize the per step reward but also account for how much resources are being consumed vs. how much to conserve for future.

The tension between consumption vs. conservation differentiates the algorithms for stochastic and adversarial Lin CBw K. While in the stochastic setting, historical data can be extrapolated to guide decisions to consume vs. conserve, it is impossible in the adversarial setting to use historical data to plan for the future. Thus, for stochastic Lin CBw K, we are able to achieve stronger additive regret bounds with respect to the optimal adaptive policy. Let OPT1 denote the optimal adaptive policy s reward for smoothed stochastic Lin CBw K. We present an algorithm whose reward REW satisﬁes

REW ě OPT1 O

where Op q notation hides dependence on logarithmic factors and dimension of the problem.

For adversarial Lin CBw K, we are only able to bound the competitive ratio, i.e., the ratio between the optimal adaptive policy s reward and the algorithm s reward. With OPT2 denote the optimal adaptive policy s reward for the smoothed adversarial Lin CBw K, we present an algorithm whose reward REW satisﬁes

REW ě OPT2 Opdrlog Tsq O

Our framework is general enough to handle structure assumptions like sparsity on the parameter vectors.

Lin CBw K Algorithms. Algorithms for both stochastic and adversarial Lin CBw K broadly perform two steps in each round: 1) estimating reward and consumptions for each arm, and 2) pulling arms while balancing earned reward and resource consumptions. In the smoothed setting, we use (episodic) greedy estimates of reward and consumptions for each arm obtained using constrained least squares estimates of the reward and consumption parameters (Kannan et al., 2018; Sivakumar et al., 2020). This is in sharp contrast with most existing work in contextual bandits including prior work on stochastic Lin CBw K (Agrawal & Devanur, 2016) which compute rewards based on upper conﬁdence bound (UCB). For choosing arms by balancing the tradeoff between reward and consumptions, multiple primal-dual approaches have been explored in Bw K (Badanidiyuru et al., 2013; Agrawal & Devanur, 2016; 2014b; Immorlica et al., 2019) and related literature on online stochastic packing problems (Agarwal et al., 2014; Mehta, 2013; Mehta et al., 2007; Buchbinder & Naor, 2009; Williamson & Shmoys, 2011; Devanur & Hayes, 2009; Agrawal & Devanur, 2014b; Devanur et al., 2011; Feldman et al., 2010; Molinaro & Ravi, 2012). Our algorithm is built on the framework developed in (Agrawal & Devanur, 2014b; 2016) where in each round, the algorithm maximizes a linear combination of the reward and consumptions with a tradeoff parameter Z. The function can be viewed as the Lagrangian of the constrained linear program (LP) in (3) with suitable modiﬁcations due to the sequential and bandit nature of the problem. Z can be viewed as a global parameter which captures the tradeoff between the optimal reward and the budget. Speciﬁcs of the computation of Z differs for stochastic and adversarial Lin CBw K, being substantially more challenging for the adversarial setting. Our algorithms provide ways of computing Z and comes with regret bounds as in (4), (5).

Comparison with Prior Work. We analyze smoothed Lin CBw K where the (stochastic or adversarial) contexts are perturbed by a small amount of Gaussian noise. The smoothing assumption (Kannan et al., 2018; Sivakumar et al., 2020) is a middle ground between assuming stochastic independent contexts and adversarial contexts both of which arguably are not representative of most real world problems. In sharp contrast to the explore-exploit bandit algorithms

Smoothed Linear Contextual Bandits with Knapsacks

Table 1. Comparisons with prior work on regret upper bounds and budget constraints. OPT is an upper bound on optimal reward, K is number of arms, B is the budget, m is context dimension for the Lin CBw K problem, d is number of constraints.

Setting Regret Budget B

Stochastic Bw K (Badanidiyuru et al., 2013) Op ?

OPT {Bqq Adversarial Bw K (Immorlica et al., 2019) OPT {Opd log Tq Op T 7{4K{Bq Ωp KT 3{4q Stochastic Lin CBw K (Agrawal & Devanur, 2016) Opm ?

Tq Ωpm T 3{4q Smoothed Stochastic Lin CBw K (This paper) Opm ?

Tq Ωpm2{3T 3{4q Smoothed Adversarial Lin CBw K (This paper) OPT {Opd log Tq Opm ?

Tq Ωp T 3{4q

in prior literature (Agrawal & Devanur, 2016; Immorlica et al., 2019), we show an exploration free greedy algorithm theoretically achieves optimal regret in the smoothed setting. Practically such exploration free bandit algorithms are desirable in problems where ethics, fairness (Bird et al., 2016), or computational efﬁciency (Bastani et al., 2018) are concerns. For Lin CB, (Bietti et al., 2018) empirically show the greedy algorithm to perform as well as state-ofthe-art explore-exploit bandit algorithms on many practical datasets.

Table 1 is a summary of prior results for variations of the bandit with knapsacks problems. Our results for stochastic Lin CBw K match upto log factors the regret bounds for Lin CBw K in (Agrawal & Devanur, 2016) and the bounds for Lin CB with smoothed contexts in (Sivakumar et al., 2020). Our main contribution is an algorithm and regret analysis for smoothed adversarial Lin CBw K. Although we do not rigorously analyze lower bounds, our algorithm achieves worst case competitive ratio which match upto constant factors the bounds in (Immorlica et al., 2019) for the arguably simpler setting of adversarial multi armed bandits (MAB) with knapsacks. We also analyze regret bounds in the high-dimensional regularized setting, e.g., Lasso.

Notations: Vectors are denoted by bold symbols, e.g., µ, y, and matrices are denoted by upper case letters, e.g., W, X. For context xtpaq νtpaq gtpaq, a P r Ks, we will use Xt, Nt, Gt P Rmˆp K 1q with Xt Nt Gt to denote the set of base context vectors, Gaussian perturbation vectors, and observed context vectors respectively in round t. The last column in Xt, Nt, Gt is 0 corresponding to the no-op arm. rt P RK 1 and Vt P Rdˆp K 1q will denote the reward and consumption vectors at time t.

2. Smoothed Lin CBw K

As discussed in Section 1, in the smoothed setting, the context vectors are Gaussian perturbed versions of base context vectors, i.e., for all a P r Ks,

xtpaq νtpaq gtpaq , (6)

where νtpaq P Bm 2 are the base context vectors and gtpaq Np0, σ2Imˆmq are Gaussian perturbations. We consider

two settings for the base context vectors νtpaq: (a) stochastic, where the νtpaq are stochastically chosen independently from a ﬁxed but unknown distribution, and (b) adversarial, where the νtpaq are chosen by an adaptive adversary. The following result establishes high probability bounds on }xtpaq}2.

Lemma 1 Let β : argmax t Pr T s,a Pr Ks }xtpaq}2. Then with proba-

bility at least p1 δq,

β ď O 1 σ a

m logp TK{δq . (7)

We will assume σ2 1{m. Since the parameter vectors have L2 norm unity the rewards and consumptions in each time step are bounded by Opβq.

Our framework can handle sparsity assumptions on µ and columns of W . Without loss of generality, we will assume the same structure for µ and the columns of W . The sparsity structure is represented by an atomic norm Jp q (Chandrasekaran et al., 2012). We use the constrained least squares estimator in each step for estimation.

pµt argmin µPRm 1 2t} yt Stµ}2 2 s.t. Jpµq ď Jpµ q ,

px Wtq:j argmin w PRm 1 2t}p Qtq:j Stw}2 2 s.t. Jpwq ď Jp W jq,

where St P Rtˆm is the matrix with contexts chosen before time t as rows, yt P Rt are the rewards observed before time t and Qt P Rtˆd are the consumption vectors observed before time t. Let Ec : te | Jpµ eq ď Jpµ qu denote the error set and A cone p Ecq X Sm 1 denote the error cone (Wainwright, 2019; Vershynin, 2018; Banerjee et al., 2014). Error sets for the columns of W are deﬁned similarly. Our results for regret will have dependence on the Gaussian width of the error set wp Aq (Talagrand, 2005; 2014). For example wp Aq Θp?mq if no structure is assumed for the parameter vectors and wp Aq Θp?s log mq if the parameters are s-sparse and Lasso (Tibshirani, 1996)

Smoothed Linear Contextual Bandits with Knapsacks

is used for parameter estimation. Henceforth we will overload the term A to denote the error cone for constrained estimation of µ and columns of W .

We will compare our algorithms to the class of dynamic policies. Formally, let φp Tq p Xt, rt, Vtq T t 1 denote a particular instantiation of contexts for T time steps. rt P RK 1, Vt P Rdˆp K 1q denote the reward and consumption vectors at time t including for the no-op arm. A dynamic policy pt : φp Tq ÞÑ K 1 maps contexts at each time step to distributions over K 1 options, viz. the K arms and the no-op option. In particular, we develop regret bounds with respect to the optimal dynamic policy p p q which has foreknowledge of µ , W and the mechanism by which the contexts are generated. Interestingly, we will in fact work with a suitably deﬁned context dependent static policy as a benchmark by showing that its reward upper bounds the reward of the optimal dynamic policy while satisfying the resource constraints.

Benchmarks: Stochastic setting. In the stochastic setting Xt s are generated from an unknown distribution D. Here D is the distribution for the convolution Xt Nt Gt with Xt, Nt, Gt P Rmˆp K 1q denoting the observed context matrix, base vectors matrix and Gaussian perturbation matrix respectively. The last column of all matrices is the 0 vector corresponding to the no-op arm. Let π : Rmˆp K 1q ÞÑ K 1 denote a context dependent probability distribution over arms and let rpπq and vpπq denote the expected reward and consumption vector of policy πp q, i.e.,

rpπq : EX Drr πp X; Dqs, vpπq : EX Dr V πp X; Dqs .

π : argmax π Trpπq s. t. Tvpπq ď B1 . (8)

We revisit a result by (Agrawal & Devanur, 2016) which shows that in the stochastic setting, the reward obtained by the optimal dynamic policy p p q can be upper bounded by an optimal context dependent static policy π p q.

Lemma 2 (Lemma 1 in (Agrawal & Devanur, 2016)) Let Ę OPT1 : Eφp T qrřT t 1 µ Xtp t pφp Tqqs denote the value of the optimal feasible dynamic policy p which knows parameters µ , W and distribution D. Then, the optimal static policy π in (8) satisﬁes: Trpπ q ě Ę OPT1 and Tvpπ q ď B1.

Since the optimal static policy π is competitive with the optimal dynamic policy p , for the stochastic setting it sufﬁces to compare the performance of our proposed algorithm with the performance of π . If the sequence of arms chosen by our algorithm is tatu, we deﬁne regret as: R1p Tq : OPT1 řT t 1 rtpatq, where OPT1 : Trpπ q.

Benchmarks: Adversarial setting. In the smoothed adversarial setting, the base context vectors νtpaq, a P r Ks are chosen by an adaptive adversary, and the optimal feasible dynamic policy p is the one which maximizes the cumulative reward p : argmaxp řT t 1 µ Xtptpφp Tqq while staying feasible. As in the smoothed stochastic setting, we consider the optimal static policy π that depends on context Xt and realization φp Tq. The policy π optimizes the cumulative reward while staying feasible, i.e.,

π : argmax π

t 1 µ Xtπp Xt; φp Tqq

t 1 W Xtπp Xt; φp Tqq ď B1 . (9)

Lemma 3 Let Ę OPT2 : řT t 1 µ Xtp t pφp Tqq and OPT2 : řT t 1 µ Xtπ p Xt; φp Tqq respectively denote the value of the optimal feasible dynamic and static policies which have knowledge of the parameters µ , W . Then OPT2 ě Ę OPT2.

Since the optimal static policy again dominates the optimal dynamic policy under feasibility, it sufﬁces to compare the performance of our algorithm to that of the static policy π . If the sequence of arms chosen by our algorithm is tatu, we deﬁne regret as: R2p Tq : OPT2 řT t 1 rtpatq, where OPT2 : řT t 1 µ Xtπ p Xt; φp Tqq.

3. Greedy Algorithm

In this section we discuss the greedy algorithm Greedy Lin CBw K (Algorithm 1) used in both the stochastic and adversarial settings. Greedy Lin CBw K is based on the primal-dual paradigm explored in prior work on Bw K (Agrawal & Devanur, 2014a;b; 2016; Immorlica et al., 2019) with a key unique aspect: our algorithm keeps updating estimates pµt, x Wt respectively of the true reward and consumption parameters µ , W and picks arms greedily using these estimates. The greedy approach of arm selection in Greedy Lin CBw K is in sharp contrast with both the existing UCB-based approach for Lin CBw K in the stochastic setting (Agrawal & Devanur, 2014a;b; 2016) and the approach for adversarial Bw K setting in (Immorlica et al., 2019). The overall approach in Greedy Lin CBw K has similarities with prior work on both stochastic and adversarial bandits with knapsacks (Agrawal & Devanur, 2016; Immorlica et al., 2019) in that the primal arm selection is with bandit feedback and the dual update is essentially a full information online convex optimization (OCO).

Smoothed Linear Contextual Bandits with Knapsacks

Algorithm 1 Greedy Lin CBw K

Input: Parameter Z P R , budget B, context matrix S1, reward vector y1, consumption matrix Q1 Initialize θ1 using uniform distribution for t 1 to T do p S, y, Q, θ, Teffqt 1 Greedy Stepp Z, B, p S, y, Q, θ, Teffqtq

if Qt 1ej ě B for any j P rds then

EXIT end if end for

Algorithm 2 Greedy Step

Input: Parameter Z P R , budget B, context matrix St, reward vector yt, consumption matrix Qt, dual vector θt, Teffptq Estimate pµt, x Wt Estimatep St, yt, Qt, Teffptqq Set at argmax a Pr Ks xtpaq ppµt Zx Wtθtq .

if xtpatq ppµt Zx Wtθtq ě 2ζt as in (10) then Select arm at, Teffpt 1q Teffptq 1 Observe reward rtpatq and consumption vtpatq P Rd

Append xtpatq, rtpatq, vtpatq to St, yt, Qt to get St 1, yt 1, Qt 1. else

Set at the no-op arm, Teffpt 1q Teffptq Set St 1, yt 1, Qt 1 St, yt, Qt. end if Update θt 1 using OMD with gtpθq : xθ, pvtpatq B0

T 1qy and θt P d 1

3.1. The Greedy Lin CBw K Algorithm

The algorithm has three key steps as illustrated in Greedy Step (Algorithm 2): parameter estimation, arm selection to receive reward and consumption vector, and dual update, with some steps picking the no-op arm.

Estimation. In each step, the algorithm estimates the reward vector pµt and constraint parameter x Wt based on a constrained least squares formulation as shown in Estimate (Algorithm 3). In the smoothed stochastic setting, one can show that the estimated parameters pµt, x Wt approach the true parameters at a Op1{ a

Teffptqq rate using standard techniques. Showing a similar result for the smoothed adversarial settings needs more care, and our analysis leverages recent advances in such analysis (Sivakumar et al., 2020).

Arm Selection. The arm selection is based on

at argmax a Pr Ks xtpaq ppµt Zx Wtθtq ,

Algorithm 3 Estimate

Input: Context matrix St, reward vector yt, consumption matrix Qt, effective steps Teffptq Compute SVD of design matrix: 1 ?

Teffptq St UDV

Compute Puffer transformation: F UD 1U and deﬁne St FSt, yt Fyt, Qt FQt Estimate parameters using constrained least squares

pµt argmin µPRm 1 2Teffptqq} yt Stµ}2 2 s.t. Jpµq ď Jpµ q

px Wtq:j argmin w PRm 1 2Teffptq}p Qtq:j Stw}2 2 s.t. Jpwq ď Jp W jq

as long as xtpatq ppµt Zx Wtθtq ě 2ζt where

logp Td{δq Zβ log K

otherwise the no-op arm is selected. Such an arm choice was ﬁrst explored in (Agrawal & Devanur, 2014b; 2016) who prescribed using Z Ωp OPT {Bq to obtain optimal regret rates, where OPT is the reward obtained by the optimal static policy. As we show in subsequent sections, the algorithms for the stochastic and adversarial settings differ in how OPT and by extension Z is estimated and used.

Dual Update. For a chosen arm at, the corresponding consumption vector vtpatq needs to stay within the per step budget B

T 1, i.e., vtpatq B

T 1 ď 0. Similar to (Agrawal & Devanur, 2016; Immorlica et al., 2019) we will assume the presence of a dummy time resource with B{T consumption in each round irrespective of the chosen arm. Following related work (Agrawal & Devanur, 2016; Abernethy et al., 2011), the dual update involves maximizing xθt 1, vtpatq B T 1y under the constraint θt 1 P d 1 sequentially, which can be done with online mirror descent (OMD).

No-op Arm. Note that Greedy Step (Algorithm 2) chooses the no-op arm only when the objective for all arms is below 2ζt with ζt as in (10) which changes with time. The negative threshold 2ζt corresponds to the objective value estimation error when using pµt, x Wt to estimate the objective instead of µ , W . In particular, in steps where Greedy Step chooses the no-op arm, an algorithm using µ , W would also have chosen the no-op arm with high probability.

3.2. Primal-Dual Perspective

We develop an understanding of Greedy Lin CBw K by considering the optimization problem in hindsight to ﬁnd the optimum static policy over T steps. The purpose is to gain insights about key steps in the algorithm, and more rigorous technical results are presented in subsequent sections.

Smoothed Linear Contextual Bandits with Knapsacks

Consider the optimization problem:

t 1 µ Xtπp Xtq s.t. W Xtπp Xtq B

T 1 ď 0 , t P r Ts

Let yt P Rpd 1qˆ1 be the Lagrange mulipliers corresponding to the t-th constraint, and let θt yt{}yt}1 be the normalized Lagrange multipliers. The Lagrangian is given by:

t 1 µ Xtπp Xtq

B yt, W Xtπp Xtq B

t 1 µ Xtπp Xtq Z

B θt, W Xtπp Xtq B

where Z is a suitable constant. Note that we are not showing the primal constraints πp Xtq P K 1 and dual constraints θt P d 1 explicitly to avoid clutter.

Scaling constant Z. Let OPT denote the (primal) optimal value for (11). Since the primal is a feasible bounded LP, strong duality holds (Bertsimas & Tsitsiklis, 1997; Boyd & Vandenberghe, 2005), and the solutions of the primal and dual problems match, i.e., is both OPT. With y t denoting the optimal Lagrange dual parameters, by strong duality OPT řT t 1xy t , B

T 1y. Then, with Z t : }y t }1 and θ t y t {}y t }1, we have

t 1 xy t , B

t 1 Z t xθ t , B

t 1 Z t B T B Z ,

where Z : 1 T řT t 1 Z t 1 T řT t 1 }y t }1, so that we have Z OPT

B . Thus, the optimal scaling constant Z

is simply the per step average of the sum of optimal Langrange multipliers }y t }1, and satisﬁes Z OPT

B so that an estimate of OPT will yield an estimate of Z . In the online setting, y t are not known, and are estimated sequentially. As a result, Greedy Lin CBw K will keep an (running) estimate of OPT and hence Z based on (a) the current estimates pµt, x Wt respectively of µ , W , and (b) the contexts X1:t observed till time step t.

Primal updates. The primal update would ideally be based on Bc Lpπ,θq

Bπp Xtq , i.e., gradient of the Lagrangian w.r.t. the policy. Since we can pull only one arm, i.e., a bandit step, the update will have to be based on partials w.r.t. πp Xtqpaq, a P r Ks and picking the arm with the largest gradient, i.e., a Frank Wolfe, conditional gradient, or coordinate ascent type update (Frank & Wolfe, 1956). Note that the partial derivative is simply: BLpπ,θq Bπp Xtqpaq xtpaq pµ ZW θtq, and the arm chosen is the one which maximizes this.

Dual updates. Since the true consumption vector at step t by pulling arm a is vtpaq and since the primal is a maximization, the cumulative dual objective to be minimized is řT t 1xθt, vtpaq B

T 1y over θt P d 1. For convenience, we will equivalently consider maximizing řT t 1xθt, vtpaq B

T 1y. Since the problem needs to be solved sequentially, this becomes a simple example of OCO with a linear objective and simplex constraint , which can be solved by online mirror descent (OMD) or exponentiated gradient (EG) descent (Shalev-Shwartz et al., 2012) with sublinear regret.

Proposition 1 With gtpθq xθ, vtpaq B

T 1y, OMD with constraint θt P d 1 achieves the following regret:

RDp Tq max θP d 1

t 1 gtpθtq O a

3.3. Main Result: Reward from Greedy Lin CBw K

Theorem 1 lower bounds the reward obtained by the greedy algorithm. Tstop is either T or the ﬁrst round any resource consumption exceeds B. The algorithm incurs additive regret Rp Tq due to use of estimates pµ, x W instead of µ , W to compute the reward and constraint objectives. The reward can be lower bounded by either of two quantities. The primal update ensures the reward is greater than OPTp Tstopq Zpγap Tstopq γπp Tstopqq. Presence of the noop arm ensures the reward is lower bounded by Zγap Tstopq.

Theorem 1 Let Tstop ď T denote the stopping time of the algorithm. Let β maxt Pr T s,a Pr Ks }xtpaq}2 be the maximum ℓ2 norm of context vectors. Then with probability at least 1 2δ ,

REW ě max OPTp Tstopq Zpγap Tstopq γπp Tstopqq,

Zγap Tstopq Op ZRp Tqq ,

t 1 xµ , xtpatqy (12)

OPTp Tstopq

t 1 xµ , Xtπ p Xtqy (13)

t 1 x W Xtπ p Xtq, θty (14)

t 1 x W xtpatq, θty . (15)

Smoothed Linear Contextual Bandits with Knapsacks

and, with σ2 being the Gaussian perturbation variance,

logp Td{δq β ?

4. Smoothed Stochastic Lin CBw K

In this section, we discuss the algorithm for stochastic Lin CBw K. The algorithm has two phases. In the initial warm start phase with T0 time steps, the arms are chosen uniformly at random. The parameter Z is estimated at the end of the warm start phase. We further discuss estimation of Z in Section 4.1. In the exploit phase, greedy Algorithm 1 is run with a ﬁxed Z computed at the end of the warm start phase and a budget B1 B βT0. This is due to Lemma 1 which bounds the maximum consumption for any resource per round to β. The execution of the algorithm stops either after T time steps or when one of the resource consumption exceeds B. Our algorithm differs from (Agrawal & Devanur, 2016) in two aspects. The algorithm chooses arms uniformly at random in the warm start phase instead of the more involved exploration technique employed in (Agrawal & Devanur, 2016). The algorithm computes greedy estimates of the reward and constraint objectives in each time step of the exploit phase instead of UCB estimates.

4.1. Estimation of Z

Recall the discussion in Section 3.2 where we establish Z OPT1

B where OPT1 is the optimal static policy reward. To estimate Z after the warm start phase, we estimate the optimal reward z OPTp T0; ωp T0qq where ωp T0q p Xt, rt, vt, atq T0 t 1 are observations in the warm start phase and extrapolate it. Let pµT0 and x WT0 denote the parameter estimates at the end of the warm start phase. Then,

z OPTp T0; ωp T0qq : max π

t 1 pµ T0Xtπp Xtq

t 1 x W T0Xtπp Xtq ď B0 Rp T0q ,

z OPTp T; ωp T0qq : T

T0 z OPTp T0q , (17)

where B0 p T0{Tq B and Rp T0q is an upper bound on the regret řT0 t 1p W x WT0q Xtπp Xtq after the warm start phase. Note that Rp T0q

O ˆ wp Aq ?

logp T0d{δq β?T0 log K

is the regret at the end

of the warm start phase. We set Zp T0q as follows.

z OPTp T; ωp T0qq 2 T T0

Algorithm 4 Smoothed Stochastic Lin CBw K

Input: Budget B P R , Total timesteps T. Initialize S1 rs, y1 rs, Q1 rs for τ 1 to T0 do

Select arm xτpaτq uniformly at random, observe noisy reward rτpaτq and consumption vector vτpaτq P Rd. Obtain Sτ 1, yτ 1, Qτ 1 by appending xτpaτq, rτpaτq, vτpaτq to Sτ, yτ, Qτ. end for Let B1 B βT0. Compute parameter Z such that OPT1

B1 ď Z ď O OPT1

B1 1 . Run Algorithm Greedy Lin CBw K with budget B1 from time T0 1 to T.

We revisit the following result from (Agrawal & Devanur, 2016) on error bounds on estimator z OPTp T; ωp T0qq with respect to the optimal reward OPT1. The bounds account for both the extrapolation error and use of pµT0, x WT0 instead of the true parameters for estimation.

Lemma 4 [Lemma 5 in (Agrawal & Devanur, 2016)] For

Rp T0q O ˆ pwp Aq ?

logp T0d{δqqβ?T0 log K

, with proba-

bility at least 1 δ

z OPTp T; ωp T0qq ě OPT1 2 ˆ T

z OPTp T; ωp T0qq ď OPT1 9 ˆ T

Rp T0q ˆOPT1

{ OPTp T ;ωp T0qq 2 T T0

B 1. Then with probability 1 Opδq,

Zp T0q ě OPT1

Moreover if B ě Ω T ?T0 wp Aq log K

σ then Zp T0q ď

4.2. Regret Analysis

Let Tstop ď T denote the stopping time step of the algorithm which is either T or the ﬁrst time any of the resource consumption exceeds B. The ﬁnal regret is obtained by combining the results of Lemma 4 and Theorem 1. Since arms are chosen at random we accrue linear regret in the warm start phase. Here RDp Tq is the regret of the dual OCO algorithm from Proposition 1.

Smoothed Linear Contextual Bandits with Knapsacks

Theorem 2 Assume B ě Ω max T0, T ?T0 wp Aq log K

σ . Then if Zp T0q is set as outlined in Lemma 4, we have

B 1 ď Zp T0q ď O ˆOPT1

Moreover with probability at least 1 6δ, following is the regret for the greedy Algorithm 4,

regretp Tq ď O ˆˆOPT1

B 1 max p T0, Rp Tqq ,

where Rp Tq O ˆ wp Aq ?

logp T d{δq β ?

We wrap up with special cases of our results.

Corollary 1 Assume the variance of the Gaussian noise in the smoothed setting σ2 Ωp1{mq. Then with probability at least 1 Opδq,

If Jp q is the ℓ2 2 norm, T0 O m2{3?

T and B ě Ωpm2{3T 3{4q,

regretp Tq ď O ˆˆOPT1

If Jp q } }1, µ and columns of W are ssparse, T0 Oppm s log mq1{3?

Tq and B ě Ωpm s log mq1{3T 3{4q,

regretp Tq ď O ˆˆOPT1

5. Smoothed Adversarial Lin CBw K

In this section, we consider adversarial Lin CBw K where the base context vectors νtpaq, a P r Ks of the smoothed contexts xtpaq νtpaq gtpaq are chosen by an adaptive adversary. The basic ideas behind the algorithm remain the same as in stochastic Lin CBw K except for the estimation of the tradeoff parameter Z which balances optimizing the reward and consumption. But unlike the stochastic setting it is not possible to estimate OPT2, the reward from the optimal feasible static policy, without knowledge of the adversarially chosen contexts νtpaq over all time steps.

5.1. Estimation of Z

We address this challenge by adapting a variant of the standard doubling trick in online learning (Shalev-Shwartz et al., 2012). At any stage, the algorithm has an estimate of the value of OPT and by extension Z. In each time step t, the realized cumulative reward is compared with the estimate

for OPT and the algorithm updates the estimate for OPT when the realized reward is double that of the current estimate for OPT, and also updates the estimate for Z. A related doubling trick adaptation is used in (Immorlica et al., 2019) for the adversarial multi-armed bandits with knapsacks problem although the algorithmic details are different.

Similar to the the stochastic setting, Algorithm 5 has a warm start phase with T0 rounds. But unlike the stochastic setting the warm start phase is only required to get good estimates of the reward and constraint parameters. The exploit phase in Algorithm 5 is further subdivided into multiple epochs. Let T1 denote the start time step of an epoch. Let ωp T1q p Xt, rt, vt, atq T1 t 1 denote data observed until T1. At the ﬁrst step of the epoch, z OPTp T1; ωp T1qq and Zp T1q are computed with data ωp T1q as follows:

z OPTp T1; ωp T1qq : max π

t 1 pµ T1Xtπp Xtq

t 1 x W T1Xtπp Xtq ď B Rp T1q ,

Zp T1q : z OPTp T1; ωp T1qq

2d B . (24)

The value of z OPTp T2; ωp T2qq is monitored in each time step T2 ą T1 of an epoch until z OPTp T2; ωp T2qq ě 2z OPTp T1; ωp T1qq at which point the epoch is terminated and a new epoch is started. Each epoch is allocated a budget B1{2rlog Ts. In each time step of an epoch Greedy Step (Algorithm 2) is run until a new epoch is started or when the allocated budget for any resource is exhausted. In case of budget exhaustion, an arm is chosen uniformly at random with a ﬁxed probability until a new epoch starts.

5.2. Regret Analysis

It is evident that the Z estimates improve with progression of epochs. We therefore focus on the last completed epoch and show there is sufﬁcient reward to be collected from this epoch. Moreover we show that the greedy algorithm receives minimum Ωp1{pd log Tqq fraction of the reward available in the ﬁnal completed epoch. OPT2 denotes the reward of the optimal static policy from Lemma 3.

Theorem 3 Assume B ě Ωp T 3{4q. Then with probability at least 1 6δ following is a lower bound on the reward obtained by the greedy algorithm

REW ě OPT2 16drlog Ts O ˆˆOPT2

B 1 Rp Tq ,

where Rp Tq O ˆ wp Aq ?

logp T dq{δ β ?

Smoothed Linear Contextual Bandits with Knapsacks

Algorithm 5 Smoothed Adversarial Lin CBw K

Input: Budget B, number of time steps T Initialize S1, y1, Q1 rs for t 1 to T0 do

Select arm xtpatq uniformly at random, observe noise reward rtpatq and consumption vector vtpatq. Get St 1, yt 1, Qt 1 by appending xtpatq, rtpatq, vtpatq to St, yt, Qt. end for Let B1 B T0. Initialize epoch j tlog2 z OPTp T0; ωp T0qqu for each epoch do

Let T1 be the initial time step of epoch. Estimate z OPTp T1; ωp T1qq, Zp T1q using (24) Compute budget B0 B1 2rlog T s Initialize θT1 using uniform distribution. for each time step T2 in epoch do

Recompute z OPTp T2; ωp T2qq using (24) if z OPTp T2; ωp T2qq ě 2j 1 then

Increment j j 1 and start a new epoch else

if any budget consumed then

Run Algorithm Greedy Step with B B0 and Z Zp T1q. else

Play any arm uniformly at random with probability T 1{4, play the no-op arm otherwise end if end if end for end for

(Immorlica et al., 2019) obtain Opdrlog Tsq competitive ratio for adversarial multi-armed bandits (MAB) with knapsacks and show it to be optimal. We match the bound for the linear contextual bandit setting with smoothed contexts with better additive regret rates. Note that (Immorlica et al., 2019) considered oblivious adversaries where we consider adaptive adversaries in the smoothed setting. Further, the additive regret in (Immorlica et al., 2019) is O KT 7{4{B

so for regret bounds to be meaningful requires the condition B

K OPT ě Ωp T 7{4q. Our setup requires much milder assumptions. We wrap up with special cases of our result.

Corollary 2 Assume the variance of the Gaussian noise in the smoothed setting σ2 Ωp1{mq. Let OPT be the optimal reward and κ ď Opdrlog Tsq. Then with probability at least 1 Opδq,

If Jp q } }2 and B ě Ωp T 3{4q,

If Rp q } }1, µ and columns of W are s-sparse and B ě Ωp T 3{4q,

6. Conclusion

We considered the linear contextual bandit with knapsacks (Lin CBw K) problem in the smoothed setting where the contexts chosen by nature, stochastically or adversarially, is perturbed by Gaussian noise. Such smoothed analysis, especially in the adversarial setting, is a mechanism to soften the worst case analysis and prior work has illustrated that simpler algorithms often work when avoiding the worst case. Our results in this paper continue this tradition. As the ﬁrst work which analyzes smoothed adversarial Lin CBw K with adaptive adversaries, we show that a combination of a greedy strategy to choose arms combined with a doubling trick gives a suitable algorithm for the setting. Scope for future work includes consideration of more ﬂexible models for the rewards as well as budgets, including dynamic budgets.

Acknowledgements. We would like to thank the reviewers for valuable comments and questions. The research was supported by NSF grants IIS 21-31335, OAC 21-30835, DBI 20-21898, and a C3.ai research award.

Abernethy, J., Bartlett, P. L., and Hazan, E. Blackwell approachability and no-regret learning are equivalent. In Conference on Learning Theory, pp. 27 46. PMLR, 2011.

Agarwal, S., Wang, Z., and Ye, Y. A dynamic near-optimal algorithm for online linear programming. Operations Research, 62(4):876 890, 2014.

Agrawal, S. and Devanur, N. Linear contextual bandits with knapsacks. In Lee, D., Sugiyama, M., Luxburg, U., Guyon, I., and Garnett, R. (eds.), Advances in Neural Information Processing Systems, volume 29. Curran Associates, Inc., 2016.

Agrawal, S. and Devanur, N. R. Bandits with concave rewards and convex knapsacks. In Proceedings of the ﬁfteenth ACM conference on Economics and computation, pp. 989 1006, 2014a.

Agrawal, S. and Devanur, N. R. Fast algorithms for online stochastic convex programming. In Proceedings of the twenty-sixth annual ACM-SIAM symposium on Discrete algorithms, pp. 1405 1424. SIAM, 2014b.

Agrawal, S., Devanur, N. R., and Li, L. An efﬁcient algorithm for contextual bandits with knapsacks, and an

Smoothed Linear Contextual Bandits with Knapsacks

extension to concave objectives. In Conference on Learning Theory, pp. 4 18. PMLR, 2016.

Babaioff, M., Dughmi, S., Kleinberg, R. D., and Slivkins, A. Dynamic pricing with limited supply. In ACM Transactions on Economics and Computation, volume 3, 2015.

Badanidiyuru, A., Kleinberg, R., and Singer, Y. Dynamic pricing with limited supply. In ACM Transactions on Economics and Computation, 2012.

Badanidiyuru, A., Kleinberg, R., and Slivkins, A. Bandits with knapsacks. In 2013 IEEE 54th Annual Symposium on Foundations of Computer Science, pp. 207 216. IEEE, 2013.

Badanidiyuru, A., Langford, J., and Slivkins, A. Resourceful contextual bandits. In Conference on Learning Theory, pp. 1109 1134. PMLR, 2014.

Banerjee, A., Chen, S., Fazayeli, F., and Sivakumar, V. Estimation with Norm Regularization. In Neural Information Processing Systems (NIPS), 2014.

Bastani, H., Bayati, M., and Khosravi, K. Mostly exploration-free algorithms for contextual bandits. ar Xiv:1704.09011, 2018.

Bertsimas, D. and Tsitsiklis, J. Introduction to Linear Optimization. Athena Scientiﬁc, 1997.

Besbes, O. and Zeevi, A. J. Blind network revenue management. Operations Research, 60(6):1537 1550, 2012.

Bietti, A., Agarwal, A., and Langford, J. Practical evaluation and optimization of contextual bandit algorithms. Co RR ar Xiv:1802.04064, 2018.

Bird, S., Barocas, S., Crawford, K., Diaz, F., and Wallach, H. Exploring or exploiting? social and ethical implications of automonous experimentation. In Workshop on Fairness, Accountability, and Transparency in Machine Learning, 2016.

Boyd, S. and Vandenberghe, L. Convex Optimization. Cambridge University Press, 2005.

Buchbinder, N. and Naor, J. S. Online primal-dual algorithms for covering and packing. Mathematics of Operations Research, 34(2), 2009.

Chandrasekaran, V., Recht, B., Parrilo, P. A., and Willsky, A. S. The Convex Geometry of Linear Inverse Problems. Foundations of Computational Mathematics, 12(6):805 849, 2012.

Chu, W., Li, L., Reyzin, L., and Schapire, R. E. Contextual bandits with linear payoff functions. In International Conference on Artiﬁcial Intelligence and Statistics (AISTATS), 2011.

Combes, R., Jiang, C., and Srikant, R. Bandits with budgets: Regret lower bounds and optimal algorithms. ACM SIGMETRICS Performance Evaluation Review, 43(1): 245 257, 2015.

Devanur, N. R. and Hayes, T. P. The adwords problem: Online keyword matching with budgeted bidders under random permutations. In ACM Conference on Electronic Commerce (ACM-EC), 2009.

Devanur, N. R., Jain, K., Sivan, B., and Wilkens, C. A. Near optimal online algorithms and fast approximation algorithms for resource allocation problems. In ACM Conference on Electronic Commerce (ACM-EC), 2011.

Feldman, J., Henzinger, M., Korula, N., Mirrokni, V. S., and Stein, C. Online stochastic packing applied to display ad allocation. In European Symposium on Algorithms (ESA), 2010.

Frank, M. and Wolfe, P. An algorithm for quadratic programming. Naval Research Logistics, 3, 1956.

Immorlica, N., Sankararaman, K. A., Schapire, R., and Slivkins, A. Adversarial bandits with knapsacks. In 2019 IEEE 60th Annual Symposium on Foundations of Computer Science (FOCS), pp. 202 219. IEEE, 2019.

Kannan, S., Morgenstern, J., Roth, A., Waggoner, B., and Wu, Z. S. A smoothed analysis of the greedy algorithm for the linear contextual bandit problem. Co RR ar Xiv:1801.04323, 2018.

Langford, J. and Zhang, T. The epoch-greedy algorithm for contextual multi-armed bandits. In Advances in Neural Information Processing Systems (NIPS), 2007.

Lattimore, T. and Szepesvari, C. Bandit Algorithms. Cambridge University Press (To appear), 2019.

Laurent, B. and Massart, P. Adaptive estimation of a quadratic functional by model selection. Annals of Statistics, pp. 1302 1338, 2000.

Li, L., Chu, W., Langford, J., and Schapire, R. E. A contextual-bandit approach to personalized news article recommendation. In International World Wide Web Conference (WWW), 2010.

Mehta, A. Online matching and ad allocation. Foundations and Trends in Machine Learning, 8(4):265 368, 2013.

Mehta, A., Saberi, A., Vazirani, U., and Vazirani, V. Adwords and generalized online matching. Journal of the ACM, 54(5), 2007.

Molinaro, M. and Ravi, R. Geometry of online packing linear problems. In International Colloquium on Automata, Languages and Programming (ICALP), 2012.

Smoothed Linear Contextual Bandits with Knapsacks

Raghavan, M., Slivkins, A., Vaughan, J. W., and Wu, Z. S. The externalities of exploration and how data diversity helps exploitation. In Conference on Learning Theory (COLT), pp. 1724 1738, 2018.

Shalev-Shwartz, S. et al. Online learning and online convex optimization. Foundations and trends in Machine Learning, 4(2):107 194, 2012.

Singla, A. and Krause, A. Truthful incentives in crowdsourcing tasks using regret minimization mechanisms. In International World Wide Web Conference (WWW), 2013.

Sivakumar, V., Wu, Z. S., and Banerjee, A. Structured linear contextual bandits: A sharp and geometric smoothed analysis. ar Xiv:2002.11332v1, 2020.

Slivkins, A. Introduction to multi-armed bandits. Foundations and Trends in Machine Learning, 12(1-2):1 286, 2021.

Talagrand, M. The Generic Chaining. Springer Monographs in Mathematics. Springer Berlin, 2005.

Talagrand, M. Upper and Lower Bounds of Stochastic Processes. Springer, 2014.

Tibshirani, R. Regression Shrinkage and Selection via the Lasso. Journal of the Royal Statistical Society, 58(1): 267 288, 1996.

Vershynin, R. High-Dimensional Probability: An Introduction with Applications in Data Science. Cambridge Series in Statistical and Probabilistic Mathematics. Cambridge University Press, 2018.

Wainwright, M. High-Dimensional Statistics: A Non Asymptotic Viewpoint. Cambridge University Press (To appear), 2019.

Williamson, D. P. and Shmoys, D. B. The Design of Approximation Algorithms. Cambridge University Press, 2011.

Smoothed Linear Contextual Bandits with Knapsacks

A. Concentration Inequalities

We restate the following Hoeffding-type result for dependent variables from (Sivakumar et al., 2020).

Lemma 5 (Lemma 11 from (Sivakumar et al., 2020)) Let t Ztu be a sub-Gaussian martingale difference sequence (MDS) and let z1:t denote a realization of Z1:t. Let tatu be a sequence of random variables such that at ftpz1:pt 1qq for some sequence function ft with |at| ď αt a.s. for suitable constants α1, . . . , αT . Then, for any τ ą 0, we have

4cκ2 řT t 1 α2 t

for absolute constants c ą 0 and where κ is the ψ2-norm of the conditional sub Gaussian random variables.

Corollary 3 Let z1:T denote denote a realization of a sub-Gaussian martingale difference sequence such that Er Zi|z1:i 1s 0 and }Zi}ψ2 ď κ1. Then using τ a

4cκ2 1T logp1{δq Op a

T logp1{δqq,

4cκ2 1T logp1{δq

B. Benchmark and Reward, Consumption Bounds

Lemma 1: Let β argmax t Pr T s,a Pr Ks }xtpaq}2. Then with probability atleast 1 δ,

β ď O 1 σ a

m logp TK{δq . (28)

Proof: We have xtpaq νtpaq gtpaq where νtpaq P Bm 2 and gtpaq Np0, σ2Imˆmq. Therefore we have the following,

}xtpaq}2 ď }νtpaq}2 }gtpaq}2 ď 1 }gtpaq}2 . (29)

To obtain bounds on }gtpaq}2, note that Er}gtpaq}2s ď a

Er}gtpaq}2 2s σ?m. Moreover we have the following result.

Lemma 6 (Lemma 1 from Laurent and Massart (Laurent & Massart, 2000)) Support x χ2 m, i.e., x řm i 1 g2 i for gi P Np0, 1q independently, then P x ě m 2?mϵ 2ϵ ď expp ϵq . (30)

Therefore from the above lemma the following can be deduced,

Pr}gtpaq}2 ě 5σ?mϵs ď expp ϵq .

Taking the max over all time steps by union bound argument,

P max t Pr T s,a Pr Ks}gtpaq}2 ě 5σ?mϵ ď TK expp ϵq .

Now choosing ϵ logp TK{δq and by simple arithmatic manipulations and combining with (29) we get the stated result.

Lemma 2: Let Ę OPT1 : Eφp T qrřT t 1 µ Xtp t pφp Tqqs denote the value of the optimal feasible dynamic policy p which knows parameters µ , W and distribution D. Here p t pφp Tqq denotes the distribution over arms in the t-th time step. Then, the optimal static policy π in (8) satisﬁes: Trpπ q ě Ę OPT1 and Tvpπ q ď B1.

Proof: Given the dynamic policy p , construct the following static context dependent policy π as follows.

t 1 Eφp T qrp t pφp Tqq|Xs .

Smoothed Linear Contextual Bandits with Knapsacks

Now by deﬁnition,

Trpπ q TEX Drr π p X; Dqs Eφp T qr

t 1 r p t pφp Tqqs Ę OPT1

Tvpπ q TEX Dr Vtπ p X; Dqs Eφp T qr

t 1 Vtp t pφp Tqqs ď B1 .

Lemma 3: Let Ę OPT2 : řT t 1 µ Xtp t pφp Tqq and OPT : řT t 1 µ Xtπ p Xt; φp Tqq denote the value of the optimal feasible dynamic policy and the optimal static policy which has knowledge of the parameters µ , W . Then OPT ě Ę OPT2.

Proof: For each context set X observed in φp Tq, construct the following static context dependent policy,

π p Xq : avgrp t pφp Tqq|Xs .

It therefore follows by deﬁnition that Ę OPT2 : OPT2 and řT t 1 W Xtπ p Xt; φp Tqq řT t 1 W Xtp t pφp Tqq ď 1B.

C. Reward and Constraint Parameter Estimation

We prove upper bounds on parameter estimation errors in each round for the reward and constraint parameters. To be concise we will directly reference results from (Sivakumar et al., 2020) wherever possible.

C.1. Covariance Matrix Minimum Eigenvalue

Theorem 4 Consider any time step t in Algorithm 1 when the no-op arm is not chosen. Let xtpatq be the context corresponding to the chosen arm. The following is a lower bound on the minimum eigenvalue of the covariance matrix,

λmin Extpatq rxtpatqxtpatq s ě c1 σ2

Proof: Let xtpatq νtpatq gtpatq where at : argmax a Pr Ks xtpaq T pˆµt Z ˆWtθtq. Set φt ˆµt Z ˆWtθt. By deﬁnition,

xtpatqxtpatq ˇˇˇˇˇ xtpatq argmax xtpaq xxtpaq, φty

min w:}w}2 1w

xtpatqxtpatq ˇˇˇˇˇ xtpatq argmax xtpaq xxtpaq, φty

min w:}w}2 1

w xtpatqxtpatq w

ˇˇˇˇˇ xtpatq argmax xtpaq xxtpaq, φty

ě min w:}w}2 1Var

xw, νtpatq gtpatqy

ˇˇˇˇˇ xtpatq argmax xtpaq xxtpaq, φty

ě min w:}w}2 1Var

xw, gtpatqy

ˇˇˇˇˇ gtpatq argmax gtpaq xνtpaq gtpaq, φty

Let Q P Rmˆm be an orthogonal matrix such that Qφt p}φt}2, 0, . . . , 0q. Also let pgtpa1q, . . . , gtpa Kqq p Q ϵtpa1q, . . . , Q ϵtpa Kqq. Due to rotational invariance property of multivariate Gaussian distributions ϵtpaiq

Smoothed Linear Contextual Bandits with Knapsacks

Np0, σ2Imˆmq. The r.h.s. in (32) evaluates to the following,

min w:}w}2 1Var

xgtpatq, wy

ˇˇˇˇˇ gtpatq argmax gtpaq xνtpaq gtpaq, φty

min w:}w}2 1Var

x Qgtpatq, Qwy

ˇˇˇˇˇ gtpatq argmax gtpaq x Qνtpaq Qgtpaq, Qφty

min w:}w}2 1Var

xϵtpatq, wy

ˇˇˇˇˇ ϵtpatq argmax ϵtpaq x Qνtpaq ϵtpaq, Qφty

min w:}w}2 1Var

xϵtpatq, wy

ˇˇˇˇˇ ϵtpatq argmax ϵtpaq p Qνtpaq ϵtpaqq1

min w:}w}2 1

w2 1Varpϵtpatq1q

ˇˇˇˇˇ ϵtpatq argmax ϵtpaq p Qνtpaq ϵtpaqq1

j 1 w2 j Varpϵtpatqjq

ˇˇˇˇˇ ϵtpatq argmax ϵtpaq p Qνtpaq ϵtpaqq1

Note that in the above, p Qνtpaq ϵtpaqq1 denotes the ﬁrst coordinate of the vector Qνtpaq ϵtpaq. The below lemma establishes lower bounds on

Varpϵtpatq1q

ˇˇˇˇˇ ϵtpatq argmax ϵtpaq p Qνtpaq ϵtpaqq1

Lemma 7 Let g1, . . . , g K be independent Gaussian random variables sampled from Np0, σ2q. Let b1, . . . , b K denote K random real numbers. Then,

ˇˇˇˇˇ gi argmax gj pgj bjq

ˇˇˇˇˇ gi argmax gj gj

log K , (34)

for some positive constant c1.

We ﬁrst prove the inequality p Varpgiq | gi argmax gj pgj bjqq ě p Varpgiq | gi argmax gj gjq. Let gp1q, , gp Kq denote

the order statistics of the Gaussian random variables such that gp1q ě ě gp Kq. Now there are K! permutations of the sum bi gpjq, 1 ď i, j ď K. Assume we have K buckets A1, , AK and we partition the K! permutations into the K buckets such that A1 Y Y AK contains all the K! permutations and Ak X Ak1 φ, 1 ď k, k1 ď K, k k1.

Consider a single permutation when bi gpjq has the highest value for some 1 ď i, j ď K. Assume ties are broken randomly with equal probability. For example, if bi gpjq, bi1 gpj1q, have equal values either is selected at random with equal probability.

Case 1: If j K we assign the permutation to the bucket AK.

Case 2: If case 1 is not satisﬁed, consider indices pi1, i2, j1q such that the following conditions are satisﬁed:

1. 1 ď i1, i2, j1 ă K, i1 i2, j ď j1

2. bi gpjq ě bi1 gpj1q and bi gpjq ą bi2 gpj1 1q in the current permutation

3. bi gpj1q ą bi1 gpjq in the permutation derived from the current permutation by swapping gpjq, gpj1q but bi gpj1 1q ă bi2 gpjq in the permutation obtained by swapping gpjq, gpj1 1q

Condition 2, basically, is the assumption that bi gpjq has the highest value for the current permutation. Condition 3 ﬁnds an index j1 ą j such in spite of swapping gpj1q and gpjq to obtain a new permutation, bi gpj1q has the highest value, but if

Smoothed Linear Contextual Bandits with Knapsacks

gpj1 1q and gpjq are swapped bi gpj1 1q no longer has the highest value. Note that one possible assignment is i1 i and j1 j

In this case we assign the permutation to bucket Aj1.

Now by construction, the buckets partition the K! permutations, i.e., Ak X Ak1 φ, k k1. Also in bucket Aj, gpj1q s are present in equal proportion @ j1 ď j. This is because of the following reason. Bucket Aj has no permutation with bi1 gpj1q, j1 ą j with highest value due to the construction above. Let bi gpjq have the highest value for a permutation from bucket j. Then since gpj1q ą gpjq, @j1 ă j, the permutation obtained by swapping gpjq, gpj1q will have bi gpj1q as the highest value and all the permutations are equally probable.

p Varpgiq | gi argmax gj pgj bjqq

i 1 p Varpgiq | gi ě gpjqq Pp Ajq , (35)

where Pp Ajq is the proportion of the K! permutations in bucket j. Finally, we observe that p V arpgiq | gi ě gpjqq has the lowest value when j 1 and hence řK i 1p Varpgiq | gi ě gpjqq Pp Ajq has the lowest value when Pp A1q 1 and Pp Ajq 0, j 1, thus proving

p Varpgiq | gi argmax gj pgj bjqq ě p Varpgiq | gi argmax gj gjq . (36)

C.2. Reward and Constraint Parameters Estimation Error Bounds

The following result is borrowed from (Sivakumar et al., 2020). Note that the result in (Sivakumar et al., 2020) was proved for the setting when the base vectors are chosen adversarially and by extension are also applicable to the simpler setting when the base vectors are chosen stochastically.

Theorem 5 [Theorem 7 in (Sivakumar et al., 2020)] Consider any time step t after the warm start phase. Let Teffptq denote the number of time steps before t when the no-op arm was not chosen. Assume the rows of the design matrix in time step t satisfy the following covariance minimum eigenvalue condition.

λmin Extpatq rxtpatqxtpatq s ě c1 σ2

log K . (37)

Then with probability atleast 1 δ{T following is the estimation error bounds for the reward and constraint parameters.

}ˆµt µ }2 ď O

logp Td{δq log K

max 1ďjďd }p Wtq:j p W q:j}2 ď O

logp Td{δq log K

Proof: Theorem 4 proves the covariance minimum eigenvalue condition. The parameter estimation error upper bounds can be obtained using arguments from Theorem 7 in (Sivakumar et al., 2020) with slight modiﬁcations.

D. Greedy Algorithm Subroutine Analysis

We provide proof for Theorem 1.

Smoothed Linear Contextual Bandits with Knapsacks

Theorem 1 Let Tstop ď T denote the stopping time of the algorithm. Let β maxt Pr T s,a Pr Ks }xtpaq}2 be the maximum ℓ2 norm of context vectors. Then with probability at least 1 2δ ,

REW ě max OPTp Tstopq Zpγap Tstopq γπp Tstopqq,

Zγap Tstopq Op ZRp Tqq ,

t 1 xµ , xtpatqy (12)

OPTp Tstopq

t 1 xµ , Xtπ p Xtqy (13)

t 1 x W Xtπ p Xtq, θty (14)

t 1 x W xtpatq, θty . (15)

and, with σ2 being the Gaussian perturbation variance,

logp Td{δq β ?

Proof: We ﬁrst revise the notations that will be used in the proof. Let Teff Teffp Tstopq be the effective number of time steps where any arm other than no-op is chosen after T time steps. The arm chosen by the algorithm at time step t satisﬁes one of the following two conditions.

1. The no-op arm is chosen by the algorithm when xtpa1 tq ppµt Zx Wtθtq ă 2ζt with a1 t argmax a Pr Ks xtpaq ppµt

2. Arm at is chosen when xtpatq ppµt Zx Wtθtq ě 2ζt,

where ζt O ˆ wp Aq ?

logp T d{δq Zβ log K

Let Υno denote the set of time steps when condition 1 is triggered and Υa denotes the set of time steps when condition 2 is triggered. The size of the sets |Υno| Tstop Teff and |Υa| Teff.

Assume Algorithm 1 has knowledge of µ , W . Then in time step t arm xtpa t q will be chosen and both of the below conditions are true,

t 1 pxµ , xtpa t qy Zx W xtpa t q, θtyq ě 0

t 1 pxµ , xtpa t qy Zx W xtpa t q, θtyq ě

t 1 pxµ , Xtπ p Xtqy Zx W Xtπ p Xtq, θtyq . (40)

The ﬁrst line is because we have the no-op option which has 0 rewards and 0 consumption. Note that since we have knowledge of µ , W we no longer need to consider the negative threshold ζt which was to account for parameter estimation errors.

The below lemma shows that with high probability when the algorithm chooses the no-op arm in rounds in Υno the actual arm chosen with foreknowledge of µ , W will also be the no-op arm.

Smoothed Linear Contextual Bandits with Knapsacks

Lemma 8 Denote optimal arm in each round by a t which is the arm that would have been chosen if the true parameters µ , W were known and optimal arm other than no-op in each time step by a 1 t : argmax a Pr Ks xtpaq pµ ZW θtq. Let

a1 t : argmax a Pr Ks xtpaq ppµt Zx Wtθtq be the optimal arm without considering no-op chosen in the Algorithm. Then for any

round t in Υno with ζt O ˆ wp Aq ?

logp T d{δq Zβ log K

xtpa1 tq ppµt Zx Wtθtq ă 2ζt . (41)

Then with probability atleast 1 2δ

T , a t is also the no-op arm.

Proof: For the algorithm with knowledge of µ , W to choose the no-op arm we have to prove xµ , xtpa 1 t qy Zx W xtpa 1 t q, θty ă 0.

xµ , xtpa 1 t qy Zx W xtpa 1 t q, θty xµ pµt, xtpa 1 t qy Zxp W x Wtq xtpa 1 t q, θty

xpµt, xtpa 1 t qy Zxx W t xtpa 1 t q, θty

ă xµ pµt, xtpa 1 t qy Zxp W x Wtq xtpa 1 t q, θty

xpµt, xtpa1 tqy Zxx W t xtpa1 tq, θty looooooooooooooooooomooooooooooooooooooon

ă }xtpa 1 t q}2}µ pµt}2 loooooooooooomoooooooooooon

Z }xtpa 1 t q}2}W x Wt}8}θt}1 looooooooooooooooomooooooooooooooooon

ă 0 with probability 1 2δ

In the second line we use xpµt, xtpa 1 t qy Zxx W t xtpa 1 t q, θty ď xpµt, xtpa1 tqy Zxx W t xtpa1 tq, θty by deﬁnition. Also since the no-op arm was chosen we have xpµt, xtpa1 tqy Zxx W t xtpa1 tq, θty ă 2ζt. In the third line we use }xtpa t q}2 ď β from Lemma 1 and use estimation error bounds from Theorem 5.

Now it can be deduced from Lemma 8 using a union bound argument over rounds in Υno when the no-op arm is chosen with high probability the optimal arm is also the no-op arm. Therefore with probability 1 2δp Tstop Teffq

t PΥno xµ , xpatqy ÿ

t PΥno xµ , xpa t qy ÿ

t PΥno x W xpatq, θty ÿ

t PΥno x W xpa t q, θty 0 . (42)

The following is true for time steps in set Υa due to the greedy choice made in Algorithm 2.

xpµt, xtpatqy Zxx W t xtpatq, θty ě ÿ

xpµt, xtpa t qy Zxx W t xtpa t q, θty

t PΥa pxµ , xtpatqy Zx W xtpatq, θtyq ÿ

t PΥa xpµt µ , xtpatq xtpa t qy looooooooooooooooooomooooooooooooooooooon

ďRµp Tstopq

t PΥa xpx Wt W q pxtpatq xtpa t qq, θty looooooooooooooooooooooooomooooooooooooooooooooooooon

ďRW p Tstopq

t PΥa pxµ , xtpa t qy Zx W xtpa t q, θtyq (44)

Smoothed Linear Contextual Bandits with Knapsacks

Combining results (40), (42) and (44) leads to the following with probability atleast 1 2δp Tstop Teffq

t 1 xµ , xtpatqy loooooooomoooooooon

t 1 x W xtpatq, θty loooooooooooomoooooooooooon

t PΥa xpµt µ , xtpatq xtpa t qy looooooooooooooooooomooooooooooooooooooon

ďRµp Tstopq

t PΥa xp W x Wtq pxtpatq xtpa t qq, θty looooooooooooooooooooooooomooooooooooooooooooooooooon

ďRW p Tstopq

t 1 xµ , xtpatqy loooooooomoooooooon

t 1 xµ , Xtπ p Xtqyy looooooooooomooooooooooon

OPTp Tstopq

t PΥa xpµt µ , xtpatq xtpa t qy looooooooooooooooooomooooooooooooooooooon

ďRµp Tstopq

t PΥa xp W x Wtq pxtpatq xtpa t qq, θty looooooooooooooooooooooooomooooooooooooooooooooooooon

ďRW p Tstopq

t 1 x W xtpatq, θty looooooooooomooooooooooon

t 1 x W Xtπ p Xtq, θty loooooooooooooomoooooooooooooon

The below lemma upper bounds the regret Rµp Tstopq, RW p Tstopq.

Lemma 9 Let,

logp Td{δq β ?

where β max t Pr T s,a Pr Ks }xtpaq}2 ď O 1 σ a

m logp TK{δq . With probability atleast 1 2δTeff

Rµp Tstopq ď Rp Tq RW p Tstopq ď Rp Tq . (47)

Proof: Let Υa denote the set of time steps when any arm other than no-op was chosen. Then we have the following,

Rµp Tstopq ď 2β ÿ

t PΥa }µ pµt}2

logp Td{δq log K

Teffptq ....... with prob. ě 1 δTeff

T , using Theorem 5 and union bound

logp Td{δq a

Teffptq log K

σ ď Rp Tq , (48)

where c is a positive constant and in line 2 we use the reward estimation error rates from Theorem 5 together with a union bounding argument.

Similarly for bounding the constraint regret consider the following,

RW p Tstopq ď 2β ÿ

t PΥa }x Wt W }8}θ}1

t PΥa max 1ďjďd }px Wtq:j p W q:j}2 .... since }θ}1 ď 1

logp Td{δq log K

Teffptq ....... with prob. ě 1 δTeff

T , Theorem 5 and union bound

logp Td{δq a

Teffptq log K

σ ď Rp Tq . (49)

Now from equation (45) and the result of Lemma 9 we get the advertised result with the observation that Tstop ď T.

Smoothed Linear Contextual Bandits with Knapsacks

E. Regret For Smoothed Stochastic Linear Bandits with Knapsacks

Lemma 4 [Lemma 5 in (Agrawal & Devanur, 2016)] For Rp T0q O ˆ pwp Aq ?

logp T0d{δqqβ?T0 log K

, with probability

at least 1 δ

z OPTp T; ωp T0qq ě OPT1 2 ˆ T

z OPTp T; ωp T0qq ď OPT1 9 ˆ T

Rp T0q ˆOPT1

{ OPTp T ;ωp T0qq 2 T T0

B 1. Then with probability 1 Opδq,

Zp T0q ě OPT1

Moreover if B ě Ω T ?T0 wp Aq log K

σ then Zp T0q ď O OPT1

Proof: We borrow work from Lemma 5 in (Agrawal & Devanur, 2016). The only difference compared to (Agrawal & Devanur, 2016) is the step to bound the empirical reward and consumptions with the expectation since we have Gaussian perturbations which are no longer bounded. We have to prove the following. ˇˇˇˇˇ

t 1 xµ , Xtπp Xtqy EXrxµ , Xtπp Xtqys

ˇˇˇˇˇ ď Rp T0q

t 1 xp W q:j, Xtπp Xtqy EXrxp W q:j, Xtπp Xtqys

ˇˇˇˇˇ ď Rp T0q (50)

This is equivalent to proving bounds on ˇˇˇřT0 t 1xµ , Xtπp Xtq EXr Xtπp Xtqsy ˇˇˇ | řT0 t 1 zt| where zt xµ , Xtπp Xtq EXr Xtπp Xtqsy. Note that zt are (conditionally on history) sub-Gaussian with sub-Gaussian norm c2p1 σq Op1q. This is because the base vectors of the contexts are bounded in Bm 2 and σ2 is the variance of the Gaussian perturbations. Therefore applying Corollary 3 we get,

t 1 xµ , Xtπp Xtq EXr Xtπp Xtqsy

4c2 2p1 σq2 logp T0d{δq

ď pδ{T0dq . (51)

t 1 xp W q:j, Xtπp Xtq EXr Xtπp Xtqsy

4c2 2p1 σq2 logp T0d{δq

ď pδ{T0dq . (52)

Now taking a union bound over all 1 ď j ď d and with the observation a

4c2 2p1 σq2 logp T0d{δq ď Rp T0q we obtain 50 with probability atleast δ{T0. The rest of the proof follows Lemma 5 in (Agrawal & Devanur, 2016) and Lemmas F.4,F.6 in (Agrawal & Devanur, 2014a).

We provide the proof for the ﬁnal regret bounds in the stochastic setting for the greedy algorithm.

Theorem 2 Assume B ě Ω max T0, T ?T0 wp Aq log K

σ . Then if Zp T0q is set as outlined in Lemma 4, we have

B 1 ď Zp T0q ď O ˆOPT1

Smoothed Linear Contextual Bandits with Knapsacks

Moreover with probability at least 1 6δ, following is the regret for the greedy Algorithm 4,

regretp Tq ď O ˆˆOPT1

B 1 max p T0, Rp Tqq , (21)

where Rp Tq O ˆ wp Aq ?

logp T d{δq β ?

Proof: The condition on B and Zp T0q is a consequence of result of Lemma 4. We now focus on bounding the regret. We accrue maximum OpβT0q regret in the warm start phase. For characterizing regret during the exploit phase, let s start with the lower bound for REW from Theorem 1. With probability atleast 1 2δ,

REW ě OPTp Tstopq Zp T0qγap Tstopq Zp T0qγπp Tstopq Op Zp T0q Rp Tqq , (53)

t 1 xµ , xtpatqy (54)

OPTp Tstopq

t 1 xµ , Xtπ p Xtqy (55)

t 1 x W Xtπ p Xtq, θty (56)

t 1 x W xtpatq, θty . (57)

For stochastic Lin CBw K t 1 references ﬁrst round after warm start phase T0 1, and if the algorithm stops at time step T 1 then Tstop T 1 T0

Taking expectations w.r.t. X on both sides, with probability atleast 1 2δ

EXr REWs ě EXr OPTp Tstopqs Zp T0q EXrγap Tstopq γπp Tstopqs Op Zp T0q Rp Tqq . (58)

Also using deﬁnition of optimal regret,

EXr OPTp Tstopqs EXt

t 1 xµ , Xtπ p Xtqy

T OPT1 . (59)

Lemma 10 Following is upper bound on EXrγap Tstopq γπp Tstopqs,

EXrγap Tstopq γπp Tstopqs ě B ˆ 1 Tstop

βT0 RDp Tstopq , (60)

where RDp Tstopq is the regret of the OCO algorithm.

Proof: We make the following observations.

Smoothed Linear Contextual Bandits with Knapsacks

EXrγap Tstopq γπp Tstopqs EXr

t 1 x W pxtpatq Xtπ p Xtqq, θtys (61)

θt , EXtr W xtpatq|Ht 1s loooooooooooomoooooooooooon

EXtr W Xtπ p Xtqs looooooooooomooooooooooon

T by deﬁnition of π

B θt , vt 1B

t 1 gtpθtq , (62)

where gtpθtq is the objective maximized by OCO algorithm in each step. We therefore obtain lower bounds on řTstop t 1 gtpθtq. Let θ argmax }θ}1ď1,θiě0

řTstop t 1 gtpθq. If RDp Tstopq is the regret of the OCO algorithm up to time Tstop, the following is true,

t 1 gtpθtq ě

t 1 gtpθ q RDp Tstopq .

Consider the two cases, when Tstop ă T and Tstop T.

Case 1: If Tstop ă T, then řTstop t 1xej, vty ě B1 B βT0 for some j 1, . . . , d. Therefore,

t 1 gtpθtq ě

t 1 gtpθ q RDp Tstopq ě

t 1 gtpejq RDp Tstopq

B ej, vt 1B

F RDp Tstopq ě

B ˆ 1 Tstop

βT0 RDp Tstopq .

Case 2: If Tstop T, then B 1 Tstop

T 0. Therefore,

t 1 gtpθtq ě

t 1 gtpθ q RDp Tstopq ě

t 1 gtp0q RDp Tstopq 0 RDp Tstopq B ˆ 1 Tstop

RDp Tstopq .

Therefore, from the results for case 1 and case 2, we get,

t 1 gtpθtq ě B ˆ 1 Tstop

βT0 RDp Tstopq . (63)

Therefore from (62) and (63) we get the following advertised result,

EXrγapτq γπpτqs ě B ˆ 1 Tstop

βT0 RDp Tstopq .

From Proposition 1 we get with probability atleast 1 δ,

RDp Tstopq Op a

Tplog dqq ď Op Rp Tqq . (64)

Also from result of Lemma 1 with probability atleast 1 δ, β is bounded with the assumption σ Ωp1{?mq.

β ď O 1 σ a

m logp TK{δq (65)

Smoothed Linear Contextual Bandits with Knapsacks

From result of Lemma 4, we have with probability atleast 1 δ,

B ď Zp T0q ď ˆOPT1

Therefore, from equations (58), (59), (64), (65), (66) and the result of Lemma 10, with probability atleast 1 5δ

EXr REWs ě Tstop

T OPT1 ZB ˆTstop

T 1 O ˆˆOPT1

B 1 Rp Tq .

EXr REWs ě OPT1 O ˆˆOPT1

B 1 Rp Tq , (67)

which is the result in expectation. To bound the actual total reward we use the result of Corollary 3 with zt xxtpatq EXrxtpatqs, µ y. The vector xtpatq EXrxtpatqs is a sub-Gaussian random vector with sub-Gaussian norm }xtpatq EXrxtpatqs}ψ2 supu PSm 1 }xxtpatq EXrxtpatqs, uy}ψ2 ď c1p1 σq. Therefore zt xxtpatq EXrxtpatqs, µ y is a c2p1 σq sub-Gaussian random variable. Therefore applying Corollary 3 we get with probability atleast 1 δ,

REW ě EXr REWs Op a

T logp1{δqq ě EXr REWs Op Rp Tqq . (68)

Combining equations (67), (68) gives the advertised result.

F. Regret for Smoothed Adversarial Linear Bandits with Knapsacks

Theorem 3 Assume B ě Ωp T 3{4q. Then with probability at least 1 6δ following is a lower bound on the reward obtained by the greedy algorithm

REW ě OPT2 16drlog Ts O ˆˆOPT2

B 1 Rp Tq , (25)

where Rp Tq O ˆ wp Aq ?

logp T dq{δ β ?

Proof: We ﬁrst show that by design resource consumption is not exceeded over the T time steps. The algorithm allocates a budget B0 B1{2rlog Ts to each epoch. rlog Ts is the maximum number of epochs, so a budget of B1{2 is allocated for running the greedy algorithm. The other B1{2 budget is allocated for the random exploration rounds in each epoch once any of the resource consumption exceeds B0. In the random exploration rounds an arm is chosen uniformaly at random with probability T 1{4. Therefore with T rounds with probability 1 δ at most T 3{4 3 a

T 3{4 logp1{δq rounds are random exploration rounds. Therefore if B ą 4T 3{4 resource consumptions are not exceeded over the T time steps.

We now focus on the last completed epoch of the algorithm from time step T1 to T2. Let κ be the value of parameter j in the algorithm during time step T1. Therefore, we have the following relationships with z OPTp T1; ωp T1qq, z OPTp T2; ωp T2qq, z OPTp T; ωp Tqq denoting the maximum reward that can be obtained using only contexts before time steps T1, T2, T respectively.

z OPTp T; ωp Tqq ď 2κ 2 since a new epoch was not started after T2 z OPTp T2; ωp T2qq ě 2κ 1 a new epoch begins at time step T2

2κ ď z OPTp T1; ωp T1qq ă 2κ 2β a new epoch begins at time step T1, reward bounded by β . (69)

Also using estimation error bounds for z OPTp T; ωp Tqq, z OPTp T1; ωp T1qq and z OPTp T2; ωp T2qq from Lemma 4 with probability at least 1 δ,

OPTp T; ωp Tqq 2Rp Tq ď z OPTp T; ωp Tqq ď OPTp T; ωp Tqq 9Rp Tq ˆOPTp T; ωp Tqq

OPTp T1; ωp T1qq 2Rp Tq ď z OPTp T1; ωp T1qq ď OPTp T1; ωp T1qq 9Rp Tq ˆOPTp T1; ωp T1qq

OPTp T2; ωp T2qq 2Rp Tq ď z OPTp T2; ωp T2qq ď OPTp T2; ωp T2qq 9Rp Tq ˆOPTp T2; ωp T2qq

Smoothed Linear Contextual Bandits with Knapsacks

where OPTp T; ωp Tqq, OPTp T1; ωp T1qq, OPTp T2; ωp T2qq are the optimal values computed at time step T, T1, T2 using observed data before T, T1, T2 respectively . Also Zp T1q for the last completed epoch is set at the beginning of round T1.

Zp T1q z OPTp T1; ωp T1qq

2d B . (71)

We now focus on analyzing the performance of the algorithm during the last completed epoch. Let Tstop ď T2 denote the stopping time. Tstop T2 if no resource is exhausted else the ﬁrst time step when any resource consumption exceeds the allocation B0.

Case 1: Tstop ă T2

By Theorem 1 and 9, Tstop ÿ

t T1 xµ , xtpatqy ě γap T1, Tstopq Op Zp T1q Rp Tqq , (72)

where γap T1, Tstopq řTstop t T1x W xtpatq, θty. Since Ervtpatq|Xt, at, Ht 1s W xtpatq and therefore by Azuma-

Hoeffding } řTstop t T1 vtpatq W xtpatq}8 ď Rp Tq. Therefore to bound γap Tstopq we will bound xvtpatq, θty. For the

dual OCO algorithm with gtpθtq xθt, pvtpatq B0

T 1qy, let θ argmax θ:}θ}1 1,θiě0

řTstop t T1 gtpθq. Therefore, due to the OCO

algorithm bounds with probability at least 1 δ,

t T1 gtpθtq ě

t T1 gtpθ q RDp Tstopq ñ

t T1 xvtpatq, θty ě

t T1 xvtpatq, θ y RDp Tstopq , (73)

where RDp Tstopq is the dual OCO algorithm regret. Now since θ maximizes the objective, for some 1 ď i ď d,

t T1 xvtpatq, θ y ě

t T1 xvtpatq, eiy ě B0 . (74)

Since RDpτq ď Rp Tq from equation (72) we get the following,

t T1 xµ , xtpatqy ě Zp T1q B0 Op Zp T1q Rp Tqq . (75)

Case 2: Tstop T2

Denote the total optimal reward by OPTp T1, T2; ωp T2qq which is the optimal reward between time steps T1 and T2 computed with reference to time step T2 using all observed data before T2. By Theorem 1 and 9,

REW OPTp T1, T2; ωp T2qq Zp T1qγ1p T2q Zp T1qγ2p T2q Op Zp T1q Rp Tqq . (76)

The following is always true,

t T1 x W Xtπ p Xt; ωp T2qq, θty ď d B . (77)

Also applying OCO regret bounds with probability atleast 1 δ,

γ1p T2q ě RDp T2q , (78)

with RDp T2q Op Rp Tqq. Therefore from (76), (77) and (78), we have

REW ě OPTp T1, T2; ωp T2qq Zp T1qd B Op Zp T1q Rp Tqq . (79)

Smoothed Linear Contextual Bandits with Knapsacks

Combining results (75) and (79),

REW ě minp OPTp T1, T2; ωp T2qq Zp T1qd B, Zp T1q B0q Op Zp T1q Rp Tqq . (80)

Next we focus on expressing Zp T1q, OPTp T1, T2; ωp T2qq in terms of the optimal reward OPT OPTp T; ωp Tqq and budget B. Now consider the quantity OPTp T1, T2; ωp T2qq Zp T1qd B.

OPTp T1, T2; ωp T2qq Zp T1qd B ě OPTp T2; ωp T2qq OPTp T1; ωp T2qq Zp T1qd B

ě OPTp T2; ωp T2qq z OPTp T1; ωp T1qq Op Rp Tqq z OPTp T1; ωp T1qq

2 ....use (71), upper bound for z OPTp T1; ωp T1qq from (70)

ě z OPTp T2; ωp T2qq O ˆˆOPT

B 1 Rp Tq 3

2 z OPTp T1; ωp T1qq Op Rp Tqq

....use lower bound for OPTp T2q from (70)

22κ O ˆˆOPT

....substituting bounds from (69)

ě 2κ 2 Op Rp Tqq

8 Op Rp Tqq

....use upper bound for z OPTp T; ωp Tqq from (69), then use (70) (81)

In the ﬁrst line above the quantities OPTp T1; ωp T2qq and OPTp T2; ωp T2qq are the optimal reward at time steps T1, T2 with time step T2 as reference..

π : argmax π

t 1 µ Xtπp Xt; ωp T2qq s.t.

t 1 W Xtπp Xt; ωp T2qq ď B1

OPTp T1; ωp T2qq

t 1 µ Xtπ p Xt; ωp T2qq

OPTp T2; ωp T2qq

t 1 µ Xtπ p Xt; ωp T2qq

The ﬁrst line above is true because we can always assume our algorithm ends at time T2 choosing no-op arms after T2. So OPTp T2; ωp T2qq is the maximum reward that can be obtained and reference optimal awards in all time steps before T2 to time step T2. Also z OPTp T1; ωp T1qq ď 2κ 2β estimated at the end of time step T1 is the maximum reward that can be obtained by using all resources and only using contexts before time step T1 and hence z OPTp T1; ωp T2qq computed at reference time T2 is always less than or equal to 2κ 2β.

Smoothed Linear Contextual Bandits with Knapsacks

Next consider quantity Zp T1q B0.

Zp T1q B0 z OPTp T1; ωp T1qq

4drlog Ts ... use lower bound for z OPTp T1; ωp T1qq from (69), B0 B{2rlog Ts

ě 2κ 2 Op Rp Tqq

16drlog Ts Op Rp Tqq

ě 2κ 2 Op Rp Tqq

16drlog Ts Op Rp Tq 16drlog Ts

ě OPT 16drlog Ts Op Rp Tqq . (82)

Zp T1q z OPTp T1; ωp T1qq

2d B ď O ˆOPT

Therefore from equations (80), (81), (82), (83), we get

REW ě OPT 16drlog Ts O ˆˆOPT

B 1 Rp Tq . (84)