# highdimensional_linear_bandits_with_knapsacks__f0448d11.pdf

High-dimensional Linear Bandits with Knapsacks

Wanteng Ma 1 Dong Xia 1 Jiashuo Jiang 2

We study the contextual bandits with knapsack (CBw K) problem under the high-dimensional setting where the dimension of the feature is large. We investigate how to exploit the sparsity structure to achieve improved regret for the CBw K problem. To this end, we first develop an online variant of the hard thresholding algorithm that performs the optimal sparse estimation. We further combine our online estimator with a primal-dual framework, where we assign a dual variable to each knapsack constraint and utilize an online learning algorithm to update the dual variable, thereby controlling the consumption of the knapsack capacity. We show that this integrated approach allows us to achieve a sublinear regret that depends logarithmically on the feature dimension, thus improving the polynomial dependency established in the previous literature. We also apply our framework to the high-dimension contextual bandit problem without the knapsack constraint and achieve optimal regret in both the data-poor regime and the data-rich regime.

1. Introduction

Introduced in the seminal paper (Badanidiyuru et al., 2013), the bandit with knapsacks problem (Bw K) is defined by solving an online knapsack problem with global size constraints. This kind of problem is a special but important case of the online allocation problem, which imposes a rewardagnostic assumption on resource allocations. The bandit with knapsacks problem has been broadly applied to many scenarios, e.g., ad allocation, dynamic pricing, repeated auctions, etc. In fact, in several applications like online recommendation or online advertising, many contexts (or

1Department of Mathematics, Hong Kong University of Science and Technology, Hong Kong SAR 2Department of Industrial Engineering and Decision Analytics, Hong Kong University of Science and Technology, Hong Kong SAR. Correspondence to: Jiashuo Jiang <jsjiang@ust.hk>.

Proceedings of the 41 st International Conference on Machine Learning, Vienna, Austria. PMLR 235, 2024. Copyright 2024 by the author(s).

features, covariates) of rewards that we can observe are possibly high-dimensional, which significantly contribute to the decision-making and motivate us to consider a variant of the Bw K problem, i.e., the contextual bandit with knapsacks problem (Badanidiyuru et al., 2014). However, although the contextual bandit with knapsacks problem has been extensively studied under different settings (Agrawal & Devanur, 2014; 2016; Immorlica et al., 2022; Liu et al., 2022), previous studies largely neglect the inherent high dimensionality of covariates, and in turn, incur regrets that depend polynomially on the large dimension d, making these methods less feasible in the high-dimensional setting. This motivates us to explore further the Bw K problem in the high-dimensional case, which is an emergent topic in online learning.

In this paper, we address this challenge by proposing efficient methods to solve the high-dimensional linear contextual bandit with knapsacks problem. Our method consists of two parts, primal estimation and dual-based allocation. We will show that our online method in primal estimation can achieve exact sparse recovery with optimal statistical error, which is comparable with the renowned LASSO method but with less computational cost. Together with dual allocation, our primal-dual method can effectively control the regret of

Bw K problem in the order e O V UB Cmin

T + V UB Cmin

which is logarithmically dependent on the dimension d. Moreover, we also show that the regret can be further improved to e O V UB Cmin

T with additional diverse covariate condition.

Our method also brings new insights into the general online sparse estimation and sparse bandit problems. For the sparse bandit problems, most of the existing literature heavily relies on LASSO, which explores sparsity by regularized sample average approximation. Although LASSO guarantees good theoretical results, it is hard to perform in an online fashion. In this paper, we solve the sparse recovery problem through a novel stochastic approximation approach with hard thresholding, which is more aligned with online learning and is also statistically optimal. This estimation algorithm leads to a by-product, i.e., a unified sparse bandit algorithm framework that reaches desired optimal regrets e O(s2/3 0 T 2/3) and e O( s0T), in both data-poor and data-rich regimes respectively, which satisfies the so-called the best of two worlds

High-dimensional Linear Bandits with Knapsacks

(Hao et al., 2020).

1.1. Main Results and Comparison to Literature

We summarize our main results and contributions.

First, we develop a new online sparse estimation algorithm, named Online HT, that performs the sparse estimation in an online manner. Note that previous methods for sparse estimation, like LASSO (e.g. (Hao et al., 2020; Li et al., 2022; Ren & Zhou, 2023)) and iterative hard thresholding (Blumensath & Davies, 2009; Nguyen et al., 2017), perform the offline estimation and thus require us to store the entire historical data set, on the size of O(d T), which can be costly when T is large. In contrast, our algorithm, featuring gradient-averaging with online hard thresholding that only requires us to store the average of the previous estimations, on the size of O(d2), instead of the entire data set. Moreover, the computation complexity of the sparse estimation step can be reduced by our approach. To be specific, the computational complexity of Online HT is O(d2) per iteration and O(d2T) in total, while the computational complexity of classical LASSO solution is O(d3 + d2t) per iteration (Efron et al., 2004), and O(d3T + d2T 2) in total if we require constant updates of the estimation, e.g., (Kim & Paik, 2019; Ren & Zhou, 2023). In this way, our online estimator enjoys a greater computational benefit than the offline estimator established in the previous literature.

Second, we show that the online update of our Online HT algorithm can be naturally combined with a primal-dual framework to solve the high dimensional CBw K problem. Specifically, for each resource constraint, we introduce a dual variable. Though previous work (e.g. (Badanidiyuru et al., 2013; Agrawal & Devanur, 2016)) on Bw K and CBw K problem has shown that a sublinear regret can be achieved by applying online learning algorithms to update the dual variables and control the resource consumption, these regret bounds depend polynomially on the feature dimension, for example, the O(d

T) regret bound in (Agrawal & Devanur, 2016) and the O(

d T) regret bound in (Han et al., 2023b). The difference in our approach is that we use the output of the Online HT algorithm at the current step to serve as the primal estimation for the dual update. In this way, we consecutively update the primal estimation by Online HT and update the dual variable by the online mirror descent algorithm in each iteration. We show that this integrated approach can effectively exploit the sparsity structure of our problem and achieve a regret that depends logarithmically on both the dimension d and constraints number m. Thus, our approach performs the online allocation of the CBw K problem more efficiently when d is relatively large.

Finally, our Online HT algorithm framework can be broadly applied to many other high-dimensional problems to achieve the statistically optimal estimation rate. For example, we

applied the Online HT to the high-dimensional contextual bandit problem, which can be regarded as a special case of the high-dimensional contextual CBw K problem where the resource constraints are absent. We show that our algorithm reaches the desired optimal regrets e O(s2/3 0 T 2/3) for the data-poor regime and e O( s0T) for the general data-rich regimes under the extra diverse covariate condition. In this way, we achieve the so-called the best of two worlds (Hao et al., 2020) without additional phase splitting and signal requirements (Hao et al., 2020; Jang et al., 2022). We further review other papers on Bw K problems and online sparse estimation problems in the appendix.

2. Notations

Throughout the paper, we use e O( ) to denote the big-O rate that omits the logarithmic terms. We write [K] as the set of positive integers from 1 to K, i.e., {1, 2, . . . , K}. We shall denote the scalas by normal symbols and vectors/matrices by bold symbols. For the matrix norms, we use max to represent the maximum absolute value of entries, and use 2,max to represent the maximum ℓ2 norm of all the rows, i.e., M 2,max = maxi e i M 2.

3. High-dimensional Contextual Bw K

We consider the high-dimensional contextual bandit with the knapsack problem over a finite horizon of T periods. There are m resources and each resource i [m] has an initial capacity Ci. The capacity vector is denoted by C Rm. We normalize the vector C such that Ci/T [0, 1] for each i [m]. We denote Cmin = mini [m]{Ci}. There are K arms and a null arm that generate no reward and consume no resources to perform void action. At each period t [T], one query arrives, denoted by query t, and is associated with a feature xt Rd. We assume that the feature xt is drawn from a distribution F( ) independently at each period t. For each arm a [K], query t is associated with an unknown reward rt(a, xt) and an unknown size b(a, xt) = (b1(a, xt), . . . , bm(a, xt)) Rm 0. Note that the reward r(a, xt) and the size b(a, xt) depends on the feature xt and the arm a. For each arm a [K], we assume that the size b(a, xt) follows the following relationship

b(a, xt) = W a xt + ωt, (1)

where W a Rm d is a weight matrix and is assumed to be unknown, specified for each arm a [K]. ωt Rm is a mdimensional random noise vector independently for each t, and each entry ωt,i following sub-Gaussian distribution with parameter σ and mean 0. The reward r(a, xt) is stochastic and is assumed to follow the relationship

r(a, xt) = (µ a) xt + ξt, (2)

High-dimensional Linear Bandits with Knapsacks

where µ a Rd is an unknown weight vector, specified for each arm a [m], and ξt is a random noise following a sub Gaussian distribution with parameter σ independently, with expectation equals 0. In the following discussion, we will sometimes write the reward r(a, xt) as rt for simplicity.

After seeing the feature xt, a decision maker must decide online which arm to pull. If arm at is pulled for query t, then each resource i [m] will be consumed by bi(at, xt) units and a reward rt(at, xt) will be collected. The realized value of rt(at, xt) is also observed. Note that query t is only feasible to be served if the remaining capacities exceed b(at, xt) component-wise. The decision maker s goal is to maximize the total collected reward subject to the resource capacity constraint.

The benchmark is the offline decision maker that is aware of the value of µ a and xt for all a [K], t [T] and always makes the optimal decision in hindsight. We denote by {yoff a,t, a [K]}T t=1 the offline decision of the offline optimum, which is an optimal solution to the following offline problem:

V Off(I) = max ya,t

a [K] ((µ a) xt ya,t + ξt)

a [K] W a xt ya,t + ωt

ya,t {0, 1}, X

a [K] ya,t = 1, t [T]

For any feasible online policy π, we use regret to measure its performance, which is defined as follows:

Regret(π) := EI F [V Off(I)] EI F [V π(I)] (3)

where I = {(xt, ξt)}T t=1 F denotes that xt follows distribution F( ) independently for each t [T], and V π(I) denotes the total value collected under the policy π. A common upper bound of EI F [V off(I)] can be formulated as follows:

a [K] Ext F (µ a) xt ya,t(xt)

a [K] Ext F [W a xt ya,t(xt)] C,

ya,t(xt) [0, 1], X

a [K] ya,t(xt) = 1, t [T], xt

The following result is standard in the literature, which formally establishes the fact that V UB can be used to upper bound the regret of any policy π.

Lemma 3.1 (folklore). We have EI F [V Off(I)] V UB.

Therefore, in what follows, we benchmark against V UB and we exploit the structures of V UB to derive our online policy and bound the regret.

3.1. High-dimensional features and sparsity structures

We consider the case where the dimension of the feature d is very large, and a sparsity structure exists for the weight vector µ a. Specifically, we assume that there exists s0 such that the uniform sparsity level can be bounded: µ a 0 s0 for each a [K], given s0 d, and a bound on the general range of arms: µ a 1. Accordingly, we assume that W a are also row-wise sparse with each row satisfying maxa [K],i [m] W a,i 0 s0, with maximum entry satisfying maxa [K] W a max 1. To establish the theory of online learning, one must ensure that the information of each µ a and W a can be retrieved statistically based on the observation. The following basic assumptions are necessary for such sparse learning.

Assumption 3.2. We make the following assumptions throughout the paper.

(a). There exists a constant D such that the covariate xt is uniformly bounded: xt D.

(b). There exists a constant D such that for any arm a covariate x, it holds that b(a, xt) D .

(c). For any s, the covariance matrix Σ := Extx t has the 2s-sparse minimal eigenvalue ϕmin(s) and 2ssparse maximal eigenvalue ϕmax(s) (Meinshausen & Yu, 2008), where ϕmin(s) is defined as:

ϕmin(s) = min β: β 0 2s β Σβ

ϕmax(s) is also correspondingly defined. Then the condition number can be denoted by κ = ϕmax(s)

The sparse minimal eigenvalue condition essentially shares the same idea as the restrict eigenvalue conditions that have been broadly used in the high-dimensional sparse bandit problem (Hao et al., 2020; Oh et al., 2021; Li et al., 2022). It ensures that the sparse structure can be detected from the sampling.

4. Optimal Online Sparse Estimation

The primal task for our online learning problem is to estimate the high-dimensional arms during the exploration, which serves as the foundation of our learning strategies. To this end, we focus on estimating one specific arm in this section, say, estimating µ a for one a [K] with the

High-dimensional Linear Bandits with Knapsacks

observation xt and rt. Estimating W a can be similarily conducted by treating bi(a, xt) as the response of each W a,i . Since µ a 0 s0 for s0 d, for the linear problem, recovering µ a is equivalent to the following ℓ0 constrained optimization problem:

min µ 0 s0 f(µ) := E(rt µ xt)2 = µ µ a 2 Σ+σ2. (4)

To solve (4), LASSO is massively used in the literature. Despite its statistical optimality, such a method heavily relies on the accumulated data to perform the ℓ1-regularized optimization, which can not be easily adapted to the online setting, especially sequential estimations. Thus, in highdimensional online learning, finding a sparse estimation algorithm that runs fully online and still achieves optimal statistical rate is imperative. We describe our proposed optimal online sparse estimation algorithm in Algorithm 1 in the context of ϵ-greedy sampling strategy. To ease the notation, we define the sparse projection Hs(x) as the hard thresholding operator that zeros out all the signals in x except the largest (in absolute value) s entries. Here we denote ρ = s0/s as the relative sparsity level.

Algorithm 1 Online Hard Thresholding with Averaged Gradient (Online HT)

1: Input: T, step size ηt, sparsity level s, s0, arm a, µa,0 = 0 2: for t = 1, . . . , T do 3: Sample the reward according to the decision variable ya,t Ber(pa,t), where pa,t σ(Ht 1, xt) 4: if pa,t = 0 then 5: Treat ya,t/pa,t = 0 6: end if 7: Compute the covariance matrix bΣa,t = 1/t (t 1)bΣa,t 1 + ya,txtx t /pa,t

8: Get averaged stochastic gradient: ga,t = 2bΣa,tµa,t 1 2

t Pt j=1 ya,jxjrj/pa,j 9: Gradient descent with hard thresholding: µa,t = Hs(µa,t 1 ηtga,t) 10: Exact s0-sparse estimation: µs a,t = Hs0(µa,t) 11: end for 12: Output: {µs t}, t [T]

Theorem 4.1. Define pj = inf pa,j as the lower bound

of each pa,j and suppose pj Ω(j 1

3 ). If we take the relative sparsity level satisfying ρ := s0/s = 1 9κ4 , and ηt = 1 4κϕmax(s), then under Assumption 3.2, the output of Algorithm 1 satisfies

E µs a,t µ a 2 2 σ2D2s0

ϕ2 min(s) log d

and the high-probability bound

µs a,t µ a 2 2 σ2D2s0

ϕ2 min(s) log(d T/ε)

which holds for all t with probability at least 1 ε.

Algorithm 1 serves as an online counterpart of the classic LASSO method. It achieves the statistically optimal rate of sparse estimation in the sense that, if we force pa,j = 1 for

each j, then we obtain the estimation error O s0σ2 log d

ϕ2 min(s)t , which matches the well-known optimal sparse estimation error rate (Ye & Zhang, 2010; Tsybakov & Rigollet, 2011). Algorithm 1 needs to continuously maintain an empirical covariance matrix bΣa,t, which takes up O(d2) storage space; however, all the updates of bΣa,t and stochastic gradients ga,t can be computed linearly, which leads to the fast O(d2T) total computational complexity. Moreover, our bound can be easily extended to the uniform bound over all arms E maxa [K] µs a,t µ a 2 2, with only an additional log K term on the error rate. See the supplementary materials for details. The pj here is used to adapt our algorithm to the ϵ-greedy exploration strategy. If for each j, the arm a can be sampled with minimum probability ϵj, then we have pa,j = 1 (K 1)ϵj or pa,j = ϵj for arm a, implying that pj = ϵj. The inverse probability weight 1/pa,j we use in Algorithm 1 serves to correct the empirical covariance matrix and the gradients of each iteration by importance sampling(Chen et al., 2021), making the gradient estimation consistent. Actually, the error rate in Theorem 4.1 also applies to s-sparse estimator µa,t. Thus, when s0 is unknown, we can just use µa,t instead.

For the hard-thresholding type method, the major challenge for the online algorithm design is the gradient information loss caused by truncation. In the online update, the hard thresholding operator will zero out all the small signals, which contain valuable gradient information for the next update (Murata & Suzuki, 2018; Zhou et al., 2018). Moreover, the missing information will accumulate during the online iteration, rendering it difficult for previous methods to recover a sparse structure (Nguyen et al., 2017; Murata & Suzuki, 2018; Zhou et al., 2018). To tackle this issue, we choose a slightly larger sparsity level that allows us to preserve more information on the gradient. We show that a larger sparsity level (which depends on the condition number κ) allows us to keep enough information so that the truncation effect is negligible.

The fundamental cause of the gradient averaging in Algorithm 1 is actually the poor smoothness property of the hard thresholding operator, i.e., projection onto ℓ0-constraint space. Unlike the convex projection or higher-order lowrank projection, the projection onto the ℓ0-constraint space

High-dimensional Linear Bandits with Knapsacks

exhibits an inflating smoothness behavior. To be specific, the projection onto the convex space shares the nice property P(x + ) x 2 2, with no inflation on the error. The projection onto the low-rank space (e.g., SVD or HOSVD on low-rank matrix or tensor) also satisfies P(x + ) x F F + C 2 F if is in the tangent space of the manifold (Kressner et al., 2014; Cai et al., 2022), which leads to tiny inflation in online tensor learning (Cai et al., 2023). However, the projection onto ℓ0-constraint space can only ensure P(x + ) x 2 (1 + δ) 2, where δ is a non-zero parameter depending on the relative sparsity level and is unimprovable (Shen & Li, 2017), which causes trouble for online sparse recovery. To mitigate the inevitable inflation, gradient averaging is employed to decrease the variance, thereby enabling us to achieve the optimal convergence rate.

For the Bw K problem, since b(a, xt) are also unknown for decision-makers, we need to consecutively estimate the size, or equivalently, W a . To this end, we can treat each row W a,i as a sparse vector (substituting µ a) with bi(a, xt) as the response (substituting rt), and estimate them using Algorithm 1. The error of estimating W a,i shares the same order as estimating µ a. See the supplementary materials for the exact error bound of the estimation c Wa,t.

5. Online Allocation: Bw K Problem

In this section, we handle the Bw K problem described in Section 3. Our algorithm adopts a primal-dual framework, where we introduce a dual variable to reflect the capacity consumption of each resource. The dual variable can be interpreted as the Lagrangian dual variable for V UB, with the dual function:

t=1 Ext F max yt(xt) K X

a [K] (µ a) xt ya,t(xt)

Z (W a xt) η ya,t(xt) + C η,

where K denotes the unit simplex K = {y RK : ya 0, a [K], and P a [K] ya = 1} and Z is a scaling parameter that we will specify later. Note that if the weight vector µ a, cost W a are given for each arm a [K] and the distribution F( ) is known, one can directly solve the dual problem minη L(η) to obtain the optimal dual variable η and then the primal variable ya,t(xt) can be decided by solving the inner maximization problem in the definition of the dual function L(η). However, since they are all unknown, one cannot directly solve the dual problem. Instead, we will employ an online learning algorithm and use the information we obtained at each period as the feedback to the online learning algorithm to update the dual variable ηt. Then, we plug in the dual variable ηt, as well as estimates of µ a and W a for each a [K], to solve the inner

maximization problem in the definition of the dual function L(η) to obtain the primal variable ya,t(xt). Note that this primal-dual framework has been developed previously in the literature (e.g. (Badanidiyuru et al., 2013; Agrawal & Devanur, 2016)) of bandits with knapsacks for UCB algorithms. The innovation of our algorithm is that, instead of using UCB in the primal selection, we use ϵ-greedy for exploration with a finer estimate of µ a and W a via Algorithm 1, which enables us to exploit the sparsity structure of the problem and obtain improved regret bound. Our formal algorithm is presented in Algorithm 2.

Algorithm 2 Primal-Dual High-dimensional Bw K Algorithm

1: Input: Z, ϵ-greedy probability ϵt for each t, δ. 2: In the first m rounds, pull each arm once and initialize ηm = 1

m1m. Set µs a,m = 0, c Wa,m = 0 3: for t = m + 1, ..., T do 4: Observe the feature xt. 5: Estimate Est Cost(a) = x t c W a,t 1ηt 1 for each arm a [K]. 6: Sample a random variable νt Ber(Kϵt), 7: if νt = 0 then 8: yt = arg max a [K]{(µs a,t 1) xt Z Est Cost(a)}

9: else 10: yt is uniformly selected from [K] 11: end if 12: Receive rt and b(yt, xt). If one of the constraints is violated, then EXIT. 13: Update for each resource i [m],

αt(i) = αt 1(i) (1 + δ)(bi(yt,xt) Ci

and project αt into the unit simplex {η : η 1 1, η 0} to obtain ηt as follows:

ηt(i) = αt(i) P

i [m] αt(i ), i [m].

14: For each arm a [K], obtain the estimate µs a,t and c Wa,t from Algorithm 1. 15: end for

5.1. Regret analysis

In this section, we conduct regret analysis of Algorithm 2. We first show how regret depends on the choice of ϵt, for each t [T], as well as the estimation error of our estimator of µ a, W a for each a [K]. We then specify the exact value of ϵt and utilize the estimation error characterized in Theorem 4.1 to derive our final regret bound.

Theorem 5.1. Denote by π the process of our Algorithm 2, and τ the stopping time of Algorithm 2. If Z satisfies Z

High-dimensional Linear Bandits with Knapsacks

V UB Cmin , then, under Assumption 3.2, the regret of the policy π can be upper bounded as follows

Regret(π) Z O p

t=1 max a xt, µ a µs a,t 1 + D c Wa,t W a

+ (4Rmax + 2D Z)

by setting δ = O q

, where Rmax =

supx,a [K] | x, µ a | and D denotes an upper bound of bi(yt, xt) as specified in Assumption 3.2.

The three terms in Theorem 5.1 exhibit distinct components of Algorithm 2 that contribute to the final regret bound. The first term represents the effect of the dual update using the Hedge algorithm (Freund & Schapire, 1997). While the last two terms arise from online sparse estimation and ϵgreedy exploration, both of which can be categorized as consequences of the primal update. Given that the estimation error is confined by Corollary B.1 and Proposition B.2, we can establish the following regret bound:

Theorem 5.2. Under Assumption 3.2, if Z satisfies V UB

Z O V UB Cmin + 1 , then the regret of Algorithm 2 can be upper bounded by

Regret(π) O V UB

3 min(s) Rmax + D V UB

2 3 0 T 2 3

by setting δ = O q

, and ϵt = Θ t 1

The result generally shows a two-phase regret of high-dimensional Bw K problem, i.e., Regret(π) =

e O V UB Cmin

T + V UB Cmin

3 T 2 3 , which reveals the leading

effects of primal or dual updates on the regret under different situations. That is, if V UB

Cmin = O(T 1 4 ), then our constraints are sufficient enough for decision-making such that learning the primal information will be the barrier of the problem,

which leads to Regret(π) = e O V UB Cmin

3 T 2 3 ; however

Cmin ω(T 1 4 ), our constraints are considered scarce, positioning the dual information as the bottleneck of the problem, and thus Regret(π) = e O V UB Cmin

T . Most notably, our regret only shows logarithmic dependence on the dimension d, which improves the polynomial dependency

on d in previous results (Agrawal & Devanur, 2016) and makes the algorithm more feasible for high-dimensional problems.

5.2. Estimating reward-constraint ratio

The Algorithm 2 require an estimation of reward-constraint ratio Z. Such an estimation can be obtained from linear programming similar to that in (Agrawal & Devanur, 2016). However, different from (Agrawal & Devanur, 2016), we will use the estimators obtained in Algorithm 1 to construct a relaxed linear programming. To be specific, we choose a parameter T0 and use the first T0 periods to obtain an approximation of V UB , i.e., ˆV , by uniform sampling. We show that as long as T0 = e O s2 0 T 2

, we will have Z =

Cmin +1) with high probability. If further the constraints grow linearly, i.e., Cmin = Ω(T), we only require T0 = e O (1) in general. See the appendix for details.

5.3. Improved regret with diverse covariate

In Theorem 5.2, it is shown that the primal update may become the bottleneck of the regret. This happens because we have to compromise between exploration and exploitation. However, in some cases, when the covariates are diverse enough, our dual allocation algorithm will naturally explore sufficient arms, leading to significant improvement in the exploitation. We now describe such a case with the notion of diverse covariate condition (Ren & Zhou, 2023).

Assumption 5.3 (Diverse covariate). There are (possibly K-dependent) positive constants γ(K) and ζ(K), such that for any unit vector v Rd, v 2 = 1 and any a [K], conditional on the history Ht 1, there is

P v xtx t v 1 {yt = a} γ(K) Ht 1 ζ(K),

where yt = argmaxa [K]{(µs a,t 1) xt Z Est Cost(a)}

Such a diverse covariate condition states that when we perform the online allocation task, our dual-based algorithm can ensure sufficient exploration. This can be viewed as a primal-dual version of the diverse covariate condition for greedy algorithms (Han et al., 2020; Ren & Zhou, 2023). If our covariate is diverse enough, we can just set ϵt = 0 in Algorithm 2 to obtain a good performance of primal exploration. We present the primal behavior of our algorithms in the following Theorem 5.4.

Theorem 5.4. Denote κ1 = ϕmax(s) γ(K)ζ(K) If we take ρ = 1 9κ4 1 ,

and ηt = 1 4κ1ϕmax(s), then under Assumption 3.2 and 5.3, setting ϵt = 0, the output of Algorithm 1 satisfies

E µa,t µ a 2 2 σ2D2s0 γ2(K)ζ2(K) log d

High-dimensional Linear Bandits with Knapsacks

and the high-probability bound

µa,t µ a 2 2 σ2D2s0 γ2(K)ζ2(K) log(d TK/ε)

holds with probability at least 1 ε.

Theorem 5.4 suggests that under the diverse covariate condition, our algorithm can recover the sparse arms with a statistical error rate that is optimal for t. This greatly improves the primal performance of our algorithm and thus leads to a sharper regret bound for the Bw K problem. We describe this improved regret in Theorem 5.5.

Theorem 5.5. If Z satisfies V UB

Cmin Z c V UB

Cmin + c , then the regret of the Algorithm 2 can be upper bounded by:

Regret(π) O V UB

T log K log(md K)

by setting δ = O q

, and ϵt = 0 for each t [T].

The rationale behind setting ϵt = 0 in Algorithm 2 is that, when our covariate vectors exhibit sufficient diversity, our strategy will automatically explore enough arms while simultaneously optimizing regret. This condition is typically met in the online allocation problem where the optimal strategy is often a distribution within arms, rather than a single arm (Badanidiyuru et al., 2018). This starkly contrasts with the classical multi-armed bandit problem, where the optimal solution is typically confined to a single arm. Theorem 5.5 significantly reduces the impact of

primal update on the regret from e O V UB Cmin

3 T 2 3 to a

sharper e O s0

T , which makes the impact of the dual update the dominating factor of regret, giving the bound Regret(π) = e O V UB Cmin

6. Optimal High-dimensional Bandit Algorithm

An important application of our Algorithm 1 is the highdimensional bandit problem (Carpentier & Munos, 2012; Hao et al., 2020), where we do not consider the knapsacks but only focus on reward maximization (or, we can treat the bandit problem as a special Bw K problem where the constraints are always met). Here we associate our algorithm with ϵ-greedy strategy and show that our highdimensional bandit algorithm by Online HT can achieve both the e O(s

2 3 0 T 2 3 ) optimal regret in the data-poor regime, and the e O( s0T) optimal regret in the data-rich regime, which enjoys the so-called the best of two worlds .

Algorithm 3 High Dimensional Bandit by Online HT

1: ϵ-greedy sampling probability ϵt for each t. µs a,0 = 0, step size ηt. 2: for t = 1, ..., T do 3: Observe the feature xt. 4: Sample a random variable νt Ber(Kϵt). 5: Pull the arm yt with ϵt-greedy strategy defined as follows:

arg max a [K]

xt, µs a,t 1 , if νt = 0

a, w.p. 1/K for each a [K] if νt = 1

and receive a reward rt. 6: For each a [K], update the sparse estimate µs a,t by Algorithm 1 with each pa,t = (1 Kϵt)ya,t + ϵt 7: end for

Theorem 6.1. Let Rmax = sup | xt, µa |. Choosing

ϵt = σ 2 3 D 4 3 s

2 3 0 (log(d K)) 1 3 t 1

3 / (Rmax K)

2 3 1/K, our Algorithm 3 incurs the regret

Regret(π) R

1 3max K 1 3 σ 2 3 D 4 3 s

2 3 0 T 2 3 (log(d K)) 1 3

ϕmin(s) 2 3

Theorem 6.1 states the optimality of our high-dimensional bandit algorithm under minimal assumptions, which matches the Ω ϕ 2/3 min s2/3 0 T 2/3 lower bound (Jang et al.,

2022) in the data-poor regime d T 1 3 s

4 3 0 . We further show that, we can use the same algorithm framework to achieve better regret given the diverse covariate condition, which will match the regret lower bound for data-rich regimes. We present our result in Theorem 6.2.

Theorem 6.2. Suppose xt is further sparse marginal sub Gaussian:

E exp u xt exp cϕmax(s0) u 2 2/2 ,

for any 2s0-sparse vector u. Assume the following diverse covariate condition (Ren & Zhou, 2023) holds: There are positive constants γ(K) and ζ(K), such that for any unit vector v Rd, and any a [K], there is

P v xtx t v 1 {a t = a} γ(K) Ht 1 ζ(K),

where a t = maxa [K] xt, µs a,t 1 is selected greedily. Denote κ1 = ϕmax(s) γ(K)ζ(K). Setting ϵt = 0, we have the following regret bound for Algorithm 3:

Regret(π) e O

High-dimensional Linear Bandits with Knapsacks

The regret of our bandit algorithm indeed matches the known low bound of high-dimensional bandit problems Ω( s0T) (Chu et al., 2011; Ren & Zhou, 2023). Compared with previous LASSO-based frameworks, no additional assumption on the range of arms (e.g., ℓ2-norm bound of µ a (Ren & Zhou, 2023)) or the minimum signal strength (Hao et al., 2020; Jang et al., 2022) is needed for our algorithm to achieve the optimal regret in the data-rich regime, as long as the diverse covariate condition holds. The sparse marginal sub-Gaussian assumption here is used to yield a more precise characterization of errors w.r.t s0. If without this assumption, there will be no κ1 term in the regret bound.

7. Numerical Results

7.1. Sparse recovery

We first examine the feasibility of our primal algorithm in the sparse recovery problem. To check the performance of Algorithm 1, suppose now we only consider one arm µ , and we want to estimate it in an online process. To this end, we always choose yt = 1 and thus pt = 1. At each t, we measure the sparse estimation error µs t µ 2 2, and the support recovery rate |supp(µs t) Ω |/s0, which indicates the ratio of the support set we have detected. The result is presented in Figure 1. Here we set d = 1000, s0 = 10, σ = 0.5, and Σ to be the power decaying covariance matrix: Σij = α|i j|, where α = 0.5. Compared with the prevalent LASSO method used in online high dimensional bandit problem (Kim & Paik, 2019; Hao et al., 2020; Ren & Zhou, 2023), our method shares efficient computational cost while achieving better estimation error. See Figure 1 for the arm estimation and support set recovery of our method. To be specific, the computational cost of Online HT is O(d2) per iteration and O(d2T) in total, while the computational cost of classical LASSO solution is O(d3 + d2t) per iteration (Efron et al., 2004), and O(d3T + d2T 2) in total if we require constant updates of the estimation, e.g., (Kim & Paik, 2019; Ren & Zhou, 2023). Here in the

LASSO, we select the regularization level λ = c q

t , where c is selected to be {5, 1, 0.1} respectively. One huge advantage that distinguishes our method from LASSO or soft thresholding method (Han et al., 2023a) is that we can achieve a guaranteed exact s0-sparse estimation without parameter tuning.

7.2. Online bandit problem

We then apply our Algorithm 3 to the high-dimensional linear bandit problem, and Primal-dual based Algorithm 2 to the linear Bw K problem to corroborate our study on the regret.

For the bandit problem, we choose d = 100, s0 = 10, K = 5. The covariates are still generated following Section

Figure 1. Primal performance of Online HT vs LASSO.

7.1. We study the regret accumulation for a fixed T and regret growth with respect to different Ts, respectively. The result is presented in Figure 2. Here, we mainly compare our ϵ-greedy Online HT method with LASSO bandit algorithm (Explore-Then-Commit method) in, e.g., (Hao et al., 2020; Li et al., 2022; Jang et al., 2022). In our simulation, we try different lengths of exploration phases t1 as t1 = 0.3T 2 3 and t1 = 0.5T 2 3 for LASSO bandit algorithm. The greedy Online HT means we simply treat each ϵt = 0. It can be observed that our method outperforms the LASSO bandit algorithm in the regret growth, and the greedy Online HT shows far slower regret growth than other algorithms.

Figure 2. Regret of Online HT vs LASSO Bandit.

7.3. High-dimensional Bw K

We now focus on the linear Bw K problem with highdimensional sparse arms. We show the performance of our algorithm, together with the classic UCB-based linear Bw K algorithm, i.e., the lin CBw K (Agrawal & Devanur, 2016), to demonstrate the feasibility of the Online HT method. Notice that, in the original paper of (Agrawal & Devanur, 2016), the lin CBw K algorithm is designed for Model-C bandit problem, but it can be easily generalized to our Model-P setting by computing the UCB of multiple arms at the same time. We set d = 200, s0 = 10, K = 5, with generated following Section 7.1. The constraints are randomly generated following uniform distribution with m = 5, and each row of W a is also sparse with the support set same as µ a. We present our methods regret and relative regret control in Figure 3. The relative regret is defined by Regret

OPT . It can be observed that when T is small, lin CBw K fails to control

High-dimensional Linear Bandits with Knapsacks

the cumulative regret due to the high dimensionality of the problem. As T grows larger, the impact of high dimensionality is decreased and thus two methods behave comparably. The relative regret curves also show this phenomenon. Our Online HT methods share faster convergence rates for the relative regret in the data-poor regime.

Figure 3. Regret of Online HT vs lin CBw K for CBw K problem.

Impact Statement

This paper derives a new online sparse estimator and develops a unified approach to solve the online allocation problem with high-dimensional covariates. The work presented by this paper advances the field of Machine Learning. There are many potential societal consequences of our work, none which we feel must be specifically highlighted here

Agrawal, S. and Devanur, N. Linear contextual bandits with knapsacks. Advances in Neural Information Processing Systems, 29, 2016.

Agrawal, S. and Devanur, N. R. Bandits with concave rewards and convex knapsacks. In Proceedings of the fifteenth ACM conference on Economics and computation, pp. 989 1006, 2014.

Ariu, K., Abe, K., and Prouti ere, A. Thresholded lasso bandit. In International Conference on Machine Learning, pp. 878 928. PMLR, 2022.

Arora, S., Hazan, E., and Kale, S. The multiplicative weights update method: a meta-algorithm and applications. Theory of computing, 8(1):121 164, 2012.

Badanidiyuru, A., Kleinberg, R., and Slivkins, A. Bandits with knapsacks. In 2013 IEEE 54th Annual Symposium on Foundations of Computer Science, pp. 207 216. IEEE, 2013.

Badanidiyuru, A., Langford, J., and Slivkins, A. Resourceful contextual bandits. In Conference on Learning Theory, pp. 1109 1134. PMLR, 2014.

Badanidiyuru, A., Kleinberg, R., and Slivkins, A. Bandits with knapsacks. Journal of the ACM (JACM), 65(3):1 55, 2018.

Balseiro, S. R., Lu, H., and Mirrokni, V. The best of many worlds: Dual mirror descent for online allocation problems. Operations Research, 71(1):101 119, 2023.

Bastani, H. and Bayati, M. Online decision making with high-dimensional covariates. Operations Research, 68 (1):276 294, 2020.

Blumensath, T. and Davies, M. E. Iterative hard thresholding for compressed sensing. Applied and computational harmonic analysis, 27(3):265 274, 2009.

Boucheron, S., Lugosi, G., and Massart, P. Concentration inequalities: A nonasymptotic theory of independence. univ. press, 2013.

Cai, J.-F., Li, J., and Xia, D. Generalized low-rank plus sparse tensor estimation by fast riemannian optimization. Journal of the American Statistical Association, pp. 1 17, 2022.

Cai, J.-F., Li, J., and Xia, D. Online tensor learning: Computational and statistical trade-offs, adaptivity and optimal regret. ar Xiv preprint ar Xiv:2306.03372, 2023.

High-dimensional Linear Bandits with Knapsacks

Carpentier, A. and Munos, R. Bandit theory meets compressed sensing for high dimensional stochastic linear bandit. In Artificial Intelligence and Statistics, pp. 190 198. PMLR, 2012.

Chen, H., Lu, W., and Song, R. Statistical inference for online decision making via stochastic gradient descent. Journal of the American Statistical Association, 116(534): 708 719, 2021.

Chu, W., Li, L., Reyzin, L., and Schapire, R. Contextual bandits with linear payoff functions. In Proceedings of the Fourteenth International Conference on Artificial Intelligence and Statistics, pp. 208 214. JMLR Workshop and Conference Proceedings, 2011.

Efron, B., Hastie, T., Johnstone, I., and Tibshirani, R. Least angle regression. The Annals of Statistics, 32(2):407 451, 2004.

Freedman, D. A. On tail probabilities for martingales. the Annals of Probability, pp. 100 118, 1975.

Freund, Y. and Schapire, R. E. A decision-theoretic generalization of on-line learning and an application to boosting. Journal of computer and system sciences, 55(1):119 139, 1997.

Han, R., Luo, L., Lin, Y., and Huang, J. Online inference with debiased stochastic gradient descent. Biometrika, pp. asad046, 2023a.

Han, Y., Zhou, Z., Zhou, Z., Blanchet, J., Glynn, P. W., and Ye, Y. Sequential batch learning in finite-action linear contextual bandits. ar Xiv preprint ar Xiv:2004.06321, 2020.

Han, Y., Zeng, J., Wang, Y., Xiang, Y., and Zhang, J. Optimal contextual bandits with knapsacks under realizability via regression oracles. In International Conference on Artificial Intelligence and Statistics, pp. 5011 5035. PMLR, 2023b.

Hao, B., Lattimore, T., and Wang, M. High-dimensional sparse linear bandits. Advances in Neural Information Processing Systems, 33:10753 10763, 2020.

Immorlica, N., Sankararaman, K., Schapire, R., and Slivkins, A. Adversarial bandits with knapsacks. Journal of the ACM, 69(6):1 47, 2022.

Jang, K., Zhang, C., and Jun, K.-S. Popart: Efficient sparse regression and experimental design for optimal sparse linear bandits. Advances in Neural Information Processing Systems, 35:2102 2114, 2022.

Jiang, J., Li, X., and Zhang, J. Online stochastic optimization with wasserstein based non-stationarity. ar Xiv preprint ar Xiv:2012.06961, 2020.

Kim, G.-S. and Paik, M. C. Doubly-robust lasso bandit. Advances in Neural Information Processing Systems, 32, 2019.

Kressner, D., Steinlechner, M., and Vandereycken, B. Lowrank tensor completion by riemannian optimization. BIT Numerical Mathematics, 54:447 468, 2014.

Langley, P. Crafting papers on machine learning. In Langley, P. (ed.), Proceedings of the 17th International Conference on Machine Learning (ICML 2000), pp. 1207 1216, Stanford, CA, 2000. Morgan Kaufmann.

Li, W., Barik, A., and Honorio, J. A simple unified framework for high dimensional bandit problems. In International Conference on Machine Learning, pp. 12619 12655. PMLR, 2022.

Li, X., Sun, C., and Ye, Y. The symmetry between arms and knapsacks: A primal-dual approach for bandits with knapsacks. In International Conference on Machine Learning, pp. 6483 6492. PMLR, 2021.

Liu, S., Jiang, J., and Li, X. Non-stationary bandits with knapsacks. Advances in Neural Information Processing Systems, 35:16522 16532, 2022.

Ma, W., Cao, Y., Tsang, D. H., and Xia, D. Optimal regularized online convex allocation by adaptive re-solving. ar Xiv preprint ar Xiv:2209.00399, 2022.

Meinshausen, N. and Yu, B. Lasso-type recovery of sparse representations for high-dimensional data. Annals of Statistics, 37(1), 2008.

Murata, T. and Suzuki, T. Sample efficient stochastic gradient iterative hard thresholding method for stochastic sparse linear regression with limited attribute observation. Advances in Neural Information Processing Systems, 31, 2018.

Nguyen, N., Needell, D., and Woolf, T. Linear convergence of stochastic iterative greedy algorithms with sparse constraints. IEEE Transactions on Information Theory, 63 (11):6869 6895, 2017.

Oh, M.-h., Iyengar, G., and Zeevi, A. Sparsity-agnostic lasso bandit. In International Conference on Machine Learning, pp. 8271 8280. PMLR, 2021.

Ren, Z. and Zhou, Z. Dynamic batch learning in highdimensional sparse linear contextual bandits. Management Science, 2023.

Shen, J. and Li, P. A tight bound of hard thresholding. The Journal of Machine Learning Research, 18(1):7650 7691, 2017.

High-dimensional Linear Bandits with Knapsacks

Slivkins, A., Sankararaman, K. A., and Foster, D. J. Contextual bandits with packing and covering constraints: A modular lagrangian approach via regression. In The Thirty Sixth Annual Conference on Learning Theory, pp. 4633 4656. PMLR, 2023.

Tsybakov, A. and Rigollet, P. Exponential screening and optimal rates of sparse estimation. Annals of Statistics, 39(2):731 771, 2011.

Wainwright, M. J. High-dimensional statistics: A nonasymptotic viewpoint, volume 48. Cambridge university press, 2019.

Wang, X., Wei, M., and Yao, T. Minimax concave penalized multi-armed bandit model with high-dimensional covariates. In International Conference on Machine Learning, pp. 5200 5208. PMLR, 2018.

Ye, F. and Zhang, C.-H. Rate minimaxity of the lasso and dantzig selector for the lq loss in lr balls. The Journal of Machine Learning Research, 11:3519 3540, 2010.

Yuan, X. and Li, P. Stability and risk bounds of iterative hard thresholding. In International Conference on Artificial Intelligence and Statistics, pp. 1702 1710. PMLR, 2021.

Zhou, P., Yuan, X., and Feng, J. Efficient stochastic gradient hard thresholding. Advances in Neural Information Processing Systems, 31, 2018.

High-dimensional Linear Bandits with Knapsacks Supplement to High-dimensional Linear Bandits with Knapsacks

A. Further Related Literature

A.1. Related Literature

Bandit with knapsacks problem (Badanidiyuru et al., 2013; Agrawal & Devanur, 2014) can be viewed as a special case of online allocation problem, where reward functions are unknown for decision-makers. Unlike other resource allocation problems (Jiang et al., 2020; Balseiro et al., 2023; Ma et al., 2022), the Bw K problem poses strong demands on balancing exploration and exploitation. In the face of uncertainty, this trade-off is mainly handled by, e.g., elimination-based algorithms (Badanidiyuru et al., 2013; 2018), or UCB (Agrawal & Devanur, 2014), or primal-dual algorithms (Badanidiyuru et al., 2013; Li et al., 2021). In the contextual Bw K problem (CBw K), some well-established methods have been proposed, including policy elimination (Badanidiyuru et al., 2014) and UCB-type algorithm (Agrawal & Devanur, 2016). Recently, (Slivkins et al., 2023) summarized a general primal-dual framework for contextual Bw K with a regression-based primal algorithm. However, the currently well-known CBw K methods (Badanidiyuru et al., 2014; Agrawal & Devanur, 2016; Slivkins et al., 2023) all suffer from O(

d) dependence on the dimension in the regret, which hugely confines their applicants to the low-dimensional case. The failure of classic CBw K methods for large d strongly motivates us to explore the CBw K problem with high-dimensional contexts, which is frequently encountered in the real world, like user-specific recommendations and personalized treatments (Bastani & Bayati, 2020).

To study high-dimensional CBw K problems, naturally, we may think of learning experiences from high-dimensional contextual bandit problems. As the origin of the CBw K problem, the contextual bandit problem has been more actively studied in high-dimensional settings. Based on the LASSO method, many sampling strategies have been devised. Noticeable force-sampling strategy in (Bastani & Bayati, 2020) achieves a regret O s2 0 (log d + log T)2 under the margin condition,

and has been improved by (Wang et al., 2018) to a sharper minimax rate O s2 0 (log d + s0) log T . (Kim & Paik, 2019) has constructed a doubly-robust ε-greedy sampling strategy by re-solving LASSO, yielding a regret of order e O(s0

T). (Hao et al., 2020; Li et al., 2022; Jang et al., 2022) introduced an Explore-then-Commit LASSO bandit framework with regret e O(s2/3 0 T 2/3). As is shown in (Jang et al., 2022), the regret lower bound of sparse bandit problem is Ω ϕ 2/3 min s2/3 0 T 2/3 in

the data-poor regime d T 1 3 s

4 3 0 . However, another stream of work showed that, for the general data-rich regime, the optimal regret is of order Ω( s0T) (Chu et al., 2011; Ren & Zhou, 2023) and can be obtained with additional covariate conditions, for example, diverse covariate condition (Ren & Zhou, 2023), and balanced covariance condition, (Oh et al., 2021; Ariu et al., 2022), etc. The two-phase optimal regret of the sparse bandit problem leads to an open question, i.e., can we achieve the best of two worlds of sparse bandit problem in both data-poor and data-rich regimes with a unified framework (Hao et al., 2020)? In our paper, we will answer this question affirmatively by providing our Online HT algorithm in the sparse bandit setting.

The idea of hard thresholding is applied in our methodology for the consecutive online estimation. Hard thresholding finds its application in sparse recovery primarily for the iterative hard thresholding methods (Blumensath & Davies, 2009). One of the most intriguing properties of hard thresholding is that it can return an exact sparse estimation given any sparsity level. Nonetheless, the poor smoothness behavior inhered in the hard thresholding projector (Shen & Li, 2017) makes it difficult to analyze the error for iterative methods, especially for stochastic gradient descent methods with large variances. Therefore, current applications of hard thresholding mainly focus on batch learning (Nguyen et al., 2017; Yuan & Li, 2021) or hybrid learning (Zhou et al., 2018), while hard thresholding methods for online learning are still largely unexplored.

B. Addidtional Results

B.1. Estimation errors

Corollary B.1 is for the uniform error bound of estimating µ a.

Corollary B.1. Under the same condition as Theorem 4.1, we have the following uniform bound for the estimations over all arms

E max a [K]

µs a,t µ a 2 2 σ2D2s0

ϕ2 min(s) log(d K)

High-dimensional Linear Bandits with Knapsacks

Moreover, the exact uniform error bound for estimating W a is given in the following proposition:

Proposition B.2. Under the same conditions as Theorem 4.1, using Algorithm 1 to estimate W a will lead to the uniform error bound:

E max a [K]

c Wa,t W a 2

2,max σ2D2s0

ϕ2 min(s) log(md K)

B.2. Obtain parameter Z

We now show the procedure for computing the parameter Z to serve as an input to Algorithm 2. The procedure is similar to that in (Agrawal & Devanur, 2016), however, we will use the estimator obtained in Algorithm 1. To be specific, we specify a parameter T0 and we use the first T0 periods to obtain an estimate of V UB. Then, the estimate can be obtained by solving the following linear programming.

ˆV = max T T0

a [K] (µs a,T0) xt ya,t (5a)

a [K] W a xt ya,t C + δ (5b)

ya,t [0, 1], X

a [K] ya,t = 1, t [T0]. (5c)

We have the following bound regarding the gap between the value of V UB and its estimate ˆV .

Lemma B.3. If setting δ 2TD maxa [K] W a c Wa,T0 µ a µs a,T0 1, then with probability at least 1 β, it holds that

ˆV V UB C V UB

Cmin + (1 + δ)TD r 1

Cmin T0 log 1

Therefore, by uniform sampling from time 1 to T0, we can simply set Z = O ˆV Cmin

, and as long as T0 =

C2 min log 1

β , we get that Z = O( V UB

Cmin + 1) with probability at least 1 β from the high probability bound of our sparse estimator in Theorem 4.1. If further the constraints grow linearly, i.e., Cmin = Ω(T), we only require

T0 = O s2 0 log 1

β in general.

C. Proofs of Main Results

C.1. Proof of Theorem 4.1

Proof. We first denote eµt = µt 1 ηtgt, and the support Ω= Ωt Ωt 1 Ω as the union of the support set of µt, µt 1, and µ . We shall use PΩ(x) to represent the projection onto the support Ω. In the following proof, we will mainly focus on the s-sparse estimation µt rather than the exact s0-sparse estimation µs t since µs t = Hs0(µt) and thus µs t µ 2 2 µt µ 2 by (Shen & Li, 2017). For the iterative method, we have

µt µ 2 2 = Hs(PΩ(eµt)) µ 2 2

PΩ(eµt) µ 2 2,

High-dimensional Linear Bandits with Knapsacks

by the tight bound of hard thresholding operator (Shen & Li, 2017). Here ρ = s0/s is the relative sparsity level. By selecting a small enough ρ (e.g., ρ 1

4), it is clear that

µt µ 2 2 1 + 3

2 ρ PΩ(eµt) µ 2 2

2 ρ µt 1 µ 2 2 2ηt PΩ(gt), µt 1 µ + η2 t PΩ(gt) 2 2

2 ρ µt 1 µ 2 2 2ηt f(µt 1), µt 1 µ + 2η2 t PΩ(gt f(µt 1)) 2 2

+2η2 t PΩ( f(µt 1)) 2 2 + 2ηt PΩ(gt f(µt 1)) 2 µt 1 µ 2 ,

where we use the fact that f(µt 1), µt 1 µ = PΩ( f(µt 1)), µt 1 µ by the definition of PΩ( ). Now, applying the restricted strong convexity and smoothness condition from Assumption 3.2:

f(µt 1), µt 1 µ 2ϕmin(s) µt 1 µ 2 2 PΩ( f(µt 1)) 2ϕmax(s) µt 1 µ 2,

We can show that

µt µ 2 2 1 + 3

2 ρ 1 4ϕmin(s)ηt + 8η2 t ϕ2 max(s) µt 1 µ 2 2

+ 6η2 t PΩ(gt f(µt 1)) 2 2 + 6ηt PΩ(gt f(µt 1)) 2 µt 1 µ 2

2 ρ 1 4ϕmin(s)ηt + 8η2 t ϕ2 max(s) µt 1 µ 2 2

+ 18sη2 t max i [d] | gt f(µt 1), ei |2 + 18ηt s max i [d] | gt f(µt 1), ei | µt 1 µ 2

The following lemma quantifies the variation of the stochastic gradient:

Lemma C.1. Define {ei}d 1 as the canonical basis of Rd. The variance of stochastic gradient gt at the point µt 1 can be bounded by the following inequality:

E max i [d] | gt f(µt 1), ei |2 C s D2 log(dt)

E µt 1 µ 2 2 + C σ2D2(Pt j=1 1/pj) log d

Moreover, the following inequality also holds with probability at least 1 ϵ

max i [d] | gt f(µt 1), ei |2 Cs D2 log(d/ϵ)

µt 1 µ 2 2 + C σ2D2(Pt j=1 1/pj) log(d/ϵ)

With Lemma C.1, we are able to derive the expectation bound and probability bound respectively. For the expectation bound, we have

E µt µ 2 2 1 + 3

2 ρ 1 4ϕmin(s)ηt + 8η2 t ϕ2 max(s) E µt 1 µ 2 2

+ 18sη2 t E max i [d] | gt f(µt 1), ei |2

E max i [d] | gt f(µt 1), ei |2q

E µt 1 µ 2 2

High-dimensional Linear Bandits with Knapsacks

We set ρ = 1 9κ4 , and ηt = 1 4κϕmax(s). Plugging in the expectation bound in Lemma C.1, we have

1 1 4κ4 + C s0D p

log(dt) ϕmin(s)t

E µt 1 µ 2 2

+ C s0σ2D2(Pt j=1 1/pj) log d

ϕ2 min(s)t2 + C

s0σ2D2(Pt j=1 1/pj) log d

ϕ2 min(s)t2 E µt 1 µ 2 2.

When t is sufficiently large, essentially we have

E µt µ 2 2 1 1 5κ4

E µt 1 µ 2 2

+ C s0σ2D2(Pt j=1 1/pj) log d

ϕ2 min(s)t2 + C

s0σ2D2(Pt j=1 1/pj) log d

ϕ2 min(s)t2 E µt 1 µ 2 2.

This instantly gives us the expectation bound

E µt µ 2 2 σ2D2s0

ϕ2 min(s) log d

which proves the first claim. Following a similar fashion, we can also prove the high-probability bound: with probability at least 1 ϵ, we have

1 1 4κ4 + C s0D p

log(d T/ϵ) ϕmin(s)t

+ C s0σ2(Pt j=1 1/pj) log(d T/ϵ)

ϕ2 min(s)t2 + C

s0σ2(Pt j=1 1/pj) log(d T/ϵ)

ϕ2 min(s)t2 µt 1 µ 2,

for all the t [T]. When t is sufficiently large, essentially we have

µt µ 2 2 1 1 5κ4

+ C s0σ2D2(Pt j=1 1/pj) log(d T/ϵ)

ϕ2 min(s)t2 + C

s0σ2D2(Pt j=1 1/pj) log(d T/ϵ)

ϕ2 min(s)t2 µt 1 µ 2.

It is therefore clear that

µt µ 2 2 σ2D2s0

ϕ2 min(s) log(d T/ε)

holds for all t [T] with probability at least 1 ϵ. Thus, we finish the proof.

C.2. Proof of Lemma C.1

Proof. Define {ei}d 1 as the canonical basis of Rd. Since

gt = 2bΣtµt 1 2

j=1 yjxjrj/pt = 2

(µt 1 µ ) 2

j=1 yjxjξj/pt,

= 2bΣt(µt 1 µ ) 2

j=1 yjxjξj/pt

High-dimensional Linear Bandits with Knapsacks

| gt f(µt 1), ei | =

2 bΣt Σ (µt 1 µ ) 2

j=1 yjxjξj/pt, ei

2 D bΣt Σ (µt 1 µ ), ei E | {z } Part 1

j=1 yjxj,iξj/pt

| {z } Part 2

We consider the two parts separately. Notice that, in the first part, µt 1 µ is at most 2s-sparse, which means that the first part can be bounded by

D bΣt Σ (µt 1 µ ), ei E 2 max i,j [d]

bΣt,ij Σij µt 1 µ ℓ1

2s max i,k [d]

j=1 yjxj,ixj,k/pj Σik

Here we use the H older s inequality. The concentration of maxi,k [d] 1

t Pt j=1 yjxj,ixj,k/pj Σik implies that:

max i,k [d]

j=1 yjxj,ixj,k/pj Σik

d2 max i,k [d] P

j=1 yjxj,ixj,k/pj Σik

By the martingale structure of 1

t Pt j=1 yjxj,ixj,k/pj Σik:

E [yjxj,ixj,k/pj Σik|Hj 1] = 0, |yjxj,ixj,k/pj Σik| 2D2/pj,

We can use the Bernstein-type martingale concentration inequality in Lemma C.2 to derive the following bound:

j=1 yjxj,ixj,k/pj Σik

D4(Pt j=1 1/pj)/t2 + 2D2z/(tpt)

where we select v2 = D4(Pt j=1 1/pj)/t2, and b = 2D2/(tpt). Thus, with the probability at least 1 ϵ, we can control the concentration at the level:

j=1 yjxj,ixj,k/pj Σik

log(1/ϵ) + CD2 1

For simplicity, we only consider pj = j α. Then, when α 1

3, the tail can be controlled by the level

j=1 yjxj,ixj,k/pj Σik

log(1/ϵ) = Lϵ

For the bound on the expectation, we have

E max i [d]

D bΣt Σ (µt 1 µ ), ei E 2 8s E max i,k [d]

j=1 yjxj,ixj,k/pj Σik

High-dimensional Linear Bandits with Knapsacks

Define µ as an upper bound of the µ 2 which can as large as O(Poly(d)). We choose ϵ = σ2

s2d2(Pt j=1 1/pj) µ2D2 . It follows

E max i [d]

D bΣt Σ (µt 1 µ ), ei E 2

max i,k [d]

j=1 yjxj,ixj,k/pj Σik

L2 ϵ µt 1 µ 2 2

max i,k [d]

j=1 yjxj,ixj,k/pj Σik

Cs L2 ϵE µt 1 µ 2 2 + C σ2

j=1 1/pj) log(dt) + log µD2

E µt 1 µ 2 2 + C σ2D2

This gives the upper bound of Part 1. We now proceed to control Part 2 analogously. Invoke Lemma C.2 again, we select v2 = σ2D2(Pt j=1 1/pj)/t2, and b = σD/(tpt). We then have the concentration bound:

j=1 yjxj,iξj/pt

σ2(Pt j=1 1/pj)/t2 + 2σz/(tpt)

2σ2D2(Pt j=1 1/pj)/t2

+ 4 exp cz 4σD/(tpt)

and the tail on the maximum:

j=1 yjxj,iξj/pt

2σ2D2(Pt j=1 1/pj)/t2

+ 4d exp cz 4σD/(tpt)

2σ2D2(Pt j=1 1/pj)/t2 + log d

+ 4d exp cz 4σD/(tpt) + log d

According to the tail-to-expectation formula: EX2 = 2 R z P(|X| > z)dz, we have

E max i [d]

j=1 yjxj,iξj/pt

2σ2D2(Pt j=1 1/pj)/t2 + log d

0 z exp cz 4σD/(tpt) + log d dz

0 zdz + 8 Z

2σ2D2(Pt j=1 1/pj)/t2 + log d

0 zdz + 8 Z

z2 z exp cz 4σD/(tpt) + log d dz

σ2D2(Pt j=1 1/pj) log d

t2 + σD log d

tpt + σ2D2 log d2

C σ2D2(Pt j=1 1/pj) log d

High-dimensional Linear Bandits with Knapsacks

Here in the second inequality we choose z1 = q

cσ2D2(Pt j=1 1/pj) log d/t2, and z2 = cσD log d/(tpj), and compute the integration by substitution. Combining Part 1 and Part 2, we have

E max i [d] | gt f(µt 1), ei |2 8E max i [d]

D bΣt Σ (µt 1 µ ), ei E 2 + 8E max i [d]

j=1 yjxj,iξj/pt

j=1 1/pj) log(dt) + log µD2

E µt 1 µ 2 2

+ C σ2D2(Pt j=1 1/pj) log d

C s D2 log(dt)

j=1 1/pj)E µt 1 µ 2 2 + C σ2D2(Pt j=1 1/pj) log d

which gives us the first claim, the expectation bound. For the second claim, the probability bound, we only need to apply the aforementioned tail bound to Part 1 and 2 again. With Lemma C.2, it is clear that with probability at least 1 ϵ,

max i,k [d]

j=1 yjxj,ixj,k/pj Σik

and with probability at least 1 ϵ,

j=1 yjxj,iξj/pt

σD log(d/ϵ)

log(d/ϵ) C σD

Therefore, with probability at least 1 ϵ, the variation can be controlled by

max i [d] | gt f(µt 1), ei |2 Cs D2 log(d/ϵ)

µt 1 µ 2 2 + C σ2D2(Pt j=1 1/pj) log(d/ϵ)

Lemma C.2 (Bernstein-type Martingale Concentration for Heterogeneous Variables). Suppose {Dt}T t=1 are martingale differences that are adapted to the filtration {Ft}T 1 t=0 , i.e., E[Dt|Ft 1] = 0. If {Dt}T t=1 satisfies

1. PT t=1 Var(Dt|Ft 1) v2,

2. E h |Dt|k Ft 1 i k!bk 2, for any k 3.

Then, there exists a universal constant c such that the following probability bound holds

This is a general version of Bernstein-type martingale concentration inequality (Freedman, 1975). The Lemma C.2 can be easily justified by applying the martingale argument to the classic Bernstein inequality (see, for example, (Boucheron et al., 2013), (Wainwright, 2019)). The key idea is to show that, conditional on the history Ft 1, the moment-generating function of each Dt can be bounded by exp λ2σ2 t 1 b|λ| (up to some constant factor) with the individual variance σ2 t .

High-dimensional Linear Bandits with Knapsacks

C.3. Proof of Corollary B.1

Proof. From the proof of Theorem 4.1, we can easily derive the following bound from equation (6):

max a µa,t µ a 2 2 1 + 3

2 ρ 1 4ϕmin(s)ηt + 8η2 t ϕ2 max(s) max a µa,t µ a 2 2

+ 18sη2 t max i [d],a | ga,t fa(µa,t 1), ei |2 + 18ηt s max i [d],a | ga,t fa(µa,t 1), ei | max a µa,t 1 µ a 2. (9)

Analogous to the proof of Lemma C.1, we can also prove that

Lemma C.3. We have

E max i [d],a | ga,t fa(µa,t 1), ei |2 C s D2 log(d Kt)

E max a µa,t µ a 2 2

+ C σ2D2(Pt j=1 1/pj) log(d K)

Here, we have an extra log K term compared with Lemma C.1 because we take the maximum overall arms. Together with (9), we can essentially show that

E max a µs a,t µ a 2 2 σ2D2s0

ϕ2 min(s) log(d K)

C.4. Proof of Proposition B.2

Proof. The proof is analogous to the proof of Corollary B.1. Notice that, if we substitute µ a with W a,i and substitute rt with bi(a, xt), then for each i we will have

max a Wa,i ,t W a,i 2 2 1 + 3

2 ρ 1 4ϕmin(s)ηt + 8η2 t ϕ2 max(s) max a Wa,i ,t W a,i 2 2

+ 18sη2 t max j [d],a

gi a,t f i a(Wa,i ,t 1), ej 2

+ 18ηt s max j [d],a

gi a,t f i a(Wa,i ,t 1), ej max a Wa,i ,t 1 W a,i 2.

Here Wa,i ,t is the s-sparse estimation of W a,i , and c Wa,i ,t = Hs0(Wa,i ,t) is the exact s0-sparse estimation. gi a,t means the corresponding averaged stochastic gradient for estimating W a,i . Taking maximum over i [m] on both sides of (10),

we can derive that the 2, max-norm, i.e., maxa Wa,t W a 2 2,max = maxi [m],a Wa,i ,t W a,i 2 2, can be controlled by the variance in the gradient:

E max i [m],j [d],a

gi a,t f i a(Wa,i ,t 1), ej 2 C s D2 log(d Kmt)

E max i [m],a

Wa,i ,t W a,i 2 2

+ C σ2D2(Pt j=1 1/pj) log(d Km)

This, similar to Lemma C.3, can be derived from the proof of Lemma C.1 by just changing the number of elements when taking the maximum. This leads to the expectation bound for estimating W a :

E max a [K] Wa,t W a 2 2,max σ2D2s0

ϕ2 min(s) log(md K)

High-dimensional Linear Bandits with Knapsacks

Using the property of the hard thresholding operator (Shen & Li, 2017) we can conclude our proof of the bound on

E maxa [K] c Wa,t W a 2

C.5. Proof of Theorem 5.1

Proof . For simplicity, we just write the sparse estimations of all µs a,t as Mt Rd K collectively in the following regret analysis of the Bw K problem, with the corresponding optimal value M Rd K. We denote by τ the time period that one of the resources is depleted or let τ = T if there are remaining resources at the end of the horizon. Note that by the decision rule of the algorithm, for each t, with probability 1 Kϵt, we have

(M t 1xt) yt(xt) Z η t X

a [K] c Wa,txtya,t(xt) (M t 1xt) y (xt) Z η t X

a [K] c Wa,txty a(xt) (11)

where we denote by y RK one optimal solution to V UB. On the other hand, with probability Kϵt, we pull an arm randomly in the execution of Algorithm 2, which implies that

(M t 1xt) yt(xt) Z η t X

a [K] c Wa,txtya,t(xt)

(M t 1xt) y (xt) Z η t X

a [K] c Wa,txty a(xt) 2Rmax 2D Z (12)

since (M t 1xt) yt(xt) Z η t P

a [K] c Wa,txtya,t(xt) Rmax D Z and (M t 1xt) y (xt) Z

a [K] c Wa,txty a(xt) Rmax + D Z. Then, we take expectations on both sides of (11) and sum up t from t = 1 to t = τ to obtain

(M t 1xt) yt(xt) Z η t X

a [K] c Wa,txtya,t(xt)

(M t 1xt) y (xt) Z η t X

a [K] c Wa,txty a(xt)

2(Rmax + D Z)

We can substitute Mt 1, c Wa,t with their true values M , W a by the following inequalities:

(M t 1xt) y (xt) E

((M ) xt) y (xt) E

t=1 max a xt, µ a µs a,t 1

t=1 max a xt, µ a µs a,t 1 , and

a [K] c Wa,txty a(xt) E

a [K] W a xty a(xt) + DE

c Wa,t W a .

And notice that y RK is the optimal solution to V UB, which means that

a [K] W a xty a(xt)

Moreover, from the dual update rule, we have the following result:

Lemma C.4. For any η, it holds that

b(yt, xt) C

b(yt, xt) C

High-dimensional Linear Bandits with Knapsacks

where R(T, η) denotes the regret of the Hedge algorithm given any η.

b(yt, xt) C

t=1 η b(yt, xt) C

For simplicity, we shall write the regret as R(T) in the following discussion. Therefore, from Lemma C.4, we know that

a [K] W a xtya,t(xt) X

a [K] W a xty a(xt)

a [K] W a xtya,t(xt) C

a [K] W a xtya,t(xt) C

Here we use the fact that Eη t b(a, xt) = Eη t P

a [K] W a xtya,t(xt) because ηt σ(Ht 1). We now bound the last formula in (16). We consider two cases:

(I) If τ < T which implies that Pτ t=1 P a [K] W a xt,iya,t(xt) Ci for some resource i [m], we set η = ei in (16) and we have

a [K] W a xtya,t(xt) C

T R(T, ei) 2Rmax

T R(T, ei) 2Rmax

(II) If τ = T which implies T τ

T = 0, we set η = 0 in (16) and we have

a [K] W a xtya,t(xt) C

R(T, 0) 2Rmax

= E1 {τ = T}

T R(T, 0) 2Rmax

(18) where Cmin = mini [m]{Ci}. Therefore, combining (17) and (18) as the lower bound of (16), we obtain

a [K] W a xtya,t(xt) X

a [K] W a xty a(xt)

E sup η R(T, η) .

Plugging (14) and (19) into (13), we obtain

(M t 1xt) yt(xt)

T V UB + Z Cmin T τ

Z E sup η R(T, η)

t=1 max a xt, µ a µs a,t 1 + D c Wa,t W a

(4Rmax + 2D Z)

High-dimensional Linear Bandits with Knapsacks

Note that Z V UB

Cmin . We have

(µ t xt) yt(xt) V UB Z E sup η R(T, η)

t=1 max a xt, µ a µs a,t 1 + D c Wa,t W a

(4Rmax + 2D Z)

Finally, we plug in the regret bound of the Hedge algorithm (from Theorem 2 of (Freund & Schapire, 1997), see also, multiplicative weights update method (Arora et al., 2012)), which is the algorithm used to update the dual variable ηt, and we obtain that

E sup η R(T, η) p

by setting δ = q

T D , where D denotes an upper bound of bi(yt, xt) for each i [m], t [T] and every yt, xt. Therefore, our proof is completed.

C.6. Proof of Lemma C.4

Proof. We denote by T the number of periods from t = 1 to t = τ such that νt = 0. Then, from the regret bound of the embedded OCO algorithm, we know that

t=1 1νt=0 η t

b(yt, xt) C

t=1 1νt=0 b(yt, xt) C

t=1 1νt=0 b(yt, xt) C

Moreover, from the boundedness of ηt and xt, we know that

b(yt, xt) C

t=1 1νt=0 η t

b(yt, xt) C

t=1 1νt=1 (23)

t=1 1νt=0 b(yt, xt) C

b(yt, xt) C

t=1 1νt=1. (24)

Therefore, plugging (23) and (24) into (22), we have that

b(yt, xt) C

b(yt, xt) C

t=1 1νt=1 R(T),

which completes our proof.

C.7. Proof of Lemma B.3

Proof. The proof follows from (Agrawal & Devanur, 2016). We define an intermediate benchmark as follows.

V (δ/2) = max T T0

a [K] (µ a) xt ya,t (25a)

a [K] W a xt ya,t C + δ

High-dimensional Linear Bandits with Knapsacks

a [K] ya,t = 1, t [T0] (25c)

ya,t [0, 1], a [K], t [T0]. (25d)

The only difference between V (δ) in (25) and ˆV is that the estimation µs a,T0 and c Wa,T0 are replaced by the true value µ a, W a for all a [K]. Then, we can bound the gap between ˆV and V UB by bounding the two terms |V UB V (δ/2)| and | V (δ/2) ˆV | separately.

Bound the term | V (δ/2) V UB|: We denote by L(η) the dual function of V UB as follows:

L(η) = (C) η +

a [K] ya,t(xt)=1

(µ a) xt ya,t(xt) (η) W a xt ya,t(xt)

= (C) η + T Ex F

a [K] ya(x)=1

(µ a) x ya(x) (η) W a x ya(x)

We also denote by L(η) the dual function of V (δ/2) as follows:

L(η) = C + δ

a [K] ya,t=1

(µ a) xt ya,t X

a [K] (η) [W a xt ya,t]

Then, the function L(η) can be regarded as a sample average approximation of L(η). We then proceed to bound the range of the optimal dual variable for V UB and ˆV . Denote by η an optimal dual variable for V UB. Then, it holds that

which implies that

η Ω := η 0 : η 1 V UB

Similarly, denote by η an optimal dual variable for V (δ/2) and we can obtain that

η Ω := η 0 : η 1 V (δ/2) Cmin + δ/2

V UB = L(η ) L(η ) |L(η ) L(η )| L( η ) |L(η ) L(η )| = V (δ/2) |L(η ) L(η )| (28)

V (δ/2) = L( η ) L( η ) | L( η ) L( η )| L(η ) | L( η ) L( η )| = V UB | L( η ) L( η )|. (29)

Define a random variable H(x) = max P

a [K] ya(x)=1 (µ a) x ya(x) (η ) W a x ya(x) where x F. It is

clear to see that |H(x)| (Rmax + V UB

Cmin D ) where D denotes an upper bound on W a x for every a [K] and x. Then, we have

| L(η ) L(η )| = δ

Ex F [H(x)] T

Cmin + T (Rmax + V UB

Cmin D ) r 1

High-dimensional Linear Bandits with Knapsacks

holds with probability at least 1 β

2 , where the inequality follows from the standard Hoeffding s inequality. Similarly, we have

| L( η ) L( η )| δ

2 η 1 + T (Rmax + V (δ/2) Cmin + δ/2 D ) r 1

2 V (δ/2) Cmin + δ/2 + T (Rmax + V (δ/2)

Cmin D ) r 1

Cmin + T (Rmax + V (δ/2)

Cmin D ) r 1

holds with probability at least 1 β

2 . From the union bound, we know that with probability at least 1 β, both (30) and (31) hold. Therefore, from (28) and (29), we have the following two inequalities

V (δ/2) V UB δ

Cmin + T (Rmax + V UB

Cmin D ) r 1

V UB V (δ/2) δ

Cmin + T (Rmax + V (δ/2)

Cmin D ) r 1

holds with probability at least 1 β.

Bound the term | V (δ/2) ˆV |: We first denote by y an optimal solution to V (δ/2). Note that

a [K] (W a )xt ya,t T

a [K] (c Wa,T0)xt ya,t

T D max a [K] W a c Wa,T0 . (34)

2 T D maxa [K] W a c Wa,T0 , we know that y is a feasible solution to ˆV . Also, note that

a [K] (µ a) xt ya,t T

a [K] (µs a,T0) xt ya,t

T D max a [K] µ a µs a,T0 1. (35)

Therefore, we know that

a [K] (µs a,T0) xt ya,t + T D max a [K] µ a µs a,T0 1 ˆV + T D max a [K] µ a µs a,T0 1. (36)

On the other hand, we denote by ˆy an optimal solution to ˆV . Then, note that

a [K] (W a )xt ˆya,t T

a [K] (c Wa,T0)xt ˆya,t

T D max a [K] W a c Wa,T0 . (37)

a [K] (W a )xt ˆya,t T

a [K] (c Wa,T0)xt ˆya,t + T D max a [K] W a c Wa,T0 C + δ.

Thus, we know that ˆy is a feasible solution to V ( 3

2δ) and again, from (35), it holds that

a [K] (µs a,T0) xt ˆya,t + T D max a [K] µ a µs a,T0 1 V (3

2δ) + T D max a [K] µ a µs a,T0 1. (38)

Therefore, combining (32) and (38), we have

ˆV V UB + 3

Cmin + T (Rmax + V UB

Cmin D ) r 1

β + T D max a [K] µ a µs a,T0 1. (39)

High-dimensional Linear Bandits with Knapsacks

Also, combining (33) and (36), we have

V UB ˆV + δ

Cmin + T (Rmax + V (δ/2)

Cmin D ) r 1

β + T D max a [K] µ a µs a,T0 1. (40)

We further use the bound on V (δ/2) V UB in (32) to plug in (40), and we obtain that ˆV V UB C V UB

Cmin + (1 + δ)TD r 1

β + T 2D 2 1

which completes our proof.

C.8. Proof of Theorem 6.1 and 6.2

Proof. Our proof essentially follows the basic ideas of regret analysis for ϵ-greedy algorithms, with a fine-grained process on the estimation error. For the ϵ-greedy algorithm, we have

t=1 xt, µopt(xt)

xt, µopt(xt) µs opt,t 1(xt)

D xt, µs y t ,t 1 µs opt,t 1(xt) E

+ D xt, µs y t ,t 1 µ y t

E + D xt, µ y t µ yt Ei

t=1 E xt µopt(xt) µs opt,t 1(xt) 1 + µs y t µ y t ,t 1 1

t=1 Kϵt Rmax

where y t means the greedy action y t = arg maxa [K] xt, µs a,t 1 , and µs opt,t 1(xt) indicates the estimation of the optimal arm µopt(xt). The inequality uses the fact of greedy action, and the uniform risk bound. This leads to the regret-bound

t=1 E s0 max a µs a,t µ a 2 + 2

t=1 Kϵt Rmax

log(d K) ϕmin(s)

t=1 Kϵt Rmax.

Choosing ϵt = σ 2 3 D 4 3 s

2 3 0 (log(d K)) 1 3 t 1

3 / (KRmax)

2 3 1/K, the statement in Theorem 6.1 can be justified. For the Theorem 6.2, since it can be viewed as a special case of ϵ-greedy strategy (with ϵ = 0), we have

t=1 E max a xt, µs a,t 1 µ a ,

where the estimation error can be guaranteed by

E max a µs a,t µ a 2 2 σ2D2s0 γ2(K)ζ2(K) log(d K)

This error bound can be easily derived from the proof of Theorem 5.4. Here each term maxa xt, µs a,t 1 µ a in the regret can be controlled by two ways:

E max a xt, µs a,t 1 µ a DE max a µa,t 1 µ a 1, (42)

and E h max a xt, µs a,t 1 µ a E xt, µs a,t 1 µ a i

0 P max a xt, µs a,t 1 µ a E xt, µs a,t 1 µ a z dz (43)

High-dimensional Linear Bandits with Knapsacks

Combining (41) with (42), it is easy to show that the regret bound:

t=1 E max a xt, µs a,t 1 µ a σD2s0 p

log(d K)T γ(K)ζ(K) .

We use (43) to give another bound. Notice that xt is independent of the history Ht 1, which implies that, conditional on the history Ht 1,

E xt, µs a,t 1 µ a q

E µs a,t 1 µ a xtx t µs a,t 1 µ a q µs a,t 1 µ a 2

ϕmax(s0) µs a,t 1 µ a 2.

Since xt is marginal sub-Gaussian, the xt, µs a,t 1 µ a has a tail behavior by Chernoff bound:

P xt, µs a,t 1 µ a E xt, µs a,t 1 µ a z exp

ϕmax(s0) µs a,t 1 µ a 2

and also P max a xt, µs a,t 1 µ a E xt, µs a,t 1 µ a z

ϕmax(s0) maxa µs a,t 1 µ a 2

This instantly gives rise to the maxima inequality by (43)

E h max a xt, µs a,t 1 µ a E xt, µs a,t 1 µ a i

ϕmax(s0) maxa µs a,t 1 µ a 2

log Kϕmax(s0) max a µs a,t 1 µ a 2

We thus have

E max a xt, µs a,t 1 µ a

E h max a xt, µs a,t 1 µ a E xt, µs a,t 1 µ a i + max a E xt, µs a,t 1 µ a

log Kϕmax(s0) max a µs a,t 1 µ a 2,

conditional on the history Ht 1. Together with the estimation error (41), we can derive another regret bound:

t=1 E max a xt, µs a,t 1 µ a p

log Kϕmax(s0)σD p

s0 log(d K)T γ(K)ζ(K)

s0 log K log(d K)T p

Associate these two regret bounds, we finish the proof.

C.9. Proof of Theorem 5.4

Proof. The proof shares a similar fashion with the proof of Theorem 4.1. The key difference is that, instead of focusing on the concentration of the gradient ga,t to the population version f a(µa,t 1), we consider a series

High-dimensional Linear Bandits with Knapsacks

of new objective functions {f a t } that is changing over time, and derive the concentration of ga,t to f a t (µt 1). To this end, we defined the history-dependent covariance matrices E xtx t 1 {yt = a} Ht 1 , and their average: Σa,t = Pt j=1 E xjx j 1 {yj = a} Hj 1 /t. We write the corresponding objective function that Σa,t represents as

f a t (µ) = µ µ a 2 Σa,t. In the following proof, since we will mainly focus on one arm, we will write µt, µ , gt, ft, bΣt, Σt etc instead of µa,t, µ a, ga,t, f a t , bΣa,t and Σa,t, etc to easy the notation. An argument analog to the proof of Theorem 4.1 gives that:

µt µ 2 2 1 + 3

2 ρ µt 1 µ 2 2 2ηt PΩ(gt), µt 1 µ + η2 t PΩ(gt) 2 2

2 ρ µt 1 µ 2 2 2ηt ft(µt 1), µt 1 µ + 2η2 t PΩ(gt ft(µt 1)) 2 2

+2η2 t PΩ( ft(µt 1)) 2 2 + 2ηt PΩ(gt ft(µt 1)) 2 µt 1 µ 2 ,

where we use the fact that ft(µt 1), µt 1 µ = PΩ( ft(µt 1)), µt 1 µ by the definition of PΩ( ). Because we are interested in the new objective function ft(µ) = µ µ 2 Σt, we need to check the sparse eigenvalue of Σt. Since for any β such that β 0 2s , we have β E xtx t 1 {yt = a} Ht 1 β β E xtx t Ht 1 β ϕmax(s) β 2 2, then it is clear that the 2s-sparse maximal eigenvalue of Σt = Pt j=1 E xjx j 1 {yj = a} Hj 1 /t is bounded by ϕmax(s). For the minimum eigenvalue, it follows by Assumption 5.3 that given any unit vector v,

v E xtx t 1 {yt = a} Ht 1 v E v xtx t v1 {yt = a} 1 v xtx t v1 {yt = a} γ(K) Ht 1

E γ(K)1 v xtx t v1 {yt = a} γ(K) Ht 1

It is clear that the 2s-sparse minimum eigenvalue of Σt can be lower bounded by γ(K)ζ(K). We therefore take the condition number of Σt as κ1 = ϕmax(s) γ(K)ζ(K). The eigenvalues of Σt also imply:

ft(µt 1), µt 1 µ 2γ(K)ζ(K) µt 1 µ 2 2,

PΩ( ft(µt 1)) 2ϕmax(s) µt 1 µ 2.

We can show that

µt µ 2 2 1 + 3

2 ρ 1 4γ(K)ζ(K)ηt + 8η2 t ϕ2 max(s) µt 1 µ 2 2

+ 6η2 t PΩ(gt ft(µt 1)) 2 2 + 6ηt PΩ(gt ft(µt 1)) 2 µt 1 µ 2

2 ρ 1 4γ(K)ζ(K)ηt + 8η2 t ϕ2 max(s) µt 1 µ 2 2

+ 18sη2 t max i [d] | gt ft(µt 1), ei |2 + 18ηt s max i [d] | gt ft(µt 1), ei | µt 1 µ 2

The following lemma, which echoes with aforementioned Lemma C.1, quantifies the variation of the averaged stochastic gradient under the diverse covariate condition without ε-greedy strategy:

Lemma C.5. Define {ei}d 1 as the canonical basis of Rd. Under Assumption 3.2, 3.2 and 5.3, the variance of stochastic gradient gt at the point µt 1 given in Algorithm 1 can be bounded by the following inequality:

E max i [d] | gt ft(µt 1), ei |2 C s D2 log(dt)

t E µt 1 µ 2 2 + C σ2D2 log d

Moreover, the following inequality also holds with probability at least 1 ϵ

max i [d] | gt ft(µt 1), ei |2 Cs D2 log(d/ϵ)

t µt 1 µ 2 2 + C σ2D2 log(d/ϵ)

High-dimensional Linear Bandits with Knapsacks

We defer the proof of Lemma C.5 to the next section.

We set ρ = 1 9κ4 1 , and ηt = 1 4κ1ϕmax(s). Plugging in the expectation bound in Lemma C.5, we have

1 1 4κ4 1 + C s0D p

log(dt) γ(K)ζ(K)

E µt 1 µ 2 2

+ C s0σ2D2 log d

γ2(K)ζ2(K)t + C

s0σ2D2 log d γ2(K)ζ2(K)t E µt 1 µ 2 2.

When t is sufficiently large, essentially we have

E µt µ 2 2 1 1 5κ4 1

E µt 1 µ 2 2

+ C s0σ2D2 log d

γ2(K)ζ2(K)t + C

s0σ2D2 log d γ2(K)ζ2(K)t E µt 1 µ 2 2.

This instantly gives us the expectation bound

E µt µ 2 2 σ2D2s0 γ2(K)ζ2(K) log d

which proves the first claim. Apply Lemma C.5 again to the recursive relationship in (45), we also have the second claim:

µt µ 2 2 σ2D2s0 γ2(K)ζ2(K) log(d T/ε)

holds for all t [T] with probability at least 1 ϵ

C.10. Proof of Lemma C.5

Proof. The idea essentially follows the proof of Lemma C.1, with some modifications in the martingale concentration argument. Notice that, in Algorithm 1, for any arm a [K], we have

gt = 2bΣtµt 1 2

j=1 1 {yt = a} xjrj = 2

1 {yj = a} xjx j (µt 1 µ ) 2

j=1 1 {yt = a} xjξj,

= 2bΣt(µt 1 µ ) 2

j=1 1 {yj = a} xjξj.

Still, we can write

| gt ft(µt 1), ei | =

2 bΣt Σt (µt 1 µ ) 2

j=1 yjxjξj/pt, ei

2 D bΣt Σt (µt 1 µ ), ei E | {z } Part 1

j=1 1 {yj = a} xj,iξj

| {z } Part 2

We consider the two parts separately.

In Part 1, for any i, k [d], by the martingale structure of 1

t Pt j=1 1 {yj = a} xj,ixj,k Σt,ik:

j=1 [1 {yj = a} xj,ixj,k|Hj 1] t Σt,ik = 0, |1 {yj = a} xj,ixj,k E [1 {yj = a} xj,ixj,k|Ht 1]| 2D2,

High-dimensional Linear Bandits with Knapsacks

We can use the Bernstein-type martingale concentration inequality in Lemma C.2 to derive the following bound:

j=1 1 {yj = a} xj,ixj,k Σt,ik

D4/t + 2D2z/t

where we select v2 = D4/t, and b = 2D2/t. This leads to the concentration that with probability at least 1 ϵ,

max i,k [d]

j=1 1 {yj = a} xj,ixj,k Σt,ik

It follows from the process in (8) that

E max i [d]

D bΣt Σt (µt 1 µ ), ei E 2

log(dt) + log µD2

E µt 1 µ 2 2 + C σ2D2

We now proceed to control Part 2 analogously. Invoke Lemma C.2 again by selecting v2 = σ2D2/t, and b = σD/t. We then have the concentration bound:

j=1 1 {yj = a} xj,iξj

σ2D2/t + 2σDz/t

+ 4 exp ctz

which gives the tail bound with probability at least 1 ϵ:

j=1 1 {yj = a} xj,iξj

and also the expectation bound for the maxima:

E max i [d]

j=1 1 {yj = a} xj,iξj

C σ2D2 log d

Combining Part 1 and Part 2 gives us the first claim on the expectation bound:

E max i [d] | gt ft(µt 1), ei |2 C s D2 log(dt)

t E µt 1 µ 2 2 + C σ2D2 log d

The high probability bound in Part 1 and Part 2 directly leads to the probability bound: with a probability at least 1 ϵ, the variation can be controlled by

max i [d] | gt ft(µt 1), ei |2 Cs D2 log(d/ϵ)

t µt 1 µ 2 2 + C σ2D2 log(d/ϵ)

High-dimensional Linear Bandits with Knapsacks

To better show the results of experiments, we present all the figures in the main text here with a larger size:

Figure 4. Primal performance of Online HT vs LASSO.

Figure 5. Regret of Online HT vs LASSO Bandit.

Figure 6. Regret of Online HT vs lin CBw K for CBw K problem.