# highdimensional_sparse_linear_bandits__81b0c1ed.pdf

High-Dimensional Sparse Linear Bandits

Botao Hao Deepmind haobotao000@gmail.com

Tor Lattimore Deepmind lattimore@google.com

Mengdi Wang Department of Electrical Engineering Princeton University mengdiw@princeton.edu

Stochastic linear bandits with high-dimensional sparse features are a practical model for a variety of domains, including personalized medicine and online advertising [Bastani and Bayati, 2020]. We derive a novel Ω(n2/3) dimension-free minimax regret lower bound for sparse linear bandits in the data-poor regime where the horizon is smaller than the ambient dimension and where the feature vectors admit a well-conditioned exploration distribution. This is complemented by a nearly matching upper bound for an explore-then-commit algorithm showing that that Θ(n2/3) is the optimal rate in the data-poor regime. The results complement existing bounds for the data-rich regime and provide another example where carefully balancing the trade-off between information and regret is necessary. Finally, we prove a dimension-free O( n) regret upper bound under an additional assumption on the magnitude of the signal for relevant features.

1 Introduction

Stochastic linear bandits generalize the standard reward model for multi-armed bandits by associating each action with a feature vector and assuming the mean reward is the inner product between the feature vector and an unknown parameter vector [Auer, 2002, Dani et al., 2008, Rusmevichientong and Tsitsiklis, 2010, Chu et al., 2011, Abbasi-Yadkori et al., 2011].

In most practical applications, there are many candidate features but no clear indication about which are relevant. Therefore, it is crucial to consider stochastic linear bandits in the high-dimensional regime but with low-dimensional structure, captured here by the notion of sparsity. Previous work on sparse linear bandits has mostly focused on the data-rich regime, where the time horizon is larger than the ambient dimension [Abbasi-Yadkori et al., 2012, Carpentier and Munos, 2012, Wang et al., 2018, Kim and Paik, 2019, Bastani and Bayati, 2020]. The reason for studying the data-rich regime is partly justiﬁed by minimax lower bounds showing that for smaller time horizons the regret is linear in the worst case.

Minimax bounds, however, do not tell the whole story. A crude maximisation over all environments hides much of the rich structure of linear bandits with sparsity. We study sparse linear bandits in the high-dimensional regime when the ambient dimension is much larger than the time horizon. In order to sidestep existing lower bounds, we reﬁne the minimax notion by introducing a dependence in our bounds on the minimum eigenvalue of a suitable exploration distribution over the actions. Similar

34th Conference on Neural Information Processing Systems (Neur IPS 2020), Vancouver, Canada.

Table 1: Comparisons with existing results on regret upper bounds and lower bounds for sparse linear bandits. Here, s is the sparsity, d is the feature dimension, n is the number of rounds, K is the number of arms, Cmin is the minimum eigenvalue of the data matrix for an exploration distribution (3.1) and τ is a problem-dependent parameter that may have a complicated form and vary across different literature.

Upper Bound Regret Assumptions Regime Abbasi-Yadkori et al. [2012] O(

sdn) none rich Sivakumar et al. [2020] O(

sdn) adver. + Gaussian noise rich Bastani and Bayati [2020] O(τKs2(log(n))2) compatibility condition rich Wang et al. [2018] O(τKs3 log(n)) compatibility condition rich Kim and Paik [2019] O(τs n) compatibility condition rich Lattimore et al. [2015] O(s n) action set is hypercube rich This paper (Thm. 4.2) O(C 2/3 min s2/3n2/3) action set spans Rd poor This paper (Thm. 5.2) O(C 1/2 min sn) action set spans Rd + mini. signal rich Lower Bound Multi-task bandits1 Ω(

sdn) N.A. rich This paper (Thm. 3.3) Ω(C 1/3 min s1/3n2/3) N.A. poor

quantities appear already in the vast literature on high-dimensional statistics [B uhlmann and Van De Geer, 2011, Wainwright, 2019].

Contributions Our ﬁrst result is a lower bound showing that Ω(n2/3) regret is generally unavoidable when the dimension is large, even if the action set admits an exploration policy for which the minimum eigenvalue of the associated data matrix is large. The lower bound is complemented by an explore-the-sparsity-then-commit algorithm that ﬁrst solves a convex optimization problem to ﬁnd the most informative design in the exploration stage. The algorithm then explores for a number of rounds by sampling from the design distribution and uses Lasso [Tibshirani, 1996] to estimate the unknown parameters. Finally, it greedily chooses the action that maximizes the reward given the estimated parameters. We derive an O(n2/3) dimension-free regret that depends instead on the minimum eigenvalue of the covariance matrix associated with the exploration distribution. Our last result is a post-model selection linear bandits algorithm that invokes phase-elimination algorithm [Lattimore et al., 2020] to the model selected by the ﬁrst-step regularized estimator. Under a sufﬁcient condition on the minimum signal of the feature covariates, we prove that a dimension-free O( n) regret is achievable, even if the data is scarce.

The analysis reveals a rich structure that has much in common with partial monitoring, where Θ(n2/3) regret occurs naturally in settings for which some actions are costly but highly informative [Bart ok et al., 2014]. A similar phenomenon appears here when the dimension is large relative to the horizon. There is an interesting transition as the horizon grows, since O(

dn) regret is optimal in the data rich regime.

Related work Most previous work is focused on the data-rich regime. For an arbitrary action set, Abbasi-Yadkori et al. [2012] proposed an online-to-conﬁdence-set conversion approach that achieves a O(

sdn) regret upper bound, where s is a known upper bound on the sparsity. The algorithm is generally not computationally efﬁcient, which is believed to be unavoidable. Additionally, a Ω(

sdn) regret lower bound for data-rich regime was established in Section 24.3 of Lattimore and Szepesv ari [2020], which means polynomial dependence on d is generally not avoidable without additional assumptions.

For this reason, it recently became popular to study the contextual setting, where the action set changes from round to round and to careful assumptions are made on the context distribution. The

1Section 24.3 of Lattimore and Szepesv ari [2020]

assumptions are chosen so that techniques from high-dimensional statistics can be borrowed. Suppose τ is a problem-dependent parameter that may have a complicated form and varies across different literature. Kim and Paik [2019] developed a doubly-robust Lasso bandit approach with an O(τs n) upper bound but required the average of the feature vectors for each arm satisﬁes the compatibility condition [B uhlmann and Van De Geer, 2011]. Bastani and Bayati [2020] and Wang et al. [2018] considered a multi-parameter setting (each arm has its own underlying parameter) and assumed the distribution of contexts satisﬁes a variant of the compatibility condition as well as other separation conditions. Bastani and Bayati [2020] derived a O(τKs2(log(n))2) upper bound and was sharpen to O(τKs2 log(n)) by Wang et al. [2018], where K is the number of arms. Although those results are dimension-free, they require strong assumptions on the context distribution that are hard to verify in practice. As a result, the aforementioned regret bounds involved complicated problem-dependent parameters that may be very large when the assumptions fail to hold.

Another thread of the literature is to consider speciﬁc action sets. Lattimore et al. [2015] proposed a selective explore-then-commit algorithm that only works when the action set is exactly the binary hypercube. They derived an optimal O(s n) upper bound as well as an optimal gap-dependent bound. Sivakumar et al. [2020] assumed the action set is generated adversarially but perturbed artiﬁcially by some standard Gaussian noise. They proposed a structured greedy algorithm to achieve an O(s n) upper bound. Deshpande and Montanari [2012] study the data-poor regime in a Bayesian setting but did not consider sparsity. Carpentier and Munos [2012] considered a special case where the action set is the unit sphere and the noise is vector-valued so that the noise becomes smaller as the dimension grows. We summarize the comparisons in Table 1.

2 Problem setting

In the beginning, the agent receives a compact action set A Rd, where d may be larger than the number of rounds n. At each round t, the agent chooses an action At A and receives a reward

Yt = At, θ + ηt , (2.1)

where (ηt)n t=1 is a sequence of independent standard Gaussian random variables and θ Rd is an unknown parameter vector. We make the mild boundedness assumption that for all x A, x 1. The parameter vector θ is assumed to be s-sparse:

j=1 1{θj = 0} s.

The Gaussian assumption can be relaxed to conditional sub-Gaussian assumption for the regret upper bound, but is necessary for the regret lower bound. The performance metric is the cumulative expected regret, which measures the difference between the expected cumulative reward collected by the omniscient policy that knows θ and that of the learner. The optimal action is x = argmaxx A x, θ and the regret of the agent when facing the bandit determined by θ is

where the expectation is over the interaction sequence induced by the agent and environment. Our primary focus is on ﬁnite-time bounds in the data-poor regime where d n.

Notation Let [n] = {1, 2, . . . , n}. For a vector x and positive semideﬁnite matrix A, we let x A =

x Ax be the weighted ℓ2-norm and σmin(A), σmax(A) be the minimum eigenvalue and maximum eigenvalue of A, respectively. The cardinality of a set A is denoted by |A|. The support of a vector x, supp(x), is the set of indices i such that xi = 0. And 1{ } is an indicator function. The suboptimality gap of action x A is x = x , θ x, θ and the minimum gap is min = min{ x : x A, x > 0}.

3 Minimax lower bound

As promised, we start by proving a kind of minimax regret lower. We ﬁrst deﬁne a quantity that measures the degree to which there exist good exploration distributions over the actions.

Deﬁnition 3.1. Let P(A) be the space of probability measures over A with the Borel σ-algebra and deﬁne Cmin(A) = max µ P(A) σmin EA µ AA .

Remark 3.2. Cmin(A) > 0 if and only if A spans Rd. Two illustrative examples are the hypercube and probability simplex. Sampling uniformly from the corners of each set shows that Cmin(A) 1 for the former and Cmin(A) 1/d for the latter.

The next theorem is a kind of minimax lower bound for sparse linear bandits. The key steps of the proof follow, with details and technical lemmas deferred to Appendix B.

Theorem 3.3. Consider the sparse linear bandits described in Eq. (2.1). Then for any policy π there exists an action set A with Cmin(A) > 0 and s-sparse parameter θ Rd such that

Rθ(n) exp( 4)

3 min(A)s 1 3 n 2 3 ,

Theorem 3.3 holds for any data regime and suggests an intriguing transition between n2/3 and n1/2

regret, depending on the relation between the horizon and the dimension. When d > n1/3s2/3 the bound is Ω(n2/3), which is independent of the dimension. On the other hand, when d n1/3s2/3, we recover the standard Ω(

sdn) dimension-dependent lower bound up to a s-factor. In Section 4, we prove that the Ω(n2/3) minimax lower bound is tight by presenting a nearly matching upper bound in the data-poor regime.

Remark 3.4. Theorem 3.3 has a worst-case ﬂavor. For each algorithm we construct a problem instance with the given dimension, sparsity and value of Cmin for which the stated regret bound holds. The main property of this type of hard instance is that it should include a informative but high-regret action set such that the learning algorithm should balance the trade-off between information and regret. This leaves the possibility for others to create minimax lower bound for their own problem.

Proof of Theorem 3.3. The proof uses the standard information-theoretic machinery, but with a novel construction and KL divergence calculation.

Step 1: construct a hard instance. We ﬁrst construct a low regret action set S and an informative action set H as follows:

S = n x Rd xj { 1, 0, 1} for j [d 1], x 1 = s 1, xd = 0 o ,

H = n x Rd xj { κ, κ} for j [d 1], xd = 1 o , (3.2)

where 0 < κ 1 is a constant. The action set is the union A = S H and let

θ = ε, . . . , ε | {z } s 1

, 0, . . . , 0, 1 ,

where ε > 0 is a small constant to be tuned later. Because θd = 1, actions in H are associated with a large regret. On the other hand, actions in H are also highly informative, which hints towards an interesting tradeoff between regret and information. Note that H is nearly the whole binary hypercube, while actions in S are (s 1)-sparse. The optimal action is in the action set A:

x = argmax x A x, θ = 1, , 1 | {z } s 1

, 0, . . . , 0 A . (3.3)

Step 2: construct an alternative bandit. The second step is to construct an alternative bandit eθ that is hard to distinguish from θ and for which the optimal action for θ is suboptimal for eθ and vice versa.

Denote Pθ and Peθ as the measures on the sequence of outcomes (A1, Y1, . . . , An, Yn) induced by the interaction between a ﬁxed bandit algorithm and the bandits determined by θ and eθ respectively. Let Eθ, Eeθ be the corresponding expectation operators. We denote a set S as

S = n x Rd xj { 1, 0, 1} for j {s, s + 1, . . . , d 1} ,

xj = 0 for j = {1, . . . , s 1, d}, x 1 = s 1 o . (3.4)

Clearly, S is a subset of S and for any x S , its support has no overlap with {1, . . . , s 1}. Then we denote

ex = argmin x S Eθ

t=1 At, x 2 #

and construct the alternative bandit eθ as

eθ = θ + 2εex . (3.6)

Note that eθ is (2s 1)-sparse since ex belongs to S that is a (s 1)-sparse set. This design guarantees the optimal arm x in bandit θ is suboptimal in alternative bandit eθ and the suboptimality gap for x

in bandit eθ is maxx A x x , eθ = (s 1)ε. Deﬁne an event

t=1 1(At S)

j=1 Atj n(s 1)

The next claim shows that when D occurs, the regret is large in bandit θ, while if it does not occur, then the regret is large in bandit eθ. The detailed proof is deferred to Appendix B.1.

Claim 3.5. Regret lower bounds with respect to event D:

Rθ(n) n(s 1)ε

2 Pθ(D) and Reθ(n) n(s 1)ε

2 Peθ(Dc) .

By the Bretagnolle Huber inequality (Lemma C.1 in the appendix),

Rθ(n) + Reθ(n) n(s 1)ε

Pθ(D) + Peθ(Dc) n(s 1)ε

4 exp KL Pθ, Peθ ,

where KL(Pθ, Peθ) is the KL divergence between probability measures Pθ and Peθ.

Step 3: calculating the KL divergence. We make use of the following bound on the KL divergence between Pθ and Peθ, which formalises the intuitive notion of information. When the KL divergence is small, the algorithm is unable to distinguish the two environments. The detailed proof is deferred to Appendix B.2.

Claim 3.6. Deﬁne Tn(H) = Pn t=1 1(At H). The KL divergence between Pθ and Peθ is upper bounded by

KL (Pθ, Peθ) 2ε2 n(s 1)2

d + κ2(s 1)Eθ[Tn(H)] . (3.7)

The ﬁrst term in the right-hand side of the bound is the contribution from actions in the low-regret action set S, while the second term is due to actions in H. The fact that actions in S are not very informative is captured by the presence of the dimension in the denominator of the ﬁrst term. When d is very large, the algorithm simply does not gain much information by playing actions in S. When Tn(H) < 1/(κ2(s 1)ε2), it is easy to see

Rθ(n) + Reθ(n) n(s 1)ε

4 exp 2nε2(s 1)2

exp( 2) . (3.8)

On the other hand, when Tn(H) > 1/(κ2ε2(s 1)), we have

Rθ(n) Eθ[Tn(H)] min x H x 1 κ2ε2(s 1) + 1 κ

κ2ε , (3.9)

since minx H x = 1 + (s 1)ε(1 κ) from the deﬁnition of H and θ.

Step 4: conclusion. Combining the above two cases together, we have

Rθ(n) + Reθ(n) min nsε

exp( 2), 1 κ2ε2s + 1 κ

where we replaced s 1 by s in the ﬁnal result for notational simplicity. Consider a sampling distribution µ that uniformly samples actions from H. A simple calculation shows that Cmin(A) Cmin(H) κ2 > 0. This is due to

x H µ(x)xx !

= σmin EX µ[XX ] = κ2 ,

where each coordinate of the random vector X Rd is sampled independently uniformly from { 1, 1}. In the data poor regime when d n1/3s2/3, we choose ε = κ 2/3s 2/3n 1/3 such that

max(Rθ(n), Reθ(n)) Rθ(n) + Reθ(n)

3 s 1 3 n 2 3 exp( 4)

3 min(A)s 1 3 n 2 3 .

Finally, in the data rich regime when d < n1/3s2/3 we choose ε = p

d/(ns2) such that the exponential term is a constant, and then

max(Rθ(n), Reθ(n)) Rθ(n) + Reθ(n) exp( 4)

4 Matching upper bound

We now propose a simple algorithm based on the explore-then-commit paradigm2 and show that the minimax lower bound in Eq. (3.1) is more or less achievable. As one might guess, the algorithm has two stages. First it solves the following optimization problem to ﬁnd the most informative design:

max µ P(A) σmin Z

x A xx dµ(x) . (4.1)

In the exploration stage, the agent samples its actions from bµ for n1 rounds, collecting a data-set {(A1, Y1), . . . , (An1, Yn1)}. The agent uses the data collecting in the exploration stage to compute the Lasso estimator bθn1. In the commit stage, the agent executes the greedy action for the rest n n1 rounds. The detailed algorithm is summarized in Algorithm 1.

Remark 4.1. The minimum eigenvalue is concave [Boyd et al., 2004], which means that the solution to (4.1) can be approximated efﬁciently using standard tools such as CVXPY [Diamond and Boyd, 2016].

The following theorem states a regret upper bound for Algorithm 1. The proof is deferred to Appendix B.3.

Theorem 4.2. Consider the sparse linear bandits described in Eq. (2.1) and assume the action set A spans Rd. Suppose Rmax is an upper bound of maximum expected reward such that maxx A x, θ Rmax. In Algorithm 1, we choose

n1 = n2/3(s2 log(2d))1/3R 2/3 max (2/C2 min(A))1/3, (4.3)

and λ1 = 4 p

log(d)/n1. Then the following regret upper bound holds,

Rθ(n) (2 log(2d)Rmax) 1 3 C 2

3 min(A)s 2 3 n 2 3 + 3n Rmax exp( c1n1). (4.4)

2Explore-then-commit template is also considered in other works [Deshmukh et al., 2018] but both the exploration and exploitation stages are very different. Deshmukh et al. [2018] considers simple regret minimization while we focus on cumulative regret minimization.

Algorithm 1 Explore the sparsity then commit (ESTC)

1: Input: time horizon n, action set A, exploration length n1, regularization parameter λ1; 2: Solve the optimization problem in Eq. (4.1) and denote the solution as bµ. 3: for t = 1, , n1 do 4: Independently pull arm At according to bµ and receive a reward: Yt = At, θ + ηt. 5: end for 6: Calculate the Lasso estimator [Tibshirani, 1996]:

bθn1 = argmin θ Rd

Yt At, θ 2 + λ1 θ 1 . (4.2)

7: for t = n1 + 1 to n do 8: Take greedy actions At = argminx A bθn1, x . 9: end for

Together with the minimax lower bound in Theorem 3.3, we can argue that ESTC algorithm is minimax optimal in time horizon n in the data-poor regime.

Remark 4.3. The regret upper bound Eq. (4.4) may still depend on d because 1/Cmin(A) could be as large as d. Indeed, if the action set is the standard basis vectors, then the problem reduces to the standard multi-armed bandit for which the minimax regret is Θ(

dn), even with sparsity. If we restrict our attention to the class of action set such that 1/Cmin(A) is dimension-free, then we have a dimension-free upper bound.

Remark 4.4. Another notion frequently appearing in high-dimensional statistics is the restricted eigenvalue condition. Demanding a lower bound on the restricted eigenvalue is weaker than the minimum eigenvalue, which can lead to stronger results. As it happens, however, the two coincide in the lower bound construction. The upper bound may also be sharpened, but the resulting optimization problem would (a) depend on the sparsity s and (b) the objective would have a complicated structure for which an efﬁcient algorithm is not yet apparent.

Remark 4.5. There is still a (s/Cmin(A))1/3 gap between the lower bound (Eq. (3.1)) and upper bound (Eq. (4.4)) ignoring logarithmic factor. We conjecture that the use of ℓ1/ℓ inequality when proving Theorem 4.2 is quite conservative. Speciﬁcally, we bound the following using the ℓ1-norm bound of Lasso (see Eq. (B.15) in the Appendix B.3 for details),

θ bθn1, x At θ bθn1 1 x At

The ﬁrst inequality ignores the sign information of bθn1 and the correlation between x At and bθn1. A similar phenomenon has been observed by Javanmard et al. [2018] and resolved by means of a delicate leave-one-out analysis to decouple the correlation. An interesting question is whether or not a similar technique could be used in our case to improve the above bound to p

s log(d)/(n1), closing the gap between regret upper bound and lower bound. On the other hand, surprisingly, even in the classical statistical settings there are still gaps between upper and lower bounds in terms of Cmin(A) [Raskutti et al., 2011]. We speculate that the upper bound may be improvable, though at present we do not know how to do it.

Remark 4.6. The algorithm uses knowledge of the sparsity to tune the length of exploration in Eq. (4.3). When the sparsity is not known, the length of exploration can be set to n1 = n2/3. The price is an additional factor of O(s1/3) to regret. This is an advantage relative to the algorithm by Abbasi-Yadkori et al. [2012], for which knowledge of the sparsity is apparently essential for constructing the conﬁdence set.

Remark 4.7. We do not expect explicit optimism-based algorithms [Dani et al., 2008, Rusmevichientong and Tsitsiklis, 2010, Chu et al., 2011, Abbasi-Yadkori et al., 2011] or implicit ones, such as Thompson sampling [Agrawal and Goyal, 2013], to achieve the minimax lower bound in the data-poor

regime. The reason is that the optimism principle does not balance the trade-off between information and regret, a phenomenon that has been observed before in linear and structured bandits [Lattimore and Szepesvari, 2017, Combes et al., 2017, Hao et al., 2020].

5 Improved upper bound

In this section, we show that under additional minimum signal condition, the restricted phase elimination algorithm can achieve a sharper O( sn) regret upper bound.

The algorithm shares similar idea with Carpentier and Munos [2012] that includes feature selection step and restricted linear bandits step. In the feature selection step, the agent pulls a certain number of rounds n2 following bµ as in (4.1). Then Lasso is used to conduct the feature selection. Based on the support Lasso selects, the algorithm invokes phased elimination algorithm for linear bandits [Lattimore et al., 2020] on the selected support.

Algorithm 2 Restricted phase elimination

1: Input: time horizon n, action set A, exploration length n2, regularization parameter λ2; 2: Solve the optimization problem Eq. (4.1) and denote the solution as bµ. 3: for t = 1, , n2 do 4: Independently pull arm At according to bµ and receive a reward: Yt = At, θ + ηt. 5: end for 6: Calculate the Lasso estimator bθn2 as in Eq. (4.2) with λ2. 7: Identify the support: b S = supp(bθn2). 8: for t = n2 + 1 to n do 9: Invoke phased elimination algorithm for linear bandits on b S. 10: end for

Condition 5.1 (Minimum signal). We assume the minimum non-zero component of θ satisﬁes:

min j supp(θ) |θj| > 1 Cmin(A)

Theorem 5.2. Consider the sparse linear bandits described in Eq. (2.1). We assume the action set A spans Rd as well as |A| = K < and suppose Condition 5.1 holds. Let n2 = C1s log(d) for a suitable large constant C1 and choose λ2 = 4 p

log(d)/n2. Denote φmax = σmax(Pn2 t=1 At A t /n2). Then the following regret upper bound of Algorithm 2 holds,

Rθ(n) C s log(d) +

9φmax log(Kn)

Cmin(A) sn , (5.1)

for universal constant C > 0.

The proof is deferred to Appendix B.4. It utilizes the sparsity and variable screening property of Lasso. More precisely, under minimum signal condition, the Lasso estimator can identify all the important covariates, i.e., supp(bθn1) supp(θ). And the model Lasso selected is sufﬁciently sparse, i.e. |supp(bθn1)| s. Therefore, it is enough to query linear bandits algorithm on supp(bθn1). Remark 5.3. It is possible to remove the dependency of φmax in the Eq. (5.1) using more dedicated analysis, using theorem 3 in Belloni et al. [2013]. The reason we choose a phase elimination type algorithm is that it has the optimal regret guarantee when the size of action set is moderately large. When the action set has an inﬁnite number of actions, we could switch to the linear UCB algorithm [Abbasi-Yadkori et al., 2011] or appeal to a discretisation argument.

6 Experiment

We compare ESTC (our algorithm) with Lin UCB [Abbasi-Yadkori et al., 2011] and doubly-robust (DR) lasso bandits [Kim and Paik, 2019]. For ESTC, we use the theoretically suggested length of

exploration stage. For Lin UCB, we use the theoretically suggested conﬁdence interval. For DR-lasso, we use the code made available by the authors on-line.

Case 1: linear contextual bandits. We use the setting in Section 5 of Kim and Paik [2019] with N = 20 arms, dimension d = 100, sparsity s = 5. At round t, we generate the action set from N(0N, V ), where Vii = 1 and Vik = ρ2 for every i = k. Larger ρ corresponds to high correlation setting that is more favorable to DR-lasso. The noise is from N(0, 1) and θ 0 = s.

Case 2: hard problem instance. Consider the hard problem instance in the proof of minimax lower bound (Theorem 3.3), including an informative action set and an uninformative action set. Since the size of action set constructed in the hard problem instance grows exponentially with d, we uniformly randomly sample 500 actions from the full informative action set and 200 from uninformative action set.

Conclusion: The experiments conﬁrm our theoretical ﬁndings. Although our theory focuses on the ﬁxed action set setting, ESTC works well in the contextual setting. DR-lasso bandits heavily rely on context distribution assumption and almost fail for the hard instance. Lin UCB suffers in the data-poor regime since it ignores the sparsity information.

Figure 1: The top two ﬁgures are for Case 1 and the bottom two ﬁgures are for Case 2.

7 Discussion

In this paper, we provide a thorough investigation of high-dimensional sparse linear bandits, and show that Θ(n2/3) is the optimal rate in the data-poor regime. Our work leaves many open problems on how the shape of action set affects the regret that reveals the subtle trade-off between information and regret. For instance, it is unclear how the regret lower bound depends on Cmin(A) in the data-rich regime and if Cmin(A) is the best quantity to describe the shape of action set A.

In another hand, the ESTC algorithm can only achieve optimal regret bound in data poor regime and becomes suboptimal in the data rich regime. It is interesting to have an algorithm to achieve optimal regrets in best of two worlds . Information-direct sampling [Russo and Van Roy, 2014] might be a good candidate since it delicately balances the trade-off between information and regret which is necessary in the sparse linear bandits.

Broader Impact We believe that presented research should be categorized as basic research and we are not targeting any speciﬁc application area. Theorems may inspire new algorithms and theoretical investigation. The algorithms presented here can be used for many different applications and a particular use may have both positive or negative impacts. We are not aware of any immediate short term negative implications of this research and we believe that a broader impact statement is not required for this paper.

Acknowledgments and Disclosure of Funding

Mengdi Wang gratefully acknowledges funding from the U.S. National Science Foundation (NSF) grant CMMI1653435, Air Force Ofﬁce of Scientiﬁc Research (AFOSR) grant FA9550-19-1-020, and C3.ai DTI.

Yasin Abbasi-Yadkori, D avid P al, and Csaba Szepesv ari. Improved algorithms for linear stochastic bandits. In Advances in Neural Information Processing Systems, pages 2312 2320, 2011.

Yasin Abbasi-Yadkori, David Pal, and Csaba Szepesvari. Online-to-conﬁdence-set conversions and application to sparse stochastic bandits. In Artiﬁcial Intelligence and Statistics, pages 1 9, 2012.

Shipra Agrawal and Navin Goyal. Thompson sampling for contextual bandits with linear payoffs. In International Conference on Machine Learning, pages 127 135, 2013.

Peter Auer. Using conﬁdence bounds for exploitation-exploration trade-offs. Journal of Machine Learning Research, 3(Nov):397 422, 2002.

G. Bart ok, D. P. Foster, D. P al, A. Rakhlin, and Cs. Szepesv ari. Partial monitoring classiﬁcation, regret bounds, and algorithms. Mathematics of Operations Research, 39(4):967 997, 2014.

Hamsa Bastani and Mohsen Bayati. Online decision making with high-dimensional covariates. Operations Research, 68(1):276 294, 2020.

Alexandre Belloni, Victor Chernozhukov, et al. Least squares after model selection in high-dimensional sparse models. Bernoulli, 19(2):521 547, 2013.

Peter J Bickel, Ya acov Ritov, Alexandre B Tsybakov, et al. Simultaneous analysis of lasso and dantzig selector. The Annals of Statistics, 37(4):1705 1732, 2009.

Stephen Boyd, Stephen P Boyd, and Lieven Vandenberghe. Convex optimization. Cambridge university press, 2004.

Peter B uhlmann and Sara Van De Geer. Statistics for high-dimensional data: methods, theory and applications. Springer Science & Business Media, 2011.

Alexandra Carpentier and R emi Munos. Bandit theory meets compressed sensing for high dimensional stochastic linear bandit. In Artiﬁcial Intelligence and Statistics, pages 190 198, 2012.

Wei Chu, Lihong Li, Lev Reyzin, and Robert Schapire. Contextual bandits with linear payoff functions. In Proceedings of the Fourteenth International Conference on Artiﬁcial Intelligence and Statistics, pages 208 214, 2011.

Richard Combes, Stefan Magureanu, and Alexandre Proutiere. Minimal exploration in structured stochastic bandits. In Advances in Neural Information Processing Systems, pages 1763 1771, 2017.

Varsha Dani, Thomas P Hayes, and Sham M Kakade. Stochastic linear optimization under bandit feedback. 2008.

Aniket Anand Deshmukh, Srinagesh Sharma, James W Cutler, Mark Moldwin, and Clayton Scott. Simple regret minimization for contextual bandits. ar Xiv preprint ar Xiv:1810.07371, 2018.

Yash Deshpande and Andrea Montanari. Linear bandits in high dimension and recommendation systems. In 2012 50th Annual Allerton Conference on Communication, Control, and Computing (Allerton), pages 1750 1754. IEEE, 2012.

Steven Diamond and Stephen Boyd. Cvxpy: A python-embedded modeling language for convex optimization. The Journal of Machine Learning Research, 17(1):2909 2913, 2016.

Botao Hao, Tor Lattimore, and Csaba Szepesvari. Adaptive exploration in linear contextual bandit. AISTATS, 2020.

Adel Javanmard and Andrea Montanari. Conﬁdence intervals and hypothesis testing for high-dimensional regression. The Journal of Machine Learning Research, 15(1):2869 2909, 2014.

Adel Javanmard, Andrea Montanari, et al. Debiasing the lasso: Optimal sample size for gaussian designs. The Annals of Statistics, 46(6A):2593 2622, 2018.

Gi-Soo Kim and Myunghee Cho Paik. Doubly-robust lasso bandit. In Advances in Neural Information Processing Systems, pages 5869 5879, 2019.

Tor Lattimore and Csaba Szepesvari. The end of optimism? an asymptotic analysis of ﬁnite-armed linear bandits. In Artiﬁcial Intelligence and Statistics, pages 728 737, 2017.

Tor Lattimore and Csaba Szepesv ari. Bandit algorithms. Cambridge University Press, 2020.

Tor Lattimore, Koby Crammer, and Csaba Szepesv ari. Linear multi-resource allocation with semi-bandit feedback. In Advances in Neural Information Processing Systems, pages 964 972, 2015.

Tor Lattimore, Csaba Szepesvari, and Gellert Weisz. Learning with good feature representations in bandits and in rl with a generative model. International Conference on Machine Learning, 2020.

G. Raskutti, M. J. Wainwright, and B. Yu. Minimax rates of estimation for high-dimensional linear regression over ℓq -balls. IEEE Transactions on Information Theory, 57(10):6976 6994, 2011.

Mark Rudelson and Shuheng Zhou. Reconstruction from anisotropic random measurements. IEEE Transactions on Information Theory, 59(6):3434 3447, 2013.

Paat Rusmevichientong and John N Tsitsiklis. Linearly parameterized bandits. Mathematics of Operations Research, 35(2):395 411, 2010.

Daniel Russo and Benjamin Van Roy. Learning to optimize via posterior sampling. Mathematics of Operations Research, 39(4):1221 1243, 2014.

Vidyashankar Sivakumar, Zhiwei Steven Wu, and Arindam Banerjee. Structured linear contextual bandits: A sharp and geometric smoothed analysis. International Conference on Machine Learning, 2020.

R. Tibshirani. Regression shrinkage and selection via the lasso. Journal of the Royal Statistical Society, Series B, 58:267 288, 1996.

Alexandre B. Tsybakov. Introduction to Nonparametric Estimation. Springer Publishing Company, Incorporated, 1st edition, 2008. ISBN 0387790519, 9780387790510.

Roman Vershynin. Introduction to the non-asymptotic analysis of random matrices. ar Xiv preprint ar Xiv:1011.3027, 2010.

Martin J Wainwright. High-dimensional statistics: A non-asymptotic viewpoint, volume 48. Cambridge University Press, 2019.

Xue Wang, Mingcheng Wei, and Tao Yao. Minimax concave penalized multi-armed bandit model with highdimensional covariates. In International Conference on Machine Learning, pages 5200 5208, 2018.