# thresholded_lasso_bandit__0381c3e4.pdf

Thresholded Lasso Bandit

Kaito Ariu 1 2 Kenshi Abe 2 Alexandre Proutière 1

In this paper, we revisit the regret minimization problem in sparse stochastic contextual linear bandits, where feature vectors may be of large dimension d, but where the reward function depends on a few, say s0 d, of these features only. We present Thresholded Lasso bandit, an algorithm that (i) estimates the vector deﬁning the reward function as well as its sparse support, i.e., signiﬁcant feature elements, using the Lasso framework with thresholding, and (ii) selects an arm greedily according to this estimate projected on its support. The algorithm does not require prior knowledge of the sparsity index s0 and can be parameter-free under some symmetric assumptions. For this simple algorithm, we establish non-asymptotic regret upper bounds scaling as O(log d+

T) in general, and as O(log d + log T) under the so-called margin condition (a probabilistic condition on the separation of the arm rewards). The regret of previous algorithms scales as O(log d+ p

T log(d T)) and O(log T log d) in the two settings, respectively. Through numerical experiments, we conﬁrm that our algorithm outperforms existing methods.

1. Introduction

The linear contextual bandit (Abe & Long, 1999; Li et al., 2010) is a sequential decision-making problem that generalizes the classical stochastic Multi-Armed Bandit (MAB) problem (Lai & Robbins, 1985; Robbins, 1952), where (i) in each round, the decision-maker is provided with a context in the form of a feature vector for each arm and where (ii) the expected reward is a linear function of these vectors. More precisely, at the beginning of round t 1, the decision-maker receives for each arm k, a feature vector At,k Rd. She then selects an arm, say k, and observes a sample of a random reward with mean At,k, θ . The pa-

1EECS and Digital Futures, KTH Royal Institute of Technology, Stockholm, Sweden 2Cyberagent, Inc., Tokyo, Japan. Correspondence to: Kaito Ariu <ariu@kth.se>.

Proceedings of the 39 th International Conference on Machine Learning, Baltimore, Maryland, USA, PMLR 162, 2022. Copyright 2022 by the author(s).

rameter vector θ Rd is ﬁxed but initially unknown. Linear contextual bandits have been extensively applied in online services such as online advertisement and recommendation systems (Li et al., 2010; 2016; Zeng et al., 2016), and constitute arguably the most relevant structured bandit model in practice.

The major challenge in the design of efﬁcient algorithms for contextual linear bandits stems from the high dimensionality of the feature space. For example, for display ad systems as studied in Chapelle & Li (2011); Weinberger et al. (2009), the joint information about a user, an ad and its publisher is encoded in a feature vector of dimension d = 224. Fortunately, typically only a few features signiﬁcantly impact the expected reward. This observation has motivated the analysis of problems where the unknown parameter vector θ is sparse (Bastani & Bayati, 2015; Kim & Paik, 2019; Oh et al., 2021; Wang et al., 2018). In this paper, we also investigate sparse contextual linear bandits, and assume that θ only has at most s0 d non-zero components. The set of these components and its cardinality s0 are unknown to the decision-maker. Sparse contextual linear bandits have attracted a lot of attention recently. State-of-the-art algorithms developed to exploit the sparse structure achieve regrets scaling as O(log d + p

T log(d T)) in general, and O(log d log T) under the co-called margin condition (a setting where arms are well separated); refer to Section 2 for details.

We develop a novel algorithm, referred to as Thresholded Lasso bandit1, with improved regret guarantees. Our algorithm ﬁrst uses the Lasso framework with thresholding to maintain and update in each round estimates of the parameter vector θ and of its support. It then greedily picks an arm based on these estimates (the thresholded estimates of θ). The regret of the algorithm strongly depends on the accuracy of these estimates. We derive strong guarantees on this accuracy, which in turn leads to regret guarantees. Our contributions are as follows.

(i) Thresholded Lasso Estimation Performance. The performance of the Lasso-based estimation procedure is now fairly well understood, see e.g., Bühlmann & Van De Geer

1An implementation of our method is available at https://github.com/Cyber Agent AILab/ thresholded-lasso-bandit.

Thresholded Lasso Bandit

(2011); Tibshirani (1996); Zhou (2010). For example, Zhou (2010) provides an analysis of the estimation of the support of θ, and speciﬁcally, gives upper bounds on the number of false positives (components that are not in the support, but estimated as part of it) and false negatives (components that are in the support, but not estimated as part of it). These analyses, however, critically rely on the assumption that the observed data is i.i.d.. This assumption does not hold for the bandit problem, as the algorithm adapts its arm selection strategy depending on the past observations. Despite the non i.i.d. nature of the data, we manage to derive performance guarantees of the estimate of θ. In particular, we establish high probability guarantees that are independent of the dimension d (see Lemma 5.7).

(ii) Regret Guarantees. Based on the analysis of the Thresholded Lasso estimation procedure, we provide a ﬁnite-time analysis of the regret of our algorithm under certain symmetry assumptions made in Oh et al. (2021). The regret scales at most as O(log d +

T) in general and O(log d + log T) under the margin condition. More precisely, the estimation error of θ induces a regret scaling as O(

T) (or O(log T) under the margin condition). The additional term O(log d) in our regret upper bounds comes from the errors made when estimating the support of θ. It is worth noting that when using the plain Lasso estimator (without thresholding), one would obtain weaker performance guarantees for the estimation of θ, typically depending on d, see e.g., Bastani & Bayati (2015). This dependence causes an additional multiplicative term log d in the regret.

(iii) Numerical Experiments. We present extensive numerical experiments to illustrate the performance of the Thresholded Lasso bandit algorithm. We compare the estimation accuracy for θ and the regret of our algorithm to those of the Lasso bandit, Doubly-Robust Lasso bandit, and Sparsity-Agnostic Lasso bandit algorithms (Bastani & Bayati, 2020; Kim & Paik, 2019; Oh et al., 2021). These experiments conﬁrm the beneﬁt of the use of the Lasso procedure with thresholding.

2. Related Work

Stochastic linear bandit problems have attracted a lot of attention over the last decade. Carpentier & Munos (2012) addresses sparse linear bandits where θ 0 s0 and where the set of arms is restricted to the ℓ2 unit ball. For regimes where the time horizon is much smaller than the dimension d, i.e., T d, the authors propose an algorithm whose regret scales at most as O(s0

T log(d T)). Hao et al. (2020b) studied high-dimensional linear bandit problems where the number of actions is larger than or equal to d. Under some signal strength conditions, they propose an algorithm that

achieves a regret of O(s0 log d + p

s0T log(d T)). These studies, however, do not consider problems with contextual information.

Recently, high-dimensional contextual linear bandits have been investigated under the sparsity assumption θ 0 s0. In this line of research, the decision-maker is provided in each round with a set of arms deﬁned by a ﬁnite set of feature vectors. This set is uniformly bounded across rounds. In this setting, the authors of Abbasi-Yadkori et al. (2012) devise an algorithm with both a minimax (problem independent) regret upper bound e O( s0d T) and problem dependent upper bound O(ds0(log T)2) (the notation e O hides the polylogarithmic terms) without any assumption on the distribution (other than the assumptions similar to our Assumption 3.1).

In Bastani & Bayati (2020) (initially published in 2015 as Bastani & Bayati (2015)), the authors address a highdimensional contextual linear bandit problem where the unknown parameter deﬁning the reward function is armspeciﬁc (θ is different for the various arms). In the proposed algorithm, arms are explored uniformly at random for O(s2 0 log d log T) prespeciﬁed rounds. Under the margin condition, similar to our Assumption 5.1, the algorithm achieves a regret of O(s2 0(log d + log T)2). For the same problem, Wang et al. (2018) develops the socalled MCP-Bandit algorithm. The latter also uses the uniform exploration for O(s2 0 log d log T) prespeciﬁed rounds, and has improved regret guarantees: the regret scales as O(s2 0(log d + s0) log T).

High-dimensional contextual linear bandits have been also studied without the margin condition, but with a unique parameter θ deﬁning the reward function. Kim & Paik (2019) designs an algorithm with uniform exploration phases of O( p

T log(d T) log T) rounds, and with regret O(s0

T log(d T)). All the aforementioned algorithms require the knowledge of the sparsity index s0. In Oh et al. (2021), the authors propose an algorithm, referred to as SA Lasso bandit, that does not require this knowledge, and with regret O(s2 0 log d + s0 p

T log(d T)). In addition, SA Lasso bandit does not include any uniform exploration phase. Its regret guarantees are derived under speciﬁc assumptions on the context distribution, the so-called relaxed symmetry assumption and the balanced covariance assumption. The authors also establish a O(s2 0 log d + p

s0T log(d T)) regret upper bound under the so-called restricted eigenvalue condition. In Ren & Zhou (2020), under the restricted eigenvalue condition induced by the restricted bounded density assumption, the authors established a regret bound of O(s0polylog(d) + p

s0T log(d)polylog(T)).

One notable recent development in contextual bandit is the regret analyses of the exploration-free or greedy algorithms. Under some symmetry assumptions on the context distri-

Thresholded Lasso Bandit

Table 1. Algorithms and their regret guarantees for scaling with respect to d and T. O notations are hiding sub-logarithmic factors in d and logarithmic factors in s0. The Compatibility and Margin conditions refer to Assumptions 3.2 and 5.1.

Paper Regret Compatibility Margin Other

Abbasi-Yadkori et al. (2012) O(ds0(log T)2) Bastani & Bayati (2020) O(s2 0(log d + log T)2) Wang et al. (2018) O(s2 0(log d + s0) log T) Kim & Paik (2019) O(s0

T log(d T)) Ren & Zhou (2020) O(s0polylog(d) + log(T) s0T log d) Restricted Bounded Density Oh et al. (2021) O(s2 0 log d + s0 p

T log(d T)) Asm 3.3, 3.4 This work: Theorem 5.2 O(s2 0 log d + s0 log T) Asm 3.3, 3.4, 3.5 This work: Theorem 5.3 O(s2 0 log d + s0T) Asm 3.3, 3.4, 3.5 Bastani et al. (2021) (non-sparse setting) O(d log d log T) Covariate diversity condition (Stronger than Asm 3.5)

bution, these algorithms exhibit sub-linear regret (Bastani et al., 2021; Kannan et al., 2018; Ren & Zhou, 2020; Oh et al., 2021). Our analysis is also inspired by the results on exploration-free algorithms.

In this paper, we develop an algorithm with improved regret guarantees with and without the margin condition. The algorithm does not rely on the knowledge of s0 and can be made parameter-free. Such a parameter-free algorithm is not proposed in other recent papers. We have summarized the relevant studies and our work in Table 1. We will come back to this table when we discuss the assumptions.

3. Model and Assumptions

3.1. Model and Notation

We consider a contextual linear stochastic bandit problem in a high-dimensional space. In each round t [T] := {1, . . . , T}, the algorithm is given a set of context vectors At = {At,k Rd : k [K]}. The successive sets (At)t 1 form an i.i.d. sequence with distribution p A. In round t, the algorithm selects an arm At At based on past observations, and collects a random reward rt. Formally, if Ft is the σ-algebra generated by random variables (A1, A1, r1, . . . , At 1, At 1, rt 1, At), At is Ftmeasurable. We assume that rt = At, θ + εt, where εt is a zero mean sub-Gaussian random variable with variance proxy σ2 given Ft and At. Our objective is to devise an algorithm with minimal regret, where regret is deﬁned as:

t=1 max A At A, θ rt

t=1 max A At A At, θ

Notation. The ℓ0-norm of a vector θ Rd is θ 0 = Pd i=1 1 {θi = 0}. We denote ˆΣt = 1

t Pt s=1 As A s as the empirical Gram matrix generated by the arms selected under a speciﬁc algorithm. For any B [d], we deﬁne θB := (θ1,B, . . . , θd,B) where for all i [d], θi,B := θi1{i B}. For each B [d], we deﬁne the submatrix A(B) Rn |B| of A Rn d where for A(B), we extract the rows that are in B. We denote supp(x) as the set of the non-zero element indices of x Rd. We also deﬁne θmin as the minimal value of |θi| on the support: θmin := mini supp(θ) |θi|. We denote by S(θ) = supp(θ) = {i [d] : θi = 0} the support of θ. Deﬁnitions and notations are also summarized in Appendix A, Table 2.

3.2. Assumptions

We present a set of assumptions used throughout the paper. Many assumptions are essentially from Oh et al. (2021). However, there are some differences, and these will be discussed. We also discuss and compare these assumptions to those made in the related literature.

Assumption 3.1 (Sparsity and parameter constraints). The parameter θ deﬁning the reward function is sparse, i.e., θ 0 s0 for some ﬁxed but unknown integer s0 (s0 does not depend on d). We further assume that θ 1 s1 for some unknown constant2 s1 and θmin s2/s0 with some unknown constant s2 (< s1). Finally, we assume that the ℓ -norm of the context vector is bounded: for all t and for all A At, A s A, where s A > 0 is a constant.

Assumption 3.2 (Compatibility condition). For a matrix M Rd d and a set S0 [d], we deﬁne the compatibility

2Given this assumption of s1 to be a constant, note that the value of θmin would scale as Θ (1/s0), as s0θmin θ 1 s1 may hold. Note that in Oh et al. (2021), θ 2 Θ(1) and A 2 Θ(1) (for all A At) are assumed.

Thresholded Lasso Bandit

constant φ(M, S0) as:

φ2(M, S0):= min x: x S0 1 =0

x S0 2 1 : x Sc 0 1 3 x S0 1

We assume that for the Gram matrix of the action set Σ := 1 K PK k=1 EA p A Ak A k satisﬁes φ2(Σ, S(θ)) φ2 0, where φ0 > 0 is some positive constant.

The compatibility condition was introduced in the highdimensional statistics literature (Bühlmann & Van De Geer, 2011). It ensures that the Lasso estimate (Tibshirani, 1996) of the parameter θ approaches to its true value as the number of samples grows large. Note that it is easy to check that the compatibility condition is strictly weaker than assuming the positive deﬁniteness of Σ. It allows us to consider feature vectors with strongly correlated components. Assumption 3.2 is considered to be essential for the Lasso estimate to be consistent and assumed in many of the relevant studies. See Table 1 for studies using the compatibility condition. φ0 can be a constant that does not depend on d. This is the case, for example, when the context distribution is multivariate Gaussian, uniform distribution. In these examples, the minimum eigenvalue of the Gram matrix is lower bounded by some constant. When the minimum eigenvalue is lower bounded by some constant, the compatibility constant is also lower bounded by some constant (Bühlmann & Van De Geer, 2011).

Assumption 3.3 (Relaxed symmetry (Oh et al., 2021)). For the distribution p A of A, there exists a constant ν 1 such that for all A RK d such that p A(A) > 0, p A(A) p A( A) ν.

Assumption 3.4 (Balanced covariance (Oh et al., 2021)). For any permutation γ of [K], for any integer k {2, ..., K 1} and a ﬁxed θ, there exists a constant Cb > 1 such that

Cb EA p A h (Aγ(1)A γ(1) + Aγ(K)A γ(K))

1{ Aγ(1), θ < . . . < Aγ(K), θ }

EA p A h Aγ(k)A γ(k)1{ Aγ(1), θ < . . . < Aγ(K), θ } i .

Assumption 3.5 (Sparse positive deﬁniteness). Deﬁne for each B [d], ΣB := 1 K PK k=1 EA p A Ak(B)Ak(B) , where Ak(B) is a |B|-dimensional vector, which is extracted from the elements of Ak with indices in B. There exists a positive constant α > 0 such that B [d],

|B| s0 + (4νCb s0)/φ2 0 and S(θ) B

= min v R|B|: v 2=1 v ΣBv α.

The parameters φ0, ν, Cb are those of Assumptions 3.2, 3.3, and 3.4.

Assumption 3.3 comes from Oh et al. (2021). This assumption is satisﬁed for the wide range of distributions including

multivariate Gaussian, uniform, and Bernoulli distributions. Assumption 3.4 is also adopted from Oh et al. (2021). This assumption holds for a wide range of distributions including multivariate Gaussian distribution, uniform distribution on sphere. It also holds when contexts are independent across arms with any arbitrary distributions (Oh et al., 2021). Assumption 3.5 implies that the context distribution is diverse enough in the neighborhood of the support of θ. Note that Assumption 3.5 is standard in low dimensional linear bandit literature (e.g., Lattimore & Szepesvári (2020); Degenne et al. (2020); Jedra & Proutiere (2020); Hao et al. (2020a)). There, if S(θ) = [d], the only choice for B is [d], and the set of action has to span Rd (hence Assumption 3.5 is satisﬁed). We will see that after an accurate estimate of the support S(θ) (Lemma 5.4), Assumption 3.5 is used only to analyze the performance of the least square estimator of low-dimensional (order of O(s0)) vector. Assumption 3.5 is strictly weaker than the covariate diversity condition of Bastani et al. (2021), where the positive deﬁniteness must be guaranteed for the Gram matrix generated by the greedy algorithm. We also discuss the details of assumptions in Appendix B.

4. Algorithm

In this section, we present the Thresholded (TH) Lasso bandit algorithm. The algorithm adapts the idea of Lasso with thresholding proposed in Zhou (2010) to estimate θ and its support. The main challenge in the analysis of the Lasso with thresholding stems from the fact that here, the data is non i.i.d. (the arm selection is adaptive).

Algorithm 1 TH Lasso Bandit

1: Input: λ0 2: for t = 1, , T do 3: Receive a context set At := {At,k : k [K]} 4: Pull arm At = argmax A At A, ˆθt (ties are broken uni-

formly at random) and observe rt

2 log t log d

t 6: A (A1, A2, . . . , At) , R (r1, r2, . . . , rt)

7: ˆθ(t) 0 argmin θ

1 t R Aθ 2 2 + λt θ 1

8: ˆS(t) 0 {j [d] : |(ˆθ(t) 0 )j| > 4λt}

9: ˆS(t) 1 {j ˆS(t) 0 : |(ˆθ(t) 0 )j| 4λt q

| ˆS(t) 0 |}

10: AS (A1, ˆS(t) 1 , A2, ˆS(t) 1 , . . . , At, ˆS(t) 1 )

11: ˆθt+1 argmin θ R ASθ 2 2

12: end for

The pseudo-code of our algorithm is presented in Algorithm 1. In round t, the algorithm pulls the arm in a greedy way using the estimated value ˆθt of θ. From the past se-

Thresholded Lasso Bandit

lected arms and rewards, we get via the Lasso a ﬁrst estimate ˆθ(t) 0 of θ. This estimate is then used to estimate the support of θ using appropriate thresholding. The regularizer λt := λ0 p

(2 log t log d)/t is set at a much larger value than that in the previous work (they typically have the order of p

(log d + log t)/t), as we are only focusing on the support recovery here. Note that we apply a thresholding procedure twice to ˆθ(t) 0 to provide the support estimate ˆS(t) 1 . The ﬁnal estimate ˆθt+1 is obtained as the least squares estimator of θ, when restricted to ˆS(t) 1 . The initial support estimate done by Lasso contains too many false positives. By including thresholding steps in the algorithm, we remove the unnecessary false positives and improve the support estimate. We quantify this improvement in the next section.

5. Performance Guarantees

We analyze the regret of the Thresholded Lasso bandit algorithm both when the margin condition holds and when it does not. We show that better guarantees can be obtained with a single-parameter or parameter-free algorithm.

5.1. With the Margin Condition

Assumption 5.1 (Margin condition). There exists a constant Cm > 0 such that for all κ > 0,

k = k , PA p A(0 < | Ak Ak , θ | κ) Cmκ.

The margin condition controls the probability that under p A two arms yield very similar rewards (and hence are hard to separate) and is widely used in the classiﬁcation literature (see e.g., Tsybakov (2004); Audibert & Tsybakov (2007)). For the (low-dimensional) linear bandit literature, it was ﬁrst introduced in Goldenshluger & Zeevi (2013). The margin condition holds for the most usual context distributions (including the uniform distribution and Multivariate Gaussian distributions) and a much weaker assumption than requiring the strict separation between the arms.

The following theorem provides a non-asymptotic regret upper bound of TH Lasso bandit under the margin condition. To simplify the presentation of our regret upper bound, deﬁne τ = j 2 log(2d2)

C2 0 (log s0)(log log d) k ,

where C0 = min n 1 2, φ2 0 512s0s2 AνCb

o . Note that τ =

O s2 0(log s0)(log d)(log log d) .

Theorem 5.2. Assume that Assumptions 3.1 3.5, 5.1 hold. (i) (TH Lasso Bandit with parameter-dependent input) There are universal positive constants c1, c2, c3 depending on σ, s A, s1, s2, φ0, ν, Cb, K, α, Cm, such that if we set λ0 = c1, then for all d c2, for all T 2:

R(T) c3 τ + s0(log s0) 3 2 log T + s2 0 .

(ii) (TH Lasso Bandit with parameter-free input) There are universal positive constants c4, c5 depending on σ, s A, s1, s2, φ0, ν, Cb, K, α, Cm, such that if we set λ0 = 1/(log log d) 1 4 in TH Lasso Bandit, then for all d c4, for all T 2,

R(T) c5 τ + s0(log s0) 3 2 log T + s2 0 .

The precise deﬁnitions of c1-c5 are given in Appendix E.1.

We provide the proof of Theorem 5.2 in Appendix E.1.

5.2. Without the Margin Condition

Theorem 5.3. Assume that Assumptions 3.1 3.5 hold. (i) (TH Lasso Bandit with parameter-dependent input) There are universal positive constants c1, c2, c3 depending on σ, s A, s1, s2, φ0, ν, Cb, K, α such that if we set λ0 = c1, then for all d c2. for all T 2:

R(T) c3 τ + (log s0) p

s0T + s2 0 .

(ii) (TH Lasso Bandit with parameter-free input) There are universal positive constants c4, c5 depending on σ, s A, s1, s2, φ0, ν, Cb, K, α such that if we set λ0 = 1/(log log d) 1 4 in TH Lasso Bandit, then for all d c4, for all T 2,

R(T) c5 τ + (log s0) p

s0T + s2 0 .

The precise deﬁnitions of c1-c5 are given in Appendix E.2

The proof of Theorem 5.3 is presented in Appendix E.2.

Theorems 5.2 and 5.3 state that TH Lasso bandit achieves much lower regret than the existing algorithms. Indeed, upper regret bounds for the latter had a term scaling as log d log T (resp. log d+ p

T log(d T)) with (resp. without) the margin condition. TH Lasso bandit removes the log d and log T multiplicative factors. In most applications of the sparse linear contextual bandit, both T and d are typically very large, and the regret improvement obtained by TH Lasso bandit is signiﬁcant. Also note that our regret upper bounds match the minimax lower bound Ω( s0T) proved in Ren & Zhou (2020).

5.3. Sketch of the Proof of Theorems

We sketch below the proof of Theorem 5.2 and 5.3. Complete proofs of Theorems and associated Lemmas are presented in Appendix E and Appendix F, respectively.

(1) Estimation of the Support of θ. First, we prove that the estimated support contains the true support S(θ) with high probability.

Thresholded Lasso Bandit

Lemma 5.4. Let t 2 log(2d2)

C2 0 such that

φ2 0 + r 1 + 4νCb

Under Assumptions 3.1, 3.2, 3.3, and 3.4, P S(θ) ˆS(t) 1 and | ˆS(t) 1 \ S(θ)| 4νCb s0

1 2 exp tλ2 t 32σ2s2 A + log d exp t C2 0 2 .

Lemma 5.4 extends the support recovery result of the Thresholded Lasso (Zhou, 2010) to the case of non-i.i.d data (generated by the bandit algorithm). The dependence on s0 is analogous to the ofﬂine result (Theorem 3.1 of Zhou (2010)). As it can be seen from the proof, even after the single-step thresholding, for all sufﬁciently large t, we have the guarantee of S(θ) ˆS(t) 0 and | ˆS(t) 0 \ S(θ)| = O(s0) with high probability. However, with the two-step thresholding, we have | ˆS(t) 1 \ S(θ)| = O( s0) (See Appendix C for the beneﬁt of the two-step thresholding in detail). Remark 5.5. In the proof of Lemma 5.4, we also obtain an bound on the estimation error of ˆθ(t) 0 (by the Lasso). One may directly use ˆθ(t) 0 for the arm selection as is in Oh et al. (2021), however, this results in a weaker performance guarantee of the order O(log d+ p

T log(d T)) (without the margin condition). This is due to the fact that the estimation error of ˆθ(t) 0 has a dependence on d, which impacts the order of the regret. This motivates the use of the thresholding procedure. With this procedure (i.e., using ˆθt), we remove the dependence in d of the estimation error when t is larger than τ, which, in turn, leads to an instantaneous regret bound independent of d. In summary, the thresholding procedure allows us to derive better regret bounds than those in existing work (e.g., Oh et al. (2021)).

Deﬁne Et := n S(θ) ˆS(t) 1 and | ˆS(t) 1 \ S(θ)| 4νCb s0

o . In the remaining of the proof, in view of Lemma 5.4, we can assume that the event Et holds.

(2) Minimal Eigenvalue of the Empirical Gram Matrix. We write ˆS(t) 1 = ˆS for the simplicity. Let ˆΣ ˆS := 1 t Pt s=1 As( ˆS)As( ˆS) be the empirical Gram matrix on the estimated support. We prove that the positive deﬁniteness of the empirical Gram matrix on the estimated support is guaranteed.

Lemma 5.6. Let t [T]. Under Assumptions 3.1 and 3.5, we have: P λmin(ˆΣ ˆS) α 4νCb Et 1

exp log s0 + 4νCb s0

tα 20s2 AνCb(s0+(4νCb s0)/φ2 0)

(3) Estimation of θ after Thresholding. Next, we study the accuracy of ˆθt.

Lemma 5.7. Let t [T] and s = s0 + 4νCb s0/φ2 0. Under Assumption 3.1, we have, for all

x, λ > 0: P ˆθt+1 θ 2 x and λmin(ˆΣ ˆS) λ Et

2s exp λ2tx2

From the above lemma, we conclude that θ is well estimated with high probability. Note that in the above estimation error, the dependency in s0 can be also improved from linear to square root compared with the analysis of Lemma 1 (Oracle inequality) in Oh et al. (2021) (SA Lasso bandit). This stems from the fact that using the compatibility condition, one can only control the ℓ1 norm of the estimation error of θ, while using the OLS leads to an ℓ2 guarantee.

(4) Instantaneous Regret Upper Bound with the Margin Condition. For the previous lemmas, we can derive an upper bound on the instantaneous regret with the margin

condition. Deﬁne h0 = q

log(4(s0 + 4νCb s0

φ2 0 )) + 1 .

Lemma 5.8 (With the margin condition). Deﬁne G

α 4νCb t := n λmin(ˆΣ ˆS) α 4νCb

o . Let t 2. Under Assumptions 3.1, 3.2, 3.3, 3.4, 3.5, and 5.1, the expected instantaneous regret E[max A At A At, θ ] is upper bounded by:

1408σ2s4 ACm(K 1)h3 0ν2C2 b s0 + 4νCb s0

+ 2(K 1)s As1

P(Ec t ) + P G

α 4νCb t c Et

Notice that the ﬁrst term of this instantaneous regret bound does not depend on d. This leads to the better regret order.

On the other hand, without the margin condition, we present the key lemma, which is proven by a novel application of the discretization technique:

Lemma 5.9 (Without the margin condition). Under Assumptions 3.1, 3.2, 3.3, 3.4, and 3.5, for any t [T], E[max A At A At, θ ] is upper bounded by

36σs A(K 1)h2 0νCb α

v u u t2 s0 + 4νCb s0

+ 2(K 1)s As1

P(Ec t ) + P G

α 4νCb t c Et

Again, the ﬁrst term in the above bound is independent of d. Note also that compared to existing work, we improve the dependence in t: we get 1/

t instantaneous upper bound while in the other recent work (e.g., Kim & Paik (2019); Oh et al. (2021)) they get p

log t/t. As a consequence, we obtain better regret guarantees without the margin condition. More precisely, we manage to remove the unnecessary log T factors in the regret that was present in all previous studies.

Thresholded Lasso Bandit

By summing up these instantaneous regret bounds, we get Theorem 5.2 and 5.3. In Appendix D, we also provide regret guarantees without Assumption 3.4 but when K = 2 (Theorem D.2) and without the margin condition (Theorem D.3). These theorems are established using the relaxed symmetry assumption only; see also the proof of Lemma 2 in Oh et al. (2021).

6. Experiments

In this section, we empirically evaluate the TH Lasso bandit algorithm. We compare its performance to those of the Lasso Bandit (Bastani & Bayati, 2020), Doubly-Robust (DR) Lasso bandit (Kim & Paik, 2019), and SA Lasso bandit (Oh et al., 2021) algorithms. Note that the Lasso bandit algorithm (Bastani & Bayati, 2020) deals with a slightly different problem setting (θ varies across arms in their setting). We follow the comparison ideas in Kim & Paik (2019) (considered the Kd-dimensional context vectors and the Kd-dimensional regression parameters for each arm. For details, see Kim & Paik (2019)).

Reward Parameter and Contexts. We consider problems where θ Rd is sparse, i.e., θ 0 = s0. We generate each non-zero components of θ in an i.i.d. manner using the uniform distribution on [1, 2]. In each round t, for each component i [d], we sample ((At,1)i, , (At,K)i) RK

from a Gaussian distribution N(0K, V ) where Vi,i = 1 for all i [K] and Vi,k = ρ2 = 0.7 for all i = k [K]. We then normalize each At,k = ((At,k)1, . . . , (At,k)d) Rd

so that its ℓ2-norm is at most Amax for all k [K]. Note that the components of the feature vectors are correlated over [d] and over [K]. The noise process is Gaussian, i.i.d. over rounds: εt N(0, 1). We test the algorithms for different values of K, d, s0, and Amax. For each experimental setting, we averaged the results for 20 instances. We also provide additional experimental results with non-Gaussian distributions in Appendix H.4. Remark 6.1. In most of our experiments, the context is drawn from a multivariate Gaussian, or uniform distribution on [ 1, 1]d. In this case, the minimum eigenvalue of the gram matrix Σ is lower bounded by some constant. Hence, Assumptions 3.2 and 3.5 are satisﬁed. Clearly, Assumption 3.3 is satisﬁed by the symmetry of the distribution. When the distribution is independent over the arms, from Proposition 1 in Oh et al. (2021), Assumption 3.4 is satisﬁed. Since each element of the context distribution has a bounded density everywhere, Assumption 5.1 is also satisﬁed. Furthermore, in Appendix H.5, we empirically tested our algorithm for some hard problems where the covariate diversity condition (Bastani et al., 2021) does not hold.

Algorithms. For DR Lasso bandit and Lasso bandit, we use the tuned hyperparameter at

https://github.com/gisoo1989/ Doubly-Robust-Lasso-Bandit. For the SA Lasso bandit and TH Lasso bandit algorithms, we tune the hyperparameter λ0 in [0.01, 0.5] to roughly optimize the algorithm performance when K = 2, d = 1000, Amax = 10, and s0 = 5. As a result, we set λ0 = 0.16 for SA Lasso bandit, and set λ0 = 0.02 for TH Lasso bandit.

Results. We ﬁrst compare the regret of each algorithm with Amax = 10, K {2, 50}, d = {1000, 2000, 10000}, and s0 {5, 20}. We experimented with larger values of d, in addition to the one in existing studies. Figure 1 shows the average cumulative regret for each algorithm. We ﬁnd that TH Lasso bandit outperforms the other algorithms in all scenarios. We provide additional experimental results, including experiments with different correlation levels ρ2( {0, 0.3}) and dimension d, in Appendix H.4.

Next, we compare the estimation accuracy for θ under three algorithms (DR, SA, and TH Lasso bandit) in the scenario: K = 2, d = 1000, Amax = 10, s0 = 5. Figure 2 shows the number of false positives | ˆS(t) 1 \ S(θ)|, the number of false negatives |S(θ) \ ˆS(t) 1 |, and ℓ2-norm error ˆθt θ 2. Note that, for DR Lasso bandit and SA Lasso bandit, we deﬁne the estimated support as ˆS(t) 1 = {i [d] : ˆθt,i = 0}. We can observe that the number of false positives of our algorithm converge to zero faster than those of DR Lasso bandit and SA Lasso bandit. Furthermore, our algorithm yields a smaller estimation error ˆθt θ 2 than the two other algorithms, as is shown in right column of Figure 2.

We also conduct experiments varying Amax {2.5, 5, 10, 20, 40, }. As in the previous experiments, for each Amax, we normalize each feature vector At,k so that its ℓ2-norm is at most Amax for all k [K]. We set K = 2, d = 1000, and s0 = 5. Figure 3 shows the average cumulative regret at t = 1000 of TH Lasso bandit and SA Lasso bandit for each Amax. This experiment conﬁrms that TH Lasso exhibits lower regret than SA Lasso bandit. Additional results when K = 50 and s0 = 20 are also included in Appendix H.3.

Finally, we examine the robustness of TH Lasso bandit and SA Lasso bandit with respect to the hyperparameter λ0. We vary λ0 [0.1λ , 2.5λ ] where λ = 0.02 for TH Lasso bandit and λ = 0.16 for SA Lasso bandit. We set K = 2, d = 1000, s0 = 5, and Amax = 10. Figure 4 shows the average cumulative regret at t = 1000 for TH Lasso bandit and SA Lasso bandit for different ratios λ0/λ . Observe that the regret of TH Lasso bandit is more stable than that of SA Lasso bandit as the ratio grows. Indeed, the performance of TH Lasso bandit is not very sensitive to the choice of λ0: it is robust. This contrasts with the SA Lasso bandit algorithm, for which a careful tuning of λ0 is needed

Thresholded Lasso Bandit

0 200 400 600 800 1000 Round

Cumulative Regret

K = 2, d = 1000, s0 = 5, 2 = 0.7, Amax = 10

THLasso Bandit SALasso Bandit DRLasso Bandit Lasso Bandit

0 200 400 600 800 1000 Round

Cumulative Regret

K = 2, d = 2000, s0 = 20, 2 = 0.7, Amax = 10

THLasso Bandit SALasso Bandit DRLasso Bandit Lasso Bandit

0 200 400 600 800 1000 Round

Cumulative Regret

K = 2, d = 10000, s0 = 20, 2 = 0.7, Amax = 10

THLasso Bandit SALasso Bandit DRLasso Bandit Lasso Bandit

0 200 400 600 800 1000 Round

Cumulative Regret

K = 50, d = 1000, s0 = 5, 2 = 0.7, Amax = 10

THLasso Bandit SALasso Bandit DRLasso Bandit Lasso Bandit

0 200 400 600 800 1000 Round

Cumulative Regret

K = 50, d = 2000, s0 = 20, 2 = 0.7, Amax = 10

THLasso Bandit SALasso Bandit DRLasso Bandit Lasso Bandit

0 200 400 600 800 1000 Round

Cumulative Regret

K = 50, d = 10000, s0 = 20, 2 = 0.7, Amax = 10

THLasso Bandit SALasso Bandit DRLasso Bandit Lasso Bandit

Figure 1. Cumulative regret of the three algorithms with ρ2 = 0.7, Amax = 10 in six scenarios selected using K {2, 50}, d {1000, 2000, 10000}, and s0 {5, 20}. The shaded area represents the standard errors.

0 200 400 600 800 1000 Round

False Positive

K = 2, d = 1000, s0 = 5, 2 = 0.7, Amax = 10

THLasso Bandit SALasso Bandit DRLasso Bandit

0 200 400 600 800 1000 Round

False Negative

K = 2, d = 1000, s0 = 5, 2 = 0.7, Amax = 10

THLasso Bandit SALasso Bandit DRLasso Bandit

0 200 400 600 800 1000 Round

K = 2, d = 1000, s0 = 5, 2 = 0.7, Amax = 10

THLasso Bandit SALasso Bandit DRLasso Bandit

Figure 2. (Left) Number of false positives | ˆS(t) 1 \ S(θ)|, (center) false negatives |S(θ) \ ˆS(t) 1 |, (right) ℓ2-norm error ˆθt θ 2 of the three algorithms with ρ2 = 0.7, Amax = 10, K = 2, s0 = 5, and d = 1000. The shaded area represents the standard errors.

to get good performance.

7. Conclusion

In this paper, we studied the high-dimensional contextual linear bandit problem with sparsity. We devised TH Lasso bandit, a simple algorithm that applies a Lasso procedure with thresholding to estimate the support of the unknown parameter. We established ﬁnite-time regret upper bounds under various assumptions, and in particular with and without the margin condition. These bounds exhibit a better regret scaling than those derived for previous algorithms. We also numerically compared TH Lasso bandit to previous algorithms in a variety of settings, and showed that it outperformed other algorithms in these settings.

In future work, it would be interesting to consider scenarios

where the assumptions made in this paper may not hold. In particular, it is worth investigating the case where the relaxed symmetry condition (Assumption 3.3) is not satisﬁed. In this case, being greedy in the successive arm selections may not work. It is intriguing to know whether devising an algorithm without forced uniform exploration and with reasonable regret guarantees is possible.

Acknowledgements

We would like to thank Komei Fujita, Yusuke Kaneko, Hiroaki Shiino, and Shota Yasui for the fruitful discussions. We also thank anonymous reviewers for helpful comments on the previous version of our manuscript. K. Ariu was partially supported by the Nakajima Foundation Scholarship. A. Proutiere s research is partially supported by the Wallenberg AI, Autonomous Systems and Software Program (WASP)

Thresholded Lasso Bandit

2.5 5 10 20 40 Amax

Cumulative Regret

K = 2, d = 1000, s0 = 5, 2 = 0.7

THLasso Bandit SALasso Bandit

Figure 3. Cumulative regret at round t = 1000 of TH Lasso bandit and SA Lasso bandit with ρ2 = 0.7, K = 2, d = 1000, s0 = 5, and varying Amax {2.5, 5, 10, 20, 40, }. The error bars represent the standard errors.

0.0 0.5 1.0 1.5 2.0 2.5 Ratio

Cumulative Regret

K = 2, d = 1000, s0 = 5, 2 = 0.7, Amax = 10

THLasso Bandit SALasso Bandit

Figure 4. Cumulative regret at round t = 1000 of TH Lasso bandit and SA Lasso bandit with ρ2 = 0.7, Amax = 10, K = 2, d = 1000, s0 = 5, and varying λ0/λ [0.1, 2.5]. The error bars represent the standard errors.

funded by the Knut and Alice Wallenberg Foundation, and by Digital Futures.

Abbasi-Yadkori, Y., Pal, D., and Szepesvari, C. Onlineto-conﬁdence-set conversions and application to sparse stochastic bandits. In Artiﬁcial Intelligence and Statistics, 2012.

Abe, N. and Long, P. M. Associative reinforcement learning using linear probabilistic concepts. In International Conference on Machine Learning, 1999.

Audibert, J.-Y. and Tsybakov, A. B. Fast learning rates for plug-in classiﬁers. The Annals of statistics, 2007.

Bastani, H. and Bayati, M. Online decision-making with high-dimensional covariates. Available at SSRN 2661896, 2015.

Bastani, H. and Bayati, M. Online decision making with high-dimensional covariates. Operations Research, 2020.

Bastani, H., Bayati, M., and Khosravi, K. Mostly exploration-free algorithms for contextual bandits. Management Science, 2021.

Bühlmann, P. and Van De Geer, S. Statistics for highdimensional data: methods, theory and applications. Springer Science & Business Media, 2011.

Carpentier, A. and Munos, R. Bandit theory meets compressed sensing for high dimensional stochastic linear bandit. In Artiﬁcial Intelligence and Statistics, 2012.

Chapelle, O. and Li, L. An empirical evaluation of thompson sampling. In Advances in Neural Information Processing Systems, 2011.

Degenne, R., Ménard, P., Shang, X., and Valko, M. Gamiﬁcation of pure exploration for linear bandits. In International Conference on Machine Learning, 2020.

Goldenshluger, A. and Zeevi, A. A linear response bandit problem. Stochastic Systems, 2013.

Hao, B., Lattimore, T., and Szepesvari, C. Adaptive exploration in linear contextual bandit. In International Conference on Artiﬁcial Intelligence and Statistics, 2020a.

Hao, B., Lattimore, T., and Wang, M. High-dimensional sparse linear bandits. In Advances in Neural Information Processing Systems, 2020b.

Jedra, Y. and Proutiere, A. Optimal best-arm identiﬁcation in linear bandits. Advances in Neural Information Processing Systems, 2020.

Kannan, S., Morgenstern, J. H., Roth, A., Waggoner, B., and Wu, Z. S. A smoothed analysis of the greedy algorithm for the linear contextual bandit problem. In Advances in Neural Information Processing Systems, 2018.

Kim, G.-S. and Paik, M. C. Doubly-robust lasso bandit. In Advances in Neural Information Processing Systems, 2019.

Lai, T. L. and Robbins, H. Asymptotically efﬁcient adaptive allocation rules. Advances in applied mathematics, 1985.

Lattimore, T. and Szepesvári, C. Bandit algorithms. Cambridge University Press, 2020.

Li, L., Chu, W., Langford, J., and Schapire, R. E. A contextual-bandit approach to personalized news article recommendation. In International conference on World wide web, 2010.

Thresholded Lasso Bandit

Li, L., Chu, W., Langford, J., and Wang, X. Unbiased ofﬂine evaluation of contextual-bandit-based news article recommendation algorithms. In International conference on Web search and data mining, pp. 297 306, 2011.

Li, S., Karatzoglou, A., and Gentile, C. Collaborative ﬁltering bandits. In International ACM SIGIR conference on Research and Development in Information Retrieval, 2016.

Oh, M.-h., Iyengar, G., and Zeevi, A. Sparsity-agnostic lasso bandit. In International Conference on Machine Learning, 2021.

Ren, Z. and Zhou, Z. Dynamic batch learning in highdimensional sparse linear contextual bandits, 2020. URL https://arxiv.org/abs/2008.11918.

Robbins, H. Some aspects of the sequential design of experiments. Bulletin of the American Mathematical Society, 1952.

Tibshirani, R. Regression shrinkage and selection via the lasso. Journal of the Royal Statistical Society: Series B (Methodological), 1996.

Tropp, J. A. User-friendly tail bounds for matrix martingales. Technical report, 2011.

Tsybakov, A. B. Optimal aggregation of classiﬁers in statistical learning. The Annals of Statistics, 2004.

Wainwright, M. J. High-dimensional statistics: A nonasymptotic viewpoint, volume 48. Cambridge University Press, 2019.

Wang, X., Wei, M., and Yao, T. Minimax concave penalized multi-armed bandit model with high-dimensional covariates. In International Conference on Machine Learning, 2018.

Weinberger, K., Dasgupta, A., Langford, J., Smola, A., and Attenberg, J. Feature hashing for large scale multitask learning. In International Conference on Machine Learning, 2009.

Zeng, C., Wang, Q., Mokhtari, S., and Li, T. Online contextaware recommendation with time varying multi-armed bandit. In International conference on Knowledge discovery and data mining, 2016.

Zhou, S. Thresholded lasso for high dimensional variable selection and statistical estimation, 2010. URL https: //arxiv.org/abs/1002.1583.

Thresholded Lasso Bandit

A. Table of Notations

Table 2 summarizes the notations used in the paper.

B. Discussion on the Assumptions and Regret Dependence on K

Our assumptions are in principle following the literature Oh et al. (2021). In the contextual linear bandit setting, Assumptions 3.3 and 3.4 or the covariate diversity condition are standard (at least in the experimental settings). They hold for many context distributions including multivariate Gaussian distribution, uniform distribution on the sphere, and arbitrary independent distribution for each arm (Oh et al., 2021). For example, the covariate diversity condition holds in the experimental settings of Bastani & Bayati (2020) and Wang et al. (2018).

Regarding the regret dependence of K, we have at least linear scaling with K. The constant Cb does not scale with K when the context distribution is a multivariate Gaussian distribution or a uniform distribution on a unit sphere (see Proposition 1 of Oh et al. (2021)). However, for general distribution, Cb can scale exponentially with K. We conjecture that we can improve this dependency: numerical results show that the dependence on K is mild (See Appendix H).

C. On the Beneﬁt of the Two Step Thresholding Procedure

In our choice of the thresholding parameter (4λt in the ﬁrst step and 4λt q

| ˆS(t) 0 | in the second step), we aim at a partial recovery of the support so that the trade-off between the duration of the phase with linearly growing regret and the support recovery accuracy is optimized in the design. Using two-step thresholding, we achieve better regret guarantees than single-step thresholding. This improvement is due to the fact that with two-step thresholding, the estimated support of θ is improved (with two-step thresholding, we have O( s0) false positives on the estimated support, whereas with single thresholding, there are O(s0)). While this difference in results does not contribute to changing the order of the regret, it does contribute to improving the coefﬁcients on log T and

T terms in regret.

D. Additional Theorems

Before presenting the additional theorems, we introduce the following assumption (which is a slightly modiﬁed version of Assumption 3.5).

Assumption D.1 (Sparse positive deﬁniteness, K = 2). Let K = 2. Deﬁne ΣB := 1

2 P2 k=1 EA p A Ak(B)Ak(B) , for any B [d], where Ak(B) is a |B|-dimensional vector, which is extracted from the elements of Ak with indices in B. There exists a positive constant α > 0 such that B [d], |B| s0 + (2ν s0)/φ2 0 and S(θ) B = min v R|B|: v 2=1 v ΣBv α .

The parameters φ0, ν are those of Assumptions 3.2, 3.3.

We redeﬁne the parameters τ = j 2 log(2d2)

C2 0 (log s0)(log log d) k , where C0 = min n 1 2, φ2 0 256s0s2 Aν o .

The following theorem provides the regret guarantees when K = 2, without Balanced covariance (Assumption 3.4), and with the margin condition.

Theorem D.2 (with margin, without balanced covariance). Assume that Assumptions 3.1 3.3, D.1, and 5.1 hold and K = 2. (i) (TH Lasso Bandit with parameter-dependent input) There are universal positive constants c1, c2, c3 depending on σ, s A, s1, s2, φ0, ν, α, Cm, such that if we set λ0 = c1, then for all d c2, for all T 2:

R(T) c3 τ + s0(log s0) 3 2 log T + s2 0 .

(ii) (TH Lasso Bandit with parameter-free input) There are universal positive constants c4, c5 depending on σ, s A, s1, s2, φ0, ν, α, Cm, such that if we set λ0 = 1/(log log d) 1 4 in TH Lasso Bandit, then for all d c4, for all

Thresholded Lasso Bandit

Table 2. Table of notations Problem-speciﬁc notations

At,k Feature vector associated with the arm k θ Parameter vector d Dimension of feature vectors s0 Sparsity index T Total number of rounds At Set of context vectors at round t p A Distribution for At rt Reward at round t Ft σ-algebra generated by random variables (A1, A1, r1, . . . , At 1, At 1, rt 1, At) εt Zero mean sub-Gaussian noise σ2 Variance proxy of εt R(T) Regret ˆΣt Empirical Gram matrix generated by the arms selected under a speciﬁc algorithm, i.e., 1

t Pt s=1 As A s S(θ) Support of θ: {i [d] : θi = 0} θmin mini S |θi| s A ℓ norm upper bound on A At (see Assumption 3.1) s1 ℓ1 norm upper bound on θ (see Assumption 3.1) s2 Used for the lower bound on θmin (see Assumption 3.1) φ2(M, S0) Compatibility constant (see Assumption 3.2) Σ Expected Gram matrix 1 K PK k=1 EA p A Ak A k

φ2 0 Lower bound on φ2(Σ, S(θ)) ν Constant for Relaxed symmetry (see Assumption 3.3) Cb Constant for Balanced covariance (see Assumption 3.4) α Constant for Sparse positive deﬁniteness (see Assumption 3.5 and D.1) λt Regularizer at round t λ0 Coefﬁcient of the regularizer ˆS(t) 0 , ˆS(t) 1 Estimate of the support after the ﬁrst and the second thresholding, respectively. ˆθt Estimated vector of θ Cm Constant for the margin condition (see Assumption 5.1) h0 Term whose order is O((log s0) 1 2 ) (see deﬁnitions before the Lemmas) C0 Term whose order is O(1/s0) (see deﬁnitions before the Theorems) τ j 2 log(2d2)

C2 0 (log s0)(log log d) k

Et Event n S ˆS(t) 1 and | ˆS(t) 1 \ S| 4νCb s0 φ2 0

o or n S ˆS(t) 1 and | ˆS(t) 1 \ S| 2ν s0

ˆS Estimate of the support after the second thresholding (Equivalent to ˆS(t) 1 ) ˆΣ ˆ S 1 t Pt s=1 As( ˆS)As( ˆS)

Gλ t Event n λmin(ˆΣ ˆ S) λ o

Amax ℓ2 norm bound on At,k (used in the experiments)

Generic notations

x 0 ℓ0 norm of x, i.e., x 0 = Pd i=1 1{θi = 0} [x] Set of positive integers upto x, i.e., [x] = {1, . . . , x} x, y Inner product of x and y P(A) Probability that event A occurs E[a] Expected value of a θi,B θi1{i B} θB (θ1,B, . . . , θd,B)

A(B) n |B| submatrix of A Rn d where B [d] supp(x) Set of the non-zero element indices of x

Thresholded Lasso Bandit

R(T) c5 τ + s0(log s0) 3 2 log T + s2 0 .

The precise deﬁnitions of c1-c5 are given in Appendix E.3.

We provide the proof of Theorem D.2 in Appendix E.3.

Next, the following theorem provides the regret guarantees when K = 2, without Balanced covariance (Assumption 3.4), and without the margin condition. Theorem D.3 (without margin, without balanced covariance). Assume that Assumptions 3.1 3.3, and D.1 hold and K = 2. (i) (TH Lasso Bandit with parameter-dependent input) There are universal positive constants c1, c2, c3 depending on σ, s A, s1, s2, φ0, ν, α such that if we set λ0 = c1, then for all d c2. for all T 2:

R(T) c3 τ + (log s0) p

s0T + s2 0 .

(ii) (TH Lasso Bandit with parameter-free input) There are universal positive constants c4, c5 depending on σ, s A, s1, s2, φ0, ν, α such that if we set λ0 = 1/(log log d) 1 4 in TH Lasso Bandit, then for all d c4, for all T 2,

R(T) c5 τ + (log s0) p

s0T + s2 0 .

The precise deﬁnitions of c1-c5 are given in Appendix E.4.

We present the proof of Theorem D.3 in Appendix E.4. Furthermore, Lemmas associated with Theorems D.2 and D.3 and their proofs are presented in Appendix G.

E. Proof of Theorems

E.1. Proof of Theorem 5.2 (with margin)

First, we determine the constants c1, c2 as follows. Set λ0 = 4σs A c with constant c > 0 (independent of d, T, and s0)

such that 4 4νCbs0

φ2 0 + r 1 + 4νCb

2 log τ log d

θmin. Note that such a constant c exists as

λτ = 4σs A c

2 log τ log d

2 log(Θ(s2 0(log s0)(log log d)(log d))) log d Θ(s2 0(log s0)(log log d)(log d))

2 log(Θ(s2 0(log s0)(log log d)(log d))) Θ(s2 0(log s0)(log log d))

r 1 log log d + 1 log s0

θmin s2/s0 (from Assumption 3.1).

We can take c1 = 4σs A c. Assume that τ (increasing function of d) satisﬁes τ exp(4/c). This facilitates a constant lower bound on d, hence c2 is determined.

We upper bound the instantaneous regret in round t 1. We have:

E[max A At A At, θ ] = E[max A At A At, θ ]

E[| max A At A, θ |] + E[| At, θ |]

s As1 + s As1 = 2s As1, (1)

Thresholded Lasso Bandit

where the second inequality stems from Hölder s inequality. We deduce the following upper bound on the expected regret up to round T:

t=1 max A At A At, θ

(a) 2s As1τ +

t=τ+1 E max A At A At, θ

(b) 2s As1τ +

1408σ2s4 ACm(K 1)h3 0ν2C2 b s0 + 4νCb s0

+ 2(K 1)s As1

P(Ec t ) + P G

α 4νCb t c Et

(c) 2s As1τ +

1408σ2s4 ACm(K 1)h3 0ν2C2 b s0 + 4νCb s0

+ 2(K 1)s As1

2 exp tλ2 t 32σ2s2 A + log d + exp t C2 0 2

log s0 + 4νCb s0

tα 20s2 AνCb s0 + (4νCb s0)/φ2 0

where for (a), we used equation (1) for 1 t τ; for (b), we used Lemma 5.8; for (c), we used Lemma 5.4 (for Et) and

Lemma 5.6 (for G

α 4νCb t ). Now we have:

1408σ2s4 ACm(K 1)h3 0ν2C2 b s0 + 4νCb s0

1408σ2s4 ACm(K 1)h3 0ν2C2 b s0 + 4νCb s0

1408σ2s4 ACm(K 1)h3 0ν2C2 b s0 + 4νCb s0

α2 (1 + Z T

= 1408σ2s4 ACm(K 1)h3 0ν2C2 b s0 + 4νCb s0

α2 (1 + log T),

Thresholded Lasso Bandit

t=τ+1 exp tλ2 t 32σ2s2 A + log d =

t=τ+1 exp ( c log t log d + log d)

t=τ+1 exp c log d log t

t=τ+1 exp ( 2 log t)

where for (a) and (b), we used the assumption τ exp(4/c). In addition,

t=τ+1 exp t C2 0 2

log s0 + 4νCb s0

tα 20s2 AνCb s0 + (4νCb s0)/φ2 0

exp t C2 0 2

+ s0 + 4νCb s0

tα 20s2 AνCb s0 + (4νCb s0)/φ2 0

C2 0 + s0 + 4νCb s0

2 20s2 AνCb

In summary, we obtain:

R(T) 2s As1τ + 1408σ2s4 ACm(K 1)h3 0ν2C2 b s0 + 4νCb s0

α2 (1 + log T)

+ 2(K 1)s As1

C2 0 + s0 + 4νCb s0

2 20s2 AνCb

Looking at the scaling with respect to d, T, and s0, one can determine c3 (note that 1/C2 0 = O(s2 0) and h3 0 = O((log(s0)) 3 2 )). This concludes the proof of the ﬁrst part of the theorem.

Regarding the second part of the theorem, we impose the following condition on d:

(i) log log d 4096σ4s4 A

(ii) 4 4νCbs0

φ2 0 + r 1 + 4νCb

(iii) d 100.

When d c4 with some constant c4, it should be noted that the condition (ii) can hold, as r

2 log τ log d

r 1 log log d + 1 log s0

Thresholded Lasso Bandit

λ0 = 1/(log log d) 1 4 (3)

= o(1) (as d ). (4)

Then, we have the following computations:

t=τ+1 exp tλ2 t 32σ2s2 A + log d =

16σ2s2 A(log log d) 1 2 1

log t log d

32σ2s2 A(log log d) 1 2

log t log τ

32σ2s2 A(log log d) 1 2

t=τ+1 exp ( 2 log t)

where for (a), we used the fact that log τ 64σ2s2 A(log log d) 1 2 from (i); for (b), we used the fact that log τ log d from d 100; for (c), we used again log τ 64σ2s2 A(log log d) 1 2 . Therefore, similarly, we get the regret bound

R(T) 2s As1τ + 1408σ2s4 ACm(K 1)h3 0ν2C2 b s0 + 4νCb s0

α2 (1 + log T)

+ 2(K 1)s As1

C2 0 + s0 + 4νCb s0

2 20s2 AνCb

and we can ﬁnd the constant c5. This concludes the proof.

E.2. Proof of Theorem 5.3 (without margin)

First, we determine the constants c1, c2 similarly to those of Theorem 5.2.

Using Lemma 5.9, we proceed as in the proof of Theorem 5.2, and deduce that:

t=1 max A At A At, θ

t=τ+1 E max A At A At, θ

36σs A(K 1)h2 0νCb α

v u u t2 s0 + 4νCb s0

t 1 + 2(K 1)s As1

P(Ec t ) + P G

α 4νCb t c Et

Thresholded Lasso Bandit

We also show that:

t=τ+1 2(K 1)s As1

P(Ec t ) + P G

α 4νCb t c Et

2(K 1)s As1

C2 0 + s0 + 4νCb s0

2 20s2 AνCb

Now we have:

36σs A(K 1)h2 0νCb α

v u u t2 s0 + 4νCb s0

36σs A(K 1)h2 0νCb α

v u u t2 s0 + 4νCb s0

36σs A(K 1)h2 0νCb

2 s0 + 4νCb s0

36σs A(K 1)h2 0νCb

8 s0 + 4νCb s0

In summary, we get:

R(T) 2s As1τ + 36σs A(K 1)h2 0νCb

8 s0 + 4νCb s0

+ 2(K 1)s As1

C2 0 + s0 + 4νCb s0

2 20s2 AνCb

Looking at the scaling with respect to d, T, and s0, one can determine c3 (note that 1/C2 0 = O(s2 0) and h2 0 = O(log(s0))). This concludes the proof of the ﬁrst part of the theorem.

Regarding the second part of the theorem, we can determine the constant c4 similarly as in Theorem 5.2. Then, we proceed as in the proof of Theorem 5.2 and get the regret bound:

R(T) 2s As1τ + 36σs A(K 1)h2 0νCb

8 s0 + 4νCb s0

+ 2(K 1)s As1

C2 0 + s0 + 4νCb s0

2 20s2 AνCb

We can ﬁnd the constant c5 from this upper bound. This concludes the proof.

E.3. Proof of Theorem D.2 (with margin, without balanced covariance)

This proof follows that of Theorem 5.2 to some extent. First, we determine the constants c1, c2 as follows. Set λ0 = 4σs A c

with constant c > 0 (independent of d, T, and s0) such that 4 2νs0

φ2 0 + r 1 + 2ν

2 log τ log d

Thresholded Lasso Bandit

θmin. Note that such a constant c exists as

λτ = 4σs A c

2 log τ log d

2 log(Θ(s2 0(log s0)(log log d)(log d))) log d Θ(s2 0(log s0)(log log d)(log d))

2 log(Θ(s2 0(log s0)(log log d)(log d))) Θ(s2 0(log s0)(log log d))

r 1 log log d + 1 log s0

θmin s2/s0 (from Assumption 3.1).

We can take c1 = 4σs A c. Assume that τ (increasing function of d) satisﬁes τ exp(4/c). This facilitates a constant lower bound on d, hence c2 is determined.

We deduce the following upper bound on the expected regret up to round T:

t=1 max A At A At, θ

(a) 2s As1τ +

t=τ+1 E max A At A At, θ

(b) 2s As1τ +

352σ2s4 ACmh3 0ν2 s0 + 2ν s0

α2 1 t 1 + 2s As1 P(Ec t ) + P G

α 2ν t c Et

(c) 2s As1τ +

352σ2s4 ACmh3 0ν2 s0 + 2ν s0

2 exp tλ2 t 32σ2s2 A + log d + exp t C2 0 2

log s0 + 2ν s0

tα 10s2 Aν s0 + (2ν s0)/φ2 0

where for (a), we used equation (1) for 1 t τ; for (b), we used Lemma G.4; for (c), we used Lemma G.1 (for Et) and Lemma G.2 (for G

α 2ν t ). Now we have:

352σ2s4 ACmh3 0ν2 s0 + 2ν s0

352σ2s4 ACmh3 0 s0 + 2ν s0

352σ2s4 ACmh3 0ν2 s0 + 2ν s0

α2 (1 + Z T

= 352σ2s4 ACmh3 0ν2 s0 + 2ν s0

α2 (1 + log T),

Thresholded Lasso Bandit

t=τ+1 exp tλ2 t 32σ2s2 A + log d =

t=τ+1 exp ( c log t log d + log d)

t=τ+1 exp c log d log t

t=τ+1 exp ( 2 log t)

where for (a) and (b), we used the assumption τ exp(4/c). In addition,

exp t C2 0 2

log s0 + 2ν s0

tα 10s2 Aν s0 + (2ν s0)/φ2 0

exp t C2 0 2

+ s0 + 2ν s0

tα 10s2 Aν s0 + (2ν s0)/φ2 0

C2 0 + s0 + 2ν s0

2 10s2 Aν α .

In summary, we obtain that:

R(T) 2s As1τ + 352σ2s4 ACmh3 0ν2 s0 + 2ν s0

α2 (log T + 1)

C2 0 + s0 + 2ν s0

2 10s2 Aν α

Looking at the scaling with respect to d, T, and s0, one can determine c3. This concludes the ﬁrst part of the theorem.

Regarding the second part of the theorem, we impose the following condition on d:

(i) log log d 4096σ4s4 A

(ii) 4 2νs0

φ2 0 + r 1 + 2ν

(iii) d 100.

With some constant c4, when d c4, it should be noted that the condition (ii) holds, as r

2 log τ log d

r 1 log log d + 1 log s0

λ0 = 1/(log log d) 1 4

= o(1) (as d ).

Thresholded Lasso Bandit

t=τ+1 exp tλ2 t 32σ2s2 A + log d =

16σ2s2 A(log log d) 1 2 1

log t log d

32σ2s2 A(log log d) 1 2

log t log τ

32σ2s2 A(log log d) 1 2

t=τ+1 exp ( 2 log t)

where for (a), we used the fact that log τ 64σ2s2 A(log log d) 1 2 ; for (b), we used the fact that log τ log d from d 100; for (c), we used again log τ 64σ2s2 A(log log d) 1 2 . Therefore, a similar regret upper bound can be obtained in this case:

R(T) 2s As1τ + 352σ2s4 ACmh3 0ν2 s0 + 2ν s0

α2 (log T + 1)

C2 0 + s0 + 2ν s0

2 10s2 Aν α

and we can ﬁnd the constant c5. This concludes the proof.

E.4. Proof of Theorem D.3 (without margin, without balanced covariance)

This proof follows the proof of Theorem 5.3 mostly. First, we determine the constants c1, c2 similarly to those of Theorem D.2.

Using Lemma G.5, we proceed as in the proof of Theorem 5.3, and deduce that:

t=1 max A At A At, θ ]

t=τ+1 E[max A At A At, θ ]

18σs Ah2 0ν α

v u u t2 s0 + 2ν s0

t 1 + 2s As1 P(Ec t ) + P G

α 2ν t c |Et

Thresholded Lasso Bandit

t=τ+1 2s As1 P(Ec t ) + P G

α 2ν t c Et

t=τ+1 2s As1

2 exp tλ2 t 32σ2s2 A + log d

+ exp t C2 0 2

log s0 + 2ν s0

tα 10s2 Aν s0 + (2ν s0)/φ2 0

Regarding the series involving p

1/t, we have:

The bounds for

t=τ+1 2s As1

2 exp tλ2 t 32σ2s2 A + log d

+ exp t C2 0 2

log s0 + 2ν s0

tα 10s2 Aν s0 + (2ν s0)/φ2 0

hold similarly as is in Theorem D.2. In summary, we get:

R(T) 2s As1τ

+ 36σs Ah2 0ν r

2 s0 + 2ν s0

C2 0 + s0 + 2ν s0

2 10s2 Aν α

Looking at the scaling with respect to d, T, and s0, one can determine c3. This concludes the proof of the ﬁrst part of the theorem.

Regarding the second part of the theorem, we impose the following condition on d:

(i) log log d 4096σ4s4 A

(ii) 4 2νs0

φ2 0 + r 1 + 2ν

(iii) d 100.

With some constant c4, when d c4, it should be noted that the condition (ii) holds, as r

2 log τ log d

r 1 log log d + 1 log s0

λ0 = 1/(log log d) 1 4

= o(1) (as d ).

Thresholded Lasso Bandit

Then, similarly to the proof in Appendix E.3, we obtain

R(T) 2s As1τ

+ 36σs Ah2 0ν r

2 s0 + 2ν s0

C2 0 + s0 + 2ν s0

2 10s2 Aν α

Investigating the scaling with respect to d, T, and s0, one can determine c5.

F. Proof of Lemmas

F.1. Proof of Lemma 5.4

We deﬁne v := ˆθ(t) 0 θ. We ﬁrst analyze the performance of the initial Lasso estimate.

Lemma F.1. Let ˆΣt :=

Pt s=1 As A s t be the empirical covariance matrix of the selected context vectors. Suppose ˆΣt satisﬁes the compatibility condition with the support S(θ) with the compatibility constant φt. Then, under Assumption 3.1, we have:

P v 1 4s0λt

1 2 exp tλ2 t 32σ2s2 A + log d .

The next lemma then states that the compatibility constant of ˆΣt does not deviate much from the compatibility constant of Σ.

Lemma F.2. Let C0 := min n 1 2, φ2 0 512s0s2 AνCb

o . For all t 2 log(2d2)

C2 0 , we have:

P φ2(ˆΣt, S(θ)) φ2 0 4νCb

1 exp t C2 0 2

Then, we follow the steps of the proof given by Zhou (2010). Let us deﬁne the event Gt as:

Gt := v 1 4s0λt

For the rest of this section, we assume that the event Gt holds. Note that:

v 1 v S(θ)c 1 = X

j S(θ)c |(ˆθ(t) 0 )j|

j S(θ)c ˆS(t) 0

|(ˆθ(t) 0 )j|

j ˆS(t) 0 \S(θ) |(ˆθ(t) 0 )j|

(a) | ˆS(t) 0 \ S(θ)|4λt,

where for (a), we used the construction of ˆS(t) 0 in the algorithm. We get:

| ˆS(t) 0 \ S(θ)| v 1

Thresholded Lasso Bandit

where for (a), we used the deﬁnition of Gt. We have: j S(θ),

|(ˆθ(t) 0 )j| θmin v S(θ) θmin v S(θ) 1

Therefore, when t is large enough so that 4λt θmin 4s0λt

φ2 t , we have: S(θ) ˆS(t) 0 . Using a similar argument, when t is

large enough so that 4λt

r 1 + 1 φ2 t

s0 θmin 4s0λt

φ2 t , it holds that S(θ) ˆS(t) 1 . From the construction of ˆS(t) 1 in the

algorithm, it also holds that: ˆS(t) 1 ˆS(t) 0 . Therefore,

i ˆS(t) 0 \S(θ) |(ˆθ(t) 0 )i|

i ˆS(t) 1 \S(θ) |(ˆθ(t) 0 )i|

| ˆS(t) 1 \ S(θ)|4λt q

| ˆS(t) 0 |,

| ˆS(t) 1 \ S(θ)| v 1

| ˆS(t) 0 |

| ˆS(t) 0 | 4s0λt

Note that the condition 4λt

r 1 + 1 φ2 t

s0 θmin 4s0λt

φ2 t is equivalent to 4λt

r 1 + 1 φ2 t

concludes the proof of Lemma 5.4 by substituting φ2 t = φ2 0/(4νCb).

F.2. Proof of Lemmas used in the proof of Lemma 5.4

F.2.1. PROOF OF LEMMA F.1

The proof is similar to that given by Oh et al. (2021). For the sake of brevity, let S = S(θ). Let us deﬁne the loss function:

s=1 (rt θ, At )2.

The initial Lasso estimate is given by:

ˆθt := arg min θ {ℓt(θ ) + λt θ 1} .

From this deﬁnition, we get:

ℓt(ˆθt) + λt ˆθt 1 ℓt(θ) + λt θ 1.

Let us denote E[ ] as the expectation over rt in this section. Note that in view of the previous inequality, we have:

ℓt(ˆθt) E[ℓt(ˆθt)] + E[ℓt(ˆθt)] E[ℓt(θ)] + λt ˆθt 1 ℓt(θ) E[ℓt(θ)] + λt θ 1.

Thresholded Lasso Bandit

Denoting vt(θ) := ℓt(θ) E[ℓt(θ)] and E(θ ) := E[ℓt(θ )] E[ℓt(θ)],

vt(ˆθt) + E(ˆθt) + λt ˆθt 1 vt(θ) + λt θ 1.

Let us deﬁne the event Tt:

Tt := {|vt(ˆθt) vt(θ)| 1

2λt ˆθt θ 1}.

We can condition on this event in the rest of the proof:

Lemma F.3. We have:

P |vt(ˆθt) vt(θ)| 1

2λt ˆθt θ 1

1 2 exp tλ2 t 32σ2s2 A + log d .

Given the event Tt, we have:

2E(ˆθt) 2λt( θ 1 ˆθt 1) + λt ˆθt θ 1.

By the triangle inequality,

ˆθt 1 = ˆθt,S 1 + ˆθt,Sc 1

θS 1 ˆθt,S θS + ˆθt,Sc 1.

We also have:

ˆθt θ 1 = (ˆθt θ)S 1 + (ˆθt θ)Sc 1

= ˆθt,S θS 1 + ˆθt,Sc 1.

Therefore, we get:

2E(ˆθt) 2λt θ 1 2λt( θS 1 ˆθt,S θS 1 + ˆθt,Sc 1) + λt( ˆθt,S θS 1 + ˆθt,Sc 1)

= 3λt ˆθt,S θS 1 λt ˆθt,Sc 1. (5)

From the compatibility condition, we get:

ˆθt,S θS 2 1 s0(ˆθt θ) ˆΣt(ˆθt θ)

Using inequality (5), we get:

2E(ˆθt) + λt ˆθt θ 1 = 2E(ˆθt) + λt ˆθt,Sc 1 + λt ˆθt,S θS 1

3λt ˆθt,S θS 1 + λt ˆθt,S θS 1

= 4λt ˆθt,S θS 1

s0(ˆθt θ) ˆΣt(ˆθt θ)

(ˆθt θ) ˆΣt(ˆθt θ) + 4λ2 ts0 φ2 t

2E(ˆθt) + 4λ2 ts0 φ2 t ,

where for the third inequality, we used 4uv u2 + 4v2 with u = q

(ˆθt θ) ˆΣt(ˆθt θ) and v = λt s0

φt . The last inequality is due to Lemma F.4:

Thresholded Lasso Bandit

Lemma F.4. We have:

2(ˆθt θ) ˆΣt(ˆθt θ) .

Thus, we get:

ˆθt θ 1 4λts0

This concludes the proof.

F.2.2. PROOF OF LEMMA F.2

For the sake of brevity, let S = S(θ). First, we deﬁne the adapted Gram matrix Σt := 1

t Pt s=1 E[As A s |Fs 1]. From the construction of the algorithm, E[As A s |Fs 1] = E[PK k=1 As,k A s,k1{k = argmax k As,k, ˆθs }| ˆθs]. The following lemma

characterizes the expected Gram matrix generated by the algorithm.

Lemma F.5 (Lemma 10 of Oh et al. (2021)). Under Assumptions 3.3 and 3.4, for each ﬁxed vector θ Rd, we have:

k=1,2 Ak A k 1{k = argmax k Ak, θ }

where A B means that A B is positive semideﬁnite.

Using Lemma F.5, we have

Σt 1 2νCb Σ. (7)

By Lemma 6.18 of Bühlmann & Van De Geer (2011), Assumption 3.2, and the deﬁnition of the compatibility constant, we get:

φ2(Σt, S) φ2( 1 2νCb Σ, S) φ2 0 2νCb . (8)

Furthermore, we have a following adaptive matrix concentration results for ˆΣt:

Lemma F.6. Let C0 := min n 1 2, φ2 0 512s0s2 AνCb

o . We have, for all t 2 log(2d2)

2s2 A ˆΣt Σt φ2(Σt, S)

exp t C2 0 2

We use a following result from Bühlmann & Van De Geer (2011):

Lemma F.7 (Corollary 6.8 in Bühlmann & Van De Geer (2011)). Suppose Σ0 satisﬁes the compatibility condition for the set S with |S| = s0, with the compatibility constant φ2(Σ0, S) > 0, and that Σ0 Σ1 λ, where 32λs0 φ2(Σ0,S) 1. Then,

the compatibility condition also holds for Σ1 with the compatibility constant φ2(Σ0,S)

2 , i.e., φ2(Σ1, S) φ2(Σ0,S)

Combining the above results, we get, for all t 2 log(2d2)

φ2(ˆΣt, S) φ2(Σt, S)

φ2 0 4νCb ,

with probability at least 1 exp t C2 0 2 . This concludes the proof.

Thresholded Lasso Bandit

F.2.3. PROOF OF LEMMA F.3

Let us denote ˆθ = ˆθt for simplicity. We compute vt(ˆθ) as:

vt(ˆθ) = ℓt(ˆθ) E[ℓt(ˆθ)]

s=1 (rs ˆθ, As )2 1

s=1 E h (rs ˆθ, As )2i

s=1 ( θ, As + εs ˆθ, As )2 1

s=1 E h ( θ, As + εs ˆθ, As )2i

s=1 (2εs θ ˆθ, As + ε2 s E[ε2 s]).

We also have that:

s=1 (ε2 s E[ε2 s]).

Therefore, we can compute:

vt(ˆθ) vt(θ) = 1

s=1 2εs θ ˆθ, As

where we used Hölder s inequality in the above inequality. We have that:

s=1 εs(As)i

where (As)i is the i-th element of As. Deﬁne e Ft as the σ-algebra generated by the random variables (A1, A1, ε1, . . . , At, At, εt, At+1). For each i [d], we get E[εs(As)i| e Fs 1] = (As)i E[εs| e Fs 1] = 0. Thus, for each i [d], {εs(As)i}t s=1 is a martingale difference sequence adapted to the ﬁltration e F1 . . . e Ft 1. By Assumption 3.1, we have |(As)i| s A. We compute, for each α R,

E[exp(αεs(As)i)| e Fs 1] E[exp(αεss A)| e Fs 1]

exp α2s2 Aσ2

Therefore εs(As)i is also a sub-Gaussian random variable with the variance proxy (s Aσ)2. Next, we use the concentration results by Wainwright (2019), Theorem 2.19:

Theorem F.8. Let (Zt, e Ft) t=1 be a martingale difference sequence, and assume that for all α R, E[exp(αZs)| e Fs 1] exp( α2σ2

2 ) with probability one. Then, for all x 0, we get:

From these results, we get:

s=1 εs(As)i

Thresholded Lasso Bandit

Taking λ = 1

1 2d exp tλ2 t 32σ2s2 A

= 1 2 exp tλ2 t 32σ2s2 A + log d .

F.2.4. PROOF OF LEMMA F.4

We denote ˆθt = ˆθ for brevity. From the deﬁnitions of E(θ ) and ℓt(θ ),

E(ˆθ) = E[ℓt(ˆθ)] E[ℓt(θ)]

s=1 ( ˆθ θ, As + εs)2

s=1 ˆθ θ, As 2

s=1 (ˆθ θ) As A s (ˆθ θ)

= (ˆθ θ) ˆΣt(ˆθ θ)

2(ˆθ θ) ˆΣt(ˆθ θ),

where for the inequality, we used the positive semi-deﬁniteness of ˆΣt.

F.2.5. PROOF OF LEMMA F.5

The proof is almost identical to the proof of Lemma 10 in Oh et al. (2021).

F.2.6. PROOF OF LEMMA F.6

Let us deﬁne γij t (At) as:

γij t (At) := 1 2s2 A ((At)i(At)j E[(At)i(At)j | Ft 1]) ,

where (At)i is the i-th element of At. We have, following a Bernstein-like inequality for the adapted data:

Lemma F.9 (Bernstein-like inequality for the adapted data (Oh et al., 2021)). Suppose for all t 1, for all 1 i j d, E[γij t (At)|Ft 1] = 0 and E[|γij t (At)|m | Ft 1] m! for all integer m 2. Then, for all x > 0, and for all integer t 1, we have:

max 1 i j d

s=1 γij s (As)

t + 2 log(2d2)

Note that 1 2s2 A ˆΣt Σt = max1 i j d 1

t Pt s=1 γij s (As) , E[γij t (At)|Ft 1] = 0, and E[|γij t (At)|m | Ft 1] 1 for all integer m 2. Therefore, we can apply Lemma F.9:

1 2s2 A ˆΣt Σt x +

t + 2 log(2d2)

Thresholded Lasso Bandit

For all t 2 log(2d2)

C2 0 with C0 := min n 1 2, φ2 0 256s0s2 AνCb

o , taking x = C2 0,

t + 2 log(2d2)

t 2C2 0 + 2

φ2 0 128s0s2 AνCb

In summary, for all t 2 log(2d2)

C2 0 , we get:

2s2 A ˆΣt Σt φ2(Σt, S)

1 2s2 A ˆΣt Σt C2 0 +

t + 2 log(2d2)

exp t C2 0 2

This concludes the proof.

F.3. Proof of Lemma 5.6

For a ﬁxed ˆS, ﬁrst we deﬁne the adapted Gram matrix on the estimated support as

s=1 E[As( ˆS)As( ˆS) |Fs 1].

From the construction of the algorithm, E[As( ˆS)As( ˆS) |Fs 1] = E[PK k=1 As,k( ˆS)As,k( ˆS) 1{k = argmax k As,k, ˆθs }| ˆθs]. Recall that for each B [d], ΣB := 1 K PK k=1 EA p A Ak(B)Ak(B) , where Ak(B)

is a |B|-dimensional vector extracted the elements of Ak with indices in B. The following lemma characterizes the expected Gram matrix generated by the algorithm.

Lemma F.10. Fix ˆS such that S(θ) ˆS and | ˆS| s0 + (4νCb s0)/φ2 0. Fix θ Rd. Under Assumption 3.2, 3.3, and 3.4, we have:

k [K] Ak( ˆS)Ak( ˆS) 1{k = argmax k Ak, θ ˆS }

1 2νCb Σ ˆS,

where A B means that A B is positive semideﬁnite.

First, we prove the lower bound on the smallest eigenvalue of the expected covariance matrices. Let Σ ˆS := 1 t Pt s=1 E[As( ˆS)As( ˆS) | Ft 1]. By Assumption 3.5 and the construction of the algorithm, under the event Et, we get:

λmin(Σ ˆS) = λmin

k=1 As,k, ˆSAs,k, ˆS1{k = argmax k Ak , ˆθs } | ˆθs

k=1 As,k, ˆSAs,k, ˆS1{k = argmax k Ak , ˆθs } | ˆθs

Thresholded Lasso Bandit

where for the ﬁrst inequality, we used the concavity of λmin( ) over the positive semi-deﬁnite matrices. Next, we prove the upper bound on the largest eigenvalue of As( ˆS)As( ˆS) :

λmax(As( ˆS)As( ˆS) ) = max v 2=1 v As( ˆS)As( ˆS) v

(a) max v 2=1 v 2 1 As( ˆS) 2

| ˆS|s2 A s0 + (4νCb s0)/φ2 0 s2 A,

where for (a), we used Hölder s inequality. Now recall the matrix Chernoff inequality by Tropp (2011):

Theorem F.11 (Matrix Chernoff, Theorem 3.1 of Tropp (2011)). Let F1 F2 . . . Ft be a ﬁltration and a consider a ﬁnite sequence {Xs} of positive semi-deﬁnite matrices with dimension d, adapted to the ﬁltration. Suppose λmax(Xk) R almost surely. Deﬁne the ﬁnite series: Y := Pt s=1 Xs and W := Pt s=1 E[Xs | Fs 1]. Then, for all µ 0, for all δ [0, 1), we have:

P (λmin(Y ) (1 δ)µ and λmin(W) µ) d e δ

Taking R = s0 + (4νCb s0)/φ2 0 s2 A, Xs = As, ˆSA s, ˆS, Y = tˆΣ ˆS, W = tΣ ˆS, δ = 1/2, µ = t α 2νCb :

P λmin(tˆΣ ˆS) 1

2t α 2νCb and λmin(tΣ ˆS) t α 2νCb

s0 + 4νCb s0

log s0 + 4νCb s0

tα 20s2 AνCb s0 + (4νCb s0)/φ2 0

where for the last inequality, we used 0.5 0.5 log(0.5) < 1

10. This concludes the proof.

F.4. Proof of Lemma 5.7

In this proof, we denote ˆS = ˆS(t) 1 and ε = (ε1, . . . , εt) . Assume λmin(ˆΣ ˆS) λ. We have:

ˆθt+1 θ 2 = (A( ˆS) A( ˆS)) 1A( ˆS) R θ 2

= (A( ˆS) A( ˆS)) 1A( ˆS) (Aθ + ε) θ 2

= (A( ˆS) A( ˆS)) 1A( ˆS) (A( ˆS)θ( ˆS) + ε) θ 2

= (A( ˆS) A( ˆS)) 1A( ˆS) ε 2

(A( ˆS) A( ˆS)) 1 2 A( ˆS) ε 2

λt A( ˆS) ε 2.

Thresholded Lasso Bandit

We get (note that we are conditioning on a ﬁxed ˆS during the proof):

P ˆθt+1 θ 2 x and λmin(ˆΣ ˆS) λ = P ˆθt+1 θ 2 x λmin(ˆΣ ˆS) λ P(λmin(ˆΣ ˆS) λ)

P A( ˆS) ε 2 λtx λmin(ˆΣ ˆS) λ P(λmin(ˆΣ ˆS) λ)

P A( ˆS) ε 2 λtx

s=1 εs(As)i1 n i ˆS o λtx q

s0 + 4νCb s0

s=1 εs(As)i

s0 + 4νCb s0

(a) 2 s0 + 4νCb s0

2σ2s2 A s0 + 4νCb s0

where for (a), we used Theorem F.8. This concludes the proof.

F.5. Proof of Lemma 5.8

We follow the proof strategy of Lemma 6 in Bastani et al. (2021). Let rπ t be the instantaneous expected regret of algorithm π at round t deﬁned as:

rπ t := E max A At A At, θ .

Let us deﬁne the events Rk := {At RK d : k argmax k At,k , θ } and Gλ t := n λmin(ˆΣ ˆS) λ o . We have:

k=1 E [rπ t | At Rk] P (At Rk) .

The term E [rπ t | At Rk] can be further computed as:

E [rπ t | At Rk] = E [ At,k At, θ | At Rk]

E h 1 n At, ˆθt At,k, ˆθt o At,k At, θ | At Rk i

ℓ =k E h 1 n At,ℓ, ˆθt At,k, ˆθt o At,k At,ℓ, θ | At Rk i

ℓ =k E h 1 n At,ℓ, ˆθt At,k, ˆθt o At,k At,ℓ, θ | At Rk, Et, G

+ 2(K 1)s As1

P(Ec t ) + P G

α 4νCb t c Et

Let us denote the event Ih := {At RK d : At,k At,ℓ, θ (2δs Ah, 2δs A(h + 1)]} where

δ = σs AνCb

v u u t32 s0 + 4νCbs0

Thresholded Lasso Bandit

By conditioning on Ih, we get:

E h 1 n At,ℓ, ˆθt At,k, ˆθt o At,k At,ℓ, θ | At Rk Ih, Et, G

h=0 E h 1 n At,ℓ, ˆθt At,k, ˆθt o At,k At,ℓ, θ | At Rk Ih, Et, G

α 4νCb t i P(At Ih)

h=0 2δs A(h + 1) E h 1 n At,ℓ, ˆθt At,k, ˆθt o | At Rk Ih, Et, G

P ( At,k At,ℓ, θ (0, 2δs A(h + 1)])

h=0 4δ2s2 A(h + 1)2Cm P At,ℓ, ˆθt At,k, ˆθt | At Rk Ih, Et, G

where for (a), we used the deﬁnition of Ih and for (b), we used Assumption 5.1. Under the event At Ih, the event At,ℓ, ˆθt At,k, ˆθt happens only when at least one of the events At,k, θ ˆθ δs Ah or At,ℓ, ˆθ θ δs Ah holds. Therefore,

P At,ℓ, ˆθt At,k, ˆθt | At Rk Ih, Et, G

P At,k, θ ˆθ δs Ah | At Rk Ih, Et, G

α 4νCb t + P At,ℓ, ˆθ θ δs Ah | At Rk Ih, Et, G

(a) P θ ˆθ 2 δh | At Rk Ih, Et, G

α 4νCb t + P θ ˆθ 2 δh | At Rk Ih, Et, G

= 2P θ ˆθ 2 δh | At Rk Ih, Et, G

where for (a), we used the Cauchy Schwarz inequality. Let us denote s = s0 + 4νCb s0

φ2 0 . Then, using Lemma 5.7, we get:

P θ ˆθ 2 δh | At Rk Ih, Et, G

α 4νCb t 2s exp α2tδ2h2

32σ2s2 Aν2C2 b s

= 2s exp h2 .

We also trivially have that:

P θ ˆθ 2 δh | At Rk Ih, Et, G

α 4νCb t 1.

Therefore, we can bound the expected instantaneous regret as:

k=1 E [rπ t | At Rk] P (At Rk)

4δ2s2 A(h + 1)2Cm min 1, 4s exp h2

+2(K 1)s As1

P(Ec t ) + P G

α 4νCb t c P (At Rk)

h=0 (h + 1)2 min 1, 4s exp h2 + 2(K 1)s As1

P(Ec t ) + P G

α 4νCb t c Et

h=0 (h + 1)2 +

h=h0+1 4s (h + 1)2 exp h2 !

+ 2(K 1)s As1

P(Ec t ) + P G

α 4νCb t c Et

Thresholded Lasso Bandit

where for brevity G = 4δ2s2 ACm(K 1), where for (a), we used PK k=1 P (At Rk) = 1 from Assumption 5.1 and we set h0 := log 4s + 1 . We have:

h=h0+1 (h + 1)2 exp( h2) =

h=h0+1 h2 exp( h2) + 2

h=h0+1 h exp( h2) +

h=h0+1 exp( h2)

h0 h2 exp( h2)dh + 2 Z

h0 h exp( h2)dh + Z

h0 exp( h2)dh.

Using an integration by parts, the inequality R h0 exp( h2 0)dh exp( h2 0)/(h0 + p

h2 0 + 4/π) exp( h2 0), and h0 1 from s0 1, we get: Z

h0 h2 exp( h2)dh 1

2h0 exp( h2 0) + 1

2 exp( h2 0)

h0 h exp( h2)dh = exp( h2 0) Z

h0 exp( h2)dh exp( h2 0).

h=h0+1 (h + 1)2 exp( h2) 1

2h0 exp( h2 0) + 5

2 exp( h2 0)

h0 exp( h2 0) + 5 exp( h2 0).

h=0 (h + 1)2 +

h=h0+1 4s (h + 1)2 exp h2 (h0 + 1)(h0 + 2)(2h0 + 3)

6 + 4s (h0 + 5) exp( h2 0)

2h3 0 + 9h2 0 + 13h0 + 6

6 + 4s (h0 + 5) 1

(a) 11h3 0,

where for (a), we used h0 1. Finally, we get:

rπ t 44δ2s2 ACm(K 1)h3 0 + 2(K 1)s As1

P(Ec t ) + P G

α 4νCb t c Et

1408σ2s4 ACm(K 1)h3 0ν2C2 b s0 + 4νCb s0

α2 1 t 1 + 2(K 1)s As1

P(Ec t ) + P G

α 4νCb t c Et

This concludes the proof.

F.6. Proof of Lemma 5.9

Let rπ t be the instantaneous expected regret of algorithm π in round t deﬁned as:

rπ t := E max A At A At, θ .

Let us deﬁne the events Rk := {At RK d : k argmax k At,k , θ } and Gλ t := n λmin(ˆΣ ˆS) λ o . As in the proof of

Lemma 5.8, we get:

k=1 E [rπ t | At Rk] P (At Rk) ,

Thresholded Lasso Bandit

E [rπ t | At Rk] = E [ At,k At, θ | At Rk]

ℓ =k E h 1 n At,ℓ, ˆθt At,k, ˆθt o At,k At,ℓ, θ | At Rk, Et, G

+ 2(K 1)s As1

P(Ec t ) + P G

α 4νCb t c Et

Let us denote the event Ih := {At RK d : At,k At,ℓ, θ (2δs Ah, 2δs A(h + 1)]} where

δ = σs AνCb

v u u t32 s0 + 4νCbs0

By conditioning on Ih, we get:

E h 1 n At,ℓ, ˆθt At,k, ˆθt o At,k At,ℓ, θ | At Rk Ih, Et, G

h=0 E h 1 n At,ℓ, ˆθt At,k, ˆθt o At,k At,ℓ, θ | At Rk Ih, Et, G

α 4νCb t i P(At Ih)

h=0 2δs A(h + 1) E h 1 n At,ℓ, ˆθt At,k, ˆθt o | At Rk Ih, Et, G

P ( At,k At,ℓ, θ (0, 2δs A(h + 1)])

h=0 2δs A(h + 1)P At,ℓ, ˆθt At,k, ˆθt | At Rk Ih, Et, G

where for (a), we used the deﬁnition of Ih. Under the event At Ih, the event At,ℓ, ˆθt At,k, ˆθt happens only when at least one of the events At,k, θ ˆθ δs Ah or At,ℓ, ˆθ θ δs Ah holds. Therefore,

P At,ℓ, ˆθt At,k, ˆθt | At Rk Ih, Et, G

P At,k, θ ˆθ δs Ah | At Rk Ih, Et, G

+ P At,ℓ, ˆθ θ δs Ah | At Rk Ih, Et, G

(a) P θ ˆθ 2 δh | At Rk Ih, Et, G

α 4νCb t + P θ ˆθ 2 δh | At Rk Ih, Et, G

= 2P θ ˆθ 2 δh | At Rk Ih, Et, G

where for (a), we used the Cauchy Schwarz inequality. Let us denote s = s0 + 4νCb s0

φ2 0 . Then, using Lemma 5.7, we get:

P θ ˆθ 2 δh | At Rk Ih, Et, G

α 4νCb t 2s exp α2tδ2h2

32σ2s2 Aν2C2 b s

= 2s exp h2 .

We also trivially have that:

P θ ˆθ 2 δh | At Rk Ih, Et, G

α 4νCb t 1.

Thresholded Lasso Bandit

Therefore, we can upper bound the expected instantaneous regret as:

k=1 E [rπ t | At Rk] P (At Rk)

2δs A(h + 1) min 1, 4s exp h2

+2(K 1)s As1

P(Ec t ) + P G

α 4νCb t c P (At Rk)

(a) 2δs A(K 1)

h=0 (h + 1) min 1, 4s exp h2

+ 2(K 1)s As1

P(Ec t ) + P G

α 4νCb t c Et

h=0 (h + 1) +

h=h0+1 4s (h + 1) exp h2 !

+ 2(K 1)s As1

P(Ec t ) + P G

α 4νCb t c Et

where for (a), we used PK k=1 P (At Rk) = 1 and we set h0 := log 4s + 1 . We have:

h=h0+1 (h + 1) exp( h2) =

h=h0+1 h exp( h2) +

h=h0+1 exp( h2)

h0 h exp( h2)dh + Z

h0 exp( h2)dh.

Since h0 1 from s0 1, we get: Z

h0 h exp( h2)dh = 1

2 exp( h2 0) Z

h0 exp( h2)dh exp( h2 0).

h=h0+1 (h + 1) exp( h2) 3

2 exp( h2 0).

h=0 (h + 1) +

h=h0+1 4s (h + 1) exp h2 (h0 + 1)(h0 + 2)

2 exp( h2 0)

h2 0 + 3h0 + 2

where for (a), we used h0 1. Finally, we get:

rπ t 9δs A(K 1)h2 0 + 2(K 1)s As1

P(Ec t ) + P G

α 4νCb t c Et

36σs A(K 1)h2 0νCb α

v u u t2 s0 + 4νCb s0

t 1 + 2(K 1)s As1

P(Ec t ) + P G

α 4νCb t c Et

Thresholded Lasso Bandit

This concludes the proof.

G. Proof of Lemmas (without balanced covariance)

Lemma G.1. Let t 2 log(2d2)

C2 0 such that 4 2νs0

φ2 0 + r 1 + 2ν

λt θmin. Under Assumptions 3.1, 3.2, 3.3, and

3.4, P S(θ) ˆS(t) 1 and | ˆS(t) 1 \ S(θ)| 2ν s0

1 2 exp tλ2 t 32σ2s2 A + log d exp t C2 0 2 .

We redeﬁne the event Et as

Et = S ˆS(t) 1 and | ˆS(t) 1 \ S| 2ν s0

Lemma G.2. Let t [T]. Under Assumptions 3.1 and D.1, we have:

P λmin(ˆΣ ˆS) α

log s0 + 2ν s0

tα 10s2 Aν s0 + (2ν s0)/φ2 0

Lemma G.3. Let t [T] and s = s0 + 2ν s0/φ2 0. Under Assumption 3.1, we have for all x, λ > 0:

P ˆθt+1 θ 2 x and λmin(ˆΣ ˆS) λ Et 2s exp λ2tx2

We redeﬁne the parameter h0 = q

log(4(s0 + 2ν s0

φ2 0 )) + 1 .

Lemma G.4. Deﬁne G

α 2ν t := n λmin(ˆΣ ˆS) α 2ν o . Let t 2. Under Assumptions 3.1, 3.2, 3.3, 3.4, 5.1, and D.1, the

expected instantaneous regret E[max A At A At, θ ] is upper bounded by:

352σ2s4 ACmh3 0ν2 s0 + 2ν s0

α2 1 t 1 + 2s As1 P(Ec t ) + P G

α 2ν t c Et .

Lemma G.5. Under Assumptions 3.1, 3.2, 3.3, 3.4, and D.1, for any t [T], E[max A At A At, θ ] is upper bounded by:

18σs Ah2 0ν α

v u u t2 s0 + 2ν s0

t 1 + 2s As1 P(Ec t ) + P G

α 2ν t c Et .

G.1. Proof of Lemma G.1

For the sake of brevity, let S = S(θ). We deﬁne v := ˆθ(t) 0 θ. We ﬁrst analyze the performance of the initial Lasso estimate.

Lemma G.6. Let ˆΣt :=

Pt s=1 As A s t be the empirical covariance matrix of the selected context vectors. Suppose ˆΣt satisﬁes the compatibility condition with the support S with the compatibility constant φt. Then, under Assumption 3.1, we have:

P v 1 4s0λt

1 2 exp tλ2 t 32σ2s2 A + log d .

The next lemma then states that the compatibility constant of ˆΣt does not deviate much from the compatibility constant of Σ.

Lemma G.7. Assume K = 2. Let C0 := min n 1 2, φ2 0 256s0s2 Aν o . For all t 2 log(2d2)

C2 0 , we have:

P φ2(ˆΣt, S) φ2 0 2ν

1 exp t C2 0 2

Thresholded Lasso Bandit

Then, we follow the steps of the proof given by Zhou (2010). Let us deﬁne the event Gt as:

Gt := v 1 4s0λt

For the rest of this section, we assume that the event Gt holds. Note that:

j Sc |(ˆθ(t) 0 )j|

j Sc ˆS(t) 0

|(ˆθ(t) 0 )j|

j ˆS(t) 0 \S |(ˆθ(t) 0 )j|

(a) | ˆS(t) 0 \ S|4λt,

where for (a), we used the construction of ˆS(t) 0 in the algorithm. We get:

| ˆS(t) 0 \ S| v 1

where for (a), we used the deﬁnition of Gt. We have: j S,

|(ˆθ(t) 0 )j| θmin v S θmin v S 1

Therefore, when t is large enough so that 4λt θmin 4s0λt

φ2 t , we have: S ˆS(t) 0 . Using a similar argument, when t is large

enough so that 4λt

r 1 + 1 φ2 t

s0 θmin 4s0λt

φ2 t , it holds that S ˆS(t) 1 . From the construction of ˆS(t) 1 in the algorithm, it

also holds that: ˆS(t) 1 ˆS(t) 0 . Therefore,

i ˆS(t) 0 \S |(ˆθ(t) 0 )i|

i ˆS(t) 1 \S |(ˆθ(t) 0 )i|

| ˆS(t) 1 \ S|4λt q

| ˆS(t) 0 |,

| ˆS(t) 1 \ S| v 1

| ˆS(t) 0 |

| ˆS(t) 0 | 4s0λt

Note that the condition 4λt

r 1 + 1 φ2 t

s0 θmin 4s0λt

φ2 t is equivalent to 4λt

r 1 + 1 φ2 t

concludes the proof of Lemma G.1 by substituting φ2 t = φ2 0/(2ν).

Thresholded Lasso Bandit

G.2. Proof of Lemmas used in the proof of Lemma G.1

G.2.1. PROOF OF LEMMA G.6

The proof is identical to that of Lemma F.1.

G.2.2. PROOF OF LEMMA G.7

For the sake of brevity, let S = S(θ). First, we deﬁne the adapted Gram matrix Σt := 1

t Pt s=1 E[As A s |Fs 1]. From the construction of the algorithm, E[As A s |Fs 1] = E[PK k=1 As,k A s,k1{k = argmax k As,k, ˆθs }| ˆθs]. The following lemma

characterizes the expected Gram matrix generated by the algorithm.

Lemma G.8. Assume K = 2. Under Assumption 3.2 and Assumption 3.3, for each ﬁxed vector θ Rd, we have:

k=1,2 Ak A k 1{k = argmax k Ak, θ }

where A B means that A B is positive semideﬁnite.

Using Lemma G.8, we have

By Lemma 6.18 of Bühlmann & Van De Geer (2011), Assumption 3.2, and the deﬁnition of the compatibility constant, we get:

φ2(Σt, S) φ2(1

ν Σ, S) φ2 0 ν . (10)

Furthermore, we have a following adaptive matrix concentration results for ˆΣt:

Lemma G.9. Let C0 := min n 1 2, φ2 0 256s0s2 Aν o . We have, for all t 2 log(2d2)

2s2 A ˆΣt Σt φ2(Σt, S)

exp t C2 0 2

Combining Lemmas G.9 and F.7, we get, for all t 2 log(2d2)

φ2(ˆΣt, S) φ2(Σt, S)

with probability at least 1 exp t C2 0 2 . This concludes the proof.

G.2.3. PROOF OF LEMMA G.8

The proof is almost identical to the proof of Lemma 2 in Oh et al. (2021).

G.2.4. PROOF OF LEMMA G.9

Let us deﬁne γij t (At) as:

γij t (At) := 1 2C2 A ((At)i(At)j E[(At)i(At)j | Ft 1]) ,

Thresholded Lasso Bandit

where (At)i is the i-th element of At. Note that 1 2s2 A ˆΣt Σt = max1 i j d 1

t Pt s=1 γij s (As) , E[γij t (At)|Ft 1] = 0,

and E[|γij t (At)|m | Ft 1] 1 for all integer m 2. Therefore, we can apply Lemma F.9:

1 2s2 A ˆΣt Σt x +

t + 2 log(2d2)

For all t 2 log(2d2)

C2 0 with C0 := min n 1 2, φ2 0 256s0s2 Aν o , taking x = C2 0,

t + 2 log(2d2)

t 2C2 0 + 2

φ2 0 64s0s2 Aν

64s0s2 Aν .

In summary, for all t 2 log(2d2)

C2 0 , we get:

2s2 A ˆΣt Σt φ2(Σt, S)

1 2s2 A ˆΣt Σt C2 0 +

t + 2 log(2d2)

exp t C2 0 2

This concludes the proof.

G.3. Proof of Lemma G.2

For a ﬁxed ˆS, ﬁrst we deﬁne the adapted Gram matrix on the estimated support as

s=1 E[As( ˆS)As( ˆS) |Fs 1].

From the construction of the algorithm, E[As( ˆS)As( ˆS) |Fs 1] = E[PK k=1 As,k( ˆS)As,k( ˆS) 1{k = argmax k As,k, ˆθs }| ˆθs]. Recall that for each B [d], ΣB := 1 K PK k=1 EA p A Ak(B)Ak(B) , where Ak(B)

is a |B|-dimensional vector extracted the elements of Ak with indices in B. The following lemma characterizes the expected Gram matrix generated by the algorithm.

Lemma G.10. Assume K = 2. Fix ˆS such that S(θ) ˆS and | ˆS| s0 +(2ν s0)/φ2 0. Fix θ Rd. Under Assumption 3.2 and Assumption 3.3, we have:

k=1,2 Ak( ˆS)Ak( ˆS) 1{k = argmax k Ak, θ ˆS }

where A B means that A B is positive semideﬁnite.

First, we prove the lower bound on the smallest eigenvalue of the expected covariance matrices. Let Σ ˆS := 1 t Pt s=1 E[As( ˆS)As( ˆS) | Ft 1]. By Assumption D.1 and the construction of the algorithm, under the event Et, we

Thresholded Lasso Bandit

λmin(Σ ˆS) = λmin

k=1 As,k, ˆSAs,k, ˆS1{k = argmax k Ak , ˆθs } | ˆθs

k=1 As,k, ˆSAs,k, ˆS1{k = argmax k Ak , ˆθs } | ˆθs

where for the inequality, we used the concavity of λmin( ) over the positive semi-deﬁnite matrices. Next, we prove the upper bound on the largest eigenvalue of As( ˆS)As( ˆS) :

λmax(As( ˆS)As( ˆS) ) = max v =1 v As( ˆS)As( ˆS) v

(a) max v =1 v 2 1 As( ˆS) 2

| ˆS|s2 A s0 + (2ν s0)/φ2 0 s2 A

where for (a), we used Hölder s inequality and Assumption 3.1. Taking R = s0 + (2ν s0)/φ2 0 s2 A, Xs = As, ˆSA s, ˆS,

Y = tˆΣ ˆS, W = tΣ ˆS, δ = 1/2, µ = t α

ν in Theorem F.11, we have:

P λmin(tˆΣ ˆS) 1

ν and λmin(tΣ ˆS) tα

log s0 + 2ν s0

tα 10s2 Aν s0 + (2ν s0)/φ2 0

where for the last inequality, we used 0.5 0.5 log(0.5) < 1

10. This concludes the proof.

G.3.1. PROOF OF LEMMA G.10

The proof is almost identical to the proof of Lemma 2 in Oh et al. (2021).

G.4. Proof of Lemma G.3

In this proof, we denote ˆS = ˆS(t) 1 and ε = (ε1, . . . , εt) . Assume λmin(ˆΣ ˆS) λ. We have:

ˆθt+1 θ 2 = (A( ˆS) A( ˆS)) 1A( ˆS) R θ 2

= (A( ˆS) A( ˆS)) 1A( ˆS) (Aθ + ε) θ 2

= (A( ˆS) A( ˆS)) 1A( ˆS) (A( ˆS)θ( ˆS) + ε) θ 2

= (A( ˆS) A( ˆS)) 1A( ˆS) ε 2

(A( ˆS) A( ˆS)) 1 2 A( ˆS) ε 2

λt A( ˆS) ε 2.

Thresholded Lasso Bandit

We get (note that we are conditioning on a ﬁxed ˆS during the proof):

P ˆθt+1 θ 2 x and λmin(ˆΣ ˆS) λ = P ˆθt+1 θ 2 x λmin(ˆΣ ˆS) λ P(λmin(ˆΣ ˆS) λ)

P A( ˆS) ε 2 λtx λmin(ˆΣ ˆS) λ P(λmin(ˆΣ ˆS) λ)

P A( ˆS) ε 2 λtx

s=1 εs(As)i1 n i ˆS o λtx q

s=1 εs(As)i

(a) 2 s0 + 2ν s0

2σ2s2 A s0 + 2ν s0

where for (a), we used Theorem F.8. This concludes the proof.

G.5. Proof of Lemma G.4

We follow the proof strategy of Lemma 6 in Bastani et al. (2021). Let rπ t be the instantaneous expected regret of algorithm π at round t deﬁned as:

rπ t := E max A At A At, θ .

Let us deﬁne the events Rk := {At RK d : k argmax k At,k , θ } and Gλ t := n λmin(ˆΣ ˆS) λ o . We have:

k=1 E [rπ t | At Rk] P (At Rk) .

The term E [rπ t | At Rk] can be further computed as:

E [rπ t | At Rk] = E [ At,k At, θ | At Rk]

E h 1 n At, ˆθt At,k, ˆθt o At,k At, θ | At Rk i

ℓ =k E h 1 n At,ℓ, ˆθt At,k, ˆθt o At,k At,ℓ, θ | At Rk i

ℓ =k E h 1 n At,ℓ, ˆθt At,k, ˆθt o At,k At,ℓ, θ | At Rk, Et, G

+ 2s As1 P(Ec t ) + P G

α 2ν t c Et .

Let us denote the event Ih := {At RK d : At,k At,ℓ, θ (2δs Ah, 2δs A(h + 1)]} where

v u u t8 s0 + 2ν s0

Thresholded Lasso Bandit

By conditioning on Ih, we get:

E h 1 n At,ℓ, ˆθt At,k, ˆθt o At,k At,ℓ, θ | At Rk Ih, Et, G

h=0 E h 1 n At,ℓ, ˆθt At,k, ˆθt o At,k At,ℓ, θ | At Rk Ih, Et, G

α 2ν t i P(At Ih)

h=0 2δs A(h + 1) E h 1 n At,ℓ, ˆθt At,k, ˆθt o | At Rk Ih, Et, G

P ( At,k At,ℓ, θ (0, 2δs A(h + 1)])

h=0 4δ2s2 A(h + 1)2Cm P At,ℓ, ˆθt At,k, ˆθt | At Rk Ih, Et, G

where for (a), we used the deﬁnition of Ih and for (b), we used Assumption 5.1. Under the event At Ih, the event At,ℓ, ˆθt At,k, ˆθt happens only when at least one of the events At,k, θ ˆθ δs Ah or At,ℓ, ˆθ θ δs Ah holds. Therefore,

P At,ℓ, ˆθt At,k, ˆθt | At Rk Ih, Et, G

P At,k, θ ˆθ δs Ah | At Rk Ih, Et, G

+ P At,ℓ, ˆθ θ δs Ah | At Rk Ih, Et, G

(a) P θ ˆθ 2 δh | At Rk Ih, Et, G

α 2ν t + P θ ˆθ 2 δh | At Rk Ih, Et, G

= 2P θ ˆθ 2 δh | At Rk Ih, Et, G

where for (a), we used the Cauchy Schwarz inequality. Let us denote s = s0 + 2ν s0

φ2 0 . Then, using Lemma G.3, we get:

P θ ˆθ 2 δh | At Rk Ih, Et, G

α 2ν t 2s exp α2tδ2h2

= 2s exp h2 .

We also trivially have that:

P θ ˆθ 2 δh | At Rk Ih, Et, G

Therefore, we can upper bound the expected instantaneous regret as:

k=1 E [rπ t | At Rk] P (At Rk)

4δ2s2 A(h + 1)2Cm min 1, 4s exp h2

+2s As1 P(Ec t ) + P G

α 2ν t c P (At Rk)

(a) 4δ2s2 ACm

h=0 (h + 1)2 min 1, 4s exp h2 + 2s As1 P(Ec t ) + P G

α 2ν t c Et

h=0 (h + 1)2 +

h=h0+1 4s (h + 1)2 exp h2 !

+ 2s As1 P(Ec t ) + P G

α 2ν t c Et

Thresholded Lasso Bandit

where for (a), we used P2 k=1 P (At Rk) = 1 from Assumption 5.1 and we set h0 := log 4s + 1 . We have:

h=h0+1 (h + 1)2 exp( h2) =

h=h0+1 h2 exp( h2) + 2

h=h0+1 h exp( h2) +

h=h0+1 exp( h2)

h0 h2 exp( h2)dh + 2 Z

h0 h exp( h2)dh + Z

h0 exp( h2)dh.

Using an integration by parts, the inequality R h0 exp( h2 0)dh exp( h2 0)/(h0 + p

h2 0 + 4/π) exp( h2 0), and h0 1 from s0 1, we get: Z

h0 h2 exp( h2)dh 1

2h0 exp( h2 0) + 1

2 exp( h2 0)

h0 h exp( h2)dh = exp( h2 0) Z

h0 exp( h2)dh exp( h2 0).

h=h0+1 (h + 1)2 exp( h2) 1

2h0 exp( h2 0) + 5

2 exp( h2 0)

h0 exp( h2 0) + 5 exp( h2 0).

h=0 (h + 1)2 +

h=h0+1 4s (h + 1)2 exp h2 (h0 + 1)(h0 + 2)(2h0 + 3)

6 + 4s (h0 + 5) exp( h2 0)

2h3 0 + 9h2 0 + 13h0 + 6

6 + 4s (h0 + 5) 1

(a) 11h3 0,

where for (a), we used h0 1. Finally, we get:

rπ t 44δ2s2 ACmh3 0 + 2s As1 P(Ec t ) + P G

α 2ν t c Et

352σ2s4 ACmh3 0ν2 s0 + 2ν s0

α2 1 t 1 + 2s As1 P(Ec t ) + P G

α 2ν t c Et .

This concludes the proof.

G.6. Proof of Lemma G.5

Let rπ t be the instantaneous expected regret of algorithm π in round t deﬁned as:

rπ t := E max A At A At, θ .

Let us deﬁne the events Rk := {At RK d : k argmax k At,k , θ } and Gλ t := n λmin(ˆΣ ˆS) λ o . As in the proof of

Lemma 5.8, we get:

k=1 E [rπ t | At Rk] P (At Rk) ,

Thresholded Lasso Bandit

E [rπ t | At Rk] = E [ At,k At, θ | At Rk]

ℓ =k E h 1 n At,ℓ, ˆθt At,k, ˆθt o At,k At,ℓ, θ | At Rk, Et, G

+ 2s As1 P(Ec t ) + P G

α 2ν t c Et .

Let us denote the event Ih := {At RK d : At,k At,ℓ, θ (2δs Ah, 2δs A(h + 1)]} where

v u u t8 s0 + 2νs0

By conditioning on Ih, we get:

E h 1 n At,ℓ, ˆθt At,k, ˆθt o At,k At,ℓ, θ | At Rk Ih, Et, G

h=0 E h 1 n At,ℓ, ˆθt At,k, ˆθt o At,k At,ℓ, θ | At Rk Ih, Et, G

α 2ν t i P(At Ih)

h=0 2δs A(h + 1) E h 1 n At,ℓ, ˆθt At,k, ˆθt o | At Rk Ih, Et, G

P ( At,k At,ℓ, θ (0, 2δs A(h + 1)])

h=0 2δs A(h + 1)P At,ℓ, ˆθt At,k, ˆθt | At Rk Ih, Et, G

where for (a), we used the deﬁnition of Ih. Under the event At Ih, the event At,ℓ, ˆθt At,k, ˆθt happens only when at least one of the events At,k, θ ˆθ δs Ah or At,ℓ, ˆθ θ δs Ah holds. Therefore,

P At,ℓ, ˆθt At,k, ˆθt | At Rk Ih, Et, G

P At,k, θ ˆθ δs Ah | At Rk Ih, Et, G

+ P At,ℓ, ˆθ θ δs Ah | At Rk Ih, Et, G

(a) P θ ˆθ 2 δh | At Rk Ih, Et, G

α 2ν t + P θ ˆθ 2 δh | At Rk Ih, Et, G

= 2P θ ˆθ 2 δh | At Rk Ih, Et, G

where for (a), we used the Cauchy Schwarz inequality. Let us denote s = s0 + 2ν s0

φ2 0 . Then, using Lemma 5.7, we get:

P θ ˆθ 2 δh | At Rk Ih, Et, G

α 2ν t 2s exp α2tδ2h2

8σ2s2 Aν2C2 b s

= 2s exp h2 .

We also trivially have that:

P θ ˆθ 2 δh | At Rk Ih, Et, G

Thresholded Lasso Bandit

Therefore, we can upper bound the expected instantaneous regret as:

k=1 E [rπ t | At Rk] P (At Rk)

2δs A(h + 1) min 1, 4s exp h2

+2s As1 P(Ec t ) + P G

α 2ν t c P (At Rk)

h=0 (h + 1) min 1, 4s exp h2 + 2s As1 P(Ec t ) + P G

α 2ν t c Et

h=0 (h + 1) +

h=h0+1 4s (h + 1) exp h2 !

+ 2s As1 P(Ec t ) + P G

α 2ν t c Et

where for (a), we used PK k=1 P (At Rk) = 1 and we set h0 := log 4s + 1 . We have:

h=h0+1 (h + 1) exp( h2) =

h=h0+1 h exp( h2) +

h=h0+1 exp( h2)

h0 h exp( h2)dh + Z

h0 exp( h2)dh.

Since h0 1 from s0 1, we get: Z

h0 h exp( h2)dh = 1

2 exp( h2 0) Z

h0 exp( h2)dh exp( h2 0).

h=h0+1 (h + 1) exp( h2) 3

2 exp( h2 0).

h=0 (h + 1) +

h=h0+1 4s (h + 1) exp h2 (h0 + 1)(h0 + 2)

2 exp( h2 0)

h2 0 + 3h0 + 2

where for (a), we used h0 1. Finally, we get:

rπ t 9δs Ah2 0 + 2s As1 P(Ec t ) + P G

α 2ν t c Et

18σs Ah2 0ν α

v u u t2 s0 + 2ν s0

t 1 + 2s As1 P(Ec t ) + P G

α 2ν t c Et .

This concludes the proof.

Thresholded Lasso Bandit

H. Additional Experimental Results and Details

H.1. The implementation of Lasso bandit

Although the problem formulation for DR Lasso and SA Lasso is the same as ours, the problem formulation for Lasso bandit Bastani & Bayati (2020) is different from ours. In Bastani & Bayati (2020), the unknown regression vectors are deﬁned armwise and a common context is observed among the arms. In the numerical experiments, we followed the comparison idea in Kim & Paik (2019) and Oh et al. (2021) to apply Lasso bandit of Bastani & Bayati (2020) to our problem setting. The idea is explained as follows: from the action set At, the Kd-dimensional context vector Xt = (A t,1, A t,2, . . . , A t,K) RKd and the Kd-dimensional arm-wise unknown regression vector βk = (1{k = 1}θ , 1{k = 2}θ , . . . , 1{k = K}θ ) RKd

for each k [K] is considered to enable the comparison. Under these transformations, we have problem dimension Kd (instead of d). Thus, the assumptions and the regret guarantees have different scalings mainly in K.

H.2. Additional Results with Various Correlation Levels

Figures 5-7 show the numerical results with correlation levels between two arms ρ2 {0.0, 0.3, 0.7} and dimension d {100, 1000, 2000, 10000}, respectively. We ﬁnd that TH Lasso bandit exhibits lower regret than SA Lasso bandit and DR Lasso bandit in all scenarios. In particular, the difference between TH Lasso and SA Lasso becomes more apparent as the dimension d increases, just as the theorem shows.

Case 1: ρ2 = 0.0

0 200 400 600 800 1000 Round

Cumulative Regret

K = 2, d = 100, s0 = 5, 2 = 0.0, Amax = 10

THLasso Bandit SALasso Bandit DRLasso Bandit

0 200 400 600 800 1000 Round

Cumulative Regret

K = 2, d = 100, s0 = 20, 2 = 0.0, Amax = 10

THLasso Bandit SALasso Bandit DRLasso Bandit

0 200 400 600 800 1000 Round

Cumulative Regret

K = 50, d = 100, s0 = 5, 2 = 0.0, Amax = 10

THLasso Bandit SALasso Bandit DRLasso Bandit

0 200 400 600 800 1000 Round

Cumulative Regret

K = 50, d = 100, s0 = 20, 2 = 0.0, Amax = 10

THLasso Bandit SALasso Bandit DRLasso Bandit

0 200 400 600 800 1000 Round

Cumulative Regret

K = 2, d = 1000, s0 = 5, 2 = 0.0, Amax = 10

THLasso Bandit SALasso Bandit DRLasso Bandit

0 200 400 600 800 1000 Round

Cumulative Regret

K = 2, d = 1000, s0 = 20, 2 = 0.0, Amax = 10

THLasso Bandit SALasso Bandit DRLasso Bandit

0 200 400 600 800 1000 Round

Cumulative Regret

K = 50, d = 1000, s0 = 5, 2 = 0.0, Amax = 10

THLasso Bandit SALasso Bandit DRLasso Bandit

0 200 400 600 800 1000 Round

Cumulative Regret

K = 50, d = 1000, s0 = 20, 2 = 0.0, Amax = 10

THLasso Bandit SALasso Bandit DRLasso Bandit

0 200 400 600 800 1000 Round

Cumulative Regret

K = 2, d = 2000, s0 = 5, 2 = 0.0, Amax = 10

THLasso Bandit SALasso Bandit DRLasso Bandit

0 200 400 600 800 1000 Round

Cumulative Regret

K = 2, d = 2000, s0 = 20, 2 = 0.0, Amax = 10

THLasso Bandit SALasso Bandit DRLasso Bandit

0 200 400 600 800 1000 Round

Cumulative Regret

K = 50, d = 2000, s0 = 5, 2 = 0.0, Amax = 10

THLasso Bandit SALasso Bandit DRLasso Bandit

0 200 400 600 800 1000 Round

Cumulative Regret

K = 50, d = 2000, s0 = 20, 2 = 0.0, Amax = 10

THLasso Bandit SALasso Bandit DRLasso Bandit

0 200 400 600 800 1000 Round

Cumulative Regret

K = 2, d = 10000, s0 = 5, 2 = 0.0, Amax = 10

THLasso Bandit SALasso Bandit DRLasso Bandit

0 200 400 600 800 1000 Round

Cumulative Regret

K = 2, d = 10000, s0 = 20, 2 = 0.0, Amax = 10

THLasso Bandit SALasso Bandit DRLasso Bandit

0 200 400 600 800 1000 Round

Cumulative Regret

K = 50, d = 10000, s0 = 5, 2 = 0.0, Amax = 10

THLasso Bandit SALasso Bandit DRLasso Bandit

0 200 400 600 800 1000 Round

Cumulative Regret

K = 50, d = 10000, s0 = 20, 2 = 0.0, Amax = 10

THLasso Bandit SALasso Bandit DRLasso Bandit

Figure 5. Cumulative regret of the three algorithms with ρ2 = 0.0, Amax = 10, K {2, 50}, d {100, 1000, 2000, 10000}, and s0 {5, 20}. The shaded area represents the standard errors.

Thresholded Lasso Bandit

Case 2: ρ2 = 0.3

0 200 400 600 800 1000 Round

Cumulative Regret

K = 2, d = 100, s0 = 5, 2 = 0.3, Amax = 10

THLasso Bandit SALasso Bandit DRLasso Bandit

0 200 400 600 800 1000 Round

Cumulative Regret

K = 2, d = 100, s0 = 20, 2 = 0.3, Amax = 10

THLasso Bandit SALasso Bandit DRLasso Bandit

0 200 400 600 800 1000 Round

Cumulative Regret

K = 50, d = 100, s0 = 5, 2 = 0.3, Amax = 10

THLasso Bandit SALasso Bandit DRLasso Bandit

0 200 400 600 800 1000 Round

Cumulative Regret

K = 50, d = 100, s0 = 20, 2 = 0.3, Amax = 10

THLasso Bandit SALasso Bandit DRLasso Bandit

0 200 400 600 800 1000 Round

Cumulative Regret

K = 2, d = 1000, s0 = 5, 2 = 0.3, Amax = 10

THLasso Bandit SALasso Bandit DRLasso Bandit

0 200 400 600 800 1000 Round

Cumulative Regret

K = 2, d = 1000, s0 = 20, 2 = 0.3, Amax = 10

THLasso Bandit SALasso Bandit DRLasso Bandit

0 200 400 600 800 1000 Round

Cumulative Regret

K = 50, d = 1000, s0 = 5, 2 = 0.3, Amax = 10

THLasso Bandit SALasso Bandit DRLasso Bandit

0 200 400 600 800 1000 Round

Cumulative Regret

K = 50, d = 1000, s0 = 20, 2 = 0.3, Amax = 10

THLasso Bandit SALasso Bandit DRLasso Bandit

0 200 400 600 800 1000 Round

Cumulative Regret

K = 2, d = 2000, s0 = 5, 2 = 0.3, Amax = 10

THLasso Bandit SALasso Bandit DRLasso Bandit

0 200 400 600 800 1000 Round

Cumulative Regret

K = 2, d = 2000, s0 = 20, 2 = 0.3, Amax = 10

THLasso Bandit SALasso Bandit DRLasso Bandit

0 200 400 600 800 1000 Round

Cumulative Regret

K = 50, d = 2000, s0 = 5, 2 = 0.3, Amax = 10

THLasso Bandit SALasso Bandit DRLasso Bandit

0 200 400 600 800 1000 Round

Cumulative Regret

K = 50, d = 2000, s0 = 20, 2 = 0.3, Amax = 10

THLasso Bandit SALasso Bandit DRLasso Bandit

0 200 400 600 800 1000 Round

Cumulative Regret

K = 2, d = 10000, s0 = 5, 2 = 0.3, Amax = 10

THLasso Bandit SALasso Bandit DRLasso Bandit

0 200 400 600 800 1000 Round

Cumulative Regret

K = 2, d = 10000, s0 = 20, 2 = 0.3, Amax = 10

THLasso Bandit SALasso Bandit DRLasso Bandit

0 200 400 600 800 1000 Round

Cumulative Regret

K = 50, d = 10000, s0 = 5, 2 = 0.3, Amax = 10

THLasso Bandit SALasso Bandit DRLasso Bandit

0 200 400 600 800 1000 Round

Cumulative Regret

K = 50, d = 10000, s0 = 20, 2 = 0.3, Amax = 10

THLasso Bandit SALasso Bandit DRLasso Bandit

Figure 6. Cumulative regret of the three algorithms with ρ2 = 0.3, Amax = 10, K {2, 50}, d {100, 1000, 2000, 10000}, and s0 {5, 20}. The shaded area represents the standard errors.

Thresholded Lasso Bandit

Case 3: ρ2 = 0.7

0 200 400 600 800 1000 Round

Cumulative Regret

K = 2, d = 100, s0 = 5, 2 = 0.7, Amax = 10

THLasso Bandit SALasso Bandit DRLasso Bandit

0 200 400 600 800 1000 Round

Cumulative Regret

K = 2, d = 100, s0 = 20, 2 = 0.7, Amax = 10

THLasso Bandit SALasso Bandit DRLasso Bandit

0 200 400 600 800 1000 Round

Cumulative Regret

K = 50, d = 100, s0 = 5, 2 = 0.7, Amax = 10

THLasso Bandit SALasso Bandit DRLasso Bandit

0 200 400 600 800 1000 Round

Cumulative Regret

K = 50, d = 100, s0 = 20, 2 = 0.7, Amax = 10

THLasso Bandit SALasso Bandit DRLasso Bandit

0 200 400 600 800 1000 Round

Cumulative Regret

K = 2, d = 1000, s0 = 5, 2 = 0.7, Amax = 10

THLasso Bandit SALasso Bandit DRLasso Bandit

0 200 400 600 800 1000 Round

Cumulative Regret

K = 2, d = 1000, s0 = 20, 2 = 0.7, Amax = 10

THLasso Bandit SALasso Bandit DRLasso Bandit

0 200 400 600 800 1000 Round

Cumulative Regret

K = 50, d = 1000, s0 = 5, 2 = 0.7, Amax = 10

THLasso Bandit SALasso Bandit DRLasso Bandit

0 200 400 600 800 1000 Round

Cumulative Regret

K = 50, d = 1000, s0 = 20, 2 = 0.7, Amax = 10

THLasso Bandit SALasso Bandit DRLasso Bandit

0 200 400 600 800 1000 Round

Cumulative Regret

K = 2, d = 2000, s0 = 5, 2 = 0.7, Amax = 10

THLasso Bandit SALasso Bandit DRLasso Bandit

0 200 400 600 800 1000 Round

Cumulative Regret

K = 2, d = 2000, s0 = 20, 2 = 0.7, Amax = 10

THLasso Bandit SALasso Bandit DRLasso Bandit

0 200 400 600 800 1000 Round

Cumulative Regret

K = 50, d = 2000, s0 = 5, 2 = 0.7, Amax = 10

THLasso Bandit SALasso Bandit DRLasso Bandit

0 200 400 600 800 1000 Round

Cumulative Regret

K = 50, d = 2000, s0 = 20, 2 = 0.7, Amax = 10

THLasso Bandit SALasso Bandit DRLasso Bandit

0 200 400 600 800 1000 Round

Cumulative Regret

K = 2, d = 10000, s0 = 5, 2 = 0.7, Amax = 10

THLasso Bandit SALasso Bandit DRLasso Bandit

0 200 400 600 800 1000 Round

Cumulative Regret

K = 2, d = 10000, s0 = 20, 2 = 0.7, Amax = 10

THLasso Bandit SALasso Bandit DRLasso Bandit

0 200 400 600 800 1000 Round

Cumulative Regret

K = 50, d = 10000, s0 = 5, 2 = 0.7, Amax = 10

THLasso Bandit SALasso Bandit DRLasso Bandit

0 200 400 600 800 1000 Round

Cumulative Regret

K = 50, d = 10000, s0 = 20, 2 = 0.7, Amax = 10

THLasso Bandit SALasso Bandit DRLasso Bandit

Figure 7. Cumulative regret of the three algorithms with ρ2 = 0.7, Amax = 10, K {2, 50}, d {100, 1000, 2000, 10000}, and s0 {5, 20}. The shaded area represents the standard errors.

Thresholded Lasso Bandit

H.3. Additional Results with Various Amax for K-Armed Bandits

We also present the experimental results with varying Amax {2.5, 5, 10, 20, 40, } and a different parameter setting. We set K = 50, d = 1000, and s0 = 20. Figure 8 shows the average cumulative regret at t = 1000 of TH Lasso bandit and SA Lasso bandit for each Amax. We observe that TH Lasso bandit outperforms SA Lasso bandit for all Amax.

2.5 5 10 20 40 Amax

Cumulative Regret

K = 50, d = 1000, s0 = 20, 2 = 0.7

THLasso Bandit SALasso Bandit

Figure 8. Cumulative regret at round t = 1000 of TH Lasso bandit and SA Lasso bandit with ρ2 = 0.7, K = 50, d = 1000, s0 = 20, and varying Amax {2.5, 5, 10, 20, 40, }. The error bars represent the standard errors.

Thresholded Lasso Bandit

H.4. Additional Results with Non-Gaussian Distributions

Figure 9 shows the numerical results with the uniform distribution and non-Gaussian elliptical distributions. For experiments with the uniform distribution, in each round t, we sample each feature vector At,k independently from a d-dimensional hypercube [ 1, 1]d. For experiments with the elliptical distribution, we construct each feature vector At,k by the following equation:

At,k = µ + RAU (l)

where µ Rd is a mean vector, U (l) Rl is uniformly distributed on the unit sphere in R(l), R R is a random variable independent of U (l), and A is a d l-dimensional matrix with rank l. We sample R from Gaussian distribution N(0, 1), and sample each element of A in an i.i.d manner using the uniform distribution on [ 1, 1]. We set µ = 0d and set l = 200. As in the previous experiments, we generate each non-zero components of θ in an i.i.d manner using the uniform distribution on [1, 2]. The noise process is Gaussian, i.i.d. over rounds: ϵt N(0, 1). We ﬁnd that TH Lasso bandit exhibits lower regret than SA Lasso bandit and DR Lasso bandit in experiments with both distributions.

0 200 400 600 800 1000 Round

Cumulative Regret

K = 2, d = 1000, s0 = 5, Uniform

THLasso Bandit SALasso Bandit DRLasso Bandit

0 200 400 600 800 1000 Round

Cumulative Regret

K = 2, d = 2000, s0 = 5, Uniform

THLasso Bandit SALasso Bandit DRLasso Bandit

0 200 400 600 800 1000 Round

Cumulative Regret

K = 2, d = 1000, s0 = 5, Elliptically(l = 200)

THLasso Bandit SALasso Bandit DRLasso Bandit

0 200 400 600 800 1000 Round

Cumulative Regret

K = 50, d = 1000, s0 = 5, Uniform

THLasso Bandit SALasso Bandit DRLasso Bandit

0 200 400 600 800 1000 Round

Cumulative Regret

K = 50, d = 2000, s0 = 5, Uniform

THLasso Bandit SALasso Bandit DRLasso Bandit

0 200 400 600 800 1000 Round

Cumulative Regret

K = 50, d = 1000, s0 = 5, Elliptically(l = 200)

THLasso Bandit SALasso Bandit DRLasso Bandit

Figure 9. Cumulative regret of the three algorithms with non-Gaussian distributions. The ﬁgures in left and center columns show the experimental results with the uniform distribution. The ﬁgures in right column shows the experimental results with the elliptical distribution. The shaded area represents the standard errors.

Thresholded Lasso Bandit

H.5. Additional Results in Hard Instances

We further investigate the performance of TH Lasso bandit in hard instances where the feature vectors of the best arms do not cover the support of θ (situations where the covariate diversity condition (Bastani et al., 2021) does not hold). In this experiment, we set K = 3 and θ = (1, 0.1, 1, 0, 0, , 0) so that S = {1, 2, 3}. We generate the arm set by generating feature vectors separately on the support S and the non-support Sc = [d] \ S. First, in each round t, we generate the feature vectors on S as the following procedure: in each round t, the set of feature vectors AS t = {AS t,1, AS t,2, AS t,3} is set to A1

with probability 0.3 and is set to A2 with probability 0.7. We set A1 = {(1, 0, 0) , (0, 1, 0) , (0.9, 0.5, 0) } and A1 = {(0, 1, 0) , (0, 0, 1) , (0.0, 0.5, 0.9) }. Second, for each component i Sc, we sample ((ASc t,1)i, (ASc t,2)i, (ASc t,3)i) R3

from a Gaussian distribution N(03, V ) where Vj,j = 1 for all j [3] and Vj,k = ρ2 for all j = k [3]. We then deﬁne ASc t,k = ((ASc t,k)1, , (ASc t,k)|Sc|) R|Sc|. Finally, by concatenating the feature vectors on S and Sc, we construct the feature vector At,k = ((AS t,k) , (ASc t,k) ) Rd. Note that (1, 0, 0) A1 and (0, 0, 1) A2 are always included in the best arms, and they do not span R3. Figure 10 shows the numerical results with correlation levels between two arms ρ2 {0.0, 0.3, 0.7} and dimension d {1000, 2000}. We ﬁnd that TH Lasso bandit exhibits lower regret than SA Lasso bandit and DR Lasso bandit in all scenarios.

0 200 400 600 800 1000 Round

Cumulative Regret

d = 1000, 2 = 0.0, Hard Instance

THLasso Bandit SALasso Bandit DRLasso Bandit

0 200 400 600 800 1000 Round

Cumulative Regret

d = 1000, 2 = 0.3, Hard Instance

THLasso Bandit SALasso Bandit DRLasso Bandit

0 200 400 600 800 1000 Round

Cumulative Regret

d = 1000, 2 = 0.7, Hard Instance

THLasso Bandit SALasso Bandit DRLasso Bandit

0 200 400 600 800 1000 Round

Cumulative Regret

d = 2000, 2 = 0.0, Hard Instance

THLasso Bandit SALasso Bandit DRLasso Bandit

0 200 400 600 800 1000 Round

Cumulative Regret

d = 2000, 2 = 0.3, Hard Instance

THLasso Bandit SALasso Bandit DRLasso Bandit

0 200 400 600 800 1000 Round

Cumulative Regret

d = 2000, 2 = 0.7, Hard Instance

THLasso Bandit SALasso Bandit DRLasso Bandit

Figure 10. Cumulative regret of the three algorithms in hard instances. The shaded area represents the standard errors.

H.6. Additional Results with Real-world Dataset

In this section, we empirically evaluate the TH Lasso bandit algorithm on a real-world dataset. We compare its performance to those of the (uniformly) random policy and Lin UCB (Li et al., 2010) with the exploration parameter α = 0.1.

We use the R6A dataset3 that contains a part of the user view/click log for articles displayed on the Yahoo! s Today Module. Speciﬁcally, we use the dataset corresponding to May 1st, 2009. We evaluate each algorithm using the replay method (Li et al., 2011) with T = 10, 000 valid events. To evaluate the expected value and variance, we subsampled the data so that each event is used with probability 0.9 for each instance. For a fair comparison between Lin UCB and TH Lasso bandit, we modify the TH Lasso bandit algorithm so that the unknown vector θ varies across arms. To emulate a high-dimensional setting, we generate each context vector At,k by concatenating the original 12-dimensional feature vector with the random dummy vector sampled from the uniform distribution on [0, 1]88. We present the average number of clicks from users for 20 instances in Figure 11. We observe that TH Lasso bandit outperforms the random policy and Lin UCB.

3https://webscope.sandbox.yahoo.com

Thresholded Lasso Bandit

0 2000 4000 6000 8000 10000 Round

Number of Clicks

THLasso Bandit Lin UCB Random

Figure 11. Average number of clicks for the three algorithms on a real-world dataset. The error bars represent the standard errors.