# fixedbudget_bestarm_identification_in_sparse_linear_bandits__0df578aa.pdf

Published in Transactions on Machine Learning Research (01/2024)

Fixed-Budget Best-Arm Identification in Sparse Linear Bandits

Recep Can Yavas recep.yavas@cnrsatcreate.sg CNRS at CREATE, Singapore

Vincent Y. F. Tan vtan@nus.edu.sg Department of Mathematics, Department of Electrical and Computer Engineering, National University of Singapore

Reviewed on Open Review: https: // openreview. net/ forum? id= Igxp7FC8uf

We study the best-arm identification problem in sparse linear bandits under the fixed-budget setting. In sparse linear bandits, the unknown feature vector θ may be of large dimension d, but only a few, say s d of these features have non-zero values. We design a two-phase algorithm, Lasso and Optimal-Design- (Lasso-OD) based linear best-arm identification. The first phase of Lasso-OD leverages the sparsity of the feature vector by applying the thresholded Lasso introduced by Zhou (2009), which estimates the support of θ correctly with high probability using rewards from the selected arms and a judicious choice of the design matrix. The second phase of Lasso-OD applies the OD-Lin BAI algorithm by Yang and Tan (2022) on that estimated support. We derive a non-asymptotic upper bound on the error probability of Lasso-OD by carefully choosing hyperparameters (such as Lasso s regularization parameter) and balancing the error probabilities of both phases. For fixed sparsity s and budget T, the exponent in the error probability of Lasso-OD depends on s but not on the dimension d, yielding a significant performance improvement for sparse and high-dimensional linear bandits. Furthermore, we show that Lasso-OD is almost minimax optimal in the exponent. Finally, we provide numerical examples to demonstrate the significant performance improvement over the existing algorithms for non-sparse linear bandits such as OD-Lin BAI, Bayes Gap, Peace, Linear Exploration, and GSE.

1 Introduction

The stochastic multi-armed bandit (MAB) is a model that provides a mathematical formulation to study the sequential design of experiments and exploration-exploitation trade-off, where a learner pulls an arm out of a total K and receives a reward drawn from a fixed and unknown distribution according to the chosen arm. This model has several applications including online advertising, recommendation systems, and drug tests. While in the standard reward model, the arms are uncorrelated with each other, stochastic linear bandits introduced in Auer (2002) generalize the standard model by associating each arm with a d-dimensional feature vector and the reward is equal to the inner product between the feature vector and an unknown global parameter. Therefore, the arms are correlated in linear bandits, meaning that pulling an arm gives information about the rewards of some other arms.

Most prior work including Auer (2002); Thompson (1933); Robbins (1952); Bubeck & Cesa-Bianchi (2012); Dani et al. (2008) on MABs focuses on regret minimization, where the goal is to maximize the cumulative reward after T arm pulls by optimizing the trade-off between exploration and exploitation. Recently, the pure exploration setting has drawn attention from researchers. One example of pure exploration is the bestarm identification (BAI) problem, where the goal is to identify the arm with the largest mean reward. The

Published in Transactions on Machine Learning Research (01/2024)

BAI problem is studied in two settings: (1) the fixed-budget setting considers a budget T N and aims to minimize the probability of failing to identify the best arm in at most T arm pulls; (2) the fixed-confidence setting considers a confidence level δ (0, 1) and aims to minimize the average number of arm pulls while identifying the best arm with probability at least 1 δ.

For the standard reward model with uncorrelated arms, the works in Even-Dar et al. (2006); Karnin et al. (2013); Kaufmann et al. (2016) and Carpentier & Locatelli (2016); Audibert & Bubeck (2010) consider the BAI problem in the fixed-confidence and fixed-budget settings, respectively. For the linear model, the works in Soare et al. (2014); Xu et al. (2018); Fiez et al. (2019); Tao et al. (2018); Jedra & Proutiere (2020); Zaki et al. (2022) develop several algorithms under the fixed-confidence setting. For the linear model under the fixed-budget setting, Hoffman et al. (2014) develop the first algorithm, Bayes Gap, which is a gap-based exploration algorithm using a Bayesian approach. Katz-Samuels et al. (2020) develop the Peace algorithm that has equally-sized rounds, where the arm-pulling strategy within each round is based on the Gaussian width of the underlying arm set. Alieva et al. (2021) develop Linear Exploration that exploits the linear structure of the model and is robust to unknown levels of observation noise and misspecification in the linear model. Yang & Tan (2022) develop the Optimal-Design-Based Linear Best Arm Identification (OD-Lin BAI) algorithm, which also employs almost equally-sized rounds, but the arm-pulling strategy within each round is based on the G-optimal design. In the first round, OD-Lin BAI aggressively eliminates all empirically suboptimal arms except the top d

2 arms; in the subsequent rounds, half of the remaining arms are eliminated in each round until a single arm remains. Azizi et al. (2022) develop the Generalized Successive Elimination (GSE) algorithm that has similar principles as OD-Lin BAI with the difference that GSE eliminates the half of the remaining arms in all rounds. Among these algorithms, only OD-Lin BAI is shown to be asymptotically minimax optimal.

In many practical applications of MABs, there are a large number of features available to the learner, but only a few of these features significantly affect the value of the reward of an arm. Sparse linear bandits are a mathematical abstraction of this phenomenon by assuming that the d-dimensional unknown parameter θ in the linear model has only s nonzero values, i.e., θ 0 = s, where s is usually much smaller than d. The performance in the MAB problems (e.g., cumulative regret, probability of identification error) usually deteriorates as the ambient dimension d increases. Therefore, the goal in the sparse setting is to design an algorithm whose performance is a function of s but not d. Some works that study the regret minimization problem for sparse linear bandits include Abbasi-Yadkori et al. (2012); Kim & Paik (2019); Hao et al. (2020); Oh et al. (2021); Ariu et al. (2022); Li et al. (2022); Jang et al. (2022); Wang et al. (2023); Chakraborty et al. (2023). The OFUL algorithm of Abbasi-Yadkori et al. (2012) keeps track of a high probability confidence set for θ and pulls an arm that maximizes the reward with respect to the arm vectors and the confidence set for θ . The DR Lasso algorithm of Kim & Paik (2019) combines Lasso with a doubly-robust technique used in the missing data literature. The ESTC algorithm of Hao et al. (2020) uses Lasso to estimate θ at the end of the first phase and then in the second phase commits to the best arm with respect to the Lasso estimate. The SA Lasso Bandit algorithm of Oh et al. (2021) estimates θ at each time using Lasso and pulls the best arm with respect to the Lasso estimate. TH Lasso Bandit algorithm of Ariu et al. (2022) estimates the support of θ using Lasso and a thresholding procedure at each time and pulls the best arm with respect to the ordinary least squares estimation restricted to the estimated support in the first phase. Li et al. (2022) generalize the ESTC algorithm of Hao et al. (2020) to general bandit problems with low-dimensional structures such as low-rank matrix bandits. The Pop Art algorithm of Jang et al. (2022) takes the population covariance of arms as input and uses a thresholding step to estimate θ in the first phase; in the second phase, it commits to the best arm with respect to the estimate of θ in the first phase. The LRP-Bandit algorithm of Wang et al. (2023) combines the thresholded Lasso with random projection where random projection is used to mitigate the negative influence of model misspecification due to the Lasso phase; their algorithm is also computationally efficient since Lasso is computed only at times with exponentially increasing gaps. Finally, Chakraborty et al. (2023) develop a Thompson Sampling algorithm for sparse linear contextual bandits.

In this paper, we study the BAI problem in sparse linear bandits under the fixed-budget setting. To the best of our knowledge, this paper presents the first result on the BAI problem in linear bandits with sparse structure, and we show that our bound on the error probability is almost minimax optimal in the exponent.

Published in Transactions on Machine Learning Research (01/2024)

Contributions Our main contributions are summarized as follows.

1. We design an algorithm, Lasso and Optimal-Design- (Lasso-OD) based Linear Best Arm Identification. This algorithm has two phases. In the first phase, we pull arms to estimate a support set ˆS that captures the support of the unknown parameter θ with high probability and has size as small as possible. This goal is accomplished by the thresholded Lasso (TL) introduced by Zhou (2009). TL obtains an initial estimation ˆθinit for the parameter θ from Lasso (Tibshirani, 1996) and passes it through an absolute value threshold to obtain ˆθthres. The support of ˆθthres is the output of the first phase. In the second phase, we apply OD-Lin BAI from Yang & Tan (2022). Lasso-OD has 3 hyperparameters: (i) T1 < T, the budget allocated for the first phase; (ii) λinit > 0, the parameter in the initial Lasso problem; and (iii) λthres > 0, the threshold value in TL. The choice of the design matrix (i.e., number of times each arm is pulled) in the first phase is crucial in attaining a good performance. Inspired by Hao et al. (2020), we design it by maximizing the smallest eigenvalue of the Gram matrix associated with the design matrix; this is known as the E-optimal design (Boyd & Vandenberghe, 2004, Sec. 7.5.2). This particular choice minimizes an upper bound on a probability term related to the performance of TL.

2. We derive a non-asymptotic upper bound on the error probability of Lasso-OD as a function of the total budget T, the number of arms K, the ambient dimension d, the sparsity s, and the arm vectors a(k), k = 1, . . . , K, the first few suboptimality gaps, and the hyperparameters T1, λinit, and λthres. As a corollary to this bound, with the knowledge of s, we carefully choose the hyperparameters so that firstly, with high probability, phase 1 selects all variables in θ and at most s2 additional variables and secondly, the probability terms due to phases 1 and 2 are approximately balanced . Under the assumption that the compatibility constant, which is a quantity that governs the performance of the Lasso, is lower bounded by a constant that is independent of s and d, our particular choice achieves the error probability exp Ω T (log2 s)H2,lin(s+s2) for s, d, T , s

d 0, and K and d not growing exponentially with T (see Corollary 1). Here, H2,lin(s + s2) is a hardness parameter that depends only on the first s + s2 1 suboptimality gaps. Note that the exponent is independent of dimension d, implying that increase in d does not significantly increase the error probability. For OD-Lin BAI, this exponent is given by exp Ω T (log2 d)H2,lin(d) ; therefore, Lasso-OD improves the

error probability exponent by a factor of Ω( log2 d

log2 s) for d s + s2.

3. We empirically compare the identification error of Lasso-OD with that of other existing algorithms in the literature on several synthetic datasets, including one that is a sparsity-based version of examples used in other papers (Jedra & Proutiere, 2020). The empirical results support our theoretical result that claims that the scaling of the error probability of Lasso-OD is characterized by the sparsity s while the performances of other algorithms significantly depend on d. We additionally demonstrate that algorithms that do not exploit the sparsity of θ are computationally prohibitive, while the computational complexity of Lasso-OD scales well even as d grows.

2 Problem Formulation

We consider a standard linear bandit with K arms with a d-dimensional unknown global parameter θ . Let the arm set be [K] {1, . . . , K}, where each arm k [K] is associated with a known arm vector a(k) Rd. A set of K arms, {a(1), . . . , a(k)}, together with θ define a linear bandit instance η. At each time t, the agent chooses an arm At [K] and observes a noisy reward

yt = θ , a(At) + ϵt, (1)

where ϵ1, ϵ2, . . . are independent 1-subgaussian noise variables. For the arm selection, the agent uses an online algorithm, that is, the arm pull At [K] may depend only on the previous t 1 arm pulls A1, . . . , At 1 and their corresponding rewards y1, . . . , yt 1. Denote the mean rewards of the arm vectors by

µk θ , a(k) , k [K]. (2)

Published in Transactions on Machine Learning Research (01/2024)

Without loss of generality, we assume that µ1 > µ2 µ3 µK, i.e., arm 1 is the unique best arm. We denote the mean gaps by k µ1 µk for 2 k K.

Under the fixed-budget setting of BAI, the agent is given a fixed time T, and makes an estimate ˆI for the best arm with no more than T arm pulls. The goal is to design an online algorithm with the identification error probability, P[ˆI = 1], as small as possible.

Notation: For any integer n, we denote [n] {1, . . . , n}. Let x = (x1, . . . , xd) be a d-dimensional vector and S [d], we denote x S (xs : s S) R|S|. We denote x A

x Ax. The minimum eigenvalue of a symmetric A is denoted by σmin(A). We denote the set of distributions on the set A as P(A). Let A1, . . . , At [K] be a sequence of arm pulls. The matrix X Rt d whose j-th row is a(Aj) is called the design matrix. Let ν P([K]) be the vector of fractions of arm pulls associated with this strategy, i.e., νk = 1

t Pt j=1 1{Aj = k} for k [K]. The Gram matrix associated with this strategy is denoted by M(ν) = 1 t X X = P

k [K] νka(k)a(k) Rd d. When we use asymptotic notation such as O( ) and Ω( ), somewhat unconventionally, we are referring to nonnegative sequences, e.g., an O(bn) if and only if lim supn an bn < and {an}n 1 is a nonnegative sequence.

Model assumptions: Denote the support of θ by S(θ ) {j [d]: θ j = 0}. We assume that the unknown parameter θ and the arm vectors {a(k)}k [K] are of length d but θ is sparse, i.e., the number of non-zero coefficients in θ satisfies θ 0 |S(θ )| = s < d. We assume that S(θ ) is unknown, but s and θmin minj S(θ ) |θ j | are known. We further assume that |µk| 1 for all arms k [K] and that there exists a positive constant θ0 independent of s and d such that θmin θ0.

3 Our Algorithm: Lasso-OD

We now present our algorithm, Lasso and Optimal-Design- (Lasso-OD) based linear best-arm identification which has two phases. In phase 1, we pull a judiciously chosen set of arms to learn the support of the unknown parameter θ . Specifically, we design phase 1 so that it outputs a subset of variables ˆS [d] whose support ˆS captures the true variables, S(θ ), with high probability, and its cardinality | ˆS| is small. To do this, we use the thresholded Lasso introduced by Zhou (2009). Once ˆS is obtained, we eliminate all variables in the arm vectors except the ones in ˆS. Note that given that ˆS S(θ ), this variable elimination would have no effect on the mean values µ1, . . . , µK since by assumption, we only eliminate some variables j [d] with θ j = 0. Therefore, the best arm is also preserved after variable elimination. Building upon this principle, in phase 2, we project the arms on the estimated support ˆS and pull arms according to the OD-Lin BAI algorithm by Yang & Tan (2022), which is designed for linear bandits without the sparsity assumption.

3.1 Motivation for Lasso-OD Algorithm

OD-Lin BAI used in phase 2 is a minimax optimal algorithm up to a multiplicative factor in the exponent in the sense that it achieves an asymptotic error probability exp Ω T (log2 d)H2,lin(d) , and for every algorithm, there exists a bandit instance η whose asymptotic error probability is lower bounded by exp O T (log2 d)H2,lin(d) . The hardness parameter

H2,lin(d) max 2 i d i 2 i (3)

determines how difficult it is to identify the best arm for a given bandit instance η (Yang & Tan, 2022). For sparse linear bandits, if an oracle knew the support of the unknown parameter θ , then the lower bound in Yang & Tan (2022, Th. 3) would be improved to exp O T (log2 s)H2,lin(s) . The purpose of TL in phase 1 is to provide an estimate for the support of θ with high accuracy while also pulling arms few enough that the resulting error probability is a function of s rather than d as in the oracle lower bound. Below, we provide the details on two phases of Lasso-OD.

Published in Transactions on Machine Learning Research (01/2024)

3.2 Phase 1 (TL)

Consider a linear model y = Xθ +ϵ, where X RT1 d is a fixed design matrix, θ Rd is a fixed unknown feature vector, y RT1 is the response vector, and ϵ RT1 is a noise vector whose entries are independent and 1-subgaussian. Tibshirani (1996) introduces the Lasso optimization problem to identify a sparse solution to the least squares estimation problem

ˆθinit = arg min θ Rd 1 T1 y Xθ 2 2 + λinit θ 1 , (4)

where λinit > 0 is a suitably chosen regularization parameter. The Lasso (4) is a convex program and can be solved efficiently, e.g., using Alternating Direction Method of Multipliers (ADMM) algorithm (Boyd et al., 2011).

For the task of variable selection, i.e., recovering the support of the unknown parameter θ without missing any of its non-zero variables, we want to obtain an estimate ˆθ that satisfies S(ˆθ) S(θ ) while ensuring that |S(ˆθ) \ S(θ )| is as small as possible. Zhou (2009) introduces the following thresholding procedure that has this property

(ˆθthres)j = (ˆθinit)j 1{|(ˆθinit)j| λthres}, j [d], (5)

where the initial estimate ˆθinit is given in (4), and λthres > 0 is the threshold. The set of selected variables by TL is S(ˆθthres). A variation of TL is used by Ariu et al. (2022) to derive refined regret guarantees in sparse stochastic contextual linear bandits. Their main idea is to find the support estimate S(ˆθ(t) thres) at each time instance t using TL and then to compute the ordinary least squares (OLS) estimation restricted on the variables in S(ˆθ(t) thres). Ariu et al. (2022) tune the free parameters λ(t) init and λ(t) thres in a way that with high probability, S(ˆθ(t) thres) S(θ ) and S(ˆθ(t) thres) is small enough, which is s + O( s) in their case. Note that on the event {S(ˆθ(t) thres) S(θ )}, the OLS solution restricted on the subset S(ˆθ(t) thres) is equal to that for the unrestricted case where all d variables are used. Our approach is similar to that in Ariu et al. (2022) in using TL to reduce the effective dimension of the problem.

Let T1 < T be the budget allocated to the variable selection procedure described above.

Design matrix optimization First, we need to specify the number of pulls for each arm during phase 1, which corresponds to determining the design matrix X RT1 d in the Lasso problem (4). To do this, we solve the optimization problem, known as the E-optimal design (Boyd & Vandenberghe, 2004, Sec. 7.5.2), given by

ν = arg max ν P([K]) σmin

i=1 νia(i)a(i) !

Since the function A 7 σmin(A) is concave and ν 7 PK i=1 νia(i)a(i) is linear, (6) is a convex optimization problem, and can be solved efficiently, for example, using the CVX toolbox (Boyd et al., 2011).

The design matrix determined by the allocation in (6) minimizes an upper bound on a probability term related to phase 1; hence, it approximately optimizes the penalty term due to incorrectly estimating the variables of θ . More discussion on this choice of the design matrix appears in Appendix A. The optimization problem (6) also appears in Hao et al. (2020) on their regret analysis in sparse linear bandits. The allocation ν can lead to fractional number of pulls T1 ν i for some arm i [K]. To guarantee integer number of pulls for all arms, we apply a rounding procedure given in Pukelsheim (2006, Ch. 12), the ROUND function in Appendix B, which is also employed in the fixed-confidence BAI algorithm in Fiez et al. (2019).

Support estimation We compute the number of pulls for each arm using (6) and ROUND, and then estimate the support from (4) and (5). Algorithm 1 below delineates the pseudo-code of this procedure.

Published in Transactions on Machine Learning Research (01/2024)

Algorithm 1 Thresholded Lasso (TL) input Time budget T1, Lasso parameters λinit and λthres, and arm vectors a(1), . . . , a(K).

1: Compute the arm pull fractions ν from (6). 2: Update ν ROUND( ν , T1) to ensure integer number of arm pulls. 3: Pull each arm i [K] exactly T1 ν i times. Denote the vector of rewards by y RT1.

4: Form the design matrix X RT1 d so that it has T1 ν i rows equal to a(i) for i [K]. Compute ˆθthres from (4) and (5). output the support ˆS = S(ˆθthres).

3.3 Phase 2 (OD-Lin BAI)

In this section, we review the OD-Lin BAI algorithm by Yang & Tan (2022). OD-Lin BAI divides the budget T into log2 d phases, where each phase has roughly the same length.

At the start of round r, OD-Lin BAI applies a dimensionality reduction step to maintain that the set of modified arms spans the space of its reduced dimension. The arm allocation during each round is determined by the G-optimal design (Kiefer & Wolfowitz, 1960), which takes a set of arm vectors {a(1), . . . , a(K)} Rd

and solves the optimization problem

π = arg min π P([K]) max i [K] a(i) 2 M(π) 1 , (7)

where M(π) PK i=1 πia(i)a(i) is the Gram matrix associated with the allocation π. At the start of each round, we solve (7) for the set of active arms and then apply the ROUND function in Appendix B to the resulting allocation to ensure integer number of pulls. The latter step replaces the procedure in Line 17 Yang & Tan (2022, Algorithm 1). This slight modification may improve the performance of the algorithm especially if the budget T is small. At the end of round 1, we eliminate all arms except the top d

2 with respect to the OLS estimator; in the rest, we halve the remaining arms at the end each round. At the end of the last round, only one arm remains and that arm is declared to be the best one. The pseudo-code of OD-Lin BAI can be found in Yang & Tan (2022) and a slight modification of it which leads to the improved error probability bound in Theorem 2 can be found in Appendix B.

3.4 Lasso-OD Algorithm

The pseudo-code of Lasso-OD described above is given in Algorithm 2. Notice that since the two phases of Lasso-OD operate independently, one can replace either or both of TL and OD-Lin BAI with their alternatives, e.g., the Pop Art algorithm (Jang et al., 2022) and the adaptive Lasso (Bühlmann & van de Geer, 2011, Ch. 2.8) for TL and any of the algorithms in Alieva et al. (2021); Katz-Samuels et al. (2020); Azizi et al. (2022); Hoffman et al. (2014) for OD-Lin BAI. We discuss some of the variants of our algorithm in Appendix G.

Algorithm 2 Lasso and Optimal-Design Based Linear Best Arm Identification (Lasso-OD) input Time budgets T1 and T2 so that T = T1 + T2, Lasso parameters λinit and λthres, and arm vectors a(1), . . . , a(K) Rd.

1: Run TL (Algorithm 1) with T1, λinit, and λthres and get the output ˆS [d]. 2: Project the arm vectors on the subset ˆS by setting a (i) = (a(i)) ˆ S for i [K].

3: Run OD-Lin BAI from Yang & Tan (2022) with budget T2 and arm vectors {a (1), . . . , a (K)} R| ˆ S|

with Line 17 of Algorithm 1 in Yang & Tan (2022) replaced by ROUND. output the only remaining arm ˆI as the output of OD-Lin BAI.

Published in Transactions on Machine Learning Research (01/2024)

4 Main Results

This section presents three non-asymptotic upper bounds on the performances of TL, OD-Lin BAI, and Lasso-OD algorithms.

4.1 Thresholded Lasso

Recall the linear model y = Xθ +ϵ, where X RT1 d is a fixed design matrix, θ Rd is a fixed unknown feature vector, y RT1 is the response vector, and ϵ RT1 is a noise vector whose entries are independent and 1-subgaussian. For any set S [d], define the set of vectors

C(S) {θ Rd : θSc 1 3 θS 1}. (8)

van de Geer & Bühlmann (2009) introduce the following compatibility condition that allows one to control the ℓ1-norm error for the sparse estimation of the unknown parameter θ where the components of the design matrix X are not highly correlated. For the rest of the section, let M = 1 T1 X X denote the Gram matrix associated with X.

Definition 1 (Compatibility condition). Given a fixed design matrix X RT1 d (whose Gram matrix is M) and a subset S [d], the compatibility constant ϕ2(M, S) is defined as

ϕ2(M, S) min θ Rd : θS 1 =0

( |S| θ 2 M θS 2 1 : θ C(S)

With some abuse of notation, we also define

ϕ2(M, s) min S [d]: |S|=s ϕ2(M, S). (10)

The following result controls the ℓ1-norm error of the initial Lasso estimator in (4).

Lemma 1 (Ariu et al. (2022), Lemma G.6). Assume that ϕ2(M, s) > 0. The Lasso estimator ˆθinit in (4) satisfies

P ˆθinit θ 1 4λinits ϕ2(M, s)

32 1 T1 max j [d] X:,j 2 2

Using Lemma 1, we derive the following bound on the event that the size of the support of the TL output (5) is below a threshold and it captures the true support S(θ ).

Theorem 1. Fix a design matrix X RT1 d and parameters λinit, λthres > 0. Let c = λthres

λinit . Assume that

ϕ2 ϕ2(M, s) > 0 and θmin λinit c + 4 ϕ2 s holds. Then,

P |S(ˆθthres)| s 1 + 4 ϕ2c

\ {S(ˆθthres) S(θ )} 1 2d exp

32 1 T1 max j [d] X:,j 2 2

The proofs of Lemma 1 and Theorem 1 are deferred to Appendix C. Theorem 1 follows steps similar to those in Ariu et al. (2022, Lemma 5.4). The interested reader can refer to Bühlmann & van de Geer (2011, Ch. 6 and 7) for more results and discussions on Lasso, TL, and their variants.

Published in Transactions on Machine Learning Research (01/2024)

4.2 An Improved Upper Bound on the Error Probability of OD-Lin BAI

The theorem below gives an improved upper bound on the error probability of OD-Lin BAI (Yang & Tan, 2022).

Theorem 2. Let T = j T log2 d k . For any linear bandit instance, the output of OD-Lin BAI satisfies

P h ˆI = 1 i (K + log2 d) exp

The right-hand side of (13) is slightly different than the one presented in Yang & Tan (2022, Th. 2). First, in Yang & Tan (2022, Th. 2), the numerator in the exponent is equal to some constant m that is approximately equal to T log2 d just like T; this is due to the modification in the distribution rounding technique. Second, the

pre-factor in Yang & Tan (2022, Th. 2) is 4K

d +3 log2 d instead of (our smaller) K +log2 d. More importantly, in (13), the constant 32 in the denominator of the exponent in Yang & Tan (2022, Th. 2) is improved to 16. The last two differences are due to a refinement in the proof technique. Lastly, our result includes a rounding error factor 1 + d2

T , which becomes negligible as T becomes large. This factor appears due to the fact that the G-optimal design may yield fractional number of pulls for some arms, which is obviously not allowed in practice. The proof of Theorem 2 is deferred to Appendix D.

4.3 Upper Bound on the Error Probability of Lasso-OD

The theorem below bounds the probability of incorrectly identifying the best arm using Lasso-OD.

Theorem 3. Let T1 < T be the length of phase 1, and let T2 = T T1 be the length of phase 2. Let λinit and λthres be some positive scalars. Let c = λthres

λinit . Let ν be the solution to (6), and let ν = ROUND( ν , T1) be its rounded version for length T1. Assume that the compatibility constant associated with the E-optimal design is positive, i.e., ϕ2 ϕ2 PK i=1 νia(i)a(i) , s > 0, and θmin λinit c + 4 ϕ2 s . Then, the output of Algorithm 2 satisfies

P h ˆI = 1 i (K + log2 d) exp

j T2 log2(s1) k

16 (1 + ϵ) H2,lin(s1)

+ 2d exp T1λ2 init 32x2max

s1 = s 1 + 4 ϕ2c

, x2 max = max j [d]

k=1 νk(a(k)j)2, and ϵ = s2 1 T2 . (15)

Proof. The proof uses Theorems 1 and 2 for the probability terms due TL and OD-Lin BAI, respectively. Let ˆS [d] denote the output of phase 1. Define the events E {| ˆS| s1} and F { ˆS S(θ )}. By the law of total probability, we have

P h ˆI = 1 i P h ˆI = 1 E F i + P [Ec Fc] . (16)

Given E F, the error probability is bounded by the right-hand side of (13) with the budget T replaced by the length of phase 2, T2, and with the dimension d replaced by s1. This follows since on the event F, the mean rewards are preserved after the arm vectors and θ are projected on ˆS and since the right-hand side of (13) is non-decreasing in d. From Theorem 1 and the arm-pulling strategy described in Line 2 of Algorithm 2, we have

P [Ec Fc] 2d exp T1λ2 init 32x2max

Combining (16) with (13) and (17), we complete the proof.

Published in Transactions on Machine Learning Research (01/2024)

The following corollary is obtained by choosing the free parameters T1, λinit, and λthres suitably to meet the conditions of Theorem 3. These nontrivial choices use the knowledge of θmin and s but not the hardness parameter and achieve an exponent of the error probability that depends only on s, T, and the hardness parameter.

Corollary 1. For any linear bandit instance that satisfies ϕ2 > 0, the output of Algorithm 2 satisfies

P h ˆI = 1 i (K + log2 d + 2d) exp T 16 log2(s + s2) (1 + ϵ)H2,lin(s + s2)(1 + c0)

c0 = 400x2 max 3ϕ4θ2 min log2(s + s2) and ϵ = (1 + c0)(s + s2)2

Here, c0 = T1

T2 is the fraction of lengths of two phases of Lasso-OD, and 1 + ϵ is the small multiplicative penalty due to rounding.

Assume that for a sequence of bandit instances (of growing dimension d), there exist positive constants ϕ2 0 and x2 0, both independent of s and d, such that ϕ2 ϕ2 0 and x2 max x2 0.1 A paradigmatic example of this is the scenario where the arm vectors are generated independently from a zero-mean, unit-variance distribution with a finite fourth-order moment (e.g., the Gaussian distribution N(0, 1), equiprobable distribution on { 1, 1}). Now, consider the setting in which K = T1 = rd where r > 1 is independent of s and d, and we pull each arm once in the Lasso phase. Then, from Bai & Yin (1993, Theorem 2), we know that the minimum eigenvalue of the Gram matrix σmin(M) converges to 1 1 r 2 with probability one, and x2 max converges to

1 also with probability one. For this example, we may take the constants ϕ2 0 and x2 0 to be ϕ2 0 = 1

and x2 0 = 2. Note that, for this example, ϕ2 0 and x2 0 are independent of s and d. Therefore, the ratio c0 is O(1) as s, d, and T grow. Under these conditions, Corollary 1 implies that the error probability of Lasso-OD is upper bounded by

exp Ω T (log2 s)H2,lin(s + s2)

for s, d, T , s

d 0, and K and d not growing exponentially with T, all of which are realistic assumptions in practice. Unlike the non-sparse case in Yang & Tan (2022), the error probability exponent under these assumptions is independent of the dimension d, but instead, depends on the sparsity s, which yields much smaller error probabilities for high dimensional sparse linear bandits. The choices of the parameters that achieve the exponent in (20) is nontrivial; we carefully choose λinit and λthres so that the condition in Theorem 1 is satisfied with equality and c0 is decreasing in s and choose T1 so that two exponents in (14) resulting from phases 1 and 2 are approximately equal. The proof of Corollary 1 is presented in Appendix E.

Assume that the agent knows the support of θ . Then, following the construction in the proof of Yang & Tan (2022, Th. 3), for any algorithm, there exists a bandit instance whose error probability is lower bounded by exp O T (log2 s)H2,lin(s) . This implies that the upper bound in (20) is indeed almost minimax optimal in the exponent. In Appendix G, we develop a variant of Lasso-OD, called Pop Art-OD, which replaces TL in phase 1 of Lasso-OD with Pop Art from Jang et al. (2022). Thanks to the fact that Pop Art provides a guarantee on the ℓ norm of the difference between the estimated parameter θ and θ , we derive an upper bound on the probability P h ˆSPA = S(θ ) i , where ˆSPA denotes the estimated support using the Pop Art algorithm. Using this bound, we show that the error probability of Pop Art-OD is upper bounded by exp Ω T (log2 s)H2,lin(s) , matching the lower bound up to a constant factor in the exponent. Due to its superior empirical performance over Pop Art-OD, we focus on Lasso-OD in the paper.

1In general, ϕ2 depends on the geometry of the set of arm vectors. One can construct instances where ϕ2 vanishes as s and d grow.

Published in Transactions on Machine Learning Research (01/2024)

Error probability

Error probability

Error probability

Error probability

Lasso-OD-CV Lasso-OD-Analytical OD-Lin BAI Bayes-Gap-Adaptive GSE

Figure 1: Comparison of several algorithms with T [200, 10000] and s = 2.

Table 1: Performance comparison of several algorithms for T = 800, d = 10, K = 50, and s = 2.

Lasso-OD-CV Lasso-OD-An. Peace Linear Exploration

Error probability 0.0275 0.045 0.40 0.39 Std. deviation 0.0026 0.0033 0.049 0.0153

5 Experiments

In this section, we numerically evaluate the performance of Lasso-OD on several synthetic sparse linear bandit instances and compare it with those of OD-Lin BAI (Yang & Tan, 2022), Bayes Gap (Hoffman et al., 2014), GSE (Azizi et al., 2022), Peace (Katz-Samuels et al., 2020), and Linear Exploration (Alieva et al., 2021). In each setting, we report the empirical error probabilities for Lasso-OD, Bayes Gap, and GSE over 4000 independent trials and for Peace and Linear Exploration over 100 independent trials.

5.1 Synthetic Dataset with Sparse Unknown Parameter Vector

To illustrate the efficacy and robustness of our algorithm on synthetic data, we consider three different sets of experiments.

In the first example, we draw K arms independently from the uniform distribution on the d-dimensional sphere of radius p

d/s, i.e., x Rd : x 2 2 = d

s , and the sparse unknown parameter is taken as θ = (1, 1, 0, . . . , 0), i.e., s = 2. Figure 1 reports the empirical error probabilities for d {10, 20}, K {50, 100} and T [200, 10000], except Peace (Katz-Samuels et al., 2020) and Linear Exploration (Alieva et al., 2021). Since the computational complexities of Peace and Linear Exploration are much higher than the rest of the algorithms, we compare Lasso-OD with Peace and Linear Exploration only for T = 800 in Table 1. Among these algorithms, Lasso-OD has the best performance for all sparse instances shown in Figure 1 and Table 1.

Lasso-OD-CV sets the budgets for phase 1 and phase 2 as T1 = T

5 and T2 = 4T

5 and tunes the Lasso parameters λinit and λthres using a K-fold cross-validation procedure that uses the value of s in its loss function. See Appendix F for the details of the cross-validation procedure. As an alternative to crossvalidation, Lasso-OD-Analytical uses the knowledge of s, θmin, and the hardness parameter H2,lin(s1) in (14),

Published in Transactions on Machine Learning Research (01/2024)

0 0.005 0.01 0.015 0.02 0.025 0

Error probability

0 0.005 0.01 0.015 0.02 0.025 0

Error probability

Lasso-OD-CV Lasso-OD-Analytical OD-Lin BAI Bayes Gap-Adaptive GSE

Figure 2: Comparison of several algorithms with T {800, 2000}, s = 2, and δ [0, 0.025].

Error probability

ODLin BAI Lasso-OD GSE

Figure 3: Comparison of Lasso-OD, OD-Lin BAI, and GSE for d = 500, s = 4, and T [3 103, 3 105].

and sets λinit, λthres, and T1 so that s1 in (15) equals s + s2, θmin = λinit(c + bs), and two exponents in (14) are equal. Note that H2,lin(s1) is usually not available to the agent.

In the second example, we test the robustness of our algorithm with respect to the variables in θ that are assumed to be zero by keeping the same arms as in the previous example and setting θ as θ j = 1 for j [2], and θ j = δRj for j {3, . . . , d}, where Rj, j = 3, . . . , d, are independent Rademacher (i.e., { 1}-valued) random variables, and δ > 0 is a constant. Figure 2 reports the empirical error probabilities for s = 2, d = 10, K = 50, T {800, 2000}, and δ [0, 0.025]. The phase transition for Lasso-OD in Figure 2 suggests that Lasso-OD achieves a smaller error probability as long as δ is small enough that the approximately sparse instance (i.e., δ > 0) and the sparse instance (i.e., δ = 0) have the same best arm. Some examples including an instance where the hyperparameters are set as in Corollary 1 without cross-validation or knowing the hardness parameter are discussed in Appendix G.

The third experiment is similar to the first, except that the parameters are larger to demonstrate that our method scales well, while others do not. Namely, we set s = 4, d = 500, and K = 1000. Since only Lasso-OD, OD-Lin BAI, and GSE are the only computationally feasible algorithms for such large dimensions, we only include the performances of these algorithms in Figure 3. The empirical error probabilities are obtained from 4000 independent trials for Lasso-OD and from 100 independent trials for OD-Lin BAI and GSE.2 We observe that Lasso-OD outperforms OD-Lin BAI for this instance at all T values shown.

5.2 Real-World Dataset with Sparse Unknown Parameter Vector

We conduct an experiment on an online news popularity dataset published by Mashable (Fernandes et al., 2015), which includes 39,797 news articles, each having 58 attributes. Some of these attributes are the

2We use a smaller number of independent trials for OD-Lin BAI and GSE as it takes too long to run since it does not exploit the sparsity of the problem.

Published in Transactions on Machine Learning Research (01/2024)

Error probability

Error probability

Error probability

Error probability

Lasso-OD Lasso-OD-Analytical

Lasso-XY Lasso-Bayes Gap

Pop Art-OD OD-Lin BAI

Bayes Gap-Adaptive GSE

Figure 4: Comparison of several algorithms for the example bandit instance in Yang & Tan (2022).

103 104 105 0

Error probability

ODLin BAI Lasso OD Bayes Gap GSE

Figure 5: Comparison of several algorithms for the real-world dataset with T [103, 105], s = 3, and d = 55.

number of words, the number of links, the number of keywords, the day that the news is published, and the channel of the news. The target of the dataset is the number of shares in social network. To adapt the dataset to the sparse linear bandit framework, we first normalize each attribute (and the target) so that the mean and the standard deviation of each attribute is 0 and 1, respectively. Since three of the attributes have > 99% correlation with some other attribute, we remove these three attributes; hence the overall dimension d = 55. Then, to obtain a sparse ground truth vector θ R55 (assuming a linear model between the attributes and the target), we estimate θ using the thresholded Lasso, where the hyperparameters λinit and λthres are determined via cross-validation. The resulting θ has 3 nonzero values, hence s = 3 (all 3 active attributes are related to the number of keywords in the news). We select 500 news articles with the largest shares as the arms, i.e., K = 500.

In Figure 5, we see that Lasso-OD, where its two hyperparameters λinit and λthres are determined by crossvalidation, has the smallest error probability for all T values shown. We are not able to conduct experiments using Peace and Linear Exploration due to their prohibitively high computational complexities for such large dimensions. To corroborate this claim, the CPU runtimes of the algorithms compared in the experiments

Published in Transactions on Machine Learning Research (01/2024)

Table 2: The empirical means of the CPU runtimes for d {10, 55, 500} and T = 104.

CPU runtimes (seconds)

d Lasso-OD OD-Lin BAI GSE Bayes Gap-Ad. Linear Exploration Peace

10 0.0019 0.0084 0.011 0.75 2.33 31.69 55 0.0068 0.048 0.092 83.2 >1800 >1800 500 0.017 4.95 5.29 >1800 >1800 >1800

above are shown in Table 2.3 In all experiments, Lasso-OD is the fastest algorithm, and the gap between the CPU runtimes of the Lasso-OD and of the other algorithms increases with the dimension d, as the other algorithms do not exploit the sparsity of the unknown parameter vector θ .

6 Conclusion

In this work, we study the BAI problem in linear bandits with sparse structure under fixed-budget setting and develop the first BAI algorithm, Lasso-OD, that exploits the sparsity of the unknown parameter θ . Lasso-OD combines TL for support estimation with the minimax optimal BAI algorithm, OD-Lin BAI. We analyze the error probability of Lasso-OD and show that the error exponent depends on the sparsity s rather than the dimension d. Unlike other algorithms in the literature, the empirical performance of Lasso-OD does not deteriorate at large dimensions.

One future direction is to derive an instance-dependent asymptotic or non-asymptotic lower bound for the BAI problem in sparse linear bandits; however, such a bound remains open even in the non-sparse scenario. Another possible direction is to extend the TL technique used in Lasso-OD to the fixed-confidence setting. Although such an extension is relatively easy to analyze, the empirical performances of most fixed-confidence BAI algorithms in linear bandits are not heavily dependent on the dimension unlike the fixed-budget setting (see, for example, Zaki et al. (2022); Tao et al. (2018); Fiez et al. (2019)). Therefore, the benefit of adding a TL phase in the fixed-confidence setting could be limited.

7 Acknowledgments and Disclosure of Funding

This research is part of the programme Des Cartes and is supported by the National Research Foundation, Prime Minister s Office, Singapore under its Campus for Research Excellence and Technological Enterprise (CREATE) programme.

Yasin Abbasi-Yadkori, David Pal, and Csaba Szepesvari. Online-to-confidence-set conversions and application to sparse stochastic bandits. In Neil D. Lawrence and Mark Girolami (eds.), Proceedings of the Fifteenth International Conference on Artificial Intelligence and Statistics, volume 22 of Proceedings of Machine Learning Research, pp. 1 9, La Palma, Canary Islands, 21 23 Apr. 2012. URL https://proceedings.mlr.press/v22/abbasi-yadkori12.html.

Ayya Alieva, Ashok Cutkosky, and Abhimanyu Das. Robust pure exploration in linear bandits with limited budget. In Marina Meila and Tong Zhang (eds.), Proceedings of the 38th International Conference on Machine Learning, volume 139 of Proceedings of Machine Learning Research, pp. 187 195, 18 24 July 2021. URL https://proceedings.mlr.press/v139/alieva21a.html.

Kaito Ariu, Kenshi Abe, and Alexandre Proutiere. Thresholded Lasso bandit. In Kamalika Chaudhuri, Stefanie Jegelka, Le Song, Csaba Szepesvari, Gang Niu, and Sivan Sabato (eds.), Proceedings of the 39th

3Note that for Table 2, we terminate experiments when their runtimes exceed 30 minutes.

Published in Transactions on Machine Learning Research (01/2024)

International Conference on Machine Learning, volume 162 of Proceedings of Machine Learning Research, pp. 878 928, 17 23 July 2022. URL https://proceedings.mlr.press/v162/ariu22a.html.

Jean-Yves Audibert and Sébastien Bubeck. Best arm identification in multi-armed bandits. In COLT - 23th Conference on Learning Theory - 2010, pp. 13 p., Haifa, Israel, June 2010. URL https://hal-enpc. archives-ouvertes.fr/hal-00654404.

Peter Auer. Using confidence bounds for exploitation-exploration trade-offs. Journal of Machine Learning Research, 3:397 422, Nov. 2002.

Mohammad Javad Azizi, Branislav Kveton, and Mohammad Ghavamzadeh. Fixed-budget best-arm identification in structured bandits. In Lud De Raedt (ed.), Proceedings of the Thirty-First International Joint Conference on Artificial Intelligence, IJCAI-22, pp. 2798 2804. International Joint Conferences on Artificial Intelligence Organization, July 2022. doi: 10.24963/ijcai.2022/388. URL https: //doi.org/10.24963/ijcai.2022/388.

Z. D. Bai and Y. Q. Yin. Limit of the Smallest Eigenvalue of a Large Dimensional Sample Covariance Matrix. The Annals of Probability, 21(3):1275 1294, 1993. doi: 10.1214/aop/1176989118. URL https: //doi.org/10.1214/aop/1176989118.

S. Boyd and L. Vandenberghe. Convex Optimization. Cambridge University Press, 2004.

Stephen Boyd, Neal Parikh, Eric Chu, Borja Peleato, and Jonathan Eckstein. Distributed optimization and statistical learning via the alternating direction method of multipliers. Foundations and Trends in Machine learning, 3(1):1 122, 2011.

Sébastien Bubeck and Nicolo Cesa-Bianchi. Regret analysis of stochastic and nonstochastic multi-armed bandit problems. Foundations and Trends in Machine Learning, 5(1):1 122, Dec. 2012.

Peter Bühlmann and Sara A. van de Geer. Statistics for high-dimensional data. Springer Series in Statistics. Springer, Heidelberg, 2011. ISBN 978-3-642-20191-2. doi: 10.1007/978-3-642-20192-9. URL http://dx. doi.org/10.1007/978-3-642-20192-9.

Alexandra Carpentier and Andrea Locatelli. Tight (lower) bounds for the fixed budget best arm identification bandit problem. In Vitaly Feldman, Alexander Rakhlin, and Ohad Shamir (eds.), 29th Annual Conference on Learning Theory, volume 49 of Proceedings of Machine Learning Research, pp. 590 604, Columbia University, New York, New York, USA, 23 26 Jun 2016. URL https://proceedings.mlr.press/v49/ carpentier16.html.

Sunrit Chakraborty, Saptarshi Roy, and Ambuj Tewari. Thompson sampling for high-dimensional sparse linear contextual bandits. In Andreas Krause, Emma Brunskill, Kyunghyun Cho, Barbara Engelhardt, Sivan Sabato, and Jonathan Scarlett (eds.), Proceedings of the 40th International Conference on Machine Learning, volume 202 of Proceedings of Machine Learning Research, pp. 3979 4008. PMLR, 23 29 Jul 2023. URL https://proceedings.mlr.press/v202/chakraborty23b.html.

Varsha Dani, Thomas P Hayes, and Sham M Kakade. Stochastic linear optimization under bandit feedback. 21st Annual Conference on Learning Theory, pp. 355 366, 2008.

Eyal Even-Dar, Shie Mannor, and Yishay Mansour. Action elimination and stopping conditions for the multi-armed bandit and reinforcement learning problems. Journal of Machine Learning Research, 7(39): 1079 1105, 2006. URL http://jmlr.org/papers/v7/evendar06a.html.

Kelwin Fernandes, Pedro Vinagre, Paulo Cortez, and Pedro Sernadela. Online News Popularity. UCI Machine Learning Repository, 2015. DOI: https://doi.org/10.24432/C5NS3V.

Tanner Fiez, Lalit Jain, Kevin G Jamieson, and Lillian Ratliff. Sequential experimental design for transductive linear bandits. In H. Wallach, H. Larochelle, A. Beygelzimer, F. d'AlchéBuc, E. Fox, and R. Garnett (eds.), Advances in Neural Information Processing Systems, volume 32, 10 12 Dec. 2019. URL https://proceedings.neurips.cc/paper_files/paper/2019/file/ 8ba6c657b03fc7c8dd4dff8e45defcd2-Paper.pdf.

Published in Transactions on Machine Learning Research (01/2024)

M. Grant and S. Boyd. CVX: Matlab software for disciplined convex programming, version 2.0. http: //cvxr.com/cvx, August 2012.

Botao Hao, Tor Lattimore, and Mengdi Wang. High-dimensional sparse linear bandits. In H. Larochelle, M. Ranzato, R. Hadsell, M.F. Balcan, and H. Lin (eds.), Advances in Neural Information Processing Systems, volume 33, pp. 10753 10763, 7 12 Dec. 2020. URL https://proceedings.neurips.cc/paper_ files/paper/2020/file/7a006957be65e608e863301eb98e1808-Paper.pdf.

Matthew Hoffman, Bobak Shahriari, and Nando Freitas. On correlation and budget constraints in modelbased bandit optimization with application to automatic machine learning. In Samuel Kaski and Jukka Corander (eds.), Proceedings of the Seventeenth International Conference on Artificial Intelligence and Statistics, volume 33 of Proceedings of Machine Learning Research, pp. 365 374, Reykjavik, Iceland, 22 25 Apr. 2014. URL https://proceedings.mlr.press/v33/hoffman14.html.

Kyoungseok Jang, Chicheng Zhang, and Kwang-Sung Jun. Popart: Efficient sparse regression and experimental design for optimal sparse linear bandits. In S. Koyejo, S. Mohamed, A. Agarwal, D. Belgrave, K. Cho, and A. Oh (eds.), Advances in Neural Information Processing Systems, volume 35, pp. 2102 2114. Curran Associates, Inc., 2022. URL https://proceedings.neurips.cc/paper_files/paper/2022/file/ 0e5cce15e1bfc6b3d7b71f24cc5da821-Paper-Conference.pdf.

Yassir Jedra and Alexandre Proutiere. Optimal best-arm identification in linear bandits. In H. Larochelle, M. Ranzato, R. Hadsell, M.F. Balcan, and H. Lin (eds.), Advances in Neural Information Processing Systems, volume 33, pp. 10007 10017, 7 12 Dec. 2020. URL https://proceedings.neurips.cc/paper_ files/paper/2020/file/7212a6567c8a6c513f33b858d868ff80-Paper.pdf.

Zohar Karnin, Tomer Koren, and Oren Somekh. Almost optimal exploration in multi-armed bandits. In Sanjoy Dasgupta and David Mc Allester (eds.), Proceedings of the 30th International Conference on Machine Learning, volume 28 of Proceedings of Machine Learning Research, pp. 1238 1246, Atlanta, Georgia, USA, 17 19 June 2013. URL https://proceedings.mlr.press/v28/karnin13.html.

Julian Katz-Samuels, Lalit Jain, Zohar Karnin, and Kevin G Jamieson. An empirical process approach to the union bound: Practical algorithms for combinatorial and linear bandits. In H. Larochelle, M. Ranzato, R. Hadsell, M.F. Balcan, and H. Lin (eds.), Advances in Neural Information Processing Systems, volume 33, pp. 10371 10382, 7 12 Dec. 2020. URL https://proceedings.neurips.cc/paper_files/paper/2020/ file/75800f73fa80f935216b8cfbedf77bfa-Paper.pdf.

Emilie Kaufmann, Olivier Cappé, and Aurélien Garivier. On the complexity of best arm identification in multi-armed bandit models. Journal of Machine Learning Research, 17:1 42, 2016.

J. Kiefer and J. Wolfowitz. The equivalence of two extremum problems. Canadian Journal of Mathematics, 12:363 366, 1960. doi: 10.4153/CJM-1960-030-4.

Gi-Soo Kim and Myunghee Cho Paik. Doubly-robust lasso bandit. In H. Wallach, H. Larochelle, A. Beygelzimer, F. d'Alché-Buc, E. Fox, and R. Garnett (eds.), Advances in Neural Information Processing Systems, volume 32. Curran Associates, Inc., 2019. URL https://proceedings.neurips.cc/paper_files/paper/ 2019/file/d60678e8f2ba9c540798ebbde31177e8-Paper.pdf.

Wenjie Li, Adarsh Barik, and Jean Honorio. A simple unified framework for high dimensional bandit problems. In Kamalika Chaudhuri, Stefanie Jegelka, Le Song, Csaba Szepesvari, Gang Niu, and Sivan Sabato (eds.), Proceedings of the 39th International Conference on Machine Learning, volume 162 of Proceedings of Machine Learning Research, pp. 12619 12655. PMLR, 17 23 Jul 2022. URL https:// proceedings.mlr.press/v162/li22a.html.

J. Löfberg. Yalmip: A toolbox for modeling and optimization in MATLAB. In In Proceedings of the CACSD Conference, Taipei, Taiwan, 2004.

Min-Hwan Oh, Garud Iyengar, and Assaf Zeevi. Sparsity-agnostic lasso bandit. In Marina Meila and Tong Zhang (eds.), Proceedings of the 38th International Conference on Machine Learning, volume 139

Published in Transactions on Machine Learning Research (01/2024)

of Proceedings of Machine Learning Research, pp. 8271 8280. PMLR, 18 24 Jul 2021. URL https: //proceedings.mlr.press/v139/oh21a.html.

Friedrich Pukelsheim. Optimal Design of Experiments. Society for Industrial and Applied Mathematics, 2006. doi: 10.1137/1.9780898719109. URL https://epubs.siam.org/doi/abs/10.1137/1.9780898719109.

Herbert Robbins. Some aspects of the sequential design of experiments. Bull. Amer. Math. Soc., 58:527 535, 1952.

Marta Soare, Alessandro Lazaric, and Remi Munos. Best-arm identification in linear bandits. In Z. Ghahramani, M. Welling, C. Cortes, N. Lawrence, and K.Q. Weinberger (eds.), Advances in Neural Information Processing Systems, volume 27, 8 11 Dec. 2014. URL https://proceedings.neurips.cc/paper_files/ paper/2014/file/f387624df552cea2f369918c5e1e12bc-Paper.pdf.

Chao Tao, Saúl Blanco, and Yuan Zhou. Best arm identification in linear bandits with linear dimension dependency. In Jennifer Dy and Andreas Krause (eds.), Proceedings of the 35th International Conference on Machine Learning, volume 80 of Proceedings of Machine Learning Research, pp. 4877 4886, 10 15 July 2018. URL https://proceedings.mlr.press/v80/tao18a.html.

William R Thompson. On the likelihood that one unknown probability exceeds another in view of the evidence of two samples. Biometrika, 25(3-4):285 294, 1933.

Robert Tibshirani. Regression shrinkage and selection via the Lasso. Journal of the Royal Statistical Society: Series B (Methodological), 58(1):267 288, 1996. doi: https://doi.org/10.1111/j.2517-6161.1996.tb02080.x. URL https://rss.onlinelibrary.wiley.com/doi/abs/10.1111/j.2517-6161.1996.tb02080.x.

Sara A. van de Geer and Peter Bühlmann. On the conditions used to prove oracle results for the Lasso. Electronic Journal of Statistics, 3(none):1360 1392, 2009. doi: 10.1214/09-EJS506. URL https://doi. org/10.1214/09-EJS506.

Xue Wang, Mike Mingcheng Wei, and Tao Yao. Efficient sparse linear bandits under high dimensional data. In Proceedings of the 29th ACM SIGKDD Conference on Knowledge Discovery and Data Mining, KDD 23, pp. 2431 2443, New York, NY, USA, 2023. Association for Computing Machinery. ISBN 9798400701030. doi: 10.1145/3580305.3599329. URL https://doi.org/10.1145/3580305.3599329.

Liyuan Xu, Junya Honda, and Masashi Sugiyama. A fully adaptive algorithm for pure exploration in linear bandits. In Amos Storkey and Fernando Perez-Cruz (eds.), Proceedings of the Twenty-First International Conference on Artificial Intelligence and Statistics, volume 84 of Proceedings of Machine Learning Research, pp. 843 851, 09 11 Apr. 2018. URL https://proceedings.mlr.press/v84/xu18d.html.

Junwen Yang and Vincent Y. F. Tan. Minimax optimal fixed-budget best arm identification in linear bandits. In Alice H. Oh, Alekh Agarwal, Danielle Belgrave, and Kyunghyun Cho (eds.), Advances in Neural Information Processing Systems, Nov. 2022. URL https://openreview.net/forum?id=PLm NPSKJr8e.

Mohammadi Zaki, Avi Mohan, and Aditya Gopalan. Improved pure exploration in linear bandits with noregret learning. In Lud De Raedt (ed.), Proceedings of the Thirty-First International Joint Conference on Artificial Intelligence, IJCAI-22, pp. 3709 3715. International Joint Conferences on Artificial Intelligence Organization, July 2022. doi: 10.24963/ijcai.2022/515. URL https://doi.org/10.24963/ijcai.2022/ 515.

Shuheng Zhou. Thresholding procedures for high dimensional variable selection and statistical estimation. In Y. Bengio, D. Schuurmans, J. Lafferty, C. Williams, and A. Culotta (eds.), Advances in Neural Information Processing Systems, volume 22, 7 10 Dec. 2009. URL https://proceedings.neurips.cc/paper_files/ paper/2009/file/92fb0c6d1758261f10d052e6e2c1123c-Paper.pdf.

Published in Transactions on Machine Learning Research (01/2024)

A Design Matrix Optimization in Thresholded Lasso

The performance of TL in Lasso-OD is characterized by the bound on the probability in (12). For any pair of (λinit, λthres), the bound in (12) is optimized by setting the design matrix X to maximize the ratio ϕ2(M,s) p

1 T1 maxj [d] X:,j 2 .

Note that in the application of the Lasso, we can normalize the dataset so that P k [K] a(k)2 j are equal for all j [d]. Under the assumptions that the number of arms, K, is large and that the arm vectors are drawn from the same distribution, the value of 1 T1 maxj [d] X:,j 2 2 cannot vary too much with respect to different fraction of arm pulls. Therefore, we relax the quantity 1 T1 maxj [d] X:,j 2 2 in (12) by its upper bound maxk [K] a(k) 2 . Then, Theorem 1 implies that the best choice of X maximizes the compatibility constant ϕ2(M, s).

Computation of the compatibility constant Let ν PT1([K]) be the T1-type distribution describing the fractions of the number of pulls for each arm. Then, M = 1 T1 X X = P i [K] νia(i)a(i) . Rewriting the compatibility constant ϕ2(M, S) from Definition 1, with some overload of notation, we obtain

ϕ2(ν, S) ϕ2(M, S) = min θ Rd

i [K] νia(i)a(i) : θS 1 = 1, θSc 1 3 (21)

ϕ2(ν, s) min S [d]: |S|=s ϕ2(ν, S). (22)

Given a fixed ν, the program in (21) is non-convex due to the ℓ1-norm equality constraint; however, by introducing binary variables, it can be turned into a mixed-integer discipled convex program (MIDCP) and be efficiently solved using CVX toolbox (Grant & Boyd, 2012). If we relaxed the equality constraint to θS 1 1, then (21) would be a quadratic program (QP).

Relaxing the optimization problem According to the arguments above, the optimization problem that we originally need to solve is

ν = arg max ν PT1([K]) ϕ2(ν, s), (23)

which is computationally intractable since the maximization constraint makes it an integer program; and even if we relaxed it to allow fractional number of pulls, the program would involve d s ds MIDCPs in its constraints.

Lemma 2. For any ν P([K]) and any S [d], it holds that ϕ2(ν, S) σmin PK i=1 νia(i)a(i) .

Lemma 2 follows from |S| θS 2 1 θS 2 2 θ 2 2 and relaxing the inequality constraint in (21).

Replacing ϕ2(ν, S) by its lower bound and allowing fractional number of pulls, we get the relaxed optimization problem in (6), which can be solved efficiently.

With the inclusion of 1 T1 maxj [d] X:,j 2 2, the optimization problem can be re-formulated as

max ν P([K])

σmin PK i=1 νia(i)a(i)

max diag PK i=1 νia(i)a(i) , (24)

which does not satisfy the MIDCP ruleset. In our experiments, we observe that the convergence of the numerical solution to the problem in (24) heavily depends on the initialization; therefore, (24) could not be considered as a reliable approach.

Published in Transactions on Machine Learning Research (01/2024)

B Pseudo-codes of ROUND and OD-Lin BAI

The pseudo-codes of the rounding procedure from Pukelsheim (2006, Ch. 12) that is used in Algorithm 1 and OD-Lasso from Yang & Tan (2022) are given below.

Algorithm 3 ROUND(π, T) input a distribution π on a set with cardinality d and a positive integer T.

1: Initialize Ti = (T d

2)πi for i = 1, . . . , d.

2: while Pd i=1 Ti = T do

3: if Pd i=1 Ti < T then 4: Set j arg mini [d] Ti πi . Update Tj Tj + 1.

5: else if Pd i=1 Ti > T then 6: Set j arg maxi [d] Ti 1

πi . Update Tj Tj 1.

7: end if 8: end while output Distribution π = T1

T , . . . , Td

Algorithm 4 Optimal Design-Based Linear Best Arm Identification (OD-Lin BAI)

input time budget T, arm set A = [K], and arm vectors {a(1), . . . , a(K)} Rd.

1: Initialize t0 0, A0 A, d0 d. For each i A0, set a0(i) = a(i). Set R = log2 d , Tr = T

R for r = 1, . . . , R 1, and TR = T PR 1 i=1 Ti. 2: for r = 1 to R do 3: \\ Dimensionality reduction: 4: Set X so that its columns are {ar 1(i): i Ar 1}. Set dr rank(X). Set ar(i) ar 1(i) for i Ar 1. 5: if dr < dr 1 then 6: Find the singular value decomposition of X = UDV , where U Rdr 1 dr. 7: Update X U X and ar(i) Xi for i Ar 1. 8: end if

9: \\ G-optimal design: 10: Input the set {ar(i): i Ar 1} to the G-optimal design, and set π(r) as the output of (7). 11: Set π(r) = ROUND(π(r), Tr) from Algorithm 2. 12: \\ Arm pulling:

13: Pull each arm i Ar 1 Tr(i) = π(r) i Tr times, which determines Atr 1+1, . . . , Atr 1+Tr. Observe the corresponding rewards ytr 1+1, . . . , ytr 1+Tr. 14: Compute the OLS estimator

i Ar 1 Tr(i)ar(i)ar(i) (25)

ˆθ(r) = V (r) 1 tr 1+Tr X

t=tr 1+1 ar(At)yt. (26)

15: \\ Arm elimination: 16: Estimate the mean rewards for each i Ar 1 as

ˆµr(i) = ˆθ(r), ar(i) . (27)

Set Ar the set of d

2r arms in Ar 1 with the largest estimated mean rewards. Set tr tr 1 + Tr. 17: end for output ˆI = the only remaining arm in AR.

Published in Transactions on Machine Learning Research (01/2024)

C Proofs Related to Lasso

In the following, let n be the number of samples. The linear model is given by y = Xθ + ϵ, where y Rn

are the rewards, X Rn d is the design matrix, and ϵ Rn are i.i.d. 1-subgaussian random variables. Recall the initial Lasso estimator

ˆθ = arg min θ Rd 1 n y Xθ 2 2 + λ θ 1 . (28)

Define the event

T = max j [d] 1 n|X :,jϵ| λ

The following result, known as the oracle inequality, is the main tool to control the performance of the initial lasso estimator.

Lemma 3 (Oracle Inequality: Theorem 6.1 from Bühlmann & van de Geer (2011)). On the event T , the initial Lasso estimator ˆθ (28) satisfies

2 + λ ˆθ θ 1 4λ2s ϕ2(M, S(θ )). (30)

Furthermore, it holds that

P [T ] 1 2d exp

32 1 n maxj [d] X:,j 2 2

Proof of Lemma 3. Since ˆθ minimizes (28), we have

2 + λ ˆθ 1 1

n y Xθ 2 2 + λ θ 1 . (32)

Plugging y = Xθ + ϵ into (32), after some algebra, we get the basic inequality

2 + λ ˆθ 1 2

nϵ X(ˆθ θ ) + λ θ 1 . (33)

Let T be the event

T = max j [d] 2 n|ϵ X:,j| λ0

Then, on T , we have using the Hölder inequality that

2 λ0 ˆθ θ 1 + λ θ 1 λ ˆθ 1 . (35)

Let S = S(θ ). By the triangle inequality, we have ˆθ 1 = ˆθS 1 + ˆθSc 1 θ S 1 ˆθS θ S 1 + ˆθSc 1 . (36)

Applying (36) to (35), we get

2 λ0 ˆθS θ S 1 + ˆθSc θ Sc 1

+ λ θ 1 θ S 1 + ˆθS θ S 1 ˆθSc 1

= (λ0 + λ) ˆθS θ S 1 + (λ0 λ) ˆθSc θ Sc 1 , (38)

Published in Transactions on Machine Learning Research (01/2024)

where the last step uses the fact that θ Sc = 0. We set λ0 = λ

2 . Then, (38) implies that on the event T , ˆθSc θ Sc 1 3 ˆθS θ S 1 . (39)

Therefore, ˆθ θ C(S), and from the definition of compatibility constant in Definition 1, we have

s(ˆθ θ ) X X(ˆθ θ )

nϕ(M, S) . (40)

We now continue with (38) with λ0 = λ

2 . We have

2 + λ ˆθ θ 1 = 2

2 + λ ˆθS θ S 1 + λ ˆθSc θ Sc 1 (41)

(3λ + λ) ˆθS θ S 1 λ ˆθSc θ Sc 1 + λ ˆθSc θ Sc 1 (42)

= 4λ ˆθS θ S 1 (43)

4λ s X(ˆθ θ ) 2 nϕ(M, S) (44)

2 + 4λ2s ϕ2(M, S), (45)

where (44) applies (40), and (45) applies the inequality 4uv u2 + 4v2 to (44). Inequality (45) completes the proof of (30).

Next, we upper bound the probability P T c . We have

n|X :,jϵ| > λ

n X :,jϵ > λ

n X :,jϵ > λ

2 42 1 n2 X:,j 2 2

where the last inequality follows since 1 n X :,jϵ 1

n X :,jϵ are subgaussian with variance proxy 1 n2 X:,j 2 2 as ϵ1, . . . , ϵn are independent 1-subgaussian random variables. Bounding each summand in (48) by the maximum of summands completes the proof of (31).

Proof of Lemma 1. The right-hand side of (30) depends on the unknown set S(θ ); however, we can further upper bound the right-hand side of (30) by replacing ϕ2(M, S(θ )) by its lower bound ϕ2(M, s), which is computable using only X and s. Therefore, Lemma 1 is a corollary to Lemma 3.

Proof of Theorem 1. Define the event G n ˆθinit θ 1 4λinits ϕ2(M,s) o . We have ˆθinit θ 1 (ˆθinit θ )S(θ )c 1 (49)

j S(θ )c |(ˆθinit)j| (50)

j S( ˆθthres)\S(θ ) |(ˆθinit)j| (51)

|S(ˆθthres) \ S(θ )|λthres, (52)

Published in Transactions on Machine Learning Research (01/2024)

where (50) follows since θ S(θ )c = 0 by assumption, and (52) follows from the thresholding step in (5). Therefore, on the event G, it holds that

|S(ˆθthres) \ S(θ )|

ˆθinit θ 1 λthres 4λinits λthres ϕ2(M, s). (53)

For all j S(θ ), on the event G, we have

|(ˆθinit)j| θmin (θ ˆθinit)S(θ ) (54)

θmin (θ ˆθinit)S(θ ) 1 (55)

θmin θ ˆθinit 1 (56)

θmin 4λinits ϕ2(M, s). (57)

Therefore, if

λthres θmin 4λinits ϕ2(M, s), (58)

S(ˆθthres) S(θ ) is satisfied on G. Combining Lemma 1, (53), and (58) completes the proof of Theorem 1.

D Proof of Theorem 2

The proof of Theorem 2 closely follows the proof of Yang & Tan (2022, Th. 2). Therefore, we only explain the differences, which are as follows.

(i) Due to our construction, m in Yang & Tan (2022) is replaced by T = j T log2 d m .

(ii) Let Ar be the active arms in round r and let {ar(i): i Ar} Rdr be the dimensionality-reduced arm vectors. From Fiez et al. (2019, Appendix B), it holds that

max i Ar ar(i) 2 M( π(r)) 1 dr

where π(r) is the rounded version of the G-optimal design output π(r).

(iii) In the proof of Yang & Tan (2022, Lemma 3), the set Br is the set of arms in Ar 1 excluding the best arm and d 2r+1 1 suboptimal arms with the largest mean rewards. We re-define Br as the set of arms in Ar 1 excluding the best arm and d

2r 1 suboptimal arms with the largest mean rewards.

With the modifications in items (i) and (ii) and following the steps in the proof of Yang & Tan (2022, Lemma 2), we get for any arm i Ar 1

P [ˆµr(1) < ˆµr(i)|1 Ar 1] exp

T 2 i 8 d 2r 1 (1 + d2

where ˆµr(i) denotes the estimated mean of arm i in round r.

Published in Transactions on Machine Learning Research (01/2024)

Using item (iii) and (60), we go through the proof of Yang & Tan (2022, Lemma 3) and get

P [1 / Ar|1 Ar 1]

2r +1)(1+ d2

2r + 1) exp

2r +1)(1+ d2

, if r > 1. (61)

Finally, following the steps in the proof of Yang & Tan (2022, Th. 2) with (61), we get

P h ˆI = 1 i

d 2r + log2 d 1

(K + log2 d) exp

which completes the proof.

E Proof of Corollary 1

We set κ, λinit and λthres as

κ = 16 ϕ4θ2 min

25 24, (64)

λinit = 1 p

κ(s + s2) , and (65)

λthres = 4 ϕ2sλinit. (66)

Note that (66) sets s1 = s + s2 in (15), and for any s N, we check that the condition in Theorem 3 holds:

θmin = 4 ϕ2 κ

25 24 4 ϕ2 κ s + 1

s + s2 = λinit

ϕ2 s . (67)

Next, we would like to set c0 = T1

T2 so that the two exponents in (14) are equal. However, since H2,lin(s + s2) is not available to us, we use the lower bound

H2,lin(s + s2) = max i {2,...,s+s2} i 2 i s + s2

2 s+s2 s + s2

where the last inequality follows from the assumption |µk| 1 for all k [K].4 One can further upper bound s+s2 by using the values of K arm vectors and searching for θ that gives the largest s+s2.

We set the ratio c0 = T1

c0 = 25 16x2 max 3ϕ4θ2 min log2(s + s2), (69)

which together with (64) (68) ensures that

j T2 log2(s1) k

16 (1 + ϵ) H2,lin(s1)

exp T1λ2 init 32x2max

Combining (70) with s1 = s + s2 completes the proof.

The ratio c0 decreasing with s as in (69) is consistent since as s approaches d, the sparse linear bandit approaches the standard linear bandit, and we would expect to spend more budget on phase 2 than phase 1 for large s. We deliberately choose the parameters in (64) (66) to maintain this property.

4This step is the only place where the assumption on the mean rewards is used.

Published in Transactions on Machine Learning Research (01/2024)

F Implementation Details

Our code is available at https://github.com/recepcyavas/TMLR_sparse_linear_bandit. In all computations of the Lasso problem (4), we use the ADMM algorithm (Boyd et al., 2011).

F.1 Lasso-OD with K-fold Cross-Validation

In the implementation of Lasso-OD-CV, the ratio of the budgets, T1 T2 , is set to the default value 1

4, i.e., naturally, the algorithm spends more budget for the BAI algorithm than for the support estimation.

For tuning the hyperparameters λinit and λthres, we pull T1 arms according to the allocation given in (6). We use the following cross-validation steps to tune the parameters.

(i) Fix two sets of hyperparameters {λinit,1, λinit,2, . . . , λinit,m} and {λthres,1, λthres,2, . . . , λthres,m} that are candidates for λinit and λthres respectively.

(ii) Iteratively tune the parameters by fixing one of them and searching for the best parameter for the other one.

(iii) In each cross-validation round, the objective is to minimize the loss function

T1 E y X ˆθthres 2

+ c1P h ˆθthres 0 < s i + c2E h 1 n ˆθthres 0 > s o ˆθthres 0

where c1 and c2 are ℓ0-norm regularization parameters. Here, the first term in (71) is the mean-squared error; the second and the third terms penalize the ℓ0-norm error and force the hyperparameters to output an estimate with s variables. In application, we set c1 = 200 and c2 = 5, giving more importance to detecting at least s variables. If the value of s is not available, we can still use this technique by setting c1 = c2 = 0. As standard, we approximate (71) by training the parameters in K 1 blocks and testing in the remaining block.

(iv) To speed up convergence, in each round of cross-validation, we exponentially narrow down the candidate sets.

(v) To reduce the variance in cross-validation, we employ Monte-Carlo simulations, i.e., we independently partition the data into K blocks for multiple times and then take the average loss.

F.2 Lasso-OD-Analytical

In the implementation of Lasso-OD-Analytical, we set the hyperparameters λinit and λthres as in (64) (66), and c0 is set so that (69) holds with equality. This setup effectively uses the hardness parameter H2,lin(s+s2). To compute these hyperparameters, we first need to compute ϕ2(M, s). As discussed in Appendix A, this requires solving a MIDCP. We do this by using the YALMIP toolbox (Löfberg, 2004) because it does not ask to convert the problem into another one. Alternatively, one can use the CVX toolbox (Grant & Boyd, 2012) with a little bit more effort, or use Lemma 2 and solve an easier problem at the expense of some performance loss.

F.3 OD-Lin BAI and Other BAI Algorithms

We implement OD-Lin BAI and other BAI algorithms shown in section 5 using the methods described in Yang & Tan (2022, Appendix E).

Bayes Gap-Adaptive: In general, Bayes Gap algorithm (Hoffman et al., 2014) requires the knowledge of the hardness parameter. As in Hoffman et al. (2014); Yang & Tan (2022), at the beginning of each time instant, we input the estimated hardness parameter according to the three-sigma rule. In the experiments, we omit the oracle version of Bayes Gap that directly uses the knowledge of the hardness parameter.

Published in Transactions on Machine Learning Research (01/2024)

G Additional Experiments

G.1 Variants of Our Algorithm

We present two variants of Lasso-OD that modify the operations in phase 2 and one variant that replaces the thresholded Lasso in phase 1.

Lasso-XY-Allocation: This algorithm is identical to Lasso-OD except that the G-optimal design used to determine the allocations within each round is replaced by the XY-allocation from Soare et al. (2014). Let X = {a(i): i A} be the set of arms in an active set A. Let Y = {x x : x, x X, x = x } be the set of arm differences. The XY-allocation solves the problem

π = arg min π P(A) max y Y y M(π) 1 . (72)

Lasso-XY-Allocation replaces Line 11 of Algorithm 4 with (72). In the experiments, we compute (72) using the Frank Wolfe algorithm as in Fiez et al. (2019).

From Yang & Tan (2022, Proof of Lemma 2), the probability that a sub-optimal arm i has a smaller estimated mean than the optimal arm 1 is bounded as

P [ˆµ(1) < ˆµ(i)] exp

2 i 2 a(1) a(i) M(π) 1

2 i 2 maxi =j a(j) a(i) M(π) 1

2 i 8 maxi A a(i) M(π) 1

where π is the allocation within the round. Here, (75) follows from the triangle inequality. The G-optimal design optimizes the allocation π in (75), and XY-allocation optimizes (74). Since XY-allocation optimizes a tighter bound, Lasso-XY-allocation is expected to perform better than Lasso-OD.

Lasso-Bayes Gap: Since Bayes Gap performs better than OD-Lin BAI in the examples in section 5, we propose the variant Lasso-Bayes Gap where in phase 2, OD-Lin BAI is replaced by Bayes Gap-Adaptive from Hoffman et al. (2014).

In our implementations of Lasso-XY-Allocation and Lasso-Bayes Gap (as described above), the parameters of Lasso are tuned via cross-validation.

Pop Art-OD: Recently, Jang et al. (2022) develop the Pop Art algorithm, which estimates the unknown parameter θ similarly to Lasso and TL. Due to its superior ℓ1 error to Lasso, Pop Art also guarantees that the support of θ is estimated efficiently. Similar to TL, the Pop Art estimate is obtained by thresholding an initial estimate. Our variant, Pop Art-OD, replaces the TL in phase 1 of Lasso-OD with Pop Art, and retains phase 2 as is. We give its pseudo-code in Algorithm 5 and analyze its performance in the section below. Line 1 of Pop Art involves an optimization problem that yields the optimal covariance matrix with respect to an upper bound on the error probability; this process reflects the design matrix optimization of TL described in Appendix A. In Pop Art, if the population covariance M in Line 3 was replaced with the empirical covariance and if the Catoni estimator in Line 4 was replaced with averaging, the resulting algorithm would be the thresholded OLS estimator.

Pop Art has one hyperparameter (the estimate on the error probability δ in Jang et al. (2022)). In the experiments to follow, we tune the hyperparameter using a K-fold cross validation procedure.

G.2 Analysis of Pop Art-OD

The following theorem bounds the error probability of Pop Art-OD.

Published in Transactions on Machine Learning Research (01/2024)

Algorithm 5 Pop Art-OD input Time budget T, arm vectors a(1), . . . , a(K) Rd, θmin, and sparsity s.

1: Solve the convex optimization problem ν = arg min ν P[K] max i [d]

{PK i=1 νia(i)a(i) } 1

ii. Let the objective

value of the minimum be H2 and M = PK i=1 ν i a(i)a(i) . Set

λPA = min p

, c PA = 2H2 λ2 PAs log2 s, T1 = T c PA 1 + c PA

, T2 = T T1, g = λ2 PA 8H2 .

2: Sample T1 arms, A1, . . . , AT1 i.i.d. with ν , and observe rewards y1, . . . , y T1.

3: For t = 1, . . . , T1, let θt = M 1a(At)yt Rd.

4: Set for i [d], θ i = Catoni ( θ1,i, . . . , θT1,i), q g (M 1)ii(1+ 2g 1 2g)

, where Catoni((Z1, . . . , Zn), α) is the

unique solution y to the equation

i=1 ψ(α(Zi y)) = 0, ψ(x) = sign(x)(1 + |x| + x2/2). (77)

5: ˆθPA = θ i1 |θ i| q

8M 1 ii g : i [d] and ˆSPA = S(ˆθPA).

6: Run OD-Lin BAI (Algorithm 4) restricted to the support ˆSPA using T2 pulls. output ˆI is the only remaining arm as the output of Algorithm 4.

Theorem 4. For any linear bandit instance, the error probability of Pop Art-OD given in Algorithm 5 is bounded as

P h ˆI = 1 i (K + log2 d + 2d) exp T 16 log2(s) (1 + ϵPA)H2,lin(s)(1 + c PA)

where c PA is defined in (76), below, and ϵPA = (1+c PA)s2

Proof. By Line 5 of Pop Art, if i [d] satisfies that θ i λPA, then i ˆSPA. From Jang et al. (2022, Prop. 1

and Th. 1), with probability at least 1 2d exp n T1λ2 PA 8H2

o , the initial Pop Art estimator θ satisfies

θ θ λPA. (79)

and the Pop Art estimator satisfies S(ˆθPA) S(θ ). By selecting λPA θmin

2 , we further ensure that S(θ ) S(ˆθPA), giving S(ˆθPA) = S(θ ). Therefore, by the union bound and Theorem 2, the error probability of Pop Art-OD is bounded as

P h ˆI = 1 i P h S(ˆθPA) = S(θ ) i + P h ˆI = 1 S(ˆθPA) = S(θ ) i (80)

2d exp T1λ2 PA 8H2

+ (K + log2 d) exp T2 16 log2 s H2,lin(s)(1 + ϵPA)

The rest of the proof follows steps similar to the proof of Theorem 3, which aims to balance the two exponents in (81). The choices of c PA, T1, and T2 together with the lower bound H2,lin(s) s

4 (see, (68)) imply that

T1λ2 PA 8H2 T2 16 log2 s H2,lin(s)(1 + ϵPA), (82)

which completes the proof. Note that we select λPA p

2H2 to ensure that 1 2g 1

2 > 0, making the Catoni parameter in Line 4 of Algorithm 5 valid.

Published in Transactions on Machine Learning Research (01/2024)

Error probability

Error probability

Error probability

Error probability

Error probability

Error probability

Lasso-OD Lasso-OD-Analytical

Lasso-XY Lasso-Bayes Gap

Pop Art-OD OD-Lin BAI

Bayes Gap-Adaptive GSE

Figure 6: Comparison of several algorithms with s {2, 3, 4}.

Theorem 4 shows that Pop Art-OD has an error probability that scales as

exp Ω T (log2 s)H2,lin(s)

as T and s grow. This is the same theoretical result as that for Lasso-OD. However, we observe from the next section (specifically Appendix G.3.1) that Lasso-OD outperforms Pop Art-OD empirically.

G.3 Experiments

In the experiments below, we include Lasso-XY-allocation and Lasso-Bayes Gap to the list of algorithms in Section 5.

G.3.1 First Example

In the first example, we test the performance of the various BAI algorithms for sparsities of at least 2. We generate K d-dimensional arm vectors a(k) = (a(k)i : i [d]), k [K], where a(k)i s are distributed N(0, 1

s) independent across arms k [K] and coordinates i [d]. The s-sparse unknown vector θ is set as θ i = 1 s for i [s] and θ i = 0 for i = s + 1, . . . , d. Figure 6 compares the performances of several algorithms in the literature and variants of our algorithm for s {2, 3, 4}, K = 50, d {10, 20}, and T [200, 104]. For all bandit instances under consideration, the performances of Lasso-OD and Lasso-XY-allocation are almost identical. However, due to its low computational complexity (see Tables 3 and 4 for the CPU runtimes), Lasso-OD is preferred over Lasso-XY-allocation. Among different variants of Lasso-OD, Lasso OD and Lasso-XY-allocation have the best performance for the instances with s = 2. For s = 3 and s = 4,

Published in Transactions on Machine Learning Research (01/2024)

Error probability

Error probability

Error probability

Error probability

Lasso-OD Lasso-OD-An.-LB

Lasso-XY Lasso-Bayes Gap

Pop Art-OD OD-Lin BAI

Bayes Gap-Adaptive GSE

Figure 7: Comparison of several algorithms with θ belonging to a finite set.

among the algorithms shown, Lasso-Bayes Gap performs the best for a large enough budget T. The poor performance of Lasso-based algorithms for small budgets is because at larger s, the minimum budget that should be allocated to phase 1 to reliably estimate the support of θ increases with s. For the instances with s {3, 4}, Bayes Gap-Adaptive performs remarkably well, but it is outperformed by Lasso-based algorithms for s = 2. Pop Art-OD algorithm is outperformed by Lasso-OD for all instances shown, which implies that the variable selection property of Pop Art is poorer than that of the thresholded Lasso.

In Tables 3 and 4,5 we report the average CPU runtimes for the instances in the first example with s = 2 and s = 3. Lasso-OD is superior to all other algorithms in terms of the computational complexity.6

G.3.2 Second Example

In the second example, we assume that θ belongs to the finite set H {θ Rd : θ 0 = s, θi n 1 s, 0, 1 s o , i [d]}. In other words, the non-zero coordinates of θ are assumed to have mag-

nitudes all equal to 1 s. We generate K d-dimensional arm vectors in the vicinity of H as follows: a(k)i = Rk,i cos(π/4 + Zk,i), where Rk,i s are independently and identically distributed (i.i.d.) generated with distribution Unif({ 1, 1}), Zk,i s are i.i.d. generated with distribution N(0, 0.01), and Rk,i s and Zk,i s are independent. For this bandit instance, given the arm vectors and using the assumption that θ H, we can lower bound the hardness parameter H2,lin(s + s2) by computing the minimum hardness parameter for the vectors θ H. In Figure 7, Lasso-OD-An.-LB computes the hyperparameters analytically and obtains T1 from the lower bound on H2,lin(s + s2) above instead of the true value of H2,lin(s + s2). Figure 7 shows that Lasso-OD-An.-LB outperforms all other algorithms in the literature and achieves similar performance as Lasso-OD and Lasso-XY-allocation for a large enough time budget.

5Pre-calculation in Tables 3 and 4 refers to the calculation of ϕ2(M, s) that is used to determine the hyperparameters for Lasso-OD-Analytical. 6All experiments are implemented on MATLAB 2023a on an Intel(R) Core(TM) i9-12900H processor.

Published in Transactions on Machine Learning Research (01/2024)

Table 3: The empirical means of the CPU runtimes for s = 2, d = 10, K = 50.

CPU runtimes (milliseconds)

T Pre-calc. Lasso Tuning Lasso-OD Lasso-XY Lasso-Bayes G. Lasso-OD-An. Bayes Gap-Ad. OD-Lin BAI GSE

100 9680 3300 0.98 9.0 1.6 1.7 3.2 2.2 4.0 200 9680 3500 0.61 5.2 2.9 1.2 5.1 2.2 4.0 400 9680 3800 0.55 3.6 5.5 0.83 9 2.1 4.0 800 9680 4380 0.48 3.4 11 0.76 17 2.1 4.0 1600 9680 6570 0.66 4.1 23 0.68 34 2.1 4.0 3200 9680 7520 0.66 4.0 46 0.70 69 2.1 4.0 6400 9680 7710 0.70 4.1 89 1.1 139 2.1 4.0

G.3.3 Third Example

In the third example, we extend the example in Yang & Tan (2022); Jedra & Proutiere (2020); Fiez et al. (2019) to sparse linear bandits. We set θ = ( 1

2, 0, . . . , 0), i.e., s = 2, and S = S(θ ) = {1, 2}. For the coordinates in S, we pull arms as in Yang & Tan (2022); we set a(1)S = (cos(π/4), sin(π/4)), a(K)S = (cos(5π/4), sin(5π/4)), and a(i)S = (cos(π/2 + ϕi), sin(π/2 + ϕi)) for i = 2, . . . , K 1, where ϕi are independently drawn from N(0, 0.09). For any i [K], we draw a(i)Sc independently from the uniform

distribution on the (d s)-dimensional centered sphere of radius q

s . Recall that since θ Sc = 0, the values of arms on the coordinates Sc have no effect on the best arm or the value of the hardness parameter. The problem would be identical to that in Yang & Tan (2022) if the agent knew the support S. In this bandit instance, arm 1 is the best arm and there are K 2 arms whose mean values are close to that of the second best arm. In the non-sparse case, i.e., d = s = 2, Yang & Tan (2022) demonstrate that OD-Lin BAI outperforms the other algorithms. Figure 8 compares the performance of variants of our algorithm with the other algorithms in the literature. We report the empirical performances for d {10, 20}, K {50, 100}, and T [200, 104]. Among the algorithms shown, Lasso-OD and its variants Lasso-XY-allocation and Pop Art OD significantly outperform the other algorithms. Unlike the previous two examples, for this example, Lasso-Bayes Gap is not the best performing algorithm.

Error probability

Error probability

Error probability

Error probability

Lasso-OD Lasso-OD-Analytical

Lasso-XY Lasso-Bayes Gap

Pop Art-OD OD-Lin BAI

Bayes Gap-Adaptive GSE

Figure 8: Comparison of several algorithms for the example bandit instance in Yang & Tan (2022).

Published in Transactions on Machine Learning Research (01/2024)

Table 4: The empirical means of the CPU runtimes for s = 3, d = 10, K = 50.

CPU runtimes (milliseconds)

T Pre-calc. Lasso Tuning Lasso-OD Lasso-XY Lasso-Bayes G. Lasso-OD-An. Bayes Gap-Ad. OD-Lin BAI GSE

100 27100 4200 1.1 12 1.6 2.2 2.5 2.4 4.2 200 27100 4500 0.96 8.9 2.9 2.1 5.2 2.4 4.1 400 27100 4900 0.94 8.1 5.7 2.3 10 2.4 4.1 800 27100 5000 0.98 7.6 12 2.3 17 2.4 4.2 1600 27100 5100 1.5 9.2 24 2.1 34 2.4 4.2 3200 27100 7000 1.5 9.2 47 2 68 2.4 4.2 6400 27100 9800 1.6 9.4 93 2 137 2.4 4.1

G.3.4 Fourth Example

In the final example, we test the performance of thresholded Lasso in which the whole horizon of length T is used for learning the support of θ . We draw each entry of the design matrix X RT d i.i.d. from N(0, 1

s) and set θ = ( 1 s, . . . , 1 s, 0, . . . , 0) where θ has s non-zero entries. In Figure 9, we report the

empirical probability of detection error P h S(ˆθthres) S(θ ) i and the empirical mean E[|S(ˆθthres)|] over 10,000 independent trials. For s = 2, the empirical error probability is 0 for T 400; for s = 4, the empirical error probability is 0 for T 800. Figure 9 shows that for T 100 and s {2, 4}, thresholded Lasso is capable of correctly detecting the active variables in θ with high probability while also keeping the average number of false positives close to zero. As expected, the average number of false positives increases with s.

Published in Transactions on Machine Learning Research (01/2024)

101 102 103 104 0

101 102 103 104 0

101 102 103 104 0

101 102 103 104 0

Average support size Probability of detection error

Figure 9: The empirical detection error probability and the empirical size of the thresholded Lasso output.