# sparsityagnostic_linear_bandits_with_adaptive_adversaries__41bfc4d6.pdf

Sparsity-Agnostic Linear Bandits with Adaptive Adversaries

Tianyuan Jin Department of Electrical and Computer Engineering National University of Singapore Singapore tianyuan@nus.edu.sg

Kyoungseok Jang Dipartimento di Informatica Università degli Studi di Milano Milano, Italy ksajks@gmail.com

Nicolò Cesa-Bianchi Università degli Studi di Milano Politecnico di Milano Milano, Italy nicolo.cesa-bianchi@unimi.it

We study stochastic linear bandits where, in each round, the learner receives a set of actions (i.e., feature vectors), from which it chooses an element and obtains a stochastic reward. The expected reward is a fixed but unknown linear function of the chosen action. We study sparse regret bounds, that depend on the number S of non-zero coefficients in the linear reward function. Previous works focused on the case where S is known, or the action sets satisfy additional assumptions. In this work, we obtain the first sparse regret bounds that hold when S is unknown and the action sets are adversarially generated. Our techniques combine online to confidence set conversions with a novel randomized model selection approach over a hierarchy of nested confidence sets. When S is known, our analysis recovers state-of-the-art bounds for adversarial action sets. We also show that a variant of our approach, using Exp3 to dynamically select the confidence sets, can be used to improve the empirical performance of stochastic linear bandits while enjoying a regret bound with optimal dependence on the time horizon.

1 Introduction

K-armed bandits are a basic model of sequential decision-making in which a learner sequentially chooses which arm to pull in a set of K arms. After each pull, the learner only observes the reward returned by the chosen arm. After T pulls, the learner must obtain a total reward as close as possible to the reward obtained by always pulling the overall best arm. Linear bandits extend K-armed bandits to a setting in which arms belong to a d-dimensional feature space. In each round t of a linear bandit problem, the learner receives an action set At Rd from the environment, chooses an arm At At based on the past observations, and then receives a reward Xt. In this work, we consider the stochastic setting in which rewards are defined by Xt = θ , At + εt, where θ Rd is a fixed latent parameter and εt is zero-mean independent noise. In linear bandits, the learner s goal is to minimize the difference between the total reward obtained by pulling in each round t the arm a At maximizing θ , a and the total reward obtained by the learner.

In stochastic linear bandits, the regret after T rounds is known to be of order d

T up to logarithmic factors. The linear dependence on the number d of features implies that the learner is better off by ignoring features corresponding to negligible components of the latent target vector θ . Hence,

38th Conference on Neural Information Processing Systems (Neur IPS 2024).

one would like to design algorithms that depend on the number S d of relevant features without requiring any preliminary knowledge on θ . This is captured by the setting of sparse linear bandits, where θ is assumed to have only 0 < S d nonzero components.

In the sparse setting, Lattimore and Szepesvári [17, Section 24.3] show that a regret of Ω

is unavoidable for any algorithm, even with knowledge of S. When S is known, this lower bound is matched (up to log factors) by an algorithm of Abbasi-Yadkori et al. [2] who, under the same assumptions and for the same algorithm, also prove an instance-dependent regret bound of e O Sd

. Here is the minimum gap, over all T rounds, between the expected reward of the optimal arm and that of any suboptimal arm. In this work we focus on the sparsity-agnostic setting, i.e., when S is unknown. Fewer results are known for this case, and all of them rely on additional assumptions on the action set, or assumptions on the sparsity structure. For example, if the action set is stochastically generated, Oh et al. [22] prove a e O S

T sparsity-agnostic regret bound. More recently, Dai et al. [9] showed a sparsity-agnostic bound e O S2

d T when the action set is fixed and equal to the unit sphere. In a model selection setting, Cutkosky et al. [8] prove a e O S2

T sparsity-agnostic regret bound for adversarial action sets, but under an additional nestedness assumption: (θ )i = 0 for i = 1, . . . , S. Surprisingly, no bounds improving on the e O d

T regret of the OFUL algorithm [1] in the sparsity-agnostic case are known that avoid additional assumptions on the sparsity structure or on the action set generation.

Main contributions. Here is the summary of our main contributions. All the proofs of our results can be found in the appendix.

We introduce a randomized sparsity-agnostic linear bandit algorithm, Sparse Lin UCB, achieving regret e O S

d T with no assumptions on the sparsity structure (e.g., nestedness) or on the action set (which may be controlled by an adaptive adversary). When S = o(

d), our bound is strictly better than the OFUL bound e O d

T . Our analysis of Sparse Lin UCB simultaneously guarantees an instance-dependent regret bound e O max{d2, S2d}/ , where is the smallest suboptimality gap over the T rounds.

If the sparsity level is known, our algorithm recovers the optimal bound eΘ(

Sd T). We also introduce Ada Lin UCB, a variant of Sparse Lin UCB that uses Exp3 to learn the probability distribution over a hierarchy of confidence sets in stochastic linear bandits. Unlike previous works, which only showed a e O T 2/3 regret bound for similar approaches, Ada Lin UCB has a e O

T regret bound. In experiments on synthetic data, Ada Lin UCB performs better than OFUL.

Technical challenges. Recall that the arm chosen in each round by OFUL is

At = argmax a At a, bθt + γt a V 1 t 1 (1.1)

where bθt is the regularized least-squares estimate of θ , Vt 1 = I + P

s<t As A s is the regularized covariance matrix of past actions, and γt is the radius of the confidence set n θ Rd : θ bθt 1 2 Vt 1 γt o . (1.2)

The squared radius γt = O(d ln t) is such that θ belongs to (1.2) with high probability simultaneously for all t 1. Our approach, instead, builds on the online to confidence set conversion technique of Abbasi-Yadkori et al. [2], where they show how to design a different confidence set for θ based on the predictions of an arbitrary algorithm for online linear regression, such that the squared radius of the confidence set is roughly equal to the regret bound of the algorithm. Using the algorithm Seq SEW for sparse online linear regression [13], whose regret bound is O(S log T), they obtain the optimal regret e O

Sd T for sparse linear bandits. Unfortunately, this result requires knowing S to properly set the radius of the confidence set. Our strategy Sparse Lin UCB (Algorithm 1) bypasses this problem by running the online to confidence set conversion technique over a hierarchy of nested confidence sets with radii αi = 2i log T for i = 1, . . . , n = Θ(log d). The framework of Abbasi-Yadkori et al. [2] guarantees that, for any sparsity value S [d], there is a critical radius αo = O(S log T) such that, with high probability, θ lies in the set with radius αi for all i o. Sparse Lin UCB randomizes the choice of the index i of the confidence radius αi, used for selecting the action at time t. If

Table 1: Comparison with other sparse linear bandit works. S [d] is the sparsity level and is the suboptimality gap (3.3). The nested assumption refers to (θ )i = 0 for i = 1, . . . , S. The minimum signal and the compatibility condition refer to assumptions on the distribution of the action set and on the smallest value of the non-zero elements in θ . Smoothed adversary refers to adversarially selected action sets with added Gaussian noise. The regret bounds listed in [1, 2, 23, 8, 25, 18, 9] are high-probability bounds: with high probability, the regret is of the same order as the bound in the table.

Reference Sparsity Agnostic Adaptive Adversary Expected Regret Assumptions

Abbasi-Yadkori et al. [1] e O(min d

Abbasi-Yadkori et al. [2] e O(min{

Sd T, d S/ ) -

Pacchiano et al. [23, 24] e O(S2

T) Nested, i.i.d. actions

Cutkosky et al. [8] e O(S2

Pacchiano et al. [25] e O(S2

e O(S2d2/ ) Nested, i.i.d. actions

Lattimore et al. [18] e O(S

T) Action set is hypercube εt [ 1, 1]

Sivakumar et al. [26] e O(S

T) Smoothed adversary

Hao et al. [15] e O(

ST) Actions set spans Rd

Minimum signal

Oh et al. [22] e O(S

T) Compatibility

Dai et al. [9] e O(S2

d T) Action set is unit sphere

Lower bound [17] Ω

This paper e O min S

max{d2, S2d} -

This paper e O

the random index It is such that It o, then we can bound the regret incurred at step t using standard techniques [1, 2]. By choosing P(It = i) proportional to 2 i, we make sure that larger confidence sets (delivering suboptimal regret bounds) are chosen with exponentially small probability. If It < o, then θ is not guaranteed to lie in the confidence set of radius αIt with high probability. Our main technical contribution is to show that the regret summed over these bad rounds is bounded by e O p

Sd T/Q , where Q = P(It o). The proof of this bound requires showing that the regret in a bad round t (when It < o) can be bounded by αo Ao t V 1 t 1. Proving that Ao t V 1 t 1 shrinks fast enough uses the fact that det Vt grows fast enough due to the exploration in the good rounds t (when It o). This is done by a carefully designed peeling technique that partitions [T] in blocks based on the value of det Vt.

To extend our analysis of Sparse Lin UCB and obtain instance-dependent regret bound, we apply the techniques of Abbasi-Yadkori et al. [1] to show that the regret over the good rounds is bounded by e O d2/ . The regret over a bad round t is controlled by (αo/ ) Ao t 2 V 1 t 1 and using techniques

similar to the instance-independent analysis we bound the regret summed over all bad rounds with e O Sd/(Q ) .

Given that Sparse Lin UCB uses a fixed probability of order 2 i to choose its confidence radius αi, it is tempting to explore adaptive probability assignments, that increase the probability of a confidence set proportionally to the rewards obtained by the actions that were selected based on that set. Algorithm Ada Lin UCB (see Algorithm 2) is a variant of Sparse Lin UCB using Exp3 [4] to assign probabilities to confidence sets. The analysis of Ada Lin UCB combines in a non-trivial way the analysis of Exp3 (including a forced exploration term q) with that of Sparse Lin UCB. Although the resulting regret bound does not improve on OFUL, our algorithm provides a new principled solution to the problem of tuning the radius in (1.1). Experiments show that Sparse Lin UCB can perform better than OFUL.

1.1 Additional related work

Sparse linear bandits. With the goal of obtaining sparsity-agnostic regret bounds, different types of assumptions on the action set have been considered in the past. Starting from the e O(S

T) regret upper bound of [18], where the action set is assumed fixed and equal to the hypercube, some works considered stochastic action sets and proved regret bounds depending on spectral parameters of the action distribution, such as the minimum eigenvalue of the covariance matrix [9, 15, 16, 20]. Others assumed a stochastic action set with strong properties, such as compatibility conditions or margin conditions [3, 6, 7, 19, 22]. As far as we know, there has been no research on adaptive adversarial action sets after [2].

Model selection. Sparse linear bandits can be naturally viewed as a bandit model selection problem. For example, Ghosh et al. [14] establish a regret bound of e O(

ST + d2/α4.65) for a fixed action set, where α is the minimum absolute value of the nonzero components of θ . Quite a bit of work has been devoted to sparse regret bounds in the nested setting. With i.i.d. and fixed-size actions sets, Foster et al. [12] achieve a regret bound of order e O S1/3T 2/3/γ3 in the nested setting, where γ is the smallest eigenvalue of the covariance matrix of At. Under the same assumption on the action set, Pacchiano et al. [24, 23] obtain a regret bound of e O(S2

T). For adversarial action sets, Cutkosky et al. [8] obtain e O(S2

T) in the nested setting. When actions are sampled i.i.d., Cutkosky et al. [8] and Pacchiano et al. [25] obtain a regret bound of e O(S2

T) for nested settings. They also obtain simultaneous instance-dependent bounds, in particular, Pacchiano et al. [25] achieve e O (Sd)2/ . Compared to the instance-dependent regret bound, our results are more general, as we allow the action set to be adaptively chosen by an adversary and do not require the nested assumption.

Parameter tuning. Although the theoretical anlysis of OFUL only holds for γt = O(d log t), smaller choices of the radius in (1.2) are known to perform better in practice. Our design of Sparse Lin UCB and Ada Lin UCB borrows ideas from the parameter tuning setting, which is typically addressed using a set of base algorithms and a randomized master algorithm that adaptively changes the probability of selecting each base algorithm [21, 23, 24]. In particular, Ada Lin UCB builds on [11], where they show that running Exp3 as the master algorithm over instances of OFUL with different radii has a better empirical performance than Thompson Sampling and UCB. Yet, they only show a regret bound of e O T 2/3 when the action set is drawn i.i.d. in each round (they also prove a bound of order

T, but only under additional assumptions on the best model). This is consistent with the results of Pacchiano et al. [24], who also obtained a regret of the same order using Exp3 as master algorithm.

2 Problem definition

In linear bandits, a learner and an adversary interact over T rounds. In each round t = 1, . . . , T:

1. The adversary chooses an arm set At Rd; 2. The learner choose an arm At At; 3. The learner obtains a reward Xt.

We assume the adversary is adaptive, i.e., At can depend in an arbitrary way on the (possibly randomized) past choices of the learner. The reward in each round t satisfies

Xt = At, θ + εt . (2.1)

Here θ Rd is a fixed and unknown target vector and {εt}t [T ] are independent conditionally 1-subgaussian random variables. We also assume θ 2 1 and a 2 1 for all a At and all t [T].1 The regret of a strategy over T rounds is defined as the difference between the reward obtained by the optimal policy, always choosing the best arm in At, and the reward obtained by the strategy choosing arm At At for t [T],

t=1 max a At a, θ

t=1 At, θ .

1The choice of the constant 1 is arbitrary. Choosing different constants would scale our bounds similarly to the scaling of the bounds in [2].

In the sparse setting, we would like to devise strategies whose regret depends on

i=1 1{θi = 0}

corresponding to the number of nonzero components of θ .

2.1 Online to confidence set conversions

Establishing a confidence set including the target vector with high probability is at the core of linear bandit algorithms, and our approach for designing a sparsity-agnostic algorithm is based on a result by Abbasi-Yadkori et al. [2]. They show how to construct a confidence set for OFUL [1] based on the predictions of a generic algorithm for online linear regression, a sequential decision-making setting defined as follows. For t = 1, . . . , T:

1. The adversary privately chooses input At Rd and outcome Xt R;

2. The learner observes At and chooses prediction b Xt R;

3. The adversary reveals Xt and the learner suffers loss ( b Xt Xt)2.

The learner s goal in online linear regression is to minimize the following notion of regret against any comparator θ Rd

t=1 (Xt b Xt)2

t=1 (Xt At, θ )2.

The confidence set proposed by Abbasi-Yadkori et al. [2] is established by the following result.

Lemma 2.1 (Abbasi-Yadkori et al. [2, Corollary 2]). Let δ (0, 1/4] and θ 2 1. Assume a sequence (At, Xt)}t [T ], where Xt satisfies (2.1) for all t [T], is fed to an online linear regression algorithm B generating predictions b Xt}t [T ]. Then P t [T] : θ / Ct δ, where

θ Rd : θ 2 2 +

s=1 ( b Xs As, θ )2 γ(δ, θ )

γ(δ, θ ) = 2 + 2BT (θ ) + 32 log

1 + BT (θ )

and BT (θ ) is an upper bound on the regret ρT (θ ) of B. When understood from the context, we will abuse the notation and denote the best radius in hindsight by γ(δ) := γ(δ, θ ) and the regret bound by BT .

Gerchinovitz [13] designed an algorithm, Seq SEW, for sparse linear regression that bounds ρT (θ) in terms of θ 0 simultaneously for all comparators θ Rd. Below here, we state his bound in the formulation of Lattimore and Szepesvári [17].

Lemma 2.2 (Lattimore and Szepesvári [17, Theorem 23.6]). Assume maxt [T ] At 2 1 and maxt [T ] |Xt| 1. There exists a universal constant c such that algorithm Seq SEW achieves, for any θ Rd,

ρT (θ) BT (θ) := c θ 0

log(e + T 1/2) + CT log 1 + θ 1

where CT = 2 + log2 log(e + T 1/2).

Using the confidence set (2.2) with B set to Seq SEW, Abbasi-Yadkori et al. [2] achieved the minimax optimal regret bound of e O(

Sd T). However, to construct Ct, the learner must know γ(δ), which depends on the unknown sparsity level S = θ 0 through BT .

Algorithm 1 Sparse Lin UCB

1: Input: T N, n N and q n := {(q1, . . . , qn) [0, 1]n : Pn i=1 qi = 1}

2: Initialization: Let V0 = I, bθ0 = (0, . . . , 0) 3: for t = 1, 2, . . . , T do 4: Receive action set At and draw It from distribution q

5: Choose action At = argmax a At

a, bθt 1 + a V 1 t 1 p

6: Receive reward Xt 7: Vt = Vt 1 + At A t 8: Feed (At, Xt) to Seq SEW and obtain prediction b Xt

9: Compute regularized least squares estimate bθt = argmin θ Rd

b Xs θ, As 2

10: end for

3 A multi-level sparse linear bandit algorithm

In this section, we introduce our main algorithm, Sparse Lin UCB, whose pseudo-code is shown in Algorithm 1. The algorithm, which runs Seq SEW as base algorithm B, uses a hierarchy of confidence sets of increasing radius. In each round t = 1, . . . , T, after receiving the action set At, the algorithm draws the index It of the confidence set for time t by sampling from the distribution q n := {(q1, . . . , qn) [0, 1]n : Pn i=1 qi = 1}. Then the algorithm plays the action At using the confidence set CIt t := {θ Rd : θ bθt 1 2 Vt 1 2It log T} (where a larger It implies a larger radius, and thus more exploration). Following the online to confidence set approach, upon receiving the reward Xt, the algorithm feeds the pair (At, Xt) to Seq SEW and uses the prediction b Xt to update the regularized least squares estimate bθt.

Let αi = 2i log T for all i N and set n N as

max θ Rd: θ 2 1 γ(1/T, θ)

where γ(δ, θ) is defined in Lemma 2.1 for B = Seq SEW.

One can check that n = Θ(log d) (when θ 0 = d), which gives αn = Θ(d log T). Our bounds depend on the following quantity, which defines the index of the smallest safe confidence set (i.e., the smallest i [n] such that θ Ci t for all t [T]),

o := argmin i [n]

The choice of our confidence set (Line 4 in Algorithm 1) is justified by the following result, which implies that o is safe. Lemma 3.1. For Ct defined in (2.2), we have that Ct Co t for all t [T].

As we use Seq SEW as base algorithm B, αo = O(S log T). Our main result is an upper bound on the regret of Sparse Lin UCB. Theorem 3.2. The expected regret of Sparse Lin UCB run with the number of models n in (3.1) and a distribution q = {qs}s [n] satisfies

d2s Tqs + (log T) p

where Q = P

If the sparsity level S is indeed known, then o in (3.2) can be computed and we get the following bound, which is tight up to log factors [17].

Corollary 3.3. Assume that the sparsity level S is known and choose the number of models n > o and the distribution {qs}s [n] with qo = 1, where o is set as in (3.2). Then, the expected regret of Sparse Lin UCB is E[RT ] = O

Sd T log T .

An instance-dependent bound. Sparse Lin UCB also enjoys an instance-dependent regret bound comparable to that of OFUL. Let be the minimum gap between the optimal arm and any suboptimal arms over all rounds, = min t [T ] min a At\A t A t a, θ , (3.3)

where A t = maxa At a, θ is the optimal arm for round t. Theorem 3.4. The expected regret of Sparse Lin UCB run with the number of models n in (3.1), a distribution q = {qs}s [n] and using Seq SEW as base algorithm satisfies

E[RT ] = O (d S/Q) + d2

where Q = P s o qs.

Sparsity-agnostic tuning of randomization. Next, we look at a specific choice of q. Fix C 1 and let

qs = C22 s if C22 s < 1 κ otherwise, (3.4)

where κ > 0 is chosen so to normalize the probabilities. It is easy to verify that for any C 1, X

s [n] 1 C22 s < 1 qs 1

implying that κ can be chosen in [0, 1]. Combining Theorem 3.2 and 3.4, we obtain the following corollary providing a hybrid distribution-free and distribution-dependent bound. Corollary 3.5. Pick any C 1. Let the number of models n as in (3.1) and q = {qs}s [n] be chosen as in (3.4). Then the expected regret of Sparse Lin UCB is

E[RT ] = e O

d T, max d2, S2d/C2

For C = 1 the above bound is e O(S

d T), which is tight up to the factor

S due to the lower bound of Ω(

Sd T) [17]. However, as mentioned in Lattimore and Szepesvári [17, Section 23.5], no algorithm can enjoy the regret of e O(

Sd T) simultaneously for all possible sparsity levels S. While our worst-case regret bound improves with a smaller S, the problem-dependent regret bound scales at least as (d2/ ) log T, which is independent of S. This raises an interesting question: could the problem-dependent bound also benefit from sparsity? Even with a very small probability p of choosing radius αn, the expected number of steps using αn would be p T. The results in [2] demonstrate that running the OFUL algorithm with αn over p T steps results in a regret of e O(d2/ ). One simple way is to decrease the frequency of selecting radius αn. However, selecting αn less than d2/ 2 times may prevent the algorithm from obtaining a good enough estimate of θ in certain settings. Remark 3.6. At first glance, it may seem straightforward to select C in Corollary 3.5, as setting C =

d yields a regret of e O(d2/ ) without apparent trade-offs. However, the trade-off lies in balancing the instance-dependent and worst-case regret bounds. Opting for C = d indeed yields an instance-dependent bound of e O(d2/ ). However, this comes at the expense of the worst-case bound, which remains e O(d

T), negating any advantages derived from the sparsity assumption S d.

If the sparsity level S is indeed known, then o in (3.2) can be computed and we get the following bound. Corollary 3.7. Assume that the sparsity level S is known and choose {qs}s [n] with qo = 1, where o is set as in (3.2). Then, the expected regret of Sparse Lin UCB is E[RT ] = e O Sd

We note that by setting qo = 1 in Theorem 3.4, the regret bound becomes e O(d2/ ). This result, as detailed in Theorem 3.4, arises from the parameter qn > 0. However, in this case, qn = 0, which allows us to achieve a more favorable regret bound.

Algorithm 2 Ada Lin UCB

1: Input: T N, η > 0, q (0, 1] 2: Initialization: Let Si,0 = 0 for all i [n], V0 = I, bθ0 = (0, . . . , 0) 3: for t = 1, 2, , T do 4: Receive action set At and draw a Bernoulli random variable Zt with P(Zt = 1) = q 5: if Zt = 1 then

6: Choose optimistic action At = argmax a At

a, bθt 1 + a V 1 t 1 p

7: Receive reward Xt; 8: else

9: Draw It from the distribution Pt,i = exp (ηSt 1,i) Pn j=1 exp ηSt 1,j for i [n];

10: Choose action At = argmax a At

a, bθt 1 + a V 1 t 1 p

11: Receive reward Xt;

12: Compute St,j = St 1,j 1{It = j}(2 Xt)/4

Pt,j for j [n];

13: end if 14: Vt = Vt 1 + At A t ; 15: Feed (At, Xt) to Seq SEW and obtain prediction b Xt;

16: Compute regularized least squares estimate bθt = argmin θ Rd

b Xs θ, As 2 ;

17: end for

4 Adaptive model selection for stochastic linear bandits

Sparse Lin UCB is also designed to handle adaptive adversarial action sets. A crucial parameter of Sparse Lin UCB is {qs}s [n], the distribution from which the radius of the confidence set is drawn. It is a natural question whether there exists an algorithm that adaptively updates this distribution based on the observed rewards. In this section we introduce Ada Lin UCB (Algorithm 2), which runs Exp3 to dynamically adjust the distribution used by Sparse Lin UCB.

Ada Lin UCB takes as input a forced exploration term q and the learning rate η for Exp3. Similarly to Sparse Lin UCB, Ada Lin UCB designs confidence sets of various radii, but its selection method differs in two aspects. First, with probability q, the algorithm performs exploration based on the confidence set with the largest radius. With probability 1 q, the algorithm instead draws the action based on Exp3. The distribution Pt used by Exp3 at round t is based on exponential weights applied to the total estimated loss, denoted by St (for technical reasons, we translate losses into rewards). The algorithm then draws Ii from Pt and selects the action At based on the confidence set with radius 2It log T. Finally, reward Xt is observed and the pair (At, Xt) is fed to Seq SEW. The prediction b Xt returned by Seq SEW is used to update the regularized least squares estimate bθt.

The following theorem states the theoretical regret upper bound of Ada Lin UCB. Theorem 4.1. If the random independent noise εt in (2.1) satisfies εt [ 1, 1] for all t [T], then the regret of Ada Lin UCB run with η = p

(log n)/(Tn) for n in (3.1) and q (0, 1] satisfies

d T log 1 + TL2

= e O max np

Although Ada Lin UCB dynamically adjusts the distribution used by Sparse Lin UCB and may achieve better empirical performance, its regret bound is no better than that of Sparse Lin UCB. The issue is that the action chosen by Ada Lin UCB in Line 10 does not ensure enough exploration to control the regret. Consequently, the algorithm needs to choose the optimistic action in Line 6 with constant probability q. Sparse Lin UCB has a similar parameter, Q, that bounds from the above the probability

of choosing the optimistic action. The key difference is that Q can be optimized for an unknown S by carefully selecting the distribution q = {qs}s 1, whereas the parameter q does not provide a similar flexibility.

5 Model selection experiments

In this section we describe some experiments we performed on synthetic data to verify whether Ada Lin UCB could outperform OFUL in a model selection task. We also test the empirical performance of Sparse Lin UCB on the same data (additional details on all the algorithms and the experimental setting are in Appendix E). The data for our model selection experiments are generated using targets θ with different sparsity levels, as we know that sparsity affects the radius of the optimal confidence set. On the other hand, since no efficient implementation of Seq SEW is known [17, Section 23.5], we cannot implement the online to confidence set approach as described in [2] to capture sparsity. Instead, we run Sparse Lin UCB and Ada Lin UCB with b Xt = Xt for all t [T], which due to the form of our confidence sets amounts to running the algorithms over multiple instances of OFUL with different choices of radius αi for i [n].

We run Sparse Lin UCB and Ada Lin UCB with αi = αi,t = 2i log t (a mildly time-dependent choice) for i = 0, 1, , log2 d. We also include α0 = 0 corresponding to the greedy strategy At = argmaxa At a, bθt 1 . The suffix _Unif indicates {qs}s [n] set to ( 1

n). The suffix _Theory indicates qs = Θ(2 s) for s = 0, . . . , n as prescribed by (3.4). Finally, we included Sparse Lin UCB_Known using qi = 1{i = o} to test the performance when the optimal index o (for the given S) is known in advance (see Corollary 3.3). We run our experiments with a fixed set of random actions, At = A for all t [T], where |A| = 30 and A is a set of vectors drawn i.i.d. from the unit sphere in R16. The target vector θ is a S-sparse (S = 1, 2, 4, 8, 16) vector whose non-zero coordinates are drawn from the unit sphere in RS. The noise εt is drawn i.i.d. from the uniform distribution over [ 1, 1]. Each curve is an average over 20 repetitions with T = 104 where, in each repetition, we draw fresh instances of A and θ .

As our implementations are not sparsity-aware, we cannot expect the regret to strongly depend on the sparsity level. Indeed, only the regret of Sparse Lin UCB_Known (which is tuned to the sparsity S) is significantly affected by sparsity. The theory-driven choice of {qs}s [n] (Sparse Lin UCB_Theory) performs better than the uniform assignment (Sparse Lin UCB_Unif), and is in the same ballpark as OFUL. On the other hand, Ada Lin UCB_Unif and Ada Lin UCB_Theory outperform all the competitors, including OFUL. This provides evidence that using Exp3 for adaptive model selection may significantly boost the empirical performance of stochastic linear bandits.

6 Limitations and open problems

Unlike previous works, we prove sparsity-agnostic regret bounds with no assumptions on the action sets or on θ (other than boundedness of θ and a for a At, which are rather standard assumptions). For Ada Lin UCB, however, we do require boundedness of the noise εt (instead of just subgaussianity). We conjecture this requirement could be dropped at the expense of a further log T factor in the regret. Finally, for efficiency reasons our implementations are not designed to capture sparsity. Hence our experiments are limited to testing the impact of model selection.

Our work leaves some open problems:

1. Proving a lower bound on the regret of sparse linear bandits when the sparsity level is unknown to the learner would be important. Citing again [17, Section 23.5], no algorithm can enjoy regret e O(

Sd T) simultaneously for all sparsity levels S. However, we do not know whether the known lower bound Ω(

Sd T) can be strengthened to Ω(S

d T) in the agnostic case.

2. Our instance-dependent regret bound is of order e O max{S2, d} d

. It would be interesting to prove an upper bound that improves on the factor d2/ , or a lower bound showing that d2/ cannot be improved on. 3. Our bound on the regret of Ada Lin UCB looks pessimistic due to the presence of the constant exploration probability q. It would be interesting to prove a bound that more closely reflects the good empirical performance of this algorithm.

Figure 1: Experimental results with different sparsity levels S {1, 2, 4, 8, 16}. In each plot, the X-axis are time steps in [1, 104] and the Y -axis is cumulative regret. AL stands for Ada Lin UCB and SL stands for Sparse Lin UCB.

Acknowledgments and Disclosure of Funding

We thank the anonymous reviewers for their helpful comments. This research is supported by the MUR PRIN grant 2022EKNE5K (Learning in Markets and Society), the FAIR (Future Artificial Intelligence Research) project, funded by the Next Generation EU program within the PNRR-PE-AI scheme, the EU Horizon CL4-2022-HUMAN-02 research, innovation action under grant agreement 101120237, project ELIAS (European Lighthouse of AI for Sustainability), a Singapore Ministry of Education Ac RF Tier 2 grant (A-8000423-00-00), and the National Research Foundation, Singapore under its AI Singapore Program (AISG Award No: AISG-Ph D/2021-01-004[T]).

In particular, NCB and KJ acknowledge the financial support from the MUR PRIN grant, the FAIR project, and the ELIAS project. TJ is funded by a Singapore Ministry of Education Ac RF Tier 2 grant (A-8000423-00-00) and the National Research Foundation, Singapore under its AI Singapore Program (AISG Award No: AISG-Ph D/202101-004[T]).

[1] Yasin Abbasi-Yadkori, Dávid Pál, and Csaba Szepesvári. Improved algorithms for linear stochastic bandits. Advances in neural information processing systems, 24, 2011.

[2] Yasin Abbasi-Yadkori, David Pal, and Csaba Szepesvari. Online-to-confidence-set conversions and application to sparse stochastic bandits. In Artificial Intelligence and Statistics, pages 1 9. PMLR, 2012.

[3] Kaito Ariu, Kenshi Abe, and Alexandre Proutière. Thresholded lasso bandit. In International Conference on Machine Learning, pages 878 928. PMLR, 2022.

[4] Peter Auer, Nicolo Cesa-Bianchi, Yoav Freund, and Robert E Schapire. The nonstochastic multiarmed bandit problem. SIAM journal on computing, 32(1):48 77, 2002.

[5] Sébastien Bubeck, Nicolo Cesa-Bianchi, et al. Regret analysis of stochastic and nonstochastic multi-armed bandit problems. Foundations and Trends in Machine Learning, 5(1):1 122, 2012.

[6] Sunrit Chakraborty, Saptarshi Roy, and Ambuj Tewari. Thompson sampling for highdimensional sparse linear contextual bandits. In International Conference on Machine Learning, pages 3979 4008. PMLR, 2023.

[7] Yi Chen, Yining Wang, Ethan X Fang, Zhaoran Wang, and Runze Li. Nearly dimensionindependent sparse linear bandit over small action spaces via best subset selection. Journal of the American Statistical Association, 119(545):246 258, 2024.

[8] Ashok Cutkosky, Christoph Dann, Abhimanyu Das, Claudio Gentile, Aldo Pacchiano, and Manish Purohit. Dynamic balancing for model selection in bandits and rl. In International Conference on Machine Learning, pages 2276 2285. PMLR, 2021.

[9] Yan Dai, Ruosong Wang, and Simon Shaolei Du. Variance-aware sparse linear bandits. In The Eleventh International Conference on Learning Representations, 2023.

[10] Varsha Dani, Thomas P Hayes, and Sham M Kakade. Stochastic linear optimization under bandit feedback. 2008.

[11] Qin Ding, Yue Kang, Yi-Wei Liu, Thomas Chun Man Lee, Cho-Jui Hsieh, and James Sharpnack. Syndicated bandits: A framework for auto tuning hyper-parameters in contextual bandit algorithms. Advances in Neural Information Processing Systems, 35:1170 1181, 2022.

[12] Dylan J Foster, Akshay Krishnamurthy, and Haipeng Luo. Model selection for contextual bandits. Advances in Neural Information Processing Systems, 32, 2019.

[13] Sébastien Gerchinovitz. Sparsity regret bounds for individual sequences in online linear regression. In Proceedings of the 24th Annual Conference on Learning Theory, pages 377 396. JMLR Workshop and Conference Proceedings, 2011.

[14] Avishek Ghosh, Abishek Sankararaman, and Ramchandran Kannan. Problem-complexity adaptive model selection for stochastic linear bandits. In International Conference on Artificial Intelligence and Statistics, pages 1396 1404. PMLR, 2021.

[15] Botao Hao, Tor Lattimore, and Mengdi Wang. High-dimensional sparse linear bandits. Advances in Neural Information Processing Systems, 33:10753 10763, 2020.

[16] Kyoungseok Jang, Chicheng Zhang, and Kwang-Sung Jun. Popart: Efficient sparse regression and experimental design for optimal sparse linear bandits. Advances in Neural Information Processing Systems, 35:2102 2114, 2022.

[17] Tor Lattimore and Csaba Szepesvári. Bandit algorithms. Cambridge University Press, 2020.

[18] Tor Lattimore, Koby Crammer, and Csaba Szepesvári. Linear multi-resource allocation with semi-bandit feedback. Advances in Neural Information Processing Systems, 28, 2015.

[19] Ke Li, Yun Yang, and Naveen N Narisetty. Regret lower bound and optimal algorithm for high-dimensional contextual linear bandit. Electronic Journal of Statistics, 15(2):5652 5695, 2021.

[20] Wenjie Li, Adarsh Barik, and Jean Honorio. A simple unified framework for high dimensional bandit problems. In International Conference on Machine Learning, pages 12619 12655. PMLR, 2022.

[21] Teodor Vanislavov Marinov and Julian Zimmert. The pareto frontier of model selection for general contextual bandits. Advances in Neural Information Processing Systems, 34:17956 17967, 2021.

[22] Min-Hwan Oh, Garud Iyengar, and Assaf Zeevi. Sparsity-agnostic lasso bandit. In International Conference on Machine Learning, pages 8271 8280. PMLR, 2021.

[23] Aldo Pacchiano, Christoph Dann, Claudio Gentile, and Peter Bartlett. Regret bound balancing and elimination for model selection in bandits and RL. ar Xiv preprint ar Xiv:2012.13045, 2020.

[24] Aldo Pacchiano, My Phan, Yasin Abbasi Yadkori, Anup Rao, Julian Zimmert, Tor Lattimore, and Csaba Szepesvari. Model selection in contextual stochastic bandit problems. Advances in Neural Information Processing Systems, 33:10328 10337, 2020.

[25] Aldo Pacchiano, Christoph Dann, and Claudio Gentile. Best of both worlds model selection. Advances in Neural Information Processing Systems, 35:1883 1895, 2022.

[26] Vidyashankar Sivakumar, Steven Wu, and Arindam Banerjee. Structured linear contextual bandits: A sharp and geometric smoothed analysis. In International Conference on Machine Learning, pages 9026 9035. PMLR, 2020.

Table 2: Notation.

Symbol Description

At Action set at time t θ True parameter of the linear model d Dimension of θ S Number of nonzero components in θ Xt Random reward of pulling arm At At b Xt Prediction of Seq SEW at time t Cs t The confidence set with radius 2s log T and center bθt 1 It The index of the confidence set drawn at time t qs The probability P(It = s) of drawing index s of the confidence set in Sparse Lin UCB o The index of the smallest safe confidence set, defined in (3.2) Q The probability P(It o) = qo + + qn of drawing a safe confidence set in Sparse Lin UCB As t The optimistic action for Cs t A t The optimal action in At E Event that θ Ct for all t [T] det V Determinant of V

In Table 2 we list the most used notations. Next, we recall some definitions that are used throughout this appendix. We have Vt = I + Pt s=1 As A s and

bθt = argmin θ Rd

b Xs θ, As 2

where As As is the action chosen by the learner at round t. Recall that αi = 2i log T. For i [n],

Ai t = argmax a At max θ Ci t θ, a = argmax a At

a, bθt 1 + a V 1 t 1 αi

Ci t = θ Rd : θ bθt 1 2 Vt 1 αi

Finally, recall definition (2.2) with δ = 1/T,

θ Rd : θ 2 2 +

s=1 ( b Xs As, θ )2 γ(1/T)

and recall that E = T

t [T ]{θ Ct}.

B Analysis of Sparse Lin UCB

Theorem 3.2. The expected regret of Sparse Lin UCB run with the number of models n in (3.1) and a distribution q = {qs}s [n] satisfies

d2s Tqs + (log T) p

where Q = P s o qs.

Proof. Lemma 3.1 implies Ct Co t . Hence, if E is true and s < o, then

θ , As t bθt 1, As t αo As t V 1 t 1 (θ Ct Co t )

= bθt 1, As t + αs As t V 1 t 1

( αo + αs) As t V 1 t 1

bθt 1, Ao t + αs Ao t V 1 t 1

( αo + αs) As t V 1 t 1 (maximality of As t in Cs t )

= bθt 1, Ao t + αo Ao t V 1 t 1 ( αo αs) Ao t V 1 t 1 ( αo + αs) As t V 1 t 1 θ , Ao t ( αo αs) Ao t V 1 t 1 ( αo + αs) As t V 1 t 1 (θ Ct Co t )

θ , Ao t ( αo αs) Ao t V 1 t 1 ( αo + αs) Ao t V 1 t 1 (by Lemma D.3)

= θ , Ao t 2 αo Ao t V 1 t 1 θ , A t 3 αo Ao t V 1 t 1 (B.1)

where in the first and the third inequalities, we use the fact that for any A Rd,

θ bθt 1, A θ bθt 1 Vt 1 A V 1 t 1, (Cauchy Schwarz inequality)

αo A V 1 t 1 (due to θ Co t )

and for the last inequality,

θ , A t bθt 1, Ao t + α0 Ao t V 1 t 1 (θ Co t , maximality of Ao t.)

We can decompose the regret as follows,

t=1 1{E} θ , A t AIt t +

t=1 1{Ec} θ , A t AIt t

t=1 1{Ec} θ , A t AIt t +

t=1 1{It o, E} θ , A t AIt t +

t=1 1{It < o, E} θ , A t AIt t

t=1 1{Ec} θ , A t AIt t +

t=1 1{It o, E} θ , A t AIt t

t=1 1{It < o, E} min n 2, 3 αo Ao t V 1 t 1

t=1 1{Ec} θ , A t AIt t +

t=1 1{It o, E} θ , A t AIt t

n 2, 3 αo Ao t V 1 t 1

Since P(Ec) δ 1

T , the first sum in the above line is easily bounded,

t=1 1{Ec} θ , A t AIt t 2TP(Ec) 2 .

Bounding term . For each s 0, let

T s = t [T] : det(Vt 1) h 2sd, 2(s+1)d .

Note that det(Vt) is monotone increasing w.r.t t. Define s = log2 det(VT )/d . Then,

t T s min n 2, 3 αo Ao t V 1 t 1

By applying Lemma D.5, we obtain

t T s min n 2, 3 αo Ao t V 1 t 1

t T s min n 1, Ao t V 1 t 1

v u u t T E

t T s min 1, Ao t 2 V 1 t 1

(Cauchy Schwarz inequality)

s=1 (2d/Q + 1/Q) (due to Lemma D.5)

(2d + 1)s /Q

(2d + 1)/Q log2 det(VT )/d

= O log T p

Bounding term . Let Ts = PT t=1 1{It = s}.

t=1 1{It o, E} θ , A t AIt t

t=1 1{It o, E} min 2, αIt AIt t V 1 t 1

t [T ] 1{It = s} min 1, As t 2 V 1 t 1

(Cauchy Schwarz inequality)

t [T ] min 1, At 2 V 1 t 1

2 log det VT i (Lemma D.1)

αs Ts i O( p

d log T) (upper bound on det(VT ))

E [Ts] O( p

d log T) (Jensen s inequality)

αsd Tqs log T (because P(It = s) = qs.)

Substituting the bounds of and to (B.2), we have

dαs T log Tqs + log T p

concluding the proof.

Theorem 3.4. The expected regret of Sparse Lin UCB run with the number of models n in (3.1), a distribution q = {qs}s [n] and using Seq SEW as base algorithm satisfies

E[RT ] = O (d S/Q) + d2

where Q = P s o qs.

t=1 1{E} θ , A t AIt t +

t=1 1{Ec} θ , A t AIt t

t=1 1{Ec} θ , A t AIt t +

t=1 1{It o, E} θ , A t AIt t 2

t=1 1{It < o, E} θ , A t AIt t 2

(minimality of )

t=1 1{Ec} θ , A t AIt t +

t=1 1{It o, E} θ , A t AIt t 2

t=1 1{It < o, E} min n 1, αo Ao t 2 V 1 t 1

(by applying (B.1) to )

Bounding term . Let s = log2 det(VT )/d . By applying Lemma D.5, we obtain

t T s min n 1, αo Ao t 2 V 1 t 1

t T s min n 1, Ao t 2 V 1 t 1

s=1 (2d/Q + 1/Q) (due to Lemma D.5)

d S(log T)2/Q

(because s = O(log T).)

Bounding term .

t=1 1{It o, E} θ , A t AIt t 2

t=1 1{It o, E} min 1, αIt AIt t 2 V 1 t 1

t [T ] min n 1, At 2 V 1 t 1

log det VT (due to Lemma D.1)

= O d2(log T)2

where the first inequality comes from

θ , A t AIt t max θ CIt t 1 θ θ , AIt t

max θ CIt t 1 θ θ Vt 1 AIt t V 1 t 1 (Cauchy-Schwarz Inequality)

2 αIt AIt t V 1 t 1

where the last inequality holds because E and It o both hold, and so θ Ct 1 CIt t 1 due to Lemma 3.1. The factor 2 is due to an application of the triangular inequality. Substituting the bounds of and in (B.2), we have

E[RT ] = O max{S/Q, d} d(log T)2

Corollary B.1. Pick any C 1 and let {qs}s [n] be chosen as in (3.4). Then the expected regret of Sparse Lin UCB is

E[RT ] = e O

d T, max d2, S2d/C2

Proof. We consider the following cases based on the value of C. Case 1: C2 < 2o. qo = C2 2 o. According to Theorem 3.2,

d2s Tqs + (log T) p

O (log T)n p

d2o Tqo + (log T) p

= e O max{C, S/C}

d T . (due to n = O(log d) and 2o = Θ(S log T))

Besides, according to Theorem 3.4,

E[RT ] = O max{S/Q, d} d log2 T

= O max{S/qo, d} d log2 T

= e O max{d2, S2d/C2}

E[RT ] = e O

d T, max d2, S2d/C2

Case 2: C2 [2o, 2n). For s with C2 < 2s, qs = C22 s. Let o = arg mins>o{C2 < 2s}. Then, qo 1/4. According to Theorem 3.2,

d2s Tqs + (log T) p

d2s T + (log T) X

d2s Tqs + (log T) p

d2o T + n C

= e O max{C, S/C}

d T . (due to C2 = Θ(2o ) and 2o 2o S)

Besides, according to Theorem 3.4,

E[RT ] = O max{S/Q, d} d(log T)2

= O max{S/qo , d} d(log T)2

= e O max{d2, S2d/C2}

E[RT ] = e O

d T, max d2, S2d/C2

Case C2 2n: qs = 1/n for all s [n]. It is easy to verify that E[RT ] = e O(d

T) and E[RT ] = e O(d2/ ). Thus,

E[RT ] = e O

d T, max d2, S2d/C2

Corollary B.2. Assume that the sparsity level S is known and choose a distribution q = {qs}s [n] with qo = 1, where o is set as in (3.2). Then, the expected regret of Sparse Lin UCB is E[RT ] = e O Sd

Proof. We follow the same steps as in the proof of Theorem 3.4. The only difference lies in the bounding term in (B.3). We have

t=1 1{It o, E} θ , A t AIt t 2

t=1 1{E} θ , A t At 2

t [T ] min n 1, At 2 V 1 t 1

= O Sd(log T)2

where the second equality is because qo = 1 and so AIt t = Ao t = At, and the first inequality comes from

θ , A t At max θ Co t 1 θ θ , At

max θ Co t 1 θ θ Vt 1 Ao t V 1 t 1 (Cauchy-Schwarz Inequality)

2 αo At V 1 t 1 where the last inequality holds because E and It = o both hold, and so θ Ct 1 Co t 1 due to Lemma 3.1. Substituting the bounds of and in (B.2), we have

E[RT ] = O Sd(log T)2

C Analysis of Ada Lin UCB

Theorem 4.1. If the random independent noise εt in (2.1) satisfies εt [ 1, 1] for all t [T], then the regret of Ada Lin UCB run with η = p

(log n)/(Tn) for n in (3.1) and q (0, 1] satisfies

d T log 1 + TL2

= e O max np

Proof. Let T1 be the set of time steps where Zt = 1 in Line 4 of Ada Lin UCB and let T2 = [T] \ T1. For all t T2 the action At At is chosen by Exp3 through an adversarial mapping µt : [n] At defined in Lines 9 10. The resulting reward Xt [ 2, 2] is fed to Exp3 as a [0, 1]-valued loss ℓt(It) = (2 Xt)/4, where µt(It) = At. Since T2 is selected through independent coin tosses with fixed bias q, we can apply the Exp3 regret analysis (for losses chosen by a non-oblivious adversary) to bound the regret in T2 and show that X

t T2 Ao t At, θ = 4 X

ℓt(It) ℓt(o) = O p

n log n (C.1)

where we used µt(o) = Ao t for all t [T].

t T2 argmax a At a At, θ

t T2 argmax a At a Ao t + Ao t At, θ

t T2 argmax a At a Ao t, θ

| {z } Reg Img

t T2 Ao t At, θ

| {z } Reg Exp3

Reg Exp3 is bounded in (C.1), hence we only need to bound

Reg Img = X

t T2 argmax a At a Ao t, θ . (C.2)

Let eθi t = argmax θ Ci t max a At a, θ and A t = argmax a At a, θ .

Recall E is the event that θ Ct for all t [T]. Assume E holds. We obtain that for t [T],

θ , A t Ao t eθo t θ , Ao t (because θ Ct Co t )

eθo t θ Vt 1 Ao t V 1 t 1 (Cauchy Schwarz inequality)

αo Ao t V 1 t 1, (C.3)

where the last inequality is because eθo t Co t . Then, assuming E holds, we can bound Reg Img as

Reg Img = X

t T2 θ , A t Ao t

t T2 min n 2, αo Ao t V 1 t 1

t T2 min n 2, p

2γ(1/T) Ao t V 1 t 1

o (due to αo 2γ(1/T))

t T2 min n 1, Ao t V 1 t 1

where in the first inequality, we use the facts θ , A t Ao t αo Ao t V 1 t 1 and θ , A t Ao t θ , A t 2.

From Lemma D.3, we have X

t T2 min n 1, Ao t 2 V 1 t 1

t T2 min n 1, An t 2 V 1 t 1

For each s 0, let

T s 2 = t T2 : det(Vt 1) h 2ds, 2d(s+1)

s 0 T s 2 .

Let s = log2 det(VT )/d . We have

E Reg Img 1[E] E

t T2 min n 1, Ao t V 1 t 1

(due to (C.4))

t T2 min n 1, Ao t V 1 t 1

(due to α0 2γ(1/T))

t T2 min 1, Ao t 2 V 1 t 1

(Cauchy Schwarz inequality)

t T s 2 min 1, Ao t 2 V 1 t 1

(where d ln s log det(VT ))

t T s 2 min 1, An t 2 V 1 t 1

(because An t V 1 t 1 Ao t V 1 t 1)

s=1 (2d + 1)/q (due to Lemma D.4)

2(2d + 1)αo Ts /q .

To obtain the final results, we have

dtrace(VT ) d 1 + TL2

where λ1, , λd are the eigenvalues of VT . Therefore, we have s log2(1 + TL2/d) .

E Reg Img 1[E] 2 p

2(2d + 1)αo T/q

(6dαo T/q) log2(1 + TL2/d) .

By substituting the bounds on Reg Img and Reg Exp3 into RT2, we obtain

E[RT2] = E[Reg Exp31{E}] + E[Reg Img] + T P(Ec)

log(1 + TL2/d) + O( p

n T log n),

where the last inequality is because P(Ec) 1/T from Lemma 2.1 with our choice of δ = 1/δ.

Finally, we bound RT1. Conditioned on event E and using (D.3), t [T], θ Ct C0 t Cn t . Hence, we can obtain

θ , A t An t eθn t θ , An t (because θ Ct Cn t )

eθn t θ Vt 1 An t V 1 t 1 (Cauchy Schwarz inequality)

αn An t V 1 t 1 (because eθn t Cn t .)

Hence, conditioned on event E,

t T1 min {2, θ , A t An t }

t T1 min n 2, αn An t V 1 t 1

t T1 min 1, An t 2 V 1 t 1

(Cauchy Schwarz inequality)

t T1 min 1, An t 2 V 1 t 1

t T2 min 1, At 2 V 1 t 1

t [T ] min 1, At 2 V 1 t 1

2 log(det(VT )), (C.7)

where the last inequality is due to Lemma D.1. Therefore, the expected regret for t T1 is

E[RT1] E[RT11{E}] + T P(Ec)

2 log(det(VT )) + 1

2d log 1 + TL2

= 2 αn E hp

2d log 1 + TL2

2d log 1 + TL2

+ 1, (Jensen s inequality)

where the first inequality is due to (C.7) and P(Ec) 1/T from Lemma 2.1 and the last inequality is due to the fact that for any t [T], with probability q, t T1. Finally, we obtain

E[RT ] = E[RT1] + E[RT2]

2d log 1 + TL2

log(1 + TL2/d) + O( p

n T log n).

Note that n = Θ(log d) and αo = Θ(S log T). We obtain

which completes the proof.

D Supporting lemmas

Lemma D.1 (Dani et al. [10]). Let A1, A2, . . . , AT Rd and Vt = I + Pt s=1 As A s for all t [T]. Then

t=1 min n 1, At 2 V 1 t 1

o 2 log det(VT ). (D.1)

min n 1, At 2 V 1 t 1

o 2 log 1 + At 2 V 1 t 1

= 2 log det(Vt)

Lemma D.2. For Ct defined in (2.2), we have that Ct Co t for all t [T].

Proof. Recall

θ : θ 2 2 +

b Xs θ, As 2 γ(1/T)

b Xs θ, As 2 bθt 2 2

b Xs bθt, As 2

= θ 2 2 2θ t X

s=1 b Xs As

bθt 2 2 + 2bθ t

s=1 b Xs As

= θ 2 Vt + 2(bθt θ) Vtbθt bθt 2 Vt (Vt = I + Pt s=1 As A s , bθt = V 1 t Pt s=1 As b Xs)

= θ 2 Vt 2θ Vtbθt + bθt 2 Vt = θ bθt 2 Vt .

Hence, we can express the ellipsoid as

Ct+1 = θ bθt 2 Vt + bθt 2 2 +

b Xs bθt, As 2 γ(1/T) . (D.2)

Therefore, for all t 0,

θ : θ bθt 2 Vt + bθt 2 2 +

b Xs bθt, As 2 γ(1/T)

(due to (D.2))

n θ : θ bθt 2 Vt γ(1/T) o

n θ : θ bθt 2 Vt αo o (by definition of α0)

= Co t+1 (D.3)

concluding the proof.

Lemma D.3. For any 1 p q n and t [T],

Ap t V 1 t 1 Aq t V 1 t 1.

Proof. Note that

Ap t = argmax a At max θ Cp t θ, a = argmax a At a, bθt 1 + αp a V 1 t 1,

Aq t = argmax a At max θ Cq t θ, a = argmax a At a, bθt 1 + αq a V 1 t 1.

For contradiction, we assume Aq t V 1 t 1 < Ap t V 1 t 1. Since Aq t V 1 t 1 < Ap t V 1 t 1, Aq t V 1 t 1 =

Ap t V 1 t 1. Besides, according to the definition of Ap t and Aq t, we have

Aq t, bθt 1 + αp Aq t V 1 t 1 Ap t , bθt 1 + αp Ap t V 1 t 1 (D.4)

< Ap t , bθt 1 + αq Ap t V 1 t 1 (D.5)

Aq t, bθt 1 + αq Aq t V 1 t 1, (D.6)

where the first inequality is due to the definition of Ap t , the second inequality is due to αp < αq, and the last inequality is due to the definition of Aq t. From the above results, we further have

Aq t Ap t , bθt 1 αp Ap t V 1 t 1 Aq t V 1 t 1 (Due to (D.4))

< αq Ap t V 1 t 1 Aq t V 1 t 1

(Due to assumption Aq t V 1 t 1 < Ap t V 1 t 1 and αq > αp)

Aq t Ap t , bθt 1 . (Due to (D.6).)

We obtain a contradiction. Therefore,

Aq t V 1 t 1 Ap t V 1 t 1. (D.7)

t T s 2 min n 1, An t 2 V 1 t 1

2d/q + 1/q.

Proof. Let det(Vt 1) = (qt 1)d and det(Vt 1 + An t (An t ) ) = (qt 1 + xt)d. For each s 0, let

T s = t [T] : det(Vt 1) h 2sd, 2d(s+1) .

Since T s 2 T s, to show

t T s 2 min n 1, An t 2 V 1 t 1

2d/q + 1/q,

we only need to prove

t T s min n 1, An t 2 V 1 t 1

2d/q + 1/q.

We let It = 1 if the coin tosses in the t s round is Head and 0 otherwise. We divide T s into two disjoint parts T s and T s. Specifically,

for T s, it holds that for t T s, det(Vt 1 + An t (An t ) ) 2d(s+1).

for T s, it holds that for t T s, det(Vt 1 + An t (An t ) ) > 2d(s+1).

From definition of T s and the fact that if It = 1, Vt = Vt 1 + An t (An t ) , we have X

From Algorithm 2, with probability q, It = 1. Therefore, if we let {Ft}t [T ] be the natural filtration of {At, It, Xt}t [T ], we have

t [T ] xt It 1{t T s}

t [T ] E [xt It 1{t T s}]

t [T ] E [E [xt It 1{t T s}|Ft 1]] (Law of total expectation)

t [T ] E [xt 1{t T s} E [It|Ft 1]] (xt and 1{t T s} are Ft 1-measurable)

t [T ] E [xt 1{t T s} q] (It is an independent coin-tossing)

and therefore E h P

t T s xt i 2s/q. For t T s, we further have

min n 1, An t 2 V 1 t 1

o 2 log det(Vt 1 + An t (An t ) ) det(Vt 1)

(due to Lemma D.1)

= 2d log 1 + xt

2d log 1 + xt

(since t T s, qt 1 2s)

2s (due to log(1 + x) x for x > 0.)

min n 1, An t 2 V 1 t 1

From definition of T s, if It = 1 and t T s, det(Vt) > 2d(s+1). Then, for all τ > t, τ / T s. Therefore, there is at most one t T s with It = 1. We obtain

t T s min n 1, An t 2 V 1 t 1

where the last inequality is because with probability q, It = 1. By combining the bounds for t T s and t T s together, lemma follows.

Lemma D.5. Let Q = P

s o qs. Then,

t T s min n 1, Ao t 2 V 1 t 1

2d/Q + 1/Q.

Proof. Let det(Vt 1) = (qt 1)d. Let xt satisfies det(Vt 1 + Ao t(Ao t) ) = (qt 1 + xt)d. We let It = 1{It o}. We divide T s into two disjoint parts T s and T s. Specifically,

for T s, it holds that for t T s, det(Vt 1 + Ao t(Ao t) ) 2d(s+1).

for T s, it holds that for t T s, det(Vt 1 + Ao t(Ao t) ) > 2d(s+1).

Note that for It o,

log(det(Vt)) log(det(Vt 1 + Ao t(Ao t) ))

= log(det(Vt)) log(Vt 1) log(det(Vt 1 + Ao t(Ao t) )) log(Vt 1)

(due to Lemma D.1)

= log(1 + AIt t V 1 t 1) log(1 + Ao t V 1 t 1)

0. (due to Lemma D.3)

Therefore, if we let det(Vt 1 + AIt t (AIt t ) ) = (qt 1 + x t)d and It o, then x t xt. Hence, X

Note that P(It o) = P

s o qs = Q. Let {Ft}t [T ] be the natural filtration of {At, It, Xt}t [T ]. We have

t [T ] xt It 1{t T s}

t [T ] E [xt It 1{t T s}]

t [T ] E [E [xt It 1{t T s}|Ft 1]] (Law of total expectation)

t [T ] E [xt 1{t T s} E [It|Ft 1]] (xt and 1{t T s} are Ft 1-measurable)

t [T ] E [xt 1{t T s} Q] (It is an independent coin-tossing)

and therefore

For t T s, we further have

min n 1, Ao t 2 V 1 t 1

o min n 1, AIt t 2 V 1 t 1

2 log det(Vt 1 + AIt t (AIt t ) ) det(Vt 1)

(due to Lemma D.1)

= 2d log 1 + xt

2d log 1 + xt

(since t T s, qt 1 2s)

2s (due to log(1 + x) x for x > 0.)

min n 1, Ao t 2 V 1 t 1

From definition of T s, if It = 1 and t T s, then It o and

det(Vt) = det(Vn 1 + AIt t (AIt t ) ) > det(Vn 1 + Ao t(Ao t) ) > (s + 1)d.

Then, for all τ > t, τ / T s. Therefore, there is at most one t T s with It = 1. We obtain

t T s min n 1, Ao t 2 V 1 t 1

where the last inequality is because with probability P

s o qs = Q, It = 1. By combining the bounds for t T s and t T s together, lemma follows.

E Experimental details

The code used in the experiments can be found in the following repository: https://github.com/ jajajang/sparsity_agnostic_model_selection.

E.1 Settings common to all algorithms

Arm set: At = A for all t [T]. A Sd 1 (the unit sphere in Rd) is a set of ddimensional vectors drawn independently and uniformly at random from Sd 1, with d = 16 and |A| = 30. θ is an S-sparse (S = 1, 2, 4, 8, 16) vector generated as follows: before the game starts, draw (θ )1, . . . , (θ )S SS 1, and (θ )k = 0 for all k > S. The noise on rewards: {εt}t [T ] are i.i.d. with ξt Unif([ 1, 1]).

Number of iterations: T = 104. Number of models: n = 6. Radius of confidence sets: α0 = 0, and αi = 2i log t for i = 1, , 5. Prior distribution {qs}s [6]

For _Unif,{qs}s [6] = 1

For _Theory, {qs}s [6] = C

64 where C = 63 64 is a normalizing constant. Each plot is the result of 20 repetitions for each method. The shade represents the 1-standard deviation bound. Hardware: Lenovo Thinkpad P16s Gen 2 Laptop - Type 21HL

CPU: 13th Gen Intel(R) Core(TM) i7-1360P 2.20 GHz RAM: 32GB Computation time: total 1338.38 seconds.

E.2 Ada Lin UCB details

Since we empirically observed that Exp3 provided enough exploration, we aggressively set the forced exploration parameter q to zero.

The learning rate of Exp3 was set to ηt = 2 q

nt , see [5].

Given the prior distribution {qs}s [6], we set Pt as follows:

Pt,s = qs exp (ηt St,s) Pn j=1 qj exp (ηt St,j)

E.3 OFUL details

We used the log-determinant form of the confidence set based on Abbasi-Yadkori et al. [1, Theorem 2], which gives the choice γt = p

2 log T + log det(Vt) + 1

for the parameter γt in (1.1) when λ = 1 and δ = 1/T.

Neur IPS Paper Checklist

Question: Do the main claims made in the abstract and introduction accurately reflect the paper s contributions and scope?

Answer: [Yes]

Justification: The main claims presented in the abstract and introduction accurately represent the contributions and scope of the paper.

Guidelines:

The answer NA means that the abstract and introduction do not include the claims made in the paper. The abstract and/or introduction should clearly state the claims made, including the contributions made in the paper and important assumptions and limitations. A No or NA answer to this question will not be perceived well by the reviewers. The claims made should match theoretical and experimental results, and reflect how much the results can be expected to generalize to other settings. It is fine to include aspirational goals as motivation as long as it is clear that these goals are not attained by the paper.

2. Limitations

Question: Does the paper discuss the limitations of the work performed by the authors?

Answer: [Yes]

Justification: We have added a paragraph in the conclusions section discussing the limitations of our work.

Guidelines:

The answer NA means that the paper has no limitation while the answer No means that the paper has limitations, but those are not discussed in the paper. The authors are encouraged to create a separate "Limitations" section in their paper. The paper should point out any strong assumptions and how robust the results are to violations of these assumptions (e.g., independence assumptions, noiseless settings, model well-specification, asymptotic approximations only holding locally). The authors should reflect on how these assumptions might be violated in practice and what the implications would be. The authors should reflect on the scope of the claims made, e.g., if the approach was only tested on a few datasets or with a few runs. In general, empirical results often depend on implicit assumptions, which should be articulated. The authors should reflect on the factors that influence the performance of the approach. For example, a facial recognition algorithm may perform poorly when image resolution is low or images are taken in low lighting. Or a speech-to-text system might not be used reliably to provide closed captions for online lectures because it fails to handle technical jargon. The authors should discuss the computational efficiency of the proposed algorithms and how they scale with dataset size. If applicable, the authors should discuss possible limitations of their approach to address problems of privacy and fairness. While the authors might fear that complete honesty about limitations might be used by reviewers as grounds for rejection, a worse outcome might be that reviewers discover limitations that aren t acknowledged in the paper. The authors should use their best judgment and recognize that individual actions in favor of transparency play an important role in developing norms that preserve the integrity of the community. Reviewers will be specifically instructed to not penalize honesty concerning limitations.

3. Theory Assumptions and Proofs

Question: For each theoretical result, does the paper provide the full set of assumptions and a complete (and correct) proof?

Answer: [Yes] Justification: We provide the full set of assumptions and complete proofs. Guidelines:

The answer NA means that the paper does not include theoretical results. All the theorems, formulas, and proofs in the paper should be numbered and crossreferenced. All assumptions should be clearly stated or referenced in the statement of any theorems. The proofs can either appear in the main paper or the supplemental material, but if they appear in the supplemental material, the authors are encouraged to provide a short proof sketch to provide intuition. Inversely, any informal proof provided in the core of the paper should be complemented by formal proofs provided in appendix or supplemental material. Theorems and Lemmas that the proof relies upon should be properly referenced. 4. Experimental Result Reproducibility

Question: Does the paper fully disclose all the information needed to reproduce the main experimental results of the paper to the extent that it affects the main claims and/or conclusions of the paper (regardless of whether the code and data are provided or not)? Answer: [Yes] Justification: We provide all the information needed to reproduce the main experimental results of the paper to the extent that it affects the main claims and/or conclusions of the paper. Guidelines:

The answer NA means that the paper does not include experiments. If the paper includes experiments, a No answer to this question will not be perceived well by the reviewers: Making the paper reproducible is important, regardless of whether the code and data are provided or not. If the contribution is a dataset and/or model, the authors should describe the steps taken to make their results reproducible or verifiable. Depending on the contribution, reproducibility can be accomplished in various ways. For example, if the contribution is a novel architecture, describing the architecture fully might suffice, or if the contribution is a specific model and empirical evaluation, it may be necessary to either make it possible for others to replicate the model with the same dataset, or provide access to the model. In general. releasing code and data is often one good way to accomplish this, but reproducibility can also be provided via detailed instructions for how to replicate the results, access to a hosted model (e.g., in the case of a large language model), releasing of a model checkpoint, or other means that are appropriate to the research performed. While Neur IPS does not require releasing code, the conference does require all submissions to provide some reasonable avenue for reproducibility, which may depend on the nature of the contribution. For example (a) If the contribution is primarily a new algorithm, the paper should make it clear how to reproduce that algorithm. (b) If the contribution is primarily a new model architecture, the paper should describe the architecture clearly and fully. (c) If the contribution is a new model (e.g., a large language model), then there should either be a way to access this model for reproducing the results or a way to reproduce the model (e.g., with an open-source dataset or instructions for how to construct the dataset). (d) We recognize that reproducibility may be tricky in some cases, in which case authors are welcome to describe the particular way they provide for reproducibility. In the case of closed-source models, it may be that access to the model is limited in some way (e.g., to registered users), but it should be possible for other researchers to have some path to reproducing or verifying the results. 5. Open access to data and code

Question: Does the paper provide open access to the data and code, with sufficient instructions to faithfully reproduce the main experimental results, as described in supplemental material?

Answer: [Yes]

Justification: We include our code in our supplementary material and will make the full code public if this work gets accepted.

Guidelines:

The answer NA means that paper does not include experiments requiring code. Please see the Neur IPS code and data submission guidelines (https://nips.cc/ public/guides/Code Submission Policy) for more details. While we encourage the release of code and data, we understand that this might not be possible, so No is an acceptable answer. Papers cannot be rejected simply for not including code, unless this is central to the contribution (e.g., for a new open-source benchmark). The instructions should contain the exact command and environment needed to run to reproduce the results. See the Neur IPS code and data submission guidelines (https: //nips.cc/public/guides/Code Submission Policy) for more details. The authors should provide instructions on data access and preparation, including how to access the raw data, preprocessed data, intermediate data, and generated data, etc. The authors should provide scripts to reproduce all experimental results for the new proposed method and baselines. If only a subset of experiments are reproducible, they should state which ones are omitted from the script and why. At submission time, to preserve anonymity, the authors should release anonymized versions (if applicable). Providing as much information as possible in supplemental material (appended to the paper) is recommended, but including URLs to data and code is permitted.

6. Experimental Setting/Details

Question: Does the paper specify all the training and test details (e.g., data splits, hyperparameters, how they were chosen, type of optimizer, etc.) necessary to understand the results?

Answer: [Yes]

Justification: We provide all the details of our experimental setting.

Guidelines:

The answer NA means that the paper does not include experiments. The experimental setting should be presented in the core of the paper to a level of detail that is necessary to appreciate the results and make sense of them. The full details can be provided either with the code, in appendix, or as supplemental material.

7. Experiment Statistical Significance

Question: Does the paper report error bars suitably and correctly defined or other appropriate information about the statistical significance of the experiments?

Answer: [Yes]

Justification: We reported 1-sigma error bars in our plots.

Guidelines:

The answer NA means that the paper does not include experiments. The authors should answer "Yes" if the results are accompanied by error bars, confidence intervals, or statistical significance tests, at least for the experiments that support the main claims of the paper. The factors of variability that the error bars are capturing should be clearly stated (for example, train/test split, initialization, random drawing of some parameter, or overall run with given experimental conditions).

The method for calculating the error bars should be explained (closed form formula, call to a library function, bootstrap, etc.) The assumptions made should be given (e.g., Normally distributed errors). It should be clear whether the error bar is the standard deviation or the standard error of the mean. It is OK to report 1-sigma error bars, but one should state it. The authors should preferably report a 2-sigma error bar than state that they have a 96% CI, if the hypothesis of Normality of errors is not verified. For asymmetric distributions, the authors should be careful not to show in tables or figures symmetric error bars that would yield results that are out of range (e.g. negative error rates). If error bars are reported in tables or plots, The authors should explain in the text how they were calculated and reference the corresponding figures or tables in the text. 8. Experiments Compute Resources

Question: For each experiment, does the paper provide sufficient information on the computer resources (type of compute workers, memory, time of execution) needed to reproduce the experiments? Answer: [Yes] Justification: We state all information on the computer resources in Appendix E. Guidelines:

The answer NA means that the paper does not include experiments. The paper should indicate the type of compute workers CPU or GPU, internal cluster, or cloud provider, including relevant memory and storage. The paper should provide the amount of compute required for each of the individual experimental runs as well as estimate the total compute. The paper should disclose whether the full research project required more compute than the experiments reported in the paper (e.g., preliminary or failed experiments that didn t make it into the paper). 9. Code Of Ethics

Question: Does the research conducted in the paper conform, in every respect, with the Neur IPS Code of Ethics https://neurips.cc/public/Ethics Guidelines? Answer: [Yes] Justification: Our research conform with the Neur IPS Code of Ethics. Guidelines:

The answer NA means that the authors have not reviewed the Neur IPS Code of Ethics. If the authors answer No, they should explain the special circumstances that require a deviation from the Code of Ethics. The authors should make sure to preserve anonymity (e.g., if there is a special consideration due to laws or regulations in their jurisdiction). 10. Broader Impacts

Question: Does the paper discuss both potential positive societal impacts and negative societal impacts of the work performed? Answer: [NA] Justification: There is no societal impact of the work performed. Guidelines:

The answer NA means that there is no societal impact of the work performed. If the authors answer NA or No, they should explain why their work has no societal impact or why the paper does not address societal impact. Examples of negative societal impacts include potential malicious or unintended uses (e.g., disinformation, generating fake profiles, surveillance), fairness considerations (e.g., deployment of technologies that could make decisions that unfairly impact specific groups), privacy considerations, and security considerations.

The conference expects that many papers will be foundational research and not tied to particular applications, let alone deployments. However, if there is a direct path to any negative applications, the authors should point it out. For example, it is legitimate to point out that an improvement in the quality of generative models could be used to generate deepfakes for disinformation. On the other hand, it is not needed to point out that a generic algorithm for optimizing neural networks could enable people to train models that generate Deepfakes faster. The authors should consider possible harms that could arise when the technology is being used as intended and functioning correctly, harms that could arise when the technology is being used as intended but gives incorrect results, and harms following from (intentional or unintentional) misuse of the technology. If there are negative societal impacts, the authors could also discuss possible mitigation strategies (e.g., gated release of models, providing defenses in addition to attacks, mechanisms for monitoring misuse, mechanisms to monitor how a system learns from feedback over time, improving the efficiency and accessibility of ML).

11. Safeguards

Question: Does the paper describe safeguards that have been put in place for responsible release of data or models that have a high risk for misuse (e.g., pretrained language models, image generators, or scraped datasets)?

Answer: [NA]

Justification: Our paper poses no such risks.

Guidelines:

The answer NA means that the paper poses no such risks. Released models that have a high risk for misuse or dual-use should be released with necessary safeguards to allow for controlled use of the model, for example by requiring that users adhere to usage guidelines or restrictions to access the model or implementing safety filters. Datasets that have been scraped from the Internet could pose safety risks. The authors should describe how they avoided releasing unsafe images. We recognize that providing effective safeguards is challenging, and many papers do not require this, but we encourage authors to take this into account and make a best faith effort.

12. Licenses for existing assets

Question: Are the creators or original owners of assets (e.g., code, data, models), used in the paper, properly credited and are the license and terms of use explicitly mentioned and properly respected?

Answer: [NA]

Justification: Our paper does not use existing assets.

Guidelines:

The answer NA means that the paper does not use existing assets. The authors should cite the original paper that produced the code package or dataset. The authors should state which version of the asset is used and, if possible, include a URL. The name of the license (e.g., CC-BY 4.0) should be included for each asset. For scraped data from a particular source (e.g., website), the copyright and terms of service of that source should be provided. If assets are released, the license, copyright information, and terms of use in the package should be provided. For popular datasets, paperswithcode.com/datasets has curated licenses for some datasets. Their licensing guide can help determine the license of a dataset. For existing datasets that are re-packaged, both the original license and the license of the derived asset (if it has changed) should be provided.

If this information is not available online, the authors are encouraged to reach out to the asset s creators. 13. New Assets

Question: Are new assets introduced in the paper well documented and is the documentation provided alongside the assets? Answer: [NA] Justification: Our paper does not release new assets. Guidelines:

The answer NA means that the paper does not release new assets. Researchers should communicate the details of the dataset/code/model as part of their submissions via structured templates. This includes details about training, license, limitations, etc. The paper should discuss whether and how consent was obtained from people whose asset is used. At submission time, remember to anonymize your assets (if applicable). You can either create an anonymized URL or include an anonymized zip file. 14. Crowdsourcing and Research with Human Subjects

Question: For crowdsourcing experiments and research with human subjects, does the paper include the full text of instructions given to participants and screenshots, if applicable, as well as details about compensation (if any)? Answer: [NA] Justification: Our paper does not involve crowdsourcing nor research with human subjects. Guidelines:

The answer NA means that the paper does not involve crowdsourcing nor research with human subjects. Including this information in the supplemental material is fine, but if the main contribution of the paper involves human subjects, then as much detail as possible should be included in the main paper. According to the Neur IPS Code of Ethics, workers involved in data collection, curation, or other labor should be paid at least the minimum wage in the country of the data collector. 15. Institutional Review Board (IRB) Approvals or Equivalent for Research with Human Subjects Question: Does the paper describe potential risks incurred by study participants, whether such risks were disclosed to the subjects, and whether Institutional Review Board (IRB) approvals (or an equivalent approval/review based on the requirements of your country or institution) were obtained? Answer: [NA] Justification: Our paper does not involve crowdsourcing nor research with human subjects. Guidelines:

The answer NA means that the paper does not involve crowdsourcing nor research with human subjects. Depending on the country in which research is conducted, IRB approval (or equivalent) may be required for any human subjects research. If you obtained IRB approval, you should clearly state this in the paper. We recognize that the procedures for this may vary significantly between institutions and locations, and we expect authors to adhere to the Neur IPS Code of Ethics and the guidelines for their institution. For initial submissions, do not include any information that would break anonymity (if applicable), such as the institution conducting the review.