# anytime_model_selection_in_linear_bandits__17eeb2ee.pdf

Anytime Model Selection in Linear Bandits

Parnian Kassraie1 Nicolas Emmenegger1 Andreas Krause1 Aldo Pacchiano2,3

1ETH Zurich 2Broad Institute of MIT and Harvard 3Boston University {pkassraie, nicolaem, krausea}@ethz.ch pacchian@bu.edu

Model selection in the context of bandit optimization is a challenging problem, as it requires balancing exploration and exploitation not only for action selection, but also for model selection. One natural approach is to rely on online learning algorithms that treat different models as experts. Existing methods, however, scale poorly (poly M) with the number of models M in terms of their regret. Our key insight is that, for model selection in linear bandits, we can emulate full-information feedback to the online learner with a favorable bias-variance trade-off. This allows us to develop ALEXP, which has an exponentially improved (log M) dependence on M for its regret. ALEXP has anytime guarantees on its regret, and neither requires knowledge of the horizon n, nor relies on an initial purely exploratory stage. Our approach utilizes a novel time-uniform analysis of the Lasso, establishing a new connection between online learning and high-dimensional statistics.

1 Introduction

When solving bandit problems or performing Bayesian optimization, we need to commit to a reward model a priori, based on which we estimate the reward function and build a policy for selecting the next action. In practice, there are many ways to model the reward by considering different feature maps or hypothesis spaces, e.g., for optimizing gene knockouts [Gonzalez et al., 2015, Pacchiano et al., 2022] or parameter estimation in biological dynamical systems [Ulmasov et al., 2016, Imani et al., 2019]. It is not known a priori which model is going to yield the most sample efficient bandit algorithm, and we can only select the right model as we gather empirical evidence. This leads us to ask, can we perform adaptive model selection, while simultaneously optimizing for reward?

In an idealized setting with no sampling limits, given a model class of size M, we could initialize M bandit algorithms (a.k.a agents) in parallel, each using one of the available reward models. Then, as the algorithms run, at every step we can select the most promising agent, according to the cumulative rewards that are obtained so far. Model selection can then be cast into an online optimization problem on a M-dimensional probability simplex, where the probability of selecting an agent is dependent on its cumulative reward, and the optimizer seeks to find the distribution with the best return in hindsight. This approach is impractical for large model classes, since at every step, it requires drawing M different samples in parallel from the environment so that the reward for each agent is realized.

In realistic applications of bandit optimization, we can only draw one sample at a time, and so we need to design an algorithm which allocates more samples to the agents that are deemed more promising. Prior work [e.g., Maillard and Munos, 2011, Agarwal et al., 2017] propose to run a single meta algorithm which interacts with the environment by first selecting an agent, and then selecting an action according to the suggestion of that agent. The online model selection problem can still be emulated in this setup, however this time the optimizer receives partial feedback, coming from only one agent. Consequently, many agents need to be queried, and the overall regret scales with poly M, again restricting this approach to small model classes. In fact, addressing the limited scope of such algorithms, Agarwal et al. [2017] raise an open problem on the feasibility of obtaining a log M dependency for the regret.

37th Conference on Neural Information Processing Systems (Neur IPS 2023).

We show that this rate is achievable, in the particular case of linearly parametrizable rewards. We develop a technique to hallucinate the reward for every agent that was not selected, and run the online optimizer with emulated full-information feedback. This allows the optimizer to assess the quality of the agents, without ever having queried them. As a result, our algorithm ALEXP, satisfies a regret of rate O(max{ p

n log3 M, n3/4 log M}), with high probability, simultaneously for all n 1 (Theorem 1). Our key idea, leading to log M dependency, is to employ the Lasso as a low-variance online regression oracle, and estimate the reward for the agents that were not chosen. This trick is made possible through our novel time-uniform analysis of online Lasso regression (Theorem 3). Consequently, ALEXP is horizon-independent, and explores adaptively without requiring an initial exploration stage. Empirically we find that ALEXP consistently outperforms prior work across a range of environments.

Table 1: Overview of literature on online model selection for bandit optimization

MS Technique Regret MS Guarantee

adaptive exploration anytime

Sparse Linear Bandits Lasso log M

MS for Black Box Bandits

OMD with bandit feedback poly M

MS for Linear Bandits (Ours)

EXP4 with full-information log M

2 Related Work

Online Model selection (MS) for bandits considers combining a number of agents in a master algorithm, with the goal of performing as well as the best agent [Maillard and Munos, 2011, Agarwal et al., 2017, Pacchiano et al., 2020, Luo et al., 2022]. This literature operates on black-box model classes of size M and uses variants of Online Mirror Descent (OMD) to sequentially select the agents. The optimizer operates on importance-weighted estimates of the agents rewards, which has high variance (poly M) and is non-zero only for the selected agent. Effectively, the optimizer receives partial (a.k.a. bandit) feedback and agents are at risk of starvation, since at every step only the selected agent gets the new data point. These works assume knowledge of the horizon, however, we suspect that this may be lifted with a finer analysis of OMD.

Sparse linear bandits use sparsity-inducing methods, often Lasso [Tibshirani, 1996], for estimating the reward in presence of many features, as an alternative to model selection. Early results are in the data-rich regime where the stopping time n is known and larger than the number of models, and some suffer from poly M regret dependency [e.g., Abbasi-Yadkori et al., 2012]. Recent efforts often consider the contextual case, where at every step only a finite stochastic subset of the action domain is presented to the agent, and that the distribution of these points is i.i.d. and sufficiently diverse [Li et al., 2022, Bastani and Bayati, 2020, Kim and Paik, 2019, Oh et al., 2021, Cella and Pontil, 2021]. We do not rely on such assumptions. Most sparse bandit algorithms either start with a purely exploratory phase [Kim and Paik, 2019, Li et al., 2022, Hao et al., 2020, Jang et al., 2022], or rely on a priori scheduled exploration [Bastani and Bayati, 2020]. The exploration budget is set according to the horizon n. Therefore, such algorithms inherently require the knowledge of n and can be made anytime only via the doubling trick [Auer et al., 1995]. Table 2 presents an in-depth overview.

ALEXP inherits the best of both worlds (Table 1): its regret enjoys the log M dependency of sparse linear bandits even on compact domains, and it has adaptive probabilistic exploration with anytime guarantees. In contrast to prior literature, we perform model selection with an online optimizer (EXP4), which hallucinates full-information feedback using a low-variance Lasso estimator instead of importance-weighted estimates. Moreover, our anytime approach lifts the horizon dependence and the exploration requirement of sparse linear bandits.

Our work is inspired by and contributes to the field of Learning with Expert Advice, which analyzes incorporating the advice of M oblivious (non-adaptive) experts, with bandit or full-information feedback [Haussler et al., 1998, Auer et al., 2002b, Mc Mahan and Streeter, 2009]. The idea of employing an online optimizer for learning stems from this literature, and has been used in various applications of online learning [Foster et al., 2017, Singla et al., 2018, Muthukumar et al., 2019, Karimi et al., 2021, Liu et al., 2022]. In particular, we are inspired by Foster and Rakhlin [2020] and Moradipari et al. [2022], who apply EXP4 to least squares estimates, for arm selection in K-armed

contextual bandits. However, their algorithms are not anytime, and due to the choice of estimator, the corresponding regret scales with O(min(

M, K log M)).

3 Problem Setting

We consider a bandit problem where a learner interacts with the environment in rounds. At step t the learner selects an action xt X, where X Rd0 is a compact domain and observes a noisy reward yt = r(xt) + εt such that εt is an i.i.d. zero-mean sub-Gaussian variable with parameter σ2. We assume the reward function r : X R is linearly parametrizable by some unknown feature map, and that the model class {ϕj : Rd0 Rd, j = 1, . . . , M} contains the set of plausible feature maps. We consider the setting where M can be very large, and while the set {ϕj} may include misspecificed feature maps, it contains at least one feature map that represents the reward function, i.e., there exists j [M] such that r( ) = θ j ϕj ( ). We assume the image of ϕj spans Rd, and no two feature maps are linearly dependent, i.e. for any j, j [M], there exists no α R such that ϕj( ) = αϕj ( ). This assumption, which is satisfied by design in practice, ensures that the features are not ill-posed and we can explore in all relevant directions. We assume that the concatenated feature map ϕ(x) := (ϕ1(x), . . . , ϕM(x)) is normalized ϕ(x) 1 for all x X and that θj B, which implies |r(x)| B for all x X.

We will model this problem in the language of model selection where a meta algorithm aims to optimize the unknown reward function by relying on a number of base learners. In order to interact with the environment the meta algorithm selects an agent that in turn selects an action. In our setting we thinking of each of these M feature maps as controlled by a base agent running its own algorithm. Base agent j uses the feature map ϕj for modeling the reward. At step t of the bandit problem, each agent j is given access to the full history Ht 1 := {(x1, y1), . . . , (xt 1, yt 1)}, and uses it to locally estimate the reward as ˆβ t 1,jϕj( ), where ˆβt 1,j Rd is the estimated coefficients vector. The agent then uses this estimate to develop its action selection policy pt,j M(X). Here, M denotes the space of probability measures defined on X. The condition on existence of j will ensure that there is at least one agent which is using a correct model for the reward, and thereby can solve the bandit problem if executed in isolation. We refer to agent j as the oracle agent.

Our goal is to find a sample-efficient strategy for iterating over the agents, such that their suggested actions maximize the cumulative reward, achieved over any horizon n 1. This is equivalent to minimizing the cumulative regret R(n) = Pn t=1 r(x ) r(xt), where x is a global maximizer of the reward function. We neither fix n, nor assume knowledge of it.

As warm-up, consider an example with deterministic agents, i.e., when pt,j is a Dirac measure on a specific action xt,j. Suppose it was practically feasible to draw the action suggested by every agent and observe the corresponding reward vector rt = (rt,j)M j=1. In this case, model selection becomes a full-information online optimization problem, and we can design a minimax optimal algorithm as follows. We assign a probability distribution qt = (qt,j)M j=1 to the models, and update it such that the overall average return Pn t=1 q t rt is competitive to the best agent s average return Pn t=1 rt,j . At every step, we update qt+1,j exp(Pt s=1 rs,j), since such exponential weighting is known to lead to an optimal solution for this classic online learning problem [Cesa-Bianchi and Lugosi, 2006]. In our setting however, the agents are stochastic, and we do not have access to the full rt vector.

We propose the Anytime EXPonential weighting algorithm based on Lasso reward estimates (ALEXP), summarized in Algorithm 1. At step t we first sample an agent jt, and then sample an action xt according to the agent s policy pt,jt. Let M denote the M-dimensional probability simplex. We maintain a probability distribution qt M over the agents, and update it sequentially as we accumulate evidence on the performance of each agent. Ideally, we would have adjusted qt,j according to the average return of model j, that is, Ex pt,jr(x). However, since r is unknown, we estimate the average reward with some ˆrt,j. We then update qt for the next step via,

qt+1,j = exp(ηt Pt s=1 ˆrs,j) PM i=1 exp(ηt Pt s=1 ˆrs,i)

for all j = 1, . . . , M, where ηt is the learning rate, and controls the sensitivity of the updates. This rule allows us to imitate the full-information example that we mentioned above. By utilizing ˆrt,j and hallucinating feedback from all agents, we can reduce the probability of selecting a badly performing agent, without ever having sampled them (c.f. Fig. 4). It remains to design the estimator ˆrt,j. We concatenate all feature maps, and, knowing that many features are redundant, use a sparsity inducing estimator over the resulting coefficients vector. Mainly, let θ = (θ1, . . . , θM) RMd be the concatenated coefficients vector. We then solve

ˆθt = arg min θ RMd L(θ; Ht, λt) = arg min θ RMd 1

t yt Φtθ 2 2 + 2λt

j=1 θj 2 (1)

where Φt = [ϕ (xs)]s t Rt Md is the feature matrix, yt Rt is the concatenated reward vector, and λt is an adaptive regularization parameter. Problem (1) is the online variant of the group Lasso [Lounici et al., 2011]. The second term is the loss is the mixed (2, 1)-norm of θ, which can be seen as the ℓ1-norm of the vector ( θ1 , . . . , θM ) RM. This norm induces sparsity at the group level, and therefore, the sub-vectors ˆθt,j Rd that correspond to redundant feature maps are expected to be 0, i.e. the null vector. We then estimate the average return of each model by simply taking an expectation ˆrt,j = Ex pt+1,j[ˆθ t ϕ(x)]. This quantity is the average return of the agent s policy pt+1,j, according to our Lasso estimator. In Section 5.2 we explain why the particular choice of Lasso is crucial for obtaining a log M rate for the regret.

For action selection, with probability γt, we sample agent j with probability qt,j and draw xt pt,j as per suggestion of the agent. With probability 1 γt, we choose the action according to some exploratory distribution π M(X) which aims to sample informative actions. This can be any design where supp(π) = X. We mix pt,j with π, to collect sufficiently diverse data for model selection. We are not restricting the agents policy, and therefore can not rely on them to explore adequately. In Theorem 1, we choose a decreasing sequence of (γt)t 1 the probabilities of exploration at step t 1, since less exploration will be required as data accumulates. To conclude, the action selection policy of ALEXP is formally described as the mixture

pt(x) = γtπ(x) + (1 γt)

j=1 qt,jpt,j(x).

5 Main Results

For the regret guarantee, we consider specific choices of base agents and exploratory distribution. Our analysis may be extended to include other policies, since ALEXP can be wrapped around any bandit agent that is described by some pt,j, and allows for random exploration with any distribution π.

Base Agents. We assume that the oracle agent has either a UCB [Abbasi-Yadkori et al., 2011] or a GREEDY [Auer et al., 2002a] policy, and all other agents are free to choose any arbitrary policy. Similar treatment can be applied to cases where the oracle uses other (sublinear) polices for solving linear or contextual bandits [e.g., Thompson, 1933, Kaufmann et al., 2012, Agarwal et al., 2014]. In either case, agent j calculates a ridge estimate of the coefficients vector based on the history Ht

ˆβt,j := arg min β Rd yt Φt,j β 2 2 + λ β 2 2 = Φ t,j Φt,j + λI 1 Φ t,j yt.

Here Φt,j Rt d is the feature matrix, where each row s is ϕ j (xs) and λ is the regularization constant. Then at step t, a GREEDY oracle suggests the action which maximizes the reward estimate ˆβ t 1,j ϕj (x), and a UCB oracle queries arg max ut 1,j (x) where ut 1,j ( ) is the upper confidence bound that this agent calculates for r( ). Proposition 21 shows that the sequence (ut,j )t 1 is in fact an anytime valid upper bound for r over the entire domain.

Exploratory policy. Performance of ALEXP depends on the quality of the samples that π suggests. The eigenvalues of the covariance matrix Σ(π, ϕ) := Ex πϕ(x)ϕ (x) reflect how diverse the data is, and thus are a good indicator for data quality. van de Geer and Bühlmann [2009] present a survey on the notions of diversity defined based on these eigenvalues. Let λmin(A) denote the minimium eigenvalue of a matrix A. Similar to Hao et al. [2020], we assume that π is the maximizer of the problem below and present our regret bounds in terms of

Cmin = Cmin(X, ϕ) := max π M(X) λmin (Σ(π, ϕ)) , (2)

Algorithm 1 ALEXP

Inputs: π, (γt, ηt, λt) for t 1 Let q1 Unif(M) and initialize base agents (p1,1, . . . , p1,M). for t 1 do

Draw αt Bernoulli(γt). Decide to explore or exploit if αt = 1 then

Choose action xt randomly according to π. Explore else

Draw jt qt. Select an agent Draw xt pt,jt. Select the action suggested by the agent end if Observe yt = r(xt) + εt. Receive reward Ht = Ht 1 {(xt, yt)}. Append history ˆθt arg min L(θ; Ht, λt). Update the parameter estimate Report Ht to all agents, and get updated policies pt+1,1, . . . , pt+1,M. Update agents Update estimated average return of every agent

ˆrt,j Ex pt+1,j[ˆθ t ϕ(x)], j = 1, . . . , M

Update agent probabilities

qt+1,j exp(ηt Pt s=1 ˆrs,j) PM i=1 exp(ηt Pt s=1 ˆrs,i) end for

which is greater than zero under the conditions specified in our problem setting. Prior works in the sparse bandit literature all require a similar or stronger assumption of this kind, and Table 2 gives an overview. Alternatively, one can work with an arbitrary π, e.g., Unif(X), as long as λmin(Σ(π, ϕ)) is bounded away from zero. Appendix C.1 reviews some configurations of (ϕ, X, π) that lead to a nonzero minimum eigenvalue, and Corollary 12 bounds the regret of ALEXP with uniform exploration.

For this choice of agents and exploratory distribution, Theorem 1 presents an informal regret bound. Here, we have used the O notation, and only included the fastest growing terms. The inequality is made exact in Theorem 14, up to constant multiplicative factors.

Theorem 1 (Cumulative Regret of ALEXP, Informal). Let δ (0, 1] and set π to be the maximizer of (2). Choose learning rate ηt = O(Cmint 1/2/C(M, δ, d)), exploration probability γt = O(t 1/4) and Lasso regularization parameter λt = O(Cmint 1/2C(M, δ, d)), where

C(M, δ, d) = O q

1 + log(M/δ) + (log log d)+ + p

d (log(M/δ) + (log log d)+) .

Then ALEXP satisfies the cumulative regret

R(n) = O n3/4B + n3/4C(M, δ, d) + C 1 min n C(M, δ, d) log M

+ C 1/2 min n5/8p

d log (n) + log(1/δ)

simultaneously for all n 1, with probability greater than 1 δ.

In this bound, the first term is the regret incurred at exploratory steps (when αt = 1), the second term is due to the estimation error of Lasso (i.e., ||θ ˆθt||), and the third term is the regret of the exponential weights sub-algorithm. The fourth term, is the regret bound for the oracle agent j , when run within the ALEXP framework. It does not depend on the agent s policy (greedy or optimistic), and is worse than the minimax optimal rate of nd log n. This is because the oracle is suggesting actions based on the history Ht, which includes uninformative action-reward pairs queried by other, potentially misspecified, agents. In Corollary 12, we provide a regret bound independent of Cmin, for orthogonal feature maps, and show that the same O(max{ p

n log3 M, n3/4 log M}) rate may be achieved even with the simple choice π = Unif(X).

5.1 Proof Sketch The regret is caused by two sources: selecting a sub-optimal agent, and an agent selecting a suboptimal action. Accordingly, for any j 1, . . . , M, we decompose the regret as

t=1 r(x ) r(xt) = n X

t=1 r(x ) rt,j + n X

t=1 rt,j r(xt) . (3)

The first term shows the cumulative regret of agent j, when run within ALEXP. The second term evaluates the received reward against the cumulative average reward of model j. We bound each term separately.

Virtual Regret. The first term Rj(n) := Pn t=1 r(x ) rt,j compares the suggestion of agent j, against the optimal action. We call it the virtual regret since the sequence (xt,j)t 1 of the actions suggested by model j are not necessarily selected by the meta algorithm, unless jt = j. This regret is merely a technical tool, and not actually realized when running ALEXP. The virtual regret of the oracle agent may still be analyzed using standard techniques for linear bandits, e.g., Abbasi-Yadkori et al. [2011], however we need to adapt it to take into account a subtle difference: The confidence sequence of model j is constructed according to the true sequence of actions (xt)t 1, while its virtual regret is calculated based on the virtual sequence (xt,j )t 1, which the model suggests. The two sequences only match at the steps when model j is selected. Adapting the analysis of Abbasi-Yadkori et al. [2011] to this subtlety, we obtain in Lemma 15 that with probability greater than 1 δ, simultaneously for all n 1,

Rj (n) = O n5/8C 1/2 min p

d log (n) + log(1/δ) .

Model Selection Regret. The second term in (3) is the model selection regret, R(n, j) := Pn t=1 rt,j r(xt), which evaluates the chosen action by ALEXP against the suggestion of the j-th agent. Our analysis relies on a careful decomposition of R(n, j),

rt,j ˆrt,j | {z } (I)

i=1 qt,iˆrt,i

| {z } (II)

i=1 qt,i(ˆrt,i rt,i)

| {z } (III)

i=1 qt,irt,i r(xt)

| {z } (IV)

We bound away each term in a modular manner, until we are left with the regret of the standard exponential weights algorithm. The terms (I) and (III) are controlled by the bias of the Lasso estimator, and are O(n3/4C(M, δ, d)) (Lemma 19). The last term (IV) is zero in expectation, and reflects the deviation of r(xt) from its mean. We observe that the summands form a bounded Martingale difference sequence, and their sum grows as O( n) (Lemma 18). Term (II) is the regret of our online optimizer, which depends on the variance of the Lasso estimator. We bound this term with O( n C(M, δ, d) log M), by first conducting a standard anytime analysis of exponential weights (Lemma 17), and then incorporating the anytime Lasso variance bound (Lemma 20). We highlight that neither of the above steps require assumptions about the base agents. Combining these steps, Lemma 16 establishes the formal bound on the model selection regret.

Anytime Lasso. We develop novel confidence intervals for Lasso with history-dependent data, which are uniformly valid over an unbounded time horizon. This result may be of independent interest in applications of Lasso for online learning or sequential decision making. Here, we use these confidence intervals to bound the bias and variance terms that appear in our treatment of the model selection regret. The width of the Lasso confidence intervals depends on the quality of feature matrix Φt, often quantified by the restricted eigenvalue property [Bickel et al., 2009, van de Geer and Bühlmann, 2009, Javanmard and Montanari, 2014]:

Definition 2. For the feature matrix Φt Rt d M we define κ(Φt, s) for any 1 s M as

κ(Φt, s) := inf (J,b) 1

t Φtb 2 q P

s.t. b Rd\{0}, X

j / J bj 2 3 X

j J bj 2, J {1, . . . , M}, |J| s.

Our analysis is in terms of this quantity, and Lemma 8 explains the connection between κ(Φt, s) and Cmin as defined in (2), particularly that κ(Φt, 2) is also positive with a high probability. Theorem 3 (Anytime Lasso Confidence Sequences). Consider the data model yt = θ ϕ(xt) + εt for all t 1, where εt is i.i.d. zero-mean sub-Gaussian noise, and (xt)t 1 is (Ft)t 1-predictable, where Ft := (x1, . . . , xt, ε1, . . . , εt 1). Then the solution of (1) guarantees

t 1 : θ ˆθt 2 4

10λt κ2(Φt, 2)

if the regularization parameter satisfies for all t 1

2 (log(2M/δ) + (log log d)+) + 5

d (log(2M/δ) + (log log d)+).

Our confidence bound enjoys the same rate as Lasso with fixed (offline) data, up to log log d factors. We prove this theorem by constructing a self-normalized martingale sequence based on the ℓ2-norm of the empirical process error (Φ t εt). We then apply the stitched time-uniform boundary of Howard et al. [2021]. Appendix B elaborates on this technique. Previous work on sparse linear bandits also include analysis of Lasso in an online setting, when xt is Ft measurable. Cella and Pontil [2021] imitate offline analysis and then apply a union bound over the time steps, which multiplies the width by log n and requires knowledge of the horizon. Bastani and Bayati [2020] also rely on knowledge of n and employ a scalar-valued Bernstein inequality on ℓ -norm of the empirical process error, which inflates the width of the confidence sets by a factor of

d. We work directly on the ℓ2-norm, and use a curved boundary for sub-Gamma martingale sequences, which according to Howard et al. [2021] is uniformly tighter than a Bernstein bound, especially for small t.

5.2 Discussion

In light of Theorem 1 and Theorem 3, we discuss some properties of ALEXP.

Sparse EXP4. Our approach presents a new connection between online learning and highdimensional statistics. The rule for updating the probabilities in ALEXP is inspired by the exponential weighting for Exploration and Exploitation with Experts (EXP4) algorithm, which was proposed by Auer et al. [2002b] and has been extensively studied in the adversarial bandit and learning with expert advice literature [e.g., Mc Mahan and Streeter, 2009, Beygelzimer et al., 2011]. EXP4 classically uses importance-weighted (IW) or ordinary least squares (LS) estimators to estimate the return rt,j, both of which are unbiased but high-variance choices [Bubeck et al., 2012]. In particular, in our linearly parametrized setting, the variance of IW and LS scales with Md, which will lead to a poly(M) regret. However, it is known that introducing bias can be useful if it reduces the variance [Zimmert and Lattimore, 2022]. For instance, EXP3-IX [Kocák et al., 2014] and EXP4-IX [Neu, 2015] construct a biased IW estimator. Equivalently, others craft regularizers for the reward of the online optimizer, seeking to improve the bias-variance balance [e.g., Abernethy et al., 2008, Bartlett et al., 2008, Abernethy and Rakhlin, 2009, Bubeck et al., 2017, Lee et al., 2020, Zimmert and Lattimore, 2022]. A key technical observation in this work is that our online Lasso estimator leads EXP4 to achieve sublinear regret which depends logarithmically on M. This is due to the fact that while the estimator itself is Md-dimensional, its bias squared and variance scale with d log M. To the best of our knowledge, this work is first to instantiate the EXP4 algorithm with a sparse low-variance estimator.

Adaptive and Anytime. To estimate the reward, prior work on sparse bandits commonly emulate the Lasso analysis on offline data or on a martingale sequence with a known length [Hao et al., 2020, Bastani and Bayati, 2020]. These works require a long enough sequence of exploratory samples, and knowledge of the horizon to plan this sequence. ALEXP removes both of these constraints, and presents a fully adaptive algorithm. Crucially, we employ the elegant martingale bounds of Howard et al. [2020] to present the first time-uniform analysis of the Lasso with history-dependent data (Theorem 3). This way we can use all the data points and explore with a probability which vanishes at a O(t 1/4) rate. Our anytime confidence bound for Lasso, together with the horizon-independent analysis of the exponential weights algorithm, also allows ALEXP to be stopped at any time with valid guarantees.

Rate Optimality. For M n, we obtain a O( p

n log3 M) regret, which matches the rate conjectured by Agarwal et al. [2017]. However, if M is comparable to n or smaller, our regret scales with

0 25 50 75 100 t

p = 10 s = 2

Lightly Correlated

0 25 50 75 100 t

p = 10 s = 8

Highly Correlated

0 25 50 75 100 t

p = 10 s = 3

Many Models

Oracle UCB ALEXP Naive UCB ETC ETS Corral

Figure 1: ALEXP can model-select in both orthogonal and correlated classes (M = 55)

Figure 2: ALEXP performs well on a large class (M = 165)

O(n3/4 log M), and while it is still sublinear and scales logarithmically with M, the dependency on n is sub-optimal. This may be due to the conservative nature of our model selection analysis, during which we do not make assumptions about the dynamics of the base agents. Therefore, to ensure sufficiently diverse data for successful model selection, we need to occasionally choose exploratory actions with a vanishing probability of γt. We conjecture that this is avoidable, if we make more assumptions about the agents, e.g., that a sufficient number of agents can achieve sublinear regret if executed in isolation. Banerjee et al. [2023] show that the data collected by sublinear algorithms organically satisfies a minimum eigenvalue lowerbound, which may also be sufficient for model selection. We leave this as an open question for future work.

6 Experiments

Experiment Setup. We create a synthetic dataset based on our data model (Section 3), and choose the domain to be 1-dimensional X = [ 1, 1]. As a natural choice of features, we consider the set of degree-p Legendre polynomials, since they form an orthonormal basis for L2(X) if p grows unboundedly. We construct each feature map, by choosing s different polynomials from this set, and therefore obtaining M = p+1 s different models. More formally, we let ϕj(x) = (Pj1(x), . . . , Pjs(x)) Rs where {j1, . . . , js} {0, . . . , p} and Pj denotes a degree j Legendre polynomial. To construct the reward function, we randomly sample j from [M], and draw θj from an i.i.d. standard gaussian distribution. We then normalize ||θj || = 1. When sampling from the reward, we add Gaussian noise with standard deviation σ = 0.01. Figure 5 in the appendix shows how the random reward functions may look. For all experiments we set n = 100, and plot the cumulative regret R(n) averaged over 20 different random seeds, the shaded areas in all figures show the standard error across these runs.

Algorithms. We perform experiments on two UCB algorithms, one with oracle knowledge of j , and a naive one which takes into account all M feature maps. We run Explore-then-Commit (ETC) by Hao et al. [2020], which explores for a horizon of n0 steps, performs Lasso once, and then selects actions greedily for the remaining steps. As another baseline, we introduce Explore-then-Select (ETS) that explores for n0 steps, performs model selection using the sparsity pattern of the Lasso estimator. For the remaining steps, the policy switches to UCB, calculated based on the selected features. Performance of ETC and ETS depends highly on n0, so we tune this hyperparameter separately for each experiment. We also run CORRAL as proposed by Agarwal et al. [2017], with UCB agents similar to ALEXP. We tune the hyper-parameters of CORRAL as well. To initialize ALEXP we set the rates of λt, γt and ηt according to Theorem 1, and perform a light hyper-parameter tuning to choose the scaling constants. We have included the details and results of our hyper-parameter tuning in Appendix F.1. To solve (1), we use CELER, a fast solver for the group Lasso [Massias et al., 2018]. Every time UCB policy is used, we set the exploration coefficient βt = 2, and every time exploration is required, we sample according to π = Unif(X). Appendix F includes the pseudo-code for all baseline algorithms.1

Easy vs. Hard Cases. We construct an easy problem instance, where s = 2, p = 10, and thus M = 55. Models are lightly correlated since each two model can have at most one Legendre polynomial in common. We also generate an instance with highly correlated feature maps where s = 8 and p = 10, which will be a harder problem, since out of the total M = 55 models, there are 36 models which have at least 6 Legendre polynomials in common with the oracle model j . Figure 1 shows that not only ALEXP is not affected by the correlations between the models, but also it achieves

1The PYTHON code for reproducing the experiments is accessible on github.com/lasgroup/ALEXP.

M = 45 M = 55 M = 66 M = 78 M = 91

Figure 3: ALEXP is hardly affected by increasing the number of models (y-axis have various scales)

0 50 100 150 (a)

0.10 qt = 20

MS Distribution

0 50 100 (b)

Ratio of Visited Agents

0 50 100 (c)

Prob. of Oracle Agent

Figure 4: ALEXP can rule out models without ever having queried them (M = 165)

a performance competitive to the oracle in both cases, implying that our exponential weights technique for model selection is robust to choice of features. ETC and ETS rely on Lasso for model selection, which performs poorly in the case of correlated features. CORRAL uses log-barrier-OMD with an importance-weighted estimator, which has a significantly high variance. The curve for CORRAL in Figures 1 and 2 is cropped since the regret values get very large. Figure 6 shows the complete results. We construct another hard instance (Fig. 2), where the model class is large (s = 3, p = 10, M = 165). ALEXP continues to outperform all baselines with a significant gap. It is clear in the regret curves how explore-then-commit style algorithms are inherently horizon-dependent, and may exhibit linear regret, if stopped at an arbitrary time. This is not an issue with the other algorithms.

Scaling with M. Figure 3 shows how well the algorithms scale as M grows. For this experiment we set s = 2 and change p {9, . . . , 13}. While increasing M hardly affects ALEXP and Oracle UCB, other baselines become less and less sample efficient.

Learning Dynamics of ALEXP. Figure 4 gives some insight into the dynamics of ALEXP when M = 165. In particular, it shows how ALEXP can rule out sub-optimal agents without ever having queried them. Figure (a) shows the distribution qt, at t = 20 which is roughly equal to the optimal n0 for ETC in this configuration. The oracle model j is annotated with a star, and has the highest probability of selection. We observe that, already at this time step, more than 80% of the agents are practically ruled out, due to small probability of selection. However, according to Figure (b), which shows Mt the total number of visited models, less than 10% of the models are queried at t = 20. This is the key practical benefit of ALEXP compared to black-box algorithms such as CORRAL. Lastly, Figure (c) shows how qt,j the probability of selecting the oracle agent changes with time. While this probability is higher than that of the other agents, Figure (c) shows that qt,j is not exceeding 0.25, therefore there is always a probability of greater than 0.75 that we sample another agent, making ALEXP robust to hard problem instances where many agents perform efficiently. We conclude that ALEXP seems to rapidly recognize the better performing agents, and select among them with high probability.

7 Conclusion We proposed ALEXP, an algorithm for simultaneous online model selection and bandit optimization. As a first, our approach leads to anytime valid guarantees for model selection and bandit regret, and does not rely on a priori determined exploration schedule. Further, we showed how the Lasso can be used together with the exponential weights algorithm to construct a low-variance online learner. This new connection between high-dimensional statistics and online learning opens up avenues for future research on high-dimensional online learning. We established empirically that ALEXP has favorable exploration exploitation dynamics, and outperforms existing baselines. We tackled the open problem of Agarwal et al. [2017], and showed that log M dependency for regret is achievable for linearly parametrizable rewards. This problem remains open for more general, non-parametric reward classes.

Acknowledgments

We thank Jonas Rothfuss and Miles Wang-Henderson for their valuable suggestions regarding the writing. We thank Scott Sussex and Felix Schur for their thorough feedback. This research was supported by the European Research Council (ERC) under the European Union s Horizon 2020 research and Innovation Program Grant agreement no. 815943. Nicolas Emmenegger was supported by the Zeno Karl Schindler Foundation and the Swiss Study Foundation. Aldo Pacchiano would like to thank the support of the Eric and Wendy Schmidt Center at the Broad Institute of MIT and Harvard. This work was supported in part by funding from the Eric and Wendy Schmidt Center at the Broad Institute of MIT and Harvard.

Yasin Abbasi-Yadkori, Dávid Pál, and Csaba Szepesvári. Improved algorithms for linear stochastic bandits. Advances in Neural Information Processing Systems, 2011.

Yasin Abbasi-Yadkori, David Pal, and Csaba Szepesvari. Online-to-confidence-set conversions and application to sparse stochastic bandits. In Artificial Intelligence and Statistics. PMLR, 2012.

Jacob Abernethy and Alexander Rakhlin. Beating the adaptive bandit with high probability. In 2009 Information Theory and Applications Workshop. IEEE, 2009.

Jacob Abernethy, Elad Hazan, and Alexander Rakhlin. An efficient algorithm for bandit linear optimization. In Conference on Learning Theory, 2008.

Alekh Agarwal, Daniel Hsu, Satyen Kale, John Langford, Lihong Li, and Robert Schapire. Taming the monster: A fast and simple algorithm for contextual bandits. In International Conference on Machine Learning. PMLR, 2014.

Alekh Agarwal, Haipeng Luo, Behnam Neyshabur, and Robert E Schapire. Corralling a band of bandit algorithms. In Conference on Learning Theory. PMLR, 2017.

Peter Auer, Nicolo Cesa-Bianchi, Yoav Freund, and Robert E Schapire. Gambling in a rigged casino: The adversarial multi-armed bandit problem. In Foundations of Computer Science. IEEE, 1995.

Peter Auer, Nicolo Cesa-Bianchi, and Paul Fischer. Finite-time analysis of the multiarmed bandit problem. Machine learning, 2002a.

Peter Auer, Nicolo Cesa-Bianchi, Yoav Freund, and Robert E Schapire. The nonstochastic multiarmed bandit problem. SIAM journal on computing, 2002b.

Debangshu Banerjee, Avishek Ghosh, Sayak Ray Chowdhury, and Aditya Gopalan. Exploration in linear bandits with rich action sets and its implications for inference. In International Conference on Artificial Intelligence and Statistics. PMLR, 2023.

Peter L Bartlett, Varsha Dani, Thomas P Hayes, Sham M Kakade, Alexander Rakhlin, and Ambuj Tewari. High-probability regret bounds for bandit online linear optimization. In Conference on Learning Theory, 2008.

Hamsa Bastani and Mohsen Bayati. Online decision making with high-dimensional covariates. Operations Research, 2020.

Alina Beygelzimer, John Langford, Lihong Li, Lev Reyzin, and Robert Schapire. Contextual bandit algorithms with supervised learning guarantees. In International Conference on Artificial Intelligence and Statistics, 2011.

Peter J. Bickel, Ya acov Ritov, and Alexandre B. Tsybakov. Simultaneous analysis of Lasso and Dantzig selector. The Annals of Statistics, 2009.

Sébastien Bubeck, Nicolo Cesa-Bianchi, et al. Regret analysis of stochastic and nonstochastic multi-armed bandit problems. Foundations and Trends in Machine Learning, 2012.

Sébastien Bubeck, Yin Tat Lee, and Ronen Eldan. Kernel-based methods for bandit convex optimization. In ACM SIGACT Symposium on Theory of Computing, pages 72 85, 2017.

Peter Bühlmann and Sara Van De Geer. Statistics for high-dimensional data: methods, theory and applications. Springer Science & Business Media, 2011.

Alexandra Carpentier and Rémi Munos. Bandit theory meets compressed sensing for high dimensional stochastic linear bandit. In Artificial Intelligence and Statistics. PMLR, 2012.

Leonardo Cella and Massimiliano Pontil. Multi-task and meta-learning with sparse linear bandits. In Conference on Uncertainty in Artificial Intelligence, 2021.

Nicolo Cesa-Bianchi and Gábor Lugosi. Prediction, learning, and games. Cambridge university press, 2006.

Dylan Foster and Alexander Rakhlin. Beyond ucb: Optimal and efficient contextual bandits with regression oracles. In International Conference on Machine Learning. PMLR, 2020.

Dylan J Foster, Satyen Kale, Mehryar Mohri, and Karthik Sridharan. Parameter-free online learning via model selection. Advances in Neural Information Processing Systems, 2017.

Dylan J Foster, Akshay Krishnamurthy, and Haipeng Luo. Model selection for contextual bandits. Advances in Neural Information Processing Systems, 32, 2019.

Sébastien Gerchinovitz. Sparsity regret bounds for individual sequences in online linear regression. In Conference on Learning Theory, 2011.

Apostolos Giannopoulos. Notes on isotropic convex bodies, October 2003.

Javier Gonzalez, Joseph Longworth, David C James, and Neil D Lawrence. Bayesian optimization for synthetic gene design. ar Xiv preprint, 2015.

Botao Hao, Tor Lattimore, and Mengdi Wang. High-dimensional sparse linear bandits. Advances in Neural Information Processing Systems, 2020.

David Haussler, Jyrki Kivinen, and Manfred K Warmuth. Sequential prediction of individual sequences under general loss functions. IEEE Transactions on Information Theory, 1998.

Steven R. Howard, Aaditya Ramdas, Jon Mc Auliffe, and Jasjeet Sekhon. Time-uniform chernoff bounds via nonnegative supermartingales. Probability Surveys, 2020.

Steven R Howard, Aaditya Ramdas, Jon Mc Auliffe, and Jasjeet Sekhon. Time-uniform, nonparametric, nonasymptotic confidence sequences. The Annals of Statistics, 2021.

Mahdi Imani, Seyede Fatemeh Ghoreishi, Douglas Allaire, and Ulisses M Braga-Neto. Mfbo-ssm: Multi-fidelity bayesian optimization for fast inference in state-space models. In Proceedings of the AAAI Conference on Artificial Intelligence, 2019.

Kyoungseok Jang, Chicheng Zhang, and Kwang-Sung Jun. Popart: Efficient sparse regression and experimental design for optimal sparse linear bandits. In Advances in Neural Information Processing Systems, 2022.

Adel Javanmard and Andrea Montanari. Confidence intervals and hypothesis testing for highdimensional regression. The Journal of Machine Learning Research, 2014.

Mohammad Reza Karimi, Nezihe Merve Gürel, Bojan Karlaš, Johannes Rausch, Ce Zhang, and Andreas Krause. Online active model selection for pre-trained classifiers. In International Conference on Artificial Intelligence and Statistics. PMLR, 2021.

Parnian Kassraie, Jonas Rothfuss, and Andreas Krause. Meta-learning hypothesis spaces for sequential decision-making. In International Conference on Machine Learning. PMLR, 2022.

Emilie Kaufmann, Olivier Cappé, and Aurélien Garivier. On bayesian upper confidence bounds for bandit problems. In Artificial intelligence and statistics. PMLR, 2012.

Gi-Soo Kim and Myunghee Cho Paik. Doubly-robust lasso bandit. Advances in Neural Information Processing Systems, 32, 2019.

Tomáš Kocák, Gergely Neu, Michal Valko, and Rémi Munos. Efficient learning by implicit exploration in bandit problems with side observations. Advances in Neural Information Processing Systems, 2014.

Tor Lattimore and Csaba Szepesvári. Bandit algorithms. Cambridge University Press, 2020.

Chung-Wei Lee, Haipeng Luo, Chen-Yu Wei, and Mengxiao Zhang. Bias no more: high-probability data-dependent regret bounds for adversarial bandits and mdps. Advances in Neural Information Processing Systems, 2020.

Wenjie Li, Adarsh Barik, and Jean Honorio. A simple unified framework for high dimensional bandit problems. In International Conference on Machine Learning. PMLR, 2022.

Xuefeng Liu, Fangfang Xia, Rick L Stevens, and Yuxin Chen. Cost-effective online contextual model selection. ar Xiv preprint, 2022.

Karim Lounici, Massimiliano Pontil, Sara Van De Geer, and Alexandre B Tsybakov. Oracle inequalities and optimal inference under group sparsity. The annals of statistics, 2011.

Haipeng Luo, Mengxiao Zhang, Peng Zhao, and Zhi-Hua Zhou. Corralling a larger band of bandits: A case study on switching regret for linear bandits. In Conference on Learning Theory. PMLR, 2022.

Odalric-Ambrym Maillard and Rémi Munos. Adaptive bandits: Towards the best history-dependent strategy. In International Conference on Artificial Intelligence and Statistics. JMLR Workshop and Conference Proceedings, 2011.

Mathurin Massias, Alexandre Gramfort, and Joseph Salmon. Celer: a fast solver for the lasso with dual extrapolation. In International Conference on Machine Learning, 2018.

H. Brendan Mc Mahan and Matthew Streeter. Tighter bounds for multi-armed bandits with expert advice. In Proceedings of the 22nd Annual Conference on Learning Theory, 2009.

Ahmadreza Moradipari, Berkay Turan, Yasin Abbasi-Yadkori, Mahnoosh Alizadeh, and Mohammad Ghavamzadeh. Feature and parameter selection in stochastic linear bandits. In International Conference on Machine Learning. PMLR, 2022.

Vidya Muthukumar, Mitas Ray, Anant Sahai, and Peter Bartlett. Best of many worlds: Robust model selection for online supervised learning. In Proceedings of the Twenty-Second International Conference on Artificial Intelligence and Statistics, 2019.

Gergely Neu. Explore no more: Improved high-probability regret bounds for non-stochastic bandits. Advances in Neural Information Processing Systems, 2015.

Min-hwan Oh, Garud Iyengar, and Assaf Zeevi. Sparsity-agnostic lasso bandit. In International Conference on Machine Learning. PMLR, 2021.

Aldo Pacchiano, My Phan, Yasin Abbasi Yadkori, Anup Rao, Julian Zimmert, Tor Lattimore, and Csaba Szepesvari. Model selection in contextual stochastic bandit problems. Advances in Neural Information Processing Systems, 2020.

Aldo Pacchiano, Drausin Wulsin, Robert A Barton, and Luis Voloch. Neural design for genetic perturbation experiments. ar Xiv preprint, 2022.

Adam Paszke, Sam Gross, Soumith Chintala, Gregory Chanan, Edward Yang, Zachary De Vito, Zeming Lin, Alban Desmaison, Luca Antiga, and Adam Lerer. Automatic differentiation in pytorch. 2017.

Ali Rahimi, Benjamin Recht, et al. Random features for large-scale kernel machines. In Advances in Neural Information Processing Systems, 2007.

Felix Schur, Parnian Kassraie, Jonas Rothfuss, and Andreas Krause. Lifelong bandit optimization: No prior and no regret. In Conference on Uncertainty in Artificial Intelligence, 2023.

Adish Singla, Hamed Hassani, and Andreas Krause. Learning to interact with learning agents. In AAAI Conference on Artificial Intelligence, 2018.

William R Thompson. On the likelihood that one unknown probability exceeds another in view of the evidence of two samples. Biometrika, 1933.

Robert Tibshirani. Regression shrinkage and selection via the lasso. Journal of the Royal Statistical Society: Series B (Methodological), 1996.

Doniyor Ulmasov, Caroline Baroukh, Benoit Chachuat, Marc Peter Deisenroth, and Ruth Misener. Bayesian optimization with dimension scheduling: Application to biological systems. In Computer Aided Chemical Engineering. Elsevier, 2016.

Sara A. van de Geer and Peter Bühlmann. On the conditions used to prove oracle results for the Lasso. Electronic Journal of Statistics, 2009.

Roman Vershynin. High-dimensional probability: An introduction with applications in data science, volume 47. Cambridge university press, 2018.

Martin J Wainwright. High-dimensional statistics: A non-asymptotic viewpoint. Cambridge university press, 2019.

Julian Zimmert and Tor Lattimore. Return of the bias: Almost minimax optimal high probability bounds for adversarial linear bandits. In Conference on Learning Theory. PMLR, 2022.

Contents of Appendix

A Extended Literature Review 14

B Time Uniform Lasso Analysis 15

C Results on Exploration 18

C.1 ALEXP with Uniform Exploration . . . . . . . . . . . . . . . . . . . . . . . . . . 20

C.2 Proof of Results on Exploration . . . . . . . . . . . . . . . . . . . . . . . . . . . 20

D Proof of Regret Bound 23

D.1 Proof of Model Selection Regret . . . . . . . . . . . . . . . . . . . . . . . . . . . 24

D.2 Proof of Virtual Regret . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30

E Time-Uniform Concentration Inequalities 32

F Experiment Details 34

F.1 Hyper-Parameter Tuning Results . . . . . . . . . . . . . . . . . . . . . . . . . . . 34

A Extended Literature Review

The sparse linear bandit literature considers linear reward functions of the form θ x, where x Rp, however a sub-vector of size d is sufficient to span the reward function. This can be formulated as model selection among M = p d different linear parametrizations, where each ϕj is a d-dimensional feature map. We present the bounds in terms of d and M for coherence with the rest of the text, assuming that M = O(p), which is the case when d p.

Table 2 compares recent work on sparse linear bandits based on a number of important factors. In this table, the ETC algorithms follow the general format of exploring, performing parameter estimation once at t = n0, and then repeatedly suggesting the same action which maximizes ˆθ n0ϕ(x). Explorethen-(ϵ)Greedy takes a similar approach, however it does not settle on ˆθn0, rather it continues to update the parameter estimate and select xt = arg max ˆθ t ϕ(x). The UCB algorithms iteratively update upper confidence bound, and choose actions which maximize them. The regret bounds in Table 2 are simplified to the terms with largest rate of growth, the reader should check the corresponding papers for rigorous results. Some of the mentioned bounds depend on problem-dependent parameters (e.g. c K), which may not be treated as absolute constants and have complicated forms. To indicate such parameters we use τ in Table 2, following the notation of Hao et al. [2020]. Note that τ varies across the rows of the table, and is just an indicator for existence of other terms.

Abbasi-Yadkori et al. [2012] use the SEQSEW online regression oracle [Gerchinovitz, 2011] for estimating the parameter vector, together with a UCB policy. The regression oracle is an exponential weights algorithm, which runs on the squared error loss. This subroutine, and thereby the algorithm proposed by Abbasi-Yadkori et al. [2012] are not computationally efficient, and this is believed to be unavoidable. This work considers the data-rich regime and shows R(n) = O(

d Mn), matching the lower bound of Theorem 24.3 in Lattimore and Szepesvári [2020].

Carpentier and Munos [2012] assume that the action set is a Euclidean ball, and that the noise is directly added to the parameter vector, i.e. yt = x (θ + εt). Roughly put, linear bandits with parameter noise are easier to solve than stochastic linear bandits with reward noise, since the noise is scaled proportionally to the features xi and does less damage [Chapter 29.3 Lattimore and Szepesvári, 2020]. In this setting, Carpentier and Munos [2012] present a O(d n) regret bound.

Recent work considers contextual linear bandits, where at every step At, a stochastic finite subset of size K from A, is presented to the agent. It is commonly assumed that members of At are i.i.d., and the sampling distribution is diverse and time-independent. The diversity assumption is often in the

form of a restricted eigenvalue condition (Definition 2) on the covariance of the context distribution [e.g. in, Kim and Paik, 2019, Bastani and Bayati, 2020]. Li et al. [2022] require a stronger condition which directly assumes that λmin(Φt) the minimum eigenvalue of the empirical covariance matrix is lower bounded. This is generally not true, but may hold with high probability. Hao et al. [2020] assume that the action set spans Rd M. We believe that this assumption is the weakest in the literature, and conjecture that it is necessary for model selection. If not met, the agent can not explore in all relevant directions, and may not identify the relevant features. Our diversity assumption is similar to Hao et al. [2020], adapted to our problem setting. Mainly, we consider reward functions which are linearly parametrizable, i.e. θ ϕ(x), as oppose to linear rewards, i.e. θ x.

A key distinguishing factor between ALEXP and existing work on sparse linear bandit is that ALEXP is horizon-independent and does not rely on a forced exploration schedule. As shown on Table 2, majority of prior work relies either on an initial exploration stage, the length of which is determined according to n [e.g., Carpentier and Munos, 2012, Kim and Paik, 2019, Li et al., 2022, Hao et al., 2020, Jang et al., 2022], or on a hand crafted schedule, which is again designed for a specific horizon [Bastani and Bayati, 2020]. Oh et al. [2021], which analyzes K-armed contextual bandits, does not require explicit exploration, and instead imposes restrictive assumptions on the diversity of context distribution, e.g. relaxed symmetry and balanced covariance. Regardless, the regret bounds hold in expectation, and are not time-uniform.

Table 2: Overview of recent work on high-dimensional Bandits. Parameter τ shows existence of other problem-dependent terms which are not constants, and varies across different rows. The regret bounds are simplified and are not rigorous.

|At| datapoor

action selection policy

context or action assumpt.

Abbasi-Yadkori et al. UCB EXP4 on Sqrd error A is compact

d Mn, w.h.p.

Foster et al. K UCB EXP4 on Sqrd error λmin(Σ) cλ (Mn)3/4K1/4 +

Kd Mn w.h.p

Carpentier and Munos UCB Hard Thresh.

A is a ball param. noise d n, w.h.p.

Bastani and Bayati K

Explore then Greedy

Lasso κ(Σ) > c K τKd2(log n + log M)2 w.h.p.

Kim and Paik K

Explore then ϵ-Greedy

Lasso κ(Σ) > c K τd n log(Mn), w.h.p.

Oh et al. K Greedy Lasso κ(Σ) > cκ + other assums.

n log(Mn) in expectation

Li et al. K ETC Lasso λmin(ˆΣ) > cλ τ(n2d)1/3 log Mn in expectation

Hao et al. ETC Lasso A spans Rd M + is compact

(nd C 1 min)2/3(log M)1/3 w.h.p.

Jang et al. ETC Hard Thresh.

A [ 1, 1]Md

+ spans RMd (nd)2/3(C 1 min log M)1/3 w.h.p.

ALEXP (Ours) Greedy or UCB

EXP4 on reward est.

Im(ϕj) spans Rd A is compact

n log M(n1/4 + C 1 min log M) w.h.p

B Time Uniform Lasso Analysis

We start by showing that the sum of squared sub-gaussian variables is a sub-Gamma process (c.f. Definition 22).

Lemma 4 (Empirical Process is sub-Gamma). For t 1, suppose ξt are a sequence conditionally standard sub-Gaussians adapted to the filtration Ft = σ(ξ1, . . . , ξt). Let vt R, and Zt := ξ2 t 1. Define the processes St := Pt i=1 Zivi and Vt := 4 Pt i=1 v2 i . Then (St) t=0 is sub-Gamma with variance process (Vt) t=0 and scale parameter c = 4 maxi 1 vi.

Proof of Lemma 4. By definition [c.f. Definition 1, Howard et al., 2021], St is sub-Gamma if for each λ [0, 1/c), there exists a supermartingale (Mt(λ)) t=0 w.r.t. Ft, such that E M0 = 1 and for all t 1:

We show the above holds in equality by proving that the left hand side itself, is a supermartingale w.r.t. Ft. We define, Mt(λ) := exp{λSt λ2Vt/2(1 cλ)}, therefore,

E [Mt|Ft 1] E exp λSt 1 λ2

2(1 cλ)Vt 1 + λZtvt 2λ2v2 t (1 cλ)

= E [Mt 1|Ft 1]E [exp (λZtvt) |Ft 1] exp 2λ2v2 t 1 cλ

= Mt 1E [exp (λZtvt) |Ft 1] exp 2λ2v2 t 1 cλ

Note that Zt is Ft 1-measurable, conditionally centered and conditionally sub-exponential with parameters (ν, α) = (2, 4) (c.f. Vershynin [2018, Lemma 2.7.6] and Wainwright [2019, Example 2.8]). Therefore, for λ < 1/c,

E [exp (λvt Zt) |Ft 1] exp 2λ2v2 t exp 2λ2v2 t 1 cλ

where the last inequality holds due to the fact that 0 1 cλ < 1. Therefore,

E [Mt|Ft 1] Mt 1 exp 2λ2v2 t 1 cλ

exp 2λ2v2 t 1 cλ

for λ [0, 1/c), concluding the proof.

We now construct a self-normalizing martingale sequence based on ℓ2-norm of the empirical process error term, and recognize that it is a sub-gamma process. We then employ our curved Bernstein bound Lemma 25 to control the boundary. This step will allow us to ignore the empirical process error term later in the lasso analysis. Lemma 5 (Empirical Process is dominated by regularization.). Let

Aj = t 1 : (Φ t εt)j 2/t λt/2 .

Then, for any 0 δ < 1, the event A = M j=1Aj happens with probability 1 δ, if for all t 1,

d (log(2M/δ) + (log log d)+) + 12

2 (log(2M/δ) + (log log d)+).

Proof of Lemma 5. This proof includes a treatment of the empirical process similar to Lemma B.1 in Kassraie et al. [2022], but adapts it to our time-uniform setting. Since εi are zero-mean sub-gaussian variables, as driven in Lemma 3.1 [Lounici et al., 2011], it holds that

Ac j = t : 1 t2 εT t Φt,jΦ t,jεt λ2

t : Pt i=1 vi(ξ2 i 1)

where ξi are sub-gaussian variables with variance proxy 1, scalar vi denotes the i-th eigenvalue of Φt,jΦ t,j/t with the concatenated vector vt = (v1, . . . , vt), and

αt,j = t2λ2/(4σ2) tr(Φ t,jΦt,j)

2 Φ t,jΦt,j Fr .

We can apply Lemma 25 to control the probability of event Ac j by tuning λ. Mainly, for Ac j to happen with probability less than δ/M, Lemma 25 states that the following must hold for all t,

2 vt 2αt,j 5

max n 4 vt 2 2, 1 o ωδ/M( vt 2) + 12ωδ/M( vt 2) max t 1 vt (4)

Recall that w.l.o.g. feature maps are bounded everywhere ϕj( ) 2 1, and rank(Φj) d which allows for the following matrix inequalities,

tr(Φ t,jΦt,j) = tr(Φt,jΦ t,j) =

i=1 ϕ j (xi)ϕj(xi) t

Φt,jΦ t,j tr(Φt,jΦ t,j) t Φt,jΦ t,j Φt,jΦ t,j Fr

d Φt,jΦ t,j t

Therefore, vt = Φt,jΦ t,j Fr/t

d, max t 1 vt = max t 1 Φt,jΦ t,j /t 1.

For Eq. (4) to hold, is suffices that for all t 1,

4d (log(2M/δ) + (log log d)+) + 12

2 (log(2M/δ) + (log log d)+).

Therefore, if λt are chosen to satisfy the above inequality, each Ac j happens with probability less than δ/M. Then by applying union bound, M j=1Ac j happens with probability less than δ.

Proof of Theorem 3. The theorem statement requires that the regularization parameter λt is chosen such that condition of Lemma 5 is met, and therefore event A happens with probability 1 δ. Throughout this proof, which adapts the analysis of Theorem 3.1. Lounici et al. [2011] to the time-uniform setting, we condition on A happening, and later incorporate the probability.

Step 1. Let ˆθt be a minimizer of L and θ be the true coefficients vector, then L(ˆθt; Ht, λt) L(θ; Ht, λt). Writing out the loss and re-ordering the inequality we obtain,

Φt(ˆθt θ) 2

t εT t Φt(ˆθt θ) + 2λt

θj 2 ˆθt,j 2

which is often referred to as the Basic inequality [Bühlmann and Van De Geer, 2011]. By Cauchy Schwarz and conditioned on event A,

εT t Φt(ˆθt θ)

(ΦT t εt)j 2

ˆθt,j θj 2 tλ

then adding λt PM j=1 ˆθt,j θj 2 to both sides, applying the triangle inequality, and recalling from Section 3 that θj = 0 for j = j gives

Φt(ˆθt θ) 2

ˆθt,j θj 2 2λt

ˆθt,j θj 2 + 2λt

θj 2 ˆθt,j 2

4λt ˆθt,j θj 2.

Since each term on the left hand side is positive, then each is also individually smaller than the right hand side, and we obtain,

Φt(ˆθt θ) 2

2 4λt ˆθt,j θj 2 (5)

ˆθt,j θj 2 3 ˆθt,j θj 2 (6)

Step 2. Consider a sequence (c1, . . . , ck, . . . ), where c1 ck . . . , then

Define J1 = {j } and J2 = {j , j } where

j = arg max j [M] j =j

ˆθt,j θj 2.

For any J [M] the complementing set is denoted as Jc = [M] \ J. For simplicity let cj = ˆθt,j θj 2, and let π(k) denote the index of the k-th largest element of {cj : j Jc 1}. By definition of Jc 2 we have,

k>1 π(k) Jc 1

k>1 π(k) Jc 1

i Jc 1 ci)2

i Jc 1 ci 2 X

k>1 π(k) Jc 1

9(c2 j + c2 j ) = 9 X

which, in turn, gives

Step 3. On the other hand, due to (6), and by definition of J2 it also holds that X

ˆθt,j θj 2 3 X

ˆθt,j θj 2.

From the theorem assumptions and Definition 2, we know that there exists 0 < κ(Φt, 2), therefore by Definition 2, the feature matrix Φt satisfies, X

2 1 tκ2(Φt, 2)

Φt(ˆθt θ) 2

(5) 1 κ2(Φt, 2)4λt ˆθt,j θj 2 1 κ2(Φt, 2)4λt

From here, by applying (8) we get, ˆθt θ 2 4

10λt κ2(Φt, 2).

If λt are chosen according to Lemma 5, event A and, in turn, the inequality above hold with probability greater than 1 δ.

C Results on Exploration

In this section we present lower-bounds on the eigenvalues of the covariance matrix ΦtΦ t , as it is later used in our regret analysis. In particular, we show that the feature matrix Φt satisfies the restricted eigenvalue condition (Definition 2) required for valid Lasso confidence set (Theorem 3), and calculate a lower bound on κ(Φt, 2). The lower bound is later used by Lemma 19 and Lemma 20 to develop the model selection regret. We show this bound in three steps.

Equivalent to Definition 2, we write κ(Φt, s) = infb Ξs Φtb 2/

Ξs := n b Rd\{0} P

j / J bj 2 3 P

j J bj 2, q P j J bj 2 2 1 s.t. J {1, . . . , M}, |J| s. o . (9)

For simplicity in notation, we further define

κ(A, s) := min b Ξs b Ab. (10)

since κ( Φ t Φt

t , s) = κ2(Φt, s).

Step I. Consider the exploratory steps at which αt = 1. Let Φπ,t be a sub-matrix of Φt where only rows from exploratory steps are included. Note that Φπ,t Rt d M is a random matrix, where the number of rows t are also random. We show that κ2(Φt, s) is lower bounded by κ2(Φt,π, s).

Lemma 6. Suppose Φπ,t has t rows. Then,

κ2(Φt, s) t

t κ2(Φπ,t, s)

Proof of Lemma 6. Let Ψ(t) = ϕ(xt)ϕ (xt) Rd M d M for all t = 1, . . . , n. Note that Ψ(t) is positive semi-definite by construction. We have,

s=1 b Ψ(s)b =

s Tπ b Ψ(s)b +

s/ Tπ b Ψ(s)b

where the set Tπ contains the indices of the exploratory steps at which the action is selected according to π. Therefore,

κ2(Φt, s) = 1

t min b Ξs Φtb 2 2

= min b Ξs 1

s Tπ b Ψ(s)b + 1

s/ Tπ b Ψ(s)b

s Tπ b Ψ(s)b

where the last inequality holds due to Ψ(s) being PSD. Then we have,

κ2(Φt, s) min b Ξs b 1

s Tπ Ψ(s) !

t κ2(Φπ,t, s)

While the number of rows of Φπ,t is a random variable, we continue to condition on the event that Φπ,t has t rows, and investigate the distribution of its restricted eigenvalues.

Step II. The restricted eigenvalues of the exploratory submatrix are well bounded away from zero. Lemma 7. Let π be the solution to (2), and s N. Suppose Φπ,t has t rows. Then for all δ > 0,

P t : κ2(Φπ,t, s) κ(Σ, s) 80s

log(2Md/δ) + (log log 4t )+

where Σ = Σ(π, ϕ) := Ex πϕ(x)ϕ (x) and κ is defined in (10).

Step III. Remains to combine the two above lemmas and incorporate a high probability bound on t , showing that it is close to Pt s=1 γs. Lemma 8. There exist absolute constants C1, C2 which satisfy,

P t 1 : κ2(Φt, 2) C1 κ(Σ, 2)t 1/4 C2t 5/8p

log(Md/δ) + (log log t)+ 1 δ

if γt = O(t 1/4). Let π be the solution to (2), then it further holds that

P t 1 : κ2(Φt, 2) C1Cmint 1/4 C2t 5/8p

log(Md/δ) + (log log t)+ 1 δ

The regret analysis of Hao et al. [2020] also relies on connecting κ(Φt, s) to Cmin, and for this, they use Theorem 2.4 of Javanmard and Montanari [2014]. This theorem states that there exists a problem-dependent constant C1 for which κ2(Φt, s) C1Cmin with high probability, if t n0 and roughly n0 = O( 3p

n2 log M). We highlight that Lemma 8, presents a lower-bound which holds for all t 1, however this comes at the cost of getting a looser lower bound than the result of Javanmard and Montanari [2014] for the larger time steps t. In fact, due to the sub-optimal dependency of Lemma 8 on t, we later obtain sub-optimal dependency on the horizon for the case when n M. It is unclear to us if this rate can be improved without assuming knowledge of n, or that n n0.

For the last lemma in this section we show that the empirical sub-matrices Φt,j are also bounded away from zero. This will be required later to prove Lemma 15.

Lemma 9 (Base Model λmin Bound). Assume π is the maximizer of Eq. (2). Then, with probability greater than 1 δ, simultaneously for all j = 1, . . . , M and t 1,

λmin(Φ t,jΦt,j) C1Cmint3/4 C2t3/8p

log(Md/δ) + (log log t)+

if γt = O(t 1/4).

C.1 ALEXP with Uniform Exploration

We presented our main regret bound (Theorem 1) in terms of Cmin, which only depends on properties of the feature maps and the action domain. We give a lower-bound on Cmin for a toy scenario which corresponds to the problem of linear feature selection over convex action sets. Proposition 10 (Invertible Features). Suppose ϕ(x) := Ax : Rd Rd is an invertible linear map, and X Rd is a convex body. Then,

Cmin λmin(A)

λ2max(T) > 0

where T is the transformation which maps X to an isotropic body.

The lower-bound of Proposition 10 is achieved by simply exploring via π = Unif(X). Inspired by Schur et al. [2023, Lemma E.13], we show that even for non-convex action domains and orthogonal feature maps, the uniform exploration yields a constant lower-bound on restricted eigenvalues. Proposition 11 (Orthonormal Features). Suppose ϕj : X R are chosen from an orthogonal basis of L2(X), and satisfy ϕi L2µ(X)/Vol(X) 1. Then there exist absolute constants C1 and C2 for which the exploration distribution π = Unif(X) satisfies

P t 1 : κ2(Φt, 2) C1t 1/4 C2t 5/8p

log(Md/δ) + (log log t)+ 1 δ.

The d = 1 condition is met without loss of generality, by splitting the higher dimensional feature maps and introducing more base features, which will increase M. Moreover, the orthonormality condition is met by orthogonalizing and re-scaling the feature maps. Basis functions such as Legendre polynomials and Fourier features [Rahimi et al., 2007] satisfy these conditions.

By invoking Proposition 11, instead of Lemma 8 in the proof of Theorem 1, we obtain the regret of ALEXP with uniform exploration. Corollary 12 (ALEXP with Uniform Exploration). Let δ (0, 1]. Suppose ϕj : X R are chosen from an orthogonal basis of L2(X), and satisfy ϕi L2 µ(X)/Vol(X) 1. Assume the oracle agent employs a UCB or a Greedy policy, as laid out in Section 5. Choose ηt = O(1/

t C(M, δ, d)) and γt = O(t 1/4) and λt = O(C(M, δ, d)/

t), then ALEXP with uniform exploration π = Unif(X) attains the regret

R(n) = O Bn3/4 + n C(M, δ, d) log M + B2 n + B p

n ((log log n B2)+ + log(1/δ))

+ (n3/4 + log n)C(M, δ, d) + n5/8p

d log n + log(1/δ) + B2

with probability greater than 1 δ, simultaneously for all n 1. Here,

C(M, δ, d) = O q

d (log(M/δ) + (log log d)+) + (log(M/δ) + (log log d)+) .

C.2 Proof of Results on Exploration

As an intermediate step, we consider the restricted eigenvalue property of the empirical covariance matrix. Given t samples, the empirical estimate of Σ is

s=1 ϕ(xs)ϕ (xs) (11)

where xs are sampled according to π. We show that every entry of ˆΣt is close to the corresponding entry in Σ, and later use it in the proofs of eigenvalue lemmas.

Lemma 13 (Anytime Bound for The Entries of Empirical Covariance Matrix). Let ˆΣt be the empirical covariance matrix corresponding to Σ(π, ϕ) given t samples. Then,

P t : d (Σ, ˆΣt ) 5

((log log 4t )+ + log(2Md/δ)) δ

where d (A, B) := maxi,j|Ai,j Bi,j|.

Proof of Lemma 13. We show the element-wise convergence of Σ to ˆΣt for the (i, j) entry where i, j = 1, . . . , d M. Consider the random sequence Xs := Σi,j ϕi(xs)ϕj(xs). We show that X1, . . . , Xn satisfies conditions of Lemma 26. We first observe that

E[Xs|X1:s 1] = EXs = Σ(i, j) Ex πϕi(x)ϕj(x) = 0

since by definition Σi,j = Ex πϕi(x) ϕj(x). Moreover, we have normalized features ϕ( ) 1, therefore, each entry ϕi( )ϕj( ) is also bounded, yielding |Xs| 2. Then Lemma 26 implies that for all δ > 0,

r (log log 4t )+ + log(2/ δ)

Setting δ = δ/(d M) and taking a union bound over all indices concludes the proof.

We are now ready to present the proofs to the lemmas in Appendix C.

Proof of Lemma 7. By (11) we have ˆΣt = Φ π,tΦπ,t

t , and thereby

κ2(Φπ,t, s) = min b Ξs b ˆΣt b = κ(ˆΣt , s).

Inspired by Lemma 10.1 in van de Geer and Bühlmann [2009], we show that element-wise closeness of matrices Σ and ˆΣt (c.f. Lemma 13) implies closeness in κ: κ2(Φπ,t, s) κ(Σ, s) = κ(ˆΣt , s) κ(Σ, s)

= κ ˆΣt Σ, s

min b Ξs d (Σ, ˆΣt ) b 2 1

where the last line holds due to Hölder s. Moreover, since b Ξs, for any J [d M] where |J| s it additionally holds that b J 2 1 and

b 1 (1 + 3) b J 1 4 s b J 2 4 s

which gives, κ2(Φπ,t, s) κ(Σ, s) 16sd (Σ, ˆΣt ). Therefore by Lemma 13,

κ2(Φπ,t, s) κ(Σ, s) 80s

((log log 4t )+ + log(2Md/δ)) (12)

with probability greater than 1 δ, simultaneously for all t 1.

Proof of Lemma 8. In Lemma 6 we showed that

κ2(Φt, s) t

t κ2(Φπ,t, s)

where t indicates the number of rows in the exploratory sub-matrix of Φt. Recall that t = Pt s=1 αs where αs are i.i.d Bernoulli random variables with success probability of γs. Due to Lemma 24,

P ( t 1 : |t Γt| t) 1 δ/2 (13)

(log log t)+ + log(8/δ)

Due to Lemma 7, with probability greater than 1 δ/2 the following holds for all t 1

κ2(Φt, 2) t

t κ(Σ, 2) 160

(log log 4t )+ + log(4Md/δ)

t κ(Σ, 2) 160

(log log (4Γt + t))+ + log(4Md/δ)

where the second inequality holds with probability 1 δ, by incorporating (13) and taking a union bound. For the rest of the proof and to keep the calculations simple, we ignore the values of the absolute constants. We use the notation g(t) = o(f(t)) to show that f(t) grows much faster than g(t). More formally, if for every constant c there exists t0, where g(t) c|f(t)| for all t t0. Since γs = O(s 1/4) there exists C such that Γt = Ct3/4, then it is straightforward to observe that there exists absolute constants Ci which satisfy,

κ2(Φt, 2) C1t 1/4 κ(Σ, 2) 5t 3/2 κ(Σ, 2)

(log log t)+ + log(8/δ)

log(Md/δ) + (log log t)+ o t 5/8p

log(Md/δ) + (log log t)+

C1t 1/4 κ(Σ, 2) C3t 5/8p

log(Md/δ) + (log log t)+

The last inequality holds since t 3/2 log log t = o(t 5/8 log log t). The above chain of inequalities imply that there exist absolute constants C1, C2, for which

P t 1 : κ2(Φt, 2) C1 κ(Σ, 2)t 1/4 C2t 5/8p

log(Md/δ) + (log log t)+ 1 δ.

If π is chosen according to (2), then κ(Σ, 2) Cmin yielding the lemma s second argument.

Proof of Lemma 9. Fix j {1, . . . , M}, and construct the set

Ξ1,j = n b Rd \ {0} b = (b1, . . . , b M), s.t. bj Rd, bj 2 1 and j = j : bj = 0 o .

Note that Ξ1,j Ξs. Therefore,

inf b Ξ1,j Φtb 2 inf b Ξs Φtb 2 =

Moreover, by construction of Ξ1,j we have for all b Ξ1,j that Φtb = Φt,jbj, therefore,

inf b Ξ1,j Φtb 2 2 = inf bj Rd

Φt,jbj 2 2 = λmin(Φ t,jΦt,j).

From the above equations we conclude that λmin(Φ t,jΦt,j) tκ2(Φt, s), for all j = 1, . . . , M. Therefore, using Lemma 8 we obtain that there exists C1, C2 such that

P t 1, j = 1, . . . , M : λmin(Φ t,jΦt,j) C1Cmint3/4 C2t3/8p

log(Md/δ) + (log log t)+ 1 δ

Proof of Proposition 10. Since X is a convex body, then there exists an invertible map T, such that T(X) is an isotropic body [e.g. Proposition 1.1.1., Giannopoulos, 2003]. Then by definition, X Unif(T(X)) is an isotropic distribution and Cov( X) = Id [e.g., c.f. Chapter 3.3.5 Vershynin, 2018]. Since ϕ is linear and invertible, it may be written is as ϕ(x) = Ax, where A is an invertible matrix. Therefore, Σ(π, ϕ) = Cov(ϕ(X)) = A Cov(X)A = A Cov T 1 X A = A (T 1)2A.

As for the minimum eigenvalue, suppose v Rd and v = 1, then

Cmin λmin (Σ(π, ϕ)) v A (T 1)2Av Av 2λmin(T 2) = Av 2 λ2max(T) λmin(A)

Proof of Proposition 11. By the assumption of the proposition, for all i M

[Σ(π, ϕ)]i,i = Ex πϕ2 i (x) = 1 Vol(X)

X ϕ2 i (x)dµ(x) 1

and for all i = j,

[Σ(π, ϕ)]i,j = Ex πϕi(x)ϕj(x) = 1 Vol(X)

X ϕi(x)ϕj(x)dµ(x) = 0

We use Σ = Σ(π, ϕ). For any b RMd where b 1,

i,j [M] b j Σi,jbi = X

i [M] b i Σi,ibi + X

i,j [M],i =j b j Σi,jbi

i [M] b i Σi,ibi

i [M] bi 2 2 1.

Which implies, κ(Σ, s) = min b Ξs b Σb 1.

By Lemma 8, there exist absolute constants C1 and C2 for which,

P t 1 : κ2(Φt, 2) κ(Σ, 2)C1t 1/4 C2t 5/8p

log(Md/δ) + (log log t)+ 1 δ.

concluding the proof.

D Proof of Regret Bound

Theorem 14 (Anytime Regret, Formal). Let δ (0, 1] and π be the maximizer of (2). Assume the oracle agent employs a UCB or a Greedy policy, as laid out in Section 5. Suppose ηt = O(Cmint 1/2C(M, δ, d)) and γt = O(t 1/4) and λt = O(C(M, δ, d)t 1/2), then exists absolute constants C1, . . . , C6 for which ALEXP attains the regret

R(n) C1Bn3/4 + C2 n C 1 min C(M, δ, d) log M

+ C3B2Cmin n + C4B p

n ((log log n B2)+ + log(1/δ))

+ C5 1 + C 1 minn 3/8p

log(Md/δ) + (log log n)+

Bn1/4 + (n3/4 + log n

Cmin )C(M, δ, d) + n5/8 Cmin

d log n + log(1/δ) + B2

with probability greater than 1 δ, simultaneously for all n 1. Here,

C(M, δ, d) = C6σ q

d (log(M/δ) + (log log d)+) + (log(M/δ) + (log log d)+).

Our main regret bound is an immediate corollary of Lemma 15 and Lemma 16, considering the regret decomposition of (3).

Lemma 15 (Virtual Regret of the Oracle). Let δ (0, 1] and λ > 0. Assume the oracle agent employs a UCB or a Greedy policy, as laid out in Section 5. If γt = O(t1/4), there exists an absolute constant C1 for which with probability greater than 1 δ, simultaneously for all n 1,

Rj (n) = C1n5/8

1 + n 3/8C 1 min p

log(Md/δ) + (log log n)+

σ2d log n λd + 1 + 2σ2 log(1/δ) + λB2

Lemma 16 (Any-Time Model-Selection Regret, Formal). Let δ (0, 1] and π be the maximizer of (2). Suppose ηt = O(Cmin/

t C(M, δ, d)) and γt = O(t 1/4) and λt = O(C(M, δ, d)/

t), then exists absolute constants Ci for which ALEXP attains the model selection regret

R(n, j) C1Bn3/4 + C2 n C 1 min C(M, δ, d) log M

+ C3B2Cmin n + C4B p

n ((log log n B2)+ + log(1/δ))

Bn1/4 + (n3/4 + log n

Cmin )C(M, δ, d) 1 + C 1 minn 3/8p

log(Md/δ) + (log log n)+

with probability greater than 1 δ, simultaneously for all n 1. Here,

C(M, δ, d) = C6σ q

d (log(M/δ) + (log log d)+) + (log(M/δ) + (log log d)+).

D.1 Proof of Model Selection Regret

Our technique for bounding the model selection regret relies on a classic horizon-independent analysis of the exponential weights algorithm, presented in Lemma 17. Lemma 17 (Anytime Exponential Weights Guarantee). Assume ηtˆrt,j 1 for all 1 j M and t 1. If the sequence (ηt)t 1 is non-increasing, then for all n 1,

j=1 qt,jˆrt,j log M

j=1 qt,jˆr2 t,j

for any arm k [M].

Proof of Lemma 17. Define ˆRt,i := Pt s=1 ˆrs,i to be the expected cumulative reward of agent i after t steps. We rewrite for a fixed k

j=1 qt,jˆrt,j =

t=1 Ej qt[ˆrt,j]. (14)

We focus on a single term in the second sum. For any t, we have

Ej qt[ˆrt,j] = log(exp( Ej qt[ηt

ηt ˆrt,j])) = log(exp( Ej qt[ηtˆrt,j])1/ηt)

ηt log(exp( Ej qt[ηtˆrt,j]))

ηt log(Ei qt exp( Ej qt[ηtˆrt,j])) (15)

The inner expectation is over j, while the outer one is over i and therefore has no effect. Moreover,

1 ηt log Ei qt exp( ηt Ej qt[ˆrt,j] + ηtˆrt,i) = 1

ηt log (exp( ηt Ej qt[ˆrt,j])Ei qt exp(ηtˆrt,i))

ηt log Ei qt exp( ηt Ej qt[ˆrt,j])

ηt log Ei qt exp(ηtˆrt,i) (16)

where again, the expectation can be reintroduced to get the last line. Combining (15) and (16),

Ej qt[ˆrt,j] = 1

ηt log Ei qt exp( ηt Ej qt[ˆrt,j] + ηtˆrt,i) 1

ηt log Ei qt exp(ηtˆrt,i) (17)

This transformation is at the core of many exponential weight proofs [Bubeck et al., 2012, Lattimore and Szepesvári, 2020]. We first bound the first term in (17):

log Ei qt exp( ηt Ej qt[ˆrt,j] + ηtˆrt,i) = log Ei qt exp(ηtˆrt,i) ηt Ej qtˆrt,j

(I) Ei qt exp(ηtˆrt,i) 1 ηt Ej qtˆrt,j = Ei qt [exp(ηtˆrt,i) 1 ηtˆrt,i]

(II) Ei qt η2 t ˆr2 t,i (18)

where in (I) we use the fact that log(z) z 1 and in (II) we use the fact that for x 1, we have exp(x) 1 + x + x2, and hence exp(x) 1 x x2. For the second term in (17), we will mirror the potential argument in Bubeck et al. [2012], but with a slightly different potential function. We expand the definition of qt:

ηt log Ei qt exp(ηtˆrt,i) = 1

ηt log PM i=1 exp(ηt ˆRt,i) PM i=1 exp(ηt ˆRt 1,i)

i=1 exp(ηt ˆRt,i) + 1

i=1 exp(ηt ˆRt 1,i)

= Jt(ηt) Jt 1(ηt), (19)

where we define Jt(η) = 1

M PM i=1 exp(η ˆRt,i). We also define Ft(η) =

M PM i=1 exp( η ˆRt,i). We observe the relation J(η) = F( η). From this, it follows that for any η, we have J (η) = F ( η) 0, by the argument in Bubeck et al. [2012, Theorem 3.1] that shows F (η) 0 for any η.

Putting together the pieces Now, we can bound (17) by inputing (18) and (19):

Ej qt[ˆrt,j] Ei qt ηtˆr2 t,i + Jt(ηt) Jt 1(ηt)

With this, we rewrite (14) as

t=1 Ej qt[ˆrt,j] =

t=1 ˆrt,k +

t=1 Ei qt ηtˆr2 t,i +

t=1 Jt(ηt) Jt 1(ηt) (20)

Potential manipulation We can do an Abel transformation on the sum of potentials in (20), namely obtaining n X

t=1 Jt(ηt) Jt 1(ηt) =

t=1 (Jt(ηt) Jt(ηt+1)) + Jn(ηn),

where we used that J0(η) = 0. We know J (η) 0 and so J is decreasing and since ηt+1 ηt, we have J(ηt+1) J(ηt) or (Jt(ηt) Jt(ηt+1)) 0, so that for any fixed k

t=1 Jt(ηt) Jt 1(ηt) Jn(ηn) log(M)

i=1 exp(ηn ˆRn,i)

ηn log exp(ηn ˆRn,k)

t=1 ˆrt,k (21)

where ( ) follows because exp is positive and log is decreasing (notice that we drop M 1 terms from the sum). Plugging (21) into (20), we obtain

t=1 Ej qt[ˆrt,j]

t=1 ˆrt,k +

t=1 Ei qt ηtˆr2 t,i +

t=1 Jt(ηt) Jt 1(ηt)

t=1 ˆrt,k +

t=1 Ei qt ηtˆr2 t,i + log(M)

t=1 Ei qt ηtˆr2 t,i + log(M)

j=1 qt,jˆr2 t,j + log(M)

We expressed in Section 5.2, that the model selection regret of ALEXP, is closely tied to the bias and variance of the reward estimates ˆrt,j. The following lemma formalizes this claim. Lemma 18. (Anytime Generic regret bound) If ηt is picked such that ηtˆrt,j 1 for all 1 j M and 1 t almost surely, then Algorithm 1 satisfies with probability greater than 1 2δ/3, that simultaneously for all n 1

t=1 γt + log M

j=1 qt,jˆr2 t,j +

t=1 (ωt,i +

j=1 qt,jωt,j)

n ((log log n B2)+ + log(12/δ))

where ωt,i = |rt,i ˆrt,i|.

Proof of Lemma 18. Let αt denote the Bernoulli random variable that is equal to 1 if at step t we select actions according to π and 0 otherwise. At each step t with αt = 1 ALEXP accumulates a regret of at most 2B, since θ B and ϕ( ) 1. We can decompose the regret as,

t=1 2Bαt + (rt,i rt)(1 αt)

For the first term, by Lemma 24, we have

n ((log log n)+ + log(4/δ1))

simultaneously for all n 1, with probability 1 δ1. Let ˆrt := PM j=1 qt,jˆrt,j. We may re-write the second term of the regret as follows,

t=1 (1 αt) rt,i rt

t=1 (1 αt) h (rt,i ˆrt,i) + (ˆrt,i ˆrt) + (ˆrt rt) i

t=1 ωt,i + (1 αt) h (ˆrt,i ˆrt) + (ˆrt rt) i

We bound the second term on the right hand side, using Lemma 17

t=1 (1 αt)(ˆrt,i ˆrt)

t=1 (ˆrt,i ˆrt) log M

j=1 qt,jˆr2 t,j.

As for the third term,

j=1 qt,jˆrt,j rt) = (1 αt) h M X

j=1 qt,j(ˆrt,j rt,j + rt,j) rt i

j=1 qt,jωt,j + (1 αt)

j=1 qt,jrt,j

It remains to bound the deviation term. For all t that satisfy αt = 0, the action/model is selected according to qt,j, therefore the conditional expectation of rt can be written as

j 1 qt,jrt,j

The sequence Xt := rt Et 1 rt is a martingale difference sequence adapted to the history Ht, since for every t 1, Et 1 Xt = E [rt Et 1 rt|Ht 1] = 0. Since rt B, then Xt 2B almost surely, which allows for an application of anytime Azuma Hoeffding (Lemma 26):

j=1 qt,jrt,j

n ((log log n B2)+ + log(2/δ2))

which, in turn, leads us to

j=1 qt,jrt,j

j=1 qt,jrt,j

n ((log log n B2)+ + log(2/δ2))

simultaneously for all n 1. We set δ1 = δ2 = δ/3, take a union bound and put the terms together obtaining,

t=1 γt + log M

j=1 qt,jˆr2 t,j +

t=1 (ωt,i +

j=1 qt,jωt,j)

n ((log log n B2)+ + log(6/δ)) + 5B p

n ((log log n)+ + log(12/δ))

We upper bound the sum of last two terms to conclude the proof.

The next two lemmas bound the bias and variance terms which appear in Lemma 18. Lemma 19 (Anytime Bound on the Bias Term). If the regularization parameter of Lasso is chosen at every step as

d (log(2M/δ) + (log log d)+) + 12

2 (log(2M/δ) + (log log d)+)

and γt = O(t 1/4), then with probability greater than 1 δ, simultaneously for all n 1,

t=1 |ˆrt,i rt,i| n3/4C 1 min C(M, δ, d) 1 + n 3/8C 1 min p

log(Md/δ) + (log log n)+

C(M, δ, d) := Cσ q

d (log(M/δ) + (log log d)+) + (log(M/δ) + (log log d)+)

and C is an absolute constant.

Proof of Lemma 19. By the definition of the expected reward and its estimate,

t=1 |ˆrt,i rt,i| =

X (r(x) ˆrt(x))dpt+1,i(x)

X |r(x) ˆrt(x)|dpt+1,i(x)

θ ˆθt 2 ϕ(x) 2dpt+1,i(x)

X dpt+1,i(x) =

We highlight that the Cauchy-Schwarz step may be refined. By further assuming that θj is bounded away from zero (also called the beta-min condition [Bühlmann and Van De Geer, 2011]) one can show that θ ˆθt is a 2-sparse vector. This will then allow one to only rely on boundedness of ϕj rather than ϕ to derive the last inequality, and relax our assumption of ϕ( ) 1 to ϕj( ) 1 for all j [M]. From Theorem 3, with probability greater than 1 δ/2 simultaneously for all n 1,

t=1 |ˆrt,i rt,i|

10λt κ2(Φt, 2) = C(M, δ, d)

1 κ2(Φt, 2)

C(M, δ, d) := 8σ

d (log(4M/δ) + (log log d)+) + 12

2 (log(4M/δ) + (log log d)+).

From Lemma 8, there exist absolute constants C1, C2 for which,

P t 1 : κ2(Φt, 2) C1Cmint 1/4 C2t 5/8p

log(Md/δ) + (log log t)+ 1 δ.

Using Taylor approximation we observe that, 1 1 x 1 = 1+x 1 +o(x 1) = O(1+x 1). Therefore, these exists absolute constant C3, C4, for which with probability greater than 1 δ for all t 1

C(M, δ, d) κ2(Φt, 2)

C1Cmint 1/4 C2t 5/8p

log(Md/δ) + (log log t)+

C3 Cmint 1/4

1 + C2t 5/8p

log(Md/δ) + (log log t)+

C1Cmint 1/4

C3 C(M, δ, d)t 1/4

1 + C4 t 3/8

log(Md/δ) + (log log t)+

= C3 C(M, δ, d)n3/4

1 + C4 n 3/8

log(Md/δ) + (log log n)+

Lemma 20 (Anytime Bound on Variance Term). Suppose λt is chosen according to Lemma 19, γt = O(t 1/4) and ηt = O(Cmint 1/2/C(M, δ, d)). Then with probability greater than 1 δ, the following holds simultaneously for all n 1 and t 1

10λt κ2(Φt, 2) + B, j [M]

j=1 qt,jˆr2 t,j C1B2Cmin n + C2Bn1/4 1 + n 3/8

log(Md/δ) + (log log n)+

+ C(M, δ, d)log n

log(Md/δ) + (log log n)+

where Ci are absolute constants, and C(M, δ, d) is as defined in Lemma 19, up to constant factors.

Proof of Lemma 20. We start by upper bounding ˆrt,j. For all j and t it holds that:

X ˆθt, ϕ(x) dpt+1,j(x) ˆθt Z

X ϕ(x) dpt+1,j(x) ˆθt 2 B + (ˆθt θ) 2

since θ 2 B. To bound the last term, we only need to invoke Theorem 3, which, in turn, will simultaneously bound ˆrt,j for all j = 1, . . . , M:

t 1, j [M] : ˆrt,j 4

10λt c2 κ,t + B

Which implies for all t 1,

j=1 qt,jˆr2 t,j

10λt c2 κ,t + B

j=1 qt,j = 160λ2 t c4 κ,t + B2 + 8B

10λt c2 κ,t .

For the last term, similar to the proof of Lemma 19 we have,

10λt c2 κ,t C1Bn1/4 1 + n 3/8

log(Md/δ) + (log log n)+

for some absolute constant C1. We treat the squared term similarly,

t=1 ηt 160λ2 t c4 κ,t C(M, δ, d)

1 + C4 t 3/8

log(Md/δ) + (log log t)+

C(M, δ, d)C3 log n

1 + C4 n 3/8

log(Md/δ) + (log log n)+

Note that the last inequality is not tight. This term will not be fastest growing term in the regret, so we have little motivation to bound it tightly. Therefore,

j=1 qt,jˆr2 t,j C1B2Cmin n + C2Bn1/4 1 + n 3/8

log(Md/δ) + (log log t)+

+ C(M, δ, d)log n

1 + C4 n 3/8

log(Md/δ) + (log log n)+

where Ci are absolute constants.

Proof of Lemma 16. We start by conditioning on the event E that ηt is picked such that ηtˆrt,j 1 for all t 1 and j = 1, . . . , M. Then by application of Lemma 18 we get with probability greater than 1 2δ/3,

t=1 γt + log M

j=1 qt,jˆr2 t,j +

t=1 (ωt,i +

j=1 qt,jωt,j)

n ((log log n B2)+ + log(12/δ))

We invoke Lemma 19 and Lemma 20 with δ δ/3 take a union bound, to bound the variance and ωt,i terms as well. These lemmas require one application of Theorem 3 to hold simultaneously and no additional union bound is required between them, since the randomness comes only from the confidence interval over ˆθt.

R(n, i) C1Bn3/4 + log M

+ C2B2Cmin n + C3Bn1/4 1 + n 3/8

log(Md/δ) + (log log n)+

+ C(M, δ, d)log n

log(Md/δ) + (log log n)+

+ n3/4C(M, δ, d) 1 + n 3/8

log(Md/δ) + (log log n)+

n ((log log n B2)+ + log(12/δ))

with probability greater than 1 δ, conditioned on event E. Assuming that event E happens with probability 1 2δ, let B = B( δ, M, d, n, B, σ, Cmin) denote the right-hand-side of the regret inequality above.

By the chain rule we may write,

P Reg(n, i) B

P R(n, i) B t [n], j [M] : ηtˆrt,j 1 P ( t [n], j [M] : ηtˆrt,j 1)

P R(n, i) B t [n], j [M] : ηtˆrt,j 1 (1 2δ)

(1 δ)(1 2δ) 1 3δ.

It remains to verify that event E is met with probability 1 2δ. Recall that ηt = O(Cmin/

t C(M, δ, d)), and that from Lemma 8 with probability 1 δ,

t C(M, δ, d) C1Cmint1/4 C2t 1/8p

log(Md/δ) + (log log t)+ 4

10C(M, δ, d) κ2(Φt, 2)

Therefore, from Lemma 20, there exists Cη such that ηt = CηCmin/B

t C(M, δ, d) satisfying,

P ( t 1, j [M] : ηtˆrt,j 1) 1 2δ

The proof is then finished by setting δ 3δ (and updating the absolute constants).

D.2 Proof of Virtual Regret

Proposition 21. For any fixed λ > 0, there exists an absolute constant C1 such that

P t 1 : ˆβt,j θj 2 ω(t, δ, d) 1 δ.

ω(t, δ, d) := C1

λd +1 +2σ2 log(1/δ)+ λB2

λ+Cmint3/4 1 + C 1 mint 3/8p

log(Md/δ) + (log log t)+ .

Moreover, for ut,j ( ) := ˆβ t,j ϕj ( ) + ω(t, δ, d),

P ( t 1, x X : r(x) ut,j (x)) 1 δ.

Proof of Proposition 21. Define for convenience Vt = Φ t,j Φt,j + λI. We first observe that

ˆβt,j = V 1 t (Φt,j ) yt

We can apply results from Abbasi-Yadkori et al. [2011] to get an anytime-valid confidence set. Their Theorem 2 asserts that with probability 1 δ, for all t 1 we have2

βt = 2σ2 log det(Vt)1/2

det( λI)1/2δ

Clearly, Vt λmin(Vt)I, and therefore with high probability,

βt λmin(Vt)

uniformly over time. Our assumption is that ϕj(x) 1, and hence, denoting by νi the eigenvalues of Vt, the geometric-arithmetic mean inequality yields

dtrace(Vt) d .

2Their theorem statement is slightly different, but they prove the stronger version we state below.

trace(Vt) =

s=1 (ϕj (x))2 i + λd t + λd

we can conclude that

(t/d + λ)d/2

+ λB2 = dσ2 log t

λd + 1 + 2σ2 log(1/δ) + λB2

We note that λmin(Vt) = λmin(Φ t,jΦt,j) + λ. Then due to Lemma 9, there exist absolute constants C1 and C2 such that for all t 1,

λmin(Vt) λ + C1Cmint3/4 C2t3/8p

log(Md/δ) + (log log t)+

therefore, there exists C3 and C4 such that

λmin(Vt) 1 q

λ + C1Cmint3/4 C2t3/8p

log(Md/δ) + (log log t)+

C3 p λ + C1Cmint3/4

log(Md/δ) + (log log t)+

λ + C1Cmint3/4

C4 p λ + C1Cmint3/4

1 + C 1 mint 3/8p

log(Md/δ) + (log log t)+

with high probability for all t 1. Setting

ω(t, δ, d) = C5

λd +1)+2σ2 log(1/δ)+ λB2

λ+Cmint3/4 1 + C 1 mint 3/8p

log(Md/δ) + (log log t)+

where C5 is an absolute constant concludes the parametric confidence bound. The upper confidence bound then simply follows: for any x X

r(x) ˆβ t,j ϕj (x) = θj ˆβt,j , ϕj (x) θj ˆβt,j 2 ϕj (x) 2 ω(t, δ, d)

where the last inequality holds with high probability simultaneously for all t 1.

Proof of Lemma 15. Using Proposition 21 and the Cauchy-Schwarz inequality we obtain,

t=1 r(x ) r( xt,j)

t=1 r(x ) ˆrt(x ) + ˆrt(x ) ˆrt( xt,j) + ˆrt( xt,j) r( xt,j)

ϕj(x ) 2 + ϕj( xt,j) 2 + ˆrt(x ) ˆrt( xt,j)

t=1 ω(t, δ, d) ϕj(x ) 2 + ϕj( xt,j) 2 + ˆrt(x ) ˆrt( xt,j)

with probability 1 δ. If the agent selects actions greedily, then ˆrt(x ) ˆrt( xt,j), and

t=1 ω(t, δ, d) ϕj(x ) 2 + ϕj( xt,j) 2

t=1 2ω(t, δ, d)

since the feature map is normalized to satisfy ϕj( ) 1. If the agent selects actions optimistically according to the upper confidence bound of Proposition 21, then

ˆrt( xt,j) + ω(t, δ, d) ϕj( xt,j) ˆrt(x ) + ω(t, δ, d) ϕj(x )

which implies

ˆrt(x ) ˆrt( xt,j) ω(t, δ, d) ϕj( xt,j) ω(t, δ, d) ϕj(x )

and therefore,

t=1 2ω(t, δ, d) ϕj( xt,j) 2

t=1 2ω(t, δ, d).

Then due to Proposition 21, with probability greater than 1 δ, simultaneously for all n 1,

t=1 ω(t, δ, d)

λd +1 +2σ2 log(1/δ)+ λB2

δ ) + (log log t)+

λd +1 +2σ2 log(1/δ)+ λB2

1 + C 1 minn 3

δ ) + (log log n)+

concluding the proof.

E Time-Uniform Concentration Inequalities

We will make use of the elegant concentration results in Howard et al. [2021], which analyzes the boundary of sub-Gamma processes. Definition 22 (Sub-Gamma process). Let (St) t=0 and (Vt) t=0 be real-valued processes adapted to (Ft) t=1 with S0 = V0 = 0 and Vt non-negative. We say that St is sub-Gamma if for λ [0, 1/c), there exists a supermartingale (Mt(λ)) t=0 w.r.t. Ft, such that E M0 = 1 and for all t 1:

2(1 cλ)Vt} Mt(λ) a.s.

The following is a special case of Theorem 1 in Howard et al. [2021]. We have simplified it by making a few straightforward choices for the parameters used originally by Howard et al. [2021], which will yield an easier-to-use bound in our scenario. Proposition 23 (Curved Boundary of Sub-Gamma Processes). Let (St)t 0 be sub-Gamma with scale parameter c and variance process (Vt)t 0. Define the boundary

max{v, 1} (log log ev)+ + log 2

+ 3c (log log ev)+ + log 2

for v > 0, where (x)+ = max(0, x). Then,

P( t : St Bα(Vt)) α.

Proof of Proposition 23. Suppose ξ( ) denotes the Riemann zeta function. Theorem 1 in Howard et al. [2021] states that if (St)t 0 is a sub-Gamma process with variance process (Vt)t 0 then the boundary

Sα(v ) = k1

v s log log (ηv ) + log ζ(s) α logs η

s log log (ηv ) + log ζ(s) α logs η

satisfies, P( t : St Sα(max(Vt, 1))) α where

k1 := η1/4 + η 1/4

2 and k2 := ( η + 1)/2

and s, η 1. Choosing s = 2 and η = e, we obtain ζ(2) = π2/6 2. Furthermore, we have k1 3

2. Then if v 1 (which we will enforce by the construction v = max(1, v)), we compute

s log log (ηv ) + log ζ(s) α logs η

2(log log ev )+ + log 2

Therefore, we can upper bound (using our bounds on k1, k2)

v (log log ev )+ + log 2

+ 3c (log log ev )+ + log 2

Now, since the boundary is given by Sα(max(v, 1)) and v = max(v, 1) 1 we deduce that

max{v, 1} (log log ev)+ + log 2

+ 3c (log log ev)+ + log 2

is an any-time valid boundary.

Lemma 24 (Time-Uniform Two-sided Bernoulli). Let X1, . . . , Xs, . . . , Xt be a martingale sequence of Bernoulli random variables with conditional mean γs. Then for all δ > 0,

s=1 (Xs γs)

t ((log log t)+ + log(4/δ))

Proof of Lemma 24. By Proposition 23, we know that if St is sub-Gamma with variance process Vt and scale parameter c, then P ( t : St Bδ(Vt)) δ, where

max{1, v} ((log log ev)+ + log(2/δ)) + 3c ((log log ev)+ + log(2/δ)) .

By Howard et al. [2020], we know that if (Xt) t=1 is a Bernoulli sequence, then St = Pt s=1(Xs γs) is sub-Gamma with variance process Vt = t and scale parameter c = 0 (hence, sub-Gaussian). This implies,

s=1 (Xs γs) 5

t ((log log t)+ + log(2/δ))

The above arguments also holds for the sequence Zs = Xs. Then taking a union bound and adjusting δ δ/2 concludes the proof.

Lemma 25 (Time-Uniform Bernstein). Let (ξi) i=1 be a sequence of conditionally standard subgaussian variables, where each ξi is Fi 1 = σ(ξ1, . . . , ξi) measurable. Then, for vi R and δ (0, 1]

i=1 (ξ2 i 1)vi 5

max n 1, 4 vt 2 2 o ωδ( vt 2) + 12ωδ( vt 2) max i 1 vi

where, vt = (v1, . . . , vt) Rt and ωδ(v) := log log(4ev2)

+ + log(2/δ).

Proof of Lemma 25. From Lemma 4, St = Pt i=1(ξ2 i 1)vi is sub-Gamma with variance process Vt = 4 Pt i=1 v2 i and c = 4 maxi 1 vi. By Proposition 23, we know that if St is sub-Gamma with variance process Vt and scale parameter c, then

P ( t : St Bδ(Vt)) δ,

max{1, v} ((log log ev)+ + log(2/δ)) + 3c ((log log ev)+ + log(2/δ)) .

Lemma 26 (Time-Uniform Azuma-Hoeffding). Let X1, . . . , Xn be a martingale difference sequence such that |Xt| B for all t > 1 almost surely. Then for all δ > 0,

t ((log log et B2)+ + log(2/δ))

Proof of Lemma 26. By Proposition 23, we know that if St is sub-Gamma with variance process Vt and scale parameter c, then P ( t : St Bδ(Vt)) δ, where

max{1, v} ((log log ev)+ + log(2/δ)) + 3c ((log log ev)+ + log(2/δ)) .

By [Howard et al., 2020], we know that if (Xt) t=1 is B-bounded martingale difference sequence, then St = Pt s=1 Xs is sub-Gamma with variance process Vt = t B2 and scale parameter c = 0. This implies,

P t : St 5B

t ((log log et B2)+ + log(2/δ)) δ,

concluding the proof.

F Experiment Details

1.00 0.75 0.50 0.25 0.00 0.25 0.50 0.75 1.00

Figure 5: Examples of possible reward functions r( ) in our experiments.

F.1 Hyper-Parameter Tuning Results

We implement 6 algorithms in our experiments, ETC [Algorithm 4, Hao et al., 2020], ETS (Algorithm 5), CORRAL [Algorithm 6, Agarwal et al., 2017], ALEXP (Algorithm 1), and Lastly UCB (Algorithm 3) with the oracle feature map ϕj (Oracle), and UCB with the concatenated feature map ϕ (Naive). The Python code is available on github.com/lasgroup/ALEXP. When algorithms require exploration, e.g., in the case of ETC or ALEXP, we simply set π = Unif(X). Figure 7 shows the results of our hyperparameter tuning experiment. To ensure that the curves are valid, we run each

0 25 50 75 100 t

p = 10 s = 2

Lightly Correlated

0 25 50 75 100 t

30 p = 10 s = 8

Highly Correlated

0 25 50 75 100 t

40 p = 10 s = 3

Many Models

Oracle UCB ALEXP Naive UCB ETC ETS Corral

Figure 6: Bench-marking ALEXP and other baselines. Complete version of Fig. 1 and Fig. 2.

Algorithm 2 Get Posterior

Inputs: Ht, ϕ, λ Let Kt [ϕ (xi)ϕ(xj)]i,j t, and Vt (Kt + λ2I), and k( ) [ϕ (xi)ϕ( )]i t

Calculate µt( ) k T ( )V 1 t yt

Calculate σt( ) q

ϕ ( )ϕ( ) k ( )V 1 t k( ) Return: µt, σt

Algorithm 3 UCB

Inputs: λ, βt, ϕ for t = 1, . . . , n do

Choose xt arg max ut 1(x) = µt 1(x) + βtσt 1(x). Choose actions optimistically Observe yt = r(xt) + εt. Receive reward Ht Ht 1 {(xt, yt)} Append history Update µt, σt Get Posterior(Ht, ϕ, λ) end for

configuration for 20 different random seeds, i.e. on different random environments. The shaded areas in Figure 7 show the standard error.

UCB. For all the experiments, we set the exploration coefficient of UCB to βt = 23 and choose the regression regularizer from λ {0.01, 0.1, 0.5}. We use PYTORCH [Paszke et al., 2017] for updating the upper confidence bounds, which requires more regularization for longer feature maps (e.g. when s = 8, p = 2), to be computationally stable.

Lasso. Every time we need to solve Eq. (1), we set λt according to the rate suggested by Theorem 3. To find a suitable constant scaling coefficient, we perform a hyper-parameter tuning experiment sampling 20 values in [10 5, 100]. We choose λ0 = 0.009, and scale λt with it across all experiments.

ALEXP. We set the rates for γt and ηt as prescribed by Theorem 1. For the scaling constants, we perform a hyper-parameter tuning experiment log-uniformly sampling 20 different configurations from γ0 [10 4, 10 1] and η0 [100, 102]. For each problem instance (i.e. as s and p change) we repeat this process. However we observe that the optimal hyper-parameters work well across all problem instances.

ETC/ETS. For these algorithms, we separately tune n0 for each problem instance. We set λ1 p

log M/n0 according to Theorem 4.2 of [Hao et al., 2020] and scale it with λ0 = 0.009, as stated before. We uniformly sample 10 different values where n0 [2, 80] since the horizon is n = 100. The optimal value often happens around n0 = 20.

CORRAL. We set the rates of the parameters as γ = O(1/n) and η = O( p

M/n) according to Agarwal et al. [2017, Theorem 5,]. Then similar to ALEXP, we tune the scaling constants. The procedure for tuning the constants is identical to ALEXP, as in we use the same search interval, and try 10 different configurations for γ and η.

3To achieve the d T log T regret, one has to set βt = O( d log T) as shown in Proposition 21.

Algorithm 4 ETC [Hao et al., 2020]

Inputs: n0, n, λ1, π Let H0 = for t = 1, . . . , n0 do

Draw xt π. Explore. Observe yt = r(xt) + εt. Receive reward Ht Ht 1 {(xt, yt)} Append history end for

ˆθn0 L(θ, Hn0, λ1) Perform Lasso once for t = n0 + 1, . . . , n do

Choose xt = arg max ˆθ n0ϕ(x) Choose actions greedily end for

Algorithm 5 ETS

Inputs: n0, n, λ1, λ, βt, π Let H0 = for t = 1, . . . , n0 do

Draw xt π. Explore Observe yt = r(xt) + εt. Receive reward Ht Ht 1 {(xt, yt)} Append history end for

ˆθn0 L(θ, Hn0, λ1) Perform Lasso once ˆJ {j | ˆθn0,j = 0, j [M]} Get sparsity pattern ϕ ˆ J( ) [ϕj( )]j ˆ J Model-select acc. to ˆJ for t = n0 + 1, . . . , n do

Choose xt = arg max ut 1(x) = µt 1(x) + βtσt 1(x) Choose actions optimistically Observe yt = r(xt) + εt Ht Ht 1 {(xt, yt)} Update µt, σt Get Posterior(Ht, ϕ ˆ J, λ) end for

Algorithm 6 CORRAL [Agarwal et al., 2017]

Inputs: n, γ, η Initialize β = e1/ ln n, η1,j = η, ρ1,j = 2M for all j = 1, . . . , M Set q1 = q1 = 1 M and initialize base agents (p1,1, . . . , p1,M). for t = 1, . . . , n do

Choose jt qt. Sample Agent Draw xt pt,jt. Play action according to agent jt Observe yt = r(xt) + εt. Calculate IW estimates ˆrt,j = yt qt,j I{j = jt} for all j = 1, . . . , M. Send ˆrt,j = yt qt,j I{j = jt} to agents and get updated policies pt+1,j. qt+1 = LOG-BARRIER-OMD(qt, ˆrt,jtejt, ηt) Update agent probabilities qt+1 = (1 γ)qt+1 + γ 1

M Mix with exploratory distribution for j = 1, . . . , M do Update parameters if 1 qt+1,j > ρt,j then ρt+1,j 2 qt,j , and ηt+1,j βηt,j else ρt+1,j ρt,j and ηt+1,j ηt,j end if end for end for

Algorithm 7 LOG-BARRIER-OMD

Inputs: qt, ℓt, ηt Find ξ [minj ℓt,j, maxj ℓt,j] such that PM j=1 q 1 t,j + ηt,j(ℓt,j ξ) 1 = 1 Return: qt+1 where q 1 t+1,j = q 1 t,j + ηt,j(ℓt,j ξ) for all j [M]

0 25 50 75 100

s = 2 p = 10

0 25 50 75 100

0 25 50 75 100

0 25 50 75 100 0

0 25 50 75 100

s = 2 p = 11

0 25 50 75 100

0 25 50 75 100

0 25 50 75 100 0

0 25 50 75 100

s = 2 p = 12

0 25 50 75 100

0 25 50 75 100

0 25 50 75 100 0

0 25 50 75 100

s = 2 p = 13

0 25 50 75 100

0 25 50 75 100

0 25 50 75 100

0 25 50 75 100

s = 3 p = 10

0 25 50 75 100

0 25 50 75 100

0 25 50 75 100 0

0 25 50 75 100

s = 8 p = 10

0 25 50 75 100

0 25 50 75 100

0 25 50 75 100

Figure 7: Results for different hyper-parameters across different problem instances. ALEXP is robust to the choice of hyper-paramters.