# varianceaware_sparse_linear_bandits__9869c28f.pdf

Published as a conference paper at ICLR 2023

VARIANCE-AWARE SPARSE LINEAR BANDITS

Yan Dai IIIS, Tsinghua University yan-dai20@mails.tsinghua.edu.cn

Ruosong Wang University of Washington ruosongw@cs.washington.edu

Simon S. Du University of Washington ssdu@cs.washington.edu

It is well-known that for sparse linear bandits, when ignoring the dependency on sparsity which is much smaller than the ambient dimension, the worst-case minimax regret is eΘ

d T where d is the ambient dimension and T is the number of rounds. On the other hand, in the benign setting where there is no noise and the action set is the unit sphere, one can use divide-and-conquer to achieve e O(1) regret, which is (nearly) independent of d and T. In this paper, we present the first

variance-aware regret guarantee for sparse linear bandits: e O q

d PT t=1 σ2 t + 1 ,

where σ2 t is the variance of the noise at the t-th round. This bound naturally interpolates the regret bounds for the worst-case constant-variance regime (i.e., σt Ω(1)) and the benign deterministic regimes (i.e., σt 0). To achieve this variance-aware regret guarantee, we develop a general framework that converts any variance-aware linear bandit algorithm to a variance-aware algorithm for sparse linear bandits in a black-box manner. Specifically, we take two recent algorithms as black boxes to illustrate that the claimed bounds indeed hold, where the first algorithm can handle unknown-variance cases and the second one is more efficient.

1 INTRODUCTION

This paper studies the sparse linear stochastic bandit problem, which is a special case of linear stochastic bandits. In linear bandits (Dani et al., 2008), the agent is facing a sequential decisionmaking problem lasting for T rounds. For the t-th round, the agent chooses an action xt X Rd, where X is an action set, and receives a noisy reward rt = θ , xt + ηt where θ X is the (hidden) parameter of the game and ηt is random zero-mean noise. The goal of the agent is to minimize her regret RT , that is, the difference between her cumulative reward PT t=1 θ , xt and maxx X PT t=1 θ , x (check Eq. (1) for a definition). Dani et al. (2008) proved that the minimax optimal regret for linear bandits is eΘ(d

T) when the noises are independent Gaussian random variables with means 0 and variances 1 and both θ and the actions xt lie in the unit sphere in Rd.1

In real-world applications such as recommendation systems, only a few features may be relevant despite a large candidate feature space. In other words, the high-dimensional linear regime may actually allow a low-dimensional structure. As a result, if we still use the linear bandit model, we will always suffer Ω(d

T) regret no matter how many features are useful. Motivated by this, the sparse linear stochastic bandit problem was introduced (Abbasi-Yadkori et al., 2012; Carpentier & Munos, 2012). This problem has an additional constraint that the hidden parameter, θ , is sparse, i.e., θ 0 s for some s d. However, the agent has no prior knowledge about s and thus the interaction protocol is exactly the same as that of linear bandits. The minimax optimal regret for

1Throughout the paper, we will use the notations e O( ) and eΘ( ) to hide log T, log d, log s (where s is the sparsity parameter, which will be introduced later) and log log 1

δ factors (where δ is the failure probability).

Published as a conference paper at ICLR 2023

sparse linear bandits is eΘ(

sd T) (Abbasi-Yadkori et al., 2012; Antos & Szepesvári, 2009).2 This bound bypasses the Ω(d

T) lower bound for linear bandits as we always have s = θ 0 d and the agent does not have access to s either (though a few previous works assumed a known s).

However, both the e O(d

T) and the e O(

sd T) bounds are the worst-case regret bounds and sometime are too pessimistic especially when d is large. On the other hand, many problems with delicate structures permit a regret bound much smaller than the worst-case bound. The structure this paper focuses on is the magnitude of the noise. Consider the following motivating example.

Motivating Example (Deterministic Sparse Linear Bandits). Consider the case where the action set is the unit sphere X = Sd 1, and there is no noise, i.e., the feedback is rt = θ , xt for each round t [T]. In this case, one can identify all non-zero entries of θ coordinates in O(s log d) steps with high probability via a divide-and-conquer algorithm, and thus yield a dimension-free regret e O(s) (see Appendix C for more details about this).3 However, this divide-and-conquer algorithm is specific for deterministic sparse linear bandit problems and does not work for noisy models. Henceforth, we study the following natural question:

Can we design an algorithm whose regret adapts to the noise level such that the regret interpolates the

d T-type bound in the worst case and the dimension-free bound in the deterministic case?

Before introducing our results, we would like to mention that there are recent works that studied the noise-adaptivity in linear bandits (Zhou et al., 2021; Zhang et al., 2021; Kim et al., 2021). They

gave variance-aware regret bounds of the form e O poly(d) q PT t=1 σ2 t + poly(d) where σ2 t is the

(conditional) variance of the noise ηt. This bound reduces to the standard e O(poly(d)

T) bound in the worst-case when σt = Ω(1), and to a constant-type regret e O(poly(d)) that is independent of T. However, compared with the linear bandits setting, the variance-aware bound for sparse linear bandits is more significant because it reduces to a dimension-free bound in the noiseless setting. Despite this, to our knowledge, no variance-aware regret bounds exist for sparse linear bandits.

1.1 OUR CONTRIBUTIONS

This paper gives the first set of variance-aware regret bounds for sparse linear bandits. We design a general framework, VASLB, to reduce variance-aware sparse linear bandits to variance-aware linear bandits with little overhead in regret. For ease of presentation, we define the following notation to characterize the variance-awareness of a sparse linear bandit algorithm: Definition 1. A variance-aware sparse linear bandit algorithm F is (f(s, d), g(s, d))-varianceaware, if for any given failure probability δ > 0, with probability 1 δ, F ensures

t=1 σ2 t polylog 1

δ + g(s, d) polylog 1

where RF T is the regret of F in T rounds, d is the ambient dimension and s is the maximum number of non-zero coordinates. Specifically, for linear bandits, f, g are functions only of d.

Hence, an (f, g)-variance-aware algorithm will achieve e O(f(s, d)

T polylog 1

δ ) worst-case regret and e O(g(s, d) polylog 1

δ ) deterministic-case regret. Ideally, we would like g(s, d) being independent of d, making the bound dimension-free in deterministic cases, as the divide-and-conquer approach.

In this paper, we provide a general framework that can convert any linear bandit algorithm F to a corresponding sparse linear bandit algorithm G in a black-box manner. Moreover, it is varianceaware-preserving, in the sense that, if F enjoys the variance-aware property, so does G. Generally

2Carpentier & Munos (2012) and Lattimore et al. (2015) obtained an O(s

T) regret bound under different models. The former one assumed a component-wise noise model, while the latter one assumed a θ 1 1 ground-truth as well as a xt 1 action space. See Appendix A for more discussions on this. 3We also remark that some assumptions on the action is needed. For example, if every action can only query one coordinate (each action corresponds to one vector of the standard basis) then an Ω(d) regret lower bound is unavoidable. Hence, in this paper, we only consider the benign case that action set is the unit sphere.

Published as a conference paper at ICLR 2023

speaking, if the plug-in linear bandit algorithm F is (f(d), g(d))-variance-aware, then our framework directly gives an (s(f(s) +

d), s(g(s) + 1))-variance-aware algorithm G for sparse linear bandits.

Besides presenting our framework, we also illustrate its usefulness by plugging in two existing variance-aware linear bandit algorithms, where the first one is variance-aware (i.e., works in unknownvariance cases) but computationally inefficient. In contrast, the second one is efficient but requires the variance σ2 t to be delivered together with feedback rt. Their regret guarantees are stated as follows.

1. The first variance-aware linear bandit algorithm we plug in is VOFUL, which was proposed by Zhang et al. (2021) and improved by Kim et al. (2021). This algorithm is computationally inefficient but deals with unknown variances. Using this VOFUL, our framework generates a (s2.5 + s

d, s3)-variance-aware algorithm for sparse linear bandits. Compared to the Ω(

sd T) regret lower-bound for sparse linear bandits (Lattimore & Szepesvári, 2020, 24.3), our worst-case regret bound is near-optimal up to a factor s. Moreover, our bound is independent of d and T in the deterministic case, nearly matching the bound of divide-and-conquer algorithm dedicated to the deterministic setting up to poly(s) factors. 2. The second algorithm we plug in is Weighted OFUL (Zhou et al., 2021), which requires known variances but is computationally efficient. We obtain an (s2 + s

T)-variance-aware efficient algorithm. In the deterministic case, this algorithm can only achieve a

T-type regret bound (albeit still independent of d). We note that this is not due to our framework but due to Weighted OFUL which itself cannot gives constant regret bound in the deterministic setting.

Moreover, we would like to remark that our deterministic regret can be further improved if a better variance-aware linear bandit algorithm is deployed: The current ones either have e O(d2) (Kim et al., 2021) or e O(

d T) (Zhou et al., 2021) regret in the deterministic case, which are both sub-optimal compared with the Ω(d) lower bound.

1.2 RELATED WORK

Linear Bandits. This problem was first introduced by Dani et al. (2008), where an algorithm with regret O(d

T(log T) 3/2) and a near-matching regret lower-bound Ω(d

T) were given. After that, an improved upper bound O(d

T log T) (Abbasi-Yadkori et al., 2011) together with an improved lower bound Ω(d T log T) (Li et al., 2019) were derived. An extension of it, namely linear contextual bandits, where the action set allowed for each step can vary with time (Chu et al., 2011; Kannan et al., 2018; Li et al., 2019; 2021), is receiving more and more attention. The best-arm identification problem where the goal of the agent is to approximate θ with as few samples as possible (Soare et al., 2014; Degenne et al., 2019; Jedra & Proutiere, 2020; Alieva et al., 2021) is also of great interest.

Sparse Linear Bandits. Abbasi-Yadkori et al. (2011) and Carpentier & Munos (2012) concurrently considered the sparse linear bandit problem, where the former work assumed a noise model of rt = xt, θ + ηt such that ηt is R-sub-Gaussian and achieved e O(R

sd T) regret, while the latter one considered the noise model of rt = xt + ηt, θ such that ηt 2 σ and θ 2 θ, achieving e O((σ + θ)2s

T) regret. Lattimore et al. (2015) assumed an hypercube (i.e., X = [ 1, 1]d) action set and a θ 1 1 ground-truth, yielding e O(s

T) regret. Antos & Szepesvári (2009) proved a Ω(

d T) lower-bound when s = 1 with the unit sphere as X. Some recent works considered data-poor regimes where d T (Hao et al., 2020; 2021a;b; Wang et al., 2020), which is beyond the scope of this paper. Another work worth mentioning is the recent work by Dong et al. (2021), which studies bandits or MDPs with deterministic rewards. Their result implies an e O(T 15/16s 1/16) bound for deterministic sparse linear bandits, which is independent of d. They also provided an ad-hoc divide-and-conquer algorithm, which achieves O(s log d) regret only for deterministic cases.

Variance-Aware Online Learning. For tabular MDPs, the variance information is widely used in both discounted settings (Lattimore & Hutter, 2012) and episodic settings (Azar et al., 2017; Jin et al., 2018), where Zanette & Brunskill (2019) used variance information to derive problemdependent regret bounds for tabular MDPs. For bandits, Audibert et al. (2009) made use of variance information in multi-armed bandits, giving an algorithm outperforming existing ones when the variances for suboptimal arms are relatively small. For bandits with high-dimensional structures, Faury et al. (2020) studied variance adaptation for logistic bandits, Zhou et al. (2021) considered linear bandits and linear mixture MDPs where the variance information is revealed to the agent,

Published as a conference paper at ICLR 2023

Table 1: An overview of the proposed algorithms/results and comparisons with related works.

Algorithm Setting Worst-case Regret a Deterministic-case Regret b Efficiency Variances

Confidence Ball2 (Dani et al., 2008) Lin Bandit e O(d

N/A OFUL (Abbasi-Yadkori et al., 2012) Sparse Lin Bandit e O(

SL-UCB (Carpentier & Munos, 2012) Sparse Lin Bandit e O(s

Lattimore et al. (2015, Algorithm 4) Sparse Lin Bandit e O(s

Weighted OFUL (Zhou et al., 2021) Lin Bandit e O(d

VOFUL2 (Kim et al., 2021) Lin Bandit e O(d1.5

T) e O(d2) Unknown

VASLB (This work) Sparse Lin Bandit e O(s2

d T) e O(s1.5

Sparse Lin Bandit e O(s2.5

d T) e O(s3) Unknown

Lower Bound (Antos & Szepesvári, 2009) Sparse Lin Bandit Ω(

d T) e N/A N/A N/A

a Worst-case means the variances σ2 t are all 1. Here, d is the ambient dimension, T is the number of rounds, and s is the sparsity parameter (only applicable to sparse linear bandits). b Deterministic-case means the variances σ2 t are all 0. Only applicable to variance-aware algorithms. c With a different feedback model; see Appendix A for more comparison. d With a different action set and an different assumption on θ ; see Appendix A for more comparison. e This bound holds even if s = 1 and the action set is fixed to be the unit sphere.

giving an e O(d q PT t=1 σ2 t +

d T) guarantee for linear bandits, and Zhang et al. (2021) proposed another algorithm for linear bandits and linear mixture MDPs, which does not require any variance

information, whose regret can be improved to be e O(d1.5 q PT t=1 σ2 t + d2) as shown by Kim et al. (2021). The recent work by Hou et al. (2022) considered variance-constrained best arm identification, where the feedback noise only depends on the action by the agent (whereas ours can depend on time, which is more general than theirs). Another recent work (Zhao et al., 2022) studied variance-aware regret bounds for bandits with general function approximation in the known variance case.

Stochastic Contextual Sparse Linear Bandits. In the setting, the action set for each round t is i.i.d. sampled (called the context ). It is known that e O(

s T) regret is achievable in this setting (Kim & Paik, 2019; Ren & Zhou, 2020; Oh et al., 2021; Ariu et al., 2022). However, in our setting where both action set X and ground-truth θ are fixed, a polynomial dependency on d is in general unavoidable because it is impossible to learn more than one parameter per arm (Bastani & Bayati, 2020), agreeing with the Ω(

d T) lower bound when s = 1 (Antos & Szepesvári, 2009; Abbasi-Yadkori et al., 2012).

2 PROBLEM SETUP

Notations. We use [N] to denote the set {1, 2, . . . , N} where N N. For a vector x Rd, we use x p to its Lp-norm, namely x p (Pd i=1 xp i ) 1/p. We use Sd 1 to denote the (d 1)-dimensional unit sphere, i.e., Sd 1 {x Rd | x 2 = 1}. We use e O( ) and eΘ( ) to hide all logarithmic factors in T, s, d and log 1

δ (see Footnote 1). For a random event E, we denote its indicator by 1[E].

We assume the action space and the ground-truth space are both the (d 1)-dimensional unit sphere, denoted by X Sd 1. Denote the ground-truth by θ X. There will be T 1 rounds for the agent to make decisions sequentially. At the beginning of round t [T], the agent has to choose an action xt X. At the end of step t, the agent receives a noisy feedback rt = xt, θ + ηt, t [T], where ηt is an independent zero-mean Gaussian random variable. Denote by σ2 t = Var(ηt) the variance of ηt. For a fair comparison with non-variance-aware algorithms, we assume that σ2 t 1. The agent then receives a (deterministic and unrevealed) reward of magnitude xt, θ for this round.

The agent is allowed to make the decision xt based on all historical actions x1, . . . , xt 1, all historical feedback r1, . . . , rt 1, and any amount of private randomness. The agent s goal is to minimize the regret, defined as follows. Definition 2 (Regret). The following random variable is the regret of a linear bandit algorithm:

RT = max x X

t=1 xt, θ =

t=1 θ xt, θ , (1)

where the second equality is due to our assumption that X = Sd 1.

Published as a conference paper at ICLR 2023

Algorithm 1 Variance-Aware Sparse Linear Bandits (VASLB) Framework

Input: Number of dimensions d, linear bandit algorithm F and its regret estimator RF n 1: Initialize gap threshold 1

4, estimated good coordinates S , current round t 0. 2: while t < T do The algorithm automatically increases t by 1 per query. 3: if S = then Have some coordinates to commit . 4: Initialize a new linear bandit instance F on coordinates S. Commit phase. 5: Execute F for na 1 steps & maintain pessimistic estimation RF na , until 1 na RF na < 2.

6: Suppose that F plays x1, x2, . . . , xna . Set bθ = 1 na Pna i=1 xi as the estimate for {θ i }i S.

i S bθ2 i 1 2 then Still have undiscovered coordinates with θ i >

i S bθ2 i , K = d |S|. Explore phase.

9: Perform nb 1 calls to RANDOMPROJECTION(K, R, S, bθ) in Algorithm 2, until

k=1 (rk,i ri)2 ln 4

4 , 1 i K, (2)

where rk is the k-th return vector of RANDOMPROJECTION and r 1 nb Pnb k=1 rk.

10: for i = 1, 2, . . . , K do

11: if |ri| > where r = 1 nb Pnb k=1 rk then add the i-th element that is not in S to S.

Algorithm 2 The RANDOMPROJECTION Subroutine

1: function RANDOMPROJECTION(K, R, S, bθ) 2: Generate K i.i.d. samples y1, y2, . . . , y K, each with equal probability being R

3: Play x X constructed as xi =

( bθi, i S yj, i is the j-th element that is not in S

4: return K

i S bθ2 i ) y) where r = x, θ + η is the (noisy) feedback.

For the sparse linear bandit problem, we have an additional restriction that θ 0 s, i.e., there are at most s coordinates of θ is non-zero. However, as mentioned in the introduction, the agent does not know anything about s she only knows that she is facing a (probably sparse) linear environment.

3 FRAMEWORK AND ANALYSIS

Our framework VASLB is presented in Algorithm 1. We explain its design in Section 3.1 and sketch its analysis in Section 3.2. Then we give two applications using VOFUL2 (Kim et al., 2021) and Weighted VOFUL (Zhou et al., 2021) as F, whose analyses are sketched in Sections 4.1 and 4.2.

3.1 MAIN DIFFICULTIES AND TECHNICAL OVERVIEW

At a high level, our framework follows the spirit of the classic explore-then-commit approach (which is directly adopted by Carpentier & Munos (2012)), where the agent first identifies those huge entries of θ and then performs a linear bandit algorithm on them. However, it is hard to incorporate variances into this vanilla idea to make it variance-aware the desired regret depends on variances and is thus unknown to the agent. Thus it is difficult to determine a gap threshold (that is, the agent stops to commit after identifying all θ i ) within a few rounds. For example, in the deterministic case, the agent must identify all non-zero entries to make the regret independent of T; on the other hand, in the worst case where σt 1, the agent only needs to identify all entries with magnitude at least T 1/4 to yield

T-style regret bounds. At the same time, the actual setting might be mixture of them (e.g., σt 0 for t t0 and σt 1 for t > t0 where t0 [T]). As a result, such an idea cannot always succeed in determining the correct threshold and getting the desired regret.

Published as a conference paper at ICLR 2023

In our proposed framework, we tackle this issue by explore-then-commit multiple times. We reduce the uncertainty gently and alternate between explore and commit modes. We decrease a gap threshold in a halving manner and, at the same time, maintain a set S of coordinates that we believe to have a magnitude larger than . For each , we explore (estimating θ i and adding those greater than into S) and commit (performing linear bandit algorithms on coordinates in S).

However, as we explore again after committing , we face a unique challenge: Suppose that some entry i [d] is identified to be at least 2 by previous explore phases. During the next explore phase, we cannot directly do pure exploration over the remaining unidentified coordinates otherwise, coordinate i will incur 4 2 regret for each round. Fortunately, we can get an estimation bθi of θ i during the previous commit phase thanks to the regret-to-sample-complexity conversion (Eq. (3)). Guarded with this estimation, we can reserve bθi mass for arm i and subtract bθ2 i from the feedback in subsequent explore phases. More preciously, we do the following.

1. In the commit phase where we apply the black-box F, we estimate {θ i }i S by the regret-tosample-complexity conversion: Suppose F plays x1, x2, . . . , xn and achieves regret RF n , then

θ bθ, θ RF n n , where bθ 1

i=1 xi. (3)

Hence, if we take {bθi}i S as an estimate of {θ i }i S, the estimation error shrinks as RF n is sublinear and the LHS of Eq. (3) is non-negative. Moreover, as we can show that bθ is not away from X by a lot (Lemma 18), we can safely use {bθi}i S to estimate {θ i }i S in subsequent phases. More importantly, if we are granted access to RF n , we know how close the estimate is; we can proceed to the next stage once it becomes satisfactory. But it is unrevealed. Fortunately, we know the regret guarantee of F, namely RF n , which can serve as a pessimistic estimation of RF n . Hence, terminating when 1

n RF n < 2 can ensure θ bθ, θ < 2 to hold with high probability. 2. In the exploration phase, as mentioned before, we can keep the regret incurred by the coordinates identified in S small by putting mass bθi for each i S. For the remaining ones, we use random projection, an idea borrowed from compressed sensing literature (Blumensath & Davies, 2009; Carpentier & Munos, 2012), to find those with large magnitudes to add them to S. One may notice that putting mass bθi for all i S will induce bias to our estimation as P

i S bθ2 i = P

i S bθiθ i . However, as bθi is close to θ i , this bias will be bounded by O( 2) and become dominated by

4 as decreases. Hence, if we omit this bias, we can overestimate the estimation error due to standard concentration inequalities like Empirical Bernstein (Maurer & Pontil, 2009; Zhang et al., 2021). Once it becomes small enough, we alternate to the commit phase again.

Therefore, with high probability, we can ensure all coordinates not in S have magnitudes no more than O( ) and all coordinates in S will together contribute regret bounded by O( 2). Hence, the regret in each step is (roughly) bounded by O(s 2). Upper bounding the number of steps needed for each stage and exploiting the regret guarantees of the chosen F then gives well-bounded regret.

3.2 ANALYSIS OF THE FRAMEWORK

Notations. For each , let T be the set of rounds associated with . By our algorithm, each T should be an interval. Moreover, {T } forms a partition of [T]. Define T a as all the rounds in the commit phase when the gap threshold is (where F is executed), and T b as the explore phase

(i.e., those executing RANDOMPROJECTION). Let f T a and f T b be the steps that the agent decided not to

proceed in T a and T b , respectively, which are formally defined as f T i = {t T i | t = maxt T i t }, i = a, b. Define the final value of as f. Denote na = |T a | and nb = |T b | (both are stopping times). We have P

=2 2,..., f (na + nb ) = T. We can then decompose RT into Ra T and Rb T :

=2 2,..., f

t T a θ xt, θ , Rb T = X

=2 2,..., f

where Ra T may depend on the choice of F and Rb T only depends on the framework (Algorithm 1) itself. We now show that, as long as the regret estimation RF n is indeed an overestimation of RF n

Published as a conference paper at ICLR 2023

with high probability, we can get a good upper bound of Rb T , which is formally stated as Theorem 3. The full proof of Theorem 3 will be presented in Appendix F and is only sketched here.

Theorem 3. Suppose that for any execution of F that last for n steps, RF n RF n holds with probability 1 δ, i.e., RF n is pessimistic. Then the total regret incurred by the second phase satisfies

t=1 σ2 t log 1

δ + s log 1

with probability 1 δ.

Remark. This theorem indicates that our framework itself will only induce an (s

d, s)-varianceawareness to the resulting algorithm. As noticed by Abbasi-Yadkori et al. (2011), when σt 1, Ω(

sd T) regret is unavoidable, which means that it is only sub-optimal by a factor no more than s. Moreover, for deterministic cases, the e O(s) regret also matches the aforementioned divide-andconquer algorithm, which is specially designed and can only work for deterministic cases.

Proof Sketch of Theorem 3. We define two good events with high probability for a given gap threshold : G and H . Informally, G means P

i S θ i (θ i bθi) < 2 (i.e., bθ is close to θ after commit ) and H stands for |θ i | Ω( ) if and only if i S (i.e., we explore correctly). Check Eq. (10) in the appendix for formal definitions. For G , from Eq. (3), we know that it happens as long as RF n RF n . It remains to argue that Pr{H | G , H2 } 1 sδ.

By Algorithm 2, the i-th coordinate of each rk (1 k nb ) is an independent sample of

K R2 (yi)2 θ i + X

j S bθj(θ j bθj) + X

K R yi is an independent Rademacher random variable. After conditioning on G and H2 , P i S bθ2 i and P i S bθiθ i will be close. Therefore, the first term is exactly θ i (the magnitude we want to estimate), the second term is a small bias bounded by O( 2) and the last two terms are zero-mean noises, which are bounded by

4 according to Empirical Bernstein Inequality (Theorem 10) and our choice of nb (Eq. (2)). Hence, Pr{H | G , H2 } 1 sδ.

Let us focus on an arm i never identified into S in Algorithm 1. By definition of nb (Eq. (2)),

(rt,i ri )2 ln 4

(rt,i E[rt,i ])2 ln 4

where the second inequality is due to properties of sample variances. By G , those coordinates in S will incur regret of P

i S(θ i xt,i)θ i = P

i S(θ i bθi)θ i < 2 for all t T b . Moreover, by H2 , each arm outside S will roughly incur nb (θ i )2 = O(nb 2) regret, as yi s are independent and zero-mean. As there are at most s non-zero coordinates, the total regret for T b will be roughly bounded by O(nb s 2) (there exists another term due to randomized yi s, which is dominated and omitted here; check Lemma 21 for more details). Hence, the total regret is bounded by

O(snb 2) = s e O

(rt,i E[rt,i ])2 ln 4

To avoid undesired poly(T) factors, we cannot directly apply Cauchy-Schwartz inequality to the sum of square roots (as there are a lot of s). Instead, again by definition of nb (Eq. (2)), we observe the following lower bound of nb , which holds for all s except for f:

t T b (rt,i E[rt,i ])2 ln 1

nb T, some arithmetic calculation gives

Published as a conference paper at ICLR 2023

(intuitively, by thresholding, we manage to move the summation over into the square root, though suffering an extra logarithmic factor; see Eq. (15) in the appendix for more details)

(rt,i E[rt,i ])2 = e O

(rt,i E[rt,i ])2

For a given and any 1 k nb , the expectation of (rk,i E[rk,i ])2 is bounded by 1 + K

(Eq. (19) in the appendix), which is no more than 1 + 4d

2 σ2 k . By concentration properties in the sample variances (Theorem 14 in the appendix), the empirical (rk,i E[rk,i ])2 should also be close to (1 + 4

d2 σ2 k); hence, one can write (omitting all log 1

nb 2 appears on both sides, we can apply the self-bounding property (Efroni et al., 2020,

Lemma 38) to conclude Rb T = O(P

snb 2) = e O s q

d PT t=1 σ2 t + s , as claimed.

4 APPLICATIONS OF THE PROPOSED FRAMEWORK

After showing Theorem 16, it only remains to bound Ra T , which depends on the choice of the plug-in algorithm F. In this section, we give two specific choices of F, VOFUL2 (Kim et al., 2021) and Weighted OFUL (Zhou et al., 2021). The former algorithm does not require the information of σt s (i.e., it works in unknown-variance cases), albeit computationally inefficient. In contrast, the latter is computationally efficient but requires σ2 t to be revealed with the feedback rt at round t.

4.1 COMPUTATIONALLY INEFFICIENT ALGORITHM FOR UNKNOWN VARIANCES

We first use the VOFUL2 algorithm from Kim et al. (2021) as the plug-in algorithm F, which has the following regret guarantee. Note that this is slightly stronger than the original bound: We derive a strengthened self-bounding version of it (the first inequality), which is critical to our analysis. Proposition 4 (Kim et al. (2021, Variant of Theorem 2)). VOFUL2 executed for n rounds on d dimensions guarantees, w.p. at least 1 δ, there exists a constant C = e O(1) such that RF n

C d1.5q Pn k=1 η2 k ln 1

δ + d2 ln 4

δ = e O d1.5p Pn k=1 σ2 k log 1

δ + d2 log 4

δ , where n is a stopping

time finite a.s. and σ2 1, σ2 2, . . . , σ2 n are the variances of the independent Gaussians η1, η2, . . . , ηn.

We now construct the regret over-estimation RF n . Due to unknown variances, it is not straightforward. Our rescue is to use ridge linear regression bβ argminβ Rd Pn k=1(rk xk, β )2 + λ β 2 for samples {(xk, rk)}n k=1, which ensures that the empirical variance estimation Pn k=1(rk xk, bβ )2

differs from the true sample variance Pn k=1 η2 k = Pn k=1(rk xk, β )2 by no more than e O(s log 1

δ ) (check Appendix E for a formal version). Accordingly, from Proposition 4, we can see that

RF n RF n C

k=1 (rk xk, bβ )2 ln 1

δ + s2 ln 1

Moreover, one can observe that the total sample variance Pn k=1 η2 k is bounded by (a constant multiple of) the total variance Pn k=1 σ2 k (which is formally stated as Theorem 13 in the appendix). Therefore, with Eq. (5) as our pessimistic regret estimation RF n , we have the following regret guarantee.

Theorem 5 (Regret of Algorithm 1 with VOFUL2). Algorithm 1 with VOFUL2 as F and RF n defined

in Eq. (5) ensures that RT = e O (s2.5 + s

d) q PT t=1 σ2 t log 1

δ + s3 log 1

δ with probability 1 δ.

Due to space limitations, we defer the full proof to Appendix G.1 and only sketch it here.

Published as a conference paper at ICLR 2023

Proof Sketch of Theorem 5. To bound Ra T , we consider the regret from the coordinates in and outside S separately. For the former, the total regret in a single phase with gap threshold is simply controlled by e O s1.5q P

t T a η2 t log 1

δ + s2 log 1

δ (thanks to Proposition 4). For the latter, each non-zero

coordinate outside S can at most incur O( 2) regret for each t T a . By definition of na (Line 5), we have na = |T a | = O s1.5

t f T a η2 t ln 1

δ , just like the proof of Theorem 3. As

the regret from the second part is bounded by O(s 2 na ), these two parts together sum to

s2.5 v u u t

t T a η2 t log 1

δ + s3 log 1

As in Theorem 3, we notice that na = Ω s1.5

t f T a η2 t ln 1

δ for all = f again by

definition of na . This will move the summation over into the square root. Moreover, by the fact

that η2 t = O(σ2 t log 1

δ ) (Theorem 14 in the appendix), we have Ra T = e O s2.5 q PT t=1 σ2 t log 1

δ + s3 log 1

δ . Combining this with the bound of Rb T provided by Theorem 3 concludes the proof.

4.2 COMPUTATIONALLY EFFICIENT ALGORITHM FOR KNOWN VARIANCES

In this section, we consider a computational efficient algorithm Weighted OFUL (Zhou et al., 2021), which itself requires σ2 t to be presented at the end of round t. Their algorithm guarantees: Proposition 6 (Zhou et al. (2021, Corollary 4.3)). With probability at least 1 δ, Weighted OFUL

executed for n steps on d dimensions guarantees RF T C( q

δ + d q Pn k=1 σ2 k log 1

C = e O(1), n is a stopping time finite a.s., and σ2 1, σ2 2, . . . , σ2 n are the variances of η1, η2, . . . , ηn.

Taking F as Weighted OFUL, we will have the following regret guarantee for sparse linear bandits: Theorem 7 (Regret of Algorithm 1 with Weighted OFUL). Algorithm 1 with Weighted

OFUL as F and RF n defined as C q

δ + s q Pn k=1 σ2 k ln 1

δ guarantees RT = e O (s2 +

d) q PT t=1 σ2 t log 1

δ with probability 1 δ.

The proof is similar to that of Theorem 5, i.e., bounding na by Line 5 of Algorithm 1 and then using summation techniques to move the summation over into the square root. The only difference is that we will need to bound O(P

2), which seems to be as large as T if we follow the analysis

of Theorem 5. However, as we included an additive factor q

δ in the regret over-estimation

RF n , we have na 2q

δ, which means na = Ω(s 4). From P na T, we can

consequently bound P

s ). The remaining part is just an analog of Theorem 5. Therefore, the proof is omitted in the main text and postponed to Appendix H.

5 CONCLUSION

We considered the sparse linear bandit problem with heteroscedastic noises and provided a general framework to reduce any variance-aware linear bandit algorithm F to an algorithm G for sparse linear bandits that is also variance-aware. We first applied the computationally inefficient algorithm VOFUL from Zhang et al. (2021) and Kim et al. (2021). The resulting algorithm works for the unknown-

variance case and gets e O((s2.5 + s

d) q PT t=1 σ2 t log 1

δ + s3 log 1

δ ) regret, which, when regarding the sparsity factor s d as a constant, not only is worst-case optimal but also enjoys constant regret in deterministic cases. We also applied the efficient algorithm Weighted OFUL by Zhou et al.

(2021) that requires known variance; we got e O((s2 + s

d) q PT t=1 σ2 t log 1

s T + s) log 1

δ ) regret, still independent of d in deterministic cases. See Appendix B for several future directions.

Published as a conference paper at ICLR 2023

ACKNOWLEDGMENTS

We greatly acknowledge Csaba Szepesvári for sharing the manuscript (Antos & Szepesvári, 2009) with us, which shows the Ω(

d T) lower bound for sparse linear bandits when s = 1 and the action sets are unit balls. The authors thank Xiequan Fan from Tianjin University for the constructive discussion about the application of Propositions 8 and 11. At last, we thank the anonymous reviewers for their detailed reviews, from which we benefit greatly. This work was supported in part by NSF CCF 2212261, NSF IIS 2143493, NSF DMS-2134106, NSF CCF 2019844 and NSF IIS 2110170.

Yasin Abbasi-Yadkori, Dávid Pál, and Csaba Szepesvári. Improved algorithms for linear stochastic bandits. Advances in neural information processing systems, 24, 2011.

Yasin Abbasi-Yadkori, David Pal, and Csaba Szepesvari. Online-to-confidence-set conversions and application to sparse stochastic bandits. In Artificial Intelligence and Statistics, pp. 1 9. PMLR, 2012.

Ayya Alieva, Ashok Cutkosky, and Abhimanyu Das. Robust pure exploration in linear bandits with limited budget. In International Conference on Machine Learning, pp. 187 195. PMLR, 2021.

András Antos and Csaba Szepesvári. Stochastic bandits with large action sets revisited. Personal communication, 2009.

Kaito Ariu, Kenshi Abe, and Alexandre Proutière. Thresholded lasso bandit. In International Conference on Machine Learning, pp. 878 928. PMLR, 2022.

Jean-Yves Audibert, Rémi Munos, and Csaba Szepesvári. Exploration exploitation tradeoff using variance estimates in multi-armed bandits. Theoretical Computer Science, 410(19):1876 1902, 2009.

Mohammad Gheshlaghi Azar, Ian Osband, and Rémi Munos. Minimax regret bounds for reinforcement learning. In International Conference on Machine Learning, pp. 263 272. PMLR, 2017.

Hamsa Bastani and Mohsen Bayati. Online decision making with high-dimensional covariates. Operations Research, 68(1):276 294, 2020.

Thomas Blumensath and Mike E Davies. Iterative hard thresholding for compressed sensing. Applied and computational harmonic analysis, 27(3):265 274, 2009.

Alexandra Carpentier and Rémi Munos. Bandit theory meets compressed sensing for high dimensional stochastic linear bandit. In Artificial Intelligence and Statistics, pp. 190 198. PMLR, 2012.

Wei Chu, Lihong Li, Lev Reyzin, and Robert Schapire. Contextual bandits with linear payoff functions. In Proceedings of the Fourteenth International Conference on Artificial Intelligence and Statistics, pp. 208 214. JMLR Workshop and Conference Proceedings, 2011.

Varsha Dani, Thomas P Hayes, and Sham M Kakade. Stochastic linear optimization under bandit feedback. In 21st Annual Conference on Learning Theory, pp. 355 366. Omnipress, 2008.

Rémy Degenne, Wouter M Koolen, and Pierre Ménard. Non-asymptotic pure exploration by solving games. Advances in Neural Information Processing Systems, 32, 2019.

Kefan Dong, Jiaqi Yang, and Tengyu Ma. Provable model-based nonlinear bandit and reinforcement learning: Shelve optimism, embrace virtual curvature. Advances in Neural Information Processing Systems, 34, 2021.

Yonathan Efroni, Shie Mannor, and Matteo Pirotta. Exploration-exploitation in constrained mdps. ar Xiv preprint ar Xiv:2003.02189, 2020.

Xiequan Fan, Ion Grama, and Quansheng Liu. Exponential inequalities for martingales with applications. Electronic Journal of Probability, 20:1 22, 2015.

Published as a conference paper at ICLR 2023

Louis Faury, Marc Abeille, Clément Calauzènes, and Olivier Fercoq. Improved optimistic algorithms for logistic bandits. In International Conference on Machine Learning, pp. 3052 3060. PMLR, 2020.

David A Freedman. On tail probabilities for martingales. the Annals of Probability, pp. 100 118, 1975.

Botao Hao, Tor Lattimore, and Mengdi Wang. High-dimensional sparse linear bandits. Advances in Neural Information Processing Systems, 33:10753 10763, 2020.

Botao Hao, Tor Lattimore, and Wei Deng. Information directed sampling for sparse linear bandits. Advances in Neural Information Processing Systems, 34, 2021a.

Botao Hao, Tor Lattimore, Csaba Szepesvári, and Mengdi Wang. Online sparse reinforcement learning. In International Conference on Artificial Intelligence and Statistics, pp. 316 324. PMLR, 2021b.

Jean Honorio and Tommi Jaakkola. Tight bounds for the expected risk of linear classifiers and pac-bayes finite-sample guarantees. In Artificial Intelligence and Statistics, pp. 384 392. PMLR, 2014.

Yunlong Hou, Vincent YF Tan, and Zixin Zhong. Almost optimal variance-constrained best arm identification. ar Xiv preprint ar Xiv:2201.10142, 2022.

Yassir Jedra and Alexandre Proutiere. Optimal best-arm identification in linear bandits. Advances in Neural Information Processing Systems, 33:10007 10017, 2020.

Chi Jin, Zeyuan Allen-Zhu, Sebastien Bubeck, and Michael I Jordan. Is q-learning provably efficient? Advances in neural information processing systems, 31, 2018.

Sampath Kannan, Jamie H Morgenstern, Aaron Roth, Bo Waggoner, and Zhiwei Steven Wu. A smoothed analysis of the greedy algorithm for the linear contextual bandit problem. Advances in neural information processing systems, 31, 2018.

Gi-Soo Kim and Myunghee Cho Paik. Doubly-robust lasso bandit. Advances in Neural Information Processing Systems, 32, 2019.

Yeoneung Kim, Insoon Yang, and Kwang-Sung Jun. Improved regret analysis for variance-adaptive linear bandits and horizon-free linear mixture mdps. ar Xiv preprint ar Xiv:2111.03289, 2021.

Johannes Kirschner and Andreas Krause. Information directed sampling and bandits with heteroscedastic noise. In Conference On Learning Theory, pp. 358 384. PMLR, 2018.

Tor Lattimore and Marcus Hutter. Pac bounds for discounted mdps. In International Conference on Algorithmic Learning Theory, pp. 320 334. Springer, 2012.

Tor Lattimore and Csaba Szepesvári. Bandit algorithms. Cambridge University Press, 2020.

Tor Lattimore, Koby Crammer, and Csaba Szepesvári. Linear multi-resource allocation with semibandit feedback. Advances in Neural Information Processing Systems, 28, 2015.

Yingkai Li, Yining Wang, and Yuan Zhou. Nearly minimax-optimal regret for linearly parameterized bandits. In Conference on Learning Theory, pp. 2173 2174. PMLR, 2019.

Yingkai Li, Yining Wang, Xi Chen, and Yuan Zhou. Tight regret bounds for infinite-armed linear contextual bandits. In International Conference on Artificial Intelligence and Statistics, pp. 370 378. PMLR, 2021.

Andreas Maurer and Massimiliano Pontil. Empirical bernstein bounds and sample-variance penalization. In COLT 2009 - The 22nd Conference on Learning Theory, 2009.

Min-hwan Oh, Garud Iyengar, and Assaf Zeevi. Sparsity-agnostic lasso bandit. In International Conference on Machine Learning, pp. 8271 8280. PMLR, 2021.

Published as a conference paper at ICLR 2023

Zhimei Ren and Zhengyuan Zhou. Dynamic batch learning in high-dimensional sparse linear contextual bandits. ar Xiv preprint ar Xiv:2008.11918, 2020.

Marta Soare, Alessandro Lazaric, and Rémi Munos. Best-arm identification in linear bandits. Advances in Neural Information Processing Systems, 27, 2014.

Yining Wang, Yi Chen, Ethan X Fang, Zhaoran Wang, and Runze Li. Nearly dimensionindependent sparse linear bandit over small action spaces via best subset selection. ar Xiv preprint ar Xiv:2009.02003, 2020.

Andrea Zanette and Emma Brunskill. Tighter problem-dependent regret bounds in reinforcement learning without domain knowledge using value function bounds. In International Conference on Machine Learning, pp. 7304 7312. PMLR, 2019.

Zihan Zhang, Jiaqi Yang, Xiangyang Ji, and Simon S Du. Improved variance-aware confidence sets for linear bandits and linear mixture mdp. Advances in Neural Information Processing Systems, 34, 2021.

Heyang Zhao, Dongruo Zhou, Jiafan He, and Quanquan Gu. Bandit learning with general function classes: Heteroscedastic noise and variance-dependent regret bounds. ar Xiv preprint ar Xiv:2202.13603, 2022.

Dongruo Zhou, Quanquan Gu, and Csaba Szepesvari. Nearly minimax optimal reinforcement learning for linear mixture markov decision processes. In Conference on Learning Theory, pp. 4532 4576. PMLR, 2021.

Published as a conference paper at ICLR 2023

Supplementary Materials

A More on Related Works 13

B Future Directions 14

C Divide-and-Conquer Algorithm for Deterministic Settings 14

D Concentration Inequalities 15

D.1 Sample Mean Upper Bound . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15

D.2 Sample Variance Upper Bound . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16

E Ridge Linear Regression 17

F Omitted Proof in Section 3.2 (Analysis of Framework) 19

F.1 Proof of Main Theorem . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19

F.2 Regret-to-Sample-Complexity Conversion . . . . . . . . . . . . . . . . . . . . . . 22

F.3 Random Projection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22

F.4 Single-Phase Regret . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24

G Omitted Proof in Section 4.1 (Analysis of VOFUL2) 24

G.1 Proof of Main Theorem . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24

G.2 Regret Over-Estimation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27

G.3 Extension to Unknown Variance Cases . . . . . . . . . . . . . . . . . . . . . . . . 29

H Omitted Proof in Section 4.2 (Analysis of Weighted OFUL) 30

H.1 Proof of Main Theorem . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30

H.2 Regret Over-estimation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32

I Auxilliary Lemmas 36

A MORE ON RELATED WORKS

In this section, we briefly compare to several related works on sparse linear bandits in terms of regret guarantees, noise assumptions and query models.

The regret of Abbasi-Yadkori et al. (2012) is e O(Rs

d T) when assuming conditionally R-sub Gaussian noises (i.e., ηt | Ft 1 sub G(R2), which will be formally defined in ??). At the same time, they allow an arbitrary varying action set D1, D2, . . . , DT Bd (though they in fact allows arbitrary decision sets D1, D2, . . . , DT Rd, their regret bound scales with maxx Dt x 2, so we assume Dt Bd without loss of generality). This model is less strictive than ours, as we only allow D1 = D2 = = DT = Bd (as explained in Footnote 3 in the main text. When the noises are Gaussian with variance 1 and the ground-truth θ is one-hot (i.e., s = 1), their regret bound reduces to e O(

d T), which matches the Ω(

d T) bound in Antos & Szepesvári (2009) when the actions sets are allowed to be the entire unit ball (which means the agent will be more powerful than that of Abbasi-Yadkori et al. (2012)).

Published as a conference paper at ICLR 2023

The regret of Carpentier & Munos (2012) is e O(( θ 2 + σ 2)s

T), assuming a unit-ball action set, a θ 2 θ 2 ground-truth and ηt 2 σ 2 noises, where ηt will be defined later. This bound seems to bypass the Ω(

d T) lower bound when s = 1. However, this is due to a different noise model: They assumed the noise is component-wise, i.e., rt = θ + ηt, xt where ηt Rd. In contrast, our model assumed a θ , xt + ηt noise model where ηt R. Therefore, the maxt ηt 2 dependency can be of order O(

d) to ensure a similar noise model as ours. The regret of Lattimore et al. (2015, Appendix G) is also of order e O(s

T), assuming a [ 1, 1]- bounded noises, a hypercube X = [ 1, 1]d action set and a θ 1 1 ground-truth. We will then explain why this does not violate the Ω(

sd T) regret lower bound as well. Consider an extreme query (1, 1, . . . , 1) X, which is valid in their query model (in fact, their algorithm is a random projection procedure with some carefully designed regularity conditions, so this type of queries appears all the time). However, in our query model where the action set is the unit ball Bd, we have to scale it by 1

d. As the noise will never be scaled, this will amplify

the noises by

d, so we will need poly(d) more times of queries to get the same confidence set, making the regret bound have a polynomial dependency on d. Moreover, their ground-truth θ needs to satisfy θ 1 1. However, an s-sparse ground-truth θ with 2-norm 1 can have 1-norm as much as s. Therefore, another s should also be multiplied for a fair comparison with our algorithm.

In conclusion, the second and the third work assumed different noise or query models to amplify the signal-to-noise ratio and thus avoid a polynomial dependency on d, compared to the regret bounds of Abbasi-Yadkori et al. (2012) and ours.

However, we have to admit that Abbasi-Yadkori et al. (2011) allows a drifting action set, whereas ours only allow a unit-sphere action set, just like Carpentier & Munos (2012). The reason is discussed in Footnote 3 in the main text.

B FUTURE DIRECTIONS

First of all, there is still a gap in the worst-case regret in terms of s, as the lower bound for sparse linear bandits is Ω(

sd T) instead of our e O(s

d T) when σt 1. Closing this gap in s is an interesting future work. Our current algorithm, unfortunately, is incapable of a O(

sd T)-style worst-case regret guarantee: Suppose that T = ds2, θ i = s 1/2 for i = 1, 2, . . . , s (so θ 2 1), and σt 1. Then

we have nb 1 q

nb (1 + d 2), which gives nb d 4. Hence, the total regret will be Ps i=1 P θ i nb (θ i )2 d Ps i=1(θ i )2 = ds2 = O(s

d T). Thus, algorithmic improvements must be made to better dependency on s. We leave this for future research.

Moreover, the current work relies on the random projection procedure (Carpentier & Munos, 2012), which only works when the action set is the unit sphere. Such an assumption is unrealistic in practice. We wonder whether there is an alternative that only requires a looser condition.

At last, deriving a variance-aware lower bound rather than a minimax one is also important, as it can better illustrate the inherent hardness of the problem with different noise levels. We remark that extending the proof of current minimax lower bounds (see, e.g., (Antos & Szepesvári, 2009)) to variance-aware ones is not straightforward.

C DIVIDE-AND-CONQUER ALGORITHM FOR DETERMINISTIC SETTINGS

In this section, we discuss how to solve the deterministic sparse linear bandit problem in e O(s) steps using a divide-and-conquer algorithm, as we briefly mentioned in the main text.

We mainly adopt the idea mentioned by Dong et al. (2021, Footnote 6). For each divide-andconquer subroutine working on several coordinates i1, i2, . . . , ik [d], we query half of them (e.g., i1, i2, . . . , ik/2 when assuming 2 | k) with p

2/k mass on each coordinate. This will reveal whether there is a non-zero coordinate among them. If the feedback is non-zero, we then conclude that there exists a non-zero coordinate in this half. Hence, we dive into this half and conquer this sub-problem (i.e., divide-and-conquer). Otherwise, we simply discard this half and consider the other half.

Published as a conference paper at ICLR 2023

However, this vanilla algorithm proposed by Dong et al. (2021) fails to consider the possibility that two coordinates cancel each other (e.g., two coordinates with magnitude p

1/2 will make the feedback equal to zero). Fortunately, this problem can be resolved via randomly putting magnitude p

2/k on each coordinate, which is similar to the idea illustrated in Algorithm 2. As the environment is deterministic, each step will give the correct feedback with probability 1. Therefore, a constant number of trials is enough to tell whether there exists a non-zero coordinate.

At last, we analyze the number of interactions needed for this approach. As it is guaranteed that each divide-and-conquer subrountine will be working on a set of coordinates where at least one of them is non-zero, we can bound the number of interactions as O P

θ i =0 #subrountines containing i . As for each time we will divide the coordinates into half, there can be at most log2 d subrountines containing i for each individual i. Therefore, the number of interactions will be O(s log d).

After that, we will be sure to find out all coordinates with non-zero magnitudes. Asking each of them once then reveals their actual magnitude. Therefore, we can recover θ in O(s log d+s) = O(s log d) rounds and will not suffer any regret after that. So the regret of this algorithm will indeed be e O(s), which is (nearly) independent of d and T.

D CONCENTRATION INEQUALITIES

D.1 SAMPLE MEAN UPPER BOUND

We shall make use of the following self-normalizing result.

Proposition 8 (Fan et al. (2015, Remark 2.9)). Suppose that {ξi}n i=1 are independent and symmetric. Then for all x > 0,

Pk i=1 ξi p Pn i=1 ξ2 i x

Corollary 9. Let X1, X2, . . . , Xn be a sequence of independent and symmetric random variables where n is a stopping time that is finite a.s. Then for any δ > 0, with probability 1 δ, we have

i=1 (Xi µi)

i=1 (Xi µi)2 ln 2

Proof. This immediately follows by picking x = 2 q

δ and then applying Fatou s lemma.

Therefore, we can present our Empirical Bernstein Inequality for conditional symmetric stochastic processes with a common mean, as follows:

Theorem 10 (Empirical Bernstein Inequality). For a sequence of independent and symmetric random variables X1, X2, . . . , Xn that shares a common mean (i.e., E[Xi] = µ for some µ for all i), we have the following inequality where n is a stopping time finite a.s.

i=1 (Xi X)2 ln 4

1 δ, δ (0, 1),

where X = 1

n Pn i=1 Xi is the sample mean.

Proof. By direct calculation, we have

i=1 (Xi X)2 =

i=1 X2 i 2n X 2 + n X 2 =

i=1 X2 i n X 2 =

i=1 (Xi µ)2 n(X µ)2.

Published as a conference paper at ICLR 2023

Applying Corollary 9 to {Xi}n i=1 gives

i=1 (Xi µ)2 ln 4

Therefore, with probability 1 δ

i=1 (Xi X)2

i=1 (Xi µ)2

i=1 (Xi X)2 + 4

i=1 (Xi µ)2 ln 4

Hence, with probability 1 δ

i=1 (Xi µ)2 2

i=1 (Xi X)2. (7)

By the union bound, with probability 1 δ, we thus have

i=1 (Xi X)2 ln 4

as claimed.

D.2 SAMPLE VARIANCE UPPER BOUND

Recall that a random variable X is sub-Gaussian with variance proxy σ2 if and only if E[exp(λ(X E[X]))] exp( 1

2λ2σ2) for all λ > 0. We shall denote such a random variable by X sub G(σ2). We first state the following generalized Freedman s inequality for sub-Gaussian random variables. Proposition 11 (Fan et al. (2015, Theorem 2.6)). Suppose that {ξi}n i=1 is a sequence of zero-mean random variables, i.e., E[ξi] = 0. Suppose that E[exp(λξi)] exp(f(λ)Vi) for some deterministic function f(λ) and some fixed {Vi}n i=1 for all λ (0, ), then, for all x, v > 0 and λ > 0, we have

i=1 Vi v2 )

exp λx + f(λ)v2 .

To derive a bound related to P(Xi X)2 and P σ2 i , we will need to characterize the concentration of the square of a sub-Gaussian random variable, which is a sub-exponential random variable: Proposition 12 (Honorio & Jaakkola (2014, Appendix B)). For a sub-Gaussian random variable X with variance proxy σ2 and mean µ, we have

E exp λ(X2 E[X2]) exp 16λ2σ4 , |λ| 1 4σ2 .

Theorem 13. For a sequence of sub-Gaussian random variables {Xi}n i=1 such that E[Xi] = µi, Xi sub G(σ2 i ), and n is a stopping time finite a.s.,

(Xi µi)2 E[(Xi µi)2] > 4

i=1 σ2 i ln 2

δ, δ (0, 1).

Proof. We first consider a non-stopping time n. Apply Proposition 11 to the sequence {(Xi µi)2 E[(Xi µi)2]} with Vi = σ4 i , f(λ) = 16λ2 for λ < 1 4σ2max and f(λ) = otherwise, where σmax is defined as max{σ1, σ2, . . . , σn}. Then for all x, v > 0 and λ (0, 1 σ2 max ), we have

(Xi µi)2 E[(Xi µi)2] > x

exp λx + 16λ2v2 .

Published as a conference paper at ICLR 2023

Picking v2 = Pn i=1 σ4 i q

δ and x = 4

(Xi µi)2 E[(Xi µi)2] > 4

i=1 σ4 i ln 2

where λ is set to x 32v2 = 1 4

2v < 1 σ2max (as v2 > Pn i=1 σ4 i σ4 max). A union bound by applying Proposition 11 to the sequence {E[(Xi µi)2] (Xi µi)2} with the same parameters and noticing the fact that p Pn i=1 σ4 i Pn i=1 σ2 i then shows that our conclusion hold for any fixed n. By Fatou s lemma, we conclude that it also holds for a stopping time n that is finite a.s.

Theorem 14 (Variance Concentration). Let {Xi}n i=1 be a sequence of random variables with a common mean µ such that Xi sub G(σ2 i ), (Xi µ) is symmetric, and n is a stopping time finite a.s. Then, δ (0, 1), with probability 1 δ, we have the following three inequalities:

i=1 (Xi X)2

i=1 (Xi µ)2

i=1 (Xi X)2 1

i=1 (Xi µ)2

i=1 (Xi µ)2 8

i=1 σ2 i ln 2

where X = 1

n Pn i=1 Xi is the sample mean.

Proof. The first two inequalities follow from Eqs. (6) and (7). The last one follows from Theorem 13 together with the fact that E[(Xi µi)2] σ2 i by definition of sub-Gaussian random variables.

E RIDGE LINEAR REGRESSION

Lemma 15. Suppose that we are given n samples yi = xi, β + ϵi, i = 1, 2, . . . , n, where β Bd and {xi}n i=1, {ϵi}n i=1 are stochastic processes adapted to the filtration {Fi}n i=0 such that ϵi is conditionally σ-Gaussian, i.e., ϵi | Fi 1 sub G(σ2). Define the following quantity as the estimate for β :

bβ = argmin β

i=1 (yi x T i β)2 + λ β 2

Then with probability 1 δ, the following inequality holds:

i=1 (yi x T i β )2

i=1 (yi x T i bβ)2 2dσ2 ln n λdδ2 + 1

Proof. Denote y = (y1, y2, . . . , yn)T, X = (XT 1 , XT 2 , . . . , XT n) and ϵ = (ϵ1, ϵ2, . . . , ϵn)T. Denote Var = Pn i=1(yi x T i β )2 and c Var = Pn i=1(yi x T i bβ)2. We have the following representation of bβ, which is by direct calculation (check, e.g., (Kirschner & Krause, 2018))

bβ = (XTX + λI) 1XTy. (8)

Furthermore, by Abbasi-Yadkori et al. (2011, Proof of Theorem 2), we have

bβ β = (XTX + λI) 1XTϵ λ(XTX + λI) 1β . (9)

Published as a conference paper at ICLR 2023

Therefore, we can write

Var c Var = (y Xβ )T(y Xβ ) (y X bβ)T(y X bβ)

= (β )TXTXβ (β )TXTy y TXβ bβTXTX bβ + bβTXTy + y TX bβ

= bβT(XTy XTX bβ) (β )TXT(y Xβ ) + y TX(bβ β ).

By Eq. (8), we have XTy = (XTX + λI)bβ. So the first term is just λbβT bβ. As y = Xβ + ϵ, the second term is just (β )TXTϵ. By Eq. (9), the last term becomes

y TX(XTX + λI) 1XTϵ λy TX(XTX + λI) 1β .

For the sake of simplicity, we define a, b M = a T(XTX + λI) 1b (note that a, b M = b, a M as XTX + λI is symmetric) and denote the induced norm by M. Therefore, we have

Var c Var = λbβT bβ (β )TXTϵ + XTy, XTϵ M λ XTy, β M.

Again by Eq. (8), we have XTy, XTϵ M = bβTXTϵ. Therefore, the second and third term together give

(β )TXTϵ + XTy, XTϵ M = (bβ β )TXTϵ

= (XTX + λI) 1XTϵ λ(XTX + λI) 1β T XTϵ

= XTϵ 2 M λ β , XTϵ = XTϵ 2 M λ(β )T(bβ β ) + λ2 β , β M,

where the last step is yielded from using Eq. (9) reversely. Then note that XTy, β = bβ β , so taking expectation on both sides gives Var c Var = λbβT bβ + XTϵ 2 M λ(β )T(bβ β ) + λ2 β , β M λbβTβ

= XT ϵ 2 M + λ bβ β 2 2 + λ2 β 2 M.

By Cauchy-Schwartz inequality, we have

bβ β 2 2 XTϵ 2 M (XTX + λI) 1/2 2 + β 2 M (XTX + λI) 1/2 2,

where the matrix norms can further be bounded by 1/λmin(XTX + λI) 1

λ. Similarly, we can conclude that β 2 M 1/λmin(XTX + λI) β 2 2 1/λ. Consequently, we have Var c Var = 2 XT ϵ 2 M + 1

As proved by Abbasi-Yadkori et al. (2011, Theorem 1), with probability 1 δ, we have

XTϵ 2 M 2σ2 ln

det(XTX + λI) p

δ2 (λ + n/s)s

where σ2 is the (maximum) variance proxy and the second last step is due to the Determinant-Trace Inequality (Abbasi-Yadkori et al., 2011, Lemma 10). Hence, with probability 1 δ, we have Var c Var 2dσ2 ln n λdδ2 + 1

as claimed.

Published as a conference paper at ICLR 2023

F OMITTED PROOF IN SECTION 3.2 (ANALYSIS OF FRAMEWORK)

F.1 PROOF OF MAIN THEOREM

In this section, we prove Theorem 3, which is restated as follows for the ease of reading:

Theorem 16 (Restatement of Theorem 3). Suppose that for any execution of F that last for n steps, RF n RF n holds with probability 1 δ, i.e., RF n is pessimistic. Then the total regret incurred by the second phase satisfies

t=1 σ2 t log 1

δ + s log 1

with probability 1 δ.

Proof. As mentioned in the main body, we define the following good events for each = 2 2, 2 2, . . . , f where f is the final value of :

i S θ i (θ i bθi) < 2 after T a ,

H : (|θ i | > 3 i S) i S |θ i | > 1

2 after T b . (10)

It is by definition to see that H 1

2 indeed holds, as all |θ i | < 3

2 and S is initially empty. We can then use induction to prove that all good events hold with high probability.

Here, we list several technical lemmas we informally referred to in the main body, whose proofs are left to subsequent sections. The first one is about the regret-to-sample-complexity conversion.

Lemma 17 (Regret-to-sample-complexity Conversion). If for any execution of F that lasts for n steps, we have RF n RF n with probability 1 δ, then we have Pr{G } 1 δ.

The second term bounds the bias term, i.e., the second term of Eq. (4) for random projection.

Lemma 18 (Bias Term of the Random Projection). Conditioning on G and H2 , we have |P

i S(bθ2 i (θ i )2)| < 3 2, which further gives |P

i S bθi(θ i bθi)| < 4 2.

Furthermore, we can bound the estimation error of the random projection process.

Lemma 19 (Concentration of the Random Projection). For any given i [K], we will have

Pr |ri θ i | > 3 2 +

As long as the estimation errors are small, we can ensure, with high probability that the good event for , namely H will also hold.

Lemma 20 (Identification of Non-zero Coordinates). Pr{H | G , H2 } 1 dδ.

Therefore, by combining all lemmas above, we can ensure that all good events, namely {G H } =2 2,..., f , hold simultaneously with probability 1 d Tδ.

At last, we bound the total regret incurred in Phase B for .

Lemma 21 (Single-Phase Regret Bound). Conditioning on G and H2 , we have

θ xt, θ 36snb 2 + 6s R

δ , with probability 1 sδ.

Published as a conference paper at ICLR 2023

Then we follow the analysis sketched in the main body. We assume, without loss of generality, that s < d. Then, conditioning on all G and H , some coordinate i must never be included into S as H holds for all .

Therefore, by property of the sample variances (Theorem 14), for such i and for each phase, with probability 1 δ, we have PN n=1(rn,i ri )2 PN n=1(rn,i E[rn,i ])2. Together with Eq. (2),

(rt,i E[rt,i ])2 ln 4

By using Lemma 21, the total regret from Phase B is bounded by

=20,2 2,..., f

36snb 2 + 6s R

By writing nb as (nb 1) + 1, for any given , the total regret of Phase b is bounded by

(rt,i E[rt,i ])2 ln 4

v u u u t4 2

(rt,i E[rt,i ])2 ln 4

v u u t 2 X

(rt,i E[rt,i ])2 ln 4

| {z } Part (a)

v u u t 2 X

(rt,i E[rt,i ])2 ln 4

| {z } Part (b)

As mentioned in the main text, we make use of the following lower bound of nb which again follows from Eq. (2) and holds for all s except f:

(rt,i ri )2 ln 4

(rt,i E[rt,i ])2 ln 4

where the last step is due to Theorem 14. For simplicity, define S = 2 P

t T b (rn,i E[rn,i])2 and

therefore nb C 2 S where C = 8 q

δ , a constant if we regard δ as a constant. Therefore, X

= f nb T (14)

and our goal is to upper bound P

S . Define a threshold X = T/ C q P

denote X = 2 log4 X so that 2 X 1

X . We will have X

=2 2,..., X

= X/2,...,2 f

Published as a conference paper at ICLR 2023

=2 2,..., X S + X

1 4 X S + 1

1 4 X S + 1

X T C = ( p

log4 X + 1) s X

= f S , (15)

where (a) applied Cauchy-Schwartz to the first summation, (b) applied the fact that 2 X 1

X and (c) used Eq. (14). So Part (a) of the regret, namely Eq. (12), can be bounded by

v u u t 2 X

(rt,i E[rt,i])2 ln 4

δ + 72s (b) 288

=2 2,..., f 2 X

(rt,i E[rt,i ])2 ln 4

δ log4(4T) + 72s

=2 2,..., f 8 X

( 2 + dσ2 t ) ln 4

δ log4(4T) + 72s (16)

t=1 σ2 t ln 4

δ log4(4T) + 1152

δ log4(4T) + 72s.

where (a) used the fact that P 2 2 as = 2 i, (b) used Eq. (15), (c) again used Cauchy Schwartz inequality and (d) used Theorem 14(3), the variance concentration result, Eq. (19), the magnitude of the variance, and the facts that R , K d. Therefore, we can conclude that

t=1 σ2 t log 4

Notice that the left-handed-side and the right-handed-side has a common term (the RHS one is inside the square-root sign). Hence, by the self-bounding property Lemma 37, we can conclude that (note that we divided s on both sides)

t=1 σ2 t log 4

which means that Part (a) (Eq. (12)), or equivalently O(s P

nb 2), is bounded by

t=1 σ2 t log 4

δ + s log 4

Now consider Part (b) of the regret, namely Eq. (13). The second term in each summand will sum up

δ ). For the first term, using the notation S , we want to bound

Published as a conference paper at ICLR 2023

Conditioning on the same events as in Eq. (16), we will have

t=1 dσ2 t +

t=1 dσ2 t log 1

t=1 dσ2 t log 1

where the second step comes from Eq. (17). Therefore,

t=1 σ2 t log 1

t=1 σ2 t log 1

which gives (by combining the bound of Eq. (12) and Eq. (13) together):

t=1 σ2 t log 4

t=1 σ2 t log 1

δ + s log 4

Further notice that, if d PT t=1 σ2 t 1, then the square-root is larger than the 4th-root. When d PT t=1 σ2 t 1, then either root is bounded by s. Hence, in either case, the 4th-root will be hidden by other factors. Henceforth, we indeed have the following conclusion with probability 1 (s + 3T)δ:

t=1 σ2 t log 1

δ + s log 1

By setting the actual δ as (s + 3T)δ, we will still have the same regret bound as O(log c

δ) = O(log 1

δ + log c) = e O(log 1

δ ) as all logarithmic factors will be hidden by e O.

F.2 REGRET-TO-SAMPLE-COMPLEXITY CONVERSION

Proof of Lemma 17. By Fatou s lemma and the fact that nb is finite a.s. (as it is truncated according to T), the probability that RF n is a pessimistic estimation for all n = 1, 2, . . . , nb is bounded by 1 δ. Conditioning on this, by definition, we will have

n=1 θ i , θ i xn RF nb RF nb ,

which means our stopping criterion (Line 5) will ensure

n=1 θ i , θ i bθ

RF nb nb < 2

and thus G is ensured.

F.3 RANDOM PROJECTION

Proof of Lemma 18. By H2 , we have |θ i | > , i S, which gives |θ i bθi| < by G . So we can bound |P

i S(bθ2 i (θ i )2)| by 2|P

i S θ i (θ i bθi)| + |P

i S(θ i bθi)2| < 2 2 + 2 = 3 2.

The second claim is then straightforward as bθi(θ i bθi) = θ i (θ i bθi) (bθ2 i (θ i )2) .

Published as a conference paper at ICLR 2023

Proof of Lemma 19. In the random projection procedure, let r(n) i be the random variable defined as (which is the i-th coordinate of rn in the algorithm)

xn, θ + ηn X

y(n) i 2 θ i + X

j S bθj(θ j bθj) + X

R2 y(n) i y(n) j

Then ri is just the sample mean estimator of {r(n) i }nb n=1.

Firstly, observe that y2 i is always R2

K , so the first term of Eq. (18) is just θ i , which is exactly the magnitude we want to estimate. Moreover, by Lemma 18, the second term is a small (deterministic) bias added to θ i and is bounded by 3 2. For the third term, as each yj is independent, K

R2 yiyj is an i.i.d. Rademacher random variable, denoted by z(n) j . Hence, the last two terms sum to

j / S,j =i z(n) j θ j + K

By definition of Rademacher random variables, they are all zero-mean and symmetric. Henceforth they can be viewed as noises with variances at most

j S,j =i z(n) j θ j + K

R2 y(n) i ηn

z(n) j θ j 2 + K

where (a) is due to the mutual independence between z(n) j and y(n) i .

Moreover, as each y(n) i is also redrawn for every n and i, we know that all {z(n) j }j S,1 n nb and

{y(n) i }1 n nb are all mutually independent. Because the first two terms Eq. (18) is not random, all

{r(n) i }1 n nb are independent, symmetric and sub-Gaussian (as θ j is bounded and ηn is Gaussian) random variables. Therefore, as nb is indeed a stopping time finite a.s., we shall apply Empirical Bernstein Inequality (Theorem 10) to {r(n) i }n where Var(r(n) i ) is characterized by Eq. (19), giving

j / S,j =i z(n) j θ j + K

n=1 (rn,i ri)2 ln 4

In other words, our choice of nb in Eq. (2) will ensure the average noise is bounded by

2 . By Lemma 18, we conclude that Pr{|ri θ i | > 3 2 +

4 | G , H2 } δ.

Proof of Lemma 20. If we skipped due to P

i S bθ2 i > 1 2, which is Line 7 of Algorithm 1, then by Lemma 18, we will have X

i S (θ i )2 > 1 5 2 conditioning on G H2 ,

and thus all remaining coordinates are smaller than 3 . Moreover, by H2 , all discovered coordinates are with magnitude at least 2

2 . Hence H automatically holds in this case.

Otherwise, suppose that the conclusion of Lemma 19 holds for all i [K] (which happens with probability 1 Kδ conditioning on G and H2 ). As we only pick those coordinates with |ri| > , all coordinates with magnitude at last +

4 + 3 2 < 3 will be picked as 1

4, so the first

Published as a conference paper at ICLR 2023

condition of H indeed holds. Moreover, all picked coordinates will have magnitudes at least (

4 + 3 2). But when = 1

4, there will be no coordinates in S as it is initially empty. And after that, we will have 2

8 . Hence, all coordinates with magnitude

2 will surely be identified, which means the second condition of H also holds.

We then have Pr{H | G , H2 } 1 Kδ by putting these two cases together.

F.4 SINGLE-PHASE REGRET

Proof of Lemma 21. Conditioning on G , as we are playing xt,i = bθi for all i S and t T b , we will have X

i S (θ i xt,i)θ i = X

i S (θ i bθi)θ i < nb 2.

Now consider a single coordinate not in S. For each t T b , we will equiprobably play R

K for this coordinate. Hence, the total regret will become

θ i = nb (θ i )2 + θ i X

By H2 , the first term is bounded by 36nb 2. By Chernoff bound, the absolute value of the

summation in the second term will be bounded by R

δ with probability 1 δ. As there are at most s non-zero coordinates by sparsity, from a union bound, we can conclude that

θ xt, θ 36snb 2 + 6s R

with probability 1 sδ.

G OMITTED PROOF IN SECTION 4.1 (ANALYSIS OF VOFUL2)

G.1 PROOF OF MAIN THEOREM

In this section, we prove Theorem 5, which is restated as Theorem 22. We first assume that Proposition 4 is indeed correct, whose discussion is left to Appendix G.2.

Theorem 22 (Regret of Algorithm 1 with VOFUL2 in Unkown-Variance Case). Consider Algorithm 1 with F as VOFUL2 (Kim et al., 2021) and RF n as

k=1 (rk xk, bβ )2 ln 1

δ + s2 ln 1

where C = e O(1) is the constant hidden in the e O notation of Proposition 4, x1, x2, . . . , xn are the actions made by the agent, r1, r2, . . . , rn are the corresponding (noisy) feedback, and bβ is defined as

bβ = argmin β Rs

k=1 (rk xk, β )2 + β 2

The algorithm ensures the following regret bound with probability 1 δ:

t=1 σ2 t log 1

δ + s3 log 1

Published as a conference paper at ICLR 2023

Proof. By Proposition 4, the total regret incurred by algorithm F satisfies

k=1 η2 k ln 1

δ + s2 ln 1

where C = e O(1) is a constant (with some logarithmic factors) and ηk N(0, σ2 k) is the noise for the k-th round executing F. We now consider our regret estimator RF n . We show that, with high probability, it is a pessimistic estimation.

Lemma 23. For any given , with probability 1 δ, RF n RF n .

Therefore, for the explore phase, we can make use of Theorem 16 which only requires RF n to be an over-estimate of RF n w.h.p., giving

t=1 σ2 t log 1

δ + s log 1

So we only need to bound the regret for the commit phase, namely Ra T . As mentioned in the main text, we will consider the regret contributed from inside and outside S seprately. Formally, we will write Ra T as

=2 2,..., f

i S θ i (θ i xt,i) + X

i/ S (θ i )2 !

where the equality is because we will not put any mass on those i / S during the commit phase. We still assume the good events G , H hold for all (defined in Eq. (10)). By H2 , for a given , P

i S(θ i )2 36s 2. Moreover, by Lemma 23, we will have

=2 2,..., f

RF na + 36s X

=2 2,..., f na 2.

By the terminating criterion 1

n RF n < 2 (Line 5 of Algorithm 1), for all , we will have na 1

C 2 RF na , which means

s1.5 v u u t

(rt xt, bβ )2 ln 1

2 ln na sδ2 ln 1

δ + s2 ln 1

Therefore, plugging back into the expression of Ra T gives (the term related with RF na is dominated by the second term, as intuitively explained in the main text):

=2 2,..., f

s2.5 v u u t

(rt xt, bβ )2 ln 1

The last term is simply bounded by O(s) after summing up over . Let us focus on the first term, where we need to upper bound the magnitude of RF n . Applying Lemma 15 shows that the following with probability 1 δ:

rk xk, bβ 2

k=1 η2 k + 2s ln n

Published as a conference paper at ICLR 2023

Therefore, we can write the regret (conditioning on this good event holds for all , which is with probability at least 1 d Tδ according to Theorem 16) as

=2 2,..., f

s2.5 v u u t

t T a η2 t log 1

Again by the technique we used for Appendix F, we will need the following lower bound for all na except for the last one, which follows from the fact that RF n RF n (Lemma 23):

By the summation technique that we used in Appendix F, more preciously, by the derivation of Eq. (15), we will have the following derivation:

= f Cs1.5 v u u t

t T a η2 t ln 1

t T a C2s3η2 t ln 1

t T a η2 t ln 1

where X is defined as

t T a C2s3η2 t ln 1

and X = 2 log4 X , which means 2 X 1

X . Hence, we will have (the second summation will be bounded by T

= f Cs1.5 v u u t

t T a σ2 t ln 1

t T a C2s3η2 t log 1

t=1 η2 t log 1

In other words, the first term of Eq. (21) will be bounded by

=2 2,..., f

t T a η2 t ln 1

t=1 η2 t log 1

So we are done with the first and the last term of Eq. (21). Now consider the second term, which is equivalent to bounding P

1. Use the following property guaranteed by (again) Line 5 of Algorithm 1: X

= f 2Cs2 ln 4

which means P

1 log4(T/s2) = e O(1). Combining them together gives

t=1 η2 t log 1

At last, due to Theorem 13, with probability 1 δ,

i=1 (ri xi, β )2 =

k=1 E[η2 k | Fk 1] + 4

k=1 σ2 k ln 2

k=1 σ2 k ln 2

Published as a conference paper at ICLR 2023

Algorithm 3 VOFUL2 Algorithm (Kim et al., 2021)

1: for t = 1, 2, . . . , T do 2: Compute the action for the t-th round as

xt = argmax x X max θ Tt 1 s=1 Θs x, θ , (22)

where Θs is defined in Eq. (24). 3: Observe reward rt = xt, θ + ηt.

where (a) is due to Theorem 13 and (b) is due to E[η2 k | Fk 1] σ2 k. Hence,

t=1 σ2 t log 1

Combining this with the regret for Rb T (Theorem 16) gives

t=1 σt t log 1

t=1 σt t log 1

δ + s3 log 1

as claimed.

G.2 REGRET OVER-ESTIMATION

In this section, we first discuss why the strengthened Proposition 4 holds. After that, we argue that our pessimistic estimation RF n is indeed an over-estimation of RF n . We state the VOFUL2 algorithm in Algorithm 3 and also restate Proposition 4 as Proposition 24 for the ease of presentation.

Proposition 24. VOFUL2 on d dimensions guarantees, with probability at least 1 δ,

t=1 η2 t log 1

δ + d2 log 1

t=1 σ2 t log 1

δ + d2 log 1

where T is a stopping time finite a.s. and σ2 1, σ2 2, . . . , σ2 T are the variances of η1, η2, . . . , ηT .

Proof. We will follow the proof of Kim et al. (2021) and highlight the different steps. We first argue that an analog to their Empirical Bernstein Inequality (Zhang et al., 2021, Theorem 4) still holds.

Lemma 25 (Analog of Theorem 4 from Zhang et al. (2021)). Let {Xi}n i=1 be a sequence of zeromean Gaussian random variables such that n is a stopping time finite a.s. Then for all n 8 and any δ (0, 1),

i=1 X2 i ln 4

Proof. This is a direct corollary of Theorem 10.

Thanks to Theorem 14, the following analog of Lemma 17 from Zhang et al. (2021) also holds:

Published as a conference paper at ICLR 2023

Lemma 26 (Analog of Lemma 17 from Zhang et al. (2021)). Let {Xi}n i=1 be a sequence of zero-mean Gaussian random variables such that n is a stopping time finite a.s. Then for all δ (0, 1),

i=1 σ2 i ln 2

where σ2 i is the variance of Xi.

Then consider their Elliptical Potential Counting Lemma (Kim et al., 2021, Lemma 5), which holds as long as xt 2 1. In our setting, we indeed have this property as xt Bd (only the noises can be unbounded, instead of actions). Hence, this lemma still holds.

Proposition 27 (Kim et al. (2021, Lemma 5)). Let x1, x2, . . . , xk Rd be such that xi 2 1 for all s [k]. Let Vk = λI + Pk i=1 xix T i . Let J = {i [k] | xi 2 V 1 i 1 q}, then

|J| 2d ln(1 + q) ln 1 + 2/e ln(1 + q) 1 λ

Then consider the confidence set construction:

(x Ts µ)ℓϵs(θ)

(x Ts µ) 2 ℓϵ2s(θ)ι, µ Bd

where ϵs(θ) = rs x T s θ, ϵ2 s(θ) = (ϵs(θ))2, ι = 128 ln((12K2L)d+2/δ) = O(

d), L = max{1, log2(1 + T

d ) } and (x)ℓ= min{|x|, 2 ℓ} x

|x|. By using Lemma 25 together with their original ϵ-net coverage argument, we have θ Θt for all t [T] with high probability:

Lemma 28 (Analog of Lemma 1 from Kim et al. (2021)). The good event E1 = { t [T], θ Θt} happens with probability 1 δ.

Similar to Kim et al. (2021), we define θt be the maximizer of Eq. (22) in the t-th round and define µt = θt θ . We also consider the following good event

E2 : t [T],

s=1 ϵ2 s(θ ) 8

s=1 σ2 s ln 8T

which happens with probability 1 δ due to Lemma 26. Define

Wℓ,t 1(µ) = 2 ℓλI +

Abbreviate Wℓ,t 1(µt) as Wℓ,t 1, then we have the following lemma, which is slightly different from the original one, whose proof will be presented later.

Lemma 29 (Analog of Lemma 4 from Kim et al. (2021)). Conditioning on E1 and setting λ = 1, we have

1. µt 2 Wℓ,t 1 C12 ℓ( p

At 1ι + ι) for some absolute constant C1, where At Pt s=1 η2 s. 2. For all s t, we have µt 2 Wℓ,s 1 C12 ℓ( p

As 1ι + ι). 3. There exists absolute constant C2 such that xtµt C2 x T t 2 W 1 ℓ,t 1( p

At 1ι + ι).

Therefore, as we have the same E1, Lemma 5 and a similar Lemma 4 (which uses ηs instead of σs), we can conclude that

s=1 η2sι + ι

s=1 η2sι + ι

Published as a conference paper at ICLR 2023

t=1 η2 t log 1

δ + d2 log 1

as in Kim et al. (2021, Theorem 2) (recall that ι = O(

d)). Conditioning on E2, we then further have

t=1 σ2 t log 1

δ + d2 log 1

as claimed.

Proof of Lemma 29. By definition and abbreviating ( )ℓas ( ), we have

µt 2 Wℓ,t 1 2 ℓλI =

(x Ts µt)(x T s µt) =

(x Ts µs)(xsθt rt + rt xsθ )

(x Ts µt)( ϵs(θt) + ϵs(θ ))

s=1 (x Ts µt)2ϵ2s(θt)ι +

(x Ts µt) 2ϵ2s(θ )ι

(x Ts µt) 22(x Ts µt)2ι + 2

(x Ts µt) 22ϵ2s(θ )ι

v u u t2 ℓ t 1 X

(x Ts µt) 22(x Ts µt)2ι + 2 ℓ16

(x Ts µt)(x Ts µt)ι + 2 ℓ16

2 ℓ8 µt 2 Vt 1 λIι + 2 ℓ16

where (a) used ϵ2 s(θt) = (rs xsθt)2 = (x T s (θ θt) + ϵ2 s(θ )) 2(x T s µt)2 + 2ϵ2 s and (b) used ϵ2 s(θ ) = η2 s. By the self-bounding property Lemma 37, we have

µt 2 Vt 1 λI 16

s=1 σ2s + 8ι,

which means µt 2 Vt 1 4λ + 16 q Pt 1 s=1 σ2sι + 8ι. Setting λ = 1 gives the first conclusion. Based on this, the second and third conclusion directly follow according to Kim et al. (2021).

G.3 EXTENSION TO UNKNOWN VARIANCE CASES

Based on Proposition 4, we then show that, our regret estimation RF n Eq. (20) is indeed pessimistic (i.e., Lemma 23).

Proof of Lemma 23. From Lemma 15 with λ = 1, with probability 1 δ, we will have (recall the assumption that σ2 t 1 for all t [T])

k=1 (rk xk, β )2 =

k=1 (rk xk, bβ )2 + 2s2 ln n

Published as a conference paper at ICLR 2023

Therefore we have

k=1 η2 k ln 1

δ + s2 ln 1

k=1 (rk xk, bβ )2 + 2s ln n

δ + s2 ln 1

k=1 (rk xk, bβ )2 ln 1

δ + s2 ln 1

In other words, our RF n is an over-estimation of RF n with probability 1 δ.

H OMITTED PROOF IN SECTION 4.2 (ANALYSIS OF WEIGHTED OFUL)

H.1 PROOF OF MAIN THEOREM

Similar to the VOFUL2 algorithm, we still assume Proposition 6 indeed holds and defer the discussions to the next section. We restate the regret guarantee of Weighted OFUL (Zhou et al., 2021), namely Theorem 7, for the ease of reading, as follows:

Theorem 30 (Regret of Algorithm 1 with Weighted OFUL in Known-Variance Case). Consider Algorithm 1 with F as Weighted OFUL (Zhou et al., 2021) and RF n as

k=1 σ2 k ln 1

The algorithm ensures the following regret bound with probability 1 δ:

t=1 σ2 t log 1

Proof. Firstly, from Proposition 6, the condition of applying Theorem 16 holds. Therefore, we have

t=1 σ2 t log 1

δ + s log 1

For Ra T , similar to Appendix H, we decompose it into two parts: those from S and from outside of S. The former case is bounded by Proposition 6, as

Regret from S with gap threshold C

t T a σ2 t ln 1

For those outside S, we will bound it as O(sna 2), where we only need to bound na . From Line 5 of Algorithm 1, we have

s(na 1) ln 1

Published as a conference paper at ICLR 2023

By the self-bounding property that x a + b x implies x O(a + b2) (Lemma 37), we have

Therefore, we can conclude that (the regret from S is dominated)

δ + s2 v u u t

where the last term is simply bounded by O(s). Again from Line 5 of Algorithm 1, we will have the following property for all = f:

t T a σ2 t ln 1

We first bound the second term, which basically follow the summation technique (Eq. (15)) that we used in Appendices F and G.1:

t T a σ2 t ln 1

t T a C2s2σ2 t ln 1

t T a σ2 t ln 1

where X is defined as

t T a C2s2σ2 t ln 1

and X = 2 log4 X , which means 2 X 1

X . Hence, we will have (the second summation will be bounded by T

X as P na T)

t T a σ2 t ln 1

t T a C2s2σ2 t log 1

t=1 σ2 t log 1

Hence, for the second term, we have

t=1 σ2 t log 1

t=1 σ2 t log 1

At last, we consider the first term. From the same lower bound of na (Eq. (25)), we will have

δ = na > C2

By the fact that P

= f na T, we will have

= f 4 = O(1)C2s ln 1

Henceforth,

=2 2,..., f 2

= O s2 log 1

Published as a conference paper at ICLR 2023

Algorithm 4 Weighted OFUL Algorithm (Zhou et al., 2021)

1: Intialize A0 λI, c0 0, bθ0 A 1 0 c0, bβ0 = 0 and Θ0 {θ | θ bθ0 A0 bβ0 +

λB}. 2: for t = 1, 2, . . . , T do 3: Compute the action for the t-th round as

xt = argmax x X max θ Θt 1 x, θ . (26)

4: Observe reward rt = xt, θ + ηt and variance information σ2 t , set σt = max{1/

d, σt}, set confidence radius bβt as

v u u td ln

1 + t dλσ2 min,t

where σmin,t mint s=1 σs.

5: Calculate At At 1 + xtx T t /σ2 t, ct ct 1 + rtxt/σ2 t, bθt A 1 t ct and Θt {θ | θ bθt At bβt +

Combining all above together gives

RT = Ra T + Rb T O

t=1 σ2 t log 1

t=1 σ2 t log 1

δ + s log 1

t=1 σ2 t log 1

as claimed.

H.2 REGRET OVER-ESTIMATION

We again briefly argue that Proposition 6 holds under our noise model. We present their algorithm in Algorithm 4. Proposition 31. With probability at least 1 δ, Weighted OFUL executed for T steps on d dimensions guarantees

t=1 σ2 t log 1

where C = e O(1), T is a stopping time finite a.s., and σ2 1, σ2 2, . . . , σ2 T are the variances of η1, η2, . . . , ηT .

Proof Sketch. We mainly follow the original proof by Zhou et al. (2021) and highlight the differences. We first highlight their Bernstein Inequality for vector-valued martingales also holds under our assumptions, as:

Lemma 32 (Analog of Theorem 4.1 from Zhou et al. (2021)). Let {xi}n i=1 be sequence of ddimensional random vectoes such that xt 2 L. Let {ηi}n i=1 be a sequence of independent, symmetric and {σ2 i }n i=1-sub-Gaussian random variables. Let rt = θ , xt + ηt for all t [n]. Set Zt = λI + Pt s=1 xsx T s , bt = Pt s=1 rsxs and θt = Z 1 t bt. Let n be a stopping time finite a.s. Then, δ (0, 1),

βt, θt θ Zt βt +

λ θ 2, t [n]

Published as a conference paper at ICLR 2023

where βt = 8σ q

d ln(1 + t L2

dλ ) ln 8t2

The proof, which will be presented later, mainly follows from the idea of their proof of Theorem 4.1, except that we are using Proposition 11. Check the proof below for more details about this.

With this theorem, we can consequently conclude that the confidence construction is indeed valid by applying Lemma 32 to the sequence {ηt/σt}t [T ], which gives

bθt θ At bβt +

λ θ 2 bβt +

λ, t [T], with probability 1 δ,

where bβt = 8 q

d ln(1 + t dλσ2 min,t ) ln 4t2

δ , as defined in Eq. (27). Therefore, we can conclude their

(B.19) from exactly the same argument, namely

t=1 min n 1, σt(bβt 1 +

λ) xt/σt A 1 t 1

Similar to their proof, define I1 = {t [T] | xt/σt A 1 t 1 1} and I2 = [T] \ I1, then we have

(where σmin is the abbreviation of σmin,T = min T s=1 σs)

t I1 min{1, xt/σt 2 A 1 t 1}

t=1 min{1, xt/σt 2 A 1 t 1} 2d ln 1 + T dλσ2 min

where the last step uses Proposition 34 and the fact that xt/σt 2 σ 1 min. Therefore, we are having the same Eq. (B.21) as theirs, which gives

2d ln 1 + T dλσ2 min

t=1 (bβt 1 +

λ)2σ2 t + 4d ln 1 + T dλσ2 min

By the choice of σt = max{1/

d, σt} and λ = 1, we have

ln 1 + T dλσ2 min

d log T log T

which gives

t=1 σ2 t log 1

as claimed, while the second step uses σ2 t = min{ 1

d, σ2 t } 1

Proof of Lemma 32. Their original proof mainly use the following two auxiliary results: The first one is the well-known Freedman inequality (Freedman, 1975), which is originally for bounded martingale difference sequences, while the second one is Lemma 11 from Abbasi-Yadkori et al. (2011). For the former one, from its variant for sub-Gaussian random variables (Proposition 11), we have:

Published as a conference paper at ICLR 2023

Corollary 33. Suppose that {ξi}n i=1 is a sequence of zero-mean random variables where ξi sub G(σ2 i ) for some sequence {σi}n i=1. Let n be a stopping time finite a.s. Then for all x, v > 0 and λ > 0,

i=1 Vi v2 )

Moreover, for any δ (0, 1), with probability 1 δ, we have

i=1 σ2 i ln 2

Proof. The first conclusion is done by applying Proposition 11 optimally with f(λ) = 1

2λ2, Vi = σ2 i and λ = x v2 . The second conclusion is consequently proved by taking v2 = Pn i=1 σ2 i and x =

For their second auxiliary lemma (Abbasi-Yadkori et al., 2011, Lemma 11), one can see that the original lemma indeed holds for sub-Gaussian random variables. Therefore, we still have the following lemma:

Proposition 34 (Abbasi-Yadkori et al. (2011, Lemma 11)). Let {xt}T t=1 be a sequence in Rd and define Vt = λI + Pt s=1 xsx T s for some λ > 0. Then, if we have xt 2 L for all t [T], then

t=1 min n 1, xt 2 V 1 t 1

o 2 log det(Vt)

det(λI) 2d ln dλ + TL2

Recall the definition that Zt = λI + Pt s=1 xsx T s . Further define dt = Pt s=1 xiηi, wt = xt Z 1 t 1 and let Et be the event that ds Z 1 s 1 βs for all s t. They proved the following lemma, which still applies to our case:

Lemma 35 (Analog of Lemma B.3 from Zhou et al. (2021)). With probability 1 δ

2, with the definitions of xt and ηt in Lemma 32, the following inequality holds for all t 1:

2ηsx T s Z 1 s 1ds 1 1 + w2s 1[Es 1] 3

Proof. We only need to verify whether we can apply our Freedman s inequality to ℓs 2ηsx T s Z 1 s 1ds 1 1+w2 s 1[Es 1]. It is obvious that E[ℓs | Fs 1] = 0. Moreover, from the following inequality (which is their Eq. (B.3))

|ℓs| 2 xs Z 1 s 1 1 + w2s ds 1 Z 1 s 11[Es 1] 2wi 1 + w2 i βs 1 min{1, 2wi}βi 1,

and the fact that ηs | Fs 1 sub G(σ2), we have ℓs | Fs 1 sub G((σβs 1 min{1, 2ws})2). Denote the sub-Gaussian parameter as eσs for simplicity. We have

s=1 eσ2 s σ2β2 t

s=1 (min{1, 2ws})2 4σ2β2 t

s=1 min{1, w2 s} 8σ2β2 t d ln 1 + t L2

where the first inequality is due to the non-decreasing property of {βs} and the last one is due to Proposition 34. Therefore, from our Freedman s inequality (Corollary 33), we can conclude that with probability 1 δ/(4t2),

s=1 8σ2β2 t d ln 1 + t L2

Published as a conference paper at ICLR 2023

β2 t 2 + 16σ2d ln 1 + t L2

Taking a union bound over t and make use of the fact that P t=1 t 2 < 2 completes the proof.

We also have the following lemma:

Lemma 36 (Analog of Lemma B.4 from Zhou et al. (2021)). Under the same conditions as the previous lemma, with probability 1 δ

2, we will have the following for all 1 simultaneously:

η2 sw2 s 1 + w2s 1

Proof. We still need to apply our Freedman s inequality (Corollary 33) to

ℓs = E η2 sw2 s 1 + w2s

η2 sw2 s 1 + w2s .

As ηs | Fs 1 sub G(σ2), η2 s | Fs 1 is a sub-exponential random variable (Proposition 12) such that E exp(λ(η2 s E[η2 s])) Fs 1 exp(16λ2σ4), |λ| 1 4σ2 ,

which consequently means

E [exp(λℓs)|Fs 1] exp 16λ2σ4(min{1, w2 s})2 , |λ| 1 4σ2 .

where the second step used the fact that w2 s 1+w2s min{1, w2 s} as we used in the proof of Lemma 35. Again by Proposition 34, we can conclude that

s=1 σ4(min{1, w2 s})2 2σ2d ln 1 + t L2

Then, as we did in the proof of Theorem 13, we will apply Proposition 11 to the martingale difference sequence {ℓs}t s=1 with Vs = σ4(min{1, w2 s})2, f(λ) = 16λ2 for λ < 1 4σ2 and f(λ) = otherwise. Then for all x, v > 0 and λ (0, 1

σ2 ), we have

i=1 σ4 i (min{1, w2s})2 v

exp λx + 16λ2v2 .

Picking v2 = Pt s=1 σ4 i (min{1, w2 s})2 and x = 4

s=1 σ4 i (min{1, w2s})2 ln 2

where λ is set to x 32v2 < 1 σ2 . Hence, with probability 1 δ 4t2 , we indeed have

s=1 σ4 i (min{1, w2s})2 ln 2

2σ2d ln 1 + t L2

Moreover, due to Proposition 34, we have

s=1 E η2 sw2 s 1 + w2s

w2 s 1 + ws 2σ2d ln 1 + t L2

Published as a conference paper at ICLR 2023

which means, as η2 sw2 s 1+w2s ℓs + E[ η2 sw2 s 1+w2s ],

η2 sw2 s 1 + w2s 2σ2d ln 1 + t L2

2σ2d ln 1 + t L2

Taking a union bound over all t and again making use of the fact that P t=1 t 2 < 2 gives our conclusion.

Therefore, as long as their Lemmas B.3 and B.4 still hold, we can conclude exactly the same conclusion from their derivation. One may refer to their proof for the details.

I AUXILLIARY LEMMAS

Lemma 37 (Self Bounding Inequality, Efroni et al. (2020, Lemma 38)). Let 0 x a + b x where a, b, x 0, then we have x 4a + 2b2.

Proof. As x b x a 0, we have

1 4b2 + 4a b

4a = b + 2 a

from the fact that

b. As x 0, we have

x (b + 2 a)2 2b2 + 4a

due to the relation that (a + b)2 2a2 + 2b2.