# improved_regret_of_linear_ensemble_sampling__377d56db.pdf

Improved Regret of Linear Ensemble Sampling

Harin Lee Seoul National University Seoul, South Korea harinboy@snu.ac.kr

Min-hwan Oh Seoul National University Seoul, South Korea minoh@snu.ac.kr

In this work, we close the fundamental gap of theory and practice by providing an improved regret bound for linear ensemble sampling. We prove that with an ensemble size logarithmic in T, linear ensemble sampling can achieve a frequentist regret bound of e O(d3/2

T), matching state-of-the-art results for randomized linear bandit algorithms, where d and T are the dimension of the parameter and the time horizon respectively. Our approach introduces a general regret analysis framework for linear bandit algorithms. Additionally, we reveal a significant relationship between linear ensemble sampling and Linear Perturbed-History Exploration (Lin PHE), showing that Lin PHE is a special case of linear ensemble sampling when the ensemble size equals T. This insight allows us to derive a new regret bound of e O(d3/2

T) for Lin PHE, independent of the number of arms. Our contributions advance the theoretical foundation of ensemble sampling, bringing its regret bounds in line with the best known bounds for other randomized exploration algorithms.

1 Introduction

Ensemble sampling [16] has emerged as an empirically effective randomized exploration technique in various online decision-making problems, such as online recommendation [17, 27, 26] and deep reinforcement learning [18 20]. Despite its popularity, the theoretical understanding of ensemble sampling has lagged behind, even for the linear bandit problem, with previous results revealing suboptimal outcomes. For instance, a prior work [21] demonstrated that linear ensemble sampling could achieve O(

T) Bayesian regret with an ensemble size growing at least linearly with T. However, the requirement for the ensemble size to be linear in T is highly unfavorable and prohibitive in many practical settings. A recent work [10] showed that a symmetrized version of linear ensemble sampling could provide an improvement in dependence on ensemble size of Θ(d log T) and show a frequentist regret bound of e O(d5/2

T). However, this regret bound clearly falls short of the existing frequentist regret achieved by standard randomized algorithms such as Thompson Sampling (TS) [4, 2] and Perturbed-History Exploration (PHE) [11 13].1

In this work, we close this fundamental gap by providing an improved regret bound for linear ensemble sampling. We prove that linear ensemble sampling with an ensemble size logarithmic in T can still attain a frequentist regret bound of e O(d3/2

T), marking the first time that linear ensemble sampling achieves a state-of-the-art result for randomized linear bandit algorithms. Our

1Lin TS [4, 2] has regret of e O min(d3/2

T, d T log K) , and Lin PHE [12] has regret of e O(d T log K). For exchanges between factors of O(

d) and O( log K), we refer to the analysis of Agrawal and Goyal [4]. Therefore, the existing result for linear ensemble sampling [10] has a gap of at least O(d) compared to Lin TS and Lin PHE. Besides this gap in regret bounds, there appears to be rather counter-intuitive dependence on ensemble size in Janz et al. [10] (see Remark 2 in Section 4).

38th Conference on Neural Information Processing Systems (Neur IPS 2024).

approach not only improves upon the regret bound but also simplifies the algorithm by avoiding the use of symmetrized perturbations, making it more practical for implementation. For regret analysis, we present a general, concise framework for analyzing linear bandit algorithms, which may be of independent interest. Furthermore, we rigorously reveal the significant relationship between ensemble sampling and PHE for the first time, showing that in the regime where the ensemble size equals T, linear PHE (Lin PHE) is a special case of linear ensemble sampling. With this new insight, we can use the regret analysis for ensemble sampling to derive a new regret bound of e O(d3/2

T) for Lin PHE, which is independent of the number of arms K.

Our main contributions are summarized as follows:

We prove a e O(d3/2

T) regret bound (Theorem 1) for linear ensemble sampling with an ensemble size of m = Ω(K log T), where K denotes the number of arms. Importantly, our regret bound does not depend on K or m even logarithmically. Our result is the first to establish e O(d3/2

T) regret for linear ensemble sampling with an ensemble size sublinear in T, improving the previous bound by the factor d while maintaining the ensemble size to be logarithmic in T.

As part of the regret analysis, we present a general regret analysis framework (Theorem 2) for linear bandit algorithms. This framework not only generalizes the regret analysis of randomized algorithms such as ensemble sampling and PHE but also applies to other optimism-based deterministic algorithms. This result can be of independent interest beyond ensemble sampling.

We rigorously investigate the relationship between linear ensemble sampling and Lin PHE. We show that in the regime of ensemble size m = T, Lin PHE is a special case of the linear ensemble sampling algorithm. To our best knowledge, this is the first result to show the equivalence between linear ensemble sampling and Lin PHE.

As a byproduct, with this new insight into the relationship between linear ensemble sampling and Lin PHE, we provide an alternate analysis for Lin PHE as an extension of the analysis for linear ensemble sampling, achieving a e O(d3/2

T) regret bound with no dependence on K.

1.1 Related Work

The stochastic linear bandit problem [5, 1, 14] is a foundational sequential decision-making problem and a core model for multi-armed bandits with features. Numerous algorithms have been developed for this problem, including deterministic approaches such as UCB-based methods [5, 8, 1] and randomized algorithms such as Thompson sampling [24, 7, 4, 2] and PHE [11 13].

Thompson sampling [24], a classical randomized method, utilizes the posterior distribution of hidden parameters based on observed data. Initially proposed for Bayesian settings [22, 23], it has also demonstrated strong performance in frequentist settings [3, 4, 2]. For the stochastic linear bandit, Agrawal and Goyal [4] showed that Thompson sampling with a Gaussian prior achieves a regret bound of e O(d3/2

T), which can be reduced to e O(d T log K) for small K. However, applying Thompson sampling to more complex problems remains challenging, especially when posterior computation becomes intractable, though approximate methods have been proposed [16, 25].

PHE [11 13] is another class of randomized algorithms that does not rely on posterior distributions, making it potentially applicable to more complex settings. In the finite-armed linear bandit model, PHE achieves a e O(d T log K) regret bound, matching the performance of Thompson sampling for finite arms [4]. However, the relationship between PHE and ensemble sampling remains unexplored in previous studies.

Ensemble sampling [16] has gained popularity as a randomized exploration method across various decision-making tasks [17, 27, 26, 18 20]. Despite its empirical success, its theoretical foundation, particularly for linear bandits, is still relatively underdeveloped. Qin et al. [21] showed that linear ensemble sampling achieves O(

T) Bayesian regret but requires an impractically large ensemble size that scales linearly with T. More recently, Janz et al. [10] reduced the dependence on ensemble size to Θ(d log T) and achieved a frequentist regret bound of e O(d5/2

T). However, the frequentist regret bound of e O(d3/2

T) for ensemble sampling has yet to be achieved.

Algorithm 1 Linear Ensemble Sampling

Input : regularization parameter λ > 0, ensemble size m N,

initial perturbation distribution PI on Rd, reward-perturbation distribution PR on R, ensemble sampling distribution {Jt}T t=1 on [m] Sample W j PI for each j [m]. Initialize V0 = λId, Sj 0 = W j, θj 0 = V 1 0 Sj 0 for each j [m] for t = 1, 2, . . . T do

Sample jt Jt Pull arm Xt = argmaxx X x θjt t 1 and observe Yt Update Vt = Vt 1 + Xt X t for j = 1, 2, . . . , m do

Sample Zj t PR Update Sj t = Sj t 1 + Xt(Yt + Zj t ) and θj t = V 1 t Sj t end for end for

2 Preliminaries

2.1 Notations

N denotes the set of natural numbers starting from 1. For a positive integer M, [M] denotes the set {1, 2, . . . , M}. 0d denotes the zero vector in Rd and Id denotes the identity matrix in Rd d. We define and work within a probability space (Ω, F, P), where Ωis the sample space, F is the event set, and P is the probability measure. N(µ, Σ) denotes the unior multi-variate Gaussian distribution with mean µ and covariance Σ. denotes logical conjunction ( and ) and denotes logical disjunction ( or ). With slight abuse of notation, we write {ω Ω: A} and A interchangeably when A is some condition, for simplicity. O( ) denotes the asymptotic growth rate with respect to problem parameters d, T, and K. e O( ) further hides logarithmic factors of T and d.

2.2 Problem Setting

We consider the stochastic linear bandit problem. The learning agent is presented with a non-empty arm set X Rd. For T time steps, where T N is the time horizon, the agent selects an arm Xt X and receives a real-valued reward Yt, where the reward is generated based on a hidden true parameter vector, θ Rd. Specifically, Yt is defined as follows:

Yt = X t θ + ηt ,

where ηt is a zero-mean random noise. The objective of the agent is to maximize the cumulative reward, or equivalently, to minimize the cumulative regret R(T) defined as

sup x X x θ X t θ .

3 Ensemble Sampling for Linear Bandits

Algorithm 1 describes linear ensemble sampling. The learner maintains an ensemble of m estimators, where each estimator fits perturbed rewards. For the j-th estimator, a random vector W j Rd acts as an initial perturbation on the estimator, and a random variable Zj t perturbs the reward at time t. Specifically, θj t is the solution of the following minimization problem:

minimize θ Rd λ θ W j/λ 2 2 +

X i θ Yi + Zj i 2 (1)

When selecting an arm, one of the m estimators is chosen according to an ensemble sampling distribution Jt and acts greedily with respect to the sampled estimator. Previous ensemble sampling

algorithms [16, 21, 10] sample the estimators uniformly from the ensemble, but we allow any policy for selecting the estimator. Further distinguishing from the algorithm presented in Janz et al. [10], we do not sample Rademacher random variables for symmetrization, making our algorithm simpler. Ensemble sampling is capable of being generalized to complex settings whenever solving minimization problem (1) is tractable. Especially when incremental updates of the minimization problem are cheap, for instance with neural networks or other gradient descent-based models, ensemble sampling can be an efficient exploration strategy. The algorithm may simply store an ensemble of models, sample one to select an action, and then update the models incrementally based on the observed reward and generated perturbation.

4 Regret Bound of Linear Ensemble Sampling

Before we present the regret bound of linear ensemble sampling (Algorithm 1), we present the following standard assumptions on the problem structure. Assumption 1 (Arm set and parameter). X is closed and for all x X, x 2 1. There exists S > 0 such that θ 2 S. Both bounds are known to the agent. Remark 1. Under Assumption 1, X is a compact set. Therefore, we can define x := argmaxx X x θ and rewrite the definition of R(T) as PT t=1 x θ X t θ .

For a rigorous statement of the second assumption, we define several filtrations. For t [T] {0}, let FX t := σ (X1, . . . Xt) and Fη t := σ (η1, . . . , ηt) be the σ-algebras generated by Xi and ηi up to time t respectively. We also define the σ-algebra generated by the algorithm s internal randomness up until the choice of Xt as FA t . Let Ft := σ FA t FX t Fη t be the σ-algebra generated by the first t iterations of the interaction between the environment and the agent. In addition, let F t := σ FA t FX t Fη t 1 be the σ-algebra generated in the same way as Ft, but excluding ηt.

Assumption 2 (Noise). There exists σ 0 such that ηt is F t -conditionally σ-sub Gaussian for all t [T], i.e., E exp(sηt)|F t exp σ2s2/2 holds almost surely for all s R.

Now, we define a value βt to describe the variance of the generated perturbation values. Define

a sequence {βt(δ)} t=0 as βt(δ) := σ q

d log 1 + t dλ + 2 log 1

λS. We may omit δ when its value is clear from the context. The definition of βt comes from Abbasi-Yadkori et al. [1] as a confidence radius of the ridge estimator, which we later specify in Lemma 1. We now present the regret bound of linear ensemble sampling (Algorithm 1). Theorem 1 (Regret bound of linear ensemble sampling). Fix δ (0, 1]. Assume |X| = K < and run Algorithm 1 with λ 1, m C(K log T + log 1

δ ), PI = N(0d, λβ2 T Id), PR = N(0, β2 T ), and Jt = Unif(m), where C is a universal constant and Unif(m) denotes the uniform distribution over [m]. Then, with probability at least 1 4δ, the cumulative regret of Algorithm 1 is

R(T) = O (d log T) 3 2

Discussion of Theorem 1. Theorem 1 shows that Algorithm 1 achieves a e O(d3/2

T) frequentist regret bound with an ensemble size of m = Ω(K log T). Importantly, our regret bound does not depend on K or m even logarithmically. Hence, this regret bound matches the state-of-the-art frequentist regret bound of linear Thompson sampling [4, 2]. Our result is the first to establish e O(d3/2

T) regret for linear ensemble sampling with an ensemble size sublinear in T, improving the previous bound by the factor d compared to the existing result in Janz et al. [10]. We conjecture that e O(d3/2

T) regret is highly likely to be the best bound for linear ensemble sampling based on the negative result in Hamidi and Bayati [9] for Lin TS.2 Comparing with the algorithm in Janz et al. [10], our version of linear ensemble sampling algorithm does not utilize Rademacher random variable for symmetrized perturbation. This allows our algorithm to be simpler than that of Janz et al. [10]. Partially due to this algorithmic difference, our regret analysis is quite distinct from the analysis of Janz et al. [10] (see the proof in Section 5.2).

2Hamidi and Bayati [9] have shown that Lin TS without the posterior variance inflation by

d factor (compared to optimism-based algorithms such as Lin UCB or OFUL [1]) can lead to a linear regret in T. That is, the frequentist regret bound of e O(d3/2

T) for Lin TS is the best one can derive for the algorithm.

Table 1: Comparison of regret bounds for linear ensemble sampling

Paper Frequentist / Bayesian Regret Bound Ensemble Size

Lu and Van Roy [16] Frequentist Invalid Invalid Qin et al. [21] Bayesian e O( d T log K) Ω(KT) Janz et al. [10] Frequentist e O(d5/2

T) Θ(d log T) This work Frequentist e O(d3/2

T) Ω(K log T)

As in Lu and Van Roy [16] and Qin et al. [21], we study the finite-armed problem setting. Both studies analyze the excess regret of ensemble sampling compared to Thompson sampling through an information theoretical approach. However, direct comparisons of the regret bounds are non-trivial. The analysis by Lu and Van Roy [16] includes an error admitted by the authors and Qin et al. [21] analyze the Bayesian regret, which is a weaker notion of regret than the frequentist regret that we analyze in this work. Along with Janz et al. [10], our result also makes a progress in reducing the size of the ensemble compared to Lu and Van Roy [16] and Qin et al. [21]. The size of the ensemble required by Janz et al. [10] is Θ(d log T).This requirement implies that their ensemble size may be smaller than ours when K is larger than d and also allows K to be infinite. However, it is important to note that their resulting regret bound of e O(d5/2

T) is clearly sub-optimal compared to regret bounds of other randomized exploration algorithms. Theorem 1 achieves the tighter regret bound while simultaneously reducing the size of the ensemble. Remark 2 (Counter-intuitive dependence on ensemble size in Janz et al. [10]). The regret bound in Janz et al. [10] actually grows super-linearly with the ensemble size, which is counter-intuitive. Their regret bound implies that as the ensemble size increases, the performance of the algorithm deteriorates. This fails to explain the superior empirical performance observed for ensemble sampling even with a large ensemble. On the contrary, our result in Theorem 1 does not show any performance degradation as the ensemble size increases. Remark 3 (Generalizability of perturbation distributions). We show that Gaussian distribution for perturbation is not essential. The only properties of the Gaussian distribution we utilize are its tail probability and anti-concentration property, stated as Lemma 4 and Fact 1 in Section 5.2. Therefore, any other distributions exhibiting similar behaviors can instead be adopted. In Appendix H, we rigorously demonstrate that any symmetric sub Gaussian distribution with lower-bounded variance can be employed, possibly at a cost of a constant factor. A large class of distributions, including uniform distribution, spherical distribution, Rademacher distribution, and centered binomial distribution with p = 1/2 satisfy this condition. This result can be of independent interest.

5.1 General Regret Analysis for Linear Bandits

We begin by presenting a general regret bound for any algorithm that selects the best arm based on an estimated parameter. This result can be of independent interest. This general bound and analysis serve as a general framework that includes the regret analysis of linear ensemble sampling (Theorem 1). Theorem 2 (General regret bound for linear bandit algorithm). Fix T N. Assume that at each time step t [T], the agent chooses Xt = argmaxx X x θt, where θt Rd is chosen by the agent under some (either deterministic or random) policy. Let λ > 0 and Vt = λI + Pt i=1 Xi X i . Let {E1,t}T t=1 and {E2,t}T t=1 be sequences of events that satisfy two conditions:

1. (Concentration) There exists a constant γ > 0 such that θt θ Vt 1 1 {E1,t} γ (2)

holds almost surely for all t [T].

2. (Optimism) E2,t Ft 1 holds and there exists a constant p (0, 1] such that

P x θ X t θt and E1,t or EC 2,t | Ft 1 p (3)

holds almost surely for all t [T].

Take E = T t=1 (E1,t E2,t) and any δ (0, 1]. Then, under the event E and an additional event whose probability is at least 1 δ, the cumulative regret is bounded as follows:

R(T) γ 1 + 2

2d T log 1 + T

Discussion of Theorem 2. The audience well-versed in the regret analysis of randomized algorithms such as TS and PHE would recognize that bounding the regret using the probability of being optimistic is a standard procedure, also presented in Theorem 1 of Kveton et al. [12] and Theorem 2 of Janz et al. [10] as generalizations of the results in Agrawal and Goyal [4] and Abeille and Lazaric [2] respectively. However, our regret analysis offers much more concise approach than the existing techniques, which can be of independent interest beyond the analysis of ensemble sampling.

Theorem 2 states that if the two conditions, specifically concentration in (2) and optimism in (3), are met, then the algorithm achieves

T regret. We provide the proof of Theorem 2 in Appendix B. Our proof technique generalizes the well-studied analysis of Abeille and Lazaric [2]. While their work poses conditions on a d-dimensional perturbation vector that is added to the ridge estimator, we do not assume the use of ridge regression nor we assume that the estimator is perturbed. Instead, we only pose conditions on the final estimator the algorithm exploits. Due to this generalization, Theorem 2 is even capable of inducing the regret bound of Lin UCB [1], which always opts for an optimistic estimator, by setting γ = 2βT and p = 1 with appropriate concentration events assigned to E1,t and E2,t. In addition, there are several improvements that simplify the proof which are worth noting. To exploit the optimism condition, we apply Markov s inequality on a well-defined random variable. The proof of Abeille and Lazaric [2] relies on defining a conditional distribution, conditioned on both history and the event of being optimistic. However, such distribution may not be well-defined if the probability of the event is 0 for given history. Janz et al. [10] try to solve this problem by separately handling such exceptional cases using conditional measures. However, their conditional measures depend on the random history, leading the probability p to be a random variable, which complicates the analysis. We also note that our proof does not require convex analysis studied in Abeille and Lazaric [2]. Remark 4 (Role of event E2,t). Previous results that utilize the probability of being optimistic [4, 2, 12, 10] do not explicitly define events {E2,t}t. However, their existence is crucial in our analysis of linear ensemble sampling. Since the perturbation sequences are also part of the history in ensemble sampling, the probability of θt being optimistic may be extremely small under some events in Ft 1 that sample unfavorable sequences. The role of E2,t is to confine our analysis to the case where such undesirable events do not occur.

5.2 Proof of Regret Bound in Theorem 1

We prove the regret bound of linear ensemble sampling stated in Theorem 1. To apply Theorem 2, high-probabilities of the sequences of events, namely {E1,t, E2,t}T t=1, should be guaranteed with an appropriate values of γ and p. We show that separate constraints can be imposed on the randomness of the rewards and the perturbations respectively to guarantee the probabilities of the events. We begin by decomposing the estimator into two parts: one that fits the observed rewards and the other that perturbs the estimator.

θj t = V 1 t Sj t = V 1 t

i=1 Xi Yi + Zj i !

i=1 Xi Yi + V 1 t

i=1 Xi Zj i

=: ˆθt + eθj t , (5)

where we define ˆθt := V 1 t Pt i=1 Xi Yi and eθj t := V 1 t (W j + Pt i=1 Xi Zj i ). ˆθt is the ridge regression estimator of the observed data, and its randomness mainly comes from the noise of the rewards, {ηi}t i=1. eθj t is the perturbation added to ˆθt, and its randomness comes from the generated perturbation, W j and {Zj i }t i=1. The following lemma states the well-known concentration result for the ridge estimator. Lemma 1 (Theorem 2 of Abbasi-Yadkori et al. [1]). Fix δ (0, 1]. For t N {0}, define a

sequence of events with βt(δ) = σ q

d log 1 + t dλ + 2 log 1

ˆEt := n ω Ω: ˆθt θ Vt βt(δ) o

and their intersection ˆE := T t=0 ˆEt. Then, P ˆE 1 δ.

Now, we address eθj t. Define a perturbation vector that represents the perturbation sequence for each model as follows:

λW j Zj 1 Zj t 1 Rd+t 1, j [m] . (6)

Let Zt := Zjt t so that Zt is the perturbation vector of the model chosen at time t. The following lemma demonstrates that optimism condition (3) can be satisfied by an anti-concentration property of Zt alone. Lemma 2 (Sufficient condition for optimism). For t [T], define a vector Ut 1 Rd+t 1 by

U t 1 := x V 1 t 1

λId X1 Xt 1 .

Then, x θ X t θt holds whenever there exists a constant c > 0 such that U t 1Zt c Ut 1 2 and θ ˆθt 1 Vt 1 c hold.

We present a straightforward proof of Lemma 2 in Appendix C. We significantly deviate from the analyses of Abeille and Lazaric [2] and Janz et al. [10] in the method of guaranteeing the optimism condition. Their analyses require a d-dimensional perturbation vector to have a constant probability of having positive component, so-called anti-concentrated, in every possible direction in Rd since they only prove the existence of a direction that implies optimism. We observe and exploit the fact that it suffices to consider just one direction, specifically Ut 1, to produce an optimistic estimator. Since Ut 1 depends only on the sequence of selected arms, the dependency between Ut 1 and Zt decouples when the dependency between {Xt}T t=1 and {Zt}T t=1 are decoupled, which we later achieve by taking the union bound in a unique way.

The following lemma shows that concentration and anti-concentration properties of the perturbation are sufficient conditions for Theorem 2. We provide a sketch of its proof and defer the remaining details to Appendix D. Lemma 3. Suppose the agent runs Algorithm 1 with some parameters. Fix eγ > 0 and p (0, 1]. For each t [T], define two events

e E1,t := n ω Ω: eθjt t 1 Vt 1 eγ o ,

e E2,t := n ω Ω: P U t 1Zt βt 1 Ut 1 2 and e E1,t | Ft 1 p o ,

where Ut 1 is defined as in Lemma 2. Take E1,t = e E1,t ˆEt 1 and E2,t = e E2,t ˆEt 1. Then, E1,t and E2,t satisfy concentration condition (2) with γ = eγ + βT and optimism condition (3) with the same value of p. Consequently, with probability at least 1 2δ P(e EC), where e E := T t=1(e E1,t e E2,t), Algorithm 1 achieves regret bound (4) of Theorem 2.

Remark 5. Lemma 3 applies to any perturbation-based algorithm that exploits θt = ˆθt 1 + eθt 1, where eθt 1 is a linear transform of a random perturbation vector Zt. A version of linear Thompson sampling [2] and Lin PHE also fall into this category.

Lemma 3 shifts the problem of constructing the regret bound of Algorithm 1 to lower-bounding the probabilities of the two sequences of events, {e E1,t}T t=1 and {e E2,t}T t=1. Note that e E1,t and e E2,t regard {Xi}t 1 i=1 and Zt only, and are independent of further randomness of {ηt}T t=1.

Sketch of Proof of Lemma 3. The concentration condition follows immediately by the triangle inequality. To show the optimism condition, we verify the following logical implication relationship: U t 1Zt βt 1 Ut 1 2 e E1,t EC 2,t x θ X t θt E1,t EC 2,t ,

where Lemma 2 bridges the anti-concentration on the left hand side to the optimism on the right hand side. This implication relationship is converted to the following probability inequality:

P (x θ X t θt) E1,t EC 2,t | Ft 1 P (U t 1Zt βt 1 Ut 1 2) e E1,t EC 2,t | Ft 1 .

By the definition of e E2,t, the right hand side is bounded below by p, implying optimism condition (3). The probability of failure is bounded by the union bound and Lemma 1.

Theorem 1 employs the Gaussian perturbation for a concrete instantiation of Algorithm 1. We define two values eγT and γT , which serve as the confidence radii of eθt and θt for t [T] under the Gaussian perturbation.

d log 1 + T

, γT := eγT + βT . (7)

Note that in terms of d and T, both eγT and γT are in O (d log T). Lemma 4 illustrates the concentration result. Its proof is a simple application of Lemma 1, and is presented in Appendix E. Lemma 4. Suppose Algorithm 1 is run with parameters specified in Theorem 1. Fix t [T] and j [m]. Suppose the sequence of arms X1, . . . , Xt is chosen arbitrarily randomly, not necessarily by the the agent. Let eθj t be defined as in Eq. (5). Then, with probability at least 1 δ/T, eθj t Vt eγT holds.

The following fact describes an anti-concentration property of Gaussian distribution, which follows from the fact that a linear combination of independent Gaussians is again Gaussian. Fact 1. If Z N(0, α2In) for some α 0 and u Rn is a fixed vector for some n N, then P u Z α u 2 P (z 1) =: p N, where z N(0, 1). We note that p N 0.15.

All the building blocks we need to prove Theorem 1 is ready. The proof illustrates that the events specified in Lemma 3 occur with high probability.

Proof of Theorem 1. Define e E1,t and e E2,t as in Lemma 3 with eγ = eγT and p = p N/4, where eγT is defined in Eq. (7) and p N is defined in Fact 1. We show that these events occur with high probability, and the rest follows from Lemma 3. For the sake of the analysis, assume that δ/T p N/2 0.08, which holds whenever T 14 or δ < 0.07. Assume that the perturbation values W j and Zj t are FA 0 -measurable for all j [m] and t [T]. An interpretation of this assumption is that the algorithm samples all the required values in advance. Note that we still obtain an equivalent algorithm and this modification need not actually take place in the execution. Under this assumption, the uniform sampling of jt Jt is the only source of randomness regarding the choice of θt and Xt when conditioned on the history Ft 1. It may seem unintuitive, but this modification simplifies the proof because we only need to deal with jt. We first lower-bound the probability of e E1,t. When j [m] is fixed, we can apply Lemma 4 and obtain P( eθj t 1 Vt 1 eγT ) 1 δ/T. Since jt is sampled independently of eθj t 1, it holds that

j=1 P jt = j, eθj t 1 Vt 1 eγT =

1 m P eθj t 1 Vt 1 eγT 1 δ

Now, we bound the probability of e E2,t. Fix j [m]. Recall that Zj t is the perturbation vector of the j-th model, defined in Eq. (6). The choice of Gaussian perturbation implies that Zj t N(0d+t 1, β2 T Id+t 1). Suppose that the sequence of arms X1, . . . , XT is fixed. Then, we can apply Fact 1, obtaining that P(U t 1Zj t βt 1 Ut 1 2) p N, where the probability is measured over the randomness of the perturbation sequence Zj t. Let Ij t := 1{(U t 1Zj t βt 1 Ut 1 2) ( eθj t 1 Vt 1 eγT )}. Then, we have that

P Ij t = 1 P U t 1Zj t βt 1 Ut 1 2 P eθj t 1 Vt 1 > eγT p N δ/T p N/2 , (9)

where the first inequality uses that P(A B) P(A) P(BC) holds for any events A and B, and the last inequality holds by the assumption δ/T p N/2. However, as we assumed that the perturbation sequence is FA 0 -measurable, it is Ft 1-measurable, hence Ij t is also Ft 1-measurable. It means that the value of Ij t is determined when the history up to time t 1 is fixed. The only remaining source of randomness in choosing θt conditioned on Ft 1 is the sampling of jt Jt. Therefore, it holds that

P U t 1Zjt t βt 1 Ut 1 2 eθjt t 1 Vt 1 eγT | Ft 1 = 1

j=1 Ij t . (10)

Algorithm 2 Linear Perturbed-History Exploration (Lin PHE)

Input : regularization parameter λ > 0, initial perturbation distribution PI on Rd,

reward-perturbation distribution PR on R Initialize V0 = λId for t = 1, 2, . . . T do

Sample Wt PI, Zt,1, . . . , Zt,t 1 i.i.d. PR Update θt = V 1 t 1 Wt + Pt 1 i=1 Xi(Yi + Zt,i)

Pull arm Xt = argmaxx X x θt, and observe Yt Update Vt = Vt 1 + Xt X t end for

Since we have verified that the expectation of the right hand side is greater than p N/2, Azuma Hoeffding inequality (Lemma 12) implies that P( 1

m Pm j=1 Ij t < p N/4) exp p2 Nm/8 . Recall that this result is obtained assuming that X1, . . . , XT are fixed. We take the union bound over all possible sequences of arms. However, a naïve union bound multiplies KT to the failure probability, which leads to an undesirable result of m scaling linearly with T. We present the following proposition inspired by an observation from Lu and Van Roy [16] that a permutation of selected arms can be regarded as equivalent. We note that the strong result of Lemma 6 in Lu and Van Roy [16] is not applicable to our setting since we do not assume that {ηt}T t=1 is distributed identically nor independently. Although we present all the main ideas to support Proposition 1 in this section, there may be a few points that readers find require further justification. Due to limited space, we provide a full, rigorous justification of Proposition 1 in Appendix F, where we present a different perspective on the sampling of perturbation.

Proposition 1. There exists an event E 2 such that under E 2, 1 m Pm j=1 Ij t p N/4 holds for t = 1, . . . , T and P(E C 2 ) T K exp( p2 Nm/8).

The key observation in Proposition 1 is that the perturbation vector consists of i.i.d. components, hence its distribution is invariant under independent permutations. Therefore, the distributions of eθj t 1 and U t 1Zj t remain invariant under the permutation of selected arms. Although the sequence of arms and the perturbation vector are not independent as a whole, the permutation that sorts the selected arms preserves the distribution of Zt since Xt and Zt are independent for all t [T]. The number of equivalence classes up to permutation over sequences of arms with lengths at most T 1 is less than T K, since each arm can be selected 0 to T 1 times inclusively. Therefore, we take the union bound over the T K sequences of arms and attain E 2 . Taking m = 8 p2 N (K log T + log 1

δ ), we obtain that P(E C 2 ) δ. Eq. (10) implies that P( T t=1 e E2,t) P(E 2 ) 1 δ. Therefore, by Lemma 3, with probability at least 1 4δ, the cumulative regret is bounded as follows:

2d T log 1 + T

δ = O (d log T) 3 2

6 Ensemble Sampling and Perturbed-History Exploration

In this section, we rigorously investigate the relationship between linear ensemble sampling and Lin PHE. A generalized version of Lin PHE is described in Algorithm 2. Note that the perturbed estimator, θt = V 1 t 1 Wt + Pt 1 i=1 Xi(Yi + Zt,i) , resembles the estimator of linear ensemble sampling, which becomes evident when compared with Eq. (5). The main difference is that in Lin PHE (Algorithm 2), the perturbation sequence is generated independently of the history at every time step, whereas in linear ensemble sampling (Algorithm 1), the sequence is not renewed but is incremented at each time step. However, we further observe that in linear ensemble sampling, as long as an estimator is not sampled for the arm selection, its perturbation sequence is independent of the selected arms and rewards. This implies that the estimator in the ensemble that is selected for the

first time is equivalent to the estimator computed by the policy of Lin PHE. Specifically, if the j-th estimator is selected for the first time at time step t, then the perturbation values of the estimator, W j and {Zj i }t 1 i=1, have had no effect on previous interactions. Therefore, newly sampling them as

W j t PI and {Zj t,i}t 1 i=1 i.i.d. PR, as in Lin PHE, does not alter future interactions. We conclude that in the case where the ensemble size is greater than or equal to T, linear ensemble sampling becomes equivalent to Lin PHE by selecting the estimators in a round robin. Proposition 2. Linear ensemble sampling (Algorithm 1) with m = T and deterministic policy of choosing a model, e.g., Jt t for t = 1, . . . , T, is equivalent to Lin PHE (Algorithm 2).

Proposition 2 shows that Lin PHE is a special case of linear ensemble sampling and provides insightful consequences in both directions of the equivalence. To our best knowledge, Proposition 2 is the first result to formally demonstrate the relationship between linear ensemble sampling and Lin PHE.

Linear ensemble sampling with T models is Lin PHE: Since an ensemble of T models is equivalent to Lin PHE which achieves a regret bound e O(d T log K), the ensemble size larger than T is not necessary. This implication certainly emphasizes the sub-optimal requirements of the ensemble size in Lu and Van Roy [16], Qin et al. [21]. Even when K > T in our problem setting, this equivalence provides the ground for upper bounding the ensemble size by T.

Lin PHE is linear ensemble sampling with T models: Conversely, since Lin PHE can be regarded as linear ensemble sampling with T models, it is possible to derive a regret bound of Lin PHE by following the proof of Theorem 1. We present Corollary 1, which states a new regret bound of e O(d3/2

T) for Lin PHE. Note that in this case, the regret bound is independent of K.

Corollary 1 (Regret bound of Lin PHE). Fix δ (0, 1]. Algorithm 2 with λ 1, PI = N(0d, λβ2 T Id) and PR = N(0, β2 T ) achieves O((d log T)3/2

T) cumulative regret with probability at least 1 3δ.

Discussion of Corollary 1. Kveton et al. [12] provide a e O(d T log K) regret bound when the number of arms is finite. Our result is the first to prove that Lin PHE achieves a e O(d3/2

T) regret bound that is independent of the number of arms. Hence, we view our overall results as a generalization of Kveton et al. [12]. It is widely observed that assuming the size of the arm set to be K may lead to an interchanging of a

d factor with a log K factor in the regret bound [4], although attaining such reduction may not always be done in a trivial manner [5, 8, 6]. Our focus is not merely on proving another regret bound for Lin PHE, but rather on highlighting the close relationship between linear ensemble sampling and Lin PHE, which, to our knowledge, has been overlooked in the literature. The proof of Corollary 1 follows the proof of Theorem 1. Note that the latter part of the proof of Theorem 1 focuses on decoupling the dependency between {Xt}T t=1 and Zt. However, in the case of Lin PHE, they are already independent since the perturbation sequence is freshly sampled at every time step, enabling a more elegant and concise proof. Especially, as it skips the parts that require the number of arms to be finite, for instance the use of Proposition 1, Corollary 1 holds even when the number of arms is infinite. The whole proof is presented in Appendix G.

7 Conclusion

We prove that linear ensemble sampling achieves a e O(d3/2

T) regret bound, marking the first such result in the frequentist setting and matching the best-known regret bound for randomized algorithms. The required ensemble size scales logarithmically with the time horizon as Ω(K log T). Additionally, we expand our analysis to Lin PHE, demonstrating that it is a special case of linear ensemble sampling with an ensemble of T models, achieving the same regret bound of e O(d3/2

T). While our work focuses on linear bandits, ensemble sampling applications have shown superior performance in more complex settings. This suggests that theoretical extensions beyond the linear setting are worth pursuing, with our results providing an important foundation for possibly understanding these extensions. Extending the results to general contextual settings, where the arm set may change over time with potentially non-linear reward functions, represents a promising direction for future work.

Acknowledgements

This work was supported by the National Research Foundation of Korea(NRF) grant funded by the Korea government(MSIT) (No. 2022R1C1C1006859, 2022R1A4A1030579, and RS-2023-00222663) and by AI-Bio Research Grant through Seoul National University.

[1] Yasin Abbasi-Yadkori, Dávid Pál, and Csaba Szepesvári. Improved algorithms for linear stochastic bandits. Advances in Neural Information Processing Systems, 24:2312 2320, 2011.

[2] Marc Abeille and Alessandro Lazaric. Linear Thompson Sampling Revisited. In Proceedings of the 20th International Conference on Artificial Intelligence and Statistics, volume 54, pages 176 184. PMLR, PMLR, 2017.

[3] Shipra Agrawal and Navin Goyal. Analysis of thompson sampling for the multi-armed bandit problem. In Conference on learning theory, pages 39 1. JMLR Workshop and Conference Proceedings, 2012.

[4] Shipra Agrawal and Navin Goyal. Thompson sampling for contextual bandits with linear payoffs. In International conference on machine learning, pages 127 135. PMLR, 2013.

[5] Peter Auer. Using confidence bounds for exploitation-exploration trade-offs. Journal of Machine Learning Research, 3(Nov):397 422, 2002.

[6] Sébastien Bubeck, Nicolo Cesa-Bianchi, and Sham M Kakade. Towards minimax policies for online linear optimization with bandit feedback. In Conference on Learning Theory, pages 41 1. JMLR Workshop and Conference Proceedings, 2012.

[7] Olivier Chapelle and Lihong Li. An empirical evaluation of thompson sampling. Advances in neural information processing systems, 24, 2011.

[8] Wei Chu, Lihong Li, Lev Reyzin, and Robert Schapire. Contextual bandits with linear payoff functions. In Proceedings of the Fourteenth International Conference on Artificial Intelligence and Statistics, pages 208 214. JMLR Workshop and Conference Proceedings, 2011.

[9] Nima Hamidi and Mohsen Bayati. On frequentist regret of linear thompson sampling. ar Xiv preprint ar Xiv:2006.06790, 2020.

[10] David Janz, Alexander E Litvak, and Csaba Szepesvári. Ensemble sampling for linear bandits: small ensembles suffice. ar Xiv preprint ar Xiv:2311.08376, 2023.

[11] Branislav Kveton, Csaba Szepesvari, Mohammad Ghavamzadeh, and Craig Boutilier. Perturbedhistory exploration in stochastic multi-armed bandits. In International Joint Conference on Artificial Intelligence, 2019. URL https://api.semanticscholar.org/Corpus ID:67856126.

[12] Branislav Kveton, Csaba Szepesvári, Mohammad Ghavamzadeh, and Craig Boutilier. Perturbedhistory exploration in stochastic linear bandits. In Uncertainty in Artificial Intelligence, pages 530 540. PMLR, 2020.

[13] Branislav Kveton, Manzil Zaheer, Csaba Szepesvari, Lihong Li, Mohammad Ghavamzadeh, and Craig Boutilier. Randomized exploration in generalized linear bandits. In International Conference on Artificial Intelligence and Statistics, pages 2066 2076. PMLR, 2020.

[14] Tor Lattimore and Csaba Szepesvári. Bandit algorithms. Cambridge University Press, 2020.

[15] Beatrice Laurent and Pascal Massart. Adaptive estimation of a quadratic functional by model selection. Annals of statistics, pages 1302 1338, 2000.

[16] Xiuyuan Lu and Benjamin Van Roy. Ensemble sampling. Advances in Neural Information Processing Systems, 30, 2017.

[17] Xiuyuan Lu, Zheng Wen, and Branislav Kveton. Efficient online recommendation via low-rank ensemble sampling. In Proceedings of the 12th ACM Conference on Recommender Systems, pages 460 464, 2018.

[18] Ian Osband, Charles Blundell, Alexander Pritzel, and Benjamin Van Roy. Deep exploration via bootstrapped dqn. Advances in Neural Information Processing Systems, 29, 2016.

[19] Ian Osband, John Aslanides, and Albin Cassirer. Randomized prior functions for deep reinforcement learning. Advances in Neural Information Processing Systems, 31, 2018.

[20] Ian Osband, Benjamin Van Roy, Daniel J Russo, Zheng Wen, et al. Deep exploration via randomized value functions. Journal of Machine Learning Research, 20(124):1 62, 2019.

[21] Chao Qin, Zheng Wen, Xiuyuan Lu, and Benjamin Van Roy. An analysis of ensemble sampling. Advances in Neural Information Processing Systems, 35:21602 21614, 2022.

[22] Daniel Russo and Benjamin Van Roy. Learning to optimize via posterior sampling. Mathematics of Operations Research, 39(4):1221 1243, 2014.

[23] Daniel J Russo, Benjamin Van Roy, Abbas Kazerouni, Ian Osband, Zheng Wen, et al. A tutorial on thompson sampling. Foundations and Trends in Machine Learning, 11(1):1 96, 2018.

[24] William R Thompson. On the likelihood that one unknown probability exceeds another in view of the evidence of two samples. Biometrika, 25(3/4):285 294, 1933.

[25] Runzhe Wan, Haoyu Wei, Branislav Kveton, and Rui Song. Multiplier bootstrap-based exploration. In International Conference on Machine Learning, pages 35444 35490. PMLR, 2023.

[26] Jie Zhou, Botao Hao, Zheng Wen, Jingfei Zhang, and Will Wei Sun. Stochastic low-rank tensor bandits for multi-dimensional online decision making. Journal of the American Statistical Association, pages 1 14, 2024.

[27] Zheqing Zhu and Benjamin Van Roy. Deep exploration for recommendation systems. In Proceedings of the 17th ACM Conference on Recommender Systems, pages 963 970, 2023.

A Notations

We summarizes the notations in this paper in Table 2 and Table 3

Table 2: Notations specific to this paper Linear Bandit

X Set of arms θ True parameter vector x Optimal arm Xt Chosen arm at time t Yt Observed reward at time t ηt Zero-mean noise at time t σ Sub Gaussian variance proxy of ηt d Dimension of arms and true parameter vector K Number of arms (if |X| < ) T Time horizon R(T) Cumulative regret Algorithm

λ Regularization parameter Vt λId + Pt i=1 Xi X i θt Perturbed estimator W j / Wt Initial perturbation vector Zj t / Zt,i Reward perturbation PI Distribution for initial perturbation W PR Distribution for reward perturbation Zt Jt Sampling policy at time t Analysis

δ Probability of failure FX t σ({Xi}t i=1) Fη t σ({ηi}t i=1) FA t σ-algebra generated by algorithm s internal randomness until time t Ft σ(FA t FX t Fη t ), σ-algebra generated by the first t-iterations of interaction ˆθt Ridge regression estimator by t samples eθt Perturbation in the estimator, θt+1 ˆθt βt Confidence radius of ˆθt γ Confidence radius of θt eγ Confidence radius of eθt p Probability of optimistic arm selection p N P(Z 1) 0.15 where Z N(0, 1) E High probability event E1,t Concentration event E2,t Optimism event ˆEt Concentration event of ˆθt e E1,t Concentration event of eθt 1 e E2,t Anti-concentration event of eθt 1 Zt Perturbation vector at time t Φt

λId X1 Xt Rd+t

Ut x V 1 t Φt

Table 3: Generic notations Sets and functions

N Set of natural numbers, starting from 1 [M] Set of natural numbers up to M, i.e., {1, 2, . . . , M} 1 Indicator function Vector and matrices

2 ℓ2 norm of a vector ( )i i-th element of a vector 0d Zero vector in Rd

Id Identity matrix in Rd d Probability

(Ω, F, P) Probability space E Expectation N(µ, Σ) Gaussian distribution with mean µ and covariance Σ Logical conjunction ( and ) Logical disjunction ( or )

B Proof of Theorem 2

Lemma 5 (Lemmas 10 and 11 in Abbasi-Yadkori et al. [1]). Let λ 1, {Xt}T t=1 be any sequence of d-dimensional vectors such that Xt 2 1 for all t [T], and Vt = λI + Pt i=1 Xi X i . Then, PT t=1 Xt 2 V 1 t 1 2d log 1 + T

As alluded in the discussion of Theorem 2, we utilize Markov s inequality and the random variable of interest is X t θt1 {E1,t E2,t}. However, this random variable may not be non-negative, precluding the use of Markov s inequality. The following lemma shows that adding an appropriate term guarantees its non-negativity. Lemma 6. Assume the conditions of Theorem 2. Let J(θ) = supx X x θ, where θ Rd. Let Θt = {θ Rd | θ θ Vt 1 γ}. Define θ t = argminθ Θt J(θ) and X t = argmaxx X x θ t . For any θ Rd and an event E , we introduce the following notation:

gt (θ, E ) = J(θ) J(θ t ) 1{E } .

Then, gt(θ , E ) 0 holds for any event E F, and gt(θt, E ) 0 holds almost surely for any event such that E E1,t.

Proof. We first prove gt(θ , E ) 0. Since θ Θt,

J(θ ) inf θ Θt J(θ) = J(θ t )

always holds. Therefore, for any event E , gt(θ , E ) 0 holds. We now suppose E E1,t and prove gt(θt, E ) 0. We consider two cases where E does and does not hold. Under E , since E E1,t and by concentration condition (2), θt Θt holds almost surely. Then, J(θt) infθ Θt J(θ) = J(θ t ) holds. Under E C, gt(θt, E ) = 0 0 trivially holds. Therefore, gt(θt, E ) 0 holds almost surely.

Proof of Theorem 2. We show that with probability at least 1 δ, it holds that

R(T)1 {E} γ 1 + 2

2d T log 1 + T

We first bound the instantaneous regret under the event E at time t. x θ X t θ 1 {E} = x θ X t θt + X t θt X t θ 1 {E}

= x θ X t θt 1 {E} | {z } I1

+ X t θt X t θ 1 {E} | {z } I2

I2 is directly bounded under E.

I2 = X t (θt θ ) 1 {E} Xt V 1 t 1 θt θ Vt 1 1 {E}

γ Xt V 1 t 1 ,

where the first inequality is due to the Cauchy-Schwarz inequality and the second inequality comes from condition (2). Now, we bound I1. Let X t , θ t , and gt be defined as in Lemma 6. By Lemma 6, gt(θt, E) 0 holds almost surely. Then,

I1 = x θ 1 {E} X t θt1 {E}

= x θ 1 {E} X t θ t 1 {E} + X t θ t 1 {E} X t θt1 {E} = gt (θ , E) gt (θt, E) gt (θ , E) gt (θ , E2,t) ,

where the last inequality holds since E E2,t. Again by Lemma 6, gt (θ , E2,t) and gt(θt, E1,t E2,t) are non-negative almost surely. Note that by the first part of condition (3), E2,t is Ft 1-measurable and hence gt (θ , E2,t) = (x θ X t θ t )1{E2,t} is also Ft 1-measurable. Applying Markov s inequality conditioned on Ft 1, we obtain that

gt(θ , E2,t)P gt(θ , E2,t) gt (θt, E1,t E2,t) | Ft 1 E h gt (θt, E1,t E2,t) | Ft 1 i . (11)

We lower-bound the probability on the left hand side utilizing condition (3). Suppose the event of interest in condition (3), namely ((x θ X t θt) E1,t) EC 2,t, holds. Under the event, either EC 2,t or (x θ X t θt) E1,t E2,t holds. Under EC 2,t, gt(θ , E2,t) gt (θt, E1,t E2,t) becomes 0 0, which trivially holds. Otherwise, we have (x θ X t θt) E1,t E2,t. Under this event, we have that

x θ X t θt x θ X t θ t X t θt X t θ t x θ X t θ t 1 {E2,t} X t θt X t θ t 1 {E1,t E2,t}

gt(θ , E2,t) gt(θt, E1,t E2,t) .

Therefore, we have shown that x θ X t θt E1,t EC 2,t gt(θ , E2,t) gt(θt, E1,t E2,t) ,

which implies

P (gt(θ , E2,t) gt (θt, E1,t E2,t) | Ft 1) P x θ X t θt E1,t EC 2,t | Ft 1

by condition (3). Therefore, we obtain that gt(θ , E2,t) 1

p E[gt(θt, E1,t E2,t) | Ft 1] from inequality (11). Lastly, we bound gt(θt, E1,t E2,t).

gt (θt, E1,t E2,t) = X t θt X t θ t 1 {E1,t E2,t}

X t θt X t θ t 1 {E1,t E2,t}

Xt V 1 t 1 θt θ t Vt 1 1 {E1,t}

θt θ Vt 1 + θ t θ Vt 1

2γ Xt V 1 t 1 ,

where the first inequality uses that X t = supx X x θ t , the second inequality is due to the Cauchy Schwarz inequality, the third inequality holds by the triangle inequality, and the last inequality comes

from condition (2) and that θ t Θt as defined in Lemma 6. Combining all, the instantaneous regret at time t under the event E is bounded as follows: x θ X t θ 1 {E} γ Xt V 1 t 1 + 2γ

p E h Xt V 1 t 1 | Ft 1 i

Xt V 1 t 1 + 2γ

E h Xt V 1 t 1 | Ft 1 i Xt V 1 t 1

Then, the cumulative regret under E is bounded as

R(T)1 {E} γ 1 + 2

t=1 Xt V 1 t 1 + 2γ

E h Xt V 1 t 1 | Ft 1 i Xt V 1 t 1

To bound the first sum, we apply the Cauchy-Schwarz inequality and then Lemma 5.

t=1 Xt V 1 t 1

t=1 Xt 2 V 1 t 1

2d T log 1 + T

The second sum is bounded by Azuma-Hoeffding inequality. Note that 0 Xt V 1 t 1 q

λmax(V 1 t 1) Xt 2 1

λ, where λmax( ) denotes the maximum eigenvalue. By Lemma 12, with probability at least 1 δ, it holds that

E h Xt V 1 t 1 | Ft 1 i Xt V 1 t 1

Therefore, with probability at least 1 δ, the cumulative regret under E is bounded as follows:

R(T)1 {E} γ 1 + 2

2d T log 1 + T

C Proof of Lemma 2

Proof of Lemma 2. The proof is simple algebra utilizing a useful matrix Φt. Define Φt to be the matrix that stacks Xi in addition to an identity matrix as follows:

λId X1 Xt Rd (d+t) .

We can express a relevant matrix and vectors using Φt, which are Vt = ΦtΦ t , eθj t = V 1 t Φt Zj t+1, and U t = x V 1 t Φt. Defining eθt 1 = eθjt t 1 to be the perturbation in the selected estimator at time t, we obtain that

x eθt 1 = x V 1 t 1Φt 1Zt = U t 1Zt .

In addition, it holds that

x V 1 t 1Φt 1Φ t 1V 1 t 1x

= x V 1 t 1 .

By X t θt = supx X x θt, it holds that x θt X t θt. Then, we have that

X t θt x θ x θt x θ

= x (θt θ )

= x eθt 1 + ˆθt 1 θ

= Ut 1Zt + x ˆθt 1 θ .

By the condition of the lemma, there exists a positive constant c such that ˆθt 1 θ Vt 1 c and U t 1Zt c Ut 1 2 0. By the Cauchy-Schwarz inequality, it holds that

x ˆθt 1 θ x V 1 t 1 ˆθt 1 θ Vt 1 c Ut 1 2 .

X t θt x θ U t 1Zt c Ut 1 2 0 ,

where we proved that X t θt x θ .

D Proof of Lemma 3

Proof of Lemma 3. Under E1,t = e E1,t ˆEt 1, the concentration condition, specifically condition (2) in Theorem 2, holds by the triangle inequality.

θt θ Vt 1 = eθt 1 + ˆθt 1 θ Vt 1 eθt 1 Vt 1 + ˆθt 1 θ Vt 1 eγ + βt 1 γ .

To show the optimism condition, condition (3) in Theorem 2, we first show that E2,t Ft 1. ˆEt 1 Ft 1 holds since it regards {Xi, ηi}t 1 i=1 only. Since P((U t 1Zt βt 1 Ut 1 2) e E1,t | Ft 1) is a Ft 1-measurable random variable and e E2,t is an event that the specified random variable is greater than or equal to p, e E2,t is in Ft 1. Therefore, we obtain that E2,t = ˆEt 1 e E2,t Ft 1. To prove the remaining part of condition (3), we demonstrate the following logical implication relationships:

U t 1Zt βt 1 Ut 1 2 e E1,t EC 2,t U t 1Zt βt 1 Ut 1 2 e E1,t E2,t EC 2,t

U t 1Zt βt 1 Ut 1 2 e E1,t ˆEt 1 EC 2,t

x θ X t θt e E1,t ˆEt 1 EC 2,t

x θ X t θt E1,t EC 2,t ,

where the first implication follows from A BC (A B) BC, the second implication holds since E2,t ˆEt 1, the third implication holds by Lemma 2, which states that (U t 1Zt βt 1 Ut 1 2 ˆEt 1) (x θ X t θt), and the last by E1,t = e E1,t ˆEt 1. This implication relationship shows that

P x θ X t θt E1,t EC 2,t | Ft 1 P U t 1Zt βt 1 Ut 1 2 e E1,t EC 2,t | Ft 1 .

We bound the right hand side using the definition of E2,t.

P U t 1Zt βt 1 Ut 1 2 e E1,t EC 2,t | Ft 1

= E h 1 n U t 1Zt βt 1 Ut 1 2 e E1,t EC 2,t o | Ft 1 i

= E h 1 n U t 1Zt βt 1 Ut 1 2 e E1,t o 1 {E2,t} + 1 EC 2,t | Ft 1 i

= E h 1 n U t 1Zt βt 1 Ut 1 2 e E1,t o | Ft 1 i 1 {E2,t} + 1 EC 2,t

p1 {E2,t} + 1 EC 2,t

where the third equality uses that E2,t Ft 1 and the first inequality holds since under E2,t e E2,t,

E h 1 n U t 1Zt βt 1 Ut 1 2 e E1,t o | Ft 1 i p holds by the definition of e E2,t. Therefore, E1,t and E2,t satisfy conditions (2) and (3). By Theorem 2, the regret bound stated in inequality (4) holds with probability at least 1 δ P(EC), where the union bound is taken. Note that

E = T t=1 (E1,t E2,t) = T t=1 e E1,t e E2,t ˆEt 1 ˆE e E. The failure probability is bounded as

δ + P EC δ + P ˆEC + P e EC

2δ + P e EC ,

where the first inequality takes the union bound over EC ˆEC e EC and the second inequality is due to Lemma 1.

E Proof of Lemma 4

Lemma 4 is a special case of Lemma 9, which generalizes Gaussian distribution to any sub Gaussian distribution. We first provide a general chi-squared concentration result, which is required to bound the perturbation induced by W. A generalized version of Lemma 7 is presented in Lemma 10.

Lemma 7. If Z N(0, Id) is a d-dimensional multivariate Gaussian vector, then for any δ (0, 1],

Proof. By Lemma 13 with x = log 1

δ , it holds that

δ + 2 log 1

Since d + 2 q

δ + 2 log 1

δ 2 , it holds that

Z 2 2 d + 2

δ + 2 log 1

Proof of Lemma 4. By Lemma 7, with δ/2T instead of δ, yields

since W j N(0d, λβ2 T Id). Applying Lemma 9 with δ/2T instead of δ yields that

eθj t 1 Vt 1 βT

d log 1 + T

holds with probability at least 1 δ/T. Note that the right hand side is equal to the definition of eγT , defined in Eq. (7).

F Rigorous Justification of Proposition 1

In this section, we rigorously justify Proposition 1 that is stated in the proof of Theorem 1. To do so, we present a different viewpoint on the perturbation sequences.

Denote the arms as X = {x1, x2, . . . , x K}. For the sake of the analysis, assume that δ/T p N/2 0.08, which holds whenever T 14 or δ < 0.07. We reconstruct the perturbation sampled by Algorithm 1. Assume that in addition to {W j}m j=1 i.i.d.

PI, Algorithm 1 samples m KT samples of {{Zj k,t}(k,t)}m j=1 i.i.d. PR at the beginning, where the subscript (k, t) enumerates from (1, 1) to (K, T). Define Nk,t = Pt i=1 1 Xi = xk to be the number of times arm k has been chosen up to time t. If the at-th arm, xat, is selected at time t, then we assign Zj t = Zj at,Nat,t. Since Zj at,Nat,t is still an i.i.d. sample of PR conditioned on history,

specifically on σ(FX t Fη t σ({{Zj i }t 1 i=1}m j=1)), we attain an equivalent algorithm with Algorithm 1. We note that these modifications need not be taken in the execution of the algorithm, and their purpose is purely for the analysis. Define FA 0 = σ({W j, {Zj k,t}(k,t)}m j=1), which reflects the fact that they are sampled in advance. For t [T], define FA t = σ (Ft 1 σ(jt)), which indicates that the only additional randomness of the algorithm when choosing θt is the sampling of jt Jt. Define the extended perturbation vector as follows:

λW j Zj 1,1 . . . Zj 1,T Zj 2,1 . . . Zj K,T Rd+KT

Removing some components and reordering Zj KT yields Zj t, where the removal and reordering depend on the sequence of chosen arms, namely a1, . . . , at 1. Note that Zj KT N(0d+KT , β2 T Id+KT ). We also define the corresponding extensions of Φt and Ut. Define a matrix that has n columns, first a of which are copies of v Rd and the rest are 0d as follows.

rep(v, a, n) := (v . . . v 0d . . . 0d) Rd n . Define the extended version of Φt as follows: Φt =

λId rep(x1, N1,t, T) . . . rep(x K, NK,t, T) Rd (d+KT ) . Φt extends Φt by permuting the columns so that the feature vectors from the same arm appear in consecutive columns, then inserting multiple 0d appropriately. Then, we have Vt = ΦtΦ t and eθt = V 1 t Φt Zj KT , since Zj KT is permuted and extended from Zj t in a similar manner. We define the extended version of Ut as Ut = (x V 1 t Φt) Rd+KT . Ut is also a permutation of Ut with additional zeros inserted. It holds that U t Zj t = U t Zj KT and Ut 2 = Ut 2. Let Xt be the set of all possible Φt. Since Φt is fully determined by N1,t, . . . NK,t and each Nk,t takes value between 0 and T 1 inclusively when 0 t T 1, we obtain that T 1 t=0 Xt T K. For any t [T], take any Φt 1 Xt 1. Note that Ut 1 = x (Φt 1Φ t 1) 1Φt 1 is fully determined by Φt 1, and eθj t 1 = (Φt 1Φ t 1) 1Φt 1Zj KT is determined by Φt 1 and Zj KT . Assuming that Φt 1 is fixed, Ut 1 is also fixed, therefore we can apply Fact 1 and obtain that P U t 1Zj KT βT Ut 1 2 p N, where the only source of randomness comes from Zj KT . Applying Lemma 4, we obtain that P( eθj t 1 Vt 1 eγT ) 1 δ/T. Let Ij(Φt 1) = 1{(U t 1Zj KT βT Ut 1 2) ( eθj t 1 Vt 1 eγT )}. Then, P(Ij(Φt 1) = 1) p N δ/T p N/2. Since the only randomness on choosing the arm conditioned on Ft 1 comes from sampling jt, it holds that

P U t 1Zjt KT βT Ut 1 2 eθjt t 1 Vt 1 eγT | Ft 1 = 1

j=1 Ij (Φt 1) .

We apply Azuma-Hoeffding inequality to show that 1 m Pm j=1 Ij (Φt 1) is bounded below with high probability. Since {Ij(Φt 1)}m j=1 are i.i.d. Bernoulli random variables with the associated probability greater than p N/2, it holds that

j=1 Ij(Φt 1) p N

j=1 Ij(Φt 1) 1

j=1 E[Ij(Φt 1)] p N

j=1 E[Ij(Φt 1)]

j=1 Ij(Φt 1) 1

j=1 E[Ij(Φt 1)] p N

j=1 Ij(Φt 1) 1

j=1 E[Ij(Φt 1)] p N

where Lemma 12 is applied at the end. By taking the union bound over T 1 t=0 Xt, we obtain that

Φ T 1 t=0 Xt, 1

j=1 Ij(Φ) p N

T K exp p2 Nm

The event E 2 is defined as the complement of the event above. The proof is complete.

ω Ω: Φ T 1 t=0 Xt, 1

j=1 Ij(Φ) > p N

G Proof of Corollary 1

Proof of Corollary 1. Let e E1,t and e E2,t be defined as in Lemma 3 with eγ = eγT and p = p N/2. We redefine a couple of notations to adapt Algorithm 2. Let

λW t Zt,1 . . . Zt,t 1 Rd+t 1

to be the perturbation vector at time t, and

eθt 1 := V 1 t 1

i=1 Xi Zt,i

be the perturbation in the estimator θt. Regarding eθt 1 as one of eθj t 1 in the proof of Theorem 1, we obtain that P(e E1,t) 1 δ/T and P((U t 1Zt βt 1 Ut 1 2) e E1,t) p N/2 hold, analogously to inequalities (8) and (9) respectively. Moreover, in contrast to the proof of Theorem 1, the perturbation vector Zt is now independent of Ft 1. Noting that Ut 1 is Ft 1-measurable, it always holds that

P (U t 1Zt βt 1 Ut 1 2) e E1,t | Ft 1 p N

This proves that e E2,t is in fact the whole event. Taking the union bound, we obtain that P(e EC) PT t=1 P(e EC 1,t) δ. By Lemma 3, with probability at least 1 3δ, the cumulative regret is bounded by

2d T log 1 + T

δ = O (d log T) 3 2

H Generalizability of Perturbation Distributions

In this section, we demonstrate that any distribution that is symmetric, sub Gaussian, and has lowerbounded variance satisfies the results of Lemma 4 and Fact 1, possibly up to a constant factor. As mentioned in Remark 3, it implies that our results are valid when the Gaussian distribution is replaced with any symmetric non-degenerate sub Gaussian distribution. The following lemma is a standard concentration result for vector martingales with sub Gaussian noises. Lemma 8 (Theorem 1 in Abbasi-Yadkori et al. [1]). Let {Ft} t=0 be a filtration. Let {ξt} t=1 be a sequence of real-valued random variables such that ξt is Ft-measurable and is Ft 1-conditionally σsub Gaussian for some σ 0. Let {Xt} t=1 be a sequence of Rd-valued random vectors such that Xt is Ft 1-measurable and Xt 2 1 almost surely for all t 1. Fix λ 1. Let Vt = λI+Pt i=1 Xt X t . Then, for any δ (0, 1], with probability at least 1 δ, the following inequality holds for all t 0:

d log 1 + t

Next lemma is a simple application of Lemma 8, which proves the concentration result of eθt under the sub Gaussianity of PR. Lemma 9 (Sufficient condition for concentration). Fix any δ (0, 1] and t [T]. Assume that P W 2 > Lδ t,0 δ, and {Zi}t 1 i=1 are mutually independent of each other and are FX i - conditionally σ2 R-sub Gaussian respectively for each i [t 1]. Then, with probability at least 1 2δ, it holds that eθt 1 Vt 1 σR

d log 1 + T

Proof. Recall that eθt 1 = V 1 t 1(W + Pt 1 i=1 Xi Zi). Therefore, eθt 1 Vt 1 = Vt 1eθt 1 V 1 t 1

i=1 Xi Zt,i V 1 t 1

W V 1 t 1 +

i=1 Xi Zi V 1 t 1 ,

where the triangle inequality is used for the last inequality. To bound the first term, we use the fact that λmax(V 1 t 1) 1

λ, where λmax( ) denotes the maximum eigenvalue. It implies that W V 1 t 1

λ W 2. Since P( W 2 > Lδ t,0) δ by assumption, W V 1 t 1 Lδ t,0/

λ holds with probability at least 1 δ. The second term is bounded by Lemma 8. With probability 1 δ, it holds that

i=1 Xi Zt,i

d log 1 + t

By taking the union bound over the two events, we obtain that

eθt 1 Vt 1 Lδ t,0

d log 1 + t

holds with probability at least 1 2δ, which proves the the lemma.

We also provide that the ℓ2-norm of the vector W whose components are i.i.d. samples of a sub Gaussian distribution is upper-bounded with high-probability. This lemma, combined with Lemma 9, justifies PI to be a distribution over Rd such that each component is an i.i.d. sample of a sub Gaussian distribution.

Lemma 10. Suppose that W Rd and each component of W is sampled i.i.d. from a σ2 Isub Gaussian distribution. Take any δ (0, 1]. Then, with probability 1 δ, it holds that

2d + 4 log 1

Proof. Take X1 = e1, . . . , Xd = ed, where {e1, . . . , ed} is the standard basis of Rd. By Lemma 8 with ξi = (W)i and λ = 1, it holds that

d log 2 + 2 log 1

with probability at least 1 δ. The proof is completed by noting that Pd i=1(W)iei (2Id) 1 = 1

Finally, we demonstrate that sub Gaussian distribution with lower-bounded variance satisfies the anticoncentration condition analogous to Fact 1. We normalize the distribution so that it is 1-sub Gaussian. Lemma 11 (Sufficient condition for anti-concentration). Suppose that P is a real-valued distribution that is symmetric, 1-sub Gaussian, and has variance at least 1/2. Suppose Z Rn for some n N and its components are i.i.d. samples of P. Then, for any fixed u Rn, it holds that

Proof. Without loss of generality, we may assume that u 2 = 1. Let Y = u Z. Then, Var(Y ) = Var(Pn i=1(u)i(Z)i) = Pn i=1(u)2 i Var((Z)i) = Var(P) 1

2. On the other hand, we attain an upper bound of Var(Y ) as follows: Var(Y ) = E Y 2

= E Y 21 |Y | 1

3 < |Y | 4 + E Y 21 {|Y | 4}

3 < |Y | 4 + E Y 21 {|Y | 4}

3 < |Y | + E Y 21 {|Y | 4} . (12)

We upper-bound E[Y 21{|Y | 4}] using its sub Gaussian property. Note that Y = u Z is 1sub Gaussian since u 2 = 1 and the components of Z are independent and 1-sub Gaussian. Applying the standard tail bound of sub Gaussian random variables, it holds that P (|Y | x) 2 exp x2/2 , or equivalently, P Y 2 x 2 exp ( x/2). Then, it holds that

E Y 21 Y 2 16 = Z

0 P Y 21 Y 2 16 x dx

0 P Y 21 Y 2 16 x dx + Z

16 P Y 21 Y 2 16 x dx

= 16P Y 2 16 + Z

16 P Y 2 x dx

= 32e 8 + 4e 8

0.0121 . (13) By plugging in the upper bound of (13) to inequality (12) and reordering the terms, we obtain that

9 0.0121 0.023 .

Finally, recall that Z is symmetric, therefore Y is symmetric. Therefore, we have P Y 1

I Auxiliary Lemmas

Lemma 12 (Azuma-Hoeffding Inequality). Fix n N. Let {Zi}n i=1 be a sequence of real-valued random variables adapted to a filtration {Fi}n i=0. Suppose that there exists a < b such that Zi [a, b] holds almost surely for all i [n]. Then, for any δ (0, 1], the following inequality holds with probability at least 1 δ:

i=1 (Zi E [Zi | Ft 1]) (b a)

Lemma 13 (Lemma 1 of Laurent and Massart [15]). Let Y1, . . . , Yd be i.i.d. standard Gaussian variables. Set Z = Pd i=1(Y 2 i 1). Then, the following inequality holds for any x > 0,

dx + 2x) e x .

Neur IPS Paper Checklist

Question: Do the main claims made in the abstract and introduction accurately reflect the paper s contributions and scope?

Answer: [Yes]

Justification: We accurately describe the problem we set and the results we obtain in the abstract and introduction, specifically Section 1.

Guidelines:

The answer NA means that the abstract and introduction do not include the claims made in the paper. The abstract and/or introduction should clearly state the claims made, including the contributions made in the paper and important assumptions and limitations. A No or NA answer to this question will not be perceived well by the reviewers. The claims made should match theoretical and experimental results, and reflect how much the results can be expected to generalize to other settings. It is fine to include aspirational goals as motivation as long as it is clear that these goals are not attained by the paper.

2. Limitations

Question: Does the paper discuss the limitations of the work performed by the authors?

Answer: [Yes]

Justification: We discuss the limitations of the work at the end of Section 7. Our work is limited to the linear bandit setting, while ensemble sampling also exhibit superior performance in complex settings. However, our results provide a necessary stepping stone to expand the theoretical analysis of ensemble sampling to such regimes.

Guidelines:

The answer NA means that the paper has no limitation while the answer No means that the paper has limitations, but those are not discussed in the paper. The authors are encouraged to create a separate "Limitations" section in their paper. The paper should point out any strong assumptions and how robust the results are to violations of these assumptions (e.g., independence assumptions, noiseless settings, model well-specification, asymptotic approximations only holding locally). The authors should reflect on how these assumptions might be violated in practice and what the implications would be. The authors should reflect on the scope of the claims made, e.g., if the approach was only tested on a few datasets or with a few runs. In general, empirical results often depend on implicit assumptions, which should be articulated. The authors should reflect on the factors that influence the performance of the approach. For example, a facial recognition algorithm may perform poorly when image resolution is low or images are taken in low lighting. Or a speech-to-text system might not be used reliably to provide closed captions for online lectures because it fails to handle technical jargon. The authors should discuss the computational efficiency of the proposed algorithms and how they scale with dataset size. If applicable, the authors should discuss possible limitations of their approach to address problems of privacy and fairness. While the authors might fear that complete honesty about limitations might be used by reviewers as grounds for rejection, a worse outcome might be that reviewers discover limitations that aren t acknowledged in the paper. The authors should use their best judgment and recognize that individual actions in favor of transparency play an important role in developing norms that preserve the integrity of the community. Reviewers will be specifically instructed to not penalize honesty concerning limitations.

3. Theory Assumptions and Proofs

Question: For each theoretical result, does the paper provide the full set of assumptions and a complete (and correct) proof? Answer: [Yes] Justification: The full set of assumptions is presented in Section 4. The complete and correct proof is provided throughout Section 5 and the Appendix. All the important ideas are provided as a sketch in the main paper, if not provided as a whole. Guidelines:

The answer NA means that the paper does not include theoretical results. All the theorems, formulas, and proofs in the paper should be numbered and crossreferenced. All assumptions should be clearly stated or referenced in the statement of any theorems. The proofs can either appear in the main paper or the supplemental material, but if they appear in the supplemental material, the authors are encouraged to provide a short proof sketch to provide intuition. Inversely, any informal proof provided in the core of the paper should be complemented by formal proofs provided in appendix or supplemental material. Theorems and Lemmas that the proof relies upon should be properly referenced. 4. Experimental Result Reproducibility

Question: Does the paper fully disclose all the information needed to reproduce the main experimental results of the paper to the extent that it affects the main claims and/or conclusions of the paper (regardless of whether the code and data are provided or not)? Answer: [NA] . Justification: The paper does not include experiments. Guidelines:

The answer NA means that the paper does not include experiments. If the paper includes experiments, a No answer to this question will not be perceived well by the reviewers: Making the paper reproducible is important, regardless of whether the code and data are provided or not. If the contribution is a dataset and/or model, the authors should describe the steps taken to make their results reproducible or verifiable. Depending on the contribution, reproducibility can be accomplished in various ways. For example, if the contribution is a novel architecture, describing the architecture fully might suffice, or if the contribution is a specific model and empirical evaluation, it may be necessary to either make it possible for others to replicate the model with the same dataset, or provide access to the model. In general. releasing code and data is often one good way to accomplish this, but reproducibility can also be provided via detailed instructions for how to replicate the results, access to a hosted model (e.g., in the case of a large language model), releasing of a model checkpoint, or other means that are appropriate to the research performed. While Neur IPS does not require releasing code, the conference does require all submissions to provide some reasonable avenue for reproducibility, which may depend on the nature of the contribution. For example (a) If the contribution is primarily a new algorithm, the paper should make it clear how to reproduce that algorithm. (b) If the contribution is primarily a new model architecture, the paper should describe the architecture clearly and fully. (c) If the contribution is a new model (e.g., a large language model), then there should either be a way to access this model for reproducing the results or a way to reproduce the model (e.g., with an open-source dataset or instructions for how to construct the dataset). (d) We recognize that reproducibility may be tricky in some cases, in which case authors are welcome to describe the particular way they provide for reproducibility. In the case of closed-source models, it may be that access to the model is limited in some way (e.g., to registered users), but it should be possible for other researchers to have some path to reproducing or verifying the results.

5. Open access to data and code

Question: Does the paper provide open access to the data and code, with sufficient instructions to faithfully reproduce the main experimental results, as described in supplemental material?

Answer: [NA] .

Justification: The paper does not include experiments.

Guidelines:

The answer NA means that paper does not include experiments requiring code. Please see the Neur IPS code and data submission guidelines (https://nips.cc/ public/guides/Code Submission Policy) for more details. While we encourage the release of code and data, we understand that this might not be possible, so No is an acceptable answer. Papers cannot be rejected simply for not including code, unless this is central to the contribution (e.g., for a new open-source benchmark). The instructions should contain the exact command and environment needed to run to reproduce the results. See the Neur IPS code and data submission guidelines (https: //nips.cc/public/guides/Code Submission Policy) for more details. The authors should provide instructions on data access and preparation, including how to access the raw data, preprocessed data, intermediate data, and generated data, etc. The authors should provide scripts to reproduce all experimental results for the new proposed method and baselines. If only a subset of experiments are reproducible, they should state which ones are omitted from the script and why. At submission time, to preserve anonymity, the authors should release anonymized versions (if applicable). Providing as much information as possible in supplemental material (appended to the paper) is recommended, but including URLs to data and code is permitted.

6. Experimental Setting/Details

Question: Does the paper specify all the training and test details (e.g., data splits, hyperparameters, how they were chosen, type of optimizer, etc.) necessary to understand the results?

Answer: [NA] .

Justification: The paper does not include experiments.

Guidelines:

The answer NA means that the paper does not include experiments. The experimental setting should be presented in the core of the paper to a level of detail that is necessary to appreciate the results and make sense of them. The full details can be provided either with the code, in appendix, or as supplemental material.

7. Experiment Statistical Significance

Question: Does the paper report error bars suitably and correctly defined or other appropriate information about the statistical significance of the experiments?

Answer: [NA] .

Justification: The paper does not include experiments.

Guidelines:

The answer NA means that the paper does not include experiments. The authors should answer "Yes" if the results are accompanied by error bars, confidence intervals, or statistical significance tests, at least for the experiments that support the main claims of the paper. The factors of variability that the error bars are capturing should be clearly stated (for example, train/test split, initialization, random drawing of some parameter, or overall run with given experimental conditions).

The method for calculating the error bars should be explained (closed form formula, call to a library function, bootstrap, etc.) The assumptions made should be given (e.g., Normally distributed errors). It should be clear whether the error bar is the standard deviation or the standard error of the mean. It is OK to report 1-sigma error bars, but one should state it. The authors should preferably report a 2-sigma error bar than state that they have a 96% CI, if the hypothesis of Normality of errors is not verified. For asymmetric distributions, the authors should be careful not to show in tables or figures symmetric error bars that would yield results that are out of range (e.g. negative error rates). If error bars are reported in tables or plots, The authors should explain in the text how they were calculated and reference the corresponding figures or tables in the text.

8. Experiments Compute Resources

Question: For each experiment, does the paper provide sufficient information on the computer resources (type of compute workers, memory, time of execution) needed to reproduce the experiments?

Answer: [NA] .

Justification: The paper does not include experiments.

Guidelines:

The answer NA means that the paper does not include experiments. The paper should indicate the type of compute workers CPU or GPU, internal cluster, or cloud provider, including relevant memory and storage. The paper should provide the amount of compute required for each of the individual experimental runs as well as estimate the total compute. The paper should disclose whether the full research project required more compute than the experiments reported in the paper (e.g., preliminary or failed experiments that didn t make it into the paper).

9. Code Of Ethics

Question: Does the research conducted in the paper conform, in every respect, with the Neur IPS Code of Ethics https://neurips.cc/public/Ethics Guidelines?

Answer: [Yes]

Justification: The research conducted in the paper conform, in every respect, with the Neur IPS Code of Ethics

Guidelines:

The answer NA means that the authors have not reviewed the Neur IPS Code of Ethics. If the authors answer No, they should explain the special circumstances that require a deviation from the Code of Ethics. The authors should make sure to preserve anonymity (e.g., if there is a special consideration due to laws or regulations in their jurisdiction).

10. Broader Impacts

Question: Does the paper discuss both potential positive societal impacts and negative societal impacts of the work performed?

Answer: [NA] .

Justification: There is no societal impact of the work performed.

Guidelines:

The answer NA means that there is no societal impact of the work performed. If the authors answer NA or No, they should explain why their work has no societal impact or why the paper does not address societal impact.

Examples of negative societal impacts include potential malicious or unintended uses (e.g., disinformation, generating fake profiles, surveillance), fairness considerations (e.g., deployment of technologies that could make decisions that unfairly impact specific groups), privacy considerations, and security considerations. The conference expects that many papers will be foundational research and not tied to particular applications, let alone deployments. However, if there is a direct path to any negative applications, the authors should point it out. For example, it is legitimate to point out that an improvement in the quality of generative models could be used to generate deepfakes for disinformation. On the other hand, it is not needed to point out that a generic algorithm for optimizing neural networks could enable people to train models that generate Deepfakes faster. The authors should consider possible harms that could arise when the technology is being used as intended and functioning correctly, harms that could arise when the technology is being used as intended but gives incorrect results, and harms following from (intentional or unintentional) misuse of the technology. If there are negative societal impacts, the authors could also discuss possible mitigation strategies (e.g., gated release of models, providing defenses in addition to attacks, mechanisms for monitoring misuse, mechanisms to monitor how a system learns from feedback over time, improving the efficiency and accessibility of ML). 11. Safeguards

Question: Does the paper describe safeguards that have been put in place for responsible release of data or models that have a high risk for misuse (e.g., pretrained language models, image generators, or scraped datasets)? Answer: [NA] . Justification: The paper poses no such risks. Guidelines:

The answer NA means that the paper poses no such risks. Released models that have a high risk for misuse or dual-use should be released with necessary safeguards to allow for controlled use of the model, for example by requiring that users adhere to usage guidelines or restrictions to access the model or implementing safety filters. Datasets that have been scraped from the Internet could pose safety risks. The authors should describe how they avoided releasing unsafe images. We recognize that providing effective safeguards is challenging, and many papers do not require this, but we encourage authors to take this into account and make a best faith effort. 12. Licenses for existing assets

Question: Are the creators or original owners of assets (e.g., code, data, models), used in the paper, properly credited and are the license and terms of use explicitly mentioned and properly respected? Answer: [NA] . Justification: The paper does not use existing assets. Guidelines:

The answer NA means that the paper does not use existing assets. The authors should cite the original paper that produced the code package or dataset. The authors should state which version of the asset is used and, if possible, include a URL. The name of the license (e.g., CC-BY 4.0) should be included for each asset. For scraped data from a particular source (e.g., website), the copyright and terms of service of that source should be provided. If assets are released, the license, copyright information, and terms of use in the package should be provided. For popular datasets, paperswithcode.com/datasets has curated licenses for some datasets. Their licensing guide can help determine the license of a dataset.

For existing datasets that are re-packaged, both the original license and the license of the derived asset (if it has changed) should be provided. If this information is not available online, the authors are encouraged to reach out to the asset s creators. 13. New Assets

Question: Are new assets introduced in the paper well documented and is the documentation provided alongside the assets? Answer: [NA] . Justification: The paper does not release new assets. Guidelines:

The answer NA means that the paper does not release new assets. Researchers should communicate the details of the dataset/code/model as part of their submissions via structured templates. This includes details about training, license, limitations, etc. The paper should discuss whether and how consent was obtained from people whose asset is used. At submission time, remember to anonymize your assets (if applicable). You can either create an anonymized URL or include an anonymized zip file. 14. Crowdsourcing and Research with Human Subjects

Question: For crowdsourcing experiments and research with human subjects, does the paper include the full text of instructions given to participants and screenshots, if applicable, as well as details about compensation (if any)? Answer: [NA] . Justification: The paper does not involve crowdsourcing nor research with human subjects. Guidelines:

The answer NA means that the paper does not involve crowdsourcing nor research with human subjects. Including this information in the supplemental material is fine, but if the main contribution of the paper involves human subjects, then as much detail as possible should be included in the main paper. According to the Neur IPS Code of Ethics, workers involved in data collection, curation, or other labor should be paid at least the minimum wage in the country of the data collector. 15. Institutional Review Board (IRB) Approvals or Equivalent for Research with Human Subjects Question: Does the paper describe potential risks incurred by study participants, whether such risks were disclosed to the subjects, and whether Institutional Review Board (IRB) approvals (or an equivalent approval/review based on the requirements of your country or institution) were obtained? Answer: [NA] . Justification: The paper does not involve crowdsourcing nor research with human subjects. Guidelines:

The answer NA means that the paper does not involve crowdsourcing nor research with human subjects. Depending on the country in which research is conducted, IRB approval (or equivalent) may be required for any human subjects research. If you obtained IRB approval, you should clearly state this in the paper. We recognize that the procedures for this may vary significantly between institutions and locations, and we expect authors to adhere to the Neur IPS Code of Ethics and the guidelines for their institution. For initial submissions, do not include any information that would break anonymity (if applicable), such as the institution conducting the review.