# an_analysis_of_ensemble_sampling__ca42ba24.pdf

An Analysis of Ensemble Sampling

Chao Qin Columbia University cqin22@gsb.columbia.edu

Zheng Wen Xiuyuan Lu Benjamin Van Roy

Deep Mind {zhengwen,lxlu,benvanroy}@google.com

Ensemble sampling serves as a practical approximation to Thompson sampling when maintaining an exact posterior distribution over model parameters is computationally intractable. In this paper, we establish a regret bound that ensures desirable behavior when ensemble sampling is applied to the linear bandit problem. This represents the ﬁrst rigorous regret analysis of ensemble sampling and is made possible by leveraging information-theoretic concepts and novel analytic techniques that may prove useful beyond the scope of this paper.

1 Introduction

Thompson sampling (TS) is a popular heuristic for balancing between exploration and exploitation in bandit learning. In its basic form, TS maintains a posterior distribution over model parameters. When the prior distribution and likelihood function exhibit conjugacy properties, the posterior distribution can be maintained through computationally tractable Bayesian inference. However, many practical contexts call for more complex models for which exact Bayesian inference becomes computationally intractable. For such contexts, ensemble sampling (ES) can serve as a practical approximation to TS [Lu and Van Roy, 2017].

Instead of the posterior distribution, ES maintains an ensemble of statistically plausible models that can be updated in an efﬁcient incremental manner. The corresponding discrete distribution represents an approximation to the posterior distribution. At the start of each timestep, a model is sampled uniformly from the ensemble and an action is selected to optimize expected reward with respect to the sampled model. Each model is initially sampled from the prior distribution and evolves through an updating process that adapts to observations and random perturbations.

While a growing literature [Osband et al., 2016, Russo et al., 2018, Osband et al., 2018, Lu et al., 2018, Osband et al., 2019, Dwaracherla et al., 2020, Osband et al., 2022] presents applications of ES, there has been no rigorous theory that ensures desirable behavior similar to that enjoyed by TS. A regret bound for ES applied to linear-Gaussian bandits is provided by Lu and Van Roy [2017], but a ﬂaw in the associated analysis has brought this result into question. Since the publication of that paper, several researchers have tried to remedy the analysis or otherwise establish performance guarantees for ES, but these efforts have gone without success.

We offer in this paper the ﬁrst rigorous regret analysis of ES. Like Lu and Van Roy [2017], we study ES applied to linear-Gaussian bandits. This serves as a simple sanity check for the approach. We establish a Bayesian regret bound (Theorem 1) that consists of two terms. The ﬁrst term is identical to a Bayesian regret bound established for TS. The second term accounts for posterior mismatch and can be made arbitrarily small by increasing the ensemble size. Our analysis leverages information-theoretic concepts and novel analytic techniques that may prove useful beyond the scope of this paper.

As a stepping stone in our analysis, we establish a general Bayesian regret bound that applies to any learning algorithm for linear bandits (Theorem 2). In particular, this regret bound is based on the Hellinger distance between the action-selection distribution speciﬁed by the learning algorithm and

36th Conference on Neural Information Processing Systems (Neur IPS 2022).

the posterior distribution of the optimal action at each timestep. We believe that this regret bound is of independent interest and might be useful in analyzing other learning algorithms, such as other approximate TS algorithms.

Note that for linear-Gaussian bandits considered in this paper, Kalman ﬁltering offers a tractable means to exact inference and, as such, an approximation method like ES is not required. However, it is worth mentioning that the analysis in this paper also offers insight into more complex models that call for approximate inference. For example, consider a neural network with very wide hidden layers. A recent research thrust (e.g., Jacot et al. [2018], Lee et al. [2017]) highlights that, under suitable technical conditions, training such a neural network via stochastic gradient descent (SGD) approximates Gaussian process inference. Our result extends directly to ﬁnite-armed Gaussian process bandits. This suggests that ES applied with neural networks that are trained via SGD can be effective in complex bandit problems, though formalizing this extension requires signiﬁcant technical work, which we leave for future research.

The remainder of this paper is organized as follows: Section 2 reviews related work, and Section 3 formulates the considered linear bandit problem and describes the ensemble sampling algorithm. Section 4 presents the main result of this paper: a Bayesian regret bound for ES applied to linear bandits. Before diving into the analysis, we deﬁne some notation used in this paper in Section 5. Then we motivate and derive a general regret bound in Section 6, and apply it to analyzing ES in Section 7. Finally, we conclude the paper and discuss future directions in Section 8.

2 Related work

Thompson sampling [Thompson, 1933, Chapelle and Li, 2011, Russo et al., 2018] is a commonly used heuristic for balancing exploration and exploitation in bandits [Lattimore and Szepesvári, 2020] and reinforcement learning [Sutton and Barto, 2018]. However, the basic form of Thompson sampling can be computationally intractable unless the prior distribution and likelihood function exhibit conjugacy properties. To overcome this computational challenge, variants of approximate versions of TS have been developed (see Chapter 5 of Russo et al. [2018]), including approximate TS using Laplace approximation [Chapelle and Li, 2011], bootstrapping [Eckles and Kaptein, 2019, Kveton et al., 2019], variational inference [Urteaga and Wiggins, 2018, Yu et al., 2020], Markov chain Monte Carlo [Casella and George, 1992, Roberts and Tweedie, 1996], hypermodels [Dwaracherla et al., 2020], and optimal transport [Zhang et al., 2019].

Ensemble sampling is an approximate version of Thompson sampling, formally proposed by Lu and Van Roy [2017]. It has been one of the most popular methods in approximate Bayesian inference and has wide applications in sequential decision making. For example, variations of ES have been applied to (deep) reinforcement learning [Osband et al., 2016, 2018, 2019], online recommendation [Lu et al., 2018, Hao et al., 2020, Zhu and Van Roy, 2021], multi-agent reinforcement learning [Dimakopoulou and Van Roy, 2018], behavioral sciences [Eckles and Kaptein, 2019], and marketing strategies [Yang et al., 2020]. Recently, Osband et al. [2022] has developed a testbed to compare and rank agents designed for uncertainty modeling, including variants of ensemble agents.

A lot of work has attempted to analyze ensemble sampling, but none of them has been successful. In their original paper, Lu and Van Roy [2017] try to analyze ES in linear-Gaussian bandits, but there is a mistake in the ﬁnal step of the analysis where they compare the regret of ES to that of TS. In particular, in the proof of Lemma 8, the hypothetical actions selected by TS are deﬁned on the histories of ES, instead of TS. Therefore, the cumulative regret incurred by these hypothetical TS actions is not equal to the cumulative regret of applying TS throughout all timesteps, while the paper incorrectly claims that they are equal.

In addition to Lu and Van Roy [2017], Phan et al. [2019] aims to analyze general approximate TS with approximation errors measured in -divergence for K-armed bandit. To demonstrate the main result, Phan et al. [2019] uses ES as an example, but also inherits the technical ﬂaw in Lu and Van Roy [2017]. In addition, though Phan et al. [2019] shows that the approximate version of TS with uniform sampling achieves sublinear regret, this result mainly beneﬁts from this forced exploration.

There are also papers aiming to analyze bandit algorithms with approximate inference. A recent paper [Huang et al., 2022] studies an upper-conﬁdence-bound (UCB) type algorithm for bandit problems with approximate inference. Another recent paper [Ash et al., 2021] proposes and analyzes

an ensemble based UCB algorithm for bandit problems. Kveton et al. [2020] studies linear bandits in the frequentist setting and provides a general regret bound that is tailored to analyzing randomized algorithms with good concentration and anti-concentration properties, but it could not be directly applied to analyzing ES due to ES s discrete and incremental update nature.

3 Preliminaries

We begin by introducing a formulation of the linear bandit and an ensemble sampling algorithm suited for this environment.

3.1 Linear bandit

The linear bandit [Abbasi-Yadkori et al., 2011, Lattimore and Szepesvári, 2020] has received much attention in the bandit learning literature. It serves as a simple didactic environment, often used to sanity-check agent designs. In this spirit, we will analyze ES as applied to the linear bandit.

We consider the linear bandit with a Gaussian prior distribution and likelihood function. An instance is characterized by a tuple E = (A, µ0, 0, σ2), where A Rd is a ﬁnite action set, µ0 and 0 are the prior mean vector and covariance matrix, and σ2 is the noise variance. Let K = |A| denote the cardinality of the action set. Note that actions are vectors of dimension d. The prior distribution over the unknown coefﬁcient vector 2 Rd, P( 2 ), is multivariate Gaussian N(µ0, 0). Rewards are generated by a random sequence (Rt : t = 1, 2, . . .) of K-dimensional vectors, which is i.i.d. conditioned on . In particular, for t = 1, 2 . . . ,

Rt,a = a> + Wt,a 8a 2 A,

where Wt = (Wt,a)a2A is a K-dimensional vector with each element distributed as N(0, σ2).

An agent interacts with the linear bandit as follows: at timestep t = 0, 1, . . ., the agent executes an action At; then it observes a reward Rt+1,At. Note that only the reward associated with the executed action At is observed through what is termed bandit feedback. The agent s experience through timestep t is encoded by a history Ht = (A0, R1,A0, . . . , At 1, Rt,At 1). The agent s objective is to maximize expected reward over some duration T:

This is equivalent to minimizing the Bayesian regret:

Regret(T) =

E[Rt+1,A Rt+1,At]

where A unif

arg maxa2A a>

. The expectations in both equations integrate over random reward realizations, algorithmic randomness, and the coefﬁcient vector . Note that A is random since it depends on random and the uniform sampling among maximizers (which breaks ties).

3.2 Ensemble sampling

In the absence of conjugacy properties that enable efﬁcient Bayesian inference, TS in its exact form becomes computationally infeasible. ES serves as a practical approximation to TS [Lu and Van Roy, 2017], often suitable for such contexts. The key idea behind ES is to maintain an ensemble of statistically plausible models that can be updated in an efﬁcient incremental manner and to treat the discrete distribution represented by this ensemble of models as an approximation to the posterior distribution. The algorithm begins by sampling M models from the prior distribution, where M is a hyperparameter. At the start of each timestep, a model is sampled uniformly from the ensemble and an action is selected to optimize the immediate expected reward with respect to the sampled model. Then, each model is updated incrementally based on the observed reward and a random perturbation. Algorithm 1 presents the associated pseudo-code for ES.

Algorithm 1 Ensemble Sampling

1: Input: M and P( 2 ) 2: Sample: 0,1, . . . , 0,M P( 2 ) 3: for t = 0, 1, . . . do 4: Sample: mt unif {1, . . . , M}

5: Execute: At unif

6: Observe: Rt+1,At 7: Update: t,m ! t+1,m 8m 2 [M] 8: end for

For Gaussian prior N(µ0, 0), by conjugacy, the posterior distribution is also Gaussian with covariance matrix and mean vector updated as follows,

and µt+1 = t+1

t µt + Rt+1,At

While this highlights conjugacy properties of a sort that allow for efﬁcient Bayesian inference and thus tractable implementation of exact TS, the linear bandit with a Gaussian prior distribution and likelihood function serves as a useful context for studying the behavior of ES in relation to TS. Like [Lu and Van Roy, 2017], we consider a version of ES that updates each m-th model according to

t+1,m = t+1

t t,m + Rt+1,At + Wt+1,m

where Wt+1 = ( Wt+1,m)m2[M] is an M-dimensional vector with each element distributed as N(0, σ2) and [M] = {1, . . . , M}.

4 A regret bound for ensemble sampling

In this section, we establish the following regret bound for ES (Algorithm 1) when it is applied to linear bandits described in Section 3.1: Theorem 1. Under ensemble sampling,

d TH(A ) | {z } (a)

M | {z } (b)

where H(A ) is the Shannon entropy of the optimal action A under the prior, and

a2A a> 0a + σ2

Note that = O(1) and Lemma 7 in Appendix A shows = O(

min{d, log K}), so

Regret(T) O

d TH(A ) + T

min{d, log K}K log(6TM)

Theorem 1 represents the ﬁrst rigorously proved regret bound that ensures some degree of robustness in application of ES to a nontrivial bandit problem. One limitation is that the second term (b) in our regret bound depends on the action set size K since our current analysis uses the method of (ﬁnite) types [Cover and Thomas, 2006], which might not be tight for linear bandit. We conjecture that the second term (b) can be improved to be only dependent of the dimension d. Another limitation is that the result applies only to the linear bandit with Gaussian prior distribution and likelihood function, which admits tractable application of exact TS. Nevertheless, this result represents a beginning of the theory of ES, suggesting that the approach is sound and the analysis might be extended to broader problem classes.

Comparison with the regret bound for TS The regret bound in Theorem 1 consists of two terms. The ﬁrst term (a) is exactly the regret bound achieved by TS with exact posterior [Russo and Van Roy, 2016]. Since the action set size is K, the entropy of the optimal action H(A ) log K, and as discussed in Russo and Van Roy [2016], when the prior is informative, H(A ) log K. Therefore, the ﬁrst term (a) is order optimal and improves upon the worst-case regret bound achieved by other algorithms, e.g., upper conﬁdence bound type algorithms. On the other hand, the second term (b) is an incremental term and accounts for posterior mismatch. Note that as the ensemble size M approaches inﬁnity, the second term (b) converges to zero and our regret bound for ES reduces to that for TS with exact posterior. Moreover, as long as the ensemble size M reaches KT/d, our regret bound for ES has the same order as that for TS in terms of T and d (up to logarithmic factors).

Comparison with the regret bound in Lu and Van Roy [2017] Theorem 1 presents a Bayesian regret bound for the prior N(µ0, 0) over , while Lu and Van Roy [2017] studies ES in the frequentist setting where unknown is ﬁxed. They try to upper bound the frequentist regret of ES by that of TS plus some cumulative approximation error due to posterior mismatch, similar to the second term (b) in our Bayesian regret bound. Although there are some technical issues in their analysis, as is discussed in Section 2, we compare both regret bounds. The frequentist regret bound in Theorem 3 of Lu and Van Roy [2017] suggests that for any > 0, when the ensemble size M = (K/ 2) ( hides logarithmic factors), their cumulative approximation error is ( )T where ( ) = maxa2A a> mina2A a> . To make the second term (b) in our regret bound scale as O( T) ( O hides logarithmic factors), we need the ensemble size M to be (K/ 2) as well. We conjecture the required ensemble size can be improved to be only dependent of the dimension d.

5 Information measures

Before proceeding, we present several information measures including Kullback Leibler divergence, Hellinger distance, Shannon entropy and mutual information.

5.1 Kullback Leibler divergence and Hellinger distance

For two probability distributions P and Q, the Kullback Leibler (KL) divergence between P and Q is deﬁned as

d KL(Pk Q) ,

if P is absolutely continuous with respect to Q; otherwise, deﬁne d KL(Pk Q) = +1. Note that d KL(Pk Q) 6= d KL(Qk P) in general. The Hellinger distance between P and Q is deﬁned as

d H(Pk Q) ,

which is symmetric in P and Q. In the special case of discrete distributions P = (p1, . . . , pn) and Q = (q1, . . . , qn), the KL divergence and Hellinger distance between P and Q can be written as

d KL(Pk Q) =

pi log (pi/qi) and d H(Pk Q) =

(ppi pqi)2.

5.2 Shannon entropy and mutual information

Consider a random variable X that takes values in a countable set X. The Shannon entropy of X is

P(X = x) log P(X = x) = EX[ log P(X 2 )]

with a convention that 0 log 0 = 0.

For ease of exposition, let Y be another random variable that takes values in a countable set Y. For y 2 Y such that P(Y = y) > 0, the realized conditional entropy of X given Y = y is

H(X|Y = y) ,

P (X = x|Y = y) log P(X = x|Y = y) = EX[ log P(X 2 |Y )|Y = y],

and the conditional entropy of X given Y is

H(X|Y ) , EY

x2X P (X = x|Y ) log P(X = x|Y )

= EY [H(X|Y = Y )]. Note that the above deﬁnitions can be extended to general random variables.

The mutual information between X and Y is

I(X; Y ) , d KL (P(X 2 , Y 2 ) || P(X 2 )P(Y 2 )) = H(X) H(X|Y ), and the conditional mutual information between X and Y given a general random variable Z is

I(X; Y |Z) , EZ [d KL (P(X 2 , Y 2 |Z) || P(X 2 |Z)P(Y 2 |Z))] = H(X|Z) H(X|Y, Z).

6 A general regret bound

In this section, we derive a general regret bound for any learning algorithm (Theorem 2), based on the Hellinger distance between the (conditional) action-selection distribution speciﬁed by the algorithm and the posterior distribution of the optimal action A , and apply it to analyzing the regret bound for ES in Theorem 1. We believe that our regret bound is of independent interest and might be used to analyze other bandit algorithms, such as variants of approximate TS algorithms, in the future.

To simplify the exposition, we ﬁrst deﬁne some shorthand notation to simplify the exposition of this paper. First, we use subscript t to denote conditioning on Ht, speciﬁcally, we deﬁne

Pt( ) , P( |Ht) and Et[ ] , E[ |Ht]. In addition, we deﬁne two probability distributions pt and pt over action set A as

pt( ) , Pt(At = ) and pt( ) , Pt(A = ). Conditional on the history Ht, pt is the sampling distribution of the action At and pt is the posterior distribution of the optimal action A . Both pt and pt are speciﬁc to the algorithm. Note that given the history Ht, pt and pt are the same under TS, but they are not under other algorithms. For approximate TS algorithms, if the approximation is accurate, pt is close to pt. Now we introduce a general regret bound for any learning algorithm: Theorem 2. Under any learning algorithm,

where and are deﬁned in Equation (3).

The regret bound in Theorem 2 holds for any learning algorithm. Note H(A ) is the entropy of the optimal action A and we always have H(A ) log K. The ﬁrst term matches the regret bound for TS derived in Russo and Van Roy [2016]. In the second term, PT 1

H(ptkpt)] quantiﬁes the cumulative distance between the sampling distribution of the action At and the posterior distribution the optimal action A . Under TS with exact posterior, d2

H(ptkpt) = 0 due to the posterior matching property of TS, i.e., pt = pt. Moreover, for approximate TS algorithms, if the approximation is accurate, d2

H(ptkpt) should be small.

Regret bounds in terms of KL divergence between pt and pt It is sometimes more convenient to analyze the KL divergence between pt and pt. Recall that the squared Hellinger distance is symmetric and upper bound by the KL divergence. Fact 1 (Lemma 2.4 in Tsybakov [2009]). For two probability distributions P and Q, we have

H(Pk Q) min{d KL(Pk Q), d KL(Qk P)}.

We obtain the regret upper bound with the approximation error in terms of two forms of KL divergence between pt and pt. Corollary 1. Under any learning algorithm,

E [min {d KL(ptkpt), d KL(ptkpt)}].

We will show in Section 7 that for ensemble sampling, d KL(ptkpt) can be bounded in terms of the ensemble size M. It is noteworthy that in some other problems, it is more convenient to bound the other form d KL(ptkpt). One such example is the regret bound developed in Wen et al. [2022].

6.1 Proof of Theorem 2

For t = 0, 1, . . . , we deﬁne per-timestep main regret Gt and per-timestep approximation error Dt:

pt(a)pt(a) (Et[Rt+1,a|A = a] Et[Rt+1,a]) ,

pt(a)Et[Rt+1,a|A = a] +

pt(a)Et[Rt+1,a]

Lemma 1. The cumulative regret can be written as follows,

Regret(T) =

E[Gt + Dt].

Proof. By the tower property of conditional expectation,

Regret(T) = PT 1

t=0 E [Et[Rt+1,A Rt+1,At]]

Et[Rt+1,A Rt+1,At] =

pt(a)Et[Rt+1,a|A = a]

pt(a)Et[Rt+1,a|At = a]

pt(a)Et[Rt+1,a|A = a]

pt(a)Et[Rt+1,a] = Gt + Dt

where the second equality uses that Rt+1 = (Rt+1,a)a2A is independent of the action At.

Next we upper bound PT 1

t=0 E[Gt] and PT 1

t=0 E[Dt], respectively. We ﬁrst analyze per-timestep main regret Gt. The following lemma generalizes Proposition 5 and Corollary 1 in Russo and Van Roy [2016] for the case such that At and A are not identically distributed given the history. Lemma 2. With deﬁned in Equation (3), for t = 0, 1 . . . , T 1,

d It(A ; (At, Rt+1,At))

where It(A ; (At, Rt+1,At)) is the mutual information between the optimal action A and actionreward pair (At, Rt+1,At) conditioning on a given history Ht at timestep t.

Proof. Let the action set A = {a1, . . . , a K}. Fix t and deﬁne M 2 RK K by

pt(ai)pt(aj) (Et[Rt+1,ai|A = aj] Et[Rt+1,ai]) 8i, j 2 [K].

Then Gt = Trace(M). By the proof of Proposition 2 in Russo and Van Roy [2016],

It(A ; (At, Rt+1,At)) = It(A ; At) + It(A ; Rt+1,At|At)

= It(A ; Rt+1,At|At)

pt(ai)It(A ; Rt+1,At|At = ai)

pt(ai)It(A ; Rt+1,ai)

pt(aj)d KL (Pt(Rt+1,ai 2 |A = aj)k Pt(Rt+1,ai 2 ))

pt(ai)pt(aj)d KL (Pt(Rt+1,ai 2 |A = aj)k Pt(Rt+1,ai 2 ))

where the ﬁrst equality uses the chain rule of mutual information (Fact 2); the second and fourth ones follow from that conditional on the history Ht, At is jointly independent of A and Rt+1 = (Rt+1,a)a2A; the ﬁfth equality uses the KL divergence form of mutual information (Fact 3).

By the updating rule on covariance matrix in Equation (1), conditional on Ht, N(µt, t) with t 0. Hence, for any i 2 [K],

Rt+1,ai = a>

i + Wt,ai N

i tai + σ2?

This implies that Rt+1,ai is sub-Gaussian with variance proxy (maxa2A a> 0a + σ2). Then by Lemma 10, we have

It(A ; (At, Rt+1,At))

i,j2[K] pt(ai)pt(aj) (Et[Rt+1,ai|A = aj] Et[Rt+1,ai])2

2 (maxa2A a> 0a + σ2)

i,j2[K] M 2

i,j 2 (maxa2A a> 0a + σ2)

F 2 (maxa2A a> 0a + σ2).

Then by Fact 4,

t It(A ; (At, Rt+1,At)) 2

maxa2A a> 0a + σ2?

a2A a> 0a + σ2

Now we show that Rank(M) d. Deﬁne

= Et [ ] and j = Et [ |A = aj] 8j 2 [K].

Then for any i, j 2 [K],

pt(ai)pt(aj)

pt(ai)pt(aj) ( j )> ai,

pt(a1) ( 1 )>

pt(a K) ( K )>

Since M is the product of one K d matrix and the other d K matrix, we have Rank(M) d.

Now we are ready to bound PT 1

t=0 E[Gt] and derive the ﬁrst term of the regret bound in Theorem 2 by following the analysis of Proposition 1 in Russo and Van Roy [2016]. Lemma 3. With deﬁned in Equation (3), the following inequality holds:

Proof. By Lemma 2,

It (A , (At; Rt+1,At))

v u u td TE

It (A ; (At, Rt+1,At))

I (A ; (At, Rt+1,At)|Ht)

d T (H(A ) H(A |HT ))

where the second inequality applies the Cauchy-Schwarz inequality, and the last inequality uses the chain rule for mutual information (Fact 2) and the deﬁnition of mutual information.

The next lemma controls the expected per-timestep approximation error E[Dt] under (sub-)Gaussian prior distribution and noises. Summing over timesteps completes the proof of Theorem 2.

Lemma 4. With deﬁned in Equation (3), for t = 0, 1, . . . , T 1,

Proof. Fix t. We write

pt(a)Et[Rt+1,a|A = a] +

pt(a)Et[Rt+1,a]

pt(a) (Et[Rt+1,a|A = a])2 +

pt(a) (Et[Rt+1,a])2

t+1,a|A = a

=d H(ptkpt)

t+1,a|A = a

t+1,a|At = a

=d H(ptkpt)

where the ﬁrst inequality applies the Cauchy Schwarz inequality; the second inequality uses the deﬁnition of Hellinger distance and the Jensen s inequality; the second-to-last equality holds since At is independent of Rt+1 = (Rt+1,a)a2A. Taking expectation of both sides, we have

E [(A> + Wt+1,A )2] +

t + Wt+1,At)2

E [(A> )2 + σ2] +

where the second inequality uses the Cauchy Schwarz inequality and tower property of conditional expectation, and the second-to-last inequality holds for the independent mean-zero (sub-)Gaussian noises with variance bounded by σ2.

7 Completion of the proof of Theorem 1

In this section, we complete the proof of Theorem 1 by controlling the cumulative approximation error PT 1

E [d KL(ptkpt)] in Corollary 1 for ensemble sampling, where pt is the sampling distribution of the action At and pt is the posterior distribution of the optimal action A . It sufﬁces to show Lemma 5 below for the per-timestep approximation error E[d KL(ptkpt)] since summing over timesteps yields the second term of the regret bound for ensemble sampling in Theorem 1. Lemma 5. Under ensemble sampling, for t = 0, 1, . . . , T 1,

E [d KL(ptkpt)] K log(6(t + 1)M)

Proof. We ﬁrst discuss the behavior of ensemble sampling. Ensemble sampling uniformly samples a model m 2 {1, . . . , M}, and then chooses the action At uniformly from the set At,m ,

arg maxa2A a> t,m. We deﬁne the following approximation of pt(a):

where | At,m| is the cardinality of the set At,m (with probability one, | At,m| = 1). At timestep t, ES would sample an action from the approximation ˆpt. Notice that the history Ht does not include independent random perturbation Wt = ( Wt,m)m2[M] N(0, σ2I), which is used in updating M models in Equation (2). Therefore, given history Ht, ˆpt(a) is still a random variable, and its expectation is the conditional probability of sampling action a, i.e., Et[ˆpt(a)] = pt(a). By convexity of KL divergence, the per-timestep approximation error

d KL(ptkpt) = d KL (Et [ˆpt] kpt) Et [d KL (ˆptkpt)] . Taking expectation of both sides over Ht gives E [d KL(ptkpt)] E [d KL (ˆptkpt)]. Next we upper bound E [d KL (ˆptkpt)] by ﬁrst writing it in terms of the cumulative distribution function:

E [d KL (ˆptkpt)] =

P (d KL (ˆptkpt) > ) d . (4)

Then we use the following result to upper bound P (d KL (ˆptkpt) > ) in the above integral.

Lemma 6 (Lemma 4 in Lu and Van Roy [2017]). Let > 0. For t = 0, 1, . . . , T 1,

P (d KL (ˆptkpt) > | ) (t + 1)K(M + 1)Ke M .

Taking expectation of both sides over gives the same upper bound on P (d KL (ˆptkpt) > ), and thus for any δ 0, the integral in Equation (4) can be decomposed and bounded as follows,

P (d KL (ˆptkpt) > ) d =

P (d KL (ˆptkpt) > ) d +

P (d KL (ˆptkpt) > ) d

δ + (t + 1)K(M + 1)K

= δ + (t + 1)K(M + 1)Ke Mδ

Taking derivative of RHS gives optimal δ = K[log(t+1)+log(M+1)]

M , and then plugging in δ yields Z 1

P (d KL (ˆptkpt) > ) d K [log(t + 1) + log(M + 1)] + 1

M K log(6(t + 1)M)

This completes the proof of Lemma 5, and summing over timesteps gives Theorem 1.

8 Concluding remarks

We present the ﬁrst rigorous analysis for ensemble sampling, an approximate Thompson sampling method. ES maintains an ensemble of statistically plausible models that can be updated in an efﬁcient incremental manner. We develop the regret bound for ES in Theorem 1 for linear bandits, which approaches the regret bound for TS as the ensemble size increases. The linear bandit problem serves only as a sanity check, and as discussed in Section 2, ES has been developed for broader and more challenging bandit and reinforcement learning problems. It is noteworthy that in many practical problems, ES with a moderate ensemble size (e.g. 30 models) outperforms other state-of-the-art benchmarks [Lu and Van Roy, 2017, Lu et al., 2018, Osband et al., 2019].

As a stepping stone, we also establish a general regret bound in Theorem 2 that applies to any learning algorithm for linear bandits. This general regret bound may be of independent interest, since it is particularly useful for analyzing other varieties of approximate TS. One only need to bound the (cumulative) approximation error measured by the Hellinger distance between the action-selection distribution speciﬁed by the algorithm and the posterior distribution of the optimal action.

Finally, Theorem 2 is established by bounding the information ratios for linear bandits, and we believe it could be generalized to other bandit and online learning problems as long as the corresponding information ratios can be appropriately bounded. These problems include generalized linear bandits, combinatorial semi-bandits, online learning with full information feedback, and problems with pure exploration objectives. We also conjecture that the results of this paper can be extended to episodic reinforcement learning, but leave these extensions to future work.

Yasin Abbasi-Yadkori, Dávid Pál, and Csaba Szepesvári. Improved algorithms for linear stochastic

bandits. In Advances in neural information processing systems, volume 24, 2011.

Jordan T. Ash, Cyril Zhang, Surbhi Goel, Akshay Krishnamurthy, and Sham Kakade. Anticoncentrated conﬁdence bonuses for scalable exploration. ar Xiv preprint ar Xiv:2110.11202, 2021.

George Casella and Edward I George. Explaining the Gibbs sampler. The American Statistician, 46

(3):167 174, 1992.

Olivier Chapelle and Lihong Li. An empirical evaluation of Thompson sampling. In Advances in

neural information processing systems, volume 24, 2011.

Thomas M. Cover and Joy A. Thomas. Elements of Information Theory. A Wiley-Interscience

publication. Wiley, 2006.

Maria Dimakopoulou and Benjamin Van Roy. Coordinated exploration in concurrent reinforcement

learning. In International Conference on Machine Learning, volume 80, pages 1271 1279, 2018.

Vikranth Dwaracherla, Xiuyuan Lu, Morteza Ibrahimi, Ian Osband, Zheng Wen, and Benjamin

Van Roy. Hypermodels for exploration. In International Conference on Learning Representations, 2020.

Dean Eckles and Maurits Kaptein. Bootstrap Thompson sampling and sequential decision problems

in the behavioral sciences. SAGE Open, 9(2), 2019.

Botao Hao, Jie Zhou, Zheng Wen, and Will Wei Sun. Low-rank tensor bandits. ar Xiv preprint

ar Xiv:2007.15788, 2020.

Jean Honorio and Tommi Jaakkola. Tight bounds for the expected risk of linear classiﬁers and

PAC-Bayes ﬁnite-sample guarantees. In International Conference on Artiﬁcial Intelligence and Statistics, volume 33, pages 384 392, 2014.

Ziyi Huang, Henry Lam, Amirhossein Meisami, and Haofeng Zhang. Generalized Bayesian upper con-

ﬁdence bound with approximate inference for bandit problems. ar Xiv preprint ar Xiv:2201.12955, 2022.

Arthur Jacot, Franck Gabriel, and Clément Hongler. Neural tangent kernel: Convergence and gener-

alization in neural networks. In Advances in neural information processing systems, volume 31, 2018.

Branislav Kveton, Csaba Szepesvári, Sharan Vaswani, Zheng Wen, Tor Lattimore, and Mohammad

Ghavamzadeh. Garbage in, reward out: Bootstrapping exploration in multi-armed bandits. In International Conference on Machine Learning, pages 3601 3610, 2019.

Branislav Kveton, Csaba Szepesvári, Mohammad Ghavamzadeh, and Craig Boutilier. Perturbed-

history exploration in stochastic linear bandits. In Proceedings of The 35th Uncertainty in Artiﬁcial Intelligence Conference, volume 115, pages 530 540, 2020.

Tor Lattimore and Csaba Szepesvári. Bandit algorithms. Cambridge University Press, 2020.

Jaehoon Lee, Yasaman Bahri, Roman Novak, Samuel S Schoenholz, Jeffrey Pennington, and Jascha

Sohl-Dickstein. Deep neural networks as gaussian processes. ar Xiv preprint ar Xiv:1711.00165, 2017.

Xiuyuan Lu and Benjamin Van Roy. Ensemble sampling. In Advances in Neural Information

Processing Systems, volume 30, 2017.

Xiuyuan Lu, Zheng Wen, and Branislav Kveton. Efﬁcient online recommendation via low-rank

ensemble sampling. In Proceedings of the 12th ACM Conference on Recommender Systems, pages 460 464, 2018.

Ian Osband, Charles Blundell, Alexander Pritzel, and Benjamin Van Roy. Deep exploration via

bootstrapped DQN. In Advances in Neural Information Processing Systems, volume 29, 2016.

Ian Osband, John Aslanides, and Albin Cassirer. Randomized prior functions for deep reinforcement

learning. In Advances in Neural Information Processing Systems, volume 31, 2018.

Ian Osband, Benjamin Van Roy, Daniel J. Russo, and Zheng Wen. Deep exploration via randomized

value functions. Journal of Machine Learning Research, 20(124):1 62, 2019.

Ian Osband, Zheng Wen, Seyed Mohammad Asghari, Vikranth Dwaracherla, Xiuyuan Lu, Morteza

Ibrahimi, Dieterich Lawson, Botao Hao, Brendan O Donoghue, and Benjamin Van Roy. The neural testbed: Evaluating joint predictions. In Advances in Neural Information Processing Systems, 2022.

My Phan, Yasin Abbasi-Yadkori, and Justin Domke. Thompson sampling and approximate inference.

In Advances in Neural Information Processing Systems, volume 32, 2019.

Gareth O. Roberts and Richard L. Tweedie. Exponential convergence of Langevin distributions and

their discrete approximations. Bernoulli, pages 341 363, 1996.

Daniel Russo and Benjamin Van Roy. An information-theoretic analysis of Thompson sampling.

Journal of Machine Learning Research, 17(68):1 30, 2016.

Daniel J. Russo, Benjamin Van Roy, Abbas Kazerouni, Ian Osband, and Zheng Wen. A tutorial on

Thompson sampling. Foundations and Trends in Machine Learning, 11(1):1 96, 2018.

Richard S. Sutton and Andrew G. Barto. Reinforcement learning: An introduction. MIT press, 2018.

William R. Thompson. On the likelihood that one unknown probability exceeds another in view of

the evidence of two samples. Biometrika, 25(3/4):285 294, 1933.

Alexandre B. Tsybakov. Introduction to Nonparametric Estimation. Springer series in statistics.

Springer, 2009.

Iñigo Urteaga and Chris Wiggins. Variational inference for the multi-armed contextual bandit. In

International Conference on Artiﬁcial Intelligence and Statistics, volume 84, pages 698 706, 2018.

Zheng Wen, Ian Osband, Chao Qin, Xiuyuan Lu, Morteza Ibrahimi, Vikranth Dwaracherla, Moham-

mad Asghari, and Benjamin Van Roy. From predictions to decisions: The importance of joint predictive distributions. ar Xiv preprint ar Xiv:2107.09224, 2022.

Jeremy Yang, Dean Eckles, Paramveer Dhillon, and Sinan Aral. Targeting for long-term outcomes.

ar Xiv preprint ar Xiv:2110.15835, 2020.

Tong Yu, Branislav Kveton, Zheng Wen, Ruiyi Zhang, and Ole J. Mengshoel. Graphical models meet

bandits: A variational Thompson sampling approach. In International Conference on Machine Learning, volume 119, pages 10902 10912, 2020.

Ruiyi Zhang, Zheng Wen, Changyou Chen, Chen Fang, Tong Yu, and Lawrence Carin. Scalable

Thompson sampling via optimal transport. In International Conference on Artiﬁcial Intelligence and Statistics, volume 89, pages 87 96, 2019.

Zheqing Zhu and Benjamin Van Roy. Deep exploration for recommendation systems. ar Xiv preprint

ar Xiv:2109.12509, 2021.

1. For all authors...

(a) Do the main claims made in the abstract and introduction accurately reﬂect the paper s

contributions and scope? [Yes] In this paper, we establish a regret bound for ES in linear bandits (Theorem 1), which is based on a novel general regret bound (Theorem 2). These contributions have been accurately reﬂected in the abstract and introduction. (b) Did you describe the limitations of your work? [Yes] We discuss the limitations of this

work and future directions in Sections 4 and 8. (c) Did you discuss any potential negative societal impacts of your work? [N/A] This is a

theoretical paper. We do not see any potential negative societal impacts. (d) Have you read the ethics review guidelines and ensured that your paper conforms to

them? [Yes] 2. If you are including theoretical results...

(a) Did you state the full set of assumptions of all theoretical results? [Yes] We clearly

state the full set of assumptions. (b) Did you include complete proofs of all theoretical results? [Yes] We indeed include

(almost) all the proofs in the main body. 3. If you ran experiments...

(a) Did you include the code, data, and instructions needed to reproduce the main exper-

imental results (either in the supplemental material or as a URL)? [N/A] This paper does not include any experimental results. (b) Did you specify all the training details (e.g., data splits, hyperparameters, how they

were chosen)? [N/A] (c) Did you report error bars (e.g., with respect to the random seed after running experi-

ments multiple times)? [N/A] (d) Did you include the total amount of compute and the type of resources used (e.g., type

of GPUs, internal cluster, or cloud provider)? [N/A] 4. If you are using existing assets (e.g., code, data, models) or curating/releasing new assets...

(a) If your work uses existing assets, did you cite the creators? [N/A] (b) Did you mention the license of the assets? [N/A]

(c) Did you include any new assets either in the supplemental material or as a URL? [N/A]

(d) Did you discuss whether and how consent was obtained from people whose data you re

using/curating? [N/A] (e) Did you discuss whether the data you are using/curating contains personally identiﬁable

information or offensive content? [N/A] 5. If you used crowdsourcing or conducted research with human subjects...

(a) Did you include the full text of instructions given to participants and screenshots, if

applicable? [N/A] (b) Did you describe any potential participant risks, with links to Institutional Review

Board (IRB) approvals, if applicable? [N/A] (c) Did you include the estimated hourly wage paid to participants and the total amount

spent on participant compensation? [N/A]