# metathompson_sampling__bfb79f8b.pdf

Meta-Thompson Sampling

Branislav Kveton 1 Mikhail Konobeev 2 Manzil Zaheer 1 Chih-wei Hsu 1 Martin Mladenov 1 Craig Boutilier 1

Csaba Szepesv ari 3 2

Efﬁcient exploration in bandits is a fundamental online learning problem. We propose a variant of Thompson sampling that learns to explore better as it interacts with bandit instances drawn from an unknown prior. The algorithm meta-learns the prior and thus we call it Meta TS. We propose several efﬁcient implementations of Meta TS and analyze it in Gaussian bandits. Our analysis shows the beneﬁt of meta-learning and is of a broader interest, because we derive a novel prior-dependent Bayes regret bound for Thompson sampling. Our theory is complemented by empirical evaluation, which shows that Meta TS quickly adapts to the unknown prior.

1. Introduction

A stochastic bandit (Lai & Robbins, 1985; Auer et al., 2002; Lattimore & Szepesvari, 2019) is an online learning problem where a learning agent sequentially interacts with an environment over n rounds. In each round, the agent pulls an arm and then receives the arm s stochastic reward. The agent aims to maximize its expected cumulative reward over n rounds. It does not know the mean rewards of the arms a priori, so must learn them by pulling the arms. This induces the well-known exploration-exploitation trade-off: explore, and learn more about an arm; or exploit, and pull the arm with the highest estimated reward. In a clinical trial, the arm might be a treatment and its reward is the outcome of that treatment for a patient.

Bandit algorithms are typically designed to have low regret for some problem class of interest to the algorithm designer (Lattimore & Szepesvari, 2019). In practice, however, the problem class may not be perfectly speciﬁed at the time of the design. For instance, consider applying Thompson sampling (TS) (Thompson, 1933; Chapelle & Li, 2012; Agrawal & Goyal, 2012; Russo et al., 2018) to a 2-armed bandit in

1Google Research 2University of Alberta 3Deep Mind. Correspondence to: Branislav Kveton <bkveton@google.com>.

Proceedings of the 38 th International Conference on Machine Learning, PMLR 139, 2021. Copyright 2021 by the author(s).

which the prior distribution over mean arm rewards, a vital part of TS, is unknown. While the prior is unknown, the designer may know that it is one of two possible priors where either arm 1 or arm 2 is optimal with high probability. If the agent could learn which of the two priors has been realized, for instance by interacting repeatedly with bandit instances drawn from that prior, it could adapt its exploration strategy to the realized prior, and thereby incur much lower regret than would be possible without this adaptation.

We formalize this learning problem as follows. A learning agent sequentially interacts with m bandit instances. Each interaction has n rounds and we refer to it as a task. The instances share a common structure, namely that their mean arm rewards are drawn from an unknown instance prior P . While P is not known, we assume that it is sampled from a meta-prior Q, which the agent knows with certainty. The goal of the agent is to minimize the regret in each sampled instance almost as well as if it knew P . This is achieved by adapting to P through interactions with the instances. This is a form of meta-learning (Thrun, 1996; 1998; Baxter, 1998; 2000), where the agent learns to act from interactions with bandit instances.

We make the following contributions. First, we formalize the problem of Bayes regret minimization where the prior P is unknown, and is learned by interactions with bandit instances sampled from it in m tasks. Second, we propose Meta TS, a meta-Thompson sampling algorithm that solves this problem. Meta TS maintains a distribution over the unknown P in each task, which we call a meta-posterior Qs, and acts optimistically with respect to it. More speciﬁcally, in task s, it samples an estimate of P as Ps Qs and then runs TS with prior Ps for n rounds. We show how to implement Meta TS efﬁciently in Bernoulli and Gaussian bandits. In addition, we bound its Bayes regret in Gaussian bandits. Our analysis is conservative because it relies only on a single pull of each arm in each task. Nevertheless, it yields an improved regret bound due to adapting to P . The analysis is of broader interest, as we derive a novel prior-dependent upper bound on the Bayes regret of TS. Our theoretical results are complemented by synthetic experiments, which show that Meta TS adapts quickly to the unknown prior P , and its regret is comparable to that of TS with a known P .

Meta-Thompson Sampling

We start with introducing our notation. The set {1, . . . , n} is denoted by [n]. The indicator 1{E} denotes that event E occurs. The i-th entry of vector v is denoted by vi. Sometimes we write v(i) to avoid clutter. A diagonal matrix with entries v is denoted by diag (v). We write O for the big-O notation up to polylogarithmic factors.

Our setting is deﬁned as follows. We have K arms, where a bandit problem instance is a vector of arm means θ RK. The agent sequentially interacts with m bandit instances, which we index by s [m]. We refer to each interaction as a task. At the beginning of task s [m], an instance θs, is sampled i.i.d. from an instance prior distribution P . The agent interacts with θs, for n rounds. In round t [n], it pulls one arm and observes a stochastic realization of its reward. We denote the pulled arm in round t of task s by As,t [K], the stochastic rewards of all arms in round t of task s by Ys,t RK, and the reward of arm i [K] by Ys,t(i). The result of the interactions in task s is history

Hs = (As,1, Ys,1(As,1), . . . , As,n, Ys,n(As,n)) .

We denote by H1:s = H1 Hs a concatenated vector of the histories in tasks 1 to s. We assume that the realized rewards Ys,t are i.i.d. with respect to both s and t, and that their means are E [Ys,t | θs, = θ] = θ. For now, we need not assume that the reward noise is sub-Gaussian; but we do adopt this in our analysis (Section 4).

The n-round Bayes regret of a learning agent or algorithm over m tasks with instance prior P is

R(m, n; P ) =

t=1 θs, (As, ) θs, (As,t)

where As, = arg max i [K] θs, (i) is the optimal arm in the problem instance θs, in task s [m]. The above expectation is over problem instances θs, sampled from P , their realized rewards, and pulled arms.

We note that R(1, n; P ) is the standard deﬁnition of the n-round Bayes regret in a K-armed bandit (Russo & Van Roy, 2014), and that it is O(

Kn) for Thomson sampling with prior P . Since all bandit instances θs, are drawn i.i.d. from the same P , they provide no additional information about each other. Thus, the regret of TS with prior P in m such instances is O(m

Kn). We validate this dependence empirically in Section 5.

Note that Thompson sampling requires P as an input. In this work, we try to attain the same regret without assuming that P is known. We formalize this problem in a Bayesian fashion. In particular, we assume the availability of a prior distribution Q over problem instance priors, and that P Q. We refer to Q as a meta-prior since it is a prior over

Figure 1. Graphical model of our bandit environment.

priors. In Bayesian statistics, this would also be known as a hyper-prior (Gelman et al., 2013). The agent knows Q but not P . We try to learn P from sequential interactions with instances θs, , which are drawn i.i.d. from P in each task. Note that θs, is also unknown. The agent only observes its noisy realizations Ys,t. We visualize relations of Q, P , θs, , and Ys,t in a graphical model in Figure 1.

One motivating example for using a meta-prior for exploration arises in recommender systems, in which exploration is used to assess the latent interests of users for different items, such as movies. In this case, each user can be treated as a bandit instance where the items are arms. A standard prior over user latent interests could readily be used by TS (Hong et al., 2020). However, in many cases, the algorithm designer may be uncertain about its true form. For instance, the designer may believe that most users have strong but noisy afﬁnity for items in exactly one of several classes, but it is unclear which. Our work can be viewed as formalizing the problem of learning such a prior over user interests, which could be used to start exploring the preferences of cold-start users.

3. Meta-Thompson Sampling

In this section, we present our approach to meta-learning in TS. We provide a general description in Section 3.1. In Sections 3.2 and 3.3, we implement it in Bernoulli bandits with a categorical meta-prior and Gaussian bandits with a Gaussian meta-prior, respectively. In Section 3.4, we justify our approach beyond these speciﬁc instances.

3.1. Algorithm Meta TS

Thompson sampling (Thompson, 1933; Chapelle & Li, 2012; Agrawal & Goyal, 2012; Russo et al., 2018) is arguably the most popular and practical bandit algorithm. TS is parameterized by a prior, which is speciﬁed by the algorithm designer. In this work, we study a more general setting where the designer can model uncertainty over an unknown prior P using a meta-prior Q. Our proposed algorithm meta-learns P from sequential interactions with bandit instances drawn i.i.d. from P . Therefore, we call it meta-Thompson sampling (Meta TS). In this subsection, we present Meta TS under the assumption that sample spaces are discrete. This eases exposition and guarantees that all

Meta-Thompson Sampling

Algorithm 1 Meta TS: Meta-learning Thompson sampling.

1: Inputs: Meta-prior Q

2: Q1 Q 3: for s = 1, . . . , m do 4: Sample Ps Qs 5: Apply Thompson sampling with prior Ps to problem instance θs, P for n rounds 6: Update meta-posterior Qs+1, as deﬁned in (1)

conditional expectations are properly deﬁned. We treat this topic more rigorously in Section 3.4.

Meta TS is a variant of TS that models uncertainty over the instance prior distribution P . This uncertainty is captured by a meta-posterior Qs, a distribution over possible instance priors. We denote the meta-posterior in task s by Qs, and assume that each Qs belongs to the same family as Q. By deﬁnition, Q1 = Q is the meta-prior. Meta TS samples the instance prior distribution Ps in task s from Qs. Then it applies TS with the sampled prior Ps to the bandit instance θs, in task s for n rounds. Once the task is complete, it updates the meta-posterior in a standard Bayesian fashion

Qs+1(P) (1)

P (H1:s | P = P) Q(P)

= P (Hs | P = P)

ℓ=1 P (Hℓ| P = P) Q(P)

= P (Hs | P = P) Qs(P)

θ P (Hs | θs, = θ) P (θs, = θ | P = P) dθ Qs(P) ,

where P (Hs | P = P) and P (Hs | θs, = θ) are probabilities of observations in task s given that the instance prior is P and the problem instance is θ, respectively. A rigorous justiﬁcation of this update is given in Section 3.4. Speciﬁc instances of this update are in Sections 3.2 and 3.3.

The pseudocode for Meta TS is presented in Algorithm 1. The algorithm is simple, natural, and general; but has two potential shortcomings. First, it is unclear if it can be implemented efﬁciently. To address this, we develop efﬁcient implementations for both Bernoulli and Gaussian bandits in Sections 3.2 and 3.3, respectively. Second, it is unclear whether Meta TS explores enough. Ideally, it should learn to perform as well as TS with the true prior P . Intuitively, we expect this since the meta-posterior samples Ps Qs should vary signiﬁcantly in the direction of high variance in Qs, which represents high uncertainty that can be reduced by exploring. We conﬁrm this in Section 4, where Meta TS is analyzed in Gaussian bandits.

3.2. Bernoulli Bandit with a Categorical Meta-Prior

Bernoulli Thompson sampling was the ﬁrst instance of TS that was analyzed (Agrawal & Goyal, 2012). In this section, we apply Meta TS to this problem class.

We consider a Bernoulli bandit with K arms that is parameterized by arm means θ [0, 1]K. The reward of arm i in instance θ is drawn i.i.d. from Ber(θi). To model uncertainty in the prior, we assume access to L potential instance priors P = P (j) L j=1. Each prior P (j) is factored across the arms as

i=1 Beta(θi; αi,j, βi,j)

Γ(αi,j + βi,j) Γ(αi,j)Γ(βi,j)θαi,j 1 i (1 θi)βi,j 1

for some ﬁxed (αi,j)K i=1 and (βi,j)K i=1. The meta-prior is a categorical distribution over L classes of tasks. That is,

Q(j) = Cat(j; w) = wj

for w L 1, where w is a vector of initial beliefs into each instance prior and L 1 is the L-dimensional simplex. The tasks are generated as follows. First, the instance prior is set as P = P (j ) where j Q. Then, in each task s, a Bernoulli bandit instance is sampled as θs, P .

Meta TS is implemented as follows. The meta-posterior in task s is

Qs(j) = Cat(j; ˆws) = ˆws,j ,

where ˆws L 1 is a vector of posterior beliefs into each instance prior. The instance prior in task s is Ps = P (js)

where js Qs. After interacting with bandit instance θs, , the meta-posterior is updated using Qs+1(j) f(j) Qs(j), where

θ P (Hs | θs, = θ) P (θs, = θ | j = j) dθ

Γ(αi,j + βi,j) Γ(αi,j)Γ(βi,j)

θi θ αi,j+N+ i,s 1 i (1 θi)βi,j+N i,s 1 dθi

Γ(αi,j + βi,j)Γ(αi,j + N + i,s)Γ(βi,j + N i,s) Γ(αi,j)Γ(βi,j)Γ(αi,j + βi,j + Ti,s) .

Here Ai,s = {t [n] : As,t = i} is the set of rounds where arm i is pulled in task s and Ti,s = |Ai,s| is the number of these rounds. In addition, N + i,s = P t Ai,s Ys,t(i) denotes the number of positive observations of arm i and N i,s = Ti,s N + i,s is the number of its negative observations.

Meta-Thompson Sampling

The above derivation can be generalized in a straightforward fashion to any categorical meta-prior whose instance priors P (j) lie in some exponential family.

3.3. Gaussian Bandit with a Gaussian Meta-Prior

Gaussian distributions have many properties that allow for tractable analysis, such as that the posterior variance is independent of observations, which we exploit in Section 4. In this section, we present a computationally-efﬁcient implementation for this problem class.

We consider a Gaussian bandit with K arms that is parameterized by arm means θ RK. The reward of arm i in instance θ is drawn i.i.d. from N(θi, σ2). We have a continuum of instance priors, parameterized by a vector of means µ RK and deﬁned as P(θ) = N(θ; µ, σ2 0IK). The noise σ0 is ﬁxed. The meta-prior is a Gaussian distribution over instance prior means Q(µ) = N(µ; 0, σ2 q IK), where σq is assumed to be known. The tasks are generated as follows. First, the instance prior is set as P = N(µ , σ2 0IK) where µ Q. Then, in each task s, a Gaussian bandit instance is sampled as θs, P .

Meta TS is implemented as follows. The meta-posterior in task s is

Qs(µ) = N(µ; ˆµs, ˆΣs) ,

where ˆµs RK is an estimate of µ and ˆΣs RK K is a diagonal covariance matrix. The instance prior in task s is Ps(θ) = N(θ; µs, σ2 0IK) where µs Qs. After interacting with bandit instance θs, , the meta-posterior is updated as Qs+1(µ) f(µ) Qs(µ), where

θ P (Hs | θs, = θ) P (θs, = θ | µ = µ) dθ

t Ai,s N(Ys,t(i); θi, σ2)

N(θi; µi, σ2 0) dθi .

Here Ai,s = {t [n] : As,t = i} is the set of rounds where arm i is pulled in task s and Ti,s = |Ai,s| is the number of such rounds, as in Section 3.2.

Since ˆΣs = diag ˆσ2 s is a diagonal covariance matrix, it is fully characterized by a vector of individual arm variances ˆσ2 s RK. The parameters ˆµs and ˆσ2 s are updated, based on Lemma 7 in Appendix B, as

ˆµs+1,i = ˆσ2 s+1,i

ˆµs,i ˆσ2 s,i + Ti,s Ti,sσ2 0 + σ2

t Ai,s Ys,t(i)

ˆσ 2 s+1,i = ˆσ 2 s,i + Ti,s Ti,sσ2 0 + σ2 .

This update can be also derived using (17) in Appendix D, when all covariance matrices are assumed to be diagonal.

The above update has a very nice interpretation. The posterior mean ˆµs+1,i of arm i is a weighted sum of the mean reward estimate of arm i in task s and the earlier posterior mean ˆµs,i. The weight of the estimate depends on how good it is. Speciﬁcally, it varies from 1/(σ2 0,i + σ2), when arm i is pulled only once, to 1/σ2 0,i, when Ti,s . This is the minimum amount of uncertainty that cannot be reduced by more pulls, due to the fact that θs, is a single observation of the unknown µ with covariance σ2 0IK.

3.4. Measure-Theoretic View and the General Case

We now present a more general measure-theoretic speciﬁcation of our meta-bandit setting. Let Z be the set of outcomes for the hidden variable Z that is sampled from a meta-prior and σ(Z) be the σ-algebra over this set. Similarly, let Θ be the set of possible bandit environments θ Θ and σ(Θ) be the σ-algebra over this set. While in this work we focus on environments characterized only by their mean reward vectors, this parameterization could be more general, and for example include the variance of mean reward vectors. The formal deﬁnition of a K-armed Bayesian meta-bandit is as follows.

Deﬁnition 1. A K-armed Bayesian meta-bandit is a tuple B = (Z, σ(Z), Q, Θ, σ(Θ), P, ρ), where (Z, σ(Z)) is a measurable space; the meta-prior Q is a probability measure over (Z, σ(Z)); the prior P is a probability kernel from (Z, σ(Z)) to (Θ, σ(Θ)); and ρ = (ρθ,i : θ Θ, i [K]) is a probability kernel from Θ [K] to (R, B(R)), where B(R) is the Borel σ-algebra of R and ρθ,i is the reward distribution associated with arm i in bandit θ.

We use lowercase letters to denote realizations of random variables. Let Pz be a distribution of bandit instances under Z = z. We assume that a new environment θ Θ is sampled from the same measure Pz at the beginning of each task with the same realization of hidden variable Z sampled from the meta-prior Q beforehand.

A bandit algorithm consists of kernels πs,t that take as input a history of interactions consisting of the pulled arms and observed rewards up to round t in task s, and output a probability measure over the arms. A bandit algorithm is connected with a Bayesian meta-bandit environment B to produce a sequence of chosen arms and observed rewards. Formally, let Ωs,t = ([K] R)(s 1)n+t 1 R2((s 1)n+t 1) for each t [n] and s [m]. Then a bandit algorithm or policy is a tuple π = (πs,t)m,n s,t=1 such that each πs,t is a kernel from (Ωs,t, B(R2((s 1)n+t 1))) to ([K], 2[K]) that interacts with a Bayesian meta-bandit B over m tasks, each lasting n rounds and producing a sequence of random vari-

Meta-Thompson Sampling

A1,1, X1,1, . . . , A1,n, X1,n , . . . ,

Am,1, Xm,1 . . . , Am,n, Xm,n ,

where Xs,t = Ys,t(As,t) is the reward in round t of task s. The probability measure over these variables, Pz,θ1,...,θm,π, is guaranteed to exist by the Ionescu-Tulcea theorem (Tulcea, 1949). Furthermore, the conditional probabilities of transitions of this measure are equal to the kernels

P (θs | z, θ1, . . . , θs 1, H1:s 1) = Pz(θs ) ,

P (As,t | z, θ1, . . . , θs, H1:s 1, Hs,t) =

πs,t(As,t | H1:s 1, Hs,t) ,

P (Xs,t | z, θ1, . . . , θs, H1:s 1, Hs,t, As,t) =

ρθs,As,t(Xs,t ) ,

where Hs,t = (Xs,ℓ, As,ℓ)t 1 ℓ=1. The following lemma says that both the task-posterior Pz( |hs,t) for any z Z and the meta-posterior Q( |h1:s 1) depend only on the pulled arms according to π, but not the exact form of π. Lemma 2. Assume that there exists a σ-ﬁnite measure λρ on (R, B(R)) such that ρθ,i is absolutely continuous with respect to λρ for all θ Θ and i [K]. Then the taskposterior and meta-posterior exist and have the following form

Pz(S1|hs,t) =

S1 Qt 1 j=1 pθs,as,j(xs,j) d Pz(θs) R

Θ Qt 1 j=1 pθs,as,j(xs,j) d Pz(θs) ,

Q(S2|h1:s 1) = R

h Qs 1 ℓ=1 Qn j=1 R

Θ pθ,aℓ,j(xℓ,j) d Pz(θ) i d Q(z) R

Z h Qs 1 ℓ=1 Qn j=1 R

Θ pθ,aℓ,j(xℓ,j) d Pz(θ) i d Q(z) ,

for any S1 σ(Θ), z Z, S2 σ(Z).

The proof is provided in Appendix C. Meta TS samples Zs from the meta-posterior Qs = Q( |h1:s 1) at the beginning of each task s [m] and then pulls arms according to the samples from PZs( |hs,t). The above lemma shows that to compute the posteriors we only need the distributions of rewards and integrate them over the environments (for the task-posterior) or over both the environments and the hidden variables (for the meta-posterior).These integrals can be derived analytically, for example, in the case of conjugate priors Sections 3.2 and 3.3.

4. Analysis

We bound the Bayes regret of Meta TS in Gaussian bandits (Section 3.3). This section is organized as follows. We state the bound and sketch its proof in Section 4.1, and discuss it in Section 4.2. In Section 4.3, we present the key lemmas. Finally, in Section 4.4, we discuss how our analysis can be extended beyond Gaussian bandits.

4.1. Regret Bound

We analyze Meta TS under the assumption that each arm is pulled at least once per task. Although this sufﬁces to show beneﬁts of meta-learning, it is conservative. A less conservative analysis would require understanding how Thompson sampling with a misspeciﬁed prior pulls arms. In particular, we would require a high-probability lower bound on the number of pulls of each arm. To the best of our knowledge, such as a bound does not exist and is non-trivial to derive. To guarantee that each arm is pulled at least once, we pull each arm in the last K rounds of each task. This is to avoid any interference with our posterior sampling analyses in the earlier rounds. Other than this, Meta TS is analyzed exactly as described in Sections 3.1 and 3.3.

Recall the following deﬁnitions in our setting (Section 3.3). The meta-prior is Q = N(0, σ2 q IK). The instance prior is P = N(µ , σ2 0IK), where µ Q is chosen before the learning agent interact with the tasks. Then, in each task s, a problem instance is drawn i.i.d. as θs, P . Our main result is the following Bayes regret bound.

Theorem 3. The Bayes regret of Meta TS over m tasks with n rounds each is

R(m, n; P )

n + σ2σ 2 0 K q

σ2σ 2 0 K m +

c2(δ)c3(δ)Kn2 m + O(Km + n)

with probability at least 1 (2m + 1)δ, where

2σ2 log n ,

c2(δ) = 2 q

2σ2q log(2K/δ) + q

2σ2 0 log n ,

c3(δ) = 8 q

(σ2 0 + σ2) log(4K/δ)/(πσ2 0) .

The probability is over realizations of µ , θs, , and ˆµs.

Proof. First, we bound the magnitude of µ . Speciﬁcally, since µ N(0, σ2 q IK), we have that

2σ2q log(2K/δ) (2)

holds with probability at least 1 δ.

Now we ﬁx task s 2 and decompose its regret. Let As, be the optimal arm in instance θs, , As,t be the pulled arm in round t by TS with misspeciﬁed prior Ps, and As,t be the pulled arm in round t by TS with correct prior P . Then

t=1 θs, (As, ) θs, (As,t)

= Rs,1 + Rs,2 ,

Meta-Thompson Sampling

t=1 θs, (As, ) θs, ( As,t)

t=1 θs, ( As,t) θs, (As,t)

The term Rs,1 is the regret of hypothetical TS that knows P . This TS is introduced only for the purpose of analysis and is the optimal policy. The term Rs,2 is the difference in the expected n-round rewards of TS with priors Ps and P , and vanishes as the number of tasks s increases.

To bound Rs,1, we apply Lemma 4 with δ = 1/n and get

2σ2K log n q

n + σ2σ 2 0 K q

σ2σ 2 0 K + O(K) ,

where O(K) corresponds to the c(δ) term in Lemma 4. To bound Rs,2, we apply Lemma 5 with δ = 1/n and get

Rs,2 2 µ + q

2σ2 0 log n s

Kn2 µs µ + O(K) ,

where O(K) is the ﬁrst term in Lemma 5, after we bound µ in it using (2).

Now we sum up our bounds on Rs,1 + Rs,2 over all tasks s 2 and get

n + σ2σ 2 0 K q

σ2σ 2 0 K m +

2 πσ2 0 Kn2 m X

s=2 µs µ + O(Km) ,

where c1 and c2(δ) are deﬁned in the main claim. Then we apply Lemma 6 to each term µs µ and have with probability at least 1 mδ that

s=2 µs µ 4 q

2(σ2 0 + σ2)m log(4K/δ) ,

where m arises from summing up the O(1/ s) terms in Lemma 6, using Lemma 8 in Appendix B. This concludes the main part of the proof.

We ﬁnish with an upper bound on the regret in task 1 and the cost of pulling each arm once at the end of each task. This is can done as follows. Since θs, N(µ , σ2 0IK),

2σ2 0 log(2K/δ) (3)

holds with probability at least 1 δ in any task s. From (2) and (3), we have with a high probability that

θs, θs, µ + µ

(σ2q + σ2 0) log(2K/δ) .

This yields a high-probability upper bound of

(σ2q + σ2 0) log(2K/δ)(n + Km)

on the regret in task 1 and pulling each arm once at the end of all tasks. This concludes our proof.

4.2. Discussion

If we assume a large m and n regime, where the learning agent improves with more tasks but also needs to perform well in each task, the most important terms in Theorem 3 are those where m and n interact. Using these terms, our bound can be summarized as

n + σ2σ 2 0 K q

σ2σ 2 0 K m + Kn2 m (4)

and holds with probability 1 (2m + 1)δ for any δ > 0.

Our bound in (4) can be viewed as follows. The ﬁrst term is the regret of Thompson sampling with the correct prior P . It is linear in the number of tasks m, since Meta TS solves m exploration problems. The second term captures the cost of learning P . Since it is sublinear in the number of tasks m, Meta TS is near optimal in the regime of large m .

We compare our bound to two baselines. The ﬁrst baseline is TS with a known prior P . The regret of this TS can be bounded using Lemma 4 and includes only the ﬁrst term in (4). In the regime of large m , this term dominates the regret of Meta TS, and thus Meta TS is near optimal.

The second baseline is agnostic Thompson sampling, which does use the structure θs, P Q. Instead, it marginalizes out Q. In our setting, this can be equivalently viewed as assuming θs, N(0, (σ2 q + σ2 0)IK). For this prior, the Bayes regret is E [R(m, n; P )], where the expectation is over P Q. Again, we can apply Lemma 4 and show that E [R(m, n; P )] has an upper bound of

n + σ2 σ 1K

σ2 σ 1K m ,

where σ2 = σ2 q + σ2 0. Clearly, σ2 > σ2 0; and therefore the difference of the above square roots is always larger than in (4). So, in the regime of large m , Meta TS has a lower regret than this baseline.

4.3. Key Lemmas

Now we present the three key lemmas used in the proof of Theorem 3. They are proved in Appendix A.

Meta-Thompson Sampling

The ﬁrst lemma is a prior-dependent upper bound on the Bayes regret of TS.

Lemma 4. Let θ be arm means in a K-armed Gaussian bandit that are generated as θ P = N(µ , σ2 0IK). Let A be the optimal arm under θ and At be the pulled arm in round t by TS with prior P . Then for any δ > 0,

t=1 θ (A ) θ (At)

2σ2K log(1/δ) q

n + σ2σ 2 0 K q

σ2σ 2 0 K ,

where c(δ) = 2 p

2σ2 0 log(1/δ)K + p

2σ2 0/πKnδ.

The effect of the prior is reﬂected in the difference of the square roots. As the prior width narrows and σ0 0, the difference decreases, which shows that a more concentrated prior leads to less exploration. The algebraic form of the

bound is also expected. Roughly speaking, q

σ2σ 2 0 K is the sum of conﬁdence interval widths in the Bayes regret analysis that cannot occur, because the prior width is σ0.

The bound in Lemma 4 differs from other similar bounds in the literature (Lu & Van Roy, 2019). One difference is that the Cauchy-Schwarz inequality is not used in its proof. Therefore, σ0 is in the square root instead of the logarithm. The resulting bound is tighter for σ2σ 2 0 K n. Another difference is that information-theory arguments are not used in the proof. The dependence on σ0 is a result of carefully characterizing the posterior variance of θ in each round. Our proof is simple and easy to follow.

The second lemma bounds the difference in the expected n-round rewards of TS with different priors.

Lemma 5. Let θ be arm means in a K-armed Gaussian bandit that are generated as θ P = N(µ , σ2 0IK). Let N(ˆµ, σ2 0IK) and N( µ, σ2 0IK) be two TS priors such that ˆµ µ ε. Let ˆθt and θt be their respective posterior samples in round t, ˆAt and At be the pulled arms under these samples. Then for any δ > 0,

t=1 θ ( ˆAt) θ ( At)

σ2 0 2π + µ

2σ2 0 log(1/δ) s

2 πσ2 0 Kn2ε .

The key dependence in Lemma 5 is that the bound is linear in the difference of prior means ε. The bound is also O(n2). Although this is unfortunate, it cannot be improved in general if we want to keep linear dependence on ε. The O(n2) dependence arises in the proof as follows. We bound the difference in the expected n-round rewards of TS with two

different priors by the probability that the two TS instances deviate in each round multiplied by the maximum reward that can be earned from that round. The probability is O(ε) and the maximum reward is O(n). This bound is applied n times, in each round, and thus the O(n2ε) dependence.

The last lemma shows the concentration of meta-posterior sample means.

Lemma 6. Let µ N(0, σ2 q IK) and the prior parameters in task s be sampled as µs | H1:s 1 N(ˆµs, ˆΣs). Then

2 σ2 0 + σ2

(σ2 0 + σ2)σ 2 q + s 1 log(4K/δ)

holds jointly over all tasks s [m] with probability at least 1 mδ.

The key dependence is that the bound is O(1/ s) in task s, which provides an upper bound on ε in Lemma 5. After we sum up these upper bounds over all s [m] tasks, we get the O( m) term in Theorem 3.

4.4. Beyond Gaussian Bandits

We analyze Gaussian bandits with a known prior covariance matrix (Section 4.1) because this simpliﬁes algebra and is easy to interpret. We believe that a similar analysis can be conducted for other bandit problems based on the following high-level interpretation of our key lemmas (Section 4.3).

Lemma 4 says that more a concentrated prior in TS yields lower regret. This is expected in general, as less uncertainty about the problem instance leads to lower regret.

Lemma 5 says that the difference in the expected n-round rewards of TS with different priors can be bounded by the difference of the prior parameters. This is expected for any prior that is smooth in its parameters.

Lemma 6 says that the meta-posterior concentrates as the number of tasks increases. When each arm is pulled at least once per task, as we assume in Section 4.1, Meta TS gets at least one noisy observation of the prior per task, and any exponential-family meta-posterior would concentrate.

5. Experiments

We experiment with three problems. In each problem, we have m = 20 tasks with a horizon of n = 200 rounds. All results are averaged over 100 runs, where P Q in each run.

The ﬁrst problem is a Bernoulli bandit with K = 2 arms and a categorical meta-prior (Section 3.2). We have L = 2

Meta-Thompson Sampling

0 5 10 15 20

Bernoulli (K = 2)

Oracle TS TS Meta TS

0 5 10 15 20

Gaussian (K = 2, ¾0 = 0.100)

0 5 10 15 20

Linear (d = 2, ¾0 = 0.100)

Figure 2. Comparison of Meta TS to two variants of Thompson sampling, where the instance prior P is known (Oracle TS) and the meta-prior Q is marginalized out (TS).

instance priors, which are deﬁned as

P (1)(θ) = Beta(θ1; 6, 2) Beta(θ2; 2, 6) ,

P (2)(θ) = Beta(θ1; 2, 6) Beta(θ2; 6, 2) .

In instance prior P (1), arm 1 is more likely to be optimal than arm 2, while arm 1 is more likely to be optimal in prior P (2). The meta-prior is a categorical distribution Cat(w) where w = (0.5, 0.5). This problem is designed such that if the agent knew P , it would know the optimal arm with high probability, and could signiﬁcantly reduce exploration in future tasks.

The second problem is a Gaussian bandit with K = 2 arms and a Gaussian meta-prior (Section 3.3). The meta-prior width is σq = 0.5, the instance prior width is σ0 = 0.1, and the reward noise is σ = 1. In this problem, σq σ0 and we expect major gains from meta-learning. In particular, based on our discussion in Section 4.2,

n + σ2σ 2 0 K q

n + σ2(σ2 0 + σ2q) 1K q

σ2(σ2 0 + σ2q) 1K

The third problem is a linear bandit in d = 2 dimensions with K = 5d arms. We sample arm features uniformly at random from [ 0.5, 0.5]d. The meta-prior, prior, and noise are set as in the Gaussian experiment. The main difference is that θs, is a parameter vector of a linear model, where the mean reward of arm x is x θs, . The posterior updates are computed as described in Appendix D. Even in this more complex setting, they have a closed form.

We compare Meta TS to two baselines. The ﬁrst, Oracle TS, is idealized TS with the true prior P . This baseline shows the lowest attainable regret. The second baseline is agnostic TS, which does not use the structure of our problem. In the Gaussian and linear bandit experiments, we implement it as TS with prior N(θ; 0, (σ2 q + σ2 0)IK), as this is a marginal distribution of θs, . In the Bernoulli bandit experiment, we use an uninformative prior QK i=1 Beta(θi; 1, 1), since the

marginal distribution does not have a closed form. We call this baseline TS.

Our results are reported in Figure 2. We plot the cumulative regret as a function of the number of experienced tasks s, as it accumulates round-by-round within tasks. The regret of algorithms that do not learn µ , such as Oracle TS and TS, is linear in s, since they solve s similar tasks using the same policy (Section 2). A lower slope of the regret indicates a better policy. Since Oracle TS is optimal in our problems, no algorithm can have sublinear regret in s.

In all plots in Figure 2, we observe signiﬁcant gains due to meta-learning P . Meta TS outperforms TS, which does not adapt to P and performs comparably to Oracle TS. This can be seen from the slope of the regret. Speciﬁcally, the slope of the Meta TS regret approaches that of Oracle TS after just a few tasks. The slopes of TS and Oracle TS do not change, as these methods do not adapt between tasks.

In Appendix E, we report additional experimental results. We observe that the beneﬁts of meta-learning are preserved as the number of arms K or dimensions d increases. However, as is expected, they diminish when the prior width σ0 approaches the meta-prior width σq. In this case, there is little beneﬁt from adapting to P and all methods perform similarly. We also experiment with misspeciﬁed Meta TS and show that the impact of the misspeciﬁcation is relatively minor. This attests to the robustness of Meta TS.

6. Related Work

The closest related work is that of Bastani et al. (2019) who propose TS that learns an instance prior from a sequence of pricing experiments. Their approach is tailored to pricing and learns through forced exploration using a conservative variant of TS, resulting in a meta-learning algorithm that is more conservative and less general than our work. Bastani et al. (2019) also do not derive improved regret bounds due to meta-learning.

Meta TS is an instance of meta-learning (Thrun, 1996; 1998;

Meta-Thompson Sampling

Baxter, 1998; 2000), where the agent learns to act under an unknown prior P from interactions with bandit instances. Earlier works on a similar topic are Azar et al. (2013) and Gentile et al. (2014), who proposed UCB algorithms for multi-task learning in the bandit setting. Multi-task learning in contextual bandits, where the arms are similar tasks, was studied by Deshmukh et al. (2017). Cella et al. (2020) proposed a Lin UCB algorithm that meta-learns mean parameter vectors in linear models. Yang et al. (2020) studied a setting where the learning agent interacts with multiple bandit instances in parallel and tries to learn their shared subspace. A general template for sequential meta-learning is outlined in Ortega et al. (2019). Our work departs from most of the above approaches in two aspects. First, we have a posterior sampling algorithm that naturally represents the uncertainty in the unknown prior P . Second, we have a Bayes regret analysis. The shortcoming of the Bayes regret is that it is a weaker optimality criterion than the frequentist regret. To the best of our knowledge, this is the ﬁrst work to propose meta-learning for Thompson sampling that is natural and has provable guarantees on improvement.

It is well known that the regret of bandit algorithms can be reduced by tuning (Vermorel & Mohri, 2005; Maes et al., 2012; Kuleshov & Precup, 2014; Hsu et al., 2019). All of these works are empirical and focus on the ofﬂine setting, where the bandit algorithms are optimized against a known instance distribution. Several recent approaches formulated learning of bandit policies as policy-gradient optimization (Duan et al., 2016; Boutilier et al., 2020; Kveton et al., 2020; Yang & Toni, 2020; Min et al., 2020). Notably, both Kveton et al. (2020) and Min et al. (2020) proposed policy-gradient optimization of TS. These works are in the ofﬂine setting and have no global optimality guarantees, except for some special cases (Boutilier et al., 2020; Kveton et al., 2020).

7. Conclusions

Thompson sampling (Thompson, 1933), a very popular and practical bandit algorithm (Chapelle & Li, 2012; Agrawal & Goyal, 2012; Russo et al., 2018), is parameterized by a prior, which is speciﬁed by the algorithm designer. We study a more general setting where the designer can specify an uncertain prior, and the actual prior is learned from sequential interactions with bandit instances. We propose Meta TS, a computationally-efﬁcient algorithm for this problem. Our analysis of Meta TS shows the beneﬁt of meta-learning and builds on a novel prior-dependent upper bound on the Bayes regret of Thompson sampling. Meta TS shows considerable promise in our synthetic experiments.

Our work is a step in the exciting direction of meta-learning state-of-the-art exploration algorithms with guarantees. It has several limitations that should be addressed in future work. First, our regret analysis relies only on a single pull

of an arm per task. While this simpliﬁes the analysis and is sufﬁcient to show improvements due to meta-learning, it is inherently conservative. Second, our analysis is limited to Gaussian bandits and relies heavily on the properties of Gaussian posteriors. While we believe that a generalization is possible (Section 4.4), it is likely to be more algebraically demanding. Finally, we hope to analyze our method in contextual bandits. As we show in Appendix D, meta-posterior updates in linear bandits with Gaussian noise have a closed form. We believe that our work lays foundations for a potential analysis of this approach, which could yield powerful contextual bandit algorithms that adapt to an unknown problem class.

Agrawal, S. and Goyal, N. Analysis of Thompson sampling for the multi-armed bandit problem. In Proceeding of the 25th Annual Conference on Learning Theory, pp. 39.1 39.26, 2012.

Auer, P., Cesa-Bianchi, N., and Fischer, P. Finite-time analysis of the multiarmed bandit problem. Machine Learning, 47:235 256, 2002.

Azar, M. G., Lazaric, A., and Brunskill, E. Sequential transfer in multi-armed bandit with ﬁnite set of models. In Advances in Neural Information Processing Systems 26, pp. 2220 2228, 2013.

Bastani, H., Simchi-Levi, D., and Zhu, R. Meta dynamic pricing: Transfer learning across experiments. Co RR, abs/1902.10918, 2019. URL https://arxiv.org/ abs/1902.10918.

Baxter, J. Theoretical models of learning to learn. In Learning to Learn, pp. 71 94. Springer, 1998.

Baxter, J. A model of inductive bias learning. Journal of Artiﬁcial Intelligence Research, 12:149 198, 2000.

Boutilier, C., Hsu, C.-W., Kveton, B., Mladenov, M., Szepesvari, C., and Zaheer, M. Differentiable meta-learning of bandit policies. In Advances in Neural Information Processing Systems 33, 2020.

Cella, L., Lazaric, A., and Pontil, M. Meta-learning with stochastic linear bandits. In Proceedings of the 37th International Conference on Machine Learning, 2020.

Chapelle, O. and Li, L. An empirical evaluation of Thompson sampling. In Advances in Neural Information Processing Systems 24, pp. 2249 2257, 2012.

Deshmukh, A. A., Dogan, U., and Scott, C. Multi-task learning for contextual bandits. In Advances in Neural Information Processing Systems 30, pp. 4848 4856, 2017.

Meta-Thompson Sampling

Duan, Y., Schulman, J., Chen, X., Bartlett, P., Sutskever, I., and Abbeel, P. RL2: Fast reinforcement learning via slow reinforcement learning. Co RR, abs/1611.02779, 2016. URL http://arxiv.org/abs/1611.02779.

Gelman, A. and Hill, J. Data Analysis Using Regression and Multilevel/Hierarchical Models. Cambridge University Press, New York, NY, 2007.

Gelman, A., Carlin, J., Stern, H., Dunson, D., Vehtari, A., and Rubin, D. Bayesian Data Analysis. Chapman & Hall, 2013.

Gentile, C., Li, S., and Zappella, G. Online clustering of bandits. In Proceedings of the 31st International Conference on Machine Learning, pp. 757 765, 2014.

Hong, J., Kveton, B., Zaheer, M., Chow, Y., Ahmed, A., and Boutilier, C. Latent bandits revisited. In Advances in Neural Information Processing Systems 33, 2020.

Hsu, C.-W., Kveton, B., Meshi, O., Mladenov, M., and Szepesvari, C. Empirical Bayes regret minimization. Co RR, abs/1904.02664, 2019. URL http://arxiv. org/abs/1904.02664.

Kuleshov, V. and Precup, D. Algorithms for multi-armed bandit problems. Co RR, abs/1402.6028, 2014. URL http://arxiv.org/abs/1402.6028.

Kveton, B., Mladenov, M., Hsu, C.-W., Zaheer, M., Szepesvari, C., and Boutilier, C. Differentiable meta-learning in contextual bandits. Co RR, abs/2006.05094, 2020. URL http://arxiv.org/abs/2006.05094.

Lai, T. L. and Robbins, H. Asymptotically efﬁcient adaptive allocation rules. Advances in Applied Mathematics, 6(1): 4 22, 1985.

Lattimore, T. and Szepesvari, C. Bandit Algorithms. Cambridge University Press, 2019.

Lindley, D. and Smith, A. Bayes estimates for the linear model. Journal of the Royal Statistical Society. Series B (Methodological), 34(1):1 41, 1972.

Lu, X. and Van Roy, B. Information-theoretic conﬁdence bounds for reinforcement learning. In Advances in Neural Information Processing Systems 32, 2019.

Maes, F., Wehenkel, L., and Ernst, D. Meta-learning of exploration/exploitation strategies: The multi-armed bandit case. In Proceedings of the 4th International Conference on Agents and Artiﬁcial Intelligence, pp. 100 115, 2012.

Min, S., Moallemi, C., and Russo, D. Policy gradient optimization of Thompson sampling policies. Co RR, abs/2006.16507, 2020. URL http://arxiv.org/ abs/2006.16507.

Ortega, P., Wang, J., Rowland, M., Genewein, T., Kurth Nelson, Z., Pascanu, R., Heess, N., Veness, J., Pritzel, A., Sprechmann, P., Jayakumar, S., Mc Grath, T., Miller, K., Azar, M. G., Osband, I., Rabinowitz, N., Gyorgy, A., Chiappa, S., Osindero, S., Teh, Y. W., van Hasselt, H., de Freitas, N., Botvinick, M., and Legg, S. Meta-learning of sequential strategies. Co RR, abs/1905.03030, 2019. URL http://arxiv.org/abs/1905.03030.

Russo, D. and Van Roy, B. Learning to optimize via posterior sampling. Mathematics of Operations Research, 39 (4):1221 1243, 2014.

Russo, D., Van Roy, B., Kazerouni, A., Osband, I., and Wen, Z. A tutorial on Thompson sampling. Foundations and Trends in Machine Learning, 11(1):1 96, 2018.

Thompson, W. R. On the likelihood that one unknown probability exceeds another in view of the evidence of two samples. Biometrika, 25(3-4):285 294, 1933.

Thrun, S. Explanation-Based Neural Network Learning - A Lifelong Learning Approach. Ph D thesis, University of Bonn, 1996.

Thrun, S. Lifelong learning algorithms. In Learning to Learn, pp. 181 209. Springer, 1998.

Tulcea, C. I. Mesures dans les espaces produits. Atti Acad. Naz. Lincei Rend. Cl Sci. Fis. Mat. Nat, 8(7), 1949.

Vermorel, J. and Mohri, M. Multi-armed bandit algorithms and empirical evaluation. In Proceedings of the 16th European Conference on Machine Learning, pp. 437 448, 2005.

Yang, J., Hu, W., Lee, J., and Du, S. Provable beneﬁts of representation learning in linear bandits. Co RR, abs/2010.06531, 2020. URL http://arxiv.org/ abs/2010.06531.

Yang, K. and Toni, L. Differentiable linear bandit algorithm. Co RR, abs/2006.03000, 2020. URL http://arxiv. org/abs/2006.03000.