# finitetime_logarithmic_bayes_regret_upper_bounds__e7693a3a.pdf

Finite-Time Logarithmic Bayes Regret Upper Bounds

Alexia Atsidakou University of Texas, Austin Branislav Kveton AWS AI Labs Sumeet Katariya Amazon

Constantine Caramanis University of Texas, Austin Sujay Sanghavi University of Texas, Austin / Amazon

We derive the first finite-time logarithmic Bayes regret upper bounds for Bayesian bandits. In a multi-armed bandit, we obtain O(c log n) and O(ch log2 n) upper bounds for an upper confidence bound algorithm, where ch and c are constants depending on the prior distribution and the gaps of bandit instances sampled from it, respectively. The latter bound asymptotically matches the lower bound of Lai (1987). Our proofs are a major technical departure from prior works, while being simple and general. To show the generality of our techniques, we apply them to linear bandits. Our results provide insights on the value of prior in the Bayesian setting, both in the objective and as a side information given to the learner. They significantly improve upon existing O( n) bounds, which have become standard in the literature despite the logarithmic lower bound of Lai (1987).

1 Introduction

A stochastic multi-armed bandit [Lai and Robbins, 1985, Auer et al., 2002, Lattimore and Szepesvari, 2019] is an online learning problem where a learner sequentially interacts with an environment over n rounds. In each round, the learner takes an action and receives its stochastic reward. The goal of the learner is to maximize its expected cumulative reward over n rounds. The mean rewards of the actions are unknown a priori but can be learned by taking the actions. Therefore, the learner faces the exploration-exploitation dilemma: explore, and learn more about the actions; or exploit, and take the action with the highest estimated reward. Bandits have been successfully applied to problems where uncertainty modeling and subsequent adaptation are beneficial. One example are recommender systems [Li et al., 2010, Zhao et al., 2013, Kawale et al., 2015, Li et al., 2016], where the actions are recommended items and their rewards are clicks. Another example is hyper-parameter optimization [Li et al., 2018], where the actions are values of the optimized parameters and their reward is the optimized metric.

Cumulative regret minimization in stochastic bandits has been traditionally studied in two settings: frequentist [Lai and Robbins, 1985, Auer et al., 2002, Abbasi-Yadkori et al., 2011] and Bayesian [Gittins, 1979, Tsitsiklis, 1994, Lai, 1987, Russo and Van Roy, 2014, Russo et al., 2018]. In the frequentist setting, the learner minimizes the regret with respect to a fixed unknown bandit instance. In the Bayesian setting, the learner minimizes the average regret with respect to bandit instances drawn from a prior distribution. The instance is unknown but the learner knows its prior distribution. The Bayesian setting allows surprisingly simple and insightful analyses of Thompson sampling. One fundamental result in this setting is that linear Thompson sampling [Russo and Van Roy, 2014] has a comparable regret bound to Lin UCB in the frequentist setting [Abbasi-Yadkori et al., 2011, Agrawal and Goyal, 2013, Abeille and Lazaric, 2017]. Moreover, many recent metaand multi-task bandit works [Bastani et al., 2019, Kveton et al., 2021, Basu et al., 2021, Simchowitz et al., 2021, Wang et al., 2021, Hong et al., 2022, Aouali et al., 2023] adopt the Bayes regret to analyze the stochastic

37th Conference on Neural Information Processing Systems (Neur IPS 2023).

structure of their problems, that the bandit tasks are similar because their parameters are sampled i.i.d. from a task distribution.

Many bandit algorithms have frequentist regret bounds that match a lower bound. As an example, in a K-armed bandit with the minimum gap and horizon n, the gap-dependent O(K 1 log n) regret bound of UCB1 [Auer et al., 2002] matches the gap-dependent Ω(K 1 log n) lower bound of Lai and Robbins [1985]. Moreover, the gap-free O(

Kn) regret bound of UCB1 matches, up to logarithmic factors, the gap-free Ω(

Kn) lower bound of Auer et al. [1995]. The extra logarithmic factor in the O(

Kn) bound can be eliminated by modifying UCB1 [Audibert and Bubeck, 2009]. In contrast, and despite the popularity of the model, matching upper and lower bounds mostly do not exist in the Bayesian setting. Specifically, Lai [1987] proved asymptotic ch log2 n upper and lower bounds, where ch is a prior-dependent constant. However, all recent Bayes regret bounds are O( n) [Russo and Van Roy, 2014, 2016, Lu and Van Roy, 2019, Hong et al., 2020, Kveton et al., 2021]. This leaves open the question of finite-time logarithmic regret bounds in the Bayesian setting.

In this work, we answer this question positively and make the following contributions:

1. We derive the first finite-time logarithmic Bayes regret upper bounds for a Bayesian upper confidence bound (UCB) algorithm. The bounds are O(c log n) and O(ch log2 n), where ch and c are constants depending on the prior distribution h and the gaps of random bandit instances sampled from h, respectively. The latter matches the lower bound of Lai [1987] asymptotically. When compared to prior O( n) bounds, we better characterize low-regret regimes, where the random gaps are large.

2. To show the value of prior as a side information, we also derive a finite-time logarithmic Bayes regret upper bound for a frequentist UCB algorithm. The bound changes only little as the prior becomes more informative, while the regret bound for the Bayesian algorithm eventually goes to zero. The bounds match asymptotically when n and the prior is overtaken by data.

3. To show the generality of our approach, we prove a O(d c log2 n) Bayes regret bound for a Bayesian linear bandit algorithm, where d denotes the number of dimensions and c is a constant depending on random gaps. This bound also improves with a better prior.

4. Our analyses are a major departure from all recent Bayesian bandit analyses, starting with

Russo and Van Roy [2014]. Roughly speaking, we first bound the regret in a fixed bandit instance, similarly to frequentist analyses, and then integrate out the random gap.

5. We show the tightness of our bounds empirically and compare them to prior bounds.

This paper is organized as follows. In Section 2, we introduce the setting of Bayesian bandits. In Section 3, we present a Bayesian upper confidence bound algorithm called Bayes UCB [Kaufmann et al., 2012]. In Section 4, we derive finite-time logarithmic Bayes regret bounds for Bayes UCB, in both multi-armed and linear bandits. These are the first such bounds ever derived. In Section 5, we compare our bounds to prior works and show that one matches an existing lower bound [Lai, 1987] asymptotically. In Section 6, we evaluate the bounds empirically. We conclude in Section 7.

We start with introducing our notation. Random variables are capitalized, except for Greek letters like θ. For any positive integer n, we define [n] = {1, . . . , n}. The indicator function is 1{ }. The i-th entry of vector v is vi. If the vector is already indexed, such as vj, we write vj,i. We denote the maximum and minimum eigenvalues of matrix M Rd d by λ1(M) and λd(M), respectively.

Our setting is defined as follows. We have a multi-armed bandit [Lai and Robbins, 1985, Lai, 1987, Auer et al., 2002, Abbasi-Yadkori et al., 2011] with an action set A. Each action a A is associated with a reward distribution pa( ; θ), which is parameterized by an unknown model parameter θ shared by all actions. The learner interacts with the bandit instance for n rounds indexed by t [n]. In each round t, it takes an action At A and observes its stochastic reward Yt p At( ; θ). The rewards are sampled independently across the rounds. We denote the mean of pa( ; θ) by µa(θ) and call it

*The work started at Amazon Search.

Algorithm 1 Bayes UCB

1: for t = 1, . . . , n do 2: Compute the posterior distribution of θ using prior h and history Ht 3: for each action a A do 4: Compute Ut,a according to (1)

5: Take action At arg max a A Ut,a and observe its reward Yt

the mean reward of action a. The optimal action is A = arg max a A µa(θ) and its mean reward is µ (θ) = µA (θ). For a fixed model parameter θ, the n-round regret of a policy is defined as

R(n; θ) = E

t=1 µ (θ) µAt(θ)

where the expectation is taken over both random observations Yt and actions At. The suboptimality gap of action a is a = µ (θ) µa(θ) and the minimum gap is min = mina A\{A } a.

Two settings are common in stochastic bandits. In the frequentist setting [Lai and Robbins, 1985, Auer et al., 2002, Abbasi-Yadkori et al., 2011], the learner has no additional information about θ and its objective is to minimize the worst-case regret for any bounded θ. We study the Bayesian setting [Gittins, 1979, Lai, 1987, Russo and Van Roy, 2014, Russo et al., 2018], where the model parameter θ is drawn from a prior distribution h that is given to the learner as a side information. The goal of the learner is to minimize the n-round Bayes regret R(n) = E [R(n; θ)], where the expectation is taken over the random model parameter θ h. Note that A , a, and min are random because they depend on the random instance θ.

3 Algorithm

We study a Bayesian upper confidence bound algorithm called Bayes UCB [Kaufmann et al., 2012]. The algorithm was analyzed in the Bayesian setting by Russo and Van Roy [2014]. The key idea in Bayes UCB is to take the action with the highest UCB with respect to the posterior distribution of model parameter θ. This differentiates it from frequentist algorithms, such as UCB1 [Auer et al., 2002] and Lin UCB [Abbasi-Yadkori et al., 2011], where the UCBs are computed using a frequentist maximum likelihood estimate (MLE) of the model parameter.

Let Ht = (Aℓ, Yℓ)ℓ [t 1] be the history of taken actions and their observed rewards up to round t. The Bayesian UCB for the mean reward of action a at round t is

Ut,a = µa(ˆθt) + Ct,a , (1)

where ˆθt is the posterior mean estimate of θ at round t and Ct,a is a confidence interval width for action a at round t. The posterior distribution of model parameter θ is computed from a prior h and history Ht using Bayes rule. The width is chosen so that |µa(ˆθt) µa(θ)| Ct,a holds with a high probability conditioned on any history Ht. Technically speaking, Ct,a is a half-width but we call it a width to simplify terminology.

Our algorithm is presented in Algorithm 1. We instantiate it in a Gaussian bandit in Section 3.1, in a Bernoulli bandit in Section 3.2, and in a linear bandit with Gaussian rewards in Section 3.3. These settings are of practical interest because they lead to computationally-efficient implementations that can be analyzed due to closed-form posteriors [Lu and Van Roy, 2019, Kveton et al., 2021, Basu et al., 2021, Wang et al., 2021, Hong et al., 2022]. While we focus on deriving logarithmic Bayes regret bounds for Bayes UCB, we believe that similar analyses can be done for Thompson sampling [Thompson, 1933, Chapelle and Li, 2012, Agrawal and Goyal, 2012, 2013, Russo and Van Roy, 2014, Russo et al., 2018]. This extension is non-trivial because a key step in our analysis is that the action with the highest UCB is taken (Section 5.3).

3.1 Gaussian Bandit

In a K-armed Gaussian bandit, the action set is A = [K] and the model parameter is θ RK. Each action a A has a Gaussian reward distribution, pa( ; θ) = N( ; θa, σ2), where θa is its mean and

σ > 0 is a known reward noise. Thus µa(θ) = θa. The model parameter θ is drawn from a known Gaussian prior h( ) = N( ; µ0, σ2 0IK), where µ0 RK is a vector of prior means and σ0 > 0 is a prior width.

The posterior distribution of the mean reward of action a at round t is N( ; ˆθt,a, ˆσ2 t,a), where

ˆσ2 t,a = (σ 2 0 + σ 2Nt,a) 1

is the posterior variance, Nt,a = Pt 1 ℓ=1 1{Aℓ= a} is the number of observations of action a up to round t, and

ˆθt,a = ˆσ2 t,a

σ 2 0 µ0,a + σ 2 t 1 X

ℓ=1 1{Aℓ= a} Yℓ

is the posterior mean. This follows from a classic result, that the posterior distribution of the mean of a Gaussian random variable with a Gaussian prior is a Gaussian [Bishop, 2006]. The Bayesian UCB of action a at round t is Ut,a = ˆθt,a + Ct,a, where Ct,a = q

2ˆσ2 t,a log(1/δ) is the confidence interval width and δ (0, 1) is a failure probability of the confidence interval.

3.2 Bernoulli Bandit

In a K-armed Bernoulli bandit, the action set is A = [K] and the model parameter is θ RK. Each action a A has a Bernoulli reward distribution, pa( ; θ) = Ber( ; θa), where θa is its mean. Hence µa(θ) = θa. Each parameter θa is drawn from a known prior Beta( ; αa, βa), where αa > 0 and βa > 0 are positive and negative prior pseudo-counts, respectively.

The posterior distribution of the mean reward of action a at round t is Beta( ; αt,a, βt,a), where

αt,a = αa +

ℓ=1 1{Aℓ= a} Yℓ, βt,a = βa +

ℓ=1 1{Aℓ= a} (1 Yℓ) .

This follows from a classic result, that the posterior distribution of the mean of a Bernoulli random variable with a beta prior is a beta distribution [Bishop, 2006]. The corresponding Bayesian UCB is Ut,a = ˆθt,a + Ct,a, where

ˆθt,a = αt,a αt,a + βt,a , Ct,a =

log(1/δ) 2(αt,a + βt,a + 1) =

log(1/δ) 2(αa + βa + Nt,a + 1) ,

denote the posterior mean and confidence interval width, respectively, of action a at round t; and δ (0, 1) is a failure probability of the confidence interval. The confidence interval is derived using the fact that Beta( ; αt,a, βt,a) is a sub-Gaussian distribution with variance proxy 1 4(αa+βa+Nt,a+1) [Marchal and Arbel, 2017].

3.3 Linear Bandit with Gaussian Rewards

We also study linear bandits [Dani et al., 2008, Abbasi-Yadkori et al., 2011] with a finite number of actions A Rd in d dimensions. The model parameter is θ Rd. All actions a A have Gaussian reward distributions, pa( ; θ) = N( ; a θ, σ2), where σ > 0 is a known reward noise. Therefore, the mean reward of action a is µa(θ) = a θ. The parameter θ is drawn from a known multivariate Gaussian prior h( ) = N( ; θ0, Σ0), where θ0 Rd is its mean and Σ0 Rd d is its covariance, represented by a positive semi-definite (PSD) matrix.

The posterior distribution of θ at round t is N( ; ˆθt, ˆΣt), where

Σ 1 0 θ0 + σ 2 t 1 X

, ˆΣt = (Σ 1 0 + Gt) 1 , Gt = σ 2 t 1 X

Here ˆθt and ˆΣt are the posterior mean and covariance of θ, respectively, and Gt is the outer product of the feature vectors of the taken actions up to round t. These formulas follow from a classic result,

that the posterior distribution of a linear model parameter with a Gaussian prior and observations is a Gaussian [Bishop, 2006]. The Bayesian UCB of action a at round t is Ut,a = a ˆθt + Ct,a, where Ct,a = p

2 log(1/δ) a ˆΣt is the confidence interval width, δ (0, 1) is a failure probability of the confidence interval, and a M =

4 Logarithmic Bayes Regret Upper Bounds

In this section, we present finite-time logarithmic Bayes regret bounds for Bayes UCB. We derive them for both K-armed and linear bandits. One bound matches an existing lower bound of Lai [1987] asymptotically and all improve upon prior O( n) bounds. We discuss this in detail in Section 5.

4.1 Bayes UCB in Gaussian Bandit

Our first regret bound is for Bayes UCB in a K-armed Gaussian bandit. It depends on random gaps. To control the gaps, we clip them as ε a = max { a, ε}. The bound is stated below. Theorem 1. For any ε > 0 and δ (0, 1), the n-round Bayes regret of Bayes UCB in a K-armed Gaussian bandit is bounded as

8σ2 log(1/δ)

εa σ2 ε a σ2 0

where C = εn + 2( p

2 log(1/δ) + 2K)σ0Knδ is a low-order term.

The proof is in Appendix A.1. For ε = 1/n and δ = 1/n, the bound is O(c log n), where c is a constant depending on the gaps of random bandit instances. The dependence on σ0 in the low-order term C can be reduced to min {σ0, σ} by a more elaborate analysis, where the regret of taking each action for the first time is bounded separately. This also applies to Corollary 2.

Now we derive an upper bound on Theorem 1 that eliminates the dependence on random gaps. To state it, we need to introduce additional notation. For any action a, we denote all action parameters except for a by θ a = (θ1, . . . , θa 1, θa+1, . . . , θK) and the corresponding optimal action in θ a by θ a = maxj A\{a} θj. We denote by ha the prior density of θa and by h a the prior density of θ a. Since the prior is factored (Section 3.1), note that h(θ) = ha(θa)h a(θ a) for any θ and action a. To keep the result clean, we state it for a sufficiently large prior variance. A complete statement for all prior variances is given in Appendix B. We note that the setting of small prior variances favors Bayesian algorithms since their regret decreases with a more informative prior. In fact, we show in Appendix B that the regret of Bayes UCB is O(1) for a sufficiently small σ0.

Corollary 2. Let σ2 0 1 8 log(1/δ) n2 log log n. Then there exist functions ξa : R h 1 n, 1 log n

i such that the n-round Bayes regret of Bayes UCB in a K-armed Gaussian bandit is bounded as

R(n) 8σ2 log(1/δ) log n σ2

2σ2 0 log n

θ a ha(θ a ξa(θ a)) h a(θ a) dθ a + C ,

where C = 8σ2K log(1/δ) log n + 2( p

2 log(1/δ) + 2K)σ0Knδ + 1 is a low-order term.

The proof is in Appendix A.2. For δ = 1/n, the bound is O(ch log2 n), where ch depends on prior h but not on the gaps of random bandit instances. This bound is motivated by Lai [1987]. The terms ξa arise due to the intermediate value theorem for function ha. Similar terms appear in Lai [1987] but vanish in their final asymptotic claims. The rate 1/ log n in the definition of ξa cannot be reduced to 1/polylog n without increasing dependence on n in other parts of the bound.

The complexity term P

θ a ha(θ a ξa(θ a)) h a(θ a) dθ a in Corollary 2 is the same as in Lai [1987] and can be interpreted as follows. Consider the asymptotic regime of n . Then, since the range of ξa is h 1 n, 1 log n

i , the term simplifies to P

θ a ha(θ a)h a(θ a) dθ a and can be viewed as the distance between prior means. In a Gaussian bandit with K = 2 actions, it has a closed form of 1

πσ2 0 exp h (µ0,1 µ0,2)2

i . A general upper bound for K > 2 actions is given below.

Lemma 3. In a K-armed Gaussian bandit with prior h( ) = N( ; µ0, σ2 0IK), we have

θ a ha(θ a) h a(θ a) dθ a 1

a =a exp (µ0,a µ0,a )2

The bound is proved in Appendix A.3 and has several interesting properties that capture low-regret regimes. First, as the prior becomes more informative and concentrated, σ0 0, the bound goes to zero. Second, when the gaps of bandit instances sampled from the prior are large, low regret is also expected. This can happen when the prior means become more separated, |µ0,a µ0,a | , or the prior becomes wider, σ0 . Our bound goes to zero in both of these cases. This also implies that Bayes regret bounds are not necessarily monotone in prior parameters, such as σ0.

4.2 UCB1 in Gaussian Bandit

Using a similar approach, we prove a Bayes regret bound for UCB1 [Auer et al., 2002]. We view it as Bayes UCB (Section 3.1) where σ0 = and each action a A is initially taken once at round t = a. This generalizes classic UCB1 to σ2-sub-Gaussian noise. An asymptotic Bayes regret bound for UCB1 was proved by Lai [1987] (claim (i) in their Theorem 3). We derive a finite-time prior-dependent Bayes regret bound below.

Theorem 4. There exist functions ξa : R h 1 n, 1 log n

i such that the n-round Bayes regret of UCB1 in a K-armed Gaussian bandit is bounded as

R(n) 8σ2 log(1/δ) log n X

θ a ha(θ a ξa(θ a)) h a(θ a) dθ a + C ,

where C = 8σ2K log(1/δ) log n + 2( p

2 log(1/δ) + 2K)σKnδ + P a A E [ a] + 1.

The proof is in Appendix A.4. For δ = 1/n, the bound is O(ch log2 n) and similar to Corollary 2. The main difference is in the additional factor σ2

2σ2 0 log n in Corollary 2, which decreases the bound. This means that the regret of Bayes UCB improves as σ0 decreases while that of UCB1 may not change much. In fact, the regret bound of Bayes UCB is O(1) as σ0 0 (Appendix B) while that of UCB1 remains logarithmic. This is expected because Bayes UCB has more information about the random instance θ as σ0 decreases, while the frequentist algorithm is oblivious to the prior.

4.3 Bayes UCB in Bernoulli Bandit

Theorem 1 and Corollary 2 can be straightforwardly extended to Bernoulli bandits because

P |θa ˆθt,a| Ct,a Ht 2δ

holds for any action a and history Ht (Section 3.2). We state the extension below and prove it in Appendix A.5.

Theorem 5. For any ε > 0 and δ (0, 1), the n-round Bayes regret of Bayes UCB in a K-armed Bernoulli bandit is bounded as

εa (αa + βa + 1) ε a

where C = εn + 2Knδ is a low-order term.

Moreover, let λ = mina A αa + βa + 1 and λ 2 log(1/δ) n2 log log n. Then

R(n) 2 log(1/δ) log n λ 2 log n

θ a ha(θ a ξa(θ a)) h a(θ a) dθ a + C ,

where C = 2K log(1/δ) log n + 2Knδ + 1 is a low-order term.

4.4 Bayes UCB in Linear Bandit

Now we present a gap-dependent Bayes regret bound for Bayes UCB in a linear bandit with a finite number of actions. The bound depends on a random minimum gap. To control the gap, we clip it as ε min = max { min, ε}.

Theorem 6. Suppose that θ 2 L holds with probability at least 1 δ . Let a 2 L for any action a A. Then for any ε > 0 and δ (0, 1), the n-round Bayes regret of linear Bayes UCB is bounded as

R(n) 8E 1 ε min

log 1 + σ2 0,max

1 + σ2 0,maxn

log(1/δ) + εn + 4LL Knδ

with probability at least 1 δ , where σ0,max = p

The proof is in Appendix A.6. For ε = 1/n and δ = 1/n, the bound is O(d c log2 n), where c is a constant depending on the gaps of random bandit instances. The bound is remarkably similar to the frequentist O(d 1 min log2 n) bound in Theorem 5 of Abbasi-Yadkori et al. [2011], where min is the minimum gap. There are two differences. First, we integrate 1 min over the prior. Second, our bound decreases as the prior becomes more informative, σ0,max 0.

In a Gaussian bandit, the bound becomes O(KE [1/ ε min] log2 n). Therefore, it is comparable to Theorem 1 up to an additional logarithmic factor in n. This is due to a more general proof technique, which allows for dependencies between the mean rewards of actions. We also note that Theorem 6 does not assume that the prior is factored, unlike Theorem 1.

5 Comparison to Prior Works

This section is organized as follows. In Section 5.1, we show that the bound in Theorem 5 matches an existing lower bound of Lai [1987] asymptotically. In Section 5.2, we compare our logarithmic bounds to prior O( n) bounds. Finally, in Section 5.3, we outline the key steps in our analyses and how they differ from prior works.

5.1 Matching Lower Bound

In frequentist bandit analyses, it is standard to compare asymptotic lower bounds to finite-time upper bounds because finite-time logarithmic lower bounds do not exist [Lattimore and Szepesvari, 2019]. We follow the same approach when arguing that our finite-time upper bounds are order optimal.

The results in Lai [1987] are for single-parameter exponential-family reward distributions, which excludes Gaussian rewards. Therefore, we argue about the tightness of Theorem 5 only. Specifically, we take the second bound in Theorem 5, set δ = 1/n, and let n . In this case, λ 2 log n 0 and ξa( ) 0, and the bound matches up to constant factors the lower bound in Lai [1987] (claim (ii) in their Theorem 3), which is

θ a ha(θ a) h a(θ a) dθ a

Lai [1987] also matched this lower bound with an asymptotic upper bound for a frequentist policy.

Our finite-time upper bounds also reveal an interesting difference from the asymptotic lower bound in (2), which may deserve more future attention. More specifically, the regret bound of Bayes UCB (Corollary 2) improves with prior information while that of UCB1 (Theorem 4) does not. We observe these improvements empirically as well (Section 6 and Appendix D). However, both bounds are the same asymptotically. This is because the benefit of knowing the prior vanishes in asymptotic analyses, since σ2

2σ2 0 log n 0 in Corollary 2 as n . This motivates the need for finite-time logarithmic Bayes regret lower bounds, which do not exist.

5.2 Prior Bayes Regret Upper Bounds

Theorem 1 and Corollary 2 are major improvements upon existing O( n) bounds. For instance, take a prior-dependent bound in Lemma 4 of Kveton et al. [2021], which holds for both Bayes UCB and Thompson sampling due to a well-known equivalence of their analyses [Russo and Van Roy, 2014, Hong et al., 2020]. For δ = 1/n, their leading term becomes

2σ2K log n q

n + σ2σ 2 0 K q

σ2σ 2 0 K . (3)

Similarly to Theorem 1 and Corollary 2, (3) decreases as the prior concentrates and becomes more informative, σ0 0. However, the bound is O( n). Moreover, it does not depend on prior means µ0 or the gaps of random bandit instances. Therefore, it cannot capture low-regret regimes due to large random gaps ε a in Theorem 1 or a small complexity term in Corollary 2. We demonstrate it empirically in Section 6.

When the random gaps ε a in Theorem 1 are small or the complexity term in Corollary 2 is large, our bounds can be worse than O( n) bounds. This is analogous to the relation of the gap-dependent and gap-free frequentist bounds [Lattimore and Szepesvari, 2019]. Specifically, a gap-dependent bound of UCB1 in a K-armed bandit with 1-sub-Gaussian rewards (Theorem 7.1) is O(K 1 min log n), where min is the minimum gap. A corresponding gap-free bound (Theorem 7.2) is O( Kn log n). The latter is smaller when the gap is small, min = o( p

(K log n)/n). To get the best bound, the minimum of the two should be taken, and the same is true in our Bayesian setting.

No prior-dependent Bayes regret lower bound exists in linear bandits. Thus we treat E [1/ ε min] in Theorem 6 as the complexity term and do not further bound it as in Corollary 2. To compare our bound fairly to existing O( n) bounds, we derive an O( n) bound in Appendix C, by a relatively minor change in the proof of Theorem 6. A similar bound can be obtained by adapting the proofs of Lu and Van Roy [2019] and Hong et al. [2022] to a linear bandit with a finite number of actions. The leading term of the bound is

2σ2 0,maxdn

log 1 + σ2 0,max

1 + σ2 0,maxn

log(1/δ) . (4)

Similarly to Theorem 6, (4) decreases as the prior becomes more informative, σ0,max 0. However, the bound is O( n) and does not depend on the gaps of random bandit instances. Hence it cannot capture low-regret regimes due to a large random minimum gap min in Theorem 6. We validate it empirically in Section 6.

5.3 Technical Novelty

All modern Bayesian analyses follow Russo and Van Roy [2014], who derived the first finite-time O( n) Bayes regret bounds for Bayes UCB and Thompson sampling. The key idea in their analyses is that conditioned on history, the optimal and taken actions are identically distributed, and that the upper confidence bounds are deterministic functions of the history. This is where the randomness of instances in Bayesian bandits is used. Using this, the regret at round t is bounded by the confidence interval width of the taken action, and the usual O( n) bounds can be obtained by summing up the confidence interval widths over n rounds.

The main difference in our work is that we first bound the regret in a fixed bandit instance, similarly to frequentist analyses. The bound involves 1 a and is derived using biased Bayesian confidence intervals. The rest of our analysis is Bayesian in two parts: we prove that the confidence intervals fail with a low probability and bound random 1 a , following a similar technique to Lai [1987]. The resulting logarithmic Bayes regret bounds cannot be derived using the techniques of Russo and Van Roy [2014], as these become loose when the confidence interval widths are introduced.

Asymptotic logarithmic Bayes regret bounds were derived in Lai [1987]. From this analysis, we use only the technique for bounding 1 a when proving Corollary 2 and Theorem 5. The central part of our proof is a finite-time per-instance bound on the number of times that a suboptimal action is taken. This quantity is bounded based on the assumption that the action with the highest UCB is taken. A

0.0 0.2 0.4 0.6 0.8 1.0

Prior width ¾0

(a) Decreasing prior width

1.0 1.5 2.0 2.5 3.0 3.5 4.0 4.5 5.0

Prior gap 0

(b) Increasing prior gap

Bayes UCB UCB1 Bayes UCB log bound UCB1 log bound

O( pn) bound

Figure 1: Gaussian bandit as (a) the prior width σ0 and (b) the prior gap 0 change.

comparable argument in Theorem 2 of Lai [1987] is asymptotic and on average over random bandit instances.

6 Experiments

We experiment with UCB algorithms in two environments: Gaussian bandits (Section 3.1) and linear bandits with Gaussian rewards (Section 3.3). In both experiments, the horizon is n = 1 000 rounds. All results are averaged over 10 000 random runs. Shaded regions in the plots are standard errors of the estimates. They are generally small because the number of runs is high.

6.1 Gaussian Bandit

The first problem is a K-armed bandit with K = 10 actions (Section 3.1). The prior width is σ0 = 1. The prior mean is µ0, where µ0,1 = 0 and µ0,a = 0 for a > 1. We set 0 = 1 and call it the prior gap. We vary σ0 and 0 in our experiments, and observe how the regret and its upper bounds change as the problem hardness varies (Sections 4.1 and 5.2).

We plot five trends: (a) Bayes regret of Bayes UCB. (b) Bayes regret of UCB1 (Section 4.2), which is a comparable frequentist algorithm to Bayes UCB. (c) A leading term of the Bayes UCB regret bound in Theorem 1, where ε = 1/n and δ = 1/n. (d) A leading term of the UCB1 regret bound: This is the same as (c) with σ0 = . (e) An existing O( n) regret bound in (3).

Our results are reported in Figure 1. We observe three major trends. First, the regret of Bayes UCB decreases as the problem becomes easier, either σ0 0 or 0 . It is also lower than that of UCB1, which does not leverage the prior. Second, the regret bound of Bayes UCB is tighter than that of UCB1, due to capturing the benefit of the prior. Finally, the logarithmic regret bounds are tighter than the O( n) bound. In addition, the O( n) bound depends on the prior only through σ0 and thus remains constant as the prior gap 0 changes.

In Appendix D, we compare Bayes UCB to UCB1 more comprehensively for various K, σ, 0, and σ0. In all experiments, Bayes UCB has a lower regret than UCB1. This also happens when the noise is not Gaussian, which a testament to the robustness of Bayesian methods to model misspecification.

6.2 Linear Bandit with Gaussian Rewards

The second problem is a linear bandit with K = 30 actions in d = 10 dimensions (Section 3.3). The prior covariance is Σ0 = σ2 0Id. The prior mean is θ0, where θ0,1 = 0 and θ0,i = 1 for i > 1. As in Section 6.1, we set 0 = 1 and call it the prior gap. The action set A is generated as follows. The first d actions are the canonical basis in Rd. The remaining K d actions are sampled uniformly at random from the positive orthant and scaled to unit length. This ensures that the first action has the highest mean reward, of 0, under the prior mean θ0. We vary σ0 and 0, and observe how the regret and its upper bounds change as the problem hardness varies (Sections 4.4 and 5.2). We plot three trends: (a) Bayes regret of Bayes UCB. (b) A leading term of the Bayes UCB regret bound in Theorem 6, where σ0,max = σ0, ε = 1/n, and δ = 1/n. (c) An existing O( n) regret bound in (4).

0.0 0.2 0.4 0.6 0.8 1.0

Prior width ¾0

(a) Decreasing prior width

1 2 3 4 5 6 7 8 9 10

Prior gap 0

(b) Increasing prior gap

Bayes UCB Bayes UCB log bound

O( pn) bound

Figure 2: Linear bandit as (a) the prior width σ0 and (b) the prior gap 0 change.

Our results are reported in Figure 2 and we observe three major trends. First, the regret of Bayes UCB decreases as the problem becomes easier, either σ0 0 or 0 . Second, the regret bound of Bayes UCB decreases as the problem becomes easier. Finally, our logarithmic regret bound can also be tighter than the O( n) bound. In particular, the O( n) bound depends on the prior only through σ0, and thus remains constant as the prior gap 0 changes. We discuss when our bounds could be looser than O( n) bounds in Section 5.2.

7 Conclusions

Finite-time logarithmic frequentist regret bounds are the standard way of analyzing K-armed bandits [Auer et al., 2002, Garivier and Cappe, 2011, Agrawal and Goyal, 2012]. In our work, we prove the first comparable finite-time bounds, logarithmic in n, in the Bayesian setting. This is a major step in theory and a significant improvement upon prior O( n) Bayes regret bounds that have become standard in the literature. Comparing to frequentist regret bounds, our bounds capture the value of prior information given to the learner. Our proof technique is general and we also apply it to linear bandits.

This work can be extended in many directions. First, our analyses only need closed-form posteriors, which are available for other reward distributions than Gaussian and Bernoulli. Second, our linear bandit analysis (Section 4.4) seems preliminary when compared to our multi-armed bandit analyses. As an example, the complexity term E [1/ ε min] in Theorem 6 could be bounded as in Corollary 2 and Theorem 5. We do not do this because the main reason for deriving the O(ch log2 n) bound in Theorem 5, an upper bound on the corresponding O(c log n) bound, is that it matches the lower bound in (2). No such instance-dependent lower bound exists in Bayesian linear bandits. Third, we believe that our approach can be extended to information-theory bounds [Russo and Van Roy, 2016] and discuss it in Appendix E. Fourth, although we only analyze Bayes UCB, we believe that similar guarantees can be obtained for Thompson sampling. Finally, we would like to extend our results to reinforcement learning, for instance by building on the work of Lu and Van Roy [2019].

There have been recent attempts in theory [Wagenmaker and Foster, 2023] to design general adaptive algorithms with finite-time instance-dependent bounds based on optimal allocations. The promise of these methods is a higher statistical efficiency than exploring by optimism, which we adopt in this work. One of their shortcomings is that they are not guaranteed to be computationally efficient, as discussed in Section 2.2 of Wagenmaker and Foster [2023]. This work is also frequentist.

Yasin Abbasi-Yadkori, David Pal, and Csaba Szepesvari. Improved algorithms for linear stochastic bandits. In Advances in Neural Information Processing Systems 24, pages 2312 2320, 2011.

Marc Abeille and Alessandro Lazaric. Linear Thompson sampling revisited. In Proceedings of the 20th International Conference on Artificial Intelligence and Statistics, 2017.

Shipra Agrawal and Navin Goyal. Analysis of Thompson sampling for the multi-armed bandit problem. In Proceedings of the 25th Annual Conference on Learning Theory, pages 39.1 39.26, 2012.

Shipra Agrawal and Navin Goyal. Thompson sampling for contextual bandits with linear payoffs. In Proceedings of the 30th International Conference on Machine Learning, pages 127 135, 2013.

Imad Aouali, Branislav Kveton, and Sumeet Katariya. Mixed-effect Thompson sampling. In Proceedings of the 26th International Conference on Artificial Intelligence and Statistics, 2023.

Jean-Yves Audibert and Sebastien Bubeck. Minimax policies for adversarial and stochastic bandits. In Proceedings of the 22nd Annual Conference on Learning Theory, 2009.

Peter Auer, Nicolo Cesa-Bianchi, Yoav Freund, and Robert Schapire. Gambling in a rigged casino: The adversarial multi-armed bandit problem. In Proceedings of the 36th Annual Symposium on Foundations of Computer Science, pages 322 331, 1995.

Peter Auer, Nicolo Cesa-Bianchi, and Paul Fischer. Finite-time analysis of the multiarmed bandit problem. Machine Learning, 47:235 256, 2002.

Hamsa Bastani, David Simchi-Levi, and Ruihao Zhu. Meta dynamic pricing: Transfer learning across experiments. Co RR, abs/1902.10918, 2019. URL https://arxiv.org/abs/1902.10918.

Soumya Basu, Branislav Kveton, Manzil Zaheer, and Csaba Szepesvari. No regrets for learning the prior in bandits. In Advances in Neural Information Processing Systems 34, 2021.

Christopher Bishop. Pattern Recognition and Machine Learning. Springer, New York, NY, 2006.

Olivier Chapelle and Lihong Li. An empirical evaluation of Thompson sampling. In Advances in Neural Information Processing Systems 24, pages 2249 2257, 2012.

Varsha Dani, Thomas Hayes, and Sham Kakade. Stochastic linear optimization under bandit feedback. In Proceedings of the 21st Annual Conference on Learning Theory, pages 355 366, 2008.

Aurelien Garivier and Olivier Cappe. The KL-UCB algorithm for bounded stochastic bandits and beyond. In Proceedings of the 24th Annual Conference on Learning Theory, pages 359 376, 2011.

John Gittins. Bandit processes and dynamic allocation indices. Journal of the Royal Statistical Society. Series B (Methodological), 41:148 177, 1979.

Joey Hong, Branislav Kveton, Manzil Zaheer, Yinlam Chow, Amr Ahmed, and Craig Boutilier. Latent bandits revisited. In Advances in Neural Information Processing Systems 33, 2020.

Joey Hong, Branislav Kveton, Manzil Zaheer, and Mohammad Ghavamzadeh. Hierarchical Bayesian bandits. In Proceedings of the 25th International Conference on Artificial Intelligence and Statistics, 2022.

Emilie Kaufmann, Olivier Cappe, and Aurelien Garivier. On Bayesian upper confidence bounds for bandit problems. In Proceedings of the 15th International Conference on Artificial Intelligence and Statistics, pages 592 600, 2012.

Jaya Kawale, Hung Bui, Branislav Kveton, Long Tran-Thanh, and Sanjay Chawla. Efficient Thompson sampling for online matrix-factorization recommendation. In Advances in Neural Information Processing Systems 28, pages 1297 1305, 2015.

Branislav Kveton, Mikhail Konobeev, Manzil Zaheer, Chih-Wei Hsu, Martin Mladenov, Craig Boutilier, and Csaba Szepesvari. Meta-Thompson sampling. In Proceedings of the 38th International Conference on Machine Learning, 2021.

Tze Leung Lai. Adaptive treatment allocation and the multi-armed bandit problem. The Annals of Statistics, 15(3):1091 1114, 1987.

Tze Leung Lai and Herbert Robbins. Asymptotically efficient adaptive allocation rules. Advances in Applied Mathematics, 6(1):4 22, 1985.

Tor Lattimore and Csaba Szepesvari. Bandit Algorithms. Cambridge University Press, 2019.

Lihong Li, Wei Chu, John Langford, and Robert Schapire. A contextual-bandit approach to personalized news article recommendation. In Proceedings of the 19th International Conference on World Wide Web, 2010.

Lisha Li, Kevin Jamieson, Giulia De Salvo, Afshin Rostamizadeh, and Ameet Talwalkar. Hyperband: A novel bandit-based approach to hyperparameter optimization. Journal of Machine Learning Research, 18(185):1 52, 2018.

Shuai Li, Alexandros Karatzoglou, and Claudio Gentile. Collaborative filtering bandits. In Proceedings of the 39th Annual International ACM SIGIR Conference, 2016.

Xiuyuan Lu and Benjamin Van Roy. Information-theoretic confidence bounds for reinforcement learning. In Advances in Neural Information Processing Systems 32, 2019.

Olivier Marchal and Julyan Arbel. On the sub-Gaussianity of the beta and Dirichlet distributions. Electronic Communications in Probability, 22:1 14, 2017.

Daniel Russo and Benjamin Van Roy. Learning to optimize via posterior sampling. Mathematics of Operations Research, 39(4):1221 1243, 2014.

Daniel Russo and Benjamin Van Roy. An information-theoretic analysis of Thompson sampling. Journal of Machine Learning Research, 17(68):1 30, 2016.

Daniel Russo, Benjamin Van Roy, Abbas Kazerouni, Ian Osband, and Zheng Wen. A tutorial on Thompson sampling. Foundations and Trends in Machine Learning, 11(1):1 96, 2018.

Max Simchowitz, Christopher Tosh, Akshay Krishnamurthy, Daniel Hsu, Thodoris Lykouris, Miro Dudik, and Robert Schapire. Bayesian decision-making under misspecified priors with applications to meta-learning. In Advances in Neural Information Processing Systems 34, 2021.

William R. Thompson. On the likelihood that one unknown probability exceeds another in view of the evidence of two samples. Biometrika, 25(3-4):285 294, 1933.

John Tsitsiklis. A short proof of the gittins index theorem. Neural Computation, 4(1):194 199, 1994.

Andrew Wagenmaker and Dylan Foster. Instance-optimality in interactive decision making: Toward a non-asymptotic theory. In Proceedings of the 36th Annual Conference on Learning Theory, 2023.

Zhi Wang, Chicheng Zhang, Manish Kumar Singh, Laurel Riek, and Kamalika Chaudhuri. Multitask bandit learning through heterogeneous feedback aggregation. In Proceedings of the 24th International Conference on Artificial Intelligence and Statistics, 2021.

Xiaoxue Zhao, Weinan Zhang, and Jun Wang. Interactive collaborative filtering. In Proceedings of the 22nd ACM International Conference on Information and Knowledge Management, pages 1411 1420, 2013.

We present a general approach for deriving finite-time logarithmic Bayes regret bounds. We start with K-armed bandits and then extend it to linear bandits.

A.1 Proof of Theorem 1

Let Et = n a A : |θa ˆθt,a| Ct,a o be the event that all confidence intervals at round t hold. Fix ε > 0. We start with decomposing the n-round regret as

t=1 E [ At]

t=1 E [ At1{ At ε, Et}] +

t=1 E [ At1{ At < ε}] + (5)

t=1 E At1 Et .

We bound the first term using the design of Bayes UCB and its closed-form posteriors.

Case 1: Event Et occurs and the gap is large, At ε. Since Et occurs and the action with the highest UCB is taken, the regret at round t can be bounded as

At = θA θAt θA Ut,A + Ut,At θAt Ut,At θAt 2Ct,At .

In the second inequality, we use that θA Ut,A on event Et. This implies that on event Et, action a can be taken only if

a 2Ct,a = 2 q

2ˆσ2 t,a log(1/δ) = 2

2 log(1/δ) σ 2 0 + σ 2Nt,a ,

which can be rearranged as

Nt,a 8σ2 log(1/δ)

2a σ2σ 2 0 .

Therefore, the number of times that action a is taken in n rounds while all confidence intervals hold, Na = Pn t=1 1{At = a, Et}, is bounded as

Na 8σ2 log(1/δ)

2a σ2σ 2 0 . (6)

Now we apply this inequality to bound the first term in (5) as

t=1 E [ At1{ At ε, Et}] E

8σ2 log(1/δ)

a σ2σ 2 0 a

Case 2: The gap is small, At < ε. Then naively Pn t=1 E [ At1{ At < ε}] < εn.

Case 3: Event Et does not occur. The last term in (5) can be bounded as

E At1 Et (8)

= E E (θA θAt)1 Et Ht

E E (θA Ut,A )1 Et Ht + E (Ut,At θAt)1 Et Ht

E h E h (θA ˆθt,A )1 Et Ht i + E h (ˆθt,At θAt)1 Et Ht i + E Ct,At1 Et Ht i .

To bound the resulting terms, we use that θa ˆθt,a | Ht N(0, ˆσ2 t,a).

Lemma 7. For any action a A, round t [n], and history Ht,

E h (θa ˆθt,a)1 Et Ht i 2σ0Kδ .

Proof. Let Et,a = n |θa ˆθt,a| Ct,a o . We start with decomposing Et into individual Et,a as

E h (θa ˆθt,a)1 Et Ht i E h |θa ˆθt,a|1 Et,a Ht i + X

a =a E h |θa ˆθt,a|1 Et,a Ht i .

To bound the first term, we use that θa ˆθt,a | Ht N(0, ˆσ2 t,a). Thus

E h |θa ˆθt,a|1 Et,a Ht i 2 q

x=Ct,a x exp x2

2ˆσ2 t,a π δ σ0δ .

To bound the second term, we use the independence of the distributions for a and a ,

E h |θa ˆθt,a|1 Et,a Ht i = E h |θa ˆθt,a| Ht i P Et,a Ht .

The probability is at most 2δ and the expectation can be bounded as

E h |θa ˆθt,a| Ht i = E q

(θa ˆθt,a)2 Ht

E h (θa ˆθt,a)2 Ht i = ˆσt,a σ0 .

This completes the proof.

The first two terms in (8) can be bounded using a union bound over a A and Lemma 7. For the last term, we use that Ct,a p

2σ2 0 log(1/δ) and a union bound in P Et Ht to get

E Ct,At1 Et Ht q

2σ2 0 log(1/δ)P Et Ht 2 p

2 log(1/δ)σ0Kδ .

Finally, we sum up the upper bounds on (8) over all rounds t [n].

A.2 Proof of Corollary 2

The key idea in the proof is to integrate out the random gap in (7). Fix action a A and thresholds ε2 > ε > 0. We consider two cases.

Case 1: Diminishing gaps ε < a ε2. Let

ξa(θ a) = arg max x [ε,ε2] ha(θ a x)

and Na be defined as in Appendix A.1. Then

E [ a Na1{ε < a < ε2}] = Z

θa=θ a ε2 a Naha(θa) dθa h a(θ a) dθ a

θa=θ a ε2 a Na dθa

ha(θ a ξa(θ a)) h a(θ a) dθ a ,

where the inequality is by the definition of ξa. Now the inner integral is independent of ha and thus can be easily bounded. Specifically, the upper bound in (6) and simple integration yield Z θ a ε

θa=θ a ε2 a Na dθa Z θ a ε

8σ2 log(1/δ)

θ a θa σ2σ 2 0 (θ a θa) dθa

= 8σ2 log(1/δ)(log ε2 log ε) σ2(ε2 2 ε2) 2σ2 0 .

For ε = 1/n and ε2 = 1/ log n, we get Z θ a ε

θa=θ a ε2 a Na dθa 8σ2 log(1/δ) log n σ2

2σ2 0 log n + σ2

2σ2 0n2 4σ2 log(1/δ) log log n (9)

8σ2 log(1/δ) log n σ2

2σ2 0 log n .

The last inequality holds for σ2 0 1 8 log(1/δ) n2 log log n.

Case 2: Large gaps a > ε2. Here we use (6) together with ε2 = 1/ log n to get

E [ a Na1{ a > ε2}] E 8σ2 log(1/δ)

a 1{ a > ε2} < 8σ2 log(1/δ) p

log n . (10)

Finally, we chain all inequalities.

A.3 Proof of Lemma 3

We have that X

θ a ha(θ a) h a(θ a) dθ a

a =a ha(θa )

a =a ha (θa )

θa ha(θa ) ha (θa ) dθa

θa exp (θa µ0,a)2

2σ2 0 (θa µ0,a )2

a =a exp (µ0,a µ0,a )2

where the last step is by completing the square and integrating out θa .

A.4 Proof of Theorem 4

The regret bound of UCB1 is proved similarly to Theorem 1 and Corollary 2. This is because UCB1 can be viewed as Bayes UCB where σ0 = and each action a A is initially taken once at round t = a. Since σ0 = , the confidence interval becomes

2σ2 log(1/δ)

The proof differs in two steps. First, the regret in the first K rounds is bounded by P

a A E [ a]. Second, the concentration argument (Case 3 in Appendix A.1) changes because the bandit instance θ is fixed and the estimated model parameter ˆθt is random. We detail it below.

Case 3: Event Et does not occur. The last term in (5) can be bounded as

E At1 Et (11)

= E E (θA θAt)1 Et θ

E E (θA Ut,A )1 Et θ + E (Ut,At θAt)1 Et θ

E h E h (θA ˆθt,A )1 Et θ i + E h (ˆθt,At θAt)1 Et θ i + E Ct,At1 Et θ i .

To bound the resulting terms, we use that θa ˆθt,a | θ N(0, σ2/Nt,a).

Lemma 8. For any action a A, round t > K, and Nt,a 1,

E h (θa ˆθt,a)1 Et θ i 2σKδ .

Proof. Let Et,a = n |θa ˆθt,a| Ct,a o . We start with decomposing Et into individual Et,a as

E h (θa ˆθt,a)1 Et θ i E h |θa ˆθt,a|1 Et,a θ i + X

a =a E h |θa ˆθt,a|1 Et,a θ i .

To bound the first term, we use that θa ˆθt,a | θ N(0, σ2/Nt,a). Thus

E h |θa ˆθt,a|1 Et θ i 2 p

x=Ct,a x exp x2

πNt,a δ σδ .

To bound the second term, we use the independence of the distributions for a and a ,

E h |θa ˆθt,a|1 Et,a θ i = E h |θa ˆθt,a| θ i P Et,a θ .

The probability is at most 2δ and the expectation can be bounded as

E h |θa ˆθt,a| θ i = E q

(θa ˆθt,a)2 θ

E h (θa ˆθt,a)2 θ i =

This completes the proof.

The first two terms in (11) can be bounded using a union bound over a A and Lemma 8. For the last term, we use that Ct,a p

2σ2 log(1/δ) and a union bound in P Et θ to get

E Ct,At1 Et θ p

2σ2 log(1/δ)P Et θ 2 p

2 log(1/δ)σKδ .

Finally, we sum up the upper bounds on (11) over all rounds t [n].

A.5 Proof of Theorem 5

Let Et = n a A : |θa ˆθt,a| Ct,a o be the event that all confidence intervals at round t hold. Fix ε > 0. We decompose the n-round regret as in (5) and then bound each resulting term next.

Case 1: Event Et occurs and the gap is large, At ε. As in Appendix A.1,

At = θA θAt θA Ut,A + Ut,At θAt Ut,At θAt 2Ct,At .

In the second inequality, we use that θA Ut,A on event Et. This implies that on event Et, action a can be taken only if

Nt,a 2 log(1/δ)

2a (αa + βa + 1) .

Now we apply this inequality to bound the first term in (5) as

t=1 E [ At1{ At ε, Et}] E

a (αa + βa + 1) a

Case 2: The gap is small, At < ε. Then naively Pn t=1 E [ At1{ At < ε}] < εn.

Case 3: Event Et does not occur. Since θa [0, 1], the last term in (5) can be bounded as

E At1 Et E P Et Ht 2Kδ .

This completes the first part of the proof.

The second claim is proved as in Appendix A.2 and we only comment on what differs. For ε = 1/n and ε2 = 1/ log n, (9) becomes Z θ a ε

θa=θ a ε2 a Na dθa Z θ a ε

θ a θa λ(θ a θa) dθa

= 2 log(1/δ)(log ε2 log ε) λ(ε2 2 ε2)

= 2 log(1/δ) log n λ 2 log n + λ 2n2 log(1/δ) log log n

2 log(1/δ) log n λ 2 log n .

The last inequality holds for λ 2 log(1/δ) n2 log log n. Moreover, (10) becomes

E [ a Na1{ a > ε2}] E 2 log(1/δ)

a 1{ a > ε2} < 2 log(1/δ) p

This completes the second part of the proof.

A.6 Proof of Theorem 6

Et = n a A : |a (θ ˆθt)| p

2 log(1/δ) a ˆΣt

be an event that high-probability confidence intervals for mean rewards at round t hold. Our proof has three parts.

Case 1: Event Et occurs and the gap is large, At ε. Then

At = 1 At 2 At 1 ε min (A θ A t θ)2 1 ε min (A θ Ut,A + Ut,At A t θ)2

1 ε min (Ut,At A t θ)2 4 ε min C2 t,At = 8 log(1/δ)

ε min At 2 ˆΣt .

The first inequality follows from definitions of At and ε min; and that the gap is large, At ε. The second inequality holds because A θ A t θ 0 by definition and Ut,At Ut,A 0 by the design of Bayes UCB. The third inequality holds because A θ Ut,A 0 on event Et. Specifically, for any action a A on event Et,

a θ Ut,a = a (θ ˆθt) Ct,a Ct,a Ct,a = 0 .

The last inequality follows from the definition of event Et. Specifically, for any action a A on event Et,

Ut,a a θ = a (ˆθt θ) + Ct,a Ct,a + Ct,a = 2Ct,a .

Case 2: The gap is small, At ε. Then naively At ε.

Case 3: Event Et does not occur. Then At1 Et 2 At 2 θ 21 Et 2LL 1 Et , where 2LL is a trivial upper bound on At. We bound the event in expectation as follows.

Lemma 9. For any round t [n] and history Ht, we have that P Et Ht 2Kδ.

Proof. First, note that for any history Ht,

a A P |a (θ ˆθt)| p

2 log(1/δ) a ˆΣt

By definition, θ ˆθt | Ht N(0d, ˆΣt), and therefore a (θ ˆθt)/ a ˆΣt | Ht N(0, 1) for any action a A. It immediately follows that

P |a (θ ˆθt)| p

2 log(1/δ) a ˆΣt

This completes the proof.

Finally, we chain all inequalities, add them over all rounds, and get

t=1 At 2 ˆΣt

log(1/δ) + εn + 4LL Knδ .

The sum can bounded using a worst-case argument below, which yields our claim. Lemma 10. The sum of posterior variances is bounded as

t=1 At 2 ˆΣt σ2 0,maxd

log 1 + σ2 0,max

1 + σ2 0,maxn

Proof. We start with an upper bound on the posterior variance of the mean reward estimate of any action. In any round t [n], by Weyl s inequalities, we have

λ1(ˆΣt) = λ1((Σ 1 0 + Gt) 1) = λ 1 d (Σ 1 0 + Gt) λ 1 d (Σ 1 0 ) = λ1(Σ0) .

Thus, when a 2 L for any action a A, we have maxa A a ˆΣt p

λ1(Σ0)L = σ0,max.

Now we bound the sum of posterior variances Pn t=1 At 2 ˆΣt. Fix round t and note that

At 2 ˆΣt = σ2 A t ˆΣt At

σ2 c1 log(1 + σ 2A t ˆΣt At) = c1 log det(Id + σ 2 ˆΣ

1 2 t At A t ˆΣ

1 2 t ) (13)

c1 = σ2 0,max log(1 + σ 2σ2 0,max) .

This upper bound is derived as follows. For any x [0, u],

x = x log(1 + x) log(1 + x) max x [0,u] x log(1 + x)

log(1 + x) = u log(1 + u) log(1 + x) .

Then we set x = σ 2A t ˆΣt At and use the definition of σ0,max.

The next step is bounding the logarithmic term in (13), which can be rewritten as

log det(Id + σ 2 ˆΣ

1 2 t At A t ˆΣ

1 2 t ) = log det(ˆΣ 1 t + σ 2At A t ) log det(ˆΣ 1 t ) . Because of that, when we sum over all rounds, we get telescoping and the total contribution of all terms is at most n X

t=1 log det(Id + σ 2 ˆΣ

1 2 t At A t ˆΣ

1 2 t ) = log det(ˆΣ 1 n+1) log det(ˆΣ 1 1 )

= log det(Σ

1 2 0 ˆΣ 1 n+1Σ

1 2 0 ˆΣ 1 n+1Σ

1 2 0 At A t Σ

t=1 A t Σ0At

1 + σ2 0,maxn

This completes the proof.

B Complete Statement of Corollary 2

Theorem 11. Let σ2 0 1 8 log(1/δ) n2 log log n. Then there exist functions ξa : R h 1 n, 1 log n

i such that the n-round Bayes regret of Bayes UCB in a K-armed Gaussian bandit is bounded as

R(n) 8σ2 log(1/δ) log n σ2

2σ2 0 log n

θ a ha(θ a ξa(θ a)) h a(θ a) dθ a + C ,

where C = 8σ2K log(1/δ) log n + (2 p

2 log(1/δ) + 1)σ0Knδ + 1 is a low-order term.

Moreover, when σ2 0 < 1 8 log(1/δ) n2 log log n, the regret is bounded as

2 log(1/δ) + 1 p

8 log(1/δ) log log n Kδ + 1 .

Proof. The first claim is proved in Appendix A.2. The second claim can be proved as follows. Take Theorem 1, set ε = 0, and consider the three cases in Appendix A.1.

Case 1: Event Et occurs and the gap is large, At ε. On event Et, action a can be taken only if

2 log(1/δ) σ 2 0 + σ 2Nt,a 2 q

2σ2 0 log(1/δ) 2 r 1 4n2 log log n < 1

Therefore, the corresponding n-round regret is bounded by 1.

Case 2: The gap is small, At < ε. This case cannot happen because ε = 0.

Case 3: Event Et does not occur. The n-round regret is bounded by

2 log(1/δ) + 1)σ0Knδ 2 p

2 log(1/δ) + 1 p

8 log(1/δ) log log n Kδ .

This completes the proof.

C Gap-Free Regret Bound of Bayes UCB in Linear Bandit

Let Et be the event in (12). Our proof has three parts.

Case 1: Event Et occurs and the gap is large, At ε. Then

At = A θ A t θ A θ Ut,A + Ut,At A t θ Ut,At A t θ 2Ct,At .

The first inequality holds because Ut,At Ut,A 0 by the design of Bayes UCB. The second one uses that A θ Ut,A 0. Specifically, for any action a A on event Et,

a θ Ut,a = a (θ ˆθt) Ct,a Ct,a Ct,a = 0 .

The last inequality follows from the definition of event Et. Specifically, for any action a A on event Et,

Ut,a a θ = a (ˆθt θ) + Ct,a Ct,a + Ct,a = 2Ct,a .

Cases 2 and 3 are bounded as in Appendix A.6. Now we chain all inequalities, add them over all rounds, and get

2 log(1/δ) + εn + 4LL Knδ

t=1 At 2 ˆΣt

2n log(1/δ) + εn + 4LL Knδ ,

where the last inequality uses the Cauchy-Schwarz inequality and the concavity of the square root. Finally, the sum Pn t=1 At 2 ˆΣt is bounded using Lemma 10. This completes the proof.

0 10 20 30 40 50 60 70 80

Regret difference

(a) Gaussian noise

0 10 20 30 40 50 60 70 80

Regret difference

(b) Box noise

Figure 3: The difference in regret of UCB1 and Bayes UCB on 81 Bayesian bandit instances, sorted by the difference. In plot (a), the noise is Gaussian N(0, σ2). In plot (b), the noise is σ with probability 0.5 and σ otherwise.

D Comparison of Bayes UCB and UCB1

We report the difference in regret of UCB1 and Bayes UCB on 81 Bayesian bandit instances. These instances are obtained by all combinations of K {5, 10, 20} actions, reward noise σ {0.5, 1, 2}, prior gap 0 {0.5, 1, 2}, and prior width σ0 {0.5, 1, 2}. The horizon is n = 1 000 rounds and all results are averaged over 1 000 random runs.

Our results are reported in Figure 3. In Figure 3a, the noise is Gaussian N(0, σ2). In Figure 3b, the noise is σ with probability 0.5 and σ otherwise. Therefore, this noise is σ2-sub-Gaussian, of the same magnitude as N(0, σ2) but far from it in terms of the distribution. This tests the robustness of Bayes UCB to Gaussian posterior updates. UCB1 only needs σ2-sub-Gaussian noise. In both plots, and in all 81 Bayesian bandit instances, Bayes UCB has a lower regret than UCB1. It is also remarkably robust to noise misspecification, although we cannot prove it.

E Note on Information-Theory Bounds

Our approach could be used to derive Bayesian information-theory bounds [Russo and Van Roy, 2016]. The key step in these bounds, where the information-theory term It,a for action a at round t arises, is At Γ p

It,At, where Γ is the highest possible ratio of regret to information gain. As in Case 1 in Appendix A.6, the n-round regret can be bounded as

1 At 2 At 1 ε min

t=1 2 At Γ2

t=1 It,At .

The term Pn t=1 It,At can be bounded using a worst-case argument by a O(log n) bound.