# mots_minimax_optimal_thompson_sampling__358c02b3.pdf

MOTS: Minimax Optimal Thompson Sampling

Tianyuan Jin 1 Pan Xu 2 Jieming Shi 3 Xiaokui Xiao 1 Quanquan Gu 2

Thompson sampling is one of the most widely used algorithms for many online decision problems, due to its simplicity in implementation and superior empirical performance over other stateof-the-art methods. Despite its popularity and empirical success, it has remained an open problem whether Thompson sampling can match the minimax lower bound Ω(

KT) for K-armed bandit problems, where T is the total time horizon. In this paper, we solve this long open problem by proposing a variant of Thompson sampling called MOTS that adaptively clips the sampling instance of the chosen arm at each time step. We prove that this simple variant of Thompson sampling achieves the minimax optimal regret bound O(

KT) for ﬁnite time horizon T, as well as the asymptotic optimal regret bound for Gaussian rewards when T approaches inﬁnity. To our knowledge, MOTS is the ﬁrst Thompson sampling type algorithm that achieves the minimax optimality for multi-armed bandit problems.

1. Introduction

The Multi-Armed Bandit (MAB) problem is a sequential decision process which is typically described as a game between the agent and the environment with K arms. The game proceeds in T time steps. In each time step t = 1, . . . , T, the agent plays an arm At {1, 2, , K} based on the observation of the previous t 1 time steps, and then observes a reward rt that is independently generated from a 1-sub Gaussian distribution with mean value µAt, where µ1, µ2, , µK R are unknown. The goal of the agent is to maximize the cumulative reward over T time steps. The performance of a strategy for MAB is measured by the expected cumulative difference over T time steps between

1School of Computing, National University of Singapore, Singapore 2Department of Computer Science, University of California, Los Angeles, USA 3Department of Computing, The Hong Kong Polytechnic University, Hong Kong. Correspondence to: Xiaokui Xiao <xkxiao@nus.edu.sg>, Quanquan Gu <qgu@cs.ucla.edu>.

Proceedings of the 38 th International Conference on Machine Learning, PMLR 139, 2021. Copyright 2021 by the author(s).

playing the best arm and playing the arm according to the strategy, which is also called the regret of a bandit strategy. Formally, the regret Rµ(T) is deﬁned as follows

Rµ(T) = T max i {1,2, ,K} µi Eµ

For a ﬁxed time horizon T, the problem-independent lower bound (Auer et al., 2002b) states that any strategy has at least a regret in the order of Ω(

KT), which is called the minimax optimal regret. On the other hand, for a ﬁxed model (i.e., µ1, . . . , µK are ﬁxed), Lai & Robbins (1985) proved that any strategy must have at least C(µ) log(T)(1 o(1)) regret when the horizon T approaches inﬁnity, where C(µ) is a constant depending on the model. Therefore, a strategy with a regret upper-bounded by C(µ) log(T)(1 o(1)) is asymptotically optimal.

This paper studies the earliest bandit strategy, Thompson sampling (TS) (Thompson, 1933). It has been observed in practice that TS often achieves a smaller regret than many upper conﬁdence bound (UCB)-based algorithms (Chapelle & Li, 2011; Wang & Chen, 2018). In addition, TS is simple and easy to implement. Despite these advantages, the theoretical analysis of TS algorithms has not been established until the past decade. In particular, in the seminal work by Agrawal & Goyal (2012), they provided the ﬁrst ﬁnitetime analysis of TS. Kaufmann et al. (2012) and Agrawal & Goyal (2013) showed that the regret bound of TS is asymptotically optimal when using Beta priors. Subsequently, Agrawal & Goyal (2017) showed that TS with Beta priors achieves an O( KT log T) problem-independent regret bound while maintaining the asymptotic optimality. In addition, they proved that TS with Gaussian priors can achieve an improved regret bound O( KT log K). Agrawal & Goyal (2017) also established the following regret lower bound for TS: the TS strategy with Gaussian priors has a worst-case regret Ω( KT log K).

Main Contributions. It remains an open problem (Li & Chapelle, 2012) whether TS type algorithms can achieve the minimax optimal regret bound O(

KT) for MAB problems. In this paper, we solve this open problem by proposing a variant of Thompson sampling, referred to as Minimax Optimal Thompson Sampling (MOTS), which clips the sampling instances for each arm based on the his-

MOTS: Minimax Optimal Thompson Sampling

Table 1. Comparisons of different TS type algorithms. The minimax ratio is the ratio (up to constant factors) between the problemindependent regret bound of the algorithm and the minimax optimal regret O(

KT). For instance, when the ration equals 1, it is minimax optimal; otherwise, it is minimax suboptimal. The results in Kaufmann et al. (2012); Agrawal & Goyal (2013; 2017) are obtained for rewards bounded in [0, 1], but the techniques in their papers also work for Gaussian rewards (See Korda et al. (2013) for details).

REWARD TYPE MINIMAX RATIO ASYM. OPTIMAL REFERENCE

TS BERNOULLI YES KAUFMANN ET AL. (2012) BERNOULLI log T YES AGRAWAL & GOYAL (2013) BERNOULLI log K * AGRAWAL & GOYAL (2017)

MOTS SUBGAUSSIAN 1 NO** THEOREMS 1, 2

SUBGAUSSIAN ilog(m 1)(T) *** YES THEOREM 3

MOTS-J GAUSSIAN 1 YES THEOREM 4

* It has been proved by Agrawal & Goyal (2017) that the log K term in the problem-independent regret is unimprovable for Thompson sampling using Gaussian priors. ** As is shown in Theorem 2, MOTS is asymptotically optimal up to a multiplicative factor 1/ρ, where ρ (1/2, 1) is a ﬁxed constant. *** ilog(r)( ) is the iterated logarithm of order r, and m 2 is an arbitrary integer independent of T.

tory of pulls. We prove that MOTS achieves O(

KT) problem-independent regret, which is minimax optimal and improves the existing best result, i.e., O( KT log K). Furthermore, we show that when the reward distributions are Gaussian, a variant of MOTS with clipped Rayleigh distributions, namely MOTS-J , can simultaneously achieve asymptotic and minimax optimal regret bounds. Our result also conveys the important message that the lower bound for TS with Gaussian priors in Agrawal & Goyal (2017) may not always hold in the more general cases when non Gaussian priors are used. Our experiments demonstrate the superiority of MOTS over the state-of-the-art bandit algorithms such as UCB (Auer et al., 2002a), MOSS (Audibert & Bubeck, 2009), and TS (Thompson, 1933) with Gaussian priors. We provide a detailed comparison on the minimax optimality and asymptotic optimality of TS type algorithms in Table 1.

Notations. A random variable X is said to follow a 1sub Gaussian distribution, if it holds that EX[exp(λX λEX[X])] exp(λ2/2) for all λ R. We denote log+(x) = max{0, log x}. We let T be the total number of time steps, K be the number of arms, and [K] = {1, 2, , K}. Without loss of generality, we assume that µ1 = maxi [K] µi throughout this paper. We use i to denote the gap between arm 1 and arm i, i.e., i := µ1 µi, i [K] \ {1}. We denote Ti(t) := Pt j=1 1{Aj = i} as the number of times that arm i has been played at time step t, and bµi(t) := Pt j=1 1 {Aj = i} rj/Ti(t) as the average reward for pulling arm i up to time t, where rj is the reward received by the algorithm at time j.

2. Related Work

Existing works on regret minimization for stochastic bandit problems mainly consider two notions of optimality: asymptotic optimality and minimax optimality. UCB (Garivier & Capp e, 2011; Maillard et al., 2011), Bayes UCB (Kaufmann, 2016), and Thompson sampling (Kaufmann et al., 2012; Agrawal & Goyal, 2017; Korda et al., 2013) are all shown to be asymptotically optimal. Meanwhile, MOSS (Audibert & Bubeck, 2009) is the ﬁrst method proved to be minimax optimal. Subsequently, two UCB-based methods, Ada UCB (Lattimore, 2018) and KL-UCB++ (M enard & Garivier, 2017), are also shown to achieve minimax optimality. In addition, Ada UCB is proved to be almost instance-dependent optimal for Gaussian multi-armed bandit problems (Lattimore, 2018).

There are many other methods on regret minimization for stochastic bandits, including explore-then-commit (Auer & Ortner, 2010; Perchet et al., 2016), ϵ-Greedy (Auer et al., 2002a), and Rand UCB (Vaswani et al., 2019). However, these methods are proved to be suboptimal (Auer et al., 2002a; Garivier et al., 2016; Vaswani et al., 2019). One exception is the recent proposed double explore-then-commit algorithm (Jin et al., 2020), which achieves asymptotic optimality. Another line of works study different variants of the problem setting, such as the batched bandit problem (Gao et al., 2019), and bandit with delayed feedback (Pike-Burke et al., 2018). We refer interested readers to Lattimore & Szepesv ari (2020) for a more comprehensive overview of bandit algorithms.

For Thompson sampling, Russo & Van Roy (2014) studied the Bayesian regret and Bubeck & Liu (2013) improved it to O(

KT) using the conﬁdence bound analysis of MOSS (Audibert & Bubeck, 2009). However, it should

MOTS: Minimax Optimal Thompson Sampling

be noted that the Bayesian regret is known to be less informative than the frequentist regret Rµ(T) studied in this paper. In fact, one can show that our minimax optimal regret Rµ(T) = O(

KT) immediately implies that the Bayesian regret is also in the order of O(

KT), but the reverse is not true (Lattimore & Szepesv ari, 2020). We refer interested readers to Russo et al. (2018) for a thorough introduction of Thompson sampling and its various applications.

3. Minimax Optimal Thompson Sampling Algorithm

3.1. General Thompson sampling strategy

We ﬁrst describe the general Thompson sampling (TS) strategy. In the ﬁrst K time steps, TS plays each arm i [K] once, and updates the average reward estimation bµi(K + 1) for each arm i. (This is a standard warm-start in the bandit literature.) Subsequently, the algorithm maintains a distribution Di(t) for each arm i [K] at time step t = K + 1, . . . , T, whose update rule will be elaborated shortly. At step t, the algorithm samples instances θi(t) independently from distribution Di(t), for all i [K]. Then, the algorithm plays the arm that maximizes θi(t): At = argmaxi [K] θi(t), and receives a reward rt. The average reward bµi(t) and the number of pulls Ti(t) for arm i [K] are then updated accordingly.

We refer to algorithms that follow the general TS strategy described above (e.g., those in Chapelle & Li (2011); Agrawal & Goyal (2017)) as TS type algorithms. Following the above deﬁnition, our MOTS method is a TS type algorithm, but it differs from other algorithms of this type in the choice of distribution Di(t): existing algorithms (e.g., Agrawal & Goyal (2017)) typically use Gaussian or Beta distributions as the posterior distribution, whereas MOTS uses a clipped Gaussian distribution, which we detail in Section 3.2. Nevertheless, we should note that MOTS ﬁts exactly into the description of Thompson sampling in Li & Chapelle (2012); Chapelle & Li (2011).

3.2. Thompson sampling using clipped Gaussian distributions

Algorithm 1 shows the pseudo-code of MOTS, with Di(t) formulated as follows. First, at time step t, for all arm i [K], we deﬁne a conﬁdence range ( , τi(t)), where

τi(t) = bµi(t) +

α Ti(t) log+ T KTi(t)

log+(x) = max{0, log x}, and α > 0 is a constant. Given τi(t), we ﬁrst sample an instance eθi(t) from Gaussian distribution N(bµi(t), 1/(ρTi(t))), where ρ (1/2, 1) is a tuning parameter (The intuition could be found at Lemma 5). Then,

Algorithm 1 Minimax Optimal Thompson Sampling with Clipping (MOTS)

1: Input: Arm set [K]. 2: Initialization: Play arm once and set Ti(K + 1) = 1; let bµi(K + 1) be the observed reward of playing arm i 3: for t = K + 1, K + 2, , T do 4: For all i [K], sample θi(t) independently from Di(t), which is deﬁned in Section 3.2 5: Play arm At = arg maxi [K] θi(t) and observe the reward rt 6: For all i [K]

bµi(t + 1) = Ti(t) bµi(t) + rt 1{i = At}

Ti(t) + 1{i = At}

7: For all i [K]: Ti(t + 1) = Ti(t) + 1{i = At} 8: end for

we set θi(t) in Line 4 of Algorithm 1 as follows:

θi(t) = min eθi(t), τi(t) . (3)

In other words, θi(t) follows a clipped Gaussian distribution with the following PDF:

( ϕt(x) + (1 Φt(x))δ(x τi(t)), if x τi(t); 0, otherwise, (4)

where ϕt(x) and Φt(x) denote the PDF and CDF of the Gaussian distribution N(bµi(t), 1 ρTi(t)), respectively, and δ( ) is the Dirac delta function.

Remark 1. (2) dependents on the horizon T, which sometimes be unknown. By using the anytime MOSS, i.e., replace T by t in (2) (see Degenne & Perchet (2016) for details), one can make MOTS anytime.

3.3. Overestimation and underestimation in Thompson sampling

Compared with vanilla TS (see Agrawal & Goyal (2017) for example), MOTS is different in the following two aspects: (1) θi(t) in (3) is sampled from a clipped Gaussian distribution instead of the common Gaussian distribution; (2) before the clipping step, eθi(t) is sampled from a Gaussian distribution whose variance is inﬂated by a factor 1/ρ. Both the clipping step and the inﬂation are crucial for improving the regret bound of vanilla TS to be minimax optimal, which address the overestimation of suboptimal arms and the underestimation of the optimal arm in vanilla TS.

Overestimation of suboptimal arms: at any time step t, vanilla TS needs to ensure the posterior sample of the optimal arm θ1(t) is larger than that of all K 1 suboptimal

MOTS: Minimax Optimal Thompson Sampling

arms. If the sample from each suboptimal arm has a probability p to exceed θ1(t), then by the union bound, the total probability of identifying the wrong arm will be (K 1)p, which leads to a log K factor in the regret bound. MOTS clips the posterior samples by a carefully chosen threshold for each arm, which avoids the case that suboptimal arms are overestimated to a large extent.

Underestimation of the optimal arm: with the clipping step, vanilla TS will still fail to ﬁnd the optimal arm if its posterior sample θ1(t) is too small. In this case, the clipping threshold of arm 1 will become smaller in the next step, and then it will be further underestimated. This means the underestimation of the optimal arm will cause severe consequence: it will hardly be picked again once underestimated. We refer readers to Lemma 5 and its discussion for more details. In contrast, MOTS enlarges the variance of the posterior by a factor of p

1/ρ, where ρ (0, 1). This increases the probability that the posterior sample of arm 1 before clipping is larger or equals to the threshold, i.e., P( eθ1 τ1) will become larger.

In Section 4, we will show that with the help of clipping and the inﬂation, MOTS achieves the minimax optimality.

4. Theoretical Analysis of MOTS

4.1. Regret of MOTS for sub Gaussian rewards

We ﬁrst show that MOTS is minimax optimal.

Theorem 1 (Minimax Optimality of MOTS). Assume that the reward of arm i [K] is 1-sub Gaussian with mean µi. For any ﬁxed ρ (1/2, 1) and α 4, the regret of Algorithm 1 satisﬁes

The second term on the right hand side of (5) is due to the fact that we need to pull each arm at least once in Algorithm 1. Following the convention in the literature (Audibert & Bubeck, 2009; Agrawal & Goyal, 2017), we only need to consider the case when PK i=2 i is dominated by

Remark 2. Compared with the results in Agrawal & Goyal (2017), the regret bound of MOTS improves that of TS with Beta priors by a factor of O( log T), and that of TS with Gaussian priors by a factor of O( log K). To the best of our knowledge, MOTS is the ﬁrst TS type algorithm that achieves the minimax optimal regret Ω(

KT) for MAB problems (Auer et al., 2002a).

The next theorem presents the asymptotic regret bound of MOTS for sub Gaussian rewards.

Theorem 2. Under the same conditions in Theorem 1, the

regret Rµ(T) of Algorithm 1 satisﬁes

lim T Rµ(T) log(T) = X

2 ρ i . (6)

Lai & Robbins (1985) proved that for Gaussian rewards, the asymptotic regret rate lim T Rµ/ log T is at least P

i: i>0 2/ i. Therefore, Theorem 2 indicates that the asymptotic regret rate of MOTS matches the aforementioned lower bound up to a multiplicative factor 1/ρ, where ρ (1/2, 1) is arbitrarily ﬁxed.

In the following theorem, by setting ρ to be time-varying, we show that MOTS is able to exactly match the asymptotic lower bound. Theorem 3. Assume the reward of each arm i is 1sub Gaussian with mean µi, i [K]. In Algorithm 1, if we choose α 4 and ρ = 1 (ilog(m)(T)/40) 1/2, then the regret of MOTS satisﬁes

KT ilog(m 1)(T) +

lim T Rµ(T) log(T) = X

where m OT (1) is an arbitrary integer independent of T and ilog(m)(T) is the result of iteratively applying the logarithm function on T for m times, i.e., ilog(m)(x) = max log ilog(m 1)(x) , e and ilog(0)(a) = a.

Theorem 3 indicates that MOTS can exactly match the asymptotic lower bound in Lai & Robbins (1985), at the cost of forgoing minimax optimality by up to a factor of O(ilog(m 1)(T)). For instance, if we choose m = 4, then MOTS is minimax optimal up to a factor of O(log log log T). Although this problem-independent bound is slightly worse than that in Theorem 1, it is still a signiﬁcant improvement over the best known problemindependent bound O( KT log T) for asymptotically optimal TS type algorithms (Agrawal & Goyal, 2017).

Finally, it should be noted that the lower bound of the asymptotic regret rate lim T Rµ/ log T P

i: i>0 2/ i in Lai & Robbins (1985) was established for Gaussian rewards. Since Gaussian is a special case of sub Gaussian, the lower bound for the Gaussian case is also a valid lower bound for general sub Gaussian cases. Therefore, MOTS is asymptotically optimal. Similar arguments are widely adopted in the literature (Lattimore & Szepesv ari, 2020).

4.2. Regret of MOTS for Gaussian rewards

In this subsection, we present a variant of MOTS, called MOTS-J , which simultaneously achieves the minimax and asymptotic optimality for Gaussian reward distributions.

MOTS: Minimax Optimal Thompson Sampling

Algorithm 2 MOTS-J

1: Input: Arm set [K]. 2: Initialization: Play arm once and set Ti(K + 1) = 1; let bµi(K + 1) be the observed reward of playing arm i 3: for t = K + 1, K + 2, , T do 4: For all i [K], sample θi(t) independently from Di(t) as follows: sample eθi(t) from J (bµi(t), 1/Ti(t)); set θi(t) = min{eθi(t), τi(t)}, where τi(t) is deﬁned in (2) 5: Play arm At = arg maxi [K] θi(t) and observe the reward rt 6: For all i [K]

bµi(t + 1) = Ti(t) bµi(t) + rt 1{i = At}

Ti(t) + 1{i = At}

7: For all i [K]: Ti(t + 1) = Ti(t) + 1{i = At} 8: end for

Algorithm 2 shows the pseudo-code of MOTS-J . Observe that MOTS-J is identical to MOTS, except that in Line 4 of MOTS-J , it samples eθi(t) from a distribution J (bµi(t), 1/Ti(t)) instead of the Gaussian distribution used in Section 3.2 for MOTS. The distribution J (µ, σ2) has the following PDF:

φJ (x) = 1 2σ2 |x µ| exp 1

Note that J is a Rayleigh distribution if it is restricted to x 0.

The following theorem shows the minimax and asymptotic optimality of MOTS-J for Gaussian rewards.

Theorem 4. Assume that the reward of each arm i follows a Gaussian distribution N(µi, 1), and that α 2 in (2). The regret of MOTS-J satisﬁes

lim T Rµ(T) log(T) = X

Remark 3. To our knowledge, MOTS-J is the ﬁrst TS type algorithm that simultaneously achieves the minimax and asymptotic optimality. Though the clipping threshold of MOTS-J in (2) looks like the MOSS index in Audibert & Bubeck (2009), there are some key differences in the choice of α, the theoretical analysis and the result. Speciﬁcally, Audibert & Bubeck (2009) proved that MOSS with the exploration index α = 4 achieves minimax optimality for MAB. It remained an open problem how to improve MOSS to be both minimax and asymptotically optimal until M enard &

Garivier (2017) proposed the KL-UCB++ algorithm for exponential families of distributions which implies that MOSS with exploration index α = 2 could lead to the asymptotic optimal regret for Gaussian rewards. For more details on the choice of α in MOSS, we refer interested readers to the discussion in Chapter 9.3 of Lattimore & Szepesv ari (2020).

Compared with MOSS index based UCB algorithms, our proposed MOTS-J is both minimax and asymptotically optimal as long as α 2. This ﬂexibility is due to the fact that our theoretical analysis (asymptotic optimal part) based on Thompson sampling is quite different from those based on UCB in Audibert & Bubeck (2009); M enard & Garivier (2017). Not conﬁned by the choice of the exploration index α, it will be more suitable to design better algorithms based on MOTS-J , e.g., achieving instance-dependent optimality (see Lattimore (2015) for details) while keeping the asymptotic optimality.

5. Proof of the Minimax Optimality of MOTS

In what follows, we prove our main result in Theorem 1, and we defer the proofs of all other results to the appendix. We ﬁrst present several useful lemmas. Lemmas 1 and 2 characterise the concentration properties of sub Gaussian random variables. Lemma 1 (Lemma 9.3 in Lattimore & Szepesv ari (2020)). Let X1, X2, be independent and 1-sub Gaussian random variables with zero means. Denote bµt = 1/t Pt s=1 Xs. Then, for α 4 and any > 0,

s [T] : bµs+

Lemma 2. Let ω > 0 be a constant and X1, X2, . . . , Xn be independent and 1-sub Gaussian random variables with zero means. Denote bµn = 1/n Pn s=1 Xs. Then, for α > 0 and any N T,

n=1 P bµn +

1 + α log+(Nω2)

2απlog+(Nω2)

Next, we introduce a few notations for ease of exposition. Recall that we have deﬁned bµi(t) to be the average reward for arm i up to a time t. Now, let bµis be the average reward for arm i up to when it is played the s-th time. In addition, similar to the deﬁnitions of Di(t) and θi(t), we deﬁne Dis as the distribution of arm i when it is played the s-th time, and θis as a sample from distribution Dis.

The following lemma upper bounds the expected total number of pulls of each arm. We note that it is ﬁrst proved

MOTS: Minimax Optimal Thompson Sampling

by Agrawal & Goyal (2017); here, we use an improved version presented in Lattimore & Szepesv ari (2020)1. Lemma 3 (Theorem 36.2 in Lattimore & Szepesv ari (2020)). Let ϵ R+. Then, the expected number of times that Algorithm 1 plays arm i is bounded by

t=1 1{At = i, Ei(t)} + E T X

t=1 1{At = i, Ec i (t)}

1 + E T 1 X

1 G1s(ϵ) 1 + E T 1 X

t=K+1 1{At = i, Ec i (t)}

2 + E T 1 X

1 G1s(ϵ) 1 + E T 1 X

s=1 1{Gis(ϵ) > 1/T} ,

where Gis(ϵ) = 1 Fis(µ1 ϵ), Fis is the CDF of Dis, and Ei(t) = {θi(t) µ1 ϵ}.

Based on the decomposition of (12), one can easily prove the problem-independent regret bound of Thompson Sampling by setting ϵ = i/2 and summing up over i = 1, . . . , K (Agrawal & Goyal, 2017). Similar techniques are also used in proving the regret bound of UCB algorithms (Lattimore & Szepesv ari, 2020).

Note that by the deﬁnition of Dis, Gis(ϵ) is a random variable depending on bµis. For brevity, however, we do not explicitly indicate this dependency by writing Gis(ϵ) as Gis(ϵ, bµis); such shortened notations are also used in Agrawal & Goyal (2017); Lattimore & Szepesv ari (2020).

Though Gis(ϵ) is deﬁned based on the clipped Gaussian distribution Dis, the right-hand side of (12) and (13) can be bounded in the same way for Gaussian distributions like in Agrawal & Goyal (2017). We need some notations. Let F is be the CDF of Gaussian distribution N(bµis, 1/(ρs)) for any s 1. Let G is(ϵ) = 1 F is(µ1 ϵ). We have the following lemma.

Lemma 4. Let ρ (1/2, 1) be a constant. Under the conditions in Theorem 1, for any ϵ > 0, there exists a universal constant c > 0 such that:

1 G 1s(ϵ) 1 c

Similar quantities are also bounded in Agrawal & Goyal (2017); Lattimore & Szepesv ari (2020), which are essential for proving the near optimal problem-independent regret bound for Thompson sampling. However, the upper bound in Lemma 4 is sharper than that in previous papers due to

1Since MOTS plays every arm once at the beginning, (12) starts with t = K + 1 and s = 1.

the scaling parameter ρ we choose in our MOTS algorithm. In fact, the requirement ρ (1/2, 1) is necessary to obtain such an improved upper bound. In the next lemma, we will show that if we choose ρ = 1 as is done in existing work, the second term in the right-hand side of (12) will have a nontrivial lower bound.

Lemma 5. Assume K log T

T. If we set ρ = 1, then there exists a bandit instance with i = 2 p

K log T/T for all i {2, , K} such that

E 1 G 1s(ϵ) 1 e sϵ2

2 sϵ2 , ϵ > 0, (15)

and the decomposition in (12) will lead to

K i E T 1 X

1 G 1s( i/2) 1 = Ω( p

The above lemma shows that if we set ρ = 1, the decomposition in (12) will lead to an unavoidable Ω( KT log T) problem-independent regret. Combined with Lemma 4, it indicates that our choice of ρ (1/2, 1) in MOTS is crucial to improve the previous analysis and obtain a better regret bound. When the reward distribution is Bernoulli, it is worth noting that Agrawal & Goyal (2017) achieved an improved regret O( KT log K) by using Gaussian priors. Meanwhile, they also proved that this regret bound is unimprovable for Thompson sampling using Gaussian priors, which leaves a gap in achieving the minimax optimal regret O(

KT). In the following proof of Theorem 1, we will show that the clipped Gaussian distribution sufﬁces to close this gap and achieve the O(

KT) minimax regret. Moreover, in Theorem 4, we will further show that MOTS-J can achieve the minimax optimal regret by using the Rayleigh distribution and does not need the requirement on the scaling parameter ρ, which is crucial in proving the asymptotic optimality simultaneously.

Now, we prove the minimax optimality of MOTS.

Proof of Theorem 1. Recall that bµis is the average reward of arm i when it has been played s times. We deﬁne as follows:

= µ1 min 1 s T

The regret of Algorithm 1 can be decomposed as follows.

i: i>0 i E[Ti(T)]

E[2T ] + E X

i: i>2 i Ti(T)

MOTS: Minimax Optimal Thompson Sampling

i S i Ti(T) , (17)

where S = {i : i > max{2 , 8 p

K/T}} is an index set. The ﬁrst term in (17) can be bounded as:

E[2T ] = 2T Z

0 min 1, 15K

where the inequality comes from Lemma 1 since

µ1 min 1 s T

1 s T : µ1 bµ1s

Now we focus on term P

i S i Ti(T). Note that the update rules of Algorithm 1 ensure Di(t+1) = Di(t) (t K +1) whenever At = i. We deﬁne

τis = bµis +

By the deﬁnition in (2), we have τis = τi(t) when Ti(t) = s. From the deﬁnition of in (16), for i S, we have

τ1s = bµ1s +

Recall the deﬁnition of D1s. Let θ1s be a sample from the clipped distribution D1s. As mentioned in Section 3.2, we obtain θ1s with the following procedure. We ﬁrst sample eθ1s from distribution N(bµ1s, 1/(ρs)). If eθ1s < τ1s, we set θ1s = eθ1s; otherwise, we set θ1s = τ1s. (20) implies that µ1 i/2 τ1s, where τ1s is the boundary for clipping. Therefore, P(eθ1s µ1 i/2) = P(θ1s µ1 i/2). By deﬁnition, F is is the CDF of N(bµis, 1/(ρs)) and G is(ϵ) = 1 F is(µ1 ϵ). Therefore, for any i S, G1s( i/2) = P(θ1s µ1 i/2) = P(eθ1s µ1 i/2) = G 1s( i/2).

Using (12) of Lemma 3 and setting ϵ = i/2, for any i S,

i E[Ti(T)] i + i E T 1 X

t=K+1 1{At = i, Ec i (t)}

+ i E T 1 X

1 G1s( i/2) 1

= i + i E T 1 X

t=K+1 1{At = i, Ec i (t)}

+ i E T 1 X

1 G 1s( i/2) 1

Bounding term I1: Note that

Ec i (t) = θi(t) > µ1 i

α Ti(t) log+ T KTi(t)

We deﬁne the following notation:

which immediately implies that

I1 = i E T 1 X

t=K+1 1{At = i, Ec i (t)} i E[κi]. (22)

To further bound (22), we have

log+ T 2 i 4K

2απ log+ T 2 i 4K

where the ﬁrst inequality is due to the fact that µ1 µi = i and the second one is by Lemma 2. For a > 0, it can be veriﬁed that h(x) = x 1 log+(ax2) is monotonically decreasing for x e/ a. Since i 8 p

T/(4K), we have log(T 2 i /(4K))/ i p

T/K. Plugging this into (23), we have E[ iκi] = O( p

Bounding term I2: applying Lemma 4, we immediately obtain

I2 = i E T 1 X

1 G 1s( i/2) 1 = O r

MOTS: Minimax Optimal Thompson Sampling

103 104 105 106 107

Time step (t)

UCB MOSS TS MOTS MOTS-J Asymptotic lower bound

(a) K = 50, ϵ = 0.2

103 104 105 106 107

Time step (t)

UCB MOSS TS MOTS MOTS-J Asymptotic lower bound

(b) K = 50, ϵ = 0.1

103 104 105 106 107

Time step (t)

UCB MOSS TS MOTS MOTS-J Asymptotic lower bound

(c) K = 50, ϵ = 0.05

Figure 1. The regret for different algorithms in Setting (1): K = 50 and ϵ {0.2, 0.1, 0.05}. The experiments are averaged over 6000 repetitions.

103 104 105 106 107

Time step (t)

UCB MOSS TS MOTS MOTS-J Asymptotic lower bound

(a) K = 100, ϵ = 0.2

103 104 105 106 107

Time step (t)

UCB MOSS TS MOTS MOTS-J Asymptotic lower bound

(b) K = 100, ϵ = 0.1

103 104 105 106 107

Time step (t)

UCB MOSS TS MOTS MOTS-J Asymptotic lower bound

(c) K = 100, ϵ = 0.05

Figure 2. The regret for different algorithms in Setting (2): K = 100 and ϵ {0.2, 0.1, 0.05}. The experiments are averaged over 6000 repetitions.

Substituting (18), (21), (23), and (24) into (17), we complete the proof of Theorem 1.

6. Experiments

In this section, we experimentally compare our proposed algorithms MOTS and MOTS-J with existing algorithms for multi-armed bandit problems with Gaussian rewards. Baseline algorithms include MOSS (Audibert & Bubeck, 2009), UCB (Katehakis & Robbins, 1995), and Thompson sampling with Gaussian priors (TS for short) (Agrawal & Goyal, 2017). Throughout the experiment, we assume the reward follows an independent Gaussian distribution N(µ, 1) with unit variance and mean µ which is to be speciﬁed in different settings. Speciﬁcally, we consider three settings with different number of arms K and different mean rewards {µi}K i=1: (1) K = 50, µ1 = 1, and µ2 = . . . = µ50 = 1 ϵ, where ϵ varies in the range {0.2, 0.1, 0.05} for different experiments; (2) K = 100, µ1 = 1, and µ2 = . . . = µ100 = 1 ϵ, where ϵ varies in the range {0.2, 0.1, 0.05} for different experiments; (3) K = 50, µ1 = 1, µ5i+j = 1 0.1i, for i = 1, . . . , 9 and j = 3, 2, 1, 0, 1, and µ47 = µ48 = µ49 = µ50 = 0. It is worth noting that Setting (1) and Setting (2) only differs

in the number of suboptimal arms, where all the suboptimal arms have the same mean value that is distinct from the mean value of the best arm. In contrast, Setting (3) is a more challenging bandit instance where the suboptimal arms are from a rather diverse set.

In all the experiments, the total number of time steps T is set to 107. The parameter ρ for MOTS deﬁned in Section 3.2 is set to 0.9999. Since we focus on Gaussian rewards, we set α = 2 in (2) for both MOTS and MOTS-J .

Furthermore, for MOTS-J , we need to sample instances from distribution J (µ, σ2), of which the PDF is deﬁned in (8). To this end, we use the well known inverse transform sampling technique by ﬁrst computing the corresponding inverse CDF, and then uniformly choosing a random number in [0, 1], which is then used to calculate the random number sampled from J (µ, σ2).

For Setting (1), Figures 1(a), 1(b), and 1(c) report the regrets of all algorithms when ϵ is 0.2, 0.1, 0.05 respectively. For all ϵ values, MOTS consistently outperforms the baselines for all time step t, and MOTS-J outperforms the baselines especially when t is large. For instance, in Figure 1(c), when time step t is T = 107, the regret of MOTS and MOTS-J

MOTS: Minimax Optimal Thompson Sampling

are 9615 and 9245 respectively, while the regrets of TS, MOSS, and UCB are 14058, 14721, and 37781 respectively. The pink solid line represents the asymptotic lower bound, i.e., Rµ(T) = log(T) P

i: i>0 2/ i. All the experimental results indicate that our algorithms are asymptotically optimal.

For Setting (2), Figures 2(a), 2(b), and 2(c) report the regrets of MOTS, MOTS-J , MOSS, TS, and UCB when ϵ is 0.2, 0.1, 0.05 respectively. Again, for all ϵ values, when varying the time step t, MOTS consistently has the smallest regret, outperforming all baselines, and MOTS-J outperforms all baselines especially when t is large.

103 104 105 106 107

Time step (t)

UCB MOSS TS MOTS MOTS-J Asymptotic lower bound

Figure 3. The regret for different algorithms in Setting (3): K = 50 and the mean values of arms are from a diverse set. The experiments are averaged over 6000 repetitions.

The experimental results for Setting (3) are reported in Figure 3, which again are consistent with the theoretical ﬁndings that MOTS and MOTS-J outperform baseline algorithms, though the suboptimal arms are extremely diverse and thus the bandit instance is relatively hard. In addition, our algorithms are still asymptotically optimal.

In summary, our algorithms consistently outperform TS, MOSS, and UCB in various settings.

7. Conclusion and Future Work

We solved the open problem on the minimax optimality for Thompson sampling (Li & Chapelle, 2012). We proposed the MOTS algorithm and proved that it achieves the minimax optimal regret O(

KT) when rewards are generated from sub Gaussian distributions. In addition, we propose a variant of MOTS called MOTS-J that simultaneously achieves the minimax and asymptotically optimal regret for K-armed bandit problems when rewards are generated from Gaussian distributions. Our experiments demonstrate the superior performances of MOTS and MOTS-J compared with the state-of-the-art bandit algorithms.

Interestingly, our experimental results show that the performance of MOTS is never worse than that of MOTSJ . Therefore, it would be an interesting future direction to investigate whether the proposed MOTS with clipped Gaussian distributions can also achieve both minimax and asymptotic optimality for multi-armed bandits.

Acknowledgement

We thank the anonymous reviewers for their helpful comments. X. Xiao is supported by the Ministry of Education, Singapore, under Tier-2 Grant R-252-000-A70-112. T. Jin is supported by the National Research Foundation, Singapore under its AI Singapore Programme (AISG Award No: AISG-Ph D/2021-01-004[T]). P. Xu and Q. Gu are partially supported by the National Science Foundation IIS-1904183. J. Shi is supported by the ﬁnancial support (1-BE3T) of research project (P0033898) from Hong Kong Polytechnic University. The views and conclusions contained in this paper are those of the authors and should not be interpreted as representing any funding agencies.

Abramowitz, M. and Stegun, I. A. Handbook of mathematical functions with formulas, graphs, and mathematical table. In US Department of Commerce. National Bureau of Standards Applied Mathematics series 55, 1965.

Agrawal, S. and Goyal, N. Analysis of thompson sampling for the multi-armed bandit problem. In Conference on learning theory, pp. 39 1, 2012.

Agrawal, S. and Goyal, N. Further optimal regret bounds for thompson sampling. In Artiﬁcial intelligence and statistics, pp. 99 107, 2013.

Agrawal, S. and Goyal, N. Near-optimal regret bounds for thompson sampling. Journal of the ACM (JACM), 64(5): 30, 2017.

Audibert, J.-Y. and Bubeck, S. Minimax policies for adversarial and stochastic bandits. In COLT, pp. 217 226, 2009.

Auer, P. and Ortner, R. Ucb revisited: Improved regret bounds for the stochastic multi-armed bandit problem. Periodica Mathematica Hungarica, 61(1-2):55 65, 2010.

Auer, P., Cesa-Bianchi, N., and Fischer, P. Finite-time analysis of the multiarmed bandit problem. Machine learning, 47(2-3):235 256, 2002a.

Auer, P., Cesa-Bianchi, N., Freund, Y., and Schapire, R. E. The nonstochastic multiarmed bandit problem. SIAM journal on computing, 32(1):48 77, 2002b.

MOTS: Minimax Optimal Thompson Sampling

Bubeck, S. and Liu, C.-Y. Prior-free and prior-dependent regret bounds for thompson sampling. In Advances in Neural Information Processing Systems, pp. 638 646, 2013.

Chapelle, O. and Li, L. An empirical evaluation of thompson sampling. In Advances in neural information processing systems, pp. 2249 2257, 2011.

Degenne, R. and Perchet, V. Anytime optimal algorithms in stochastic multi-armed bandits. In International Conference on Machine Learning, pp. 1587 1595. PMLR, 2016.

Gao, Z., Han, Y., Ren, Z., and Zhou, Z. Batched multi-armed bandits problem. In Advances in Neural Information Processing Systems, pp. 501 511, 2019.

Garivier, A. and Capp e, O. The kl-ucb algorithm for bounded stochastic bandits and beyond. In Proceedings of the 24th annual conference on learning theory, pp. 359 376, 2011.

Garivier, A., Lattimore, T., and Kaufmann, E. On explorethen-commit strategies. In Advances in Neural Information Processing Systems, pp. 784 792, 2016.

Jin, T., Xu, P., Xiao, X., and Gu, Q. Double explorethen-commit: Asymptotic optimality and beyond. ar Xiv preprint ar Xiv:2002.09174, 2020.

Katehakis, M. N. and Robbins, H. Sequential choice from several populations. Proceedings of the National Academy of Sciences of the United States of America, 92 (19):8584, 1995.

Kaufmann, E. On bayesian index policies for sequential resource allocation. ar Xiv preprint ar Xiv:1601.01190, 2016.

Kaufmann, E., Korda, N., and Munos, R. Thompson sampling: An asymptotically optimal ﬁnite-time analysis. In International conference on algorithmic learning theory, pp. 199 213. Springer, 2012.

Korda, N., Kaufmann, E., and Munos, R. Thompson sampling for 1-dimensional exponential family bandits. In Advances in neural information processing systems, pp. 1448 1456, 2013.

Lai, T. L. and Robbins, H. Asymptotically efﬁcient adaptive allocation rules. Advances in applied mathematics, 6(1): 4 22, 1985.

Lattimore, T. Optimally conﬁdent ucb: Improved regret for ﬁnite-armed bandits. ar Xiv preprint ar Xiv:1507.07880, 2015.

Lattimore, T. Reﬁning the conﬁdence level for optimistic bandit strategies. The Journal of Machine Learning Research, 19(1):765 796, 2018.

Lattimore, T. and Szepesv ari, C. Bandit algorithms. Cambridge University Press, 2020.

Li, L. and Chapelle, O. Open problem: Regret bounds for thompson sampling. In Conference on Learning Theory, pp. 43 1, 2012.

Maillard, O.-A., Munos, R., and Stoltz, G. A ﬁnite-time analysis of multi-armed bandits problems with kullbackleibler divergences. In Proceedings of the 24th annual Conference On Learning Theory, pp. 497 514, 2011.

M enard, P. and Garivier, A. A minimax and asymptotically optimal algorithm for stochastic bandits. In International Conference on Algorithmic Learning Theory, pp. 223 237, 2017.

Perchet, V., Rigollet, P., Chassang, S., Snowberg, E., et al. Batched bandit problems. The Annals of Statistics, 44(2): 660 681, 2016.

Pike-Burke, C., Agrawal, S., Szepesvari, C., and Grunewalder, S. Bandits with delayed, aggregated anonymous feedback. In International Conference on Machine Learning, pp. 4105 4113, 2018.

Russo, D. and Van Roy, B. Learning to optimize via posterior sampling. Mathematics of Operations Research, 39 (4):1221 1243, 2014.

Russo, D. J., Van Roy, B., Kazerouni, A., Osband, I., and Wen, Z. A tutorial on thompson sampling. Foundations and Trends R in Machine Learning, 11(1):1 96, 2018.

Thompson, W. R. On the likelihood that one unknown probability exceeds another in view of the evidence of two samples. Biometrika, 25(3/4):285 294, 1933.

Vaswani, S., Mehrabian, A., Durand, A., and Kveton, B. Old dog learns new tricks: Randomized ucb for bandit problems. ar Xiv preprint ar Xiv:1910.04928, 2019.

Wang, S. and Chen, W. Thompson sampling for combinatorial semi-bandits. In International Conference on Machine Learning, pp. 5114 5122, 2018.