# bounded_regret_for_finitearmed_structured_bandits__b8b9e083.pdf

Bounded Regret for Finite-Armed Structured Bandits

Tor Lattimore Department of Computing Science University of Alberta, Canada tlattimo@ualberta.ca

R emi Munos INRIA Lille, France1 remi.munos@inria.fr

We study a new type of K-armed bandit problem where the expected return of one arm may depend on the returns of other arms. We present a new algorithm for this general class of problems and show that under certain circumstances it is possible to achieve ﬁnite expected cumulative regret. We also give problemdependent lower bounds on the cumulative regret showing that at least in special cases the new algorithm is nearly optimal.

1 Introduction

The multi-armed bandit problem is a reinforcement learning problem with K actions. At each timestep a learner must choose an action i after which it receives a reward distributed with mean µi. The goal is to maximise the cumulative reward. This is perhaps the simplest setting in which the wellknown exploration/exploitation dilemma becomes apparent, with a learner being forced to choose between exploring arms about which she has little information, and exploiting by choosing the arm that currently appears optimal. (a)

Figure 1: Examples

We consider a general class of Karmed bandit problems where the expected return of each arm may be dependent on other arms. This model has already been considered when the dependencies are linear [18] and also in the general setting studied here [12, 1]. Let Θ θ be an arbitrary parameter space and deﬁne the expected return of arm i by µi(θ ) R. The learner is permitted to know the functions µ1 µK, but not the true parameter θ . The unknown parameter θ determines the mean reward for each arm. The performance of a learner is measured by the (expected) cumulative regret, which is the difference between the expected return of the optimal policy and the (expected) return of the learner s policy. Rn := n maxi 1 K µi(θ ) Pn t=1 µIt(θ ) where It is the arm chosen at time-step t.

A motivating example is as follows. Suppose a long-running company must decide each week whether or not to purchase some new form of advertising with unknown expected returns. The problem may be formulated using the new setting by letting K = 2 and Θ = [ , ]. We assume the base-line performance without purchasing the advertising is known and so deﬁne µ1(θ) = 0 for all θ. The expected return of choosing to advertise is µ2(θ) = θ (see Figure (b) above).

Our main contribution is a new algorithm based on UCB [6] for the structured bandit problem with strong problem-dependent guarantees on the regret. The key improvement over UCB is that the new algorithm enjoys ﬁnite regret in many cases while UCB suffers logarithmic regret unless all arms have the same return. For example, in (a) and (c) above we show that ﬁnite regret is possible for all

1Current afﬁliation: Google Deep Mind.

θ , while in the advertising problem ﬁnite regret is attainable if θ 0. The improved algorithm exploits the known structure and so avoids the famous negative results by Lai and Robbins [17]. One insight from this work is that knowing the return of the optimal arm and a bound on the minimum gap is not the only information that leads to the possibility of ﬁnite regret. In the examples given above neither quantity is known, but the assumed structure is nevertheless sufﬁcient for ﬁnite regret.

Despite the enormous literature on bandits, as far as we are aware this is the ﬁrst time this setting has been considered with the aim of achieving ﬁnite regret. There has been substantial work on exploiting various kinds of structure to reduce an otherwise impossible problem to one where sublinear (or even logarithmic) regret is possible [19, 4, 10, and references therein], but the focus is usually on efﬁciently dealing with large action spaces rather than sub-logarithmic/ﬁnite regret. The most comparable previous work studies the case where both the return of the best arm and a bound on the minimum gap between the best arm and some sub-optimal arm is known [11, 9], which extended the permutation bandits studied by Lai and Robbins [16] and more general results by the same authors [15]. Also relevant is the paper by Agrawal et. al. [1], which studied a similar setting, but where Θ was ﬁnite. Graves and Lai [12] extended the aforementioned contribution to continuous parameter spaces (and also to MDPs). Their work differs from ours in a number of ways. Most notably, their objective is to compute exactly the asymptotically optimal regret in the case where ﬁnite regret is not possible. In the case where ﬁnite regret is possible they prove only that the optimal regret is sub-logarithmic, and do not present any explicit bounds on the actual regret. Aside from this the results depend on the parameter space being a metric space and they assume that the optimal policy is locally constant about the true parameter.

General. Most of our notation is common with [8]. The indicator function is denoted by 1{expr} and is 1 if expr is true and 0 otherwise. We use log for the natural logarithm. Logical and/or are denoted by and respectively. Deﬁne function ω(x) = min {y N : z x log z, z y}, which satisﬁes log ω(x) O(log x). In fact, limx log(ω(x))/ log(x) = 1.

Bandits. Let Θ be a set. A K-armed structured bandit is characterised by a set of functions µk : Θ R where µk(θ) is the expected return of arm k A := {1, , K} given unknown parameter θ. We deﬁne the mean of the optimal arm by the function µ : Θ R with µ (θ) := maxi µi(θ). The true unknown parameter that determines the means is θ Θ. The best arm is i := arg maxi µi(θ ). The arm chosen at time-step t is denoted by It while Xi,s is the sth reward obtained when sampling from arm i. We denote the number of times arm i has been chosen at time-step t by Ti(t). The empiric estimate of the mean of arm i based on the ﬁrst s samples is ˆµi,s. We deﬁne the gap between the means of the best arm and arm i by i := µ (θ ) µi(θ ). The set of sub-optimal arms is A := {i A : i > 0}. The minimum gap is min := mini A i while the maximum gap is max := maxi A i. The cumulative regret is deﬁned

Note quantities like i and i depend on θ , which is omitted from the notation. As is rather common we assume that the returns are sub-gaussian, which means that if X is the return sampled from some arm, then ln E exp(λ(X EX)) λ2σ2/2. As usual we assume that σ2 is known and does not depend on the arm. If X1 Xn are sampled independently from some arm with mean µ and Sn = Pn t=1 Xt, then the following maximal concentration inequality is well-known.

P max 1 t n |St tµ| ε 2 exp ε2

A straight-forward corollary is that P {|ˆµi,n µi| ε} 2 exp ε2n

It is an important point that Θ is completely arbitrary. The classic multi-armed bandit can be obtained by setting Θ = RK and µk(θ) = θk, which removes all dependencies between the arms. The setting where the optimal expected return is known to be zero and a bound on i ε is known can be regained by choosing Θ = ( , ε]K {1, , K} and µk(θ1, , θK, i) = θk1{k = i}. We do not demand that µk : Θ R be continuous, or even that Θ be endowed with a topology.

3 Structured UCB

We propose a new algorithm called UCB-S that is a straight-forward modiﬁcation of UCB [6], but where the known structure of the problem is exploited. At each time-step it constructs a conﬁdence interval about the mean of each arm. From this a subspace Θt Θ is constructed, which contains the true parameter θ with high probability. The algorithm takes the optimistic action over all θ Θt. Algorithm 1 UCB-S

1: Input: functions µ1, , µk : Θ [0, 1] 2: for t 1, . . . , do

3: Deﬁne conﬁdence set Θt

( θ : i, µi( θ) ˆµi,Ti(t 1) <

ασ2 log t Ti(t 1)

4: if Θt = then 5: Choose arm arbitrarily 6: else 7: Optimistic arm is i arg maxi sup θ Θt µi( θ) 8: Choose arm i

Remark 1. The choice of arm when Θt = does not affect the regret bounds in this paper. In practice, it is possible to simply increase t without taking an action, but this complicates the analysis. In many cases the true parameter θ is never identiﬁed in the sense that we do not expect that Θt {θ }. The computational complexity of UCB-S depends on the difﬁculty of computing Θt and computing the optimistic arm within this set. This is efﬁcient in simple cases, like when µk is piecewise linear, but may be intractable for complex functions.

We present two main theorems bounding the regret of the UCB-S algorithm. The ﬁrst is for arbitrary θ , which leads to a logarithmic bound on the regret comparable to that obtained for UCB by [6]. The analysis is slightly different because UCB-S maintains upper and lower conﬁdence bounds and selects its actions optimistically from the model class, rather than by maximising the upper conﬁdence bound as UCB does.

Theorem 2. If α > 2 and θ Θ, then the algorithm UCB-S suffers an expected regret of at most

ERn 2 max K(α 1)

If the samples from the optimal arm are sufﬁcient to learn the optimal action, then ﬁnite regret is possible. In Section 6 we give something of a converse by showing that if knowing the mean of the optimal arm is insufﬁcient to act optimally, then logarithmic regret is unavoidable.

Theorem 3. Let α = 4 and assume there exists an ε > 0 such that

( θ Θ) |µi (θ ) µi (θ)| < ε = i = i , µi (θ) > µi(θ). (1)

+ 3 max K + max K3

with ω := max ω 8σ2αK

Remark 4. For small ε and large n the expected regret looks like ERn O

(for small n the regret is, of course, even smaller).

The explanation of the bound is as follows. If at some time-step t it holds that all conﬁdence intervals contain the truth and the width of the conﬁdence interval about i drops below ε, then by the condition in Equation (1) it holds that i is the optimistic arm within Θt. In this case UCB-S

suffers no regret at this time-step. Since the number of samples of each sub-optimal arm grows at most logarithmically by the proof of Theorem 2, the number of samples of the best arm must grow linearly. Therefore the number of time-steps before best arm has been pulled O(ε 2) times is also O(ε 2). After this point the algorithm suffers only a constant cumulative penalty for the possibility that the conﬁdence intervals do not contain the truth, which is ﬁnite for suitably chosen values of α. Note that Agrawal et. al. [1] had essentially the same condition to achieve ﬁnite regret as (1), but speciﬁed to the case where Θ is ﬁnite.

An interesting question is raised by comparing the bound in Theorem 3 to those given by Bubeck et. al. [11] where if the expected return of the best arm is known and ε is a known bound on the minimum gap, then a regret bound of

1 + log log 1

is achieved. If ε is close to i, then this bound is an improvement over the bound given by Theorem 3, although our theorem is more general. The improved UCB algorithm [7] enjoys a bound on the expected regret of O(P

i A 1 i log n 2 i ). If we follow the same reasoning as above we obtain a bound comparable to (2). Unfortunately though, the extension of the improved UCB algorithm to the structured setting is rather challenging with the main obstruction being the extreme growth of the phases used by improved UCB. Reﬁning the phases leads to super-logarithmic regret, a problem we ultimately failed to resolve. Nevertheless we feel that there is some hope of obtaining a bound like (2) in this setting.

Before the proofs of Theorems 2 and 3 we give some example structured bandits and indicate the regions where the conditions for Theorem 3 are (not) met. Areas where Theorem 3 can be applied to obtain ﬁnite regret are unshaded while those with logarithmic regret are shaded.

a hidden message

1 1 2 3 4 5 6

Figure 2: Examples

(a) The conditions for Theorem 3 are met for all θ = 0, but for θ = 0 the regret strictly vanishes for all policies, which means that the regret is bounded by ERn O(1{θ = 0} 1 |θ | log 1 |θ |).

(b) Action 2 is uninformative and not globally optimal so Theorem 3 does not apply for θ < 1/2 where this action is optimal. For θ > 0 the optimal action is 1, when the conditions are met and ﬁnite regret is again achieved.

ERn O 1{θ < 0} log n

|θ | + 1{θ > 0} log 1

(c) The conditions for Theorem 3 are again met for all non-zero θ , which leads as in (a) to a regret of ERn O(1{θ = 0} 1 |θ | log 1 |θ |).

Examples (d) and (e) illustrate the potential complexity of the regions in which ﬁnite regret is possible. Note especially that in (e) the regret for θ = 1

2 is logarithmic in the horizon, but ﬁnite for θ arbitrarily close. Example (f) is a permutation bandit with 3 arms where it can be clearly seen that the conditions of Theorem 3 are satisﬁed.

5 Proof of Theorems 2 and 3

We start by bounding the probability that some mean does not lie inside the conﬁdence set. Lemma 5. P {Ft = 1} 2Kt exp( α log(t)) where

i : |ˆµi,Ti(t 1) µi|

Proof. We use the concentration guarantees:

P {Ft = 1} (a) = P

i : µi(θ ) ˆµi,Ti(t 1)

( µi(θ ) ˆµi,Ti(t 1)

|µi(θ ) ˆµi,s|

s=1 2 exp( α log t) (e) = 2Kt1 α

where (a) follows from the deﬁnition of Ft. (b) by the union bound. (c) also follows from the union bound and is the standard trick to deal with the random variable Ti(t 1). (d) follows from the concentration inequalities for sub-gaussian random variables. (e) is trivial.

Proof of Theorem 2. Let i be an arm with i > 0 and suppose that It = i. Then either Ft is true or

Ti(t 1) < 8σ2α log n

=: ui(n) (3)

Note that if Ft does not hold then the true parameter lies within the conﬁdence set, θ Θt. Suppose on the contrary that Ft and (3) are both false.

max θ Θt µi ( θ) (a) µ (θ ) (b) = µi(θ ) + i (c) > i + ˆµi,Ti(t 1)

(d) ˆµi,Ti(t 1) +

(e) max θ Θt µi( θ),

where (a) follows since θ Θt. (b) is the deﬁnition of the gap. (c) since Ft is false. (d) is true because (3) is false. Therefore arm i is not taken. We now bound the expected number of times that arm i is played within the ﬁrst n time-steps by

ETi(n) (a) = E

t=1 1{It = i} (b) ui(n) + E

t=ui+1 1{It = i (3) is false}

(c) ui(n) + E

t=ui+1 1{Ft = 1 It = i}

where (a) follows from the linearity of expectation and deﬁnition of Ti(n). (b) by Equation (3) and the deﬁnition of ui(n) and expectation. (c) is true by recalling that playing arm i at time-step t implies that either Ft or (3) must be true. Therefore

t=ui+1 1{Ft = 1 It = i}

i A iui(n) + max E

t=1 1{Ft = 1}

Bounding the second summation

t=1 1{Ft = 1} (a) =

t=1 P {Ft = 1} (b)

t=1 2Kt1 α (c) 2K(α 1)

where (a) follows by exchanging the expectation and sum and because the expectation of an indicator function can be written as the probability of the event. (b) by Lemma 5 and (c) is trivial. Substituting into (4) leads to

ERn 2 max K(α 1)

Before the proof of Theorem 3 we need a high-probability bound on the number of times arm i is pulled, which is proven along the lines of similar results by [5].

Lemma 6. Let i A be some sub-optimal arm. If z > ui(n), then P {Ti(n) > z} 2Kz2 α

Proof. As in the proof of Theorem 2, if t n and Ft is false and Ti(t 1) > ui(n) ui(t), then arm i is not chosen. Therefore

P {Ti(n) > z}

t=z+1 P {Ft = 1} (a)

t=z+1 2Kt1 α (b) 2K Z n

z t1 αdt (c) 2Kz2 α

where (a) follows from Lemma 5 and (b) and (c) are trivial.

Lemma 7. Assume the conditions of Theorem 3 and additionally that Ti (t 1) l 8ασ2 log t

ε2 m and Ft is false. Then It = i .

Proof. Since Ft is false, for θ Θt we have:

|µi ( θ) µi (θ )| (a) |µi ( θ) ˆµi ,Ti(t 1)| + |ˆµi ,Ti(t 1) µi (θ )| (b) < 2

2σ2α log t Ti (t 1)

where (a) is the triangle inequality. (b) follows by the deﬁnition of the conﬁdence interval and because Ft is false. (c) by the assumed lower bound on Ti (t 1). Therefore by (1), for all θ Θt it holds that the best arm is i . Finally, since Ft is false, θ Θt, which means that Θt = . Therefore It = i as required.

Proof of Theorem 3. Let ω be some constant to be chosen later. Then the regret may be written as

i=1 i1{It = i} + max E

t=ω +1 1{It = i } . (5)

The ﬁrst summation is bounded as in the proof of Theorem 2 by

i A i1{It = i} X

i + 8ασ2 log ω

t=1 P {Ft = 1} . (6)

We now bound the second sum in (5) and choose ω . By Lemma 6, if n

K > ui(n), then

P n Ti(n) > n

Suppose t ω := max n ω 8σ2αK

ε2 , ω 8σ2αK

o . Then t K > ui(t) for all i = i and t K

ε2 . By the union bound

P Ti (t) < 8σ2α log t

(a) P Ti (t) < t

(b) P i : Ti(t) > t

where (a) is true since t K 8σ2α log t

ε2 . (b) since PK i=1 Ti(t) = t. (c) by the union bound and (7).

Now if Ti(t) 8σ2α log t

ε2 and Ft is false, then the chosen arm is i . Therefore

t=ω +1 1{It = i }

t=ω +1 P {Ft = 1} +

t=ω +1 P Ti(t 1) < 8σ2α log t

t=ω +1 P {Ft = 1} + 2K2

t=ω +1 P {Ft = 1} + 2K2

where (a) follows from (8) and (b) by straight-forward calculus. Therefore by combining (5), (6) and (9) we obtain

t=1 P {Ft = 1}

α 3 + 2 max K(α 1)

Setting α = 4 leads to ERn

+ 3 max K + max K3

6 Lower Bounds and Ambiguous Examples

We prove lower bounds for two illustrative examples of structured bandits. Some previous work is also relevant. The famous paper by Lai and Robbins [17] shows that the bound of Theorem 2 cannot in general be greatly improved. Many of the techniques here are borrowed from Bubeck et. al. [11]. Given a ﬁxed algorithm and varying θ we denote the regret and expectation by Rn(θ) and Eθ respectively. Returns are assumed to be sampled from a normal distribution with unit variance, so that σ2 = 1. The proofs of the following theorems may be found in the supplementary material.

a hidden message

Figure 3: Counter-examples

Theorem 8. Given the structured bandit depicted in Figure 3.(a) or Figure 2.(c), then for all θ > 0 and all algorithms the regret satisﬁes max {E θRn( θ), EθRn(θ)} 1 8θ for sufﬁciently large n.

Theorem 9. Let Θ, {µ1, µ2} be a structured bandit where returns are sampled from a normal distribution with unit variance. Assume there exists a pair θ1, θ2 Θ and constant > 0 such that µ1(θ1) = µ1(θ2) and µ1(θ1) µ2(θ1) + and µ2(θ2) µ1(θ2) + . Then the following hold:

(1) Eθ1Rn(θ1) 1+log 2n 2

(2) Eθ2Rn(θ2) n

2 exp ( 4Eθ1Rn(θ1) ) Eθ1Rn(θ1)

A natural example where the conditions are satisﬁed is depicted in Figure 3.(b) and by choosing θ1 = 1, θ2 = 1. We know from Theorem 3 that UCB-S enjoys ﬁnite regret of Eθ2Rn(θ2) O( 1

) and logarithmic regret Eθ1Rn(θ1) O( 1

log n). Part 1 of Theorem 9 shows that if we demand ﬁnite regret Eθ2Rn(θ2) O(1), then the regret Eθ1Rn(θ1) is necessarily logarithmic. On the other

hand, part 2 shows that if we demand Eθ1Rn(θ1) o(log(n)), then the regret Eθ2Rn(θ2) Ω(n). Therefore the trade-off made by UCB-S essentially cannot be improved.

Discussion of Figure 3.(c/d). In both examples there is an ambiguous region for which the lower bound (Theorem 9) does not show that logarithmic regret is unavoidable, but where Theorem 3 cannot be applied to show that UCB-S achieves ﬁnite regret. We managed to show that ﬁnite regret is possible in both cases by using a different algorithm. For (c) we could construct a carefully tuned algorithm for which the regret was at most O(1) if θ 0 and O( 1

θ log log 1

θ) otherwise. This result contradicts a claim by Bubeck et. al. [11, Thm. 8]. Additional discussion of the ambiguous case in general, as well as this speciﬁc example, may be found in the supplementary material. One observation is that unbridled optimism is the cause of the failure of UCB-S in these cases. This is illustrated by Figure 3.(d) with θ 0. No matter how narrow the conﬁdence interval about µ1, if the second action has not been taken sufﬁciently often, then there will still be some belief that θ > 0 is possible where the second action is optimistic, which leads to logarithmic regret. Adapting the algorithm to be slightly risk averse solves this problem.

7 Experiments

We tested Algorithm 1 on a selection of structured bandits depicted in Figure 2 and compared to UCB [6, 8]. Rewards were sampled from normal distributions with unit variances. For UCB we chose α = 2, while we used the theoretically justiﬁed α = 4 for Algorithm 1. All code is available in the supplementary material. Each data-point is the average of 500 independent samples with the blue crosses and red squares indicating the regret of UCB-S and UCB respectively.

0.2 0.1 0 0.1 0.2 0

K = 2, µ1(θ) = θ, µ2(θ) = θ, n = 50 000 (see Figure 2.(a))

0 5e4 1e5 0

K = 2, µ1(θ) = θ, µ2(θ) = θ, θ = 0.04 (see Figure 2.(a))

K = 2, µ1(θ) = 0, µ2(θ) = θ, n = 50 000 (see Figure 2.(b))

The results show that Algorithm 1 typically out-performs regular UCB. The exception is the top right experiment where UCB performs slightly better for θ < 0. This is not surprising, since in this case the structured version of UCB cannot exploit the additional structure and suffers due to worse constant factors. On the other hand, if θ > 0, then UCB endures logarithmic regret and performs signiﬁcantly worse than its structured counterpart. The superiority of Algorithm 1 would be accentuated in the top left and bottom right experiments by increasing the horizon.

K = 2, µ1(θ) = θ1{θ > 0}, µ2(θ) = θ1{θ < 0}, n = 50 000 (see Figure 2.(c))

8 Conclusion

The limitation of the new approach is that the proof techniques and algorithm are most suited to the case where the number of actions is relatively small. Generalising the techniques to large action spaces is therefore an important open problem. There is still a small gap between the upper and lower bounds, and the lower bounds have only been proven for special examples. Proving a general problem-dependent lower bound is an interesting question, but probably extremely challenging given the ﬂexibility of the setting. We are also curious to know if there exist problems for which the optimal regret is somewhere between ﬁnite and logarithmic. Another question is that of how to deﬁne Thompson sampling for structured bandits. Thompson sampling has recently attracted a great deal of attention [13, 2, 14, 3, 9], but so far we are unable even to deﬁne an algorithm resembling Thompson sampling for the general structured bandit problem. Acknowledgements. Tor Lattimore was supported by the Google Australia Fellowship for Machine Learning and the Alberta Innovates Technology Futures, NSERC. The majority of this work was completed while R emi Munos was visiting Microsoft Research, New England. This research was partially supported by the European Community s Seventh Framework Programme under grant agreements no. 270327 (project Comp LACS).

[1] Rajeev Agrawal, Demosthenis Teneketzis, and Venkatachalam Anantharam. Asymptotically efﬁcient adaptive allocation schemes for controlled markov chains: Finite parameter space. Automatic Control, IEEE Transactions on, 34(12):1249 1259, 1989.

[2] Shipra Agrawal and Navin Goyal. Analysis of Thompson sampling for the multi-armed bandit problem. In In Proceedings of the 25th Annual Conference on Learning Theory, 2012.

[3] Shipra Agrawal and Navin Goyal. Further optimal regret bounds for thompson sampling. In In Proceedings of the 16th International Conference on Artiﬁcial Intelligence and Statistics, volume 31, pages 99 107, 2013. [4] Kareem Amin, Michael Kearns, and Umar Syed. Bandits, query learning, and the haystack dimension. Journal of Machine Learning Research-Proceedings Track, 19:87 106, 2011.

[5] Jean-Yves Audibert, R emi Munos, and Csaba Szepesv ari. Variance estimates and exploration function in multi-armed bandit. Technical report, research report 07-31, Certis-Ecole des Ponts, 2007.

[6] Peter Auer, Nicol o Cesa-Bianchi, and Paul Fischer. Finite-time analysis of the multiarmed bandit problem. Machine Learning, 47:235 256, 2002.

[7] Peter Auer and Ronald Ortner. UCB revisited: Improved regret bounds for the stochastic multi-armed bandit problem. Periodica Mathematica Hungarica, 61(1-2):55 65, 2010.

[8] S ebastien Bubeck and Nicol o Cesa-Bianchi. Regret Analysis of Stochastic and Nonstochastic Multi-armed Bandit Problems. Foundations and Trends in Machine Learning. Now Publishers Incorporated, 2012.

[9] S ebastien Bubeck and Che-Yu Liu. Prior-free and prior-dependent regret bounds for thompson sampling. In Advances in Neural Information Processing Systems, pages 638 646, 2013. [10] S ebastien Bubeck, R emi Munos, Gilles Stoltz, and Csaba Szepesv ari. Online optimization in X-armed bandits. In NIPS, pages 201 208, 2008.

[11] S ebastien Bubeck, Vianney Perchet, and Philippe Rigollet. Bounded regret in stochastic multiarmed bandits. In In Proceedings of the 26th Annual Conference on Learning Theory, 2013.

[12] Todd L Graves and Tze Leung Lai. Asymptotically efﬁcient adaptive choice of control laws in controlled Markov chains. SIAM journal on control and optimization, 35(3):715 743, 1997.

[13] Emilie Kaufmann, Nathaniel Korda, and R emi Munos. Thompson sampling: An asymptotically optimal ﬁnite-time analysis. In Algorithmic Learning Theory, pages 199 213. Springer, 2012.

[14] Nathaniel Korda, Emilie Kaufmann, and R emi Munos. Thompson sampling for 1-dimensional exponential family bandits. In Advances in Neural Information Processing Systems, pages 1448 1456, 2013.

[15] Tze Leung Lai and Herbert Robbins. Asymptotically optimal allocation of treatments in sequential experiments. In T. J. Santner and A. C. Tamhane, editors, Design of Experiments: Ranking and Selection, pages 127 142. 1984.

[16] Tze Leung Lai and Herbert Robbins. Optimal sequential sampling from two populations. Proceedings of the National Academy of Sciences, 81(4):1284 1286, 1984.

[17] Tze Leung Lai and Herbert Robbins. Asymptotically efﬁcient adaptive allocation rules. Advances in applied mathematics, 6(1):4 22, 1985.

[18] Adam J Mersereau, Paat Rusmevichientong, and John N Tsitsiklis. A structured multiarmed bandit problem and the greedy policy. Automatic Control, IEEE Transactions on, 54(12):2787 2802, 2009.

[19] Dan Russo and Benjamin Van Roy. Eluder dimension and the sample complexity of optimistic exploration. In Advances in Neural Information Processing Systems, pages 2256 2264, 2013.