# on_explorethencommit_strategies__a1920b33.pdf

On Explore-Then-Commit Strategies

Aurélien Garivier Institut de Mathématiques de Toulouse; UMR5219 Université de Toulouse; CNRS UPS IMT, F-31062 Toulouse Cedex 9, France aurelien.garivier@math.univ-toulouse.fr

Emilie Kaufmann Univ. Lille, CNRS, Centrale Lille, Inria Seque L UMR 9189, CRISt AL - Centre de Recherche en Informatique Signal et Automatique de Lille F-59000 Lille, France emilie.kaufmann@univ-lille1.fr

Tor Lattimore University of Alberta 116 St & 85 Ave, Edmonton, AB T6G 2R3, Canada tor.lattimore@gmail.com

We study the problem of minimising regret in two-armed bandit problems with Gaussian rewards. Our objective is to use this simple setting to illustrate that strategies based on an exploration phase (up to a stopping time) followed by exploitation are necessarily suboptimal. The results hold regardless of whether or not the difference in means between the two arms is known. Besides the main message, we also reﬁne existing deviation inequalities, which allow us to design fully sequential strategies with ﬁnite-time regret guarantees that are (a) asymptotically optimal as the horizon grows and (b) order-optimal in the minimax sense. Furthermore we provide empirical evidence that the theory also holds in practice and discuss extensions to non-gaussian and multiple-armed case.

1 Introduction

It is now a very frequent issue for companies to optimise their daily proﬁts by choosing between one of two possible website layouts. A natural approach is to start with a period of A/B Testing (exploration) during which the two versions are uniformly presented to users. Once the testing is complete, the company displays the version believed to generate the most proﬁt for the rest of the month (exploitation). The time spent exploring may be chosen adaptively based on past observations, but could also be ﬁxed in advance. Our contribution is to show that strategies of this form are much worse than if the company is allowed to dynamically select which website to display without restrictions for the whole month.

Our analysis focusses on a simple sequential decision problem played over T time-steps. In time-step t 1, 2, . . . , T the agent chooses an action At {1, 2} and receives a normally distributed reward

This work was partially supported by the CIMI (Centre International de Mathématiques et d Informatique) Excellence program while Emilie Kaufmann visited Toulouse in November 2015. The authors acknowledge the support of the French Agence Nationale de la Recherche (ANR), under grants ANR-13-BS01-0005 (project SPADRO) and ANR-13-CORD-0020 (project ALICIA).

30th Conference on Neural Information Processing Systems (NIPS 2016), Barcelona, Spain.

Zt N(µAt, 1) where µ1, µ2 R are the unknown mean rewards for actions 1 and 2 respectively. The goal is to ﬁnd a strategy π (a way of choosing each action At based on past observation) that maximises the cumulative reward over T steps in expectation, or equivalently minimises the regret

Rπ µ(T) = T max {µ1, µ2} Eµ

This framework is known as the multi-armed bandit problem, which has many applications and has been studied for almost a century [Thompson, 1933]. Although this setting is now quite well understood, the purpose of this article is to show that strategies based on distinct phases of exploration and exploitation are necessarily suboptimal. This is an important message because exploration followed by exploitation is the most natural approach and is often implemented in applications (including the website optimisation problem described above). Moreover, strategies of this kind have been proposed in the literature for more complicated settings [Auer and Ortner, 2010, Perchet and Rigollet, 2013, Perchet et al., 2015]. Recent progress on optimal exploration policies (e.g., by Garivier and Kaufmann [2016]) could have suggested that well-tuned variants of two-phase strategies might be near-optimal. We show, on the contrary, that optimal strategies for multi-armed bandit problems must be fully-sequential, and in particular should mix exploration and exploitation. It is known since the work of Wald [1945] on simple hypothese testing that sequential procedures can lead to signiﬁcant gains. Here, the superiority of fully sequential procedures is consistent with intuition: if one arm ﬁrst appears to be better, but if subsequent observations are disappointing, the obligation to commit at some point can be restrictive. In this paper, we give a crisp and precise description of how restrictive it is: it leads to regret asympotically twice as large on average. The proof of this result combines some classical techniques of sequential analysis and of the bandit literature.

We study two settings, one when the gap = |µ1 µ2| is known and the other when it is not. The most straight-forward strategy in the former case is to explore each action a ﬁxed number of times n and subsequently exploit by choosing the action that appeared best while exploring. It is easy to calculate the optimal n and consequently show that this strategy suffers a regret of Rπ µ(T) 4 log(T)/ . A more general approach is to use a so-called Explore-Then-Commit (ETC) strategy, following a nomenclature introduced by Perchet et al. [2015]. An ETC strategy explores each action alternately until some data-dependent stopping time and subsequently commits to a single action for the remaining time-steps. We show in Theorem 2 that by using a sequential probability ratio test (SPRT) it is possible to design an ETC strategy for which Rπ µ(T) log(T)/ , which improves on the above result by a factor of 4. We also prove a lower bound showing that no ETC strategy can improve on this result. Surprisingly it is possible to do even better by using a fully sequential strategy inspired by the UCB algorithm for multi-armed bandits [Katehakis and Robbins, 1995]. We design a new strategy for which Rπ µ(T) log(T)/(2 ), which improves on the ﬁxed-design strategy by a factor of 8 and on SPRT by a factor of 2. Again we prove a lower bound showing that no strategy can improve on this result.

For the case where is unknown, ﬁxed-design strategies are hopeless because there is no reasonable tuning for the exploration budget n. However, it is possible to design an ETC strategy for unknown gaps. Our approach uses a modiﬁed ﬁxed-budget best arm identiﬁcation (BAI) algorithm in its exploration phase (see e.g., Even-Dar et al. [2006], Garivier and Kaufmann [2016]) and chooses the recommended arm for the remaining time-steps. In Theorem 5 we show that a strategy based on this idea satisﬁes Rπ µ(T) 4 log(T)/ , which again we show is optimal within the class of ETC strategies. As before, strategies based on ETC are suboptimal by a factor of 2 relative to the optimal rates achieved by fully sequential strategies such as UCB, which satisﬁes Rπ µ(T) 2 log(T)/ [Katehakis and Robbins, 1995].

In a nutshell, strategies based on ﬁxed-design or ETC are necessarily suboptimal. That this failure occurs even in the simple setting considered here is a strong indicator that they are suboptimal in more complicated settings. Our main contribution, presented in more details in Section 2, is to fully characterise the achievable asymptotic regret when is either known or unknown and the strategies are either ﬁxed-design, ETC or fully sequential. All upper bounds have explicit ﬁnite-time forms, which allow us to derive optimal minimax guarantees. For the lower bounds we give a novel and generic proof of all results. All proofs contain new, original ideas that we believe are fundamental to the understanding of sequential analysis.

2 Notation and Summary of Results

We assume that the horizon T is known to the agent. The optimal action is a = arg max(µ1, µ2), its mean reward is µ = µa , and the gap between the means is = |µ1 µ2|. Let H = R2 be the set of all possible pairs of means, and H = µ R2 : |µ1 µ2| = . For i {1, 2} and n N let ˆµi,n be the empirical mean of the ith action based on the ﬁrst n samples. Let At be the action chosen in time-step t and Ni(t) = Pt s=1 1 {As = i} be the number of times the ith action has been chosen after time-step t. We denote by ˆµi(t) = ˆµi,Ni(t) the empirical mean of the ith arm after time-step t.

A strategy is denoted by π, which is a function from past actions/rewards to a distribution over the next actions. An ETC strategy is governed by a sampling rule (which determines which arm to sample at each step), a stopping rule (which speciﬁes when to stop the exploration phase) and a decision rule indicating which arm is chosen in the exploitation phase. As we consider two-armed, Gaussian bandits with equal variances, we focus here on uniform sampling rules, which have been shown in Kaufmann et al. [2014] to be optimal in that setting. For this reason, we deﬁne an ETC strategy as a pair (τ, ˆa), where τ is an even stopping time with respect to the ﬁltration (Ft = σ(Z1, . . . , Zt))t and ˆa {1, 2} is Fτ-measurable. In all the ETC strategies presented in this paper, the stopping time τ depends on the horizon T (although this is not reﬂected in the notation). At time t, the action

picked by the ETC strategy is At =

1 if t τ and t is odd , 2 if t τ and t is even , ˆa otherwise . The regret for strategy π, given in Eq. (1), depends on T and µ. Assuming, for example that µ1 = µ2 + , then an ETC strategy π chooses the suboptimal arm N2(T) = τ T

2 + (T τ)+1 {ˆa = 2} times, and the regret Rπ µ(T) = Eµ[N2(T)] thus satisﬁes

Eµ[(τ T)/2] Rπ µ(T) ( /2)Eµ[τ T] + T Pµ(τ T, ˆa = a ) . (2)

We denote the set of all ETC strategies by ΠETC. A ﬁxed-design strategy is and ETC strategy for which there exists an integer n such that τ = 2n almost surely, and the set of all such strategies is denoted by ΠDETC. The set of all strategies is denoted by ΠALL. For S {H, H }, we are interested in strategies π that are uniformly efﬁcient on S, in the sense that

µ S, α > 0, Rπ µ(T) = o(T α). (3)

ΠALL ΠETC ΠDETC H 2 4 NA

We show in this paper that any uniformly efﬁcient strategy in Π has a regret at least equal to CΠ S log(T)/|µ1 µ2|(1 o T (1)) for every parameter µ S, where CΠ S is given in the adjacent table. Furthermore, we prove that these results are tight. In each case, we propose a uniformly efﬁcient strategy matching this bound. In addition, we prove a tight and non-asymptotic regret bound which also implies, in particular, minimax rate-optimality.

The paper is organised as follows. First we consider ETC and ﬁxed-design strategies when known and unknown (Section 3). We then analyse fully sequential strategies that interleave exploration and exploitation in an optimal way (Section 4). For known we present a novel algorithm that exploits the additional information to improve the regret. For unknown we brieﬂy recall the well-known results, but also propose a new regret analysis of the UCB* algorithm, a variant of UCB that can be traced back to Lai [1987], for which we also obtain order-optimal minimax regret. Numerical experiments illustrate and empirically support our results in Section 5. We conclude with a short discussion on non-uniform exploration, and on models with more than 2 arms, possibly non Gaussian. All the proofs are given in the supplementary material. In particular, our simple, uniﬁed proof for all the lower bounds is given in Appendix A.

3 Explore-Then-Commit Strategies

Fixed Design Strategies for Known Gaps. As a warm-up we start with the ﬁxed-design ETC setting where is known and where the agent chooses each action n times before committing for the remainder.

input: T and n := l 2W T 2 4/(32π) / 2m

for k {1, . . . , n} do

choose A2k 1 = 1 and A2k = 2 end for ˆa := arg maxi ˆµi,n for t {2n + 1, . . . , T} do

choose At = ˆa end for

Algorithm 1: FB-ETC algorithm

The optimal decision rule is obviously ˆa = arg maxi ˆµi,n with ties broken arbitrarily. The formal description of the strategy is given in Algorithm 1, where W denotes the Lambert function implicitly deﬁned for y > 0 by W(y) exp(W(y)) = y. We denote the regret associated to the choice of n by Rn µ(T). The following theorem is not especially remarkable except that the bound is sufﬁciently reﬁned to show certain negative lower-order terms that would otherwise not be apparent. Theorem 1. Let µ H , and let

. Then Rn µ(T) 4

log log T 2

whenever T 2 > 4

2πe, and Rn µ(T) T /2+ otherwise. In all cases, Rn µ(T) 2.04

T + . Furthermore, for all ε > 0, T 1 and n 4(1 ε) log(T)/ 2,

Rn µ(T) 1 2 n 2

As Rn µ(T) n , this entails that inf 1 n T Rn µ(T) 4 log(T)/ .

The proof of Theorem 1 is in Appendix B. Note that the "asymptotic lower bound" 4 log(T)/ is actually not a lower bound, even up to an additive constant: Rn µ(T) 4 log(T)/ when T . Actually, the same phenomenon applies many other cases, and it should be no surprise that, in numerical experiments, some algorithm reach a regret smaller than Lai and Robbins asymptotic lower bound, as was already observed in several articles (see e.g. Garivier et al. [2016]). Also note that the term at the end of the upper bound is necessary: if is large, the problem is statistically so simple that one single observation is sufﬁcient to identify the best arm; but that observation cannot be avoided.

Explore-Then-Commit Strategies for Known Gaps. We now show the existence of ETC strategies that improve on the optimal ﬁxed-design strategy. Surprisingly, the gain is signiﬁcant. We describe an algorithm inspired by ideas from hypothesis testing and prove an upper bound on its regret that is minimax optimal and that asymptotically matches our lower bound.

Let P be the law of X Y , where X (resp. Y ) is a reward from arm 1 (resp. arm 2). As is known, the exploration phase of an ETC algorithm can be viewed as a statistical test of the hypothesis H1 : (P = N( , 2)) against H2 : (P = N( , 2)). The work of Wald [1945] shows that a signiﬁcant gain in terms of expected number of samples can be obtained by using a sequential rather than a batch test. Indeed, for a batch test, a sample size of n (4/ 2) log(1/δ) is necessary to guarantee that both type I and type II errors are upper bounded by δ. In contrast, when a random number of samples is permitted, there exists a sequential probability ratio test (SPRT) with the same guarantees that stops after a random number N of samples with expectation E[N] log(1/δ)/ 2 under both H1 and H2. The SPRT stops when the absolute value of the log-likelihood ratio between H1 and H2 exceeds some threshold. Asymptotic upper bound on the expected number of samples used by a SPRT, as well as the (asymptotic) optimality of such procedures among the class of all sequential tests can be found in [Wald, 1945, Siegmund, 1985].

input: T and A1 = 1, A2 = 2, t := 2 while (t/2) |ˆµ1(t) ˆµ2(t)| < log T 2 do choose At+1 = 1 and At+2 = 2, t := t + 2 end while ˆa := arg maxi ˆµi(t) while t T do

choose At = ˆa, t := t + 1 end while

Algorithm 2: SPRT ETC algorithm

Algorithm 2 is an ETC strategy that explores each action alternately, halting when sufﬁcient conﬁdence is reached according to a SPRT. The threshold depends on the gap and the horizon T corresponding to a risk of δ = 1/(T 2). The exploration phase ends at the stopping time

τ = inf n t = 2n : ˆµ1,n ˆµ2,n log(T 2)

If τ < T then the empirical best arm ˆa at time τ is played until time T. If T 2 1, then τ = 1

(one could even deﬁne τ = 0 and pick a random arm). The following theorem gives a non-asymptotic upper bound on the regret of the algorithm. The results rely on non-asymptotic upper bounds on the expectation of τ, which are interesting in their own right. Theorem 2. If T 2 1, then the regret of the SPRT-ETC algorithm is upper-bounded as

µ (T) log(e T 2)

log(T 2) + 4

Otherwise it is upper bounded by T /2+ , and for all T and the regret is less than 10 p

The proof of Theorem 2 is given in Appendix C. The following lower bound shows that no uniformly efﬁcient ETC strategy can improve on the asymptotic regret of Algorithm 2. The proof is given in Section A together with the other lower bounds. Theorem 3. Let π be an ETC strategy that is uniformly efﬁcient on H . Then for all µ H ,

lim inf T Rπ µ(T) log(T) 1

Explore-Then-Commit Strategies for Unknown Gaps. When the gap is unknown it is not possible to tune a ﬁxed-design strategy that achieves logarithmic regret. ETC strategies can enjoy logarithmic regret and these are now analysed. We start with the asymptotic lower bound. Theorem 4. Let π be a uniformly efﬁcient ETC strategy on H. For all µ H, if = |µ1 µ2| then

lim inf T Rπ µ(T) log(T) 4

A simple idea for constructing an algorithm that matches the lower bound is to use a (ﬁxed-conﬁdence) best arm identiﬁcation algorithm for the exploration phase. Given a risk parameter δ, a δ-PAC BAI algorithm consists of a sampling rule (At), a stopping rule τ and a recommendation rule ˆa which is Fτ measurable and satisﬁes, for all µ H such that µ1 = µ2, Pµ(ˆa = a ) 1 δ. In a bandit model with two Gaussian arms, Kaufmann et al. [2014] propose a δ-PAC algorithm using a uniform sampling rule and a stopping rule τδ that asymptotically attains the minimal sample complexity Eµ[τδ] (8/ 2) log(1/δ). Using the regret decomposition (2), it is easy to show that the ETC algorithm using the stopping rule τδ for δ = 1/T matches the lower bound of Theorem 4.

input: T( 3) A1 = 1, A2 = 2, t := 2

while |ˆµ1(t) ˆµ2(t)| < q

t do choose At+1 = 1 and At+2 = 2 t := t + 2 end while ˆa := arg maxi ˆµi(t) while t T do

choose At = ˆa t := t + 1 end while

Algorithm 3: BAI-ETC algorithm

Algorithm 3 is a slight variant of this optimal BAI algorithm, based on the stopping time

t = 2n : |ˆµ1,n ˆµ2,n|>

4 log T/(2n)

The motivation for the difference (which comes from a more carefully tuned threshold featuring log(T/2n) in place of log(T)) is that the conﬁdence level should depend on the unknown gap , which determines the regret when a mis-identiﬁcation occurs. The improvement only appears in the non-asymptotic regime where we are able to prove both asymptotic optimality and order-optimal minimax regret. The latter would not be possible using a ﬁxed-conﬁdence BAI strategy. The proof of this result can be found in Appendix D. The main difﬁculty is developing a sufﬁciently strong deviation bound, which we do in Appendix G, and that may be of independent interest. Note that a similar strategy was proposed and analysed by Lai et al. [1983], but in the continuous time framework and with asymptotic analysis only. Theorem 5. If T 2 > 4e2, the regret of the BAI-ETC algorithm is upper bounded as

µ (T) 4 log T 2

It is upper bounded by T otherwise, and by 32

T + 2 in any case.

4 Fully Sequential Strategies for Known and Unknown Gaps

In the previous section we saw that allowing a random stopping time leads to a factor of 4 improvement in terms of the asymptotic regret relative to the naive ﬁxed-design strategy. We now turn our attention to fully sequential strategies when is known and unknown. The latter case is the classic 2-armed bandit problem and is now quite well understood. Our modest contribution in that case is the ﬁrst algorithm that is simultaneously asymptotically optimal and order optimal in the minimax sense. For the former case, we are not aware of any previous research where the gap is known except the line of work by Bubeck et al. [2013], Bubeck and Liu [2013], where different questions are treated. In both cases we see that fully sequential strategies improve on the best ETC strategies by a factor of 2.

Known Gaps. We start by stating the lower bound (proved in Section A), which is a straightforward generalisation of Lai and Robbins lower bound. Theorem 6. Let π be a strategy that is uniformly efﬁcient on H . Then for all µ H ,

lim inf T Rπ µ(T) log T 1 2

We are not aware of any existing algorithm matching this lower bound, which motivates us to introduce a new strategy called -UCB that exploits the knowledge of to improve the performance of UCB. In each round the algorithm chooses the arm that has been played most often so far unless the other arm has an upper conﬁdence bound that is close to larger than the empirical estimate of the most played arm. Like ETC strategies, -UCB is not anytime in the sense that it requires the knowledge of both the horizon T and the gap .

1: input: T and 2: εT = log 1

8 (e + T 2)/4 3: for t {1, . . . , T} do 4: let At,min := arg min i 1,2 Ni(t 1) and At,max = 3 At,min

5: if ˆµAt,min(t 1) +

v u u t2 log T NAt,min(t 1)

NAt,min(t 1) ˆµAt,max(t 1) + 2εT then

6: choose At = At,min 7: else 8: choose At = At,max 9: end if 10: end for

Algorithm 4: -UCB

Theorem 7. If T(2 3εT )2 2 and Tε2 T e2, the regret of the -UCB algorithm is upper bounded as

µ (T) log 2T 2

2 (1 3εT /(2 ))2 +

π log (2T 2) 2 (1 3εT / )2

log(ε2 T T) ε2 T + 80

ε2 T + 2 (2 3εT )2

Moreover lim sup T R -UCB

µ (T)/ log(T) (2 ) 1 and µ H , R -UCB

The proof may be found in Appendix E.

Unknown Gaps. In the classical bandit setting where is unknown, UCB by Katehakis and Robbins [1995] is known to be asymptotically optimal: RUCB

µ (T) 2 log(T)/ , which matches the lower bound of Lai and Robbins [1985]. Non-asymptotic regret bounds are given for example by Auer et al. [2002], Cappé et al. [2013]. Unfortunately, UCB is not optimal in the minimax sense, which is so far only achieved by algorithms that are not asymptotically optimal [Audibert and Bubeck, 2009, Lattimore, 2015]. Here, with only two arms, we are able to show that Algorithm 5 below is

simultaneously minimax order-optimal and asymptotically optimal. The strategy is essentially the same as suggested by Lai [1987], but with a fractionally smaller conﬁdence bound. The proof of Theorem 8 is given in Appendix F. Empirically the smaller conﬁdence bonus used by UCB leads to a signiﬁcant improvement relative to UCB.

1: input: T 2: for t {1, . . . , T} do

3: At = arg max i {1,2} ˆµi(t 1) +

2 Ni(t 1) log T Ni(t 1)

Algorithm 5: UCB

Theorem 8. For all ε (0, ), if T( ε)2 2 and Tε2 e2, the regret of the UCB strategy is upper bounded as

RUCB µ (T) 2 log T 2

log(ε2T) + 16e

Moreover, lim sup T Rπ µ(T)/ log(T) = 2/ and for all µ H, Rπ µ(T) 33

Note that if there are K > 2 arms, then the strategy above is still asymptotically optimal, but suffers a minimax regret of Ω( p

TK log(K)), which is a factor of p

log(K) suboptimal.

5 Numerical Experiments

We represent here the regret of the ﬁve strategies presented in this article on a bandit problem with = 1/5, for different values of the horizon. The regret is estimated by 4.105 Monte-Carlo replications. In the legend, the estimated slopes of Rπ(T) (in logarithmic scale) are indicated after the policy names.

50 100 200 500 1000 2000 5000 10000 20000 50000

0 20 40 60 80 100

FB ETC : 3.65 BAI ETC : 2.98 UCB : 1.59 SPRT ETC : 1.03 D UCB : 0.77

The experimental behavior of the algorithms reﬂects the theoretical results presented above: the regret asymptotically grows as the logarithm of the horizon, the experimental coefﬁcients correspond approximately to theory, and the relative ordering of the policies is respected. However, it should be noted that for short horizons the hierarchy is not quite the same, and the growth rate is not logarithmic; this question is raised in Garivier et al. [2016]. In particular, on short horizons the Best-Arm Identiﬁcation procedure performs very well with respect to the others, and starts to be beaten (even by the gap-aware strategies) only when T 2 is much larger that 10.

6 Conclusion: Beyond Uniform Exploration, Two Arms and Gaussian distributions

It is worth emphasising the impossibility of non-trivial lower bounds on the regret of ETC strategies using any possible (non-uniform) sampling rule. Indeed, using UCB as a sampling rule together with an a.s. inﬁnite stopping rule deﬁnes an artiﬁcial but formally valid ETC strategy that achieves the best possible rate for general strategies. This strategy is not a faithful counter-example to our claim that ETC strategies are sub-optimal, because UCB is not a satisfying exploration rule. If exploration is the objective, then uniform sampling is known to be optimal in the two-armed Gaussian case [Kaufmann et al., 2014], which justiﬁes the uniform sampling assumption.

The use of ETC strategies for regret minimisation (e.g., as presented by Perchet and Rigollet [2013]) is certainly not limited to bandit models with 2 arms. The extension to multiple arms is based on the successive elimination idea in which a set of active arms is maintained with arms chosen according to a round robin within the active set. Arms are eliminated from the active set once their optimality becomes implausible and the exploration phase terminates when the active set contains only a single arm (an example is by Auer and Ortner [2010]). The Successive Elimination algorithm has been introduced by Even-Dar et al. [2006] for best-arm identiﬁcation in the ﬁxed-conﬁdence setting. It was shown to be rate-optimal, and thus a good compromise for both minimizing regret and ﬁnding the best arm. If one looks more precisely at mutliplicative constants, however, Garivier and Kaufmann [2016] showed that it is suboptimal for the best arm identiﬁcation task in almost all settings except two-armed Gaussian bandits. Regarding regret minimization, the present paper shows that it is sub-optimal by a factor 2 on every two-armed Gaussian problem.

It is therefore interesting to investigate the performance in terms of regret of an ETC algorithm using an optimal BAI algorithm. This is actually possible not only for Gaussian distributions, but more generally for one-parameter exponential families, for which Garivier and Kaufmann [2016] propose the asymptotically optimal Track-and-Stop strategy. Denoting d(µ, µ ) = KL(νµ, νµ ) the Kullback-Leibler divergence between two distributions parameterised by µ and µ , they provide results which can be adapted to obtain the following bound. Proposition 1. For µ such that µ1 > maxa =1 µa, the regret of the ETC strategy using Track-and-Stop exploration with risk 1/T satisﬁes

µ (T) log T T (µ)

a=2 w a(µ)(µ1 µa)

where T (µ) (resp. w (µ)) is the the maximum (resp. maximiser) of the optimisation problem

max w ΣK inf a =1

w1d µ1, w1µ1 + waµa

+ wad µa, waµ1 + waµa

where ΣK is the set of probability distributions on {1, . . . , K}.

In general, it is not easy to quantify the difference to the lower bound of Lai and Robbins

lim inf T Rπ µ(T) log T

µ1 µa d(µa, µ1).

Even for Gaussian distributions, there is no general closed-form formula for T (µ) and w (µ) except when K = 2. However, we conjecture that the worst case is when µ1 and µ2 are much larger than the other means: then, the regret is almost the same as in the 2-arm case, and ETC strategies are suboptimal by a factor 2. On the other hand, the most favourable case (in terms of relative efﬁciency) seems to be when µ2 = = µK: then

K 1, w 2(µ) = = w K(µ) = 1 K 1 +

K 1 and T = 2(

K 1 + 1)2/ 2, leading to

RTa S µ (T) log(T) 1 + 1

while Lai and Robbins lower bound yields 2(K 1)/ . Thus, the difference grows with K as 2

K 1 log(T)/ , but the relative difference decreases.

Jean-Yves Audibert and Sébastien Bubeck. Minimax policies for adversarial and stochastic bandits. In Proceedings of Conference on Learning Theory (COLT), pages 217 226, 2009.

Peter Auer and Ronald Ortner. UCB revisited: Improved regret bounds for the stochastic multi-armed bandit problem. Periodica Mathematica Hungarica, 61(1-2):55 65, 2010.

Peter Auer, Nicoló Cesa-Bianchi, and Paul Fischer. Finite-time analysis of the multiarmed bandit problem. Machine Learning, 47:235 256, 2002.

Sébastien Bubeck and Che-Yu Liu. Prior-free and prior-dependent regret bounds for thompson sampling. In Advances in Neural Information Processing Systems, pages 638 646, 2013.

Sébastien Bubeck, Vianney Perchet, and Philippe Rigollet. Bounded regret in stochastic multi-armed bandits. In Proceedings of the 26th Conference On Learning Theory, pages 122 134, 2013.

Olivier Cappé, Aurélien Garivier, Odalric-Ambrym Maillard, Rémi Munos, and Gilles Stoltz. Kullback Leibler upper conﬁdence bounds for optimal sequential allocation. The Annals of Statistics, 41(3):1516 1541, 2013.

Eyal Even-Dar, Shie Mannor, and Yishay Mansour. Action Elimination and Stopping Conditions for the Multi-Armed Bandit and Reinforcement Learning Problems. Journal of Machine Learning Research, 7:1079 1105, 2006.

Aurélien Garivier and Emilie Kaufmann. Optimal best arm identiﬁcation with ﬁxed conﬁdence. In Proceedings of the 29th Conference On Learning Theory (to appear), 2016.

Aurélien Garivier, Pierre Ménard, and Gilles Stoltz. Explore ﬁrst, exploit next: The true shape of regret in bandit problems. ar Xiv preprint ar Xiv:1602.07182, 2016.

Abdolhossein Hoorfar and Mehdi Hassani. Inequalities on the lambert w function and hyperpower function. J. Inequal. Pure and Appl. Math, 9(2):5 9, 2008.

Michael N Katehakis and Herbert Robbins. Sequential choice from several populations. Proceedings of the National Academy of Sciences of the United States of America, 92(19):8584, 1995.

Emilie Kaufmann, Olivier Cappé, and Aurélien Garivier. On the Complexity of A/B Testing. In Proceedings of the 27th Conference On Learning Theory, 2014.

Tze Leung Lai. Adaptive treatment allocation and the multi-armed bandit problem. The Annals of Statistics, pages 1091 1114, 1987.

Tze Leung Lai and Herbert Robbins. Asymptotically efﬁcient adaptive allocation rules. Advances in applied mathematics, 6(1):4 22, 1985.

Tze Leung Lai, Herbert Robbins, and David Siegmund. Sequential design of comparative clinical trials. In M. Haseeb Rizvi, Jagdish Rustagi, and David Siegmund, editors, Recent advances in statistics: papers in honor of Herman Chernoff on his sixtieth birthday, pages 51 68. Academic Press, 1983.

Tor Lattimore. Optimally conﬁdent UCB: Improved regret for ﬁnite-armed bandits. Technical report, 2015. URL http://arxiv.org/abs/1507.07880.

Peter Mörters and Yuval Peres. Brownian motion, volume 30. Cambridge University Press, 2010.

Vianney Perchet and Philippe Rigollet. The multi-armed bandit with covariates. The Annals of Statistics, 2013.

Vianney Perchet, Philippe Rigollet, Sylvain Chassang, and Eric Snowberg. Batched bandit problems. In Proceedings of the 28th Conference On Learning Theory, 2015.

David Siegmund. Sequential Analysis. Springer-Verlag, 1985.

William Thompson. On the likelihood that one unknown probability exceeds another in view of the evidence of two samples. Biometrika, 25(3/4):285 294, 1933.

Abraham Wald. Sequential Tests of Statistical Hypotheses. Annals of Mathematical Statistics, 16(2): 117 186, 1945.