# nonasymptotic_pure_exploration_by_solving_games__97a2b291.pdf

Non-Asymptotic Pure Exploration by Solving Games

Rémy Degenne Centrum Wiskunde & Informatica Science Park 123, 1098 XG Amsterdam remy.degenne@cwi.nl

Wouter M. Koolen Centrum Wiskunde & Informatica Science Park 123, 1098 XG Amsterdam wmkoolen@cwi.nl

Pierre Ménard Inria Lille 40 Avenue Halley, 59650 Villeneuve-d Ascq menardprr@gmail.com

Pure exploration (aka active testing) is the fundamental task of sequentially gathering information to answer a query about a stochastic environment. Good algorithms make few mistakes and take few samples. Lower bounds (for multi-armed bandit models with arms in an exponential family) reveal that the sample complexity is determined by the solution to an optimisation problem. The existing state of the art algorithms achieve asymptotic optimality by solving a plug-in estimate of that optimisation problem at each step. We interpret the optimisation problem as an unknown game, and propose sampling rules based on iterative strategies to estimate and converge to its saddle point. We apply no-regret learners to obtain the ﬁrst ﬁnite conﬁdence guarantees that are adapted to the exponential family and which apply to any pure exploration query and bandit structure. Moreover, our algorithms only use a best response oracle instead of fully solving the optimisation problem.

1 Introduction

We study fundamental trade-offs arising in sequential interactive learning. We adopt the framework of Pure Exploration, in which the learning system interacts with its environment by performing a sequence of experiments, with the goal of maximising information gain. We aim to design general, efﬁcient systems that can answer a given query with few experiments yet few mistakes.

As usual, we model the environment by a multi-armed bandit model with exponential family arms, and work in the ﬁxed conﬁdence (δ-PAC) setting. Information-theoretic lower bounds [13] show that a certain number of samples is unavoidable to reach a certain conﬁdence. Moreover, algorithms are developed [13] that match these lower bounds asymptotically, in the small conﬁdence δ 0 regime.

Our contribution is a framework for obtaining efﬁcient algorithms with non-asymptotic guarantees. The main object of study is the Pure Exploration Game [9], a two-player zero-sum game that is central to lower bounds as well as to the widely used GLRT-based stopping rules. We develop iterative methods that provably converge to saddle-point behaviour. The game itself is not known to the learner, and has to be explored and estimated on the ﬂy. Our methods are based on pairs of low-regret algorithms, combined with optimism and tracking. We prove sample complexity guarantees for several combinations of algorithms, and discuss their computational and statistical trade-offs.

The rest of the introduction provides more detail on pure exploration problems, the pure exploration game, the connection between them, and expands on our contribution. We also review related work.

33rd Conference on Neural Information Processing Systems (Neur IPS 2019), Vancouver, Canada.

Our model for the environment is a K-armed bandit, i.e. distributions (ν1, . . . , νK) on R. We assume throughout that these distributions come from a one-dimensional exponential family, and we denote by d(µ, λ) the relative entropy (Kullback-Leibler divergence) from the distribution with mean µ to that with mean λ. A pure exploration problem is parameterised by a set M of K-armed bandit models (the possible environments), a ﬁnite set I of candidate answers and a correct-answer function i : M I. We focus on Best Arm Identiﬁcation, for which i (µ) = argmaxi µi and the Minimum Threshold problem, which is deﬁned for any ﬁxed threshold γ by i (µ) = 1{mini µi<γ}. The goal of the learner is to learn i (µ) conﬁdently and efﬁciently by means of sequentially sampling from the arms of µ, no matter which µ M it faces. When an algorithm sequentially interacts with µ, we denote by N k t and ˆµk t the sample count and empirical mean estimate (these form a sufﬁcient statistic) for each arm k after t rounds. We write τδ for the time at which the algorithm stops and ˆı for the answer it recommends. The algorithm is correct (on a particular run) if it recommends ˆı = i (µ) the correct answer for µ. An algorithm is δ-PAC (or δ-correct) if Pµ(ˆı = i (µ)) δ for each µ M. Among δ-PAC algorithms, we are interested in those minimising the sample complexity Eµ[τδ]. As it turns out, what can be achieved, and how, is captured by a certain game.

For each µ M, [9] deﬁne the two-player zero-sum simultaneous-move Pure Exploration Game: MAX plays an arm k [K], MIN plays an alternative bandit model λ M with a different correct answer i (λ) = i (µ). We denote the set of such alternatives to answer i by i = {λ M : i (λ) = i}. MAX then receives payoff d(µk, λk) from MIN. As the payoff is neither concave in k (since discrete) nor convex in λ (both domain and divergence are problematic), we will analyse the game by sequencing the moves and considering a mixed strategy for the player moving ﬁrst. With MAX moving ﬁrst and playing a mixed strategy k w K (we identify distributions over [K] and the simplex K), the value of the game is

Dµ := sup w K Dµ(w) where Dµ(w) := inf λ M:i (λ) =i (µ)

k=1 wkd(µk, λk). (1)

We denote a minimiser of Dµ by w (µ) and call it an oracle allocation. The analogue where MIN plays ﬁrst using a mixed strategy λ q P( i (µ)) (distributions over that set) is proposed and analysed in [9]. Despite the baroque domain of λ in (1), there always exist minimax q supported on K points due to dimension constraints.

The Pure Exploration Game is essential to both characterising the complexity of learning, and also to algorithm design. Namely, ﬁrst, any δ-correct algorithm has sample complexity for each bandit µ M at least Eµ[τδ] kl(δ, 1 δ)/Dµ ln 1

δ Dµ , and matching this rate requires sampling proportions Eµ[Nτδ]/Eµ[τδ] converging to w (µ) as δ 0 [see 13]. Moreover, second, the general approach for obtaining δ-correct algorithms is based on the Generalised Likelihood Ratio Test (GLRT) statistic Zt := t Dˆµt (Nt/t). There are universal thresholds β(t, δ) ln 1

δ [see e.g. 12, 13, 19, 23] such that Pµ { t : Zt β(δ, t)} δ for any µ M. Hence stopping when Zt β(t, δ) and recommending ˆı = i (ˆµt) is δ-correct for any sampling rule. Maximising the GLRT to stop as early as possible is achieved by the sampling proportions Nt/t = w (ˆµt).

These considerations show that any successful Pure Exploration agent needs to (approximately) solve the Pure Exploration Game Dµ. The Track-and-Stop approach, pioneered by [13], ensures that ˆµt µ using forced exploration, and Nt/t w (ˆµt) using tracking. Continuity of w and Dµ then yields that Zt t Dµ(w (µ)) = t Dµ. The GLRT stopping rule triggers when t = β(δ, t)/Dµ ln 1

δ Dµ , meeting the lower bound in the asymptotic regime δ 0.

Our contributions. We explore methods to solve the Pure Exploration game Dµ associated with the unknown bandit model µ, and discusses their statistical and computational trade-offs. We look at solving the game iteratively, by instantiating a low-regret online learner for each player. In particular for the k-player we use a self-tuning instance of Exponentiated Gradient called Ada Hedge [8]. The λ-player needs to play a distribution to deal with non-convexity; we consider Follow the Perturbed Leader as well as an ensemble of Online Gradient Descent experts. We show how a combination of optimistic gradient estimates, concentration of measure arguments and regret guarantees combine to deliver the ﬁrst non-asymptotic sample complexity guarantees (which retain asymptotic optimality for δ 0). The advantage of this approach is that it only requires a best response oracle (1, right) instead of a computationally more costly max-min oracle (1, left) employed by Track-and-Stop. Going the other extreme, we also develop Optimistic Track-and-Stop based on a max-max-min oracle (the outer

max implementing optimism over a conﬁdence region for µ), which trades increased computation for tighter sample complexity guarantees with simpler proofs.

Our cocktail sheds new light on the trade-offs involved in the design of pure exploration algorithms. We show how big-hammer forced exploration can be reﬁned using problem-adapted optimism. We show how tracking is unnecessary when the k player goes second. We show how computational complexity can be traded off using oracles of various sophistication. And ﬁnally, we validate our approach empirically in benchmark experiments at practical δ, and ﬁnd that our algorithms are either competitive with Track-and-Stop (dense w ) or dominate it (sparse w ).

Related work Besides maximising information gain, there is a vast literature on maximising reward in multi-armed bandit models for which a good starting point is [21]. The canonical Pure Exploration problem is Best Arm Identiﬁcation [10, 3], which is actively studied in the ﬁxed conﬁdence, ﬁxed budget and simple regret settings [21, Ch. 33]. Its sample complexity as a function of the conﬁdence level δ has been analysed very thoroughly in the (sub)-Gaussian case, where we have a rather complete picture, even including lower order terms [5]. [18] initiated the quest for correct instance-dependent constants for arms from any exponential family. [26] stresses the importance of the moderate conﬁdence regime δ 0. Although it is not the focus here, we do believe that it is crucial to obtain the right problem dependence not only in ln 1

δ but also in K and other structural parameters, as the latter may in practice dominate the sample complexity.

Pure Exploration queries beyond Best Arm include Top-M [15], Thresholding [22], Minimum Threshold [20], Combinatorial Bandits [6], pure-strategy Nash equilibria [29] and Monte-Carlo Tree Search [27]. There is also signiﬁcant interest in these problems in structured bandit models, including Rank-one [17], Lipschitz [23], Monotonic [14], Unimodal [7] and Unit-Sum [26]. Our framework applies to all these cases. Problems with multiple correct answers were recently considered by [9]. Existing learning strategies do not work unmodiﬁed; some fail and others need to be generalised.

Optimism is ubiquitous in bandit optimisation since [1], and was adapted to pure exploration by [16]. We are not aware of optimism being used to solve unknown min-max problems. Optimism was employed in the UCB Frank-Wolfe method by [2] for maximising an unknown smooth function faster. We do not currently know how to make use of such fast rate results. For games the best response value is a non-smooth function of the action.

Using a pair of independent no-regret learners to solve a ﬁxed and known game goes back to [11]. More recently game dynamics were used to explain (Nesterov) acceleration in ofﬂine optimisation [28]. Ensuring faster convergence with coordinating learners is an active area of research [25]. Unfortunately, we currently do not know how to obtain an advantage in this way, as our main learning overhead comes from concentration, not regret.

2 Algorithms with ﬁnite conﬁdence sample complexity bounds

We introduce a family of algorithms, presented as Algorithm 1, with sample complexity bounds for non-asymptotic conﬁdence δ. It uses the following ingredients: the GLRT stopping rule, a saddle point algorithm (possibly formed by two regret minimization algorithms) and optimistic loss estimates.

2.1 Model and assumption: sub-Gaussian exponential families.

We suppose that the arm distributions belong to a known one-parameter exponential family. That is, there is a reference measure ν0 and parameters η1, . . . , ηK R such that the distribution of arm k [K] is deﬁned by dνk/dν0(x) eηkx. Examples include Gaussians with a given variance or Bernoulli with means in (0, 1). All results can be extended to arms each in a possibly different known exponential family. Let Θ be the open interval of possible means of such distributions. A distribution ν is said to be σ2-sub-Gaussian if for all u R, log EX ν eu(X EX ν[X]) σ2

2 u2. An exponential family has all distributions sub-Gaussian with constant σ2 iff for all µ, λ Θ, it veriﬁes d(µ, λ) 1 2σ2 (µ λ)2.

Assumption 1. The arm distributions belong to sub-Gaussian exponential families with constant σ2.

Assumption 2. There exists a closed interval [µmin, µmax] Θ such that M [µmin, µmax]K.

Algorithm 1 Pure exploration meta-algorithm.

Require: Algorithms Ak and Aλ, stopping threshold β(t, δ) and exploration bonus f(t).

1: Sample each arm once and form estimate ˆµK. 2: for t = K + 1, . . . do 3: For k [K], let [αk t , βk t ] = {ξ : N k t 1d(ˆµk t 1, ξ) f(t 1)}. KL conﬁdence intervals

4: Let µt 1 = argminλ M K k=1[αk t ,βk t ] PK k=1 N k t 1d(ˆµk t 1, λk). = ˆµt 1 if ˆµt 1 M 5: Let it = i ( µt 1). 6: Stop and output ˆı = it if infλ it P

k N k t 1d(ˆµk t 1, λk) > β(t, δ). GLRT Stopping rule 7: Get wt and qt from Ak it and Aλ it.

8: For k [K], let U k t = max n f(t 1)/N k t 1, maxξ {αk t ,βk t } Eλ qt d(ξ, λk) o . Optimism

9: Feed Ak it the loss ℓw t (w) = PK k=1 wk U k t .

10: Feed Aλ it the loss ℓλ t (q) = Eλ q PK k=1 wk t d(ˆµk t 1, λk) .

11: Pick arm kt = argmink N k t 1/ Pt s=1 wk s. Cumulative tracking 12: Observe sample Xt νkt. Update ˆµt. 13: end for

As a consequence of Assumption 2, there exists L, D > 0 such that for all y [µmin, µmax], the function x 7 d(x, y) is L-Lipschitz on [µmin, µmax] and d(x, y) D. Assumption 1 is implied by Assumption 2. Both are discussed in Appendix F. In particular, Assumption 2 can often be relaxed. L and D will appear in the sample complexity bounds but none of our algorithms use them explicitly.

Everywhere below, ˆµt denotes the orthogonal projection of the empirical mean onto [µmin, µmax]K, with one possible exception: the GLRT stopping rule may use it either projected or not, indifferently.

2.2 Algorithmic ingredients

Stopping and recommendation rules. The algorithm stops if any one of |I| many GLRT tests succeeds [13]. Let Lµ denote the likelihood under the model parametrized by µ. The generalized log-likelihood ratio between a set Λ and the whole parameter space ΘK is

GLRΘK t (Λ) = log sup µ ΘK L µ(X1, . . . , Xt)

supλ Λ Lλ(X1, . . . , Xt) = inf λ Λ

k [K] N k t d(ˆµk t , λk) .

By concentration of measure arguments, we may ﬁnd β(t, δ) such that with probability greater than 1 δ, for all t N, GLRΘK t ({µ}) β(t, δ) [see 12, 13, 19, 23]. Test i I succeeds if GLRΘK t ( i) > β(t, δ). If the algorithm stops because of test i, it recommends ˆı = i (if several tests succeed at the same time, it chooses arbitrarily among these). Theorem 1. Any algorithm using the GLRT stopping and recommendation rules with threshold β(t, δ) such that Pµ{GLRΘK t ({µ}) > β(t, δ)} δ is δ-correct.

A game with two players An algorithm is unable to stop at time t if the stopping condition is not met, i.e.

β(t, δ) inf λ i ( ˆµt)

k [K] N k t d(ˆµk t , λk) .

In order to stop early, the right hand side has to be maximized, i.e. made close to t supw K infλ i ( ˆµt) P

k [K] wk t d(ˆµk t , λk) = t Dˆµt t Dµ. Then with β(t, δ) log 1/δ + o(t) we obtain t log(1/δ)/Dµ up to lower order terms, i.e. the stopping time is close to optimality.

We propose to approach that max-min saddle-point by implementing two iterative algorithms, Ak and Aλ, for the k-player and a λ-player. Our sample complexity bound is a function of two quantities Rk t and Rλ t , regret bounds of algorithms Ak and Aλ when used for t steps on appropriate losses.

One player of our choice goes ﬁrst. The second player can see the action of the ﬁrst, see the corresponding loss function and use an algorithm with zero regret (e.g. Best-Response or Be-The Leader). One of the players has to play distributions on its action set. We have one of the following:

1. λ-player plays ﬁrst and uses a distribution in P( it). The k-player plays kt [K]. 2. k-player plays ﬁrst and uses wt K (distribution over [K]). The λ-player plays λt it. 3. Both players play distributions and go in any order, or concurrently.

Algorithm 1 presents two players playing concurrently but can be modiﬁed: if for example λ plays second, then it gets to see ℓλ t (q) before computing qt.

The sampling rule at stage t ﬁrst computes the most likely answer it for ˆµt 1. If the set over which the algorithm optimizes at line 4 is empty, it is arbitrary. The k-player plays wt coming from Ak it, an instance of Ak running only on the rounds on which the selected answer is that it. The λ-player similarly uses an instance Aλ it of Aλ.

Tracking. Since a single arm has to be pulled, if the k-player plays w K an additional procedure is needed to translate that play into a sampling rule. We use a so-called tracking procedure, kt = argmink [K] N k t 1/ Pt s=1 wk s , which ensures that Pt s=1 wk s (K 1) N k t Pt s=1 wk s .

Optimism in face of uncertainty. Existing algorithms for general pure exploration use forced exploration to ensure convergence of ˆµt to µ, making sure that every arm is sampled more than e.g.

t times. We replace that method by the optimism in face of uncertainty principle, which gives a more adaptive exploration scheme. While that heuristic is widely used in the bandit literature, this work is its ﬁrst successful implementation for general pure exploration. In Algorithm 1, the k-player algorithm gets an optimistic loss depending on wt and qt. The λ-player gets a non-optimistic loss.

2.3 Proof scheme and sample complexity result

In order to bound the sample complexity, we introduce a sequence of concentration events Et =

{ s t, k [K], d(ˆµk s, µk) c W ((1+a) log(t))

Nk s } for a > 0 and c W(x) = x + log x + 1/2 . It

veriﬁes P+ t=3 Pµ(Ec t ) 2e K/a2 (see Appendix B for a proof). The concentration intervals used in Algorihtm 1 are a function of f(t) = c W((1 + a)(1 + b) log t) for b > 0.

Lemma 1. Let Et be an event and T0(δ) N be such that for t T0(δ), Et {τδ t}. Then

t=1 P{τδ > t} T0(δ) +

t=T0(δ) Pµ(Ec t ) .

We now present brieﬂy the steps of the proof for the stopping time upper bound before stating our main theorem on the sample complexity of Algorithm 1. These steps are inexact and should be regarded as a guideline and not as rigorous computations. A full proof of our results can be found in the appendices (Appendix B for concentration results, C for tracking and D for the main sample complexity proof). We simplify the presentation by supposing that it = i (µ) throughout (the main proof will show this may fail only o(t) rounds). For t < τδ, under concentration event Et,

β(t, δ) inf λ i (µ)

k [K] N k t d(ˆµk t , λk) (stopping condition)

inf λ i (µ)

k [K] wk sd(ˆµk t , λk) KD (tracking)

inf λ i (µ)

k [K] wk sd(ˆµk s 1, λk) O( p

t log(t)) . (concentration)

The ﬁrst term is now the inﬁmum of a sum of losses, infλ i (µ) P

s [t] ℓλ s (λ). We use the regret property of the λ-player s algorithm on those losses, then we introduce optimistic values U k s such

that for ξk {µk, ˆµk s 1} we have Eλ qs d(ξk, λk) U k s Eλ qs d(ξk, λk) + O( p

inf λ i (µ)

k [K] wk sd(ˆµk s 1, λk) X

s [t] E λ qs

k [K] wk sd(ˆµk s 1, λk) Rλ t (regret λ)

k [K] wk s U k s O(

t) Rλ t (optimism)

s [t] U k s Rk t O(

t) Rλ t (regret w)

s [t] E λ qs d(µk, λk) Rk t O(

t) Rλ t (optimism)

Finally, 1/t P

s [t] Eλ qs is itself the expectation of another distribution on P( i (µ)). Hence

s [t] Eλ qs d(µk, λk) t inf q max k Eλ q d(µk, λk) = t Dµ .

Putting these inequalities together, we get ﬁnally an inequality on such a t < τδ. The exact result we obtain is the following Theorem, proved in Appendix D. Theorem 2. Under Assumption 2, the sample complexity of Algorithm 1 on model µ M is

Eµ[τδ] T0(δ) + 2e K

a2 with T0(δ) = max{t N : t β(t, δ)

Dµ + Cµ(Rλ t + Rk t + h(t))} ,

where Cµ depends on µ and M and h(t) = O( p

t log(t)). See Appendix D for an exact deﬁnition.

The forms of h(t) and of T0(δ) depend on the particular algorithm but we now show how an inequality of that type translates into T0(δ). The next lemma is a consequence of the concavity of t 7 t log t.

Lemma 2. Suppose that t R veriﬁes the equation t C t log t log 1/δ

Dµ . Then for T δ = log 1/δ

log T δ T δ

1 C 1+log T δ 2

T δ log T δ

3 Practical Implementations

Next we discuss instantiating no-regret learners. We consider a hierarchy of computational oracles:

1. Min aka Best-Response oracle: obtain for any i I, w K and ξ ΘK a minimizer in i of λ 7 P

k [K] wkd(ξk, λk) . 2. Max-min aka Game-Solving oracle: obtain for any i I and ξ ΘK a vector w K such that there is a Nash equilibrium (w , q ) K P( i) for the zero-sum game with reward d(ξk, λk) with the k-player using the mixed strategy w . 3. Max-max-min oracle: for any conﬁdence region C = [a1, b1] . . . [a K, b K], obtain (µ+, i+, w+) with (µ+, i+) = argmaxξ C,i I supw K infλ i PK k=1 wkd(ξk, λk) and w+ a k-player strategy of a Nash equilibrium of the game with reward d(µ+k, λk).

For Minimum Threshold all oracles can be evaluated in closed form in O(K) time, and the same is true for Best Response in Best Arm Identiﬁcation. Max-min for Best Arm requires binary search [13] and Max-max-min requires O(K) max-min calls. See [24] for run-time data on Track-and-Stop (max-min oracle) and gradient ascent (min oracle) for Best Arm. Our approach also extends naturally to min-max and max-min-max oracles, which we plan to incorporate in full detail in our future work.

3.1 A Learning Algorithm for the k-Player vs Best-Response for the λ-Player

In this section the k-player plays ﬁrst, employing a regret minimization algorithm for linear losses on the simplex to produce wt K at time t. We pick Ada Hedge of [8], which runs in O(K) per round and adapts to the scale of the losses. The λ-player goes second and can use a zero-regret algorithm: Best-Response. It plays qt , a Dirac at λt argminλ it P

k [K] wk t d(ˆµk t 1, λk) .

Lemma 3. Ada Hedge has regret Rk t q P

s t b2s ln K + maxs t bs( 4

3 ln K + 2) where bs =

maxk U k s mink U k s max{D, f(s)} is the loss scale in round s, so that Rk t = O(

t ln K ln t). Best-Response has no regret, Rλ t 0. The sample complexity is bounded per Theorem 2.

We expect that in practice the scale converges to bs Dµ after a transitory startup phase.

Computational complexity: one best-response oracle call per time step.

3.2 Learning Algorithms for the λ-Player vs Best Response for the k-Player

Using a learner for the λ-player removes the need for a tracking procedure. In this section the k-player goes second and uses Best-Response, with zero regret, i.e. kt = argmaxk [K] U k t (see Algorithm 1). After playing qt P( it), the λ-player suffers loss Eλ qt d(ˆµkt t 1, λkt).

Most existing regret minimization algorithms do not apply since the function λ 7 d(µ, λ) is not convex in general and the action set it is also not convex. The challenge is to come up with an algorithm able to play distributions with only access to a best-response oracle.

Follow-The-Perturbed-Leader. Follow-The-Perturbed-Leader can sample points from a distribution on P( i) by only using best-response oracle calls on i. The version we use here incorporates all the information available to the λ-player: the loss of λ it will be d(ˆµkt t 1, λkt) where the only unknown quantity is kt. Let σt RK be a random vector with independent exponentially distributed coordinates. The idea is that the distribution qt played by the λ-player should be the distribution of

argmin λ it

s=1 d(ˆµks s 1, λks) +

k=1 σk t d(ˆµk t 1, λk) .

We show in Appendix E.2 that this argmin can be computed by a single best-response oracle call. However, the k-player has to be able to compute the best response to qt. Since we cannot get the above distribution exactly, we instead take for qt an empirical distribution from t samples. A regret bound Rλ t = O( t log t) for that algorithm is in Appendix E.2. The sample complexity is then bounded by Theorem 2.

Computational complexity: t best-response oracle calls at time step t.

Online Gradient Descent. While the learning problem for λ is hard in general, in several common cases the sets i have a simple structure. If these sets are unions of a ﬁnite number J of convex sets and λ 7 d(µ, λ) is convex (i.e. for Gaussian or Bernoulli arm distributions), then we can use off-the-shelf regret algorithms. One gradient descent learner can be used on each convex set, and these J experts are then aggregated by an exponential weights algorithm. This procedure would have O(

t) regret. The computational complexity is J (convex) best-response oracle calls per time step.

3.3 Optimistic Track-and-Stop.

At stage t, this algorithm computes (µ+, it) = argmaxξ,i supw K infλ i PK k=1 wkd(ξk, λk) where ξ ranges over all points in ΘK in a conﬁdence region around ˆµt 1 and i I. Then, the k-player plays wt such that there exists a Nash equilibrium (wt, qt) of the game with reward d(µ+k, λk). The proof of its sample complexity bound proceeds slightly differently from the sketch of part 2.3, although the ingredients are still the GLRT, concentration, optimism and game-solving. The proof of the following lemma can be found in appendix E.2.

Lemma 4. Take b = 1 in the deﬁnition of f(t). Let h(t) = 2

t Dµ + 3L p

2σ2f(t)(K2 + (2

Kt) + f(t)(K2 + 2K log(t/K)) + KD. Then the expected sample complexity is at most T0(δ) + 2e K

a2 , where T0(δ) is the maximal t N such that t (β(t, δ) + h(t))/Dµ .

Note: the K2 factors are due to the tracking. We conjecture that they should be K log K instead.

Computational complexity: one max-max-min oracle call per time step.

D-C D-D M-C M-D T-C T-D O-C O-D RR opt 0

lower bd practical

(a) Best Arm for Bernoulli bandit model µ = (0.3, 0.21, 0.2, 0.19, 0.18). The oracle weights are w = (0.34, 0.25, 0.18, 0.13, 0.10).

D-C D-D M-C M-D T-C T-D O-C O-D RR opt 0

lower bd practical

(b) Minimum Threshold for Gaussian bandit model µ = (0.5, 0.6) with threshold γ = 0.6, w = e1. Note the excessive sample complexity of T-C/ T-D.

Figure 1: Selected experiments. In both cases δ = 0.1. Plots based on 3000 and 5000 runs.

This algorithm is the most computationally expensive but has the best sample complexity upper bound, has a simpler proof and works well in experiments where computing the max-max-min oracle is feasible, like the Best Arm and Minimum Threshold problems (see section 4).

4 Experiments

The goal of our experiments is to empirically validate Algorithm 1 on benchmark problems for practical δ. We use stylised stopping threshold β(δ, t) = ln 1+ln t

δ and exploration bonus f(t) = ln t. Both are unlicensed by theory yet conservative in practise (the error frequency is way below δ). We use the following letter coding to designate sampling rules: D for Ada Hedge vs Best-Response as advocated in Section 3.1, T for Track-and-Stop of [13], M for the Gradient Ascent algorithm of [24], O for Optimistic Track-and-Stop from Section 3.3, RR for uniform, and opt for following the oracle proportions w (µ). We also ran all our experiments on a simpliﬁcation of D that uses a single learner instead of partitioning the rounds according to it. We omit it from the results, as it was always within a few percent of D. We append -C or -D to indicate whether cumulative (Nt P

s t ws) or direct (Nt twt) tracking [13] is employed. We ﬁnally note that we tune the learning rate of M in terms of (the unknown) Dµ.

We perform two series of experiments, one on Best Arm instances from [13, 24], and one on Minimum Threshold instances from [20]. Two selected experiments are shown in Figure 1, the others are included in Appendix G. We contrast the empirical sample complexity with the lower bound kl(δ, 1 δ)/Dµ, and with a more practical version, which indicates the time t for which t = β(t, δ)/Dµ, which is, approximately, the ﬁrst time at which the GLRT stopping rule crosses the threshold β.

We see in Figures 1(a) and 1(b) that direct tracking -D has the advantage over cumulative tracking -C across the board, and that uniform sampling RR is sub-optimal as expected. In Figure 1(a) we see that T performs best, closely followed by M and O. Sampling from the oracle weights opt performs surprisingly poorly (as also observed in [26, Table 1]). The main message of Figure 1(b) is that T can be highly sub-optimal. We comment on the reason in Appendix G.2. Asymptotic optimality of T implies that this effect disappears as δ 0. However, for this example this kicks in excruciatingly slowly. Figure 5 shows that T is still not competitive at δ = 10 20. On the other hand, O performs best, closely followed by M and then D. Practically, we recommend using O if its computational cost is acceptable, M if an estimate of the problem scale is available for tuning, and D otherwise.

The gap between opt and T (or O) shows that Track-and-Stop outperforms its design motivation. It is an exciting open problem to understand exactly why, and to optimise for stopping early (Nt/t w (ˆµt)) while ensuring optimality (Eµ[Nτ]/Eµ[τ] w (µ)).

5 Conclusion

We leveraged the game point of view of the pure exploration problem, together with the use of the optimism principle, to derive algorithms with sample complexity guarantees for non-asymptotic conﬁdence. Varying the ﬂavours of optimism and saddle-point strategies leads to procedures with diverse tradeoffs between sample and computational complexities. Our sample complexity bounds attain asymptotic optimality while offering guarantees for moderate conﬁdence and the obtained algorithms are empirically sound. Our bounds however most probably do not depend optimally on the problem parameters, like the number of arms K. For BAI and the Top-K arms problems, lower bounds with lower order terms as well as matching algorithms were derived by [26]. A generalization of such lower bounds to the general pure exploration problem could shed light upon the optimal complexity across the full conﬁdence spectrum.

The richness of existing saddle-point iterative algorithms may bring improved performance over our relatively simple choices. A smart algorithm could possibly take advantage of the stochastic nature of the losses instead of treating them as completely adversarial.

Acknowledgements

We are grateful to Zakaria Mhammedi and Emilie Kaufmann for multiple generous discussions. Travel funding was provided by INRIA Associate Team 6PAC. The experiments were carried out on the Dutch national e-infrastructure with the support of SURF Cooperative.

[1] P. Auer, N. Cesa-Bianchi, and P. Fischer. Finite-time analysis of the multiarmed bandit problem. Machine Learning, 47(2):235 256, 2002.

[2] Quentin Berthet and Vianney Perchet. Fast rates for bandit optimization with upper-conﬁdence Frank-Wolfe. In Advances in Neural Information Processing Systems (Neur IPS), pages 2222 2231, 2017.

[3] S. Bubeck, R. Munos, and G. Stoltz. Pure Exploration in Finitely Armed and Continuous Armed Bandits. Theoretical Computer Science 412, 1832-1852, 412:1832 1852, 2011.

[4] Nicolo Cesa-Bianchi and Gábor Lugosi. Prediction, learning, and games. Cambridge university press, 2006.

[5] Lijie Chen, Jian Li, and Mingda Qiao. Towards instance optimal bounds for best arm identiﬁcation. In Satyen Kale and Ohad Shamir, editors, Proceedings of the 2017 Conference on Learning Theory, volume 65 of Proceedings of Machine Learning Research, pages 535 592, Amsterdam, Netherlands, July 2017. PMLR.

[6] S. Chen, T. Lin, I. King, M. Lyu, and W. Chen. Combinatorial Pure Exploration of Multi-Armed Bandits. In Advances in Neural Information Processing Systems, 2014.

[7] Richard Combes and Alexandre Proutiere. Unimodal bandits: Regret lower bounds and optimal algorithms. In International Conference on Machine Learning, pages 521 529, 2014.

[8] Steven de Rooij, Tim van Erven, Peter D. Grünwald, and Wouter M. Koolen. Follow the leader if you can, Hedge if you must. Journal of Machine Learning Research, 15:1281 1316, April 2014.

[9] Rémy Degenne and Wouter M. Koolen. Pure exploration with multiple correct answers. In Advances in Neural Information Processing Systems (Neur IPS), December 2019.

[10] Eyal Even-Dar, Shie Mannor, and Yishay Mansour. PAC bounds for multi-armed bandit and markov decision processes. In 15th Annual Conference on Learning Theory (COLT), volume 2375 of Lecture Notes in Computer Science, pages 255 270. Springer, 2002.

[11] Yoav Freund and Robert E Schapire. Adaptive game playing using multiplicative weights. Games and Economic Behavior, 29(1-2):79 103, 1999.

[12] Aurélien Garivier. Informational conﬁdence bounds for self-normalized averages and applications. In 2013 IEEE Information Theory Workshop (ITW), pages 1 5. IEEE, 2013.

[13] Aurélien Garivier and Emilie Kaufmann. Optimal best arm identiﬁcation with ﬁxed conﬁdence. In Conference on Learning Theory, pages 998 1027, 2016.

[14] Aurélien Garivier, Pierre Ménard, and Laurent Rossi. Thresholding bandit for dose-ranging: The impact of monotonicity. ar Xiv preprint ar Xiv:1711.04454, 2017.

[15] S. Kalyanakrishnan and P. Stone. Efﬁcient Selection in Multiple Bandit Arms: Theory and Practice. In International Conference on Machine Learning (ICML), 2010.

[16] S. Kalyanakrishnan, A. Tewari, P. Auer, and P. Stone. PAC subset selection in stochastic multi-armed bandits. In International Conference on Machine Learning (ICML), 2012.

[17] Sumeet Katariya, Branislav Kveton, Csaba Szepesvári, Claire Vernade, and Zheng Wen. Stochastic rank-1 bandits. In Aarti Singh and Xiaojin (Jerry) Zhu, editors, Proceedings of the 20th International Conference on Artiﬁcial Intelligence and Statistics, AISTATS 2017, 20-22 April 2017, Fort Lauderdale, FL, USA, volume 54 of Proceedings of Machine Learning Research, pages 392 401. PMLR, 2017.

[18] E. Kaufmann, O. Cappé, and A. Garivier. On the Complexity of Best Arm Identiﬁcation in Multi-Armed Bandit Models. Journal of Machine Learning Research, 17(1):1 42, 2016.

[19] Emilie Kaufmann and Wouter M. Koolen. Mixture martingales revisited with applications to sequential tests and conﬁdence intervals. Preprint, October 2018.

[20] Emilie Kaufmann, Wouter M. Koolen, and Aurélien Garivier. Sequential test for the lowest mean: From Thompson to Murphy sampling. In Advances in Neural Information Processing Systems (Neur IPS) 31, pages 6333 6343, December 2018.

[21] Tor Lattimore and Csaba Szepesvári. Bandit Algorithms. Cambridge University Press, 2019.

[22] Andrea Locatelli, Maurilio Gutzeit, and Alexandra Carpentier. An optimal algorithm for the thresholding bandit problem. In Maria-Florina Balcan and Kilian Q. Weinberger, editors, Proceedings of the 33nd International Conference on Machine Learning, ICML 2016, New York City, NY, USA, June 19-24, 2016, volume 48 of JMLR Workshop and Conference Proceedings, pages 1690 1698. JMLR.org, 2016.

[23] Stefan Magureanu, Richard Combes, and Alexandre Proutiere. Lipschitz bandits: Regret lower bound and optimal algorithms. In Conference on Learning Theory, pages 975 999, 2014.

[24] Pierre Ménard. Gradient ascent for active exploration in bandit problems. ar Xiv preprint ar Xiv:1905.08165, 2019.

[25] Alexander Rakhlin and Karthik Sridharan. Optimization, learning, and games with predictable sequences. In Advances in Neural Information Processing Systems (Neur IPS), pages 3066 3074, 2013.

[26] Max Simchowitz, Kevin Jamieson, and Benjamin Recht. The simulator: Understanding adaptive sampling in the moderate-conﬁdence regime. In Conference on Learning Theory, pages 1794 1834, 2017.

[27] Kazuki Teraoka, Kohei Hatano, and Eiji Takimoto. Efﬁcient sampling method for Monte Carlo tree search problem. IEICE Transactions, 97-D(3):392 398, 2014.

[28] Jun-Kun Wang and Jacob D. Abernethy. Acceleration through optimistic no-regret dynamics. In Advances in Neural Information Processing Systems (Neur IPS), pages 3828 3838, 2018.

[29] Yichi Zhou, Jialian Li, and Jun Zhu. Identify the Nash equilibrium in static games with random payoffs. In Doina Precup and Yee Whye Teh, editors, Proceedings of the 34th International Conference on Machine Learning, volume 70 of Proceedings of Machine Learning Research, pages 4160 4169, International Convention Centre, Sydney, Australia, August 2017. PMLR.