# best_arm_identification_in_spectral_bandits__49d9ab6c.pdf Best Arm Identification in Spectral Bandits Tom aˇs Koc ak , Aur elien Garivier Ecole Normale Sup erieur de Lyon tomas.kocak@gmail.com, aurelien.garivier@ens-lyon.fr We study best-arm identification with fixed confidence in bandit models with graph smoothness constraint. We provide and analyze an efficient gradient ascent algorithm to compute the sample complexity of this problem as a solution of a non-smooth max-min problem (providing in passing a simplified analysis for the unconstrained case). Building on this algorithm, we propose an asymptotically optimal strategy. We furthermore illustrate by numerical experiments both the strategy s efficiency and the impact of the smoothness constraint on the sample complexity. Best Arm Identification (BAI) is an important challenge in many applications ranging from parameter tuning to clinical trials. It is now very well understood in vanilla bandit models, but real-world problems typically involve some dependency between arms that requires more involved models. Assuming a graph structure on the arms is an elegant practical way to encompass this phenomenon, but this had been done so far only for regret minimization. Addressing BAI with graph constraints involves delicate optimization problems for which the present paper offers a solution. 1 Introduction This work is devoted to the optimization of a stochastic function on a structured, discrete domain A = [K] {1, . . . , K}. We consider the noisy black-box model: a call to the function at point a A yields an independent draw of an unknown distribution νa F with mean µa, where F is some family of probability laws. At each time step t N, a point At A can be chosen (based on past observations) and the corresponding random outcome Yt of law νAt is observed. The signal µ (µ1, . . . , µK)T is assumed to be smooth in the following sense: A is equipped with a fixed and weighted graph structure G with adjacency matrix W = (wa,b)a,b A, and µ satisfies the graph smoothness property: a,b A wa,b (µa µb)2 2 = µ TLµ = µ 2 L R Contact Author for some (known) smoothness parameter R, where L is the graph Laplacian defined as: La, b = wa, b for a = b and La, a = P b =a wa, b. This enforces values of means (µa)a at two points a, b A to be close to each another if the weight wa,b is large. Following a classical framework in statistical testing, a risk parameter δ is fixed, and the goal is to design an algorithm that successfully maximizes µ on A as fast as possible, with failure probability at most δ. More specifically, the algorithm consists of a sampling rule (ψt)t 1 that chooses to observe At = ψt(A1, Y1, . . . , At 1, Yt 1) at step t, and a stopping rule τ such that whatever the distributions ν = (νa)a A FK, satisfies the constraint Pν Aτ+1 a (µ) arg max i [K] µi 1 δ . Among all such algorithms, the goal is to minimize the sample complexity Eµ[τ]. This problem is known in the multi-armed bandit literature as best-arm identification with fixed confidence [Even-Dar et al., 2006; Gabillon et al., 2012]. In that context, points of the domain A are called arms and the set of distributions ν is called a model. Bandit models have raised strong interest in the past years, as multi-armed bandits were gaining popularity thanks to their ability to model a variety of real-world scenarios while providing strong theoretical guarantees for the proposed algorithms. For an introduction to the theory of bandit models and a recent list of references, see [Lattimore and Szepesv ari, 2019] and references therein. Best-arm identification, in particular, has many implications in artificial intelligence and data science. For example, in pharmacology, the optimal drug dosage depends on the drug formulation but also on the targeted individuals (genetic characters, age, sex) and environmental interactions. In the industrial context, the shelf life and performance of a manufacturing device depend on its design, but also on its environment and on uncertainties during the manufacturing process. As for targeted recommender systems, their efficiency depends on the ability to understand users preferences despite the high variability of choices. Lastly, the performance of a deep neural network depends on its (hyper-)parameters and on the quality of the training set. This short list is of course not comprehensive, but the automation of these representative optimization tasks is one of the great challenges of modern artificial in- Proceedings of the Twenty-Ninth International Joint Conference on Artificial Intelligence (IJCAI-20) telligence in the fields of personalized medicine, predictive maintenance, online marketing, and autonomous learning. The problem of best-arm identification has recently received a precise solution for the vanilla version of the multiarmed bandit problem, where the learner gains information only about the arm played. [Garivier and Kaufmann, 2016a; 2019] (see also [Russo, 2016]) have described the information-theoretic barriers of the sample complexity in the form an instance-optimal lower bound on Eµ[τ]; furthermore, they have proposed a strategy, called Track-and-Stop, that asymptotically reaches this lower bound. However, the vanilla bandit model is extremely limiting in many applications, and some additional information has to be incorporated into the learning process. Recently, several papers studied multi-armed bandit problems with the ability to capture the rich structure of real-world problems. In particular, [Degenne and Koolen, 2019; Degenne et al., 2019] have extended the results mentioned above to classes of possibly structured bandits with possibly different goals. In parallel, a fruitful approach to handle large and organized domains is assume the existence of a graph structure on top of arms which provides additional information. There are two most dominant approaches to use this graph structure. The first approach is to use graphs to encode additional information about other arms; playing an arm reveals (fully or partially) the rewards of all the neighbors [Mannor and Shamir, 2011; Alon et al., 2013; Koc ak et al., 2014; Alon et al., 2017; Koc ak et al., 2016b; 2016a]. The second approach is to encode the similarities between arms by the weights of the edges; the rewards of connected arms are similar [Valko et al., 2014; Koc ak et al., 2014; Koc ak et al., 2019]. This second approach, which we adopt in this paper, easily permits to capture a lot of information about the problem by the creation of a similarity graph, which explains its wide use in signal processing, manifold, and semi-supervised learning. However, most of the work in multi-armed bandits with structure has been done so far under the regret minimization objective, while best arm identification (a practically crucial objective in auto ML, clinical trials, A/B testing, etc.) was mostly neglected. 1.1 Our Contributions In this paper, we aim to fill this gap by combining the best arm identification work of [Garivier and Kaufmann, 2016a] with the spectral setting of [Valko et al., 2014], resulting in spectral best arm identification problem. This setting captures a rich variety of real-world problems that have been neglected so far. We also analyze this setting and provide instance-optimal bounds on the sample complexity. Inspired by [Degenne and Koolen, 2019], we use a game-theoretical point of view on the problem which enables us to obtain two main contributions of this paper: Contribution 1: simplified and enlightening gametheoretic analysis for the unconstrained case (case without the spectral constraint) of [Garivier and Kaufmann, 2016a] that brings new insights to the problem which enable us to extend the analysis to the constrained case. Contribution 2: building upon the unconstrained case, we provide and analyze an algorithm that computes the sample complexity as well as the optimal arm pulling proportions in the constrained case. We use this algorithm to propose an asymptotically optimal strategy for fixed-confidence best-arm identification and we provide numerical experiments supporting the theoretical results and showing the interest of the graph constraint for reducing the sample complexity. 1.2 Assumptions A standard assumption on the family F of considered probability laws is that they form an exponential family of distributions parametrized by theirs means so that a bandit problem is fully described by the vector µ (µi, . . . , µK)T. To avoid technicalities and to emphasize the methodological contributions of this paper (ie. how to perform optimal bestarm identification with graph regularity constraints), we focus here on the case of Gaussian distributions with variance 1. For the same reason, we assume here that the vector µ has a unique maximum also denoted by a (µ). We furthermore denote by MR = {λ RK : λ TLλ R} the constrained set of considered signals (bandit problems), and by µ maxi [K] µi = µa (µ) the maximum of the signal µ (we identify the best arm with s singleton a (µ)). 1.3 Paper Structure The paper is organized as follows: Section 2 discusses the information-theoretic lower bound and a simple algorithm for the unconstrained case, previously analyzed by [Garivier and Kaufmann, 2016a], with focus on new techniques that considerably simplify the analysis. The constrained case requires more care: we propose in Section 3 a mirror gradient ascent algorithm to compute optimal arm allocation and show non-asymptotic bounds of convergence for this algorithm. Using this procedure permits us in Section 4 to present the Spectral Ta S algorithm as an extension of the Track-and Stop devoted to best-arm identification with graph regularity constraint, that reaches the instance-optimal sample complexity. We illustrate its performance and the impact of the smoothness constraint on the sample complexity by reporting some numerical experiments. 2 Sample Complexity The general information-theoretical lower bound analysis provided by [Garivier and Kaufmann, 2016b] applies to any set of bandit problems, in particular to the set of all the problems with smoothness bounded from above by R. Proposition 1. For any δ-correct strategy and any bandit problem µ Eµ[Tδ] T R(µ)k(δ, 1 δ) T R(µ) 1 sup ω K inf λ AR(µ) a [K] ωak(µa, λa) (1) for K being K-dimensional simplex, k(µa, λa) KLdivergence between distributions parametrized by µa and λa, Proceedings of the Twenty-Ninth International Joint Conference on Artificial Intelligence (IJCAI-20) and for the set AR(µ) of bandit problems with different best arm than µ, defined as AR,i(µ) λ MR : λi λa (µ) AR(µ) i =a (µ)AR,i(µ). Indeed, even if the constrained case is not explicitly covered there, this result is easily shown by following the lines of the proof of Theorem 1 in [Garivier and Kaufmann, 2016b], with AR(µ) as the set of alternatives to µ. This lower bound proves to play an important role in designing algorithms for the best arm identification since, roughly speaking, the time needed to distinguish between bandit problems µ and λ, if the number of pulls of arm i is proportional to ωi, scales with inverse of P a [K] ωak(µa, λa). Therefore, the minimization part of problem (1) selects the alternative problem which is the most difficult to distinguish from µ while playing according to ω while, on the other hand, the maximization part of problem (1) chooses ω such that the expected stopping time is as small as possible, even in the worst case scenario chosen by the minimization part of the problem. Thus, a sampling strategy playing according to ω (µ) that maximizes expression (1) is optimal in the worst case, and having a procedure that computes optimal ω (µ) enables to design a best arm identification algorithm with optimal sample complexity. In this section, we provide an algorithm to compute this sample complexity T R(µ) as well as the optimal proportion ω (µ) of the arm allocation for the learner. We proceed in three steps. Step 1: we introduce the best response oracle that computes the best response λ (ω) to fixed ω by solving the minimization part of problem (1) for chosen ω: λ (ω) arg min λ AR(µ) a [K] ωak(µa, λa) . Step 2: we show that substituting the minimization part of problem (1) by λ (ω), the resulting maximization problem is concave with respect to ω. Moreover, supergradient for this problem is computable using the same best response oracle we used for the minimization part of the problem. Step 3: using supergradient we can apply any supergradient ascent algorithm to find optimal arm allocation ω . The algorithm of our choice is mirror ascent with generalized negative entropy as a mirror map. We chose this algorithm due to its strong guarantees on convergence rate in simplex setup however, even basic gradient-based algorithms could be applied. The choice of mirror ascent algorithm will be apparent later in Section 3, we show that this algorithm enjoys convergence rate proportional to log K instead of K of the basic supergradient ascent algorithm. As briefly mentioned in the introduction, for the sake of simplicity of presentation, we assume that the distributions νµa are Gaussian. This simplifies some of the calculations while our bounds remain true even in the case of sub Gaussian random variables, including a wide class of distri- butions among which any distributions with bounded support or Bernoulli distribution. Even though using a general exponential family of probability distributions uses similar techniques, all the proofs are more technical and rather out of the scope of a conference paper. With assumption that the distributions associated with arms are Gaussian with normalized variance, KL-divergence of νµa and νλa can be expressed as k(µa, λa) = (µa λa)2 2.1 Best Arm Identification without Spectral Constraint This section is dedicated to the vanilla setting of [Garivier and Kaufmann, 2016a]. Even though this problem can be seen as a special case of spectral setting either (for the edgeless graph or for smoothness parameter R being ), we have decided to analyze it separately. The reason is that we can demonstrate techniques later used in the constrained case while proving the main result of [Garivier and Kaufmann, 2016a] with more insightful arguments that are significantly more elegant. Also, not having the spectral constraint allows us to have best response oracle with a closed-form solution which implies a simple way of finding optimal arm allocation ω . This is very different from the spectral setting where best response oracle does not have a closed-form solution and therefore, we can not avoid a numerical procedure to find ω . Theorem 2. Without loss of generality, assume that µ1 > µ2 µK and define {xa(c)}K a=2, for some positive constant c, recursively as xa 1(c) = (1 + xa(c)) µ1 µa 1 for 2 < a K. Let f(c) be a function with parameter c defined as Then there exist c R+ s.t. f(c ) = 1 and we obtain ω (µ) as 1 + PK a=2 xa(c ) ω a(µ) = xa(c )ω 1(µ) Remark. Since µ1 > µa 1 µa, we see that µ1 µa 1 which in consequence means that. If xa(c) is negative, xa 1(c) is negative as well. If xa(c) is positive, xa 1(c) xa(c). Thus, f(0) 0. Moreover, f is an increasing and continuous function with limc f(c) = . Therefore, the existence of c is guaranteed. In the rest of this section, we build the necessary tools to provide the proof of Theorem 2. Proceedings of the Twenty-Ninth International Joint Conference on Artificial Intelligence (IJCAI-20) Best Response Oracle - Vanilla Setting In the case of smoothness R = , the spectral constraint is always satisfied which provides a very simple closed form solution for the best response oracle given by the following lemma. Lemma 3. Let ω be a fixed vector from K. Then the best response λi(ω) A ,i(µ) with the best arm i is λi(ω) = (t, µ2, . . . , µi 1, t, µi+1, . . . , µK) T for t being a weighted average of µ1 and µi with weights ω1 and ωi t = µ1ω1 + µiωi The proof of this Lemma can be obtained by simple calculus. Game-Theoretical Point of View Our optimization problem (1) can be seen as a zero sum game where Player 1 plays a vector from simplex K and Player 2 plays point λ from AR(µ). We would like to have guarantees on the existence of Nash equilibrium however AR(µ) is not a convex set which poses a problem. Also, we can not directly convexify the problem since it would change the value of T R(µ) 1. To get around this obstacle, we define DR,i(µ) {(k(µ1, λ1), . . . , k(µK, λK)) T : λ AR,i(µ)} , DR(µ) i =a (µ)DR,i(µ) . This enables us to rewrite the optimization problem in the following form T R(µ) 1 sup ω K inf d DR(µ) ω Td (2) where Player 2, instead of playing λ, plays vectors of divergences (k(µ1, λ1), . . . , k(µK, λK))T where the element on position i of the vector is divergence k(µi, λi) of the distributions corresponding to i-th elements of µ and λ. This still haven t solve our issue with non-convexity of set DR(µ), however, now we can simply optimize over the convex hull of DR(µ) T R(µ) 1 sup ω K inf d Conv(DR(µ)) ω Td (3) and the following lemma guarantees that the values of optimization problems (2) and (3) are the same. Lemma 4. Let ω be a vector in RK and D be a compact subset of RK then inf d D ω Td = inf d Conv(D) ω Td where Conv(D) is the convex hull of D. Now we are left a zero sum game (3) with convex and compact sets of actions for both players. This guarantees existence of Nash equilibrium with the best actions denoted by ω (µ) and d (µ). Whole purpose of seeing (3) as a zerosum game is the following lemma Lemma 5. Let ω (µ) and d (µ) are vectors for which Nash equilibrium of game (3) is attained then there exist a constant c such that d i (µ) = c, for all i [K]. Proof of Theorem 2 Now we have all necessary tools to prove Theorem 2. Note that we still assume that µ1 > µ2 µK. Let ω be the Nash equilibrium strategy. As we showed in Lemma 3, the best response λi(ω ) A ,i(µ) to ω has a particular form λi(ω ) = (t, µ2, . . . , µi 1, t, µi+1, . . . , µK) T , where t = µ1ω 1 + µiω i ω 1 + ω i . This λi(ω ) corresponds to point di(ω ) from DR,i defined as di(ω ) = (yi, 0, . . . , 0, zi, 0, . . . , 0) T for yi = (µ1 ui)2/2 and zi = (µi ui)2/2. It is important to notice that the only two non-zero elements are at positions 1 and i. The Nash equilibrium strategy d of Player 2 can be expressed as a convex combination of optimal points from DR(µ) as showed in Lemma 4. The only candidates are points di(ω ) for i {2, . . . , K} since all the other points from DR(µ) are sub-optimal. Moreover, using Lemma 5 we know that all the elements of d are equal which in particular means that d needs to be a convex combination of all K 1 vectors di(ω ) and all of them are equally good and from the definition of di(ω ) we have (ω ) Tdi(ω ) = (ω ) Tdj(ω ) for every i, j 2. This expression can be further modified to obtain ω i ω 1 = 1 + ω j ω 1 Denoting ω 2 ω 1 by c and using i = j 1 we directly obtain that the recurrence from the theorem holds for xi(c ) = ω i ω 1 . Therefore, it is enough to know c and ω 1 to reconstruct ω . Lemma 5 provides condition for d . We know that it can be written as a convex combination of K 1 vectors di(ω ). This is possible only if vector 1 of ones can be expressed as their linear combination which gives us condition We know that all the elements of the resulting vector are equal to 1 except for the first element which has a value i=2 xi(c )2 = f(c ). Therefore, this value has to be 1 too, which gives us the second part of the theorem. Now we can recover all the ratios ω i /ω 1 and we also know that the sum of all ωi is 1 which provides all the necessary ingredients for the proof of the last part of the theorem. 2.2 Best Response Oracle - Spectral Setting As we showed in Section 2.1, Lemma 3, finding the best response to ω, in the vanilla setting, is not a difficult problem since the optimization problem involves only linear constraints. The situation in the spectral setting is a little bit more complicated. Proceedings of the Twenty-Ninth International Joint Conference on Artificial Intelligence (IJCAI-20) In this part, we again assume that µ1 > µ2 µK and focus on the best response oracle λ (ω) = arg min λ AR(µ) a [K] ωa (µa λa)2 in spectral setting, where R is a finite upper bound on smoothness of µ with respect to graph G given by its Laplacian L. We first find best responses λi(ω) = arg min λ AR,i(µ) a [K] ωa (µa λa)2 with respect to convex sets AR,i(µ) and then take the best out of these K 1 vectors. First of all, notice that if the response of vanilla oracle denoted by λi (ω) has smoothness smaller than R, the response of spectral oracle should be the same. In the case where SG(λi (ω)) R we can restrict our search and look for λ with SG(λ) exactly R thanks to the following lemma. Lemma 6. Let λi (ω) be the response of vanilla oracle such that λi (ω)TLλi (ω) > R then the response of spectral oracle λi(ω) satisfies λi(ω)TLλi(ω) = R. 2.3 Best Response Oracle Implementation Now lets consider the case where vanilla oracle produces a vector that does not satisfy smoothness constraint. In this case we know, using Lemma 6, that the best response λi(ω) of spectral oracle has smoothness R. Therefore, we can use standard Lagrange multiplier method to solve this problem. a [K] ωa (µa λa)2 2 + γ (λ TLλ R) We should not forget that we still need to ensure that λ1 = λi since we want to find the solution in AR,i(µ) with best arm i. Therefore, we can simplify the notation and later calculations by the following definitions λ: created from λ by removing the first component ω: created from ω by removing the first component µ: created from µ by averaging components 1st and i-th components w.r. to ω and removing the first component µi 1 = ω1µ1 + ωiµi L: created from L by adding the first row and column to i-th row and column, adjusting diagonal element Li,i so that the sum in the i-th row is zero, removing the first row and column Using previous definitions we can define a [K 1] ωa ( µa λa)2 2 + γ( λ T L λ R) for which F( λ, γ) = F(λ, γ) + c holds, assuming that λ1 = λi. The identity can be checked by using previous definitions. The main advantage of this transformation is that now we don t have any constraint on λ. Now we can take partial derivatives with respect to λj and γ and set them equal to 0 in order to find the best λ and later reconstruct λ: λ F( λ, γ) = Ω( λ µ) + 2γ L λ = 0 , γ = λ T L λ R = 0 , where Ωis a diagonal matrix with ω on its diagonal. First expression is just a system of K 1 linear equations with parameter γ with solution λ(γ) = ( Ω+ 2γ L) 1 Ω µ . To find the best γ so that also the second expression is true, we can use for example bisection method. The last step is choosing i > 1 that minimizes F(λ, γ). Properties of Maximization Problem Before we proceed to the algorithm computing optimal ω (µ) we need the following two lemmas to be able to compute supergradients and guarantee the convergence of the algorithm. Lemma 7. Let D RK be a compact set. Then function f : K R defined as f(ω) = infd D ωTd Is a concave function and d (ω) = arg mind D ωTd is a supergradient of f at ω. Lemma 8. Let f : K R be a function such that f(ω) = inf λ AR(µ) i=1 ωik(µi, λi) . Then function f is L-Lipschitz with respect to 1 for any L max i,j [K] k(µi, µj) . 3 Best Arm Allocation Algorithm Now that we are able to compute best response λ (ω) and therefore, supergradient of concave function f(ω) = P a [K] ωak(µa, λ a(ω)) at ω, we have all the ingredients needed for any supergradient-based algorithm to compute best arm allocation ω = arg maxω K f(ω). The algorithm of our choice is mirror ascent algorithm with generalized negative entropy as the mirror map: a [K] (ωi log(ωi) ωi) . Theorem 9. Let ω1 = ( 1 K , . . . , 1 K )T and learning rate η = t . Then mirror ascent algorithm optimizing function f defined on K with generalized negative entropy Φ as the mirror map enjoys the following guarantees for any L maxi,j [K] k(µi, µj). Proceedings of the Twenty-Ninth International Joint Conference on Artificial Intelligence (IJCAI-20) 0.02 0.04 0.06 0.08 0.10 Figure 1: Theoretical vs empirical expected stopping time as a function of smoothness parameter R with optimal value of R being 0.01. Proof. It is easy to show that Φ is 1-strongly convex w.r.t. 1 and supω K Φ(ω) Φ(ω1) log K since X ωa log ωa 1 holds for any ω K. Using Lemma 8 we have that f is LLipschitz w.r.t. 1 for any L maxi,j [K] k(µi, µj). This gives us all the necessary ingredients mirror ascent guarantees in Theorem 4.2 from [Bubeck, 2015]. 3.1 Value of Spectral Constraint Having an assumption that captures the problem might improve the learning speed significantly. We demonstrate this effect in a simplistic scenario where µ = (0.9, 0.5, 0.6)T, graph has only one edge between nodes 2 and 3, and the range of R is from 0.01 = µTLµ to 0.1. R = 0.1 is completely non-restrictive and the best response to every ω is the same as in the best arm identification problem without the spectral constraint. This can be seen on Figure 1 where we plot the value of T R(µ) (red curve) as a function of R which is proportional to the stopping time lower bound. 4 Spectral BAI Algorithm The algorithm for the best arm identification with the spectral constraint Spectral Ta S (Algorithm 1) is a variant of Track-and-Stop algorithm introduced in [Garivier and Kaufmann, 2016b]. We discuss the main ingredients of the algorithm in the next part of this section. Sampling rule. As a by-product of the lower bound analysis, Proposition 1 provides the existence of optimal sampling weights ω (µ) that need to be respected in order to reach the optimal sample complexity. Spectral Ta S simply tracks, at every timestep, some guess of these optimal proportions that is obtained by solving the sample complexity optimization problem associated with the current estimates ˆµt of the means µ. In order to capture a possibility of the initial underestimation of an arm, some level of exploration is needed and enforced by the algorithm: for every ε (0, 1/K], let ω ,ε(µ) be an L projection of ω (µ) onto ε K defined as (w1, . . . , w K) [ε, 1]K : w1 + + w K = 1 . Then the sampling rule is At+1 arg max a [K] s=0 ω ,εs a ˆµ(s) Na(t) . (4) Algorithm 1 Spectral Ta S 1: Input and initialization: 2: L : graph Laplacian 3: δ : confidence parameter 4: R : upper bound on the smoothness of µ 5: Play each arm a once and observe rewards ra 6: ˆµ1 = (r1, . . . , r K)T : empirical estimate of µ 7: while Stopping Rule (5) not satisfied do 8: Compute ω (ˆµt) by mirror ascent 9: Choose At according to Sampling Rule (4) 10: Obtain reward rt of arm At 11: Update ˆµt according to rt 12: end while 13: Output arm A = arg maxa [K] ˆµa As shown in [Garivier and Kaufmann, 2016a] (Lemma 7 and Proposition 9), the choice εs = (K2 + s) 1/2/2 ensures that the number Na(t) of draws of arm a converges almost-surely to ω a(µ) as t goes to infinity. Stopping rule. The algorithm should stop as soon as it has gathered sufficient evidence on the superiority of one of the arms with risk δ. The design of an optimal sequential testing procedure for the hypothesis µa > maxb =a µb can be traced back to [Chernoff, 1959], and is discussed in detail in [Garivier and Kaufmann, 2019]. We simply recall its form here: for two arms a, b [K], denote by ˆµa,b(t) = (Na(t)ˆµa(t) + Nb(t)ˆµb(t))/(Na(t) + Nb(t)) and by Za,b = sign ˆµa(t) ˆµb(t) Na(t)(ˆµa(t) ˆµa,b(t))2 + Nb(t)(ˆµb(t) ˆµa,b(t))2 /2 the generalized likelihood ratio statistics for the test µa > µb. Then the stopping rule is given by τ = inf n t N : max a [K] min b =a Za,b(t) > β(t, δ) o , (5) where β( , ) is a threshold function to be chosen typically slightly larger than log(1/δ). Theorem 10 in [Garivier and Kaufmann, 2016a] shows that the choice β(t, δ) = log(2t(K 1)/δ) and Aτ+1 = arg maxa [K] ˆµa(τ) yields a probability of failure Pν (Aτ+1 / a (µ)) δ. Empirical evaluation. Using the same experiment setting as in Section 3.1 we have evaluated Spectral Ta S for 10 different values of R ranging from true R = µTLµ = 0.01 to R = 0.1 and we plotted theoretical stopping time (red line) as well as average and standard deviation of 20 runs of Spectral Ta S algorithm (blue line). Figure 1 shows that the algorithm can utilize the spectral constraint and improve stopping time significantly whenever the value of R is close to the real smoothness of the problem, as suggested by theory. Acknowledgements Aur elien Garivier (ANR chaire Seq ALO) and Tom aˇs Koc ak acknowledge the support of the Project IDEXLYON of the University of Lyon, in the framework of the Programme Investissements d Avenir (ANR-16-IDEX-0005). Proceedings of the Twenty-Ninth International Joint Conference on Artificial Intelligence (IJCAI-20) [Alon et al., 2013] Noga Alon, Nicol o Cesa-Bianchi, Claudio Gentile, and Yishay Mansour. From bandits to experts: A tale of domination and independence. In Neural Information Processing Systems, 2013. [Alon et al., 2017] Noga Alon, Nicol o Cesa-Bianchi, Claudio Gentile, Shie Mannor, Yishay Mansour, and Ohad Shamir. Nonstochastic multi-armed bandits with graphstructured feedback. SIAM Journal on Computing, 46(6):1785 1826, 2017. [Bubeck, 2015] S ebastien Bubeck. Convex optimization: Algorithms and complexity. Foundations and Trends in Machine Learning, 8(3-4):231 357, 2015. [Chernoff, 1959] Herman Chernoff. Sequential design of Experiments. The Annals of Mathematical Statistics, 30(3):755 770, 1959. [Degenne and Koolen, 2019] R emy Degenne and Wouter M. Koolen. Pure Exploration with Multiple Correct Answers. technical report, feb 2019. [Degenne et al., 2019] R emy Degenne, Wouter M. Koolen, and Pierre M enard. Non-Asymptotic Pure Exploration by Solving Games. technical report, jun 2019. [Even-Dar et al., 2006] Eyal Even-Dar, Shie Mannor, and Yishay Mansour. Action elimination and stopping conditions for the multi-armed bandit and reinforcement learning problems. Journal of Machine Learning Research, 7:1079 1105, 2006. [Gabillon et al., 2012] Victor Gabillon, Mohammad Ghavamzadeh, and Alessandro Lazaric. Best arm identification: A unified approach to fixed budget and fixed confidence. In Neural Information Processing Systems, 2012. [Garivier and Kaufmann, 2016a] Aur elien Garivier and Emilie Kaufmann. Optimal best arm identification with fixed confidence. In Proceedings of the 29th Conference On Learning Theory, 2016. [Garivier and Kaufmann, 2016b] Aur elien Garivier and Emilie Kaufmann. Optimal Best Arm Identification with Fixed Confidence. 29th Annual Conference on Learning Theory, 2016. [Garivier and Kaufmann, 2019] Aur elien Garivier and Emilie Kaufmann. Non-Asymptotic Sequential Tests for Overlapping Hypotheses Applied to Near Optimal Arm Identification in Bandit Models. technical report, jun 2019. [Koc ak et al., 2014] Tom aˇs Koc ak, Gergely Neu, Michal Valko, and R emi Munos. Efficient learning by implicit exploration in bandit problems with side observations. In Neural Information Processing Systems, 2014. [Koc ak et al., 2016a] Tom aˇs Koc ak, Gergely Neu, and Michal Valko. Online learning with Erdos-R enyi sideobservation graphs. In Conference on Uncertainty in Artificial Intelligence, 2016. [Koc ak et al., 2016b] Tom aˇs Koc ak, Gergely Neu, and Michal Valko. Online learning with noisy side observations. In International Conference on Artificial Intelligence and Statistics, 2016. [Koc ak et al., 2019] Tom aˇs Koc ak, R emi Munos, Branislav Kveton, Shipra Agrawal, and Michal Valko. Spectral Bandits. Journal of Machine Learning Research, 2019. [Koc ak et al., 2014] Tom aˇs Koc ak, Michal Valko, R emi Munos, and Shipra Agrawal. Spectral Thompson sampling. In Proceedings of the National Conference on Artificial Intelligence, 2014. [Lattimore and Szepesv ari, 2019] Tor Lattimore and Csaba Szepesv ari. Bandit Algorithms. Cambridge University Press, 2019. [Mannor and Shamir, 2011] Shie Mannor and Ohad Shamir. From bandits to experts: On the value of sideobservations. In Neural Information Processing Systems, 2011. [Russo, 2016] Daniel Russo. Simple bayesian algorithms for best arm identification. In Proceedings of the 29th Conference On Learning Theory, 2016. [Valko et al., 2014] Michal Valko, R emi Munos, Branislav Kveton, and Tom aˇs Koc ak. Spectral bandits for smooth graph functions. In International Conference on Machine Learning, 2014. Proceedings of the Twenty-Ninth International Joint Conference on Artificial Intelligence (IJCAI-20)