# best_arm_identification_in_spectral_bandits__49d9ab6c.pdf

Best Arm Identiﬁcation in Spectral Bandits

Tom aˇs Koc ak , Aur elien Garivier Ecole Normale Sup erieur de Lyon tomas.kocak@gmail.com, aurelien.garivier@ens-lyon.fr

We study best-arm identiﬁcation with ﬁxed conﬁdence in bandit models with graph smoothness constraint. We provide and analyze an efﬁcient gradient ascent algorithm to compute the sample complexity of this problem as a solution of a non-smooth max-min problem (providing in passing a simpliﬁed analysis for the unconstrained case). Building on this algorithm, we propose an asymptotically optimal strategy. We furthermore illustrate by numerical experiments both the strategy s efﬁciency and the impact of the smoothness constraint on the sample complexity. Best Arm Identiﬁcation (BAI) is an important challenge in many applications ranging from parameter tuning to clinical trials. It is now very well understood in vanilla bandit models, but real-world problems typically involve some dependency between arms that requires more involved models. Assuming a graph structure on the arms is an elegant practical way to encompass this phenomenon, but this had been done so far only for regret minimization. Addressing BAI with graph constraints involves delicate optimization problems for which the present paper offers a solution.

1 Introduction This work is devoted to the optimization of a stochastic function on a structured, discrete domain A = [K] {1, . . . , K}. We consider the noisy black-box model: a call to the function at point a A yields an independent draw of an unknown distribution νa F with mean µa, where F is some family of probability laws. At each time step t N, a point At A can be chosen (based on past observations) and the corresponding random outcome Yt of law νAt is observed. The signal µ (µ1, . . . , µK)T is assumed to be smooth in the following sense: A is equipped with a ﬁxed and weighted graph structure G with adjacency matrix W = (wa,b)a,b A, and µ satisﬁes the graph smoothness property:

a,b A wa,b (µa µb)2

2 = µ TLµ = µ 2 L R

Contact Author

for some (known) smoothness parameter R, where L is the graph Laplacian deﬁned as: La, b = wa, b for a = b and La, a = P b =a wa, b. This enforces values of means (µa)a at two points a, b A to be close to each another if the weight wa,b is large. Following a classical framework in statistical testing, a risk parameter δ is ﬁxed, and the goal is to design an algorithm that successfully maximizes µ on A as fast as possible, with failure probability at most δ. More speciﬁcally, the algorithm consists of a sampling rule (ψt)t 1 that chooses to observe At = ψt(A1, Y1, . . . , At 1, Yt 1) at step t, and a stopping rule τ such that whatever the distributions ν = (νa)a A FK, satisﬁes the constraint

Pν Aτ+1 a (µ) arg max i [K] µi 1 δ .

Among all such algorithms, the goal is to minimize the sample complexity Eµ[τ]. This problem is known in the multi-armed bandit literature as best-arm identiﬁcation with ﬁxed conﬁdence [Even-Dar et al., 2006; Gabillon et al., 2012]. In that context, points of the domain A are called arms and the set of distributions ν is called a model. Bandit models have raised strong interest in the past years, as multi-armed bandits were gaining popularity thanks to their ability to model a variety of real-world scenarios while providing strong theoretical guarantees for the proposed algorithms. For an introduction to the theory of bandit models and a recent list of references, see [Lattimore and Szepesv ari, 2019] and references therein. Best-arm identiﬁcation, in particular, has many implications in artiﬁcial intelligence and data science. For example, in pharmacology, the optimal drug dosage depends on the drug formulation but also on the targeted individuals (genetic characters, age, sex) and environmental interactions. In the industrial context, the shelf life and performance of a manufacturing device depend on its design, but also on its environment and on uncertainties during the manufacturing process. As for targeted recommender systems, their efﬁciency depends on the ability to understand users preferences despite the high variability of choices. Lastly, the performance of a deep neural network depends on its (hyper-)parameters and on the quality of the training set. This short list is of course not comprehensive, but the automation of these representative optimization tasks is one of the great challenges of modern artiﬁcial in-

Proceedings of the Twenty-Ninth International Joint Conference on Artiﬁcial Intelligence (IJCAI-20)

telligence in the ﬁelds of personalized medicine, predictive maintenance, online marketing, and autonomous learning. The problem of best-arm identiﬁcation has recently received a precise solution for the vanilla version of the multiarmed bandit problem, where the learner gains information only about the arm played. [Garivier and Kaufmann, 2016a; 2019] (see also [Russo, 2016]) have described the information-theoretic barriers of the sample complexity in the form an instance-optimal lower bound on Eµ[τ]; furthermore, they have proposed a strategy, called Track-and-Stop, that asymptotically reaches this lower bound. However, the vanilla bandit model is extremely limiting in many applications, and some additional information has to be incorporated into the learning process. Recently, several papers studied multi-armed bandit problems with the ability to capture the rich structure of real-world problems. In particular, [Degenne and Koolen, 2019; Degenne et al., 2019] have extended the results mentioned above to classes of possibly structured bandits with possibly different goals. In parallel, a fruitful approach to handle large and organized domains is assume the existence of a graph structure on top of arms which provides additional information. There are two most dominant approaches to use this graph structure. The ﬁrst approach is to use graphs to encode additional information about other arms; playing an arm reveals (fully or partially) the rewards of all the neighbors [Mannor and Shamir, 2011; Alon et al., 2013; Koc ak et al., 2014; Alon et al., 2017; Koc ak et al., 2016b; 2016a]. The second approach is to encode the similarities between arms by the weights of the edges; the rewards of connected arms are similar [Valko et al., 2014; Koc ak et al., 2014; Koc ak et al., 2019]. This second approach, which we adopt in this paper, easily permits to capture a lot of information about the problem by the creation of a similarity graph, which explains its wide use in signal processing, manifold, and semi-supervised learning. However, most of the work in multi-armed bandits with structure has been done so far under the regret minimization objective, while best arm identiﬁcation (a practically crucial objective in auto ML, clinical trials, A/B testing, etc.) was mostly neglected.

1.1 Our Contributions

In this paper, we aim to ﬁll this gap by combining the best arm identiﬁcation work of [Garivier and Kaufmann, 2016a] with the spectral setting of [Valko et al., 2014], resulting in spectral best arm identiﬁcation problem. This setting captures a rich variety of real-world problems that have been neglected so far. We also analyze this setting and provide instance-optimal bounds on the sample complexity. Inspired by [Degenne and Koolen, 2019], we use a game-theoretical point of view on the problem which enables us to obtain two main contributions of this paper:

Contribution 1: simpliﬁed and enlightening gametheoretic analysis for the unconstrained case (case without the spectral constraint) of [Garivier and Kaufmann, 2016a] that brings new insights to the problem which enable us to extend the analysis to the constrained case.

Contribution 2: building upon the unconstrained case, we provide and analyze an algorithm that computes the sample complexity as well as the optimal arm pulling proportions in the constrained case. We use this algorithm to propose an asymptotically optimal strategy for ﬁxed-conﬁdence best-arm identiﬁcation and we provide numerical experiments supporting the theoretical results and showing the interest of the graph constraint for reducing the sample complexity.

1.2 Assumptions A standard assumption on the family F of considered probability laws is that they form an exponential family of distributions parametrized by theirs means so that a bandit problem is fully described by the vector µ (µi, . . . , µK)T. To avoid technicalities and to emphasize the methodological contributions of this paper (ie. how to perform optimal bestarm identiﬁcation with graph regularity constraints), we focus here on the case of Gaussian distributions with variance 1. For the same reason, we assume here that the vector µ has a unique maximum also denoted by a (µ). We furthermore denote by MR = {λ RK : λ TLλ R} the constrained set of considered signals (bandit problems), and by µ maxi [K] µi = µa (µ) the maximum of the signal µ (we identify the best arm with s singleton a (µ)).

1.3 Paper Structure The paper is organized as follows: Section 2 discusses the information-theoretic lower bound and a simple algorithm for the unconstrained case, previously analyzed by [Garivier and Kaufmann, 2016a], with focus on new techniques that considerably simplify the analysis. The constrained case requires more care: we propose in Section 3 a mirror gradient ascent algorithm to compute optimal arm allocation and show non-asymptotic bounds of convergence for this algorithm. Using this procedure permits us in Section 4 to present the Spectral Ta S algorithm as an extension of the Track-and Stop devoted to best-arm identiﬁcation with graph regularity constraint, that reaches the instance-optimal sample complexity. We illustrate its performance and the impact of the smoothness constraint on the sample complexity by reporting some numerical experiments.

2 Sample Complexity

The general information-theoretical lower bound analysis provided by [Garivier and Kaufmann, 2016b] applies to any set of bandit problems, in particular to the set of all the problems with smoothness bounded from above by R.

Proposition 1. For any δ-correct strategy and any bandit problem µ Eµ[Tδ] T R(µ)k(δ, 1 δ)

T R(µ) 1 sup ω K inf λ AR(µ)

a [K] ωak(µa, λa) (1)

for K being K-dimensional simplex, k(µa, λa) KLdivergence between distributions parametrized by µa and λa,

Proceedings of the Twenty-Ninth International Joint Conference on Artiﬁcial Intelligence (IJCAI-20)

and for the set AR(µ) of bandit problems with different best arm than µ, deﬁned as

AR,i(µ) λ MR : λi λa (µ)

AR(µ) i =a (µ)AR,i(µ).

Indeed, even if the constrained case is not explicitly covered there, this result is easily shown by following the lines of the proof of Theorem 1 in [Garivier and Kaufmann, 2016b], with AR(µ) as the set of alternatives to µ. This lower bound proves to play an important role in designing algorithms for the best arm identiﬁcation since, roughly speaking, the time needed to distinguish between bandit problems µ and λ, if the number of pulls of arm i is proportional to ωi, scales with inverse of P a [K] ωak(µa, λa). Therefore, the minimization part of problem (1) selects the alternative problem which is the most difﬁcult to distinguish from µ while playing according to ω while, on the other hand, the maximization part of problem (1) chooses ω such that the expected stopping time is as small as possible, even in the worst case scenario chosen by the minimization part of the problem. Thus, a sampling strategy playing according to ω (µ) that maximizes expression (1) is optimal in the worst case, and having a procedure that computes optimal ω (µ) enables to design a best arm identiﬁcation algorithm with optimal sample complexity.

In this section, we provide an algorithm to compute this sample complexity T R(µ) as well as the optimal proportion ω (µ) of the arm allocation for the learner. We proceed in three steps. Step 1: we introduce the best response oracle that computes the best response λ (ω) to ﬁxed ω by solving the minimization part of problem (1) for chosen ω:

λ (ω) arg min λ AR(µ)

a [K] ωak(µa, λa) .

Step 2: we show that substituting the minimization part of problem (1) by λ (ω), the resulting maximization problem is concave with respect to ω. Moreover, supergradient for this problem is computable using the same best response oracle we used for the minimization part of the problem. Step 3: using supergradient we can apply any supergradient ascent algorithm to ﬁnd optimal arm allocation ω . The algorithm of our choice is mirror ascent with generalized negative entropy as a mirror map. We chose this algorithm due to its strong guarantees on convergence rate in simplex setup however, even basic gradient-based algorithms could be applied. The choice of mirror ascent algorithm will be apparent later in Section 3, we show that this algorithm enjoys convergence rate proportional to log K instead of

K of the basic supergradient ascent algorithm.

As brieﬂy mentioned in the introduction, for the sake of simplicity of presentation, we assume that the distributions νµa are Gaussian. This simpliﬁes some of the calculations while our bounds remain true even in the case of sub Gaussian random variables, including a wide class of distri-

butions among which any distributions with bounded support or Bernoulli distribution. Even though using a general exponential family of probability distributions uses similar techniques, all the proofs are more technical and rather out of the scope of a conference paper. With assumption that the distributions associated with arms are Gaussian with normalized variance, KL-divergence of νµa and νλa can be expressed as

k(µa, λa) = (µa λa)2

2.1 Best Arm Identiﬁcation without Spectral Constraint This section is dedicated to the vanilla setting of [Garivier and Kaufmann, 2016a]. Even though this problem can be seen as a special case of spectral setting either (for the edgeless graph or for smoothness parameter R being ), we have decided to analyze it separately. The reason is that we can demonstrate techniques later used in the constrained case while proving the main result of [Garivier and Kaufmann, 2016a] with more insightful arguments that are signiﬁcantly more elegant. Also, not having the spectral constraint allows us to have best response oracle with a closed-form solution which implies a simple way of ﬁnding optimal arm allocation ω . This is very different from the spectral setting where best response oracle does not have a closed-form solution and therefore, we can not avoid a numerical procedure to ﬁnd ω . Theorem 2. Without loss of generality, assume that µ1 > µ2 µK and deﬁne {xa(c)}K a=2, for some positive constant c, recursively as

xa 1(c) = (1 + xa(c)) µ1 µa 1

for 2 < a K. Let f(c) be a function with parameter c deﬁned as

Then there exist c R+ s.t. f(c ) = 1 and we obtain ω (µ) as

1 + PK a=2 xa(c ) ω a(µ) = xa(c )ω 1(µ)

Remark. Since µ1 > µa 1 µa, we see that µ1 µa 1

which in consequence means that. If xa(c) is negative, xa 1(c) is negative as well. If xa(c) is positive, xa 1(c) xa(c). Thus, f(0) 0. Moreover, f is an increasing and continuous function with limc f(c) = . Therefore, the existence of c is guaranteed. In the rest of this section, we build the necessary tools to provide the proof of Theorem 2.

Proceedings of the Twenty-Ninth International Joint Conference on Artiﬁcial Intelligence (IJCAI-20)

Best Response Oracle - Vanilla Setting In the case of smoothness R = , the spectral constraint is always satisﬁed which provides a very simple closed form solution for the best response oracle given by the following lemma. Lemma 3. Let ω be a ﬁxed vector from K. Then the best response λi(ω) A ,i(µ) with the best arm i is

λi(ω) = (t, µ2, . . . , µi 1, t, µi+1, . . . , µK) T

for t being a weighted average of µ1 and µi with weights ω1 and ωi t = µ1ω1 + µiωi

The proof of this Lemma can be obtained by simple calculus.

Game-Theoretical Point of View Our optimization problem (1) can be seen as a zero sum game where Player 1 plays a vector from simplex K and Player 2 plays point λ from AR(µ). We would like to have guarantees on the existence of Nash equilibrium however AR(µ) is not a convex set which poses a problem. Also, we can not directly convexify the problem since it would change the value of T R(µ) 1. To get around this obstacle, we deﬁne

DR,i(µ) {(k(µ1, λ1), . . . , k(µK, λK)) T : λ AR,i(µ)} ,

DR(µ) i =a (µ)DR,i(µ) .

This enables us to rewrite the optimization problem in the following form

T R(µ) 1 sup ω K inf d DR(µ) ω Td (2)

where Player 2, instead of playing λ, plays vectors of divergences (k(µ1, λ1), . . . , k(µK, λK))T where the element on position i of the vector is divergence k(µi, λi) of the distributions corresponding to i-th elements of µ and λ. This still haven t solve our issue with non-convexity of set DR(µ), however, now we can simply optimize over the convex hull of DR(µ)

T R(µ) 1 sup ω K inf d Conv(DR(µ)) ω Td (3)

and the following lemma guarantees that the values of optimization problems (2) and (3) are the same. Lemma 4. Let ω be a vector in RK and D be a compact subset of RK then

inf d D ω Td = inf d Conv(D) ω Td

where Conv(D) is the convex hull of D. Now we are left a zero sum game (3) with convex and compact sets of actions for both players. This guarantees existence of Nash equilibrium with the best actions denoted by ω (µ) and d (µ). Whole purpose of seeing (3) as a zerosum game is the following lemma Lemma 5. Let ω (µ) and d (µ) are vectors for which Nash equilibrium of game (3) is attained then there exist a constant c such that

d i (µ) = c, for all i [K].

Proof of Theorem 2 Now we have all necessary tools to prove Theorem 2. Note that we still assume that µ1 > µ2 µK. Let ω be the Nash equilibrium strategy. As we showed in Lemma 3, the best response λi(ω ) A ,i(µ) to ω has a particular form

λi(ω ) = (t, µ2, . . . , µi 1, t, µi+1, . . . , µK) T ,

where t = µ1ω 1 + µiω i ω 1 + ω i . This λi(ω ) corresponds to point

di(ω ) from DR,i deﬁned as

di(ω ) = (yi, 0, . . . , 0, zi, 0, . . . , 0) T

for yi = (µ1 ui)2/2 and zi = (µi ui)2/2. It is important to notice that the only two non-zero elements are at positions 1 and i. The Nash equilibrium strategy d of Player 2 can be expressed as a convex combination of optimal points from DR(µ) as showed in Lemma 4. The only candidates are points di(ω ) for i {2, . . . , K} since all the other points from DR(µ) are sub-optimal. Moreover, using Lemma 5 we know that all the elements of d are equal which in particular means that d needs to be a convex combination of all K 1 vectors di(ω ) and all of them are equally good and from the deﬁnition of di(ω ) we have

(ω ) Tdi(ω ) = (ω ) Tdj(ω )

for every i, j 2. This expression can be further modiﬁed to obtain ω i ω 1 = 1 + ω j ω 1

Denoting ω 2 ω 1 by c and using i = j 1 we directly obtain

that the recurrence from the theorem holds for xi(c ) = ω i ω 1 . Therefore, it is enough to know c and ω 1 to reconstruct ω . Lemma 5 provides condition for d . We know that it can be written as a convex combination of K 1 vectors di(ω ). This is possible only if vector 1 of ones can be expressed as their linear combination which gives us condition

We know that all the elements of the resulting vector are equal to 1 except for the ﬁrst element which has a value

i=2 xi(c )2 = f(c ).

Therefore, this value has to be 1 too, which gives us the second part of the theorem. Now we can recover all the ratios ω i /ω 1 and we also know that the sum of all ωi is 1 which provides all the necessary ingredients for the proof of the last part of the theorem.

2.2 Best Response Oracle - Spectral Setting As we showed in Section 2.1, Lemma 3, ﬁnding the best response to ω, in the vanilla setting, is not a difﬁcult problem since the optimization problem involves only linear constraints. The situation in the spectral setting is a little bit more complicated.

Proceedings of the Twenty-Ninth International Joint Conference on Artiﬁcial Intelligence (IJCAI-20)

In this part, we again assume that µ1 > µ2 µK and focus on the best response oracle

λ (ω) = arg min λ AR(µ)

a [K] ωa (µa λa)2

in spectral setting, where R is a ﬁnite upper bound on smoothness of µ with respect to graph G given by its Laplacian L. We ﬁrst ﬁnd best responses

λi(ω) = arg min λ AR,i(µ)

a [K] ωa (µa λa)2

with respect to convex sets AR,i(µ) and then take the best out of these K 1 vectors. First of all, notice that if the response of vanilla oracle denoted by λi (ω) has smoothness smaller than R, the response of spectral oracle should be the same. In the case where SG(λi (ω)) R we can restrict our search and look for λ with SG(λ) exactly R thanks to the following lemma. Lemma 6. Let λi (ω) be the response of vanilla oracle such that λi (ω)TLλi (ω) > R then the response of spectral oracle λi(ω) satisﬁes λi(ω)TLλi(ω) = R.

2.3 Best Response Oracle Implementation Now lets consider the case where vanilla oracle produces a vector that does not satisfy smoothness constraint. In this case we know, using Lemma 6, that the best response λi(ω) of spectral oracle has smoothness R. Therefore, we can use standard Lagrange multiplier method to solve this problem.

a [K] ωa (µa λa)2

2 + γ (λ TLλ R)

We should not forget that we still need to ensure that λ1 = λi since we want to ﬁnd the solution in AR,i(µ) with best arm i. Therefore, we can simplify the notation and later calculations by the following deﬁnitions

λ: created from λ by removing the ﬁrst component ω: created from ω by removing the ﬁrst component µ: created from µ by averaging components 1st and i-th components w.r. to ω and removing the ﬁrst component

µi 1 = ω1µ1 + ωiµi

L: created from L by adding the ﬁrst row and column to i-th row and column, adjusting diagonal element Li,i so that the sum in the i-th row is zero, removing the ﬁrst row and column Using previous deﬁnitions we can deﬁne

a [K 1] ωa ( µa λa)2

2 + γ( λ T L λ R)

for which F( λ, γ) = F(λ, γ) + c holds, assuming that λ1 = λi. The identity can be checked by using previous deﬁnitions. The main advantage of this transformation is that

now we don t have any constraint on λ. Now we can take partial derivatives with respect to λj and γ and set them equal to 0 in order to ﬁnd the best λ and later reconstruct λ:

λ F( λ, γ) = Ω( λ µ) + 2γ L λ = 0 ,

γ = λ T L λ R = 0 ,

where Ωis a diagonal matrix with ω on its diagonal. First expression is just a system of K 1 linear equations with parameter γ with solution

λ(γ) = ( Ω+ 2γ L) 1 Ω µ .

To ﬁnd the best γ so that also the second expression is true, we can use for example bisection method. The last step is choosing i > 1 that minimizes F(λ, γ).

Properties of Maximization Problem Before we proceed to the algorithm computing optimal ω (µ) we need the following two lemmas to be able to compute supergradients and guarantee the convergence of the algorithm.

Lemma 7. Let D RK be a compact set. Then function f : K R deﬁned as f(ω) = infd D ωTd Is a concave function and d (ω) = arg mind D ωTd is a supergradient of f at ω.

Lemma 8. Let f : K R be a function such that

f(ω) = inf λ AR(µ)

i=1 ωik(µi, λi) .

Then function f is L-Lipschitz with respect to 1 for any

L max i,j [K] k(µi, µj) .

3 Best Arm Allocation Algorithm

Now that we are able to compute best response λ (ω) and therefore, supergradient of concave function f(ω) = P a [K] ωak(µa, λ a(ω)) at ω, we have all the ingredients needed for any supergradient-based algorithm to compute best arm allocation ω = arg maxω K f(ω). The algorithm of our choice is mirror ascent algorithm with generalized negative entropy as the mirror map:

a [K] (ωi log(ωi) ωi) .

Theorem 9. Let ω1 = ( 1

K , . . . , 1

K )T and learning rate η =

t . Then mirror ascent algorithm optimizing function f deﬁned on K with generalized negative entropy Φ as the mirror map enjoys the following guarantees

for any L maxi,j [K] k(µi, µj).

Proceedings of the Twenty-Ninth International Joint Conference on Artiﬁcial Intelligence (IJCAI-20)

0.02 0.04 0.06 0.08 0.10

Figure 1: Theoretical vs empirical expected stopping time as a function of smoothness parameter R with optimal value of R being 0.01.

Proof. It is easy to show that Φ is 1-strongly convex w.r.t. 1 and supω K Φ(ω) Φ(ω1) log K since X

ωa log ωa 1

holds for any ω K. Using Lemma 8 we have that f is LLipschitz w.r.t. 1 for any L maxi,j [K] k(µi, µj). This gives us all the necessary ingredients mirror ascent guarantees in Theorem 4.2 from [Bubeck, 2015].

3.1 Value of Spectral Constraint Having an assumption that captures the problem might improve the learning speed signiﬁcantly. We demonstrate this effect in a simplistic scenario where µ = (0.9, 0.5, 0.6)T, graph has only one edge between nodes 2 and 3, and the range of R is from 0.01 = µTLµ to 0.1. R = 0.1 is completely non-restrictive and the best response to every ω is the same as in the best arm identiﬁcation problem without the spectral constraint. This can be seen on Figure 1 where we plot the value of T R(µ) (red curve) as a function of R which is proportional to the stopping time lower bound.

4 Spectral BAI Algorithm The algorithm for the best arm identiﬁcation with the spectral constraint Spectral Ta S (Algorithm 1) is a variant of Track-and-Stop algorithm introduced in [Garivier and Kaufmann, 2016b]. We discuss the main ingredients of the algorithm in the next part of this section. Sampling rule. As a by-product of the lower bound analysis, Proposition 1 provides the existence of optimal sampling weights ω (µ) that need to be respected in order to reach the optimal sample complexity. Spectral Ta S simply tracks, at every timestep, some guess of these optimal proportions that is obtained by solving the sample complexity optimization problem associated with the current estimates ˆµt of the means µ. In order to capture a possibility of the initial underestimation of an arm, some level of exploration is needed and enforced by the algorithm: for every ε (0, 1/K], let ω ,ε(µ) be an L projection of ω (µ) onto ε K deﬁned as (w1, . . . , w K) [ε, 1]K : w1 + + w K = 1 . Then the sampling rule is

At+1 arg max a [K]

s=0 ω ,εs a ˆµ(s) Na(t) . (4)

Algorithm 1 Spectral Ta S

1: Input and initialization: 2: L : graph Laplacian 3: δ : conﬁdence parameter 4: R : upper bound on the smoothness of µ 5: Play each arm a once and observe rewards ra 6: ˆµ1 = (r1, . . . , r K)T : empirical estimate of µ 7: while Stopping Rule (5) not satisﬁed do 8: Compute ω (ˆµt) by mirror ascent 9: Choose At according to Sampling Rule (4) 10: Obtain reward rt of arm At 11: Update ˆµt according to rt 12: end while 13: Output arm A = arg maxa [K] ˆµa

As shown in [Garivier and Kaufmann, 2016a] (Lemma 7 and Proposition 9), the choice εs = (K2 + s) 1/2/2 ensures that the number Na(t) of draws of arm a converges almost-surely to ω a(µ) as t goes to inﬁnity. Stopping rule. The algorithm should stop as soon as it has gathered sufﬁcient evidence on the superiority of one of the arms with risk δ. The design of an optimal sequential testing procedure for the hypothesis µa > maxb =a µb can be traced back to [Chernoff, 1959], and is discussed in detail in [Garivier and Kaufmann, 2019]. We simply recall its form here: for two arms a, b [K], denote by ˆµa,b(t) = (Na(t)ˆµa(t) + Nb(t)ˆµb(t))/(Na(t) + Nb(t)) and by Za,b = sign ˆµa(t) ˆµb(t) Na(t)(ˆµa(t) ˆµa,b(t))2 + Nb(t)(ˆµb(t) ˆµa,b(t))2 /2 the generalized likelihood ratio statistics for the test µa > µb. Then the stopping rule is given by

τ = inf n t N : max a [K] min b =a Za,b(t) > β(t, δ) o , (5)

where β( , ) is a threshold function to be chosen typically slightly larger than log(1/δ). Theorem 10 in [Garivier and Kaufmann, 2016a] shows that the choice β(t, δ) = log(2t(K 1)/δ) and Aτ+1 = arg maxa [K] ˆµa(τ) yields a probability of failure Pν (Aτ+1 / a (µ)) δ. Empirical evaluation. Using the same experiment setting as in Section 3.1 we have evaluated Spectral Ta S for 10 different values of R ranging from true R = µTLµ = 0.01 to R = 0.1 and we plotted theoretical stopping time (red line) as well as average and standard deviation of 20 runs of Spectral Ta S algorithm (blue line). Figure 1 shows that the algorithm can utilize the spectral constraint and improve stopping time signiﬁcantly whenever the value of R is close to the real smoothness of the problem, as suggested by theory.

Acknowledgements Aur elien Garivier (ANR chaire Seq ALO) and Tom aˇs Koc ak acknowledge the support of the Project IDEXLYON of the University of Lyon, in the framework of the Programme Investissements d Avenir (ANR-16-IDEX-0005).

Proceedings of the Twenty-Ninth International Joint Conference on Artiﬁcial Intelligence (IJCAI-20)

[Alon et al., 2013] Noga Alon, Nicol o Cesa-Bianchi, Claudio Gentile, and Yishay Mansour. From bandits to experts: A tale of domination and independence. In Neural Information Processing Systems, 2013.

[Alon et al., 2017] Noga Alon, Nicol o Cesa-Bianchi, Claudio Gentile, Shie Mannor, Yishay Mansour, and Ohad Shamir. Nonstochastic multi-armed bandits with graphstructured feedback. SIAM Journal on Computing, 46(6):1785 1826, 2017.

[Bubeck, 2015] S ebastien Bubeck. Convex optimization: Algorithms and complexity. Foundations and Trends in Machine Learning, 8(3-4):231 357, 2015.

[Chernoff, 1959] Herman Chernoff. Sequential design of Experiments. The Annals of Mathematical Statistics, 30(3):755 770, 1959.

[Degenne and Koolen, 2019] R emy Degenne and Wouter M. Koolen. Pure Exploration with Multiple Correct Answers. technical report, feb 2019.

[Degenne et al., 2019] R emy Degenne, Wouter M. Koolen, and Pierre M enard. Non-Asymptotic Pure Exploration by Solving Games. technical report, jun 2019.

[Even-Dar et al., 2006] Eyal Even-Dar, Shie Mannor, and Yishay Mansour. Action elimination and stopping conditions for the multi-armed bandit and reinforcement learning problems. Journal of Machine Learning Research, 7:1079 1105, 2006.

[Gabillon et al., 2012] Victor Gabillon, Mohammad Ghavamzadeh, and Alessandro Lazaric. Best arm identiﬁcation: A uniﬁed approach to ﬁxed budget and ﬁxed conﬁdence. In Neural Information Processing Systems, 2012.

[Garivier and Kaufmann, 2016a] Aur elien Garivier and Emilie Kaufmann. Optimal best arm identiﬁcation with ﬁxed conﬁdence. In Proceedings of the 29th Conference On Learning Theory, 2016.

[Garivier and Kaufmann, 2016b] Aur elien Garivier and Emilie Kaufmann. Optimal Best Arm Identiﬁcation with Fixed Conﬁdence. 29th Annual Conference on Learning Theory, 2016.

[Garivier and Kaufmann, 2019] Aur elien Garivier and Emilie Kaufmann. Non-Asymptotic Sequential Tests for Overlapping Hypotheses Applied to Near Optimal Arm Identiﬁcation in Bandit Models. technical report, jun 2019.

[Koc ak et al., 2014] Tom aˇs Koc ak, Gergely Neu, Michal Valko, and R emi Munos. Efﬁcient learning by implicit exploration in bandit problems with side observations. In Neural Information Processing Systems, 2014.

[Koc ak et al., 2016a] Tom aˇs Koc ak, Gergely Neu, and Michal Valko. Online learning with Erdos-R enyi sideobservation graphs. In Conference on Uncertainty in Artiﬁcial Intelligence, 2016.

[Koc ak et al., 2016b] Tom aˇs Koc ak, Gergely Neu, and Michal Valko. Online learning with noisy side observations. In International Conference on Artiﬁcial Intelligence and Statistics, 2016. [Koc ak et al., 2019] Tom aˇs Koc ak, R emi Munos, Branislav Kveton, Shipra Agrawal, and Michal Valko. Spectral Bandits. Journal of Machine Learning Research, 2019. [Koc ak et al., 2014] Tom aˇs Koc ak, Michal Valko, R emi Munos, and Shipra Agrawal. Spectral Thompson sampling. In Proceedings of the National Conference on Artiﬁcial Intelligence, 2014. [Lattimore and Szepesv ari, 2019] Tor Lattimore and Csaba Szepesv ari. Bandit Algorithms. Cambridge University Press, 2019. [Mannor and Shamir, 2011] Shie Mannor and Ohad Shamir. From bandits to experts: On the value of sideobservations. In Neural Information Processing Systems, 2011. [Russo, 2016] Daniel Russo. Simple bayesian algorithms for best arm identiﬁcation. In Proceedings of the 29th Conference On Learning Theory, 2016. [Valko et al., 2014] Michal Valko, R emi Munos, Branislav Kveton, and Tom aˇs Koc ak. Spectral bandits for smooth graph functions. In International Conference on Machine Learning, 2014.

Proceedings of the Twenty-Ninth International Joint Conference on Artiﬁcial Intelligence (IJCAI-20)