# epsilon_best_arm_identification_in_spectral_bandits__771b0f31.pdf

Epsilon Best Arm Identiﬁcation in Spectral Bandits

Tom aˇs Koc ak and Aur elien Garivier Unit e de Math ematiques Pures et Appliqu ees et Laboratoire de l Informatique du Parall elisme Ecole Normale Sup erieure de Lyon, Universit e de Lyon tomas.kocak@gmail.com, aurelien.garivier@ens-lyon.fr

We propose an analysis of Probably Approximately Correct (PAC) identiﬁcation of an ε-best arm in graph bandit models with Gaussian distributions. We consider ﬁnite but potentially very large bandit models where the set of arms is endowed with a graph structure, and we assume that the arms expectations µ are smooth with respect to this graph. Our goal is to identify an arm whose expectation is at most ε below the largest of all means. We focus on the ﬁxed-conﬁdence setting: given a risk parameter δ, we consider sequential strategies that yield an ε-optimal arm with probability at least 1 δ. All such strategies use at least T R,ε(µ) log(1/δ) samples, where R is the smoothness parameter. We identify the complexity term T R,ε(µ) as the solution of a min-max problem for which we give a game-theoretic analysis and an approximation procedure. This procedure is the key element required by the asymptotically optimal Track-and Stop strategy.

1 Introduction A bandit model (see [Lattimore and Szepesv ari, 2019] and references therein) is a set of probability distributions ν = {νa : a A}. These distributions are called arms, and the statistician can sample one of them at each time step t 1. Best-arm identiﬁcation (BAI) consists of using those samples so as to ﬁnd which arm has the highest expectation µa = E(νa), while ε-BAI aims at identifying an arm a such that µa maxb µb ε. A ﬁxed-conﬁdence algorithm for a given risk δ consists in a sampling rule At choosing thanks to past observations which arm is sampled at each time step t, and of a stopping rule τ (a stopping time): it is called δcorrect if Aτ+1 is ε-optimal with probability at least 1 δ. The efﬁciency of this algorithm is measured by the mean number Eν[τ] of samples needed. Since the work of [Mannor and Tsitsiklis, 2004] and [Even Dar et al., 2006], best-arm identiﬁcation has received considerable interest. It has been proved that good strategies require no more than A(ν)+B(ν) log(1/δ) samples, for some functions A and B of the model that were progressively improved. While, for example, [Karnin et al., 2013] investigated more

on term A, other authors insisted on the fact that B is the dominant term when δ is small and focused on the best possible term B. For BAI (with ε = 0), the latter was identiﬁed by [Garivier and Kaufmann, 2016] and [Russo, 2016]. The ﬁrst of those articles provided a generic analysis that reduces BAI to the identiﬁcation of an information-theoretic complexity term that gives at the same time a lower bound on the performance of any algorithm, and a key ingredient of an asymptotically optimal strategy called Track-and-Stop. This term appears to be the solution of a min-max optimization program, for which an ad-hoc solution was given in the aforementioned article. While these ﬁrst works were limited to the δ-correct identiﬁcation of the best arm (which was assumed to exist and be unique), [Garivier and Kaufmann, 2019] proposed an extension to the problem of identifying ε-optimal arms. Simultaneously, [Degenne and Koolen, 2019] leveraged the gametheoretic nature of the complexity term to encompass even more general objectives. In parallel with this progress, vanilla bandit models have shown limitations in settings (such as recommendation systems) where the number of arms is huge. It is then not uncommon that the set of arms is naturally endowed with some structure. One simple way to take this structure into account is to assume the existence of some notion of similarity: some arms are close from one another, in the sense that their outcomes are expected to have similar distributions. Graph bandits are meant to provide a theoretical framework for this setting: the set of arms is provided with a graph structure where the weight wa,b of the link from arm a to arm b measures their similarity. The set of means µ = (µa : 1 a K) is assumed to be smooth with respect to this graph in the sense that

a, b A wa,b (µa µb)2

2 = µ TLµ R , (1)

where L denotes the graph s Laplacian and R is some known upper-bound. This means that two arms connected by an edge with signiﬁcant weight should have similar expectations. Recently, [Koc ak and Garivier, 2020] showed how to optimally, δ-correctly identify the best arm in a graph bandit model if it exists. But the consideration of very large sets of arms, and the fact that many of them might be close to optimal, suggest that it is often more relevant to identify (more

Proceedings of the Thirtieth International Joint Conference on Artiﬁcial Intelligence (IJCAI-21)

quickly) any ε optimal arm instead of the unique best. In the present work, we address the problem of PAC identiﬁcation of an ε-best arm in graph bandit models, hence encompassing all the aforementioned difﬁculties. We focus on ﬁnite bandit models with K arms: A = {1, . . . , K}. In order to avoid technicalities, we consider Gaussian arms νa = N(µa, 1) and identify the bandit problem with vector µ = (µ1, . . . , µK) of arm expectations. Following the approach of [Garivier and Kaufmann, 2016], we start in Section 2 by identifying the complexity of the problem as the optimal ratio Eν[τ]/ log(1/δ) of any (ε, δ)- PAC algorithm when δ 0. This complexity term appears to be the solution of an interesting max-min progam. Taking great beneﬁt from its game-theoretic structure, we propose a solution that is radically different from [Garivier and Kaufmann, 2019] even in the case where R = , a case that we treat separately in Section 3. The structured case R < , which is the main contribution of this paper, is treated in Section 4.

2 The Complexity of ε-BAI for Graph Bandits 2.1 Characteristic Time Several factors are contributing to the complexity of identifying one of the ε-best arms in bandit problems. The difﬁculty is related to the quantity T R, ε(µ) called characteristic time.

Deﬁnition 1. Characteristic time T R, ε(µ) is deﬁned as

T R, ε(µ) 1 max ω K i A ε(µ) min j =i λ Mi, j R, ε

a [K] ωa (µa λa)2

where K is the K-dimensional simplex, A ε(µ) is the set of all ε-best arms of bandit problem µ

A ε(µ) {a A : µa max b A (µb) ε},

and Mi, j R, ε is a set of bandit problems with smoothness at most R and with arm j being better than arm i by at least ε margin

Mi, j R, ε {λ RK : λ TLλ R, λi λj ε}. (3)

λTLλ R is often called a spectral constraint, hence the name spectral bandits. Remark. This deﬁnition is backward compatible with previous papers. By setting R to inﬁnity, every problem satisﬁes the spectral constraint and we obtain the setting of [Garivier and Kaufmann, 2019]. By setting ε to zero, we are identifying only the best arm which leads to the setting of [Koc ak and Garivier, 2020]. By setting R to inﬁnity and ε to zero at the same time we obtain the original setting of [Garivier and Kaufmann, 2016]. The starting point for this paper is stated in the following proposition. This proposition can be obtained along the lines of a recent paper by [Garivier and Kaufmann, 2019]. It shows the connection between the expected stopping time of any δcorrect algorithm that identiﬁes ε-best arms and the characteristic time deﬁned previously.

Proposition 1. For any δ-correct strategy and any bandit problem µ, the expectation of stopping time τδ of the strategy is lower bounded as

lim inf δ 0 Eµ[τδ] log(1/δ) T R, ε(µ)

where the characteristic time T R, ε(µ) is deﬁned in Eq. (2). The most signiﬁcant part of this proposition is that it provides a lower bound that scales with the characteristic time. The proof is assuming that the learner plays according to strategy ω from the deﬁnition of T R, ε(µ), i.e. playing arm a with probability ωa, while the environment chooses some bandit problem λ for the learner. By choosing the best possible strategy ω for the learner while the environment chooses the hardest possible bandit problem λ we obtain the lower bound from the proposition. However, this works also the other way around. If the learner plays according to the optimal strategy ω , the strategy that maximizes expression in the deﬁnition of T R, ε(µ), the expected stopping time of the learner is also proportional to the characteristic time and therefore matching the lower bound (possibly up to some multiplicative constant). Therefore, the main focus of this paper is on analyzing the characteristic time and ﬁnding a way to compute optimal weight ω for the learner and provide an algorithm that utilizes these weights.

2.2 Game-Theoretical Point of View As we hinted in the previous section, computing the inverse of characteristic time, T R, ε(µ) 1, can be seen as a game between the learner and the environment where: The ﬁrst player (learner) chooses one of the ε-best arms i A ε(µ) and ω K while trying to maximize the value of the optimization problem. The second player (environment) chooses alternative arm j = i and bandit problem λ Mi, j R, ε while trying to minimize the value of the optimization problem. In fact, the optimization function in the deﬁnition of T R, ε(µ) 1 is very simple; linear in ω and quadratic in λ. To make it more apparent, we use the following deﬁnition and rewrite the optimization problem in a slightly different way.

Deﬁnition 2. Let Mi, j R, ε be the set of the problems with smoothness at most R and arm j being better than arm i by at least ε (expression (3)). Deﬁne the set of elementwise divergences from µ to Mi, j R, ε as

Di, j R, ε d RK : λ Mi, j R, ε s.t. da (µa λa)2

This deﬁnition enables us to rewrite T R, ε(µ) 1 in a more compact way as

T R, ε(µ) 1 = sup ω K max i A ε(µ) min j =i inf d Di, j R, ε ω Td (4)

Thanks to this reparametrization we obtained an optimization problem that is linear with independent ω and d. We use this form later to simplify the presentation of some ideas and proofs. We approach this optimization problem in two steps:

Proceedings of the Thirtieth International Joint Conference on Artiﬁcial Intelligence (IJCAI-21)

1. Given i and ω of the ﬁrst player (learner), we compute the best response d (resp. λ) of the second player 2. Having the best response, we can ﬁnd optimal ω (either directly or numerically, depending on the problem) In any case, it is important to be able to compute the best response of the second player. When ﬁnding the best response, usually it is helpful to split the optimization problem into several smaller problems and solve them separately. We do it by considering only a subproblem with ﬁxed i. Deﬁnition 3. By ﬁxing i in T R, ε(µ) 1 we can deﬁne

T i R, ε(µ) 1 sup ω K min j =i inf d Di, j R, ε ω Td

This deﬁnition enables us to compute T R, ε(µ) 1 as

T R, ε(µ) 1 = max i A ε(µ) T i R, ε(µ) 1.

3 BAI Problem Without Structure In this section, we focus on the setting without structure. This setting was previously studied by [Garivier and Kaufmann, 2019] and can be obtained simply by setting the smoothness parameter R to which would make any bandit problem satisfy the smoothness constraint (Expression (1)). The main contribution of this section is to signiﬁcantly simplify the proofs of [Garivier and Kaufmann, 2019] by using a previously mentioned game-theoretical approach while providing the necessary ideas and reasoning later used in the more difﬁcult spectral case. As mentioned earlier, this game-theoretic approach was initiated in [Degenne and Koolen, 2019]. Finding values of individual T i , ε(µ) 1 is reminiscent of the problems solved by [Garivier and Kaufmann, 2016] and [Koc ak and Garivier, 2020]. However, this time we assume that i is not necessarily the optimal arm but it is at most ε away from the optimal arm. The following theorem shows the main result of the unconstrained case and a convenient way of computing optimal weights ω (µ). Theorem 2. Assume that i is one of the ε optimal arms of bandit problem µ, i.e. µi > µj ε for every j [K]. Let I be any arm different from i and deﬁne sequence {xa(c)}a [K]/{i} as

xj(c) = 1 + x I(c) 1 δi, j ε δi, I ε 1 1

for any j [K]/{i, I}, constant c, and δi, j ε = (µi µj +ε). Let f(c) be a function with parameter c deﬁned as

j [K]/{i} xj(c)2.

Then there exist c R+ such that f(c ) = 1 and we obtain optimal ω (µ) as

ω i (µ) = 1 1 + P j [K]/{i} xj(c )

ω j (µ) = xj(c ) ω i (µ) for j [K]/{i} The rest of this section is dedicated to building necessary tools for the proof of Theorem 2 later in Section 3.2.

3.1 Best Response Oracle - Setting Without Constraint The following lemma shows us the form of the best response of Player 2 in T i , ε(µ) 1 game.

Lemma 3. Let ω be a vector from K and µ a bandit problem. Then the best response λi, j , ε(ω) Mi, j , ε to ω, with arm j being better than arm i by at least ε, is in form

λi, j , ε(ω) = (µ1, . . . , µi 1, t, . . . , t + ε, µj+1, . . . , µK) T

t = µiωi + µjωj

ωi + ωj ε ωj ωi + ωj

Proof. Assuming that j-th position of λi, j , ε(ω) is exactly ε larger than i-th position of λi, j , ε(ω), for some ε ε. Using simple calculus we obtain that

t = µiωi + µjωj

ωi + ωj ε ωj ωi + ωj

while the element on the j-th position has value

µiωi + µjωj

ωi + ωj + ε ωi ωi + ωj

The ﬁrst part of both expressions is a weighted average of µi and µj which is always in interval [µj, µi] and therefore, by increasing ε we increase our objective function. This makes ε = ε the optimal choice.

Corollary 1. Let ω be a vector from K and µ a bandit problem. Then the best response di, j , ε(ω) Di, j , ε to ω, with arm j being better than arm i by at least ε, is zero everywhere except for the i-th and j-th position

di, j , ε(ω) = (0, . . . , 0, ωj2δi, j ε 2 | {z } position i

, . . . , ωi2δi, j ε 2 | {z } position j

, 0, . . . , 0) T

ωj = ωj ωi + ωj , ωi = ωi ωi + ωj , δi, j ε = (µi µj + ε)2 .

Proof. The proof is obtained by combining Lemma 3 with the deﬁnition of Di, j , ε.

3.2 Proof of Theorem 2 Now that we know the exact form of the best response provided by the oracle, we are ready to prove the statement of Theorem 2 for the non-spectral setting. Using Corollary 1 problem T i , ε(µ) 1 transforms into

T i , ε(µ) 1 = sup ω K min j =i ω Tdi, j , ε(ω) ,

where Player 2 now chooses only j = i. The following lemma shows that Player 2 can play a mixed strategy (a convex combination of pure strategies where the player plays only one arm j) while not changing the value of the game.

Proceedings of the Thirtieth International Joint Conference on Artiﬁcial Intelligence (IJCAI-21)

Lemma 4. Let ω be a vector in RK and D be a compact subset of RK then

inf d D ω Td = inf d Conv(D) ω Td ,

where Conv(D) is the convex hull of D.

Proof. Since D is a compact set, there exist vector d such that ω Td = inf d Conv(D) ω Td .

We also know that d is a vector from the convex hull of D therefore, d can be expressed as a convex combination P j [K] qjdj for q = (q1, . . . , q K) K of at most K points dj D. Therefore, we have

ω Td = ω T X

j [K] qjdj X

j [K] qjω Td = ω Td .

The inequality in the previous expression holds since d is the minimizer of the expression from the lemma statement. However, the ﬁrst term in the previous expression is the same as the last term and therefore the inequality should achieve equality. This can occur only if ωTdj = ωTd for every j where qj = 0. Since at least one qj is strictly positive, corresponding dj is one of the minimizers of the minimization problem over D.

As we mentioned, Lemma 4 allows Player 2 to play a mixed strategy while the value of the game stays the same. The main beneﬁt of this change is that the game now has a Nash equilibrium. In order to be in the equilibrium, any change to the strategy of the ﬁrst player should result in same value of the game as long as the second player plays the optimal mixed strategy. This also means that in the mixed strategy of the second player, all the elements of played vector should be the same. Therefore, there has to be a convex combination with coefﬁcients {qj 0}j [K]/{i} such that

di, , ε(ω) = X

qjdi, j , ε(ω) and di, , ε(ω) = r1 (5)

for some constant r. Since for any j = i only di, j , ε(ω) is not 0 at position j, value of qj can not be 0 and

qjdi, j , ε(ω) = qkdi, k , ε(ω) for any j, k [K]/{i} . (6)

Now we have everything necessary to ﬁnd the convex combination for a given ω as well as the way to ﬁnd optimal ω for the ﬁrst player.

3.3 Finding Weights qj and Optimal ω

Instead of ﬁnding the convex combination we can look for a linear combination that gives us 1 and then renormalize it to obtain a convex combination. Since di, j , ε(ω) is the only member contributing to the j-th element, we can divide it by di, j , ε(ω)j to obtain 1 at the j-th position. Therefore, qj is proportional to di, j , ε(ω) 1 j . Now we have a vector of ones except for the i-th element for which we do not have any

guarantees. In order to be in Nash equilibrium, i-th element should be 1 as well. That means that

di, , ε(ω)i = X

di, j , ε(ω)i di, j , ε(ω)j = X

2 = 1 . (7)

From Lemma 4 and the fact that both di, j , ε(ω) and di, k , ε(ω) contribute to the optimal mixed strategy, they need to be equally good when the ﬁrst player chooses optimal ω . Therefore, we have

ω i di, j , ε(ω )i + ω j di, j , ε(ω )j = ω i di, k , ε(ω )i + ω kdi, k , ε(ω )k which, after a few steps, leads to

xj(c ) = 1 + xk(c ) 1 , δi, j ε δi, k ε 1 1 .

This provides us a way to express xj(c ) using xk(c ) for any combination of arms j and k. In particular, setting k = I we recover Theorem 2

4 BAI Problem with Structure The main complexity of spectral setting comes from the fact that the best response for the second player of game T i R, ε(µ) 1 does not have a closed form and therefore, it is impossible to compute ω directly as in the case without structure. We solve this problem in several steps: Computing best response to ω numerically. Restating T i R, ε(µ) 1 as a function of ω. Computing a supergradient for this function. Applying a gradient algorithm to compute optimal ω .

4.1 Best Response Oracle - Spectral Setting

The oracle needs to ﬁnd λi, j R, ε(ω) that minimizes

inf λ Mi, j R, ε

a [K] ωa (µa λa)2

In the case where the oracle for the setting without structure returns λi, j , ε(ω) that satisﬁes spectral constraint (λi, j , ε(ω)TLλi, j , ε(ω) R), we are done and the spectral oracle should return value λi, j R, ε(ω) = λi, j , ε(ω). On the other hand, we can restrict our search for the problems λi, j R, ε(ω) with smoothness exactly R, thanks to the following lemma. Lemma 5. Let λi, j , ε(ω) be the response of non-spectral oracle such that λi, j , ε(ω)TLλi, j , ε(ω) > R then the response of spectral oracle λi, j R, ε(ω) satisﬁes λi, j R, ε(ω)TLλi, j R, ε(ω) = R.

Proof idea. Suppose that the smoothness of λi, j R, ε(ω) is smaller than R, i.e. λi, j R, ε(ω)TLλi, j R, ε(ω) < R. Deﬁne λ as a convex combination of non-spectral and spectral oracle responses with parameter α (0, 1).

λ = αλi, j R, ε(ω) + (1 α)λi, j , ε(ω).

For small enough α, we can show that λ improves the optimization problem while still satisfying the spectral constraint.

Proceedings of the Thirtieth International Joint Conference on Artiﬁcial Intelligence (IJCAI-21)

Knowing that the smoothness of the oracle response is exactly R, we use the Lagrange multiplier method to solve the problem

a [K] ωa (µa λa)2

2 + γ(λ TLλ R) ,

where λi = λj ε and γ is the Lagrange multiplier. Before solving this problem we need to eliminate coupling between i-th and j-th elements of λ and introduce some notation to simplify some of the expressions in the next lemma. Lets ﬁx indices i and j for the moment and deﬁne: Any tilde index a is equal to a if a < j or a 1 if a > j. This means that by removing element on the j-position of some vector, we can refer to the original element on the a-th position using a.

eω is a vector such that

eω a = ωa for all a [K]/{i, j} eω i = ωi + ωj

eΩis a diagonal matrix with eω on diagonal.

eλ is a vector such that eλ a = λa for all a [K]/{j}

eµ is a vector such that

eµ a = µa for all a [K]/{i, j}

eµ i = ωiµi + ωjµj

ωi + ωj εωj ωi + ωj

e L is a matrix created from L by adding j-th row and column to i-th row and column and updating diagonal entries to have a zero sum on every row and column and then removing t-th row and column from the matrix.

e L a, b = La, b for all a, b [K]/{i, j}

e L i, a = e L a, i = Li, a + Lj, a for all a [K]/{i, j}

e L i, i = X

a [K]/{i, j} e Li, a

e Lj is a vector created from the j-th column of L by setting i-th element to 0, updating j-th element to have zero sum of elements, and removing i-th element.

Lemma 6. Let i and j are ﬁxed and e Lj be the j-th column of e L. Then we deﬁne eλ(γ) as

eλ(γ) eΩ+ 2γ e L 1 eΩeµ + 2γε e Lj

There exists γ such that λ(γ )TLλ(γ ) = R and λ(γ ) is the best response vector that corresponds to eλ(γ ) such that element on the j-th position is larger than the element on the i-th position by exactly ε.

Proof idea. The statement of the lemma can be obtain taking partial derivatives of F(λ, γ) and solving the resulting system of equations.

4.2 Supergradient as the Best Response

Now that we have a way to compute the best response to Player 1, we are ready to restate T i R, ε(µ) 1 as a function of ω and provide a lemma that shows the form of a supergradient for this function. Deﬁne f i(ω) as

f i(ω) min j =i inf d Di, j R, ε ω Td .

Note that T i R, ε(µ) 1 = sup ω K f i(ω) . The following lemma

gives us a convenient way to compute a supergradient of f i at ω. In fact, the best response di, j(ω), computed by the best response oracle from Section 4.1, is a supergradient of f i thanks to the following lemma.

Lemma 7. Let D RK be a compact set. Then function f : K R deﬁned as f(ω) = infd D ωTd is a concave function and d (ω) = arg mind D ωTd is a supergradient of f at ω.

Proof. Let d (ω) D be a vector that realizes the inﬁmum from the deﬁnition of f(ω). Such a vector is well deﬁned since D is compact. First, we prove that d (ω) is a supergradient of f at any point ω since the existence of supergradient implies the concavity of the function. Let ω1 and ω2 be any two points from the domain of f. From the deﬁnition of d (ω) we have

ω T 2d (ω1) ω T 2d (ω2) ω T 1d (ω1) + (ω2 ω1) Td (ω1) ω T 2d (ω2)

Which, using the deﬁnition of f, can be further rewritten as

f(ω1) + d (ω1) T(ω2 ω1) f(ω2) .

Thus, d (ω1) is a supergradient of f at ω1 and function f is concave.

Now that we have a supergradient for function f i(ω), we are ready to apply a gradient-based algorithm to ﬁnd the optimal ω and therefore, the value of T i R, ε(µ) 1. The algorithm of our choice is the mirror ascent algorithm that provides strong guarantees.

Theorem 8. Let ω1 = ( 1

K , . . . , 1

K )T and learning rate

t . Then mirror ascent algorithm optimizing L-Lipschitz function f, with respect to 1, deﬁned on K with generalized negative entropy Φ as the mirror map enjoys the following guarantees

Proof. This result can be adapted from [Bubeck, 2015].

Now, the last step is using the geometry of the problem and the form of the best response oracle to show that f i(ω) is Lipschitz for some constant L. This is captured in the following lemma.

Proceedings of the Thirtieth International Joint Conference on Artiﬁcial Intelligence (IJCAI-21)

Lemma 9. Let f i : K R be a function such that

f i(ω) = min j =i inf λ Mi, j R, ε

a [K] ωa (µa λa)2

Then function f i is L-Lipschitz with respect to 1 for any

L max a,b [K] (µa µb + ε)2

5 Algorithm Solving the complexity problem (3) permits to rely on the Spectral Ta S (Algorithm 1) by [Koc ak and Garivier, 2020] which is a variation of the asymptotically optimal Track-and-Stop algorithm introduced in [Garivier and Kaufmann, 2016]. Essentially, this algorithm tracks optimal sampling distribution ω with respect to the current estimate of unknown bandit problem µ and playing accordingly. By playing an arm, the estimate of µ gets progressively better over time which, in consequence, leads to a more precise sampling distribution ω . The last two ingredients for the algorithm are sampling and stopping rules. We recall them for self-containment, and refer to [Garivier and Kaufmann, 2016; Koc ak and Garivier, 2020] for more details.

5.1 Sampling Rule In order to capture and correct possible arm underestimation, the algorithm introduces extra small amount of exploration. For every γ (0, 1/K], let ω ,γ(µ) be an L projection of ω (µ) onto γ K deﬁned as (ω1, . . . , ωK) [γ, 1]K : ω1 + + ωK = 1 . Then the sampling rule is

At+1 arg max a [K]

s=0 ω ,γs a ˆµ(s) Na(t) . (8)

where γs of order of 1/ s which provides as much exploration as possible while not inﬂuencing the bounds signiﬁcantly.

5.2 Stopping Rule The algorithm should stop as soon as it has gathered sufﬁcient evidence on the superiority of one of the arms with probability 1 δ: for two arms i A ε(ˆµ) and j [K], denote by

Zi,j(t) = inf λ Mi, j R, ε

1 2Na(t)(µa λa)2

the generalized likelihood ratio statistics for the test µi > µj. Then the stopping rule is given by

τ = inf t N : max i A ε( ˆµ) min j =i Zi,j(t) > β(t, δ) , (9)

where β( , ) is a threshold function to be chosen typically slightly larger than log(1/δ). Theorem 10 in [Garivier and Kaufmann, 2016] shows that the choice β(t, δ) = log(2t(K 1)/δ) and Aτ+1 = arg maxa [K] ˆµa(τ) yields a probability of failure Pν (Aτ+1 / a (µ)) δ.

Algorithm 1 Spectral Ta S

1: Input and initialization: 2: L : graph Laplacian 3: ε, δ : tolerance and conﬁdence parameters 4: R : upper bound on the smoothness of µ 5: Play each arm a once and observe rewards ra 6: ˆµ1 = (r1, . . . , r K)T : empirical estimate of µ 7: while Stopping Rule (9) not satisﬁed do 8: Compute ω (ˆµt) by mirror ascent 9: Choose At according to Sampling Rule (8) 10: Obtain reward rt of arm At 11: Update ˆµt according to rt 12: end while 13: Output arm A = arg maxa [K] ˆµa

Figure 1: Effect of R on the characteristic and stopping time.

6 Experiments For the experiments, we used bandit problem

µ = (0, 0.1, 0.2, 0.3, 0.4, 0.5, 0.4, 0.3, 0.2, 0.1, 0)

with K = 11 arms, a graph that connects all the neighboring actions a and a + 1 for every a [K 1], ε = 0.05, and different values of R. The following plot demonstrates the effect of smoothness parameter R on both theoretical and empirical stopping times. The green curve represents the average stopping time of 10 runs of Spectral Ta S while the red curve represents the characteristic time.

7 Conclusion and Open Problems We identiﬁed the characteristic time of ﬁxed-conﬁdence εbest arm identiﬁcation in bandit models with graph smoothness. It appears as a delicate min-max optimization problem, but thanks to a game-theoretic analysis of this complexity we could provide an efﬁcient algorithm for its computation, leading to an asymptotically optimal algorithm. While this provides a complete treatment of the ﬁxed-conﬁdence-setting, the dual ﬁxed-budget setting is still not understood. How good is a strategy following the estimated optimal weights, but stopping at a given time n and not at a chosen stopping time? Are some improvements using the budget n possible? These natural questions are open for further investigations.

Acknowledgements Aur elien Garivier (ANR chaire Seq ALO) and Tom aˇs Koc ak acknowledge the support of the Project IDEXLYON of the University of Lyon, in the framework of the Programme Investissements d Avenir (ANR-16-IDEX-0005).

Proceedings of the Thirtieth International Joint Conference on Artiﬁcial Intelligence (IJCAI-21)

References [Bubeck, 2015] S ebastien Bubeck. Convex optimization: Algorithms and complexity. Foundations and Trends in Machine Learning, 2015. [Degenne and Koolen, 2019] R emy Degenne and Wouter M. Koolen. Pure exploration with multiple correct answers. In Advances in Neural Information Processing Systems, 2019. [Even-Dar et al., 2006] Eyal Even-Dar, Shie Mannor, and Yishay Mansour. Action Elimination and Stopping Conditions for the Multi-Armed Bandit and Reinforcement Learning Problems. Journal of Machine Learning Research, 7:1079 1105, 2006. [Garivier and Kaufmann, 2016] Aur elien Garivier and Emilie Kaufmann. Optimal best arm identiﬁcation with ﬁxed conﬁdence. Proceedings of the 29th Conference On Learning Theory, 2016. [Garivier and Kaufmann, 2019] Aur elien Garivier and Emilie Kaufmann. Non-asymptotic sequential tests for overlapping hypotheses and application to near optimal arm identiﬁcation in bandit models. preprint Ar Xiv:1905.03495, 2019. [Karnin et al., 2013] Zohar Karnin, Tomer Koren, and Oren Somekh. Almost optimal exploration in multi-armed bandits. In International Conference on Machine Learning, 2013. [Koc ak and Garivier, 2020] Tom aˇs Koc ak and Aur elien Garivier. Best arm identiﬁcation in spectral bandits. International Joint Conference on Artiﬁcial Intelligence, 2020. [Lattimore and Szepesv ari, 2019] Tor Lattimore and Csaba Szepesv ari. Bandit Algorithms. Cambridge University Press, 2019. [Mannor and Tsitsiklis, 2004] Shie Mannor and John N Tsitsiklis. The Sample Complexity of Exploration in the Multi-Armed Bandit Problem. Journal of Machine Learning Research, 2004. [Russo, 2016] Daniel Russo. Simple bayesian algorithms for best arm identiﬁcation. In Proceedings of the 29th Conference On Learning Theory, 2016.

Proceedings of the Thirtieth International Joint Conference on Artiﬁcial Intelligence (IJCAI-21)