# optimal_bestarm_identification_in_linear_bandits__007bc31a.pdf

Optimal Best-arm Identiﬁcation in Linear Bandits

Yassir Jedra KTH Stockholm, Sweden jedra@kth.se

Alexandre Proutiere KTH Stockholm, Sweden alepro@kth.se

We study the problem of best-arm identiﬁcation with ﬁxed conﬁdence in stochastic linear bandits. The objective is to identify the best arm with a given level of certainty while minimizing the sampling budget. We devise a simple algorithm whose sampling complexity matches known instance-speciﬁc lower bounds, asymptotically almost surely and in expectation. The algorithm relies on an arm sampling rule that tracks an optimal proportion of arm draws, and that remarkably can be updated as rarely as we wish, without compromising its theoretical guarantees. Moreover, unlike existing best-arm identiﬁcation strategies, our algorithm uses a stopping rule that does not depend on the number of arms. Experimental results suggest that our algorithm signiﬁcantly outperforms existing algorithms. The paper further provides a ﬁrst analysis of the best-arm identiﬁcation problem in linear bandits with a continuous set of arms.

1 Introduction

The stochastic linear bandit [1, 2] is a sequential decision-making problem that generalizes the classical stochastic Multi-Armed Bandit (MAB) problem [3, 4] by assuming that the average reward is a linear function of the arm. Linear bandits have been extensively applied in online services such us online advertisement and recommendation systems [5, 6, 7], and constitute arguably the most relevant structured bandit model in practice. Most existing analyses of stochastic linear bandits concern regret minimization [2, 8, 9, 10, 11], i.e., the problem of devising an online algorithm maximizing the expected reward accumulated over a given time horizon. When the set of arms is ﬁnite, this problem is solved in the sense that we know an instance-speciﬁc regret lower bound, and a simple algorithm whose regret matches this fundamental limit [10, 11].

The best-arm identiﬁcation problem (also referred to as pure exploration problem) in linear bandits with ﬁnite set of arms has received less attention [12, 13, 14, 15, 16], and does not admit a fully satisfactory solution. In the pure exploration problem with ﬁxed conﬁdence, one has to design a δ-PAC algorithm (able to identify the best arm with probability at least 1 δ) using as few samples as possible. Such an algorithm consists of a sampling rule (an active policy to sequentially select arms), a stopping rule, and a decision rule that outputs the estimated best arm. The number of rounds before the algorithm stops is referred to as its sample complexity. An instance-speciﬁc informationtheoretical lower bound of the expected sample complexity has been derived in [17]. However, we are lacking simple and practical algorithms achieving this bound. Importantly, existing algorithms exhibit scalability issues as they always include subroutines that explicitly depend on the number of arms (refer to the related work for details). They may also be computationally involved.

In this paper, we present a new best-arm identiﬁcation algorithm for linear bandits with ﬁnite set of arms, whose sample complexity matches the information-theoretical lower bound. The algorithm follows the track-and-stop principle proposed in [18] for pure exploration in bandits without structure. Its sampling rule tracks the optimal proportion of arm draws, predicted by the sample complexity lower bound and estimated using the least-squares estimator of the system parameter. Remarkably,

34th Conference on Neural Information Processing Systems (Neur IPS 2020), Vancouver, Canada.

this tracking procedure can be made as lazy as we wish (the estimated optimal proportion of draws can be updated rarely not every round) without compromising the asymptotic optimality of the algorithm. The stopping rule of our algorithm is classically based on a generalized likelihood ratio test. However the exploration threshold deﬁning its stopping condition is novel, and critically, we manage to make it independent of the number of arms. Overall our algorithm is simple, scalable, and yet asymptotically optimal. In addition, its computational complexity can be tuned by changing the frequency at which the tracking rule is updated, without affecting its theoretical guarantees.

We also study the pure exploration problem in linear bandits with a continuous set of arms. We restrict our attention to the case where the set of arms consists of the (d 1)-dimensional unit sphere. We establish a sample complexity lower bound satisﬁed by any (ϵ, δ)-PAC algorithm (such algorithms identify an ϵ-optimal arm with probability at least 1 δ). This bound scales as d

ε log(1/δ). We ﬁnally propose an algorithm whose sample complexity matches the lower bound order-wise.

Related work. Best-arm identiﬁcation algorithms in linear bandits with a ﬁnite set of K arms have been proposed and analyzed in [12, 13, 14, 15, 16]. Soare et al. [12] leverage tools from G-optimal experimental design to devise the XY-adaptive algorithm returning the best arm and with sample complexity τ satisfying τ (M T µ log(K2/δ))(log log(K2/δ) + log(1/ 2 min)), w.p. 1 δ, where µ is the parameter deﬁning the reward function, min is the minimal gap between the best and a sub-optimal arm, T µ log(1/δ) is the information theoretical lower bound for the expected sample complexity of δ-PAC algorithms, and where M is an instance-dependent constant. XYadaptive runs in phases, and eliminates arms at the end of each phase. The use of phases requires rounding procedures, which come with d2 additional rounds in the sample complexity. The algorithm also requires to solve in each round an optimization problem similar to that leading to the sample complexity lower bound (see 3.1). Improved versions of XY-adaptive have been proposed in [15, 16]. ALBA [15] relies on a novel estimator for µ (removing the need of rounding procedures). RAGE [16] offers an improved sample complexity τ T µ log(1/ 2 min)(log(K2/δ) + d log(1/ 2 min)) (slightly simplifying the expression). The aforementioned algorithms are rather complicated, and explicitly use the number K of arms in some of their components: K is present in the arm elimination function in XY-adaptive, in the phase durations in [15, 16]. Importantly, their sample complexity does not match the information-theoretical lower bound when δ decreases. There is also no guarantees for their expected sample complexity.

[13] proposes an algorithm based on an explore-and-verify framework and with an asymptotically optimal sample complexity. The algorithm is not practical, but is the ﬁrst to demonstrate that the lower bound derived in [17] is achievable. In [14], the authors present Lin Gap E, an algorithm, as simple as ours. However, its sampling and stopping rules are both sub-optimal (e.g. the algorithm needs to sample all arms at least once), which in turn leads to weak performance guarantees with a sample complexity satisfying τ K log(1/δ).

Recently, Degenne et al. [19] proposed Lin Game and Lin Game-C, two track-and-stop algorithms that achieve the best possible sample complexity asymptotically. Their stopping rule is somewhat similar to ours, but has the disadvantage of requiring boundedness on the unknown parameter µ. Their sampling rule is different than ours and relies on a game theoretic approach. The latter consists in viewing the information theoretic constant T µ as the result of a zero-sum game between the agent and the environment. It is not clear how computationally efﬁcent this algorithm is as it requires solving a saddle-point problem at each round. Finally, we would like to mention that [19] was not available to us at the time of submission.

The algorithm we present is as simple as Lin Gap E, does not run in phases, does not explicitly use the number of arms in its sampling and stopping rules, and has an asymptotically optimal sample complexity, both almost surely and in expectation.

We are not aware of any work on best-arm identiﬁcation in linear bandits with a continuous set of arms. We provide here the ﬁrst results.

2 Model and Objective

We consider a bandit problem with a set A Rd of arms. In round t 1, if the decision maker selects arm a, she observes as a feedback a random reward rt = µ a + ηt. µ Rd is unknown, and (ηt)t 1 is a sequence of i.i.d. Gaussian random variables, ηt N(0, σ2). The objective is to

learn the arm a µ with the highest expected reward a µ = arg maxa A µ a. Throughout the paper, we assume that µ and A are such that the best arm a µ is unique. We also assume that the set of arms A spans Rd.

A best-arm identiﬁcation algorithm consists of a sampling rule, a stopping rule, and a decision rule. The sampling rule decides which arm at is selected in round t based on past observations: at is Ft 1-measurable, where Ft is the σ-algebra generated by (a1, r1, . . . , at, rt). The stopping rule decides when to stop sampling, and is deﬁned by τ, a stopping time w.r.t. the ﬁltration (Ft)t 1. The decision rule outputs a guess ˆaτ of the best arm based on observations collected up to round τ, i.e., ˆaτ is Fτ-measurable. The performance of an identiﬁcation algorithm is assessed through its probabilistic guarantees, and through its sample complexity τ. We consider different probabilistic guarantees, depending on whether the set of arms A is ﬁnite or continuous. Speciﬁcally: for ϵ, δ > 0,

Deﬁnition 1 (Finite set of arms A). An algorithm is δ-PAC if for all µ, Pµ[ˆaτ = a µ] δ and Pµ[τ < ] = 1.

Deﬁnition 2 (Continuous set of arms A). An algorithm is (ε, δ)-PAC if for all µ, Pµ[µ (a µ ˆaτ) > ε] δ and Pµ[τ < ] = 1.

When the set of arms A is ﬁnite (resp. continuous), the objective is to devise a δ-PAC (resp. (ε, δ)- PAC) algorithm with minimal expceted sample complexity Eµ[τ].

Notation. Let K = |A| when A is ﬁnite. Λ = {x [0, 1]A : P

a A xa = 1} denotes the simplex in dimension K. For a, b [0, 1], kl(a, b) is the KL divergence between two Bernoulli distributions of respective means a and b. For any w, w RA, we denote d (w, w ) = maxa A |wa w a|, and for any compact set C RA, d (w, C) = minw C d (w, w ). For w RA, supp(w) = {a A : wa = 0} denotes the support of w. Pµ (resp. Eµ) denotes the probability measure (resp. expectation) of observations generated under µ; in absence of ambiguity, we simply use P (resp. E). For two functions f and g, we write f g iff there exists a universal constant C such that for all x, f(x) Cg(x).

3 Finite set of arms

Consider a ﬁnite set A of K arms. We ﬁrst recall existing lower bounds on the expected complexity of δ-PAC algorithms, and then present our algorithm along with an analysis of its sample complexity.

3.1 Sample complexity lower bound

Soare [17] derived the following sample complexity lower bound, using the method developed by Garivier and Kaufmann [20] in the case of bandits without structure.

Theorem 1. The sample complexity of any δ-PAC algorithm satisﬁes: µ, Eµ[τ] σ2T µkl(δ, 1 δ), where

(T µ) 1 = sup w Λ min a A\a µ

(µ (a µ a))2

a A waaa 1 (a µ a) . (1)

In the above lower bound, w may be interpreted as the proportions of arm draws, also referred to as allocation. For a A, wa represents the fraction of rounds where arm a is selected. This interpretation stems from the proof of Theorem 1, where wa = Eµ[Na(τ)]/Eµ[τ] and Na(t) is the number of times a is selected up to and including round t (see [17]). The lower bound is obtained by taking the supremum over w, i.e., over the best possible allocation.

A different way to deﬁne (T µ) 1 is supw Λ ψ(µ, w) (a convex program) [17], where

ψ(µ, w) = min {λ: a =a µ,λ (a a)<0} 1 2(µ λ) X

The next lemmas, proved in Appendix B, conﬁrm that the two deﬁnitions of T µ are equivalent, and provide useful properties of the function ψ and of its maximizers.

Lemma 1. We have:

mina A\a µ µ,a µ a 2

2(a µ a) ( PK i=1 wiaia i ) 1(a µ a) if P a A waaa 0,

0 otherwise. (3)

In addition, ψ is continuous in both µ and w, and w 7 ψ(µ, w) attains its maximum in Λ at a point w µ such that P

a A(w µ)aaa is invertible.

Lemma 2. (Maximum theorem) Let µ Rd such that a µ is unique. Deﬁne ψ (µ) = maxw Λ ψ(µ, w) and C (µ) = arg maxw Λ ψ(µ, w). Then ψ is continuous at µ, and C (µ) is convex, compact and non-empty. Furthermore, we have1 for any open neighborhood V of C (µ), there exists an open neighborhood U of µ, such that for all µ U, we have C (µ ) V.

3.2 Least-squares estimator

Our algorithm and its analysis rely on the least-squares estimator of µ and on its performance. This estimator ˆµt based on the observations in the t ﬁrst rounds is: ˆµt = (Pt s=1 asa s ) (Pt s=1 asrs). The following result provides a sufﬁcient condition on the sampling rule for the convergence of ˆµt to µ. This condition depends on the asymptotic spectral properties of the covariates matrix Pt s=1 asa s . We also provide a concentration result for the least-squares estimator. Refer to Appendix C for the proofs of the following lemmas.

Lemma 3. Assume that the sampling rule satisﬁes lim inft λmin 1 tα Pt s=1 asa s > 0 almost surely (a.s.), for some α (0, 1). Then, limt ˆµt = µ a.s.. More precisely, for all β (0, α/2), ˆµt µ = o(t β) a.s..

Lemma 4. Let α > 0 and L = maxa A a . Assume that λmin(Pt s=1 asa s ) ctα a.s. for all t t0 for some t0 0 and some constant c > 0. Then

t t0 P ( ˆµt µ ε) (c 1/2L)dt

2 exp cε2tα

The least-squares estimator is used in our decision rule. After the algorithm stops in round τ, it returns the arm ˆaτ arg maxa A ˆµ τ a.

3.3 Sampling rule

To design an algorithm with minimal sample complexity, the sampling rule should match optimal proportions of arm draws, i.e., an allocation in the set C (µ). Since µ is unknown, our sampling rule will track, in round t, allocations in the plug-in estimate C (ˆµt). To successfully apply this certainty equivalence principle, we need to at least make sure that using our sampling rule, ˆµt converges to µ. Using Lemma 3, we can design a family of sampling rules with this guarantee: Lemma 5. (Forced exploration) Let A0 = {a0(1), . . . , a0(d)} A : λmin(P a A0 aa ) > 0. Let (bt)t 1 be an arbitrary sequence of arms. Furthermore, deﬁne for all t 1, f(t) = c A0

t where c A0 = 1

a A0 aa . Consider the sampling rule, deﬁned recursively as: i0 = 1, and for t 0, it+1 = (it mod d) + 1{λmin( Pt s=1 asa s )<f(t)} and

( a0(it) if λmin Pt s=1 asa s < f(t),

bt otherwise. (5)

Then for all t 5d

2, we have λmin Pt s=1 asa s f(t d 1).

A sampling rule of the family deﬁned in Lemma 5 is forced to explore an arm in A0 (in a round robin manner) if λmin(Pt s=1 asa s ) is too small. According to Lemma 3, this forced exploration is enough to ensure that ˆµt converges to µ a.s.. Next in the following tracking lemma, we show how to design the sequence (bt)t 1 so that the sampling rule gets close to a set C we wish to track.

1This statement is that of upper hemicontinuity of a correspondence.

Lemma 6. (Tracking a set C) Let (w(t))t 1 be a sequence taking values in Λ, such that there exists a compact, convex and non empty subset C in Λ, there exists ε > 0 and t0(ε) 1 such that t t0, d (w(t), C) ε. Deﬁne for all a A, Na(0) = 0. Consider a sampling rule deﬁned by (5) and

bt = arg min a supp(Pt s=1 w(s))

where Na(0) = 0 and for t 0, Na(t + 1) = Na(t) + 1{at=a}. Then there exists t1(ε) t0(ε) such that t t1(ε), d ((Na(t)/t)a A, C) (pt + d 1)ε where pt = |supp(Pt s=1 w(s))\A0| K d.

The lazy tracking rule. The design of our tracking rule is completed by choosing the sequence (w(t))t 1 in (6). The only requirement we actually impose on this sequence is the following condition: there exists a non-decreasing sequence (ℓ(t))t 1 of integers with ℓ(1) = 1, ℓ(t) t 1 for t > 1 and limt ℓ(t) = and such that

lim t min s ℓ(t) d (w(t), C (ˆµs)) = 0. a.s.. (7)

This condition is referred to as the lazy condition, since it is very easy to ensure in practice. For example, it holds for the following lazy tracking rule. Let T = {tn : n 1} be a deterministic increasing set of integers such that tn as n , we can deﬁne (w(t))t 1 such that it tracks C (ˆµt) only when t T . Speciﬁcally, if t T , w(t+1) C (ˆµt), and w(t+1) = w(t) otherwise. For this sequence, (7) holds with ℓ(t) = t 1. The lazy condition is sufﬁcient to guarantee the almost sure asymptotical optimality of the algorithm. To achieve optimality in terms of expected sample complexity, we will need a slightly stronger condition, also easily satisﬁed under some of the above lazy tracking rules, see details in 3.5.

The following proposition states that the lazy sampling rule is able to track the set C (µ). It follows from the fact that ˆµt converges to µ (thanks to Lemmas 3 and 5) and from combining the maximum theorem (Lemma 2) and Lemma 6. All proofs related to the sampling rule are presented in Appendix E. Proposition 1. Under any sampling rule (5)-(6) satisfying the lazy condition (7), the proportions of arm draws approach C (µ): limt d ((Na(t)/t)a A, C (µ)) = 0, a.s..

3.4 Stopping rule

We use the classical Chernoff s stopping time. Deﬁne the generalized log-likelihood ratio for all pair of arms a, b A, t 1, and ε 0 as

Za,b,ε(t) = log max{µ:µ (a b) ε} fµ(rt, at, . . . , r1, a1)

max{µ:µ (a b) ε} fµ(rt, at, . . . , r1, a1)

where fµ(rt, at, . . . , r1, a1) exp( 1

2 Pt s=1(rs µ as)2) under our Gaussian noise assumption. We may actually derive an explicit expression of Za,b,ε(t) (see Appendix D for a proof):

Lemma 7. Let t 1 and assume that Pt s=1 asa s 0. For all a, b A, we have:

Za,b,ε(t) = sgn(ˆµ t (a b) + ε) (ˆµ t (a b) + ε)2

2(a b) Pt s=1 asa s 1 (a b) .

Here we use Za,b(t) = Za,b,0(t) (Za,b,ε will be used in the case of continuous set of arms). Note that Za,b(t) 0 iff a arg maxa A ˆµ t a. Denoting Z(t) = maxa A minb A\a Za,b(t), the stopping rule is deﬁned as follows:

t N : Z(t) > β(δ, t) and

s=1 asa s c Id

where β(δ, t) is referred to as the exploration threshold and c is some positive constant (refer to Remark 1 for a convenient choice for c). The exploration threshold β(δ, t) should be chosen so that

the algorithm is δ-PAC. We also wish to design a threshold that does not depend in the number K of arms. These requirements leads to the exploration threshold deﬁned in the proposition below (its proof is presented in Appendix D and relies on a concentration result for self-normalized processes [9]). Proposition 2. Let u > 0, and deﬁne:

β(δ, t) = (1 + u)σ2 log

det (uc) 1 Pt s=1 asa s + Id 1

Under any sampling rule, and a stopping rule (8) with exploration rate (9), we have: P τ < , µ (a µ ˆaτ) > 0 δ.

The above proposition is valid for any sampling rule, but just ensures that if the algorithm stops, it does not make any mistake w.p. 1 δ. To get a δ-PAC algorithm, we need to specify the sampling rule. Remark 1. (Choosing c and u) A convenient choice for the constant c involved in (8) and (9) is c = maxa A a 2. With this choice, we have: det(c 1 Pt s=1 asa s + Id) (t + 1)d. The constant u should be chosen so that the threshold in (9) is lowered for instance one my choose u = 1. From these choices the threshold can be as simple as β(δ, t) = 2σ2 log(t d 2 /δ). In addition, if we use a sampling rule with forced exploration as in (5), then in view of Lemma 5, the second stopping condition Pt s=1 asa s c Id is satisﬁed as soon as t exceeds d + 1 + c2d λmin(P

3.5 Sample complexity analysis

Algorithm 1: Lazy Track-and-Stop (LT&S) Input: Arms A, conﬁdence level δ, set T of lazy updates Initialization: t = 0, i = 0, A0 = 0, Z(0) = 0, N(0) = (Na(0))a A = 0; while (λmin(At) < c) or (Z(t) < β(δ, t)) do

if λmin(At) < f(t) then

a a0(i + 1), i (i + 1 mod d) else

a arg minb supp(Pt s=1 w(s)) Nb(t) Pt s=1 wb(s) ,

end t t + 1, sample arm a and update N(t), ˆµt, Z(t), At At 1 + aa , w(t) w(t 1) if t T then

w(t) = arg maxw Λ ψ(ˆµt, w) end end return ˆaτ = arg maxa A ˆµ τ a

In this section, we establish that combining a sampling rule (5)-(6) satisfying the lazy condition (7) and the stopping rule (8)-(9), we obtain an asymptotically optimal algorithm. Refer to Appendix F for proofs. An example of such algorithm is the Lazy Track-and-Stop (LT&S) algorithm, whose pseudo-code is presented in Algorithm 1. LT&S just updates the tracking rule in rounds in a set T . Theorem 2. (Almost sure sample complexity upper bound) An algorithm deﬁned by (5)-(6)-(8)-(9) with a lazy sampling rule (satisfying (7)) is δ-PAC. Its sample complexity veriﬁes:

P(lim sup δ 0

δ ) σ2T µ) = 1.

To obtain an algorithm with optimal expected sample complexity, we need to consider lazy tracking rules that satisfy the following condition: there exist α > 0 and a non-decreasing sequence (ℓ(t))t 1 of integers with ℓ(1) = 1, ℓ(t) t and lim inft ℓ(t)/tγ > 0 for some γ > 0 and such that

ε > 0, h(ε) : t 1, P min s ℓ(t) d (w(t), C (ˆµs)) > ε h(ε)

t2+α . (10)

The condition (10) is again easy to ensure. Assume that we update w(t) only if t T = {tn : n 1}, where tn is increasing sequence of integers such that tn as n . Then (10) holds for the sequence (ℓ(t))t 1 such that ℓ(ti+1) = ti for all i, provided that lim infn tn+1/tγ n > 0 for some γ > 0. Examples include: (i) periodic updates of w(t): T = {1 + k P, k N} and ℓ(t) = max{1, t P}; (ii) exponential updates T = {2k, k N} and ℓ(t) = max{1, t/2 }. The condition (10) may seem too loose, but we have to keep in mind that in practice, the performance of the algorithm will depend on the update frequency of w(t). However for asymptotic optimality, (10) is enough (the key point is to have some concentration of ˆµt around µ, which is guaranteed via the forced exploration part of the sampling rule).

Theorem 3. (Expected sample complexity upper bound) An algorithm deﬁned by (5)-(6)-(8)-(9) with a sampling rule satisfying (7) and (10) is δ-PAC. Its sample complexity veriﬁes:

lim sup δ 0

E [τ] log 1

4 Continuous set of arms

We now investigate the case where A = Sd 1 is the (d 1)-dimensional unit sphere. Without loss of generality, we restrict our attention to problems where µ M(ε0) = {η : η a η > ε0} for some ε0 > 0. The results of this section are proved in Appendix G.

4.1 Sample complexity lower bound

Theorem 4. Let ε (0, ε0/5), and δ (0, 1). The sample complexity of any (δ, ε)-PAC algorithm satisﬁes: for all µ M(ε0), Eµ[τ] σ2(d 1)

20 µ ε kl(δ, 1 δ).

The above theorem is obtained by ﬁrst applying the classical change-of-measure argument (see e.g. Lemma 19 [20]). Such an argument implies that under any (ε, δ)-PAC algorithm, for all confusing λ such that {a Sd 1 : µ (a µ a) ε} and {a Sd 1 : λ (a λ a) ε} are disjoint,

(µ λ) 2kl(δ, 1 δ).

We then study the solution of the following max-min problem: max(at)t 1 minλ Bε(µ)(µ λ) E Pτ s=1 asa s (µ λ), where Bε(µ) denotes the set of confusing parameters. The continuous action space makes this analysis challenging. We show that the value of the max-min problem is smaller than Eµ[τ] 10 µ ε

σ2(d 1), which leads to the claimed lower bound.

4.2 Algorithm

We present a simple algorithm whose sample complexity approach our lower bound. We describe its three components below. The decision rule is the same as before, based on the least-squares estimator of µ: ˆat arg maxa A ˆµ t a.

Sampling rule. Let U = {u1, u2, . . . , ud} be subset of Sd 1, that forms an orthonormal basis of Rd. The sampling rule just consists in selecting an arms from U in a round robin manner: for all t 1, at = u(t mod d).

Stopping rule. As for the case of ﬁnite set of arms, the stopping rule relies on a generalized loglikelihood ratio test. Deﬁne Z(t) = inf{b A:|ˆµ t (ˆat b)| εt} Zˆat,b,εt(t), where an expression of Zˆat,b,εt(t) is given in Lemma 7. We consider the following stopping time:

t N : Z(t) β(δ, t) and λmin

max c, ρ(δ, t)

Hence compared to the case of ﬁnite set of arms, we add a stopping condition deﬁned by the threshold ρ(δ, t) and related to the spectral properties of the covariates matrix.

Proposition 3. Let (δt)t 1, (εt)t 1 be two sequences with values in (0, 1) and (0, ε), respectively,

and such that P t=1 δt < δ, and limt εt = ε. Let ζt = log(2 det c 1 Pt s=1 asa s + Id 1

log(δt), and deﬁne:

β(δ, t) = 2σ2ζt and ρ(δ, t) = 4σ2ε2 tζt (ε εt)2 (12)

Then under the stopping rule (11)-(12), we have: Pµ τ < , µ (a µ ˆaτ) > ε δ.

4.3 Sample complexity analysis

Under speciﬁc choices for the sequence (εt)t 1, we can analyze the sample complexity of our algorithm, and show its optimality order-wise.

Theorem 5. Choose in the stopping rule εt = ε 1 + ε(4σ2 log( 4

d )) 1/2 1 (observe that εt < ε and limt εt = ε). Then under the aforementioned sampling rule, and the stopping rule (11)-(12), we have: P lim supδ 0 τ log(1/δ) σ2d µ ε = 1 and lim supδ 0 E[τ] log(1/δ) σ2d µ ε.

5 Experiments

We present here a few experimental results comparing the performance of our algorithm to that of RAGE, the state-of-the-art algorithm [16], in the case of ﬁnite set of arms. We compare LT&S and RAGE only because they outperform other existing algorithms. Further experimental results can be found in Appendix A.

Experimental set-up. We use the following toy experiment which corresponds to the many arms example in [16]. d = 2 and A = {(1, 0), ej3π/4, ej(π/4+φi), i [n 2]} C where (φi) are i.i.d. N(0, 0.09). µ = (1, 0). Experiments are made with the risk δ = 0.05.

Implementation of LT&S. To update the allocation w(t), we use Frank-Wolfe algorithm [21] (without any rounding procedure). Our implementation of Frank-Wolfe is similar to that of Fiez et al. [16] (refer to Appendix A.3. for further comments). At each update, the previous allocation is fed as an initial value for the new optimization problem. We implement the exponential lazy update scheme T = {2k, k N}. The parameters of our stopping rule are c = c A0

d (so that after d steps the second condition of the stopping rule is satisﬁed) and u = 1; we use the threshold β(6δ/π2t2, t). The initial exploration matrix A0 is chosen at random. We implemented two versions of LT&S. The ﬁrst one does not track the average but only the current allocation w(t): a arg minb supp(w(t))(Nb(t) twb(t)). The second version tracks the average allocations as described in Algorithm 1.

We further compare our results to that of the Oracle algorithm proposed by [12]. The algorithm samples from a true optimal allocation w C (µ), and applies a stopping rule that depends on K.

Results. From the table below, LT&S outperforms RAGE most of the times, and the performance improvement gets higher when the number of arms grows. LT&S without averaging shows better performance, than with averaging. In Appendix A, we present results for another version of LT&S, with even better performance.

Algorithm LT&S (No averaging) LT&S RAGE Oracle Sample Complexity Sample Complexity Sample Complexity Sample Complexity

Number of arms Mean (Std) Mean (Std) Mean (Std) Mean (Std)

(K = 1000) 1206.55 (42.2) 1409 (57) 1148.45 (49.82) 476.45 (40.74) (K = 2500) 1253.60 (47.70) 1404 (57) 1440.75 (149.24) 492.15 (43.88) (K = 5000) 1247.05 (81.07) 1401 (86) 1540.3 (158.90) 515.60 (47.64) (K = 7500) 1296.55 (76.78) 1434 (78) 1598.0 (164.60) 547.65 (45.77)

Table 1: Results for the many arms experiment [16]

6 Conclusion

In this paper, we present LT&S, an algorithm to solve the best-arm identiﬁcation problem in stochastic linear bandits. The sampling rule of the algorithm just tracks the optimal allocation predicted by the sample complexity lower bound. Its stopping rule is deﬁned through generalized log-likelihood ratio and an exploration threshold that does not depend on the number of arms, but on the ambient dimension only. LT&S is asymptotically optimal: we have guarantees on its sample complexity, both almost surely and in expectation. The ﬁrst experimental results are very promising, as LT&S seems to exhibit a much better sample complexity than existing algorithms. We also provide the ﬁrst results on the pure exploration problem in the linear bandits with a continuous set of arms.

The analysis presented in this paper suggests several extensions. We can easily generalize the results to non-Gaussian reward distributions (e.g. bounded, from a one-parameter exponential family). It would be interesting to extend our results in the continuous setting to generic convex sets of arms (we believe that the instance-speciﬁc sample complexity lower bound would just depend on the local smoothness of the set of arms around the best arm). A more challenging but exciting question is to derive tight non-asymptotic sample complexity upper bound for LT&S, so as to characterize the trade-off between the laziness of the algorithm and its sample complexity.

Broader impact

This work is mostly theoretical. Our results may provide guidelines and insights towards an improved design of algorithms for linear bandits. Linear bandit algorithms are versatile, and in particular used in clinical trials and recommendation systems. Hence our results can beneﬁt users and developers of such systems. In clinical trials, minimizing the sample complexity is crucial, and the use of our algorithm there can be really beneﬁcial. Our algorithm has the additional advantage of being computationally efﬁcient. Overall, we do not foresee any direct negative impact of our work. However, it worth noting that there is some concern about the potential use of recommender systems for opinion inﬂuence.

Acknowledgement

This research was supported by the Wallenberg AI, Autonomous Systems and Software Program (WASP) funded by the Knut and Alice Wallenberg Foundation.

[1] Peter Auer. Using conﬁdence bounds for exploitation-exploration trade-offs. J. Mach. Learn. Res., 3(null):397?422, March 2003.

[2] Varsha Dani, Thomas Hayes, and Sham Kakade. Stochastic linear optimization under bandit feedback. In COLT, 2008.

[3] Herbert Robbins. Some aspects of the sequential design of experiments. Bull. Amer. Math. Soc., 58(5):527 535, 09 1952.

[4] Tze Leung Lai and Herbert Robbins. Asymptotically efﬁcient adaptive allocation rules. Advances in Applied Mathematics, 6(1):4 22, 1985.

[5] Lihong Li, Wei Chu, John Langford, and Robert E. Schapire. A contextual-bandit approach to personalized news article recommendation. In Proceedings of the 19th International Conference on World Wide Web, WWW, pages 661 670, New York, NY, USA, 2010.

[6] Wei Chu, Lihong Li, Lev Reyzin, and Robert Schapire. Contextual bandits with linear payoff functions. In Geoffrey Gordon, David Dunson, and Miroslav Dudúk, editors, Proceedings of the Fourteenth International Conference on Artiﬁcial Intelligence and Statistics, volume 15 of Proceedings of Machine Learning Research, pages 208 214, Fort Lauderdale, FL, USA, 11 13 Apr 2011. PMLR.

[7] Shuai Li, Alexandros Karatzoglou, and Claudio Gentile. Collaborative ﬁltering bandits. In Proceedings of the 39th International ACM SIGIR conference on Research and Development in Information Retrieval, pages 539 548, 2016.

[8] Paat Rusmevichientong and John Tsitsiklis. Linearly parameterized bandits. Math. Oper. Res., 35(2), 2010.

[9] Yasin Abbasi-Yadkori, Dávid Pál, and Csaba Szepesvári. Improved algorithms for linear stochastic bandits. In Advances in Neural Information Processing Systems, pages 2312 2320, 2011.

[10] Tor Lattimore and Csaba Szepesvari. The end of optimism? an asymptotic analysis of ﬁnitearmed linear bandits. AISTATS, 2016.

[11] Richard Combes, Stefan Magureanu, and Alexandre Proutiere. Minimal exploration in structured stochastic bandits. In I. Guyon, U. V. Luxburg, S. Bengio, H. Wallach, R. Fergus, S. Vishwanathan, and R. Garnett, editors, Advances in Neural Information Processing Systems 30, pages 1763 1771. Curran Associates, Inc., 2017.

[12] Marta Soare, Alessandro Lazaric, and Rémi Munos. Best-arm identiﬁcation in linear bandits. In Advances in Neural Information Processing Systems, pages 828 836, 2014.

[13] Zohar S Karnin. Veriﬁcation based solution for structured mab problems. In Advances in Neural Information Processing Systems, pages 145 153, 2016.

[14] Liyuan Xu, Junya Honda, and Masashi Sugiyama. A fully adaptive algorithm for pure exploration in linear bandits. In Amos Storkey and Fernando Perez-Cruz, editors, Proceedings of the Twenty-First International Conference on Artiﬁcial Intelligence and Statistics, volume 84 of Proceedings of Machine Learning Research, pages 843 851, Playa Blanca, Lanzarote, Canary Islands, 09 11 Apr 2018. PMLR.

[15] Chao Tao, Saúl Blanco, and Yuan Zhou. Best arm identiﬁcation in linear bandits with linear dimension dependency. In Jennifer Dy and Andreas Krause, editors, Proceedings of the 35th International Conference on Machine Learning, volume 80 of Proceedings of Machine Learning Research, pages 4877 4886, Stockholmsmässan, Stockholm Sweden, 10 15 Jul 2018. PMLR.

[16] Tanner Fiez, Lalit Jain, Kevin G Jamieson, and Lillian Ratliff. Sequential experimental design for transductive linear bandits. In Advances in Neural Information Processing Systems, pages 10666 10676, 2019.

[17] Marta Soare. Sequential Resource Allocation in Linear Stochastic Bandits . Theses, Université Lille 1 - Sciences et Technologies, December 2015.

[18] Aurélien Garivier and Emilie Kaufmann. Optimal best arm identiﬁcation with ﬁxed conﬁdence. In Conference on Learning Theory, pages 998 1027, 2016.

[19] Rémy Degenne, Pierre Ménard, Xuedong Shang, and Michal Valko. Gamiﬁcation of pure exploration for linear bandits. International Conference on Machine Learning, 2020.

[20] Emilie Kaufmann, Olivier Cappé, and Aurélien Garivier. On the complexity of best-arm identiﬁcation in multi-armed bandit models. The Journal of Machine Learning Research, 17(1):1 42, 2016.

[21] Martin Jaggi. Revisiting Frank-Wolfe: Projection-free sparse convex optimization. volume 28 of Proceedings of Machine Learning Research, pages 427 435, Atlanta, Georgia, USA, 17 19 Jun 2013. PMLR.

[22] Rangarajan K Sundaram et al. A ﬁrst course in optimization theory. Cambridge university press, 1996.

[23] Victor H Peña, Tze Leung Lai, and Qi-Man Shao. Self-normalized processes: Limit theory and Statistical Applications. Springer Science & Business Media, 2008.

[24] Tuhin Sarkar and Alexander Rakhlin. Near optimal ﬁnite time identiﬁcation of arbitrary linear dynamical systems. ar Xiv preprint ar Xiv:1812.01251, 2018.

[25] Pierpaolo Natalini and Biagio Palumbo. Inequalities for the incomplete gamma function. Math. Inequal. Appl, 3(1):69 77, 2000.

[26] Jonathan M Borwein, O-Yeat Chan, et al. Uniform bounds for the complementary incomplete gamma function. Mathematical Inequalities and Applications, 12:115 121, 2009.