# optimal_bestarm_identification_in_linear_bandits__007bc31a.pdf Optimal Best-arm Identification in Linear Bandits Yassir Jedra KTH Stockholm, Sweden jedra@kth.se Alexandre Proutiere KTH Stockholm, Sweden alepro@kth.se We study the problem of best-arm identification with fixed confidence in stochastic linear bandits. The objective is to identify the best arm with a given level of certainty while minimizing the sampling budget. We devise a simple algorithm whose sampling complexity matches known instance-specific lower bounds, asymptotically almost surely and in expectation. The algorithm relies on an arm sampling rule that tracks an optimal proportion of arm draws, and that remarkably can be updated as rarely as we wish, without compromising its theoretical guarantees. Moreover, unlike existing best-arm identification strategies, our algorithm uses a stopping rule that does not depend on the number of arms. Experimental results suggest that our algorithm significantly outperforms existing algorithms. The paper further provides a first analysis of the best-arm identification problem in linear bandits with a continuous set of arms. 1 Introduction The stochastic linear bandit [1, 2] is a sequential decision-making problem that generalizes the classical stochastic Multi-Armed Bandit (MAB) problem [3, 4] by assuming that the average reward is a linear function of the arm. Linear bandits have been extensively applied in online services such us online advertisement and recommendation systems [5, 6, 7], and constitute arguably the most relevant structured bandit model in practice. Most existing analyses of stochastic linear bandits concern regret minimization [2, 8, 9, 10, 11], i.e., the problem of devising an online algorithm maximizing the expected reward accumulated over a given time horizon. When the set of arms is finite, this problem is solved in the sense that we know an instance-specific regret lower bound, and a simple algorithm whose regret matches this fundamental limit [10, 11]. The best-arm identification problem (also referred to as pure exploration problem) in linear bandits with finite set of arms has received less attention [12, 13, 14, 15, 16], and does not admit a fully satisfactory solution. In the pure exploration problem with fixed confidence, one has to design a δ-PAC algorithm (able to identify the best arm with probability at least 1 δ) using as few samples as possible. Such an algorithm consists of a sampling rule (an active policy to sequentially select arms), a stopping rule, and a decision rule that outputs the estimated best arm. The number of rounds before the algorithm stops is referred to as its sample complexity. An instance-specific informationtheoretical lower bound of the expected sample complexity has been derived in [17]. However, we are lacking simple and practical algorithms achieving this bound. Importantly, existing algorithms exhibit scalability issues as they always include subroutines that explicitly depend on the number of arms (refer to the related work for details). They may also be computationally involved. In this paper, we present a new best-arm identification algorithm for linear bandits with finite set of arms, whose sample complexity matches the information-theoretical lower bound. The algorithm follows the track-and-stop principle proposed in [18] for pure exploration in bandits without structure. Its sampling rule tracks the optimal proportion of arm draws, predicted by the sample complexity lower bound and estimated using the least-squares estimator of the system parameter. Remarkably, 34th Conference on Neural Information Processing Systems (Neur IPS 2020), Vancouver, Canada. this tracking procedure can be made as lazy as we wish (the estimated optimal proportion of draws can be updated rarely not every round) without compromising the asymptotic optimality of the algorithm. The stopping rule of our algorithm is classically based on a generalized likelihood ratio test. However the exploration threshold defining its stopping condition is novel, and critically, we manage to make it independent of the number of arms. Overall our algorithm is simple, scalable, and yet asymptotically optimal. In addition, its computational complexity can be tuned by changing the frequency at which the tracking rule is updated, without affecting its theoretical guarantees. We also study the pure exploration problem in linear bandits with a continuous set of arms. We restrict our attention to the case where the set of arms consists of the (d 1)-dimensional unit sphere. We establish a sample complexity lower bound satisfied by any (ϵ, δ)-PAC algorithm (such algorithms identify an ϵ-optimal arm with probability at least 1 δ). This bound scales as d ε log(1/δ). We finally propose an algorithm whose sample complexity matches the lower bound order-wise. Related work. Best-arm identification algorithms in linear bandits with a finite set of K arms have been proposed and analyzed in [12, 13, 14, 15, 16]. Soare et al. [12] leverage tools from G-optimal experimental design to devise the XY-adaptive algorithm returning the best arm and with sample complexity τ satisfying τ (M T µ log(K2/δ))(log log(K2/δ) + log(1/ 2 min)), w.p. 1 δ, where µ is the parameter defining the reward function, min is the minimal gap between the best and a sub-optimal arm, T µ log(1/δ) is the information theoretical lower bound for the expected sample complexity of δ-PAC algorithms, and where M is an instance-dependent constant. XYadaptive runs in phases, and eliminates arms at the end of each phase. The use of phases requires rounding procedures, which come with d2 additional rounds in the sample complexity. The algorithm also requires to solve in each round an optimization problem similar to that leading to the sample complexity lower bound (see 3.1). Improved versions of XY-adaptive have been proposed in [15, 16]. ALBA [15] relies on a novel estimator for µ (removing the need of rounding procedures). RAGE [16] offers an improved sample complexity τ T µ log(1/ 2 min)(log(K2/δ) + d log(1/ 2 min)) (slightly simplifying the expression). The aforementioned algorithms are rather complicated, and explicitly use the number K of arms in some of their components: K is present in the arm elimination function in XY-adaptive, in the phase durations in [15, 16]. Importantly, their sample complexity does not match the information-theoretical lower bound when δ decreases. There is also no guarantees for their expected sample complexity. [13] proposes an algorithm based on an explore-and-verify framework and with an asymptotically optimal sample complexity. The algorithm is not practical, but is the first to demonstrate that the lower bound derived in [17] is achievable. In [14], the authors present Lin Gap E, an algorithm, as simple as ours. However, its sampling and stopping rules are both sub-optimal (e.g. the algorithm needs to sample all arms at least once), which in turn leads to weak performance guarantees with a sample complexity satisfying τ K log(1/δ). Recently, Degenne et al. [19] proposed Lin Game and Lin Game-C, two track-and-stop algorithms that achieve the best possible sample complexity asymptotically. Their stopping rule is somewhat similar to ours, but has the disadvantage of requiring boundedness on the unknown parameter µ. Their sampling rule is different than ours and relies on a game theoretic approach. The latter consists in viewing the information theoretic constant T µ as the result of a zero-sum game between the agent and the environment. It is not clear how computationally efficent this algorithm is as it requires solving a saddle-point problem at each round. Finally, we would like to mention that [19] was not available to us at the time of submission. The algorithm we present is as simple as Lin Gap E, does not run in phases, does not explicitly use the number of arms in its sampling and stopping rules, and has an asymptotically optimal sample complexity, both almost surely and in expectation. We are not aware of any work on best-arm identification in linear bandits with a continuous set of arms. We provide here the first results. 2 Model and Objective We consider a bandit problem with a set A Rd of arms. In round t 1, if the decision maker selects arm a, she observes as a feedback a random reward rt = µ a + ηt. µ Rd is unknown, and (ηt)t 1 is a sequence of i.i.d. Gaussian random variables, ηt N(0, σ2). The objective is to learn the arm a µ with the highest expected reward a µ = arg maxa A µ a. Throughout the paper, we assume that µ and A are such that the best arm a µ is unique. We also assume that the set of arms A spans Rd. A best-arm identification algorithm consists of a sampling rule, a stopping rule, and a decision rule. The sampling rule decides which arm at is selected in round t based on past observations: at is Ft 1-measurable, where Ft is the σ-algebra generated by (a1, r1, . . . , at, rt). The stopping rule decides when to stop sampling, and is defined by τ, a stopping time w.r.t. the filtration (Ft)t 1. The decision rule outputs a guess ˆaτ of the best arm based on observations collected up to round τ, i.e., ˆaτ is Fτ-measurable. The performance of an identification algorithm is assessed through its probabilistic guarantees, and through its sample complexity τ. We consider different probabilistic guarantees, depending on whether the set of arms A is finite or continuous. Specifically: for ϵ, δ > 0, Definition 1 (Finite set of arms A). An algorithm is δ-PAC if for all µ, Pµ[ˆaτ = a µ] δ and Pµ[τ < ] = 1. Definition 2 (Continuous set of arms A). An algorithm is (ε, δ)-PAC if for all µ, Pµ[µ (a µ ˆaτ) > ε] δ and Pµ[τ < ] = 1. When the set of arms A is finite (resp. continuous), the objective is to devise a δ-PAC (resp. (ε, δ)- PAC) algorithm with minimal expceted sample complexity Eµ[τ]. Notation. Let K = |A| when A is finite. Λ = {x [0, 1]A : P a A xa = 1} denotes the simplex in dimension K. For a, b [0, 1], kl(a, b) is the KL divergence between two Bernoulli distributions of respective means a and b. For any w, w RA, we denote d (w, w ) = maxa A |wa w a|, and for any compact set C RA, d (w, C) = minw C d (w, w ). For w RA, supp(w) = {a A : wa = 0} denotes the support of w. Pµ (resp. Eµ) denotes the probability measure (resp. expectation) of observations generated under µ; in absence of ambiguity, we simply use P (resp. E). For two functions f and g, we write f g iff there exists a universal constant C such that for all x, f(x) Cg(x). 3 Finite set of arms Consider a finite set A of K arms. We first recall existing lower bounds on the expected complexity of δ-PAC algorithms, and then present our algorithm along with an analysis of its sample complexity. 3.1 Sample complexity lower bound Soare [17] derived the following sample complexity lower bound, using the method developed by Garivier and Kaufmann [20] in the case of bandits without structure. Theorem 1. The sample complexity of any δ-PAC algorithm satisfies: µ, Eµ[τ] σ2T µkl(δ, 1 δ), where (T µ) 1 = sup w Λ min a A\a µ (µ (a µ a))2 a A waaa 1 (a µ a) . (1) In the above lower bound, w may be interpreted as the proportions of arm draws, also referred to as allocation. For a A, wa represents the fraction of rounds where arm a is selected. This interpretation stems from the proof of Theorem 1, where wa = Eµ[Na(τ)]/Eµ[τ] and Na(t) is the number of times a is selected up to and including round t (see [17]). The lower bound is obtained by taking the supremum over w, i.e., over the best possible allocation. A different way to define (T µ) 1 is supw Λ ψ(µ, w) (a convex program) [17], where ψ(µ, w) = min {λ: a =a µ,λ (a a)<0} 1 2(µ λ) X The next lemmas, proved in Appendix B, confirm that the two definitions of T µ are equivalent, and provide useful properties of the function ψ and of its maximizers. Lemma 1. We have: mina A\a µ µ,a µ a 2 2(a µ a) ( PK i=1 wiaia i ) 1(a µ a) if P a A waaa 0, 0 otherwise. (3) In addition, ψ is continuous in both µ and w, and w 7 ψ(µ, w) attains its maximum in Λ at a point w µ such that P a A(w µ)aaa is invertible. Lemma 2. (Maximum theorem) Let µ Rd such that a µ is unique. Define ψ (µ) = maxw Λ ψ(µ, w) and C (µ) = arg maxw Λ ψ(µ, w). Then ψ is continuous at µ, and C (µ) is convex, compact and non-empty. Furthermore, we have1 for any open neighborhood V of C (µ), there exists an open neighborhood U of µ, such that for all µ U, we have C (µ ) V. 3.2 Least-squares estimator Our algorithm and its analysis rely on the least-squares estimator of µ and on its performance. This estimator ˆµt based on the observations in the t first rounds is: ˆµt = (Pt s=1 asa s ) (Pt s=1 asrs). The following result provides a sufficient condition on the sampling rule for the convergence of ˆµt to µ. This condition depends on the asymptotic spectral properties of the covariates matrix Pt s=1 asa s . We also provide a concentration result for the least-squares estimator. Refer to Appendix C for the proofs of the following lemmas. Lemma 3. Assume that the sampling rule satisfies lim inft λmin 1 tα Pt s=1 asa s > 0 almost surely (a.s.), for some α (0, 1). Then, limt ˆµt = µ a.s.. More precisely, for all β (0, α/2), ˆµt µ = o(t β) a.s.. Lemma 4. Let α > 0 and L = maxa A a . Assume that λmin(Pt s=1 asa s ) ctα a.s. for all t t0 for some t0 0 and some constant c > 0. Then t t0 P ( ˆµt µ ε) (c 1/2L)dt 2 exp cε2tα The least-squares estimator is used in our decision rule. After the algorithm stops in round τ, it returns the arm ˆaτ arg maxa A ˆµ τ a. 3.3 Sampling rule To design an algorithm with minimal sample complexity, the sampling rule should match optimal proportions of arm draws, i.e., an allocation in the set C (µ). Since µ is unknown, our sampling rule will track, in round t, allocations in the plug-in estimate C (ˆµt). To successfully apply this certainty equivalence principle, we need to at least make sure that using our sampling rule, ˆµt converges to µ. Using Lemma 3, we can design a family of sampling rules with this guarantee: Lemma 5. (Forced exploration) Let A0 = {a0(1), . . . , a0(d)} A : λmin(P a A0 aa ) > 0. Let (bt)t 1 be an arbitrary sequence of arms. Furthermore, define for all t 1, f(t) = c A0 t where c A0 = 1 a A0 aa . Consider the sampling rule, defined recursively as: i0 = 1, and for t 0, it+1 = (it mod d) + 1{λmin( Pt s=1 asa s ) 0 and t0(ε) 1 such that t t0, d (w(t), C) ε. Define for all a A, Na(0) = 0. Consider a sampling rule defined by (5) and bt = arg min a supp(Pt s=1 w(s)) where Na(0) = 0 and for t 0, Na(t + 1) = Na(t) + 1{at=a}. Then there exists t1(ε) t0(ε) such that t t1(ε), d ((Na(t)/t)a A, C) (pt + d 1)ε where pt = |supp(Pt s=1 w(s))\A0| K d. The lazy tracking rule. The design of our tracking rule is completed by choosing the sequence (w(t))t 1 in (6). The only requirement we actually impose on this sequence is the following condition: there exists a non-decreasing sequence (ℓ(t))t 1 of integers with ℓ(1) = 1, ℓ(t) t 1 for t > 1 and limt ℓ(t) = and such that lim t min s ℓ(t) d (w(t), C (ˆµs)) = 0. a.s.. (7) This condition is referred to as the lazy condition, since it is very easy to ensure in practice. For example, it holds for the following lazy tracking rule. Let T = {tn : n 1} be a deterministic increasing set of integers such that tn as n , we can define (w(t))t 1 such that it tracks C (ˆµt) only when t T . Specifically, if t T , w(t+1) C (ˆµt), and w(t+1) = w(t) otherwise. For this sequence, (7) holds with ℓ(t) = t 1. The lazy condition is sufficient to guarantee the almost sure asymptotical optimality of the algorithm. To achieve optimality in terms of expected sample complexity, we will need a slightly stronger condition, also easily satisfied under some of the above lazy tracking rules, see details in 3.5. The following proposition states that the lazy sampling rule is able to track the set C (µ). It follows from the fact that ˆµt converges to µ (thanks to Lemmas 3 and 5) and from combining the maximum theorem (Lemma 2) and Lemma 6. All proofs related to the sampling rule are presented in Appendix E. Proposition 1. Under any sampling rule (5)-(6) satisfying the lazy condition (7), the proportions of arm draws approach C (µ): limt d ((Na(t)/t)a A, C (µ)) = 0, a.s.. 3.4 Stopping rule We use the classical Chernoff s stopping time. Define the generalized log-likelihood ratio for all pair of arms a, b A, t 1, and ε 0 as Za,b,ε(t) = log max{µ:µ (a b) ε} fµ(rt, at, . . . , r1, a1) max{µ:µ (a b) ε} fµ(rt, at, . . . , r1, a1) where fµ(rt, at, . . . , r1, a1) exp( 1 2 Pt s=1(rs µ as)2) under our Gaussian noise assumption. We may actually derive an explicit expression of Za,b,ε(t) (see Appendix D for a proof): Lemma 7. Let t 1 and assume that Pt s=1 asa s 0. For all a, b A, we have: Za,b,ε(t) = sgn(ˆµ t (a b) + ε) (ˆµ t (a b) + ε)2 2(a b) Pt s=1 asa s 1 (a b) . Here we use Za,b(t) = Za,b,0(t) (Za,b,ε will be used in the case of continuous set of arms). Note that Za,b(t) 0 iff a arg maxa A ˆµ t a. Denoting Z(t) = maxa A minb A\a Za,b(t), the stopping rule is defined as follows: t N : Z(t) > β(δ, t) and s=1 asa s c Id where β(δ, t) is referred to as the exploration threshold and c is some positive constant (refer to Remark 1 for a convenient choice for c). The exploration threshold β(δ, t) should be chosen so that the algorithm is δ-PAC. We also wish to design a threshold that does not depend in the number K of arms. These requirements leads to the exploration threshold defined in the proposition below (its proof is presented in Appendix D and relies on a concentration result for self-normalized processes [9]). Proposition 2. Let u > 0, and define: β(δ, t) = (1 + u)σ2 log det (uc) 1 Pt s=1 asa s + Id 1 Under any sampling rule, and a stopping rule (8) with exploration rate (9), we have: P τ < , µ (a µ ˆaτ) > 0 δ. The above proposition is valid for any sampling rule, but just ensures that if the algorithm stops, it does not make any mistake w.p. 1 δ. To get a δ-PAC algorithm, we need to specify the sampling rule. Remark 1. (Choosing c and u) A convenient choice for the constant c involved in (8) and (9) is c = maxa A a 2. With this choice, we have: det(c 1 Pt s=1 asa s + Id) (t + 1)d. The constant u should be chosen so that the threshold in (9) is lowered for instance one my choose u = 1. From these choices the threshold can be as simple as β(δ, t) = 2σ2 log(t d 2 /δ). In addition, if we use a sampling rule with forced exploration as in (5), then in view of Lemma 5, the second stopping condition Pt s=1 asa s c Id is satisfied as soon as t exceeds d + 1 + c2d λmin(P 3.5 Sample complexity analysis Algorithm 1: Lazy Track-and-Stop (LT&S) Input: Arms A, confidence level δ, set T of lazy updates Initialization: t = 0, i = 0, A0 = 0, Z(0) = 0, N(0) = (Na(0))a A = 0; while (λmin(At) < c) or (Z(t) < β(δ, t)) do if λmin(At) < f(t) then a a0(i + 1), i (i + 1 mod d) else a arg minb supp(Pt s=1 w(s)) Nb(t) Pt s=1 wb(s) , end t t + 1, sample arm a and update N(t), ˆµt, Z(t), At At 1 + aa , w(t) w(t 1) if t T then w(t) = arg maxw Λ ψ(ˆµt, w) end end return ˆaτ = arg maxa A ˆµ τ a In this section, we establish that combining a sampling rule (5)-(6) satisfying the lazy condition (7) and the stopping rule (8)-(9), we obtain an asymptotically optimal algorithm. Refer to Appendix F for proofs. An example of such algorithm is the Lazy Track-and-Stop (LT&S) algorithm, whose pseudo-code is presented in Algorithm 1. LT&S just updates the tracking rule in rounds in a set T . Theorem 2. (Almost sure sample complexity upper bound) An algorithm defined by (5)-(6)-(8)-(9) with a lazy sampling rule (satisfying (7)) is δ-PAC. Its sample complexity verifies: P(lim sup δ 0 δ ) σ2T µ) = 1. To obtain an algorithm with optimal expected sample complexity, we need to consider lazy tracking rules that satisfy the following condition: there exist α > 0 and a non-decreasing sequence (ℓ(t))t 1 of integers with ℓ(1) = 1, ℓ(t) t and lim inft ℓ(t)/tγ > 0 for some γ > 0 and such that ε > 0, h(ε) : t 1, P min s ℓ(t) d (w(t), C (ˆµs)) > ε h(ε) t2+α . (10) The condition (10) is again easy to ensure. Assume that we update w(t) only if t T = {tn : n 1}, where tn is increasing sequence of integers such that tn as n . Then (10) holds for the sequence (ℓ(t))t 1 such that ℓ(ti+1) = ti for all i, provided that lim infn tn+1/tγ n > 0 for some γ > 0. Examples include: (i) periodic updates of w(t): T = {1 + k P, k N} and ℓ(t) = max{1, t P}; (ii) exponential updates T = {2k, k N} and ℓ(t) = max{1, t/2 }. The condition (10) may seem too loose, but we have to keep in mind that in practice, the performance of the algorithm will depend on the update frequency of w(t). However for asymptotic optimality, (10) is enough (the key point is to have some concentration of ˆµt around µ, which is guaranteed via the forced exploration part of the sampling rule). Theorem 3. (Expected sample complexity upper bound) An algorithm defined by (5)-(6)-(8)-(9) with a sampling rule satisfying (7) and (10) is δ-PAC. Its sample complexity verifies: lim sup δ 0 E [τ] log 1 4 Continuous set of arms We now investigate the case where A = Sd 1 is the (d 1)-dimensional unit sphere. Without loss of generality, we restrict our attention to problems where µ M(ε0) = {η : η a η > ε0} for some ε0 > 0. The results of this section are proved in Appendix G. 4.1 Sample complexity lower bound Theorem 4. Let ε (0, ε0/5), and δ (0, 1). The sample complexity of any (δ, ε)-PAC algorithm satisfies: for all µ M(ε0), Eµ[τ] σ2(d 1) 20 µ ε kl(δ, 1 δ). The above theorem is obtained by first applying the classical change-of-measure argument (see e.g. Lemma 19 [20]). Such an argument implies that under any (ε, δ)-PAC algorithm, for all confusing λ such that {a Sd 1 : µ (a µ a) ε} and {a Sd 1 : λ (a λ a) ε} are disjoint, (µ λ) 2kl(δ, 1 δ). We then study the solution of the following max-min problem: max(at)t 1 minλ Bε(µ)(µ λ) E Pτ s=1 asa s (µ λ), where Bε(µ) denotes the set of confusing parameters. The continuous action space makes this analysis challenging. We show that the value of the max-min problem is smaller than Eµ[τ] 10 µ ε σ2(d 1), which leads to the claimed lower bound. 4.2 Algorithm We present a simple algorithm whose sample complexity approach our lower bound. We describe its three components below. The decision rule is the same as before, based on the least-squares estimator of µ: ˆat arg maxa A ˆµ t a. Sampling rule. Let U = {u1, u2, . . . , ud} be subset of Sd 1, that forms an orthonormal basis of Rd. The sampling rule just consists in selecting an arms from U in a round robin manner: for all t 1, at = u(t mod d). Stopping rule. As for the case of finite set of arms, the stopping rule relies on a generalized loglikelihood ratio test. Define Z(t) = inf{b A:|ˆµ t (ˆat b)| εt} Zˆat,b,εt(t), where an expression of Zˆat,b,εt(t) is given in Lemma 7. We consider the following stopping time: t N : Z(t) β(δ, t) and λmin max c, ρ(δ, t) Hence compared to the case of finite set of arms, we add a stopping condition defined by the threshold ρ(δ, t) and related to the spectral properties of the covariates matrix. Proposition 3. Let (δt)t 1, (εt)t 1 be two sequences with values in (0, 1) and (0, ε), respectively, and such that P t=1 δt < δ, and limt εt = ε. Let ζt = log(2 det c 1 Pt s=1 asa s + Id 1 log(δt), and define: β(δ, t) = 2σ2ζt and ρ(δ, t) = 4σ2ε2 tζt (ε εt)2 (12) Then under the stopping rule (11)-(12), we have: Pµ τ < , µ (a µ ˆaτ) > ε δ. 4.3 Sample complexity analysis Under specific choices for the sequence (εt)t 1, we can analyze the sample complexity of our algorithm, and show its optimality order-wise. Theorem 5. Choose in the stopping rule εt = ε 1 + ε(4σ2 log( 4 d )) 1/2 1 (observe that εt < ε and limt εt = ε). Then under the aforementioned sampling rule, and the stopping rule (11)-(12), we have: P lim supδ 0 τ log(1/δ) σ2d µ ε = 1 and lim supδ 0 E[τ] log(1/δ) σ2d µ ε. 5 Experiments We present here a few experimental results comparing the performance of our algorithm to that of RAGE, the state-of-the-art algorithm [16], in the case of finite set of arms. We compare LT&S and RAGE only because they outperform other existing algorithms. Further experimental results can be found in Appendix A. Experimental set-up. We use the following toy experiment which corresponds to the many arms example in [16]. d = 2 and A = {(1, 0), ej3π/4, ej(π/4+φi), i [n 2]} C where (φi) are i.i.d. N(0, 0.09). µ = (1, 0). Experiments are made with the risk δ = 0.05. Implementation of LT&S. To update the allocation w(t), we use Frank-Wolfe algorithm [21] (without any rounding procedure). Our implementation of Frank-Wolfe is similar to that of Fiez et al. [16] (refer to Appendix A.3. for further comments). At each update, the previous allocation is fed as an initial value for the new optimization problem. We implement the exponential lazy update scheme T = {2k, k N}. The parameters of our stopping rule are c = c A0 d (so that after d steps the second condition of the stopping rule is satisfied) and u = 1; we use the threshold β(6δ/π2t2, t). The initial exploration matrix A0 is chosen at random. We implemented two versions of LT&S. The first one does not track the average but only the current allocation w(t): a arg minb supp(w(t))(Nb(t) twb(t)). The second version tracks the average allocations as described in Algorithm 1. We further compare our results to that of the Oracle algorithm proposed by [12]. The algorithm samples from a true optimal allocation w C (µ), and applies a stopping rule that depends on K. Results. From the table below, LT&S outperforms RAGE most of the times, and the performance improvement gets higher when the number of arms grows. LT&S without averaging shows better performance, than with averaging. In Appendix A, we present results for another version of LT&S, with even better performance. Algorithm LT&S (No averaging) LT&S RAGE Oracle Sample Complexity Sample Complexity Sample Complexity Sample Complexity Number of arms Mean (Std) Mean (Std) Mean (Std) Mean (Std) (K = 1000) 1206.55 (42.2) 1409 (57) 1148.45 (49.82) 476.45 (40.74) (K = 2500) 1253.60 (47.70) 1404 (57) 1440.75 (149.24) 492.15 (43.88) (K = 5000) 1247.05 (81.07) 1401 (86) 1540.3 (158.90) 515.60 (47.64) (K = 7500) 1296.55 (76.78) 1434 (78) 1598.0 (164.60) 547.65 (45.77) Table 1: Results for the many arms experiment [16] 6 Conclusion In this paper, we present LT&S, an algorithm to solve the best-arm identification problem in stochastic linear bandits. The sampling rule of the algorithm just tracks the optimal allocation predicted by the sample complexity lower bound. Its stopping rule is defined through generalized log-likelihood ratio and an exploration threshold that does not depend on the number of arms, but on the ambient dimension only. LT&S is asymptotically optimal: we have guarantees on its sample complexity, both almost surely and in expectation. The first experimental results are very promising, as LT&S seems to exhibit a much better sample complexity than existing algorithms. We also provide the first results on the pure exploration problem in the linear bandits with a continuous set of arms. The analysis presented in this paper suggests several extensions. We can easily generalize the results to non-Gaussian reward distributions (e.g. bounded, from a one-parameter exponential family). It would be interesting to extend our results in the continuous setting to generic convex sets of arms (we believe that the instance-specific sample complexity lower bound would just depend on the local smoothness of the set of arms around the best arm). A more challenging but exciting question is to derive tight non-asymptotic sample complexity upper bound for LT&S, so as to characterize the trade-off between the laziness of the algorithm and its sample complexity. Broader impact This work is mostly theoretical. Our results may provide guidelines and insights towards an improved design of algorithms for linear bandits. Linear bandit algorithms are versatile, and in particular used in clinical trials and recommendation systems. Hence our results can benefit users and developers of such systems. In clinical trials, minimizing the sample complexity is crucial, and the use of our algorithm there can be really beneficial. Our algorithm has the additional advantage of being computationally efficient. Overall, we do not foresee any direct negative impact of our work. However, it worth noting that there is some concern about the potential use of recommender systems for opinion influence. Acknowledgement This research was supported by the Wallenberg AI, Autonomous Systems and Software Program (WASP) funded by the Knut and Alice Wallenberg Foundation. [1] Peter Auer. Using confidence bounds for exploitation-exploration trade-offs. J. Mach. Learn. Res., 3(null):397?422, March 2003. [2] Varsha Dani, Thomas Hayes, and Sham Kakade. Stochastic linear optimization under bandit feedback. In COLT, 2008. [3] Herbert Robbins. Some aspects of the sequential design of experiments. Bull. Amer. Math. Soc., 58(5):527 535, 09 1952. [4] Tze Leung Lai and Herbert Robbins. Asymptotically efficient adaptive allocation rules. Advances in Applied Mathematics, 6(1):4 22, 1985. [5] Lihong Li, Wei Chu, John Langford, and Robert E. Schapire. A contextual-bandit approach to personalized news article recommendation. In Proceedings of the 19th International Conference on World Wide Web, WWW, pages 661 670, New York, NY, USA, 2010. [6] Wei Chu, Lihong Li, Lev Reyzin, and Robert Schapire. Contextual bandits with linear payoff functions. In Geoffrey Gordon, David Dunson, and Miroslav Dudúk, editors, Proceedings of the Fourteenth International Conference on Artificial Intelligence and Statistics, volume 15 of Proceedings of Machine Learning Research, pages 208 214, Fort Lauderdale, FL, USA, 11 13 Apr 2011. PMLR. [7] Shuai Li, Alexandros Karatzoglou, and Claudio Gentile. Collaborative filtering bandits. In Proceedings of the 39th International ACM SIGIR conference on Research and Development in Information Retrieval, pages 539 548, 2016. [8] Paat Rusmevichientong and John Tsitsiklis. Linearly parameterized bandits. Math. Oper. Res., 35(2), 2010. [9] Yasin Abbasi-Yadkori, Dávid Pál, and Csaba Szepesvári. Improved algorithms for linear stochastic bandits. In Advances in Neural Information Processing Systems, pages 2312 2320, 2011. [10] Tor Lattimore and Csaba Szepesvari. The end of optimism? an asymptotic analysis of finitearmed linear bandits. AISTATS, 2016. [11] Richard Combes, Stefan Magureanu, and Alexandre Proutiere. Minimal exploration in structured stochastic bandits. In I. Guyon, U. V. Luxburg, S. Bengio, H. Wallach, R. Fergus, S. Vishwanathan, and R. Garnett, editors, Advances in Neural Information Processing Systems 30, pages 1763 1771. Curran Associates, Inc., 2017. [12] Marta Soare, Alessandro Lazaric, and Rémi Munos. Best-arm identification in linear bandits. In Advances in Neural Information Processing Systems, pages 828 836, 2014. [13] Zohar S Karnin. Verification based solution for structured mab problems. In Advances in Neural Information Processing Systems, pages 145 153, 2016. [14] Liyuan Xu, Junya Honda, and Masashi Sugiyama. A fully adaptive algorithm for pure exploration in linear bandits. In Amos Storkey and Fernando Perez-Cruz, editors, Proceedings of the Twenty-First International Conference on Artificial Intelligence and Statistics, volume 84 of Proceedings of Machine Learning Research, pages 843 851, Playa Blanca, Lanzarote, Canary Islands, 09 11 Apr 2018. PMLR. [15] Chao Tao, Saúl Blanco, and Yuan Zhou. Best arm identification in linear bandits with linear dimension dependency. In Jennifer Dy and Andreas Krause, editors, Proceedings of the 35th International Conference on Machine Learning, volume 80 of Proceedings of Machine Learning Research, pages 4877 4886, Stockholmsmässan, Stockholm Sweden, 10 15 Jul 2018. PMLR. [16] Tanner Fiez, Lalit Jain, Kevin G Jamieson, and Lillian Ratliff. Sequential experimental design for transductive linear bandits. In Advances in Neural Information Processing Systems, pages 10666 10676, 2019. [17] Marta Soare. Sequential Resource Allocation in Linear Stochastic Bandits . Theses, Université Lille 1 - Sciences et Technologies, December 2015. [18] Aurélien Garivier and Emilie Kaufmann. Optimal best arm identification with fixed confidence. In Conference on Learning Theory, pages 998 1027, 2016. [19] Rémy Degenne, Pierre Ménard, Xuedong Shang, and Michal Valko. Gamification of pure exploration for linear bandits. International Conference on Machine Learning, 2020. [20] Emilie Kaufmann, Olivier Cappé, and Aurélien Garivier. On the complexity of best-arm identification in multi-armed bandit models. The Journal of Machine Learning Research, 17(1):1 42, 2016. [21] Martin Jaggi. Revisiting Frank-Wolfe: Projection-free sparse convex optimization. volume 28 of Proceedings of Machine Learning Research, pages 427 435, Atlanta, Georgia, USA, 17 19 Jun 2013. PMLR. [22] Rangarajan K Sundaram et al. A first course in optimization theory. Cambridge university press, 1996. [23] Victor H Peña, Tze Leung Lai, and Qi-Man Shao. Self-normalized processes: Limit theory and Statistical Applications. Springer Science & Business Media, 2008. [24] Tuhin Sarkar and Alexander Rakhlin. Near optimal finite time identification of arbitrary linear dynamical systems. ar Xiv preprint ar Xiv:1812.01251, 2018. [25] Pierpaolo Natalini and Biagio Palumbo. Inequalities for the incomplete gamma function. Math. Inequal. Appl, 3(1):69 77, 2000. [26] Jonathan M Borwein, O-Yeat Chan, et al. Uniform bounds for the complementary incomplete gamma function. Mathematical Inequalities and Applications, 12:115 121, 2009.