# bestarm_identification_in_linear_bandits__896ba280.pdf

Best-Arm Identiﬁcation in Linear Bandits

Marta Soare Alessandro Lazaric Rémi Munos INRIA Lille Nord Europe, Seque L Team {marta.soare,alessandro.lazaric,remi.munos}@inria.fr

We study the best-arm identiﬁcation problem in linear bandit, where the rewards of the arms depend linearly on an unknown parameter θ and the objective is to return the arm with the largest reward. We characterize the complexity of the problem and introduce sample allocation strategies that pull arms to identify the best arm with a ﬁxed conﬁdence, while minimizing the sample budget. In particular, we show the importance of exploiting the global linear structure to improve the estimate of the reward of near-optimal arms. We analyze the proposed strategies and compare their empirical performance. Finally, as a by-product of our analysis, we point out the connection to the G-optimality criterion used in optimal experimental design.

1 Introduction

The stochastic multi-armed bandit problem (MAB) [16] offers a simple formalization for the study of sequential design of experiments. In the standard model, a learner sequentially chooses an arm out of K and receives a reward drawn from a ﬁxed, unknown distribution relative to the chosen arm. While most of the literature in bandit theory focused on the problem of maximization of cumulative rewards, where the learner needs to trade-off exploration and exploitation, recently the pure exploration setting [5] has gained a lot of attention. Here, the learner uses the available budget to identify as accurately as possible the best arm, without trying to maximize the sum of rewards. Although many results are by now available in a wide range of settings (e.g., best-arm identiﬁcation with ﬁxed budget [2, 11] and ﬁxed conﬁdence [7], subset selection [6, 12], and multi-bandit [9]), most of the work considered only the multi-armed setting, with K independent arms.

An interesting variant of the MAB setup is the stochastic linear bandit problem (LB), introduced in [3]. In the LB setting, the input space X is a subset of Rd and when pulling an arm x, the learner observes a reward whose expected value is a linear combination of x and an unknown parameter θ Rd. Due to the linear structure of the problem, pulling an arm gives information about the parameter θ and indirectly, about the value of other arms. Therefore, the estimation of K meanrewards is replaced by the estimation of the d features of θ . While in the exploration-exploitation setting the LB has been widely studied both in theory and in practice (e.g., [1, 14]), in this paper we focus on the pure-exploration scenario.

The fundamental difference between the MAB and the LB best-arm identiﬁcation strategies stems from the fact that in MAB an arm is no longer pulled as soon as its sub-optimality is evident (in high probability), while in the LB setting even a sub-optimal arm may offer valuable information about the parameter vector θ and thus improve the accuracy of the estimation in discriminating among near-optimal arms. For instance, consider the situation when K 2 out of K arms are already discarded. In order to identify the best arm, MAB algorithms would concentrate the sampling on the two remaining arms to increase the accuracy of the estimate of their mean-rewards until the discarding condition is met for one of them. On the contrary, a LB pure-exploration strategy would seek to pull the arm x X whose observed reward allows to reﬁne the estimate θ along the dimensions which are more suited in discriminating between the two remaining arms. Recently, the best-arm identiﬁcation in linear bandits has been studied in a ﬁxed budget setting [10], in this paper we study the sample complexity required to identify the best-linear arm with a ﬁxed conﬁdence.

This work was done when the author was a visiting researcher at Microsoft Research New-England. Current afﬁliation: Google Deep Mind.

2 Preliminaries

The setting. We consider the standard linear bandit model. Let X Rd be a ﬁnite set of arms, where |X| = K and the ℓ2-norm of any arm x X, denoted by ||x||, is upper-bounded by L. Given an unknown parameter θ Rd, we assume that each time an arm x X is pulled, a random reward r(x) is generated according to the linear model r(x) = x θ + ε, where ε is a zero-mean i.i.d. noise bounded in [ σ; σ]. Arms are evaluated according to their expected reward x θ and we denote by x = arg maxx X x θ the best arm in X. Also, we use Π(θ) = arg maxx X x θ to refer to the best arm corresponding to an arbitrary parameter θ. Let Δ(x, x ) = (x x ) θ be the value gap between two arms, then we denote by Δ(x) = Δ(x , x) the gap of x w.r.t. the optimal arm and by Δmin = minx X Δ(x) the minimum gap, where Δmin > 0. We also introduce the sets Y = {y = x x , x, x X} and Y = {y = x x, x X} containing all the directions obtained as the difference of two arms (or an arm and the optimal arm) and we redeﬁne accordingly the gap of a direction as Δ(y) = Δ(x, x ) whenever y = x x .

The problem. We study the best-arm identiﬁcation problem. Let ˆx(n) be the estimated best arm returned by a bandit algorithm after n steps. We evaluate the quality of ˆx(n) by the simple regret Rn = (x ˆx(n)) θ . While different settings can be deﬁned (see [8] for an overview), here we focus on the (ǫ, δ)-best-arm identiﬁcation problem (the so-called PAC setting), where given ǫ and δ (0, 1), the objective is to design an allocation strategy and a stopping criterion so that when the algorithm stops, the returned arm ˆx(n) is such that P Rn ǫ δ, while minimizing the needed number of steps. More speciﬁcally, we will focus on the case of ǫ = 0 and we will provide high-probability bounds on the sample complexity n.

The multi-armed bandit case. In MAB, the complexity of best-arm identiﬁcation is characterized by the gaps between arm values, following the intuition that the more similar the arms, the more pulls are needed to distinguish between them. More formally, the complexity is given by the problemdependent quantity HMAB = K i=1 1 Δ2 i i.e., the inverse of the pairwise gaps between the best arm and the suboptimal arms. In the ﬁxed budget case, HMAB determines the probability of returning the wrong arm [2], while in the ﬁxed conﬁdence case, it characterizes the sample complexity [7].

Technical tools. Unlike in the multi-arm bandit scenario where pulling one arm does not provide any information about other arms, in a linear model we can leverage the rewards observed over time to estimate the expected reward of all the arms in X. Let xn = (x1, . . . , xn) X n be a sequence of arms and (r1, . . . , rn) the corresponding observed (random) rewards. An unbiased estimate of θ can be obtained by ordinary least-squares (OLS) as ˆθn = A 1 xn bxn, where Axn = n t=1 xtx t Rd d and bxn = n t=1 xtrt Rd. For any ﬁxed sequence xn, through Azuma s inequality, the prediction error of the OLS estimate is upper-bounded in high-probability as follows. Proposition 1. Let c = 2σ

2 and c = 6/π2. For every ﬁxed sequence xn, we have1

P n N, x X, x θ x ˆθn c||x||A 1 xn

log(c n2K/δ) 1 δ. (1)

While in the previous statement xn is ﬁxed, a bandit algorithm adapts the allocation in response to the rewards observed over time. In this case a different high-probability bound is needed.

Proposition 2 (Thm. 2 in [1]). Let ˆθη n be the solution to the regularized least-squares problem with regularizer η and let Aη x = ηId + Ax. Then for all x X and every adaptive sequence xn such that at any step t, xt only depends on (x1, r1, . . . , xt 1, rt 1), w.p. 1 δ, we have

x θ x ˆθη n ||x||( Aη xn) 1

d log 1 + n L2/η

+ η1/2||θ || . (2)

The crucial difference w.r.t. Eq. 1 is an additional factor

d, the price to pay for adapting xn to the samples. In the sequel we will often resort to the notion of design (or soft allocation) λ Dk, which prescribes the proportions of pulls to arm x and Dk denotes the simplex X. The counterpart of the design matrix A for a design λ is the matrix Λλ = x X λ(x)xx . From an allocation xn we can derive the corresponding design λxn as λxn(x) = Tn(x)/n, where Tn(x) is the number of times arm x is selected in xn, and the corresponding design matrix is Axn = nΛλxn.

1Whenever Prop.1 is used for all directions y Y, then the logarithmic term becomes log(c n2K2/δ) because of an additional union bound. For the sake of simplicity, in the sequel we always use logn(K2/δ).

3 The Complexity of the Linear Best-Arm Identiﬁcation Problem

Figure 1: The cones corresponding to three arms (dots) in R2. Since θ C(x1), then x = x1. The conﬁdence set S (xn) (in green) is aligned with directions x1 x2 and x1 x3. Given the uncertainty in S (xn), both x1 and x3 may be optimal.

As reviewed in Sect. 2, in the MAB case the complexity of the best-arm identiﬁcation task is characterized by the reward gaps between the optimal and suboptimal arms. In this section, we propose an extension of the notion of complexity to the case of linear best-arm identiﬁcation. In particular, we characterize the complexity by the performance of an oracle with access to the parameter θ .

Stopping condition. Let C(x)={θ Rd, x Π(θ)} be the set of parameters θ which admit x as an optimal arm. As illustrated in Fig. 1, C(x) is the cone deﬁned by the intersection of half-spaces such that C(x) = x X {θ Rd, (x x ) θ 0} and all the cones together form a partition of the Euclidean space Rd. We assume that the oracle knows the cone C(x ) containing all the parameters for which x is optimal. Furthermore, we assume that for any allocation xn, it is possible to construct a conﬁdence set S (xn) Rd such that θ S (xn) and the (random) OLS estimate ˆθn belongs to S (xn) with high probability, i.e., P ˆθn S (xn) 1 δ. As a result, the oracle stopping criterion simply checks whether the conﬁdence set S (xn) is contained in C(x ) or not. In fact, whenever for an allocation xn the set S (xn) overlaps the cones of different arms x X, there is ambiguity in the identity of the arm Π(ˆθn). On the other hand when all possible values of ˆθn are included with high probability in the right cone C(x ), then the optimal arm is returned.

Lemma 1. Let xn be an allocation such that S (xn) C(x ). Then P Π(ˆθn) = x δ.

Arm selection strategy. From the previous lemma2 it follows that the objective of an arm selection strategy is to deﬁne an allocation xn which leads to S (xn) C(x ) as quickly as possible.3 Since this condition only depends on deterministic objects (S (xn) and C(x )), it can be computed independently from the actual reward realizations. From a geometrical point of view, this corresponds to choosing arms so that the conﬁdence set S (xn) shrinks into the optimal cone C(x ) within the smallest number of pulls. To characterize this strategy we need to make explicit the form of S (xn). Intuitively speaking, the more S (xn) is aligned with the boundaries of the cone, the easier it is to shrink it into the cone. More formally, the condition S (xn) C(x ) is equivalent to

x X, θ S (xn), (x x) θ 0 y Y , θ S (xn), y (θ θ) Δ(y).

Then we can simply use Prop. 1 to directly control the term y (θ θ) and deﬁne

S (xn) = θ Rd, y Y , y (θ θ) c||y||A 1 xn

logn(K2/δ) . (3)

Thus the stopping condition S (xn) C(x ) is equivalent to the condition that, for any y Y ,

c||y||A 1 xn

logn(K2/δ) Δ(y). (4)

From this condition, the oracle allocation strategy simply follows as

x n = arg min xn max y Y c||y||A 1 xn

Δ(y) = arg min xn max y Y ||y||A 1 xn Δ(y) . (5)

Notice that this strategy does not return an uniformly accurate estimate of θ but it rather pulls arms that allow to reduce the uncertainty of the estimation of θ over the directions of interest (i.e., Y ) below their corresponding gaps. This implies that the objective of Eq. 5 is to exploit the global linear assumption by pulling any arm in X that could give information about θ over the directions in Y , so that directions with small gaps are better estimated than those with bigger gaps.

2For all the proofs in this paper, we refer the reader to the long version of the paper [18]. 3Notice that by deﬁnition of the conﬁdence set and since θn θ as n , any strategy repeatedly pulling all the arms would eventually meet the stopping condition.

Sample complexity. We are now ready to deﬁne the sample complexity of the oracle, which corresponds to the minimum number of steps needed by the allocation in Eq. 5 to achieve the stopping condition in Eq. 4. From a technical point of view, it is more convenient to express the complexity of the problem in terms of the optimal design (soft allocation) instead of the discrete allocation xn. Let ρ (λ) = maxy Y ||y||2 Λ 1 λ /Δ2(y) be the square of the objective function in Eq. 5 for any design

λ Dk. We deﬁne the complexity of a linear best-arm identiﬁcation problem as the performance achieved by the optimal design λ = arg minλ ρ (λ), i.e.

HLB = min λ Dk max y Y

||y||2 Λ 1 λ Δ2(y) = ρ (λ ). (6)

This deﬁnition of complexity is less explicit than in the case of HMAB but it contains similar elements, notably the inverse of the gaps squared. Nonetheless, instead of summing the inverses over all the arms, HLB implicitly takes into consideration the correlation between the arms in the term ||y||2 Λ 1 λ , which represents the uncertainty in the estimation of the gap between x and x (when y = x x). As a result, from Eq. 4 the sample complexity becomes N = c2HLB logn(K2/δ), (7) where we use the fact that, if implemented over n steps, λ induces a design matrix Aλ = nΛλ and maxy ||y||2 A 1 λ /Δ2(y) = ρ (λ )/n. Finally, we bound the range of the complexity.

Lemma 2. Given an arm set X Rd and a parameter θ , the complexity HLB (Eq. 6) is such that max y Y ||y||2/(LΔ2 min) HLB 4d/Δ2 min. (8)

Furthermore, if X is the canonical basis, the problem reduces to a MAB and HMAB HLB 2HMAB.

The previous bounds show that Δmin plays a signiﬁcant role in deﬁning the complexity of the problem, while the speciﬁc shape of X impacts the numerator in different ways. In the worst case the full dimensionality d appears (upper-bound), and more arm-set speciﬁc quantities, such as the norm of the arms L and of the directions Y , appear in the lower-bound.

4 Static Allocation Strategies

Input: decision space X Rd, conﬁdence δ > 0 Set: t = 0; Y = {y = (x x ); x = x X}; while Eq. 11 is not true do if G-allocation then xt = arg min x X max x X x (A + xx ) 1x

else if XY-allocation then xt = arg min x X max y Y y (A + xx ) 1y

end if Update ˆθt = A 1 t bt, t = t + 1 end while Return arm Π(ˆθt)

Figure 2: Static allocation algorithms

The oracle stopping condition (Eq. 4) and allocation strategy (Eq. 5) cannot be implemented in practice since θ , the gaps Δ(y), and the directions Y are unknown. In this section we investigate how to deﬁne algorithms that only rely on the information available from X and the samples collected over time. We introduce an empirical stopping criterion and two static allocations.

Empirical stopping criterion. The stopping condition S (xn) C(x ) cannot be tested since S (xn) is centered in the unknown parameter θ and C(x ) depends on the unknown optimal arm x . Nonetheless, we notice that given X, for each x X the cones C(x) can be constructed beforehand. Let S(xn) be a high-probability conﬁdence set such that for any xn, ˆθn S(xn) and P(θ S(xn)) 1 δ. Unlike S , S can be directly computed from samples and we can stop whenever there exists an x such that S(xn) C(x). Lemma 3. Let xn = (x1, . . . , xn) be an arbitrary allocation sequence. If after n steps there exists an arm x X such that S(xn) C(x) then P Π(ˆθn) = x δ.

Arm selection strategy. Similarly to the oracle algorithm, we should design an allocation strategy that guarantees that the (random) conﬁdence set S(xn) shrinks in one of the cones C(x) within the fewest number of steps. Let Δn(x, x ) = (x x ) ˆθn be the empirical gap between arms x, x . Then the stopping condition S(xn) C(x) can be written as

x X, x X, θ S(xn), (x x ) θ 0

x X, x X, θ S(xn), (x x ) (ˆθn θ) Δn(x, x ). (9)

This suggests that the empirical conﬁdence set can be deﬁned as

S(xn) = θ Rd, y Y, y (ˆθn θ) c||y||A 1 xn

logn(K2/δ) . (10)

Unlike S (xn), S(xn) is centered in ˆθn and it considers all directions y Y. As a result, the stopping condition in Eq. 9 could be reformulated as

x X, x X, c||x x ||A 1 xn

logn(K2/δ) Δn(x, x ). (11)

Although similar to Eq. 4, unfortunately this condition cannot be directly used to derive an allocation strategy. In fact, it is considerably more difﬁcult to deﬁne a suitable allocation strategy to ﬁt a random conﬁdence set S into a cone C(x) for an x which is not known in advance. In the following we propose two allocations that try to achieve the condition in Eq. 11 as fast as possible by implementing a static arm selection strategy, while we present a more sophisticated adaptive strategy in Sect. 5. The general structure of the static allocations in summarized in Fig. 2.

G-Allocation Strategy. The deﬁnition of the G-allocation strategy directly follows from the observation that for any pair (x, x ) X 2 we have that ||x x ||A 1 xn 2 maxx X ||x ||A 1 xn . This suggests that an allocation minimizing maxx X ||x||A 1 xn reduces an upper bound on the quantity tested in the stopping condition in Eq. 11. Thus, for any ﬁxed n, we deﬁne the G-allocation as

x G n = arg min xn max x X ||x||A 1 xn . (12)

We notice that this formulation coincides with the standard G-optimal design (hence the name of the allocation) deﬁned in experimental design theory [15, Sect. 9.2] to minimize the maximal meansquared prediction error in linear regression. The G-allocation can be interpreted as the design that allows to estimate θ uniformly well over all the arms in X. Notice that the G-allocation in Eq. 12 is well deﬁned only for a ﬁxed number of steps n and it cannot be directly implemented in our case, since n is unknown in advance. Therefore we have to resort to a more incremental implementation. In the experimental design literature a wide number of approximate solutions have been proposed to solve the NP-hard discrete optimization problem in Eq. 12 (see [4, 17] for some recent results and [18] for a more thorough discussion). For any approximate G-allocation strategy with performance no worse than a factor (1 + β) of the optimal strategy x G n , the sample complexity N G is bounded as follows. Theorem 1. If the G-allocation strategy is implemented with a β-approximate method and the stopping condition in Eq. 11 is used, then

P N G 16c2d(1 + β) logn(K2/δ)

Δ2 min Π(ˆθN G) = x 1 δ. (13)

Notice that this result matches (up to constants) the worst-case value of N given the upper bound on HLB. This means that, although completely static, the G-allocation is already worst-case optimal.

XY-Allocation Strategy. Despite being worst-case optimal, G-allocation is minimizing a rather loose upper bound on the quantity used to test the stopping criterion. Thus, we deﬁne an alternative static allocation that targets the stopping condition in Eq. 11 more directly by reducing its left-handside for any possible direction in Y. For any ﬁxed n, we deﬁne the XY-allocation as

x XY n = arg min xn max y Y ||y||A 1 xn . (14)

XY-allocation is based on the observation that the stopping condition in Eq. 11 requires only the empirical gaps Δ(x, x ) to be well estimated, hence arms are pulled with the objective of increasing the accuracy of directions in Y instead of arms X. This problem can be seen as a transductive variant of the G-optimal design [19], where the target vectors Y are different from the vectors X used in the design. The sample complexity of the XY-allocation is as follows. Theorem 2. If the XY-allocation strategy is implemented with a β-approximate method and the stopping condition in Eq. 11 is used, then

P N XY 32c2d(1 + β) logn(K2/δ)

Δ2 min Π(ˆθN XY) = x 1 δ. (15)

Although the previous bound suggests that XY achieves a performance comparable to the Gallocation, in fact XY may be arbitrarily better than G-allocation (for an example, see [18]).

5 XY-Adaptive Allocation Strategy

Input: decision space X Rd; parameter α; conﬁdence δ Set j =1; Xj =X; Y1 =Y; ρ0 =1; n0 =d(d + 1) + 1 while | Xj| > 1 do ρj = ρj 1

t = 1; A0 = I while ρj/t αρj 1(xj 1 nj 1)/nj 1 do Select arm xt = arg min x X max y Y y (A + xx ) 1y

Update At = At 1 + xtx t , t = t + 1 ρj = maxy Yj y A 1 t y end while Compute b = t s=1 xsrs; ˆθj = A 1 t b Xj+1 = X for x X do if x :||x x ||A 1 t

logn(K2/δ) Δj(x , x) then Xj+1 = Xj+1 {x} end if end for Yj+1 = {y = (x x ); x, x Xj+1} end while Return Π(ˆθj)

Figure 3: XY-Adaptive allocation algorithm

Fully adaptive allocation strategies. Although both Gand XY-allocation are sound since they minimize upper-bounds on the quantities used by the stopping condition (Eq. 11), they may be very suboptimal w.r.t. the ideal performance of the oracle introduced in Sec. 3. Typically, an improvement can be obtained by moving to strategies adapting on the rewards observed over time. Nonetheless, as reported in Prop. 2, whenever xn is not a ﬁxed sequence, the bound in Eq. 2 should be used. As a result, a factor

d would appear in the deﬁnition of the conﬁdence sets and in the stopping condition. This directly implies that the sample complexity of a fully adaptive strategy would scale linearly with the dimensionality d of the problem, thus removing any advantage w.r.t. static allocations. In fact, the sample complexity of Gand XYallocation already scales linearly with d and from Lem. 2 we cannot expect to improve the dependency on Δmin. Thus, on the one hand, we need to use the tighter bounds in Eq. 1 and, on the other hand, we require to be adaptive w.r.t. samples. In the sequel we propose a phased algorithm which successfully meets both requirements using a static allocation within each phase but choosing the type of allocation depending on the samples observed in previous phases.

Algorithm. The ideal case would be to deﬁne an empirical version of the oracle allocation in Eq. 5 so as to adjust the accuracy of the prediction only on the directions of interest Y and according to their gaps Δ(y). As discussed in Sect. 4 this cannot be obtained by a direct adaptation of Eq. 11. In the following, we describe a safe alternative to adjust the allocation strategy to the gaps.

Lemma 4. Let xn be a ﬁxed allocation sequence and ˆθn its corresponding estimate for θ . If an arm x X is such that

x X s.t. c||x x||A 1 xn

logn(K2/δ) < Δn(x , x), (16)

then arm x is sub-optimal. Moreover, if Eq. 16 is true, we say that x dominates x.

Lem. 4 allows to easily construct the set of potentially optimal arms, denoted X(xn), by removing from X all the dominated arms. As a result, we can replace the stopping condition in Eq. 11, by just testing whether the number of non-dominated arms | X(xn)| is equal to 1, which corresponds to the case where the conﬁdence set is fully contained into a single cone. Using X(xn), we construct Y(xn) = {y = x x ; x, x X(xn)}, the set of directions along which the estimation of θ needs to be improved to further shrink S(xn) into a single cone and trigger the stopping condition. Note that if xn was an adaptive strategy, then we could not use Lem. 4 to discard arms but we should rely on the bound in Prop. 2. To avoid this problem, an effective solution is to run the algorithm through phases. Let j N be the index of a phase and nj its corresponding length. We denote by Xj the set of non-dominated arms constructed on the basis of the samples collected in the phase j 1. This set is used to identify the directions Yj and to deﬁne a static allocation which focuses on reducing the uncertainty of θ along the directions in Yj. Formally, in phase j we implement the allocation

xj nj = arg min xnj max y Yj ||y||A 1 xnj , (17)

which coincides with a XY-allocation (see Eq. 14) but restricted on Yj. Notice that xj nj may still use any arm in X which could be useful in reducing the conﬁdence set along any of the directions in

Yj. Once phase j is over, the OLS estimate ˆθj is computed using the rewards observed within phase j and then is used to test the stopping condition in Eq. 11. Whenever the stopping condition does not hold, a new set Xj+1 is constructed using the discarding condition in Lem. 4 and a new phase is started. Notice that through this process, at each phase j, the allocation xj nj is static conditioned on the previous allocations and the use of the bound from Prop. 1 is still correct.

A crucial aspect of this algorithm is the length of the phases nj. On the one hand, short phases allow a high rate of adaptivity, since Xj is recomputed very often. On the other hand, if a phase is too short, it is very unlikely that the estimate ˆθj may be accurate enough to actually discard any arm. An effective way to deﬁne the length of a phase in a deterministic way is to relate it to the actual uncertainty of the allocation in estimating the value of all the active directions in Yj. In phase j, let ρj(λ) = maxy Yj ||y||2 Λ 1 λ , then given a parameter α (0, 1), we deﬁne

nj = min n N : ρj(λxj n)/n αρj 1(λj 1)/nj 1 , (18)

where xj n is the allocation deﬁned in Eq. 17 and λj 1 is the design corresponding to xj 1 nj 1, the allocation performed at phase j 1. In words, nj is the minimum number of steps needed by the XY-adaptive allocation to achieve an uncertainty over all the directions of interest which is a fraction α of the performance obtained in the previous iteration. Notice that given Yj and ρj 1 this quantity can be computed before the actual beginning of phase j. The resulting algorithm using the XY-Adaptive allocation strategy is summarized in Fig. 3.

Sample complexity. Although the XY-Adaptive allocation strategy is designed to approach the oracle sample complexity N , in early phases it basically implements a XY-allocation and no signiﬁcant improvement can be expected until some directions are discarded from Y. At that point, XY-adaptive starts focusing on directions which only contain near-optimal arms and it starts approaching the behavior of the oracle. As a result, in studying the sample complexity of XY-Adaptive we have to take into consideration the unavoidable price of discarding suboptimal directions. This cost is directly related to the geometry of the arm space that inﬂuences the number of samples needed before arms can be discarded from X. To take into account this problem-dependent quantity, we introduce a slightly relaxed deﬁnition of complexity. More precisely, we deﬁne the number of steps needed to discard all the directions which do not contain x , i.e. Y Y . From a geometrical point of view, this corresponds to the case when for any pair of suboptimal arms (x, x ), the conﬁdence set S (xn) does not intersect the hyperplane separating the cones C(x) and C(x ). Fig. 1 offers a simple illustration for such a situation: S no longer intercepts the border line between C(x2) and C(x3), which implies that direction x2 x3 can be discarded. More formally, the hyperplane containing parameters θ for which x and x are equivalent is simply C(x) C(x ) and the quantity

M = min{n N, x = x , x = x , S (x XY n ) (C(x) C(x )) = } (19) corresponds to the minimum number of steps needed by the static XY-allocation strategy to discard all the suboptimal directions. This term together with the oracle complexity N characterizes the sample complexity of the phases of the XY-adaptive allocation. In fact, the length of the phases is such that either they correspond to the complexity of the oracle or they can never last more than the steps needed to discard all the sub-optimal directions. As a result, the overall sample complexity of the XY-adaptive algorithm is bounded as in the following theorem. Theorem 3. If the XY-Adaptive allocation strategy is implemented with a β-approximate method and the stopping condition in Eq. 11 is used, then

P N (1 + β) max{M , 16

α N } log(1/α) log c

Π(ˆθN) = x 1 δ. (20)

We ﬁrst remark that, unlike G and XY, the sample complexity of XY-Adaptive does not have any direct dependency on d and Δmin (except in the logarithmic term) but it rather scales with the oracle complexity N and the cost of discarding suboptimal directions M . Although this additional cost is probably unavoidable, one may have expected that XY-Adaptive may need to discard all the suboptimal directions before performing as well as the oracle, thus having a sample complexity of O(M +N ). Instead, we notice that N scales with the maximum of M and N , thus implying that XY-Adaptive may actually catch up with the performance of the oracle (with only a multiplicative factor of 16/α) whenever discarding suboptimal directions is less expensive than actually identifying the best arm.

6 Numerical Simulations

We illustrate the performance of XY-Adaptive and compare it to the XY-Oracle strategy (Eq. 5), the static allocations XY and G, as well as with the fully-adaptive version of XY where X is updated at each round and the bound from Prop.2 is used. For a ﬁxed conﬁdence δ = 0.05, we compare the sampling budget needed to identify the best arm with probability at least 1 δ. We consider a set of arms X Rd, with |X| = d + 1 including the canonical basis (e1, . . . , ed) and an additional arm xd+1 = [cos(ω) sin(ω) 0 . . . 0] . We choose θ = [2 0 0 . . . 0] , and ﬁx ω = 0.01, so that Δmin = (x1 xd+1) θ is much smaller than the other gaps. In this setting, an efﬁcient sampling strategy should focus on reducing the uncertainty in the direction y = (x1 xd+1) by pulling the arm x2 = e2 which is almost aligned with y. In fact, from the rewards obtained from x2 it is easier to decrease the uncertainty about the second component of θ , that is precisely the dimension which allows to discriminate between x1 and xd+1. Also, we ﬁx α = 1/10, and the noise ε N(0, 1). Each phase begins with an initialization matrix A0, obtained by pulling once each canonical arm. In Fig. 4 we report the sampling budget of the algorithms, averaged over 100 runs, for d = 2 . . . 10.

d=2 d=3 d=4 d=5 d=6 d=7 d=8 d=9 d=10 0

Dimension of the input space

Number of Samples

Fully adaptive G XY XY Adaptive XY Oracle

Figure 4: The sampling budget needed to identify the best arm, when the dimension grows from R2

The results. The numerical results show that XYAdaptive is effective in allocating the samples to shrink the uncertainty in the direction y. Indeed, XY-adaptive identiﬁes the most important direction after few phases and is able to perform an allocation which mimics that of the oracle. On the contrary, XY and G do not adjust to the empirical gaps and consider all directions as equally important. This behavior forces XY and G to allocate samples until the uncertainty is smaller than Δmin in all directions. Even though the Fully-adaptive algorithm also identiﬁes the most informative direction rapidly, the

d term in the bound delays the discarding of the arms and prevents the algorithm from gaining any advantage compared to XY and G. As shown in Fig. 4, the difference between the budget of XY-Adaptive and the static strategies increases with the number of dimensions. In fact, while additional dimensions have little to no impact on XY-Oracle and XY-Adaptive (the only important direction remains y independently from the number of unknown features of θ ), for the static allocations more dimensions imply more directions to be considered and more features of θ to be estimated uniformly well until the uncertainty falls below Δmin.

7 Conclusions

In this paper we studied the problem of best-arm identiﬁcation with a ﬁxed conﬁdence, in the linear bandit setting. First we offered a preliminary characterization of the problem-dependent complexity of the best arm identiﬁcation task and shown its connection with the complexity in the MAB setting. Then, we designed and analyzed efﬁcient sampling strategies for this problem. The G-allocation strategy allowed us to point out a close connection with optimal experimental design techniques, and in particular to the G-optimality criterion. Through the second proposed strategy, XY-allocation, we introduced a novel optimal design problem where the testing arms do not coincide with the arms chosen in the design. Lastly, we pointed out the limits that a fully-adaptive allocation strategy might have in the linear bandit setting and proposed a phased-algorithm, XY-Adaptive, that learns from previous observations, without suffering from the dimensionality of the problem. Since this is one of the ﬁrst works that analyze pure-exploration problems in the linear-bandit setting, it opens the way for an important number of similar problems already studied in the MAB setting. For instance, we can investigate strategies to identify the best-linear arm when having a limited budget or study the best-arm identiﬁcation when the set of arms is very large (or inﬁnite). Some interesting extensions also emerge from the optimal experimental design literature, such as the study of sampling strategies for meeting the G-optimality criterion when the noise is heterosckedastic, or the design of efﬁcient strategies for satisfying other related optimality criteria, such as V-optimality.

Acknowledgments This work was supported by the French Ministry of Higher Education and Research, Nord-Pas de Calais Regional Council and FEDER through the Contrat de Projets Etat Region 2007 2013", and European Community s Seventh Framework Programme under grant agreement no 270327 (project Comp LACS).

[1] Yasin Abbasi-Yadkori, Dávid Pál, and Csaba Szepesvári. Improved algorithms for linear stochastic bandits. In Proceedings of the 25th Annual Conference on Neural Information Processing Systems (NIPS), 2011. [2] Jean-Yves Audibert, Sébastien Bubeck, and Rémi Munos. Best arm identiﬁcation in multiarmed bandits. In Proceedings of the 23rd Conference on Learning Theory (COLT), 2010. [3] Peter Auer. Using conﬁdence bounds for exploitation-exploration trade-offs. Journal of Machine Learning Research, 3:397 422, 2002. [4] Mustapha Bouhtou, Stephane Gaubert, and Guillaume Sagnol. Submodularity and randomized rounding techniques for optimal experimental design. Electronic Notes in Discrete Mathematics, 36:679 686, 2010. [5] Sébastien Bubeck, Rémi Munos, and Gilles Stoltz. Pure exploration in multi-armed bandits problems. In Proceedings of the 20th International Conference on Algorithmic Learning Theory (ALT), 2009. [6] Sébastien Bubeck, Tengyao Wang, and Nitin Viswanathan. Multiple identiﬁcations in multiarmed bandits. In Proceedings of the International Conference in Machine Learning (ICML), pages 258 265, 2013. [7] Eyal Even-Dar, Shie Mannor, and Yishay Mansour. Action elimination and stopping conditions for the multi-armed bandit and reinforcement learning problems. J. Mach. Learn. Res., 7:1079 1105, December 2006. [8] Victor Gabillon, Mohammad Ghavamzadeh, and Alessandro Lazaric. Best arm identiﬁcation: A uniﬁed approach to ﬁxed budget and ﬁxed conﬁdence. In Proceedings of the 26th Annual Conference on Neural Information Processing Systems (NIPS), 2012. [9] Victor Gabillon, Mohammad Ghavamzadeh, Alessandro Lazaric, and Sébastien Bubeck. Multi-bandit best arm identiﬁcation. In Proceedings of the 25th Annual Conference on Neural Information Processing Systems (NIPS), pages 2222 2230, 2011. [10] Matthew D. Hoffman, Bobak Shahriari, and Nando de Freitas. On correlation and budget constraints in model-based bandit optimization with application to automatic machine learning. In Proceedings of the 17th International Conference on Artiﬁcial Intelligence and Statistics (AISTATS), pages 365 374, 2014. [11] Kevin G. Jamieson, Matthew Malloy, Robert Nowak, and Sébastien Bubeck. lil UCB : An optimal exploration algorithm for multi-armed bandits. In Proceeding of the 27th Conference on Learning Theory (COLT), 2014. [12] Emilie Kaufmann and Shivaram Kalyanakrishnan. Information complexity in bandit subset selection. In Proceedings of the 26th Conference on Learning Theory (COLT), pages 228 251, 2013. [13] Jack Kiefer and Jacob Wolfowitz. The equivalence of two extremum problems. Canadian Journal of Mathematics, 12:363 366, 1960. [14] Lihong Li, Wei Chu, John Langford, and Robert E. Schapire. A contextual-bandit approach to personalized news article recommendation. In Proceedings of the 19th International Conference on World Wide Web (WWW), pages 661 670, 2010. [15] Friedrich Pukelsheim. Optimal Design of Experiments. Classics in Applied Mathematics. Society for Industrial and Applied Mathematics, 2006. [16] Herbert Robbins. Some aspects of the sequential design of experiments. Bulletin of the American Mathematical Society, pages 527 535, 1952. [17] Guillaume Sagnol. Approximation of a maximum-submodular-coverage problem involving spectral functions, with application to experimental designs. Discrete Appl. Math., 161(12):258 276, January 2013. [18] Marta Soare, Alessandro Lazaric, and Rémi Munos. Best-Arm Identiﬁcation in Linear Bandits. Technical report, http://arxiv.org/abs/1409.6110. [19] Kai Yu, Jinbo Bi, and Volker Tresp. Active learning via transductive experimental design. In Proceedings of the 23rd International Conference on Machine Learning (ICML), pages 1081 1088, 2006.