# fast_pure_exploration_via_frankwolfe__73c72f48.pdf

Fast Pure Exploration via Frank-Wolfe

Po-An Wang KTH Royal Institute of Technology Stockholm, Sweden wang9@kth.se

Ruo-Chun Tzeng KTH Royal Institute of Technology Stockholm, Sweden rctzeng@kth.se

Alexandre Proutiere EECS and Digital Futures KTH, Stockholm, Sweden alepro@kth.se

We study the problem of active pure exploration with ﬁxed conﬁdence in generic stochastic bandit environments. The goal of the learner is to answer a query about the environment with a given level of certainty while minimizing her sampling budget. For this problem, instance-speciﬁc lower bounds on the expected sample complexity reveal the optimal proportions of arm draws an Oracle algorithm would apply. These proportions solve an optimization problem whose tractability strongly depends on the structural properties of the environment, but may be instrumental in the design of efﬁcient learning algorithms. We devise Frank-Wolfe-based Sampling (FWS), a simple algorithm whose sample complexity matches the lower bounds for a wide class of pure exploration problems. The algorithm is computationally efﬁcient as, to learn and track the optimal proportion of arm draws, it relies on a single iteration of Frank-Wolfe algorithm applied to the lower-bound optimization problem. We apply FWS to various pure exploration tasks, including best arm identiﬁcation in unstructured, thresholded, linear, and Lipschitz bandits. Despite its simplicity, FWS is competitive compared to state-of-art algorithms.

1 Introduction

Pure exploration in stochastic bandits [24] refers to the task of answering a given question about the reward distributions of the different arms, using as few arm pulls (or samples) as possible. The task may correspond to identifying the best arm [13], the top-m arms [37], all ϵ-good arms [27], a set of arms whose expected rewards exceed a given threshold [26], etc. To reduce the sample complexity of such a task, the learner needs to leverage as much as possible the information available about reward distributions, which typically comes as known structural properties of the set of their expected rewards. Exploiting particular structures (e.g., unimodal, Lipschitz, convex, linear) has been thoroughly studied in the regret minimization setting (see [6], and references therein), but less in the pure exploration framework, where most efforts have focused on linear structures [35, 20, 39, 36, 10, 18, 9].

In this paper, we investigate a generic learning problem proposed in [8] and covering the aforementioned pure exploration tasks with or without structure. Consider K arms whose reward distributions (ν1, . . . , νK) come from a one-dimensional exponential family and are of unknown means µ = (µ1, . . . , µK). The parameter µ is known to belong to Λ RK, the set of possible instances. For each µ Λ, we assume that there is a unique true answer i (µ) that belongs to the ﬁnite set I of possible answers1 (e.g., for the best arm identiﬁcation task, i (µ) = arg maxk µk). We consider pure

1Scenarios with several correct answers require a more involved analysis, see [7].

35th Conference on Neural Information Processing Systems (Neur IPS 2021).

exploration tasks in the ﬁxed conﬁdence setting where the learner wishes, for any possible µ Λ, to discover i (µ) with a certain level of conﬁdence 1 δ, for some δ (0, 1). The learner s strategy is deﬁned by (i) an adaptive sampling rule dictating the sequence of arm pulls, (ii) a stopping rule deﬁning τ, the round where, based on the data gathered so far, the learner decides to stop pulling arms, and (iii) a decision rule specifying her answer. The goal is to devise a δ-PAC (it outputs the right answer with probability at least 1 δ for any µ Λ) strategy minimizing the expected sample complexity Eµ[τ].

Using the same arguments as those used in [13] for classical MAB problems, we may derive a lower bound of the expected sample complexity satisﬁed by any δ-PAC strategy. This lower bound, whose proof can be found in Appendix B for completeness, is given by T (µ)kl(δ, 1 δ), where the characteristic time T (µ) is deﬁned through the following optimization problem:

T (µ) 1 = sup ω Σ inf λ Alt(µ)

k=1 ωkd(µk, λk), (1)

where Σ is the (K 1)-dimensional simplex, Alt(µ) is the set of confusing parameters λ Λ such that i (µ) = i (λ), kl(a, b) is the KL divergence between two Bernoulli distributions of means a and b, and d(µk, λk) denotes the KL divergence of arm-k reward distributions under parameters µ and λ . A solution ω (µ) of (1) can be interpreted as an optimal allocation, in the sense that pulling each arm i a proportion of round equal to ω i (µ) (in expectation) constitutes an optimal sampling rule.

Most existing algorithms achieving an asymptotically (when δ goes to 0) minimal sample complexity leverage a Track-and-Stop (Ta S) framework [13]. In each round t, they plug ˆµ(t) the estimated expected arm rewards in the lower bound optimization problem (1), and track the allocation w (ˆµ(t)). As already noticed in [28], the main drawback of the Track-and-Stop framework is that it requires a recurrent access to an Oracle able to solve (1) (actually existing analyses usually assume that the Oracle outputs the exact solution for any µ). (1) is a concave program but can become difﬁcult to solve depending the underlying structure Λ. Indeed, for complex structures, identifying the most confusing parameters leading to the objective function infλ Alt(µ) PK k=1 ωkd(µk, λk) can be hard.

Contributions. 1) Instead of solving (1) in each round as in the Ta S framework, we propose an online iterative method to approach the optimal allocation of arm pulls. Speciﬁcally, we devise Frank-Wolfe-based Sampling (FWS), a computationally efﬁcient algorithm that just relies, in each round, on a single iteration Frank-Wolfe (FW) algorithm applied to (1) instantiated at ˆµ(t). 2) For a wide class of pure exploration problems with or without structure, we derive an upper bound of the expected sample complexity of FWS for any certainty level δ, and show that this bound matches the lower bound T (µ)kl(δ, 1 δ) asymptotically as δ goes to 0. 3) We illustrate the performance of FWS on various pure exploration problems, including best arm identiﬁcation in unstructured, linear, and Lipschitz bandits. In all tested scenarios, and despite its simplicity, FWS matches the performance of the best existing algorithms.

The use of the FW algorithm has been suggested in [13] in the case of best arm identiﬁcation problem in unstructured bandits. In this case, FW iterations take a very simple and intuitive form (see Example 1 introduced in 3). The corresponding sampling rule is referred to as Best Challenger in [13], and leads to algorithms with remarkably low sample complexity empirically sometimes lower than that of Ta S algorithms solving (1) in each round. So far however, as discussed in [28], the analysis of FW-type sampling rules, and even their convergence, have eluded researchers. Towards the design of FWS algorithm, we devise a simple variant of the FW algorithm that yields a sampling rule whose sample complexity can be analyzed. We conﬁrm the asymptotic optimality of as well as its empirical superiority, not only for the case of best arm identiﬁcation in unstructured bandits as predicted by [13], but also for a wide class of pure exploration problems. We believe that our analysis also brings interesting solutions to the three important obstacles we needed to tackle to devise and analyze a FW-type sampling rule: (i) the objective function in (1) is not smooth; (ii) its curvature becomes inﬁnite in general close to the boundary of Σ; and (iii) the estimate ˆµ(t) is evolving and might be far from µ.

2 Related Work

Best Arm Identiﬁcation (BAI) has recently received a lot of attention, either in unstructured bandit problems, see [13, 33], or in problems with various kinds of structure, e.g., linear [35, 20, 39, 36,

10, 18, 9, 31], combinatorial [23, 19, 32], spectral [22], monotone [14], cascading [41]. For BAI in unstructured bandits with ﬁxed conﬁdence, [13] developed the celebrated Track-and-Stop framework leading to algorithms able to asymptotically converge towards the optimal allocation of arm draws, and in turn, to achieve the lowest sample complexity possible in the high conﬁdence regime (as δ goes to 0). It is possible to apply the Ta S framework to speciﬁc structures, as this was proposed in [18] for linear bandits. However, for more involved structures, this might become computationally too difﬁcult. Indeed Ta S requires the learner to repeatedly solve the optimization problem (1).

The authors of [8] propose and exploit an interpretation of the lower bound optimization problem (1) as the solution of a 2-players game the ω-player playing the sup and the λ-player playing the inf . The algorithm presented in [8] combines two zero-regret algorithms applied sequentially by the two players, and converge to an optimal allocation. Interestingly, the algorithm uses the optimism in face of uncertainty principle to remove the need of forced exploration (the ω-player is fed with upper-conﬁdence bounds on her rewards). As shown later, the algorithm does not perform as well as FWS. The applicability of the framework used in [8] remains unclear to us: in [9] and in [19], the authors claim that the framework cannot be applied to linear and combinatorial bandits, respectively.

In [28], the author proposes a solution close to ours. His algorithm, LMA (Lazy Miror Ascent), just runs in each round one iteration of a sub-gradient ascent algorithm applied to (1). Fortunately, the projection step usually involved in such algorithm is simple. Numerically, as illustrated later in the paper, we found that LMA may not be as efﬁcient as Ta S or FWS. We could try to explain this by remarking that LMA has similarities with the Exponential Weights algorithm (see Appendix F in [28]), an algorithm designed for adversarial online optimization problem, and may be too conservative in a stochastic setting.

As already mentioned in the introduction, FW-based algorithms for BAI in unstructured bandits have been mentioned ﬁrst in [13] for their simplicity and good performance. Applying FW as if the objective function was smooth may fail at converging [28] experimentally. We believe that we manage to make, in our algorithm, the minimal modiﬁcation of the FW-based algorithm so that convergence and asymptotic optimality are guaranteed. Finally note that [2] uses FW in a regret minimization problem but with a smooth objective function.

We conclude this section by mentioning existing works on the FW algorithm when applied to optimizing non-smooth functions. The proposed solutions consist by either smoothing objective function or enlarging the set of differential (this is the second approach we chose). [11, 15] apply FW on the randomly smoothed surrogate instead of the original non-smooth objective. However, computing the gradient at each iteration requires to query many time on the objective function, which may not be practical. [1, 29] use a proximal operator to replace the objective function, but as pointed out in [4], the smoothing parameters of the proximal operator are not trivial to tune. Our solution is close to those developed in [30, 4]. There, inspired by the approximate subdifferential [38], the authors propose to collect the set of the gradients in the neighborhood at each round. They show that these collection is continuous even when the objective functions is non-smooth, which allows for the use of FW. The way we deal with the non-smoothness issue is similar but simpliﬁed by the fact that the speciﬁc form of our objective function.

3 Preliminaries

We consider the pure exploration task described in the introduction. This section presents the additional assumptions made towards the design and analysis of our algorithm. These assumptions are here illustrated for the classical Best Arm Identiﬁcation (BAI) task in unstructured bandits (see Example 1); they will be veriﬁed for all other examples of pure exploration problems presented in Section 5. This section also provides useful properties of the lower bound optimization problem (1), and ﬁnally describes our choice of stopping and decision rules.

3.1 Assumptions and properties of the lower bound optimization problem

The answer map i : Λ I allows us to decompose Λ into a union of non-overlapping sets: Λ = i ISi, where Si = {µ Λ : i (µ) = i} for all i I. The answer map is known (i.e., knowing µ is enough to output the right answer), and hence without loss of generality, we can assume

that Si = for all i I. Using this notation, the set of confusing parameters can be written as Alt(µ) = i =i (µ)Si.

Assumption 1. For each i I, Si is an open set and the complementary of Si is a ﬁnite union of convex sets. Namely, there exists a ﬁnite collection Ji of convex sets Ci j s.t. Λ \ Si = j Ji Ci j.

Example 1. The BAI task in unstructured bandits with Bernoulli rewards. For this task, we have Λ = (0, 1)K, I = {1, . . . , K}, and for all arm i, the set of parameters for which arm i is the best arm is Si = {µ Λ : µi > µk, k = i}. We have: Λ \ Si = j Ji Ci j where Ji = I \ {i} is the set of arms different than i and Ci j = {µ Λ : µj > µi} is the convex set of parameters for which arm j is better than arm i.

Now under Assumption 1, we can decompose the lower bound optimization problem as follows: T (µ) 1 = supω Σ Fµ(ω) where Fµ(ω) = minj Ji (µ) fj(ω, µ) and for all j Ji (µ),

fj(ω, µ) = inf λ Ci (µ) j

k=1 ωkd(µk, λk). (2)

Note that (2) is convex program (by convexity of the KL divergence), and that fj is a concave function in ω (as the minimum of concave functions). As a consequence, the objective function Fµ is also concave, but not smooth. The following proposition summarizes insightful properties of the functions fj, j Ji (µ). It is a consequence of the envelope theorem and proved in Appendix K.2.

Proposition 1. Let i I, j Ji. Deﬁne for all (ω, µ) Σ Si,

λj(ω, µ) = arg min λ cl(Ci j)

k=1 ωkd(µk, λk), (3)

where cl(Ci j) is the closure of Ci j. Then under Assumption 1, λj(ω, µ) is unique for all (ω, µ) Σ Si, where Σ is the interior of Σ. In addition, fj is continuously differentiable on Σ Si, and (ω, µ) Σ Si,

ωfj(ω, µ) =

k=1 d(µk, λj(ω, µ)k)ek, (4)

where ek denotes the K-dimensional vector whose k-th coordinate is 1 and whose other coordinates are 0.

A key insight from the above result is that the objective function Fµ is the minimum of a ﬁnite number of continuously differentiable functions. This observation will make the use of a slightly modifed FW algorithm possible (remember that the FW algorithm is known to converge for smooth functions only). We use an additional assumption on the gradient and curvature of fj. A controlled curvature is an essential ingredient when analyzing the convergence of FW-based algorithms, see e.g. [17]. Deﬁne Σγ = {ω Σ : mink ωk γ} for any γ (0, 1/K). Following [17], we deﬁne Cψ(K), the curvature constant of the concave differentiable function ψ : K R with respect to the compact set K, as

Cψ(K) = sup x,z K α (0,1] y=x+α(z x)

1 α2 [ψ(x) ψ(y) + y x, ψ(x) ] . (5)

Refer to [17], for the intuition behind this deﬁntion and examples.

Assumption 2. For all µ Λ, (i) there exists L > 0 such that j Ji (µ), ω Σ, ωfj(ω, µ) L; (ii) there exists D > 0 such that γ (0, 1/K) and j Ji (µ), Cfj( ,µ)(Σγ) D

There is a simple way to verify whether a pure exploration problem satisﬁes Assumption 2, by looking at the second derivative of the function y 7 d(x, y) at the points (µk, (λj(ω, µ))k) for all k. Refer to Appendix C for details.

Example 1 (cont d). For unstructured bandits with Bernoulli rewards, we can easily compute fj and its gradient [13, 28]: for all j = i (µ) and all ω Σ, deﬁne mj(ω, µ) = ωi (µ)µi (µ)+ωjµj

ωi (µ)+ωj . Then

λj(ω, µ)k = µk if k / {i (µ), j} and λj(ω, µ)k = mj(ω, µ) otherwise. As a consequence: fj(ω, µ) = ωi (µ)d(µi (µ), mj(ω, µ)) + ωjd(µj, mj(ω, µ)), ωfj(ω, µ) = d(µi (µ), mj(ω, µ))ei (µ) + d(µj, mj(ω, µ))ej. (6)

For this example, we can verify that Assumption 2 holds, either directly or using the tool described in Appendix C.

3.2 Stopping and decision rules

Next we present the two last components of the FWS algorithm, namely the stopping and decision rules. These components are standard and borrowed from the existing literature. We need a few notations. For any t 1, let At denote the arm selected in round t. Deﬁne Nk(t) = Pt s=1 1{As = k} the number of times arm k has been selected up to round t, and by ωk(t) = Nk(t)/t the corresponding empirical proportion of draw. When Nk(t) > 0, the empirical average reward of arm k up to round t is denoted by ˆµk(t) = Pt s=1 Xk(s)1{As = k}/Nk(t), where Xk(s) is the random reward received from pulling arm k in round s.

Let us denote by τ, the stopping time deﬁning when the algorithm stops exploring and has to output a decision. Our decision rule is obviously to output the best empirical answer: ˆıτ = i (ˆµ(τ)).

For the stopping rule, as in other existing algorithms, we leverage a Generalized Likelihood Ratio Test (GLRT). Our test boils down to comparing t Fˆµ(t)(ω(t)) to a threshold β(t, δ) (recall that Fµ is the objective function of the lower bound optimization problem): τ = inf{t 1 : t Fˆµ(t)(ω(t)) β(t, δ)}. (7)

Many thresholds β(t, δ) have been proposed in the literature [21, 13, 18, 28]. For FWS and its analysis, we just need that the threshold statisﬁes the two following properties: t 1, t Fˆµ(t)(ω(t)) β(t, δ) = (Pµ [i (ˆµ(t)) = i (µ)] δ) , (8)

c1(Λ), c2(Λ) > 0 : t c1(Λ), β(t, δ) log c2(Λ)t

The ﬁrst of the above properties will naturally imply that FWS returns the true answer with probability at least 1 δ when stopping, whereas the second will be instrumental in the sample complexity analysis (there, c1(Λ), c2(Λ) may depend on the set of possible instances, and on the reward distributions). In [21], the authors manage to provide, for any generic pure exploration task, a single threshold satisfying (8)-(9)). Unless otherwise mentioned, we will use the stopping rule implementing this threshold.

4 The FWS Algorithm and its Sample Complexity

In the FWS algorithm, we use the FW algorithm to learn an optimal allocation ω (µ). In each round, an iteration of FW updates the allocation that the FWS algorithm aims at approaching using some tracking procedure. We describe this learning and tracking procedure below.

4.1 Adapting Frank-Wolfe to the non-smooth function Fµ

The FW algorithm [12] solves smooth convex programs by linearizing, in each iteration, the objective function and moving towards a minimizer of this linear function. Compared to the projected gradient and proximal methods, FW is computationally more efﬁcient (e.g. it avoids the projection step), and is particularly well-suited when optimizing over polyhedra [3] (which is our case here). For a contemporary treatment of FW, refer to [17]. FW was suggested in [13] for BAI in unstructured bandits to update the allocation to be tracked. For this BAI problem, an iteration of the FW algorithm takes an intuitive form (see also Appendix A2 in [28]):

Example 1 (cont d). For BAI in unstructured bandits, the optimal allocation ω (µ) is the maximizer of the function ω 7 Fµ(ω) = minj fj(ω, µ). Fµ is smooth at points when the minimum is realized at a single arm j = arg minj fj(ω, µ), and there, in view of (6), its gradient is

Fµ(ω) = d(µi (µ), mj (ω, µ))ei (µ) + d(µj , mj (ω, µ))ej . Now in an iteration of the FW algorithm, one would follow the direction given by arg maxω Σ ω Fµ(ω). This direction is ej if d(µj , mj (ω, µ)) > d(µi (µ), mj (ω, µ)), and ei (µ) otherwise. This is precisely what the FW-type sampling rule suggested in [13] is doing: in round (t + 1), the best challenger is deﬁned as j = arg minj fj(ω(t), ˆµ(t)), and the arm selected corresponds to the direction given by arg maxω Σ ω Fˆµ(t)(ω(t)), i.e., it is either the best challenger j or the best empirical arm i (ˆµ(t)).

The convergence analysis of FW usually requires that the objective function is smooth, and that its curvature can be controlled. When applying FW-type algorithms to design an optimal sampling rule (a rule that converges to the allocation ω (µ) maximizing Fµ), we face three issues: (i) Fµ is not smooth; (ii) Fµ has an unbounded curvature close to the boundary of Σ; (iii) µ is unknown initially, so the FW iteration in round t can be applied to Fˆµ(t) only. We discuss below how we circumvent these issues in the design of our algorithm.

(i) Non-smoothness of Fµ. In view of Proposition 1, Fµ is the minimum of a ﬁnite number of smooth concave functions fj. Hence at points where two of these functions are equal in ω, Fµ is not differentiable in ω. The FW algorithm has been adapted to cope with non-smooth functions, see e.g. [30]. Typically, one constructs continuous approximations of the gradient close to non-smooth points of the functions. This construction often involves the r-subdifferential [16]2, which would be too costly to compute for Fµ. Instead, we can leverage the fact Fµ is the minimum of concave functions, and construct the called r-subdifferential subspace: for r (0, 1),

HFµ(ω, r) = cov fj(ω, µ) : j Ji (µ), fj(ω, µ) < Fµ(ω) + r , (10)

where cov{S} denotes the convex hull of the set S. This choice greatly simpliﬁes because it does not require to compute the gradient of fj in a neighborhood of ω. Since the fj are continuously differentiable, we can prove that ω 7 HFµ(ω, r) is a continuous (i.e. upperand lower-hemicontinuous). Using the r-subdifferential subspace, the modiﬁed FW update is given as follows. Let x(t) be the estimated optimizer of Fµ in round t. In round (t + 1), it is updated as: z(t + 1) = argmaxz Σ minh HFµ(x(t),rt) z x(t), h (ties broken arbitrarily) , x(t + 1) = t t+1x(t) + 1 t+1z(t + 1). (11)

Of course in the FWS algorithm, µ is unknown, and will be simply replaced by ˆµ(t) in the above update. The way we choose the sequence of parameters {rt}t 1 will be discussed later. Computing z(t) is equivalent to solving a zero-sum game, which can be further formulated as a LP [40] (Chapter 20). Refer to Appendix H for a detailed description of this LP.

(ii) Unbounded curvature of Fµ and (iii) unknown µ. These two issues are solved by a single trick. We impose that in the FW iterations, the update directions z(t) cover all ek, k = 1, . . . , K sufﬁciently often. This ensures that the target allocation x(t) stays away from the boundary of Σ, which in turn allows us to control the curvature of Fˆµ(t) thanks to Assumption 2. This imposed constraint can be seen as a sort of forced exploration, and further implies (thanks to our tracking procedure) that each arm is played often enough. Now, with this kind of forced exploration, ˆµ(t) will concentrate around the true µ.

4.2 Algorithm

The FWS algorithm proceeds as follows. FWS maintains a target allocation, denoted by x(t), its empirical allocation ω(t), and the empirical average rewards ˆµ(t) after round t. After an initialization phase (K rounds where each arm is selected), FWS alternates between forced exploration and FW updates. More precisely: Forced exploration occurs at rounds t where p

t/K is an integer and at those where ˆµ(t 1) / Λ (in this case, we cannot compute the objective function). In forced exploration round t, the target allocation is updated towards the center of the simplex: x(t) = t 1

t x(t 1) + 1

t (1/K, . . . , 1/K). FW updates happen in other rounds. There, the target allocation is updated according to our adapted version of FW (11), where in round t the unknown µ is replaced by ˆµ(t 1). In the successive FW

2For r (0, 1), the r-subdifferential of ψ : K R (where K RK is compact and convex) is deﬁned as rψ(x) = {h RK : ψ(y) < ψ(x) + y x, h + r for all y K}.

updates, we use r-subdifferential subspaces with varying parameter r. For the analysis of FWS, we will select a sequence of parameters {rt}t 1 with an appropriate decay rate.

After the target allocation is updated in round t, the algorithm tracks this allocation by selecting the arm maximizing over k the ratio xk(t)/ωk(t 1). Finally, FWS, whose pseudo-code is presented below, uses the stopping and decision rules described in 3.2.

Algorithm 1: FWS algorithm Input: Conﬁdence level δ, sequence {rt}t 1 Initialization: Sample each arm once and update ω(K), x(K) = ( 1

K , . . . , 1

K ), and ˆµ(K) t K While (t Fˆµ(t)(ω(t)) < β(δ, t) or ˆµ(t 1) / Λ) t t + 1 If ( p

t/K N or ˆµ(t 1) / Λ) (forced exploration) z(t) ( 1

K , . . . , 1

K ) Else (FW update)

z(t) argmax z Σ min h HFˆ µ(t 1)(x(t 1),rt) z x(t 1), h (ties broken arbitrarily)

Update x(t) t 1

t x(t 1) + 1

t z(t) Sample the arm At argmaxk xk(t)/ωk(t 1) (ties broken arbitrarily) Update ω(t) and ˆµ(t) Output: i (ˆµ(t))

4.3 Sample complexity

In the following theorem, we establish the asymptotic optimality of FWS.

Theorem 1. Consider the FWS algorithm with a sequence {rt}t 1 of strictly positive reals satisfying (i) limt 1

t Pt s=1 rs = 0, and (ii) limt trt = . Under Assumptions 1, 2, the algorithm terminates in ﬁnite time almost surely and is δ-PAC. Its sample complexity τ satisﬁes:

lim sup δ 0

τ log(1/δ) T (µ) = 1, and lim sup δ 0

Eµ [τ] log(1/δ) T (µ).

The proof is given in Appendix I. We sketch the proof of the guarantees in expectation. The proof relies on classical concentration results, but more critically combines continuity arguments (developed in Appendix K) to account for the varying ˆµ(t), and tools to analyze the convergence of the modiﬁed FW algorithm (reported in Appendix L). 1. First using concentration inequalities and the fact that FWS includes forced exploration rounds, we can deﬁne, for round t, a "good" event Et under which ˆµ(t) is very close to µ and such that P t=1 Pµ [Ec t ] < . Then, several continuity arguments have to be made. In Lemma 6 (Appendix K) we show that µ 7 Fµ is continuous (w.r.t. the uniform convergence norm). In Theorem 3 (Appendix K) we also prove that the solution z(t + 1) of the FW update (11) is continuous in µ. The arguments above allow us to analyze the convergence of the FW updates almost as if ˆµ(t) was replaced by µ provided that the event Et occurs. 2. Now we can study under the event Et, the impact of the FW update on the target allocation. The main step of our proof is Theorem 6 (Appendix L) characterizing how Fµ(x(t)) get closer to Fµ(ω (µ)) in each FW update. We then deduce that after a time T1, Fˆµ(t)(x(t)) is a good approximation of Fµ(ω (µ)). 3. We conclude the proof using similar arguments as those in [13]. According to our stopping rule, t > τ if and only if t Fˆµ(t)(ω(t)) > β(t, δ). Hence Eµ [τ] = P t=1 Pµ [τ > t] = P t=1 Pµ t Fˆµ(t)(ω(t)) β(t, δ) which can be approximately upper bounded by T1 + P t=T1 Pµ [Ec t ] + P t=1 Pµ [t Fµ(ω (µ)) β(t, δ)]. The proof is concluded by remarking that in view of the property (9) of our stopping threshold, the last sum is close to T (µ) log(1/δ) as δ 0. Note that our proof of Theorem 1 accounts for the possibility in certain structures (e.g. linear) of

having multiple optimal allocations (these allocations form a convex set). We just reason in terms of the objective function (as in [18] for linear bandits).

Under the following additional assumption, we can derive non-asymptotic sample complexity upper bound for FWS. The proof of the following theorem is presented in Appendix N.

Assumption 3. For any µ Λ, there exist constants κ, E > 0, s.t. if π µ κ, then

π Si (µ), ω Σ, j Ji (µ), πd(πk, λj(ω, π)k) is continuous and πd(πk, λj(ω, π)k) 1 E, k = 1, . . . , K.

Theorem 2. Consider the FWS algorithm with a sequence {rt}t 1 as in Theorem 1. Under Assumptions 1, 2, and 3, the sample complexity τ of the algorithm satisﬁes: for any µ Λ, δ (0, 1), and any ϵ < min{κE/2, 1}, ϵ < 1,

Eµ [τ] 1 + ϵ Fµ(ω (µ)) 6ϵ

log (1 + ϵ)c2(Λ)e δ(Fµ(ω (µ)) 6ϵ)

+ log log (1 + ϵ)c2(Λ) δ(Fµ(ω (µ)) 6ϵ)

+ Ψ(K, D, E, L, c1(Λ), ϵ) + T

where Tϵ,L is a constant such if t Tϵ,L, then Pt s=1 rs < tϵ and trt > L. The constant Ψ is polynomial in (D, E, L, c1(Λ), 1/ϵ) and exponential in K. The precise deﬁnition of Ψ is given in Appendix N.

5 Examples and Experiments for Linear Bandits

5.1 Examples

Our framework can be applied to many pure exploration problems, including BAI in unstructured (see Example 1), linear, Lipschitz bandits. It further covers threshold bandits (the problem of identifying all arms with rewards greater than a threshold), linear threshold bandits, top-m bandits (where we wish to identify the best m arms), and dueling bandits. All these examples are presented in Appendix. Using numerical experiments, we show that FWS is competitive with state-of-the-art algorithms for BAI in unstructured, linear, and Lipschitz bandits, see Appendices D-E-F, respectively. To the best of our knowledge, we report the ﬁrst results for BAI in Lipschitz bandits. We quote some of our results for BAI in linear bandits below.

When facing a new pure exploration problem, one can check whether it falls into our framework, by ﬁrst directly verifying Assumption 1. In Appendix C, we provide a simple sufﬁcient condition ensuring that Assumption 2 holds, and explain why all the aforementioned pure exploration problems satisfy this condition.

5.2 BAI in linear bandits

Linear bandits constitute arguably the most popular and important bandit problems with structure, and have found many applications [25, 5]. BAI in linear bandits has received a lot of attention recently, see 2. To model linear bandits, we slightly modify our framework. The reason for this modiﬁcation is that the linear structure is so strong that using our initial framework, the set Λ would be small, and we would have problems ensuring that ˆµ(t) Λ after some reasonable time t. Alternatively (rather than modifying the framework), we could modify the FWS algorithm so that ˆµ(t) is projected onto Λ.

Consider a set of K arms. Arm k is attached a d-dimensional feature vector ak and its average reward ak, µ , where µ Rd is unknown. Without loss of generality, we assume that {ak}k [K] spans Rd. We modify the deﬁnition of Λ as follows: Λ = µ Rd : k [K] s.t. ak ai, µ > 0, i = k . Hence µ parametrizes the average rewards of the arms, but µk is not the average reward of arm k. The true answer is i (µ) = argmaxk ak, µ . The lower bound optimization problem (1) becomes: supω Σ Fµ(ω) where Fµ(ω) = infλ Alt(µ) 1 2(µ λ) P

k ωkaka k (µ λ) and Alt(µ) = {λ Λ : k = i (µ) s.t. ak ai (µ), λ > 0}, see e.g. [18]. From there, we can reproduce our framework: for Assumption 1, for all j = i (µ), Ci (µ) j = {λ Λ : aj ai (µ), λ > 0}; as for

the functions fj, they are deﬁned through:

λj(ω, µ) = µ +

ai (µ) aj, µ

ai (µ) aj 2 V 1 ω V 1 ω

aj ai (µ) , (12)

where Vω = P

k ωkaka k . In the FWS algorithm for linear bandits, we use the Least-Squares Estimator (LSE) ˆµ(t) given past observations, see [18] or Appendix E for an explicit expression. It can be readily seen that this slight modiﬁcation of our framework does not affect the validity of Theorem 1. We just need to use the concentration inequalities derived in [18] for ˆµ(t) in the ﬁrst step of its proof.

Numerical experiments. We consider the example proposed by [35]. The unknown parameter µ = e1 and there are d + 1 arms, e1, , ed, cos(φ)e1 + sin(φ)e2 in Rd, where (e1, , ed) form the standard orthonormal basis. We set d = 6 and φ = 0.1. To assess the performance of the FWS algorithm, we compare with the following algorithms: the Lazy Track and Stop algorithm (LT) from [18]; Line Game-C (CG-C) and Line Game (Lk-C) from [9] and implemented by [34]; the XY-Adaptive algorithm (XY-A) from [35]. For information, we also run the Round Robin algorithm RR selecting each equally. For comparison, we ﬁnally compute the sample complexity lower bound LBlin(δ) (equal to T (µ)kl(δ, 1 δ)).

Except for XY-A, all algorithms implement the same stopping rule deﬁned in (7) with threshold β(t, δ) = log((log(t) + 1)/δ) (this threshold was initially suggested in [13], and is also used in [34] for CG-C and Lk-C). For XY-A, we use the stopping rule advocated in the corresponding papers. Refer to Appendix E for the detailed implementations.

In Table 1, we present the sample complexity (the number of samples gathered before the algorithm stops) averaged over 1000 runs for the various algorithms and for different conﬁdence levels δ {0.1, 0.01, 0.001, 0.0001}. In Appendix E, we provide detailed results, e.g. including box-plots (to show how conﬁdent we are about the values displayed in Table 1), as well as the empirical allocations achieved under the various algorithms.

Table 1: Sample complexity for the linear bandit benchmark example of [35], averaged over 1000 runs. Refer to Appendix E for details, including box-plots.

FWS LT CG-C Lk-C XY-A RR LBlin(δ)

δ = 0.1 1 030 919 2 498 2 319 7 016 5 451 359 δ = 0.01 1 614 1 464 3 501 3 431 7 779 8 814 920 δ = 0.001 2 229 1 982 4 324 4 326 9 090 12 101 1 408 δ = 0.0001 2 839 2 518 5 118 5 120 9 723 15 314 1 881

6 Conclusion

We have developed FWS, a computationally and statistically efﬁcient algorithm for active pure exploration in bandit problems with ﬁxed conﬁdence. In each round, FWS performs a single iteration of a modiﬁed FW algorithm to approach an optimal allocation of arm draws predicted by the asymptotic lower bound. In the FWS algorithm, the FW iterations aim at maximizing a non-smooth function. Our main contribution is here to adapt the design of FW so that its convergence can be analyzed even for this non-smooth function. FW-based pure exploration algorithms have been discussed in the literature, with the belief that they would perform well. We conﬁrm this belief, and even establish the asymptotic optimality of FWS in wide class of pure exploration problems.

Many interesting research directions could be investigated. Our analysis of the sample complexity in the moderate conﬁdence regime has the advantage of being applicable to generic pure exploration problems, but may not be always tight. For bandits with speciﬁc structures, we may reﬁne the analysis in this regime to get better upper bounds. We are also interested in investigating whether the iterative approach used in the FWS algorithm can be extended to more complex problems such as learning an optimal policy in MDPs, as well as to regret minimization problems. There, instance-speciﬁc regret lower bounds and the corresponding optimal exploration process are characterized by the solution of an optimization problem, just as in pure exploration problems.

Acknowledgments and Disclosure of Funding

The authors would like to thank the anonymous reviewers whose comments helped us to improve the manuscript. R.-C Tzeng is supported by ERC Advanced Grant REBOUND (834862). A. Proutiere s research is supported by the Wallenberg AI, Autonomous Systems and Software Program (WASP) funded by the Knut and Alice Wallenberg Foundation. This work was also in part ﬁnancially supported by Digital Futures.

[1] Andreas Argyriou, Marco Signoretto, and Johan Suykens. Hybrid conditional gradientsmoothing algorithms with applications to sparse and low rank regularization. Regularization, Optimization, Kernels, and Support Vector Machines, 2014.

[2] Quentin Berthet and Vianney Perchet. Fast rates for bandit optimization with upper-conﬁdence frank-wolfe. In Proc. of Neur IPS, 2017.

[3] Venkat Chandrasekaran, Benjamin Recht, Pablo A. Parrilo, and Alan S. Willsky. The convex geometry of linear inverse problems. 2012.

[4] Edward Cheung and Yuying Li. Solving separable nonsmooth problems using frank-wolfe with uniform afﬁne approximations. In Proc. of IJCAI, 2018.

[5] Wei Chu, Lihong Li, Lev Reyzin, and Robert Schapire. Contextual bandits with linear payoff functions. In Proc. of AISTATS, 2011.

[6] Richard Combes, Stefan Magureanu, and Alexandre Proutiere. Minimal exploration in structured stochastic bandits. In Proc. of Neur IPS, 2017.

[7] Rémy Degenne and Wouter M Koolen. Pure exploration with multiple correct answers. In Proc. of Neur IPS, 2019.

[8] Rémy Degenne, Wouter M Koolen, and Pierre Ménard. Non-asymptotic pure exploration by solving games. In Proc. of Neur IPS, 2019.

[9] Rémy Degenne, Pierre Ménard, Xuedong Shang, and Michal Valko. Gamiﬁcation of pure exploration for linear bandits. In Proc. of ICML, 2020.

[10] Tanner Fiez, Lalit Jain, Kevin G Jamieson, and Lillian Ratliff. Sequential experimental design for transductive linear bandits. In Proc. of Neur IPS, 2019.

[11] Abraham D Flaxman, Adam Tauman Kalai, and H Brendan Mc Mahan. Online convex optimization in the bandit setting: gradient descent without a gradient. In Proc. of SODA, 2005.

[12] Marguerite Frank and Philip Wolfe. An algorithm for quadratic programming. Naval Research Logistics Quarterly, 1956.

[13] Aurélien Garivier and Emilie Kaufmann. Optimal best arm identiﬁcation with ﬁxed conﬁdence. In Proc. of COLT, 2016.

[14] Aurélien Garivier, Pierre Ménard, Laurent Rossi, and Pierre Menard. Thresholding bandit for dose-ranging: The impact of monotonicity. ar Xiv preprint ar Xiv:1711.04454, 2017.

[15] Elad Hazan and Satyen Kale. Projection-free online learning. In Proc. of ICML, 2012.

[16] Jean-Baptiste Hiriart-Urruty and Claude Lemaréchal. Convex analysis and minimization algorithms I: Fundamentals. Springer science & business media, 2013.

[17] Martin Jaggi. Revisiting frank-wolfe: Projection-free sparse convex optimization. In Proc. of ICML, 2013.

[18] Yassir Jedra and Alexandre Proutiere. Optimal best-arm identiﬁcation in linear bandits. In Proc. of Neur IPS, 2020.

[19] Marc Jourdan, Mojmír Mutn y, Johannes Kirschner, and Andreas Krause. Efﬁcient pure exploration for combinatorial bandits with semi-bandit feedback. In Proc. of ALT, 2021.

[20] Zohar S Karnin. Veriﬁcation based solution for structured mab problems. In Proc. of Neur IPS, 2016.

[21] Emilie Kaufmann and Wouter Koolen. Mixture martingales revisited with applications to sequential tests and conﬁdence intervals. ar Xiv preprint ar Xiv:1811.11419, 2018.

[22] Tomáš Kocák and Aurélien Garivier. Best arm identiﬁcation in spectral bandits. In Proc. of IJCAI, 2020.

[23] Yuko Kuroki, Junya Honda, and Masashi Sugiyama. Combinatorial pure exploration with full-bandit feedback and beyond: Solving combinatorial optimization under uncertainty with limited observation. ar Xiv preprint ar Xiv:2012.15584, 2020.

[24] Tze Leung Lai and Herbert Robbins. Asymptotically efﬁcient adaptive allocation rules. Advances in applied mathematics, 1985.

[25] Lihong Li, Wei Chu, John Langford, and Robert E Schapire. A contextual-bandit approach to personalized news article recommendation. In Proc. of WWW, 2010.

[26] Andrea Locatelli, Maurilio Gutzeit, and Alexandra Carpentier. An optimal algorithm for the thresholding bandit problem. In Proc. of ICML, 2016.

[27] Blake Mason, Lalit Jain, Ardhendu Tripathy, and Robert Nowak. Finding all ϵ-good arms in stochastic bandits. In Proc. of Neur IPS, 2020.

[28] Pierre Ménard. Gradient ascent for active exploration in bandit problems. ar Xiv, 2019.

[29] Federico Pierucci, Zaid Harchaoui, and Jérôme Malick. A smoothing approach for composite conditional gradient with nonsmooth loss. Ph D thesis, INRIA Grenoble, 2014.

[30] Sathya N Ravi, Maxwell D Collins, and Vikas Singh. A deterministic nonsmooth frank wolfe algorithm with coreset guarantees. Informs Journal on Optimization, 2019.

[31] Clémence Réda, Emilie Kaufmann, and Andrée Delahaye-Duriez. Top-m identiﬁcation for linear bandits. In Proc. of AISTATS, 2021.

[32] Idan Rejwan and Yishay Mansour. Top-k combinatorial bandits with full-bandit feedback. In Proc. of ALT, 2020.

[33] Daniel Russo. Simple bayesian algorithms for best arm identiﬁcation. In Annual Conference on Learning Theory. PMLR, 2016.

[34] Xuedong Shang. Linbai: Gamiﬁcation of pure exploration for linear bandits. https://github. com/xuedong/Lin BAI.jl, 2021. [Online; accessed 09-May-2021].

[35] Marta Soare, Alessandro Lazaric, and Rémi Munos. Best-arm identiﬁcation in linear bandits. In Proc. of Neur IPS, 2014.

[36] Chao Tao, Saúl Blanco, and Yuan Zhou. Best arm identiﬁcation in linear bandits with linear dimension dependency. In Proc. of ICML, 2018.

[37] Tengyao Wang, Nitin Viswanathan, and Sébastien Bubeck. Multiple identiﬁcations in multiarmed bandits. In Proc. of ICML, 2013.

[38] DJ White. Extension of the frank-wolfe algorithm to concave nondifferentiable objective functions. Journal of optimization theory and applications, 1993.

[39] Liyuan Xu, Junya Honda, and Masashi Sugiyama. A fully adaptive algorithm for pure exploration in linear bandits. In Proc. of AISTATS, 2018.

[40] Petyon Young and Shmuel Zamir. Handbook of game theory. Elsevier, 2014.

[41] Zixin Zhong, Wang Chi Cheung, and Vincent Tan. Best arm identiﬁcation for cascading bandits in the ﬁxed conﬁdence setting. In Proc. of ICML, 2020.