# feasible_arm_identification__0ddfea25.pdf

Feasible Arm Identiﬁcation

Julian Katz-Samuels 1 Clayton Scott 1

We introduce the feasible arm identiﬁcation problem, a pure exploration multi-armed bandit problem where the agent is given a set of Ddimensional arms and a polyhedron P tx : Ax ď bu Ă RD. Pulling an arm gives a random vector and the goal is to determine, using a ﬁxed budget of T pulls, which of the arms have means belonging to P. We propose three algorithms MD-UCBE, MD-SAR, and MD-APT and provide a uniﬁed analysis establishing upper bounds for each of them. We also establish a lower bound that matches up to constants the upper bounds of MDUCBE and MD-APT. Finally, we demonstrate the effectiveness of our algorithms on synthetic and real-world datasets.

1. Introduction

Pure exploration multi-armed bandit (MAB) problems provide a framework for determining via a sequential experiment which of a set of distributions meet some criteria. In this setting, there are K distributions ν1, . . . , νK and the agent sequentially chooses from which distribution to sample an observation. At the end of the sampling stage, the agent outputs the distributions which he believes meet the desired criteria and the performance of the agent is measured based on the quality of this decision. In the MAB literature, distributions are also referred to as arms, and sampling a realization from a distribution νi is referred to as pulling arm i. The most well-studied of these problems is top-k arm identiﬁcation. In this problem, the goal is to ﬁnd the k best arms, that is, k arms with the largest means. This problem and other pure exploration problems have applications in a wide range of areas, including crowdsourcing, A/B testing, and online advertising.

In many application domains, the arms and the criteria for

1Department of Computer Science and Electrical Engineering, University of Michigan. Correspondence to: Julian Katz-Samuels <jkatzsam@umich.edu>.

Proceedings of the 35 th International Conference on Machine Learning, Stockholm, Sweden, PMLR 80, 2018. Copyright 2018 by the author(s).

a good arm are multi-dimensional in nature. For example, in crowdsourcing it is important to distinguish good workers from bad workers. For a multilabel classiﬁcation task (where examples are associated with multiple labels), a worker can be modeled as a multi-dimensional arm where each dimension corresponds to her accuracy at identifying a particular label, and a natural deﬁnition for a good worker is that her accuracy is above some threshold for each label (e.g., 90%). A common approach for ﬁnding such workers is to use a collection of examples labeled by domain experts as a set of tests. Since workers are paid for each example that they label, often an organization is only willing to spend a limited number of queries to ﬁnd good workers and an effective method under this budget constraint is needed.

As another example, consider A/B testing for designing products such as websites, ads, and video games. In this setting, there are several options for a product and a company diverts trafﬁc to each of the options to determine which one to choose. Multi-dimensional criteria arise naturally in this domain, as well. For example, a company that wants to grow its user base for its website might desire the rate of new subscriptions to be above some level, while still maintaining a certain level of user retention among its current users. If the product is a video game, the company might also be interested in maintaining some metric of user engagement above a certain threshold.

The pure exploration MAB literature lacks (i) a simple framework for describing problems where the arms and criteria are multi-dimensional and (ii) practical algorithms for addressing these problems. In this paper, we aim to address this gap. We introduce the feasible arm identiﬁcation problem in which arms are associated with multi-dimensional distributions and the goal is to ﬁnd arms whose means belong to a given polyhedron1 P tx : Ax ď bu. Polyhedra encompass a large class of regions that can model common user-deﬁned constraints, including thresholds or ranges on individual dimensions and linear constraints involving multiple dimensions. We propose several algorithms for the ﬁxed budget setting and provide upper and lower bounds. Finally, we demonstrate through experiments on synthetic and real-world datasets that by leveraging the geometry of the

1There are several conﬂicting deﬁnitions of polyhedra. We deﬁne a polyhedron as the intersection of a ﬁnite number of closed halfspaces (Boyd and Vandenberghe, 2004).

Feasible Arm Identiﬁcation

problem, our methods signiﬁcantly outperform a uniform allocation strategy. Indeed, in several of our experiments, our methods ﬁnd the feasible arms with a probability that is a factor of 10 better than that of a uniform allocation strategy. All proofs are contained in the supplementary material.

2. Related Work

MABs have received a signiﬁcant amount of attention. Most work considers minimizing the cumulative regret instead of a pure exploration objective. There have been relatively few works on multi-dimensional arms and criteria in this regime (Drugan and Now, 2013; Busa-Fekete et al., 2017; Tekin and Turgay, 2017). Drugan and Now (2013) modify a UCB algorithm to ﬁnd all arms on the Pareto front. Busa-Fekete et al. (2017) use the Generalized Gini Index to optimize all objectives in a fair way. Tekin and Turgay (2017) consider a contextual MAB setting where the goal is to maximize the total reward in a non-dominant objective, subject to the constraint that the total reward in a dominant objective is maximized. These works differ from our work in that (i) they consider the cumulative regret setting, which is fundamentally different from the pure exploration setting (Bubeck et al., 2009), and (ii) they aim to either balance multiple objective functions or ﬁnd arms on the Pareto front, whereas we aim to ﬁnd feasible arms, where feasibility is deﬁned by membership in a given polyhedron.

In recent years, there have been many advances in pure exploration MABs in the ﬁxed conﬁdence and ﬁxed budget settings (Mannor and Tistisklis, 2004; Gabillon et al., 2012; Bubeck et al., 2013; Chen et al., 2014; Jamieson et al., 2014). A limited number of works have considered multi-dimensional feedback. Auer et al. (2016) considered a variant of the top arm identiﬁcation problem where arms are multi-dimensional with each dimension corresponding to a distinct objective that an agent wishes to optimize, and the goal is to identify the Pareto front of the arms. In contrast to our work, they consider the ﬁxed conﬁdence setting. More importantly, Pareto front identiﬁcation and feasible arm identiﬁcation are mathematically very different problems and apply to distinct situations. Whereas Pareto front identiﬁcation is relevant to multi-objective optimization, the feasible arm identiﬁcation problem is useful for situations where there are user-deﬁned criteria for what qualiﬁes as a good arm.

Chen et al. (2017) recently proposed the general sampling problem, which can model a setting where arms are multidimensional and the goal is to ﬁnd arms with means belonging to a given polyhedron. There are several major differences with our work. First, Chen et al. (2017) do not consider multi-dimensional feedback as the agent can sample from one dimension of one arm at a time. Second, whereas they study the ﬁxed conﬁdence setting, we

study the ﬁxed budget setting. Third, they assume isotropic Gaussian arms, whereas we assume each arm is associated with a multi-dimensional sub-Gaussian distribution. Finally, their proposed algorithm (see their Algorithm 7) is sampleinefﬁcient and impractical since in its ﬁrst stage, it employs a uniform allocation strategy until the conﬁdence bounds (deﬁned with δ 0.01) of all of the means either intersect with the given polyhedron or do not intersect with the given polyhedron.

Locatelli et al. (2016) introduced the thresholding bandit problem (TBP), which is essentially the scalar version of the feasible arm identiﬁcation problem, and the algorithm APT. In TBP, there are K scalar-valued distributions, a threshold τ, and a budget T. The goal is to identify all of the distributions with means above τ. Our work signiﬁcantly generalizes TBP by considering multi-dimensional arms and the problem of identifying those arms with means belonging to a given polyhedron. Unlike Locatelli et al. (2016) who only analyze APT, we provide an uniﬁed analysis of three algorithms for the feasible arm identiﬁcation problem. One of our algorithm, MD-APT, reduces to APT in the onedimensional thresholding case and our upper bound also reduces to the upper bound of APT (up to constant factors). To deal with this general setting, we introduce a novel complexity measure that characterizes the hardness of determining whether an arm is in P. This measure is essentially the distance of the mean of an arm to the boundary of the polyhedron. In addition, our general setting introduces technical challenges for establishing upper and lower bounds. We overcome these by using tools from convex analysis, properties of multi-dimensional sub-Gaussian distributions, and change-of-measure arguments involving multi-dimensional distributions.

Recently, Zheng et al. (2017) considered a problem with a polyhedral constraint, but their setup is very different from our own. In their setting, the goal is to solve a linear program where either the constraints are not fully known or the cost function is not fully known but can be estimated by adaptive sampling. In our work, the constraints are known and we wish to learn which out of a collection of distributions have feasible means.

In this section, we formalize the feasible arm identiﬁcation problem. To begin, we deﬁne some notation. For all n P N, let rns t1, . . . , nu. For any x P RD and A Ă RD, let distpx, Aq infy PA }x y}2. Let 1 p1, . . . , 1qt P RD

and 1t u denote the indicator function. Deﬁne SD 1 tx P RD : }x}2 1u.

Suppose we are given K stochastic arms. When the ith arm is pulled, a reward is drawn i.i.d. from a D-dimensional

Feasible Arm Identiﬁcation

distribution νi. Denote µi EX νi X. We assume that the agent is given a polyhedron P tx : Ax ď bu where A P RMˆD and b P RM. Let at j denote the jth row of A. By dividing each constraint by }aj}2, we can assume without loss of generality that }aj}2 1 for all j P r Ms. Let BP denote the boundary of P, i.e., BP s Pz P where s P denotes the closure of P and P denotes the interior of P. For simplicity, we assume that P has positive volume.

We consider the ﬁxed budget setting. The game is as follows: there are T rounds and at each round t, the agent chooses an arm It P r Ks and observes a realization Xt νIt. The goal is to identify all of the arms whose means belong to the polyhedron. To deﬁne a performance measure, let ϵ ą 0 denote the tolerance, and deﬁne Sint P,ϵ ti P r Ks : µi P P and distpµi, BPq ě ϵu and Sout P,ϵ ti P r Ks : distpµi, Pq ą ϵu. Sint P,ϵ is the set of arms that lie in the interior of P by at least ϵ and Sout P,ϵ is the set of arms that lie outside of P by at least ϵ. Let p S Ă r Ks denote the set of arms outputted by an algorithm. We deﬁne the following error measure:

LT,P,ϵpp Sq 1tp S X Sout P,ϵ H _ p Sc X Sint P,ϵ Hu

In words, the goal is to identify all of the arms with means belonging to the polyhedron up to tolerance ϵ in the sense that an algorithm is successful if its output includes every arm i such that µi P P and distpµi, BPq ě ϵ and excludes every arm l such that distpµl, Pq ą ϵ.

We deﬁne the margin of arm i as

p P,ϵq i distpµi, BPq ϵ

" minj Pr Ms distpµi, tx : at jx bjuq ϵ : µi P P distpµi, Pq ϵ : µi R P (1)

" minj Pr Ms bj at jµi ϵ : µi P P distpµi, Pq ϵ : µi R P (2)

where line (1) follows by Lemma H.1 and line (2) follows by the closed form solution of the distance from a point to a hyperplane and }aj}2 1 (Boyd and Vandenberghe, 2004).

The complexity of an instance of the feasible arm identiﬁcation problem is deﬁned to be:

i Pr Ks r p P,ϵq i s 2.

In words, an instance has low complexity if all of the arms are far from the boundary of the polyhedron and high complexity if some of the arms are very close to the boundary. The intuition behind this complexity measure is that for an algorithm to output the correct answer about arm i, it is sufﬁcient to guarantee that an estimate pµi is within a

ball centered at µi with radius p P,ϵq i

2 (see Lemma 3). For

the sake of brevity, we usually write LT,ϵpp Sq, pϵq i , and H instead of LT,P,ϵpp Sq, p P,ϵq i , and HP,ϵ, respectively.

Our analysis assumes that each νi is a multi-dimensional sub-Gaussian distribution, which we now deﬁne (see Vershynin et al. (2017) for more details). Let X be a scalar random variable. We say that X is R-sub-Gaussian if E expp X2

R2 q ď 2. We deﬁne the sub-Gaussian norm of X as the smallest R that satisﬁes the above requirement:

}X}ψ2 inft R ą 0 : E expp X2

A random vector X P RD is sub-Gaussian if Xta is sub Gaussian for all a P RD. The sub-Gaussian norm of X is deﬁned as

}X}ψ2 sup a PSD 1

We say that a random vector X is R-sub-Gaussian if }X}ψ2 ď R. Henceforth, we assume that ν1, . . . , νK are R-sub-Gaussian. See Vershynin (2012) for a discussion of sub-Gaussian distributions.

4. Lower Bound

In this section, we establish a lower bound for the feasible arm identiﬁcation problem. Our construction takes any polyhedron P and means µ1, . . . , µK P P and produces a collection of problems such that any algorithm makes a mistake on one of the problems with probability at least on the order of expp c T

H q (where c is a constant). In fact, this lower bound holds even when the algorithm is given the distance of each arm to the boundary of the polyhedron. If A Ă RD is closed and x P RD, let Proj Apxq denote the projection of x onto A.

Theorem 1. Let P tx P RD : Ax ď bu have positive volume and ϵ ě 0 such that P ϵ tx P P : distpx, BPq ą ϵqu is nonempty. Let µ1, . . . , µK P P ϵ , τi P Proj BP pµiq for all i P r Ks, and µ1 i µi 2pτi µiq for all i P r Ks. Let νi denote the distribution Npµi, Iq and ν1 i the distribution Npµ1 i, Iq. Let B0 denote the product distribution ν1 b. . .b νK and Bi denote the product distribution

ν1 b . . . b νi 1 b ν1 i b νi 1 b . . . b νK.

Then, B0, . . . , BK have the same problem complexity

i 1 r distpµi, BPq ϵs 2

and for any algorithm,

max i Pt0,...,Ku EBip LT,ϵpp Sqq

ě expp 13 T

H 25D logp48plogp Tq 1q KDqqq.

Feasible Arm Identiﬁcation

This lower bound is equal to the lower bound of Locatelli et al. (2016) (see their Theorem 1) up to the factor of D and constants. Since D logpplogp Tq 1q KDqq grows very slowly as a function of T in comparison with T

H , the dependence on D is quite mild. We also note that the lower bound does not depend on the number of constraints M in the polyhedron P, which suggests that the number of constraints of P does not directly affect the statistical difﬁculty of the feasible arm identiﬁcation problem. Since polyhedra approximate convex sets arbitrarily well, the independence of our lower bound from M enables us to derive a nearly identical lower bound for the setting where P is convex (see the supplementary material for details).

The proof of Theorem 1 is based on a novel lower bound construction with multidimensional distributions for MABs. Often, lower bounds in the bandit literature modify scalar distributions and the main idea is to perturb the mean of a scalar distribution by making it either larger or smaller. In the feasible arm identiﬁcation problem, picking a direction to perturb the mean of a distribution is not so simple. Indeed, the direction depends on the polyhedron since for some polyhedra, changing the ﬁrst coordinate does not produce points lying outside of the polyhedron. In our construction, we interchange a distribution νi with mean µi P P with a distribution ν1 i with mean µ1 i that is shifted away from µi in the direction of its projection onto the boundary of P.

Theorem 1 also implies the following non-asymptotic minimax bound.

Corollary 1. Let P tx P RD : Ax ď bu have positive volume, ϵ ě 0 such that P ϵ is nonempty, and R ą 0. Let H ą 0 such that there exists µ1, . . . , µK P P ϵ with

i 1 r distpµi, BPq ϵs 2.

Let BP,ϵ, H,R denote the set of feasible arm identiﬁcation problems on polyhedron P, with tolerance ϵ, and with K arms such that the distributions are R-sub-Gaussian and the problem complexity is less than H. Then, T ě 25 HR2D logp48plogp Tq 1q KDqq implies that, for any algorithm,

sup BPBP,ϵ, H,R EBp LT,ϵpp Sqq ě expp 14 T HR2 q.

In words, this result says essentially that for any polyhedron P and tolerance ϵ ě 0, the induced class of feasible arm identiﬁcation problems with P and ϵ has a minimax lower bound on the order of expp c T HR2 q where c is a constant. Henceforth, we say that an algorithm is nearly optimal if for large enough T its expected loss decays as Opexpp c T HR2 qq where c is a constant.

5. Algorithms

In this section, we extend three algorithms to the feasible arm identiﬁcation problem, namely, an upper conﬁdence bound based algorithm (UCBE) (Audibert and Bubeck, 2010), a successive accepts and rejects algorithm (SAR) (Bubeck et al., 2013; Chen et al., 2014), and the Anytime Parameter-free Thresholding algorithm (APT) (Locatelli et al., 2016). The main novelty of our approach is that our algorithms estimate the distance of the mean of each arm to the boundary of the polyhedron to decide which arm to pull. To begin, we introduce some notation. Let It denote the index of the arm chosen at time t. Let Xi,j,t denote the tth realization of the jth coordinate of νi, Tiptq řt 1 s 1 1t Is iu denote the number of pulls of arm i at round t, and pµi,t denote the estimate of µi after t samples, i.e., pµi,t ppµi,1,t, . . . , pµi,D,tqt where pµi,j,t 1

t řt s 1 Xi,j,s.

The key quantity in each of these algorithms is the following empirical estimator of the margin of each arm:

p pϵq i,t " minj Pr Ms bj at j pµi,t ϵ : pµi,t P P distppµi,t, Pq ϵ : pµi,t R P

Given pµi,t, distppµi,t, Pq can be computed by solving a quadratic program and, thus, the interior point method can compute p pϵq i,t in runtime polynomial in M and D. Each of

our algorithms updates one p pϵq i,t in each round, thus solving at most T quadratic programs. Therefore, each algorithm can be implemented efﬁciently.

Algorithm 1 MD-UCBE: Multi-dimensional Upper Conﬁdence Bound Exploration algorithm

1: Input: K arms, polyhedron P, tolerance ϵ, budget T, hyperparameter a 2: for t 1, . . . , T do 3: if t ď K then 4: Sample Xt νt. 5: else 6: Choose It arg mini p pϵq i,Tiptq b

a Tiptq and sam-

ple Xt νIt. 7: end if 8: end for 9: Return: p S ti P r Ks : pµi,Tip T 1q P Pu

Next, we describe each of the algorithms and our results. Each algorithm outputs p S ti P r Ks : pµi,Tip T 1q P Pu. The algorithms differ in how they decide which arm to pull. MD-UCBE (Algorithm 1) is a modiﬁcation of the algorithm UCBE from Audibert and Bubeck (2010). At each time step t, it pulls an arm i that minimizes p pϵq i,Tiptq b

a Tiptq breaking

ties arbitrarily where a is a hyperparameter. Theorem 2 gives an upper bound on its expected loss.

Feasible Arm Identiﬁcation

Theorem 2. Let K ě 0, T ě K and ϵ ě 0. Suppose 0 ď a ď 25

H . Then, the expected loss of MD-UCBE satisﬁes:

Er LT,ϵpp Sqs ď 2plogp Tq 1q K5D expp a 1600R2 q.

Paralleling our upper bounds for MD-SAR and MD-APT, this result says that the degree of difﬁculty of a problem for MD-UCBE depends on H, i.e., the distance of the arms to the boundary of the polyhedron and the tolerance parameter ϵ. Theorem 2 suggests setting a 25

H , in which case MD-UCBE is nearly optimal. One important shortcoming of this algorithm is that H is not known in practice, so it is unclear how to set the hyperparameter a. Indeed, in our experiments, we show that the performance of MD-UCBE is highly sensitive to the selection of a.

Algorithm 2 MD-SAR: Multi-dimensional Successive Accepts and Rejects algorithm

1: Input: K arms, polyhedron P, tolerance ϵ, budget T

2: Ď logpxq 1

i , n0 0, nk Q T K Ě logp Kqp K 1 kq

pk ą 1q 3: Q r Ks 4: for k 1, . . . , K 1 do 5: Query nk nk 1 samples from all arms i P Q

6: Q ÐÝ Qz arg maxi PQ p pϵq i,nk 7: end for 8: Return: p S ti P r Ks : pµi,Tip T 1q P Pu

MD-SAR (Algorithm 2) extends the SAR algorithm from Bubeck et al. (2013). It divides the budget T into K 1 rounds. In each round, it samples all of the arms belonging to Q Ă r Ks the same number of times. At the end of each round, it removes from Q an arm i that maximizes p pϵq i,Tiptq. Intuitively, MD-SAR stops sampling from an arm i for which there is the least amount of uncertainty about whether µi P P. Theorem 3 provides an upper bound on the expected loss of MD-SAR. It depends on a different complexity term that is nevertheless related to H. Let piq denote the index of the arm with the ith smallest margin so that pϵq p1q ď pϵq p2q ď . . . ď pϵq p Kq and deﬁne the complexity parameter

H2 max i Pr Ks ir pϵq piqs 2.

The analysis of Audibert and Bubeck (2010) of the analogous quantities immediately implies that H2 ď H ď logp2Kq H2.

Theorem 3. Let K ě 0, T ě K and ϵ ě 0. Then, the

expected loss of MD-SAR satisﬁes: Er LT,ϵpp Sqs ď

2plogp Tq 1q K5D expp T K 1296 logp2Kq H2

4K35D expp T K

Similar to previous results on SAR-type algorithms in the ﬁxed budget setting (Audibert and Bubeck, 2010; Chen et al., 2014), our upper bound on MD-SAR is loose by a factor of logp Kq in the exponential. While the guarantee is not tight, it has the signiﬁcant practical advantage over MD-UCBE that it does not involve a difﬁcult-to-tune hyperparameter. On the other hand, MD-SAR has the limitation that it needs to know T in advance.

Algorithm 3 MD-APT: Multi-dimensional Anytime Parameter-Free Thresholding algorithm

1: Input: K arms, polyhedron P, tolerance ϵ, budget T 2: for t 1, . . . , T do 3: if t ď K then 4: Sample Xt νt. 5: else 6: Choose It arg mini p pϵq i,Tiptq a

Tiptq and sample Xt νIt. 7: end if 8: end for 9: Return: p S ti P r Ks : pµi,Tipt 1q P Pu

MD-APT (Algorithm 3) is a modiﬁcation of the APT algorithm in Locatelli et al. (2016). After an initialization phase in which it pulls each arm once, at each round t, it pulls an arm i that minimizes p pϵq i,Tiptq a

Tiptq. The intuition

behind the algorithm is that if the margins pϵq i were known in advance, then a nearly optimal strategy would allocate samples to the arms proportionally to the r pϵq i s 2s. For simplicity, let ϵ 0; the case ϵ ą 0 is not as clear since arms whose distance to the boundary is less than ϵ do not need to be sampled at all. Proposition 1. Let ϵ 0. A static allocation strategy with a total of T r pϵq i s2H pulls of the ith arm @i P r Ks achieves

Er LT,ϵpp Sqs ď 2K5D expp 1

Thus, such a static allocation is nearly optimal. Since the pϵq i s are unknown, MD-APT samples the arms proportionally to the estimates rp pϵq i,Tiptqs 2. Theorem 4 gives an upper bound on the expected loss of MD-APT. Theorem 4. Let K ě 0, T ě 2K, and ϵ ě 0. Then, the expected loss of MD-APT satisﬁes:

Er LT,ϵpp Sqs ď 2plogp Tq 1q K5D expp T 1296R2H q.

Feasible Arm Identiﬁcation

This Theorem implies that MD-APT is nearly optimal. Further, unlike MD-UCBE, it is parameter-free and, unlike MD-SAR, it is an anytime algorithm in the sense that MDAPT does not require knowledge of the budget T. These properties make MD-ADT practical for many applications (Jun and Nowak, 2016).

We note that although the runtime of our algorithms depends on M, our upper bounds on their statistical performance are independent of M. We leverage this result and the fact that one can approximate convex sets arbitrarily well with polyhedra to obtain a computationally inefﬁcient algorithm with nearly the same guarantee as Theorem 4 for the setting where P is convex (see the supplementary material for details).

6. Analysis

Our analyses of the three algorithms are uniﬁed through a series of lemmas. The ﬁrst key idea is a sufﬁcient condition for p pϵq i,t to concentrate around pϵq i . Lemma 1 shows that concentration of pµiptq around its mean in the norm sense is sufﬁcient. Lemma 1. Let γ ą 0, i P r Ks, and t P r Ts. If }pµi,t µi}2 ď γ, then

|p pϵq i,t pϵq i | ď 2γ.

In the scalar case, concentration of the empirical margin around the true margin often follows by the triangle inequality. In our setting, because of the more complicated relationship between pµi,t and p pϵq i,t such an argument is not sufﬁcient.

The second key idea is that with an appropriately high probability, pµi,t concentrates around its mean in the norm sense. The main tools are Hoeffding s maximal inequality (see Lemma H.2) and an ϵ-net, which we now deﬁne (Vershynin et al., 2017). Deﬁnition 1. Let A Ă RD and ϵ ą 0. N Ă A is an ϵ-net of A if @x P A, there exists y P N such that }x y}2 ď ϵ. Let N Ă A be an ϵ-net of A. We say that N is minimal if, for any other ϵ-net M of A, it holds that |M| ě |N|. Lemma 2. Let N be a minimal 1

2-net on SD 1. Let ω ą 0. Deﬁne the event

Ξ t@i, @y P N, @r P r Ts : |ytppµi.r µiq| ď

Then, on Ξ, for all i P r Ks and for all r P r Ts,

}pµi.r µi}2 ď

PrpΞq ě 1 2plogp Tq 1q K5D expp ω2

In effect, Lemma 1 and Lemma 2 together imply that with high probability, (i) pµi,t concentrates around µi in the norm sense and (ii) p pϵq i,t concentrates around pϵq i .

Finally, the third idea is the simple observation that if for all

i P r Ks, pµi,t lies in a ball centered at µi with radius pϵq i 2 , then an algorithm does not make a mistake. Lemma 3. Fix t P r Ts and i P r Ks and suppose that }pµi,t µi}2 ă 1

2 pϵq i . Then, Aµi ď b ϵ1 implies that Apµi,t ă b and distpµi, Pq ě ϵ implies that pµi,t R P.

The analysis of each algorithm then proceeds as follows. First, suppose some appropriately deﬁned variant of the event Ξ in Lemma 2. Second, by Lemmas 1 and 2, (i) pµi,t concentrates around µi in the norm sense and (ii) p pϵq i,t concentrates around pϵq i . Given these concentration results, it is shown that each algorithm pulls each arm a sufﬁcient number of times so that Lemma 3 can be applied.

7. Experiments

In this section, we conduct experiments on synthetic and real-world datasets. In addition to the algorithms MDUCBE, MD-SAR, and MD-APT, we consider a uniform allocation algorithm (UA), which samples the arms in a cyclic fashion. We consider the performance of MD-UCBE under four hyperparameter settings ai i 25

H for i P t.1, 1, 10, 100u. Let MD-UCBE[i] denote MD-UCBE with hyperparameter ai. Note that the larger i is, the more MD-UCBE[i] explores and that our theoretical guarantee in Theorem 2 only covers i ď 1. To calculate p pϵq i,t , we use the quadratic programming solver in the CVXOPT package for python. We average all experiments over 2000 trials.

7.1. Synthetic Experiments

Each experiment has 20 5-dimensional arms and is run for 2000 time steps. We use Gaussian distributions with variance 1

4. For experiments 1, 2, and 3 we use a cube P tx P R5 : 0 ď xi ď 1u. In experiments 4 and 5, we use more complicated feasibility regions. In the following, we say an arm i is irrelevant if the error measure LT,ϵp q does not depend on how i is categorized.

Experiment 1 (Four Groups with Irrelevant Arms): We set ϵ 0.075 and use µ0:1 p.8qb5, µ2:3 p.9qb5, µ4:5 p1.1qb5, µ6:7 p1.2qb5, µ8 p.975qb5, µ9 p1.025qb5, µ10:19 p.3qb5. Note that this problem has two irrelevant arms, µ8 and µ9.

Experiment 2 (Four Groups with no Irrelevant Arms): We set ϵ 0 and use µ0:1 p.8qb5, µ2:3 p.9qb5, µ4:5 p1.1qb5, µ6:7 p1.2qb5, µ8 p.95qb5, µ9 p1.05qb5, µ10:19 p.3qb5. In comparison to experiment 1, we make it slightly easier to determine whether the arms µ8 and µ9

Feasible Arm Identiﬁcation

belong to the polyhedron because otherwise the difﬁculty of the problem prevents any algorithm from achieving substantial progress after 2000 time steps.

Experiment 3 (Linear Progression with Irrelevant Arms): We set ϵ 0.075 and use µ0:3 p.75qb5 p0 : 3q ˆ .05, µ4 p.975qb5, µ5 p1.025qb5, µ6:9 p1.25qb5 p0 : 3q ˆ .05, µ10:19 p1.15qb5. Note that this problem has two irrelevant arms, µ4 and µ5.

Experiment 4 (Four Groups on the Simplex): For this experiment, we use P tx P R5 : xi ě 0, ř

i xi ď 2u. We set ϵ .1. Let c p.2qb5. We use µ0:4 c, µ5:9 1.85 c, µ10:14 2.25 c, and µ15:19 1.95 c. µ0:9 are good arms, µ10:14 are bad arms, and µ15:19 are irrelevant.

Experiment 5 (Ordered Polyhedron): For this experiment, we use P tx P R5 : xi ď xi 1@i P r4su and ϵ .1. We use µ0:3 p0, .2, .4, .6, .8qt, µ4:7 p.0, .15, .3, .45, .6qt, µ8:11 p0, .2, .15, .6, .8sqt, µ12:15 p0, .2, .05, .6, .8qt, and µ16:19 p0, .2, .4, .2, 0qt. The arms µ8:11 are irrelevant.

The performance of MD-UCBE is very sensitive to the selection of its hyperparameter. MD-UCBE[1] and MDUCBE[10] tend to do well, but MD-UCBE[100] explores too much so that it tends to perform only slightly better than UA and MD-UCBE[.1] does not explore enough. Although MD-UCBE[.1] has a theoretical guarantee, the constants are too large so that it never makes progress in solving the problems. MD-APT performs better than MD-SAR in experiments 1, 4, and 5 and worse than MD-SAR in experiments 2 and 3. In experiment 2, MD-APT pulls arm 8, which minimizes pϵq i , too frequently. It pulls arm 8 on average 904.8125 times, whereas MD-SAR more evenly spreads out its pulls, pulling arm 8 on average 317.751 times. We observe a similar phenomenon in a variant of experiment 3 where we set ϵ 0 and which we defer to the supplementary material due to lack of space. This suggests that in certain problems MD-APT focuses too much on speciﬁc arms with means near the boundary and does not allocate enough samples to other arms. On the other hand, MDSAR utilizes knowledge of the time horizon T to effectively spread out samples. MD-APT s agnosticism about T may put it a disadvantage in the regime where some of the pϵq i are very small and T is small relative to H. As suggested by experiment 1, the ϵ parameter can be used to counteract the sensitivity of MD-APT to arms with means near the boundary.

7.2. Application 1: Dose-Finding

In clinical trials, an important challenge is determining the appropriate dosage of a drug. The main difﬁculty is the trade-off that as the dosage increases, the effectiveness of the drug tends to increase, but the likelihood of adverse

effects also increases. Thus, one must ﬁnd a dosage that is sufﬁciently effective, but does not have too many side effects. We assume a situation where the side effects are mild enough not to be a concern for clinical trials, but could nevertheless be unacceptable for a ﬁnal commercial product.

We investigate this problem by considering the data in Genovese et al. (2013) (see ARCR20 in week 16 in Table 2 and Table 3). In this study, the authors examine the drug secukinumab for treating rheumatoid arthritis. They consider four dosage levels (25mg, 75mg, 150mg, 300mg) and a placebo. We design a simulation based on their data where each arm corresponds to a drug and has two attributes, the likelihood of being effective and the likelihood of causing an adverse effect. Let µi,1 denote the probability of being effective and µi,2 the probability of causing an adverse effect. Then, dosage levels 25mg, 75mg, 150mg, and 300mg have means µ1 p.34, .519qt, µ2 p.469, .612qt, µ3 p.465, .465qt, µ4 p.537, .61qt, respectively, and the placebo has mean µ5 p.36, .58qt. We suppose that a drug is considered good if the probability of success is above .4 and the probability of adverse effects is below .5 and we set ϵ 0. Thus, only arm 3 is good and all other arms are bad. We chose these thresholds so that one drug is good; we did not try other threshold settings. We run the experiment for 1000 time steps.

Figure 6 gives the results of the experiment. MD-APT and MD-UCBE[10] perform better than the rest of the algorithms. MD-UCBE[1] performs slightly worse than UA, which may be because there are only 5 arms so that UA is not that bad of a strategy and MD-UCBE[1] does not explore sufﬁciently. MD-SAR only performs slightly better than UA. This may be because the time horizon is only 1000 time steps and there are only 5 arms.

7.3. Application 2: Crowdsourcing

We use a real-world dataset for the natural language processing task of affective text analysis (Snow et al., 2008). In this task, workers are asked to rate a short headline on valence and six emotions: disgust, fear, joy, anger, sadness and surprise. A group of experts also provide such ratings for the headlines.

We consider the problem of ﬁnding workers that tend to agree with the expert views on each of the tasks. We examine the deviation of a worker s ratings with the experts ratings. We normalize this deviation onto a scale of r0, 1s. Let µi,j denote the mean of worker i on task j and let µj denote the mean of all of the workers on task j. We deem a worker i good if µi,j ď µj for all j P r7s. In words, a worker is good if for every task, he performs better than the average worker. To make this realistic, we assume that we are in a setting where the average worker performance on each task is known based on another pool of workers. We

Feasible Arm Identiﬁcation

0 250 500 750 1000 1250 1500 1750 2000 horizon

log(est. failure probability)

Figure 1. Four Groups on Cube with Irrelevant Arms

0 250 500 750 1000 1250 1500 1750 2000 horizon

log(est. failure probability)

Figure 2. Four Groups on Cube, no Irrelevant Arms

0 250 500 750 1000 1250 1500 1750 2000 horizon

log(est. failure probability)

Figure 3. Linear Progression on Cube with Irrelevant Arms

0 250 500 750 1000 1250 1500 1750 2000 horizon

log(est. failure probability)

Figure 4. Four Groups on a Simplex

0 250 500 750 1000 1250 1500 1750 2000 horizon

log(est. failure probability)

MD-APT MD-SAR UA MD-UCBE[1] MD-UCBE[10] MD-UCBE[.1] MD-UCBE[100]

Figure 5. Ordered polyhedron

0 200 400 600 800 1000 horizon

log(est. failure probability)

Figure 6. Dose-Finding Experiment

0 500 1000 1500 2000 2500 3000 3500 4000 horizon

log(est. failure probability)

Figure 7. Crowdsourcing Experiment

use a tolerance of ϵ 0.02. There is a total of 38 workers, where 30 workers are bad arms, 3 workers are good arms, and 5 workers are irrelevant. Because each worker only provides a small number (at least 20) of ratings, whenever an arm is pulled, the algorithm observes an observation chosen uniformly at random with replacement from the data associated with the arm. We run each algorithm for 4000 time steps and in each trial, we randomly permute the samples of each worker. In the supplementary material, we repeat this experiment, but we simulate each arm as a Gaussian distribution (see Section J); the results are very similar.

Figure 7 gives the results of the experiment. Until roughly time step 3000, MD-APT and MD-UCBE[10] perform the best. Afterwards, MD-SAR does substantially better than MD-APT and MD-UCBE[10]. MD-UCBE[1] and MDUCBE[100] perform only marginally better than UA.

7.4. Summary of Results

The experiments suggest that although MD-UCBE is a competitive algorithm, it is highly sensitive to hyperparameter selection, which limits its applicability in practice. MDSAR and MD-APT tend to perform dramatically better than UA. For example, in the crowdsourcing experiment, UA has a ﬁnal error rate of roughly 52%, whereas MD-SAR has a ﬁnal error rate of roughly 5%. Further, our algorithms can handle complicated polyhedra such as the polyhdron that requires that coordinates are sorted in ascending order (see experiment 5). These results suggest that MD-APT tends to perform better than MD-SAR, but in some settings

(e.g., some arms with small pϵq i and H large relative to T) MD-APT focuses too much on some of the arms with means near the boundary. Because MD-SAR more evenly spreads out its pulls among the arms, it performs better in this regime.

8. Conclusion

In this paper, we introduced the feasible arm identiﬁcation problem. This problem provides a ﬂexible framework for settings where arms are multi-dimensional and it is of interest to determine whether each arm satisﬁes user-deﬁned multi-dimensional criteria. We provided a characterization of the difﬁculty of these problems that yielded a lower bound and we provided a uniﬁed analysis of three algorithms MDUCBE, MD-SAR, and MD-APT. Our experiments suggest that by leveraging the geometry of the feasible arm identiﬁcation problem, MD-SAR and MD-APT are able to dramatically outperform a uniform allocation approach.

Our work also suggests several open directions for future research. For example, in many crowdsourcing problems, one does not ask workers to perform all tasks at once, but rather one task at a time and, yet, it may be of interest to ﬁnd workers who excel at a collection of tasks. This suggests a variant of the feasible arm identiﬁcation problem where the agent chooses one coordinate of one arm and observes a realization of the corresponding random variable in each round.

Feasible Arm Identiﬁcation

Acknowledgements

This work was supported in part by NSF grant 1422157. We thank the anonymous reviewers for their very helpful comments and Aditya Modi for his useful feedback.

J.-Y. Audibert and S. Bubeck. Best arm identiﬁcation in multi-armed bandits. Conference on Learning Theory, 2010.

P. Auer, C.-K. Chiang, R. Ortner, and M. Drugan. Pareto front identiﬁcation from stochastic bandit feedback. Artiﬁcial Intelligence and Statistics, pages 939 947, 2016.

S. Boyd and L. Vandenberghe. Convex optimization. Cambridge University press, 2004.

S. Bubeck, R. Munos, and G. Stolz. Pure exploration in multi-armed bandits problems. Algorithmic Learning Theory, pages 23 37, 2009.

S. Bubeck, T. Wang, and N. Viswanathan. Multiple identiﬁcations in multi-armed bandits. International Conference on Machine Learning, pages 258 265, 2013.

R. Busa-Fekete, B. Szorenyi, P. Weng, and S. Mannor. Multiobjective bandits: Optimizing the generalized gini index. ICML, pages 625 634, 2017.

L. Chen, A. Gupta, J. Li, M. Qiao, and R. Wang. Nearly optimal sampling algorithms for combinatorial pure exploration. Proceedings of Machine Learning Research, 65:1 53, 2017.

S. Chen, T. Lin, I. King, M. Lyu, and W. Chen. Combinatorial pure exploration of multi-armed bandits. Advances in Neural Information Processing Systems, pages 379 387, 2014.

M. Drugan and A. Now. Designing multi-objective multiarmed bandits algorithms: A study. Neural Networks (IJCNN), The 2013 International Joint Conference on, pages 1 8, 2013.

V. Gabillon, M. Ghavamzadeh, and A. Lazaric. Best arm identiﬁcation: A uniﬁed approach to ﬁxed budget and ﬁxed conﬁdence. Advances in Neural Information Processing Systems, pages 3212 3220, 2012.

M. Genovese, P. Durez, H. Richards, J. Supronik, E. Dokoupilova, V. Mazurov, and J. Aelion. Efﬁcacy and safety of secukinumab in patients with rheumatoid arthritis: a phase ii, dose-ﬁnding, double-blind, randomised, placebo controlled study. Annals of the rheumatic diseases, 72: 863 869, 2013.

K. Jamieson, M. Malloy, R. Nowak, and S. Bubeck. lil ucb: An optimal exploration algorithm for multi-armed bandits. Conference on Learning Theory, pages 424 439, 2014.

K.-S. Jun and R. Nowak. Anytime exploration for multiarmed bandits using conﬁdence information. ICML, pages 974 982, 2016.

A. Locatelli, M. Gutzeit, and A. Carpentier. An optimal algorithm for the thresholding bandit problem. Proceedings of The 33rd International Conference on Machine Learning, pages 1690 1698, 2016.

S. Mannor and J. Tistisklis. The sample complexity of exploration in the multi-armed bandit problem. Journal of Machine Learning Research, pages 623 648, 2004.

R. Snow, B. O Connor, D. Jurafsky, and A. Ng. Cheap and fast but is it good?: evaluating non-expert annotations for natural language tasks. Proceedings of the conference on empirical methods in natural language processing, pages 254 263, 2008.

C. Tekin and E. Turgay. Multi-objective contextual multiarmed bandit problem with a dominant objective. ar Xiv preprint ar Xiv:1708.05655, 2017.

R. Vershynin. Introduction to the non-asymptotic analysis of random matrices. In Y. Eldar and G. Kutyniok, editors, Compressed Sensing: Theory and Applications, pages 210 268. Cambridge University Press, 2012.

R. Vershynin, P. Hsu, C. Ma, J. Nelson, E. Schnoor, D. Stoger, T. Sullivan, and T. Tao. High-dimensional probability: An introduction with applications in data science. 2017.

S. Zheng, B. Waggoner, Y. Liu, and Y. Chen. Active information acquisition for linear optimization. ar Xiv preprint ar Xiv:1709.10061, 2017.