# optimal_arms_identification_with_knapsacks__903bff09.pdf

Optimal Arms Identification with Knapsacks

Shaoang Li 1 Lan Zhang 1 Yingqi Yu 1 Xiang-Yang Li 1

Best Arm Identification (BAI) is a general online pure exploration framework to identify optimal decisions among candidates via sequential interactions. We pioneer the Optimal Arms identification with Knapsacks (OAK) problem, which extends the BAI setting to model the resource consumption. We present a novel OAK algorithm and prove the upper bound of our algorithm by exploring the relationship between selecting optimal actions and the structure of the feasible region. Our analysis introduces a new complexity measure, which builds a bridge between the OAK setting and bandits with knapsacks problem. We establish the instance-dependent lower bound for the OAK problem based on the new complexity measure. Our results show that the proposed algorithm achieves a near-optimal probability bound for the OAK problem. In addition, we demonstrate that our algorithm recovers or improves the state-of-the-art upper bounds for several special cases, including the simple OAK setting and some classical pure exploration problems.

1. Introduction

Multi-armed bandits exemplify the exploration-exploitation trade-off framework in online decision-making problems. The decision-maker selects arms (actions, options, decisions) sequentially and learns from the rewards to maximize the expected cumulative rewards over a number of trials. Many applications need to identify the best action over the candidates, and the rewards or loss during the exploration is ignored, which is defined as the best arm identification (BAI) problem (Audibert et al., 2010). For example, in a medical trial problem with m candidate ingredients and T patients that can be admitted to the medical trial (the exploration phase is limited by a fixed budget T), the decision-maker

1University of Science and Technology of China, Hefei, China. Correspondence to: Lan Zhang <zhanglan@ustc.edu.cn>, Xiang Yang Li <xiangyangli@ustc.edu.cn>.

Proceedings of the 40 th International Conference on Machine Learning, Honolulu, Hawaii, USA. PMLR 202, 2023. Copyright 2023 by the author(s).

would like to identify the best ingredient to minimize harm to the patients or maximize the medical therapeutic effect. A series of efforts have been made to solve variants of BAI problems (Xu et al., 2020; Katz-Samuels & Jamieson, 2020; Zhang & Ong, 2021; Zhong et al., 2021).

The bandits with knapsacks (Bw K) framework was introduced by (Badanidiyuru et al., 2013) to deal with a more general and realistic setting that takes the resource consumption into consideration. In the Bw K setting, the optimal fixed distribution over arms may outperform the arm with the highest expected reward. The support of optimal distribution is composed of the arms with optimal bang-per-buck, i.e., reward per unit of resource consumption, thus there are multiple optimal arms (Badanidiyuru et al., 2013). Existing works focus on maximizing the accumulated reward under the resource constraints by finding the optimal fixed distribution (Badanidiyuru et al., 2013; Agrawal & Devanur, 2014).

In this paper, we consider a common situation in which the decision-maker needs to identify all optimal arms with the knapsack constraints, and define it as the optimal arms identification (OAK) problem. In the OAK setting, there exists a fixed set of arms and the hard constrained capacity for each resource; each arm is associated with an unknown reward distribution and an unknown consumption distribution. During the exploration, the decision-maker chooses an arm in each round and only observes a scalar-valued reward independently sampled from the reward distribution and a resource consumption vector independently sampled from the consumption distribution. Once one or more resource budget constraint is violated then the exploration stops. The decision-maker aims to maximize the probability of identifying all optimal arms with the resource knapsacks.

The OAK problem encompasses a wide range of applications due to the presence of resource constraints in pure exploration decision problems. For example, during the medical testing phase, the selection of ingredients may be constrained by the supply and monetary cost of each component, and the decision-maker would like to identify the ingredients that minimize harm to the patients or maximize the medical therapeutic effect. Similarly, the dynamic pricing problem involves sellers who face limited supply and aim to determine the optimal policy for maximizing

Optimal Arms Identification with Knapsacks

expected revenue. In the dynamic procurement problem, algorithms are designed to purchase items or services while adhering to budgetary and other constraints.

The knowledge of optimal arms, which is not limited to pure exploration settings, is also valuable for addressing the associated regret minimization Bw K problem. Previous work demonstrates that even with sufficient exploration about the optimal solution of LP to converge on an LP-perfect distribution while avoiding obviously suboptimal strategies, an O(

T) gap with the optimum remains necessary (Badanidiyuru et al., 2013). In contrast, subsequent studies present tighter regret upper bounds with the known optimal arms set (Flajolet & Jaillet, 2015; Li et al., 2021). For instance, (Li et al., 2021) prove the O(d4/b2) regret upper bound, omitting other problem-dependent parameters, which does not depend on T.

The OAK problem raises several challenges for designing and analyzing algorithms. One challenge is the unknown number of optimal arms, which necessitate that the algorithm explores the structure of the feasible region. Another one is to estimate the fixed optimal distribution, as with the Bw K problem. The expected per-round reward is no longer a reliable estimate of the arm s value. Each pull of any arm may influence the decision-maker s estimated optimal distribution and lead to a different result. The algorithm needs to search over all possible distributions. And the large search space of possible distribution exacerbates the difficulty of the problem. This differs from the overwhelming majority of previous pure exploration bandits settings. Therefore, the design and analysis of algorithms require new technical tools for characterizing the complexity of the new exploration problem.

1.1. Our contributions

We pioneer the OAK setting under knapsacks constraints in this paper. First, we propose a BASEOAK learning algorithm based on the quarter reject/accept strategy. The analysis of our algorithm depends on our observation of the relationship between selecting optimal arms (point aspect) and the structure of the feasible domain (global aspect). We develop a new complexity measure based on this observation. Our analysis shows the successive elimination-style algorithm places excessive emphasis on point aspect and veers widely from the optimal distribution.

Then, we develop a FULLOAK algorithm based on BASEOAK that strikes a balance between converging to the optimal distribution and exploring the optimality of arms. We upper bound the probability that the algorithm makes mistakes based on the new complexity measure. We establish the instance-dependent lower bound for the OAK problem. The analysis shows that FULLOAK is close to optimum - the lower bound matches the error probability in

the exponential term up to a constant factor.

Last, We further investigate some special cases of OAK setting. We study the simple OAK setting for the case that influence from selections of different arms could be avoided. For the simple OAK problem, we present a near-optimal algorithm BASEOAK based on BASEOAK. We demonstrate that BASEOAK recovers or improves the state-of-the-art upper bounds for many classical pure exploration problems, including the BAI problem, top-K best arms identification problem, and multi-bandits best arms identification problem.

2. Problem Setup and Technical Preliminaries

In this paper, we use bold fonts to represent vectors and matrices. For a matrix C, we use Cj, and C ,i to denote the j-th row vector and the i-th column vector, respectively. For a set X, we use |X| to denote its cardinality.

2.1. Problem setup

We formally define the OAK problem below. Given T rounds, m arms and d types of resources being consumed, they are indexed by [T] = 1, 2, . . . , T, [m] = 1, 2, . . . , m, and [d] = 1, 2, . . . , d, respectively. Each arm is associated with an unknown reward distribution and an unknown consumption distribution. In each round t, the algorithm plays an arm i(t) [m], then observes a scalarvalued reward r(t) [0, 1] and a resource consumption vector c(t) [0, 1]d, which are independently sampled from the reward/consumption distribution. The jth component of c(t) represents consumption of resource j. There are some fixed unknown reward expected vector µ = (µ1, . . . , µm) [0, 1]m and consumption expected matrix C = (C ,1, . . . , C ,m) [0, 1]d m such that E[r(t)|i(t)] = µi(t) and E[c(t)|i(t)] = C ,i(t).

We use B to denote the hard resource constraint vector. For each resource j, there is a pre-specified knapsack Bj representing the maximum amount constraint of consumption over all time horizons. Once one or more resource budget constraint is violated then the exploration stops. We say the constraints are uniform if Bj = B for all resource j [d]. And any OAK instance can be reduced to one with uniform constraints B = minj [d] Bj. For notation simplicity, we focus on the uniform OAK setting with knapsack B. Let b = B/T denote the expected per-round resource constraint. Besides, we assume that the 1-st resource is the time resource and each arm deterministically consumes 1 unit of it whenever it is picked. We assume the 1-st arm is the null arm that can be played with no reward and only consumes the time resource. These assumptions are standard form in Bw K literature (Badanidiyuru et al., 2013; Agrawal & Devanur, 2014; Li et al., 2021).

Optimal Arms Identification with Knapsacks

For a problem instance, we say one arm i is an optimal arm if it would be selected by the optimal dynamic policy in expectation. We give a more precise definition in the linear relaxation part below. The OAK algorithm aims to maximize the probability that correctly identifies all optimal arms with the fixed resource constraint. Formally, let X

denote the index set of all optimal arms. The algorithm has to output an arm set O [m] once any constraint is violated (includes the time resource). The algorithm tries to maximize P[O = X ].

2.2. Linear relaxation

The OAK problem can be relaxed to the following linear program max µ x,

The x is the decision vector and xi corresponds to the probability to select arm i [m]. The LP (1) always has feasible solutions because of the existence of null arms. Let OPTLP and x denote the optimal value and optimal solution of (1), respectively. One arm i [m] is an optimal arm if the corresponding variable is basic variable in x , i.e. x i > 0.

Formally, let X := {i|x i > 0, i [m]} and X := {i|x i = 0, i [m]} denote the index set of optimal basic variables and non-basic variables of x , respectively. Then each arm i X is optimal arm and each arm i X is sub-optimal arm. Similarly, let Y := {j|b (x ) Cj, = 0, j [d]} and Y := {j|b (x ) Cj, > 0, j [d]} denote the index set of active constraints and non-active constrains of (1). Then each constraint j Y is an active constraint and each constraint j Y is a non-active constraint. Notice that we always have |X | = |Y | min{m, d}.

Assumption 2.1. The LP (1) has a unique optimal solution. Moreover, the optimal solution is non-degenerate.

This assumption is a standard one in LP s literature, and any LP can satisfy this assumption with an arbitrarily small perturbation (Megiddo & Chandrasekaran, 1989; Li et al., 2021). The dual problem of (1) is

s.t. C w µ,

Let w denote the optimal solution of it. Notice that for each non-active constraint j Y , there is a non-basic variable w j = 0.

Then we introduce the sub-optimality measure and optimality measure we use in this paper. To measure the suboptimality, we use the absolute value of reduced cost/profit Ri := (w ) C ,i µi, i [m] in LP literature and can be

regarded as the cost/profit obtained for increasing a variable by a small amount. Notice that we have Ri = 0 for each optimal arm i X and Ri > 0 for each sub-optimal arm i X . To measure the optimality, consider the following linear program max µ x,

x 0, xi = 0.

Let OPT i LP denote the optimal value of it, which adds a new constraint xi = 0 for one arm i [m] compared with (1). Then define the value Gi := OPTLP OPT i LP, which shows the reward gap caused by one arm s deletion. Under Assumption 2.1, we have Gi = 0 for each sub-optimal arm i X and Gi > 0 for each optimal arm i X .

3. Base OAK Algorithm and Complexity Measure

This section introduces the intuition and specification of the BASEOAK algorithm (shown in Algorithm 1). We also introduce the new complexity measure. Based on this measure, we upper bound the probability that the algorithm makes mistakes.

3.1. Base OAK algorithm

The algorithm splits the budget B evenly into log4/3 m 1 phases and chooses the worst/best quarter of surviving arms to reject/accept at the end of each phase. An arm will be included in the final output if accepted during the time horizon, and an arm will be excluded if rejected at the end of one phase. The stop condition of BASEOAK is implied in the design of the number of phases and quarter elimination. They guarantee that BASEOAK does not exceed the budget B and each arm is accepted or rejected before BASEOAK ends.

We describe the procedure of the algorithm below. The algorithm maintains three arm sets: the accept arm set X p , the reject arm set X p, and the active arm set Xp. The accept/reject arm set includes all accepted/rejected arms before phase p, and the active arm set includes all remaining arms.

During phase p, the algorithm pulls all surviving arms n(p) times, where the definition of n(p) is given in Algorithm 1. Let s(p) denote the times of one surviving arm has been selected until the end of phase p, i.e., s(p) = Pp k=0 n(k). Let µ(p) and C(p) denote the empirical mean estimator until the end of phase p for µ and C, respectively. Let ri(l) and Cj,i(l) denote the reward and j-th resource consumption observed of the l-th pull for the arm i. Formally,

µi(p) = 1 s(p)

l=1 ri(l), Cj,i(p) = 1 s(p)

l=1 Cj,i(l).

Optimal Arms Identification with Knapsacks

Algorithm 1 Base OAK Algorithm (BASEOAK) Input: resource constraint B, number of arms m

1: X0 [m], X 0 , X 0 2: for p = 0, . . . , log4/3 m 1 do 3: Pull each arm i Xp X p for

$ B |Xp X p | log4/3 m

times 4: Compute the empirical estimator of Ri and Gi for each arm i Xp 5: if more basic variables in x (p) then 6: X p+1 X p { the set of |Xp|/4 optimal arms in Xp with the largest Gi} 7: else 8: X p+1 X p { the set of |Xp|/4 sub-optimal arms in Xp with the largest Ri} 9: end if 10: Xp+1 X0\(X p+1 X p+1) 11: end for 12: Output X log4/3 m

At the end of each phase p, with the empirical estimator µ(p) and C(p), compute

Let x (p) and w denote the optimal solution of it and the dual problem, respectively. In the meantime, we compute the the empirical estimator of Ri and Gi for each surviving i Xp X p with µ(p) and C(p). We have Ri = ( w ) C ,i µi, i [m]. Similarly, consider the following LP max µ x,

x 0, xi = 0.

Let OPT i LP denote the optimal value of it, we have Gi = µ x (p) OPT i LP.

If there are more basic variables (corresponding to the optimal arms) in x (p), the algorithm chooses a quarter of arms with the largest Gi from the active arm set Xp to accept and adds them into the accept arm set X p+1. Conversely, the algorithm chooses a quarter of arms with largest Ri from the active arm set Xp to reject and adds them into the reject arm set X p+1. Maybe there are some arms with the same Ri / Gi such that it is difficult to decide which arm to reject/accept.

We use a random strategy in this case, i.e., select a random arm to reject/accept until a quarter of the arms are eliminated. At the end of the last phase, the algorithm outputs all arms accepted during the whole time horizons.

The algorithm cannot be simplified to just eliminate and return the active set. It is important to maintain the active arm set and reject arm set simultaneously to make sure that each arm will be rejected/accepted only once during the game.

3.2. Complexity measure

We introduce the complexity measure used in our work. Let D = {x Rm|Cx b, x 0} denote the feasible domain of (1). Notice that D is a convex polyhedron. Let B denote the set of all vertexes (are also extreme points) of the convex polyhedron. We say x D is a vertex if ( λ (0, 1), u, v D)[x = λu + (1 λ)v u = v]. We use x(k) to denote the k-th optimal vertex, i.e.,

µ x = µ x(1) µ x(2) . . .

µ x(k) . . . µ x(|B|).

Under Assumption 2.1, the number of extreme points is no less than m. We define the vertex gap i as

i = µ x µ x(i), i [m].

Our analysis relies on the following complexity measure:

H := max i =1 i 2 i , i [m],

which is a generalization of the complexity measure for BAI. Note that the complexity measure captures the reduced cost of sub-optimal arm and the influence caused by that one of the optimal arm is not allowed to use simultaneously, and builds a bridge between point aspect (sub-optimality of arms) and global aspect (the feasible domain of latent structures of a problem instance, which could be induced from the Bw K domain). We provide a formal description below.

Theorem 3.1. For any optimal arms identification or bandits with knapsack problem instance, we have R(i) i+1

2 and G(i) i+1.

Proof Sketch. Let dq denote the edge direction vector from x leading to the adjacent extreme points x(q) corresponding to the increase of the sub-optimal variable q X . We use x(q) to denote the q-th optimal adjacent vertex, i.e. µ x > µ x(1) . . . µ x(q) . . . , q X . We define the adjacent gap (q) := µ x µ x(q), q X . Then we consider the relationship between (q) and R(q). We rearrange w = (w B|w N) , where the basis vector

Optimal Arms Identification with Knapsacks

w B R|Y | includes all basic variables and the non-basis vector w N = (0, . . . , 0) . Let dqi denote the i-th element

of dq. Define αq := mini X n x i dqi

o . We have

(q) = αqµ dq = αq(µq w B) Q ,q

= αq(µq w ) C ,q = αq R(q)

From the geometry of linear programming, αq is the distance between the vertex x and x(q). And the distance of the polyhedron (i.e. the feasible domain of (1)) is not more than

2. From the definition of vertex gap and adjacent gap, we have q+1 (q). Combine the two facts together, we obtain R(i) i+1

Then consider LP (3), notice that the feasible domain of (3) is a subset of D and the optimal extreme point of (3) is also a vertex of D. The feasible domain of (3) is nonempty because of the existence of the null arm. Under assumption 2.1, different LP (3) with different absent optimal arms have different vertex. Based on the definition of vertex gap, we could conclude that G(i) i+1.

The details of the proof are provided in Appendix A.

3.3. Theoretical result

Theorem 3.2. For the OAK problem, Algorithm 1 makes errors with probability at most

O md log m exp b4B 90H max(|X |, log m)

Proof Sketch. Notice that the total pulls of BASEOAK is at most B and ci,j(t) 1 for all arm i and resource j during one round t, the algorithm will never exceed the consumption knapsack.

First, we argue that at the end of each phase, the optimal value of (4) is always close to OPTLP. At the end of phase p, with probability at least 1 2md exp 2δ2 ps(p) ), we have

µ x (p) 1 δp

µ x (p) 1 + δp

(OPTLP 2) + δp.

Then based on the analysis of the gap between the optimal solution of the dual form of (4) and w , we could bound the probability that Ri (p) > Ri (p) is at most

And the probability that Gi (p) < Gi (p) is at most

2md exp 2b2G2 i 9 s(p) .

For all arms in Xp, let Pp and Pp denote the set of optimal arms for (1) and (4), Qp and Qp denote the set of suboptimal arms for (1) and (4). Let S p denote the 1

16|Xp| arms with smallest Gi and S p denote the 1

16|Xp| arms with largest Ri, respectively. Define

Φ p := max i Pp\S p exp 2b2G2 i 9 s(p) ,

Φ p := max i Qp\S p exp

Let us start with the case that Qp > Pp. Consider the number of arms in Qp (Pp\S p) , then

E[| Qp (Pp\S p)|] = X

i Pp\S p P[ Gi(p) < Gi (p)]

i Pp\S p 2md exp 2b2G2 i 9 s(p)

2md |Pp\S p| Φ p.

Then we apply Markov s inequality and obtain

P[| Qp (Pp\S p)| > 1

8| Qp|] 16md |Pp\S p| | Qp| Φ p.

We could bound the cardinality of the set Qp Qp with high probability

P[| Qp Qp| > 3

4| Qp|] 1 16md |Pp\S p| | Qp| Φ p.

Based on this event, let i p denote the eliminated optimal arm in phase p. Consider the the number of arms in ( Qp Qp)\S p with larger Rx than that of the eliminated optimal arm and let N p denote it, then

i ( Qp Qp)\S p P[ Ri p(p) < Ri(p)]

i ( Qp Qp)\S p 2md exp

By applying Markov s inequality, we obtain

6| Qp Qp|] 6E[N p] | Qp Qp|.

We obtain that the probability that there is at least one eliminated optimal arms is at most 32md Φ p + 12md Φ p. Similarly, the probability that at least one sub-optimal arm is added to X p is at most 32md Φ p +12md Φ p. The pulls for each arm in Xp before phase t + 1 satisfy

s(p) B log4/3 m

1 |Xp + X p |

|Xp|, 2 |X p |

Optimal Arms Identification with Knapsacks

Algorithm 2 Full OAK Algorithm (FULLOAK) Input: resource constraint B, number of arms m

1: Pull each arm once 2: while time horizon is less than T/3 and the consumption of any resource is less than B/3 do 3: for t = 1, 2, . . . , m do 4: Solve the linear program (5) and let xt denote the solution for it 5: Choose an arm to pull as an independent sample from the distribution xt 6: end for 7: end while 8: Obtain the estimate of the optimal distribution x T/3. 9: X0 [m], X 0 , X 0 10: for p = 0, . . . , log4/3 m 1 do 11: Pull each arm i Xp for

$ B 2|Xp X p | log4/3 m

times 12: Compute the empirical estimator of Ri and Gi for each arm i Xp 13: if more basic variables in x (p) then 14: X p+1 X p { the set of |Xp|/4 optimal arms in Xp with the largest Gi} 15: else 16: X p+1 X p { the set of |Xp|/4 sub-optimal arms in Xp with the largest Ri} 17: end if 18: Xp+1 X0\(X p+1 X p+1) 19: end for 20: Output X log4/3 m

Combine them together, then we complete the proof. The details of the proof are provided in Appendix B.

4. Full OAK Algorithm and Lower Bound

This section develops an algorithm, called FULLOAK (shown in Algorithm 2), that solves the OAK problem based on some intuitions of BASEOAK. We provide the introduction, the main idea, and theoretical analysis of the algorithm. Moreover, we provide an instance-dependent lower bound for the OAK problem.

4.1. Full OAK algorithm

Notice that the theoretical analysis of BASEOAK show the dependence on |X | that the learning strategy makes mistakes for the OAK problem. The main reason is that an

accurate estimator of the fixed optimal distribution suffices to guarantee algorithms with low error probability. However, BASEOAK does uniform exploration between all surviving arm during one phase, which veer widely from the optimal distribution. This also makes BASEOAK cannot delete any optimal arm during the game, so the exploration ability is limited. The dependence could be avoided if the algorithm obtains an accurate estimator before the reject/accept phase. Based on these analyses, we present the FULLOAK algorithm. There are two steps: the first step derived from the UCB family of algorithms and aims to converge the optimal solution of (1); the second step based on BASEOAK, the difference is that FULLOAK will delete all accept arms from the surviving arms set at the end of each phase. We provide the specification of the first step below.

Let ni(t) denote the number of pulls of arm i before round t + 1. Let µU(t) and CL(t) denote the upper confidence bound reward vector and lower confidence bound consumption matrix until round t, respectively. Formally,

µU i (t) := proj[0,1] ( µi(t) + 2frad( µi(t), ni(t) + 1)) ,

CL j,i(t) := proj[0,1] Ci,j(t) 2frad( Ci,j(t), ni(t) + 1) ,

where proj[0,1] is a project function from real number to interval [0, 1] and frad(v, n) = p γv

n, γ > 0 is a confidence radius function. Then after the selection of round t, consider the following linear program

max (µU(t)) x

s.t. CL(t)x (1 ϵ)b.

The algorithm solves this linear program and selects arm according to the optimal solution of it for each round during the first step. Let ni(T/3) denote the pulls for arm i during the first phase. Then we obtain x T/3 with (x T/3)i = ni(T/3) P

4.2. Theoretical result

The following theorem expresses the error bound for FULLOAK. Theorem 4.1. For the OAK problem, with ϵ = q

3 log(md T )m

B + 3 log(md T )m log T

B , Algorithm 2 makes errors with probability at most

O md T exp αb2B

where α is a constant.

Proof sketch. The upper confidence bound of the expected reward µU(T/3) and lower confidence bound of the expected consumption CL(T/3) satisfy the following properties:

Optimal Arms Identification with Knapsacks

(1) with probability at least 1 2md T exp( Ω(γ)),

|(µU(T/3)) x T/3 OPTLP|

(2) with probability at least 1 2md T exp( Ω(γ)),

(CL(t)) xt ct

B + γm log T

Notice that the consumption during the second step is at most B

2 , so the consumption of FULLOAK will less than 5B

6 with high probability.

For the second step, at the end of phase p, for any optimal arm i X and any sub-optimal arm i X , the probability that Ri (p) > Ri (p) is at most

2md exp α1 b2R2 i s(p)

for some constant α1. And the probability that Gi (p) < Gi (p) is at most

2md exp 2b2G2 i 9 s(p) .

Similar to the proof of Theorem 3.2, we bound the probability that the algorithm makes mistakes by ignoring the 1 16|Xp| arms with the smallest Gi and the 1 16|Xp| arms with largest Ri of the active arms set, then we complete the proof. The details of the proof are provided in Appendix C.

4.3. Lower bound

We provide an instance-dependent lower bound. Our analysis ensures that any bandit strategy nevertheless makes a mistake for some OAK problem instances.

Theorem 4.2. For some OAK problem instances, consider any bandits algorithm that output an arm set O [m] at the end of the T-th round, it holds that

P(O = X ) Ω exp βb2B

where β is a constant.

Proof. We provide the core constructions below and give the detailed proof in Appendix D.

Let (pw)2 w W [1/4, 1/2) be (W 1) real numbers and let p1 = 1/2. And we define the quantities lw := 1/2 pw. Assume m is an exact multiple of W. Then we define

2 lw 2 (m i)/W , w = (i mod W), i [m].

Let πi denote the Bernoulli distribution of mean µi and π i denote the Bernoulli distribution of mean 1 µi.

Consider W problem instances with time horizon T, m arms, d types of resources being consumed, and knapsack b = W/m for each type of resource. To ease the reading, assume T is a power of 2, W Ω( m), and d > m/W. Let w = (i mod W), for the u-th problem instance, the i-th arm xu i is associated with the reward distribution πu i ,

πu i := πi1{w = u} + π i1{w = u}, u [W], i [m].

The consumption vector cu i satisfies (cu i )1 = (cu i )d = (cu i )w = 1, and (cu i )j = 0 for all j = 1, j = w, j = d.

5. Special Cases

In this section, we investigate some special cases of the OAK problem, including simple OAK problem and some classical pure exploration problems.

5.1. Simple OAK problem

The upper and lower bounds show the dependence on |X | that BASEOAK makes mistakes for the general OAK problem. The dependence could be avoided for some simple OAK problems. We say if the deletion of any optimal arm does not change Ri and Gi of any other arm, then the OAK problem is a simple OAK problem. We provide some examples of simple OAK problem in Sec. 5.2. For the simple OAK problem, we present BASEOAK (Algorithm 3) based on BASEOAK: the algorithm eliminates the accepted arms from the active arm set at the end of each accept phase.

Theorem 5.1. For the simple OAK problem, Algorithm 3 makes errors with probability at most

O m2 exp κT H log m

where κ is a constant.

We provide the specification of Algorithm 3 and the proof details of Theorem 5.1 in Appendix E.1.

5.2. Pure exploration problems

Example 5.2 (Best arm identification). The best arm identification problem can be modeled by the OAK problem with one resource (time resource) and one optimal arm. For the BAI problem, Algorithm 3 makes errors with probability at most

O log m exp κT H log m

where κ is a constant.

Notice that for the BAI problem, our complexity measure H is same as the complexity measure introduced in (Audibert

Optimal Arms Identification with Knapsacks

5000 10000 20000

(a) Time horizon T

0.01 0.05 0.1

(b) The gap ϵ

0.1 0.2 0.4

(c) Per-round knapsack b

Figure 1. The results obtained in different environments.

et al., 2010). The result recovers the tight lower bound (Carpentier & Locatelli, 2016) up to a logarithmic factor and recovers the state-of-the-art upper bound in (Karnin et al., 2013). We provide the details of the proof in Appendix E.2.

Example 5.3 (Top K and MB problem). For any K [m], the Top K problem can be modeled by the simple OAK problem with d = 2 and |X | = K. The consumption vector for each arm is deterministic (b, b/K) . Note that the number of optimal arms K is known to the learner. Let P = {X (1), . . . , X (K)} be a partition of [m]. The MB problem with P can be modeled by the simple OAK problem with d = K + 1 and |X | = K. The deterministic consumption vector C ,i for each arm i X (k), k [K] is deterministic with (C1,i = b, Ck,i = b/|X (k)|, Cj,i = 0)[j = 1, j = k]. Note that the number of optimal arms K and the partition P are known to the learner. For the Top K problem or the MB problem with K partitions, Algorithm 3 makes errors with probability at most

O m exp κT H log m

where κ is a constant.

Notice that the multiplicative factor in Example 5.3 is O(m) while the previous upper bound is O(m2) for the Top K and MB in the fixed budget setting (Bubeck et al., 2013; Chen et al., 2014). We provide the proof in Appendix E.2.

6. Numerical Evaluations

We consider a specific instance in which there are four arms (m = 4), three types of resources (d = 3), the expected per-round resource constraint of 0 < b 1, and a parameter 0 < ϵ b. The unknown reward vector is r = (0.5, 0.5 ϵ, 0.5, 0.5), and the unknown expected resource consumption is represented by the matrix:

0.5 0.5 0 0 0 0 0.5 0.5 + ϵ b b b b

We begin by considering the case where the knapsack b = 0.2 and the gap ϵ = 0.01. Among the available arms, the ones with indices 1 and 3 are optimal, while the rest

are sub-optimal. To evaluate the performance of different algorithms, we compare the probability of outputting the index set containing all optimal arms. All results are the averages over 100 runs.

Note that the traditional regret minimization algorithms are not suitable for handling the OAK setting. This can be attributed to the fact that these algorithms rely on the optimism under uncertainty principle and the associated confidence radius is often too large to explore the entire latent structure adequately. Consequently, the selection probability for not too bad arms remains high even when the game is ending. To compare our algorithms with traditional strategies, we propose a modification to the Ucb Bw K algorithm (Babaioff et al., 2015) that makes it more suitable for the OAK task. Specifically, we modify the algorithm to compute the LP based on the mean estimator after the last round and output optimal arms based on the solution. We refer to this new algorithm as Mean Bw K. The algorithm maintain a distribution D(t) over arms to select arms during the times horizons, and we assume that the algorithms identify optimal arms based on the distribution of the last round, i.e., an arm i is considered optimal if and only if D(τ) i > 10 3, where τ is the index of the last round before the algorithms stop.

The results (accuracy) obtained in different environments are summarized in Figure 1. Figure 1(a) provides results for different time horizons. It is noteworthy that our algorithms outperform Mean Bw K due to their new designs tailored specifically for the OAK setting. To further evaluate the performance of our algorithms, we vary the value of ϵ while keeping b = 0.2 and T = 2 104 fixed, and present the results in Figure 1(b). Note that for smaller ϵ, algorithms are more susceptible to errors. We also investigate the impact of different values of b on the performance of our algorithms. We conduct experiments with ϵ = 0.01 and T = 2 104

fixed, while also modifying the consumption to ensure that there are at least two optimal arms. The results of these experiments are presented in Figure 1(c). Based on the results, we observe that smaller values of b make it more challenging for the algorithms to identify all optimal arms, even with a longer time horizon, which is consistent with

Optimal Arms Identification with Knapsacks

our theoretical analysis. The code for algorithms could be available at https://github.com/Shaoang Li/ OAK-problem.git.

7. Related Work

7.1. Pure exploration problems

The best arm identification with the fixed budget setting was introduced by (Audibert et al., 2010). Subsequent work (Karnin et al., 2013; Carpentier & Locatelli, 2016; Chen et al., 2017c) establish the upper bound and lower bound of BAI, respectively. There are some extensions of BAI including top-K best arms identification (Kalyanakrishnan et al., 2012; Bubeck et al., 2013; Chen et al., 2017a;b; R eda et al., 2021; Zhou & Tian, 2022), θ-threshold arms identification (Locatelli et al., 2016; Mukherjee et al., 2017; Xu et al., 2020), ϵ-best arm identification (Kano et al., 2019; Katz-Samuels & Jamieson, 2020), multi-bandits best arms identification (Gabillon et al., 2011; Bubeck et al., 2013), and other variants (Abbasi-Yadkori et al., 2018; Rizk et al., 2021; Zhang & Ong, 2021; Zhong et al., 2021; Barrier et al., 2022; Wang et al., 2022). The Feasible Arms Identification (FAI) setting (Katz-Samuels & Scott, 2018; 2019) aims to identify all feasible (distribution have means belonging to the polyhedron) arms and top-K feasible arms, respectively, while the decision-maker aims to identify all optimal arms in the OAK problem. There are several fundamental differences between the OAK setting and the Feasible Arms Identification problem: (1) the expected reward vector µ is unknown for the OAK setting but known for the FAI setting; (2) the leaner has to consider infinite possible candidate distributions satisfy Cx b (consumption matrix C is unknown) for the OAK setting while only needs to consider finite m distributions satisfy xi D (feasible domain D is known) for the FAI setting.

7.2. Bandits with knapsacks

Another line relevant to this paper is bandits with knapsacks. The regret minimization setting of stochastic Bw K was first introduced and optimally solved in (Badanidiyuru et al., 2013) to encompass application domains the learner be limited by the resource constraints. Subsequent work provide a UCB-based algorithm for Bw K problem (Agrawal & Devanur, 2014) and a black-box reduction from bandits to Bw K (Immorlica et al., 2019). They all achieve near-optimal worst-case regret. Some work (Flajolet & Jaillet, 2015; Sankararaman & Slivkins, 2021; Ren et al., 2021; Li et al., 2021) study the problem-dependent regret of Bw K. There are some other versions of Bw K including budgeted bandits (Tran-Thanh et al., 2010; 2012; Ding et al., 2013; Cayci et al., 2020; Das et al., 2022), contextual bandits with knapsacks (Badanidiyuru et al., 2014; Agrawal & Devanur, 2016; Agrawal et al., 2016; Sivakumar et al.,

2022; Li & Stoltz, 2022), combinatorial semi-bandits with knapsacks (Sankararaman & Slivkins, 2018), adversarial bandits with knapsacks (Immorlica et al., 2019; Kesselheim & Singla, 2020; Castiglioni et al., 2022), other variants (Liu et al., 2022b;a; Kumar & Kleinberg, 2022), and applications (Badanidiyuru et al., 2012; Babaioff et al., 2015; Li et al., 2022). The regret minimization setting aims to trade off exploration and exploitation while the OAK setting is the pure-exploration framework. As a result, the two settings require different techniques for proving lower and upper bounds.

To achieve near-optimal instance-dependent regret for Bw K, (Li et al., 2021) provide an algorithm (phase I of Algorithm 1) from the primal-dual perspective that can identify the optimal arm set. However, the exploration setting is different and this primal-dual algorithm cannot handle the OAK task due to two fundamental reasons. First, the theoretical result of the primal-dual algorithm (phase I) is based on the assumption that the algorithm will not exceed the resource constraint, while the OAK problem is motivated by pure exploration with resource consumption and hard knapsacks during the learning process. Second, the stopping condition of the primal-dual algorithm depends on the pre-set confidence radius due to the optimism under uncertainty strategy. With a fixed pre-set confidence radius and the unknown hardness of the problem instance, the stopping time of the primal-dual algorithm is unpredictable.

8. Conclusion

We consider the optimal arms identification with knapsacks problem, which extends the best arm identification by considering resource consumption. We present a novel, parameter-free algorithm that returns optimal arms with high probability. We propose a new complexity measure for the OAK problem, which builds a bridge between the OAK and Bw K problem. We provide the error upper and lower bounds for the general OAK problem based on the new complexity measure. We further investigate some special cases and the results show that the proposed algorithm recovers or improves the state-of-the-art upper bounds for some classical pure exploration problems.

Acknowledgments

The research is partially supported by National Key R&D Program of China under Grant No.2021ZD0110400, Innovation Program for Quantum Science and Technology 2021ZD0302900, China National Natural Science Foundation with No.62132018, No. 61932016, Pioneer and Leading Goose R&D Program of Zhejiang, 2023C01029, and the Fundamental Research Funds for the Central Universities WK2150110024.

Optimal Arms Identification with Knapsacks

Abbasi-Yadkori, Y., Bartlett, P., Gabillon, V., Malek, A., and Valko, M. Best of both worlds: Stochastic & adversarial best-arm identification. In Conference on Learning Theory, pp. 918 949. PMLR, 2018.

Agrawal, S. and Devanur, N. Linear contextual bandits with knapsacks. Advances in Neural Information Processing Systems, 29:3450 3458, 2016.

Agrawal, S. and Devanur, N. R. Bandits with concave rewards and convex knapsacks. In Conference on Economics and Computation, pp. 989 1006. ACM, 2014.

Agrawal, S., Devanur, N. R., and Li, L. An efficient algorithm for contextual bandits with knapsacks, and an extension to concave objectives. In Conference on Learning Theory, pp. 4 18. PMLR, 2016.

Audibert, J.-Y., Bubeck, S., and Munos, R. Best arm identification in multi-armed bandits. In Conference on Learning Theory, pp. 41 53. Citeseer, 2010.

Babaioff, M., Dughmi, S., Kleinberg, R. D., and Slivkins, A. Dynamic pricing with limited supply. ACM Trans. Economics and Comput., 3(1):4:1 4:26, 2015.

Badanidiyuru, A., Kleinberg, R., and Singer, Y. Learning on a budget: posted price mechanisms for online procurement. In Proceedings of the 13th ACM Conference on Electronic Commerce, pp. 128 145. ACM, 2012.

Badanidiyuru, A., Kleinberg, R., and Slivkins, A. Bandits with knapsacks. In Symposium on Foundations of Computer Science, pp. 207 216. IEEE, 2013.

Badanidiyuru, A., Langford, J., and Slivkins, A. Resourceful contextual bandits. In Conference on Learning Theory, pp. 1109 1134. PMLR, 2014.

Barrier, A., Garivier, A., and Koc ak, T. A non-asymptotic approach to best-arm identification for gaussian bandits. In International Conference on Artificial Intelligence and Statistics, volume 151, pp. 10078 10109. PMLR, 2022.

Bubeck, S., Wang, T., and Viswanathan, N. Multiple identifications in multi-armed bandits. In International Conference on Machine Learning, pp. 258 265. PMLR, 2013.

Carpentier, A. and Locatelli, A. Tight (lower) bounds for the fixed budget best arm identification bandit problem. In Conference on Learning Theory, pp. 590 604. PMLR, 2016.

Castiglioni, M., Celli, A., and Kroer, C. Online learning with knapsacks: the best of both worlds. Co RR, abs/2202.13710, 2022.

Cayci, S., Eryilmaz, A., and Srikant, R. Budget-constrained bandits over general cost and reward distributions. In International Conference on Artificial Intelligence and Statistics, volume 108, pp. 4388 4398. PMLR, 2020.

Chen, J., Chen, X., Zhang, Q., and Zhou, Y. Adaptive multiple-arm identification. In International Conference on Machine Learning, pp. 722 730. PMLR, 2017a.

Chen, L., Li, J., and Qiao, M. Nearly instance optimal sample complexity bounds for top-k arm selection. In Artificial Intelligence and Statistics, pp. 101 110. PMLR, 2017b.

Chen, L., Li, J., and Qiao, M. Towards instance optimal bounds for best arm identification. In Conference on Learning Theory, pp. 535 592. PMLR, 2017c.

Chen, S., Lin, T., King, I., Lyu, M. R., and Chen, W. Combinatorial pure exploration of multi-armed bandits. In Advances in Neural Information Processing Systems, volume 27, pp. 379 387, 2014.

Das, D., Jain, S., and Gujar, S. Budgeted combinatorial multi-armed bandits. In International Conference on Autonomous Agents and Multiagent Systems, pp. 345 353, 2022.

Ding, W., Qin, T., Zhang, X.-D., and Liu, T.-Y. Multiarmed bandit with budget constraint and variable costs. In Proceedings of the AAAI Conference on Artificial Intelligence, 2013.

Flajolet, A. and Jaillet, P. Logarithmic regret bounds for bandits with knapsacks. ar Xiv preprint ar Xiv:1510.01800, 2015.

Gabillon, V., Ghavamzadeh, M., Lazaric, A., and Bubeck, S. Multi-bandit best arm identification. Advances in Neural Information Processing Systems, 24, 2011.

Immorlica, N., Sankararaman, K. A., Schapire, R., and Slivkins, A. Adversarial bandits with knapsacks. In Symposium on Foundations of Computer Science, pp. 202 219. IEEE, 2019.

Kalyanakrishnan, S., Tewari, A., Auer, P., and Stone, P. Pac subset selection in stochastic multi-armed bandits. In International Conference on Machine Learning, volume 12, pp. 655 662, 2012.

Kano, H., Honda, J., Sakamaki, K., Matsuura, K., Nakamura, A., and Sugiyama, M. Good arm identification via bandit feedback. Machine Learning, 108(5):721 745, 2019.

Karnin, Z., Koren, T., and Somekh, O. Almost optimal exploration in multi-armed bandits. In International Conference on Machine Learning, pp. 1238 1246. PMLR, 2013.

Optimal Arms Identification with Knapsacks

Katz-Samuels, J. and Jamieson, K. The true sample complexity of identifying good arms. In International Conference on Artificial Intelligence and Statistics, pp. 1781 1791. PMLR, 2020.

Katz-Samuels, J. and Scott, C. Feasible arm identification. In International Conference on Machine Learning, pp. 2535 2543. PMLR, 2018.

Katz-Samuels, J. and Scott, C. Top feasible arm identification. In International Conference on Artificial Intelligence and Statistics, pp. 1593 1601. PMLR, 2019.

Kesselheim, T. and Singla, S. Online learning with vector costs and bandits with knapsacks. In Conference on Learning Theory, pp. 2286 2305. PMLR, 2020.

Kleinberg, R., Slivkins, A., and Upfal, E. Multi-armed bandits in metric spaces. In Dwork, C. (ed.), Symposium on Theory of Computing, pp. 681 690. ACM, 2008.

Kumar, R. and Kleinberg, R. Non-monotonic resource utilization in the bandits with knapsacks problem. In Advances in Neural Information Processing Systems, 2022.

Li, S., Zhang, L., and Li, X. Online pricing with limited supply and time-sensitive valuations. In IEEE Conference on Computer Communications, pp. 860 869. IEEE, 2022.

Li, X., Sun, C., and Ye, Y. The symmetry between arms and knapsacks: A primal-dual approach for bandits with knapsacks. In International Conference on Machine Learning, pp. 6483 6492. PMLR, 2021.

Li, Z. and Stoltz, G. Contextual bandits with knapsacks for a conversion model. In Advances in Neural Information Processing Systems, 2022.

Liu, Q., Xu, W., Wang, S., and Fang, Z. Combinatorial bandits with linear constraints: Beyond knapsacks and fairness. In Advances in Neural Information Processing Systems, 2022a.

Liu, S., Jiang, J., and Li, X. Non-stationary bandits with knapsacks. In Advances in Neural Information Processing Systems, 2022b.

Locatelli, A., Gutzeit, M., and Carpentier, A. An optimal algorithm for the thresholding bandit problem. In International Conference on Machine Learning, pp. 1690 1698. PMLR, 2016.

Megiddo, N. and Chandrasekaran, R. On the ε-perturbation method for avoiding degeneracy. Operations Research Letters, 8(6):305 308, 1989.

Mukherjee, S., Naveen, K. P., Sudarsanam, N., and Ravindran, B. Thresholding bandits with augmented ucb. ar Xiv preprint ar Xiv:1704.02281, 2017.

R eda, C., Kaufmann, E., and Delahaye-Duriez, A. Top-m identification for linear bandits. In International Conference on Artificial Intelligence and Statistics, volume 130, pp. 1108 1116. PMLR, 2021.

Ren, W., Liu, J., and Shroff, N. B. On logarithmic regret for bandits with knapsacks. In Conference on Information Sciences and Systems, pp. 1 6. IEEE, 2021.

Rizk, G., Thomas, A., Colin, I., Laraki, R., and Chevaleyre, Y. Best arm identification in graphical bilinear bandits. In International Conference on Machine Learning, pp. 9010 9019. PMLR, 2021.

Sankararaman, K. A. and Slivkins, A. Combinatorial semibandits with knapsacks. In International Conference on Artificial Intelligence and Statistics, pp. 1760 1770. PMLR, 2018.

Sankararaman, K. A. and Slivkins, A. Bandits with knapsacks beyond the worst case. Advances in Neural Information Processing Systems, 34, 2021.

Sivakumar, V., Zuo, S., and Banerjee, A. Smoothed adversarial linear contextual bandits with knapsacks. In International Conference on Machine Learning, volume 162, pp. 20253 20277. PMLR, 2022.

Tran-Thanh, L., Chapman, A., De Cote, E. M., Rogers, A., and Jennings, N. R. Epsilon first policies for budget limited multi-armed bandits. In Proceedings of the AAAI Conference on Artificial Intelligence, 2010.

Tran-Thanh, L., Chapman, A., Rogers, A., and Jennings, N. Knapsack based optimal policies for budget limited multi armed bandits. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 26, pp. 1134 1140, 2012.

Wang, Z., Wagenmaker, A. J., and Jamieson, K. G. Best arm identification with safety constraints. In International Conference on Artificial Intelligence and Statistics, volume 151 of Proceedings of Machine Learning Research, pp. 9114 9146. PMLR, 2022.

Xu, Y., Chen, X., Singh, A., and Dubrawski, A. Thresholding bandit problem with both duels and pulls. In International Conference on Artificial Intelligence and Statistics, pp. 2591 2600. PMLR, 2020.

Zhang, M. and Ong, C. S. Quantile bandits for best arms identification. In International Conference on Machine Learning, pp. 12513 12523. PMLR, 2021.

Zhong, Z., Cheung, W. C., and Tan, V. Probabilistic sequential shrinking: A best arm identification algorithm for stochastic bandits with corruptions. In International Conference on Machine Learning, pp. 12772 12781. PMLR, 2021.

Optimal Arms Identification with Knapsacks

Zhou, R. and Tian, C. Approximate top-m arm identification with heterogeneous reward variances. In International Conference on Artificial Intelligence and Statistics, volume 151 of Proceedings of Machine Learning Research, pp. 7483 7504. PMLR, 2022.

Optimal Arms Identification with Knapsacks

A. Analysis of Complexity Measure (Theorem 3.1)

After removing all non-active constraints of (1), we can get a standard form LP. Formally, base on the matrix C, by arranging the |X | basic columns and |Y | active rows next to each other, we obtain a |Y | |X | optimal basis matrix corresponding to x and let P denote the basis matrix. Similarly, by arranging the |X | non-basic columns and |Y | active rows next to each other, we obtain a |Y | |X | non-basis matrix and let Q denote it. Then we could construct a matrix A := [P |Q] with |X | linearly independent columns (rows). We use b to denote the vector (b, . . . , b) (0, 1]|X |. Then the standard form LP is max µ x,

s.t. Ax = b , x 0. (6)

It is oblivious that (1) and (6) have same feasible domain D, optimal solution x , and optimal value OPTLP. The dual problem of (6) is min (b ) w,

s.t. A w µ, w 0.

And w B is the optimal solution of it. We define a new m m square matrix:

M := P Q 0 I

We use I to denote the identity matrix. We also get the inverse of M:

M 1 := P 1 P 1Q 0 I

Define a m dimension vector dq := (M 1) ,q, q X .

Notice that x is the optimal extreme point in D. Under the Assumption 2.1, there are |X | neighbors of x . And dq is the edge direction from x leading to the adjacent extreme points x(q) corresponding to the increase of the sub-optimal variable q X . We use x(q) to denote the q-th optimal adjacent vertex, i.e.,

µ x > µ x(1) . . . µ x(q) . . . , q X .

We define the adjacent gap (q) as (q) = µ x µ x(q), q X .

Then consider the relationship between (q) and R(q). Let dqi denote the i-th element of dq. Define

αq = min i X

we have (q) = αqµ dq = αq(µq w B) Q ,q

= αq(µq w ) C ,q = αq R(q).

From the geometry of linear programming, αq is the distance between the vertex x and x(q). The distance of the polyhedron (i.e., the feasible domain of (1)) will be not more than

2. From the definition of vertex gap and adjacent gap, we have q+1 (q). Combine the two facts together, we obtain R(i) i+1

Then consider LP (3), notice that the feasible domain of (3) is a subset of D and the optimal extreme point of (3) is also a vertex of D. The feasible domain of (3) is nonempty because of the existence of null arms. Under Assumption 2.1, different LP (3) with different absent optimal arms have different vertex. Based on the definition of vertex gap, we could conclude that G(i) i+1.

B. Analysis of Algorithm 1 (Theorem 3.2)

In this section, we analyze the probability bound of Algorithm 1 and prove Theorem 3.2.

Optimal Arms Identification with Knapsacks

B.1. Main lemmas

Lemma B.1. At the end of phase p, for the optimal value of (4), if the algorithm makes no errors before the beginning of phase p, with probability at least 1 2md exp 2δ2 ps(p) ), we have

µ x (p) 1 δp

Proof. According to the Hoeffding s inequality, with probability at least 1 2md exp 2δ2 ps(p) , for all optimal arms i

and all active constraints j in (1) and (4), we have

max(µi δp, 0) µi (p) min(1, µi + δp),

max(Cj ,i δp, 0) Cj ,i (p) min(1, Cj ,i + δp).

Under this event, consider the following linear program

s.t. x C j, b δp, j [d],

Let x (b L) and OPTLP(b L) denote the optimal solution and optimal value of it, respectively. According to (Agrawal & Devanur, 2014), we have

OPTLP(b L) 1 δp

Then we show that x (b L) is also a feasible solution for LP( µ, C) of phase p. Consider the j-th resource

i=1 Cj,ix i (b L)

i=1 ( Cj,i Cj,i)x i (b L) +

i=1 Cj,ix i (b L)

Cj,i Cj,i + (b δp)

With this feasibility, we have µ x (b L)

= µ x (b L) µ x (b L) µ x (b L)

OPTLP(b L) µ µ x (b L) 1

Then we get

µ x (p) µ x (b L) 1 δp

Lemma B.2. At the end of phase p, for the optimal value of (4), if one of optimal arms is not active in x (p), with probability at least 1 2md exp 2δ2 ps(p) ), we have

µ x (p) 1 + δp

(OPTLP 2) + δp.

Optimal Arms Identification with Knapsacks

Proof. According to Theorem 3.1, for any arm i [m], we have

OPT i LP OPTLP 2. (7)

According to the Hoeffding s inequality, with probability at least 1 2md exp 2δ2 ps(p) , for all optimal arms i and all active constraints j in (1) and (4), we have

max(µi δp, 0) µi (p) min(1, µi + δp),

max(Cj ,i δp, 0) Cj ,i (p) min(1, Cj ,i + δp).

Under this event, assume one of the optimal arm i X is eliminated in round t. Consider the following linear program

s.t. x C j, b + δp, j [d],

x 0, xi = 0.

Let x i (b U) and OPTi LP(b U) denote the optimal solution and optimal value of it, respectively. According to (Agrawal & Devanur, 2014), we have

OPTi LP b b + δp OPTi LP(b U).

Combine it with Theorem 3.1, we get

OPTi LP(b U) 1 + δp

(OPTLP 2) .

Then we show that x (p) is also a feasible solution for (8) of phase p. Consider the j-th resource

i=1 Cj,i x i (p)

i=1 (Cj,i Cj,i) x i (p) +

i=1 Cj,i x i (p)

Cj,i Cj,i + b

With this feasibility, we have µ x (p)

= µ x (p) + µ x (p) µ x (p)

OPTi LP(b U) + µ µ x (p) 1

(OPTLP 2) + δp.

Then we complete the proof.

Lemma B.3. At the end of phase p, for any optimal arm i X and any active sub-optimal arm i X Xp, the probability that Ri (p) > Ri (p) is at most

Proof. Consider the dual form min b w,

s.t. C w µ, w 0.

Optimal Arms Identification with Knapsacks

Let w (p) denote the optimal solution of it, according to Lemma B.1 and B.2, we have 1 δp

b w δp b w (p) 1 + δp

b w 2 + δp.

Then we get

j=1 ( w (p))j

j=1 ( w (p))j

Consider the Ri of i and i

Ri = (w ) Ci µi 0, Ri = ( w (p)) Ci µi > 0, Ri = ( w (p)) Ci µi < Ri .

From (9) and (10), we have

j=1 ( w (p))j Cj,i µi

j=1 ( w (p))j (Cj,i + δp) (µi δp)

j=1 w j (Cj,i + δp)

j=1 w j (Cj,i + δp)

j=1 ( w (p))j (Cj,i + δp)

j=1 w j (Cj,i + δp) + δp

j=1 ( w (p))j Cj,i µi

j=1 ( w (p))j(Cj,i δp) (µi + δp)

j=1 w j (Cj,i δp)

j=1 w j (Cj,i δp)

j=1 ( w (p))j (Cj,i δp)

j=1 w j (Cj,i δp) δp

Optimal Arms Identification with Knapsacks

According to the Strong duality theorem, we have

j=1 w j = 1

i=1 µix i = 1

From the Hoeffding s inequality, Lemma B.1 and B.2, we have

P[ Ri (p) > Ri (p)]

b2 + Ri < 2 δp + δp

b + Ri < 2δp

Then we complete the proof.

Lemma B.4. During phase p, for any optimal arm i X and any active sub-optimal arm i X Xp, the probability that Gi (p) < Gi (p) is at most

2md exp 2b2G2 i 9 s(p) .

Proof. During phase p, for each arm i [m], consider the linear programming

x 0, xi = 0.

Let OPT i LP denote the optimal value of it. According to Lemma B.1 and the proof of Lemma B.2, with probability at least 1 2| Pp Pp X p |2 exp 2δ2 ps(p) , we have

(OPTLP Gi ) + δp.

Then we have P[ Gi (p) < Gi (p)]

(OPTLP Gi ) + δp 1 δp

P 1 + 2(OPTLP Gi )

2md exp 2b2G2 i 9 s(p) .

B.2. Proof of Theorem 3.2

For all arms in Xp, let Pp and Pp denote the set of optimal arms for (1) and (4), Qp and Qp denote the set of sub-optimal arms for (1) and (4), respectively. Consider the first phase p such that there is at least one eliminated optimal arm or one

Optimal Arms Identification with Knapsacks

sub-optimal arm is added to X p . Let S p denote the 1 16|Xp| arms with smallest Gi and S p denote the 1 16|Xp| arms with largest Ri, respectively. Define

Φ p := max i Pp\S p exp 2b2G2 i 9 s(p) ,

Φ p := max i Qp\S p exp

Consider the case that Qp > Pp. Consider the number of arms in Qp (Pp\S p) , then

E[| Qp (Pp\S p)|] = X

i Pp\S p P[ Gi(p) < Gi (p)]

i Pp\S p 2md exp 2b2G2 i 9 s(p)

2md |Pp\S p| Φ p.

Then we apply Markov s inequality

P[| Qp (Pp\S p)| > 1

8E[| Qp (Pp\S p)] | Qp|

16md |Pp\S p| | Qp| Φ p.

Then we have

P[| Qp Qp| > 3

4| Qp|] 1 16md |Pp\S p| | Qp| Φ p. (12)

Based on this event, let i p denote the eliminated optimal arm in phase p. Consider the the number of arms in ( Qp Qp)\S p with larger Rx than that of the eliminated optimal arm and let N p denote it, then

i ( Qp Qp)\S p P[ Ri p(p) < Ri(p)]

i ( Qp Qp)\S p 2md exp

2md |( Qp Qp)\S p| max i ( Qp Qp)\S p exp

Then we apply Markov s inequality

6| Qp Qp|] 6E[N p] | Qp Qp|

12md max i ( Qp Qp)\S p exp

Then we obtain that the probability that there is at least one eliminated optimal arms is at most

32md Φ p + 12md Φ p.

Optimal Arms Identification with Knapsacks

Consider the case that Pp > Qp. Similarly, we have

P[| Pp (Qp\S p)| > 1

8E[| Pp (Qp\S p)] | Pp|

16md |Qp\S p| | Pp| Φ p,

P[| Pp Pp| > 3

4| Pp|] 1 16md |Qp\S p| | Pp| Φ p. (14)

Let i t denote the added sub-optimal arm in phase p. Consider the the number of arms in ( Pp Pp)\S p with smaller Gx than that of the sub-optimal arm i and let N t denote it, then

E[N t ] = X

i ( Pp Pp)\S p P[ Gi(p) < Gi (p)]

i ( Pp Pp)\S p 2md exp 2b2G2 i 9 s(p)

2md | Pp Pp| max i ( Pp Pp)\S p exp 2b2G2 i 9 s(p) .

Then we apply Markov s inequality

6E[N p] | Pp Pp|

12| Pp Pp X p |2 max i ( Pp Pp)\S p exp 2b2G2 i 9 s(p) .

Then the probability that at least one sub-optimal arm is added to X p is at most

32md Φ p + 12md Φ p.

In conclude, the probability that the algorithm makes mistakes in round t is at most

32md (Φ p + Φ p). (15)

Next, we analyse the algorithm to bound the probability of making mistakes over all rounds. Clearly, the algorithm does not exceed the budget B.

Consider the pulls for each arm in Xp before phase p + 1

s(p) B log4/3 m

1 |Xp + X p |

|Xp|, 2 |X p |

Optimal Arms Identification with Knapsacks

4 p, then we have

Φ p = max i Pp\S p exp 2b2G2 i 9 s(p)

2b2 2 ip 9 s(p)

4b2 2 ip 9 (p + 1)B |X p | log4/3 m

4b2 2 ip 9 B m log4/3 m

4b2 2 ip 9ip

ip(p + 1)B |X p | log4/3 m

4b2 2 ip 9ip

B 16 log4/3 m

9H B 16 log4/3 m

Φ p = max i Qp\S p exp

2 (p + 1)B |X p | log4/3 m

B 16 log4/3 m

9H B 16 log4/3 m

Combine (15), (16), (17) together, then we complete the proof of Theorem 3.2.

Last, we analyse the algorithm s behavior. Consider the case that Qp > 2Pp, according to (12) (which does not rely on Qp > Pp), with probability at least 1 16md |Pp\S p| | Qp| Φ p, we have

| Pp| = | Pp Qp| + | Pp Pp|

= |Qp| | Qp Qp| + |Pp| | Qp Pp|

2|Qp| | Qp|

Similarly, consider the case that Pp > 2Qp, according to (14), with probability at least 1 16md |Qp\S p| | Pp| Φ p, we have

| Qp| = | Qp Qp| + | Qp Pp| < 3

2|Pp| | Pp| | Pp|.

C. Analysis of Algorithm 2

Our proof based on the following concentration inequality.

Lemma C.1 ((Kleinberg et al., 2008; Babaioff et al., 2015)). Consider some distribution with values in [0, 1], let v and v be the expectation and average of n independent samples x1, x2, . . . , xn from this distribution, respectively. Then for each γ > 0, P[|v v| frad( v, n) 3frad(v, n)] 1 exp( Ω(γ)), (18)

where frad(v, n) = p γv

n. More generally, equation (18) holds if v = 1

n Pn t=1 E[xt|x1, . . . , xt 1].

Optimal Arms Identification with Knapsacks

We first prove the clean event that the upper confidence bound of the expected reward µU(T/3) and lower confidence bound of the expected consumption CL(T/3)satisfy the following properties:

(1) with probability at least 1 2md T exp( Ω(γ)),

|(µU(T/3)) x T/3 OPTLP| O

(2) with probability at least 1 2md T exp( Ω(γ)),

(CL(t)) xt ct 1 O rγm

B + γm log T

The proof is standard and similar to Lemma 7.4 in (Badanidiyuru et al., 2013) and the theoretical analysis in Appendix B.3 for the UCB algorithm for Bw K (Agrawal & Devanur, 2014). We provide the details for completeness. Let ˆv denote the empirical average of n samples, with probability at least 1 exp( Ω(γ)), we have

|ˆv v| n n + 1 frad(ˆv, n) + v n + 1

frad(ˆv, n + 1) + v n + 1 2frad(ˆv, n + 1).

By take a union bound, with probability 1 m T exp( Ω(γ)), we have

t=1 (r(t) µi(t))

t=1 (µU(t))i(t), T

(µU(T/3)) x T/3 3

t=1 (µU(t))i(t)

t=1 (µU(t))i(t), T

According to (Badanidiyuru et al., 2013), for any two vectors a, n Rm +, the following inequality always hold:

i=1 frad(ai, ni)ni p

γm(a n) + γm

t=1 (µi(t) (µU(t))i(t))

t frad(µi(t), ni(t)(t) + 1)

i (ni(t)(T/3) + 1)frad(µi(t), ni(t)(T/3) + 1)

i µi(ni(t)(T/3) + 1)

t (µU(t))i(t)

Optimal Arms Identification with Knapsacks

Combine (19) and (20) together, we obtain v u u t

t=1 (µU(t))i(t)

t=1 r(t) + O( γm),

and (µU(T/3)) x T/3 3

t r(t)) + γm

Similarly, we could also prove that with probability 1 md T exp( Ω(γ)), we have

(CL(t)) xt ct O p

γm B + γm 1. (22)

Combine (21) and (22) with the following inequality

t=1 r(t) (1 ϵ)OPTLP

and substituting the specification of ϵ and γ = O(log(md T)), we obtain the desired inequalities. Notice that the consumption during the second step is at most B

2 , so the consumption of FULLOAK will less than 5B

6 with high probability. Define

Then we prove the following lemma.

Lemma C.2. At the end of phase p, for any optimal arm i X and any sub-optimal arm i X , the probability that Ri (p) > Ri (p) is at most 2md exp α1 b2R2 i s(p) .

for some constant α1.

Proof. Consider the dual form min b w,

s.t. C w µ, w 0.

Let w (p) denote the optimal solution of it, we have

|b w (p) b w| O(Ψ).

Then we get d X

j=1 ( w (p))j

j=1 ( w (p))j

Consider Ri of optimal arm i and suboptimal i

Ri = (w ) Ci µi 0, Ri = ( w (p)) Ci µi > 0, Ri = ( w (p)) Ci µi < Ri .

Optimal Arms Identification with Knapsacks

From (23) and (24), we have

j=1 ( w (p))j Cj,i µi

j=1 ( w (p))j (Cj,i + 3frad(Cj,i , s (p))) (µi 3frad(µi , s (p)))

j=1 w j (Cj,i + 3frad(Cj,i , s (p)))

j=1 w j (Cj,i + 3frad(Cj,i , s (p)))

j=1 ( w (p))j (Cj,i + 3frad(Cj,i , s (p)))

(µi 3frad(µi , s (p)))

j=1 w j (Cj,i + 3frad(Cj,i , s (p))) + 1

b O(Ψ) (µi 3frad(µi , s (p)))

3frad(Cj,i , s (p))

j=1 ( w (p))j Cj,i µi

j=1 ( w (p))j(Cj,i 3frad(Cj,i , s (p))) (µi + 3frad(µi , s (p)))

j=1 w j (Cj,i 3frad(Cj,i , s (p)))

j=1 w j (Cj,i 3frad(Cj,i , s (p)))

j=1 ( w (p))j (Cj,i 3frad(Cj,i , s (p)))

(µi + 3frad(µi , s (p)))

j=1 w j (Cj,i 3frad(Cj,i , s (p))) 1

b O(Ψ) (µi + 3frad(µi , s (p)))

Ri 3frad(Cj,i , s (p))

According to the Strong duality theorem, we have

j=1 w j = 1

i=1 µix i = 1

Then we have P[ Ri (p) > Ri (p)]

P Ri < 3 (frad(Cj,i , s (p)) + 3frad(Cj,i , s (p))) 1 + 1

b O (Ψ + frad(1, s(p))) .

Notice that we have frad(1, s(p)) = Ω(Ψ). Combine them together, then we complete the proof.

Optimal Arms Identification with Knapsacks

According to Lemma B.4, for any optimal arm i X and any active sub-optimal arm i X Xp, the probability that Gi (p) < Gi (p) is at most

2md exp 2b2G2 i 9 s(p)

during phase p. Similar to the proof of Theorem 3.2, we bound the probability that the algorithm makes mistakes by ignoring the 1 16|Xp| arms with the smallest Gi and the 1 16|Xp| arms with largest Ri of the active arms set. For all arms in Xp, let Pp and Pp denote the set of optimal arms for (1) and (4), Qp and Qp denote the set of sub-optimal arms for (1) and (4). Let S p denote the 1 16|Xp| arms with smallest Gi and S p denote the 1 16|Xp| arms with largest Ri, respectively. Define

Φ p := max i Pp\S p exp 2b2G2 i 9 s(p) ,

Φ p := max i Qp\S p exp α1 b2R2 i s(p) .

We obtain that the probability that there is at least one eliminated optimal arms is at most

32md Φ p + 12md Φ p.

Similarly, the probability that at least one sub-optimal arm is added to X p is at most

32md Φ p + 12md Φ p.

Notice that FULLOAK will delete all accept arms from the surviving arms set at the end of each phase, the pulls for each arm in Xp before phase t + 1 satisfy

s(p) B log4/3 m

4 p, then we have

Φ p = max i Pp\S p exp 2b2G2 i 9 s(p)

2b2 2 ip 9 s(p)

4b2 2 ip 9 B m log4/3 m

4b2 2 ip 9ip

B 16 log4/3 m

9H B 16 log4/3 m

Similarly, we have

Φ p = max i Qp\S p exp α1 b2R2 i s(p) exp α2b2B

where α2 is a constant. Combine them together, then we complete the proof.

The proof is similar to Lemma 7.4 in (Badanidiyuru et al., 2013) and the theoretical analysis in Appendix B.3 for the UCB algorithm for Bw K (Agrawal & Devanur, 2014).

D. Analysis of Lower Bound (Theorem 4.2)

Let (pw)2 w W [1/4, 1/2) be (W 1) real numbers and let p1 = 1/2. And we define the quantities lw := 1/2 pw. Assume m is an exact multiple of W. Then we define

2 lw 2 (m i)/W , w = (i mod W), i [m].

Optimal Arms Identification with Knapsacks

Let πi denote the Bernoulli distribution of mean µi and π i denote the Bernoulli distribution of mean 1 µi.

Consider W problem instances with time horizon T, m arms, d types of resources being consumed, and knapsack b = W/m for each type of resource. To ease the reading, assume T is a power of 2, W Ω( m), and d > m/W. Let w = (i mod W), for the u-th problem instance, the i-th arm xu i is associated with the reward distribution πu i ,

πu i := πi1{w = u} + π i1{w = u}, u [W], i [m].

The consumption vector cu i satisfies (cu i )1 = (cu i )d = (cu i )w = 1, and (cu i )j = 0 for all j = 1, j = w, j = d. Then there are |X | = m/W = 1/b optimal arms and their indexes satisfy (i mod W) = u. For the hardness measure of the u-th problem instance H(u), we have

H(u) = max i [m] i 2 i,u b 2 1 b +1 X

w =u (lw + lu) 2,

where i,u is the vertex gap of the i-th arm for u-th instance.

Consider any algorithm A and let (Tk)1 k |X | denote the number of samples by A on arms from index (k 1) W + 1 to k W. These quantities are random but satisfy P

1 k |X | Tk = B. We have

P(O = X ) X

i X P(i / O)

1 k |X | exp

β1Tk 2m k log(W) P

w =u(lw + lu) 2

2 1 b 1 log(W) P

w =u(lw + lu) 2

β2b2 2 1 b +1B

2 1 b 1H(u) log(W)

exp 2β2b2B H(u) log m

where β1, β2 are some constants. The second inequality comes from Theorem 2 of (Carpentier & Locatelli, 2016). Then we complete the proof.

E. Analysis of Special Cases

E.1. Simple OAK Problem

In this section, we provide the specification of BASEOAK and prove Theorem 5.1.

The algorithm (shown in Algorithm 3) also splits the budget evenly into phases and chooses the worst/best quarter of surviving arms to reject/accept at the end of each phase. The difference is that the Algorithm 3 eliminates the accepted arms from the active arm set at the end of each accept phase.

We provide the proof of Theorem 5.1 below. Again, the probability that the algorithm makes mistakes in phase p is at most

32| Pp Pp|2 (Φ p + Φ p). (25)

And the algorithm does not exceed the budget T.

Consider the pulls for each arm in Xp before phase p + 1

s(p) T log4/3 m

Optimal Arms Identification with Knapsacks

Algorithm 3 BASEOAK

Input: rounds T, number of arms m

1: X0 [m], X 0 , X 0 2: for p = 0, . . . , log4/3 m 1 do 3: Pull each arm i Xp for

$ T |Xp| log4/3 m

times 4: Compute the empirical estimator of the reduced gap Ri and deletion gap Gi for each arm i Xp 5: if more basic variables in x (p) then 6: X p+1 X p { the set of |Xp|/4 optimal arms in Xp with the largest Gi} 7: else 8: X p+1 X p { the set of |Xp|/4 sub-optimal arms in Xp with the largest Ri} 9: end if 10: Xp+1 X0\(X p+1 X p+1) 11: end for 12: Output X log4/3 m

4 p, then we have

Φ p = max i Pp\S p exp 2b2G2 i 9 s(p)

2b2 2 ip 9 s(p)

4b2 2 ip 9 T m log4/3 m

4b2 2 ip 9ip

T 16 log4/3 m

9H T 16 log4/3 m

Φ p = max i Qp\S p exp

T 16 log4/3 m

9H T 16 log4/3 m

For all rounds t, we have | Pp Pp| 3

4 p m. Combine (25), (26), (27) together, then we complete the proof.

E.2. Pure Exploration Problems

In this section, we prove the results in Example 5.2 and 5.3.

For the BAI problem, there is one optimal arm i . According to the proof of Lemma B.3, assume the optimal arm i is not

Optimal Arms Identification with Knapsacks

eliminated at the end of phase p. For the optimal arm i and any active sub-optimal arm i X Xp, the probability that Ri (p) > Ri (p) is at most

According to the proof of Theorem 3.2, the probability that the algorithm makes mistakes in round p is at most

32(Φ p + Φ p).

By equation (26), (27), and a union bound, we complete the proof of the result in Example 5.2.

For the Top K and MB problem, due to the deterministic resource consumption, the probability that the algorithm makes mistakes in round p is at most 32| Pp Pp| (Φ p + Φ p).

For all rounds p, we have | Pp Pp| 3

4 p m. By equation (26), (27), and a union bound, we obtain the result in Example 5.3.