# active_learning_with_safety_constraints__98b819ca.pdf

Active Learning with Safety Constraints

Romain Camilleri, Andrew Wagenmaker, Jamie Morgenstern, Lalit Jain, Kevin Jamieson

University of Washington, Seattle, WA {camilr,ajwagen,jamiemmt,jamieson}@cs.washington.edu,lalitj@uw.edu

Active learning methods have shown great promise in reducing the number of samples necessary for learning. As automated learning systems are adopted into real-time, real-world decision-making pipelines, it is increasingly important that such algorithms are designed with safety in mind. In this work we investigate the complexity of learning the best safe decision in interactive environments. We reduce this problem to a constrained linear bandits problem, where our goal is to ﬁnd the best arm satisfying certain (unknown) safety constraints. We propose an adaptive experimental design-based algorithm, which we show efﬁciently trades off between the difﬁculty of showing an arm is unsafe vs suboptimal. To our knowledge, our results are the ﬁrst on best-arm identiﬁcation in linear bandits with safety constraints. In practice, we demonstrate that this approach performs well on synthetic and real world datasets.

1 Introduction

In many problems in online decision-making, the goal of the learner is to take measurements in such a way as to learn a near-optimal policy. Oftentimes, though the space of policies may be large, the set of feasible, or safe policies could be much smaller, effectively constraining the search space of the learner. Furthermore, these constraints may themselves depend on unknown problem parameters.

For example, consider the problem of bidding sequentially in a series of auctions where the bidder bids a price wt, the value of winning an item t is denoted vt, and the utility of winning that item and paying price pt is vt pt. The goal of the bidder is to choose an optimal strategy amongst bidding strategies s 2 S, s : R ! R. When a bidder is deciding how to choose these strategies, they often face constraints: they may have a budget B they must abide to; they may wish to have those auctions they win be well-distributed across time (e.g. in the case of advertising campaigns); they may want to ensure the set of items they win satisfy some other property (e.g. for advertisements, they might want to ensure they are not over-targeting any demographic group).

As another example, inventory management systems may face similar issues of deciding amongst strategies, where there is some objective function (such as revenue) and a variety of constraints at play in this choice (e.g. capacity of a set of warehouses, employee scheduling constraints, or limits on the duration of delivery lag). They also operate in markets with changing demand and other uncertainties, leading to uncertainty about which strategies are feasible or safe (satisfy constraints) and uncertainty about the revenue they generate.

Both of these scenarios motivate understanding the sample complexity of selecting an action or strategy which approximately maximizes an objective while also satisfying some constraints, where samples are needed to both learn the objective value of actions and whether or not they satisfy said constraints. In this work, we study the active sample complexity of this task if the learner can choose which examples to observe and have labeled, how many fewer samples might they need compared to the number needed in a passive setting? We pose this as a best-arm identiﬁcation problem in the setting of linear bandits with safety constraints, where the goal is to estimate the best arm, subject to it meeting certain (initially unknown) safety constraints. We propose an experiment design-based algorithm which efﬁciently learns the best safe decision, and show the efﬁcacy of this

36th Conference on Neural Information Processing Systems (Neur IPS 2022).

approach in practice through several experimental examples. To the best of our knowledge, ours is the ﬁrst approach to handle best-arm identiﬁcation in linear bandits with safety constraints.

1.1 Linear Bandits with Safety Constraints Let δ 2 (0, 1) be a conﬁdence parameter, X, Z Rd be ﬁnite known sets of vectors, and assume there exists 2 Rd, µ 2 Rm d unknown to the learner. For simplicity, we assume that k k2 1, and kµ ,ik2 1, i 2 [m] and kxk2 1, kzk2 1, 8x 2 X, z 2 Z. The learner plays according to the following protocol: at each time step t the learner chooses some action xt 2 X, observes (rt, {st,i}m

i=1) where rt = >

t and st,i = µ>

t,i for all i 2 [m], where w

t,i are i.i.d. mean zero 1-sub Gaussian noise. The choice of action xt is measurable with respect to the history Ft = {(xj, rj, {sj,i}m

j=1. The learner stops at a stopping time δ which is measurable with respect to the ﬁltration generated by Ft , and returns bz 2 Z. In general, when referring to any expectation E or probability P, the underlying measure will be with respect to the actions, observed rewards, and internal randomness of the algorithm.

We are interested in the safe transductive best-arm identiﬁcation problem (STBAI), where the goal of the learner is to identify

z := arg max

z2Z z> s.t. z>µ ,i γ, 8i 2 [m]

for some (known) threshold γ. In words, our goal is to identify the best safe arm in Z, z , where we say an arm z is safe if it satisﬁes every linear constraint: z>µ ,i γ, 8i 2 [m]. We are interested in obtaining learners that take the fewest number of samples possible to accomplish this. In practice, we will consider a slightly easier objective. Fix some tolerance > 0 and let

Z := {z 2 Z : z> z>

, z>µ ,i γ + , 8i 2 [m]}.

Then our goal is to obtain an ( , δ)-PAC learner deﬁned as follows:

Deﬁnition 1 (( , δ)-PAC Learner). A learner is ( , δ)-PAC if for any instance it returns bz such that P[bz 2 Z ] 1 δ.

We deﬁne the optimality gap for any z 2 Z as (z) := >

(z z), and the safety gap for constraint i as i

safe(z) := γ µ>

,iz. Note that either (z) or i

safe(z) can be negative. If (z) < 0, it follows that z has larger value z> than the best safe arm z , which implies it must be unsafe. If i

safe(z) < 0 for some i, then arm z is unsafe. We also deﬁne the -safe optimality gap as:

z02Z(z0 z)> s.t. min i2[m] i

safe(z) . (1)

(z) is then the gap in value between arm z and the best arm with minimum safety gap at least .

Mathematical Notation. Let kxk2

A = x>Ax and p(x) := max{x, 0}. e O( ) hides factors that are logarithmic in the arguments. . denotes inequality up to constants. We denote the simplex as 4X := {λ 2 R|X|

x2X λx = 1}.

2 Safe Best-Arm Identiﬁcation in Linear Bandits

2.1 Algorithm Deﬁnition The main challenge in algorithm design for the safe best-arm identiﬁcation problem is ensuring that we are efﬁciently balancing our exploration between reﬁning our estimates of both the safety gaps, as well as the optimality gaps. Our approach is given in Algorithm 1, BESIDE.

BESIDE relies on a round-based adaptive experimental design approach. In each round BESIDE consists of three phases. In the ﬁrst phase, it solves an experimental design over λ 2 4X , with the goal of reﬁning our estimates of the safety gaps. It then takes samples from λ . In the second phase these samples are used to estimate the safety constraints, bµi, , and the safety gaps of each arm, b i,

safe(z). Finally, in Phase 3, an additional experimental design is solved which now aims to reﬁne our estimates of the optimality gaps, and the estimates of the optimality gaps b (z) for each z 2 Z are then computed. We encapsulate Phase 3 in a subroutine, RAGE , which we outline in the following. We now carefully describe each phase we begin with Phase 2 to explain how our estimator works.

Algorithm 1 Best Safe Arm Identiﬁcation (BESIDE)

1: input: tolerance , conﬁdence δ 2: dlog( 20

safe(z) 0, b 0(z) 0 for all z 2 Z 3: for = 1, 2, . . . , do 4: 20 2 // Phase 1: Solve design to reduce uncertainty in safety constraints 5: Deﬁne

c (z) = min

safe (z)| + max

j p( b j, 1

safe (z)) + p(b 1(z))

6: Let be the minimal value of 2 R+ which is greater than 4 log 4m|Z| 2

δ such that the objective to the following is no greater than /100, and λ the corresponding optimal distribution

inf λ24X max

100 (c (z) + ) +

A(λ) 1 log( 4m|Z| 2

7: Sample xt λ , collect observations {(xt, rt, st,1, . . . , st,m)}

t=1 // Phase 2: Estimate safety constraints 8: {bµi, }m

i=1 RIPS({(xt, st,i)}

t=1, Z, δ 2m 2 )

safe(z) γ z>bµi, + kzk A(λ ) 1

log( 4m|Z| 2

δ ) // Phase 3: Refine estimates of optimality gaps

10: {b (z)}z2Z RAGE

Z, Y , , δ 4 2 , {b safe(z) maxj p( b j,

safe(z))}z2Z

// Perform final round of exploration to ensure we find -good arm 11: Yend {z 2 Z : c (z) . b i,

safe(z) + }

12: {b end(z)}z2Yend RAGE (Yend, Yend, , δ, {b safe(z) maxj p( b j,

safe(z))}z2Z)

13: return bz = arg minz2Yend b end(z)

Phase 2: In Phase 2 the algorithm would like to use the samples drawn from the design λ to estimate the constraints for each z 2 Z: z>µ ,i for each i 2 [m]. Past works using adaptive experimental design in the linear bandits literature have utilized the least-squares estimator along with complicated rounding schemes [13] which may require an additional poly(d) samples each round (this poly(d) factor could be prohibitively large for example, in active classiﬁcation problems, d is the total number of data points). We instead utilize the RIPS estimator of [6] which gives us a guarantee of the form: with probability greater than 1 δ, for all z 2 Z,

|z>(bµi, µ ,i)| . kzk A(λ ) 1

log( 4m|Z| 2

We describe the RIPS estimator in more detail in Appendix B.

Phase 1: By our deﬁnition of the experimental design on Line 6, our safety gap estimation error bound in (2) satisﬁes, for each z 2 Z:

|z>(bµi, µ ,i)| . kzk A(λ ) 1

log( 4m|Z| 2

δ ) . c (z) + . (3)

Note that our design chooses an allocation that minimizes the variance in our estimate of each safety constraint (up to some tolerance), which scales as kzk2

A(λ) 1. This can be thought of as a form of XY-design a design of the form infλ24X maxy2Y kyk2

A(λ) 1 where here Y Z is chosen to reduce our uncertainty in estimating the safety value for each z 2 Z. We refer to such a design objective henceforth as XYsafe. Assume that at round 1, we can guarantee

c (z) = min

safe (z)| + max

j p( b j, 1

safe (z)) + p(b 1(z)) +

safe(z)| + max

safe(z)) + p( 1(z)) + . (4)

Then combining the above inequalities, we see that the experiment design on Line 6 aims to minimize the uncertainty in our estimate of z>µ ,i up to a tolerance that scales as the maximum of the four

terms in (4). It follows that if any of these terms is large, we will only allocate a small number of samples to reﬁning our estimate of arm z. Each one of these terms can be intuitively motivated by thinking through what is needed to prove that an arm z 6= z .

z has small safety gap minj | j

safe(z)|: if this term is large, it implies that minimum safety gap for z is large. To show an arm is safe or unsafe, it sufﬁces to learn each safety gap up to a tolerance a constant factor from its value regularizing by this term ensures we do just that.

z fails some safety constraint maxj p( j

safe(z)): if this term is large, it implies that arm z is very unsafe for some constraint. In this case, we can easily determine z is unsafe, and therefore do not need to reduce our uncertainty in the safety gap any more.

z is sub-optimal p( 1(z)): if this term is large, it implies that z is very suboptimal compared

to some safe arm with safety gap at least 1. In this case, we do not need to estimate z s safety gap, as we will have already eliminated it.

It remains to ensure that (4) holds. As we show in Appendix D through a careful inductive argument, combining (3) with our guarantee on the estimates of the optimality gaps obtained in Phase 3, b (z), is sufﬁcient to guarantee (4) holds. In particular, if any gap is greater than it is estimated up to a constant factor, and otherwise it is estimated up to O( ). This ensures that our gaps are estimated at the correct rate while guaranteeing we do not collect too many samples in each round.

Phase 3: In this phase we estimate the suboptimality gaps using RAGE . RAGE is inspired by the RAGE algorithm of [13] for best-arm identiﬁcation. In the interest of space, we defer the full deﬁnition of RAGE to Appendix C but provide some intuition here. After Phase 2, by (3) the set of arms Y := {z 2 Z : cs(z) . b i,s(z), 8i 2 [m]} for s are precisely the ones that we can certify are safe (note that we do not need to ever explicitly construct such a set we can instead maintain an implicit deﬁnition through the constraints). RAGE uses an adaptive experimental design procedure to sample in such a way as to optimally estimate the gaps (z by)> , 8z 2 Z and some by 2 Y up to some (sufﬁcient) tolerance. In particular, it also solves an XY-design, but now on the set Y {z by : z 2 Z}. Thus, rather than minimizing kzk2

A(λ) 1, we minimize kz byk2

A(λ) 1. This design reduces uncertainty on the differences between arms, which allows us to reﬁne our estimates of their optimality gaps. Henceforth we refer to such a design as XYdi . We describe the importance of the choice of design in more detail in Section 2.4. Ultimately, if an arm z has value within a factor of of the best safe arm in Y , and if we have not yet shown arm z is unsafe, then we will estimate its optimality gap up to a constant factor of . If we were maintaining arm sets explicitly (similar to the original RAGE algorithm of [13]) we would eliminate arms at this point.

Remark 1 (Computational Complexity). The main computational challenge in BESIDE and RAGE

is the calculation of the experimental designs (i.e. Line 6 and the corresponding design in RAGE ). In general, the presence of the square root implies that the resulting optimization problem may not be convex in λ. To handle this issue we note that 2pxy = min >0 x + y

thus we can replace the ex-

isting design with infλ24X maxz2Z min >0 1

100 (c (z) + )+ kzk2

A(λ) 1 +log( 4m|Z| 2

δ )/( ). By appropriately discretizing the space we search over for and we can then apply the Frank-Wolfe algorithm to minimize over λ. While computationally efﬁcient in theory, this procedure is quite complicated and impractical for large problems. In the experiments section we provide a practical heuristic that is motivated by the above algorithm and is computationally efﬁcient for larger problems.

2.2 Main Result BESIDE achieves the following complexity.

Theorem 1. BESIDE is ( , δ)-PAC. In other words, with probability at least 1 δ, BESIDE returns an arm bz 2 Z such that

, min i2[m] i

and terminates after collecting at most

inf λ24X max

A(λ) 1 log( m|Z|

safe(z)| + maxj p( j

safe(z)) + p( e (z)) + e

)2 (safety)

inf λ24X max

A(λ) 1 log( |Z|

safe(z)) + p( e (z)) + e

)2 + C0 (optimality)

samples for some C = poly log( 1

) and C0 = poly log( 1

, |Z|) log 1

The complexity bound given in Theorem 1 may, at ﬁrst glance, appear rather opaque, yet it in fact yields a very intuitive interpretation. The ﬁrst term in the complexity, the safety term, is the complexity needed to show each arm is safe or unsafe, if they have not otherwise been eliminated. As described in the previous section, if p( e (z)) is large, this implies we have found an arm better than z, so learning its safety value is irrelevant.

The second term in the complexity, the optimality term, corresponds to the difﬁculty of showing an arm is worse than the best arm we can guarantee is safe. Note that we can only guarantee an arm is suboptimal if we can ﬁnd a safe arm with higher value. Recall the deﬁnition of e (z) given in (1). Intuitively, e (z) denotes the gap in value between arm z and the best arm with safety gap at least e . As we make e smaller, we can show additional arms are safe, which increases e (z). While this makes it easier to show z is suboptimal, it comes at a cost the extra samples necessary to decrease our safety tolerance, given by the ﬁrst term in the complexity. BESIDE trades off between optimizing for each of these terms gradually decreasing its tolerance on both the safety and optimality terms to more easily eliminate suboptimal arms, while not allocating too many samples to guarantee safety.

To help illustrate this complexity, we consider a simple example with orthogonal arms, i.e. a multi-armed bandit example.

Example 1 (BESIDE on Multi-Armed Bandits). In the multi-armed bandit setting, we have X = Z = {e1, . . . , ed}. Let m = 1, d = 3, and consider the settings of and µ given in Figure 1. Here we see that arm e1 is safe and has value much higher than any other arm, so z = e1, and can be shown to be safe relatively easily; arm e2 has near-optimal value but is very unsafe; and arm e3 is unsafe with very small safety gap, but has the smallest value.

[θ ]1 [θ ]2 [θ ]3 0.00

[µ ]1 [µ ]2 [µ ]3 0.00

Safety Value

Figure 1: Multi-Armed Bandit Instance

Showing e2 is Suboptimal. As e2 has near-optimal value, (e2) is very small and it is very difﬁcult to show e2 is suboptimal. However, safe(e2) = O(1), so it is very easy to show e2 is unsafe. It follows that p( safe(e2)) = O(1) so both denominators in our complexity will always be O(1) for z = e2 BESIDE does not attempt to show e2 is suboptimal, but instead shows it is unsafe, and therefore does not pay for the small optimality gap of (e2) in the complexity.

Showing e3 is Suboptimal. Recall the deﬁnition of (z) = maxz0: safe(z0) >

(z0 z). In this case, for = O(1), we will have safe(e1) , which implies that (e3) = >

(e1 e2) = (e3) = O(1). To show e3 is suboptimal, we could either show it is unsafe (which is very difﬁcult)

or suboptimal (which is very easy). Observing the sample complexity of Theorem 1, we see that the denominator of both terms will always be O(1) for z = e3 since (e2) = O(1) BESIDE never pays for the small safety gap of e3, it instead takes advantage of the fact that e3 can easily be shown to be suboptimal, and uses this to eliminate it.

In both of these cases we see that BESIDE does the right thing, always using the easier of the two criteria either showing an arm is unsafe or suboptimal to show that z 6= z . Combining the above observations, for min{ (e3), safe(e2), safe(e1)}, it follows that on this example the total

sample complexity of BESIDE given by Theorem 1 scales as:

1 safe(e1)2 + 1 safe(e2)2 + 1 (e3)2

where the 1/ safe(e1)2 arises because we must also show e1 is safe.

2.3 Optimality of BESIDE Optimality in Best-Arm Identiﬁcation. Consider applying BESIDE to a problem instance where m = 1, µ ,1 = 0, and γ = 1. In this case, every arm is safe, and the safety constraints are essentially vacuous every arm can easily be shown safe. We can therefore think of this as simply an instance of the best-arm identiﬁcation problem. In this setting, we obtain the following corollary.

Corollary 1. Consider running BESIDE on a problem instance where m = 1, µ ,1 = 0, and γ = 1, and set = 1

2 maxz6=z >

(z z). Then with probability at least 1 δ, BESIDE returns z and has sample complexity bounded by:

inf λ24X max

A(λ) 1 (z)2 log |Z|

A(λ) 1 log |Z|

Up to lower-order terms, this exactly matches the lower bound on best-arm identiﬁcation given in [13]. Thus, in settings where the safety constraint is vacuous, BESIDE hits the optimal rate.

Worst-Case Performance of BESIDE. We next consider the worst-case performance of BESIDE in settings when X = Z. We have the following result.

Corollary 2. Assume that X = Z. Then for any and (µ ,i)m

i=1, the sample complexity of BESIDE necessary to return an -good and -safe arm is bounded as e O( d

2 (log(m|X|) + log 1

Theorem 2 of [38] shows a worst-case lower bound of (d2/ 2) on the sample complexity of identifying an -optimal arm in the standard linear bandit setting. Safe best-arm identiﬁcation problems in which the safety constraint is vacuous are at least as hard as the standard best-arm identiﬁcation problem, since at minimum we need to ﬁnd the best arm out of every safe arm. Thus, (d2/ 2) is also a worst-case lower bound for the safe best-arm identiﬁcation problem. The hard instance of [38] has |X| = O(2d), so it follows that on this instance, BESIDE achieves a complexity of e O( d

2 (d + log 1

δ )), and therefore BESIDE has optimal dimensionality dependence. In addition, this also implies that safe best-arm identiﬁcation, in the worst-case, is no harder than the standard best-arm identiﬁcation problem it is no harder to ﬁnd the best safe arm, regardless of the number of safety constraints, than to ﬁnd the best arm, ignoring safety constraints.

2.4 The Role of Experiment Design We can think of the safe best-arm identiﬁcation problem, in some sense, as an interpolation of the standard best-arm identiﬁcation problem, as well as the level-set estimation problem, where the goal is to identify z 2 Z satisfying z>µ γ [29]. In the former problem, [13] shows that the instance-optimal rate can be attained by running a round-based algorithm and at every round solving an instance of the XYdi experiment design, as deﬁned in Section 2.1. In the latter problem, [29] also show that a round-based algorithm can hit the instance-optimal rate, but instead solving the XYsafe problem at each round. It is natural to ask whether either of these strategies could be applied to the safe best-arm identiﬁcation problem directly, or if it is necessary to alternate between them. The following results show that, on their own, each of these designs is unable to hit the optimal rate.

Proposition 2. Fix some small enough > 0. Then there exist instances of the safe best-arm identiﬁcation problem, Ii = ( i

, X i, Zi), i = 1, 2, with d = |X i| = |Zi| = 2, m = 1, such that:

On I1, any ( , δ)-PAC algorithm which plays only allocations minimizing XYdi must have

, while BESIDE identiﬁes an -optimal arm after e O( 1

2 log 1/δ) samples.

On I2, any ( , δ)-PAC algorithm which plays only allocations minimizing XYsafe must have

( 1 3/2 log 1

, while BESIDE identiﬁes an -optimal arm after e O( 1

log 1/δ) samples.

Proposition 2 implies that, to solve the safe best-arm identiﬁcation problem optimally, more care must be taken in exploring than either standard experiment design induces we must trade off between XYdi and XYsafe as BESIDE does. We remark brieﬂy on the instance I1. On this instance we have X = {e1, e2} and Z = {z1, z2} with z1 = [1/4, 1/2] and z2 = [3/4, 1/2 + ]. We set 1

= [1, 0], µ1

= [0, 1], and γ = 1/2 + /2. Here z2 is unsafe while z1 is safe, so it follows that z = z1. As z>

, to show z2 6= z , we must show it is unsafe. However, if we solve the design XYdi , we see that it places nearly all of the mass on the ﬁrst coordinate. While this would be optimal if both z1 and z2 were safe and we simply wished to determine which has a higher value, to show z2 is unsafe, the optimal strategy places (roughly) the same mass on each coordinate, since each coordinate could contribute to the safety value. This is precisely the allocation BESIDE will play, so it is able to show that z2 is unsafe much more efﬁciently than a naive XYdi approach.

3 Experiments for Safe Best Arm Identiﬁcation in Linear Bandits

We next present experimental results on BESIDE to demonstrate the advantage of experimental design especially combining XYdi and XYsafe designs. As there are no existing algorithms that consider safe best-arm identiﬁcation, as a benchmark we consider the naive adaptive approach BASELINE that ﬁrst solves the problem of ﬁnding the safe arms up to a desired tolerance, and then solves the problem of ﬁnding the best (safe) arm among the arms that were found to be safe. We ﬁrst describe instances on which we test BESIDE. Our experimental details and precise implementation of BESIDE using elimination are described in Section F.

Multi-Armed Bandit. We consider a best-arm identiﬁcation problem in which every arm is safe, but the arm with highest value is very difﬁcult to identify as safe, while the second-best arm can easily be shown safe. We vary the total number of arms and run BESIDE and BASELINE with = 0.5 and δ = 0.1. From Figure 2, we observe that the sample complexity of BESIDE is smaller (up to about two times for 100 arms) than the sample complexity of its baseline.

Linear Response Model. Random Instance: We also consider the more general setup where X, Z Rd, 2 Rd and µ 2 Rd are randomly generated from independent Gaussian random variables with mean 0 and variance 1. We set |X| = 50 and vary the size of |Z|. In Figure 3, we see again that BESIDE signiﬁcantly outperforms the baseline.

Hard Instance: We last consider the instance of Proposition 2 and benchmark against the strategy playing only allocations minimizing XYdi . In Figure 4, we see again that BESIDE signiﬁcantly outperforms this baseline, corroborating the theoretical result of Proposition 2.

Figure 2: Total arm pulls to termination vs. number of arms

Figure 3: Total arm pulls to termination vs. |Z|

Figure 4: Total arm pulls to termination vs.

3.1 Practical Algorithms for Active Classiﬁcation Under Constraints

Next, we provide an application of the above ideas to pool-based active classiﬁcation with constraints namely, adaptive sampling to learn the highest accuracy classiﬁer with a constraint on the false discovery rate (FDR). We ﬁrst explain how this problem maps to the linear bandit setting. Precisely, let X be the example space and Y = {0, 1} the label space. Fix a hypothesis class H such that each h 2 H is a classiﬁer h : X ! Y. We represent each h with an associated indicator vector zh 2 {0, 1}|X|

where zh(x) = 1 () h(x) = 1. Similarly, let 2 [0, 1]|X| represent the label distribution, i.e. (x) = P(Y = 1|X = x). Then the risk of a classiﬁer R(h) := Ex Unif(X),Y Ber( (x))[

[h(x) 6= Y ]] = z>

h (2 1) and the FDR is deﬁned as FDR(h) := (1 )>z/1>z. In the case when 2 {0, 1}|X|, FDR(h) is the proportion of examples that h incorrectly labels as 1 out of all

examples h labels as 1. Our goal is to solve the following constrained best arm identiﬁcation problem:

h2H R(h) s.t. FDR(h) q () min

h s.t. ((1 )> q1>)>z 0. (5)

The main challenge in running BESIDE on this problem directly is a potentially high computational cost from computing a design over an extremely large hypothesis class H (e.g. neural networks of a bounded width). In this section we provide an alternative approach motivated by BESIDE. Algorithm 2 follows a similar design as BESIDE and relies on an oracle, CERM, that can solve (5), i.e. given a dataset it returns the highest accuracy classiﬁer under an FDR constraint. Such oracles are available in, for example in [1, 10]. In each round of Algorithm 2 we perform randomized exploration by perturbing the labels on our existing dataset with mean zero Gaussian noise, and then training k classiﬁers bhi, i 2 [k], on the resulting datasets. Implicitly, we are making the assumption that the loss function in the training of ERM can handle continuous labels, such as the MLE of logistic regression. As described in [25], randomized exploration emulates sampling from a posterior distribution on our possible set of classiﬁers. We then use the labels generated from these classiﬁers to compute safe classiﬁers hi, i 2 [k]. Finally, mimicking the strategy of BESIDE, we compute XYsafe and XYdi designs on these k safe classiﬁers and repeat (note that the designs computed on Line 5 are equivalent to XYsafe and XYdi in the classiﬁcation setting).

Algorithm 2 Active constrained classification with randomized exploration

Require: Batch size n, initial (labeled) data x(0)

1 , . . . , x(0)

n , number of rounds L, number of classiﬁers per round k, perturbation variance σ 1: for = 1, . . . , L do 2: for i = 1, . . . , k do

3: bhi = ERM({(x( )

t=1), where { (i)

i.i.d. N(0, σ2)

4: hi = CERM({(x, bhi(x))}x2X ) 5: Compute designs: λsafe = arg minλ24X max1 i k

λx , λdiff =

arg minλ24X max1 i6=j k

{hi(x)6=hj(x)}

λx 6: Sample x( )

1 , . . . , x( )

n from a uniform mixture of λsafe, λdiff 7: Observe corresponding labels y( )

1 , . . . , y( )

n return eh = CERM({(x( )

t )}1 t n,0 L)

To validate Algorithm 2, we experiment against a passive baseline that selects points uniformly at randoms from the pool of examples X, retrains the model using the same Constrained Empirical Risk Minimization oracle (CERM) as Algorithm 2 on its current samples, and report the accuracy and FDR. We evaluate on two real world datasets and on one synthetic dataset next and provide an additional details on the experiments in Section F.

Figure 5: FDR vs accuracy for active (Algorithm 2) and passive sampling, ticks report number of samples. FDR and accuracy are averaged over 5 trials

Adult dataset. We evaluate on the adult income data set [27] (48,842 examples) where the goal is to predict whether someone s income is above $50k per year. We set the constraint to be FDR < 0.15 and report in Figure 5 the accuracy and the FDR obtained when varying the number of labels given to each method (batch size is set to 25 and initial number of queried labels is 50). We observe that for any desired accuracy Algorithm 2 allows us to provide a classiﬁer with lower FDR. Also, for any chosen number of total labels such as 500, 750, 2000 as reported in Figure 5 the Algorithm 2 gives a classiﬁer with higher accuracy and lower FDR. In general we found that the active method needed half the number of samples as the passive sampling to achieve a given FDR. This demonstrates the effectiveness of Algorithm 2 to learn simultaneously the objective (risk) and the constraint (FDR), in a similar favorable way as characterized by our theoretical ﬁndings.

Figure 6: TPR vs FDR for active active (Algorithm 2) and passive sampling, ticks report number of samples. Precision is 1 FDR, recall is TPR. Precision and recall are averaged over 25 trials

German Credit dataset. We consider the German Credit Dataset originally from the Staﬂog Project Databases [24]. The goal is to predict whether someone s credit is bad or good . We report in Figure 6 the recall (TPR) and the precision (1 FDR) obtained when varying the number of labels given to each method. We observe that for any desired precision Algorithm 2 allows us to provide a classiﬁer with higher recall. Also, for any chosen number of total labels such as 170, 270, 330, 450, 600 as reported in Figure 6 the Algorithm 2 gives a classiﬁer with higher precision and higher recall. As for the Adult dataset we found that the active method needed half the number of samples as the passive sampling to achieve a given precision.

Half circle dataset. We consider a two-dimensional half circle dataset, visualized on Figure 10. We report in Figures 11 and 12 the precision and (respectively) the recall obtained when varying the number of labels given to each method. The conﬁdence intervals are obtained over 25 repetitions. We observe that Algorithm 2 allows us to provide a classiﬁer satisfying a given recall or precision in far fewer queries. This is in line with the results of [16] on One Dimensional Thresholds, where the sample complexity of the active strategy is O(log(n)) while the sample complexity of the passive strategy is at least of order n.

Figure 7: Half circle dataset. Figure 8: Precision Figure 9: Recall

4 Related works

Constrained Bandits. A growing body of work seeks to address the question of safe learning in interactive environments. In particular, the majority of such works have considered the problem of regret minimization in linear bandits with linear safety constraints. Here, the goal is to maximize online reward, x>

t , by choosing actions xt 2 X Rd, while ensuring a safety constraint of the form x>

t µ γ is met at all times (either in expectation or with high probability). A variety of algorithms have been proposed, including UCB-style [23, 2, 32], and Thompson Sampling [30, 31]. While these works show that

T regret is attainable, they only provide worst-case bounds (while we obtain instance-dependent bounds) and do not study the pure-exploration best-arm identiﬁcation problem. To our knowledge, the only work to offer instance-dependent guarantees is [9], yet they focus exclusively on the regret setting, and offer a relatively coarse notion of instance-dependence analogous to O(d poly log T/ min) bounds in the unconstrained linear bandits setting in

contrast to the more ﬁne-grained notion of instance-dependence we provide.

To our knowledge, only several existing works consider the question of best-arm identiﬁcation with safety constraints [36, 37, 39, 28]. The most related to ours is [28] which focuses on the easier problem of safe best arm identiﬁcation with known rewards and unknown constraints. Since the reward is known, the main challenge in the setting of [28] is to learn constraints via G-optimal designs. The key and novel challenge of our framework is to carefully balance between G and XY designs: naively spending enough budget to either learn the reward model (via a XY design) or to learn the safety constraints (via a G design) will fail catastrophically (see Example 2.1 and Proposition 1). [36, 37] consider a general constrained optimization setting where the goal of the learner is to

minimize some function f(x) over a domain x 2 D, while only having access to noisy samples of f(x), f(xt) + wt, and guaranteeing that a safety constraint g(xt) h is met for every query point xt.

While they do provide a sample complexity upper bound, they give no lower bound, and, as shown in [39], their approach can be very suboptimal. [39] considers the setting of best-arm identiﬁcation in multi-armed bandits. In their setting, at every step t they query a value at 2 A for a particular coordinate it, and their goal is to identify the coordinate i such that a

i i, where a

i is the largest value respecting the safety constraint: a

i = arg maxa2A a i s.t. aµi γ. Similar to [36, 37], they require that the safety constraint atµit γ must be met while learning. Though they do show matching upper and lower bounds, and in addition consider a slightly more general setting that allows for nonlinear (but monotonic) response functions, they treat every coordinate as independent, and do not allow for information-sharing between coordinates the key generalization the linear bandit setting targets. We remark as well that in our setting, unlike these works, we allow the learner to query unsafe points during exploration, and only require that they output a safe decision at termination.

Best-Arm Identiﬁcation in Linear Bandits. The best-arm identiﬁcation problem in multi-armed bandits (without safety constraints) is a classical and well-studied problem [3, 33, 12, 5], and nearoptimal algorithms exist [18, 22]. More recently, there has been a growing interest in understanding the sample complexity of best-arm identiﬁcation in linear bandits [35, 19, 40, 13, 20, 11]. We highlight in particular the work of [13] which proposes an experiment-design based algorithm, RAGE, that our approach takes inspiration from. While much progress has been made in understanding best-arm identiﬁcation in linear bandits, to our knowledge, no existing works consider the setting of best-arm identiﬁcation in linear bandits with safety constraints, the setting of this work.

Active Classiﬁcation under FDR constraints We ﬁnally mention one other related body of work the problem of actively sampling to ﬁnd a classiﬁer with high accuracy or recall under precision constraints. Motivated by the experimental design approach of our main algorithm, BESIDE, we provide a heuristic algorithm for this problem with good empirical performance in Section 3.1. There is an extensive body of work on active learning (see the survey [14]) but only recently have works made the connection between best-arm identiﬁcation for linear bandits and classiﬁcation [21, 16, 7]. Precision constraints has been less studied in the adaptive context, we only know of [16, 4].

5 Conclusion

In this work we have shown that it is possible to efﬁciently ﬁnd the best safe arm in linear bandits with a carefully designed adaptive experiment design-based approach. Our results open up several interesting directions for future work.

Instance Optimality. While BESIDE is worst-case optimal, in Appendix A we show an instancedependent lower bound which BESIDE does not, in general, seem to hit. We conjecture that this lower bound may be loose addressing this discrepancy and showing matching instance-dependent upper and lower bounds is an exciting direction for future work.

Safety During Exploration. Though there are many interesting applications where we may not require safety during exploration (i.e. only querying safe arms), in other cases we may need to ensure safety is met during exploration. Extending our work to this setting is an interesting open problem.

Potential Impacts. As with any algorithm making stochastic assumptions, if assumptions are not met we can not guarantee the performance. In this case, one limitation is that if the underlying environment is changing (i.e. the constraints vary over time) the algorithm could have unexpected behavior with unintended consequences. Such a situation could lead to harmful results in examples such as the online advertising bidding example from the introduction. To mitigate this limitation of our setting, practitioners are encouraged to monitor many metrics, both short and long-term.

Acknowledgements The work of AW was supported by an NSF GFRP Fellowship DGE-1762114. The work of JM was supported by an NSF Career award, and NSF AI institute (IFML) and the Simons collaborative grant on the foundations of fairness. The work of KJ was funded in part by the AFRL and NSF TRIPODS 2023166.

[1] Alekh Agarwal, Alina Beygelzimer, Miroslav Dudík, John Langford, and Hanna Wallach. A

reductions approach to fair classiﬁcation, 2018.

[2] Sanae Amani, Mahnoosh Alizadeh, and Christos Thrampoulidis. Linear stochastic bandits under

safety constraints. In Hanna M. Wallach, Hugo Larochelle, Alina Beygelzimer, Florence d Alché Buc, Edward A. Fox, and Roman Garnett, editors, Advances in Neural Information Processing Systems 32: Annual Conference on Neural Information Processing Systems 2019, Neur IPS 2019, 8-14 December 2019, Vancouver, BC, Canada, pages 9252 9262, 2019. URL http://papers. nips.cc/paper/9124-linear-stochastic-bandits-under-safety-constraints.

[3] Robert E Bechhofer. A sequential multiple-decision procedure for selecting the best one

of several normal populations with a common unknown variance, and its use with various experimental designs. Biometrics, 14(3):408 429, 1958.

[4] Paul N Bennett, David M Chickering, Christopher Meek, and Xiaojin Zhu. Algorithms for

active classiﬁer selection: Maximizing recall with precision constraints. In Proceedings of the Tenth ACM International Conference on Web Search and Data Mining, pages 711 719, 2017.

[5] Sébastien Bubeck, Rémi Munos, and Gilles Stoltz. Pure exploration in multi-armed bandits

problems. In International conference on Algorithmic learning theory, pages 23 37. Springer, 2009.

[6] Romain Camilleri, Julian Katz-Samuels, and Kevin Jamieson. High-dimensional experimental

design and kernel bandits, 2021.

[7] Romain Camilleri, Zhihan Xiong, Maryam Fazel, Lalit Jain, and Kevin Jamieson. Selective

sampling for online best-arm identiﬁcation, 2021.

[8] Olivier Catoni. Challenging the empirical mean and empirical variance: a deviation study. In

Annales de l IHP Probabilités et statistiques, volume 48, pages 1148 1185, 2012.

[9] Tianrui Chen, Aditya Gangrade, and Venkatesh Saligrama. A doubly optimistic strategy for

safe linear bandits, 2022. URL https://arxiv.org/abs/2209.13694.

[10] Andrew Cotter, Maya Gupta, Heinrich Jiang, Nathan Srebro, Karthik Sridharan, Serena Wang,

Blake Woodworth, and Seungil You. Training well-generalizing classiﬁers for fairness metrics and other data-dependent constraints, 2018. URL https://arxiv.org/abs/1807.00028.

[11] Rémy Degenne, Pierre Ménard, Xuedong Shang, and Michal Valko. Gamiﬁcation of pure

exploration for linear bandits. In International Conference on Machine Learning, pages 2432 2442. PMLR, 2020.

[12] Eyal Even-Dar, Shie Mannor, and Yishay Mansour. Pac bounds for multi-armed bandit and

markov decision processes. In International Conference on Computational Learning Theory, pages 255 270. Springer, 2002.

[13] Tanner Fiez, Lalit Jain, Kevin Jamieson, and Lillian Ratliff. Sequential experimental design for

transductive linear bandits, 2019.

[14] Steve Hanneke. Theory of active learning. Foundations and Trends in Machine Learning, 7

(2-3), 2014.

[15] Elad Hazan and Satyen Kale. Projection-free online learning. ar Xiv preprint ar Xiv:1206.4657,

[16] Lalit Jain and Kevin Jamieson. A new perspective on pool-based active classiﬁcation and

false-discovery control, 2020.

[17] Kevin Jamieson and Lalit Jain. Interactive machine learning. 2022.

[18] Kevin Jamieson, Matthew Malloy, Robert Nowak, and Sébastien Bubeck. lil ucb: An optimal

exploration algorithm for multi-armed bandits. In Conference on Learning Theory, pages 423 439. PMLR, 2014.

[19] Zohar S Karnin. Veriﬁcation based solution for structured mab problems. Advances in Neural

Information Processing Systems, 29, 2016.

[20] Julian Katz-Samuels, Lalit Jain, Kevin G Jamieson, et al. An empirical process approach to the

union bound: Practical algorithms for combinatorial and linear bandits. Advances in Neural Information Processing Systems, 33:10371 10382, 2020.

[21] Julian Katz-Samuels, Jifan Zhang, Lalit Jain, and Kevin Jamieson. Improved algorithms

for agnostic pool-based active classiﬁcation, 2021. URL https://arxiv.org/abs/2105. 06499.

[22] Emilie Kaufmann, Olivier Cappé, and Aurélien Garivier. On the complexity of best arm

identiﬁcation in multi-armed bandit models, 2016.

[23] Abbas Kazerouni, Mohammad Ghavamzadeh, Yasin Abbasi Yadkori, and Benjamin Van Roy.

Conservative contextual linear bandits. Advances in Neural Information Processing Systems, 30, 2017.

[24] E. Keogh, C.; Blake, and C. J. Merz. Uci repository of machine learning databases 1998. URL

http://archive.ics.uci.edu/ml.

[25] Branislav Kveton, Manzil Zaheer, Csaba Szepesvari, Lihong Li, Mohammad Ghavamzadeh,

and Craig Boutilier. Randomized exploration in generalized linear bandits, 2019. URL https: //arxiv.org/abs/1906.08947.

[26] Tor Lattimore and Csaba Szepesvári. Bandit algorithms. Cambridge University Press, 2020.

[27] M Lichman. Uci machine learning repository 2013. URL http://archive.ics.uci.edu/

[28] David Lindner, Sebastian Tschiatschek, Katja Hofmann, and Andreas Krause. Interactively

learning preference constraints in linear bandits, 2022. URL https://arxiv.org/abs/2206. 05255.

[29] Blake Mason, Romain Camilleri, Subhojyoti Mukherjee, Kevin Jamieson, Robert Nowak, and

Lalit Jain. Nearly optimal algorithms for level set estimation. ar Xiv preprint ar Xiv:2111.01768, 2021.

[30] Ahmadreza Moradipari, Sanae Amani, Mahnoosh Alizadeh, and Christos Thrampoulidis. Safe

linear thompson sampling with side information, 2019. URL https://arxiv.org/abs/ 1911.02156.

[31] Ahmadreza Moradipari, Christos Thrampoulidis, and Mahnoosh Alizadeh. Stage-wise conser-

vative linear bandits. 2020. doi: 10.48550/ARXIV.2010.00081. URL https://arxiv.org/ abs/2010.00081.

[32] Aldo Pacchiano, Mohammad Ghavamzadeh, Peter Bartlett, and Heinrich Jiang. Stochastic

bandits with linear constraints, 2020. URL https://arxiv.org/abs/2006.10185.

[33] Edward Paulson. A sequential procedure for selecting the population with the largest mean

from k normal populations. The Annals of Mathematical Statistics, pages 174 180, 1964.

[34] Max Simchowitz, Kevin Jamieson, and Benjamin Recht. The simulator: Understanding adaptive

sampling in the moderate-conﬁdence regime. In Conference on Learning Theory, pages 1794 1834. PMLR, 2017.

[35] Marta Soare, Alessandro Lazaric, and Rémi Munos. Best-arm identiﬁcation in linear bandits.

Advances in Neural Information Processing Systems, 27:828 836, 2014.

[36] Yanan Sui, Alkis Gotovos, Joel Burdick, and Andreas Krause. Safe exploration for optimization

with gaussian processes. In International Conference on Machine Learning, pages 997 1005. PMLR, 2015.

[37] Yanan Sui, Joel Burdick, Yisong Yue, et al. Stagewise safe bayesian optimization with gaussian

processes. In International Conference on Machine Learning, pages 4781 4789. PMLR, 2018.

[38] Andrew Wagenmaker, Yifang Chen, Max Simchowitz, Simon S Du, and Kevin Jamieson.

Reward-free rl is no harder than reward-aware rl in linear markov decision processes. ar Xiv preprint ar Xiv:2201.11206, 2022.

[39] Zhenlin Wang, Andrew Wagenmaker, and Kevin Jamieson. Best arm identiﬁcation with safety

constraints. In International Conference on Artiﬁcial Intelligence and Statistics. PMLR, 2022.

[40] Liyuan Xu, Junya Honda, and Masashi Sugiyama. A fully adaptive algorithm for pure explo-

ration in linear bandits. In International Conference on Artiﬁcial Intelligence and Statistics, pages 843 851. PMLR, 2018.

1. For all authors...

(a) Do the main claims made in the abstract and introduction accurately reﬂect the paper s

contributions and scope? [Yes]

(b) Did you describe the limitations of your work? [Yes] See Conclusion.

(c) Did you discuss any potential negative societal impacts of your work? [Yes] See

Conclusion.

(d) Have you read the ethics review guidelines and ensured that your paper conforms to

them? [Yes]

2. If you are including theoretical results...

(a) Did you state the full set of assumptions of all theoretical results? [Yes]

(b) Did you include complete proofs of all theoretical results? [Yes] Refer to Appendix.

3. If you ran experiments...

(a) Did you include the code, data, and instructions needed to reproduce the main ex-

perimental results (either in the supplemental material or as a URL)? [Yes] Refer to Appendix.

(b) Did you specify all the training details (e.g., data splits, hyperparameters, how they

were chosen)? [Yes] Refer to Appendix.

(c) Did you report error bars (e.g., with respect to the random seed after running experi-

ments multiple times)? [Yes]

(d) Did you include the total amount of compute and the type of resources used (e.g., type

of GPUs, internal cluster, or cloud provider)? [Yes] See Appendix.

4. If you are using existing assets (e.g., code, data, models) or curating/releasing new assets...

(a) If your work uses existing assets, did you cite the creators? [Yes]

(b) Did you mention the license of the assets? [N/A]

(c) Did you include any new assets either in the supplemental material or as a URL? [N/A]

(d) Did you discuss whether and how consent was obtained from people whose data you re

using/curating? [N/A]

(e) Did you discuss whether the data you are using/curating contains personally identiﬁable

information or offensive content? [N/A]

5. If you used crowdsourcing or conducted research with human subjects...

(a) Did you include the full text of instructions given to participants and screenshots, if

applicable? [N/A]

(b) Did you describe any potential participant risks, with links to Institutional Review

Board (IRB) approvals, if applicable? [N/A]

(c) Did you include the estimated hourly wage paid to participants and the total amount

spent on participant compensation? [N/A]