# interactively_learning_preference_constraints_in_linear_bandits__e2c0bdd1.pdf

Interactively Learning Preference Constraints in Linear Bandits

David Lindner 1 Sebastian Tschiatschek 2 Katja Hofmann 3 Andreas Krause 1

We study sequential decision-making with known rewards and unknown constraints, motivated by situations where the constraints represent expensive-to-evaluate human preferences, such as safe and comfortable driving behavior. We formalize the challenge of interactively learning about these constraints as a novel linear bandit problem which we call constrained linear best-arm identification. To solve this problem, we propose the Adaptive Constraint Learning (ACOL) algorithm. We provide an instancedependent lower bound for constrained linear best-arm identification and show that ACOL s sample complexity matches the lower bound in the worst-case. In the average case, ACOL s sample complexity bound is still significantly tighter than bounds of simpler approaches. In synthetic experiments, ACOL performs on par with an oracle solution and outperforms a range of baselines. As an application, we consider learning constraints to represent human preferences in a driving simulation. ACOL is significantly more sample efficient than alternatives for this application. Further, we find that learning preferences as constraints is more robust to changes in the driving scenario than encoding the preferences directly in the reward function.

1. Introduction

Often, (sequential) decision-making problems are formalized as maximizing an unknown reward function that captures an expensive-to-evaluate objective, for example, user preferences (see Chapter 1 of Lattimore & Szepesv ari (2020) for examples). However, in many practical situations, it can be more natural to model problems with a known reward

1ETH Zurich, Switzerland 2University of Vienna, Austria 3Microsoft Research Cambridge, UK. Correspondence to: David Lindner <david.lindner@inf.ethz.ch>.

Proceedings of the 39 th International Conference on Machine Learning, Baltimore, Maryland, USA, PMLR 162, 2022. Copyright 2022 by the author(s).

function and unknown, expensive-to-evaluate constraints.

For example, a cookie manufacturer might want to create a low-calorie cookie.4 The cookie should have the lowest amount of calories possible, but at least 95% of customers should like it. To evaluate this constraint, the manufacturer has to produce specific cookies and test them with customers. The reward, i.e., the amount of calories for a recipe, is easy to evaluate without producing a cookie. Because customer trials are expensive, the cookie manufacturer wants to find the best constrained solution with as few trials as possible.

As a second example, consider finding safe control parameters for an autonomous car. A car manufacturer might have a set of controllers to choose from that perform a specific task, such as reaching a target destination as quickly as possible. The ideal controller achieves this task well and drives safely and comfortably. Whereas the objective travel time is easy to specify as a reward function, the constraints perceived safety and comfort may require feedback from human drivers and passengers. Similarly as in the previous example, the manufacturer s goal is to find the best, safe controller with as few trials that involve human feedback as possible. We assume that the controllers are evaluated in a simulation, so it is acceptable to evaluate an unsafe controller during training; however, the constraints have to be satisfied during deployment.

In both examples, the decision-making problem is naturally characterized by an easy-to-evaluate part (the reward) and an expensive-to-evaluate part (the constraint). Additionally, we observe that constraints are more robust to changes in the environment and can be transferred to selecting controllers for different goals, in contrast to encoding the constraints as a penalty in the reward function (see Figure 1). Hence, in this paper, we study learning about unknown, expensive-to-evaluate constraints.

Specifically, we propose a two-phase approach to solving problems with unknown constraints. In the first phase, we learn to estimate the expensive-to-evaluate constraint function well enough for solving the constraint optimization problem. In the second phase, we recommend a solution.

4Example adapted from Gelbart et al. (2014).

Interactively Learning Preference Constraints in Linear Bandits

Base Scenario Different Reward Different Environment

(a) Reward Penalty

Base Scenario Different Reward Different Environment

(b) Constraint

Figure 1. We want to select a controller for driving the orange car. In the base scenario (left image), the car should drive at velocity v, which we encode as reward. We model other driving rules, like usually drive in a lane or don t get too close to other cars , either as a penalty on the reward function (in (a)) or as a constraint (in (b)). In the different reward scenario (middle image), the car should pull over to the right of the street instead of keeping the velocity. If we reuse the same reward penalty for this task, the controller does not achieve the task because the penalty is too strong. So, we would have to tune the penalty for this new task, whereas the constrained controller still completes the new task safely without any tuning. In the different environment scenario (right image), the goal of driving at a target velocity remains the same, but the environment changes. In this changed environment, three vehicles block the road. Here, the penalized controller trades off the penalty with achieving high reward and tries to go through the cars to keep the velocity, which is too dangerous. On the other hand, the constraint formulation does not allow violating a constraint such as don t get too close to other cars which makes the orange car stop before the street is blocked. In Section 4.3, we study this example in more detail.

Constraint violations are allowed in the first phase, but the final recommendation has to satisfy the constraints.

Contributions. We formalize learning about unknown constraints to find the best constrained solution as a novel linear bandit problem (Section 3) which we call constrained linear best-arm identification (CBAI). We provide an instancedependent sample complexity lower bound (Section 3.1) and propose Adaptive Constraint Learning (ACOL), an algorithm that almost matches this lower bound (Section 3.3). Our empirical evaluation shows that ACOL gets close to the performance of an oracle solution that has access to the true constraint function while outperforming a range of simpler baselines (Section 4.1). As a concrete application, we consider learning driving behavior in a simulation, where the constraints represent human preferences about driving behavior (Section 4.3). We demonstrate empirically that ACOL can learn these constraints and propose heuristic variants of the algorithm that empirically improve sample efficiency. Additionally, we quantify the observation that learning driving preferences as constraints instead of rewards increases the robustness and transferability of the learned preferences.

2. Related Work

Learning constraints is similar to actively classifying arms as feasible or infeasible ; but, in contrast to typical active learning (Settles, 2012), we do not need to classify all arms. Instead, we only want to find the best feasible arm,

which can require fewer samples than classifying all arms. Our problem formalization as a linear multi-armed bandit best-arm identification problem (Audibert et al., 2010) is similar to Soare et al. (2014) in the unconstrained setting, but focused on learning constraints.

Much prior work on constraints in multi-armed bandits considers other notions of constraints than we do. For example, constraints holding in expectation rather than with high probability (Pacchiano et al., 2021), or constraints in the form of a lower bound (threshold) on the reward (Locatelli et al., 2016; Kazerouni et al., 2017; Kano et al., 2019; Khezeli & Bitar, 2020).

Amani et al. (2019) and Moradipari et al. (2021) consider a linear bandit setting with a separate (linear) constraint function. Both differ from our work in three important ways: (1) they assume an unknown reward function whereas we assume the reward to be known, (2) they focus on cumulative regret minimization whereas we focus on best-arm identification, and (3) they require the constraints to be satisfied during exploration whereas we only require them to be satisfied for the final recommendation. These works adapt bandit algorithms based on upper confidence bounds (Amani et al., 2019) or Thompson sampling (Moradipari et al., 2021) to minimize regret in the constraint setting. To enable safe exploration, they need to assume a convex and compact set of arms; we do not require this assumption.

Wang et al. (2022) also study best-arm identification with linear constraints. In contrast to our work, they assume

Interactively Learning Preference Constraints in Linear Bandits

unknown rewards and focus on safety constraints that must be satisfied during exploration. To make this possible, they need to make more assumptions about the structure of the set of arms. In particular, they assume that the decision maker can only query in each dimension independently. Because of this, their algorithm cannot be applied to our setting without significant changes.

Our algorithm is conceptually similar to other bandit algorithms based on the principle of eliminating sub-optimal arms step-by-step. Our theoretical analysis employs similar tools as that used for best-arm-identification in unconstrained linear bandits (Soare et al., 2014; Fiez et al., 2019).

Some works in Bayesian optimization (BO) also study the problem of exploring to find the best constrained solution to a problem with an expensive-to-evaluate constraint function. Common approaches heuristically extend BO methods to incorporate an unknown constraint function (Gardner et al., 2014; Hern andez-Lobato et al., 2016; Perrone et al., 2019). In contrast to this line of work, we obtain sample complexity guarantees by focusing on linear constraint functions. Similar to the bandit literature, most work on BO with constraints focuses on the setting where safety constraints must hold during exploration (e.g., Sui et al., 2015).

We apply our algorithm to the problem of learning from human preferences, which is essential for building systems with hard-to-specify goals, e.g., in robotics (Daniel et al., 2014). We use an environment by Sadigh et al. (2017) who model human preferences as rewards rather than constraints.

3. The Linear Constrained Best-Arm Identification Problem

We seek to find the best constrained solution from a discrete set of options represented by feature vectors x X Rd. We assume that both the known reward function and the unknown constraint function are linear in x.

Definition 1. A constrained linear best-arm identification (CBAI) problem ν = (X, θ, ϕ, τ) consists of a finite action set X Rd, a reward parameter θ Rd, a constraint parameter ϕ Rd, and threshold τ R. The decisionmaker knows X and θ, but not ϕ and τ. In each iteration, the decision-maker selects an arm x X and observes ϕT x + ηx, where ηx is sub-Gaussian noise. Their goal is to identify a constrained optimal arm

x argmax x X,ϕT x τ θT x

within as few iterations as possible.

In our initial example, X contains all potential cookie recipes. θT x encodes the amount of calories for recipe x, which the decision-maker knows and wants to minimize.

ϕT x encodes the unknown customer preferences, which the decision-maker must infer from as few experiments as possible. In the following, we assume w.l.o.g. τ = 0 but generalization to τ = 0 is straightforward. If τ is unknown, we can simply model it as a constant shift in the const To simplify notation, we omit τ and talk about a CBAI problem ν = (X, θ, ϕ).

3.1. Lower Bounds

We first provide a lower bound on the sample complexity of solving a given CBAI problem. The following theorem states how many samples are necessary to distinguish a given CBAI instance from the closest instance with a different solution, which is necessary to solve an instance.

Theorem 1 (CBAI lower bound). Assume ηx N(0, 1) for all x X. For any CBAI problem ν = (X, θ, ϕ), there exists another CBAI problem ν = (X, θ, ϕ ) with the same set of actions X and reward parameter θ but a different constraint parameter and optimal arm, such that the expected number of iterations τ needed by any allocation strategy that can distinguish between ν and ν with probability at least 1 δ is lower bounded as

E[τ] 2 log 1

max x X θ (x ν)

x 2 A 1 λ (ϕT x)2 ,

where λ is a probability distribution over arms which the allocation strategy follows, i.e., λ(x) is the probability that it pulls arm x, Aλ = P

x λ(x)xx T is the design matrix, X θ (x ν) = {x X|θT x θT x ν} is the set of all arms with reward no less than x ν, the optimal arm for problem ν.

The proof in Appendix A.1 uses a proof strategy similar to that for lower bounds for standard linear bandits (Soare et al., 2014; Soare, 2015; Fiez et al., 2019). We consider the log-likelihood ratio of making a series of observations in instance ν compared to ν and consider how we can choose ν to have a different solution but a small log-likelihood ratio, i.e., the decision-maker makes similar observations as if they were in ν. In contrast to the standard linear bandit case, we need to carefully reason about the constraints when ensuring that ν has a different solution than ν. We distinguish the case that the solution of ν is infeasible in ν

and the case that an arm with larger reward is feasible in ν

but not ν. Reasoning about these two cases yields the result.

Our lower bound has a similar form as those for best-arm identification in linear bandits (Soare et al., 2014). In particular, we have the same uncertainty term in the numerator. Instead of a suboptimality gap in the denominator, we get the distance to the constraint boundary: the problem is harder if arms are closer to the constraint boundary. However, our maximization is over individual arms instead of directions, i.e., pairs of arms, and the set X θ (x ν) does not appear in

Interactively Learning Preference Constraints in Linear Bandits

the linear bandit case.

We want to characterize the sample complexity of different algorithms for solving the CBAI problem. To this end, let us define the sample complexity of a given problem instance using the lower bound we just derived.

Definition 2 (CBAI sample complexity). We define the sample complexity of a CBAI problem ν as

HCLB(ν) := min λ max x X θ (x ν)

x 2 A 1 λ (ϕT x)2

This describes the best sample complexity that any algorithm can achieve on CBAI problem ν. It will also be helpful to have a worst-case upper bound on HCLB(ν) as a point of comparison, which the next proposition provides.

Proposition 1. For any CBAI problem ν, we have HCLB(ν) d/(C+ min)2, where C+ min = minx X |ϕT x|. This bound is tight, i.e, there is an instance ν, such that we have HCLB(ν) = d/(C+ min)2.

This results indicates that a CBAI problem is harder if it has a larger dimension d, or if the distance of the arm that is closest to the constraint boundary (C+ min) is smaller. This worst case bound corresponds to situations where all arms are linearly independent and pulling one arm does not provide any information about any other arm.

Oracle solution. We can make the definition of sample complexity more concrete by considering an oracle solution that has access to the true constraint value to select which arms to query. The oracle selects arms by explicitly minimizing HCLB:

λ argmin λ max x X θ (x ν)

x 2 A 1 λ (ϕT x)2 .

The oracle prefers arms with high uncertainty (high x 2 A 1 λ )

and arms close to the constraint boundary (low (ϕT x)2). Moreover, it focuses on reducing the uncertainty about arms that have higher reward than the true optimal arm (arms in X θ (x ν)). In Appendix B.1, we show that this oracle solution has sample complexity on order HCLB(ν), i.e., it is indeed optimal.

3.2. Confidence Intervals for Linear Regression

Our algorithms rely on high probability confidence intervals on the linear constraints constructed from observations. Hence, let us briefly review how to construct such confidence intervals from observations with sub-Gaussian noise.

Suppose, an algorithm queried a sequence of arms xt = (x1, . . . , xt). For a given xi, it observed yi = ϕT xi + ηxi,

where ϕ is the true constraint parameter, and ηxi is sub Gaussian noise. We now aim to find confidence intervals such that ϕT x [lt ϕ(x), ut ϕ(x)] with probability at least 1 δ, where lt ϕ(x) = ˆϕT x βt x A 1 xt and ut ϕ(x) = ˆϕT x + βt x A 1 xt . Based on these confidence intervals, we can decide whether a given arm is likely feasible or not.

If the queries follow a distribution that does not depend on the observations, it is straightforward to derive confidence intervals (e.g., Chapter 20 in Lattimore & Szepesv ari (2020)).

Proposition 2. Let xt = (x1, . . . , xt) be a sequence of arms from a fixed allocation for which we have observed ϕT xi + ηxi where ηxi is independent sub-Gaussian noise. If we estimate ˆϕ from the observations using least-squares regression and choose βt = p

2 log(|X|/δ) then we have P( x X : ϕT x / [lt ϕ(x), ut ϕ(x)]) δ.

However, in sequential decision-making we usually want to adapt our strategy after making observations. In this case, we need to be more careful in constructing confidence intervals, as observed by Abbasi-Yadkori et al. (2011). Unfortunately, the resulting confidence intervals are weaker than those for static allocations by a factor of

Proposition 3 (Theorem 2 by Abbasi-Yadkori et al. (2011)). Let xt = (x1, . . . , xt) be a sequence of points selected with a possibly adaptive strategy for which we have observed ϕT xi + ηxi where ηxi is independent sub-Gaussian noise. Assume, that ϕ 2 S and x 2 L for all x X. If we estimate ˆϕ from the observations using least-squares regression, then for every x Rd and for all t 0: P( x X : ϕT x / [lt ϕ(x), ut ϕ(x)]) δ with βt = p

d log ((1 + t L2/λ)/δ) +

3.3. Algorithms Using Static Confidence Intervals

To design an algorithm for solving CBAI problems, we need to decide (1) which arms to pull during exploration and (2) when we can stop the algorithm and return the correct arm with high probability. First, let us address the second question and then get back to the first one.

Stopping condition. Using the past observations, we can define confidence intervals for the constraint value of each arm. Let lt ϕ(x) and ut ϕ(x) be such that we know with high probability (w.h.p.) ϕT x [lt ϕ(x), ut ϕ(x)]. Now we can also determine w.h.p. that all arms with lt ϕ(x) > 0 are infeasible, and all arms with ut ϕ(x) 0 are feasible. Moreover, we can identify suboptimal arms by considering r = maxut ϕ(x) 0 θT x. The solution to this optimization problem are the highest-reward arms that are feasible w.h.p. Therefore, all arms with reward less than r are clearly suboptimal. Combining these observations, we can define a set of arms that we are uncertain about, i.e., that could still be

Interactively Learning Preference Constraints in Linear Bandits

Ut = {x X|lt ϕ(x) 0 and ut ϕ(x) > 0 and θT x > r}

Note, that if Ut is empty, we can stop and return an arm in argmaxut ϕ(x) 0 θT x. This arm will be optimal w.h.p.

Arm selection criterion. In each iteration, we have to decide which arm to pull. We could, e.g., combine the above stopping condition with querying uniformly random arms. This algorithm would return the correct optimal arm with high probability. However, random querying will usually not be the most sample efficient approach. Another natural approach is to select the arms that we are most uncertain about, which is sometimes called uncertainty sampling. We could, e.g., choose a fixed allocation

λG argmin λ max x X x A 1 λ .

This approach is also called G-Allocation in the experimental design literature. We show in Appendix B.2 that G-Allocation matches the worst-case lower bound in Proposition 1. However, we can do better by focusing on arms that we cannot yet exclude as being certainly feasible, infeasible, or suboptimal. Concretely, we modify G-Allocation to reduce uncertainty only about arms in Ut:

λACOL argmin λ max x Ut x A 1 λ

Rounding. All algorithms implementing a static allocation require a rounding procedure to translate an allocation λ into a finite sequence of arms x1, . . . , xn. The experimental design literature provides various efficient rounding procedures that are ε-approximate. We use a standard procedure described in Chapter 12 of Friedrich (2006). Adaptive Constraint Learning (ACOL). Algorithm 1 shows the full algorithm we call Adaptive Constraint Learning (ACOL). The algorithm proceeds in rounds. In each round t it pulls arms to reduce the uncertainty about arms in Ut, then updates Ut, and decides if it can stop and return a recommendation. The round length Nt is chosen carefully to allow us to provide a tight sample complexity result.

The following theorem the main theoretical result of our paper establishes that ACOL returns the correct optimal solution to any CBAI problem and provides an upper bound on the number of samples necessary. Theorem 2 (ACOL sample complexity). Assume Algorithm 1 is implemented with an ε-approximate rounding strategy. Then, after N iterations the algorithm returns an optimal arm with probability at least 1 δ, and we have:

N 8 log |X| t2

t=1 min λ max x Ut

x 2 A 1 λ (ϕT x)2 + t

8 log |X| t2

(1 + ε) t HCLB(ν) + t

Algorithm 1 Adaptive Constraint Learning (ACOL).

1: Input: significance δ 2: U1 X (uncertain arms) 3: F1 (feasible arms) 4: t 1 (round) 5: while Ut = do 6: δt δ2/t2

7: λ t argminλ maxx Ut x 2 A 1 λ 8: ρ t minλ maxx Ut x 2 A 1 λ 9: Nt max nl 22t+3 log |X|

(1 + ε)ρ t m , r(ε) o

10: x Nt Round(λ t , Nt) 11: Pull arms x1, . . . , x Nt and observe constraint values 12: t t + 1 13: Update ˆϕt and A based on new data 14: lt ϕ(x) ˆϕT t x βt x A 1 for all arms x

15: ut ϕ(x) ˆϕT t x + βt x A 1 for all arms x 16: Ft Ft 1 {x|ut ϕ(x) 0}

17: r maxx Ft θT x 18: Ut Ut 1 \ {x|lt ϕ(x) > 0} \ {x|ut ϕ(x) 0} \{x|θT x < r} 19: end while 20: return x argmaxx Ft θT x

where HCLB(ν) = minλ maxx X x 2 A 1 λ /(ϕT x)2, and t = log2 C+ min . Moreover, HCLB(ν) d/(C+ min)2

We prove the theorem in Appendix A. The key step uses Proposition 2 to show that the confidence intervals shrink exponentially. This implies that in a logarithmic number of rounds, the largest confidence interval will be less than C+ min; and once this is the case, Ut is empty and the algorithm returns the correct solution. Combining this with the round lengths of Nt allows us to prove the result.

The sample complexity of ACOL is of order HCLB(ν), except for logarithmic factors. Also, we show that HCLB d/(C+ min)2, so the bound matches the lower bound of Proposition 1 for worst-case instances, but it is much tighter for benign instances. In particular, the bound in Theorem 2 contains the same min-max problem as the instance dependent sample complexity HCLB(ν), only with the maximization being over different sets, namely Ut instead of X θ (x ν). Note, that we cannot expect a practical algorithm to only explore arms in X θ (x ν) because we do not know x ν a priori. Instead, ACOL explores in Ut, a conservative estimate of X θ (x ν) that shrinks over time given the knowledge so far. Theorem 2 does not exactly match the instance dependent lower bound, but the difference only depends on how well Ut approximates the set of relevant arms.

Interactively Learning Preference Constraints in Linear Bandits

Algorithm 2 Greedy Adaptive Constraint Learning (GACOL).

1: Input: βt, λ 2: initialize ˆS1(ˆϕ), U1 X, F1 , A λI, t 1 3: while Ut = do 4: x maxx Ut x 2 A 1 5: Pull arm x and observe constraint value 6: t t + 1, A A + xx T

7: lt ϕ(x) ˆϕT t x βt x A 1 for all arms x

8: ut ϕ(x) ˆϕT t x + βt x A 1 for all arms x 9: Ft Ft 1 {x|ut ϕ(x) 0}

10: r maxx Ft θT x 11: Ut Ut 1 \ {x|lt ϕ(x) > 0} \ {x|ut ϕ(x) 0} \{x|θT x < r} 12: end while 13: return x argmaxx Ft θT x

3.4. Algorithms Using Adaptive Confidence Intervals

Whereas the algorithm we just introduced comes with a strong sample complexity guarantee, it is impractical in various ways, primarily because of the round-based structure. In particular, the algorithm requires a rounding procedure to determine a sequence of actions; it then follows this sequence for a predefined round length and can not stop before finishing a round. Also, in between rounds, the algorithm discards all previously made observations, which is necessary to apply Proposition 2.

Next, we present an alternative version of this algorithm that uses the adaptive confidence intervals of Proposition 3. This allows us to remove the round-based structure in favor of a greedy algorithm that does not have the same limitation. This algorithm, which we call Greedy Adaptive Constraint Learning (G-ACOL), is shown in Algorithm 2. Unfortunately, for G-ACOL, we can only provide significantly weaker sample complexity guarantees; but we find it performs well empirically.

Since the adaptive confidence intervals hold for all t > 0 simultaneously, we can now check the stopping condition after each sample. Instead of determining a static allocation that reduces uncertainty about the uncertain arms, we now greedily select the arm to pull that reduces uncertainty within Ut the most. Thanks to Proposition 3, this algorithm still stops and returns the correct solution. However, it achieves worse sample complexity due to the additional factor of

d in Proposition 3.

Heuristic modifications. There is a variety of heuristic modifications that we can make to G-ACOL to improve its practical performance at the cost of losing some theoretical guarantees. First, we could use a different query rule within the set of uncertain arms, such as uniformly random querying, which reduces computational cost. Second, the

βt resulting from Proposition 3 tends to be very large. In practice, we can try to tune βt to get good confidence intervals that are much smaller than the ones suggested by the theory. Third, we can turn the algorithm into an anytime algorithm by defining a recommendation rule, such as recommending the best arm that is certainly feasible. Then, we can stop the algorithm after an a priori unknown budget of queries and receive a best guess for the optimal arm.

4. Experiments

We perform three experiments. First, in Section 4.1, we consider synthetic CBAI instances to evaluate ACOL and compare it to natural baselines. Additionally, we investigate the effect of various heuristic modifications to the algorithm. Second, in Section 4.2, we compare ACOL to algorithms that safely minimize regret. And, third, in Section 4.3, we consider learning constraints that represent human preferences in a simulated driving scenario. This experiment illustrates how to model preference learning problems as CBAI problems. In the driving simulation, we also demonstrate the benefits of learning constraints in terms of robustness and transferability.

We provide more details on the experiments in Appendix C and we provide the full source code to reproduce our experiments.1 For all experiments we use a significance of δ = 0.05 and, if not stated differently, observations have Gaussian noise with σ = 0.05.

4.1. Synthetic Experiments

We consider two synthetic CBAI instances and a range of baselines and multiple variants of ACOL/ G-ACOL.

Instance 1 Irrelevant dimensions. First, we consider CBAI instances which contain a number of dimensions that are irrelevant for learning the correct constraint boundary. The problems have dimension d, and d + 1 arms: x1, . . . , xd+1. For each i = 1, . . . , d 1, we have xi = ei, whereas xd = (1 ε)ed, and xd+1 = (1 + ε)ed, for some ε > 0. ei denotes the i-th unit vector. The reward and constraint parameter are both θ = ϕ = ed. We define a threshold τ = 1; hence, x1, . . . , xd 1 are feasible but suboptimal, xd is optimal and xd+1 is infeasible. Importantly, the arms x1, . . . , xd 1 are irrelevant to finding the correct constraint boundary between xd and xd+1. An ideal algorithm would focus its queries primarily on xd and xd+1. We can vary the problem difficulty by changing ε (more difficult for small values), and d (more difficult for large values).

Instance 2 Unit sphere. To create CBAI instances with a range of different reward and constraint functions, we sample arms x1, . . . , xn uniformly from a d-dimensional

1https://github.com/lasgroup/ adaptive-constraint-learning

Interactively Learning Preference Constraints in Linear Bandits

0.025 0.05 0.075 0.1 ε

number of samples

20 40 dimension

number of samples

(a) Irrelevant dimensions

25 50 75 100 number of arms

number of samples

25 50 75 100 dimension

number of samples

(b) Unit sphere

Uniform G-Allocation ACOL (ours) Oracle Adaptive Uniform (tuned) Greedy Max Var (tuned) G-ACOL (tuned) G-ACOL (theory)

Figure 2. For both synthetic instances, we plot the median number of iterations for finding the constrained optimal solution as a function of different parameters of the problem instance. All methods return the correct constrained optimal solution. For irrelevant dimensions , we vary ε for fixed d = 10, and d for fixed ε = 0.05. For unit sphere , we vary n for fixed d = 30, and d for fixed n = 30. Note that for unit sphere , the instances are randomly sampled for each random seed, whereas the irrelevant dimensions instance stays the same. For legibility, we only show the median computed over 30 random seeds, and omit a few of the baselines we evaluated. For plots with all baselines that include confidence intervals, see Appendix D. Overall, ACOL is the most sample efficient approach of all algorithms that provide theoretical guarantees. In rare cases, it even needs fewer samples than the oracle. However, this is mostly an artifact of both algorithms using slightly different round lengths (cf. Appendix C). By tuning βt, we can gain several orders of magnitude in sample efficiency at the cost of theoretical guarantees. G-ACOL remains the most sample efficient among these tuned approaches.

unit sphere. We also sample the reward parameter θ from the unit sphere. As constraint parameter, we choose ϕ = xi xj where xi and xj are the two closest arms in ℓ2-distance. We can increase the problem difficulty by increasing the dimension d and the number of arms n.

Baselines. We compare ACOL and G-ACOL to various baselines. The Oracle solution uses knowledge of the true constraint parameter to choose the best possible static allocation (cf. Appendix B.1). In practice, we cannot implement the oracle because we do not know the constraint parameter; but, it yields a performance upper bound to which we can compare other algorithms. G-Allocation uses a static allocation that uniformly reduces uncertainty (cf. Appendix B.2), whereas Uniform pulls all arms with equal probability. We also consider variants of these algorithms that use the adaptive confidence interval in Proposition 3. We call the adaptive version of G-Allocation Greedy Max Var because it greedily selects arms with the highest uncertainty esimate from Ut. We call uniform sampling with the adaptive confidence intervals Adaptive Uniform respectively. For all algorithms that use adaptive confidence intervals, in addition to the version using Proposition 3, we test a tuned version that considers βt as a numeric hyperparameter instead (indicated by the name of the algorithms followed by (tuned)). We chose βt = 1

4, for all experiments, which we determined from minimal tuning on the irrelevant dimensions instance for the Greedy Max Var algorithm. For clarity, we omit a few of the baselines that perform poorly in our plots. Appendix D provides the full results.

Results. Figure 2 shows our results in the synthetic CBAI instances. All algorithms find the correct solution, but their

sample efficiency varies widely. From all algorithms with theoretical guarantees, the (unrealistic) oracle solution needs the fewest number of iterations, as expected. But ACOL can get close to the oracle performance and outperforms GAllocation and uniform sampling in all cases. For example, if we increase the number of irrelevant dimensions in the first experiment, G-Allocation and uniform sampling need more samples to determine which dimension is relevant. In contrast, both ACOL quickly focuses on the relevant dimension. Therefore, the number of iterations it needs does not increase when adding irrelevant dimensions to the problem, similar to the oracle solution.

Methods that use adaptive confidence intervals with βt suggested by Proposition 3 turn out to be less sample efficient than their round-based counterparts using static confidence intervals, including G-ACOL performing worse than ACOL. The reason for this is that the confidence interval in Proposition 3 is quite loose. We can heuristically choose smaller confidence intervals and consider βt as a tunable hyperparameter. We find that we can achieve orders of magnitude better sample complexity without much tuning and still always find the correct solution. Even though this approach loses the theoretical guarantees, it could be very valuable in practical applications.

4.2. Comparing ACOL to Regret Minimization

To highlight the difference of our constrained linear bestarm identification setting to regret minimization with constraints, we perform an experiment to compare G-ACOL to the approaches by Amani et al. (2019) and Moradipari et al.

Interactively Learning Preference Constraints in Linear Bandits

number of samples

Figure 3. We compare G-ACOL to Max Rew-F and Max Rew-U, that adapt regret minimization approaches to the CBAI setting. We focus on a simple 1-dimensional problem, where we ensure the set of feasible arms is connected. We find that Max Rew-F is particularly sample inefficient because it only selects arms that are certainly feasible. Max Rew-U is also less sample efficient than G-ACOL because it selects arms with high reward over other arms that would be more informative during exploration.

(2021). The algorithm by Amani et al. (2019) performs UCB and the algorithm by Moradipari et al. (2021) performs Thompson sampling, both within the set of certainly feasible arms.

We can translate both approaches to our setting with known rewards by greedily selecting arms from Ft w.r.t. their reward. Because we do not start with a known safe arm, we add an additional phase in which we select arms randomly until Ft is not empty. Let us call this approach Max Rew-F. As a hybrid of this approach and ACOL, we can design an algorithm that greedily select arms from Ut w.r.t. their reward. Let us call this algorithm Max Rew-U.

Unfortunately, Max Rew-F gets stuck in our synthetic instances because we do not make any assumptions on the safe set such as convexity and compactness. To evaluate these algorithms, we, therefore, consider a third synthetic instance in which the safe set is connected. We consider 10 arms in d = 1 that are equally spaced between 0 and 1. The reward and constraint vectors are θ = ϕ = 1, and the threshold is τ = 0.25. Here the safe set is connected, but we can learn the constraint boundary more efficiently if we are allowed to violate the constraint during exploration.

We compare G-ACOL to Max Rew-F and Max Rew-U in Figure 3. We find that G-ACOL explores much more efficiently than both of the other approaches. Max Rew-F is particularly sample inefficient, because it ensures feasibility during exploration, which is not necessary in our case. In Appendix D, we provide results for Max Rew-U in all of our environments. We cannot provide these results for Max Rew-F because it gets stuck in all other environments.

0 1 2 reward penalty λ

0 1 2 reward penalty λ

Figure 4. We quantify our finding that learning constraints is more robust to changes in the environment than learning a penalized reward function. We consider the three scenarios from Figure 1: the base scenario ( ), a scenario with a different goal ( ), and a scenario with a change in the environment ( ). We find a policy that optimizes the reward function θT x λϕT x and plot the reward and the constraint of the solution for different values of λ. In particular, we need to choose a different value of λ for each environment to find the best solution with a constraint value below 1. The dashed horizontal lines in the reward plot show the reward a constrained solution obtains on the corresponding instance, which does not require any tuning. For each scenario, the smallest λ we find to yield a feasible solution still gives a worse solution in terms of reward than the constrained solution.

4.3. Preference Learning Experiments

We now consider the application that initially motivated us to define the CBAI problem. As discussed in Section 1, we are interested in situations where the reward parameter θ describes an easy-to-specify goal or metric, and the constraint parameter ϕ describes expensive-to-evaluate human preferences.

As an example of this, we consider a driving simulator, which Sadigh et al. (2017) originally introduced to study learning reward functions to represent human preferences about driving behavior. Instead, we change the setting to have the reward θ represent an easy-to-specify goal such as drive at velocity v , and the constraint ϕ represent other driving rules such as usually drive in a lane or don t get too close to other cars , as shown in Figure 1. Appendix C provides more details on the environment.

The decision-maker has to select a controller to drive the car from a set of precomputed controllers X, i.e., the set of arms . The optimal controller x maximizes θT x

and satisfies ϕT x τ. The decision-maker can try out individual controllers to get feedback on whether they are feasible. In contrast to our previous experiments, the feedback is binary. However, we can still model it via a sub-Gaussian noise model by ensuring the constraint values are in [0, 1] and interpreting them as probabilities. Therefore, this is a CBAI problem, and we can apply the same algorithms we applied to our synthetic problems.

Robustness of learning constraints. First, we want to quantify the observation of Figure 1 that constraints can be

Interactively Learning Preference Constraints in Linear Bandits

G-Allocation

Greedy Max Var

Adaptive Uniform

number of samples

number of samples

solution correct

Adaptive Uniform Greedy Max Var Max Rew-U G-ACOL

Figure 5. The left chart shows the number of iterations that all algorithms with theoretical guarantees need to find the correct solution in the driving scenario. ACOL is the fastest, but it still needs 105 samples. Instead, we can use heuristic confidence intervals where we consider βt as a hyperparameter instead of choosing the values suggested by theory. The two right plots shows the number of iterations and the percentage of the times the methods return a correct solution as a function of βt. None of these algorithms is guaranteed to return the correct solution. But, empirically, we find that for βt beyond the vertical line, the algorithms always return the correct solution. This again shows that tuning βt can drastically improve the sample efficiency while still returning the correct solution empirically.

a particularly robust representation of human preferences. Specifically, using constraints to represent human preferences can increase robustness to changes in the environment and allow to transfer the constraints to different reward functions. Constraints are more robust than modeling the same preferences as a penalty on the reward function. Figure 4 quantifies this by directly comparing the two options in terms of the reward and constraint values they achieve. In particular, we find that the magnitude of the reward penalty often has to be updated if the environment changes, whereas the constraint formulation is robust to such changes.

Results of learning constraints. We consider the driving scenario as a CBAI problem and study learning the constraint function. Here, we only report results for the base scenario in Figure 1. Appendix D contains similar results for the other two scenarios which are qualitatively similar. In Figure 5, we compare the performance of ACOL and other algorithms with theoretical correctness guarantees to versions of these algorithms with heuristic confidence intervals. In both cases ACOL or G-ACOL is the most sample efficient algorithm. By choosing the heuristic confidence intervals, we can reduce the number of samples necessary by two orders of magnitude from 105 to 103, at the cost of theoretical guarantees. In all cases, using ACOL is preferable over alternatives because it finds the correct solution with fewer queries about the constraint function.

5. Conclusion

It is natural to formalize sequential decision-making problems in many practical situations as optimizing a known reward function subject to unknown, expensive-to-evaluate

constraints. We studied constrained linear best-arm identification (CBAI), a linear bandit setting to learn about constraints efficiently, and proposed Adaptive Constraint Learning (ACOL) to efficiently solve this problem.

Limitations and future work. Our theoretical analysis is limited to a single constraint function, which might not be appropriate for applications where the constraints are nonadditive. It should be possible to extend the same theoretical ideas to multiple linear constraints that all have to be satisfied, which would allow to apply ACOL to such situations. From the empirical perspective, we found that modelling human preferences as constraints rather than rewards can be more robust. Future work should study using constraints to model human preferences in more practical applications.

Broader impact. Sample efficient methods to learn about human preferences could help to avoid misspecified objectives in ML (Amodei et al., 2016). By focusing on learning constraints, it might be possible to make preference learning more robust and interpretable. Of course, such algorithms could be misused, but we are optimistic that robust methods to learn from humans will lead to safer ML methods overall.

Acknowledgements

This project has received funding from the Microsoft Swiss Joint Research Center (Swiss JRC), and from the European Research Council (ERC) under the European Union s Horizon 2020 research and innovation programme grant agreement No 815943. We thank Ilija Bogunovic and Alexandru T, ifrea for valuable feedback on early drafts of this paper.

Interactively Learning Preference Constraints in Linear Bandits

Abbasi-Yadkori, Y., P al, D., and Szepesv ari, C. Improved algorithms for linear stochastic bandits. In Advances in Neural Information Processing Systems, 2011.

Amani, S., Alizadeh, M., and Thrampoulidis, C. Linear stochastic bandits under safety constraints. In Advances in Neural Information Processing Systems, 2019.

Amodei, D., Olah, C., Steinhardt, J., Christiano, P., Schulman, J., and Man e, D. Concrete problems in AI safety. ar Xiv:1606.06565, 2016.

Audibert, J.-Y., Bubeck, S., and Munos, R. Best arm identification in multi-armed bandits. In Conference on Learning Theory (COLT), 2010.

Bıyık, E., Palan, M., Landolfi, N. C., Losey, D. P., and Sadigh, D. Asking easy questions: A user-friendly approach to active reward learning. In Conference on Robot Learning (Co RL), 2020.

Daniel, C., Viering, M., Metz, J., Kroemer, O., and Peters, J. Active reward learning. In Proceedings of Robotics: Science and Systems (RSS), 2014.

Fiez, T., Jain, L., Jamieson, K. G., and Ratliff, L. Sequential experimental design for transductive linear bandits. In Advances in Neural Information Processing Systems, 2019.

Friedrich, P. Optimal design of experiments. Siam. Society for industrial and applied mathematics, 2006.

Gardner, J. R., Kusner, M. J., Xu, Z. E., Weinberger, K. Q., and Cunningham, J. P. Bayesian optimization with inequality constraints. In Proceedings of International Conference on Machine Learning (ICML), 2014.

Gelbart, M. A., Snoek, J., and Adams, R. P. Bayesian optimization with unknown constraints. In Uncertainty in Artificial Intelligence (UAI), 2014.

Hern andez-Lobato, J. M., Gelbart, M. A., Adams, R. P., Hoffman, M. W., and Ghahramani, Z. A general framework for constrained Bayesian optimization using information-based search. Journal of Machine Learning Research, 17, 2016.

Kano, H., Honda, J., Sakamaki, K., Matsuura, K., Nakamura, A., and Sugiyama, M. Good arm identification via bandit feedback. Machine Learning, 108, 2019.

Kaufmann, E., Capp e, O., and Garivier, A. On the complexity of best-arm identification in multi-armed bandit models. The Journal of Machine Learning Research, 17, 2016.

Kazerouni, A., Ghavamzadeh, M., Abbasi Yadkori, Y., and Van Roy, B. Conservative contextual linear bandits. In Advances in Neural Information Processing Systems, 2017.

Khezeli, K. and Bitar, E. Safe linear stochastic bandits. In AAAI Conference on Artificial Intelligence, 2020.

Kiefer, J. and Wolfowitz, J. The equivalence of two extremum problems. Canadian Journal of Mathematics, 12, 1960.

Lattimore, T. and Szepesv ari, C. Bandit algorithms. Cambridge University Press, 2020.

Locatelli, A., Gutzeit, M., and Carpentier, A. An optimal algorithm for the thresholding bandit problem. In Proceedings of International Conference on Machine Learning (ICML), 2016.

Moradipari, A., Amani, S., Alizadeh, M., and Thrampoulidis, C. Safe linear Thompson sampling with side information. IEEE Transactions on Signal Processing, 2021.

Pacchiano, A., Ghavamzadeh, M., Bartlett, P., and Jiang, H. Stochastic bandits with linear constraints. In International Conference on Artificial Intelligence and Statistics (AISTATS), 2021.

Perrone, V., Shcherbatyi, I., Jenatton, R., Archambeau, C., and Seeger, M. Constrained bayesian optimization with max-value entropy search. In Neur IPS 2019 Workshop on Metalearning, 2019.

Rubinstein, R. Y. and Kroese, D. P. The cross-entropy method: a unified approach to combinatorial optimization, Monte-Carlo simulation, and machine learning. Springer, 2004.

Sadigh, D., Dragan, A. D., Sastry, S., and Seshia, S. A. Active preference-based learning of reward functions. In Proceedings of Robotics: Science and Systems (RSS), 2017.

Settles, B. Active learning. Morgan & Claypool Publishers, 2012.

Soare, M. Sequential resource allocation in linear stochastic bandits. Ph D thesis, Universit e Lille 1-Sciences et Technologies, 2015.

Soare, M., Lazaric, A., and Munos, R. Best-arm identification in linear bandits. In Advances in Neural Information Processing Systems, 2014.

Sui, Y., Gotovos, A., Burdick, J., and Krause, A. Safe exploration for optimization with Gaussian processes. In Proceedings of International Conference on Machine Learning (ICML), 2015.

Interactively Learning Preference Constraints in Linear Bandits

Wang, Z., Wagenmaker, A., and Jamieson, K. Best arm identification with safety constraints. In International Conference on Artificial Intelligence and Statistics (AISTATS), 2022.

Wen, M. and Topcu, U. Constrained cross-entropy method for safe reinforcement learning. IEEE Transactions on Automatic Control, 2020.

Interactively Learning Preference Constraints in Linear Bandits

This section provides the full proofs of our key results of the paper: the sample complexity lower bound for CBAI problems (Appendix A.1) and the sample complexity of ACOL (Appendix A.2).

A.1. Lower Bounds

Theorem 1 (CBAI lower bound). Assume ηx N(0, 1) for all x X. For any CBAI problem ν = (X, θ, ϕ), there exists another CBAI problem ν = (X, θ, ϕ ) with the same set of actions X and reward parameter θ but a different constraint parameter and optimal arm, such that the expected number of iterations τ needed by any allocation strategy that can distinguish between ν and ν with probability at least 1 δ is lower bounded as

E[τ] 2 log 1

max x X θ (x ν)

x 2 A 1 λ (ϕT x)2 ,

where λ is a probability distribution over arms which the allocation strategy follows, i.e., λ(x) is the probability that it pulls arm x, Aλ = P

x λ(x)xx T is the design matrix, X θ (x ν) = {x X|θT x θT x ν} is the set of all arms with reward no less than x ν, the optimal arm for problem ν.

Proof. Our proof has a similar structure to the proof of Theorem 3.1 by Soare (2015). Let us denote the optimal arm of problem ν with x ν and the optimal arm of ν with x ν . Let A be a δ-PAC algorithm to solve constrained linear bandit problems, and let A be the event that A recommends x ν as the optimal arm. If we denote by Pν(A) the probability of A happening for instance ν, and by Pν (A) the probability for instance ν , we have Pν(A) 1 δ and Pν (A) δ.

Let ε = ϕ ϕ, and let τ be the stopping time of A. Let (x1, . . . , xτ) be the sequence of arms A pulls and (z1, . . . , zt) the corresponding observed noisy constraint values zi = x T i ϕ + ηxi with ηxi N(0, 1) being independent Gaussian noise.

Now, consider the log-likelihood ratio of these observations under algorithm A:

Pν(zs|xs) Pν (zs|xs)

s=1 log Pν(zs|xs)

s=1 log Pν(ηs)

s=1 log exp( η2 s/2) exp( η 2 s /2)

1 2((zs x T s ϕ )2 (zs x T s ϕ)2) =

1 2(z2 s 2zsx T s ϕ + (x T s ϕ )2 z2 s + 2zsx T s ϕ (x T s ϕ)2)

1 2(2zsx T s (ϕ ϕ ) + (x T s ϕ x T s ϕ)(x T s ϕ + x T s ϕ)) =

1 2( 2zsx T s ε + x T s ε(x T s ϕ + x T s ε + x T s ϕ))

s=1 (x T s ε) 2zs + 2x T s ϕ + x T s ε 2 =

s=1 (x T s ε) x T s ε 2 ηs

Taking the expectation of this log-likelihood ratio gives:

Eν[Lτ] = Eν

s=1 (x T s ε) x T s ε 2 ηs

s=1 (x T s ε)2 #

Eν [ηs] | {z } =0

s=1 εT xsx T s ε

x X Eν[τ]λ(x) εT xx T ε

x X λ(x) εT xx T ε

2Eν[τ] εT Aλ ε

Next, we can apply Lemma 19 from Kaufmann et al. (2016):

2Eν[τ] εT Aλ ε KL(Pν(A), Pν (A)) log 1 2.4δ

Interactively Learning Preference Constraints in Linear Bandits

Eν[τ] 2 log 1

1 εT Aλ ε (1)

To obtain a lower bound, we now aim to find the smallest ε such that ν and ν have different constrained optimal arms.

Let X θ (x) = {x X|θT x θT x} be the set of arms with higher reward than x. There are two ways we can modify ν to change its optimal arm. We can change ϕ to ϕ such that either, Case (i), the previous optimum x ν becomes infeasible in ν , or, Case (ii), a solution x ν X θ (x ν) that was infeasible in ν is now feasible in ν . We will consider both cases separately, and aim to find an ε for each case that minimizes εT Aλ ε.

Case (i). We want to find ε that minimizes 1

2εT Aλε such that ϕ T x ν > 0, i.e., the previously optimal arm becomes infeasible. We can write this constraint equivalently as

ϕ T x ν > 0 ϕT x ν ϕ T x ν < ϕT x ν εT x ν < ϕT x ν εT x ν ϕT x ν < 0

Which results in the following optimization problem:

min ε 1 2εT Aλε s.t. εT x ν ϕT x ν + α 0,

where α > 0. The Lagrangian is L(ε, γ) = 1

2εT Aλε γ(εT x ν ϕT x ν + α), and requiring L

γ = 0 yields:

ε = Aλε γx ν = 0 Aλε = γx ν A

1 2 λ ε = γA 1

γ = εT x ν ϕT x ν + α = 0 εT x ν = ϕT x ν α

From the first equation, it follows that

x ν T ε = x ν T A 1

1 2 λ ε = γx ν T A 1 λ x ν = γ x ν 2 A 1 λ

x ν T ε = x ν T A 1

1 2 λ ε = 1

γ εT Aλε = 1

and therefore

x ν T ε = x ν A 1 λ ε Aλ = ϕT x ν α

ε Aλ = ϕT x ν α x ν A 1 λ > ϕT x ν x ν A 1 λ

where the last inequality follows because α > 0 and Aλ is positive definite.

Case (ii). We want to find ε that minimizes 1

2εT Aλε such that there exists an x X for which θT x > θT x ν and ϕ T x 0, i.e., x has higher reward than x ν and it is feasible in ν . We can write these constraints as

θT x > θT x ν θT (x ν x) + α 0

ϕ T x 0 εT x + ϕT x 0

with α > 0. This results in the following optimization problem:

min ε 1 2εT Aλε

s.t. x : θT (x ν x) + α 0

εT x + ϕT x 0

The Lagrangian of this problem is

L(ε, γ, δ) = 1

2εT Aλε γ(θT (x ν x) + α) δ(εT x + ϕT x)

Interactively Learning Preference Constraints in Linear Bandits

Requiring L

δ = 0 results in

ε = Aλε δx = 0 Aλε = δx A

1 2 λ ε = δA 1

δ = εT x + ϕT x = 0 εT x = ϕT x

It follows that

x T ε = x T A 1

1 2 λ ε = δx T A 1 λ x = δ x 2 A 1 λ

x T ε = x T A 1

1 2 λ ε = 1

δ εT Aλε = 1

and therefore

x T ε = x A 1 λ ε Aλ = ϕT x ε Aλ = ϕT x x A 1 λ

Combining this result with the remaining constraint θT x > θT x ν which implies x X θ (x ν), we can conclude

ε Aλ min x X θ (x ν)

ϕT x x A 1 λ

Combining cases (i) and (ii). We can conclude that the ε that minimizes ε 2 Aλ while still ensuring that ν has a different solution than ν, satisfies:

ε Aλ min h ϕT x ν x ν A 1 λ | {z } Case (i)

, min x X θ (x ν)

ϕT x x A 1 λ | {z } Case (ii)

But because x ν X θ (x ν), it is simply

ε Aλ min x X θ (x ν)

ϕT x x A 1 λ

Combining this result with eq. (1), gives the final bound:

Eν[τ] 2 log 1

1 minx X θ (x ν) ϕT x x A 1 λ

2 = 2 log 1

max x X θ (x ν)

x 2 A 1 λ (ϕT x)2

Next, we derive the worst case bound on the quantity making up the CBAI lower bound.

Proposition 1. For any CBAI problem ν, we have HCLB(ν) d/(C+ min)2, where C+ min = minx X |ϕT x|. This bound is tight, i.e, there is an instance ν, such that we have HCLB(ν) = d/(C+ min)2.

HCLB(ν) = min λ max x X θ (x ν)

x 2 A 1 λ (ϕT x)2 1

C+ min 2 min λ max x X θ (x ν) x 2 A 1 λ d

where the last inequality uses the well-known result by Kiefer & Wolfowitz (1960). Equality holds, for example, if all x X are linearly independent and have the same constraint value C+ min.

Interactively Learning Preference Constraints in Linear Bandits

A.2. Adaptive Constraint Learning

In this section, we analyse the sample complexity of ACOL and prove our main result.

Theorem 2 (ACOL sample complexity). Assume Algorithm 1 is implemented with an ε-approximate rounding strategy. Then, after N iterations the algorithm returns an optimal arm with probability at least 1 δ, and we have:

N 8 log |X| t2

t=1 min λ max x Ut

x 2 A 1 λ (ϕT x)2 + t

8 log |X| t2

(1 + ε) t HCLB(ν) + t

where HCLB(ν) = minλ maxx X x 2 A 1 λ /(ϕT x)2, and t = log2 C+ min . Moreover, HCLB(ν) d/(C+ min)2

Proof. Let Et := {Ut St} where St := {x X|ut ϕ(x) lt ϕ(x) 2 t}. So, Et is the event that all arms in Ut have confidence interval smaller than 2 t. We will first show that P(E1) 1 δ1 and P(Et|Et 1) 1 δt, which ensures that the set of arms we are uncertain about shrinks exponentially in the rounds t.

Let x Ut. Then, using Proposition 2, and the ε-approximate rounding strategy, it holds with probability at least 1 δt that:

ut ϕ(x) lt ϕ(x) 2

Nt x A 1 λ t

Using the length of a round Nt = l 22t+3 log |X|

(1 + ε)ρ t m , and that we select arms to reduce uncertainty in Ut, we get

ut ϕ(x) lt ϕ(x) 2 t s min λ max x Ut x 2 A 1 λ

1 x A 1 λ t

2 t s min λ max x Ut x 2 A 1 λ

1 min λ max x Ut x A 1 λ

Note, that x can only be in Ut if ut ϕ(x) > 0 and lt ϕ(x) 0. It follows that P(Et|Et 1) 1 δt.

Now consider round t := l log2 1 C+ min

m . We show P(U t = |E t) = 1. Assume E t, i.e., Ut St. Let x U t, then:

|ϕT x| 2 t 2 log2 1/C+ min = C+ min

which is a contradiction because otherwise x would have a smaller constraint value than C+ min. Consequently, the set of uncertain arms U t is empty and the algorithm returns the correct solution given E t. Lemma 1 shows that the unconditional probability of the algorithm returning the correct solution after round t is at least 1 δ.

Finally, we can compute the total number of samples the algorithm needs to return the correct solution:

t=1 22t+3 log |X|

t=1 22t+3 log |X|

(1 + ε)ρ t + t

8 log |X| t2

t=1 (2t)2ρ t + t = 8 log |X| t2

t=1 (2t)2 min λ max x Ut x 2 A 1 λ + t

= 8 log |X| t2

t=1 min λ max x Ut

x 2 A 1 λ (2 t)2 + t (a) 8 log |X| t2

t=1 min λ max x Ut

x 2 A 1 λ (ϕT x)2 + t

(b) 8 log |X| t2

t=1 min λ max x X

x 2 A 1 λ (ϕT x)2 + t 8 log |X| t2

(1 + ε) t HCLB(ν) + t

Interactively Learning Preference Constraints in Linear Bandits

Algorithm 3 Round based algorithm with a generic allocation λ with hyperparamater v (1, 2). For λ argminλ maxx X θ (x ν) x A 1 λ /|ϕT x| this algorithm becomes the oracle solution. For λ argminλ maxx X x A 1 λ it becomes G-Allocation.

1: Input: static design λ , significance δ 2: U1 X (uncertain arms) 3: F1 (feasible arms) 4: t 1 (round) 5: while Ut = do 6: δt δ2/t2

7: Nt vt log(|X|/δt) 8: x Nt Round(λ , Nt) 9: Pull arms x1, . . . , x Nt and observe constraint values 10: t t + 1 11: Update ˆϕt and A based on new data 12: lt ϕ(x) ˆϕT t x βt x A 1 for all arms x X

13: ut ϕ(x) ˆϕT t x + βt x A 1 for all arms x X 14: Ft Ft 1 {x|ut ϕ(x) 0}

15: r maxx Ft θT x 16: Ut Ut 1 \ {x|lt ϕ(x) > 0} \ {x|ut ϕ(x) 0} \ {x|θT x < r} 17: end while 18: return x argmaxx Ft θT x

where (a) follows because we showed that |ϕT x| 2 t w.h.p. for x Ut, and (b) follows simply because Ut X. In the

last step, we defined HCLB(ν) = minλ max x X

x 2 A 1 λ (ϕT x)2

HCLB(ν) = min λ max x X

x 2 A 1 λ (ϕT x)2 1

C+ min 2 min λ max x X x 2 A 1 λ 1

C+ min 2 min λ max x Rd x 2 A 1 λ d

using the result by Kiefer & Wolfowitz (1960).

Lemma 1. Let E1, . . . , ET be a Markovian sequence of events such that P(E1) 1 δ1 and P(Et|Et 1) 1 δt for all t = 2, . . . , T, where δt = δ2/t2 and δ (0, 1). Et is independent of other events conditioned on Et 1. Then P(ET ) 1 δ.

t=2 P(Et|Et 1)

where the last inequality holds for 0 δ 1.

B. Alternative Algorithms

Given any static design λ , we can consider different round-based algorithms using the static confidence intervals from Proposition 2. Algorithm 3 shows the general algorithm. It uses the same stopping condition as ACOL but uses a more straightforward round length of vt log(|X|/δt) with v a hyperparameter, and a fixed static allocation. In this section, we analyze two versions of this generic algorithm that are of particular interest: the oracle solution (Appendix B.1) and G-Allocation (Appendix B.2).

Interactively Learning Preference Constraints in Linear Bandits

B.1. Oracle Solution

The oracle solution allocates samples according to λ argminλ maxx X θ (x ν) x A 1 λ /|ϕT x| in Algorithm 3. Note that this design exactly matches the term in our instance dependent lower-bound in Theorem 1. Therefore, this is the ideal allocation to achieve good sample complexity. However, this oracle solution requires knowledge of ϕ, which we do not know in practice.

As expected, this algorithm matches the sample complexity lower bound, i.e., it is instance-optimal apart from logarithmic factors. The following theorem formalizes this.

Theorem 3 (Oracle sample complexity). The oracle algorithm finds the optimal solution to a constrained linear best-arm identification problem ν = (X, θ, ϕ) within N HCLB(ν) with probability at least 1 δ.

Proof. Assuming a (1 + ε)-approximate rounding procedure, in round t we have: x 2 A 1 x Nt 1+ε

Nt x 2 A 1 λ . It follows,

similar to the proof of Theorem 2, that in round t, for each x X θ if ϕT x > ϕT x ν:

ϕT x lt ϕ(x) p

2 log(|X|/δt) x A 1 x n p

2(1 + ε) log(|X|/δt)/Nt x A 1 λ

A similar argument gives for x ν:

ut ϕ(x ν) ϕT x ν p

2 log(|X|/δt) x ν A 1 x n p

2(1 + ε) log(|X|/δt)/Nt x ν A 1 λ

Let us call the event that these confidence bounds hold Et. We have P(Et|Et 1) 1 δt. Now, consider round t = logv (2(1 + ε)HCLB(ν)) with length N t = 2(1 + ε) log(|X|/δt)HCLB(ν) . For all x X θ if ϕT x > ϕT x ν:

ϕT x l t ϕ(x)

1 HCLB(ν) x A 1 λ

v u u t (ϕT x)2

x 2 A 1 λ x A 1 λ |ϕT x|

Note that x is infeasible and ϕT x, which implies l t ϕ(x) 0 and in turn x / U t. Similarly, u t ϕ(x ν) ϕT x ν |ϕT x ν|. x ν is feasible and ϕT x ν 0. Hence, u t ϕ(x ν) 0 and x ν / U t. This implies that U t = and, conditioned on E t, the oracle algorithm solves the problem in round t with probability 1. We can apply Lemma 1 to conclude that, unconditionally, the algorithm solves the problem in round t with a probability of at least 1 δ.

Let us compute the total iterations necessary:

t=1 2(1 + ε) log(|X|/δt)HCLB(ν) t(1 + 2(1 + ε) log(|X| t2/δ2)HCLB(ν)) HCLB(ν)

So, N is on order HCLB(ν) except for logarithmic factors, concluding the proof.

B.2. G-Allocation

We obtain G-Allocation by choosing λ argminλ maxx X x A 1 λ in Algorithm 3. G-Allocation uniformly reduce the uncertainty about the constraint function for all arms. This is not ideal because it does not focus on which arms are plausible optimizers according to the known reward function.

Still, the following theorem shows that G-Allocation achieves sample complexity on order d/C+ min 2, so it matches the worst-case lower bound in Proposition 1.

Theorem 4. G-Allocation finds the optimal arm within N d/C+ min 2 iterations with probability at least 1 δ.

Proof. As in the proof of Theorem 3, we have in round t, for each x X θ if ϕT x > ϕT x ν:

ut ϕ(x) lt ϕ(x) 2 p

2 log(|X|/δt) x A 1 x n 2 p

2(1 + ε) log(|X|/δt)/Nt x A 1 λ

Interactively Learning Preference Constraints in Linear Bandits

Again, we call the event that these confidence bounds hold Et, and have P(Et|Et 1) 1 δt. Now, consider round

8(1 + ε) argmin λ max x X x A 1 λ /C+ min 2

N t = 8(1 + ε) log(|X|/δt) argmin λ max x X x A 1 λ /C+ min 2

For all x Ut it follows that:

u t ϕ(x) l t ϕ(x)

v u u t C+ min 2

x A 1 λ x A 1 λ C+ min

This implies that G-Allocation solves the problem in round t with probability 1, similar to the proof of Theorem 2. We can apply Lemma 1 to conclude that, unconditionally, the algorithm solves the problem in round t with a probability of at least 1 δ.

Let us compute the total iterations necessary:

8(1 + ε) log(|X|/δt) argmin λ max x X x A 1 λ /C+ min 2

t 1 + 8(1 + ε) log(|X|/δt) argmin λ max x X x A 1 λ /C+ min 2

t 1 + 8(1 + ε) log(|X|/δt)d/C+ min 2 d/C+ min 2

where the last inequality uses the result by Kiefer & Wolfowitz (1960).

C. Experimental Details About the Driving Environment

This section provides details on the driving environment we use in Section 4.3. We provide full source code for all of our experiments at: https://github.com/lasgroup/adaptive-constraint-learning

We extend the Driver proposed by Sadigh et al. (2017) and Bıyık et al. (2020), to incorporate different tasks. Here, we provide a brief description of the dynamics and features of the environment.

The Driver environment uses point-mass dynamics with a continuous state and action space. The state s = (x, y, φ, v) consists of the agent s position (x, y), its heading φ, and its velocity v. The actions a = (a1, a2) consist of a steering input and an acceleration. The environment dynamics are given by

st+1 = (xt+1, yt+1, φt+1, vt+1) = (xt + x, yt + y, φt + φ, clip(vt + v, 1, 1))

( x, y, φ, v) = (v cos φ, v sin φ, va1, a2 αv)

where α = 1 is a friction parameter, and the velocity is clipped to [ 1, 1] at each timestep.

The environment represents a highway with three lanes. In addition to the agent, the environment contains a second car that moves on a predefined trajectory. The reward and the constraint functions are linear in a set of features

f(s) = (f1(s), f2(s), f3(s), f4(s), f5(s), f6(s), f7(s), f8(s), 1)

that are described in detail in Table C.1.

The (known) rewards for the three scenarios are:

Base scenario: θ1 = (1, 0, 0, 0, 0, 0, 0, 0, 0)

Different reward: θ2 = (0, 1, 0, 0, 0, 0, 0, 0, 0)

Different environment: θ3 = (1, 0, 0, 0, 0, 0, 0, 0, 0)

Interactively Learning Preference Constraints in Linear Bandits

Algorithm 4 Cross-entropy method for (constrained) reinforcement learning. For more details on the cross-entropy method, see Rubinstein & Kroese (2004), and for the application to constrained RL, see Wen & Topcu (2020).

1: Input: niter, nsamp, nelite 2: Initialize policy parameters µ Rd, σ Rd. 3: for iteration = 1, 2, . . . , niter do 4: Sample nsamp samples of ωi N(µ, diag(σ)) 5: Evaluate policies ω1, . . . , ωnsamp in the environment 6: if constrained problem then 7: Sort ωi in ascending order of constraint value J(ωi) 8: Let E be the first nelite policies 9: if J(ωnelite) 0 then 10: Sort {ωi|J(ωi) 0} in descending order of return G(ωi) 11: Let E be the first nelite policies 12: end if 13: else 14: Sort ωi in descending order of return G(ωi) 15: Let E be the first nelite policies 16: end if 17: Fit Gaussian distribution with mean µ and diagonal covariance σ to E 18: end for 19: return µ

The (unknown) constraint is:

ϕ = (0, 0, 0.3, 0.05, 0.02, 0.5, 0.3, 0.8), τ = 1

Our Driver environment uses a fixed time horizon T = 20, and policies are represented simply as sequences of 20 actions because the environment is deterministic.

C.1. Cross-Entropy Method for Constrained RL

We find policies in the Driver environment with a given reward function using the cross-entropy method (Rubinstein & Kroese, 2004). For the constrained reinforcement learning problem, we use a modified cross-entropy method, proposed by Wen & Topcu (2020), that takes the feasibility of solutions into account. Algorithm 4 contains pseudocode of this method.

C.2. Binary Feedback

So far, we considered numerical observations of the constraint value ϕT x + η where η is subgaussian noise. In the driving environment, we (more realistic) binary observations in { 1, 1}.

If we assume that all true constraint values are in [ 1, 1], we can define the observation model P(y = 1|ϕ, x) = (ϕT x+1)/2. We can consider this as bounded, sub-gaussian noise on the constraint value, and so all our analysis still applies.

To translate learning the unknown constraint function in the Driver environment into a constrained linear best arm identification problem, we consider a set of pre-computed policies Π. This set of policies corresponds to the arms of a linear bandit problem, and both the return G(π) of a policy and the constraint function J(π) are linear in the expected feature counts of the policy: G(π) = f(π) r and J(π) = f(π) c.

For binary observations, we normalize the features of all policies such that all constraint values are between 1 and 1.

D. Additional Experimental Results

Here, we provide the additional results for the experiments discussed in the main paper. Full results are shown in Figure 6 for the bandit results and Figure 7 for the driving scenario. Table D.2 contains an overview of all algorithms and baselines

Interactively Learning Preference Constraints in Linear Bandits

Feature Description Type Definition θ1 θ2 θ3 ϕ

f1(s) Target velocity Numerical (v 0.4)2 1 0 1 0

f2(s) Target location Numerical (x xr)2

xr center of right lane 0 1 0 0

f3(s) Stay on street Binary 1 iff off street

f4(s) Stay in lane Numeric

1 1+exp( bd+a), d distance to closest lane center, b = 10000, a = 10

f5(s) Stay aligned with street Numeric | cos(θ)| 0 0 0 0.02

f6(s) Don t drive backwards Binary 1 iff v < 0 0 0 0 0.5

f7(s) Stay within speed limit Binary 1 iff v > 0.6 0 0 0 0.3

f8(s) Don t get too close to other cars Numeric exp( b(c1d2 x + c2d2 y) + ba), a = 0.01, b = 30, c1 = 4, c2 = 1

Table C.1. Features for representing the reward and constraint function in the Driver environment. The last four columns contain the reward weights for the three scenarios and the constraint weight that is shared.

Interactively Learning Preference Constraints in Linear Bandits

Name Confidence Intervals Selection Criterion Select From Arms Plot Color

Oracle static Oracle All G-Allocation static Max Var All Uniform static Uniform All ACOL static Max Var Uncertain Greedy Max Var adaptive Max Var All Adaptive Uniform adaptive Uniform All Max Rew-U adaptive Max Rew Uncertain Max Rew-F adaptive Max Rew Feasible - G-ACOL adaptive Max Var Uncertain G-ACOL Uniform adaptive Uniform Uncertain Greedy Max Var (tuned) adaptive tuned Max Var All Adaptive Uniform (tuned) adaptive tuned Uniform All G-ACOL (tuned) adaptive tuned Max Var Uncertain G-ACOL Uniform (tuned) adaptive tuned Uniform Uncertain Max Rew-U (tuned) adaptive tuned Max Rew Uncertain Max Rew-F (tuned) adaptive tuned Max Rew Feasible -

Table D.2. Overview of all algorithms we evaluate.

that we evaluated.

We find that methods that select arms from U randomly (G-ACOL Uniform) or by maximizing the reward (Max Rew-U) can perform quite well in some cases with tuned confidence intervals. Indeed, Max Rew-U outperforms G-ACOL in the unit sphere experiment. This is not consistent across environments, and G-ACOL performs comparable or better in all other environments. Still, in some cases, when theoretical guarantees are not required, these heuristic approaches might be valuable alternatives.

Interactively Learning Preference Constraints in Linear Bandits

0.025 0.05 0.075 0.1 ε

number of samples

20 40 dimension

number of samples

(a) Irrelevant dimensions

25 50 75 100 number of arms

number of samples

25 50 75 100 dimension

number of samples

(b) Unit sphere

Uniform Adaptive Uniform Adaptive Uniform (tuned) G-Allocation Greedy Max Var Greedy Max Var (tuned) ACOL (ours) G-ACOL G-ACOL (tuned) Oracle G-ACOL Uniform G-ACOL Uniform (tuned) Max Rew-U Max Rew-U (tuned)

Figure 6. Similar plots to Figure 2, including some additional algorithms: the non-tuned versions of algorithms use the confidence interval from Proposition 3, and G-ACOL Uniform is G-ACOL with uniform sampling instead of the maximum variance objective. Table D.2 provides an overview of all baselines. Moreover, the plots here show the 25th and 75th percentiles over 30 random seeds. For irrelevant dimensions , these are close to the median, but for unit sphere , there is much more randomness because the instances are randomly generated.

Interactively Learning Preference Constraints in Linear Bandits

G-Allocation

G-ACOL Uniform

Greedy Max Var

Adaptive Uniform

number of samples

number of samples

solution correct

(a) Base scenario

G-ACOL Uniform

G-Allocation

Greedy Max Var

Adaptive Uniform

number of samples

number of samples

solution correct

(b) Different reward

G-ACOL Uniform

G-Allocation

Greedy Max Var

Adaptive Uniform

number of samples

number of samples

solution correct

(c) Different environment

Adaptive Uniform Greedy Max Var G-ACOL Max Rew-U G-ACOL Uniform

Figure 7. Similar plots as in Figure 5 for all three driving scenarios from Figure 1, showing G-ACOL Uniform as aan additional baseline.