# on_kernelized_multiarmed_bandits_with_constraints__97ab3355.pdf

On Kernelized Multi-Armed Bandits with Constraints

Xingyu Zhou Electrical and Computer Engineering

Wayne State University

Detroit, MI, USA xingyu.zhou@wayne.edu

Bo Ji Computer Science

Virginia Tech Blacksburg, VA, USA

boji@vt.edu

We study a stochastic bandit problem with a general unknown reward function and a general unknown constraint function. Both functions can be non-linear (even non-convex) and are assumed to lie in a reproducing kernel Hilbert space (RKHS) with a bounded norm. In contrast to safety-type hard constraints studied in prior works, we consider soft constraints that may be violated in any round as long as the cumulative violations are small. Our ultimate goal is to study how to utilize the nature of soft constraints to attain a ﬁner complexity-regret-constraint trade-off in the kernelized bandit setting. To this end, leveraging primal-dual optimization, we propose a general framework for both algorithm design and performance analysis. This framework builds upon a novel sufﬁcient condition, which not only is satisﬁed under general exploration strategies, including upper conﬁdence bound (UCB), Thompson sampling (TS), and new ones based on random exploration, but also

enables a uniﬁed analysis for showing both sublinear regret and sublinear or even zero constraint violation. We demonstrate the superior performance of our proposed algorithms via numerical experiments based on both synthetic and real-world datasets. Along the way, we also make the ﬁrst detailed comparison between two popular methods for analyzing constrained bandits and Markov decision processes (MDPs) by discussing the key difference and some subtleties in the analysis, which could be of independent interest to the communities.

1 Introduction

Stochastic bandit optimization of an unknown function f has recently gained increasing popularity due to its widespread real-life applications such as recommendations [1], cloud resource conﬁgurations [2], and wireless power control [3]. At each time t, an action xt is selected and then a (noisy) bandit reward feedback yt is observed. The goal is to maximize the cumulative reward, or equivalently minimize the total regret due to not choosing the optimal action in hindsight. To capture a generic function, researchers have turned to nonparametric models of f via Gaussian process or reproducing kernel Hilbert space (RKHS), which are able to uniformly approximate an arbitrary continuous function over a compact set [4]. In this paper, as in [5, 6], we consider the agnostic setting (i.e., frequentist-type), where f is assumed to be a ﬁxed function in an RKHS with a bounded norm (i.e., a measure of smoothness). We call this setting frequentist-type kernelized bandits (KB).

In addition to a generic non-linear or even non-convex function f, another common feature in practical applications is that there often exist additional constraints in the decision-making process such as hard-constraint like safety or soft-constraint like cost. To this end, there have been exciting recent advances in the theoretical analysis of constrained kernelized bandits. In particular, [7 9] propose algorithms with convergence guarantees, while [10], to the best our knowledge, is the ﬁrst

36th Conference on Neural Information Processing Systems (Neur IPS 2022).

work that establishes regret bounds for their developed algorithm, although under the Bayesian-type1 setting. These algorithms mainly focus on KB with a hard constraint such as safety, i.e., the selected action in each round needs to satisfy the constraint with a high probability. Hence, compared to the unconstrained case, additional computation is required to construct a safe action set in each round, which not only incurs additional complexity burdens, but often leads to conservative performance.

Motivations. In practice, there are also many applications that involve soft constraints that may be violated in any round. The goal is to maximize the total reward while minimizing the total constraint violations. For example, in a wireless networking system, the reward could be the throughput while the constraint is that the average energy consumption is below a threshold. In this case, the constraint can be violated at any time, and moreover, using less energy at one time instant allows one to use more in the next time instant, i.e., compensation across time (see a formal deﬁnition in Eq.(1)) Furthermore, existing provably efﬁcient algorithms [7 10] largely build on upper conﬁdence bound (UCB) exploration, which often has inferior empirical performance compared to Thompson sampling (TS) exploration. Hence, another key question is whether one can design provably efﬁcient CKB algorithms with general explorations. In summary, the following fundamental theoretical question remains open:

Can a ﬁner complexity-regret-constraint trade-off be attained in CKB under general explorations?

Contributions. In this paper, we take a systematic approach to afﬁrmatively answer the above fundamental question. In particular, we tackle the complexity-regret-constraint trade-off by formulating KB under soft constraints as a stochastic bandit problem where the objective is to maximize the cumulative reward while minimizing the cumulative constraint violations and maintaining the same computation complexity as in the unconstrained case. Our detailed contributions are as follows.

We develop a uniﬁed framework for CKB based on primal-dual optimization, which can

guarantee both sublinear reward regret and sublinear total constraint violation under a class of general exploration strategies, including UCB, TS, and new effective ones (e.g., random exploration) under the same complexity as the unconstrained case. We also show that by introducing slackness in the dual update, one can trade regret to achieve bounded or even zero constraint violation. This framework builds upon a novel sufﬁcient condition, which not only facilitates the design of new CKB algorithms but provides a uniﬁed view in the performance analysis. We demonstrate the superior performance of our proposed algorithms via numerical experi-

ments based on both synthetic and real-world data. In addition, we discuss the beneﬁts of our algorithms in terms of various practical considerations such as low complexity, scalability, robustness, and ﬂexibility. We also provide the ﬁrst detailed comparison between two popular methods for analyzing

constrained bandits in general. Speciﬁcally, the ﬁrst one is based on convex optimization tool as in [11, 12], which is also the inspiration for our paper. The other one is based on Lyapunov-drift argument as in [13 15]. We discuss the key difference in terms of regret and constraint violation analysis in these two methods and highlight the subtlety in applying a standard queueing technique (i.e., Hajek lemma [16]) to bound the constraint violation in the second method. We believe this provides a clear picture on the methodology, which is of independent interest to the communities.

1.1 Related Work

In the special cases of KB, including multi-armed bandits (MAB) and linear bandits (e.g., KB with a linear kernel), there is a large body of work on bandits with different types of constraints, including knapsack bandits [17 19], conservative bandits [20 22], bandits with fairness constraints [23, 24], bandits with hard safety constraints [25 27], and bandits with cumulative soft constraints [28, 13]. Among them, the bandit setting with cumulative soft constraints is the closest to ours. In

particular, [13] considers linear bandits under UCB exploration and a zero constraint violation is attained via the Lyapunov-drift method. However, it is unclear how to generalize it to handle general exploration strategies such as TS; see a further discussion in Section 4.

1In the Bayesian-type KB, f is assumed to be a sample from a Gaussian process and the observation noise is Gaussian. In contrast, in our considered frequentist-type KB, f is a ﬁxed function in an RKHS and the noise can be any sub-Gaussian. A better regret bound can often be achieved in the easier Bayesian-type KB setting [6].

Broadly speaking, our work is also related to reinforcement learning (RL) with soft constraints, i.e., constrained MDPs. In particular, our analysis is inspired by those on constrained MDPs [11, 12] (which is another popular method to handle constrained bandits and MDPs via convex optimization tools), but has signiﬁcant differences. First, in those works, the constraint violation is e O(

T). In contrast, ours can attain bounded and even zero constraint violations by introducing the slackness in the dual update. Second, they only consider UCB-type exploration, but our algorithms can be equipped with various exploration strategies (including UCB), thanks to our general sufﬁcient condition. Third, they focus on either tabular or linear function approximation settings. In contrast, both objective and constraint functions we consider can be nonlinear. There are also recent works on constrained MDPs that claim to achieve bounded or zero constraint violation [14, 15] based on the Lyapunov-drift method. However, as in the bandit case, it is unclear how to generalize it to handle general explorations beyond UCB. Finally, the theoretical understanding of unconstrained KB (a.k.a. Gaussian process bandit) is well-established, including regret upper bounds under UCB and TS [5, 29, 6] and lower bounds [30, 31]. We remark that our work is mainly a theory-guided study. In a more practical area of KB, i.e., Bayesian optimization (BO), there have been many BO algorithms developed for the constrained setting; see [32 34] and the references therein. Although these algorithms enjoy good performance in various practical settings, their theoretical performance guarantees are still unclear.

2 Problem Formulation and Preliminaries

We consider a stochastic bandit optimization problem with soft constraints. In particular, in each round t 2 {1, 2, . . . , T}, a learning agent chooses an action xt 2 X and receives a bandit reward feedback rt = f(xt) + t, where t is a zero-mean noise. The learning agent also observes a bandit constraint feedback ct = g(xt) + t, where t is a zero-mean noise. By soft constraints, the goal here is to maximize the cumulative reward while minimizing the cumulative total violation.

Learning Problem. Deﬁne cumulative regret and constraint violation as follows:

R(T) := Tf(x )

f(xt), V(T) :=

where x := argmax{x2X:g(x) 0} f(x) and [ ]+ := max{ , 0}. The goal of the learning agent is to achieve both sublinear regret and sublinear constraint violation. In fact, we will establish bounds on the following stronger version of regret. Speciﬁcally, let be a probability distribution over the set of actions X (i.e., stochastic policy), and let E [f(x)] :=

x2X f(x) (x) dx and E [g(x)] :=

x2X g(x) (x) dx. We compare our achieved reward with the following baseline optimization problem: max {E [f(x)] : E [g(x)] 0} where both f and g are known, and is its optimal solution. Now, our stronger regret is deﬁned as

R+(T) := TE [f(x)]

Clearly, we have R(T) R+(T)2. Throughout the paper, we assume the following commonly used condition in constrained optimization literature; see also [13, 35, 11].

Assumption 1 (Slater s condition). There is a constant δ > 0 such that there exists a probability distribution 0 that satisﬁes E 0 [g(x)] δ. Without loss of generality, we assume δ 1.

This is a quite mild assumption since it only requires that one can ﬁnd a stochastic policy under which the expected cost is less than a strictly negative value. This is in sharp constraint to existing KB algorithms for hard constraints that typically require the existence of an initial safe action [9, 10].

In this paper, we consider the frequentist-type regularity assumption that is typically used in uncontrained KB works (e.g., [5, 6]). Speciﬁcally, we assume that f is a ﬁxed function in an RKHS with a bounded norm. In particular, the RKHS for f is denoted by Hk, which is completely determined by the corresponding kernel function k : X X ! R. Any function h 2 Hk satisﬁes the reproducing property: h(x) = hh, k( , x)i Hk, where h , i Hk is the inner product deﬁned on Hk. Similarly, for the unknown constraint function g, we assume that g is a ﬁxed function in the RKHS deﬁned by a

2This stronger regret also allows us to use convex optimization tool since E [f(x)] :=

x2X f(x) (x) dx is a convex function with respect to , which will be important for our analysis.

kernel function ek, and the RKHS for g is denoted by Hek. We assume that the following boundedness property holds throughout the paper.

Assumption 2 (Boundedness). We assume that kfk Hk B and k(x, x) 1 for any x 2 X and

that the noise t is i.i.d. R-sub-Gaussian. Similarly, we assume that kgk Hek G and ek(x, x) 1 for

any x 2 X and that the noise t is i.i.d. e R-sub-Gaussian.

Gaussian Process Surrogate Model. We use a Gaussian process (GP), denoted by GP(0, k( , )), as a prior for the unknown function f, and a Gaussian likelihood model for the noise variables t, which are drawn from N(0, λ) and are independent across t. Note that this GP surrogate model is used for algorithm design only; it does not change the fact that f is a ﬁxed function in Hk and that the noise t can be sub-Gaussian (i.e., an agnostic setting [6]). Let [t] := {1, 2, . . . , t}. Conditioned on a set of observations Ht = {(xs, rs), s 2 [t]}, by the properties of GP [36], the posterior distribution for f is GP(µt( ), kt( , )), where

µt(x) := kt(x)T (Kt + λI) 1Rt, (3)

kt(x, x0) := k(x, x0) kt(x)T (Kt + λI) 1kt(x0), (4)

in which kt(x) := [k(x1, x), . . . , k(xt, x)]T , Kt := [k(xu, xv)]u,v2[t], and Rt is the (noisy) reward vector [r1, r2, . . . , rt]T . In particular, we also deﬁne σ2

t (x) := kt(x, x). Let KA := [k(x, x0)]x,x02A for A X. We deﬁne the maximum information gain as γt(k, X) := max A X:|A|=t

1 2 ln |It + λ 1KA| where It is the t t identity matrix. The maximum information gain plays a key role in the regret bounds of GP-based algorithms. While γt(k, X) depends on the kernel k and domain X, we simply use γt whenever the context is clear. For instance, if X is compact and convex with dimension d, then we have γt = O((ln t)d+1) for squared exponential kernel k SE, γt = O(d ln t) for linear kernel [6] and γt = O(T

d 2 +d ) (where is a hyperparameter) for Matérn kernel k Matérn [37]. Similarly, the learning agent also uses a GP surrogate model for g, i.e., a GP prior GP(0, ek( , )) and a Gaussian noise N(0, eλ). Conditioned on a set of observations e Ht = {(xs, cs), s 2 [t]}, the posterior distribution for g is GP(eµt( ), ekt( , )), where eµt and ekt are computed in the same way.

3 A Uniﬁed Framework for Constrained Kernelized Bandits

In this section, leveraging primal-dual optimization, we develop a uniﬁed framework for both algorithm design and performance analysis in constrained kernelized bandits. In particular, we ﬁrst propose a master algorithm called CKB (constrained KB), which can be equipped with very general exploration strategies. Then, we develop a novel sufﬁcient condition, which not only provides a uniﬁed analysis of regret and constraint violation, but also facilitates the design of new exploration strategies (and hence new CKB algorithms) with rigorous performance guarantees.

Algorithm. We ﬁrst explain our master algorithm CKB in Algorithm 1, which is based on primaldual optimization. Let the Lagrangian of the baseline problem max {E [f(x)] : E [g(x)] 0} be L( , φ) := E [f(x)] φE [g(x)] and the associated dual problem is deﬁned as D(φ) := max L( , φ) with the optimal dual variable being φ := argminφ 0 D(φ). Note that since both f and g are unknown, the agent has to ﬁrst generate estimates of them (i.e., ft and gt, respectively) based on exploration strategies Af and Ag, which capture the tradeoff between exploration and exploitation (line 3). Then, both estimates will be truncated according to the range of f and g, respectively (lines 4-5) (where Proj is the projection operator). The truncation is necessary for our analysis, but it does not impact the regret bound since it will not lead to loss of useful information. Then, lines 6-7 correspond to the primal optimization step that approximates D(φt) (i.e., approximate L by L with f and g replaced by ft and gt). The reason behind line 7 is that one of the optimal solutions for max L( , φt) is simply argmaxx( ft(x) φt gt(x)). Then, line 8 is the dual update that minimizes D(φt) with respect to φ by taking a projected gradient step with 1/V being the step size. The parameter is chosen to be larger than the optimal dual variable φ , and hence the projected interval [0, ] includes the optimal dual variable. This can be achieved since the optimal dual variable is bounded under Slater s condition, and in particular, we have φ (E [f(x)] E 0 [f(x)])/δ by [38, Theorem 8.42]. Finally, line 8 is the posterior update via standard GP regression for both f and g as computed in (3) and (4) with σ2

t (x) = kt(x, x) and eσ2

t (x) = ekt(x, x).

Remark 1 (Computational complexity). CKB enjoys the same computational complexity as the standard unconstrained case (e.g., [5]) since the additional dual update is a simple projection and

Algorithm 1 CKB Algorithm

1: Parameters: V , , φ1 = 0, µ0(x) = eµ0(x) = 0, σ0(x) = eσ0(x) = 1, 8x, exploration strategies

Af and Ag 2: for batch t=1, 2, . . . do 3: Based on posterior models, generate ft and gt using Af and Ag, respectively 4: Truncate ft as ft(x) = Proj[ B,B]ft(x) 5: Truncate gt as gt(x) = Proj[ G,G]gt(x)

6: Pseudo-acquisition function: bzφt(x) = ft(x) φt gt(x) 7: Choose primal action xt = argmaxx2X bzφt(x); observe rt and ct 8: Update dual variable: φt+1 = Proj[0, ]

9: Posterior model: update (µt, σt) and (eµt, eσt) via GP regression using new data (xt, rt, ct) 10: end for

the primal optimization keeps the same ﬂavor as the unconstrained case, i.e., without constructing a speciﬁc safe set as in existing constrained KB algorithms designed for hard constraints.

We call CKB a master algorithm as it allows us to employ different exploration strategies (or called acquisition functions) (i.e., Af and Ag). Therefore, one fundamental question is: How to design efﬁcient exploration strategies such that favorable performance can be guaranteed? In the following, we ﬁrst use UCB-type exploration as an example to gain useful insights, which in turn will facilitate the development of a novel sufﬁcient condition. This condition not only is satisﬁed under very general exploration strategies, but also enables a uniﬁed analytical framework for showing both sublinear regret and sublinear constraint violation in constrained kernelized bandits.

We ﬁrst introduce standard UCB and TS explorations under GP as in [5]. Deﬁnition 1 (GP-UCB and GP-TS Explorations). Suppose the posterior distribution for a black-box function h in round t is given by GP(bµt 1( ), bkt 1( , )) and bβt is an increasing sequence.

(i) The estimate of h in round t under GP-UCB exploration is ht( ) = bµt 1( ) + bβtbσt 1( ), where

t 1(x) := bkt 1(x, x) for all x 2 X.

(ii) The estimate of h in round t under GP-TS exploration is ht( ) GP(bµt 1( ), bβ2

t bkt 1( , )).

3.1 Warm Up: CKB with GP-UCB Exploration

In this section, we instantiate CKB with GP-UCB exploration called CKB-UCB, as a warm-up. In particular, in CKB-UCB, Af is a GP-UCB exploration (see Deﬁnition 1) with a positive bβt sequence (i.e., optimistic with respect to reward), and Ag is a GP-UCB exploration with a negative bβt sequence (i.e., optimistic with respect to cost). This instantiation enjoys the following performance guarantee.

Theorem 1. Suppose 4B/δ, V = G

T/ , Af is a GP-UCB exploration with bβt = βt := B + R

2(γt 1 + 1 + ln(2/ )), and Ag is a GP-UCB exploration with bβt = eβt := (G + R

2(eγt 1 + 1 + ln(2/ ))). Under Slater s condition in Assumption 1 and regularity assumptions in Assumption 2, CKB-UCB achieves the following bounds simultaneously with probability at least 1 for any 2 (0, 1):

TγT (γT + ln(2/ )) + G

TbγT (bγT + 2 ln(2/ ))

where C := max{B, G} and bγT := max{γT , eγT }. Remark 2. The (reward) regret here is the stronger version, i.e., R+(T). Compared to the unconstrained case, the regret bound has an additional term G

T, which roughly captures the impact of the constraint. As in the unconstrained case, one can plug in different γT and eγT to see that both regret and constraint violation are sublinear for commonly used kernels. For example, for an SE kernel, both γT and eγT are on the order of (ln T)d+1. Finally, the standard doubling trick can be used to design an anytime algorithm (i.e., without the knowledge of T) with regret and constraint violation bounds of the same order.

Proof Sketch of Theorem 1. The key step is to establish a sublinear bound on the term R+(T) + φ PT

t=1 g(xt) for any φ 2 [0, ]. Then, with convex optimization tools, one can establish regret and constraint violation bound. To this end, By the dual update of CKB, we can ﬁrst show that R+(T) + φ PT

t=1 g(xt) T1 + T2 + V

2 φ2 + 1 2V TG2, where

(E [f(x)] φt E [g(x)])

( ft(xt) φt gt(xt)), (5)

( ft(xt) f(xt)) + φ

(g(xt) gt(xt)). (6)

Thus, the analysis for each choice of exploration strategy only differ in how to bound T1 + T2. Under GP-UCB exploration, the term T1 0 by the optimism while T2 can be upper bounded by T2 2βT

t=1 σt 1(xt) + 2φeβT

t=1 eσt 1(xt) = O(βT

pγT T + φeβT

eγT T), i.e., standard predication error bound. Building on the bound on T1+T2, one can then show the required results.

3.2 A Sufﬁcient Condition for Provably Efﬁcient Explorations

The above analysis reveals that the key step in obtaining sublinear performance guarantees of Algorithm 1 is to ﬁnd a sublinear bound on T1 + T2, which depends on the choice of exploration strategies. To go beyond GP-UCB exploration, we will establish a sufﬁcient condition on general exploration strategies (i.e., Af and Ag), which guarantees a sublinear bound on T1 + T2 and hence sublinear regret and sublinear constraint violation.

We ﬁrst present the intuition behind the key components of our sufﬁcient condition. Inspired by [39], it sufﬁces to focus on the following three nice events so as to bound T1 + T2 in (5)-(6):

Eest := {Eest

f (x, t)\Eest

g (x, t); 8(x, t)}, Econc

t := {Econc

f,t (x) \ Econc

g,t (x); 8x}, Eanti

f,t \ Eanti

g,t , in which

f (x, t) := |f(x) µt 1(x)| c(1)

f,tσt 1(x), Eest

g (x, t) := |g(x) eµt 1(x)| c(1)

f,teσt 1(x),

f,t (x) := |ft(x) µt 1(x)| c(2)

f,tσt 1(x), Econc

g,t (x) := |gt(x) eµt 1(x)| c(2)

g,teσt 1(x),

f,t :=E [ft(x) µt 1(x)] c(1)

f,t E [σt 1(x)], Eanti

g,t :=E [gt(x) eµt 1(x)] c(1)

g,t E [eσt 1(x)].

To see the intuition, ﬁrst suppose that events Eest and Econc

t hold with high probability. Then, it is easy to see that the estimates are close to the true functions, and hence, one can derive a bound on T2 in (6). Now, suppose that events Eest and Eanti

t hold with some positive probability. Then, one can see that the estimates are optimistic with respect to the true functions when evaluated at the optimal points. This probabilistic optimism is the key to bounding T1 in (5). Note that GP-UCB exploration is optimistic with probability one by deﬁnition (see Deﬁnition 1), and hence, T1 0 always holds.

To formally state our result, let us deﬁne the ﬁltration Ft as all the history up to the end of round t. Let Et [ ] := E [ |Ft 1] and Pt ( ) := P [ |Ft 1]. Then, we have the following sufﬁcient condition. Assumption 3 (Sufﬁcient Condition). The sufﬁcient condition includes two parts:

(1) (Probability condition) P (Eest) 1 p1, Pt (Econc

t ) 1 p2,t, and Pt

p3 > 0 for some time-dependent sequences c(1)

f,t, and c(2)

(2) (Boundedness condition) (i) There exists a positive probability p4 such that 1 + 2 (p3 p2,t) 1/p4

for all t; (ii) There exist some functions of T: cf(T), cg(T) such that c(1)

f,t cf(T) and

g,t cg(T) for all t; (iii) PT

t=1 p2,t C0 for some constant C0. Remark 3. The above sufﬁcient condition generalizes existing similar results [39 41] in several aspects. First, existing works mainly focus on the MAB or linear bandit settings, which are special cases of our KB setting (e.g., choosing a linear kernel leads to linear bandit). Second, while existing works only establish bounds on the expected regret, we aim to establish a high-probability bound. As a result, we need the additional boundedness condition, which, however, is simply for technical

reasons. Third, in contrast to existing works that consider the unconstrained case only, we consider the constrained case, which is more challenging. Speciﬁcally, it requires that Eanti

t holds under policy (i.e., the expectation over ) rather than under a single optimal action x .

With the above sufﬁcient condition, we have the following general performance bounds for CKB.

Theorem 2. Suppose 4B/δ and V = G

T/ . Let := B + G. Assume that CKB is equipped with exploration strategies that satisfy the sufﬁcient condition in Assumption 3. Then, under Slater s condition in Assumption 1 and regularity assumptions in Assumption 2, CKB achieves the following bounds on regret and constraint violation with probability at least 1 p1 for any 2 (0, 1):

T + cf(T) + cg(T)

T + cf(T) + cg(T)

In the following, we will show that Theorem 2 provides a uniﬁed view of the regret and constraint violation performance for various CKB algorithms, thanks to our sufﬁcient condition. In particular, we ﬁrst show that existing exploration strategies, such as GP-UCB and GP-TS, satisfy our sufﬁcient condition, given by the following corollary.

Corollary 1. Let βt = B+R

2(γt 1 + 1 + ln(1/ f)) and eβt = B+R

2(eγt 1 + 1 + ln(1/ g)). (i). GP-UCB with bβt = βt and bβt = eβt for Af and Ag, respectively, satisﬁes the sufﬁcient

(ii). GP-TS with bβt = βt and bβt = eβt for Af and Ag, respectively, satisﬁes the sufﬁcient condition

when is a deterministic policy.

The sufﬁcient condition also enables us to design CKB algorithms with new exploration strategies. In the following, inspired by [40], we propose a new GP-based exploration strategy, which aims to strike a balance between GP-UCB and GP-TS explorations. Deﬁnition 2 (Rand GP-UCB Exploration). Suppose that the posterior distribution for a black-box function h in round t is given by GP(bµt 1( ), bkt 1( , )). Then, the estimate of h in round t under Rand GP-UCB exploration strategy is ht( ) = bµt 1( ) + b Ztbσt 1( ), where b Zt b D for some distribution b D and bσ2

t 1(x) = bkt 1(x, x) for all x 2 X.

In contrast to GP-UCB, Rand GP-UCB replaces the deterministic conﬁdence bound by a randomized one. Compared to GP-TS, Rand GP-UCB uses coupled noise in the sense that all the actions share the same noise b Zt rather than decoupled and correlated noise in GP-TS. This subtle difference will not only help to eliminate the additional factor

ln(|X|) in GP-TS due to the use of union bound, but also allow us to deal with a general stochastic .

Corollary 2. Let βt, eβt be the same as in Corollary 1. Rand GP-UCB with b D = N(0, β2

t ) and b D = N(0, eβ2

t ) for Af and Ag, respectively, satisﬁes the sufﬁcient condition.

Thus, one can instantiate CKB with Rand GP-UCB exploration to obtain a new algorithm called CKB-Rand with performance guarantees given by Theorem 2. Note that Rand GP-UCB with other distributions b D can also satisfy the sufﬁcient condition (as discussed in Appendix C).

4 Discussions

In this section, we discuss several possible questions one may have at this moment about our CKB algorithm. The ﬁrst natural question to ask is whether one can further improve the constraint violation bound. It turns out that with a minor modiﬁcation of CKB algorithm, one can achieve a bounded and even zero constraint violation by trading the regret slightly (but still the same order as before). The modiﬁcation is to introduce a slackness given by " in the dual update in Algorithm 1, i.e., φt+1 = Proj[0, ]

V ( gt(xt) + ")

with " δ/2. Intuitively speaking, this can be viewed as if one is working on a new pessimistic constraint function. After obtaining the constraint violation under this new hypothetic constraint, one can subtract "T to ﬁnd the true constraint violation under the true function g. The catch here is that one needs also change the baseline problem to the following one: max {E [f(x)] : E [g(x)] + " 0} so that it matches the new pessimistic constraint function. Let

" be the optimal solution to this new problem and the obtained regret is only with respect to

" rather than the original . Thus, we need to further bound the following difference TE [f(x)] TE " [f(x)] to obtain the true regret bound, which is given in Appendix E.

0 2000 4000 6000 8000 10000 Iteration

Cumulative Regret

CKB-UCB CKB-TS CKB-Rand Stage Opt SGP-UCB

(a) Regret on synthetic data

0 200 400 600 800 1000 Iteration

Cumulative Regret

CKB-UCB CKB-TS CKB-Rand Stage Opt SGP-UCB

(b) Regret on real-world data

0 200 400 600 800 1000 Iteration

Stronger Cumulative Constraint Violations

CKB-UCB CKB-TS CKB-Rand

(c) Constraint on real-world data

Figure 1: Experimental results on constrained kernelized bandits with light-tailed data .

Another question one may ask is why we are able to go beyond UCB-type exploration compared to existing works (e.g., [13 15]) that only establish performance guarantees under UCB. The answer is that they resort to another popular method for constrained optimization based on the so-called Lyapunov-drift argument. However, this method relies heavily on the optimism under UCB to derive both regret and constraint violation bounds. Thus, it is difﬁcult to generalize it for other exploration strategies even for simpler linear bandit problems. In contrast, our analysis relies only on the bound for T1 + T2, which allows us to consider more general exploration strategies. In Appendix F, we also discuss how to apply the Lyapunov-drift based method to analyze UCB exploration in our setting, through which we highlight more differences and some subtleties in Lyapunov-drift method.

5 Simulation Results

Although our work is mainly a theoretical study, we conduct simulations to compare the performance of our algorithms (i.e., CKB-UCB, CKB-TS, and CKB-Rand, that is, CKB with GP-UCB, GPTS, and Rand GP-UCB explorations, respectively) with existing safe KB algorithms based on both synthetic and real-world datasets. In particular, we consider the two most recent safe KB algorithms: Stage Opt [28] (which has a superior performance compared to Safe Opt [27]) and SGP-UCB [3]. Our goal is to show that our proposed CKB algorithms can trade a slight performance in constraint violation (i.e., soft constraint) for improvement in the reward regret, computation complexity and ﬂexible implementations, i.e., a better complexity-regret-constraint trade-off.

5.1 Synthetic Data and Light-Tailed Real-World Data

Synthetic Data. The domain X is generated by discretizing [0, 1] uniformly into 100 points. The objective function f( ) = Pp

i=1 aik( , xi) is generated by uniformly sampling ai 2 [ 1, 1] and support points xi 2 X with p = 100. With the same manner, we generate the constraint function g. The kernel is kse with parameter l = 0.2. Other parameters include B, R and γt are set similar as in the unconstrained case (e.g., [5]).

Light-Tailed Real-World Data. We use the light sensor data collected in the CMU Intelligent Workplace in Nov 2005, which is available online as Matlab structure3 and contains locations of 41 sensors, 601 train samples and 192 test samples. We use it in the context of ﬁnding the maximum average reading of the sensors. In particular, f is set as empirical average of the test samples, with B set as its maximum, and k is set as the empirical covariance of the normalized train samples. The constraint is given by g( ) = f( )+h with h = B/2. We perform 50 trials (each with T = 10, 000) and plot the mean of the cumulative regret along with the error bars, as shown in Fig. 1.

Regret. Our three CKB algorithms achieve a better (or similar) regret performance compared to the existing safe BO algorithms (see Figures 1(a) and 1(b)). Among the three CKB algorithms, CKB-Rand appears to have reasonably good performance at all times.

Constraint violation. Since we have V(T) = 0 under all the algorithms, we study the total number of rounds where the constraint is violated, denoted by N. In the synthetic data setting, our proposed CKB algorithms have N 5 over T = 10, 000 rounds; in the real-world data setting, CKB-UCB enjoys N = 0 and CKB-Rand has an average N = 38 over a horizon T = 1, 000. Furthermore, we plot the stronger cumulative constraint violations given by PT

t=1[g(xt)]+ as shown in Figure 1(c),

3http://www.cs.cmu.edu/~guestrin/Class/10708-F08/projects/

0 2000 4000 6000 8000 10000 Iteration

Cumulative Regret

CKB-UCB CKB-TS CKB-Rand Stage Opt SGP-UCB

(a) Regret on ﬁnance data

0 2000 4000 6000 8000 10000 Iteration

Average Regret

CKB-UCB CKB-TS Ck B-Rand Stage Opt SGP-UCB

(b) Avg. regret on ﬁnance data

0 2000 4000 6000 8000 10000 Iteration

Stronger Cumulative Constraint Violations

CKB-UCB CKB-TS CKB-Rand

(c) Constraint on ﬁnance data Figure 2: Experimental results on constrained kernelized bandits under heavy-tailed ﬁnance data.

from which we can see that all CKB algorithms achieve sublinear performance even with respect to this stronger metric, i.e., violation cancellation across rounds is not allowed.

Practical considerations. Our proposed CKB algorithms have the same computational complexity as the unconstrained case. In particular, they scale linearly with the number of actions in the discretedomain case4. On the other hand, Stage Opt scales quadratically due to the construction of the safe set, and SGP-UCB requires the additional random initialization stage, which leads to linear regret at the beginning of the learning process. Moreover, standard methods for improving the scalability of unconstrained KB can be naturally applied to our CKB algorithms [42]. Finally, both Stage Opt and SGP-UCB require the knowledge of a safe action (i.e., one that satisﬁes the constraint) in advance, and moreover, Stage Opt requires f to be Lipschitz and needs to estimate the Lipschitz constant, which impacts the robustness. In contrast, CKB algorithms for soft constraints only require a mild Slater s condition as in Assumption 1, which does not require the existence of a safe action.

5.2 Heavy-Tailed Real-World Data

We further compare different constrained KB algorithms in a new real-world dataset, which demonstrates a heavy-tailed noise. Note that sub-Gaussian noise is required in all the existing theoretical works (including our work). We use this dataset to test the robustness of various constrained KB algorithms. The experimental results tend to show that our three CKB algorithms are more robust in terms of heavy-tailed noise, which is common in practical applications. The detail of this real-world dataset is deferred to Appendix D.

Regret. We plot both cumulative regret and time-average regret in this setting (see Figures 2 (a) and (b)). We can observe that in the presence of heavy-tailed noise, our three CKB algorithms have signiﬁcant performance gain over existing safe KB algorithms.

Constraint violation. We focus on the strong metric, i.e., the number of rounds where the constraint is violated, denoted by N. We have that CKB-Rand enjoys an average N = 21 and CKB-UCB has an average N = 47 within the horizon of T = 10, 000. We also plot stronger cumulative constraint violations given by PT

t=1[g(xt)]+ as shown in Figure 2(c), from which we can see that all CKB algorithms achieve sublinear performance even with respect to this stronger metric.

6 Conclusion

We presented a general framework for constrained KB with soft constraints via primal-dual optimization. Armed with our developed sufﬁcient condition, this framework not only allows us to design provably efﬁcient (i.e., sublinear reward regret and sublinear total constraint violation) CKB algorithms with both UCB and TS explorations, but presents a uniﬁed method to design new effective ones. By introducing slackness, our algorithm can also attain a bounded or even zero constraint violation while still achieving a sublinear regret. We further perform simulations on both synthetic data and real-world data that corroborate our theoretical results. Along the way, we also present the ﬁrst detailed discussion on two existing methods for analyzing constrained bandits and MDPs by highlighting interesting insights. For future work, one interesting direction is to build upon recent advances in unconstrained KB that attain an improved regret (e.g., [43 46]) to study the corresponding constrained case in the hope to maintain the same improvement.

4For a continuous domain, as in the unconstrained case, one can resort to heuristic solvers (e.g., a combination of random sampling (cheap) and the "L-BFGS-B " optimization method.). In fact, to attain the same order of regret bound, the solution to the acquisition maximization problem need not be exact. Instead, it only needs to maximize the acquisition function within C/

t accuracy for some constant C at each step.

Acknowledgement

We would like to thank the anonymous reviewers for their helpful comments. XZ is partially supported by NSF grant CNS-2153220. BJ is partially supported by NSF grant CNS-2112694.

[1] Lihong Li, Wei Chu, John Langford, and Robert E Schapire. A contextual-bandit approach to

personalized news article recommendation. In Proceedings of the 19th international conference on World wide web, pages 661 670, 2010.

[2] Brijen Thananjeyan, Kirthevasan Kandasamy, Ion Stoica, Michael I Jordan, Ken Goldberg,

and Joseph E Gonzalez. Pac best arm identiﬁcation under a deadline. ar Xiv preprint ar Xiv:2106.03221, 2021.

[3] Mung Chiang, Prashanth Hande, and Tian Lan. Power control in wireless cellular networks.

Now Publishers Inc, 2008.

[4] Charles A Micchelli, Yuesheng Xu, and Haizhang Zhang. Universal kernels. Journal of Machine

Learning Research, 7(12), 2006.

[5] Sayak Ray Chowdhury and Aditya Gopalan. On kernelized multi-armed bandits. ar Xiv preprint

ar Xiv:1704.00445, 2017.

[6] Niranjan Srinivas, Andreas Krause, Sham M Kakade, and Matthias Seeger. Gaussian pro-

cess optimization in the bandit setting: No regret and experimental design. ar Xiv preprint ar Xiv:0912.3995, 2009.

[7] Yanan Sui, Alkis Gotovos, Joel Burdick, and Andreas Krause. Safe exploration for optimization

with gaussian processes. In International Conference on Machine Learning, pages 997 1005. PMLR, 2015.

[8] Felix Berkenkamp, Andreas Krause, and Angela P Schoellig. Bayesian optimization with safety

constraints: safe and automatic parameter tuning in robotics. ar Xiv preprint ar Xiv:1602.04450, 2016.

[9] Yanan Sui, Joel Burdick, Yisong Yue, et al. Stagewise safe bayesian optimization with gaussian

processes. In International Conference on Machine Learning, pages 4781 4789. PMLR, 2018.

[10] Sanae Amani, Mahnoosh Alizadeh, and Christos Thrampoulidis. Regret bound for safe gaussian

process bandit optimization. In Learning for Dynamics and Control, pages 158 159. PMLR, 2020.

[11] Yonathan Efroni, Shie Mannor, and Matteo Pirotta. Exploration-exploitation in constrained

mdps. ar Xiv preprint ar Xiv:2003.02189, 2020.

[12] Dongsheng Ding, Xiaohan Wei, Zhuoran Yang, Zhaoran Wang, and Mihailo Jovanovic. Provably

efﬁcient safe exploration via primal-dual policy optimization. In International Conference on Artiﬁcial Intelligence and Statistics, pages 3304 3312. PMLR, 2021.

[13] Xin Liu, Bin Li, Pengyi Shi, and Lei Ying. An efﬁcient pessimistic-optimistic algorithm for

stochastic linear bandits with general constraints. ar Xiv preprint ar Xiv:2102.05295, 2021.

[14] Honghao Wei, Xin Liu, and Lei Ying. A provably-efﬁcient model-free algorithm for constrained

markov decision processes. ar Xiv preprint ar Xiv:2106.01577, 2021.

[15] Tao Liu, Ruida Zhou, Dileep Kalathil, PR Kumar, and Chao Tian. Learning policies with zero

or bounded constraint violation for constrained mdps. ar Xiv preprint ar Xiv:2106.02684, 2021.

[16] Bruce Hajek. Hitting-time and occupation-time bounds implied by drift analysis with applica-

tions. Advances in Applied probability, pages 502 525, 1982.

[17] Shipra Agrawal and Nikhil Devanur. Linear contextual bandits with knapsacks. Advances in

Neural Information Processing Systems, 29:3450 3458, 2016.

[18] Ashwinkumar Badanidiyuru, Robert Kleinberg, and Aleksandrs Slivkins. Bandits with knap-

sacks. In 2013 IEEE 54th Annual Symposium on Foundations of Computer Science, pages 207 216. IEEE, 2013.

[19] Huasen Wu, Rayadurgam Srikant, Xin Liu, and Chong Jiang. Algorithms with logarithmic or

sublinear regret for constrained contextual bandits. ar Xiv preprint ar Xiv:1504.06937, 2015.

[20] Yifan Wu, Roshan Shariff, Tor Lattimore, and Csaba Szepesvári. Conservative bandits. In

International Conference on Machine Learning, pages 1254 1262. PMLR, 2016.

[21] Abbas Kazerouni, Mohammad Ghavamzadeh, Yasin Abbasi-Yadkori, and Benjamin Van Roy.

Conservative contextual linear bandits. ar Xiv preprint ar Xiv:1611.06426, 2016.

[22] Evrard Garcelon, Mohammad Ghavamzadeh, Alessandro Lazaric, and Matteo Pirotta. Improved

algorithms for conservative exploration in bandits. In Proceedings of the AAAI Conference on Artiﬁcial Intelligence, pages 3962 3969, 2020.

[23] Yifang Chen, Alex Cuellar, Haipeng Luo, Jignesh Modi, Heramb Nemlekar, and Stefanos

Nikolaidis. Fair contextual multi-armed bandits: Theory and experiments. In Conference on Uncertainty in Artiﬁcial Intelligence, pages 181 190. PMLR, 2020.

[24] Fengjiao Li, Jia Liu, and Bo Ji. Combinatorial sleeping bandits with fairness constraints. IEEE

Transactions on Network Science and Engineering, 7(3):1799 1813, 2019.

[25] Sanae Amani, Mahnoosh Alizadeh, and Christos Thrampoulidis. Linear stochastic bandits

under safety constraints. ar Xiv preprint ar Xiv:1908.05814, 2019.

[26] Aldo Pacchiano, Mohammad Ghavamzadeh, Peter Bartlett, and Heinrich Jiang. Stochastic

bandits with linear constraints. In International Conference on Artiﬁcial Intelligence and Statistics, pages 2827 2835. PMLR, 2021.

[27] Ahmadreza Moradipari, Sanae Amani, Mahnoosh Alizadeh, and Christos Thrampoulidis. Safe

linear thompson sampling with side information. ar Xiv preprint ar Xiv:1911.02156, 2019.

[28] Xin Liu, Bin Li, Pengyi Shi, and Lei Ying. Pond: Pessimistic-optimistic online dispatch. ar Xiv

preprint ar Xiv:2010.09995, 2020.

[29] Michal Valko, Nathaniel Korda, Rémi Munos, Ilias Flaounas, and Nelo Cristianini. Finite-time

analysis of kernelised contextual bandits. ar Xiv preprint ar Xiv:1309.6869, 2013.

[30] Jonathan Scarlett, Ilija Bogunovic, and Volkan Cevher. Lower bounds on regret for noisy

gaussian process bandit optimization. In Conference on Learning Theory, pages 1723 1742. PMLR, 2017.

[31] Xu Cai and Jonathan Scarlett. On lower bounds for standard and robust gaussian process bandit

optimization. In International Conference on Machine Learning, pages 1216 1226. PMLR, 2021.

[32] David Eriksson and Matthias Poloczek. Scalable constrained bayesian optimization. In Interna-

tional Conference on Artiﬁcial Intelligence and Statistics, pages 730 738. PMLR, 2021.

[33] Michael A Gelbart, Jasper Snoek, and Ryan P Adams. Bayesian optimization with unknown

constraints. ar Xiv preprint ar Xiv:1403.5607, 2014.

[34] José Miguel Hernández-Lobato, Michael A Gelbart, Ryan P Adams, Matthew W Hoffman,

and Zoubin Ghahramani. A general framework for constrained bayesian optimization using information-based search. MIT Press, 2016.

[35] Hao Yu, Michael J Neely, and Xiaohan Wei. Online convex optimization with stochastic

constraints. ar Xiv preprint ar Xiv:1708.03741, 2017.

[36] Carl Edward Rasmussen. Gaussian processes in machine learning. In Summer School on

Machine Learning, pages 63 71. Springer, 2003.

[37] Sattar Vakili, Kia Khezeli, and Victor Picheny. On information gain and regret bounds in

gaussian process bandits. In International Conference on Artiﬁcial Intelligence and Statistics, pages 82 90. PMLR, 2021.

[38] Amir Beck. First-order methods in optimization. SIAM, 2017.

[39] Branislav Kveton, Csaba Szepesvari, Mohammad Ghavamzadeh, and Craig Boutilier. Perturbed-

history exploration in stochastic linear bandits. ar Xiv preprint ar Xiv:1903.09132, 2019.

[40] Sharan Vaswani, Abbas Mehrabian, Audrey Durand, and Branislav Kveton. Old dog learns new

tricks: Randomized ucb for bandit problems. ar Xiv preprint ar Xiv:1910.04928, 2019.

[41] Baekjin Kim and Ambuj Tewari. Randomized exploration for non-stationary stochastic linear

bandits. In Conference on Uncertainty in Artiﬁcial Intelligence, pages 71 80. PMLR, 2020.

[42] Daniele Calandriello, Luigi Carratino, Alessandro Lazaric, Michal Valko, and Lorenzo Rosasco.

Gaussian process optimization with adaptive sketching: Scalable and no regret. In Conference on Learning Theory, pages 533 557. PMLR, 2019.

[43] Zihan Li and Jonathan Scarlett. Gaussian process bandit optimization with few batches. In

International Conference on Artiﬁcial Intelligence and Statistics, pages 92 107. PMLR, 2022.

[44] Sudeep Salgia, Sattar Vakili, and Qing Zhao. A domain-shrinking based bayesian optimization

algorithm with order-optimal regret performance. Advances in Neural Information Processing Systems, 34:28836 28847, 2021.

[45] Romain Camilleri, Kevin Jamieson, and Julian Katz-Samuels. High-dimensional experimental

design and kernel bandits. In International Conference on Machine Learning, pages 1227 1237. PMLR, 2021.

[46] Sattar Vakili, Jonathan Scarlett, Da-shan Shiu, and Alberto Bernacchia. Improved convergence rates for sparse approximation methods in kernel-based learning. ar Xiv preprint ar Xiv:2202.04005, 2022.

[47] Sayak Ray Chowdhury and Aditya Gopalan. Bayesian optimization under heavy-tailed payoffs.

In Advances in Neural Information Processing Systems, pages 13790 13801, 2019.

[48] Yasin Abbasi-Yadkori, Dávid Pál, and Csaba Szepesvári. Improved algorithms for linear

stochastic bandits. Advances in neural information processing systems, 24:2312 2320, 2011.

1. For all authors...

(a) Do the main claims made in the abstract and introduction accurately reﬂect the paper s

contributions and scope? [Yes] (b) Did you describe the limitations of your work? [Yes]

(c) Did you discuss any potential negative societal impacts of your work? [N/A] This is

rather a theoretical work. (d) Have you read the ethics review guidelines and ensured that your paper conforms to

them? [Yes] 2. If you are including theoretical results...

(a) Did you state the full set of assumptions of all theoretical results? [Yes] (b) Did you include complete proofs of all theoretical results? [Yes] In the appendix 3. If you ran experiments...

(a) Did you include the code, data, and instructions needed to reproduce the main experi-

mental results (either in the supplemental material or as a URL)? [Yes] (b) Did you specify all the training details (e.g., data splits, hyperparameters, how they

were chosen)? [Yes] (c) Did you report error bars (e.g., with respect to the random seed after running experi-

ments multiple times)? [Yes] (d) Did you include the total amount of compute and the type of resources used (e.g., type

of GPUs, internal cluster, or cloud provider)? [N/A] Experiments were rather run on a personal computer. 4. If you are using existing assets (e.g., code, data, models) or curating/releasing new assets...

(a) If your work uses existing assets, did you cite the creators? [N/A] (b) Did you mention the license of the assets? [N/A]

(c) Did you include any new assets either in the supplemental material or as a URL? [N/A]

(d) Did you discuss whether and how consent was obtained from people whose data you re

using/curating? [N/A] (e) Did you discuss whether the data you are using/curating contains personally identiﬁable

information or offensive content? [N/A] 5. If you used crowdsourcing or conducted research with human subjects...

(a) Did you include the full text of instructions given to participants and screenshots, if

applicable? [N/A] (b) Did you describe any potential participant risks, with links to Institutional Review

Board (IRB) approvals, if applicable? [N/A] (c) Did you include the estimated hourly wage paid to participants and the total amount

spent on participant compensation? [N/A]