# misspecified_gaussian_process_bandit_optimization__83941d95.pdf

Misspeciﬁed Gaussian Process Bandit Optimization

Ilija Bogunovic

Andreas Krause

We consider the problem of optimizing a black-box function based on noisy bandit feedback. Kernelized bandit algorithms have shown strong empirical and theoretical performance for this problem. They heavily rely on the assumption that the model is well-speciﬁed, however, and can fail without it. Instead, we introduce a misspeciﬁed kernelized bandit setting where the unknown function can be uniformly approximated by a function with a bounded norm in some Reproducing Kernel Hilbert Space (RKHS). We design efﬁcient and practical algorithms whose performance degrades minimally in the presence of model misspeciﬁcation. Speciﬁcally, we present two algorithms based on Gaussian process (GP) methods: an optimistic EC-GP-UCB algorithm that requires knowing the misspeciﬁcation error, and Phased GP Uncertainty Sampling, an elimination-type algorithm that can adapt to unknown model misspeciﬁcation. We provide upper bounds on their cumulative regret in terms of , the time horizon, and the underlying kernel, and we show that our algorithm achieves optimal dependence on with no prior knowledge of misspeciﬁcation. In addition, in a stochastic contextual setting, we show that EC-GP-UCB can be effectively combined with the regret bound balancing strategy and attain similar regret bounds despite not knowing .

1 Introduction

Bandit optimization has been successfully used in a great number of machine learning and real-world applications, e.g., in mobile health [42], environmental monitoring [40], economics [27], hyperparameter tuning [26], to name a few. To scale to large or continuous domains, modern bandit approaches try to model and exploit the problem structure that is often manifested as correlations in rewards of "similar" actions. Hence, the key idea of kernelized bandits is to consider only smooth reward functions of a low norm belonging to a chosen Reproducing Kernel Hilbert Space (RKHS) of functions. This permits the application of ﬂexible nonparametric Gaussian process (GP) models and Bayesian optimization methods via a well-studied link between RKHS functions and GPs (see, e.g., [18] for a concise review).

A vast majority of previous works on nonparametric kernelized bandits have focused on designing algorithms and theoretical bounds on the standard notions of regret (see, e.g., [40, 9, 35]). However, they solely focus on the realizable (i.e., well-speciﬁed) case in which one assumes perfect knowledge of the true function class. For example, the analysis of the prominent GP-UCB [40] algorithm assumes the model to be well-speciﬁed and ignores potential misspeciﬁcation issues. As the realizability assumption may be too restrictive in real applications, we focus on the case where it may only hold approximately. In practice, model misspeciﬁcations can arise due to various reasons, such as incorrect choice of kernel, consideration of an overly smooth function class, hyperparameter estimation errors, etc. Hence, an open question is to characterize the impact of model misspeciﬁcation in the kernelized setting, and to design robust algorithms whose performance degrades optimally with the increasing level of misspeciﬁcation.

35th Conference on Neural Information Processing Systems (Neur IPS 2021).

In this paper, we study the GP bandit problem with model misspeciﬁcation in which the true unknown function might be -far (as measured in the max norm) from a member of the learner s assumed hypothesis class. We propose a novel GP bandit algorithm and regret bounds that depend on the misspeciﬁcation error, time horizon, and underlying kernel. Speciﬁcally, we present an algorithm that is based on the classical uncertainty sampling approach that is frequently used in Bayesian optimization and experimental design. Importantly, our main presented algorithm assumes no knowledge of the misspeciﬁcation error and achieves standard regret rates in the realizable case.

Related work on GP bandits. GP bandit algorithms have received signiﬁcant attention in recent years (e.g., [40, 11, 9, 39]). While the most popular approaches in the stochastic setting rely on upper conﬁdence bound (UCB) and Thompson sampling strategies, a number of works also consider uncertainty sampling procedures (e.g., [10, 41, 6]). Beyond the standard setting, numerous works have also considered the contextual bandit setting (e.g., [22, 44, 21]), while the case of unknown kernel hyperparameters and misspeciﬁed smoothness has been studied in, e.g., [46, 3, 47]. [46, 3] assume that the reward function is smooth as measured by the known (or partially known) kernel, while in our problem, we allow the unknown function to lie outside a given kernel s reproducing space.

Related corruption-robust GP bandits in which an adversary can additively perturb the observed rewards are recently considered in [4]. In Section 2, we also discuss how the misspeciﬁed problem can be seen from this perspective, where the corruption function is ﬁxed and the adversarial budget scales with the time horizon. While the focus of [4] is on protecting against an adaptive adversary and thus designing randomized algorithms and regret bounds that depend on the adversarial budget, we propose deterministic algorithms and analyze the impact of misspeciﬁcation.

Apart from corruptions, several works have considered other robust aspects in GP bandits, such as designing robust strategies against the shift in uncontrollable covariates [5, 20, 31, 38, 7]. While they report robust regret guarantees, they still assume the realizable case only. Our goal of attaining small regret despite the wrong hypothesis class requires very different techniques from these previous works.

Related work on misspeciﬁed linear bandits. Recently, works on reinforcement learning with misspeciﬁed linear features (e.g., [12, 45, 17]) have renewed interest in the related misspeciﬁed linear bandits (e.g., [48], [24], [30], [15]) ﬁrst introduced in [16]. In [16], the authors show that standard algorithms must suffer ( T) regret under an additive perturbation of the linear model. Recently, [48] propose a robust variant of OFUL [1] that requires knowing the misspeciﬁcation parameter . In particular, their algorithm obtains a high-probability O(d

d T) regret bound. In [24], the authors propose another arm-elimination algorithm based on G-experimental design. Unlike the previous works, their algorithm is agnostic to the misspeciﬁcation level, and its performance matches the lower bounds. As shown in [24], when the number of actions K is large (K d), the price of -misspeciﬁcation must grow as (

d T).1 Our main algorithm is inspired by the proposed technique for the ﬁnite-arm misspeciﬁed linear bandit setting [24]. It works in the more general kernelized setting, uses a simpler data acquisition rule often used in Bayesian optimization and experimental design, and recovers the same optimal guarantees when instantiated with linear kernel.

Several works have recently considered the misspeciﬁed contextual linear bandit problem with unknown model misspeciﬁcation . [14] introduce a new family of algorithms that require access to an online oracle for square loss regression and address the case of adversarial contexts. Concurrent work of [33] solves the case when contexts / action sets are stochastic. Both works ([14] and [33]) leverage CORRAL-type aggregation [2] of contextual bandit algorithms and achieve the optimal

T) regret bound. Finally, in [32], the authors present a practical master algorithm that plays base algorithms that come with a candidate regret bound that may not hold during all rounds. The master algorithm plays base algorithms in a balanced way and suitably eliminates algorithms whose regret bound is no longer valid. Similarly to the previous works that rely on the CORRAL-type master algorithms, we use the balancing master algorithm of [33] together with our GP-bandit base algorithm to provide contextual misspeciﬁed regret bounds.

Around the time of the submission of this work, a related approach that among others also considers misspeciﬁed kernel bandits appeared online [8]. [8, Theorem 3] contains the same regret scaling due to misspeciﬁcation as we obtain in our results. The main difference between the two works is in the proposed algorithms and analysis techniques. Our approach does not require robust estimators and simply uses the standard ones (i.e., GP posterior/ kernelized ridge regression mean and variance

1A result by [15] further shows that this result can be improved to O(

KT) in the small-K regime.

estimators) that can be computed in the closed-form. It can also handle inﬁnite action sets similarly to the classical Bayesian optimization algorithms; it utilizes a single acquisition function that, in practice, can be maximized via standard off-the-shelf global optimization solvers. We also present a complete treatment of the misspeciﬁed problem by showing the failure of standard UCB approaches, algorithms for known and unknown , an impossibility result, and extensions to the contextual bandit setting. Finally, our main algorithm (Algorithm 2) demonstrates the use of a different acquisition function in comparison to [24, 8] that relies on standard and non-robust estimators.

Contributions. In this paper, we systematically handle model misspeciﬁcation in GP bandits. Speciﬁcally, this paper makes the following contributions:

We introduce a misspeciﬁed kernelized bandit problem, and for known misspeciﬁcation error

, we present the EC-GP-UCB algorithm with enlarged conﬁdence bounds that achieves cumulative RT = O(BpγT T +γT

T + TpγT ) regret. Our simple lower bound argument shows that (T ) regret is unavoidable in the general kernelized setting. For when is unknown, we propose another algorithm based on uncertainty sampling and

phased exploration that achieves (up to polylog factors) the previous regret rates in the misspeciﬁed setting, and standard regret guarantees in the realizable case (when = 0). Finally, we consider a misspeciﬁed contextual kernelized problem, and show that when

action sets are stochastic, our EC-GP-UCB algorithm can be effectively combined with the regret bound balancing strategy from [33] to achieve previous regret bounds (up to some additive lower order terms).

2 Problem statement

We consider the problem of sequentially maximizing some black-box reward function f : D ! R over a known compact set of actions D Rd. To learn about f , the learner relies on sequential noisy bandit feedback, i.e., at every round t, the learner selects xt 2 D and obtains a noisy observation

t = f (xt) + t, (1)

where we assume independent (over rounds) σ-sub-Gaussian noise (see Appendix A for deﬁnition).

Let k( , ) denote a positive deﬁnite kernel function deﬁned on D D and Hk(D) be its associated Reproducing Kernel Hilbert Space (RKHS) of well-behaved functions. Suppose that before making any decision, the learner is provided with a hypothesis class

Fk(D; B) = {f 2 Hk(D) : kfkk B}, (2)

where every member function has a bounded RKHS norm kfkk =

hf, fi for some B > 0, that measures the complexity of f with respect to kernel k( , ). We consider kernel functions such that k(x, x) 1 for every x 2 D. Most commonly used kernel functions that satisfy this property are outlined in Appendix A.

The standard setting assumes a realizable (i.e., well-speciﬁed) scenario in which f 2 Fk(D; B), i.e., the unknown function is a member of the known RKHS with bounded norm. In contrast, in the misspeciﬁed setting, we assume that the learner is informed that f can be uniformly approximated by a member from the given hypothesis class Fk(D; B), i.e.,

min f2Fk(D;B) kf f k1 , (3)

for some > 0 and max-norm k k1. Here, we note that if two functions are close in RKHS norm, then they are also pointwise close but the reverse does not need to be true. Hence, the true function f can in principle have a signiﬁcantly larger RKHS norm than any f from Fk(D; B), or it might not even belong to the considered RKHS.

We also note that in the case of continuous universal kernels [28] (i.e., due to the universal function approximation property of such kernels), any continuous function f on D satisﬁes the above assumption (Eq. (3)) for any > 0 and Fk(D; B) with suitably large RKHS norm bound B. This is a difference in comparison to the previously studied misspeciﬁed linear setting, and another motivation to study the misspeciﬁed kernelized problem.

As in the standard setting (which corresponds to the case when = 0), we assume that k( , ) and B are known to the learner. The goal of the learner is to minimize the cumulative regret

x2D f (x) f (xt)

where T is a time horizon. We denote instantaneous regret at round t as r

t = f (x ) f (xt), where x 2 arg maxx2D f (x), and note that R

t . The learner s algorithm knows Fk(D; B), meaning that it takes D, k( , ) and B as input. Crucially, in our most general considered problem variant, the exact value of misspeciﬁcation is assumed to be unknown, and so we seek an algorithm that can adapt to any > 0, including the realizable case when = 0.

Alternatively, one can consider competing with a best-in-class benchmark, i.e.,

f 2 arg min

kf f k1 (6)

and | f(x) f (x)| for every x 2 D. Here, the goal is to minimize cumulative regret in case the true unknown objective is f 2 Fk(D; B), while noisy observations of f that the learner receives are at most -misspeciﬁed, i.e.,

t = f(xt) + m(xt) | {z } f (xt)

where m( ) : D ! [ , ] is a ﬁxed and unknown function. Since it holds that k f f k1 , the absolute difference between RT and R

T is at most 2 T, and as will already become clear in Section 3.2, this difference will have no signiﬁcant impact on the scalings in our main bound (since R

T = ( T)). In our theoretical analysis, we will interchangeably use both regret deﬁnitions (from Eqs. (4) and (5)).

3 Algorithms for misspeciﬁed kernelized bandits

We start this section by recalling a Gaussian Process (GP) framework for learning RKHS functions. Then, we present different GP-bandit algorithms and theoretical regret bounds for misspeciﬁed settings of increasing levels of difﬁculty.

3.1 Learning with Gaussian Processes

In the realizable case (i.e., when the true unknown f 2 Fk(D; B) and the learner knows the true hypothesis space Fk(D; B)) and under the noise model described in Eq. (1), uncertainty modeling and learning in standard GP-bandit algorithms can be viewed through the lens of Gaussian Process models. A Gaussian Process GP(µ( ), k( , )) over the input domain D, is a collection of random variables (f(x))x2D where every ﬁnite number of them (f(xi))n

i=1, n 2 N, is jointly Gaussian with mean E[f(xi)] = µ(xi) and covariance E[(f(xi) µ(xi))(f(xj) µ(xj))] = k(xi, xj) for every 1 i, j n. Standard algorithms implicitly use a zero-mean GP(0, k( , )) as the prior distribution over f, i.e., f GP(0, k( , )), and assume that the noise variables are drawn independently across t from N(0, λ) with λ > 0. After collecting new data, that is, a sequence of actions {x1, . . . , xt} and their corresponding noisy observations {y1, . . . , yt}, the posterior distribution under previous assumptions is also Gaussian with the mean and variance that can be computed in closed-form as:

µt(x) = kt(x)T (Kt + λIt) 1Yt (8)

t (x) = k(x, x) kt(x)T (Kt + λIt) 1kt(x), (9)

where kt(x) = [k(x1, x), . . . , k(xt, x)]T 2 Rt 1, Kt = [k(xs, xs0)]s,s0 t 2 Rt t is the corresponding kernel matrix, and Yt := [y1, . . . , yt] denotes a vector of observations.

The previous standard modeling assumptions lead itself to model misspeciﬁcations as GP samples are rougher than RKHS functions and are not contained in Hk(D) with high probability. Although this

leads to a mismatched hypothesis space, GPs and RKHS functions are closely related (see, e.g., [18]) when used with same kernel function, and it is possible to use GP models to infer reliable conﬁdence intervals on the unknown f 2 Fk(D; B). Under this assumption, the popular algorithms such as GP-UCB [40] construct statistical conﬁdence bounds that contain f with high probability uniformly over time horizon, i.e., the following holds |f(x) µt 1(x)| βtσt 1(x) for every t 1 and x 2 D. Here, {βt}t T stands for the sequence of parameters that are suitably set (see Lemma 1) to (i) trade-off between exploration and exploitation and (ii) ensure the validity of the conﬁdence bounds. In every round t, GP-UCB then queries the unknown function at a point xt 2 D that maximizes the upper conﬁdence bound given by µt 1( )+βtσt 1( ), with µt 1( ) and σt 1( ) as deﬁned in Eqs. (8) and (9). In the standard setting, GP-UCB is no regret, meaning that RT /T ! 0 as T ! 1. However, in the misspeciﬁed setting, the previous standard conﬁdence bounds are no longer valid and one needs to consider different strategies.

Before moving to the misspeciﬁed case, we recall an important kernel-dependent quantity known as maximum information gain γt(k, D) [40] that is frequently used to characterize the regret bounds.2 It stands for the maximum amount of information that a set of noisy observations can reveal about the unknown function sampled from a zero-mean Gaussian process with kernel k. Speciﬁcally, for a set of points S D, we use f S to denote a random vector [f(x)]x2S, and YS to denote the corresponding noisy observations obtained as YS = f S + S, where S N(0, λI). The maximum information gain is then deﬁned as:

γt(k, D) := max S D:|S|=t I(f S, YS) = max S D:|S|=t

1 2|It + λ 1Kt|, (10)

where I( , ) stands for the mutual information between random variables, and | | is the determinant. Simply put, if samples are taken "close" to each other (far from each other) as measured by the kernel, they are more correlated (less correlated) under the GP prior and provide less (more) information. As shown in [40], the maximum information gain γt(k, D) scales sublinearly with t for the most commonly used kernels (see Appendix A).

3.2 Known misspeciﬁcation error and optimistic approaches

In this section, we also use the "well-speciﬁed" mean µt( ) and variance σ2

t ( ) estimates from Eqs. (8) and (9), where we assume that noisy observations used in Eq. (8) correspond to a function

f 2 arg minf2Fk(D;B) kf f k1. We note that σ2

t ( ) does not depend on the observations (i.e., Yt), and additionally, we deﬁne µ

t ( ) that depends on the noisy observations of the true f , i.e.,

t (x) = kt(x)T (Kt + λIt) 1Y

1, . . . , y

i = f (xt) + t for 1 i t. The only difference between the deﬁnitions of µ

t (x) and µt(x) comes from the used observation vector, i.e., Y

t and Yt, respectively.

We also use the following standard result from [40, 9, 13] that provides conﬁdence bounds around the unknown function in the realizable setting. Lemma 1. Let f( ) be a function that belongs to the space of functions Fk(D; B). Assume the σ-sub-Gaussian noise model as in Eq. (1), and let Yt 1 := [y1, . . . , yt 1] denote the vector of previous noisy observations that correpsond to the queried points (x1, . . . , xt 1). Then, the following holds with probability at least 1 δ simultaneously over all t 1 and x 2 D:

|f(x) µt 1(x)| βtσt 1(x), (12)

where µt 1( ) and σt 1( ) are given in Eq. (8) and Eq. (9) with λ > 0, and

βt = σ λ1/2

2 ln(1/δ) +

ln(1 + λ 1σt0 1(xt0))

2 + B. (13)

Next, we start by addressing the misspeciﬁed setting when is known to the learner. We consider minimizing RT (from Eq. (5)), and provide an upper conﬁdence bound algorithm with enlarged conﬁdence bounds EC-GP-UCB (see also Algorithm 1):

xt 2 arg max

σt 1(x), (15)

2We often use notation γt in the text when k( , ) and D are clear from context.

Algorithm 1 EC-GP-UCB (Enlarged Conﬁdence GP-UCB)

1: Require: Kernel function k( , ), domain D, misspeciﬁcation , and parameters B, λ, σ 2: Set µ0(x) = 0 and σ0(x) = k(x, x), for all x 2 D 3: for t = 1, . . . , T do 4: Choose

xt 2 arg max

σt 1(x) (14)

5: Observe y

t = f (xt) + t 6: Update to µ

t ( ) and σt( ) by using (xt, yt) according to Eq. (9) and Eq. (11) 7: end for

where the conﬁdence interval enlargement is to account for the use of the biased mean estimator µ

t 1( ) (instead of µt 1( )). This can be interpreted as introducing an additional exploration bonus to the standard GP-UCB algorithm [40] in case of misspeciﬁcation. The enlargement corresponds to the difference in the mean estimators that is captured in the following lemma: Lemma 2. For any x 2 D, t 1 and λ > 0, we have

λ σt(x), (16)

where µt( ) and µ

t ( ) are deﬁned as in Eq. (8) and Eq. (11), respectively, and σt( ) is from Eq. (9).

Next, we upper bound the cumulative regret of the proposed algorithm. Theorem 1. Suppose the learner s hypothesis class is Fk(D; B) for some ﬁxed B > 0 and D Rd. For any f deﬁned on D and 0 such that minf2Fk(D;B) kf f k1 , EC-GP-UCB with

enlarged conﬁdence Eq. (15) and known , achieves the following regret bound with probability at least 1 δ:

(ln(1/δ) + γT )γT T + TpγT

As remarked before, the regret bound from Eq. (17) implies the upper bound on R

T of the same order, since R

T and RT differ by at most 2 T. The ﬁrst part of the bound, i.e., O

(ln(1/δ) + γT )γT T

, corresponds to the standard regret bound achieved by GP-UCB in the realizable scenario [40, 9] which is also known to nearly match the lower bounds in case of commonly used kernels [36]. On the other hand, in Appendix C, we also demonstrate that T dependence is unavoidable in general for any algorithm in the misspeciﬁed kernelized setting.

A similar robust GP-UCB algorithm with enlarged conﬁdence bounds has been ﬁrst considered in [4] to defend against adversarial corruptions. One can think of the misspeciﬁed setting as a corrupted one where, at every round, the corruption is bounded by , yielding a total corruption budget of C = T . The algorithm proposed in [4] attains a CpγT T regret bound (due to corruptions) which is strictly suboptimal in comparison to the TpγT bound obtained in our Theorem 1.

Finally, we note that to obtain the previous result, the algorithm requires knowledge of as input, and it is unclear how to adapt the algorithm to the unknown case. In particular, we show in Appendix B.2 that the problem in the analysis arises since there is no effective way of controlling the uncertainty at x when using standard UCB-based approaches. To address this more practical setting, in the next section we propose our main algorithm that does not require the knowledge of .

3.3 Unknown misspeciﬁcation error: Phased GP Uncertainty Sampling

Our second proposed Phased GP Uncertainty Sampling algorithm that has no knowledge of the true parameter is shown in Algorithm 2. It runs in episodes of exponentially increasing length me and maintains a set of potentially optimal actions De. In each episode e, actions are selected via exploration-encouraging uncertainty sampling, i.e., an action of maximal GP epistemic uncertainty σt( ) is selected (with ties broken arbitrarily). The selected action is then used to update σt( ), that does not depend on the received observations (see Eq. (9)). We also note that at the beginning of every episode, the algorithm reverts back to the prior model (before any data is observed), i.e., by setting µ0(x) = 0 and σ0(x) = k(x, x), for all x 2 De.

Algorithm 2 Phased GP Uncertainty Sampling

1: Require: Kernel function k( , ), domain D, and parameters B, λ, σ 2: Set episode index e = 1, episode length me = 1, and set of potentially optimal actions De = D 3: Set µ0(x) = 0 and σ0(x) = k(x, x), for all x 2 De 4: for t = 1, . . . , me do 5: Choose

xt 2 arg max

t 1(x) (19)

6: Update to σt( ) by including xt according to Eq. (9) 7: end for 8: Receive {y1, . . . , yme}, such that

yt = f (xt) + t for t 2 {1, . . . , me},

and use them to compute µ

me( ) according to Eq. (11) 9: Set

me(x) + βme+1σme(x) max x2De

me(x) βme+1σme(x)

10: me+1 2me, e e + 1 and return to step (3) (terminate after T total function evaluations)

After every episode, the algorithm receives me noisy observations which are then used to update the mean estimate µ

me( ) according to Eq. (8). Finally, the algorithm eliminates actions that appear suboptimal according to the current but potentially wrong model. This is done by retaining all the actions x 2 De whose hallucinated upper conﬁdence bound is at least as large as the maximum hallucinated lower bound, that is,

me(x) + βme+1σme(x) max

me(x) βme+1σme(x)

where βme+1 is set as in the realizable setting. Here, the term "hallucinated" refers to the fact that these conﬁdence bounds might be invalid (for f), and hence, the optimal action might be eliminated. We also note that in case = 0, the hallucinated conﬁdence bounds are actually valid. In our analysis (see Appendix D), we prove that although the optimal action can be eliminated, a "near-optimal" one is retained after every episode. Hence, we only need to characterize the difference in regret due to such possible wrong elimination within all episodes (whose number is logarithmic in T).

Our Algorithm 2 bears some similarities with the Phased Elimination algorithm of [24] designed for the related linear misspeciﬁed bandit setting. Both algorithms employ an episodic action elimination strategy. However, the algorithm of [24] crucially relies on the Kiefer-Wolfowitz theorem [19] and requires computing a near-optimal design at every episode (i.e., a probability distribution over a set of currently plausibly optimal actions) that minimizes the worst-case variance of the resulting least-squares estimator. Adapting this approach to the kernelized setting and RKHS function classes of inﬁnite dimension is a nontrivial task. Instead, we use a kernel ridge regression estimate and we show that it is sufﬁcient to sequentially select actions via a simple acquisition rule that does not rely on ﬁnite-dimensional feature approximations of the kernel. In particular, our algorithm, at each round t (in episode e), selects a single action x (from De) that maximizes kφ(x)k2

t Φt+λIk) 1 (here, φ(x) denotes kernel features, i.e., k(x, x0) = hφ(x), φ(x0)ik), which is equivalent to maximizing the GP posterior variance in Eq. (19) (see Appendix D for details). We also note that similar GP uncertainty sampling acquisition rules are commonly used in Bayesian optimization and experimental design (see, e.g., [10, 6, 23]), and efﬁcient iterative updates of the posterior variance that avoid computation of a t t kernel matrix inverse at each round are also available (see, e.g., Appendix F in [9]).

In the next theorem, we bound the cumulative regret of the proposed algorithm and use O( ) notation to hide polylog(T) factors. Theorem 2. Suppose the learner s hypothesis class is Fk(D; B) for some ﬁxed B > 0 and D Rd. For any f deﬁned on D and 0 such that minf2Fk(D;B) kf f k1 , Phased GP Uncertainty Sampling (Algorithm 2) achieves the following regret bound with probability at least 1 δ:

(ln(1/δ) + γT )γT T + TpγT

In comparison with the previous kernel-based EC-GP-UCB algorithm, our phased uncertainty sampling algorithm attains the same regret guarantee without knowledge of . Our result holds in the case of inﬁnite action sets, and further on, we can substitute the existing upper bounds on γT (k, D) to specialize it to particular kernels. For example, in the case of linear kernel and compact domain, we have γT (klin, D) = O(d log T), while for squared-exponential kernel it holds γt(k SE, D) = O((log T)d+1) [40]. Moreover, when the used kernel is linear, we recover the same misspeciﬁcation regret rate of [24], i.e., O( T

4 Algorithm for the contextual misspeciﬁed kernelized bandit setting

In this section, we consider a contextual misspeciﬁed problem with unknown f : D ! [0, 1] and the same assumptions as before (see Section 2). The main difference comes from the fact that at every round t, the learner needs to choose an action from a possibly different action set Dt D. We assume that the learner observes a context ct , {x}x2Dt at every round t, where ct is assumed to be drawn i.i.d. from some distribution.3 The learner then plays xt 2 Dt and observes yt = f (xt) + t, with 1-sub-Gaussian noise and independence between time steps.

Since we consider the setting in which the set of actions changes with every round, we note that algorithms that employ action-elimination strategies, such as our Phased GP Uncertainty Sampling, are not easily adapted to this scenario as they require a ﬁxed action set. On the other hand, our optimistic EC-GP-UCB algorithm requires knowing the true misspeciﬁcation parameter a priori, which is also unsatisfactory. To address this, in this section, we combine our EC-GP-UCB with the regret bound balancing strategy from [32].

We measure the regret incurred by the learner by using the corresponding contextual regret deﬁnition:

max x2Dt f (x) f (xt)

We specialize the regret bound balancing strategy to the GP-bandit setting. The proposed algorithmic solution considers M different base algorithms (Base)M

i=1, and uses an elimination M armed bandit master algorithm (see Algorithm 3) to choose which base algorithm to play at every round.

Algorithm 3 Regret bound balancing [33]

Require: (Base)M

i=1 Initialize: I1 = [M], Ni(0) = 0 for all i for t = 1, . . . , T do

Receive Dt Select it 2 arg mini2It R(i)(Ni(t 1)) Baseit plays xt 2 Dt and observes yt Update Baseit and set Ni(t) = Ni(t) + 1 Update the set of active algorithms It+1 end for

Base algorithms. At each round, the learner plays according to the suggestion of a single base algorithm i 2 {1, . . . , M}. We use i(t) to denote the set of rounds in which base algorithm i is selected up to time t, and Ni(t) = | i(t)| to denote the number of rounds algorithm i is played by the learner. After a total of T rounds, the regret of algorithm i is given by R

maxx2Dt f (x) f (xt)

also note that R

As a good candidate for base algorithms, we use our EC-GP-UCB algorithm (from Algorithm 1), but instead of the true unknown , we instantiate EC-GP-UCB with different candidate values ˆ i. In particular, we use M = d1 + 1

2 log2(T/γ2

T )e different EC-GP-UCB algorithms, and for every algorithm i 2 [M], we set ˆ i = 21 i

pγT . This means that as i decreases, the used algorithms are more robust and can tolerate larger misspeciﬁcation error. On the other hand, overly conservative values lead to excessive exploration and large regret. We consider the true to be in [0, 1), and our goal is to ideally match the performance of EC-GP-UCB that is run with such .

3Our approach can be extended to the general setting where f(x, c) is deﬁned on the joint action-context space, and where contexts are assumed to be drawn i.i.d. (e.g., user proﬁles are often considered as contexts in recommendation systems [25], and i.i.d. is a reasonable assumption for different users).

Each base algorithm comes with a candidate regret upper bound Ri(t) : N ! R+ (that holds with high probability) of the form given in Theorem 1. Moreover, every candidate regret bound Ri(t) is non-decreasing in t and Ri(0) = 0. Without loss of generality, we can assume that candidate bounds can increase by at most 1 per each round, i.e., 0 Ri(t) Ri(t 1) 1 for all rounds t and algorithms i. We use the following forms of EC-GP-UCB candidate bounds in Algorithm 3:

Ri(Ni(t)) = min{c0

pγT Ni(t), Ni(t)}, (22)

for some quantities c0

2 that do not depend on Ni(t) or ˆ i. For a given ˆ i, these bounds are computable for every t 2 [T]. In particular, to evaluate them, we also need to compute γT (Eq. (10)). We note that for popularly used kernels, analytical upper bounds exist as discussed in Section 3.3, while in general, we can efﬁciently approximate it up to a constant factor via simple greedy maximization.4

We refer to algorithm i as being consistent if R

i,t Ri(Ni(t)) for all t 2 [T] with high probability, and otherwise as inconsistent. We also let i denote the ﬁrst consistent algorithm i for which ˆ i . In particular, we have that ˆ i ˆ i for every inconsistent algorithm i.

Master algorithm. We brieﬂy recall the regret bound balancing algorithm of [32] and its main steps. At each round, the learner selects an algorithm it according to the regret bound balancing principle:

it 2 arg min

Ri(Ni(t 1)), (23)

where It [M] is the set of active (i.e., plausibly consistent) algorithms at round t. Then, the learner uses the algorithm it to select xt, observes reward yt, and increases Ni(t) = Ni(t 1) + 1. Intuitively, by evaluating candidate regret bounds for different number of times, the algorithm keeps them almost equal and hence balanced, i.e., Ri(Ni(t)) Rj(Nj(t)) + 1 for every t and all active algorithms i, j 2 It.

The remaining task is to update the set of active algorithms based on the received observation. We deﬁne Ji(t) = P

j2 i(t) yj as the cumulative reward of algorithm i in the ﬁrst t rounds. As in [32] (Section 4), we set I1 = [M], and at the end of each round, the learner eliminates every inconsistent algorithm i 2 It from It+1 when:

Ji(t)+Ri(Ni(t))+c

Ni(t) ln(M ln Ni(t)/δ) < max

j2It Jj(t) c

Nj(t) ln(M ln Nj(t)/δ), (24)

where c is a suitably set absolute constant (see Lemma A.1 in [32]). Crucially, the elimination condition from Eq. (24) is used in Algorithm 3, and it ensures that no consistent algorithm is eliminated (with high probability). In particular, since each Dt is sampled i.i.d., it holds that Ni(t)EDt[maxx2Dt f (x)] is upper/lower bounded by the left/right hand side of Eq. (24). Hence, if the upper bound of algorithm i for this quantity is smaller than the maximum lower bound, the algorithm can be classiﬁed as inconsistent and consequently eliminated from It+1.

Regret bound. By using the general regret result for Algorithm 3 from [32] (see Appendix E), we show the total regret obtained for Algorithm 3 when instantiated with our EC-GP-UCB algorithm and candidate regret bounds provided in Eq. (22). This is formally captured in the following proposition.

Proposition 1. Consider M = d1 + 1

2 log2(T/γ2

T )e EC-GP-UCB base algorithms, each with candidate regret upper bound of the form provided in Eq. (22) with ˆ i = 21 i

pγT for each i 2 [M]. In the misspeciﬁed kernelized contextual setting, when run with such M algorithms, Algorithm 3 achieves the following regret bound with probability at least 1 Mδ:

(BpγT + γT )

T + TpγT + (BpγT + γT )2

The obtained contextual regret bound recovers (up to poly-logarithmic factors and an additive term of lower-order) the bound of the EC-GP-UCB instance (from Theorem 1) that assumes the knowledge of the true .

4When applied to Eq. (10), a solution found by a simple and efﬁcient greedy algorithm achieves at least (1 1/e) 1γT (this is due to submodularity of the mutual information; see, e.g., [40]).

5 Conclusion

We have considered the GP-bandit optimization problem in the case of a misspeciﬁed hypothesis class. Our work systematically handles model misspeciﬁcations in GP bandits and provides robust and practical algorithms. We designed novel algorithms based on ideas such as enlarged conﬁdence bounds, phased epistemic uncertainty sampling, and regret bound balancing, for the standard (with known/unknown misspeciﬁcation error) and contextual settings. While there have been some theoretical [29] and practical [37] efforts to quantify the impact of misspeciﬁed priors in the related Bayesian setting [40], an interesting direction for future work is to develop algorithms and regret bounds in case of misspeciﬁed GP priors.

Acknowledgments and Disclosure of Funding

This project has received funding from the European Research Council (ERC) under the European Unions Horizon 2020 research and innovation programme grant agreement No 815943 and ETH Zürich Postdoctoral Fellowship 19-2 FEL-47.

[1] Yasin Abbasi-Yadkori, Dávid Pál, and Csaba Szepesvári. Improved algorithms for linear

stochastic bandits. In Advances in Neural Information Processing Systems (Neur IPS), volume 11, pages 2312 2320, 2011. [2] Alekh Agarwal, Haipeng Luo, Behnam Neyshabur, and Robert E Schapire. Corralling a band

of bandit algorithms. In Conference on Learning Theory, pages 12 38. PMLR, 2017. [3] Felix Berkenkamp, Angela P Schoellig, and Andreas Krause. No-regret Bayesian optimization

with unknown hyperparameters. ar Xiv preprint ar Xiv:1901.03357, 2019. [4] Ilija Bogunovic, Andreas Krause, and Scarlett Jonathan. Corruption-tolerant Gaussian process

bandit optimization. In Conference on Artiﬁcial Intelligence and Statistics (AISTATS), 2020. [5] Ilija Bogunovic, Jonathan Scarlett, Stefanie Jegelka, and Volkan Cevher. Adversarially robust

optimization with Gaussian processes. In Advances in Neural Information Processing Systems (Neur IPS), pages 5760 5770, 2018. [6] Ilija Bogunovic, Jonathan Scarlett, Andreas Krause, and Volkan Cevher. Truncated variance

reduction: A uniﬁed approach to Bayesian optimization and level-set estimation. In Advances in Neural Information Processing Systems (Neur IPS), pages 1507 1515, 2016. [7] Sait Cakmak, Raul Astudillo, Peter Frazier, and Enlu Zhou. Bayesian optimization of risk

measures. ar Xiv preprint ar Xiv:2007.05554, 2020. [8] Romain Camilleri, Kevin Jamieson, and Julian Katz-Samuels. High-dimensional experimental

design and kernel bandits. In International Conference on Machine Learning, pages 1227 1237. PMLR, 2021. [9] Sayak Ray Chowdhury and Aditya Gopalan. On kernelized multi-armed bandits. In International

Conference on Machine Learning (ICML), pages 844 853, 2017. [10] Emile Contal, David Buffoni, Alexandre Robicquet, and Nicolas Vayatis. Parallel Gaussian

process optimization with upper conﬁdence bound and pure exploration. In Joint European Conference on Machine Learning and Knowledge Discovery in Databases, pages 225 240. Springer, 2013. [11] Nando de Freitas, Alex Smola, and Masrour Zoghi. Regret bounds for deterministic Gaussian

process bandits. ar Xiv preprint ar Xiv:1203.2177, 2012. [12] Simon S Du, Sham M Kakade, Ruosong Wang, and Lin F Yang. Is a good representation

sufﬁcient for sample efﬁcient reinforcement learning? ar Xiv preprint ar Xiv:1910.03016, 2019. [13] Audrey Durand, Odalric-Ambrym Maillard, and Joelle Pineau. Streaming kernel regression

with provably adaptive mean, variance, and regularization. The Journal of Machine Learning Research, 19(1):650 683, 2018. [14] Dylan J Foster, Claudio Gentile, Mehryar Mohri, and Julian Zimmert. Adapting to misspeciﬁca-

tion in contextual bandits. Advances in Neural Information Processing Systems, 33, 2020.

[15] Dylan J Foster and Alexander Rakhlin. Beyond UCB: Optimal and efﬁcient contextual bandits

with regression oracles. ar Xiv preprint ar Xiv:2002.04926, 2020. [16] Avishek Ghosh, Sayak Ray Chowdhury, and Aditya Gopalan. Misspeciﬁed linear bandits. In

AAAI Conference on Artiﬁcial Intelligence, 2017. [17] Chi Jin, Zhuoran Yang, Zhaoran Wang, and Michael I Jordan. Provably efﬁcient reinforcement

learning with linear function approximation. In Conference on Learning Theory, pages 2137 2143, 2020. [18] Motonobu Kanagawa, Philipp Hennig, Dino Sejdinovic, and Bharath K Sriperumbudur. Gaus-

sian processes and kernel methods: A review on connections and equivalences. ar Xiv preprint ar Xiv:1807.02582, 2018. [19] Jack Kiefer and Jacob Wolfowitz. The equivalence of two extremum problems. Canadian

Journal of Mathematics, 12:363 366, 1960. [20] Johannes Kirschner, Ilija Bogunovic, Stefanie Jegelka, and Andreas Krause. Distributionally

robust Bayesian optimization. ar Xiv preprint ar Xiv:2002.09038, 2020. [21] Johannes Kirschner and Andreas Krause. Stochastic bandits with context distributions. In

Advances in Neural Information Processing Systems, pages 14113 14122, 2019. [22] Andreas Krause and Cheng S Ong. Contextual Gaussian process bandit optimization. In

Advances in Neural Information Processing Systems (Neur IPS), pages 2447 2455, 2011. [23] Andreas Krause, Ajit Singh, and Carlos Guestrin. Near-optimal sensor placements in gaussian

processes: Theory, efﬁcient algorithms and empirical studies. Journal of Machine Learning Research, 9(Feb):235 284, 2008. [24] Tor Lattimore, Csaba Szepesvari, and Gellert Weisz. Learning with good feature representations

in bandits and in RL with a generative model. International Conference on Machine Learning, 2020. [25] Lihong Li, Wei Chu, John Langford, and Robert E Schapire. A contextual-bandit approach to

personalized news article recommendation. In International conference on World Wide Web, pages 661 670, 2010. [26] Lisha Li, Kevin Jamieson, Giulia De Salvo, Afshin Rostamizadeh, and Ameet Talwalkar. Hyper-

band: A novel bandit-based approach to hyperparameter optimization. The Journal of Machine Learning Research, 18(1):6765 6816, 2017. [27] Lydia T Liu, Horia Mania, and Michael Jordan. Competing bandits in matching markets. In

International Conference on Artiﬁcial Intelligence and Statistics, pages 1618 1628. PMLR, 2020. [28] Charles A Micchelli, Yuesheng Xu, and Haizhang Zhang. Universal kernels. Journal of Machine

Learning Research, 7(12), 2006. [29] Willie Neiswanger and Aaditya Ramdas. Uncertainty quantiﬁcation using martingales for

misspeciﬁed Gaussian processes. ar Xiv preprint ar Xiv:2006.07368, 2020. [30] Gergely Neu and Julia Olkhovskaya. Efﬁcient and robust algorithms for adversarial linear

contextual bandits. ar Xiv preprint ar Xiv:2002.00287, 2020. [31] Thanh Tang Nguyen, Sunil Gupta, Huong Ha, Santu Rana, and Svetha Venkatesh. Distribution-

ally robust Bayesian quadrature optimization. ar Xiv preprint ar Xiv:2001.06814, 2020. [32] Aldo Pacchiano, Christoph Dann, Claudio Gentile, and Peter Bartlett. Regret bound balancing

and elimination for model selection in bandits and RL. ar Xiv preprint ar Xiv:2012.13045, 2020. [33] Aldo Pacchiano, My Phan, Yasin Abbasi-Yadkori, Anup Rao, Julian Zimmert, Tor Lattimore,

and Csaba Szepesvari. Model selection in contextual stochastic bandit problems. ar Xiv preprint ar Xiv:2003.01704, 2020. [34] Carl Edward Rasmussen and Christopher KI Williams. Gaussian processes for machine learning,

volume 1. MIT press Cambridge, 2006. [35] Jonathan Scarlett. Tight regret bounds for Bayesian optimization in one dimension. In Interna-

tional Conference on Machine Learning (ICML), 2018. [36] Jonathan Scarlett, Ilijia Bogunovic, and Volkan Cevher. Lower bounds on regret for noisy

Gaussian process bandit optimization. In Conference on Learning Theory (COLT), 2017.

[37] Eric Schulz, Maarten Speekenbrink, José Miguel Hernández-Lobato, Zoubin Ghahramani, and

Samuel J Gershman. Quantifying mismatch in Bayesian optimization. In Neur IPS workshop on Bayesian optimization: Black-box optimization and beyond, 2016. [38] Pier Giuseppe Sessa, Ilija Bogunovic, Maryam Kamgarpour, and Andreas Krause. Mixed

strategies for robust optimization of unknown objectives. In Conference on Artiﬁcial Intelligence and Statistics (AISTATS), 2020. [39] Shubhanshu Shekhar and Tara Javidi. Gaussian process bandits with adaptive discretization.

Electronic Journal of Statistics, 12(2):3829 3874, 2018. [40] Niranjan Srinivas, Andreas Krause, Sham M Kakade, and Matthias Seeger. Gaussian process op-

timization in the bandit setting: No regret and experimental design. In International Conference on Machine Learning (ICML), pages 1015 1022, 2010. [41] Yanan Sui, Alkis Gotovos, Joel Burdick, and Andreas Krause. Safe exploration for optimization

with Gaussian processes. In International Conference on Machine Learning (ICML), pages 997 1005, 2015. [42] Ambuj Tewari and Susan A Murphy. From ads to interventions: Contextual bandits in mobile

health. In Mobile Health, pages 495 517. Springer, 2017. [43] Sattar Vakili, Kia Khezeli, and Victor Picheny. On information gain and regret bounds in

Gaussian process bandits. ar Xiv preprint ar Xiv:2009.06966, 2020. [44] Michal Valko, Nathaniel Korda, Rémi Munos, Ilias Flaounas, and Nelo Cristianini. Finite-time

analysis of kernelised contextual bandits. Uncertainty In Artiﬁcial Intelligence (UAI), 2013. [45] Benjamin Van Roy and Shi Dong. Comments on the du-kakade-wang-yang lower bounds. ar Xiv

preprint ar Xiv:1911.07910, 2019. [46] Ziyu Wang and Nando de Freitas. Theoretical analysis of Bayesian optimisation with unknown

Gaussian process hyper-parameters. ar Xiv preprint ar Xiv:1406.7758, 2014. [47] George Wynne, Francois-Xavier Briol, and Mark Girolami. Convergence guarantees for Gaussian process means with misspeciﬁed likelihoods and smoothness. Journal of Machine Learning Research, 22(123):1 40, 2021. [48] Andrea Zanette, Alessandro Lazaric, Mykel Kochenderfer, and Emma Brunskill. Learning near

optimal policies with low inherent Bellman error. ar Xiv preprint ar Xiv:2003.00153, 2020.