# misspecified_gaussian_process_bandit_optimization__83941d95.pdf Misspecified Gaussian Process Bandit Optimization Ilija Bogunovic Andreas Krause We consider the problem of optimizing a black-box function based on noisy bandit feedback. Kernelized bandit algorithms have shown strong empirical and theoretical performance for this problem. They heavily rely on the assumption that the model is well-specified, however, and can fail without it. Instead, we introduce a misspecified kernelized bandit setting where the unknown function can be uniformly approximated by a function with a bounded norm in some Reproducing Kernel Hilbert Space (RKHS). We design efficient and practical algorithms whose performance degrades minimally in the presence of model misspecification. Specifically, we present two algorithms based on Gaussian process (GP) methods: an optimistic EC-GP-UCB algorithm that requires knowing the misspecification error, and Phased GP Uncertainty Sampling, an elimination-type algorithm that can adapt to unknown model misspecification. We provide upper bounds on their cumulative regret in terms of , the time horizon, and the underlying kernel, and we show that our algorithm achieves optimal dependence on with no prior knowledge of misspecification. In addition, in a stochastic contextual setting, we show that EC-GP-UCB can be effectively combined with the regret bound balancing strategy and attain similar regret bounds despite not knowing . 1 Introduction Bandit optimization has been successfully used in a great number of machine learning and real-world applications, e.g., in mobile health [42], environmental monitoring [40], economics [27], hyperparameter tuning [26], to name a few. To scale to large or continuous domains, modern bandit approaches try to model and exploit the problem structure that is often manifested as correlations in rewards of "similar" actions. Hence, the key idea of kernelized bandits is to consider only smooth reward functions of a low norm belonging to a chosen Reproducing Kernel Hilbert Space (RKHS) of functions. This permits the application of flexible nonparametric Gaussian process (GP) models and Bayesian optimization methods via a well-studied link between RKHS functions and GPs (see, e.g., [18] for a concise review). A vast majority of previous works on nonparametric kernelized bandits have focused on designing algorithms and theoretical bounds on the standard notions of regret (see, e.g., [40, 9, 35]). However, they solely focus on the realizable (i.e., well-specified) case in which one assumes perfect knowledge of the true function class. For example, the analysis of the prominent GP-UCB [40] algorithm assumes the model to be well-specified and ignores potential misspecification issues. As the realizability assumption may be too restrictive in real applications, we focus on the case where it may only hold approximately. In practice, model misspecifications can arise due to various reasons, such as incorrect choice of kernel, consideration of an overly smooth function class, hyperparameter estimation errors, etc. Hence, an open question is to characterize the impact of model misspecification in the kernelized setting, and to design robust algorithms whose performance degrades optimally with the increasing level of misspecification. 35th Conference on Neural Information Processing Systems (Neur IPS 2021). In this paper, we study the GP bandit problem with model misspecification in which the true unknown function might be -far (as measured in the max norm) from a member of the learner s assumed hypothesis class. We propose a novel GP bandit algorithm and regret bounds that depend on the misspecification error, time horizon, and underlying kernel. Specifically, we present an algorithm that is based on the classical uncertainty sampling approach that is frequently used in Bayesian optimization and experimental design. Importantly, our main presented algorithm assumes no knowledge of the misspecification error and achieves standard regret rates in the realizable case. Related work on GP bandits. GP bandit algorithms have received significant attention in recent years (e.g., [40, 11, 9, 39]). While the most popular approaches in the stochastic setting rely on upper confidence bound (UCB) and Thompson sampling strategies, a number of works also consider uncertainty sampling procedures (e.g., [10, 41, 6]). Beyond the standard setting, numerous works have also considered the contextual bandit setting (e.g., [22, 44, 21]), while the case of unknown kernel hyperparameters and misspecified smoothness has been studied in, e.g., [46, 3, 47]. [46, 3] assume that the reward function is smooth as measured by the known (or partially known) kernel, while in our problem, we allow the unknown function to lie outside a given kernel s reproducing space. Related corruption-robust GP bandits in which an adversary can additively perturb the observed rewards are recently considered in [4]. In Section 2, we also discuss how the misspecified problem can be seen from this perspective, where the corruption function is fixed and the adversarial budget scales with the time horizon. While the focus of [4] is on protecting against an adaptive adversary and thus designing randomized algorithms and regret bounds that depend on the adversarial budget, we propose deterministic algorithms and analyze the impact of misspecification. Apart from corruptions, several works have considered other robust aspects in GP bandits, such as designing robust strategies against the shift in uncontrollable covariates [5, 20, 31, 38, 7]. While they report robust regret guarantees, they still assume the realizable case only. Our goal of attaining small regret despite the wrong hypothesis class requires very different techniques from these previous works. Related work on misspecified linear bandits. Recently, works on reinforcement learning with misspecified linear features (e.g., [12, 45, 17]) have renewed interest in the related misspecified linear bandits (e.g., [48], [24], [30], [15]) first introduced in [16]. In [16], the authors show that standard algorithms must suffer ( T) regret under an additive perturbation of the linear model. Recently, [48] propose a robust variant of OFUL [1] that requires knowing the misspecification parameter . In particular, their algorithm obtains a high-probability O(d d T) regret bound. In [24], the authors propose another arm-elimination algorithm based on G-experimental design. Unlike the previous works, their algorithm is agnostic to the misspecification level, and its performance matches the lower bounds. As shown in [24], when the number of actions K is large (K d), the price of -misspecification must grow as ( d T).1 Our main algorithm is inspired by the proposed technique for the finite-arm misspecified linear bandit setting [24]. It works in the more general kernelized setting, uses a simpler data acquisition rule often used in Bayesian optimization and experimental design, and recovers the same optimal guarantees when instantiated with linear kernel. Several works have recently considered the misspecified contextual linear bandit problem with unknown model misspecification . [14] introduce a new family of algorithms that require access to an online oracle for square loss regression and address the case of adversarial contexts. Concurrent work of [33] solves the case when contexts / action sets are stochastic. Both works ([14] and [33]) leverage CORRAL-type aggregation [2] of contextual bandit algorithms and achieve the optimal T) regret bound. Finally, in [32], the authors present a practical master algorithm that plays base algorithms that come with a candidate regret bound that may not hold during all rounds. The master algorithm plays base algorithms in a balanced way and suitably eliminates algorithms whose regret bound is no longer valid. Similarly to the previous works that rely on the CORRAL-type master algorithms, we use the balancing master algorithm of [33] together with our GP-bandit base algorithm to provide contextual misspecified regret bounds. Around the time of the submission of this work, a related approach that among others also considers misspecified kernel bandits appeared online [8]. [8, Theorem 3] contains the same regret scaling due to misspecification as we obtain in our results. The main difference between the two works is in the proposed algorithms and analysis techniques. Our approach does not require robust estimators and simply uses the standard ones (i.e., GP posterior/ kernelized ridge regression mean and variance 1A result by [15] further shows that this result can be improved to O( KT) in the small-K regime. estimators) that can be computed in the closed-form. It can also handle infinite action sets similarly to the classical Bayesian optimization algorithms; it utilizes a single acquisition function that, in practice, can be maximized via standard off-the-shelf global optimization solvers. We also present a complete treatment of the misspecified problem by showing the failure of standard UCB approaches, algorithms for known and unknown , an impossibility result, and extensions to the contextual bandit setting. Finally, our main algorithm (Algorithm 2) demonstrates the use of a different acquisition function in comparison to [24, 8] that relies on standard and non-robust estimators. Contributions. In this paper, we systematically handle model misspecification in GP bandits. Specifically, this paper makes the following contributions: We introduce a misspecified kernelized bandit problem, and for known misspecification error , we present the EC-GP-UCB algorithm with enlarged confidence bounds that achieves cumulative RT = O(BpγT T +γT T + TpγT ) regret. Our simple lower bound argument shows that (T ) regret is unavoidable in the general kernelized setting. For when is unknown, we propose another algorithm based on uncertainty sampling and phased exploration that achieves (up to polylog factors) the previous regret rates in the misspecified setting, and standard regret guarantees in the realizable case (when = 0). Finally, we consider a misspecified contextual kernelized problem, and show that when action sets are stochastic, our EC-GP-UCB algorithm can be effectively combined with the regret bound balancing strategy from [33] to achieve previous regret bounds (up to some additive lower order terms). 2 Problem statement We consider the problem of sequentially maximizing some black-box reward function f : D ! R over a known compact set of actions D Rd. To learn about f , the learner relies on sequential noisy bandit feedback, i.e., at every round t, the learner selects xt 2 D and obtains a noisy observation t = f (xt) + t, (1) where we assume independent (over rounds) σ-sub-Gaussian noise (see Appendix A for definition). Let k( , ) denote a positive definite kernel function defined on D D and Hk(D) be its associated Reproducing Kernel Hilbert Space (RKHS) of well-behaved functions. Suppose that before making any decision, the learner is provided with a hypothesis class Fk(D; B) = {f 2 Hk(D) : kfkk B}, (2) where every member function has a bounded RKHS norm kfkk = hf, fi for some B > 0, that measures the complexity of f with respect to kernel k( , ). We consider kernel functions such that k(x, x) 1 for every x 2 D. Most commonly used kernel functions that satisfy this property are outlined in Appendix A. The standard setting assumes a realizable (i.e., well-specified) scenario in which f 2 Fk(D; B), i.e., the unknown function is a member of the known RKHS with bounded norm. In contrast, in the misspecified setting, we assume that the learner is informed that f can be uniformly approximated by a member from the given hypothesis class Fk(D; B), i.e., min f2Fk(D;B) kf f k1 , (3) for some > 0 and max-norm k k1. Here, we note that if two functions are close in RKHS norm, then they are also pointwise close but the reverse does not need to be true. Hence, the true function f can in principle have a significantly larger RKHS norm than any f from Fk(D; B), or it might not even belong to the considered RKHS. We also note that in the case of continuous universal kernels [28] (i.e., due to the universal function approximation property of such kernels), any continuous function f on D satisfies the above assumption (Eq. (3)) for any > 0 and Fk(D; B) with suitably large RKHS norm bound B. This is a difference in comparison to the previously studied misspecified linear setting, and another motivation to study the misspecified kernelized problem. As in the standard setting (which corresponds to the case when = 0), we assume that k( , ) and B are known to the learner. The goal of the learner is to minimize the cumulative regret x2D f (x) f (xt) where T is a time horizon. We denote instantaneous regret at round t as r t = f (x ) f (xt), where x 2 arg maxx2D f (x), and note that R t . The learner s algorithm knows Fk(D; B), meaning that it takes D, k( , ) and B as input. Crucially, in our most general considered problem variant, the exact value of misspecification is assumed to be unknown, and so we seek an algorithm that can adapt to any > 0, including the realizable case when = 0. Alternatively, one can consider competing with a best-in-class benchmark, i.e., f 2 arg min kf f k1 (6) and | f(x) f (x)| for every x 2 D. Here, the goal is to minimize cumulative regret in case the true unknown objective is f 2 Fk(D; B), while noisy observations of f that the learner receives are at most -misspecified, i.e., t = f(xt) + m(xt) | {z } f (xt) where m( ) : D ! [ , ] is a fixed and unknown function. Since it holds that k f f k1 , the absolute difference between RT and R T is at most 2 T, and as will already become clear in Section 3.2, this difference will have no significant impact on the scalings in our main bound (since R T = ( T)). In our theoretical analysis, we will interchangeably use both regret definitions (from Eqs. (4) and (5)). 3 Algorithms for misspecified kernelized bandits We start this section by recalling a Gaussian Process (GP) framework for learning RKHS functions. Then, we present different GP-bandit algorithms and theoretical regret bounds for misspecified settings of increasing levels of difficulty. 3.1 Learning with Gaussian Processes In the realizable case (i.e., when the true unknown f 2 Fk(D; B) and the learner knows the true hypothesis space Fk(D; B)) and under the noise model described in Eq. (1), uncertainty modeling and learning in standard GP-bandit algorithms can be viewed through the lens of Gaussian Process models. A Gaussian Process GP(µ( ), k( , )) over the input domain D, is a collection of random variables (f(x))x2D where every finite number of them (f(xi))n i=1, n 2 N, is jointly Gaussian with mean E[f(xi)] = µ(xi) and covariance E[(f(xi) µ(xi))(f(xj) µ(xj))] = k(xi, xj) for every 1 i, j n. Standard algorithms implicitly use a zero-mean GP(0, k( , )) as the prior distribution over f, i.e., f GP(0, k( , )), and assume that the noise variables are drawn independently across t from N(0, λ) with λ > 0. After collecting new data, that is, a sequence of actions {x1, . . . , xt} and their corresponding noisy observations {y1, . . . , yt}, the posterior distribution under previous assumptions is also Gaussian with the mean and variance that can be computed in closed-form as: µt(x) = kt(x)T (Kt + λIt) 1Yt (8) t (x) = k(x, x) kt(x)T (Kt + λIt) 1kt(x), (9) where kt(x) = [k(x1, x), . . . , k(xt, x)]T 2 Rt 1, Kt = [k(xs, xs0)]s,s0 t 2 Rt t is the corresponding kernel matrix, and Yt := [y1, . . . , yt] denotes a vector of observations. The previous standard modeling assumptions lead itself to model misspecifications as GP samples are rougher than RKHS functions and are not contained in Hk(D) with high probability. Although this leads to a mismatched hypothesis space, GPs and RKHS functions are closely related (see, e.g., [18]) when used with same kernel function, and it is possible to use GP models to infer reliable confidence intervals on the unknown f 2 Fk(D; B). Under this assumption, the popular algorithms such as GP-UCB [40] construct statistical confidence bounds that contain f with high probability uniformly over time horizon, i.e., the following holds |f(x) µt 1(x)| βtσt 1(x) for every t 1 and x 2 D. Here, {βt}t T stands for the sequence of parameters that are suitably set (see Lemma 1) to (i) trade-off between exploration and exploitation and (ii) ensure the validity of the confidence bounds. In every round t, GP-UCB then queries the unknown function at a point xt 2 D that maximizes the upper confidence bound given by µt 1( )+βtσt 1( ), with µt 1( ) and σt 1( ) as defined in Eqs. (8) and (9). In the standard setting, GP-UCB is no regret, meaning that RT /T ! 0 as T ! 1. However, in the misspecified setting, the previous standard confidence bounds are no longer valid and one needs to consider different strategies. Before moving to the misspecified case, we recall an important kernel-dependent quantity known as maximum information gain γt(k, D) [40] that is frequently used to characterize the regret bounds.2 It stands for the maximum amount of information that a set of noisy observations can reveal about the unknown function sampled from a zero-mean Gaussian process with kernel k. Specifically, for a set of points S D, we use f S to denote a random vector [f(x)]x2S, and YS to denote the corresponding noisy observations obtained as YS = f S + S, where S N(0, λI). The maximum information gain is then defined as: γt(k, D) := max S D:|S|=t I(f S, YS) = max S D:|S|=t 1 2|It + λ 1Kt|, (10) where I( , ) stands for the mutual information between random variables, and | | is the determinant. Simply put, if samples are taken "close" to each other (far from each other) as measured by the kernel, they are more correlated (less correlated) under the GP prior and provide less (more) information. As shown in [40], the maximum information gain γt(k, D) scales sublinearly with t for the most commonly used kernels (see Appendix A). 3.2 Known misspecification error and optimistic approaches In this section, we also use the "well-specified" mean µt( ) and variance σ2 t ( ) estimates from Eqs. (8) and (9), where we assume that noisy observations used in Eq. (8) correspond to a function f 2 arg minf2Fk(D;B) kf f k1. We note that σ2 t ( ) does not depend on the observations (i.e., Yt), and additionally, we define µ t ( ) that depends on the noisy observations of the true f , i.e., t (x) = kt(x)T (Kt + λIt) 1Y 1, . . . , y i = f (xt) + t for 1 i t. The only difference between the definitions of µ t (x) and µt(x) comes from the used observation vector, i.e., Y t and Yt, respectively. We also use the following standard result from [40, 9, 13] that provides confidence bounds around the unknown function in the realizable setting. Lemma 1. Let f( ) be a function that belongs to the space of functions Fk(D; B). Assume the σ-sub-Gaussian noise model as in Eq. (1), and let Yt 1 := [y1, . . . , yt 1] denote the vector of previous noisy observations that correpsond to the queried points (x1, . . . , xt 1). Then, the following holds with probability at least 1 δ simultaneously over all t 1 and x 2 D: |f(x) µt 1(x)| βtσt 1(x), (12) where µt 1( ) and σt 1( ) are given in Eq. (8) and Eq. (9) with λ > 0, and βt = σ λ1/2 2 ln(1/δ) + ln(1 + λ 1σt0 1(xt0)) 2 + B. (13) Next, we start by addressing the misspecified setting when is known to the learner. We consider minimizing RT (from Eq. (5)), and provide an upper confidence bound algorithm with enlarged confidence bounds EC-GP-UCB (see also Algorithm 1): xt 2 arg max σt 1(x), (15) 2We often use notation γt in the text when k( , ) and D are clear from context. Algorithm 1 EC-GP-UCB (Enlarged Confidence GP-UCB) 1: Require: Kernel function k( , ), domain D, misspecification , and parameters B, λ, σ 2: Set µ0(x) = 0 and σ0(x) = k(x, x), for all x 2 D 3: for t = 1, . . . , T do 4: Choose xt 2 arg max σt 1(x) (14) 5: Observe y t = f (xt) + t 6: Update to µ t ( ) and σt( ) by using (xt, yt) according to Eq. (9) and Eq. (11) 7: end for where the confidence interval enlargement is to account for the use of the biased mean estimator µ t 1( ) (instead of µt 1( )). This can be interpreted as introducing an additional exploration bonus to the standard GP-UCB algorithm [40] in case of misspecification. The enlargement corresponds to the difference in the mean estimators that is captured in the following lemma: Lemma 2. For any x 2 D, t 1 and λ > 0, we have λ σt(x), (16) where µt( ) and µ t ( ) are defined as in Eq. (8) and Eq. (11), respectively, and σt( ) is from Eq. (9). Next, we upper bound the cumulative regret of the proposed algorithm. Theorem 1. Suppose the learner s hypothesis class is Fk(D; B) for some fixed B > 0 and D Rd. For any f defined on D and 0 such that minf2Fk(D;B) kf f k1 , EC-GP-UCB with enlarged confidence Eq. (15) and known , achieves the following regret bound with probability at least 1 δ: (ln(1/δ) + γT )γT T + TpγT As remarked before, the regret bound from Eq. (17) implies the upper bound on R T of the same order, since R T and RT differ by at most 2 T. The first part of the bound, i.e., O (ln(1/δ) + γT )γT T , corresponds to the standard regret bound achieved by GP-UCB in the realizable scenario [40, 9] which is also known to nearly match the lower bounds in case of commonly used kernels [36]. On the other hand, in Appendix C, we also demonstrate that T dependence is unavoidable in general for any algorithm in the misspecified kernelized setting. A similar robust GP-UCB algorithm with enlarged confidence bounds has been first considered in [4] to defend against adversarial corruptions. One can think of the misspecified setting as a corrupted one where, at every round, the corruption is bounded by , yielding a total corruption budget of C = T . The algorithm proposed in [4] attains a CpγT T regret bound (due to corruptions) which is strictly suboptimal in comparison to the TpγT bound obtained in our Theorem 1. Finally, we note that to obtain the previous result, the algorithm requires knowledge of as input, and it is unclear how to adapt the algorithm to the unknown case. In particular, we show in Appendix B.2 that the problem in the analysis arises since there is no effective way of controlling the uncertainty at x when using standard UCB-based approaches. To address this more practical setting, in the next section we propose our main algorithm that does not require the knowledge of . 3.3 Unknown misspecification error: Phased GP Uncertainty Sampling Our second proposed Phased GP Uncertainty Sampling algorithm that has no knowledge of the true parameter is shown in Algorithm 2. It runs in episodes of exponentially increasing length me and maintains a set of potentially optimal actions De. In each episode e, actions are selected via exploration-encouraging uncertainty sampling, i.e., an action of maximal GP epistemic uncertainty σt( ) is selected (with ties broken arbitrarily). The selected action is then used to update σt( ), that does not depend on the received observations (see Eq. (9)). We also note that at the beginning of every episode, the algorithm reverts back to the prior model (before any data is observed), i.e., by setting µ0(x) = 0 and σ0(x) = k(x, x), for all x 2 De. Algorithm 2 Phased GP Uncertainty Sampling 1: Require: Kernel function k( , ), domain D, and parameters B, λ, σ 2: Set episode index e = 1, episode length me = 1, and set of potentially optimal actions De = D 3: Set µ0(x) = 0 and σ0(x) = k(x, x), for all x 2 De 4: for t = 1, . . . , me do 5: Choose xt 2 arg max t 1(x) (19) 6: Update to σt( ) by including xt according to Eq. (9) 7: end for 8: Receive {y1, . . . , yme}, such that yt = f (xt) + t for t 2 {1, . . . , me}, and use them to compute µ me( ) according to Eq. (11) 9: Set me(x) + βme+1σme(x) max x2De me(x) βme+1σme(x) 10: me+1 2me, e e + 1 and return to step (3) (terminate after T total function evaluations) After every episode, the algorithm receives me noisy observations which are then used to update the mean estimate µ me( ) according to Eq. (8). Finally, the algorithm eliminates actions that appear suboptimal according to the current but potentially wrong model. This is done by retaining all the actions x 2 De whose hallucinated upper confidence bound is at least as large as the maximum hallucinated lower bound, that is, me(x) + βme+1σme(x) max me(x) βme+1σme(x) where βme+1 is set as in the realizable setting. Here, the term "hallucinated" refers to the fact that these confidence bounds might be invalid (for f), and hence, the optimal action might be eliminated. We also note that in case = 0, the hallucinated confidence bounds are actually valid. In our analysis (see Appendix D), we prove that although the optimal action can be eliminated, a "near-optimal" one is retained after every episode. Hence, we only need to characterize the difference in regret due to such possible wrong elimination within all episodes (whose number is logarithmic in T). Our Algorithm 2 bears some similarities with the Phased Elimination algorithm of [24] designed for the related linear misspecified bandit setting. Both algorithms employ an episodic action elimination strategy. However, the algorithm of [24] crucially relies on the Kiefer-Wolfowitz theorem [19] and requires computing a near-optimal design at every episode (i.e., a probability distribution over a set of currently plausibly optimal actions) that minimizes the worst-case variance of the resulting least-squares estimator. Adapting this approach to the kernelized setting and RKHS function classes of infinite dimension is a nontrivial task. Instead, we use a kernel ridge regression estimate and we show that it is sufficient to sequentially select actions via a simple acquisition rule that does not rely on finite-dimensional feature approximations of the kernel. In particular, our algorithm, at each round t (in episode e), selects a single action x (from De) that maximizes kφ(x)k2 t Φt+λIk) 1 (here, φ(x) denotes kernel features, i.e., k(x, x0) = hφ(x), φ(x0)ik), which is equivalent to maximizing the GP posterior variance in Eq. (19) (see Appendix D for details). We also note that similar GP uncertainty sampling acquisition rules are commonly used in Bayesian optimization and experimental design (see, e.g., [10, 6, 23]), and efficient iterative updates of the posterior variance that avoid computation of a t t kernel matrix inverse at each round are also available (see, e.g., Appendix F in [9]). In the next theorem, we bound the cumulative regret of the proposed algorithm and use O( ) notation to hide polylog(T) factors. Theorem 2. Suppose the learner s hypothesis class is Fk(D; B) for some fixed B > 0 and D Rd. For any f defined on D and 0 such that minf2Fk(D;B) kf f k1 , Phased GP Uncertainty Sampling (Algorithm 2) achieves the following regret bound with probability at least 1 δ: (ln(1/δ) + γT )γT T + TpγT In comparison with the previous kernel-based EC-GP-UCB algorithm, our phased uncertainty sampling algorithm attains the same regret guarantee without knowledge of . Our result holds in the case of infinite action sets, and further on, we can substitute the existing upper bounds on γT (k, D) to specialize it to particular kernels. For example, in the case of linear kernel and compact domain, we have γT (klin, D) = O(d log T), while for squared-exponential kernel it holds γt(k SE, D) = O((log T)d+1) [40]. Moreover, when the used kernel is linear, we recover the same misspecification regret rate of [24], i.e., O( T 4 Algorithm for the contextual misspecified kernelized bandit setting In this section, we consider a contextual misspecified problem with unknown f : D ! [0, 1] and the same assumptions as before (see Section 2). The main difference comes from the fact that at every round t, the learner needs to choose an action from a possibly different action set Dt D. We assume that the learner observes a context ct , {x}x2Dt at every round t, where ct is assumed to be drawn i.i.d. from some distribution.3 The learner then plays xt 2 Dt and observes yt = f (xt) + t, with 1-sub-Gaussian noise and independence between time steps. Since we consider the setting in which the set of actions changes with every round, we note that algorithms that employ action-elimination strategies, such as our Phased GP Uncertainty Sampling, are not easily adapted to this scenario as they require a fixed action set. On the other hand, our optimistic EC-GP-UCB algorithm requires knowing the true misspecification parameter a priori, which is also unsatisfactory. To address this, in this section, we combine our EC-GP-UCB with the regret bound balancing strategy from [32]. We measure the regret incurred by the learner by using the corresponding contextual regret definition: max x2Dt f (x) f (xt) We specialize the regret bound balancing strategy to the GP-bandit setting. The proposed algorithmic solution considers M different base algorithms (Base)M i=1, and uses an elimination M armed bandit master algorithm (see Algorithm 3) to choose which base algorithm to play at every round. Algorithm 3 Regret bound balancing [33] Require: (Base)M i=1 Initialize: I1 = [M], Ni(0) = 0 for all i for t = 1, . . . , T do Receive Dt Select it 2 arg mini2It R(i)(Ni(t 1)) Baseit plays xt 2 Dt and observes yt Update Baseit and set Ni(t) = Ni(t) + 1 Update the set of active algorithms It+1 end for Base algorithms. At each round, the learner plays according to the suggestion of a single base algorithm i 2 {1, . . . , M}. We use i(t) to denote the set of rounds in which base algorithm i is selected up to time t, and Ni(t) = | i(t)| to denote the number of rounds algorithm i is played by the learner. After a total of T rounds, the regret of algorithm i is given by R maxx2Dt f (x) f (xt) also note that R As a good candidate for base algorithms, we use our EC-GP-UCB algorithm (from Algorithm 1), but instead of the true unknown , we instantiate EC-GP-UCB with different candidate values ˆ i. In particular, we use M = d1 + 1 2 log2(T/γ2 T )e different EC-GP-UCB algorithms, and for every algorithm i 2 [M], we set ˆ i = 21 i pγT . This means that as i decreases, the used algorithms are more robust and can tolerate larger misspecification error. On the other hand, overly conservative values lead to excessive exploration and large regret. We consider the true to be in [0, 1), and our goal is to ideally match the performance of EC-GP-UCB that is run with such . 3Our approach can be extended to the general setting where f(x, c) is defined on the joint action-context space, and where contexts are assumed to be drawn i.i.d. (e.g., user profiles are often considered as contexts in recommendation systems [25], and i.i.d. is a reasonable assumption for different users). Each base algorithm comes with a candidate regret upper bound Ri(t) : N ! R+ (that holds with high probability) of the form given in Theorem 1. Moreover, every candidate regret bound Ri(t) is non-decreasing in t and Ri(0) = 0. Without loss of generality, we can assume that candidate bounds can increase by at most 1 per each round, i.e., 0 Ri(t) Ri(t 1) 1 for all rounds t and algorithms i. We use the following forms of EC-GP-UCB candidate bounds in Algorithm 3: Ri(Ni(t)) = min{c0 pγT Ni(t), Ni(t)}, (22) for some quantities c0 2 that do not depend on Ni(t) or ˆ i. For a given ˆ i, these bounds are computable for every t 2 [T]. In particular, to evaluate them, we also need to compute γT (Eq. (10)). We note that for popularly used kernels, analytical upper bounds exist as discussed in Section 3.3, while in general, we can efficiently approximate it up to a constant factor via simple greedy maximization.4 We refer to algorithm i as being consistent if R i,t Ri(Ni(t)) for all t 2 [T] with high probability, and otherwise as inconsistent. We also let i denote the first consistent algorithm i for which ˆ i . In particular, we have that ˆ i ˆ i for every inconsistent algorithm i. Master algorithm. We briefly recall the regret bound balancing algorithm of [32] and its main steps. At each round, the learner selects an algorithm it according to the regret bound balancing principle: it 2 arg min Ri(Ni(t 1)), (23) where It [M] is the set of active (i.e., plausibly consistent) algorithms at round t. Then, the learner uses the algorithm it to select xt, observes reward yt, and increases Ni(t) = Ni(t 1) + 1. Intuitively, by evaluating candidate regret bounds for different number of times, the algorithm keeps them almost equal and hence balanced, i.e., Ri(Ni(t)) Rj(Nj(t)) + 1 for every t and all active algorithms i, j 2 It. The remaining task is to update the set of active algorithms based on the received observation. We define Ji(t) = P j2 i(t) yj as the cumulative reward of algorithm i in the first t rounds. As in [32] (Section 4), we set I1 = [M], and at the end of each round, the learner eliminates every inconsistent algorithm i 2 It from It+1 when: Ji(t)+Ri(Ni(t))+c Ni(t) ln(M ln Ni(t)/δ) < max j2It Jj(t) c Nj(t) ln(M ln Nj(t)/δ), (24) where c is a suitably set absolute constant (see Lemma A.1 in [32]). Crucially, the elimination condition from Eq. (24) is used in Algorithm 3, and it ensures that no consistent algorithm is eliminated (with high probability). In particular, since each Dt is sampled i.i.d., it holds that Ni(t)EDt[maxx2Dt f (x)] is upper/lower bounded by the left/right hand side of Eq. (24). Hence, if the upper bound of algorithm i for this quantity is smaller than the maximum lower bound, the algorithm can be classified as inconsistent and consequently eliminated from It+1. Regret bound. By using the general regret result for Algorithm 3 from [32] (see Appendix E), we show the total regret obtained for Algorithm 3 when instantiated with our EC-GP-UCB algorithm and candidate regret bounds provided in Eq. (22). This is formally captured in the following proposition. Proposition 1. Consider M = d1 + 1 2 log2(T/γ2 T )e EC-GP-UCB base algorithms, each with candidate regret upper bound of the form provided in Eq. (22) with ˆ i = 21 i pγT for each i 2 [M]. In the misspecified kernelized contextual setting, when run with such M algorithms, Algorithm 3 achieves the following regret bound with probability at least 1 Mδ: (BpγT + γT ) T + TpγT + (BpγT + γT )2 The obtained contextual regret bound recovers (up to poly-logarithmic factors and an additive term of lower-order) the bound of the EC-GP-UCB instance (from Theorem 1) that assumes the knowledge of the true . 4When applied to Eq. (10), a solution found by a simple and efficient greedy algorithm achieves at least (1 1/e) 1γT (this is due to submodularity of the mutual information; see, e.g., [40]). 5 Conclusion We have considered the GP-bandit optimization problem in the case of a misspecified hypothesis class. Our work systematically handles model misspecifications in GP bandits and provides robust and practical algorithms. We designed novel algorithms based on ideas such as enlarged confidence bounds, phased epistemic uncertainty sampling, and regret bound balancing, for the standard (with known/unknown misspecification error) and contextual settings. While there have been some theoretical [29] and practical [37] efforts to quantify the impact of misspecified priors in the related Bayesian setting [40], an interesting direction for future work is to develop algorithms and regret bounds in case of misspecified GP priors. Acknowledgments and Disclosure of Funding This project has received funding from the European Research Council (ERC) under the European Unions Horizon 2020 research and innovation programme grant agreement No 815943 and ETH Zürich Postdoctoral Fellowship 19-2 FEL-47. [1] Yasin Abbasi-Yadkori, Dávid Pál, and Csaba Szepesvári. Improved algorithms for linear stochastic bandits. In Advances in Neural Information Processing Systems (Neur IPS), volume 11, pages 2312 2320, 2011. [2] Alekh Agarwal, Haipeng Luo, Behnam Neyshabur, and Robert E Schapire. Corralling a band of bandit algorithms. In Conference on Learning Theory, pages 12 38. PMLR, 2017. [3] Felix Berkenkamp, Angela P Schoellig, and Andreas Krause. No-regret Bayesian optimization with unknown hyperparameters. ar Xiv preprint ar Xiv:1901.03357, 2019. [4] Ilija Bogunovic, Andreas Krause, and Scarlett Jonathan. Corruption-tolerant Gaussian process bandit optimization. In Conference on Artificial Intelligence and Statistics (AISTATS), 2020. [5] Ilija Bogunovic, Jonathan Scarlett, Stefanie Jegelka, and Volkan Cevher. Adversarially robust optimization with Gaussian processes. In Advances in Neural Information Processing Systems (Neur IPS), pages 5760 5770, 2018. [6] Ilija Bogunovic, Jonathan Scarlett, Andreas Krause, and Volkan Cevher. Truncated variance reduction: A unified approach to Bayesian optimization and level-set estimation. In Advances in Neural Information Processing Systems (Neur IPS), pages 1507 1515, 2016. [7] Sait Cakmak, Raul Astudillo, Peter Frazier, and Enlu Zhou. Bayesian optimization of risk measures. ar Xiv preprint ar Xiv:2007.05554, 2020. [8] Romain Camilleri, Kevin Jamieson, and Julian Katz-Samuels. High-dimensional experimental design and kernel bandits. In International Conference on Machine Learning, pages 1227 1237. PMLR, 2021. [9] Sayak Ray Chowdhury and Aditya Gopalan. On kernelized multi-armed bandits. In International Conference on Machine Learning (ICML), pages 844 853, 2017. [10] Emile Contal, David Buffoni, Alexandre Robicquet, and Nicolas Vayatis. Parallel Gaussian process optimization with upper confidence bound and pure exploration. In Joint European Conference on Machine Learning and Knowledge Discovery in Databases, pages 225 240. Springer, 2013. [11] Nando de Freitas, Alex Smola, and Masrour Zoghi. Regret bounds for deterministic Gaussian process bandits. ar Xiv preprint ar Xiv:1203.2177, 2012. [12] Simon S Du, Sham M Kakade, Ruosong Wang, and Lin F Yang. Is a good representation sufficient for sample efficient reinforcement learning? ar Xiv preprint ar Xiv:1910.03016, 2019. [13] Audrey Durand, Odalric-Ambrym Maillard, and Joelle Pineau. Streaming kernel regression with provably adaptive mean, variance, and regularization. The Journal of Machine Learning Research, 19(1):650 683, 2018. [14] Dylan J Foster, Claudio Gentile, Mehryar Mohri, and Julian Zimmert. Adapting to misspecifica- tion in contextual bandits. Advances in Neural Information Processing Systems, 33, 2020. [15] Dylan J Foster and Alexander Rakhlin. Beyond UCB: Optimal and efficient contextual bandits with regression oracles. ar Xiv preprint ar Xiv:2002.04926, 2020. [16] Avishek Ghosh, Sayak Ray Chowdhury, and Aditya Gopalan. Misspecified linear bandits. In AAAI Conference on Artificial Intelligence, 2017. [17] Chi Jin, Zhuoran Yang, Zhaoran Wang, and Michael I Jordan. Provably efficient reinforcement learning with linear function approximation. In Conference on Learning Theory, pages 2137 2143, 2020. [18] Motonobu Kanagawa, Philipp Hennig, Dino Sejdinovic, and Bharath K Sriperumbudur. Gaus- sian processes and kernel methods: A review on connections and equivalences. ar Xiv preprint ar Xiv:1807.02582, 2018. [19] Jack Kiefer and Jacob Wolfowitz. The equivalence of two extremum problems. Canadian Journal of Mathematics, 12:363 366, 1960. [20] Johannes Kirschner, Ilija Bogunovic, Stefanie Jegelka, and Andreas Krause. Distributionally robust Bayesian optimization. ar Xiv preprint ar Xiv:2002.09038, 2020. [21] Johannes Kirschner and Andreas Krause. Stochastic bandits with context distributions. In Advances in Neural Information Processing Systems, pages 14113 14122, 2019. [22] Andreas Krause and Cheng S Ong. Contextual Gaussian process bandit optimization. In Advances in Neural Information Processing Systems (Neur IPS), pages 2447 2455, 2011. [23] Andreas Krause, Ajit Singh, and Carlos Guestrin. Near-optimal sensor placements in gaussian processes: Theory, efficient algorithms and empirical studies. Journal of Machine Learning Research, 9(Feb):235 284, 2008. [24] Tor Lattimore, Csaba Szepesvari, and Gellert Weisz. Learning with good feature representations in bandits and in RL with a generative model. International Conference on Machine Learning, 2020. [25] Lihong Li, Wei Chu, John Langford, and Robert E Schapire. A contextual-bandit approach to personalized news article recommendation. In International conference on World Wide Web, pages 661 670, 2010. [26] Lisha Li, Kevin Jamieson, Giulia De Salvo, Afshin Rostamizadeh, and Ameet Talwalkar. Hyper- band: A novel bandit-based approach to hyperparameter optimization. The Journal of Machine Learning Research, 18(1):6765 6816, 2017. [27] Lydia T Liu, Horia Mania, and Michael Jordan. Competing bandits in matching markets. In International Conference on Artificial Intelligence and Statistics, pages 1618 1628. PMLR, 2020. [28] Charles A Micchelli, Yuesheng Xu, and Haizhang Zhang. Universal kernels. Journal of Machine Learning Research, 7(12), 2006. [29] Willie Neiswanger and Aaditya Ramdas. Uncertainty quantification using martingales for misspecified Gaussian processes. ar Xiv preprint ar Xiv:2006.07368, 2020. [30] Gergely Neu and Julia Olkhovskaya. Efficient and robust algorithms for adversarial linear contextual bandits. ar Xiv preprint ar Xiv:2002.00287, 2020. [31] Thanh Tang Nguyen, Sunil Gupta, Huong Ha, Santu Rana, and Svetha Venkatesh. Distribution- ally robust Bayesian quadrature optimization. ar Xiv preprint ar Xiv:2001.06814, 2020. [32] Aldo Pacchiano, Christoph Dann, Claudio Gentile, and Peter Bartlett. Regret bound balancing and elimination for model selection in bandits and RL. ar Xiv preprint ar Xiv:2012.13045, 2020. [33] Aldo Pacchiano, My Phan, Yasin Abbasi-Yadkori, Anup Rao, Julian Zimmert, Tor Lattimore, and Csaba Szepesvari. Model selection in contextual stochastic bandit problems. ar Xiv preprint ar Xiv:2003.01704, 2020. [34] Carl Edward Rasmussen and Christopher KI Williams. Gaussian processes for machine learning, volume 1. MIT press Cambridge, 2006. [35] Jonathan Scarlett. Tight regret bounds for Bayesian optimization in one dimension. In Interna- tional Conference on Machine Learning (ICML), 2018. [36] Jonathan Scarlett, Ilijia Bogunovic, and Volkan Cevher. Lower bounds on regret for noisy Gaussian process bandit optimization. In Conference on Learning Theory (COLT), 2017. [37] Eric Schulz, Maarten Speekenbrink, José Miguel Hernández-Lobato, Zoubin Ghahramani, and Samuel J Gershman. Quantifying mismatch in Bayesian optimization. In Neur IPS workshop on Bayesian optimization: Black-box optimization and beyond, 2016. [38] Pier Giuseppe Sessa, Ilija Bogunovic, Maryam Kamgarpour, and Andreas Krause. Mixed strategies for robust optimization of unknown objectives. In Conference on Artificial Intelligence and Statistics (AISTATS), 2020. [39] Shubhanshu Shekhar and Tara Javidi. Gaussian process bandits with adaptive discretization. Electronic Journal of Statistics, 12(2):3829 3874, 2018. [40] Niranjan Srinivas, Andreas Krause, Sham M Kakade, and Matthias Seeger. Gaussian process op- timization in the bandit setting: No regret and experimental design. In International Conference on Machine Learning (ICML), pages 1015 1022, 2010. [41] Yanan Sui, Alkis Gotovos, Joel Burdick, and Andreas Krause. Safe exploration for optimization with Gaussian processes. In International Conference on Machine Learning (ICML), pages 997 1005, 2015. [42] Ambuj Tewari and Susan A Murphy. From ads to interventions: Contextual bandits in mobile health. In Mobile Health, pages 495 517. Springer, 2017. [43] Sattar Vakili, Kia Khezeli, and Victor Picheny. On information gain and regret bounds in Gaussian process bandits. ar Xiv preprint ar Xiv:2009.06966, 2020. [44] Michal Valko, Nathaniel Korda, Rémi Munos, Ilias Flaounas, and Nelo Cristianini. Finite-time analysis of kernelised contextual bandits. Uncertainty In Artificial Intelligence (UAI), 2013. [45] Benjamin Van Roy and Shi Dong. Comments on the du-kakade-wang-yang lower bounds. ar Xiv preprint ar Xiv:1911.07910, 2019. [46] Ziyu Wang and Nando de Freitas. Theoretical analysis of Bayesian optimisation with unknown Gaussian process hyper-parameters. ar Xiv preprint ar Xiv:1406.7758, 2014. [47] George Wynne, Francois-Xavier Briol, and Mark Girolami. Convergence guarantees for Gaussian process means with misspecified likelihoods and smoothness. Journal of Machine Learning Research, 22(123):1 40, 2021. [48] Andrea Zanette, Alessandro Lazaric, Mykel Kochenderfer, and Emma Brunskill. Learning near optimal policies with low inherent Bellman error. ar Xiv preprint ar Xiv:2003.00153, 2020.