# instance_dependent_regret_analysis_of_kernelized_bandits__d4309b0b.pdf

Instance-dependent Regret Analysis of Kernelized Bandits

Shubhanshu Shekhar 1 Tara Javidi 2

We study the problem of designing an adaptive strategy for querying a noisy zeroth-order-oracle to efﬁciently learn about the optimizer of an unknown function f. To make the problem tractable, we assume that f lies in the reproducing kernel Hilbert space (RKHS) associated with a known kernel K, with its norm bounded by M < . Prior results, working in a minimax framework, have characterized the worst-case (over all functions in the problem class) limits on regret achievable by any algorithm, and have constructed algorithms with matching (modulo polylogarithmic factors) worst-case performance for the Mat ern family of kernels. These results suffer from two drawbacks. First, the minimax lower bound gives limited information about the limits of regret achievable by the commonly used algorithms on a speciﬁc problem instance f. Second, the existing upper bound analysis fails to adapt to easier problem instances within the function class. Our work takes steps to address both these issues. First, we derive instance-dependent regret lower bounds for algorithms with uniformly (over the function class) vanishing normalized cumulative regret. Our result, valid for several practically relevant kernelized bandits algorithms, such as, GP-UCB, GP-TS and Sup Kernel UCB, identiﬁes a fundamental complexity measure associated with every problem instance. We then address the second issue, by proposing a new minimax near-optimal algorithm which also adapts to easier problem instances.

1. Introduction

We consider the problem of optimizing a function f : X = [0, 1]d 7 R by adaptively gathering information about it

1Carnegie Mellon University 2University of California, San Deigo. Correspondence to: Shubhanshu Shekhar <shubhan2@andrew.cmu.edu>.

Proceedings of the 39 th International Conference on Machine Learning, Baltimore, Maryland, USA, PMLR 162, 2022. Copyright 2022 by the author(s).

via noisy zeroth-order-oracle queries. To make the problem tractable, we assume that (1) f lies in the reproducing kernel Hilbert space (RKHS) associated with a given positivedeﬁnite kernel K, and (2) the RKHS norm of f is bounded by a known constant M < . The function f can be accessed through noisy zeroth-order-oracle queries that return yx = f(x) + η at a query point x X, with η denoting the additive observation noise. Given a total budget n, the goal of an agent is to design an adaptive querying strategy, denoted by A, to select a sequence of query points (xt)n t=1, that incur a small cumulative regret Rn(A, f), deﬁned as

Rn(A, f) :=

t=1 f(x ) f(xt), x arg max x X f(x).

The cumulative regret forces the agent to address the exploration-exploitation trade-off and prevents it from querying too many points from the sub-optimal regions of the domain.

The problem described above is referred to as the kernelized bandit or agnostic Gaussian Process bandit problem (Srinivas et al., 2012). Prior theoretical works in this area have focused on establishing lower and upper bounds on the performance achievable by algorithms in the minimax setting. These results characterize the worst-case limits, over all functions in the given RKHS, of performance achievable by any adaptive sampling algorithm (see Section 1.1 for details). The minimax framework does not account for the fact that there may exist functions that are easier to optimize than others in the same function class. As a result, the minimax regret bounds may not accurately reﬂect the performance of carefully designed adaptive strategies in the typical, non-adversarial, problem instances. In this paper, we take a step towards addressing this issue and present the ﬁrst instance-dependent analysis for kernelized bandits.

In this paper, we focus on the RKHS corresponding to the Mat ern family of kernels, denoted by {HKν : ν > 0}. These kernels are highly relevant for practical applications and allow a graded control over the smoothness of its elements (through a smoothness parameter ν > 0) as described by Stein (2012). We believe that our arguments can be extended to Squared-Exponential (SE) kernels by using the Fourier domain constructions of Scarlett et al. (2017), al-

Instance-dependent Regret Analysis of Kernelized Bandits

though the intermediate steps may be more complicated. Furthermore, we also restrict our attention primarily to analyzing the cumulative regret, and leave the extension of our work to the pure exploration setting for future work (see Section 5 for further discussion).

The rest of this paper is organized as follows: we discuss the limitations of the existing minimax analysis and present an overview of our contributions in Section 1.1. We formally state the problem and the required assumptions in Section 2, and in Section 3 we derive the instance-dependent lower bounds of the regret achievable by a uniformly good class of algorithms (this class includes all the existing algorithms analyzed in the literature including GP-UCB, GP-TS, Sup Kernel UCB). Finally, in Section 4.1, we propose a new algorithm that achieves the best of both worlds: it matches the minimax lower bounds (up to polylogarithmic factors) for the Mat ern family of kernels in the worst case, and can also exploit some additional structure present in the given problem instance to achieve regret tighter than the minimax lower bound for those instances.

1.1. Overview of Results

For a given class of functions, H, the minimax expected cumulative regret is deﬁned as R n(H) := inf A supf H E [Rn (f, A)]. For the RKHS associated with Mat ern kernels with smoothness parameter ν > 0, prior work has established a minimax rate R n(HKν) = Θ n(ν+d)/(ν+2d) , ignoring polylogarithmic factors in n. In particular, the worst-case algorithm independent lower bound of the order Ω n(ν+d)/(2ν+d) for the Mat ern kernels with smoothness parameter ν > 0 was established by Scarlett et al. (2017). On the other hand, Vakili et al. (2021a) recently derived tighter bounds on the mutual information gain (or equivalently the effective dimension) associated with Mat ern kernels, that in turn implied that the Sup Kernel UCB algorithm of Valko et al. (2013) matches (up to polylogarithmic terms) the above-stated lower bound, hence showing its near-optimality.

While the existing theoretical results provide a rather complete understanding of the minimax regret for Mat ern family of kernels, they suffer from two drawbacks:

1. The minimax lower bounds are obtained by constructing a suitable subset of hard problem, and then showing that there exists no algorithm that can perform well on all of those problems simultaneously. These results tell us that for any algorithm, there exists at least one hard problem instance on which that algorithm must incur a certain regret. However, such results do not tell us what are the limits of performance for carefully designed, good algorithms (such as GP-UCB; precise meaning of good is stated in Deﬁnition 2) on speciﬁc,

non-adversarial, problem instances.

2. The existing analysis of most of the algorithms depend on global properties of the given function class, such as the dimension, kernel parameters and RKHS norm. Consider, for example, the GP-UCB algorithm for which Srinivas et al. (2012) derived the following upper bound on the regret Rn = e O ( nγn), where γn is the maximum information gain for the given kernel K formally deﬁned in (2). Since γn is a property associated with the kernel itself, the existing theory does not adapt to any simplifying structure that may be present in the speciﬁc problem instance.

Our main contributions make progress towards addressing the two issues stated above. In particular, we ﬁrst derive an instance-dependent lower bound for algorithms with uniformly bounded normalized cumulative regret, and then we propose an algorithm that achieves near-optimal worst case performance but also can exploit some additional structure present in problem instances.

First, we consider the following question: Suppose we are given an algorithm A that is known to have O (na0) worst-case regret over all functions in the RKHS associated with Mat ern kernel Kν (denoted by HKν). Then, what values of expected regret can A achieve for a given function f in HKν? We answer this question by identifying a lower-complexity term Cf that characterizes the per-instance achievable limit (precise statement in Theorem 1).

Main Result 1 (Lower Bound). Suppose an algorithm A has a worst-case regret O (na0) over functions in HKν. Then, for a given f with f HKν < M, we have:

E[Rn(f, A)] = Ω Cf(n (1 a0)) ,

where Cf( ) := P

k 0 mk/(2k+2 ) for any > 0, and mk denotes the 2wk = O ( 2k)1/ν packing number (see Deﬁnition 15 in Appendix A) of the annular set Zk = {x X : 2k f(x ) f(x) < 2k+1 }.

To interpret the term Cf, let us deconstruct each element, mk/(2k+2 ), in its deﬁning sum. Consider a ball (denoted by E) of radius wk = O((2k )1/ν) contained in the region Zk. By construction of Zk, every point in E is at most 2k+1 (and at least 2k ) suboptimal for f. We ﬁrst show that, if > n (1 a0), then A must spend at least Ω 1/(2k+1 )2 queries in the region E. This implies that the total number of queries in the region Zk is at least Ω(mk/(2k+1 )2), owing to the fact that mk disjoint balls of radius wk can be packed in the region Zk. This in turn, implies that the total regret incurred by these queries is lower bounded by Ω 2k mk/(2k+1 )2 = Ω mk/(2k+2 ) . Further details of this argument are in Section 3 and in Appendix B.1.

Instance-dependent Regret Analysis of Kernelized Bandits

The previous result can be specialized for minimax-optimal case by setting a0 = a ν := (ν + d)/(2ν + d), to obtain the per-instance performance limit of minimax optimal algorithms. Furthermore, with a careful choice of the problem instance f, we can also recover the minimax lower bound (see Remark 4). This motivates our next question: Can we construct a minimax-optimal algorithm, that adapts, and incurs smaller regret on easier problem instances? We address this question, by constructing a new algorithm A1 in Section 4.1, for which we show the following (see Theorem 2 for a more precise statement).

Main Result 2 (Upper Bound). We construct an algorithm, A1, that is minimax near-optimal for functions in HKν, and satisﬁes (with a ν = (ν + d)/(2ν + d))

E[Rn (f, A1)] = e O

Cf(n (1 a ν)) ,

where Cf( ) := P

k 0 emk/(2kξ ) for > 0 and ξ = min{1, ν}. The term emk is the 2 ewk = O (2kξ )1/ξ

packing number of the set e Zk = {x : f(x ) f(x) = O(2kξ )}.

Similar to Cf, the upper-complexity term Cf can also be interpreted in terms of the number of queries made by our proposed algorithm A1 in the regions e Zk. In general, the term Cf is larger than Cf. This is mainly due to two reasons:

We have stated our lower bound result for algorithms with a worst-case O(na0) regret over the entire function class HKν. However, the proof only requires the regret guarantee to hold uniformly for a small class of local perturbations. Thus, it is possible to reﬁne the result by imposing a uniform regret condition only in a small neighborhood of f (with radius shrinking with n at an appropriate rate). We leave the investigation of such reﬁned lower bounds to future work.

The packing radius ewk used in Cf is smaller than the analogous term wk used in Cf, due to the presence of ξ = min{1, ν} instead of ν.

Despite these issues, in Proposition 2 we identify conditions under which Cf is strictly tighter than the minimax rate, thus proving that A1 can adapt to easier problem instances.

In the case of functions where the abstract packing numbers mk and emk can be well-estimated, the above results give us explicit (in n) regret bounds. For example, if the function f x x b for some b > ν in the neighborhood of its maximizer x (formally stated in Deﬁnition 4), then we can obtain explicit instance dependent lower (Proposition 1) and upper (Proposition 2) bound. In Figure 1, we plot the variation of the exponent of the regret bound (i.e., α if regret is nα) with dimension for the existing upper bounds

for different algorithms (dashed) and their corresponding instance-dependent lower bounds (solid-lines, same color) as well as the instance-dependent upper bound of our algorithm A1 for ν = 1.1 and b = 1.2.

Figure 1. Plot of the variation of the exponent of different regret upper and lower bounds (i.e., α, if regret nα) with dimension, on an instance f HKν, satisfying f (x x )b near its optimum. This local growth property of f, formally stated in Deﬁnition 4, allows us to obtain closed form estimates of Cf and Cf. The dashed lines denote upper bounds, and the solid lines of the same color represent the corresponding instance-dependent lower bounds derived in Proposition 1. Note that the solid blue line represents the instance dependent lower bound for both, A1 and Sup Kernel UCB, as they are both minimax near-optimal.

1.2. Related Work

Minimax lower bounds for kernelized bandits. Scarlett et al. (2017) derived the ﬁrst algorithm-independent minimax lower bound on the regret for the kernelized bandits problem with SE and Mat ern kernels. For Mat ern kernel Kν, they obtain an algorithm-independent minimax lower bound of Ω n(d+ν)/(d+2ν) . Their result proceeds by ﬁrst constructing a ﬁnite class of hard problem instances, and then using a reduction to multiple hypothesis testing and a change of measure argument. However, the hard functions employed by Scarlett et al. (2017) are of the needle-in-haystack type, and may not be representative of the typical functions belonging to the RKHS. Thus, the resulting regret lower bound may provide a pessimistic limit for the achievable performance of algorithms on the speciﬁc problem instance encountered.

Instance-dependent lower bounds for linear bandits. Instance dependent lower bounds have been derived for the related problem of linear bandits with ﬁnitely many arms, either in the pure-exploration (Degenne et al., 2020) or regret-minimization (Tirinzoni et al., 2020) settings. At a high level, the proofs of these results, as well as our Theorem 1, proceed in two steps: (i) given a problem instance,

Instance-dependent Regret Analysis of Kernelized Bandits

identify a suitable class of alternative problem instances, and (ii) obtain a lower bound on the number of queries that any uniformly good algorithm must allocate to the alternatives. For the case of ﬁnite domains considered in the above-mentioned works, the alternative classes are quite naturally deﬁned in terms of instances with different optimal arm. This is not the case in our setting due to the continuous nature of the domain X = [0, 1]d. We address this by presenting a perturbation argument to construct suitable alternatives depending on the problem instance as well as the budget n to derive our instance-dependent lower bounds.

Kernelized bandit algorithms. Srinivas et al. (2012) proposed the GP-UCB algorithm motivated by the Upper Conﬁdence Bound (UCB) strategy for multi-armed bandits (MABs) (Auer et al., 2002), which proceeds by selecting query points (xt)t 1 that maximize the UCB of f of the form µt(x) + βtσ(t)(x) over a sequence of ﬁnite discretizations of the domain X for suitable factors (βt)t 1. For this algorithm, Srinivas et al. (2012) derived the following high-probability upper bound on Rn

Rn = O nγn . (1)

In the above display γn is the maximum information gain associated with the kernel K, deﬁned as

γn := max S X:|S|=n I(y S; f), for f GP(0, K), (2)

where y S = (y1, . . . , yn) is the vector of observations at points in S = (x1, . . . , xn) and I(y S; f) denotes the mutual information between the observations y S and the function f assumed to be a sample from a zero-mean Gaussian Process GP(0, K). To obtain explicit (in n) regret bounds from (1), Srinivas et al. (2012) also derived upper bounds on γn for the SE and Mat ern families. More recently, Vakili et al. (2021a) derived tighter bounds on γn for these families using a different approach than that employed by Srinivas et al. (2012). In particular, for the Mat ern family, this implies the following regret bound for GP-UCB: Rn = e O n(3d/2+ν)/(d+2ν) where ν is the smoothness parameter. Chowdhury and Gopalan (2017) derived that the same upper bound for the Thompson Sampling based randomized algorithm, GP-TS.

The regret bounds of GP-UCB and GP-TS for Mat ern kernels, are not sublinear when ν < d/2. Janz et al. (2020) addressed this issue by proposing an algorithm (referred to as π GP-UCB), which adaptively partitions the input space and ﬁts independent GP models in each element of the partition. This yields an alternative bound on γn, and results in a sublinear regret bound of the form Rn = e O (neν), where eν = d(2d+3)+2ν/d(2d+4)+4ν, for ν > 1 and d 1.

Valko et al. (2013) proposed the Sup Kernel UCB algorithm that proceeds by dividing the queried points into

batches which consist of conditionally independent observations. Although initially proposed for ﬁnite domains, the algorithm easily extends to continuous domains by using sufﬁciently ﬁne discretizations. Valko et al. (2013) showed that the Sup Kernel UCB algorithm achieves a regret bound of Rn = e O nγn ; tighter than the bound for GP-UCB (and GP-TS) by a factor of γn. By plugging in the recently derived bounds on γn for Mat ern kernels by Vakili et al. (2021a), this implies that the Sup Kernel UCB algorithm achieves a regret bound Rn = e O n(d+ν)/(d+2ν) , matching the minimax lower bounds.

Despite being minimax near-optimal, Sup Kernel UCB often performs poorly in practice. Recent works, such as (Salgia et al., 2021; Li and Scarlett, 2022), address this by designing algorithms that are minimax near-optimal, and also perform well on common benchmarks. Finally, we note that another minimax near-optimal kernelized bandit algorithm was recently proposed by Camilleri et al. (2021) using a sampling strategy based on G-optimal designs.

2. Preliminaries

In this section, we introduce some deﬁnitions and notations required to formally state the problem. Additional deﬁnitions are deferred to Appendix A.

As stated in the introduction, we consider the problem of optimizing a black-box function f : X 7 R, where X = [0, 1]d that can be accessed through noisy zeroth order oracle queries. The task of the learner (or agent) is to design an adaptive strategy to select query points {xt : 1 t n}, that incur a small cumulative regret Rn(f, A) = Pn t=1 f(x ) f(xt). To formally present the problem statement, we need a precise deﬁnition of adaptive querying strategies and the exact assumptions made on the objective function and observation noise.

We begin with a deﬁnition of adaptive querying strategy. Deﬁnition 1 (Adaptive Strategy). An adaptive querying strategy A consists of a sequence of mappings (At) t=1 where At : (X Y)t 1 Ut 7 X, where U = [0, 1] represents the range of additional randomness (used in randomized algorithms such as GP-TS).

For the rest of the paper, we will use the terms adaptive strategy and algorithm interchangeably. Next, we introduce the assumptions on the objective function and observation noise under which our theoretical results will be stated. Assumption 1. The objective function f lies in the RKHS HK associated with a known kernel K, and its RKHS norm f HK is bounded by a known constant M < .

The above assumption is standard in the kernelized bandits literature, and informally it states that the unknown function has low complexity, where the complexity is quantiﬁed in

Instance-dependent Regret Analysis of Kernelized Bandits

terms of the RKHS norm. As stated in the introduction, we will restrict our discussion to Mat ern kernels in this paper (formally deﬁned in Appendix A).

Assumption 2a. The zeroth-order-oracle returns y(xt) = f(xt) + ηt for a query point xt X, where {ηt : t 1} is an i.i.d. sequence of N(0, σ2) random variables.

Assumption 2b. The zeroth-order-oracle returns y(xt) = f(xt) + ηt for a query point xt X, where {ηt : t 1} is an i.i.d. sequence of σ2-sub-Gaussian random variables.

These two assumptions state that the observation noise has light tails, and hence we can construct tight conﬁdence intervals for the unknown functions based on the noisy observations. We will use Assumption 2a in the statement of our lower bound, while the more general Assumption 2b will be used to state the upper bound result. Note that imposing the N(0, σ2) requirement for stating the lower bounds is primarily to simplify the presentation. This is further discussed in Remark 9 in Appendix B.2.1, and relies on the fact that we can obtain closed form expressions for KL-divergence involving Gaussian random variables, which simpliﬁes the expression of the lower-complexity term (see Deﬁnition 3).

We now present the formal problem statement.

Problem Statement 1. Suppose Assumptions 1 and either 2a or 2b hold with known values of M and σ2. Then, given a total querying budget n, design an adaptive strategy A to select query points x1, . . . , xn, which incur a small cumulative regret Rn (A, f) := Pn t=1 f(x ) f(xt).

Notations. We end this section, by listing some of the notations that will be used in the rest of the paper. As mentioned earlier, X = [0, 1]d for some d 1 represents the domain and f : X 7 R is the unknown objective function. We will use x to represent an optimal point of f, i.e., x arg maxx X f(x). For any x X and r > 0, we use B(x, r) to denote the ℓ2 open ball {z X : z x 2 < r} and B(x, r1, r2) to denote the annular region {z X : r1 < z x < r2}. For a function f and constants u > 0 and c > 1, we use e X(f, u, c) to denote the set {x X : u f(x ) f(x) cu}.

3. Lower Bounds

In order to derive instance-dependent bounds on the regret, we need to restrict our attention to uniformly-good algorithms that perform well for all elements of the given problem class. A restriction of this kind is necessary to obtain non-trivial instance-dependent regret bounds. Otherwise, for every problem instance f, there exists a trivial algorithm that always queries x arg maxx X f(x), and incurs no regret on f, but suffers a linear regret for all functions for which x is strictly suboptimal.

We begin by presenting a deﬁnition of a0-consistent algo-

rithms that formally characterize the meaning of uniformly good algorithms required for obtaining instance-dependent lower bound. This deﬁnition is motivated by a similar notion of consistent policies used in multi-armed bandits (see Lattimore and Szepesv ari, 2020, Deﬁnition 16.1). Deﬁnition 2 (a0-consistency). An algorithm A is said to be a0-consistent over a function class F, if for all a > a0 and f F, the following holds:

lim n E [Rn (A, f)]

Remark 1. Note that when F is the RKHS associated with a Mat ern kernel, then all the existing algorithms discussed in Section 1.2 satisfy the condition above with some a0 < 1. In particular, this condition is satisﬁed by GP-UCB and GP-TS with a0 = min{1, (ν + 3d/2)/(2ν + d)}, by π GP-UCB with a0 = (d(2d + 3) + 2ν)/(d(2d + 4) + 4ν) and by Sup Kernel UCB with a0 = (ν + d)/(2ν + d).

The rest of this section is organized as follows. In Section 3.1, we ﬁrst present a general lower bound that characterizes the regret achievable by an algorithm on a given function f, in terms of the lower complexity term Cf introduced in Deﬁnition 3. This abstract complexity term depends on the packing number of certain near-optimal regions associated with f, as well as the exponent a0 of the uniform regret condition for a0 consistent algorithms. In Section 3.2, we consider the special case of functions satisfying a local growth condition (Deﬁnition 4), for which the term Cf can be well-estimated to get an explicit lower bound in terms of the query budget n.

3.1. General lower bound

We now obtain a general instance-dependent lower bound for a0-consistent algorithms. First, we introduce a notion of complexity associated with a function f HKν(M), that will be used to state the main result. Deﬁnition 3 (Lower-Complexity). Let f HKν(M) with f Kν = (1 λ)M for some ν > 0, M > 0 and λ (0, 1). Fix > 0, and introduce the set Zk := e X(f, 2k , 2) = {x X : 2k f(x ) f(x) < 2k+1 }. Introduce the radius wk = 3 2k Mν/(λM) 1/ν (with Mν denoting the norm of the bump function introduced in Deﬁnition 17 in Appendix B.1.2) and let mk denote the 2wk-packing number of the set Zk. Finally, deﬁne the complexity term

Note that our notation Cf( ) suppresses the ν, M and λ dependence of Cf.

We now present an instance-dependent lower bound on the expected cumulative regret of any a0-consistent algorithm A in terms of the complexity term introduced above.

Instance-dependent Regret Analysis of Kernelized Bandits

Theorem 1. Let f HKν(M) with f Kν = (1 λ)M for ν > 0, M > 0 and λ (0, 1). Consider any a0consistent algorithm (with a0 < 1) over the family of functions HKν(M), denoted by A, and ﬁx any a > a0. Then, the expected cumulative regret of A on the instance f satisﬁes

E [Rn (A, f)] 7 log 2

4 σ2Cf n (1 a) ,

for n large enough (exact condition in equation 13 in Appendix B).

Theorem 1 follows as a consequence of a more general statement, presented and proved in Appendix B. We also present a detailed overview of the argument involved in proving Theorem 1 in Appendix B.1. The high level idea is to ﬁrst construct a perturbed version of f by adding a bump function (see Deﬁnition 17 in Appendix B.1.2) supported on a ball E that is contained in a region Zk for some k 1 (see Deﬁnition 3). Then, using the a0-consistency of A, we obtain a lower bound on the number of times A must query the region E. By identifying the total number of disjoint balls that can be packed in different regions of the input space, and adding their contributions to the regret, we obtain the ﬁnal result. Remark 2. The lower bound in Theorem 1 based on the complexity term of eq. (3) has a natural interpretation. For any k 0, the set Zk denotes the annular region where the suboptimality f(x ) f(x) is between 2k n and 2k+1 n, with n := n (1 a). Then, the term wk = 2k+1 n/(λM) 1/ν denotes the radius of the smallest ball that can support a scaled bump function (see Deﬁnition 17) that ensures the resulting perturbed version of f still remains in HKν(M) and also satisﬁes the properties stated in Deﬁnition 16 in Appendix B.1.1. As we show in Appendix B.1.1, for any such perturbation of f, the algorithm A must spend roughly 1/(2k+1 n)2 samples to distinguish between f and its perturbation. The regret incurred in the process is lower bounded by 2k n 1/(2k+1 n)2 = 1/(2k+2 n), that is, suboptimality ( 2k n) times the number of queries in that region ( 1/(2k+1 n)2). Since mk disjoint balls of radius wk can be packed into Zk, the expression of the complexity term in (3) follows.

The strict inequality in the deﬁnition of the complexity term in (3) immediately implies the following weaker, but more interpretable, version of the above statement.

Corollary 1. Consider an a0-consistent algorithm A, and ﬁx any a > a0. Under the same assumptions as Theorem 1, and with w0 := 3(n (1 a))/(λM) 1/ν, let m0 denote the 2w0-packing number of the set Z0 := {x X : n (1 a) f(x ) f(x) < 2n (1 a)}. Then, we have the following:

E [Rn (f, A)] = Ω σ2m0 n(1 a) .

Remark 3. Corollary 1 states that a key quantity characterizing the regret achievable by a uniformly good algorithm is the packing number of an annular near-optimal region associated with the given function. Informally, we can write m0 w d 0 n(1 a) d/ν, where d := lim infn log(m0)/ log(1/w0). The term d is reminiscent of the concepts of near-optimality dimension and zooming dimension used in prior works in bandits in metric spaces (Bubeck et al., 2011; Kleinberg et al., 2019) as well as in Gaussian Process bandits (Shekhar and Javidi, 2018). These works use such notions to obtain instance-dependent upper bounds on the regret of algorithms that non-uniformly discretize the domain. This connection also motivates our approach in designing Algorithm 1 in Section 4.1. Remark 4. As a sanity check, we note that Corollary 1 recovers the minimax lower bound by considering f to be a scaled bump function g introduced in Deﬁnition 17, and A to be a minimax near-optimal algorithm. In particular, by deﬁning f( ) = 2 ng( /w0), the region Z0 contains the whole domain with a ball of radius w0 removed. This implies that m0 w d 0 = n(1 a)d/ν for a > a ν = (d + ν)/(d + 2ν), and thus E[Rn(f, A)] = Ω(nα) for any α < (1 a ν) 1 + d

ν = ν d+2ν d+ν

ν = d+ν d+2ν = a ν.

In the next section, we specialize the above results to functions that satisfy an additional local growth condition, for which the complexity terms can be explicitly lower bounded in terms of more interpretable parameters.

3.2. Lower bound under growth condition

In this section, we consider the class of functions satisfying the following additional assumption.

Deﬁnition 4 (Growth Condition). We say that the objective function f satisﬁes the local growth condition with parameters (c, c, b, ϱ0) if for all x B(x , ϱ0) X, we have c x x b f(x ) f(x) c x x b. We shall denote by F (c, c, b, ϱ0) the class of all functions satisfying this property.

Similar conditions have been used in analyzing the performance of ﬁrst order stochastic optimization algorithms by Ramdas and Singh (2013) and in characterizing the minimax rates of active learning algorithms by Castro and Nowak (2008).

As an example, consider the case when the function f has continuous second order derivatives and its optimizer x lies in the interior of the domain. Then, if the Hessian of f at x

is non-singular, we can ﬁnd a ϱ0 > 0 such that for all x in B(x , ϱ0), the spectral norm of the Hessian of f is between c and c. Then the function f satisﬁes the growth condition with exponent 2 and constants c and c depending on the minimum and maximum norm of the Hessian in B(x , ϱ0). We now state the main result of this section.

Instance-dependent Regret Analysis of Kernelized Bandits

Proposition 1. Introduce the function class G := HKν(M) F (c, c, b, ϱ0) for some ν > 0 and b > ν. Let A be an a0-consistent algorithm with a0 < 1 for the function class HKν(M). Then for any a > a0 and f G with f HKν < M, we have the following:

lim inf n E [Rn (A, f)]

nα > 0 for any α < (1 a0) 1 + d

Since most of the commonly used algorithms, such as GP-UCB, GP-TS, Sup Kernel UCB and π GP-UCB, discussed further in Section 1.2, satisfy a0-consistency introduced in Deﬁnition 2, we can use Theorem 1 to obtain the instance-dependent lower bounds for these algorithms. Corollary 2. Proposition 1 implies that E[Rn(A, f)] = Ω(nα), when f G, b > ν and

A is the Sup Kernel UCB algorithm, ν > 1 and α < ν+d(1 ν/b)

A is either GP-UCB or GP-TS, ν > d/2, and α < 1 d

2ν ν+d(1 ν/b)

A is the π GP-UCB algorithm, ν > 1 and α < 1 + d

2ν ν+d(1 ν/b)

2ν+d(d+2) .

The results of the above corollary are presented for some speciﬁc ν = 1.1 and b = 1.2 in Figure 1. As we can see, algorithms with tighter uniform regret bound (i.e., smaller a0) incur higher instance-dependent lower bounds.

4. Instance-Dependent Upper Bound

As mentioned in the introduction, the prior theoretical analysis of the existing kernelized bandits algorithms upper bound their regret in terms of quantities such as the maximum information gain, γn, deﬁned in (2). Since γn is a property associated with the kernel K, such results do not adapt to the hardness of the speciﬁc problem instance within the associated RKHS they predict the same performance guarantees for the easiest as well as the hardest problem in the class. As a result, the ability of the existing algorithms to adapt to the complexity of problems with the same problem class is unknown.

Our results in this section take a step towards addressing this issue. In particular, we describe a new algorithm for kernelized bandits in Section 4.1, and show in Section 4.2 that this algorithm is minimax near-optimal and also admits tighter upper bounds for easier problem instances.

4.1. Proposed Algorithm

We ﬁrst introduce a standard notion of a sequence of nested partitions of the input space, often used in prior works, such

Algorithm 1: Breadthwise Exploration with Adaptive Discretization (A1) Input: n, the querying budget Kν, the kernel belonging to the Mat ern family M, upper bound on the RKHS norm v1, v2, ρ, parameters of the tree of partitions τ, regularization parameter used in posterior computation.

1 Initialize: t = 1, flag True, Et , Yt , Pt {x0,1}, L = M log n and ξ = min{ν, 1}.

2 for t = 1, 2, . . . , n do

3 while flag do

4 βt, {(µt(x), σt(x)) : x Pt} = Compute Posterior (Pt, Et, Yt, τ)

5 xt arg maxx Pt σt(x), Ut = maxx Pt σt(x)

6 if 1/ n < Ut < L v1ρht ξ then

7 Pt, ht Refine Partition(Pt, βt, {(µt(x), σt(x)) : x Pt})

10 yt f(xt) + ηt, Et Et {xt}, Yt Yt {yt};

11 flag False

14 flag True;

as (Bubeck et al., 2011; Munos, 2011; Wang et al., 2014; Shekhar and Javidi, 2018), to design algorithms for zeroth order optimization.

Deﬁnition 5 (tree of partitions). We say that a sequence of subsets of X, denoted by (Xh)h 0, forms a tree of partitions of X, if it satisﬁes the following properties:

For all h 0, we have Xh = {xh,i : 1 i 2h}. Furthermore, for every xh,i Xh is associated a cell Xh,i. For i = j, Xh,i and Xh,j are disjoint, and 2h i=1Xh,i = X. Moreover, for any h, i we have {xh+1,2i 1, xh+1,2i} Xh+1 Xh,i.

There exist constants 0 < v2 1 v1 and ρ (0, 1) such that

B(xh,i, v2ρh) Xh,i B(xh,i, v1ρh).

Next, we introduce two subroutines that will be employed by the algorithm. The ﬁrst subroutine, called Compute Posterior, computes the posterior mean and posterior covariance function of the surrogate

Instance-dependent Regret Analysis of Kernelized Bandits

Gaussian Process (GP) model used for approximating the unknown objective function in the algorithm.

Deﬁnition 6 (Compute Posterior). Given a subset Pt X, a multi-set of points belonging to Pt, denoted by Et = {x1, . . . , xt 1}, at which the function f was evaluated, the corresponding noisy function evaluations Yt = {y1, . . . , yt 1} and a constant τ > 0, the Compute Posterior subroutine returns the terms βt, and {(µt(x), σt(x)) : x Pt}, which are deﬁned as follows:

Σt = Kt + τI, where Kt = [K(xi, xj)]xi,xj Et kt(x) = [K(x, x1), . . . , K(x, xt 1)]T , yt = [y1, . . . , yt 1]T

2 log (|Pt|n3) /t

µt(x) = kt(x)T Σ 1 t yt

σt(x) = τ 1/2 q

K(x, x) kt(x)T Σ 1 t kt(x).

As we describe later, our algorithm proceeds by adaptively discretizing the input space based on the observations gathered the granularity of the partition becoming ﬁner in the near-optimal regions of the input space, aimed at mimicking the discretization of the space involved in deﬁning the complexity term in Deﬁnition 3. Our second subroutine, called Refine Partition, presents the formal steps involved in updating the discretization used in the algorithm.

Deﬁnition 7 (Refine Partition). This subroutine takes in as inputs, Pt, ht, βt and {(µt(x), σt(x)) : x Pt}; and returns an updated partition Pt and level ht. First compute lt := maxx Pt µt(x) βtσt(x) and deﬁne Pt = {x Pt : µt(x) + βtσt(x) > lt}. Next, deﬁne the new Pt as Pt = xht,i Pt{xht+1,2i 1, xht+1,2i}, and update ht ht + 1.

We now present an outline of the steps of our proposed algorithm below. The formal pseudocode is in Algorithm 1.

Outline of Algorithm 1. At any time t 1, the algorithm maintains a set of active points denoted by Pt. These points satisfy the following two properties: (i) Pt Xht for some ht 0; i.e., all the active points lie in the same depth (i.e., ht) of the tree of partitions, and (ii) any optimizer x of f must lie in the region xht,i Pt Xht,i. The algorithm evaluates the function at points in the active set, and computes the posterior mean and standard deviation by calling the Compute Posterior subroutine. The algorithm then compares the maximum posterior standard deviation (for points in Pt) with an upper bound on the variation in function value in the cell Xht,i associated with an active point xht,i Pt. If the maximum posterior standard deviation is larger than L(v1ρh)ξ, then the algorithm evaluates the function at the corresponding active point with the largest σt(x). Otherwise, it concludes that the active

points in Pt have been sufﬁciently well explored, and it moves to the next level of the partition tree by calling the Refine Partition subroutine. The above process continues until the querying budget is exhausted. Remark 5. Algorithm 1 carefully combines two key ideas from kernelized bandits literature: (i) it divides the evaluated points into subsets (according to their level h in the tree) which satisfy a conditional independence property, similar to Sup Kernel UCB of Valko et al. (2013), and (ii) adaptively partitions the input space to zoom into the nearoptimal regions, similar to the algorithms in (Shekhar and Javidi, 2018; 2020). The ﬁrst property allows us to construct tighter conﬁdence intervals, which results in the algorithm achieving the minimax regret rate. As a result, applying Theorem 1 to this algorithm provides us with the best (i.e., the highest) instance-dependent lower bounds. Additionally, the second property allows the algorithm to exploit the easier problem instances when the objective function satisﬁes the growth condition with small b, and results in improved regret bound in these problem instances. This is proved in Theorem 2 and Proposition 2.

4.2. Regret Bound for Algorithm 1 (A1)

In this section, we analyze the performance of Algorithm 1 on individual problem instances lying in the RKHS associated with Mat ern kernels. To do so, we introduce the following complexity measure associated with a problem instance f.

Deﬁnition 8 (Upper-Complexity). Consider a function f HKν with ν > 0, and deﬁne ξ = min{1, ν}. For given constants , ρ (0, 1) and c1, c2 > 0, introduce the set e Zk := {x X : f(x ) f(x) c1(1/ρ)kξ } for k 0, and let emk denote the 2 ewk := 2c2 (1/ρ)kξ 1/ξ packing number of the set e Zk. Deﬁne the upper-complexity term as follows:

emk (1/ρ)kξ .

Note that the notation Cf( ) suppresses the ν, ρ, c1 and c2 dependence of the complexity term. Remark 6 (Comparison of Cf and Cf). The uppercomplexity term introduced above has a similar form as the corresponding lower-complexity term (Cf), introduced earlier in Deﬁnition 3: both complexity measures sum over terms involving a packing number of a near-optimal set in the numerator, and an exponentially growing term times in the denominator. Despite this similarity, the uppercomplexity term is, in general, larger than the corresponding lower-complexity term. This is because ewk is proportional to 1/ξ, while wk (in Deﬁnition 3) is proportional to 1/ν. Since ν ξ := min{1, ν}, and < 1, the term emk represents a much tighter packing than the corresponding term,

Instance-dependent Regret Analysis of Kernelized Bandits

mk in Deﬁnition 3. Another, less important, factor in Cf being larger than Cf is that the set e Zk is usually larger than the corresponding annular set Zk used in deﬁning Cf.

We now state the main result of this section stating that the algorithm A1 is minimax near-optimal, and its instance-dependent regret can be characterized by the uppercomplexity term deﬁned above.

Theorem 2. For the class of functions f HKν(M), the algorithm A1 is a0-consistent with a0 = a ν = (ν +d)/(ν + 2d). Furthermore, A1 also satisﬁes the following instancedependent upper bound on the expected regret:

E [Rn (A1, f)] = e O

where n = min{n (1 a ν), ρHnξ}, where Hn is deﬁned precisely in Lemma 7 in Appendix D. The e O( ) term suppresses polylogarithmic factors in n.

Again, it is instructive to specialize the above result to the special case of an instance f that also satisfy the additional local-growth condition around its optima.

Proposition 2. Consider a function f satisfying Assumption 1 with K = Kν, and Assumption 2b with parameter σ2, that also satisﬁes the growth condition (introduced in Deﬁnition 4) with an exponent b. Then, the cumulative regret of Algorithm 1 satisﬁes the following, with ξ = min{1, ν}:

E [Rn (A1, f)] = e O(na), where

a := min d + ν

d + 2ν , d(1 ξ/b)+ + ξ d(1 ξ/b)+ + 2ξ

The notation (z)+ refers to max{0, z} and the notation O hides the polylogarithmic factors in the upper bound.

The proof of this result is given in Appendix D. In particular, this result implies that for an instance f with local growth exponent b > 0, for all values of d 1 and 0 < ν < 1 1 1

b , the regret achieved by Algorithm 1 is tighter than the minimax rate (achieved by Sup Kernel UCB). Furthermore, for a ﬁxed ν > 0, the amount of possible improvement increases with decreasing values of b.

While the regret achieved by A1 is better than those of existing algorithms, it is still larger than the corresponding instance-dependent lower bound as shown in Figure 1. The key reason is that A1 uses the H older continuity (with exponent ξ) of the elements of HKν to adaptively discretize the domain, which leads to the ξ (instead of ν) dependence of the complexity term Cf. This looseness can be avoided if we can construct tighter X-uniform conﬁdence intervals (Vakili et al., 2021b) for f.

5. Conclusion

In this paper, we initiated the instance-dependent analysis of the kernelized bandits problem. This investigation was motivated by the practical question of identifying the limits of performance achievable by commonly used algorithms, such as GP-UCB, on typical, non-adversarial, problem instances. We ﬁrst obtained a general complexity measure that characterizes the fundamental hardness of a speciﬁc problem instance. Then, we specialized this result to probleminstances satisfying an additional local growth condition, to obtain explicit lower bounds in terms of the budget n. Finally, we introduced a new algorithm that achieves the best of both worlds: it matches the worst case performance limit (modulo polylogarithmic terms) established by prior work, but also has the ability to adapt to the easier problem instances.

The results of this paper lead to several interesting questions for future work, and we describe two key directions below:

A natural next step is to investigate whether we can design an algorithm that is both minimax nearoptimal and instance-optimal for the Mat ern family. More speciﬁcally, with a0 = a ν := (d + ν)/(d + 2ν), can we design an algorithm that satisﬁes supf HKν (M) E[Rn (A, f)] = e O(na0), and

E [Rn(A, f)] = e O Cf( n) with n = n (1 a)

for any a > a0 simultaneously (recall that Cf denotes the complexity term introduced in Deﬁnition 3). From a technical point of view, achieving this will be signiﬁcantly aided by deriving tight time and X-uniform conﬁdence intervals for the GP model.

Another interesting line of work is to adapt the ideas used in this paper to problems such as kernel levelset estimation, and optimization in other function spaces (Singh, 2021; Liu et al., 2021). The lowerbound technique of our paper can be easily generalized to these cases, as we discuss brieﬂy in ??. However, designing algorithms that match the so-obtained instancedependent lower bounds may require new techniques.

Finally, the focus in this paper was on obtaining instance-dependent bounds for the cumulative regret Rn. It is interesting to explore whether similar results can be obtained for simple regret in the pure exploration setting.

P. Auer, N. Cesa-Bianchi, and P. Fischer. Finite-time analysis of the multiarmed bandit problem. Machine learning, 47(2-3):235 256, 2002.

S. Bubeck, R. Munos, G. Stoltz, and C. Szepesv ari. X-

Instance-dependent Regret Analysis of Kernelized Bandits

armed bandits. Journal of Machine Learning Research, 12(5), 2011.

X. Cai and J. Scarlett. On lower bounds for standard and robust gaussian process bandit optimization. In International Conference on Machine Learning, pages 1216 1226. PMLR, 2021.

R. Camilleri, K. Jamieson, and J. Katz-Samuels. Highdimensional experimental design and kernel bandits. In International Conference on Machine Learning, pages 1227 1237. PMLR, 2021.

R. M. Castro and R. D. Nowak. Minimax bounds for active learning. IEEE Transactions on Information Theory, 54 (5):2339 2353, 2008.

S. R. Chowdhury and A. Gopalan. On kernelized multiarmed bandits. In Proceedings of the 34th International Conference on Machine Learning-Volume 70, pages 844 853. JMLR. org, 2017.

R. Degenne, P. M enard, X. Shang, and M. Valko. Gamiﬁcation of pure exploration for linear bandits. In International Conference on Machine Learning, pages 2432 2442. PMLR, 2020.

A. Garivier, P. M enard, and G. Stoltz. Explore ﬁrst, exploit next: The true shape of regret in bandit problems. Mathematics of Operations Research, 44(2):377 399, 2019.

D. Janz, D. Burt, and J. Gonz alez. Bandit optimisation of functions in the mat ern kernel rkhs. In International Conference on Artiﬁcial Intelligence and Statistics, pages 2486 2495. PMLR, 2020.

M. Kanagawa, P. Hennig, D. Sejdinovic, and B. K. Sriperumbudur. Gaussian processes and kernel methods: A review on connections and equivalences. ar Xiv preprint ar Xiv:1807.02582, 2018.

R. Kleinberg, A. Slivkins, and E. Upfal. Bandits and experts in metric spaces. Journal of the ACM (JACM), 66(4): 1 77, 2019.

T. Lattimore and C. Szepesv ari. Bandit algorithms. Cambridge University Press, 2020.

Z. Li and J. Scarlett. Gaussian process bandit optimization with few batches. In International Conference on Artiﬁcial Intelligence and Statistics, pages 92 107. PMLR, 2022.

Y. Liu, Y. Wang, and A. Singh. Smooth bandit optimization: generalization to holder space. In International Conference on Artiﬁcial Intelligence and Statistics, pages 2206 2214. PMLR, 2021.

R. Munos. Optimistic optimization of a deterministic function without the knowledge of its smoothness. In Advances in neural information processing systems, pages 783 791, 2011.

Y. Polyanskiy and Y. Wu. Lecture notes on information theory. Technical report, MIT, 2014. (http://people.lids.mit.edu/yp/ homepage/data/itlectures_v5.pdf).

A. Ramdas and A. Singh. Optimal rates for stochastic convex optimization under tsybakov noise condition. In Proceedings of the 30th International Conference on Machine Learning, pages 365 373. PMLR, 2013.

S. Salgia, S. Vakili, and Q. Zhao. A domain-shrinking based bayesian optimization algorithm with order-optimal regret performance. Advances in Neural Information Processing Systems, 34, 2021.

J. Scarlett, I. Bogunovic, and V. Cevher. Lower bounds on regret for noisy gaussian process bandit optimization. In Conference on Learning Theory, pages 1723 1742, 2017.

S. Shekhar and T. Javidi. Gaussian process bandits with adaptive discretization. Electronic Journal of Statistics, 12(2):3829 3874, 2018.

S. Shekhar and T. Javidi. Multi-scale zero-order optimization of smooth functions in an rkhs. ar Xiv preprint ar Xiv:2005.04832, 2020.

S. Singh. Continuum-armed bandits: a function space perspective. In International Conference on Artiﬁcial Intelligence and Statistics, pages 2620 2628. PMLR, 2021.

N. Srinivas, A. Krause, S. M. Kakade, and M. W. Seeger. Information-theoretic regret bounds for Gaussian Process optimization in the bandit setting. IEEE Transactions on Information Theory, 58(5):3250 3265, 2012.

M. L. Stein. Interpolation of spatial data: some theory for kriging. Springer Science & Business Media, 2012.

A. Tirinzoni, M. Pirotta, M. Restelli, and A. Lazaric. An asymptotically optimal primal-dual incremental algorithm for contextual linear bandits. Advances in Neural Information Processing Systems, 33:1417 1427, 2020.

S. Vakili, K. Khezeli, and V. Picheny. On information gain and regret bounds in gaussian process bandits. In International Conference on Artiﬁcial Intelligence and Statistics, pages 82 90. PMLR, 2021a.

S. Vakili, J. Scarlett, and T. Javidi. Open problem: Tight online conﬁdence intervals for rkhs elements. In Conference on Learning Theory, pages 4647 4652. PMLR, 2021b.

Instance-dependent Regret Analysis of Kernelized Bandits

M. Valko, N. Korda, R. Munos, I. Flaounas, and N. Cristianini. Finite-time analysis of kernelised contextual bandits. In Uncertainty in Artiﬁcial Intelligence, page 654, 2013.

R. van Handel. Probability in high dimension. Technical report, Princeton University NJ, 2014. (https://web. math.princeton.edu/ rvan/APC550.pdf).

Z. Wang, B. Shakibi, L. Jin, and N. Freitas. Bayesian multiscale optimistic optimization. In Artiﬁcial Intelligence and Statistics, pages 1005 1014. PMLR, 2014.

Instance-dependent Regret Analysis of Kernelized Bandits

A. Additional Deﬁnitions

In this section, we list the deﬁnitions of several technical terms that have been used in stating the main results of the paper. We begin with the deﬁnition of a positive deﬁnite kernel, that will then be used in deﬁning an RKHS.

Deﬁnition 9 (Positive Deﬁnite Kernel). For a non-empty set X, a symmetric function K : X X 7 [0, ) is called a positive-deﬁnite kernel, if for any m N, any x1, . . . , xm X and c1, c2, . . . , cm R, the following is true: Pm i=1 Pm j=1 ci K(xi, xj)cj 0.

In this paper, we will focus primarily on a family of kernels, referred to the Mat ern family, that are parameterized by a smoothness parameter ν > 0.

Deﬁnition 10 (Mat ern kernels). For ν > 0 and θ > 0, the Mat ern kernel Kν : X X 7 R is deﬁned as

Kν (x, z) = 1 2ν 1Γ(ν)

where Jν denotes the modiﬁed Bessel function of the second kind of order ν.

The RKHS associated with Mat ern kernels consist of functions with a ﬁnite degree of smoothness (Kanagawa et al., 2018) as opposed to the inﬁnitely differentiable functions lying the RKHS associated with the SE kernels. Due to this property, the Mat ern kernels are commonly used in practical problems (Stein, 2012, 1.7) as they provide a reasonable trade-off between analytical tractability and representation power.

We now present a formal deﬁnition of the RKHS associated with a positive deﬁnite kernel K.

Deﬁnition 11 (RKHS). For a nonempty set X and a positive-deﬁnite kernel K, the RKHS associated with K, denoted by HK, is deﬁned as the Hilbert space of functions on X with an inner product , satisfying the following: (i) for all x X, the function K( , x) HK, and (ii) for all x X and g HK, we have g(x) = g, K( , x) .

The equality g(x) = g, K( , x) is referred to as the reproducing property which lends the name to the RKHS. We next introduce the deﬁnition of Gaussian Processes (GPs) that are often used as a surrogate model for estimating functions lying in an RKHS.

Deﬁnition 12 (Gaussian Processes). For a positive deﬁnite kernel K : X X 7 R, we use GP(0, K) to represent a stochastic process indexed by X, denoted by {Zx : x X}, such that for any m N and x1, . . . , xm X, the random vector [Zx1, . . . , Zxm] N(0, Σm) with Σm = [K(xi, xj)]1 i,j m.

Next, we present the formal deﬁnition of the probability measure that the noisy zeroth-order-oracle and adaptive sampling scheme induce on the space ([0, 1]d R)n. This term is used in Lemma 1.

Deﬁnition 13 (Induced Probability Measure). An adaptive sampling strategy A and a function f induces a probability measure Pf,A (henceforth abbreviated as Pf) on the measurable space (Ω, F), with Ω= (X Y)n and F representing the Borel σ algebra on Ω. Recall that we have X = [0, 1]d and Y = R. This measure assigns the probabilities to events E = Qn t=1 (Et,X Et,Y ) where Et,X BX and Et,Y BY.

Pf (Xn, Y n) E =

t=1 Pf Yt Et,Y |Xt, Y t 1 PA Xt Et,X|Xt 1, Y t 1 .

Here Pf represents the noisy zeroth-order-oracle and PA is determined by the sampling strategy.

We end this section with the deﬁnitions of sub-Gaussian random variables (used in stating Assumption 2b) and packing numbers (used in deﬁning the complexity terms Cf and Cf).

Deﬁnition 14 (Sub-Gaussianity). A random variable X is said to be sub-Gaussian with parameter σ2 if it satisﬁes the following for all t R:

E[et X] exp σ2t2

Deﬁnition 15 (Packing number). Given a subset S of a metric space (X, ℓ), the w packing number of S is the cardinality of the largest E S such that any two z, z E satisfy ℓ(e, e ) w.

Instance-dependent Regret Analysis of Kernelized Bandits

B. Proof of Theorem 1

We present a detailed outline of the proof of Theorem 1 in Appendix B.1, and describe the formal steps of the proof in Appendix B.2 and Appendix B.3. In particular, Appendix B.2 has the statement and proof of an intermediate result that gives us a single term in the lower bound of Theorem 1, and Appendix B.3 contains the steps required to obtain the statement of Theorem 1 from the intermediate result.

B.1. Overview of the argument

We begin by presenting an informal description of the key ideas involved in the obtaining the main lower bound, formally stated as Theorem 1 in Section 3.1.

Suppose f is a function lying in HKν(M) and let A be an a0-consistent algorithm for this family of functions. Suppose E1, E2, . . . , Em are m disjoint subsets of the input space X for some m 1, with the property that

f(x) f(x ) i, for all x Ei, for all i [m]. (4)

Now if Ni denotes the (random) number of times the algorithm A queries points in the region Ei in n rounds, then we immediately have the following regret lower bound.

Ef[Rn (A, f)]

i=1 i Ef [Ni] . (5)

The expression in (5) suggests that one way of lower bounding the regret incurred by A on the function f is to lower bound the expected number of samples it allocates in these suboptimal regions, Ef[Ni] for 1 i m. We approach this task by considering functions that are slightly perturbed versions of f, denoted by fi, that also lie in the same RKHS. The perturbed function fi shall differ from f only in the region Ei, but this difference should be substantial enough to ensure that the maximizer of fi lies in Ei. This fact makes fi operationally distinct from f, for which the region Ei is at least i-suboptimal as assumed in (4). Now, since the algorithm A is a0-consistent for the given function class, it must achieve o(na) regret (for any a > a0) for all such functions (and in particular, for f and all of fi). These two facts will enable us to bound the number of samples that the algorithm A must spend (on an average) in the suboptimal region Ei when f is the true function.

10.0 7.5 5.0 2.5 0.0 2.5 5.0 7.5 10.0 0.00

10.0 7.5 5.0 2.5 0.0 2.5 5.0 7.5 10.0 0.0

Figure 2. The two ﬁgures show a function f (left) and its perturbed version f) (right). In the left ﬁgure, the optimum of f is denoted by the green . The red shaded region corresponds to the set E of Deﬁnition 16 that is disjoint from a neighborhood (green shaded region) of the optimum of f. The gray shaded regions represent the set e X(f, , c) = {x : f(x) f(x ) c } for = 0.2 and c = 2.5. The ﬁgure on the right, f, is then obtained by adding a bump function supported on E to f By construction, in the left ﬁgure, the red-shaded region is at least suboptimal, while in the right ﬁgure the green-shaded region is at-least suboptimal. The two functions f, and f will induce statistically similar distributions as they differ only in a small region (shown in red). Hence, if f is the true objective function, any good algorithm A must spend some queries in the red region to discard the possibility that the objective function is f.

The rest of this section is organized as follows:

Instance-dependent Regret Analysis of Kernelized Bandits

In Appendix B.1.1, we describe the details of the argument in deriving a lower bound on the expected number of samples A spends in a suboptimal region for one perturbed function.

In Appendix B.1.2, we present the details involved in constructing an appropriate collection of perturbed functions. In particular, this involves carefully balancing several trade-offs in the choice of parameters such as i and Ei, such that the resulting lower bound is tightest.

B.1.1. A PERTURBATION ARGUMENT

Suppose f HKν(M) and A is an a0-consistent algorithm. Let f be another function in HKν(M) that is an (E, c, ) perturbation of f, as deﬁned below.

Deﬁnition 16 ((E, c, ) perturbation). We say that a function f HKν(M) is an (E, c, )-perturbation of another function f HKν(M), if it satisﬁes the following properties:

(P1) f differs from f only in a region E of the input space, i.e., f(x) = f(x) for all x X \ E.

(P2) The function f achieves its maximum value (denoted by f ) at a point x X \ E. On the other hand, f achieves its maximum value f at a point x lying in the region E.

(P3) There exist constants c > 1 and > 0 such that the following conditions are satisﬁed:

|f(x) f(x)| c , for all x E, (6) f f(x) , for all x E. f f(x) , for all x X \ E.

Remark 7. In this section, we do not address the issue of existence of a function f satisfying all the properties, as well as the possible values of c, and choices of the region E. We present the argument under the assumption that such an f exits for some ﬁxed c, and E. The trade-offs involved in constructing such f will be discussed in Appendix B.1.2.

Similar to the more general case of (5), the deﬁnition above motivates a simple decomposition of the regret in terms of the number of queries made by A in the region E in which f and f differ. In particular, if N denotes the (random) number of times A queries points in E, then we immediately have the following:

Ef [Rn (A, f)] Ef [N] , and E f h Rn(A, f) i E f [n N] .

To obtain a lower bound on Ef[N], we will use the following two key properties of the pair (f, f) as encoded by the formal statements in Deﬁnition 16.

From a statistical point of view, the two problem instances are close. In particular, f and f only differ over the region E, and furthermore their deviation is upper bounded by c . This allows us to upper bound the KL divergence between their induced probability distributions Pf and P f (see Deﬁnition 13) in terms of Ef[N].

In an operational sense, the two problem instances f and f are sufﬁciently distinct. This is a consequence of properties (P2) and (P3) in Deﬁnition 16, which say that the optimizer of f (resp. f) lies in the region X \ E (resp. E) that is known to be at least suboptimal for f (resp. f). This, along with the a0-consistency of A will be used to lower-bound the KL-divergence between Pf and P f by a constant.

Combining the two inequalities will give us the required lower bound on Ef[N], and consequently on Ef[Rn (A, f)]. We now describe the steps.

Step 1: Upper bound on DKL(Pf, P f). Recall that the pairs (f, A) and ( f, A) both induce a probability measure on the n fold product of input-observation space Ω:= (X R)n. Denote the two probability measures by Pf and P f. Then, assuming that the observation noise is i.i.d. N(0, σ2), it can be show that

DKL Pf, P f 1 2σ2

sup x E f(x) f(x) 2 Ef [N] c2 2

2σ2 Ef [N] . (7)

Instance-dependent Regret Analysis of Kernelized Bandits

To obtain (7), we use the closed-form expression of the KL-divergence between two univariate Gaussian random variables with the same variance, and the fact that the maximum deviation between f and f is upper bounded by c . The details are presented while proving Lemma 1 in Appendix B.2.1.

The statement of (7) implies that the KL-divergence between Pf and P f can be controlled by two terms: (i) the maximum deviation between f and f, a quantity that is bounded by c by assumption, and (ii) the expected number of queries made by A in the region E for the function f. This quantity cannot be too large either, as the algorithm A is assumed to be a0 consistent, and the region E is at least suboptimal for f as stated in (6).

Step 2: Lower Bound on DKL(Pf, P f). Since the algorithm A is assumed to be a0 consistent, we immediately have the following two statements for any a > a0:

Ef [N ] Ef [Rn (f, A)] = o (na) , and E f [(n N) ] E f h Rn f, A i = o (na) .

Together, these two conditions imply that Ef[N] and E f[n N] cannot be too large. To see this, deﬁne a [0, 1] valued random variable Z = N/n and let p = Ef[Z] and q = E f[Z]. Note that Z denotes the fraction of samples spent in the region E by the algorithm A. Then, for large enough values of n, and ﬁxed, we expect that p 0 and q 1. This in turn implies that the KL-divergence between two Bernoulli random variables with expected values p and q respectively is non-zero. More speciﬁcally, we can show that there exists a constant C > 0 such that

0 < C d KL(p, q) = d KL

where d KL(p, q) = p log(p/q) + (1 p) log ((1 p)/(1 q)) denotes the KL-divergence between two Bernoulli random variables. The next step in obtaining a lower bound on Ef[N] is the observation that

DKL Pf, P f . (9)

This is a consequence of data-processing inequality, as shown by Garivier et al. (2019).

Step 3: Lower bound Ef[N]. Finally, we can use (9) to link the statements of (7) and (8), and obtain the inequality

2σ2 Ef [N] , which implies Ef [N] 2Cσ2

Multiplying this term with gives us one term in the regret decomposition of (5).

B.1.2. CONSTRUCTING AN (E, c, ) PERTURBED FUNCTION

Next, we introduce the deﬁnition of a bump function used in (Cai and Scarlett, 2021, Lemma 4) that will be used to construct the local perturbations (satisfying the conditions of Deﬁnition 16) in our lower bound proof.

Deﬁnition 17 (bump function g). Deﬁne the function g(x) = exp 1 1 1 x 2 1{ x <1}, which satisﬁes the following properties:

g is supported on the ball B(0, 1).

supx B(0,1) g(x) = g(0) = 1.

g HKν := Mν < for some constant Mν depending on ν.

if g( ) = g(

w) for some w > 0, then g HKν (1/w)νMν.

We now discuss the details of constructing a function f satisfying the conditions of Deﬁnition 16. Here is the summary for a ﬁxed a (a0, 1).

The term should not be smaller than n (1 a).

Instance-dependent Regret Analysis of Kernelized Bandits

An appropriate choice of the set E is a ball B(z, w) for a point z X \ X and radius w > 0. Recall that X = {x X : f(x ) = f := maxx X f(x)}.

We deﬁne the perturbed function f = f + g where g( ) = (c + 1)g z

w for some w > 0. Here g is the bump function introduced in Deﬁnition 17. The speciﬁc constraints on w are discussed below.

We now discuss these choices in more details.

Choice of . The term parametrizes the amount of perturbation between f and f. should be large enough to ensure that f and f are distinguishable from the point of view of the algorithm A. In particular, ﬁx any a > a0. Then for this value of a, must be larger than n (1 a). This is because, if < n (1 a), then the algorithm may spend all n of its samples in the region E under f as well as f without violating the o(na) requirement on regret.

Choice of z and w. The terms z and w must be such that B(z, w) e X(f, , c) := {x X : f f(x) c }. Thus z and w must be selected to ensure that the ball of radius w around z is fully contained in the annular region e X(f, , c) in which the sub-optimality of f, (i.e., f f(x)) is between and c . Furthermore, the radius w must be large enough to satisfy the condition stated in (10) below.

Deﬁning f. Having deﬁned the region E, we then construct the perturbed version of f, by adding a shifted and scaled bump function to it. In particular, we add g = (c + 1) g z

w to f. Note that, by assumption, the RKHS norm of f satisﬁes f HKν < M. Since we require the perturbed function to also lie in the class HKν(M), a sufﬁcient condition for that is

g HKν (c + 1)

wν g HKν = (c + 1)

wν Mν M f HKν w (c + 1) Mν

Remark 8. Note that the expression for w in (10) implicitly assumes that for this value of c and the region e X(f, , c) is large enough to contain a ball of this (or larger) radius. Our result, Theorem 3, holds under this assumption. In case this condition is violated, Theorem 3 reduces to the trivial lower bound Ef[Rn (f, A)] 0.

B.2. An intermediate one-step result

Proposition 3. Consider the kernelized bandit problem with a budget n and objective function f HKν(M) with f HKν = (1 λ)M for some λ (0, 1). Let A denote an a0-consistent algorithm for the class HKν(M), and ﬁx an a > a0. For constants 16n (1 a) and c > 1, introduce the set Z = {x X : f(x ) f(x) < c }, and with w = ((c + 1) Mν/(Mλ))1/ν, use m(Z, w) to denote the 2w packing number of Z. Then, the following is true for n large enough (exact condition in equation 13 below):

E [Rn (A, f)] 7 log 2

4 m(Z, w) σ2

Proof. Let {zi : 1 i m(Z, w)} denote the points that form the maximal 2w packing set of Z with cardinality m = m(Z, w). By deﬁnition of Z, the region B(zi, w) is at least -suboptimal for f. Building upon this fact, the proof of the result follows in these three steps:

First, we show that we can construct m perturbed functions, denoted by {fi : 1 i m(Z, w)}, as introduced in Deﬁnition 16. The function fi differs from f only in the region B(zi, w), and in fact, it achieves its maximum value in that region.

Next, for each such perturbed function, we obtain a lower bound on the number of samples that the algorithm A must spend in B(zi, w).

Finally, the result follows by using the regret decomposition Equation (5), again using the fact that the points in B(z, w) are -suboptimal for f.

We now present the details of the steps outlined above. For every i {1, . . . , m}, deﬁne the function fi = f + gi, where gi = (c + 1) g zi

w is the scaled and shifted version of the bump function introduced in Deﬁnition 17. Now the choice

Instance-dependent Regret Analysis of Kernelized Bandits

of the radius (or scale parameter) w in the deﬁnition of fi according to (10) implies that

gi HKν (c + 1)

This implies the following:

The function fi satisﬁes fi HKν f HKν + gi HKν M. Thus, the function fi lies in the class HKν(M), and hence A achieves a regret o (na) for any a > a0 on fi for all 1 i m.

The functions f and fi have well separated optimal regions. More formally, if x and x i denote the maximizers of f and fi respectively, the following are true:

f(x ) f(x) , for all x B(zi, w),

fi(x i ) fi(x) , for all x B(zi, w).

The functions f and fi differ from each other only in the region B(zi, w) and furthermore, they satisfy the following uniform deviation bound:

sup x X |f(x) fi(x)| c .

To summarize the above three points, the function fi is a B(zi, w), , c -perturbation of f. Next, we show that any a0-consistent algorithm must allocate at least a certain number of points to the region B(zi, w) when the true underlying function is f, in order to gather enough evidence to reject fi.

Lemma 1. Let Ni(A, n) denote the number of times the algorithm A queries the oracle at points in the region B(zi, w). Then we have the following bound:

Ef [Ni(A, n)] 2σ2

(1 pn,i) log 1 1 qn,i

log 2 , where (12)

pn,i := Ef [Ni(A, n)]

n and qn,i := Efi [Ni(A, n)]

In the above display, Ef denotes the expectation w.r.t. the probability measure induced by the pair (f, A), and similarly Efi denotes the probability measure induced by the pair (fi, A) for 1 i m.

The proof of (12) follows by relating the regret incurred by A on f and fi respectively to a pair of multi-armed bandit problems with (m + 1) arms, and then applying the fundamental information inequality (Garivier et al., 2019, 2). The details of this proof are deferred to Appendix B.2.1

Next, we simplify the expression obtained in Lemma 1 by appealing to the a0-consistency of the algorithm A. In the process, we also clarify the meaning of the assumption that n is large enough in the statement of Theorem 3. In particular, we require that n is large enough to ensure the following to hold simultaneously

Ef [Rn (A, f)] 2na, and Efi [Rn (A, fi)] 2na, for all i [m]. (13)

More speciﬁcally, using the notation f0 = f, we note that due to the a0-consistency of A, and the fact that all the functions {fi : 0 i m} lie in HKν(M), we must have limn Efi[Rn(A, fi)]/na = 0. Hence, there must exist a ﬁnite n0 such that for all n n0, we have Efi[Rn(A, fi)] 2na for all i {0, 1, . . . , m}.

Next, we observe that

1 pn,i = 1 Ef [Ni(A, n)]

n = 1 Ef [Ni(A, n)]

1 Ef [Rn(A, f)]

n 1 Ef [Rn(A, f)]

16n1 (1 a) (14)

Instance-dependent Regret Analysis of Kernelized Bandits

In the above display, (14) uses the fact that Ni (A, n) Rn (A, f), and that 16n (1 a), (15) uses the assumption made in (13) that n is large enough to ensure that E0 [Rn (A, f)] 2na.

Similarly, for the qn,i dependent term, we have

1 1 qn,i = n n Efi[Ni(A, n)] = n (n Efi[Ni(A, n)]) n Efi [Rn (A, fi)] n

In the above display, (16) uses the assumption that n is large enough to ensure that Efi [Rn (A, fi)] 2na.

Putting (15) and (17) back in (12), we get

Ef [Ni (A, n)] 2σ2

Finally, the result stated in (11) follows by repeating the argument of Lemma 1 for all the different values of i {1, . . . , m}, and noting that Ef [Rn (A, f)] Pm i=1 Ef [Ni(A, n)] from the decomposition inequality (5).

B.2.1. PROOF OF LEMMA 1

To prove this result, we need to introduce some additional notation. We use Ht to denote the observations up to, and including, time t for t {1, 2, . . . , n}. For a given n 1, introduce the sample space Ω= (X Y)n, and let F0 denote a sigma algebra of subsets of Ω. For a given function f and an querying strategy A, we use Pf,A to denote the probability measure on Ωinduced by the pair (f, A). We will drop the A dependence of Pf,A, and simply use P(f) in the sequel.

The ﬁrst step is to obtain an upper bound on the KL-divergence between the measures P(f) and P(fi) induced on the space Ω, for a common algorithm A. In particular, suppose (X1, Y1, . . . , Xn, Yn) denote the query-observation pairs collected by the algorithm A up to time n. Then we have the following:

Dn := DKL P(f), P(fi)

= Dn 1 + DKL P(f) Xn,Yn|Hn 1, P(fi) Xn,Yn|Hn 1

= Dn 1 + DKL P(f) Xn|Hn 1, P(fi) Xn|Hn 1

+ DKL P(f) Yn|Hn 1,Xn, P(fi) Yn|Hn 1,Xn

= Dn 1 + 0 + DKL P(f) Yn|Hn 1,Xn, P(fi) Yn|Hn 1,Xn

= Dn 1 + Ef

" (f(Xn) fi(Xn))2

1{Xn B(zi,w)} c2 2

In the above display,

(18) and (19) follow from the chain rule for KL-divergence (Polyanskiy and Wu, 2014, Theorem 2.2),

(20) uses the fact that conditioned on Hn 1, the distribution of Xn is the same for both the problem instances, that is they are both selected according to the mapping An : (X Y)n 1 7 X, where A = (At)n t=1 is the common strategy,

(21) uses the fact that condition on Xn, Yn is distributed as N f(Xn), σ2 and N fi(Xn), σ2 under the two distributions P(f) and P(fi) respectively, and

(22) uses the fact that, by construction, f and fi only differ in the region B(zi, w), and furthermore, in this region we have maxx B(zi,w) |f(x) fi(x)| c

Instance-dependent Regret Analysis of Kernelized Bandits

Repeating the steps involved in obtaining (22) n 1 times, we get the following upper bound on the term Dn:

t=1 Ef 1{Xt B(zi,w)} = c2 2

t=1 1{Xt B(zi,w)}

2σ2 Ef [Ni (A, n)] .

Recall that the term Ni(A, n) denotes the number of times the algorithm A queries points from the region B(zi, w) in the ﬁrst n rounds.

Now, suppose Z : Ω7 [0, 1] be any measurable [0, 1] valued random variable. Then by (Garivier et al., 2019, Lemma 1), we get the following result:

DKL (Pf, Pfi) d KL (Ef [Z] , Ef [Z]) ,

where d KL(p, q) for p, q [0, 1] denotes the KL-divergence between two Bernoulli random variables with means p and q respectively.

To complete the proof, we select Z := Ni(A,n)

n , and using the fact (Garivier et al., 2019, Eq. (11)) that d KL(p, q) (1 p) log(1 q) log 2, we get the required inequality

Ef h Ni( A, n) i c2 2

2σ2 (1 pn,i) log (1 qn,i) log 2, where

pn,i = Ef [Z] , and qn,i = Efi [Z] .

Remark 9. The only point at which we exploit the assumption that the observation noise is distributed as N(0, σ2) (i.e., Assumption 2a) is in obtaining the inequality (21). Due to this assumption on the noise, we get a closed form expression for an upper bound on the KL-divergence in (22), i.e., c2 2 /2σ2. In general, if we only assumed that the observation noise was σ2 sub-Gaussian, then the same result would hold true with the previous closed-form upper bound replaced by the expression supx B(zi,w) DKL Pf,Yn|Xn=x, Pfi,Yn|Xn=x .

B.3. Concluding Theorem 1 from Proposition 3

Theorem 1 follows by repeated application of the result in Proposition 3 to different regions of the input space. More speciﬁcally, introduce the following notation:

As in the previous section, we ﬁx an a > a0, and choose c = 2 and = 16n (1 a).

For k 0, set Zk = {x X : 2k f(x ) f(x) < 2k+1 }. With wk := (3 2k Mν)/(λM) 1/ν, we use mk to denote the 2wk packing number of Zk.

Since, f HKν(M), we know that the set Zk is an empty set for all k larger than a ﬁnite value k0. More speciﬁcally, we

have k0 = l log2 supx X f(x ) f(x)

m l log2 2MKν(0)

Finally, the statement of Theorem 1 follows by k0 + 1 repeated applications of the intermediate statement proved in Proposition 3 using Z = Zk, = 2k , c = 2, and wk = (3 2k Mν)/(λM) 1/ν for k = 0, 1, . . . , k0.

C. Proof of Proposition 1

To prove this statement, we appeal to the one-step result obtained in Proposition 3. In particular, we apply Proposition 3 with the following parameters:

We set = n = 16n (1 a), and c = cn := 21/d c c where c and c are the parameters introduced in Deﬁnition 4.

The set Z of Proposition 3 now becomes {x X : n f(x ) f(x) < cn n}.

Instance-dependent Regret Analysis of Kernelized Bandits

We set the radius of the balls to w = wn = (cn+1) n Mν

Mλ 1/ν , where λ = 1 f HKν /M, and use m(Z, wn) to denote the 2wn packing number of the set Z.

With these parameters, Proposition 3 gives us the following lower bound on the regret:

E [Rn (A, f)] = Ω σ2m(Z, wn)

To conclude the statement of Proposition 1, we will show that m = m(Z, wn) = Ω

. First, we introduce the

following terms:

1/b , r1 := cn n

1/b where cn = 21/d c

c as before. (24)

Next, we use Deﬁnition 4 to obtain the following result about Z.

Lemma 2. With r0 and r1 introduced in (24) and e X(f, , c) deﬁned above, we have

Z B(x , r0, r1) := {x X : r0 x x < r1}

Proof. Suppose x B(x , r0, r1). Then we have the following:

x x r0 := n

f(x ) f(x) crb 0 f(x ) f(x) n. (25)

Similarly, we also have the following:

x x r1 := cn n

f(x ) f(x) crb 1

f(x ) f(x) c cn n

f(x ) f(x) cn n. (26)

Together, (25) and (26) imply that if x B(x , r0, r1) then x Z.

The above statement implies that the 2wn packing number of Z can be lower-bounded by the 2wn packing number of the smaller set B(x , r0, r1). Let us denote the 2wn-packing number of B(x , r0, r1) with em. Then, by using the fact that the 2wn packing number is lower bounded by the 2w covering number, and employing the standard volume arguments (van Handel, 2014, Lemma 5.13), we conclude that there exists a constant 0 < C1 < such that

m(Z, wn) m (B(x , r0, r1), wn) C1

1/b n 1/ν n

The 1/b n and the 1/ν n terms in the above display arise from the deﬁnitions of r1 r0 and wn (deﬁned at the beginning of this proof) respectively. Plugging the lower bound on m(Z, wn) back in (23) gives us the required result.

Instance-dependent Regret Analysis of Kernelized Bandits

D. Proof of Theorem 2

We ﬁrst recall a simple embedding result from Shekhar and Javidi (2020) which is crucial in the adaptive partitioning approach used in our algorithm. Fact 1. Suppose f HKν(M) with 0 < M < . Then, we have |f(x) f(z)| MCK x z ξ for ξ := min{1, ν} and for some constant CK depending on K for all x, z X.

Instead of obtaining the bounds on the term CK in the statement above, we assume that n is large enough to ensure that CK log n to simplify the presentation.

We now present the following independence result about the points queried by the algorithm. Proposition 4. Suppose Pt Xh is the active set of points for Algorithm 1 at some time t. Let t0 < t denote the time at which a point from Pt was ﬁrst queried, and let Et denote the multi-set of points {xt0, . . . , xt 1} queried by the algorithm. Then the collection of random variables (yt0, . . . , yt 1) are mutually independent, conditioned on the observations xt0, . . . , xt 1.

Proof. Using P to represent the joint distribution of the random variables, we proceed as follows:

P (yt0, . . . , yt 1|xt0, . . . , xt 1) = P (yt0, . . . , yt 1, xt0, . . . , xt 1)

P (xt0, . . . , xt 1)

Qt 1 s=t0 P (xs|xt0, . . . , xs 1, yt0, . . . , ys 1) P (ys|xt0, . . . , xs, yt0, . . . , ys 1)

P (xt0, . . . , xt 1)

Qt 1 s=t0 P (xs|xt0, . . . , xs 1) P (ys|xs)

P (xt0, . . . , xt 1)

= P (xt0, . . . xt 1) Qt 1 s=t0 P (ys|xs)

P (xt0, . . . , xt 1)

s=t0 P (ys|xt0, . . . , xt 1) .

In the above display, (a) uses the fact that at any s t0, the query point xs only depends on the previous query points xt0, . . . , xs 1 and not on the observations; and the fact that conditioned on xs, the observation ys is independent of the query points xt0, . . . , xs 1 and observations yt0, . . . , ys 1. (b) uses the fact that conditioned on xs the observation ys is independent of xt0, . . . , xs 1, xs+1, . . . , xt 1.

The equality (b) implies the conditional independence of the observations given the query points belonging to the current active set Pt, as required.

Having obtained the conditional independence property of the query points, we now present the key concentration result that leads to the required regret bounds. Lemma 3. For some t 1, let Pt, t0 and Et be the same as in Proposition 4. Then, for a given δ (0, 1), the following is true:

P x Pt, s.t. |f(x) µt(x)| > βtσt(x) δt := 6δ t2π2 ,

2σ2 log |Pt|π2t2

Recall that the terms µt( ) and σt( ) represent the posterior mean and standard-deviation functions computed by the subroutine Compute Posterior introduced in Deﬁnition 6.

Proof. To prove this result, we rely on the following facts derived by Valko et al. (2013) while proving their Lemma 2. For x Pt, there exists αt0, . . . , αt 1 R (depending on x) such that the following holds:

Instance-dependent Regret Analysis of Kernelized Bandits

f(x) µt(x) =

i=t0 αi (yi f(xi)) + Bx

i=t0 α2 i σt(x)2 and |Bx| τ 1/2Mσt(x),

for all x Pt. Recall that τ is the regularization parameter used in the subroutine Compute Posterior, while M is the upper bound on the RKHS norm of f.

Now, we use the conditional independence property derived in Proposition 4 along with the conditional σ2-sub-Gaussianity of the observation noise to get the required concentration result.

In particular, for a given t 1 and a ﬁxed x Pt, we have

P (f(x) µt(x) > βtσt(x)) = E

i=t0 αi (yi f(xi)) > βtσt(x)| (xi)t 1 i=t0

i=t0 αi(yi f(xi))

e λβtσt(x) | (xi)t 1 i=t0

i=t0 E exp (λαi (yi f(xi))) |(xi)t 1 i=t0 #

i=t0 α2 i λβtσt(x)

2 σt(x)2 λβtσt(x) (30)

= exp β2 t 2σ2

In the above display, (27) follows by an application of Chernoff s inequality with some constant λ > 0 to be selected later, (28) follows from the conditional independence property derived in Proposition 4, (29) uses the fact that, conditioned on xi, the random variable yi f(xi) is zero-mean σ2 sub-Gaussian, (30) uses the fact that Pt 1 i=t0 α2 i σt(x)2, and (31) follows by selecting λ = βt/(σ2σt(x)).

Repeating the argument of the previous display with µt(x) f(x), in the place of f(x) µt(x), gives us that

P (|f(x) µt(x)| > βtσt(x)) 2 exp β2 t 2σ2

Next, using the fact that βt > r

2σ2 log |Pt|π2t2

3δ , we have by a union bound over x Pt:

P ( x Pt, s.t. |f(x) µt(x)| > βtσt(x)) X

x Pt 2 exp β2 t 2σ2

< 2 exp log |Pt|π2t2

= 6δ π2t2 := δt.

Instance-dependent Regret Analysis of Kernelized Bandits

Since Pn t=1 δt < δ, we note that the following event E, occurs with probability at least 1 δ

E := n t=1 x Pt {|f(x) µt(x)| βtσt(x)}. (32)

Throughout the rest of the proof, we will work under the event E with δ = 1/n. Hence, the expected regret of the algorithm A1 can then be upper bounded by

E[Rn(A1, f)] = E[Rn(A1, f)1{E}] + E[Rn(A1, f)1{Ec}] (33)

E[Rn(A1, f)1{E}] + P (Ec) n sup x X (f(x ) f(x))

E[Rn(A1, f)1{E}] + 2 f HKν diam(X)

E[Rn(A1, f)1{E}] + O(1).

Since the second term is upper bounded by a constant, it sufﬁces to show that the required upper bounds hold for the regret incurred under the event E.

We now obtain a result about the sub-optimality of the points queried by the algorithm.

Lemma 4. Suppose event E introduced in (32) occurs, and the algorithm queries a point xt Pt Xh for some h 1. Then we have

f(x ) f(xt) 7Lvξ 1ρ ξ ρhξ = O ρhξ .

Proof. Since h 1, the set Pt must have been formed by a call to the Refine Partition subroutine. Let xt = xh,i for some i {1, . . . , 2h} and furthermore, denote its parent node by xh ,i where h = h 1 and i = i/2 . Assume that the active set Pt was formed by a call to the Refine Partition subroutine at some time t0 < t. Then the following must be true:

f(xh,i) = f(xt) f(xh ,i ) 2 z }| { L(v1ρh )ξ

:=Vh µt0(xh ,i ) 2βt0σt0(xh ,i ) Vh (34)

µt0(xh ,i ) + βt0σt0(xh ,i ) 3Vh (35)

max x Pt0 (µt0(x) βt0σt0(x)) 3Vh (36)

(f(x ) 4Vh ) 3Vh = f(x ) 7Vh . (37)

In the above display,

the ﬁrst inequality in (34) uses the fact that f is (L, ξ) H older continuous, while that second inequality uses the fact that under the event E, we have f(xh ,i ) µt0(xh ,i ) βt0σt0(xh ,i ),

(35) follows by adding and subtracting 2βt0σt0(xh ,i ), and then using the fact that 2βt0σt0(xh ,i ) must be smaller than 2Vh := L(v1ρh )ξ, due to line 6 in Algorithm 1,

(36) then uses the fact µt0(xh ,i ) + βt0σt0(xh ,i ) must be larger than the highest lower bound, maxx Pt0 µt0(x) βt0σt0(x) in order for xh,i to be included in the updated Pt0 returned by Refine Partition,

and ﬁnally, (37) uses the fact that f(x ) 2Vh maxx Pt0 µt0(x) βt0σt0(x). To see this, suppose xh ,j denotes the point in Pt0 such that x Xh ,j. Then we must have the following:

max x Pt0 µt0(x) βt0σt0(x) µt0 (xh ,j) βt0σt0 (xh ,j) f(xh ,j) 2βt0(xh ,j)

f(x ) 2Vh 2βt0(xh ,j) f(x ) 4Vh .

Instance-dependent Regret Analysis of Kernelized Bandits

The previous lemma gives us a bound on the suboptimality of any point queried by the algorithm at a time t in terms of the parameter ρ and the depth of the cell h. Now, let Nh denote the number of times the algorithm queries a point at level h, i.e., lying in the subset Xh. Then we have the following regret decomposition (assuming the event E deﬁned in (32) occurs):

Rn(A1, f) =

t=1 f(x ) f(xt) = O

To complete the proof, it remains to get an upper bound on the term Nh, which we do in two ways: one in terms of the maximum information gain γn, and the other in terms of the upper-complexity term Cf introduced in Deﬁnition 8. The former will imply the minimax near-optimality of A1, while the latter will lead to improved regret (beyond the minimax value) on some easier problem instances.

We now present the γn based upper bound on Nh. As an immediate consequence of this bound, we also observe that A1 is minimax near-optimal.

Lemma 5. The number of queries made by A1 at level h of the tree satisﬁes Nh = e O ρ 2hξγn . As a consequence of this, we obtain the following upper bound on the regret:

E [Rn(A1, f)] = e O ( nγn) = e O na ν , with a ν = ν + d

Proof. This result follows by using (Valko et al., 2013, Lemma 4 and 5) to get that Nh = e O ρ hξβn γn Nh . Dividing both sides by Nh and taking the square gives the upper bound on Nh.

Next, under event E, we have Rn = O Phmax h=0 ρhξNh = e O Phmax h=0 βn γn Nh = e O hmax γnn = e O γnn . In

the last two equalities, we used the fact that βn = O log n and hmax = log n, and hence are absorbed by the hidden polylogarithmic leading constant in the notation e O ( ).

Lemma 6. Assume that the event E holds, and let Nh denote the number of queries made by A1 at level h. Introduce the set Wh := {x X : f(x ) f(x) (7Lvξ 1ρ ξ)ρhξ}, and let mh = m Wh, 2v2ρh denote the 2v2ρh-packing number of the set Wh. Then, Nh is upper bounded by e O ρ 2hξmh .

Proof. Suppose a point xh,i Xh is evaluated nh,i times before a call to Refine Partition is made. Then we must have that

L2v2ξ 1 ρ2hξ

This is due to the following fact: suppose that at time t, the point xh,i has been evaluated s times. Then by (Shekhar and Javidi, 2018, Proposition 3) we know that the posterior standard deviation at xh,i must satisfy σt(xh,i) τ/ s. Plugging s nh,i from (38) in this bound implies that after nh,i evaluations, the condition βtσt(xh,i) < L(v1ρh)ξ is satisﬁed, and hence the point xh,i will not be evaluated anymore. Furthermore, we also know that if Pt Xh, then it also satisﬁes the

following two properties: (i) Pt Zh := n x X : f(x ) f(x) 7Lvξ 1ρ ξ ρhξo , and (ii) any two points in Pt are

separated by a distance of 2v2ρh. Together, these two facts imply that Pt must be a packing set of Zh, and thus we can upper bound |Pt| with mh, the 2v2ρh packing number of Zh. As a consequence, we have Nh nh,imh = e O ρ 2hξmh .

It remains to show that the expected regret of A1 is upper bounded by the upper-complexity term Cf.

Lemma 7. Introduce the term Hn = max{H : PH h=0 ρ 2hξmh n}, and deﬁne n = min{n (1 a ν), ρHnξ}. Then, we have

E[Rn (A1, f)] = e O

Instance-dependent Regret Analysis of Kernelized Bandits

Proof. We will work under the event E introduced in (32), that occurs with probability at least 1 1/n. The proof of this result follows by employing the upper bound on Nh derived in Lemma 6, and rearranging the resulting terms to form the upper-complexity term.

In particular, we note that Rn (A1, f) PHn h=0 e O ρhξNh . Now, due to the upper bound on Nh obtained in Lemma 6, we

have Rn (A1, f) = e O PHn h=0 ρ hξmh . To rewrite this in terms of Cf, note that n ρHnξ, and for k 0, deﬁne

e Zk = WHn k = {x : f(x ) f(x) (7Lvξ 1ρ ξ)ρ(Hn k)ξ}

= {x : f(x ) f(x) (7Lvξ 1ρ ξ)ρ ξkρHnξ}

= {x : f(x ) f(x) (7Lvξ 1ρ ξ)ρ ξk n}.

Having deﬁned e Zk, we note that the term emk in the deﬁnition of Cf is the same as m Hn h = m e Zk, v2ρHn k : the

2v2ρHn k-packing number of e Zk. Finally, noting that ρHn kξ = (ρ ξk) n, we have the following:

h=0 ρ ξhmh =

k=0 ρ ξ(Hn k) emk =

emk (1/ρξ)k n = Cf( n).

This completes the proof.

E. Proof of Proposition 2

As shown in (33), it sufﬁces to get the bound on the regret under the 1 1/n probability event E introduced in (32). When the objective function, f, satisﬁes the local growth condition with exponent b, we can show that the regions Wh = {x : f(x ) f(x) (7Lvξ 1ρ ξ)ρhξ} is contained in a ball centered at the optimal point x . In particular, due to the local-growth condition, it follows that Wh B x , c ρhξ/b for some constant c . As mh = m Wh, v2ρh is the 2v2ρh-packing number of the set Wh, we can bound it from above by using volume arguments to get that mh = ρhd(1 ξ/b) .

Combining this with the result of Lemma 6, we get that

Rn (A1, f) = e O

h=0 ρ hξρ hd(1 ξ/b) !

Introducing the term ed := d(1 ξ/b), we have Rn (A1, f) = e O Phmax h=0 ρ h(ξ+ e d) . Proceeding as in the proof of Lemma 7,

introduce the term e Hn = max{H hmax : PH h=0 ρ h(2ξ+ e d)}, we see that ρ g Hn = e O n1/(2ξ+ e d) . This gives us the following:

Rn (A1, f) = e O

h=0 ρ h(ξ+ e d) + ρHnξn

= e O n(ξ+ e d))/(2ξ+ e d) .

F. Extension to Lipschitz Bandits

We note that the ideas used in obtaining the instance-dependent lower bound in Theorem 1 can also be applied to related problems, such as Lipschitz bandits and kernelized level-set estimation. In this section, we state the analogous result for the Lipschitz bandit problem. Here, the goal is to design an adaptive querying strategy to optimize an unknown Lipschitz continuous objective function f with a Lipschitz constant bounded above by L, via noisy zeroth-order queries. For this problem, we can prove an analog of Theorem 1.

Deﬁnition 18. Let f be a (1 λ)L-Lipschitz function for some λ (0, 1). Fix a > 0, and introduce the set Zk := {x X : 2k f(x ) f(x) < 2k+1 }. Introduce the radius wk = 3 2k /(λL), and let mk denote the 2wk packing number of the set Zk for k 0. Then, we can deﬁne the following complexity term:

C(Lip) f ( , L, λ) := X

Instance-dependent Regret Analysis of Kernelized Bandits

Then, proceeding as in the proof of Theorem 1, we can obtain the following lower bound.

Proposition 5. For a (1 λ)L-Lipschitz function f, the expected regret of an a0-consistent (for the family of L-Lipschitz functions) algorithm A satisﬁes the following for any a > a0:

E [Rn (A, f)] = Ω σ2C(Lip) f n (1 a), L, λ .

The general steps involved in obtaining this statement are similar to those used in the proof of Theorem 1, and we omit the details. In particular, the main difference is that we now use the bump function of the form g(x) = max{0, L(1 x )}. We can check that this function is L-Lipschitz and supported on the unit ball.

G. Additional Figures

Figure 3. The two ﬁgures plot the variation of the exponent of different regret upper and lower bounds (i.e., α, if regret nα) with dimension, on a problem instance f HKν , satisfying f (x x )b near its optimum.