# instance_dependent_regret_analysis_of_kernelized_bandits__d4309b0b.pdf Instance-dependent Regret Analysis of Kernelized Bandits Shubhanshu Shekhar 1 Tara Javidi 2 We study the problem of designing an adaptive strategy for querying a noisy zeroth-order-oracle to efficiently learn about the optimizer of an unknown function f. To make the problem tractable, we assume that f lies in the reproducing kernel Hilbert space (RKHS) associated with a known kernel K, with its norm bounded by M < . Prior results, working in a minimax framework, have characterized the worst-case (over all functions in the problem class) limits on regret achievable by any algorithm, and have constructed algorithms with matching (modulo polylogarithmic factors) worst-case performance for the Mat ern family of kernels. These results suffer from two drawbacks. First, the minimax lower bound gives limited information about the limits of regret achievable by the commonly used algorithms on a specific problem instance f. Second, the existing upper bound analysis fails to adapt to easier problem instances within the function class. Our work takes steps to address both these issues. First, we derive instance-dependent regret lower bounds for algorithms with uniformly (over the function class) vanishing normalized cumulative regret. Our result, valid for several practically relevant kernelized bandits algorithms, such as, GP-UCB, GP-TS and Sup Kernel UCB, identifies a fundamental complexity measure associated with every problem instance. We then address the second issue, by proposing a new minimax near-optimal algorithm which also adapts to easier problem instances. 1. Introduction We consider the problem of optimizing a function f : X = [0, 1]d 7 R by adaptively gathering information about it 1Carnegie Mellon University 2University of California, San Deigo. Correspondence to: Shubhanshu Shekhar . Proceedings of the 39 th International Conference on Machine Learning, Baltimore, Maryland, USA, PMLR 162, 2022. Copyright 2022 by the author(s). via noisy zeroth-order-oracle queries. To make the problem tractable, we assume that (1) f lies in the reproducing kernel Hilbert space (RKHS) associated with a given positivedefinite kernel K, and (2) the RKHS norm of f is bounded by a known constant M < . The function f can be accessed through noisy zeroth-order-oracle queries that return yx = f(x) + η at a query point x X, with η denoting the additive observation noise. Given a total budget n, the goal of an agent is to design an adaptive querying strategy, denoted by A, to select a sequence of query points (xt)n t=1, that incur a small cumulative regret Rn(A, f), defined as Rn(A, f) := t=1 f(x ) f(xt), x arg max x X f(x). The cumulative regret forces the agent to address the exploration-exploitation trade-off and prevents it from querying too many points from the sub-optimal regions of the domain. The problem described above is referred to as the kernelized bandit or agnostic Gaussian Process bandit problem (Srinivas et al., 2012). Prior theoretical works in this area have focused on establishing lower and upper bounds on the performance achievable by algorithms in the minimax setting. These results characterize the worst-case limits, over all functions in the given RKHS, of performance achievable by any adaptive sampling algorithm (see Section 1.1 for details). The minimax framework does not account for the fact that there may exist functions that are easier to optimize than others in the same function class. As a result, the minimax regret bounds may not accurately reflect the performance of carefully designed adaptive strategies in the typical, non-adversarial, problem instances. In this paper, we take a step towards addressing this issue and present the first instance-dependent analysis for kernelized bandits. In this paper, we focus on the RKHS corresponding to the Mat ern family of kernels, denoted by {HKν : ν > 0}. These kernels are highly relevant for practical applications and allow a graded control over the smoothness of its elements (through a smoothness parameter ν > 0) as described by Stein (2012). We believe that our arguments can be extended to Squared-Exponential (SE) kernels by using the Fourier domain constructions of Scarlett et al. (2017), al- Instance-dependent Regret Analysis of Kernelized Bandits though the intermediate steps may be more complicated. Furthermore, we also restrict our attention primarily to analyzing the cumulative regret, and leave the extension of our work to the pure exploration setting for future work (see Section 5 for further discussion). The rest of this paper is organized as follows: we discuss the limitations of the existing minimax analysis and present an overview of our contributions in Section 1.1. We formally state the problem and the required assumptions in Section 2, and in Section 3 we derive the instance-dependent lower bounds of the regret achievable by a uniformly good class of algorithms (this class includes all the existing algorithms analyzed in the literature including GP-UCB, GP-TS, Sup Kernel UCB). Finally, in Section 4.1, we propose a new algorithm that achieves the best of both worlds: it matches the minimax lower bounds (up to polylogarithmic factors) for the Mat ern family of kernels in the worst case, and can also exploit some additional structure present in the given problem instance to achieve regret tighter than the minimax lower bound for those instances. 1.1. Overview of Results For a given class of functions, H, the minimax expected cumulative regret is defined as R n(H) := inf A supf H E [Rn (f, A)]. For the RKHS associated with Mat ern kernels with smoothness parameter ν > 0, prior work has established a minimax rate R n(HKν) = Θ n(ν+d)/(ν+2d) , ignoring polylogarithmic factors in n. In particular, the worst-case algorithm independent lower bound of the order Ω n(ν+d)/(2ν+d) for the Mat ern kernels with smoothness parameter ν > 0 was established by Scarlett et al. (2017). On the other hand, Vakili et al. (2021a) recently derived tighter bounds on the mutual information gain (or equivalently the effective dimension) associated with Mat ern kernels, that in turn implied that the Sup Kernel UCB algorithm of Valko et al. (2013) matches (up to polylogarithmic terms) the above-stated lower bound, hence showing its near-optimality. While the existing theoretical results provide a rather complete understanding of the minimax regret for Mat ern family of kernels, they suffer from two drawbacks: 1. The minimax lower bounds are obtained by constructing a suitable subset of hard problem, and then showing that there exists no algorithm that can perform well on all of those problems simultaneously. These results tell us that for any algorithm, there exists at least one hard problem instance on which that algorithm must incur a certain regret. However, such results do not tell us what are the limits of performance for carefully designed, good algorithms (such as GP-UCB; precise meaning of good is stated in Definition 2) on specific, non-adversarial, problem instances. 2. The existing analysis of most of the algorithms depend on global properties of the given function class, such as the dimension, kernel parameters and RKHS norm. Consider, for example, the GP-UCB algorithm for which Srinivas et al. (2012) derived the following upper bound on the regret Rn = e O ( nγn), where γn is the maximum information gain for the given kernel K formally defined in (2). Since γn is a property associated with the kernel itself, the existing theory does not adapt to any simplifying structure that may be present in the specific problem instance. Our main contributions make progress towards addressing the two issues stated above. In particular, we first derive an instance-dependent lower bound for algorithms with uniformly bounded normalized cumulative regret, and then we propose an algorithm that achieves near-optimal worst case performance but also can exploit some additional structure present in problem instances. First, we consider the following question: Suppose we are given an algorithm A that is known to have O (na0) worst-case regret over all functions in the RKHS associated with Mat ern kernel Kν (denoted by HKν). Then, what values of expected regret can A achieve for a given function f in HKν? We answer this question by identifying a lower-complexity term Cf that characterizes the per-instance achievable limit (precise statement in Theorem 1). Main Result 1 (Lower Bound). Suppose an algorithm A has a worst-case regret O (na0) over functions in HKν. Then, for a given f with f HKν < M, we have: E[Rn(f, A)] = Ω Cf(n (1 a0)) , where Cf( ) := P k 0 mk/(2k+2 ) for any > 0, and mk denotes the 2wk = O ( 2k)1/ν packing number (see Definition 15 in Appendix A) of the annular set Zk = {x X : 2k f(x ) f(x) < 2k+1 }. To interpret the term Cf, let us deconstruct each element, mk/(2k+2 ), in its defining sum. Consider a ball (denoted by E) of radius wk = O((2k )1/ν) contained in the region Zk. By construction of Zk, every point in E is at most 2k+1 (and at least 2k ) suboptimal for f. We first show that, if > n (1 a0), then A must spend at least Ω 1/(2k+1 )2 queries in the region E. This implies that the total number of queries in the region Zk is at least Ω(mk/(2k+1 )2), owing to the fact that mk disjoint balls of radius wk can be packed in the region Zk. This in turn, implies that the total regret incurred by these queries is lower bounded by Ω 2k mk/(2k+1 )2 = Ω mk/(2k+2 ) . Further details of this argument are in Section 3 and in Appendix B.1. Instance-dependent Regret Analysis of Kernelized Bandits The previous result can be specialized for minimax-optimal case by setting a0 = a ν := (ν + d)/(2ν + d), to obtain the per-instance performance limit of minimax optimal algorithms. Furthermore, with a careful choice of the problem instance f, we can also recover the minimax lower bound (see Remark 4). This motivates our next question: Can we construct a minimax-optimal algorithm, that adapts, and incurs smaller regret on easier problem instances? We address this question, by constructing a new algorithm A1 in Section 4.1, for which we show the following (see Theorem 2 for a more precise statement). Main Result 2 (Upper Bound). We construct an algorithm, A1, that is minimax near-optimal for functions in HKν, and satisfies (with a ν = (ν + d)/(2ν + d)) E[Rn (f, A1)] = e O Cf(n (1 a ν)) , where Cf( ) := P k 0 emk/(2kξ ) for > 0 and ξ = min{1, ν}. The term emk is the 2 ewk = O (2kξ )1/ξ packing number of the set e Zk = {x : f(x ) f(x) = O(2kξ )}. Similar to Cf, the upper-complexity term Cf can also be interpreted in terms of the number of queries made by our proposed algorithm A1 in the regions e Zk. In general, the term Cf is larger than Cf. This is mainly due to two reasons: We have stated our lower bound result for algorithms with a worst-case O(na0) regret over the entire function class HKν. However, the proof only requires the regret guarantee to hold uniformly for a small class of local perturbations. Thus, it is possible to refine the result by imposing a uniform regret condition only in a small neighborhood of f (with radius shrinking with n at an appropriate rate). We leave the investigation of such refined lower bounds to future work. The packing radius ewk used in Cf is smaller than the analogous term wk used in Cf, due to the presence of ξ = min{1, ν} instead of ν. Despite these issues, in Proposition 2 we identify conditions under which Cf is strictly tighter than the minimax rate, thus proving that A1 can adapt to easier problem instances. In the case of functions where the abstract packing numbers mk and emk can be well-estimated, the above results give us explicit (in n) regret bounds. For example, if the function f x x b for some b > ν in the neighborhood of its maximizer x (formally stated in Definition 4), then we can obtain explicit instance dependent lower (Proposition 1) and upper (Proposition 2) bound. In Figure 1, we plot the variation of the exponent of the regret bound (i.e., α if regret is nα) with dimension for the existing upper bounds for different algorithms (dashed) and their corresponding instance-dependent lower bounds (solid-lines, same color) as well as the instance-dependent upper bound of our algorithm A1 for ν = 1.1 and b = 1.2. Figure 1. Plot of the variation of the exponent of different regret upper and lower bounds (i.e., α, if regret nα) with dimension, on an instance f HKν, satisfying f (x x )b near its optimum. This local growth property of f, formally stated in Definition 4, allows us to obtain closed form estimates of Cf and Cf. The dashed lines denote upper bounds, and the solid lines of the same color represent the corresponding instance-dependent lower bounds derived in Proposition 1. Note that the solid blue line represents the instance dependent lower bound for both, A1 and Sup Kernel UCB, as they are both minimax near-optimal. 1.2. Related Work Minimax lower bounds for kernelized bandits. Scarlett et al. (2017) derived the first algorithm-independent minimax lower bound on the regret for the kernelized bandits problem with SE and Mat ern kernels. For Mat ern kernel Kν, they obtain an algorithm-independent minimax lower bound of Ω n(d+ν)/(d+2ν) . Their result proceeds by first constructing a finite class of hard problem instances, and then using a reduction to multiple hypothesis testing and a change of measure argument. However, the hard functions employed by Scarlett et al. (2017) are of the needle-in-haystack type, and may not be representative of the typical functions belonging to the RKHS. Thus, the resulting regret lower bound may provide a pessimistic limit for the achievable performance of algorithms on the specific problem instance encountered. Instance-dependent lower bounds for linear bandits. Instance dependent lower bounds have been derived for the related problem of linear bandits with finitely many arms, either in the pure-exploration (Degenne et al., 2020) or regret-minimization (Tirinzoni et al., 2020) settings. At a high level, the proofs of these results, as well as our Theorem 1, proceed in two steps: (i) given a problem instance, Instance-dependent Regret Analysis of Kernelized Bandits identify a suitable class of alternative problem instances, and (ii) obtain a lower bound on the number of queries that any uniformly good algorithm must allocate to the alternatives. For the case of finite domains considered in the above-mentioned works, the alternative classes are quite naturally defined in terms of instances with different optimal arm. This is not the case in our setting due to the continuous nature of the domain X = [0, 1]d. We address this by presenting a perturbation argument to construct suitable alternatives depending on the problem instance as well as the budget n to derive our instance-dependent lower bounds. Kernelized bandit algorithms. Srinivas et al. (2012) proposed the GP-UCB algorithm motivated by the Upper Confidence Bound (UCB) strategy for multi-armed bandits (MABs) (Auer et al., 2002), which proceeds by selecting query points (xt)t 1 that maximize the UCB of f of the form µt(x) + βtσ(t)(x) over a sequence of finite discretizations of the domain X for suitable factors (βt)t 1. For this algorithm, Srinivas et al. (2012) derived the following high-probability upper bound on Rn Rn = O nγn . (1) In the above display γn is the maximum information gain associated with the kernel K, defined as γn := max S X:|S|=n I(y S; f), for f GP(0, K), (2) where y S = (y1, . . . , yn) is the vector of observations at points in S = (x1, . . . , xn) and I(y S; f) denotes the mutual information between the observations y S and the function f assumed to be a sample from a zero-mean Gaussian Process GP(0, K). To obtain explicit (in n) regret bounds from (1), Srinivas et al. (2012) also derived upper bounds on γn for the SE and Mat ern families. More recently, Vakili et al. (2021a) derived tighter bounds on γn for these families using a different approach than that employed by Srinivas et al. (2012). In particular, for the Mat ern family, this implies the following regret bound for GP-UCB: Rn = e O n(3d/2+ν)/(d+2ν) where ν is the smoothness parameter. Chowdhury and Gopalan (2017) derived that the same upper bound for the Thompson Sampling based randomized algorithm, GP-TS. The regret bounds of GP-UCB and GP-TS for Mat ern kernels, are not sublinear when ν < d/2. Janz et al. (2020) addressed this issue by proposing an algorithm (referred to as π GP-UCB), which adaptively partitions the input space and fits independent GP models in each element of the partition. This yields an alternative bound on γn, and results in a sublinear regret bound of the form Rn = e O (neν), where eν = d(2d+3)+2ν/d(2d+4)+4ν, for ν > 1 and d 1. Valko et al. (2013) proposed the Sup Kernel UCB algorithm that proceeds by dividing the queried points into batches which consist of conditionally independent observations. Although initially proposed for finite domains, the algorithm easily extends to continuous domains by using sufficiently fine discretizations. Valko et al. (2013) showed that the Sup Kernel UCB algorithm achieves a regret bound of Rn = e O nγn ; tighter than the bound for GP-UCB (and GP-TS) by a factor of γn. By plugging in the recently derived bounds on γn for Mat ern kernels by Vakili et al. (2021a), this implies that the Sup Kernel UCB algorithm achieves a regret bound Rn = e O n(d+ν)/(d+2ν) , matching the minimax lower bounds. Despite being minimax near-optimal, Sup Kernel UCB often performs poorly in practice. Recent works, such as (Salgia et al., 2021; Li and Scarlett, 2022), address this by designing algorithms that are minimax near-optimal, and also perform well on common benchmarks. Finally, we note that another minimax near-optimal kernelized bandit algorithm was recently proposed by Camilleri et al. (2021) using a sampling strategy based on G-optimal designs. 2. Preliminaries In this section, we introduce some definitions and notations required to formally state the problem. Additional definitions are deferred to Appendix A. As stated in the introduction, we consider the problem of optimizing a black-box function f : X 7 R, where X = [0, 1]d that can be accessed through noisy zeroth order oracle queries. The task of the learner (or agent) is to design an adaptive strategy to select query points {xt : 1 t n}, that incur a small cumulative regret Rn(f, A) = Pn t=1 f(x ) f(xt). To formally present the problem statement, we need a precise definition of adaptive querying strategies and the exact assumptions made on the objective function and observation noise. We begin with a definition of adaptive querying strategy. Definition 1 (Adaptive Strategy). An adaptive querying strategy A consists of a sequence of mappings (At) t=1 where At : (X Y)t 1 Ut 7 X, where U = [0, 1] represents the range of additional randomness (used in randomized algorithms such as GP-TS). For the rest of the paper, we will use the terms adaptive strategy and algorithm interchangeably. Next, we introduce the assumptions on the objective function and observation noise under which our theoretical results will be stated. Assumption 1. The objective function f lies in the RKHS HK associated with a known kernel K, and its RKHS norm f HK is bounded by a known constant M < . The above assumption is standard in the kernelized bandits literature, and informally it states that the unknown function has low complexity, where the complexity is quantified in Instance-dependent Regret Analysis of Kernelized Bandits terms of the RKHS norm. As stated in the introduction, we will restrict our discussion to Mat ern kernels in this paper (formally defined in Appendix A). Assumption 2a. The zeroth-order-oracle returns y(xt) = f(xt) + ηt for a query point xt X, where {ηt : t 1} is an i.i.d. sequence of N(0, σ2) random variables. Assumption 2b. The zeroth-order-oracle returns y(xt) = f(xt) + ηt for a query point xt X, where {ηt : t 1} is an i.i.d. sequence of σ2-sub-Gaussian random variables. These two assumptions state that the observation noise has light tails, and hence we can construct tight confidence intervals for the unknown functions based on the noisy observations. We will use Assumption 2a in the statement of our lower bound, while the more general Assumption 2b will be used to state the upper bound result. Note that imposing the N(0, σ2) requirement for stating the lower bounds is primarily to simplify the presentation. This is further discussed in Remark 9 in Appendix B.2.1, and relies on the fact that we can obtain closed form expressions for KL-divergence involving Gaussian random variables, which simplifies the expression of the lower-complexity term (see Definition 3). We now present the formal problem statement. Problem Statement 1. Suppose Assumptions 1 and either 2a or 2b hold with known values of M and σ2. Then, given a total querying budget n, design an adaptive strategy A to select query points x1, . . . , xn, which incur a small cumulative regret Rn (A, f) := Pn t=1 f(x ) f(xt). Notations. We end this section, by listing some of the notations that will be used in the rest of the paper. As mentioned earlier, X = [0, 1]d for some d 1 represents the domain and f : X 7 R is the unknown objective function. We will use x to represent an optimal point of f, i.e., x arg maxx X f(x). For any x X and r > 0, we use B(x, r) to denote the ℓ2 open ball {z X : z x 2 < r} and B(x, r1, r2) to denote the annular region {z X : r1 < z x < r2}. For a function f and constants u > 0 and c > 1, we use e X(f, u, c) to denote the set {x X : u f(x ) f(x) cu}. 3. Lower Bounds In order to derive instance-dependent bounds on the regret, we need to restrict our attention to uniformly-good algorithms that perform well for all elements of the given problem class. A restriction of this kind is necessary to obtain non-trivial instance-dependent regret bounds. Otherwise, for every problem instance f, there exists a trivial algorithm that always queries x arg maxx X f(x), and incurs no regret on f, but suffers a linear regret for all functions for which x is strictly suboptimal. We begin by presenting a definition of a0-consistent algo- rithms that formally characterize the meaning of uniformly good algorithms required for obtaining instance-dependent lower bound. This definition is motivated by a similar notion of consistent policies used in multi-armed bandits (see Lattimore and Szepesv ari, 2020, Definition 16.1). Definition 2 (a0-consistency). An algorithm A is said to be a0-consistent over a function class F, if for all a > a0 and f F, the following holds: lim n E [Rn (A, f)] Remark 1. Note that when F is the RKHS associated with a Mat ern kernel, then all the existing algorithms discussed in Section 1.2 satisfy the condition above with some a0 < 1. In particular, this condition is satisfied by GP-UCB and GP-TS with a0 = min{1, (ν + 3d/2)/(2ν + d)}, by π GP-UCB with a0 = (d(2d + 3) + 2ν)/(d(2d + 4) + 4ν) and by Sup Kernel UCB with a0 = (ν + d)/(2ν + d). The rest of this section is organized as follows. In Section 3.1, we first present a general lower bound that characterizes the regret achievable by an algorithm on a given function f, in terms of the lower complexity term Cf introduced in Definition 3. This abstract complexity term depends on the packing number of certain near-optimal regions associated with f, as well as the exponent a0 of the uniform regret condition for a0 consistent algorithms. In Section 3.2, we consider the special case of functions satisfying a local growth condition (Definition 4), for which the term Cf can be well-estimated to get an explicit lower bound in terms of the query budget n. 3.1. General lower bound We now obtain a general instance-dependent lower bound for a0-consistent algorithms. First, we introduce a notion of complexity associated with a function f HKν(M), that will be used to state the main result. Definition 3 (Lower-Complexity). Let f HKν(M) with f Kν = (1 λ)M for some ν > 0, M > 0 and λ (0, 1). Fix > 0, and introduce the set Zk := e X(f, 2k , 2) = {x X : 2k f(x ) f(x) < 2k+1 }. Introduce the radius wk = 3 2k Mν/(λM) 1/ν (with Mν denoting the norm of the bump function introduced in Definition 17 in Appendix B.1.2) and let mk denote the 2wk-packing number of the set Zk. Finally, define the complexity term Note that our notation Cf( ) suppresses the ν, M and λ dependence of Cf. We now present an instance-dependent lower bound on the expected cumulative regret of any a0-consistent algorithm A in terms of the complexity term introduced above. Instance-dependent Regret Analysis of Kernelized Bandits Theorem 1. Let f HKν(M) with f Kν = (1 λ)M for ν > 0, M > 0 and λ (0, 1). Consider any a0consistent algorithm (with a0 < 1) over the family of functions HKν(M), denoted by A, and fix any a > a0. Then, the expected cumulative regret of A on the instance f satisfies E [Rn (A, f)] 7 log 2 4 σ2Cf n (1 a) , for n large enough (exact condition in equation 13 in Appendix B). Theorem 1 follows as a consequence of a more general statement, presented and proved in Appendix B. We also present a detailed overview of the argument involved in proving Theorem 1 in Appendix B.1. The high level idea is to first construct a perturbed version of f by adding a bump function (see Definition 17 in Appendix B.1.2) supported on a ball E that is contained in a region Zk for some k 1 (see Definition 3). Then, using the a0-consistency of A, we obtain a lower bound on the number of times A must query the region E. By identifying the total number of disjoint balls that can be packed in different regions of the input space, and adding their contributions to the regret, we obtain the final result. Remark 2. The lower bound in Theorem 1 based on the complexity term of eq. (3) has a natural interpretation. For any k 0, the set Zk denotes the annular region where the suboptimality f(x ) f(x) is between 2k n and 2k+1 n, with n := n (1 a). Then, the term wk = 2k+1 n/(λM) 1/ν denotes the radius of the smallest ball that can support a scaled bump function (see Definition 17) that ensures the resulting perturbed version of f still remains in HKν(M) and also satisfies the properties stated in Definition 16 in Appendix B.1.1. As we show in Appendix B.1.1, for any such perturbation of f, the algorithm A must spend roughly 1/(2k+1 n)2 samples to distinguish between f and its perturbation. The regret incurred in the process is lower bounded by 2k n 1/(2k+1 n)2 = 1/(2k+2 n), that is, suboptimality ( 2k n) times the number of queries in that region ( 1/(2k+1 n)2). Since mk disjoint balls of radius wk can be packed into Zk, the expression of the complexity term in (3) follows. The strict inequality in the definition of the complexity term in (3) immediately implies the following weaker, but more interpretable, version of the above statement. Corollary 1. Consider an a0-consistent algorithm A, and fix any a > a0. Under the same assumptions as Theorem 1, and with w0 := 3(n (1 a))/(λM) 1/ν, let m0 denote the 2w0-packing number of the set Z0 := {x X : n (1 a) f(x ) f(x) < 2n (1 a)}. Then, we have the following: E [Rn (f, A)] = Ω σ2m0 n(1 a) . Remark 3. Corollary 1 states that a key quantity characterizing the regret achievable by a uniformly good algorithm is the packing number of an annular near-optimal region associated with the given function. Informally, we can write m0 w d 0 n(1 a) d/ν, where d := lim infn log(m0)/ log(1/w0). The term d is reminiscent of the concepts of near-optimality dimension and zooming dimension used in prior works in bandits in metric spaces (Bubeck et al., 2011; Kleinberg et al., 2019) as well as in Gaussian Process bandits (Shekhar and Javidi, 2018). These works use such notions to obtain instance-dependent upper bounds on the regret of algorithms that non-uniformly discretize the domain. This connection also motivates our approach in designing Algorithm 1 in Section 4.1. Remark 4. As a sanity check, we note that Corollary 1 recovers the minimax lower bound by considering f to be a scaled bump function g introduced in Definition 17, and A to be a minimax near-optimal algorithm. In particular, by defining f( ) = 2 ng( /w0), the region Z0 contains the whole domain with a ball of radius w0 removed. This implies that m0 w d 0 = n(1 a)d/ν for a > a ν = (d + ν)/(d + 2ν), and thus E[Rn(f, A)] = Ω(nα) for any α < (1 a ν) 1 + d ν = ν d+2ν d+ν ν = d+ν d+2ν = a ν. In the next section, we specialize the above results to functions that satisfy an additional local growth condition, for which the complexity terms can be explicitly lower bounded in terms of more interpretable parameters. 3.2. Lower bound under growth condition In this section, we consider the class of functions satisfying the following additional assumption. Definition 4 (Growth Condition). We say that the objective function f satisfies the local growth condition with parameters (c, c, b, ϱ0) if for all x B(x , ϱ0) X, we have c x x b f(x ) f(x) c x x b. We shall denote by F (c, c, b, ϱ0) the class of all functions satisfying this property. Similar conditions have been used in analyzing the performance of first order stochastic optimization algorithms by Ramdas and Singh (2013) and in characterizing the minimax rates of active learning algorithms by Castro and Nowak (2008). As an example, consider the case when the function f has continuous second order derivatives and its optimizer x lies in the interior of the domain. Then, if the Hessian of f at x is non-singular, we can find a ϱ0 > 0 such that for all x in B(x , ϱ0), the spectral norm of the Hessian of f is between c and c. Then the function f satisfies the growth condition with exponent 2 and constants c and c depending on the minimum and maximum norm of the Hessian in B(x , ϱ0). We now state the main result of this section. Instance-dependent Regret Analysis of Kernelized Bandits Proposition 1. Introduce the function class G := HKν(M) F (c, c, b, ϱ0) for some ν > 0 and b > ν. Let A be an a0-consistent algorithm with a0 < 1 for the function class HKν(M). Then for any a > a0 and f G with f HKν < M, we have the following: lim inf n E [Rn (A, f)] nα > 0 for any α < (1 a0) 1 + d Since most of the commonly used algorithms, such as GP-UCB, GP-TS, Sup Kernel UCB and π GP-UCB, discussed further in Section 1.2, satisfy a0-consistency introduced in Definition 2, we can use Theorem 1 to obtain the instance-dependent lower bounds for these algorithms. Corollary 2. Proposition 1 implies that E[Rn(A, f)] = Ω(nα), when f G, b > ν and A is the Sup Kernel UCB algorithm, ν > 1 and α < ν+d(1 ν/b) A is either GP-UCB or GP-TS, ν > d/2, and α < 1 d 2ν ν+d(1 ν/b) A is the π GP-UCB algorithm, ν > 1 and α < 1 + d 2ν ν+d(1 ν/b) 2ν+d(d+2) . The results of the above corollary are presented for some specific ν = 1.1 and b = 1.2 in Figure 1. As we can see, algorithms with tighter uniform regret bound (i.e., smaller a0) incur higher instance-dependent lower bounds. 4. Instance-Dependent Upper Bound As mentioned in the introduction, the prior theoretical analysis of the existing kernelized bandits algorithms upper bound their regret in terms of quantities such as the maximum information gain, γn, defined in (2). Since γn is a property associated with the kernel K, such results do not adapt to the hardness of the specific problem instance within the associated RKHS they predict the same performance guarantees for the easiest as well as the hardest problem in the class. As a result, the ability of the existing algorithms to adapt to the complexity of problems with the same problem class is unknown. Our results in this section take a step towards addressing this issue. In particular, we describe a new algorithm for kernelized bandits in Section 4.1, and show in Section 4.2 that this algorithm is minimax near-optimal and also admits tighter upper bounds for easier problem instances. 4.1. Proposed Algorithm We first introduce a standard notion of a sequence of nested partitions of the input space, often used in prior works, such Algorithm 1: Breadthwise Exploration with Adaptive Discretization (A1) Input: n, the querying budget Kν, the kernel belonging to the Mat ern family M, upper bound on the RKHS norm v1, v2, ρ, parameters of the tree of partitions τ, regularization parameter used in posterior computation. 1 Initialize: t = 1, flag True, Et , Yt , Pt {x0,1}, L = M log n and ξ = min{ν, 1}. 2 for t = 1, 2, . . . , n do 3 while flag do 4 βt, {(µt(x), σt(x)) : x Pt} = Compute Posterior (Pt, Et, Yt, τ) 5 xt arg maxx Pt σt(x), Ut = maxx Pt σt(x) 6 if 1/ n < Ut < L v1ρht ξ then 7 Pt, ht Refine Partition(Pt, βt, {(µt(x), σt(x)) : x Pt}) 10 yt f(xt) + ηt, Et Et {xt}, Yt Yt {yt}; 11 flag False 14 flag True; as (Bubeck et al., 2011; Munos, 2011; Wang et al., 2014; Shekhar and Javidi, 2018), to design algorithms for zeroth order optimization. Definition 5 (tree of partitions). We say that a sequence of subsets of X, denoted by (Xh)h 0, forms a tree of partitions of X, if it satisfies the following properties: For all h 0, we have Xh = {xh,i : 1 i 2h}. Furthermore, for every xh,i Xh is associated a cell Xh,i. For i = j, Xh,i and Xh,j are disjoint, and 2h i=1Xh,i = X. Moreover, for any h, i we have {xh+1,2i 1, xh+1,2i} Xh+1 Xh,i. There exist constants 0 < v2 1 v1 and ρ (0, 1) such that B(xh,i, v2ρh) Xh,i B(xh,i, v1ρh). Next, we introduce two subroutines that will be employed by the algorithm. The first subroutine, called Compute Posterior, computes the posterior mean and posterior covariance function of the surrogate Instance-dependent Regret Analysis of Kernelized Bandits Gaussian Process (GP) model used for approximating the unknown objective function in the algorithm. Definition 6 (Compute Posterior). Given a subset Pt X, a multi-set of points belonging to Pt, denoted by Et = {x1, . . . , xt 1}, at which the function f was evaluated, the corresponding noisy function evaluations Yt = {y1, . . . , yt 1} and a constant τ > 0, the Compute Posterior subroutine returns the terms βt, and {(µt(x), σt(x)) : x Pt}, which are defined as follows: Σt = Kt + τI, where Kt = [K(xi, xj)]xi,xj Et kt(x) = [K(x, x1), . . . , K(x, xt 1)]T , yt = [y1, . . . , yt 1]T 2 log (|Pt|n3) /t µt(x) = kt(x)T Σ 1 t yt σt(x) = τ 1/2 q K(x, x) kt(x)T Σ 1 t kt(x). As we describe later, our algorithm proceeds by adaptively discretizing the input space based on the observations gathered the granularity of the partition becoming finer in the near-optimal regions of the input space, aimed at mimicking the discretization of the space involved in defining the complexity term in Definition 3. Our second subroutine, called Refine Partition, presents the formal steps involved in updating the discretization used in the algorithm. Definition 7 (Refine Partition). This subroutine takes in as inputs, Pt, ht, βt and {(µt(x), σt(x)) : x Pt}; and returns an updated partition Pt and level ht. First compute lt := maxx Pt µt(x) βtσt(x) and define Pt = {x Pt : µt(x) + βtσt(x) > lt}. Next, define the new Pt as Pt = xht,i Pt{xht+1,2i 1, xht+1,2i}, and update ht ht + 1. We now present an outline of the steps of our proposed algorithm below. The formal pseudocode is in Algorithm 1. Outline of Algorithm 1. At any time t 1, the algorithm maintains a set of active points denoted by Pt. These points satisfy the following two properties: (i) Pt Xht for some ht 0; i.e., all the active points lie in the same depth (i.e., ht) of the tree of partitions, and (ii) any optimizer x of f must lie in the region xht,i Pt Xht,i. The algorithm evaluates the function at points in the active set, and computes the posterior mean and standard deviation by calling the Compute Posterior subroutine. The algorithm then compares the maximum posterior standard deviation (for points in Pt) with an upper bound on the variation in function value in the cell Xht,i associated with an active point xht,i Pt. If the maximum posterior standard deviation is larger than L(v1ρh)ξ, then the algorithm evaluates the function at the corresponding active point with the largest σt(x). Otherwise, it concludes that the active points in Pt have been sufficiently well explored, and it moves to the next level of the partition tree by calling the Refine Partition subroutine. The above process continues until the querying budget is exhausted. Remark 5. Algorithm 1 carefully combines two key ideas from kernelized bandits literature: (i) it divides the evaluated points into subsets (according to their level h in the tree) which satisfy a conditional independence property, similar to Sup Kernel UCB of Valko et al. (2013), and (ii) adaptively partitions the input space to zoom into the nearoptimal regions, similar to the algorithms in (Shekhar and Javidi, 2018; 2020). The first property allows us to construct tighter confidence intervals, which results in the algorithm achieving the minimax regret rate. As a result, applying Theorem 1 to this algorithm provides us with the best (i.e., the highest) instance-dependent lower bounds. Additionally, the second property allows the algorithm to exploit the easier problem instances when the objective function satisfies the growth condition with small b, and results in improved regret bound in these problem instances. This is proved in Theorem 2 and Proposition 2. 4.2. Regret Bound for Algorithm 1 (A1) In this section, we analyze the performance of Algorithm 1 on individual problem instances lying in the RKHS associated with Mat ern kernels. To do so, we introduce the following complexity measure associated with a problem instance f. Definition 8 (Upper-Complexity). Consider a function f HKν with ν > 0, and define ξ = min{1, ν}. For given constants , ρ (0, 1) and c1, c2 > 0, introduce the set e Zk := {x X : f(x ) f(x) c1(1/ρ)kξ } for k 0, and let emk denote the 2 ewk := 2c2 (1/ρ)kξ 1/ξ packing number of the set e Zk. Define the upper-complexity term as follows: emk (1/ρ)kξ . Note that the notation Cf( ) suppresses the ν, ρ, c1 and c2 dependence of the complexity term. Remark 6 (Comparison of Cf and Cf). The uppercomplexity term introduced above has a similar form as the corresponding lower-complexity term (Cf), introduced earlier in Definition 3: both complexity measures sum over terms involving a packing number of a near-optimal set in the numerator, and an exponentially growing term times in the denominator. Despite this similarity, the uppercomplexity term is, in general, larger than the corresponding lower-complexity term. This is because ewk is proportional to 1/ξ, while wk (in Definition 3) is proportional to 1/ν. Since ν ξ := min{1, ν}, and < 1, the term emk represents a much tighter packing than the corresponding term, Instance-dependent Regret Analysis of Kernelized Bandits mk in Definition 3. Another, less important, factor in Cf being larger than Cf is that the set e Zk is usually larger than the corresponding annular set Zk used in defining Cf. We now state the main result of this section stating that the algorithm A1 is minimax near-optimal, and its instance-dependent regret can be characterized by the uppercomplexity term defined above. Theorem 2. For the class of functions f HKν(M), the algorithm A1 is a0-consistent with a0 = a ν = (ν +d)/(ν + 2d). Furthermore, A1 also satisfies the following instancedependent upper bound on the expected regret: E [Rn (A1, f)] = e O where n = min{n (1 a ν), ρHnξ}, where Hn is defined precisely in Lemma 7 in Appendix D. The e O( ) term suppresses polylogarithmic factors in n. Again, it is instructive to specialize the above result to the special case of an instance f that also satisfy the additional local-growth condition around its optima. Proposition 2. Consider a function f satisfying Assumption 1 with K = Kν, and Assumption 2b with parameter σ2, that also satisfies the growth condition (introduced in Definition 4) with an exponent b. Then, the cumulative regret of Algorithm 1 satisfies the following, with ξ = min{1, ν}: E [Rn (A1, f)] = e O(na), where a := min d + ν d + 2ν , d(1 ξ/b)+ + ξ d(1 ξ/b)+ + 2ξ The notation (z)+ refers to max{0, z} and the notation O hides the polylogarithmic factors in the upper bound. The proof of this result is given in Appendix D. In particular, this result implies that for an instance f with local growth exponent b > 0, for all values of d 1 and 0 < ν < 1 1 1 b , the regret achieved by Algorithm 1 is tighter than the minimax rate (achieved by Sup Kernel UCB). Furthermore, for a fixed ν > 0, the amount of possible improvement increases with decreasing values of b. While the regret achieved by A1 is better than those of existing algorithms, it is still larger than the corresponding instance-dependent lower bound as shown in Figure 1. The key reason is that A1 uses the H older continuity (with exponent ξ) of the elements of HKν to adaptively discretize the domain, which leads to the ξ (instead of ν) dependence of the complexity term Cf. This looseness can be avoided if we can construct tighter X-uniform confidence intervals (Vakili et al., 2021b) for f. 5. Conclusion In this paper, we initiated the instance-dependent analysis of the kernelized bandits problem. This investigation was motivated by the practical question of identifying the limits of performance achievable by commonly used algorithms, such as GP-UCB, on typical, non-adversarial, problem instances. We first obtained a general complexity measure that characterizes the fundamental hardness of a specific problem instance. Then, we specialized this result to probleminstances satisfying an additional local growth condition, to obtain explicit lower bounds in terms of the budget n. Finally, we introduced a new algorithm that achieves the best of both worlds: it matches the worst case performance limit (modulo polylogarithmic terms) established by prior work, but also has the ability to adapt to the easier problem instances. The results of this paper lead to several interesting questions for future work, and we describe two key directions below: A natural next step is to investigate whether we can design an algorithm that is both minimax nearoptimal and instance-optimal for the Mat ern family. More specifically, with a0 = a ν := (d + ν)/(d + 2ν), can we design an algorithm that satisfies supf HKν (M) E[Rn (A, f)] = e O(na0), and E [Rn(A, f)] = e O Cf( n) with n = n (1 a) for any a > a0 simultaneously (recall that Cf denotes the complexity term introduced in Definition 3). From a technical point of view, achieving this will be significantly aided by deriving tight time and X-uniform confidence intervals for the GP model. Another interesting line of work is to adapt the ideas used in this paper to problems such as kernel levelset estimation, and optimization in other function spaces (Singh, 2021; Liu et al., 2021). The lowerbound technique of our paper can be easily generalized to these cases, as we discuss briefly in ??. However, designing algorithms that match the so-obtained instancedependent lower bounds may require new techniques. Finally, the focus in this paper was on obtaining instance-dependent bounds for the cumulative regret Rn. It is interesting to explore whether similar results can be obtained for simple regret in the pure exploration setting. P. Auer, N. Cesa-Bianchi, and P. Fischer. Finite-time analysis of the multiarmed bandit problem. Machine learning, 47(2-3):235 256, 2002. S. Bubeck, R. Munos, G. Stoltz, and C. Szepesv ari. X- Instance-dependent Regret Analysis of Kernelized Bandits armed bandits. Journal of Machine Learning Research, 12(5), 2011. X. Cai and J. Scarlett. On lower bounds for standard and robust gaussian process bandit optimization. In International Conference on Machine Learning, pages 1216 1226. PMLR, 2021. R. Camilleri, K. Jamieson, and J. Katz-Samuels. Highdimensional experimental design and kernel bandits. In International Conference on Machine Learning, pages 1227 1237. PMLR, 2021. R. M. Castro and R. D. Nowak. Minimax bounds for active learning. IEEE Transactions on Information Theory, 54 (5):2339 2353, 2008. S. R. Chowdhury and A. Gopalan. On kernelized multiarmed bandits. In Proceedings of the 34th International Conference on Machine Learning-Volume 70, pages 844 853. JMLR. org, 2017. R. Degenne, P. M enard, X. Shang, and M. Valko. Gamification of pure exploration for linear bandits. In International Conference on Machine Learning, pages 2432 2442. PMLR, 2020. A. Garivier, P. M enard, and G. Stoltz. Explore first, exploit next: The true shape of regret in bandit problems. Mathematics of Operations Research, 44(2):377 399, 2019. D. Janz, D. Burt, and J. Gonz alez. Bandit optimisation of functions in the mat ern kernel rkhs. In International Conference on Artificial Intelligence and Statistics, pages 2486 2495. PMLR, 2020. M. Kanagawa, P. Hennig, D. Sejdinovic, and B. K. Sriperumbudur. Gaussian processes and kernel methods: A review on connections and equivalences. ar Xiv preprint ar Xiv:1807.02582, 2018. R. Kleinberg, A. Slivkins, and E. Upfal. Bandits and experts in metric spaces. Journal of the ACM (JACM), 66(4): 1 77, 2019. T. Lattimore and C. Szepesv ari. Bandit algorithms. Cambridge University Press, 2020. Z. Li and J. Scarlett. Gaussian process bandit optimization with few batches. In International Conference on Artificial Intelligence and Statistics, pages 92 107. PMLR, 2022. Y. Liu, Y. Wang, and A. Singh. Smooth bandit optimization: generalization to holder space. In International Conference on Artificial Intelligence and Statistics, pages 2206 2214. PMLR, 2021. R. Munos. Optimistic optimization of a deterministic function without the knowledge of its smoothness. In Advances in neural information processing systems, pages 783 791, 2011. Y. Polyanskiy and Y. Wu. Lecture notes on information theory. Technical report, MIT, 2014. (http://people.lids.mit.edu/yp/ homepage/data/itlectures_v5.pdf). A. Ramdas and A. Singh. Optimal rates for stochastic convex optimization under tsybakov noise condition. In Proceedings of the 30th International Conference on Machine Learning, pages 365 373. PMLR, 2013. S. Salgia, S. Vakili, and Q. Zhao. A domain-shrinking based bayesian optimization algorithm with order-optimal regret performance. Advances in Neural Information Processing Systems, 34, 2021. J. Scarlett, I. Bogunovic, and V. Cevher. Lower bounds on regret for noisy gaussian process bandit optimization. In Conference on Learning Theory, pages 1723 1742, 2017. S. Shekhar and T. Javidi. Gaussian process bandits with adaptive discretization. Electronic Journal of Statistics, 12(2):3829 3874, 2018. S. Shekhar and T. Javidi. Multi-scale zero-order optimization of smooth functions in an rkhs. ar Xiv preprint ar Xiv:2005.04832, 2020. S. Singh. Continuum-armed bandits: a function space perspective. In International Conference on Artificial Intelligence and Statistics, pages 2620 2628. PMLR, 2021. N. Srinivas, A. Krause, S. M. Kakade, and M. W. Seeger. Information-theoretic regret bounds for Gaussian Process optimization in the bandit setting. IEEE Transactions on Information Theory, 58(5):3250 3265, 2012. M. L. Stein. Interpolation of spatial data: some theory for kriging. Springer Science & Business Media, 2012. A. Tirinzoni, M. Pirotta, M. Restelli, and A. Lazaric. An asymptotically optimal primal-dual incremental algorithm for contextual linear bandits. Advances in Neural Information Processing Systems, 33:1417 1427, 2020. S. Vakili, K. Khezeli, and V. Picheny. On information gain and regret bounds in gaussian process bandits. In International Conference on Artificial Intelligence and Statistics, pages 82 90. PMLR, 2021a. S. Vakili, J. Scarlett, and T. Javidi. Open problem: Tight online confidence intervals for rkhs elements. In Conference on Learning Theory, pages 4647 4652. PMLR, 2021b. Instance-dependent Regret Analysis of Kernelized Bandits M. Valko, N. Korda, R. Munos, I. Flaounas, and N. Cristianini. Finite-time analysis of kernelised contextual bandits. In Uncertainty in Artificial Intelligence, page 654, 2013. R. van Handel. Probability in high dimension. Technical report, Princeton University NJ, 2014. (https://web. math.princeton.edu/ rvan/APC550.pdf). Z. Wang, B. Shakibi, L. Jin, and N. Freitas. Bayesian multiscale optimistic optimization. In Artificial Intelligence and Statistics, pages 1005 1014. PMLR, 2014. Instance-dependent Regret Analysis of Kernelized Bandits A. Additional Definitions In this section, we list the definitions of several technical terms that have been used in stating the main results of the paper. We begin with the definition of a positive definite kernel, that will then be used in defining an RKHS. Definition 9 (Positive Definite Kernel). For a non-empty set X, a symmetric function K : X X 7 [0, ) is called a positive-definite kernel, if for any m N, any x1, . . . , xm X and c1, c2, . . . , cm R, the following is true: Pm i=1 Pm j=1 ci K(xi, xj)cj 0. In this paper, we will focus primarily on a family of kernels, referred to the Mat ern family, that are parameterized by a smoothness parameter ν > 0. Definition 10 (Mat ern kernels). For ν > 0 and θ > 0, the Mat ern kernel Kν : X X 7 R is defined as Kν (x, z) = 1 2ν 1Γ(ν) where Jν denotes the modified Bessel function of the second kind of order ν. The RKHS associated with Mat ern kernels consist of functions with a finite degree of smoothness (Kanagawa et al., 2018) as opposed to the infinitely differentiable functions lying the RKHS associated with the SE kernels. Due to this property, the Mat ern kernels are commonly used in practical problems (Stein, 2012, 1.7) as they provide a reasonable trade-off between analytical tractability and representation power. We now present a formal definition of the RKHS associated with a positive definite kernel K. Definition 11 (RKHS). For a nonempty set X and a positive-definite kernel K, the RKHS associated with K, denoted by HK, is defined as the Hilbert space of functions on X with an inner product , satisfying the following: (i) for all x X, the function K( , x) HK, and (ii) for all x X and g HK, we have g(x) = g, K( , x) . The equality g(x) = g, K( , x) is referred to as the reproducing property which lends the name to the RKHS. We next introduce the definition of Gaussian Processes (GPs) that are often used as a surrogate model for estimating functions lying in an RKHS. Definition 12 (Gaussian Processes). For a positive definite kernel K : X X 7 R, we use GP(0, K) to represent a stochastic process indexed by X, denoted by {Zx : x X}, such that for any m N and x1, . . . , xm X, the random vector [Zx1, . . . , Zxm] N(0, Σm) with Σm = [K(xi, xj)]1 i,j m. Next, we present the formal definition of the probability measure that the noisy zeroth-order-oracle and adaptive sampling scheme induce on the space ([0, 1]d R)n. This term is used in Lemma 1. Definition 13 (Induced Probability Measure). An adaptive sampling strategy A and a function f induces a probability measure Pf,A (henceforth abbreviated as Pf) on the measurable space (Ω, F), with Ω= (X Y)n and F representing the Borel σ algebra on Ω. Recall that we have X = [0, 1]d and Y = R. This measure assigns the probabilities to events E = Qn t=1 (Et,X Et,Y ) where Et,X BX and Et,Y BY. Pf (Xn, Y n) E = t=1 Pf Yt Et,Y |Xt, Y t 1 PA Xt Et,X|Xt 1, Y t 1 . Here Pf represents the noisy zeroth-order-oracle and PA is determined by the sampling strategy. We end this section with the definitions of sub-Gaussian random variables (used in stating Assumption 2b) and packing numbers (used in defining the complexity terms Cf and Cf). Definition 14 (Sub-Gaussianity). A random variable X is said to be sub-Gaussian with parameter σ2 if it satisfies the following for all t R: E[et X] exp σ2t2 Definition 15 (Packing number). Given a subset S of a metric space (X, ℓ), the w packing number of S is the cardinality of the largest E S such that any two z, z E satisfy ℓ(e, e ) w. Instance-dependent Regret Analysis of Kernelized Bandits B. Proof of Theorem 1 We present a detailed outline of the proof of Theorem 1 in Appendix B.1, and describe the formal steps of the proof in Appendix B.2 and Appendix B.3. In particular, Appendix B.2 has the statement and proof of an intermediate result that gives us a single term in the lower bound of Theorem 1, and Appendix B.3 contains the steps required to obtain the statement of Theorem 1 from the intermediate result. B.1. Overview of the argument We begin by presenting an informal description of the key ideas involved in the obtaining the main lower bound, formally stated as Theorem 1 in Section 3.1. Suppose f is a function lying in HKν(M) and let A be an a0-consistent algorithm for this family of functions. Suppose E1, E2, . . . , Em are m disjoint subsets of the input space X for some m 1, with the property that f(x) f(x ) i, for all x Ei, for all i [m]. (4) Now if Ni denotes the (random) number of times the algorithm A queries points in the region Ei in n rounds, then we immediately have the following regret lower bound. Ef[Rn (A, f)] i=1 i Ef [Ni] . (5) The expression in (5) suggests that one way of lower bounding the regret incurred by A on the function f is to lower bound the expected number of samples it allocates in these suboptimal regions, Ef[Ni] for 1 i m. We approach this task by considering functions that are slightly perturbed versions of f, denoted by fi, that also lie in the same RKHS. The perturbed function fi shall differ from f only in the region Ei, but this difference should be substantial enough to ensure that the maximizer of fi lies in Ei. This fact makes fi operationally distinct from f, for which the region Ei is at least i-suboptimal as assumed in (4). Now, since the algorithm A is a0-consistent for the given function class, it must achieve o(na) regret (for any a > a0) for all such functions (and in particular, for f and all of fi). These two facts will enable us to bound the number of samples that the algorithm A must spend (on an average) in the suboptimal region Ei when f is the true function. 10.0 7.5 5.0 2.5 0.0 2.5 5.0 7.5 10.0 0.00 10.0 7.5 5.0 2.5 0.0 2.5 5.0 7.5 10.0 0.0 Figure 2. The two figures show a function f (left) and its perturbed version f) (right). In the left figure, the optimum of f is denoted by the green . The red shaded region corresponds to the set E of Definition 16 that is disjoint from a neighborhood (green shaded region) of the optimum of f. The gray shaded regions represent the set e X(f, , c) = {x : f(x) f(x ) c } for = 0.2 and c = 2.5. The figure on the right, f, is then obtained by adding a bump function supported on E to f By construction, in the left figure, the red-shaded region is at least suboptimal, while in the right figure the green-shaded region is at-least suboptimal. The two functions f, and f will induce statistically similar distributions as they differ only in a small region (shown in red). Hence, if f is the true objective function, any good algorithm A must spend some queries in the red region to discard the possibility that the objective function is f. The rest of this section is organized as follows: Instance-dependent Regret Analysis of Kernelized Bandits In Appendix B.1.1, we describe the details of the argument in deriving a lower bound on the expected number of samples A spends in a suboptimal region for one perturbed function. In Appendix B.1.2, we present the details involved in constructing an appropriate collection of perturbed functions. In particular, this involves carefully balancing several trade-offs in the choice of parameters such as i and Ei, such that the resulting lower bound is tightest. B.1.1. A PERTURBATION ARGUMENT Suppose f HKν(M) and A is an a0-consistent algorithm. Let f be another function in HKν(M) that is an (E, c, ) perturbation of f, as defined below. Definition 16 ((E, c, ) perturbation). We say that a function f HKν(M) is an (E, c, )-perturbation of another function f HKν(M), if it satisfies the following properties: (P1) f differs from f only in a region E of the input space, i.e., f(x) = f(x) for all x X \ E. (P2) The function f achieves its maximum value (denoted by f ) at a point x X \ E. On the other hand, f achieves its maximum value f at a point x lying in the region E. (P3) There exist constants c > 1 and > 0 such that the following conditions are satisfied: |f(x) f(x)| c , for all x E, (6) f f(x) , for all x E. f f(x) , for all x X \ E. Remark 7. In this section, we do not address the issue of existence of a function f satisfying all the properties, as well as the possible values of c, and choices of the region E. We present the argument under the assumption that such an f exits for some fixed c, and E. The trade-offs involved in constructing such f will be discussed in Appendix B.1.2. Similar to the more general case of (5), the definition above motivates a simple decomposition of the regret in terms of the number of queries made by A in the region E in which f and f differ. In particular, if N denotes the (random) number of times A queries points in E, then we immediately have the following: Ef [Rn (A, f)] Ef [N] , and E f h Rn(A, f) i E f [n N] . To obtain a lower bound on Ef[N], we will use the following two key properties of the pair (f, f) as encoded by the formal statements in Definition 16. From a statistical point of view, the two problem instances are close. In particular, f and f only differ over the region E, and furthermore their deviation is upper bounded by c . This allows us to upper bound the KL divergence between their induced probability distributions Pf and P f (see Definition 13) in terms of Ef[N]. In an operational sense, the two problem instances f and f are sufficiently distinct. This is a consequence of properties (P2) and (P3) in Definition 16, which say that the optimizer of f (resp. f) lies in the region X \ E (resp. E) that is known to be at least suboptimal for f (resp. f). This, along with the a0-consistency of A will be used to lower-bound the KL-divergence between Pf and P f by a constant. Combining the two inequalities will give us the required lower bound on Ef[N], and consequently on Ef[Rn (A, f)]. We now describe the steps. Step 1: Upper bound on DKL(Pf, P f). Recall that the pairs (f, A) and ( f, A) both induce a probability measure on the n fold product of input-observation space Ω:= (X R)n. Denote the two probability measures by Pf and P f. Then, assuming that the observation noise is i.i.d. N(0, σ2), it can be show that DKL Pf, P f 1 2σ2 sup x E f(x) f(x) 2 Ef [N] c2 2 2σ2 Ef [N] . (7) Instance-dependent Regret Analysis of Kernelized Bandits To obtain (7), we use the closed-form expression of the KL-divergence between two univariate Gaussian random variables with the same variance, and the fact that the maximum deviation between f and f is upper bounded by c . The details are presented while proving Lemma 1 in Appendix B.2.1. The statement of (7) implies that the KL-divergence between Pf and P f can be controlled by two terms: (i) the maximum deviation between f and f, a quantity that is bounded by c by assumption, and (ii) the expected number of queries made by A in the region E for the function f. This quantity cannot be too large either, as the algorithm A is assumed to be a0 consistent, and the region E is at least suboptimal for f as stated in (6). Step 2: Lower Bound on DKL(Pf, P f). Since the algorithm A is assumed to be a0 consistent, we immediately have the following two statements for any a > a0: Ef [N ] Ef [Rn (f, A)] = o (na) , and E f [(n N) ] E f h Rn f, A i = o (na) . Together, these two conditions imply that Ef[N] and E f[n N] cannot be too large. To see this, define a [0, 1] valued random variable Z = N/n and let p = Ef[Z] and q = E f[Z]. Note that Z denotes the fraction of samples spent in the region E by the algorithm A. Then, for large enough values of n, and fixed, we expect that p 0 and q 1. This in turn implies that the KL-divergence between two Bernoulli random variables with expected values p and q respectively is non-zero. More specifically, we can show that there exists a constant C > 0 such that 0 < C d KL(p, q) = d KL where d KL(p, q) = p log(p/q) + (1 p) log ((1 p)/(1 q)) denotes the KL-divergence between two Bernoulli random variables. The next step in obtaining a lower bound on Ef[N] is the observation that DKL Pf, P f . (9) This is a consequence of data-processing inequality, as shown by Garivier et al. (2019). Step 3: Lower bound Ef[N]. Finally, we can use (9) to link the statements of (7) and (8), and obtain the inequality 2σ2 Ef [N] , which implies Ef [N] 2Cσ2 Multiplying this term with gives us one term in the regret decomposition of (5). B.1.2. CONSTRUCTING AN (E, c, ) PERTURBED FUNCTION Next, we introduce the definition of a bump function used in (Cai and Scarlett, 2021, Lemma 4) that will be used to construct the local perturbations (satisfying the conditions of Definition 16) in our lower bound proof. Definition 17 (bump function g). Define the function g(x) = exp 1 1 1 x 2 1{ x <1}, which satisfies the following properties: g is supported on the ball B(0, 1). supx B(0,1) g(x) = g(0) = 1. g HKν := Mν < for some constant Mν depending on ν. if g( ) = g( w) for some w > 0, then g HKν (1/w)νMν. We now discuss the details of constructing a function f satisfying the conditions of Definition 16. Here is the summary for a fixed a (a0, 1). The term should not be smaller than n (1 a). Instance-dependent Regret Analysis of Kernelized Bandits An appropriate choice of the set E is a ball B(z, w) for a point z X \ X and radius w > 0. Recall that X = {x X : f(x ) = f := maxx X f(x)}. We define the perturbed function f = f + g where g( ) = (c + 1)g z w for some w > 0. Here g is the bump function introduced in Definition 17. The specific constraints on w are discussed below. We now discuss these choices in more details. Choice of . The term parametrizes the amount of perturbation between f and f. should be large enough to ensure that f and f are distinguishable from the point of view of the algorithm A. In particular, fix any a > a0. Then for this value of a, must be larger than n (1 a). This is because, if < n (1 a), then the algorithm may spend all n of its samples in the region E under f as well as f without violating the o(na) requirement on regret. Choice of z and w. The terms z and w must be such that B(z, w) e X(f, , c) := {x X : f f(x) c }. Thus z and w must be selected to ensure that the ball of radius w around z is fully contained in the annular region e X(f, , c) in which the sub-optimality of f, (i.e., f f(x)) is between and c . Furthermore, the radius w must be large enough to satisfy the condition stated in (10) below. Defining f. Having defined the region E, we then construct the perturbed version of f, by adding a shifted and scaled bump function to it. In particular, we add g = (c + 1) g z w to f. Note that, by assumption, the RKHS norm of f satisfies f HKν < M. Since we require the perturbed function to also lie in the class HKν(M), a sufficient condition for that is g HKν (c + 1) wν g HKν = (c + 1) wν Mν M f HKν w (c + 1) Mν Remark 8. Note that the expression for w in (10) implicitly assumes that for this value of c and the region e X(f, , c) is large enough to contain a ball of this (or larger) radius. Our result, Theorem 3, holds under this assumption. In case this condition is violated, Theorem 3 reduces to the trivial lower bound Ef[Rn (f, A)] 0. B.2. An intermediate one-step result Proposition 3. Consider the kernelized bandit problem with a budget n and objective function f HKν(M) with f HKν = (1 λ)M for some λ (0, 1). Let A denote an a0-consistent algorithm for the class HKν(M), and fix an a > a0. For constants 16n (1 a) and c > 1, introduce the set Z = {x X : f(x ) f(x) < c }, and with w = ((c + 1) Mν/(Mλ))1/ν, use m(Z, w) to denote the 2w packing number of Z. Then, the following is true for n large enough (exact condition in equation 13 below): E [Rn (A, f)] 7 log 2 4 m(Z, w) σ2 Proof. Let {zi : 1 i m(Z, w)} denote the points that form the maximal 2w packing set of Z with cardinality m = m(Z, w). By definition of Z, the region B(zi, w) is at least -suboptimal for f. Building upon this fact, the proof of the result follows in these three steps: First, we show that we can construct m perturbed functions, denoted by {fi : 1 i m(Z, w)}, as introduced in Definition 16. The function fi differs from f only in the region B(zi, w), and in fact, it achieves its maximum value in that region. Next, for each such perturbed function, we obtain a lower bound on the number of samples that the algorithm A must spend in B(zi, w). Finally, the result follows by using the regret decomposition Equation (5), again using the fact that the points in B(z, w) are -suboptimal for f. We now present the details of the steps outlined above. For every i {1, . . . , m}, define the function fi = f + gi, where gi = (c + 1) g zi w is the scaled and shifted version of the bump function introduced in Definition 17. Now the choice Instance-dependent Regret Analysis of Kernelized Bandits of the radius (or scale parameter) w in the definition of fi according to (10) implies that gi HKν (c + 1) This implies the following: The function fi satisfies fi HKν f HKν + gi HKν M. Thus, the function fi lies in the class HKν(M), and hence A achieves a regret o (na) for any a > a0 on fi for all 1 i m. The functions f and fi have well separated optimal regions. More formally, if x and x i denote the maximizers of f and fi respectively, the following are true: f(x ) f(x) , for all x B(zi, w), fi(x i ) fi(x) , for all x B(zi, w). The functions f and fi differ from each other only in the region B(zi, w) and furthermore, they satisfy the following uniform deviation bound: sup x X |f(x) fi(x)| c . To summarize the above three points, the function fi is a B(zi, w), , c -perturbation of f. Next, we show that any a0-consistent algorithm must allocate at least a certain number of points to the region B(zi, w) when the true underlying function is f, in order to gather enough evidence to reject fi. Lemma 1. Let Ni(A, n) denote the number of times the algorithm A queries the oracle at points in the region B(zi, w). Then we have the following bound: Ef [Ni(A, n)] 2σ2 (1 pn,i) log 1 1 qn,i log 2 , where (12) pn,i := Ef [Ni(A, n)] n and qn,i := Efi [Ni(A, n)] In the above display, Ef denotes the expectation w.r.t. the probability measure induced by the pair (f, A), and similarly Efi denotes the probability measure induced by the pair (fi, A) for 1 i m. The proof of (12) follows by relating the regret incurred by A on f and fi respectively to a pair of multi-armed bandit problems with (m + 1) arms, and then applying the fundamental information inequality (Garivier et al., 2019, 2). The details of this proof are deferred to Appendix B.2.1 Next, we simplify the expression obtained in Lemma 1 by appealing to the a0-consistency of the algorithm A. In the process, we also clarify the meaning of the assumption that n is large enough in the statement of Theorem 3. In particular, we require that n is large enough to ensure the following to hold simultaneously Ef [Rn (A, f)] 2na, and Efi [Rn (A, fi)] 2na, for all i [m]. (13) More specifically, using the notation f0 = f, we note that due to the a0-consistency of A, and the fact that all the functions {fi : 0 i m} lie in HKν(M), we must have limn Efi[Rn(A, fi)]/na = 0. Hence, there must exist a finite n0 such that for all n n0, we have Efi[Rn(A, fi)] 2na for all i {0, 1, . . . , m}. Next, we observe that 1 pn,i = 1 Ef [Ni(A, n)] n = 1 Ef [Ni(A, n)] 1 Ef [Rn(A, f)] n 1 Ef [Rn(A, f)] 16n1 (1 a) (14) Instance-dependent Regret Analysis of Kernelized Bandits In the above display, (14) uses the fact that Ni (A, n) Rn (A, f), and that 16n (1 a), (15) uses the assumption made in (13) that n is large enough to ensure that E0 [Rn (A, f)] 2na. Similarly, for the qn,i dependent term, we have 1 1 qn,i = n n Efi[Ni(A, n)] = n (n Efi[Ni(A, n)]) n Efi [Rn (A, fi)] n In the above display, (16) uses the assumption that n is large enough to ensure that Efi [Rn (A, fi)] 2na. Putting (15) and (17) back in (12), we get Ef [Ni (A, n)] 2σ2 Finally, the result stated in (11) follows by repeating the argument of Lemma 1 for all the different values of i {1, . . . , m}, and noting that Ef [Rn (A, f)] Pm i=1 Ef [Ni(A, n)] from the decomposition inequality (5). B.2.1. PROOF OF LEMMA 1 To prove this result, we need to introduce some additional notation. We use Ht to denote the observations up to, and including, time t for t {1, 2, . . . , n}. For a given n 1, introduce the sample space Ω= (X Y)n, and let F0 denote a sigma algebra of subsets of Ω. For a given function f and an querying strategy A, we use Pf,A to denote the probability measure on Ωinduced by the pair (f, A). We will drop the A dependence of Pf,A, and simply use P(f) in the sequel. The first step is to obtain an upper bound on the KL-divergence between the measures P(f) and P(fi) induced on the space Ω, for a common algorithm A. In particular, suppose (X1, Y1, . . . , Xn, Yn) denote the query-observation pairs collected by the algorithm A up to time n. Then we have the following: Dn := DKL P(f), P(fi) = Dn 1 + DKL P(f) Xn,Yn|Hn 1, P(fi) Xn,Yn|Hn 1 = Dn 1 + DKL P(f) Xn|Hn 1, P(fi) Xn|Hn 1 + DKL P(f) Yn|Hn 1,Xn, P(fi) Yn|Hn 1,Xn = Dn 1 + 0 + DKL P(f) Yn|Hn 1,Xn, P(fi) Yn|Hn 1,Xn = Dn 1 + Ef " (f(Xn) fi(Xn))2 1{Xn B(zi,w)} c2 2 In the above display, (18) and (19) follow from the chain rule for KL-divergence (Polyanskiy and Wu, 2014, Theorem 2.2), (20) uses the fact that conditioned on Hn 1, the distribution of Xn is the same for both the problem instances, that is they are both selected according to the mapping An : (X Y)n 1 7 X, where A = (At)n t=1 is the common strategy, (21) uses the fact that condition on Xn, Yn is distributed as N f(Xn), σ2 and N fi(Xn), σ2 under the two distributions P(f) and P(fi) respectively, and (22) uses the fact that, by construction, f and fi only differ in the region B(zi, w), and furthermore, in this region we have maxx B(zi,w) |f(x) fi(x)| c Instance-dependent Regret Analysis of Kernelized Bandits Repeating the steps involved in obtaining (22) n 1 times, we get the following upper bound on the term Dn: t=1 Ef 1{Xt B(zi,w)} = c2 2 t=1 1{Xt B(zi,w)} 2σ2 Ef [Ni (A, n)] . Recall that the term Ni(A, n) denotes the number of times the algorithm A queries points from the region B(zi, w) in the first n rounds. Now, suppose Z : Ω7 [0, 1] be any measurable [0, 1] valued random variable. Then by (Garivier et al., 2019, Lemma 1), we get the following result: DKL (Pf, Pfi) d KL (Ef [Z] , Ef [Z]) , where d KL(p, q) for p, q [0, 1] denotes the KL-divergence between two Bernoulli random variables with means p and q respectively. To complete the proof, we select Z := Ni(A,n) n , and using the fact (Garivier et al., 2019, Eq. (11)) that d KL(p, q) (1 p) log(1 q) log 2, we get the required inequality Ef h Ni( A, n) i c2 2 2σ2 (1 pn,i) log (1 qn,i) log 2, where pn,i = Ef [Z] , and qn,i = Efi [Z] . Remark 9. The only point at which we exploit the assumption that the observation noise is distributed as N(0, σ2) (i.e., Assumption 2a) is in obtaining the inequality (21). Due to this assumption on the noise, we get a closed form expression for an upper bound on the KL-divergence in (22), i.e., c2 2 /2σ2. In general, if we only assumed that the observation noise was σ2 sub-Gaussian, then the same result would hold true with the previous closed-form upper bound replaced by the expression supx B(zi,w) DKL Pf,Yn|Xn=x, Pfi,Yn|Xn=x . B.3. Concluding Theorem 1 from Proposition 3 Theorem 1 follows by repeated application of the result in Proposition 3 to different regions of the input space. More specifically, introduce the following notation: As in the previous section, we fix an a > a0, and choose c = 2 and = 16n (1 a). For k 0, set Zk = {x X : 2k f(x ) f(x) < 2k+1 }. With wk := (3 2k Mν)/(λM) 1/ν, we use mk to denote the 2wk packing number of Zk. Since, f HKν(M), we know that the set Zk is an empty set for all k larger than a finite value k0. More specifically, we have k0 = l log2 supx X f(x ) f(x) m l log2 2MKν(0) Finally, the statement of Theorem 1 follows by k0 + 1 repeated applications of the intermediate statement proved in Proposition 3 using Z = Zk, = 2k , c = 2, and wk = (3 2k Mν)/(λM) 1/ν for k = 0, 1, . . . , k0. C. Proof of Proposition 1 To prove this statement, we appeal to the one-step result obtained in Proposition 3. In particular, we apply Proposition 3 with the following parameters: We set = n = 16n (1 a), and c = cn := 21/d c c where c and c are the parameters introduced in Definition 4. The set Z of Proposition 3 now becomes {x X : n f(x ) f(x) < cn n}. Instance-dependent Regret Analysis of Kernelized Bandits We set the radius of the balls to w = wn = (cn+1) n Mν Mλ 1/ν , where λ = 1 f HKν /M, and use m(Z, wn) to denote the 2wn packing number of the set Z. With these parameters, Proposition 3 gives us the following lower bound on the regret: E [Rn (A, f)] = Ω σ2m(Z, wn) To conclude the statement of Proposition 1, we will show that m = m(Z, wn) = Ω . First, we introduce the following terms: 1/b , r1 := cn n 1/b where cn = 21/d c c as before. (24) Next, we use Definition 4 to obtain the following result about Z. Lemma 2. With r0 and r1 introduced in (24) and e X(f, , c) defined above, we have Z B(x , r0, r1) := {x X : r0 x x < r1} Proof. Suppose x B(x , r0, r1). Then we have the following: x x r0 := n f(x ) f(x) crb 0 f(x ) f(x) n. (25) Similarly, we also have the following: x x r1 := cn n f(x ) f(x) crb 1 f(x ) f(x) c cn n f(x ) f(x) cn n. (26) Together, (25) and (26) imply that if x B(x , r0, r1) then x Z. The above statement implies that the 2wn packing number of Z can be lower-bounded by the 2wn packing number of the smaller set B(x , r0, r1). Let us denote the 2wn-packing number of B(x , r0, r1) with em. Then, by using the fact that the 2wn packing number is lower bounded by the 2w covering number, and employing the standard volume arguments (van Handel, 2014, Lemma 5.13), we conclude that there exists a constant 0 < C1 < such that m(Z, wn) m (B(x , r0, r1), wn) C1 1/b n 1/ν n The 1/b n and the 1/ν n terms in the above display arise from the definitions of r1 r0 and wn (defined at the beginning of this proof) respectively. Plugging the lower bound on m(Z, wn) back in (23) gives us the required result. Instance-dependent Regret Analysis of Kernelized Bandits D. Proof of Theorem 2 We first recall a simple embedding result from Shekhar and Javidi (2020) which is crucial in the adaptive partitioning approach used in our algorithm. Fact 1. Suppose f HKν(M) with 0 < M < . Then, we have |f(x) f(z)| MCK x z ξ for ξ := min{1, ν} and for some constant CK depending on K for all x, z X. Instead of obtaining the bounds on the term CK in the statement above, we assume that n is large enough to ensure that CK log n to simplify the presentation. We now present the following independence result about the points queried by the algorithm. Proposition 4. Suppose Pt Xh is the active set of points for Algorithm 1 at some time t. Let t0 < t denote the time at which a point from Pt was first queried, and let Et denote the multi-set of points {xt0, . . . , xt 1} queried by the algorithm. Then the collection of random variables (yt0, . . . , yt 1) are mutually independent, conditioned on the observations xt0, . . . , xt 1. Proof. Using P to represent the joint distribution of the random variables, we proceed as follows: P (yt0, . . . , yt 1|xt0, . . . , xt 1) = P (yt0, . . . , yt 1, xt0, . . . , xt 1) P (xt0, . . . , xt 1) Qt 1 s=t0 P (xs|xt0, . . . , xs 1, yt0, . . . , ys 1) P (ys|xt0, . . . , xs, yt0, . . . , ys 1) P (xt0, . . . , xt 1) Qt 1 s=t0 P (xs|xt0, . . . , xs 1) P (ys|xs) P (xt0, . . . , xt 1) = P (xt0, . . . xt 1) Qt 1 s=t0 P (ys|xs) P (xt0, . . . , xt 1) s=t0 P (ys|xt0, . . . , xt 1) . In the above display, (a) uses the fact that at any s t0, the query point xs only depends on the previous query points xt0, . . . , xs 1 and not on the observations; and the fact that conditioned on xs, the observation ys is independent of the query points xt0, . . . , xs 1 and observations yt0, . . . , ys 1. (b) uses the fact that conditioned on xs the observation ys is independent of xt0, . . . , xs 1, xs+1, . . . , xt 1. The equality (b) implies the conditional independence of the observations given the query points belonging to the current active set Pt, as required. Having obtained the conditional independence property of the query points, we now present the key concentration result that leads to the required regret bounds. Lemma 3. For some t 1, let Pt, t0 and Et be the same as in Proposition 4. Then, for a given δ (0, 1), the following is true: P x Pt, s.t. |f(x) µt(x)| > βtσt(x) δt := 6δ t2π2 , 2σ2 log |Pt|π2t2 Recall that the terms µt( ) and σt( ) represent the posterior mean and standard-deviation functions computed by the subroutine Compute Posterior introduced in Definition 6. Proof. To prove this result, we rely on the following facts derived by Valko et al. (2013) while proving their Lemma 2. For x Pt, there exists αt0, . . . , αt 1 R (depending on x) such that the following holds: Instance-dependent Regret Analysis of Kernelized Bandits f(x) µt(x) = i=t0 αi (yi f(xi)) + Bx i=t0 α2 i σt(x)2 and |Bx| τ 1/2Mσt(x), for all x Pt. Recall that τ is the regularization parameter used in the subroutine Compute Posterior, while M is the upper bound on the RKHS norm of f. Now, we use the conditional independence property derived in Proposition 4 along with the conditional σ2-sub-Gaussianity of the observation noise to get the required concentration result. In particular, for a given t 1 and a fixed x Pt, we have P (f(x) µt(x) > βtσt(x)) = E i=t0 αi (yi f(xi)) > βtσt(x)| (xi)t 1 i=t0 i=t0 αi(yi f(xi)) e λβtσt(x) | (xi)t 1 i=t0 i=t0 E exp (λαi (yi f(xi))) |(xi)t 1 i=t0 # i=t0 α2 i λβtσt(x) 2 σt(x)2 λβtσt(x) (30) = exp β2 t 2σ2 In the above display, (27) follows by an application of Chernoff s inequality with some constant λ > 0 to be selected later, (28) follows from the conditional independence property derived in Proposition 4, (29) uses the fact that, conditioned on xi, the random variable yi f(xi) is zero-mean σ2 sub-Gaussian, (30) uses the fact that Pt 1 i=t0 α2 i σt(x)2, and (31) follows by selecting λ = βt/(σ2σt(x)). Repeating the argument of the previous display with µt(x) f(x), in the place of f(x) µt(x), gives us that P (|f(x) µt(x)| > βtσt(x)) 2 exp β2 t 2σ2 Next, using the fact that βt > r 2σ2 log |Pt|π2t2 3δ , we have by a union bound over x Pt: P ( x Pt, s.t. |f(x) µt(x)| > βtσt(x)) X x Pt 2 exp β2 t 2σ2 < 2 exp log |Pt|π2t2 = 6δ π2t2 := δt. Instance-dependent Regret Analysis of Kernelized Bandits Since Pn t=1 δt < δ, we note that the following event E, occurs with probability at least 1 δ E := n t=1 x Pt {|f(x) µt(x)| βtσt(x)}. (32) Throughout the rest of the proof, we will work under the event E with δ = 1/n. Hence, the expected regret of the algorithm A1 can then be upper bounded by E[Rn(A1, f)] = E[Rn(A1, f)1{E}] + E[Rn(A1, f)1{Ec}] (33) E[Rn(A1, f)1{E}] + P (Ec) n sup x X (f(x ) f(x)) E[Rn(A1, f)1{E}] + 2 f HKν diam(X) E[Rn(A1, f)1{E}] + O(1). Since the second term is upper bounded by a constant, it suffices to show that the required upper bounds hold for the regret incurred under the event E. We now obtain a result about the sub-optimality of the points queried by the algorithm. Lemma 4. Suppose event E introduced in (32) occurs, and the algorithm queries a point xt Pt Xh for some h 1. Then we have f(x ) f(xt) 7Lvξ 1ρ ξ ρhξ = O ρhξ . Proof. Since h 1, the set Pt must have been formed by a call to the Refine Partition subroutine. Let xt = xh,i for some i {1, . . . , 2h} and furthermore, denote its parent node by xh ,i where h = h 1 and i = i/2 . Assume that the active set Pt was formed by a call to the Refine Partition subroutine at some time t0 < t. Then the following must be true: f(xh,i) = f(xt) f(xh ,i ) 2 z }| { L(v1ρh )ξ :=Vh µt0(xh ,i ) 2βt0σt0(xh ,i ) Vh (34) µt0(xh ,i ) + βt0σt0(xh ,i ) 3Vh (35) max x Pt0 (µt0(x) βt0σt0(x)) 3Vh (36) (f(x ) 4Vh ) 3Vh = f(x ) 7Vh . (37) In the above display, the first inequality in (34) uses the fact that f is (L, ξ) H older continuous, while that second inequality uses the fact that under the event E, we have f(xh ,i ) µt0(xh ,i ) βt0σt0(xh ,i ), (35) follows by adding and subtracting 2βt0σt0(xh ,i ), and then using the fact that 2βt0σt0(xh ,i ) must be smaller than 2Vh := L(v1ρh )ξ, due to line 6 in Algorithm 1, (36) then uses the fact µt0(xh ,i ) + βt0σt0(xh ,i ) must be larger than the highest lower bound, maxx Pt0 µt0(x) βt0σt0(x) in order for xh,i to be included in the updated Pt0 returned by Refine Partition, and finally, (37) uses the fact that f(x ) 2Vh maxx Pt0 µt0(x) βt0σt0(x). To see this, suppose xh ,j denotes the point in Pt0 such that x Xh ,j. Then we must have the following: max x Pt0 µt0(x) βt0σt0(x) µt0 (xh ,j) βt0σt0 (xh ,j) f(xh ,j) 2βt0(xh ,j) f(x ) 2Vh 2βt0(xh ,j) f(x ) 4Vh . Instance-dependent Regret Analysis of Kernelized Bandits The previous lemma gives us a bound on the suboptimality of any point queried by the algorithm at a time t in terms of the parameter ρ and the depth of the cell h. Now, let Nh denote the number of times the algorithm queries a point at level h, i.e., lying in the subset Xh. Then we have the following regret decomposition (assuming the event E defined in (32) occurs): Rn(A1, f) = t=1 f(x ) f(xt) = O To complete the proof, it remains to get an upper bound on the term Nh, which we do in two ways: one in terms of the maximum information gain γn, and the other in terms of the upper-complexity term Cf introduced in Definition 8. The former will imply the minimax near-optimality of A1, while the latter will lead to improved regret (beyond the minimax value) on some easier problem instances. We now present the γn based upper bound on Nh. As an immediate consequence of this bound, we also observe that A1 is minimax near-optimal. Lemma 5. The number of queries made by A1 at level h of the tree satisfies Nh = e O ρ 2hξγn . As a consequence of this, we obtain the following upper bound on the regret: E [Rn(A1, f)] = e O ( nγn) = e O na ν , with a ν = ν + d Proof. This result follows by using (Valko et al., 2013, Lemma 4 and 5) to get that Nh = e O ρ hξβn γn Nh . Dividing both sides by Nh and taking the square gives the upper bound on Nh. Next, under event E, we have Rn = O Phmax h=0 ρhξNh = e O Phmax h=0 βn γn Nh = e O hmax γnn = e O γnn . In the last two equalities, we used the fact that βn = O log n and hmax = log n, and hence are absorbed by the hidden polylogarithmic leading constant in the notation e O ( ). Lemma 6. Assume that the event E holds, and let Nh denote the number of queries made by A1 at level h. Introduce the set Wh := {x X : f(x ) f(x) (7Lvξ 1ρ ξ)ρhξ}, and let mh = m Wh, 2v2ρh denote the 2v2ρh-packing number of the set Wh. Then, Nh is upper bounded by e O ρ 2hξmh . Proof. Suppose a point xh,i Xh is evaluated nh,i times before a call to Refine Partition is made. Then we must have that L2v2ξ 1 ρ2hξ This is due to the following fact: suppose that at time t, the point xh,i has been evaluated s times. Then by (Shekhar and Javidi, 2018, Proposition 3) we know that the posterior standard deviation at xh,i must satisfy σt(xh,i) τ/ s. Plugging s nh,i from (38) in this bound implies that after nh,i evaluations, the condition βtσt(xh,i) < L(v1ρh)ξ is satisfied, and hence the point xh,i will not be evaluated anymore. Furthermore, we also know that if Pt Xh, then it also satisfies the following two properties: (i) Pt Zh := n x X : f(x ) f(x) 7Lvξ 1ρ ξ ρhξo , and (ii) any two points in Pt are separated by a distance of 2v2ρh. Together, these two facts imply that Pt must be a packing set of Zh, and thus we can upper bound |Pt| with mh, the 2v2ρh packing number of Zh. As a consequence, we have Nh nh,imh = e O ρ 2hξmh . It remains to show that the expected regret of A1 is upper bounded by the upper-complexity term Cf. Lemma 7. Introduce the term Hn = max{H : PH h=0 ρ 2hξmh n}, and define n = min{n (1 a ν), ρHnξ}. Then, we have E[Rn (A1, f)] = e O Instance-dependent Regret Analysis of Kernelized Bandits Proof. We will work under the event E introduced in (32), that occurs with probability at least 1 1/n. The proof of this result follows by employing the upper bound on Nh derived in Lemma 6, and rearranging the resulting terms to form the upper-complexity term. In particular, we note that Rn (A1, f) PHn h=0 e O ρhξNh . Now, due to the upper bound on Nh obtained in Lemma 6, we have Rn (A1, f) = e O PHn h=0 ρ hξmh . To rewrite this in terms of Cf, note that n ρHnξ, and for k 0, define e Zk = WHn k = {x : f(x ) f(x) (7Lvξ 1ρ ξ)ρ(Hn k)ξ} = {x : f(x ) f(x) (7Lvξ 1ρ ξ)ρ ξkρHnξ} = {x : f(x ) f(x) (7Lvξ 1ρ ξ)ρ ξk n}. Having defined e Zk, we note that the term emk in the definition of Cf is the same as m Hn h = m e Zk, v2ρHn k : the 2v2ρHn k-packing number of e Zk. Finally, noting that ρHn kξ = (ρ ξk) n, we have the following: h=0 ρ ξhmh = k=0 ρ ξ(Hn k) emk = emk (1/ρξ)k n = Cf( n). This completes the proof. E. Proof of Proposition 2 As shown in (33), it suffices to get the bound on the regret under the 1 1/n probability event E introduced in (32). When the objective function, f, satisfies the local growth condition with exponent b, we can show that the regions Wh = {x : f(x ) f(x) (7Lvξ 1ρ ξ)ρhξ} is contained in a ball centered at the optimal point x . In particular, due to the local-growth condition, it follows that Wh B x , c ρhξ/b for some constant c . As mh = m Wh, v2ρh is the 2v2ρh-packing number of the set Wh, we can bound it from above by using volume arguments to get that mh = ρhd(1 ξ/b) . Combining this with the result of Lemma 6, we get that Rn (A1, f) = e O h=0 ρ hξρ hd(1 ξ/b) ! Introducing the term ed := d(1 ξ/b), we have Rn (A1, f) = e O Phmax h=0 ρ h(ξ+ e d) . Proceeding as in the proof of Lemma 7, introduce the term e Hn = max{H hmax : PH h=0 ρ h(2ξ+ e d)}, we see that ρ g Hn = e O n1/(2ξ+ e d) . This gives us the following: Rn (A1, f) = e O h=0 ρ h(ξ+ e d) + ρHnξn = e O n(ξ+ e d))/(2ξ+ e d) . F. Extension to Lipschitz Bandits We note that the ideas used in obtaining the instance-dependent lower bound in Theorem 1 can also be applied to related problems, such as Lipschitz bandits and kernelized level-set estimation. In this section, we state the analogous result for the Lipschitz bandit problem. Here, the goal is to design an adaptive querying strategy to optimize an unknown Lipschitz continuous objective function f with a Lipschitz constant bounded above by L, via noisy zeroth-order queries. For this problem, we can prove an analog of Theorem 1. Definition 18. Let f be a (1 λ)L-Lipschitz function for some λ (0, 1). Fix a > 0, and introduce the set Zk := {x X : 2k f(x ) f(x) < 2k+1 }. Introduce the radius wk = 3 2k /(λL), and let mk denote the 2wk packing number of the set Zk for k 0. Then, we can define the following complexity term: C(Lip) f ( , L, λ) := X Instance-dependent Regret Analysis of Kernelized Bandits Then, proceeding as in the proof of Theorem 1, we can obtain the following lower bound. Proposition 5. For a (1 λ)L-Lipschitz function f, the expected regret of an a0-consistent (for the family of L-Lipschitz functions) algorithm A satisfies the following for any a > a0: E [Rn (A, f)] = Ω σ2C(Lip) f n (1 a), L, λ . The general steps involved in obtaining this statement are similar to those used in the proof of Theorem 1, and we omit the details. In particular, the main difference is that we now use the bump function of the form g(x) = max{0, L(1 x )}. We can check that this function is L-Lipschitz and supported on the unit ball. G. Additional Figures Figure 3. The two figures plot the variation of the exponent of different regret upper and lower bounds (i.e., α, if regret nα) with dimension, on a problem instance f HKν , satisfying f (x x )b near its optimum.