# on_the_sublinear_regret_of_gpucb__abc32426.pdf

On the Sublinear Regret of GP-UCB

Justin Whitehouse Carnegie Mellon University jwhiteho@andrew.cmu.edu

Zhiwei Steven Wu Carnegie Mellon University

zstevenwu@cmu.edu

Aaditya Ramdas Carnegie Mellon University

aramdas@cmu.edu

In the kernelized bandit problem, a learner aims to sequentially compute the optimum of a function lying in a reproducing kernel Hilbert space given only noisy evaluations at sequentially chosen points. In particular, the learner aims to minimize regret, which is a measure of the suboptimality of the choices made. Arguably the most popular algorithm is the Gaussian Process Upper Conﬁdence Bound (GP-UCB) algorithm, which involves acting based on a simple linear estimator of the unknown function. Despite its popularity, existing analyses of GPUCB give a suboptimal regret rate, which fails to be sublinear for many commonly used kernels such as the Mat ern kernel. This has led to a longstanding open question: are existing regret analyses for GP-UCB tight, or can bounds be improved by using more sophisticated analytical techniques? In this work, we resolve this open question and show that GP-UCB enjoys nearly optimal regret. In particular, our results yield sublinear regret rates for the Mat ern kernel, improving over the state-of-the-art analyses and partially resolving a COLT open problem posed by Vakili et al. Our improvements rely on a key technical contribution regularizing kernel ridge estimators in proportion to the smoothness of the underlying kernel k. Applying this key idea together with a largely overlooked concentration result in separable Hilbert spaces (for which we provide an independent, simpliﬁed derivation), we are able to provide a tighter analysis of the GP-UCB algorithm.

1 Introduction

An essential problem in areas such as econometrics [12, 13], medicine [22, 23], optimal control [4, 3], and advertising [21] is to optimize an unknown function given bandit feedback, in which algorithms only get to observe the outcomes for the chosen actions. Due to the bandit feedback, there is a fundamental tradeoff between exploiting what has been observed about the local behavior of the function and exploring to learn more about the function s global behavior. There has been a long line of work on bandit learning that investigates this tradeoff across different settings, including multi-armed bandits [29, 19, 37], linear bandits [2, 30], and kernelized bandits [5, 26, 32].

In this work, we focus on the kernelized bandit framework, which can be viewed as an extension of the well-studied linear bandit setting to an inﬁnite-dimensional reproducing kernel Hilbert space (or RKHS) (H, h , i H). In this problem, there is some unknown function f : X ! R of bounded norm in H, where X Rd is a bounded set. In each round t 2 [T], the learner uses previous observations to select an action Xt 2 X, and then observes feedback Yt := f (Xt)+ t, where t is a zero-mean noise variable. The learner aims to minimize (with high probability) the regret at time

37th Conference on Neural Information Processing Systems (Neur IPS 2023).

T, which is deﬁned as

f (x ) f (Xt)

where x := arg maxx2X f (x). The goal is to develop simple, efﬁcient algorithms for the kernelized bandit problem that minimize regret RT . We make the following standard assumption. We also make assumptions on the underlying kernel k, which we discuss in Section 2. Assumption 1. We assume that (a) there is some constant D > 0 known to the learner such that kf k H D and (b) for every t 1, t is σ-sub Gaussian conditioned on σ(Y1:t 1, X1:t).

Arguably the simplest algorithm for the kernelized bandit problem is GP-UCB (Gaussian process upper conﬁdence bound) [31, 5]. GP-UCB works by maintaining a kernel ridge regression estimator of the unknown function f alongside a conﬁdence ellipsoid, optimistically selecting in each round the action that provides the maximal payoff over all feasible functions. Not only is GP-UCB efﬁciently computable thanks to the kernel trick, but it also offers strong empirical guarantees [5]. The only seeming deﬁcit of GP-UCB is its regret guarantee, as existing analyses only show that, with high probability, RT = e O(γT

T), where γT is a kernel-dependent measure of complexity known as the maximum information gain [31, 6]. In contrast, more complicated, less computationally efﬁcient algorithms such as Sup Kernel UCB [34, 25] have been shown to obtain regret bounds of e O(pγT T), improving over the analysis of GP-UCB by a multiplicative factor of pγT . This gap is stark as the bound e O(γT

T) fails, in general, to be sub-linear for the practically relevant Mat ern kernel, whereas e O(pγT T) is sublinear for any kernel experiencing polynomial eigendecay [32].

This discrepancy has prompted the development of many variants of GP-UCB that, while less computationally efﬁcient, offer better regret guarantees in some situations [17, 27, 28]. (See a detailed discussion of these algorithms along with other related work in Appendix A.) However, the following question remains an open problem in online learning [33]: are existing analyses of vanilla GP-UCB tight, or can an improved analysis show GP-UCB enjoys sublinear regret?

1.1 Contributions

In this work, we show that GP-UCB obtains almost optimal, sublinear regret for any kernel experiencing polynomial eigendecay. This, in particular, implies that GP-UCB obtains sublinear regret for the commonly used Mat ern family of kernels. We provide a brief roadmap of our paper below.

1. In Section 3, we provide background into self-normalized concentration in Hilbert spaces.

In particular, in Theorem 1, we provide an independent, simpliﬁed derivation of a bound due to Abbasi-Yadkori [1], which concerns to self-normalized concentration of certain process in separable Hilbert spaces. This bound has been largely overlooked in the kernel bandit literature, so we draw attention to it in hopes it can be leveraged in solving further kernel-based learning problems. As opposed to the existing bound of Chowdhury and Gopalan [5], which involves employing a complicated double mixture argument, the bound we present follows directly from applying the well-studied ﬁnite-dimensional method of mixtures alongside a simple truncation argument [7 9, 2]. These bounds are clean and show simple dependence on the regularization parameter. 2. In Section 4, we use leverage the self-normalized concentration detailed in Theorem 1 to

provide an improved regret analysis for GP-UCB. By carefully choosing regularization parameters based on the smoothness of the underlying kernel, we demonstrate that GPUCB enjoys sublinear regret of e O

for any kernel experiencing (C, β)-polynomial

eigendecay. As a special case of this result, we obtain regret bounds of e O

for the commonly used Mat ern kernel with smoothness in dimension d. Our new analysis improves over existing state-of-the-art analysis for GP-UCB, which fails to guarantee sublinear regret in general for the Mat ern kernel family [5], and thus partially resolves an open problem posed by [33] on the suboptimality of GP-UCB.

In sum, our results show that GP-UCB, the go-to algorithm for the kernelized bandit problem, is nearly optimal, coming close to the algorithm-independent lower bounds of Scarlett et al. [25]. Our

work thus can be seen as providing theoretical justiﬁcation for the strong empirical performance of GP-UCB [31]. Perhaps the most important message of our work is the importance of careful regularization in online learning problems. While many existing bandit works treat the regularization parameter as a small, kernel-independent constant, we are able to obtain signiﬁcant improvements by carefully selecting the regularization parameter. We hope our work will encourage others to pay close attention to the selection of regularization parameters in future works.

2 Background and Problem Statement

Notation. We brieﬂy touch on basic deﬁnitions and notational conveniences that will be used throughout our work. If a1, . . . , at 2 R, we let a1:t := (a1, . . . , at)>. Let (H, h , i H) be a reproducing kernel Hilbert space associated with a kernel k : X X ! R. We refer to the identity operator on H as id H. This is distinct from the identity mapping on Rd, which we will refer to as Id. For elements f, g 2 H, we deﬁne their outer product as fg> := fhg, i H and inner product as f >g := hf, gi H. For any t 1 and sequence of points x1, . . . , xt 2 X (which will typically be understood from context), let Φt := (k( , x1), . . . , k( , xt))>. We can respectively deﬁne the Gram matrix Kt : Rt ! Rt and covariance operator Vt : H ! H as Kt := (k(xi, xj))i,j2[t] = ΦtΦ>

t and Vt := Pt

s=1 k( , xs)k( , xs)> = Φ>

t Φt. These two operators essentially encode the same information about the observed data points, the former being easier to work with when actually performing computations (by use of the well known kernel trick) and latter being easier to algebraically manipulate.

Suppose A : H ! H is a Hermitian operator of ﬁnite rank; enumerate its non-zero eigenvalues as λ1(A), . . . , λk(A). We can deﬁne the Fredholm determinant of I + A as det(I + A) := Qk

m=1(1 + λi(A)) [20]. For any t 1, > 0, and x1, . . . , xt 2 X, one can check via a straightforward computation that det(It + 1Kt) = det(id H + 1Vt), where Kt and Vt are the Gram matrix and covariance operator deﬁned above. We, again, will use these two quantities interchangeably in the sequel, but will typically prefer the latter in our proofs.

If (H, h , i H) is a (now general) separable Hilbert space and ('n)n 1 is an orthonormal basis for H, for any N 1 we can deﬁne the orthogonal projection operator N : H ! span{'1, . . . , 'N} H by Nf := PN

n=1hf, 'ni H'n. We can correspondingly the deﬁne the projection onto the remaining basis functions to be the map ?

N : H ! span{'1, . . . , 'N}? given by ?

Nf := f Nf. Lastly, if A : H ! H is a symmetric, bounded linear operator, we let λmax(A) denote the maximal eigenvalue of A, when such a value exists. In particular, λmax(A) will exist whenever A has a ﬁnite rank, as will typically be the case considered in this paper.

Basics on RKHSs. Let X Rd be some domain. A kernel is a positive semideﬁnite map k : X X ! R that is square-integrable, i.e.

X |k(x, y)|2dxdy < 1. Any kernel k has an associated reproducing kernel Hilbert space or RKHS (H, h , i H) containing the closed span of all partial kernel evaluations k( , x), x 2 X. In particular, the inner product h , i H on H satisﬁes the reproducing relationship f(x) = hf, k( , x)i H for all x 2 X.

A kernel k can be associated with a corresponding Hilbert-Schmidt operator, which is the Hermitian operator Tk : L2(X) ! L2(X) given by (Tkf)(x) :=

X f(y)k(x, y)dy for any x 2 X. In short, Tk can be thought of as smoothing out or mollifying a function f according to the similarity metric induced by k. Tk plays a key role in kernelized learning through Mercer s Theorem, which gives an explicit representation for H in terms of the eigenvalues and eigenfunctions of Tk. Fact 1 (Mercer s Theorem). Let (H, h , i H) be the RKHS associated with kernel k, and let (µn)n 1 and (φn)n 1 be the sequence of non-increasing eigenvalues and corresponding eigenfunctions for Tk. Let ('n)n 1 be the sequence of rescaled functions 'n := pµnφn. Then,

and ('n)n 1 forms an orthonormal basis for (H, h , i H).

We make the following assumption throughout the remainder of our work, which is standard and comes from Vakili et al. [32].

Assumption 2 (Assumption on kernel k). The kernel k : X X ! R satisﬁes (a) |k(x, y)| L for all x, y 2 X, for some constant L > 0 and (b) |φn(x)| B for all x 2 X, for some B > 0.

Complexity of RKHS s. By the eigendecay of a kernel k, we really mean the rate of decay of the sequence of eigenvalues (µn)n 1. In the literature, there are two common paradigms for studying the eigendecay of k: (C1, C2, β)-exponential eigendecay, under which 8n 1, µn C1 exp( C2nβ), and (C, β)-polynomial eigendecay, under which 8n 1, µn Cn β. For kernels experiencing exponential eigendecay, of which the squared exponential is the most important example, GP-UCB is known to be optimal up to poly-logarithmic factors. However, for kernels experiencing polynomial eigendecay, of which the Mat ern family is a common example, existing analyses of GP-UCB fail to yield sublinear regret. It is this latter case we focus on in this work.

Given the above representation in Fact 1, it is clear that the eigendecay of the kernel k governs the complexity or size of the RKHS H. We make this notion of complexity precise by discussing maximum information gain, a sequential, kernel-dependent quantity governing concentration and hardness of learning in RKHS s [6, 31, 32].

Let t 1 and > 0 be arbitrary. The maximum information gain at time t with regularization is the scalar γt( ) given by

γt( ) := sup x1,...,xt2X

1 2 log det

= sup x1,...,xt2X

1 2 log det

Our presentation of maximum information gain differs from some previous works in that we encode the regularization parameter into our notation. This inclusion is key for our results, as we obtain improvements by carefully selecting . Vakili et al. [32] bound the rate of growth of γt( ) in terms of the rate of eigendecay of the kernel k. We leverage the following fact in our main results. Fact 2 (Corollary 1 in Vakili et al. [32]). Suppose that kernel k satisﬁes Assumption 2 and experiences (C, β)-polynomial eigendecay. Then, for any t 1, we have

We last deﬁne the practically relevant Mat ern kernel and discuss its eigendecay. Deﬁnition/Fact 3. The Mat ern kernel with bandwidth σ > 0 and smoothness > 1/2 is given by

k ,σ(x, y) := 1 Γ( )2 1

where Γ is the gamma function and B is the modiﬁed Bessel function of the second kind. It is known that there is some constant C > 0 that may depend on σ but not on d or such that k ,σ experiences

-eigendecay [24, 32].

Basics on martingale concentration: A ﬁltration (Ft)t 0 is a sequence of σ-algebras satisfying Ft Ft+1 for all t 1. If (Mt)t 0 is a H-valued process, we say (Mt)t 0 is a martingale with respect to (Ft)t 0 if (a) (Mt)t 0 is (Ft)t 0-adapted, and (b) E(Mt | Ft 1) = Mt 1 for all t 1. An R-valued process is called a supermartingale if the equality in (b) is replaced with , i.e. supermartingales tend to decrease. Martingales are useful in many statistical applications due to their strong concentration of measure properties [15, 36]. The follow fact can be leveraged to provide time-uniform bounds on the growth of any non-negative supermartingale. Fact 4 (Ville s Inequality). Let (Mt)t 0 be a non-negative supermartingale with respect to some ﬁltration. Suppose EM0 = 1. Then, for any δ 2 (0, 1), we have

9t 0 : Mt 1

See Howard et al. [14] for a self-contained proof of Ville s inequality, and many applications.

If F is a σ-algebra, and is an R-valued random variable, we say is σ-sub Gaussian conditioned on F if, for any λ 2 R, we have log E

2 ; in particular this condition implies

that is mean zero. With this, we state the following result on self-normalized processes. To our understanding, the following result was ﬁrst presented in some form as Example 4.2 of de la Pe na et al. [8] (in the setting of continuous local martingales), and can be derived leveraging the argument of Theorem 1 in de la Pe na et al. [9]. The exact form below was established (in the setting of discrete-time processes) in Theorem 1 of Abbasi-Yadkori et al. [2], which is commonly leveraged to construct conﬁdence ellipsoids in the linear bandit setting. Fact 5 (Example 4.2 from [8], Theorem 1 from [2]). Let (Ft)t 0 be a ﬁltration, let (Xt)t 1 be an (Ft)t 0-predictable sequence in Rd, and let ( t)t 1 be a real-valued (Ft)t 1-adapted sequence such that conditional on Ft 1, t is mean zero and σ-sub Gaussian. Then, for any > 0, the process (Mt)t 0 given by

det(Id + 1Vt)

222( Id + Vt) 1/2St/σ

is a non-negative supermartingale with respect to (Ft)t 0, where St := Pt

s=1 s Xs and Vt := Pt

s . Consequently, by Fact 4, for any conﬁdence δ 2 (0, 1), the following holds: with probability at least 1 δ, simultaneously for all t 1, we have

222(Vt + Id) 1/2St

det(Id + 1Vt)

Note the simple dependence on the regularization parameter > 0 in the above bound. While the regularization parameter doesn t prove important in regret analysis for linear bandits (where is treated as constant), the choice for will be critical in our setting. In the following section, we will discuss how Fact 5 can be extended to the setting of separable Hilbert spaces essentially verbatim (an observation ﬁrst noticed by Abbasi-Yadkori [1]).

3 A Remark on Self-Normalized Concentration in Hillbert Spaces

We begin by discussing a key, self-normalized concentration inequality for martingales. We use this bound in the sequel to construct simpler, more ﬂexible conﬁdence ellipsoids than currently exist for GP-UCB. The bound we present (in Theorem 1 below) is, more or less, equivalent to Corollary 3.5 in the thesis of Abbasi-Yadkori [1]. Our result is mildly more general in the sense that it directly argues that a target mixture process is a nonnegative supermartingale. The result in Abbasi-Yadkori [1] is more general in the sense it allows the regularization (or shift) matrix to be non-diagonal. Either concentration result is sufﬁcient for the regret bounds obtained in the sequel.

The aforementioned corollary in [1], quite surprisingly, has not been referenced in central works on the kernelized bandit problem, namely Chowdhury and Gopalan [5] and Vakili et al. [32, 33]. In fact, strictly weaker versions of the conclusion have been independently rediscovered in the context of kernel regression [10]. We emphasize that this result of Abbasi-Yadkori [1] (and the surrounding technical conclusions) are very general and may allow for further improvements in problems related to kernelized learning.

We now present Theorem 1, providing a brief sketch and a full proof in Appendix B. We believe our proof, which directly shows a target process is a nonnegative supermartingale, is of independent interest when compared to that of Abbasi-Yadkori [1] due to its simplicity. In particular, our proof follows from ﬁrst principles, avoiding advanced topological notions of convergence (e.g. in the weak operator topology) and existence of certain Gaussian measures on separable Hilbert spaces, which were heavily utilized in the proof of Corollary 3.5 in Abbasi-Yadkori [1]. Theorem 1 (Self-normalized concentration in Hilbert spaces). Let (Ft)t 0 be a ﬁltration, (ft)t 1 be an (Ft)t 0-predictable sequence in a separable Hilbert space1 H such that kftk H < 1 a.s. for all t 0, and ( t)t 1 be an (Ft)t 1-adapted sequence in R such that conditioned on Ft 1, t is mean zero and σ-sub Gaussian. Deﬁning St := Pt

s=1 sfs and Vt := Pt

s , we have that for any > 0, the process (Mt)t 0 deﬁned by

det(id H + 1Vt)

222( id H + Vt) 1/2St/σ

1A space is separable if it has a countable, dense set. Separability is key, because it means we have a countable basis, whose ﬁrst N elements we project onto.

is a nonnegative supermartingale with respect to (Ft)t 0. Consequently, by Fact 4, for any δ 2 (0, 1), with probability at least 1 δ, simultaneously for all t 1, we have

222(Vt + Id) 1/2St

det(id H + 1Vt)

We can summarize our independent proof in two simple steps. First, following from Fact 5, the bound in Theorem 1 holds when we project St and Vt onto a ﬁnite number N of coordinates, deﬁning a truncated nonnegative supermartingale M (N)

t . Secondly, we can make a limiting arugment, showing M (N)

t is essentially Mt for large values of N.

Proof Sketch for Theorem 1. Let ('n)n 1 be an orthonormal basis for H, and, for any N 1, let N denote the projection operator onto HN := span{'1, . . . , 'N}. Note that the projected process ( NSt)t 1 is an H-valued martingale with respect to (Ft)t 0. Further, note that the projected variance process ( NVt >

N)t 0 satisﬁes

( Nfs)( Nfs)>.

Since, for any N 1, HN is a ﬁnite-dimensional Hilbert space, it follows from Lemma 1 that the process (M (N)

t )t 0 given by

det(id H + 1 NVt >

222( id H + NVt >

is a nonnegative supermartingale with respect to (Ft)t 0. One can check that, for any t 0, M (N)

N!1 Mt. Thus, Fatou s Lemma implies

E (Mt | Ft 1) = E

which proves the ﬁrst part of the claim. The second part of the claim follows from applying Fact 4 to the deﬁned nonnegative supermartingale and rearranging. See Appendix B for details.

The following corollary specializes Theorem 1 (and thus Corollary 3.5 of Abbasi-Yadkori [1]) to the case where H is a RKHS and ft = k( , Xt), for all t 1. In this special case, we can reframe the above theorem in terms familiar Gram matrix Kt, assuming the quantity is invertible. While we prefer the simplicity and elegance of working directly in the RKHS H in the sequel, the follow corollary allows us to present Theorem 1 in a way that is computationally tractable.

Corollary 1. Let us assume the same setup as Theorem 1, and additionally assume that (a) (H, h , i H) is a RKHS associated with some kernel k, and (b) there is some X-valued (Ft)t 0predictable process (Xt)t 1 such that (ft)t 1 = (k( , Xt))t 1. Then, for any > 0 and δ 2 (0, 1), we have that, with probability at least 1 δ, simultaneously for all t 0,

222(Vt + id H) 1/2St

v u u t2 log

1 δ det(It + 1Kt)

If, in addition, the Gram matrix Kt = (k(Xi, Xj))i,j2[t] is invertible, we have the equality

t ) 1/2 1:tk2 = k( id H + Vt) 1/2Stk H.

We prove Corollary 1 in Appendix B. With this reframing of Theorem 1, we compare the concentration results of Theorem 1 (and thus Abbasi-Yadkori [1]) to the following, commonly leveraged result from Chowdhury and Gopalan [5].

Fact 6 (Theorem 1 from Chowdhury and Gopalan [5]). Assume the same setup as Fact 5. Let > 0 be arbitrary, and let Kt := (k(Xi, Xj))i,j2[t] be the Gram matrix corresponding to observations made by time t 1. Then, with probability at least 1 δ, simultaneously for all t 1, we have

(Kt + It) 1 + It

det ((1 + )It + Kt)

To make comparison with this bound clear, we parameterize the bounds in the above fact in terms of > 0 instead of > 0 to emphasize the following difference: both sides of the bound presented in Theorem 1 shrink as is increased, whereas both sides of the bound in Fact 6 increase as grows. Thus, increasing in Theorem 1 should be seen as decreasing in the bound of Chowdhury and Gopalan [5]. The bounds in Corollary 1 and Fact 6 coincide when = 1 and # 0 (per Lemma 1 in Chowdhury and Gopalan [5]), but are otherwise not equivalent for other choices of and .

We believe Theorem 1 and Corollary 3.5 of Abbasi-Yadkori [1] to be signﬁcantly more usable than the result of Chowdhury and Gopalan [5] for several reasons. First, the aforementioned bounds directly extend the method of mixtures (in particular, Fact 5) to potentially inﬁnite-dimensional Hilbert spaces. This similarity in form allows us to leverage existing analysis of Abbasi-Yadkori et al. [2] to prove our regret bounds, with only slight modiﬁcations. This is in contrast to the more cumbersome regret analysis that leverages Fact 6, which is not only more difﬁcult to follow, but also obtains inferior, sometimes super-linear regret guarantees.

Second, we note that Theorem 1 provides a bound that has a simple dependence on > 0. In more detail, directly as a byproduct of the simpliﬁed bounds, Theorem 2 offers a regret bound that can readily be tuned in terms of . Due to their use of a double mixture technique in proving Fact 6, Chowdhury and Gopalan [5] essentially wind up with a nested, doubly-regularized matrix ((Kt + It) 1 + It) 1/2 with which they normalize the residuals 1:t. In particular, this more complicated normalization make it difﬁcult to understand how varying impacts regret guarantees, which we ﬁnd to be essential for proving improved regret guarantees.

We note that the central bound discussed in this section does not provide an improvement in dependence on maximum information gain in the sense hypothesized by Vakili et al. [33]. In particular, the authors hypothesized the possibility of shaving a pγT multiplicative factor off of self-normalized concentration inequalities in RKHS s. This was shown in a recent work (see Lattimore [18]) to be impossible in general. Instead, Theorem 1 and Corollary 3.5 of Abbasi-Yadkori [1] give one access to a family of bounds parameterized by the regularization parameter > 0. As will be seen in the sequel, by optimizing over this parameter, one can obtain signiﬁcant improvements in regret.

4 An Improved Regret Analysis of GP-UCB

In this section, we provide the second of our main contributions, which is an improved regret analysis for the GP-UCB algorithm. We provide a description of GP-UCB in Algorithm 1. While we state the algorithm directly in terms of quantities in the RKHS H, these quantities can be readily converted to those involving Gram matrices or Gaussian processes for those who prefer that perspective [5, 38].

As seen in Section 3, by carefully extending the method of mixtures technique (originally by Robbins) of Abbasi-Yadkori et al. [2], Abbasi-Yadkori [1] and de la Pe na et al. [7, 8] to Hilbert spaces, we can construct self-normalized concentration inequalities that have simple dependence on the regularization parameter . These simpliﬁed bounds, in conjunction with information about the eigendecay of the kernel k [32], can be combined to carefully choose to obtain improved regret. We now present our main result.

Theorem 2. Let T > 0 be a ﬁxed time horizon, > 0 a regularization parameter, and assume Assumptions 2 and 1 hold. Let δ 2 (0, 1), and for t 1 deﬁne

det(id H + 1Vt)

Algorithm 1 Gaussian Process Upper Conﬁdence Bound (GP-UCB) Input: Regularization parameter > 0, norm bound D, conﬁdence bounds (Ut)t 1, and time

horizon T. Set V0 := 0, f0 := 0, E0 := {f 2 H : kfk H D} for t = 1, . . . , T do

Let (Xt, eft) := arg maxx2X,f2Et 1hf, k( , x)i H Play action Xt and observe reward Yt := f (Xt) + t Set Vt := Vt 1 + k( , Xt)k( , Xt)> and ft := (Vt + id H) 1Φ>

t Y1:t Set Et :=

22(Vt + id H)1/2(ft f)

Then, with probability at least 1 δ, the regret of Algorithm 1 run with parameters , (Ut)t 1, D satisﬁes

where in the big-Oh notation above we treat δ, D, σ, B, and L as being held constant. If the kernel k experiences (C, β)-polynomial eigendecay for some C > 0 and β > 1, taking = O(T

1 1+β ) yields

2, which is always sub-linear in T.

While we present the above bound with a ﬁxed time-horizon, it can be made anytime by carefully applying a standard doubling argument (see Lattimore and Szepesv ari [19], for instance). We specialize the above theorem to the case of the Mat ern kernel in the following corollary.

Corollary 2. Deﬁnition 3 states that the Mat ern kernel with smoothness > 1/2 in dimension d experiences (C, 2 +d

d )-eigendecay, for some constnat C > 0. Thus, GP-UCB obtains a regret rate

of RT = e O

We note that our regret analysis is the ﬁrst to show that GP-UCB attains sublinear regret for general kernels experiencing polynomial eigendecay. Of particular import is that Corollary 2 of Theorem 2 yields the ﬁrst analysis of GP-UCB that implies sublinear regret for the Mat ern kernel under general settings of ambient dimension d and smoothness . A recent result by Janz [16], using a uniform lengthscale argument, demonstrates that GP-UCB obtains sublinear regret for the speciﬁc case of the Mat ern family when the parameter and dimension d satisfy a uniform boundedness condition independent of scale. Our results are (a) more general, holding for any kernel exhibiting polynomial eigendecay, (b) don t require checking uniform boundedness independent of scale condition, and (c) follow from a simple regularization based argument. In particular, the arguments of Janz [16] require advanced functional analytic and Fourier analytic machinery.

We note that our analysis does not obtain optimal regret, as the theoretically interesting but computationally cumbersome Sup Kernel UCB algorithm [25, 34] obtains a slightly improved regret bound of

for (C, β)-polynomial eigendecay and e O

for the Mat ern kernel with smoothness in dimension d. Due to the aforementioned result of Lattimore [18], which shows that improved dependence on maximum information gain cannot be generally obtained in Hilbert space concentration, we believe further improvements on regret analysis for GP-UCB may not possible.

To wrap up this section, we provide a proof sketch for Theorem 2. The entire proof, along with full statements and proofs of the technical lemmas, can be found in Appendix C.

Proof Sketch for Theorem 2. Letting, for any t 2 [T], the instantaneous regret be deﬁned as rt := f (x ) f (Xt), a standard argument yields that, with probability at least 1 δ, simultaneously for all t 2 [T],

222( id H + Vt 1) 1/2k( , Xt)

2The notation e O suppresses multiplicative, poly-logarithmic factors in T

A further standard argument using Cauchy-Schwarz and an elliptical potential argument yields

2T log det(id H + 1VT )

det(id H + 1VT )

2T log det(id H + 1VT )

2 log(1/δ) + σ

2γT ( ) + 1/2D

4TγT ( ) = O

which proves the ﬁrst part of the claim. If, additionally, k experiences (C, β)-polynomial eigende-

cay, we know that γT ( ) = e O

by Fact 2. Setting := O(T

1 1+β ) thus yields

proving the second part of the claim.

5 Conclusion

In this work, we present an improved analysis for the GP-UCB algorithm in the kernelized bandit problem. We provide the ﬁrst analysis showing that GP-UCB obtains sublinear regret when the underlying kernel k experiences polynomial eigendecay, which in particular implies sublinear regret rates for the practically relevant Mat ern kernel. In particular, we show GP-UCB obtains regret e O

when k experiences (C, β)-polynomial eigendecay, and regret e O

for the Mat ern kernel with smoothness in dimension d.

Our contributions are twofold. First, we show the importance of ﬁnding the right concentration inequality for tackling problems in online learning in this case the correct bound being a selfnormalized inequality originally due to Abbasi-Yadkori [1]. We provide an independent proof of a result equivalent to Corollary 3.5 of Abbasi-Yadkori [1] in Theorem 1, and hope that our simpliﬁed, truncation-based analysis will make the result more accessible to researchers working on problems in kernelized learning. Second, we demonstrate the importance of regularization in the kernelized bandit problem. In particular, since the smoothness of the kernel k governs the hardness of learning, by regularizing in proportion to the rate of eigendecay of k, one can obtain signiﬁcantly improved regret bounds.

A shortcoming of our work is that, despite obtaining the ﬁrst generally sublinear regret bounds for GP-UCB, our rates are not optimal. In particular, there are discretization-based algorithms, such as Sup Kernel UCB [34], which obtain slightly better regret bounds of e O

for (C, β)-polynomial eigendecay. We hypothesize that the vanilla GP-UCB algorithm, which involves constructing conﬁdence ellipsoids directly in the RKHS H, cannot obtain this rate.

The common line of reasoning [33] is that because the Lin-UCB (the equivalent algorithm in Rd) obtains the optimal regret rate of e O(d

T) in the linear bandit problem setting, then GP-UCB should attain optimal regret as well. In the linear bandit setting, there is no subtlety between estimating the optimal action and unknown slope vector, as these are one and the same. In the kernel bandit setting, estimating the function and optimal action are not equivalent tasks. In particular, the former serves in essence as a nuisance parameter in estimating the latter: tight estimation of unknown function under the Hilbert space norm implies tight estimation of the optimal action, but not the other way around. Existing optimal algorithms are successful because they discretize the input domain, which has ﬁnite metric dimension [26], and make no attempts to estimate the unknown function in RKHS norm. Since compact sets in RKHS s do not, in general, have ﬁnite metric dimension [35], this makes estimation of the unknown function a strictly more difﬁcult task. In fact, recent work by Lattimore [18] demonstrate that self-normalized concentration in RKHS s, in general, cannot exhibit improved dependence on maximum information gain. This further supports our hypothesis on the further unimprovability of the regret analysis of GP-UCB past the improvements made in this paper.

6 Acknowledgements

AR acknowledges support from NSF DMS-2310718 and NSF IIS-2229881. ZSW and JW were supported in part by the NSF CNS2120667, a Cy Lab 2021 grant, a Google Faculty Research Award, and a Mozilla Research Grant. JW also acknowledges support from NSF GRFP grants DGE1745016 and DGE2140739.

We also would like to thank Xingyu Zhou and Johannes Kirschner for independently bringing to our attention the result from Abbasi-Yadkori [1] (Corollary 3.5) on self-normalized concentration in Hilbert spaces which is essentially equivalent to Theorem 1. We have rewritten the paper in a way that emphasizes the importance of this result and provides proper attribution to the original author.

[1] Yasin Abbasi-Yadkori. Online learning for linearly parametrized control problems. 2013.

[2] Yasin Abbasi-Yadkori, D avid P al, and Csaba Szepesv ari. Improved algorithms for linear stochastic bandits. Advances in Neural Information Processing Systems, 24, 2011.

[3] Rajeev Agrawal. The continuum-armed bandit problem. SIAM Journal on Control and Opti-

mization, 33(6):1926 1951, 1995.

[4] Andrew G Barto. Reinforcement learning control. Current Opinion in Neurobiology, 4(6):

888 893, 1994.

[5] Sayak Ray Chowdhury and Aditya Gopalan. On kernelized multi-armed bandits. In Interna-

tional Conference on Machine Learning, pages 844 853. PMLR, 2017.

[6] Thomas M Cover and Joy A Thomas. Information theory and statistics. Elements of Informa-

tion Theory, 1(1):279 335, 1991.

[7] Victor de la Pe na, Michael J Klass, and Tze Leung Lai. Self-normalized processes: Exponen-

tial inequalities, moment bounds and iterated logarithm laws. The Annals of Probability, 32, 07 2004. doi: 10.1214/009117904000000397.

[8] Victor de la Pe na, Michael J Klass, and Tze Leung Lai. Pseudo-maximization and selfnormalized processes. Probability Surveys Vol, 4:172 192, 09 2007. doi: 10.1214/07-PS119.

[9] Victor H de la Pe na, Michael J Klass, and Tze Leung Lai. Theory and applications of multi-

variate self-normalized processes. Stochastic Processes and their Applications, 119(12):4210 4227, 2009.

[10] Audrey Durand, Odalric-Ambrym Maillard, and Joelle Pineau. Streaming kernel regression

with provably adaptive mean, variance, and regularization. The Journal of Machine Learning Research, 19(1):650 683, 2018.

[11] Rick Durrett. Probability: theory and examples, volume 49. Cambridge university press, 2019.

[12] Vivek Farias, Ciamac Moallemi, Tianyi Peng, and Andrew Zheng. Synthetically controlled

bandits. ar Xiv preprint ar Xiv:2202.07079, 2022.

[13] Matthew Hoffman, Eric Brochu, and Nando De Freitas. Portfolio allocation for Bayesian optimization. In UAI, pages 327 336, 2011.

[14] Steven R Howard, Aaditya Ramdas, Jon Mc Auliffe, and Jasjeet Sekhon. Time-uniform Cher-

noff bounds via nonnegative supermartingales. Probability Surveys, 17:257 317, 2020.

[15] Steven R Howard, Aaditya Ramdas, Jon Mc Auliffe, and Jasjeet Sekhon. Time-uniform, non-

parametric, nonasymptotic conﬁdence sequences. The Annals of Statistics, 49(2), 2021.

[16] David Janz. Sequential decision making with feature-linear models. Ph D thesis, 2022.

[17] David Janz, David Burt, and Javier Gonz alez. Bandit optimisation of functions in the mat ern

kernel RKHS. In International Conference on Artiﬁcial Intelligence and Statistics, pages 2486 2495. PMLR, 2020.

[18] Tor Lattimore. A lower bound for linear and kernel regression with adaptive covariates. In The

Thirty Sixth Annual Conference on Learning Theory, pages 2095 2113. PMLR, 2023.

[19] Tor Lattimore and Csaba Szepesv ari. Bandit Algorithms. Cambridge University Press, 2020.

[20] Peter D Lax. Functional Analysis, volume 55. John Wiley & Sons, 2002.

[21] Lihong Li, Wei Chu, John Langford, and Robert E Schapire. A contextual-bandit approach to

personalized news article recommendation. In Proceedings of the 19th International Conference on World Wide web, pages 661 670, 2010.

[22] Yishay Mansour, Aleksandrs Slivkins, and Vasilis Syrgkanis. Bayesian incentive-compatible

bandit exploration. Operations Research, 68(4):1132 1161, 2020.

[23] Aditya Mate, Jackson Killian, Haifeng Xu, Andrew Perrault, and Milind Tambe. Collapsing

bandits and their application to public health intervention. Advances in Neural Information Processing Systems, 33:15639 15650, 2020.

[24] Gabriele Santin and Robert Schaback. Approximation of eigenfunctions in kernel-based spaces. Advances in Computational Mathematics, 42(4):973 993, 2016.

[25] Jonathan Scarlett, Ilija Bogunovic, and Volkan Cevher. Lower bounds on regret for noisy

Gaussian process bandit optimization. In Conference on Learning Theory, pages 1723 1742. PMLR, 2017.

[26] Shubhanshu Shekhar and Tara Javidi. Gaussian process bandits with adaptive discretization.

Electronic Journal of Statistics, 12(2):3829 3874, 2018.

[27] Shubhanshu Shekhar and Tara Javidi. Multi-scale zero-order optimization of smooth functions

in an RKHS. 2022 IEEE International Symposium on Information Theory (ISIT), pages 288 293, 2020.

[28] Shubhanshu Shekhar and Tara Javidi. Instance dependent regret analysis of kernelized bandits.

In International Conference on Machine Learning, pages 19747 19772. PMLR, 2022.

[29] Aleksandrs Slivkins et al. Introduction to multi-armed bandits. Foundations and Trends in

Machine Learning, 12(1-2):1 286, 2019.

[30] Marta Soare, Alessandro Lazaric, and R emi Munos. Best-arm identiﬁcation in linear bandits.

Advances in Neural Information Processing Systems, 27, 2014.

[31] Niranjan Srinivas, Andreas Krause, Sham M Kakade, and Matthias Seeger. Gaussian process

optimization in the bandit setting: No regret and experimental design. International Conference on Machine Learning, 2009.

[32] Sattar Vakili, Kia Khezeli, and Victor Picheny. On information gain and regret bounds in

Gaussian process bandits. In International Conference on Artiﬁcial Intelligence and Statistics, pages 82 90. PMLR, 2021.

[33] Sattar Vakili, Jonathan Scarlett, and Tara Javidi. Open problem: Tight online conﬁdence in-

tervals for RKHS elements. In Conference on Learning Theory, pages 4647 4652. PMLR, 2021.

[34] Michal Valko, Nathaniel Korda, R emi Munos, Ilias Flaounas, and Nelo Cristianini. Finite-time

analysis of kernelised contextual bandits. Proceedings of the 29th Conference on Uncertainty in Artiﬁcial Intelligence, 2013.

[35] Martin J Wainwright. High-dimensional Statistics: A Non-asymptotic Viewpoint, volume 48.

Cambridge university press, 2019.

[36] Ian Waudby-Smith and Aaditya Ramdas. Estimating means of bounded random variables by

betting. Journal of the Royal Statistical Society, Series B, 2023.

[37] Peter Whittle. Multi-armed bandits and the Gittins index. Journal of the Royal Statistical

Society: Series B (Methodological), 42(2):143 149, 1980.

[38] Christopher KI Williams and Carl Edward Rasmussen. Gaussian processes for machine learn-

ing, volume 2. MIT press Cambridge, MA, 2006.