# bayesian_optimization_under_heavytailed_payoffs__4697f324.pdf Bayesian Optimization under Heavy-tailed Payoffs Sayak Ray Chowdhury Department of ECE Indian Institute of Science Bangalore, India 560012 sayak@iisc.ac.in Aditya Gopalan Department of ECE Indian Institute of Science Bangalore, India 560012 aditya@iisc.ac.in We consider black box optimization of an unknown function in the nonparametric Gaussian process setting when the noise in the observed function values can be heavy tailed. This is in contrast to existing literature that typically assumes sub Gaussian noise distributions for queries. Under the assumption that the unknown function belongs to the Reproducing Kernel Hilbert Space (RKHS) induced by a kernel, we first show that an adaptation of the well-known GP-UCB algorithm with reward truncation enjoys sublinear O T 2+α 2(1+α) regret even with only the (1 + α)-th moments, α (0, 1], of the reward distribution being bounded ( O hides logarithmic factors). However, for the common squared exponential (SE) and Matérn kernels, this is seen to be significantly larger than a fundamental Ω(T 1 1+α ) lower bound on regret. We resolve this gap by developing novel Bayesian optimization algorithms, based on kernel approximation techniques, with regret bounds matching the lower bound in order for the SE kernel. We numerically benchmark the algorithms on environments based on both synthetic models and real-world data sets. 1 Introduction Black-box optimization of an unknown function f : Rd R with expensive, noisy queries is a generic problem arising in domains such as hyper-parameter tuning for complex machine learning models [3], sensor selection [14], synthetic gene design [15], experimental design etc. The popular Bayesian optimization (BO) approach, towards solving this problem, starts with a prior distribution, typically a nonparametric Gaussian process (GP), over a function class, uses function evaluations to compute the posterior distribution over functions, and chooses the next function evaluation adaptively using a sampling strategy towards reaching the optimum. Popular sampling strategies include expected improvement [25], probability of improvement [40], upper confidence bounds [35], Thompson sampling [11], predictive-entropy search [17], etc. The design and analysis of adaptive sampling strategies for BO typically involves the assumption of bounded, or at worst sub-Gaussian, distributions for rewards (or losses) observed by the learner, which is quite light-tailed. Yet, many real-world environments are known to exhibit heavy-tailed behavior, e.g., the distribution of delays in data networks is inherently heavy-tailed especially with highly variable or bursty traffic flow distributions that are well-modeled with heavy tails [20], heavy-tailed price fluctuations are common in finance and insurance data [29], properties of complex networks often exhibit heavy tails such as degree distribution [37], etc. This motivates studying methods for Bayesian optimization when observations are significantly heavy tailed compared to Gaussian. A simple version of black box optimization in the form of online learning in finite multi-armed bandits (MABs) with heavy-tailed payoffs, was first studied rigorously by Bubeck et al. [8], where the payoffs are assumed to have bounded (1 + α)-th moment for α (0, 1]. They showed that for 33rd Conference on Neural Information Processing Systems (Neur IPS 2019), Vancouver, Canada. MABs with only finite variances (i.e., α = 1), by using statistical estimators that are more robust than the empirical mean, one can still recover the optimal regret rate for MAB under the sub-Gaussian assumption. Moving further, Medina and Yang [24] consider these estimators for the problem of linear (parametric) stochastic bandits under heavy-tailed rewards and Shao et al. [34] show that almost optimal algorithms can be designed by using an optimistic, data-adaptive truncation of rewards. Some other important works include pure exploration under heavy-tailed noise [43], payoffs with bounded kurtosis [23], extreme bandits [10], heavy tailed payoffs with α (0, ) [38]. Against this backdrop, we consider regret minimization with heavy-tailed reward distributions in bandits with a potentially continuous arm set, and whose (unknown) expected reward function is nonparametric and assumed to have smoothness compatible with a kernel on the arm set. Here, it is unclear if existing BO techniques relying on statistical confidence sets based on sub-Gaussian observations can be made to work to attain nontrivial regret, since it is unlikely that these confidence sets will at all be correct. It is worth mentioning that in the finite dimensional setting, Shao et al. [34] solve the problem almost optimally, but their results do not carry over to the general nonparametric kernelized setup since their algorithms and regret bounds depend crucially on the finite feature dimension. We answer this affirmatively in this work, and formalize and solve BO under heavy tailed noise almost optimally. Specifically, this paper makes the following contributions. We adapt the GP-UCB algorithm to heavy-tailed payoffs by a truncation step, and show that it enjoys a regret bound of O(γT T 2+α 2(1+α) ) where γT depends on the kernel associated with the RKHS and is generally sub-linear in T. This regret rate, however, is potentially sub-optimal due to a Ω(T 1 1+α ) fundamental lower bound on regret that we show for two specific kernels, namely the squared exponential (SE) kernel and the Matérn kernel. We develop a new Bayesian optimization algorithm by truncating rewards in each direction of an approximate, finite-dimensional feature space. We show that the feature approximation can be carried out by two popular kernel approximation techniques: Quadrature Fourier features [26] and Nyström approximation [9]. The new algorithm under either approximation scheme gets regret O(γT T 1 1+α ), which is optimal upto log factors for the SE kernel. Finally, we report numerical results based on experiments on synthetic as well as real-world based datasets, for which the algorithms we develop are seen to perform favorably in the harsher heavy-tailed environments. Related work. An alternative line of work uses approaches for black box optimization based on Lipschitz-type smoothness structure [22, 7, 2, 33], which is qualitatively different from RKHS smoothness type assumptions. Recently, Bogunovic et al. [5] consider GP optimization under an adversarial perturbation of the query points. But, the observation noise is assumed to be Gaussian unlike our heavy-tailed environments. Kernel approximation schemes in the context of BO usually focuses on reducing the cubic cost of gram matrix inversion [39, 41, 26, 9]. However, we crucially use these approximations to achieve optimal regret for BO under heavy tailed noise, which, we believe, might not be possible without resorting to the kernel approximations. 2 Problem formulation Let f : X R be a fixed but unknown function over a domain X Rd for some d N. At every round, a learner queries f at a single point xt X, and observes a noisy payoff yt = f(xt) + ηt. Here the noise sequence ηt, t 1 are assumed to be zero mean i.i.d. random variables such that the payoffs satisfy E h |yt|1+α |Ft 1 i v for some α (0, 1] and v (0, ), where Ft 1 = σ({xτ, yτ)}t 1 τ=1, xt) denotes the σ-algebra generated by the events so far1. Observe that this bound on the (1 + α)-th moment at best yields bounded variance for yt, and does not necessarily mean that yt (or ηt) is sub-Gaussian as is assumed typically. The query point xt at round t is chosen causally depending upon the history {(xs, ys)}t 1 s=1 of query and payoff sequences available up to round t 1. The learner s goal is to maximize its (expected) cumulative reward PT t=1 f(xt) over a time horizon T or equivalently minimize its cumulative regret RT = PT t=1 (f(x ) f(xt)), where 1If instead the moment bound holds for each ηt then this can be translated to a moment bound for each yt using, say, a bound on f(x). x argmaxx X f(x) is a maximum point of f (assuming the maximum is attained; not necessarily unique). A sublinear growth of RT with T implies the time-average regret RT /T 0 as T . Regularity assumptions: Attaining sub-linear regret is impossible in general for arbitrary reward functions f, and thus some regularity assumptions are needed. In this paper, we assume smoothness for f induced by the structure of a kernel on X. Specifically, we make the standard assumption of a p.s.d. kernel k : X X R such that k(x, x) 1 for all x X, and f being an element of the reproducing kernel Hilbert space (RKHS) Hk(X) of smooth real valued functions on X. Moreover, the RKHS norm of f is assumed to be bounded, i.e., f H B for some B < . Boundedness of k along the diagonal holds for any stationary kernel, i.e., where k(x, x ) = k(x x ), e.g., the Squared Exponential kernel k SE and the Mat ern kernel k Matérn: k SE(x, x ) = exp r2 and k Matérn(x, x ) = 21 ν where l > 0 and ν > 0 are hyperparameters of the kernels, r = x x 2 is the distance between x and x , and Bν is the modified Bessel function. 3 Warm-up: the first algorithm Towards designing a BO algorithm for heavy tailed observations, we briefly recall the standard GP-UCB algorithm for the sub-Gaussian setting. GP-UCB at time t chooses the point xt = argmaxx X µt 1(x) + βtσt 1(x) where µt(x) = kt(x)T (Kt + λIt) 1Yt and σ2 t (x) = k(x, x) kt(x)T (Kt + λIt) 1kt(x) are the posterior mean and variance functions after t observations from a function drawn from the GP prior GPX (0, k), with additive i.i.d. Gaussian noise N(0, λ). Here Yt = [y1, . . . , yt]T is the vector formed by observations, Kt = [k(u, v)]u,v Xt is the kernel matrix computed at the set Xt = {x1, . . . , xt}, kt(x) = [k(x1, x), . . . , k(xt, x)]T and It is the identity matrix of order t. If the noise ηt is assumed conditionally R-sub-Gaussian, i.e., E eγηt Ft 1 exp γ2R2 2 for all γ R, then using βt+1 = O R p ln |It + λ 1Kt| ensures O( T) regret [11], as the posterior GP concentrates rapidly on the true function f. However, when the sub-Gaussian assumption does not hold, we cannot expect the posterior GP to have such nice concentration property. In fact, it is known that the ridge regression estimator µt Hk(X) of f is not robust when the noise exhibits heavy fluctuations [19]. So, in order to tackle heavy tailed noise, one needs more robust estimates bµt of f along with suitable confidence sets. A natural idea to curb the effects of heavy fluctuations is to truncate high rewards [8]. Our first algorithm Truncated GP-UCB (Algorithm 1) is based on this idea. Truncated GP-UCB (TGP-UCB) algorithm: At each time t, we truncate the reward yt to zero if it is larger than a suitably chosen truncation level bt, i.e., we set the truncated reward byt = yt1|yt| bt. Then, we construct the truncated version of the posterior mean as bµt(x) = kt(x)T (Kt +λIt) 1 b Yt where b Yt = [by1, . . . , byt]T and simply run GP-UCB with bµt instead of µt. The truncation level bt can be adapted with time t. We choose an increasing sequence of bt s, i.e., as time progresses and confidence interval shrinks, we truncate less and less Algorithm 1 Truncated GP-UCB (TGP-UCB) Input: Parameters λ > 0, {bt}t 1, {βt}t 1 Set bµ0(x) = 0 and σ2 0(x) = k(x, x) x X for t = 1, 2, 3 . . . do Play xt = argmaxx X bµt 1(x) + βtσt 1(x) and observe payoff yt Set byt = yt1|yt| bt and b Yt = [by1, . . . , byt]T Compute bµt(x) = kt(x)T (Kt + λIt) 1 b Yt and σ2 t (x) = kt(x)T (Kt + λIt) 1kt(x) end for aggressively. Finally, in order to account for the bias introduced by truncation, we blow up the confidence width βt of GP-UCB by a multiplicative factor of bt so that f(x) is contained in the interval bµt 1(x) βtσt 1(x) with high probability. This helps us to obtain a sub-linear regret bound for TGP-UCB given in the Theorem 1, with a full proof deferred to appendix B. Theorem 1 (Regret bound for TGP-UCB) Let f Hk(X), f H B and k(x, x) 1 for all x X. Let E h |yt|1+α |Ft 1 i v < for some α (0, 1] and for all t 1. Then, for any δ (0, 1], TGP-UCB, with bt = v 1 1+α t 1 2(1+α) and βt+1 = B + 3 ln |It + λ 1Kt| + 2 ln(1/δ), enjoys, with probability at least 1 δ, the regret bound TγT + v 1 1+α p γT (γT + ln(1/δ))T 2+α 2(1+α) , where γT γT (k, X) = max A X:|A|=T 1 2 ln It + λ 1KA . Here, γT denotes the maximum information gain about any f GPX (0, k) after T noisy observations obtained by passing f through an i.i.d. Gaussian channel N(0, λ), and measures the reduction in the uncertainty of f after T noisy observations. It is a property of the kernel k and domain X, e.g., if X is compact and convex, then γT = O (ln T)d+1 for k SE and O T d(d+1) 2ν+d(d+1) ln T for k Matérn [35]. Remark 1. An R-sub-Gaussian environment satisfies the moment condition with α = 1 and v = R2, so the result implies a sub-linear O(T 3/4) regret bound for TGP-UCB in sub-Gaussian environments. 4 Regret lower bound Establishing lower bounds under general kernel smoothness structure is an open problem even when the payoffs are Gaussian. Similar to Scarlett et al. [31], we only focus on the SE and Matérn kernels. Theorem 2 (Lower bound on cumulative regret) Let X = [0, 1]d for some d N. Fix a kernel k {k SE, k Matérn}, B > 0, T N, α (0, 1] and v > 0. Given any algorithm, there exists a function f Hk(X) with f H B, and a reward distribution satisfying E h |yt|1+α |Ft 1 i v for all t [T] := {1, 2, . . . , T}, such that when the algorithm is run with this f and reward distribution, its regret satisfies 1. E[RT ] = Ω v 1 1+α ln v 1 α T dα 2(1+α) T 1 1+α if k = k SE, 2. E[RT ] = Ω v ν ν(1+α)+dα B dα ν(1+α)+dα T ν+dα ν(1+α)+dα if k = k Matérn. The proof argument is inspired by that of Scarlett et al. [31], which provides the lower bound of BO under i.i.d. Gaussian noise, but with nontrivial changes to account for heavy tailed observations. The proof is based on constructing a finite subset of difficult functions in Hk(X). Specifically, we choose f as a uniformly sampled function from a finite set {f1, . . . , f M}, where each fj is obtained by shifting a common function g Hk(Rd) by a different amount such that each of these has a unique maximum, and then cropping to X = [0, 1]d. g takes values in [ 2 , 2 ] with the maximum attained at x = 0. The function g is constructed properly, and the parameters , M are chosen appropriately based on the kernel k, fixed constants B, T, α, v such that any -optimal point for fj fails to be -optimal point for any other fj and that fj H B for all j [M]. The reward function takes values in {sgn (f(x)) v α , 0}, with the former occurring with probability 2 α |f(x)|, such that, for every x X, the expected reward is f(x) and (1 + α)-th raw moment is upper bounded by v. Now, if we can lower bound the regret averaged over j [M], then there must exist some fj for which the bound holds. The formal proof is deferred to Appendix C. Remark 2. Theorem 2 suggests that (a) TGP-UCB may be suboptimal, and (b) for the SE kernel, it may be possible to design algorithms recovering O( T) regret bound under finite variances (α = 1). 5 An optimal algorithm under heavy tailed rewards In view of the gap between the regret bound for TGP-UCB and the fundamental lower bound, it is possible that TGP-UCB (Algorithm 1) does not completely mitigate the effect of heavy-tailed fluctuations, and perhaps that truncation in a different domain may work better. In fact, for parametric linear bandits (i.e., BO with finite dimensional linear kernels), it has been shown that appropriate truncation in feature space improves regret performance as opposed to truncating raw observations [34], and in this case the feature dimension explicitly appears in the regret bound. However, the main challenge in the more general nonparametric setting is that the feature space is infinite dimensional, which would yield a trivial regret upper bound. If we can find an approximate feature map ϕ : X Rm in a low-dimensional Euclidean inner product space Rm such that k(x, y) ϕ(x)T ϕ(y), then we can perform the above feature adaptive truncation effectively as well as keep the error introduced due to approximation in control. Such a kernel approximation can be done efficiently either in a data independent way (Fourier features approximation [28]) or in a data dependent way (Nyström approximation [12]) and has been used in the context of BO to reduce the time complexity of GP-UCB [26, 9]. But in this work, the approximations are crucial to obtain optimal theoretical guarantees. We now describe our algorithm Adaptively Truncated Approximate GP-UCB (Algorithm 2). 5.1 Adaptively Truncated Approximate GP-UCB (ATA-GP-UCB) algorithm At each round t, we select an arm xt which maximizes the approximate (under kernel approximation) GP-UCB score µt 1(x) + βt σt 1(x), where µt 1(x) and σ2 t 1(x) denote approximate posterior mean and variance from the previous round, respectively and βt is an appropriately chosen confidence width. Then, we update µt(x) and σ2 t (x) as follows. First, we find a feature embedding ϕt Rmt, of some appropriate dimension mt, which approximates the kernel efficiently. Then, we find the rows u T 1 , . . . , u T mt of the matrix V 1/2 t ΦT t , where Φt = [ ϕt(x1), . . . , ϕt(xt)]T and Vt = ΦT t Φt + λImt, and use those as the weight vectors for truncating the rewards in each of mt directions by setting bri = Pt τ=1 ui,τyτ1|ui,τ yτ | bt for all i [mt], where bt specifies the truncation level. Then, we find our estimate of f as θt = V 1/2 t [br1, . . . , brmt]T . Finally, we approximate the posterior mean as µt(x) = ϕt(x)T θt and the posterior variance as (i) σ2 t (x) = λ ϕt(x)T V 1 t ϕt(x) for the Fourier features approximation, or as (ii) σ2 t (x) = k(x, x) ϕt(x)T ϕt(x) + λ ϕt(x)T V 1 t ϕt(x) for the Nyström approximation. Now it only remains to describe how to find the feature embeddings ϕt. (a) Quadrature Fourier features (QFF) approximation: If k is a bounded, continuous, positive definite, stationary kernel satisfying k(x, x) = 1, then by Bochner s theorem [4], k is the Fourier transform of a probability measure p, i.e., k(x, y) = R Rd p(ω) cos(ωT (x y))dω. For the SE kernel, this measure has density p(ω) = l 2π de l2 ω 2 2 2 (abusing notation for measure and density). Mutny and Krause [26] show that for any stationary kernel k on Rd whose inverse Fourier transform decomposes product wise, i.e., p(ω) = Qd j=1 pj(ωj), we can use Gauss-Hermite quadrature [18] to approximate it. If X = [0, 1]d, the SE kernel is approximated as follows. Choose m N and m = md, and construct the 2m-dimensional feature map 2 l ωT i x if 1 i m, p ν(ωi m) sin 2 l ωT i mx if m + 1 i 2m. (1) Here the set {ω1, . . . , ωm} = d times z }| { A m A m, where A m is the set of m (real) roots of the m-th Hermite polynomial H m, and ν(z) = Qd j=1 2 m 1 m! m2H m 1(zj)2 for all z Rd. For our purposes, we will have ATA-GP-UCB work with the embedding ϕt(x) = ϕ(x) of dimension mt = 2m for all t 1. Remark 3. The seminal work of Rahimi and Recht [28] that develops random Fourier feature (RFF) approximation of any stationary kernel is based on the feature map ϕ(x) = 1 m[cos(ωT 1 x), . . . , cos(ωT mx), sin(ωT 1 x), . . . , sin(ωT mx)]T , where each ωi is sampled independently from p(ω). However, RFF embeddings do not appear to be useful for our purpose of achieving sublinear regret (see discussion after Lemma 1), so we work with the QFF embedding. (b) Nyström approximation: Unlike the QFF approximation where the basis functions (cosine and sine) do not depend on the data, the basis functions used by the Nyström method are data dependent. For a set of points Xt = {x1, . . . , xt}, the Nyström method [42] approximates the kernel matrix Kt as follows: First sample a random number mt of points from Xt to construct a dictionary Dt = {xi1, . . . , ximt}; ij [t], according to the following distribution. For each i [t], include xi in Dt independently with probability pt,i = min{q σ2 t 1(xi), 1} for a suitably chosen parameter q (which trades off between the quality and the size of the embedding). Then, compute the (approximate) finite-dimensional feature embedding ϕt(x) = K1/2 Dt k Dt(x), where KDt = [k(u, v)]u,v Dt, k Dt(x) = [k(xi1, x), . . . , k(ximt, x)]T and A denotes the pseudo inverse of any matrix A. We call the entire procedure Nyström Embedding (pseudocode in appendix). Algorithm 2 Adaptively Truncated Approximate GP-UCB (ATA-GP-UCB) Input: Parameters λ > 0, {bt}t 1, {βt}t 1, q, a kernel approximation (QFF or Nyström) Set: µ0(x) = 0 and σ2 0(x) = k(x, x) for all x X for t = 1, 2, 3 . . . do Play xt = argmaxx X µt 1(x) + βt σt 1(x) and observe payoff yt Set ϕt(x) = ϕ(x) if QFF approximation Nyström Embedding {(xi, σt 1(xi))}t i=1, q if Nyström approximation Set ΦT t = [ ϕt(x1), . . . , ϕt(xt)] and Vt = ΦT t Φt + λImt, where mt is the dimension of ϕt Find the rows u T 1 , . . . , u T mt of V 1/2 t ΦT t and set bri = Pt τ=1 ui,τyτ1|ui,τ yτ | bt for all i [mt] Set θt = V 1/2 t [br1, . . . , brmt]T and compute µt(x) = ϕt(x)T θt Set σ2 t (x) = (i) λ ϕt(x)T V 1 t ϕt(x) if QFF approximation (ii) k(x, x) ϕt(x)T ϕt(x) + λ ϕt(x)T V 1 t ϕt(x) if Nyström approximation end for Remark 4. It is well known (λ-ridge leverage score sampling [1]) that, by sampling points proportional to their posterior variances σ2 t (x), one can obtain an accurate embedding ϕt(x), which in turn gives an accurate approximation σ2 t (x). But, computation of σ2 t (x) in turn requires inverting Kt, which takes at most O(t3) time. So, we make use of the already computed approximations σ2 t 1(x) to sample points at round t, without significantly compromising on the accuracy of the embeddings [9]. Remark 5. The choice (i) of σ2 t (x) in Algorithm 2 ensures accurate estimation of the variance of x under the QFF approximation [26]. But, the same choice leads to severe underestimation of the variance under the Nyström approximation, specially when x is far away from Dt. The choice (ii) of σ2 t (x) in Algorithm 2 is known as deterministic training conditional in the GP literature [27] and provably prevents the phenomenon of variance starvation under Nyström approximation [9]. 5.2 Cumulative regret of ATA-GP-UCB with QFF embeddings The following lemma shows that the data adaptive truncation of all the historical rewards and a good approximation of the kernel help us obtain a tighter confidence interval than TGP-UCB. Lemma 1 (Tighter confidence sets with QFF truncation) For any δ (0, 1], ATA-GP-UCB with QFF approximation and parameters bt = (v/ ln(2m T/δ)) 1 α 2(1+α) and βt+1 = B + 4 p m/λ v 1 1+α (ln(2m T/δ)) 1 α 2(1+α) , ensures that with probability at least 1 δ, uniformly over all t [T] and x X, |f(x) µt 1(x)| βt σt 1(x) + O(Bε1/2 m t2), (2) where the QFF dimension m is such that supx,y X k(x, y) ϕ(x)T ϕ(y) =: εm < 1. Here, the scaling t 1 α 2(1+α) of the confidence width βt is much less than the scaling t 1 2(1+α) of TGP-UCB, which eventually leads to a tighter confidence interval. However, in order to achive sublinear cumulative regret, we need to ensure that the approximation error εm decays at least as fast as O(1/T 6) and feature dimension m grows no faster than polylog(T). This will ensure that the regret accumulated due to the second term in the RHS of 2 is O(1), as well as the contribution from the first term is O(T 1 1+α ), since sum of the approximate posterior standard deviations grows only as O( m T). Now, the QFF embedding (1) of k SE can be shown to achieve εm d2d 1 1 4l2 m = O d2d 1 ( ml2) m [26]. The decay is exponential when m > 1/l2 and d = O(1)2. Now, setting m = Θ log4/e T 6 , we can ensure that ε1/2 m T 3 = O(1) and m = O (ln T)d , and thus, in turn, a sublinear regret bound 3. The following theorem states this formally, with a full proof deferred to Appendix D.2. 2For most BO applications, the effective dimensionality of the problem is low, e.g., additive models [21, 30]. 3 Under RFF approximation εm = O( p 1/m) [36]. Hence, ATA-GP-UCB does not achieve sublinear regret. Theorem 3 (Regret bound for ATA-GP-UCB with QFF embedding) Fix any δ (0, 1]. Then, under the same hypothesis of Theorem 1, for X = [0, 1]d and k = k SE, ATA-GP-UCB under QFF approximation, with parameters bt and βt set as in Lemma 1, and with the embedding ϕ from 1 such that m > 1/l2 and m = Θ log4/e T 6 , enjoys, with probability at least 1 δ, the regret bound T(ln T)d+1 + v 1 1+α ln T(ln T)d ln T (ln T)d T 1 1+α Remark 6. When the variance of the rewards is finite (i.e., α = 1), the cumulative regret for ATAGP-UCB under QFF approximation of the SE kernel is O((ln T)d+1 T), which now recovers the state-of-the-art regret bound of GP-UCB under sub-Gaussian rewards [26, Corollary 2] unlike the earlier TGP-UCB. It is worth pointing out that the bound in Theorem 3 is only for the SE kernel defined on X = [0, 1]d, and designing a no-regret BO strategy under the QFF approximation of any other stationary kernel still remains a open question even when the rewards are sub-Gaussian [26]. 5.3 Cumulative regret of ATA-GP-UCB with Nyström embeddings Now, we will show that ATA-GP-UCB under Nyström approximation achives optimal regret for any stationary kernel defined on X Rd without any restriction on d. Similar to Lemma 1, ATAGP-UCB under Nyström approximation also maintains tighter confidence sets than TGP-UCB. As before, the confidence sets are useful only if the dimension of the embeddings mt grows no faster than polylog(t). Not only that, we also need to ensure that the approximate posterior variances are only a constant factor away from the exact ones. Then, since sum of the posterior standard deviations grows only as O( TγT ), we can achieve the optimal O(T 1 1+α ) regret scaling. Now for any ε (0, 1), setting q = 6 1+ε 1 ε ln(2T/δ)/ε2, the Nyström embeddings ϕt can be shown to achieve mt 6 1+ε λ qγt and 1 ε 1+εσ2 t (x) σ2 t (x) 1+ε 1 εσ2 t (x) with probability at least 1 δ [9], which helps us to achieve an optimal regret bound. The following theorem states this formally, with a full proof deferred to Appendix D.3. Theorem 4 (Regret bound for ATA-GP-UCB with Nyström embedding) Fix any δ (0, 1], ε (0, 1) and set ρ = 1+ε 1 ε. Then, under the same hypothesis of Theorem 1, ATA-GP-UCB under Nyström approximation, and with parameters q = 6ρ ln(4T/δ)/ε2, bt = (v/ ln(4mt T/δ)) 1 α 2(1+α) and βt+1 = B(1 + 1 1 ε) + 4 p mt/λ v 1 1+α (ln(4mt T/δ)) 1 α 2(1+α) , enjoys, with probability at least 1 δ, the regret bound ρB 1 + 1 1 ε ε v 1 1+α ln γT ln(T/δ)T ln(T/δ)γT T 1 1+α Remark 7. Theorems 3 and 4 imply that ATA-GP-UCB achieves O v 1 1+α (ln T)d T 1 1+α regret bound for k SE, which matches the lower bound (Theorem 2) upto a factor of α 1+α in the exponent of ln T, as well as a few extra ln T factors hidden in the notation O. For the Matérn kernel, the bound is O T 1 1+α 2ν+(2+α)d(d+1) 2ν+d(d+1) , which is sublinear only when d(d+1) 2ν < α4, and the gap from the lower bound is more significant in this case. It is worth mentioning that a similar gap is present even for the (easier) setting of sub-Gaussian rewards [31] and there might exist better algorithms which can bridge this gap. When the variance of the rewards is finite (i.e., α = 1), the cumulative regret for ATA-GP-UCB under Nyström approximation is O(γT T), which recovers the state-of-the-art regret bound under sub-Gaussian rewards [9, Thm. 2]. For the linear bandit setting, i.e. when the feature map ϕt(x) = x itself, substituting γT = O(d ln T), we find that the regret upper bound in Theorem 4 recovers the (optimal) regret bound of [34, Thm. 3] up to a logarithmic factor. 5.4 Computational complexity of ATA-GP-UCB (a) Time complexity: Under the (data-dependent) Nyström approximation, constructing the dictionary Dt takes O(t) time at each step t. Then, we compute the embeddings ϕt(x) for all arms in 4This holds, for example, Matérn kernel on R2 with ν = 3.5 when variance of the rewards is finite (α = 1). 0.5 0 0.5 1 1.5 2 2.5 Time - average Regret TGP-UCB ATA-GP-UCB-QFF ATA-GP-UCB-Nystr om (a) k SE, f RKHS, Student s-t 0.5 0 0.5 1 1.5 2 2.5 Time - average Regret TGP-UCB ATA-GP-UCB-QFF ATA-GP-UCB-Nystr om (b) k SE, f RKHS, Pareto 0.5 0 0.5 1 1.5 2 2.5 Time - average Regret TGP-UCB ATA-GP-UCB-Nystr om (c) k Matérn, f RKHS, Student s-t 0 2000 4000 6000 8000 10000 12000 0.25 Time - average Regret TGP-UCB ATA-GP-UCB-Nystr om (d) Stock market data 0 2000 4000 6000 8000 10000 12000 Time - average Regret TGP-UCB ATA-GP-UCB-QFF ATA-GP-UCB-Nystr om (e) Light sensor data 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 0 True function No truncation Truncation (f) Effect of truncation on GP-UCB Figure 1: (a)-(e) Time-average regret (RT /T) for TGP-UCB, ATA-GP-UCB with QFF approximation (ATAGP-UCB-QFF) and Nyström approximation (ATA-GP-UCB-Nyström) on heavy-tailed data. (f) Confidence sets (µt σt) formed by GP-UCB with and without truncation under heavy fluctuations. O(m3 t + m2 t |X|) time, where |X| is the cardinality of X. Now, construction of Vt takes O(m2 tt) time, since we need to rebuild it from the scratch. Then, V 1/2 t is computed in O(m3 t) time. We can now compute µt(x) and σ2 t (x) for all arms in O(m2 tt + mt |X|) and O(m2 t |X|) time, respectively, using already computed ϕt(x) and V 1/2 t . Thus per-step time complexity is O m2 t(t + |X|) , since mt t. For continuous X, one can approximately maximize the GP-UCB score by grid search / Branch and Bound methods such as DIRECT [6]. In fact it can be maximized within O(ε) accuracy by making O(ε d) calls to it, yielding a per-step time complexity of O(m2 t(t + ε d)). Since mt = O(γt) and γt is poly-logarithmic in t for SE kernel, per step time complexity is O(t + ε d). For Matérn kernel, the complexity is O(tp(t + ε d)), 1 < p < 2. Similarly, under (data-independent) QFF approximation, the per-step time complexity is O(m3 + m2(t + ε d)) = O(t + ε d) since m = O((ln T)d) for the SE kernel. (b) Space complexity: Note that under Nyström approximation, at each round t we need to store all previously chosen arms, the matrix V 1/2 t and the vectors ϕt(x). Hence, the per-step space complexity of ATA-GP-UCB is O(t + mt(mt + ε d)) = O(mt(mt + ε d)) for small enough ε. Under QFF approximation, the complexity is O(m(m + ε d)). 6 Experiments We numerically compare the performance of TGP-UCB (Algorithm 1), ATA-GP-UCB with QFF (ATA-GP-UCB-QFF) and Nyström (ATA-GP-UCB-Nyström) approximations (Algorithm 2) on both synthetic and real-world heavy-tailed environments. (Our codes are available here.) The confidence width βt and truncation level bt of our algorithms, and the trade-off parameter q used in Nyström approximation are set order-wise similar to those recommended by theory (Theorems 1, 3 and 4). We use λ = 1 in all algorithms and ε = 0.1 in ATA-GP-UCB-Nyström. We plot the mean and standard deviation (under independent trials) of the time-average regret RT /T in Figure 1. We use the following datasets. 1. Synthetic data: We generate the objective function f Hk(X) with X set to be a discretization of [0, 1] into 100 evenly spaced points. Each f = Pp i=1 aik( , xi) was generated using an SE kernel with l = 0.2 and by uniformly sampling ai [ 1, 1] and support points xi X with p = 100. We set B = maxx X |f(x)|. To generate the rewards, first we consider y(x) = f(x) + η, where the noise η are samples from the Student s t-distribution with 3 degrees of freedom (Figure 1 a). Here, the variance is bounded (α = 1) and hence v = B2 + 3. Next, we generate the rewards as samples from the Pareto distribution with shape parameter 2 and scale parameter f(x)/2. f is generated similarly, except that here we sample ai s uniformly from [0, 1]. Then, we set B as before leading to the bound of (1 + α)-th raw moments v = B1+α 2α(1 α). We plot the results for α = 0.9 (Figure 1 b). We use m = 32 features (in consistence with Theorem 3) for ATA-GP-UCB-QFF in these experiments. Next, we generate f using the Matérn kernel with l = 0.2 and ν = 2.5, and consider the same Student s-t distribution as earlier to generate rewards. As we do not have the theory of ATA-GP-UCB-QFF for the Matérn kernel yet, we exclude evaluating it here (Figure 1 c). We perform 20 trials for 2 104 rounds and for each trial we evaluate on a different f (which explains the high error bars). 2. Stock market data: We consider a representative application of identifying the most profitable stock in a given pool of stocks. This is motivated by the practical scenario that an investor would like to invest a fixed budget of money in a stock and get as much return as possible. We took the adjusted closing price of 29 stocks from January 4th, 2016 to April 10th, 2019. We conduct Kolmogrov-Smirnov (KS) test to find out that the null hypothesis of stock prices following a Gaussian distribution is rejected against the favor of a heavy-tailed distribution. We take the empirical mean of stock prices as our objective function f and empirical covariance of the normalized stock prices as our kernel function k (since stock behaviors are mostly correlated with one another). We consider α = 1 and set v as the empirical average of the squared prices. Since the kernel is data dependent, we cannot run ATA-GP-UCB-QFF here. We average over 10 independent trials of the algorithms (Figure 1 d). 3. Light sensor data: We take light sensor data collected in the CMU Intelligent Workplace in Nov 2005 containing locations of 41 sensors, 601 train samples and 192 test samples in the context of learning the maximum average reading of the sensors. For each sensor, we find that the KS test on its readings rejects the Gaussian against the favor of a heavy-tailed distribution. We take the empirical average of the test samples as our objective f and empirical covariance of the normalized train samples as our kernel k. We consider α = 1, set v as the empirical mean of the squared readings and B as the maximum of the average readings. For ATA-GP-UCB-QFF, we fit a SE kernel with l2 = 0.1 on the given sensor locations and approximate it with m = 162 = 256 features (Figure 1 e). Observations: We find that ATA-GP-UCB outperforms TGP-UCB uniformly over all experiments, which is consistent with our theoretical results. We also see that the performance of ATA-GP-UCB under the Nyström approximation is no worse than that under the QFF approximation. Not only that, the scope of the latter is limited due to its dependence on the analytical form of the kernel, whereas the former is data-adaptive and hence, well suited for practical purposes. Effect of truncation: For heavy-tailed rewards, the sub-Gaussian constant R = . Hence, we exclude evaluating GP-UCB in the above experiments. Now, we demonstrate the effect of truncation on GP-UCB in the following experiment. First, we generate a function f Hk(X) and normalize it between [0, 1]. Then, we simulate rewards as y(x) = f(x) + η, where η takes values in { 10, 10}, uniformly, for any single random point in X, and is zero everywhere else. We run GP-UCB with βt = ln t and see that the posterior mean after T = 104 rounds is not a good estimate of f. However, by truncating reward samples which exceeds t1/4 (truncation threshold in TGP-UCB when α = 1) at round t, we get an (almost) accurate estimator of f. Not only that, the confidence interval around this estimator contains f at every point in X, which in turn ensures good performance. We plot the respective confidence sets averaged over 50 such randomizations of noise (Figure 1 f). 7 Conclusion To the best of our knowledge, this is the first work to formulate and solve BO under heavy-tailed observations. We have demonstrated the failure of existing BO methods and developed (almost) optimal algorithms using kernel approximation techniques, which are easy to implement and perform well in practice, with rigorous theoretical guarantees. It is worth noting that using a Bernstein type concentration bound in each direction of the approximate feature space, we are able to obtain the near optimal regret scaling for ATA-GP-UCB (Algorithm 2). Instead, if one can derive a Bernstein type bound for self-normalized processes which depends on the (1 + α)-th moments of rewards, then one may not have to resort to feature approximation to get optimal regret. Further, instead of truncating the payoffs, one can also consider building and studying a median of means-style estimator [8] in the (approximate) feature space and hope to develop an optimal algorithm. Acknowledgments The authors are grateful to the anonymous reviewers for their valuable comments. S. R. Chowdhury is supported by the Google India Ph D fellowship grant and the Tata Trusts travel grant. A. Gopalan is grateful for support from the DST INSPIRE faculty grant IFA13ENG-69. [1] Ahmed Alaoui and Michael W Mahoney. Fast randomized kernel ridge regression with statistical guarantees. In Advances in Neural Information Processing Systems, pages 775 783, 2015. [2] Mohammad Gheshlaghi Azar, Alessandro Lazaric, and Emma Brunskill. Online stochastic optimization under correlated bandit feedback. In ICML, pages 1557 1565, 2014. [3] James Bergstra and Yoshua Bengio. Random search for hyper-parameter optimization. J. Mach. Learn. Res., 13:281 305, February 2012. [4] Salomon Bochner. Lectures on Fourier integrals. Princeton University Press, 1959. [5] Ilija Bogunovic, Jonathan Scarlett, Stefanie Jegelka, and Volkan Cevher. Adversarially robust optimization with gaussian processes. In Advances in Neural Information Processing Systems, pages 5760 5770, 2018. [6] Eric Brochu, Vlad M Cora, and Nando De Freitas. A tutorial on bayesian optimization of expensive cost functions, with application to active user modeling and hierarchical reinforcement learning. ar Xiv preprint ar Xiv:1012.2599, 2010. [7] Sébastien Bubeck, Rémi Munos, Gilles Stoltz, and Csaba Szepesvári. X-armed bandits. Journal of Machine Learning Research, 12(May):1655 1695, 2011. [8] Sébastien Bubeck, Nicolo Cesa-Bianchi, and Gábor Lugosi. Bandits with heavy tail. IEEE Transactions on Information Theory, 59(11):7711 7717, 2013. [9] Daniele Calandriello, Luigi Carratino, Alessandro Lazaric, Michal Valko, and Lorenzo Rosasco. Gaussian process optimization with adaptive sketching: Scalable and no regret. In Conference on Learning Theory, 2019. [10] Alexandra Carpentier and Michal Valko. Extreme bandits. In Advances in Neural Information Processing Systems, pages 1089 1097, 2014. [11] Sayak Ray Chowdhury and Aditya Gopalan. On kernelized multi-armed bandits. In Proceedings of the 34th International Conference on Machine Learning-Volume 70, pages 844 853. JMLR. org, 2017. [12] Petros Drineas and Michael W Mahoney. On the nyström method for approximating a gram matrix for improved kernel-based learning. journal of machine learning research, 6(Dec): 2153 2175, 2005. [13] Audrey Durand, Odalric-Ambrym Maillard, and Joelle Pineau. Streaming kernel regression with provably adaptive mean, variance, and regularization. The Journal of Machine Learning Research, 19(1):650 683, 2018. [14] R. Garnett, M. A. Osborne, and S. J. Roberts. Bayesian optimization for sensor set selection. In Proceedings of the 9th ACM/IEEE International Conference on Information Processing in Sensor Networks, IPSN 10, pages 209 219, New York, NY, USA, 2010. ACM. [15] Javier Gonzalez, Joseph Longworth, David C James, and Neil D Lawrence. Bayesian optimization for synthetic gene design. ar Xiv preprint ar Xiv:1505.01627, 2015. [16] Elad Hazan, Amit Agarwal, and Satyen Kale. Logarithmic regret algorithms for online convex optimization. Machine Learning, 69(2-3):169 192, 2007. [17] José Miguel Hernández-Lobato, Matthew W Hoffman, and Zoubin Ghahramani. Predictive entropy search for efficient global optimization of black-box functions. In Advances in neural information processing systems, pages 918 926, 2014. [18] Francis Begnaud Hildebrand. Introduction to numerical analysis. Courier Corporation, 1987. [19] Daniel Hsu and Sivan Sabato. Heavy-tailed regression with a generalized median-of-means. In International Conference on Machine Learning, pages 37 45, 2014. [20] Krishna P. Jagannathan, Mihalis G. Markakis, Eytan Modiano, and John N. Tsitsiklis. Throughput optimal scheduling over time-varying channels in the presence of heavy-tailed traffic. IEEE Trans. Information Theory, 60(5):2896 2909, 2014. doi: 10.1109/TIT.2014.2311125. URL https://doi.org/10.1109/TIT.2014.2311125. [21] Kirthevasan Kandasamy, Jeff Schneider, and Barnabás Póczos. High dimensional bayesian optimisation and bandits via additive models. In International Conference on Machine Learning, pages 295 304, 2015. [22] Robert Kleinberg, Aleksandrs Slivkins, and Eli Upfal. Multi-armed bandits in metric spaces. In Proceedings of the fortieth annual ACM symposium on Theory of computing, pages 681 690. ACM, 2008. [23] Tor Lattimore. A scale free algorithm for stochastic bandits with bounded kurtosis. In Advances in Neural Information Processing Systems, pages 1584 1593, 2017. [24] Andres Munoz Medina and Scott Yang. No-regret algorithms for heavy-tailed linear bandits. In International Conference on Machine Learning, pages 1642 1650, 2016. [25] Jonas Moˇckus. On bayesian methods for seeking the extremum. In Optimization Techniques IFIP Technical Conference, pages 400 404. Springer, 1975. [26] Mojmir Mutny and Andreas Krause. Efficient high dimensional bayesian optimization with additivity and quadrature fourier features. In Advances in Neural Information Processing Systems, pages 9005 9016, 2018. [27] Joaquin Quinonero-Candela, Carl Edward Rasmussen, and Christopher KI Williams. Approximation methods for gaussian process regression. Large-scale kernel machines, pages 203 224, 2007. [28] Ali Rahimi and Benjamin Recht. Random features for large-scale kernel machines. In Advances in neural information processing systems, pages 1177 1184, 2008. [29] Sidney I Resnick. Heavy-tail phenomena: probabilistic and statistical modeling. Springer Science & Business Media, 2007. [30] Paul Rolland, Jonathan Scarlett, Ilija Bogunovic, and Volkan Cevher. High-dimensional bayesian optimization via additive models with overlapping groups. ar Xiv preprint ar Xiv:1802.07028, 2018. [31] Jonathan Scarlett, Ilija Bogunovic, and Volkan Cevher. Lower bounds on regret for noisy gaussian process bandit optimization. In Conference on Learning Theory, pages 1723 1742, 2017. [32] Yevgeny Seldin, François Laviolette, Nicolo Cesa-Bianchi, John Shawe-Taylor, and Peter Auer. Pac-bayesian inequalities for martingales. IEEE Transactions on Information Theory, 58(12): 7086 7093, 2012. [33] Rajat Sen, Kirthevasan Kandasamy, and Sanjay Shakkottai. Noisy blackbox optimization using multi-fidelity queries: A tree search approach. In The 22nd International Conference on Artificial Intelligence and Statistics, pages 2096 2105, 2019. [34] Han Shao, Xiaotian Yu, Irwin King, and Michael R Lyu. Almost optimal algorithms for linear stochastic bandits with heavy-tailed payoffs. In Advances in Neural Information Processing Systems, pages 8420 8429, 2018. [35] Niranjan Srinivas, Andreas Krause, Sham Kakade, and Matthias Seeger. Gaussian process optimization in the bandit setting: no regret and experimental design. In Proceedings of the 27th International Conference on International Conference on Machine Learning, pages 1015 1022. Omnipress, 2010. [36] Bharath Sriperumbudur and Zoltán Szabó. Optimal rates for random fourier features. In Advances in Neural Information Processing Systems, pages 1144 1152, 2015. [37] Steven H Strogatz. Exploring complex networks. nature, 410(6825):268, 2001. [38] Sattar Vakili, Keqin Liu, and Qing Zhao. Deterministic sequencing of exploration and exploitation for multi-armed bandit problems. IEEE Journal of Selected Topics in Signal Processing, 7 (5):759 767, 2013. [39] Zi Wang and Stefanie Jegelka. Max-value entropy search for efficient bayesian optimization. In Proceedings of the 34th International Conference on Machine Learning-Volume 70, pages 3627 3635. JMLR. org, 2017. [40] Zi Wang, Bolei Zhou, and Stefanie Jegelka. Optimization as estimation with gaussian processes in bandit settings. In Artificial Intelligence and Statistics, pages 1022 1031, 2016. [41] Zi Wang, Clement Gehring, Pushmeet Kohli, and Stefanie Jegelka. Batched large-scale bayesian optimization in high-dimensional spaces. ar Xiv preprint ar Xiv:1706.01445, 2017. [42] Tianbao Yang, Yu-Feng Li, Mehrdad Mahdavi, Rong Jin, and Zhi-Hua Zhou. Nyström method vs random fourier features: A theoretical and empirical comparison. In Advances in neural information processing systems, pages 476 484, 2012. [43] Xiaotian Yu, Han Shao, Michael R Lyu, and Irwin King. Pure exploration of multi-armed bandits with heavy-tailed payoffs. In Proceedings of the Thirty-Fourth Conference on Uncertainty in Artificial Intelligence, pages 937 946, 2018.