# optimal_rates_for_random_fourier_features__37c6092e.pdf

Optimal Rates for Random Fourier Features

Bharath K. Sriperumbudur Department of Statistics Pennsylvania State University University Park, PA 16802, USA bks18@psu.edu

Zolt an Szab o Gatsby Unit, CSML, UCL Sainsbury Wellcome Centre, 25 Howland Street London - W1T 4JG, UK zoltan.szabo@gatsby.ucl.ac.uk

Kernel methods represent one of the most powerful tools in machine learning to tackle problems expressed in terms of function values and derivatives due to their capability to represent and model complex relations. While these methods show good versatility, they are computationally intensive and have poor scalability to large data as they require operations on Gram matrices. In order to mitigate this serious computational limitation, recently randomized constructions have been proposed in the literature, which allow the application of fast linear algorithms. Random Fourier features (RFF) are among the most popular and widely applied constructions: they provide an easily computable, low-dimensional feature representation for shift-invariant kernels. Despite the popularity of RFFs, very little is understood theoretically about their approximation quality. In this paper, we provide a detailed ﬁnite-sample theoretical analysis about the approximation quality of RFFs by (i) establishing optimal (in terms of the RFF dimension, and growing set size) performance guarantees in uniform norm, and (ii) presenting guarantees in Lr (1 r < ) norms. We also propose an RFF approximation to derivatives of a kernel with a theoretical study on its approximation quality.

1 Introduction

Kernel methods [17] have enjoyed tremendous success in solving several fundamental problems of machine learning ranging from classiﬁcation, regression, feature extraction, dependency estimation, causal discovery, Bayesian inference and hypothesis testing. Such a success owes to their capability to represent and model complex relations by mapping points into high (possibly inﬁnite) dimensional feature spaces. At the heart of all these techniques is the kernel trick, which allows to implicitly compute inner products between these high dimensional feature maps, λ via a kernel function k: k(x, y) = λ(x), λ(y) . However, this ﬂexibility and richness of kernels has a price: by resorting to implicit computations these methods operate on the Gram matrix of the data, which raises serious computational challenges while dealing with large-scale data. In order to resolve this bottleneck, numerous solutions have been proposed, such as low-rank matrix approximations [25, 6, 1], explicit feature maps designed for additive kernels [23, 11], hashing [19, 9], and random Fourier features (RFF) [13] constructed for shift-invariant kernels, the focus of the current paper.

RFFs implement an extremely simple, yet efﬁcient idea: instead of relying on the implicit feature map λ associated with the kernel, by appealing to Bochner s theorem [24] any bounded, continuous, shift-invariant kernel is the Fourier transform of a probability measure -[13] proposed an explicit low-dimensional random Fourier feature map φ obtained by empirically approximating the Fourier integral so that k(x, y) φ(x), φ(y) . The advantage of this explicit low-dimensional feature representation is that the kernel machine can be efﬁciently solved in the primal form through fast linear solvers, thereby enabling to handle large-scale data. Through numerical experiments, it has also been demonstrated that kernel algorithms constructed using the approximate kernel do not

Contributed equally.

suffer from signiﬁcant performance degradation [13]. Another advantage with the RFF approach is that unlike low rank matrix approximation approach [25, 6] which also speeds up kernel machines, it approximates the entire kernel function and not just the kernel matrix. This property is particularly useful while dealing with out-of-sample data and also in online learning applications. The RFF technique has found wide applicability in several areas such as fast function-to-function regression [12], differential privacy preserving [2] and causal discovery [10].

Despite the success of the RFF method, surprisingly, very little is known about its performance guarantees. To the best of our knowledge, the only paper in the machine learning literature providing certain theoretical insight into the accuracy of kernel approximation via RFF is [13, 22]:1 it shows that Am := sup{|k(x, y) φ(x), φ(y) R2m| : x, y S} = Op( p

log(m)/m) for any compact set S Rd, where m is the number of random Fourier features. However, since the approximation proposed by the RFF method involves empirically approximating the Fourier integral, the RFF estimator can be thought of as an empirical characteristic function (ECF). In the probability literature, the systematic study of ECF-s was initiated by [7] and followed up by [5, 4, 27]. While [7] shows the almost sure (a.s.) convergence of Am to zero, [5, Theorems 1 and 2] and [27, Theorems 6.2 and 6.3] show that the optimal rate is m 1/2. In addition, [7] shows that almost sure convergence cannot be attained over the entire space (i.e., Rd) if the characteristic function decays to zero at inﬁnity. Due to this, [5, 27] study the convergence behavior of Am when the diameter of S grows with m and show that almost sure convergence of Am is guaranteed as long as the diameter of S is eo(m). Unfortunately, all these results (to the best of our knowledge) are asymptotic in nature and the only known ﬁnite-sample guarantee by [13, 22] is non-optimal. In this paper (see Section 3), we present a ﬁnite-sample probabilistic bound for Am that holds for any m and provides the optimal rate of m 1/2 for any compact set S along with guaranteeing the almost sure convergence of Am as long as the diameter of S is eo(m). Since convergence in uniform norm might sometimes be a too strong requirement and may not be suitable to attain correct rates in the generalization bounds associated with learning algorithms involving RFF,2 we also study the behavior of k(x, y) φ(x), φ(y) R2m in Lr-norm (1 r < ) and obtain an optimal rate of m 1/2. The RFF approach to approximate a translation-invariant kernel can be seen as a special of the problem of approximating a function in the barycenter of a family (say F) of functions, which was considered in [14]. However, the approximation guarantees in [14, Theorem 3.2] do not directly apply to RFF as the assumptions on F are not satisﬁed by the cosine function, which is the family of functions that is used to approximate the kernel in the RFF approach. While a careful modiﬁcation of the proof of [14, Theorem 3.2] could yield m 1/2 rate of approximation for any compact set S, this result would still be sub-optimal by providing a linear dependence on |S| similar to the theorems in [13, 22], in contrast to the optimal logarithmic dependence on |S| that is guaranteed by our results.

Traditionally, kernel based algorithms involve computing the value of the kernel. Recently, kernel algorithms involving the derivatives of the kernel (i.e., the Gram matrix consists of derivatives of the kernel computed at training samples) have been used to address numerous machine learning tasks, e.g., semi-supervised or Hermite learning with gradient information [28, 18], nonlinear variable selection [15, 16], (multi-task) gradient learning [26] and ﬁtting of distributions in an inﬁnite-dimensional exponential family [20]. Given the importance of these derivative based kernel algorithms, similar to [13], in Section 4, we propose a ﬁnite dimensional random feature map approximation to kernel derivatives, which can be used to speed up the above mentioned derivative based kernel algorithms. We present a ﬁnite-sample bound that quantiﬁes the quality of approximation in uniform and Lr-norms and show the rate of convergence to be m 1/2 in both these cases.

A summary of our contributions are as follows. We

1. provide the ﬁrst detailed ﬁnite-sample performance analysis of RFFs for approximating kernels and their derivatives. 2. prove uniform and Lr convergence on ﬁxed compacts sets with optimal rate in terms of the RFF dimension (m); 3. give sufﬁcient conditions for the growth rate of compact sets while preserving a.s. convergence uniformly and in Lr; specializing our result we match the best attainable asymptotic growth rate.

1[22] derived tighter constants compared to [13] and also considered different RFF implementations. 2For example, in applications like kernel ridge regression based on RFF, it is more appropriate to consider the approximation guarantee in L2 norm than in the uniform norm.

Various notations and deﬁnitions that are used throughout the paper are provided in Section 2 along with a brief review of RFF approximation proposed by [13]. The missing proofs of the results in Sections 3 and 4 are provided in the supplementary material.

2 Notations & preliminaries

In this section, we introduce notations that are used throughout the paper and then present preliminaries on kernel approximation through random feature maps as introduced by [13].

Deﬁnitions & Notation: For a topological space X, C(X) (resp. Cb(X)) denotes the space of all continuous (resp. bounded continuous) functions on X. For f Cb(X), f X := supx X |f(x)| is the supremum norm of f. Mb(X) and M 1 +(X) is the set of all ﬁnite Borel and probability measures on X, respectively. For µ Mb(X), Lr(X, µ) denotes the Banach space of r-power (r 1) µ-integrable functions. For X Rd, we will use Lr(X) for Lr(X, µ) if µ is a Lebesgue measure on X. For f Lr(X, µ), f Lr(X,µ) := R

X |f|r dµ 1/r denotes the Lr-norm of f for 1 r < and we write it as Lr(X) if X Rd and µ is the Lebesgue measure. For any f L1(X, P) where

P M 1 +(X), we deﬁne Pf := R

X f(x) d P(x) and Pmf := 1 m Pm i=1 f(Xi) where (Xi)m i=1 i.i.d. P, Pm := 1 m Pm i=1 δXi is the empirical measure and δx is a Dirac measure supported on x X. supp(P) denotes the support of P. Pm := m j=1P denotes the m-fold product measure.

For v := (v1, . . . , vd) Rd, v 2 := q Pd i=1 v2 i . The diameter of A Y where (Y, ρ) is a metric space is deﬁned as |A|ρ := sup{ρ(x, y) : x, y Y}. If Y = Rd with ρ = 2, we denote the diameter of A as |A|; |A| < if A is compact. The volume of A Rd is deﬁned as vol(A) = R

A 1 dx. For A Rd, we deﬁne A := A A = {x y : x, y A}. conv(A) is the convex hull of A. For

a function g deﬁned on open set B Rd Rd, p,qg(x, y) := |p|+|q|g(x,y) xp1 1 x pd d yq1 1 y qd d , (x, y) B,

where p, q Nd are multi-indices, |p| = Pd j=1 pj and N := {0, 1, 2, . . .}. Deﬁne vp = Qd j=1 vpj j . For positive sequences (an)n N, (bn)n N, an = o(bn) if limn an bn = 0. Xn = Op(rn) (resp. Oa.s.(rn)) denotes that Xn

rn is bounded in probability (resp. almost surely). Γ(t) = R 0 xt 1e x dx is the Gamma function, Γ 1

2 = π and Γ(t + 1) = tΓ(t).

Random feature maps: Let k : Rd Rd R be a bounded, continuous, positive deﬁnite, translation-invariant kernel, i.e., there exists a positive deﬁnite function ψ : Rd R such that k(x, y) = ψ(x y), x, y Rd where ψ Cb(Rd). By Bochner s theorem [24, Theorem 6.6], ψ can be represented as the Fourier transform of a ﬁnite non-negative Borel measure Λ on Rd, i.e.,

k(x, y) = ψ(x y) = Z

Rd e 1ωT (x y)dΛ(ω) ( ) = Z

Rd cos ωT (x y) dΛ(ω), (1)

where ( ) follows from the fact that ψ is real-valued and symmetric. Since Λ(Rd) = ψ(0), k(x, y) = ψ(0) R e 1ωT (x y) d P(ω) where P := Λ ψ(0) M 1 +(Rd). Therefore, w.l.o.g., we assume throughout the paper that ψ(0) = 1 and so Λ M 1 +(Rd). Based on (1), [13] proposed an

approximation to k by replacing Λ with its empirical measure, Λm constructed from (ωi)m i=1 i.i.d. Λ so that resultant approximation can be written as the Euclidean inner product of ﬁnite dimensional random feature maps, i.e.,

ˆk(x, y) = 1

i=1 cos ωT i (x y) ( ) = φ(x), φ(y) R2m , (2)

where φ(x) = 1 m(cos(ωT 1 x), . . . , cos(ωT mx), sin(ωT 1 x), . . . , sin(ωT mx)) and ( ) holds based on the basic trigonometric identity: cos(a b) = cos a cos b+sin a sin b. This elegant approximation to k is particularly useful in speeding up kernel-based algorithms as the ﬁnite-dimensional random feature map φ can be used to solve these algorithms in the primal thereby offering better computational complexity (than by solving them in the dual) while at the same time not lacking in performance. Apart from these practical advantages, [13, Claim 1] (and similarly, [22, Prop. 1]) provides a theoretical guarantee that ˆk k S S 0 as m for any compact set S Rd. Formally, [13, Claim

1] showed that note that (3) is slightly different but more precise than the one in the statement of Claim 1 in [13] for any ǫ > 0,

Λm n (ωi)m i=1 : ˆk k S S ǫ o Cd |S|σǫ 1 2d

d+2 e mǫ2 4(d+2) , (3)

where σ2 := R ω 2 dΛ(ω) and Cd := 2 6d+2

d d d+2 + d

2 2 d+2 27d 2 d+2 when d 2. The

condition σ2 < implies that ψ (and therefore k) is twice differentiable. From (3) it is clear that the probability has polynomial tails if ǫ < |S|σ (i.e., small ǫ) and Gaussian tails if ǫ |S|σ (i.e., large ǫ) and can be equivalently written as

Λm n (ωi)m i=1 : ˆk k S S C

2d d |S|σ p

m 1 log m o m

α 4(d+2) (log m) d d+2 , (4)

where α := 4d C

d d |S|2σ2. For |S| sufﬁciently large (i.e., α < 0), it follows from (4) that

ˆk k S S = Op |S| p

m 1 log m . (5)

While (5) shows that ˆk is a consistent estimator of k in the topology of compact convergence (i.e., ˆk convergences to k uniformly over compact sets), the rate of convergence of p

(log m)/m is not optimal. In addition, the order of dependence on |S| is not optimal. While a faster rate (in fact, an optimal rate) of convergence is desired better rates in (5) can lead to better convergence rates for the excess error of the kernel machine constructed using ˆk , the order of dependence on |S| is also important as it determines the the number of RFF features (i.e., m) that are needed to achieve a given approximation accuracy. In fact, the order of dependence on |S| controls the rate at which |S| can be grown as a function of m when m (see Remark 1(ii) for a detailed discussion about the signiﬁcance of growing |S|). In the following section, we present an analogue of (4) see Theorem 1 that provides optimal rates and has correct dependence on |S|.

3 Main results: approximation of k

As discussed in Sections 1 and 2, while the random feature map approximation of k introduced by [13] has many practical advantages, it does not seem to be theoretically well-understood. The existing theoretical results on the quality of approximation do not provide a complete picture owing to their non-optimality. In this section, we ﬁrst present our main result (see Theorem 1) that improves upon (4) and provides a rate of m 1/2 with logarithm dependence on |S|. We then discuss the consequences of Theorem 1 along with its optimality in Remark 1. Next, in Corollary 2 and Theorem 3, we discuss the Lr-convergence (1 r < ) of ˆk to k over compact subsets of Rd. Theorem 1. Suppose k(x, y) = ψ(x y), x, y Rd where ψ Cb(Rd) is positive deﬁnite and σ2 := R ω 2 dΛ(ω) < . Then for any τ > 0 and non-empty compact set S Rd,

(ωi)m i=1 : ˆk k S S h(d, |S|, σ) +

where h(d, |S|, σ) := 32 p

2d log(2|S| + 1) + 32 p

2d log(σ + 1) + 16 p

2d[log(2|S| + 1)] 1.

Proof (sketch). Note that ˆk k S S = supx,y S |ˆk(x, y) k(x, y)| = supg G |Λmg Λg|, where G := {gx,y(ω) = cos(ωT (x y)) : x, y S}, which means the object of interest is the suprema of an empirical process indexed by G. Instead of bounding supg G |Λmg Λg| by using Hoeffding s inequality on a cover of G and then applying union bound as carried out in [13, 22], we use the reﬁned technique of applying concentration via Mc Diarmid s inequality, followed by symmetrization and bound the Rademacher average by Dudley entropy bound. The result is obtained by carefully bounding the L2(Λm)-covering number of G. The details are provided in Section B.1 of the supplementary material.

Remark 1. (i) Theorem 1 shows that ˆk is a consistent estimator of k in the topology of compact convergence as m with the rate of a.s. convergence being p

m 1 log |S| (almost sure convergence is guaranteed by the ﬁrst Borel-Cantelli lemma). In comparison to (4), it is clear that Theorem 1

provides improved rates with better constants and logarithmic dependence on |S| instead of a linear dependence. The logarithmic dependence on |S| ensures that we need m = O(ǫ 2 log |S|) random features instead of O(ǫ 2|S|2 log(|S|/ǫ)) random features, i.e., signiﬁcantly fewer features to achieve the same approximation accuracy of ǫ.

(ii) Growing diameter: While Theorem 1 provides almost sure convergence uniformly over compact sets, one might wonder whether it is possible to achieve uniform convergence over Rd. [7, Section 2] showed that such a result is possible if Λ is a discrete measure but not possible for Λ that is absolutely continuous w.r.t. the Lebesgue measure (i.e., if Λ has a density). Since uniform convergence of ˆk to k over Rd is not possible for many interesting k (e.g., Gaussian kernel), it is of interest to study the convergence on S whose diameter grows with m. Therefore, as mentioned in Section 2, the order of dependence of rates on |S| is critical. Suppose |Sm| as m (we write |Sm| instead of |S| to show the explicit dependence on m). Then Theorem 1 shows that ˆk is a consistent estimator of k in the topology of compact convergence if m 1 log |Sm| 0 as m (i.e., |Sm| = eo(m)) in contrast to the result in (4) which requires |Sm| = o( p

m/ log m). In other words, Theorem 1 ensures consistency even when |Sm| grows exponentially in m whereas (4) ensures consistency only if |Sm| does not grow faster than p

(iii) Optimality: Note that ψ is the characteristic function of Λ M 1 +(Rd) since ψ is the Fourier transform of Λ (by Bochner s theorem). Therefore, the object of interest ˆk k S S = ˆψ ψ S , is the uniform norm of the difference between ψ and the empirical characteristic function ˆψ = 1 m Pm i=1 cos( ωi, ), when both are restricted to a compact set S Rd. The question of the convergence behavior of ˆψ ψ S is not new and has been studied in great detail in the probability and statistics literature (e.g., see [7, 27] for d = 1 and [4, 5] for d > 1) where the characteristic function is not just a real-valued symmetric function (like ψ) but is Hermitian. [27, Theorems 6.2 and 6.3] show that the optimal rate of convergence of ˆψ ψ S is m 1/2 when d = 1, which matches with our result in Theorem 1. Also Theorems 1 and 2 in [5] show that the logarithmic dependence on |Sm| is optimal asymptotically. In particular, [5, Theorem 1] matches with the growing diameter result in Remark 1(ii), while [5, Theorem 2] shows that if Λ is absolutely continuous w.r.t. the Lebesgue measure and if lim supm m 1 log |Sm| > 0, then there exists a positive ε such that lim supm Λm( ˆψ ψ Sm, ε) > 0. This means the rate |Sm| = eo(m) is not only the best possible in general for almost sure convergence, but if faster sequence |Sm| is considered then even stochastic convergence cannot be retained for any characteristic function vanishing at inﬁnity along at least one path. While these previous results match with that of Theorem 1 (and its consequences), we would like to highlight the fact that all these previous results are asymptotic in nature whereas Theorem 1 provides a ﬁnite-sample probabilistic inequality that holds for any m. We are not aware of any such ﬁnite-sample result except for the one in [13, 22].

Using Theorem 1, one can obtain a probabilistic inequality for the Lr-norm of ˆk k over any compact set S Rd, as given by the following result. Corollary 2. Suppose k satisﬁes the assumptions in Theorem 1. Then for any 1 r < , τ > 0 and non-empty compact set S Rd,

(ωi)m i=1 : ˆk k Lr(S)

!2/r h(d, |S|, σ) +

where ˆk k Lr(S) := ˆk k Lr(S S) = R

S |ˆk(x, y) k(x, y)|r dx dy 1

Proof. Note that ˆk k Lr(S) ˆk k S Svol2/r(S). The result follows by combining Theorem 1 and the fact that vol(S) vol(A) where A := n x Rd : x 2 |S|

2 o and vol(A) = πd/2|S|d

2 +1) (which follows from [8, Corollary 2.55]).

Corollary 2 shows that ˆk k Lr(S) = Oa.s.(m 1/2|S|2d/rp

log |S|) and therefore if |Sm| as m , then consistency of ˆk in Lr(Sm)-norm is achieved as long as m 1/2|Sm|2d/rp

0 as m . This means, in comparison to the uniform norm in Theorem 1 where |Sm| can grow exponential in mδ (δ < 1), |Sm| cannot grow faster than m r 4d (log m) r

4d θ (θ > 0) to achieve consistency in Lr-norm.

Instead of using Theorem 1 to obtain a bound on ˆk k Lr(S) (this bound may be weak as ˆk k Lr(S) ˆk k S Svol2/r(S) for any 1 r < ), a better bound (for 2 r < ) can be obtained by directly bounding ˆk k Lr(S), as shown in the following result.

Theorem 3. Suppose k(x, y) = ψ(x y), x, y Rd where ψ Cb(Rd) is positive deﬁnite. Then for any 1 < r < , τ > 0 and non-empty compact set S Rd,

(ωi)m i=1 : ˆk k Lr(S)

!2/r C r m1 max{ 1

where C r is the Khintchine constant given by C r = 1 for r (1, 2] and C r =

for r [2, ).

Proof (sketch). As in Theorem 1, we show that k ˆk Lr(S) satisﬁes the bounded difference property, hence by the Mc Diarmid s inequality, it concentrates around its expectation E k ˆk Lr(S). By symmetrization, we then show that E k ˆk Lr(S) is upper bounded in terms of Eε Pm i=1 εi cos( ωi, ) Lr(S), where ε := (εi)m i=1 are Rademacher random variables. By exploiting the fact that Lr(S) is a Banach space of type min{r, 2}, the result follows. The details are provided in Section B.2 of the supplementary material.

Remark 2. Theorem 3 shows an improved dependence on |S| without the extra p

log |S| factor given in Corollary 2 and therefore provides a better rate for 2 r < when the diameter of S grows, i.e., ˆk k Lr(Sm) a.s. 0 if |Sm| = o(m r 4d ) as m . However, for 1 < r < 2, Theorem 3 provides a slower rate than Corollary 2 and therefore it is appropriate to use the bound in Corollary 2. While one might wonder why we only considered the convergence of ˆk k Lr(S) and not ˆk k Lr(Rd), it is important to note that the latter is not well-deﬁned because ˆk / Lr(Rd) even if k Lr(Rd).

4 Approximation of kernel derivatives

In the previous section we focused on the approximation of the kernel function where we presented uniform and Lr convergence guarantees on compact sets for the random Fourier feature approximation, and discussed how fast the diameter of these sets can grow to preserve uniform and Lr convergence almost surely. In this section, we propose an approximation to derivatives of the kernel and analyze the uniform and Lr convergence behavior of the proposed approximation. As motivated in Section 1, the question of approximating the derivatives of the kernel through ﬁnite dimensional random feature map is also important as it enables to speed up several interesting machine learning tasks that involve the derivatives of the kernel [28, 18, 15, 16, 26, 20], see for example the recent inﬁnite dimensional exponential family ﬁtting technique [21], which implements this idea.

To this end, we consider k as in (1) and deﬁne ha := cos( πa

2 + ), a N (in other words h0 = cos, h1 = sin, h2 = cos, h3 = sin and ha = ha mod 4). For p, q Nd, assuming R |ωp+q| dΛ(ω) < , it follows from the dominated convergence theorem that

p,qk(x, y) = Z

Rd ωp( ω)qh|p+q| ωT (x y) dΛ(ω)

Rd ωp+q h|p|(ωT x)h|q|(ωT y) + h3+|p|(ωT x)h3+|q|(ωT y) dΛ(ω),

so that p,qk(x, y) can be approximated by replacing Λ with Λm, resulting in

\ p,qk(x, y) := sp,q(x, y) = 1

j=1 ωp j ( ωj)qh|p+q| ωT j (x y) = φp(x), φq(y) R2m , (6)

where φp(u) := 1 m ωp 1 h|p|(ωT 1 u), , ωp mh|p|(ωT mu), ωp 1 h3+|p|(ωT 1 u), , ωp mh3+|p|(ωT mu)

and (ωj)m j=1 i.i.d. Λ. Now the goal is to understand the behavior of sp,q p,qk S S and sp,q p,qk Lr(S) for r [1, ), i.e., obtain analogues of Theorems 1 and 3.

As in the proof sketch of Theorem 1, while sp,q p,qk S S can be analyzed as the suprema of an empirical process indexed by a suitable function class (say G), some technical issues arise because G is not uniformly bounded. This means Mc Diarmid or Talagrand s inequality cannot be applied to achieve concentration and bounding Rademacher average by Dudley entropy bound may not be reasonable. While these issues can be tackled by resorting to more technical and reﬁned methods, in this paper, we generalize (see Theorem 4 which is proved in Section B.1 of the supplement) Theorem 1 to derivatives under the restrictive assumption that supp(Λ) is bounded (note that many popular kernels including the Gaussian do not satisfy this assumption). We also present another result (see Theorem 5) by generalizing the proof technique3 of [13] to unbounded functions where the boundedness assumption of supp(Λ) is relaxed but at the expense of a worse rate (compared to Theorem 4).

Theorem 4. Let p, q Nd, Tp,q := supω supp(Λ) |ωp+q|, Cp,q := Eω Λ h |ωp+q| ω 2 2 i , and assume that C2p,2q < . Suppose supp(Λ) is bounded if p = 0 and q = 0. Then for any τ > 0 and non-empty compact set S Rd,

(ωi)m i=1 : p,qk sp,q S S H(d, p, q, |S|) + Tp,q

H(d, p, q, |S|) = 32 p

U(p, q, |S|) + 1

U(p, q, |S|) + q

C2p,2q + 1)

U(p, q, |S|) = log 2|S|T 1/2 2p,2q + 1 .

Remark 3. (i) Note that Theorem 4 reduces to Theorem 1 if p = q = 0, in which case Tp,q = T2p,2q = 1. If p = 0 or q = 0, then the boundedness of supp(Λ) implies that Tp,q < and T2p,2q < .

(ii) Growth of |Sm|: By the same reasoning as in Remark 1(ii) and Corollary 2, it follows that p,qk sp,q Sm Sm a.s. 0 if |Sm| = eo(m) and p,qk sp,q Lr(Sm) a.s. 0 if m 1/2|Sm|2d/rp

log |Sm| 0 (for 1 r < ) as m . An exact analogue of Theorem 3 can be obtained (but with different constants) under the assumption that supp(Λ) is bounded and it can be shown that for r [2, ), p,qk sp,q Lr(Sm) a.s. 0 if |Sm| = o(m r 4d ).

The following result relaxes the boundedness of supp(Λ) by imposing certain moment conditions on Λ but at the expense of a worse rate. The proof relies on applying Bernstein inequality at the elements of a net (which exists by the compactness of S) combined with a union bound, and extending the approximation error from the anchors by a probabilistic Lipschitz argument.

Theorem 5. Let p, q Nd, ψ be continuously differentiable, z 7 z [ p,qk(z)] be continuous, S Rd be any non-empty compact set, Dp,q,S := supz conv(S ) z [ p,qk(z)] 2 and Ep,q := Eω Λ [|ωp+q| ω 2]. Assume that Ep,q < . Suppose L > 0, σ > 0 such that

Eω Λ |f(z; ω)|M M! σ2LM 2

2 ( M 2, z S ), (7)

3We also correct some technical issues in the proof of [13, Claim 1], where (i) a shift-invariant argument was applied to the non-shift invariant kernel estimator ˆk(x, y) = 1 m Pm j=1 2 cos(ωT j x + bj) cos(ωT j y + bj) =

1 m Pm j=1 cos(ωT j (x y)) + cos(ωT j (x + y) + 2bj) , (ii) the convexity of S was not imposed leading to possibly undeﬁned Lipschitz constant (L) and (iii) the randomness of = arg max S [k( ) ˆk( )] 2 was not taken into account, thus the upper bound on the expectation of the squared Lipschitz constant (E[L2]) does not hold.

where f(z; ω) = p,qk(z) ωp( ω)qh|p+q| ωT z . Deﬁne Fd := d d d+1 + d 1 d+1 .4 Then

Λm ({(ωi)m i=1 : p,qk sp,q S S ǫ})

2σ2 ) + Fd2 4d 1

d+1 |S|(Dp,q,S + Ep,q)

d d+1 e mǫ2

8(d+1)σ2(1+ ǫL

2σ2 ) . (8)

Remark 4. (i) The compactness of S implies that of S . Hence, by the continuity of z 7 z [ p,qk(z)], one gets Dp,q,S < . (7) holds if |f(z; ω)| L

2 and Eω Λ |f(z; ω)|2 σ2

( z S ). If supp(Λ) is bounded, then the boundedness of f is guaranteed (see Section B.4 in the supplement).

(ii) In the special case when p = q = 0, our requirement boils down to the continuously differentiability of ψ, E0,0 = Eω Λ ω 2 < , and (7).

(iii) Note that (8) is similar to (3) and therefore based on the discussion in Section 2, one has p,qk sp,q S S = Oa.s.(|S| p

m 1 log m). But the advantage with Theorem 5 over [13, Claim 1] and [22, Prop. 1] is that it can handle unbounded functions. In comparison to Theorem 4, we obtain worse rates and it will be of interest to improve the rates of Theorem 5 while handling unbounded functions.

5 Discussion

In this paper, we presented the ﬁrst detailed theoretical analysis about the approximation quality of random Fourier features (RFF) that was proposed by [13] in the context of improving the computational complexity of kernel machines. While [13, 22] provided a probabilistic bound on the uniform approximation (over compact subsets of Rd) of a kernel by random features, the result is not optimal. We improved this result by providing a ﬁnite-sample bound with optimal rate of convergence and also analyzed the quality of approximation in Lr-norm (1 r < ). We also proposed an RFF approximation for derivatives of a kernel and provided theoretical guarantees on the quality of approximation in uniform and Lr-norms over compact subsets of Rd.

While all the results in this paper (and also in the literature) dealt with the approximation quality of RFF over only compact subsets of Rd, it is of interest to understand its behavior over entire Rd. However, as discussed in Remark 1(ii) and in the paragraph following Theorem 3, RFF cannot approximate the kernel uniformly or in Lr-norm over Rd. By truncating the Taylor series expansion of the exponential function, [3] proposed a non-random ﬁnite dimensional representation to approximate the Gaussian kernel which also enjoys the computational advantages of RFF. However, this representation also does not approximate the Gaussian kernel uniformly over Rd. Therefore, the question remains whether it is possible to approximate a kernel uniformly or in Lr-norm over Rd but still retaining the computational advantages associated with RFF.

Acknowledgments

Z. Szab o wishes to thank the Gatsby Charitable Foundation for its generous support.

[1] A. E. Alaoui and M. Mahoney. Fast randomized kernel ridge regression with statistical guarantees. In NIPS, 2015.

[2] K. Chaudhuri, C. Monteleoni, and A. D. Sarwate. Differentially private empirical risk minimization. Journal of Machine Learning Research, 12:1069 1109, 2011.

[3] A. Cotter, J. Keshet, and N. Srebro. Explicit approximations of the Gaussian kernel. Technical report, 2011. http://arxiv.org/pdf/1109.4603.pdf.

[4] S. Cs org o. Multivariate empirical characteristic functions. Zeitschrift f ur Wahrscheinlichkeitstheorie und Verwandte Gebiete, 55:203 229, 1981.

[5] S. Cs org o and V. Totik. On how long interval is the empirical characteristic function uniformly consistent? Acta Scientiarum Mathematicarum, 45:141 149, 1983.

4Fd is monotonically decreasing in d, F1 = 2.

[6] P. Drineas and M. W. Mahoney. On the Nystr om method for approximating a Gram matrix for improved kernel-based learning. Journal of Machine Learning Research, 6:2153 2175, 2005.

[7] A. Feuerverger and R. A. Mureika. The empirical characteristic function and its applications. Annals of Statistics, 5(1):88 98, 1977.

[8] G. B. Folland. Real Analysis: Modern Techniques and Their Applications. Wiley-Interscience, 1999.

[9] B. Kulis and K. Grauman. Kernelized locality-sensitive hashing. IEEE Transactions on Pattern Analysis and Machine Intelligence, 34:1092 1104, 2012.

[10] D. Lopez-Paz, K. Muandet, B. Sch olkopf, and I. Tolstikhin. Towards a learning theory of cause-effect inference. JMLR W&CP ICML, pages 1452 1461, 2015.

[11] S. Maji, A. C. Berg, and J. Malik. Efﬁcient classiﬁcation for additive kernel SVMs. IEEE Transactions on Pattern Analysis and Machine Intelligence, 35:66 77, 2013.

[12] J. Oliva, W. Neiswanger, B. P oczos, E. Xing, and J. Schneider. Fast function to function regression. JMLR W&CP AISTATS, pages 717 725, 2015.

[13] A. Rahimi and B. Recht. Random features for large-scale kernel machines. In NIPS, pages 1177 1184, 2007.

[14] A. Rahimi and B. Recht. Uniform approximation of functions with random bases. In Allerton, pages 555 561, 2008.

[15] L. Rosasco, M. Santoro, S. Mosci, A. Verri, and S. Villa. A regularization approach to nonlinear variable selection. JMLR W&CP AISTATS, 9:653 660, 2010.

[16] L. Rosasco, S. Villa, S. Mosci, M. Santoro, and A. Verri. Nonparametric sparsity and regularization. Journal of Machine Learning Research, 14:1665 1714, 2013.

[17] B. Sch olkopf and A. J. Smola. Learning with Kernels: Support Vector Machines, Regularization, Optimization, and Beyond. MIT Press, 2002.

[18] L. Shi, X. Guo, and D.-X. Zhou. Hermite learning with gradient data. Journal of Computational and Applied Mathematics, 233:3046 3059, 2010.

[19] Q. Shi, J. Petterson, G. Dror, J. Langford, A. Smola, A. Strehl, and V. Vishwanathan. Hash kernels. AISTATS, 5:496 503, 2009.

[20] B. K. Sriperumbudur, K. Fukumizu, A. Gretton, A. Hyv arinen, and R. Kumar. Density estimation in inﬁnite dimensional exponential families. Technical report, 2014. http://arxiv.org/pdf/1312.3516.pdf.

[21] H. Strathmann, D. Sejdinovic, S. Livingstone, Z. Szab o, and A. Gretton. Gradient-free Hamiltonian Monte Carlo with efﬁcient kernel exponential families. In NIPS, 2015.

[22] D. J. Sutherland and J. Schneider. On the error of random Fourier features. In UAI, pages 862 871, 2015.

[23] A. Vedaldi and A. Zisserman. Efﬁcient additive kernels via explicit feature maps. IEEE Transactions on Pattern Analysis and Machine Intelligence, 34:480 492, 2012.

[24] H. Wendland. Scattered Data Approximation. Cambridge University Press, 2005.

[25] C. K. I. Williams and M. Seeger. Using the Nystr om method to speed up kernel machines. In NIPS, pages 682 688, 2001.

[26] Y. Ying, Q. Wu, and C. Campbell. Learning the coordinate gradients. Advances in Computational Mathematics, 37:355 378, 2012.

[27] J. E. Yukich. Some limit theorems for the empirical process indexed by functions. Probability Theory and Related Fields, 74:71 90, 1987.

[28] D.-X. Zhou. Derivative reproducing properties for kernel methods in learning theory. Journal of Computational and Applied Mathematics, 220:456 463, 2008.