# orlicz_random_fourier_features__5880296f.pdf

Journal of Machine Learning Research 21 (2020) 1-37 Submitted 12/19; Revised 6/20; Published 6/20

Orlicz Random Fourier Features

Linda Chamakh linda.chamakh@polytechnique.edu

Emmanuel Gobet emmanuel.gobet@polytechnique.edu

Zolt an Szab o zoltan.szabo@polytechnique.edu CMAP, CNRS, Ecole Polytechnique, Institut Polytechnique de Paris 91128 Palaiseau, France Global Markets Quantitative Research BNP Paribas

Editor: Jean-Philippe Vert

Kernel techniques are among the most widely-applied and inﬂuential tools in machine learning with applications at virtually all areas of the ﬁeld. To combine this expressive power with computational eﬃciency numerous randomized schemes have been proposed in the literature, among which probably random Fourier features (RFF) are the simplest and most popular. While RFFs were originally designed for the approximation of kernel values, recently they have been adapted to kernel derivatives, and hence to the solution of large-scale tasks involving function derivatives. Unfortunately, the understanding of the RFF scheme for the approximation of higher-order kernel derivatives is quite limited due to the challenging polynomial growing nature of the underlying function class in the empirical process. To tackle this diﬃculty, we establish a ﬁnite-sample deviation bound for a general class of polynomial-growth functions under α-exponential Orlicz condition on the distribution of the sample. Instantiating this result for RFFs, our ﬁnite-sample uniform guarantee implies a.s. convergence with tight rate for arbitrary kernel with α-exponential Orlicz spectrum and any order of derivative.

Keywords: random Fourier features, kernel derivative, polynomial-growth functions, α-exponential Orlicz norm, unbounded empirical processes

1. Introduction

Kernel machines (Taylor and Cristianini, 2004; Steinwart and Christmann, 2008; Paulsen and Raghupathi, 2016) form one of the most fundamental tools in machine learning and statistics with a wide range of successful applications. The impressive modelling power and ﬂexibility of kernel techniques in capturing complex nonlinear relations originates from the richness of the underlying Hk function class called reproducing kernel Hilbert space (RKHS, Aronszajn 1950) associated to a k : X X R kernel. Kernels extend the classical notion of inner product on X = Rd by assuming the existence of a φ : X H feature map to a Hilbert space H such that k(x, x ) = φ(x), φ(x ) H for all x, x X. This simple equality (also called the kernel trick) forms the basis of kernel techniques and enables one to compute inner products implicitly without direct access to the feature of the points.

c 2020 Linda Chamakh, Emmanuel Gobet and Zolt an Szab o.

License: CC-BY 4.0, see https://creativecommons.org/licenses/by/4.0/. Attribution requirements are provided at http://jmlr.org/papers/v21/19-1031.html.

Chamakh, Gobet and Szab o

In applications one is often given {xn}N n=1 samples and is facing with an optimization problem expressed in terms of function values and derivatives1

{ pf(xn)}n [N] p Dn , f 2 Hk

where [N] = {1, . . . , N}, Dn Nd, N := {0, 1, . . .}, l : R1+P n [N] #Dn R is a loss function, #Dn is the cardinality of the set Dn, pf(xn) := p1+...+pdf(xn)

p1 x1 pd xd and the RKHS

Hk is characterized by f(x) = f, k( , x) Hk ( x X, f Hk) and k( , x) Hk ( x X).2

The ﬁrst property of RKHSs is called the reproducing property, the second one describes basic elements of Hk; combining the two properties makes the canonical feature map and feature space explicit: k(x, x ) = φ(x), φ(x ) Hk where φ(x) = k( , x) Hk. For example by taking the quadratic loss, Tikhonov regularization, only function values (Dn = {0}, n [N]) and λ > 0, (1) reduces to kernel ridge regression

min f Hk 1 N

n [N] [f(xn) yn]2 + λ f 2 Hk .

Alternatively, one can get back Hermite learning with gradient data (Zhou, 2008; Shi et al., 2010) by additionally including ﬁrst-order derivatives

min f Hk 1 N

[f(xn) yn]2 + f (xn) y n 2 2

+ λ f 2 Hk , λ > 0

where f (x) = [ e1f(x); . . . ; edf(x)] Rd is the derivative of f, ej Rd is the jth canonical

basis vector, 2 is the Euclidean norm and Dn = n 0, {ej}d j=1 o (n [N]). Further examples with function derivatives are semi-supervised learning with gradient information (Zhou, 2008), nonlinear variable selection (Rosasco et al., 2010, 2013), learning of piecewisesmooth functions (Lauer et al., 2012), multi-task gradient learning (Ying et al., 2012), structure optimization in parameter-varying ARX (autoregressive with exogenous input) processes (Duijkers et al., 2014), or density estimation with inﬁnite-dimensional exponential families (Sriperumbudur et al., 2017). An appealing property of RKHSs is that their geometry makes the optimization problem (1) deﬁned over function spaces computationally tractable. Indeed, assuming that l is increasing in its last argument, the pf(x) = f, p,0k( , x)

Hk derivative-reproducing property of kernels and the representer theorem (Zhou, 2008) guarantee that the solution of (1) has a ﬁnite-dimensional parameterization f( ) = P n [N] P p Dn an,p p,0k( , xn) (an,p R) and it is suﬃcient to solve

q Dm am,q p,qk(xn, xm)

n,m [N] p Dn,q Dm

an,pam,q p,qk(xn, xm)

1. To have derivatives, in the sequel we assume that X = Rd. 2. We use the k( , x) shorthand to denote the function y X 7 k(y, x) R while keeping x X ﬁxed.

Orlicz Random Fourier Features

determined by the p,qk(x, y) := Pd i=1(pi+qi)k(x,y) p1 x1 pd xd q1 y1 qd yd kernel derivatives; a = (an,p)n [N], p Dn

R P n [N] #Dn. Though kernel methods show impressive modelling power at numerous areas, due to the implicit computation of feature similarities, this ﬂexibility comes with a computational price. Several techniques have been developed in the literature to mitigate this computational challenge such as incomplete Cholesky factorization (Bach and Jordan, 2002), subsampling schemes (Williams and Seeger, 2001; Drineas and Mahoney, 2005; Rudi et al., 2017), sketching (Alaoui and Mahoney, 2015; Yang et al., 2017), random Fourier features (RFF, Rahimi and Recht 2007, 2008), their quasi-Monte Carlo (Yang et al., 2014), memoryeﬃcient (Le et al., 2013; Dai et al., 2014; Zhang et al., 2019), orthogonal (Yu et al., 2016) or structured (Bojarski et al., 2017) variants. In this paper we study the RFF technique which is probably the conceptually simplest and most inﬂuential approach.3 By the Bochner theorem (Rudin, 1990) a continuous, bounded, shift-invariant kernel k : Rd Rd R can be written as the Fourier transform of a (ﬁnite) measure Λ, called the spectral measure

k(x, y) = Z

Rd cos ω (x y) dΛ(ω). (3)

The RFF method uses this representation of k to provide an explicit low-dimensional feature map approximation for the kernel values and f

k(x, x ) λ(x), λ(x )

R2M , ˆfw(x) = w, λ(x) R2M , (4)

where the integral representation (3) with respect to the measure Λ is replaced by an average over random points; hence the random Fourier feature naming. As a result, one can estimate w by leveraging fast linear primal solvers. The idea has been successfully used in various contexts including diﬀerential privacy preserving (Chaudhuri et al., 2011), fast function-to-function regression (Oliva et al., 2015), learning message operators in expectation propagation (Jitkrittum et al., 2015), causal discovery (Lopez-Paz et al., 2015; Strobl et al., 2019), independence testing (Zhang et al., 2017), prediction and ﬁltering in dynamical systems (Downey et al., 2017), bandit optimization (Li et al., 2018), or estimation of Gaussian mixture models (Keriven et al., 2018). Similarly to (4), one can consider RFF-based approximation of kernel derivatives when solving optimization tasks involving function derivatives [see (1) and (2)]. This is the strategy followed for example by Strathmann et al. (2015) to ﬁt distributions belonging to the inﬁnite-dimensional exponential family, which boils down to an optimization problem with third-order kernel derivatives (Sriperumbudur et al., 2017, Theorem 5). The focus of this work is to study the approximation quality of the RFF-based kernelderivative approximation \ p,qk p,qk S := sup x,y S

p,qk(x, y) \ p,qk(x, y) ,

3. Rahimi and Recht (2007) won the 10-year test-of-time award at NIPS-2017 as a recognition of the inﬂuence of RFFs.

Chamakh, Gobet and Szab o

where S Rd is a compact set. Despite the large number of successful RFF applications, quite little is understood theoretically on its approximation quality. Below we provide a brief summary with particular focus on optimal guarantees and results related to kernel derivatives.

Kernel values (p = q = 0): The uniform ﬁnite-sample bounds (Rahimi and Recht, 2007; Sutherland and Schneider, 2015) have recently been improved (Sriperumbudur and Szab o, 2015) exponentially in terms of the diameter of the compact set SM (|SM|) arriving

to4 k bk SM = Oa.s.

log |SM| log M

from k bk SM = Op |SM| log M

denotes the maximum. The result shows that the diameter of the set SM can grow at a |SM| = eo(M) rate while still getting a consistent estimate; this rate is optimal as shown in the characteristic function literature (Cs org o and Totik, 1983).

Kernel ridge regression: RFFs have been settled in kernel ridge regression by Rudi and Rosasco (2017) via showing that using M = o(N) = O

N log N random Fourier features

is suﬃcient to get O 1/

N generalization error. Under additional γ-capacity (γ [0, 1])

and r-range space conditions (r 1

2), the same authors showed that even faster, minimax

optimal O N 2r 2r+γ rates are achievable with M = o(N) = O N

2r+γ log N RFFs.

The result improves the originally proved guarantee (Rahimi and Recht, 2008) holding under the pessimistic M = O(N) setting. The suﬃciency of similar sub-linear number of RFFs with minimax guarantees was also established in the kernel least squares setting when applying mini-batched stochastic gradient descent (Carratino et al., 2018). Recently the analysis of Rudi and Rosasco (2017) has been further sharpened (in terms of the number of required RFFs, Li et al. 2019) by leveraging the notion of eﬀective degrees of freedom.

Classiﬁcation with 0-1 loss: In the classiﬁcation setting with the 0-1 loss and RKHSs, Gilbert et al. (2018) proved that M = o(N) = O N 2 2+c optimized RFF features

optimized in the sense of Bach (2017) are suﬃcient to achieve a learning rate of O N c 2+c

provided that the spectrum of the integral operator associated to the kernel decay polynomially at the rate of λi = O (i c) with c > 1.4 The same authors showed that the learning rate can be improved to O N 1 with M = O lnd(N) RFF-s in case of sub-exponential spectrum, where d denotes the dimension of the inputs in the classiﬁcation.

Kernel PCA: Sriperumbudur and Sterge (2018) have proved that the statistical performance of kernel principal component analysis (KPCA) can be matched by M = O(N2/3) (polynomial decay) or M = O(

N) (exponential decay) RFFs, depending on the eigenvalue decay of the covariance operator associated to the kernel. Ullah et al. (2018) derived a similar bound for a streaming KPCA algorithm under exponential spectrum decay condition.

4. The classical O( ) notation up to logarithmic factors is denoted by O( ); the extension of O( ) in almost sure and convergence in probability sense are Op( ) and Oa.s.( ).

Orlicz Random Fourier Features

Kernel derivatives: If the support of the spectral measure associated to k is either bounded or it satisﬁes the Bernstein condition

Qd j=1 ωpj+qj j n

(σp,q)n dΛ(ω) n!

2 Kn 2, n = 2, 3, . . . (5)

with some constant K 1 and σp,q =

Rd Qd j=1 ωpj+qj j 2 dΛ(ω), then

\ p,qk p,qk SM = Oa.s.

log |SM| log M

rate is achievable as shown by Sriperumbudur and Szab o (2015) and Szab o and Sriperumbudur (2019), respectively. As a practical example, one can consider for instance the previously mentioned inﬁnite-dimensional exponential (IE) family ﬁtting problem where the estimation boils to solving a linear equation system with coeﬃcients made of third-order kernel derivatives (|p + q| = 3). The IE family is deﬁned by a kernel for which a common choice is the Gaussian; this implies a Gaussian spectrum Λ. Unfortunately, in this case the bounded support condition is not met. Similarly, the Bernstein condition (5) only holds for |p+q| 2 with the Gaussian kernel, as it can be veriﬁed (Szab o and Sriperumbudur, 2019, Remark Higher-order derivatives in Section 3) by making use of the analytical expressions for the absolute moments of the Gaussian spectrum. These limitations (summarized in Table 1) of the popular random Fourier features technique motivate our work and the study of widely-applied kernels with unbounded spectral support for the RFF approximation of high-order kernel derivatives. A consequence of our new estimates in Theorem 1 is that the a.s. rates previously obtained under stringent conditions (on p, q or Λ) are now available for any p, q and any spectral measure Λ with α-exponential moments (as deﬁned in (6), α > 0). Because Bernstein condition implies exponential moments, our result includes the one given by Szab o and Sriperumbudur (2019).

Particularly, assuming additional smoothness on the bounded shift-invariant kernel, its derivative satisﬁes a representation similar to (3):

p,qk(x, y)= Z

j=1 ωpj j ( ωj)qj #

c(Pd i=1 |pi+qi|) ω (x y)

| {z } =:fx y(ω)

where cn is the nth derivative of the cos( ) function. The primary diﬃculty is to handle the polynomial growing nature of the

F = {ω 7 fx y(ω) : x, y S}

function class which controls the error \ p,qk p,qk S. We tackle this challenge by imposing the ﬁniteness of the α-exponential Orlicz norm of the spectral measure (Λ) associated to the kernel, in other words

α > 0, c > 0 such that Eω Λ ec ω α 2 < + . (6)

Chamakh, Gobet and Szab o

Assumption on the spectral measure Conditions Convergence rate

on p, q for p,qk \ p,qk SM

2nd moment exists p = q = 0 Oa.s.

log |SM| log M

Ref: Sriperumbudur and Szab o (2015, Th. 1)

bounded support any p, q Oa.s.

log |SM| log M

Ref: Sriperumbudur and Szab o (2015, Th. 4)

Bernstein condition small p, q Oa.s.

log |SM| log M

Ref: Szab o and Sriperumbudur (2019)

Orlicz condition any p, q Oa.s.

log |SM| log M

Table 1: Summary of RFF guarantees on kernel values and derivatives. Last line: it includes any measure Λ with a ﬁnite α-exponential moment (for some α, c > 0, Eω Λ ec ω α 2 < + ), like the Gaussian, the inverse multiquadratic, or the sech kernel, see Corollary 4. For further examples see Table 2.

Kernels with α-exponential Orlicz spectrum include the popular Gaussian, the inverse multiquadric, or the sech kernel; for further examples see Table 2 and Remark 5(ii). We establish the consistency and prove ﬁnite-sample uniform guarantees of the resulting Orlicz RFF scheme for the approximation of kernel derivatives at any order, as it is brieﬂy illustrated in the last line of Table 1. To allow this level of generality, we prove a new ﬁnite-sample deviation bound for the empirical process related to a general class of functions f with polynomial growth of the sample Xm. The distribution of the latter is assumed to have ﬁnite α-exponential Orlicz norm and consequently, the random variables f(Xm) belong to a γ-exponential Orlicz space with index γ smaller than 1. For deriving such deviation bounds, we have been inspired by the work of Adamczak (2008) which elegantly combines the Klein and Rio (2005) inequality for truncated variables, the Hoﬀman-Jorgensen inequality to deal with sum of residual of truncated variables, and a Talagrand (1989) inequality in γ-exponential Orlicz norms for sum of centered random variables. However, our work signiﬁcantly diﬀers from that of Adamczak (2008). First, our aims are diﬀerent: Adamczak (2008) focuses on getting large deviation bounds while we are looking for all-scale deviation bounds, which leads to a diﬀerent analysis (in the application of Klein-Rio inequalities for instance). Second, we are concerned by getting upper bounds with quite explicit control. In particular, this requires a careful treatment of Orlicz-type estimates since the function Ψγ(x) = exγ 1 deﬁning the Orlicz space is not convex for γ < 1 (see Figure 1), as opposed to the usual case; see the results in Section 4. We also derive sharp estimates from the Dudley entropy integral bound (Theorem 9), which enables us to get a tight dependency w.r.t. the diameter of the parameter space. Furthermore, we clarify the use of the Talagrand inequality (Theorem 7);

Orlicz Random Fourier Features

in Adamczak (2008, Theorem 5) it is seemingly invoked for supremum over functions while it is related to sum over centered random variables. With this novel ﬁnite-sample deviation bound, the analysis of Orlicz RFFs readily follows, using optimized inequalities. The paper is structured as follows. Our problem is formulated in Section 2. The main result on the approximation quality of kernel derivatives with random Fourier features is presented in Section 3. Properties of the Orlicz norm are summarized in Section 4. Proofs are provided in Section 5. The appendix contains additional technical details (Section A), the deﬁnition of special functions (Section B) and external statements used in the proofs (Section C).

2. Problem Formulation

In this section we formally deﬁne our problem after introducing a few notations. Notations: Let the set of natural, real and complex numbers, positive integers, positive reals, non-negative reals and non-positive integers be denoted by N = {0, 1, . . .}, R, C, Z+ = {1, 2, . . .}, R+ = (0, ), R 0 = [0, ) and Z 0 = {0, 1, 2, . . .}, respectively. The positive value of x R is denoted by (x)+ = x 0. The Gamma function is Γ(t) = R 0 xt 1e xdx for t C\Z 0. For x R the secant function is sech(x) = 1 cosh(x). Let a S +b = {as+b : s S}

where S R and a, b R. For an N Z+, [N] = {1, . . . , N}. Let the nth derivative of the cos( ) function (n N) be cn = cos(n). For multi-indices p, q Nd and ω Rd

let |p| = Pd j=1 pj, ωp = Qd j=1 ωpj j , pf(x) = |p|f(x) p1 x1 pd xd , p,qg(x, y) = |p|+|q|g(x,y) p1 x1 pd xd q1 y1 qd yd . The

diameter of a compact set T Rd is denoted by |T| = supx,y T x y 2 < . Let S Rd

be a Borel set. Let S = S S = {a b : a S, b S}. The set of Borel probability measures on S is written as M+ 1 (S). Let the Banach space of real-valued, r-power µ-integrable

functions on S (1 r < ) be Lr(S, µ), with f Lr(S,µ) = R

S |f(x)|rdµ(x) 1

r . We use the shorthand µf = R

S f(x)dµ(x) where µ M+ 1 (S) and f L1 (S, µ). The product measure of µ1, . . . , µM M+ 1 (S) is M m=1µm; speciﬁcally when all the components coincide we use the shorthand µM = m [M]µ. The empirical measure is PM = 1 M PM m=1 δXm with δX being the Dirac measure concentrated on X and X1, . . . , XM M m=1µm. Let (rn)n N be a positive sequence. The boundedness of Xn

rn almost surely is denoted by Xn = Oa.s.(rn). Let n R+. We say that an f : Rd R function is of polynomial growth of order n (shortly f FP(n)) if supx Rd |f(x)| 1+ x n 2 < ; FP = n R+FP(n). Let us

assume that Ψ : R 0 R 0 is a continuous, strictly increasing mapping, Ψ(0) = 0 and limx Ψ(x) = . The set of Rd-valued random variables having ﬁnite Ψ-Orlicz norm is deﬁned as LΨ = n X : X Ψ := inf n c > 0 : EΨ X 2

c 1 o < + o . Throughout the paper we will be particularly interested in (see Figure 1)

Ψα : x R 0 7 exα 1 R 0 (α > 0),

in other words in random variables having ﬁnite α-exponential Orlicz norm. The fact X LΨα is equivalent to the existence of an s > 0 constant such that E h es X α 2 i < . Random variables X LΨ2 and X LΨ1 are called sub-Gaussian and sub-exponential, respectively. For f FP and random variable X having α-exponential moment (X LΨα) Ef(X) < . Special functions are deﬁned in Table 4.

Chamakh, Gobet and Szab o

0 0.5 1 1.5 0

=0.1 =0.3 =1.5

Figure 1: Ψα for diﬀerent α values.

We proceed by formally deﬁning our task. Let k : Rd Rd R be a continuous, bounded and shift-invariant kernel. Then, by the Bochner theorem (Rudin, 1990) one can assume w.l.o.g. the existence of a Λ M+ 1 Rd spectral measure such that

k(x, y) = Z

Rd cos ω (x y) dΛ(ω)

Rd cos ω x cos ω y + sin ω x sin ω y dΛ(ω).

Let p, q Nd and assume that R

Rd |ωp+q| dΛ(ω) < . In this case p,qk(x, y) exists, and by the dominated convergence theorem one arrives at

p,qk(x, y) = Z

Rd p cos ω x q cos ω y + p sin ω x q sin ω y dΛ(ω).

The integral can be estimated by Monte-Carlo technique replacing Λ with ΛM = 1 M PM m=1 δωm,

(ωm)m [M] i.i.d. Λ:

\ p,qk(x, y) = 1

h p cos ω mx q cos ω my + p sin ω mx q sin ω my i

= λp(x), λq(y) R2M , (7)

where λp(x) = 1

h p cos ω mx

m [M] ; p sin ω mx

i R2M; this is the RFF feature approximation λp in (4). For p = q = 0, the construction reduces to the traditional RFF technique (Rahimi and Recht, 2007). This form implies that our target quantity can be written as \ p,qk p,qk S = sup z S |(ΛM Λ)(fz)|, fz(ω) = ωp( ω)qc|p+q| ω z , (8)

= sup f F |(ΛM Λ)(f)|, F = {fz : z S } , (9)

thus the problem boils down to the study of supremum of empirical processes with F FP(n) where n = |p + q| + 1. In the next section we detail our main result about the ﬂuctuation of such processes.

Orlicz Random Fourier Features

3. Main Result

In this section we present our main result on the supremum of empirical processes of polynomial growth, and specialize it to the approximation quality of RFFs for kernel derivatives. The proofs are given in Section 5. We investigate the concentration of the supf F | 1

M PM m=1 f(Xm)| quantity under the following assumptions: 1. Compact parameterization: F = {ft : t T} where ft : Rd R is parameterized by a compact set T Rd . 2. Lipschitz condition: There exists n R+ and function L : Rd R 0, L FP(n) such that

(a) |ft0(x)| L(x) for some t0 T,

(b) |ft1(x) ft2(x)| L(x)ρ ( t1 t2 2) for all x Rd, t1 T, t2 T,

(c) with ρ : [0, |T|] R 0 continuous strictly increasing mapping with ρ(0) = 0 such

that Iρ(|T|) := ρ (|T|) R 1 0

log 1 + 2|T| ρ 1(uρ(|T|)) du < .

3. Independence, ﬁnite α-exponential Orlicz norm:

(a) (Xm)m [M] are independent Rd-valued random variables; shortly, (Xm)m [M] m [M]µm with µm M+ 1 Rd . (b) α R+ such that Xm Ψα < for all m [M]. 4. Centering: E [f(Xm)] = 0 for all f F and m [M].

Under these conditions, our main result is as follows.

Theorem 1 (Concentration of processes with polynomial growth) Assume that F and (Xm)m [M] satisfy Assumptions 1-4 and γ := α

n 1. Let log stand for the natural

logarithm, βγ := Γ 1 + 1

γ γ , P := m [M]µm, and L L2(X1:M) := q

1 M P m [M] L2(Xm).

Let Ψ(l) γ be the convexiﬁcation5 of Ψγ, Aγ := Ψ(l) γ 1 (1)

Ψ 1 γ (1) , Bγ := Ψ(l) γ 1(1), Cγ and CD be

the constants deﬁned in (45) and (46), and Kγ := 2

1 γ 1 1 + Aγ

. Then for any ε > 0 satisfying

ε 6B, B := 2CD

d E h L L2(X1:M) i

M Iρ(|T|), (10)

m [M] ft(Xm) ε

Mε 3Kγ maxm [M] supt T |ft(Xm)| Ψγ

72σ2+84c ε , (11)

5. The function Ψγ is not convex for γ < 1. We convexify Ψγ and use the Section 4(v) based integral control property holding for convex Ψ-s; for details on Ψ(l) γ see Section A.1.

Chamakh, Gobet and Szab o

σ2 := sup t T

m [M] E f2 t (Xm) ,

c := max m [M] sup t T ft(Xm) Ψγ

γ maxm [M] supt T ft(Xm) Ψγ

8E max m [M] sup t T |ft(Xm)| [0, + ).

(i) Two-sided bound: For P inft T 1 M P m [M] ft(Xm) ε the same one-sided deviation bound can be obtained by replacing ft with ft. As a result one can estimate P supt T 1

M P m [M] ft(Xm) ε by twice the bound above.

(ii) Assumption (3): Assumption (3a) with Assumption (4) is weaker than being i.i.d.: for example E [ft(Xm)] = 0 holds for Xm = N 0, σ2 m and ft(x) = ctx3, but Xm-s can diﬀer in their variance.

(iii) Assumption α/n 1: This condition holds without loss of generality. Indeed, in case of α/n > 1, one can get a modiﬁed (α , n ) pair satisfying α /n 1 by either increasing n to the value n = α using that FP(n) FP(n ), or by decreasing α to the value α = n using that Xm Ψα < implies Xm Ψα < for any α (0, α).

(iv) Proof-related remarks:

1. Compactness of T: This compactness with the Lipschitz property enables one to control the covering number of F. 2. Truncated functions: The Lipschitz property of F implies that of the truncated functions: for x Rd, s and t T

|Tcft(x) Tcfs(x)| |ft(x) fs(x)| L(x)ρ ( t s 2) , (12)

where Tcf(x) := f(x)1|f(x)| c + c1f(x)>c c1f(x)< c is f soft-thresholded at level c. 3. F FP(n): This property is inherited (Section 5.4) from L FP(n) by the Lipschitz conditions (2a)-(2b). 4. Finiteness of the terms in Theorem 1: maxm [M] supt T |ft(Xm)| Ψ α

E maxm [M] supt T |ft(Xm)| are ﬁnite (see Section 5.4) in Theorem 1 by the Lipschitz assumption (2a)-(2b), Xm Ψα < (Assumption (3b)) and L FP(n) (Assumption (2)).

(v) RFF specialization: Assuming that the α-exponential Orlicz condition holds for the spectral measure Λ associated to k ( α R+ such that ω Ψα < , ω Λ),6 one can see (Section 5.1) that RFFs are covered by choosing

d = d, ft(x) fz(ω) Λfz, t z, T S , Xm ωm,

6. This requirement implies that R

Rd |ωp+q| dΛ(ω) < and thus the existence of p,qk for any p, q Nd.

Orlicz Random Fourier Features

ρ(u) = uβ, β = 1 1 + (log|S |)+ (0, 1], n |p + q|+β.

While any value of β (0, 1] would meet the assumptions, allowing β to depend on the diameter of S enables us to get optimal convergence rates w.r.t. the diameter (see Corollary 4). The terms driving the guarantee for RFF can be bounded (Section 5.6) as follows: there is a constant CRFF R+, depending only on Λ, |p + q|, but not on |S | and M, such that

1 + (log|S |)+

max m [M] sup z S gz(ωm) Ψγ CRFF, max m [M] sup z S |gz(ωm)|

Ψγ CRFF [log(1 + M)]n/α ,

max m [M] sup z S |gz(ωm)|

CRFF [log(1 + M)]n/α .

Using these bounds, our ﬁnite-sample uniform guarantee on Orlicz RFFs is as follows.

Corollary 3 (Orlicz RFFs for kernel derivative approximation) Let k : Rd Rd R be a continuous, bounded, shift-invariant kernel with spectral measure Λ. Suppose that Λ satisﬁes the α-exponential Orlicz assumption ( α R+ such that ω Ψα < , ω Λ) and let S Rd be a compact set. Let β = 1 1+(log|S |)+ (0, 1], let p, q Nd, n := |p+q|+β, and

assume that γ := α

n 1. Let \ p,qk be the RFF estimate of p,qk using (ωm)m [M] i.i.d. Λ samples as given in Eq. (7). Then, there exists a constant C R+ (depending only on Λ,

|p + q|, but not on S and M) such that for any ϵ C

1+(log|S |)+

ΛM \ p,qk p,qk S ϵ 2e (Mε)γ

C log(1+M) + e

C 1+ε[log( C/ε) log(1+M)]1/γ . (14)

Corollary 4 (Almost sure convergence for kernel derivative approximation) Let p, q Nd and k : Rd Rd R be a continuous, bounded, shift-invariant kernel with spectral measure Λ which satisﬁes the α-exponential Orlicz assumption for some α > 0. Then, for any sequence of compact sets (SM) M=2 such that (log |SM|)+ = o(M), we have

\ p,qk p,qk SM = Oa.s.

(log |SM|)+ log M

Chamakh, Gobet and Szab o

(i) Spectral measure (Λ) examples: Our result assumes the α-exponential Orlicz property of the spectral measure Λ associated to k. In Table 2 we provide various examples for Λ (with the relevant case of unbounded support) satisfying this requirement; their relation is summarized in Figure 2. While for the RFF approximation it is not necessary, in many of these examples the corresponding kernel value can also be computed, see Table 3.

(ii) α-exponential Orlicz assumption for tensor product kernels: Using the αexponential Orlicz spectral measures of Table 2 on R, one can immediately construct Orlicz spectral measures on Rd. Indeed, assume that (i) k is a product kernel, i.e. k(x, y) = Q i [d] ki(xi, yi), Λ = i [d]Λi, and (ii) Λi, the spectral measure associated to ki, satisﬁes the αi-exponential Orlicz assumption (αi R+). Then ω Λ is αexponential Orlicz with α = mini [d] αi; see Section A.2.

(iii) α-exponential Orlicz vs. Bernstein assumption: Our result complements Szab o and Sriperumbudur (2019) s work, where the authors showed that for d = 1 and spectral densities fλ(ω) e ω2ℓthe Bernstein condition (and hence fast rates) holds for |p+q| 2ℓ= α. Indeed, we proved under the more general α-exponential Orlicz assumption the same a.s. convergence rates for any arbitrary order (see Corollary 4) kernel derivatives.

Spectrum Spectral density: fΛ(ω) Parameters α

2σ2 σ > 0 2

Laplace σ 2 e σ|ω| σ > 0 1

generalized Gaussian α 2βΓ( 1

α α > 0, β > 0 α

variance Gamma σ2b|ω|b 1

2 (σ|ω|) πΓ(b)(2σ)b 1

2 σ > 0, b > 1

Weibull (S) s 2λ |ω|

λ s 1 e |ω|

λ s s > 0, λ > 0 s

exponentiated exponential (S) α 2λ 1 e |ω|

λ α 1 e |ω|

λ λ > 0, α > 0 1

exponentiated Weibull (S) αs 2λ |ω|

λ s 1 1 e |ω|

λ s α 1 s > 0, λ > 0, α > 0 s

Nakagami (S) mm Γ(m)Ωm |ω|2m 1e mω2

chi-squared (S) 1 2 s 2 +1Γ( s

2)|ω| s 2 1e |ω|

Erlang (S) λs|ω|s 1e λ|ω|

2(s 1)! s Z+, λ > 0 1

Orlicz Random Fourier Features

Spectrum Spectral density: fΛ(ω) Parameters α

Gamma (S) 1 2Γ(s)θs |ω|s 1e |ω|

θ s > 0, θ > 0 1

generalized Gamma (S) p/a D

p |ω|D 1e |ω|

a p a > 0, D > 0, p > 0 p

Rayleigh (S) |ω| 2σ2 e ω2

2σ2 σ > 0 2

Maxwell-Boltzmann (S) 1

2a2 a3 a > 0 2

chi (S) 1 2 s 2 Γ( s

2)|ω|s 1e ω2

exponential-logarithmic (S) 1 2 log(p) β(1 p)e β|ω|

1 (1 p)e β|ω| p (0, 1), β > 0 1

Weibull-logarithmic (S) 1 2 log(p) αβ(1 p)|ω|α 1e β|ω|α

1 (1 p)e β|ω|α p (0, 1) , β > 0, α > 0 α

Gamma/Gompertz (S) bseb|ω|βs

2(β 1+eb|ω|) s+1 b > 0, β > 0, s > 0 bs

hyperbolic secant 1 2sech π

logistic e ω

s i2 s > 0 1

normal-inverse Gaussian αδK1(α

δ2+ω2 eδα α > 0, δ R 1

hyperbolic 1 2δK1(δα)e α

δ2+ω2 α > 0, δ R 1

generalized hyperbolic (α/δ)λ

2 λ α > 0, λ R, δ R 1

Table 2: Kernel spectrum examples in one dimension (d = 1) obeying the α-exponential Orlicz assumption. (S) stands for symmetrized. The symmetrization guarantees that the kernel associated to Λ is real-valued. Last column: Orlicz exponent. For the variance Gamma distribution the Orlicz exponent follows from the known Ku(z) p

asymptotics (Barndorﬀ-Nielsen et al., 2001, page 297) where Ku is the modiﬁed Bessel function of the second kind, as deﬁned in Table 4. Notice that the normal-inverse Gaussian δ=σ2α, α Gaussian limit (see Figure 2) changed the Orlicz exponent from 1 to 2.

4. Properties of the Orlicz Norm

In this section, for self-containedness we summarize the properties of Ψ which hold independently of the convexity/non-convexity of Ψ (unless explicitly required). Let X, X Rd be random variables, and assume that Ψ : R 0 R 0 (and similarly Φ below) is continuous, strictly increasing, Ψ(0) = 0 and limx Ψ(x) = .

Chamakh, Gobet and Szab o

exponentiated exponential (S)

Weibull -logarithmic (S) Gamma/Gompertz (S)

exponentiated Weibull (S)

exponential -logarithmic (S) chi-squared (S) Erlang (S)

Weibull (S) generalized Gamma (S) Gamma (S) Laplace

Rayleigh (S) chi (S) variance Gamma generalized Gaussian

hyperbolic secant Maxwell-Boltzmann (S) normal-inverse Gaussian Gaussian

logistic generalized hyperbolic hyperbolic Nakagami (S)

α = 1 β = 1 s = 1

p = D p = 1

p = 2, a = 2

s Z+ s Z+/2

s = 2 (σ = 1)

s = 3 (a = 1)

α = 2 δ=σ2α,

Figure 2: Relation of the spectral density examples of Table 2. (S) stands for symmetrized.

Orlicz Random Fourier Features

Kernel name Kernel value: k(x, y) Spectrum

Gaussian e σ2(x y)2

inverse quadric σ2 σ2+(x y)2 Laplace

π Γ(1/α) Ψ1,1 1

2; 1 , [β(x y)]2

4 generalized Gaussiana

inverse multiquadric h σ2 σ2+(x y)2 ib variance Gamma

P n 2N ( 1) n 2 (x y)nλn

s Weibull (S)b

[1 2i(x y)] s

2 +[1+2i(x y)] s

2 2 chi-squared (S)b

λ i s + h 1+ i(x y)

2 Erlang (S)b

[1 θi(x y)] s+[1+θi(x y)] s

2 Gamma (S)b

1 σ(x y)e σ2(x y)2

2 erﬁ σ(x y)

Rayleigh (S)b

F1,1 s 2; 1

P n 2N ( 1) n 2 (x y)nΓ( n

log(p)n!β α n Li n

α+1(1 p) Weibull-logarithmic (S)b,c

1 2 [cΛ(x y) + cΛ(y x)], with cΛ(t) = Gamma/Gompertz (S)b

= βs sb sb ti F2,1 s + 1; ti

b + s + 1; 1 β

sech sech(x y) hyperbolic secant

πs(x y) sinh(πs(x y)) logistic

α2+(x y)2 i normal-inverse Gaussian

α2+(x y)2K1(δα) hyperbolic

Kλ(δα) generalized hyperbolic

a. The analytical computation of the characteristic function (and hence the kernel value) was carried out for α > 1 (Pog any and Nadarajah, 2010). b. In case of symmetrization (S): k(x, y) = 1

2 [cΛ(x y) + cΛ(y x)] where cΛ(t) = Eω Λ[eitω] is the characteristic function of the spectral measure (on R 0) before symmetrization; i = 1. c. The characteristic function was obtained by Ciumara and Preda (2009).

Table 3: Kernel examples for the spectral densities given in Table 2. The special functions Ψ1,1, erﬁ, F1,1, Li, F2,1 and Kλ are deﬁned in Table 4.

Chamakh, Gobet and Szab o

(i) Normalization: If X LΨ then E h Ψ X 2

(ii) Constant: For a λ R constant λ Ψ= |λ|/Ψ 1(1). (iii) Monotonicity in Ψ: Ψ Φ implies X Ψ X Φ. (iv) Monotonicity in the argument: If d = 1 and 0 X X a.s., then X Ψ X Ψ.

(v) Finite Ψ implies integrability: If Ψ is convex and X LΨ, then E [ X 2] X Ψ Ψ 1(1). (vi) Generalized triangle inequality: Let X, X LΨα and α R+. Then X + X LΨα and X + X Ψα 2( 1

α 1)+ X Ψα + X Ψα

(vii) Deviation inequality from Ψ: If X LΨ then P ( X 2 c) 2 Ψ(c/ X Ψ)+1 for any c 0. (viii) Maximal inequality for Ψα and α R+: for any sequence (Xm)M m=1 of random variables in LΨα, we have max m [M] Xm 2

Ψα max m [M] Xm Ψα

The proofs of these properties are available in Section A.3.

After introducing a few additional notations, we provide the proofs of our results and remarks presented in Sections 3 and 4. External statements used in the proofs are summarized in Section C. Notations: For γ (0, 1] and x R 0, let Iγ(x) = R x 0 e tγdt be the incomplete Gamma function. Let (Z, m) be a semi-metric space and ϵ R+. The set S Z is said to be an ϵ-net of Z if for any z Z there exists s S such that m(s, z) ϵ. The ϵ-covering number of Z is deﬁned as the size of the smallest ϵ-net, i.e., N(ϵ, m, Z) = inf n ℓ 1 : s1, . . . , sl Z such that Z ℓ j=1Bm(sj, ϵ) o , where Bm(s, ϵ) = {z Z : m(z, s) ϵ} is the closed ball with center s Z and radius ϵ.

5.1. Proof of Remark 2(v)

In view of (8)-(9), we need to check Assumptions 1-4 with the parameterized function class

gz(ω) := fz(ω) Λfz = ωp( ω)qc|p+q| ω z Λfz, (z S ).

Thanks to the α-exponential Orlicz condition on Λ and the i.i.d. property of (ωm)m [M] in (7), Assumption 3 is trivially fulﬁlled. Assumption 4 holds by the deﬁnition of gz( ) and because the distribution of ωm is Λ. Assumption 1 is satisﬁed since S is a compact set of Rd. Therefore, it remains to prove Assumption 2, with the existence of n R+ and L FP(n). First, notice that

i [d] |ωi|pi+qi ω |p+q| 2 . (16)

Orlicz Random Fourier Features

Order: (16) implies that

|gz(ω)| |fz(ω)| + Λ|fz| ω |p+q| 2 + Λ h |p+q| 2 i =: L1(ω). (17)

Lipschitz condition: Let [z1, z2] = {az1 + (1 a)z2 : a [0, 1]} denote the segment connecting z1, z2 Rd. By using the mean value theorem

|gz1(ω) gz2(ω)| max z [z1,z2]

2 z1 z2 2 , (18)

z with fz(ω)

z = ωp( ω)qc|p+q|+1 ω z ω, and by using similar computations as before, one gets gz(ω)

2 ω |p+q|+1 2 + Λ h |p+q|+1 2 i =: L2(ω). (19)

As a result, to fulﬁll Assumption 2, we can take L(ω) = max(L1(ω), L2(ω)) and ρ(u) = u. For such L, we have n = |p + q| + 1. Reﬁned L and ρ: We now derive reﬁned L and ρ, by interpolating diﬀerent bounds. From (17), we can obtain the crude estimate |gz1(ω) gz2(ω)| 2L1(ω), which combined with (18)-(19) gives

|gz1(ω) gz2(ω)| (2L1(ω))1 βh z1 z2 2 L2(ω) iβ (20)

for any β (0, 1]. Here we have used that if 0 x min(x1, x2) then x x1 β 1 xβ 2. It follows that one can take

ρ(u) = uβ, n = |p + q| + β, L(ω) = max L1(ω), (2L1(ω))1 βLβ 2(ω) FP(n). (21)

For β = 1, we retrieve the former choice of L and ρ. Furthermore, we have

Iρ(|T|) =|T|β Z 1

log 1 + 2|T| (u|T|β)1/β

du =|T|β Z 1

log 1 + 2 u1/β

du < + . (22)

Notice that the advantage of having the additional degree-of-freedom β is two-fold, and it is striking when β 0 (compared to β = 1). Firstly, it gives a smaller n, which has a (slight) positive impact on the control of statistical ﬂuctuations; secondly, the dependence of Iρ(|T|) in the diameter |T| is smaller through the growth exponent. To conclude, we have proved that Orlicz RFFs fulﬁll the assumptions of Theorem 1. Later in Section 5.6, we will establish that Iρ(|T|) satisﬁes a (tight) bound w.r.t. p

1 + (log|T|)+.

5.2. Proof that Polynomial Growth Preserves the Exponential Orlicz Property

We show that f(X) Ψγ < for X Ψα < , f FP(n), n R+, γ = α

n. Indeed, by the deﬁnition of f FP(n), there exists C R+ such that |f(x)| C(1 + x n 2) for all x Rd. Hence for any γ > 0

|f(x)|γ Cγ (1 + x n 2)γ ( ) 2(γ 1)+Cγ (1 + x nγ 2 ) , (23)

Chamakh, Gobet and Szab o

where in ( ) we used that

(a + b)γ 2(γ 1)+ (aγ + bγ) , a, b 0, γ > 0. (24)

Since X LΨα there is some s R+ for which E h es X α 2 i < . Combining this property with (23) and recalling that nγ = α yields

es |f(x)|γ es 2(γ 1)+Cγ(1+ x α 2 ) E h es |f(X)|γi es 2(γ 1)+CγE h es X α 2 i <

with s = s 2(γ 1)+Cγ ; this shows that f(X) LΨγ.

5.3. Proof of Theorem 1

By introducing the Rcf(x) := f(x) Tcf(x) notation of residuals obtained at level c R+

(the value of c will be speciﬁed later), we bound the target quantity by using the subadditivity of supremum

m [M] ft(Xm) | {z } Tcft(Xm)+Rcft(Xm)

m [M] (Tcft(Xm) E [Tcft(Xm)] + E [Tcft(Xm)] + Rcft(Xm))

Tcft(Xm) E [Tcft(Xm)]

+ sup t T E

m [M] Tcft(Xm)

m [M] Rcft(Xm)

This means that using c for which ETc ε

m [M] ft(Xm) ε

P ZRc ε/3 + P

Z Tc ε/3 . (25)

The structure of our proof is as follows. 1. Unbounded part (ZRc): Based on the Talagrand and the Hoﬀman-Jorgensen inequalities, for large enough c (referred to as c HJ) we will derive an exponential control over P ZRc ε/3 expressed with maxm [M] supt T |ft(Xm)| Ψγ which is ﬁnite by Section 5.4. 2. Bounded part (Z Tc): We handle this term using the Klein-Rio inequality and the Dudley entropy integral bound. In addition, this part will give rise to the constraint (10) on ε. 3. Truncation (ETc): As E[ft(Xm)] = 0, Tcft ft and E[Tcft(Xm)] 0 for large c (called cmin). The ETc ε

3 requirement can be controlled via the integral form of the expectation of non-negative random variables and the incomplete Gamma function.

Orlicz Random Fourier Features

The bounding of the ZRc, Z Tc and ETc quantities is detailed in the following sections. Plugging the (26) and (28) results of the computations into (25) gives the ﬁnal bound (11). The ε constraint comes from (33), provided that c cmin c HJ. The constants cmin and c HJ are deﬁned in (35) and (38), respectively.

5.3.1. Bounding ZRc

P ZRc ε/3 is bounded as

P ZRc ε/3 P

m [M] |Rcft(Xm)| Mε/3

m [M] sup t T |Rcft(Xm)| Mε/3

Mε/3 P m [M] supt T |Rcft(Xm)| Ψγ

Mε 3Kγ maxm [M] supt T |ft(Xm)| Ψγ

where in (a) we used the the sub-additivity of the supremum, in (b) the deviation inequality Section (4)(vii) was applied, (c) holds by Section 5.5 for c c HJ (the value of c HJ is deﬁned in Section 5.5).

5.3.2. Bounding Z Tc

Below we will invoke the Klein-Rio inequality and control the expectation E h

Klein-Rio inequality: Let gm,t : x Rd 7 Tcft(x) E [Tcft(Xm)] and let us deﬁne the function classes

Tc F[M] := {gt := (g1,t, . . . , g M,t) : t T}, Tc F := {Tcft : t T}.

gm,t [ 2c, 2c] are measurable and bounded functions. Centering: by construction E[gm,t(Xm)] = 0 ( m [M]). Countability: Since t 7 ft is continuous, the supt T can be restricted to rational numbers (T Qd), one can take T T Qd, and assume that Tc F[M] is countable. If

Z Tci ε/6, (27)

then the Klein-Rio inequality (Theorem 8 where the supt T and supf Tc F[M] coincide) implies that

Z Tc ε/3 (27) P

Z Tci ε/6 e M (ε/6)2

2( σ2+4c E[ZTc])+6c ε/6

(27) e M (ε/6)2

2 σ2+14cε/6 = e M ε2

72 σ2+84c ε e M ε2

72σ2+84c ε , (28)

Chamakh, Gobet and Szab o

where the weak variance σ2 is deﬁned and bounded by

σ2 := sup t T

m [M] E h (Tcft(Xm) E [Tcft(Xm)])2i sup t T

m [M] E h (Tcft(Xm))2i

m [M] E f2 t (Xm) =: σ2.

Bounding E h

Z Tci : We control E h

Z Tci in (27) by the Dudley entropy integral bound.

In this bound the covering number of Tc F is estimated by that of the compact set T Rd

with propagation relying on Assumption (2b). Dudley entropy integral bound: Slight modiﬁcation (without absolute value) of van der Vaart and Wellner (1996, Lemma 2.3.1) gives

Z Tci 2E [R(X1:M, Tc F)] , (29)

where R(x1:M, Tc F) := Eε h supt T 1 M P m [M] εm Tcft(xm) i is the Rademacher average of Tc F, X1:M := (Xm)m [M], x1:M := (xm)m [M], ε :=(εm)m [M] contains independent Rademacher variables (i.e. P (εm = 1) = 1

2), and ε is independent of X1:M. Let Zt(x1:M) := 1 M P m [M] εm Tcft(xm), so R(x1:M, Tc F) = Eε [supt T Zt(x1:M)], and

deﬁne the pseudo-metric on T as d(t, s) := 1 M P m [M] [Tcft(xm) Tcfs(xm)]2 1/2 . The {Zt : t T} process is separable since it is continuous, and T Rd is separable, centered thanks to the Rademacher variables, sub-Gaussian with respect to M 1/2d: indeed, for any λ R+ and t, s T

Eε h eλ(Zt Zs)i (a) =

m=1 Eεm h eεm λ

M [Tcft(xm) Tcfs(xm)]i (b)

2M2 [Tcft(xm) Tcfs(xm)]2

λ2(M 1/2d(t,s)) 2

In (a) we used the independence of εm-s, (b) follows from Eεm [eaεm] = cosh(a) ( ) e a2

2 ( a R), where ( ) can be obtained from the power series expansion of the cosh( ) and the exponential function. Hence Theorem 9 can be applied:

R(x1:M, Tc F) CD

log(N(ε, M 1/2d, T))dε

log(N(ε, d, T))dε, (30)

where we used the N(ε, M 1/2d, T) = N(M1/2ε, d, T) identity, and applied an ε = M1/2ε substitution. We note that the above inﬁnite integral can be truncated at |T|d := supt,s T d(t, s), the d-diameter of T, since N(ε, d, T) = 1 for ε |T|d.

Orlicz Random Fourier Features

Covering number: By (12) one can relate d(t, s) and t s 2 as

m [M] L2(xm)

ρ ( t s 2) := L L2(x1:M) ρ ( t s 2) ,

which implies

N(ε, d, T) N

ρ 1 ε L L2(x1:M)

|T|d L L2(x1:M) sup t,s T ρ ( t s 2) L L2(x1:M) ρ(|T|). (32)

In the last inequality the increasing property of ρ was exploited. Combining (31)-(32) with the well-known bound (van de Geer, 2000, Lemma 2.5) on the covering number7

of a compact set T Rd

N ε , 2 , T 2|T|

ε + 1 d , ε > 0,

(30) can be estimated further as

R(x1:M, Tc F) CD

Z L L2(X1:M)ρ(|T|)

v u u u u u tlog

ρ 1 ε L L2(x1:M)

d L L2(x1:M) ρ (|T|)

log 1 + 2|T| ρ 1 (uρ (|T|))

where we introduced the new variable u = ε L L2(x1:M)ρ(|T|). Substituting this bound

into (29) we arrive at

d E h L L2(X1:M) i

M Iρ(|T|) =: B. (33)

To guarantee E h

Z Tci ε/6, we solve B ε

6; this gives the (10) bound on ε.

5.3.3. Bounding ETc

Bounding ETc by the incomplete Gamma function (Iγ):

E [Tcft(Xm)] (a) = E [Rcft(Xm)] (b) E ( ft(Xm) c) 1ft(Xm) c

7. In our deﬁnition of the covering number, in its bound on compact sets in Rd (van de Geer, 2000, Lemma 2.5) and in the ﬁnal Dudley entropy bound (Bartlett, 2013, Lecture 11, 14) the elements of the ϵ-net are assumed to belong to the set covered.

Chamakh, Gobet and Szab o

c P ( ft(Xm) y) dy. (34)

In (a) we used that Tcft(x) = ft(x) Rcft(x) and E[ft(Xm)] = 0, (b) follows from

Rcft(x) = [ft(x) + c]1ft(x) c + [ft(x) c]1ft(x) c [ft(x) + c]1ft(x) c.

(c) holds by using that for a Z 0 random variable, E[Z] = R 0 P (Z z) dz; we choose Z = max (0, ft(Xm) c). Therefore

ETc = sup t T E

m [M] Tcft(Xm)

(a) max m [M] sup t T

c P ( ft(Xm) y) dy

(b) 2 max m [M] sup t T

c e y ft(Xm) Ψγ

(c) = 2 max m [M] sup t T

c ft(Xm) Ψγ e uγdu

(d) = 2 max m [M] sup t T

0 e uγdu Z c ft(Xm) Ψγ

(e) = 2 max m [M] sup t T

c ft(Xm) Ψγ

(f) 2 max m [M] sup t T ft(Xm) Ψγ

c maxm [M] supt T ft (Xm ) Ψγ

(g) 2Γ 1 + 1

c maxm [M] supt T ft(Xm) Ψγ

max m [M] sup t T ft(Xm) Ψγ

where (a) holds by taking maximum over m [M] and using (34), (b) follows from the

P ( ft(Xm) y) 2e y ft(Xm) Ψγ

deviation inequality implied by Section 4(vii). (c) was obtained from a u = y ft(Xm) Ψγ substitution, in (d) we decomposed the integral to

have the incomplete Gamma function appear. (e) is a consequence of the deﬁnition of Iγ and the limit

Iγ(x) = Z x

0 e tγdt = 1

0 u 1 γ 1e udu x 1

where we applied an u = tγ substitution and the Γ(z + 1) = zΓ(z) recursion. (f) comes from the monotonicity of Iγ (Iγ(x) Iγ(y) if x y). (g) follows from applying the lower bound on Iγ from Theorem 10 with x = c maxm [M] supt T ft(Xm) Ψγ .

Orlicz Random Fourier Features

Additional truncation level bound on c: Guaranteeing B ε

3 (and thus ETc ε

3) is equivalent to choosing c large enough such that

c maxm [M] supt T ft(Xm) Ψγ

γ maxm [M] supt T ft(Xm) Ψγ .

Because γ 1, the function h : x 7 1 (1 x) 1 γ is concave on [0, 1], and thus it is below its tangent line computed at (0, h(0)), i.e. 1 (1 x) 1 γ 1

γ x. Therefore choosing

c maxm [M] supt T ft(Xm) Ψγ

it is enough to use c such that

c maxm [M] supt T ft(Xm) Ψγ

γ maxm [M] supt T ft(Xm) Ψγ .

Solving this inequality for c means that

c cmin := max m [M] sup t T ft(Xm) Ψγ

γ maxm [M] supt T ft(Xm) Ψγ

5.4. Proof of F FP(n), E maxm [M] supt T |ft(Xm)| < , and maxm [M] supt T |ft(Xm)| Ψ α

By Assumption (2a)-(2b), the triangle inequality and the monotonicity of ρ, one gets

|ft(x)| |ft(x) ft0(x)| + |ft0(x)| L(x) [ρ ( t t0 2) + 1] L(x)[ρ(|T|) + 1], (36)

for any t T, x Rd. The individual statements now can be proved as follows.

F FP(n): By L FP(n) and (36), ft FP(n) for all t T, in other words F FP(n). Finiteness of maxm [M] supt T |ft(Xm)| Ψ α

n : Using (36), we get

max m [M] sup t T |ft(Xm)| [ρ(|T|) + 1] X

m [M] L(Xm). (37)

Thanks to Section 5.2, each L(Xm) belongs to LΨ α

n . Combining this with the generalized triangular inequality Section 4(vi) gives the claim. Finiteness of E maxm [M] supt T |ft(Xm)| : Each L(Xm) is integrable (because L has a polynomial growth and the distribution of Xm satisﬁes the α-Orlicz exponential assumption). Thus, the statement follows from (37).

5.5. Control if c c HJ

We show that under the assumptions of Theorem 1 with

c c HJ := 8E max m [M] sup t T |ft(Xm)| (38)

Chamakh, Gobet and Szab o

m [M] sup t T |Rcft(Xm)|

max m [M] sup t T |ft(Xm)| Ψγ . (39)

Notice that c HJ is ﬁnite by Section 5.4. We bound the l.h.s. of (39):

m [M] sup t T |Rcft(Xm)|

sup t T |Rcft(Xm)| E sup t T |Rcft(Xm)| + E sup t T |Rcft(Xm)| Ψγ

(a) 2 1 γ 1

sup t T |Rcft(Xm)| E sup t T |Rcft(Xm)| Ψγ

m [M] sup t T |Rcft(Xm)|

(b) 2 1 γ 1

sup t T |Rcft(Xm)| E sup t T |Rcft(Xm)|

+ max m [M]

sup t T |Rcft(Xm)| E sup t T |Rcft(Xm)|

+ 1 Ψ 1 γ (1)E

m [M] sup t T |Rcft(Xm)|

=: 2 1 γ 1 Cγ(E1 + E2) + 1 Ψ 1 γ (1)E3

In (a) we applied the generalized triangle inequality Section 4(vi) and 1 γ 1

as γ = α n (0, 1]. In (b) the Talagrand inequality (45) was invoked with the Ym := supt T |Rcft(Xm)| E [supt T |Rcft(Xm)|] centered variables and B := R, followed by

taking the γ-Orlicz norm of the constant λ := E h P m [M] supt T |Rcft(Xm)| i according to Section 4(ii). We continue the derivation with bounding the E1, E2 and E3 terms in (40). Bounding E1:

sup t T |Rcft(Xm)| E sup t T |Rcft(Xm)|

Orlicz Random Fourier Features

m [M] sup t T |Rcft(Xm)|

(b) 16E max m [M] sup t T |Rcft(Xm)|

(c) 16E max m [M] sup t T |ft(Xm)| (d) 16 max m [M] sup t T |ft(Xm)| Ψ(l) γ

Ψ(l) γ 1 (1)

(e) 16 max m [M] sup t T |ft(Xm)| Ψγ

Ψ(l) γ 1 (1),

where in (a) we used the triangle inequality, in (b) we applied the Hoﬀman-Jorgensen inequality (Theorem 6; t0 = 0, p = 1, B = R, Ym = supt T |Rcft(Xm)|) with

m [M] sup t T |Rcft(Xm)| > 0

m [j] sup t T |Rcft(Xm)| > 0

= P max m [M] sup t T |ft(Xm)| > c

(g) P max m [M] sup t T |ft(Xm)| c HJ

8 = 1 2 4p with p = 1.

In (f) the non-negativity of Ym was exploited; in (g) c c HJ was used. We applied the Markov inequality and the deﬁnition of c HJ in (h). (c) holds by |Rcft(Xm)| |ft(Xm)|. In (d) and (e) we applied Section 4(v) with the convex Ψ(l) γ deﬁned in Section A.1 and the monotonicity property Section 4(iii) with Ψ(l) γ Ψγ, respectively. Bounding E2:

E2 = max m [M]

sup t T |Rcft(Xm)| E sup t T |Rcft(Xm)|

(a) max m [M]

sup t T |Rcft(Xm)| + E sup t T |Rcft(Xm)| Ψγ

(b) max m [M] sup t T |Rcft(Xm)| + max m [M] E sup t T |Rcft(Xm)| Ψγ

(c) 2 1 γ 1 max m [M] sup t T |Rcft(Xm)| Ψγ + max m [M] E sup t T |Rcft(Xm)| Ψγ

(d) 2 1 γ 1 max m [M] sup t T |Rcft(Xm)| Ψγ + 1 Ψ 1 γ (1)E max m [M] sup t T |Rcft(Xm)| !

(e) 2 1 γ 1

max m [M] sup t T |ft(Xm)| Ψγ +

Ψ(l) γ 1 (1)

max m [M] sup t T |ft(Xm)| Ψ(l) γ

(f) 2 1 γ 1

Ψ(l) γ 1 (1)

max m [M] sup t T |ft(Xm)| Ψγ .

Chamakh, Gobet and Szab o

In (a) we used the triangle inequality with the monotonicity Section 4(iv), (b) holds by the sub-additivity of the maximum and again the monotonicity Section 4(iv), in (c) we applied the generalized triangle inequality Section 4(vi) and that 1 γ 1

γ 1 as γ (0, 1],

(d) holds by Section 4(ii) with the constant λ = E maxm [M] supt T |Rcft(Xm)| , (e) is by the monotonicity Section 4(iv) as |Rcft(Xm)| |ft(Xm)|, and by Section 4(v), (f) follows from Ψ(l) γ Ψγ combined with the monotonicity Section 4(iii). Bounding E3: By (b)-(e) of the E1 derivation we have that

m [M] sup t T |Rcft(Xm)|

8 max m [M] sup t T |ft(Xm)| Ψγ

Ψ(l) γ 1 (1).

By adding the obtained E1, E2 and E3 bounds, we get (39) with Kγ deﬁned in Theorem 1.

5.6. Bounding the Driving Terms of Theorem 1 for RFF

We bound the constants of Theorem 1 in the RFF case described in Remark 2(v).

The term B: It is deﬁned in (10). Recalling the expression (22) for Iρ(|T|) and using the Cauchy-Schwarz inequality for bounding E L L2(ω1:M) by p

Eω Λ [L2(ω)] gives

Eω Λ [L2(ω)]

M |S |β Z 1

log 1 + 2u 1/β du.

We now aim at showing a tight bound for |S |β R 1 0

log 1 + 2u 1/β du w.r.t. |S | with

an appropriate choice of β = β(|S |). Indeed, let β = 1 1+(log|S |)+ (0, 1]. We start by proving the bound

log 1 + 2u 1/β du 4 β , β (0, 1]. (41)

By the change of variable t = β log 1 + 2u 1

β (i.e. u = et/β 1

2 β ), we get

(et/β 1)β+1 dt = 2β

t et(1 e t/β)β+1 dt.

Using the fact that 1 e t/β 2

3 on [β log(3), + ), we arrive at

te tdt ( ) 9 2 β Γ 3

where the inequality ( ) is obtained by taking β = 1 in 3β+1 and β = 0 in the integral;

hence (41) is proved. Now, using (41) with β = 1 1+(log|S |)+ and its |S |β = e

log|S | 1+(log|S |)+ implication, we get

log 1 + 2u 1/β du 4e

log|S | 1+(log|S |)+ p

1 + (log|S |)+ 4e p

1 + (log|S |)+,

Orlicz Random Fourier Features

and therefore

Eω Λ [L2(ω)] p

1 + (log|S |)+

The term σ2: It is deﬁned in Theorem 1. Since the variance is bounded by the second moment, E g2 z(ωm) E f2 z(ωm) . Furthermore, since (ωm)M m=1 are i.i.d., the previous

expectation can bounded by E h ω 2|p+q| 2 i using (16). As a result, we get

σ2 Eω Λ h ω 2|p+q| 2 i .

The term maxm [M] supz S gz(ωm) Ψγ with γ = α/n 1 and n = |p + q| + β: It appears in the deﬁnition of c (in Theorem 1). In view of the bound (17) which is uniform in z and using property (iv) of Section 4, we get

max m [M] sup z S gz(ωm) Ψγ = max m [M] sup z S |gz(ωm)| Ψγ ω |p+q| 2 + Λ h |p+q| 2 i Ψα/n .

The term maxm [M] supz S |gz(ωm)| Ψγ: It shows up in the exponential bound (11). We invoke the maximal inequality for the γ-Orlicz norm (item (viii) of Section 4) with the previous estimate to obtain max m [M] sup z S |gz(ωm)|

Ψγ log(1 + M)

n/α ω |p+q| 2 + Λ h |p+q| 2 i Ψα/n .

The term E maxm [M] supz S |gz(ωm)| : It appears in the deﬁnition of c. Using properties (iii) and (v) of Section 4 and the convexiﬁcation of Ψγ, we directly get

max m [M] sup z S |gz(ωm)|

Ψ(l) α/n 1 (1) log(1 + M)

n/α ω |p+q| 2 + Λ h |p+q| 2 i Ψα/n .

Collecting the diﬀerent bounds we obtain (13) by setting

CRFF(n) := max

Eω Λ [L2(ω)], Eω Λ h ω 2|p+q| 2 i ,

1 Ψ(l) α/n 1 (1) [log(3/2)] n/α ω |p+q| 2 + Λ h |p+q| 2 i Ψα/n

Long (but standard) computations show that CRFF(n) is uniformly bounded for n [|p + q|, |p + q| + 1], and thus we can set CRFF := supn [|p+q|,|p+q|+1] CRFF(n).

5.7. Proofs of Corollary 3 and 4

Corollary 3 with the existence of C R+ is a direct consequence of Theorem 1 combined with Remark 2(v), in particular because CRFF does not depend on S and M, and Kγ can

Chamakh, Gobet and Szab o

be bounded uniformly in n [|p + q|, |p + q| + 1]. The Talagrand constant Cγ is uniformly bounded w.r.t. γ provided that γ is bounded away from 0, see the proof of Talagrand (1989, Theorem 3).

Now let us prove Corollary 4; set εM =

(1+[log |(SM) |]+) log(1+M)

M . Observe that

(i) εM satisﬁes the lower bound requirement on ε in Corollary 3;

(ii) by assumption εM 0 as M 0 by using that |S | 2|S|;

(iii) therefore

#1/γ 1 + εM

!#1/γ + [log(1 + M)]1/γ

2 + εM [log(1 + M)]1/γ

for M large enough;

As a consequence of (iii) and (iv), setting δM := 1+(log |(SM) |)+

log(1+M) 1, we get (for M large enough)

M ε2 M C 1 + εM h log C

log(1 + M) i1/γ 6 C log(1 + M)δM

6 C C δM[log(1+M)]1/2+1/γ

= 6 log(1 + M) δM 2 + z M δM ,

where z M =

6 C C [log(1+M)]1/2+1/γ

M M 0. Since the function δ R+ 7 δ 2+z M

δ is increasing and δM 1, we get (for M large enough)

M ε2 M C 1 + εM h log C

log(1 + M) i1/γ 6 log(1 + M)1

On the other hand, using (iv), we easily get (MεM)γ C log(1+M)

2 log(1 + M) for M large enough. To sum up, in view of (14), we have proved (still for large enough M)

ΛM \ p,qk p,qk SM ϵM

2 (1 + M)2 + 1 (1 + M)2

Orlicz Random Fourier Features

and by the Borell-Cantelli lemma, we conclude to the a.s. convergence (15).

Acknowledgments

This research is supported by the Chair Stress Test, RISK Management and Financial Steering of the Foundation Ecole Polytechnique and by the Association Nationale de la Recherche Technique. This work is part of the ﬁrst author s doctoral thesis, under the supervision of the two other authors. The authors wish to thank Radoslaw Adamczak for his detailed comments on his work.

Appendix A. Additional Proofs

A.1. Ψ(l) γ , the Convexiﬁcation of Ψγ

In the proof of Theorem 1 an integral control with convex Ψ (see Section 4(v)) is beneﬁcial/applied. However, Ψγ is not convex for γ (0, 1). To handle this issue, we convexify Ψγ(x) = exγ 1 in case of γ (0, 1) for small values of the argument.8

By computing the derivatives of Ψγ we get that it is convex iﬀx xγ := 1 γ

γ . Indeed,

Ψ γ(x) = γxγ 1exγ,

Ψ γ(x) = γexγ (γ 1)xγ 2 + xγ 1γxγ 1

Ψ γ(x) = 0 x = xγ, Ψ γ(x) > 0 x > xγ, Ψ γ(x) < 0 x < xγ.

We also have to make sure that Ψ(l) γ , constructed as the line connecting (0, 0) with (x, Ψγ(x)) glued to Ψγ|[x, ), gives a convex function, for a suitable choice of x. A geometric argument shows that it is enough to choose x xγ(> 0) such that

x Ψ γ(x) exγ 1 γxγexγ.

Since the r.h.s. is higher order than the l.h.s., the requirement can be satisﬁed for large enough x; we can choose

xγ := inf x xγ : exγ 1 γxγexγ ,

Ψ(l) γ (x) :=

xγ x if x [0, xγ),

Ψγ(x) if x [ xγ, ).

Notice that by construction Ψ(l) γ Ψγ.

8. For γ = 1, Ψ(l) γ = Ψγ.

Chamakh, Gobet and Szab o

A.2. Proof of Remark 5(ii)

ωi LΨαi means that Eωi Λi esi|ωi|αi < for some si R+ ( i [d]). Let α = mini [d] αi

and ω α := P i [d] |ωi|α 1

α . Then ω 2

d supi [d] |ωi|

d ω α ( ω Rd). Notice that |ωi|α |ωi|αi if |ωi| 1 and |ωi|α 1 otherwise, i.e. we have |ωi|α |ωi|αi + 1 for any ωi R. This means that taking s = mini [d] si and s := s dα/2 > 0 gives

ω α 2 dα/2 ω α α = dα/2 X

i [d] |ωi|α dα/2 X

i [d] (|ωi|αi + 1),

Eω h e s ω α 2 i Eω h es P i [d](|ωi|αi+1)i

Eωi h es|ωi|αii es Y

Eωi h esi|ωi|αii es < ,

where we used the independence of ωi-s in ( ). We got that Eω Λ h e s ω α 2 i < which implies that ω LΨα.

A.3. Proof of the Properties in Section 4 about the Orlicz Norm

Properties (i)-(iv): These properties are well-known and directly follow from the deﬁnition of the Orlicz norm. Property (v): The case X Ψ= 0 gives a trivial inequality and can be discarded. Since Ψ is bounded from below by an increasing aﬃne function, X LΨ implies that X is integrable. Combining (i) with Jensen s inequality gives Ψ E[ X 2]

E h Ψ X 2 X Ψ

i 1, and the result follows. Property (vi): It is well-known that the usual triangle inequality holds for α 1. We now focus on the case α (0, 1]. Set c := X α Ψα + X α Ψα 1/α, p := cα X α Ψα and

q := cα X α Ψα , and notice that 1

q = 1. Then, combining (24) with γ = α (0, 1) and

the H older inequality with the conjugate exponents (p, q) yields

lim sup m + E e m X+X 2

c α lim sup m + E e m X 2

c α e m X 2

lim sup m + E em p X α 2 cα 1/p lim sup m + E em q X α 2 cα 1/q

item (i) 21/p21/q = 2.

Therefore, X + X LΨα and X + X Ψα c by the deﬁnition of the α-Orlicz norm.

Applying (24) with γ = 1/α we get c 2( 1

α 1)+ X Ψα + X Ψα and hence the claimed result is proved. Property (vii): This is a direct consequence of the Markov inequality. Property (viii): A similar statement was proved by van der Vaart and Wellner (1996, Lemma 2.2.2), but under the assumption that Ψα is convex (which holds only if α 1) and without explicit constant. Our statement is valid for any α > 0 with explicit control.

Orlicz Random Fourier Features

A ﬁrst inequality: Let α R+. We claim that for any x0 > 0 and any x, y 1, we have

Ψα x1/α 0 x Ψα x1/α 0 y Ψα x1/α 0 Ψα x1/α 0 xy . (43)

Because Ψα(x) = Ψ1(xα) where Ψ1(x) =: Ψ(x) = ex 1, the inequality for α = 1 clearly implies those for all α > 0. To prove the inequality for α = 1, let x0 and x be ﬁxed, and set H(y) = Ψ(x0)Ψ(x0xy) Ψ(x0x) Ψ(x0y). One has

H (y) = x0xΨ(x0)ex0xy x0Ψ(x0x)ex0y

= x2 0xex0ex0x Ψ(x0)

x0ex0 ex0x(y 1) Ψ(x0x)

x0xex0x ex0(y 1) ,

x0ex0 = 1 e x0

0 e ux0du Z 1

0 e ux0xdu = Ψ(x0x)

ex0x(y 1) ex0(y 1),

where we used x0 > 0, x, y 1 at the two last inequalities. This shows that H (y) 0, and since H(1) = 0 we have H(y) 0 for any y 1. Consequently, (43) is proved. Final maximal inequality: We follow the arguments of van der Vaart and Wellner (1996, Lemma 2.2.2) with slights modiﬁcations. The inequality (43) can be rewritten as

Ψα(x) Ψα x1/α 0 Ψα xy/x1/α 0 /Ψα(y), x, y x1/α 0 . (44)

Set c = maxm [M] Xm Ψα /x1/α 0 and let y x1/α 0 .

If maxm [M] Xm 2

cy x1/α 0 , then we have the crude bound Ψα maxm [M] Xm 2

Ψα x1/α 0 = Ψ(x0).

If maxm [M] Xm 2

cy x1/α 0 , then (44) yields

maxm [M] Xm 2

maxm [M] Xm 2

maxm [M] Xm Ψα

m [M] Ψ(x0)Ψα Xm 2/ Xm Ψα /Ψα(y).

Consequently, in both cases we have

maxm [M] Xm 2

m [M] Ψα Xm 2/ Xm Ψα /Ψα(y)

Taking expectation and using Property (i), we arrive at E h Ψα maxm [M] Xm 2

Ψ(x0) h 1 + M Ψα(y) i . Let us choose x0 such that Ψ(x0) < 1. In this case the choice

y = x1/α 0 Ψ 1 α M 1/Ψ(x0) 1 ensures that the above bound is valid and smaller than 1.

Chamakh, Gobet and Szab o

Consequently, by the deﬁnition of α-Orlicz norm, we get maxm [M] Xm 2 Ψα cy. The choice x0 = log(3/2) satiﬁes the previous requirement: Ψ(x0) = 1

2 < 1. In this case y = [log(3/2) log(1 + M)]1/α = [log(1 + M)]1/α since M 1. We have obtained the claimed Property (viii).

Appendix B. Special Functions

Name Deﬁnition

Modiﬁed Bessel function of the ﬁrst kind Ja(x) = P n N 1 n!Γ(n+a+1) x

Modiﬁed Bessel function of the second kind Ka(x) = π

2 J a(x) Ja(x)

sin(aπ) Fox-Wright generalized hyperbolic function Ψ1,1 ((a, A); (b, B); x) = P n N Γ(a+An) Γ(b+Bn) xn

n! (Imaginary) error function erﬁ(x) = P n N 2 π x2n+1 n!(2n+1) Kummer s conﬂuent hypergeometric function F1,1(a; b; x) = P n N a(n) b(n) xn

n! Polylogarithm function Lia(x) = P n Z+ xn

na Ordinary hyperbolic function F2,1(a, b; c; z) = P n N a(n)b(n)zn

Table 4: Deﬁnition of special functions. Ja(x), Ka(x): x R and a is non-integer; when a is an integer the limit is taken. Ψ1,1 ((a, A); (b, B); x): a R+, b R+, x R, A R+, B R+ and 1+B > A; Ψ1,1 ((a, 1); (b, 1); x) = Γ(a)

Γ(b) F1,1(a; b; x). erﬁ(x): x R. F1,1(a; b; x): a R+, b R+, x R. Lia(x): a R, x R, |x| < 1. F2,1(a, b; c; z): a C, b C, c C\Z 0, z C and |z| < 1; for |z| 1 its analytical continuation is taken. For n N, a(n) is the rising factorial of a deﬁned as a(n) = Γ(a+n)

Γ(a) where a C\Z 0 and a+n C\Z 0.

Appendix C. External Statements

In this subsection we state external statements which were used to derive our results. Below B stands for a separable Banach space, Lp(B) is the space of B-valued p-integrable functions. The norm Ψα is deﬁned analogously to Rd by changing 2 to B.

Theorem 6 (Hoﬀman-Jorgensen inequality, Ledoux and Talagrand 2013, Proposition 6.8) Let p > 0, M Z+, (Ym)m [M] be independent random variables in Lp(B), Sm := Pm j=1 Yj for m [M], t0 = inf t > 0 : P (max1 m M Sm B > t) (2 4p) 1 . Then

E max m [M] Sm p B

2 4p E max m [M] Ym p B

Theorem 7 (Talagrand, 1989, Theorem 3) Let γ (0, 1]. Then, there is a constant Cγ such that for all ﬁnite sequence (Ym)m [M] of independent, mean zero, integrable random

Orlicz Random Fourier Features

variables in LΨγ(B), we have

+ max m [M] Ym B

Theorem 8 (Klein-Rio inequality for supremum of empirical process, Klein and Rio 2005, Theorems 1.1-1.2) Let M Z+, c R+, (Xm)m [M] be independent B-valued random variables, and F a countable set of f := (f1, . . . , f M) measurable functions from B into [ c, c]M such that E [fm(Xm)] = 0 for all m [M]. Deﬁne Z := supf F 1 M P m [M] fm(Xm), σ2 := 1 M supf F E h P m [M] f2 m(Xm) i . Then, for any t 0 the following right and left-hand sided deviation inequalities hold

P (Z E [Z] t) e M t2

2(σ2+2c E[Z])+3c t , P (Z E [Z] t) e M t2

2(σ2+2c E[Z])+2c t .

Theorem 9 (Dudley entropy integral bound)9 Let {Zt : t T} be a zero-mean separable stochastic process that is sub-Gaussian w.r.t. a pseudo-metric d on the indexing set

T, in other words for every λ R E eλ(Zt Zs) e λ2d(s,t)2

2 ( s, t T). Then there exists a universal constant CD such that

E sup t T Zt

log N(ε, d, T)dϵ, (46)

where N(ε, d, T) denotes the covering number.

Theorem 10 (Alzer 1997, Theorem 1)10 Let γ (0, 1], βγ := Γ 1 + 1

γ γ , x R 0,

Iγ(x) := R x 0 e tγdt. Then 1 e βγxγ 1

γ Iγ(x) Γ(1+1/γ) 1 e xγ 1

Radoslaw Adamczak. A tail inequality for suprema of unbounded empirical processes with applications to Markov chains. Electronic Journal of Probability, 13:1000 1034, 2008.

Ahmed El Alaoui and Michael Mahoney. Fast randomized kernel ridge regression with statistical guarantees. In Advances in Neural Information Processing Systems (NIPS), pages 775 783, 2015.

Horst Alzer. On some inequalities for the incomplete Gamma function. Mathematics of Computation of the American Mathematical Society, 66(218):771 778, 1997.

Nachman Aronszajn. Theory of reproducing kernels. Transactions of the American Mathematical Society, 68:337 404, 1950.

9. See van der Vaart and Wellner (1996, Corollary 2.2.8) for a general statement. Regarding the numerical value of CD, van Handel (2016, Corollary 5.25) proves that one can take CD = 12 whereas Bartlett (2013, Lecture 14) suggests a slightly smaller constant CD = 8

2. 10. The statement here follows by taking the limit of the cited result at γ = 1 and x = 0.

Chamakh, Gobet and Szab o

Francis Bach. On the equivalence between kernel quadrature rules and random feature expansions. Journal of Machine Learning Research, 18:1 38, 2017.

Francis R. Bach and Michael I. Jordan. Kernel independent component analysis. Journal of Machine Learning Research, 3:1 48, 2002.

Ole E. Barndorﬀ-Nielsen, Thomas Mikosch, and Sidney I. Resnick. L evy Processess Theory and Applications. Springer Science+Business Media, 2001.

Peter Bartlett. Theoretical statistics. Available at https://www.stat.berkeley.edu/ ~bartlett/courses/2013spring-stat210b/, 2013.

Mariusz Bojarski, Anna Choromanska, Krzysztof Choromanski, Francois Fagan, Cedric Gouy-Pailler, Anne Morvan, Nouri Sakr, Tam as Sarl os, and Jamal Atif. Structured adaptive and random spinners for fast machine learning computations. In International Conference on Artiﬁcial Intelligence and Statistics (AISTATS), pages 1020 1029, 2017.

Luigi Carratino, Alessandro Rudi, and Lorenzo Rosasco. Learning with SGD and random features. In Advances in Neural Information Processing Systems (Neur IPS), pages 10192 10203, 2018.

Kamalika Chaudhuri, Claire Monteleoni, and Anand D. Sarwate. Diﬀerentially private empirical risk minimization. Journal of Machine Learning Research, 12:1069 1109, 2011.

Roxana Ciumara and Vasile Preda. The Weibull-logarithmic distribution in lifetime analysis and its properties. In International Conference Applied Stochastic Models and Data Analysis (ASMDA), pages 395 399. 2009.

S andor Cs org o and Vilmos Totik. On how long interval is the empirical characteristic function uniformly consistent? Acta Scientiarum Mathematicarum, 45:141 149, 1983.

Bo Dai, Bo Xie, Niao He, Yingyu Liang, Anant Raj, Maria-Florina F. Balcan, and Le Song. Scalable kernel methods via doubly stochastic gradients. In Advances in Neural Information Processing Systems (NIPS), pages 3041 3049, 2014.

Carlton Downey, Ahmed Hefny, Byron Boots, Boyue Li, and GeoﬀGordon. Predictive state recurrent neural networks. In Advances in Neural Information Processing Systems (NIPS), pages 6053 6064, 2017.

Petros Drineas and Michael W. Mahoney. On the Nystr om method for approximating a Gram matrix for improved kernel-based learning. Journal of Machine Learning Research, 6:2153 2175, 2005.

Ren e Duijkers, Roland T oth, Dario Piga, and Vincent Laurain. Shrinking complexity of scheduling dependencies in LS-SVM based LPV system identiﬁcation. In IEEE Conference on Decision and Control, pages 2561 2566, 2014.

Anna Gilbert, Ambuj Tewari, and Yitong Sung. But how does it work in theory? Linear SVM with random features. In Advances in Neural Information Processing Systems (Neur IPS), pages 3379 3388, 2018.

Orlicz Random Fourier Features

Wittawat Jitkrittum, Arthur Gretton, Nicolas Heess, Ali Eslami, Balaji Lakshminarayanan, Dino Sejdinovic, and Zolt an Szab o. Kernel-based just-in-time learning for passing expectation propagation messages. In Conference on Uncertainty in Artiﬁcial Intelligence (UAI), pages 405 414, 2015.

Nicolas Keriven, Anthony Bourrier, R emi Gribonval, and Patrick P erez. Sketching for largescale learning of mixture models. Information and Inference: A Journal of the IMA, 7: 447 508, 2018.

Thierry Klein and Emmanuel Rio. Concentration around the mean for maxima of empirical processes. The Annals of Probability, 33(3):1060 1077, 2005.

Fabien Lauer, Van Luong Le, and G erard Bloch. Learning smooth models of nonsmooth functions via convex optimization. In International Workshop on Machine Learning for Signal Processing (IEEE-MLSP), 2012.

Quoc Le, Tam as Sarl os, and Alexander Smola. Fastfood - computing Hilbert space expansions in loglinear time. In International Conference on Machine Learning (ICML), pages 244 252, 2013.

Michel Ledoux and Michel Talagrand. Probability in Banach Spaces: isoperimetry and processes. Springer, 2013.

Lisha Li, Kevin Jamieson, Giulia De Salvo, Afshin Rostamizadeh, and Ameet Talwalkar. Hyperband: A novel bandit-based approach to hyperparameter optimization. Journal of Machine Learning Research, 18(185):1 52, 2018.

Zhu Li, Jean-Fran cois Ton, Dino Oglic, and Dino Sejdinovic. A uniﬁed analysis of random Fourier features. In International Conference on Machine Learning (ICML), pages 3905 3914, 2019.

David Lopez-Paz, Krikamol Muandet, Bernhard Sch olkopf, and Ilya Tolstikhin. Towards a learning theory of cause-eﬀect inference. In International Conference on Machine Learning (ICML), pages 1452 1461, 2015.

Junier Oliva, Willie Neiswanger, Barnab as P oczos, Eric Xing, and JeﬀSchneider. Fast function to function regression. In International Conference on Artiﬁcial Intelligence and Statistics (AISTATS), pages 717 725, 2015.

Vern I. Paulsen and Mrinal Raghupathi. An Introduction to the Theory of Reproducing Kernel Hilbert Spaces. Cambridge University Press, 2016.

Tibor K. Pog any and Saralees Nadarajah. On the characteristic function of the generalized normal distribution. Comptes Rendus Mathematique, 348(3-4):203 206, 2010.

Ali Rahimi and Benjamin Recht. Random features for large-scale kernel machines. In Advances in Neural Information Processing Systems (NIPS), pages 1177 1184, 2007.

Ali Rahimi and Benjamin Recht. Weighted sums of random kitchen sinks: Replacing minimization with randomization in learning. In Advances in Neural Information Processing Systems (NIPS), pages 1313 1320, 2008.

Chamakh, Gobet and Szab o

Lorenzo Rosasco, Matteo Santoro, Soﬁa Mosci, Alessandro Verri, and Silvia Villa. A regularization approach to nonlinear variable selection. In International Conference on Artiﬁcial Intelligence and Statistics (AISTATS), pages 653 660, 2010.

Lorenzo Rosasco, Silvia Villa, Soﬁa Mosci, Matteo Santoro, and Alessandro Verri. Nonparametric sparsity and regularization. Journal of Machine Learning Research, 14:1665 1714, 2013.

Alessandro Rudi and Lorenzo Rosasco. Generalization properties of learning with random features. In Advances in Neural Information Processing Systems (NIPS), pages 3218 3228, 2017.

Alessandro Rudi, Luigi Carratino, and Lorenzo Rosasco. FALKON: An optimal large scale kernel method. In Advances in Neural Information Processing Systems (NIPS), pages 3891 3901, 2017.

Walter Rudin. Fourier Analysis on Groups. Wiley-Interscience, 1990.

Lei Shi, Xin Guo, and Ding-Xuan Zhou. Hermite learning with gradient data. Journal of Computational and Applied Mathematics, 233:3046 3059, 2010.

Bharath Sriperumbudur and Nicholas Sterge. Approximate kernel PCA using random features: Computational vs. statistical trade-oﬀ. Technical report, Pennsylvania State University, 2018. (https://arxiv.org/abs/1706.06296).

Bharath K. Sriperumbudur and Zolt an Szab o. Optimal rates for random Fourier features. In Advances in Neural Information Processing Systems (NIPS), pages 1144 1152, 2015.

Bharath K. Sriperumbudur, Kenji Fukumizu, Arthur Gretton, Aapo Hyv arinen, and Revant Kumar. Density estimation in inﬁnite dimensional exponential families. Journal of Machine Learning Research, 18(57):1 59, 2017.

Ingo Steinwart and Andreas Christmann. Support Vector Machines. Springer, 2008.

Heiko Strathmann, Dino Sejdinovic, Samuel Livingstone, Zolt an Szab o, and Arthur Gretton. Gradient-free Hamiltonian Monte Carlo with eﬃcient kernel exponential families. In Advances in Neural Information Processing Systems (NIPS), pages 955 963, 2015.

Eric V. Strobl, Kun Zhang, and Shyam Visweswaran. Approximate kernel-based conditional independence tests for fast non-parametric causal discovery. Journal of Causal Inference, 7(1):20180017, 2019. (e ISSN 2193-3685; https://doi.org/10.1515/jci-2018-0017).

Dougal J. Sutherland and JeﬀSchneider. On the error of random Fourier features. In Conference on Uncertainty in Artiﬁcial Intelligence (UAI), pages 862 871, 2015.

Zolt an Szab o and Bharath K. Sriperumbudur. On kernel derivative approximation with random Fourier features. In International Conference on Artiﬁcial Intelligence and Statistics (AISTATS), pages 827 836, 2019.

Michel Talagrand. Isoperimetry and integrability of the sum of independent Banach-space valued random variables. The Annals of Probability, 17(4):1546 1570, 1989.

Orlicz Random Fourier Features

John Shawe Taylor and Nello Cristianini. Kernel Methods for Pattern Analysis. Cambridge University Press, 2004.

Enayat Ullah, Poorya Mianjy, Teodor V. Marinov, and Raman Arora. Streaming kernel PCA with O( n) random features. In Advances in Neural Information Processing Systems (Neur IPS), pages 7311 7321, 2018.

Sara van de Geer. Empirical Processes in M-estimation. Cambridge University Press, 2000.

Aad W. van der Vaart and Jon A. Wellner. Weak Convergence and Empirical processes: with Applications to Statistics. Springer-Verlag, New-York, 1996.

Ramon van Handel. Probability in high dimension. Available at https://web.math. princeton.edu/~rvan/APC550.pdf, 2016.

Christopher K. I. Williams and Matthias Seeger. Using the Nystr om method to speed up kernel machines. In Advances in Neural Information Processing Systems (NIPS), pages 682 688, 2001.

Jiyan Yang, Vikas Sindhwani, Haim Avron, and Michael W. Mahoney. Quasi-Monte Carlo feature maps for shift-invariant kernels. International Conference on Machine Learning (ICML), pages 485 493, 2014.

Yun Yang, Mert Pilanci, and Martin J. Wainwright. Randomized sketches for kernels: Fast and optimal non-parametric regression. The Annals of Statistics, 45:991 1023, 2017.

Yiming Ying, Qiang Wu, and Colin Campbell. Learning the coordinate gradients. Advances in Computational Mathematics, 37:355 378, 2012.

Felix X Yu, Ananda Theertha Suresh, Krzysztof M Choromanski, Daniel N Holtmann-Rice, and Sanjiv Kumar. Orthogonal random features. In Advances in Neural Information Processing Systems (NIPS), pages 1975 1983, 2016.

Jian Zhang, Avner May, Tri Dao, and Christopher R e. Low-precision random Fourier features for memory-constrained kernel approximation. In International Conference on Artiﬁcial Intelligence and Statistics (AISTATS), pages 1264 1274, 2019.

Qinyi Zhang, Sarah Filippi, Arthur Gretton, and Dino Sejdinovic. Large-scale kernel methods for independence testing. Statistics and Computing, pages 1 18, 2017.

Ding-Xuan Zhou. Derivative reproducing properties for kernel methods in learning theory. Journal of Computational and Applied Mathematics, 220:456 463, 2008.