# orlicz_random_fourier_features__5880296f.pdf Journal of Machine Learning Research 21 (2020) 1-37 Submitted 12/19; Revised 6/20; Published 6/20 Orlicz Random Fourier Features Linda Chamakh linda.chamakh@polytechnique.edu Emmanuel Gobet emmanuel.gobet@polytechnique.edu Zolt an Szab o zoltan.szabo@polytechnique.edu CMAP, CNRS, Ecole Polytechnique, Institut Polytechnique de Paris 91128 Palaiseau, France Global Markets Quantitative Research BNP Paribas Editor: Jean-Philippe Vert Kernel techniques are among the most widely-applied and influential tools in machine learning with applications at virtually all areas of the field. To combine this expressive power with computational efficiency numerous randomized schemes have been proposed in the literature, among which probably random Fourier features (RFF) are the simplest and most popular. While RFFs were originally designed for the approximation of kernel values, recently they have been adapted to kernel derivatives, and hence to the solution of large-scale tasks involving function derivatives. Unfortunately, the understanding of the RFF scheme for the approximation of higher-order kernel derivatives is quite limited due to the challenging polynomial growing nature of the underlying function class in the empirical process. To tackle this difficulty, we establish a finite-sample deviation bound for a general class of polynomial-growth functions under α-exponential Orlicz condition on the distribution of the sample. Instantiating this result for RFFs, our finite-sample uniform guarantee implies a.s. convergence with tight rate for arbitrary kernel with α-exponential Orlicz spectrum and any order of derivative. Keywords: random Fourier features, kernel derivative, polynomial-growth functions, α-exponential Orlicz norm, unbounded empirical processes 1. Introduction Kernel machines (Taylor and Cristianini, 2004; Steinwart and Christmann, 2008; Paulsen and Raghupathi, 2016) form one of the most fundamental tools in machine learning and statistics with a wide range of successful applications. The impressive modelling power and flexibility of kernel techniques in capturing complex nonlinear relations originates from the richness of the underlying Hk function class called reproducing kernel Hilbert space (RKHS, Aronszajn 1950) associated to a k : X X R kernel. Kernels extend the classical notion of inner product on X = Rd by assuming the existence of a φ : X H feature map to a Hilbert space H such that k(x, x ) = φ(x), φ(x ) H for all x, x X. This simple equality (also called the kernel trick) forms the basis of kernel techniques and enables one to compute inner products implicitly without direct access to the feature of the points. c 2020 Linda Chamakh, Emmanuel Gobet and Zolt an Szab o. License: CC-BY 4.0, see https://creativecommons.org/licenses/by/4.0/. Attribution requirements are provided at http://jmlr.org/papers/v21/19-1031.html. Chamakh, Gobet and Szab o In applications one is often given {xn}N n=1 samples and is facing with an optimization problem expressed in terms of function values and derivatives1 { pf(xn)}n [N] p Dn , f 2 Hk where [N] = {1, . . . , N}, Dn Nd, N := {0, 1, . . .}, l : R1+P n [N] #Dn R is a loss function, #Dn is the cardinality of the set Dn, pf(xn) := p1+...+pdf(xn) p1 x1 pd xd and the RKHS Hk is characterized by f(x) = f, k( , x) Hk ( x X, f Hk) and k( , x) Hk ( x X).2 The first property of RKHSs is called the reproducing property, the second one describes basic elements of Hk; combining the two properties makes the canonical feature map and feature space explicit: k(x, x ) = φ(x), φ(x ) Hk where φ(x) = k( , x) Hk. For example by taking the quadratic loss, Tikhonov regularization, only function values (Dn = {0}, n [N]) and λ > 0, (1) reduces to kernel ridge regression min f Hk 1 N n [N] [f(xn) yn]2 + λ f 2 Hk . Alternatively, one can get back Hermite learning with gradient data (Zhou, 2008; Shi et al., 2010) by additionally including first-order derivatives min f Hk 1 N [f(xn) yn]2 + f (xn) y n 2 2 + λ f 2 Hk , λ > 0 where f (x) = [ e1f(x); . . . ; edf(x)] Rd is the derivative of f, ej Rd is the jth canonical basis vector, 2 is the Euclidean norm and Dn = n 0, {ej}d j=1 o (n [N]). Further examples with function derivatives are semi-supervised learning with gradient information (Zhou, 2008), nonlinear variable selection (Rosasco et al., 2010, 2013), learning of piecewisesmooth functions (Lauer et al., 2012), multi-task gradient learning (Ying et al., 2012), structure optimization in parameter-varying ARX (autoregressive with exogenous input) processes (Duijkers et al., 2014), or density estimation with infinite-dimensional exponential families (Sriperumbudur et al., 2017). An appealing property of RKHSs is that their geometry makes the optimization problem (1) defined over function spaces computationally tractable. Indeed, assuming that l is increasing in its last argument, the pf(x) = f, p,0k( , x) Hk derivative-reproducing property of kernels and the representer theorem (Zhou, 2008) guarantee that the solution of (1) has a finite-dimensional parameterization f( ) = P n [N] P p Dn an,p p,0k( , xn) (an,p R) and it is sufficient to solve q Dm am,q p,qk(xn, xm) n,m [N] p Dn,q Dm an,pam,q p,qk(xn, xm) 1. To have derivatives, in the sequel we assume that X = Rd. 2. We use the k( , x) shorthand to denote the function y X 7 k(y, x) R while keeping x X fixed. Orlicz Random Fourier Features determined by the p,qk(x, y) := Pd i=1(pi+qi)k(x,y) p1 x1 pd xd q1 y1 qd yd kernel derivatives; a = (an,p)n [N], p Dn R P n [N] #Dn. Though kernel methods show impressive modelling power at numerous areas, due to the implicit computation of feature similarities, this flexibility comes with a computational price. Several techniques have been developed in the literature to mitigate this computational challenge such as incomplete Cholesky factorization (Bach and Jordan, 2002), subsampling schemes (Williams and Seeger, 2001; Drineas and Mahoney, 2005; Rudi et al., 2017), sketching (Alaoui and Mahoney, 2015; Yang et al., 2017), random Fourier features (RFF, Rahimi and Recht 2007, 2008), their quasi-Monte Carlo (Yang et al., 2014), memoryefficient (Le et al., 2013; Dai et al., 2014; Zhang et al., 2019), orthogonal (Yu et al., 2016) or structured (Bojarski et al., 2017) variants. In this paper we study the RFF technique which is probably the conceptually simplest and most influential approach.3 By the Bochner theorem (Rudin, 1990) a continuous, bounded, shift-invariant kernel k : Rd Rd R can be written as the Fourier transform of a (finite) measure Λ, called the spectral measure k(x, y) = Z Rd cos ω (x y) dΛ(ω). (3) The RFF method uses this representation of k to provide an explicit low-dimensional feature map approximation for the kernel values and f k(x, x ) λ(x), λ(x ) R2M , ˆfw(x) = w, λ(x) R2M , (4) where the integral representation (3) with respect to the measure Λ is replaced by an average over random points; hence the random Fourier feature naming. As a result, one can estimate w by leveraging fast linear primal solvers. The idea has been successfully used in various contexts including differential privacy preserving (Chaudhuri et al., 2011), fast function-to-function regression (Oliva et al., 2015), learning message operators in expectation propagation (Jitkrittum et al., 2015), causal discovery (Lopez-Paz et al., 2015; Strobl et al., 2019), independence testing (Zhang et al., 2017), prediction and filtering in dynamical systems (Downey et al., 2017), bandit optimization (Li et al., 2018), or estimation of Gaussian mixture models (Keriven et al., 2018). Similarly to (4), one can consider RFF-based approximation of kernel derivatives when solving optimization tasks involving function derivatives [see (1) and (2)]. This is the strategy followed for example by Strathmann et al. (2015) to fit distributions belonging to the infinite-dimensional exponential family, which boils down to an optimization problem with third-order kernel derivatives (Sriperumbudur et al., 2017, Theorem 5). The focus of this work is to study the approximation quality of the RFF-based kernelderivative approximation \ p,qk p,qk S := sup x,y S p,qk(x, y) \ p,qk(x, y) , 3. Rahimi and Recht (2007) won the 10-year test-of-time award at NIPS-2017 as a recognition of the influence of RFFs. Chamakh, Gobet and Szab o where S Rd is a compact set. Despite the large number of successful RFF applications, quite little is understood theoretically on its approximation quality. Below we provide a brief summary with particular focus on optimal guarantees and results related to kernel derivatives. Kernel values (p = q = 0): The uniform finite-sample bounds (Rahimi and Recht, 2007; Sutherland and Schneider, 2015) have recently been improved (Sriperumbudur and Szab o, 2015) exponentially in terms of the diameter of the compact set SM (|SM|) arriving to4 k bk SM = Oa.s. log |SM| log M from k bk SM = Op |SM| log M denotes the maximum. The result shows that the diameter of the set SM can grow at a |SM| = eo(M) rate while still getting a consistent estimate; this rate is optimal as shown in the characteristic function literature (Cs org o and Totik, 1983). Kernel ridge regression: RFFs have been settled in kernel ridge regression by Rudi and Rosasco (2017) via showing that using M = o(N) = O N log N random Fourier features is sufficient to get O 1/ N generalization error. Under additional γ-capacity (γ [0, 1]) and r-range space conditions (r 1 2), the same authors showed that even faster, minimax optimal O N 2r 2r+γ rates are achievable with M = o(N) = O N 2r+γ log N RFFs. The result improves the originally proved guarantee (Rahimi and Recht, 2008) holding under the pessimistic M = O(N) setting. The sufficiency of similar sub-linear number of RFFs with minimax guarantees was also established in the kernel least squares setting when applying mini-batched stochastic gradient descent (Carratino et al., 2018). Recently the analysis of Rudi and Rosasco (2017) has been further sharpened (in terms of the number of required RFFs, Li et al. 2019) by leveraging the notion of effective degrees of freedom. Classification with 0-1 loss: In the classification setting with the 0-1 loss and RKHSs, Gilbert et al. (2018) proved that M = o(N) = O N 2 2+c optimized RFF features optimized in the sense of Bach (2017) are sufficient to achieve a learning rate of O N c 2+c provided that the spectrum of the integral operator associated to the kernel decay polynomially at the rate of λi = O (i c) with c > 1.4 The same authors showed that the learning rate can be improved to O N 1 with M = O lnd(N) RFF-s in case of sub-exponential spectrum, where d denotes the dimension of the inputs in the classification. Kernel PCA: Sriperumbudur and Sterge (2018) have proved that the statistical performance of kernel principal component analysis (KPCA) can be matched by M = O(N2/3) (polynomial decay) or M = O( N) (exponential decay) RFFs, depending on the eigenvalue decay of the covariance operator associated to the kernel. Ullah et al. (2018) derived a similar bound for a streaming KPCA algorithm under exponential spectrum decay condition. 4. The classical O( ) notation up to logarithmic factors is denoted by O( ); the extension of O( ) in almost sure and convergence in probability sense are Op( ) and Oa.s.( ). Orlicz Random Fourier Features Kernel derivatives: If the support of the spectral measure associated to k is either bounded or it satisfies the Bernstein condition Qd j=1 ωpj+qj j n (σp,q)n dΛ(ω) n! 2 Kn 2, n = 2, 3, . . . (5) with some constant K 1 and σp,q = Rd Qd j=1 ωpj+qj j 2 dΛ(ω), then \ p,qk p,qk SM = Oa.s. log |SM| log M rate is achievable as shown by Sriperumbudur and Szab o (2015) and Szab o and Sriperumbudur (2019), respectively. As a practical example, one can consider for instance the previously mentioned infinite-dimensional exponential (IE) family fitting problem where the estimation boils to solving a linear equation system with coefficients made of third-order kernel derivatives (|p + q| = 3). The IE family is defined by a kernel for which a common choice is the Gaussian; this implies a Gaussian spectrum Λ. Unfortunately, in this case the bounded support condition is not met. Similarly, the Bernstein condition (5) only holds for |p+q| 2 with the Gaussian kernel, as it can be verified (Szab o and Sriperumbudur, 2019, Remark Higher-order derivatives in Section 3) by making use of the analytical expressions for the absolute moments of the Gaussian spectrum. These limitations (summarized in Table 1) of the popular random Fourier features technique motivate our work and the study of widely-applied kernels with unbounded spectral support for the RFF approximation of high-order kernel derivatives. A consequence of our new estimates in Theorem 1 is that the a.s. rates previously obtained under stringent conditions (on p, q or Λ) are now available for any p, q and any spectral measure Λ with α-exponential moments (as defined in (6), α > 0). Because Bernstein condition implies exponential moments, our result includes the one given by Szab o and Sriperumbudur (2019). Particularly, assuming additional smoothness on the bounded shift-invariant kernel, its derivative satisfies a representation similar to (3): p,qk(x, y)= Z j=1 ωpj j ( ωj)qj # c(Pd i=1 |pi+qi|) ω (x y) | {z } =:fx y(ω) where cn is the nth derivative of the cos( ) function. The primary difficulty is to handle the polynomial growing nature of the F = {ω 7 fx y(ω) : x, y S} function class which controls the error \ p,qk p,qk S. We tackle this challenge by imposing the finiteness of the α-exponential Orlicz norm of the spectral measure (Λ) associated to the kernel, in other words α > 0, c > 0 such that Eω Λ ec ω α 2 < + . (6) Chamakh, Gobet and Szab o Assumption on the spectral measure Conditions Convergence rate on p, q for p,qk \ p,qk SM 2nd moment exists p = q = 0 Oa.s. log |SM| log M Ref: Sriperumbudur and Szab o (2015, Th. 1) bounded support any p, q Oa.s. log |SM| log M Ref: Sriperumbudur and Szab o (2015, Th. 4) Bernstein condition small p, q Oa.s. log |SM| log M Ref: Szab o and Sriperumbudur (2019) Orlicz condition any p, q Oa.s. log |SM| log M Table 1: Summary of RFF guarantees on kernel values and derivatives. Last line: it includes any measure Λ with a finite α-exponential moment (for some α, c > 0, Eω Λ ec ω α 2 < + ), like the Gaussian, the inverse multiquadratic, or the sech kernel, see Corollary 4. For further examples see Table 2. Kernels with α-exponential Orlicz spectrum include the popular Gaussian, the inverse multiquadric, or the sech kernel; for further examples see Table 2 and Remark 5(ii). We establish the consistency and prove finite-sample uniform guarantees of the resulting Orlicz RFF scheme for the approximation of kernel derivatives at any order, as it is briefly illustrated in the last line of Table 1. To allow this level of generality, we prove a new finite-sample deviation bound for the empirical process related to a general class of functions f with polynomial growth of the sample Xm. The distribution of the latter is assumed to have finite α-exponential Orlicz norm and consequently, the random variables f(Xm) belong to a γ-exponential Orlicz space with index γ smaller than 1. For deriving such deviation bounds, we have been inspired by the work of Adamczak (2008) which elegantly combines the Klein and Rio (2005) inequality for truncated variables, the Hoffman-Jorgensen inequality to deal with sum of residual of truncated variables, and a Talagrand (1989) inequality in γ-exponential Orlicz norms for sum of centered random variables. However, our work significantly differs from that of Adamczak (2008). First, our aims are different: Adamczak (2008) focuses on getting large deviation bounds while we are looking for all-scale deviation bounds, which leads to a different analysis (in the application of Klein-Rio inequalities for instance). Second, we are concerned by getting upper bounds with quite explicit control. In particular, this requires a careful treatment of Orlicz-type estimates since the function Ψγ(x) = exγ 1 defining the Orlicz space is not convex for γ < 1 (see Figure 1), as opposed to the usual case; see the results in Section 4. We also derive sharp estimates from the Dudley entropy integral bound (Theorem 9), which enables us to get a tight dependency w.r.t. the diameter of the parameter space. Furthermore, we clarify the use of the Talagrand inequality (Theorem 7); Orlicz Random Fourier Features in Adamczak (2008, Theorem 5) it is seemingly invoked for supremum over functions while it is related to sum over centered random variables. With this novel finite-sample deviation bound, the analysis of Orlicz RFFs readily follows, using optimized inequalities. The paper is structured as follows. Our problem is formulated in Section 2. The main result on the approximation quality of kernel derivatives with random Fourier features is presented in Section 3. Properties of the Orlicz norm are summarized in Section 4. Proofs are provided in Section 5. The appendix contains additional technical details (Section A), the definition of special functions (Section B) and external statements used in the proofs (Section C). 2. Problem Formulation In this section we formally define our problem after introducing a few notations. Notations: Let the set of natural, real and complex numbers, positive integers, positive reals, non-negative reals and non-positive integers be denoted by N = {0, 1, . . .}, R, C, Z+ = {1, 2, . . .}, R+ = (0, ), R 0 = [0, ) and Z 0 = {0, 1, 2, . . .}, respectively. The positive value of x R is denoted by (x)+ = x 0. The Gamma function is Γ(t) = R 0 xt 1e xdx for t C\Z 0. For x R the secant function is sech(x) = 1 cosh(x). Let a S +b = {as+b : s S} where S R and a, b R. For an N Z+, [N] = {1, . . . , N}. Let the nth derivative of the cos( ) function (n N) be cn = cos(n). For multi-indices p, q Nd and ω Rd let |p| = Pd j=1 pj, ωp = Qd j=1 ωpj j , pf(x) = |p|f(x) p1 x1 pd xd , p,qg(x, y) = |p|+|q|g(x,y) p1 x1 pd xd q1 y1 qd yd . The diameter of a compact set T Rd is denoted by |T| = supx,y T x y 2 < . Let S Rd be a Borel set. Let S = S S = {a b : a S, b S}. The set of Borel probability measures on S is written as M+ 1 (S). Let the Banach space of real-valued, r-power µ-integrable functions on S (1 r < ) be Lr(S, µ), with f Lr(S,µ) = R S |f(x)|rdµ(x) 1 r . We use the shorthand µf = R S f(x)dµ(x) where µ M+ 1 (S) and f L1 (S, µ). The product measure of µ1, . . . , µM M+ 1 (S) is M m=1µm; specifically when all the components coincide we use the shorthand µM = m [M]µ. The empirical measure is PM = 1 M PM m=1 δXm with δX being the Dirac measure concentrated on X and X1, . . . , XM M m=1µm. Let (rn)n N be a positive sequence. The boundedness of Xn rn almost surely is denoted by Xn = Oa.s.(rn). Let n R+. We say that an f : Rd R function is of polynomial growth of order n (shortly f FP(n)) if supx Rd |f(x)| 1+ x n 2 < ; FP = n R+FP(n). Let us assume that Ψ : R 0 R 0 is a continuous, strictly increasing mapping, Ψ(0) = 0 and limx Ψ(x) = . The set of Rd-valued random variables having finite Ψ-Orlicz norm is defined as LΨ = n X : X Ψ := inf n c > 0 : EΨ X 2 c 1 o < + o . Throughout the paper we will be particularly interested in (see Figure 1) Ψα : x R 0 7 exα 1 R 0 (α > 0), in other words in random variables having finite α-exponential Orlicz norm. The fact X LΨα is equivalent to the existence of an s > 0 constant such that E h es X α 2 i < . Random variables X LΨ2 and X LΨ1 are called sub-Gaussian and sub-exponential, respectively. For f FP and random variable X having α-exponential moment (X LΨα) Ef(X) < . Special functions are defined in Table 4. Chamakh, Gobet and Szab o 0 0.5 1 1.5 0 =0.1 =0.3 =1.5 Figure 1: Ψα for different α values. We proceed by formally defining our task. Let k : Rd Rd R be a continuous, bounded and shift-invariant kernel. Then, by the Bochner theorem (Rudin, 1990) one can assume w.l.o.g. the existence of a Λ M+ 1 Rd spectral measure such that k(x, y) = Z Rd cos ω (x y) dΛ(ω) Rd cos ω x cos ω y + sin ω x sin ω y dΛ(ω). Let p, q Nd and assume that R Rd |ωp+q| dΛ(ω) < . In this case p,qk(x, y) exists, and by the dominated convergence theorem one arrives at p,qk(x, y) = Z Rd p cos ω x q cos ω y + p sin ω x q sin ω y dΛ(ω). The integral can be estimated by Monte-Carlo technique replacing Λ with ΛM = 1 M PM m=1 δωm, (ωm)m [M] i.i.d. Λ: \ p,qk(x, y) = 1 h p cos ω mx q cos ω my + p sin ω mx q sin ω my i = λp(x), λq(y) R2M , (7) where λp(x) = 1 h p cos ω mx m [M] ; p sin ω mx i R2M; this is the RFF feature approximation λp in (4). For p = q = 0, the construction reduces to the traditional RFF technique (Rahimi and Recht, 2007). This form implies that our target quantity can be written as \ p,qk p,qk S = sup z S |(ΛM Λ)(fz)|, fz(ω) = ωp( ω)qc|p+q| ω z , (8) = sup f F |(ΛM Λ)(f)|, F = {fz : z S } , (9) thus the problem boils down to the study of supremum of empirical processes with F FP(n) where n = |p + q| + 1. In the next section we detail our main result about the fluctuation of such processes. Orlicz Random Fourier Features 3. Main Result In this section we present our main result on the supremum of empirical processes of polynomial growth, and specialize it to the approximation quality of RFFs for kernel derivatives. The proofs are given in Section 5. We investigate the concentration of the supf F | 1 M PM m=1 f(Xm)| quantity under the following assumptions: 1. Compact parameterization: F = {ft : t T} where ft : Rd R is parameterized by a compact set T Rd . 2. Lipschitz condition: There exists n R+ and function L : Rd R 0, L FP(n) such that (a) |ft0(x)| L(x) for some t0 T, (b) |ft1(x) ft2(x)| L(x)ρ ( t1 t2 2) for all x Rd, t1 T, t2 T, (c) with ρ : [0, |T|] R 0 continuous strictly increasing mapping with ρ(0) = 0 such that Iρ(|T|) := ρ (|T|) R 1 0 log 1 + 2|T| ρ 1(uρ(|T|)) du < . 3. Independence, finite α-exponential Orlicz norm: (a) (Xm)m [M] are independent Rd-valued random variables; shortly, (Xm)m [M] m [M]µm with µm M+ 1 Rd . (b) α R+ such that Xm Ψα < for all m [M]. 4. Centering: E [f(Xm)] = 0 for all f F and m [M]. Under these conditions, our main result is as follows. Theorem 1 (Concentration of processes with polynomial growth) Assume that F and (Xm)m [M] satisfy Assumptions 1-4 and γ := α n 1. Let log stand for the natural logarithm, βγ := Γ 1 + 1 γ γ , P := m [M]µm, and L L2(X1:M) := q 1 M P m [M] L2(Xm). Let Ψ(l) γ be the convexification5 of Ψγ, Aγ := Ψ(l) γ 1 (1) Ψ 1 γ (1) , Bγ := Ψ(l) γ 1(1), Cγ and CD be the constants defined in (45) and (46), and Kγ := 2 1 γ 1 1 + Aγ . Then for any ε > 0 satisfying ε 6B, B := 2CD d E h L L2(X1:M) i M Iρ(|T|), (10) m [M] ft(Xm) ε Mε 3Kγ maxm [M] supt T |ft(Xm)| Ψγ 72σ2+84c ε , (11) 5. The function Ψγ is not convex for γ < 1. We convexify Ψγ and use the Section 4(v) based integral control property holding for convex Ψ-s; for details on Ψ(l) γ see Section A.1. Chamakh, Gobet and Szab o σ2 := sup t T m [M] E f2 t (Xm) , c := max m [M] sup t T ft(Xm) Ψγ γ maxm [M] supt T ft(Xm) Ψγ 8E max m [M] sup t T |ft(Xm)| [0, + ). (i) Two-sided bound: For P inft T 1 M P m [M] ft(Xm) ε the same one-sided deviation bound can be obtained by replacing ft with ft. As a result one can estimate P supt T 1 M P m [M] ft(Xm) ε by twice the bound above. (ii) Assumption (3): Assumption (3a) with Assumption (4) is weaker than being i.i.d.: for example E [ft(Xm)] = 0 holds for Xm = N 0, σ2 m and ft(x) = ctx3, but Xm-s can differ in their variance. (iii) Assumption α/n 1: This condition holds without loss of generality. Indeed, in case of α/n > 1, one can get a modified (α , n ) pair satisfying α /n 1 by either increasing n to the value n = α using that FP(n) FP(n ), or by decreasing α to the value α = n using that Xm Ψα < implies Xm Ψα < for any α (0, α). (iv) Proof-related remarks: 1. Compactness of T: This compactness with the Lipschitz property enables one to control the covering number of F. 2. Truncated functions: The Lipschitz property of F implies that of the truncated functions: for x Rd, s and t T |Tcft(x) Tcfs(x)| |ft(x) fs(x)| L(x)ρ ( t s 2) , (12) where Tcf(x) := f(x)1|f(x)| c + c1f(x)>c c1f(x)< c is f soft-thresholded at level c. 3. F FP(n): This property is inherited (Section 5.4) from L FP(n) by the Lipschitz conditions (2a)-(2b). 4. Finiteness of the terms in Theorem 1: maxm [M] supt T |ft(Xm)| Ψ α E maxm [M] supt T |ft(Xm)| are finite (see Section 5.4) in Theorem 1 by the Lipschitz assumption (2a)-(2b), Xm Ψα < (Assumption (3b)) and L FP(n) (Assumption (2)). (v) RFF specialization: Assuming that the α-exponential Orlicz condition holds for the spectral measure Λ associated to k ( α R+ such that ω Ψα < , ω Λ),6 one can see (Section 5.1) that RFFs are covered by choosing d = d, ft(x) fz(ω) Λfz, t z, T S , Xm ωm, 6. This requirement implies that R Rd |ωp+q| dΛ(ω) < and thus the existence of p,qk for any p, q Nd. Orlicz Random Fourier Features ρ(u) = uβ, β = 1 1 + (log|S |)+ (0, 1], n |p + q|+β. While any value of β (0, 1] would meet the assumptions, allowing β to depend on the diameter of S enables us to get optimal convergence rates w.r.t. the diameter (see Corollary 4). The terms driving the guarantee for RFF can be bounded (Section 5.6) as follows: there is a constant CRFF R+, depending only on Λ, |p + q|, but not on |S | and M, such that 1 + (log|S |)+ max m [M] sup z S gz(ωm) Ψγ CRFF, max m [M] sup z S |gz(ωm)| Ψγ CRFF [log(1 + M)]n/α , max m [M] sup z S |gz(ωm)| CRFF [log(1 + M)]n/α . Using these bounds, our finite-sample uniform guarantee on Orlicz RFFs is as follows. Corollary 3 (Orlicz RFFs for kernel derivative approximation) Let k : Rd Rd R be a continuous, bounded, shift-invariant kernel with spectral measure Λ. Suppose that Λ satisfies the α-exponential Orlicz assumption ( α R+ such that ω Ψα < , ω Λ) and let S Rd be a compact set. Let β = 1 1+(log|S |)+ (0, 1], let p, q Nd, n := |p+q|+β, and assume that γ := α n 1. Let \ p,qk be the RFF estimate of p,qk using (ωm)m [M] i.i.d. Λ samples as given in Eq. (7). Then, there exists a constant C R+ (depending only on Λ, |p + q|, but not on S and M) such that for any ϵ C 1+(log|S |)+ ΛM \ p,qk p,qk S ϵ 2e (Mε)γ C log(1+M) + e C 1+ε[log( C/ε) log(1+M)]1/γ . (14) Corollary 4 (Almost sure convergence for kernel derivative approximation) Let p, q Nd and k : Rd Rd R be a continuous, bounded, shift-invariant kernel with spectral measure Λ which satisfies the α-exponential Orlicz assumption for some α > 0. Then, for any sequence of compact sets (SM) M=2 such that (log |SM|)+ = o(M), we have \ p,qk p,qk SM = Oa.s. (log |SM|)+ log M Chamakh, Gobet and Szab o (i) Spectral measure (Λ) examples: Our result assumes the α-exponential Orlicz property of the spectral measure Λ associated to k. In Table 2 we provide various examples for Λ (with the relevant case of unbounded support) satisfying this requirement; their relation is summarized in Figure 2. While for the RFF approximation it is not necessary, in many of these examples the corresponding kernel value can also be computed, see Table 3. (ii) α-exponential Orlicz assumption for tensor product kernels: Using the αexponential Orlicz spectral measures of Table 2 on R, one can immediately construct Orlicz spectral measures on Rd. Indeed, assume that (i) k is a product kernel, i.e. k(x, y) = Q i [d] ki(xi, yi), Λ = i [d]Λi, and (ii) Λi, the spectral measure associated to ki, satisfies the αi-exponential Orlicz assumption (αi R+). Then ω Λ is αexponential Orlicz with α = mini [d] αi; see Section A.2. (iii) α-exponential Orlicz vs. Bernstein assumption: Our result complements Szab o and Sriperumbudur (2019) s work, where the authors showed that for d = 1 and spectral densities fλ(ω) e ω2ℓthe Bernstein condition (and hence fast rates) holds for |p+q| 2ℓ= α. Indeed, we proved under the more general α-exponential Orlicz assumption the same a.s. convergence rates for any arbitrary order (see Corollary 4) kernel derivatives. Spectrum Spectral density: fΛ(ω) Parameters α 2σ2 σ > 0 2 Laplace σ 2 e σ|ω| σ > 0 1 generalized Gaussian α 2βΓ( 1 α α > 0, β > 0 α variance Gamma σ2b|ω|b 1 2 (σ|ω|) πΓ(b)(2σ)b 1 2 σ > 0, b > 1 Weibull (S) s 2λ |ω| λ s 1 e |ω| λ s s > 0, λ > 0 s exponentiated exponential (S) α 2λ 1 e |ω| λ α 1 e |ω| λ λ > 0, α > 0 1 exponentiated Weibull (S) αs 2λ |ω| λ s 1 1 e |ω| λ s α 1 s > 0, λ > 0, α > 0 s Nakagami (S) mm Γ(m)Ωm |ω|2m 1e mω2 chi-squared (S) 1 2 s 2 +1Γ( s 2)|ω| s 2 1e |ω| Erlang (S) λs|ω|s 1e λ|ω| 2(s 1)! s Z+, λ > 0 1 Orlicz Random Fourier Features Spectrum Spectral density: fΛ(ω) Parameters α Gamma (S) 1 2Γ(s)θs |ω|s 1e |ω| θ s > 0, θ > 0 1 generalized Gamma (S) p/a D p |ω|D 1e |ω| a p a > 0, D > 0, p > 0 p Rayleigh (S) |ω| 2σ2 e ω2 2σ2 σ > 0 2 Maxwell-Boltzmann (S) 1 2a2 a3 a > 0 2 chi (S) 1 2 s 2 Γ( s 2)|ω|s 1e ω2 exponential-logarithmic (S) 1 2 log(p) β(1 p)e β|ω| 1 (1 p)e β|ω| p (0, 1), β > 0 1 Weibull-logarithmic (S) 1 2 log(p) αβ(1 p)|ω|α 1e β|ω|α 1 (1 p)e β|ω|α p (0, 1) , β > 0, α > 0 α Gamma/Gompertz (S) bseb|ω|βs 2(β 1+eb|ω|) s+1 b > 0, β > 0, s > 0 bs hyperbolic secant 1 2sech π logistic e ω s i2 s > 0 1 normal-inverse Gaussian αδK1(α δ2+ω2 eδα α > 0, δ R 1 hyperbolic 1 2δK1(δα)e α δ2+ω2 α > 0, δ R 1 generalized hyperbolic (α/δ)λ 2 λ α > 0, λ R, δ R 1 Table 2: Kernel spectrum examples in one dimension (d = 1) obeying the α-exponential Orlicz assumption. (S) stands for symmetrized. The symmetrization guarantees that the kernel associated to Λ is real-valued. Last column: Orlicz exponent. For the variance Gamma distribution the Orlicz exponent follows from the known Ku(z) p asymptotics (Barndorff-Nielsen et al., 2001, page 297) where Ku is the modified Bessel function of the second kind, as defined in Table 4. Notice that the normal-inverse Gaussian δ=σ2α, α Gaussian limit (see Figure 2) changed the Orlicz exponent from 1 to 2. 4. Properties of the Orlicz Norm In this section, for self-containedness we summarize the properties of Ψ which hold independently of the convexity/non-convexity of Ψ (unless explicitly required). Let X, X Rd be random variables, and assume that Ψ : R 0 R 0 (and similarly Φ below) is continuous, strictly increasing, Ψ(0) = 0 and limx Ψ(x) = . Chamakh, Gobet and Szab o exponentiated exponential (S) Weibull -logarithmic (S) Gamma/Gompertz (S) exponentiated Weibull (S) exponential -logarithmic (S) chi-squared (S) Erlang (S) Weibull (S) generalized Gamma (S) Gamma (S) Laplace Rayleigh (S) chi (S) variance Gamma generalized Gaussian hyperbolic secant Maxwell-Boltzmann (S) normal-inverse Gaussian Gaussian logistic generalized hyperbolic hyperbolic Nakagami (S) α = 1 β = 1 s = 1 p = D p = 1 p = 2, a = 2 s Z+ s Z+/2 s = 2 (σ = 1) s = 3 (a = 1) α = 2 δ=σ2α, Figure 2: Relation of the spectral density examples of Table 2. (S) stands for symmetrized. Orlicz Random Fourier Features Kernel name Kernel value: k(x, y) Spectrum Gaussian e σ2(x y)2 inverse quadric σ2 σ2+(x y)2 Laplace π Γ(1/α) Ψ1,1 1 2; 1 , [β(x y)]2 4 generalized Gaussiana inverse multiquadric h σ2 σ2+(x y)2 ib variance Gamma P n 2N ( 1) n 2 (x y)nλn s Weibull (S)b [1 2i(x y)] s 2 +[1+2i(x y)] s 2 2 chi-squared (S)b λ i s + h 1+ i(x y) 2 Erlang (S)b [1 θi(x y)] s+[1+θi(x y)] s 2 Gamma (S)b 1 σ(x y)e σ2(x y)2 2 erfi σ(x y) Rayleigh (S)b F1,1 s 2; 1 P n 2N ( 1) n 2 (x y)nΓ( n log(p)n!β α n Li n α+1(1 p) Weibull-logarithmic (S)b,c 1 2 [cΛ(x y) + cΛ(y x)], with cΛ(t) = Gamma/Gompertz (S)b = βs sb sb ti F2,1 s + 1; ti b + s + 1; 1 β sech sech(x y) hyperbolic secant πs(x y) sinh(πs(x y)) logistic α2+(x y)2 i normal-inverse Gaussian α2+(x y)2K1(δα) hyperbolic Kλ(δα) generalized hyperbolic a. The analytical computation of the characteristic function (and hence the kernel value) was carried out for α > 1 (Pog any and Nadarajah, 2010). b. In case of symmetrization (S): k(x, y) = 1 2 [cΛ(x y) + cΛ(y x)] where cΛ(t) = Eω Λ[eitω] is the characteristic function of the spectral measure (on R 0) before symmetrization; i = 1. c. The characteristic function was obtained by Ciumara and Preda (2009). Table 3: Kernel examples for the spectral densities given in Table 2. The special functions Ψ1,1, erfi, F1,1, Li, F2,1 and Kλ are defined in Table 4. Chamakh, Gobet and Szab o (i) Normalization: If X LΨ then E h Ψ X 2 (ii) Constant: For a λ R constant λ Ψ= |λ|/Ψ 1(1). (iii) Monotonicity in Ψ: Ψ Φ implies X Ψ X Φ. (iv) Monotonicity in the argument: If d = 1 and 0 X X a.s., then X Ψ X Ψ. (v) Finite Ψ implies integrability: If Ψ is convex and X LΨ, then E [ X 2] X Ψ Ψ 1(1). (vi) Generalized triangle inequality: Let X, X LΨα and α R+. Then X + X LΨα and X + X Ψα 2( 1 α 1)+ X Ψα + X Ψα (vii) Deviation inequality from Ψ: If X LΨ then P ( X 2 c) 2 Ψ(c/ X Ψ)+1 for any c 0. (viii) Maximal inequality for Ψα and α R+: for any sequence (Xm)M m=1 of random variables in LΨα, we have max m [M] Xm 2 Ψα max m [M] Xm Ψα The proofs of these properties are available in Section A.3. After introducing a few additional notations, we provide the proofs of our results and remarks presented in Sections 3 and 4. External statements used in the proofs are summarized in Section C. Notations: For γ (0, 1] and x R 0, let Iγ(x) = R x 0 e tγdt be the incomplete Gamma function. Let (Z, m) be a semi-metric space and ϵ R+. The set S Z is said to be an ϵ-net of Z if for any z Z there exists s S such that m(s, z) ϵ. The ϵ-covering number of Z is defined as the size of the smallest ϵ-net, i.e., N(ϵ, m, Z) = inf n ℓ 1 : s1, . . . , sl Z such that Z ℓ j=1Bm(sj, ϵ) o , where Bm(s, ϵ) = {z Z : m(z, s) ϵ} is the closed ball with center s Z and radius ϵ. 5.1. Proof of Remark 2(v) In view of (8)-(9), we need to check Assumptions 1-4 with the parameterized function class gz(ω) := fz(ω) Λfz = ωp( ω)qc|p+q| ω z Λfz, (z S ). Thanks to the α-exponential Orlicz condition on Λ and the i.i.d. property of (ωm)m [M] in (7), Assumption 3 is trivially fulfilled. Assumption 4 holds by the definition of gz( ) and because the distribution of ωm is Λ. Assumption 1 is satisfied since S is a compact set of Rd. Therefore, it remains to prove Assumption 2, with the existence of n R+ and L FP(n). First, notice that i [d] |ωi|pi+qi ω |p+q| 2 . (16) Orlicz Random Fourier Features Order: (16) implies that |gz(ω)| |fz(ω)| + Λ|fz| ω |p+q| 2 + Λ h |p+q| 2 i =: L1(ω). (17) Lipschitz condition: Let [z1, z2] = {az1 + (1 a)z2 : a [0, 1]} denote the segment connecting z1, z2 Rd. By using the mean value theorem |gz1(ω) gz2(ω)| max z [z1,z2] 2 z1 z2 2 , (18) z with fz(ω) z = ωp( ω)qc|p+q|+1 ω z ω, and by using similar computations as before, one gets gz(ω) 2 ω |p+q|+1 2 + Λ h |p+q|+1 2 i =: L2(ω). (19) As a result, to fulfill Assumption 2, we can take L(ω) = max(L1(ω), L2(ω)) and ρ(u) = u. For such L, we have n = |p + q| + 1. Refined L and ρ: We now derive refined L and ρ, by interpolating different bounds. From (17), we can obtain the crude estimate |gz1(ω) gz2(ω)| 2L1(ω), which combined with (18)-(19) gives |gz1(ω) gz2(ω)| (2L1(ω))1 βh z1 z2 2 L2(ω) iβ (20) for any β (0, 1]. Here we have used that if 0 x min(x1, x2) then x x1 β 1 xβ 2. It follows that one can take ρ(u) = uβ, n = |p + q| + β, L(ω) = max L1(ω), (2L1(ω))1 βLβ 2(ω) FP(n). (21) For β = 1, we retrieve the former choice of L and ρ. Furthermore, we have Iρ(|T|) =|T|β Z 1 log 1 + 2|T| (u|T|β)1/β du =|T|β Z 1 log 1 + 2 u1/β du < + . (22) Notice that the advantage of having the additional degree-of-freedom β is two-fold, and it is striking when β 0 (compared to β = 1). Firstly, it gives a smaller n, which has a (slight) positive impact on the control of statistical fluctuations; secondly, the dependence of Iρ(|T|) in the diameter |T| is smaller through the growth exponent. To conclude, we have proved that Orlicz RFFs fulfill the assumptions of Theorem 1. Later in Section 5.6, we will establish that Iρ(|T|) satisfies a (tight) bound w.r.t. p 1 + (log|T|)+. 5.2. Proof that Polynomial Growth Preserves the Exponential Orlicz Property We show that f(X) Ψγ < for X Ψα < , f FP(n), n R+, γ = α n. Indeed, by the definition of f FP(n), there exists C R+ such that |f(x)| C(1 + x n 2) for all x Rd. Hence for any γ > 0 |f(x)|γ Cγ (1 + x n 2)γ ( ) 2(γ 1)+Cγ (1 + x nγ 2 ) , (23) Chamakh, Gobet and Szab o where in ( ) we used that (a + b)γ 2(γ 1)+ (aγ + bγ) , a, b 0, γ > 0. (24) Since X LΨα there is some s R+ for which E h es X α 2 i < . Combining this property with (23) and recalling that nγ = α yields es |f(x)|γ es 2(γ 1)+Cγ(1+ x α 2 ) E h es |f(X)|γi es 2(γ 1)+CγE h es X α 2 i < with s = s 2(γ 1)+Cγ ; this shows that f(X) LΨγ. 5.3. Proof of Theorem 1 By introducing the Rcf(x) := f(x) Tcf(x) notation of residuals obtained at level c R+ (the value of c will be specified later), we bound the target quantity by using the subadditivity of supremum m [M] ft(Xm) | {z } Tcft(Xm)+Rcft(Xm) m [M] (Tcft(Xm) E [Tcft(Xm)] + E [Tcft(Xm)] + Rcft(Xm)) Tcft(Xm) E [Tcft(Xm)] + sup t T E m [M] Tcft(Xm) m [M] Rcft(Xm) This means that using c for which ETc ε m [M] ft(Xm) ε P ZRc ε/3 + P Z Tc ε/3 . (25) The structure of our proof is as follows. 1. Unbounded part (ZRc): Based on the Talagrand and the Hoffman-Jorgensen inequalities, for large enough c (referred to as c HJ) we will derive an exponential control over P ZRc ε/3 expressed with maxm [M] supt T |ft(Xm)| Ψγ which is finite by Section 5.4. 2. Bounded part (Z Tc): We handle this term using the Klein-Rio inequality and the Dudley entropy integral bound. In addition, this part will give rise to the constraint (10) on ε. 3. Truncation (ETc): As E[ft(Xm)] = 0, Tcft ft and E[Tcft(Xm)] 0 for large c (called cmin). The ETc ε 3 requirement can be controlled via the integral form of the expectation of non-negative random variables and the incomplete Gamma function. Orlicz Random Fourier Features The bounding of the ZRc, Z Tc and ETc quantities is detailed in the following sections. Plugging the (26) and (28) results of the computations into (25) gives the final bound (11). The ε constraint comes from (33), provided that c cmin c HJ. The constants cmin and c HJ are defined in (35) and (38), respectively. 5.3.1. Bounding ZRc P ZRc ε/3 is bounded as P ZRc ε/3 P m [M] |Rcft(Xm)| Mε/3 m [M] sup t T |Rcft(Xm)| Mε/3 Mε/3 P m [M] supt T |Rcft(Xm)| Ψγ Mε 3Kγ maxm [M] supt T |ft(Xm)| Ψγ where in (a) we used the the sub-additivity of the supremum, in (b) the deviation inequality Section (4)(vii) was applied, (c) holds by Section 5.5 for c c HJ (the value of c HJ is defined in Section 5.5). 5.3.2. Bounding Z Tc Below we will invoke the Klein-Rio inequality and control the expectation E h Klein-Rio inequality: Let gm,t : x Rd 7 Tcft(x) E [Tcft(Xm)] and let us define the function classes Tc F[M] := {gt := (g1,t, . . . , g M,t) : t T}, Tc F := {Tcft : t T}. gm,t [ 2c, 2c] are measurable and bounded functions. Centering: by construction E[gm,t(Xm)] = 0 ( m [M]). Countability: Since t 7 ft is continuous, the supt T can be restricted to rational numbers (T Qd), one can take T T Qd, and assume that Tc F[M] is countable. If Z Tci ε/6, (27) then the Klein-Rio inequality (Theorem 8 where the supt T and supf Tc F[M] coincide) implies that Z Tc ε/3 (27) P Z Tci ε/6 e M (ε/6)2 2( σ2+4c E[ZTc])+6c ε/6 (27) e M (ε/6)2 2 σ2+14cε/6 = e M ε2 72 σ2+84c ε e M ε2 72σ2+84c ε , (28) Chamakh, Gobet and Szab o where the weak variance σ2 is defined and bounded by σ2 := sup t T m [M] E h (Tcft(Xm) E [Tcft(Xm)])2i sup t T m [M] E h (Tcft(Xm))2i m [M] E f2 t (Xm) =: σ2. Bounding E h Z Tci : We control E h Z Tci in (27) by the Dudley entropy integral bound. In this bound the covering number of Tc F is estimated by that of the compact set T Rd with propagation relying on Assumption (2b). Dudley entropy integral bound: Slight modification (without absolute value) of van der Vaart and Wellner (1996, Lemma 2.3.1) gives Z Tci 2E [R(X1:M, Tc F)] , (29) where R(x1:M, Tc F) := Eε h supt T 1 M P m [M] εm Tcft(xm) i is the Rademacher average of Tc F, X1:M := (Xm)m [M], x1:M := (xm)m [M], ε :=(εm)m [M] contains independent Rademacher variables (i.e. P (εm = 1) = 1 2), and ε is independent of X1:M. Let Zt(x1:M) := 1 M P m [M] εm Tcft(xm), so R(x1:M, Tc F) = Eε [supt T Zt(x1:M)], and define the pseudo-metric on T as d(t, s) := 1 M P m [M] [Tcft(xm) Tcfs(xm)]2 1/2 . The {Zt : t T} process is separable since it is continuous, and T Rd is separable, centered thanks to the Rademacher variables, sub-Gaussian with respect to M 1/2d: indeed, for any λ R+ and t, s T Eε h eλ(Zt Zs)i (a) = m=1 Eεm h eεm λ M [Tcft(xm) Tcfs(xm)]i (b) 2M2 [Tcft(xm) Tcfs(xm)]2 λ2(M 1/2d(t,s)) 2 In (a) we used the independence of εm-s, (b) follows from Eεm [eaεm] = cosh(a) ( ) e a2 2 ( a R), where ( ) can be obtained from the power series expansion of the cosh( ) and the exponential function. Hence Theorem 9 can be applied: R(x1:M, Tc F) CD log(N(ε, M 1/2d, T))dε log(N(ε, d, T))dε, (30) where we used the N(ε, M 1/2d, T) = N(M1/2ε, d, T) identity, and applied an ε = M1/2ε substitution. We note that the above infinite integral can be truncated at |T|d := supt,s T d(t, s), the d-diameter of T, since N(ε, d, T) = 1 for ε |T|d. Orlicz Random Fourier Features Covering number: By (12) one can relate d(t, s) and t s 2 as m [M] L2(xm) ρ ( t s 2) := L L2(x1:M) ρ ( t s 2) , which implies N(ε, d, T) N ρ 1 ε L L2(x1:M) |T|d L L2(x1:M) sup t,s T ρ ( t s 2) L L2(x1:M) ρ(|T|). (32) In the last inequality the increasing property of ρ was exploited. Combining (31)-(32) with the well-known bound (van de Geer, 2000, Lemma 2.5) on the covering number7 of a compact set T Rd N ε , 2 , T 2|T| ε + 1 d , ε > 0, (30) can be estimated further as R(x1:M, Tc F) CD Z L L2(X1:M)ρ(|T|) v u u u u u tlog ρ 1 ε L L2(x1:M) d L L2(x1:M) ρ (|T|) log 1 + 2|T| ρ 1 (uρ (|T|)) where we introduced the new variable u = ε L L2(x1:M)ρ(|T|). Substituting this bound into (29) we arrive at d E h L L2(X1:M) i M Iρ(|T|) =: B. (33) To guarantee E h Z Tci ε/6, we solve B ε 6; this gives the (10) bound on ε. 5.3.3. Bounding ETc Bounding ETc by the incomplete Gamma function (Iγ): E [Tcft(Xm)] (a) = E [Rcft(Xm)] (b) E ( ft(Xm) c) 1ft(Xm) c 7. In our definition of the covering number, in its bound on compact sets in Rd (van de Geer, 2000, Lemma 2.5) and in the final Dudley entropy bound (Bartlett, 2013, Lecture 11, 14) the elements of the ϵ-net are assumed to belong to the set covered. Chamakh, Gobet and Szab o c P ( ft(Xm) y) dy. (34) In (a) we used that Tcft(x) = ft(x) Rcft(x) and E[ft(Xm)] = 0, (b) follows from Rcft(x) = [ft(x) + c]1ft(x) c + [ft(x) c]1ft(x) c [ft(x) + c]1ft(x) c. (c) holds by using that for a Z 0 random variable, E[Z] = R 0 P (Z z) dz; we choose Z = max (0, ft(Xm) c). Therefore ETc = sup t T E m [M] Tcft(Xm) (a) max m [M] sup t T c P ( ft(Xm) y) dy (b) 2 max m [M] sup t T c e y ft(Xm) Ψγ (c) = 2 max m [M] sup t T c ft(Xm) Ψγ e uγdu (d) = 2 max m [M] sup t T 0 e uγdu Z c ft(Xm) Ψγ (e) = 2 max m [M] sup t T c ft(Xm) Ψγ (f) 2 max m [M] sup t T ft(Xm) Ψγ c maxm [M] supt T ft (Xm ) Ψγ (g) 2Γ 1 + 1 c maxm [M] supt T ft(Xm) Ψγ max m [M] sup t T ft(Xm) Ψγ where (a) holds by taking maximum over m [M] and using (34), (b) follows from the P ( ft(Xm) y) 2e y ft(Xm) Ψγ deviation inequality implied by Section 4(vii). (c) was obtained from a u = y ft(Xm) Ψγ substitution, in (d) we decomposed the integral to have the incomplete Gamma function appear. (e) is a consequence of the definition of Iγ and the limit Iγ(x) = Z x 0 e tγdt = 1 0 u 1 γ 1e udu x 1 where we applied an u = tγ substitution and the Γ(z + 1) = zΓ(z) recursion. (f) comes from the monotonicity of Iγ (Iγ(x) Iγ(y) if x y). (g) follows from applying the lower bound on Iγ from Theorem 10 with x = c maxm [M] supt T ft(Xm) Ψγ . Orlicz Random Fourier Features Additional truncation level bound on c: Guaranteeing B ε 3 (and thus ETc ε 3) is equivalent to choosing c large enough such that c maxm [M] supt T ft(Xm) Ψγ γ maxm [M] supt T ft(Xm) Ψγ . Because γ 1, the function h : x 7 1 (1 x) 1 γ is concave on [0, 1], and thus it is below its tangent line computed at (0, h(0)), i.e. 1 (1 x) 1 γ 1 γ x. Therefore choosing c maxm [M] supt T ft(Xm) Ψγ it is enough to use c such that c maxm [M] supt T ft(Xm) Ψγ γ maxm [M] supt T ft(Xm) Ψγ . Solving this inequality for c means that c cmin := max m [M] sup t T ft(Xm) Ψγ γ maxm [M] supt T ft(Xm) Ψγ 5.4. Proof of F FP(n), E maxm [M] supt T |ft(Xm)| < , and maxm [M] supt T |ft(Xm)| Ψ α By Assumption (2a)-(2b), the triangle inequality and the monotonicity of ρ, one gets |ft(x)| |ft(x) ft0(x)| + |ft0(x)| L(x) [ρ ( t t0 2) + 1] L(x)[ρ(|T|) + 1], (36) for any t T, x Rd. The individual statements now can be proved as follows. F FP(n): By L FP(n) and (36), ft FP(n) for all t T, in other words F FP(n). Finiteness of maxm [M] supt T |ft(Xm)| Ψ α n : Using (36), we get max m [M] sup t T |ft(Xm)| [ρ(|T|) + 1] X m [M] L(Xm). (37) Thanks to Section 5.2, each L(Xm) belongs to LΨ α n . Combining this with the generalized triangular inequality Section 4(vi) gives the claim. Finiteness of E maxm [M] supt T |ft(Xm)| : Each L(Xm) is integrable (because L has a polynomial growth and the distribution of Xm satisfies the α-Orlicz exponential assumption). Thus, the statement follows from (37). 5.5. Control if c c HJ We show that under the assumptions of Theorem 1 with c c HJ := 8E max m [M] sup t T |ft(Xm)| (38) Chamakh, Gobet and Szab o m [M] sup t T |Rcft(Xm)| max m [M] sup t T |ft(Xm)| Ψγ . (39) Notice that c HJ is finite by Section 5.4. We bound the l.h.s. of (39): m [M] sup t T |Rcft(Xm)| sup t T |Rcft(Xm)| E sup t T |Rcft(Xm)| + E sup t T |Rcft(Xm)| Ψγ (a) 2 1 γ 1 sup t T |Rcft(Xm)| E sup t T |Rcft(Xm)| Ψγ m [M] sup t T |Rcft(Xm)| (b) 2 1 γ 1 sup t T |Rcft(Xm)| E sup t T |Rcft(Xm)| + max m [M] sup t T |Rcft(Xm)| E sup t T |Rcft(Xm)| + 1 Ψ 1 γ (1)E m [M] sup t T |Rcft(Xm)| =: 2 1 γ 1 Cγ(E1 + E2) + 1 Ψ 1 γ (1)E3 In (a) we applied the generalized triangle inequality Section 4(vi) and 1 γ 1 as γ = α n (0, 1]. In (b) the Talagrand inequality (45) was invoked with the Ym := supt T |Rcft(Xm)| E [supt T |Rcft(Xm)|] centered variables and B := R, followed by taking the γ-Orlicz norm of the constant λ := E h P m [M] supt T |Rcft(Xm)| i according to Section 4(ii). We continue the derivation with bounding the E1, E2 and E3 terms in (40). Bounding E1: sup t T |Rcft(Xm)| E sup t T |Rcft(Xm)| Orlicz Random Fourier Features m [M] sup t T |Rcft(Xm)| (b) 16E max m [M] sup t T |Rcft(Xm)| (c) 16E max m [M] sup t T |ft(Xm)| (d) 16 max m [M] sup t T |ft(Xm)| Ψ(l) γ Ψ(l) γ 1 (1) (e) 16 max m [M] sup t T |ft(Xm)| Ψγ Ψ(l) γ 1 (1), where in (a) we used the triangle inequality, in (b) we applied the Hoffman-Jorgensen inequality (Theorem 6; t0 = 0, p = 1, B = R, Ym = supt T |Rcft(Xm)|) with m [M] sup t T |Rcft(Xm)| > 0 m [j] sup t T |Rcft(Xm)| > 0 = P max m [M] sup t T |ft(Xm)| > c (g) P max m [M] sup t T |ft(Xm)| c HJ 8 = 1 2 4p with p = 1. In (f) the non-negativity of Ym was exploited; in (g) c c HJ was used. We applied the Markov inequality and the definition of c HJ in (h). (c) holds by |Rcft(Xm)| |ft(Xm)|. In (d) and (e) we applied Section 4(v) with the convex Ψ(l) γ defined in Section A.1 and the monotonicity property Section 4(iii) with Ψ(l) γ Ψγ, respectively. Bounding E2: E2 = max m [M] sup t T |Rcft(Xm)| E sup t T |Rcft(Xm)| (a) max m [M] sup t T |Rcft(Xm)| + E sup t T |Rcft(Xm)| Ψγ (b) max m [M] sup t T |Rcft(Xm)| + max m [M] E sup t T |Rcft(Xm)| Ψγ (c) 2 1 γ 1 max m [M] sup t T |Rcft(Xm)| Ψγ + max m [M] E sup t T |Rcft(Xm)| Ψγ (d) 2 1 γ 1 max m [M] sup t T |Rcft(Xm)| Ψγ + 1 Ψ 1 γ (1)E max m [M] sup t T |Rcft(Xm)| ! (e) 2 1 γ 1 max m [M] sup t T |ft(Xm)| Ψγ + Ψ(l) γ 1 (1) max m [M] sup t T |ft(Xm)| Ψ(l) γ (f) 2 1 γ 1 Ψ(l) γ 1 (1) max m [M] sup t T |ft(Xm)| Ψγ . Chamakh, Gobet and Szab o In (a) we used the triangle inequality with the monotonicity Section 4(iv), (b) holds by the sub-additivity of the maximum and again the monotonicity Section 4(iv), in (c) we applied the generalized triangle inequality Section 4(vi) and that 1 γ 1 γ 1 as γ (0, 1], (d) holds by Section 4(ii) with the constant λ = E maxm [M] supt T |Rcft(Xm)| , (e) is by the monotonicity Section 4(iv) as |Rcft(Xm)| |ft(Xm)|, and by Section 4(v), (f) follows from Ψ(l) γ Ψγ combined with the monotonicity Section 4(iii). Bounding E3: By (b)-(e) of the E1 derivation we have that m [M] sup t T |Rcft(Xm)| 8 max m [M] sup t T |ft(Xm)| Ψγ Ψ(l) γ 1 (1). By adding the obtained E1, E2 and E3 bounds, we get (39) with Kγ defined in Theorem 1. 5.6. Bounding the Driving Terms of Theorem 1 for RFF We bound the constants of Theorem 1 in the RFF case described in Remark 2(v). The term B: It is defined in (10). Recalling the expression (22) for Iρ(|T|) and using the Cauchy-Schwarz inequality for bounding E L L2(ω1:M) by p Eω Λ [L2(ω)] gives Eω Λ [L2(ω)] M |S |β Z 1 log 1 + 2u 1/β du. We now aim at showing a tight bound for |S |β R 1 0 log 1 + 2u 1/β du w.r.t. |S | with an appropriate choice of β = β(|S |). Indeed, let β = 1 1+(log|S |)+ (0, 1]. We start by proving the bound log 1 + 2u 1/β du 4 β , β (0, 1]. (41) By the change of variable t = β log 1 + 2u 1 β (i.e. u = et/β 1 2 β ), we get (et/β 1)β+1 dt = 2β t et(1 e t/β)β+1 dt. Using the fact that 1 e t/β 2 3 on [β log(3), + ), we arrive at te tdt ( ) 9 2 β Γ 3 where the inequality ( ) is obtained by taking β = 1 in 3β+1 and β = 0 in the integral; hence (41) is proved. Now, using (41) with β = 1 1+(log|S |)+ and its |S |β = e log|S | 1+(log|S |)+ implication, we get log 1 + 2u 1/β du 4e log|S | 1+(log|S |)+ p 1 + (log|S |)+ 4e p 1 + (log|S |)+, Orlicz Random Fourier Features and therefore Eω Λ [L2(ω)] p 1 + (log|S |)+ The term σ2: It is defined in Theorem 1. Since the variance is bounded by the second moment, E g2 z(ωm) E f2 z(ωm) . Furthermore, since (ωm)M m=1 are i.i.d., the previous expectation can bounded by E h ω 2|p+q| 2 i using (16). As a result, we get σ2 Eω Λ h ω 2|p+q| 2 i . The term maxm [M] supz S gz(ωm) Ψγ with γ = α/n 1 and n = |p + q| + β: It appears in the definition of c (in Theorem 1). In view of the bound (17) which is uniform in z and using property (iv) of Section 4, we get max m [M] sup z S gz(ωm) Ψγ = max m [M] sup z S |gz(ωm)| Ψγ ω |p+q| 2 + Λ h |p+q| 2 i Ψα/n . The term maxm [M] supz S |gz(ωm)| Ψγ: It shows up in the exponential bound (11). We invoke the maximal inequality for the γ-Orlicz norm (item (viii) of Section 4) with the previous estimate to obtain max m [M] sup z S |gz(ωm)| Ψγ log(1 + M) n/α ω |p+q| 2 + Λ h |p+q| 2 i Ψα/n . The term E maxm [M] supz S |gz(ωm)| : It appears in the definition of c. Using properties (iii) and (v) of Section 4 and the convexification of Ψγ, we directly get max m [M] sup z S |gz(ωm)| Ψ(l) α/n 1 (1) log(1 + M) n/α ω |p+q| 2 + Λ h |p+q| 2 i Ψα/n . Collecting the different bounds we obtain (13) by setting CRFF(n) := max Eω Λ [L2(ω)], Eω Λ h ω 2|p+q| 2 i , 1 Ψ(l) α/n 1 (1) [log(3/2)] n/α ω |p+q| 2 + Λ h |p+q| 2 i Ψα/n Long (but standard) computations show that CRFF(n) is uniformly bounded for n [|p + q|, |p + q| + 1], and thus we can set CRFF := supn [|p+q|,|p+q|+1] CRFF(n). 5.7. Proofs of Corollary 3 and 4 Corollary 3 with the existence of C R+ is a direct consequence of Theorem 1 combined with Remark 2(v), in particular because CRFF does not depend on S and M, and Kγ can Chamakh, Gobet and Szab o be bounded uniformly in n [|p + q|, |p + q| + 1]. The Talagrand constant Cγ is uniformly bounded w.r.t. γ provided that γ is bounded away from 0, see the proof of Talagrand (1989, Theorem 3). Now let us prove Corollary 4; set εM = (1+[log |(SM) |]+) log(1+M) M . Observe that (i) εM satisfies the lower bound requirement on ε in Corollary 3; (ii) by assumption εM 0 as M 0 by using that |S | 2|S|; (iii) therefore #1/γ 1 + εM !#1/γ + [log(1 + M)]1/γ 2 + εM [log(1 + M)]1/γ for M large enough; As a consequence of (iii) and (iv), setting δM := 1+(log |(SM) |)+ log(1+M) 1, we get (for M large enough) M ε2 M C 1 + εM h log C log(1 + M) i1/γ 6 C log(1 + M)δM 6 C C δM[log(1+M)]1/2+1/γ = 6 log(1 + M) δM 2 + z M δM , where z M = 6 C C [log(1+M)]1/2+1/γ M M 0. Since the function δ R+ 7 δ 2+z M δ is increasing and δM 1, we get (for M large enough) M ε2 M C 1 + εM h log C log(1 + M) i1/γ 6 log(1 + M)1 On the other hand, using (iv), we easily get (MεM)γ C log(1+M) 2 log(1 + M) for M large enough. To sum up, in view of (14), we have proved (still for large enough M) ΛM \ p,qk p,qk SM ϵM 2 (1 + M)2 + 1 (1 + M)2 Orlicz Random Fourier Features and by the Borell-Cantelli lemma, we conclude to the a.s. convergence (15). Acknowledgments This research is supported by the Chair Stress Test, RISK Management and Financial Steering of the Foundation Ecole Polytechnique and by the Association Nationale de la Recherche Technique. This work is part of the first author s doctoral thesis, under the supervision of the two other authors. The authors wish to thank Radoslaw Adamczak for his detailed comments on his work. Appendix A. Additional Proofs A.1. Ψ(l) γ , the Convexification of Ψγ In the proof of Theorem 1 an integral control with convex Ψ (see Section 4(v)) is beneficial/applied. However, Ψγ is not convex for γ (0, 1). To handle this issue, we convexify Ψγ(x) = exγ 1 in case of γ (0, 1) for small values of the argument.8 By computing the derivatives of Ψγ we get that it is convex iffx xγ := 1 γ γ . Indeed, Ψ γ(x) = γxγ 1exγ, Ψ γ(x) = γexγ (γ 1)xγ 2 + xγ 1γxγ 1 Ψ γ(x) = 0 x = xγ, Ψ γ(x) > 0 x > xγ, Ψ γ(x) < 0 x < xγ. We also have to make sure that Ψ(l) γ , constructed as the line connecting (0, 0) with (x, Ψγ(x)) glued to Ψγ|[x, ), gives a convex function, for a suitable choice of x. A geometric argument shows that it is enough to choose x xγ(> 0) such that x Ψ γ(x) exγ 1 γxγexγ. Since the r.h.s. is higher order than the l.h.s., the requirement can be satisfied for large enough x; we can choose xγ := inf x xγ : exγ 1 γxγexγ , Ψ(l) γ (x) := xγ x if x [0, xγ), Ψγ(x) if x [ xγ, ). Notice that by construction Ψ(l) γ Ψγ. 8. For γ = 1, Ψ(l) γ = Ψγ. Chamakh, Gobet and Szab o A.2. Proof of Remark 5(ii) ωi LΨαi means that Eωi Λi esi|ωi|αi < for some si R+ ( i [d]). Let α = mini [d] αi and ω α := P i [d] |ωi|α 1 α . Then ω 2 d supi [d] |ωi| d ω α ( ω Rd). Notice that |ωi|α |ωi|αi if |ωi| 1 and |ωi|α 1 otherwise, i.e. we have |ωi|α |ωi|αi + 1 for any ωi R. This means that taking s = mini [d] si and s := s dα/2 > 0 gives ω α 2 dα/2 ω α α = dα/2 X i [d] |ωi|α dα/2 X i [d] (|ωi|αi + 1), Eω h e s ω α 2 i Eω h es P i [d](|ωi|αi+1)i Eωi h es|ωi|αii es Y Eωi h esi|ωi|αii es < , where we used the independence of ωi-s in ( ). We got that Eω Λ h e s ω α 2 i < which implies that ω LΨα. A.3. Proof of the Properties in Section 4 about the Orlicz Norm Properties (i)-(iv): These properties are well-known and directly follow from the definition of the Orlicz norm. Property (v): The case X Ψ= 0 gives a trivial inequality and can be discarded. Since Ψ is bounded from below by an increasing affine function, X LΨ implies that X is integrable. Combining (i) with Jensen s inequality gives Ψ E[ X 2] E h Ψ X 2 X Ψ i 1, and the result follows. Property (vi): It is well-known that the usual triangle inequality holds for α 1. We now focus on the case α (0, 1]. Set c := X α Ψα + X α Ψα 1/α, p := cα X α Ψα and q := cα X α Ψα , and notice that 1 q = 1. Then, combining (24) with γ = α (0, 1) and the H older inequality with the conjugate exponents (p, q) yields lim sup m + E e m X+X 2 c α lim sup m + E e m X 2 c α e m X 2 lim sup m + E em p X α 2 cα 1/p lim sup m + E em q X α 2 cα 1/q item (i) 21/p21/q = 2. Therefore, X + X LΨα and X + X Ψα c by the definition of the α-Orlicz norm. Applying (24) with γ = 1/α we get c 2( 1 α 1)+ X Ψα + X Ψα and hence the claimed result is proved. Property (vii): This is a direct consequence of the Markov inequality. Property (viii): A similar statement was proved by van der Vaart and Wellner (1996, Lemma 2.2.2), but under the assumption that Ψα is convex (which holds only if α 1) and without explicit constant. Our statement is valid for any α > 0 with explicit control. Orlicz Random Fourier Features A first inequality: Let α R+. We claim that for any x0 > 0 and any x, y 1, we have Ψα x1/α 0 x Ψα x1/α 0 y Ψα x1/α 0 Ψα x1/α 0 xy . (43) Because Ψα(x) = Ψ1(xα) where Ψ1(x) =: Ψ(x) = ex 1, the inequality for α = 1 clearly implies those for all α > 0. To prove the inequality for α = 1, let x0 and x be fixed, and set H(y) = Ψ(x0)Ψ(x0xy) Ψ(x0x) Ψ(x0y). One has H (y) = x0xΨ(x0)ex0xy x0Ψ(x0x)ex0y = x2 0xex0ex0x Ψ(x0) x0ex0 ex0x(y 1) Ψ(x0x) x0xex0x ex0(y 1) , x0ex0 = 1 e x0 0 e ux0du Z 1 0 e ux0xdu = Ψ(x0x) ex0x(y 1) ex0(y 1), where we used x0 > 0, x, y 1 at the two last inequalities. This shows that H (y) 0, and since H(1) = 0 we have H(y) 0 for any y 1. Consequently, (43) is proved. Final maximal inequality: We follow the arguments of van der Vaart and Wellner (1996, Lemma 2.2.2) with slights modifications. The inequality (43) can be rewritten as Ψα(x) Ψα x1/α 0 Ψα xy/x1/α 0 /Ψα(y), x, y x1/α 0 . (44) Set c = maxm [M] Xm Ψα /x1/α 0 and let y x1/α 0 . If maxm [M] Xm 2 cy x1/α 0 , then we have the crude bound Ψα maxm [M] Xm 2 Ψα x1/α 0 = Ψ(x0). If maxm [M] Xm 2 cy x1/α 0 , then (44) yields maxm [M] Xm 2 maxm [M] Xm 2 maxm [M] Xm Ψα m [M] Ψ(x0)Ψα Xm 2/ Xm Ψα /Ψα(y). Consequently, in both cases we have maxm [M] Xm 2 m [M] Ψα Xm 2/ Xm Ψα /Ψα(y) Taking expectation and using Property (i), we arrive at E h Ψα maxm [M] Xm 2 Ψ(x0) h 1 + M Ψα(y) i . Let us choose x0 such that Ψ(x0) < 1. In this case the choice y = x1/α 0 Ψ 1 α M 1/Ψ(x0) 1 ensures that the above bound is valid and smaller than 1. Chamakh, Gobet and Szab o Consequently, by the definition of α-Orlicz norm, we get maxm [M] Xm 2 Ψα cy. The choice x0 = log(3/2) satifies the previous requirement: Ψ(x0) = 1 2 < 1. In this case y = [log(3/2) log(1 + M)]1/α = [log(1 + M)]1/α since M 1. We have obtained the claimed Property (viii). Appendix B. Special Functions Name Definition Modified Bessel function of the first kind Ja(x) = P n N 1 n!Γ(n+a+1) x Modified Bessel function of the second kind Ka(x) = π 2 J a(x) Ja(x) sin(aπ) Fox-Wright generalized hyperbolic function Ψ1,1 ((a, A); (b, B); x) = P n N Γ(a+An) Γ(b+Bn) xn n! (Imaginary) error function erfi(x) = P n N 2 π x2n+1 n!(2n+1) Kummer s confluent hypergeometric function F1,1(a; b; x) = P n N a(n) b(n) xn n! Polylogarithm function Lia(x) = P n Z+ xn na Ordinary hyperbolic function F2,1(a, b; c; z) = P n N a(n)b(n)zn Table 4: Definition of special functions. Ja(x), Ka(x): x R and a is non-integer; when a is an integer the limit is taken. Ψ1,1 ((a, A); (b, B); x): a R+, b R+, x R, A R+, B R+ and 1+B > A; Ψ1,1 ((a, 1); (b, 1); x) = Γ(a) Γ(b) F1,1(a; b; x). erfi(x): x R. F1,1(a; b; x): a R+, b R+, x R. Lia(x): a R, x R, |x| < 1. F2,1(a, b; c; z): a C, b C, c C\Z 0, z C and |z| < 1; for |z| 1 its analytical continuation is taken. For n N, a(n) is the rising factorial of a defined as a(n) = Γ(a+n) Γ(a) where a C\Z 0 and a+n C\Z 0. Appendix C. External Statements In this subsection we state external statements which were used to derive our results. Below B stands for a separable Banach space, Lp(B) is the space of B-valued p-integrable functions. The norm Ψα is defined analogously to Rd by changing 2 to B. Theorem 6 (Hoffman-Jorgensen inequality, Ledoux and Talagrand 2013, Proposition 6.8) Let p > 0, M Z+, (Ym)m [M] be independent random variables in Lp(B), Sm := Pm j=1 Yj for m [M], t0 = inf t > 0 : P (max1 m M Sm B > t) (2 4p) 1 . Then E max m [M] Sm p B 2 4p E max m [M] Ym p B Theorem 7 (Talagrand, 1989, Theorem 3) Let γ (0, 1]. Then, there is a constant Cγ such that for all finite sequence (Ym)m [M] of independent, mean zero, integrable random Orlicz Random Fourier Features variables in LΨγ(B), we have + max m [M] Ym B Theorem 8 (Klein-Rio inequality for supremum of empirical process, Klein and Rio 2005, Theorems 1.1-1.2) Let M Z+, c R+, (Xm)m [M] be independent B-valued random variables, and F a countable set of f := (f1, . . . , f M) measurable functions from B into [ c, c]M such that E [fm(Xm)] = 0 for all m [M]. Define Z := supf F 1 M P m [M] fm(Xm), σ2 := 1 M supf F E h P m [M] f2 m(Xm) i . Then, for any t 0 the following right and left-hand sided deviation inequalities hold P (Z E [Z] t) e M t2 2(σ2+2c E[Z])+3c t , P (Z E [Z] t) e M t2 2(σ2+2c E[Z])+2c t . Theorem 9 (Dudley entropy integral bound)9 Let {Zt : t T} be a zero-mean separable stochastic process that is sub-Gaussian w.r.t. a pseudo-metric d on the indexing set T, in other words for every λ R E eλ(Zt Zs) e λ2d(s,t)2 2 ( s, t T). Then there exists a universal constant CD such that E sup t T Zt log N(ε, d, T)dϵ, (46) where N(ε, d, T) denotes the covering number. Theorem 10 (Alzer 1997, Theorem 1)10 Let γ (0, 1], βγ := Γ 1 + 1 γ γ , x R 0, Iγ(x) := R x 0 e tγdt. Then 1 e βγxγ 1 γ Iγ(x) Γ(1+1/γ) 1 e xγ 1 Radoslaw Adamczak. A tail inequality for suprema of unbounded empirical processes with applications to Markov chains. Electronic Journal of Probability, 13:1000 1034, 2008. Ahmed El Alaoui and Michael Mahoney. Fast randomized kernel ridge regression with statistical guarantees. In Advances in Neural Information Processing Systems (NIPS), pages 775 783, 2015. Horst Alzer. On some inequalities for the incomplete Gamma function. Mathematics of Computation of the American Mathematical Society, 66(218):771 778, 1997. Nachman Aronszajn. Theory of reproducing kernels. Transactions of the American Mathematical Society, 68:337 404, 1950. 9. See van der Vaart and Wellner (1996, Corollary 2.2.8) for a general statement. Regarding the numerical value of CD, van Handel (2016, Corollary 5.25) proves that one can take CD = 12 whereas Bartlett (2013, Lecture 14) suggests a slightly smaller constant CD = 8 2. 10. The statement here follows by taking the limit of the cited result at γ = 1 and x = 0. Chamakh, Gobet and Szab o Francis Bach. On the equivalence between kernel quadrature rules and random feature expansions. Journal of Machine Learning Research, 18:1 38, 2017. Francis R. Bach and Michael I. Jordan. Kernel independent component analysis. Journal of Machine Learning Research, 3:1 48, 2002. Ole E. Barndorff-Nielsen, Thomas Mikosch, and Sidney I. Resnick. L evy Processess Theory and Applications. Springer Science+Business Media, 2001. Peter Bartlett. Theoretical statistics. Available at https://www.stat.berkeley.edu/ ~bartlett/courses/2013spring-stat210b/, 2013. Mariusz Bojarski, Anna Choromanska, Krzysztof Choromanski, Francois Fagan, Cedric Gouy-Pailler, Anne Morvan, Nouri Sakr, Tam as Sarl os, and Jamal Atif. Structured adaptive and random spinners for fast machine learning computations. In International Conference on Artificial Intelligence and Statistics (AISTATS), pages 1020 1029, 2017. Luigi Carratino, Alessandro Rudi, and Lorenzo Rosasco. Learning with SGD and random features. In Advances in Neural Information Processing Systems (Neur IPS), pages 10192 10203, 2018. Kamalika Chaudhuri, Claire Monteleoni, and Anand D. Sarwate. Differentially private empirical risk minimization. Journal of Machine Learning Research, 12:1069 1109, 2011. Roxana Ciumara and Vasile Preda. The Weibull-logarithmic distribution in lifetime analysis and its properties. In International Conference Applied Stochastic Models and Data Analysis (ASMDA), pages 395 399. 2009. S andor Cs org o and Vilmos Totik. On how long interval is the empirical characteristic function uniformly consistent? Acta Scientiarum Mathematicarum, 45:141 149, 1983. Bo Dai, Bo Xie, Niao He, Yingyu Liang, Anant Raj, Maria-Florina F. Balcan, and Le Song. Scalable kernel methods via doubly stochastic gradients. In Advances in Neural Information Processing Systems (NIPS), pages 3041 3049, 2014. Carlton Downey, Ahmed Hefny, Byron Boots, Boyue Li, and GeoffGordon. Predictive state recurrent neural networks. In Advances in Neural Information Processing Systems (NIPS), pages 6053 6064, 2017. Petros Drineas and Michael W. Mahoney. On the Nystr om method for approximating a Gram matrix for improved kernel-based learning. Journal of Machine Learning Research, 6:2153 2175, 2005. Ren e Duijkers, Roland T oth, Dario Piga, and Vincent Laurain. Shrinking complexity of scheduling dependencies in LS-SVM based LPV system identification. In IEEE Conference on Decision and Control, pages 2561 2566, 2014. Anna Gilbert, Ambuj Tewari, and Yitong Sung. But how does it work in theory? Linear SVM with random features. In Advances in Neural Information Processing Systems (Neur IPS), pages 3379 3388, 2018. Orlicz Random Fourier Features Wittawat Jitkrittum, Arthur Gretton, Nicolas Heess, Ali Eslami, Balaji Lakshminarayanan, Dino Sejdinovic, and Zolt an Szab o. Kernel-based just-in-time learning for passing expectation propagation messages. In Conference on Uncertainty in Artificial Intelligence (UAI), pages 405 414, 2015. Nicolas Keriven, Anthony Bourrier, R emi Gribonval, and Patrick P erez. Sketching for largescale learning of mixture models. Information and Inference: A Journal of the IMA, 7: 447 508, 2018. Thierry Klein and Emmanuel Rio. Concentration around the mean for maxima of empirical processes. The Annals of Probability, 33(3):1060 1077, 2005. Fabien Lauer, Van Luong Le, and G erard Bloch. Learning smooth models of nonsmooth functions via convex optimization. In International Workshop on Machine Learning for Signal Processing (IEEE-MLSP), 2012. Quoc Le, Tam as Sarl os, and Alexander Smola. Fastfood - computing Hilbert space expansions in loglinear time. In International Conference on Machine Learning (ICML), pages 244 252, 2013. Michel Ledoux and Michel Talagrand. Probability in Banach Spaces: isoperimetry and processes. Springer, 2013. Lisha Li, Kevin Jamieson, Giulia De Salvo, Afshin Rostamizadeh, and Ameet Talwalkar. Hyperband: A novel bandit-based approach to hyperparameter optimization. Journal of Machine Learning Research, 18(185):1 52, 2018. Zhu Li, Jean-Fran cois Ton, Dino Oglic, and Dino Sejdinovic. A unified analysis of random Fourier features. In International Conference on Machine Learning (ICML), pages 3905 3914, 2019. David Lopez-Paz, Krikamol Muandet, Bernhard Sch olkopf, and Ilya Tolstikhin. Towards a learning theory of cause-effect inference. In International Conference on Machine Learning (ICML), pages 1452 1461, 2015. Junier Oliva, Willie Neiswanger, Barnab as P oczos, Eric Xing, and JeffSchneider. Fast function to function regression. In International Conference on Artificial Intelligence and Statistics (AISTATS), pages 717 725, 2015. Vern I. Paulsen and Mrinal Raghupathi. An Introduction to the Theory of Reproducing Kernel Hilbert Spaces. Cambridge University Press, 2016. Tibor K. Pog any and Saralees Nadarajah. On the characteristic function of the generalized normal distribution. Comptes Rendus Mathematique, 348(3-4):203 206, 2010. Ali Rahimi and Benjamin Recht. Random features for large-scale kernel machines. In Advances in Neural Information Processing Systems (NIPS), pages 1177 1184, 2007. Ali Rahimi and Benjamin Recht. Weighted sums of random kitchen sinks: Replacing minimization with randomization in learning. In Advances in Neural Information Processing Systems (NIPS), pages 1313 1320, 2008. Chamakh, Gobet and Szab o Lorenzo Rosasco, Matteo Santoro, Sofia Mosci, Alessandro Verri, and Silvia Villa. A regularization approach to nonlinear variable selection. In International Conference on Artificial Intelligence and Statistics (AISTATS), pages 653 660, 2010. Lorenzo Rosasco, Silvia Villa, Sofia Mosci, Matteo Santoro, and Alessandro Verri. Nonparametric sparsity and regularization. Journal of Machine Learning Research, 14:1665 1714, 2013. Alessandro Rudi and Lorenzo Rosasco. Generalization properties of learning with random features. In Advances in Neural Information Processing Systems (NIPS), pages 3218 3228, 2017. Alessandro Rudi, Luigi Carratino, and Lorenzo Rosasco. FALKON: An optimal large scale kernel method. In Advances in Neural Information Processing Systems (NIPS), pages 3891 3901, 2017. Walter Rudin. Fourier Analysis on Groups. Wiley-Interscience, 1990. Lei Shi, Xin Guo, and Ding-Xuan Zhou. Hermite learning with gradient data. Journal of Computational and Applied Mathematics, 233:3046 3059, 2010. Bharath Sriperumbudur and Nicholas Sterge. Approximate kernel PCA using random features: Computational vs. statistical trade-off. Technical report, Pennsylvania State University, 2018. (https://arxiv.org/abs/1706.06296). Bharath K. Sriperumbudur and Zolt an Szab o. Optimal rates for random Fourier features. In Advances in Neural Information Processing Systems (NIPS), pages 1144 1152, 2015. Bharath K. Sriperumbudur, Kenji Fukumizu, Arthur Gretton, Aapo Hyv arinen, and Revant Kumar. Density estimation in infinite dimensional exponential families. Journal of Machine Learning Research, 18(57):1 59, 2017. Ingo Steinwart and Andreas Christmann. Support Vector Machines. Springer, 2008. Heiko Strathmann, Dino Sejdinovic, Samuel Livingstone, Zolt an Szab o, and Arthur Gretton. Gradient-free Hamiltonian Monte Carlo with efficient kernel exponential families. In Advances in Neural Information Processing Systems (NIPS), pages 955 963, 2015. Eric V. Strobl, Kun Zhang, and Shyam Visweswaran. Approximate kernel-based conditional independence tests for fast non-parametric causal discovery. Journal of Causal Inference, 7(1):20180017, 2019. (e ISSN 2193-3685; https://doi.org/10.1515/jci-2018-0017). Dougal J. Sutherland and JeffSchneider. On the error of random Fourier features. In Conference on Uncertainty in Artificial Intelligence (UAI), pages 862 871, 2015. Zolt an Szab o and Bharath K. Sriperumbudur. On kernel derivative approximation with random Fourier features. In International Conference on Artificial Intelligence and Statistics (AISTATS), pages 827 836, 2019. Michel Talagrand. Isoperimetry and integrability of the sum of independent Banach-space valued random variables. The Annals of Probability, 17(4):1546 1570, 1989. Orlicz Random Fourier Features John Shawe Taylor and Nello Cristianini. Kernel Methods for Pattern Analysis. Cambridge University Press, 2004. Enayat Ullah, Poorya Mianjy, Teodor V. Marinov, and Raman Arora. Streaming kernel PCA with O( n) random features. In Advances in Neural Information Processing Systems (Neur IPS), pages 7311 7321, 2018. Sara van de Geer. Empirical Processes in M-estimation. Cambridge University Press, 2000. Aad W. van der Vaart and Jon A. Wellner. Weak Convergence and Empirical processes: with Applications to Statistics. Springer-Verlag, New-York, 1996. Ramon van Handel. Probability in high dimension. Available at https://web.math. princeton.edu/~rvan/APC550.pdf, 2016. Christopher K. I. Williams and Matthias Seeger. Using the Nystr om method to speed up kernel machines. In Advances in Neural Information Processing Systems (NIPS), pages 682 688, 2001. Jiyan Yang, Vikas Sindhwani, Haim Avron, and Michael W. Mahoney. Quasi-Monte Carlo feature maps for shift-invariant kernels. International Conference on Machine Learning (ICML), pages 485 493, 2014. Yun Yang, Mert Pilanci, and Martin J. Wainwright. Randomized sketches for kernels: Fast and optimal non-parametric regression. The Annals of Statistics, 45:991 1023, 2017. Yiming Ying, Qiang Wu, and Colin Campbell. Learning the coordinate gradients. Advances in Computational Mathematics, 37:355 378, 2012. Felix X Yu, Ananda Theertha Suresh, Krzysztof M Choromanski, Daniel N Holtmann-Rice, and Sanjiv Kumar. Orthogonal random features. In Advances in Neural Information Processing Systems (NIPS), pages 1975 1983, 2016. Jian Zhang, Avner May, Tri Dao, and Christopher R e. Low-precision random Fourier features for memory-constrained kernel approximation. In International Conference on Artificial Intelligence and Statistics (AISTATS), pages 1264 1274, 2019. Qinyi Zhang, Sarah Filippi, Arthur Gretton, and Dino Sejdinovic. Large-scale kernel methods for independence testing. Statistics and Computing, pages 1 18, 2017. Ding-Xuan Zhou. Derivative reproducing properties for kernel methods in learning theory. Journal of Computational and Applied Mathematics, 220:456 463, 2008.