# quadraturebased_features_for_kernel_approximation__623bd923.pdf Quadrature-based features for kernel approximation Marina Munkhoeva Yermek Kapushev Evgeny Burnaev Ivan Oseledets , Skolkovo Institute of Science and Technology Moscow, Russia Institute of Numerical Mathematics of the Russian Academy of Sciences Moscow, Russia We consider the problem of improving kernel approximation via randomized feature maps. These maps arise as Monte Carlo approximation to integral representations of kernel functions and scale up kernel methods for larger datasets. Based on an efficient numerical integration technique, we propose a unifying approach that reinterprets the previous random features methods and extends to better estimates of the kernel approximation. We derive the convergence behaviour and conduct an extensive empirical study that supports our hypothesis1. 1 Introduction Kernel methods proved to be an efficient technique in numerous real-world problems. The core idea of kernel methods is the kernel trick compute an inner product in a high-dimensional (or even infinite-dimensional) feature space by means of a kernel function k: k(x, y) = ψ(x), ψ(y) , (1) where ψ : X F is a non-linear feature map transporting elements of input space X into a feature space F. It is a common knowledge that kernel methods incur space and time complexity infeasible to be used with large-scale datasets directly. For example, kernel regression has O(N 3 + Nd2) training time, O(N 2) memory, O(Nd) prediction time complexity for N data points in original d-dimensional space X. One of the most successful techniques to handle this problem, known as Random Fourier Features (RFF) proposed by [29], introduces a low-dimensional randomized approximation to feature maps: k(x, y) ˆΨ(x) ˆΨ(y). (2) This is essentially carried out by using Monte-Carlo sampling to approximate scalar product in (1). A randomized D-dimensional mapping ˆΨ( ) applied to the original data input allows employing standard linear methods, i.e. reverting the kernel trick. In doing so one reduces the complexity to that of linear methods, e.g. D-dimensional approximation admits O(ND2) training time, O(ND) memory and O(N) prediction time. It is well known that as D , the inner product in (2) converges to the exact kernel k(x, y). Recent research [35; 14; 9] aims to improve the convergence of approximation so that a smaller D can be used to obtain the same quality of approximation. This paper considers kernels that allow the following integral representation k(x, y) = Ep(w)fxy(w) = I(fxy), p(w) = 1 (2π)d/2 e w 2 2 , fxy = φ(w x)φ(w y). (3) 1The code for this paper is available at https://github.com/maremun/quffka. 32nd Conference on Neural Information Processing Systems (Neur IPS 2018), Montréal, Canada. For example, the popular Gaussian kernel admits such representation with fxy(w) = φ(w x) φ(w y), where φ( ) = [cos( ) sin( )] . The class of kernels admitting the form in (3) covers shift-invariant kernels (e.g. radial basis function (RBF) kernels) and Pointwise Nonlinear Gaussian (PNG) kernels. They are widely used in practice and have interesting connections with neural networks [8; 34]. The main challenge for the construction of low-dimensional feature maps is the approximation of the expectation in (3) which is d-dimensional integral with Gaussian weight. Unlike other research studies we refrain from using simple Monte Carlo estimate of the integral, instead, we propose to use specific quadrature rules. We now list our contributions: We propose to use spherical-radial quadrature rules to improve kernel approximation accuracy. We show that these quadrature rules generalize the RFF-based techniques. We also provide an analytical estimate of the error for the used quadrature rules that implies better approximation quality. We use structured orthogonal matrices (so-called butterfly matrices) when designing quadrature rule that allow fast matrix by vector multiplications. As a result, we speed up the approximation of the kernel function and reduce memory requirements. We carry out an extensive empirical study comparing our methods with the state-of-the-art ones on a set of different kernels in terms of both kernel approximation error and downstream tasks performance. The study supports our hypothesis on the exceeding accuracy of the method. 2 Quadrature Rules and Random Features We start with rewriting the expectation in Equation (3) as integral of fxy with respect to p(w): I(fxy) = (2π) d 2 fxy(w)dw. Integration can be performed by means of quadrature rules. The rules usually take a form of interpolating function that is easy to integrate. Given such a rule, one may sample points from the domain of integration and calculate the value of the rule at these points. Then, the sample average of the rule values would yield the approximation of the integral. The connection between integral approximation and mapping ψ is straightforward. In what follows we show a brief derivation of the quadrature rules that allow for an explicit mapping of the form: ψ(x) = [ a0φ(0) a1φ(w 1 x) . . . a Dφ(w Dx) ], where the choice of the weights ai and the points wi is dictated by the quadrature. We use the average of sampled quadrature rules developed by [18] to yield unbiased estimates of I(fxy). A change of coordinates is the first step to facilitate stochastic spherical-radial rules. Now, let w = rz, with z z = 1, so that w w = r2 for r [0, ], leaving us with (to ease the notation we substitute fxy with f) I(f) = (2π) d 2 rd 1f(rz)drdz = (2π) d 2 |r|d 1f(rz)drdz, (4) I(f) is now a double integral over the unit d-sphere Ud = {z : z z = 1, z Rd} and over the radius. To account for both integration regions we apply a combination of spherical (S) and radial (R) rules known as spherical-radial (SR) rules. To provide an intuition how the rules work, here we briefly state and discuss their form2. Stochastic radial rules of degree 2l + 1 R(h) = l P i=0 ˆwi h(ρi)+h( ρi) 2 have the form of the weighted symmetric sums and approximate the infinite range integral T(h) = R e r2 2 |r|d 1h(r)dr. Note 2Please see [18] for detailed derivation of the stochastic radial (section 2), spherical (section 3) and spherical radial rules (section 4) that when h is set to the function f of interest, T(f) corresponds to the inner integral in (4). To get an unbiased estimate for T(h), points ρi are sampled from specific distributions. The weights ˆwi are derived so that the rule is exact for polynomials of degree 2l + 1 and give unbiased estimate for other functions. Stochastic spherical rules SQ(s) = p P j=1 ewjs(Qzj), where Q is a random orthogonal matrix, approximate an integral of a function s(z) over the surface of unit d-sphere Ud, where zj are points on Ud, i.e. z j zj = 1. Remember that the outer integral in (4) has Ud as its integration region. The weights ewj are stochastic with distribution such that the rule is exact for polynomials of degree p and gives unbiased estimate for other functions. Stochastic spherical-radial rules SR of degree (2l + 1, p) are given by the following expression SR(2l+2,p) Q,ρ = i=1 ˆwi f(ρQzi) + f( ρQzi) where the distributions of weights are such that if degrees of radial rules and spherical rules coincide, i.e. 2l + 1 = p, then the rule is exact for polynomials of degree 2l + 1 and gives unbiased estimate of the integral for other functions. 2.1 Spherical-radial rules of degree (1, 1) is RFF If we take radial rule of degree 1 and spherical rule of degree 1, we obtain the following rule SR(1,1) Q,ρ = f(ρQz)+f( ρQz) 2 , where ρ χ(d). It is easy to see that ρQz N(0, I), and for shift invariant kernel f(w) = f( w), thus, the rule reduces to SR(1,1) Q,ρ = f(w), where w N(0, I). Now, RFF [29] makes approximation of the RBF kernel in exactly the same way: it generates random vector from Gaussian distribution and calculates the corresponding feature map. Proposition 2.1. Random Fourier Features for RBF kernel are SR rules of degree (1, 1). 2.2 Spherical-radial rules of degree (1, 3) is ORF Now, let s take radial rule of degree 1 and spherical rule of degree 3. In this case we get the following spherical-radial rule SR1,3 Q,ρ = Pd i=1 f(ρQei)+f( ρQei) 2 , where ρ χ(d), ei = (0, . . . , 0, 1, 0, . . . , 0) is an i-th column of the identity matrix. Let us compare SR1,3 rules with Orthogonal Random Features [14] for the RBF kernel. In the ORF approach, the weight matrix W = SQ is generated, where S is a diagonal matrix with the entries drawn independently from χ(d) distribution and Q is a random orthogonal matrix. The approximation of the kernel is then given by k ORF(x, y) = Pd i=1 f(wi), where wi is the i-th row of the matrix W. As the rows of Q are orthonormal, they can be represented as Qei. Proposition 2.2. Orthogonal Random Features for RBF kernel are SR rules of degree (1, 3). 2.3 Spherical-radial rules of degree (3, 3) We go further and take both spherical and radial rules of degree 3, where we use original and reflected vertices vj of randomly rotated unit vertex regular d-simplex V as the points on the unit sphere SR3,3 Q,ρ(f) = 1 d f(0) + d d + 1 f( ρQvj) + f(ρQvj) where ρ χ(d + 2). We apply (5) to the approximation of (4) by averaging the samples of SR3,3 Q,ρ: I(f) = EQ,ρ[SR3,3 Q,ρ(f)] ˆI(f) = 1 i=1 SR3,3 Qi,ρi(f), (6) where n is the number of sampled SR rules. Speaking in terms of the approximate feature maps, the new feature dimension D in case of the quadrature based approximation equals 2n(d + 1) + 1 as we sample n rules and evaluate each of them at 2(d + 1) random points and 1 zero point. In this work we propose to modify the quadrature rule by generating ρj χ(d + 2) for each vj, i.e. SR3,3 Q,ρ(f) = 1 Pd+1 j=1 d (d+1)ρ2 j f(0) + d d+1 h f( ρj Qvj)+f(ρj Qvj) i . It doesn t affect the quality of approximation while simplifies an analysis of the quadrature-based random features. Explicit mapping We finally arrive at the map ψ(x) = [ a0φ(0) a1φ(w 1 x) . . . a Dφ(w Dx) ], d ρ2 3, aj = 1 ρj d 2(d+1), wj is the j-th row in the matrix W = ρ h (QV) (QV) i , ρ = [ρ1 . . . ρD] . To get D features one simply stacks n = D 2(d+1)+1 such matrices Wk = ρkh (Qk V) (Qk V) i so that W RD d, where only Qk Rd d and ρk are generated randomly (k = 1, . . . , n). For Gaussian kernel, φ( ) = [cos( ) sin( )] . For the 0-order arc-cosine kernel, φ( ) = Θ( ), where Θ( ) is the Heaviside function. For the 1-order arc-cosine kernel, φ( ) = max(0, ). 2.4 Generating uniformly random orthogonal matrices The SR rules require a random orthogonal matrix Q. If Q follows Haar distribution, the averaged samples of SR3,3 Q,ρ rules provide an unbiased estimate for (4). Essentially, Haar distribution means that all orthogonal matrices in the group are equiprobable, i.e. uniformly random. Methods for sampling such matrices vary in their complexity of generation and multiplication. We test two algorithms for obtaining Q. The first uses a QR decomposition of a random matrix to obtain a product of a sequence of reflectors/rotators Q = H1 . . . Hn 1D, where Hi is a random Householder/Givens matrix and a diagonal matrix D has entries such that P(dii = 1) = 1/2. It implicates no fast matrix multiplication. We test both methods for random orthogonal matrix generation and, since their performance coincides, we leave this one out for cleaner figures in the Experiments section. The other choice for Q are so-called butterfly matrices [17]. For d = 4 c1 s1 0 0 s1 c1 0 0 0 0 c3 s3 0 0 s3 c3 c2 0 s2 0 0 c2 0 s2 s2 0 c2 0 0 s2 0 c2 c1c2 s1c2 c1s2 s1s2 s1c2 c1c2 s1s2 c1s2 c3s2 s3s2 c3c2 s3c2 s3s2 c3s2 s3c2 c3c2 where si, ci is sine and cosine of some angle θi, i = 1, . . . , d 1. For definition and discussion please see Supplementary Materials. The factors of B(d) are structured and allow fast matrix multiplication. The method using butterfly matrices is denoted by B in the Experiments section. 3 Error bounds Proposition 3.1. Let l be a diameter of the compact set X and p(w) = N(0, σ2 p I) be the probability density corresponding to the kernel. Let us suppose that |φ(w x)| κ, |φ (w x)| µ for all w Ω, x X and 1 fxy(ρz) ρ2 M for all ρ [0, ), where z z = 1. Then for Quadrature-based Features approximation ˆk(x, y) of the kernel function k(x, y) and any ε > 0 it holds P sup x,y X |ˆk(x, y) k(x, y)| ε βd d+1 exp Dε2 8M 2(d + 1) 3To get a2 0 0, you need to sample ρj two times on average (see Supplementary Materials for details). Table 1: Space and time complexity. Method Space Time ORF O(Dd) O(Dd) QMC O(Dd) O(Dd) ROM O(d) O(d log d) Quadrature based O(d) O(d log d) Table 2: Experimental settings for the datasets. Dataset N d #samples #runs Powerplant 9568 4 550 500 LETTER 20000 16 550 500 USPS 9298 256 550 500 MNIST 70000 784 550 100 CIFAR100 60000 3072 50 50 LEUKEMIA 72 7129 10 10 where βd = d d d+1 + d 1 d+1 2 6d+1 d+1 d d+1 d d+1 . Thus we can construct approximation with error no more than ε with probability at least 1 δ as long as D 8M 2(d + 1) d log σplκµ The proof of this proposition closely follows [33], details can be found in the Supplementary Materials. Term βd depends on dimension d, its maximum is β86 64.7 < 65, and limd βd = 64, though it is lower for small d. Let us compare this probability bound with the similar result for RFF in [33]. Under the same conditions the required number of samples to achieve error no more than ε with probability at least 1 δ for RFF is the following δ + d d + 1 log 3d + 3 For Quadrature-based Features for RBF kernel M = 1 2, κ = µ = 1, therefore, we obtain The asymptotics is the same, however, the constants are smaller for our approach. See Section 4 for empirical justification of the obtained result. Proposition 3.2 ([33]). Given a training set {(xi, yi)}n i=1, with xi Rd and yi R, let h(x) denote the result of kernel ridge regression using the positive semi-definite training kernel matrix K, test kernel values kx and regularization parameter λ. Let ˆh(x) be the same using a PSD approximation to the training kernel matrix b K and test kernel values ˆkx. Further, assume that the training labels are centered, Pn i=1 yi = 0, and let σ2 y = 1 n Pn i=1 y2 i . Also suppose kx κ. Then |ˆh(x) h(x)| σy n λ ˆkx kx 2 + κσyn λ2 b K K 2. Suppose that sup |k(x, x ) ˆk(x, x )| ε for all x, x Rd. Then ˆkx kx 2 nε and b K K 2 b K K F nε. By denoting λ = nλ0 we obtain |ˆh(x) h(x)| λ0+1 λ2 0 σyε. Therefore, P |ˆh(x) h(x)| ε P ˆk(x, x ) k(x, x ) λ2 0ε σy(λ0 + 1) So, for the quadrature rules we can guarantee |ˆh(x) h(x)| ε with probability at least 1 δ as long as D 8M 2(d + 1)σ2 y d log σyσplκµ(λ0 + 1) λ2 0ε + log βd 4 Experiments We extensively study the proposed method on several established benchmarking datasets: Powerplant, LETTER, USPS, MNIST, CIFAR100 [23], LEUKEMIA [20]. In Section 4.2 we show kernel approximation error across different kernels and number of features. We also report the quality of SVM models with approximate kernels on the same data sets in Section 4.3. Arc-cosine 0 1 2 3 4 5 0.3 Arc-cosine 1 1 2 3 4 5 0 1 2 3 4 5 0.0 1 2 3 4 5 0.0 1 2 3 4 5 n 1 2 3 4 5 n 1 2 3 4 5 n 1 2 3 4 5 n 1 2 3 4 5 n 1 2 3 4 5 n G Gort ROM QMC GQ B Figure 1: Kernel approximation error across three kernels and 6 datasets. Lower is better. The x-axis represents the factor to which we extend the original feature space, n = D 2(d+1)+1, where d is the dimensionality of the original feature space, D is the dimensionality of the new feature space. 4.1 Methods We present a comparison of our method (B) with estimators based on a simple Monte Carlo, quasi-Monte Carlo [35] and Gaussian quadratures [11]. The Monte Carlo approach has a variety of ways to generate samples: unstructured Gaussian [29], structured Gaussian [14], random orthogonal matrices (ROM) [10]. Monte Carlo integration (G, Gort, ROM). The kernel is estimated as ˆk(x, y) = 1 Dφ(Mx)φ(My), where M RD d is a random weight matrix. For unstructured Gaussian based approximation M = G, where Gij N(0, 1). Structured Gaussian has M = Gort, where Gort = DQ, Q is obtained from RQ decomposition of G, D is a diagonal matrix with diagonal elements sampled from the χ(d) distribution. In compliance with the previous work on ROM we use S-Rademacher with three blocks: M = i=1 SDi, where S is a normalized Hadamard matrix and P(Dii = 1) = 1/2. Quasi-Monte Carlo integration (QMC). Quasi-Monte Carlo integration boasts improved rate of convergence 1/D compared to 1/ D of Monte Carlo, however, as empirical results illustrate its performance is poorer than that of orthogonal random features [14]. It has larger constant factor hidden under O notation in computational complexity. For QMC the weight matrix M is generated as a transformation of quasi-random sequences. We run our experiments with Halton sequences in compliance with the previous work. Gaussian quadratures (GQ). We included subsampled dense grid method from [11] into our comparison as it is the only data-independent approach from the paper that is shown to work well. We reimplemented code for the paper to the best of our knowledge as it is not open sourced. 4.2 Kernel approximation To measure kernel approximation quality we use relative error in Frobenius norm K ˆ K F K F , where K and ˆK denote exact kernel matrix and its approximation. In line with previous work we run experiments for the kernel approximation on a random subset of a dataset. Table 2 displays the settings for the experiments across the datasets. Approximation was constructed for different number of SR samples n = D 2(d+1)+1, where d is an original feature space dimensionality and D is the new one. For the Gaussian kernel we set hyperparameter γ = 1 2σ2 to the default value of 1 d for all the approximants, while the arc-cosine kernels (see definition of arc-cosine kernel in the Supplementary Materials) have no hyperparameters. We run experiments for each [kernel, dataset, n] tuple and plot 95% confidence interval around the mean value line. Figure 1 shows the results for kernel approximation error on LETTER, MNIST, CIFAR100 and LEUKEMIA datasets. QMC method almost always coincides with RFF except for arc-cosine 0 kernel. It particularly enjoys Powerplant dataset with d = 4, i.e. small number of features. Possible explanation for such behaviour can be due to the connection with QMC quadratures. The worst case error for QMC quadratures scales with n 1(log n)d, where d is the dimensionality and n is the number of sample points [28]. It is worth mentioning that for large d it is also a problem to construct a proper QMC point set. Thus, in higher dimensions QMC may bring little practical advantage over MC. While recent randomized QMC techniques indeed in some cases have no dependence on d, our approach is still computationally more efficient thanks to the structured matrices. GQ method as well matches the performance of RFF. We omit both QMC and GQ from experiments on datasets with large d = [3072, 7129] (CIFAR100, LEUKEMIA). The empirical results in Figure 1 support our hypothesis about the advantages of SR quadratures applied to kernel approximation compared to SOTA methods. With an exception of a couple of cases: (Arc-cosine 0, Powerplant) and (Gaussian, USPS), our method displays clear exceeding performance. 4.3 Classification/regression with new features 1 2 3 4 5 n 0.775 0.800 0.825 0.850 0.875 accuracy/R2 Arc-cosine 0 1 2 3 4 5 n 0.3 0.4 0.5 0.6 0.7 0.8 1 2 3 4 5 n 0.945 0.948 0.951 0.954 0.957 1 2 3 4 5 n 0.84 0.86 0.88 0.90 0.92 0.94 accuracy/R2 Arc-cosine 1 1 2 3 4 5 n 0.700 0.725 0.750 0.775 0.800 0.825 1 2 3 4 5 n 0.9655 0.9670 0.9685 0.9700 0.9715 0.9730 1 2 3 4 5 n 0.924 0.927 0.930 0.933 0.936 accuracy/R2 Gaussian 1 2 3 4 5 n 1 2 3 4 5 n 0.9600 0.9608 0.9616 0.9624 0.9632 exact Gort ROM G B Figure 2: Accuracy/R2 score using embeddings with three kernels on 3 datasets. Higher is better. The x-axis represents the factor to which we extend the original feature space, n = D 2(d+1)+1. We report accuracy and R2 scores for the classification/regression tasks on some of the datasets (Figure 2). We examine the performance with the same setting as in experiments for kernel approximation error, except now we map the whole dataset. We use Support Vector Machines to obtain predictions. Kernel approximation error does not fully define the final prediction accuracy the best performing kernel matrix approximant not necessarily yields the best accuracy or R2 score. However, the empirical results illustrate that our method delivers comparable and often superior quality on the downstream tasks. 4.4 Walltime experiment We measure time spent on explicit mapping of features by running each experiment 50 times and averaging the measurements. Indeed, Figure 3 demonstrates that the method scales as theoretically predicted with larger dimensions thanks to the structured nature of the mapping. 0 1000 2000 3000 4000 5000 6000 7000 d, dataset input dimension Explicit mapping time G Gort GQ B Figure 3: Time spent on explicit mapping. The x-axis represents the 5 datasets with increasing input number of features: LETTER, USPS, MNIST, CIFAR100 and LEUKEMIA. 5 Related work The most popular methods for scaling up kernel methods are based on a low-rank approximation of the kernel using either data-dependent or independent basis functions. The first one includes Nyström method [12], greedy basis selection techniques [31], incomplete Cholesky decomposition [15]. The construction of basis functions in these techniques utilizes the given training set making them more attractive for some problems compared to Random Fourier Features approach. In general, data-dependent approaches perform better than data-independent approaches when there is a gap in the eigen-spectrum of the kernel matrix. The rigorous study of generalization performance of both approaches can be found in [36]. In data-independent techniques, the kernel function is approximated directly. Most of the methods (including the proposed approach) that follow this idea are based on Random Fourier Features [29]. They require so-called weight matrix that can be generated in a number of ways. [24] form the weight matrix as a product of structured matrices. It enables fast computation of matrix-vector products and speeds up generation of random features. Another work [14] orthogonalizes the features by means of orthogonal weight matrix. This leads to less correlated and more informative features increasing the quality of approximation. They support this result both analytically and empirically. The authors also introduce matrices with some special structure for fast computations. [10] propose a generalization of the ideas from [24] and [14], delivering an analytical estimate for the mean squared error (MSE) of approximation. All these works use simple Monte Carlo sampling. However, the convergence can be improved by changing Monte Carlo sampling to Quasi-Monte Carlo sampling. Following this idea [35] apply quasi-Monte Carlo to Random Fourier Features. In [37] the authors make attempt to improve quality of the approximation of Random Fourier Features by optimizing sequences conditioning on a given dataset. Among the recent papers there are works that, similar to our approach, use the numerical integration methods to approximate kernels. While [3] carefully inspects the connection between random features and quadratures, they did not provide any practically useful explicit mappings for kernels. Leveraging the connection [11] propose several methods with Gaussian quadratures. Among them three schemes are data-independent and one is data-dependent. The authors do not compare them with the approaches for random feature generation other than random Fourier features. The data-dependent scheme optimizes the weights for the quadrature points to yield better performance. A closely related work [25] constructs features for kernel approximation by approximating spherical-radial integral and designs QMC points to speed up approximation and reduce memory. 6 Conclusion We propose an approach for the random features methods for kernel approximation, revealing a new interpretation of RFF and ORF. The latter are special cases of the spherical-radial quadrature rules with degrees (1,1) and (1,3) respectively. We take this further and develop a more accurate technique for the random features preserving the time and space complexity of the random orthogonal embeddings. Our experimental study confirms that for many kernels on the most datasets the proposed approach delivers the best kernel approximation. Additionally, the results showed that the quality of the downstream task (classification/regression) is also superior or comparable to the state-of-the-art baselines. Acknowledgments This work was supported by the Ministry of Science and Education of Russian Federation as a part of Mega Grant Research Project 14.756.31.0001. [1] Theodore W Anderson, Ingram Olkin, and Les G Underhill. Generation of random orthogonal matrices. SIAM Journal on Scientific and Statistical Computing, 8(4):625 629, 1987. [2] Haim Avron and Vikas Sindhwani. High-performance kernel machines with implicit distributed optimization and randomization. Technometrics, 58(3):341 349, 2016. [3] Francis Bach. On the equivalence between kernel quadrature rules and random feature expansions. Journal of Machine Learning Research, 18(21):1 38, 2017. 8 [4] John A Baker. Integration over spheres and the divergence theorem for balls. The American Mathematical Monthly, 104(1):36 47, 1997. [5] James Bergstra, Daniel Yamins, and David Cox. Making a science of model search: Hyperparameter optimization in hundreds of dimensions for vision architectures. In International Conference on Machine Learning, pages 115 123, 2013. [6] Salomon Bochner. Monotone funktionen, stieltjessche integrale und harmonische analyse. Mathematische Annalen, 108(1):378 410, 1933. [7] Xixian Chen, Haiqin Yang, Irwin King, and Michael R Lyu. Training-efficient feature map for shift-invariant kernels. In IJCAI, pages 3395 3401, 2015. [8] Youngmin Cho and Lawrence K Saul. Kernel methods for deep learning. In Advances in Neural Information Processing Systems, pages 342 350, 2009. 2 [9] Krzysztof Choromanski and Vikas Sindhwani. Recycling randomness with structure for sublinear time kernel expansions. ar Xiv preprint ar Xiv:1605.09049, 2016. 1 [10] Krzysztof Choromanski, Mark Rowland, and Adrian Weller. The unreasonable effectiveness of random orthogonal embeddings. ar Xiv preprint ar Xiv:1703.00864, 2017. 6, 8 [11] Tri Dao, Christopher M De Sa, and Christopher Ré. Gaussian quadrature for kernel features. In Advances in Neural Information Processing Systems, pages 6109 6119, 2017. 6, 8 [12] Petros Drineas and Michael W Mahoney. On the Nyström method for approximating a Gram matrix for improved kernel-based learning. Journal of Machine Learning Research, 6(Dec):2153 2175, 2005. 8 [13] Kai-Tai Fang and Run-Ze Li. Some methods for generating both an NT-net and the uniform distribution on a Stiefel manifold and their applications. Computational Statistics & Data Analysis, 24(1):29 46, 1997. [14] X Yu Felix, Ananda Theertha Suresh, Krzysztof M Choromanski, Daniel N Holtmann-Rice, and Sanjiv Kumar. Orthogonal Random Features. In Advances in Neural Information Processing Systems, pages 1975 1983, 2016. 1, 3, 6, 8 [15] Shai Fine and Katya Scheinberg. Efficient SVM training using low-rank kernel representations. Journal of Machine Learning Research, 2(Dec):243 264, 2001. 8 [16] Alexander Forrester, Andy Keane, et al. Engineering design via surrogate modelling: a practical guide. John Wiley & Sons, 2008. [17] Alan Genz. Methods for generating random orthogonal matrices. Monte Carlo and Quasi-Monte Carlo Methods, pages 199 213, 1998. 4 [18] Alan Genz and John Monahan. Stochastic integration rules for infinite regions. SIAM journal on scientific computing, 19(2):426 439, 1998. 2 [19] Alan Genz and John Monahan. A stochastic algorithm for high-dimensional integrals over unbounded regions with gaussian weight. Journal of Computational and Applied Mathematics, 112(1):71 81, 1999. [20] Todd R Golub, Donna K Slonim, Pablo Tamayo, Christine Huard, Michelle Gaasenbeek, Jill P Mesirov, Hilary Coller, Mignon L Loh, James R Downing, Mark A Caligiuri, et al. Molecular classification of cancer: class discovery and class prediction by gene expression monitoring. Science, 286(5439):531 537, 1999. 5 [21] Simon Haykin. Cognitive dynamic systems: perception-action cycle, radar and radio. Cambridge University Press, 2012. [22] Po-Sen Huang, Haim Avron, Tara N Sainath, Vikas Sindhwani, and Bhuvana Ramabhadran. Kernel methods match deep neural networks on timit. In Acoustics, Speech and Signal Processing (ICASSP), 2014 IEEE International Conference on, pages 205 209. IEEE, 2014. [23] Alex Krizhevsky and Geoffrey Hinton. Learning multiple layers of features from tiny images. 2009. 5 [24] Quoc Le, Tamás Sarlós, and Alex Smola. Fastfood-approximating kernel expansions in loglinear time. In Proceedings of the International Conference on Machine Learning, 2013. 8 [25] Yueming Lyu. Spherical structured feature maps for kernel approximation. In International Conference on Machine Learning, pages 2256 2264, 2017. 8 [26] Francesco Mezzadri. How to generate random matrices from the classical compact groups. ar Xiv preprint math-ph/0609050, 2006. [27] John Monahan and Alan Genz. Spherical-radial integration rules for bayesian computation. Journal of the American Statistical Association, 92(438):664 674, 1997. [28] Art B Owen. Latin supercube sampling for very high-dimensional simulations. ACM Transactions on Modeling and Computer Simulation (TOMACS), 8(1):71 102, 1998. 7 [29] Ali Rahimi and Benjamin Recht. Random features for large-scale kernel machines. In Advances in Neural Information Processing Systems, pages 1177 1184, 2008. 1, 3, 6, 8 [30] Walter Rudin. Fourier analysis on groups. Courier Dover Publications, 2017. [31] Alex J Smola and Bernhard Schölkopf. Sparse greedy matrix approximation for machine learning. 2000. 8 [32] G. W. Stewart. The efficient generation of random orthogonal matrices with an application to condition estimators. SIAM Journal on Numerical Analysis, 17(3):403 409, 1980. ISSN 00361429. URL http: //www.jstor.org/stable/2156882. [33] Dougal J Sutherland and Jeff Schneider. On the error of random fourier features. ar Xiv preprint ar Xiv:1506.02785, 2015. 5 [34] Christopher KI Williams. Computing with infinite networks. In Advances in Neural Information Processing Systems, pages 295 301, 1997. 2 [35] Jiyan Yang, Vikas Sindhwani, Haim Avron, and Michael Mahoney. Quasi-Monte Carlo feature maps for shift-invariant kernels. In Proceedings of The 31st International Conference on Machine Learning (ICML-14), pages 485 493, 2014. 1, 6, 8 [36] Tianbao Yang, Yu-Feng Li, Mehrdad Mahdavi, Rong Jin, and Zhi-Hua Zhou. Nyström Method vs Random Fourier Features: A Theoretical and Empirical Comparison. In Advances in Neural Information Processing Systems, pages 476 484, 2012. 8 [37] Felix X Yu, Sanjiv Kumar, Henry Rowley, and Shih-Fu Chang. Compact nonlinear maps and circulant extensions. ar Xiv preprint ar Xiv:1503.03893, 2015. 8