# datadriven_random_fourier_features_using_stein_effect__2b571fd7.pdf Data-driven Random Fourier Features using Stein Effect Wei-Cheng Chang LTI, CMU wchang2@cs.cmu.edu Chun-Liang Li MLD, CMU chunlial@cs.cmu.edu Yiming Yang LTI, CMU yiming@cs.cmu.edu Barnab as P oczos MLD, CMU bapoczos@cs.cmu.edu Large-scale kernel approximation is an important problem in machine learning research. Approaches using random Fourier features have become increasingly popular, where kernel approximation is treated as empirical mean estimation via Monte Carlo (MC) or Quasi-Monte Carlo (QMC) integration. A limitation of the current approaches is that all the features receive an equal weight summing to 1. In this paper, we propose a novel shrinkage estimator from Stein effect , which provides a data-driven weighting strategy for random features and enjoys theoretical justifications in terms of lowering the empirical risk. We further present an efficient randomized algorithm for large-scale applications of the proposed method. Our empirical results on six benchmark data sets demonstrate the advantageous performance of this approach over representative baselines in both kernel approximation and supervised learning tasks. 1 Introduction Kernel methods offer a comprehensive collection of nonparametric techniques for modeling a broad range of problems in machine learning. However, standard kernel machines such as kernel SVM or kernel ridge regression suffer from slow training and prediction which limits their use in real-world large-scale applications. Consider, for example, the training of linear regression over n labeled data points {(xi, yi)}n i=1 where xi Rd with n d and yi is a binary label. The time complexity of standard least square fit is O(nd2) and the memory demand is O(nd). Its kernel-based counterpart requires solving a linear system with an n-by-n kernel matrix, which takes O(n3 + n2d) time and O(n2) memory usage as typical. Such complexity is far from desirable in big-data applications. In recent years, significant efforts have been devoted into low-rank kernel approximation for enhancing the scalability of kernel methods. Among existing approaches, random features [Rahimi and Recht, 2007; Rahimi and Recht, 2009; Le et al., 2013; Bach, 2015] have drawn considerable attention and yielded state-of-the-art results in classification accuracy on benchmark data sets [Huang et al., 2014; Dai et al., 2014; Li and P oczos, 2016]. Specifically, inspired from Bochner s theorem [Rudin, 2011], random Fourier features have been studied for evaluating the expectation of shift-invariant kernels (i.e., k(x, x ) = g(x x ) for some function g). Rahimi and Recht [2007] proposed to use Monte-Carlo methods (MC) to estimate the expectation; Yang et al. [2014] leveraged the low-discrepancy properties of Quasi-Monte Carlo (QMC) sequences to reduce integration errors. Both the MC and QMC methods rely on the (implicit) assumption that all the M random features are equally important, and hence assign a uniform weight 1 M to the features in kernel estimation. Such a treatment, however, is arguably sub-optimal for minimizing the expected risk in kernel approximation. Avron et al. [2016b] presented a weighting strategy for minimizing a loose error bound which depends on the maximum norm of data points. On the other hand, Bayesian Quadrature (BQ) is a powerful framework that solves the integral approximation by E[f(x)] P m βmf(xm) where f(xm) is feature function and βm is with non-uniform weights. BQ leverages Gaussian Process (GP) to model the prior of function f( ), and the resulted estimator is Bayes optimal [Husz ar and Duvenaud, 2012]. Thus, BQ could be considered a natural choice for kernel approximation. Nonetheless, a wellknown limitation (or potential weakness) is that the performance of Bayesian-based models heavily relies on extensive hyper-parameter tuning, e.g., on the condition of the covariance matrix in the Gaussian prior, which is not suitable for large real-world applications. To address the fundamental limitations of the aforementioned work, we propose a novel approach to data-driven feature weighting in the approximation of shift-invariant kernels, which motivated by the by Stein Effect in the statistical literature (Section 4), and solve it using an efficient stochastic algorithm with a convex optimization objective. We also present a natural extension of BQ to the applications of kernel approximation (Section 3). The adapted BQ together with standard MC and QMC methods serve as representative baselines in our empirical evaluations on six benchmark data sets. The empirical results (Section 5) show that the proposed Stein Effect Shrinkage (SES) estimator consistently outperforms the baseline methods in terms of lowering the kernel approximation errors, and was competitive to or better than the best performer among the baseline methods in most task-oriented evaluations (on supervised learning of regression and classifi- Proceedings of the Twenty-Sixth International Joint Conference on Artificial Intelligence (IJCAI-17) cation tasks). 2 Preliminaries Let us briefly outline the related background and key concepts in random-feature based kernel approximation. 2.1 Reproducing Kernel Hilbert Space At the heart of kernel methods lies the idea that inner products in high-dimensional feature spaces can be computed in an implicit form via a kernel function k : X X R, which is defined on a input data domain X Rd such that k(x, x ) = φ(x), φ(x ) H. where φ : X H is a feature map that associates kernel k with an embedding of the input space into a high-dimensional Hilbert space H. A key to kernel methods is that as long as kernel algorithms have access to k, one need not to represent φ(x) explicitly. Most often, that means although φ(x) can be high-dimensional or even infinite-dimensional, their inner products, can be evaluated by k. This idea is known as the kernel trick. The bottleneck of kernel methods lies in both training and prediction phase. The former requires to compute and store the kernel matrix K Rn n for n data points, while the latter need to evaluate the decision function f(x) via kernel trick f(x) = PN i=1 αik(xi, x), which sums over the nonzero support vectors αi. Unfortunately, Steinwart and Christmann [2008] show that the number of nonzero support vectors growth linearly in the size of training data n. This posts a great challenge for designing scalable algorithms for kernel methods in large-scale applications. Some traditional way to address this difficulty is by making trade-offs between time and space. [Kivinen et al., 2004; Chang and Lin, 2011]. 2.2 Random Features (RF) Rahimi and Recht [2007] propose a low-dimensional feature map z(x) : X R2M for the kernel function k k(x, x ) z(x), z(x ) (1) under the assumption that k fall in the family of shift-invariant kernel. The starting point is a celebrated result that characterizes the class of positive definite functions: Theorem 1 (Bochner s theorem [Rudin, 2011]). A continuous, real valued, symmetric and shift-invariant function k on Rd is a positive definite kernel if and only if there is a positive finite measure P(w) such that k(x x ) = Z Rd2 cos(w T x) cos(w T x ) (2) + sin(w T x) sin(w T x ) d P(w). The most well-known kernel that belongs to the shift-invariant family is Gaussian kernel k(x, x ) = exp( x x 2 2σ2 ), the associated density P(w) is again Gaussian, N(0, σ 2Id). Rahimi and Recht [2007] approximate the integral representation of the kernel (2) by Monte Carlo method as follows, m=1 hx,x (wm) = z(x), z(x ) , (3) hx,x (w) := 2 cos(w T x) cos(w T x ) + sin(w T x) sin(w T x ) M [φw1(x), . . . , φw M (x)] (4) 2[cos(w T x), sin(w T x)] and w1, . . . , w M are i.i.d. samples from P(w). After obtaining the randomized feature map z, training data could be transformed into {(z(xi), yi}n i=1. As long as M is sufficiently smaller than n, this leads to more scalable solutions, e.g., for regression we get back to O(n M 2) training and O(Md) prediction time, with O(n M) memory requirements. We can also apply random features to unsupervised learning problems, such as kernel clustering [Chitta et al., 2012]. 2.3 Quasi-Monte Carlo Technique Instead of using plain MC approximation, Yang et al. [2014] propose to use the low-discrepancy properties of Quasi-Monte Carlo (QMC) sequences to reduce the integration error in approximations of the form (3). Due to space limitation, we restrict our discussion to the background that is necessary for understanding subsequent sections. We refer interested readers to the comprehensive reviews [Caflisch, 1998; Dick et al., 2013] for more detailed exposition. The QMC method is generally applicable to integrals over a unit cube. So the procedure is to first generate a discrepancy sequence t1, . . . , t M [0, 1]d, and transform it into a sequence w1, . . . , w M in Rd. To convert the integral presentation of kernel (2) to an integral over the unit cube, a simple change of variables suffices. For t Rd, define Φ 1(t) = [Φ 1 1 (t1), . . . , Φ 1 d (td)] Rd, where we assume that the density function in (2) can be written as p(w) = QM i=1 pj(wj) and φj is the cumulative distribution function (CDF) of pj, j = 1, . . . , d. By setting w = Φ 1(t), the integral (2) is equivalent to k(x x ) = Z Rd hx,x (w)p(w)dw = Z [0,1]d hx,x (Φ 1(t))dt. 3 Bayesian Quadrature for Random Features Random features constructed by either MC [Rahimi and Recht, 2007] or QMC [Yang et al., 2014] approaches employ equal weights 1/M on integrand functions as shown in (3). An important question is how to construct different weights on the integrand functions to have a better kernel approximation. Given a fixed QMC sequence, Avron et al. [2016b] solve the weights by optimizing a derived error bound based on the observed data. There are two potential weakness of [Avron et al., 2016b]. First, Avron et al. [2016b] does not fully utilize the data information because the optimization objective only depends on the upper bound information of |xi| instead of Proceedings of the Twenty-Sixth International Joint Conference on Artificial Intelligence (IJCAI-17) the distribution of xi. Second, the error bound used in Avron et al. [2016b] is potentially loose. Due to technical reasons, Avron et al. [2016b] relaxes hx,x (w) to hx,x (w)sinc(Tw) when they derive the error bound to optimize. It is not studied whether the relaxation results in a proper upper bound. On the other hand, selecting weights by such way is closely connected to the Bayesian Quadrature (BQ), originally proposed by Ghahramani and Rasmussen [2002] and theoretically guaranteed under Bayes assumptions. BQ is a Bayesian approach that puts prior knowledge on functions to solve the integral estimation problem. We first introduce the work of Ghahramani and Rasmussen [2002], and then discuss how to apply BQ to kernel approximations. Standard BQ considers a single integration problem with form f = Z f(w)d P(w). (5) To utilize the information of f, BQ puts a Gaussian Process (GP) prior on f with mean 0 and covariance matrix KGP where KGP (w, w ) = Cov(f(w), f(w )) = exp( w w 2 2 2σ2 GP ). After conditioning on sample w1, , w M from P, we obtain a closed-form GP posterior over f and use the posterior to get the optimal Bayes estimator ˆf 1 as Z f(w)d P(w) = γ K 1 GP f(W) = m=1 β(m) BQ f(wm), where γm = R KGP (w, wm)d P(w), βm BQ = (K 1 GP γ)m, and f(W) = [f(w1), , f(w M)]. Clearly, the resulting estimator is simply a new weighted combination of the empirical observations f(wm) by β(m) BQ derived from the Bayes perspective. In kernel approximation, we could treat it as a series of the single integration problem. Therefore, we could directly apply BQ on it. Here we discuss the technical issues of applying BQ on kernel approximation. First, we need to evaluate γ. For Gaussian kernel, we derive the closed-form expression γm = Z KGP (w, wm)P(w) = exp wm 0 2 2 σ2 GP + σ 2 where γm is a random variable evaluated at wm. 3.1 Data-driven by Kernel Hyperparameters The most import issue of applying BQ on kernel approximation is that it models the information of hx,x by the kernel KGP . In practice, we usually use Gaussian kernel for KGP , then the feature information of (x, x ) is modeling by the kernel bandwidth. As suggested by Rasmussen [2006], one could tune the best hyper-parameter either by maximizing the marginal likelihood or cross-validation based on the metrics 1For the square loss. in interested. Note that the existing BQ approximates the integral representation of a kernel function value, evaluated at a single pair of data (x, x ). However, it is unfeasible to tune individual σGP for n2 pairs of data. Therefore, we tune a global hyper-parameter σGP on a subset of pair data points. 4 Stein Effect Shrinkage (SES) Estimator In practice, it is well known that the performance of Gaussian Process based models heavily depend on hyper-parameter σGP , i.e. the bandwidth of kernel function k GP . Therefore, instead of using Bayesian way to obtain the optimal Bayes estimator from BQ, we start from the risk minimization perspective by considering the Stein effect to derive the shrinkage estimator for random features. The standard estimator of Monte-Carlo methods is the unbiased empirical average using equal weights 1 M for M samples which sum to 1. However, Stein effect (or James-Stein effect) suggests a family of biased estimators could result in a lower risk than the standard empirical average. Here we extend the analysis of non-parametric Stein effect to the kernel approximation with random features by considering the risk function R(k, k ) = Ex,x ,w(k(x x ) k (x x ; w))2. Theorem 2. (Stein effect on kernel approximation) Let ˆk(x x , w) = 1 M PM m=1 hx,x (wm). For any estimator µ, which is independent of x and w, there exists 0 α < 1 such that k( ) = αµ + (1 α)ˆk( ) is an estimator with lower risk R(k, k) < R(k, ˆk). Proof. The proof closely follows Muandet et al. [2014] for a different problem under the non-parametric setting 2. By bias-variance decomposition, we have R(k, ˆk) = Ex,x Varw(ˆk(x x , w)) R(k, k) = (1 α)2Ex,x Varw(ˆk(x x , w)) +α2Ex,x (µ k(x x ))2 . By elementary calculation, we have R(k, k) < R(k, ˆk) when 0 α 2Ex,x Varw(ˆk(x x , w)) Ex,x Varw(ˆk(x x , w)) + Ex,x (µ k(x x ))2 and the optimal value happened at α = Ex,x Varw(ˆk(x x , w)) Ex,x Varw(ˆk(x x , w)) + Ex,x (µ k(x x ))2 < 1 Corollary 2.1. (Shrinkage estimator) If k(x x ) > 0, there is a shrinkage estimator k( ) = (1 α )ˆk( ) that has lower risk than the empirical mean estimator ˆk( ) by setting µ = 0. The standard shrinkage estimator derived from Stein effect uses the equal weights 1 α M to minimize the variance; however, a non-uniform weights usually results in better performance in practice [Husz ar and Duvenaud, 2012]. Therefore, 2The original Stein effect has a Gaussian distribution assumption. Proceedings of the Twenty-Sixth International Joint Conference on Artificial Intelligence (IJCAI-17) based on the analysis of Theorem 2, we proposed the estimator k(x x , w) = PM m=1 βmh(wm) obtained by minimizing the empirical risk with the constraint PM m=1 β2 m c to shrink the weights, where c is a small constant. With Lagrangian multiplier, we end up the following risk minimization formulation which is data-driven by directly taking k(x, x ) into account, h k(x, x ) [z(x) z(x )]T β i2 + λβ β 2, (6) where denotes the Hadamard product, and λ is the regularization coefficient to shrink the weights. The objective function (6) can be further simplified as a least squares regression (LSR) formulation, min β RM t Zβ 2 + λββT β, (7) where t Rn2, Z Rn2 M, such that for all i, j P = {1, . . . , n}, there exist a corresponding pair mapping function ξ : ξP (i, j) k satisfying tk := φ(xi)T φ(xj) and Z(k, : ) := z(xi) z(xj). We approximately solve optimization problem (7) via the matrix sketching techniques [Avron et al., 2013; Avron et al., 2016a; Woodruff and others, 2014]: min β RM St SZβ 2 + λββT β, (8) where S Rr n2 is the randomized sketching matrix. Solving the sketched LSR problem (8) now costs only O(rd2 +Ts) where Ts is the time cost of sketching. From the optimization perspective, suppose β be the optimal solution of LSR problem (7), and β be the solution of sketched LSR problem (8). [Wang et al., 2017] prove that if r = O(M/ϵ + poly(M)), the the objective function value of (8) at β is at most ϵ worse than the value of (7) at β . In practice, we consider an efficient Ts = O(r) sampling-based sketching matrix where S is a diagonal matrix with Si,i = 1/pi if row i was include in the sample, and Si,i = 0 otherwise. That being said, we form a sub-matrix of Z by including each row of Z in the sub-matrix independently with probability pi. [Woodruff and others, 2014] suggests to set pi proportional to Zi , the norm of ith row of Z, to have a meaningful error bound. Algorithm 1 Data-driven random feature (SES) 1: procedure SES 2: Compute the Fourier transform p of the kernel 3: Generate a sequence w1, . . . , w M by QMC or MC 4: Compute un-weighted random feature z(x) by (4) 5: Calculate shrinkage weight by (8) 6: Compute data-driven random features by z (x) = diag( p 5 Experiments In this section, we empirically demonstrate the benefits of our proposed method on six data sets listed in Table 1. The Dataset Task n d CPU regression 6554 21 Census regression 18,186 119 Years regression 463,715 90 Adult classification 32,561 123 MNIST classification 60,000 778 Covtype classification 581,012 54 Table 1: Data Statistics. n is number of instances and d is dimension of instances. features in all the six data sets are scaled to zero mean and unit variance in a preprocessing step. We consider two different tasks: i) the quality of kernel approximation and ii) the applicability in supervised learning. Both the relative kernel approximation errors and the regression/classification errors are reported on the test sets. 5.1 Methods for Comparison MC: Monte Carlo approximation with uniform weight. QMC: Quasi-Monte Carlo approximation with uniform weight. We adapt the implementation of Yang et al. [2014] on Github. 3 BQ: QMC sequence with Bayesian Quadrature weights. We modify the source code from Husz ar and Duvenaud [2012]. 4 Gaussian RBF kernel is used through all the experiments where the kernel bandwidth σ is tuned on the set {2 10, 2 8, . . . , 28, 210} that is in favor of Monte Carlo methods. For QMC method, we use scrambling and shifting techniques recommended in the QMC literature (See Dick et al. [2013] for more details). In BQ framework, we consider the GP prior with the covariance matrix k GP (w, w ) = exp( w w 2 2 2σ2 GP ) where σGP is tuned on the set {2 8, 2 6, . . . , 26, 28} over a subset of sampled pair data (x, x ). Likewise, regularization coefficient λβ of in the proposed objective function (6) is tuned on the set {2 8, 2 6, . . . , 26, 28} over the same subset of sampled pair data. For regression task, we solve ridge regression problems where the regularization coefficient λ is tuned by five folds cross-validation. For classification task, we solve L2-loss SVM where the regularization coefficient C is tuned by five folds cross-validation. All experiments code are written in MATLAB and run on a Intel(R) Xeon(R) CPU 2.40GHz Linux server with 32 cores and 190GB memory. 5.2 Quality of Kernel Approximation In our setting, the most natural metric for comparison is the quality of approximation of the kernel matrix. We use the relative kernel approximation error K K F / K F to measure the quality, as shown in Figure 1. SES outperforms the other methods on cpu, census, adult, mnist, covtype, and is competitive with the best baseline methods over the years data sets. Notice that SES is particularly 3https://github.com/chocjy/QMC-features 4http://www.cs.toronto.edu/ duvenaud/ Proceedings of the Twenty-Sixth International Joint Conference on Artificial Intelligence (IJCAI-17) (f) covtype Figure 1: Kernel approximation results. x-axis is number of samples used to approximate the integral and y-axis shows the relative kernel approximation error. strong when the number of random features is relatively small. This phenomenon confirms our theoretical analysis, that is, the smaller the random features are, the larger the variance would be in general. Our shrinkage estimator with Stein effect enjoys better performance than the empirical mean estimators (MC and QMC) using uniformly weighted features without shrinkage. Although BQ does use weighted random features, its performance is still not as good as SES because tuning the covariance matrix in the Gaussian Process is rather difficult. Moreover, it is even harder to tune a global bandwidth for all integrand functions that approximate the kernel function. 5.3 Supervised Learning For regression task, we report the relative regression error y y 2/ y 2. For classification task, we report the classification error. We investigate if better kernel approximations reflect better predictions in the supervised learning tasks. The results are shown in Table 2. We can see that among the same data Dataset M MC QMC BQ SES cpu 512 3.35% 3.29% 3.29% 3.27% census 256 8.46% 6.60% 6.44% 6.49% years 128 0.474% 0.473% 0.473% 0.473% adult 64 16.21% 15.57% 15.87% 15.45% mnist 256 7.33% 7.72% 7.96% 7.71% covtype 128 23.36% 22.88% 23.13% 22.67% Table 2: Supervised learning errors. M is number of random features. Error rate is reported, the lower the better. set (cpu, census, adult, and covtype) our method performs very well in terms of kernel approximation, and it also yields lower errors in most of the cases of the supervised learning tasks. Note that we only present partial experiment results on some selected number of random features. However, we also observe that the generalization error of our proposed method may be higher than that of other state-of-the-art methods at some other number of random features that we did not present here. This situation is also observed in other works [Avron et al., 2016b]: better kernel approximation does not always result in better supervised learning performance. Under certain conditions, less number of features (worse kernel approximation) can play a similar role of regularization to avoid overfitting [Rudi et al., 2016]. Rudi et al. [2016] suggest to solve this issue by tuning number of random features. 5.4 Distributions of Feature Weights We are interested in the distributions of random feature weights derived from BQ and SES. For different numbers of random features (M = 32, 128, 512), we plotted the corresponding histograms and compared them with the equal weight of 1/M used in MC and QMC, as shown in Figure 2. When the number of random features is small, the difference between BQ and SES is big. As the number of random features increases, the weights of each method gradually converge to a stable distribution, and get increasingly closer to the uniform weight spike. This empirical observation agrees with the theoretical analysis we found, namely with the role of the weight estimator in the bias-variance trade-off. 5.5 Behavior of Randomized Optimization Solver In theory, our randomized solver only needs to sample O(M) data in our proposed objective function (8) to have a suffi- Proceedings of the Twenty-Sixth International Joint Conference on Artificial Intelligence (IJCAI-17) (b) M = 128 (c) M = 512 (d) sum of weights versus M Figure 2: A comparison of distribution of weights on adult data set. M is the number of random features. ciently accurate solution. We verify this assertion by varying the sampled data size r in sketching matrix S and examine it affects the relative kernel approximation errors. The results are shown in Figure 3. For census data set, we observe that as the number of sampled data r exceeds the number of random features M, the kernel approximation of our proposed method outperforms the state-of-the-art, QMC. Furthermore, the number of sampled data required by our proposed randomized solver is still the same order of M even for a much larger data set covtype, with 0.5 millions data points. On the other hand, BQ does not have consistent behavior regarding the sampled data size, which is used for tuning the hyperparmaeter σGP . This again illustrates that BQ is highly sensitive to the covariance matrix and is difficult to tune σGP on only a small subset of sampled data. In practice, we found that we usually only need to sample up to 2 4 times the number of random features M to get a relatively stable and good minimizer β . This observation coincides with the complexity analysis of our proposed randomized solver in Section 4. 5.6 Uniform v.s. Non-uniform Shrinkage Weight We conducted experiments with SES using a uniform shrinkage weight versus using non-uniform shrinkage weights on the cpu and census data sets. The results in Figures 4 confirm our conjecture that non-uniform shrinkage weights are more beneficial, although both enjoy the lower empirical risk from the Stein effect. We also observed similar behavior on other data sets, and we omit those results. 6 Discussion and Conclusion Nystr om methods [Williams and Seeger, 2001; Drineas and Mahoney, 2005] is yet another line of research for kernel approximation. Specifically, eigendecomposition of the kernel matrix is achieved via low rank factorization with various sampling techniques [Wang and Zhang, 2013]. We refer interested readers to the comprehensive study [Yang et al., 2012] for more detailed comparisons. Due to the page limits, the (a) M = 64, census data set (b) M = 256, census data set (c) M = 64, covtype data set (d) M = 256, covtype data set Figure 3: A comparison of different sampled size training data in solving objective function (8). (a) cpu data set (b) census data set Figure 4: Comparison of uniform and non-uniform shrinkage weight. scope of this paper only focus on the study of random Fourier features. Finally, we notice [Sinha and Duchi, 2016] proposes a weighting scheme for random features which matches the label correlation matrix with the goal to optimize the supervise objective (e.g. classification errors). In contrast, our work aims to optimize the kernel approximation, which does not require the label information to solve weights and can be applied to different tasks after solving the weights once. To conclude, we propose a novel shrinkage estimator, which enjoys the benefits of Stein effect w.r.t. lowering the empirical risk compared to empirical mean estimators. Our estimator induces non-uniform weights on the random features with desirable data-driven properties. We further provide an efficient randomized solver to optimize this estimation problem for real-world large-scale applications. The extensive experimental results demonstrate the effectiveness of our proposed method. Acknowledgments We thank the reviewers for their helpful comments. This work is supported in part by the National Science Foundation (NSF) under grants IIS-1546329 and IIS-1563887. Proceedings of the Twenty-Sixth International Joint Conference on Artificial Intelligence (IJCAI-17) References [Avron et al., 2013] Haim Avron, Vikas Sindhwani, and David Woodruff. Sketching structured matrices for faster nonlinear regression. In NIPS, 2013. [Avron et al., 2016a] Haim Avron, Kenneth L Clarkson, and David P Woodruff. Faster kernel ridge regression using sketching and preconditioning. ar Xiv, 2016. [Avron et al., 2016b] Haim Avron, Vikas Sindhwani, Jiyan Yang, and Michael W Mahoney. Quasi-monte carlo feature maps for shift-invariant kernels. JMLR, 2016. [Bach, 2015] Francis Bach. On the equivalence between quadrature rules and random features. ar Xiv, 2015. [Caflisch, 1998] Russel E Caflisch. Monte carlo and quasimonte carlo methods. Acta numerica, 1998. [Chang and Lin, 2011] Chih-Chung Chang and Chih-Jen Lin. Libsvm: a library for support vector machines. ACM Transactions on Intelligent Systems and Technology (TIST), 2011. [Chitta et al., 2012] Radha Chitta, Rong Jin, and Anil K Jain. Efficient kernel clustering using random fourier features. In ICDM, 2012. [Dai et al., 2014] Bo Dai, Bo Xie, Niao He, Yingyu Liang, Anant Raj, Maria-Florina F Balcan, and Le Song. Scalable kernel methods via doubly stochastic gradients. In NIPS, 2014. [Dick et al., 2013] Josef Dick, Frances Y Kuo, and Ian H Sloan. High-dimensional integration: the quasi-monte carlo way. Acta Numerica, 2013. [Drineas and Mahoney, 2005] Petros Drineas and Michael W Mahoney. On the nystr om method for approximating a gram matrix for improved kernel-based learning. JMLR, 2005. [Ghahramani and Rasmussen, 2002] Zoubin Ghahramani and Carl E Rasmussen. Bayesian monte carlo. In NIPS, 2002. [Huang et al., 2014] Po-Sen Huang, Haim Avron, Tara N Sainath, Vikas Sindhwani, and Bhuvana Ramabhadran. Kernel methods match deep neural networks on timit. In ICASSP, 2014. [Husz ar and Duvenaud, 2012] Ferenc Husz ar and David Duvenaud. Optimally-weighted herding is bayesian quadrature. In UAI, 2012. [Kivinen et al., 2004] Jyrki Kivinen, Alexander J Smola, and Robert C Williamson. Online learning with kernels. IEEE transactions on signal processing, 52(8), 2004. [Le et al., 2013] Quoc Le, Tam as Sarl os, and Alexander Smola. Fastfood-computing hilbert space expansions in loglinear time. In ICML, 2013. [Li and P oczos, 2016] Chun-Liang Li and Barnab as P oczos. Utilize old coordinates: Faster doubly stochastic gradients for kernel methods. In UAI, 2016. [Muandet et al., 2014] K. Muandet, K. Fukumizu, B. Sriperumbudur, A. Gretton, and B. Sch olkopf. Kernel mean estimation and stein effect. In ICML, 2014. [Rahimi and Recht, 2007] Ali Rahimi and Benjamin Recht. Random features for large-scale kernel machines. In NIPS, 2007. [Rahimi and Recht, 2009] Ali Rahimi and Benjamin Recht. Weighted sums of random kitchen sinks: Replacing minimization with randomization in learning. In NIPS, 2009. [Rasmussen, 2006] Carl Edward Rasmussen. Gaussian processes for machine learning. 2006. [Rudi et al., 2016] Alessandro Rudi, Raffaello Camoriano, and Lorenzo Rosasco. Generalization properties of learning with random features. ar Xiv, 2016. [Rudin, 2011] Walter Rudin. Fourier analysis on groups. John Wiley & Sons, 2011. [Sinha and Duchi, 2016] Aman Sinha and John C Duchi. Learning kernels with random features. In NIPS, 2016. [Steinwart and Christmann, 2008] Ingo Steinwart and Andreas Christmann. Support vector machines. Springer Science & Business Media, 2008. [Wang and Zhang, 2013] Shusen Wang and Zhihua Zhang. Improving cur matrix decomposition and the nystr om approximation via adaptive sampling. JMLR, 2013. [Wang et al., 2017] Shusen Wang, Alex Gittens, and Michael W Mahoney. Sketched ridge regression: Optimization perspective, statistical perspective, and model averaging. ar Xiv, 2017. [Williams and Seeger, 2001] Christopher Williams and Matthias Seeger. Using the nystr om method to speed up kernel machines. In NIPS, 2001. [Woodruff and others, 2014] David P Woodruff et al. Sketching as a tool for numerical linear algebra. Foundations and Trends R in Theoretical Computer Science, 2014. [Yang et al., 2012] Tianbao Yang, Yu-Feng Li, Mehrdad Mahdavi, Rong Jin, and Zhi-Hua Zhou. Nystr om method vs random fourier features: A theoretical and empirical comparison. In NIPS, 2012. [Yang et al., 2014] Jiyan Yang, Vikas Sindhwani, Haim Avron, and Michael W Mahoney. Quasi-monte carlo feature maps for shift-invariant kernels. In ICML, 2014. Proceedings of the Twenty-Sixth International Joint Conference on Artificial Intelligence (IJCAI-17)