# learning_adaptive_random_features__07c2f0ce.pdf

The Thirty-Third AAAI Conference on Artiﬁcial Intelligence (AAAI-19)

Learning Adaptive Random Features

Yanjun Li,1 Kai Zhang, 2 Jun Wang,3 Sanjiv Kumar4

1University of Illinois at Urbana-Champaign, 2Temple University, 3East China Normal University, 4Google Research 1yli145@illinois.edu, 2zhang.kai@temple.edu, 3jwang@sei.ecnu.edu.cn, 4sanjivk@google.com

Random Fourier features are a powerful framework to approximate shift invariant kernels with Monte Carlo integration, which has drawn considerable interest in scaling up kernel-based learning, dimensionality reduction, and information retrieval. In the literature, many sampling schemes have been proposed to improve the approximation performance. However, an interesting theoretic and algorithmic challenge still remains, i.e., how to optimize the design of random Fourier features to achieve good kernel approximation on any input data using a low spectral sampling rate? In this paper, we propose to compute more adaptive random Fourier features with optimized spectral samples (wj s) and feature weights (pj s). The learning scheme not only signiﬁcantly reduces the spectral sampling rate needed for accurate kernel approximation, but also allows joint optimization with any supervised learning framework. We establish generalization bounds using Rademacher complexity, and demonstrate advantages over previous methods. Moreover, our experiments show that the empirical kernel approximation provides effective regularization for supervised learning.

Introduction Despite the immense popularity of kernel-based learning algorithms (Sch olkopf and Smola 2002), the expensive evaluation of nonlinear kernels has prohibited their application to large datasets. Low-rank kernel approximation is a powerful tool in alleviating the memory and computational cost of large kernel machines (Williams and Seeger 2001; Drineas and Mahoney 2005; Fowlkes et al. 2004; Halko, Martinsson, and Tropp 2011; Mahoney 2011; Kumar, Mohri, and Talwalkar 2012; Si, Hsieh, and Dhillon 2014). These methods adopt various sampling schemes on the rows/columns of the kernel matrix to obtain an efﬁcient, low-rank decomposition, which in turn serves as a highly compact empirical kernel map that can reduce the cost of kernel machines from cubic to linear scale. Random Fourier features, a highly innovative feature map pioneered by Rahimi and Recht (2008), has attracted signiﬁcant interest, which will also be the focus of our work. Rather than decomposing the kernel matrix directly, the

Contact author. Copyright c 2019, Association for the Advancement of Artiﬁcial Intelligence (www.aaai.org). All rights reserved.

method resorts to the Fourier transform of positive semideﬁnite and shift-invariant kernels and obtains explicit feature maps using Monte Carlo approximation of the Fourier representation. The features can then be written as the cosine (or sine) of the inner product between the input data and random spectral samples drawn from a density speciﬁed by the characteristic function. For example, the characteristic function for Gaussian kernel is still Gaussian, hence the spectral samples should follow a Gaussian distribution. In recent years, there has been continuing effort in designing optimal sampling schemes in computing Fourier features for accurate approximation. Le et al. (2013) proposed Fastfood, a feature map that is faster to compute thanks to a combination of diagonal Gaussian matrices and the Hadamard transform. Yang et al. (2014) showed that integral approximation using low-discrepancy quasi-Monte Carlo (QMC) sequence has a faster convergence than random Monte Carlo samples, and achieves lower error especially for high-dimensional data. Shen et al. (2017) proposed to apply moment matching on the spectral samples so that their empirical distribution is closer to the intended Gaussian. Besides shift-invariant kernels, approximate feature maps have also been considered for other nonlinear kernels, such as additive kernels (Vedaldi and Zisserman 2012), and polynomial kernels using spherical sampling (Pennington, Felix, and Kumar 2015). Despite these recent advances, open challenges still exist. First, most of the existing works ignore the impact of the input data distribution on the design of the feature map. Instead, the main dependency considered is the relation between the kernel and its characteristic function. Second, better kernel approximation with random Fourier features may not always translate to better generalization performance. Despite reported improvements in regression/classiﬁcation, it has been found that such improvements do not correlate well with the improvements in the quality of kernel features (Avron et al. 2016; Chang et al. 2017). In this paper, we argue that a good Fourier feature map should adapt to input data distribution. For example, the optimal sampling scheme in the spectral domain will ideally be different when approximating the kernel on data from different distributions, in order to achieve a desired accuracy with low sampling rate. To illustrate this, we use importance sampling (Barber 2012) as a motivating example in approximat-

ing the kernel Fourier transform, and demonstrate that the optimal proposal density should be adaptive to the input distribution. Based on these observations, we propose a novel method to learn adaptive Fourier features, in which the spectral samples and their corresponding feature weights are optimized jointly through minimization of the empirical kernel approximation loss (EKAL). We also propose two approximators of the EKAL, so that an efﬁcient iterative algorithm can be designed, reducing the overall time/space complexity to scale linearly with input sample size and dimension. We want to emphasize some fundamental differences between our method and existing randomized algorithms. In those methods, a sampling probability is computed to select useful matrix columns often involving costly operations in the input space (e.g. SVD for leverage scores (Mahoney 2011) or computing norms of all matrix columns (Kumar, Mohri, and Talwalkar 2012). In comparison, we do not resort to any sampling strategy but instead explicitly optimize the spectral basis and weights in the Fourier space; besides, our approach is signiﬁcantly cheaper and the cost of obtaining the basis can be as low as sub-linear. Recently (Bach 2017) proposed to select spectral samples by computing a discrete probability distribution over a large number of reference points. Note that their goal is functional approximation in the RKHS, while ours is kernel matrix approximation; besides, sampling in a high-dimensional space can be quite challenging, while optimizing the spectral basis based on explicit objective function can be computationally more tractable and more convenient. We establish rigorous generalization bounds tailored to the minimizers of the EKAL approximators, using Rademacher average and Mc Diarmid s inequality. Unlike the loss function in a typical statistical learning problem, EKAL contains a small set of pairwise kernel evaluations that are not all independent. We overcome this challenge by creating independent games in statistically dependent rounds based on round-robin tournament scheduling. Our bounds can be easily translated to guarantees for supervised learning, using results that connect low-rank kernel approximation and learning accuracy (Cortes, Mohri, and Talwalkar 2010). These theoretical ﬁndings are complemented by numerical experiments, which show the clear advantage of our learned Fourier features over previous ones in kernel approximation. Indeed, our method is the ﬁrst to fully exploit the input data to signiﬁcantly improve the quality of Fourier features, bridging the gap between data-driven methods and ﬁxed-basis methods. Besides effectively reducing the dimension of the kernel map (i.e., spectral sampling rate) to achieve desired accuracy, our method can also be incorporated in any supervised learning framework. In particular, we employ EKAL minimization as a regularization to building linear prediction models with learned Fourier features, leading to improved generalization performance. Our main contributions include: (1) Theoretical justiﬁcation of learning data-driven Fourier features using importance sampling as an example. (2) Joint optimization of spectral samples and feature weights in minimizing two types of EKAL approximators; generalization error bound for both. (3) Hybrid loss with both unsupervised kernel regu-

larization and supervised prediction loss to improve the performance in classiﬁcation/regression. (4) Extensive experimental results both in kernel approximation and in supervised learning tasks.

Random Fourier Features Monte Carlo Method with Uniform Weights

Given a shift-invariant kernel function k, we wish to construct a feature map Z(x) whose pairwise inner-product approximates the kernel by k(x1 x2) Z(x1), Z(x2) . By doing this, the kernel matrix K deﬁned on {xℓ}N ℓ=1 can then be approximated by a low-rank decomposition ZZ , where Kst = k(xs xt) and the s-th row of Z is Z(xs). Rahimi and Recht (2008) pioneered the use of Fourier transform in solving this problem, by noting that any PSD shift invariant kernel k can be reconstructed using the Fourier basis sampled under the probability density deﬁned by the characteristic function of k,

k(x1 x2) = Z

Rd eiw (x1 x2)p(w)dw. (1)

The density p(w) is the Gaussian PDF N(0, σ 2I) for Gaussian kernel k(x1 x2) = exp( x1 x2 2/(2σ2)). Suppose the feature map is deﬁned as

Z(x) := 1 r

h eiw 1 x, eiw 2 x, ..., eiw r xi , (2)

where {wj}r j=1 are d-dimensional spectral samples drawn independently from p(w). It can then be observed that Z(x1), Z(x2) = 1

r Pr j=1 eiw j (x1 x2), which is an unbiased estimator of the Fourier representation of the kernel1, and such a Monte Carlo approximation will asymptotically converge to the true integral (1) (Rahimi and Recht 2008).

Importance Sampling with Non-Uniform Weights

The sampling probability p(w) in Monte Carlo method is the characteristic function of the kernel. For a given kernel, p(w) is ﬁxed regardless of the distribution of the input samples. However, given the input samples xi s, we believe that its distribution P(x) should also have an impact on the sampling probability p(w). In particular, the optimal p(w) should be adaptive to P(x) in terms of accurately approximating kernel matrix deﬁned on P(x), using as few spectral samples (wj s) as possible. In order to see this, we consider the use of importance sampling, which is widely used in Bayesian inference to reduce the variance of approximations. Since k is real, one can rewrite the real part of Fourier representation as

k(x1 x2) = Z

Rd cos(w (x1 x2))p(w)

q(w)q(w)dw.

1Since the kernel is real, one can remove the imaginary parts of (1) and (2). The real Fourier representation and features are k(x1 x2) = R

Rd cos w (x1 x2) p(w)dw and Z(x) := 1 r cos(w 1 x), . . . , cos(w r x), sin(w 1 x), . . . , sin(w r x) .

Here, q(w) is an importance-weighted proposal density. If wj s are drawn from the distribution q(w), then we can approximate the above integral with Pr j=1 pj cos(w j (x1 x2)), where pj = p(wj)/(r q(wj)) satisﬁes E[pj] = 1/r. The above ﬁnite sample estimate is the inner product e Z(x1), e Z(x2) between weighted Fourier features:2

e Z(x) := p1 cos(w 1 x), ..., pr cos(w r x), p1 sin(w 1 x), ..., pr sin(w r x) . (3)

The optimal importance distribution minimizes the following expected error, which depends on input distribution:

min q(w) Ex1,x2Ewj e Z(x1), e Z(x2) k(x1 x2) 2. (4)

For example, in the case x1 x2 is drawn from a Gaussian distribution, we can prove the following. Claim 1. If k(x1 x2) = exp( x1 x2 2/(2σ2)), and x1 x2 follows a Gaussian distribution N(0, σ2 0I), then the optimal q(w) satisﬁes

q(w) e σ2 w 2 + e (σ2+2σ2 0) w 2 1/2 .

Claim 1 shows that the optimal q(w) depends on the input distribution P(x) (particularly on σ0). It follows that, in the ﬁnite sample estimate, the choice of samples wj s should also be data-dependent. Meanwhile, the corresponding feature weighting pj = p(wj)/(r q(wj)) appears non-uniform and should depend on the input data distribution as well. In practice, the underlying input distribution P(x) is unknown, hence solving (4) for the optimal importance distribution q(w) is intractable. Therefore, instead of designing the optimal q(w), we propose to learn wj s (spectral samples) and pj s (corresponding feature weights) directly by minimizing the ﬁnite-sample kernel approximation error.

Adaptive Fourier Features Empirical Kernel Approximation Loss In this section, we explain how wj s and pj s in weighted Fourier features (3) can be learned efﬁciently from data. Suppose the input data xℓbelong to a compact set X. To simplify the notation, we deﬁne function f : X X R, parametrized by {wj, pj}r j=1:

f(x1 x2) :=

j=1 pj cos w j (x1 x2) .

The goal for kernel approximation is to learn the optimal wj W (1 j r) and p r 1 that minimizes the mean squared error Ex1,x2[(k(x1 x2) f(x1 x2))2]. If we restrict our attention to approximating the kernel matrix on a dataset {xℓ}N ℓ=1, the loss function is the mean squared error over the empirical distribution over the dataset:

L(f) := 1 N 2

t=1 (f(xs xt) k(xs xt))2 .

2Uniformly weighted Fourier features Z(x) is a special case of e Z(x): q(w) = p(w) and hence pj = 1/r.

In this objective, we intend to optimize both wj s and their weights pj s. Such an optimization will adapt the resulting Fourier features to the input data, hence making them more likely to obtain desired approximation accuracy with minimal sampling rate in the spectral domain. Evaluating k(xs xt) over the whole sample is expensive and unnecessary. An important contribution of our work is to approximately evaluate the loss function L(f) using a small number n of landmark data points (n N), similar to the idea of sketching for regression (Avron, Sindhwani, and Woodruff 2013; Woodruff and others 2014). We use the following two strategies to choose landmark points and compute empirical kernel approximation loss (EKAL): Random sampling: We randomly sample n points from the dataset with equal probability without replacement, i.e., choose {xℓ1, . . . , xℓn} {x1, . . . , x N}, and compute

f(xℓs xℓt) k(xℓs xℓt) 2.

K-means clustering: We choose the landmark points as the cluster centers of the k-means algorithm. Suppose the sth cluster has Ns members, whose centroid is xc s (1 s n, Pn s=1 Ns = N). Thus EKAL with clustering is:

N 2 f(xc s xc t) k(xc s xc t) 2.

Iterative Algorithm

We write a uniﬁed objective function subsuming both the sampling-based loss function Ls(f) and the clusteringbased loss function Lc(f), as follows

min wj, p 0

t=1 q2 sq2 t X

j pj cos w j (xs xt)

k(xs xt) 2 + λ p 2,

where the squared landmark weights are q2 s = 1/n for Ls(f), and q2 s = Ns/N for Lc(f). Our goal is to optimize wj s and pj s by minimizing the EKAL with weight decay on p = [p1, p2, . . . , pr] 0. Let X = [x1, x2, . . . , xn] Rn d denote the landmark points, W = [w1w2, ..., wr] Rd r the spectral samples, K Rn n the kernel matrix on the landmark points, and q = [q1, q2, . . . , qn] Rn the landmark weights. We use A B and A 2 to denote the entrywise product and the entrywise square, respectively. Then the above objective function can be equivalently written as follows:

cos(XW) diag(p) cos(XW)

+ sin(XW) diag(p) sin(XW)

The iterative algorithm is summarized in Algorithm 1.

Solving for feature weights p: Deﬁne C := cos(XW) and S := sin(XW). We solve a quadratic programming (QP): minp 0 p Ap 2b p, where

A = C diag2(q) C 2 + C diag2(q) S 2

+ S diag2(q) C 2 + S diag2(q) S 2 + λI,

b = diag C diag2(q) K diag2(q) C

+ diag S diag2(q) K diag2(q) S .

This is a simple non-negative QP with an 2r 2r Hessian. Solving for spectral samples W: When p is ﬁxed, one can solve for W via gradient descent. Deﬁne

R := C diag(p)C + S diag(p)S K qq .

Then the gradient of loss L with respect to W can be computed as follows:

CL = 2(R qq )C diag(p),

SL = 2(R qq )S diag(p),

WL = X ( S CL + C SL) .

Algorithm 1 Learning Fourier Features

Input: X Rn d Output: W(T ) Rd r, p(T ) Rr Parameters: learning rate µ, number of iterations T, S Initialize W(0) and p(0) using random Fourier features for t = 1, 2, . . . , T do

p(t) arg minp 0 L(W(t 1), p) + λ p 2

// Solve a QP for p W(t) W(t 1) for s = 1, 2, . . . , S do

W(t) W(t) µ WL(W(t), p(t)) end for

// Update W via gradient descent end for

The time complexity for the QP is O(nr2+n2r+r3), and that for computing the gradient W is O(n2r+dnr), which reduce to O(r3) and O(r3 + dr2) if n = O(r). Overall, the time complexity is O(TS(r3 + dr3)), with T, S being the number of iterations, and r, n N. Very recently, Chang et al. (2017) also proposed a datadriven approach to learn weights for random features. Their motivation is to sacriﬁce the unbiasedness of the estimator with the hope to lower the variance. In comparison, we build both theoretic and algorithmic linkage between Fourier features and data distribution, and our framework allows joint optimization of samples and weights.

Generalization Error Analysis In this section, we establish generalization error bounds for minimizing two types of EKAL approximators, i.e., the sampling-based Ls(f) and the clustering-based Lc(f).

These results guarantee that learned Fourier features can approximate the kernel not only over selected landmark points but also the entire data. Suppose spectral samples wj belong to a compact sets W. In addition, p = [p1, p2, . . . , pr] resides in the standard simplex r 1 = {p : pj 0, Pr j=1 pj = 1} (since E[Pr j=1 pj] = Ew q[p(w)/q(w)] = 1). Note that such simplex constraint can be easily relaxed to any compact constraint set. Deﬁne the set of all functions F = {f : wj W, p r 1}. Our theoretical analysis of EKAL with sampling departs slightly from the last section. To create independence between landmark samples that simpliﬁes our statistical argument, we sample n landmark points {xℓs}n s=1 from {xℓ}N ℓ=1 with replacement, and minimize the empirical loss

Ln(f) := 2 n(n 1)

f(xℓs xℓt) k(xℓs xℓt) 2.

The additional loss incurred by learning on a small set of sampled landmarks is bounded. From Proposition 1, it appears that n = O(dr) landmark points are required to achieve small generalization error, while empirical experiments show that n = O(r) landmarks sufﬁce (Figure 2).

Proposition 1. With probability at least 1 δ,

sup f F |Ln(f) L(f)| 4

(dr + r 1)(log n + 2 log(3r Wd X + 9))

where r W = supw W w is the radius of W, and d X = supx1,x2 X x1 x2 is the diameter of X.

Proof Sketch. To be more legible, the double subscripts in {xℓs}n s=1 are dropped in favor of {xℓ}n ℓ=1. We deﬁne the collection of sampled landmarks Xn := {xℓ}n ℓ=1, and Ω(Xn) := supf F |Ln(f) L(f)|. We ﬁrst bound the expectation of Ω(Xn) using a variation of the symmetrization trick (Gine and Zinn 1984). Unlike the empirical loss of a classic learning problem, the terms in Ln(f) are not all independent. We think of the n(n 1)/2 terms as the games in a round-robin tournament (Lucas 1883). Without loss of generality, we assume that n is even. We break the right-hand side into n 1 (not independent but identically distributed) rounds with n/2 independent games in each round. Then by symmetry, E Ω(Xn) 4 n

E supf F Pn/2 j=1 zj f(x2j 1 x2j) k(x2j 1 x2j) 2 ,

where {zj}n/2 j=1 are i.i.d. Rademacher random variables. Next, we bound the Rademacher average by constructing an ϵ-net of F, such that for every f F there exists f in the net that satisﬁes supx1,x2 X |f(x1 x2) f (x1 x2)| 2 p

1/n. By Massart s Lemma (Massart 2000),

2 log(3r Wd X n)dr(9 n)r 1

By Mc Diarmid s inequality (Mc Diarmid 1989),

Pr [Ω(Xn) E Ω(Xn) + t] exp nt2

Proposition 1 follows from

sup f F |Ln(f) L(f)| E Ω(Xn) + Ω(Xn) E Ω(Xn) ,

where EΩ(Xn) and Ω(Xn) EΩ(Xn) are bounded in (7) and (8), respectively.

Next, we show that the additional loss incurred by the features learned from the cluster centers approaches zero when the quantization error of k-means diminishes. The kernels we consider are Lipschitz continuous. For example, the Lipschitz constant for k(x) = e x 2/(2σ2) is Lk = e 1/2/σ.

Proposition 2. supf F |Lc(f) L(f)| 8ρ (r W + Lk), where ρ = maxℓ xℓ xc j(ℓ) is the quantization error of k-means, r W = supw W w is the radius of W, and Lk = supx1,x2 X X |k(x1) k(x2)|/ x1 x2 is the Lipschitz constant of the kernel k.

Proof Sketch. Since |k( )| 1 and |f( )| 1, we have

sup f F |Lc(f) L(f)|

4 max s,t sup f F

f(xs xt) f(xc j(s) xc j(t))

+ 4 max s,t

k(xs xt) k(xc j(s) xc j(t)) ,

where xc j(s) is the centroid of the cluster containing xs. We then bound the two terms using the Lipschitz continuity of the feature map and the kernel, respectively.

Using the relation between low-rank kernel approximation and learning accuracy (Cortes, Mohri, and Talwalkar 2010), one can easily translate the above results to stability bounds of supervised learning algorithms, in terms of Ln(f) (or Lc(f)) and the choice of landmarks.

Target-Aware Fourier Features In the literature, the Fourier features are mainly designed for numerically approximating the kernel matrix. However better kernel approximation may not always lead to better generalization (Avron et al. 2016). Therefore, it s desirable to improve the Fourier features with supervised information. To achieve this, we propose a hybrid loss, which is the combination of unsupervised loss (kernel approximation) and supervised loss (classiﬁcation or regression error), as

min W,p 0 α,β

j=1 c g(xj; W, α, β), yj + γ α 2

+η L(W, p) + λ p 2 .

Here, the ﬁrst line is the prediction error on labeled samples {xj, yj}N j=1, with weight decay on α R2r. The predictor

g(xj; W, α, β) is chosen as a linear function over learned Fourier features (and thus corresponds to a nonlinear function in the input space):

g(xj; W, α, β) = α [cos(W xj); sin(W xj)] + β, (10)

c(g(xi), yi) can be (g(xi) yi)2 for regression, or hinge loss max(0, 1 yi g(xi)) for classiﬁcation. The second line L(W, p) + λ p 2 is the unsupervised regularization term, which is the ﬁnite-sample kernel approximation error as deﬁned in (5). Here, the spectral samples W not only appear in the predictor g in (10), but also faithfully reconstruct the kernel matrix as speciﬁed in (5). Therefore, the unsupervised kernel approximation loss imposes highly informative regularization on computing the model g, which can notably improve the generalization performance as we shall discuss in more detail in the experiments section.

Algorithm 2 Learning Fourier Features with Supervision

Input: X Rn d Output: W(T ) Rd r, p(T ) Rr, α(T ), β(T ) Parameters: learning rate µ, number of iterations T, S Initialize W(0) and p(0) using random Fourier features for t = 1, 2, . . . , T do

α(t), β(t) training linear machine

// Update α, β via gradient descent p(t) arg minp 0 L(W(t 1), p) + λ p 2

// Solve a QP for p W(t) W(t 1) for s = 1, 2, . . . , S do

W(t) W(t) µ 1

N PN j=1 Wcj + η WL

// Update W via gradient descent end for

Table 1: Benchmark datasets

Dataset Task Input Dimension Sample Size Wine Regression 11 4898 Parkinson Regression 16 5875 CPU Regression 21 8192 Adult Classiﬁcation 123 48842 Covtype Classiﬁcation 54 58101 MNIST Classiﬁcation 784 14780

Optimization procedures are in Algorithm 2. Each iteration of our optimization procedure has three steps. First, given W, linear model g can be obtained using any off-theshelf linear machines. Its parameters (weights α, bias β) do not need to be fully optimized in each iteration; empirically, a few steps of gradient descent sufﬁce. Second, feature weights p are optimized by QP in (5). Third, spectral samples W are updated via gradient descent. The gradient of EKAL WL is derived in (6); and that for the loss

Input data xi s Weights pj s Input data xi s Weights pj s

Figure 1: Some toy 2d samples xi s (input space) and corresponding color-coded weights pj s (spectral domain). The spectral samples wj s are chosen as a grid for better visualization. Our approach generates data-dependent weighting.

r/25 r/5 r 5r 0

0.3 r = 50 r = 100 r = 200

(a) SAMPLE, Wine

r/25 r/5 r 5r 0

r = 50 r = 100 r = 200

(b) SAMPLE, Parkinson

r/25 r/5 r 5r 0

0.3 r = 50 r = 100 r = 200

(c) CLUSTER, Wine

r/25 r/5 r 5r 0

0.15 r = 50 r = 100 r = 200

(d) CLUSTER, Parkinson

Figure 2: Relative errors (y-axis) v.s. #landmarks n (x-axis) for varying dimension r; n expressed as multiples of r.

cj := c(g(xj), yj) is

Wcj = g(xj)c(g(xj), yj) xj

sin(W xj) α1 + cos(W xj) α2 ,

where α1, α2 Rr denote the ﬁrst r entries (cosine part) and the second r entries (sine part) of α, respectively. Hence the gradient descent update for W is W W µ 1

N PN j=1 Wcj +η WL . Overall, the time complexity of each gradient descent step for α, β, and W is O(Ndr), which is linear in the sample size. Recently, Sinha and Duchi (2016) proposed to learn feature weights pj s by kernel alignment; Yu et al. (2015) considered optimizing spectral samples wj s using classiﬁcation loss. There are two key differences between our work and theirs. First, previous works learn either pj s or wj s,

while we adjust both. Second, we employ a hybrid loss incorporating EKAL minimization as a novel regularization to the learned model, leading to improved generalization.

Experiments This section reports empirical evaluations in numerical kernel matrix approximation and supervised learning tasks. We denote our Fourier features learned by minimizing EKAL (Algorithm 1) with sampling and clustering as SAMPLE and CLUSTER, respectively. In supervised learning tasks, we minimize hybrid loss with sampling-based EKAL and name it SUPERVISE. We have compared with the following methods for constructing Fourier features: MC: Standard Monte Carlo sampling for random Fourier features (Rahimi and Recht 2008) FASTFOOD: Fast sampling using Hadamard matrices (Le, Sarl os, and Smola 2013). HALTON, SOBOL, LATTICE, DIGIT: 4 low-discrepancy sequences used in QMC sampling (Yang et al. 2014). MM: The moment matching approach (Shen, Yang, and Wang 2017). WEIGHT: Learning feature weights for kernel approximation via linear ridge regression (Chang et al. 2017). ALIGN: Kernel learning via alignment maximization (Sinha and Duchi 2016). The benchmark datasets used are listed in Table 1. For simplicity, we convert Covtype and MNIST to binary classiﬁcation (type 1 vs. not type 1, and digit 0 vs. digit 1). All data samples are split into training/test sets (2 : 1), unless provided in the original data. We tune the parameters via cross validation on training set. Input data is normalized to have zero mean and unit variance in each dimension, and the Gaussian kernel width 2σ2 is chosen as the dimension d of the input data, equal to E[ x1 x2 2/2] after normalization.

Two-Dimensional Toy Examples In Figure 1, we plot some toy examples to demonstrate the intrinsic connection between Fourier features (in particular

Table 2: Relative kernel approximation errors. The errors of LATTICE for MNIST are missing because the code provided by (Yang et al. 2014) cannot generate a lattice sequence of dimension larger than 250.

Dataset r MC FASTFOOD HALTON SOBOL LATTICE DIGIT MM WEIGHT SAMPLE CLUSTER

50 0.31 0.34 0.24 0.27 0.28 0.23 0.25 0.24 0.14 0.13 100 0.19 0.25 0.18 0.20 0.19 0.15 0.17 0.15 0.08 0.08 200 0.13 0.16 0.11 0.14 0.13 0.11 0.13 0.10 0.05 0.05

50 0.16 0.13 0.10 0.18 0.12 0.08 0.14 0.10 0.05 0.04 100 0.08 0.14 0.09 0.12 0.07 0.05 0.09 0.05 0.03 0.02 200 0.06 0.08 0.05 0.09 0.06 0.04 0.05 0.03 0.02 0.01

50 0.17 0.18 0.14 0.18 0.14 0.13 0.13 0.14 0.11 0.09 100 0.13 0.13 0.11 0.12 0.11 0.09 0.09 0.10 0.07 0.05 200 0.08 0.10 0.07 0.09 0.07 0.06 0.06 0.06 0.04 0.03

50 0.27 0.29 0.27 0.28 0.28 0.26 0.89 0.26 0.23 0.25 100 0.19 0.22 0.18 0.20 0.20 0.18 0.24 0.19 0.14 0.16 200 0.14 0.16 0.13 0.14 0.14 0.11 0.10 0.13 0.10 0.11

50 0.23 0.25 0.22 0.22 0.23 0.21 0.19 0.22 0.16 0.14 100 0.16 0.18 0.15 0.16 0.16 0.13 0.12 0.14 0.09 0.07 200 0.11 0.12 0.10 0.12 0.11 0.09 0.09 0.09 0.05 0.04

50 0.17 0.21 0.17 0.18 0.21 1.25 0.17 0.14 0.26 100 0.13 0.15 0.13 0.13 0.13 1.10 0.12 0.09 0.24 200 0.09 0.10 0.09 0.08 0.09 0.84 0.08 0.06 0.10

Table 3: Generalization error for regression (RMSE) and classiﬁcation (%). Shaded methods use labels in computing features.

Dataset MC FASTFOOD HALTON SOBOL LATTICE DIGIT MM WEIGHT ALIGN SAMPLE CLUSTER SUPERVISE Wine 0.712 0.697 0.697 0.707 0.709 0.703 0.716 0.710 0.703 0.706 0.703 0.697 Parkinson 6.931 6.929 6.963 6.898 6.959 6.997 6.946 6.950 6.895 6.957 6.931 6.432 CPU 7.238 6.632 7.196 7.471 6.455 7.415 7.664 7.533 6.889 7.495 7.060 3.687 Adult 16.02 15.91 16.01 15.90 15.94 15.87 15.75 16.10 15.94 15.98 15.96 15.64 Covtype 19.37 19.94 19.39 19.63 19.39 19.95 19.78 19.67 19.46 19.52 19.32 19.78 MNIST 0.757 0.757 0.567 0.615 0.709 0.378 0.851 0.662 0.757 0.735 0.378

their weights pj s) and input data distribution P(x). For visualization purpose, wj s are chosen from a uniform grid and only pj s are optimized (in real-data experiments, wj s and pj s are optimized together). Note that the color-coded weight map can be deemed intuitively as an distribution sensitive importance distribution in the spectral domain.

Kernel Approximation Experiments

We use the relative error e K K F/ K F to quantify the performance of kernel approximation, where K is the exact kernel, and K the approximate one by Algorithm 1. Number of Landmarks in EKAL-Approximators. We ﬁrst explore the number of landmarks in the EKALapproximators that guarantees accurate result on the entire data. For number of features r = 50, 100, and 200, we run SAMPLE and CLUSTERp with number of landmarks n = r/25, r/5, r, and 5r. Errors are shown in Figure 2. Clearly, one achieves relatively small generalization error for n as small as r. So we ﬁx n = r for the rest experiments. Relative Approximation Errors. We reportp relative kernel approximation errors of all competing algorithms on six benchmark datasets for r = 50, 100, 200 (Table 2). In each dataset, at least one of our two methods, SAMPLE and CLUSTER, achieves the lowest error. This demonstrates the

advantage learned Fourier features in kernel approximation. Usually, CLUSTER performs better than SAMPLE, meaning that landmarks chosen as cluster centers yield better kernel approximation. However, when the number of landmarks n is lower than the input dimension (n 200 < 784 = d for MNIST), CLUSTER becomes worse than SAMPLE. This is because when the number of clusters is too small, the cluster centers can be undesirably far from the actual input samples. In such a case, replacing the cluster centers with their closest samples proves an effective cure. Due to the space limit, we defer the empirical study of the performance of the modiﬁed EKAL approximator (that uses the closest samples to the cluster centers as landmarks) to an extended version of this paper. Indeed, such modiﬁcation improves the performance of CLUSTER in high-dimensional cases, and achieves competitive performance against both CLUSTER and SAMPLE for all datasets.

Supervised Learning Experiments We perform supervised learning tasks of regression and classiﬁcation; in both cases, a linear predictor is applied directly on learned Fourier features, leading to a nonlinear counterpart in the input space. For regression, we use ridge regression and report root mean square error (RMSE); for classiﬁcation, we use ℓ2-regularized SVM and report classiﬁcation

error. All the experiments use n = r = 200. The regression or classiﬁcation results on six benchmark datasets are reported in Table 3. When comparing methods that do not use label information in constructing the Fourier features (methods not shaded in Table 3), the results are inconclusive, i.e., any method performs better on certain tasks and worse on others. Overall, the unsupervised features learned by our methods (SAMPLE and CLUSTER) are quite comparable to others. When labels are used in constructing the Fourier features, our method (SUPERVISE) attains lowest generalization error on most datasets, demonstrating its superiority over both unsupervised feature construction methods and recently proposed, supervised method (ALIGN).

Conclusion In this paper, we propose a novel framework for learning Fourier features that adapt to input data. Both spectral samples and weights are optimized jointly, which can be further engaged in any supervised learning framework. Extensive theoretical/empirical results demonstrate advantages of our method. In the future, we will study theoretic connections between the proposed Fourier features and explicit kernel low-rank decomposition.

Acknowledgement Jun Wang is supported in part by the National Science Foundation of China NO.61672236.

References Avron, H.; Sindhwani, V.; Yang, J.; and Mahoney, M. W. 2016. Quasi-monte carlo feature maps for shift-invariant kernels. Journal of Machine Learning Research 17(120):1 38. Avron, H.; Sindhwani, V.; and Woodruff, D. 2013. Sketching structured matrices for faster nonlinear regression. In Advances in Neural Information Processing Systems, 2994 3002. Bach, F. 2017. On the equivalence between kernel quadrature rules and random feature expansions. Journal of Machine Learning Research 18(21):1 38. Barber, D. 2012. Bayesian reasoning and machine learning. Cambridge University Press. Chang, W.-C.; Li, C.-L.; Yang, Y.; and P oczos, B. 2017. Datadriven random fourier features using stein effect. In Proceedings of the Twenty-Sixth International Joint Conference on Artiﬁcial Intelligence, 1497 1503. Cortes, C.; Mohri, M.; and Talwalkar, A. 2010. On the impact of kernel approximation on learning accuracy. In Proceedings of the Thirteenth International Conference on Artiﬁcial Intelligence and Statistics, 113 120. Drineas, P., and Mahoney, M. W. 2005. On the nystr om method for approximating a gram matrix for improved kernel-based learning. Journal of Machine Learning Research 6(Dec):2153 2175. Fowlkes, C.; Belongie, S.; Chung, F.; and Malik, J. 2004. Spectral grouping using the nystrom method. IEEE Transactions on Pattern Analysis and Machine Intelligence 26(2):214 225. Gine, E., and Zinn, J. 1984. Some limit theorems for empirical processes. The Annals of Probability 12(4):929 989. Halko, N.; Martinsson, P. G.; and Tropp, J. A. 2011. Finding structure with randomness: Probabilistic algorithms for constructing approximate matrix decompositions. SIAM Review 53(2):217 288.

Kumar, S.; Mohri, M.; and Talwalkar, A. 2012. Sampling methods for the nystr om method. Journal of Machine Learning Research 13(Apr):981 1006. Le, Q.; Sarl os, T.; and Smola, A. 2013. Fastfood-approximating kernel expansions in loglinear time. In Proceedings of the International Conference on Machine Learning, volume 85. Lucas, E. 1883. Les jeux de demoiselles. In R ecr eations Math ematiques. Gauthier-Villars (Paris). 161 197. Mahoney, M. W. 2011. Randomized algorithms for matrices and data. Foundations and Trends R in Machine Learning 3(2):123 224. Massart, P. 2000. Some applications of concentration inequalities to statistics. Annales de la Facult e des sciences de Toulouse : Math ematiques 9(2):245 303. Mc Diarmid, C. 1989. On the method of bounded differences. Surveys in combinatorics 141(1):148 188. Pennington, J.; Felix, X. Y.; and Kumar, S. 2015. Spherical random features for polynomial kernels. In Advances in Neural Information Processing Systems, 1846 1854. Rahimi, A., and Recht, B. 2008. Random features for large-scale kernel machines. In Advances in Neural Information Processing Systems, 1177 1184. Sch olkopf, B., and Smola, A. J. 2002. Learning with kernels: support vector machines, regularization, optimization, and beyond. MIT press. Shen, W.; Yang, Z.; and Wang, J. 2017. Random features for shiftinvariant kernels with moment matching. In Proceedings of the Thirty-First AAAI Conference on Artiﬁcial Intelligence. Si, S.; Hsieh, C.-J.; and Dhillon, I. 2014. Memory efﬁcient kernel approximation. In Proceedings of the 31st International Conference on Machine Learning, volume 32 of Proceedings of Machine Learning Research, 701 709. PMLR. Sinha, A., and Duchi, J. C. 2016. Learning kernels with random features. In Advances in Neural Information Processing Systems, 1298 1306. Vedaldi, A., and Zisserman, A. 2012. Efﬁcient additive kernels via explicit feature maps. IEEE Transactions on Pattern Analysis and Machine Intelligence 34(3):480 492. Williams, C. K., and Seeger, M. 2001. Using the nystr om method to speed up kernel machines. In Advances in Neural Information Processing Systems, 682 688. Woodruff, D. P., et al. 2014. Sketching as a tool for numerical linear algebra. Foundations and Trends R in Theoretical Computer Science 10(1 2):1 157. Yang, J.; Sindhwani, V.; Avron, H.; and Mahoney, M. 2014. Quasimonte carlo feature maps for shift-invariant kernels. In Proceedings of the 31st International Conference on Machine Learning, 485 493. Yu, F. X.; Kumar, S.; Rowley, H.; and Chang, S.-F. 2015. Compact nonlinear maps and circulant extensions. ar Xiv preprint ar Xiv:1503.03893.