# trainingefficient_feature_map_for_shiftinvariant_kernels__a8682ad1.pdf

Training-Efﬁcient Feature Map for Shift-Invariant Kernels

Xixian Chen1,2, Haiqin Yang1,2, Irwin King1,2, Michael R. Lyu1,2

1Shenzhen Key Laboratory of Rich Media Big Data Analytics and Applications, Shenzhen Research Institute, The Chinese University of Hong Kong, Shenzhen, China 2Department of Computer Science and Engineering, The Chinese University of Hong Kong, Shatin, N.T., Hong Kong {xxchen, hqyang, king, lyu}@cse.cuhk.edu.hk

Random feature map is popularly used to scale up kernel methods. However, employing a large number of mapped features to ensure an accurate approximation will still make the training time consuming. In this paper, we aim to improve the training efﬁciency of shift-invariant kernels by using fewer informative features without sacriﬁcing precision. We propose a novel feature map method by extending Random Kitchen Sinks through fast datadependent subspace embedding to generate the desired features. More speciﬁcally, we describe two algorithms with different tradeoffs on the running speed and accuracy, and prove that O(l) features induced by them are able to perform as accurately as O(l2) features by other feature map methods. In addition, several experiments are conducted on the real-world datasets demonstrating the superiority of our proposed algorithms.

1 Introduction

Kernel methods are powerful since they can achieve excellent performance when learning the non-linear relationship embedded in the training data. Usually, the non-linear relationship is encoded by a kernel function, k : k(x, y) = Φ(x), Φ(y) , where Φ( ) : Rm H maps the training data in the original m-dimensional feature space to a highdimensional or even inﬁnite-dimensional feature space, H. Kernel methods can then train on the kernel matrix that represents the similarity of pairs of training data points without explicitly deﬁning the mapping function Φ( ). This obviously can overcome the curse of dimensionality. Although kernel methods have attracted extensive investigation in the past two decades due to their signiﬁcant performance advantage, yet one main issue is that they scale poorly with the size of the training dataset. Taking the kernel ridge regression (KRR) [Saunders et al., 1998] as an example, given a data matrix X Rn m consisting of n data points in m dimension, KRR incurs O(n3 + n2m) time for training, much more than O(nm2) in the linear counterpart when n m. Therefore, the huge time complexity makes kernel methods impractical to use in large datasets.

To resolve the scalability issue, random feature map is proposed to map the data explicitly to a low-dimensional Euclidean inner product space using the Fourier transform. Through the feature map Z : Rm Rl, one can obtain k(x, y) = Φ(x), Φ(y) Z(x), Z(y) . Then instead of KRR, we can directly employ linear ridge regression with the mapped features, taking O(nl2 +nml) time that depends linearly on n. The number of mapped features l, however, exerts great inﬂuence on the effectiveness and efﬁciency of random feature map. Only using a small l can speed up the training, while this loses the accuracy in the kernel matrix approximation and the learning. In contrast, a large l can ensure the precision, but it takes more training time. Unfortunately, demonstrated in many literatures [Rahimi and Recht, 2009; Dai et al., 2014], l should be chosen in the same order of n. In this regard, even for the linear ridge regression, it will take O(n3 + n2m) training time, leading to a computation complexity almost as the same as that in the original kernel methods. In this paper, we consider accelerating the training using an important kernel functions which are shift-invariant, namely k(x, y) = k(x y). This kind of kernels (e.g., Gaussian, Laplacian and Cauchy) is popularly and widely used. Our contributions in this work are summarized as follows: First, we propose a training-efﬁcient feature map (TEFM) method by extending the Random Kitchen Sinks (RKS) [Rahimi and Recht, 2007; 2009] via fast data-dependent subspace embedding (FDSE). It can enjoy both the advantages of the RKS and FDSE, where RKS is a much more representative and efﬁcient feature map for shift-invariant kernels. On the other hand, FDSE stands for a powerful dimensionality reduction tool on both the accuracy and speed. Second, we specify the proposed method TEFM by two algorithms: TEFM-G and TEFM-S. The former may be a bit more accurate, while the latter runs much faster. Third, we give provable results indicating that O(l) features from TEFM-G or TEFM-S are able to perform almost as accurately as O(l2) features from other methods, which is also conﬁrmed by extensive experiments. The rest of paper is organized as follows. In Section 2, we summarize some related work. In Section 3, we present

Proceedings of the Twenty-Fourth International Joint Conference on Artificial Intelligence (IJCAI 2015)

our method with the theoretical analysis. In Section 4, we provide empirical results. Finally, in Section 5, we conclude the whole work.

Notation. We use bold letters to denote either a vector or a matrix. The transpose of X is denoted by XT . X denotes its Moore-Penrose inverse, Tr(X) its trace and span(X) its column spaces. X and X F represent the spectral norm and Frobenius norm of X, respectively. Finally, we use X Y to represent that Y X is positive semideﬁnite.

2 Related Work

In this section, we review and discuss several existing feature map methods for shift-invariant kernels. Then, we outline the subspace embedding and analyze its properties.

2.1 Random Feature Maps for Shift-Invariant Kernels Existing random feature maps for shift-invariant kernels include the RKS and its variants. We ﬁrst spend some efforts on the RKS, a basic and effective method. The RKS is due to Bochner s theorem [Rudin, 1990], by which each entry of kernel matrix K Rn n can be approximated as:

Kij = k(xi xj) = Z p(z)eiz T (xi xj)dz (1)

s=1 eiz T s xi, eiz T s xj

l/2 cos(z T s xi), 1 p

l/2 cos(z T s xj) . . .

. . . + 1 p

l/2 sin(z T s xi), 1 p

l/2 sin(z T s xj)

= Z(xi) Rl, Z(xj) Rl , (2)

s=1 eiz T s xi, eiz T s xj

l cos(z T s xi + bs),

l cos(z T s xj + bs)

= Z(xi) Rl, Z(xj) Rl , (3)

where zs is sampled based on a proper probability density function p(z) set to be the inverse Fourier transform of shiftinvariant kernel function k( ) and bs is uniformly sampled over [0, 2π] [Rahimi and Recht, 2007]. Training kernel methods can then gain acceleration by using linear algorithms incorporated with {Z(xi)}n i=1 and achieve higher accuracy than the linear algorithms directly operated on {xi}n i=1, simultaneously. Compared with kernel methods, linear algorithms using {xi}n i=1 run much more quickly especially in the low-dimensional data space, but gain less accuracy. However, this disadvantage can be alleviated by applying {Z(xi)}n i=1 which contains the information of kernels.

The linear algorithms may be still impractical, since a large d has to be chosen to ensure a comparable accuracy as the kernel methods. The performance of kernel approximation using RKS is given by the proved assertion: |Kij Z(xi), Z(xj) | O(1/

l) holds with a constant probability i, j [n] if drawing d features [Rahimi and Recht, 2007]. Therefore, RKS converges at the rate of O(1/

l) in the kernel approximation error. Accordingly, this kernel approximation error degenerates the learning accuracy by increasing the generation error from O(1/ n) in the kernel methods to O(1/ n + 1/

l) in the linear algorithms trained by RKS features. This suggests that l should be set on the order of n to guarantee the generalization performance [Rahimi and Recht, 2009; Dai et al., 2014]. There are some follow-up works to improve RKS, but they still suffer from certain training inefﬁciency problems. [Le et al., 2013] ideally reduces the feature generation time of RKS from O(nml) to O(nl log m) by using fast Hadamard matrix, while the kernel approximation error is larger than RKS with the same number of features. Hence, to guarantee the same accuracy, more features have to be employed requiring more training time. In [Yang et al., 2014], an advanced sampling technique Quasi-Monte Carlo is introduced, but it improves the convergence rate of kernel approximation error only when the feature number d is exponential in the data dimension m. This is not well satisﬁed in many real-world datasets and thus more features are needed in the training. In a conclusion, RKS is still a better choice to reduce training time. Next, we review subspace embedding to be applied in our proposed method.

2.2 Subspace Embedding

Subspace embedding, informally, is able to reduce dimensionality by performing a map f : Rm Rl on the dataset X = {xi Rm}n i=1, where l < m. Brieﬂy, it can be cast into two categories: data-independent or data-dependent one. In the data-independent subspace embedding such as random projection [Bingham and Mannila, 2001], f is predeﬁned without knowing how the dataset is distributed in the original spaces, and it retains the data information (e.g., xi , X , X F , etc.) by ensuring an approximate isometry over the embedded subspaces with high probability. These whole embedding procedures can be quickly completed, but the dimensionality of the embedded subspaces cannot be sufﬁciently low, otherwise it will be impossible to approximate an isometry accurately [Nelson and Nguyn, 2014]. In contrast, the data-dependent subspace embedding maps the dataset into some important spaces determined by the data distribution. The insight is that, typically the dataset is distributed differently in each space, and most of the distribution lies in only a few spaces also regarded as important. Therefore, dimensionality can be substantially reduced without sacriﬁcing much information. The issue is that getting the important spaces usually takes much time. To effectively and efﬁciently reduce the dimensionality, [Halko et al., 2011] proposes a fast and accurate method to ﬁnd certain important space on the dataset, which is referred to as FDSE in our paper. This essentially belongs to the data-

dependent category. The difference lies in that the important spaces are approximated by performing the data-independent subspace embedding on the data, by which the whole computation time can be reduced.

3 Training-Efﬁcient Feature Map In this section, ﬁrst we propose our method together with two algorithms. Later, we provide error analysis of the proposed algorithms in the kernel approximation and related learning tasks. Finally, we compare and discuss the time complexity of each method on the regression task.

3.1 Methods and Algorithms Here, we are at the stage to specify our method TEFM. [Yang et al., 2012] has pointed out that features in RKS are randomly sampled from a distribution independent of the training data, leading to a data-independent vector representation whose most information will be restricted within some speciﬁc spaces. Thus, we propose to improve the RKS further via FDSE, i.e., the FDSE maps the RKS feature matrix F Rn d into its l-dimensional important orthogonal row space fast obtained by combining random projection (e.g., Gaussian projection matrix) with QR decomposition, and thus yields the desired feature matrix G Rn l. This is expected to run faster than the standard QR decomposition with the time complexity reduced from O(nd2) to O(ndl + dl2), where we assume n > d > l.

Algorithm 1: TEFM-G Input: data XT = {xi Rm}n i=1, shift-invariant kernel function k( ), scalars [q, d, l]. Output: G Rn l. 1: Form F Rn d with each row vector constructed by method as shown in Eq. (2) or Eq. (3). 2: Draw Θ Rn l from N(0, 1). 3: Construct Y = (FT F)q FT Θ. 4: Run QR decomposition on Y such that QR = Y where Q Rd l is column-orthogonal. 5: G = FQ.

We summarize this method in Algorithm 1. As can be seen, there are two parts in this algorithm. Step 1 represents a feature matrix F, where the inverse Fourier transform p(z) of different k( ) can be found in [Rahimi and Recht, 2007]. Steps 2 to 5 show the FDSE. q is typically set by 0, 1 or 2, which is used to speed up the decay of the singular value of F and thus improves the accuracy of the whole algorithm [Halko et al., 2011]. Moreover, step 4 in Algorithm 1 can still be accelerated by replacing Θ with SRHT (Subsampled Randomized Hadamard Transform), then running QR decomposition only requires O(nd log l + dl2) time. For the details of SRHT, please refer to [Tropp, 2011]. We summarize this method in Algorithm 2 that runs faster than Algorithm 1.

3.2 Kernel Approximation Analysis In this part, we provide theoretical analysis to show that O(l) features generated by our algorithms can approximate kernel

Algorithm 2: TEFM-S Input: data XT = {xi Rm}n i=1, shift-invariant kernel function k( ), scalars [d, l]. Output: G Rn l. 1: Run step 1 of Algorithm 1. 2: Draw a SRHT matrix Θ Rn l. 3: Construct Y = FT Θ. 4: Run steps 4, 5 of Algorithm 1.

matrix as accurately as O(l2) features induced by other existing methods like RKS and its extensions. Before proceeding, we ﬁrst state two extracted and reformulated propositions. Proposition 1. [Tropp, 2015] Suppose {Si}n i=1 are random square matrices with ESi = 0 and Si L. Deﬁne Z = Pn i=1 Si, Ξ1 and Ξ2 satisfying E(ZZT ) Ξ1 and E(ZT Z) Ξ2. Let ξ = max { Ξ1 , Ξ2 } and r = Tr(Ξ1 + Ξ2)/ξ. Then, P { Z t} 4r exp( t2/2

ξ+Lt/3) for t ξ + L/3. Proposition 2. [Halko et al., 2011] Denote the SVD of A Rm n by A = Udiag(Σ1, Σ2)(V1, V2)T where Σ1 Rk k, Σ2 R(n k) (n k), V1 Rn k, V2 Rn (n k) and diag(Σ1, Σ2) Rn n with Σ1 and Σ2 on its main diagonal. Given Ω Rn l and deﬁne Y = (AAT )q AΩ. If VT 1 Ωhas full row rank, then (I YY )A 2 (I YY )(AT A)q A 2/(2q+1) ( Σ2 4q+2 + Σ4q+2 2 (VT 2 Ω)(VT 1 Ω) 2)1/(2q+1). They are our main tools to derive the theoretical analysis. Then, we present our main results. Theorem 1. Suppose we have a kernel matrix K Rn n constructed from shift-invariant functions and run Algorithm 1 to get features G Rn l. Then, with limited failure probability,

K GGT O(n/l). (4)

Proof. Without loss of generality, we analyze the case in detail where F in step 1 corresponds to Eq. (2) and Gaussian kernel function k(xi, xj) = exp( xi xj 2/2σ2). The theorem and analysis can be applied in other cases. Revisit the variables and parameters in Algorithm 1, we have

K GGT = K FQQT FT

K FFT + FFT FQQT FT

= K FFT + F(I QQT )2FT (5)

= K FFT + (I QQT )FT 2 (6)

d) + O(n/l), (7)

where Eq. (5) follows from the fact (I QQT ) is an orthogonal projection matrix and Eq. (7) will be proved as follows. The ﬁrst term in Eq. (7) is proved by Proposition 1. Deﬁne a = d/2 and Hi Rn 2 formed by columns 2i 1 and 2i of F, i.e., {cos(z T i xj)/ a, sin(z T i xj)/ a}n j=1 Rn 2, i [a]. Let Si = (K a Hi Hi T )/a. Then, K FFT = Pa i=1 Si and ESi = 0. Ei =j(ST i Sj) = 0 also holds by that {Hi}a i=1 are independent. Moreover, K n due to Kjj 1, and

Hi 2 = Hi HT i ( p

1/a)2n = n/a. Thus, Si 2n/a = L. We accordingly deﬁne Ξ1 = Ξ2 = E(ZT Z) = E(Pa i=1 Si)T (Pa i=1 Si) = a E(ST i Si) since Z = Pa i=1 Si is symmetric and Ei =j(ST i Sj) = 0. Then ξ = Ξ1 = a E(K a Hi HT i )2/a2 = {a2E(Hi HT i )2 K2}/a = {a2E(Hi HT i Hi HT i ) K2}/a {a2E( Hi 2Hi HT i ) K2}/a (n K K2)/a = {U(nΣ Σ2)UT }/a = (nΣ Σ2)/a n2/(4a), where UΣUT is the SVD of K with the eigenvalues Σii = σi listed in the descending order. In the above inequalities, the ﬁrst one holds by 0 Ξ1 = {a2E(Hi HT i Hi HT i ) K2}/a {a2E( Hi 2Hi HT i ) K2}/a due to that x T Hi HT i Hi HT i x = x T Hi HT i 2 x T Hi 2 HT i 2 = x T Hi HT i x HT i 2 for any x Rn, and the last one follows from that max0 σi n |(nσi σ2 i )/a| = n2/(4a). r = Tr(Ξ1 + Ξ2)/ξ usually can satisfy that 1 r O(n). Now we can get the concentration bound that P( K FFT t) 4r exp{ t2/2 n2/(4a)+2nt/(3a)}.

Let δ = 4r exp{ t2/2 n2/(4a)+2nt/(3a)} and θ = 4r/δ. Then,

K FFT ln θ{2n/3a+ p

4n2/9a2 + n2/(2a ln θ)} = O(n/ a) = O(n/

d) holds with failure probability at most δ, where we can regard ln θ as a small value and also easily check that t ξ + L/3 holds. To prove the second term in Eq. (7), we use Proposition 2 and get

(I QQT )FT 2 (I YY )FT 2 (8)

( Σ2 4q+2 + Σ4q+2 2 (VT 2 Θ)(VT 1 Θ) 2)1/(2q+1)

(1 + (VT 2 Θ)(VT 1 Θ) 2)1/(2q+1) Σ2 2, (9)

where F = UΣVT = Udiag(Σ1, Σ2)(V1, V2)T with Σ1 Rk k and Σ2 R(d k) (d k), and Eq. (8) holds because span(Y) span(Q). Notice that Θ is a centered Gaussian matrix, thus VT 1 Θ has full row rank with probability one and Proposition 2 can be applied. The derivation for the upper bound of (V2Θ)(VT 1 Θ) follows from [Halko et al., 2011]. Then, combining and reformulating them, we get that (I QQT )FT 2 [k, q, d, p]1/(2q+1)n/(k + 1) = O(n/l) holds with failure probability at most 6 exp( p), where l = k + p with p typically set to be a small constant (e.g.,5), [k, q, d, p] means an expression on variables k, q, d, p, which turns smaller as p increases and is brieﬂy represented here due to limited space. [k, q, d, p]1/(2q+1) can be considered as a small value if with a slightly big q. By union bound and setting d = Ω(l2), we prove Eq. (4) holds with failure probability at most δ + 6 exp( p).

The bound in Eq. (7) actually is incurred by RKS and FDSE. The ﬁrst term of Eq. (7) denotes a spectral norm bound for RKS by Eq. (2) or Eq. (3), which is slightly tighter and more general than the expectation bound in [Lopez-paz et al., 2014], and reﬂects the same convergence rate on RKS feature number as the element-wise bound [Rahimi and Recht, 2007]. If d = Ω(l2), then training on O(l) feature generated by our two algorithms will be much faster than O(l2) features by RKS with the approximation error almost on the

same scale. In the experiments, we will see the accuracy versus the computation time for the training and feature map. Remark. [Hamid et al., 2014] tries to approximate the polynomial kernel more concisely and accurately, by which our approach is motivated similarly but has important differences. First, we use FDSE instead of random projection to approximate more accurately while maintaining the computation efﬁciency. Second, the theoretical result therein shows that the extent of approximation improvement relies on the degree of the polynomial kernel, while ours depends on the number of ﬁnal generated features, which directly demonstrates that we can use much fewer informative features to train. Third, the experiment shows that our method can achieve improvement in the shift-invariant kernels, while [Hamid et al., 2014] even degenerates the performance, which we conjecture is due to the different sampling procedures for the feature maps. Theorem 2. Suppose we have a kernel matrix K Rn n constructed from shift-invariant functions and run Algorithm 2 to get features G Rn l. Then, with limited failure probability,

K GGT O(n/l). (10)

Proof. The proof is similar to that in Theorem 1 by setting q = 0 in Eq. (9). Then, with failure probability at most c/k, (V2Θ)(VT 1 Θ) has a constant upper bound if l = Ck log k, where c and C are some constants [Tropp, 2011]. Treating log k as a small value and using Σ2 2 n/(k +1) can complete the proof.

Here, q has to be zero and c/k can be larger than 6 exp( p) in Theorem 1. This implies Algorithm 1 may perform more accurately than Algorithm 2 although it runs slower.

3.3 Impact on Learning Tasks In this part, we show how our features impact the learning accuracy on regression and classiﬁcation tasks. The related algorithms including KRR and SVM can be uniﬁed into the following optimization problem [Yang et al., 2011a; 2011b],

min w λ 2 w 2 + 1

i=1 ℓ{ℏ(w T Z(xi), yi)}, (11)

where ℓ(t) means convex loss functions such as ℓ(t) = t2/2, l(t) = log{1 + exp( t)} and l(t) = max(0, 1 t). t = ℏ(w T Z(xi), yi) denotes w T Z(xi) yi or w T Z(xi)yi. We also assume the magnitude of gradient ℓ (t) can be upper bounded by C (nondifferentiable loss function like l(t) = max(0, 1 t) in SVM can be made differentiable by smoothing techniques [Nesterov, 2005; Zhang et al., 2013]). Theorem 3. Suppose we get kernel matrix K Rn n by operating shift-invariant functions on data matrix XT = {xi Rm}n i=1, and feature matrix GT = {Gi Rl}n i=1 by Algorithm 1 or 2. Denote by L(w ) the optimal value of Eq. (11). Then, let L(w G) be the obtained optimal value by training on Z(xi) = Gi, and L(w K) the optimal value by by training on K (i.e.,Z(xi) = Φ(xi)), where w G Rl and w K may be inﬁnite-dimensional. Then, with limited failure probability,

L(w G) L(w K) + O(1/l). (12)

Proof. Instead of directly applying Representer Theorem [Sch olkopf et al., 2001] on Eq. (11), a desired tight result can be obtained using its Lagrangian duality. Rewrite Eq. (11) into a constrained convex optimization problem by introducing new variables g T = {gi = ℏ(w T Z(xi), yi)}n i=1 R1 n, and combine the Lagrangian with KKT conditions, then we get an equivalent dual problem without duality gap, i.e., L(w ) = maxα Pn i=1 ℓ (αi) 1

2λαT Z(X)Z(X)T α αT y, where αT = {αi}n i=1 R1 n, y T = {yi}n i=1 R1 n, each row of Z(X) is formed by Z(xi), ℏ(w T Z(xi), yi) = w T Z(xi) yi without loss of generality, and ℓ (αi) = inf{ αT g + Pn i=1 ℓ(gi)/n} is supposed to be ﬁnite with αi = ℓ (gi)/n, i [n]. Then we have,

L(w G) = max α

i=1 ℓ (αi) 1

2λαT GGT α αT y

L(w K) + max α 1 2λαT (K GGT )α

2λ α K GGT α

L(w K) + n(C/n)2

L(w K) + C2

l ) = L(w K) + O(1

where λ is assumed to be a certain ﬁxed value and the last inequality holds with limited failure probability by using Theorem 1 or 2.

We can extend the above result to other features with different kernel approximation performances, showing that a faster convergence of the kernel approximation error can lead to faster convergence in the minimized regularized training error. This can be used as a simpliﬁed measurement to quantify the impact of kernel approximation on the learning accuracy [Cortes et al., 2010; Cesa-Bianchi et al., 2014]. Therefore, applying the features that ensure a good kernel approximation favors an accurate learning on the learning tasks.

3.4 Computation Analysis

In this part, we compare the computation performance on ridge regression task written in the form of Eq. (11) using the square loss function. We use ridge regression due to that it can give a closed-form solutions for the mapped features and standard kernel, then the computation time does not depend on some speciﬁc optimization techniques. We summarize the result in Table 1, where nnz( ) means the number of non-zero value and t the number of test points. As we can see, with d = O(l2) set, our two algorithms take much less time, and also signiﬁcantly reduce the the overall time with the dominant training time decreased. For RKS, only a seed is needed for random number generator to reproduce random sequence for the testing [Dai et al., 2014]. After getting w by Eq. (11), we practically multiply it with Q of Algorithm 1 or 2 for the testing. Finally, our algorithms only stores a seed and a vector Qw T for the testing, taking comparable storage with other methods.

Mapping Training Prediction M1 O(nnz(X)n) O(n3) O(t(m)n) M2 O(nnz(X)d) O(nd2) O(t(m + 1)d)

M3 O(nnz(X)d O(nl2) O(t(m + 1)d +l2d + nld) +dl)

M4 O(nnz(X)d O(nl2) O(t(m + 1)d +l2d + nd log l) +dl)

Table 1: Time complexity of M1 to M4 representing 4 methods respectively, i.e., Kernel method, RKS, TEFM-G and TEFM-S.

4 Empirical Evaluation In this section, we empirically demonstrate the superiority of the proposed methods on the Gaussian kernel matrix approximation and related regression task. We compare the following random feature map methods: 1. Random Kitchen Sinks (denoted by RKS). 2. Quasi-Monte Carlo method [Yang et al., 2014] (denoted by Quasi). We use Digital Net to generate QMC sequence, as it yields the lowest approximation error and supports the high dimensional data [Yang et al., 2014]. 3. Compact feature maps [Hamid et al., 2014] (denoted by Comp). The proposed method should not be restricted to the polynomial kernel, we thus apply it to Gaussian kernel in our experiments. 4. Fastfood method [Le et al., 2013] (denoted by Ffood). We use Hadamard features for its better performance. 5. Our proposed algorithms TEFM-G and TEFM-S.

We use six publicly available real-world datasets listed in Table 2, and they can be downloaded from LIBSVM website [Chang and Lin, 2011] or UCI machine learning repository. All our experiments are run on Matlab with single thread mode in order to fairly compare the running time.

Dataset # instances # features Cpu 6554 21 A9a 48842 123 Blog Feedback 60021 280 Slice Localization 53500 384 UJIIndoor Loc 21048 520 Mnist 70000 784

Table 2: Details of datasets in our experiments.

4.1 Kernel Approximation Quality We compare the features generated by above methods on the kernel approximation. We report the approximation accuracy of each method measured by K Kapp / K , where Kapp is the approximation matrix reconstructed by the random features. To facilitate the computation of full kernel matrix K for the comparison purposes, we randomly sample these datasets so that only 6554, 8141, 8733, 8917, 6646 and 6006 instances are used, respectively. We summarize the results in Figure 1. The x-axis stands for the number of features l. In our two algorithms, typically approximation error will turn down as d or l increases, however, d cannot be very large or even inﬁnite, otherwise

0 100 200 300 400 500 600 700 10 3

number of features

approximation error

RKS TEFM G TEFM S Comp Quasi Ffood

0 100 200 300 400 500 600 700 10 2

number of features

approximation error

Blog Feedback

RKS TEFM G TEFM S Comp Quasi Ffood

0 100 200 300 400 500 600 700 10 2

number of features

approximation error

UJIIndoor Loc

RKS TEFM G TEFM S Comp Quasi Ffood

0 100 200 300 400 500 600 700 10 2

number of features

approximation error

RKS TEFM G TEFM S Comp Quasi Ffood

0 100 200 300 400 500 600 700

number of features

approximation error

Slice Localization

RKS TEFM G TEFM S Comp Quasi Ffood

0 100 200 300 400 500 600 700 10 2

number of features

approximation error

RKS TEFM G TEFM S Comp Quasi Ffood

Figure 1: Kernel approximation.

our two feature generation algorithms will be time consuming. Thus, d actually affects the tradeoff between the accuracy and computation. If we keep d = 4l, then Figure 1 shows that our algorithms incur nearly half of approximation error compared to others, which almost coincides with our theoretical result. This implies that it is possible to train the leaning tasks with fewer features while maintaining the accuracy. In particular, our methods beat the counterpart Comp which implements data-independent subspace embedding for original features [Hamid et al., 2014]. This also reﬂects that data-independent subspace embedding is not suitable for the random features on shift-invariant kernels.

4.2 Performance on KRR First, we compare the prediction ability of different features on KRR. The parameters are chosen by cross-validation. We report the relative root mean square error (RMSE) of each method in the testing datasets. The relative RMSE is deﬁned by y yo / y , where yo is the predicted value and y is the ground truth. We consider the regression dataset Slice Localization and list the result in Figure 2. As we can see, using much fewer informative features, our two algorithms can obtain the same accuracy as others. Thus, training using our fewer features can keep the learning accuracy and require much less time. This also implies that, accurate learning can be achieved by applying the features that ensure a good kernel matrix approximation. Second, we report the prediction error versus the time (includes feature mapping time and training time) in Figure 3. Our algorithms require slightly more feature map time.

0 100 200 300 400 500 600 700 0.1

number of features

Slice Localization

RKS TEFM-G TEFM-S Comp Quasi Ffood

Figure 2: Accuracy comparison on regression.

They, however, use fewer features to get the same approximation performance. Therefore, this substantially decreases the training time. As can be seen, even we consider the time used for the feature map, our two algorithms still outperform the other methods with smaller RMSE achieved. Particularly, the advantage of our methods will be more evident when we need a much smaller RMSE. The reason is that, to obtain a much smaller RMSE, more features in other methods should be generated, and as the feature number grows large, the time used for training regression task will be dominant compared to the time for feature map.

0 5 10 15 20 25 30 35 40 45 0.05

Slice Localization

7()0 * 7()0 6 Comp Quasi

Figure 3: Time comparison on regression.

5 Conclusion

In this paper, we describe an effective feature map method that can generate better features to make the learning tasks trained accurately and more efﬁciently. The basic idea of our method is to combine fast data-dependent subspace embedding with the RKS. We also present two algorithms TEFM-G and TEFM-S. Our theoretical analysis indicates these two algorithms achieve better kernel approximation and faster training on learning tasks without losing precision. We report extensive empirical results of our algorithms on several realworld datasets, which support our analysis and demonstrate a good practical performance.

Acknowledgments

We would like to thank the anonymous reviewers for their valuable comments and suggestions to improve the quality of the paper. The work described in this paper was fully supported by the National Grand Fundamental Research 973 Program of China (No. 2014CB340401 and No. 2014CB340405), the Research Grants Council of the Hong Kong Special Administrative Region, China (Project No. CUHK 413213 and CUHK 14205214), and Microsoft Research Asia Regional Seed Fund in Big Data Research (Grant No. FY13-RES-SPONSOR-036).

References [Bingham and Mannila, 2001] Ella Bingham and Heikki Mannila. Random projection in dimensionality reduction: applications to image and text data. In Proceedings of the seventh ACM SIGKDD international conference on Knowledge discovery and data mining, pages 245 250. ACM, 2001. [Cesa-Bianchi et al., 2014] Nicol o Cesa-Bianchi, Yishay Mansour, and Ohad Shamir. On the complexity of learning with kernels. ar Xiv preprint ar Xiv:1411.1158, 2014. [Chang and Lin, 2011] Chih-Chung Chang and Chih-Jen Lin. Libsvm: a library for support vector machines. ACM Transactions on Intelligent Systems and Technology (TIST), 2(3):27, 2011. [Cortes et al., 2010] Corinna Cortes, Mehryar Mohri, and Ameet Talwalkar. On the impact of kernel approximation on learning accuracy. In International Conference on Artiﬁcial Intelligence and Statistics, pages 113 120, 2010. [Dai et al., 2014] Bo Dai, Bo Xie, Niao He, Yingyu Liang, Anant Raj, Maria-Florina F Balcan, and Le Song. Scalable kernel methods via doubly stochastic gradients. In Advances in Neural Information Processing Systems, pages 3041 3049, 2014. [Halko et al., 2011] Nathan Halko, Per-Gunnar Martinsson, and Joel A Tropp. Finding structure with randomness: Probabilistic algorithms for constructing approximate matrix decompositions. SIAM review, 53(2):217 288, 2011. [Hamid et al., 2014] Raffay Hamid, Ying Xiao, Alex Gittens, and Dennis De Coste. Compact random feature maps. 2014. [Le et al., 2013] Quoc Le, Tam as Sarl os, and Alex Smola. Fastfood approximating kernel expansions in loglinear time. In Proceedings of the international conference on machine learning, 2013. [Lopez-paz et al., 2014] David Lopez-paz, Suvrit Sra, Alex Smola, Zoubin Ghahramani, and Bernhard Schoelkopf. Randomized nonlinear component analysis. In Proceedings of the 31st International Conference on Machine Learning (ICML-14), pages 1359 1367, 2014. [Nelson and Nguyn, 2014] Jelani Nelson and Huy L Nguyn. Lower bounds for oblivious subspace embeddings. In Automata, Languages, and Programming, pages 883 894. Springer, 2014.

[Nesterov, 2005] Yu Nesterov. Smooth minimization of non-smooth functions. Mathematical programming, 103(1):127 152, 2005. [Rahimi and Recht, 2007] Ali Rahimi and Benjamin Recht. Random features for large-scale kernel machines. In Advances in neural information processing systems, pages 1177 1184, 2007. [Rahimi and Recht, 2009] Ali Rahimi and Benjamin Recht. Weighted sums of random kitchen sinks: Replacing minimization with randomization in learning. In Advances in neural information processing systems, pages 1313 1320, 2009. [Rudin, 1990] Walter Rudin. Fourier analysis on groups. Number 12. John Wiley & Sons, 1990. [Saunders et al., 1998] Craig Saunders, Alexander Gammerman, and Volodya Vovk. Ridge regression learning algorithm in dual variables. In Proceedings of the 15th International Conference on Machine Learning, pages 515 521. Morgan Kaufmann, 1998. [Sch olkopf et al., 2001] Bernhard Sch olkopf, Ralf Herbrich, and Alex J Smola. A generalized representer theorem. In Computational learning theory, pages 416 426. Springer, 2001. [Tropp, 2011] Joel A Tropp. Improved analysis of the subsampled randomized hadamard transform. Advances in Adaptive Data Analysis, 3(01n02):115 126, 2011. [Tropp, 2015] Joel A Tropp. An introduction to matrix concentration inequalities. ar Xiv preprint ar Xiv:1501.01571, 2015. [Yang et al., 2011a] Haiqin Yang, Irwin King, and Michael R. Lyu. Sparse Learning Under Regularization Framework. LAP Lambert Academic Publishing, April 2011. [Yang et al., 2011b] Haiqin Yang, Zenglin Xu, Jieping Ye, Irwin King, and Michael R. Lyu. Efﬁcient sparse generalized multiple kernel learning. IEEE Transactions on Neural Networks, 22(3):433 446, March 2011. [Yang et al., 2012] Tianbao Yang, Yu-Feng Li, Mehrdad Mahdavi, Rong Jin, and Zhi-Hua Zhou. Nystr om method vs random fourier features: A theoretical and empirical comparison. In Advances in neural information processing systems, pages 476 484, 2012. [Yang et al., 2014] Jiyan Yang, Vikas Sindhwani, Haim Avron, and Michael Mahoney. Quasi-monte carlo feature maps for shift-invariant kernels. In Proceedings of The 31st International Conference on Machine Learning, pages 485 493, 2014. [Zhang et al., 2013] Lijun Zhang, Mehrdad Mahdavi, Rong Jin, Tianbao Yang, and Shenghuo Zhu. Recovering the optimal solution by dual random projection. In Conference on Learning Theory, pages 135 157, 2013.