# memory_efficient_kernel_approximation__9db287e4.pdf

Journal of Machine Learning Research 18 (2017) 1-32 Submitted 1/15; Revised 2/17; Published 3/17

Memory Eﬃcient Kernel Approximation

Si Si sisidaisy@google.com Google Research Mountain View, CA 94043, USA Cho-Jui Hsieh chohsieh@ucdavis.edu Departments of Computer Science and Statistics University of California, Davis Davis, CA 95616, USA Inderjit S. Dhillon inderjit@cs.utexas.edu Department of Computer Science University of Texas at Austin Austin, TX 78701, USA

Editor: Le Song

Scaling kernel machines to massive data sets is a major challenge due to storage and computation issues in handling large kernel matrices, that are usually dense. Recently, many papers have suggested tackling this problem by using a low-rank approximation of the kernel matrix. In this paper, we ﬁrst make the observation that the structure of shift-invariant kernels changes from low-rank to block-diagonal (without any low-rank structure) when varying the scale parameter. Based on this observation, we propose a new kernel approximation framework Memory Eﬃcient Kernel Approximation (MEKA), which considers both low-rank and clustering structure of the kernel matrix. We show that the resulting algorithm outperforms state-of-the-art low-rank kernel approximation methods in terms of speed, approximation error, and memory usage. As an example, on the covtype dataset with half a million samples, MEKA takes around 70 seconds and uses less than 80 MB memory on a single machine to achieve 10% relative approximation error, while standard Nystr om approximation is about 6 times slower and uses more than 400MB memory to achieve similar approximation. We also present extensive experiments on applying MEKA to speed up kernel ridge regression. Keywords: kernel approximation, Nystr om method, kernel methods

1. Introduction

Kernel methods (Sch olkopf and Smola, 2002) are a class of machine learning algorithms that ﬁrst map samples from input space to a high-dimensional feature space. In the high-dimensional feature space, various methods can be applied depending on the machine learning task, for example, kernel support vector machine (SVM) (Cortes and Vapnik, 1995) (Hsieh, Si, and Dhillon, 2014a) and kernel ridge regression (Saunders, Gammerman, and Vovk, 1998). A key issue in scaling up kernel machines is the storage and computation

. This work was done before joining Google.

c 2017 Si Si, Cho-Jui Hsieh, and Inderjit S. Dhillon.

License: CC-BY 4.0, see https://creativecommons.org/licenses/by/4.0/. Attribution requirements are provided at http://jmlr.org/papers/v18/15-025.html.

Si, Hsieh, and Dhillon

of the kernel matrix, which is usually dense. Storing the dense matrix takes O(n2) space, while computing it requires O(n2d) operations, where n is the number of data points and d is the dimension. A common approach to achieve scalability is to approximate the kernel matrix using limited memory storage. This approach not only resolves the memory issue, but also speeds up kernel machine solvers, because the time complexity for using the kernel is usually proportional to the amount of memory used to represent the kernel. Most kernel approximation methods aim to form a low-rank approximation G CCT for the kernel matrix G, with C Rn k and rank k n. Although it is well known that Singular Value Decomposition (SVD) yields the best rank-k approximation, it often cannot be applied as it requires the entire kernel matrix to be computed and stored. To overcome this issue, many methods have been proposed to approximate the best rank-k approximation of a kernel matrix, including Greedy basis selection techniques (Smola and Sch olkopf, 2000), incomplete Cholesky decomposition (Fine and Scheinberg, 2001), and Nystr om methods (Williams and Seeger, 2001).

However, it is unclear whether low-rank approximation is the most memory eﬃcient way to approximate a kernel matrix. In this paper, we ﬁrst make the observation that for practically used shift-invariant kernels, the kernel structure varies from low-rank to blockdiagonal as the scaling parameter γ varies from 0 to . This observation suggests that even the best rank-k approximation can have extremely large approximation error when γ is large, so it is worth exploiting the block structure of the kernel matrix. Based on this idea, we propose a Memory Eﬃcient Kernel Approximation (MEKA) framework to approximate the kernel matrix. Our proposed framework considers and analyzes the use of clustering in the input space to eﬃciently exploit the block structure of shift-invariant kernels. We show that the individual blocks generated by kmeans clustering have lowrank structure, which motivates us to apply Nystr om low-rank approximation to each block separately. Between-cluster blocks are then approximated in a memory-eﬃcient manner. Our approach only needs O(nk + (ck)2) memory to store a rank-ck approximation(where c n is the number of clusters), while traditional low-rank methods need O(nk) space to store a rank-k approximation. Therefore, using the same amount of storage, our method can achieve lower approximation error than the commonly used low-rank methods. Moreover, our proposed method takes less computation time than other low-rank methods to achieve a given approximation error.

Theoretically, we show that under the same amount of storage, the error bound of our approach can be better than standard Nystr om if the gap between the k + 1-st and ck + 1-st singular values of G is larger than 2 where consists of all between-cluster blocks. On real datasets, our proposed algorithm consumes less memory and computation time to achieve comparable reconstruction error. For example, on the covtype dataset with half million samples, MEKA takes around 70 seconds and uses less than 80 MB memory on a single machine to achieve 10% relative approximation error, while standard Nystr om approximation takes more than 400 seconds and uses more than 400MB memory to achieve similar approximation. Also, MEKA is faster for kernel ridge regression compared with other state-of-the-art kernel approximation methods. As an example, on the mnist2m dataset with 2 million samples, our method takes 550 seconds on a single machine using less than 500 MBytes memory to achieve accuracy comparable with standard Nystr om approximation,

Memory Efficient Kernel Approximation

which takes more than 2700 seconds and uses more than 2 GBytes memory on the same problem.

Parts of this paper have appeared previously in (Si, Hsieh, and Dhillon, 2014a). In this paper, we provide: (1) a more detailed survey of state-of-the-art methods; (2) much more comprehensive experimental comparisons; (3) thorough investigation of the inﬂuence of the parameters in our method; (4) the application of applying the block structure of kernel matrix to speed up kernel SVM; and (5) more discussion including how to achieve stable results and solve non-psd issues in MEKA.

The rest of the paper is outlined as follows. We ﬁrst present related work in Section 2. We then explain the popular Nystr om approximation method and present motivation for our framework in Section 3. We then show the block structure of kernel matrix and its application to speed up kernel SVM in Section 4. Our main kernel approximation algorithm MEKA is proposed and analyzed in Section 5. Experimental results are given in Section 6, and conclusion and discussion are provided in Section 7.

2. Related Research

To approximate the kernel matrix using limited memory, one common way is to use a lowrank approximation. The best rank-k approximation can be obtained by the SVD, but it is computationally prohibitive when n grows to tens of thousands. To address the scalability issue of SVD, approximate SVD solvers such as randomized SVD (Halko, Martinsson, and Tropp, 2011) have been widely used for large-scale data. To exploit the sparse structure of large-scale network data, an alternative is to apply CUR matrix decomposition (Mahoney and Drineas, 2009) that explicitly expresses the low-rank decomposition in terms of a small number of rows and columns of the original data matrix. Another way is building a hierarchical tree to initialize a block Lanczos algorithm to eﬃciently compute the spectral decomposition of large-scale graphs (Si, Shin, Dhillon, and Parlett, 2014b). Unfortunately, to approximate kernel matrices, all the above approaches need to compute the entire kernel matrix, so the time complexity is at least O(dn2).

Many algorithms have been proposed to overcome the prohibitive time and space complexity of SVD for approximating kernel matrices. They can be categorized into two classes: methods that explicitly approximate kernel matrices, and methods that approximate the kernel function.

Approximating the kernel matrix. The ﬁrst class of approaches approximate the kernel matrix based on a subset of sampled elements; as a result, all of them are data dependent. The Nystr om method (Williams and Seeger, 2001) is the most widely used technique to approximate the kernel matrix given a sampled subset of columns. To approximate a rank-k approximation, Nystr om approximation requires O(nk2 + k3) time to form the rank-k approximation. Many strategies have been proposed to improve over the basic Nystr om approximation, including ensemble Nystr om (Kumar, Mohri, and Talwalkar, 2009), Nystr om with k-means to obtain benchmark points (Zhang, Tsang, and Kwok, 2008; Zhang and Kwok, 2010), randomized Nystr om (Li, Kwok, and Lu, 2010), Nystr om approximation with shift (Wang et al., 2014), adding pseudo landmark points to the Nystr om approximation (Hsieh, Si, and Dhillon, 2014b), and fast-Nys that uses fast transform to

Si, Hsieh, and Dhillon

generate structured landmark points to speed up Nystr om approximation (Si, Hsieh, and Dhillon, 2016). Diﬀerent Nystr om sampling strategies are analyzed and compared in (Kumar, Mohri, and Talwalkar, 2012; Gittens and Mahoney, 2013). Besides Nystr om approximation, Fine and Scheinberg (2001) use the incomplete Cholesky decomposition with pivoting for approximating kernel matrices, which requires O(nk2 + nkd) time for computing a rank-k approximation. Bach and Jordan (2005) incorporate side information (labels) into the incomplete Cholesky decomposition, and show that the resulting problem can be solved with the same O(nk2 + nkd) time complexity. Finally, Achlioptas, Mc Sherry, and Sch olkopf (2001) propose a sampling and reweighted approach to obtain an unbiased estimator of the kernel-vector product, and use subspace iteration to approximate the top k eigenvectors. Approximating the kernel function. The second class of methods is to directly approximate the kernel function without computing elements of the kernel matrix, so the approximation does not depend on the data. To approximate the kernel function, a typical approach is to ﬁnd a feature mapping Z : Rd Rk where the kernel function K(x, y) can be approximated by Z(x)T Z(y). Rahimi and Recht (2007, 2008) deﬁne the random feature map for shift invariant kernel functions based on the Fourier transform. In the resulting Random Kitchen Sinks (RKS) algorithm, the main computation turns out to be the matrix vector multiplication Wxi for each instance xi, where W is a Gaussian random matrix. To improve eﬃciency, Le, Sarlos, and Smola (2013) show that the computation of Wxi can be sped up by the fast Hadamard transform. On the other hand, Yang et al. (2014) propose to use a quasi Monte Carlo approach to improve the approximation performance of RKS. In addition to shift invariant kernels, Kar and Karnick (2012) construct the random feature map for polynomial kernels, and Hamid et al. (2014) propose the condensed random feature map to improve performance. Besides the above approaches based on random feature maps, there are other methods that directly approximate the kernel function. Cotter, Keshet, and Srebro (2011) approximate the Gaussian kernel by the t-th order Taylor expansion, but it requires O(dt) features, which is computationally burdensome for large d or t. Chang et al. (2010) propose to use the kernel expansion for low-degree polynomial kernels. Recently, Yang et al. (2012) showed that the Nystr om method has a better generalization error bound than the RKS approach if the gap in the eigen-spectrum of the kernel matrix is large. Most of the above methods can be viewed as faster ways to ﬁnd a rank-k approximation of the kernel matrix, so the approximation error is always worse than the top-k SVD when using O(nk) memory. As we show in Section 3, the kernel matrix typically changes from lowrank to block structure as the scaling parameter γ increases. However, the block structure of the kernel matrix has never been considered in dense kernel approximation, although it has been studied for approximation of other types of matrices. For example, Savas and Dhillon (2011) applied Clustered Low Rank Approximation (CLRA) to approximate large and sparse social networks. CLRA applies spectral or graph clustering to the adjacency matrix of a social network, runs SVD on each diagonal block, and uses matrix projection to capture oﬀ-diagonal information. All these steps require storage of the entire matrix, thus they are infeasible for large-scale dense kernel matrices. For example, computing the entire kernel matrix of the mnist2m dataset would require about 8 TBytes of memory; moreover the time for computing the SVD and projection steps is prohibitive. On the same dataset

Memory Efficient Kernel Approximation

our proposed approach can obtain accurate results in 10 minutes with only 500 MBytes memory (as shown in Section 6). To achieve this, we need totally diﬀerent algorithms than the ones in CLRA for clustering and approximating blocks to yield our memory-eﬃcient scheme. More speciﬁcally, CLRA was designed to approximate sparse adjacency matrices, but cannot be directly applied to large kernel matrices as that would require computation and storage of the entire kernel matrix at a cost of O(n2d) time and O(n2) space. To overcome this problem, we propose the following innovations: (1) We perform clustering, and then apply Nystr om approximation to within-cluster blocks to avoid computing all withinblock entries; (2) We theoretically justify the use of kmeans clustering to explore the block structure of the kernel; (3) We propose a sampling approach to capture between-block information; (4) We theoretically show the error bound of our method and compare it with the traditional Nystr om approach.

3. Preliminaries and Motivation

Let K( , ) denote the kernel function, and G Rn n be the corresponding kernel matrix where Gij = K(xi, xj), and xi, xj Rd are data points. Computing and storing the kernel matrix G usually takes O(n2d) time and O(n2) space, which is prohibitive when there are millions of samples. One way to deal with these challenges is to approximate the dense kernel matrix G by a low-rank approximation G. By doing this, kernel machines are transformed to linear problems which can be solved eﬃciently. The best rank-k approximation of G is given by its singular value decomposition(SVD), i.e., G UkΣk Uk T , where Σk is the diagonal matrix of largest k singular values and Uk contains the corresponding singular vectors. However, computing the SVD of G is computationally prohibitive and memory intensive. Many fast kernel approximation algorithms have thus been proposed and studied. Nystr om kernel approximation is a widely used approximation approach, which uses a sample of m data points and does not need to form the entire G explicitly to generate its low-rank approximation. In standard Nystr om approximation (proposed in Williams and Seeger (2001)), we ﬁrst uniformly at random sample m data points and assemble the corresponding m columns of G as the n m matrix C. Let M be the m m kernel matrix between the m sampled points, then the standard Nystr om method generates a rank-k approximation to G as G G = CM+ k CT , (1)

where Mk is the best rank-k approximation of M (by SVD) and M+ k is its pseudo-inverse. Various extensions to this Nystr om based kernel approximation have been proposed. For example, k-means Nystr om(Zhang, Tsang, and Kwok, 2008; Zhang and Kwok, 2010) uses clusters centroids as the landmark points to form C; ensemble Nystr om(Kumar, Mohri, and Talwalkar, 2009) combines a collection of standard Nystr om approximations. We will compare state-of-the-art Nystr om based methods in Section 6. We use the Gaussian kernel as an example to discuss the structure of the kernel matrix under diﬀerent scale parameters. Given two samples xi and xj, the Gaussian kernel is given by K(xi, xj)=e γ xi xj 2 2, where γ is a scale or width parameter; the corresponding kernel matrix entries are Gij = K(xi, xj). Low-rank approximation has been widely used

Si, Hsieh, and Dhillon

to obtain an approximation for kernel matrices. However, under diﬀerent scale parameters, the kernel matrix has quite diﬀerent structures, suggesting that diﬀerent approximation strategies should be used for diﬀerent γ. Let us examine two extreme cases of the Gaussian kernel: when γ 0, G ee T where e = [1, . . . , 1]T . As a consequence, G is close to low-rank when γ is small. However, at the other extreme as γ , G changes to the identity matrix, which has full rank with all eigenvalues equal to 1. In this case, G does not have a low-rank structure, but has a block/clustering structure. This observation motivates us to consider both low rank and clustering structure of the kernel matrix. Figures 1a and 1b give an example of the structure of a Gaussian kernel with diﬀerent γ on a real dataset by randomly sampling 5000 samples from the covtype dataset. Before discussing further details, we ﬁrst contrast the use of block and low-rank approximations on the same dataset. We compare approximation errors for diﬀerent methods when they use the same amount of memory in Figure 1c. Clearly, low-rank approximation methods work well only for very small γ values. Block Kernel Approximation (BKA), as proposed in Section 4.1, is a simple way to use clustering structure of G that is eﬀective for large γ. Our proposed algorithm, MEKA, considers both block and low-rank structure of the kernel, and thus performs better than others under diﬀerent γ values as seen in Figure 1c.

(a) The Gaussian kernel matrix with γ = 0.1 on covtype dataset

(b) The Gaussian kernel matrix with γ = 1 on covtype dataset

(c) Comparison of diﬀerent kernel approximation methods for various γ.

Figure 1: (a) and (b) show that the structure of the Gaussian kernel matrix K(x, y) = e γ x y 2 for the covtype data tends to become more block diagonal as γ increases(dark regions correspond to large values, while lighter regions correspond to smaller values). Plot (c) shows that low-rank approximations work only for small γ, and Block Kernel Approximation (BKA) works for large γ, while our proposed method MEKA works for small as well as large γ.

4. Block Kernel Approximation

In this section, we ﬁrst introduce Block Kernel Approximation (BKA), a simple way to exploit the clustering structure of kernel matrices, and then show its application for speeding up kernel SVM.

Memory Efficient Kernel Approximation

4.1 Clustering Structure of Shift-invariant Kernel Matrices

There has been substantial research on approximating shift-invariant kernels (Rahimi and Recht, 2007). A kernel function K(xi, xj) is shift-invariant if the kernel value depends only on xi xj, that is, K(xi, xj) = f(η(xi xj)) where f( ) is a function that maps Rd to R, and η > 0 is a constant to determine the scale of the data. η is very crucial to the performance of kernel machines and is usually chosen by cross-validation. We further deﬁne gu(t) = f(ηtu) to be a one variable function along u s direction where u is an unit vector. We assume the kernel function satisﬁes the following property:

Assumption 1 gu(t) is diﬀerentiable for all t = 0.

Most of the practically used shift-invariant kernels satisfy the above assumption, for example, the Gaussian kernel (K(x, y) = e γ x y 2 2), and the Laplacian kernel (K(x, y) = e γ x y 1). It is clear that η2 is equivalent to γ for the Gaussian kernel if written in the form of K(x, y) = f(η(x y)). When η is large, oﬀ-diagonal blocks of shift-invariant kernel matrices will become small, and most of the information is concentrated in the diagonal blocks. To approximate the kernel matrix by exploiting this clustering structure, we ﬁrst present a simple Block Kernel Approximation (BKA) as follows. Given a good partition V1, . . . , Vc of the data points, where each Vs is a subset of {1, . . . , n}, BKA approximates the kernel matrix as:

G(1,1) 0 . . . 0 0 G(2,2) . . . 0 ... ... ... ... 0 0 . . . G(c,c)

Here, G(s,s) denotes the kernel matrix for block Vs note that this implies that diagonal blocks G(s,s) = G(s,s) and all the oﬀ-diagonal blocks, G(s,t) = 0 with s = t. BKA is useful when η is large. By analyzing its approximation error, we now show that k-means in the input space can be used to capture the clustering structure for shift-invariant kernels. The approximation error equals G G 2 F = P i,j K(xi, xj)2 Pc s=1 P i,j Vs K(xi, xj)2. Since the ﬁrst term is ﬁxed, minimizing the error G G 2 F is the same with maximizing the second term, the sum of squared within-cluster entries D = Pc s=1 P i,j Vs K(xi, xj)2. However, directly maximizing D will not give a useful partition the maximizer will assign all the data into one cluster. The same problem occurs in graph clustering (Shi and Malik, 2000; von Luxburg, 2007). A common approach is to normalize D by each cluster s size |Vs|. The resulting spectral clustering objective (also called ratio association) is:

Dkernel({Vs}c s=1) =

i,j Vs K(xi, xj)2. (3)

Maximizing (3) usually yields a balanced partition, but the computation is expensive because we have to compute all the entries in G. In the following theorem, we derive a lower bound for Dkernel({Vs}c s=1):

Si, Hsieh, and Dhillon

Theorem 1 For any shift-invariant kernel that satisﬁes Assumption 1,

Dkernel({Vs}c s=1) C η2R2Dkmeans({Vs}c s=1) (4)

where C = nf(0)2

2 , R is a constant depending on the kernel function, and Dkmeans Pc s=1 P i Vs xi ms 2 2 is the k-means objective function, where ms = (P i Vsxi)/|Vs|, s=1, , c, are the cluster centers.

Proof We use u to denote the unit vector in the direction of xi xj (xi = xj). By the mean value theorem, we have

K(xi, xj) = gu(η xi xj 2) = gu(0) + ηg u(s) xi xj 2

for some s (0, η xi xj 2). By deﬁnition, f(0) = gu(0), so

f(0) K(xi, xj) + ηR xi xj 2, (5)

where R := sup θ R, v =1 |g v(θ)|. (6)

Squaring both sides of (5) we have

f(0)2 K(xi, xj)2 + η2R2 xi xj 2 2 + 2K(xi, xj)(ηR xi xj 2).

From the classical arithmetic and geometric mean inequality, we can upper bound the last term by 2K(xi, xj)(ηR xi xj 2) K(xi, xj)2 + η2R2 xi xj 2 2,

therefore f(0)2

2 K(xi, xj)2 + η2R2 xi xj 2 2. (7)

Plugging (7) into (3), we have

Dkernel({Vs}c s=1)

2 η2R2 xi xj 2 2

i,j Vs xi xj 2 2,

which can be manipulated to prove the desired bound (4).

Interestingly, the right hand side of (4) can be maximized when the k-means objective function Dkmeans is minimized. Therefore, although optimal solutions for k-means and ratio association might be diﬀerent (consider the two circles data, when each circle forms a cluster), Theorem 1 shows that conducting k-means in the input space will provide a reasonably good way to exploit the clustering structure of shift-invariant kernels, especially when it is infeasible to perform spectral clustering on G which might need precomputation of the entire kernel matrix. Figure 2 shows that the partition from k-means can often work as well as spectral clustering on G, which directly optimizes Dkernel, and both of them are much better than random partitions. One advantage of conducting k-means is that the time complexity of each iteration is O(ndc), which is much less than computing the kernel when the dimensionality d n.

Memory Efficient Kernel Approximation

(a) covtype

Figure 2: The Gaussian kernel approximation error of BKA using diﬀerent ways to generate ﬁve partitions on 500 samples from covtype and cadata; on these data sets k-means in the input space performs similarly to spectral clustering on the kernel matrix, but is more eﬃcient.

4.2 Speeding up Kernel SVM with BKA

In this section we will show how to use block kernel approximation(BKA) to divide kernel SVM problem into subproblems and signiﬁcantly speed up the computation. Given a set of instance-label pairs (xi, yi), i = 1, . . . , n, xi Rd and yi {1, 1}, the main task in training the kernel SVM is to solve the following quadratic optimization problem:

min α f(α) = 1

2αT Qα e T α, s.t. 0 α C, (8)

where e is the vector of all ones; C is the balancing parameter between loss and regularization in the SVM primal problem; α Rn is the vector of dual variables; and Q is an n n matrix with Qij = yiyj Gij, where Gij = K(xi, xj) is the kernel value between i-th and j-th sample. Letting α denote the optimal solution of (8), the decision value for a test data x can be computed by n X

i=1 α i yi K(x, xi). (9)

Due to high computation cost of directly solving kernel SVM, we can approximate the kernel matrix G by BKA to divide whole kernel SVM problem into subproblems, where each subproblem can be handled eﬃciently and independently. To do this, we ﬁrst partition the dual variables into k subsets {V1, . . . , Vk}, where {V1, . . . , Vk} are partitions generated by performing kmeans on the data points, and then solve the respective subproblems independently

min α(c) 1 2(α(c))T Q(c,c)α(c) e T α(c), s.t. 0 α(c) C, (10)

where c = 1, . . . , k, α(c) denotes the subvector {αi | i Vc} and Q(c,c) is the submatrix of Q with row and column indexes Vc.

Si, Hsieh, and Dhillon

dataset Number of Number of d training samples testing samples ijcnn1 49,990 91,701 22 census 159,619 39,904 409 covtype 464,810 116,202 54

Table 1: Dataset statistics

The quadratic programming problem (8) has n variables, and takes at least O(n2) time to solve in practice. By dividing it into k subproblems (10) with equal sizes, the time complexity for solving the subproblems can be reduced to O(k (n

k )2) = O(n2/k). Moreover, the space requirement is also reduced from O(n2) to O(n2/k2). After computing all the subproblem solutions, we concatenate them to form an approximate solution for the whole problem α = [ α(1), . . . , α(k)], where α(c) is the optimal solution for the c-th subproblem.

4.3 Comparing BKA-SVM with Low-rank Kernel SVM Solvers(BKA-SVM)

We now compare block structure based kernel SVM solver BKA-SVM with low-rank structure based kernel SVM solvers. All the experiments are conducted on an Intel 2.66GHz CPU with 8G RAM. We use 3 benchmark datasets as shown in Table 1. The thee datasets can be downloaded from http://www.csie.ntu.edu.tw/~cjlin/libsvmtools/datasets or the UCI data repository. We use a random 80%-20% split for covtype, and the original training/testing split for other datasets.

4.3.1 Competing Methods

We include the following exact kernel SVM solvers (LIBSVM), approximate low-rank SVM solvers (LLSVM, Fast Food) in our comparison:

1. LIBSVM: the implementation in the LIBSVM library (Chang and Lin, 2011) with a small modiﬁcation to handle SVM without the bias term we observe that LIBSVM has similar test accuracy with/without bias. We also include the results for using LIBSVM with random 1/5 subsamples on each dataset in Table 2.

2. LLSVM: improved Nystr om method for nonlinear SVM by (Wang et al., 2011). We solve the resulting linear SVM problem by the dual coordinate descent solver in LIBLINEAR (Hsieh et al., 2008).

3. Fast Food: use random Fourier features to approximate the kernel function (Le et al., 2013). We solve the resulting linear SVM problem by the dual coordinate descent solver in LIBLINEAR.

Both LLSVM and Fast Food use low-rank representation to approximate kernel matrix so that to speed up solving kernel SVM.

Memory Efficient Kernel Approximation

ijcnn1 census covtype C = 32, γ = 2 C = 512, γ = 2 9 c = 32, γ = 32 time(s) acc(%) time(s) acc(%) time(s) acc(%) BKA-SVM 5 98.49 838 94.72 436 96.05 LIBSVM 115 98.69 2920 94.2 83631 96.15 LIBSVM (subsample) 6 98.24 641 93.2 5330 92.46 LLSVM 38 98.23 1212 92.8 4451 84.21 Fast Food 87 95.95 851 91.6 8550 80.1

Table 2: Comparison on real datasets using the RBF kernel.

4.3.2 Parameter Setting

We consider the RBF kernel K(xi, xj) = exp( γ xi xj 2 2). We use same kernel function for both training and test phases. We chose the balancing parameter C and kernel parameter γ by 5-fold cross validation on a grid of points: C = [2 10, 2 9, . . . , 210] and γ = [2 10, . . . , 210] for ijcnn1, census, and covtype. For BKA-SVM, we set the number of clusters to be 64 for these three datasets. There is a tradeoﬀbetween the number of clusters and prediction accuracy. If we increase the number of clusters, BKA-SVM will become faster, but the prediction accuracy will mostly decrease. On the other hand, if the number of clusters is set smaller, BKA-SVM can achieve higher accuracy (in most cases), while takes more time to train. The following are parameter settings for other methods in Table 2: the rank is set to be 3000 in LLSVM; number of Fourier features is 3000 in Fastfood1; the tolerance in the stopping condition for LIBSVM is set to 10 3 (the default setting of LIBSVM). Tables 2 present time taken and test accuracies. Experimental results show that the BKA-SVM achieves near-optimal test performance. Also we observe that BKA-SVM performs better than low-rank approximation based methods for kernel SVM problem showing the beneﬁt of using block structure of kernel matrix. We can see that BKA exploits the block structure of kernel matrix, and can speed up the training of kernel SVM. As shown in Tandon et al. (2016), BKA can also be used for speeding up kernel ridge regression problem. About the memory requirement, which is the main theme of this paper, BKA takes O(n2

k ) memory to approximate the kernel matrix, while popular low-rank based kernel approximation methods are more memory eﬃcient, and only need linear memory to represent the kernel matrix.

5. Memory Eﬃcient Kernel Approximation

There are two main drawbacks of the BKA approach: (i) it ignores all oﬀ-diagonal blocks, which results in large error when η is small (as seen in Figure 1(c)); (ii) for large-scale kernel approximation, it is too expensive to compute and store all the diagonal block entries. To

1. In Fastfood we control the number of blocks so that number of Fourier features is close to 3000 for each dataset.

Si, Hsieh, and Dhillon

overcome these two drawbacks, we propose to use low-rank representation for each block in the kernel matrix.

5.1 Low Rank Structure of Each Block

To motivate the use of low-rank representation in our proposed method, we ﬁrst present the following bound:

Theorem 2 Given data points x1, . . . , xn Rd, and a partition V1, . . . , Vc, and assume f is Lipschitz continuous, then for any s, t (s = t or s = t)

G(s,t) G(s,t) k F 4Ck 1/dp

|Vs||Vt|min(rs, rt),

where G(s,t) k is the best rank-k approximation to G(s,t); C is the Lipschitz constant of the shift-invariant function f; rs is the radius of the s-th cluster.

Proof To prove this theorem, we use the ϵ-net theorem in (Cucker and Smale, 2001), which states that when all the data xi Rd, i = 1, , n are in a ball with radius r, there exist T = ( 4r

r )d balls of radius r that cover all the data points. If we set T to be k, then r = k 1/d4r. Let x1, . . . , xns be the data points in the s-th cluster, and let y1, . . . , ynt be the data points in the t-th cluster, and ns = |Vs|, nt = |Vt|. Our goal is to show that G(s,t) is lowrank, where G(s,t) i,j = K(xi, yj). Assume rt is the radius of the t-th cluster, therefore we can ﬁnd k balls with radius r = k 1/d4rt to cover {yj}nt j=1. Assume centers of the balls for t-th cluster are m1, m2, . . . , mk, then we can form a low-rank matrix G(s,t) = U V T , where for all i = 1, . . . , ns, j = 1, . . . , nt, and q = 1, . . . , k,

Ui,q = K(xi, mq) and Vj,q =

( 1 if yj Ball(mq), 0 otherwise.

Assume yj is in ball q, then

(G(s,t) ij G(s,t) ij )2 = (f(xi yj) f(xi mq))2

C2 (xi yj) (xi mq) 2

= C2 yj mq 2 2 C2 r2.

Therefore, if (G(s,t)) is the best rank-k approximation for G(s,t), then

G(s,t) (G(s,t)) F G(s,t) G(s,t) F Ck 1/d4rt nsnt. (11)

Similarly, by covering s-th cluster {xi}ns i=1 with k balls we can get the following inequality:

G(s,t) (G(s,t)) F Ck 1/d4rs nsnt. (12)

Combining (11) and (12) gives the results.

Memory Efficient Kernel Approximation

16 14 13 7 7 14 29 13 9 9 13 13 20 10 10 7 9 10 29 11 7 9 10 11 28

(a) k-means clustering.

139 99 101 44 45 99 116 86 43 44 101 86 131 46 47 44 43 46 47 45 45 44 47 45 49

(b) a random partition.

Table 3: Rank of each of the 5 blocks (from a subsampled ijcnn1 data set) using diﬀerent partition strategies: (a) by k-means clustering; (b) by a random partition.

Theorem 2 suggests that each block(diagonal or oﬀ-diagonal block) of the kernel matrix will be low-rank if we ﬁnd the partition by k-means in the input space and the radius of the cluster is small. In the following we present empirical conﬁrmation of this result. In Table 3, we present the numerical rank of each block, where numerical rank for a m by n matrix A is deﬁned as the number of singular values with magnitude larger than max(n, m) A 2δ where δ is a small tolerance 10 6. We sample 4000 data points from the ijcnn1 dataset and generate 5 clusters by k-means and random partition. Table 3a shows the numerical rank for each block using k-means, while Table 3b shows the numerical rank for each block when the partitions are random. We observe that by using k-means, the rank of each block is fairly small.

5.2 Memory Eﬃcient Kernel Approximation (MEKA)

Based on the above observation, we propose a fast and memory eﬃcient scheme to approximate shift-invariant kernel matrices. As suggested by Theorem 2, each block tends to be low-rank after k-means clustering; thus we can form a rank-k approximation for each of the c2 blocks separately to achieve low error; however, this approach would require O(cnk) memory, which can be prohibitive. Therefore, our proposed method ﬁrst performs k-means clustering, and after rearranging the matrix according to clusters, it computes the low-rank basis only for diagonal blocks (which are more dominant than oﬀ-diagonal blocks) and uses them to approximate oﬀ-diagonal blocks. Empirically, we observe that the principal angles between the dominant singular subspaces of diagonal block and oﬀ-diagonal block are small (as shown in Figure 3). In Figure 3, we randomly sampled 1000 data points from the covtype and ijcnn1 datasets and generated 5 clusters by k-means for each dataset. The blue line shows the cosines of the principal angles between the dominant singular subspace of a diagonal block G(s,s) and that of an oﬀ-diagonal block G(s,t) for diﬀerent ranks k, where s and t are randomly chosen. We can observe that most of the cosines are close to 1, showing that there is substantial overlap between the dominant singular subspaces of the diagonal and oﬀ-diagonal block.

By using our proposed approach, we focus on diagonal blocks, and spend less eﬀort on the oﬀ-diagonal blocks. Assume the rank-ks approximation of the sth diagonal block is

Si, Hsieh, and Dhillon

(a) covtype

Figure 3: The cosines of the principal angles between the dominant singular subspaces of diagonal block and oﬀ-diagonal block of a Gaussian kernel ((a): γ = 0.1 and 1000 random samples from covtype; (b): γ = 1 and 1000 random samples from ijcnn1) with respect to diﬀerent ranks. The cosines of the principal angles are close to 1 showing that two subspaces are similar.

W (s)L(s,s)(W (s))T , we form the following memory-eﬃcient kernel approximation:

G = WLW T , (13)

where W is a diagonal matrix as

W (1) 0 . . . 0 0 W (2) . . . 0 ... ... ... ... 0 0 . . . W (c)

and L is a link matrix consisting of c2 blocks, where each ks kt block L(s,t) captures the interaction between the sth and tth clusters. For now, let us ﬁrst assume ks = k s (we will discuss diﬀerent strategies to choose ks later). Note that if we were to restrict L to be a block diagonal matrix, G would still be a block diagonal approximation of G. However, we consider the more general case that L is dense. In this case, each oﬀ-diagonal block G(s,t)

is approximated as W (s)L(s,t)(W (t))T , and this approximation is memory eﬃcient as only O(k2) additional memory is required to represent the (s, t) oﬀ-diagonal block. If a rank-k approximation is used within each cluster, then the generated approximation has rank ck, but only needs a total of O(nk + (ck)2) storage. Computing W (s). Since we aim to deal with dense kernel matrices of huge size, we use the standard Nystr om approximation to compute low-rank basis for each diagonal block. When applying the standard Nystr om method to a ns ns block G(s,s), we sample m columns from G(s,s), evaluate their kernel values, compute the rank-k pseudo-inverse of an m m matrix, and form G(s,s) W (s)L(s,s)(W (s)) T . The time required per block is O(nsm(k + d) + m3), and thus our method requires a total of O(nm(k + d) + cm3) time to form W. We can replace Nystr om by any other low-rank approximation method discussed

Memory Efficient Kernel Approximation

in Section 2. In Figure 4 we compare using standard Nystr om(Nys)(Williams and Seeger, 2001), k-means Nystr om(KNys)(Zhang and Kwok, 2010), and Nystr om with randomized SVD(RNys)(Li, Kwok, and Lu, 2010) to generate W (s) on both ijcnn1 and cadata datasets. As shown in Figure 4, we observe that the standard Nystr om method combined with MEKA gives excellent performance.

Figure 4: Comparison of using Nys, KNys, and RNys to obtain the basis W (s) for diagonal blocks in MEKA on ijcnn1 and cadata datasets. The x-axis shows the computation time and y-axis shows the relative kernel approximation error(%).

Computing L(s,t). The optimal least squares solution for L(s,t)(s = t) is the minimizer of the local approximation error G(s,t) W (s)L(s,t)(W (t))T F . However, forming the entire G(s,t) block can be time consuming. For example, computing the whole kernel matrix for mnist2m with 2 million data points takes more than a week. Therefore, to compute L(s,t), we propose to randomly sample a (1 + ρ)k (1 + ρ)k submatrix ˆG(s,t) from G(s,t), and then ﬁnd L(s,t) that minimizes the error on this submatrix. If the row/column index set for the subsampled submatrix ˆG(s,t) in G(s,t) is vs/vt, then L(s,t) can be computed in closed form:

L(s,t) = ((W (s) vs )T W (s) vs ) 1(W (s) vs )T ˆG(s,t)W (t) vt ((W (t) vt )T W (t) vt ) 1,

where W (s) vs and W (t) vt are formed by the rows in W (s) and W (t) with row index sets vs and vt respectively. Since there are only k2 variables in L(s,t), we do not need too many samples for each block, and the time to compute L(s,t) is O((1 + ρ)3k3). In practice, we observe that setting ρ to be 2 or 3 is enough for a good approximation, so the time complexity is O(k3). Empirically, many values in the oﬀ-diagonal blocks are close to zero, and only a few of them have large values as shown in Figure 1. Based on this observation, we further propose a thresholding technique to reduce the time for storing and computing L(s,t). Since the distance between cluster centers is a good indicator for the values in an oﬀ-diagonal block, we can set the whole block L(s,t) to 0 if K(ms, mt) ϵ for some thresholding parameter ϵ > 0, where ms and mt are the k-means centroid for the s-th and t-th cluster respectively. Obviously, to choose ϵ, we need to achieve a balance between speed and accuracy. When ϵ is small, we will approximate more oﬀ-diagonal blocks; while when ϵ is large, we will set more oﬀ-diagonal blocks to be 0, but increase the approximation error. We test the inﬂuence

Si, Hsieh, and Dhillon

of thresholding parameter ϵ on the ijcnn1 and cadata data in Figure 5. When ϵ is large, although MEKA yields higher approximation error(because it omits more oﬀ-diagonal information), it is faster. On the other hand, for small ϵ, when more oﬀ-diagonal information is considered, we notice an increase in time and smaller in approximation error. In practice, we need to use cross-validation to select ϵ.

Figure 5: Time (in seconds) and kernel approximation quality of MEKA when varying the thresholding parameter ϵ for setting oﬀ-diagonal blocks in L to be zero.

Choosing the rank ks for each cluster. We need to decide the rank for the sth(s = 1, , c) cluster, ks, which can be done in diﬀerent ways: (i) the same k for all the clusters; (ii) ks is proportional to the size of sth cluster; (iii) eigenvalues based approach. For (iii), suppose M(s) is the ms ms matrix consisting of the intersection of ms sampled columns in G(s,s), and M is the cm cm(Pc s=1 ms = cm) block-diagonal matrix with M(s) as diagonal block. We can choose ks such that the set of top-ck eigenvalues of M is the union of the eigenvalues of M(s) in each cluster, that is, [σ1(M), . . . , σck(M)] = c s=1[σ1(M(s)), . . . , σks(M(s))]. To use (iii), we can oversample points in each cluster, e.g., sample 2k points from each cluster, perform eigendecompostion of a 2k 2k kernel matrix, sort the eigenvalues from c clusters, and ﬁnally select the top-ck eigenvalues and their corresponding eigenvectors. Comparing these three strategies, (i) achieves lower memory usage and is fastest, and (ii) is more accurate than (i) with more memory usage, while (iii) is slowest but achieves lower error for diagonal blocks. In Figure 6, we compare these three sampling strategies to choose ks on ijcnn1 and cadata datasets. It is shown that these three methods perform similarly well and choosing ks to be same for each cluster performs sightly better than by the size of each cluster and singular values based approach. In the experiment, we set all the clusters to have the same rank k. We show that this simple choice of ks already outperforms state-of-the-art kernel approximation methods. We are now ready to present our main algorithm in Algorithm 1. In Table 4, we compare the time and space complexity for our method with SVD, standard Nystr om, and RKS. We can see that MEKA is more memory eﬃcient. For the time complexity, both TL (time for computing oﬀ-diagonal L) and TC (time for clustering) are small as (1) we use thresholding to force some L(s,t) blocks to be zero, and perform least squares on small blocks, which

Memory Efficient Kernel Approximation

Figure 6: Comparison of diﬀerent strategies to choose the rank ks of each cluster in MEKA on ijcnn1 and cadata datasets.

Method Storage Rank Time Complexity RKS (Rahimi and Recht, 2008) O(cnk) ck O(cnkd) Nystr om(Williams and Seeger, 2001) O(cnk) ck O(cnm(ck + d) + (cm)3) SVD O(cnk) ck O(n3 + n2d) MEKA O(nk + (ck)2) ck O(nm(k + d) + cm3 + TL + TC)

Table 4: Memory and time analysis of various kernel approximation methods, where TL is the time to compute the matrix L and TC is the time for clustering in MEKA.

means TL can at most be O(1

2c2k3); (2)TC is proportional to the number of samples. For a large dataset, we sample 20000 points for k-means, and thus the clustering is more eﬃcient than working on the entire data set.

Algorithm 1 Memory Eﬃcient Kernel Approximation (MEKA)

Input : Data points {(xi)}n i=1, scaling parameter γ, rank k, and no. of clusters c.

Output: The rank-ck approximation G = WLW T using O(nk + (ck)2) space Generate the partition V1, . . . , Vc by k-means; for s = 1, . . . , c do Perform the rank-k approximation G(s,s) W (s)L(s,s)(W (s))T by standard Nystr om; end forall (s, t)(s = t) do Sample a submatrix G(s,t) from G(s,t) with row index set vs and column index set vt; Form W (s) vs by selecting the rows in W (s) according to index set vs; Form W (t) vt by selecting the rows in W (t) according to index set vt; Solve the least squares problem: G(s,t) W (s) vs L(s,t)(W (t) vt )T to obtain L(s,t); end

Si, Hsieh, and Dhillon

Dataset TC TW TL pendigit 0.05 0.69 0.35 ijcnn1 0.15 1.27 0.84 covtype 1.83 9.82 12.23

Table 5: Time (in seconds) for each step of MEKA, where TW is the time to compute low-rank approximation for the diagonal block matrices; TL is the time to form the link matrix L; TC is the time for performing k-means clustering.

In Table 5, we show the time cost for each step of MEKA on pendigit, ijcnn1, and covtype datasets. The execution time of our proposed algorithm mainly consists of three parts:(1) time for performing k-means clustering (TC), (2) time for forming the basis W from the diagonal blocks (TW ), (3) time to compute the link matrix L from oﬀ-diagonal blocks (TL). From Table 5, we observe that compared with TW and TL, TC is fairly small and TW dominates the whole process in most cases. For covtype data set, since we choose c to be large, TL is sightly larger than TW . We will analyze TW , TL, and TC for diﬀerent c in the experiment part.

5.3 Analysis

We now bound the approximation error for our proposed method. We show that when σk+1 σck+1 is large, where σk+1 and σck+1 are the k + 1st and ck + 1st singular values of G respectively, and entries in oﬀ-diagonal blocks are small, MEKA has a better approximation error bound compared to standard Nystr om that uses similar storage.

Theorem 3 Let denote a matrix consisting of all oﬀ-diagonal blocks of G, so (s,t) = G(s,t) for s = t and all zeros when s = t. We sample cm points from the dataset uniformly at random without replacement and split them according to the partition from k-means, such that each cluster has ms benchmark points and Pc s=1 ms = cm. Let Gck be the best rank-ck approximation of G, and G be the rank-ck approximation from MEKA. Suppose we choose the rank ks for each diagonal block using the eigenvalue based approach as mentioned in Section 5.1, then with probability at least 1 δ, the following inequalities hold for any sample of size cm:

G G 2 G Gck 2+ 1 c 2n m Gmax(1 + θ) + 2 2,

G G F G Gck F + 64k

4 n Gmax(1+θ) 1 2 +2 F

where θ = q

n m n 0.5 1 β(m,n) log 1

1 2max; β(m, n) = 1 1 2 max{m,n m};Gmax = maxi Gii;

and d G max represents the distance maxij p

Gii + Gjj 2Gij.

Memory Efficient Kernel Approximation

Proof Let B denote the matrix formed by the diagonal block of G, that is,

G(1) 0 . . . 0 0 G(2) . . . 0 ... ... ... ... 0 0 . . . G(c)

According to the deﬁnition of , G = B + . In MEKA, the error G G 2 consists of two components,

G G 2 = B B + ( ) 2 B B 2 + 2 (16)

where B and are the approximations for B and in MEKA respectively. Let us ﬁrst consider the error in approximating the diagonal blocks B B 2. Since we sample cm benchmark points from n data points uniformly at random without replacement and distribute them according to the partition coming from k-means, the sth cluster now has ms benchmark points with Ps=c s=1 ms = cm. For the s-th diagonal block G(s), we will perform the rank-ks approximation using standard Nystr om, so we have G(s) E(s)(M(s) ks )+E(s), where E(s) denotes the matrix formed by ms sampled columns from

G(s) and M(s) ks is a ms ms matrix consisting of the intersection of sampled ms columns. Suppose we use the singular value based approach to choose ks for s-th cluster as described in Section 5.1, and

(M(1) k1 )+ 0 . . . 0 0 (M(2) k2 )+ . . . 0 ... ... ... ... 0 0 . . . (M(c) kc )+,

where M is the cm cm block diagonal matrix that consists of the intersection of the sampled cm columns. Then we can see that approximating the diagonal blocks B is equivalent to directly performing standard Nystr om on B by sampling cm benchmark points uniformly at random without replacement to achieve the rank-ck approximation. The standard Nystr om s norm-2 and Frobenius error bound are given in (Kumar et al., 2009), so B B 2 can be bounded with probability at least 1 δ as

B B 2 B Bck 2 + 2n cm Bmax[1 +

n cm n 0.5 1 β(cm, n) log 1

δ d B max/B

where Bck denotes the best rank-ck approximation to B; Bmax = maxi Bii; d B max represents the distance maxij p

Bii + Bjj 2Bij. To bound 2, recall that some oﬀ-diagonal blocks in MEKA are set to 0 by thresholding and 0 is one special solution of least squares problem to compute L(s,t), we have 2 2. Furthermore, according to perturbation theory (Stewart and Ji-Guang, 1990), we have

B Bck 2 G Gck 2 + 2. (18)

Si, Hsieh, and Dhillon

The inequality in (16) combined with (18) gives a bound on G G 2 as,

G G 2 B Bck 2 + 2 + 2n cm Bmax[1 +

n cm n 0.5 1 β(cm, n) log 1

δ d B max/B

G Gck 2 + 2 2 + 2n cm Bmax[1 +

n cm n 0.5 1 β(cm, n) log 1

δ d B max/B

G Gck 2 + 2 2 + 2n cm Gmax[1 +

n cm n 0.5 1 β(cm, n) log 1

δ d G max/G

G Gck 2 + 2 2 + 1 c 2n m Gmax[1 +

n m n 0.5 1 β(m, n) log 1

δ d G max/G

where Gck denotes the best rank-ck approximation to G; Gmax = maxi Gii; d G max represents the distance maxij p

Gii + Gjj 2Gij. The third inequality is because G = B + , Bmax Gmax and d B max d G max. The last inequality is because n m and n cm. Similarly by using perturbation theory and upper bounds for the Frobenius error of standard Nystr om, the result follows.

When ks(s = 1, , c) is balanced (meaning ks is approximately the same for each cluster) and n is large, MEKA provides a rank-ck approximation using roughly the same amount of storage as rank-k approximation by standard Nystr om. Interestingly, from Theorem 3, if G Gk 2 G Gck 2 2 2, then

G G 2 G Gk 2 + 1 c 2n m Gmax(1 + θ).

The second term in the right hand side of above inequality is only 1 c of that in the spectral norm error bound for standard Nystr om that uniformly samples m columns without replacement in G to obtain the rank-k approximation as shown in (Kumar, Mohri, and Talwalkar, 2009). Thus, if there is a large enough gap between σk+1 and σck+1, the error bound for our proposed method is better than standard Nystr om that uses similar storage. Furthermore, when γ is large, G tends to have better clustering structure, suggesting in Theorem 3 that is usually quite small. Note that when using the same rank k for all the clusters, the above bound can be worse because of some extreme cases, e.g., all the top-ck eigenvalues are in the same cluster. In practice we do not observe those extreme situations. We also want to mention that both F and 2 will be aﬀected by the number of clusters c.

6. Experimental Results

In this section, we empirically demonstrate the beneﬁts of our proposed method, MEKA on various data sets2 that are listed in Table 6. Experiment results here are mainly based on

2. All the datasets are downloaded from www.csie.ntu.edu.tw/~cjlin/libsvmtools/datasets and UCI repository (Bache and Lichman, 2013). Note that the census datasets in Table 1 and Table 6 come from diﬀerent sources.

Memory Efficient Kernel Approximation

Dataset n d Dataset n d wine 6,497 11 census 22,784 137 cpusmall 8,192 12 ijcnn1 49,990 22 pendigit 10,992 16 covtype 581,012 54 cadata 20,640 8 mnist2m 2,000,000 784

Table 6: Data set statistics (n: number of samples; d: dimension of samples).

the Gaussian kernel, but we observe similar behavior on other shift-invariant kernels (see Section 6.2). Note that the same kernel function is used in both training and test phases. We compare our method with six state-of-the-art kernel approximation methods:

1. The standard Nystr om method (denoted by Nys)(Williams and Seeger, 2001). In the experiment, we uniformly sample 2k columns of G without replacement, and run Nystr om for rank-k approximation.

2. Kmeans Nystr om (denoted by KNys)(Zhang and Kwok, 2010), where the landmark points are the cluster centroids. As suggested in (Zhang et al., 2012), we sample 20000 points for clustering when the total number of data samples is larger than 20000.

3. Random Kitchen Sinks (denoted by RKS)(Rahimi and Recht, 2008), which approximates the shift-invariant kernel based on its Fourier transform.

4. Fastfood with Hadamard features (denoted by fastfood)(Le, Sarlos, and Smola, 2013).

5. Ensemble Nystr om (denoted by ENys) (Kumar, Mohri, and Talwalkar, 2009). Due to concern for the computation cost, we set the number of experts in ENys 3.

6. Nystr om using randomized SVD (denoted by RNys)(Li, Kwok, and Lu, 2010). We set the number of power iterations q = 1 and oversampling parameter p = 10.

We compare all the methods on two diﬀerent tasks: kernel low-rank approximation and kernel ridge regression. We do not compare with BKA in this section, because (1) the approximation error of BKA is large when γ is small (as shown in Figure 1); (2) BKA is time consuming because it needs to compute all the diagonal blocks kernel values, which needs O(n2d

c ) time; (3) BKA is not memory eﬃcient to approximate the kernel matrix, because it needs O(n2

c ) space to store the approximation. All the experiments are conducted on a machine with an Intel Xeon X5440 2.83GHz CPU and 32G RAM.

6.1 Kernel Approximation Quality

We now compare the kernel approximation quality for the above methods. Main results. The kernel approximation results are shown in Figure 7 and Table 7. We use relative kernel approximation error G G F / G F to measure the quality. We randomly sampled 20000 rows of G to evaluate the relative approximation error for ijcnn1 and covtype. In Figure 7, we show the kernel approximation performance of diﬀerent methods by varying k and γ. The rank (k) varies from 100 to 600 for ijcnn1 and covtype

Si, Hsieh, and Dhillon

Dataset k γ c Nys RNys KNys ENys RKS fastfood MEKA pendigit 128 2 5 0.1325 0.1361 0.0828 0.2881 0.4404 0.4726 0.0811 ijcnn1 128 1 10 0.0423 0.0385 0.0234 0.1113 0.2972 0.2975 0.0082 covtype 256 10 15 0.3700 0.3738 0.2752 0.5646 0.8825 0.8920 0.1192

Table 7: Comparison of approximation error of our proposed method with six other stateof-the-art kernel approximation methods on real datasets, where γ is the Gaussian scaling parameter; c is the number of clusters in MEKA; k is the rank of each diagonal-block in MEKA and the rank of the approximation for six other methods. Note that for a given k, every method has roughly the same amount of memory. All results show relative kernel approximation errors for each k.

and from 20 to 200 for the pendigit data. Figure 7 shows that our proposed approximation scheme always achieves lower error with less time and memory. The main reason is that using similar amount of time and memory, our method aims to approximate the kernel matrix by a rank-ck approximation, while all other methods are only able to form a rank-k approximation. In Table 7, we ﬁx the rank k and γ, so that each method has the same memory usage of low-rank representation, and compare MEKA with them in terms of relative approximation error. As it can be seen, under the same amount of memory, our proposed method consistently yields lower approximation error than other methods. Also as we can see from Table 7 and Figure 8, Nystr om based methods perform much better than random features based methods (including RKS and Fastfood here) in terms of kernel approximation quality. Therefore we do not show their performance in Figure 7, so that we could see the diﬀerence between MEKA and other Nystr om based methods. Robustness to the Gaussian scaling parameter γ. To show the robustness of our proposed algorithm with diﬀerent γ as explained in Section 3, we test its performance on the ijcnn1 (Figure 9a), cadata (Figure 9b) and sampled covtype datasets (Figure 1c). The relative approximation errors for diﬀerent γ values are shown in the ﬁgures using a ﬁxed amount of memory. For large γ, the kernel matrix tends to have a block structure, so our proposed method yields lower error than other methods. The gap becomes larger as γ increases. Interestingly, Figure 1c shows that the approximation error of MEKA is superior to even the exact SVD, as it is much more memory eﬃcient. Even for small γ where the kernel exhibits low-rank structure, our proposed method performs better than Nystr om based methods, suggesting that it can get the low-rank structure of the kernel matrix. Robustness to the number of clusters c. Compared with Nystr om, one main extra parameter for our method is the number of clusters c. In Figure 10, we test our method with diﬀerent values of c on ijcnn1 dataset. In Figure 10a, we show how the approximation error changes when c is varied from 5 to 25 (with rank k = 100 in each cluster). The memory usage is nk + (ck)2, so the storage increases as c increases. For a fair comparison, we increase the rank of other methods as c increases, so that all the methods use the same amount of memory. Figure 10a shows that the performance of our proposed method for varying choices of c. We can observe that under diﬀerent c, MEKA always performs better than others. Furthermore, in Figure 10b we show the time cost for each step of MEKA on

Memory Efficient Kernel Approximation

(a) pendigit (γ = 2), memory vs approx. error.

(b) ijcnn1 (γ = 1), memory vs approx. error.

(c) covtype (γ = 1), memory vs approx. error.

(d) pendigit (γ = 2), time vs approximation error.

(e) ijcnn1 (γ = 1), time vs approximation error.

(f) covtype (γ = 1), time vs approximation error.

(g) pendigit (γ = 10), memory vs approximation error.

(h) ijcnn1 (γ = 10), memory vs approximation error.

(i) covtype (γ = 10), memory vs approximation error.

(j) pendigit (γ = 10), time vs approximation error.

(k) ijcnn1 (γ = 10), time vs approximation error.

(l) covtype (γ = 10), time vs approximation error.

Figure 7: Low-rank Gaussian kernel approximation results. Methods with approximation error above the top of y-axis are not shown.

Si, Hsieh, and Dhillon

Figure 8: Comparison between Nystr om based methods (MEKA and standard Nystr om) and random feature based methods (RKS and fastfood). We can see that Nystr om based methods perform much better than random feature based methods for kernel approximation.

(a) ijcnn1.

(b) cadata.

Figure 9: The kernel approximation errors for diﬀerent Gaussian scaling parameter γ.

ijcnn1 dataset when varying the number of clusters c. Here the parameter γ is set to be 1. From Figure 10b, we observe that when the number of clusters c is small, TW will dominate the whole process. As c increases, the time for computing the link matrix L, TL, increases. This is because the number of oﬀ-diagonal blocks increases quadratically with c. Since the time complexity for k-means is O(ncd), TC increases linearly with c.

6.2 The Performance of MEKA for Approximating the Laplacian Kernel

As we mentioned in Section 5, our algorithm is suitable for approximating stationary kernels. Besides Gaussian kernel, we can apply MEKA to approximate other kernels, e.g., Laplacian kernel (K(x, y) = e γ x y 1). Figure 11 compares our proposed method with the standard Nystr om(Nys), Randomized Nystr om(RNys), and Kmeans Nystr om(KNys) for approximating the Laplacian kernel on ijcnn1 data. Similar to Gaussian kernel, we observe that MEKA is more memory eﬃcient and faster than other methods for approximating the Laplacian kernel.

Memory Efficient Kernel Approximation

(a) kernel approximation under diﬀerent numbers of clusters c in MEKA. For standard Nystr om(Nys), Randomized Nystr om(RNys), and Kmeans Nystr om(KNys), we use the same memory with MEKA.

(b) Time for each step of MEKA when varying the number of clusters c. TC is the time for performing k-means clustering; TW is the time to form the basis W from the diagonal blocks; and TL is the time to compute L from oﬀ-diagonal blocks.

Figure 10: The kernel approximation errors and time cost for each step of MEKA when varying c on ijcnn1 dataset.

(a) memory vs approximation error.

(b) time vs approximation error.

Figure 11: Low-rank Laplacian kernel approximation results for ijcnn1 dataset.

6.3 Kernel Ridge Regression

Next we compare the performance of various methods on kernel ridge regression (Saunders, Gammerman, and Vovk, 1998):

max α λαT α + αT Gα 2αT y, (19)

where G is the kernel matrix formed by training samples {x1, . . . , xl}, and y Rl are the targets. For each kernel approximation method, we ﬁrst form the approximated kernel G, and then solve (19) by conjugate gradient (CG). The main computation in CG is the matrix vector product Gv. Using low-rank approximation, this can be computed using O(nk) ﬂops. For our proposed method, we compute WLW T v, where W T v = Pc s=1 W (s)v(s) requires

Si, Hsieh, and Dhillon

Dataset k γ c λ Nys RNys KNys ENys RKS fastfood MEKA wine 128 2 10 3 2 4 0.7514 0.7555 0.7568 0.7732 0.7459 0.7509 0.7375 cadata 128 22 5 2 3 0.1504 0.1505 0.1386 0.1462 0.1334 0.1502 0.1209 cpusmall 256 22 5 2 4 8.8747 8.6973 6.9638 9.2831 9.6795 10.2601 6.1130 census 256 2 4 5 2 5 0.0679 0.0653 0.0578 0.0697 0.0727 0.0732 0.0490 covtype 256 22 10 2 2 0.8197 0.8216 0.8172 0.8454 0.8011 0.8026 0.7106 mnist2m 256 2 5 40 2 5 0.2985 0.2962 0.2725 0.3018 0.3834 na 0.2667

Table 8: Comparison of our proposed method with six other state-of-the-art kernel approximation methods on real datasets for kernel ridge regression, where λ is the regularization constant. All the parameters are chosen by cross validation, and every method has roughly the same amount of memory as in Table 7. All results show test RMSE for regression for each k. Note that k for fastfood needs to be larger than d, so we cannot test fastfood on mnist2m when k = 256.

O(nk) ﬂops, L(Wv) requires O( L 0) ﬂops, and W(LW T v) requires O(nk) ﬂops. Therefore, the time complexity for computing the matrix vector product for both MEKA and low-rank approximation methods is proportional to the memory for storing the approximate kernel matrices. Besides kernel approximation algorithms, we compare our method with another divide-and-conquer based kernel ridge regression method (denoted DC-KRR)(Zhang et al., 2013). The basic idea in Zhang et al. (2013) is to randomly partition the data into c parts and then train a kernel ridge regression model in each partition. To test a new data point, it will be tested on each submodel and the ﬁnal prediction is the average of c predictions. The parameters are chosen by ﬁve fold cross-validation and shown in Table 8. The rank for these algorithms is varied from 100 to 1000. The test root mean square error (test RMSE) is deﬁned as yte Gteα , where yte Ru is testing labels and Gte Ru l is the approximate kernel values between testing and training data. The covtype and mnist2m data sets are not originally designed for regression, and here we set the target variables to be 0 and 1 for mnist2m and -1 and 1 for covtype. Table 8 compares the kernel ridge regression performance of our proposed scheme with six other methods given the same amount of memory or same k in terms of test RMSE. It shows that our proposed method consistently performs better than other methods. Figure 12 shows the time usage of diﬀerent methods for regression by varying the memory or rank k. As we can see that using the same amount of time, our proposed algorithm always achieves the lowest test RMSE. The total running time consists of the time for obtaining the low-rank approximation and time for regression. The former depends on the time complexity for each method, and the latter depends on the memory requirement to store the low-rank matrices. As shown in the previous experiment, MEKA is faster than the other methods while achieving lower approximation error and using less memory. As a consequence, it achieves lower test RMSE in less time compared to other kernel approximation methods.

7. Conclusions and Discussions

In this paper, we have proposed a novel framework, Memory Eﬃcient Kernel Approximation (MEKA) for approximating shift-invariant kernel matrices. We observe that the structure of the shift-invariant kernel matrix changes from low rank to block diagonal as the scale

Memory Efficient Kernel Approximation

(a) wine, time vs regression error.

(b) cpusmall, time vs regression error.

(c) cadata, time vs regression error.

(d) census, time vs regression error.

(e) covtype, time vs regression error.

(f) mnist2m, time vs regression error.

Figure 12: Kernel ridge regression results for various data sets. Methods with regression error above the top of y-axis are not shown. All results are averaged over ﬁve independent runs.

parameter is changed. Our method exploits both low-rank and block structure present in the kernel matrix, and thus performs better than previously proposed low-rank based methods in terms of approximation and regression error, speed and memory usage. The code for MEKA is available at www.cs.utexas.edu/~ssi/meka/. We will discuss next about some typical problems encountered when using MEKA and how to deal with them.

7.1 Dealing with the Non-PSD Issue

If for each oﬀ-diagonal block, we sample all the entries, the resulting approximate matrix in MEKA will be positive semideﬁnite(PSD). The reason is as follow: if we use all the all-diagonal blocks to compute the link matrix L, the approximation will be G WLW T

with L = W T GW. Since G is PSD, so it is with L, which proves WLW T will be PSD. However, due to the sampling procedure in MEKA, the resulting approximate matrix might not be PSD, which will cause some problems when PSD is required for some applications, e.g., kernel SVM with hinge loss. There are two simple and eﬀective ways to solve this issue: (1) The ﬁrst method (MEKA-PSD) is to set negative eigenvalues to 0. The procedure is ﬁrst to perform eigen-decomposition on the small ck ck link matrix L = USUT (where U and S are the eigenvector and eigenvalue matrices for L respectively) in the MEKA representation, and then shrink its negative eigenvalues to be 0, which forms the new eigenvalue matrix S for L. After that, the new MEKA approximation will become: G = WU SUT W T . Since S is now PSD, so it is with WU SUT W T . Therefore, G will be PSD after above shrinking operation. Due to eigen-decomposition on L and the following

Si, Hsieh, and Dhillon

shrinking operation, both the kernel approximation error and computation time will increase slightly. We show the comparison of Nys, MEKA and MEKA-PSD on ijcnn1 dataset in Figure 13. We can see that to achieve similar approximation, MEKA-PSD is slightly slower than MEKA, but still performs better than Nys, which also generates PSD kernel approximation. (2) The second method is to directly add a small value ϵ to the diagonal of G, which is equivalent to using regularization term when applying MEKA for kernel methods. We show the kernel ridge regression results in Table 8, where we add regularization term to MEKA. Note that to make sure the resulting MEKA approximation is PSD for the second method, ϵ should be equal or larger than the absolute value of the smallest negative eigenvalue of G.

Figure 13: MEKA-PSD kernel approximation results for ijcnn1 dataset.

7.2 Dealing with the Instability Issue

There are two steps in Algorithm 1 that might cause the approximation unstable. One is due to the basis formed from each diagonal block Wi. If the kernel matrix has strong blockdiagonal structure, choosing columns from diagonal blocks to form W will perform well; on the contrary, if the kernel matrix does not have very strong block structure, other sampling strategies can be involved, for example, sampling columns from the rows corresponding to each partition to form the basis. Another step which might cause unstable result is due to insuﬃcient entries sampled when forming the link matrix L. One way to solve this issue is to sample more entries from each oﬀ-diagonal block to form L when the computation time is not the main concern.

7.3 Other Applications using MEKA

In the experiment section, we apply MEKA for kernel approximation and kernel ridge regression, and besides that we could use our MEKA framework for many other machine learning applications: such as speeding up the computation of the inverse of the kernel matrix. For many machine learning applications, e.g., Gaussian Process, we need to compute the matrix inverse (G + λI) 1, where computing the inverse of a dense matrix becomes the bottleneck for these applications. By approximating kernel matrix G in MEKA form

Memory Efficient Kernel Approximation

G WLW T , the inverse operation can be done in a faster fashion using Woodbury formula.

(λI + G) 1 (λI + WLW T ) 1

λ(I W(λL 1 + W T W) 1W T )

We can see that after the approximation, we only need to inverse the ck ck matrix which reduce the time complexity for inverse of the dense matrix from O(n3) to O(n(ck)2 +(ck)3). Besides speeding up the matrix inverse, MEKA can also be used to approximate the eigendecomposition of kernel matrix and used in various machine learning applications, for instance, manifold learning and kernelized dimensionality reduction.

Acknowledgments

This research was supported by NSF grants CCF-1320746, IIS-1546452 and CCF-1564000. Cho-Jui Hsieh also acknowledges support from XSED E startup resources.

D. Achlioptas, F. Mc Sherry, and B. Sch olkopf. Sampling techniques for kernel methods. In Advances in Neural Information Processing Systems, 2001.

F. R. Bach and M. I. Jordan. Predictive low-rank decomposition for kernel methods. In International Conference on Machine Learning, 2005.

K. Bache and M. Lichman. UCI machine learning repository, 2013. URL http://archive. ics.uci.edu/ml.

Chih-Chung Chang and Chih-Jen Lin. LIBSVM: A library for support vector machines. ACM Transactions on Intelligent Systems and Technology, 2:27:1 27:27, 2011.

Y.-W Chang, C.-J. Hsieh, K.-W. Chang, M. Ringgaard, and C.-J Lin. Training and testing low-degree polynomial data mappings via linear SVM. Journal of Machine Learning Research, 11:1471 1490, 2010.

C. Cortes and V. Vapnik. Support-vector networks. Machine Learning, 20:273 297, 1995.

A. Cotter, J. Keshet, and N. Srebro. Explicit approximations of the Gaussian kernel. ar Xiv:1109.47603, 2011.

F. Cucker and S. Smale. On the mathematical foundations of learning. Bulletin of the American Mathematical Society, 39:1 49, 2001.

S. Fine and K. Scheinberg. Eﬃcient SVM training using low-rank kernel representations. Journal of Machine Learning Research, 2:243 264, 2001.

A. Gittens and M. W. Mahoney. Revisiting the Nystr om method for improved large-scale machine learning. In International Conference on Machine Learning, 2013.

Si, Hsieh, and Dhillon

N. Halko, P. G. Martinsson, and J. A. Tropp. Finding structure with randomness: Probabilistic algorithms for constructing approximate matrix decompositions. SIAM Review, 53(2):217 288, 2011.

R. Hamid, Y. Xiao, A. Gittens, and D. De Coste. Compact random feature maps. In International Conference on Machine Learning, 2014.

C.-J. Hsieh, K.-W. Chang, C.-J. Lin, S. S. Keerthi, and S. Sundararajan. A dual coordinate descent method for large-scale linear SVM. In International Conference on Machine Learning, 2008.

C.-J. Hsieh, S. Si, and I. S. Dhillon. A divide-and-conquer solver for kernel support vector machines. In International Conference on Machine Learning, 2014a.

C.-J. Hsieh, S. Si, and I. S. Dhillon. Fast prediction for large-scale kernel machines. In Advances in Neural Information Processing Systems, 2014b.

P. Kar and H. Karnick. Random feature maps for dot product kernels. In International Conference on Machine Learning, 2012.

S. Kumar, M. Mohri, and A. Talwalkar. Ensemble Nystr om methods. In Advances in Neural Information Processing Systems, 2009.

S. Kumar, M. Mohri, and A. Talwalkar. Sampling methods for the Nystr om method. Journal of Machine Learning Research, 13:981 1006, 2012.

Q. V. Le, T. Sarlos, and A. J. Smola. Fastfood approximating kernel expansions in loglinear time. In International Conference on Machine Learning, 2013.

M. Li, J. T. Kwok, and B.-L. Lu. Making large-scale Nystr om approximation possible. In International Conference on Machine Learning, 2010.

M. W. Mahoney and P. Drineas. CUR matrix decompositions for improved data analysis. Proceedings of the National Academy of Science, 2009.

A. Rahimi and B. Recht. Random features for large-scale kernel machines. In Advances in Neural Information Processing Systems, 2007.

A. Rahimi and B. Recht. Weighted sums of random kitchen sinks: replacing minimization with randomization in learning. In Advances in Neural Information Processing Systems, 2008.

C. Saunders, A. Gammerman, and V. Vovk. Ridge regression learning algorithm in dual variables. In International Conference on Machine Learning, 1998.

B. Savas and I. S. Dhillon. Clustered low rank approximation of graphs in information science applications. In SIAM Conference on Data Mining, 2011.

B. Sch olkopf and A. J. Smola. Learning with kernels. MIT Press, 2002.

Memory Efficient Kernel Approximation

J. Shi and J. Malik. Normalized cuts and image segmentation. IEEE Trans. Pattern Analysis and Machine Intelligence, 22(8):888 905, 2000.

S. Si, C.-J. Hsieh, and I. S. Dhillon. Memory eﬃcient kernel approximation. In International Conference on Machine Learning, 2014a.

S. Si, D. Shin, I. S. Dhillon, and Beresford N. Parlett. Multi-scale spectral decomposition of massive graphs. In Advances in Neural Information Processing Systems, 2014b.

S. Si, C.-J. Hsieh, and I. S. Dhillon. Computationally eﬃcient Nystr om approximation using fast transforms. In International Conference on Machine Learning, pages 2655 2663, 2016.

A. J. Smola and B. Sch olkopf. Sparse greedy matrix approximation for machine learning. In International Conference on Machine Learning, 2000.

G.W. Stewart and Sun Ji-Guang. Matrix Perturbation Theory. Academic Press, Boston, 1990.

Rashish Tandon, Si Si, Pradeep Ravikumar, and Inderjit Dhillon. Kernel ridge regression via partitioning. ar Xiv preprint ar Xiv:1608.01976, 2016.

U. von Luxburg. A tutorial on spectral clustering. Statistics and Computing, 17(4), 2007.

S. Wang, C. Zhang, H. Qian, and Z. Zhang. Improving the modiﬁed Nystr om method using spectral shifting. In ACM SIGKDD Conferences on Knowledge Discovery and Data Mining, 2014.

Z. Wang, N. Djuric, K. Crammer, and S. Vucetic. Trading representability for scalability: Adaptive multi-hyperplane machine for nonlinear classiﬁcation. In ACM SIGKDD Conferences on Knowledge Discovery and Data Mining, 2011.

Christopher Williams and M. Seeger. Using the Nystr om method to speed up kernel machines. In Advances in Neural Information Processing Systems, 2001.

J. Yang, V. Sindhwani, H. Avron, and M. W. Mahoney. Quasi-Monte Carlo feature maps for shift-invariant kernels. In International Conference on Machine Learning, 2014.

T. Yang, Y.-F. Li, M. Mahdavi, R. Jin, and Z.-H. Zhou. Nystr om method vs random Fourier features: A theoretical and empirical comparison. In Advances in Neural Information Processing Systems, 2012.

K. Zhang and J. T. Kwok. Clustered Nystr om method for large scale manifold learning and dimension reduction. IEEE Trans. Neural Networks, 21(10):1576 1587, 2010.

K. Zhang, I. W. Tsang, and J. T. Kwok. Improved Nystr om low rank approximation and error analysis. In International Conference on Machine Learning, 2008.

K. Zhang, L. Lan, Z. Wang, and F. Moerchen. Scaling up kernel SVM on limited resources: A low-rank linearization approach. In International Conference on Artiﬁcial Intellignence and Statistics, 2012.

Si, Hsieh, and Dhillon

Yuchen Zhang, John C. Duchi, and Martin J. Wainwright. Divide and conquer kernel ridge regression. In Conference on Learning Theory, 2013.