# fast_and_accurate_refined_nyströmbased_kernel_svm__37212778.pdf

Fast and Accurate Reﬁned Nystr om Based Kernel SVM

Zhe Li,1 Tianbao Yang,1 Lijun Zhang,2 and Rong Jin3

1Department of Computer Science, The University of Iowa, Iowa City, IA 52242, USA 2National Key Laboratory for Novel Software Technology, Nanjing University, Nanjing 210023, China 3Alibaba Group, Seattle, WA 98101, USA {zhe-li-1,tianbao-yang}@uiowa.edu, zhanglj@lamda.nju.edu.cn, jinrong.jr@alibaba-inc.com

In this paper, we focus on improving the performance of the Nystr om based kernel SVM. Although the Nystr om approximation has been studied extensively and its application to kernel classiﬁcation has been exhibited in several studies, there still exists a potentially large gap between the performance of classiﬁer learned with the Nystr om approximation and that learned with the original kernel. In this work, we make novel contributions to bridge the gap without increasing the training costs too much by proposing a reﬁned Nystr om based kernel classiﬁer. We adopt a two-step approach that in the ﬁrst step we learn a sufﬁciently good dual solution and in the second step we use the obtained dual solution to construct a new set of bases for the Nystr om approximation to re-train a reﬁned classiﬁer. Our approach towards learning a good dual solution is based on a sparse-regularized dual formulation with the Nystr om approximation, which can be solved with the same time complexity as solving the standard formulation. We justify our approach by establishing a theoretical guarantee on the error of the learned dual solution in the ﬁrst step with respect to the optimal dual solution under appropriate conditions. The experimental results demonstrate that (i) the obtained dual solution by our approach in the ﬁrst step is closer to the optimal solution and yields improved prediction performance; and (ii) the second step using the obtained dual solution to re-train the model further improves the performance.

Kernel method (Scholkopf and Smola, 2001; Shawe-Taylor and Cristianini, 2004) (e.g., Support Vector Machine (SVM) ) is one of the most effective learning methods widely used in classiﬁcation and regression. Thanks to the kernel trick, low dimension features in the original space are mapped into high dimension features without explicitly computing inner product between high dimensional features. However, as the scale of data continues to grow, the kernel method suffers from the severe problem of computing and maintaining a tremendously large kernel matrix, rendering it prohibitive even impossible to learn a kernel classiﬁer in real applications with big data. To speed up the training of a kernel classiﬁer for big data, several fast kernel approximation methods have been developed, including the Nystr om method (Williams and Seeger, 2001; Drineas and Mahoney, 2005) and the random Fourier features (Rahimi and Recht,

Copyright c 2016, Association for the Advancement of Artiﬁcial Intelligence (www.aaai.org). All rights reserved.

2007) among others (Le et al., 2013; Yang et al., 2015b). Recently, the authors in (Yang et al., 2012) studied the two approximation schemes in a uniﬁed framework and demonstrated that the Nystr om method could achieve a better performance than random Fourier features in certain scenarios (e.g., when there is a large eigen-gap in the kernel matrix (Yang et al., 2012) or the eigen-values follow a powerlaw distribution (Jin et al., 2013)). In this work, we focus on further improving the performance of the Nystr om based kernel classiﬁer with the training time increased by a factor of two. The Nystr om method for approximating a kernel matrix works by constructing a set of bases referred to as landmark points and then constructing an approximation based on the kernel similarities between the landmark points and all data points (including the landmark points). It has been observed that the selection of the landmark points affects the performance of the Nystr om method (Kumar et al., 2009a). Nevertheless, there still exists a gap between the performance of the Nystr om based kernel classiﬁer and that of the optimal kernel classiﬁer. It remains an important problem to bridge the performance gap while maintaining the efﬁciency. Recently, there emerges a reﬁned Nystr om based kernel SVM that ﬁrst learns an approximate dual solution close enough to the optimal dual solution to the original kernel SVM and then uses the approximate dual solution to construct a set of landmark points to improve the performance of the Nystr om based kernel classiﬁer (Hsieh et al., 2014b). In this paper, we propose an improved method to obtain a good dual solution in the ﬁrst step. Our approach is motivated by the fact that the original kernel classiﬁer usually has a small number of support vectors, indicating the optimal dual solution is a sparse vector. However, when exploring a Nystr om approximation, the number of support vectors could increase due to that some examples become difﬁcult to be classiﬁed, leading to an increased number of support vectors, i.e., a denser dual solution. Therefore, in order to improve the quality of the dual solution, we introduce a sparsity-inducing regularizer into the dual formation deﬁned with the Nystr om approximation. We justify the proposed approach by a theoretical analysis of the error bound of the obtained dual solution under the incoherence and restricted eigen-value conditions. Empirically, we observe that the proposed approach achieves better perfor-

Proceedings of the Thirtieth AAAI Conference on Artificial Intelligence (AAAI-16)

mance than the original Nystr om based kernel classiﬁer and the reﬁned Nystr om based kernel classiﬁer using divide-andconquer approach for obtaining an approximate dual solution. The main contributions of the paper are summarized as: (i) we study a reﬁned Nystr om based kernel SVM and propose a new pipeline that ﬁrst solves a sparse-regularized dual formulation with the approximated kernel and then utilizes the obtained dual solution to re-train a reﬁned Nystr om based kernel classiﬁer; and (ii) we justify the proposed approach by a theoretical analysis and extensive empirical studies.

Related Work In this section, we review some related work on approximate kernel methods, the Nystr om method, sparse learning and randomized dimensionality reduction. Due to the exceedingly high cost of computing and maintaining a big kernel matrix for large-scale data, several fast approximate kernel methods have been developed, including random Fourier features (Rahimi and Recht, 2007), Fastfood (Le et al., 2013) and the Nystr om method (Drineas and Mahoney, 2005) as representatives. Yang et al. (2012) studied the random Fourier features and the Nystr om method in a uniﬁed framework from the perspective of functional approximation. They demonstrated that the random Fourier features is equivalent to learning a predictive function using a set of basis functions that are generated independent of the data, while the Nystr om method is equivalent to learning a predictive function using a set of data-dependent basis functions. The Nystr om method for approximating a positive semideﬁnite (PSD) matrix has been studied extensively in recent years (Drineas and Mahoney, 2005; Kumar et al., 2009b; Yang et al., 2012; Zhang et al., 2008; Gittens, 2011; Talwalkar and Rostamizadeh, 2010; Gittens and Mahoney, 2013; Jin et al., 2013). Nevertheless, when employed in kernel methods for classiﬁcation and regression, there still exits a gap between the performance of the Nystr om based kernel classiﬁer and the optimal kernel classiﬁer. Recently, (Hsieh et al., 2014b) proposed a reﬁned Nystr om based kernel classiﬁer based on a two-step approach where in the ﬁrst step an approximate dual solution is learned and in the second step a set of new landmark points are constructed using the approximate dual solution obtained in the ﬁrst step. Our work differentiates from this work in how to learn an approximate dual solution as described in the introduction. Sparse learning has been researched tremendously is machine learning and statistics. Almost all existing studies are centered around imposing a sparsity-induced regularizer (e.g., the ℓ1 norm) on the model (i.e., the primal solution). In this work, we impose a ℓ1 norm on the dual solution motivated by the fact that in kernel SVM many examples could be non-support vectors, indicating their corresponding dual variables are zeros. The most relevant work is presented in (Xu et al., 2015), which studied a sparse kernel regression with the Nystr om approximation. It was brought to our attention that the proposed approach for learning a good dual solution in the ﬁrst step is similar

to a recent work on the dual recovery analysis for randomized dimensionality reduction for solving high-dimensional learning problems (Yang et al., 2015a) , which employed the JL transform to reduce high-dimensional examples into a low-dimensional space, then proposed to solve a sparseregularized dual formulation. Although the proposed approach share the same insight on the introduced sparsityinducing regularizer on the dual variables, we emphasize that the present work makes non-trivial contributions in the analysis since the Nystr om method is not a JL transform, therefore the analysis in (Yang et al., 2015a) based on the JL lemma can not carry over to the Nystr om based kernel method.

The problem and Proposed Algorithm Preliminaries and Motivation Let (xi, yi), i = 1, , n denote a set of training examples, where xi Rd denotes the feature vector, and yi {+1, 1} denotes the class label. Let κ( , ) denote a valid kernel function and Hκ denote a Reproducing Kernel Hilbert Space endowed with κ( , ). The kernel SVM is to solve the following optimization problem:

min f Hκ 1 n

i=1 ℓ(f(xi), yi) + λ

2 ||f||2 Hκ (1)

where ℓ(z, y) = max(0, 1 yz)p (p = 1 or 2) is the hinge loss or the squared hinge loss. Using the convex conjugate function, the above optimization problem can be turned into a dual problem:

α = arg max α Ωn 1

i=1 ℓ i (αi) 1 2λn2 αT Kα (2)

where Ωn is the domain of the dual solution, K Rn n is the kernel matrix, and ℓ i (αi) is the convex conjugate of ℓ(z, yi) in terms of z. For example, if ℓ(z, yi) = max(0, 1 yiz), then ℓi(αi) = αiyi and Ωn = {α Rn : 1 α y 0}. When the number n of training examples is large it is prohibitive to compute and maintain the kernel matrix K. The Nystr om method computes a low-rank approximation of K by sampling a small subset of columns of K or constructing a set of landmark points to address the computation and memory limitations. In particular, if we let Lm = {c1, , cm}, where ci Rd, denote a set of m landmark points, Km Rm m denote the sub-kernel matrix between the points in Lm, and Kb Rn m denote the sub-kernel matrix between all examples and the landmark points, then the Nystr om approximation of K is computed by K = Kb K m KT b (3)

where K m denotes the pseudo-inverse of Km. When applying the Nystr om approximation for solving the dual problem, we have the following optimization problem:

α = arg max α Ωn 1

i=1 ℓ i (αi) 1 2λn2 αT (Kb K KT b )α

which is equivalent to the dual problem of using a short feature representation of training examples

xi = ( K m)1/2(κ(xi, c1), , κ(xi, cm))T , i = 1, . . . , n (5) Let X = ( x1, , xn) Rm n, it is straightforward to verify K = XT X, and the problem (4) can be written as

i=1 ℓ i (αi) 1 2λn2 αT XT Xα (6)

which can be solved efﬁciently using stochastic optimization algorithms developed for linear methods (Shalev-Shwartz and Zhang, 2013; Johnson and Zhang, 2013; Lin et al., 2014). The overall running time of the Nystr om based kernel classiﬁer consists of the running time of computing the short feature representation of all training data, which is O(m2n+m3), and the running time of optimization. Hence, the Nystr om based kernel classiﬁer can be trained efﬁciently when m is relatively small. On the other hand, the generalization performance of the Nystr om based kernel classiﬁer is in the order of O(1/ m) for general data, though which can be improved to O(1/m) for some special data (Yang et al., 2012). Therefore, with a small value of m, there still exists a potentially large gap between the performance of the Nystr om based kernel classiﬁer and the optimal kernel classiﬁer. In this paper, we propose a reﬁned Nystr om based kernel SVM to bridge the gap between the Nystr om based kernel classiﬁer and the optimal kernel classiﬁer. To motivate the proposed approach, we ﬁrst note that given the optimal dual solution α , the optimal kernel classiﬁer can be written as: f ( ) = 1

λn n i=1[α ]iκ(xi, ). If we know that α is m-sparse with m n and choose the support vectors as landmark points, i.e., L m = {c 1, . . . , c m} = {xi : [α ]i = 0}, then we can solve the following optimization problem

i=1 ℓ(f(xi), yi) + λ

2 ||f||2 Hκ (7)

where Hm κ = {f : f = m i=1 βiκ(c i , )}. As demonstrated in (Yang et al., 2012), the problem in (7) is equivalent to using the Nystr om approximation constructed with the landmark points in L m. It is not difﬁcult to show that under the discussed conditions the optimal solution to the above problem is also the optimal solution to (1). The details are shown in the supplement. From another perspective, following the Theorem 2 in (Hsieh et al., 2014b), the error of α is bounded by

α α 2 1 nλnnz Km 2(1 + K 2)Δ,

i=1 |[α ]i| K i K i 2 (8)

where λnnz is the minimum nonzero eigen-value of K/n and K i denotes the i-th column of K. It indicates that the quality of α is mostly affected by a small portion of

columns of K with larger |[α ]i|. The above argument suggests a two-step approach towards improving the performance of the Nystr om based kernel classiﬁer: in the ﬁrst step we learn an approximate dual solution that is close to α and then in the second step we construct a set of landmark points aiming to minimize Δ using the approximate dual solution in place of α . (Hsieh et al., 2014b) also implements the two-step approach by learning an approximate dual solution using the divide-and-conquer approach (Hsieh et al., 2014a) that divides all examples into a number of groups and solves a small kernel SVM for each group to obtain an approximate dual solution. However, there is no guarantee on the quality of the obtained dual solution. Below, we propose a more solid approach to learn a reﬁned Nystr om based kernel SVM.

A Reﬁned Nystr om based kernel SVM Our approach is inspired by the fact that in the optimal kernel classiﬁer the number of support vectors is usually relatively small compared to the total number of examples, indicating the optimal dual solution is a sparse vector. However, when exploring a Nystr om approximation, the number of support vectors could increase due to that some examples become difﬁcult to be classiﬁed, leading to increased number of support vectors, i.e., a denser dual solution. Therefore, in order to improve the quality of the dual solution, we introduce a sparsity-inducing regularizer into the dual formation deﬁned with the Nystr om approximated kernel. In particular, we solve the following formulation to obtain an improved dual solution:

α = arg max α Ωn 1

i=1 ℓ i (αi) 1 2λn2 αT Kα τ

(9) It was shown in (Yang et al., 2015a) when the loss is the hinge loss or the squared hinge loss, adding the ℓ1 norm on the dual variable α is equivalent to using a new loss function with a reduced margin 1 τ as compared with 1 used in the standard hinge loss. To see this, we can consider the hinge loss ℓ(z, y) = max(0, 1 yz), then ℓ i (αi) = αiyi and Ωn = {α Rn : 1 α y 0}, and with a variable change the new problem in (9) can be reduced to

max β [0,1]n 1 n

i=1 βi(1 τ) 1 2λn2 (β y)T K(β y)

which is the dual problem of the following problem

max w Rm 1 n

i=1 max(0, (1 τ) yiw xi) + λ

with the reduced margin 1 τ in the deﬁnition of the hinge loss. In the next subsection, we provide a theoretical analysis of the proposed sparse-regularized dual formation with the Nystr om approximation by establishing an error bound of the obtained dual solution α . The above analysis also implies that the new formulation can be solved with the same time complexity as solving the original formulation in (6).

Next, we brieﬂy discuss the second step that uses the obtained dual solution α to re-train a reﬁned Nystr om based kernel classiﬁer. The methodology is to select a new set of landmark points using the dual solution α and then learn a Nystr om based kernel classiﬁer using the selected landmark points. In Hsieh et al. (2014b), the authors have suggested an approach based on weighed k-means clustering. This approach is grounded in that when the kernel function is stationary (i.e., κ(xi, xj) = κ( xi xj 2)), the Δ in (8) is bounded by a quantity that is proportional to the square root of a weighted k-means objective deﬁned with the weights given by square of the optimal dual solution, i.e., n i=1[α ]2 i xi cπi 2, where πi = arg minj xi cj 2. Thus, one can perform a weighted k-means using the weights given by [ α ]2 i , i [n] and use the resulting cluster centers as landmark points to construct the Nystr om approximation. However, this approach will introduce an additional cost of weighted k-means clustering to ﬁnd the clusters and is restricted to stationary kernels. In this paper, we use a simple alternative based on a greedy approach. It is motivated by that if α is given we can select the examples that have largest |[α ]i| to minimize Δ. In practice, we only obtain an approximate dual solution α , hence we opt for a probabilistic sampling approach that selects examples based on the probability distribution Pr(xi is selected) = |[ α ]i| n i=1 |[ α ]i|, which is observed to be more effective than a deterministic approach that simply selects examples that have largest |[ α ]i| and also competitive with the weighted k-means sampling approach.

A Theoretical Analysis We provide a theoretical analysis of the error of α below and ﬁnally present a theorem to summarize the main result. Let S be the support set of α and s = |S| be the number of non-zero entries in α . Denote by αS the vector that only contains elements of α in S. We assume that s n. Before presenting our analysis we need to deﬁne a few quantities regarding the kernel matrix as follows:

i=1 |[α ]i|| K ,i K ,i| ,

γ(s) = min 1 α 0 s 1 n α Kα

where γ(s) > 0 is known as restricted eigen-value condition of K/n in the literature (Bickel et al., 2009). We denote by λi, i [n] the eigen-values of K/n ranked in the descending order. In addition, we introduce the following coherence measure to facilitate our analysis. For any real PSD matrix A RN N, let τk(A) denote the coherence of a dominant k-dimensional invariant subspace of A speciﬁed by τk(A) = N

k maxi(PUk)ii, where PUk = Uk U k denotes the projection onto the dominant k invariant subspace of A (i.e., Uk contains the top-k eigen-vectors of A as its columns). The coherence measure has been used in matrix completion (Recht, 2011) and random matrix approximation (Gittens, 2011). To characterize the coherence measure of the kernel matrix K with respect to any subset Ω of

cardinality m + s, we deﬁne

τk(m, s) = max Ω [n],|Ω|=m+s τk(KΩ,Ω)

where KΩ,Ω is the submatrix of K with row and column indices in Ω. We ﬁrst present the following lemma showing that α α lies in the cones of dominant coordinates as in the deﬁnition of restricted eigen-value. Lemma 1. Let S be the support set of α and Sc denote its complement. By setting τ 2 λn n i=1 |[α ]i| K i K i , we have [ α α ]Sc 1 3 [ α α ]S 1. Due to limit of space, we put all proofs in the supplement. We assume ℓ i (α) is μ-strongly convex, where μ 0 1. Following the optimality condition of α to (9) and the optimality condition of α to (2), there exists g α 1 such that

( α α ) ( f( α ) + 1 λn2 K α + τ

( α α ) ( f(α ) + 1 λn2 Kα ) 0

where f(α) = 1

n n i=1 ℓ i (αi). Using the Cauchy-Schwarz inequality, we have

( α α ) ( f( α ) + 1 λn2 K α ) + τ

n [ α ]Sc 1

n [ α α ]S 1

Thus, we have τ n [ α α ]S 1 τ

n [ α ]Sc 1

( α α ) ( f( α ) + 1 λn2 K α )

= ( α α ) ( f(α ) + 1 λn2 Kα )

+ ( α α ) ( f( α ) f(α ))

+ 1 λn2 ( α α ) ( K α Kα )

n α α 2 2 + 1

λn 1 n( α α ) ( K K)α A

λn 1 n( α α ) K( α α ) B

where the second inequality uses that f(α) is (μ/n)- strongly. Next, we bound the two terms A, and B. For bounding A, we have A α α 1 Δ. For bounding B, we prove the following lemma in the supplement. Lemma 2. If m 8kτk(m, 16s) 16s log d + log k

δ , by setting τ 2 λn n i=1 |[α ]i| K i K i , then

B 2 γ(16s) 3 + 32s

1When μ = 0, it is a convex function. The squared hinge loss is 1/2-strongly convex.

Table 1: Statistics of datasets Name usps letter ijcnn1 webspam cod-rna covtype #Training 7,291 12,000 91,701 280,000 271,617 464,810 #Testing 2,007 6000 49,990 70,000 59,535 116,202 #Features 256 16 22 254 8 54

Given the above bounds for A and B, we have

α α 2 2(λμ + 2γ(16s) (6 + 64s/m)λk+1)

+ (λτ Δ)|ˆαSc|1 (λτ + Δ) [ α α ]S 1

If we assume that τ 2 Δ

λ , then it is not difﬁcult to prove the error bound stated in the following theorem.

Theorem 1. Assume for some k and δ (0, 1) and the following conditions hold

λμ + 2γ(16s) 6 + 64s

m 8kτk(m, 16s) 16s log d + log k

By setting τ 2 λn n i=1 |[α ]i| K i K i Then, with a probability 1 δ, we have

α α 2 1.5λ sτ λμ + 2γ(16s) (6 + 64s/m)λk+1

Remark: It is interesting to compare our error bound of α with the bound of the original Nystr om based formulation in terms of α as in (8) derived by (Hsieh et al., 2014b) and the error bound of dual solution obtained by the divideand-conquer approach (Hsieh et al., 2014a). Considering τ = Θ( 1

λn n i=1 |[α ]i| K i K i ), compared with (8), our error bound is proportional to n i=1 |[α ]i| K i K i which is smaller than Δ as in the error bound of α . The error bound of α has an inverse dependence on the minimum non-zero eigen-value of K/n, which in practice could be very close to zero, leading to potentially a large error in α . In contrast, our error bound is inversely proportional to λμ+2γ(16s) (6+64s/m)λk+1, depending on the minimum restricted eigen-value. In addition, the error bound in (8) depends on Km 2 and K 2, while our error bound only depends on s, making the proposed reﬁned Nystr om based kernel classiﬁer attractive when the number of support vectors is relatively small. Compared with the error bound of the approximate solution obtained by the divideand-conquer approach (Theorem 1 (Hsieh et al., 2014a)), which depends on how well the data is clustered and is inversely proportional to the minimum eigen-value of the kernel matrix, the bound in Theorem 1 is better.

Experiments Implementation In our experiments, we implement both the feature construction by the Nystr om method and the optimization of linear SVM in a cluster environment. The training data is randomly

Figure 1: Test Accuracy for linear SVM, RBF SVM and Nystr om based kernel classiﬁer with different number of samples on the six datasets.

Figure 2: Test Accuracy of the spare-regularized Nystr om based kernel classiﬁer.

partitioned over 5 nodes. Given the landmark points, we construct the short feature representation as in Eqn. (5) for all training examples by running the code parallel on 5 nodes. To solve the linear SVM problem in a distributed fashion, we use the recently proposed distributed stochastic dual coordinate ascent algorithm (Yang, 2013; Ma et al., 2015).

Experimental Results

In this section, we present empirical evaluations of the proposed reﬁned Nystr om based kernel classiﬁer on six realworld datasets, namely usps, letter, ijcnn1, webspam, codrna and covtype, of which we use the version available on LIBSVM website 2. Table 1 summarizes the statistics of these datasets. We run linear SVM and kernel SVM using LIBLINEAR and LIBSVM, respectively. The kernel used in the experiments is the RBF kernel and the loss function is the hinge loss. Through cross-validation, we choose the best parameter C from 2[ 6:1:6] and the best parameter γ for the RBF kernel from 2[ 6:2:6]. For the methods that involves randomness, the results are averaged over ﬁve random trials of sampling. We ﬁrst compare the original Nystr om based kernel classiﬁer with different number of samples to linear SVM and kernel SVM, with results shown in Figure 1. For the original Nystr om approximation, we use uniform sampling to select examples from the training data. We can see that as the number of samples m for the Nystr om approximation increases, the test accuracy is monotonically increas-

2http://www.csie.ntu.edu.tw/ cjlin/libsvmtools/datasets/

ing. However, there still exits a potentially large gap in the performance as compared with the optimal kernel classiﬁer. For example, the optimal kernel classiﬁer outperforms the Nystr om based kernel classiﬁer with m = 4096 by 3 percent on webspam dataset and by 4 percent on the ijcnn1 dataset. Next, we verify the proposed sparse-regularized dual formulation with the Nystr om approximation. The test accuracy is evaluated on the returned model constructed from the obtained dual variables. Comparing this with the the standard Nystr om based kernel classiﬁer allows us to see the effect of the added sparsity-inducing regularizer. The results are shown in Figure 2 with the value of τ varying from 0 to 0.9 3. When the value of τ is 0, it reduces to the standard Nystr om based kernel classiﬁer. From the results, we can see that adding the sparsity-inducing regularizer to the dual formulation with the Nystr om approximation can boost the performance. For example, when m = 4096 the performance of the Nystr om based kernel classiﬁer is improved by 2 percent on the ijcnn1 dataset. Next, we examine the performance of the reﬁned Nystr om based kernel SVM. We use the obtained dual solution to the sparse-regularized dual formulation to construct a new set of landmark points to re-train a Nystr om based kernel classiﬁer. For each value of τ, we optimize the sparse-regularized dual formulation and obtain a dual solution, then we use the dual solution to construct the same number of landmark points to compute a new Nystr om approximation for learning a new classiﬁer. We report the results of the probabilistic approach (referred to as sp-pro-nys) for constructing the landmark points, which is described on page 4. For baselines, we include the results of the model directly constructed from the obtained dual solution in the ﬁrst step (referred to as sp) and the approach that uses the divide-andconquer approach (Hsieh et al., 2014a) to obtain a dual solution to re-train a Nystr om based kernel classiﬁer using the weighted k-means to ﬁnd the centers as the landmark points as suggested in (Hsieh et al., 2014b). This approach is referred to as dc-wkm-nys. Note that the divide-and-conquer approach requires a clustering on the training data in order to obtain a partition of the training data. We follow the idea in (Hsieh et al., 2014b) and use the standard k-means clustering to partition the data instead of the expensive kernel kmeans clustering as suggested in (Hsieh et al., 2014a). The results are shown in Figure 3. From the results we can see that (i) the second step that re-trains a new Nystr om based kernel classiﬁer using the obtained dual solution in the ﬁrst step can further improve the performance; (ii) the proposed new pipeline outperforms the divide-and-conquer approach followed by the weighted k-means sampling approach for constructing a new Nystr om approximation. Finally, we compare the training time of linear SVM, kernel SVM, the standard Nystr om based kernel classiﬁer and the reﬁned Nystr om based kernel classiﬁer. We report the results on two datasets webspam and cod-rna with m = 1024 in the Figure 4. It shows that the training time of Nystr om

3When τ > 1, it will yield trivial solution with the optimal model being zero. To avoid clutter, we only show one curve on the covtype dataset.

Figure 3: Test Accuracy of the reﬁned Nystr om based kernel classiﬁer (sp-pro-nys). The value of m is set to 1024 in the proposed algorithm. For the divide-and-conquer approach, the x-axis denotes 1 nc/100, where nc denotes the number of clusters used in divide-and-conquer approach. We did not report the result on the covtype due to that the divide-andconquer approach needs to solve multiple kernel SVM on each partition. When the number of partitions is small, each kernel SVM is still expensive.

Figure 4: Training time of linear SVM, kernel SVM, the standard Nystr om based classiﬁer, and the reﬁned Nystr om based classiﬁer for m = 1024 on two datasets webspam and cod-rna.

on both datasets is much less than kernel SVM. On the other hand, the training time of the reﬁned Nystr om based classiﬁer is also comparable to training time of the standard Nystr om method.

Conclusions

In this paper, we have considered improving the performance of the Nystr om based kernel SVM. We proposed a fast and accurate reﬁned Nystr om based kernel classiﬁer that consists of two steps, where in the ﬁrst step we learn an accurate dual solution based on a sparse-regularized dual formulation with the Nystr om approximation and in the second step we use the obtained dual solution to re-train a Nystr om based kernel classiﬁer. We established an error bound of the obtained dual solution in the ﬁrst step, which is better than previous theoretical results. The empirical evaluations on various datasets further demonstrate the effectiveness of the proposed algorithm.

Acknowlegements

This work was partially supported by NSF IIS-1463988 and NSF IIS-1545995.

References Peter J. Bickel, Ya acov Ritov, and Alexandre B. Tsybakov. Simultaneous analysis of lasso and dantzig selector. Annals of Statistics, 37(4), 2009.

Petros Drineas and Michael W. Mahoney. On the nystr om method for approximating a gram matrix for improved kernel-based learning. JMLR, pages 2153 2175, 2005.

Alex Gittens. The spectral norm error of the naive nystrom extension. Co RR, 2011.

Alex Gittens and Michael W. Mahoney. Revisiting the nystrom method for improved large-scale machine learning. Co RR, abs/1303.1849, 2013.

Cho-Jui Hsieh, Si Si, and Inderjit S. Dhillon. A divide-andconquer solver for kernel support vector machines. In ICML, pages 566 574, 2014a.

Cho-Jui Hsieh, Si Si, and Inderjit S. Dhillon. Fast prediction for large-scale kernel machines. In NIPS, pages 3689 3697, 2014b.

Rong Jin, Tianbao Yang, Mehrdad Mahdavi, Yu-Feng Li, and Zhi-Hua Zhou. Improved bounds for the nystr om method with application to kernel classiﬁcation. IEEE Transactions on Information Theory (IEEE TIT), 59(10): 6939 6949, 2013.

Rie Johnson and Tong Zhang. Accelerating stochastic gradient descent using predictive variance reduction. In NIPS, pages 315 323, 2013.

Sanjiv Kumar, Mehryar Mohri, and Ameet Talwalkar. On sampling-based approximate spectral decomposition. In ICML, pages 553 560, 2009a.

Sanjiv Kumar, Mehryar Mohri, and Ameet Talwalkar. Sampling techniques for the nystrom method. In AISTATS, pages 304 311, 2009b.

Quoc Le, Tam as Sarl os, and Alex Smola. Fastfood approximating kernel expansions in loglinear time. In ICML, 2013.

Qihang Lin, Zhaosong Lu, and Lin Xiao. An accelerated proximal coordinate gradient method. In NIPS, pages 3059 3067, 2014.

Chenxin Ma, Virginia Smith, Martin Jaggi, Michael I. Jordan, Peter Richt arik, and Martin Tak ac. Adding vs. averaging in distributed primal-dual optimization. In Proceedings of the 32nd International Conference on Machine Learning, ICML 2015, pages 1973 1982, 2015.

Ali Rahimi and Benjamin Recht. Random features for largescale kernel machines. In NIPS, pages 1177 1184, 2007.

Benjamin Recht. A simpler approach to matrix completion. Journal of Machine Learning Research (JMLR), 12:3413 3430, 2011.

Bernhard Scholkopf and Alexander J Smola. Learning with kernels: support vector machines, regularization, optimization, and beyond. MIT press, 2001.

Shai Shalev-Shwartz and Tong Zhang. Stochastic Dual Coordinate Ascent Methods for Regularized Loss Minimization. JMLR, pages 567 599, 2013.

John Shawe-Taylor and Nello Cristianini. Kernel methods for pattern analysis. Cambridge university press, 2004.

Ameet Talwalkar and Afshin Rostamizadeh. Matrix coherence and the nystrom method. In UAI, 2010.

Christopher Williams and Matthias Seeger. Using the nystr om method to speed up kernel machines. In NIPS, pages 682 688, 2001.

Zenglin Xu, Rong Jin, Bin Shen, and Shenghuo Zhu. Nystrom approximation for sparse kernel methods: Theoretical analysis and empirical evaluation. In AAAI, pages 3115 3121, 2015.

Tianbao Yang. Trading computation for communication: Distributed stochastic dual coordinate ascent. In Advances in Neural Information Processing Systems, pages 629 637, 2013.

Tianbao Yang, Yu-Feng Li, Mehrdad Mahdavi, Rong Jin, and Zhi-Hua Zhou. Nystrom method vs random fourier features: A theoretical and empirical comparison. In NIPS, pages 485 493, 2012.

Tianbao Yang, Lijun Zhang, Rong Jin, and Shenghuo Zhu. Theory of dual-sparse regularized randomized reduction. Co RR, 2015a.

Zichao Yang, Andrew Gordon Wilson, Alexander J. Smola, and Le Song. A la carte - learning fast kernels. In AISTATS, 2015b.

Kai Zhang, Ivor W. Tsang, and James T. Kwok. Improved nystrom low-rank approximation and error analysis. In ICML, 2008.