# fast_and_accurate_refined_nyströmbased_kernel_svm__37212778.pdf Fast and Accurate Refined Nystr om Based Kernel SVM Zhe Li,1 Tianbao Yang,1 Lijun Zhang,2 and Rong Jin3 1Department of Computer Science, The University of Iowa, Iowa City, IA 52242, USA 2National Key Laboratory for Novel Software Technology, Nanjing University, Nanjing 210023, China 3Alibaba Group, Seattle, WA 98101, USA {zhe-li-1,tianbao-yang}@uiowa.edu, zhanglj@lamda.nju.edu.cn, jinrong.jr@alibaba-inc.com In this paper, we focus on improving the performance of the Nystr om based kernel SVM. Although the Nystr om approximation has been studied extensively and its application to kernel classification has been exhibited in several studies, there still exists a potentially large gap between the performance of classifier learned with the Nystr om approximation and that learned with the original kernel. In this work, we make novel contributions to bridge the gap without increasing the training costs too much by proposing a refined Nystr om based kernel classifier. We adopt a two-step approach that in the first step we learn a sufficiently good dual solution and in the second step we use the obtained dual solution to construct a new set of bases for the Nystr om approximation to re-train a refined classifier. Our approach towards learning a good dual solution is based on a sparse-regularized dual formulation with the Nystr om approximation, which can be solved with the same time complexity as solving the standard formulation. We justify our approach by establishing a theoretical guarantee on the error of the learned dual solution in the first step with respect to the optimal dual solution under appropriate conditions. The experimental results demonstrate that (i) the obtained dual solution by our approach in the first step is closer to the optimal solution and yields improved prediction performance; and (ii) the second step using the obtained dual solution to re-train the model further improves the performance. Kernel method (Scholkopf and Smola, 2001; Shawe-Taylor and Cristianini, 2004) (e.g., Support Vector Machine (SVM) ) is one of the most effective learning methods widely used in classification and regression. Thanks to the kernel trick, low dimension features in the original space are mapped into high dimension features without explicitly computing inner product between high dimensional features. However, as the scale of data continues to grow, the kernel method suffers from the severe problem of computing and maintaining a tremendously large kernel matrix, rendering it prohibitive even impossible to learn a kernel classifier in real applications with big data. To speed up the training of a kernel classifier for big data, several fast kernel approximation methods have been developed, including the Nystr om method (Williams and Seeger, 2001; Drineas and Mahoney, 2005) and the random Fourier features (Rahimi and Recht, Copyright c 2016, Association for the Advancement of Artificial Intelligence (www.aaai.org). All rights reserved. 2007) among others (Le et al., 2013; Yang et al., 2015b). Recently, the authors in (Yang et al., 2012) studied the two approximation schemes in a unified framework and demonstrated that the Nystr om method could achieve a better performance than random Fourier features in certain scenarios (e.g., when there is a large eigen-gap in the kernel matrix (Yang et al., 2012) or the eigen-values follow a powerlaw distribution (Jin et al., 2013)). In this work, we focus on further improving the performance of the Nystr om based kernel classifier with the training time increased by a factor of two. The Nystr om method for approximating a kernel matrix works by constructing a set of bases referred to as landmark points and then constructing an approximation based on the kernel similarities between the landmark points and all data points (including the landmark points). It has been observed that the selection of the landmark points affects the performance of the Nystr om method (Kumar et al., 2009a). Nevertheless, there still exists a gap between the performance of the Nystr om based kernel classifier and that of the optimal kernel classifier. It remains an important problem to bridge the performance gap while maintaining the efficiency. Recently, there emerges a refined Nystr om based kernel SVM that first learns an approximate dual solution close enough to the optimal dual solution to the original kernel SVM and then uses the approximate dual solution to construct a set of landmark points to improve the performance of the Nystr om based kernel classifier (Hsieh et al., 2014b). In this paper, we propose an improved method to obtain a good dual solution in the first step. Our approach is motivated by the fact that the original kernel classifier usually has a small number of support vectors, indicating the optimal dual solution is a sparse vector. However, when exploring a Nystr om approximation, the number of support vectors could increase due to that some examples become difficult to be classified, leading to an increased number of support vectors, i.e., a denser dual solution. Therefore, in order to improve the quality of the dual solution, we introduce a sparsity-inducing regularizer into the dual formation defined with the Nystr om approximation. We justify the proposed approach by a theoretical analysis of the error bound of the obtained dual solution under the incoherence and restricted eigen-value conditions. Empirically, we observe that the proposed approach achieves better perfor- Proceedings of the Thirtieth AAAI Conference on Artificial Intelligence (AAAI-16) mance than the original Nystr om based kernel classifier and the refined Nystr om based kernel classifier using divide-andconquer approach for obtaining an approximate dual solution. The main contributions of the paper are summarized as: (i) we study a refined Nystr om based kernel SVM and propose a new pipeline that first solves a sparse-regularized dual formulation with the approximated kernel and then utilizes the obtained dual solution to re-train a refined Nystr om based kernel classifier; and (ii) we justify the proposed approach by a theoretical analysis and extensive empirical studies. Related Work In this section, we review some related work on approximate kernel methods, the Nystr om method, sparse learning and randomized dimensionality reduction. Due to the exceedingly high cost of computing and maintaining a big kernel matrix for large-scale data, several fast approximate kernel methods have been developed, including random Fourier features (Rahimi and Recht, 2007), Fastfood (Le et al., 2013) and the Nystr om method (Drineas and Mahoney, 2005) as representatives. Yang et al. (2012) studied the random Fourier features and the Nystr om method in a unified framework from the perspective of functional approximation. They demonstrated that the random Fourier features is equivalent to learning a predictive function using a set of basis functions that are generated independent of the data, while the Nystr om method is equivalent to learning a predictive function using a set of data-dependent basis functions. The Nystr om method for approximating a positive semidefinite (PSD) matrix has been studied extensively in recent years (Drineas and Mahoney, 2005; Kumar et al., 2009b; Yang et al., 2012; Zhang et al., 2008; Gittens, 2011; Talwalkar and Rostamizadeh, 2010; Gittens and Mahoney, 2013; Jin et al., 2013). Nevertheless, when employed in kernel methods for classification and regression, there still exits a gap between the performance of the Nystr om based kernel classifier and the optimal kernel classifier. Recently, (Hsieh et al., 2014b) proposed a refined Nystr om based kernel classifier based on a two-step approach where in the first step an approximate dual solution is learned and in the second step a set of new landmark points are constructed using the approximate dual solution obtained in the first step. Our work differentiates from this work in how to learn an approximate dual solution as described in the introduction. Sparse learning has been researched tremendously is machine learning and statistics. Almost all existing studies are centered around imposing a sparsity-induced regularizer (e.g., the ℓ1 norm) on the model (i.e., the primal solution). In this work, we impose a ℓ1 norm on the dual solution motivated by the fact that in kernel SVM many examples could be non-support vectors, indicating their corresponding dual variables are zeros. The most relevant work is presented in (Xu et al., 2015), which studied a sparse kernel regression with the Nystr om approximation. It was brought to our attention that the proposed approach for learning a good dual solution in the first step is similar to a recent work on the dual recovery analysis for randomized dimensionality reduction for solving high-dimensional learning problems (Yang et al., 2015a) , which employed the JL transform to reduce high-dimensional examples into a low-dimensional space, then proposed to solve a sparseregularized dual formulation. Although the proposed approach share the same insight on the introduced sparsityinducing regularizer on the dual variables, we emphasize that the present work makes non-trivial contributions in the analysis since the Nystr om method is not a JL transform, therefore the analysis in (Yang et al., 2015a) based on the JL lemma can not carry over to the Nystr om based kernel method. The problem and Proposed Algorithm Preliminaries and Motivation Let (xi, yi), i = 1, , n denote a set of training examples, where xi Rd denotes the feature vector, and yi {+1, 1} denotes the class label. Let κ( , ) denote a valid kernel function and Hκ denote a Reproducing Kernel Hilbert Space endowed with κ( , ). The kernel SVM is to solve the following optimization problem: min f Hκ 1 n i=1 ℓ(f(xi), yi) + λ 2 ||f||2 Hκ (1) where ℓ(z, y) = max(0, 1 yz)p (p = 1 or 2) is the hinge loss or the squared hinge loss. Using the convex conjugate function, the above optimization problem can be turned into a dual problem: α = arg max α Ωn 1 i=1 ℓ i (αi) 1 2λn2 αT Kα (2) where Ωn is the domain of the dual solution, K Rn n is the kernel matrix, and ℓ i (αi) is the convex conjugate of ℓ(z, yi) in terms of z. For example, if ℓ(z, yi) = max(0, 1 yiz), then ℓi(αi) = αiyi and Ωn = {α Rn : 1 α y 0}. When the number n of training examples is large it is prohibitive to compute and maintain the kernel matrix K. The Nystr om method computes a low-rank approximation of K by sampling a small subset of columns of K or constructing a set of landmark points to address the computation and memory limitations. In particular, if we let Lm = {c1, , cm}, where ci Rd, denote a set of m landmark points, Km Rm m denote the sub-kernel matrix between the points in Lm, and Kb Rn m denote the sub-kernel matrix between all examples and the landmark points, then the Nystr om approximation of K is computed by K = Kb K m KT b (3) where K m denotes the pseudo-inverse of Km. When applying the Nystr om approximation for solving the dual problem, we have the following optimization problem: α = arg max α Ωn 1 i=1 ℓ i (αi) 1 2λn2 αT (Kb K KT b )α which is equivalent to the dual problem of using a short feature representation of training examples xi = ( K m)1/2(κ(xi, c1), , κ(xi, cm))T , i = 1, . . . , n (5) Let X = ( x1, , xn) Rm n, it is straightforward to verify K = XT X, and the problem (4) can be written as i=1 ℓ i (αi) 1 2λn2 αT XT Xα (6) which can be solved efficiently using stochastic optimization algorithms developed for linear methods (Shalev-Shwartz and Zhang, 2013; Johnson and Zhang, 2013; Lin et al., 2014). The overall running time of the Nystr om based kernel classifier consists of the running time of computing the short feature representation of all training data, which is O(m2n+m3), and the running time of optimization. Hence, the Nystr om based kernel classifier can be trained efficiently when m is relatively small. On the other hand, the generalization performance of the Nystr om based kernel classifier is in the order of O(1/ m) for general data, though which can be improved to O(1/m) for some special data (Yang et al., 2012). Therefore, with a small value of m, there still exists a potentially large gap between the performance of the Nystr om based kernel classifier and the optimal kernel classifier. In this paper, we propose a refined Nystr om based kernel SVM to bridge the gap between the Nystr om based kernel classifier and the optimal kernel classifier. To motivate the proposed approach, we first note that given the optimal dual solution α , the optimal kernel classifier can be written as: f ( ) = 1 λn n i=1[α ]iκ(xi, ). If we know that α is m-sparse with m n and choose the support vectors as landmark points, i.e., L m = {c 1, . . . , c m} = {xi : [α ]i = 0}, then we can solve the following optimization problem i=1 ℓ(f(xi), yi) + λ 2 ||f||2 Hκ (7) where Hm κ = {f : f = m i=1 βiκ(c i , )}. As demonstrated in (Yang et al., 2012), the problem in (7) is equivalent to using the Nystr om approximation constructed with the landmark points in L m. It is not difficult to show that under the discussed conditions the optimal solution to the above problem is also the optimal solution to (1). The details are shown in the supplement. From another perspective, following the Theorem 2 in (Hsieh et al., 2014b), the error of α is bounded by α α 2 1 nλnnz Km 2(1 + K 2)Δ, i=1 |[α ]i| K i K i 2 (8) where λnnz is the minimum nonzero eigen-value of K/n and K i denotes the i-th column of K. It indicates that the quality of α is mostly affected by a small portion of columns of K with larger |[α ]i|. The above argument suggests a two-step approach towards improving the performance of the Nystr om based kernel classifier: in the first step we learn an approximate dual solution that is close to α and then in the second step we construct a set of landmark points aiming to minimize Δ using the approximate dual solution in place of α . (Hsieh et al., 2014b) also implements the two-step approach by learning an approximate dual solution using the divide-and-conquer approach (Hsieh et al., 2014a) that divides all examples into a number of groups and solves a small kernel SVM for each group to obtain an approximate dual solution. However, there is no guarantee on the quality of the obtained dual solution. Below, we propose a more solid approach to learn a refined Nystr om based kernel SVM. A Refined Nystr om based kernel SVM Our approach is inspired by the fact that in the optimal kernel classifier the number of support vectors is usually relatively small compared to the total number of examples, indicating the optimal dual solution is a sparse vector. However, when exploring a Nystr om approximation, the number of support vectors could increase due to that some examples become difficult to be classified, leading to increased number of support vectors, i.e., a denser dual solution. Therefore, in order to improve the quality of the dual solution, we introduce a sparsity-inducing regularizer into the dual formation defined with the Nystr om approximated kernel. In particular, we solve the following formulation to obtain an improved dual solution: α = arg max α Ωn 1 i=1 ℓ i (αi) 1 2λn2 αT Kα τ (9) It was shown in (Yang et al., 2015a) when the loss is the hinge loss or the squared hinge loss, adding the ℓ1 norm on the dual variable α is equivalent to using a new loss function with a reduced margin 1 τ as compared with 1 used in the standard hinge loss. To see this, we can consider the hinge loss ℓ(z, y) = max(0, 1 yz), then ℓ i (αi) = αiyi and Ωn = {α Rn : 1 α y 0}, and with a variable change the new problem in (9) can be reduced to max β [0,1]n 1 n i=1 βi(1 τ) 1 2λn2 (β y)T K(β y) which is the dual problem of the following problem max w Rm 1 n i=1 max(0, (1 τ) yiw xi) + λ with the reduced margin 1 τ in the definition of the hinge loss. In the next subsection, we provide a theoretical analysis of the proposed sparse-regularized dual formation with the Nystr om approximation by establishing an error bound of the obtained dual solution α . The above analysis also implies that the new formulation can be solved with the same time complexity as solving the original formulation in (6). Next, we briefly discuss the second step that uses the obtained dual solution α to re-train a refined Nystr om based kernel classifier. The methodology is to select a new set of landmark points using the dual solution α and then learn a Nystr om based kernel classifier using the selected landmark points. In Hsieh et al. (2014b), the authors have suggested an approach based on weighed k-means clustering. This approach is grounded in that when the kernel function is stationary (i.e., κ(xi, xj) = κ( xi xj 2)), the Δ in (8) is bounded by a quantity that is proportional to the square root of a weighted k-means objective defined with the weights given by square of the optimal dual solution, i.e., n i=1[α ]2 i xi cπi 2, where πi = arg minj xi cj 2. Thus, one can perform a weighted k-means using the weights given by [ α ]2 i , i [n] and use the resulting cluster centers as landmark points to construct the Nystr om approximation. However, this approach will introduce an additional cost of weighted k-means clustering to find the clusters and is restricted to stationary kernels. In this paper, we use a simple alternative based on a greedy approach. It is motivated by that if α is given we can select the examples that have largest |[α ]i| to minimize Δ. In practice, we only obtain an approximate dual solution α , hence we opt for a probabilistic sampling approach that selects examples based on the probability distribution Pr(xi is selected) = |[ α ]i| n i=1 |[ α ]i|, which is observed to be more effective than a deterministic approach that simply selects examples that have largest |[ α ]i| and also competitive with the weighted k-means sampling approach. A Theoretical Analysis We provide a theoretical analysis of the error of α below and finally present a theorem to summarize the main result. Let S be the support set of α and s = |S| be the number of non-zero entries in α . Denote by αS the vector that only contains elements of α in S. We assume that s n. Before presenting our analysis we need to define a few quantities regarding the kernel matrix as follows: i=1 |[α ]i|| K ,i K ,i| , γ(s) = min 1 α 0 s 1 n α Kα where γ(s) > 0 is known as restricted eigen-value condition of K/n in the literature (Bickel et al., 2009). We denote by λi, i [n] the eigen-values of K/n ranked in the descending order. In addition, we introduce the following coherence measure to facilitate our analysis. For any real PSD matrix A RN N, let τk(A) denote the coherence of a dominant k-dimensional invariant subspace of A specified by τk(A) = N k maxi(PUk)ii, where PUk = Uk U k denotes the projection onto the dominant k invariant subspace of A (i.e., Uk contains the top-k eigen-vectors of A as its columns). The coherence measure has been used in matrix completion (Recht, 2011) and random matrix approximation (Gittens, 2011). To characterize the coherence measure of the kernel matrix K with respect to any subset Ω of cardinality m + s, we define τk(m, s) = max Ω [n],|Ω|=m+s τk(KΩ,Ω) where KΩ,Ω is the submatrix of K with row and column indices in Ω. We first present the following lemma showing that α α lies in the cones of dominant coordinates as in the definition of restricted eigen-value. Lemma 1. Let S be the support set of α and Sc denote its complement. By setting τ 2 λn n i=1 |[α ]i| K i K i , we have [ α α ]Sc 1 3 [ α α ]S 1. Due to limit of space, we put all proofs in the supplement. We assume ℓ i (α) is μ-strongly convex, where μ 0 1. Following the optimality condition of α to (9) and the optimality condition of α to (2), there exists g α 1 such that ( α α ) ( f( α ) + 1 λn2 K α + τ ( α α ) ( f(α ) + 1 λn2 Kα ) 0 where f(α) = 1 n n i=1 ℓ i (αi). Using the Cauchy-Schwarz inequality, we have ( α α ) ( f( α ) + 1 λn2 K α ) + τ n [ α ]Sc 1 n [ α α ]S 1 Thus, we have τ n [ α α ]S 1 τ n [ α ]Sc 1 ( α α ) ( f( α ) + 1 λn2 K α ) = ( α α ) ( f(α ) + 1 λn2 Kα ) + ( α α ) ( f( α ) f(α )) + 1 λn2 ( α α ) ( K α Kα ) n α α 2 2 + 1 λn 1 n( α α ) ( K K)α A λn 1 n( α α ) K( α α ) B where the second inequality uses that f(α) is (μ/n)- strongly. Next, we bound the two terms A, and B. For bounding A, we have A α α 1 Δ. For bounding B, we prove the following lemma in the supplement. Lemma 2. If m 8kτk(m, 16s) 16s log d + log k δ , by setting τ 2 λn n i=1 |[α ]i| K i K i , then B 2 γ(16s) 3 + 32s 1When μ = 0, it is a convex function. The squared hinge loss is 1/2-strongly convex. Table 1: Statistics of datasets Name usps letter ijcnn1 webspam cod-rna covtype #Training 7,291 12,000 91,701 280,000 271,617 464,810 #Testing 2,007 6000 49,990 70,000 59,535 116,202 #Features 256 16 22 254 8 54 Given the above bounds for A and B, we have α α 2 2(λμ + 2γ(16s) (6 + 64s/m)λk+1) + (λτ Δ)|ˆαSc|1 (λτ + Δ) [ α α ]S 1 If we assume that τ 2 Δ λ , then it is not difficult to prove the error bound stated in the following theorem. Theorem 1. Assume for some k and δ (0, 1) and the following conditions hold λμ + 2γ(16s) 6 + 64s m 8kτk(m, 16s) 16s log d + log k By setting τ 2 λn n i=1 |[α ]i| K i K i Then, with a probability 1 δ, we have α α 2 1.5λ sτ λμ + 2γ(16s) (6 + 64s/m)λk+1 Remark: It is interesting to compare our error bound of α with the bound of the original Nystr om based formulation in terms of α as in (8) derived by (Hsieh et al., 2014b) and the error bound of dual solution obtained by the divideand-conquer approach (Hsieh et al., 2014a). Considering τ = Θ( 1 λn n i=1 |[α ]i| K i K i ), compared with (8), our error bound is proportional to n i=1 |[α ]i| K i K i which is smaller than Δ as in the error bound of α . The error bound of α has an inverse dependence on the minimum non-zero eigen-value of K/n, which in practice could be very close to zero, leading to potentially a large error in α . In contrast, our error bound is inversely proportional to λμ+2γ(16s) (6+64s/m)λk+1, depending on the minimum restricted eigen-value. In addition, the error bound in (8) depends on Km 2 and K 2, while our error bound only depends on s, making the proposed refined Nystr om based kernel classifier attractive when the number of support vectors is relatively small. Compared with the error bound of the approximate solution obtained by the divideand-conquer approach (Theorem 1 (Hsieh et al., 2014a)), which depends on how well the data is clustered and is inversely proportional to the minimum eigen-value of the kernel matrix, the bound in Theorem 1 is better. Experiments Implementation In our experiments, we implement both the feature construction by the Nystr om method and the optimization of linear SVM in a cluster environment. The training data is randomly Figure 1: Test Accuracy for linear SVM, RBF SVM and Nystr om based kernel classifier with different number of samples on the six datasets. Figure 2: Test Accuracy of the spare-regularized Nystr om based kernel classifier. partitioned over 5 nodes. Given the landmark points, we construct the short feature representation as in Eqn. (5) for all training examples by running the code parallel on 5 nodes. To solve the linear SVM problem in a distributed fashion, we use the recently proposed distributed stochastic dual coordinate ascent algorithm (Yang, 2013; Ma et al., 2015). Experimental Results In this section, we present empirical evaluations of the proposed refined Nystr om based kernel classifier on six realworld datasets, namely usps, letter, ijcnn1, webspam, codrna and covtype, of which we use the version available on LIBSVM website 2. Table 1 summarizes the statistics of these datasets. We run linear SVM and kernel SVM using LIBLINEAR and LIBSVM, respectively. The kernel used in the experiments is the RBF kernel and the loss function is the hinge loss. Through cross-validation, we choose the best parameter C from 2[ 6:1:6] and the best parameter γ for the RBF kernel from 2[ 6:2:6]. For the methods that involves randomness, the results are averaged over five random trials of sampling. We first compare the original Nystr om based kernel classifier with different number of samples to linear SVM and kernel SVM, with results shown in Figure 1. For the original Nystr om approximation, we use uniform sampling to select examples from the training data. We can see that as the number of samples m for the Nystr om approximation increases, the test accuracy is monotonically increas- 2http://www.csie.ntu.edu.tw/ cjlin/libsvmtools/datasets/ ing. However, there still exits a potentially large gap in the performance as compared with the optimal kernel classifier. For example, the optimal kernel classifier outperforms the Nystr om based kernel classifier with m = 4096 by 3 percent on webspam dataset and by 4 percent on the ijcnn1 dataset. Next, we verify the proposed sparse-regularized dual formulation with the Nystr om approximation. The test accuracy is evaluated on the returned model constructed from the obtained dual variables. Comparing this with the the standard Nystr om based kernel classifier allows us to see the effect of the added sparsity-inducing regularizer. The results are shown in Figure 2 with the value of τ varying from 0 to 0.9 3. When the value of τ is 0, it reduces to the standard Nystr om based kernel classifier. From the results, we can see that adding the sparsity-inducing regularizer to the dual formulation with the Nystr om approximation can boost the performance. For example, when m = 4096 the performance of the Nystr om based kernel classifier is improved by 2 percent on the ijcnn1 dataset. Next, we examine the performance of the refined Nystr om based kernel SVM. We use the obtained dual solution to the sparse-regularized dual formulation to construct a new set of landmark points to re-train a Nystr om based kernel classifier. For each value of τ, we optimize the sparse-regularized dual formulation and obtain a dual solution, then we use the dual solution to construct the same number of landmark points to compute a new Nystr om approximation for learning a new classifier. We report the results of the probabilistic approach (referred to as sp-pro-nys) for constructing the landmark points, which is described on page 4. For baselines, we include the results of the model directly constructed from the obtained dual solution in the first step (referred to as sp) and the approach that uses the divide-andconquer approach (Hsieh et al., 2014a) to obtain a dual solution to re-train a Nystr om based kernel classifier using the weighted k-means to find the centers as the landmark points as suggested in (Hsieh et al., 2014b). This approach is referred to as dc-wkm-nys. Note that the divide-and-conquer approach requires a clustering on the training data in order to obtain a partition of the training data. We follow the idea in (Hsieh et al., 2014b) and use the standard k-means clustering to partition the data instead of the expensive kernel kmeans clustering as suggested in (Hsieh et al., 2014a). The results are shown in Figure 3. From the results we can see that (i) the second step that re-trains a new Nystr om based kernel classifier using the obtained dual solution in the first step can further improve the performance; (ii) the proposed new pipeline outperforms the divide-and-conquer approach followed by the weighted k-means sampling approach for constructing a new Nystr om approximation. Finally, we compare the training time of linear SVM, kernel SVM, the standard Nystr om based kernel classifier and the refined Nystr om based kernel classifier. We report the results on two datasets webspam and cod-rna with m = 1024 in the Figure 4. It shows that the training time of Nystr om 3When τ > 1, it will yield trivial solution with the optimal model being zero. To avoid clutter, we only show one curve on the covtype dataset. Figure 3: Test Accuracy of the refined Nystr om based kernel classifier (sp-pro-nys). The value of m is set to 1024 in the proposed algorithm. For the divide-and-conquer approach, the x-axis denotes 1 nc/100, where nc denotes the number of clusters used in divide-and-conquer approach. We did not report the result on the covtype due to that the divide-andconquer approach needs to solve multiple kernel SVM on each partition. When the number of partitions is small, each kernel SVM is still expensive. Figure 4: Training time of linear SVM, kernel SVM, the standard Nystr om based classifier, and the refined Nystr om based classifier for m = 1024 on two datasets webspam and cod-rna. on both datasets is much less than kernel SVM. On the other hand, the training time of the refined Nystr om based classifier is also comparable to training time of the standard Nystr om method. Conclusions In this paper, we have considered improving the performance of the Nystr om based kernel SVM. We proposed a fast and accurate refined Nystr om based kernel classifier that consists of two steps, where in the first step we learn an accurate dual solution based on a sparse-regularized dual formulation with the Nystr om approximation and in the second step we use the obtained dual solution to re-train a Nystr om based kernel classifier. We established an error bound of the obtained dual solution in the first step, which is better than previous theoretical results. The empirical evaluations on various datasets further demonstrate the effectiveness of the proposed algorithm. Acknowlegements This work was partially supported by NSF IIS-1463988 and NSF IIS-1545995. References Peter J. Bickel, Ya acov Ritov, and Alexandre B. Tsybakov. Simultaneous analysis of lasso and dantzig selector. Annals of Statistics, 37(4), 2009. Petros Drineas and Michael W. Mahoney. On the nystr om method for approximating a gram matrix for improved kernel-based learning. JMLR, pages 2153 2175, 2005. Alex Gittens. The spectral norm error of the naive nystrom extension. Co RR, 2011. Alex Gittens and Michael W. Mahoney. Revisiting the nystrom method for improved large-scale machine learning. Co RR, abs/1303.1849, 2013. Cho-Jui Hsieh, Si Si, and Inderjit S. Dhillon. A divide-andconquer solver for kernel support vector machines. In ICML, pages 566 574, 2014a. Cho-Jui Hsieh, Si Si, and Inderjit S. Dhillon. Fast prediction for large-scale kernel machines. In NIPS, pages 3689 3697, 2014b. Rong Jin, Tianbao Yang, Mehrdad Mahdavi, Yu-Feng Li, and Zhi-Hua Zhou. Improved bounds for the nystr om method with application to kernel classification. IEEE Transactions on Information Theory (IEEE TIT), 59(10): 6939 6949, 2013. Rie Johnson and Tong Zhang. Accelerating stochastic gradient descent using predictive variance reduction. In NIPS, pages 315 323, 2013. Sanjiv Kumar, Mehryar Mohri, and Ameet Talwalkar. On sampling-based approximate spectral decomposition. In ICML, pages 553 560, 2009a. Sanjiv Kumar, Mehryar Mohri, and Ameet Talwalkar. Sampling techniques for the nystrom method. In AISTATS, pages 304 311, 2009b. Quoc Le, Tam as Sarl os, and Alex Smola. Fastfood approximating kernel expansions in loglinear time. In ICML, 2013. Qihang Lin, Zhaosong Lu, and Lin Xiao. An accelerated proximal coordinate gradient method. In NIPS, pages 3059 3067, 2014. Chenxin Ma, Virginia Smith, Martin Jaggi, Michael I. Jordan, Peter Richt arik, and Martin Tak ac. Adding vs. averaging in distributed primal-dual optimization. In Proceedings of the 32nd International Conference on Machine Learning, ICML 2015, pages 1973 1982, 2015. Ali Rahimi and Benjamin Recht. Random features for largescale kernel machines. In NIPS, pages 1177 1184, 2007. Benjamin Recht. A simpler approach to matrix completion. Journal of Machine Learning Research (JMLR), 12:3413 3430, 2011. Bernhard Scholkopf and Alexander J Smola. Learning with kernels: support vector machines, regularization, optimization, and beyond. MIT press, 2001. Shai Shalev-Shwartz and Tong Zhang. Stochastic Dual Coordinate Ascent Methods for Regularized Loss Minimization. JMLR, pages 567 599, 2013. John Shawe-Taylor and Nello Cristianini. Kernel methods for pattern analysis. Cambridge university press, 2004. Ameet Talwalkar and Afshin Rostamizadeh. Matrix coherence and the nystrom method. In UAI, 2010. Christopher Williams and Matthias Seeger. Using the nystr om method to speed up kernel machines. In NIPS, pages 682 688, 2001. Zenglin Xu, Rong Jin, Bin Shen, and Shenghuo Zhu. Nystrom approximation for sparse kernel methods: Theoretical analysis and empirical evaluation. In AAAI, pages 3115 3121, 2015. Tianbao Yang. Trading computation for communication: Distributed stochastic dual coordinate ascent. In Advances in Neural Information Processing Systems, pages 629 637, 2013. Tianbao Yang, Yu-Feng Li, Mehrdad Mahdavi, Rong Jin, and Zhi-Hua Zhou. Nystrom method vs random fourier features: A theoretical and empirical comparison. In NIPS, pages 485 493, 2012. Tianbao Yang, Lijun Zhang, Rong Jin, and Shenghuo Zhu. Theory of dual-sparse regularized randomized reduction. Co RR, 2015a. Zichao Yang, Andrew Gordon Wilson, Alexander J. Smola, and Le Song. A la carte - learning fast kernels. In AISTATS, 2015b. Kai Zhang, Ivor W. Tsang, and James T. Kwok. Improved nystrom low-rank approximation and error analysis. In ICML, 2008.