# asynchronous_doubly_stochastic_sparse_kernel_learning__5007c783.pdf

Asynchronous Doubly Stochastic Sparse Kernel Learning

Bin Gu,1 Xin Miao,2 Zhouyuan Huo,1 Heng Huang1* 1Department of Electrical & Computer Engineering, University of Pittsburgh, USA 2Dept. of Computer Science and Engineering, University of Texas at Arlington, USA big10@pitt.edu, xin.miao@mavs.uta.edu, zhouyuan.huo@pitt.edu, heng.huang@pitt.edu

Kernel methods have achieved tremendous success in the past two decades. In the current big data era, data collection has grown tremendously. However, existing kernel methods are not scalable enough both at the training and predicting steps. To address this challenge, in this paper, we ﬁrst introduce a general sparse kernel learning formulation based on the random feature approximation, where the loss functions are possibly non-convex. Then we propose a new asynchronous parallel doubly stochastic algorithm for large scale sparse kernel learning (Asy DSSKL). To the best our knowledge, Asy DSSKL is the ﬁrst algorithm with the techniques of asynchronous parallel computation and doubly stochastic optimization. We also provide a comprehensive convergence guarantee to Asy DSSKL. Importantly, the experimental results on various large-scale real-world datasets show that, our Asy DSSKL method has the signiﬁcant superiority on the computational efﬁciency at the training and predicting steps over the existing kernel methods.

Introduction Kernel methods have achieved tremendous success in the past two decades for non-linear learning problems. There are a large number of successful and popular kernel methods (Vapnik 1998; Zhu and Hastie 2005; Zhu et al. 2004; Baudat and Anouar 2000; Vovk 2013; Li, Yang, and Xing 2005) for various learning problems. We take binary classiﬁcation and regression for example. Support vector classiﬁcation (SVC) (Vapnik 1998), kernel logistic regression (Zhu and Hastie 2005) and 1-norm SVC (Zhu et al. 2004) are the popular kernel methods for binary classiﬁcation. Support vector regression (Vapnik 1998), kernel ridge regression (Vovk 2013) and kernel Lasso (Li, Yang, and Xing 2005) are the popular kernel methods for regression. These kernel methods have been successfully applied to solve various real-world applications (such as computational biology (Sch olkopf, Tsuda, and Vert 2004) and remote sensing data analysis (Camps Valls and Bruzzone 2009)). However, traditional kernel methods need to store and compute the kernel matrix with the size of O(l2) where l is the training sample size. When l is large, the kernel matrix can be neither stored in local memory nor computed. Even

*To whom all correspondence should be addressed. Copyright 2018, Association for the Advancement of Artiﬁcial Intelligence (www.aaai.org). All rights reserved.

worse, the kernel methods normally have the computational complexity O(l3) for the training. To address the scalability issue of kernel methods in the training step, several decomposition algorithms (Takahashi and Nishi 2006) have been proposed for training the kernel methods. However, even for the state-of-the-art implementations (e.g. LIBSVM software package), the observed computational complexity is O(lκ) where 1 < κ < 2.3 (Chang and Lin 2011a). More related works of training kernel methods are discussed in the next section. To sum up, in the current big data era, the existing kernel methods are not scalable enough at the training step. Besides the scalability issue at the training step, traditional kernel methods are also not scalable at the predicting step. Speciﬁcally, the computational complexity for predicting a testing sample is O(l), because normally the number of support vectors grows linearly with the sample size. Thus, the computational complexity of predicting a testing set with similar size of training set is O(l2). To address the scalability issue of kernel methods in the predicting step, a compact (or sparse) model is preferred. The direct method for obtaining a compact model is adding sparse constraint (normally 1norm) on the coefﬁcients of the model. For example, (Zhu et al. 2004) imposed the 1-norm on the SVC formulation. (Yen et al. 2014) imposed 1-norm on the random features. Li, Yang, and Xing (2005) reformulated Lasso into a form isomorphic to SVC, which generates a sparse solution in the nonlinear feature space. However, these methods are not scalable at the training step. To address the scalability issues of kernel methods at the training and predicting steps, in this paper, we ﬁrst introduce a general sparse kernel learning formulation with the random feature approximation, where the loss functions are possibly non-convex. Then, we propose a new asynchronous parallel doubly stochastic algorithm for sparse kernel learning (Asy DSSKL). We believe this is very important for kernel methods for four reasons. 1) Generality: Asy DSSKL works for a general class of sparse kernel methods based on the random feature approximation, where the loss function is possibly non-convex. 2) Efﬁcient computation: The pivotal step of Asy DSSKL is to compute the doubly stochastic gradient on a mini-batch of samples and a selected coordinate, which is quite efﬁcient. 3) Comprehensive convergence guarantees: Asy DSSKL achieves a sublinear rate when the smooth loss functions are convex or non-convex.

The Thirty-Second AAAI Conference on Artificial Intelligence (AAAI-18)

Asy DSSKL achieves a linear (geometric) convergence rate when the objective function is with the optimal strong convexity. 4) Strong empirical performance: The experimental results on various large-scale real-world datasets show that, Asy DSSKL has the signiﬁcant superiority on the scalability and efﬁciency at the training and predicting steps over the existing kernel methods. Novelties. The main novelties of this paper are summarized as follows.

1. To the best of our knowledge, our Asy DSSKL is the ﬁrst algorithm with the techniques of asynchronous parallel computation and doubly stochastic optimization. Although our objective function (4) can be solved by (Liu and Wright 2015), their method is stochastic only on coordinates which is not scalable to large sample size. However, our Asy DSSKL is stochastic both on samples and coordinates.

2. We provide the convergence rates of Asy DSSKL in the different settings, which is nontrivial although our proofs follow (Liu and Wright 2015). Specially, we provide the convergence rate of Asy DSSKL in the non-convex setting, which is new and not included in (Liu and Wright 2015).

Notations. To make the paper easier to follow, we give the following notations.

1. ej is the zero vector in Rn except that the j-th coordinate equal to 1.

2. Pj,gj is the componentwise proximal operator as Pj,gj(w ) = arg minw 1 2 w w 2 + h ((w)j).

3. denotes the Euclidean norm, and denotes the inﬁnity norm.

Related Works In this section, we brieﬂy review the techniques of parallel computation and stochastic optimization for training kernel methods. Parallel computation. The parallel computation as the basic big data technique can be roughly divided into synchronous and asynchronous models according to whether the reading or writing lock is used. The most of parallel algorithms for kernel methods are based on the synchronous model due to its simplicity. For example, Chang and Lin (2011b) proposed a modiﬁed LIBSVM implementation by using the synchronous parallel technique to compute a column of kernel matrix. Zhao and Magoules (2011) and You et al. (2015) proposed synchronous parallel SVM algorithms for the parallel environment with shared memory. The asynchronous computation is much more efﬁcient than the synchronous computation, because it keeps all computational resources busy all the time. However, the convergence analysis for the asynchronous computation is much more difﬁcult due to the inconsistent reading and writing. To the best of our knowledge, the only work for the asynchronous parallel kernel methods is the asynchronous parallel greedy coordinate descent algorithm for SVC (You et al. 2016). However, their work does not exploit another important big data technology, i.e., stochastic optimization, and is not scalable at the predicting step.

Stochastic optimization. Stochastic optimization is another big data technique which could be stochastic on samples and coordinates. Speciﬁcally, Shalev-Shwartz, Singer, and Srebro (2007) and Kivinen, Smola, and Williamson (2004) proposed stochastic algorithms for kernel SVMs which are stochastic on samples. Shalev-Shwartz and Zhang (2013) proposed a stochastic dual coordinate ascend algorithm for kernel methods which is stochastic on coordinates. However, they are not scalable enough because the computational complexities are O(T 2) where T is the iteration number. To address this issue, (Dai et al. 2014) proposed a doubly stochastic kernel method, which is stochastic both on samples and random features. Random feature mapping (Rahimi and Recht 2007) (a method of approximating kernel function) is an effective method to scale up kernel methods. However, as analyzed in (Rahimi and Recht 2009), to achieve a good generalization ability, the number of random features needs to be O(l). Thus, the predicting of (Rahimi and Recht 2009) is not scalable due to a huge number of random features in the model. In addition, (Rahimi and Recht 2009) does not exploit the parallel computation techniques as mentioned above.

Asynchronous Doubly Stochastic Sparse Kernel Learning

In this section, we ﬁrst give a brief review of random features, and then introduce a general sparse kernel learning formulation based on the random feature approximation, where the loss functions are possibly non-convex. Finally, we propose our Asy DSSKL algorithm.

Random Features

Given a training set S = {(Xi, yi)}l i=1 with Xi Rn, and yi {+1, 1} for binary classiﬁcation or yi R for regression. For kernel methods, the optimal model can be represented as f(Xi) = l k=1 αi K(Xk, Xi) according to the representer theorem (Sch olkopf, Herbrich, and Smola 2001). Thus, one beneﬁt of kernel methods is that, we can learn a nonlinear model directly using kernel functions, instead of computing the inner production in the explicit kernel space H. However, the number of non-zero αi often increases linearly with data size (Steinwart and Christmann 2008). Even worse, as mentioned previously, kernel methods need to store and compute the kernel matrix with the size O(l2). To address the scalability issue, random feature mapping (Rahimi and Recht 2007) was proposed to explicitly approximate the kernel function K( , ). Speciﬁcally, by Mercer s theorem (Mercer 1909), for any positive semi-deﬁnite kernel K( , ), there exists a kernel space H, a probability distribution p(h) and a random feature mapping φh(x), such that

K(Xi, Xj) =

h H p(h)φh(Xi)φh(Xj)dh (1)

Based on the random features, the kernel function (1) is approximated by K(Xi, Xj)

h A φh(Xi)φh(Xj), where A is a set of random features drawn from the kernel space H according to the distribution p(h). Thus, the model function

Table 1: Commonly used smooth loss functions for binary classiﬁcation (BC) and regression (R).

Type of function Name of loss Type of task The loss function

Square loss BC+R L(f(Xi), yi) = (f(Xi) yi)2 Logistic loss BC L(f(Xi), yi) = log(1 + exp( yif(Xi)))

Smooth hinge loss BC L(f(Xi), yi) =

1 2 zi if zi 0 1 2(1 zi)2 if 0 < zi < 1 0 if zi 1 , where

zi = yif(Xi).

Non-convex Correntropy induced loss R L(f(Xi), yi) = σ2

1 e (yi XT i x)2

Sigmoid loss BC L(f(Xi), yi) = 1 1+exp( yif(Xi))

can be approximated by

h A w(h)φh(Xi) (2)

However, as analyzed in (Rahimi and Recht 2009; Yen et al. 2014), to achieve a good generalization ability, the number of random features needs to be O(l). Because the model size of (2) grows linearly with l, it is highly desirable to have a sparse learning algorithm which would be beneﬁcial to the efﬁciency at the predicting step.

Sparse Kernel Learning with Random Features In the inﬁnite-dimensional kernel space, the kernel methods are normally written as (3) based on (1).

min w(h),h H Q(w) = (3)

h H p(h)w(h)φh(Xi)dh, yi

To produce a compact model, Yen et al. (2014) proposed a sparse kernel learning formulation (4) based on the random feature approximation. They proved that minw(h),h A P(w) is an O( 1

|A|)-approximation of Q(w ) in Corollary 1 of (Yen et al. 2014).

min w(h),h A P(w) =

h A w(h)φh(Xi), yi

where g(w) is a sparse constraint on w by 1-norm, and L( , ) is a smooth and convex loss function. In this paper, we relax L( , ) as a smooth, possibly non-convex loss function. Thus, the formulation (4) is a more general formulation which includes a large number of learning problems (e.g. regression, binary classiﬁcation, multiple classiﬁcation and so on). Table 1 presents several common used smooth loss functions for

binary classiﬁcation and regression. Note that the correntropy induced loss (Feng et al. 2015) and the sigmoid loss (Lin 2004) are non-convex. Yen et al. (2014) provided an iterative framework to solve the sparse kernel learning formulation (4). In order to incorporate the parallel computation, we modify the framework as Algorithm 1, where the iteration number is limited. Empirically M = 5 can achieve a good result. The key of Algorithm 1 is to design a scalable algorithm for the sparse kernel learning problem (4) such that scaling well in sample size and dimensionality simultaneously.

Algorithm 1 Asynchronous sparse random feature learning framework

1: Initialize w0 = 0, working set A0 = , and t = 0. 2: for t = 0, 1, 2, , M 1 do 3: All threads parallelly sample h1, h2, , h R i.i.d. from a distribution p(h), and add h1, h2, , h R into the set A. 4: Asynchronously solve (4) by Asy DSSKL. 5: Parallelly update At+1 = At {h : wt(h) = 0}. 6: end for

Asy DSSKL Algorithm In this section, we propose our Asy DSSKL algorithm to solve the sparse kernel learning problem (4). Our Asy DSSKL exploits the asynchronous parallel computation and doubly stochastic optimization, which is signiﬁcantly different from (Liu and Wright 2015) as discussed before. Speciﬁcally, 1) For asynchronous parallel computation, our Asy DSSKL works in the parallel environment with shared memory, such as multi-core processors and GPU-accelerators. All cores in CPU or GPU can read and write the vector x in the shared memory simultaneously without any lock. 2) For doubly stochastic optimization, we use the techniques of stochastic variance reduced gradient (SVRG) (Zhang 2004) and stochastic coordinate descent (SCD) (Tseng and Yun 2009; You et al. 2016). SVRG is an accelerated version of stochastic gradient descent (SGD) algorithm (Shamir and Zhang 2013) which is stochastic on samples, and scales well in sample size. SCD is stochastic on features which scales well in dimensionality. Thus, our Asy DSSKL

can scale well in sample size and dimensionality simultaneously. Speciﬁcally, following the SVRG (Zhang 2004) and SCD (Tseng and Yun 2009; You et al. 2016), our Asy DSSKL has two-layer loops. The outer layer is to parallelly compute the full gradient F(ws) = 1

l l i=1 fi(ws), where the superscript s denotes the s-th outer loop. The inner layer is parallelly repeating the following three steps:

1. Read: Read the vector w from the shared memory to the local memory without reading lock. We use ws+1 t to denote the value in the local memory, where the subscript t denotes the t-th inner loop.

2. Compute: Randomly choose a mini-batch B from {1, ..., l} and a coordinate j from {1, ..., n}, and locally compute vs+1 j = 1 |B|

i B jfi( ws+1 t )

i B jfi( ws) + j F( ws).

3. Update: Update the coordinate j of the vector x in the shared memory as ws+1 t+1

Pj, γ Lmax gj (ws+1 t )j γ Lmax vs+1 j without writing lock, where γ is the steplength, Lmax is deﬁned in the next section.

The detailed description of Asy DSSKL is presented in Algorithm 2. Note that in line 9 of Algorithm 2, vs+1 j(t) computed locally is an unbiased estimation of j(t)F( ws+1 t ), because the expectation of vs+1 t on B is equal to F( ws+1 t ), i.e., E vs+1 t = F( ws+1 t ).

Algorithm 2 Asynchronous doubly stochastic sparse kernel learning algorithm (Asy DSSKL)

Input: The steplength γ, the number of outer loop iterations S, and the number of inner loop iterations m. Output: x S. 1: Initialize w0 Rd. 2: for s = 0, 1, 2, , S 1 do 3: ws ws

4: All threads parallelly compute the full gradient F( ws) = 1

l l i fi( ws) 5: For each thread, do: 6: for t = 0, 1, 2, , m 1 do 7: Randomly sample a mini-batch B from {1, ..., l} with equal probability. 8: Randomly choose a coordinate j(t) from {1, ..., n} with equal probability. 9: Compute vs+1 j(t) = 1 |B|

i B j(t)fi( ws+1 t )

i B j(t)fi( ws) + j(t)F( ws)

10: ws+1 t+1 Pj(t), γ Lmax gj(t) ws+1 t γ Lmax ej(t) vs+1 j(t) .

11: end for 12: ws+1 ws+1 m 13: end for

Convergence Rate Analysis of Asy DSSKL Because Asy DSSKL runs in asynchronously without any lock, the inconsistent reading and writing could arise to the vector w in the shared memory, which makes the convergence analysis of Asy DSSKL more challenging. In this section, we assume that the delay for the vector w in the shared memory is bounded (i.e., Assumption 4), and provide the comprehensive convergence guarantees: Asy DSSKL achieves a sublinear convergence rate when the smooth loss functions are convex or non-convex (Theorems 1 and 2), and a linear convergence rate when the objective function has optimal strong convexity (Theorem 1). In the following, we ﬁrst give the several widely used assumptions (i.e., optimal strong convexity, Lipschitz smoothness, and bound of delay), then provide the main conclusions to the convergence rates of our Asy DSSKL.

Preliminaries Optimal Strong Convexity: Let P denote the optimal value of (4), and let S denote the solution set of P such that P(w) = P , w S. Firstly, we assume that S is nonempty (i.e., Assumption 1). Assumption 1. The solution set S of (4) is nonempty. Based on S, we deﬁne PS(w) = arg miny S y w 2 as the Euclidean-norm projection of a vector w onto S. Then, we assume that the convex function P(w) is with the optimal strong convexity (i.e., Assumption 2). Assumption 2 (Optimal strong convexity).

P(w) P(PS(w)) ℓ

2 w PS(w) 2 (5)

As mentioned in (Liu and Wright 2015), the condition of optimal strong convexity is signiﬁcantly weaker than the normal strong convexity condition. And several examples of optimally strongly convex functions that are not strongly convex are provided in (Liu and Wright 2015). Lipschitz Smoothness: We deﬁne the normal Lipschitz constant (Lnor), restricted Lipschitz constant (Lres) and coordinate Lipschitz constant (Lmax) as follows. Deﬁnition 1 (Lipschitz constants). Lnor, Lres and Lmax are the normal Lipschitz constant, restricted Lipschitz constant and coordinate Lipschitz constant, respectively, for fi ( i {1, , l}) in (4). We have

fi(w1) fi(w2) Lnor w1 w2 , (6) w1, w2 fi(w) fi(w + ejt) Lres|t|, w, t (7) fi(w) fi(w + ejt) Lmax|t|, w, t (8)

Based on Lnor Lres and Lmax as deﬁned above, we assume that the function fi ( i {1, , l} is Lipschitz smooth with Lnor Lres and Lmax (i.e., Assumption 3). In addition, we deﬁne Λres = Lres Lmax , Λnor = Lnor

Assumption 3 (Lipschitz smoothness). The function fi ( i {1, , l} is Lipschitz smooth with the normal Lipschitz constant Lnor, restricted Lipschitz constant Lres and coordinate Lipschitz constant Lmax.

Bound of Delay: Because Asy DSSKL does not use the reading lock, the vector ws+1 t read to the local memory may be inconsistent to the vector ws+1 t in the shared memory, which means that some components of ws+1 t are same with the ones in xs+1 t , but others are different to the ones in xs+1 t . However, we can deﬁne a set K(t) of inner iterations, such that,

ws+1 t = ws+1 t +

t K(t) (ws+1 t +1 ws+1 t ) (9)

where t t 1. It is reasonable to assume that there exists an upper bound τ such that τ t min{t |t K(t)} (i.e., Assumption 4).

Assumption 4 (Bound of delay). There exists an upper bound τ such that τ t min{t |t K(t)} for all inner iterations t in Asy DSSKL.

Convergence Rate Analysis We provide the convergence rates of Asy DSSKL in the convex and non-convex settings (Theorems 1 and 2). The detailed proofs of Theorem 1 and 2 are presented in Appendix.

Convex Setting We ﬁrst give the convergence rates of Asy DSSKL at the convex setting in Theorem 1.

Theorem 1. Let ρ be a constant that satisﬁes ρ > 1, and

deﬁne the quantity θ1 = ρ 1 2 ρ τ+1

1 ρ 1 2 , θ2 = ρ 1 2 ρ m

1 ρ 1 2 and θ =

ρ 1 . Suppose the nonnegative steplength parameter γ > 0 satisﬁes

1 Λnorγ γτθ

n 2(Λresθ1 + Λnorθ2)γ

n1/2 0 (10)

If the optimal strong convexity holds for P(w) with ℓ> 0 (i.e., Assumption 2), we have

EP(ws) P Lmax

1 1 + mγℓ n(ℓγ+Lmax)

w0 PS(w0) 2 + 2γ Lmax

If fi(w) is a general smooth convex function with Assumption 3, we have

EP(ws) P (11)

n Lmax w0 PS(w0) 2 + 2γn P(w0) P

Remark 1. Theorem 1 shows that, if the objective function P(w) is with the optimal strong convexity, Asy DSSKL achieves a linear convergence rate (see (11)). If the loss function fi(w) is general smooth convex, Asy DSSKL achieves a sublinear convergence rate (see (11)).

Remark 2. The thread number is not explicitly considered in Theorems 1. As discussed in (Liu and Wright 2015), the parameter τ is closely related to the number of threads that can be involved in the computation, without degrading the

convergence performance of the algorithm. In other words, if the number of threads is small enough such that (10) holds, the convergence expressions (11), (11) do not depend on the number of threads, implying that linear speedup can be expected.

Non-Convex Setting If the function P(w) is non-convex, the global optimum point cannot be guaranteed. Thus, the closeness to the optimal solution (i.e., P(w) P and x PS(w) ) cannot be used for the convergence analysis. To analyze the convergence rate of Asy DSSKL in the non-convex setting, we deﬁne the expectation of a subgradient ξ P(ws t ) as E P(ws t ). Speciﬁcally, E P(ws t ) can be written as following.

E P(ws t ) def = Lmax

γ ws t ws t+1 (12)

where ws t+1 def = P γ Lmax g ws t γ Lmax vs t . Based on (12), it

is easy to verify that (ws t+1)j(t) = (ws+1 t+1 )j(t). Thus, we have Ej(t)(ws t+1 ws t ) = 1 n

ws t+1 ws t , which means that ws t+1 ws t captures the expectation of ws t+1 ws t . It is easy to verify that E P(ws t ) is equal to 0 when Asy DSSKL approaches to a stationary point. Based on E P(ws t ), we give the convergence rate at the non-convex setting in Theorem 2. Theorem 2. Let ρ be a constant that satisﬁes

ρ > 1, and deﬁne the quantities θ1 = ρ 1 2 ρ τ+1

and θ2 = ρ 1 2 ρ m

1 ρ 1 2 . Suppose the nonnegative

steplength parameter γ > 0 satisﬁes γ min n1/2(1 ρ 1) 4 4(Λres(1+θ1)+Λnor(1+θ2)), n1/2

1 2 n1/2+2Λnorθ2+Λresθ1

. Let T denote the number of total iterations of Asy DSSKL. If fi(w) is a smooth non-convex function with Assumption 3, we have

t=0 E P(ws t ) 2 (13)

T γ n 1 γ 1

2 2Lnorθ2+Lresθ1

Remark 3. Theorem 2 shows that, if the loss function fi(w) is non-convex, Asy DSSKL converges to a stationary point with a sublinear convergence rate.

Experimental Results Experimental Setup Design of Experiments: To demonstrate the superiority of our Asy DSSKL on the efﬁciency at the training step, we compare the accuracies (or errors) on the testing set vs. training time of the different kernel methods. To demonstrate the superiority of our Asy DSSKL on the efﬁciency of predicting, we compare the testing time vs. the size of testing samples of the different kernel methods. The state-of-the-art kernel methods compared in the experiments are LIBSVM(P), Asy GCD

and Double SGD which are summarized in Table 2. Note that, LIBSVM(P) is an modiﬁed implementation of LIBSVM with parallelly computing a column of kernel matrix (Chang and Lin 2011b). To show a near-linear speedup obtained by asynchronous parallel computation, we also test the speedup of our Asy DSSKL on different datasets.

Table 2: The state-of-the-art kernel methods compared in our experiments. (BC=binary classiﬁcation, R=regression, Asy=asynchronous, DS=double stochastic, SC=sparse constraint)

Algorithm Reference Problem Parallel Asy DS SC LIBSVM(P) (Chang and Lin 2011b) BC+R Yes No No No Asy GCD (You et al. 2016) BC Yes Yes No No Double SGD (Dai et al. 2014) BC+R No No Yes No Asy DSSKL Our BC+R Yes Yes Yes Yes

Implementation Details: Our experiments are performed on a 32-core two-socket Intel Xeon E5-2699 machine where each socket has 16 cores. We implement our Asy DSSKL in C++, where the shared memory parallel computation is handled via Open MP (Chandra 2001). We implement LIBSVM(P) by modifying LIBSVM with Open MP according to the instruction1 provided by Chang and Lin (2011b). We use the implementation2 provided by (You et al. 2016) for Asy GCD. We use the implementation3 provided by (Dai et al. 2014) for Double SGD. Note that the implementation of Double SGD only works for binary classiﬁcation because only the logistic loss was implemented in their code. In the experiments, we consider the binary classiﬁcation and regression problems. Speciﬁcally, our Asy DSSKL uses the logistic loss (see Table 1) for binary classiﬁcation, and the correntropy induced loss (see Table 1) for regression. We use the accuracy as the measure criterion of binary classiﬁcation, and use the mean squared error (MSE) as the measure criterion of regression. In each experiment, the accuracy and MSE values are the average over 5 trials. In the experiments, the value of steplength γ is selected from {102; 10; 1; 10 1; 10 2; 10 3; 10 4; 10 5}. The # of inner loop iterations m is set as the size of training set, and the # of outer loop iterations S is set as 10. Datasets: Table 3 summarizes the six large-scale realworld datasets used in our experiments. They are the Covtype B, RCV1, SUSY, Covtype M, MNIST and Aloi datasets which are from https://www.csie.ntu.edu.tw/ cjlin/ libsvmtools/datasets/. Covtype B and Covtype M are from a same source. Covtype M, MNIST and Aloi are originally for multi-class classiﬁcation. In the experiments, we treat the multi-class classiﬁcation as regression problem.

1The instruction to implement LIBSVM(P) is available at http: //www.csie.ntu.edu.tw/ cjlin/libsvm/faq.html\#f432. 2The Asy GCD code is available at https://github.com/cjhsieh/ asyn kernel svm. 3The Double SGD code is available at https://github.com/ zixu1986/Doubly Stochastic Gradients.

Table 3: The large-scale dasetsets used in the experiments, where the multi-class classiﬁcation datasets are treated as regression problem.

Task Dataset Features Samples

Binary classiﬁcation

Covtype B 54 581,012 RCV1 47,236 677,399 SUSY 18 5,000,000

Covtype M 54 581,012 MNIST 784 1,000,000 Aloi 128 108,000

Results and Discussions

In the experiments of comparing the training time of different algorithms, Asy GCD, LIBSVM(P) and our Asy DSSKL is running on 16 cores. Figures 1a-1c provide the results of accuracy vs. training time on the Covtype B, RCV1, and SUSY datasets, respectively. The results show that Asy DSSKL achieves the best accuracy value in the most time for a ﬁxed training time. Especially, Asy DSSKL can converge to a good accuracy with a tremendous little time, mostly less than 60 seconds for large datasets. Figures 1d-1f provide the results of MSE vs. training time on the Covtype M, MNIST, and Aloi datasets, respectively. The results show that Asy DSSKL always achieves the smallest MSE value for a ﬁxed training time, and Asy DSSKL can converge to a good value with a tremendous little time, mostly less than 60 seconds for large datasets. Figures 1a-1f conﬁrm that our Asy DSSKL has the signiﬁcant superiority on the scalability and efﬁciency at the training step over the state-of-the-art kernel methods. Figures 1g-1h present the testing time vs. the number of testing samples on the datasets of Covtype B and Covtype M. The results show that the testing time of our Asy DSSKL is smallest, compared with the other methods without sparse constraint. Figures 1g-1h conﬁrm that our Asy DSSKL has the signiﬁcant superiority on the scalability and efﬁciency at the predicting step over the existing kernel methods. We perform Asy DSSKL on 1, 2, 4, 8 and 16 cores to observe the speedup. Figure 2 presents the speedup results of Asy DSSKL on the Covtype B and Covtype M datasets. The results show that Asy DSSKL can have a near-linear speedup on a parallel system with shared memory. This is because we do not use any lock in the implementation of Asy DSSKL.

In this paper, we introduced a general sparse kernel learning formulation with a new asynchronous parallel doubly stochastic algorithm for sparse kernel learning (Asy DSSKL). We provided a comprehensive convergence guarantee to Asy DSSKL: 1) Asy DSSKL achieves a sublinear rate when the smooth loss functions are convex or non-convex. 2) Asy DSSKL achieves a linear (geometric) convergence rate when the objective function is with the optimal strong convexity. Experimental results show that our Asy DSSKL method has the signiﬁcant superiority on the computational efﬁciency at the training and predicting steps over the existing kernel methods.

0 50 100 150 Time(sec)

Asy DSSKL Asy GCD LIBSVM(P) Double SGD

(a) Covtype B

0 20 40 60 80 100 Time(sec)

Asy DSSKL Asy GCD LIBSVM(P) Double SGD

0 100 200 300 400 500 Time(sec)

Asy DSSKL Asy GCD LIBSVM(P) Double SGD

0 100 200 300 400 500 600 Time(sec)

Asy DSSKL LIBSVM(P)

(d) Covtype M

0 5 10 15 20 25 30 Time(sec)

Asy DSSKL LIBSVM(P)

0 10 20 30 40 50 60 Time(sec)

Asy DSSKL LIBSVM(P)

0 1 2 3 4 5 Number of testing samples 10 5

Asy DSSKL Asy GCD LIBSVM(P) Double SGD

(g) Covtype B

0 1 2 3 4 5 Number of testing samples 10 5

Asy DSSKL LIBSVM(P)

(h) Covtype M

Figure 1: Comparisons between Asy DSSKL and other state-of-the-art kernel methods. (a-c) Accuracy vs. training time. (d-f) MSE vs. training time. (g-h) Testing time vs. testing samples.

0 50 100 150 200 250 Time(sec)

1 core 2 core 4 core 8 core 16 core

(a) Objective vs. time

0 2 4 6 8 10 12 14 16 #Cores

(b) Speedup

0 50 100 150 Time(sec)

1 core 2 core 4 core 8 core 16 core

(c) Objective vs. time

0 2 4 6 8 10 12 14 16 #Cores

(d) Speedup

Figure 2: Speedup results of Asy DSSKL. (a-b) Covtype B dataset. (c-d) Covtype M dataset.

Acknowledgement

This work was partially supported by the following grants: NSF-IIS 1302675, NSF-IIS 1344152, NSF-DBI 1356628, NSF-IIS 1619308, NSF-IIS 1633753, NIH R01 AG049371.

Baudat, G., and Anouar, F. 2000. Generalized discriminant analysis using a kernel approach. Neural computation 12(10):2385 2404. Camps-Valls, G., and Bruzzone, L. 2009. Kernel methods for remote sensing data analysis. John Wiley & Sons. Chandra, R. 2001. Parallel programming in Open MP. Morgan kaufmann. Chang, C.-C., and Lin, C.-J. 2011a. Libsvm: a library for support vector machines. ACM Transactions on Intelligent Systems and Technology (TIST) 2(3):27. Chang, C.-C., and Lin, C.-J. 2011b. LIBSVM: A library for support vector machines. ACM Transactions on Intelligent

Systems and Technology 2:27:1 27:27. Software available at http://www.csie.ntu.edu.tw/ cjlin/libsvm.

Dai, B.; Xie, B.; He, N.; Liang, Y.; Raj, A.; Balcan, M.-F. F.; and Song, L. 2014. Scalable kernel methods via doubly stochastic gradients. In Advances in Neural Information Processing Systems, 3041 3049.

Feng, Y.; Huang, X.; Shi, L.; Yang, Y.; and Suykens, J. A. 2015. Learning with the maximum correntropy criterion induced losses for regression. Journal of Machine Learning Research 16:993 1034.

Kivinen, J.; Smola, A. J.; and Williamson, R. C. 2004. Online learning with kernels. IEEE transactions on signal processing 52(8):2165 2176.

Li, F.; Yang, Y.; and Xing, E. 2005. From lasso regression to feature vector machine. In NIPS, 779 786.

Lin, Y. 2004. A note on margin-based loss functions in classiﬁcation. Statistics & probability letters 68(1):73 82.

Liu, J., and Wright, S. J. 2015. Asynchronous stochastic

coordinate descent: Parallelism and convergence properties. SIAM Journal on Optimization 25(1):351 376. Mercer, J. 1909. Functions of positive and negative type, and their connection with the theory of integral equations. Philosophical transactions of the royal society of London. Series A, containing papers of a mathematical or physical character 209:415 446. Rahimi, A., and Recht, B. 2007. Random features for largescale kernel machines. In Advances in neural information processing systems, 1177 1184. Rahimi, A., and Recht, B. 2009. Weighted sums of random kitchen sinks: Replacing minimization with randomization in learning. In Advances in neural information processing systems, 1313 1320. Sch olkopf, B.; Herbrich, R.; and Smola, A. J. 2001. A generalized representer theorem. In International Conference on Computational Learning Theory, 416 426. Springer. Sch olkopf, B.; Tsuda, K.; and Vert, J.-P. 2004. Kernel methods in computational biology. MIT press. Shalev-Shwartz, S., and Zhang, T. 2013. Stochastic dual coordinate ascent methods for regularized loss minimization. Journal of Machine Learning Research 14(Feb):567 599. Shalev-Shwartz, S.; Singer, Y.; and Srebro, N. 2007. Pegasos: Primal estimated sub-gradient solver for svm. In Proceedings of the 24th international conference on Machine learning, 807 814. ACM. Shamir, O., and Zhang, T. 2013. Stochastic gradient descent for non-smooth optimization: Convergence results and optimal averaging schemes. In International Conference on Machine Learning, 71 79. Steinwart, I., and Christmann, A. 2008. Support vector machines. Springer Science & Business Media. Takahashi, N., and Nishi, T. 2006. Global convergence of decomposition learning methods for support vector machines. IEEE Transactions on Neural Networks 17(6):1362 1369. Tseng, P., and Yun, S. 2009. A coordinate gradient descent method for nonsmooth separable minimization. Mathematical Programming 117(1-2):387 423. Vapnik, V. N. 1998. Statistical learning theory, volume 1. Wiley New York. Vovk, V. 2013. Kernel ridge regression. In Empirical inference. Springer. 105 116. Yen, I. E.-H.; Lin, T.-W.; Lin, S.-D.; Ravikumar, P. K.; and Dhillon, I. S. 2014. Sparse random feature algorithm as coordinate descent in hilbert space. In Advances in Neural Information Processing Systems, 2456 2464. You, Y.; Fu, H.; Song, S. L.; Randles, A.; Kerbyson, D.; Marquez, A.; Yang, G.; and Hoisie, A. 2015. Scaling support vector machines on modern hpc platforms. Journal of Parallel and Distributed Computing 76:16 31. You, Y.; Lian, X.; Liu, J.; Yu, H.-F.; Dhillon, I. S.; Demmel, J.; and Hsieh, C.-J. 2016. Asynchronous parallel greedy coordinate descent. In Advances in Neural Information Processing Systems, 4682 4690.

Zhang, T. 2004. Solving large scale linear prediction problems using stochastic gradient descent algorithms. In Proceedings of the twenty-ﬁrst international conference on Machine learning, 116. ACM. Zhao, H.-x., and Magoules, F. 2011. Parallel support vector machines on multi-core and multiprocessor systems. In 11th International Conference on Artiﬁcial Intelligence and Applications (AIA 2011). IASTED. Zhu, J., and Hastie, T. 2005. Kernel logistic regression and the import vector machine. Journal of Computational and Graphical Statistics 14(1):185 205. Zhu, J.; Rosset, S.; Tibshirani, R.; and Hastie, T. J. 2004. 1-norm support vector machines. In Advances in neural information processing systems, 49 56.