# proximal_scope_for_distributed_sparse_learning__903cad45.pdf Proximal SCOPE for Distributed Sparse Learning Shen-Yi Zhao National Key Lab. for Novel Software Tech. Dept. of Comp. Sci. and Tech. Nanjing University, Nanjing 210023, China zhaosy@lamda.nju.edu.cn Gong-Duo Zhang National Key Lab. for Novel Software Tech. Dept. of Comp. Sci. and Tech. Nanjing University, Nanjing 210023, China zhanggd@lamda.nju.edu.cn Ming-Wei Li National Key Lab. for Novel Software Tech. Dept. of Comp. Sci. and Tech. Nanjing University, Nanjing 210023, China limw@lamda.nju.edu.cn Wu-Jun Li National Key Lab. for Novel Software Tech. Dept. of Comp. Sci. and Tech. Nanjing University, Nanjing 210023, China liwujun@nju.edu.cn Distributed sparse learning with a cluster of multiple machines has attracted much attention in machine learning, especially for large-scale applications with high-dimensional data. One popular way to implement sparse learning is to use L1 regularization. In this paper, we propose a novel method, called proximal SCOPE (p SCOPE), for distributed sparse learning with L1 regularization. p SCOPE is based on a cooperative autonomous local learning (CALL) framework. In the CALL framework of p SCOPE, we find that the data partition affects the convergence of the learning procedure, and subsequently we define a metric to measure the goodness of a data partition. Based on the defined metric, we theoretically prove that p SCOPE is convergent with a linear convergence rate if the data partition is good enough. We also prove that better data partition implies faster convergence rate. Furthermore, p SCOPE is also communication efficient. Experimental results on real data sets show that p SCOPE can outperform other state-of-the-art distributed methods for sparse learning. 1 Introduction Many machine learning models can be formulated as the following regularized empirical risk minimization problem: min w Rd P(w) = 1 i=1 fi(w) + R(w), (1) where w is the parameter to learn, fi(w) is the loss on training instance i, n is the number of training instances, and R(w) is a regularization term. Recently, sparse learning, which tries to learn a sparse model for prediction, has become a hot topic in machine learning. There are different ways to implement sparse learning [28, 30]. One popular way is to use L1 regularization, i.e., R(w) = λ w 1. In this paper, we focus on sparse learning with R(w) = λ w 1. Hence, in the following content of this paper, R(w) = λ w 1 unless otherwise stated. One traditional method to solve (1) is proximal gradient descent (p GD) [2], which can be written as follows: wt+1 = prox R,η(wt η F(wt)), (2) 32nd Conference on Neural Information Processing Systems (Neur IPS 2018), Montréal, Canada. where F(w) = 1 n Pn i=1 fi(w), wt is the value of w at iteration t, η is the learning rate, prox is the proximal mapping defined as prox R,η(u) = arg min v (R(v) + 1 2η v u 2). (3) Recently, stochastic learning methods, including stochastic gradient descent (SGD) [18], stochastic average gradient (SAG) [22], stochastic variance reduced gradient (SVRG) [10], and stochastic dual coordinate ascent (SDCA) [24], have been proposed to speedup the learning procedure in machine learning. Inspired by the success of these stochastic learning methods, proximal stochastic methods, including proximal SGD (p SGD) [11, 6, 26, 4], proximal block coordinate descent (p BCD) [29, 31, 21], proximal SVRG (p SVRG) [32] and proximal SDCA (p SDCA) [25], have also been proposed for sparse learning in recent years. All these proximal stochastic methods are sequential (serial) and implemented with one single thread. The serial proximal stochastic methods may not be efficient enough for solving large-scale sparse learning problems. Furthermore, the training set might be distributively stored on a cluster of multiple machines in some applications. Hence, distributed sparse learning [1] with a cluster of multiple machines has attracted much attention in recent years, especially for large-scale applications with high-dimensional data. In particular, researchers have recently proposed several distributed proximal stochastic methods for sparse learning [15, 17, 13, 16, 27]1. One main branch of the distributed proximal stochastic methods includes distributed p SGD (dp SGD) [15], distributed p SVRG (dp SVRG) [9, 17] and distributed SVRG (DSVRG) [13]. Both dp SGD and dp SVRG adopt a centralized framework and mini-batch based strategy for distributed learning. One typical implementation of a centralized framework is based on Parameter Server [14, 33], which supports both synchronous and asynchronous communication strategies. One shortcoming of dp SGD and dp SVRG is that the communication cost is high. More specifically, the communication cost of each epoch is O(n), where n is the number of training instances. DSVRG adopts a decentralized framework with lower communication cost than dp SGD and dp SVRG. However, in DSVRG only one worker is updating parameters locally and all other workers are idling at the same time. Another branch of the distributed proximal stochastic methods is based on block coordinate descent [3, 20, 7, 16]. Although in each iteration these methods update only a block of coordinates, they usually have to pass through the whole data set. Due to the partition of data, it also brings high communication cost in each iteration. Another branch of the distributed proximal stochastic methods is based on SDCA. One representative is PROXCOCOA+ [27]. Although PROXCOCOA+ has been theoretically proved to have a linear convergence rate with low communication cost, we find that it is not efficient enough in experiments. In this paper, we propose a novel method, called proximal SCOPE (p SCOPE), for distributed sparse learning with L1 regularization. p SCOPE is a proximal generalization of the scalable composite optimization for learning (SCOPE) [34]. SCOPE cannot be used for sparse learning, while p SCOPE can be used for sparse learning. The contributions of p SCOPE are briefly summarized as follows: p SCOPE is based on a cooperative autonomous local learning (CALL) framework. In the CALL framework, each worker in the cluster performs autonomous local learning based on the data assigned to that worker, and the whole learning task is completed by all workers in a cooperative way. The CALL framework is communication efficient because there is no communication during the inner iterations of each epoch. p SCOPE is theoretically guaranteed to be convergent with a linear convergence rate if the data partition is good enough, and better data partition implies faster convergence rate. Hence, p SCOPE is also computation efficient. In p SCOPE, a recovery strategy is proposed to reduce the cost of proximal mapping when handling high dimensional sparse data. Experimental results on real data sets show that p SCOPE can outperform other state-of-theart distributed methods for sparse learning. 1In this paper, we mainly focus on distributed sparse learning with L1 regularization. The distributed methods for non-sparse learning, like those in [19, 5, 12], are not considered. 2 Preliminary In this paper, we use to denote the L2 norm 2, w to denote the optimal solution of (1). For a vector a, we use a(j) to denote the jth coordinate value of a. [n] denotes the set {1, 2, . . . , n}. For a function h(a; b), we use h(a; b) to denote the gradient of h(a; b) with respect to (w.r.t.) the first argument a. Furthermore, we give the following definitions. Definition 1 We call a function h( ) is L-smooth if it is differentiable and there exists a positive constant L such that a, b : h(b) h(a) + h(a)T (b a) + L Definition 2 We call a function h( ) is convex if there exists a constant µ 0 such that a, b : h(b) h(a) + ζT (b a) + µ 2 a b 2, where ζ h(a) = {c|h(b) h(a) + c T (b a), a, b}. If h( ) is differentiable, then ζ = h(a). If µ > 0, h( ) is called µ-strongly convex. Throughout this paper, we assume that R(w) is convex, F(w) = 1 n Pn i=1 fi(w) is strongly convex and each fi(w) is smooth. We do not assume that each fi(w) is convex. 3 Proximal SCOPE In this paper, we focus on distributed learning with one master (server) and p workers in the cluster, although the algorithm and theory of this paper can also be easily extended to cases with multiple servers like the Parameter Server framework [14, 33]. The parameter w is stored in the master, and the training set D = {xi, yi}n i=1 are partitioned into p parts denoted as D1, D2, . . . , Dp. Here, Dk contains a subset of instances from D, and Dk will be assigned to the kth worker. D = Sp k=1 Dk. Based on this data partition scheme, the proximal SCOPE (p SCOPE) for distributed sparse learning is presented in Algorithm 1. The main task of master is to add and average vectors received from workers. Specifically, it needs to calculate the full gradient z = F(wt) = 1 n Pp k=1 zk. Then it needs to calculate wt+1 = 1 p Pp k=1 uk,M. The main task of workers is to update the local parameters u1,m, u2,m, . . . , up,m initialized with uk,0 = wt. Specifically, for each worker k, after it gets the full gradient z from master, it calculates a stochastic gradient vk,m = fik,m(uk,m) fik,m(wt) + z, (4) and then update its local parameter uk,m by a proximal mapping with learning rate η: uk,m+1 = prox R,η(uk,m ηvk,m). (5) From Algorithm 1, we can find that p SCOPE is based on a cooperative autonomous local learning (CALL) framework. In the CALL framework, each worker in the cluster performs autonomous local learning based on the data assigned to that worker, and the whole learning task is completed by all workers in a cooperative way. The cooperative operation is mainly adding and averaging in the master. During the autonomous local learning procedure in each outer iteration which contains M inner iterations (see Algorithm 1), there is no communication. Hence, the communication cost for each epoch of p SCOPE is constant, which is much less than the mini-batch based strategy with O(n) communication cost for each epoch [15, 9, 17]. p SCOPE is a proximal generalization of SCOPE [34]. Although p SCOPE is mainly motivated by sparse learning with L1 regularization, the algorithm and theory of p SCOPE can also be used for smooth regularization like L2 regularization. Furthermore, when the data partition is good enough, p SCOPE can avoid the extra term c(uk,m wt) in the update rule of SCOPE, which is necessary for convergence guarantee of SCOPE. 4 Effect of Data Partition In our experiment, we find that the data partition affects the convergence of the learning procedure. Hence, in this section we propose a metric to measure the goodness of a data partition, based on which the convergence of p SCOPE can be theoretically proved. Due to space limitation, the detailed proof of Lemmas and Theorems are moved to the long version [35]. Algorithm 1 Proximal SCOPE 1: Initialize w0 and the learning rate η; 2: Task of master: 3: for t = 0, 1, 2, ...T 1 do 4: Send wt to each worker; 5: Wait until receiving z1, z2, . . . , zp from all workers; 6: Calculate the full gradient z = 1 n Pp k=1 zk and send z to each worker; 7: Wait until receiving u1,M, u2,M, . . . , up,M from all workers and calculate wt+1 = 1 p Pp k=1 uk,M; 8: end for 9: Task of the kth worker: 10: for t = 0, 1, 2, ...T 1 do 11: Wait until receiving wt from master; 12: Let uk,0 = wt, calculate zk = P i Dk fi(wt) and send zk to master; 13: Wait until receiving z from master; 14: for m = 0, 1, 2, ...M 1 do 15: Randomly choose an instance xik,m Dk; 16: Calculate vk,m = fik,m(uk,m) fik,m(wt) + z; 17: Update uk,m+1 = prox R,η(uk,m ηvk,m); 18: end for 19: Send uk,M to master 20: end for 4.1 Partition First, we give the following definition: Definition 3 Define π = [φ1( ), . . . , φp( )]. We call π a partition w.r.t. P( ), if F(w) = 1 p Pp k=1 φk(w) and each φk( ) (k = 1, . . . , p) is µk-strongly convex and Lk-smooth (µk, Lk > 0). Here, P( ) is defined in (1) and F( ) is defined in (2). We denote A(P) = {π|π is a partition w.r.t. P( )}. Remark 1 Here, π is an ordered sequence of functions. In particular, if we construct another partition π by permuting φi( ) of π, we consider them to be two different partitions. Furthermore, two functions φi( ), φj( ) (i = j) in π can be the same. Two partitions π1 = [φ1( ), . . . , φp( )], π2 = [ψ1( ), . . . , ψp( )] are considered to be equal, i.e., π1 = π2, if and only if φk(w) = ψk(w)(k = 1, . . . , p), w. For any partition π = [φ1( ), . . . , φp( )] w.r.t. P( ), we construct new functions Pk( ; ) as follows: Pk(w; a) = φk(w; a) + R(w), k = 1, . . . , p (6) where φk(w; a) = φk(w) + Gk(a)T w, Gk(a) = F(a) φk(a), and w, a Rd. In particular, given a data partition D1, D2, . . . , Dp of the training set D, let Fk(w) = 1 |Dk| P i Dk fi(w) which is also called the local loss function. Assume each Fk( ) is strongly convex and smooth, and F(w) = 1 p Pp k=1 Fk(w). Then, we can find that π = [F1( ), . . . , Fp( )] is a partition w.r.t. P( ). By taking expectation on vk,m defined in Algorithm 1, we obtain E[vk,m|uk,m] = Fk(uk,m) + Gk(wt). According to the theory in [32], in the inner iterations of p SCOPE, each worker tries to optimize the local objective function Pk(w; wt) using proximal SVRG with initialization w = wt and training data Dk, rather than optimizing Fk(w) + R(w). Then we call such a Pk(w; a) the local objective function w.r.t. π. Compared to the subproblem of PROXCOCOA+ (equation (2) in [27]), Pk(w; a) is more simple and there is no hyperparameter in it. 4.2 Good Partition In general, the data distribution on each worker is different from the distribution of the whole training set. Hence, there exists a gap between each local optimal value and the global optimal value. Intuitively, the whole learning algorithm has slow convergence rate or cannot even converge if this gap is too large. Definition 4 For any partition π w.r.t. P( ), we define the Local-Global Gap as lπ(a) = P(w ) 1 k=1 Pk(w k(a); a), where w k(a) = arg minw Pk(w; a). We have the following properties of Local-Global Gap: Lemma 1 π A(P), lπ(a) = P(w ) + 1 p Pp k=1 H k( Gk(a)) lπ(w ) = 0, a, where H k( ) is the conjugate function of φk( ) + R( ). Theorem 1 Let R(w) = w 1. π A(P), there exists a constant γ < such that lπ(a) γ a w 2, a. The result in Theorem 1 can be easily extended to smooth regularization which can be found in the long version [35]. According to Theorem 1, the local-global gap can be bounded by γ a w 2. Given a specific a, the smaller γ is, the smaller the local-global gap will be. Since the constant γ only depends on the partition π, intuitively γ can be used to evaluate the goodness of a partition π. We define a good partition as follows: Definition 5 We call π a (ϵ, ξ)-good partition w.r.t. P( ) if π A(P) and γ(π; ϵ) = sup a w 2 ϵ lπ(a) a w 2 ξ. (7) In the following, we give the bound of γ(π; ϵ). Lemma 2 Assume π = [F1( ), . . . , Fp( )] is a partition w.r.t. P( ), where Fk(w) = 1 |Dk| P i Dk fi(w) is the local loss function, each fi( ) is Lipschitz continuous with bounded domain and sampled from some unknown distribution P. If we assign these {fi( )} uniformly to each worker, then with high probability, γ(π; ϵ) 1 p Pp k=1 O(1/(ϵ p |Dk|)). Moreover, if lπ(a) is convex w.r.t. a, then γ(π; ϵ) 1 p Pp k=1 O(1/ p ϵ|Dk|). Here we ignore the log term and dimensionality d. For example, in Lasso regression, it is easy to get that the corresponding local-global gap lπ(a) is convex according to Lemma 1 and the fact that Gk(a) is an affine function in this case. Lemma 2 implies that as long as the size of training data is large enough, γ(π; ϵ) will be small and π will be a good partition. Please note that the uniformly here means each fi( ) will be assigned to one of the p workers and each worker has the equal probability to be assigned. We call the partition resulted from uniform assignment uniform partition in this paper. With uniform partition, each worker will have almost the same number of instances. As long as the size of training data is large enough, uniform partition is a good partition. 5 Convergence of Proximal SCOPE In this section, we will prove the convergence of Algorithm 1 for proximal SCOPE (p SCOPE) using the results in Section 4. Theorem 2 Assume π = [F1( ), . . . , Fp( )] is a (ϵ, ξ)-good partition w.r.t. P( ). For convenience, we set µk = µ, Lk = L, k = 1, 2 . . . , p. If wt w 2 ϵ, then E wt+1 w 2 [(1 µη + 2L2η2)M + 2L2η + 2ξ µ 2L2η ] wt w 2. Because smaller ξ means better partition and the partition π corresponds to data partition in Algorithm 1, we can see that better data partition implies faster convergence rate. Corollary 1 Assume π = [F1( ), . . . , Fp( )] is a (ϵ, µ 8 )-good partition w.r.t. P( ). For convenience, we set µk = µ, Lk = L, k = 1, 2 . . . , p. If wt w 2 ϵ, taking η = µ 12L2 , M = 20κ2, where κ = L µ is the conditional number, then we have E wt+1 w 2 3 4 wt w 2. To get the ϵ-suboptimal solution, the computation complexity of each worker is O((n/p + κ2) log( 1 Corollary 2 When p = 1, which means we only use one worker, p SCOPE degenerates to proximal SVRG [32]. Assume F( ) is µ-strongly convex (µ > 0) and L-smooth. Taking η = µ 6L2 , M = 13κ2, we have E wt+1 w 2 3 4 wt w 2. To get the ϵ-optimal solution, the computation complexity is O((n + κ2) log( 1 We can find that p SCOPE has a linear convergence rate if the partition is (ϵ, ξ)-good, which implies p SCOPE is computation efficient and we need T = O(log( 1 ϵ )) outer iterations to get a ϵ-optimal solution. For all inner iterations, each worker updates uk,m without any communication. Hence, the communication cost is O(log( 1 ϵ )), which is much smaller than the mini-batch based strategy with O(n) communication cost for each epoch [15, 9, 17]. Furthermore, in the above theorems and corollaries, we only assume that the local loss function Fk( ) is strongly convex. We do not need each fi( ) to be convex. Hence, M = O(κ2) and it is weaker than the assumption in proximal SVRG [32] whose computation complexity is O((n + κ) log( 1 ϵ )) when p = 1. In addition, without convexity assumption for each fi( ), our result for the degenerate case p = 1 is consistent with that in [23]. 6 Handle High Dimensional Sparse Data For the cases with high dimensional sparse data, we propose recovery strategy to reduce the cost of proximal mapping so that it can accelerate the training procedure. Here, we adopt the widely used linear model with elastic net [36] as an example for illustration, which can be formulated as follows: minw P(w) := 1 n Pn i=1 hi(x T i w) + λ1 2 w 2 + λ2 w 1, where hi : R R is the loss function. We assume many instances in {xi Rd|i [n]} are sparse vectors and let Ci = {j|x(j) i = 0}. Proximal mapping is unacceptable when the data dimensionality d is too large, since we need to execute the conditional statements O(Md) times which is time consuming. Other methods, like proximal SGD and proximal SVRG, also suffer from this problem. Since z(j) is a constant during the update of local parameter uk,m, we will design a recovery strategy to recover it when necessary. More specifically, in each inner iteration, with the random index s = ik,m, we only recover u(j) to calculate the inner product x T s uk,m and update u(j) k,m for j Cs. For those j / Cs, we do not immediately update u(j) k,m. The basic idea of these recovery rules is: for some coordinate j, we can calculate u(j) k,m2 directly from u(j) k,m1, rather than doing iterations from m = m1 to m2. Here, 0 m1 < m2 M. At the same time, the new algorithm is totally equivalent to Algorithm 1. It will save about O(d(m2 m1)(1 ρ)) times of conditional statements, where ρ is the sparsity of {xi Rd|i [n]}. This reduction of computation is significant especially for high dimensional sparse training data. Due to space limitation, the complete rules are moved to the long version [35]. Here we only give one case of our recovery rules in Lemma 3. Lemma 3 (Recovery Rule) We define the sequence {αq} as: α0 = 0 and for q = 1, 2, . . ., αq = Pq i=1(1 λ1η)i 1/(1 λ1η)q. For the coordinate j and constants m1, m2, if j / Cik,m for any m [m1, m2 1]. If |z(j)| < λ2, u(j) k,m1 > 0, then the relation between u(j) k,m1 and u(j) k,m2 can be summarized as follows: define q0 which satisfies αq0η(z(j) + λ2) u(j) k,m1 < αq0+1η(z(j) + λ2), 1. If m2 m1 q0, then u(j) k,m2 = (1 λ1η)m2 m1[u(j) k,m1 αm2 m1η(z(j) + λ2)]. 2. If m2 m1 > q0, then u(j) k,m2 = 0. 7 Experiment We use two sparse learning models for evaluation. One is logistic regression (LR) with elastic net [36]: P(w) = 1 n Pn i=1 log(1 + e yix T i w) + λ1 2 w 2 + λ2 w 1. The other is Lasso regression [28]: P(w) = 1 2n Pn i=1(x T i w yi)2 + λ2 w 1. All experiments are conducted on a cluster of multiple machines. The CPU for each machine has 12 Intel E5-2620 cores, and the memory of each machine is 96GB. The machines are connected by 10GB Ethernet. Evaluation is based on four datasets in Table 1: cov, rcv1, avazu, kdd2012. All of them can be downloaded from Lib SVM website 2. Table 1: Datasets #instances #features λ1 λ2 cov 581,012 54 10 5 10 5 rcv1 677,399 47,236 10 5 10 5 avazu 23,567,843 1,000,000 10 7 10 5 kdd2012 119,705,032 54,686,452 10 8 10 5 7.1 Baselines We compare our p SCOPE with six representative baselines: proximal gradient descent based method FISTA [2], ADMM type method DFAL [1], newton type method m OWL-QN [8], proximal SVRG based method Asy Prox-SVRG [17], proximal SDCA based method PROXCOCOA+ [27], and distributed block coordinate descent DBCD [16]. FISTA and m OWL-QN are serial. We design distributed versions of them, in which workers distributively compute the gradients and then master gathers the gradients from workers for parameter update. All methods use 8 workers. One master will be used if necessary. Unless otherwise stated, all methods except DBCD and PROXCOCOA+ use the same data partition, which is got by uniformly assigning each instance to each worker (uniform partition). Hence, different workers will have almost the same number of instances. This uniform partition strategy satisfies the condition in Lemma 2. Hence, it is a good partition. DBCD and PROXCOCOA+ adopt a coordinate distributed strategy to partition the data. 7.2 Results The convergence results of LR with elastic net and Lasso regression are shown in Figure 1. DBCD is too slow, and hence we will separately report the time of it and p SCOPE when they get 10 3suboptimal solution in Table 2. Asy Prox-SVRG is slow on the two large datasets avazu and kdd2012, and hence we only present the results of it on the datasets cov and rcv1. From Figure 1 and Table 2, we can find that p SCOPE outperforms all the other baselines on all datasets. Table 2: Time comparison (in second) between p SCOPE and DBCD. p SCOPE DBCD cov 0.32 822 LR rcv1 3.78 > 1000 cov 0.06 81.9 Lasso rcv1 3.09 > 1000 7.3 Speedup We also evaluate the speedup of p SCOPE on the four datasets for LR. We run p SCOPE and stop it when the gap P(w) P(w ) 10 6. The speedup is defined as: Speedup = (Time using one worker)/(Time using p workers). We set p = 1, 2, 4, 8. The speedup results are in Figure 2 (a). We can find that p SCOPE gets promising speedup. 7.4 Effect of Data Partition We evaluate p SCOPE under different data partitions. We use two datasets cov and rcv1 for illustration, since they are balanced datasets which means the number of positive instances is almost the same as 2https://www.csie.ntu.edu.tw/~cjlin/libsvmtools/datasets/ time (second) objective value - minimum Asy Prox-SVRG 0 20 40 60 80 100 time (second) objective value - minimum Asy Prox-SVRG 0 100 200 300 400 500 time (second) objective value - minimum 0 200 400 600 800 1000 time (second) objective value - minimum (a) LR with elastic net 0 2 4 6 8 10 time (second) objective value - minimum Asy Prox-SVRG 0 20 40 60 80 100 time (second) objective value - minimum Asy Prox-SVRG 0 20 40 60 80 100 time (second) objective value - minimum 0 100 200 300 400 time (second) objective value - minimum (b) Lasso regression Figure 1: Evaluation with baselines on two models. 1 2 3 4 5 6 7 8 (a) Speedup of p SCOPE 0 2 4 6 8 10 time (second) objective value - minimum 0 10 20 30 40 50 time (second) objective value - minimum (b) Effect of data partition Figure 2: Speedup and effect of data partition that of negative instances. For each dataset, we construct four data partitions: π (each worker has the whole data), π1 (uniform partition); π2 (75% positive instances and 25% negative instances are on the first 4 workers, and other instances are on the last 4 workers), π3 (all positive instances are on the first 4 workers, and all negative instances are on the last 4 workers). The convergence results are shown in Figure 2 (b). We can see that data partition does affect the convergence of p SCOPE. The best partition π achieves the best performance3. The performance of uniform partition π1 is similar to that of the best partition π , and is better than the other two data partitions. In real applications with large-scale dataset, it is impractical to assign each worker the whole dataset. Hence, we prefer to choose uniform partition π1 in real applications, which is also adopted in above experiments of this paper. 8 Conclusion In this paper, we propose a novel method, called p SCOPE, for distributed sparse learning. Furthermore, we theoretically analyze how the data partition affects the convergence of p SCOPE. p SCOPE is both communication and computation efficient. Experiments on real data show that p SCOPE can outperform other state-of-the-art methods to achieve the best performance. Acknowledgements This work is partially supported by the Deng Feng project of Nanjing University. 3The proof that π is the best partition can be found in the long version [35]. [1] Necdet S. Aybat, Zi Wang, and Garud Iyengar. An asynchronous distributed proximal gradient method for composite convex optimization. In Proceedings of the 32nd International Conference on Machine Learning, pages 2454 2462, 2015. [2] Amir Beck and Marc Teboulle. A fast iterative shrinkage-thresholding algorithm for linear inverse problems. SIAM Journal on Imaging Sciences, 2(1):183 202, 2009. [3] Joseph K. Bradley, Aapo Kyrola, Danny Bickson, and Carlos Guestrin. Parallel coordinate descent for l1regularized loss minimization. In Proceedings of the 28th International Conference on Machine Learning, pages 321 328, 2011. [4] Richard H. Byrd, S. L. Hansen, Jorge Nocedal, and Yoram Singer. A stochastic quasi-newton method for large-scale optimization. SIAM Journal on Optimization, 26(2):1008 1031, 2016. [5] Soham De and Tom Goldstein. Efficient distributed SGD with variance reduction. In Proceedings of the 16th IEEE International Conference on Data Mining, pages 111 120, 2016. [6] John C. Duchi and Yoram Singer. Efficient online and batch learning using forward backward splitting. Journal of Machine Learning Research, 10:2899 2934, 2009. [7] Olivier Fercoq and Peter Richtárik. Optimization in high dimensions via accelerated, parallel, and proximal coordinate descent. SIAM Review, 58(4):739 771, 2016. [8] Pinghua Gong and Jieping Ye. A modified orthant-wise limited memory quasi-newton method with convergence analysis. In Proceedings of the 32nd International Conference on Machine Learning, pages 276 284, 2015. [9] Zhouyuan Huo, Bin Gu, and Heng Huang. Decoupled asynchronous proximal stochastic gradient descent with variance reduction. Co RR, abs/1609.06804, 2016. [10] Rie Johnson and Tong Zhang. Accelerating stochastic gradient descent using predictive variance reduction. In Advances in Neural Information Processing Systems, pages 315 323, 2013. [11] John Langford, Lihong Li, and Tong Zhang. Sparse online learning via truncated gradient. In Advances in Neural Information Processing Systems, pages 905 912, 2008. [12] Rémi Leblond, Fabian Pedregosa, and Simon Lacoste-Julien. ASAGA: asynchronous parallel SAGA. In Proceedings of the 20th International Conference on Artificial Intelligence and Statistics, pages 46 54, 2017. [13] Jason D. Lee, Qihang Lin, Tengyu Ma, and Tianbao Yang. Distributed stochastic variance reduced gradient methods by sampling extra data with replacement. Journal of Machine Learning Research, 18:122:1 122:43, 2017. [14] Mu Li, David G. Andersen, Jun Woo Park, Alexander J. Smola, Amr Ahmed, Vanja Josifovski, James Long, Eugene J. Shekita, and Bor-Yiing Su. Scaling distributed machine learning with the parameter server. In Proceedings of the 11th USENIX Symposium on Operating Systems Design and Implementation, pages 583 598, 2014. [15] Yitan Li, Linli Xu, Xiaowei Zhong, and Qing Ling. Make workers work harder: decoupled asynchronous proximal stochastic gradient descent. Co RR, abs/1605.06619, 2016. [16] Dhruv Mahajan, S. Sathiya Keerthi, and S. Sundararajan. A distributed block coordinate descent method for training l1 regularized linear classifiers. Journal of Machine Learning Research, 18:91:1 91:35, 2017. [17] Qi Meng, Wei Chen, Jingcheng Yu, Taifeng Wang, Zhiming Ma, and Tie-Yan Liu. Asynchronous stochastic proximal optimization algorithms with variance reduction. In Proceedings of the 31th AAAI Conference on Artificial Intelligence, pages 2329 2335, 2017. [18] Arkadi Nemirovski, Anatoli Juditsky, Guanghui Lan, and Alexander Shapiro. Robust stochastic approximation approach to stochastic programming. SIAM Journal on Optimization, 19(4):1574 1609, 2009. [19] Sashank J. Reddi, Ahmed Hefny, Suvrit Sra, Barnabás Póczos, and Alexander J. Smola. On variance reduction in stochastic gradient descent and its asynchronous variants. In Advances in Neural Information Processing Systems, pages 2647 2655, 2015. [20] Peter Richtárik and Martin Takác. Parallel coordinate descent methods for big data optimization. Mathematical Programming, 156(1-2):433 484, 2016. [21] Chad Scherrer, Mahantesh Halappanavar, Ambuj Tewari, and David Haglin. Scaling up coordinate descent algorithms for large l1 regularization problems. In Proceedings of the 29th International Conference on Machine Learning, pages 1407 1414, 2012. [22] Mark W. Schmidt, Nicolas Le Roux, and Francis R. Bach. Minimizing finite sums with the stochastic average gradient. Mathematical Programming, 162(1-2):83 112, 2017. [23] Shai Shalev-Shwartz. SDCA without duality, regularization, and individual convexity. In Proceedings of the 33nd International Conference on Machine Learning, pages 747 754, 2016. [24] Shai Shalev-Shwartz and Tong Zhang. Stochastic dual coordinate ascent methods for regularized loss. Journal of Machine Learning Research, 14(1):567 599, 2013. [25] Shai Shalev-Shwartz and Tong Zhang. Accelerated proximal stochastic dual coordinate ascent for regularized loss minimization. In Proceedings of the 31th International Conference on Machine Learning, pages 64 72, 2014. [26] Ziqiang Shi and Rujie Liu. Large scale optimization with proximal stochastic newton-type gradient descent. In Proceedings of European Conference on Machine Learning and Principles and Practice of Knowledge Discovery in Databases, pages 691 704, 2015. [27] Virginia Smith, Simone Forte, Michael I. Jordan, and Martin Jaggi. L1-regularized distributed optimization: a communication-efficient primal-dual framework. Co RR, abs/1512.04011, 2015. [28] Robert Tibshirani. Regression shrinkage and selection via the lasso. Journal of the Royal Statistical Society, Series B, 58:267 288, 1994. [29] Paul Tseng. Convergence of a block coordinate descent method for nondifferentiable minimization. Journal of Optimization Theory and Applications, 109(3):475 494, 2001. [30] Jialei Wang, Mladen Kolar, Nathan Srebro, and Tong Zhang. Efficient distributed learning with sparsity. In Proceedings of the 34th International Conference on Machine Learning, pages 3636 3645, 2017. [31] Tong T. Wu and Kenneth Lange. Coordinate descent algorithms for lasso penalized regression. The Annals of Applied Statistics, 2(1):224 244, 2008. [32] Lin Xiao and Tong Zhang. A proximal stochastic gradient method with progressive variance reduction. SIAM Journal on Optimization, 24(4):2057 2075, 2014. [33] Eric P. Xing, Qirong Ho, Wei Dai, Jin Kyu Kim, Jinliang Wei, Seunghak Lee, Xun Zheng, Pengtao Xie, Abhimanu Kumar, and Yaoliang Yu. Petuum: a new platform for distributed machine learning on big data. In Proceedings of the 21th International Conference on Knowledge Discovery and Data Mining, pages 1335 1344, 2015. [34] Shen-Yi Zhao, Ru Xiang, Ying-Hao Shi, Peng Gao, and Wu-Jun Li. SCOPE: scalable composite optimization for learning on Spark. In Proceedings of the 31th AAAI Conference on Artificial Intelligence, pages 2928 2934, 2017. [35] Shen-Yi Zhao, Gong-Duo Zhang, Ming-Wei Li, and Wu-Jun Li. Proximal SCOPE for distributed sparse learning: better data partition implies faster convergence rate. Co RR, abs/1803.05621, 2018. [36] Hui Zou and Trevor Hastie. Regularization and variable selection via the elastic net. Journal of the Royal Statistical Society, Series B, 67:301 320, 2005.