# efficient_distributed_learning_with_sparsity__fea4879d.pdf Efficient Distributed Learning with Sparsity Jialei Wang 1 Mladen Kolar 1 Nathan Srebro 2 Tong Zhang 3 We propose a novel, efficient approach for distributed sparse learning with observations randomly partitioned across machines. In each round of the proposed method, worker machines compute the gradient of the loss on local data and the master machine solves a shifted ℓ1 regularized loss minimization problem. After a number of communication rounds that scales only logarithmically with the number of machines, and independent of other parameters of the problem, the proposed approach provably matches the estimation error bound of centralized methods. 1. Introduction We consider learning a sparse linear regressor β minimizing the population objective: β arg min β EX,Y D rℓp Y, x X, βyqs , (1) where p X, Y q P X Y Rp Y are drawn from an unknown distribution D and ℓp , q is a convex loss function, based on N i.i.d. samples txi, yiu N i 1 drawn from D, and when the support S : supportpβ q tj P rps | β j 0u of β is small, |S| s. In a standard single-machine setting, a common empirical approach is to minimize the ℓ1 regularized empirical loss (see, e.g., (2) below). Here we consider a setting where data are distributed across m machines, and, for simplicity, assume1 that N nm, so that each machine j has access to n i.i.d. observations (from the same source D) txji, yjiun i 1 (equivalently, that N nm samples are randomly partitioned across machines). The main contribution of the paper is a novel algorithm for estimating β in a distributed setting. Our estimator is 1University of Chicago, USA 2Toyota Technological Institute at Chicago, USA 3Tencent AI Lab, China. Correspondence to: Jialei Wang , Mladen Kolar , Nathan Srebro , Tong Zhang . Proceedings of the 34 th International Conference on Machine Learning, Sydney, Australia, PMLR 70, 2017. Copyright 2017 by the author(s). 1Results in the paper easily generalize to a setting where each machine has a different number of observations. able to achieve the performance of a centralized procedure that has access to all data, while keeping computation and communication costs low. Compared to the existing oneshot estimation approach (Lee et al., 2015b), our method can achieve the same statistical performance without performing the expensive debiasing step. As the number of communication rounds increases, the estimation accuracy improves until matching the performance of a centralized procedure, which happens after the logarithm of the total number of machines rounds. Furthermore, our results can be achieved under weak assumptions on the data generating procedure. We assume that the communication occurs in rounds. In each round, machines exchange messages with the master machine. Between two rounds, each machine only computes based on its local information, which includes local data and previous messages (Zhang et al., 2013b; Shamir & Srebro, 2014; Arjevani & Shamir, 2015). In a nondistributed setting, efficient estimation procedures need to balance statistical efficiency with computation efficiency (runtime). In a distributed setting, the situation is more complicated and we need to balance two resources, local runtime and number of rounds of communication, with the statistical error. The local runtime refers to the amount of work each machine needs to do. The number of rounds of communication refers to how often do local machines need to exchange messages with the master machine. We compare our procedure to other algorithm using the aforementioned metrics. We consider the following two baseline estimators of β : the local estimator uses data available only on the master (first) machine and ignores data available on other machines. In particular, it computes bβlocal arg min β 1 n i 1 ℓpy1i, xx1i, βyq λ||β||1 (2) using locally available data. The local procedure is efficient in both communication and computation, however, the resulting estimation error is large compared to an estimator that uses all of the available data. The other idealized baseline is the centralized estimator bβcentralize arg min β 1 mn i 1 ℓpyji, xxji, βyq λ||β||1. Efficient Distributed Learning with Sparsity Approach n Á ms2 log p ms2 log p Á n Á s2 log p Communication Computation Communication Computation Centralize n p Tlassopmn, pq n p Tlassopmn, pq Avg-Debias p p Tlassopn, pq This paper (EDSL) p 2 Tlassopn, pq log m p log m Tlassopn, pq Table 1. Comparison of resources required for matching the centralized error bound of various approaches for high-dimensional distributed sparse linear regression problems, where Tlassopn, pq is the runtime for solving a generalized lasso problem of size n p. Unfortunately, due to data being huge and communication expensive, we cannot compute the centralized estimator, even though it achieves the optimal statistical error. In a related setting, Lee et al. (2015b) studied a one-shot approach to learning β , called Avg-Debias, that is based on averaging the debiased lasso estimators (Zhang & Zhang, 2013). Under strong assumptions on the data generating procedure, their approach matches the centralized error bound after one round of communication. While an encouraging result, there are limitations to this approach, that we list below. The debiasing step in Avg-Debias is computationally heavy as it requires each local machine to estimate a p p matrix. For example, Javanmard (2014) (section 5.1) transforms the problem of estimating the debiasing matrix Θ into p generalized lasso problems. This is computationally prohibitive for high-dimensional problems (Zhang & Zhang, 2013; Javanmard & Montanari, 2014). In comparison, our procedure requires only solving one ℓ1 penalized objective in each iteration, which has the same time complexity as computing bβlocal in (2). See Section 2 for details. Avg-Debias procedure only matches the statistical error rate of the centralized procedure when the sample size per machine satisfies n Á ms2 log p. Our approach improves this sample complexity to n Á s2 log p. Avg-Debias procedure requires strong conditions on the data generating process. For example, the data matrix is required to satisfy the generalized coherence condition for debiasing to work2. As we show here, such a condition is not needed for consistent highdimensional estimation in a distributed setting. Instead, we only require standard restricted eigenvalue condition that are commonly assumed in the highdimensional estimation literature. Our method (EDSL) addresses the aforementioned issues 2The generalized coherence states that there exists a matrix Θ, such that ||bΣΘ Ip||8 À b n , where bΣ is the empirical covariance matrix. of Avg-Debias. Table 1 summarizes the resources required for the approaches discussed above to solve the distributed sparse linear regression problems. Parallel Work In parallel work (publicly announced on ar Xiv simultaneously with the results in this contribution), Jordan et al. (2016) present a method which is equivalent to the first iteration of our method, and thus achieves the same computational advantage over Avg-Debias as depicted in the left column of Table 1 and discussed in the first and third bullet points above. Jordan et al. extend the idea in ways different and orthogonal to this submission, by considering also low-dimensional and Bayesian inference problems. Still, for high-dimensional problems, they only consider a one-shot procedure, and so do not achieve statistical optimality in the way our method does, and do not allow using n À ms2 log p samples per machine (see right half of Table 1). The improved one-shot approach is thus a parallel contribution, made concurrently by Jordan et al. and by us, while the multi-step approach and accompanied reduction in required number of samples (discusse in the second bullet point above) and improvement in statistical accuracy is a distinct contribution of this this submission. Other Related Work A large body of literature exists on distributed optimization for modern massive data sets (Dekel et al., 2012; Duchi et al., 2012; 2014; Zhang et al., 2013b; Zinkevich et al., 2010; Boyd et al., 2011; Balcan et al., 2012; Yang, 2013; Jaggi et al., 2014; Ma et al., 2015; Shamir & Srebro, 2014; Zhang & Xiao, 2015; Lee et al., 2015a; Arjevani & Shamir, 2015). A popular approach to distributed estimation is averaging estimators formed locally by different machines (Mcdonald et al., 2009; Zinkevich et al., 2010; Zhang et al., 2012; Huang & Huo, 2015). Divide-and-conquer procedures also found applications in statistical inference (Zhao et al., 2014a; Cheng & Shang, 2015; Lu et al., 2016). Shamir & Srebro (2014) and Rosenblatt & Nadler (2014) showed that averaging local estimators at the end will have bad dependence on either condition number or dimension of the problem. Yang (2013), Jaggi et al. (2014) and Smith et al. (2016) studied distributed optimization using stochastic (dual) coordinate descent, these approaches try to find a good balance between computation and communication, however, their communication com- Efficient Distributed Learning with Sparsity plexity depends badly on the condition number. As a result, they are not better than first-order approaches, such as (proximal) accelerated gradient descent (Nesterov, 1983), in terms of communication. Shamir et al. (2014) and Zhang & Xiao (2015) proposed truly communication-efficient distributed optimization algorithms. They leveraged the local second-order information and, as a result, obtained milder dependence on the condition number compared to the firstorder approaches (Boyd et al., 2011; Shamir & Srebro, 2014; Ma et al., 2015). Lower bounds were studied in Zhang et al. (2013a), Braverman et al. (2015), and Arjevani & Shamir (2015). However, it is not clear how to extend these existing approaches to problems with non-smooth objectives, including the ℓ1 regularized problems. Most of the above mentioned work is focused on estimators that are (asymptotically) linear. Averaging at the end reduces the variance of the these linear estimators, resulting in an estimator that matches the performance of a centralized procedure. Zhang et al. (2013c) studied averaging local estimators obtained by the penalized kernel ridge regression, with the ℓ2 penalty was chosen smaller than usual to avoid the large bias problem. The situation in a high-dimensional setting is not so straightforward, since the sparsity inducing penalty introduces the bias in a nonlinear way. Zhao et al. (2014b) illustrated how averaging debiased composite quantile regression estimators can be used for efficient inference in a high-dimensional setting. Averaging debiased high-dimensional estimators was subsequently used in Lee et al. (2015b) for distributed estimation, multi-task learning (Wang et al., 2015), and statistical inference (Battey et al., 2015). Notation. We use rns to denote the set t1, . . . , nu. For a vector a P Rn, we let supportpaq tj : aj 0u be the support set, ||a||q, q P r1, 8q, the ℓq-norm defined as ||a||q p i Prns |ai|qq1{q, and ||a||8 maxi Prns |ai|. For a matrix A P Rn1 n2, we use the following element-wise ℓ8 matrix norms ||A||8 maxi Prn1s,j Prn2s |aij|. Denote In as n n identity matrix. For two sequences of numbers tanu8 n 1 and tbnu8 n 1, we use an Opbnq to denote that an Cbn for some finite positive constant C, and for all n large enough. If an Opbnq and bn Opanq, we use the notation an bn. We also use an À bn for an Opbnq and an Á bn for bn Opanq. Paper Organization. We describe our method in Section 2, and present the main results in the context of sparse linear regression in Section 3, and provide a generalized theory in Section 4. We demonstrate the effectiveness of the proposal via experiments in Section 5, and conclude the paper with discussions in Section 6. In Appendix, in Section A we illustrate some concrete examples of the general results in Section 4, and all proofs are deferred in Section B. More experimental results are presented in Section C. Algorithm 1 Efficient Distributed Sparse Learning (EDSL). Input: Data txji, yjiuj Prms,i Prns, loss function ℓp , q. Initialization: The master obtains bβ0 by minimizing (3), and broadcast bβ0 to every worker. for t 0, 1, . . . do Workers: for j 2, 3, . . . , m do if Receive bβt from the master then Calculate gradient Ljp bβtq and send it to the master. end end Master: if Receive t Ljp bβtqum j 2 from all workers then Obtain bβt 1 by solving the shifted ℓ1 regularized problem in (4). Broadcast bβt 1 to every worker. end end 2. Methodology In this section, we detail our procedure for estimating β in a distributed setting. Algorithm 1 provides an outline of the steps executed by the master and worker nodes. Let i 1 ℓpyji, xxji, βyq, j P rms, be the empirical loss at each machine. Our method starts by solving a local ℓ1 regularized M-estimation program. At iteration t 0, the master (first) machine obtains bβ0 as a minimizer of the following program min L1pβq λ0||β||1. (3) The vector bβ0 is broadcasted to all other machines, which use it to compute a gradient of the local loss at bβ0. In particular, each worker computes Ljp bβ0q and communicates it back to the master. This constitutes one round of communication. At the iteration t 1, the master solves the shifted ℓ1 regularized problem bβt 1 arg min β L1pβq j 1 Ljp bβtq L1p bβtq, β λt 1||β||1. (4) A minimizer bβt 1 is communicated to other machines, which use it to compute the local gradient Ljp bβt 1q as before. Formulation (4) is inspired by the proposal in Shamir et al. (2014), where the authors studied distributed optimization Efficient Distributed Learning with Sparsity for smooth and strongly convex empirical objectives. Compared to Shamir et al. (2014), we do not use any averaging scheme, which would require additional rounds of communication and, moreover, we add an ℓ1 regularization term to ensure consistent estimation in high-dimensions. Different from the distributed first-order optimization approaches, the refined objective (4) leverages both global first-order information and local higher-order information. To see this, suppose we set λt 1 0 and that Ljpβq is a quadratic objective with invertible Hessian. Then we have the following closed form solution for (4), bβt 1 bβt 2L1p bβtq 1 j Prms Ljp bβtq which is exactly a sub-sampled Newton updating rule. Unfortunately for high-dimensional problems, the Hessian is no longer invertible, and a ℓ1 regularization is added to make the solution well behaved. The regularization parameter λt will be chosen in a way, so that it decreases with the iteration number t. As a result we will be able to show that the final estimator performs as well at the centralized solution. We discuss in details how to choose λt in the following section. 3. Main Result We illustrate our main theoretical results in the context of sparse linear regression model yji xxji, β y ϵji, i P rns, j P rms, (5) where xji is a subgaussian p-dimensional vector of input variables and ϵji is i.i.d. mean zero subgaussian noise. The loss function considered is the usual the squared loss ℓpy, byq 1 2py byq2. With this notation, the centralized approach leads to the lasso estimator (Tibshirani, 1996) bβcentralize arg min β 1 m j 1 Ljpβq λ||β||1, where the loss at worker j is i Prns pyji xβ, xjiyq2. Before stating the main result, we provide the definition of the subgaussian norm (Vershynin, 2012). Definition 1 (Subgaussian norm). The subgaussian norm ||X||ψ2 of a subgaussian p-dimensional random vector X, is defined as ||X||ψ2 sup x PSp 1 sup q 1 q 1{2p E|x X, xy|qq1{q, where Sp 1 is the p-dimensional unit sphere. We also need an assumption on the restricted strong convexity constant (Negahban et al., 2012). Assumption 2. We assume that there exists a κ 0, such that for any P Cp S, 3q, 1 2n||X1 ||2 2 κ|| ||2 2, Cp S, 3q t P Rp | || Sc||1 3|| S||1u is a restricted cone in Rp, and X1 rx T 11; x T 12; . . . ; x T 1ns P Rn p is the data matrix on the master machine. When xji are randomly drawn from a subgaussian distribution, Assumption (2) is satisfied with high probability as long as n Á s log p (Rudelson & Zhou, 2013). We are now ready to state the estimation error bound for bβt 1 obtained using Algorithm 1. Theorem 3. Assume that data are generated from a sparse linear regression model in (5) with ||xji||ψ2 σX and ||ϵji||ψ2 σ. Let i Prns xjiϵji 2L max j,i ||xji||2 8 || bβt β ||1 n (6) Then for t 0 we have, with probability at least 1 2δ, || bβt 1 β ||1 1 at 1 n 1 an at 1 n sσσX || bβt 1 β ||2 1 at 1 n 1 an at nbn sσσX We can simplify the bound obtained in Theorem 3 by looking at the scaling with respect to n, m, s, and p, by treating κ, σ and σX as constants. Suppose n Á s2 log p and set Efficient Distributed Learning with Sparsity The following error bounds hold for Algorithm 1: || bβt β ||1 ÀP s || bβt β ||2 ÀP We can compare the above bounds to the performance of the local and centralized lasso (Wainwright, 2009; Meinshausen & Yu, 2009; Bickel et al., 2009). For bβlocal, we have || bβlocal β ||1 ÀP s || bβlocal β ||2 ÀP For bβcentralize, we have || bβcentralize β ||1 ÀP s || bβcentralize β ||2 ÀP We see that after one round of communication, we have || bβ1 β ||1 ÀP s mn s2 log p || bβ1 β ||2 ÀP mn s3{2 log p These bounds match the results in Lee et al. (2015b) without expensive debiasing step. Furthermore, when m À n s2 log p, they match the performance of the centralized lasso. Finally, as long as t Á log m and n Á s2 log p, it is easy to check that s b mn . There- || bβt 1 β ||1 ÀP s || bβt 1 β ||2 ÀP which matches the centralized lasso performance without additional error terms. That is, as long as n Á s2 log p, the rounds of communication to matches centralized procedure only increase logarithmically with the number of machines and independent of other parameters. Differently, for distributed learning methods studied in the literature for minimizing smooth objectives, the rounds of communication to match centralized procedure increase polynomially with m (see table 1 in (Zhang & Xiao, 2015)). This is because here we exploit the underlying restricted strong convexity from empirical loss functions, while prior work on distributed minimization of smooth objectives (Shamir et al., 2014; Zhang & Xiao, 2015) only consider strong convexity explicitly from regularization. 4. Generalized Theory and Proof Sketch In order to establish Theorem 3, we prove an error bound on bβ β for a general loss ℓp , q and bβ obtained using Algorithm 1. To simplify the presentation, we assume that the domain X is bounded and that the loss function ℓp , q is smooth. Assumption 4. The loss ℓp , q is L-smooth with respect to the second argument: ℓ1pa, bq ℓ1pa, cq L|b c|, @a, b, c P R Furthermore, |ℓ 3pa, bq| M for all a, b P R. Commonly used loss functions in statistical learning, including the squared loss for regression and logistic loss for classification, satisfy this assumption (Zhang et al., 2013b). Next, we state the restricted strong convexity condition for a general loss function (Negahban et al., 2012). Assumption 5. There exists κ 0 such that for any P Cp S, 3q L1pβ q L1pβ q x L1pβ q, y κ|| ||2 2, with Cp S, 3q t P Rp||| Sc||1 3|| S||1u. The restricted strong convexity holds with high probability for a wide range of models and designs and it is commonly assumed for showing consistent estimation in highdimensions (see, for example, van de Geer & B uhlmann, 2009; Negahban et al., 2012; Raskutti et al., 2010; Rudelson & Zhou, 2013, for details). Our main theoretical result establishes a recursive estimation error bound, which relates the estimation error || bβt 1 β || to that of the previous iteration || bβt β ||1. Theorem 6. Suppose Assumption 4 and 5 holds. Let j Prms Ljpβ q 2L max j,i ||xji||2 8 2M max j,i ||xji||3 8 || bβt β ||2 1 . Efficient Distributed Learning with Sparsity Then with probability at least 1 δ, we have || bβt 1 β ||1 48s j Prms Ljpβ q max j,i ||xji||2 8 max j,i ||xji||3 8 || bβt β ||2 1 , || bβt 1 β ||2 12?s j Prms Ljpβ q max j,i ||xji||2 8 max j,i ||xji||3 8 || bβt β ||2 1 . Theorem 6 upper bounds the estimation error || bβt 1 β ||1 as a function of || bβt β ||1. Applying Theorem 6 iteratively, we immediately obtain the following estimation error bound which depends on the quality of local ℓ1 regularized estimation || bβ0 β ||1. Corollary 7. Suppose the conditions of Theorem 6 are satisfied. Furthermore, suppose that for all t, we have M max j,i ||xji||8 || bβt β ||1 L Then with probability at least 1 δ, we have || bβt 1 β ||1 at 1 n || bβ0 β ||1 p1 anq 1p1 at 1 n q 48s j Prms Ljpβ q || bβt 1 β ||2 at nbn || bβ0 β ||1 p1 anq 1p1 at 1 n q 12?s j Prms Ljpβ q max j,i ||xji||2 8 max j,i ||xji||2 8 For the quadratic loss we have that M 0 and the condition in (10) holds. For other types of losses, condition in (10) will be true for t large enough when m Á s2, leading to local exponential rate of convergence until reaching statistical optimal region. 4.1. Proof Sketch of Theorem 6 We first analyze how the estimation error bound decreases after one round of communication. In particular, we bound || bβt 1 β || with || bβt β ||. Define e L1pβ, bβtq L1pβq j Prms Ljp bβtq L1p bβtq, β e L1pβ, bβtq L1pβq 1 j Prms Ljp bβtq L1p bβtq. The following lemma bounds the ℓ8 norm of e L1pβ, bβtq. Lemma 8. With probability at least 1 δ, we have e L1pβ , bβtq j Prms Ljpβ q 2L max j,i ||xji||2 8 M max j,i ||xji||3 8 || bβt β ||2 1 . The lemma bounds the magnitude of the gradient of the loss at optimum point β . This will be used to guide our choice of the ℓ1 regularization parameter λt 1 in (4). The following lemma shows that as long as λt 1 is large enough, it is guaranteed that bβt 1 β is in a restricted cone. Lemma 9. Suppose e L1pβ , bβtq Then with probability at least 1 δ, we have bβt 1 β P Cp S, 3q. Based on the conic condition and restricted strong convexity condition, we can obtain the recursive error bound stated in Theorem 6 following the proof strategy as in Negahban et al. (2012). Applications Theorem 6 can be used to establish statistical guarantees for more general sparse learning problems, for example consider the logistic regression is a popular classification model where the binary label yji P t 1, 1u is drawn according to a Bernoulli distribution: Ppyji 1|xjiq exppyjixxji, β yq exppyjixxji, β yq 1, (12) we can establish local exponential convergence when applying Algorithm 1 to estimate β in the high-dimensional logistic model. Section A in Appendix provide formal guarantees and more illustrative examples. Efficient Distributed Learning with Sparsity m 5 m 10 m 20 0 1 2 3 4 5 6 7 8 9 Rounds of Communications Estimation Error Local Prox-GD Centralize Avg-Debias 0 1 2 3 4 5 6 7 8 9 Rounds of Communications Estimation Error Local Prox-GD Centralize Avg-Debias 0 1 2 3 4 5 6 7 8 9 Rounds of Communications Estimation Error Local Prox-GD Centralize Avg-Debias n 500, p 3000, s 10, X Np0, Σq, Σij 0.5|i j|{5. 0 1 2 3 4 5 6 7 8 9 Rounds of Communications Estimation Error Local Prox-GD Centralize Avg-Debias 0 1 2 3 4 5 6 7 8 9 Rounds of Communications Estimation Error Local Prox-GD Centralize Avg-Debias 0 1 2 3 4 5 6 7 8 9 Rounds of Communications Estimation Error Local Prox-GD Centralize Avg-Debias n 1000, p 3000, s 10, X Np0, Σq, Σij 0.5|i j|{5. Figure 1. Comparison of various algorithms for distributed sparse learning on simulated data, first row: sparse linear regression, second row: sparse logistic regression. 5. Experiments In this section we present empirical comparisons between various approaches on both simulated and real world datasets 3. We run the algorithms for both distributed regression and classification problems, and compare with the following algorithms: i) Local; ii) Centralize; iii) Distributed proximal gradient descent (Prox GD); iv) Avg Debias (Lee et al., 2015b) with hard thresholding, and v) the proposed EDSL approach. 5.1. Simulations We first examine the algorithms on simulated data. We generate txjiuj Prms,i Prns from a multivariate normal distribution with mean zero and covariance matrix Σ. The covariance Σ controls the condition number of the problem and we will varying it to see how the performance changes. We set Σij 0.5|i j| for the well-conditioned setting and Σij 0.5|i j|{5 for the ill-conditioned setting. The response variable tyjiuj Prms,i Prns are drawn from (5) and (12) for regression and classification problems, respectively. For regression, the noise ϵji is sampled from a standard normal distribution. The true model β is set to be s-sparse, where the first s-entries are sampled i.i.d. from a uniform distribution in r0, 1s, and the other entries are set 3Please refer to Section C in Appendix for full experimental results and more details We run experiments with various pn, p, m, sq settings4. The estimation error || bβt β ||2 is shown versus rounds of communications for for Prox GD and the proposed EDSL algorithm. We also plot the estimation error of Local, Avg Debias, and Centralize as horizontal lines, since the communication cost is fixed for for these algorithms56. Figure 1 summarize the results, averaged across 10 independent trials. We have the following observations: The Avg-Debias approach obtained much better estimation error compared to Local after one round of communication and sometimes performed quite close to Centralize. However, in most cases, there is still a gap compared with Centralize, especially when the problem is not well-conditioned or m is large. Prox GD converges very slow when the condition number becomes bad (Σij 0.5|i j|{5 case). As theory suggests, EDSL obtained a solution that is 4n: sample size per machine, p: problem dimension, m: number of machines, s: true support size. 5these algorithms have zero, one-shot and full communications, respectively. 6To give some senses about computational cost, for a problem with n 200, p 1000, at each round EDSL takes about 0.048s, while Avg-Debias takes about 40.334s. Efficient Distributed Learning with Sparsity 0 1 2 3 4 5 6 7 8 9 Rounds of Communications Classification Error (%) Local Prox-GD Centralize Avg-Debias 0 1 2 3 4 5 6 7 8 9 Rounds of Communications Normalized MSE Local Prox-GD Centralize Avg-Debias 0 1 2 3 4 5 6 7 8 9 Rounds of Communications Normalized MSE Local Prox-GD Centralize Avg-Debias mnist 1 vs 2 connect4 dna Figure 2. Comparison of various approaches for distributed sparse regression and classification on real world datasets. competitive with Avg-Debias after one round of communication. The estimation error decreases to match performance of Centralize within few rounds of communications; typically less than 5, even though the theory suggests EDSL will match the performance of centralize within Oplog mq rounds of communication. Above experiments illustrate our theoretical results in finite samples. As suggested by theory, when sample size per machine n is relatively small, one round of communication is not sufficient to make Avg-Debias matches the performance of centralized procedure. However, EDSL could match the performance of Avg-Debias with one round of communication and further improve the estimation quality by exponentially reducing the gap between centralized procedure with Avg-Debias, until matching the centralized performance. Thus, the proposed EDSL improves the Avg Debias approach both computationally and statistically. 5.2. Real-world Data Evaluation In this section, we compare the distributed sparse learning algorithms on several real world datasets. For all data sets, we use 60% of data for training, 20% as held-out validation set for tuning the parameters, and the remaining 20% for testing. We randomly partition data 10 times and report the average performance on the test set. For regression tasks, the evaluation metric is the normalized Mean Squared Error (normalized MSE), while for classification tasks we report the miss-classification error. We randomly partition the data on m 10 machines. A subset of the results are plotted in Figure 2 where for some data sets the performance of Avg-Debias is significantly worse than others (mostly because the debiasing step fails), thus we omit these plots. Since there is no well-specified model on these datasets, the curves behave quite differently on different data sets. However, a large gap between the local and centralized procedure is consistent as the later uses 10 times more data. Avg Debias often fails on these real datasets and performs much worse than in the simulations, the main reason might be that the assumptions, such as well-specified model or generalized coherence condition, fail, then Avg-Debias can totally fail and produce solution even much worse than the local. Nevertheless, the proposed EDSL performs quite robust on real world data sets, and can often output a solution which is highly competitive with the centralized model within a few rounds of communications. We also observed a slight zig-zag behavior for EDSL approach on some data sets. For example, on the mushrooms data set, the predictive performance of EDSL is not stable. In sum, the experimental results on real world data sets verified that the proposed EDSL method is effective for distributed sparse learning problems. 6. Conclusion and Discussion We proposed a novel approach for distributed learning with sparsity, which is efficient in both computation and communication. Our theoretical analysis showed that the proposed method works under weaker conditions than Avg Debias estimator while matches its error bound with oneround communication. Furthermore, the estimation error can be improved with a logarithmic more rounds of communication until matching the centralized procedure. Experiments on both simulated and real-world data demonstrate that the proposed method significantly improves the performance over one shot averaging approaches, and matches the centralized procedure with few iterations. There might be several ways to improve this work. As we see in real data experiments, the proposed approach can still perform slightly worse than the centralized approach on certain datasets. It is interesting to explore how to make EDSL provably work under even weaker assumptions. For example, EDSL requires Ops2 log pq samples per machine to match the centralized method in Oplog mq rounds of communications, however, it is not clear whether the sample size requirement can be improved, while still maintaining low-communication cost. Last but not the least, it is interesting to explore presented ideas to improve the computational cost of communication-efficient distributed multitask learning with shared support (Wang et al., 2015). Efficient Distributed Learning with Sparsity Arjevani, Yossi and Shamir, Ohad. Communication complexity of distributed convex learning and optimization. Ar Xiv e-prints, ar Xiv:1506.01900, June 2015. Balcan, Maria-Florina, Blum, Avrim, Fine, Shai, and Mansour, Yishay. Distributed learning, communication complexity and privacy. In Mannor, Shie, Srebro, Nathan, and Williamson, Robert C. (eds.), JMLR W&CP 23: COLT 2012, volume 23, pp. 26.1 26.22, 2012. Battey, Heather, Fan, Jianqing, Liu, Han, Lu, Junwei, and Zhu, Ziwei. Distributed estimation and inference with statistical guarantees. Ar Xiv e-prints, ar Xiv:1509.05457, September 2015. Bickel, Peter J., Ritov, Ya acov, and Tsybakov, Alexandre B. Simultaneous analysis of lasso and Dantzig selector. Ann. Stat., 37(4):1705 1732, 2009. doi: 10.1214/ 08-AOS620. Boyd, Stephen P., Parikh, Neal, Chu, Eric, Peleato, Borja, and Eckstein, Jonathan. Distributed optimization and statistical learning via the alternating direction method of multipliers. Found. Trends Mach. Learn., 3(1):1 122, 2011. Braverman, Mark, Garg, Ankit, Ma, Tengyu, Nguyen, Huy L., and Woodruff, David P. Communication lower bounds for statistical estimation problems via a distributed data processing inequality. Ar Xiv e-prints, ar Xiv:1506.07216, June 2015. Cheng, Guang and Shang, Zuofeng. Computational limits of divide-and-conquer method. Ar Xiv e-prints, ar Xiv:1512.09226, December 2015. Dekel, Ofer, Gilad-Bachrach, Ran, Shamir, Ohad, and Xiao, Lin. Optimal distributed online prediction using mini-batches. J. Mach. Learn. Res., 13:165 202, 2012. ISSN 1532-4435. Duchi, John C., Agarwal, Alekh, and Wainwright, Martin J. Dual averaging for distributed optimization: convergence analysis and network scaling. IEEE Trans. Automat. Control, 57(3):592 606, 2012. ISSN 0018-9286. doi: 10.1109/TAC.2011.2161027. Duchi, John C., Jordan, Michael I., Wainwright, Martin J., and Zhang, Yuchen. Optimality guarantees for distributed statistical estimation. Ar Xiv e-prints, ar Xiv:1405.0782, May 2014. Hoeffding, Wassily. Probability inequalities for sums of bounded random variables. J. Am. Stat. Assoc., 58:13 30, 1963. ISSN 0162-1459. Huang, Cheng and Huo, Xiaoming. A distributed one-step estimator. Ar Xiv e-prints, ar Xiv:1511.01443, November 2015. Jaggi, Martin, Smith, Virginia, Tak ac, Martin, Terhorst, Jonathan, Krishnan, Sanjay, Hofmann, Thomas, and Jordan, Michael I. Communication-efficient distributed dual coordinate ascent. In Advances in Neural Information Processing Systems, pp. 3068 3076, 2014. Javanmard, Adel. Inference and estimation in highdimensional data analysis. Ph D dissertation, Stanford University, 2014. Javanmard, Adel and Montanari, Andrea. Confidence intervals and hypothesis testing for high-dimensional regression. J. Mach. Learn. Res., 15(Oct):2869 2909, 2014. Jordan, Michael I, Lee, Jason D, and Yang, Yun. Communication-efficient distributed statistical learning. ar Xiv preprint ar Xiv:1605.07689, 2016. Lee, Jason D., Lin, Qihang, Ma, Tengyu, and Yang, Tianbao. Distributed stochastic variance reduced gradient methods and a lower bound for communication complexity. Ar Xiv e-prints, ar Xiv:1507.07595, July 2015a. Lee, Jason D., Sun, Yuekai, Liu, Qiang, and Taylor, Jonathan E. Communication-efficient sparse regression: a one-shot approach. Ar Xiv e-prints, ar Xiv:1503.04337, 2015b. Lu, Junwei, Cheng, Guang, and Liu, Han. Nonparametric heterogeneity testing for massive data. Ar Xiv e-prints, ar Xiv:1601.06212, January 2016. Ma, Chenxin, Smith, Virginia, Jaggi, Martin, Jordan, Michael I., Richtrik, Peter, and Tak, Martin. Adding vs. averaging in distributed primal-dual optimization. Ar Xiv e-prints, ar Xiv:1502.03508, February 2015. Mc Cullagh, P. and Nelder, J. A. Generalized linear models. Monographs on Statistics and Applied Probability. Chapman & Hall, London, 1989. ISBN 0-412-317605. doi: 10.1007/978-1-4899-3242-6. Second edition [of MR0727836]. Mcdonald, Ryan, Mohri, Mehryar, Silberman, Nathan, Walker, Dan, and Mann, Gideon S. Efficient largescale distributed training of conditional maximum entropy models. In Bengio, Y., Schuurmans, D., Lafferty, J. D., Williams, C. K. I., and Culotta, A. (eds.), Advances in Neural Information Processing Systems 22, pp. 1231 1239. Curran Associates, Inc., 2009. Meinshausen, Nicolas and B uhlmann, Peter. High dimensional graphs and variable selection with the lasso. Ann. Stat., 34(3):1436 1462, 2006. Meinshausen, Nicolas and Yu, B. Lasso-type recovery of sparse representations for high-dimensional data. Ann. Stat., 37(1):246 270, 2009. Negahban, Sahand N, Ravikumar, Pradeep, Wainwright, Martin J., and Yu, Bin. A unified framework for highdimensional analysis of m-estimators with decomposable regularizers. Stat. Sci., 27(4):538 557, 2012. Nesterov, Yurii. A method of solving a convex program- Efficient Distributed Learning with Sparsity ming problem with convergence rate p1{k2q. In Soviet Mathematics Doklady, volume 27, pp. 372 376, 1983. Raskutti, Garvesh, Wainwright, Martin J, and Yu, Bin. Restricted eigenvalue properties for correlated gaussian designs. The Journal of Machine Learning Research, 11: 2241 2259, 2010. Ravikumar, Pradeep, Wainwright, Martin J., and Lafferty, J. D. High-dimensional ising model selection using ℓ1regularized logistic regression. Ann. Stat., 38(3):1287 1319, 2010. Rosenblatt, Jonathan and Nadler, Boaz. On the optimality of averaging in distributed statistical learning. Ar Xiv eprints, ar Xiv:1407.2724, July 2014. Rudelson, Mark and Zhou, Shuheng. Reconstruction from anisotropic random measurements. Information Theory, IEEE Transactions on, 59(6):3434 3447, 2013. Shamir, Ohad and Srebro, Nathan. Distributed stochastic optimization and learning. In 52nd Annual Allerton Conference on Communication, Control, and Computing (Allerton), 2014, pp. 850 857. IEEE, 2014. Shamir, Ohad, Srebro, Nathan, and Zhang, Tong. Communication efficient distributed optimization using an approximate newton-type method. In Proceedings of The 31st International Conference on Machine Learning, pp. 1000 1008, 2014. Smith, Virginia, Forte, Simone, Ma, Chenxin, Takac, Martin, Jordan, Michael I, and Jaggi, Martin. Cocoa: A general framework for communication-efficient distributed optimization. ar Xiv preprint ar Xiv:1611.02189, 2016. Tibshirani, Robert J. Regression shrinkage and selection via the lasso. J. R. Stat. Soc. B, 58(1):267 288, 1996. ISSN 0035-9246. van de Geer, Sara A. High-dimensional generalized linear models and the lasso. Ann. Stat., 36(2):614 645, 2008. van de Geer, Sara A. and B uhlmann, Peter. On the conditions used to prove oracle results for the lasso. Electron. J. Stat., 3:1360 1392, 2009. Vershynin, Roman. Introduction to the non-asymptotic analysis of random matrices. In Eldar, Y. C. and Kutyniok, G. (eds.), Compressed Sensing: Theory and Applications. Cambridge University Press, 2012. Wainwright, Martin J. Sharp thresholds for highdimensional and noisy sparsity recovery using ℓ1constrained quadratic programming (lasso). IEEE Trans. Inf. Theory, 55(5):2183 2202, 2009. ISSN 0018-9448. doi: 10.1109/TIT.2009.2016018. Wang, Jialei, Kolar, Mladen, and Srebro, Nathan. Distributed multitask learning. Ar Xiv e-prints, ar Xiv:1510.00633, October 2015. Wu, Tong Tong, Chen, Yi Fang, Hastie, Trevor J., So- bel, Eric, and Lange, Kenneth L. Genome-wide association analysis by lasso penalized logistic regression. Bioinformatics, 25(6):714 721, 2009. doi: 10.1093/ bioinformatics/btp041. Yang, Tianbao. Trading computation for communication: Distributed stochastic dual coordinate ascent. In Burges, C. J. C., Bottou, L., Welling, M., Ghahramani, Z., and Weinberger, K. Q. (eds.), Advances in Neural Information Processing Systems 26, pp. 629 637. Curran Associates, Inc., 2013. Yuan, M. and Lin, Y. Model selection and estimation in the gaussian graphical model. Biometrika, 94(1):19 35, 2007. Zhang, Cun-Hui and Zhang, Stephanie S. Confidence intervals for low dimensional parameters in high dimensional linear models. J. R. Stat. Soc. B, 76(1):217 242, Jul 2013. Zhang, Yuchen and Xiao, Lin. Communication-efficient distributed optimization of self-concordant empirical loss. Ar Xiv e-prints, ar Xiv:1501.00263, 2015. Zhang, Yuchen, Wainwright, Martin J., and Duchi, John C. Communication-efficient algorithms for statistical optimization. In Advances in Neural Information Processing Systems, pp. 1502 1510, 2012. Zhang, Yuchen, Duchi, John C., Jordan, Michael I., and Wainwright, Martin J. Information-theoretic lower bounds for distributed statistical estimation with communication constraints. In Advances in Neural Information Processing Systems, pp. 2328 2336, 2013a. Zhang, Yuchen, Duchi, John C., and Wainwright, Martin J. Communication-efficient algorithms for statistical optimization. J. Mach. Learn. Res., 14:3321 3363, 2013b. ISSN 1532-4435. Zhang, Yuchen, Duchi, John C, and Wainwright, Martin J. Divide and conquer kernel ridge regression: A distributed algorithm with minimax optimal rates. ar Xiv preprint ar Xiv:1305.5029, 2013c. Zhao, Tianqi, Cheng, Guang, and Liu, Han. A partially linear framework for massive heterogeneous data. Ar Xiv e-prints, ar Xiv:1410.8570, October 2014a. Zhao, Tianqi, Kolar, Mladen, and Liu, Han. A general framework for robust testing and confidence regions in high-dimensional quantile regression. Ar Xiv e-prints, ar Xiv:1412.8724, December 2014b. Zhu, Ji and Hastie, Trevor J. Classification of gene microarrays by penalized logistic regression. Biostatistics, 5(3):427 443, 2004. doi: 10.1093/biostatistics/kxg046. Zinkevich, Martin, Weimer, Markus, Smola, Alexander J., and Li, Lihong. Parallelized stochastic gradient descent. In Advances in Neural Information Processing, pp. 2595 2603. Curran Associates, Inc., 2010.