# communicationefficient_sparse_regression__d8a54bf9.pdf Journal of Machine Learning Research 18 (2017) 1-30 Submitted 1/16; Revised 2/16; Published 1/17 Communication-efficient Sparse Regression Jason D. Lee Jason Lee@marshall.usc.edu Marshall School of Business University of Southern California Los Angeles, CA 90089 Qiang Liu qliu@cs.dartmouth.edu Department of Computer Science Dartmouth University Hanover, NH 02714 Yuekai Sun yuekai@umich.edu Department of Statistics University of Michigan Ann Arbor, MI 48109 Jonathan E. Taylor jonathan.taylor@stanford.edu Department of Statistics Stanford University Stanford, CA 94305 Editor: Zhihua Zhang We devise a communication-efficient approach to distributed sparse regression in the highdimensional setting. The key idea is to average debiased or desparsified lasso estimators. We show the approach converges at the same rate as the lasso as long as the dataset is not split across too many machines, and consistently estimates the support under weaker conditions than the lasso. On the computational side, we propose a new parallel and computationally-efficient algorithm to compute the approximate inverse covariance required in the debiasing approach, when the dataset is split across samples. We further extend the approach to generalized linear models. Keywords: Distributed Sparse Regression, Averaging, Debiasing, lasso, high-dimensional statistics 1. Introduction Explosive growth in the size of modern datasets has fueled interest in distributed statistical learning. For examples, we refer to Boyd et al. (2011); Dekel et al. (2012); Duchi et al. (2012); Zhang et al. (2013) and the references therein. The problem arises, for example, when working with datasets that are too large to fit on a single machine and must be distributed across multiple machines. The main bottleneck in the distributed setting is usually communication between machines/processors, so the overarching goal of algorithm design is to minimize communication costs. c 2017 Lee, Liu, Sun, and Taylor. License: CC-BY 4.0, see https://creativecommons.org/licenses/by/4.0/. Attribution requirements are provided at http://jmlr.org/papers/v18/16-002.html. In distributed statistical learning, the simplest and most popular approach is averaging: each machine forms a local estimator ˆθk with the portion of the data stored locally, and a master averages the local estimators to produce an aggregate estimator: θ = 1 m Pm k=1 ˆθk. Averaging was first studied by Mcdonald et al. (2009) for multinomial regression. They derive non-asymptotic error bounds on the estimation error that show averaging reduces the variance of the local estimators, but has no effect on the bias (from the centralized solution). In follow-up work, Zinkevich et al. (2010) studied a variant of averaging where each machine computes a local estimator with stochastic gradient descent (SGD) on a random subset of the dataset. They show, among other things, that their estimator converges to the centralized estimator. More recently, Zhang et al. (2013) studied averaged empirical risk minimization (ERM). They show that the mean squared error (MSE) of the averaged ERM decays like O N 1 2 + m N , where m is the number of machines and N is the total number of samples. Thus, so long as m N, the averaged ERM matches the N 1 2 convergence rate of the centralized ERM. Even more recently, Rosenblatt and Nadler (2014) studied the optimality of averaged ERM in two asymptotic settings: N , m, p fixed and p, n , p n µl (0, 1), where n = N m is the number of samples per machine. They show that in the n , p fixed setting, the averaged ERM is first-order equivalent to the centralized ERM. However, when p, n , the averaged ERM is suboptimal (versus the centralized ERM). We develop a divide and conquer approach to statistical learning. In the high-dimensional setting, regularization is essential. The key idea is to average debiased or desparsified regularized M-estimators. Under suitable conditions, it is possible to show that the local debiased estimators are asymptotically normal. thus the averaged estimator delivers the same statistical performance as the computationally infeasible centralized M-estimator. Formally, we show that the error of the averaged estimator decomposes into a e OP 1 asymptotically normal term and a remainder term. As long as m N s log p, where s is the sparsity of the unknown regression coefficients, the reminder term is asymptotically negligible. Thus the averaged estimator converges at the same rate as a centralized estimator. Further, the averaged estimator is model selection consistent under a weak minimum signal strength condition. In the following section, we review the theoretical properties of the lasso and debiased lasso and describe our contributions more formally. 2. A divide-and-conquer approach to sparse regression To keep things simple, we focus on sparse linear regression. Consider the sparse linear model y = Xβ + ϵ, where the rows of X Rn p are predictors, and the components of y Rn are the responses. To keep things simple, we assume (A1) the predictors x Rp are independent σx-subgaussian random vectors with whose covariance Σ has smallest eigenvalue σp(Σ) > λmin and largest eigenvalue σ1(Σ) < λmax; Communication-efficient Sparse Regression (A2) the regression coefficients β Rp are s-sparse, i.e. all but s components of β are zero; (A3) the components of the noise ϵ are independent, mean zero σy-subgaussian random variables. Given the predictors and responses, the lasso estimates β by ˆβ := arg min β Rp 1 2n y Xβ 2 2 + λ β 1. There is a well-developed theory of the lasso that says, under suitable assumptions on X, the lasso estimator ˆβ is nearly minimax optimal(e.g. see Hastie et al. (2015), Chapter 11 for an overview). More precisely, under some conditions on 1 n XT X, the MSE of the lasso estimator is roughly s log p n , which is the minimax rate. 2.1 Background on the lasso and debiasing However, the lasso estimator is also biased1. Since averaging only reduces variance, not bias, we gain (almost) nothing by averaging the biased lasso estimators. That is, it is possible to show if we naively averaged local lasso estimators, the MSE of the averaged estimator is of the same order as that of the local estimators. The key to overcoming the bias of the averaged lasso estimator is to debias the lasso estimators before averaging. The debiased lasso estimator by Javanmard and Montanari (2013a) is ˆβd := ˆβ + 1 n ˆΘXT (y X ˆβ), (1) where ˆβ is the lasso estimator and ˆΘ Rp p is an approximate inverse to ˆΣ = 1 n XT X. Intuitively, the debiased lasso estimator trades bias for variance. The trade-offis obvious when ˆΣ is non-singular: setting ˆΘ = ˆΣ 1 gives the ordinary least squares (OLS) estimator (XT X) 1XT y. Another way to interpret the debiased lasso estimator is a corrected estimator that compensates for the bias incurred by shrinkage. By the optimality conditions of the lasso, the correction term 1 n XT (y X ˆβ) is a subgradient of λ 1 at ˆβ. By adding a term proportional to the subgradient of the regularizer, the debiased lasso estimator compensates for the bias incurred by regularization. The debiased lasso estimator has previously been used to perform inference on the regression coefficients in high-dimensional regression models. We refer to the papers by Javanmard and Montanari (2013a); van de Geer et al. (2013); Zhang and Zhang (2014); Belloni et al. (2011) for details. The choice of ˆΘ in the correction term is crucial to the performance of the debiased estimator. Javanmard and Montanari (2013a) suggest forming ˆΘ row by row: the j-th row of ˆΘ is the optimum of minimize θ Rp θT ˆΣθ subject to ˆΣθ ej δ. (2) 1. We refer to Section 2.2 in Javanmard and Montanari (2013a) for a more formal discussion of the bias of the lasso estimator. The parameter δ should large enough to keep the problem feasible, but as small as possible to keep the bias (of the debiased lasso estimator) small. As we shall see, when the rows of X are subgaussian, setting δ log p 2 is usually large enough to keep (2) feasible. Definition 1 (Generalized coherence) Given X Rn p, let ˆΣ = 1 n XT X. The generalized coherence between ˆΣ and Θ Rp p is GC(ˆΣ, Θ) = maxj [p] ˆΣΘT j ej . The preceding definition is a generalization of the usual notion of coherence as it appears in the compressed sensing literature. Assume the columns of X are normalized so that xj 2 = 1, and Θ = I. The diagonal entries of ˆΣΘ I vanish. Thus GC(ˆΣ, Θ) is the largest offdiagonal entry of ˆΣ, which is the largest inner product between columns of X: 1 n maxi =j e T i XT Xej . We recognize the preceding quantity as the coherence of X. Lemma 2 (Javanmard and Montanari (2013a)) Under (A1), when 16κσ4 xn > log p, the event EGC(ˆΣ) := n GC(ˆΣ, Σ 1) 8 c1 κσ2 x log p occurs with probability at least 1 2p 2 for some c1 > 0, where κ := λmax(Σ) λmin(Σ) is the condition number of Σ. As we shall see, the bias of the debiased lasso estimate is of higher order than its variance under suitable conditions on ˆΣ. In particular, we require ˆΣ to satisfy the restricted eigenvalue (RE) condition. Definition 3 (RE condition) For any S [p], let C(S) := { Rp | Sc 1 3 S 1}. We say ˆΣ satisfies the RE condition on the cone C(S) when T ˆΣ µl S 2 2 for some µl > 0 and any C(S). The RE condition requires ˆΣ to be positive definite on C(S). When the rows of X Rn p are i.i.d. Gaussian random vectors, Raskutti et al. (2010) show there are constants µ1, µ2 > 0 such that 1 n X 2 2 µ1 2 2 µ2 log p n 2 1 for any Rp with probability at least 1 c2 exp ( c2n) . Their result implies the RE condition holds on C(S) (for any S [p]) as long as n |S| log p, even when there are dependencies among the predictors. Their result was extended to subgaussian designs by Rudelson and Zhou (2013), also allowing for dependencies among the covariates. We state their result verbatim. Communication-efficient Sparse Regression Lemma 4 (Rudelson and Zhou (2013)) Under (A1), if n > 4000 sσ2 x log Cp s and p > s, where s := s + 25920κs = C s, the event ERE(X) = n T ˆΣ 1 2λmin(Σ) S 2 2 for any C(S) o occurs with probability at least 1 2e n 4000σ4x , where C and C are universal constants. Proof The lemma is a consequence of Rudelson and Zhou (2013), Theorem 6. In their notation, we set δ = 1 2, k0 = 3 and bound maxj [p] Aej 2 2 and K(s0, k0, Σ 1 2 ) by λmax(Σ) and λmin(Σ) 1 When the RE condition holds, the lasso and debiased lasso estimators are consistent for a suitable choice of the regularization parameter λ. The parameter λ should be large enough to dominate the empirical process part of the problem: 1 n XT y , but as small as possible to reduce the bias incurred by regularization. As we shall see, setting λ σy log p 2 is a good choice. Lemma 5 Under (A3), 1 n XT ϵ maxj [p](ˆΣj,j) 1 2 σy 3 log p with probability at least 1 ep 2 for any (non-random) X Rn p. When ˆΣ satisfies the RE condition and λ is large enough, Negahban et al. (2012) show that the lasso is consistent. Lemma 6 (Negahban et al. (2012)) If in addition to (A2) and (A3), 1. ˆΣ satisfies the RE condition on C(supp(β )) with constant µl 2. 1 n XT ϵ λ, µl sλ and ˆβ β 2 3 When the lasso estimator is consistent, the debiased lasso estimator is also consistent. Further, it is possible to show that the bias of the debiased estimator is of higher order than its variance. Similar results by Javanmard and Montanari (2013a); van de Geer et al. (2013); Zhang and Zhang (2014); Belloni et al. (2011) are the key step in showing the asymptotic normality of the (components of) the debiased lasso estimator. The result we state is essentially Javanmard and Montanari (2013a), Theorem 2.3. Lemma 7 If in addition to (A2) and (A3), 1. ˆΣ satisfies the RE condition on C(supp(β )) with constant µl 2. 1 n XT ϵ λ, 3. (ˆΣ, ˆΘ) has generalized incoherence δ, the debiased lasso estimator has the form ˆβd = β + 1 n ˆΘXT ϵ + ˆ , Lemma 7, together with Lemmas 5 and 2, shows that the bias of the debiased lasso estimator is of higher order than its variance. In particular, setting λ and δ to be the order of the upper bounds on infΘ Rp p GC(ˆΣ, Θ) and 1 n XT ϵ given by Lemmas 5 and 2 gives a bias term ˆ that is OP s log p n . By comparison, the variance term 1 n ˆΘXT ϵ is the maximum of p subgaussian random variables with mean zero and variances of O(1), which 2 . Thus the bias term is of higher order than the variance term as long as n s2 log p. Corollary 8 If in addition to the conditions of Lemma 6, 1. (ˆΣ, ˆΘ) has generalized incoherence δ log p 2. λ = maxj [p](ˆΣj,j) 1 2 σy 3 log p δ maxj [p](ˆΣj,j) 1 2 µl σy s log p The rest of the paper is organized as follows. In the subsequent section, we describe a divide-and-conquer approach to sparse regression and derive its theoretical properties. We show that 1. the averaged estimator converges at the same rate as the communication-intensive centralized lasso estimator. In particular, a thresholded version of the averaged estimator attains the same estimation rate as the centralized lasso estimator βht β 2 s log p 2 , and only requires one round of communication. 2. a thresholded version of the averaged estimator is model selection consistent as long as the minimum signal strength is at least log p 2 . We remark that the model selection consistency result does not require X to obey an irrepresentability condition, which the centralized lasso does require. Although the divide-and-conquer approach is communication efficient, it is costly in terms of floating point operations. The parallel runtime of debiasing is roughly equivalent to the cost of evaluating p lasso estimators, due to computation of ˆΘ. In the rest of this section, we describe a more sophisticated approach to debiasing: each machine debiases p m, instead of all p, regression coefficients. Thus the parallel runtime of the more sophisticated approach is roughly m times smaller than that of a naive approach. In Section 4, we further refine the approach to reduce the sample complexity from n ms2 log p to n ms log p. In Section 6, we show via simulation experiments that Communication-efficient Sparse Regression the averaged debiased estimator outperforms averaging local lasso estimates, and performs as well as the centralized lasso. Section 5 generalizes our approach from least-squares to generalized linear models such as logistic regression. Finally in Section 7, we show the optimality of our estimator in terms of the amount of communication, and rounds of communication using recent work on communication lower bounds. We also provide a comparison of the average debiased estimator and the centralized lasso estimator. The parallel runtime of the averaging debiased estimator is only larger than the centralized lasso by a constant multiplicative factor. 2.2 Averaging debiased lassos Recall the problem setup: we are given N samples of the form (xi, yi) distributed across m machines: The k-th machine has local predictors Xk Rnk p and responses yk Rnk. To keep things simple, we assume the data is evenly distributed, i.e. n1 = = nk = n = N m. The averaged debiased lasso estimator (for lack of a better name) is k=1 ˆβd k = 1 ˆβk + ˆΘk XT k (yk Xk ˆβk) , (3) We begin by studying the error of the β in the ℓ norm. Lemma 9 Suppose the local sparse regression problem on each machine satisfies the conditions of Corollary 8, that is when m p, 1. {ˆΣk}k [m] satisfy the RE condition on C(supp(β )) with constant µl, 2. {(ˆΣk, ˆΘk)}k [m] have generalized incoherence c GC log p 3. λ1 = = λm = cΣσy 3 log p β β cσy cΩlog p µl σy s log p with probability at least 1 ep 1, where c > 0 is a universal constant, cΩ:= maxj [p], k [m]((ˆΘk ˆΣk ˆΘT k )j,j) and cΣ := maxj [p],k [m]((ˆΣk)j,j) 1 2 . Lemma 10 hints at the performance of the averaged debiased lasso. In particular, we note the first term is O log p 2 , which matches the convergence rate of the centralized estimator. When n is large enough, s log p n is negligible compared to log p 2 , and the error 2 . Next, we show the conditions of Lemma 10 occur with high probability when the rows of X are independent subgaussian random vectors. Lemma 10 Under (A1), (A2), and (A3), when m < p, p > s, 1. n > max 4000 sσ2 x log(Cp s ), 8000σ4 x log p, 3 c1 max{σ2 x, σx} log p , 2. λ1 = = λm = maxj [p],k [m]((ˆΣk j,j) 1 2 σy 3 log p 3. δ1 = = δm = 8 c1 κσ2 x log p 2 and form {ˆΘk}k [m] by (2), maxj [p] Σ 1 j,j log p 2 + κ maxj [p](Σj,j) 1 2 λmin(Σ) σ2 xσy s log p with probability at least 1 (8 + e)p 1 for some universal constant c > 0. The averaged debiased lasso has one serious drawback versus the lasso: β is usually dense. The density of β detracts from the intrepretability of the coefficients and makes the estimation error large in the ℓ2 and ℓ1 norms. To remedy both problems, we threshold the averaged debiased lasso: HTt( β) βj 1{| βj| t}, STt( β) sign( βj) max{| βj| t, 0}. As we shall see, both hard and soft-thresholding give sparse aggregates that are close to β in ℓ2 norm. Lemma 11 As long as t > β β , βht := HTt( β) satisfies 1. βht β 2t, 2. βht β 2 2 3. βht β 1 2 The analogous result also holds for βst := STt( β). Proof By the triangle inequality, βht β βht β + β β t + β β 2t. Since t > β β , βht j = 0 whenever β j = 0. Thus βht is s-sparse and βht β is 2s-sparse. By the equivalence between the ℓ and ℓ2, ℓ1 norms, The argument for βst is similar. By combining Lemma 11 with Lemma 10, we show that βht converges at the same rates as the centralized lasso. Communication-efficient Sparse Regression Theorem 12 Under the conditions of Lemma 10, hard-thresholding β at σy 4 maxj [p] Σ 1 j,j log p c2N 1 κ maxj [p](Σj,j) 1 2 λmin(Σ) σ2 xσy s log p 1. βht β P σy maxj [p] Σ 1 j,j log p N 1 κ maxj [p](Σj,j) 1 2 λmin(Σ) σ2 xσy s log p 2. βht β 2 P σy maxj [p] Σ 1 j,j s log p N 1 κ maxj [p](Σj,j) 1 2 λmin(Σ) σ2 xσy s 3 2 log p 3. βht β 1 P σy maxj [p] Σ 1 j,j s2 log p N 1 κ maxj [p](Σj,j) 1 2 λmin(Σ) σ2 xσy s2 log p Remark 13 By Theorem 12, when m n s2 log p, the variance term is dominant and the convergence rates given by the theorem simplify: 1. βht β P log p 2. βht β 2 P s log p 3. βht β 1 P s2 log p The convergence rates for the centralized lasso estimator ˆβ are identical (modulo constants): 1. ˆβ β P log p 2. ˆβ β 2 P s log p 3. ˆβ β 1 P s2 log p The estimator βht matches the convergence rates of the centralized lasso in ℓ1, ℓ2, and ℓ norms. Furthermore, βht can be evaluated in a communication-efficient manner by a one-shot averaging approach. Corollary 14 Under the conditions of Lemma 10, further assume 1. m n s2 log p, 2. β-min: |β j | log p 2 for any j supp(β ). Then supp( βht) = supp(β ). Proof As long as we threshold at t > βht β , supp( βht) supp(β ). That is, all the zero components of β are correctly estimated. Further, as long as the non-zero components of β have magnitude at least 2t, they are not set to zero by thresholding at t. By Theorem 12, there is such a t log p 3. A distributed approach to debiasing The averaged estimator we studied has the form k=1 ˆβk + ˆΘk XT k (y Xk ˆβk). The estimator requires each machine to form ˆΘk by the solution of (2). Since the dual of (2) is an ℓ1-regularized quadratic program: minimize γ Rp 1 2γT ˆΣkγ ˆΣkγ + δ γ 1 , (4) forming ˆΘk is (roughly speaking) p times as expensive as solving the local lasso problem, making it the most expensive step (in terms of floating point operations) of evaluating the averaged estimator. To trim the cost of the debiasing step, we consider an estimator that forms only a single ˆΘ : k=1 ˆβk + 1 k=1 XT k (y Xk ˆβk). (5) To evaluate (5), 1. each machine sends ˆβk and 1 n XT k (y Xk ˆβk) to a central server, 2. the central server forms 1 m Pm k=1 ˆβk and 1 N Pm k=1 XT k (y Xk ˆβk) and sends the averages to all the machines, 3. each machine, given the averages, forms p m rows of ˆΘ and debiases p m coefficients: k=1 ˆβj + ˆΘj, 1 k=1 XT k (y Xk ˆβk) , where ˆΘj, Rp is a row vector. As we shall see, each machine can perform debiasing with only the data stored locally. Thus, forming the estimator (5) requires two rounds of communication. The question that remains is how to form ˆΘj, . We consider an estimator proposed by van de Geer et al. (2013): nodewise regression on the predictors. For some j [p] that machine k is debiasing, the machine solves ˆγj := arg min γ Rp 1 1 2n Xk,j Xk, jγ 2 2 + λj γ 1, j [p], where Xk, j Rn (p 1) is Xk less its j-th column Xk,j. Implicitly, we are forming 1 ˆγ1,2 . . . ˆγ1,p ˆγ2,1 1 . . . ˆγ2,p ... ... ... ... ˆγp,1 ˆγp,2 . . . ˆγp,p Communication-efficient Sparse Regression where the components of ˆγj are indexed by k {1, . . . , j 1, j + 1, . . . , p}. We scale the rows of ˆC by diag ˆτ1, . . . , ˆτp , where n Xj X jˆγj 2 2 + λj ˆγj 1 1 to form ˆΘ = ˆT 2 ˆC. Each row of ˆΘ is given by ˆγj,1 . . . ˆγj,j 1 1 ˆγj,j+1 . . . ˆγj,p . (6) Since ˆγj and ˆτj only depend on Xk, they can be formed without any communication. Before we justify the choice of ˆΘ theoretically, we mention that it is an approximate inverse of ˆΣ (in a component-wise sense). By the optimality conditions of nodewise regression, n Xj X jˆγj 2 2 + λj ˆγj 1 n Xj X jˆγj 2 2 + 1 n(Xj X jˆγj)T XT jˆγj n Xj(Xj X jˆγ). Recalling the defintition of ˆΘ, we have 1 n ˆΘj, XT Xj = 1 1 n(Xj ˆγT j X j)T Xj = λj and 1 n ˆΘj, XT X j = 1 n(Xj ˆγT j X j)T X j λj for any j [ p ]. Thus max j [ p ] ˆΘj, ˆΣ ej λj ˆτ 2 j . (7) van de Geer et al. (2013) show that when the rows of X are i.i.d. subgaussian random vectors and the precision matrix Σ 1 is sparse, ˆΘj, converges to Σ 1 j at the usual convergence rate of the lasso. For completeness, we restate their result. We consider a sequence of regression problems indexed by the sample size N, dimension p, sparsity s0 that satisfies (A1), (A2), and (A3). As N grows to infinity, both p = p(N) and s = s(N) may also grow as a function of N. To keep notation manageable, we drop the index N. We further assume (A4) the covariance of the predictors (rows of X) has smallest eigenvalue λmin(Σ) Ω(1) and largest diagonal entry maxj [p] Σj,j O(1), (A5) the rows of Σ 1 are sparse: maxj [p] s2 j log p n o(1), where sj is the sparsity of Σ 1 j . Lemma 15 (van de Geer et al. (2013), Theorem 2.4) Under (A1) (A5), (6) with suit- able parameters λj log p ˆΘj, Σ 1 j 1 P 2 for any j [p]. We show that the averaged estimator (5) matches the convergence rate of the centralized lasso. Theorem 16 Under (A1) (A5), (5), where ˆΘ is given by (6), with suitable parameters λj, λk log p 2 , j [p], k [m] satisfies β β P log p 2 + smax log p where smax := max{s0, s1, . . . , sp}. Proof See the appendix. By combining the Lemma 11 with Theorem 16, we can show that βht := HT( β, t) for an appropriate threshold t converges to β at the same rates as the centralized lasso. Theorem 17 Under the conditions of Theorem 16, hard-thresholding β at t log p 2 + smax log p 1. βht β P log p 2 + smax log p 2. βht β 2 P s0 log p 2 + s0smax log p 3. βht β 1 P s2 0 log p 2 + s0smax log p Theorem 17 shows that for m n s2max log p, the variance term is dominant, so the convergence rates simplify: 1. βht β P log p 2. βht β 2 P smax log p 3. βht β 1 P s2 max log p Thus, estimator βht shares the advantages of βht over the centralized lasso (cf. Remark 13). It also achieves computational gains over βht by amortizing the cost of debiasing across m machines. Communication-efficient Sparse Regression 4. A sharper estimation result It is possible to obtain a sharper estimation result by forgoing the ℓ norm convergence rate. By sharper, we mean the sample complexity of the averaged estimator from m n s2 0 log p to m n s0 log p. The sharper estimation result depends on a result by Javanmard and Montanari (2013b), which we combine with Lemma 15 and restate for completeness. Before stating the results, we define the ( , l) norm of a point x Rp as x ( ,l) := max A [p],|A| l x A 2 When l = 1, the ( , l) norm of x is its ℓ norm. When l = p, the ( , l) norm is the ℓ2 norm (rescaled by 1 p). Thus the ( , l) norm interpolates between the ℓ2 and ℓ norms. Javanmard and Montanari (2013b), Theorem 2.3 shows that the bias of the debiased lasso is of order s0 log p Lemma 18 Under the conditions of Theorem 16, ˆ k ( ,c s0) P c s0 log p n for any k [m] for any c > 0, where c is a constant that depends only on c and Σ. By Lemma 18, the estimator (5) is consistent in the ( , s0) norm. The argument is similar to the proof of Theorem 16. Theorem 19 Under the conditions of Theorem 16, β β ( ,c s0) OP log p 2 + s0 log p Theorem 20 Under the conditions of Theorem 16, hard-thresholding β at t = | β|(ˆs0) for some ˆs0 s0, i.e. setting all but the largest ˆs0 debiased coefficients to zero, gives 1. βht β 2 P s0 log p 2 + s0 log p 2. βht β 1 P s2 0 log p 2 + s3/2 0 log p n . By Theorem 20, when m N s0 log p, the variance term is dominant and the convergence rates given by the theorem simplify to the convergence rates of the (centralized) lasso estimator: 1. βht β 2 P s0 log p 2. βht β 1 P s2 0 log p Thus, by forgoing estimation error in the ℓ norm, it is possible to reduce the sample complexity of the averaged estimator to m s0 log p N . When m = 1, we recover the sample complexity of the centralized lasso estimator. 5. Averaging debiased ℓ1 regularized M-estimators The distributed approach to debiasing extends readily to ℓ1 regularized M-estimators. As before, we are given N pairs (xi, yi) stored on m machines. Let ρ(yi, a) be a loss function function, which is convex in a, and ρ, ρ be its derivatives with respect to a. That is ρ(y, a) = d daρ(y, a), ρ(y, a) = d2 da2 ρ(y, a). We define ℓk(β) = 1 n Pn i=1 ρ(yi, x T i β), where the sum is only over the pairs on machine k. The averaged estimator is k=1 ˆβk + ˆΘ 1 k=1 ℓk(ˆβk) , (8) where ˆβk is the local ℓ1 regularized M-estimator: ˆβk := arg minβ Rp ℓk(β) + λk β 1. As before, we form ˆΘ by nodewise regression on the weighted design matrix Xˆβk := Wˆβk Xk, where Wˆβk is diagonal and its diagonal entries are i,i := ρ(yi, x T i ˆβk) 1 2 . That is, for some j [p] that machine k is debiasing, the machine solves ˆγj := arg min γ Rp 1 1 2n Xˆβk,j Xˆβk, jγ 2 2 + λj γ 1, j [p], and forms ˆΘj, = 1 ˆγj,1 . . . ˆγj,j 1 1 ˆγj,j+1 . . . ˆγj,p , where ˆτj = 1 n Xˆβk,j Xˆβk, jˆγj 2 2 + λj ˆγj 1 1 (B1) the pairs {(xi, yi)}i [N] are i.i.d.; the predictors are bounded: maxi [N] xi 1; the projection of Xβ ,j on R(Xβ , j) in the E 2ℓk(β ) inner product is bounded: Xβ , jγβ ,j 1 for any j [ p ], where γβ ,j := arg min γ Rp 1 E Xβ ,j Xβ , jγ 2 2 . (B2) the rows of E 2ℓk(β ) 1 are sparse: maxj [p] s2 j log p n o(1), where sj is the sparsity of E 2ℓk(β ) 1 (B3) the smallest eigenvalue of E 2ℓk(β ) is bounded away from zero and its entries are bounded. Communication-efficient Sparse Regression (B4) for any β such that β β 1 δ for some δ > 0, the diagonal entries of Wβ stays away from zero, and | ρ(y, x T β) ρ(y, x T β )| |x T (β β )|. (B5) we have 1 n Xk(ˆβk β ) 2 2 P s0λ2 k and ˆβk β 1 P s0λk. (B6) the derivatives ρ(y, a), ρ(y, a) is locally Lipschitz: maxi [N] sup|a,a x T i β | δ supy | ρ(y,a) ρ(y,a )| |a a | K for some δ > 0. maxi [N] supy | ρ(y, x T i β)| O(1), maxi [N] sup|a x T i β | δ supy | ρ(y, a)| O(1). (B7) the diagonal entries of E 2ℓk(β ) 1 E ℓk(β ) ℓk(β )T E 2ℓk(β ) 1 are bounded. The preceding assumptions deserve elaboration. Assumptions (B1), (B4), (B6), and (B7) are standard in the literature on high-dimensional regression. They ensure the various intermediate quantities, such as ρ(y, x T β) and its derivaties, remain bounded. Assumption (B2) is perhaps the most restrictive. The assumption serves to ensure that the debiasing step is effective in reducing the bias of the regularized estimator. It may be relaxed (at the cost of additional technicalities) to the rows of E 2ℓk(β ) 1 admit a sj-sparse approximation. We refer to B uhlmann and Van De Geer (2011) for the details. Assumption (B3) is a quantitative version of the usual rank condition in regression. It ensures the regression coefficients are identifiable in the limit. Assumption (B5) is not necessary; it is implied by the other assumptions. We refer to B uhlmann and Van De Geer (2011), Chapter 6 for the details. Here we state it as an assumption to simplify the exposition. We are ready to state our main results concerning the averaged estimator(8). It shows the averaged estimator achieves the convergence rate of the centralized ℓ1-regularized Mestimator. Theorem 21 Under (B1) (B7), (8) with suitable parameters λj, λk log p 2 , j [p], k [m] satisfies β β P log p 2 + smax log p where smax := max{s0, s1, . . . , sp}. Proof The averaged estimator is given by k=1 ˆβk ˆΘ ℓk(ˆβk)(ˆβk β ) β . By the smoothness of ρ, ρ(yi, x T i ˆβk) = ρ(yi, x T i β ) + ρ(yi, ai)x T i (ˆβk β ), where ai is a point between x T i ˆβk and x T i β . Thus k=1 ˆβk ˆΘ( ℓk(β ) + Qk(ˆβk β )) β m Pm k=1 ℓk(β ) + 1 I ˆΘQk (ˆβk β ). where Qk = 1 n Pn i=1 ρ(yi, ai)xix T i , where the sum is over the data points on machine k. Taking norms, we obtain m Pm k=1 ℓk(β ) + 1 I ˆΘQk (ˆβk β ) . It is possible to show that ˆΘ 1 m Pm k=1 ℓk(β ) P log p 2 , which corresponds to the first term in (9). We refer to B uhlmann and Van De Geer (2011), Chapter 6 for the details. We turn our attention to the second term. By the triangle inequality, (I ˆΘQk)(ˆβk β ) I ˆΘ 2ℓk(ˆβk) (ˆβk β ) + ˆΘ( 2ℓk(ˆβk) Qk)(ˆβk β ) maxj [p] e T j ˆΘj, 2ℓk(ˆβk) ˆβk β 1 i=1 ˆΘxi ρ(yi, x T i ˆβk) ρ(yi, ai)x T i (ˆβk β ) . We proceed term by term. By (7), maxj [p] e T j ˆΘj, 2ℓk(ˆβk) λj By van de Geer et al. (2013), Theorem 3.2, |ˆτ 2 j τ 2 j | P max{s0, sj} log p Thus maxj [p] e T j ˆΘj, 2ℓk(ˆβk) P log p 2 and, by (B5), maxj [p] e T j ˆΘj, 2ℓk(ˆβk) ˆβk β 1 P smax log p Communication-efficient Sparse Regression We turn our attention to the second term. We have ˆΘxi P 1 because ˆΘxi maxj [p] ˆΘj, XT k maxj [p] ˆΘj, XT k,β maxj [p] 1 ˆτ 2 j (Xk,β )j (Xk,β ) jˆγj . Again, by van de Geer et al. (2013), Theorem 3.2, P maxj [p] 1 τ 2 j (Xk,β )j (Xk,β ) jˆγj P maxj [p] 1 τ 2 j (Xk,β )j (Xk,β ) jγj τ 2 j (Xk,β )j (ˆγj γj) 1. which, by (B1) and van de Geer et al. (2013), Theorem 3.2, P 1 + sj log p i=1 ˆΘxi ρ(yi, x T i ˆβk) ρ(yi, ai)x T i (ˆβk β ) ρ(yi, x T i ˆβk) ρ(yi, ai)x T i (ˆβk β ) , which, by (B5) and (B6), is at most n Xk(ˆβk β ) 2 2 P s0 log p We put the pieces together to deduce 1 m Pm k=1 I ˆΘQk (ˆβk β ) P smax log p By combining the Lemma 11 with Theorem 16, we can show that βht := HT( β, t) for an appropriate threshold t converges to β at the same rates as the centralized ℓ1-regularized M-estimator. Theorem 22 Under the conditions of Theorem 21, hard-thresholding β at t log p 2 + maxj [p] sj log p 1. βht β P log p 2 + smax log p 2. βht β 2 P s0 log p 2 + s0smax log p 1 2 3 4 5 x 10 1 2 3 4 5 x 10 Global Lasso Naive Avg Avg Debiased Total Number of Samples (nk) Total Number of Samples (nk) (Σ = I, p = 104, n = 5 103) (Σij = 0.5|i j|, p = 104, n = 5 103) log10 ℓ Error log10 ℓ Error Figure 1: The estimation error (in ℓ norm) of the averaged debiased lasso estimator versus that of the centralized lasso when the predictors are Gaussian. In both settings, the estimation error of the averaged debiased estimator is comparable to that of the centralized lasso, while that of the naive averaged lasso is much worse. 3. βht β 1 P s2 0 log p 2 + s0 maxj [p] sj log p Assuming s0 smax, Theorem 22 shows when m n s2 0 log p, the variance term is dominant, so the convergence rates simplify to 1. βht β P log p 2. βht β 2 P s0 log p 3. βht β 1 P s2 0 log p 6. Simulations We validate our theoretical results with simulations. First, we study the estimation error of the averaged debiased lasso in ℓ norm. To focus on the effect of averaging, we grow the number of machines m linearly with the (total) sample size N. In other words, we fix the sample size per machine n and grow the total sample size N by adding machines. The tuning parameters were set to their oracle values stated in the Theorem 12. Figure 1 compares the estimation error (in ℓ norm) of the averaged debiased lasso estimator with that of the centralized lasso. We see the estimation error of the averaged debiased lasso estimator is comparable to that of the centralized lasso, while that of the naive averaged lasso is much worse. We conduct a second set of simulations to study the effect of the number of machines on the estimation effort of the averaged estimator. To focus on the effect of the number of Communication-efficient Sparse Regression 5 10 15 20 2.8 10 20 30 40 50 Global Lasso Naive Avg Avg Debiased Number of Machines (k) Number of Machines (k) (Σ = I, p = 104, nk = 2 105) (Σij = 0.5|i j|, p = 104, nk = 2 105) log10 ℓ Error log10 ℓ Error Figure 2: The estimation error (in ℓ norm) of the averaged estimator as the number of machines k vary. When the number of machines is small, the error is comparable to that of the centralized lasso. However, when the number of machines exceeds a certain threshold, the bias term (which grows linearly in k) is dominant, and the performance of the averaged estimator degrades. machines k, we fix the (total) sample size N and vary the number of machines the samples are distributed across. The tuning parameters were again set to the oracle values stated in the Theorem 12. Figure 2 shows how the estimation error (in ℓ norm) of the averaged estimator grows as the number of machines grows. When the number of machines is small, the estimation error of the averaged estimator is comparable to that of the centralized lasso. However, when the number of machines exceeds a certain threshold, the estimation error grows with the number of machines. This transition occurs when s log p or equivalently, when k N s2 log p 1 2 . The preceding observation is consistent with the prediction of Lemma 10: when the number of machines exceeds a certain threshold, the bias term of order s log p n becomes dominant. Since s log p n k, we expect the error to grow linearly with k, which agrees with the trends in Figure 2. We conduct a third set of simulations to study the effect of thresholding on the estimation error in ℓ2 norm. The tuning parameters were set to the oracle values stated in the Theorem 12. Figure 3 compares the estimation error incurred by the averaged estimator with and without thresholding versus that of the centralized lasso. Since the averaged estimator is usually dense, its estimation error (in ℓ2 norm) is large compared to that of the centralized lasso. However, after thresholding, the averaged estimator performs comparably versus the centralized lasso. This demonstrates the importance of the thresholding step to achieve low ℓ2 error. In practice, it is possible to set the tuning parameter δ by the bisection method in the accompanying code to Javanmard and Montanari (2013a); via bisection, they search for the smallest δ such that the optimization program in (2) is feasible. The lasso tuning parameter λ is set by first estimating the noise variance using the residuals and then using the formula 1 2 3 4 5 x 10 1 2 3 4 5 x 10 1.5 Global Lasso Avg Debiased Ht Avg Debiased Total Number of Samples (nk) Total Number of Samples (nk) (Σ = I, p = 104, n = 5 103) (Σij = 0.5|i j|, p = 104, n = 5 103) log10 ℓ2 Error log10 ℓ2 Error Figure 3: The estimation error (in ℓ2 norm) of the averaged estimator with and without thresholding versus that of the centralized lasso when the predictors are Gaussian. In both settings, thresholding reduces the estimation error by order(s) of magnitude. Although the estimation error of the averaged estimator is large compared to that of the centralized lasso, the thresholded averaged estimator performs comparably, or even better than, the centralized lasso. λ = σ 2 log p. The parameter λ can be chosen independently of σ, if we replace the lasso with the sqrt-lasso Belloni et al. (2011); all of the same theoretical guarantees still apply, since the sqrt-lasso has the same consistency guarantees. For generalized linear models, the oracle choice of λ depends on known quantities Negahban et al. (2012). 7. Summary and discussion We devised a communication-efficient approach to distributed sparse regression in the highdimensional setting. The key idea is first debiasing local lasso estimators, and then averaging the debiased estimators. We show that as long as the data is not split across too many machines, the averaged estimator achieves the convergence rate of the centralized lasso estimator. In the appendix, we show that by foregoing consistency in the ℓ norm, it is possible to further reduce the sample complexity of the averaged estimator to that of the centralized lasso estimator. Further, the distributed approach to debiasing extends readily to other ℓ1 regularized M-estimators. In concurrent work, the approach of averaging debiased M-estimators was proposed by Battey et al. (2015) for high-dimensional inference. 7.1 Communication and Computational complexity In recent years, there has a been a flurry of work on establishing communication lower bounds for mean estimation in the Gaussian distribution. In other words, they establish the minimum communication C needed to obtain ℓ2 2 risk R , where ˆβ β 2 2 R (Duchi et al., 2014; Garg et al., 2014). These results are not directly applicable to sparse linear Communication-efficient Sparse Regression regression, since they do not impose sparsity on the mean. In Braverman et al. (2015), the authors established that to obtain risk R s log p N at least Ω m min(n,p) log p bits of com- munication is required. Our approach communicates O(mp) bits to achieve risk of s log p N , so is communication-optimal when p n. To our knowledge, lowest known communication complexity for solving the lasso is at least mp σmax(Σ) σmin(Σ) log 1 ϵ, for any desired accuracy n (Agarwal et al., 2012). This communication cost is larger than our algorithm by a multiplicative factor of σmax(Σ) σmin(Σ) log 1 ϵ, which is substantially larger when the problem is poorly conditioned. In light of the fact that our approach is essentially optimal in terms of communication cost, we turn to the computational complexity of our method. The most intensive step of our approach is the computation of ˆΘ. To compute one row of ˆΘ requires solving an optimization problem whose cost is equivalent to a lasso in dimension p with n samples. For the purpose of comparison, let us assume that the lasso solver performs T iterations. Thus the complexity of solving a lasso in dimension p with n samples is O(np T). In the simple approach of Section 2 where each machine computes its own ˆΘ, the parallel runtime is O(np2T). However using the approach of Section 3, each machine is only computing p/m rows of ˆΘ. This brings down the parallel runtime to O(np2T m ). In comparison, the cost of solving the lasso using a state-of-art optimization method such as Agarwal et al. (2012) has parallel runtime O(np T), so our computational cost is larger by a factor of O( p m). It is reasonable to think of p m as a constant, since as the number of variables p increases the dataset sizes increases and we will be forced to use a larger cluster size m due to memory constraints on a single machine. Although our computational complexity is larger in the distributed setting, the dominant factor is often bottlenecked by the communication and latency limitations, rather than local computation. Appendix A. Proofs of Lemmas Proof [Proof of Lemma 2] Let zi = Σ 1 2 xi. The generalized coherence between X and Σ 1 is given by |||Σ 1 ˆΣ I||| = ||| 1 2 zi)(Σ 1 2 zi)T I||| , where |||X||| is the maximum entry of a X. Each entry of 1 n Pn i=1(Σ 1 2 zi)(Σ 1 2 zi)T I is a sum of independent subexponential random variables. Their subexponential norms are bounded by 2 zi)j(Σ 1 2 zi)k 1{j = k} ψ1 2 (Σ 1 2 zi)j(Σ 1 2 zi)k ψ1, where X ψ1 and Y ψ2 are the sub-exponential and sub-Gaussian norms of the random variables X and Y . Recall for any two subgaussian random variables X, Y, we have XY ψ1 2 X ψ2 Y ψ2 . 2 zi)j(Σ 1 2 zi)k δj,k ψ1 4 (Σ 1 2 zi)j ψ2 (Σ 1 2 zi)k ψ2 4 κσ2 x, where σx = zi ψ2. By a Bernstein-type inequality, 2 zi)j(Σ 1 2 zi)k δj,k t 2e c1 min{ nt2 where c1 > 0 is a universal constant and σ2 x := 4 κσ2 x. Since σ4 xn > log p, we set t = 2 σ2 x c1 log p 2 to obtain 2 zi)j(Σ 1 2 zi)k δj,k 2 σ2 x c1 We obtain the stated result by taking a union bound over the p2 entries of 1 n Pn i=1(Σ 1 2 zi)(Σ 1 2 zi)T I. Proof [Proof of Lemma 5] By Vershynin (2010), Proposition 5.10, n|x T j ϵ| > t e exp c2n2t2 σ2y x T j 2 2 e exp c2n2t2 σ2y maxj [p] ˆΣj,j We take a union bound over the p components of 1 n XT ϵ to obtain n XT ϵ > t e exp c2n2t2 σ2y maxj [p] ˆΣj,j + log p . We set t = maxj [p] ˆΣ 1 2 j,jσy 3 log p 2 to obtain the desired conclusion. Proof [Proof of Lemma 7] We start by substituting the linear model into (1): ˆβd = ˆβ + 1 n ˆΘXT (y X ˆβ) = β + ˆΘˆΣ(β ˆβ) + 1 By adding and subtracting ˆ = β ˆβ, we obtain ˆβd = β + 1 n ˆΘXT (y X ˆβ) = β + (ˆΘˆΣ I)(β ˆβ) + 1 We obtain the expression of ˆβd by setting ˆ . To show ˆ 3δ µ sλ, we apply H older s inequality to each component of ˆ to obtain |(ˆΘˆΣ I)(β ˆβ)| maxj ˆΣm T j ej ˆβ β 1 δ ˆβ β 1, (10) where δ is the generalized incoherence between X and ˆΘ. By Lemma 6, ˆβ β 1 3 We combine the bound on ˆβ β 1 with (10) to obtain the stated bound on ˆ . Communication-efficient Sparse Regression Proof [Proof of Lemma 10] By Lemma 7, k=1 ˆΘk XT k ϵk + 1 We take norms to obtain k=1 ˆΘk XT k ϵk + 1 We focus on bounding the first term. Let a T j := e T j ˆΘ1XT 1 . . . ˆΘm XT m . By Vershynin (2010), Proposition 5.10, N a T j ϵ > t e exp c2N2t2 for some universal constant c2 > 0. Further, k=1 Xk ˆΘT k ej 2 2 = n ˆΘk ˆΣk ˆΘT k where cΩ:= maxj [p],k [m] ˆΘk ˆΣk ˆΘT k j,j. By a union bound over j [p], Pr maxj [p] 1 N a T j ϵ > t e exp c2Nt2 cΩσ2y + log p . We set t = σy 2cΩlog p 2 to deduce Pr maxj [p] 1 N a T j ϵ σy 2cΩlog p We turn our attention to bounding the second term. By Lemma 5 and a union bound over j [p], when we set λ1 = = λm = λ := maxj [p],k [m]((ˆΣk j,j) 1 2 σy 3 log p n XT k ϵ λ for any k [m] with probability at least 1 em p2 1 ep 1. By Lemma 7, when 1. {ˆΣk}k [m] satisfy the RE condition on C(supp(β )) with constant µl, 2. {(ˆΣk, ˆΘk)}k [m] have generalized incoherence c GC log p the second term is at most 3 3 c2 c GCcΣ µl σy s log p n . We put the pieces together to obtain β β σy 2cΩlog p µl σy s log p Proof [Proof of Lemma 10] We start with the conclusion of Lemma 10: β β σy 2cΩlog p µl σy s log p First, we show that the two constants cΩ= maxj [p], k [m](ˆΘk ˆΣk ˆΘT k )j,j and cΣ := maxj [p],k [m]((ˆΣk)j,j) 1 2 are bounded with high probability. Lemma 23 Under (A1), Pr maxj [p] Σ 1 j ˆΣΣ 1 j > 2 maxj [p] Σ 1 j,j 2pe c1 min{ n for some universal constant c1 > 0. Proof [Proof of Lemma 23] We express Σ 1 j, ˆΣΣ 1 j, = Σ 1 j, ˆΣΣ 1 j, Σ 1 j,j + Σ 1 j,j = 1 i=1 (x T i Σ 1 ,j )2 Σ 1 j,j + Σ 1 j,j . Since the subgaussian norm of zi = Σ 1 2 xi is σx, x T i Σ 1 ,j is also subgaussian with subgaussian norm bounded by x T i Σ 1 ,j ψ2 zi ψ2 Σ 1 2 ,j 2 σx(Σ 1 j,j ) 1 2 . We recognize 1 n Pn i=1(x T i Σ 1 ,j )2 Σ 1 j,j as a sum of i.i.d. subexponential random variables with subexponential norm bounded by (x T i Σ 1 ,j )2 Σ 1 j,j ψ1 2 (x T i Σ 1 ,j )2 ψ1 4 x T i Σ 1 ,j 2 ψ2 4σ2 xΣ 1 j,j . By Vershynin (2010), Proposition 5.16, we have i=1 (x T i Σ 1 ,j )2 Σ 1 j,j > t 2e c1 min{ nt2 16σ2x(Σ 1 j,j )2 , nt 4σxΣ 1 j,j } for some absolute constant c1 > 0. For t = Σ 1 j,j , the bound simplifies to i=1 (x T i Σ 1 ,j )2 Σ 1 j,j > Σ 1 j,j 2e c1 min{ n 16σ2x , n 4σx }. We take a union bound over j [p] to obtain the stated result. Since we form {ˆΘk}k [m] by (2), (ˆΘk ˆΣk ˆΘT k )j,j maxj [p](Σ 1 ˆΣkΣ 1))j,j. Lemma 23 implies maxj [p](Σ 1 ˆΣkΣ 1))j,j 2 maxj [p] Σ 1 j,j for each k [m] with probability at least 1 2pe c1 min{ n Communication-efficient Sparse Regression Lemma 24 Under (A1), Pr(maxj [p](ˆΣj,j) 1 2 > 2 maxj [p](Σj,j) 1 2 ) 2pe c1 min{ n 16σ2x , n 4σx } for some universal constant c1 > 0. We put the pieces together to obtain the stated result: 1. By Lemma 23 (and a union bound over k [m]), Pr(cΩ 2 maxj Σ 1 j,j ) 2mpe c1 min{ n Since m p, when n > 3 c1 max{σ2 x, σx} log p, Pr cΩ< 2 max j Σ 1 j,j 1 2p 1. 2. By Lemma 24 (and a union bound over k [m]), 2 maxj [p](Σj,j) 1 2 ) 1 2mpe c1 min{ n 16σ2x , n 4σx }. c1 max{σ2 x, σx} log p, the right side is again at most 2p 1. 3. By Lemma 4, as long as n > max{4000 sσ2 x log(Cp s ), 8000σ4 x log p}, ˆΣ1, . . . , ˆΣm all satisfy the RE condition with probability at least 1 2me n 4000σ4x 1 2p 1. 4. By Lemma 2, Pr k [m]EGC(ˆΣk) 1 2p 2. Since m < p, the probability is at least 1 2p 1. We apply the bounds cΩ 2 maxj [p] Σ 1 j,j , cΣ 2 maxj [p](Σj,j) 1 2 , c GC = 8 c1 κσ2 x, and 2λmin(Σ) to obtain 4 maxj [p] Σ 1 j,j log p κ maxj [p](Σj,j) 1 2 λmin(Σ) σ2 xσy s log p Proof [Proof of Lemma 24] We follow a similar argument as the proof of Lemma 23: ˆΣk;j,j = ˆΣj,j = ˆΣj,j Σj,j + Σj,j = 1 i=1 x2 i,j Σj,j + Σj,j. Since the zi = Σ 1 2 xi is subgaussian with subgaussian norm σx, xi,j is also subgaussian with subgaussian norm bounded by 1 2 j, zi ψ2 σx(Σj,j) 1 2 . We recognize ˆΣj,j Σj,j = 1 n Pn i=1 x2 i,j Σj,j as a sum of i.i.d. subexponential random variables with subexponential norm bounded by ˆΣj,j Σj,j ψ1 2 x2 i,j ψ1 4 xi,j 2 ψ2 4σ2 xΣj,j. By Vershynin (2010), Proposition 5.16, we have Pr(|ˆΣj,j Σj,j| > t) 2e c1 min{ nt2 16σ2xΣ2 j,j , nt σxΣj,j } for some absolute constant c1 > 0. For t = Σj,j, the bound simplifies to Pr(|ˆΣj,j Σj,j| > Σj,j) 2e c1 min{ n 16σ2x , n 4σx }. We take a union bound over j [p] to obtain the stated result. Proof [Proof of Theorem 16] We start by substituting the linear model into (5): k=1 ˆβk ˆΘˆΣk(ˆβk β ) + 1 n ˆΘXT k ϵk k=1 ˆβk ˆΘˆΣk(ˆβk β ) + 1 Subtracting β and taking norms, we obtain k=1 (I ˆΘˆΣk)(ˆβk β ) + 1 N ˆΘXT ϵ . (11) By Vershynin (2010), Proposition 5.16, and Lemma (23), it is possible to show that N ˆΘXT ϵ P log p We turn our attention to the first term in (11). It s straightforward to see each term in the sum is bounded by (I ˆΘˆΣk)(ˆβk β ) (I Σ 1 ˆΣk)(ˆβk β ) + (Σ 1 ˆΘ)ˆΣk(ˆβk β ) maxj [p] e T j Σ 1 j ˆΣk ˆβk β 1 + Σ 1 j ˆΘj, 1 ˆΣk(ˆβk β ) . We put the pieces together to deduce each term is O smax log p Communication-efficient Sparse Regression 1. By Lemmas 4, 6, 24, ˆβk β 1 P s0λk. 2. By Lemma 15, Σ 1 j ˆΘj, 1 P sj log p 3. By the triangle inequality, ˆΣk(ˆβk β ) 1 n XT k (yk Xk ˆβk) + 1 n XT k ϵk . By the optimality conditions of the (local) lasso estimators, the first term is λk, and it is possible to show, by Lemma 23 and Vershynin (2010), Proposition 5.16, that the second term is OP log p Since λk log p 2 , by a union bound over k [m], we obtain β β OP log p 2 + smax log p where smax := max{s0, s1, . . . , sp}. Proof [Proof of Lemma 18] The result is essentially Javanmard and Montanari (2013b), Theorem 2.3 with ˆΩ= ˆΘ given by (6). Lemma 15 shows that maxj [p] ˆΘj, Σ 1 j 1 P sj log p Since maxj [p] s2 j log p n o(1), ˆΘ satisfies the conditions of Javanmard and Montanari (2013b), Theorem 2.3: ˆ k ( ,c s0) P c s0 log p n for any k [m], The bound is uniform in k [m] by a union bound for suitable parameters λk log p Proof [Proof of Theorem 19] We start by substituting the linear model into (5): k=1 ˆ k + 1 Subtracting β and taking norms, we obtain β β ( ,c s0) 1 k=1 ˆ k ( ,c s0) + 1 N ˆΘXT ϵ ( ,c s0). (12) By Lemma 18, the first (bias) term is of order c s0 log p n . We focus on showing the second (variance) term is of order log p 2 . Since the ( , l) norm is non-increasing in l, N ˆΘXT ϵ ( ,c s0) 1 By Vershynin (2010), Proposition 5.16 and Lemma 23, it is possible to show that N ˆΘXT ϵ OP log p Thus the second term in (12) is of order log p 2 . We put all the pieces together to obtain the stated conclusion. Proof [Proof of Theorem 20] Since βht β is 2s0-sparse, βht β 2 2 s0 βht β 2 ( ,c s0) or, equivalently, βht β 2 s0 βht β ( ,c s0). By the triangle inequality, βht β ( ,c s0) βht β ( ,c s0) + β β ( ,c s0) 2 β β ( ,c s0), where the second inequality is by the fact that thresholding at t = | β|(c s0) minimizes β β ( ,c s0) over c s0-sparse points β. Thus βht β 2 = OP s0 log p 2 + s0 log p To complete the proof, we observe that the estimation error bound of βht in the ℓ1 norm follows by the fact that βht β is 2s0-sparse. Alekh Agarwal, Sahand Negahban, Martin J Wainwright, et al. Fast global convergence of gradient methods for high-dimensional statistical recovery. The Annals of Statistics, 40 (5):2452 2482, 2012. Heather Battey, Jianqing Fan, Han Liu, and Junwei Lu. Splitotic analysis for distributed estimation and hypothesis testing. preprint (personal communication), 2015. Alexandre Belloni, Victor Chernozhukov, and Christian Hansen. Inference for highdimensional sparse econometric models. ar Xiv preprint ar Xiv:1201.0220, 2011. Stephen Boyd, Neal Parikh, Eric Chu, Borja Peleato, and Jonathan Eckstein. Distributed optimization and statistical learning via the alternating direction method of multipliers. Foundations and Trends in Machine Learning, 3(1):1 122, 2011. Mark Braverman, Ankit Garg, Tengyu Ma, Huy L Nguyen, and David P Woodruff. Communication lower bounds for statistical estimation problems via a distributed data processing inequality. ar Xiv preprint ar Xiv:1506.07216, 2015. Communication-efficient Sparse Regression Peter B uhlmann and Sara Van De Geer. Statistics for high-dimensional data: methods, theory and applications. Springer, 2011. Ofer Dekel, Ran Gilad-Bachrach, Ohad Shamir, and Lin Xiao. Optimal distributed online prediction using mini-batches. The Journal of Machine Learning Research, 13(1):165 202, 2012. John C Duchi, Alekh Agarwal, and Martin J Wainwright. Dual averaging for distributed optimization: convergence analysis and network scaling. Automatic Control, IEEE Transactions on, 57(3):592 606, 2012. John C Duchi, Michael I Jordan, Martin J Wainwright, and Yuchen Zhang. Optimality guarantees for distributed statistical estimation. ar Xiv preprint ar Xiv:1405.0782, 2014. Ankit Garg, Tengyu Ma, and Huy L Nguyen. Lower bound for high-dimensional statistical learning problem via direct-sum theorem. ar Xiv preprint ar Xiv:1405.1665, 2014. Trevor Hastie, Robert Tibshirani, and Martin Wainwright. Statistical learning with sparsity: the lasso and its generalizations. CRC Press, 2015. Adel Javanmard and Andrea Montanari. Confidence intervals and hypothesis testing for high-dimensional regression. ar Xiv preprint ar Xiv:1306.3171, 2013a. Adel Javanmard and Andrea Montanari. Nearly optimal sample size in hypothesis testing for high-dimensional regression. ar Xiv preprint ar Xiv:1311.0274, 2013b. Ryan Mcdonald, Mehryar Mohri, Nathan Silberman, Dan Walker, and Gideon S Mann. Efficient large-scale distributed training of conditional maximum entropy models. In Advances in Neural Information Processing Systems, pages 1231 1239, 2009. Sahand N Negahban, Pradeep Ravikumar, Martin J Wainwright, and Bin Yu. A unified framework for high-dimensional analysis of m-estimators with decomposable regularizers. Statistical Science, 27(4):538 557, 2012. Garvesh Raskutti, Martin J Wainwright, and Bin Yu. Restricted eigenvalue properties for correlated gaussian designs. J. Mach. Learn. Res., 11:2241 2259, 2010. Jonathan Rosenblatt and Boaz Nadler. On the optimality of averaging in distributed statistical learning. ar Xiv preprint ar Xiv:1407.2724, 2014. Mark Rudelson and Shuheng Zhou. Reconstruction from anisotropic random measurements. Information Theory, IEEE Transactions on, 59(6):3434 3447, 2013. Sara van de Geer, Peter B uhlmann, Ya acov Ritov, and Ruben Dezeure. On asymptotically optimal confidence regions and tests for high-dimensional models. ar Xiv preprint ar Xiv:1303.0518, 2013. Roman Vershynin. Introduction to the non-asymptotic analysis of random matrices. ar Xiv preprint ar Xiv:1011.3027, 2010. Cun-Hui Zhang and Stephanie S Zhang. Confidence intervals for low dimensional parameters in high dimensional linear models. Journal of the Royal Statistical Society: Series B (Statistical Methodology), 76(1):217 242, 2014. Yuchen Zhang, John C Duchi, and Martin J Wainwright. Communication-efficient algorithms for statistical optimization. Journal of Machine Learning Research, 14:3321 3363, 2013. Martin Zinkevich, Markus Weimer, Lihong Li, and Alex J Smola. Parallelized stochastic gradient descent. In Advances in Neural Information Processing Systems, pages 2595 2603, 2010.