# distributed_sparse_regression_via_penalization__f0bad669.pdf

Journal of Machine Learning Research 24 (2023) 1-62 Submitted 11/21; Revised 6/23; Published 9/23

Distributed Sparse Regression via Penalization

Yao Ji jiyao@purdue.edu School of Industrial Engineering Purdue University West Lafayette, IN 47906, USA

Gesualdo Scutari gscutari@purdue.edu School of Industrial Engineering Purdue University West Lafayette, IN 47906, USA

Ying Sun ybs5190@psu.edu School of Electrical Engineering and Computer Science The Pennsylvania State University State College, PA 16802, USA

Harsha Honnappa honnappa@purdue.edu School of Industrial Engineering Purdue University West Lafayette, IN 47906, USA

Editor: Francesco Orabona

We study sparse linear regression over a network of agents, modeled as an undirected graph (with no centralized node). The estimation problem is formulated as the minimization of the sum of the local LASSO loss functions plus a quadratic penalty of the consensus constraint the latter being instrumental to obtain distributed solution methods. While penalty-based consensus methods have been extensively studied in the optimization literature, their statistical and computational guarantees in the high dimensional setting remain unclear. This work provides an answer to this open problem. Our contribution is two-fold. First, we establish statistical consistency of the estimator: under a suitable choice of the penalty parameter, the optimal solution of the penalized problem achieves near optimal minimax rate O(s log d/N) in ℓ2-loss, where s is the sparsity value, d is the ambient dimension, and N is the total sample size in the network this matches centralized sample rates. Second, we show that the proximal-gradient algorithm applied to the penalized problem, which naturally leads to distributed implementations, converges linearly up to a tolerance of the order of the centralized statistical error the rate scales as O(d), revealing an unavoidable speed-accuracy dilemma. Numerical results demonstrate the tightness of the derived sample rate and convergence rate scalings.

Keywords: distributed optimization, penalization, high-dimension statistics, linear convergence, sparse linear regression

. Equal contribution.

c 2023 Yao Ji, Gesualdo Scutari, Ying Sun and Harsha Honnappa.

License: CC-BY 4.0, see https://creativecommons.org/licenses/by/4.0/. Attribution requirements are provided at http://jmlr.org/papers/v24/21-1333.html.

Ji, Scutari, Sun and Honnappa

1. Introduction

We study high-dimensional sparse estimation over a network of m agents, modeled as an undirected graph. No centralized agent is assumed in the network; agents can communicate only with their immediate neighbors. Each agent i owns a data set (yi, Xi), generated according to the linear model yi = Xiθ + wi, (1)

where yi Rn is the vector of n observations, Xi Rn d is the design matrix, wi Rn

is observation noise, and θ Rd is the unknown s-sparse parameter common to all local models. In the high-dimensional setting, as postulated here, the ambient dimension d is larger than the total sample size N = n m and s d. A standard approach to estimate θ from {(yi, Xi)}m i=1 is to solve the LASSO problem, whose Lagrangian form reads

ˆθ arg min θ Rd 1 m

1 2n yi Xiθ 2 + λ θ 1, (2)

where λ > 0 controls the sparsity of the solution ˆθ. Since the objective function involves the entire data set {(yi, Xi)}m i=1 across the network, and routing local data to other agents is infeasible (e.g., due to privacy issues) or highly ineﬃcient, Problem (2) cannot be solved by each agent i independently. This calls for the design of distributed algorithms whereby agents alternate computations, based on available local information, with communications with neighboring nodes. To this end, a widely adopted approach is to decompose (2) by introducing local estimates θi s of the common variable θ, each one controlled by one agent, and forcing consensus among the agents (e.g., Nedić et al. 2018):

min θ Rmd 1 m

1 2n yi Xiθi 2 + λ

m θ 1, subject to V θ = 0, (3)

where θ = [θ 1 , . . . , θ m] is the stack vector of all the local copies θi s, and V is a positive semideﬁnite consensus-enforcing matrix, i.e., V θ = 0 if and only if all θi s are equal. The objective function in (3) is now (additively) separable in the agents variables; however, there is still a coupling across the θi s, due to the consensus constraint V θ = 0. To resolve this coupling, a widely adopted strategy in the literature of distributed optimization is to employ an inexact penalization of the constraint via a quadratic function. This leads to the following relaxed formulation:

ˆθ arg min θ Rmd 1 m

1 2n yi Xiθi 2 + 1 2mγ θ 2 V + λ

where θ 2 V θ V θ, and γ > 0 is a free parameter controlling the violation of the consensus constraint V θ = 0. Invoking standard results of penalty methods (see, e.g., Nesterov et al. (2018)), it is not diﬃcult to check that, when γ 0, every limit point of the resulting sequence ˆθ = ˆθ(γ) is a solution of (3). This justiﬁes the use of (4) as an approximation of (3) (for suﬃciently small γ).

Distributed Sparse Regression via Penalization

Problem (4) unlocks distributed solution methods. Here, we consider the proximal gradient algorithm (Nesterov et al., 2018) that, based upon a suitable choice of the matrix V , is readily implementable over the network. This resembles the renowned Distributed Gradient Descent algorithms (DGD) (see Section 1.3), which are among the most studied distributed schemes in the literature. Motivated by the popularity of the penalized formulation (4) and associated DGD algorithms, the goal of this paper is to study the statistical properties of the estimator (4) as well as computational guarantees of the aforementioned DGD algorithm.

centralized estimation error

local estimation error

0 1,000 2,000 3,000 4,000 5,000

Average Estimation Error

γ = 0.3 γ = 0.03 γ = 0.006

Figure 1: Proximal gradient in the high-dimensional setting (4): linear convergence up to some tolerance; diﬀerent curves refer to diﬀerent values of the penalty parameter γ. Notice the speed-accuracy dilemma.

1.1 Challenges and open problems

While penalty-based formulations like (4) and related solution methods have been extensively studied in the optimization literature, the statistical properties of the solution ˆθ in the high-dimensional setting (d N) remain unknown, and so are the convergence guarantees of the proximal gradient algorithm applied to (4). Postponing to Section 1.3 a detailed review of the literature, here we point out the following. Statistics: classical sample complexity analysis of LASSO error ˆθ θ 2 for (2) (e.g., Wainwright 2019) is not directly applicable to the penalized problem (4) for instance, it is unclear whether each agent s error ˆθi θ 2

can match centralized sample complexity. Distributed optimization: When it comes to algorithms for solving (4), existing studies are of pure optimization type, lacking of statistical guarantees. If nevertheless invoked to predict convergence of the proximal gradient algorithm applied to (4), they would certify sublinear convergence of the optimization error, since the objective function in (4) is not strongly convex on the entire space (recall d > N). This results in a pessimistic prediction, as shown by the exploratory experiment in Figure 1: the average estimation error (1/m) Pm i=1 θt i θ 2 decreases linearly up to a tolerance (ﬂoor); diﬀerent curves refer to diﬀerent values of the penalty parameter γ. The ﬁgure also plots the (square) estimation error achieved by solving (2) termed as centralized estimation

Ji, Scutari, Sun and Honnappa

error and the average of the (square) estimation errors achieved by each agent solving the LASSO problem using only its local data termed as local estimation error. The experiment seems to suggest that statistical error comparable to centralized ones are still achievable where data are distributed over a network. However this requires a suﬃciently small γ, and thus results in slow convergence rates. To the best of our knowledge, a theoretical understanding of these phenomena remains an open problem; questions are abundant, such as: (i) Is centralized statistical consistency (quantiﬁed by sample complexity N = o(s log d)) provably achievable when data are distributed across the network? What is the role/impact of the network? (ii) Is it feasible for the distributed proximal gradient method to yield statistically optimal solutions while maintaining linear convergence? (iii) How do sample and convergence rates of the algorithm interact with model parameters, speciﬁcally γ, d, N, and network conﬁgurations?

1.2 Major contributions

This work addresses the above questions our contributions can be summarized as follows.

1) Statistical analysis of the penalized LASSO problem (4): We establish nonasymptotic error bounds on the estimation error averaged over the agents, (1/m) Pm i=1 ˆθi θ 2, under proper tuning of λ and γ. Our results are of two types. (i) A deterministic bound, under a strong convexity requirement on the objective in (4) restricted to certain directions containing the augmented LASSO error ˆθ 1m θ (cf. Theorem 6) this bound sheds light on the role of the network and consensus errors (via γ) into the estimation process; and (ii) a (sample) convergence rate (1/m) Pm i=1 ˆθi θ 2 = O(s log d/N) (cf. Theorem 7), which holds with high probability (w.h.p.) under standard Gaussian data generation models (cf. Assumption 1). This matches the statistical error of the LASSO estimator (2), thereby unveiling for the ﬁrst time that statistical consistency over networks is feasible under a similar order of sample size N as employed in the centralized setting, even when the number of local samples n is insuﬃcient.

2) Algorithmic guarantees: To compute such estimators in a distributed fashion, we leverage the proximal-gradient algorithm applied to (4), and study its convergence and statistical properties (cf. Theorem 10 and Theorem 13). A major result is proving that, in the setting (ii) above, the algorithm converges linearly up to a ﬁxed tolerance which can be driven below the statistical precision of the centralized LASSO problem (2). Speciﬁcally, to enter an ε-neighborhood of a statistically optimal solution, it takes

O 1 1 ρ λmax(Σ)

λmin(Σ) d m log m log 1

number of communications (iterations), where ρ [0, 1) is a measure of the connectivity of the network (the smaller ρ is, the more connected the graph); and λmax(Σ)/λmin(Σ) is the restricted condition number of the LASSO loss function [see (2)], with λmax (resp. λmin) denoting the largest (resp. smallest) eigenvalue of the covariance matrix Σ of the data (cf. Assumption 1). This shows that centralized statistical accuracy is achievable over a given network (without moving data) but at the price of a linear rate (and thus communication cost) that scales as O(d). This speed-accuracy dilemma is conﬁrmed

Distributed Sparse Regression via Penalization

by our experiments (cf. Section 5). A similar phenomenon has been observed previously in low-dimensional settings (strongly convex losses) (Yuan et al., 2016, Theorem 3). However, our results demonstrating this dilemma in the high-dimensional setting as well, imply that this appears to be a scarlet letter of DGD-like algorithms.

1.3 Related works

Statistical analysis: Statistical properties of the LASSO solution ˆθ of (2) along with several other regularized M-estimators have been extensively studied in the literature (see, e.g., (Tibshirani, 1996, Hastie et al., 2015, Wainwright, 2019) and references therein). Introducing suitably restricted notions of strong convexity of the loss e.g., (Bickel et al., 2009, Candes and Tao, 2006, de Geer and Buhlmann, 2009, Sahand et al., 2012, Wainwright, 2019) (nonasymptotic) error bounds and sample complexity for such estimators under highdimensional scaling are established. For instance, for the LASSO estimator (2), statistical errors read ˆθ θ 2 = O(s log d/N).

These conditions and results for (2) do not transfer directly to the lifted, penalized formulation (4) it is not even clear the relation between ˆθ [cf. (2)] and ˆθ1, . . . , ˆθm [cf. (4)]. A new solution and statistical analysis is needed for the augmented LASSO estimator ˆθ (4), possibly revealing the role of the network on the statistical properties of ˆθ.

Centralized optimization algorithms: Referring to solution methods for centralized sparse linear regression problems, several studies are available in the literature, including (Becker et al., 2011, Beck and Teboulle, 2009, Bredies and Lorenz, 2008, Hale et al., 2008, Tseng and Yun, 2009, Zhou and So, 2017, Wen et al., 2017, Bolte et al., 2009) and (Agarwal et al., 2012). Since (2) is not strongly convex in a global sense, classical (accelerated) ﬁrst-order methods like (Becker et al., 2011, Beck and Teboulle, 2009) are known to converge at sublinear rate; others (Bredies and Lorenz, 2008, Hale et al., 2008) are proved to achieve linear convergence if initialized in a neighborhood of the solution of (2); and (Tseng and Yun, 2009, Zhou and So, 2017, Wen et al., 2017, Bolte et al., 2009) showed linear convergence (in particular) of the proximal-gradient algorithm, invoking global regularity conditions of the loss (2), such as the Luo-Tseng s bound (Luo and Tseng, 1993) or the KL property (Bolte et al., 2009, Pan and Liu, 2018). These studies are of pure optimization type e.g., convergence focuses on iteration complexity of the optimization error, no statistical analysis of the limit points is provided. Furthermore, they are not suitable for the high-dimensional regime (i.e., d, N growing ). A closer related work is (Agarwal et al., 2012), which establishes global linear convergence of the proximal-gradient algorithm for (2) up to the statistical precision of the model, under a restricted strong convexity (RSC) and restricted smoothness (RSM) assumption. The method is not directly implementable over mesh networks, because of the lack of a centralized node. Furthermore, it is unclear whether RSC/RSM conditions hold for the penalized sum-loss in (4). On the other hand, a naive application of the RSC/RSM to each agent s loss fi(θi) = (1/2n) yi Xiθi 2 in (4) (without accounting for the penalty, coupling term (1/γ) θ 2 V ), would require a local sample scaling n = O(s log d) to hold. This conclusion is unsatisfactory because it would state that the centralized minimax error bound ˆθ θ 2 = O(s log d/N) is not achievable over networks a fact that is confuted by our theoretical ﬁndings and experiments.

Ji, Scutari, Sun and Honnappa

Divide and Conquer (D&C) methods: When it comes to decomposition methods for statistical estimation and inference, the statistical community is best acquainted with D&C methods. D&C algorithms postulate the existence of a node in the network (a.k.a. master node) connected to all the others (termed worker nodes), which combines the estimators produced by each worker using its local data set. D&C algorithms for M-estimation in low-dimension, covering the asymptotics d, N while d/N c [0, 1), have been extensively studied in the literature; representative examples include Rosenblatt and Nadler (2016), Wang et al. (2018), Chen et al. (2021), Bao and Xiong (2021), Jianqing et al. (2021). More relevant to this work are the D&C methods applicable to sparse linear regression in high-dimension, i.e., d > N and d/N , which include Lee et al. (2015), Battey et al. (2018), Wang et al. (2017), Jordan et al. (2018). Lee et al. (2015), Battey et al. (2018) devised a one-shot approach averaging at the master node debiased local LASSO estimators. Wang et al. (2017), Jordan et al. (2018) independently improved the sample complexity of Lee et al. (2015) hinging on ideas from Shamir et al. (2014) Table 1 provides the sample and communication complexity of these methods, which can be summarized as follows. By performing a single round of communication from the workers to the master node, resulting in a O(d) communication cost, these algorithms achieve the centralized statistical error O(s log d/N) as long as the local sample size is suﬃciently large, i.e., n = Ω(ms2 log d) (see Table 1). Alternatively, for ﬁxed n, this imposes a constraint on the maximum number of workers, i.e., m = O(n/(s2 log d)), which limits the range of applicability of these methods to small-size (star) networks. The dependence of n on m can be removed at the cost of multiple communication rounds; to our knowledge, the state of the art is Wang et al. (2017) showing that n = Ω(s2 log d) suﬃces under log m communication rounds, resulting in a total communication cost of O(d log m). None of these methods is directly implementable over mesh networks, because of the lack of a centralized node. Naive attempts of decentralizing D&C methods over mesh networks by replacing the exact average at the master node with local consensus updates fail to achieve centralized statistical consistency.

D&C Methods n m s2 log d m s2 log d n s2 log d Communication Cost (one round) Communication Cost (multiple rounds) Avg-Debias Lee et al. (2015) d Battey et al. (2018) d CSL Jordan et al. (2018) d 1

EDSL Wang et al. (2017) d d log m

Table 1: D&C algorithms for sparse linear regression in the high-dimensional, d > N and d/N : local sample size and communication cost to achieve the centralized statistical error O(s log d/N). For a single communication round, all methods require a condition on the minimum local sample size n; multiple communication rounds can reduce the condition on local sample size n. 1CSL Jordan et al. (2018) can be extended to multiple rounds of communication to reduce the local sample size using the similar argument as in EDSL Wang et al. (2017).

In contrast to D&C methods, the DGD-like algorithm studied in this paper to solve (4) provably achieves (near) optimal minimax rates with no conditions on the local sample size, at a total communication cost however of O(d2). This raises the question whether communication costs of O(d) are achievable in high-dimension over mesh networks by other distributed optimization algorithms, yet with no conditions on the local sample size. Motivated by this work, the study of other methods in high-dimension is the subject of

Distributed Sparse Regression via Penalization

current investigation; see, e.g., the companion work Sun et al. (2022). In fact, as discussed next, there is no study of any other existing distributed algorithm in high-dimension.

Distributed optimization algorithms: Solving the LASSO problem (2) over mesh networks falls under the umbrella of distributed optimization. The literature of distributed optimization methods is vast; given the focus of the paper, we comment next only relevant works on decentralization of the (proximal) gradient method over mesh networks modeled as undirected graphs. Distributed Gradient Descent (DGD) algorithms, including those derived by penalizing consensus constraints as in (4), have been extensively studied in the literature; see, e.g., (Nedić and Ozdaglar, 2009, Nedić et al., 2010, Chen and Sayed, 2012, Sayed, 2014, Chen and Ozdaglar, 2012, Yuan et al., 2016, Zeng and Yin, 2018, Daneshmand et al., 2020, Nedić et al., 2018). Among all, the most relevant distributed scheme to this paper is (Zeng and Yin, 2018), a proximal gradient algorithm. When applied to (4), under the additional assumption of bounded (sub)gradient of the agents losses (a fact that is not guaranteed), sublinear convergence (on the objective value) to the optimal solutions of (4) would be certiﬁed (recall that agents losses are not strongly convex globally). Furthermore, the connection between the solution of the penalized problem (4) and that of the LASSO formulation (2) remains unclear. While diﬀerent and not derived directly from (4), the other DGD-like algorithms can be roughly commented as follows: (i) when the agents loss functions are strongly convex (or the centralized loss satisﬁes the KL property (Zeng and Yin, 2018, Daneshmand et al., 2020)), diﬀerentiable, and there are no constraints, DGD-like schemes equipped with a constant stepsize, converge (only) to a neighborhood of the solution at linear rate (Yuan et al., 2016, Zeng and Yin, 2018, Yuan et al., 2020). Convergence (in objective value) to the exact solution is achieved only using diminishing stepsize rules, thus at the slower sublinear rate (see, e.g., Zeng and Yin 2018, Jakovetić et al. 2014). This speed-accuracy dilemma can be overcomed by correcting explicitly the local gradient direction so that a constant stepsize can be used still preserving convergence to the exact solution; examples include: gradient tracking methods (Qu and Li, 2017, Nedić et al., 2016, Xu et al., 2018, Lorenzo and Scutari, 2016, Sun et al., 2019) and primal-dual schemes (Jakovetić et al., 2011, Shi et al., 2014, Jakovetić, 2019, Jakovetić et al., 2013, Shi et al., 2015a,b), just to name a few.

The above review of the literature shows that there is no study of statistical/computational guarantees in the high-dimensional regime. Our comments on centralized optimization algorithms apply here for all the aforementioned distributed ones: all the convergence results are of pure optimization type and are confuted by our experiments (see Figure 1). A new analysis is needed to understand the behaviour of distributed algorithms in the high-dimensional regime. This paper represents the ﬁrst study of a DGD-like algorithm towards this direction.

1.4 Notation and paper organization

The rest of the paper is organized as follows. Section 2 introduces the assumptions on the data model and network along with some consequences. Solution analysis of the penalized LASSO (4) is addressed in Section 3 a deterministic error bound, based on a notion of restricted strong convexity, is ﬁrst established; then near optimal centralized sample complexity is proved under standard data generation models (Section 3.3). The (distributed) proximal

Ji, Scutari, Sun and Honnappa

gradient algorithms applied to (4) is studied in Section 4. Finally, Section 5 provides some experiments validating our theoretical ﬁndings while Section 6 draws some conclusions. All the proofs of the presented results are relegated to the appendix.

Notation: Let [m] {1, . . . , m}, with m N++; 1 is the vector of all ones; ei is the i-th canonical vector; Id is the d d identity matrix (when unnecessary, we omit the subscript); and denotes the Kronecker product. Given x1, . . . , xm Rd, the bold symbol x = [x 1 , . . . , x m] Rmd denotes the stack vector; for any x = [x 1 , . . . , x m] , we deﬁne its blockaverage as xav (1/m) Pm i=1 xi, and the disagreement vector x [x 1, . . . , x m] , with each x i xi xav. Similarly, given the matrices X1, . . . , Xm Rn d, we use bold notation for the stacked matrix X = [X 1 , . . . , X m] . We order the eigenvalues of any symmetric matrix A Rm m in nonincreasing fashion, i.e., λmax(A) = λ1(A) . . . λm(A) = λmin(A). We use to denote the Euclidean norm; when other norms are used, e.g., ℓ1-norm and ℓ , we will append the associate subscript to , such as 1, and . Consistently, when applied to matrices, denotes the operator norm induced by . Furthermore, we write x A (x Ax)1/2, for any symmetric positive semideﬁnite matrix. Given S [d] and y Rd, we denote by |S| the cardinality of S and by y S the |S|-dimensional vector containing the entries of y indexed by the elements of S; Sc is the complement of S. All the log in the paper are intended natural logarithms, unless otherwise stated. Given two univariate random variables X and Y , we say that Y has stochastic dominance over X if X st Y , meaning P(X t) P(Y t), for all t R (Marshall et al., 2011, p. 694).

2. Setup and Background

In this section we introduce the main assumptions on the data model and network setting underlying our analysis along with some related consequences.

2.1 Problem setting

The following quantities associated with (1) will be used throughout the paper:

S supp{θ }, s = |S|, Lmax max i [m] λmax

We collect all the local data {(yi, Xi)}m i=1 into the stacked vector measures y = [y 1 , . . . , y m] RN and matrix X = [X 1 , . . . , X m] RN d. The quadratic losses of the centralized LASSO problem (2) and of the penalized one (4) are denoted respectively by

F(θ) 1 2N y Xθ 2 and Lγ(θ) 1 2N

i=1 yi Xiθi 2 | {z } fi(θi)

+ 1 2mγ θ 2 V . (6)

We recall next the main path/assumptions used to bound the LASSO error ˆθ θ 2 in the centralized setting (2) (e.g., Wainwright 2019). In the regime d > N, F is not strongly convex the d d Hessian matrix X X has at most rank N. Nevertheless, ˆθ θ 2 can be well-controlled requiring strong convexity of F to hold along a subset of directions. The Restricted Eigenvalue (RE) condition suﬃces (Bickel et al., 2009, Candes and Tao, 2006,

Distributed Sparse Regression via Penalization

Wainwright, 2019)

1 N X 2 δc 2, C(S) { Rd | Sc 1 3 S 1}, (7)

where δc > 0 is the curvature parameter, and C(S) captures the set of sparse directions of interests. The rationale behind (7) is that, since ˆθ θ can be proved to belong to C(S), if F is strongly convex on C(S) as requested by (7) then small diﬀerences on the loss will translate into bounds on ˆθ θ 2. The RE (7) imposes conditions on the design matrix X. The following RSC implies (7).

Lemma 1 Suppose that F satisﬁes the following RSC condition with curvature µ > 0 and tolerance τ > 0: 1 N X 2 µ

2 2 1, Rd. (8)

Under µ/2 16sτ > 0, the RE (7) holds with δc = µ/2 16sτ.

The practical utility of the RSC condition (8) vs. the RE is that it can be certiﬁed with high probability by a variety of random design matrices X. Here we consider the following.

Assumption 1 (Random Gaussian model) The design matrix X RN d satisﬁes the following: (i) the rows of X are i.i.d. N(0, Σ); and (ii) Σ is positive deﬁnite, with minimum eigenvalue λmin(Σ) > 0.

Lemma 2 ((Raskutti et al., 2010, Theorem 1)) Let X RN d be a design matrix satisfying Assumption 1. Then, there exist universal constants c0, c1 > 0 such that, with probability at least 1 exp( c0N), the RSC condition (8) holds with parameters

µ = λmin(Σ) and τ = 2c1ζΣ log d

N , with ζΣ max i [d] Σii. (9)

2.2 Network setting

We model the network of m agents as an undirected graph G = (V, E), where V = [m] is the set of agents, and E is the set of the edges; {i, j} E if and only if there is a communication link between agent i and agent j. We make the blanket assumption that G is connected, which is necessary for the convergence of distributed algorithms to a consensual solution. To solve (4) over G via gradient descent, each agent should be able to compute the gradient of the objective (w.r.t. its own local variable θi) using only information from its immediate neighbours. This imposes some extra conditions on the sparsity pattern of the matrix V . We will use the following widely adopted structure for V .

Assumption 2 The matrix V = (Im W) Id, where W (wij)m i,j=1 satisﬁes the following:

(a) It is compliant with G, that is, (i) wii > 0, i [m]; (ii) wij > 0, if {i, j} E; and (iii) wij = 0 otherwise; and

(b) It is symmetric and stochastic, that is, W1 = 1 (and thus also 1 W = 1 ).

Ji, Scutari, Sun and Honnappa

It follows from the connectivity of G and Assumption 2 that

V θ = 0 iﬀ θi = θj, i = j [m],

and ρ max{|λ2(W)|, |λmin(W)|} < 1. (10)

Roughly speaking, ρ measures how fast the network mixes information (the smaller, the faster). If G is complete graph or a star, one can choose W = 11 /m, resulting in ρ = 0.

3. Solution Analysis and Statistical Guarantees

This section presents the solution analysis of the penalized LASSO problem (4), establishing nonasymptotic bounds of (1/m) Pm i=1 ˆθi θ 2. Our study builds on the following steps.

1) We ﬁrst determine a suitable restricted set of directions Cγ(S) [cf. (11)] which contains the augmented LASSO error ˆθ 1m θ under certain conditions on the sparsityenhancing parameter λ [cf. Proposition 3] the set Cγ(S) plays similar role as C(S) [cf. (7)] for the centralized LASSO (2), and sheds light on the role of the penalty parameter γ (and thus the consensus errors) on the sparsity pattern of ˆθ;

2) We then determine a RSC-like condition [cf. (15)] ensuring that, under a suitable choice of γ controlling the consensus error, the subset Cγ(S) is well-aligned with the curved directions of the loss Lγ of (4);

3) Results in the previous steps will translate into bounds on (1/m) Pm i=1 ˆθi θ 2 [cf. Theorem 6]. Quite interesting, our RSC condition holds w.h.p. under the random model in Assumption 1 (cf. Lemma 5), which yields centralized sample complexity (1/m) Pm i=1 ˆθi θ 2 = O(s log d/N) (cf. Theorem 7).

3.1 The set of (almost) sparse average directions

For each given γ (0, 1), deﬁne the set

Cγ(S) { Rmd | ( av)Sc 1 3 ( av)S 1 + h(γ, )}, (11)

mγλ 2 + 2 max i [m] w i Xi /(λn) + 2 p

The maximum of h(γ, ) is a decreasing function of γ > 0. This suggests that, the sparsity of the average component av of directions Cγ(S) can be controlled by γ; in particular, by decreasing γ one can make av arbitrary close to the cone C(S) of sparse directions of the centralized LASSO (2) [cf. (7)]. The importance of Cγ(S) is captured by the following result.

Proposition 3 Under Assumption 2 and λ satisfying

2 N X w λ, (13)

the augmented LASSO error ˆθ 1m θ lies in Cγ(S).

Distributed Sparse Regression via Penalization

Proof See Appendix A.

Therefore, the average component of the augmented LASSO error is nearly sparse for suﬃciently small γ and large λ. This will be used to pursue statistically optimal estimates.

3.2 In-network RE condition

We impose a positive curvature on the loss Lγ of (4) [cf. (6)] along suitable chosen directions in Cγ(S). The ﬁrst-order Taylor expansion of Lγ at θ along θ θ , denoted by TLγ(θ; θ ), can be lower bounded as

TLγ(θ; θ ) Lγ(θ) Lγ(θ ) Lγ(θ ), θ θ

4 X(θ θ )av 2

N | {z } curvature along average

| {z } nonconsensual component

The second term on the RHS of (14) is due to the disagreement of the θi s, and can be controlled choosing suitably small γ. In fact, we will prove that a curvature condition on the ﬁrst term of the RHS of (14) along the directions θ θ Cγ(S) is enough to establish the desired error bounds on the LASSO error (1/m) Pm i=1 ˆθi θ 2. This motivates the following deﬁnition of RSC-like property of Lγ.

Assumption 3 (In-network RE) The loss function Lγ satisﬁes the following RSC condition with curvature δ > 0 and tolerance ξ > 0:

TL(θ; θ ) δ (θ θ )av 2 ξ h2(γ, (θ θ ) ), θ θ Cγ(S). (15)

The stipulated condition mandates a positive curvature for Lγ along consensual directions in Cγ(S). Across the entire space, however, Lγ need not exhibit strong convexity, attributed largely to the tolerance term accounting for consensus errors. The following two results establish suﬃcient conditions for (15) to hold, for deterministic and random design matrices X which match those required for the centralized LASSO (see Lemma 1 and Lemma 2).

Lemma 4 Reinstate Lemma 1, under µ/2 16sτ > 0. Then (15) holds, with δ = µ/2 16sτ and ξ = τ, for any given γ (0, (1 ρ)/Lmax].

Proof See Appendix B.1.

Lemma 5 Let X RN d satisfy Assumption 1. For any N and γ such that

N c2 ζΣs log d

λmin(Σ) , and γ (0, (1 ρ)/Lmax], (16)

TLγ(θ; θ ) λmin(Σ)

4 (θ θ )av 2 λmin(Σ)

64s h2 γ, (θ θ ) , θ, θ : θ θ Cγ(S),

with probability at least 1 exp( c0N). Here, c0, c2 > 0 are universal constants.

Ji, Scutari, Sun and Honnappa

Proof See Appendix B.2.

3.3 Error bounds and statistical consistency of the LASSO error of (4)

We are ready to establish consistency and convergence rates for the augmented LASSO estimator ˆθ. Our ﬁrst result is a deterministic upper bound on the average error under the In-network RE condition (15).

Theorem 6 Consider the augmented LASSO problem (4) under Assumptions 2 and 3. For any ﬁxed λ and γ satisfying respectively

2 N X w λ and γ 2(1 ρ) 4Lmax + δ, (18)

any solution ˆθ = [ˆθ1, . . . , ˆθm] satisﬁes

i=1 ˆθi θ 2

δ2 | {z } centralized error

+ 2ξd2γ2(maxi [m] w i Xi + λn)4

δλ2n4(1 ρ)2 + 4dγ(maxi [m] w i Xi + λn)2

δn2[2(1 ρ) 4Lmaxγ δγ] | {z } cost of decentralization

Proof See Appendix C.

Theorem 6 shows the bound on the LASSO error over the network can be decoupled in two terms the ﬁrst one matches that of the centralized LASSO error (see, e.g., Wainwright 2019, Theorem 7.13) while the second one quantiﬁes the price to pay due to the decentralization of the optimization and consequent lack of consensus. The explicit dependence on γ shows that the detriment eﬀect of the consensus errors can be controlled by γ: as γ 0, the error bound above approaches that of the centralized LASSO solution. There is however no free lunch; we anticipate that γ 0 aﬀects adversarially the convergence rate of the proximal gradient algorithm applied to problem (4), determining thus a speed-accuracy dilemma. The next result provides nonasymptotic rates for the LASSO error above, under the random Gaussian model for X and the noise w in (1) optimal centralized convergence rates are achievable by a proper choice of γ.

Theorem 7 Consider the augmented LASSO problem (4) with d 2 under Assumption 2. Suppose that X satisﬁes Assumption 1 and w N(0, σ2IN); the sample size satisﬁes

N c3 ζΣs log d

λmin(Σ) ; (20)

and the parameters λ and γ are chosen according to the following

γ c5 (1 ρ) λmax(Σ)(d + log m) + λmin(Σ)dm(log m + 1), (22)

Distributed Sparse Regression via Penalization

for some t0 > 2. Then, any solution ˆθ = [ˆθ1, . . . , ˆθm] of problem (4) satisﬁes

i=1 ˆθi θ 2 c6 σ2ζΣt0 λmin(Σ)2 s log d

with probability at least 1 c7 exp( c8 log d). (24)

Here, c3, . . . , c8 are universal constants.

Proof See Appendix D.

The bound (23) matches the statistical error of the centralized LASSO estimator in (2) proving that statistical consistency over networks is achievable under the same order of the sample size N used in the centralized setting, even when the local number n of samples does not suﬃce. This is possible because agents communicate over the network the computation of such a solution and associated communication overhead is studied in the next section.

4. Distributed Gradient Descent Algorithm

To compute the statistically optimal estimator ˆθ over networks, we employ the proximal gradient algorithm applied to the penalized formulation (4), which naturally decomposes across the agents. Speciﬁcally, at iteration t, θ is updated by minimizing the ﬁrst order approximation of the objective function Lγ, which reads

θt+1 = argmin θi 1 R i [m] Lγ(θt) + Lγ(θt), θ θt + 1 2βm θ θt 2 + λ

m θ 1, (25)

where we included an extra constraint θi 1 R to regularize the iterates, and β > 0 plays the role of the stepsize. The following lemma shows that one can ﬁnd a suﬃciently large R so that the solution of (4) does not change if we add therein the norm ball constraint θi 1 R, i [m].

Lemma 8 Consider Problem (4) under Assumption 2. Further assume that (i) X satisﬁes the RSC condition (8) with δ = µ/2 16 sτ > 0; (ii) λ satisﬁes (13); and (iii) γ satisﬁes

γ (1 ρ) 2Lmax + δ + 128(d/s)δ(maxi [m] w i Xi /(λn) + 2 m)2 . (26)

Then, ˆθi 1 R, i [m], whenever R is such that

R max λs δ(1 r)

with r (0, 1).

Proof See Appendix E.

Therefore, we can focus on Problem (25) without loss of generality. Notice that the problem is separable in {θi}i [m]; hence, it can be solved distributively from each agent i. Furthermore, the solution can be computed in an explicit form, as determined next.

Ji, Scutari, Sun and Honnappa

Lemma 9 The solution to (25) reads

proxβλ 1(ψt i), if proxβλ 1(ψt i) 1 R,

ΠB 1(R)(ψt i), otherwise; (28)

where R x 7 proxh(x) R is the proximal operator (applied to ψt i component-wise), and

j=1 wijθt j γ fi(θt i)

Proof See Appendix F.

Notice that the proximal operation in (28) has a closed-form expression via soft-thresholding (Donoho, 1995) while the projection onto the ℓ1-ball can be eﬃciently computed using the procedure in (Duchi et al., 2008). To perform the update (28), each agent i only needs to receive the local estimates θt j from its immediate neighbors.

4.1 Linear convergence to statistical precision

Convergence rate of the optimization error θt ˆθ is stated under the RSC condition (8), in terms of the contraction coeﬃcient κ and initial optimality gap η0 G:

8 8sτ, κ 1 βµav

4 , and η0 G Lγ(θ0) + λ

(29) where θ0 is a ﬁxed initialization. Further, denote

i=1 ˆθi θ 2 + λ2s 1976µ2 . (30)

We can now state our convergence result.

Theorem 10 Consider the augmented LASSO problem (4) under Assumption 2. Suppose the design matrix X satisﬁes the RSC condition (8) with µ c9sτ for some suﬃciently large constant c9 > 0; and the penalty parameters λ and γ satisﬁes

λ max 2 X w

N , 64τ θ 1

γ 1 ρ 2Lmax + (µ/2 16sτ) 1 + 128(d/s)(maxi [m] w i Xi /(λn) + 2 m)2 , (32)

respectively. Let {θt i}i [m] be the sequence generated by Algorithm (28) under the following choices of tuning parameters β and R

β = γ γLmax + 1 λmin(W) (33)

Distributed Sparse Regression via Penalization

max 56λs µ 32sτ , 2 θ 1

R λ 32τ ; (34)

and initialization θ0 i Rd, i [m], such that

η0 G 4sτ ε2 stat. (35)

i=1 θt i ˆθi 2 τs

µav ε2 stat + τs

for any tolerance parameter α2 such that

α2 4sτ ε2 stat, (37)

and for all

t log2 log2

1+Lmax log 2

µav +(1 + ρ) log 2

µav + 1 + ρ

log η0 G α2

The intervals in (34) and (37) are nonempty.

Proof See Appendix G.

The theorem shows that Algorithm (28) converges at a linear rate to an optimal solution ˆθ, up to a tolerance term as speciﬁed on the right hand side of (36) the ﬁrst term therein depends on the model parameters, while the second one is controlled by α2. Theorem 13 and (see also Corollary 14) below proves that for the random Gaussian data generation model, the tolerance can be driven below the statistical precision for suﬃciently large N.

Remark 11 Observe that the condition (35) pertaining to the initialization does not truly impose a substantial constraint. Indeed, any initial point that contravenes (35) would inherently be situated within the centralized statistical error ball, i.e.,

η0 G 4sτ ε2 stat (32),Theorem 6 = O sτ λ2s .

Remark 12 Algorithm (28) is closely related to the DGD algorithm studied in the literature of distributed optimization (e.g., Zeng and Yin 2018, Nedić et al. 2018). In fact, if in (28) one choose β = γ/2, with γ satisfying (32) [note that this choice of β is compatible with (33)], the gradient step therein reduces to

i=1 wijθt j

2 fi(θt i), (39)

which can be viewed as DGD with weight matrix 1

2(I + W) and step size γ/2.

Ji, Scutari, Sun and Honnappa

Theorem 13 Consider the augmented LASSO problem (4) with d 2 under Assumption 2. Suppose the design matrix X satisﬁes Assumption 1, w N(0, σ2IN), and

N c10 ζΣ λmin(Σ)s log d. (40)

Choose the penalty parameters λ and γ satisfying respectively

λ c11 max σ

N , ζΣ s log d

γ c12 (1 ρ) λmax(Σ) (d + log m) + λmin(Σ)dm (log m + 1). (42)

for some ﬁxed t0 > 2. Let {θt i}i [m] be the sequence generated by Algorithm (28) under the following choices of tuning parameters β and R

β = γ γc13d/n + 1 λm(W), max 56λs λmin(Σ) c14sζΣ log d/N , 2s R λN c14ζΣ log d,

and initialization θ0 i Rd, i [m], such that

η0 G c15 ζΣ λmin(Σ)2 s log d

N λ2s. (44)

Then, with probability at least (24),

i=1 θt i ˆθi 2 c16 λmin(Σ)

α2 + ζΣ s log d

i=1 ˆθi θ 2 + λ2s λmin(Σ)2 + α4

for any tolerance parameter α2 such that

i=1 ˆθi θ 2 + λ2s λmin(Σ)2

and for all

+ log η0 G α2

κΣ(d + log m) + (1 + ρ) λmin(Σ)γ

Further, the range of values of R in (43) is nonempty; and the interval in (46) is nonempty as well, with probability at least (24). Here, c10, . . . , c18 are universal constants.

Proof See Appendix H.

A suitable choice of the free parameters above leads to the following simpliﬁed result, showing linear convergence up to a tolerance of a higher order than the statistical error.

Distributed Sparse Regression via Penalization

Corollary 14 Consider the augmented LASSO problem (4) with d 2 under Assumption 2. Suppose the X satisﬁes Assumption 1, w N(0, σ2IN) and the sample size satisﬁes

N c19 max ζΣs log d

λmin(Σ) , s2ζΣ log d

and d + log m

Choose the penalty parameters λ and γ satisfying respectively

and γ c21 1 ρ λmax(Σ)(d + log m) + λmin(Σ)dm (log m + 1), (50)

for some ﬁxed t0 2. Let {θt i}i [m] be the sequence generated by Algorithm (28) under the following choices of β and R:

β = nγ γc13d + n(1 λm(W)), max c22 sσ λmin(Σ)

N , 2s R c23σ

t0N ζΣ log d (51)

and initialization θ0 i Rd, i [m], such that

σζΣ λmin(Σ)

Then, with probability at least (24),

i=1 θt i ˆθi 2

c23 λmin(Σ)

α2 + ζΣ s log d

i=1 ˆθi θ 2 + ζΣ s log d

λ2 min(Σ) s log d

N + 1 σ2t0 α4 !

for any tolerance parameter α2 such that

α2 c25ζΣ s log d

i=1 ˆθi θ 2 + σ2ζΣt0 λmin(Σ)2 s log d

(54) and for all

t c26 κΣ dm (log m + 1)

( log2 log2

! + log η0 G α2

The range of value of R in (51) is nonempty; and the interval in (54) is nonempty as well, with probability at least (24).

Ji, Scutari, Sun and Honnappa

Proof See Appendix I.

It is not diﬃcult to check that, in the above setting, Theorem 7 holds (in particular, (50) implies (22); hence, by (23), we have 1 m Pm i=1 ˆθi θ 2 = O(s log d

N ). Therefore, whenever the sample size N = o(s log d) a condition that is required for statistical consistency of any centralized method by minimax results (see, e.g., Raskutti et al. 2011), the (lower bound of the) tolerance α2 in (54) and thus the overall residual error in (53) is of smaller order than the statistical error O(s log d

N ). Therefore, in this setting, a total number of communications (iterations) of

O κΣ d m log m

is suﬃcient to drive the iterates generated by Algorithm (28) within O(α2) of an optimal solution bθ (in the sense of (53)), and thus to an estimate of θ within the statistical error. This matches centralized statistical accuracy achievable by the LASSO estimator ˆθ in (2). Notice that no conditions on the local sample size n are required. The expression (56) sheds light on the impact of the problem and network parameters on the convergence. Speciﬁcally, the following comments are in order.

(i) Network dependence/scaling: The term 1/(1 ρ) > 1 captures the eﬀect of the network; as expected, weakly connected networks (i.e., as ρ 1) call for more rounds of communication to achieve the prescribed accuracy. Recall that ρ = ρ(m) is a function of the number of agents m (and the speciﬁc topology under consideration). Hence, the term m log m 1 ρ(m) (57)

shows how the number of rounds of communications on the network scales with m, for a given graph topology (determining ρ(m)). Our experiments in Section 5 seem to suggest that the dependence of the communication rounds on m as predicted by (57) is fairly tight, for diﬀerent graph topologies.

Table 2 provides some estimates of the scaling of 1/(1 ρ(m)) with m for some representative graphs, when the lazy Metropolis rule is used for the gossip matrix W (Nedić et al., 2018). Some graphs, for instance the Erdős-Rényi, exhibit a more favorable scaling than others, such as line graphs. Note that equation (57) does not encapsulate the total communication cost, which is also contingent on the density of the graph. We deﬁning one channel use as the communication occurring per edge connecting two nodes. For example, in the case of an Erdős-R enyi graph (with p = log m/m), the total channel uses (across all nodes) per communication round is O(m log m). In contrast, the complete graph necessitates O(m2) channel uses per communication round, even though both graphs display a scaling of (57) with m, and thus a total number of communication rounds of the same order.

(ii) Population condition number: The ratio λmax(Σ)

λmin(Σ) is the condition number of the covariance matrix of the data; it can be interpreted as the restricted condition number of the LASSO loss function F(θ) [see (6)]. Therefore, as expected, ill-conditioned problems call for more iterations (communication) to achieve the prescribed accuracy.

Distributed Sparse Regression via Penalization

path 2-d grid complete p-Erdős-Rényi p-Erdős-Rényi

(1 ρ) 1 O(m2) O(m log m) O(1) O(1) [p = log m/m] O(1) [p = O(1)]

Table 2: Scaling of (1 ρ(m)) 1 with agents number m, for diﬀerent graph topologies.

(iii) Speed-accuracy dilemma: We proved that centralized statistical accuracy, as for the LASSO estimator ˆθ in (2), is achievable over networks via the distributed algorithm (28). This can be accomplished even when individual agents do not possess suﬃcient data to ensure statistical consistency locally. The crucial factor making this possible is the assistive role of the network via information mixing, However, equation (56) reveals that regardless of the speed of information propagation within the network (irrespective of how small ρ is), the total number of communication rounds necessary to achieve a predetermined accuracy scales as O(d). Our forthcoming numerical results will validate the precision of such scaling. Therefore, we discern that the DGD-like scheme encounters similar speed-accuracy dilemmas in high-dimensional regimes as observed when applied to strongly convex, smooth losses in lower-dimensional cases (e.g., see Nedić et al. 2018). This issue seems to be inevitable and a direct consequence of the structural updates implemented by the algorithm.

5. Numerical Results

In this section, we provide some experiments on synthetic and real data. The former are instrumental to validate our theoretical ﬁndings. More speciﬁcally, we validate the following theoretical results. (i) On the statistical error front, we show that with a proper choice of γ, the solution of the distributed formulation (4) achieves the statistical accuracy of the centralized LASSO estimator (Theorem 7). We also validate the dependency of γ with the dimension d [cf. (22)]. On the computational front, (ii) we demonstrate that DGD-like algorithm (28) displays linear convergence up to statistical precision; (iii) we also validate the scaling of the communication rounds with the network size m, as predicted by (57). (iv) Finally, we illustrate the speed-accuracy dilemma, as shown in Theorem 13. We then proceed to experiment on real data, showing that statistical accuracy of the centralized LASSO estimator (Theorem 7) is achievable by the distributed method (4), still at the cost of a convergence rate scaling with O(d). All the experiments were run on a server equipped with Intel(R) Xeon(R) CPU E5-2699A v4 @ 2.40GHz. Experimental setup (synthetic data): The ground truth θ is set by randomly sampling a multivariate Gaussian N(0, Id) and thresholding the smallest d s elements to zero. The noise vector w is assumed to be multivariate Gaussian N(0, 0.25IN). We construct X RN d by independently generating each row xi Rd, adopting the following procedure (Agarwal et al., 2012): let z1, . . . , zd 1 be i.i.d. standard normal random variables, set xi,1 = z1/

1 0.252 and xi,t+1 = 0.25xi,t + zt, for t = 1, 2, . . . , d 1. It can be veriﬁed that all the eigenvalues of Σ = cov(xi) lie within the interval [0.64, 2.85]. We partition (X, y) as X = [X 1 , X 2 , . . . , X m] and y = [y 1 , . . . , y m] , and agent i owns the data set portion (Xi, yi) and we have m agents in total. We simulate an undirected graph G following the

Ji, Scutari, Sun and Honnappa

Erdös-Rényi model G(m, p), where m is the number of agents and p is the probability that an edge is independently included in the graph. The coeﬃcient of the matrix W are chosen according to the Lazy Metropolis rule (Olshevsky, 2017). All results are using Monte Carlo with 30 repetitions. 1) Statistical accuracy veriﬁcation (Theorem 7). We set N = 220, m = 20, d = 400, and consider two types of graphs, namely a fully connected and a weakly connected graph, the latter generated as Erdös-Rényi graph with edge probability p = 0.1, resulting in ρ 0.973. Figure 2 plots the log-average statistical error log Pm i=1 ˆθi θ 2/m versus λ for the fullyand weakly-connected graphs (resp.), and contrasts it with the centralized LASSO log-error log ˆθ θ 2 . The following comments are in order. (i) A careful choice of γ (= 5 10 4) is required to ensure that the distributed penalty LASSO recovers the centralized ℓ2-error; however, with the same choice of γ, the solution achieved by the distributed method (4) over weakly connected graph can not recover the the statistical accuracy of the centralized LASSO estimator. (ii) The weakly connected graph requires a smaller γ (= 1 10 4) to recover the centralized statistical error; which is consistent with the dependence of γ on ρ as in (22). (iii) The range of λ guaranteeing the minimal ℓ2error in both the centralized and distributed penalty LASSO is comparable, as predicted by condition (20) on λ. 2) Validating γ = O((1 ρ)/d) in (22) (Theorem 7). Figure 3 plots the inverse of largest γ (grid-searched) that guarantees centralized statistical accuracy versus the dimension, for three choices of (N, d, s), corresponding to increasing values of d, s = log d , m = 5 and adjust N such that roughly constant s log d/N (and so the centralized statistical error). The ﬁgure shows that, as d increases, a smaller γ is required to preserve centralized statistical errors. The scaling of such a γ is roughly O(1/d), validating the dimension-dependence of recovery predicted in (22). Notice also that a weaker connected graph requires smaller γ to recover the centralized statistical error, and the slope of weakly connected graph (yellow, ρ = 0.9045) is larger than that of the fully connected (blue, ρ = 0.4897), which is consistent with the dependence of 1/γ = O(d/(1 ρ)) as proved in (22). 3) Linear convergence and the speed-accuracy dilemma (Theorem 13). Figure 5 plots the log average optimization error versus the number of iterations generated by the distributed proximal-gradient Algorithm (28), in the same setting of Figure 3. As predicted by Theorem 13, linear convergence within the centralized statistical error is achievable when d, N and s log d/N = o(1), but at a rate scaling with O(d), revealing the the speed-accuracy dilemma. 4) On the dependence of communication rounds on network size m. To underscore the aforementioned dependence, we carried out experiments across an array of network topologies. This comprised of three deterministic graphs the complete graph, path graph, and star graph and two random graphs the Erdös-Rényi graph with p = O(1) (speciﬁcally p = 0.6) and p = O(log m/m) (speciﬁcally p = log m/m), resulting in both cases a connectivity ρ roughly constant with m with high-probability (Nedić et al., 2018, Proposition 5). In each topology, we progressively augmented the number of nodes, m, in increments of 10, 25, 40, and 50, while maintaining a consistent total sample size of 200 and dimension of 400. We sought the largest γ (grid-searched) and the least number of communications for each pairing of m and graph type that attain centralized statistical errors (within 3% accuracy). Results are presented in Figure 4, where we plotted

Distributed Sparse Regression via Penalization

1 2 3 4 5 6 7

i=1 ˆθi θ 2

N = 220, d = 400, s = 5, m = 20, γ = 5 10 4

centralized distributed

1 2 3 4 5 6 7

i=1 ˆθi θ 2

N = 220, d = 400, s = 5, m = 20, γ = 1 10 3

centralized distributed

1 2 3 4 5 6 7

i=1 ˆθi θ 2

N = 220, d = 400, s = 5, m = 20, γ = 5 10 4

centralized distributed

1 2 3 4 5 6 7

i=1 ˆθi θ 2

N = 220, d = 400, s = 5, m = 20, γ = 1 10 4

centralized distributed

Figure 2: Statistical error of the estimator bθ [see (4)] and the centralized LASSO estimator ˆθ [see (2)] versus λ, using synthetic data; First row: fully connected graph (ρ = 0.4897); Second row:. Erdös-Rényi graph with p = 0.1, (ρ 0.973). Notice that our theory explains the behaviour of the curves only for values of λ 0.033 [as required by (22)].

0 500 1,000 1,500 2,000 0

dimension d

Fully connected Weakly connected

Figure 3: Validating γ = O((1 ρ)/d) as predicted in (22): Inverse of critical γ (gridsearched) to retain centralized statistical consistency versus dimension d; 1/γ scales linearly on d.

0 10 20 30 40 50 60 0

Network size m

Communications

complete graph star graph path graph Erd os-R enyi with p = log m

m Erd os-R enyi with p = 0.6

Figure 4: Validating the scaling O(m) of the communication complexity as predicted in (56): Communication rounds versus network size m to achieve centralized statistical accuracy.

Ji, Scutari, Sun and Honnappa

0 0.2 0.4 0.6 0.8 1 1.2

i=1 θt i θ 2

Fully connected graph

d = 360, γ = 8 10 3

d = 800, γ = 4 10 3

d = 1560, γ = 2 10 3

0 0.5 1 1.5 2 2.5

i=1 θt i θ 2

Weakly connected graph

d = 360, γ = 2.48 10 3

d = 800, γ = 1.24 10 3

d = 1560, γ = 6.6 10 4

Figure 5: Linear convergence of Algorithm (28) up to the centralized statistical error: estimation error generated by Algorithm (25) versus iterations (communications), using synthetic data. Left panel: fully connected graph (ρ = 0.4897); Right panel:. Erdös Rényi graph with p = 0.1, (ρ 0.9045). As predicted by our theory, the scaling of γ to recover centralized statistical consistency is γ = Θ(1/d): As d roughly doubles, going from 360 to 800, γ decreases by half. The same scaling is observed when d goes from 800 to 1560, revealing the the speed-accuracy dilemma.

such a number of communications versus m for the aforementioned topologies. Notice that the communications scaling is linear with m for the complete graph and Erdös-Rényi with p = O(log m/m) and p = O(1). Given that we approximately achieve 1/(1 ρ(m)) = O(1) for these three topologies Nedić et al. (2018), this result conﬁrms the validity of (56), which anticipates e O(m) under such settings. Experiment on real data. We test our ﬁndings on the data set eyedata in the Normal Beta-Prime package (Bai and Ghosh, 2019). This data set contains gene expression data of d = 200 genes, and N = 120 samples. Data originate from microarray experiments of mammalian eye tissue samples. We randomly divide the data set into training sample set with size Ntrain = 80 and test data set with size Ntest = 40. We partition the training data into m = 10 subsets. Each agent i owns the data set portion with size 8. We run Monte Carlo simulations, with 30 repetitions. Since we do not have access of the ground truth θ , we replace the ℓ2 statistical error and the ℓ2 optimization error with the MSE errors

MSE 1 m Ntest

i=1 y test ˆyi 2 and MSEt 1 m Ntest

i=1 y test yt i 2, (58)

respectively, where y test is the output of the test set, and ˆyi = Xiˆθi, i [m], are the model forecasts; ˆyt i = Xiθt i, i [m], are output at iteration t. m = 1 corresponds to the centralized case, with ˆy = Xˆθ. Our ﬁrst experiment is meant to check whether the solution of the penalized problem (4) matches the solution of the centralized LASSO via a proper choice of γ. Figure 6 plots the MSE (log scale) vs. γ achieved by Algorithm (28) over a fully connected graph (left panel) and a weakly Erdös-Rényi graph with p = 0.1, resulting in ρ 0.71 (right panel). The results conﬁrm what we have already observed on synthetic data. Our second experiment on real data is to validate the speed-accuracy dilemma, postulated by our theory and already validated on synthetic data (cf. Figure 5). Figure 7 plots the

Distributed Sparse Regression via Penalization

0.1 0.15 0.2 0.25 0.3

i=1 y test ˆyi 2

Ntrain = 80, Ntest = 40, d = 200, s = 5, m = 10, γ = 1 10 5

centralized distributed

0.1 0.15 0.2 0.25 0.3

i=1 y test ˆyi 2

Ntrain = 80, Ntest = 40, d = 200, s = 5, m = 10, γ = 3 10 5

centralized distributed

0.1 0.15 0.2 0.25 0.3

i=1 y test ˆyi 2

Ntrain = 80, Ntest = 40, d = 200, s = 5, m = 10, γ = 5 10 6

centralized distributed

0.1 0.15 0.2 0.25 0.3

i=1 y test ˆyi 2

Ntrain = 80, Ntest = 40, d = 200, s = 5, m = 10, γ = 1 10 5

centralized distributed

Figure 6: MSE deﬁned in (58) associated with the estimator bθ [see (4)] and the centralized LASSO estimator ˆθ [see (2)] versus λ using the data set eyedata in the Normal Beta-Prime package. First row: fully connected graph (ρ = 0.4897); Second row: Erdös-Rényi graph with p = 0.1, (ρ 0.971). Notice that our theory explains the behaviour of the curves only for values of λ 0.15 [as required by (22)].

Ji, Scutari, Sun and Honnappa

centralized MSE

0 1 2 3 4 5

i=1 y test yt i 2

Fully connected graph

centralized MSE

0 200 400 600 800

i=1 y test yt i 2

Zoom in on the fully connected graph

centralized MSE

0 1 2 3 4 5

i=1 y test yt i 2

Weakly connected graph ρ 0.96

centralized MSE

0 200 400 600 800

i=1 y test yt i 2

Zoom in on the weakly connected graph ρ 0.96

Figure 7: Linear convergence of Algorithm (28) up to the centralized statistical error, for diﬀerent values of γ, using the data set eyedata in the Normal Beta-Prime package: MSEt

deﬁned in (58) versus the number of iterations (communications). First row: fully connected graph (ρ = 0.4897); Second row: Erdös-Rényi graph with p = 0.1, (ρ 0.96). Left panel: iterations up to 5 104. Right panel: zoom in on the iterations up to 8 102.

Distributed Sparse Regression via Penalization

log average optimization error versus the number of iterations generated by Algorithm (28), in the same network setting as for Figure 6; diﬀerent curves refer to diﬀerent values of the penalty parameter γ. Since θ is no longer available when using real data, we heuristically set R in the projection (28) as R = max1 i m ˆθi 1. This max-quantity can be obtained locally by each agent by running a min-consensus algorithms, requiring a number of communications of the order of the diameter of the network. The ﬁgure still shows linear convergence up to some tolerance, which is of the order of the MSE error in (58). Even on real data the speed-accuracy dilemma is evident.

6. Concluding Remarks

We studied sparse linear regression over mesh networks. We established statistical and computational guarantees in the high-dimensional regime of a penalty-based consensus formulation and associated distributed proximal gradient method. This is the ﬁrst attempt of studying the behaviour of a distributed method in the high-dimensional regime; our interest in penalty-based formulations to decentralize the optimization/estimation was motivated by their popularity and early adoption in the literature of distributed optimization (lowdimensional regime). We proved that optimal sample complexity O(s log d/N) for the distributed estimator is achievable over networks, even when local sample size is not suﬃcient for statistical consistency. This contrasts with D&C methods which impose a condition on the local samples size (let alone they are readily implementable over mesh networks). On the computational side, such statistically optimal estimates can be achieved by the distributed proximal-gradient algorithm applied to the penalized problem, which converges at linear rate such a rate however scales as O(1/d), no matter how good the network connectivity is, resulting in a total communication cost of O(d2). We claim that this unfavorable communication cost is unavoidable for such penaltybased methods, because they lack of any mechanism mixing directly local gradients (they only average iterates). This raises the question whether communication costs of O(d) are achievable in high-dimension over mesh networks by other distributed, iterative algorithms, yet with no conditions on the local sample size. A ﬁrst study towards this direction is the companion work (Sun et al., 2022), where the projected gradient algorithm (Sun et al., 2019) based on gradient tracking is studied in the high-dimensional setting. The analysis of other distributed methods employing other forms of gradient correction, such as primal-dual method as in (Jakovetić et al., 2011, Shi et al., 2014, Jakovetić, 2019, Jakovetić et al., 2013, Shi et al., 2015a,b) remains an interesting topic for future investigation.

Acknowledgments

The authors would like to acknowledge support for this project from the Oﬃce of Naval Research (ONR), under the Grant # N00014-21-1-2673.

In this appendix we present the proofs of the results in the paper. We will use the same notation as in the paper along with the following additional deﬁnitions.

Ji, Scutari, Sun and Honnappa

Recall the statistical error ˆν ˆθ 1m θ . For any θ Rmd partitioned as θ = [θ1, . . . , θm], with each θi Rd, we deﬁne ν θ 1m θ . (59)

When needed, we decompose θ, and accordingly ν, in its average and orthogonal component

ν = 1m νav + ν , with νav = 1

i=1 νi. (60)

In particular, when θ = ˆθ, we will write for the augmented LASSO error

ˆν ˆθ 1m θ and ˆν = 1m ˆνav + ˆν , (61)

whereas when θ = θt, with θt being the iterates generated by Algorithm (25) , we will write

νt θt 1m θ and νt = 1m νt av + νt . (62)

Finally, the optimization error (along with its decomposition in average and orthogonal component) is denoted by

t θt ˆθ and t = 1m t av + t . (63)

Table 3 below summarizes all the universal constants used in the paper along with their range of values and associated constraints.

universal constant c0 c1 c2 c3 c4 c5 c6 c7

value > 0 > 0 > 0 > 0 max{2, 4 c2 2, 4 c2 2 c 1 3 } > 32 c5/32 1 free

universal constant c8 c9 c10 c11 c12 c13 c14 c15 c16

6 max{128 c1, c5} 1824 3648 c1 max{3648 c1, c5} 2731 c2 1/t0 1152 57

universal constant c17 c18 c19 c20 c21 c22 c23 c24

value 9 72 c1 c17 c1 c2 8 c17/988 8 c1 c17/ c2 8 c2 8/1976 11339 c1 c2 8 21130 c4 > 0

Table 3: Universal constants used in the Appendix.

Distributed Sparse Regression via Penalization

Appendix A. Proof of Proposition 3

For the sake of convenience, let us rewrite the objective function in (4) as

G(θ) = Lγ(θ) + λ

m θ 1, with Lγ(θ) = 1 2N

i=1 yi Xiθi 2 + 1 2mγ θ 2 V . (64)

By the optimality of ˆθ, it follows

G(ˆθ) G(1m θ )

i=1 yi Xiˆθi 2 + 1 2mγ ˆθ 2 V + λ

i=1 yi Xiθ 2 + 1 2mγ 1m θ 2 V | {z } =0 (Assumption 2)

Using yi = Xiθ + wi and the fact that θ is S-sparse, we can write

i=1 Xiˆθi Xiθ 2 2

i=1 w i Xiˆνi + 2λ

i=1 ( (ˆνi)S 1 (ˆνi)Sc 1) 1 mγ ˆθ 2 V .

Using the factorization ˆν = 1m ˆνav + ˆν , the above bounds reads

i=1 Xiˆθi Xiθ 2

i=1 w i Xi(ˆνav + ˆν i) + 2λ

i=1 ( (ˆνav)S + (ˆν i)S 1 (ˆνav)Sc + (ˆν i)Sc 1) 1 mγ ˆθ 2 V

N w Xˆνav + 2

i=1 w i Xiˆν i + 2λ

i=1 ( (ˆνav)S 1 (ˆνav)Sc 1)

i=1 ˆν i 1 1 mγ ˆν 2 V ,

where in (a) we used m P

i=1 w i Xiˆνav = w X ˆνav and ˆθ 2 V = ˆθ 1m θ 2 V = ˆν 2 V .

We bound now the two terms w Xˆνav and m P

i=1 w i Xiˆν i. We have

i=1 Xiˆθi Xiθ 2

Hölder s 2 N w X ˆνav 1 + max i [m] w i Xi 2 N

i=1 ( (ˆνav)S 1 (ˆνav)Sc 1) + 2λ

i=1 ˆν i 1 1 mγ ˆν 2 V

Ji, Scutari, Sun and Honnappa

(10),(13) 3λ (ˆνav)S 1 λ (ˆνav)Sc 1 1 ρ

mγ ˆν 2 + max i [m] w i Xi 2 N + 2λ

| {z } Term II

Finally, we bound Term II. Since we have no sparsity information on ˆν , we can only assert that ˆν i 1

d ˆν i , for all i [m]. Hence,

Term II 1 ρ

mγ ˆν 2 + max i [m] w i Xi 2 N + 2λ

d ˆν i (12) = λ h(γ, ˆν ). (66)

Using (66) in (65), we ﬁnally obtain

i=1 Xiˆθi Xiθ 2 3 (ˆνav)S 1 (ˆνav)Sc 1 + h(γ, ˆν ),

implying 3 (ˆνav)S 1 (ˆνav)Sc 1 + h(γ, ˆν ) 0, which concludes the proof.

Appendix B. Proof of Lemma 4 and Lemma 5

B.1 Proof of Lemma 4

Fix γ (0, (1 ρ)/Lmax] and let Cγ(S). Then, we have

( av)Sc 1 3 ( av)S 1 + h(γ, ),

where h(γ, ) is deﬁned in (12). Substituting the above inequality into the RSC condition (8), yields

1 N X av 2 µ

2 16sτ av 2 τh2(γ, ). (67)

Therefore, for all θ and θ , with θ θ Cγ(S), it holds

2 16sτ (θ θ )av 2 τ h2(γ, (θ θ ) ) Lmax

δ (θ θ )av 2 ξ h2(γ, (θ θ ) ), (68)

where we used γ (0, (1 ρ)/Lmax], and set δ = µ/2 16sτ, ξ = τ. This proves (15).

B.2 Proof of Lemma 5

Let X RN d be a design matrix satisfying Assumption 1. The RSC condition (Raskutti et al., 2010, Theorem 1) implies that there exist c0, c1 > 0, such that for all av Rd,

1 N X av 2 1

2 Σ 1 2 av 2 c1ζΣ log d

N av 2 1 (69)

Distributed Sparse Regression via Penalization

holds with probability at least 1 exp( c0N). Furthermore, by condition (c) of X, we have

Σ 1 2 av 2 λmin(Σ) av 2. (70)

Let Cγ(S), that is,

( av)Sc 1 3 ( av)S 1 + h(γ, ). (71)

Substituting (70) and (71) into (69), yields

2 32sc1ζΣ log d

av 2 2c1ζΣ log d

4 av 2 λmin(Σ)

64s h2(γ, ), for N 128s c1ζΣ log d

The proof follows using the above bound in (14) along with γ (0, (1 ρ)/Lmax].

Appendix C. Proof of Theorem 6

Our starting point toward the upper bound on the average LASSO error (1/m) Pm i=1 ˆνi 2 is lowerand upper-bounding the average of local errors (1/N) Pm i=1 Xiˆνi 2 while decomposing ˆθ in its average component and orthogonal one. This decomposition is instrumental to separate in the desired ﬁnal bound a term of the same order of the centralized LASSO error from the (additive) perturbation due to the lack of exact consensus.

Step 1: Establishing the upper bound of (1/N) m P

i=1 Xiˆνi 2.

We start with the optimality condition of Problem (4). By optimality of ˆθ, it follows that

Xiˆθi yi Xiˆνi λ

m( 1m θ 1 ˆθ 1) + 1 2mγ ( 1m θ 2 V | {z } =0 (Assumption 2)

ˆθ 2 V ). (72)

We can then write

i=1 Xiˆνi 2

i=1 (yi Xiθ ) Xiˆνi + 2λ

m ( 1m θ 1 ˆθ 1) 1 mγ ˆθ 2 V 1

i=1 Xiˆνi 2.

Using yi = Xiθ + wi and the fact that θ is S-sparse, we can write

i=1 Xiˆνi 2 2

i=1 w i Xiˆνi + 2λ

i=1 ( (ˆνi)S 1 (ˆνi)Sc 1) 1 mγ ˆθ 2 V 1

i=1 Xiˆνi 2.

Introducing the decomposition ˆν = 1m ˆνav + ˆν , the above bound reads

i=1 Xiˆνi 2

Ji, Scutari, Sun and Honnappa

i=1 w i Xi(ˆνav + ˆν i) + 2λ

i=1 ( (ˆνav)S + (ˆν i)S 1 (ˆνav)Sc + (ˆν i)Sc 1) 1 mγ ˆθ 2 V

N w Xˆνav + 2

i=1 w i Xiˆν i + 2λ

i=1 ( (ˆνav)S 1 (ˆνav)Sc 1)

i=1 ˆν i 1 1 mγ ˆν 2 V ,

where in (a) we used m P

i=1 w i Xiˆνav = w X ˆνav and ˆθ 2 V = ˆθ 1m θ 2 V = ˆν 2 V .

We bound now the two terms w Xˆνav and m P

i=1 w i Xiˆν i. We have

i=1 Xiˆθi Xiθ 2

Hölder s 2 N w X ˆνav 1 + max i [m] w i Xi 2 N

i=1 ( (ˆνav)S 1 (ˆνav)Sc 1) + 2λ

i=1 ˆν i 1 1 mγ ˆν 2 V

(10),(13) 3λ (ˆνav)S 1 λ (ˆνav)Sc 1 1 ρ

mγ ˆν 2 + max i [m] w i Xi 2 N + 2λ

i=1 ˆν i 1.

We further relax the bound by dropping λ (ˆνav)Sc 1 and enlarging (ˆνav)S 1 ˆνav 1 while revealing the term 9λ2s

2δ which is of the order of the centralized LASSO error:

i=1 Xiˆνi 2

δ 2 ˆνav 1 ρ

mγ ˆν 2 + max i [m] w i Xi 2 N + 2λ

N + ξh2(γ, ˆν ) 1 ρ

mγ ˆν 2 + max i [m] w i Xi 2 N + 2λ

Step 2: Establishing the lower bound of (1/N) m P

i=1 Xiˆνi 2.

Invoking the decomposition ˆνi = ˆνav + ˆν i, i [m], along with the Young s inequality, we can write

i=1 Xi(ˆνav + ˆν i) 2 1

N Xˆνav 2 2

i=1 Xiˆν i 2

2[δ ˆνav 2 ξh2(γ, ˆν )] + 1

2N Xˆνav 2 2

i=1 Xiˆν i 2.

Distributed Sparse Regression via Penalization

Step 3: Lower bound Upper bound. Chaining (73) and (74) while adding δ 2m ˆν 2 on both sides, yield

1 2δ ˆνav 2 + δ

2δ + ξh2(γ, ˆν ) + 2Lmax

ˆν 2 + max i [m] w i Xi 2 N + 2λ

2δ + ξh2 max + 2Lmax

ˆν 2 + max i [m] w i Xi 2 N + 2λ

md ˆν | {z } h1(γ, ˆν )

where in the last inequality we used Lmax = max i [m] λmax(X i Xi/n) [cf. (5)], and the following

upper bound for h(γ, ˆν )

h(γ, ˆν ) hmax dγ λ(1 ρ)

maxi [m] w i Xi n + λ 2 . (76)

Under the condition on γ as in (18), h1(γ, ˆν ) is a quadratic function of ˆν opening downward, and it can be upper bounded over R+ as

h1 max 2dγ(maxi [m] w i Xi /n + λ)2

2(1 ρ) 4Lmaxγ δγ . (77)

Using (77) in (75), we ﬁnally obtain (19).

Appendix D. Proof of Theorem 7

The proof builds on the following four steps: 1) We ﬁrst consider as source of randomness only the design matrix X (cf. Assumption 1) while keeping w ﬁxed, deriving a high-probability bound for Lmax in (5); 2) We then ﬁx X and consider the randomness coming from the noise w, providing high-probability bounds for the noise-dependent terms X w /N and max1 i m X i wi /n; 3) We then combine the previous two results via the union bound and establish a lower bound on λ for (13) to hold with high probability; 4) Finally, we use the bound in 3) to obtain the ﬁnal error bound on the ℓ2-LASSO error. Let P be a probability measure on the product sample space RN d RN. For brevity, we use the same notation for the marginal distributions on RN d and RN. Step 1: Randomness from X. We deﬁne three good events so that the largest eigenvalue of (1/n)X i Xi, smallest eigenvalue of (1/N)X X and the norm of the columns of X are well-controlled. We prove next that these events jointly occur with high probability. Speciﬁcally, let

A1 X RN d Lmax c4λmax(Σ) 1 + d + log m

Ji, Scutari, Sun and Honnappa

A2 X RN d X satisﬁes (17) ,

A3 X RN d max j=1,...,d 1

where c4 > 0 is a universal constant (see (84)), and we recall from (5) and (9) that Lmax maxi [m] λmax(X i Xi/n), and ζΣ max i [d] Σii, respectively. We proceed to bounding P(A1),

P(A2), and P(A3). (i) Bounding P(A1): Recall that X = [X 1 , . . . , X m] , and X satisﬁes Assumption 1. Thus, {Xi}i [m] are i.i.d random matrices, with i.i.d. rows drawn from N(0, Σ). By (Vershynin, 2012, Remark 5.40) it follows that the following holds with probability at least

1 2 exp{ c3t2}, (80)

for all t 0 1 n X i Xi Σ max{a, a2} Σ , where a c2

with constants c3 and c2 > 0. Given (81) and using the triangle inequality, we have 1

n X i Xi Σ + Σ λmax(Σ) max{a, a2} + λmax(Σ). (82)

Applying the union bound we obtain the following bound for Lmax

P Lmax λmax(Σ)(1 + max{a, a2}) 1 m 2 exp{ c3t2}. (83)

Setting t = q

d + c 1 3 log m, yields

d + c 1 3 log m n

Therefore, we conclude

Lmax λmax(Σ) 1 + a + a2 2λmax(Σ) 1 + a2 c4λmax(Σ) 1 + d + log m

with probability at least 1 2 exp( c3d) and

c4 = max{2, 4 c2 2, 4 c2 2 c 1 3 } 2. (85)

(ii) Bounding P(A2): It follows readily from Lemma 5: if N 128s c1ζΣ log d

λmin(Σ) and γ > 0,

P(Ac 2) exp( c0N). (86)

(iii) Bounding P(A3): Recall Assumption 1. It follows that Xej is an isotropic Gaussian random vector in RN with N(0, Σjj) entries. Hence, Xej 2/Σjj is a chi-squared random

Distributed Sparse Regression via Penalization

variable with degree N. Then, applying the standard bound for chi-squared random variables (Wainwright, 2019, Example 2.11) we have

2 1 t 2 exp( Nt2/8), for all t (0, 1). (87)

Taking t = 1

2 in (87) and applying the union bound, we obtain

P max j [d] 1 N

2 exp( N/32 + log d). (88)

Therefore, for all N c5 log d, with c5 > 32, we have

P max j [d] Xej 2

1 2 exp[ ( c5/32) log d + log d]

=1 2 exp( c6 log d), where c6 = c5/32 1 > 0. (89)

Combining the conditions on N, we have

N c9sζΣ log d

(a) max 128s c1ζΣ log d

λmin(Σ) , c5 log d , (90)

where c9 = max{128 c1, c5}, and in (a) we used s 1, ζΣ λmin(Σ). Finally, we combine (84), (86), and (89); using the union bound again, we have

P(Ac 1 Ac 2 Ac 3) P(Ac 1) + P(Ac 2) + P(Ac 3) 2 exp( c3d) + exp( c0N) + 2 exp( c6 log d).

Deﬁne A A1 A2 A3,

P(A) 1 2 exp( c3d) exp( c0N) 2 exp( c6 log d). (91)

Step 2: Randomness from w. We start with bounding X w . For ﬁxed X A, and w N(0, σ2IN), recall X = [X 1 , . . . , X m] , and w = [w 1 , . . . , w m] , where for each agent i [m], Xi Rn d is the design matrix, wi Rn is observation noise. Then, for any i [m] and j [d],

X A N 0, σ2

and w i Xiej

X A N 0, σ2

max i [m] max j [d] Xiej 2

N max j [d] 1 m

n = max j [d] Xej 2

due to 1 m Pm i=1 Xiej 2

N . By deﬁnition, for all X A A3, 2 Xej 2/(3ζΣN) 1 and, by (93), 2 Xiej 2/(3ζΣN) 1. Therefore, combining it with (92), we obtain r 2

X A N 0, σ2

, where 2 Xej 2

3ζΣN 1; and r 2

X A N 0, σ2

, where 2 Xiej 2

3ζΣN 1. (94)

Ji, Scutari, Sun and Honnappa

Denote p X(x) and p Xi(xi) as the density of p

2/(3ζΣ)w Xej/N and p

2/3ζΣw i Xiej/N, respectively. Let Z N(0, σ2/N), with density function p Z(z). Since 2 Xej 2/(3ζΣN) 1 and 2 Xiej 2/(3ζΣN) 1, we conclude p X(0) p Z(0) and p Xi(0) p Z(0). (Horn, 1988, Theorem 1) implies

st |Z|, as well as

X A P(|Z| x) 2 exp Nx2

Notice that X w /N = maxj [d]|w Xej|/N. Hence, setting x = σ p

t0 log d/N, with t0 > 2, the union bound implies

X A 2 exp 1

2(t0 2) log d . (96)

D1 w RN X w

We have P(D1 | X A) 1 2 exp 1

2(t0 2) log d . Combining it with (91), yields

=P(D1 | A)P(A)

[1 2 exp{ [(t0 2) log d]/2}][1 2 exp( c3d) exp( c0N) 2 exp( c6 log d)]

1 2 exp( c3d) exp( c0N) 2 exp( c6 log d) 2 exp{ [(t0 2) log d]/2.

It remains to bound maxi [m] X i wi . Since Xi, i [m], are independent, the columns of Xi are n dimensional i.i.d Gaussian random vectors, each element has variance at most ζΣ, and the elements of wi N(0, σ2In). Then each element of X i wi is the sum of n independent sub-exponential random variables with sub-exponential norm at most σ ζΣ. Applying random matrix theorem (Vershynin, 2012, Proposition 5.16) and the union bound, we obtain

n max i [m] X i wi t 1 2 exp c24 min t2

σ2ζΣ , t σ ζΣ

n + log md , t 0,

for some c24 > 0. Thus, under 2log md c24n and t = σ q

1 n max i [m] X i wi σ

1 2 exp c24 min 2σ2ζΣlog md

c24nσ2ζΣ , σ 2ζΣ log md c24nζΣσ

Distributed Sparse Regression via Penalization

1 2 exp ( log d) , (98)

while, under 2log md > c24n and t = 2σ ζΣ log md

n c24 , it holds

n max i [m] X i wi 2σ ζΣ log md

1 2 exp c24 min 4σ2ζΣlog2 md

c2 24n2σ2ζΣ , 2σ ζΣlog md

1 2 exp ( log d) . (99)

Combining (98) and (99), we have

1 n max i [m] X i wi σ p

1 4 exp ( log d) . (100)

D2 w RN 1 n max i [m] X i wi σ p

and D D1 D2. Then, chaining (91), (100), and (96), we ﬁnally get

1 2 exp( c3d) exp c0 c9sζΣ log d

2 exp( c6 log d) 2 exp{ [(t0 2) log d]/2

4 exp( log d)

1 11 exp( c8 log d), (101)

where c8 = min{1, c3, c6, (t0 2)/2, c0 c9}. Step 3: Suﬃcient condition on λ for (13) to hold with high probability. We ﬁrst recall (13) for convenience.

Combining it with the high probability upper bound for X w /N in (96) (Step 2), we conclude that, if λ satisﬁes

6, then (13) holds with probability at least (101). Step 4: Bounding the statistical error under (22). Recall the deterministic error bounds in Theorem 6: for any ﬁxed λ and γ satisfying (18),

Ji, Scutari, Sun and Honnappa

δ2 + 2ξd2γ2

maxi [m] w i Xi n + λ 4 + 4dγ(maxi [m] w i Xi /n + λ)2

δ[2(1 ρ) 4Lmaxγ δγ] .

In Step 3, we provided a suﬃcient condition on λ to guarantee (13) holds with probability at least as (101). Now we proceed to provide a suﬃcient condition on γ, not only to guarantee γ 2(1 ρ)/(4Lmax + δ), but also contribute to restricting the error term above within the centralized statistical error, which is of the order O(λ2s). By Lemma 5, if N 128s c1ζΣ log d/λmin(Σ), with probability at least 1 exp( c0N), the in-network RE condition holds with δ = λmin(Σ)/4 and ξ = λmin(Σ)/(64s). Combining this with the high probability upper bound derived on Lmax in (84) and the high probability upper bound derived for maxi [m] w i Xi /n in (100), we have

i=1 ˆνi 2 144λ2s

λmin(Σ)2 + d2γ2 σ ζΣ min n 2 log md

8λ2s(1 ρ)2 | {z } Term I

+ 8dγ σ ζΣ min n 2 log md

λmin(Σ) 1 ρ 4γ c4λmax(Σ) 1 + d+log m

| {z } Term II

with probability larger than (101). It remains to prove that condition (22) on γ is suﬃcient for Term I and Term II to be within O(λ2s). Notice that

Term I Term II2 λmin(Σ)2

512λ2s . (103)

Thus it is suﬃcient to bound only Term II. Enforcing Term II c7λ2s/λmin(Σ)2, where c7 is a numerical constant, we derive the following suﬃcient condition on γ to ensure Term II c7λ2s/λmin(Σ)2:

λ min n 2 log md

o + 1 i2 + 4 c4λmax(Σ)[1 + d+log m

n ] + λminΣ

(104) Thus, under (104), we have

Term II c7 λ2s λmin(Σ)2 Term I c2 7λ4s2

λmin(Σ)4 λmin(Σ)2

512λ2s = c2 7λ2s 512λmin(Σ)2 .

Therefore, the ﬁnal statistical error satisﬁes

i=1 ˆνi 2 144 + c7 + c2 7 512

c2 8σ2ζΣt0 λmin(Σ)2 s log d

N = c6 σ2ζΣt0 λmin(Σ)2 s log d

Distributed Sparse Regression via Penalization

where c6 144 + c7 + c2 7 512

c2 8. Since λ satisﬁes (102), we have

c24t0 log d . (105)

Substituting (105) into (104), we have the following suﬃcient condition for (104)

γ 8n(1 ρ) c7s

32 c4 c7sλmax(Σ)[n + (d + log m)] + λmin(Σ)n 64d h 1 c8

2m log md c24t0 log d + 1 i2 + c7s .

Notice that

c24t0 log d + 1

128d m(log m + 1)

3 c24 + 1 .

In addition,

s n(1 ρ) 4s c4λmax(Σ)[n + (d + log m)] + λmin(Σ)n [16d [m(log m + 1)/(3 c24 c7) + 1] + s/8]

c5(1 ρ) λmax(Σ)(d + log m) + λmin(Σ)dm(log m + 1), (106)

where c5 1/ max{8 c4, 32/( c24 c7)} Hence, (22) is suﬃcient for (104).

Appendix E. Proof of Lemma 8

Using (60), each ˆθi 1 can be bounded as

ˆθi 1 ˆθi θ 1 + θ 1 = ˆνav + ˆν i 1 + θ 1 ˆνav 1 +

d ˆν i + θ 1. (107)

We bound next ˆνav 1. By Proposition 3, any solution ˆθ of (4) satisﬁes

(ˆνav)Sc 1 3 (ˆνav)S 1 + h(γ, ˆν ).

ˆνav 1 4 (ˆνav)S 1 + h(γ, ˆν )

4 s ˆν m + h(γ, ˆν )

δ2 + 2τd2γ2(maxi [m] w i Xi + λn)4

δλ2n4(1 ρ)2 + 4dγ(maxi [m] w i Xi + λn)2

δn2[2(1 ρ) 4Lmaxγ δγ]

+ h(γ, ˆν )

δ dγ(maxi [m] w i Xi + λn)2

64sdγ(maxi [m] w i Xi + λn)2

δn2[2(1 ρ) 4Lmaxγ δγ]

Ji, Scutari, Sun and Honnappa

+ h(γ, ˆν ), (108)

where in (a) we used Theorem 6 and the fact that the RSC (8) implies the in-network RE (15) [cf. Lemma 4], with ξ = τ and δ = µ/2 16sτ > 0. Substituting (108) in (107) yields

δ dγ(maxi [m] w i Xi + λn)2

64sdγ(maxi [m] w i Xi + λn)2

δn2[2(1 ρ) 4Lmaxγ δγ]

+ h(γ, ˆν ) +

d ˆν | {z } h2(γ, ˆν )

δ dγ(maxi [m] w i Xi + λn)2

64sdγ(maxi [m] w i Xi + λn)2

δn2[2(1 ρ) 4Lmaxγ δγ]

+ dγ(2 maxi [m] w i Xi + (2 + m)λn)2

4λn2(1 ρ) + θ 1

δ dγ(maxi [m] w i Xi + λn)2

λn2(1 ρ) | {z } Term I= hmax

dγ(maxi [m] w i Xi + λn)2

n2[2(1 ρ) 4Lmaxγ δγ] | {z } Term II

+ dγ(maxi [m] w i Xi + 2 mλn)2

λn2(1 ρ) | {z } Term III

+ θ 1, (109)

where in (a) we bounded h2(γ, ) on R as

h2(γ, ˆν ) (12) = 1 ρ

λmγ ˆν 2 + 2 max i [m] w i Xi /(λn) + 2 p

dγ 2 maxi [m] w i Xi + (2 + m)λn 2

4λn2(1 ρ) ;

and in (b) we enlarged 2 + m 4 m. We bound now Term I Term III using condition (26) on γ. We have the following:

hmax = Term I λs 128 δ, (110)

dγ(maxi [m] w i Xi + 2 mλn)2

n2[2(1 ρ) 4Lmaxγ δγ]

Term III λ s 128 δ.

Substituting the above bounds in (109) we obtain

δ λs 128δ + λs

Distributed Sparse Regression via Penalization

(27) (1 r) R + r R = R.

This completes the proof. .

Appendix F. Proof of Lemma 9

Since Lγ(θ) 1

m Pm i=1 fi(θi) + 1 2mγ θ 2 V , we have

f1(θt 1) ... fm(θt m)

+ 1 mγ ((I W) Id) θt.

Substituting the expression of Lγ(θt) into Problem (25), it is not hard to see it is separable in the θi s, and the update of θi given as

θt+1 i = arg min θi 1 R

θi θt i + β fi(θt i) + β

j=1 wijθt j

The problem boils down to solving

min θi θi ψt i 2 + λ θi 1

s.t. θi 1 R, (111)

with λ 2βλ and ψt i deﬁned in (28). To solve (111) we ﬁrst drop the constraint θi 1 R. The minimizer of the objective function is given by

θi = prox λ

2 1(ψt i). (112)

Note that θi can be computed in closed form by soft-thresholding ψt i. Case 1: θi satisﬁes the constraint in (111), i.e., θi 1 R. We conclude that θi is a solution of (111). Case 2: θi violates the constraint in (111), i.e., θi 1 > R. Then, the constraint must be active at the optimal point of (111). Hence, Problem (111) is equivalent to

min θi θi ψt i 2

s.t. θi 1 = R, (113)

where we dropped the term λ θi 1 in the objective function, since it is constant on the constraint set. Since (112) can be computed in closed form by soft-thresholding ψt i, we conclude ψt i 1 θi 1 > R, and thus the convex problem with constraint (113) is equivalent to min θi θi ψt i 2

s.t. θi 1 R. (114)

Combining the two cases completes the proof.

Ji, Scutari, Sun and Honnappa

Appendix G. Proof of Theorem 10

Recall the factorization of the objective function by G and L as introduced in (64)

G(θ) = Lγ(θ) + λ

m θ 1, with Lγ(θ) = 1 2N

i=1 yi Xiθi 2 + 1 2mγ θ 2 V .

We begin (Step 1) proving a weaker result than Theorem 10, that is, linear convergence of the error G(θt) G(ˆθ), up to the tolerance as on the RHS of (36) this is Theorem 15 below. Then (Step 2), leveraging the curvature property of G along the trajectory of the algorithm (see Lemma 17 in Appendix J.1), we transfer the rate decay of G(θt) G(ˆθ) on that of the iterates error θt ˆθ , which completes the proof of Theorem 10.

Step 1: On linear convergence of the optimality gap G(θt) G(ˆθ). Recall the deﬁnition of ε2 stat and µav as given in (30) and (29), respectively.

Theorem 15 Instate the setting of Theorem 10. There holds:

G(θt) G(ˆθ) α2, (115)

for any tolerance parameter α2 such that

α2 4sτ ε2 stat, (116)

and for all

t log2 log2

1 + Lmax log 2

µav + (1 + ρ) log 2

µav + 1 + ρ

log η0 G α2

Furthermore, the interval in (116) is nonempty.

Proof See Appendix J.

Step 2: On linear convergence of the optimality gap θt ˆθ . We can now proceed to prove Theorem 10. Given Theorem 15, it is suﬃcient to show that (36) holds. Recall the shorthand for the optimization error, t = θt ˆθ. At high-level the idea is to construct a lower bound of G(θt) G(ˆθ) as a function of t 2 by exploiting, under the RSC condition (8), the curvature property of G along a restricted set of directions. Speciﬁcally, we use the following curvature property proved in Lemma 17 (cf. Section J.1), which holds under the more stringent setting of Theorem 151: for all t T,

µav t av 2 f( t ) G(θt) G(ˆθ) + τ

4(v2 + 8h2 max), (118)

1. Speciﬁcally, in the proof of Theorem 15, we showed that condition on R as in (34) is more stringent than (27) in Lemma 17 see Fact 1 in Appendix J.2.

Distributed Sparse Regression via Penalization

where f( t ) is deﬁned as [cf. (14)],

f( t ) = Lmax

v2 is given by [cf. (143)]

v2 = 144s ˆνav 2 2 + 4 min 2η

λ , 2R 2 , with η = α2, (119)

and hmax is deﬁned as [cf. (76)]

hmax = dγ λ(1 ρ)

maxi [m] w i Xi n + λ 2 .

We proceed now to bound the LHS and RHS of (118). The goal is to lower bound the LHS by a quantity proportional to t 2, so that (118) will provide the desired bound of t 2 in terms of the optimization gap G(θt) G(ˆθ) (up to a tolerance). The following bound of f( t ), which is a consequence of (170) serves the scope:

We also upper bound the RHS of (118) to further simplify the ﬁnal expression; speciﬁcally, we use

hmax (110) λs 64(µ 32sτ) (120)

144s ˆνav 2 + 4 min 2α2

λ , 2R 2 144s ˆν 2

where the inequality follows from ˆνav 2 ˆν 2/m and the fact that α2 Rλ/4 [cf. (37)]. Using the above bounds along with (115) in (118) yield: for all t T,

m + 8 λs 64(µ 32sτ)

α2 + 36sτ ˆν 2

1976µ2 + 4τα4

where the last inequality follows from µ c10sτ with c10 = 1824. This proves (36).

Appendix H. Proof of Theorem 13

Given the postulated random data model and the conclusions of Theorem 10, it is suﬃcient to prove the following: Step 1: Under (41) on λ, condition (31) holds with high-probability; Step 2: Condition (42) on γ is suﬃcient for (32) to hold with high probability; Step 3: Under condition (40) on N, any R in (43) satisﬁes also (34); furthermore, the interval of values of R in (43) is nonempty; Step 4: Any α2 in (46) satisﬁes (37) with high-probability;

Ji, Scutari, Sun and Honnappa

and the range for α2 in (46) is nonempty with high-probability; Step 5: Given the bound on the statistical error as in (36) and for all t satisfying (38), we conclude that (45) holds, for all t satisfying (47), with high-probability.

Step 1: Suﬃcient condition on λ for (31) to hold with high probability. To prove this result, we follow a similar path as introduced in the proof of Theorem 7.

(i) Randomness from X. This step is the same as Step 1 in the proof of Theorem 7 (cf. Appendix D), except for the deﬁnition of the event A2 replaced here with A 2, deﬁned as

A 2 X RN d X satisﬁes RSC (8) with parameters (µ, τ) = λmin(Σ), 2c1ζΣ log d

(122) where ζΣ = maxi [d] Σii. Lemma 2 implies

P(A 2) 1 exp( c0N). (123)

Deﬁne A = A1 A 2 A3, where A1 and A3 are deﬁned in (78) and (79) (cf. Appendix D), respectively, and recall here for convenience

A1 X RN d Lmax c4λmax(Σ) 1 + d + log m

A3 X RN d max j=1,...,d 1

Then, similar to under condition (90), there holds (91), since N satisﬁes (40) with c12 = max{ c9, c11} c9 = max{128 c1, c5}, and c11 = 3648 c1, we have

P(A ) 1 2 exp( c3d) exp( c0N) 2 exp( c6 log d). (124)

(ii) Randomness from w. This step follows Step 2 as in the proof of Theorem 7 (cf. Appendix D) and thus is not duplicated. In particular, recalling the deﬁnitions of D1, D2, and D therein for convenience

w RN maxi [m] X i wi n σ p

and D D1 D2. The following the same reasoning as in Step 2 of the proof of Theorem 7, we have, for all t0 2,

P(A D) 1 11 exp( c8 log d). (125)

(iii) Suﬃcient condition on λ for (31) to hold with high probability. Recall (31) for

convenience,

λ max 2 X w

N , 64τ θ 1

Distributed Sparse Regression via Penalization

Combining it with the high probability upper bound for X w /N derived in (125), we

conclude the following: suppose λ σ q

6ζΣt0 log d

N , then for any tuple (X, w) A D, and

any t0 > 2,, since 2 X w /N σ q

6ζΣt0 log d

N , it follows that λ 2 X w /N. That is,

P(A D) (125) 1 11 exp( c8 log d).

Furthermore, for any tuple (X, w) A D, (122) implies τ = 2c1ζΣ log d/N. Therefore it follows that if λ 128s c1ζΣ log d

N , then 64τ θ 1. Using (41), we conclude that for any t0 > 2,

λ c11 max σ

N , sζΣ log d

with c11 = max{

6, 128 c1}, is suﬃcient for (31) to hold with probability at least (125).

Step 2: (42) is suﬃcient for (32) to hold with high probability. Recall (32) for convenience,

γ 1 ρ 2Lmax + (µ/2 16sτ) 1 + 128(d/s)(maxi [m] w i Xi /(λn) + 2 m)2 .

In order to derive a suﬃcient condition on γ to ensure (32) holds with high probability, we leverage Step 1 (i) above, where we derived high probability bounds for Lmax, maxi [m] w i Xi /n, and λ. Speciﬁcally, substituting into (32) the bounds on Lmax [as in (84)], maxi [m] X i wi /n [as in (125)], and the explicit expression of the RSC parameters (µ, τ) [as in (122)], we conclude that if

2 c4λmax(Σ) 1 + d+log m

n + λmin(Σ)

2 32s c1ζΣ log d

N h 1 + 128d

s (g(m, d) + 2 m)2i,

where g(m, d) = σ

λ ζΣ min n 2 log md

o , then (32) holds with probability at least (24). We proceed by showing that (42) is suﬃcient for (126). Speciﬁcally, since

g(m, d) = σ

m log md 3t0 c24 log d,

and λmin(Σ)

2 32s c1ζΣ log d

we obtain the more conservative condition

2 c4λmax(Σ) 1 + d+log m

n + λmin(Σ)

m log md 3t0 c24 log d + 2 m 2 . (127)

Ji, Scutari, Sun and Honnappa

We proceed to further simplify (127). Notice that

m log md 3t0 c24 log d + 2 m

!2 (a) 1 + 128d

3 c24 (2 log m + 1) + 8 m

(b) 256dm [(2 log m + 1) / c24 + 8] , (128)

where (a) is due to 1 2 log d, for d 2 and t0 2; and in (b) we upper bound both terms by 128d [(2 log m + 1) / c24 + 8] . Using (128) and further simpliﬁcation, we have

2 c4λmax(Σ) 1 + d+log m

n + 128λmin(Σ)dm [(2 log m + 1) / c24 + 8]

c12(1 ρ) λmax(Σ) (d + log m) + λmin(Σ)dm (log m + 1), (129)

where c12 1 max{4 c4,512/ c24,2048}. Hence, Hence, under (42), (32) holds with probability at least (24). Step 3: Ensuring there exists an R fulﬁlling (34). Substituting in (34) the explicit expression of the RSC parameters (µ, τ) [under the event in (122)] as well as θ 1 = s, we conclude that (34) holds with probability at least 1 exp( c0N), whenever R satisﬁes display (43),

max 56λs λmin(Σ) 64s c1ζΣ log d/N , 2s R λN 64 c1ζΣ log d.

We now show that the interval (43) is non-empty. Since N satisﬁes (40) with c10 = c12 = max{ c5, 3648 c1} there holds

56λs λmin(Σ) 64s c1ζΣ log d/N λN 64 c1ζΣ log d. (130)

Furthermore, (41) [Step 1 (iii)] implies

2s λN 64 c1ζΣ log d. (131)

By (130) and (131), we infer that (43) is non-empty. Step 4: (46) is suﬃcient for (37) to hold with high-probability. Substituting in (37) the explicit expression of the RSC parameters (µ, τ) [under the event (122)], we conclude that (46) is in fact suﬃcient for (37) to hold with probability at least (125). It remains to prove that the (random) interval (46) is non-empty with high probability, which we do next. To this end, we upper bound the statistical error Pm i=1 ˆθi θ 2/m, under (37). Recall that, with probability at least (125), (32) holds. Therefore, we can invoke Theorem 6 to bound the statistical error, and write: with probability at least (125),

i=1 ˆθi θ 2

Distributed Sparse Regression via Penalization

Theorem 6 9λ2s

δ d2γ2(maxi [m] w i Xi + λn)4

λ2n4(1 ρ)2 | {z } (Term I)2 in (109)

δ dγ(maxi [m] w i Xi + λn)2

n2[2(1 ρ) 4Lmaxγ δγ] | {z } (Term II)2 in (109)

δ λ2s 256 δ

(a) = 9λ2s (µ/2 16sτ)2 + 2τ µ/2 16sτ

λs 128 (µ/2 16sτ)

2 + 4 µ/2 16sτ λ2s 256 (µ/2 16sτ)

(122) 1 (λmin(Σ) 64s c1ζΣ log d/N)2 36λ2s + 16

1282 2s c1ζΣ log d

N λ2s (λmin(Σ) 64s c1ζΣ log d/N) + 16

where in (a) we used ξ = τ and δ = µ/2 16sτ > 0 (due to Lemma 4). Thus, with probability at least (125), we can upper bound the lower interval bound in (46) by

8 36s c1ζΣ log d N(λmin(Σ) 64s c1ζΣ log d/N)2

1282 2s c1ζΣ log d

N λ2s (λmin(Σ) 64s c1ζΣ log d/N)

256λ2s + 8s c1ζΣ log d

N λ2s 1976(λmin(Σ) 64s c1ζΣ log d/N)2

288s c1ζΣ log d N(λmin(Σ) 64s c1ζΣ log d/N)2

s c1ζΣ log d

512N λ2s (λmin(Σ) 64s c1ζΣ log d/N) + 37λ2s .

Using the bound on N given by (40)

N c12sζΣ log d

λmin(Σ) , with c12 = max{3648 c1, c5},

we obtain 64s c1ζΣ log d/N λmin(Σ)/57. Substituting into the inequality above we have

8s c1ζΣ log d

i=1 ˆθi θ 2 + λ2s 1976λmin(Σ)2

288s c1ζΣ log d N(λmin(Σ) 64s c1ζΣ log d/N) 56

57λmin(Σ) λ2s + 37λ2s .

Applying again the lower bound on N, we further get

N(λmin(Σ) 64s c1ζΣ log d/N)

max c12sζΣ log d

λmin(Σ) (λmin(Σ) 64s c1ζΣ log d/N), N 56

57λmin(Σ) ,

8s c1ζΣ log d

i=1 ˆθi θ 2 + λ2s 1976λmin(Σ)2

Ji, Scutari, Sun and Honnappa

288s c1ζΣ log d

56 57λmin(Σ)

38λ2s min λmin(Σ) c12sζΣ log d(λmin(Σ) 64s c1ζΣ log d/N) 1, 57

56λmin(Σ) 1

min 4(λmin(Σ) 64s c1ζΣ log d/N) 1λ2s, 11339 s c1ζΣ log d

λmin(Σ)2 λ2s min λR

The last inequality follows from the conditions on R and η0 G given by (43) and (44), respectively.

Step 5: (45) holds, for all t satisfying (47), with high probability. Building on the conclusions of the previous steps and Theorem 10, to prove the statement of this step, it is suﬃcient to show that the RHS of (45) [resp. of (47)] is an upper bound of the RHS of (36) [resp. (38)] that holds with high probability. We begin with the RHS of (36): with probability at least (125), there holds,

1 µ/8 8sτ α2 + 36sτ ˆν 2

m(µ/8 8sτ) + τsλ2s 1976µ2(µ/8 8sτ) + 4τα4

λ2(µ/8 8sτ)

(a) 456 55λmin(Σ)

α2 + 72s c1ζΣ log d

i=1 ˆθi θ 2 + s c1ζΣ log d

988N λ2s λmin(Σ)2

+8s c1ζΣ log d

c16 λmin(Σ)

α2 + sζΣ log d

i=1 ˆθi θ 2 + λ2s λmin(Σ)2 + α4

where in (a) we use the following fact [which holds with probability at least (125)]

8 8sτ = λmin(Σ)

8 16s c1ζΣ log d

(40) λmin(Σ)

8 16s c1ζΣ log dλmin(Σ)

3648 c1sζΣ log d = 55λmin(Σ)

Next, we bound the RHS of (38), invoking the high probability bound for Lmax [as in (84)] and the explicit expression of the RSC parameters (µ, τ) [under (122)]. We have the following log2 log2

1 + Lmax log 2

µav + (1 + ρ) log 2

µav + 1 + ρ

log η0 G α2

1 + λmax(Σ)

λmin(Σ) 456 c4[1 + (d + log m)/n] log 2

55 + 456(1 + ρ) log 2

λmin(Σ) 456 c4[1 + (d + log m)/n]

55 + 456(1 + ρ)

log η0 G α2

c18 4 max 456 c4 log 2

55 , 456 log 2

= 1824 c4 log 2

Then, the RHS of (132) can be bounded as

+ log η0 G α2

κΣ(d + log m) + (1 + ρ) λmin(Σ)γ

This completes the proof.

Distributed Sparse Regression via Penalization

Appendix I. Proof of Corollary 14

The corollary is a customization of Theorem 13, under the (feasible) choices of N, λ and γ as in the statement of the corollary. Step 1: On the choices of λ and γ. We show that, under (48), (49) and (50) are special instances of (41) and (42), respectively. Since N satisﬁes (48), with c12 = max{3648 c1, c5} and c13 = 2731 c2 1/t0, there holds

6ζΣt0 log d

N 128 c1ζΣ s log d

Therefore, (41) reduces to

6ζΣt0 log d

which is satisﬁed by the choice of λ as in (49), with c8 being any constant such that c8

6. Consider now the condition on γ as in (42). Since we are interested in the high-dimensional regime where N d, we assume that d + log m n. Using this, we can lower bound the RHS of (42) and readily obtain the more stringent condition on γ as in (50), with c14 = 1152. Step 2: Condition on R in (51) implies (43). Using again (48), we can upper bound the lower interval of R in (43) as

56λs λmin(Σ) 64s c1ζΣ log d/N

(48) 56λs λmin(Σ) λmin(Σ)/57

(49) = 57 c8s λmin(Σ)σ

6t0ζΣ log d

Using (49), the upper interval in the same condition reads

λN 64 c1ζΣ log d = c8 64 c1 σ

6t0N ζΣ log d.

Therefore, (51) is suﬃcient for (43) to hold, with c15 = 57

6 c8 and c16 =

6 c8/(64 c1). It remains to show that the range of value of R in (51) is nonempty. By (48),

N c12sζΣ log d

6 c8 64 c1 σ

t0N ζΣ log d 57

6 c8 λmin(Σ)sσ

N c13s2ζΣ log d

σ2 N c13s2ζΣ log d

6 c8 64 c1 σ

t0N ζΣ log d 2s.

Step 3: (46) reduces to (54), under (49). The statement follows by a direct substitution of (49) in (46):

λ2s 1976λmin(Σ)2 = c2 8σ2ζΣt0s log d 1976Nλmin(Σ)2 = c21σ2ζΣt0

λmin(Σ)2 s log d

where c21 = c2 8/1976, and

N , η0 G 11339s c1ζΣ log d

Nλmin(Σ)2 λ2s = c22σ2t0

Ji, Scutari, Sun and Honnappa

with c22 = 11339 c1 c2 8. Notice that the (random) interval (54) is non-empty, with probability at least (24). This follows from the fact that (46) is nonempty with the same probability, for all R satisﬁes (43) (Theorem 13), and that (51) is suﬃcient for (43) to hold (Step 2 above). Step 4: (53) holds with high probability, for all t satisfying (55). (53) follows readily from (45) by substitution of the values of λ and γ as in (49) and (50), respectively; and deﬁning the following constants c17 = 9, c18 = 72 c1 c17, c19 = c1 c2 8 c17/988, and c20 = 8 c1 c17/ c2 8. We conclude the proof showing that (55) is a stronger condition than (132). Using (49) and (50), explicitly writen as the following

2 c4λmax(Σ) 1 + d+log m

n + 128λmin(Σ)dm [(2 log m + 1) / c24 + 8] ,

the RHS of (132) reads

! 1 + λmax(Σ)

λmin(Σ) 456 c4[1 + (d + log m)/n] log 2

( log2 log2

! log 2 + log η0 G α2

55λmin(Σ) 2 c4λmax(Σ) 1 + d+log m

n + 128λmin(Σ)dm [(2 log m + 1) / c24 + 8]

+ 456 c4[1 + (d + log m)/n]

λmin(Σ) log η0 G α2

ρ 1 log2 log2

! 1 + λmax(Σ)

λmin(Σ) 456 c4[1 + (d + log m)/n] log 2

( log2 log2

! log 2 + log η0 G α2

55 1 1 ρ 2 c4 λmax(Σ)

1 + d + log m

+ 128dm [(2 log m + 1) / c24 + 8]

+ 456 c4[1 + (d + log m)/n]

λmin(Σ) log η0 G α2

(85) log2 log2

! c4 λmax(Σ)

λmin(Σ) 456[1 + (d + log m)/n] log 2 + 55

| {z } term I

( log2 log2

! log 2 + log η0 G α2

55 1 1 ρ 4 c4 λmax(Σ)

λmin(Σ) d + log m

n + 128dm 2 log m + 1

Distributed Sparse Regression via Penalization

+ 456 c4[1 + (d + log m)/n]

λmin(Σ) log η0 G α2

. | {z } term II

Using (d + log m)/n 1 and ρ (0, 1), we can bound term I and term II as

term I (log 2) c4 λmax(Σ)

λmin(Σ) 1 1 ρ 1001

55 d + log m

term II c4 λmax(Σ)

λmin(Σ) 1 1 ρ 912

55 d + log m

n log η0 G α2

Using the above bounds along with (d + log m)/n dm, we can further bound the RHS of (134) as ( log2 log2

! log 2 + log η0 G α2

c23 1 ρ λmax(Σ)

2 log m + 1

c24 + 8 , (135)

where c26 = 22222 c4. This proves (55), and completes the proof of the corollary.

Appendix J. Proof of Theorem 15

At high level the proof is organized in the following two steps. (Step 1) Under the following event, a tolerance η > 0 and an iteration number T are given such that

G(θt) G(ˆθ) η, t T, (136)

we establish a suﬃcient decrease of the optimization error G(θt) G(ˆθ) in the form

G(θt) G(ˆθ) κt T (G(θT ) G(ˆθ)) + tolerance, t T, (137)

for suitable κ (0, 1) and tolerance > 0 this is proved in Lemma 18 (cf. Appendix J.1). Then, (Step 2) we divide the iterations t = 0, 1, 2, . . . , into a series of disjoint epochs [Tk, Tk+1), with 0 = T0 T1 , each one with associated ηk, with η0 η1 . The tuples {(ηk, Tk)} are constructed so that G(θt) G(ˆθ) ηk, for all t Tk. This permits to apply recursively (137) with smaller and smaller values of ηk, till the error G(θt) G(ˆθ) is driven below a desired threshold. This second step, formalized in Proposition 19 (cf. Appendix J.2), leverages (Agarwal et al., 2012, Th. 2).

J.1 Step 1: Suﬃcient decrease of the optimization error under (136)

The error decrease in the form (137) is formally stated in Lemma 18 below. It requires two intermediate technical results, namely: (i) Lemma 16, which restricts the (average of the) optimization error t = θt ˆθ to a set of almost sparse directions; and (ii) Lemma 17, which establishes a curvature property of G along such trajectories.

Ji, Scutari, Sun and Honnappa

Lemma 16 (On the sparsity of t av) Consider Problem (4) under Assumption 2. Further assume that (i) the design matrix X satisﬁes the RSC condition (8) with δ = µ/2 16 sτ > 0; (ii) λ satisﬁes (13); and (iii) γ satisﬁes (32). Let {θt} be the sequence generated by Algorithm (25) with R chosen such that

R max λs δ(1 r)

for some r (0, 1). Under condition (136) with parameters (T, η), the following holds: for any t T,

( t av)Sc 1 3 ( t av)S 1 + 6 (ˆνav)S 1 + 2hmax + min 2η

λ , 2R , (139)

where [cf. (76)]

hmax = dγ λ(1 ρ)

maxi [m] w i Xi n + λ 2 . (140)

Proof See Appendix K.1.

Invoking the RSC condition (8), the next lemma links the objectiveand the iterate-errors along (139).

Lemma 17 (Curvature along (139)) Instate the assumptions of Lemma 16. Under condition (136) with parameters (T, η), the following holds: for any t T,

µ 8 8τs t av 2 G(θt) G(ˆθ) + f( t ) + τ

4(v2 + 8h2 max), µ 8 8τs t av 2 TLγ(ˆθ; θt) + f( t ) + τ

4(v2 + 8h2 max), (141)

where TLγ(ˆθ; θt) is the ﬁrst order Taylor error of L at θt along the direction ˆθ θt [cf. (14)],

f( t ) Lmax

v2 144s ˆνav 2 + 4 min 2η

λ , 2R 2 . (143)

Proof See Appendix K.2.

Using Lemma 17, we are now ready to formally prove (137).

Lemma 18 (Descent of the objective function) Instate the assumptions of Lemma 16, under the stronger condition µ

8 8τs > 0 and the additional assumption that β is chosen so that

β γ γLmax + 1 λmin(W). (144)

Distributed Sparse Regression via Penalization

Under (136) with parameters (T, η), the following holds:

G(θt) G(ˆθ) κt T (G(θT ) G(ˆθ)) + τ(36s ˆνav 2 + 2h2 max + ϵ2), t T, (145)

and ϵ = min 2η

λ , 2R . (146)

Proof See Appendix K.3.

Note the structure of the tolerance term in (145): s ˆνav 2 is of the order of the statistical error; h2 max is due to the lack of consensus on the agents trajectories θt i s, it can be controlled by carefully choosing γ; and ϵ2 is a function of the threshold η. In Step 2 below we show that, since κ < 1, one can eventually driven the error ϵ2 below the threshold O(s ˆνav 2 + h2 max).

J.2 Step 2: Recursive application of Lemma 18

As anticipated, the key idea is to divide the iterations t = 0, 1, 2, . . . , into a series of disjoint epochs [Tk, Tk+1), with Tk Tk+1, each one with associated ηk, such that (i) G(θt) G(ˆθ) ηk, for all t Tk; and (ii) η0 η1 . This permits to apply recursively Lemma 18 with smaller and smaller values of ηk, till the error G(θt) G(ˆθ) is driven below the threshold 4τ(36s ˆνav 2 + 2h2 max). This construction follows the same argument as in the proof of (Agarwal et al., 2012, Th. 2) with minor adjustments (Lemma 4 therein is replaced with our Lemma 18) and thus is omitted.

Proposition 19 (Agarwal et al. 2012, Theorem 2) Instate the setting of Lemma 18. Further assume,

R λ 32τ . (147)

Then, there holds G(θt) G(ˆθ) α2,

for any tolerance α2 such that

α2 4τ(36s ˆνav 2 + 2h2 max), (148)

and for all

t log2 log2

1 + log 2 log 1/κ

+ log(η0 G/α2) log 1/κ , (149)

where η0 G = G(θ0) G(ˆθ).

Equipped with Proposition 19, we can now complete the proof of Theorem 15. It remains to show the following facts: Fact 1: The lower bound condition on R as in (34) is more stringent than that in (138), under a proper choice of r (0, 1); and the interval in (34) is nonempty; Fact 2: The range of α in (116) is contained in that of (148); and the interval (116) is nonempty;

Ji, Scutari, Sun and Honnappa

Fact 3: (117) is suﬃcient for (149).

We prove these facts next. Fact 1: Choosing r = 1/2, the lower bound condition on R in (138) reads

R max 2λs µ/2 16sτ

2τs µ/2 16sτ

Recalling µ c10sτ = 1824sτ, the following holds for the lower bound in (150):

2λs µ/2 16sτ

2τs µ/2 16sτ

56λs µ 32sτ .

max 2λs µ/2 16sτ

2τs µ/2 16sτ

max 56λs µ 32sτ , 2 θ 1

which proves the desired implication. Finally, notice that the interval in (34) is non-empty. This is a consequence of (i) the fact

56λs µ 32sτ λ 32τ ,

due to µ c10sτ = 1824sτ; and (ii) the condition λ 64τ θ 1, due to (31).

Fact 2: Using the condition on γ as in (32), we have

4τ(36s ˆνav 2 + 2h2 max) (110) 4τ 36s Pm i=1 ˆνi 2

1282 (µ/2 16 sτ)2

µ c10sτ 4sτ 36

i=1 ˆνi 2 + λ2s 1976µ2

Therefore, the range of α in (116) is included in that of (148). It remains to show that the range of α2 in (116) is nonempty, which is a consequence of the following chain of inequalities.

4τ 36s ˆν 2

Th.6 ξ=τ (Lm. 4) 144τs

9λ2s (µ/2 16sτ)2 + 2τ µ/2 16sτ d2γ2(max1 i m w i Xi + λn)4

λ2n4(1 ρ)2 | {z } =Term I2 [see (109)]

+ 4 µ/2 16sτ dγ(max1 i m w i Xi + λn)2

n2[2(1 ρ) 4Lmaxγ (µ/2 16sτ)γ] | {z } =Term II2 [see (109)]

+ λ2s 36 1976(µ 32sτ)2

Distributed Sparse Regression via Penalization

(110) 144τs 9λ2s (µ/2 16sτ)2 + τs µ/2 16sτ λ2s 8192(µ/2 16sτ)2 + λ2s 64(µ/2 16sτ)2

+ λ2s 36 7904(µ/2 16sτ)2

(a) 144τs 9λ2s (µ/2 16sτ)2 + 1 896 λ2s 8192(µ/2 16sτ)2 + λ2s 64(µ/2 16sτ)2

+ λ2s 36 7904(µ/2 16sτ)2

< 144τs (µ/2 16sτ)2 10λ λ s

(34) 1440τs 28(µ/2 16sτ) λR (b) < λR

where (a) and (b) follow from µ/2 16sτ 896sτ, due to µ c10sτ, with c10 = 1824. This together with (35) shows that the range of α2 in (116) is non-empty. Fact 3: We obtain (117) from (149) by upper bounding the right hand side of (149). To this end, we ﬁrst lower bound log(1/κ) as:

(33),(146) = log

1 γ(µ/8 8τs) γLmax+1 λmin(W)

1 γ(µ/8 8τs)

γLmax + 1 + ρ.

Using (152) in (149) and using η0 G α2 [due to (148)], we can upper bound the right hand side of (149) as log2 log2

1 + (γLmax + 1 + ρ) log 2

+ (γLmax + 1 + ρ)

γ(µ/8 8τs) log η0 G α2

which proves (117).

Appendix K. Proofs of auxiliary Lemmata in Section J

K.1 Proof of Lemma 16

Recalling the deﬁnitions of t, νt, and bν as given in (63), (62) and (61), respectively, we have t = θt ˆθ = νt ˆν. Therefore, t av = νt av ˆνav. We can then bound the desired quantity ( t av)Sc 1 as

( t av)Sc 1 (νt av)Sc 1 + (ˆνav)Sc 1. (153)

We prove below the following upper bounds for (νt av)Sc 1 and (ˆνav)Sc 1

(νt av)Sc 1 3 (νt av)S 1 + h(γ, νt ) + min 2η

(ˆνav)Sc 1 3 (ˆνav)S 1 + h(γ, ˆν ). (154)

Ji, Scutari, Sun and Honnappa

Using (154) in (153) and the triangle inequality yields the desired result

( t av)Sc 1 3( ( t av)S 1 + 2 (ˆνav)S 1) + h(γ, νt ) + h(γ, ˆν ) + min 2η

(76) 3 ( t av)S 1 + 6 (ˆνav)S 1 + 2hmax + min 2η

We prove next (154). From the optimality of ˆθalong with (136), we deduce

G(θt) G(1m θ ) η, t T. (155)

Hence, for any t T, there holds

i=1 Xiθt i yi 2 + 1 2mγ 1m θ + νt 2 V + λ

m 1m θ + νt 1

2N Xθ y 2 + 1 2mγ 1m θ 2 V + λ

m 1m θ 1 + η. (156)

Subtracting m P

N X i (Xiθ yi), νt i from both sides and rearranging terms, we obtain

N X i (Xiθ yi), νt i

+ 1 2mγ 1m θ 2 V 1 2mγ 1m θ + νt 2 V | {z } Term I

i=1 Xi(θ + νt i) yi 2 1

N X i (Xiθ yi), νt i

m( 1m θ + νt 1 1m θ 1)

m( 1m θ + νt 1 1m θ 1) | {z } Term II , (157)

where the last inequality follows from convexity of m P

i=1 Xiθi yi 2/(2N).

We proceed upper (resp. lower) bounding Term I (resp. Term II). We have

N w Xνt av + 1

i=1 w i Xiνt i 1 2mγ νt 2 V

2 νt av 1 + 1

N max i [m] X i wi νt 1 1 2mγ νt 2 V .

To lower bound Term II we decompose θ + νt i as θ + νt i = θ S + θ Sc + (νt i)S + (νt i)Sc. Then, invoking the decomposibility of the regularizer, we can write: for all i,

θ + νt i 1 θ 1 ( (νt av)Sc 1 (νt av)S 1) νt i 1. (159)

Distributed Sparse Regression via Penalization

Using (158) and (159) in (157) yields

(νt av)Sc 1 3 (νt av)S 1 + h(γ, νt ) + 2η

On the other hand, since θt i 1 R and θ 1 < R, we have

(νt av)Sc 1 νt av 1 θ 1 + θt av 1 < R + R = 2R. (161)

Using (161), we can then strengthen (160) as the ﬁrst inequality in (154). The proof of the second inequality in (154) follows the same steps and uses the fact that ˆθi 1 R, for all i [m] (Lemma 8).

K.2 Proof of Lemma 17

To bound the (average component of the) optimization error in terms of the function optimality gap we leverage the curvature property of Lγ [under the RSC condition (8)] along the trajectory of the algorithm. We explicitly use the fact that the trajectory lies in the set described by (139) [cf. Lemma 16]. Recalling that G(θ) = Lγ(θ) + λ

m θ 1, by the optimality of ˆθ, it follows

t, Lγ(ˆθ) + λ

m ˆθ 1 0. (162)

We can then write

G(θt) G(ˆθ) (162) TLγ(θt; ˆθ)

RSC (8) 1 4

2(64s t av 2 + 2v2 + 16h2 max) Lmax

where in (a) we used

(139) 4 ( t av)S 1 + 6 (ˆνav)S 1 + 2hmax + min 2η

4(4 ( t av)S 1)2 + 4(6 (ˆνav)S 1)2 + 4(2hmax)2 + 4 min 2η

64s t av 2 + 2v2 + 16h2 max,

with v2 = 144s ˆνav 2 + 4 min 2η

Reorganizing the terms in (163) yields the ﬁrst inequality in (141). Similar arguments apply to derive the second inequality in (141) by noticing that, for quadratic Lγ, we have TLγ(ˆθ; θt) = TLγ(θt; ˆθ). This concludes the proof.

Ji, Scutari, Sun and Honnappa

K.3 Proof of Lemma 18

The proof follows descent arguments (see, e.g., Nesterov 2007), suitably coupled with the curvature property established in Lemma 17 to achieve contraction up to a controllable tolerance. By deﬁnition of θt+1 in (25), we have Gt(θt+1) Gt(θ), for all feasible θ. Recalling that ˆθ is feasible (Lemma 8), θω ω ˆθ + (1 ω)θt is feasible as well, for any ω (0, 1). Therefore,

Gt(θω) = (1 ω)Lγ(θt) + ωLγ(ˆθ) ωTLγ(ˆθ; θt) + ω2

2βm t 2 + λ

(141) (1 ω)G(θt) + ωG(ˆθ) + ωf( t ) + ωτ

4(v2 + 8h2 max) + ω2

8 8τs t av 2. (164)

We proceed to relate G(θt+1) with Gt(θt+1).

= Gt(θt+1) 1 2βm θt+1 θt 2 + Lγ(θt+1) Lγ(θt) Lγ(θt), θt+1 θt | {z } = 1

2N Pm i=1 Xi(θt+1 i θt i) 2+ 1 2mγ θt+1 θt 2 V

(164) G(θt) ω(G(θt) G(ˆθ)) + ωf( t ) + ωτ

4(v2 + 8h2 max) + ω2

8 8τs t av 2 + 1

i=1 Xi(θt+1 i θt i) 2

+ 1 2mγ θt+1 θt 2 V 1 2βm θt+1 θt 2. (165)

Subtracting G(ˆθ) from both sides of the above inequality and denoting the function gap as ηt G = G(θt) G(ˆθ), we have

ηt+1 G (1 ω) ηt G + ωf( t ) + ωτ

4(v2 + 8h2 max) + ω2

2βm t 2 ω µ

8 8τs t av 2

i=1 Xi(θt+1 i θt i) 2 + 1 2mγ θt+1 θt 2 V 1 2βm θt+1 θt 2

(1 ω)ηt G + ωf( t ) + ωτ

4(v2 + 8h2 max) + ω2

8 8τs t av 2

0, for 0 ω β µ 8 8τs

m + 1 λmin(W)

| {z } 0 [condition on β in (144)]

(a) (1 ω)ηt G | {z } Term I + ωf( t ) + ω2

| {z } Term II

4(v2 + 8h2 max) | {z } Term III

Distributed Sparse Regression via Penalization

where (a) holds under the condition 0 ω β µ

8 8τs . Note that µ/8 8τs > 0 by assumption; hence the interval for ω is non-empty. Furthermore, ω (0, 1/2), due to

8 8τs (144) γ γLmax + 1 λmin(W) µ

8 8τs (b) < 1

where in (b) we used the following lower bound for (1 λmin)/γ:

(32) 1 λmin(W)

2 16sτ + 128d

2 16sτ maxi [m] w i Xi λn + 2 m

2 16sτ + 128d

2 16sτ maxi [m] w i Xi λn + 2 m

In (166), Term I captures the geometric decrease of the objective error, for any ω < 1; Term II is due to consensus errors and it is controllable by choosing a suﬃciently small network regularizer γ; ﬁnally, Term III is due to the lack of strong convexity, determining a nonzero tolerance on the achievable objective error. We choose ω to minimize the contraction factor in Term I, resulting in

8 8τs . (168)

Under this choice we can bound Term I Term III as follows. Term I:

Term I = 1 β µ

ηt G. (169)

Note that κ (0, 1/2), due to (167). Term II: Using the upper bound of γ in (26), we can bound f( t ) [cf. (142)] as

f( t ) Lmax

2m + µ 32 sτ

4m + 32d(µ 32 sτ)

sm (max i [m] w i Xi /(λn) + 2 m)2 t 2.

Term II (168),(170) β µ

4 8sτ t 2 + β

8 8τs 2 t 2

8 8τs 2 t 2 0. (171)

Ji, Scutari, Sun and Honnappa

Term III = β µ

8 8τs τ 36s ˆνav 2 2 + 2h2 max + min 2η

λ , 2R 2 . (172)

Using (169), (171), and (172) in (166), we ﬁnally obtain: for all t T,

ηt+1 G κ ηt G + β µ

8 8τs τ 36s ˆνav 2 + 2h2 max + min 2η

κt T ηT G + τ 36s ˆνav 2 + 2h2 max + min 2η

This completes the proof.

A. Agarwal, S. Negahban, and M. J. Wainwright. Fast global convergence of gradient methods for high-dimensional statistical recovery. The Annals of Statistics, pages 2452 2482, April 2012.

R. Bai and M. Ghosh. Normalbetaprime: Normal beta prime prior. Statistica Sinica, 2019. URL https://CRAN.R-project.org/package=Normal Beta Prime. R package version 2.2.

Y. Bao and W. Xiong. One-round communication eﬃcient distributed m-estimation. In International Conference on Artiﬁcial Intelligence and Statistics, April 2021.

H. Battey, F. Jianqing, L. Han, L. Junwei, and Z. Ziwei. Distributed testing and estimation under sparse high dimensional models. The Annals of Statistics, 46(3):1352 1382, June 2018.

A. Beck and M. Teboulle. A fast iterative shrinkage-thresholding algorithm for linear inverse problems. SIAM Journal on Imaging Sciences, 2:183 202, January 2009.

S. Becker, J. Bobin, and E. J. Candes. Nesta: a fast and accurate ﬁrst-order method for sparse recovery. SIAM Journal on Imaging Sciences, 4:1 39, April 2011.

P. J. Bickel, Y. Ritov, and A. B. Tsybakov. Simultaneous analysis of Lasso and Dantzig selector. The Annals of Statistics, 37(4):1705 1732, February 2009.

J. Bolte, A. Daniilidis, O. Ley, and L. Mazet. Characterizeations of lojasiewicz inequalities: subgradient ﬂows, talweg, convexity. The Transactions of the American Mathematical Society, 322:3319 3363, June 2009.

K. Bredies and D. A. Lorenz. Linear convergence of iterative soft-thresholding. Journal of Fourier Analysis and Applications, 14:813 837, September 2008.

E. J. Candes and T. Tao. Near-optimal signal recovery from random projections: universal encoding strategies. IEEE Transactions on Information Theory, 52(12):5406 5425, January 2006.

Distributed Sparse Regression via Penalization

A. I. Chen and A. Ozdaglar. A fast distributed proximal-gradient method. In 2012 50th Annual Allerton Conference on Communication, Control, and Computing (Allerton), pages 601 608, April 2012.

J. Chen and A. H. Sayed. Diﬀusion adaptation strategies for distributed optimization and learning over networks. IEEE Transactions on Signal Processing, 60(8):4289 4305, August 2012.

X. Chen, W. Liu, and Y. Zhang. First-order newton-type estimator for distributed estimation and inference. Journal of the American Statistical Association, pages 1 17, April 2021.

A. Daneshmand, G. Scutari, and V. Kungurtsev. Second-order guarantees of distributed gradient algorithms. SIAM Journal on Optimization, 30(4):3029 3068, January 2020.

S. Van de Geer and P. Buhlmann. On the conditions used to prove oracle results for the lasso. Electronic Journal of Statistics, 3(4):1360 1392, October 2009.

D. L. Donoho. De-noising by soft-thresholding. IEEE Transactions on Information Theory, 41(3):613 627, November 1995.

J. Duchi, S. Shalev-Shwartz, Y. Singer, and T. Chandra. Eﬃcient projections onto the ℓ1-ball for learning in high dimensions. Proceedings of the 25th International Conference on Machine Learning, pages 272 279, July 2008.

E. T. Hale, W. Yin, and Y. Zhang. Fixed-point continuation for ℓ1-minimization: methodology and convergence. SIAM Journal on Optimization, 19:1107 1130, October 2008.

T. Hastie, R. Tibshirani, and M. J. Wainwright. Statistical Learning with Sparsity: the Lasso and Generalizations. Chapman & Hall/CRC, 2015.

P. S. Horn. On the stochastic ordering of absolute univariate Gaussian random variables. The Annals of Statistics, 16(3):1327 1329, September 1988.

D. Jakovetić. A uniﬁcation and generalization of exact distributed ﬁrst-order methods. IEEE Transactions on Signal and Information Processing over Networks, 5:31 46, September 2019.

D. Jakovetić, J. Xavier, and JMF. Moura. Cooperative convex optimization in networked systems: augmented lagrangian algorithms with directed gossip communication. IEEE Transactions on Signal Processing, 59(8):3889 3902, July 2011.

D. Jakovetić, JMF. Moura, and J. Xavier. Linear convergence rate of a class of distributed augmented lagrangian algorithms. IEEE Transactions on Automatic Control, 60(4):922 936, July 2013.

D. Jakovetić, J. Xavier, and JMF. Moura. Fast distributed gradient methods. IEEE Transactions on Automatic Control, 59:1131 1146, December 2014.

F. Jianqing, G. Yongyi, and W. Kaizheng. Communication-eﬃcient accurate statistical estimation. Journal of the American Statistical Association, 0(0):1 11, August 2021.

Ji, Scutari, Sun and Honnappa

M. I. Jordan, J. D. Lee, and Y. Yang. Communication-eﬃcient distributed statistical inference. Journal of the American Statistical Association, November 2018.

J. D. Lee, Y. Sun, Q. Liu, and J.E. Taylor. Communication-eﬃcient sparse regression: a one-shot approach. Journal of Machine Learning Research, January 2015.

P. Di Lorenzo and G. Scutari. Next: In-network nonconvex optimization. IEEE Transactions on Signal and Information Processing over Networks, 2:1 1, February 2016.

Z. Q. Luo and P. Tseng. Error bounds and convergence analysis of feasible descent methods: a general approach. Annals of Operations Research, 46:157 178, March 1993.

A. W. Marshall, I. Olkin, and B. C. Arnold. Inequalities: Theory of Majorization and Its Applications, volume 143. Springer, second edition, 2011.

A. Nedić and A. Ozdaglar. Distributed subgradient methods for multi-agent optimization. IEEE Transactions on Automatic Control, 54(1):48 61, January 2009.

A. Nedić, A. Ozdaglar, and P. A. Parrilo. Constrained consensus and optimization in multi-agent networks. IEEE Transactions on Automatic Control, 55(4):922 938, April 2010.

A. Nedić, A. Olshevsky, and W. Shi. Achieving geometric convergence for distributed optimization over time-varying graphs. SIAM Journal on Optimization, 27:2597 2633, July 2016.

A. Nedić, A. Olshevsky, and M. G. Rabbat. Network topology and communicationcomputation tradeoﬀs in decentralized optimization. Proceedings of the IEEE, 106:953 976, September 2018.

Y. Nesterov. Gradient methods for minimizing composite objective function. Research Papers in Economics, January 2007.

Y. Nesterov et al. Lectures on Convex Optimization, volume 137. Springer, 2018.

A. Olshevsky. Linear time average consensus and distributed optimization on ﬁxed graphs. SIAM Journal on Control and Optimization, 55(6):3990 4014, December 2017.

S. Pan and Y. Liu. Metric subregularity of subdiﬀerential and KL property of exponent 1/2. arxiv preprint, ar Xiv:1812.00558v3, 322, 2018.

G. Qu and N. Li. Harnessing smoothness to accelerate distributed optimization. IEEE Transactions on Control of Network Systems, 5:1245 1260, April 2017.

G. Raskutti, M. J. Wainwright, and B. Yu. Restricted eigenvalue properties for correlated gaussian designs. Journal of Machine Learning Research, 11:2241 2259, August 2010.

G. Raskutti, M. Wainwright, and B. Yu. Minimax rates of estimation for high-dimensional linear regression over ℓq -balls. IEEE Transactions on Information Theory, 57:6976 6994, November 2011.

Distributed Sparse Regression via Penalization

J. Rosenblatt and B. Nadler. On the optimality of averaging in distributed statistical learning. Information and Inference: A Journal of the IMA, 5(4):379 404, 2016.

N. Sahand, P. Ravikumar, M. J. Wainwright, and B. Yu. A uniﬁed framework for highdimensional analysis of m-estimators with decomposable regularizers. Statistical Science, 27(4):538 557, November 2012.

A. H. Sayed. Adaptation, learning, and optimization over networks. Foundations and Trends in Machine Learning, 7:311 801, January 2014.

O. Shamir, N. Srebro, and T. Zhang. Communication-eﬃcient distributed optimization using an approximate newton-type method. In International Conference on Machine Learning, pages 1000 1008. PMLR, 2014.

W. Shi, Q. Ling, K. Yuan, G. Wu, and W. Yin. On the linear convergence of the admm in decentralized consensus optimization. IEEE Transactions on Signal Processing, 62: 1750 1761, July 2014.

W. Shi, Q. Ling, G. Wu, and W. Yin. EXTRA: An exact ﬁrst-order algorithm for decentralized consensus optimization. SIAM Journal on Optimization, 25(2):944 966, November 2015a.

W. Shi, Q. Ling, G. Wu, and W. Yin. A proximal gradient algorithm for decentralized composite optimization. IEEE Transactions on Signal Processing, 63(22):6013 6023, November 2015b.

Y Sun, A Daneshmand, and G Scutari. Distributed optimization based on gradient-tracking revisited: enhancing convergence rate via surrogation. ar Xiv preprint, ar Xiv:1905.02637, 2019.

Y Sun, Maros M., G Scutari, and Guang C. High-dimensional inference over networks: linearly convergence algorithms and statistical guarantees. ar Xiv:2201.08507, 2022.

R. Tibshirani. Regression shrinkage and selection via the lasso. Journal of the Royal Statistical Society Series B, 58:267 288, January 1996.

P. Tseng and S. Yun. A coordinate gradient descent method for nonsmooth separable minimization. Mathematical Programming, 117:387 423, March 2009.

R. Vershynin. Introduction to the non-asymptotic analysis of random matrices. In Compressed Sensing, 2012.

M. J. Wainwright. High-Dimensional Statistics: A Non-asymptotic Viewpoint. Cambridge University Press, 2019.

J. Wang, M. Kolar, N. Srebro, and T. Zhang. Eﬃcient distributed learning with sparsity. In International Conference on Machine Learning, pages 3636 3645. PMLR, May 2017.

S. Wang, F. Roosta, P. Xu, and W. W. Mahoney. Giant: Globally improved approximate newton method for distributed optimization. Advances in Neural Information Processing Systems, 31, 2018.

Ji, Scutari, Sun and Honnappa

B. Wen, X. Chen, and T. Pong. Linear convergence of proximal gradient algorithm with extrapolation for a class of nonconvex nonsmooth minimization problems. SIAM Journal on Optimization, 27:124 145, December 2017.

J. Xu, S. Zhu, Y. C. Soh, and L. Xie. Convergence of asynchronous distributed gradient methods over stochastic networks. IEEE Transactions on Automatic Control, 63(2): 434 448, July 2018.

K. Yuan, Q. Ling, and W. Yin. On the convergence of decentralized gradient descent. SIAM Journal on Optimization, 26(3):1835 1854, January 2016.

K. Yuan, S. Alghunaim, B. Ying, and A. H. Sayed. On the inﬂuence of bias-correction on distributed stochastic optimization. IEEE Transactions on Signal Processing, 68:4352 4367, July 2020.

J. Zeng and W. Yin. On nonconvex decentralized gradient descent. IEEE Transactions on Signal Processing, 66(11):2834 2848, June 2018.

Z. Zhou and A. Man-Cho So. A uniﬁed approach to error bounds for structured convex optimization problems. Mathematical Programming, 165:689 728, December 2017.