# decentralized_sparse_linear_regression_via_gradienttracking__f1ecb298.pdf

Journal of Machine Learning Research 26 (2025) 1-56 Submitted 9/23; Revised 12/24; Published 9/25

Decentralized Sparse Linear Regression via Gradient-Tracking

Marie Maros mmaros@tamu.edu Wm Michael Barnes 64 Department of Industrial & Systems Engineering Texas A&M University College Station, TX 77843, USA

Gesualdo Scutari gscutari@purdue.edu School of Industrial Engineering School of Electrical and Computer Engineering Purdue University West Lafayette, IN 47907, USA

Ying Sun ybs5190@psu.edu School of Electrical Engineering and Computer Science The Pennsylvania State University University Park, PA 16802, USA

Guang Cheng guangcheng@ucla.edu Department of Statistics University of California Los Angeles Los Angeles, USA.

The order of the ﬁrst three authors is alphabetic.

Corresponding author.

Editor: Pradeep Ravikumar

We study sparse linear regression over a network of agents, modeled as an undirected graph without a center node. The estimation of the s-sparse parameter is formulated as a constrained LASSO problem wherein each agent owns a subset of the N total observations. We analyze the convergence rate and statistical guarantees of a distributed projected gradient tracking-based algorithm under high-dimensional scaling, allowing the ambient dimension d to grow with (and possibly exceed) the sample size N. Our theory shows that, under standard notions of restricted strong convexity and smoothness of the average loss functions, suitable conditions on the network connectivity and algorithm tuning, the distributed algorithm converges globally at a linear rate to an estimate that is within the centralized statistical precision of the model, O(s log d/N). When s log d/N = o(1), a condition necessary for statistical consistency, an ε-optimal solution is attained after O(κ log(1/ε)) gradient computations and O(κ/(1 ρ) log(1/ε)) communication rounds, where κ is the restricted condition number of the loss function and ρ measures the network connectivity. The computation cost matches that of the centralized projected gradient algorithm despite having data distributed; whereas the communication rounds reduce as the network connectivity improves. Overall, our study reveals interesting connections between statistical eﬃciency, network connectivity & topology, and convergence rate in the high dimensional setting.

c 2025 Marie Maros, Gesualdo Scutari, Ying Sun, and Guang Cheng.

License: CC-BY 4.0, see https://creativecommons.org/licenses/by/4.0/. Attribution requirements are provided at http://jmlr.org/papers/v26/23-1168.html.

Maros, Scutari, Sun, Cheng

master-worker architecture mesh network

Figure 1: Two network architectures: master-worker and mesh networks.

Keywords: High-dimensional estimation, distributed convex optimization, linear convergence, sparse linear regression, gradient-tracking1

1. Introduction

Datasets with massive sample size and high-dimensionality are ubiquitous in modern science and engineering; examples include genomics and biomedical data, social media, ﬁnancial time-series, and e-commerce data, just to name a few. The archetypal scenario is the one wherein data are produced, collected, or stored at diﬀerent times and locations. The sheer volume and spatial/temporal disparity of data render centralized processing and storage on standalone machines prohibitively expensive or outright impossible. This has motivated in recent years a surge of interest in developing distributed methods that enable analytics and computations across multiple machines (henceforth referred to as agents), connected by a communication network. Broadly speaking, two architectures have captured most of the interest of the research community, namely: (i) the master-worker topology and (ii) a general, connected, multiagent architecture (a.k.a. mesh network). They are depicted in Fig. 1 and brieﬂy commented next. In a master-worker architecture, there is typically one or more master nodes connected to all other nodes, termed workers. The workers store part of the data, execute (in parallel) intermediate computations, and communicate (iteratively) suitable outcomes to the master node, which maintains and updates the authoritative copy of the optimization variables, producing eventually the ﬁnal output. These federated architectures have been adopted in several applications to parallelize and decompose a variety of learning and optimization tasks; see, e.g., (Li et al., 2020; Mc Mahan et al., 2017; Boyd et al., 2011). The statistical community is best acquainted with divide-and-conquer (D&C) methods whereby the master node combines the estimators produced by the workers using their local datasets. The idea has been widely applied in statistical estimation and inference such as M-estimation (Zhang et al., 2013b; Chen and Xie, 2014; Shi et al., 2018), sparse high dimensional models Battey et al. (2018); Lee et al. (2017), PCA Fan et al. (2019), matrix factorization Mackey et al. (2011), quantile regression Volgushev et al. (2019); Chen et al. (2019) and nonand semiparametric methods including (Neiswanger et al., 2014; Shang and Cheng, 2017; Zhao et al.,

1. This work was conducted when Sun, Maros, and Cheng were at Purdue University.

High-Dimensional Inference over Networks

2016; Zhang et al., 2013a; Kleiner et al., 2014), just to name a few. However, such architectures may be impractical or undesirable in various scenarios. For example, in large-scale systems, the master node can become a systemic bottleneck: (i) its failure can disrupt the entire network, (ii) the node s communication resources may become insuﬃcient as the network size grows (owing to restricted bandwidth), and (iii) the implementation of some algorithms may require prohibitive computation eﬀorts on standalone machines (Hendrikx et al., 2020), particularly when processing high-dimensional and large datasets. Furthermore, there are scenarios where establishing a star topology (or a spanning tree) is simply not feasible or too resource-expensive; this is the case, e.g., of wireless networks with low-power devices, which can communicate only with nodes in their physical proximity. To circumventing these impediments, a natural approach is to eliminate master nodes, transitioning to peer-to-peer architectures wherein each node is connected exclusively to a subset of other nodes. These are the systems considered in this work.

1.1 Notation

Throughout the paper vectors are boldfaced and matrices are boldfaced and capitalized. We denote by 1 the vector of all ones, and by J = 11 the projection onto the consensus subspace; dimensions of these quantities will be clear from the context. Let p be denoteℓp norm, for p 1. When p is not speciﬁed. With a slight abuse of notation, we also denote by p the operator norm when applied to matrices, induced by the vector ℓp norm; F denotes the Frobenius norm.

1.2 Sparse linear regression over mesh networks

We study sparse linear regression problem over a mesh network of m agents, modeled as a general connected, undirected graph. Each agent i takes n linear measures of an s-sparse signal θ Rd: yi = Xiθ + ni, (1)

where yi Rn is the vector of measurements, Xi Rn d is the design matrix, and ni Rn

is the observation noise. For simplicity we assume that all agents acquire the same number n of measurements, accumulating to a total of N = m n over the network. Our focus is on the high-dimensional setting where both the ambient dimension d and the sample size N grow, with d faster than N. A standard approach to estimate θ from {(yi, Xi)}m i=1 is solving the LASSO problem, whose constrained form reads

bθ argmin θ 1 r

, where Li(θ) 1

2n yi Xiθ 2. (2)

In (2), each Li is the least squares loss, local to agent i, and the ℓ1-norm constraint (with r known to all the agents) aims at promoting sparsity on the solution bθ. As an instance of the empirical risk minimization problem with nonsmooth regularization, there is a large body of distributed ﬁrst order algorithms applicable for (2). Among these methods, the gradient tracking based ones have been demonstrated successful in improving the algorithmic eﬃciency both empirically and theoretically (Xu et al., 2018; Di Lorenzo and Scutari, 2016; Nedi c et al., 2016). As such, this technique has been widely

Maros, Scutari, Sun, Cheng

applied to design distributed algorithms in various problem setups. Based on decentralizing the proximal gradient algorithm (PGD), the DGT algorithm for problem (2) decouples the optimization by letting each agent i locally maintain an estimator θi Rd of the common variable θ. At every iteration t, each agent i [m] performs in parallel the following update:

comm. step: θt i =

j=1 wij θ t 1

2 j , gt i =

j=1 wij gt 1 j + Lj(θt j) Lj(θt 1 j ) , (3a)

comp. step: θ t+ 1

θt i γ 1gt i . (3b)

In the computation step, each agent performs a proximal (projected) gradient update, where gt i plays the role of estimating the centralized gradient L(θt i); Q θi 1 r denots the Euclidean projection onto the ℓ1 ball {θi Rd : θi 1 r}. In the communication

step, each agent updates the local variable θ t 1

2 i and gradient estimator gt 1 i based on local averaging. The update of gt i (gradient tracking step) can be explained as follows: At each iteration t, agent i subtracts the outdated gradient Li(θt 1 i ) at the previous iteration and adds the new one Li(θt i); this refreshes the memory of gi and ensures that the sum of the gt i s is equal to the sum gradient Pm i=1 Li(θt i). Then, agents perform a local averaging by computing a weighted sum using wij, enforcing thus consensus among the gi s. If both θi s and gi s are asymptotically consensual, then

j=1 gt j = 1

j=1 Lj(θt j) 1

j=1 Lj(θt i) as t . (4)

That is, gt i converges to L(θt i), as desired. Despite the vast literature on decentralized algorithms, existing results of DGT are of pure optimization type; hence they lack guarantees in the high-dimensional setting, even for the sparse linear regression problem (2). In contrast, the PGD algorithm as its centralized counterpart has been proven to converge at a fast linear rate to statistically optimal solutions (Agarwal et al., 2012). For a detailed review of the literature, we refer readers to Section 1.4. Given this disparity, it naturally raises the question of whether DGT can enjoy the same fast rate and statistical optimality of PGD in the decentralized setting, and how the network conﬁguration aﬀects its performance. This paper addresses this open question.

1.3 Major contributions

We provide the statistical and convergence analysis of the DGT algorithm for solving the sparse linear regression problem (2). Our major contributions are the following.

(i) Statistical-computational guarantees: By exploiting the Restricted Strong Convexity (RSC) and Smoothness (RSM) properties of L (Agarwal et al., 2012), we establish regularity conditions under which the sequence generated by DGT provably convergences at a linear rate to an estimate whose distance to bθ is smaller than the statistical precision bθ θ . Speciﬁcally, to enter an ε-neighborhood of such precision, it takes

O (κ log(1/ε)) (5)

High-Dimensional Inference over Networks

iterations (gradient evaluations) as long as ρ poly(m, κ) 1, where κ is the restricted condition number of L and ρ measures the network connectivity. The condition on ρ, when not met by the given network, can be enforced via multiple rounds of communication per iteration, resulting in

O κ 1 ρ log(mκ) log(1/ε) (6)

overall communication rounds. As an implication, our result shows statistically optimal estimators can be computed at a linear rate by the DGT algorithm in the limit N, d and s log d/N = o(1). Furthermore, (5) matches the iteration complexity of the centralized proximal gradient descent (PGD) (Agarwal et al., 2012).

(ii) Sample/network scaling: The communication complexity (6) remains unchanged, for s log d/N being constant and for a ﬁxed network. We also study the scalability of (6) with respect to the network size m, revealing tradeoﬀs between communication rounds and communication costs for diﬀerent graph topologies, as summarized in Table 1. This sheds light on which network architectures are more favorable for high-dimensional estimation. For instance, among the commonly used network architectures, DGT running on the Erd os-R enyi graph with edge connecting probability p = log m/m achieves the lowest communication cost at the busiest node, and ties with the implementation on the star topology in the amount of total channel use.

(iii) Technical novelties: Our analysis represents a shift from both centralized approaches in high-dimension (e.g., (Agarwal et al., 2012)) and the prevailing techniques used for studying gradient-tracking algorithms in decentralized optimization (e.g., (Nedi c et al., 2016; Sun et al., 2022c; Xu et al., 2021)), across several aspects.

(a) Inexact descent under RSC/RSM: Linear convergence of gradient-tracking methods has been proved under the assumption of global strong convexity of L and global smoothness of Li s conditions that fall short in the high-dimensional setting where d >> N. In contrast, our analysis employs the more nuanced RSC and RSM conditions better suited for high-dimensional data and known to hold with high probability across various data generation models (Wainwright, 2019). Our ﬁrst result is an inexact descent property of the estimation error (detailed in Proposition 11), achieving linear shrinkage subject to consensus and gradient-tracking errors. This can be view as a generalization of (Agarwal et al., 2012) to more complex scenarios posed by the mentioned errors (not present in (Agarwal et al., 2012)) and decentralized contexts.

(b) A new analysis of gradient tracking errors in high-dimensions: Traditional analyses of gradient tracking errors leverage the Lipschitz continuity of each Li to establish uniform bounds on the tracking errors gt i L(θt i), in terms of consensus and optimization errors (see Sec. 3.2.2). However, this approach proves inadequate in highdimensional settings, where in the asymptotics d/N , the Lipschitz constants of Li s scale as O(d), for various predictor models (Wainwright, 2019), thereby diverging as d . Fundamentally, this issue arises because the gradients Li(θi) remain dense vectors even when all θi s are sparse. This would translate in convergence rates scaling unfavorably with the ambient dimension. Our new analysis posits that, under

Maros, Scutari, Sun, Cheng

the RSC and RSM conditions, it is suﬃcient to manage the tracking errors only along sparse directions. Within these directions, gt i is proved to serves as a suﬃciently accurate estimate of L(θt i) (see Proposition 12). This crucial adjustment enables us to achieve convergence rates for DGT that are independent of the ambient dimensions.

(c) Combining error dynamics via the z-transform: The reﬁned treatment of gradienttracking errors, as discussed in (b), signiﬁcantly complicates the convergence analysis of DGT. The error dynamics at any given iteration now depend on the consensus/tracking errors and optimization residuals from all past iterates, starting from initialization. This extended dependency chain precludes the use of simpler, traditional analytical techniques, such as those in (Xu et al., 2018; Sun et al., 2022c; Nedi c et al., 2016) based on the small gain theorem, typically applied only to dynamics with single-hop time dependencies. To eﬀectively manage this historical information, we employ the z-transform for ﬁnite-length sequences and obtain a suﬃcient condition for the linear convergence of DGT. This approach represents a technique of independent interest, oﬀering potential applications to other problem of decentralized optimization.

line graph 2-d grid star complete graph Erd os-R enyi

Figure 2: Some commonly used graphs G for the communication network. Dashed line in the Erd os-R enyi graph stands for the existence of an edge with some probability.

line 2-d grid star complete push-pull (star) Erd os-R enyi Erd os-R enyi

(1 ρ) 1 O(m2) O(m log m) O(m2) 1 1 O(1) [p = log m/m] O(1) [p = O(1)] max degree 2 4 m m m O(log m) O(m) channel use/comm. O(m) O(m) O(m) O(m2) O(m) O(m log m) O(m2) max channel use e O(m2) e O(m) e O(m3) e O(m) e O(m) e O(1) e O(m) total channel use e O(m3) e O(m2) e O(m3) e O(m2) e O(m) e O(m) e O(m2)

Table 1: Scalability of the communication complexity of DGT with the network size m, for the network topology listed in Fig. 2. Max degree: maximum degree of the nodes; channel use/comm.: total channel use per communication round; max channel use: channel use of the busiest node to reach ε optimization precision; total channel use: total channel uses to reach ε optimization precision. e O hides the logarithmic factors in communication complexity. Push-pull(star) refers to the centralized PGD algorithm implemented over star-networks, we refer the readers to the description around (11) for more details.

High-Dimensional Inference over Networks

1.4 Related works

Distributed methods over master-worker systems. The last decade has witnessed a signiﬁcant surge in research on statistical methods within master-worker architectures, as highlighted in Sec. 1. Much of this research has focused on D&C strategies (Zhang et al., 2013b; Chen and Xie, 2014; Shi et al., 2018; Battey et al., 2018; Lee et al., 2017; Fan et al., 2019; Mackey et al., 2011; Volgushev et al., 2019; Chen et al., 2019; Neiswanger et al., 2014; Shang and Cheng, 2017; Zhao et al., 2016; Zhang et al., 2013a; Kleiner et al., 2014). These methods typically involve a single round of communication from clients to the server and achieve statistical optimality, but only with a limited number of client nodes, restricting their use to small networks. To address these limitations, iterative approaches have been developed that allow for larger networks by removing the cap on the number of workers; examples include (Jordan et al., 2018; Fan et al., 2021; Wang et al., 2017). At the cost of multiple communication rounds, they eﬀectively remove restrictions on the number of workers, at the expense of requiring a minimum sample size at each client to ensure the statistical optimality of the ﬁnal estimator at the server. All these algorithms heavily rely on the presence of a server node that, at each iteration, aggregates local gradients from clients into a centralized gradient and distributes updates. This setup ensures that the iterates are generated using exact gradient information, mimicking a fully centralized setting. Without such a centralized node, this mechanism vanishes, rendering these algorithms inapplicable on mesh networks. Moreover, simply replacing the exact gradient with approximations obtained by averaging neighboring local gradients is inadequate. This approach fails to achieve statistically optimal estimates or guarantee convergence, as it does not align with the assumptions and analytical frameworks of centralized methods, which do not account for gradient inaccuracies.

Distributed methods over mesh networks. Early decentralized methods over mesh networks are distributed (sub-)gradient descent (DGD) type methods including (Nedi c and Ozdaglar, 2009; Nedi c et al., 2010; Chen and Ozdaglar, 2012; Chen and Sayed, 2012; Nedi c and Olshevsky, 2014). These algorithms are not directly applicable to (2) because they cannot handle constraints. Subsequent schemes handling constraints include (Sayed, 2014; Shi et al., 2015; Di Lorenzo and Scutari, 2016; Sun et al., 2022c,b; Xu et al., 2021). Based on the order of local computation and communication, the DGD algorithm has the following two variations:

j=1 wijθt j γ 1 Li(θt i)

j=1 wij θt j γ 1 Lj(θt j)

In DGD-CTA, each agent at iteration t ﬁrst averages the neighboring variables through a consensus step and then performs a gradient step using its local gradient. Conversely, in the DGD-ATC, each agent ﬁrst executes the local gradient descent step before carrying out the consensus averaging. It is well established that DGD algorithms, even when applied to smooth and strongly convex loss functions L, converge at a sublinear rate. To overcome

Maros, Scutari, Sun, Cheng

this drawback, the DGT [cf. eq. (3)] (Xu et al., 2018; Di Lorenzo and Scutari, 2016; Sun et al., 2022c; Nedi c et al., 2016) and various primal-dual algorithms (Shi et al., 2015; Yuan et al., 2018; Xu et al., 2021) utilize an improved estimate of the centralized gradient at each agent s location. Consequently, these algorithms achieve a linear convergence rate for loss functions L that are both strongly convex and smooth. For a detailed overview of these methods, please see (Nedic, 2020; Nedi c et al., 2018). Despite intensive study, existing convergence results for these algorithms, derived purely from an optimization perspective, lead to pessimistic conclusions when applied to the highdimensional problem (2). Speciﬁcally, since N < d, the average loss function L lacks strong convexity, resulting in sublinear convergence rates for any ﬁxed ambient dimension d. Moreover, when d increases at a rate faster than N, which is the typical regime in highdimension (Wainwright, 2019), the Lipschitz constant of L will diverge as well, under standard choices of the design matrices {X}m i=1. This requires progressively smaller step sizes, ultimately hindering the convergence of the algorithms.

First order algorithms in the high-dimensional setting. It is well-recognized that convergence analysis and parameter tuning of centralized ﬁrst-order methods for (2) must exploit the landscape properties of L as determined by the underlying statistical model. Notably, linear convergence to statistical optimal estimates is established in (Agarwal et al., 2012; Wang et al., 2014; Lin and Xiao, 2014; Xiao and Zhang, 2013) for the proximal gradient algorithm applied to (2), under the RSC and RSM properties of L, which hold with high probability across various random data generation models and noise distribution (Wainwright, 2019). These conditions, coupled with a proper algorithm tuning, ensure that the algorithmic trajectory remains within a conﬁned region where L is approximately strongly convex and smooth, thereby securing linear convergence up to statistical optimality. An intriguing question is whether distributed ﬁrst-order algorithms can similarly leverage this landscape to achieve rapid convergence toward statistically optimal estimates, particularly when local agents lack suﬃcient data samples to ensure statistical consistency locally. In contrast to their centralized counterparts, the performance of distributed algorithms in high-dimension is less clear. While research is still evolving, existing studies (concurrent to this work (Sun et al., 2022a)) suggest that not all categories of distributed algorithms consistently beneﬁt from these favorable landscape properties, indicating a more complex interaction among algorithmic design, distributed nature of data, and network topology, to be revealed. For instance, the analysis of a DGD-CTA type algorithm for the Lagrangian formulation of the LASSO problem, developed in the companion paper (Ji et al., 2023a), proves that the iterates generated by the algorithm enter an ε-neighborhood of a statistically optimal estimate of θ after

O κ 1 ρ d m log m log 1

number of iterations, where ρ [0, 1) is a measure of the connectivity of the network (the smaller ρ, the more connected the graph, see Assumption 4 in Sec. 2 for a formal deﬁnition); and κ is the restricted condition number of L. This result shows that DGDCTA converges at a linear rate, but the number of iteration towards statistically optimal estimates grows as O(d), even in the asymptotic regime N, d where the statistical error O((s log d)/N) remains unchanged. In contrast, the DGD-ATC algorithm achieve statistical

High-Dimensional Inference over Networks

optimal estimates at a converges at rate (Ji et al., 2023b)

e O κ 1 ρ log(dm) log 1

which has the more favorable O(log d) scalability with d, compared to the CTA version. The key diﬀerence between the two DGD variants lies in their local optimization approaches. In the CTA version, each agent uses its local gradient evaluated at a point that is inconsistent with the iterate obtained after the communication step, necessitating a step size of O(d 1) regardless of network connectivity. Conversely, by swapping the order of the communication and optimization steps, the ATC version allows for a constant step size in well-connected networks, enhancing scalability with the ambient dimension. Building on the insights from these analyses, in a paper (Maros and Scutari, 2022) subsequent to this work, we introduce a modiﬁed version of DGD (termed DGD2) that achieves statistical optimality at comparable communication and computational cost of DGT.

1.5 Paper organization

The remainder of the paper is structured as follows: Sec. 2 outlines the primary assumptions and details the statistical-computational guarantees the proofs are reported in Sec. 3, with intermediate results available in the appendix. Numerical simulations corroborating our theoretical ﬁndings are discussed in Sec. 4. Finally, Sec. 5 draws some conclusions.

2. Convergence Analysis

We start by introducing the assumptions related to the LASSO problem (2) and the network topology, used for analyzing the convergence of DGT, given in (3).

2.1 Problem Setup and Main Assumptions

Assumption 1 (Global RSC/RSM) Given X = [X 1 , . . . , X m] RN d, with N = n m, the following conditions hold:

(Global RSC/RSM)

N µΣ u 2 τµ u 2 1,

N LΣ u 2 + τg u 2 1, u Rd, (9)

where (µΣ, τµ) and (LΣ, τg) are positive RSC/RSM parameters.

Assumption 1 is a standard requirement for ensuring linear convergence of the PGD algorithm (Agarwal et al., 2012) to statistically optimal estimates, even in centralized settings. Additionally, we impose the following local restricted smoothness condition on the agents losses, which is instrumental to ensure that the gradient tracking procedure estimates the centralized gradient accurately.

Assumption 2 (Local RSM) For each local design matrix Xi, the following holds:

n ℓΣ u 2 + τℓ u 2 1, u Rd, (10)

where (ℓΣ, τℓ) are positive RSM parameters.

Maros, Scutari, Sun, Cheng

The mesh network is modeled as an undirected graph G (V, E), with V = {1, . . . , m} being the set of m nodes and E V V being the set of edges corresponding to the communication links: (i, j) E iﬀthere exists a communication link between agent i and j. We denote by Ni = {j V : (i, j) E} {i} the set of neighbors of agent i (including the agent itself). We make the following standard Assumption 3 on the graph connectivity, which is necessary to achieve consensus over the network.

Assumption 3 (Connected network) The graph G is connected.

To achieve consensus and ensure implementation over the communication network G, standard conditions in the literature require the following constraints on the weights wij used in the DGT algorithm (3), where PK denotes the set of polynomials with degree no larger than K.

Assumption 4 [On the gossip matrices] The matrix W = (wij)m i,j=1 satisﬁes the following: (a) W = P K(W), where PK PK with PK(1) = 1, and W wij m i,j=1 has a sparsity pattern compliant with G, that is

i) wii > 0, for all i = 1, . . . , m;

ii) wij > 0, if (i, j) E; and wij = 0 otherwise. Furthermore, W is doubly stochastic, that is, 1 W = 1 and W1 = 1. (b) Let ρ W J 2; then, it holds (W J)t cmρt, for all t = 1, 2, . . . , and some cm 1.

Remark 5 Assumption 3 and Assumption 4(a) imply ρ < 1 and (W J)t 2 = ρt. Using the norm bound (W J)t m (W J)t 2, we infer that Assumption 4(b) holds with cm m.

Assumption 4(a) requires W being a consensus-forcing matrix with sparsity pattern matching the topolgy of communication network graph G. When implementing the DGT algorithm (3) with wij = wij, step (3a) can be executed via one-hop communication with each agent s immediate neighbors. Several rules of choosing W have been proposed in the literature satisfying Assumption 4, such as the Laplacian, the Metropolis-Hasting, and the maximum-degree weights rules; see, e.g., (Nedi c et al., 2018) and references therein. When the connectivity of G is relatively weak, it may be beneﬁcial to run multiple communication rounds to accelerate information propagation. This is realized by letting W = PK(W). When K > 1, K rounds of communications per iteration t are employed. For instance, by letting W = W K, the connectivity parameter can be improved from ρ = W J 2 to W K J 2 = ρK. Faster information mixing can be obtained using suitably designed polynomials PK(W), such as Chebyshev (Wien, 2011; Scaman et al., 2017) or orthogonal (a.k.a. Jacobi) (Berthier et al., 2020) polynomials. It is worth noting that the DGT algorithm contains as a special instance, the proximal gradient descent when the graph G is a star-network. In this setting, by electing the ﬁrst node as the master node (star center), the centralized PGD iterate

θt γ 1 L(θt) (11)

High-Dimensional Inference over Networks

can be naturally implemented on the star-network. Speciﬁcally, at each iteration t, the master node broadcasts θt. Then each client node j evaluates Lj(θt) and sends it to the master. The master node averages the gradients L(θt) = 1

m Pm j=1 Lj(θt) based on which the optimization step (11) is computed. The new value θt+1 is broadcast back to the clients. It is not diﬃcult to check that the above procedure can be encompassed by the DGT algorithm (3) using the mixing matrix W = W W , with W = 1/ m[1m, 0m m 1]. Note that the resulting W is the same as that of a complete graph, but implemented in a more communication eﬃcient way. We are ready to present the main convergence properties of DGT and their implications. Our statements consists of two parts. In Sec. 2.2, we establish conditions under which the network average optimization error 1 m Pm i=1 θt i bθ 2 shrinks linearly up to a tolerance o( bθ θ 2), and provide an explicit expression of the rate. In Sec. 2.3, we show that for the statistical model subsumed in (1) with random design matrix, this result holds with high probability. We also discuss the impact of the network topology on the algorithm s performance. The proofs of the theorems are deferred to Sec. 3.

2.2 Convergence under RSC/RSM

Under Assumption 1-4, deﬁne

5C2 2c2 m (1 ρ)2

τℓν2 + C1(τµ + τg)

LΣ ν2, with ν 2 bθ θ 1 +2 s bθ θ ,

λ 1 (2κ) 1 + C1s(τµ + τg)/LΣ

1 2C1sτg/LΣ , with κ LΣ

where C1 > 0, C2 > 1 are universal constants. The convergence rate of DGT is given by the following theorem.

Theorem 6 Given the linear model (1) with θ being s-sparse, consider the LASSO problem (2) over a network G satisfying Assumption 3. Suppose that the global L and local Li losses satisfy Assumption 1 and Assumption 2, respectively. Let {(θt i)m i=1} be the sequence generated by the DGT algorithm with step size γ chosen as

γ = LΣ + 4 ℓ2 Σ µΣ

c2 m ρ (1 ρ)4 , (14)

and the weight matrix W satisfying Assumption 4. Further, assume that

µΣ > 36C1s(τµ + τg) (15)

ρ 2 75c2 m C2 2 ℓ2 Σ µ2 Σ + ℓΣ

µ2 Σ 6C2 2c2 msτℓ

2 = poly m, sτℓ

Maros, Scutari, Sun, Cheng

Then, for any optimal solution bθ of the LASSO problem satisfying bθ 1 = r, we have

i=1 θt i bθ 2 B λt + stat

1 λ, t = 1, 2, . . . , (17)

for λ (0, 1), where B is a constant depending on the initialization [c.f. (69)].

The residual error stat depends on the parameters of the RSC/RSM conditions, the network connectivity ρ, and the statistical error bθ θ 2. The next corollary shows that these parameters can be chosen such that stat = o bθ θ 2 .

Corollary 7 Instate assumptions of Theorem 6; and suppose r θ 1 and

s(τµ + τg) = o(1) and s ρc2 m ℓΣ µΣ τℓ= o(1). (18)

Then, we have

i=1 θt i bθ 2 λt B + o bθ θ 2 . (19)

Iteration complexity. The contraction factor λ determining the linear decay of the optimization error depends on the restricted condition number κ and the RSC/RSM parameters. The lack of strong convexity and smoothness deteriorates the convergence rate by introducing the terms related to the tolerance parameters τµ, τg. Nevertheless, it can be veriﬁed that under condition (15), λ 1 (4κ) 1, and thus such degradation is limited. This matches the convergence rate of the centralized PGD up to a constant factor and improves on existing analyses of distributed algorithms whose convergence to a solution of (2) is certiﬁed only at sublinear rate (see Sec. 1.4). When special settings are considered, the convergence rate reduces to well-known expressions. In particular, for strongly convex and smooth losses, i.e., τµ = τg = τℓ= 0, we have stat = 0, implying that DGT converges to an ε-solution of (2) in O(κ log(1/ε)) iterations. This recovers the result in (Sun et al., 2022c). When the network is fully connected, the ﬁrst term in stat vanishes (as ρ = 0), implying that DGT converges linearly to a neighborhood of bθ of size O((τν + τg)ν2/LΣ)). This rate matches (up to constant factors) that of the centralized PGD under RSC/RSM established in (Agarwal et al., 2012).

Interplay between optimization and communication. Convergence of DGT is established, in particular, under condition (16). This reveals an interesting interplay between communication cost and hardness of the optimization problem. The quantity ℓΣ/µΣ can be viewed as the restricted condition number of the local losses Li (cf. Assumption 2); larger ℓΣ/µΣ and sτℓ/µΣ correspond to more ill conditioned functions Li. Therefore, the more ill conditioned Li, the more connected the network must be to achieve convergence the rate of centralized PGD. When the network topology is given, the associated ρ might not satisfy (16). If so, one can still enforce the desired threshold on ρ by employing multiple rounds of communications. This will be discussed in detail in the next subsection.

High-Dimensional Inference over Networks

2.3 Guarantees for sparse linear regression under random designs

We present some implications of Theorem 6 and Corollary 7 for the model (1), under the following assumption on the design matrix and noise.

Assumption 8 Consider the linear model (1). The design matrices {Xi}m i=1 are i.i.d. random matrices drawn from the Σ-Gaussian ensemble; and the elements of {ni}m i=1 are i.i.d. σ2-sub-Gaussian random variables, independent of the Xi s. Deﬁne

2σmin(Σ), LΣ 2σmax(Σ), textand ζ max j Σjj.

Theorem 9 Given the linear model (1), with θ being s-sparse, consider the LASSO problem (2) over a network G satisfying Assumption 3, and design matrices {Xi}m i=1 and the measurement noise satisfying Assumption 8. Suppose

bθ 1 = r θ 1, s log d/N < c5 µΣ

ζ , and ρ < (c6m8κ4) 1 = poly(m, κ) 1, (20)

for some constants c5, c6 > 0. Let {(θt i)m i=1} be the sequence generated by the DGT algorithm with step size γ chosen according to (14) and the weight matrix W satisfying Assumption 4. Then, the following holds:

i=1 θt i bθ 2 B 1 (2κ) 1 + C s log d/N

1 C s log d/N

bθ θ 2, (21)

with probability at least 1 c8 exp( c9 log d), for some constants C , c7, c8, c9 > 0. The expression of B > 0 is given in (109).

Theorem 9 reveals several interesting properties of DGT, as discussed next.

Scalability with respect to the problem dimension. For a ﬁxed network with connectivity ρ satisfying (20), the dependency of the convergence rate on the ambient dimension d, the total sample size N, and sparsity level s is only through the ratio s log d/N. This implies that the convergence rate (21) is invariant under the asymptotics s, N, d and s log d/N = C (C is an universal constant) resulting in the global sample scaling N = O(s log d). Near optimal sample complexity. When s log d/N = o(1), the residual error in (21) is of smaller order than the statistical precision bθ θ 2. Therefore, DGT takes O(κ log(1/ε)) iterations to reach an ε-neighborhood of a statistically optimal solution; this rate is of the same order of the centralized PGD under RSC/RSM (Agarwal et al., 2012). Note also that the condition s log d/N = o(1) (almost) matches the optimal sample complexity N = Ω(s log(d/s)) for model (1) (Wainwright, 2019), which is necessary for any centralized method to consistently estimate the d-dimensional, s-sparse θ from N samples. Since N = n m, the condition s log d/N = o(1) reveals an interesting interplay between statistical error, convergence rate, and network connectivity/communication cost, which is peculiar of the distributed setting: when agents do not have enough local samples n for statistical consistency but the overall sample size N across the network suﬃces a situation that happens, e.g., when m is large (large-scale network) DGT still achieves centralized

Maros, Scutari, Sun, Cheng

statistical accuracy at linear rate, provided that the network is suﬃciently connected, i.e., ρ poly(m, κ) 1. Therefore, the insuﬃciency of the local sample size is compensated by higher communication costs. The impact of network parameters and topology on the convergence of DGT is further elaborated next. Logarithmic scalabilty of the communication cost with the network size. When the graph G is not part of the design but given a-priori along with W (satisfying Assumption 4), the condition ρ poly(m, κ) 1 can be satisﬁed by running multiple rounds of consensus steps. More precisely, since ρ = W J < 1 (Assumption 4), one can construct W = W K such that ρ = ρK poly(m, κ) 1. This corresponds to running K consensus steps over the graph G using each time the matrix W in (3a). Note that for ﬁxed m (graph), such K is always ﬁnite. In fact, one can see that taking K = O log(m κ) (1 ρ) 1 fulﬁlls the requirement. If Chebyshev acceleration is used to employ multiple rounds of communications, coupled with a symmetric matrix W (satisfying Assumption (4)), one can show that the number of communication steps K reduces to O log(m κ) (1 ρ) 1/2 . Combining this fact with the iteration complexity (21), one can conclude that, in the setting of Theorem 9, DGT based on W = W K drives 1 m Pm i=1 θt i bθ 2 within an ε-neighborhood of a statistical optimal solution in the following number of communication steps

O κ 1 ρ log(mκ) log(1/ε) . (22)

This approach signiﬁcantly improves the communication complexity compared to DGD-like algorithms, such as those discussed in (Ji et al., 2023a,b). Unlike (7) and (8), (22) does not depend on the ambient dimension d; therefore, DGT avoids the speed-accuracy tradeoﬀinherent in DGD algorithms. This marks the ﬁrst instance of a distributed algorithm demonstrating such a desirable property in high-dimensional settings. Impact of the network topology. The network topology aﬀects the convergence rate of DGT through the quantities m and ρ. A larger m or ρ increases communication complexity, as indicated by (22). In addition, for networks of the same size m, the eﬃciency of information propagation depends on the graph s topology, which in (22) is captured by the dependency of ρ on m (Nedi c et al., 2018). Table 1 provides, for some commonly used network topologies (see Fig. 2 for some examples), the dependency of ρ on m, when the lazy Metropolis rule is adopted; for other rules, see (Charron-Bost, 2020). While (1 ρ) 1 aﬀects the total number of communications, it does not capture the overall cost of communications, which also depends on the edge-density of the graph. Hence, counting as one channel use per communication, with each edge shared between two nodes, Table 1 reports also the following quantities, capturing diﬀerent communications costs: the total channel use per communication round, the channel use of the busiest node, and the total channel uses to reach an ε neighborhood of a statistically optimal solution. From the ﬁrst row of the table and (22), we infer that DGT running over a graph with stronger connectivity (smaller ρ) requires less communication rounds to converge. On the other hand, as a trade-oﬀ, better connectivity generally results from denser graphs, and thus yields higher communications costs (heavier channel uses), as reported in the table. We deduce that DGT running on the Erd os-R enyi graph with edge connecting probability p = log m/m achieves the lowest communication cost at the busiest node, and ties with

High-Dimensional Inference over Networks

the implementation on the star ( push-pull ) in the amount of total channel use2. For the path graph and 2-d grid graph, even though the channel use per communication round is low, DGT takes more iterations to converge, and overall a higher total communication cost. Comparing the two instances of the Erd os-R enyi graphs (the last two columns in the table), one may prefer a sparser graph at the price of more communication rounds to a denser graph and less communication rounds. Finally, even though the Erd os-R enyi is relatively eﬃcient for DGT, in practice one may not be able to construct such a topology, e.g., due to geographic constraints. In such scenarios, a path/grid graph is still a valuable choice.

3. Proof of Main Results

3.1 Preliminaries

We begin rewriting the iterates (3a)-(3b) in matrix form and introducing some notation, which is convenient for our developments. Given (θt i)m i=1, (gt i)m i=1, and ( L(θt i))m i=1, we introduce the following associated matrices:

Θt [θt 1, . . . , θt m] , Θt = [ θt 1, . . . , θt m] Θt+ 1

Gt [gt 1, . . . , gt m] , and L(Θt) [ L1(θt 1), . . . , Lm(θt m)] . (23)

Furthermore, denote bΘ = [bθ, . . . , bθ | {z } m times

] . Using (23), (3a) can be rewritten in compact form:

Θt = W Θt 1 + Θt 1 , (24a)

Gt = W Gt 1 + L(Θt) L(Θt 1) . (24b)

Introducing the average quantities

i=1 θt i and gt 1

i=1 gt i, (25)

we deﬁne the consensus errors on the θ, g-vectors as

θt i, θt i θt and gt i, gt i gt, i = 1, . . . , m. (26)

Note that in our setting, θ0 i, = 0, for all i = 1, . . . , m. The matrix counterparts of (26) are

Θt [θt 1, , . . . , θt m, ] = (I J)Θt and Gt [gt 1, , . . . , gt m, ] = (I J)Gt, (27)

where J 11 /m. Using (24) and the fact that (I J)W = W J, the dynamic of (27) reads

Θt = (W J)(Θt 1 + Θt 1) (28a)

Gt = (W J)(Gt 1 + L(Θt) L(Θt 1)). (28b)

The ℓ1-norm constraint plays a key role in enforcing sparsity of feasible points. Speciﬁcally, for any θ Rd, with θ 1 r, θ bθ is approximately s-sparse, as quantiﬁed next.

2. We note that the performance on push-pull (star) are better than those on the Erd os-R enyi when the log factor in e O is explicitly accounted.

Maros, Scutari, Sun, Cheng

Lemma 10 (Agarwal et al. (2012)) Let bθ be any optimal solution of Problem (2) such that bθ 1 = r. Then for any θ 1 r, there holds

θ bθ 1 2 s θ bθ + 2 1 + 2 s | {z } ν

where bθ θ is the statistical error.

3.2 Analysis of the error dynamics

Due to the double stochasticity of W, the iterates θ t+ 1

2 i and θt i generated by DGT must be

feasible. As a result of Lemma 10, the directions θ t+ 1

2 i bθ and θt i bθ satisfy the approximate sparsity property deﬁned by (29). This adherence is a fundamental property leveraged in our proof to establish the desired convergence results, marking a notable shift from traditional proof techniques found in the literature on decentralized optimization. Based on this key property, we structure our analysis into the following steps.

(i) Inexact descent property (Sec. 3.2.1): We demonstrate that the optimization residual Θt bΘ 2 decreases linearly, up to a tracking error term arising from the discrepancy between L(θt i) and its estimator gt i, and a tolerance term related to ν due to the approximate sparsity of the direction θt i bθ.

(ii) Bounding the tracking error (Sec. 3.2.2): We establish that the tracking error can be eﬀectively bounded by an average of all historical optimization residuals, weighted by the network connectivity ρ and a ν-dependent tolerance term, akin to the one described in (i). Aiming at exploiting the local restricted smoothness of Li and the approximate sparsity of the direction θt i bθ, our approach introduces a novel line of analysis that diverges from traditional proof techniques used to study convergence of gradient-tracking algorithms.

(iii) Combining error dynamics via the z-transform (Sec. 3.2.3): We combine the two error dynamics in (i) and (ii) employing the z-transform for ﬁnite-length sequences, to eﬀectively deal with historical optimization residuals, and obtain a sufﬁcient condition for the algorithm s convergence. Our method deviates from conventional approaches commonly found in the literature on gradient-tracking methods, such as (Xu et al., 2018; Sun et al., 2022c; Nedi c et al., 2016), which often rely on the small gain theorem to analyze dynamics with only single-hop time dependencies. By employing the z-transform, we broaden the analytical framework to incorporate more complex time dependencies.

3.2.1 Inexact descent

We begin with the analysis of the optimization step (3b). Since each gi aims to estimate the local gradient L(θi), step (3b) can be regarded as an inexact version of the proximal gradient update:

2 i = argmin θ 1 r

n L(θt i) (θi θt i) + γ

2 θi θt i 2o . (30)

High-Dimensional Inference over Networks

Therefore, we base our proof on the analysis of (30) while taking into account the gradient estimation error L(θt i) gt i, due to the use in (30) of the gradient tracking vector gt i rather than L(θt i). The following proposition establishes the one-step descent of the optimization error, up to additive errors.

Proposition 11 Consider the LASSO problem (2) over the network G, under Assumptions 3 and 1; and bθ such that bθ = r. Let {θt i} be the sequence generated by Algorithm 3 under Assumption 4. Then, the following holds:

2 bΘ 2 F 1 µΣ

γ + C1s(τµ + τg)

i=1 ( L(θt i) gt i) (θ t+ 1

+C1(τµ + τg)

γ mν2, (31)

for some universal constant C1 > 0.

Proof See Appendix A.

Note that in a fully connected network, L(θt i) gt i = 0 and θt i = θt j, for all i, j = 1, . . . , m, and t = 1, . . . ,. Setting γ = LΣ, (31) recovers the convergence result of the centralized PGD algorithm, as in (Agarwal et al., 2012, Thm. 1). The subsequent analysis focuses on the dynamics of the gradient tracking error δt.

3.2.2 Bounding the tracking error in high dimension

To study the evolution of the tracking error δt, we resort to the dynamics of gt i. Recalling the deﬁnition of gt [cf. (25)] and the tracking dynamics (3a), we have

gt = gt 1 + 1

Li(θt i) Li(θt 1 i ) and hence gt = 1

i=1 Li(θt i),

where we used the initialization g0 = 1

m Pm i=1 Li(θ0 i ) . This average preserving property of the sum-gradient suggests the following decomposition of the tracking error δt:

j=1 Lj(θt j) + gt gt i

2 i bθ) (32)

j=1 Lj(θt i) 1

j=1 Lj(θt j)

gt gt i (θ t+ 1

Limitation of current proof techniques: The subsequent analysis, aiming at bounding the gradient-tracking error δt, necessities a departure from traditional techniques developed in the literature of gradient-tracking algorithms, such as (Xu et al., 2018; Sun et al., 2022c;

Maros, Scutari, Sun, Cheng

Nedi c et al., 2016). Traditional approaches often handle the error δt in (32) using the Cauchy-Schwarz inequality, which yields the following type of bound:

δt C ℓmax Θt 2 F + C Θt+ 1

2 bΘ 2 F + C Gt 2 F , (33)

where ℓmax = maxi (1/n)X i Xi 2 is the largest Lipschitz constants of the local gradients

Li. The analysis would proceed absorbing the second term Θt+ 1

2 bΘ 2 F in (33) into the LHS of (31), and using the dynamics (28a)-(28b) along with the Lipschitz continuity of Li s to obtain the following bounds on the remaining consensus and tracking errors Θt 2 F and Gt 2 F in (33) (Sun et al., 2022c):

Θt F ρ Θt 1 F + ρ Θt 1 F , (34a)

Gt F ρ Gt 1 F + ρ ℓmax Θt 1 F + ρ ℓmax Θt 1 F . (34b)

The rest of the proof in (Sun et al., 2022c) would chain the inequalities (31), (33), and (34) using small-gain arguments, yielding linear convergence rates, under a suitable choice of the stepsize γ (and the network connectivity ρ), provided that strong convexity of L is assumed see (Sun et al., 2022c, Propositions 3.6 & 3.8). This line of analysis fails in high dimensional scenarios, for the following reasons. (i) The centralized objective functions L is RSC and not strongly convex globally, and the local objective functions Li are only convex. (ii) In the typical asymptotics of high-dimensional problems d/N and s log d/N = O(1) the Lipschitz constants ℓmax of L is scale as O(d), for various predictor x i models (Wainwright, 2019), thereby diverging as d . Even postulating to be able to relax the strong convexity requirement of L to the RSC, establishing linear convergence along the path of the classical analysis will require the network connectivity ρ (or the step-size γ) to counteract the unfavorable 1/d scaling of ℓmax, ensuring ρ ℓmax = O(1), if optimal centralized statistical precision is to be achieved. This would translate in the unsatisfactory scaling of the total number of communications as log d (or d). Notice that, this issue remains even under more restrictive assumptions on the sample size, such as s log d/n = O(1). At the high level, this arises because the gradients Li(θi) s remain non-sparse vectors even when θi s are all sparse.

A new line of analysis: The challenges outlined above underscore the need of a new mechanism to control the gradient error δt in high-dimensional decentralized settings. The key observation is that the undesirable scaling of the tracking error δt with the ambient dimension d, steaming from ℓmax, is unavoidable if one pursues the statistically agnostic bound of Gt F as given in (34b), which globally relates the gradient tracking error Gt F to the consensus error. Our proposed approach capitalizes on the observation that δt primarily

depends on the component of the gradient error projected along the directions θ t+ 1

2 i bθ, which are approximately sparse vectors. This contrasts with (34b), which considers the entirety of the gradient tracking error. More precisely, our proof leverages the fact that, for any set of row-wise feasible points Θ1, Θ2, Θ3, and Θ4, the inner products

(Θ1 Θ2) ( Lj(Θ3) Lj(Θ4)), j [m] (35a)

(Θ1 Θ2) ( L(Θ3) L(Θ4)) (35b)

High-Dimensional Inference over Networks

can be controlled by quantities that do scale logarithmically with the ambient dimension at most. Therefore, in place of breaking the inner products using the Cauchy-Schwartz inequality, yielding to one-hop bounds as (34b), our approach necessitates working directly with inner product relationships. The ﬁnal result is a bound of δt that depends on the entire algorithm trajectory from the initialization. This is formalized in the next proposition.

Proposition 12 Under the setting of Proposition 11 and the local restricted smoothness Assumption 2, the gradient tracking error δt (t = 1, . . . ,) can be bounded as

1 ρ (ℓΣ + sτℓ) Θt+ 1

2 bΘ 2 F + cmℓΣ

s=0 ρt s Θs 2 F + cmℓΣ

s=0 ρt s Θs 2 F

+ cmsτℓϵ 1 t 1 X

s=0 ρt s Θs+ 1

2 bΘ 2 F + cmsτℓϵ 1 t 1 X

s=0 ρt s Θs 1

+mcmρt A3 ϵ 1 + sc2 gℓ 1 Σ ϵ 1 + cgν + ρ mcm

1 ρ τℓν2(ϵ + ϵ 1) (36)

for some C2 > 0; where ϵ > 0, Θ 1

A3 ℓΣ θ 2 + τℓ θ 2 1, cg Lj(θ ) + L(θ ) . (37)

Proof See Appendix B.

Proposition 12 reveals that the gradient tracking error δt can be bounded by the discounted cumulative sum of historical optimization errors { Θs 1/2 bΘ }t+1 s=0, historical step-length { Θs }t 1 s=0, and consensus errors { Θs }t 1 s=0. Furthermore, one can check that δt decreases if ρ, m, or the right-hand-side tolerance parameter τℓdecrease, which is a consequence of a more connected network, and larger local sample size n, respectively. We can now combine the statements obtained in Propositions 11 and 12, and write 1 1

C1sτg + C2ρ cmϵ

1 ρ (ℓΣ + sτℓ) Θt+ 1

γ + C1s(τµ + τg)

Θt bΘ 2 F 1 LΣ

Θt 2 F + C1(τµ + τg)

s=0 ρt s Θs 2 F + C2

s=0 ρt s Θs 2 F

γ cmsτℓϵ 1 t 1 X

s=0 ρt s Θs+ 1

2 bΘ 2 F + C2

γ cmsτℓϵ 1 t 1 X

s=0 ρt s Θs 1

γ mcmρt A3 ϵ 1 + s c2 gℓ 1 Σ ϵ 1 + cgν + C2

1 ρ τℓν2(ϵ + ϵ 1). (38)

Our last step bounds the consensus error Θt using { Θs }t 1 s=0, as given next. The bound deviates from (34a), as it is a function of the entirety sequence { Θs }s 0. This will allow us to exploit the term Θt 2 in (38) to control consensus errors.

Maros, Scutari, Sun, Cheng

Proposition 13 The consensus error Θt can be bounded as

Θt 2 F 2ρ2t Θ0 2 F | {z } =0

s=0 ρt s Θs 2 F . (39)

Proof The proof is standard and is provided in Appendix D for completeness.

3.2.3 Combining the error dynamics

To establish linear convergence of DGT from (38)-(39), we will employ the z-transform for ﬁnite length sequences to deal with the convolution-sum terms. Notice that small gain arguments typically used in (Nedi c et al., 2016; Sun et al., 2022c) are not directly applicable to chain the inequalities (38)-(39) in our scenario. This is because, unlike (Nedi c et al., 2016; Sun et al., 2022c), each sequence in the above inequalities is inﬂuenced by the complete historical dependencies of the others, rather than a simpler single-hop dependence. This motivates the introduction of the z transform to handle an arbitrarily long memory. We begin brieﬂy reviewing the relevant properties of the z-transform and then apply it to the system (38)-(39).

The z-transform and properties

Let {a(t)} be a sequence of nonnegative numbers. Deﬁne the length K z-transform of {a(t)} as

t=1 a(t)z t, z R++. (40)

Some properties of the z-transform instrumental to our developments are summarized next, whose proofs are provided in Appendix M.

Lemma 14 Let AK(z) be the length K z-transform of a nonnegative sequence {a(t)}, and ρ (0, 1). Then, the following hold:

(i) PK t=1 a(t + 1)z t z AK(z) a(1), z R++;

(ii) PK t=1 Pt 1 s=0 ρt sa(s) z t ρ z ρ AK(z) + a(0) , z (ρ, 1);

(iii) PK t=1 Pt 1 s=0 ρt sa(s + 1) z t z ρ z ρAK(z), z (ρ, 1).

Lemma 15 Suppose a nonnegative sequence {a(t)} satisﬁes

t=1 a(t)z t B + c

t=1 z t, for all K 1 and z ( z, 1), (41)

and some B, c > 0. Then a(t) B zt + c 1 z . (42)

When B < + , {a(t)} converges linearly at rate z up to a constant.

High-Dimensional Inference over Networks

The z-transform system

Introduce the following quantities:

t=1 z t Θt 1

2 bΘ 2, K Θ(z) =

t=1 z t Θt 2, ΘK (z) =

t=1 z t Θt 2.

(43) To apply (43) to (38) we ﬁrst bound the term Θt bΘ 2 on the RHS there as Θt bΘ 2 Θt 1

2 bΘ 2 under condition

γ µΣ C1s(τµ + τg). (44)

Further assuming

γ C1sτg + C2ρ cmϵ

1 ρ (ℓΣ + sτℓ) , (45)

we can multiply z t (z > ρ) to both sides of (38) and (39) while summing up from t = 1 to t = K and applying Lemma 14 (recall that Θ0 = 0), leading to

C1sτg + C2ρ cmϵ

1 ρ (ℓΣ + sτℓ) + C2ρ cmϵ 1sτℓ

γ + C1s(τµ + τg)

γ cmsτℓϵ 1 ρ z ρ

ϵ 2ρ 1 ρ ρ z ρ

t=1 z t + B(z),

γ cmsτℓϵ 1 ρ z ρ Θ0 bΘ 2 + C2

γ mcm A3 ϵ 1 + sc2 gℓ 1 Σ ϵ 1 + cgν ρ z ρ

C1sτg + C2ρ cmϵ

1 ρ (ℓΣ + sτℓ) Θ1/2 bΘ 2 (47)

1 ρ τℓν2(ϵ + ϵ 1) + C1(τµ + τg)

The following conditions are suﬃcient to invoke Lemma 15 and prove the convergence of sequence {(1/m) Θt+ 1

2 bΘ 2 F }:

i) The step size condition (44) and (45) hold.

Maros, Scutari, Sun, Cheng

ii) There exists some z (ρ, 1) such that 1 1

C1sτg + C2ρ cmϵ

1 ρ (ℓΣ + sτℓ) + C2ρ cmϵ 1sτℓ

γ + C1s(τµ + τg)

γ cmsτℓϵ 1 ρ z ρ; (49a)

ϵ 2ρ 1 ρ ρ z ρ

2 > 0. (49b)

3.3 Linear convergence under RSC/RSM (Proof of Theorem 6)

The proof is organized into the following steps:

In Sec. 3.3.1 we provide choices of ϵ and γ for the existence of z > ρ satisfying (49).

Under such choices, we bound B(z) [cf. (47)] and stat [cf. (48)] in Sec. 3.3.2.

In Sec. 3.3.3 we establish conditions for the linear convergence of n 1 m Θt+ 1

t up to a residual error depending on the statistical precision bθ θ 2.

3.3.1 Choice of ϵ and γ

Lemma 16 and 17 below provide conditions and the range of z > ρ for (49b) and (49a) to hold, respectively. For convenience we set ϵ as

1 ρ 2C2cm . (50)

Lemma 16 Given γ > LΣ, condition (49b) is satisﬁed for all z > zρ, with

1 + (γ LΣ) µΣ

Proof See Appendix E.

We focus now on condition (49a). Denote for simplicity

γ + C1s(τµ+τg)

γ cmsτℓϵ 1 = γ µΣ + C1s(τµ + τg)

ℓΣ µΣ 2C2 2c2m 1 ρ sτℓ ,

γ n C1sτg + C2ρ cmϵ

1 ρ (ℓΣ + sτℓ) o

γ cmsτℓϵ 1 = γ n C1sτg + ρ

ℓΣ (ℓΣ + sτℓ) o

ℓΣ µΣ 2C2 2c2m 1 ρ sτℓ ,

D = γ µΣ + C1s(τµ + τg)

γ n C1sτg + ρ

ℓΣ (ℓΣ + sτℓ) o. (53)

High-Dimensional Inference over Networks

Note that, under (44) and (45) strictly satisﬁed, A, D > 0. Then (49a) can be rewritten succinctly as

z A + Gnet(z)

D Gnet(z), Gnet(z) ρ z ρ. (54)

Lemma 17 Given γ satisfying (44) and (45), condition (49a) is satisﬁed for all z > zσ, with

zσ σ + ρ 1 + 1

and A and D deﬁned in (52).

Proof See Appendix F.

Combining Lemma 16 and 17 we conclude that, for any given γ > LΣ, under conditions (44) and (45), (49a)-(49b) are satisﬁed by the choices of z such that

z > z, with z max{zρ, zσ}. (56)

In the remaining of this section, we will simplify the expression of z by making a speciﬁc choice of γ. We do not aim to ﬁnd the optimal value of γ but a convenient feasible choice yielding an insightful expression of the rate. We set

γ = LΣ + C3 ℓ2 Σ µΣ

c2 m ρ (1 ρ)4 , (57)

where C3 > 0 is some absolute constant (to be determined). Deﬁne

σ0 1 κ 1 + C1s(τµ + τg)/LΣ

1 C1sτg/LΣ , κ LΣ

Proposition 18 Let γ be chosen according to (57). Under the following conditions:

1 2C1sτg/LΣ > 0, (59)

µΣ > 4C1s(τµ + τg) + ℓΣ

8C2 2c2 mρ (1 ρ)2 sτℓ+ ρ µΣ

ℓΣ sτℓ, (60)

and ρ 1/2, (61)

z can be upperbounded as

2C2 C3 ρ(1 ρ) + 2C2 2 C3

σ0 + ρ 18C3c2 m ℓ2 Σ µΣLΣ + ℓΣ µΣLΣ 4C2 2c2 msτℓ+ 2µΣ ℓΣLΣ sτℓ 1 2C1sτg/LΣ

Maros, Scutari, Sun, Cheng

Proof See Appendix G.

To further simplify (62), we require the second argument in the max-expression in (62) to be no larger than 1 (2κ) 1 + C1s(τµ + τg)/LΣ

1 2C1sτg/LΣ . (63)

Corollary 19 Let γ be chosen as

γ = LΣ + 4 ℓ2 Σ µΣ

c2 m ρ (1 ρ)4 . (64)

Under the following conditions:

µΣ > 36C1s(τµ + τg), (65)

ρ 2 72c2 m ℓ2 Σ µ2 Σ + ℓΣ

µ2 Σ 4C2 2c2 msτℓ+ 2

z zup max ρ + C2ρ + C2 2 2 ρ, 1 (2κ) 1 + C1s(τµ + τg)/LΣ

1 2C1sτg/LΣ

Proof See Appendix H.

3.3.2 Bounding B(z) [cf. (47)] and stat [cf. (48)]

According to Eq. (94), under the setting of Proposition 18, any z satisfying z zup is lower bounded by

z ρ + ρ(1 ρ)3 C2 2 C3 . (68)

Substituting the expression of γ [cf. (64)] and ϵ [cf. (50)] into B(z) [cf. (47)] and stat [cf. (48)], together with (68) we obtain the following bound.

Proposition 20 Under the setting of Corollary 19 and assume C2 1, then

Θ0 bΘ 2 F + Θ 2 F + Θ1/2 bΘ 2 F + mcm ρ

cmsc2 g µΣ + cgν

5C2 2c2 m (1 ρ)2

τℓν2 + C1(τµ + τg)

LΣ ν2 stat (70)

for some C4 > 0, where cg and ν are deﬁned as

cg = max j [m] Lj(θ ) + L(θ ) and ν = 2 bθ θ 1 + 2 s bθ θ . (71)

Proof See Appendix I.

High-Dimensional Inference over Networks

3.3.3 Linear convergence up to o( bθ θ 2)

We prove Theorem 6 and Corollary 7. Recall below the network connectivity condition (16) and rate expression λ (13) given in Theorem 6 for convenience:

ρ 2 75c2 m C2 2 ℓ2 Σ µ2 Σ + ℓΣ

µ2 Σ 6C2 2c2 msτℓ

2 λ = 1 (2κ) 1 + C1s(τµ + τg)/LΣ

1 2C1sτg/LΣ .

Based on Lemma 15 and the results in Sec. 3.3.1 & 3.3.2, it suﬃces to show (16) implies (66) and zup λ < 1. (16) = (66): the claim holds since C2 > 1, ℓΣ µΣ, and cm 1. zup λ < 1: it is not diﬃcult to check that µΣ > 36C1s(τµ + τg) implies λ < 1 (4κ) 1. Next we prove

ρ + C2ρ + C2 2 2 ρ λ,

and thus zup = λ. It is suﬃcient to check that the following chain of inequalities hold:

ρ + C2ρ + C2 2 2 ρ (1 2C1sτg/LΣ)

3C2 2 ρ (16) 1 50 µ2 Σ c2mℓ2 Σ 1

2κ + C1s(τµ + τg)/LΣ, (72)

where we used ρ < 1, cm, C2 1, ℓΣ LΣ and κ 1. This completes the proof of Theorem 6. Lastly, we prove Corollary 7. Using (Agarwal et al., 2012, Lemma 5) yields bθ θ 1 2 s bθ θ . Using ρ 1/4 and conditions in (18) it is not diﬃcult to verify that stat in (12) is o bθ θ 2 .

3.4 Linear convergence for sparse vector regression (Proof of Theorem 9)

We now customize Theorem 6 for the random design setting satisfying Assumption 8. We ﬁrst state the following high probability bounds for the RSC/RSM properties.

Proposition 21 (Raskutti et al. (2010)) Given m i.i.d. distributed random matrices Xi Rn d drawn from the Σ-Gaussian ensemble. Let X = [X 1 , . . . , X m] RN d, with N = n m. Then, there exist universal constants c0, c1 > 0 such that

(Global RSC/RSM)

2 Σ1/2u 2 c1ζ log d

N 2 Σ1/2u 2 + c1ζ log d

N u 2 1, u Rd; (73)

with probability greater than 1 exp( c0N), where ζ = maxj Σjj.

By slightly modifying the proof of (Raskutti et al., 2010, Theorem 1), provided in Appendix K for completeness, we can obtain the following bound for the local RSM condition.

Maros, Scutari, Sun, Cheng

Proposition 22 Given a random matrix Xi Rn d drawn from the Σ-Gaussian ensemble, there exists universal constants c0, c1 > 0 such that

n 16m Σ1/2u 2 + c1ζ m log d

n u 2 1, u Rd, i [m] (74)

with probability greater than 1 exp( c0N), where ζ = maxj Σjj.

Next we derive a high probability bound for cg = maxj [m] Lj(θ ) + L(θ ) .

Lemma 23 Consider the linear model (1) under Assumption 8. Then

2 exp ( log d) (75)

2 exp ( log d) (76)

for some c3 > 0.

Proof See Appendix J.

According to (1) we have L(θ ) = X n/N and Lj(θ ) = X j nj/n . Using Lemma 23 we can readily obtain the following bound for c2 g.

Corollary 24 Reintate the conditions of Lemma 23; there holds:

c2 g 4ζσ2 2 c3

with probability greater than 1 4 exp ( c4 log d), for some c4 > 0.

Leveraging Theorem 6 with the parameters (µΣ, LΣ, ℓΣ), (τµ, τg, τℓ), and cg guaranteed by Proposition 21, Proposition 22, and Corollary 24 we are ready to prove Theorem 9. Speciﬁcally, let

2σmin(Σ) and τµ = c1ζ log d

N (for the RSC),

LΣ = 2σmax(Σ) and τg = c1ζ log d

N (for the RSM), (78)

ℓΣ = 16mσmax(Σ) and τℓ= c1ζm2 log d

N (for the local RSM); (79)

High-Dimensional Inference over Networks

Propositions 21 & 22 and Corollary 24 guarantee that, with the above choices for (µΣ, LΣ, ℓΣ) and (τµ, τg, τℓ), the global RSC and RSM conditions (cf. Assumption 1), the local RSM property (cf. Assumption 2), and the bound (77) on cg all hold with probability at least 1 c8 exp( c9 log d), where we invoked the union bound and used log d s log d c5 µΣ

c5N, due to ζ µΣ s log d

N < c5. The rest of the proof is to (i) show (20) implies (15)-(16) in Theorem 6 and (ii) simplify the expressions of the rate expression (13), the statistical error (12), and B in (69) using (78)-(79). The calculation is left to Appendix L.

4. Numerical Results

This section provides some numerical results that validate our theoretical ﬁndings. We consider the following problem setup. Given the distributed linear regression model (1), the unknown s-sparse vector θ is generated with the ﬁrst s elements being i.i.d. N(0, 1); the remaining coordinates are set to zero. Each row of Xi Rn d is generated according to N(0, I); each element of ni follows N(0, 0.25); and r is set r = θ 1. Unless otherwise speciﬁed, we the communication network is modeled as a Erd os-R enyi graph G(m, p), where m is the number of nodes and each edge is included in the graph with probability p, independently from the others. The speciﬁc values of problem parameters (N, d, s) and network parameters (m, p) are speciﬁed in each simulation. Aiming at validating Theorem 9, the reported experiments show the following: (i) the scalability of the algorithm with respect to the total sample size N and problem dimension d, for a ﬁxed network; and (ii) the impact of network topology on the algorithm. Finally, we validate our ﬁndings on real data sets.

4.1 Scalability with respect to problem parameters for a ﬁxed network

Theorem 9 states that, for a ﬁxed, suﬃciently connected network, the average optimization error of DGT among all agents decreases at a linear, up to a tolerance depending on the statistical precision bθ θ . Moreover, the expression of the convergence rate and the tolerance term depend exclusively on the ratio α = s log d

N . To validate numerically these theoretical ﬁndings, we conduct the following two simulations over an Erd os-R enyi graph with m = 50 nodes and edge activation probability p = 0.5. The reported results are the average of 100 trials with the same graph and independently randomly generated data sets.

(i) Fixed α and increasing dimension d: We set α 0.2 and consider three sets of problem parameters, namely: d = 5000, 10000, 20000 and s =

d . Consequently, the number of total samples N and local sample size n is set to N = 3050, 4700, 7100 and n = 61, 94, 142, respectively. Observe that N = m n with m = 50. Since the expression of the proximal parameter γ may be conservative in practice, we tune it manually as described next. We ﬁrst generate one instance of the LASSO problem with d = 5000 and run DGT using the γ that yields the fastest convergence rate. This value is selected via grid search. The same γ is used for all problem instances. Fig. 3a plots the average normalized estimation error (1/m) Pm i=1 θt i θ 2 θ 2 versus the number of iterations, for d = 5000, 10000, 20000. The dashed-line curves represent the normalized statistical precision ( bθ θ 2 θ 2 for the three sets of problems. The ﬁgure shows that, in all cases, the estimation error decays

Maros, Scutari, Sun, Cheng

0 10 20 30 40 50 60 70 80 9

m Pm i=1 θt i θ 2

d = 5000 d = 5000 d = 20000

(a) Log-normalized error for α 0.2. In these problems d {5000, 10000, 20000}, s =

d , n {61, 94, 142}.

0 2 4 6 8 10 12 14 16 18 20 22 24 26 28 30 6

m Pm i=1 θt i θ 2

α 0.04 α 0.2 α 1

(b) Log-normalized error for diﬀerent values of α. In these problems d = 20000, s = log(d)/2 , m = 50 and α {0.04, 0.2, 1}.

Figure 3: Scalability of DGT for ﬁxed networks. The network is composed of m = 50 agents with connectivity ρ = 0.0638. Dashed lines indicate estimation error if using bθ.

at a linear rate before reaching the statistical precision. Further, the rate is invariant to the change of d as long as the ratio α remains ﬁxed. This is exactly what Theorem 9 predicts.

(ii) Increasing α: This experiment explores the relation between the convergence rate of DGT and α. We set d = 20000, s = (log d)/2 and consider three local sample sizes, n = 1, 5, and 25. Correspondingly, we have α {1, 0.2, 0.04}. For each value of n, we choose the proximal parameter γ via grid search. The one that yields fastest convergence rate is kept ﬁxed for all independent trials. Fig. 3b shows that the algorithm s performance degrades as α increases both the convergence rate of DGT and the statistical precision get worse. Note that this behavior is in agreement with Theorem 9.

4.2 Impact of network scale and connectivity

This set of simulations investigates the scalability of DGT with respect to network parameters (m, ρ), for ﬁxed problem parameters (N, d, s). In particular, Theorem 9 established the convergence of DGT under condition ρ m 8κ 4, which requires better connectivity for a network involving more agents. To test the necessity of this condition, we run DGT on networks with increasing size m = 50, 625, 1250, 2500. Correspondingly we varied the edge connectivity probability p = 0.87, 0.4, 0.23, 0.15 in order to keep ρ roughly constant, equal to 0.18. The problem parameters are set to be N = 2500, d = 5000, s =

d . The step size is chosen via grid search and set to be the one yielding the fastest convergence. All curves reported for this simulation in one trial are conducted on the same problem instance for diﬀerent network conﬁgurations. The result is averaged over 100 trials and is reported in Fig 4a. Observe that keeping the graph s connectivity constant while increasing the number of agents leads to a degradation of the achievable rate and estimation error. Additionally, we empirically demonstrate that for Erd os-R enyi graphs with connectivity ρ = O(1/ m)) achieved by ﬁxing the network connectivity probability p DGT converges under a growing network size. The result is reported in Fig. 4b. The simulation is conducted

High-Dimensional Inference over Networks

0 10 20 30 40 50 60 70 80 90 100 8

m Pm i=1 θt i θ 2

m = 50 m = 625 m = 1250 m = 2500 Centralized solution

(a) Erd os-R enyi graphs with ρ 0.18.

0 5 10 15 20 25 30 35 40

m Pm i=1 θt i θ 2

m = 50, ρ = 0.167 m = 625, ρ = 0.052 m = 1250, ρ = 0.041 m = 2500, ρ = 0.031 Centralized solution

(b) Erd os-R enyi graphs with link probability 0.87.

Figure 4: Plot of the log-normalized estimation error for m {50, 625, 1250, 2500}. In this problem d = 5000, s =

d , n {50, 4, 2, 1} and N = 2500. All curves in each panel are obtained using the same data points. Black horizontal lines indicate the normalized estimation error if using bθ.

Erd os-R enyi (p = 0.87) Line graph m = 50 18 23775 m = 625 18 6115118 m = 1250 18 27094153 m = 2500 19 118911225

Figure 5: Number of communication rounds required for each network to satisfy ρk m 8.

under the same setting as Fig. 4a except for p set to be 0.87. The simulation suggests that the convergence condition ρ m 8κ 4 might be more restrictive than what is needed in practice, which could be an artifact of our analysis. To conclude the assessment of the algorithm performance as function of the network connectivity parameter ρ, in Table 5, we report the number of communication rounds required per iteration to satisfy the condition ρK m 8. Observe that the number remains approximately constant for Erd os-R enyi graphs while increasing dramatically for line-graphs. Similarly, in Fig. 6, we display the number of total communication rounds required to solve a problem instance with d = 5000, s =

d , m {50, 625, 1250, 2500}, and n {50, 4, 2, 1}. The results are consistent with those displayed in Table 5 and demonstrate that the quantity ρ is predictive of the number of communication rounds required for convergence.

4.3 Comparison with other distributed algorithms

We test the performance of DGT when compared to other decentralized algorithms with known guarantees in the high-dimensional regime. Speciﬁcally, we consider DGD-CTA (Ji et al., 2023a), DGD-ATC (Ji et al., 2023b), and DGD2 proposed in our companion work (Maros and Scutari, 2022). DGD-ATC, and DGD2 are known to achieve centralized statistical precision at a rate comparable that of the centralized PGD, provided the network is suﬃciently connected. Conversely, DGD-CTA tends to slow down when aiming for central-

Maros, Scutari, Sun, Cheng

0 1 2 3 4 5 6 7 8 9 8

log(communication rounds)

m Pm i=1 θt i θ 2

Erd os R enyi Graph

m=50 m = 625 m = 1250 m = 2500

(a) Erd os-R enyi graphs.

0 2 4 6 8 10 12 14 16 18 20 22 24 8

log(communication rounds)

m Pm i=1 θt i θ 2

m = 50 m = 625 m = 1250 m = 2500

(b) Line graphs.

Figure 6: Plots of the log-normalized estimation error for Erd os-R enyi and line graphs, with m {50, 625, 1250, 2500}. In this problem, d = 5000, s =

d , n {50, 4, 2, 1} and N = 2500.

ized statistical precision. Given that, except for DGD-CTA, all algorithms can potentially perform equivalently in terms of iterations contingent on meeting speciﬁc network connectivity requirements we focus on the number of communication rounds required to reach centralized statistical precision rather than just the total iteration count. For this analysis, we create problem instances with d {400, 2000, 4000, 20000}, s {5, 4, 7, 4}, and N {240, 240, 480, 360}. This yields α 0.11 for all instances. We solve each problem instance up to statistical precision using the centralized PGD and record the total number of performed iterations Tcent for each problem. In the decentralized setting, we generate a base mixing matrix W with m = 120 and p = 0.02, which results in a poorly connected base mixing matrix. For DGD-ATC, DGD2 and DGT we select a mixing matrix W = W K

with the smallest possible K that allows the algorithms to achieve centralized statistical precision in at most 2Tcent iterations. For DGD-ATC, DGD2 and DGT we measure the total communication rounds as iterations K. For DGD-CTA we set the network to be fully connected with the largest possible step-size that yields convergence to the order of centralized statistical precision and record the total number of network uses, which in this case also corresponds to the number of iterations. The results of this experiment are reported in Fig. 7. We observe that despite employing a fully connected network, DGD-CTA requires the most communication rounds of all schemes. This is because the network connectivity does not signiﬁcantly aid the performance of DGD-CTA. Meanwhile, we observe that for DGD-ATC the number of communication rounds increases moderately with the dimension d while for DGD2 and DGT the number of communication rounds remains invariant as the problem dimension increases. The statistical error achieved by DGD2 and DGT while of similar order is beneﬁcial towards DGT.

High-Dimensional Inference over Networks

Figure 7: Number of communication rounds for m = 120, d {400, 2000, 4000, 20000}, s {5, 4, 7, 4}, and N {240, 240, 480, 360}. All curves are obtained using the same data points and consist of 20 averages each.

4.4 Simulations on real data

We test the DGT on the Communities and Crime data set (UC Irvine Machine Learning Repository) where we have removed data points with missing attributes (covariates), and removed attributes corresponding to community number/name or zip-code (attributes 1 to 4 in the data set) yielding a regression problem with d = 123 and total sample size of 123 which we have split into Ntrain = 82 and Ntest=41. For the decentralized implementation, we have a network of m = 41 agents corresponding to n = 2 samples per agent. The agents communicate via a network built upon an Erd os-R enyi where the probability of an edge being present is set to p = 0.5. Because the resulting graph has connectivity ρ = 0.5236 we have agents perform three communication rounds per gradient computation yielding an eﬀective connectivity ρ = 0.1435 1 m which corresponds to the connectivity that is suﬃcient to yield centralized performance according to numerical simulations (c.f. Figure 4b). Observe that our theoretical results yield more conservative values for ρ; consequently, employing the connectivity predicted by the theory can only yield improved results. Our results are reported in Figure 8 where the performance of the centralized projected gradient descent (PGD) is reported to the left, and the performance of the DGT is reported on the right panel. Denote by (Xtrain, ytrain) the data pairs included in the training data set, and analogously by (Xtest, ytest) the data pairs included in the testing data set. The training error is measured as 1 2Ntrain Xtrainθt ytrain 2 2 where θt are iterates of PGD generated by minimizing the function 1 2Ntrain Xtrainθ ytrain 2 over the set θ 1 0.85 using step-size 0.09. The testing error is measured by 1 2Ntest Xtestθt ytest 2. The right panel reports the performance of DGT with step-size set to be 0.05. The training and testing errors are measured by 1

m Pm i=1 1 2Ntrain Xtrainθt i ytrain 2 and 1

m Pm i=1 Xtestθt i ytest 2, respectively. We report in the right panel the diﬀerence between the centralized and decentralized train and

Maros, Scutari, Sun, Cheng

Figure 8: Plots of train and test error on the Communities and Crime data set for centralized projected gradient descent (PGD) (left) and DGT (right).

test errors respectively. We conclude that the performance of centralized and decentralized schemes is practically identical (thanks to multiple but ﬁnite consensus rounds), both in training and testing errors.

5. Conclusion

We studied sparse linear regression over mesh networks. We established statistical and computational guarantees in high-dimensions for DGT, a decentralization of the PGD based on gradient tracking. We proved that (near) optimal sample complexity O(s log d/N) for the distributed estimator is achievable over networks, even when the local sample size is not suﬃcient for statistical consistency. Convergence at a linear rate was proved, which is of the same order as that of PGD. This improves on DGD-like methods (Ji et al., 2023a,b), which instead suﬀer from the speed-accuracy dilemma. Our work, together with the study of DGD (Ji et al., 2023a,b) and DGD2 (Maros and Scutari, 2022), reveal the (non)-convergence of decentralized algorithms is signiﬁcantly affected by the gradient estimator used in the local optimization step. A future investigation will be the study of other distributed methods; of particular interest is understanding the role of other forms of gradient corrections towards statistical, computation, and communication tradeoﬀs.

Acknowledgments

Maros and Scutari would like to acknowledge support for this project from the Oﬃce of Naval Research (ONR Grant N. N000142112673 and ONR Grant N. N000142412751). Cheng acknowledges support form National Science Foundation (NSF CNS 2247795), and the Oﬃce of Naval Research (ONR Grant N. N00014-22-1-2680).

High-Dimensional Inference over Networks

Appendix A. Proof of Proposition 11

First note that the optimality of θ t+ 1

2 i implies:

2 i gt i + γ θt i 0, θ 1 r, (80)

where we recall θt i θ t+ 1

2 i θt i. We use now the global RSC/RSM property of L (cf. Assumption 1) to compute the reduction of optimality gap L(θ) L(bθ) by executing step (3b).

2 i ) = L(θt i) + L(θt i) (θ t+ 1

2 i θt i) + 1

2 i θt i) X X

(RSM) L(θt i) + L(θt i) (θ t+ 1

2 i θt i) + LΣ

2 i θt i 2 + τg

2 i θt i 2 1

=L(bθ) + L(θt i) (θ t+ 1

2(θt i bθ) X X

2 i θt i 2 + τg

2 i θt i 2 1

(RSC) L(bθ) + L(θt i) (θ t+ 1

2 θt i bθ 2 + LΣ

2 θt i bθ 2 1 + τg

2 i θt i 2 1.

Letting θ = bθ in (80) and adding to the above inequality we have

2 i ) L(bθ) γ

2 i bθ 2 θt i bθ 2 + θ t+ 1

2 i θt i 2 + ( L(θt i) gt i) (θ t+ 1

2 θt i bθ 2 + LΣ

2 i θt i 2 + τµ

2 θt i bθ 2 1 + τg

2 i θt i 2 1.

Now applying the triangle inequality to the last term and Lemma 10 yields:

2 i bθ 2 θt i bθ 2 + θ t+ 1

2 i θt i 2 + ( L(θt i) gt i) (θ t+ 1

2 θt i bθ 2 + LΣ

2 i θt i 2 + τµ

2 (8s θt i bθ 2 + 2ν2)

2 (16s θ t+ 1

2 i bθ 2 + 16s θt i bθ 2 + 8ν2)

2 i bθ 2 θt i bθ 2 + θ t+ 1

2 i θt i 2 + ( L(θt i) gt i) (θ t+ 1

2 θt i bθ 2 + LΣ

2 i θt i 2 + 8sτg θ t+ 1

2 i bθ 2 + (4sτµ + 8sτg) θt i bθ 2

+ (τµ + 4τg)ν2.

Maros, Scutari, Sun, Cheng

Rearranging terms and using the fact that L(θ t+ 1

2 i ) L(bθ), yields

2 i bθ 2 1 µΣ

γ + C1s(τµ + τg)

2 i θt i 2 + 2

γ ( L(θt i) gt i) (θ t+ 1

2 i bθ) + C1(τµ + τg)

Summing over i = 1, . . . , m, completes the proof.

Appendix B. Proof of Proposition 12

The diﬀerence 1 m Pm j=1 Lj(θt i) 1

m Pm j=1 Lj(θt j) can be expressed as

j=1 Lj(θt i) 1

j=1 Lj(θt j) = 1

N X X(θt i θt) + 1

1 n X j Xj( θt θt j).

Substituting into the expression of δt we can split it into three error terms:

i=1 (θ t+ 1

N X X (θt i θt)

j=1 (θ t+ 1

n X j Xj ( θt θt j)

gt gt i (θ t+ 1

We are now ready to bound δt. For this, we substitute the expressions of Θt and Gt [cf. (28a)-(28b)] into (82). We then use the RSM (cf. Assumption 1) to upper bound δt by quantities that depend exclusively on the sequences Θt+ 1

2 bΘ and Θt . 1) Bounding T1: Iterating (28a) telescopically to t = 0, yields

Θt = (W J)Θ0 +

s=0 (W J)t s Θs, t = 1, 2, . . . , . (83)

This shows that Θt lies in the span of {Θ0 , Θ0, . . . , Θt 1}. Therefore,

j=1 ℓ(t) ij (θ0 j, ) +

s=0 ℓ(t s) ij θs j, , t = 1, 2, . . . , (84)

where ℓ(t) ij denotes the ij th element of the matrix (W J)t.

High-Dimensional Inference over Networks

Substituting (84) in the expression of T1, we can bound it as

i=1 (θ t+ 1

j=1 ℓ(t) ij θ0 j, +

s=0 ℓ(t s) ij θs j

j=1 |ℓ(t) ij | 1

s=0 |ℓ(t s) ij | 1

j=1 |ℓ(t) ij |

s=0 |ℓ(t s) ij |

where we have applied Cauchy-Schwartz and Young s inequality with ϵ > 0. Invoking the RSM (Assumption 1) along with Assumption 4 and Lemma 10, we can bound the RHS of (85) as (cf. Appendix C.1): for all t 1,

1 ρ (LΣ + 8sτg) Θt+ 1

s=0 ρt s Θs 2 F

+ cm 8sτgϵ 1 t 1 X

s=0 ρt s Θs+ 1

2 bΘ 2 F + cm 8sτgϵ 1 t 1 X

s=0 ρt s Θs bΘ 2 F

1 ρ 4τgν2(ϵ + ϵ 1) + 1

LΣ Θ0 2 F + τg

j=1 θ0 j θ0 2 1

| {z } =0 by initialization

2) Bounding T2: Recall that

j=1 (θ t+ 1

n X j Xj ( θt θt j) | {z }

Focusing on the ij-th summand and using (84) we can write

T ij 2 = (θ t+ 1

n X j Xj m X

ℓ(t) jk θ0 k, +

ℓ(t s) jk θs k

Maros, Scutari, Sun, Cheng

Note that T ij 2 has the same form as T1 and thus following similar steps as in (??) while using the local RSM (Assumption 2), we can bound T2 as (cf. Appendix C.2): for all t 1,

1 ρ (ℓΣ + 8sτℓ) Θt+ 1

s=0 ρt s Θs 2 F

+ cm8sτℓϵ 1 t 1 X

s=0 ρt s Θs+ 1

2 bΘ 2 F + cm8sτℓϵ 1 t 1 X

s=0 ρt s Θs bΘ 2 F

1 ρ 4τℓν2 (ϵ + ϵ 1) + 1

ℓΣ Θ0 2 F + τℓ

k=1 θ0 k θ0 2 1

3) Bounding T3: Using (28b) we can express θ t+ 1

2 i bθ gt gt i in T3 as

2 i bθ gt gt i

ℓ(t) ij g0 j, +

ℓ(t s) ij 1

n X j Xj (θs+1 j θs j)

j=1 |ℓ(t) ij |

2 i bθ g0 j,

| {z } T3,1

j=1 |ℓ(t s) ij |

n X j Xj (θs+1 j θs j)

| {z } T3,2

Following again similar steps as in the bounds of T1 and T2, we can obtain the following bound for T3 (cf. Appendix C.3):

1 ρ (2ℓΣ + 8sτℓ) Θt+ 1

2 bΘ 2 F + cmℓΣ

s=0 ρt s Θs+1 Θs 2 F

+ cm8sτℓϵ 1 t 1 X

s=0 ρt s Θs+1 bΘ 2 F + cm8sτℓϵ 1 t 1 X

s=0 ρt s Θs bΘ 2 F

+ mcmρt A3ϵ 1 + sc2 gℓ 1 Σ ϵ 1 + cgν + ρ mcm

1 ρ 4τℓν2(ϵ + ϵ 1),

2 ℓΣ θ 2 + τℓ θ 2 1 + 1

mℓΣ Θ 2 F + τℓ θ 2 1 ,

cg Lj(θ ) + L(θ ) . (88)

4) Combining the bounds of T1, T2, and T3: Adding up the bounds of T1, T2, and T3, and using the following two inequalities

Θt+1 bΘ 2 F = W Θt+ 1

2 bΘ 2 F Θt+ 1

2 bΘ 2 F (89)

Θs+1 Θs 2 F (24a) = W Θs + Θs Θs 2 F 4 Θs 2 F + 2 Θs 2 F , (90)

yield the desired bound, detailed calculations are provided in Appendix C.4.

High-Dimensional Inference over Networks

Appendix C. Supplement to Appendix B

In this section, we upperbound the tracking errors T1, T2, and T3 in (82).

Appendix C.1 Bounding T1 in (85)

Invoking the RSC property (Assumption 1) and Lemma 10, we have 1

(LΣ + 8sτg) θ t+ 1

Similarly, the other two terms in (85) can be bounded respectively as 1

2 LΣ θ0 j θ0 2 + τg θ0 j θ0 2 1

2 LΣ θs j 2 + τg θ s+ 1

2 j θs j 2 1

LΣ θs j 2 + 2τg θ s+ 1

2 j bθ 2 1 + 2τg θs j bθ 2 1

LΣ θs j 2 + 16sτg θ s+ 1

2 j bθ 2 + 16sτg θs j bθ 2 + 8τgν2.

Summing up the terms and using (W J)t cmρt (due to Assumption 4), we arrive to the following bound for T1:

j=1 |ℓ(t) ij | ϵ

(LΣ + 8sτg) θ t+ 1

2 + 2τgν2 !

j=1 |ℓ(t) ij | 1

LΣ θ0 j θ0 2 + τg θ0 j θ0 2 1

s=0 |ℓ(t s) ij | ϵ

(LΣ + 8sτg) θ t+ 1

2 + 2τgν2 !

s=0 |ℓ(t s) ij | 1

LΣ θs j 2 + 16sτg θ s+ 1

2 j bθ 2 + 16sτg θs j bθ 2 + 8τgν2

2cmρt (LΣ + 8sτg) Θt+ 1

2 mcmρt 2τgν2

LΣ Θ0 2 F + τg

j=1 θ0 j θ0 2 1

s=0 cmρt s ϵ

2 (LΣ + 8sτg) Θt+ 1

s=0 mcmρt s ϵ

Maros, Scutari, Sun, Cheng

s=0 cmρt s 1

2ϵLΣ Θs 2 F +

s=0 cmρt s 1

2ϵ 16sτg Θs+ 1

s=0 cmρt s 1

2ϵ 16sτg Θs bΘ 2 F +

s=0 mcmρt s 1

Combining terms yields

1 ρ (LΣ + 8sτg) Θt+ 1

s=0 ρt s Θs 2 F

+ cm 8sτgϵ 1 t 1 X

s=0 ρt s Θs+ 1

2 bΘ 2 F + cm 8sτgϵ 1 t 1 X

s=0 ρt s Θs bΘ 2 F

1 ρ 4τgν2(ϵ + ϵ 1) + 1

LΣ Θ0 2 F + τg

j=1 θ0 j θ0 2 1

Appendix C.2 Bounding T2 in (82)

Using the local RSM (cf. Assumption 2) and Lemma 10, we can bounds the summands in T ij 2 [cf. (86)] as

n X j Xj ℓ(t) jk θ0 k,

|ℓ(t) jk | 1 n Xj(θ t+ 1

2 i bθ) 1 n Xj(θ0 k θ0)

|ℓ(t) jk | ϵ

(ℓΣ + 8sτℓ) θ t+ 1

2 i bθ 2 + 2τℓν2 + |ℓ(t) jk | 1

ℓΣ θ0 k θ0 2 + τℓ θ0 k θ0 2 1

n X j Xj ℓ(t s) jk θs k

|ℓ(t s) jk | 1 n Xj(θ t+ 1

1 n Xj θs k

|ℓ(t s) jk | ϵ

(ℓΣ + 8sτℓ) θ t+ 1

2 i bθ 2 + 2τℓν2

+ |ℓ(t s) jk | 1

ℓΣ θs k 2 + 16sτℓ θ s+ 1

2 k bθ 2 + 16sτℓ θs k bθ 2 + 8τℓν2 ,

respectively. Summing over i, j, k and using (W J)t cmρt, we can bound T2 as:

k=1 |ℓ(t) jk | ϵ

(ℓΣ + 8sτℓ) θ t+ 1

2 i bθ 2 + 2τℓν2

High-Dimensional Inference over Networks

k=1 |ℓ(t) jk | 1

ℓΣ θ0 k θ0 2 + τℓ θ0 k θ0 2 1

s=0 |ℓ(t s) jk | ϵ

(ℓΣ + 8sτℓ) θ t+ 1

2 i bθ 2 + 2τℓν2

s=0 |ℓ(t s) jk | 1

ℓΣ θs k 2 + 16sτℓ θ s+ 1

2 k bθ 2 + 16sτℓ θs k bθ 2 + 8τℓν2 .

2 (ℓΣ + 8sτℓ) Θt+ 1

F + m2cmρt τℓν2 ϵ

2ϵℓΣ Θ0 2 F + mcmρt 1

k=1 θ0 k θ0 2 1

s=0 mcmρt s ϵ

2 (ℓΣ + 8sτℓ) Θt+ 1

s=0 m2cmρt sτℓν2 ϵ

s=0 mcmρt s 1

2ϵℓΣ Θs 2 F +

s=0 mcmρt s 1

ϵ 8sτℓ Θs+ 1

s=0 mcmρt s 1

ϵ 8sτℓ Θs bΘ 2 F +

s=0 m2cmρt s 1

1 ρ (ℓΣ + 8sτℓ) Θt+ 1

s=0 ρt s Θs 2 F

s=0 ρt s Θs+ 1

2 bΘ 2 F + mcm 1

s=0 ρt s Θs bΘ 2 F

1 ρ 4τℓν2 (ϵ + ϵ 1) + 1

ℓΣ Θ0 2 F + τℓ

k=1 θ0 k θ0 2 1

Appendix C.3 Bounding T3 in (82)

We bound ﬁrst T3,1 and T3,2. Invoking the deﬁnition of g0 j, we have

k=1 Lk(θ0 k)

2 i bθ 1 n X j (Xjθ0 j yj) 1 mn

k=1 X k (Xkθ0 k yk)

2 i bθ 1 n X j (Xjθ0 j Xjθ + Xjθ yj) 1 mn

k=1 X k (Xkθ0 k Xθ + Xθ yk)

2 i bθ 1 n X j Xj(θ0 j θ ) 1 mn

k=1 X k Xk(θ0 k θ )

Maros, Scutari, Sun, Cheng

2 i bθ Lj(θ )

2 i bθ L(θ )

2 i bθ 2 + τℓ θ t+ 1

2ϵ ℓΣ θ0 j θ 2 + τℓ θ0 j θ 2 1

2 ℓΣ θ t+ 1

2 i bθ 2 + τℓ θ t+ 1

2 i bθ 2 1 + 1

2ϵ ℓΣ θ0 k θ 2 + τℓ θ0 k θ 2 1

2 i bθ 1 ( Lj(θ ) + L(θ ) ) | {z } cg

(ℓΣ + 8sτℓ) θ t+ 1

2 i bθ 2 + 2τℓν2 + 1

2ϵ ℓΣ θ0 j θ 2 + τℓ θ0 j θ 2 1

(ℓΣ + 8sτℓ) θ t+ 1

2 i bθ 2 + 2τℓν2 (91)

ℓΣ Θ0 Θ 2 F + τℓ

k=1 θ0 k θ 2 1

(b) ϵ (ℓΣ + 8sτℓ) θ t+ 1

2 i bθ 2 + 2τℓν2

ℓΣ θ0 j θ 2 + τℓ θ0 j θ 2 1 + 1

mℓΣ Θ0 Θ 2 F + τℓ 1 m

k=1 θ0 k θ 2 1

+ ϵℓΣ θ t+ 1

2 i bθ 2 + sc2 gℓ 1 Σ ϵ 1 + cgν

ϵ (2ℓΣ + 8sτℓ) θ t+ 1

2 i bθ 2 + 2τℓν2

ℓΣ θ0 j θ 2 + τℓ θ0 j θ 2 1 + 1

mℓΣ Θ0 Θ 2 F + τℓ 1 m

k=1 θ0 k θ 2 1

| {z } A3 + sc2 gℓ 1 Σ ϵ 1 + cgν,

where (a) follows from Lemma 10 while in (b) used the bound 2 (cg s) θ t+ 1

2 i bθ 2 + sc2 gℓ 1 Σ ϵ 1.

High-Dimensional Inference over Networks

Similarly, we can bound T3,2 as

n X j Xj (θs+1 j θs j)

2 i bθ 2 + τℓ θ t+ 1

ℓΣ θs+1 j θs j 2 + τℓ θs+1 j θs j 2 1

2 (ℓΣ + 8sτℓ) θ t+ 1

ℓΣ θs+1 j θs j 2 + 16sτℓ θs+1 j bθ 2 + 16sτℓ θs j bθ 2 + τℓν2(ϵ + 4ϵ 1)

(92) Using (91) and (92) along with (W J)t cmρt, we can bound T3 as follows:

gt gt i T (θ t+ 1

j=1 |ℓ(t) ij | T3,1 +

j=1 |ℓ(t s) ij | T3,2

j=1 |ℓ(t) ij | ϵ (2ℓΣ + 8sτℓ) θ t+ 1

j=1 |ℓ(t) ij | ϵ 2τℓν2 + ϵ 1A3 + sc2 gℓ 1 Σ ϵ 1 + cgν

j=1 |ℓ(t s) ij | ϵ

2 (ℓΣ + 8sτℓ) θ t+ 1

j=1 |ℓ(t s) ij | 1

ℓΣ θs+1 j θs j 2 + 16sτℓ θs+1 j bθ 2 + 16sτℓ θs j bθ 2

j=1 |ℓ(t s) ij |τℓν2(ϵ + 4ϵ 1)

cmρtϵ (2ℓΣ + 8sτℓ) Θt+ 1

2 bΘ 2 F + mcmρt ϵ 2τℓν2 + ϵ 1A3 + sc2 gℓ 1 Σ ϵ 1 + cgν

s=0 cmρt s ϵ

2 (ℓΣ + 8sτℓ) Θt+ 1

s=0 cmρt s 1

2ϵℓΣ Θs+1 Θs 2 F +

s=0 cmρt s 1

ϵ 8sτℓ Θs+1 bΘ 2

s=0 cmρt s 1

ϵ 8sτℓ Θs bΘ 2 +

s=0 mcmρt sτℓν2(ϵ + 4ϵ 1)

1 ρ (2ℓΣ + 8sτℓ) Θt+ 1

2 bΘ 2 F + cmℓΣ

s=0 ρt s Θs+1 Θs 2 F

Maros, Scutari, Sun, Cheng

+ cm8sτℓϵ 1 t 1 X

s=0 ρt s Θs+1 bΘ 2 F + cm8sτℓϵ 1 t 1 X

s=0 ρt s Θs bΘ 2 F

+ mcmρt A3ϵ 1 + sc2 gℓ 1 Σ ϵ 1 + cgν + ρ mcm

1 ρ 4τℓν2(ϵ + ϵ 1)

1 ρ (2ℓΣ + 8sτℓ) Θt+ 1

2 bΘ 2 F + cmℓΣ

s=0 ρt s Θs+1 Θs 2 F

+ cm8sτℓϵ 1 t 1 X

s=0 ρt s Θs+1 bΘ 2 F + cm8sτℓϵ 1 t 1 X

s=0 ρt s Θs bΘ 2 F

+ mcmρt A3ϵ 1 + sc2 gℓ 1 Σ ϵ 1 + cgν + ρ mcm

1 ρ 4τℓν2(ϵ + ϵ 1).

Appendix C.4 Bounding δt in (82)

Using the bounds on T1, T2, and T3, we can ﬁnally bound δt:

i=1 ( L(θt i) yt i) (θ t+ 1

T1 + T2 + T3

1 ρ (LΣ + 8sτg) Θt+ 1

s=0 ρt s Θs 2 F

+ cm 8sτgϵ 1 t 1 X

s=0 ρt s Θs+ 1

2 bΘ 2 F + cm 8sτgϵ 1 t 1 X

s=0 ρt s Θs bΘ 2 F

1 ρ 4τgν2(ϵ + ϵ 1)

1 ρ (ℓΣ + 8sτℓ) Θt+ 1

s=0 ρt s Θs 2 F

+ cm8sτℓϵ 1 t 1 X

s=0 ρt s Θs+ 1

2 bΘ 2 F + cm8sτℓϵ 1 t 1 X

s=0 ρt s Θs bΘ 2 F

1 ρ 4τℓν2 (ϵ + ϵ 1)

1 ρ (2ℓΣ + 8sτℓ) Θt+ 1

2 bΘ 2 F + cmℓΣ

s=0 ρt s Θs+1 Θs 2 F

+ cm8sτℓϵ 1 t 1 X

s=0 ρt s Θs+1 bΘ 2 F + cm8sτℓϵ 1 t 1 X

s=0 ρt s Θs bΘ 2 F

+ mcmρt A3ϵ 1 + sc2 gℓ 1 Σ ϵ 1 + 1 + ρ mcm

1 ρ 4τℓν2(ϵ + ϵ 1) + ρ mcm

1 ρ c2 gν2.

1 ρ (2ℓΣ + 8sτℓ) Θt+ 1

High-Dimensional Inference over Networks

s=0 ρt s Θs 2 F + cmℓΣ

s=0 ρt s Θs+1 Θs 2 F

+ cm24τℓϵ 1 t 1 X

s=0 ρt s Θs+ 1

2 bΘ 2 F + cm24τℓϵ 1 t 1 X

s=0 ρt s Θs bΘ 2 F

+ mcmρt A3 ϵ 1 + sc2 gℓ 1 Σ ϵ 1 + 1 + ρ mcm

1 ρ 12τℓν2(ϵ + ϵ 1) + ρ mcm

1 ρ (2ℓΣ + 8sτℓ) Θt+ 1

s=0 ρt s Θs 2 F + 2cmℓΣ

s=0 ρt s Θs 2 F

+ cm24sτℓϵ 1 t 1 X

s=0 ρt s Θs+ 1

2 bΘ 2 F + cm24sτℓϵ 1 t 1 X

s=0 ρt s Θs 1

+ mcmρt A3 ϵ 1 + sc2 gℓ 1 Σ ϵ 1 + 1 + ρ mcm

1 ρ 12τℓν2(ϵ + ϵ 1) + ρ mcm

1 ρ c2 gν2,

where in (a) we used LΣ ℓΣ, τg τℓ, and (89); and in (b) we used again (89) and (90).

Appendix D. Proof of Proposition 13

Using Assumptions 3 and 4 on (83) we have

s=0 ρt s Θs .

Squaring both sides gives

s=0 ρt s Θs

2ρ2t Θ0 2 + 2

s=0 ρt s Θs

2ρ2t Θ0 2 + 2ρ 1 ρ

s=0 ρt s Θs 2,

where the last inequality follows from Cauchy-Schwartz inequality.

Maros, Scutari, Sun, Cheng

Appendix E. Proof of Lemma 16

Recall Gnet(z) = ρ z ρ. Since (49b) is quadratic in Gnet(z), we obtain the following feasible range of Gnet(z):

ϵ 2ρ 1 ρ 1 LΣ

1 + (γ LΣ) µΣ

Therefore, condition (49b) is fulﬁlled for all z in the form

z = ρ + ρ Gnet(z) ρ + 4ρ2

1 + (γ LΣ) µΣ

Appendix F. Proof of Lemma 17

We substitute Gnet(z) = ρ z ρ into (54) and solve the equation

z = A + ρ z ρ D ρ z ρ . (93)

Let denote the discriminant of the second order equation in z (54). Since

= (Dρ + ρ + A)2 4D(A 1)ρ = (Dρ + ρ A)2 + 4(A + D)ρ 0,

Eq. (93) has two real roots (z1 z2)

z1,2 = (D + 1)ρ + A p

(Dρ + ρ + A)2 4D(A 1)ρ

By inspection, we have that z1 0. Hence, (54) is satisﬁed for all z (0, z1] [z2, 1). Next we show z1 ρ and z2 ρ. As z > ρ must be satisﬁed, z [z2, 1). Suppose on the contrary that z1 > ρ, which implies that

ρ + A Dρ > p

(Dρ + ρ + A)2 4D(A 1)ρ

(ρ + A Dρ)2 > (ρ + A + Dρ)2 4D(A 1)ρ and ρ + A Dρ > 0.

After simpliﬁcation, the ﬁrst inequality becomes 4Dρ2 + 4Dρ < 0, which is impossible. Therefore we have proved that z1 ρ. As both z1 and z2 are nonnegative, we can simply upperbound z2 as z2 z1 + z2. This completes the proof.

High-Dimensional Inference over Networks

Appendix G. Proof of Proposition 18

Substituting the expression of γ given by (57) into (51) we obtain:

zρ = ρ + ρ(1 ρ)3 C2 2 C3

1 + ρ ρ (1 ρ)4

ρ + ρ(1 ρ)3 C2 2 C3

2C2 C3 ρ(1 ρ) + 2C2 2 C3

ρ(1 ρ)3 zρ.

The rest of the proof bounds zσ based on Lemma 17. Substituting γ into the expression of A and D we get

A D = LΣ µΣ + C1s(τµ + τg) + C3 ℓ2 Σ µΣ c2 m ρ (1 ρ)4

ℓΣ sτℓ+ C3 ℓ2 Σ µΣ c2m ρ (1 ρ)4

= 1 κ 1 + C1s(τµ + τg)/LΣ + C3 ℓ2 Σ µΣLΣ c2 m ρ (1 ρ)4

1 C1sτg/LΣ ρ

2 µΣ ℓΣLΣ sτℓ+ C3 ℓ2 Σ µΣLΣ c2m ρ (1 ρ)4 .

Dividing the numerator and denominator by 1 C1sτg/LΣ and by the deﬁnition of σ0:

A D = σ0 + C3 ℓ2 Σ µΣLΣ c2 m ρ (1 ρ)4 (1 C1sτg/LΣ) 1

2 µΣ ℓΣLΣ sτℓ+ C3 ℓ2 Σ µΣLΣ c2m ρ (1 ρ)4 (1 C1sτg/LΣ) 1

A D σ0 + C3 ℓ2 Σ µΣLΣ c2 m ρ (1 ρ)4 (1 C1sτg/LΣ) 1 + ρ 2κ 1 + ρ

2 µΣ ℓΣLΣ sτℓ (1 C1sτg/LΣ) 1

1 + C3 ℓ2 Σ µΣLΣ c2m ρ (1 ρ)4 (1 C1sτg/LΣ) 1

σ0 + C3 ℓ2 Σ µΣLΣ

c2 m ρ (1 ρ)4 (1 C1sτg/LΣ) 1 + ρ

κ 1 + µΣ ℓΣLΣ sτℓ

(1 C1sτg/LΣ) 1

=σ0 + ρ C3 ℓ2 Σ µΣLΣ c2 m (1 ρ)4 + 1

2 κ 1 + µΣ ℓΣLΣ sτℓ

ρ 1/2 σ0 + ρ 16C3c2 m ℓ2 Σ µΣLΣ + 1

2 κ 1 + µΣ ℓΣLΣ sτℓ

1 C1sτg/LΣ , (96)

where for A D to hold, we require 1 ρ

µΣ C1sτg + C1s(τµ + τg) + ρ

ℓΣ sτℓ. (97)

Similarly, we can bound ρ + ρ/D as

D = ρ + ρ ℓΣ µΣLΣ 2C2 2c2 m 1 ρ sτℓ

1 C1sτg/LΣ + C3 ℓ2 Σ µΣLΣ c2m ρ (1 ρ)4 n ρ 2κ 1 + ρ

2 µΣ ℓΣLΣ sτℓ o

Maros, Scutari, Sun, Cheng

ρ D ρ + ρ ℓΣ µΣLΣ 2C2 2c2 m 1 ρ sτℓ+ ρ

2 µΣ ℓΣLΣ sτℓ

1 C1sτg/LΣ ρ

2κ 1 + C3 ℓ2 Σ µΣLΣ c2m ρ (1 ρ)4 (98)

ρκ 1 1 ρ + ρ ℓΣ µΣLΣ 2C2 2c2 m 1 ρ sτℓ+ ρ

2 µΣ ℓΣLΣ sτℓ

1 2 C1sτg/LΣ + C3 ℓ2 Σ µΣLΣ c2m ρ (1 ρ)4 ,

where ρ D requires

LΣ + C3 ℓ2 Σ µΣ

c2 m ρ (1 ρ)4 ρ ℓΣ

2C2 2c2 m 1 ρ sτℓ+ C1sτg + ρ

Using the fact that ρ 1 and LΣ µΣ, it suﬃces

4C2 2c2 m 1 ρ sτℓ+ 2C1sτg + ρµΣ

ℓΣ sτℓ. (99)

Note that (60) implies both (97) and (99). Finally, we further simply the bound in (98) as follows. Multiplying both the numerator and denominator by (1 ρ)4 we get

D ρ + ρ ℓΣ µΣLΣ 2C2 2c2 msτℓ(1 ρ)3 + ρ

2 µΣ ℓΣLΣ sτℓ(1 ρ)4

2 C1sτg/LΣ)(1 ρ)4 + C3c2m ρ ℓ2 Σ µΣLΣ

Deﬁne function

(1 ρ)4 + C3c2 m ρ ℓ2 Σ µΣLΣ .

Diﬀerentiating f with respect to ρ yields

f (ρ) = 4 1

(1 ρ)3 + C3c2 m ℓ2 Σ µΣLΣ

for C3 4, due to cm 1, ℓΣ µΣ, and ℓΣ LΣ. Therefore, f(ρ) is monotonically increasing in (0, 1) and it holds

ℓΣ µΣLΣ 4C2 2c2 msτℓ+ µΣ ℓΣLΣ sτℓ 1 2C1sτg/LΣ . (100)

Combining (96) and (100), zσ can be readily bounded as

zσ σ0 + ρ 16C3c2 m ℓ2 Σ µΣLΣ + 1

2 κ 1 + µΣ ℓΣLΣ sτℓ

1 C1sτg/LΣ + ρ + ρ

ℓΣ µΣLΣ 4C2 2c2 msτℓ+ µΣ ℓΣLΣ sτℓ 1 2C1sτg/LΣ

σ0 + ρ 18C3c2 m ℓ2 Σ µΣLΣ + ℓΣ µΣLΣ 4C2 2c2 msτℓ+ 2µΣ ℓΣLΣ sτℓ 1 2C1sτg/LΣ , (101)

where in the last inequality we have used the fact that C3c2 m ℓ2 Σ µΣLΣ 1, ρ 1 and κ 1 1. To make the bound not vacuous, we need to assume 1 2C1sτg/LΣ > 0 [cf. (59)]. Combining (94) and (101) completes the proof.

High-Dimensional Inference over Networks

Appendix H. Proof of Corollary 19

Recall the deﬁnition of σ0 as in (58):

σ0 1 κ 1 + C1s(τµ + τg)/LΣ

1 C1sτg/LΣ , κ LΣ

By requiring

ρ 2κ 18C3c2 m ℓ2 Σ µΣLΣ + ℓΣ µΣLΣ 4C2 2c2 msτℓ+ 2µΣ

σ0 + ρ 18C3c2 m ℓ2 Σ µΣLΣ + ℓΣ µΣLΣ 4C2 2c2 msτℓ+ 2µΣ ℓΣLΣ sτℓ 1 2C1sτg/LΣ 1 (2κ) 1 + C1s(τµ + τg)/LΣ

1 2C1sτg/LΣ .

Choosing C3 = 4 we obtain an upper bound of RHS of (62) given by (67), under network connectivity condition . Thus far, we can conclude that (49a)-(49b) are satisﬁed by any z and γ such that z > zup, with γ and zup deﬁned in (64) and (67), respectively, as long as the following conditions hold [note that conditions (44) and (59) are already implied by (64) and (60), respectively]:

step size lower bound (45), with ϵ chosen according to (50);

RSC/RSM tolerance parameters condition (60);

and network connectivity condition (102).

Finally, we simplify these conditions. First, (45), with ϵ given by (50), is a consequence of (60) and (64). In fact, substituting the expression of γ and ϵ into (45), (45) reads

LΣ + 4 ℓ2 Σ µΣ

c2 m ρ (1 ρ)4 > C1sτg + ρ

ℓΣ (ℓΣ + sτl) = C1sτg + ρ

2LΣ ρµΣ + 8 ℓ2 Σ µΣ

c2 m ρ (1 ρ)4 > 2C1sτg + ρµΣ

Since ρ 1 and µΣ LΣ, clearly(60) implies (45). Second, we claim that (60) is implied by (102) and (65). Using the facts that cm 1 and ℓΣ/µΣ 1, (102) implies the following: ρ 1/4 and

µ2 Σ 4C2 2c2 msτℓ+ 2

Consequently, we can upperbound the second and third summands on the RHS of (60) as

8C2 2c2 mρ (1 ρ)2 sτℓ+ ρ µΣ

8C2 2c2 mρ 9/16 sτℓ+ ρ µΣ

µΣ 8C2 2c2 msτℓ+ 4µΣ

This together with (65) implies (60).

Maros, Scutari, Sun, Cheng

Appendix I. Proof of Proposition 20

We ﬁrst bound B(z). Substituting the γ [cf. (64)] and ϵ [cf. (50)] into into (47):

2C2c2 m 1 ρ ρ z ρ Θ0 bΘ 2

1 ρ + sc2 g 1 µΣ

1 ρ + cgν ρ z ρ + Θ1/2 bΘ 2

2C3c2 m ρ (1 ρ)4 Θ0 bΘ 2

LΣ mcm ρ A3 ℓΣ

1 ρ + sc2 g 1 µΣ

1 ρ + cgν C3 (1 ρ)3C2 + Θ1/2 bΘ 2.

Substituting the expression of A3 [cf. Prop. 12] in the above bound yields

2C3c2 m (1 ρ)4 Θ0 bΘ 2

+ m ρ ℓΣ µΣLΣ

2C3c2 m (1 ρ)4 ℓΣ θ 2 + τℓ θ 2 1

LΣ mcm ρ sc2 g 1 µΣ

1 ρ + cgν C3 (1 ρ)3C2 + Θ1/2 bΘ 2

2C3c2 m (1 ρ)4 Θ0 bΘ 2

+ m ρ ℓΣ µΣLΣ

2C3c2 m (1 ρ)4 ℓΣ θ 2 + sτℓ θ 2 (θ is s-sparse)

LΣ mcm ρ sc2 g 1 µΣ

1 ρ + cgν C3 (1 ρ)3C2 + Θ1/2 bΘ 2.

Since (66) is equivalent to

2 ρκ 18C3c2 m ℓ2 Σ µΣLΣ + ℓΣ µΣLΣ 4C2 2c2 msτℓ+ 2µΣ

with C3 = 4, together with the fact that κ 1, we have

ρ ℓ2 Σ µΣLΣ C3c2 m 1

36 and ρ ℓΣ µΣLΣ C2 2c2 msτℓ 1

In addition, from (66) we get ρ 1/4, and thus we obtain (69). We bound now stat. Using the expression of γ [cf. (64)] and ϵ [cf. (50)], we have

1 ρ τℓν2(ϵ + ϵ 1) + C1(τµ + τg)

1 ρ τℓν2 µΣ

1 ρ 2C2cm + ℓΣ

+ C1(τµ + τg)

2 µΣ ℓΣLΣ + ρ ℓΣ µΣLΣ

2C2 2c2 m (1 ρ)2

τℓν2 + C1(τµ + τg)

4C2 2c2 m (1 ρ)2

τℓν2 + C1(τµ + τg)

LΣ ν2 stat (given C2 1).

High-Dimensional Inference over Networks

Appendix J. Proof of Lemma 23

We ﬁrst bound maxj [m] X j nj/n . Each column of Xj is an n-dimensional Gaussian random vector with independent entries. The maximum variance of these entries is bounded by ζ = maxj [d] Σjj. Since Xj is independent of nj and the elements of nj are i.i.d. σ2sub-Gaussian, each element of X j nj is the sum of n independent sub-exponential random variables with sub-exponential norm no larger than ζσ. Applying the Bernstein s inequality and the union bound we get

2 exp c3 min t2

n + log md ,

for some c3 > 0. When log md (c3/2) n, we choose t = (2/c3) ζσ log md/n ζσ, which leads to

2 exp ( log md) .

On the other hand, when log md < (c3/2) n, we choose t = ζσ p

(2/c3) log md/n < ζσ, which leads to

2 exp ( log md) .

Combining the two cases we obtain (75). Applying the same procedure to L(θ ) , proves (76).

Appendix K. Proof of Proposition 22

The proof is a modiﬁcation of (Raskutti et al., 2010). Next, we only highlight the key diﬀerences, for completeness. Deﬁne the set

V (r) = {r Rd | Σ 1 2 v 2 = 1, v 1 r},

for ﬁxed radius r > 0. Accordingly, deﬁne the random variable

M(r, Xi) = sup v V (r)

Following similar steps as in (Raskutti et al., 2010, Lemma 1), and using the Sudakov Fernique inequality for Gaussian processes, we can bound the expectation of M(r, Xi) as

EM(r, Xi) 1 + 3ρ(Σ)

n r t(r), (104)

Maros, Scutari, Sun, Cheng

where ρ2(Σ) = maxj Σjj. Then applying the concentration inequality in the same way as in (Raskutti et al., 2010, Lemma 2) we obtain

P(M(r, Xi) > t(r) + mt(r)) 2 exp( Nt(r)2/2) 2 exp( n(( m + 1)t(r))2/8). (105)

Finally, deﬁne the event

Ti = { v Rd s.t. Σ1/2v 2 = 1 and Xiv 2/ n > 2( m + 1)t( v 1)}.

Applying (Raskutti et al., 2010, Lemma 3) with

f(v, Xi) = Xiv 2/ n, h(v) = v 1, g(r) = ( m+1)t(r), an = n/8 and µ = m,

we have P(T c i ) 1 c 2 exp( c 0N),

for some c 0 1/2. Let us use now the union bound:

i=1 P(Ti) c 2m exp( c 0N) = c 2 exp( c 0N + log m).

Xiv n 2( m + 1) Σ1/2v + 6( m + 1)ρ(Σ)

n v 1, v Rd, i [m], (106)

with probability larger than 1 c 2 exp( c 0N + log m). Since N/2 m/2 > log m, it holds 1 exp( c 0N + log m) 1 exp( c 0N), for some c 0 > 0.

Appendix L. Supplement to the proof of Theorem 9

(i) We show (20), i.e.,

s log d/N < c5 µΣ

ζ , and ρ < (c6m8κ4) 1

implies (15)-(16). Substituting the expression of τµ and τg it is not hard to verify (20) implies (15): µΣ > 36C1s(τµ + τg), with c5 suﬃciently small. Substituting τℓinto (16) we can see a suﬃcient condition for (16) is

m3κ2 + m4κ2 ζ

for some C6 > 0; (107) is implied by (20) for suﬃciently large c6 > 0. (ii) We use (78)-(79) to further simplify the expressions of the rate expression (13), the statistical error (12), and B in (69). The rate λ in (13) can be bounded as

λ 1 (2κ) 1 + 2C1c1 ζ LΣ

1 2C1c1 ζ LΣ

High-Dimensional Inference over Networks

The statistical error can be bounded as

stat = ρ 2LΣ

5C2 2c2 m (1 ρ)2

τℓν2 + C1(τµ + τg)

c1ζ m2 log d

(1 ρ)2 ζ LΣ

for some C5, c7 > 0, where in the last inequality we used c6ρm8κ4 < 1, ν2 = O(s bθ θ 2) (due to bθ θ 1 2 s bθ θ (Agarwal et al., 2012, Lemma 5)), and ρ < (3C2 2) 2

implied by (16). Finally, for B in (69), we have

Θ0 bΘ 2 + Θ 2 + Θ1/2 bΘ 2 + mcm ρ

cmsc2 g µΣ + cgν

Θ0 bΘ 2 + Θ 2 + Θ1/2 bΘ 2

where we used again bθ θ 1 2 s bθ θ .

Appendix M. Properties of The z-Transform

Proof of Lemma 14

(i): For the time-shifted sequence {a(t)}K+1 t=2 , it holds

t=1 a(t + 1)z t = z

t=1 a(t + 1)z t 1

t=1 a(t)z t + a(K + 1)z K 1 a(1)z 1 !

= z AK(z) a(1) + a(K + 1)z K z AK(z) a(1).

(ii): For the sequence {Pt 1 s=0 ρt sa(s)}K t=1, we have z (ρ, 1):

s=0 zs tρt sa(s)z s =

s=0 a(s)z s K X

t=s+1 zs tρt s ρ z ρ AK(z) + a(0) .

Maros, Scutari, Sun, Cheng

(iii): Similarly, the sequence {Pt 1 s=0 ρt sa(s + 1)}K t=1, we have

s=0 ρt sa(s + 1)z t =

s=0 a(s + 1)z s 1 K X

t=s+1 ρt sz t+s+1 z ρ z ρAK(z).

Proof of Lemma 15

Since a(t) 0, (41) implies that for all K 1 and z ( z, 1),

a(K)z K B + c

Therefore, for all z ( z, 1),

a(K) B z K + c

t=1 z K t B z K + c 1 z .

Letting z z+ concludes the proof.

Alekh Agarwal, Sahand Negahban, Martin J Wainwright, et al. Fast global convergence of gradient methods for high-dimensional statistical recovery. The Annals of Statistics, 40 (5):2452 2482, 2012.

Heather Battey, Jianqing Fan, Han Liu, Junwei Lu, and Ziwei Zhu. Distributed testing and estimation under sparse high dimensional models. Annals of Statistics, 46(3):1352, 2018.

Raphael Berthier, Francis Bach, and Pierre Gaillard. Accelerated gossip in networks of given dimension using jacobi polynomial iterations. SIAM J. on Mathematics of Data Science, 1:24 47, 2020.

Stephen Boyd, Neal Parikh, and Eric Chu. Distributed optimization and statistical learning via the alternating direction method of multipliers. Now Publishers Inc, 2011.

Bernadette Charron-Bost. Geometric bounds for convergence rates of averaging algorithms. ar Xiv preprint ar Xiv:2007.04837, 2020.

Annie I. Chen and A. Ozdaglar. A fast distributed proximal-gradient method. In 2012 50th Annual Allerton Conference on Communication, Control, and Computing (Allerton), pages 601 608, October 2012.

Jianshu Chen and Ali H. Sayed. Diﬀusion adaptation strategies for distributed optimization and learning over networks. IEEE Transactions on Signal Processing, 60(8):4289 4305, August 2012.

High-Dimensional Inference over Networks

Xi Chen, Weidong Liu, and Yichen Zhang. Quantile regression under memory constraint. The Annals of Statistics, 47(6):3244 3273, 2019.

Xueying Chen and Min-ge Xie. A split-and-conquer approach for analysis of extraordinarily large data. Statistica Sinica, pages 1655 1684, 2014.

Paolo Di Lorenzo and Gesualdo Scutari. NEXT: In-network nonconvex optimization. IEEE Transactions on Signal and Information Processing over Networks, 2(2):120 136, June 2016.

Jianqing Fan, Dong Wang, Kaizheng Wang, and Ziwei Zhu. Distributed estimation of principal eigenspaces. Annals of statistics, 47(6):3009, 2019.

Jianqing Fan, Yongyi Guo, and Kaizheng Wang. Communication-eﬃcient accurate statistical estimation. J. of the American Statistical Association, 0(0):1 11, 2021.

Hadrien Hendrikx, Lin Xiao, Sebastien Bubeck, Francis Bach, and Laurent Massoulie. Statistically preconditioned accelerated gradient method for distributed optimization. In International Conference on Machine Learning, pages 4203 4227. PMLR, 2020.

Yao Ji, Gesualdo Scutari, Ying Sun, and Harsha Honnappa. Distributed sparse regression via penalization. J. of Machine Learning Research, (24):1 62, 2023a.

Yao Ji, Gesualdo Scutari, Ying Sun, and Harsha Honnappa. Distributed (ATC) gradient descent for high dimension sparse regression. IEEE Transactions on Information Theory, (69):5253 5276, 2023b.

Michael I Jordan, Jason D Lee, and Yun Yang. Communication-eﬃcient distributed statistical inference. J. of the American Statistical Association, 2018.

Ariel Kleiner, Ameet Talwalkar, Purnamrita Sarkar, and Michael I Jordan. A scalable bootstrap for massive data. J. of the Royal Statistical Society: Series B (Statistical Methodology), 76(4):795 816, 2014.

Jason D Lee, Qiang Liu, Yuekai Sun, and Jonathan E Taylor. Communication-eﬃcient sparse regression. J. of Machine Learning Research, 18(1):115 144, 2017.

Tian Li, Anit Kumar Sahu, Ameet Talwalkar, and Virginia Smith. Federated learning: challenges, methods, and future directions. IEEE Signal Processing Magazine, 37(3): 50 60, 2020.

Qihang Lin and Lin Xiao. An adaptive accelerated proximal gradient method and its homotopy continuation for sparse optimization. In International Conference on Machine Learning, pages 73 81. PMLR, 2014.

Lester Mackey, Michael Jordan, and Ameet Talwalkar. Divide-and-conquer matrix factorization. In Advances in Neural Information Processing Systems, volume 24, pages 1134 1142. Curran Associates, Inc., 2011.

Maros, Scutari, Sun, Cheng

Marie Maros and Gesualdo Scutari. Dgd2: A linearly convergent distributed algorithm for high-dimensional statistical recovery. In Advances in Neural Information Processing Systems, volume 35, pages 3475 3487. PMLR, 2022.

Brendan Mc Mahan, Eider Moore, Daniel Ramage, Seth Hampson, and Blaise Aguera y Arcas. Communication-eﬃcient learning of deep networks from decentralized data. In Aarti Singh and Jerry Zhu, editors, Proceedings of the 20th International Conference on Artiﬁcial Intelligence and Statistics, volume 54, pages 1273 1282. PMLR, 20 22 Apr 2017.

Angelia Nedic. Distributed gradient methods for convex machine learning problems in networks. IEEE Signal Processing Magazine, 10:92 101, 2020.

Angelia Nedi c and Alex Olshevsky. Distributed optimization over time-varying directed graphs. IEEE Transactions on Automatic Control, 60(3):601 615, 2014.

Angelia Nedi c and Asuman Ozdaglar. Distributed subgradient methods for multi-agent optimization. IEEE Transactions on Automatic Control, 54(1):48 61, January 2009.

Angelia Nedi c, Asuman Ozdaglar, and Pablo A. Parrilo. Constrained consensus and optimization in multi-agent networks. IEEE Transactions on Automatic Control, 55(4): 922 938, April 2010.

Angelia Nedi c, Alex Olshevsky, and Wei Shi. Achieving geometric convergence for distributed optimization over time-varying graphs. SIAM J. on Optimization, 27:2597 2633, July 2016.

Angelia Nedi c, Alex Olshevsky, and G. Rabbat, Michael. Network topology and communication-computation tradeoﬀs in decentralized optimization. Proceedings of the IEEE, 106:953 976, 2018.

Willie Neiswanger, Chong Wang, and Eric P. Xing. Asymptotically exact, embarrassingly parallel mcmc. In Proceedings of the Thirtieth Conference on Uncertainty in Artiﬁcial Intelligence, UAI 14, page 623 632, 2014.

Garvesh Raskutti, Martin J Wainwright, and Bin Yu. Restricted eigenvalue properties for correlated gaussian designs. J. of Machine Learning Research, 11:2241 2259, 2010.

Ali H. Sayed. Adaptation, learning, and optimization over networks. Foundations and Trends in Machine Learning, 7:311 801, January 2014.

Kevin Scaman, Francis Bach, S ebastien Bubeck, Yin Tat Lee, and Laurent Massouli e. Optimal algorithms for smooth and strongly convex distributed optimization in networks. In Proceedings of the 34th International Conference on Machine Learning, volume 70, pages 3027 3036, 2017.

Zuofeng Shang and Guang Cheng. Computational limits of a distributed algorithm for smoothing spline. J. of Machine Learning Research, 18(108):1 37, 2017.

High-Dimensional Inference over Networks

Chengchun Shi, Wenbin Lu, and Rui Song. A massive data framework for M-estimators with cubic-rate. J. of the American Statistical Association, 113(524):1698 1709, 2018.

Wei Shi, Qing Ling, Gang Wu, and Wotao Yin. A proximal gradient algorithm for decentralized composite optimization. IEEE Transactions on Signal Processing, 63(22):6013 6023, 2015.

Ying Sun, Marie Maros, Gesualdo Scutari, and Guang Chang. High-dimensional inference over networks: Linear convergence and statistical guarantees. Arxiv preprint: ar Xiv:2201.08507, pages 1 50, Jan. 2022a. doi: https://doi.org/10.48550/ar Xiv.2201. 08507.

Ying Sun, Gesualdo Scutari, and Amir Daneshmand. Distributed optimization based on gradient tracking revisited: Enhancing convergence rate via surrogation. SIAM Journal on Optimization, 32(2):354 385, 2022b.

Ying Sun, Gesualdo Scutari, and Amir Daneshmand. Distributed optimization based on gradient tracking revisited: Enhancing convergence rate via surrogation. SIAM J. on Optimization, 32(2):354 385, 2022c.

Stanislav Volgushev, Shih-Kang Chao, and Guang Cheng. Distributed inference for quantile regression processes. The Annals of Statistics, 47(3):1634 1662, 2019.

Martin J. Wainwright. High-Dimensional Statistics: A Non-Asymptotic Viewpoint. Cambridge Series in Statistical and Probabilistic Mathematics. Cambridge University Press, New York, NY, 1st ed edition, 2019. ISBN 9781108627771.

Jialei Wang, Mladen Kolar, Nathan Srebro, and Tong Zhang. Eﬃcient distributed learning with sparsity. In International Conference on Machine Learning, pages 3636 3645. PMLR, 2017.

Zhaoran Wang, Han Liu, and Tong Zhang. Optimal computational and statistical rates of convergence for sparse nonconvex learning problems. Annals of statistics, 42(6):2164, 2014.

Auzinger Wien. Iterative solution of large linear systems. Lecture Notes, TU Wien, 2011.

Lin Xiao and Tong Zhang. A proximal-gradient homotopy method for the sparse leastsquares problem. SIAM J. on Optimization, 23(2):1062 1091, 2013.

Jinming Xu, Shanying Zhu, Yeng Chai Soh, and Lihua Xie. Convergence of asynchronous distributed gradient methods over stochastic networks. IEEE Transactions on Automatic Control, 63(2):434 448, 2018.

Jinming Xu, Ye Tian, Ying Sun, and Gesualdo Scutari. Distributed algorithms for composite optimization: Uniﬁed framework and convergence analysis. IEEE Transactions on Signal Processing, 69:3555 3570, 2021.

Kun Yuan, Bicheng Ying, Xiaochuan Zhao, and Ali H Sayed. Exact diﬀusion for distributed optimization and learning part I: Algorithm development. IEEE Transactions on Signal Processing, 67(3):708 723, 2018.

Maros, Scutari, Sun, Cheng

Yuchen Zhang, John Duchi, and Martin Wainwright. Divide and conquer kernel ridge regression. In Conference on Learning Theory, pages 592 617. PMLR, 2013a.

Yuchen Zhang, John C Duchi, and Martin J Wainwright. Communication-eﬃcient algorithms for statistical optimization. J. of Machine Learning Research, 14(1):3321 3363, 2013b.

Tianqi Zhao, Guang Cheng, and Han Liu. A partially linear framework for massive heterogeneous data. Annals of Statistics, 44(4):1400, 2016.