# decentralized_accelerated_proximal_gradient_descent__fa44cab1.pdf Decentralized Accelerated Proximal Gradient Descent Haishan Ye1 Ziang Zhou1 Luo Luo2 Tong Zhang2 1Shenzhen Research Institute of Big Data, The Chinese University of Hong Kong, Shenzhen 2Department of Mathematics, The Hong Kong University of Science and Technology hsye_cs@outlook.com zhouza15@fudan.edu.cn luoluo@ust.hk tongzhang@ust.hk Decentralized optimization has wide applications in machine learning, signal processing, and control. In this paper, we study the decentralized composite optimization problem with a non-smooth regularization term. Many proximal gradient based decentralized algorithms have been proposed in the past. However, these algorithms do not achieve near optimal computational complexity and communication complexity. In this paper, we propose a new method which establishes the optimal computational complexity and a near optimal communication complexity. Our empirical study shows that the proposed algorithm outperforms existing state-of-the-art algorithms. 1 Introduction In this paper, we consider the decentralized composite optimization problem where agents aim to solve the following composite convex problem defined as min x Rd h(x) f(x) + r(x), f(x) 1 i=1 fi(x), (1) where each function fi(x) is the loss function of agent i and known only to the agent; r(x) is a non-smooth, convex function shared by all agents. The agents form a connected and undirected network. Agents can communicate with their neighbors to cooperatively solve the Problem (1). Many machine learning problems can be formulated as Problem (1) such as the elastic net [29], the graphical Lasso [4], and sparse logistic regression [24]. Because centralized optimization suffers from the communication traffic jam on the central server and robustness of the network, the decentralized optimization has become an active research topic in machine learning. Decentralized methods for solving Problem (1) have also been widely studied in signal processing, control, and optimization communities. Due to the wide applications, many different decentralized algorithms have been proposed in the past. We review the existing methods for the smooth non-proximal case where r(x) = 0. Some earlier primal algorithms were penalty based, and they can only reach sublinear convergence O(1/t) even for strongly convex objective functions [12, 27, 7]. The methods in [20] established the linear convergence for decentralized methods based on ADMM. More recently, gradient tracking based algorithms were proposed and they all achieved linear convergence including EXTRA [18], ESOM [11] , exact diffusion [28], NIDS [9], and many others [14, 15, 3, 17, 8]. Qu & Li [15] combined acceleration technique with gradient tracking to obtain faster convergence rate. Some other recent works [16, 22, 5, 5] studied the dual formulations of the problem, and proposed accelerated dual gradient descent to reach an optimal convergence rate for smooth problems. Recently, Ye et al. [26] proposed Mudag, which can achieve the optimal computation complexity and near optimal communication complexity for the smooth case with r(x) = 0. Equal Contribution 34th Conference on Neural Information Processing Systems (Neur IPS 2020), Vancouver, Canada. Many gradient tracking based algorithms have been extended to decentralized composite optimization problems with a non-smooth regularization term such as PG-EXTRA [19] and NIDS [9]. However, due to the non-smooth term, these algorithms can only achieve sub-linear convergence rates. Recently, the authors of [21] proposed a gradient tracking based method called SONATA, and established a linear convergence rate with the assumption that f(x) is strongly convex. In addition, Alghunaim et al. [2] proposed a primal-dual algorithm which can achieve linear convergence rate when each fi(x) is convex. Recently, a unified framework to analyze a large group of algorithms, and showed that these algorithms can also achieve linear convergence rates with nonsmooth regularization term such as EXTRA (PG-EXTRA) [18], NIDS [9], and Harnessing [14] in the work of [1, 25]. Despite intensive studies in the literature, the convergence rates of these previous algorithms do not match the optimal convergence rate. Moreover, the communication complexities achieved by algorithms analyzed in the framework of Xu et al. [25] and Alghunaim et al. [1] are sub-optimal. In this paper, we propose a novel algorithm that can establish the optimal computation complexity and a near optimal communication complexity. Our method is closely related to Mudag of [26], which can only handle the non-composite case with r(x) = 0. We summarize our contributions as follows: 1. We proposed a novel decentralized primal proximal algorithm that can achieve the optimal computational complexity O( κg log( 1 ϵ )) and an almost optimal communication complex- ity O q κg 1 λ2(W ) log( p(M,L,µ) q(M,L,µ)) log 1 ϵ , which matches the lower bound up to a log factor. κg is the global condition number of f(x). M and L are the smoothness parameter of local fi(x) and f(x), respectively. p and q are polynomials of M, L and µ of order no larger than three. To the best of our knowledge, our work is the first near optimal decentralized proximal algorithm. 2. Our algorithm does not require each individual function to be (strongly) convex. Thus, our algorithm has a wide application range since fi(x) may not be convex in many machine learning problems. The strong convexity condition of each fi(x) is required in the algorithms suited in the framework of [25] to achieve linear convergence rates. SONATA can also achieve linear convergence rate without this condition [21]. However, SONATA has to construct successive convex approximation function to deal with the non-convexity of fi(x) which requires extra computation burden. 2 Problem Set-up We first introduce some properties of objective functions that will be used in this paper. Then we introduce the Gossip matrix associated with the network. Finally, we reformulate the Problem (1). 2.1 Function Properties Global L-Smoothness We call f(x) is L-smooth, that is, for any y, x Rd, it holds that f(y) f(x) + f(x), y x + L Global µ-Strong Convexity We call f(x) is µ-strongly convex, that is, for any y, x Rd, it holds that f(y) f(x) + f(x), y x + µ Local M-Smoothness For each fi(x) in Eqn. (1), and any y, x Rd, it holds that fi(y) fi(x) + fi(x), y x + M Local ν-Strong Convexity For each fi(x) in Eqn. (1), and any y, x Rd, it holds that fi(y) fi(x) + fi(x), y x + ν Based on the smoothness and strong convexity, we can define global and local condition number of the objective function respectively as follows κg = L/µ and κℓ= M/ν. It is well known that κg κℓand L M. Furthermore, if f(x) is µ-strongly convex and r is convex, then one can easily check that h(x) = f(x) + r(x) is also µ-strongly convex. 2.2 Gossip Matrix We use W Rm m to denote the gossip matrix associated with the network and use λ2(W) to denote the second largest eigenvalue of W. The gossip matrix W has the following important properties: 1. W is symmetric with Wi,j = 0 if and only if agents i and j are connected or i = j. 2. 0 W I, W1 = 1, null(I W) = span(1). We use I to denote the m m identity matrix and 1 = [1, . . . , 1] Rm denotes the vector with all ones. By above two properties, one can achieve averaging xi s in different agents by several steps of local communication (by multiplying W). Instead of directly multiplying W several times, Liu & Morse [10] proposed a more efficient way to achieve averaging described in Algorithm 2 which has the following important proposition. Proposition 1. Let x K be the output of Algorithm 2 and x = 1 m1 x0. Then it holds that m1 x K, and x K 1 x 1 p 1 λ2(W) K x0 1 x , where λ2(W) is the second largest eigenvalue of W. 2.3 Problem Reformulation Denote by xi Rd the local copy of the variable of x for agent i. We introduce the aggregated variable x and the aggregated objective function H(x) as i=1 fi(xi) + 1 i=1 r(xi) with x = [x1, , xm] . (2) We can reformulate Problem (1) as min x Rm d H(x) subject to x1 = x2 = = xm. (3) We denote the aggregated smooth component, the aggregated non-smooth component, and F(x) as i=1 fi(xi), R(x) = 1 i=1 r(xi), F(x) = 1 m[ f1(x1), , fm(xm)] . Moreover, we denote the local proximal operator and aggregate proximal operator as proxη,r(x) = argmin z Rd 2η z x 2 , proxmη,R(x) = argmin z Rm d R(z) + 1 2mη z x 2 . 3 Decentralized Accelerated Proximal Gradient Descent In this section, we propose a novel decentralized proximal gradient descent algorithm achieving the optimal computational complexity and near optimal communication complexity. We describe our decentralized accelerated proximal gradient descent method (DAPG) in Algorithm 1. The use of multi-consensus (Fast Mix) and gradient tracking (that is, each time we track the gradient by communicating st + F(yt+1) F(yt)) is motivated by Mudag of [26]. However, the actual algorithm is quite different, due to the need to handle the proximal term. Algorithm 1 DPAG 1: Input: x(i) 0 = x(j) 0 for 1 i, j, m, y0 = x0, s0 = F(x0), η = 1 L and α = p µ 1 λ2(W ) log p(M,L,µ) , where p, q are polynomials with order less than 3. 2: for t = 0, . . . , T do 3: xt+1 = Fast Mix(proxηm,R(yt ηst), K) 4: yt+1 = Fast Mix xt+1 + 1 α 1+α(xt+1 xt), K 5: st+1 = Fast Mix(st + F(yt+1) F(yt), K) 6: end for 7: Output: x T . Algorithm 2 Fast Mix 1: Input: x0 = x 1, K, W, step size ηw = 1 1 λ2 2(W ). 2: for k = 0, . . . , K do 3: xk+1 = (1 + ηw)Wxk ηwxk 1; 4: end for 5: Output: x K. 3.1 Sketch of the Main Proof Techniques The main idea behind DAPG is trying to approximate the centralized accelerated proximal gradient descent by Nesterov s acceleration, multi-consensus, and gradient tracking. We first define the average of the aggregated variables defined in Algorithm 1 as follows i=0 x(i) t , yt = 1 i=0 y(i) t , st = 1 i=0 s(i) t , gt = 1 i=0 fi(y(i) t ), (5) where x(i), y(i) indicates the i-th row of matrix x and y. We can regard these averaged variables as the approximation of their centralized counterparts. In our method, instead of solving Problem (1) over the network, we minimize the reformulated function (3). Our algorithm tries to use accelerated proximal gradient descent to minimize H(x) where the acceleration helps to achieve a fast convergence rate. To deal with the consensus constraints xi = xj for 1 i, j m, we resort to the multi-consensus and gradient-tracking techniques. We then show that variables such as xi t, yi t and si t in agent i will converge to their centralized counterparts xt, yt and st as t increases. Once local variables and gradients can approximate their centralized counterparts, we can show that our algorithm has convergence properties similar to that of the centralized accelerated proximal gradient descent. This implies that our algorithm can also achieve the optimal computational complexity [13]. Previously multi-consensus was regarded as communication-unfriendly because, without gradienttracking, one has to increase the communication times to achieve high precision consensuses [6, 14]. However, it was shown in [26] that the combination of multi-consensus and gradient-tracking leads to communication-efficiency. We follow this argument, and show that the proposed algorithm also achieves near optimal communication complexity in the proximal case. 3.2 Complexity Analysis Following earlier work, we measure the computational complexity by the number of times that the gradient of f(x) is computed, and we measure the communication complexity by the times of local communications, which is presented as Wx in our algorithm. Similar to the analysis of accelerated proximal gradient descent method, we define the Lyapunov function as follows Vt h( xt) h(x ) + µ 2 vt x 2 , (6) where vt = xt 1 + 1 α( xt xt 1), with α = p µ In the following lemma, we will show how the multi-consensus ( Fast Mix operation in Algorithm 2) helps to achieve the consensus and bound the error between local variables and their centralized counterparts. Lemma 1. Suppose f(x) is L-smooth and µ-strongly convex. Assume each fi(x) is M-smooth and r is a proper and lower-semicontinuous convex function. Let zt = [ xt 1 xt , yt 1 yt , st 1 st ] , where 1 = [1, . . . , 1] Rm. It holds that zt+1 = Azt + 8ρM m[0, 0, r 2 where ρ and A are defined as 1 λ2(W) K , A ρ 0 2 2η 2 4 4η 2M M(13 + 4Mη) 1 + 6Mη Furthermore, we have zt+1 At+1z0 + 8Mρ r2m i=0 At i[0, 0, p If the spectral norm of A is less than 1 then Vt will converge to zero. It follows that zt , which can be regarded as a measure of total decentralized error compared to its centralized counterpart, will converge to zero. That is, the proposed method can well approximate accelerated proximal gradient descent. The following Lemmas show how the Lyapunov function Vt decreases at the disturbance caused by the decentralized setting. Lemma 2. The Lyapunov function Vt of Algorithm 1 defined in Eqn. (6) has the following property Vt+1 (1 α)Vt + D1 p Vt zt + D2 zt 2 , (8) where zt is defined in Lemma 1 and the constants D1, D2 are defined as follows Lµ(2 + 2Mη) + 3µ(2 + 2Mη) + 8 + µ m(1 + µ) 9 (12 + 8L + 4M)2 + 16L(L + 1) + 133 + 79L By Lemma 1, we can obtain that the convergence properties of zt is determined by the value of ρ and Vt. At the same time, Lemma 2 shows that the value of zt will affect the convergence properties of Vt in turn. In the following lemma, we give the condition to achieve A 2 1 2. Lemma 3. Let A 2 be the spectral norm of matrix A defined in Lemma 1. With the constant D3 defined as D3 = 9 + 6η + 15M + 4M 2η + 6Mη, if ρ < 1 2D3 , we will have A 2 < 1 In the following lemma, we will show that once α 1 2, the Lyapunov function Vt will converge with the rate 1 α 2 once we choose a proper ρ. We can observe that we can get a slightly slower convergence rate than vanilla accelerated proximal gradient descent and this is caused by the error brought by the decentralized setting. Lemma 4. Assuming that α 1 2, f(x) is L-smooth and µ-strongly convex, each fi(x) is M-smooth, r(x) is a proper and lower-semicontinuous convex function, then Algorithm 1 has the following convergence rate t+1 (V0 + C z0 2), with C = D2 1 α + 2D2, (9) if ρ satisfies the conditions in Lemma 3 and ρ < α 2 (D1D4 + D2D2 4), with D4 = 24M r2m Moreover, we can bound zt as zt 2 2ρ D2 4 1 α t V0 + C z0 2 . (10) With the above core lemmas, we give the detailed computation complexity and communication complexity of our algorithm in the following theorem. Theorem 1 (Main Theorem). Assume that f(x) is L-smooth and µ-strongly convex, each fi(x) is M-smooth, α 1 2, r is a proper and lower-semicontinuous convex function. Letting K satisfy that K = r κg 1 λ2(W) log(ρ 1), with ρ < min α 2(D1D4 + D2D2 4), 1 2D3 then, it holds that for the output x T of Algorithm 1, h( x T ) h(x ) 1 α h( x0) h(x ) + µ x0 x 2 i=1 fi( x0) f( x0) 2 ! To achieve h( x T ) h(x ) < ϵ and x T 1x 2 = O(ϵ/µ), the computational and communication complexities of Algorithm 1 are T = O κg log 1 , and Q = O r κg 1 λ2(W) log p(M, L, µ) q(M, L, µ) log(1 where each O( ) contains a universal constant and p(M, L, µ), q(M, L, µ) are polynomials with order less than 3. Proof. Because ρ satisfies the conditions in Lemma 4, combining with the definition of Vt in Eqn. (6), we obtain that h( x T ) h(x ) VT 1 1 T h( x0) h(x ) + µ 2 x0 x 2 + C z0 2 h( x0) h(x ) + µ 2 x0 x 2 + C z0 2 , where C is defined in Lemma 4. Because when t = 0 we set all nodes as the same status , i.e., x0 = 1 x0 and y0 = 1 y0, we have z0 2 = Pm i=1 fi( x0) fi(x ) 2. Thus, to achieve h( x T ) h(x ) < ϵ, T requires to be T = 2 κg log h( x0) h(x ) + µ 2 x0 x 2 + C Pm i=1 fi( x0) fi(x ) 2 ϵ = O( κg log 1 By Eqn. (10) of Lemma 4, we have z T 2 2ρ2D2 4 1 α T (V0 + C z0 )2 = O(ϵ/µ). Therefore, we can obtain x T 1x 2 2( x T 1 x T 2 + 1 x T 1x 2) z T 2 + 4 µVT = O(ϵ/µ). The bound of K can be obtained by Proposition 1. By the properties of D1, D2, D3, C, D4, it is easy to check that D3 and α 2(D1D4+D2D2 4) are independent of m. Thus, to achieve the conditions of ρ in Lemma 2, 3, and 4, ρ only needs to be less than p q , where p(M, L, µ) and q(M, L, µ) are polynomials of M, L and µ with orders less than 3. Thus, by Proposition 1, we only require that K 1 1 λ2(W ) log p q . Combining with the computation complexity, we can obtain the total communication complexity as Q = O r κg 1 λ2(W) log p(M, L, µ) q(M, L, µ) log 1 Remark 1. We can observe that DAPG establishes the optimal computational complexity [13] and a near optimal communication complexity which matches the lower bound up to a log term independent of ϵ [16]. This is the best computation and communication complexity of decentralized proximal algorithms can achieve. Before our work, the fastest decentralized proximal algorithm is NIDS which establishes computation and communication complexities both of O max n κℓ, 1 1 λ2(W ) o [9, 25]. We can observe that the complexities of DAPG are much less than that of NIDS. 0 500 1000 1500 2000 2500 3000 Number of Gradient Computations DPA PG-EXTRA NIDS DAPG, K=1 DAPG, K=2 DAPG, K=3 (a) σ2 = 10 3 0 500 1000 1500 2000 2500 3000 Number of Gradient Computations DPA PG-EXTRA NIDS DAPG, K=1 DAPG, K=2 DAPG, K=3 (b) σ2 = 10 4 0 500 1000 1500 2000 2500 3000 Number of Gradient Computations DPA PG-EXTRA NIDS DAPG, K=1 DAPG, K=2 DAPG, K=3 (c) σ2 = 10 5 0 1000 2000 3000 4000 5000 6000 7000 8000 9000 10000 Number of Communications DPA PG-EXTRA NIDS DAPG, K=1 DAPG, K=2 DAPG, K=3 (d) σ2 = 10 3 0 1000 2000 3000 4000 5000 6000 7000 8000 9000 10000 Number of Communications DPA PG-EXTRA NIDS DAPG, K=1 DAPG, K=2 DAPG, K=3 (e) σ2 = 10 4 0 1000 2000 3000 4000 5000 6000 7000 8000 9000 10000 Number of Communications DPA PG-EXTRA NIDS DAPG, K=1 DAPG, K=2 DAPG, K=3 (f) σ2 = 10 5 Figure 1: Experiment on a9a . We compare the computation cost of algorithms in the top row and compare the communication cost of algorithms in the bottom row. Remark 2. The complexities of DAPG are linear to κg instead of κℓ. In contrast, the complexities of algorithms fitting into the framework of [25] are linear to κℓ. Note that κℓcan be infinitely larger than κg. For example, for f(x) = 1 2(f1(x)+f2(x)), x R2, with f1(x) = x2 1 and f2(x) = x2 1 +x2 2, then it holds that κg = 2 and κℓ= . Thus, the complexities of DAPG are better than those depending on the local condition number. Remark 3. DAPG does not require each fi(x) be convex to achieve fast convergence rate. Thus, DAPG has a wide application range since fi(x) may be non-convex in some machine learning applications. SONATA can also be applied in these applications [21]. However, SONATA has to construct an SCA surrogate function of the non-convex function fi(x) to deal with the local non-convexity. Furthermore, DAPG only takes a cheap proximal mapping each iteration while SONATA has to minimize a subproblem. Hence, DAPG is a simpler and easier to implement algorithm. 4 Experiments In the previous sections, we have given the theoretical analysis of our algorithm. In this section, we will validate the effectiveness and computational efficiency of our algorithm empirically. We will conduct experiments on the sparse logistic regression problem where f(x) is general strongly convex. The sparse logistic regression is defined as i=1 fi(x) + σ1 x 1 + σ2 2 x 2, with fi(x) = 1 j=1 log[1 + exp( bj aj, x )] (11) where aj Rd is the j-th input vector, and bj { 1, 1} is the corresponding label. Experiments Setting In our experiments, we consider random networks where each pair of agents have a connection with a probability of p = 0.1. We set W = I L λ1(L) where L is the Laplacian matrix associated with a weighted graph, and λ1(L) is the largest eigenvalue of L. We set m = 100, that is, there exists 100 agents in this network. In our experiments, the gossip matrix W satisfies 1 λ2(W) = 0.05. We conduct experiments on the datasets w8a and w9a which can be downloaded in libsvm datasets. For w8a , we set n = 497 and d = 300. For a9a , we set n = 325 and d = 123. We set σ1 = 10 4 0 500 1000 1500 2000 2500 3000 Number of Gradient Computations DPA PG-EXTRA NIDS DAPG, K=1 DAPG, K=2 DAPG, K=3 (a) σ2 = 10 3 0 500 1000 1500 2000 2500 3000 Number of Gradient Computations DPA PG-EXTRA NIDS DAPG, K=1 DAPG, K=2 DAPG, K=3 (b) σ2 = 10 4 0 500 1000 1500 2000 2500 3000 Number of Gradient Computations DPA PG-EXTRA NIDS DAPG, K=1 DAPG, K=2 DAPG, K=3 (c) σ2 = 10 5 0 1000 2000 3000 4000 5000 6000 7000 8000 9000 10000 Number of Communications DPA PG-EXTRA NIDS DAPG, K=1 DAPG, K=2 DAPG, K=3 (d) σ2 = 10 3 0 1000 2000 3000 4000 5000 6000 7000 8000 9000 10000 Number of Communications DPA PG-EXTRA NIDS DAPG, K=1 DAPG, K=2 DAPG, K=3 (e) σ2 = 10 4 0 1000 2000 3000 4000 5000 6000 7000 8000 9000 10000 Number of Communications DPA PG-EXTRA NIDS DAPG, K=1 DAPG, K=2 DAPG, K=3 (f) σ2 = 10 5 Figure 2: Experiment on w8a . We compare the computation cost of algorithms in the top row and compare the communication cost of algorithms in the bottom row. for all datasets and set σ2 as 10 3, 10 4 and 10 5 to control the condition number of the objective function. Comparison with Existing Works We compare our work with state-of-the-art algorithms PG-EXTRA [19], NIDS [9] and Decentralized Proximal Algorithm (DPA) [2]. In the experiments, we set K = 1, 2, 3 respectively. The parameters of all algorithms are well-tuned. We report experiment results in Figure 1 and 2. We can observe that DAPG takes much less computational cost than other algorithms because DAPG uses Nesterov s acceleration to achieve a faster convergence rate. This matches our theoretical analysis of the computation complexity. We can further observe that the advantage of DAPG is more clear when σ2 is small. This is because a small σ2 commonly leads to a large condition number and the computation complexity of DAPG is linear to κg instead of κℓ. DAPG also shows great advantages over other state-of-the-art decentralized proximal algorithms on the communication cost. Though DAPG takes three times of local communication while other algorithms communicate only once for each iteration, DAPG still requires much less communication costs because of its fast convergence rate when σ2 is small. 5 Conclusion In this paper, we studied the decentralized composite optimization problem with a non-smooth regularization term. We proposed a novel algorithm that achieves the optimal computation complexity and a near optimal communication complexity matching the lower bound up to a log term independent of ϵ. This is the best known communication complexity that decentralized proximal gradient algorithms can achieve. Furthermore, DAPG does not require each individual function fi(x) to be convex to achieve a fast convergence rate. Hence, our algorithm has a wider range of applications in machine learning than other decentralized proximal gradient descent algorithms which require fi(x) to be convex. Finally, the complexities of our algorithms depend on the global condition number κg instead of the local one. Since κg κℓand κℓcan be infinitely larger than κg, our algorithm can achieve much better performance than the algorithms whose complexities depend on κℓ. The experiments also validate the advantages of our algorithm. 6 Broader Impact Our work focuses on the theory of decentralized optimization and proposes a novel decentralized proximal algorithm. This will help us to design new decentralized proximal algorithms. Because decentralized optimization has wide applications in machine learning, sensor networks, and multirobot system, our work may be used in these areas. Acknowledgments and Disclosure of Funding This work is supported by the project of Shenzhen Research Institute of Big Data (named Automated Machine Learning ) and GRF 16201320. [1] Alghunaim, S. A., Ryu, E., Yuan, K., & Sayed, A. H. (2020). Decentralized proximal gradient algorithms with linear convergence rates. IEEE Transactions on Automatic Control. [2] Alghunaim, S. A., Yuan, K., & Sayed, A. H. (2019). A linearly convergent proximal gradient algorithm for decentralized optimization. In H. M. Wallach, H. Larochelle, A. Beygelzimer, F. d Alché-Buc, E. B. Fox, & R. Garnett (Eds.), Advances in Neural Information Processing Systems 32: Annual Conference on Neural Information Processing Systems 2019, Neur IPS 2019, 8-14 December 2019, Vancouver, BC, Canada (pp. 2844 2854). [3] Di Lorenzo, P. & Scutari, G. (2016). Next: In-network nonconvex optimization. IEEE Transactions on Signal and Information Processing over Networks, 2(2), 120 136. [4] Friedman, J., Hastie, T., & Tibshirani, R. (2008). Sparse inverse covariance estimation with the graphical lasso. Biostatistics, 9(3), 432 441. [5] Hendrikx, H., Bach, F., & Massoulié, L. (2019). An accelerated decentralized stochastic proximal algorithm for finite sums. In Advances in Neural Information Processing Systems (pp. 952 962). [6] Jakoveti c, D., Xavier, J., & Moura, J. M. (2014). Fast distributed gradient methods. IEEE Transactions on Automatic Control, 59(5), 1131 1146. [7] Lan, G. & Monteiro, R. D. (2013). Iteration-complexity of first-order penalty methods for convex programming. Mathematical Programming, 138(1-2), 115 139. [8] Li, B., Cen, S., Chen, Y., & Chi, Y. (2019a). Communication-efficient distributed optimization in networks with gradient tracking. ar Xiv preprint ar Xiv:1909.05844. [9] Li, Z., Shi, W., & Yan, M. (2019b). A decentralized proximal-gradient method with network independent step-sizes and separated convergence rates. IEEE Transactions on Signal Processing, 67(17), 4494 4506. [10] Liu, J. & Morse, A. S. (2011). Accelerated linear iterations for distributed averaging. Annual Reviews in Control, 35(2), 160 165. [11] Mokhtari, A., Shi, W., Ling, Q., & Ribeiro, A. (2016). A decentralized second-order method with exact linear convergence rate for consensus optimization. IEEE Trans. Signal Inf. Process. over Networks, 2(4), 507 522. [12] Nedic, A. & Ozdaglar, A. (2009). Distributed subgradient methods for multi-agent optimization. IEEE Transactions on Automatic Control, 54(1), 48 61. [13] Nesterov, Y. (2018). Lectures on convex optimization, volume 137. Springer. [14] Qu, G. & Li, N. (2017). Harnessing smoothness to accelerate distributed optimization. IEEE Transactions on Control of Network Systems, 5(3), 1245 1260. [15] Qu, G. & Li, N. (2019). Accelerated distributed nesterov gradient descent. IEEE Transactions on Automatic Control. [16] Scaman, K., Bach, F., Bubeck, S., Lee, Y. T., & Massoulié, L. (2017). Optimal algorithms for smooth and strongly convex distributed optimization in networks. In Proceedings of the 34th International Conference on Machine Learning-Volume 70 (pp. 3027 3036).: JMLR. org. [17] Shen, Z., Mokhtari, A., Zhou, T., Zhao, P., & Qian, H. (2018). Towards more efficient stochastic decentralized learning: Faster convergence and sparse communication. In International Conference on Machine Learning (pp. 4624 4633). [18] Shi, W., Ling, Q., Wu, G., & Yin, W. (2015a). Extra: An exact first-order algorithm for decentralized consensus optimization. SIAM Journal on Optimization, 25(2), 944 966. [19] Shi, W., Ling, Q., Wu, G., & Yin, W. (2015b). A proximal gradient algorithm for decentralized composite optimization. IEEE Trans. Signal Process., 63(22), 6013 6023. [20] Shi, W., Ling, Q., Yuan, K., Wu, G., & Yin, W. (2014). On the linear convergence of the admm in decentralized consensus optimization. IEEE Transactions on Signal Processing, 62(7), 1750 1761. [21] Sun, Y., Daneshmand, A., & Scutari, G. (2019). Convergence rate of distributed optimization algorithms based on gradient tracking. ar Xiv preprint ar Xiv:1905.02637. [22] Uribe, C. A., Lee, S., Gasnikov, A., & Nedi c, A. (2018). A dual approach for optimal algorithms in distributed optimization over networks. ar Xiv preprint ar Xiv:1809.00710. [23] Xiao, L. & Boyd, S. (2004). Fast linear iterations for distributed averaging. Systems & Control Letters, 53(1), 65 78. [24] Xiao, L. & Zhang, T. (2014). A proximal stochastic gradient method with progressive variance reduction. SIAM Journal on Optimization, 24(4), 2057 2075. [25] Xu, J., Tian, Y., Sun, Y., & Scutari, G. (2020). Distributed algorithms for composite optimization: Unified and tight convergence analysis. Co RR, abs/2002.11534. [26] Ye, H., Luo, L., Zhou, Z., & Zhang, T. (2020). Multi-consensus decentralized accelerated gradient descent. Co RR, abs/2005.00797. [27] Yuan, K., Ling, Q., & Yin, W. (2016). On the convergence of decentralized gradient descent. SIAM Journal on Optimization, 26(3), 1835 1854. [28] Yuan, K., Ying, B., Zhao, X., & Sayed, A. H. (2019). Exact diffusion for distributed optimization and learning - part I: algorithm development. IEEE Trans. Signal Process., 67(3), 708 723. [29] Zou, H. & Hastie, T. (2005). Regularization and variable selection via the elastic net. Journal of the Royal Statistical Society: Series B (Statistical Methodology), 67(2), 301 320.