# multiconsensus_decentralized_accelerated_gradient_descent__415c8d46.pdf Journal of Machine Learning Research 24 (2023) 1-50 Submitted 10/22; Revised 7/23; Published 9/23 Multi-Consensus Decentralized Accelerated Gradient Descent Haishan Ye YEHAISHAN@XJTU.EDU.CN Center for Intelligent Decision-Making and Machine Learning School of Management Xi an Jiaotong University Xi an, China Luo Luo LUOLUO@FUDAN.EDU.CN School of Data Science Fudan University Shanghai, China Ziang Zhou 20071642R@CONNECT.POLYU.HK Department of Computing The Hong Kong Polytechnic University Hong Kong, China Tong Zhang TONGZHANG@TONGZHANG-ML.ORG Computer Science & Mathematics The Hong Kong University of Science and Technology Hong Kong, China Editor: Ohad Shamir This paper considers the decentralized convex optimization problem, which has a wide range of applications in large-scale machine learning, sensor networks, and control theory. We propose novel algorithms that achieve optimal computation complexity and near optimal communication complexity. Our theoretical results give affirmative answers to the open problem on whether there exists an algorithm that can achieve a communication complexity (nearly) matching the lower bound depending on the global condition number instead of the local one. Furthermore, the linear convergence of our algorithms only depends on the strong convexity of global objective and it does not require the local functions to be convex. The design of our methods relies on a novel integration of well-known techniques including Nesterov s acceleration, multi-consensus and gradient-tracking. Empirical studies show the outperformance of our methods for machine learning applications. Keywords: consensus optimization, decentralized algorithm, accelerated gradient descent, gradient tracking, composite optimization 1. Introduction In this paper, we consider the decentralized optimization problem, where the objective function is composed of m local functions fi(x) that are located on m different agents. The agents form a connected and undirected network and each of them only accesses its local function and communicates . Corresponding author c 2023 Haishan Ye, Luo Luo, Ziang Zhou, and Tong Zhang. License: CC-BY 4.0, see https://creativecommons.org/licenses/by/4.0/. Attribution requirements are provided at http://jmlr.org/papers/v24/22-1210.html. YE, LUO, ZHOU, AND ZHANG with its neighbors. All of the agents target to cooperatively solve the convex optimization problem min x Rd h(x) f(x) + r(x) with f(x) 1 i=1 fi(x), (1) where f(x) is L-smooth and µ-strongly convex, r(x) is convex but may be non-differentiable. Many machine learning models have the form (1) such as logistic regression and elastic net regression. Decentralized optimization has been widely studied and applied in many applications such as largescale machine learning (Tsianos et al., 2012; Kairouz et al., 2021), automatic control (Bullo et al., 2009; Lopes and Sayed, 2008), wireless communication (Ribeiro, 2010), and sensor networks (Rabbat and Nowak, 2004; Khan et al., 2009). Many decentralized optimization algorithms have been proposed. One class of them is primalonly methods, including decentralized gradient methods (Nedic and Ozdaglar, 2009; Yuan et al., 2016), decentralized accelerated gradient method (Jakoveti c et al., 2014; Qu and Li, 2019) and EXTRA (Shi et al., 2015b; Li et al., 2019; Mokhtari and Ribeiro, 2016). They only access the gradients of fi(x) and are usually computationally efficient. Another class of algorithms are the dual-based decentralized algorithms, such as the dual subgradient ascent (Terelius et al., 2011), dual gradient ascent and its accelerated version (Scaman et al., 2017; Uribe et al., 2020), the primal-dual method (Lan et al., 2020; Scaman et al., 2018; Hong et al., 2017), and ADMM (Erseghe et al., 2011; Shi et al., 2014). However, dual-based algorithms commonly need more computation cost when the gradient of the dual function is not explicitly available. There are several important open problems in the area of decentralized optimization. First, Scaman et al. (2017, 2019) raised the problem whether there exists an algorithm that has a (near) optimal communication complexity depending on the global condition number κg = L/µ instead of the local condition number κℓ(defined in Eq. (7)). Since the data distributed on different agents are potentially quite different, the global condition number κg could be much smaller than κℓ. In the extreme case, the local function fi may be non-strongly convex, then it is possible that κℓis infinitely large while κg is still small. However, existing works only achieved the optimal computation and communication complexities with respect to the local condition number κℓin the case of r(x) = 0 (Kovalev et al., 2020; Li and Lin, 2021; Song et al., 2023; Scaman et al., 2017). Furthermore, it is unclear whether the convexity of each individual fi(x) is essential for computationefficient and communication-efficient decentralized algorithms. Most of existing algorithms with linear convergence rates such as EXTRA (Shi et al., 2015b) and OPAC (Kovalev et al., 2020) all require each fi(x) to be (strongly) convex. Sun et al. (2022) first proposed the linear-convergent algorithm that allows some individual functions to be non-convex. However, Sun et al. (2022) s algorithm cannot achieve the (near) optimal computation and communication complexities. Finally, existing methods cannot achieve the optimal computation and (near) optimal communication complexities for non-differentiable r(x) (Xu et al., 2021; Alghunaim et al., 2020, 2019; Sun et al., 2022). How to design computation and communication efficient accelerated decentralized proximal gradient descent is still an open question. This paper addresses the theoretical issues discussed above and designs two novel decentralized algorithms Prox Mudag and Mudag. We summarize our contributions as follows: 1. Our algorithms have the optimal computation complexity O κg log(1/ϵ) and the near optimal communication complexity O p κg/(1 λ2(W)) log Mκg/L log(1/ϵ) , where M and L are the smoothness parameters of fi(x) and f(x) respectively. To the best of our knowledge, this is the first (near) optimal decentralized algorithm that depends on the global condition number which provides an affirmative answer to the open problem whether there exists an algorithm that can achieve a communication complexity of O p κg/(1 λ2(W)) log(1/ϵ) or even close to it (Scaman et al., 2017). 2. Our algorithms do not require each individual function to be convex. Hence, they can be used in a wider range of applications than existing optimal decentralized algorithms. For example, the sub-problem of fast PCA by the shift-invert method is non-convex. 3. The proposed Prox Mudag can achieve optimal computation and (near) optimal communication complexity when r(x) is convex but non-differentiable. To the best of our knowledge, it obtains the best-known communication complexity for the decentralized strongly-convex optimization problems with the composite objective function. 2. Related Work We first review the penalty-based algorithms. Nedic and Ozdaglar (2009) proposed the well-known decentralized gradient descent method, where each agent performs a consensus step and a gradient descent step with a fixed step-size related to the penalty parameter. Yuan et al. (2016) proved the convergence rate of decentralized gradient descent and showed how the penalty parameter affects the computation complexity. To avoid the diminishing step-size commonly required in penaltybased algorithms, Jakoveti c et al. (2014) combined multi-consensus and Nesterov s acceleration to achieve the optimal computation complexity for minimizing non-strongly convex functions. Berahas et al. (2018) proposed to use multi-consensus to achieve the balance between computation and communication complexity. Recently, Li et al. (2020b) proposed APM-C, which employed multiconsensus and increased the penalty parameter properly for each iteration. Combining Nesterov s acceleration, APM-C can achieve a linear convergence rate and a low communication complexity. Li et al. (2020a) applied multi-consensus to network Newton method to achieve computation and communication efficiency. Dual-based methods are another important research line. These methods introduce a Lagrangian function and work in the dual space. There are different ways to solve the reformulated problem such as gradient descent method (Terelius et al., 2011), accelerated gradient method (Scaman et al., 2017; Uribe et al., 2020), primal-dual method (Lan et al., 2020; Scaman et al., 2018) and ADMM (Shi et al., 2014; Erseghe et al., 2011). However, such methods are typically computationally inefficient. For example, using the accelerated gradient method to solve the dual counterpart of the decentralized optimization problem can achieve optimal communication complexity Scaman et al. (2017); Uribe et al. (2020), but its computation complexity will have an additional dependency on the eigenvalue gap of gossip matrix (Uribe et al., 2020). The gradient-tracking method is a popular way to reduce the computational cost (Qu and Li, 2017; Xu et al., 2015; Qu and Li, 2019; Di Lorenzo and Scutari, 2016, 2015; Sun et al., 2022; Nedic et al., 2017; Zhu and Mart ınez, 2010). There are two different techniques for gradient-tracking. One of them is keeping a variable to estimate the average gradient and uses this estimation in the gradient descent step (Sun et al., 2022; Di Lorenzo and Scutari, 2016; Qu and Li, 2017). Another one is introducing two different weight matrices to track the difference of gradients (Shi et al., 2015b; Li et al., 2019). Recently, (Nedic et al., 2017; Li and Lin, 2020; Jakoveti c, 2018; Xu et al., YE, LUO, ZHOU, AND ZHANG Methods Computation Communication Is fi(x) convex? Acc-DNGD (Qu and Li, 2019) O κ5/7 ℓ (1 λ2(W ))1.5 log 1 ϵ O κ5/7 ℓ (1 λ2(W ))1.5 log 1 NIDS (Li et al., 2019) O κℓ+ 1 1 λ2(W ) log 1 ϵ O κℓ+ 1 1 λ2(W ) log 1 ADA (Uribe et al., 2020) O κℓ 1 λ2(W ) log2 1 O q κℓ 1 λ2(W ) log 1 APM-C Li et al. (2020b) O κℓlog 1 ϵ O q κℓ 1 λ2(W ) log2 1 Acc-EXTRA (Li and Lin, 2020) e O q κℓ 1 λ2(W ) log 1 ϵ e O q κℓ 1 λ2(W ) log 1 OPAC (Kovalev et al., 2020) O κℓlog 1 ϵ O q κℓ 1 λ2(W ) log 1 Mudag (Algorithm 1) O κg log 1 ϵ e O q κg 1 λ2(W ) log 1 Lower Bound (Scaman et al., 2017) Ω κg log 1 ϵ Ω q κg 1 λ2(W ) log 1 Table 1: Complexity comparisons between our algorithm and existing works for smooth and strongly convex problems. That is, r(x) equals to zero in Problem (1). The notation O( ) hides the constant terms and e O( ) also hides log terms which are independent of ϵ. 2021) studied the connection between these two strategies and showed that they can be transformed to each other. Due to the tracking of history information, gradient-tracking based algorithms can achieve linear convergence rates for strongly convex objective functions (Qu and Li, 2017; Shi et al., 2015b; Nedic et al., 2017; Sun et al., 2022). However, the previously obtained convergence rates and communication complexities are much worse than the results in this paper. The communication complexities of existing works commonly depend on the local condition number κℓ(Scaman et al., 2017; Li et al., 2020b, 2019; Li and Lin, 2021; Kovalev et al., 2020). Only EXTRA (Shi et al., 2015b) and DIGing (Nedic et al., 2017) achieve computation and communication complexities depending on ˆκg (defined in Eq. (7)), which is still worse than our results that depend on κg and log ˆκg. Due to the fact κg ˆκg mκg, a communication complexity depending on κg is preferred in real applications. We summarize the results for the case of r(x) = 0 in Table 1. Acc-DNGD is most relevant to our algorithm. It also utilizes Nesterov s acceleration and gradient-tracking (Qu and Li, 2019). However, the multi-consensus step in our algorithm can analog the centralized accelerated gradient descent more efficiently and leads the convergence analysis almost to be the same as standard analysis (Nesterov, 2018) in the centralized scenario. In contrast, Acc-DNGD does not have such a good property and it does not achieve near optimal computation complexity nor near optimal communication complexity. Since our algorithm can effectively approximate the centralized accelerated gradient descent, our algorithm does not require each individual function fi(x) to be convex but only requires f(x) to be strongly convex while this condition is required in Acc-DNGD. Finally, the convergence rate of our algorithm depends on the global condition number κg, while that of Acc-DNGD depends on the local condition number κℓ. 1. It holds that κg = Ω(κℓ) for the case used to prove the lower bound of communication complexity (Scaman et al., 2017). Methods Computation Communication Is fi(x) convex? NIDS2 (Li et al., 2019) O κℓ+ 1 1 λ2(W ) log 1 ϵ O κℓ+ 1 1 λ2(W ) log 1 D2P2 (Alghunaim et al., 2019) O κℓ 1 λ2(W ) log 1 ϵ O κℓ 1 λ2(W ) log 1 Prox Mudag (Algorithm 3) O κg log 1 ϵ e O q κg 1 λ2(W ) log 1 Table 2: Complexity comparisons between our algorithm and existing works for composite and strongly convex problems. In Scaman et al. (2017), a lower bound of communication complexity was obtained for the decentralized optimization problem, which is O p κg/(1 λ2(W)) log(1/ϵ) for strongly convex problems. A dual-based algorithm was proposed to match the lower bound. However, this method is only suitable for the cases where dual functions of each local agent are easy to compute. Hence, the computation complexity of the method in Scaman et al. (2017) severely deteriorates once the dual functions are computationally inefficient to work with. Recently, Uribe et al. (2020) proposed an accelerated dual ascent algorithm which achieves the same communication complexity as the one of Scaman et al. (2017), but with a computation complexity of O κℓ/ p 1 λ2(W) log2(1/ϵ) . Recently, Li and Lin (2020) proposed Acc-EXTRA by applying Catalyst to accelerate EXTRA. However, due to the lack of the multi-consensus, Acc-EXTRA fails to achieve the optimal computation complexity. On the other hand, its communication complexity is also no better than Mudag. Furthermore, Catalyst introduces an additional loop of iteration. In contrast, Mudag is simple and easy to implement. Kovalev et al. (2020) proposed OPAC, which is a primal-dual based algorithm. The computation and communication complexities of OPAC are O κℓlog(1/ϵ) κℓ/(1 λ2(W)) log(1/ϵ) respectively, which depends on the local condition number. Additionally, it requires each fi(x) to be strongly convex. For the case r(x) is convex but non-differentiable, many gradient tracking based algorithms have been extended to decentralized composite optimization problems with a non-differentiable regularization term such as PG-EXTRA (Shi et al., 2015a) and NIDS (Li et al., 2019). However, due to the non-differentiable term, these algorithms can only achieve sub-linear convergence rates. Recently, Sun et al. (2022) proposed a gradient tracking based method called SONATA, and established a linear convergence rate with the assumption that f(x) is strongly convex. In addition, Alghunaim et al. (2019) proposed a primal-dual algorithm which can achieve a linear convergence rate when each fi(x) is convex. Recently, Alghunaim et al. (2020); Xu et al. (2021) proposed a unified framework to analyze a large group of algorithms. They showed the algorithms including EXTRA (PG-EXTRA) (Shi et al., 2015b), NIDS (Li et al., 2019) and Harnessing (Qu and Li, 2017) can also achieve linear convergence rates with a non-differentiable regularization term. Despite intensive studies in the literature, the convergence rates of these previous algorithms do not match the optimal convergence rate. Moreover, the communication complexities achieved by algorithms analyzed in the framework of Xu et al. (2021) and Alghunaim et al. (2020) are sub-optimal. The conference version of our paper proposed DAPG which achieves the optimal computation complexity and near optimal communication complexity (Ye et al., 2020). However DAPG takes three multi-consensus 2. Li et al. (2019) only gave a sublinear convergence rate for NIDS when r(x) is convex, the linear convergence rate is proved in works (Alghunaim et al., 2020; Xu et al., 2021). YE, LUO, ZHOU, AND ZHANG steps while Prox Mudag in this paper only takes two multi-consensus steps. Thus, Prox Mudag can achieve better communication-efficiency than DAPG. We compare our Prox Mudag with existing state-of-the-art decentralized algorithms for the composite optimization in Table 2. 3. Preliminaries We let xi Rd be the local copy of the variable of x for agent i and we introduce the aggregate variable x Rm d, aggregate objective function F(x) and aggregate gradient F(x) Rm d as i=1 fi(xi) and F(x) = f1(x1) ... fm(xm) We denote that i=0 x(i) t , yt = 1 i=0 y(i) t and gt = 1 i=0 fi(y(i) t ), (3) where x(i) means the i-th row of matrix x. Moreover, we use to denote the Frobenius norm of vector or matrix and use x, y to denote the inner product of vectors x and y. Furthermore, we denote Accordingly, we introduce the proximal operator and aggregated proximal operator with respect to r( ) and R( ) as proxη,r(x) = argmin z Rd 2η z x 2 and proxmη,R(x) = argmin z Rm d R(z) + 1 2mη z x 2 . Using above notations, we define the (aggregated) generalized gradients as Gt = η 1 yt proxηm,R(yt ηst) and G(i) t = η 1 y(i) t proxη,r(y(i) t ηs(i) t ) . (5) We also denote i=1 G(i) t . (6) Then we introduce the following definitions that will be used in the whole paper: We say f(x) is L-smooth if for all x, y Rd, it holds that f(y) f(x) + f(x), y x + L We say f(x) is µ-strongly convex, if for all x, y Rd, it holds that f(y) f(x) + f(x), y x + µ We say fi(x) is locally Mi-smooth if for all x, y Rd, it holds that fi(y) fi(x) + fi(x), y x + Mi We say fi(x) is locally νi-strongly convex if for all x, y Rd, it holds that fi(y) fi(x) + fi(x), y x + νi Based on the smoothness and strong convexity, we can define global and local condition numbers of the objective function as µ and κℓ= M where M = max i {1,...,m} Mi and ν = min i {1,...,m} νi. (8) It is easy to verify that L M and κg ˆκg κℓ. (9) For the topology of the network, we let W be the weight matrix associated with the network, indicating how agents are connected to each other. We assume that the weight matrix W has the following properties: 1. W is symmetric with Wi,j = 0 if and if only agents i and j are connected or i = j; 2. 0 W I, W1 = 1, null(I W) = span(1); where we use I to denote the m m identity matrix and 1 = [1, . . . , 1] Rm denotes the vector with all ones. The weight matrix has an important property that W = 1 m11 (Xiao and Boyd, 2004). Thus, one can achieve the effect of averaging local xi on different agents by using Wx for iterations. Instead of directly multiplying W, Liu and Morse (2011) proposed a more efficient way to achieve averaging described in Algorithm 2, which has the following important proposition. Proposition 1 Let x K be the output of Algorithm 2 with ηw = 1/(1 + p 1 λ2 2(W)) and we denote x = 1 m1 x0. Then it holds that m1 x K and x K 1 x 1 λ2(W) K x0 1 x , where λ2(W) is the second largest eigenvalue of W. 4. Multi-Consensus Decentralized Accelerated Gradient Descent In this section, we propose two novel decentralized algorithms achieving the optimal computation complexity and near optimal communication complexity. These two algorithms are suitable for the case of r(x) = 0 and the case of r(x) is general convex respectively. YE, LUO, ZHOU, AND ZHANG Algorithm 1 Mudag 1: Input: x0, η, α, and K 2: Initialization: x0 = 1x0, y0 = x0 3: x1 = Fast Mix(y0 η F(y0)) 4: y1 = x1 + 1 α 5: for t = 1, . . . , T do 6: xt+1 = Fast Mix(yt + (xt yt 1) η( F(yt) F(yt 1)), K) 7: yt+1 = xt+1 + 1 α 1+α (xt+1 xt) 9: Output: x T Algorithm 2 Fast Mix 1: Input: x0, K, W and ηw 2: x 1 = x0 3: for k = 0, . . . , K do 4: xk+1 = (1 + ηw)Wxk ηwxk 1 6: Output: x K 4.1 Algorithms and Main Ideas Our algorithms are based on the multi-consensus, gradient-tracking and Nesterov s acceleration technique. We first introduce Prox Mudag (Algorithm 3) for solving the problem with r(x) = 0. It has the following algorithmic procedure: xt+1 =proxηm,R(yt ηst), (10) yt+1 =Fast Mix xt+1 + 1 α 1 + α(xt+1 xt), K , (11) st+1 =Fast Mix(st + F(yt+1) F(yt), K), (12) where η is the step size and K is the step number in multi-consensus. We can observe that Eqs. (10) and (11) belong to the algorithmic framework of accelerated proximal gradient descent since st can approximate the average gradient. In Eq. (12), we introduce st to track the gradient by using history information and the gradient difference. Thus, st can well approximate the average gradient 1 gt (defined in Eq. (3)). Furthermore, the variable yt can also approximate 1 yt well by the Fast Mix operator. Since both yt and gt well approximate the averages, then we can obtain that gt f( yt). Thus, the convergence properties of our algorithm are similar to the centralized accelerated proximal gradient descent, which is the main idea behind our approach to the decentralized optimization. In other words, we combine multi-consensus with gradient-tracking to approximate the centralized accelerated proximal gradient descent. As we will show, this seemingly simple idea leads to establishing a near optimal algorithm for the decentralized optimization. Note that Algorithm 3 only takes two multi-consensus steps at each iteration. In contrast, the algorithm in conference version (Ye et al., 2020, Algorithm 1) of this paper requires three multi-consensus steps at each round. Algorithm 3 Prox Mudag 1: Input: x0, η, α, K 2: Initialization: x0 = 1x0, y0 = x0, s0 = F(x0) 3: for t = 0, . . . , T do 4: xt+1 = proxηm,R(yt ηst) 5: yt+1 = Fast Mix xt+1 + 1 α 1+α(xt+1 xt), K 6: st+1 = Fast Mix(st + F(yt+1) F(yt), K) 8: Output: x T Though reducing one multi-consensus step will not improve the order of communication complexity, it requires much less communication cost and benefits in real applications. In the case of r(x) = 0, we propose Mudag (Algorithm 1) that only needs one multi-consensus step for each iteration. The Mudag has the following algorithmic procedure: xt+1 =Fast Mix(yt + (xt yt 1) η( F(yt) F(yt 1)), K), (13) yt+1 =xt+1 + 1 α 1 + α (xt+1 xt) . To understand Mudag from perspective of gradient tracking, we can reformulate the above procedure in a form similar to Eqs. (10) to (12) as follows (The reformulation is proved in Lemma 23) xt+1 =Fast Mix (yt ηst, K) , (14) yt+1 =xt+1 + 1 α 1 + α(xt+1 xt), (15) st+1 =Fast Mix(st, K) + ( F(yt+1) F(yt)) η 1(Fast Mix(yt, K) yt). (16) Note that Eq. (16) is an explicit gradient tracking step similar to Eq. (12). Comparing Eqs. (14)- (16) with Eqs. (10)-(12), we can observe that these two algorithms share a similar procedure since they share the same intuition. However, the iteration of Prox Mudag cannot be improved to one multi-consensus step like Mudag. If we directly replace Eq. (13) by xt+1 =Fast Mix proxηm,R (yt + (xt yt 1) η( F(yt) F(yt 1))) , K , it is easy to check that the algorithm cannot converge to the optimum. Because Mudag only has one multi-consensus step for each iteration while Prox Mudag takes two multi-consensus steps, in practice, Mudag commonly requires much less communication cost than Prox Mudag when Mudag is applicable. Thus, the Mudag is a better choice than Prox Mudag in the case of r(x) = 0. 4.2 Main Results In this work, we focus on the synchronized setting in which the computation complexity depends on the number of gradient calls and the communication complexity depends on the rounds of local communication. We give the detailed upper complexity bounds for our algorithms in the following theorems. YE, LUO, ZHOU, AND ZHANG Theorem 2 Let f(x) be L-smooth and µ-strongly convex. Assume each fi(x) is M-smooth. We set η = 1/L and α = µη in Algorithm 1. Letting K in Algorithm 1 satisfy that 1 1 λ2(W) log with ρ 1 43 9 288 L then the sequence { xt} satisfies that f( x T ) f(x ) 1 α f( x0) f(x ) + µ x0 x 2 i=1 fi( x0) f( x0) 2 ! where x is the global minimum of f(x). To achieve x T such that x T 1x 2 = O(mϵ/µ) and f( x T ) f(x ) ϵ, the computation and communication complexities of Algorithm 1 are at most T = O κg log 1 and Q = O r κg 1 λ2(W) log Mκg respectively. Theorem 3 Let f(x) be L-smooth and µ-strongly convex. Assume each fi(x) is M-smooth. We set η = 1/(2L) and α = µη in Algorithm 3. Letting K in Algorithm 3 satisfy that 1 1 λ2(W) log with ρ 1 5.5 108 L 6 κ 3/2 g , then sequence { xt} generated by Algorithm 3 satisfies that h( x T ) h(x ) 1 α h( x0) h(x ) + µ x0 x 2 i fi( x0) f( x0) 2 ! where x is the global minimum of h(x). To achieve x T such that x T 1x 2 = O(mϵ/µ) and h( x T ) h(x ) ϵ, the computation and communication complexities of Algorithm 1 are at most T = O κg log 1 and Q = O r κg 1 λ2(W) log Mκg respectively. Remark 4 Theorem 2 shows that Mudag achieves the same order of computation complexity as that of the centralized Nesterov s accelerated gradient descent. At the same time, the communication complexity nearly matches the known lower bound of decentralized optimization problem up to a factor of log (Mκg/L). We conjecture that it may be possible to remove the log(κg) factor, because the term only comes from the inequality yt x p 2Vt/µ, where Vt is defined in Eq. (17) in the proof, which may be loose. Remark 5 Theorem 2 and 3 only assume that f(x) is µ-strongly convex and L-smooth, and fi(x) is M-smooth (note that unlike many previous works, our dependency on M is logarithmic only). Thus, our algorithms can be used when fi(x) is possibly non-convex. This kind of problem has been widely studied in recent years (Allen-Zhu, 2018; Garber et al., 2016) and one important example is the fast PCA by shift-invert method (Garber et al., 2016). In contrast, the previous works (Scaman et al., 2017; Li et al., 2020b, 2019; Qu and Li, 2019; Kovalev et al., 2020; Li and Lin, 2021) require the (strong) convexity of fi(x) to obtain the linear convergence rate. Remark 6 The computation and communication complexities of our algorithms depend on κg rather than κℓ, which is a novel result. Before our work, it was unknown whether there exists a decentralized algorithm that can achieve a communication complexity close to the lower bound Ω p κg/(1 λ2(W)) log(1/ϵ) (Scaman et al., 2017, 2019). Remark 7 Observe that the step 3 of Algorithm 1 resorts to multi-consensus and gradient tracking to encourage x(i, :) on different agents to be close to each other. Similarly, for the centralized distributed optimization problem, the consensus step is also needed, which is often implemented by two rounds of communications between agents and the central server. In this view, the centralized optimization methods and the decentralized one only differ in the way to achieve consensus. We can also regard the decentralized optimization methods as an approximation to the decentralized one. 5. Convergence Analysis In this section, we give a detailed characterization on how our decentralized algorithms approximate accelerated (proximal) gradient descent. Since Mudag and Prox Mudag have similar ideas for convergence analysis, we only present how to obtain the convergence rate of Prox Mudag in this section and leave the analysis of Mudag in Appendix C. Note that the analysis of Prox Mudag may be more sophisticated than the one of Mudag because of the additional step of proximal operation. We first introduce the Lyapunov function as follows Vt = h( xt) h(x ) + µ 2 vt x 2 , (17) where vt is defined as vt = xt 1 + 1 α( xt xt 1) with α = µη. (18) In the rest of this section, we will show how the Lyapunov function Vt converges and how multiconsensus and gradient-tracking help us to approximate centralized accelerated proximal gradient descent. Then we show that xt, yt, gt (defined in Eq. (3) and generated by Algorithm 3) and vt (defined in Eq. (18)) can be fit into the framework of the centralized accelerated proximal gradient descent. Lemma 8 Let xt, yt, gt and Gt (defined in Eqs. (3) and (6)) be generated by Algorithm 3. By setting st = 1 m1 st with st defined in Eq. (12), it satisfies: xt+1 = yt η Gt, (19) yt+1 = xt+1 + 1 α 1 + α( xt+1 xt), (20) st+1 = st + gt+1 gt = gt+1. (21) YE, LUO, ZHOU, AND ZHANG Proof Proposition 1 provides a property of Fast Mix that x = 1 m1 Fast Mix(x, K). Combining the algorithmic procedures of Algorithm 3 and the definition of Gt in Eq. (6), we can obtain the first two equations. We first the last equality by induction. For t = 0, we use the fact that s0 = F(y0). Then, it holds that s0 = g0. We assume that st = gt at time t. By the update equation (12) and Proposition 1, we have st+1 = st + ( gt+1 gt) = gt+1. Thus, we obtain the result at time t + 1. Lemma 8 shows that the averaged version of Eqs. (10)-(12) is almost the same as accelerated proximal gradient descent (Nesterov, 2018). Thus, if st is an accurate estimation of f( yt), then Algorithm 3 has convergence properties similar to accelerated proximal gradient descent. Next, we are going to show yt(i, :) yt and st(i, :) st by the following lemma. Lemma 9 Let zt = [ xt 1 xt , yt 1 yt , η st 1 st ] with xt, yt and st generated by Algorithm 3, then it holds that zt+1 Azt + 4ρ m M where ρ and A are defined as 1 λ2(W) K and A = 0 2 2 2ρ 4ρ 4ρ ρM/L 8ρM2/L2 5ρM/L If the spectral radius of A is less than 1 and Vt converges to zero, then zt will converge to zero. Note that xt 1 xt , yt 1 yt and η st 1 st are no larger than zt . Hence, Algorithm 3 can well approximate centralized accelerated proximal gradient descent in such conditions. Next, we prove above two conditions that lead to the convergence of zt . First, the following lemma shows the spectral radius of A is less than 1 2 if ρ is small enough. Lemma 10 Matrix A defined in Eq. (23) satisfies that 0 < λ1(A) and |λ3(A)| |λ2(A)| < λ1(A) with λi(A) being the i-th largest eigenvalue of A. Letting η = 1/(2L) and ρ L3/(1280M3), then it holds that λ1(A) 1 and the eigenvector v associated with λ1(A) is positive and its entries satisfy 0 < v(1) 8v(3) ρ , 0 < v(2) 3v(3) and 0 < v(3), (24) where v(i) is i-th entry of v. Now, we are going to show Vt converges linearly but with some perturbation terms related to zt. Lemma 11 Letting xt, yt, st be generated by Algorithm 3, it holds that Vt+1 (1 α)Vt + 13η m st 1 st 2 + 20M2η + 10η 1 m yt 1 yt 2 . (25) The above lemma shows that the Algorithm 3 has a convergence property similar to the accelerated proximal gradient descent but with some perturbation terms. Next, using above lemmas and choosing proper ρ by a proper K, we will obtain the convergence rate of Algorithm 3. Finally, we provide the proof of our main result Theorem 3. Proof It is easy to check that ρ satisfies the conditions required in Lemma 10. Let the eigenvector v defined in Lemma 10 and set v(3) = 1. Combining with the fact that the first two entries of z0 are zero, we can obtain that, z0 z0 v and [0, 0, 1] v. By Eq. (22), we can obtain that zt+1 z0 At+1v + 4ρM = z0 λ1(A)t+1v + 4ρM Vi λ1(A)t iv t+1 v + 4ρM where the first equality is because v is the eigenvector associated with λ1(A) and the last inequality is because of λ1(A) 1 2 obtained in Lemma 10. Next, we will prove our result by induction. For t = 0, we have y0 1 y0 = 0 and V1 (25) (1 α)V0 + 13η 1 m (η s0 1 s0 )2 = (1 α)V0 + 13η 1 (1 α)V0 + 1 α Next, we assume that for i = 1, . . . , t, it holds that m z0 2 . (27) YE, LUO, ZHOU, AND ZHANG Combining with Eq. (26), we can obtain that j=0 2 (t 2 j)p Vj + 2 (t 1) z0 j=0 2 (t 2 j) r m z0 2 + 2 (t 1) z0 2 t 1 2 (t 2) m z0 2 + 2 (t 1) z0 m z0 2 + 2 (t 1) z0 Thus, using the definition of z and A, we can obtain that (22) [2ρ, 4ρ, 4ρ], zt 1 (28) [2ρ, 4ρ, 4ρ], v m z0 2 + 2 (t 1) z0 ρ (2v(1) + 4v(2) + 4) m z0 2 + 2 (t 1) z0 (24) ρ 16 ρ + 12 + 4 m z0 2 + 2 (t 1) z0 m z0 2 + 2 (t 1) z0 Then we have 20M2η + 10η 1 m yt 1 yt 2 =10M2/L + 20L m yt 1 yt 2 m 2 322 ρ 288m M2 t 1 V0 + 52L m z0 2 + 4 (t 1) z0 2 . Similarly, we have (27)(28) ρ 24M 2 m z0 2 + 2 (t 1) z0 m z0 2 + 2 (t 1) z0 Consequently, it holds that 13η m st 1 st 2 = 13 ηm (η st 1 st )2 m L3 722 ρ 288m M2 t 1 V0 + 52L m z0 2 + 4 (t 1) z0 2 . (30) Combining above results, we obtain (25),(29),(30) (1 α) Vt 60 322 + 52 722 ρM 4 t 1 V0 + 52L m z0 2 + 4 (t 1) z0 2 m z0 2 + 1.91 108 ρM 6 + 6366 ρM 2 L2 4 (t 1) 52L m z0 2 + 1.92 108 ρM 6 t+1 V0 + 52L where the second inequality is because of the induction assumption and the last inequality is due to ρ L6/ 5.5 108 κ3/2 g M6 . Furthermore, it holds that =2 [0, 1, 1], zt (28) 2 [0, 1, 1], v m z0 2 + 2 (t 1) z0 m z0 2 + 2 (t 1) z0 which concludes the proof. YE, LUO, ZHOU, AND ZHANG 6. Experiments We evaluate the performance of our algorithms on (sparse) logistic regression with different settings, including the situation in which each fi(x) is strongly convex and the local function fi(x) may be non-convex. 6.1 The Setting of Networks In our experiments, we consider random networks where each pair of agents have a connection with a probability of p. We set W = I L/λ1(L), where L is the Laplacian matrix associated with a weighted graph, and λ1(L) is the largest eigenvalue of L. We also set m = 100, that is, there exist 100 agents in this network. In our experiments, we run the algorithms on the settings of p = 0.1 and p = 0.5, which correspond to 1 λ2(W) = 0.05 and 1 λ2(W) = 0.81 respectively. 6.2 Experiments on ℓ2-Regularized Logistic Regression We consider the ℓ2-regularized logistic regression model whose local objective function of logistic regression is defined as j=1 log 1 + exp( bij aij, x ) + σi 2 x 2, (31) where aij Rd and bij { 1, 1} are the j-th input vector and the corresponding label on the i-th agent. We conduct our experiments on a real-world dataset a9a which can be downloaded from LIBSVM repository (Chang and Lin, 2011). We set n = 325 and d = 123. We conduct the following four experimental settings: 1. We set σ1 = = σm = 10 3, then each fi(x) is strongly-convex. 2. We set σ1 = = σm = 10 4, then each fi(x) is strongly-convex. 3. We set σ1 = = σm 1 = 10 1 and σm = 10, then functions fi(x) for i < m are non-convex but f(x) is still strongly-convex. 4. We set σ1 = = σm 1 = 10 2 and σm = 1, then functions fi(x) for i < m are non-convex but f(x) is still strongly-convex. We compare our algorithm (Mudag) to centralized accelerated gradient descent (AGD) in (Nesterov, 2018), EXTRA in (Shi et al., 2015b), NIDS in (Li et al., 2019), Acc-DNGD in (Qu and Li, 2019) and APM-C in (Li et al., 2020b). In this paper, we do not compare our algorithm to the dualbased algorithms such as accelerated dual ascent algorithm (Uribe et al., 2020; Scaman et al., 2017) because these algorithms cannot be applied to the case where some functions fi(x) are non-convex. The step sizes of all algorithms are well-tuned to achieve their best performances. Furthermore, we set the momentum coefficient as L + µ for Mudag, AGD and APM-C. We initialize x0 at 0 for all the compared methods. In the setting in which each fi(x) is strongly convex, we report the experimental results in Figure 1. Compared with AGD, our algorithm has almost the same computation cost, which validates our theoretical analysis. Assuming that AGD communicates once per iteration, we can also see 0 500 1000 1500 2000 2500 3000 Number of Gradient Computations Acc-DNGD Mudag 0 1000 2000 3000 4000 5000 6000 7000 8000 Number of Communications Acc-DNGD Mudag 0 500 1000 1500 2000 2500 3000 Number of Gradient Computations Acc-DNGD Mudag 0 1000 2000 3000 4000 5000 6000 7000 8000 Number of Communications Acc-DNGD Mudag 0 500 1000 1500 2000 2500 3000 Number of Gradient Computations Acc-DNGD Mudag 0 1000 2000 3000 4000 5000 6000 7000 8000 Number of Communications Acc-DNGD Mudag 0 500 1000 1500 2000 2500 3000 Number of Gradient Computations Acc-DNGD Mudag 0 1000 2000 3000 4000 5000 6000 7000 8000 Number of Communications Acc-DNGD Mudag Figure 1: Comparisons with logistic regression and random networks. Each fi(x) is strongly convex (σi = 0.001 in the top row, and σi = 0.0001 in the bottom row). Random networks have 1 λ2(W) = 0.81 in the left two columns and 1 λ2(W) = 0.05 in the right two columns. 0 500 1000 1500 2000 2500 3000 Number of Gradient Computations Acc-DNGD Mudag 0 1000 2000 3000 4000 5000 6000 7000 8000 Number of Communications Acc-DNGD Mudag 0 500 1000 1500 2000 2500 3000 Number of Gradient Computations Acc-DNGD Mudag 0 1000 2000 3000 4000 5000 6000 7000 8000 Number of Communications Acc-DNGD Mudag 0 500 1000 1500 2000 2500 3000 Number of Gradient Computations Acc-DNGD Mudag 0 1000 2000 3000 4000 5000 6000 7000 8000 Number of Communications Acc-DNGD Mudag 0 500 1000 1500 2000 2500 3000 Number of Gradient Computations Acc-DNGD Mudag 0 1000 2000 3000 4000 5000 6000 7000 8000 Number of Communications Acc-DNGD Mudag Figure 2: Comparisons with logistic regression and random networks. Each local objective fi(x) may be non-convex. In the top row, σi = 10 2 for agents i = 1 . . . , m 1 and σi = 1 for the agent i = m. In the bottom row, σi = 10 1 for agents i = 1 . . . , m 1, and σi = 10 for the agent i = m. Random networks have 1 λ2(W) = 0.81 in the left two columns and 1 λ2(W) = 0.05 in the right two columns. that the communication cost of Mudag is almost the same communication cost as that of AGD when 1 λ2(W) = 0.81, and six times of that of AGD when 1 λ2(W) = 0.05. This matches the theoretical results of communication complexity for our algorithm. Furthermore, our algorithm achieves both lower computation cost and lower communication cost than other decentralized algorithms on all settings. The advantages are more obvious for small σi, which also validates the comparison of the upper bounds with related works. YE, LUO, ZHOU, AND ZHANG 0 500 1000 1500 2000 2500 3000 Number of Gradient Computations P2D2 PG-EXTRA NIDS Prox Mudag, K=1 Prox Mudag, K=2 Prox Mudag, K=3 0 1000 2000 3000 4000 5000 6000 7000 8000 9000 10000 Number of Communications P2D2 PG-EXTRA NIDS Prox Mudag, K=1 Prox Mudag, K=2 Prox Mudag, K=3 0 500 1000 1500 2000 2500 3000 Number of Gradient Computations P2D2 PG-EXTRA NIDS Prox Mudag, K=1 Prox Mudag, K=2 Prox Mudag, K=3 0 1000 2000 3000 4000 5000 6000 7000 8000 9000 10000 Number of Communications P2D2 PG-EXTRA NIDS Prox Mudag, K=1 Prox Mudag, K=2 Prox Mudag, K=3 0 500 1000 1500 2000 2500 3000 Number of Gradient Computations P2D2 PG-EXTRA NIDS Prox Mudag, K=1 Prox Mudag, K=2 Prox Mudag, K=3 0 1000 2000 3000 4000 5000 6000 7000 8000 9000 10000 Number of Communications P2D2 PG-EXTRA NIDS Prox Mudag, K=1 Prox Mudag, K=2 Prox Mudag, K=3 0 500 1000 1500 2000 2500 3000 Number of Gradient Computations P2D2 PG-EXTRA NIDS Prox Mudag, K=1 Prox Mudag, K=2 Prox Mudag, K=3 0 1000 2000 3000 4000 5000 6000 7000 8000 9000 10000 Number of Communications P2D2 PG-EXTRA NIDS Prox Mudag, K=1 Prox Mudag, K=2 Prox Mudag, K=3 Figure 3: Comparisons with sparse logistic regression and random networks. In the top row, experiments on a9a with σi = 10 3 for agents i = 1, . . . , m in the left two columns and σi = 10 4 in the right two columns. In the bottom row, experiments on w8a with σi = 10 3 for agents i = 1, . . . , m in the left two columns and σi = 10 4 in the right two columns. In the setting in which an individual function fi(x) could be non-convex, we report the experimental results in Figure 2. Note that the global objective function of experiments reported in Figure 1 and Figure 2 are the same but the model that corresponds to Figure 2 contains some non-convex fi(x). Comparing the curves in these two figures, we can observe that the computation cost of AGD and our algorithm are not affected by the non-convexity of fi(x) because their convergence rates only depend on κg. On the other hand, the communication cost of our algorithm increases slightly compared to the setting where each fi(x) is convex. This is because the ratio M/L of fi(x) increases when we set σi = 10 1 or σi = 10 2 for agent i = 1, . . . , m 1. Our communication complexity theory shows M/L will affect the communication cost by a log(M/L) factor. Compared with our algorithm, the performance of the other decentralized algorithms deteriorates greatly, which can be clearly observed by comparing the two figures in the top right corners of Figure 1 and Figure 2. 6.3 Experiments on Sparse Logistic Regression We consider the sparse logistic regression model whose objective function is defined as i=1 fi(x) + γ x 1 , where fi(x) is defined in Eq. (31). We conduct experiments on the graph with 1 λ2(W) = 0.05 and fi(x) and only consider the case when each fi(x) is convex, since experiments on logistic regression have already shown the advantage of our ideas for non-convex fi(x). We conduct experiments on the datasets a9a and w8a , which can be downloaded from Libsvm datasets. For w8a , we set n = 497 and d = 300. For a9a , we set n = 325 and d = 123. We conduct the following two experimental settings: 1. We set γ = 10 4 and σ1 = = σm = 10 3. 2. We set γ = 10 4 and σ1 = = σm = 10 4. We compare our algorithm (Prox Mudag) with the state-of-the-art algorithms PG-EXTRA (Shi et al., 2015a), NIDS (Li et al., 2019) and decentralized proximal algorithm (D2P2) (Alghunaim et al., 2019). In the experiments, we set K = 1, K = 2 and K = 3 in Prox Mudag to evaluate how K affects the convergence behavior. The parameters of all algorithms are well-tuned. We report the experimental results in Figure 3. We can observe that Prox Mudag outperforms other algorithms in all cases. First, Prox Mudag takes much less computation cost than other algorithms since Prox Mudag uses Nesterov s acceleration to achieve a faster convergence rate. This matches our theoretical analysis of the computation complexity. We can further observe that the advantage of Prox Mudag is more clear when σ2 is small. This is because the small σi s commonly lead to large condition numbers and the computation complexity of Prox Mudag depends on κg instead of κℓor κℓ. The results also show Prox Mudag has great advantages over other state-of-the-art decentralized proximal algorithms on the communication cost. 7. Conclusion In this paper, we proposed two novel decentralized algorithms, which achieve the optimal computation complexity and the near optimal communication complexity. To the best of our knowledge, this is the best communication complexity that primal-based decentralized algorithms can achieve especially for the decentralized composite optimization problems. Our results provide an affirmative answer to the open problem whether there is a decentralized algorithm that can achieve the communication complexity O p κg/(1 λ2(W)) log(1/ϵ) or even close to this lower bound for a strongly convex objective function. Furthermore, our algorithm does not require each individual functions fi(x) to be convex. Our experiments showed that the non-convexity of individual function fi(x) rarely degrades the performance of our algorithm. Our analysis also implies that integrating multi-consensus and gradient tracking can well approximate the decentralized optimization algorithm to the corresponding centralized counterpart. The implementation of the resulting algorithms are simple, effective and with (near) optimal complexities. This novel perspective may also provide useful insights for developing new decentralized optimization algorithms in other settings. Acknowledgments The authors would like to thank Lesi Chen and Yuxing Liu s helpful discussion. Haishan Ye is supported by National Natural Science Foundation of China under Grant No. 12101491. Luo Luo is supported by National Natural Science Foundation of China (No. 62206058) and Shanghai Sailing Program (22YF1402900). YE, LUO, ZHOU, AND ZHANG Appendix A. Useful Lemmas Lemma 12 For any matrix x Rm d and x = 1 m1 x, it holds that x 1 x 2 x 2 . (32) Proof It holds that i=1 x(i, :) x(j, :) 2 2 i=1 x(j, :), x(i, :) + 1 i=1 x(i, :) j=1 x(j, :), x(i, :) + 1 j=1 x(j, :), x(i, :) Lemma 13 We have F(y) F(x) M y x (33) gt f( yt) M m yt 1 yt . (34) Furthermore, we have the (L + 2/η)-smooth property for the generalized gradient h( ) (defined in Eq. (42)), i.e., h(x) h(y) 2 η + L x y . (35) Proof The first inequality is because each fi(x) is M-smooth and F(y) F(x) = i fi(y(i, :)) fi(x(i, :)) 2 v u u t M2 m X i y(i, :) x(i, :) 2 = M y x . The second inequality follows from gt f( yt) = fi(yt(i, :)) fi( yt) yt(i, :) yt yt(i, :) yt 2 m = M m yt 1 yt . Then we can prove Eq. (35) using L-smoothness of f(x) and the non-expansiveness of the proximal operator h(x) h(y) = x proxη,r(x η f(x)) η y proxη,r(y η f(y)) proxη,r(x η f(x)) proxη,r(y η f(y)) η (x η f(x)) (y η f(y)) η + L x y , where the last inequality is due to the L-smoothness of f(x). Lemma 14 For xt, yt and vt defined in Eqs. (3) and (18), then we can obtain that yt xt = α( vt yt), (36) yt+1 = xt+1 + α vt+1 1 + α , (37) (1 α) vt + α yt η α Gt r(x), is convex, (1 α) vt + α yt η α gt, r(x) = 0. (38) Proof First using the definition of vt, we have xt+1 + α vt+1 (18) = xt+1 + α[ xt + 1 α( xt+1 xt)] 1 + α = xt+1 + 1 α 1 + α( xt+1 xt) Then we can have yt xt = α( vt yt). YE, LUO, ZHOU, AND ZHANG Now, we are going to prove Eq. (38) with the case r(x) is convex since r(x) = 0 is a special case of r(x) being convex. Then we have (1 α) vt + α yt η α Gt = vt α( vt yt) η α Gt = xt + vt yt η α( yt xt η Gt) = xt + 1 α( xt+1 xt) = vt+1. Lemma 15 Let f(x) be µ-strongly convex. For yt, and Vt defined in Eqs. (3) and (17) and x being the optimum, we have the following inequality, Proof Since f(x) is µ-strongly convex, h(x) in Eq. (1) is also µ-strongly convex. Thus, we obtain yt x (37) = xt + α vt 1 + α x 1 1 + α xt x + α 1 + α vt x 2(h( x) h(x )) µ + α 1 + α At the end of this section, we provide the proof of Proposition 1. Proof We let n11 , Π = Π 0 0 Π and W = (1 + ηw)W ηw W I 0 then the iteration of Algorithm 2 can be written as xk+1 xk = W xk xk 1 The property W1 = 1 directly leads to x = 1 m1 x K. It also indicates n W11 = W 1 n11 W = W 1 n11 W = W 1 which means ΠW = WΠ. (40) Consequently, we achieve W Π = (1 + ηw)W ηw W I 0 = (1 + ηw)WΠ ηw WΠ Π 0 Π W = Π 0 0 Π (1 + ηw)W ηw W I 0 = (1 + ηw)WΠ Πηw W Π 0 Π W Π = Π 0 0 Π (1 + ηw)WΠ ηw WΠ Π 0 = (1 + ηw)ΠWΠ ηwΠWΠ Π2 0 = (1 + ηw)ΠW ηwΠW Π 0 = Π W = W Π, where we use the equality (40). This implies for any K 2, we have Π W K = ( Π W) W K 1 = ( Π W Π) W K 1 = ( Π W) Π W K 1 =( Π W)2 Π W K 2 = = ( Π W)K Π = Π( W Π)K = Π W( Π W)K 1 Π = Π W( Π W)K 2 Π W Π = Π W( Π W)K 2 W Π = = Π W W K 1 Π Combining above result with Lemma 9 of Song et al. (2023), we have x K 1 x = Πx K Π x K x K 1 = Π W K x0 x 1 = Π W K Π x0 x 1 14 ρK w Πx0 = 14 ρK w x0 1 x , 1 λ2 2(W) . Since we have 1 1 + x 1 1 1 YE, LUO, ZHOU, AND ZHANG for any x [0, 1], it holds that This implies 1 λ2(W) K x0 1 x . Appendix B. Proof of Lemmas in Section 5 B.1 Collection of Lemmas We list several important lemmas that will be used in our proofs. Lemma 16 (Nesterov (2018)) Letting h(x) the generalized gradient of h(x) (refer to Eq. (1)) be defined as h(x) x proxη,r(x η f(x)) η , with η being the step size, (42) then it holds that h(x ) = 0 if x minimizes h(x). Lemma 17 Letting prox(i) ηm,R(x) denote the i-th row of the matrix proxηm,R(x) (defined in Eqn. (4)), we have the following equation prox(i) ηm,R(x) = proxη,r(x(i)), which implies that G(i) t equals to the i-th row of Gt defined in Eq. (5). Proof By the definition of the proximal operators, we have proxηm,R(x) = argmin z R(z) + 1 2ηm z x 2 F i=1 r(z(i)) + z(i) x(i) 2 ! i=1 r(z(i)) + z(i) x(i) 2 ! argminz r(z) + 1 argminz r(z) + 1 Therefore, we have the following equation prox(i) ηm,R(x) = proxη,r(x(i)). Lemma 18 For any x Rm d, proxmη,R( ) (defined in Eq. (4)) has the following property proxηm,R m11 proxηm,R(x) x 1 xt . (43) Proof Using Lemma 17 and non-expansiveness of the proximal mapping, we have proxηm,R( 1 m11 proxηm,R(x) i=1 proxη,r(x(i)) m1 x) proxη,r(x(i)) m1 x) proxη,r(x(i)) 2 = x 1 xt 2 . Lemma 19 Letting s(i) t be the i-th row of st and G(i) t , Gt (defined in Eqs. (5), (6)) generated by Algorithm 3, we have s(i) t f(y(i) t ) 2 2 st 1 st 2 + 8M2 yt 1 yt 2 (44) G(i) t Gt 2 18 yt 1 yt 2 + 16η2 st 1 st 2 . (45) YE, LUO, ZHOU, AND ZHANG Proof Using the inequality that (a + b)2 2a2 + 2b2, we have s(i) t f(y(i) t ) 2 2 s(i) t st 2 + 2 st f(y(i) t ) 2 s(i) t st 2 + 4 i=1 st f( yt) 2 + 4 f( yt) f(y(i) t ) 2 2 st 1 st 2 + 4M2 1 yt yt 2 + 4L2 1 yt yt 2 2 st 1 st 2 + 8M2 1 yt yt 2 , where the third inequality is from Eq. (34) and the L-smoothness of f(x), the last inequality is due to L M. Furthermore, it holds that G(i) t Gt 2 = ηGt η y(i) t proxη,r(y(i) t ηs(i) t ) 1 y(j) t proxη,r(y(j) t ηs(j) t ) y(i) t yt 2 + 2 proxη,r(y(i) t ηs(i) t ) 1 j=1 proxη,r(y(j) t ηs(j) t ) y(i) t yt 2 + 4 proxη,r(y(i) t ηs(i) t ) proxη,r( yt η st) 2 proxη,r( yt η st) 1 j=1 proxη,r(y(j) t ηs(j) t ) 2 yt 1 yt 2 + 4 y(i) t yt η(s(i) t st) 2 + 4 proxη,r( yt η st) proxη,r(y(i) t ηs(i) t ) 2 2 yt 1 yt 2 + 16 yt 1 yt 2 + 16η2 st 1 st 2 =18 yt 1 yt 2 + 16η2 st 1 st 2 , where the third and forth inequalities are due to the non-expansiveness of proximal mapping. Lemma 20 Letting h( yt) be defined in Eq. (42), then we have the following error bound for the estimated generalized gradient η Gt h( yt) 4 + 2Mη m yt 1 yt + 2η m st 1 st . (46) Proof It holds that η Gt η h( yt) = ηG(i) t η h( yt) ηG(i) t η h( yt) 2 y(i) t proxη,r(y(i) t ηs(i) t ) yt proxη,r( yt η f( yt)) 2 2 y(i) t yt 2 + 2 proxη,r(y(i) t ηs(i) t ) proxη,r( yt η f( yt)) 2 2 y(i) t yt 2 + 2 (y(i) t ηs(i) t ) ( yt η f( yt)) 2 2 yt 1 yt 2 + 2 ηst η1 f( yt) + yt 1 yt 2 1 m (4 yt 1 yt + 2η st 1 st + 2η 1 st 1 f( yt) ) 1 m (4 + 2Mη) yt 1 yt + 2η st 1 st , where the second inequality is due to the non-expansiveness of proximal operator, and the last inequality is from Eq. (34). B.2 Proof of Lemma 9 Proof For simplicity, we denote Fast Mix( , K) operation as T( ). From Proposition 1 we can know that T(x) 1 m11 x ρ x 1 First, we have 1 xt+1 xt+1 proxηm,R(yt ηst) 1 m11 proxηm,R(yt ηst) proxηm,R(yt ηst) proxηm,R (1( yt η st)) + proxηm,R (1( yt η st)) 1 m11 proxηm,R(yt ηst) yt 1 yt + η st 1 st + (yt ηst) 1 ( yt η st) 2 yt 1 yt + 2η st 1 st , where the third inequality is because of Lemma 18 and the non-expansiveness of proximal operator. YE, LUO, ZHOU, AND ZHANG Using the definition of yt+1 in Algorithm 1 and the property of Fast Mix operation, we have yt+1 1 yt+1 2ρ 1 + α xt+1 1 xt+1 + ρ1 α 1 + α xt 1 xt (47) 4ρ yt 1 yt + 4ρη st 1 st + ρ xt 1 xt . Now we are going to bound the value of st+1 1 st+1 . We have st+1 1 st+1 ρ st + F(yt+1) F(yt) 1 ( st + gt+1 gt) (32) ρ st 1 st + ρM yt+1 yt ρ st 1 st + ρM yt+1 1 yt+1 + ρM 1 yt+1 1 yt + ρM 1 yt yt (48) ρ st 1 st + ρM (4ρ yt 1 yt + 4ρη st 1 st + ρ xt 1 xt ) + ρM 1 yt+1 1 yt + ρM 1 yt yt = ρ + 4ρ2Mη st 1 st + ρ2M xt 1 xt + ρM + 4ρ2M 1 yt yt + ρM 1 yt+1 1 yt ρ(1 + 4Mη) st 1 st + ρM xt 1 xt + 5ρM yt 1 yt + ρM 1 yt+1 1 yt , where the last inequality is because of ρ 1. Then we only need to consider the term 1 yt+1 1 yt . Using the iteration of average variables illustrated in Eq. (20), we have = 2 1 + α xt+1 1 α 1 + α xt yt = 2 1 + α( yt η Gt) 1 α 1 + α xt yt 1 + α yt xt + 2η 1 + α yt x + xt x + 2η Gt h( yt) + 2η h( yt) (46) yt x + xt x + 8 + 4Mη m yt 1 yt + 4η m st 1 st + 2η h( yt) . Furthermore, by Lemma 15 and the fact that h(x ) = 0, we can obtain 2η h( yt) + xt x + yt x =2η h( yt) h(x ) + xt x + yt x =2 yt proxη,r( yt η f( yt)) (x proxη,r(x η f(x ))) + xt x + yt x (35) 2 yt x + 2 yt x + 2η f( yt) f(x ) + xt x + yt x (5 + 2ηL) yt x + xt x (39) (5 + 2ηL) µ(h( xt) h(x )) where the last equality is because of η = 1/(2L). Thus, we can obtain that 1 yt+1 1 yt (8 + 4Mη) yt 1 yt + 4η st 1 st + 7 m r 2 Combining above results, we can bound the value of η st+1 1 st+1 as follows η st+1 1 st+1 ρ (1 + 8Mη) η st 1 st + ρ Mη xt 1 xt + ρMη (13 + 4Mη) yt 1 yt + 7ρMη m r 2 L η st 1 st + ρM L2 yt 1 yt + 4ρM where the last inequality is because of η = 1/(2L) and 1 M/L. Combining Eqs. (47), (48) and (49), we can obtain zt+1 Azt + 4ρM m 0 2 2 2ρ 4ρ 4ρ ρM/L 8ρM2/L2 5ρM/L B.3 Proof of Lemma 10 Proof It is easy to check that A is non-negative and irreducible. Furthermore, every diagonal entry of A is not zero. Thus, by Perron-Frobenius theorem and Corollary 8.4.7 of Horn and Johnson (2012), A has a real-valued positive number λ1(A) which is algebraically simple and associated with a strictly positive eigenvector v. It also holds that λ1(A) is strictly larger than |λi(A)| with i = 2, 3. We write down the characteristic polynomial p(ζ) of A, that is p(ζ) = ζp0(ζ) 32(M/L)2ρ2 + 20ρ2M/L, where p0(ζ) = ζ2 ρ (4 + 5M/L) ζ 4ρ 8ρ(M/L)2 + 5M/L + 1 5ρM/L . Let us denote = 16ρ 8ρ(M/L)2 + 5M/L + 1 5ρM/L . (50) It is easy to check that > 0. Thus, two roots of p0(ζ) are ζ1 = ρ(4 + 5M/L) + p (4 + 5M/L)2ρ2 + YE, LUO, ZHOU, AND ZHANG ζ2 = ρ(4 + 5M/L) p (4 + 5M/L)2ρ2 + ζ = 2ρ 32(M/L)2 + 2 (4 + 5M/L) + q p (ζ ) + 32(M/L)2ρ2 20ρ2M/L = 2ρ 32(M/L)2 + 2 (4 + 5M/L) + q 2ρ 32(M/L)2+2 (4+5M/L) + q 4} ρ(4+5M/L) p (4+5M/L)2ρ2+ 2ρ 32(M/L)2+2 (4+5M/L) + q 4} ρ(4+5M/L) + p (4+5M/L)2ρ2+ 2ρ 32(M/L)2 + 2 (4 + 5M/L) + q 2ρ 32(M/L)2 + 1 (4 + 5M/L) + q (4 + 5M/L)2ρ2 + )2 = 2ρ 32(M/L)2 + 2 (4 + 5M/L) + q 2ρ 32(M/L)2 + 1 (4 + 5M/L) 2 + max{ , 1 4} ((4 + 5M/L)2ρ2 + ) 2 + 2ρ 32(M/L)2 + 1 (4 + 5M/L) s 2ρ 32(M/L)2 + 2 (4 + 5M/L) + q 2ρ 32(M/L)2 + 1 (4 + 5M/L) s 2ρ 32(M/L)2 + 1 (4 + 5M/L) max{ , 1 2ρ(32(M/L)2 + 1) 5 Thus, we can obtain that p (ζ ) > 2ρ(32(M/L)2 + 1) 5 8 32(M/L)2ρ2 + 20ρ2M/L > 0. Note that p(ζ) is monotonely increasing in the range [ζ , ]. Thus, p(ζ) does not have real roots in this range. This implies λ1(A) ζ . By Eq. (50), we can obtain that if ρ satisfies the condition that ρ 1 64 (8(M/L)2 + 5M/L + 1), (51) then it holds that 1 4. If ρ also satisfies the condition that ρ 1 4 (32(M/L)2 + 2) (4 + 5M/L), (52) then we can obtain that It is easy to check that if ρ L3/(1280M3), inequalities (51) and (52) hold. Now, we begin to prove that ρ < λ1(A). We can conclude this result once it holds p( ρ) < 0. This is because p(ζ) will have a root between ρ and 1/2 and λ1(A) must be no less than this root. We have = ρp0( ρ) 32ρ2(M/L)2 + 20ρ2M/L =ρ ρ ρ(4 + 5M/L) 4 ρ 8ρ(M/L)2 + 5M/L + 1 5ρM/L 32ρ(M/L)2 + 20ρM/L =ρ (20M/L + 3) ρ ρ(4 + 5M/L) 4ρ3/2 8(M/L)2 5M/L ρ 32(M/L)2 15M/L <ρ ( (20M/L + 3) ρ ρ(4 + 5M/L)) where the first inequality is because of M/L 1. Since v is the eigenvector associated with λ1(A), we can obtain that Av = λ1(A)v and have the following equations 2v(2) + 2v(3) = λ1(A)v(1), (53) 2ρv(1) + 4ρv(2) + 4ρv(3) = λ1(A)v(2), (54) L v(1) + 8ρ M 2 v(2) + 5ρM L v(3) = λ1(A)v(3). (55) By Eqs. (53) and (54), we can obtain that 2ρv(1) = λ1(A)v(2) 2ρλ1(A)v(1), which implies that v(1) = λ1(A)v(2) 2ρ(1 + λ1(A)). Replacing above equation to Eq. (55), we can obtain that Mλ1(A)v(2) 2L(1 + λ1(A)) = λ1(A)v(3) 2 v(2) + 5ρM < λ1(A)v(3), YE, LUO, ZHOU, AND ZHANG which implies that v(2) 2L(1 + λ1(A))v(3) where the second inequality is because of ρ 1/2. Combining Eq. (53), we can obtain that v(1) 2 (3v(3) + v(3)) λ1(A) 8v(3) ρ , where the last inequality is because of λ1(A) ρ. B.4 Proof of Lemma 11 Before proving Lemma 11, we first give several important lemmas which are closely related to the convergence rate of Algorithm 3. Lemma 21 Letting xt, yt, st be generated by Algorithm 3, it holds that h( xt+1) h(x ) (1 α)(h( xt) h(x )) Gt, (1 α) xt + αx yt m st 1 st 2 + 20M2η + 10η 1 m yt 1 yt 2 . Proof By µ-strong convexity , L-smoothness of f(x) and the property of proximal operator, we have h(proxη,r(y(i) t ηs(i) t )) = f(proxη,r(y(i) t ηs(i) t )) + r(proxη,r(y(i) t ηs(i) t )) =f(y(i) t + proxη,r(y(i) t ηs(i) t ) y(i) t ) + r(proxη,r(y(i) t ηs(i) t )) f(y(i) t ) + f(y(i) t ) proxη,r(y(i) t ηs(i) t ) y(i) t + L proxη,r(y(i) t ηs(i) t ) y(i) t 2 η (proxη,r(y(i) t ηs(i) t ) y(i) t + ηs(i) t ) (z proxη,r(y(i) t ηs(i) t )) h(z) f(y(i) t ) (z y(i) t ) µ z y(i) t 2 + f(y(i) t ) proxη,r(y(i) t ηs(i) t ) y(i) t proxη,r(y(i) t ηs(i) t ) y(i) t 2 + 1 η (proxη,r(y(i) t ηs(i) t ) y(i) t +ηs(i) t ) (z proxη,r(y(i) t ηs(i) t )) =h(z) D f(y(i) t ), z y(i) t E µ z y(i) t 2 η D f(y(i) t ), G(i) t E + η2L + D s(i) t G(i) t , z y(i) t + ηG(i) t E =h(z) D f(y(i) t ) s(i) t + G(i) t , z y(i) t E µ η D f(y(i) t ) s(i) t , G(i) t E η 1 ηL for any given z Rd. Multiplying 1 α on both sides of Eq. (57) and setting z = xt, we get (1 α)h(proxη,r(y(i) t ηs(i) t )) (1 α)h( xt) (1 α) D f(y(i) t ) s(i) t + G(i) t , xt y(i) t E µ(1 α) xt y(i) t 2 (1 α)η D f(y(i) t ) s(i) t , G(i) t E (1 α)η 1 ηL Similarly, multiplying α on both sides of Eq. (57) and setting z = x , we obtain that αh(proxη,r(y(i) t ηs(i) t )) αh(x ) α D f(y(i) t ) s(i) t + G(i) t , x y(i) t E µα αη D f(y(i) t ) s(i) t , G(i) t E αη 1 ηL Adding above two inequalities, we have h(proxη,r(y(i) t ηs(i) t )) h(x ) (1 α) (h( xt) h(x )) D f(y(i) t ) s(i) t + G(i) t , (1 α) xt + αx y(i) t E η D f(y(i) t ) s(i) t , G(i) t E η 1 ηL Note that by Jensen s inequality, we can get that x y(i) t 2 . (59) Then averaging Eq. (58) from i = 1 to m and using the convexity of h(x), we have h( xt+1) h(x ) i=1 h(proxη,r(y(i) t ηs(i) t )) h(x ) (58) (1 α)(h( xt) h(x )) 1 D f(y(i) t ) s(i) t + G(i) t , (1 α) xt + αx y(i) t E D f(y(i) t ) s(i) t , G(i) t E η 1 ηL (59) (1 α)(h( xt) h(x )) 1 D f(y(i) t ) s(i) t + G(i) t , (1 α) xt + αx y(i) t E D f(y(i) t ) s(i) t , G(i) t E η 1 ηL YE, LUO, ZHOU, AND ZHANG Furthermore, we have s(i) t G(i) t f(y(i) t ) (1 α) xt + αx y(i) t s(i) t G(i) t f(y(i) t ) (1 α) xt + αx yt + yt y(i) t (21) = Gt, (1 α) xt + αx yt + 1 D s(i) t G(i) t f(y(i) t ), yt y(i) t E D s(i) t G(i) t f(y(i) t ), yt y(i) t E D s(i) t G(i) t f(y(i) t ) + Gt, yt y(i) t E D s(i) t f(y(i) t ), yt y(i) t E + 1 D Gt G(i) t , yt y(i) t E s(i) t f(y(i) t ) 2 yt y(i) t 2 G(i) t Gt 2 yt y(i) t 2 Pm i=1 s(i) t f(y(i) t ) 2 + Pm i=1 G(i) t Gt 2 mη yt 1 yt 2 (44),(45) η m 9 st 1 st 2 + 4M 2 yt 1 yt 2 + 9η 2 yt 1 yt 2 + 1 ηm yt 1 yt 2 m st 1 st 2 + 4M 2η + 10η 1 m yt 1 yt 2 , where the first inequality is because of Cauchy s inequality and the second inequality is because of 2ab ηa2 + b2/η. Combining above two inequalities, we can obtain that D s(i) t G(i) t f(y(i) t ), yt y(i) t E Gt, (1 α) xt + αx yt + 9η m st 1 st 2 + 4M2η + 10η 1 m yt 1 yt 2 . Moreover, we have D f(y(i) t ) s(i) t , G(i) t E f(y(i) t ) s(i) t G(i) t 2 f(y(i) t ) s(i) t 2 + 1 m st 1 st 2 + 16M2η m yt 1 yt 2 + η Combining Eqs. (60), (61) and (62), we can obtain that h( xt+1) h(x ) (1 α)(h( xt) h(x )) Gt, (1 α) xt + αx yt G(i) t 2 µα m st 1 st 2 + 20M2η + 10η 1 m yt 1 yt 2 (1 α)(h( xt) h(x )) Gt, (1 α) xt + αx yt m st 1 st 2 + 20M2η + 10η 1 m yt 1 yt 2 , where the last inequality is because of Jensen s inequality. Lemma 22 Letting xt, yt, st be generated by Algorithm 3, it holds that 2 vt+1 x 2 (1 α)µ 2 vt x 2 + αµ + Gt, (1 α) xt + αx yt + η Gt 2 . (63) Proof We have µ (1 α) vt + α yt η 2 (1 α) vt + α yt x 2 µη α Gt, (1 α) vt + α yt x + µη2 2 (1 α) vt + α yt x 2 α Gt, (1 α) vt + α yt x + η YE, LUO, ZHOU, AND ZHANG Furthermore, by Eq. (36), we have vt = yt + 1 α( yt xt) which implies that (1 α) vt + α yt = yt + 1 α α ( yt xt). Thus, it holds that α Gt, (1 α) vt + α yt x = Gt, (1 α) xt + αx yt . It also holds that (1 α) vt + α yt x 2 ((1 α) vt x + α yt x )2 (1 α) vt x 2 + α yt x 2 . Therefore, it holds that 2 vt+1 x 2 (1 α)µ 2 vt x 2 + αµ 2 yt x 2 + Gt, (1 α) xt + αx yt + η Combining above two lemmas, we can obtain the following result. Proof [Proof of Lemma 11] Using the definition of Vt, we have Vt+1 = h( xt+1) h(x ) + µ (56),(63) (1 α)Vt η 1 m st 1 st 2 + 20M 2η + 10η 1 m yt 1 yt 2 (1 α)Vt + 13η m st 1 st 2 + 20M 2η + 10η 1 m yt 1 yt 2 , where the last inequality is because of η = 1/(2L). Appendix C. Convergence Analysis of Algorithm 1 The proof of Algorithm 1 is almost the same to the one of Algorithm 3. But, without the proximal mapping which will cause extra consensus error terms, the detailed convergence analysis of Algorithm 1 is clean and easy to follow. Lemma 23 The update procedure of Algorithm 1 can be represented as xt+1 =Fast Mix (yt ηst, K) , (64) yt+1 =xt+1 + 1 α 1 + α(xt+1 xt), (65) st+1 =Fast Mix(st, K) + ( F(yt+1) F(yt)) η 1(Fast Mix(yt, K) yt), (66) with s0 = F(y0). Proof The proof of this reformulation is equivalent to prove that given the reformulation of xt, yt and st at iteration t, the reformulation of xt+1 holds at iteration t + 1. Therefore our induction focuses on xt+1. First, when t = 0, we can obtain that x1 = T(y0 η F(y0)) = T(y0 ηs0), (67) which implies that x1 y0 = ηT(s0) + T(y0) y0. Furthermore, have s1 = T(s0) + ( F(y2) F(y1)) η 1(T(y0) y0). Thus, we can obtain that x2 (13) = T(y1 + (x1 y0) η( F(y1) F(y0))) = T(y1 (ηT(s0) + η( F(y1) F(y0))) + T(y0) y0) = T(y1 ηs1), where the first equation is because of the update of Algorithm 1. We obtain that the result holds at t = 0. Next, we prove that if the results hold in the t-th iteration, then it also holds at the (t + 1)-th iteration. For the t-th iteration, we assume that xt+1 = T(yt ηst), which implies that xt+1 T(yt) = ηT(st). Therefore, we obtain that xt+2 (13) = T(yt+1 + xt+1 yt η( F(yt+1) F(yt))) = T(yt+1 + xt+1 T(yt) η( F(yt+1) F(yt)) + T(yt) yt) = T(yt+1 ηT(st) η( F(yt+1) F(yt)) + T(yt) yt) = T(yt+1 ηst+1). This proves the desired result. We now show that xt, yt, gt (defined in Eq. (3) and generated by Algorithm 1) and vt (defined in Eq. (18)) can be fit into the framework of the centralized Nesterov s accelerated gradient descent. Lemma 24 Let xt, yt, gt (defined in Eq. (3)) be generated by Algorithm 1. Then they satisfy the following equalities: xt+1 = yt η gt, yt+1 = xt+1 + 1 α 1 + α( xt+1 xt), st+1 = st + gt+1 gt = gt+1. YE, LUO, ZHOU, AND ZHANG Proof We first prove the last equality. First, we have 1 m11 (T(yt) yt) = 1 yt 1 yt = 0 by Lemma 1. Thus, we can obtain that st+1 = st + gt+1 gt. Furthermore, we will prove st = η gt by induction. For t = 0, we use the fact that s0 = η F(y0). Then, it holds that s0 = g0. We assume that st = gt at time t. By the update equation, we have st+1 = st + ( gt+1 gt) = gt+1. Thus, we obtain the result at time t + 1. The first two equations can be proved using Eq. (68) and Proposition 1. Lemma 25 Let zt = yt 1 yt , ρ 1 xt 1 xt , M 1 st 1 st with xt and yt generated by Algorithm 1 and st defined in Eq. (66), then it holds that zt+1 Azt + 4 m 0, 0, r 2 where ρ and A are defined as 1 λ2(W) K and A 2ρ ρ 2ρMη 1 0 Mη 9Mη ρ 3ρMη Proof By the update step of yt+1 in Algorithm 1, we have yt+1 1 yt+1 2 1 + α xt+1 1 xt+1 + 1 α 1 + α xt 1 xt . Furthermore, by Eq. (64), we have 1 ρ xt+1 1 xt+1 yt 1 yt + Mη 1 M st 1 st . Therefore, we can obtain that yt+1 1 yt+1 2ρ 1 + α yt 1 yt + 1 α 1 + α xt 1 xt + 2ρη 1 + α st 1 st 2ρ yt 1 yt + ρ 1 ρ xt 1 xt + 2ρMη 1 M st 1 st . Furthermore, by Eq. (66), we have η st+1 1 st+1 η T(st) 1 st + η F(yt+1) F(yt) 1( gt+1 gt) + T(yt) yt (32) η T(st) 1 st + η F(yt+1) F(yt) + T(yt) yt (33) ρ η st 1 st + Mη yt+1 yt + T(yt) yt ρ η st 1 st + Mη yt+1 yt + 2 yt 1 yt , where the last inequality is because of T(yt) yt = T(yt) 1 yt + 1 yt yt (1 + ρ) yt 1 yt 2 yt 1 yt . By the update rule of yt+1, we have yt+1 yt = 2 1 + αxt+1 1 α (64) = 2 1 + αT(yt ηst) 1 α 2 1 + α T(yt) yt + 1 α 1 + α xt yt + 2η 1 + α T(st) 4 1 + α yt 1 yt + 1 α 1 + α( xt 1 xt + yt 1 yt + 1( yt x ) + 1( xt x ) ) + 2η 1 + α( T(st) 1 st + 1 st ) (68) 5 1 + α yt 1 yt + 2ρ 1 + α η st 1 st + 1 α 1 + α( xt 1 xt ) 1 + α ( 1( yt x ) + 1( xt x ) ) + 2η m Furthermore, by Eq. (34), we have gt gt f( yt) + f( yt) M m yt 1 yt + f( yt) . Therefore, we can obtain that 1 M st+1 1 st+1 ρ(1 + 2ρMη) 1 M st 1 st + 5 + 2Mη 1 + α + 2 Mη 1 + α xt 1 xt + 1 α 1 + α ( 1( yt x ) + 1( xt x ) ) + 2η m 1 + α f( yt) ρ(1 + 2ρMη) 1 M st 1 st + (7 + 2Mη) yt 1 yt + xt 1 xt + 1( yt x ) + 1( xt x ) + 2η m f( yt) M st 1 st + 9Mη yt 1 yt + xt 1 xt + 1( yt x ) + 1( xt x ) + 2η m f( yt) , where the last two inequalities use 1 < 1 + α, η = 1/L and L M. Furthermore, we have 1( yt x ) + 1( xt x ) + 2η m f( yt) 1( yt x ) + 1( xt x ) + 2Lη m yt x 3 m yt x + m xt x YE, LUO, ZHOU, AND ZHANG The first inequality is because of the L-smoothness of f(x). The second inequality follows from the step size η = 1/L. The last inequality is due to the µ-strong convexity. Thus, we can obtain that 1 M st+1 1 st+1 M st 1 st + 9Mη yt 1 yt + ρ 1 ρ xt 1 xt + 4 m r 2 By the definition of zt, we can obtain that zt+1 = Azt + 0, 0, 4 p Next, we will prove the above two conditions which guarantee the convergence of zt . In the following lemma, we show the properties of A and prove that the spectrum radius of A is less than 1 2 if ρ is small enough. Lemma 26 Matrix A defined in Lemma 25 satisfies that 0 < λ1(A) and |λ3(A)| |λ2(A)| < λ1(A), with λi(A) being the i-th largest eigenvalue of A. Let η = 1/L and ρ satisfy the condition that ρ 1 108(Mη)3 + 288(Mη)2 + 24Mη + 16, then it holds that λ1(A) 1 and the eigenvector v associated with λ1(A) is positive and its entries satisfy 18Mη, v(2) 1 18 ρMη + Mη ρ v(3), 0 < v(3), (70) where v(i) is i-th entry of v. Proof It is easy to check that A is non-negative and irreducible. Furthermore, every diagonal entry of A is not zero. Thus, by Perron-Frobenius theorem and Corollary 8.4.7 of Horn and Johnson (2012), A has a real-valued positive number λ1(A) which is algebraically simple and associated with a strictly positive eigenvector v. It also holds that λ1(A) is strictly larger than |λi(A)| with i = 2, 3. We write down the characteristic polynomial p(ζ) of A, p(ζ) = ζp0(ζ) 9(Mη)2ρ + 3ρ2Mη, where p0(ζ) = ζ2 ρ (2 + 3Mη) ζ ρ 18(Mη)2 + Mη + 1 6ρMη . Let us denote = 4ρ 18(Mη)2 + Mη + 1 6ρMη . (71) It holds that Mη =4ρ(18Mη + 1 + 1 Mη 6ρ) 4ρ (18 6) > 0. Thus, two roots of p0(ζ), ζ1 and ζ2 are ζ1 = ρ(2 + 3Mη) + p (2 + 3Mη)2ρ2 + ζ2 = ρ(2 + 3Mη) p (2 + 3Mη)2ρ2 + ζ = 2ρ 9(Mη)2 + 2 (2 + 3Mη) + q p (ζ ) = 2ρ 9(Mη)2 + 2 (2 + 3Mη) + q 2ρ 9(Mη)2 + 2 (2 + 3Mη) + q 4} ρ(2 + 3Mη) p (2 + 3Mη)2ρ2 + 2ρ 9(Mη)2 + 2 (2 + 3Mη) + q 4} ρ(2 + 3Mη) + p (2 + 3Mη)2ρ2 + 2 9(Mη)2ρ + 3ρ2Mη 2ρ 9(Mη)2 + 2 (2 + 3Mη) + q 2ρ 9(Mη)2 + 1 (2 + 3Mη) + q (2 + 3Mη)2ρ2 + )2 2 9(Mη)2ρ + 3ρ2Mη = 2ρ 9(Mη)2 + 2 (2 + 3Mη) + q 2ρ 9(Mη)2 + 1 (2 + 3Mη) 2 + max{ , 1 4} ((2 + 3Mη)2ρ2 + ) 2 YE, LUO, ZHOU, AND ZHANG + 2ρ 9(Mη)2 + 1 (2 + 3Mη) r 9(Mη)2ρ + 3ρ2Mη 2ρ 9(Mη)2 + 2 (2 + 3Mη) + q 2ρ 9(Mη)2 + 1 (2 + 3Mη) r 2ρ 9(Mη)2 + 1 (2 + 3Mη) max{ , 1 4} 2 9(Mη)2ρ 2ρ(9(Mη)2 + 1) 5 8 9(Mη)2ρ > 0. Note that p(ζ) is monotonely increasing in the range [ζ , ]. Thus, p(ζ) does not have real roots in this range. This implies λ1(A) ζ . By Eq. (71), we can obtain that if ρ satisfies ρ 16 (18(Mη)2 + Mη + 1) 1 , then it holds that 1 4. If ρ also satisfies the condition that ρ 4 9(Mη)2 + 2 (2 + 3Mη) 1 , then we can obtain that Combining the above conditions of ρ, we only need that ρ 1 108(Mη)3 + 288(Mη)2 + 24Mη + 16. Now, we show that ρ < λ1(A). We can conclude this result once it holds p( ρ) < 0. This is because p(ζ) will have a root between ρ and 1/2 and λ1(A) must be no less than this root. We have p( ρ) = ρ p0( ρ) 9(Mη)2ρ + 3ρMη =ρ ρ ρ(2 + 3Mη) 4 ρ 9(Mη)2 + 3ρMη =ρ ρ 2ρ 4 ρ 9M 2η2 8 4 ρ 9M 2η2 ! where the last inequality is because of Mη 1 (by Eq. (9)). Since v is the eigenvector associated with λ1(A), we can obtain that Av = λ1(A)v and have the following equations 2ρv(1) + ρv(2) + 2ρMηv(3) = λ1(A)v(1), v(1) + Mηv(3) = λ1(A)v(2), 9Mηv(1) + ρv(2) + 3ρMηv(3) = λ1(A)v(3). Thus, combining with ρ λ1(A) 1 2, we can obtain that v(1) 1 9Mη (λ1(A)v(3) (ρv(2) + 3ρMηv(3))) < v(3) v(2) = v(1) + Mηv(3) λ1(A) 1 18 ρMη + Mη ρ Lemma 27 Letting Vt be the Lyapunov function defined in Eq. (17) associated to Algorithm 1, then it satisfies the following property 4α Vt + 1 + 8 m yt 1 yt 2 . (72) Proof When r(x) = 0, h(x) equals to f(x). Thus, we use f(x) directly instead of h(x). By the update procedure of Algorithm 1, we have f( xt+1) f( yt) η f( yt), gt + Lη2 =f( yt) η gt, gt + η gt, gt f( yt) + Lη2 2L gt 2 + 1 L gt, gt f( yt) , where the last equation is because η = 1/L. Furthermore, by the definition of Vt, we have 2 vt+1 x 2 + f( xt+1) f(x ) 2 (1 α) vt + α yt x 2 µ Lα gt, (1 α) vt + α yt x + µ 2L2α2 gt 2 + f( xt+1) f(x ) 2 (1 α) vt + α yt x 2 α gt, (1 α) vt + α yt x + f( yt) f(x ) + 1 L gt, gt f( yt) . YE, LUO, ZHOU, AND ZHANG Furthermore, by Eq. (37), we can obtain that vt = yt + 1 α( yt xt). Then we can obtain (1 α) vt + α yt = yt + 1 α α ( yt xt). Hence, we have f( yt) α gt, (1 α) vt + α yt x f(x ) =f( yt) + gt, αx + (1 α) xt yt f(x ) =(α + 1 α)f( yt) + f( yt), α(x yt) + (1 α)( xt yt) f(x ) + gt f( yt), αx + (1 α) xt yt (1 α)(f( xt) f(x )) αµ 2 x yt 2 + gt f( yt), αx + (1 α) xt yt , where the last inequality is because f(x) is µ-strongly convex. Therefore, we can obtain that 2 (1 α) vt + α yt x 2 + 1 L gt, gt f( yt) + (1 α)(f( xt) f(x )) αµ 2 x yt 2 + gt f( yt), αx + (1 α) xt yt 2 vt x 2 + µα 2 yt x 2 + (1 α)(f( xt) f(x )) 2 x yt 2 + gt f( yt), αx + (1 α) xt yt + 1 =(1 α)Vt + gt f( yt), αx + (1 α) xt yt + 1 L gt f( yt) gt , where the second inequality is because of (1 α) vt + α yt x 2 ((1 α) vt x + α yt x )2 (1 α) vt x 2 + α yt x 2 . Furthermore, we have αx + (1 α) xt yt (1 α) xt x + α yt x max r 2 Therefore, we have Vt+1 (1 α)Vt + 1 L gt f( yt) gt + µ gt f( yt) (1 α)Vt + 1 L gt f( yt) 2 + 1 L gt f( yt) f( yt) + µ gt f( yt) (1 α)Vt + 1 L gt f( yt) 2 + 2 µ gt f( yt) (1 α)Vt + 1 L gt f( yt) 2 + α L gt f( yt) 2 4α Vt + 1 + 8 m yt 1 yt 2 . Now, we provide the proof of Theorem 2. Proof Let the eigenvector v be defined in Lemma 26 and set v(3) = 1. Combining with the fact that first two entries of z0 are zero, we can obtain that, z0 z0 v and [0, 0, 1] v. By Eq. (69), we can obtain that zt+1 z0 At+1v + 4 r2m = z0 λ1(A)t+1v + 4 r2m Vi λ1(A)t iv t+1 v + 4 r2m where the first equality is because v is the eigenvector associated with λ1(A) and the last inequality is because of Lemma 26. Next, we will prove our result by induction. We have s0 η f( y0) = 0, because the initial values x0(i, :) are equal to each other. Then by Eq. (72), we have V0 + µ 288m z0 2 . Next, we assume that for i = 1, . . . , t, it holds that i V0 + µ 288m z0 2 . Combining with Eq. (74), we can obtain that j=0 2 (t 2 j)p Vj + 2 (t 1) z0 j=0 2 (t 2 j) r V0 + µ 288m z0 2 + 2 (t 1) z0 1 α/2 t 1 2 (t 2) V0 + µ 288m z0 2 + 2 (t 1) z0 V0 + µ 288m z0 2 + 2 (t 1) z0 YE, LUO, ZHOU, AND ZHANG Now we upper bound the value of st f( yt) . First, by Lemma 25, we can obtain that [2ρ, ρ, 2ρMη], zt 1 (75) ρ (2v(1) + v(2) + 2Mη) V0 + µ 288m z0 2 + 2 (t 1) z0 (70) ρ 2 18Mη + 1 18 ρMη + Mη ρ + 2Mη V0 + µ 288m z0 2 + 2 (t 1) z0 V0 + µ 288m z0 2 + 2 (t 1) z0 Combining the inductive hypothesis with Eq. (72), we have 4α Vt + 1 + 8 m yt 1 yt 2 t V0 + µ 288m z0 2 L (2Mη)2 288m t 1 V0 + µ 288m z0 2 + 4 (t 1) z0 2 t V0 + µ 288m z0 2 + 8 288 ρ 1 + 8 µ (2Mη)2 1 α t V0 + µ 288m z0 2 t+1 V0 + µ 288m z0 2 , (76) where the last inequality is because of ρ 1 43 9 288 L Therefore, we can obtain that at the (t + 1)-th iteration, it also holds that t+1 V0 + µ 288m z0 2 . Furthermore, 1 ρ xt 1 xt (69) [1, 0, Mη], zt 1 (75) [1, 0, Mη], v V0 + µ 288m z0 2 + 2 (t 1) z0 V0 + µ 288m z0 2 + 2 (t 1) z0 Thus, we can obtain that xt 1 xt ρ 2Mη V0 + µ 288m z0 2 + 2 (t 1) z0 This finishes our proof. Sulaiman A. Alghunaim, Kun Yuan, and Ali H. Sayed. A linearly convergent proximal gradient algorithm for decentralized optimization. Neur IPS, 2019. Sulaiman A. Alghunaim, Ernest Ryu, Kun Yuan, and Ali H. Sayed. Decentralized proximal gradient algorithms with linear convergence rates. IEEE Transactions on Automatic Control, 2020. Zeyuan Allen-Zhu. Katyusha X: Simple momentum method for stochastic sum-of-nonconvex optimization. In ICML, 2018. Albert S. Berahas, Raghu Bollapragada, Nitish Shirish Keskar, and Ermin Wei. Balancing communication and computation in distributed optimization. IEEE Transactions on Automatic Control, 64 (8):3141 3155, 2018. Francesco Bullo, Jorge Cortes, and Sonia Martinez. Distributed control of robotic networks: a mathematical approach to motion coordination algorithms, volume 27. Princeton University Press, 2009. Chih-Chung Chang and Chih-Jen Lin. LIBSVM: a library for support vector machines. ACM Transactions on Intelligent Systems and Technology, 2(3):1 27, 2011. Paolo Di Lorenzo and Gesualdo Scutari. Distributed nonconvex optimization over networks. In Workshop on CAMSAP, 2015. Paolo Di Lorenzo and Gesualdo Scutari. Next: In-network nonconvex optimization. IEEE Transactions on Signal and Information Processing over Networks, 2(2):120 136, 2016. Tomaso Erseghe, Davide Zennaro, Emiliano Dall Anese, and Lorenzo Vangelista. Fast consensus by the alternating direction multipliers method. IEEE Transactions on Signal Processing, 59(11): 5523 5537, 2011. Dan Garber, Elad Hazan, Chi Jin, Sham M. Kakade, Cameron Musco, Praneeth Netrapalli, and Aaron Sidford. Robust shift-and-invert preconditioning: Faster and more sample efficient algorithms for eigenvector computation. In ICML, 2016. YE, LUO, ZHOU, AND ZHANG Mingyi Hong, Davood Hajinezhad, and Ming-Min Zhao. Prox-PDA: The proximal primal-dual algorithm for fast distributed nonconvex optimization and learning over networks. In ICML, 2017. Roger A. Horn and Charles R. Johnson. Matrix analysis. Cambridge university press, 2012. Duˇsan Jakoveti c. A unification and generalization of exact distributed first-order methods. IEEE Transactions on Signal and Information Processing over Networks, 5(1):31 46, 2018. Duˇsan Jakoveti c, Joao Xavier, and Jos e M.F. Moura. Fast distributed gradient methods. IEEE Transactions on Automatic Control, 59(5):1131 1146, 2014. Peter Kairouz, H. Brendan Mc Mahan, Brendan Avent, Aur elien Bellet, Mehdi Bennis, Arjun Nitin Bhagoji, Kallista Bonawitz, Zachary Charles, Graham Cormode, Rachel Cummings, et al. Advances and open problems in federated learning. Foundations and Trends R in Machine Learning, 14(1 2):1 210, 2021. Usman A. Khan, Soummya Kar, and Jos e M.F. Moura. Diland: An algorithm for distributed sensor localization with noisy distance measurements. IEEE Transactions on Signal Processing, 58(3): 1940 1947, 2009. Dmitry Kovalev, Adil Salim, and Peter Richt arik. Optimal and practical algorithms for smooth and strongly convex decentralized optimization. In Neur IPS, 2020. Guanghui Lan, Soomin Lee, and Yi Zhou. Communication-efficient algorithms for decentralized and stochastic optimization. Mathematical Programming, 180(1-2):237 284, 2020. Boyue Li, Shicong Cen, Yuxin Chen, and Yuejie Chi. Communication-efficient distributed optimization in networks with gradient tracking and variance reduction. Journal of Machine Learning Research, 21:1 51, 2020a. Huan Li and Zhouchen Lin. Revisiting EXTRA for smooth distributed optimization. SIAM Journal on Optimization, 30(3):1795 1821, 2020. Huan Li and Zhouchen Lin. Accelerated gradient tracking over time-varying graphs for decentralized optimization. ar Xiv preprint ar Xiv:2104.02596, 2021. Huan Li, Cong Fang, Wotao Yin, and Zhouchen Lin. Decentralized accelerated gradient methods with increasing penalty parameters. IEEE transactions on Signal Processing, 68:4855 4870, 2020b. Zhi Li, Wei Shi, and Ming Yan. A decentralized proximal-gradient method with network independent step-sizes and separated convergence rates. IEEE Transactions on Signal Processing, 67(17): 4494 4506, 2019. Ji Liu and A. Stephen Morse. Accelerated linear iterations for distributed averaging. Annual Reviews in Control, 35(2):160 165, 2011. Cassio G. Lopes and Ali H. Sayed. Diffusion least-mean squares over adaptive networks: Formulation and performance analysis. IEEE Transactions on Signal Processing, 56(7):3122 3136, 2008. Aryan Mokhtari and Alejandro Ribeiro. DSA: Decentralized double stochastic averaging gradient algorithm. Journal of Machine Learning Research, 17(1):2165 2199, 2016. Angelia Nedic and Asuman Ozdaglar. Distributed subgradient methods for multi-agent optimization. IEEE Transactions on Automatic Control, 54(1):48 61, 2009. Angelia Nedic, Alex Olshevsky, and Wei Shi. Achieving geometric convergence for distributed optimization over time-varying graphs. SIAM Journal on Optimization, 27(4):2597 2633, 2017. Yurii Nesterov. Lectures on convex optimization, volume 137. Springer, 2018. Guannan Qu and Na Li. Harnessing smoothness to accelerate distributed optimization. IEEE Transactions on Control of Network Systems, 5(3):1245 1260, 2017. Guannan Qu and Na Li. Accelerated distributed Nesterov gradient descent. IEEE Transactions on Automatic Control, 2019. Michael Rabbat and Robert Nowak. Distributed optimization in sensor networks. In IPSN, 2004. Alejandro Ribeiro. Ergodic stochastic optimization algorithms for wireless communication and networking. IEEE Transactions on Signal Processing, 58(12):6369 6386, 2010. Kevin Scaman, Francis Bach, S ebastien Bubeck, Yin Tat Lee, and Laurent Massouli e. Optimal algorithms for smooth and strongly convex distributed optimization in networks. In ICML, 2017. Kevin Scaman, Francis Bach, S ebastien Bubeck, Laurent Massouli e, and Yin Tat Lee. Optimal algorithms for non-smooth distributed optimization in networks. In Neur IPS, 2018. Kevin Scaman, Francis Bach, S ebastien Bubeck, Yin Lee, and Laurent Massouli e. Optimal convergence rates for convex distributed optimization in networks. Journal of Machine Learning Research, 20:1 31, 2019. Wei Shi, Qing Ling, Kun Yuan, Gang Wu, and Wotao Yin. On the linear convergence of the admm in decentralized consensus optimization. IEEE Transactions on Signal Processing, 62(7):1750 1761, 2014. Wei Shi, Qing Ling, Gang Wu, and Wotao Yin. A proximal gradient algorithm for decentralized composite optimization. IEEE Transactions on Signal Processing, 63(22):6013 6023, 2015a. Wei Shi, Qing Ling, Gang Wu, and Wotao Yin. EXTRA: An exact first-order algorithm for decentralized consensus optimization. SIAM Journal on Optimization, 25(2):944 966, 2015b. Zhuoqing Song, Lei Shi, Shi Pu, and Ming Yan. Optimal gradient tracking for decentralized optimization. Mathematical Programming, pages 1 53, 2023. Ying Sun, Gesualdo Scutari, and Amir Daneshmand. Distributed optimization based on gradient tracking revisited: Enhancing convergence rate via surrogation. SIAM Journal on Optimization, 32(2):354 385, 2022. H akan Terelius, Ufuk Topcu, and Richard M. Murray. Decentralized multi-agent optimization via dual decomposition. IFAC proceedings volumes, 44(1):11245 11251, 2011. YE, LUO, ZHOU, AND ZHANG Konstantinos I. Tsianos, Sean Lawlor, and Michael G. Rabbat. Consensus-based distributed optimization: Practical issues and applications in large-scale machine learning. In Allerton, 2012. C esar A Uribe, Soomin Lee, Alexander Gasnikov, and Angelia Nedi c. A dual approach for optimal algorithms in distributed optimization over networks. In ITA Workshop, 2020. Lin Xiao and Stephen Boyd. Fast linear iterations for distributed averaging. Systems & Control Letters, 53(1):65 78, 2004. Jinming Xu, Shanying Zhu, Yeng Chai Soh, and Lihua Xie. Augmented distributed gradient methods for multi-agent optimization under uncoordinated constant stepsizes. In CDC, 2015. Jinming Xu, Ye Tian, Ying Sun, and Gesualdo Scutari. Distributed algorithms for composite optimization: unified framework and convergence analysis. IEEE Transactions on Signal Processing, 69:3555 3570, 2021. Haishan Ye, Ziang Zhou, Luo Luo, and Tong Zhang. Decentralized accelerated proximal gradient descent. In Neur IPS, 2020. Kun Yuan, Qing Ling, and Wotao Yin. On the convergence of decentralized gradient descent. SIAM Journal on Optimization, 26(3):1835 1854, 2016. Minghui Zhu and Sonia Mart ınez. Discrete-time dynamic average consensus. Automatica, 46(2): 322 329, 2010.