# decentralized_sumofnonconvex_optimization__add3a042.pdf

Decentralized Sum-of-Nonconvex Optimization

Zhuanghua Liu1, 2, Bryan Kian Hsiang Low1

1Department of Computer Science, National University of Singapore 2CNRS@CREATE LTD, 1 Create Way, #08-01 CREATE Tower, Singapore 138602 liuzhuanghua9@gmail.com, lowkh@comp.nus.edu.sg

We consider the optimization problem of minimizing the sum-of-nonconvex function, i.e., a convex function that is the average of nonconvex components. The existing stochastic algorithms for such a problem only focus on a single machine and the centralized scenario. In this paper, we study the sum-of-nonconvex optimization in the decentralized setting. We present a new theoretical analysis of the PMGTSVRG algorithm for this problem and prove the linear convergence of their approach. However, the convergence rate of the PMGT-SVRG algorithm has a linear dependency on the condition number, which is undesirable for the ill-conditioned problem. To remedy this issue, we propose an accelerated stochastic decentralized first-order algorithm by incorporating the techniques of acceleration, gradient tracking, and multi-consensus mixing into the SVRG algorithm. The convergence rate of the proposed method has a square-root dependency on the condition number. The numerical experiments validate the theoretical guarantee of our proposed algorithms on both synthetic and real-world datasets.

1 Introduction The exponential growth of data in the past decades has sparked substantial interest in developing algorithms distributed over multiple agents. A common scenario is that each agent within some network topology owns a disjoint subset of data, and they collaborate to tackle a global optimization objective. The network topology in which each agent resides can be classified into two categories: clientserver vs. decentralized settings. For the former setting, a central parameter server communicates with all the workers and aggregates the information collected from them (Li et al. 2014). When there is a large volume of data on each agent, the central server becomes the bottleneck in the whole network. For the latter setting, each agent only communicates with its direct neighbors to exchange their information and finish the global task (Lian et al. 2017). This paper focuses on stochastic optimization for minimizing the sum-of-nonconvex objective function in the decentralized setting. We formulate our problem as a convex optimization problem collaboratively solved by m agents in the network. Consider the following composite optimization

Copyright 2024, Association for the Advancement of Artificial Intelligence (www.aaai.org). All rights reserved.

objective function:

min x Rd F(x) := f(x) + ψ(x), f(x) := 1

i=1 fi(x) (1)

where f(x) is convex and smooth, and ψ( ) is convex but possibly non-smooth (e.g., ℓ1-regularization term). We suppose there are m agents and the i-th agent stores the local objective function fi(x) which can be written as a finite-sum form:

j=1 fi,j(x)

where n is the number of components and each fi,j(x) is smooth but possibly non-convex. The sum-of-nonconvex optimization is common in real-world applications, including 1-PCA (Saad 2011), k-PCA (Allen-Zhu and Li 2016), online eigenvector problem (Allen-Zhu and Li 2017), and general nonconvex optimization (Agarwal et al. 2016; Carmon et al. 2018). Existing decentralized first-order optimization algorithms suffer several limitations in solving Problem (1): (i) The decentralized deterministic algorithms, such as EXTRA (Shi et al. 2015a), Exact-Diffusion (Yuan et al. 2018), P2D2 (Alghunaim, Yuan, and Sayed 2019), and SONATA (Sun, Daneshmand, and Scutari 2019), need access to the full gradient at each round. The computational cost of each iteration is prohibitively expensive on massive datasets. (ii) While existing decentralized stochastic methods achieve a cheaper per-iteration cost by sampling a minibatch of samples, the theoretical analysis of these approaches is not specialized in the sum-of-nonconvex optimization problem. One class of methods (Shi et al. 2015a; Xin, Khan, and Kar 2020; Ye, Xiong, and Zhang 2020) assumes that all component functions fi,j( ) are convex such that the global objective function f( ) is convex1. Convergence analysis of these works does not apply to Problem (1) due to the mismatch of problem assumptions. The other class of methods (Li, Li, and Chi 2022; Luo and Ye 2022; Xin, Khan, and Kar 2022) assumes that component functions fi,j( ) are nonconvex and

1Ye, Xiong, and Zhang (2020) have claimed that each component function can be possibly nonconvex. However, Assumption 1 in their work requires each component function to be both L-smooth and convex, otherwise Eq. (3) cannot be satisfied.

The Thirty-Eighth AAAI Conference on Artiﬁcial Intelligence (AAAI-24)

the global objective function f( ) is also possibly nonconvex. Consequently, the rate achieved by these methods is not optimal for Problem (1). In this paper, we intend to design communicationand computation-efficient optimization algorithms for the sumof-nonconvex problem in the decentralized setting. We start by presenting a new theoretical analysis of the PMGTSVRG proposed by Ye, Xiong, and Zhang (2020) for objective function (1). It achieves the stochastic first-order oracle (SFO) complexity of O((n+ nκ) log(1/ϵ)) and communication complexity of O(( n + κ)ξ log(1/ϵ)) where κ is the condition number and ξ is some constant depending on the underlying network structure. Notice that both the computational and communication complexities of PMGT-SVRG have a linear dependency on the condition number κ, which can be exceptionally expensive when the objective function is ill-conditioned. To remedy this issue, we propose an accelerated stochastic variance-reduced proximal-gradient optimization method called PMGT-Katyusha X for Problem (1) to improve the dependency of complexities on the condition number. Specifically, the vanilla Katyusha X algorithm proposed by Allen Zhu (2018) achieves the SFO complexity with a square-root dependency on the condition number on a single machine. To extend the Katyusha X to the decentralized setting, we incorporate the powerful ideas of acceleration (Allen-Zhu and Orecchia 2014), gradient tracking (Di Lorenzo and Scutari 2016; Qu and Li 2017), and multi-consensus mixing (Liu and Morse 2011) into the SVRG algorithm. The resulting PMGT-Katyusha X achieves the stochastic first-order oracle (SFO) complexity of O (n + n 3 4 κ )ξ log(1/ϵ) and the communication complexity of O ( n+n 1 4 κ)ξ log(1/ϵ) . It is worth noting that the SFO complexity of our proposed algorithm matches the best-known result (Allen-Zhu 2018) for a single machine. Numerical experiments on several synthetic and realworld datasets demonstrate significant improvement of our proposed PMGT-Katyusha X over existing baseline methods.

Paper Organization A review of related literature on decentralized stochastic first-order methods and stochastic sum-of-nonconvex optimization is presented in Section 2. In Section 3, we introduce the notations and problem setting of decentralized sum-of-nonconvex optimization. We present the theoretical result of PMGT-SVRG on Problem (1) in Section 4. We formally present our proposed algorithm PMGT-Katyusha X with the main theorem in Section 5. A proof sketch is provided for the main theorem in Section 6. Numerical results are presented in Section 7. Finally, we conclude this paper with a summary of our results in Section 8.

2 Related Work

In this section, we review related literature on decentralized stochastic first-order algorithms. In addition, we summarize existing works of stochastic optimization for the sum-ofnonconvex problem on a single machine.

2.1 Decentralized Stochastic First-Order Methods

We review related work about decentralized stochastic firstorder methods for objective functions when each local function fi( ) has the finite-sum structure. These methods can be divided into two categories based on the convexity of the component function.

fi,j( ) is convex The first decentralized variance-reduced method called DSA was proposed by Mokhtari and Ribeiro (2016), and it is a combination of EXTRA (Shi et al. 2015a) and SAGA (Defazio, Bach, and Lacoste-Julien 2014). DBSA/ADFS (Hendrikx, Bach, and Massouli e 2019; Hendrikx, Bach, and Massoulie 2021; Shen et al. 2018) attempted to accelerate DSA with proximal mapping and variance reduction. Although several works (Hendrikx, Bach, and Massouli e 2020; Xin, Khan, and Kar 2020) have proposed proximal mapping-free algorithms, the computation and communication complexities of these methods are worse than DBSA and ADFS. Li et al. (2020); Ye, Xiong, and Zhang (2020) proposed decentralized stochastic algorithms that achieve a linear convergence rate by incorporating variance reduction, gradient tracking, and multi-consensus mixing.

fi,j( ) is nonconvex All existing decentralized stochastic methods assume the global objective function f( ) is possibly nonconvex if the component function fi,j( ) is nonconvex. Although the analysis of these approaches can be applied to our setting, the resulting convergence rate may not be optimal for the problem studied in this paper. Sun, Lu, and Hong (2020) provided the first decentralized stochastic algorithm, D-GET, combining variance reduction and gradient tracking. Li, Li, and Chi (2022); Xin, Khan, and Kar (2022) further proposed algorithms with improved complexity bound. Recently, DEAREST (Luo and Ye 2022) is the first decentralized stochastic algorithm that achieves both optimal computation and communication complexity. Due to the assumption that the global objective is nonconvex, all these approaches can obtain at most sublinear convergence rates. A comparison between our work and related works is summarized in Table 1.

2.2 Stochastic Sum-of-Nonconvex Optimization

Stochastic optimization on the sum-of-nonconvex optimization problem is a commonly used technique for analyzing offline Principle Component Analysis (PCA) problems. Garber et al. (2016) reduced 1-PCA subproblems to the sum-of-nonconvex problem, and they leveraged the conventional accelerated stochastic optimization scheme to accelerate the convergence. For the k-PCA problem, Allen-Zhu and Li (2016) reduced the k-PCA problem to the sum-ofnonconvex problem, and they apply the accelerated stochastic technique to improve the convergence of k-PCA problem. Allen-Zhu (2018) further improved the convergence by accelerating the stochastic optimization of the sumof-nonconvex problem with the linear coupling technique (Allen-Zhu and Orecchia 2014).

The Thirty-Eighth AAAI Conference on Artiﬁcial Intelligence (AAAI-24)

Methods Problem Stochastic Gradient Calls Communication Complexity GT-SVRG (Xin, Khan, and Kar 2020)

Sum-of-Convex Non-Composite O n + κ2 log κ (1 λ2(W ))2 log 1

O n + κ2 log κ (1 λ2(W ))2 log 1

GT-SAGA (Xin, Khan, and Kar 2020)

Sum-of-Convex Non-Composite O n + κ2

(1 λ2(W ))2 log 1

(1 λ2(W ))2 log 1

PG-EXTRA (Shi et al. 2015b)

Sum-of-Convex Composite O nκ 1 λ2(W ) log 1

O nκ 1 λ2(W ) log 1

NIDS (Li, Shi, and Yan 2019) Sum-of-Convex Composite O n κ + 1 1 λ2(W ) log 1

O κ + 1 1 λ2(W ) log 1

PMGT-SVRG (Ye, Xiong, and Zhang 2020)

Sum-of-Convex Composite O (n + κ) log 1

1 λ2(W ) log 1

PMGT-SAGA (Ye, Xiong, and Zhang 2020)

Sum-of-Convex Composite O (n + κ) log 1

1 λ2(W ) log 1

PMGT-SVRG (This paper) Sum-of-Nonconvex Composite O (n + nκ) log 1

1 λ2(W ) log 1

PMGT-Katyusha X (This paper) Sum-of-Nonconvex Composite O (n + n 3 4 κ) log 1

O n+ κn 1 4

1 λ2(W ) log 1

Table 1: We compare the proposed algorithms with related work on Problem (1) when the global objective function is strongly convex. We use notation O to hide the logarithm factor in the complexity. Note that the condition number in this table does not consider the difference in smoothness parameters. We present our results by distinguishing L, ℓ1, and ℓ2 in Sections 4 and 5.2.

3 Preliminaries In this section, we formalize our problem setting.

3.1 Notations We denote as the Euclidean norm for vectors and Frobenius norm for matrices, and we denote 2 as the operator norm for matrix. We use lowercase non-bold letter x Rd as a random variable of dimension d and lowercase bold letter

x = (x1, . . . , xm) Rm d

as the aggregated variable collected from m machines. We denote all one vector of dimension m by 1 Rm. For simplicity, we write 1x Rm d as the Kronecker product between the all one vector 1 and some vector x. We use x as the average of the aggregated variable x such that x = m 11 x. For the non-smooth function ψ( ), we define Ψ(x) = m 1 Pm i=1 ψ(xi) for the aggregated variable x Rm d. We also define the proximal operator for vector x as

proxη,ψ(x) = arg minz Rd ψ(z) + z x 2 /(2η)

and the proximal operator for aggregated variable x as

proxmη,Ψ(x)=arg minz Rm d Ψ(z)+ z x 2/(2mη) .

We use x to represent the optimal solution for F( ) as

x = arg minx Rd F(x) .

3.2 Problem Formulation We summarize some of the basic properties of convex and smooth functions below. Definition 3.1. For a function f : Rd R, there exist some constants L, ℓ1, ℓ2 > 0 and σ 0 such that

1. f is σ-strongly convex. That is, for any x, y Rd,

f(x) f(y) f(y), x y σ

2. f is L-Lipschitz smooth. That is, for any x, y Rd,

f(x) f(y) L x y ;

3. f is (ℓ1, ℓ2)-smooth. That is, for any x, y Rd,

2 x y 2 f(x) f(y) f(y), x y ℓ1

Recall that the global objective function (1) can be decomposed into mn nonconvex functions fi,j( ). The ith agent is given access to a disjoint subset of n functions fi,j( ), for j [n]. We assume the function f( ) is convex and L-smooth, each fi,j( ) is (ℓ1, ℓ2)-smooth with ℓ2 ℓ1 and ψ( ) is a proper convex function. We further assume f is σf-strongly convex and ψ is σψ-strongly convex with σf 0 and σψ 0. We define σ = σf + σψ such that σ > 0. We focus on decentralized optimization on a network in which each agent only communicates with its neighbors. The topology of the network is characterized by the gossip matrix W. We let Wi,j 0 if nodes i and j are direct neighbour in G; and Wi,j = 0 if nodes i and j are not connected. Furthermore, we assume W is a doubly stochastic matrix, and it satisfies the following properties:

Definition 3.2. Let W be a doubly stochastic matrix. Then, (a) W is symmetric, (b) 0 W I and W1 = 1, and (c) null(I W) = span(1).

4 Convergence Analysis of PMGT-SVRG In this section, we show the convergence rate of the PMGTSVRG (Ye, Xiong, and Zhang 2020) for the objective function (1). PMGT-SVRG achieves a linear convergence rate on

The Thirty-Eighth AAAI Conference on Artiﬁcial Intelligence (AAAI-24)

the sum-of-convex problem by integrating gradient tracking and multi-consensus mixing techniques into the SVRG algorithm. We remove the assumption that each function fi,j( ) is convex such that the inequality

1 2L fi,j(x) fi,j(y) 2 fi,j(x) fi,j(y) fi,j(y),x y

for any x, y Rd in Assumption 1 of (Ye, Xiong, and Zhang 2020) no longer holds. The following result shows that PMGT-SVRG can still achieve a linear convergence rate on the sum-of-nonconvex problem: Theorem 4.1. Assume function F( ) defined in (1) is σstrongly convex, f( ) is L-smooth, and each component fi,j is (ℓ1, ℓ2)-smooth. Additionally, we assume that the underlying network matrix W is doubly stochastic so it satisfies the properties in Definition 3.2. Under appropriate hyperparameter setting, to obtain F( yk+1) F(x ) ϵ, the algorithm PMGT-SVRG requires at most

n + L + (ℓ1ℓ2) 1 2 σ n 1 2

log F( y0) F(x )

SFO calls and

n + (L + ℓ1ℓ2)/σ

(1 λ2(W))1/2 log F( y0) F(x )

rounds of communication. Remark 4.2. Compared with the theoretical results of the PMGT-SVRG on the decentralized sum-of-convex problem in Table 1, the SFO complexity introduces an additional dependency on n. It can be inferred that when the condition number κ := (L + ℓ1ℓ2)/σ is larger than n, the SFO complexity of PMGT-SVRG on the sum-of-nonconvex problem is worse than the sum-of-convex problem. Interestingly, the communication complexity of the PMGT-SVRG achieved by our analysis is better than that by Ye, Xiong, and Zhang (2020) even though our objective function is harder. The improvement comes from the introduction of the minibatch in Algorithm 3 in the appendix.

5 PMGT-Katyusha X In this section, we propose the main idea behind PMGTKatyusha X and present the convergence theorem of this algorithm.

5.1 The Algorithm We present the main intuition of the PMGT-Katyusha X algorithm. The core design of the algorithm is to apply the acceleration scheme (Allen-Zhu and Orecchia 2014) on the stochastic variance reduced gradient (SVRG) (Johnson and Zhang 2013) method. Furthermore, we blend the powerful ideas of gradient tracking and multi-consensus mixing into the accelerated algorithm to develop the decentralized variant of the algorithm. The backbone of our algorithm is the SVRG which adopts an outer-inner loop structure to reduce the inherent variance of stochastic gradients. Specifically, we construct a

full gradient snapshot fi(w0 i ) for each agent i [m] at each epoch. At each iteration of the inner loop, a variancereduced unbiased gradient estimator is updated as:

vt i = fi(w0 i ) + b 1 P ji Bt i f t i,ji(wt i) f t i,ji(w0 i )

where Bt i is a minibatch of b indices sampled uniformly from {1, . . . , n}. Recall that the global objective function (1) contains a convex, non-smooth function ψ( ), we apply the proximal mapping after executing one step of gradient descent: wt+1 i = proxη,ψ(wt i ηvt i ) where η is the learning rate of the SVRG algorithm. The SVRG algorithm is known to achieve a linear rate of convergence for the strongly convex objective function.

Acceleration If we naively extend the SVRG algorithm to the decentralized setting, the convergence rate of the resulting algorithm has a linear dependency on the condition number as shown in Theorem 4.1. To obtain an improved dependency on the condition number, we employ the acceleration technique introduced by Allen-Zhu and Orecchia (2014). The acceleration is achieved by the linear coupling of the gradient descent step and mirror descent step. In particular, denote xk i := w0 i and yk i := w T i as the first and the last iterate of the SVRG inner loop at Epoch k, then we apply one step of mirror descent as follows

qk i =arg min q Rd

q qk 1 i 2+ xk i yk i 2τ , q +τ

q yk i 2o ,

which can be simplified as

qk i = qk 1 i + τyk i 2 xk i yk i 2τ 1 + τ

The iterate xk+1 i can be updated as a linear combination of yk i and qk i : xk+1 i = τqk i + (1 τ)yk i . On top of the accelerated SVRG method, we also apply gradient tracking and multi-consensus mixing to extend the above algorithm to the decentralized setting.

Gradient Tracking Recall that our goal is to find the minima x of the global objective function (1). However, gradients collected from local neighbors have large variances due to the dissimilarity between distinct local objective functions fi( ). To alleviate this issue, we adopt the gradient tracking technique that introduces a new variable s to track the difference between local gradient estimators:

st+1 = Mix(st + vt+1 vt)

where Mix( ) is some mixing protocol, and vt is the aggregated variance-reduced gradient estimator of {vt i}m i=1 at the t-th iteration of the inner loop. The intuition behind the technique is that while the variance of the local gradient estimators can be arbitrarily large in general, the variance of the differences between local gradient estimators will be small when the local variable approaches the global minima x . The technique also applies to the full gradient constructed in the outer loop.

The Thirty-Eighth AAAI Conference on Artiﬁcial Intelligence (AAAI-24)

Algorithm 1: PMGT-Katyusha X

Input: x0 i , y0 i , y 1 i , q0 i and S0 i Parameter: Fast Mix parameter M, functions {fi}m i=1 and ψ, initial point x0, mini-batch size b, learning rate η 0, momentum parameter τ (0, 1], number K of epochs Output: x K i , y K i , q K i 1: Initialize x0 i = x0, y0 i = x0, y 1 i = x0, q0 i = x0 and ˆs0 i = F(x0) for each i in parallel 2: for k = 0, . . . , K 1 do 3: xk+1 i = Fast Mix(τqk + (1 τ)yk, M)i 4: νk+1 i = ˆsk i + fi(xk+1 i ) fi(xk i ) 5: ˆsk+1 i = Fast Mix(νk+1, M)i 6: s 1 i = v 1 i = ˆsk+1 i 7: µi = ˆsk+1 i , w0 i = xk+1 i 8: t0 = n/b and sample T Geom(1/t0) 9: for t = 0, . . . , T do 10: Let Bt i be a batch of b indices sampled uniformly from [n] with replacement 11: vt i = µi + b 1 P ji Bt i( f t i,ji(wt i) f t i,ji(w0 i ))

12: st i = Fast Mix(st 1 + vt vt 1, M)i 13: wt+1 i = Fast Mix(proxmη,Ψ(wt ηst), M)i 14: end for 15: yk+1 i = w T +1 i

16: qk+1 i = Fast Mix( qk+ τyk+1

2 xk+1 yk+1

2 , M)i 17: end for

Multi-Consensus Mixing Under the decentralized learning scenario, each agent wishes to obtain the global average of their variables through communication with local neighbors. The agent can get a more faithful estimate of the global average through multiple rounds of local communication. Additionally, the mixing technique can be accelerated to achieve even faster consensus. The complete procedure of the Multi-Consensus Mixing can be found in Algo. 2. We apply the Mixing technique to both state variables and gradient tracking variables in both the inner loop and outer loop. After fusing these techniques with the SVRG algorithm, the complete PMGT-Katyusha X for the decentralized setting is presented in Algo. 1.

5.2 Main Theorem In this subsection, we characterize the linear convergence rate of PMGT-Katyusha X on the decentralized sum-ofnonconvex problems.

Algorithm 2: Fast Mix(x0, M, W)

Initilaize: x 1 = x0 and η = 1/(1 + p

1 λ2 2(W)) Output: WK 1: for k = 0, . . . , M do 2: xk+1 = (1 + η)xk W ηxk 1 (2)

Theorem 5.1. Assume function F( ) defined in (1) is σstrongly convex, f( ) is L-smooth, and each component fi,j( ) is (ℓ1, ℓ2)-smooth. We also assume that the underlying network matrix W is doubly stochastic so it satisfies the properties in Definition 3.2. Under appropriate parameter settings, the outputs of Algo. 1 has the following properties:

E F( y K) F(x ) 3

(1 + τ/2)K F( y0) F(x ) ,

E h x K x 2i 128L

σ (1 + τ/2)K x0 x 2 ,

E h q K x 2i 1750L

στ 4 (1 + τ/2)K x0 x 2 .

Remark 5.2. To obtain the ϵ-approximate solution of the Problem (1), i.e., F( y K) F(x ) ϵ, the outer loop of Algorithm PMGT-Katyusha X has to be executed for at least

Lb nσ + (ℓ1ℓ2) 1 4 σn 1 4

log F( y0) F(x )

times. Recall that at each epoch, the algorithm makes one call of full gradient oracle and Tb SFO calls where E[T] = n/b. Consequently, our algorithm makes an expectation of 2n SFO calls. Additionally, as can be seen in Algo. 1, the multi-consensus mixing takes M rounds of communication when called. We can deduce that

M = O log L

To reach ϵ-accuracy, Algo. 1 has to make O(Kn) calls to the first-order oracle and O(Kn M/b) rounds of communication. We can bound the computation complexity and communication complexity by setting batch size b = n and the following corollary can be obtained: Corollary 5.3. Under the setting of Theorem 5.1, to obtain an ϵ-approximate solution y K, i.e., F( y K) F(x ) ϵ, PMGT-Katyusha X requires at most

n + L 1 2 + (ℓ1ℓ2) 1 4 σ n 3 4

log F( y0) F(x )

SFO calls and

n + L 1 2 +(ℓ1ℓ2) 1 4 σ n 1 4

1 λ2(W) log F( y0 F(x ))

rounds of communication.

6 Proof Sketch of Theorem 5.1 In this section, we provide a sketch of the proof for Theorem 5.1. Since Algo. 1 has a double loop structure, we prove the theorem in two stages. For the first stage, we present the analysis for one epoch of decentralized SVRG adapted for

The Thirty-Eighth AAAI Conference on Artiﬁcial Intelligence (AAAI-24)

the sum-of-nonconvex problem; For the second stage, our goal is to show that the consensus error and the convergence error of the outer loop decay linearly at the same rate. We first define the vector of consensus errors in the inner loop as zt = ( wt 1 wt 2 , st 1 st 2) . We can build a system of linear inequalities with zt such that the spectral norm of the coefficient matrix is less than 1 if the hyperparameters are chosen appropriately. We start by reviewing the following lemma which is essential for the analysis of the consensus error:

Lemma 6.1 (Liu and Morse (2011)). Let w0, w M Rd be the input and output of the Algorithm 2 and w = 1 m1T w0, then it satisfies that

w M 1 w ρ w0 1 w ,

1 λ2(L) M .

After constructing the linear system of inequalities for zt, we obtain the following lemma that bounds the consensus error accumulated in the inner loop:

Lemma 6.2. Given T Geom(p), the expected consensus error at the end of the inner loop satisfies

ET z T z0 + 1 p

p 16ρ2η2ℓ1ℓ2 w0 1 w0 2

p 16ρ2η2ℓ1ℓ2m ET w T w0 2 .

Using the convexity and smoothness assumption, we can obtain the following result.

Lemma 6.3. For any iteration t in the inner loop, the average variable is defined as wt = m 11 wt where wt+1 = Fast Mix(proxmη,Ψ(wt ηst), M). For any u Rd,

E F( wt+1) F(u)

st f( wt) 2 + 2 ησf

(1+σψη) u wt+1 2

2η + 1 ηL+2η

2(1 ηL)m 1 st st 2

σfm + 1 + η

wt 1 wt 2 .

Moreover, one can show the following result by the (ℓ1, ℓ2)-smoothness of fi,j:

Lemma 6.4. Denote the variable st = m 1 Pm i=1 st i with st i = Fast Mix(st 1 + vt vt 1, M)i. Then,

E st f( wt) 2

b wt w0 2 + 6ℓ1ℓ2

mb w0 1 w0 2.

Combining the above three lemmas, we can prove the following main result for the inner loop:

Lemma 6.5. If we choose the hyperparameters for Algorithm 1 such that η min 1/(2L), p

b/(ℓ1ℓ2t0)/8 , and

ρ min (8bmηJt0) 1/2, (18(1 + b)) 1/2 where J := J(η, b) is some positive constant. Then, for any u Rd,

E F( w T +1) F(u)

E w T +1 w0 2

4ηt0 + w0 u, w0 w T +1

8 u w T +1 2 + Jη2 s0 1 s0 2

+ J + 1 4ηm

w0 1 w0 2 .

For the theoretical analysis of the outer loop, we define the following vector of consensus errors

( xt 1 xt 2, yt 1 yt 2, qt 1 qt 2, η2 ˆst 1 ˆst 2) .

We also construct a system of linear inequalities of the above quantity. Using a similar proof technique as the inner loop to bound the consensus error of the outer loop, we can prove Theorem 5.1 by induction. The complete theoretical analysis of PMGT-Katyusha X is presented in the Appendix.

7 Numerical Experiments To demonstrate the efficiency of PMGT-Katyusha X, we evaluate the proposed method on the sub-problem of solving PCA by the shift-and-invert method. The corresponding sum-of-nonconvex optimizing problem has the form of

min x Rd F(x) = 1

2x λ1 + λ1 λ2

I A x + b x (3)

where A = (mn) 1 Pm i=1 Pn j=1 ai,ja i,j Rd d, λ1 and λ2 are the largest and the second largest eigenvalues of the matrix A; r is a hyperparameter that controls the ratio for the eigengap. We conduct our experiments on both synthetic and realworld datasets. For the synthetic dataset, we generate a Bernoulli matrix of size 60, 000 50 with entries in { 1}; For the real-world dataset, we use the Covtype downloaded from the LIBSVM website2. For the gossip matrix W underlying the decentralized network, we generate a matrix where each agent is randomly connected to two neighbors. The second largest eigenvalue of the resulting matrix is λ2(W) = 0.97.

Baselines We compare the empirical performance of our proposed method with several baselines including PMGTSVRG ((Ye, Xiong, and Zhang 2020)), PGEXTRA (Shi et al. 2015b), and NIDS (Li, Shi, and Yan 2019). Although theoretical results of PGEXTRA and NIDS are developed for the sum-of-convex objective functions, they still show convergence behavior on the sum-of-nonconvex functions empirically.

Experiment Specifications For all of the experiments, the y-axis represents the suboptimality of function value, i.e., F( y) F(x ) where F(x ) is taken as the lowest value achieved among all the baselines. The left and the middle figure represents the suboptimality vs. gradient evaluations and communication rounds, respectively. The right

2https://www.csie.ntu.edu.tw/ cjlin/libsvmtools/datasets

The Thirty-Eighth AAAI Conference on Artiﬁcial Intelligence (AAAI-24)

0 50 100 150 #Grad Evaluations (x1e4)

log10[h( xt) h(x )]

0 200 400 #Grad Evaluations (x1e4)

log10[h( xt) h(x )]

0 1000 2000 3000 4000 5000 #Communications

log10[h( xt) h(x )]

0 5000 10000 15000 20000 #Communications

log10[h( xt) h(x )]

0 100 200 300 #Costs (x1e4)

log10[h( xt) h(x )]

PMGT_Katyusha X PMGT_SVRG PGEXTRA NIDS

0 250 500 750 1000 #Costs (x1e4)

log10[h( xt) h(x )]

PMGT_Katyusha X PMGT_SVRG PGEXTRA NIDS

Figure 1: Performance comparison between PGEXTRA, NIDS, PMGT-SVRG, and PMGT-Katyusha X on the synthetic dataset. The left column represents results with the ratio r = 2 and the right column represents results with the ratio r = 300 defined in Problem (3). The plot of PGEXTRA and NIDS are overlapped as their performance are close to each other.

figure shows the suboptimality vs. the cost which is the weighted average between gradient evaluation and communication rounds.

Comparison Results on the Synthetic Dataset Figure 1 reports the performance comparison between our PMGTKatyusha X and baselines when the ratio r = 2 or r = 300. PMGT-Katyusha X outperforms other baselines for both settings, even when the ratio is small. We point out that PGEXTRA and NIDS have similar performance so their curves are overlapped in the performance comparison. For each setting, PMGT-Katyusha X makes the least number of gradient evaluations compared with other baselines. To our surprise, it even requires fewer communication rounds to converge than NIDS although the communication bound of PMGTKatyusha X depends on the number of component functions n. As a result, PMGT-Katyusha X achieves the best cost among all baselines.

Comparison Results on the Real-world Dataset Figure 2 reports the performance comparison between PMGTKatyusha X and other baselines on the Covtype dataset when the ratio r = 2 or r = 300. When r is small, PMGT-Katyusha X has fewer gradient evaluations than NIDS/PGEXTRA, but PMGT-Katyusha X requires more communication rounds than NIDS/PGEXTRA. When r is large, PMGT-Katyusha X outperforms other baselines in both the

0 25 50 75 100 #Grad Evaluations (x1e4)

log10[h( xt) h(x )]

0 200 400 #Grad Evaluations (x1e4)

log10[h( xt) h(x )]

0 1000 2000 3000 4000 #Communications

log10[h( xt) h(x )]

0 5000 10000 15000 20000 #Communications

log10[h( xt) h(x )]

0 50 100 150 200 #Costs (x1e4)

log10[h( xt) h(x )]

PMGT_Katyusha X PMGT_SVRG PGEXTRA NIDS

0 250 500 750 1000 #Costs (x1e4)

log10[h( xt) h(x )]

PMGT_Katyusha X PMGT_SVRG PGEXTRA NIDS

Figure 2: Performance comparison between PGEXTRA, NIDS, PMGT-SVRG, and PMGT-Katyusha X on the Covtype dataset. The left column represents results with the ratio r = 2 and the right column represents results with the ratio r = 300 defined in (3). The plot of PGEXTRA and NIDS are overlapped as their performance are close to each other.

number of gradient evaluations and communication rounds.

8 Conclusions and Future Work

This paper presented a new theoretical analysis of PMGTSVRG to the decentralized sum-of-nonconvex problem, and it has a linear dependency on the condition number. To achieve a better dependency, we proposed the first accelerated stochastic first-order algorithm for the decentralized sum-of-nonconvex problem and showed it enjoys a linear convergence rate with a square-root dependency on the condition number. The empirical evidence validates the advantages of our algorithm on both synthetic and real-world datasets. Although the computational complexity of the proposed algorithm has matched the best-known algorithm for the centralized scenario, the communication upper bound still looks unsatisfying due to the inclusion of the factor n. It is interesting to study how to design a more communicationefficient algorithm for decentralized sum-of-nonconvex optimization.

Acknowledgments

This research/project is supported by the National Research Foundation, Singapore under its AI Singapore Programme (AISG Award No: AISG2-Ph D-2023-08-043T-J). This research is part of the programme Des Cartes and is supported

The Thirty-Eighth AAAI Conference on Artiﬁcial Intelligence (AAAI-24)

by the National Research Foundation, Prime Minister s Office, Singapore under its Campus for Research Excellence and Technological Enterprise (CREATE) programme.

References Agarwal, N.; Allen-Zhu, Z.; Bullins, B.; Hazan, E.; and Ma, T. 2016. Finding approximate local minima for nonconvex optimization in linear time. ar Xiv preprint ar Xiv:1611.01146. Alghunaim, S.; Yuan, K.; and Sayed, A. H. 2019. A linearly convergent proximal gradient algorithm for decentralized optimization. Advances in Neural Information Processing Systems, 32. Allen-Zhu, Z. 2018. Katyusha X: Practical momentum method for stochastic sum-of-nonconvex optimization. ar Xiv preprint ar Xiv:1802.03866. Allen-Zhu, Z.; and Li, Y. 2016. Lazy SVD: Even faster SVD decomposition yet without agonizing pain. Advances in Neural Information Processing Systems, 29. Allen-Zhu, Z.; and Li, Y. 2017. Follow the compressed leader: Faster online learning of eigenvectors and faster MMWU. In International Conference on Machine Learning, 116 125. PMLR. Allen-Zhu, Z.; and Orecchia, L. 2014. Linear coupling: An ultimate unification of gradient and mirror descent. ar Xiv preprint ar Xiv:1407.1537. Carmon, Y.; Duchi, J. C.; Hinder, O.; and Sidford, A. 2018. Accelerated methods for nonconvex optimization. SIAM Journal on Optimization, 28(2): 1751 1772. Defazio, A.; Bach, F.; and Lacoste-Julien, S. 2014. SAGA: A fast incremental gradient method with support for nonstrongly convex composite objectives. Advances in Neural Information Processing Systems, 27. Di Lorenzo, P.; and Scutari, G. 2016. Next: In-network nonconvex optimization. IEEE Transactions on Signal and Information Processing over Networks, 2(2): 120 136. Garber, D.; Hazan, E.; Jin, C.; Kakade, S. M.; Musco, C.; Netrapalli, P.; and Sidford, A. 2016. Robust shift-andinvert preconditioning: Faster and more sample efficient algorithms for eigenvector computation. In International Conference on Machine Learning. Hendrikx, H.; Bach, F.; and Massouli e, L. 2019. An accelerated decentralized stochastic proximal algorithm for finite sums. Advances in Neural Information Processing Systems, 32. Hendrikx, H.; Bach, F.; and Massouli e, L. 2020. Dual-free stochastic decentralized optimization with variance reduction. Advances in Neural Information Processing Systems, 33: 19455 19466. Hendrikx, H.; Bach, F.; and Massoulie, L. 2021. An optimal algorithm for decentralized finite-sum optimization. SIAM Journal on Optimization, 31(4): 2753 2783. Johnson, R.; and Zhang, T. 2013. Accelerating stochastic gradient descent using predictive variance reduction. Advances in Neural Information Processing Systems, 26.

Li, B.; Cen, S.; Chen, Y.; and Chi, Y. 2020. Communicationefficient distributed optimization in networks with gradient tracking and variance reduction. In International Conference on Artificial Intelligence and Statistics, 1662 1672. PMLR. Li, B.; Li, Z.; and Chi, Y. 2022. DESTRESS: Computationoptimal and communication-efficient decentralized nonconvex finite-sum optimization. SIAM Journal on Mathematics of Data Science, 4(3): 1031 1051. Li, M.; Andersen, D. G.; Park, J. W.; Smola, A. J.; Ahmed, A.; Josifovski, V.; Long, J.; Shekita, E. J.; and Su, B.-Y. 2014. Scaling distributed machine learning with the parameter server. In 11th USENIX Symposium on operating systems design and implementation (OSDI 14), 583 598. Li, Z.; Shi, W.; and Yan, M. 2019. A decentralized proximalgradient method with network independent step-sizes and separated convergence rates. IEEE Transactions on Signal Processing, 67(17): 4494 4506. Lian, X.; Zhang, C.; Zhang, H.; Hsieh, C.-J.; Zhang, W.; and Liu, J. 2017. Can decentralized algorithms outperform centralized algorithms? a case study for decentralized parallel stochastic gradient descent. Advances in Neural Information Processing Systems, 30. Liu, J.; and Morse, A. S. 2011. Accelerated linear iterations for distributed averaging. Annual Reviews in Control, 35(2): 160 165. Luo, L.; and Ye, H. 2022. An Optimal Stochastic Algorithm for Decentralized Nonconvex Finite-sum Optimization. ar Xiv preprint ar Xiv:2210.13931. Mokhtari, A.; and Ribeiro, A. 2016. DSA: Decentralized double stochastic averaging gradient algorithm. The Journal of Machine Learning Research, 17(1): 2165 2199. Qu, G.; and Li, N. 2017. Harnessing smoothness to accelerate distributed optimization. IEEE Transactions on Control of Network Systems, 5(3): 1245 1260. Saad, Y. 2011. Numerical methods for large eigenvalue problems: revised edition. SIAM. Shen, Z.; Mokhtari, A.; Zhou, T.; Zhao, P.; and Qian, H. 2018. Towards more efficient stochastic decentralized learning: Faster convergence and sparse communication. In International Conference on Machine Learning, 4624 4633. PMLR. Shi, W.; Ling, Q.; Wu, G.; and Yin, W. 2015a. Extra: An exact first-order algorithm for decentralized consensus optimization. SIAM Journal on Optimization, 25(2): 944 966. Shi, W.; Ling, Q.; Wu, G.; and Yin, W. 2015b. A proximal gradient algorithm for decentralized composite optimization. IEEE Transactions on Signal Processing, 63(22): 6013 6023. Sun, H.; Lu, S.; and Hong, M. 2020. Improving the sample and communication complexity for decentralized nonconvex optimization: Joint gradient estimation and tracking. In International Conference on Machine Learning, 9217 9228. PMLR. Sun, Y.; Daneshmand, A.; and Scutari, G. 2019. Convergence rate of distributed optimization algorithms based on gradient tracking. ar Xiv preprint ar Xiv:1905.02637.

The Thirty-Eighth AAAI Conference on Artiﬁcial Intelligence (AAAI-24)

Xin, R.; Khan, U. A.; and Kar, S. 2020. Variance-reduced decentralized stochastic optimization with accelerated convergence. IEEE Transactions on Signal Processing, 68: 6255 6271. Xin, R.; Khan, U. A.; and Kar, S. 2022. Fast decentralized nonconvex finite-sum optimization with recursive variance reduction. SIAM Journal on Optimization, 32(1): 1 28. Ye, H.; Xiong, W.; and Zhang, T. 2020. PMGT-VR: A decentralized proximal-gradient algorithmic framework with variance reduction. ar Xiv preprint ar Xiv:2012.15010. Yuan, K.; Ying, B.; Zhao, X.; and Sayed, A. H. 2018. Exact diffusion for distributed optimization and learning-Part I: Algorithm development. IEEE Transactions on Signal Processing, 67(3): 708 723.

The Thirty-Eighth AAAI Conference on Artiﬁcial Intelligence (AAAI-24)