# finding_local_minima_efficiently_in_decentralized_optimization__7412ae83.pdf

Finding Local Minima Efficiently in Decentralized Optimization

Wenhan Xian Computer Science University of Maryland College Park wxian1@umd.edu

Heng Huang Computer Science University of Maryland College Park heng@umd.edu

In this paper we study the second-order optimality of decentralized stochastic algorithm that escapes saddle point efficiently for nonconvex optimization problems. We propose a new pure gradient-based decentralized stochastic algorithm PEDESTAL with a novel convergence analysis framework to address the technical challenges unique to the decentralized stochastic setting. Our method is the first decentralized stochastic algorithm to achieve second-order optimality with non-asymptotic analysis. We provide theoretical guarantees with the gradient complexity of O(ϵ 3) to find O(ϵ, ϵ)-second-order stationary point, which matches state-of-the-art results of centralized counterparts or decentralized methods to find first-order stationary point. We also conduct two decentralized tasks in our experiments, a matrix sensing task with synthetic data and a matrix factorization task with a real-world dataset to validate the performance of our method.

1 Introduction

Decentralized optimization is a class of distributed optimization that trains models in parallel across multiple worker nodes over a decentralized communication network. Decentralized optimization has recently attracted increased attention in machine learning and emerged as a promising framework to solve large-scale tasks because of its capability to reduce communication costs. In the conventional centralized paradigm, all worker nodes need to communicate with the central node, which results in high communication cost on the central node when the number of nodes is large or the transmission between the center and some remote nodes suffers network latency. Conversely, decentralized optimization avoids these issues since each worker node only communicates with its neighbors.

Although decentralized optimization has shown advantageous performance in many previous works (Lian et al. [2017], Tang et al. [2018]), the study of second-order optimality for decentralized stochastic optimization algorithms is still limited. Escaping saddle point and finding local minima is a core problem in nonconvex optimization since saddle point is a category of first-order stationary point that can be reached by many gradient-based optimizers such as gradient descent but it is not the expected point to minimize the objective function.

Perturbed gradient descent (Jin et al. [2017]) and negative curvature descent (Xu et al. [2018], Allen-Zhu and Li [2018]) are two primary pure gradient-based methods (not involving second-order derivatives) to achieve second-order optimality. Typically, perturbed gradient descent method is composed of a descent phase and an escaping phase. If the norm of gradient is large, the algorithm will run the descent phase as normal. Otherwise it will run the escaping phase to discriminate whether the candidate first-order stationary point is a saddle point or local minimum. Negative curvature

This work was partially supported by NSF IIS 1838627, 1837956, 1956002, 2211492, CNS 2213701, CCF 2217003, DBI 2225775.

37th Conference on Neural Information Processing Systems (Neur IPS 2023).

descent method escapes saddle point by computing the direction of negative curvature at the candidate point. If it is categorized as a saddle point then the algorithm will update along the direction of negative curvature. Generally it involves a nested loop to perform the negative curvature subroutine.

Currently, the solution to the second-order optimality of decentralized problem in deterministic setting has been proposed. Perturbed Decentralized Gradient Tracking (PDGT) (Tziotis et al. [2020]) is a decentralized deterministic algorithm adopting the perturbed gradient descent strategy to achieve second-order stationary point. However, it is expensive to compute full gradients for large machine learning models. It is crucial to propose a stochastic algorithm to obtain second-order optimality for decentralized problems. Besides, there are some drawbacks of PDGT to make it less efficient and hard to be generalized to the stochastic setting. These drawbacks are also the key challenges to achieve second-order optimality for decentralized algorithms, which are listed as follows:

(1) PDGT runs fixed numbers of iterations in descent phase and escaping phase such that the phases of all nodes can be changed simultaneously. This strategy works because the descent is easy to be estimated in deterministic setting. Nonetheless, the exact descent of stochastic algorithm over a fixed number of iterations is hard to be bounded because of randomness and noises. If the fixed number is not large enough it is possible that the averaged model parameter is not a first-order stationary point. If the fixed number is as large as the expected number of iterations to achieve first-order stationary point, the algorithm will become less efficient as it is probably stuck at a saddle point for a long time before drawing the perturbation, especially in the second and later descent phase. Specifically, applying fixed number of iterations in each phase results in the complexity of at least O(ϵ 4.5) (see Appendix D), which is higher than O(ϵ 3) of our method. Therefore, we are motivated to propose an algorithm that can change phases adaptively (based on runtime gradient norm) and independently (not required to consider status on other nodes or notify other nodes).

(2) In PDGT the perturbations on all nodes are drawn from the same random seed. Besides, a coordinating protocol involving broadcast and aggregation is used to compute the averaged model parameter and the descent of overall loss function to discriminate the candidate point. These strategies together with the fixed number of iterations act as a hidden coordinator to make PDGT discriminate saddle point in the same way as centralized algorithms. However, when the number of worker nodes is large it is time-consuming to perform broadcast or aggregation over the whole decentralized network. Moreover, when generalized to stochastic setting the changing of phase is not guaranteed to be synchronized. Additionally, we will note in the Supplementary Material that the consensus error 1

n Pn i=1 x(i) t xt 2 is another factor to impact the effectiveness of perturbed gradient descent, which is not present in centralized problems. All above issues are theoretical difficulties to study and ensure second-order optimality for decentralized stochastic algorithms.

(Vlaski and Sayed [2020]) proves the theoretical guarantee of second-order optimality for decentralized stochastic algorithm with perturbed gradient descent. However, it does not provide a non-asymptotic analysis to estimate the convergence rate or gradient complexity. The effectiveness of the result relies on a sufficiently small learning rate, and it does not present a specific algorithm. The analysis is based on the assumption that the iteration formula can be approximated by a centralized update scheme when the learning rate is small enough. Nevertheless, in practice it is difficult to maintain an ideally small learning rate, and the iterative update process can be more complex as previously mentioned. To our best knowledge, the second-order optimality issue of decentralized stochastic algorithm with non-asymptotic analysis is still not solved. Therefore, we are motivated to study this important and challenging issue and raise the following questions:

Can we design a decentralized stochastic optimization algorithm with non-asymptotic analysis to find local minima efficiently? Is the algorithm still effective to discriminate saddle point even if each node can change its phase adaptively and independently without any coordinating protocols?

The answer is affirmative. In this paper, we propose a novel gradient-based algorithm named PErturbed DEcentralized STORM ALgorithm (PEDESTAL) which is the first decentralized stochastic algorithm to find second-order stationary point. We adopt perturbed gradient descent to ensure the second-order optimality and use STORM (Cutkosky and Orabona [2019]) estimator to accelerate the convergence. We provide completed convergence analysis to guarantee the second-order optimality theoretically. More details about the reason of choosing perturbed gradient descent and technical difficulties are discussed in Section 3.2. Next we will introduce the problem setup in this paper.

We focus on the following decentralized optimization problem:

min x f(x) = 1

i=1 fi(x), fi(x) = Eξ Di Fi(x, ξ) (1)

where n is the number of worker nodes in the decentralized network and fi is the local loss function on i-th worker node. Here fi is supposed to take the form of stochastic expectation over local data distribution Di, which covers a variety of optimization problems including finite-sum problem and online problem. Data distributions on different nodes are allowed to be heterogeneous. The objective function f is nonconvex such that saddle points probably exist.

The goal of our method is to find O(ϵ, ϵH)-second-order stationary point of problem 1, which is defined by the point x satisfying f(x) ϵ and min eig( 2f(x)) ϵH, where eig( ) represents the eigenvalues. The classic setting is ϵH = ϵ.

We summarize the contributions of this paper as follows:

We propose a novel algorithm PEDESTAL, which is the first decentralized stochastic gradient-based algorithm to achieve second-order optimality with non-asymptotic analysis. We provide a new analysis framework to support changing phases adaptively and independently on each node without any coordinating protocols involving broadcast or aggregation. We also address certain technical difficulties unique to decentralized optimization to justify the effectiveness of perturbed gradient descent in discriminating saddle point.

We prove that our PEDESTAL achieves the gradient complexity of O(ϵ 3+ϵϵ 8 H +ϵ4ϵ 11 H ) to find O(ϵ, ϵH)-second-order stationary point. Particularly, PEDESTAL achieves the gradient complexity of O(ϵ 3) in the classic setting ϵH = ϵ, which matches state-of-the-art results of centralized counterparts or decentralized methods to find first-order stationary point.

2 Related Work

In this section we will introduce the background of related works. The comparison of important features is shown in Table 1. Here O( ) refers to the big O notation that hides the logarithmic terms.

2.1 Decentralized Algorithms for First-Order Optimality Decentralized optimization is an efficient framework to solve problem 1 collaboratively by multiple worker nodes. In each iteration a worker node only needs to communicate with its neighbors. One of the best-known decentralized stochastic algorithm is D-PSGD (Lian et al. [2017]), which integrates average consensus with local stochastic gradient descent steps and shows competitive result to centralized SGD. The ability to address Non-IID data is a limitation of D-PSGD and some variants of D-PSGD are studied to tackle the data heterogeneity issue, such as D2 (Tang et al. [2018]) by storing previous status and GT-DSGD (Xin et al. [2021b]) by using gradient tracking (Xu et al. [2015], Lorenzo and Scutari [2016]). D-GET (Sun et al. [2020]) and D-SPIDER-SFO (Pan et al. [2020]) improve the gradient complexity of D-PSGD from O(ϵ 4) to O(ϵ 3) by utilizing variance reduced gradient estimator SPIDER (Fang et al. [2018]). GT-HSGD also achieves gradient complexity of O(ϵ 3) by combining gradient tracking and STORM gradient estimator (Cutkosky and Orabona [2019]). SPIDER requires a large batchsize of O(ϵ 1) on average and a mega batchsize of O(ϵ 2) periodically. In contrast, STORM only requires a large batch in the first iteration. After that the batchsize can be as small as O(1), which makes STORM more efficient to be implemented in practice.

2.2 Centralized Algorithms for Second-Order Optimality Perturbed gradient descent is a simple and effective method to escape saddle points and find local minima. PGD (Jin et al. [2017]) is the representative of this family of algorithms, which achieves second-order optimality in deterministic setting. It draws a perturbation when the gradient norm is small. If this point is a saddle point, the loss function value will decrease by a certain threshold within a specified number of iterations (i.e., breaking the escaping phase) with high probability. Otherwise, the candidate point is regarded as a second-order stationary point. In stochastic setting, Perturbed SGD perturbs every iteration and suffers a high gradient complexity of O(ϵ 8) to achieve O(ϵ, ϵ)-second-order stationary point and the gradient complexity hides a polynomial factor of dimension d. CNC-SGD requires a Correlated Negative Curvature assumption and the gradient complexity of O(ϵ 5) to achieve the classic second-order optimality. SSRGD (Li [2019]) adopts the

Name Averaged Batchsize Gradient Complexity Classic Setting

D-PSGD [12] O(1) O(ϵ 4) - GT-DSGD [22] O(1) O(ϵ 4) - D-GET [16] O(ϵ 1) O(ϵ 3) - D-SPIDER-SFO [15] O(ϵ 1) O(ϵ 3) - GT-HSGD [21] O(1) O(ϵ 3) -

SGD+Neon2 [1] O(1) O(ϵ 4 + ϵ 2ϵ 3 H + ϵ 5 H ) O(ϵ 4) SCSG+Neon2 [1] O(ϵ 0.5) O(ϵ 10/3 + ϵ 2ϵ 3 H + ϵ 5 H ) O(ϵ 3.5) Natasha2+Neon2 [1] O(ϵ 2) O(ϵ 3.25 + ϵ 3ϵ 1 H + ϵ 5 H ) O(ϵ 3.5) SPIDER-SFO+ [5] O(ϵ 1) O(ϵ 3 + ϵ 2ϵ 2 H + ϵ 5 H ) O(ϵ 3) Perturbed SGD [6] O(1) O(ϵ 4 + ϵ 16 H ) O(ϵ 8) CNC-SGD [4] O(1) O(ϵ 4 + ϵ 10 H ) O(ϵ 5) SSRGD [11] O(ϵ 1) O(ϵ 3 + ϵ 2ϵ 3 H + ϵ 1ϵ 4 H ) O(ϵ 3.5) Pullback [2] O(ϵ 1) O(ϵ 3 + ϵ 6 H ) O(ϵ 3)

PDGT [18] Full - - PEDESTAL-S (ours) O(1) O(ϵ 3), ϵH ϵ0.2 - PEDESTAL (ours) O(ϵ 3/4) O(ϵ 3 + ϵϵ 8 H + ϵ4ϵ 11 H ) O(ϵ 3)

Table 1: The comparison of important properties between related algorithms and our PEDESTAL. Column Averaged Batchsize" is computed when ϵH = ϵ. Column Classic Setting" refers to the gradient complexity under the classic condition ϵH = ϵ. The first group of algorithms are decentralized methods achieving first-order optimality. The second group of algorithms are centralized methods achieving second-order optimality. The last group of algorithms are decentralized methods achieving second-order optimality. PEDESTAL-S is a special case of PEDESTAL with O(1) batchsize. The complexity of PDGT is not shown because it is not stochastic.

same two-phase scheme as PGD but uses the moving distance as the criterion to discriminate saddle point in the escaping phase. It also takes advantage of variance reduction to improve the gradient complexity to O(ϵ 3.5). Pullback (Chen et al. [2022]) proposes a pullback step to further enhance the gradient complexity to O(ϵ 3), which matches the best result of reaching first-order stationary point.

2.3 Stochastic Gradient Descent

A branch of study of stochastic gradient descent argues that SGD can avoid saddle point under certain conditions. However, that is completely different from the problem we focus on. In this paper we propose a method that can find local minima effectively for a general problem 1, while escaping saddle point by stochastic gradient itself depends on some additional assumptions. For example, (Mertikopoulos et al. [2020]) requires the noise of gradient should be uniformly excited. According to our experimental result in Section 5, we can see in some cases stochastic gradient descent cannot escape saddle point effectively or efficiently. Besides, the gradient noise in variance reduced methods is reduced in order to accelerate the convergence. Our experimental results indicate that the gradient noise in variance reduced algorithms is not as good as SGD to serve as the perturbation to avoid saddle point. Therefore, it is necessary to study the second-order stationary point for variance reduced algorithms so as to enable both second-order optimality and fast convergence.

3.1 Algorithm

In this section, we will introduce our PEDESTAL algorithm, which is demonstrated in Algorithm 1. Suppose there are n worker nodes in the decentralized communication network connected by a weight matrix W. The initial value of model parameters on all nodes are identical and equal to x0. x(i) t , v(i) t and y(i) t are the model parameter, gradient estimator and gradient tracker on the i-th worker node in iteration t. z(i) t is the temporary model parameter that is awaiting communication. xt, vt and yt are corresponding mean values over all nodes. Counter esc(i) counts the number of iterations in the

Algorithm 1 Perturbed Decentralized STORM Algorithm (PEDESTAL)

Input: initial value x(i) 0 = x0, v(i) 1 = 0, y(i) 1 = 0, esc(i) = 1. Parameter: b0, b1, η, β, r, Cv, Cd, CT .

1: On i-th node: 2: for t = 0, 1, . . . , T 1 do 3: if t = 0 then 4: Compute v(i) 0 = Fi(x0, ξ(i) 0 ) with |ξ(i) 0 | = b0. 5: else 6: Compute v(i) t = Fi(x(i) t , ξ(i) t ) + (1 β)(v(i) t 1 Fi(x(i) t 1, ξ(i) t )) with |ξ(i) t | = b1. 7: end if 8: Communicate and update the gradient tracker: y(i) t = Pn j=1 wij(y(j) t 1 + v(j) t v(j) t 1).

9: if esc(i) = 1 and y(i) t Cv then

10: Draw a perturbation ξ B0(r) and update z(i) t = x(i) t + ξ.

11: Save x(i) t as x(i) and set esc(i) = 0. 12: else 13: Update z(i) t = x(i) t ηy(i) t . 14: end if 15: Communicate and update the model parameter: x(i) t+1 = Pn j=1 wijz(j) t .

16: if esc(i) 0 then 17: Reset esc(i) = 1 if x(i) t+1 x(i) > Cd else update esc(i) = esc(i) + 1. 18: end if 19: end for Return: xt CT if there are at least n

10 nodes satisfying esc(i) CT .

current escaping phase on the i-th worker node, which is also the indicator of current phase. When it runs the descent phase on the i-th worker node esc(i) is set to 1; otherwise esc(i) 0.

In the first iteration, the gradient estimator is computed based on a large batch size with b0. Beginning from the second iteration, the gradient estimator v(i) t is calculated by small mini-batch of samples according to the update rule of STORM, which can be formulated by line 6 in Algorithm 1 where β is a hyperparameter of STORM algorithm. Notation Fi(x(i) t , ξ(i) t ) represents the stochastic gradient obtained from a batch of samples ξ(i) t , which can be written as Fi(x(i) t , ξ(i) t ) = (1/|ξ(i) t |) P

j ξ(i) t Fi(x(i) t , j).

After calculating v(i) t , each worker node communicates with its neighbors and update the gradient tracker y(i) t . Inspired by the framework of Perturbed Gradient Descent, our PEDESTAL method also consists of two phases, the descent phase and the escaping phase. If worker node i is in the descent phase and the norm y(i) t is smaller than the given threshold Cv, then it will draw a perturbation ξ uniformly from B0(r) and update z(i) t = x(i) t + ξ. The phase is switched to escaping phase and esc(i) is set to 0. Anchor x(i) = x(i) t is saved and will be used to discriminate whether the escaping phase is broken. After this iteration counter esc(i) will be added by 1 in each following iteration until the moving distance from the anchor on i-th worker node (i.e., x(i) t x(i) ) is larger than threshold Cd for some t, which breaks the escaping phase and turn back to descent phase. If the condition of drawing perturbation is not satisfied, z(i) t is updated by z(i) t = x(i) t ηy(i) t no matter which phase is running currently.

If the i-th worker node s counter esc(i) is larger than the threshold CT , it indicates that xt CT is a candidate second-order stationary point. When at least n

10 nodes satisfy the condition esc(i) CT , the algorithm is terminated. Notice that the fraction is set to 1 10 for convenience in the convergence analysis. Our algorithm also works for other constant fractions. From Algorithm 1 we can see the decision of changing phases on each node only depends on its own status, which is adaptive and independent. Coordinating protocol including broadcast or aggregation is not required.

3.2 Discussion Here we will discuss the insight of the algorithm design and compare the differences between our method and related works. Some novel improvements are the key to the questions in Section 1.

3.2.1 Perturbed Gradient Descent or Negative Curvature Descent

Perturbed gradient descent and negative curvature descent are two of the most widely used pure first-order methods to find second-order stationary points. In PEDESTAL algorithm, we adopt the strategy of perturbed gradient descent rather than negative curvature descent because of the following reasons. First, negative curvature descent methods such as Neon (Xu et al. [2018]) and Neon2 (Allen-Zhu and Li [2018]) involves a nested loop to execute the negative curvature subroutine to recognize if a first-order stationary point is a local minimum. However, in decentralized setting, it is possible that the gradient norms on some nodes are smaller than the threshold while others are not. Therefore, some nodes will execute the negative curvature subroutine but its neighbors may not. In this case neighbor nodes need to wait for the nodes running negative curvature subroutines and there will be idle time on neighbor nodes. Besides, the analysis of negative curvature descent methods rely on the precision of the negative curvature direction. It is unknown if the theoretical results are still effective when only a fraction of nodes participate in the computation of negative curvature direction while the others use the gradient. In contrast, perturbed gradient descent only requires a simple operation of drawing perturbation, which is more suitable for decentralized algorithms.

3.2.2 Stepsize and Batchsize

In Pullback, a dynamic stepsize ηt = η/ vt in the descent phase where η = O(ϵ) and vt is the gradient estimator. This stepsize is originated from SPIDER (Fang et al. [2018]) which ensures its convergence by bounding the update distance xt+1 xt . In the escaping phase, Pullback adopts a larger stepsize of O(1) in the escaping phase and a special pullback stepsize in the last iteration, which is the key to improve the gradient complexity. Different from Pullback, in Algorithm 1 we adopt a consistent stepsize such that it keeps invariant even if phase changes and all nodes always use the same stepsize. If there is no perturbation in iteration t, we have xt+1 = xt η vt, which is important to the convergence analysis. We discard the strategy in Pullback for two reasons. First, the gradient normalization will probably cause divergence issues in decentralized optimization because in centralized algorithm the gradient direction is vt/ vt , which is equivalent to vc = Pn i=1 v(i) t / Pn i=1 v(i) t . However, in decentralized algorithm the average of v(i) t is not available on local nodes. If the gradient normalization is done locally, we will get vd = Pn i=1 v(i) t / v(i) t , which is different to vc and the error is hard to be estimated. Actually, both D-GET and D-SPIDER-SFO adopt the constant stepsize in Spider Boost (Wang et al. [2019]) to avoid performing the gradient normalization step. SPIDER needs the gradient normalization because xt+1 xt is required to be small in the proof, while Spider Boost improves the proof to bound xt+1 xt by η vt which is canceled eventually. In our analysis we also adopt the strategy in Spider Boost. Second, in our algorithm the changing of phase is occurred independently on each node. The phase-wise stepsize and pullback strategy will lead to different stepsizes among all nodes in one iteration, which will also cause potential convergence issues.

In (Chen et al. [2022]), two versions of Pullback are proposed, i.e., Pullback-SPIDER and Pullback STORM using SPIDER and STORM as the gradient estimator respectively. As introduced previously, one of the advantages of STORM is avoiding large batchsize. Nonetheless, Pullback-STORM adopt a large batchsize of O(ϵ 1) in each iteration, which violates the original intention of STORM. Besides, from Table 1 we can see all algorithms achieving second-order optimality with O(ϵ 3) gradient complexity require a large batchsize of O(ϵ 1). Therefore, we propose a small batch version named PEDESTAL-S as a special case of PEDESTAL that only requires an averaged batchsize of O(1).

3.2.3 Conditions of Termination

As a result of applying gradient tracking, we can bound 1

n Pn i=1 y(i) t yt 2 by O(ϵ2). Even though we have such an estimation, it is still possible that the norm y(i) t is as large as O( nϵ) on some nodes when the entire decentralized network has already achieved optimality. Therefore, waiting for all nodes to reach second-order stationary point is not an efficient strategy. This is the reason why we terminate our algorithm when only a fraction of worker nodes satisfy esc(i) CT .

In SSRGD and Pullback, there is an upper bound of iteration numbers in the escaping phase. If the escaping phase is not broken in this number of iterations then the candidate point is regarded as a second-order stationary point. If the escaping phase is broken, then the averaged moving distance is larger than a threshold and the loss function will be reduced by O(ϵ2) on average. This strategy guarantees that the algorithm will terminate with a certain gradient complexity. However, in our algorithm worker nodes do not enter escaping phase simultaneously and thus we do not set such an upper bound. In this case the averaged moving distance cannot be lower bounded as CT has no upper bound. Fortunately, we can complete our analysis by a different novel framework (see the proof outline in the Appendix). An alternative solution is to stop the update on the node that has run certain number of iterations in the escaping phase while the algorithm will continue. But that solution is also challenging since the relation between the first-achieved local optimal solution and the final global optimal solution is unknown and the analysis is non-trivial.

One remaining issue of the current termination strategy is that it involves the global knowledge of how many worker nodes satisfying the termination condition. One solution is to run an additional process to track this global value. The cost of transmitting Boolean values is much less expensive than broadcasting the model. Another solution is to set a maximum iteration in practice. Generally we need to evaluate the model after certain epochs to see if the training process is running smoothly and we can save a checkpoint when we find a better evaluation result. The theoretical analysis ensures that an optimal solution can be visited if the number of iterations is as large as O(ϵ 3).

3.2.4 Small Stuck Region

The theoretical guarantee of second-order optimality in SSRGD and Pullback is mainly credit to the lemma of small stuck region, which states that if there are two decoupled sequences xt and x t with identical stochastic samples, xs = x s and xs+1 x s+1 = r0e1 where e1 is the eigenvector corresponding to the smallest eigenvalue, then it satisfies max{ xt xs , x t x s } Cd for some s t s + CT with high probability. In SSRGD and Pullback, the averaged moving distance 1 t s Pt τ=s+1 xτ+1 xτ 2 is used as the criterion to discriminate saddle point because the small stuck region lemma can be applied in this way. However, in decentralized algorithm some nodes enter the escaping phase before the candidate point xs is achieved. Suppose node i enters escaping phase in iteration s , then the averaged moving distance starting from iteration s on node i cannot be well estimated because the condition of not breaking escaping phase on node i only guarantees the bound of averaged moving distance starting from s . Therefore, in our method we use the total moving distance x(i) t x(i) s as the criterion because we can obtain estimation x(i) t x(i) s 2Cd given x(i) t x(i) s Cd and x(i) s x(i) s Cd. And we can further complete our analysis by the small stuck region lemma. Actually we do not require more memory because x is the point to return in SSRGD and Pullback (hence should be saved). In practice, we can also return x(i) for any node i drawing perturbation in iteration t CT since x(i) t xt can be well bounded. Besides, we discover that the consensus error 1

n Pn i=1 x(i) t xt 2 results in an extra term when proving the small stuck region lemma, which becomes another challenge. If the consensus error is not under control, it can drive x away from xs or push x toward xs, no matter what f(x) is. In this manner, the stuck region cannot be estimated. In this work, we provide the corresponding proof to estimate this new term exclusively occurred in decentralized setting in our convergence analysis.

4 Convergence Analysis

4.1 Assumptions

In this section we will provide the main theorem of our convergence analysis. First we will introduce the assumptions used in this paper. All assumptions used in this paper are mild and commonly used in the analysis of related works.

Assumption 1. (Lower Bound) The objective f is lower bounded, i.e., infx f(x) = f > .

Assumption 2. (Bounded Variance) The stochastic gradient of each local loss function is an unbiased estimator and has bounded variance, i.e., for any i {1, 2, , n} we have

Eξ Fi(x, ξ) = fi(x), Eξ Fi(x, ξ) fi(x) 2 σ2 (2)

Assumption 3. (Lipschitz Gradient) For all ξ and i {1, 2, , n}, Fi(x, ξ) has Lipschitz gradient, i.e., for any x1 and x2 we have Fi(x1, ξ) Fi(x2, ξ) L x1 x2 with a constant L. Assumption 4. (Lipschitz Hessian) For all ξ and i {1, 2, , n}, Fi(x, ξ) has Lipschitz hessian, i.e., for any x1 and x2 we have 2Fi(x1, ξ) 2Fi(x2, ξ) ρ x1 x2 with a constant ρ.

Assumption 1, Assumption 2 and Assumption 3 are common assumptions used in the analysis of stochastic optimization algorithms. Assumption 4 is the standard assumption to find second-order optimality, which is used in all algorithms that achieves second-order stationary point in Table 1. Assumption 5. (Spectral Gap) The decentralized network is connected by a doubly-stochastic weight matrix W Rn n satisfying W1n = W T 1n = 1n and λ := W J [0, 1).

Here J is a n n matrix with all elements equal to 1

n. W is the weight matrix of the decentralized network where wij > 0 if node i and node j are connected, otherwise wij = 0. denotes the spectral norm of matrix (i.e., largest singular value). Notice that λ is a connectivity measurement of the network graph and it is also the second largest singular value of W. We do not assume W to be symmetric and hence the communication network can be both directed graph and undirected graph. The spectral gap assumption is also used commonly in the analysis of decentralized algorithms.

4.2 Main Theorems

Let ϵH = ϵα. When α 0.5, we have the following Theorem 1. Theorem 1. Assume α 0.5 and Assumption 1 to 5 are satisfied. Let θ = min{ 3 5α

2 , 1}. We set η = Θ( ϵθ

L ), β = Θ(ϵ1+θ), b0 = Θ(ϵ 2), b1 = Θ(max{ϵ2 θ 5α, 1}), r = Θ(ϵ1+θ), Cv = Θ(ϵ), CT = Θ(ϵ θ α) and Cd = Θ(ϵ1 α). Then our PEDESTAL algorithm will achieve O(ϵ, ϵH)-secondorder stationary point with O(ϵ 3) gradient complexity.

The specific constants hidden in Θ( ) will be shown in Appendix B, where the proof outline and the completed proof of Theorem 1 can also be found. From Theorem 1 we can see our PEDESTAL-S with b1 = O(1) can achieve O(ϵ, ϵH)-second-order stationary point with O(ϵ 3) gradient complexity for ϵH ϵ0.2. In the classic setting, our PEDESTAL achieves second-order stationary point with O(ϵ 3) gradient complexity. When α > 0.5, i.e., ϵH < ϵ, we have the following Theorem 2. Since the parameter settings are different and the O(1) batchsize is only available in Theorem 1, we separate these two theorems. The proof of Theorem 2 can be found in Appendix D. Theorem 2. When ϵH < ϵ (i.e., α > 0.5), we set η = Θ(ϵθ), β = Θ(ϵ1+θ), b0 = Θ(ϵ 1), b1 = Θ(ϵ max{4α 1 θ,θ+α}), r = Θ(ϵ1+θ), Cv = Θ(ϵ), CT = Θ(ϵ θ α) and Cd = Θ(ϵα) where θ = min{ 3α 1

2 , 3α 2}. Under Assumption 1 to 5, our PEDESTAL algorithm will achieve O(ϵ, ϵH)-second-order stationary point with O(ϵϵ 8 H + ϵ4ϵ 11 H ) gradient complexity.

5 Experiments

In this section we will demonstrate our experimental results to validate the performance of our method. We conduct two tasks in our experiment, a matrix sensing task on synthetic dataset and a matrix factorization task on real-world dataset. Both of these two tasks are non-spurious local minimum problems (Ge et al. [2017, 2016]), which means all local minima are global minima. Thus, we conclude an algorithm is stuck at saddle point if the loss function value does not achieve the global minimum. The source code is available in https://github.com/WH-XIAN/PEDESTAL.

5.1 Matrix Sensing

We follow the experimental setup of (Chen et al. [2022]) to solve a decentralized matrix sensing problem. The goal of this task is to recover a low-rank d d symmetric matrix M = U (U )T

where U Rd r for some small r. We set the number of worker nodes to n = 20. We generate a synthetic dataset with N sensing matrices {Ai}N i=1 and N corresponding observations bi = Ai, M . Here the inner product X, Y of two matrices X and Y is defined by the trace tr(XT Y ). The decentralized optimization problem can be formulated by

i=1 Li(U), where Li(U) = 1

j=1 ( Aij, UU T bij)2 , (3)

(a) d = 50, ring

(b) d = 50, toroidal

(c) d = 50, exponential

(d) d = 100, ring

(e) d = 100, toroidal

(f) d = 100, exponential

Figure 1: Experimental results of the decentralized matrix sensing task on different network topology for d = 50 and d = 100. Data is assigned to worker nodes by random distribution. The y-axis is the loss function value and the x-axis is the number of gradient oracles divided by the number of data N.

(a) ring, random

(b) toroidal, random

(c) exponential, random

(d) ring, Dirichlet

(e) toroidal, Dirichlet

(f) exponential, Dirichlet

Figure 2: Experimental results of the decentralized matrix factorization task on different network topology on Movie Lens-100k. The y-axis is the loss function value and the x-axis is the number of gradient oracles divided by the size of matrix N l.

where Ni is the amount of data assigned to worker node i.

The number of rows of matrix U is set to d = 50 and d = 100 respectively and the number of columns is set to r = 3. The ground truth low-rank matrix M equals U (U )T where each entry of U is generated by Gaussian distribution N(0, 1/d) independently. We randomly generate N = 20 n d samples of sensing matrices {Ai}N i=1, Ai Rd d from standard Gaussian distribution and calculate the corresponding labels bi = Ai, M . We consider two different types of data distribution, random distribution and Dirichlet distribution Dir20(0.3) to assign data to each worker node. We conduct experiments on three different types of network topology, i.e., ring topology, toroidal

topology (2-dimensional ring) and undirected exponential graph. The initial value of U is set to [u0, 0, 0] where u0 is yield from Gaussian distribution and multiplied by a scalar such that it satisfies u0 max eig(M ). We compare our PEDESTAL algorithm to decentralized baselines including D-PSGD, GTDSGD, D-GET, D-SPIDER-SFO and GTHSGD. In this experiment, the learning rate is chosen from {0.01, 0.001, 0.0001}. The batchsize is set to 10. For PEDESTAL and GTHSGD, parameter β is set to 0.01. For D-GET and D-SPIDER-SFO, the period q is 100. For PEDESTAL, threshold Cv is set to 0.0001. Perturbation radius r is set to 0.001. The threshold of moving distance Cd is set to 0.01. The experimental results are shown in Figure 1. Due to the space limit, we only show the result of random data distribution in the main manuscript and leave the result of Dirichlet distribution to Appendix A.

From the experimental result we can see all baselines are stuck at the saddle point and cannot escape it effectively. In contrast, our PEDESTAL reaches and escapes saddle points and finally find the local minimum. We also calculate the smallest eigenvalue of hessian matrix for each algorithm at the converged optimal point, which is left to the Supplementary Material because of space limit. According to the eigenvalue result, we can see the smallest eigenvalue is much closer to 0 than all baselines. Therefore, our experiment verifies that our PEDESTAL achieves the best performance to escape saddle point and find local minimum.

5.2 Matrix Factorization

The second task in our experiment is matrix factorization, which aims to approximate a given matrix M RN l by a low-rank matrix that can be decomposed to the product of two matrices U RN r and V Rl r for some small r. The optimization problem can be formulated by

min U RN r,V Rl r M UV T 2 F :=

j=1 (Mij (UV T )ij)2 (4)

where F denotes the Frobenius norm and subscript ij refers to the element at i-th row and j-th column. In our experiment we solve this problem on the Movie Lens-100k dataset (Harper and Konstan [2015]). Movie Lens-100k contains 100,000 ratings of 1682 movies provided by 943 users. Each rating is in the interval [0, 5] and scaled to [0, 1] in the experiment. This task can be regarded as an association task to predict users potential ratings for unseen movies. In our experiment we set the number of worker node to n = 50. Each node is assigned the data from different group of users. Similar to the matrix sensing task, here we also use random distribution and Dirichlet distribution respectively to distribute users to worker nodes. And we also use ring topology, toroidal topology and undirected exponential graph as the communication network. The baselines are also D-PSGD, GTDSGD, D-GET, D-SPIDER-SFO and GTHSGD. In this experiment, the number of worker nodes is 50 and the rank of the matrix M is set to 25. The learning rate is chosen from {0.01, 0.001, 0.0001}. The batchsize is set to 100. For PEDESTAL and GTHSGD, parameter β is set to 0.1. For D-GET and D-SPIDER-SFO, the period q is 100. For PEDESTAL, threshold Cv is set to 0.002. Perturbation radius r is set to 0.01. The threshold of moving distance Cd is set to 0.5. The experimental results are shown in Figure 2.

From the experimental results we can see our PEDESTAL achieves the best performance to escape saddle point and find second-order stationary point. All baselines cannot escape saddle point effectively or efficiently. Particularly, variance reduced methods D-GET and D-SPIDER-SFO shows worse performance than SGD based algorithms D-PSGD and GTDSGD, which indicates that although reducing gradient noise can accelerate convergence, it also weakens the ability to escape saddle point. Therefore, our contribution is important since we make the fast convergence of variance reduction compatible with the capability to avoid saddle point.

6 Conclusion In this paper we propose a novel algorithm PEDESTAL to find local minima in nonconvex decentralized optimization. PEDESTAL is the first decentralized stochastic algorithm to achieve second-order optimality with non-asymptotic analysis. We improve the drawbacks in previous deterministic counterpart to make phase changed independently on each node and avoid consensus protocols of broadcast or aggregation. We prove that PEDESTAL can achieve O(ϵ, ϵ)-second-order stationary point with the gradient complexity of O(ϵ 3), which matches state-of-the-art results of centralized counterpart or decentralized method to find first-order stationary point. We also conduct the matrix sensing and matrix factorization tasks in our experiments to validate the performance of PEDESTAL.

Zeyuan Allen-Zhu and Yuanzhi Li. Neon2: Finding local minima via first-order oracles. Advances in Neural Information Processing Systems, 31, 2018.

Zixiang Chen, Dongruo Zhou, and Quanquan Gu. Faster perturbed stochastic gradient methods for finding local minima. In Sanjoy Dasgupta and Nika Haghtalab, editors, Proceedings of The 33rd International Conference on Algorithmic Learning Theory, volume 167 of Proceedings of Machine Learning Research, pages 176 204. PMLR, 29 Mar 01 Apr 2022.

Ashok Cutkosky and Francesco Orabona. Momentum-based variance reduction in non-convex sgd. Neural Information Processing Systems (Neur IPS), 2019.

Hadi Daneshmand, Jonas Kohler, Aurelien Lucchi, and Thomas Hofmann. Escaping saddles with stochastic gradients. In Jennifer Dy and Andreas Krause, editors, Proceedings of the 35th International Conference on Machine Learning, volume 80 of Proceedings of Machine Learning Research, pages 1155 1164, 10 15 Jul 2018.

Cong Fang, Chris Junchi Li, Zhouchen Lin, and Tong Zhang. Spider: Near-optimal non-convex optimization via stochastic path-integrated differential estimator. In S. Bengio, H. Wallach, H. Larochelle, K. Grauman, N. Cesa-Bianchi, and R. Garnett, editors, Advances in Neural Information Processing Systems, volume 31. Curran Associates, Inc., 2018.

Rong Ge, Furong Huang, Chi Jin, and Yang Yuan. Escaping from saddle points online stochastic gradient for tensor decomposition. In Peter Grünwald, Elad Hazan, and Satyen Kale, editors, Proceedings of The 28th Conference on Learning Theory, volume 40 of Proceedings of Machine Learning Research, pages 797 842, Paris, France, 03 06 Jul 2015. PMLR.

Rong Ge, Jason D Lee, and Tengyu Ma. Matrix completion has no spurious local minimum. In D. Lee, M. Sugiyama, U. Luxburg, I. Guyon, and R. Garnett, editors, Advances in Neural Information Processing Systems, volume 29. Curran Associates, Inc., 2016.

Rong Ge, Chi Jin, and Yi Zheng. No spurious local minima in nonconvex low rank problems: A unified geometric analysis. In International Conference on Machine Learning, pages 1233 1242. PMLR, 2017.

F Maxwell Harper and Joseph A Konstan. The movielens datasets: History and context. Acm transactions on interactive intelligent systems (tiis), 5(4):1 19, 2015.

Chi Jin, Rong Ge, Praneeth Netrapalli, Sham M Kakade, and Michael I Jordan. How to escape saddle points efficiently. In International Conference on Machine Learning, pages 1724 1732. PMLR, 2017.

Zhize Li. Ssrgd: Simple stochastic recursive gradient descent for escaping saddle points. In H. Wallach, H. Larochelle, A. Beygelzimer, F. d'Alché-Buc, E. Fox, and R. Garnett, editors, Advances in Neural Information Processing Systems, volume 32. Curran Associates, Inc., 2019.

Xiangru Lian, Ce Zhang, Huan Zhang, Cho-Jui Hsieh, Wei Zhang, and Ji Liu. Can decentralized algorithms outperform centralized algorithms? a case study for decentralized parallel stochastic gradient descent. In I. Guyon, U. Von Luxburg, S. Bengio, H. Wallach, R. Fergus, S. Vishwanathan, and R. Garnett, editors, Advances in Neural Information Processing Systems, volume 30. Curran Associates, Inc., 2017.

Paolo Di Lorenzo and Gesualdo Scutari. Next: In-network nonconvex optimization. IEEE Transactions on Signal and Information Processing over Networks, 2(2):120 136, 2016. doi: 10.1109/TSIPN.2016.2524588.

Panayotis Mertikopoulos, Nadav Hallak, Ali Kavis, and Volkan Cevher. On the almost sure convergence of stochastic gradient descent in non-convex problems. In H. Larochelle, M. Ranzato, R. Hadsell, M.F. Balcan, and H. Lin, editors, Advances in Neural Information Processing Systems, volume 33, pages 1117 1128. Curran Associates, Inc., 2020.

Taoxing Pan, Jun Liu, and Jie Wang. D-spider-sfo: A decentralized optimization algorithm with faster convergence rate for nonconvex problems. Proceedings of the AAAI Conference on Artificial Intelligence, 34(02):1619 1626, Apr. 2020. doi: 10.1609/aaai.v34i02.5523.

Haoran Sun, Songtao Lu, and Mingyi Hong. Improving the sample and communication complexity for decentralized non-convex optimization: Joint gradient estimation and tracking. In Hal Daumé III and Aarti Singh, editors, Proceedings of the 37th International Conference on Machine Learning, volume 119 of Proceedings of Machine Learning Research, pages 9217 9228. PMLR, 13 18 Jul 2020.

Hanlin Tang, Xiangru Lian, Ming Yan, Ce Zhang, and Ji Liu. d2: Decentralized training over decentralized data. In Jennifer Dy and Andreas Krause, editors, Proceedings of the 35th International Conference on Machine Learning, volume 80 of Proceedings of Machine Learning Research, pages 4848 4856. PMLR, 10 15 Jul 2018.

Isidoros Tziotis, Constantine Caramanis, and Aryan Mokhtari. Second order optimality in decentralized non-convex optimization via perturbed gradient tracking. In H. Larochelle, M. Ranzato, R. Hadsell, M.F. Balcan, and H. Lin, editors, Advances in Neural Information Processing Systems, volume 33, pages 21162 21173. Curran Associates, Inc., 2020.

Stefan Vlaski and Ali H Sayed. Second-order guarantees in centralized, federated and decentralized nonconvex optimization. ar Xiv preprint ar Xiv:2003.14366, 2020.

Zhe Wang, Kaiyi Ji, Yi Zhou, Yingbin Liang, and Vahid Tarokh. Spiderboost and momentum: Faster variance reduction algorithms. In H. Wallach, H. Larochelle, A. Beygelzimer, F. d'Alché-Buc, E. Fox, and R. Garnett, editors, Advances in Neural Information Processing Systems, volume 32. Curran Associates, Inc., 2019.

Ran Xin, Usman Khan, and Soummya Kar. A hybrid variance-reduced method for decentralized stochastic non-convex optimization. In Marina Meila and Tong Zhang, editors, Proceedings of the 38th International Conference on Machine Learning, volume 139 of Proceedings of Machine Learning Research, pages 11459 11469. PMLR, 18 24 Jul 2021a.

Ran Xin, Usman A. Khan, and Soummya Kar. An improved convergence analysis for decentralized online stochastic non-convex optimization. IEEE Transactions on Signal Processing, 69:1842 1858, 2021b. doi: 10.1109/TSP.2021.3062553.

Jinming Xu, Shanying Zhu, Yeng Chai Soh, and Lihua Xie. Augmented distributed gradient methods for multi-agent optimization under uncoordinated constant stepsizes. In 2015 54th IEEE Conference on Decision and Control (CDC), pages 2055 2060, 2015. doi: 10.1109/CDC.2015.7402509.

Yi Xu, Rong Jin, and Tianbao Yang. First-order stochastic algorithms for escaping from saddle points in almost linear time. Advances in neural information processing systems, 31, 2018.

A Additional Experimental Results

The experimental results of Dirichlet distribution of the matrix sensing task is shown in Figure 3. The smallest eigenvalue at the converged point for each algorithm is shown in Table 2 and Table 3.

(a) d = 50, ring

(b) d = 50, toroidal

(c) d = 50, exponential

(d) d = 100, ring

(e) d = 100, toroidal

(f) d = 100, exponential

Figure 3: Experimental results of the decentralized matrix sensing task on different network topology for d = 50 and d = 100. Data is assigned to worker nodes by Dirichlet distribution. The y-axis is the loss function value and the x-axis is the number of gradient oracles divided by the number of data N.

D-PSGD GTDSGD D-GET D-SPIDER-SFO GTHSGD PEDESTAL d = 50, ring -0.0332 -0.0327 -0.0333 -0.0328 -0.0329 1.78e 5

d = 50, toroidal -0.0331 -0.0334 -0.0334 -0.0327 -0.0329 4.18e 5

d = 50, exponential -0.0323 -0.0330 -0.0331 -0.0332 -0.0333 1.09e 6

d = 100, ring -0.0184 -0.0184 -0.0184 -0.0184 -0.0185 2.07e 6

d = 100, toroidal -0.0185 -0.0186 -0.0185 -0.0184 -0.0184 2.25e 7

d = 100, exponential -0.0184 -0.0184 -0.0186 -0.0184 -0.0184 3.07e 5

Table 2: Smallest eigenvalue of hessian matrix at the converged point (random data distribution).

D-PSGD GTDSGD D-GET D-SPIDER-SFO GTHSGD PEDESTAL d = 50, ring -0.0332 -0.0337 -0.0332 -0.0325 -0.0330 3.60e 6

d = 50, toroidal -0.0334 -0.0324 -0.0329 -0.0325 -0.0327 2.29e 5

d = 50, exponential -0.0334 -0.0326 -0.0333 -0.0330 -0.0328 3.97e 5

d = 100, ring -0.0184 -0.0184 -0.0184 -0.0185 -0.0183 4.48e 5

d = 100, toroidal -0.0184 -0.0184 -0.0184 -0.0184 -0.0185 1.24e 5

d = 100, exponential -0.0186 -0.0185 -0.0186 -0.0183 -0.0185 3.63e 6

Table 3: Smallest eigenvalue of hessian matrix at the converged point (Dirichlet data distribution).

B Proof of Theorem 1

B.1 Notation

We define matrix Xt = [x(1) t , , x(n) t ] Rd n where x(i) t is the model parameter on i-th worker node with dimension d and n is the number of worker nodes. Similarly we have Yt = [y(1) t , , y(n) t ],

Zt = [z(1) t , , z(n) t ] and Vt = [v(1) t , , v(n) t ]. Let ωt = xt+1 xt 2 and Ωt = Zt Xt. Define pt = nt/n where nt is the number of worker nodes drawing perturbation in iteration t.

B.2 Outline

In this section we will provide the proof outline of Theorem 1. First, we prove some basic lemmas to estimate gradient noise and consensus error, which will be used frequently in later proof. The gradient noise is estimated by Lemma 1, the proof of which can be found in Section C.1. The consensus error is estimated by Lemma 2, the proof of which can be found in Section C.2. Lemma 1. (Gradient Noise) Under Assumption 2 and Assumption 3 we have

i=1 v(i) t fi(x(i) t ) 2 16 log(4/δ)βσ2

b1 + 384 log(4/δ)L2

t=0 Xt Xt 2 F

+ 192 log(4/δ)L2

t=0 ωt + 2 log(4/δ)σ2

i=1 fi(x(i) t ) 2 16 log(4/δ)βσ2

nb1 + 384 log(4/δ)L2

t=1 Xt Xt 2 F

+ 192 log(4/δ)L2

t=0 ωt + 2 log(4/δ)σ2

Lemma 2. (Consensus Error) Let η (1 λ)2ϵθ

600 log(4/δ)λ2L, β = C 1 1 ϵ1+θ and b1 C1ϵ 1+θ where C1 1 is a constant. Under Assumption 2, 3 and 5 we have

t=1 Xt Xt 2 F 160000n log(4/δ)L2η2λ4

(1 λ)4 min{b1β, 1}T

t=0 ωt + 12288n log(4/δ)βη2λ4σ2

+ 2000n log(4/δ)η2λ4σ2

(1 λ)4βb0T + 128λ4η2

i=1 fi(x0) 2 +

64nλ2pt(η2C2 v + r2) (1 λ)2T

t=0 Yt Yt 2 F 4644 log(4/δ)n L2λ2

(1 λ) min{b1β, 1}T

t=0 ωt + 384 log(4/δ)nλ2βσ2

+ 50 log(4/δ)nλ2σ2

(1 λ)βb0T + 8λ2

i=1 fi(x0) 2+

150000 log(4/δ)n L2λ4pt(η2C2 v +r2) (1 λ)3 min{b1β, 1}T

Next we will prove that PEDESTAL will terminate in certain number of iterations. Under Assumption 2, 3 and 5, we can prove the following Lemma 3. The proof is demonstrated in Section C.3.

Lemma 3. (Descent) Let η (1 λ)2ϵθ

600 log(4/δ)λ2L, β = C 1 1 ϵ1+θ, b1 C1ϵ 1+θ and b0 = C1ϵ 1 where C1 1 is a constant. Under Assumption 2, 3 and 5 we have

f( x T ) f(x0) + σ2

i=1 fi(x0) 2

Dt = 1 16η ωt + (1 λ)2

i=1 x(i) t+1 x(i) t 2 + η

i=1 y(i) t 2 200ηϵ2σ2

(1 λ)2C2 1 7pt(η2C2 v + r2) 4η

Here we call Dt the descent of iteration t. We categorize all iterations into three types:

type-A: pt 1

5, type-B: pt < 1

i=1 y(i) t 2 4C2 v 5 , type-C: otherwise

When at least n

5 nodes drawing perturbation in iteration t, then it is type-A. There are two cases where pt is small: most nodes in the descent phase or most nodes in the escaping phase. An iteration is type-B if pt < 1

n Pn i=1 y(i) t 2 4C2 v 5 , which represents the case where most nodes are in the descent phase. And type-C iteration represents the case where most nodes are in the escaping phase. Next we will estimate type-A and type-C iteration with the following Lemma 4.

Lemma 4. Let η (1 λ)2ϵθ

600 log(4/δ)λ2L, β = C 1 1 ϵ1+θ, b1 C1ϵ 1+θ, b0 = C1ϵ 1, Cd = C2ηCT ϵ,

Cv = (1 λ)C2ϵ

200 and r ηCv/4 where C1 = 20000σ (1 λ)2C2 and C2 is a constant. Under Assumption 2, 3 and 5, we can find disjoint intervals I = I1 Ik such that the indexes of all type-A and type-C iterations except the last CT iterations are contained in I and the descent over I can be estimated by

t I Dt |I| (1 λ)2C2 2ηϵ2

where | | denotes the total number of the set.

Besides, for all type-B iteration t, we have the following estimation

Lemma 5. Let parameter and assumption settings be the same as Lemma 4, then for all type-B iteration t we have

Dt (1 λ)2C2 2ηϵ2

With Lemma 4, Lemma 5 and Assumption 1, we can conclude that PEDESTAL will terminate in O(ϵ 2 θ) + CT iterations. As the last two negative terms in Dt are canceled by 1

n Pn i=1 x(i) t+1 x(i) t 2 and 1

n Pn i=1 y(i) t 2 respectively in Lemma 4 and Lemma 5, we have 1

η PT 1 t=0 ωt O(1). Hence by Lemma 2 we know the consensus error 1

n Xt Xt 2 F can be bounded by O(ϵ1+θ) on average. Besides, from the parameter setting we can see Cv is Θ(ϵ), which ensures the first-order optimality of the decentralized algorithm.

Finally, we will prove PEDESTAL is able to achieve second-order stationary point. First, we will give the small stuck region lemma in decentralized setting. Recall that ϵH = ϵα is the tolerance of second-order stationary point. The proof is in Section C.6.

Lemma 6. (Small Stuck Region) Suppose ns worker nodes draw perturbation in iteration s and γ = min eig( 2f( xs)) ϵH. Let η (1 λ)2ϵθ

1000 n log(CT ) log(4/δ)λ2L, β = C 1 1 ϵ1+θ, b1

1000C1ϵ2 θ 5α, Cd = C2ηCT ϵµ and CT = log(12n Cd/r0)/(ηγ) where C1 = 20000 (1 λ)2C2 , C2

1 λ 2000 log(4/δ)ρ log(Cd) and µ = max{1, 2α}. Let Xt and X t be two coupled decentralized sequences

by running PEDESTAL from Xs with Xs = X s, x(i) s+1 = x(i)

s+1 if node i does not draw perturbation

in iteration s and x(i) s+1 = x(i)

s+1 + r0e1 otherwise. Here e1 is the eigenvector with respect to the

smallest eigenvalue γ. Define di = maxs t s+CT { x(i) t x(i) s , x(i)

t x(i) s }. Then there are at least 9n

10 nodes such that di 2Cd.

In decentralized small stuck region lemma, the consensus error will lead to a new term (see Eq. (34)) and make the proof more complicated. In our proof, we use the condition of ϵH ϵ, i.e., α 1. For smaller ϵH the batchsize b1 is required to set larger. With Lemma 6, we can prove that when PEDESTAL is terminated, it finds a local minimum with high probability.

Lemma 7. Let r0 = δr/

d where d is the dimension of model parameter. Other parameters are the same as Lemma 6. Suppose PEDESTAL is terminated in iteration s + CT . Then xs is a second-order stationary point with probability at least 1 δ.

Lemma 7 provides the guarantee of second-order optimality of PEDESTAL. When ϵH ϵ, i.e., α 0.5 (including the classic setting ϵH = ϵ), the parameter settings of all lemmas are consistent and the main theorem is proven. The total gradient complexity is

O(ϵ 2 θ ϵ 1+θ) = O(ϵ 3)

When α = 0.5, we have θ = 0.25 and b1 = Θ(ϵ 0.75). When α 0.2, we can set θ = 1 and b1 = O(1), which is result of PEDESTAL-S. In Section D we will provide the analysis of the case α > 0.5 with a different parameter setting of θ and b1. We can achieve the gradient complexity of

O(ϵ 3 + ϵϵ 8 H + ϵ4ϵ 11 H ) (5)

over all cases of ϵH.

C Proof of Lemmas

C.1 Proof of Lemma 1

Proof. According to the definition of v(i) t , we have

v(i) t+1 fi(x(i) t+1) (1 β)t+1 v(i) t fi(x(i) t ) (1 β)t = β( Fi(x(i) t+1, ξ(i) t+1) fi(x(i) t+1)) (1 β)t+1

+ ( Fi(x(i) t+1, ξ(i) t+1) fi(x(i) t+1)) ( Fi(x(i) t , ξ(i) t+1) fi(x(i) t )) (1 β)t (6)

where |ξ(i) t+1| = b1. The expectation of the right side of Eq. (6) over ξ(i) t+1 is 0. Using Cauchy-Schwartz inequality, Assumption 2 and Assumption 3 we have

β( Fi(x(i) t+1, j) fi(x(i) t+1)) (1 β)t+1 + ( Fi(x(i) t+1, j) fi(x(i) t+1)) ( Fi(x(i) t , j) fi(x(i) t )) (1 β)t 2

(1 β)2t+2 + 8L2 x(i) t+1 x(i) t 2

(1 β)2t (7)

for each j ξ(i) t+1. Thus, applying Azuma-Hoeffding inequality to Eq. (6) we can obtain

v(i) t fi(x(i) t ) (1 β)t(v(i) 0 fi(x0)) 2

b1 (2βσ2 + 8L2 t 1 X

s=0 (1 β)2(t s) x(i) s+1 x(i) s 2) (8)

with probability 1 δ. Here we use the fact that P+ s=0(1 β)s = 1

β . Using Cauchy-Schwartz inequality to Eq. (8) we have

v(i) t fi(x(i) t ) 2 16 log(4/δ)

b1 (βσ2 + 4L2 t 1 X

s=0 (1 β)2(t s) x(i) s+1 x(i) s 2)

+ 2(1 β)2t v(i) 0 fi(x0)) 2 (9)

Sum Eq. (9), we obtain

1 log(4/δ)n T

i=1 v(i) t fi(x(i) t ) 2

t=0 Xt+1 Xt 2 F + 2σ2

t=0 Xt Xt 2 F + 192L2

t=0 ωt + 2σ2

which finishes the proof of (a) in Lemma 1. In the first inequality of Eq. (10) we apply Azuma Hoeffding inequality to v(i) 0 fi(x0). In the second inequality we apply Cauchy-Schwartz inequality and use the fact x(i) t+1 x(i) t = (x(i) t+1 xt+1) (x(i) t xt) + ( xt+1 xt). Mimic above steps and we can achieve the inequality (b) in Lemma 1. The term n in the denominator is derived by the fact that ξ(i) t s on different nodes are independent.

C.2 Proof of Lemma 2

Proof. As Yt = W(Yt 1 + Vt Vt 1), we have

Yt Yt 2 F = (W J)(Yt 1 Yt 1) + (W J)(Vt Vt 1) 2 F λ2 Yt 1 Yt 1 2 F + 2 (W J)Yt, (W J)(Vt Vt 1) + λ2 Vt Vt 1 2 F

2 Yt 1 Yt 1 2 F + λ2 + λ4

1 λ2 Vt Vt 1 2 F

2 Yt 1 Yt 1 2 F + 3λ2(1 + λ2)

i=1 ( v(i) t fi(x(i) t ) 2 + v(i) t 1 fi(x(i) t 1) 2)

+ 9L2λ2(1 + λ2)

1 λ2 ( Xt Xt 2 F + Xt 1 Xt 1 2 F + nωt 1) (11)

where the first inequality is derived by Assumption 5, the second inequality is derived by Young s inequality and the last inequality is derived by Cauchy-Schwartz inequality and Assumption 3. When t = 0, by Azuma-Hoeffding inequality we can get

Y0 Y0 2 F 2λ2 n X

i=1 fi(x0) 2 + 8 log(4/δ)nλ2σ2

with probability 1 δ. As Xt+1 = W(Xt + Ωt), by Assumption 5 and Young s inequality we have

Xt+1 Xt+1 2 F

2 Xt Xt 2 F + 2λ2

1 λ2 Ωt Ωt 2 F

2 Xt Xt 2 F + 4η2λ2

1 λ2 Yt Yt 2 F + 4λ2

1 λ2 Ωt Ωt η(Yt Yt) 2 F

2 Xt Xt 2 F + 4η2λ2

1 λ2 Yt Yt 2 F + 8nλ2pt(η2C2 v + r2) 1 λ2 (13)

where the second inequality is obtained by Cauchy-Schwartz inequality and the last inequality is because when node i draws perturbation it must satisfy y(i) t Cv. Note that X0 = X0. Sum Eq. (13), we have

t=1 Xt Xt 2 F

t=0 Yt Yt 2 F + 16nλ2(η2C2 v + r2)T (1 λ2)2

288L2η2λ4(1 + λ2)

t=0 Xt Xt 2 F + 96η2λ4(1 + λ2)

i=1 v(i) t fi(x(i) t ) 2

+ 144n L2η2λ4(1 + λ2)

t=0 ωt + 16λ2η2

(1 λ2)3 Y0 Y0 2 F +

16nλ2pt(η2C2 v + r2) (1 λ2)2 (14)

where the last inequality comes from Eq. (11). When η (1 λ)2

40λ2L we have 288L2η2λ4(1+λ2)

t=1 Xt Xt 2 F

192η2λ4(1 + λ2)

i=1 v(i) t fi(x(i) t ) 2 + 288n L2η2λ4(1 + λ2)

i=1 fi(x0) 2 + 256 log(4/δ)nλ4η2σ2

(1 λ2)3b0T +

32nλ2pt(η2C2 v + r2) (1 λ2)2T

73728 log(4/δ)L2η2λ4(1 + λ2)

(1 λ2)4b1βT

t=0 Xt Xt 2 F + 24576n log(4/δ)L2η2λ4(1 + λ2)

(1 λ2)4b1βT

+ n log(4/δ)η2λ4(1 + λ2)σ2

(1 λ2)4 (3072β

βb0T ) + 288n L2η2λ4(1 + λ2)

i=1 fi(x0) 2 + 256 log(4/δ)nλ4η2σ2

(1 λ2)3b0T +

32nλ2pt(η2C2 v + r2) (1 λ2)2T (15)

where the last inequality is achieved by Lemma 1. According to the parameter setting, we have 73728 log(4/δ)L2η2λ4(1 + λ2)

(1 λ2)4b1β 1

2 Therefore, we have

t=1 Xt Xt 2 F

160000n log(4/δ)L2η2λ4

(1 λ)4 min{b1β, 1}T

t=0 ωt + 12288n log(4/δ)βη2λ4σ2

(1 λ)4b1 + 2000n log(4/δ)η2λ4σ2

i=1 fi(x0) 2 +

64nλ2pt(η2C2 v + r2) (1 λ)2T (16)

where we have used the condition λ 1 to simplify the inequality. Moreover, sum Eq. (11) and we can achieve

t=0 Yt Yt 2 F

i=1 v(i) t fi(x(i) t ) 2 + 36L2λ2

t=0 Xt Xt 2 F + 18n L2λ2

+ 2 (1 λ)T Y0 Y0 2 F

(1 λ)T (1 + 128 log(4/δ)

t=0 Xt Xt 2 F + 18n L2λ2

(1 λ)T (1 + 128 log(4/δ)

+ 192 log(4/δ)nλ2βσ2

(1 λ)b1 + 25 log(4/δ)nλ2σ2

(1 λ)βb0T + 4λ2

i=1 fi(x0) 2

4644 log(4/δ)L2λ2

(1 λ) min{b1β, 1}T

t=0 Xt Xt 2 F + 2322 log(4/δ)n L2λ2

(1 λ) min{b1β, 1}T

+ 192 log(4/δ)nλ2βσ2

(1 λ)b1 + 25 log(4/δ)nλ2σ2

(1 λ)βb0T + 4λ2

i=1 fi(x0) 2

37152 log(4/δ)L2η2λ4

(1 λ)3 min{b1β, 1}T

t=0 Yt Yt 2 F + 2322 log(4/δ)n L2λ2

(1 λ) min{b1β, 1}T

+ 192 log(4/δ)nλ2βσ2

74304 log(4/δ)n L2λ4pt(η2C2 v + r2) (1 λ)3 min{b1β, 1}T + 25 log(4/δ)nλ2σ2

i=1 fi(x0) 2 (17)

where the second inequality uses Lemma 1 and Eq. (12). The last inequality uses the sum of Eq. (13). As 37152 log(4/δ)L2η2λ4

(1 λ)3 min{b1β,1} 1

t=0 Yt Yt 2 F

4644 log(4/δ)n L2λ2

(1 λ) min{b1β, 1}T

t=0 ωt + 384 log(4/δ)nλ2βσ2

(1 λ)b1 + 50 log(4/δ)nλ2σ2

i=1 fi(x0) 2 +

150000 log(4/δ)n L2λ4pt(η2C2 v + r2) (1 λ)3 min{b1β, 1}T (18)

which finishes the proof.

C.3 Proof of Lemma 3

Proof. By Assumption 3 we have

f( xt+1) f( xt) + f( xt), xt+1 xt + L

2 xt+1 xt 2

= f( xt) + f( xt), η vt + f( xt), xt+1 xt + η vt + L

2 xt+1 xt 2

2 f( xt) 2 + η

2 vt f( xt) 2 + η

2η xt+1 xt + η vt 2 1

2η xt+1 xt + η vt η f( xt) 2 + L

2 xt+1 xt 2

2 vt f( xt) 2 + 1

2η xt+1 xt + η vt 2 + Lωt

2 vt f( xt) 2 + 1

4η ωt + η vt f( xt) 2

2 vt 2 + pt(η2C2 v + r2) η + Lωt

2 + 2η vt 1

i=1 fi(x(i) t ) 2

n Xt Xt 2 F (19)

where the first inequality is obtained by Young s inequality and the last inequality is obtained by Cauchy-Schwartz inequality, Assumption 3 and the fact that perturbation is only drawn when y(i) t Cv and nt nodes draw perturbation in iteration t. Sum Eq. (19) and apply Lemma 1, we have

f( x T ) f(x0) 1

t=0 vt 2 + (1 + 768 log(4/δ)

t=0 Xt Xt 2 F

pt(η2C2 v + r2) η + 32 log(4/δ)βηTσ2

nb1 + (1 + 384 log(4/δ)Lη

+ 4 log(4/δ)ησ2

According to the update of gradient tracker, we have yt = vt. By Lemma 10 we have

i=1 x(i) t+1 x(i) t 2 = ωt + 1

n (Xt+1 Xt+1) (Xt Xt) 2 F (21)

i=1 y(i) t 2 = yt 2 + 1

n Yt Yt 2 F (22)

Divide the term vt 2 in Eq. (20) into three portions and we get f( x T )

t=0 ωt (1 λ)2

t=0 vt 2 + (1 + 768 log(4/δ)

t=0 Xt Xt 2 F

pt(η2C2 v + r2) η + 32 log(4/δ)βηTσ2

nb1 + (1 + 384 log(4/δ)Lη

t=0 Lωt + 4 log(4/δ)ησ2

t=0 ωt (1 λ)2

i=1 y(i) t 2 + η

t=0 Yt Yt 2 F

pt(η2C2 v + r2) η + (1 + 768 log(4/δ)

t=0 Xt Xt 2 F + 32 log(4/δ)βηTσ2

+ (1 + 384 log(4/δ)Lη

t=0 Lωt + 4 log(4/δ)ησ2

t=0 ωt (1 λ)2

i=1 x(i) t+1 x(i) t 2 η

i=1 y(i) t 2

+ (1 + 768 log(4/δ)

nb1β + (1 λ)2

128L2η2 )L2η

t=0 Xt Xt 2 F + η

t=0 Yt Yt 2 F

+ (1+ 384 log(4/δ)Lη

pt(η2C2 v + r2) η + 32 log(4/δ)βηTσ2

nb1 + 4 log(4/δ)ησ2

f(x0) 1 8Lη

t=0 Lωt (1 λ)2

i=1 x(i) t+1 x(i) t 2 η

i=1 y(i) t 2

t=0 Lωt + A2 Tβησ2

b1 + A3 ησ2

pt(η2C2 v + r2) η + A5 η

i=1 fi(x0) 2 (23)

In the second inequality we use Eq. (22). In the third inequality we use Eq. (21) and Cauchy-Schwartz inequality. In the last inequality we use Lemma 2 and the coefficients are

A1 = 1+ 384 log(4/δ)Lη

nb1β +(1+ 768 log(4/δ)

nb1β + (1 λ)2

128L2η2 )160000 log(4/δ)L3η3λ4

(1 λ)4 min{b1β, 1} + 774 log(4/δ)Lηλ2

A2 = 32 log(4/δ)

n + (1 + 768 log(4/δ)

nb1β + (1 λ)2

128L2η2 )12288 log(4/δ)L2η2λ4

(1 λ)4 + 64 log(4/δ)λ2

A3 = 4 log(4/δ)

n + (1 + 768 log(4/δ)

nb1β + (1 λ)2

128L2η2 )2000 log(4/δ)L2η2λ4

(1 λ)4 + 10 log(4/δ)λ2

A4 = 1 + (1 + 768 log(4/δ)

nb1β + (1 λ)2

128L2η2 )64λ2L2η2

(1 λ)2 + 25000 log(4/δ)L2η2λ4

A5 = (1 + 768 log(4/δ)

nb1β + (1 λ)2

128L2η2 )128λ4L2η2

(1 λ)3 + 2λ2

According to the parameter setting, we have A1 1 16Lη, A2 200 log(4/δ)

(1 λ)2 , A3 40 log(4/δ)

(1 λ)2 , A4 7

4 and A5 5 1 λ. Therefore, we have

f( x T ) f(x0) + 40 log(4/δ)ησ2

(1 λ)2βb0 + 5η (1 λ)n

i=1 fi(x0) 2

i=1 fi(x0) 2

t=0 Dt (24)

Dt = 1 16η ωt + (1 λ)2

i=1 x(i) t+1 x(i) t 2 + η

i=1 y(i) t 2 200ηϵ2σ2

(1 λ)2C2 1 7pt(η2C2 v + r2) 4η (25)

which reaches the conclusion.

C.4 Proof of Lemma 4

Proof. For convenience, the iteration that draws perturbation is considered to be included in the escaping phase. If an iteration belongs to type-A, i.e., pt 1

5, then at least n/5 worker nodes are in

the escaping phase. If an iteration belongs to type-C, we have 1

n Pn i=1 y(i) t 2 4C2 v 5 . Therefore, there are at least n

5 worker nodes satisfying y(i) t Cv, which also indicates that at least n

5 worker nodes are in the escaping phase. Then if iteration t is either type-A or type-C, there must be n/5 worker nodes in the escaping phase. We denote the set of these n/5 worker nodes as Et. Furthermore, if this iteration t is not one of the last CT iterations before termination, then there must exist n/10 worker nodes out of Et such that they have not met the condition esc(i) CT and will break the escaping phase before meeting the condition because of the termination criterion in Algorithm 1. We use Bt to denote these worker nodes.

For each i Bt, we have an interval [a(i) t , b(i) t ] such that t [a(i) t , b(i) t ] and node i enters escaping phase in iteration a(i) t and breaks escaping phase in iteration b(i) t . Besides, we also have

b(i) t a(i) t CT and x(i)

b(i) t x(i)

Then by Cauchy-Schwartz inequality we have

b(i) t x(i)

a(i) t 2 CT Xb(i) t

t=a(i) t x(i) t+1 x(i) t 2 (26)

Let at = mini{a(i) t } and bt = maxi{b(i) t }. It is easy to check that bt at 2CT . Next, we will perform the refining step. If t < t are two iterations that are either type-A or type-C and t [at, bt], then we make at = at and bt = bt. Let I = t[at, bt] for all type-A and type-C iterations t. Then I can be written as disjoint union of

I = I1 I2 Ik (27)

because if at at bt then [at, bt] and [at , bt ] can be merged into one interval. Now we can see for each iteration t that is either type-A or type-C and t is not one of the last CT iterations, we have t I. Next we will estimate the descent over I. Without loss of generality, we consider an interval Ij. Ij can be expressed by union J1 Jl where Jm = [atm, btm] for some tm, m = 1, , l. Because of the refining step, we have each tm is only included in interval Jm and the intersection of any three intervals in J1, , Jl is . According to Eq. (26) we have

t Jm x(i) t+1 x(i) t 2 C2 d 10CT (28)

since |Bt| n 10. Next, we will consider the intersection of Jm and Jm+1. Notice that when estimating Eq. (28) we only add the terms x(i) t+1 x(i) t 2 on nodes i Btm and in the intervals [a(i) tm, b(i) tm]. Therefore, for any node i / Btm Btm+1, the terms used to estimate Eq. (28) will not

be added repeatedly. If i Btm Btm+1, we have [a(i) tm, b(i) tm] and [a(i) tm+1, b(i) tm+1] are disjoint because

tm+1 [a(i) tm+1, b(i) tm+1] but tm+1 / [a(i) tm, b(i) tm] and a node cannot draw perturbation before breaking the escaping phase. Hence we can sum Eq. (28) over m and achieve

t Ij x(i) t+1 x(i) t 2 l C2 d 10CT (29)

Since the length of each Jm is not larger than 2CT , we have

t Ij x(i) t+1 x(i) t 2 |Ij|C2 d 20C2 T and 1

t I x(i) t+1 x(i) t 2 |I|C2 d 20C2 T (30)

Combining Eq. (30) and Lemma 3, we can estimate the descent over I by X

t I Dt |I| (1 λ)2C2 d 5120ηC2 T 200ηϵ2σ2

(1 λ)2C2 1 7(η2C2 v + r2) 4η

|I| (1 λ)2C2 2ηϵ2

according to the parameter setting.

C.5 Proof of Lemma 5

Proof. According to Lemma 3 and the definition of type-B iteration, we have

Dt ηC2 v 20 200ηϵ2

(1 λ)2C2 1 7r2

20η ηC2 v 40 200ηϵ2σ2

(1 λ)2C2 1 (1 λ)2C2 2ηϵ2

8000000 (32)

for all type-B iteration t where we have used the parameter setting.

C.6 Proof of Lemma 6

Proof. Suppose the conclusion is not true and we will find the conflict. Thus, we have the assumption that there are at least n

10 worker nodes satisfying di 2Cd. First, we define

w(i) t = x(i) t x(i)

t , wt = xt x t, H = 2f( xs), H(i) = 2fi( xs), H(i) t = 2Fi(x(i) s , ξ(i) t )

i=1 ( Fi(x(i) t , ξ(i) t ) Fi( xt, ξ(i) t )) ( Fi(x(i)

t , ξ(i) t ) Fi( x t, ξ(i) t ))

(1 β)( Fi(x(i) t 1, ξ(i) t ) Fi( xt 1, ξ(i) t )) ( Fi(x(i)

t 1, ξ(i) t ) Fi( x t 1, ξ(i) t ))

νt = vt f( xt) ( v t f( x t)) ζt

0 ( 2f( x t + θ( xt x t)) H)dθ

(i) t = Z 1

0 ( 2fi( x t + θ( xt x t)) H(i))dθ

Then we have

wt = wt 1 η( vt 1 v t 1)

= wt 1 η( f( xt 1) f( x t 1) + vt 1 f( xt 1) v t 1 + f( x t 1))

= wt 1 η h ( xt 1 x t 1) Z 1

0 2f( x t 1 + θ( xt 1 x t 1))dθ + νt 1 + ζt 1 i

= (I ηH)wt 1 η( t 1wt 1 + νt 1 + ζt 1) (33)

Here term ζt is yield from consensus error and does not exist in centralized algorithms. Applying recursion to Eq. (33), we can obtain

wt = (I ηH)t s 1ws+1 η

τ=s+1 (I ηH)t τ 1( τwτ + ντ + ζτ) (34)

Let qt = η Pt 1 τ=s+1(I ηH)t τ 1( τwτ + ντ + ζτ). We will prove

2(1 + ηγ)t s 1psr0 (35)

which leads to 1 2(1 + ηγ)t s 1psr0 wt 3

2(1 + ηγ)t s 1psr0 (36)

because (I ηH)t s 1ws+1 = (1+ηγ)t s 1psr0 according to the definition of ws+1. We define d = maxs t s+CT { xt xs , x t xs }. Since at least n

10 nodes satisfy di 2Cd, Cd = O(ϵ1 α) and the averaged consensus error is bounded by O(ϵ2(1+θ)), we have

di 3Cd and d 1

i=1 di 3Cd (37)

To achieve Eq. (35), it is sufficient to prove

τ=s+1 (1 + ηγ)t τ 1 τwτ + ντ + ζτ 1

2(1 + ηγ)t s 1psr0 (38)

b1 (1 + ηγ)t s 1Lpsr0

t s + 1 12ηCT (1 + ηγ)t s 1psr0 (39)

ζt 8(1 + λ2

2 L psr0 + Lη(1 + ηγ)t s 1Lpsr0 b1(t s) + 1 12ηCT (1 + ηγ)t s 1psr0 (40)

which can be derived by induction. When t = s + 1, the left side of Eq. (38) is 0 and thus the inequality is satisfied. Suppose Eq. (38) holds for t t0. When t = t0 + 1, we have

τ=s+1 (1 + ηγ)t τ 1 τwτ

τ=s+1 (1 + ηγ)t s 2psr0 5ηρCd CT (1 + ηγ)t s 2psr0

6(1 + ηγ)t s 1psr0 (41)

where we use Assumption 4 and the case of t t0 in the first two inequalities. We use the parameter setting of Cd in the last inequality. Next, we will estimate the terms related to νt. By Azuma-Hoeffding inequality we know Eq. (39) is satisfied when t = s + 1. We define

ϵt,i = ( Fi( xt+1, ξ(i) t+1) fi( xt+1)) (1 β)( Fi( xt, ξt+1) fi( xt))

ϵ t,i = ( Fi( x t+1, ξ(i) t+1) fi( x t+1)) (1 β)( Fi( x t, ξt+1) fi( x t))

Then according to the definition of νt we have

νt+1 = (1 β)νt + 1

i=1 (ϵt,i ϵ t,i) = 1

τ=s (1 β)t τ n X

i=1 (ϵτ,i ϵ τ,i) (42)

(i) t,1 = Z 1

0 ( 2Fi( x t + θ( xt x t), ξ(i) t ) H(i) t )dθ

(i) t,2 = Z 1

0 ( 2Fi( x t 1 + θ( xt 1 x t 1), ξ(i) t ) H(i) t )dθ

ˆ (i) t,1 = Z 1

0 ( 2Fi(x(i)

t + θ(x(i) t x(i)

t ), ξ(i) t ) H(i) t )dθ

ˆ (i) t,2 = Z 1

0 ( 2Fi(x(i)

t 1 + θ(x(i) t 1 x(i)

t 1), ξ(i) t ) H(i) t )dθ (43)

Then we have

= H(i) t+1wt+1 + (i) t+1,1wt+1 H(i)wt+1 (i) t+1wt+1 + (1 β)(H(i)wt + (i) t wt)

(1 β)(H(i) t+1wt + (i) t+1,2wt)

= (H(i) t+1 H)(wt+1 (1 β)wt) + ( (i) t+1,1 (i) t+1)wt+1 + (1 β)( (i) t (i) t+1,2)wt (44)

According to Assumption 3 and Assumption 4, we have

ϵt,i ϵ t,i 2L wt+1 wt + (2βL + 3ρCd) wt + 3ρCd wt+1 (45)

Applying Azuma-Hoeffding inequality to Eq. (42), with Eq. (45) we can obtain

νt 2 4 log(4/δ)

τ=s [2L wτ+1 wτ + (2βL + 3ρCd) wτ + 3ρCd wτ+1 ]2

48 log(4/δ)

τ=s+1 (L2 wτ wτ 1 2 + 5ρ2C2 d wτ 2) (46)

since β is Θ(ϵ1+θ) and Cd is Θ(ϵ1 α). According to Eq. (34), we have

= L ηH(I ηH)τ s 2ws+1 η

τ =s+1 ηH(I ηH)τ s 2( τ wτ + ντ + ζτ )

+ η( τ 1wτ 1 + ντ 1 + ζτ 1)

Lηγ(1 + ηγ)τ s 2psr0 + Lηγ

2 (1 + ηγ)τ s 2psr0 + Lη τ 1wτ 1 + ντ 1 + ζτ 1

2Lηγ(1 + ηγ)τ s 2psr0 + Lη τ 1wτ 1 + ντ 1 + ζτ 1 (47)

In the first inequality, the first term is derived by the definition of ws+1. The second term is derived by the supposition that Eq. (38) holds for t t0 and the fact that Eq. (38) implies

τ=s+1 (1 + ηγ)t τ 1 τwτ + ντ + ζτ 1

2(1 + ηγ)t s 1psr0 (48)

Combining Eq. (46) and Eq. (47), we have

νt 2 48 log(4/δ)

τ=s+1 (L2 wτ wτ 1 2 + 5ρ2C2 d wτ 2)

270 log(4/δ)ρ2C2 d nb1ηγ (1 + ηγ)2(t s 1)p2 sr2 0 + 192 log(4/δ)L2ηγ

nb1 (1 + ηγ)2(t s 1)p2 sr2 0

+ 96 log(4/δ)L2η2

τ=s+1 τwτ + ντ + ζτ 2

300 log(4/δ)ρ2C2 d nb1ηγ (1 + ηγ)2(t s 1)p2 sr2 0 + 192 log(4/δ)L2ηγ

nb1 (1 + ηγ)2(t s 1)p2 sr2 0

+ 4 log(4/δ)L2

nb1ηγC2 T (1 + ηγ)2(t s 1)p2 sr2 0 + 5000 log2(4/δ)L4η2p2 sr2 0 b2 1

(1 + ηγ)2(τ s 1)

1 288η2C2 T (1 + ηγ)2(t s 1)p2 sr2 0 + 800 log2(4/δ)L2ηγ

nb1 (1 + ηγ)2(t s 1)p2 sr2 0

+ 5000 log2(4/δ)L4η2p2 sr2 0 b2 1

(1 + ηγ)2(τ s 1)

10000 log(4/δ)L4η4

b4 1γ2 (1 + ηγ)2(t s 1)L2p2 sr2 0 (t s)2 + 1 144η2C2 T (1 + ηγ)2(t s 1)p2 sr2 0

(1 + ηγ)2(t s 1)L2p2 sr2 0 (t s)2 + 1 144η2C2 T (1 + ηγ)2(t s 1)p2 sr2 0 (49)

The exponential term in Eq. (40) can be addressed by the following strategy. When t O( 1 1 λ), the term can be dominated by other terms such as 1 ηCT . When t < O( 1 1 λ), it can be bounded by

L2η2 log(4/δ)psr2 0 n(1 λ)b1 L2η2(t s)2 log(4/δ)p2 sr2 0 (1 λ)b1(t s)2 (50)

The term t s in the numerator will be bounded by η in this case and hence it can be merged to the first term in Eq. (39). In the third inequality of Eq. (49), we split the last term into two parts: τ s > Lη b1γ and τ s Lη b1γ . Since R + t dx x2 = 1

t , we can merge the case τ s > Lη b1γ into the second term and estimate the rest one where τ s is small. According to the choice of θ, we have b1 Θ(ϵ2 θ 5α) and η2C2 d C3 T b1 O(1) and hence get the estimation in Eq. (49). We should notice that we use the relation η b1γ O(1) in our proof, which is automatically satisfied. By Eq. (49) we can reach the conclusion in Eq. (39). Furthermore, we have

τ=s+1 (1 + ηγ)t τ 1 ντ

Lη(1 + ηγ)t s 1psr0(

12(1 + ηγ)t s 1psr0

Lη log(CT )(1 + ηγ)t s 1psr0 + 1

12(1 + ηγ)t s 1psr0 1

6(1 + ηγ)t s 1psr0 (51)

The last step to prove Eq. (38) is to estimate the term corresponding to ζt, which is a new term only occurred in decentralized algorithms. Recall the definitions in Eq. (43), we have

h (H(i) t + ˆ (i) t,1)w(i) t (H(i) t + (i) t,1)wt (1 β)((H(i) t + ˆ (i) t,2)w(i) t 1 (H(i) t + (i) t,2)wt 1) i

i=1 H(i) t [(w(i) t wt) (1 β)(w(i) t 1 wt 1)] + 1

i=1 ˆ (i) t,1(w(i) t wt)

i=1 ( ˆ (i) t,1 (i) t,1)wt 1 β

i=1 ˆ (i) t,2(w(i) t 1 wt 1) 1 β

i=1 ( ˆ (i) t,2 (i) t,2)wt 1 (52)

Then by Assumption 3, Assumption 4, Eq. (37), Lemma 10 and Cauchy-Schwartz inequality, we have

n ( Xt Xt (X t X t) 2 F + Xt 1 Xt 1 (X t 1 X t 1) 2 F )

+ 144ρ2C2 d( wt 2 + wt 1 2) (53)

It is sufficient to prove

L n Xt Xt (X t X t) F 2(1 + λ2

2 L psr0 + 1 48ηCT (1 + ηγ)t s 1psr0

+ Lη(1 + ηγ)t s 1Lpsr0

4 b1(t s) (54)

because of Eq. (53) and the parameter setting. Eq. (54) can also be proven by induction. When t = s+1 the condition is satisfied. Next we will estimate Xt Xt (X t X t) 2 F . By Assumption 5 and Young s inequality we have

Xt Xt (X t X t) 2 F = (W J)[(Xt 1 Xt 1 (X t 1 X t 1)) η(Yt 1 Yt 1 (Y t 1 Y t 1))] 2 F

2 Xt 1 Xt 1 (X t 1 X t 1) 2 F + 2η2λ2

1 λ2 Yt 1 Yt 1 (Y t 1 Y t 1) 2 F

τ=s+1 (1 + λ2

2 )t τ 1 Yτ Yτ (Y τ Y τ) 2 F

2 )t s 1 Xs+1 Xs+1 (X s+1 X s+1) 2 F

τ=s+1 (1 + λ2

2 )t τ 1 Yτ Yτ (Y τ Y τ) 2 F +(1 + λ2

2 )t s 1λ2(n ns)psr2 0 (55)

where we apply recursion in the second inequality and use the definition of the decoupled sequences in the last equality. Similarly, by recursion we also have

Yt Yt (Y t Y t ) 2 F

2 Yt 1 Yt 1 (Y t 1 Y t 1) 2 F + λ2 + λ4

1 λ2 Vt Vt 1 (V t V t 1) 2 F

τ=s+1 (1 + λ2

2 )t τ Vτ Vτ 1 (V τ V τ 1) 2 F (56)

Combining above two inequalities, we achieve

Xt Xt (X t X t) 2 F

τ=s+1 (1 + λ2

2 )t τ 1(t τ) Vτ Vτ 1 (V τ V τ 1) 2 F

2 )t s 1λ2(n ns)psr2 0 (57)

According to the update rule of v(i) t we have

v(i) t v(i) t 1 (v(i)

t 1) (1 β)(v(i) t 1 v(i) t 2 (v(i)

= Fi(x(i) t , ξ(i) t ) (1 β) Fi(x(i) t 1, ξ(i) t ) Fi(x(i)

t , ξ(i) t ) + (1 β) Fi(x(i)

t 1, ξ(i) t )

[ Fi(x(i) t 1, ξ(i) t 1) (1 β) Fi(x(i) t 2, ξ(i) t 1) Fi(x(i)

t 1, ξ(i) t 1)

+ (1 β) Fi(x(i)

t 2, ξ(i) t 1)] (58)

Then mimic the estimation of νt, we can obtain

Vt Vt 1 (V t V t 1) 2 F

32 log(4/δ)

i=1 [2L w(i) τ+1 w(i) τ + (2βL + 3ρCd) w(i) τ + 3ρCd w(i) τ+1 ]2

i=1 w(i) t w(i) t 1 2 + 36ρ2C2 d

i=1 w(i) t 2 (59)

Combining above inequalities and the parameter setting of β, we can obtain

1 n Xt Xt (X t X t) 2 F (1 + λ2

2 )t s 1λ2psr2 0

(2000 log(4/δ)η2ρ2C2 d(t s)λ4

(1 λ)2b1 + 72η2ρ2C2 dλ4

τ=s+1 (1 + λ2

2 )t τ 1 (t τ)

i=1 w(i) τ 2

+ (500 log(4/δ)L2η2(t s)λ4

(1 λ)2b1 + 8L2η2λ4

τ=s+1 (1 + λ2

2 )t τ 1 (t τ)

i=1 w(i) τ w(i) τ 1 2

(1 λ)2 (32 + 2000 log(4/δ)(t s)

τ=s+1 (1 + λ2

2 )t τ 1 (t τ)

i=1 Xτ Xτ (X τ X τ) 2 F

+ (2000 log(4/δ)η2ρ2C2 d(t s)λ4

(1 λ)2b1 + 72η2ρ2C2 dλ4

τ=s+1 (1 + λ2

2 )t τ 1(t τ) wτ 2

+ (500 log(4/δ)L2η2(t s)λ4

(1 λ)2b1 + 8L2η2λ4

τ=s+1 (1 + λ2

2 )t τ 1(t τ) wτ wτ 1 2 (60)

Using Eq. (36), Eq. (47) and Eq. (54) we have

n Xt Xt (X t X t) 2 F

B1t2(1 + λ2

2 )t s 1L2psr2 0 + B2(1 + ηγ)2(t s 1)L2p2 sr2 0

τ=s+1 ((1 + λ2)(1 + ηγ)2

2 )t τ 1(t τ)

+ B3(1 + ηγ)2(t s 1)L2p2 sr2 0

τ=s+1 ((1 + λ2)(1 + ηγ)2

2 )t τ 1 t τ (τ s)2

2 )t s 1L2psr2 0 (61)

(1 λ)2 (32 + 2000 log(4/δ)(t s)

b1 ) + 384L2η2(500 log(4/δ)L2η2(t s)

(1 λ)2b1 + 8L2η2

B2 = 1 72(1 λ)2C2 T (2 + 125 log(4/δ)(t s)

b1 ) + 4500 log(4/δ)η2ρ2C2 d(t s) (1 λ)2b1 + 162η2ρ2C2 d (1 λ)2

+ (8L2η2γ2 + 54L2η2ρ2C2 d + 1 6C2 T )(500 log(4/δ)L2η2(t s)

(1 λ)2b1 + 8L2η2

B3 = 48 log(4/δ)L2η2

b1 (500 log(4/δ)L2η2(t s)

(1 λ)2b1 + 8L2η2

+ L4η4(1 + ηγ)2(t s 1)

(1 λ)2b1 (32 + 2000 log(4/δ)(t s)

If t O( 1 1 λ), t2( 1+λ2

2 )t s 1 is small and the first term of Eq. (61) can be merged to the second term. Otherwise if t < O( 1 1 λ), it can be merged to the last term according to the parameter setting

of η and b1. When ϵ is small, we have (1+λ2)(1+ηγ)2

4 . Hence the second term of Eq. (61) can be bounded by Lemma 8. The third term of Eq. (61) can be estimated by Lemma 9 (the case of t < O( 1 1 λ) can be addressed by the parameter setting of η and b1). Therefore, we can prove

n Xt Xt (X t X t) 2 F 4(1 + λ2

2 )t s 1L2psr2 0 + 1 2304ηCT (1 + ηγ)2(t s 1)p2 sr2 0

+ L2η2(1 + ηγ)2(t s 1)L2p2 sr2 0 16b1(t s)2 (63)

because of the parameter setting. We should notice that here we also use the relation η b1γ O(1), which is always satisfied according to the setting of b1. Based on Eq. (53) and Eq. (63), it is easy to check that ζt satisfies Eq. (40). Moreover, we have

τ=s+1 (1 + ηγ)t τ 1 ζτ

Lη(1 + ηγ)t s 1psr0(8

τ=s+1 (4 + λ2

12(1 + ηγ)t s 1psr0

1 λ + log(CT ))(1 + ηγ)t s 1psr0 + 1

12(1 + ηγ)t s 1psr0

6(1 + ηγ)t s 1psr0 (64)

where the first inequality is derived by

4 )2 and (3 + λ2)(1 + ηγ)

Now combining Eq. (41), Eq. (51) and Eq. (64), we can reach the conclusion in Eq. (38) and finish the proof of the induction. Recall the assumption at the beginning, we have

1 2(1 + ηγ)CT psr0 w CT 2 d 6Cd (66)

since xt x t xt xs + x t xs . Eq. (66) implies that

CT log(12Cd/(psr0))

log(1 + ηγ) < 2 log(12n Cd/r0)

which conflicts with the definition of CT . Therefore, the proof of Lemma 6 is finished.

C.7 Proof of Lemma 7

Proof. If node i enters the escaping phase in iteration s before iteration s and does not break it in iteration s+CT , then for s t s+CT , we have x(i) t x(i) s x(i) t x(i) s + x(i) s x(i) s 2Cd. Therefore, there are at least n

10 worker nodes satisfying maxs t s+CT x(i) t x(i) s 2Cd.

Suppose min eig( 2f( xs)) ϵH and e1 is the corresponding eigenvector. Let Si denote the region of the perturbation on node i that PEDESTAL will terminate in iteration s + CT , i.e., n 10 workers will not break the escaping phase. Then by Lemma 6 we can conclude that there must exist one worker node such that the projection of Si onto direction e1 is smaller than r0. Since the perturbation ξi is drawn from uniform distribution, the probability of ξi Si can be bounded by

Pr(ξi Si) r0V (Ball(d 1, r))

V (Ball(d, r)) δ (68)

where V ( ) denotes the volume and Ball(d, r) denotes the d-dimensional ball with radius r. The last inequality is achieved by the definition of r0. Therefore, we can prove that xs is a second-order stationary point with probability at least 1 δ.

D Additional Theoretical Result

In this section we will provide some additional theoretical result of our PEDESTAL algorithm. First we will demonstrate the convergence analysis of the case ϵH < ϵ, i.e., α > 0.5. Next, we will discuss the strategy of using fixed number of iterations in each descent and escaping phase, which motivates the design of PEDESTAL.

D.1 Smaller Tolerance for Second-Order Optimality

When ϵH < ϵ, the conclusions of previous Lemmas are still satisfied except Lemma 4. In this case, Cd = C2ηCT ϵµ where µ = 2α > 1. Parameter Cd should be smaller than the original setting in Lemma 4, which results in more iterations to converge. Fortunately, the analysis of Lemma 4 can be adjusted and we can achieve Theorem 2. The proof is provided as follows.

Proof. The fourth term of Dt in Lemma 3 is derived by ηβσ2

b1 and at this time we will set b1 ϵ (2µ 1 θ) so that the ϵ term is replaced by ϵµ. The last term of Dt can be written as

7pt(η2C2 v + r2) 4η = 1

7(η2C2 v + r2) 4η (69)

where P is the set of all pairs of (t, i) such that node i draws perturbation in iteration t. We can divide P into two parts. P1 contains all pairs of (t, i) such that node i breaks the escaping phase within M iterations, where M is an integer to be decided later. The rest part is denoted by P2.

For any (t, i) P1, suppose node i breaks escaping phase in iteration t + m, where m M. Then node i will never draw perturbation between iteration t and iteration t + M. Mimic the steps of Eq. (26), by Cauchy-Schwartz inequality we can obtain

τ=t x(i) τ+1 x(i) τ 2 C2 d M (70)

Let M = ϵ 2 2θ+2α. Then we have

τ=t x(i) τ+1 x(i) τ 2 7(η2C2 v + r2) 4η (71)

i=1 x(i) τ+1 x(i) τ 2 1

7(η2C2 v + r2) 4η (72)

by the parameter setting of Cv. On the other hand, if (t, i) P2, then node i will not break the escaping phase in M steps and hence the perturbation step will not execute, either. Therefore, we have estimation

7(η2C2 v + r2) 4η

7(η2C2 v + r2) 4Mη (73)

With Eq. (72), Eq. (73) and the new setting of b1, the descent in Lemma 3 can be improved to

Dt = 1 16η ωt + (1 λ)2

i=1 x(i) t+1 x(i) t 2 + η

i=1 y(i) t 2 200ηϵ2µσ2

(1 λ)2C2 1 7(η2C2 v + r2) 4Mη

When θ 3α 2, we have ϵ2

M ϵ2µ and Lemma 4 still holds but the conclusion is changed to

t I Dt |I| (1 λ)2C2 2ηϵ2µ

In this case, PEDESTAL algorithm will terminate in O(ϵ θ 2µ) iterations. In Lemma 6 and Lemma 7 we need the relations

η2C2 d C3 T b1 O(1), η b1ϵH O(1) (74)

which implies b1 O(ϵ θ α). Therefore, we set b1 = Θ(ϵ max{4α 1 θ,θ+α}) with the condition θ 3α 2. When α 1, we set θ = 3α 1

2 , which satisfies θ 3α 2 and

4α 1 θ = θ + α = 5α 1

The gradient complexity in this case is

2 ) = O(ϵ 8α+1) (76)

When α > 1, we have θ = 3α 2 and b1 = Θ(ϵ (4α 2)). The gradient complexity is

O(ϵ (7α 2) ϵ (4α 2)) = O(ϵ 11α+4) (77)

which finishes the proof of Theorem 2.

Therefore, the gradient complexity over all cases of α can by written by

O(ϵ 3 + ϵϵ 8 H + ϵ4ϵ 11 H ) (78)

D.2 Phases with Fixed Number of Iterations

If a decentralized stochastic perturbed gradient descent method adopt the strategy of fixed number of iterations in each phase, the gradient complexity in the descent phase should be at least O(ϵ 3) to ensure the first-order stationary point. But the total descent of a descent phase could be small because it is possible that it is stuck at a saddle point after only a few steps. Hence we need to consider the descent in the escaping phase. According to Lemma 3 and Lemma 4 we can see the descent of an escaping phase is O( C2 d ηCT ). As the conditions ηCd CT O(1) and CT = O( 1 ηϵH ) are required in Lemma 6, we can obtain that the total descent of an escaping phase is no larger than O(ϵ3 H). In the classic setting of ϵH = ϵ, the total descent of an escaping is upper bounded by O(ϵ1.5). Consequently, the total gradient complexity to achieve (ϵ, ϵ)-second-order stationary point is at least O(ϵ 4.5), which is worse than the result of our PEDESTAL.

E Auxiliary Lemmas

Lemma 8. Let 0 < a < 1. Then we have

τ=1 τaτ 1 = 1 at

Lemma 9. Let 0 < a < 1. When t O( 1 1 a), we have

(t + 1 τ)2 8 t2(1 a)2

Proof. When τ t

2, by Lemma 8 we have

(t + 1 τ)2 4 t2(1 a)2 (79)

(t + 1 τ)2 X

τ>t/2 τaτ 1 at/2( t 2(1 a) + 1 (1 a)2 ) (80)

Therefore, we can reach the conclusion when t O( 1 1 a).

Lemma 10. (Definition of Variance) For any random variable X, we have

E[X EX]2 = EX2 (EX)2

Lemma 11. (Lemma D.1 in (Chen et al. [2022])) Let ϵ1:k Rd be a vector-valued martingale difference sequence with respect to Fk, i.e., for each k [K], E[ϵk|Fk] = 0 and ϵk Bk, then with probability 1 δ we have

k=1 ϵk 2 4 log(4/δ)

k=1 B2 k (81)