# faster_adaptive_decentralized_learning_algorithms__fae48e5f.pdf

Faster Adaptive Decentralized Learning Algorithms

Feihu Huang 1 2 Jianyu Zhao 1

Decentralized learning recently has received increasing attention in machine learning due to its advantages in implementation simplicity and system robustness, data privacy. Meanwhile, the adaptive gradient methods show superior performances in many machine learning tasks such as training neural networks. Although some works focus on studying decentralized optimization algorithms with adaptive learning rates, these adaptive decentralized algorithms still suffer from high sample complexity. To fill these gaps, we propose a class of faster adaptive decentralized algorithms (i.e., Ada MDOS and Ada MDOF) for distributed nonconvex stochastic and finite-sum optimization, respectively. Moreover, we provide a solid convergence analysis framework for our methods. In particular, we prove that our Ada MDOS obtains a near-optimal sample complexity of O(ϵ 3) for finding an ϵ-stationary solution of nonconvex stochastic optimization. Meanwhile, our Ada MDOF obtains a near-optimal sample complexity of O( nϵ 2) for finding an ϵ-stationary solution of for nonconvex finite-sum optimization, where n denotes the sample size. To the best of our knowledge, our Ada MDOF algorithm is the first adaptive decentralized algorithm for nonconvex finite-sum optimization. Some experimental results demonstrate efficiency of our algorithms.

1. Introduction

With the rapidly increasing dataset sizes and the high dimensionality of the machine learning problems, training large-scale machine learning models has been increasingly concerned. Clearly, training large-scale models by a single

1College of Computer Science and Technology, Nanjing University of Aeronautics and Astronautics, Nanjing, China 2MIIT Key Laboratory of Pattern Analysis and Machine Intelligence, Nanjing, China. Correspondence to: Feihu Huang <huangfeihu2018@gmail.com>.

Proceedings of the 41 st International Conference on Machine Learning, Vienna, Austria. PMLR 235, 2024. Copyright 2024 by the author(s).

centralized machine has become inefficient and unscalable. Due to addressing the efficiency and scalability challenges, recently distributed machine learning optimization is widely studied. In particular, decentralized optimization (Lian et al., 2017) has received increasing attention in recent years in machine learning due to liberating the centralized agent with large communication load and privacy risk. In the paper, we study decentralized learning algorithms to solve the distributed stochastic problem over a communication network G = (V, E), defined as

min x Rd F(x) 1

i=1 f i(x), f i(x) = Eξi f i x; ξi (1)

where for any i [m], f i(x) denotes the objective function in i-th client, which is a differentiable and possibly nonconvex function. Here ξi for any i [m] is an independent random variable following an unknown distribution Di, and for any i, j [m] possibly Di = Dj. G = (V, E) is a communication network including m computing agents, where any agents i, j V can communicate only if (i, j) E. Meanwhile, we also consider decentralized learning algorithms for solving the distributed finite-sum problem over a communication network G = (V, E), defined as

min x Rd F(x) 1

i=1 f i(x), f i(x) = 1

k=1 f i k(x) (2)

where f i k(x) = f i(x; ξi k) for k = 1, 2 , n. Here {ξi k}n k=1 can be seen as n samples drawn from distribution Di for i = 1, 2, , m. In fact, Problems (1) and (2) frequently appear many machine learning applications such as training Deep Neural Networks (DNNs) (Lian et al., 2017) and reinforcement learning (Chen et al., 2022).

Many decentralized stochastic gradient-based algorithms recently have been developed to solve the above stochastic Problem (1). For example, (Lian et al., 2017) proposed an efficient decentralized stochastic gradient descent (DPSGD) algorithm, which integrates average consensus with local-SGD steps and outperforms the standard centralized SGD methods. Due to the presence of inconsistency under non-i.i.d. setting, some variants of D-PSGD (Tang et al., 2018; Xin et al., 2021b) are studied to handle the data heterogeneity issue, e.g., D2 method (Tang et al., 2018) by storing previous status, and GT-DSGD method (Xin et al., 2021a)

Faster Adaptive Decentralized Learning Algorithms

Table 1: Sample and Communication complexities comparison of the representative adaptive decentralized stochastic algorithms for finding an ϵ-stationary point of Problem (1) or (2), i.e., E F(x) ϵ or its equivalent variants. Note that the Ada RWGD (Sun et al., 2022) relies on the random walk instead of parallel framework used in other algorithms. For fair comparison, here we do not consider some specific cases such as sparse stochastic gradients.

Problem Algorithm Reference Sample Complexity Communication Complexity

DADAM (Nazari et al., 2022) O(ϵ 4) O(ϵ 4) Ada RWGD (Sun et al., 2022) O(ϵ 4) O(ϵ 4) DAMSGrad/DAda Grad (Chen et al., 2023) O(ϵ 4) O(ϵ 4) Ada MDOS Ours O(ϵ 3) O(ϵ 3) Finite-Sum Ada MDOF Ours O( nϵ 2) O(ϵ 2)

by using gradient tracking technique (Xu et al., 2015). Subsequently, (Sun et al., 2020; Pan et al., 2020) proposed some accelerated decentralized SGD algorithms (i.e., D-GET and D-SPIDER-SFO) by using variance reduced gradient estimator of SARAH/SPIDER (Nguyen et al., 2017; Fang et al., 2018), which obtain a near-optimal sample complexity of O(ϵ 3) for finding the stationary solution of stochastic optimization problems. To reduce large batch-size at each iteration, (Zhang et al., 2021; Xin et al., 2021a) proposed a class of efficient momentum-based decentralized SGD algorithms (i.e., GT-HSGD and GT-STORM) based on momentum-based variance reduced gradient estimator of Prox HSGD/STORM (Cutkosky & Orabona, 2019; Tran Dinh et al., 2022), which also obtain a near-optimal sample complexity of O(ϵ 3).

Meanwhile, some decentralized stochastic gradient-based algorithms have been developed to solve the above finite-sum Problem (2). (Sun et al., 2020; Xin et al., 2022) proposed a class of efficient decentralized algorithms for nonconvex finite-sum optimization based on variance reduced gradient estimator of SARAH (Nguyen et al., 2017). Subsequently, (Zhan et al., 2022) presented a fast decentralized algorithm for nonconvex finite-sum optimization based on variance reduced gradient estimator of Zero SARAH (Li et al., 2021) without computing multiple full gradients.

It well known that the adaptive gradient methods show superior performances n many machine learning tasks such as training DNNs. More recently, (Nazari et al., 2022; Sun et al., 2022; Chen et al., 2023) proposed some adaptive decentralized algorithms for stochastic optimization based on the existing Adam algorithm (Kingma & Ba, 2014) or its variants. However, these adaptive decentralized algorithms still suffer high sample and communication complexities in finding the stationary solution of Problem (1) (Please see Table 1). Naturally, there still exists an open question:

Could we design adaptive decentralized algorithms with lower sample and communication complexities to solve Problems (1) and (2) ?

In the paper, to fill this gap, we affirmatively answer to this question, and propose a class of faster adaptive decentralized algorithms to solve Problems (1) and (2), respectively, based on the momentum-based variance-reduced and gradient tracking techniques. In particular, our methods use a unified adaptive matrix to flexibly incorporate various adaptive learning rates. Our main contributions are several folds:

(1) We propose a class of efficient adaptive decentralized optimization algorithms (i.e., Ada MDOS and Ada MDOF) to solve Problems (1) and (2), respectively, based on the momentum-based variance-reduced and gradient tracking techniques simultaneously. Moreover, we provide a convergence analysis framework for our methods.

(2) We prove that our Ada MDOS algorithm reaches the near optimal sample complexity of O(ϵ 3) for finding an ϵ-stationary solution of Problem (1), which matches the lower bound of smooth nonconvex stochastic optimization (Arjevani et al., 2023).

(3) We prove that our Ada MDOF algorithm reaches the near optimal sample complexity of O( nϵ 2) for finding an ϵ-stationary solution of Problem (2), which matches the lower bound of smooth nonconvex finitesum optimization (Fang et al., 2018).

(4) We conduct some numerical experiments on training nonconvex machine learning tasks to verify the efficiency of our proposed algorithms.

Since our algorithms use a unified adaptive matrix including various adaptive learning rates, our convergence analysis does not consider some specific cases such as sparse stochastic gradients. Despite this, our adaptive algorithms still obtain lower sample and communication complexities compared to the existing adaptive decentralized algorithms.

2. Related Works

In this section, we overview some representative decentralized optimization algorithms and adaptive gradient algorithms, respectively.

Faster Adaptive Decentralized Learning Algorithms

2.1. Decentralized Optimization Decentralized optimization is an efficient framework to collaboratively solve distributed problems by multiple worker nodes, where a worker node only needs to communicate with its neighbors at each iteration. The traditional decentralized optimization methods include some popular algorithms such as Alternating Direction Method of Multipliers (ADMM) (Boyd et al., 2011), Dual Averaging (Duchi et al., 2011b). Subsequently, some efficient decentralized optimization algorithms have been developed, e.g., Extra (Shi et al., 2015), Next (Di Lorenzo & Scutari, 2016), Prox PDA (Hong et al., 2017). Meanwhile, (Lian et al., 2017) proposed an efficient decentralized stochastic gradient descent algorithm (i.e., D-PSGD), which shows that the decentralized SGD can outperform the parameter server-based SGD algorithms relying on high communication cost. From this, decentralized algorithms have begun to shine in machine learning such as training DNNs. Subsequently, (Tang et al., 2018) proposed an accelerated D-PSGD (i.e., D2) by using previous status. Meanwhile, (Xin et al., 2021a) further proposed an efficient decentralized SGD algorithm (i.e., GT-DSGD) by using gradient tracking technique (Xu et al., 2015). By using the variance-reduced techniques, some other accelerated decentralized SGD algorithms (Sun et al., 2020; Pan et al., 2020; Cutkosky & Orabona, 2019; Tran-Dinh et al., 2022) have been proposed, include DSPIDER-SFO (Pan et al., 2020) and GT-HSGD (Xin et al., 2021a). Meanwhile, (Nazari et al., 2022) studied the decentralized version of AMSGrad (Reddi et al., 2019) for online optimization. Moreover, (Sun et al., 2022; Chen et al., 2023) developed adaptive decentralized algorithms for stochastic optimization by using local adaptive learning rates.

2.2. Adaptive Gradient Methods Adaptive gradient methods (Duchi et al., 2011a; Kingma & Ba, 2014; Loshchilov & Hutter, 2017) recently have been successfully applied in machine learning tasks such as training DNNs. Adam (Kingma & Ba, 2014) is one of popular adaptive gradient methods by using a coordinate-wise adaptive learning rate and momentum technique to accelerate algorithm, which is the default optimization tool for training attention models (Zhang et al., 2020). Subsequently, some variants of Adam (Reddi et al., 2019; Chen et al., 2018; Guo et al., 2021) have been presented to obtain a convergence guarantee under the nonconvex setting. Due to using the coordinate-wise type of adaptive learning rates, Adam frequently shows a bad generalization performance in training DNNs. To improve the generalization performances, recently some adaptive gradient methods such as Adam W (Loshchilov & Hutter, 2017), Ada Grad (Li & Orabona, 2019) and Ada Belief (Zhuang et al., 2020) have been proposed. More recently, some accelerated adaptive gradient methods (Cutkosky & Orabona, 2019; Huang et al., 2021; Levy et al., 2021; Kavis et al., 2022) have been pro-

posed based on the variance-reduced techniques.

3. Preliminaries

3.1. Notations [m] denotes the set {1, 2, , m}. denotes the ℓ2 norm for vectors and spectral norm for matrices. x, y denotes the inner product of two vectors x and y. For vectors x and y, xr (r > 0) denotes the element-wise power operation, x/y denotes the element-wise division and max(x, y) denotes the element-wise maximum. Id denotes a d-dimensional identity matrix. at = O(bt) denotes that at cbt for some constant c > 0. The notation O( ) hides logarithmic terms. ones(d, 1) denotes an all-one d-dimensional vector.

3.2. Assumptions In this subsection, we give some mild assumptions on the Problems (1) and (2).

Assumption 3.1. (Smoothness) For any i [m], each component loss function f i(x; ξi) is L-smooth, such that for all x1, x2 Rd

f i(x1; ξi) f i(x2; ξi) L x1 x2 . (3)

Clearly, based on Assumptions 3.1, we have

F(x1) F(x2)

Eξi[ f i(x1; ξi)] Eξi[ f i(x2; ξi)]

i=1 Eξi f i(x1; ξi) f i(x2; ξi) L x1 x2 ,

i.e., the global function F(x) is L-smooth as well.

Assumption 3.2. (Sampling Oracle) Stochastic function f i(x; ξi) has an unbiased stochastic gradient with bounded variance for any i [m], i.e.,

E[ f i(x; ξi)] = f i(x), E f i(x; ξi) f i(x) 2 σ2.

Assumption 3.3. (Lower Bounded) The objective function F(x) is lower bounded, i.e., F = infx Rd F(x).

Assumption 3.4. (Network Protocol) The graph G = (V, E) is connected and undirected, which can be represented by a mixing matrix W Rm m: 1) Wi,j > 0 if Wi,j E and Wi,j = 0 otherwise; 2) W is doubly stochastic such that W = W T , Pm i=1 Wi,j = 1 and Pm j=1 Wi,j = 1; 3) the eigenvalues of W satisfy λm λ2 < λ1 = 1 and ν = max(|λ2|, |λm|) < 1.

Assumption 3.5. In our algorithms, the local adaptive matrices Ai t ρId 0 for all i [m], t 1 for updating the variables x, where ρ > 0 is an appropriate positive number.

Assumptions 3.1 and 3.2 are commonly used in stochastic smooth nonconvex optimization (Sun et al., 2020; Pan

Faster Adaptive Decentralized Learning Algorithms

Algorithm 1 Adaptive Momentum-Based Decentralized Optimization (Ada MDOS) Algorithm for Stochastic Optimization

1: Input: T > 0, tuning parameters {γ, ηt, βt}, initial inputs xi 1 Rd for all i [m]; 2: initialize: Set xi 0 = xi 0 for i [m], and draw one sample ξi 0 and then compute ui 0 = f i(xi 0; ξi 0) and wi 0 = P

j Ni Wi,juj 0 for all i [m]. 3: for t = 0 to T 1 do 4: for i = 1, , m (in parallel) do 5: Generate the adaptive matrix Ai t Rd d; One example of Ai t by using update rule (ai 0 = 0, 0 < ϱ < 1, ρ > 0.) Compute ai t = ϱai t 1 + (1 ϱ)( f i(xi t; ξi t))2, Ai t = diag( p

ai t + ρId); 6: xi t+1 = P j Ni Wi,jxj t γ(Ai t) 1wi t;

7: xi t+1 = xi t + ηt( xi t+1 xi t); 8: Randomly draw a sample ξi t+1 Di; 9: ui t+1 = f i(xi t+1; ξi t+1) + (1 βt+1)(ui t f i(xi t; ξi t+1));

10: wi t+1 = P

j Ni Wi,j wj t + uj t+1 uj t ; 11: end for 12: end for 13: Output: Chosen uniformly random from {xi t 1}m i=1.

et al., 2020; Cutkosky & Orabona, 2019; Tran-Dinh et al., 2022). Assumption 3.3 ensures the feasibility of Problems (1) and (2). Assumption 3.4 shows the protocol properties of network G = (V, E), which is very common in the decentralized distributed optimization (Lian et al., 2017; Xin et al., 2021a). Assumption 3.5 imposes that each local adaptive matrix is positive definite, which is commonly used in many adaptive gradient methods for non-distributed optimization (Huang et al., 2021; Yun et al., 2021).

4. Adaptive Momentum-Based Decentralized Algorithms

In this section, we propose a class of efficient adaptive momentum-based decentralized algorithms to solve Problems (1) and (2), respectively, which build on the momentum-based and gradient tracking techniques.

4.1. Ada MDOS Algorithm for Stochastic Optimization

In this subsection, we propose a faster adaptive momentumbased decentralized (Ada MDOS) algorithm for the stochastic Problem (1) over a network, which builds on the variancereduced momentum technique of STORM (Cutkosky & Orabona, 2019; Tran-Dinh et al., 2022) and gradient tracking technique (Xu et al., 2015). In particular, our Ada MDOS algorithm also uses the momentum iteration and uni-

Algorithm 2 Adaptive Momentum-Based Decentralized Optimization (Ada MDOF) Algorithm for Finite-Sum Optimization

1: Input: T > 0, tuning parameters {γ, ηt, βt}, initial inputs xi 1 Rd for all i [m]; 2: initialize: Set xi 0 = xi 1 = xi 1, zi 1,0 = zi 2,0 = = zi n,0 = 0 and ui 0 = wi 0 = 0 for any i [m]. 3: for t = 1 to T do 4: for i = 1, , m (in parallel) do 5: Randomly draw a minibatch samples Ii t with |Ii t| = b; 6: ui t = 1

b P k Ii t f i k(xi t) f i k(xi t 1) + (1

βt)ui t 1 + βt 1 b P

k Ii t f i k(xi t 1) zi k,t 1 +

1 n Pn j=1 zi j,t 1 ;

7: wi t = P

j Ni Wi,j wj t 1 + uj t uj t 1 ;

8: Generate the adaptive matrix Ai t Rd d; One example of Ai t by using update rule (ai 0 = 0, 0 < ϱ < 1, ρ > 0.) Compute ai t = ϱai t 1 + (1 ϱ) 1

k Ii t f i k(xi t) 2, Ai t = diag( p

ai t + ρId);

9: xi t+1 = P

j Ni Wi,jxj t γ(Ai t) 1wi t;

10: xi t+1 = xi t + ηt( xi t+1 xi t); 11: zi k,t = f i k(xt) for k Ii t and zi k,t = zi k,t 1 for k / Ii t. 12: end for 13: end for 14: Output: Chosen uniformly random from {xi t 1}m i=1.

fied adaptive learning rate to update variable. Algorithm 1 provides the algorithmic framework of our Ada MDOS algorithm.

At the line 5 of Algorithm 1, we generate an adaptive matrix based on the historical stochastic gradients f i(xi l; ξi l)

1 l t. And we give an Adam-like adaptive learning rate, defined as

ai t = ϱai t 1 + (1 ϱ)( f i(xi t; ξi t))2

l=1 (1 ϱ)ϱt l( f i(xi l; ξi l))2,

Ai t = diag( q

ai t + ρId), (4)

clearly, we have Ai t ρId, which satisfies Assumption 3.5. Besides one example (4), we can also generate many adaptive matrices satisfying the above Assumption 3.5. e.g., the Barzilai-Borwein-like adaptive matrix, defined as

ai t = | xi t xi t 1, f i(xi t; ξi t) f i(xi t 1; ξi t) | xi t xi t 1 2 + ρ,

Ai t = ai t Id ρId. (5)

Faster Adaptive Decentralized Learning Algorithms

At the line 6 of Algorithm 1, each client updates the local variable xi based on adaptive matrix Ai t and momentumbased gradient estimator wi t:

j Ni Wi,jxj t γ(Ai t) 1wi t, (6)

where the constant γ > 0. Here Ni = {j V | (i, j) E, j = i} denotes the neighborhood of the i-th client. Here each client communicates with its neighbors to update the variable x. Then we further use the momentum iteration to update the variable x at the line 7 of Algorithm 1:

xi t+1 = xi t + ηt( xi t+1 xi t), (7)

where ηt (0, 1).

At lines 8-9 of Algorithm 1, each client uses the variancereduced momentum-based technique (Tran-Dinh et al., 2022; Cutkosky & Orabona, 2019) to update the stochastic gradients by using local data ξi t+1: for i [m]

ui t+1 = f i(xi t+1; ξi t+1)

+ (1 βt+1)(ui t f i(xi t; ξi t+1)), (8)

where βt+1 (0, 1). At the line 10 of our Algorithm 1, then each client communicates with its neighbors to compute gradient estimators wi t+1, defined as

j Ni Wi,j wj t + uj t+1 uj t , (9)

which uses gradient tracking technique (Xu et al., 2015; Di Lorenzo & Scutari, 2016) to reduce the consensus error. Thus, the local stochastic gradient estimator wi t+1 can track the directions of global gradients.

4.2. Ada MDOF Algorithm for Finite-Sum Optimization

In this subsection, we propose a faster adaptive momentumbased decentralized (Ada MDOF) algorithm for distributed finite-sum problem (2) over a network, which builds on the variance-reduced momentum technique of Zero SARAH (Li et al., 2021) and gradient tracking technique. Algorithm 2 provides the algorithmic framework of our Ada MDOF algorithm.

Algorithm 2 is fundamentally similar to Algorithm 1, differing primarily in its application of the variance-reduced momentum technique from Zero SARAH (Li et al., 2021) to update the stochastic gradients by using local data: for i [m]

f i k(xi t) f i k(xi t 1) + (1 βt)ui t 1

f i k(xi t 1) zi k,t 1 + 1

j=1 zi j,t 1 ,

where βt+1 (0, 1). Here the Zero SARAH technique can be seen as the combination of SARAH (Nguyen et al., 2017) and SAGA (Defazio et al., 2014) techniques.

5. Convergence Analysis

In this section, under some mild assumptions, we provide the convergence properties of our Ada MDOS and Ada MDOF algorithms for Problems (1) and (2), respectively. All related proofs are provided in the following Appendix. For notational simplicity, let xt = 1 m Pm i=1 xi t for all t 1.

5.1. Convergence Properties of our Ada MDOS Algorithm

For Ada MDOS algorithm, we define a useful Lyapunov function, for any t 1

Ωt = E h F( xt) + (λt 1 9γη

2ρ ) ut 1 f(xt 1) 2

i=1 ui t 1 f i(xi t 1) 2

+ (θt 1 19γηL2

i=1 xi t 1 xt 1 2

i=1 wi t 1 wt 1 2

i=1 gi t 1 2i ,

where gi t = (Ai t) 1wi t, f(xt) = 1

m Pm i=1 f i(xi t), χt

2ρ , θt 29γηL2

4ρ and ηt = η for all t 0.

Theorem 5.1. Suppose the sequences {xi t}m i=1 T t=1 be generated from Algorithm 1. Under the above Assumptions 3.1-3.5, and let ηt = η, 0 < βt 1 for all t 0, γ min ρ(1 ν2)

48θt , 3ρ(1 ν2)θt

3(1+ν2) Ht ,

γ(3+ν2) Ht with Ht = 9 2βt +

8ν2 (1 ν)2 for all t 1, we have

t=1 E F( xt) (10)

i=1 E[ F(xi t) + L xt xi t ]

t=1 Htβ2 t v u u t 1

i=1 E Ai t 2,

where G = F ( x1) F

ργη + 4ν2 ρ2(1 ν) + 9 2ρ2β0 9 2ρ2 σ2.

Faster Adaptive Decentralized Learning Algorithms

Remark 5.2. Based on Assumption 3.1, if using Barzilai-Borwein-like adaptive matrix (5), we have q

1 T PT t=1 1 m Pm i=1 E Ai t 2 L + ρ.

Let βt = 1 T 2/3 for all t 1, we have

Ht = 9 2βt + 8ν2

(1 ν)2 = O(T 2/3), (11)

and then we can obtain q

1 T PT t=1 Htβ2 t = O( 1 T 1/3 ). We

set θt = θ 29γηL2

6ρ for all t 1. Meanwhile, we can set ρ = O(1), γ = O(1) and η = O( 1 T 1/3 ). Then we obtain G = O(T 1/3) and

t=1 E F( xt)

O( 1 T 1/3 + σ T 1/3 )

i=1 E Ai t 2. (12)

1 T PT t=1 1 m Pm i=1 E Ai t 2 = O(1), let

t=1 E F( xt) O( 1 T 1/3 + σ T 1/3 ) ϵ, (13)

we have T = O(ϵ 3). Since our Ada MDOS algorithm only use one sample at each iteration, it obtains a near-optimal sample complexity of 1 T = O(ϵ 3) for finding an ϵstationary point of Problem (1), which matches the lower bound of smooth nonconvex stochastic optimization (Arjevani et al., 2023).

Assumption 5.3. (Lipschitz Continuity) For any i [m], each component loss function f i(x; ξi) is M-Lipschitz continuity, such that for all x Rd

f i(x; ξi) M. (14)

Assumption 5.3 is commonly used in the adaptive gradient algorithms (Reddi et al., 2019; Chen et al., 2018; Guo et al., 2021; Sun et al., 2022; Chen et al., 2023).

Remark 5.4. Based on Assumption 5.3, if using Adamlike adaptive matrix given in Algorithm 1, we have q

1 T PT t=1 1 m Pm i=1 E Ai t 2 M + ρ.

Based on the above Theorem 5.1, our Ada MDOS algorithm still obtains a near-optimal sample (gradient) complexity of O(ϵ 3) for finding an ϵ-stationary point of Problem (1).

5.2. Convergence Properties of our Ada MDOF Algorithm

For Ada MDOF algorithm, we first define a useful Lyapunov function for any t 1

Φt = E h F( xt) + (λt 1 9γη

2ρ ) ut 1 f(xt 1) 2

i=1 ui t 1 f i(xi t 1) 2+ ργη

i=1 gi t 1 2

k=1 f i k(xi t 1) zi k,t 1 2

+ (θt 1 19γηL2

i=1 xi t 1 xt 1 2

+ (ϑt 1 3γη

i=1 wi t 1 wt 1 2i ,

where αt 0, χt 0, λt 9γη

2ρ , θt 29γηL2

4ρ and ηt = η for all t 0.

Theorem 5.5. Suppose the sequences {xi t}m i=1 T t=1 be generated from Algorithm 2. Under the above Assumptions 3.1-3.5, and let ηt = η, 0 < βt 1 for all t 0, γ min ρ(1 ν2)

48θt , 3ρ(1 ν2)θt

6(1+ν2) Ht ,

γ(3+ν2) Ht with Ht = 9 bβt +

6ν2 b(1 ν)2 + 4n2β2 t b3 9

βt + 9ν2 (1 ν)2 + 3ν2 (1 ν)2 for all t 0, we have

t=1 E F( xt) (15)

i=1 E[ F(xi t) + L xt xi t ]

i=1 E Ai t 2,

where G = F ( x1) F

ρ2 + 18β2 0ν2

ρ2(1 ν)2 + 3ν2 ρ2(1 ν)2 +

9 2ρ2β0 9 2ρ2 1

m Pm i=1 1 n Pn k=1 f i k(xi 0) 2 is independent on T, b and n.

Remark 5.6. Based on Assumption 3.1, if using Barzilai-Borwein-like adaptive matrix (5), we have q

1 T PT t=1 1 m Pm i=1 E Ai t 2 L + ρ. Based on Assump-

tion 5.3, we have q

1 T PT t=1 1 m Pm i=1 E Ai t 2 M + ρ.

Let θt = θ 29γηL2

6ρ , b = n and βt = b

n for all t 1,

Faster Adaptive Decentralized Learning Algorithms

Ht = 9 bβt + 6ν2

b(1 ν)2 + 4n2β2 t b3 9

(1 ν)2 + 3ν2

(1 ν)2 . (16)

Then we have 1 Ht 1 r

6(1 + ν2) Ht ,

γ(3 + ν2) Ht

6(1 + ν2) ,

Thus we can let η =

45 + 45ν2 (1 ν)2 and

γ = min ρ(1 ν2)

48θ , 3ρ(1 ν2)θ

58L2 . Note that we set βt = b n for all t 1, while we can set β0 (0, 1), which is independent on T, b and n. Let ρ = O(1), η = O(1) and γ = O(1), we have G = F ( x1) F

ρ2 + 18β2 0ν2

ρ2(1 ν)2 + 3ν2 ρ2(1 ν)2 + 9 2ρ2β0

m Pm i=1 1 n Pn k=1 f i k(xi 0) 2 = O(1) is indepen-

dent on T, b and n. Since q

1 T PT t=1 1 m Pm i=1 E Ai t 2 = O(1), set

t=1 E F( xt) O( 1

we have T = O(ϵ 2). Since our Ada MDOF algorithm requires b samples, we can obtain a near-optimal sample complexity of T b = O( nϵ 2), which matches the lower bound of smooth nonconvex finite-sum optimization (Fang et al., 2018).

Figure 1: Stationary gap vs epoch at w8a dataset under the ring network (Left) and the 3-regular network (Right).

6. Numerical Experiments

In this section, we apply some numerical experiments to demonstrate efficiency of our Ada MDOS and Ada MDOF

Figure 2: Stationary gap vs epoch at covertype dataset under the ring network (Left) and the 3-regular network (Right).

algorithms. Since the Ada RWGD (Sun et al., 2022) relies on the random walk instead of parallel framework used in other algorithms, our adaptive decentralized algorithms only compare to the existing adaptive decentralized methods (i.e, DADAM (Nazari et al., 2022), DAMSGrad (Chen et al., 2023), DAda Grad (Chen et al., 2023)) given in Tabel 1. In decentralized algorithms, we consider two classical undirected networks that connect all clients, i.e., the ring and 3-regular expander networks (Hoory et al., 2006), described in the following Appendix B.

6.1. Training Logistic Model

In this subsection, we consider learning a non-convex logistic model (Allen-Zhu & Hazan, 2016) for binary classification over a decentralized network of m nodes with n data samples at each node:

min x Rd 1 m

k=1 f i(x; ξi k) + λ x 2 , (17)

where λ > 0 and f i(x; ξi k) = 1 1+exp(li k ai k,x ) is a noncon-

vex sigmoid loss function. Here ξi k = (ai k, li k) denotes the k-th sample at i-th node, where ai k Rd denotes the features and li k { 1, 1} is a label. In the experiment, we set the regularization parameter λ = 10 5, and use the same initial solution x0 = xi 0 = 0.01 ones(d, 1) for all i [m] for all algorithms. We use public w8a and covertype datasets1. The w8a dataset includes 60,000 training examples, where we partitioned into 5 clients each containing 12,000 training examples. The covertype dataset includes 100,000 training examples, where we partitioned into 5 clients each containing 20,000 training examples.

In the experiment, we characterize performance of the algorithms in comparison in terms of the decrease of stationary gap versus epochs, where the stationary gap is defined as F( xt) + 1

m Pm i=1 xt xi t , where xi t is the estimate of the stationary solution at the i-th node and xt = 1

m Pm i=1 xi t, and each epoch represents n component gradient computa-

1available at https://www.openml.org/

Faster Adaptive Decentralized Learning Algorithms

Figure 3: Training CNN on MNIST dataset: training loss vs epoch (Left), training accuracy (%) vs epoch (Middle), and test accuracy (%) vs epoch (Right) under the ring network.

Figure 4: Training CNN on MNIST dataset: training loss vs epoch (Left), training accuracy (%) vs epoch (Middle), and test accuracy (%) vs epoch (Right) under the 3-regular network.

tions at each node. In the experiment, for fair comparison, we use the batch size b = 10 in all algorithms, and set β1 = β2 = 0.9 in the DADAM (Nazari et al., 2022) and DAMSGrad (Chen et al., 2023), and set β1 = 0.9 in the DAda Grad (Chen et al., 2023), and set ϱ = βt = ηt = 0.9 for all t 1 in our algorithms.

Figures 1 and 2 show that our Ada MDOS method outperforms all comparisons, while due to requiring large batchsize, our Ada MDOF is comparable with the existing adaptive methods.

6.2. Training Convolutional Neural Network

In this subsection, we consider training a Convolutional Neural Network (CNN) for MNIST classification over a decentralized network. Here we use the same CNN architecture as in (Mc Mahan et al., 2017). This CNN includes two 5 5 convolution layers (the first with 32 channels, the second with 64, each followed with 2 2 max pooling), a fully connected layer with 512 units and Re Lu activation, and a final softmax output layer (1,663,370 total parameters). The MNIST dataset (Le Cun et al., 2010) consists of 10 classes of 28 28 grayscale images, which includes 60,000 training examples and 10,000 testing examples, which we

partitioned into 5 clients each containing 12000 training and 2000 testing examples.

In the experiment, we characterize performance of the algorithms in comparison in terms of the decrease of training loss versus epochs, where the loss denotes the objective function value in training CNN. Meanwhile, we also use the training accuracy and test accuracy, where the accuracy denotes the classification accuracy. For fair comparison, we use the batch size b = 10 in all algorithms, and set β1 = β2 = 0.9 in the DADAM (Nazari et al., 2022) and DAMSGrad (Chen et al., 2023), and set β1 = 0.9 in the DAda Grad (Chen et al., 2023), and set ϱ = βt = ηt = 0.9 for all t 1 in our algorithms. Figures 3 and 4 also show that our Ada MDOS method outperforms all comparisons, while due to requiring large batch size, our Ada MDOF is comparable with the existing adaptive methods.

6.3. Training Residual Network

In this subsection, we consider training a residual neural network for Tiny-Image Net classification over a decentralized network. Here we use the Res Net-18 as in (He et al., 2016), which includes a 3 3 convolution layer followed with batch-norm and Re LU activation, eight residual blocks

Faster Adaptive Decentralized Learning Algorithms

(start with 64 channels, channel number doubled at third, fifth and seventh block, end with 512 channels), a 4 4 max pooling, a fully connected layer with 512 units and Re LU activation, and a final softmax output layer. Each residual block contains a shortcut and two 3 3 convolution layers, the first followed with batch-norm and Re LU activation, the second followed with batch-norm.

The Tiny-Image Net dataset (Le & Yang, 2015) consists of 200 classes of 64 x 64 RGB images, which includes 100,000 training examples and 10,000 testing examples, respectively. Here we partitioned into 5 clients, where each client contains 20,000 training and 2000 testing examples, respectively.

For fair comparison, we use the batch size b = 10 in all algorithms, and set β1 = β2 = 0.9 in the DADAM and DAMSGrad, and set β1 = 0.9 in the DAda Grad, and set ϱ = βt = ηt = 0.9 for all t 1 in our algorithms. In this experiment, we add two basic non-adaptive decentralized algorithms, i.e., D-PSGD (Lian et al., 2017) and D2 (Tang et al., 2018), as the comparisons. From Figure 5, we find that although DADAM and D2 methods outperform our Ada MDOS method at the beginning of the iteration, while our Ada MDOS then outperforms all comparisons.

Figure 5: Training Res Net-18 on Tiny-Image Net dataset: test accuracy (%) vs epoch under the ring network (Left) and the 3-regular network (Right).

7. Conclusion

In the paper, we studied the distributed nonconvex stochastic and finite-sum optimization problems over a network. Moreover, we proposed a faster adaptive momentum-based decentralized optimization algorithm (i.e., Ada MDOS) to solve the stochastic problems, which reaches a near-optimal sample complexity of O(ϵ 3) for nonconvex stochastic optimization. Meanwhile, we proposed a faster adaptive momentum-based decentralized optimization algorithm (i.e., Ada MDOF) to solve the finite-sum problems, which obtains a near-optimal sample complexity of O( nϵ 2) for nonconvex finite-sum optimization. In particular, our methods use a unified adaptive matrix including various types of adaptive learning rate.

Acknowledgements

We thank the anonymous reviewers for their helpful comments. This paper was partially supported by NSFC under Grant No. 62376125. It was also partially supported by the Fundamental Research Funds for the Central Universities NO.NJ2023032.

Impact Statement

This paper presents work whose goal is to advance the field of Machine Learning. There are many potential societal consequences of our work, none which we feel must be specifically highlighted here.

Allen-Zhu, Z. and Hazan, E. Variance reduction for faster non-convex optimization. In International conference on machine learning, pp. 699 707. PMLR, 2016.

Arjevani, Y., Carmon, Y., Duchi, J. C., Foster, D. J., Srebro, N., and Woodworth, B. Lower bounds for non-convex stochastic optimization. Mathematical Programming, 199 (1-2):165 214, 2023.

Boyd, S., Parikh, N., Chu, E., Peleato, B., Eckstein, J., et al. Distributed optimization and statistical learning via the alternating direction method of multipliers. Foundations and Trends in Machine learning, 3(1):1 122, 2011.

Chen, X., Liu, S., Sun, R., and Hong, M. On the convergence of a class of adam-type algorithms for non-convex optimization. ar Xiv preprint ar Xiv:1808.02941, 2018.

Chen, X., Karimi, B., Zhao, W., and Li, P. On the convergence of decentralized adaptive gradient methods. In Asian Conference on Machine Learning, pp. 217 232. PMLR, 2023.

Chen, Z., Zhou, Y., Chen, R.-R., and Zou, S. Sample and communication-efficient decentralized actor-critic algorithms with finite-time analysis. In International Conference on Machine Learning, pp. 3794 3834. PMLR, 2022.

Cutkosky, A. and Orabona, F. Momentum-based variance reduction in non-convex sgd. Advances in neural information processing systems, 32, 2019.

Defazio, A., Bach, F., and Lacoste-Julien, S. Saga: A fast incremental gradient method with support for nonstrongly convex composite objectives. Advances in neural information processing systems, 27, 2014.

Di Lorenzo, P. and Scutari, G. Next: In-network nonconvex optimization. IEEE Transactions on Signal and Information Processing over Networks, 2(2):120 136, 2016.

Faster Adaptive Decentralized Learning Algorithms

Duchi, J., Hazan, E., and Singer, Y. Adaptive subgradient methods for online learning and stochastic optimization. Journal of machine learning research, 12(7), 2011a.

Duchi, J. C., Agarwal, A., and Wainwright, M. J. Dual averaging for distributed optimization: Convergence analysis and network scaling. IEEE Transactions on Automatic control, 57(3):592 606, 2011b.

Fang, C., Li, C. J., Lin, Z., and Zhang, T. Spider: Nearoptimal non-convex optimization via stochastic pathintegrated differential estimator. In Advances in Neural Information Processing Systems, pp. 689 699, 2018.

Guo, Z., Xu, Y., Yin, W., Jin, R., and Yang, T. A novel convergence analysis for algorithms of the adam family. ar Xiv preprint ar Xiv:2112.03459, 2021.

He, K., Zhang, X., Ren, S., and Sun, J. Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 770 778, 2016.

Hong, M., Hajinezhad, D., and Zhao, M.-M. Prox-pda: The proximal primal-dual algorithm for fast distributed nonconvex optimization and learning over networks. In International Conference on Machine Learning, pp. 1529 1538. PMLR, 2017.

Hoory, S., Linial, N., and Wigderson, A. Expander graphs and their applications. Bulletin of the American Mathematical Society, 43(4):439 561, 2006.

Huang, F., Li, J., and Huang, H. Super-adam: faster and universal framework of adaptive gradients. Advances in Neural Information Processing Systems, 34:9074 9085, 2021.

Kavis, A., Skoulakis, S., Antonakopoulos, K., Dadi, L. T., and Cevher, V. Adaptive stochastic variance reduction for non-convex finite-sum minimization. Advances in Neural Information Processing Systems, 35:23524 23538, 2022.

Kingma, D. P. and Ba, J. Adam: A method for stochastic optimization. ar Xiv preprint ar Xiv:1412.6980, 2014.

Le, Y. and Yang, X. S. Tiny imagenet visual recognition challenge. 2015.

Le Cun, Y., Cortes, C., and Burges, C. Mnist handwritten digit database. ATT Labs [Online]. Available: http://yann.lecun.com/exdb/mnist, 2, 2010.

Levy, K., Kavis, A., and Cevher, V. Storm+: Fully adaptive sgd with recursive momentum for nonconvex optimization. Advances in Neural Information Processing Systems, 34:20571 20582, 2021.

Li, X. and Orabona, F. On the convergence of stochastic gradient descent with adaptive stepsizes. In The 22nd International Conference on Artificial Intelligence and Statistics, pp. 983 992. PMLR, 2019.

Li, Z., Hanzely, S., and Richt arik, P. Zerosarah: Efficient nonconvex finite-sum optimization with zero full gradient computation. ar Xiv preprint ar Xiv:2103.01447, 2021.

Lian, X., Zhang, C., Zhang, H., Hsieh, C.-J., Zhang, W., and Liu, J. Can decentralized algorithms outperform centralized algorithms? a case study for decentralized parallel stochastic gradient descent. Advances in Neural Information Processing Systems, 30, 2017.

Loshchilov, I. and Hutter, F. Decoupled weight decay regularization. ar Xiv preprint ar Xiv:1711.05101, 2017.

Mc Mahan, B., Moore, E., Ramage, D., Hampson, S., and y Arcas, B. A. Communication-efficient learning of deep networks from decentralized data. In Artificial intelligence and statistics, pp. 1273 1282. PMLR, 2017.

Nazari, P., Tarzanagh, D. A., and Michailidis, G. Dadam: A consensus-based distributed adaptive gradient method for online optimization. IEEE Transactions on Signal Processing, 70:6065 6079, 2022.

Nguyen, L. M., Liu, J., Scheinberg, K., and Tak aˇc, M. Sarah: A novel method for machine learning problems using stochastic recursive gradient. In International conference on machine learning, pp. 2613 2621. PMLR, 2017.

Pan, T., Liu, J., and Wang, J. D-spider-sfo: A decentralized optimization algorithm with faster convergence rate for nonconvex problems. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 34, pp. 1619 1626, 2020.

Reddi, S. J., Kale, S., and Kumar, S. On the convergence of adam and beyond. ar Xiv preprint ar Xiv:1904.09237, 2019.

Shi, W., Ling, Q., Wu, G., and Yin, W. Extra: An exact firstorder algorithm for decentralized consensus optimization. SIAM Journal on Optimization, 25(2):944 966, 2015.

Sun, H., Lu, S., and Hong, M. Improving the sample and communication complexity for decentralized non-convex optimization: Joint gradient estimation and tracking. In International conference on machine learning, pp. 9217 9228. PMLR, 2020.

Sun, T., Li, D., and Wang, B. Adaptive random walk gradient descent for decentralized optimization. In International Conference on Machine Learning, pp. 20790 20809. PMLR, 2022.

Faster Adaptive Decentralized Learning Algorithms

Tang, H., Lian, X., Yan, M., Zhang, C., and Liu, J. d2: Decentralized training over decentralized data. In International Conference on Machine Learning, pp. 4848 4856. PMLR, 2018.

Tran-Dinh, Q., Pham, N. H., Phan, D. T., and Nguyen, L. M. A hybrid stochastic optimization framework for composite nonconvex optimization. Mathematical Programming, 191(2):1005 1071, 2022.

Xin, R., Khan, U., and Kar, S. A hybrid variance-reduced method for decentralized stochastic non-convex optimization. In International Conference on Machine Learning, pp. 11459 11469. PMLR, 2021a.

Xin, R., Khan, U. A., and Kar, S. An improved convergence analysis for decentralized online stochastic non-convex optimization. IEEE Transactions on Signal Processing, 69:1842 1858, 2021b.

Xin, R., Khan, U. A., and Kar, S. Fast decentralized nonconvex finite-sum optimization with recursive variance reduction. SIAM Journal on Optimization, 32(1):1 28, 2022.

Xu, J., Zhu, S., Soh, Y. C., and Xie, L. Augmented distributed gradient methods for multi-agent optimization under uncoordinated constant stepsizes. In 2015 54th IEEE Conference on Decision and Control (CDC), pp. 2055 2060. IEEE, 2015.

Yun, J., Lozano, A. C., and Yang, E. Adaptive proximal gradient methods for structured neural networks. Advances in Neural Information Processing Systems, 34: 24365 24378, 2021.

Zhan, W., Wu, G., and Gao, H. Efficient decentralized stochastic gradient descent method for nonconvex finitesum optimization problems. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 36, pp. 9006 9013, 2022.

Zhang, J., Karimireddy, S. P., Veit, A., Kim, S., Reddi, S., Kumar, S., and Sra, S. Why are adaptive methods good for attention models? Advances in Neural Information Processing Systems, 33:15383 15393, 2020.

Zhang, X., Liu, J., Zhu, Z., and Bentley, E. S. Gt-storm: Taming sample, communication, and memory complexities in decentralized non-convex learning. In Proceedings of the Twenty-second International Symposium on Theory, Algorithmic Foundations, and Protocol Design for Mobile Networks and Mobile Computing, pp. 271 280, 2021.

Zhuang, J., Tang, T., Ding, Y., Tatikonda, S. C., Dvornek, N., Papademetris, X., and Duncan, J. Adabelief optimizer: Adapting stepsizes by the belief in observed gradients.

Advances in neural information processing systems, 33: 18795 18806, 2020.

Faster Adaptive Decentralized Learning Algorithms

A. Appendix

In this section, we provide the detailed convergence analysis of our algorithms. We first give some notations and review some useful lemmas.

For the stochastic problem (1). Let Et = Eξt,ξt 1, ,ξ1 with ξt {ξi t}m i=1. Let xt = 1 m Pm i=1 xi t, wt = 1 m Pm i=1 wi t, f i(xi t) = E[ f i(xi t; ξi t)] for all i [m], t 1 and

i=1 f i(xi t), F( xt) = 1

i=1 f i( xt). (18)

For the finite-sum problem (2). Let Et = EIt,It 1, ,I1 with It {Ii t}m i=1.Let xt = 1 m Pm i=1 xi t, wt = 1 m Pm i=1 wi t, f i(xi t) = 1

n Pn k=1 f i k(xi t) for all i [m] and

k=1 f i k(xi t)) = 1

i=1 f i(xi t), F( xt) = 1

i=1 f i( xt). (19)

Lemma A.1. Given m vectors {ui}m i=1, the following inequalities satisfy: ||ui + uj||2 (1 + c)||ui||2 + (1 + 1

for any c > 0, and || 1

m Pm i=1 ui||2 1

m Pm i=1 ||ui||2.

Lemma A.2. Given a finite sequence {ui}m i=1, and u = 1

m Pm i=1 ui, the following inequality satisfies Pm i=1 ui u 2 Pm i=1 ui 2.

Lemma A.3. The sequences {ui t 1, wi t 1}m i=1 be generated from our Algorithm 1 or 2, we have for all t 1,

i=1 ui t = ut = wt = 1

i=1 wi t. (20)

Proof. We proceed by induction. From our Algorithm 1, since wi 1 = P

j Ni Wi,juj 1, we have

i=1 wi 1 = 1

j Ni Wi,juj 1 = 1

i=1 Wi,j = 1

j=1 uj 1 = u1, (21)

where the second last equality is due to Pm i=1 Wi,j = 1 from Assumption 3.4.

From the line 10 of Algorithm 1, we have for all t 1

i=1 wi t+1 = 1

j Ni Wi,j wj t + uj t+1 uj t

wj t + uj t+1 uj t m X

i=1 Wi,j = 1

wj t + uj t+1 uj t = wt + ut+1 ut = ut+1, (22)

where the second last equality is due to Ni = {j V | (i, j) E, j = i} and Pm i=1 Wi,j = 1, and the last equality holds by the inductive hypothesis, i.e., wt = ut.

Lemma A.4. Suppose the sequence xi t 1, xi t 1 m i=1 be generated from Algorithm 1 or 2. Let 0 < γ ρ 4Lηt for all t 0, then we have

F( xt+1) F( xt) + 2γηt

ρ F( xt) ut 2 + γηt

i=1 wi t wt 2 ργηt

i=1 gi t 2, (23)

where gi t = (Ai t) 1wi t for any i [m].

Faster Adaptive Decentralized Learning Algorithms

Proof. Let gi t = (Ai t) 1wi t, we have xi t+1 = P

j Ni Wi,jxj t γ(Ai t) 1wi t. Then we have

i=1 xi t+1 = 1

j Ni Wi,jxj t γ 1

i=1 (Ai t) 1wi t

i=1 Wi,j γ 1

i=1 gi t = xt γ gt, (24)

where the last equality is due to Pm i=1 Wi,j = 1. Since Ai t ρId for all i [m], we have

ρ gi t 2 Ai tgi t, gi t = wi t, gi t (i) = wi t wt, gi t + ut, gi t

2ρ wi t wt 2 + ρ

2 gi t 2 + ut, gi t , (25)

where the equality (i) is due to ut = wt. So we can obtain

2ρ wi t wt 2 ρ

2 gi t 2 + ut, gi t , (26)

Then we have for any i [m]

2ρ wi t wt 2 ργηt

2 gi t 2 + γηt ut, gi t . (27)

Then taking an average over i from 1 to m yields that

i=1 wi t wt 2 ργηt

i=1 gi t 2 + γηt 1 m

i=1 ut, gi t

i=1 wi t wt 2 ργηt

i=1 gi t 2 + γηt ut, gt . (28)

According to Assumption 3.1, i.e., the function F(x) is L-smooth, we have

F( xt+1) F( xt) + F( xt), xt+1 xt + L

2 xt+1 xt 2

= F( xt) + ηt F( xt), xt+1 xt + Lη2 t 2 xt+1 xt 2

= F( xt) + ηt F( xt) ut + ut, γ gt + Lγ2η2 t 2 gt 2

F( xt) + 2γηt

ρ F( xt) ut 2 + ργηt

8 gt 2 γηt ut, gt + Lγ2η2 t 2 gt 2

F( xt) + 2γηt

ρ F( xt) ut 2 + ργηt

i=1 gi t 2 γηt ut, gt + Lγ2η2 t 2 1 m

i=1 gi t 2, (29)

where the second equality is due to xt+1 = xt + ηt( xt+1 xt).

By summing the above inequalities (28) with (29), we can obtain

F( xt+1) F( xt) + 2γηt

ρ F( xt) ut 2 + γηt

i=1 wi t wt 2 (3ργηt

8 Lγ2η2 t 2 ) 1

F( xt) + 2γηt

ρ F( xt) ut 2 + γηt

i=1 wi t wt 2 ργηt

i=1 gi t 2, (30)

where the last inequality is due to γ ρ 4Lηt for all t 1.

Faster Adaptive Decentralized Learning Algorithms

A.1. Convergence Analysis of Ada MDOS Algorithm

In this subsection, we provide the convergence analysis of our Ada MDOS Algorithm for stochastic optimization. Lemma A.5. Under the above assumptions, and assume the stochastic gradient estimators ui t 1 m i=1 be generated from Algorithm 1, we have

E ui t+1 f i(xt+1) 2 (1 βt+1)E ui t f i(xt) 2 + 2β2 t+1σ2 + 2L2η2 t E xi t+1 xi t 2, i [m] (31)

E ut+1 f(xt+1) 2 (1 βt+1)E ut f(xt) 2 + 2β2 t+1σ2

m + 2L2η2 t m2

i=1 E xi t+1 xi t 2, (32)

where f(xt) = 1

m Pm i=1 f i(xi t).

Proof. Since ui t+1 = f i(xi t+1; ξi t+1) + (1 βt+1)(ui t f i(xi t; ξi t+1)) for any i [m], we have

f i(xi t+1; ξi t+1) + (1 βt+1)(ui t f i(xi t; ξi t+1))

= f(xt+1; ξt+1) + (1 βt+1)( ut f(xt; ξt+1)). (33)

Then we have

E ut+1 f(xt+1) 2

ui t+1 f i(xi t+1) 2

f i(xi t+1; ξi t+1) + (1 βt+1) ui t f i(xi t; ξi t+1) f i(xi t+1) 2

f i(xi t+1; ξi t+1) f i(xi t+1) (1 βt+1) f i(xi t; ξi t+1) f i(xi t)

+ (1 βt+1) 1

ui t f i(xi t) 2

i=1 E f i(xi t+1; ξi t+1) f i(xi t+1) (1 βt+1) f i(xi t; ξi t+1) f i(xi t) 2

+ (1 βt+1)2E ut f(xt) 2

i=1 E f i(xi t+1; ξi t+1) f i(xi t+1) f i(xi t; ξi t+1) + f i(xi t) 2

+ 2β2 t+1 m2

i=1 E f i(xi t+1; ξi t+1) f i(xi t+1) 2 + (1 βt+1)2E ut f(xt) 2

i=1 E f i(xi t+1; ξi t+1) f i(xi t; ξi t+1) 2 + 2β2 t+1σ2

m + (1 βt+1)2E ut f(xt) 2

(1 βt+1)2E ut f(xt) 2 + 2β2 t+1σ2

m + 2(1 βt+1)2L2

i=1 E xi t+1 xi t 2

(1 βt+1)E ut f(xt) 2 + 2β2 t+1σ2

m + 2L2η2 t m2

i=1 E xi t+1 xi t 2,

where the forth equality holds by the following fact: for any i [m],

Eξi t+1 f i(xi t+1; ξi t+1) = f i(xi t+1), Eξi t+1 f i(xi t; ξi t+1)) = f i(xi t),

Faster Adaptive Decentralized Learning Algorithms

and for any i = j [m], ξi t+1 and ξj t+1 are independent; the second inequality holds by the inequality E ζ E[ζ] 2 E ζ 2

and Assumption 3.4; the second last inequality is due to Assumption 3.2; the last inequality holds by 0 < βt+1 1 and xi t+1 = xi t + ηt( xi t+1 xi t).

Similarly, we have

E ui t+1 f i(xt+1) 2 (1 βt+1)E ui t f i(xt) 2 + 2β2 t+1σ2 + 2L2η2 t E xi t+1 xi t 2, i [m]. (34)

Lemma A.6. Given the sequence xi t 1, wi t 1 m i=1 be generated from Algorithm 1. We have

i=1 xi t+1 xt+1 2 (1 (1 ν2)ηt

i=1 xi t xt 2 + 2ηtγ2

i=1 gi t gt 2,

i=1 xi t+1 xi t 2 (3 + ν2)

i=1 xi t xt 2 + 2(1 + ν2)

1 ν2 γ2 m X

i=1 wi t+1 wt+1 2 ν

i=1 wi t wt 2 + ν2

1 ν 4β2 t+1

i=1 ui t f i(xi t) 2

+ 4β2 t+1mσ2 + 8η2 t L2 m X

i=1 xi t+1 xi t 2 .

Proof. For notational simplicity, let xt = [(x1 t)T , , (xm t )T ]T Rmd, xt = [( x1 t)T , , ( xm t )T ]T Rmd and gt = [(g1 t )T , , (gm t )T ]T Rmd for all t 1. By using Assumption 3.4, since W1 = 1 and W = W Id, we have W(1 xt) = 1 xt. Meanwhile, we have 1T (xt 1 x) = 0 and W1 = 1. Thus, we have

Wxt 1 xt 2 = W(xt 1 xt) 2 ν2 xt 1 xt 2, (35)

where the above inequality holds by xt 1 xt is orthogonal to 1 that is the eigenvector corresponding to the largest eigenvalue of W, and ν denotes the second largest eigenvalue of W.

Since xi t+1 = P

j Ni Wi,jxj t γ(Ai t) 1wi t = P

j Ni Wi,jxj t γgi t for all i [m], we have xt+1 = Wxt γgt and xt+1 = xt γ gt. Since xt+1 = xt + ηt( xt+1 xt) and xt+1 = xt + ηt( xt+1 xt), we have

i=1 xi t+1 xt+1 2 = xt+1 1 xt+1 2 (36)

= xt + ηt( xt+1 xt) 1 ( xt + ηt( xt+1 xt) 2

(1 + α1)(1 ηt)2 xt 1 xt 2 + (1 + 1

α1 )η2 t xt+1 1 xt+1 2

(i) =(1 ηt) xt 1 xt 2 + ηt xt+1 1 xt+1 2

= (1 ηt) xt 1 xt 2 + ηt Wxt γgt 1 xt γ gt 2

(1 ηt) xt 1 xt 2 + (1 + α2)ηt Wxt 1 xt 2 + (1 + 1

α2 )ηtγ2 gt 1 gt 2

(ii) (1 ηt) xt 1 xt 2 + (1 + ν2)ηt

2 xt 1 xt 2 + ηtγ2(1 + ν2)

1 ν2 gt 1 gt 2

(iii) (1 (1 ν2)ηt

2 ) xt 1 xt 2 + 2ηtγ2

1 ν2 gt 1 gt 2

= (1 (1 ν2)ηt

i=1 xi t xt 2 + 2ηtγ2

i=1 gi t gt 2, (37)

Faster Adaptive Decentralized Learning Algorithms

where the above equality (i) is due to α1 = ηt 1 ηt , and the second inequality (ii) holds by α2 = 1 ν2

2ν2 and Wxt 1 xt 2 ν2 xt 1 xt 2, and the above inequality (ii) is due to 0 < ν < 1. Meanwhile, we have

i=1 xi t+1 xt 2 = xt+1 1 xt 2

= Wxt γgt 1 xt 2

(1 + α2)ν2 xt 1 xt 2 + (1 + 1

α2 )γ2 gt 2

(i) = 1 + ν2

i=1 xi t xt 2 + 1 + ν2

1 ν2 γ2 m X

i=1 gi t 2, (38)

where the last equality (i) holds by α2 = 1 ν2

2ν2 . Then we have

i=1 xi t+1 xi t 2 = xt+1 xt 2

= xt+1 1 xt + 1 xt xt 2

2 xt+1 1 xt 2 + 2 xt 1 xt 2

i=1 xi t xt 2 + 2(1 + ν2)

1 ν2 γ2 m X

i=1 gi t 2. (39)

Let wt = [(w1 t )T , (w2 t )T , , (wm t )T ]T , ut = [(u1 t)T , (u2 t)T , , (um t )T ]T and wt = 1 m Pm i=1 wi t and ut = 1 m Pm i=1 ui t. Then we have for any t 1,

wt+1 = W wt + ut+1 ut .

According to the above proof of Lemma A.3, we have wt+1 = wt + ut+1 ut for all t 1. Thus we have

i=1 wi t+1 wt+1 2 = wt+1 1 wt+1 2

= W wt + ut+1 ut 1 wt + ut+1 ut 2

(1 + c) Wwt 1 wt 2 + (1 + 1

c ) W ut+1 ut 1 ut+1 ut 2

(1 + c)ν2 wt 1 wt 2 + (1 + 1

c )ν2 ut+1 ut 1 ut+1 ut 2

(1 + c)ν2 wt 1 wt 2 + (1 + 1

c )ν2 ut+1 ut 2, (40)

where the last inequality holds by Lemma A.2.

Faster Adaptive Decentralized Learning Algorithms

Since ui t+1 = f i(xi t+1; ξi t+1) + (1 βt+1)(ui t f i(xi t; ξi t+1)) for any i [m] and t 1, we have

ut+1 ut 2 =

i=1 ui t+1 ui t 2

i=1 f i(xi t+1; ξi t+1) βt+1ui t (1 βt+1) f i(xi t; ξi t+1) 2

i=1 βt+1( f i(xi t+1; ξi t+1) f i(xi t+1)) + βt+1( f i(xi t+1) f i(xi t)) + βt+1( f i(xi t) ui t)

+ (1 βt+1)( f i(xi t+1; ξi t+1) f i(xi t; ξi t+1)) 2

i=1 f i(xi t+1; ξi t+1) f i(xi t+1) 2 + 4β2 t+1

i=1 f i(xi t+1) f i(xi t) 2

i=1 f i(xi t) ui t 2 + 4(1 βt+1)2 m X

i=1 f i(xi t+1; ξi t+1) f i(xi t; ξi t+1) 2

i=1 ui t f i(xi t) 2 + 4β2 t+1mσ2 + 4β2 t+1L2 m X

i=1 xi t+1 xi t 2

+ 4(1 βt+1)2L2 m X

i=1 xi t+1 xi t 2

i=1 ui t f i(xi t) 2 + 4β2 t+1mσ2 + 8η2 t L2 m X

i=1 xi t+1 xi t 2, (41)

where the last inequality holds by 0 < βt < 1 and xi t+1 = xi t + ηt( xi t+1 xi t).

Plugging the above inequalities (41) into (40), we have

i=1 wi t+1 wt+1 2 (1 + c)ν2 wt 1 wt 2 + (1 + 1

c )ν2 ut+1 ut 2

(1 + c)ν2 wt 1 wt 2 + (1 + 1

c )ν2 4β2 t+1

i=1 ui t f i(xi t) 2 + 4β2 t+1mσ2

+ 8η2 t L2 m X

i=1 xi t+1 xi t 2 . (42)

ν 1, we have

i=1 wi t+1 wt+1 2 ν

i=1 wi t wt 2 + ν2

1 ν 4β2 t+1

i=1 ui t f i(xi t) 2

+ 4β2 t+1mσ2 + 8η2 t L2 m X

i=1 xi t+1 xi t 2 . (43)

Theorem A.7. (Restatement of Theorem 5.1) Suppose the sequences {xi t}m i=1 T t=1 be generated from Algorithm 1.

Under the above Assumptions 3.1-3.5, and let ηt = η, 0 < βt 1 for all t 0, γ min ρ(1 ν2)

48θt , 3ρ(1 ν2)θt

Faster Adaptive Decentralized Learning Algorithms

3(1+ν2) Ht ,

γ(3+ν2) Ht with Ht = 9 2βt + 8ν2 (1 ν)2 for all t 1, we have

t=1 E F( xt) 1

i=1 E[ F(xi t) + L xt xi t ]

t=1 Htβ2 t v u u t 1

i=1 E Ai t 2. (44)

where G = F ( x1) F

ργη + 4ν2 ρ2(1 ν) + 9 2ρ2β0 9 2ρ2 σ2.

Proof. Without loss of generality, let η = η1 = = ηT . According to Lemma A.4, we have

F( xt+1) F( xt) + 2γη

ρ F( xt) ut 2 + γη

i=1 wi t wt 2 ργη

i=1 gi t 2. (45)

According to the Lemma A.5, we have

E ut f(xt) 2 (1 βt)E ut 1 f(xt 1) 2 + 2β2 t σ2

i=1 E xi t xi t 1 2, (46)

i=1 E ui t f i(xt) 2 (1 βt) 1

i=1 E ui t 1 f i(xt 1) 2 + 2β2 t σ2 + 2L2η2 1

i=1 E xi t xi t 1 2. (47)

According to Lemma A.6, we have

i=1 E wi t wt 2 ν 1

i=1 wi t 1 wt 1 2 + 4ν2

2L2η2 xi t xi t 1 2 + β2 t f i(xi t 1) ui t 1 2

+ β2 t mσ2 . (48)

Meanwhile, we also have

i=1 xi t xt 2 (1 (1 ν2)η

i=1 xi t 1 xt 1 2 + 2ηγ2

i=1 gi t 1 gt 1 2

i=1 xi t 1 xt 1 2 + 2ηγ2

i=1 ( gi t 1 2 + gt 1 2)

i=1 xi t 1 xt 1 2 + 4ηγ2

i=1 gi t 1 2. (49)

i=1 xi t xi t 1 2 (3 + ν2) 1

i=1 xi t 1 xt 1 2 + 2(1 + ν2)

i=1 gi t 1 2. (50)

Faster Adaptive Decentralized Learning Algorithms

Next considering the term ut F( xt) 2, we have

ut F( xt) 2 = ut f(xt) + f(xt) F( xt) 2

2 ut f(xt) 2 + 2 f(xt) F( xt) 2

2 ut f(xt) 2 + 2 1

i=1 f i(xi t) 1

i=1 f i( xt) 2

2 ut f(xt) 2 + 2L2

i=1 xi t xt 2,

where the last inequality is due to Assumption 3.1. Then we can obtain

ut f(xt) 2 1

2 ut F( xt) 2 + L2

i=1 xi t xt 2. (51)

Since ut = wt for all t 1, we have

i=1 wi t F(xi t) 2 = 1

i=1 wi t wt + ut F( xt) + F( xt) F(xi t) 2

i=1 wi t wt 2 + 3 ut F( xt) 2 + 3 1

i=1 F( xt) F(xi t) 2

i=1 wi t wt 2 + 3 ut F( xt) 2 + 3L2 1

i=1 xi t xt 2.

Then we have

ut F( xt) 2 1

i=1 wi t F(xi t) 2 + 1

i=1 wi t wt 2 + L2

i=1 xi t xt 2. (52)

We define a useful Lyapunov function (i.e., potential function), for any t 1

Ωt = Et F( xt) + (λt 1 9γη

2ρ ) ut 1 f(xt 1) 2 + χt 1 1 m

i=1 ui t 1 f i(xi t 1) 2 (53)

+ (θt 1 19γηL2

i=1 xi t 1 xt 1 2 + (ϑt 1 γη

i=1 wi t 1 wt 1 2

i=1 gi t 1 2 ,

Faster Adaptive Decentralized Learning Algorithms

where χt 1 0, λt 1 9γη

2ρ , θt 1 29γηL2

6ρ and ϑt 1 γη

4ρ for all t 1. Then we have

Ωt+1 = Et+1 F( xt+1) + (λt 9γη

2ρ ) ut f(xt) 2 + χt 1 m

i=1 ui t f i(xi t) 2

+ (θt 29γηL2

i=1 xi t xt 2 + (ϑt 3γη

i=1 wi t wt 2 + ργη

(i) Et+1 h F( xt) γη

4ρ ut F( xt) 2 ργη

i=1 gi t 2 + 9γηL2

i=1 xi t xt 2 + λt ut f(xt) 2

i=1 ui t f i(xi t) 2 + (θt 29γηL2

i=1 xi t xt 2 + (ϑt γη

i=1 wi t wt 2i

(ii) Et+1 h F( xt) γη

i=1 wi t F(xi t) 2 ργη

i=1 gi t 2 γηL2

i=1 xi t xt 2

+ λt(1 βt) ut 1 f(xt 1) 2 + 2λtβ2 t σ2

m + 2λt L2η2

i=1 xi t xi t 1 2

+ χt(1 βt) 1

i=1 ui t 1 f i(xt 1) 2 + 2χtβ2 t σ2 + 2χt L2η2 1

i=1 xi t xi t 1 2

+ θt(1 (1 ν2)η

i=1 xi t 1 xt 1 2 + θt 4ηγ2

i=1 gi t 1 2

i=1 wi t 1 wt 1 2 + 4ϑtν2

2L2η2 xi t xi t 1 2 + β2 t f i(xi t 1) ui t 1 2

+ β2 t mσ2 i , (54)

where the inequality (i) is due to the above inequalities (45) and (51); and the inequality (ii) holds by the above inequalities (46), (47), (49) and (52).

Faster Adaptive Decentralized Learning Algorithms

Then we have

Ωt+1 Et+1 h F( xt) γη

i=1 wi t F(xi t) 2 ργη

i=1 gi t 2 γηL2

i=1 xi t xt 2

+ λt(1 βt) ut 1 f(xt 1) 2 + χt χtβt + 4ϑtβ2 t ν2

i=1 ui t 1 f i(xi t 1) 2

+ 2L2η2 λt + χt + 4ϑtν2

i=1 xi t xi t 1 2 + θt(1 (1 ν2)η

i=1 xi t 1 xt 1 2

i=1 gi t 1 2 + ϑtν 1

i=1 wi t 1 wt 1 2 + 2λt

m + 2χt + 4ϑtν2

1 ν β2 t σ2i

(i) Et+1 h F( xt) γη

i=1 wi t F(xi t) 2 ργη

i=1 gi t 2 γηL2

i=1 xi t xt 2

+ λt(1 βt) ut 1 f(xt 1) 2 + χt χtβt + 4ϑtβ2 t ν2

i=1 ui t 1 f i(xi t 1) 2

+ 2L2η2 λt + χt + 4ϑtν2

1 ν (3 + ν2) 1

i=1 xi t 1 xt 1 2 + 2(1 + ν2)

i=1 gi t 1 2

+ θt(1 (1 ν2)η

i=1 xi t 1 xt 1 2 + θt 4ηγ2

i=1 gi t 1 2

i=1 wi t 1 wt 1 2 + 2λt

m + 2χt + 4ϑtν2

1 ν β2 t σ2i , (55)

where the inequality (i) is due to the above inequality (50).

Since 0 < βt < 1 for all t 1, λt = 9γη 2ρβt 9γη

2ρ and λt λt 1, then we have λt(1 βt) λt 1 9γη

1 ν 4ϑtβtν2

1 ν and χt χt 1, we have χt χtβt + 4ϑtβ2 t ν2

1 ν χt 1. Let ϑt = γη ρ(1 ν) for all t 1, since

0 < ν < 1, we have ϑtν = ϑt (1 ν)ϑt ϑt 3γη

4ρ = ϑt 1 3γη

4ρ . Meanwhile, let θt θt 1 for all t 1,

γ min ρ(1 ν2)

48θt , 3ρ(1 ν2)θt

58L2 , η min ρ

3(1+ν2) Ht ,

γ(3+ν2) Ht with Ht = 9 2βt + 8ν2 (1 ν)2 for all t 1, we

θt(1 (1 ν2)η

2 ) + 2L2η2(3 + ν2) λt + χt + 4ϑtν2

1 ν θt 1 29γηL2

1 ν2 + 4(1 + ν2)

1 ν2 L2η2γ2 λt + χt + 4ϑtν2

Faster Adaptive Decentralized Learning Algorithms

Based on the choice of these parameters and the above inequality (55), we have

Ωt+1 Et+1 h F( xt) γη

i=1 wi t F(xi t) 2 ργη

i=1 gi t 2 γηL2

i=1 xi t xt 2

+ (λt 1 9γη

2ρ ) ut 1 f(xt 1) 2 + χt 1 1 m

i=1 ui t 1 f i(xi t 1) 2

+ (θt 1 29γηL2

i=1 xi t 1 xt 1 2 + (ϑt 1 3γη

i=1 wi t 1 wt 1 2

i=1 gi t 1 2 + 2γηHt

i=1 wi t F(xi t) 2 ργη

i=1 gi t 2 γηL2

i=1 xi t xt 2 + 2γηHt

ρ β2 t σ2. (58)

Then we can obtain

i=1 wi t F(xi t) 2 + L2

i=1 xi t xt 2 + 1

i=1 gi t 2 12(Ωt Ωt+1)

ρ2 β2 t σ2. (59)

Since x1 0 = x1 0 = = xm 0 = xm 0 , u1 0 = = um 0 and w1 0 = = wm 0 = 0, we have

Ω1 = E F( x1) + (λ0 9γη

2ρ ) u0 f(x0) 2 + χ0 1 m

i=1 ui 0 f i(xi 0) 2

F( x1) + χ0 + λ0 9γη

= F( x1) + 4γην2

ρ(1 ν)2 + 9γη

2ρ σ2. (60)

Let Mi t = gi t + 1

ρ F(xi t) wi t + L

ρ xt xi t . Then we have

Mi t = gi t + 1

ρ F(xi t) wi t + L

(i) = (Ai t) 1wi t + 1

ρ F(xi t) wi t + L

= 1 Ai t Ai t (Ai t) 1wi t + 1

ρ F(xi t) wi t + L

1 Ai t wi t + 1

ρ F(xi t) wi t + L

(ii) 1 Ai t wi t + 1 Ai t F(xi t) wi t + L Ai t xt xi t

F(xi t) + L xt xi t , (61)

where the equality (i) holds by gi t = (Ai t) 1wi t, and the inequality (ii) holds by Ai t ρ for all t 1 due to Assumption 3.4. Then we have

F(xi t) + L xt xi t Mi t Ai t . (62)

Faster Adaptive Decentralized Learning Algorithms

According to the above inequality (59), we have

i=1 E[Mi t]2 1

i=1 wi t F(xi t) 2 + 3L2

i=1 xi t xt 2 + 3

36(Ωt Ωt+1)

γρη 36(Ω1 F )

ρ2 β2 t σ2. (63)

By using Cauchy-Schwarz inequality to the above inequality (62), we have

i=1 E[ F(xi t) + L xt xi t ] 1

i=1 E Mi t Ai t

i=1 E[Mi t]2

i=1 E Ai t 2. (64)

By plugging the above inequalities (64) into (63), we can obtain

i=1 E[ F(xi t) + L xt xi t ] 6 Ω1 F

Tγρη + 12σ q

1 T PT t=1 Htβ2 t ρ v u u t 1

i=1 E Ai t 2. (65)

Since F(x) is L-smooth, we have

F( xt) = F( xt) F(xi t) + F(xi t) F(xi t) + L xi t xt . (66)

Meanwhile, let G = F ( x1) F

ργη + 4ν2 ρ2(1 ν) + 9 2ρ2β0 9 2ρ2 σ2. Then we have

t=1 E F( xt) 1

i=1 E[ F(xi t) + L xt xi t ]

Tγρη + 12σ q

1 T PT t=1 Htβ2 t ρ v u u t 1

i=1 E Ai t 2

1 T PT t=1 Htβ2 t ρ v u u t 1

i=1 E Ai t 2. (67)

Let βt = 1 T 2/3 for all t 1, we have

Ht = 9 2βt + 8ν2

(1 ν)2 = O(T 2/3). (68)

We set θt = θ 29γηL2

6ρ for all t 1. Then we can set γ = O(1) and η = O( 1 T 1/3 ). Thus we have G = F ( x1) F

ργη + 4ν2 ρ2(1 ν) + 9 2ρ2β0 9 2ρ2 σ2 = O(T 1/3), and then we can obtain

t=1 E F( xt) 1

i=1 E[ F(xi t) + L xt xi t ]

O( 1 T 1/3 + σ T 1/3 )

i=1 E Ai t 2. (69)

Faster Adaptive Decentralized Learning Algorithms

1 T PT t=1 1 m Pm i=1 E Ai t 2 = O(1), set

t=1 E F( xt) O( 1 T 1/3 + σ T 1/3 )

i=1 E Ai t 2 ϵ, (70)

we have T = O(ϵ 3). Since our Algorithm 1, it reaches a near-optimal sample complexity of 1 T = O(ϵ 3) for finding an ϵ-stationary solution of Problem (1).

A.2. Convergence Analysis of Ada MDOF Algorithm

In this subsection, we provide the convergence analysis of our Ada MDOF Algorithm for finite-sum optimization.

Lemma A.8. Under the above Assumptions 3.1-3.2, and assume the gradient estimators ui t 1 m i=1 be generated from Algorithm 2, we have

Et ui t f i(xi t) 2 (1 βt) ui t 1 f i(xi t 1) 2 + 2β2 t b 1 n

k=1 f i k(xi t 1) zi k,t 1 2

+ 2L2η2 t 1 b xi t xi t 1 2, i [m] (71)

Et ut f(xt) 2 (1 βt) ut 1 f(xt 1) 2 + 2L2η2 t bm

i=1 xi t xi t 1 2

k=1 f i k(xi t 1) zi k,t 1 2, (72)

where βt (0, 1) for all t 1.

Proof. According the line 6 of Algorithm 2, i.e, ui t = 1 b P

k Ii t f i k(xi t) f i k(xi t 1) + (1 βt)ui t 1 +

Faster Adaptive Decentralized Learning Algorithms

k Ii t f i k(xi t 1) zi k,t 1 + 1

n Pn j=1 zi j,t 1 , we have

Et ui t f i(xi t) 2 (73)

f i k(xi t) f i k(xi t 1) + (1 βt)ui t 1 + βt 1

f i k(xi t 1) zi k,t 1 + 1

j=1 zi j,t 1 f i(xi t) 2

f i k(xi t) f i k(xi t 1) f i(xi t) + f i(xi t 1) + (1 βt)(ui t 1 f i(xi t 1))

f i k(xi t 1) zi k,t 1 + 1

j=1 zi j,t 1 f i(xi t 1) 2

f i k(xi t) f i k(xi t 1) f i(xi t) + f i(xi t 1) + βt 1

f i k(xi t 1) zi k,t 1

j=1 zi j,t 1 f i(xi t 1) 2 + (1 βt)2 ui t 1 f i(xi t 1) 2

f i k(xi t) f i k(xi t 1) f i(xi t) + f i(xi t 1) 2

+ 2β2 t Et 1

f i k(xi t 1) zi k,t 1 + 1

j=1 zi j,t 1 f i(xi t 1) 2 + (1 βt)2 ui t 1 f i(xi t 1) 2

b xi t xi t 1 2 + 2β2 t b 1 n

k=1 zi k,t 1 f i k(xi t 1) 2 + (1 βt)2 ui t 1 f i(xi t 1) 2

2L2η2 t 1 b xi t xi t 1 2 + 2β2 t b 1 n

k=1 zi k,t 1 f i k(xi t 1) 2 + (1 βt) ui t 1 f i(xi t 1) 2, (74)

where the third equality holds by

f i k(xi t) f i k(xi t 1) = f i(xi t) f i(xi t 1),

f i k(xi t 1) zi k,t 1 = f i(xi t 1) 1

j=1 zj t 1,

and the second last inequality holds by the inequality E ζ E[ζ] 2 E ζ 2 and Assumption 3.1; the last inequality holds by 0 < βt 1 and xi t = xi t 1 + ηt 1( xt xi t 1).

Let ut = 1 m Pm i=1 ui t, f i(xi t) = 1 n Pn k=1 f i k(xi t) for all i [m] and f(xt) = 1 m Pm i=1( 1

n Pn k=1 f i k(xi t)) =

Faster Adaptive Decentralized Learning Algorithms

1 m Pm i=1 f i(xi t). We can obtain

Et ut f(xt) 2 = Et 1

i=1 (ui t f i(xi t)) 2

f i k(xi t) f i k(xi t 1) f i(xi t) + f i(xi t 1) + (1 βt)(ui t 1 f i(xi t 1))

f i k(xi t 1) zi k,t 1 + 1

j=1 zi j,t 1 f i(xi t 1) 2

(i) =(1 βt)2 ut 1 f(xt 1) 2 + Et 1

f i k(xi t) f i k(xi t 1) f i(xi t) + f i(xi t 1)

f i k(xi t 1) zi k,t 1 + 1

j=1 zi j,t 1 f i(xi t 1) 2

(1 βt)2 ut 1 f(xt 1) 2 + 2

f i k(xi t) f i k(xi t 1) f i(xi t) + f i(xi t 1) 2

f i k(xi t 1) zi k,t 1 + 1

j=1 zi j,t 1 f i(xi t 1) 2

(1 βt)2 ut 1 f(xt 1) 2 + 2L2

i=1 xi t xi t 1 2 + 2β2 t mb

k=1 f i k(xi t 1) zi k,t 1 2

(ii) (1 βt) ut 1 f(xt 1) 2 + 2L2η2 t bm

i=1 xi t xi t 1 2 + 2β2 t bm

k=1 f i k(xi t 1) zi k,t 1 2, (75)

where the above equality (i) is due to for all i [m]

f i k(xi t) f i k(xi t 1) = f i(xi t) f i(xi t 1),

f i k(xi t 1) zi k,t 1 = f i(xi t 1) 1

j=1 zj t 1,

and {Ii t}m i=1 are independent, and the above inequality (ii) holds by 0 < βt < 1 and xi t = xi t 1 + ηt( xi t xi t 1).

Lemma A.9. Under Assumption 3.1, the sequence {zi k,t} is defined in Line 11 of Algorithm 2, then we have

k=1 f i k(xi t) zi k,t 2 (1 b

k=1 f i k(xi t 1) zi k,t 1 2

n 1)L2η2 t 1 xi t xi t 1 2, i [m]. (76)

Faster Adaptive Decentralized Learning Algorithms

Proof. According to the Line 11 of Algorithm 2, we have

k=1 f i k(xi t) zi k,t 2 = (1 b

k=1 f i k(xi t) zi k,t 1 2 (77)

k=1 f i k(xi t) f i k(xi t 1) + f i k(xi t 1) zi k,t 1 2

k=1 f i k(xi t) f i k(xi t 1) 2

+ (1 + α)(1 b

k=1 f i k(xi t 1) zi k,t 1 2

n)L2 xi t xi t 1 2 + (1 + α)(1 b

k=1 f i k(xi t 1) zi k,t 1 2,

where the last inequality is due to Assumption 3.1.

Let α = b 2n, then we have

i=1 f i(xi t) zi t 2 (1 b

i=1 f i(xi t 1) zi t 1 2 + (2n

n 1)L2 xi t xi t 1 2 (78)

i=1 f i(xi t 1) zi t 1 2 + (2n

n 1)L2η2 t 1 xt xi t 1 2,

where the above equality holds by xi t xi t 1 = ηt 1( xt xi t 1).

Lemma A.10. Given the sequence xi t 1, xi t 1, wi t 1 m i=1 be generated from Algorithm 2. We have

i=1 xi t+1 xt+1 2 (1 (1 ν2)ηt

i=1 xi t xt 2 + 2ηtγ2

i=1 gi t gt 2,

i=1 xi t+1 xi t 2 (3 + ν2)

i=1 xi t xt 2 + 2(1 + ν2)

1 ν2 γ2 m X

i=1 gi t 2,

i=1 Et+1 wi t+1 wt+1 2 ν

i=1 wi t wt 2 + 3ν2

L2η2 t xi t+1 xi t 2 + β2 t+1 f i(xi t) ui t 2

+ β2 t+1 bn

k=1 zi k,t f i(xi t) 2 .

Proof. For notational simplicity, let xt = [(x1 t)T , , (xm t )T ]T Rmd, xt = [( x1 t)T , , ( xm t )T ]T Rmd and gt = [(g1 t )T , , (gm t )T ]T Rmd for all t 1. By using Assumption 3.4, since W1 = 1 and W = W Id, we have W(1 xt) = 1 xt. Meanwhile, we have 1T (xt 1 x) = 0 and W1 = 1. Thus, we have

Wxt 1 xt 2 = W(xt 1 xt) 2 ν2 xt 1 xt 2, (79)

where the above inequality holds by xt 1 xt is orthogonal to 1 that is the eigenvector corresponding to the largest eigenvalue of W, and ν denotes the second largest eigenvalue of W.

Since xi t+1 = P

j Ni Wi,jxj t γ(Ai t) 1wi t = P

j Ni Wi,jxj t γgi t for all i [m], we have xt+1 = Wxt γgt and

Faster Adaptive Decentralized Learning Algorithms

xt+1 = xt γ gt. Since xt+1 = xt + ηt( xt+1 xt) and xt+1 = xt + ηt( xt+1 xt), we have

i=1 xi t+1 xt+1 2 = xt+1 1 xt+1 2 (80)

= xt + ηt( xt+1 xt) 1 ( xt + ηt( xt+1 xt) 2

(1 + α1)(1 ηt)2 xt 1 xt 2 + (1 + 1

α1 )η2 t xt+1 1 xt+1 2

(i) =(1 ηt) xt 1 xt 2 + ηt xt+1 1 xt+1 2

= (1 ηt) xt 1 xt 2 + ηt Wxt γgt 1 xt γ gt 2

(1 ηt) xt 1 xt 2 + (1 + α2)ηt Wxt 1 xt 2 + (1 + 1

α2 )ηtγ2 gt 1 gt 2

(ii) (1 ηt) xt 1 xt 2 + (1 + ν2)ηt

2 xt 1 xt 2 + ηtγ2(1 + ν2)

1 ν2 gt 1 gt 2

(iii) (1 (1 ν2)ηt

2 ) xt 1 xt 2 + 2ηtγ2

1 ν2 gt 1 gt 2

= (1 (1 ν2)ηt

i=1 xi t xt 2 + 2ηtγ2

i=1 gi t gt 2, (81)

where the above equality (i) is due to α1 = ηt 1 ηt , and the second inequality (ii) holds by α2 = 1 ν2

2ν2 and Wxt 1 xt 2 ν2 xt 1 xt 2, and the above inequality (ii) is due to 0 < ν < 1. Meanwhile, we have

i=1 xi t+1 xt 2 = xt+1 1 xt 2

= Wxt γgt 1 xt 2

(1 + α2)ν2 xt 1 xt 2 + (1 + 1

α2 )γ2 gt 2

(i) = 1 + ν2

i=1 xi t xt 2 + 1 + ν2

1 ν2 γ2 m X

i=1 gi t 2, (82)

where the last equality (i) holds by α2 = 1 ν2

2ν2 . Then we have

i=1 xi t+1 xi t 2 = xt+1 xt 2

= xt+1 1 xt + 1 xt xt 2

2 xt+1 1 xt 2 + 2 xt 1 xt 2

i=1 xi t xt 2 + 2(1 + ν2)

1 ν2 γ2 m X

i=1 gi t 2. (83)

Let wt = [(w1 t )T , (w2 t )T , , (wm t )T ]T , ut = [(u1 t)T , (u2 t)T , , (um t )T ]T and wt = 1 m Pm i=1 wi t and ut = 1 m Pm i=1 ui t. Then we have for any t 1,

wt+1 = W wt + ut+1 ut .

Faster Adaptive Decentralized Learning Algorithms

According to the above proof of Lemma A.3, we have wt+1 = wt + ut+1 ut for all t 1. Thus we have

i=1 wi t+1 wt+1 2 = wt+1 1 wt+1 2

= W wt + ut+1 ut 1 wt + ut+1 ut 2

(1 + c) Wwt 1 wt 2 + (1 + 1

c ) W ut+1 ut 1 ut+1 ut 2

(1 + c)ν2 wt 1 wt 2 + (1 + 1

c )ν2 ut+1 ut 1 ut+1 ut 2

(1 + c)ν2 wt 1 wt 2 + (1 + 1

c )ν2 ut+1 ut 2, (84)

where the last inequality holds by Lemma A.2.

Since ui t+1 = 1

k Ii t+1 f i k(xt+1) f i k(xt) + (1 βt+1)ui t + βt+1 1 b P

k Ii t+1 f i k(xt) zi k,t + 1

n Pn j=1 zi j,t

for any i [m] and t 1, we have

Et+1 ut+1 ut 2

i=1 Et+1 ui t+1 ui t 2

f i k(xi t+1) f i k(xi t) βt+1ui t + βt+1 1

f i k(xi t) zi k,t + 1

j=1 zi j,t 2

f i k(xi t+1) f i k(xi t) βt+1ui t + βt+1 f i(xi t)

f i k(xi t) zi k,t + 1

j=1 zi j,t f i(xi t) 2

i=1 xi t+1 xi t 2 + 3β2 t+1

i=1 f i(xi t) ui t 2 + 3β2 t+1 bn

k=1 zi k,t f i(xi t) 2

i=1 xi t+1 xi t 2 + 3β2 t+1

i=1 f i(xi t) ui t 2 + 3β2 t+1 bn

k=1 zi k,t f i(xi t) 2, (85)

where the second last inequality holds by Assumption 3.1 and the last equality is due to xi t+1 = xi tηt( xi t+1 xi t).

Plugging the above inequalities (85) into (84), we have

i=1 wi t+1 wt+1 2 (1 + c)ν2 wt 1 wt 2 + (1 + 1

c )ν2 ut+1 ut 2

(1 + c)ν2 wt 1 wt 2 + 3(1 + 1

L2η2 t xi t+1 xi t 2 + β2 t+1 f i(xi t) ui t 2

+ β2 t+1 bn

k=1 zi k,t f i(xi t) 2 . (86)

ν 1, we have

i=1 Et+1 wi t+1 wt+1 2 ν

i=1 wi t wt 2 + 3ν2

L2η2 t xi t+1 xi t 2 + β2 t+1 f i(xi t) ui t 2

+ β2 t+1 bn

k=1 zi k,t f i(xi t) 2 . (87)

Faster Adaptive Decentralized Learning Algorithms

Theorem A.11. (Restatement of Theorem 5.5) Suppose the sequences {xi t}m i=1 T t=1 be generated from Algorithm 2.

Under the above Assumptions 3.1-3.5, and let ηt = η, 0 < βt 1 for all t 0, γ min ρ(1 ν2)

48θt , 3ρ(1 ν2)θt

6(1+ν2) Ht ,

γ(3+ν2) Ht with Ht = 9 bβt + 6ν2 b(1 ν)2 + 4n2β2 t b3 9

βt + 9ν2 (1 ν)2 + 3ν2 (1 ν)2 for all t 0, we

t=1 E F( xt) 1

i=1 E[ F(xi t) + L xt xi t ] 6

i=1 E Ai t 2, (88)

where G = F ( x1) F

ρ2 + 18β2 0ν2

ρ2(1 ν)2 + 3ν2 ρ2(1 ν)2 + 9 2ρ2β0 9 2ρ2 1

m Pm i=1 1 n Pn k=1 f i k(xi 0) 2 is independent on T, b and n.

Proof. Without loss of generality, let η = η1 = = ηT . According to Lemma A.4, we have

F( xt+1) F( xt) + 2γη

ρ F( xt) ut 2 + γη

i=1 wi t wt 2 ργη

i=1 gi t 2. (89)

According to the Lemma A.8, we have

Et ut f(xt) 2 (1 βt) ut 1 f(xt 1) 2 + 2L2η2

i=1 xi t xi t 1 2

k=1 f i k(xi t 1) zi k,t 1 2, (90)

i=1 Et ui t f i(xi t) 2 (1 βt) 1

i=1 ui t 1 f i(xi t 1) 2 + 2L2η2

i=1 xi t xi t 1 2

+ 2β2 t b 1 m

k=1 f i k(xi t 1) zi k,t 1 2. (91)

According to Lemma A.9, we have for any i [m]

k=1 f i k(xt) zi k,t 2 (1 b

k=1 f i k(xi t 1) zi k,t 1 2 + (2n

n 1)L2η2 xi t xi t 1 2

k=1 f i k(xi t 1) zi k,t 1 2 + 2n

b L2η2 xi t xi t 1 2. (92)

According to Lemma A.10, we have

i=1 Et wi t wt 2 ν 1

i=1 wi t 1 wt 1 2 + 3ν2

L2η2 xi t xi t 1 2 + β2 t f i(xi t 1) ui t 1 2

k=1 zi k,t 1 f i(xi t 1) 2 . (93)

Faster Adaptive Decentralized Learning Algorithms

Meanwhile, we also have

i=1 xi t xt 2 (1 (1 ν2)η

i=1 xi t 1 xt 1 2 + 2ηγ2

i=1 gi t 1 gt 1 2

i=1 xi t 1 xt 1 2 + 2ηγ2

i=1 ( gi t 1 2 + gt 1 2)

i=1 xi t 1 xt 1 2 + 4ηγ2

i=1 gi t 1 2. (94)

i=1 xi t xi t 1 2 (3 + ν2) 1

i=1 xi t 1 xt 1 2 + 2(1 + ν2)

i=1 gi t 1 2. (95)

Next considering the term ut F( xt) 2, we have

ut F( xt) 2 = ut f(xt) + f(xt) F( xt) 2

2 ut f(xt) 2 + 2 f(xt) F( xt) 2

2 ut f(xt) 2 + 2 1

i=1 f i(xi t) 1

i=1 f i( xt) 2

2 ut f(xt) 2 + 2L2

i=1 xi t xt 2,

where the last inequality is due to Assumption 3.1. Then we can obtain

ut f(xt) 2 1

2 ut F( xt) 2 + L2

i=1 xi t xt 2. (96)

Since ut = wt for all t 1, we have

i=1 wi t F(xi t) 2 = 1

i=1 wi t wt + ut F( xt) + F( xt) F(xi t) 2

i=1 wi t wt 2 + 3 ut F( xt) 2 + 3 1

i=1 F( xt) F(xi t) 2

i=1 wi t wt 2 + 3 ut F( xt) 2 + 3L2 1

i=1 xi t xt 2.

Then we have

ut F( xt) 2 1

i=1 wi t F(xi t) 2 + 1

i=1 wi t wt 2 + L2

i=1 xi t xt 2. (97)

We define a useful Lyapunov function (i.e., potential function), for any t 1

Φt = Et F( xt) + (λt 1 9γη

2ρ ) ut 1 f(xt 1) 2 + χt 1 1 m

i=1 ui t 1 f i(xi t 1) 2 (98)

k=1 f i k(xi t 1) zi k,t 1 2 + (θt 1 19γηL2

i=1 xi t 1 xt 1 2

+ (ϑt 1 3γη

i=1 wi t 1 wt 1 2 + ργη

i=1 gi t 1 2 ,

Faster Adaptive Decentralized Learning Algorithms

where αt 1 0, χt 1 0, λt 1 9γη

2ρ , θt 1 29γηL2

6ρ and ϑt 1 3γη

4ρ for all t 1. Then we have

Φt+1 = Et+1 h F( xt+1) + (λt 9γη

2ρ ) ut f(xt) 2 + χt 1 m

i=1 ui t f i(xi t) 2

k=1 f i k(xi t) zi k,t 2 + (θt 29γηL2

i=1 xi t xt 2

i=1 wi t wt 2 + ργη

i=1 gi t 2i

(i) Et+1 h F( xt) γη

4ρ ut F( xt) 2 ργη

i=1 gi t 2 + 9γηL2

i=1 xi t xt 2 + λt ut f(xt) 2

i=1 ui t f i(xi t) 2 + αt(1 b

k=1 f i k(xi t 1) zi k,t 1 2

i=1 xi t xi t 1 2 + (θt 29γηL2

i=1 xi t xt 2

i=1 wi t wt 2i

(ii) Et+1 h F( xt) γη

i=1 wi t F(xi t) 2 ργη

i=1 gi t 2 γηL2

i=1 xi t xt 2

+ λt(1 βt) ut 1 f(xt 1) 2 + 2λt L2η2

i=1 xi t xi t 1 2 + 2λtβ2 t bm

k=1 f i k(xi t 1) zi k,t 1 2

+ χt(1 βt) 1

i=1 ui t 1 f i(xi t 1) 2 + 2χt L2η2

i=1 xi t xi t 1 2

+ 2χtβ2 t b 1 m

k=1 f i k(xi t 1) zi k,t 1 2 + αt(1 b

k=1 f i k(xi t 1) zi k,t 1 2

i=1 xi t xi t 1 2 + θt(1 (1 ν2)η

i=1 xi t 1 xt 1 2 + θt 4ηγ2

i=1 gi t 1 2

i=1 wi t 1 wt 1 2 + 3ϑtν2

L2η2 xi t xi t 1 2 + β2 t f i(xi t 1) ui t 1 2

k=1 zi k,t 1 f i(xi t 1) 2 i , (99)

where the inequality (i) is due to the above inequalities (89), (92) and (96); and the inequality (ii) holds by the above inequalities (90), (91), (93) and (97).

Faster Adaptive Decentralized Learning Algorithms

Then we have

Φt+1 Et+1 h F( xt) γη

i=1 wi t F(xi t) 2 ργη

i=1 gi t 2 γηL2

i=1 xi t xt 2

+ λt(1 βt) ut 1 f(xt 1) 2 + χt χtβt + 3ϑtβ2 t ν2

i=1 ui t 1 f i(xi t 1) 2

i=1 xi t xi t 1 2

2n + 2λtβ2 t b + 2χtβ2 t b + 3β2 t ϑtν2

k=1 f i k(xi t 1) zi k,t 1 2

+ θt(1 (1 ν2)η

i=1 xi t 1 xt 1 2 + θt 4ηγ2

i=1 gi t 1 2 + ϑtν 1

i=1 wi t 1 wt 1 2i

(i) Et+1 h F( xt) γη

i=1 wi t F(xi t) 2 ργη

i=1 gi t 2 γηL2

i=1 xi t xt 2

+ λt(1 βt) ut 1 f(xt 1) 2 + χt χtβt + 3ϑtβ2 t ν2

i=1 ui t 1 f i(xi t 1) 2

1 ν (3 + ν2) 1

i=1 xi t 1 xt 1 2 + 2(1 + ν2)

i=1 gi t 1 2

2n + 2λtβ2 t b + 2χtβ2 t b + 3β2 t ϑtν2

k=1 f i k(xi t 1) zi k,t 1 2

+ θt(1 (1 ν2)η

i=1 xi t 1 xt 1 2 + θt 4ηγ2

i=1 gi t 1 2 + ϑtν 1

i=1 wi t 1 wt 1 2i ,

where the inequality (i) is due to the above inequality (95).

Since 0 < βt < 1 for all t 1, λt = 9γη

2ρ and λt λt 1, then we have λt(1 βt) λt 1 9γη

2ρ . Let χt = 3ϑtν2

1 ν and χt χt 1, we have χt χtβt + 3ϑtβ2 t ν2

1 ν χt 1. Let αt = 2nβ2 t b2 (2λt + 2χt + 3ϑtν2

1 ν ) = 2nβ2 t b2 (2λt + 9ϑtν2

and αt αt 1, then we have αt αtb

2n + 2λtβ2 t b + 2χtβ2 t b + 3β2 t ϑtν2

b(1 ν) αt 1. Let ϑt = γη ρ(1 ν) for all t 1, since 0 < ν < 1,

we have ϑtν = ϑt (1 ν)ϑt ϑt 1 3γη

4ρ . Meanwhile, let θt θt 1 for all t 1, γ min ρ(1 ν2)

48θt , 3ρ(1 ν2)θt

6(1+ν2) Ht ,

γ(3+ν2) Ht with Ht = 9 bβt + 6ν2 b(1 ν)2 + 4n2β2 t b3 9

βt + 9ν2 (1 ν)2 + 3ν2 (1 ν)2 for all t 1, we

θt(1 (1 ν2)η

2 ) + L2η2(3 + ν2) 2λt

1 ν θt 1 29γηL2

1 ν2 + 2(1 + ν2)

1 ν2 L2η2γ2 2λt

Faster Adaptive Decentralized Learning Algorithms

Based on the choice of these parameters and the above inequality (100), we have

Φt+1 Et+1 h F( xt) γη

i=1 wi t F(xi t) 2 ργη

i=1 gi t 2 γηL2

i=1 xi t xt 2

+ (λt 1 9γη

2ρ ) ut 1 f(xt 1) 2 + χt 1 1 m

i=1 ui t 1 f i(xi t 1) 2

k=1 f i k(xi t 1) zi k,t 1 2 + (θt 1 29γηL2

i=1 xi t 1 xt 1 2

+ (ϑt 1 3γη

i=1 wi t 1 wt 1 2 + ργη

i=1 gi t 1 2i

i=1 wi t F(xi t) 2 ργη

i=1 gi t 2 γηL2

i=1 xi t xt 2. (103)

Then we can obtain

i=1 wi t F(xi t) 2 + L2

i=1 xi t xt 2 + 1

i=1 gi t 2 12(Φt Φt+1)

γρη . (104)

Since x1 0 = x1 0 = = xm 0 = xm 0 , zi 1,0 = zi 2,0 = = zi n,0 = 0 and ui 0 = wi 0 = 0 for any i [m], we have

Φ1 = E F( x1) + α0 1 m

k=1 f i k(xi 0) 2 + (λ0 9γη

2ρ ) f(x0) 2 + χ0 1 m

i=1 f i(xi 0) 2

F( x1) + α0 + χ0 + λ0 9γη

k=1 f i k(xi 0) 2

= F( x1) + 18γηβ0

ρ + 18β2 0γην2

ρ(1 ν)2 + 3γην2

ρ(1 ν)2 + 9γη

k=1 f i k(xi 0) 2. (105)

Let Mi t = gi t + 1

ρ F(xi t) wi t + L

ρ xt xi t . Then we have

Mi t = gi t + 1

ρ F(xi t) wi t + L

(i) = (Ai t) 1wi t + 1

ρ F(xi t) wi t + L

= 1 Ai t Ai t (Ai t) 1wi t + 1

ρ F(xi t) wi t + L

1 Ai t wi t + 1

ρ F(xi t) wi t + L

(ii) 1 Ai t wi t + 1 Ai t F(xi t) wi t + L Ai t xt xi t

F(xi t) + L xt xi t , (106)

where the equality (i) holds by gi t = (Ai t) 1wi t, and the inequality (ii) holds by Ai t ρ for all t 1 due to Assumption 3.4. Then we have

F(xi t) + L xt xi t Mi t Ai t . (107)

Faster Adaptive Decentralized Learning Algorithms

According to the above inequality (104), we have

i=1 E[Mi t]2 1

i=1 wi t F(xi t) 2 + 3L2

i=1 xi t xt 2 + 3

36(Φt Φt+1)

γρη 36(Φ1 F )

Tγρη . (108)

By using Cauchy-Schwarz inequality, we have

i=1 E[ F(xi t) + L xt xi t ] 1

i=1 E Mi t Ai t

i=1 E[Mi t]2

i=1 E Ai t 2. (109)

By plugging the above inequalities (109) into (108), we can obtain

i=1 E[ F(xi t) + L xt xi t ] 6 Φ1 F

i=1 E Ai t 2. (110)

Since F(x) is L-smooth, we have

F( xt) = F( xt) F(xi t) + F(xi t) F(xi t) + L xi t xt . (111)

Meanwhile, let G = F ( x1) F

ρ2 + 18β2 0ν2

ρ2(1 ν)2 + 3ν2 ρ2(1 ν)2 + 9 2ρ2β0 9 2ρ2 1

m Pm i=1 1 n Pn k=1 f i k(xi 0) 2. Then we have

t=1 E F( xt) 1

i=1 E[ F(xi t) + L xt xi t ]

i=1 E Ai t 2

i=1 E Ai t 2. (112)

Let θt = θ 29γηL2

6ρ for all t 1. Let b = n and βt = b

n for all t 1, we have

Ht = 9 bβt + 6ν2

b(1 ν)2 + 4n2β2 t b3 9

(1 ν)2 + 3ν2

(1 ν)2 45 + 45ν2

(1 ν)2 . (113)

Then we have 1 Ht 1 r

(1 ν)2 , and

6(1 + ν2) Ht ,

γ(3 + ν2) Ht

6(1 + ν2) ,

Thus we can let η = min ρ

45 + 45ν2 (1 ν)2 and γ = min ρ(1 ν2)

48θ , 3ρ(1 ν2)θ

58L2 . Note that

we set βt = b n for all t 1, while we can set β0 (0, 1), which is independent on T, b and n. Thus we have

Faster Adaptive Decentralized Learning Algorithms

G = F ( x1) F

ρ2 + 18β2 0ν2

ρ2(1 ν)2 + 3ν2 ρ2(1 ν)2 + 9 2ρ2β0 9 2ρ2 1

m Pm i=1 1 n Pn k=1 f i k(xi 0) 2 is independent on T, b and n.

Let ρ = O(1), η = O(1) and γ = O(1), then we have G = O(1) is independent on T, b and n. Given q

1 T PT t=1 1 m Pm i=1 E Ai t 2 = O(1), set

t=1 E F( xt) 6

i=1 E Ai t 2 ϵ, (114)

we have T = O(ϵ 2). Since our Algorithm 2 requires b samples, it obtains a sample complexity of Tb = O( nϵ 2).

B. Additional Experiment Details

In the experiments, we consider two classical undirected networks that connect all clients, i.e., the ring and 3-regular expander networks (Hoory et al., 2006), illustrated in Figures 6 and 7, respectively.

Figure 6: An illustration of the ring network with 5 nodes and its mixing matrix.

Figure 7: An illustration of the 3-regular expander network with 5 nodes and its mixing matrix.